git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Scheffenegger, Richard" <Richard.Scheffenegger@netapp.com>
To: "René Scharfe" <l.s.r@web.de>,
	"Junio C Hamano" <gitster@pobox.com>,
	"brian m. carlson" <sandals@crustytoothpaste.net>
Cc: "git@vger.kernel.org" <git@vger.kernel.org>
Subject: RE: git --archive
Date: Sat, 24 Sep 2022 11:34:57 +0000	[thread overview]
Message-ID: <PH0PR06MB7639DAA5CA112495E3EB43AB86509@PH0PR06MB7639.namprd06.prod.outlook.com> (raw)
In-Reply-To: <8eb5131e-5ae1-79bd-df0c-bf0b2ec8583f@web.de>

Hi Rene,

>  git archive compresses a small file by reading it fully and writing the result in one go.  It streams big files, though, i.e. reads, compresses and writes them in small pieces.  That won't work as easily if multiple files are compressed in parallel.

The ask was not to parallelize the compression step (scale CPU efficiency locally), but to heat up the *remote filesystem* metadata and data caches by massive parallel/multithreaded reads (if one so desires), in order for the stricty sequential, one-after-another existing archiving steps to complete faster.

> Giving up on deterministic order would reduce the memory usage for keeping compressed small files.  

That was just one idea. Retaining a deterministic order to have stable secure hashes of the resulting files is a valuable property.

> That only matters if the product of core.bigFileThreshold (default value 512 MB, the number of parallel threads and the compre>ssion ratio exceeds the available memory.  The same effect could be achieved by using temporary files.  We'd still have to keep up to core.bigFileThreshold times the number of threads of uncompressed data in memory, though.

Again, none of the data read during the preparatory, highly concurrent step would need to linger around anywhere. Just heating up *remote* filesystem metadata and data caches is sufficient to dramatically reduce access and read latency - which is the primary determining factor of how long it takes to prepare an archive (e.g 500 sec for 80k /1.3 GB uncompressed / 250 M compressed files when data partially/mostly resides on cold storage).

> If I/O latency instead of CPU usage is the limiting factor and prefetching would help then starting git grep or git archive in the background might work.  If the order of visited blobs needs to be randomized then perhaps something like this would be better:
>
>   git ls-tree -r HEAD | awk '{print $3}' | sort | git cat-file --batch >/dev/null

Isn't the 2nd git, receiving input from stdin, running single-threaded?

Maybe

Git ls-tree -r HEAD | awk '{print $3}' | sort | split -d -l 100 -a 4 - splitted ; for i in $(ls splitted????) ; do "git cat-file --batch > /dev/null &"; done; rm -f splitted????

To parallelize the reading of the objects?

Otherwise, the main problem of lacking concurrency in reading in all the objected while metadata/data caches are cold would only moved from the archive step to the cat-file step, while overall completion would not be any faster. 

> No idea how to randomize the order of tree object visits.

To heat up data caches, the order of objects visited is not relevant, the order or IOs issued to the actual object is relevant. Trivial sequential reads (from start to end) typically get marked for cache evicition after having been delivered once to the client - that cache memory becomes available for immediate overwrite. To increase their "stickiness" in caches, the object reads would need to be performed in a pseudo-random fashion, e.g. if the IO block size is 1MB, accessing blocks in in an order like 10,1,9,4,7,3,8,2,6,5 would have them marked for longer cache retention (in remote filesystem servers).

Richard


  reply	other threads:[~2022-09-24 11:35 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-22  8:57 git --archive Scheffenegger, Richard
2022-09-22 20:13 ` Junio C Hamano
2022-09-22 20:35   ` Scheffenegger, Richard
2022-09-23  0:49     ` brian m. carlson
2022-09-23 16:30       ` Junio C Hamano
2022-09-23 16:51         ` Scheffenegger, Richard
2022-09-24  8:58         ` René Scharfe
2022-09-24 11:34           ` Scheffenegger, Richard [this message]
2022-09-24 13:19             ` René Scharfe
2022-09-24 18:07               ` René Scharfe
2022-09-25  8:17                 ` René Scharfe
2022-09-24 19:44               ` Scheffenegger, Richard

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=PH0PR06MB7639DAA5CA112495E3EB43AB86509@PH0PR06MB7639.namprd06.prod.outlook.com \
    --to=richard.scheffenegger@netapp.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=l.s.r@web.de \
    --cc=sandals@crustytoothpaste.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).