RE: git --archive - Scheffenegger, Richard

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: "Scheffenegger, Richard" <Richard.Scheffenegger@netapp.com>
To: "René Scharfe" <l.s.r@web.de>,
	"Junio C Hamano" <gitster@pobox.com>,
	"brian m. carlson" <sandals@crustytoothpaste.net>
Cc: "git@vger.kernel.org" <git@vger.kernel.org>
Subject: RE: git --archive
Date: Sat, 24 Sep 2022 19:44:22 +0000	[thread overview]
Message-ID: <PH0PR06MB763997C841605AEA42D8E8A986509@PH0PR06MB7639.namprd06.prod.outlook.com> (raw)
In-Reply-To: <712ffe78-c3e3-dacf-c3e3-f339385e9bb4@web.de>

>> Maybe
>>
>> Git ls-tree -r HEAD | awk '{print $3}' | sort | split -d -l 100 -a 4
>> - splitted ; for i in $(ls splitted????) ; do "git cat-file --batch > 
>> /dev/null &"; done; rm -f splitted????
>>
>> To parallelize the reading of the objects?
>
> Sure, but in a repository with 100000 files you'd end up with 1000 parallel processes, which may be a few too many.  Splitting the list into similar-sized parts based on a given degree of parallelism is probably more practical.

With the high overhead of starting up new processes, and most objects probably being small, I would expect a fair number to run to completion before the loop is finished; However, saving on process startup cost etc, and internalizing something like this into the git archive code (perhaps with some code to detect local vs remote filesystems, e.g. by measuring initial response latency) would help more. Then, yes, restricting the concurrency to something reasonable like a couple 100 threads, maybe 1000-2000 would be good. (And indeed, different storage systems have different sweet spots for concurrency - where the overall completion time is minimal. 1 is certainly not it 😉

> It could be done by relying on the randomness of the object IDs and partitioning by a sub-string.  Or perhaps using pseudo-random numbers is sufficient:
>
>   git ls-tree -r HEAD |
 >  awk '{print $3}' |
 >  sort |
 >  awk -v pieces=8 -v prefix=file '
 >     {
 >        piece = int(rand() * pieces)
 >        filename = prefix piece
 >        print $0 > filename
 >     }'
>
> So how much does such a warmup help in your case?

Cold tier data with no metadata in cache - IO latency in the middle-double digits milliseconds. Warmed up metadata cache - high single digit to low double digit milliseconds (~2-4x improvement). Warmed up and fully in external storage cache - sub-millisecond latency (10-100x faster).

With other words - warming up the cache in a pre-phase with (very) high concurrency - accessing as many objects in parallel as possible - can hit the throughput limit (10G, 25G Eth) but individually, each IO would still take some 10-50ms. However, if fully (externally) cached, a single-threaded, singular IO application will certainly not match the Eth throughput limits, but complete much faster than simply accessing the data cold. Only "cost" is that the data is effectively transferred twice - unless some more clever local marshalling mechanism is implemented (but that is beyond the scope of my ask).

>>> No idea how to randomize the order of tree object visits.
>>
>> To heat up data caches, the order of objects visited is not relevant, 
>> the order or IOs issued to the actual object is relevant.
>
> What's the difference?

Trivial sequential read access data (reading block 1, 2, 3, 4, 5), while triggering some prefetching, will also mark the cache for these blocks for quick reuse - not LRU reuse. While this will warm up the metadata cache (directories, inodes), by the time the read-tree task in archive mode (1 object after another, read from start to end) comes by, the blocks would need to be retrieved again from stable storage.

When there is some pseudo-random IO pattern when accessing the object data, these blocks will be marked for LRU reuse in most storage systems. Thus stay in Cache (typically measing a few TB nowadays too, implemented as SRAM, DRAM, SCM (Intel Optane while it lasted) or native NVMe). So after the first phase of high-concurrency, pseudo-random reads, the simplistic tree traversal and sequential read of objects would be served from the cache, with IO latency in the sub-millisecond range - or around 1-2 orders of magnitude faster. As the completion time of a task with concurrency 1 is just the serial concatenation of all the IO latencies (to a very good approximation), the completion time for "git archive" would be similarly reduced - from e.g 400 sec down to 4-8 sec... 

> NB: When I wrote "tree objects" I meant the type of objects from Git's object store (made up of packs and loose files) that represent sub-directories, and with "visit" I meant reading them to traverse the hierarchy of Git blobs and trees.
> 
> Here's an idea after all: Using "git ls-tree" without "-r" and handling recursing in the prefetch script would allow traversing trees in a different order and even in parallel.  Not sure how to limit parallelism to a sane degree.

Ideally, the "git archive" command could take care of more effective (higher concurrency) in the code ultimately 😊

Best regards,
  Richard

     prev parent reply	other threads:[~2022-09-24 19:44 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-22  8:57 git --archive Scheffenegger, Richard
2022-09-22 20:13 ` Junio C Hamano
2022-09-22 20:35   ` Scheffenegger, Richard
2022-09-23  0:49     ` brian m. carlson
2022-09-23 16:30       ` Junio C Hamano
2022-09-23 16:51         ` Scheffenegger, Richard
2022-09-24  8:58         ` René Scharfe
2022-09-24 11:34           ` Scheffenegger, Richard
2022-09-24 13:19             ` René Scharfe
2022-09-24 18:07               ` René Scharfe
2022-09-25  8:17                 ` René Scharfe
2022-09-24 19:44               ` Scheffenegger, Richard [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=PH0PR06MB763997C841605AEA42D8E8A986509@PH0PR06MB7639.namprd06.prod.outlook.com \
    --to=richard.scheffenegger@netapp.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=l.s.r@web.de \
    --cc=sandals@crustytoothpaste.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).