git --archive

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* git --archive
@ 2022-09-22  8:57 Scheffenegger, Richard
  2022-09-22 20:13 ` Junio C Hamano
  0 siblings, 1 reply; 12+ messages in thread
From: Scheffenegger, Richard @ 2022-09-22  8:57 UTC (permalink / raw)
  To: git@vger.kernel.org

Hi group,

Unless I’m mistaken, the procedure to create a zip archive reads like a recursive collection of all relevant objects, and then writing them out sequentially, in a single thread. 

Is this assessment correct?

I was wondering if a highly concurrent fetching phase could be optionally added…

Background is the proliferation of cold archive / high latency storage – traversing a tree in a recursive manner, single threaded, will accumulate the maximum amount of wallclock time; 

A first improvement could be to run all the recursive traversals in a parallel (multi threaded) way – with dispatched threads to issue prefetch read calls (to warm up the various caches). And in a second phase traverse the (now warmed up) tree single-threaded, and append the objects to the archive…

Subsequently, if sufficient memory can be allocated, the asynchronously dispatched reads could be fetched in memory, and as soon as the reads for the next in-order object is completed (these reads individually may not complete in-order), append the objects to the archive… 

Best regards,

Richard Scheffenegger

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: git --archive
  2022-09-22  8:57 git --archive Scheffenegger, Richard
@ 2022-09-22 20:13 ` Junio C Hamano
  2022-09-22 20:35   ` Scheffenegger, Richard
  0 siblings, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2022-09-22 20:13 UTC (permalink / raw)
  To: Scheffenegger, Richard; +Cc: git@vger.kernel.org

"Scheffenegger, Richard" <Richard.Scheffenegger@netapp.com> writes:

> Unless I’m mistaken, the procedure to create a zip archive reads like a recursive collection of all relevant objects, and then writing them out sequentially, in a single thread. 
>
> Is this assessment correct?
>
> I was wondering if a highly concurrent fetching phase could be optionally added…

The details matter here, I think.  Enumerating and slurping down the
contents to be archived out of the repository/object store to the
core can indeed be made parallel, but the end result product being a
zip archive or a tarball, which is fairly a serialized output
format, there is only so much you can hold in core, and it is not
clear what your plan is to do this without filling all the memory.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: git --archive
  2022-09-22 20:13 ` Junio C Hamano
@ 2022-09-22 20:35   ` Scheffenegger, Richard
  2022-09-23  0:49     ` brian m. carlson
  0 siblings, 1 reply; 12+ messages in thread
From: Scheffenegger, Richard @ 2022-09-22 20:35 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git@vger.kernel.org

Hi Junio,

>> Unless I’m mistaken, the procedure to create a zip archive reads like a recursive collection of all relevant objects, and then writing them out sequentially, in a single thread.
>>
>> Is this assessment correct?
>>
>> I was wondering if a highly concurrent fetching phase could be 
>> optionally added…
>
> The details matter here, I think.  Enumerating and slurping down the contents to be archived out of the repository/object store to the core can indeed be made parallel, but the end result product being a zip archive or a tarball, which is fairly a serialized output format, there is only so much you can hold in core, and it is not clear what your plan is to do this without filling all the memory.

"core" (presumably referring to the OS kernel memory for IO caching) is not the only cache in play here. 

As mentioned, the use case would be repositories living on storage systems measuring in at around 500 TB on the filesystem - like what is not uncommon on hyperscalers and large commercial scale development environments.

Those external filesystems (the filesystems do NOT live in "core") tend to have some more or less sophisticated form of tiering, destaging cold (infrequently accessed) data out to high latency devices. Amazon Glacier being an extreme example here (on-demand access tape drives, which need to be fetched by a robot and put into a tape drive when data stored there is needed).

To a more realistic scenario, you may have 100 TB in SSD, and a few PB in spinning rust, externally connected to the primary tier. 

An initial phase to simply fetch in as many objects and sub-trees as possible in parallel (ideally with IOs to the objects issued in a non-sequential order, to promote them and their metadata not to get evicted immediately after the first fetch) would heat up those external caches (to some extent the "core" filesystem page cache too, but that is of minor concern). Thus when the tree is then walked in a fully sequential, recursive way just like currently, the hot metadata, and some hot data would dramatically cut the accumulation of latency. 

A 2nd phase, having the tree fetched in parallel, and sent out serial, would work even better.

Also, at least for ZIP (not so much for TAR), objects residing in different subdirectories can be stored in any order - and only need to be referenced properly in the central directory. Thus whenever a subthread has completed the reading of a (sufficiently small) object to be in (git program) memory, it should be sent immediately to the ZIP writer thread. The result would be that small and hot files (which can be read in quickly) end up at the beginning of the zip file, but the parallel threads can already, in parallel, read-in larger and colder object - the absolute wait time within the worker thread reading those objects may be slightly higher, but as many objects are read in in parallel, the absolute time to create the archive would be minimized.

In short - there is no real need with ZIP for the recursive traversal of the objects and trees to deliver them in-sequence at all (beside, the sequence may be determined by the underlying filesystem, which is not necessarily guaranteed to provide a trivially sorted list of any kind - only that the ordering of files within a directory is stable.

Hope this clarifies the background of this ask.

Best regards,
  Richard

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: git --archive
  2022-09-22 20:35   ` Scheffenegger, Richard
@ 2022-09-23  0:49     ` brian m. carlson
  2022-09-23 16:30       ` Junio C Hamano
  0 siblings, 1 reply; 12+ messages in thread
From: brian m. carlson @ 2022-09-23  0:49 UTC (permalink / raw)
  To: Scheffenegger, Richard; +Cc: Junio C Hamano, git@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 1909 bytes --]

On 2022-09-22 at 20:35:08, Scheffenegger, Richard wrote:
> Also, at least for ZIP (not so much for TAR), objects residing in
> different subdirectories can be stored in any order - and only need to
> be referenced properly in the central directory. Thus whenever a
> subthread has completed the reading of a (sufficiently small) object
> to be in (git program) memory, it should be sent immediately to the
> ZIP writer thread. The result would be that small and hot files (which
> can be read in quickly) end up at the beginning of the zip file, but
> the parallel threads can already, in parallel, read-in larger and
> colder object - the absolute wait time within the worker thread
> reading those objects may be slightly higher, but as many objects are
> read in in parallel, the absolute time to create the archive would be
> minimized.

Maybe they can technically be stored in any order, but people don't want
git archive to produce non-deterministic archives.  I'm one of the folks
responsible for the service at GitHub that serves archives (which uses
git archive under the hood) and people become very unhappy when the
archives are not bit-for-bit identical, even though neither Git nor
GitHub guarantee that.  That's because people want to use those archives
with cryptographic hashes like SHA-256, and if the file changes, the
hash breaks.  (We tell them to generate a tarball as part of the release
process and upload it as a release asset instead.)

What Git does implicitly guarantee is that the result is deterministic:
that is, given the same repository and the same version of Git, that the
archive is identical.  The encoding may change across versions, but not
within a version.  I feel like it would be very difficult to achieve the
speedups you want and still produce a deterministic archive.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: git --archive
  2022-09-23  0:49     ` brian m. carlson
@ 2022-09-23 16:30       ` Junio C Hamano
  2022-09-23 16:51         ` Scheffenegger, Richard
  2022-09-24  8:58         ` René Scharfe
  0 siblings, 2 replies; 12+ messages in thread
From: Junio C Hamano @ 2022-09-23 16:30 UTC (permalink / raw)
  To: brian m. carlson; +Cc: Scheffenegger, Richard, git@vger.kernel.org

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> Maybe they can technically be stored in any order, but people don't want
> git archive to produce non-deterministic archives...
> ...  I feel like it would be very difficult to achieve the
> speedups you want and still produce a deterministic archive.

I am not going to work on it myself, but I think the only possible
parallelism would come from making the reading for F(n+1) and
subsequent objects overlap writing of F(n), given a deterministic
order of files in the resulting archive.  When we decide which file
should come first, and learns that it is F(0), it probably comes the
tree object of the root level, and it is very likely that we would
already know what F(1) and F(2) are by that time, so it should be
possible to dispatch reading and applying content filtering on F(1)
and keeping the result in core, while we are still writing F(0) out.

Thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: git --archive
  2022-09-23 16:30       ` Junio C Hamano
@ 2022-09-23 16:51         ` Scheffenegger, Richard
  2022-09-24  8:58         ` René Scharfe
  1 sibling, 0 replies; 12+ messages in thread
From: Scheffenegger, Richard @ 2022-09-23 16:51 UTC (permalink / raw)
  To: Junio C Hamano, brian m. carlson; +Cc: git@vger.kernel.org




> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
>
>> Maybe they can technically be stored in any order, but people don't 
>> want git archive to produce non-deterministic archives...
>> ...  I feel like it would be very difficult to achieve the speedups 
>> you want and still produce a deterministic archive.
>
> I am not going to work on it myself, but I think the only possible parallelism would come from making the reading for F(n+1) and subsequent objects overlap writing of F(n), given a deterministic order of files in the resulting archive.  When we decide which file should come first, and learns that it is F(0), it probably comes the tree object of the root level, and it is very likely that we would already know what F(1) and F(2) are by that time, so it should be possible to dispatch reading and applying content filtering on F(1) and keeping the result in core, while we are still writing F(0) out.
>
> Thanks.

Yes. But even preceeding any changes in the actual tree traversal to collect the objects one-by-one as currently, a "simple" parallelized, recursive walk over all objects, pseudo-randomly reading a fraction of the data (mostly directories, but also files to update all the (externally) cached inode metadata, should help. As long as this stage is highly parallelizes, it's cost (in time) would be recovered in a much faster single-threaded tree recursion just as exists currently.

That is not to say, that the above method wouldn't be a significant improvement again 😊

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: git --archive
  2022-09-23 16:30       ` Junio C Hamano
  2022-09-23 16:51         ` Scheffenegger, Richard
@ 2022-09-24  8:58         ` René Scharfe
  2022-09-24 11:34           ` Scheffenegger, Richard
  1 sibling, 1 reply; 12+ messages in thread
From: René Scharfe @ 2022-09-24  8:58 UTC (permalink / raw)
  To: Junio C Hamano, brian m. carlson, Scheffenegger, Richard
  Cc: git@vger.kernel.org

Am 23.09.22 um 18:30 schrieb Junio C Hamano:
> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
>
>> Maybe they can technically be stored in any order, but people don't want
>> git archive to produce non-deterministic archives...
>> ...  I feel like it would be very difficult to achieve the
>> speedups you want and still produce a deterministic archive.
>
> I am not going to work on it myself, but I think the only possible
> parallelism would come from making the reading for F(n+1) and
> subsequent objects overlap writing of F(n), given a deterministic
> order of files in the resulting archive.  When we decide which file
> should come first, and learns that it is F(0), it probably comes the
> tree object of the root level, and it is very likely that we would
> already know what F(1) and F(2) are by that time, so it should be
> possible to dispatch reading and applying content filtering on F(1)
> and keeping the result in core, while we are still writing F(0) out.

That's what git grep does.  It can be seen as a very lossy compression
with output printed in a deterministic order.

git archive compresses a small file by reading it fully and writing the
result in one go.  It streams big files, though, i.e. reads, compresses and writes
them in small pieces.  That won't work as easily if multiple files are compressed
in parallel.

To allow multiple streams would require storing their results in temporary files.
Perhaps it would already help to allow only a single stream and start it only when
its time to output has come, though.

Giving up on deterministic order would reduce the memory usage for keeping
compressed small files.  That only matters if the product of core.bigFileThreshold
(default value 512 MB, the number of parallel threads and the compression ratio
exceeds the available memory.  The same effect could be achieved by using
temporary files.  We'd still have to keep up to core.bigFileThreshold times the
number of threads of uncompressed data in memory, though.

If I/O latency instead of CPU usage is the limiting factor and prefetching would
help then starting git grep or git archive in the background might work.  If the
order of visited blobs needs to be randomized then perhaps something like this
would be better:

   git ls-tree -r HEAD | awk '{print $3}' | sort | git cat-file --batch >/dev/null

No idea how to randomize the order of tree object visits.

René

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: git --archive
  2022-09-24  8:58         ` René Scharfe
@ 2022-09-24 11:34           ` Scheffenegger, Richard
  2022-09-24 13:19             ` René Scharfe
  0 siblings, 1 reply; 12+ messages in thread
From: Scheffenegger, Richard @ 2022-09-24 11:34 UTC (permalink / raw)
  To: René Scharfe, Junio C Hamano, brian m. carlson; +Cc: git@vger.kernel.org

Hi Rene,

>  git archive compresses a small file by reading it fully and writing the result in one go.  It streams big files, though, i.e. reads, compresses and writes them in small pieces.  That won't work as easily if multiple files are compressed in parallel.

The ask was not to parallelize the compression step (scale CPU efficiency locally), but to heat up the *remote filesystem* metadata and data caches by massive parallel/multithreaded reads (if one so desires), in order for the stricty sequential, one-after-another existing archiving steps to complete faster.

> Giving up on deterministic order would reduce the memory usage for keeping compressed small files.  

That was just one idea. Retaining a deterministic order to have stable secure hashes of the resulting files is a valuable property.

> That only matters if the product of core.bigFileThreshold (default value 512 MB, the number of parallel threads and the compre>ssion ratio exceeds the available memory.  The same effect could be achieved by using temporary files.  We'd still have to keep up to core.bigFileThreshold times the number of threads of uncompressed data in memory, though.

Again, none of the data read during the preparatory, highly concurrent step would need to linger around anywhere. Just heating up *remote* filesystem metadata and data caches is sufficient to dramatically reduce access and read latency - which is the primary determining factor of how long it takes to prepare an archive (e.g 500 sec for 80k /1.3 GB uncompressed / 250 M compressed files when data partially/mostly resides on cold storage).

> If I/O latency instead of CPU usage is the limiting factor and prefetching would help then starting git grep or git archive in the background might work.  If the order of visited blobs needs to be randomized then perhaps something like this would be better:
>
>   git ls-tree -r HEAD | awk '{print $3}' | sort | git cat-file --batch >/dev/null

Isn't the 2nd git, receiving input from stdin, running single-threaded?

Maybe

Git ls-tree -r HEAD | awk '{print $3}' | sort | split -d -l 100 -a 4 - splitted ; for i in $(ls splitted????) ; do "git cat-file --batch > /dev/null &"; done; rm -f splitted????

To parallelize the reading of the objects?

Otherwise, the main problem of lacking concurrency in reading in all the objected while metadata/data caches are cold would only moved from the archive step to the cat-file step, while overall completion would not be any faster. 

> No idea how to randomize the order of tree object visits.

To heat up data caches, the order of objects visited is not relevant, the order or IOs issued to the actual object is relevant. Trivial sequential reads (from start to end) typically get marked for cache evicition after having been delivered once to the client - that cache memory becomes available for immediate overwrite. To increase their "stickiness" in caches, the object reads would need to be performed in a pseudo-random fashion, e.g. if the IO block size is 1MB, accessing blocks in in an order like 10,1,9,4,7,3,8,2,6,5 would have them marked for longer cache retention (in remote filesystem servers).

Richard

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: git --archive
  2022-09-24 11:34           ` Scheffenegger, Richard
@ 2022-09-24 13:19             ` René Scharfe
  2022-09-24 18:07               ` René Scharfe
  2022-09-24 19:44               ` Scheffenegger, Richard
  0 siblings, 2 replies; 12+ messages in thread
From: René Scharfe @ 2022-09-24 13:19 UTC (permalink / raw)
  To: Scheffenegger, Richard, Junio C Hamano, brian m. carlson
  Cc: git@vger.kernel.org

Am 24.09.22 um 13:34 schrieb Scheffenegger, Richard:
>
>> If I/O latency instead of CPU usage is the limiting factor and
>> prefetching would help then starting git grep or git archive in the
>> background might work.  If the order of visited blobs needs to be
>> randomized then perhaps something like this would be better:
>>
>> git ls-tree -r HEAD | awk '{print $3}' | sort | git cat-file
>> --batch >/dev/null
>
> Isn't the 2nd git, receiving input from stdin, running
> single-threaded?

Yes.

> Maybe
>
> Git ls-tree -r HEAD | awk '{print $3}' | sort | split -d -l 100 -a 4
> - splitted ; for i in $(ls splitted????) ; do "git cat-file --batch >
> /dev/null &"; done; rm -f splitted????
>
> To parallelize the reading of the objects?

Sure, but in a repository with 100000 files you'd end up with 1000
parallel processes, which may be a few too many.  Splitting the list
into similar-sized parts based on a given degree of parallelism is
probably more practical.

It could be done by relying on the randomness of the object IDs and
partitioning by a sub-string.  Or perhaps using pseudo-random numbers
is sufficient:

   git ls-tree -r HEAD |
   awk '{print $3}' |
   sort |
   awk -v pieces=8 -v prefix=file '
      {
         piece = int(rand() * pieces)
         filename = prefix piece
         print $0 > filename
      }'

So how much does such a warmup help in your case?

>> No idea how to randomize the order of tree object visits.
>
> To heat up data caches, the order of objects visited is not relevant,
> the order or IOs issued to the actual object is relevant.

What's the difference?

NB: When I wrote "tree objects" I meant the type of objects from Git's
object store (made up of packs and loose files) that represent
sub-directories, and with "visit" I meant reading them to traverse the
hierarchy of Git blobs and trees.

Here's an idea after all: Using "git ls-tree" without "-r" and handling
recursing in the prefetch script would allow traversing trees in a
different order and even in parallel.  Not sure how to limit parallelism
to a sane degree.

René

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: git --archive
  2022-09-24 13:19             ` René Scharfe
@ 2022-09-24 18:07               ` René Scharfe
  2022-09-25  8:17                 ` René Scharfe
  2022-09-24 19:44               ` Scheffenegger, Richard
  1 sibling, 1 reply; 12+ messages in thread
From: René Scharfe @ 2022-09-24 18:07 UTC (permalink / raw)
  To: Scheffenegger, Richard, Junio C Hamano, brian m. carlson
  Cc: git@vger.kernel.org

Am 24.09.22 um 15:19 schrieb René Scharfe:
> It could be done by relying on the randomness of the object IDs and
> partitioning by a sub-string.  Or perhaps using pseudo-random numbers
> is sufficient:
>
>    git ls-tree -r HEAD |
>    awk '{print $3}' |

No need for awk here, of course; "git ls-tree -r --object-only HEAD"
does the same.  Just saying.

> Here's an idea after all: Using "git ls-tree" without "-r" and handling
> recursing in the prefetch script would allow traversing trees in a
> different order and even in parallel.  Not sure how to limit parallelism
> to a sane degree.

How about something like this?  xargs -P provides a controlled degree of
parallelism.  Sorting by object ID (i.e. hash value) should provide a
fairly random order.  Does this thing work for you?


treeish=HEAD
parallelism=8

dir=$(mktemp -d)
echo "$treeish" >"$dir/trees"

# Traverse all sub-trees in randomized order and collect all blob IDs.
while test -s "$dir/trees"
do
        sort <"$dir/trees" >"$dir/trees.current"
        rm "$dir/trees"
        xargs -P "$parallelism" -L 1 git ls-tree <"$dir/trees.current" |
        awk -v dir="$dir" -v pieces="$parallelism" '
                $2 == "tree" {print $3 > (dir "/trees")}
                $2 == "blob" {print $3 >> (dir "/blobs" int(rand() * pieces))}
        '
done

# Prefetch all blobs in randomized order.
replstr="%"
command="sort $replstr | git cat-file --batch >/dev/null"
ls "$dir/blobs"* | xargs -P "$parallelism" -I "$replstr" -L 1 sh -c "$command"

rm "$dir/trees.current" "$dir/blobs"*
rmdir "$dir"

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: git --archive
  2022-09-24 13:19             ` René Scharfe
  2022-09-24 18:07               ` René Scharfe
@ 2022-09-24 19:44               ` Scheffenegger, Richard
  1 sibling, 0 replies; 12+ messages in thread
From: Scheffenegger, Richard @ 2022-09-24 19:44 UTC (permalink / raw)
  To: René Scharfe, Junio C Hamano, brian m. carlson; +Cc: git@vger.kernel.org

>> Maybe
>>
>> Git ls-tree -r HEAD | awk '{print $3}' | sort | split -d -l 100 -a 4
>> - splitted ; for i in $(ls splitted????) ; do "git cat-file --batch > 
>> /dev/null &"; done; rm -f splitted????
>>
>> To parallelize the reading of the objects?
>
> Sure, but in a repository with 100000 files you'd end up with 1000 parallel processes, which may be a few too many.  Splitting the list into similar-sized parts based on a given degree of parallelism is probably more practical.

With the high overhead of starting up new processes, and most objects probably being small, I would expect a fair number to run to completion before the loop is finished; However, saving on process startup cost etc, and internalizing something like this into the git archive code (perhaps with some code to detect local vs remote filesystems, e.g. by measuring initial response latency) would help more. Then, yes, restricting the concurrency to something reasonable like a couple 100 threads, maybe 1000-2000 would be good. (And indeed, different storage systems have different sweet spots for concurrency - where the overall completion time is minimal. 1 is certainly not it 😉

> It could be done by relying on the randomness of the object IDs and partitioning by a sub-string.  Or perhaps using pseudo-random numbers is sufficient:
>
>   git ls-tree -r HEAD |
 >  awk '{print $3}' |
 >  sort |
 >  awk -v pieces=8 -v prefix=file '
 >     {
 >        piece = int(rand() * pieces)
 >        filename = prefix piece
 >        print $0 > filename
 >     }'
>
> So how much does such a warmup help in your case?

Cold tier data with no metadata in cache - IO latency in the middle-double digits milliseconds. Warmed up metadata cache - high single digit to low double digit milliseconds (~2-4x improvement). Warmed up and fully in external storage cache - sub-millisecond latency (10-100x faster).

With other words - warming up the cache in a pre-phase with (very) high concurrency - accessing as many objects in parallel as possible - can hit the throughput limit (10G, 25G Eth) but individually, each IO would still take some 10-50ms. However, if fully (externally) cached, a single-threaded, singular IO application will certainly not match the Eth throughput limits, but complete much faster than simply accessing the data cold. Only "cost" is that the data is effectively transferred twice - unless some more clever local marshalling mechanism is implemented (but that is beyond the scope of my ask).

>>> No idea how to randomize the order of tree object visits.
>>
>> To heat up data caches, the order of objects visited is not relevant, 
>> the order or IOs issued to the actual object is relevant.
>
> What's the difference?

Trivial sequential read access data (reading block 1, 2, 3, 4, 5), while triggering some prefetching, will also mark the cache for these blocks for quick reuse - not LRU reuse. While this will warm up the metadata cache (directories, inodes), by the time the read-tree task in archive mode (1 object after another, read from start to end) comes by, the blocks would need to be retrieved again from stable storage.

When there is some pseudo-random IO pattern when accessing the object data, these blocks will be marked for LRU reuse in most storage systems. Thus stay in Cache (typically measing a few TB nowadays too, implemented as SRAM, DRAM, SCM (Intel Optane while it lasted) or native NVMe). So after the first phase of high-concurrency, pseudo-random reads, the simplistic tree traversal and sequential read of objects would be served from the cache, with IO latency in the sub-millisecond range - or around 1-2 orders of magnitude faster. As the completion time of a task with concurrency 1 is just the serial concatenation of all the IO latencies (to a very good approximation), the completion time for "git archive" would be similarly reduced - from e.g 400 sec down to 4-8 sec... 

> NB: When I wrote "tree objects" I meant the type of objects from Git's object store (made up of packs and loose files) that represent sub-directories, and with "visit" I meant reading them to traverse the hierarchy of Git blobs and trees.
> 
> Here's an idea after all: Using "git ls-tree" without "-r" and handling recursing in the prefetch script would allow traversing trees in a different order and even in parallel.  Not sure how to limit parallelism to a sane degree.

Ideally, the "git archive" command could take care of more effective (higher concurrency) in the code ultimately 😊

Best regards,
  Richard

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: git --archive
  2022-09-24 18:07               ` René Scharfe
@ 2022-09-25  8:17                 ` René Scharfe
  0 siblings, 0 replies; 12+ messages in thread
From: René Scharfe @ 2022-09-25  8:17 UTC (permalink / raw)
  To: Scheffenegger, Richard, Junio C Hamano, brian m. carlson
  Cc: git@vger.kernel.org

Am 24.09.22 um 20:07 schrieb René Scharfe:
> Am 24.09.22 um 15:19 schrieb René Scharfe:
>> It could be done by relying on the randomness of the object IDs and
>> partitioning by a sub-string.  Or perhaps using pseudo-random numbers
>> is sufficient:
>>
>>    git ls-tree -r HEAD |
>>    awk '{print $3}' |
>
> No need for awk here, of course; "git ls-tree -r --object-only HEAD"
> does the same.  Just saying.
>
>> Here's an idea after all: Using "git ls-tree" without "-r" and handling
>> recursing in the prefetch script would allow traversing trees in a
>> different order and even in parallel.  Not sure how to limit parallelism
>> to a sane degree.
>
> How about something like this?  xargs -P provides a controlled degree of
> parallelism.  Sorting by object ID (i.e. hash value) should provide a
> fairly random order.  Does this thing work for you?
>
>
> treeish=HEAD
> parallelism=8
>
> dir=$(mktemp -d)
> echo "$treeish" >"$dir/trees"
>
> # Traverse all sub-trees in randomized order and collect all blob IDs.
> while test -s "$dir/trees"
> do
>         sort <"$dir/trees" >"$dir/trees.current"
>         rm "$dir/trees"
>         xargs -P "$parallelism" -L 1 git ls-tree <"$dir/trees.current" |
>         awk -v dir="$dir" -v pieces="$parallelism" '
>                 $2 == "tree" {print $3 > (dir "/trees")}
>                 $2 == "blob" {print $3 >> (dir "/blobs" int(rand() * pieces))}

This will exhaust the file descriptor limit (ulimit -n) for really high
degrees of parallelism.  Here's a version that scales beyond that.  It
takes ca. five seconds on my machine against Git's own repository on an
SSD, so its process overhead should still be bearable for your purpose.


treeish=HEAD
parallelism=1000

dir=$(mktemp -d)
echo "$treeish" >"$dir/trees"

while test -s "$dir/trees"
do
        sort <"$dir/trees" >"$dir/trees.current"
        rm "$dir/trees"
        xargs -P "$parallelism" -L 1 git ls-tree <"$dir/trees.current" |
        awk -v dir="$dir" -v pieces="$parallelism" '
                $2 == "tree" {print $3 > (dir "/trees")}
                $2 == "blob" {print int(rand()*pieces)+1, $3 >> (dir "/blobs")}
        '
done

replstr="%"
command="awk '\$1 == \"%\" {print \$2}' | sort | git cat-file --batch >/dev/null"
seq "$parallelism" | xargs -P "$parallelism" -I "$replstr" -L 1 sh -c "$command"

rm "$dir/trees.current" "$dir/blobs"
rmdir "$dir"

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-09-25  8:18 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-22  8:57 git --archive Scheffenegger, Richard
2022-09-22 20:13 ` Junio C Hamano
2022-09-22 20:35   ` Scheffenegger, Richard
2022-09-23  0:49     ` brian m. carlson
2022-09-23 16:30       ` Junio C Hamano
2022-09-23 16:51         ` Scheffenegger, Richard
2022-09-24  8:58         ` René Scharfe
2022-09-24 11:34           ` Scheffenegger, Richard
2022-09-24 13:19             ` René Scharfe
2022-09-24 18:07               ` René Scharfe
2022-09-25  8:17                 ` René Scharfe
2022-09-24 19:44               ` Scheffenegger, Richard

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).