mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "René Scharfe" <>
To: "Scheffenegger, Richard" <>,
	Junio C Hamano <>,
	"brian m. carlson" <>
Cc: "" <>
Subject: Re: git --archive
Date: Sat, 24 Sep 2022 15:19:12 +0200	[thread overview]
Message-ID: <> (raw)
In-Reply-To: <>

Am 24.09.22 um 13:34 schrieb Scheffenegger, Richard:
>> If I/O latency instead of CPU usage is the limiting factor and
>> prefetching would help then starting git grep or git archive in the
>> background might work.  If the order of visited blobs needs to be
>> randomized then perhaps something like this would be better:
>> git ls-tree -r HEAD | awk '{print $3}' | sort | git cat-file
>> --batch >/dev/null
> Isn't the 2nd git, receiving input from stdin, running
> single-threaded?


> Maybe
> Git ls-tree -r HEAD | awk '{print $3}' | sort | split -d -l 100 -a 4
> - splitted ; for i in $(ls splitted????) ; do "git cat-file --batch >
> /dev/null &"; done; rm -f splitted????
> To parallelize the reading of the objects?

Sure, but in a repository with 100000 files you'd end up with 1000
parallel processes, which may be a few too many.  Splitting the list
into similar-sized parts based on a given degree of parallelism is
probably more practical.

It could be done by relying on the randomness of the object IDs and
partitioning by a sub-string.  Or perhaps using pseudo-random numbers
is sufficient:

   git ls-tree -r HEAD |
   awk '{print $3}' |
   sort |
   awk -v pieces=8 -v prefix=file '
         piece = int(rand() * pieces)
         filename = prefix piece
         print $0 > filename

So how much does such a warmup help in your case?

>> No idea how to randomize the order of tree object visits.
> To heat up data caches, the order of objects visited is not relevant,
> the order or IOs issued to the actual object is relevant.

What's the difference?

NB: When I wrote "tree objects" I meant the type of objects from Git's
object store (made up of packs and loose files) that represent
sub-directories, and with "visit" I meant reading them to traverse the
hierarchy of Git blobs and trees.

Here's an idea after all: Using "git ls-tree" without "-r" and handling
recursing in the prefetch script would allow traversing trees in a
different order and even in parallel.  Not sure how to limit parallelism
to a sane degree.


  reply	other threads:[~2022-09-24 13:19 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-22  8:57 git --archive Scheffenegger, Richard
2022-09-22 20:13 ` Junio C Hamano
2022-09-22 20:35   ` Scheffenegger, Richard
2022-09-23  0:49     ` brian m. carlson
2022-09-23 16:30       ` Junio C Hamano
2022-09-23 16:51         ` Scheffenegger, Richard
2022-09-24  8:58         ` René Scharfe
2022-09-24 11:34           ` Scheffenegger, Richard
2022-09-24 13:19             ` René Scharfe [this message]
2022-09-24 18:07               ` René Scharfe
2022-09-25  8:17                 ` René Scharfe
2022-09-24 19:44               ` Scheffenegger, Richard

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

  List information:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).