git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "René Scharfe" <l.s.r@web.de>
To: "Scheffenegger, Richard" <Richard.Scheffenegger@netapp.com>,
	Junio C Hamano <gitster@pobox.com>,
	"brian m. carlson" <sandals@crustytoothpaste.net>
Cc: "git@vger.kernel.org" <git@vger.kernel.org>
Subject: Re: git --archive
Date: Sat, 24 Sep 2022 15:19:12 +0200	[thread overview]
Message-ID: <712ffe78-c3e3-dacf-c3e3-f339385e9bb4@web.de> (raw)
In-Reply-To: <PH0PR06MB7639DAA5CA112495E3EB43AB86509@PH0PR06MB7639.namprd06.prod.outlook.com>

Am 24.09.22 um 13:34 schrieb Scheffenegger, Richard:
>
>> If I/O latency instead of CPU usage is the limiting factor and
>> prefetching would help then starting git grep or git archive in the
>> background might work.  If the order of visited blobs needs to be
>> randomized then perhaps something like this would be better:
>>
>> git ls-tree -r HEAD | awk '{print $3}' | sort | git cat-file
>> --batch >/dev/null
>
> Isn't the 2nd git, receiving input from stdin, running
> single-threaded?

Yes.

> Maybe
>
> Git ls-tree -r HEAD | awk '{print $3}' | sort | split -d -l 100 -a 4
> - splitted ; for i in $(ls splitted????) ; do "git cat-file --batch >
> /dev/null &"; done; rm -f splitted????
>
> To parallelize the reading of the objects?

Sure, but in a repository with 100000 files you'd end up with 1000
parallel processes, which may be a few too many.  Splitting the list
into similar-sized parts based on a given degree of parallelism is
probably more practical.

It could be done by relying on the randomness of the object IDs and
partitioning by a sub-string.  Or perhaps using pseudo-random numbers
is sufficient:

   git ls-tree -r HEAD |
   awk '{print $3}' |
   sort |
   awk -v pieces=8 -v prefix=file '
      {
         piece = int(rand() * pieces)
         filename = prefix piece
         print $0 > filename
      }'

So how much does such a warmup help in your case?

>> No idea how to randomize the order of tree object visits.
>
> To heat up data caches, the order of objects visited is not relevant,
> the order or IOs issued to the actual object is relevant.

What's the difference?

NB: When I wrote "tree objects" I meant the type of objects from Git's
object store (made up of packs and loose files) that represent
sub-directories, and with "visit" I meant reading them to traverse the
hierarchy of Git blobs and trees.

Here's an idea after all: Using "git ls-tree" without "-r" and handling
recursing in the prefetch script would allow traversing trees in a
different order and even in parallel.  Not sure how to limit parallelism
to a sane degree.

René

  reply	other threads:[~2022-09-24 13:19 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-22  8:57 git --archive Scheffenegger, Richard
2022-09-22 20:13 ` Junio C Hamano
2022-09-22 20:35   ` Scheffenegger, Richard
2022-09-23  0:49     ` brian m. carlson
2022-09-23 16:30       ` Junio C Hamano
2022-09-23 16:51         ` Scheffenegger, Richard
2022-09-24  8:58         ` René Scharfe
2022-09-24 11:34           ` Scheffenegger, Richard
2022-09-24 13:19             ` René Scharfe [this message]
2022-09-24 18:07               ` René Scharfe
2022-09-25  8:17                 ` René Scharfe
2022-09-24 19:44               ` Scheffenegger, Richard

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=712ffe78-c3e3-dacf-c3e3-f339385e9bb4@web.de \
    --to=l.s.r@web.de \
    --cc=Richard.Scheffenegger@netapp.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=sandals@crustytoothpaste.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).