From: "René Scharfe" <l.s.r@web.de>
To: "Scheffenegger, Richard" <Richard.Scheffenegger@netapp.com>,
Junio C Hamano <gitster@pobox.com>,
"brian m. carlson" <sandals@crustytoothpaste.net>
Cc: "git@vger.kernel.org" <git@vger.kernel.org>
Subject: Re: git --archive
Date: Sat, 24 Sep 2022 15:19:12 +0200 [thread overview]
Message-ID: <712ffe78-c3e3-dacf-c3e3-f339385e9bb4@web.de> (raw)
In-Reply-To: <PH0PR06MB7639DAA5CA112495E3EB43AB86509@PH0PR06MB7639.namprd06.prod.outlook.com>
Am 24.09.22 um 13:34 schrieb Scheffenegger, Richard:
>
>> If I/O latency instead of CPU usage is the limiting factor and
>> prefetching would help then starting git grep or git archive in the
>> background might work. If the order of visited blobs needs to be
>> randomized then perhaps something like this would be better:
>>
>> git ls-tree -r HEAD | awk '{print $3}' | sort | git cat-file
>> --batch >/dev/null
>
> Isn't the 2nd git, receiving input from stdin, running
> single-threaded?
Yes.
> Maybe
>
> Git ls-tree -r HEAD | awk '{print $3}' | sort | split -d -l 100 -a 4
> - splitted ; for i in $(ls splitted????) ; do "git cat-file --batch >
> /dev/null &"; done; rm -f splitted????
>
> To parallelize the reading of the objects?
Sure, but in a repository with 100000 files you'd end up with 1000
parallel processes, which may be a few too many. Splitting the list
into similar-sized parts based on a given degree of parallelism is
probably more practical.
It could be done by relying on the randomness of the object IDs and
partitioning by a sub-string. Or perhaps using pseudo-random numbers
is sufficient:
git ls-tree -r HEAD |
awk '{print $3}' |
sort |
awk -v pieces=8 -v prefix=file '
{
piece = int(rand() * pieces)
filename = prefix piece
print $0 > filename
}'
So how much does such a warmup help in your case?
>> No idea how to randomize the order of tree object visits.
>
> To heat up data caches, the order of objects visited is not relevant,
> the order or IOs issued to the actual object is relevant.
What's the difference?
NB: When I wrote "tree objects" I meant the type of objects from Git's
object store (made up of packs and loose files) that represent
sub-directories, and with "visit" I meant reading them to traverse the
hierarchy of Git blobs and trees.
Here's an idea after all: Using "git ls-tree" without "-r" and handling
recursing in the prefetch script would allow traversing trees in a
different order and even in parallel. Not sure how to limit parallelism
to a sane degree.
René
next prev parent reply other threads:[~2022-09-24 13:19 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-09-22 8:57 git --archive Scheffenegger, Richard
2022-09-22 20:13 ` Junio C Hamano
2022-09-22 20:35 ` Scheffenegger, Richard
2022-09-23 0:49 ` brian m. carlson
2022-09-23 16:30 ` Junio C Hamano
2022-09-23 16:51 ` Scheffenegger, Richard
2022-09-24 8:58 ` René Scharfe
2022-09-24 11:34 ` Scheffenegger, Richard
2022-09-24 13:19 ` René Scharfe [this message]
2022-09-24 18:07 ` René Scharfe
2022-09-25 8:17 ` René Scharfe
2022-09-24 19:44 ` Scheffenegger, Richard
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=712ffe78-c3e3-dacf-c3e3-f339385e9bb4@web.de \
--to=l.s.r@web.de \
--cc=Richard.Scheffenegger@netapp.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=sandals@crustytoothpaste.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).