Re: git --archive - René Scharfe

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: "René Scharfe" <l.s.r@web.de>
To: Junio C Hamano <gitster@pobox.com>,
	"brian m. carlson" <sandals@crustytoothpaste.net>,
	"Scheffenegger, Richard" <Richard.Scheffenegger@netapp.com>
Cc: "git@vger.kernel.org" <git@vger.kernel.org>
Subject: Re: git --archive
Date: Sat, 24 Sep 2022 10:58:15 +0200	[thread overview]
Message-ID: <8eb5131e-5ae1-79bd-df0c-bf0b2ec8583f@web.de> (raw)
In-Reply-To: <xmqqedw2vysc.fsf@gitster.g>

Am 23.09.22 um 18:30 schrieb Junio C Hamano:
> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
>
>> Maybe they can technically be stored in any order, but people don't want
>> git archive to produce non-deterministic archives...
>> ...  I feel like it would be very difficult to achieve the
>> speedups you want and still produce a deterministic archive.
>
> I am not going to work on it myself, but I think the only possible
> parallelism would come from making the reading for F(n+1) and
> subsequent objects overlap writing of F(n), given a deterministic
> order of files in the resulting archive.  When we decide which file
> should come first, and learns that it is F(0), it probably comes the
> tree object of the root level, and it is very likely that we would
> already know what F(1) and F(2) are by that time, so it should be
> possible to dispatch reading and applying content filtering on F(1)
> and keeping the result in core, while we are still writing F(0) out.

That's what git grep does.  It can be seen as a very lossy compression
with output printed in a deterministic order.

git archive compresses a small file by reading it fully and writing the
result in one go.  It streams big files, though, i.e. reads, compresses and writes
them in small pieces.  That won't work as easily if multiple files are compressed
in parallel.

To allow multiple streams would require storing their results in temporary files.
Perhaps it would already help to allow only a single stream and start it only when
its time to output has come, though.

Giving up on deterministic order would reduce the memory usage for keeping
compressed small files.  That only matters if the product of core.bigFileThreshold
(default value 512 MB, the number of parallel threads and the compression ratio
exceeds the available memory.  The same effect could be achieved by using
temporary files.  We'd still have to keep up to core.bigFileThreshold times the
number of threads of uncompressed data in memory, though.

If I/O latency instead of CPU usage is the limiting factor and prefetching would
help then starting git grep or git archive in the background might work.  If the
order of visited blobs needs to be randomized then perhaps something like this
would be better:

   git ls-tree -r HEAD | awk '{print $3}' | sort | git cat-file --batch >/dev/null

No idea how to randomize the order of tree object visits.

René

next prev parent reply	other threads:[~2022-09-24  8:58 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-22  8:57 git --archive Scheffenegger, Richard
2022-09-22 20:13 ` Junio C Hamano
2022-09-22 20:35   ` Scheffenegger, Richard
2022-09-23  0:49     ` brian m. carlson
2022-09-23 16:30       ` Junio C Hamano
2022-09-23 16:51         ` Scheffenegger, Richard
2022-09-24  8:58         ` René Scharfe [this message]
2022-09-24 11:34           ` Scheffenegger, Richard
2022-09-24 13:19             ` René Scharfe
2022-09-24 18:07               ` René Scharfe
2022-09-25  8:17                 ` René Scharfe
2022-09-24 19:44               ` Scheffenegger, Richard

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8eb5131e-5ae1-79bd-df0c-bf0b2ec8583f@web.de \
    --to=l.s.r@web.de \
    --cc=Richard.Scheffenegger@netapp.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=sandals@crustytoothpaste.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).