From: "René Scharfe" <l.s.r@web.de>
To: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Cc: git@vger.kernel.org, Junio C Hamano <gitster@pobox.com>,
Johannes Schindelin <johannes.schindelin@gmx.de>,
Rohit Ashiwal <rohit.ashiwal265@gmail.com>,
Jeff King <peff@peff.net>,
"brian m . carlson" <sandals@crustytoothpaste.net>
Subject: Re: [PATCH v4 4/6] archive-tar: add internal gzip implementation
Date: Thu, 16 Jun 2022 20:55:18 +0200 [thread overview]
Message-ID: <3ed80afd-34b3-afd8-5ffb-0187a4475ee1@web.de> (raw)
In-Reply-To: <220615.86wndhwt9a.gmgdl@evledraar.gmail.com>
Am 15.06.22 um 22:32 schrieb Ævar Arnfjörð Bjarmason:
>
> On Wed, Jun 15 2022, René Scharfe wrote:
>
>> Git uses zlib for its own object store, but calls gzip when creating tgz
>> archives. Add an option to perform the gzip compression for the latter
>> using zlib, without depending on the external gzip binary.
>>
>> Plug it in by making write_block a function pointer and switching to a
>> compressing variant if the filter command has the magic value "git
>> archive gzip". Does that indirection slow down tar creation? Not
>> really, at least not in this test:
>>
>> $ hyperfine -w3 -L rev HEAD,origin/main -p 'git checkout {rev} && make' \
>> './git -C ../linux archive --format=tar HEAD # {rev}'
>
> Shameless plug: https://lore.kernel.org/git/211201.86r1aw9gbd.gmgdl@evledraar.gmail.com/
>
> I.e. a "hyperfine" wrapper I wrote to make exactly this sort of thing
> easier.
>
> You'll find that you need less or no --warmup with it, since the
> checkout flip-flopping and re-making (and resulting FS and other cache
> eviction) will go away, as we'll use different "git worktree"'s for the
> two "rev".
OK, but requiring hyperfine alone is burden enough for reviewers.
I had a try anyway and it took me a while to realize that git-hyperfine
requires setting the Git config option hyperfine.run-dir band that it
ignores it on my system. Had to hard-code it in the script.
> (Also, putting those on a ramdisk really helps)
>
>> Benchmark #1: ./git -C ../linux archive --format=tar HEAD # HEAD
>> Time (mean ± σ): 4.044 s ± 0.007 s [User: 3.901 s, System: 0.137 s]
>> Range (min … max): 4.038 s … 4.059 s 10 runs
>>
>> Benchmark #2: ./git -C ../linux archive --format=tar HEAD # origin/main
>> Time (mean ± σ): 4.047 s ± 0.009 s [User: 3.903 s, System: 0.138 s]
>> Range (min … max): 4.038 s … 4.066 s 10 runs
>>
>> How does tgz creation perform?
>>
>> $ hyperfine -w3 -L command 'gzip -cn','git archive gzip' \
>> './git -c tar.tgz.command="{command}" -C ../linux archive --format=tgz HEAD'
>> Benchmark #1: ./git -c tar.tgz.command="gzip -cn" -C ../linux archive --format=tgz HEAD
>> Time (mean ± σ): 20.404 s ± 0.006 s [User: 23.943 s, System: 0.401 s]
>> Range (min … max): 20.395 s … 20.414 s 10 runs
>>
>> Benchmark #2: ./git -c tar.tgz.command="git archive gzip" -C ../linux archive --format=tgz HEAD
>> Time (mean ± σ): 23.807 s ± 0.023 s [User: 23.655 s, System: 0.145 s]
>> Range (min … max): 23.782 s … 23.857 s 10 runs
>>
>> Summary
>> './git -c tar.tgz.command="gzip -cn" -C ../linux archive --format=tgz HEAD' ran
>> 1.17 ± 0.00 times faster than './git -c tar.tgz.command="git archive gzip" -C ../linux archive --format=tgz HEAD'
>>
>> So the internal implementation takes 17% longer on the Linux repo, but
>> uses 2% less CPU time. That's because the external gzip can run in
>> parallel on its own processor, while the internal one works sequentially
>> and avoids the inter-process communication overhead.
>>
>> What are the benefits? Only an internal sequential implementation can
>> offer this eco mode, and it allows avoiding the gzip(1) requirement.
>
> I had been keeping one eye on this series, but didn't look at it in any
> detail.
>
> I found this after reading 6/6, which I think in any case could really
> use some "why" summary, which seems to mostly be covered here.
>
> I.e. it's unclear if the "drop the dependency on gzip(1)" in 6/6 is a
> reference to the GZIP test dependency, or that our users are unlikely to
> have "gzip(1)" on their systems.
It's to avoid a run dependency; the build/test dependency remains.
> If it's the latter I'd much rather (as a user) take a 17% wallclock
> improvement over a 2% cost of CPU. I mostly care about my own time, not
> that of the CPU.
Understandable, and you can set tar.tgz.command='gzip -cn' to get the
old behavior. Saving energy is a better default, though.
The runtime in the real world probably includes lots more I/O time. The
tests above are repeated and warmed up to get consistent measurements,
but big repos are probably not fully kept in memory like that.
> Can't we have our 6/6 cake much easier and eat it too by learning a
> "fallback" mode, i.e. we try to invoke gzip, and if that doesn't work
> use the "internal" one?
Interesting idea, but I think the existing config option suffices. E.g.
a distro could set it in the system-wide config file if/when gzip is
installed.
> Re the "eco mode": I also wonder how much of the overhead you're seeing
> for both that 17% and 2% would go away if you pin both processes to the
> same CPU, I can't recall the command offhand, but IIRC taskset or
> numactl can do that. I.e. is this really measuring IPC overhead, or
> I-CPU overhead on your system?
I'd expect that running git archive and gzip at the same CPU core takes
more wall-clock time than using zlib because inflating the object files
and deflating the archive are done sequentially in both scenarios.
Can't test it on macOS because it doesn't offer a way to pin programs to
a certain core, but e.g. someone with access to a Linux system can check
that using taskset(1).
René
next prev parent reply other threads:[~2022-06-16 18:56 UTC|newest]
Thread overview: 74+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-04-12 23:04 [PATCH 0/2] Avoid spawning gzip in git archive Johannes Schindelin via GitGitGadget
2019-04-12 23:04 ` [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die() Rohit Ashiwal via GitGitGadget
2019-04-13 1:34 ` Jeff King
2019-04-13 5:51 ` Junio C Hamano
2019-04-14 4:36 ` Rohit Ashiwal
2019-04-26 14:29 ` Johannes Schindelin
2019-04-26 23:44 ` Junio C Hamano
2019-04-29 21:32 ` Johannes Schindelin
2019-05-01 18:09 ` Jeff King
2019-05-02 20:29 ` René Scharfe
2019-05-05 5:25 ` Junio C Hamano
2019-05-06 5:07 ` Jeff King
2019-04-14 4:34 ` Rohit Ashiwal
2019-04-14 10:33 ` Junio C Hamano
2019-04-26 14:28 ` Johannes Schindelin
2019-05-01 18:07 ` Jeff King
2019-04-12 23:04 ` [PATCH 2/2] archive: avoid spawning `gzip` Rohit Ashiwal via GitGitGadget
2019-04-13 1:51 ` Jeff King
2019-04-13 22:01 ` René Scharfe
2019-04-15 21:35 ` Jeff King
2019-04-26 14:51 ` Johannes Schindelin
2019-04-27 9:59 ` René Scharfe
2019-04-27 17:39 ` René Scharfe
2019-04-29 21:25 ` Johannes Schindelin
2019-05-01 17:45 ` René Scharfe
2019-05-01 18:18 ` Jeff King
2019-06-10 10:44 ` René Scharfe
2019-06-13 19:16 ` Jeff King
2019-04-13 22:16 ` brian m. carlson
2019-04-15 21:36 ` Jeff King
2019-04-26 14:54 ` Johannes Schindelin
2019-05-02 20:20 ` Ævar Arnfjörð Bjarmason
2019-05-03 20:49 ` Johannes Schindelin
2019-05-03 20:52 ` Jeff King
2019-04-26 14:47 ` Johannes Schindelin
[not found] ` <pull.145.v2.git.gitgitgadget@gmail.com>
[not found] ` <4ea94a8784876c3a19e387537edd81a957fc692c.1556321244.git.gitgitgadget@gmail.com>
2019-05-02 20:29 ` [PATCH v2 3/4] archive: optionally use zlib directly for gzip compression René Scharfe
[not found] ` <ac2b2488a1b42b3caf8a84594c48eca796748e59.1556321244.git.gitgitgadget@gmail.com>
2019-05-02 20:30 ` [PATCH v2 2/4] archive-tar: mark RECORDSIZE/BLOCKSIZE as unsigned René Scharfe
2019-05-08 11:45 ` Johannes Schindelin
2019-05-08 23:04 ` Jeff King
2019-05-09 14:06 ` Johannes Schindelin
2019-05-09 18:38 ` Jeff King
2019-05-10 17:18 ` René Scharfe
2019-05-10 21:20 ` Jeff King
2022-06-12 6:00 ` [PATCH v3 0/5] Avoid spawning gzip in git archive René Scharfe
2022-06-12 6:03 ` [PATCH v3 1/5] archive: rename archiver data field to filter_command René Scharfe
2022-06-12 6:05 ` [PATCH v3 2/5] archive-tar: factor out write_block() René Scharfe
2022-06-12 6:08 ` [PATCH v3 3/5] archive-tar: add internal gzip implementation René Scharfe
2022-06-13 19:10 ` Junio C Hamano
2022-06-12 6:18 ` [PATCH v3 4/5] archive-tar: use OS_CODE 3 (Unix) for internal gzip René Scharfe
2022-06-12 6:19 ` [PATCH v3 5/5] archive-tar: use internal gzip by default René Scharfe
2022-06-13 21:55 ` Junio C Hamano
2022-06-14 11:27 ` Johannes Schindelin
2022-06-14 15:47 ` René Scharfe
2022-06-14 15:56 ` René Scharfe
2022-06-14 16:29 ` Johannes Schindelin
2022-06-14 20:04 ` René Scharfe
2022-06-15 16:41 ` Junio C Hamano
2022-06-14 11:28 ` [PATCH v3 0/5] Avoid spawning gzip in git archive Johannes Schindelin
2022-06-14 20:05 ` René Scharfe
2022-06-30 18:55 ` Johannes Schindelin
2022-07-01 16:05 ` Johannes Schindelin
2022-07-01 16:27 ` Jeff King
2022-07-01 17:47 ` Junio C Hamano
2022-06-15 16:53 ` [PATCH v4 0/6] " René Scharfe
2022-06-15 16:58 ` [PATCH v4 1/6] archive: update format documentation René Scharfe
2022-06-15 16:59 ` [PATCH v4 2/6] archive: rename archiver data field to filter_command René Scharfe
2022-06-15 17:01 ` [PATCH v4 3/6] archive-tar: factor out write_block() René Scharfe
2022-06-15 17:02 ` [PATCH v4 4/6] archive-tar: add internal gzip implementation René Scharfe
2022-06-15 20:32 ` Ævar Arnfjörð Bjarmason
2022-06-16 18:55 ` René Scharfe [this message]
2022-06-24 11:13 ` Ævar Arnfjörð Bjarmason
2022-06-24 20:24 ` René Scharfe
2022-06-15 17:04 ` [PATCH v4 5/6] archive-tar: use OS_CODE 3 (Unix) for internal gzip René Scharfe
2022-06-15 17:05 ` [PATCH v4 6/6] archive-tar: use internal gzip by default René Scharfe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=3ed80afd-34b3-afd8-5ffb-0187a4475ee1@web.de \
--to=l.s.r@web.de \
--cc=avarab@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=johannes.schindelin@gmx.de \
--cc=peff@peff.net \
--cc=rohit.ashiwal265@gmail.com \
--cc=sandals@crustytoothpaste.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).