git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: "René Scharfe" <l.s.r@web.de>
Cc: git@vger.kernel.org, Junio C Hamano <gitster@pobox.com>,
	Johannes Schindelin <johannes.schindelin@gmx.de>,
	Rohit Ashiwal <rohit.ashiwal265@gmail.com>,
	Jeff King <peff@peff.net>,
	"brian m . carlson" <sandals@crustytoothpaste.net>
Subject: Re: [PATCH v4 4/6] archive-tar: add internal gzip implementation
Date: Fri, 24 Jun 2022 13:13:20 +0200	[thread overview]
Message-ID: <220624.8635fujn3o.gmgdl@evledraar.gmail.com> (raw)
In-Reply-To: <3ed80afd-34b3-afd8-5ffb-0187a4475ee1@web.de>


On Thu, Jun 16 2022, René Scharfe wrote:

> Am 15.06.22 um 22:32 schrieb Ævar Arnfjörð Bjarmason:
>> [...]
> Understandable, and you can set tar.tgz.command='gzip -cn' to get the
> old behavior.  Saving energy is a better default, though.

I disagree with that in general, a big reason for why git won out over
other VCS's is that it wasn't as slow. I think we should primarily be
interested in the time a user might end up staring at the screen.

I understand the concern to have "git archive" just work, e.g. if you
uninstall gzip(1) (although that seems rather obscure, but perhaps this
is for more minimal setups).

I don't think saving energy is a virtue, *maybe* it is, but maybe your
computer is powered by hydro, solar or nuclear instead of coal, so even
if we're taking global energy policy into account for changes to git
it's highly context dependant.

In any case, this is also true for pretty much any other git command
that might spawn processes or threads, e.g. "git grep":

	$ hyperfine -w3 -L cpus 0,0-7 'taskset --cpu-list {cpus} ./git grep foo.*bar' -r 10
	Benchmark 1: taskset --cpu-list 0 ./git grep foo.*bar
	  Time (mean ± σ):      39.3 ms ±   1.2 ms    [User: 20.0 ms, System: 18.6 ms]
	  Range (min … max):    38.2 ms …  41.8 ms    10 runs
	
	Benchmark 2: taskset --cpu-list 0-7 ./git grep foo.*bar
	  Time (mean ± σ):      28.1 ms ±   1.3 ms    [User: 43.5 ms, System: 51.0 ms]
	  Range (min … max):    26.6 ms …  31.2 ms    10 runs
	
	Summary
	  'taskset --cpu-list 0-7 ./git grep foo.*bar' ran
	    1.40 ± 0.08 times faster than 'taskset --cpu-list 0 ./git grep foo.*bar'

Here we use less than 1/2 the user/system time when I pin it to 1 cpu,
but we're 40% slower.

So this is a bit of a digression, but this particular thing seems much
better left to the OS or your hardware's CPU throttling policy. To the
extent that we care perhaps more fitting would be to have a global
core.wrapper-cmd option or something, so you could pass all git commands
through "taskset" (or your local equivalent), or just use shell aliases.

> The runtime in the real world probably includes lots more I/O time.  The
> tests above are repeated and warmed up to get consistent measurements,
> but big repos are probably not fully kept in memory like that.
>
>> Can't we have our 6/6 cake much easier and eat it too by learning a
>> "fallback" mode, i.e. we try to invoke gzip, and if that doesn't work
>> use the "internal" one?
>
> Interesting idea, but I think the existing config option suffices.  E.g.
> a distro could set it in the system-wide config file if/when gzip is
> installed.

I think in practice distros are unlikely to have such triggers for
"package X is installed, let's set config Y". I mean, e.g. Debian can do
that with its packaging system, but it's expecting a lot. Why not flip
the default depending on if start_command() fails?

>> Re the "eco mode": I also wonder how much of the overhead you're seeing
>> for both that 17% and 2% would go away if you pin both processes to the
>> same CPU, I can't recall the command offhand, but IIRC taskset or
>> numactl can do that. I.e. is this really measuring IPC overhead, or
>> I-CPU overhead on your system?
>
> I'd expect that running git archive and gzip at the same CPU core takes
> more wall-clock time than using zlib because inflating the object files
> and deflating the archive are done sequentially in both scenarios.
> Can't test it on macOS because it doesn't offer a way to pin programs to
> a certain core, but e.g. someone with access to a Linux system can check
> that using taskset(1).

Here's a benchmark, this is your hyperfine command, just with taskset
added. It's an 8-core box, so 0-7 is "all CPUs" (I think...):

	hyperfine -w3 \
		-L cpus 0,0-7 \
		-L command 'gzip -cn','git archive gzip' \
		'taskset --cpu-list {cpus} ./git -c tar.tgz.command="{command}" archive --format=tgz HEAD'

Which gives me:

	Benchmark 1: taskset --cpu-list 0 ./git -c tar.tgz.command="gzip -cn" archive --format=tgz HEAD
	  Time (mean ± σ):      1.561 s ±  0.029 s    [User: 1.503 s, System: 0.058 s]
	  Range (min … max):    1.522 s …  1.622 s    10 runs
	 
	Benchmark 2: taskset --cpu-list 0-7 ./git -c tar.tgz.command="gzip -cn" archive --format=tgz HEAD
	  Time (mean ± σ):      1.337 s ±  0.029 s    [User: 1.535 s, System: 0.075 s]
	  Range (min … max):    1.298 s …  1.388 s    10 runs
	 
	Benchmark 3: taskset --cpu-list 0 ./git -c tar.tgz.command="git archive gzip" archive --format=tgz HEAD
	  Time (mean ± σ):      1.493 s ±  0.032 s    [User: 1.453 s, System: 0.040 s]
	  Range (min … max):    1.462 s …  1.572 s    10 runs
	 
	Benchmark 4: taskset --cpu-list 0-7 ./git -c tar.tgz.command="git archive gzip" archive --format=tgz HEAD
	  Time (mean ± σ):      1.503 s ±  0.026 s    [User: 1.466 s, System: 0.036 s]
	  Range (min … max):    1.469 s …  1.542 s    10 runs
	 
	Summary
	  'taskset --cpu-list 0-7 ./git -c tar.tgz.command="gzip -cn" archive --format=tgz HEAD' ran
	    1.12 ± 0.03 times faster than 'taskset --cpu-list 0 ./git -c tar.tgz.command="git archive gzip" archive --format=tgz HEAD'
	    1.12 ± 0.03 times faster than 'taskset --cpu-list 0-7 ./git -c tar.tgz.command="git archive gzip" archive --format=tgz HEAD'
	    1.17 ± 0.03 times faster than 'taskset --cpu-list 0 ./git -c tar.tgz.command="gzip -cn" archive --format=tgz HEAD'

Whic I think should control for the IPC overhead v.s. the advantage of
multicore. I.e. we're faster with "gzip -cn" on multicore, but the
internal implementation has an advantage when it comes to 

I tried out the fallback method, memory leaks aside (needs to do a
proper cleanup) this seems to work. Most of the diff is moving the
existing code into a function:

diff --git a/archive-tar.c b/archive-tar.c
index 3d77e0f7509..a1b08812ee3 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -458,14 +458,36 @@ static void tgz_write_block(const void *data)
 	tgz_deflate(Z_NO_FLUSH);
 }
 
+static const char default_gzip_command[] = "gzip -cn";
 static const char internal_gzip_command[] = "git archive gzip";
 
-static int write_tar_filter_archive(const struct archiver *ar,
-				    struct archiver_args *args)
+static int do_internal_gzip(const struct archiver *ar,
+			    struct archiver_args *args)
 {
 #if ZLIB_VERNUM >= 0x1221
 	struct gz_header_s gzhead = { .os = 3 }; /* Unix, for reproducibility */
 #endif
+	int r;
+
+	write_block = tgz_write_block;
+	git_deflate_init_gzip(&gzstream, args->compression_level);
+#if ZLIB_VERNUM >= 0x1221
+	if (deflateSetHeader(&gzstream.z, &gzhead) != Z_OK)
+		BUG("deflateSetHeader() called too late");
+#endif
+	gzstream.next_out = outbuf;
+	gzstream.avail_out = sizeof(outbuf);
+
+	r = write_tar_archive(ar, args);
+
+	tgz_deflate(Z_FINISH);
+	git_deflate_end(&gzstream);
+	return r;
+}
+
+static int write_tar_filter_archive(const struct archiver *ar,
+				    struct archiver_args *args)
+{
 	struct strbuf cmd = STRBUF_INIT;
 	struct child_process filter = CHILD_PROCESS_INIT;
 	int r;
@@ -473,33 +495,24 @@ static int write_tar_filter_archive(const struct archiver *ar,
 	if (!ar->filter_command)
 		BUG("tar-filter archiver called with no filter defined");
 
-	if (!strcmp(ar->filter_command, internal_gzip_command)) {
-		write_block = tgz_write_block;
-		git_deflate_init_gzip(&gzstream, args->compression_level);
-#if ZLIB_VERNUM >= 0x1221
-		if (deflateSetHeader(&gzstream.z, &gzhead) != Z_OK)
-			BUG("deflateSetHeader() called too late");
-#endif
-		gzstream.next_out = outbuf;
-		gzstream.avail_out = sizeof(outbuf);
-
-		r = write_tar_archive(ar, args);
-
-		tgz_deflate(Z_FINISH);
-		git_deflate_end(&gzstream);
-		return r;
-	}
+	if (!strcmp(ar->filter_command, internal_gzip_command))
+		return do_internal_gzip(ar, args);
 
 	strbuf_addstr(&cmd, ar->filter_command);
 	if (args->compression_level >= 0)
 		strbuf_addf(&cmd, " -%d", args->compression_level);
 
-	strvec_push(&filter.args, cmd.buf);
-	filter.use_shell = 1;
+	strvec_split(&filter.args, cmd.buf);
 	filter.in = -1;
 
-	if (start_command(&filter) < 0)
+	if (start_command(&filter) < 0) {
+		if (!strcmp(ar->filter_command, default_gzip_command)) {
+			warning_errno(_("could not start '%s' filter, falling back to '%s'"),
+				      ar->filter_command, internal_gzip_command);
+			return do_internal_gzip(ar, args);
+		}
 		die_errno(_("unable to start '%s' filter"), cmd.buf);
+	}
 	close(1);
 	if (dup2(filter.in, 1) < 0)
 		die_errno(_("unable to redirect descriptor"));

  reply	other threads:[~2022-06-24 11:51 UTC|newest]

Thread overview: 74+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-12 23:04 [PATCH 0/2] Avoid spawning gzip in git archive Johannes Schindelin via GitGitGadget
2019-04-12 23:04 ` [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die() Rohit Ashiwal via GitGitGadget
2019-04-13  1:34   ` Jeff King
2019-04-13  5:51     ` Junio C Hamano
2019-04-14  4:36       ` Rohit Ashiwal
2019-04-26 14:29       ` Johannes Schindelin
2019-04-26 23:44         ` Junio C Hamano
2019-04-29 21:32           ` Johannes Schindelin
2019-05-01 18:09             ` Jeff King
2019-05-02 20:29               ` René Scharfe
2019-05-05  5:25               ` Junio C Hamano
2019-05-06  5:07                 ` Jeff King
2019-04-14  4:34     ` Rohit Ashiwal
2019-04-14 10:33       ` Junio C Hamano
2019-04-26 14:28     ` Johannes Schindelin
2019-05-01 18:07       ` Jeff King
2019-04-12 23:04 ` [PATCH 2/2] archive: avoid spawning `gzip` Rohit Ashiwal via GitGitGadget
2019-04-13  1:51   ` Jeff King
2019-04-13 22:01     ` René Scharfe
2019-04-15 21:35       ` Jeff King
2019-04-26 14:51         ` Johannes Schindelin
2019-04-27  9:59           ` René Scharfe
2019-04-27 17:39             ` René Scharfe
2019-04-29 21:25               ` Johannes Schindelin
2019-05-01 17:45                 ` René Scharfe
2019-05-01 18:18                   ` Jeff King
2019-06-10 10:44                     ` René Scharfe
2019-06-13 19:16                       ` Jeff King
2019-04-13 22:16     ` brian m. carlson
2019-04-15 21:36       ` Jeff King
2019-04-26 14:54       ` Johannes Schindelin
2019-05-02 20:20         ` Ævar Arnfjörð Bjarmason
2019-05-03 20:49           ` Johannes Schindelin
2019-05-03 20:52             ` Jeff King
2019-04-26 14:47     ` Johannes Schindelin
     [not found] ` <pull.145.v2.git.gitgitgadget@gmail.com>
     [not found]   ` <4ea94a8784876c3a19e387537edd81a957fc692c.1556321244.git.gitgitgadget@gmail.com>
2019-05-02 20:29     ` [PATCH v2 3/4] archive: optionally use zlib directly for gzip compression René Scharfe
     [not found]   ` <ac2b2488a1b42b3caf8a84594c48eca796748e59.1556321244.git.gitgitgadget@gmail.com>
2019-05-02 20:30     ` [PATCH v2 2/4] archive-tar: mark RECORDSIZE/BLOCKSIZE as unsigned René Scharfe
2019-05-08 11:45       ` Johannes Schindelin
2019-05-08 23:04         ` Jeff King
2019-05-09 14:06           ` Johannes Schindelin
2019-05-09 18:38             ` Jeff King
2019-05-10 17:18               ` René Scharfe
2019-05-10 21:20                 ` Jeff King
2022-06-12  6:00 ` [PATCH v3 0/5] Avoid spawning gzip in git archive René Scharfe
2022-06-12  6:03   ` [PATCH v3 1/5] archive: rename archiver data field to filter_command René Scharfe
2022-06-12  6:05   ` [PATCH v3 2/5] archive-tar: factor out write_block() René Scharfe
2022-06-12  6:08   ` [PATCH v3 3/5] archive-tar: add internal gzip implementation René Scharfe
2022-06-13 19:10     ` Junio C Hamano
2022-06-12  6:18   ` [PATCH v3 4/5] archive-tar: use OS_CODE 3 (Unix) for internal gzip René Scharfe
2022-06-12  6:19   ` [PATCH v3 5/5] archive-tar: use internal gzip by default René Scharfe
2022-06-13 21:55     ` Junio C Hamano
2022-06-14 11:27       ` Johannes Schindelin
2022-06-14 15:47         ` René Scharfe
2022-06-14 15:56           ` René Scharfe
2022-06-14 16:29           ` Johannes Schindelin
2022-06-14 20:04             ` René Scharfe
2022-06-15 16:41               ` Junio C Hamano
2022-06-14 11:28   ` [PATCH v3 0/5] Avoid spawning gzip in git archive Johannes Schindelin
2022-06-14 20:05     ` René Scharfe
2022-06-30 18:55       ` Johannes Schindelin
2022-07-01 16:05         ` Johannes Schindelin
2022-07-01 16:27           ` Jeff King
2022-07-01 17:47             ` Junio C Hamano
2022-06-15 16:53 ` [PATCH v4 0/6] " René Scharfe
2022-06-15 16:58   ` [PATCH v4 1/6] archive: update format documentation René Scharfe
2022-06-15 16:59   ` [PATCH v4 2/6] archive: rename archiver data field to filter_command René Scharfe
2022-06-15 17:01   ` [PATCH v4 3/6] archive-tar: factor out write_block() René Scharfe
2022-06-15 17:02   ` [PATCH v4 4/6] archive-tar: add internal gzip implementation René Scharfe
2022-06-15 20:32     ` Ævar Arnfjörð Bjarmason
2022-06-16 18:55       ` René Scharfe
2022-06-24 11:13         ` Ævar Arnfjörð Bjarmason [this message]
2022-06-24 20:24           ` René Scharfe
2022-06-15 17:04   ` [PATCH v4 5/6] archive-tar: use OS_CODE 3 (Unix) for internal gzip René Scharfe
2022-06-15 17:05   ` [PATCH v4 6/6] archive-tar: use internal gzip by default René Scharfe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=220624.8635fujn3o.gmgdl@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=johannes.schindelin@gmx.de \
    --cc=l.s.r@web.de \
    --cc=peff@peff.net \
    --cc=rohit.ashiwal265@gmail.com \
    --cc=sandals@crustytoothpaste.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).