[PATCH 0/2] Avoid spawning gzip in git archive

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH 0/2] Avoid spawning gzip in git archive
@ 2019-04-12 23:04 Johannes Schindelin via GitGitGadget
  2019-04-12 23:04 ` [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die() Rohit Ashiwal via GitGitGadget
                   ` (4 more replies)
  0 siblings, 5 replies; 74+ messages in thread
From: Johannes Schindelin via GitGitGadget @ 2019-04-12 23:04 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano

When creating .tar.gz archives with git archive, we let gzip handle the
compression part. But that is not even necessary, as we already require zlib
(to compress our loose objects). It is also unfortunate, as it requires gzip 
to be in the PATH (which is not the case e.g. with MinGit for Windows, which
tries to bundle just the bare minimum of files to make Git work
non-interactively, for use with 3rd-party applications requiring Git).

This patch series resolves this conundrum by teaching git archive the trick
to gzip-compress in-process.

Rohit Ashiwal (2):
  archive: replace write_or_die() calls with write_block_or_die()
  archive: avoid spawning `gzip`

 archive-tar.c | 54 ++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 41 insertions(+), 13 deletions(-)

base-commit: 8104ec994ea3849a968b4667d072fedd1e688642
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-145%2Fdscho%2Fdont-spawn-gzip-in-archive-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-145/dscho/dont-spawn-gzip-in-archive-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/145
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die()
  2019-04-12 23:04 [PATCH 0/2] Avoid spawning gzip in git archive Johannes Schindelin via GitGitGadget
@ 2019-04-12 23:04 ` Rohit Ashiwal via GitGitGadget
  2019-04-13  1:34   ` Jeff King
  2019-04-12 23:04 ` [PATCH 2/2] archive: avoid spawning `gzip` Rohit Ashiwal via GitGitGadget
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 74+ messages in thread
From: Rohit Ashiwal via GitGitGadget @ 2019-04-12 23:04 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Rohit Ashiwal

From: Rohit Ashiwal <rohit.ashiwal265@gmail.com>

MinGit for Windows comes without `gzip` bundled inside, git-archive uses
`gzip -cn` to compress tar files but for this to work, gzip needs to be
present on the host system.

In the next commit, we will change the gzip compression so that we no
longer spawn `gzip` but let zlib perform the compression in the same
process instead.

In preparation for this, we consolidate all the block writes into a
single function.

This closes https://github.com/git-for-windows/git/issues/1970

Signed-off-by: Rohit Ashiwal <rohit.ashiwal265@gmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 archive-tar.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/archive-tar.c b/archive-tar.c
index 4aabd566fb..ba37dad27c 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -17,6 +17,8 @@ static unsigned long offset;
 
 static int tar_umask = 002;
 
+static gzFile gzip;
+
 static int write_tar_filter_archive(const struct archiver *ar,
 				    struct archiver_args *args);
 
@@ -38,11 +40,21 @@ static int write_tar_filter_archive(const struct archiver *ar,
 #define USTAR_MAX_MTIME 077777777777ULL
 #endif
 
+/* writes out the whole block, or dies if fails */
+static void write_block_or_die(const char *block) {
+	if (gzip) {
+		if (gzwrite(gzip, block, (unsigned) BLOCKSIZE) != BLOCKSIZE)
+			die(_("gzwrite failed"));
+	} else {
+		write_or_die(1, block, BLOCKSIZE);
+	}
+}
+
 /* writes out the whole block, but only if it is full */
 static void write_if_needed(void)
 {
 	if (offset == BLOCKSIZE) {
-		write_or_die(1, block, BLOCKSIZE);
+		write_block_or_die(block);
 		offset = 0;
 	}
 }
@@ -66,7 +78,7 @@ static void do_write_blocked(const void *data, unsigned long size)
 		write_if_needed();
 	}
 	while (size >= BLOCKSIZE) {
-		write_or_die(1, buf, BLOCKSIZE);
+		write_block_or_die(buf);
 		size -= BLOCKSIZE;
 		buf += BLOCKSIZE;
 	}
@@ -101,10 +113,10 @@ static void write_trailer(void)
 {
 	int tail = BLOCKSIZE - offset;
 	memset(block + offset, 0, tail);
-	write_or_die(1, block, BLOCKSIZE);
+	write_block_or_die(block);
 	if (tail < 2 * RECORDSIZE) {
 		memset(block, 0, offset);
-		write_or_die(1, block, BLOCKSIZE);
+		write_block_or_die(block);
 	}
 }
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 2/2] archive: avoid spawning `gzip`
  2019-04-12 23:04 [PATCH 0/2] Avoid spawning gzip in git archive Johannes Schindelin via GitGitGadget
  2019-04-12 23:04 ` [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die() Rohit Ashiwal via GitGitGadget
@ 2019-04-12 23:04 ` Rohit Ashiwal via GitGitGadget
  2019-04-13  1:51   ` Jeff King
       [not found] ` <pull.145.v2.git.gitgitgadget@gmail.com>
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 74+ messages in thread
From: Rohit Ashiwal via GitGitGadget @ 2019-04-12 23:04 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Rohit Ashiwal

From: Rohit Ashiwal <rohit.ashiwal265@gmail.com>

As we already link to the zlib library, we can perform the compression
without even requiring gzip on the host machine.

Signed-off-by: Rohit Ashiwal <rohit.ashiwal265@gmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 archive-tar.c | 34 +++++++++++++++++++++++++---------
 1 file changed, 25 insertions(+), 9 deletions(-)

diff --git a/archive-tar.c b/archive-tar.c
index ba37dad27c..5979ed14b7 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -466,18 +466,34 @@ static int write_tar_filter_archive(const struct archiver *ar,
 	filter.use_shell = 1;
 	filter.in = -1;
 
-	if (start_command(&filter) < 0)
-		die_errno(_("unable to start '%s' filter"), argv[0]);
-	close(1);
-	if (dup2(filter.in, 1) < 0)
-		die_errno(_("unable to redirect descriptor"));
-	close(filter.in);
+	if (!strcmp("gzip -cn", ar->data)) {
+		char outmode[4] = "wb\0";
+
+		if (args->compression_level >= 0 && args->compression_level <= 9)
+			outmode[2] = '0' + args->compression_level;
+
+		gzip = gzdopen(fileno(stdout), outmode);
+		if (!gzip)
+			die(_("Could not gzdopen stdout"));
+	} else {
+		if (start_command(&filter) < 0)
+			die_errno(_("unable to start '%s' filter"), argv[0]);
+		close(1);
+		if (dup2(filter.in, 1) < 0)
+			die_errno(_("unable to redirect descriptor"));
+		close(filter.in);
+	}
 
 	r = write_tar_archive(ar, args);
 
-	close(1);
-	if (finish_command(&filter) != 0)
-		die(_("'%s' filter reported error"), argv[0]);
+	if (gzip) {
+		if (gzclose(gzip) != Z_OK)
+			die(_("gzclose failed"));
+	} else {
+		close(1);
+		if (finish_command(&filter) != 0)
+			die(_("'%s' filter reported error"), argv[0]);
+	}
 
 	strbuf_release(&cmd);
 	return r;
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die()
  2019-04-12 23:04 ` [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die() Rohit Ashiwal via GitGitGadget
@ 2019-04-13  1:34   ` Jeff King
  2019-04-13  5:51     ` Junio C Hamano
                       ` (2 more replies)
  0 siblings, 3 replies; 74+ messages in thread
From: Jeff King @ 2019-04-13  1:34 UTC (permalink / raw)
  To: Rohit Ashiwal via GitGitGadget
  Cc: Johannes Schindelin, git, Junio C Hamano, Rohit Ashiwal

On Fri, Apr 12, 2019 at 04:04:39PM -0700, Rohit Ashiwal via GitGitGadget wrote:

> From: Rohit Ashiwal <rohit.ashiwal265@gmail.com>
> 
> MinGit for Windows comes without `gzip` bundled inside, git-archive uses
> `gzip -cn` to compress tar files but for this to work, gzip needs to be
> present on the host system.
> 
> In the next commit, we will change the gzip compression so that we no
> longer spawn `gzip` but let zlib perform the compression in the same
> process instead.
> 
> In preparation for this, we consolidate all the block writes into a
> single function.

Sounds like a good preparatory step. This part confused me, though:

> @@ -38,11 +40,21 @@ static int write_tar_filter_archive(const struct archiver *ar,
>  #define USTAR_MAX_MTIME 077777777777ULL
>  #endif
>  
> +/* writes out the whole block, or dies if fails */
> +static void write_block_or_die(const char *block) {
> +	if (gzip) {
> +		if (gzwrite(gzip, block, (unsigned) BLOCKSIZE) != BLOCKSIZE)
> +			die(_("gzwrite failed"));
> +	} else {
> +		write_or_die(1, block, BLOCKSIZE);
> +	}
> +}

What is gzwrite()? At first I thought this was an out-of-sequence bit of
the series, but it turns out that this is a zlib.h interface. So the
idea (I think) is that here we introduce a "gzip" variable that is
always false, and this first conditional arm is effectively dead code.
And then in a later patch we'd set up "gzip" and it would become
not-dead.

I think it would be less confusing if this just factored out
write_block_or_die(), which starts as a thin wrapper and then grows the
gzip parts in the next patch.

A few nits on the code itself:

> +static gzFile gzip;
> [...]
> +       if (gzip) {

Is it OK for us to ask about the truthiness of this opaque type? That
works if it's really a pointer behind the scenes, but it seems like it
would be equally OK for zlib to declare it as a struct.

It looks OK in my version of zlib, and that library tends to be fairly
conservative so I wouldn't be surprised if it was that way back to the
beginning and remains that way for eternity. But it feels like a bad
pattern.

> +		if (gzwrite(gzip, block, (unsigned) BLOCKSIZE) != BLOCKSIZE)

This cast is interesting. All of the matching write_or_die() calls are
promoting it to a size_t, which is also unsigned.

BLOCKSIZE is a constant. Should we be defining it with a "U" in the first place?

I doubt it matters much either way from a correctness perspective. I
just wonder when I see a cast like that if we're going to get unexpected
truncation or similar.

-Peff

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 2/2] archive: avoid spawning `gzip`
  2019-04-12 23:04 ` [PATCH 2/2] archive: avoid spawning `gzip` Rohit Ashiwal via GitGitGadget
@ 2019-04-13  1:51   ` Jeff King
  2019-04-13 22:01     ` René Scharfe
                       ` (2 more replies)
  0 siblings, 3 replies; 74+ messages in thread
From: Jeff King @ 2019-04-13  1:51 UTC (permalink / raw)
  To: Rohit Ashiwal via GitGitGadget
  Cc: Johannes Schindelin, git, Junio C Hamano, Rohit Ashiwal

On Fri, Apr 12, 2019 at 04:04:40PM -0700, Rohit Ashiwal via GitGitGadget wrote:

> From: Rohit Ashiwal <rohit.ashiwal265@gmail.com>
> 
> As we already link to the zlib library, we can perform the compression
> without even requiring gzip on the host machine.

Very cool. It's nice to drop a dependency, and this should be a bit more
efficient, too.

> diff --git a/archive-tar.c b/archive-tar.c
> index ba37dad27c..5979ed14b7 100644
> --- a/archive-tar.c
> +++ b/archive-tar.c
> @@ -466,18 +466,34 @@ static int write_tar_filter_archive(const struct archiver *ar,
>  	filter.use_shell = 1;
>  	filter.in = -1;
>  
> -	if (start_command(&filter) < 0)
> -		die_errno(_("unable to start '%s' filter"), argv[0]);
> -	close(1);
> -	if (dup2(filter.in, 1) < 0)
> -		die_errno(_("unable to redirect descriptor"));
> -	close(filter.in);
> +	if (!strcmp("gzip -cn", ar->data)) {

I wondered how you were going to kick this in, since users can define
arbitrary filters. I think it's kind of neat to automagically convert
"gzip -cn" (which also happens to be the default). But I think we should
mention that in the Documentation, in case somebody tries to use a
custom version of gzip and wonders why it isn't kicking in.

Likewise, it might make sense in the tests to put a poison gzip in the
$PATH so that we can be sure we're using our internal code, and not just
calling out to gzip (on platforms that have it, of course).

The alternative is that we could use a special token like ":zlib" or
something to indicate that the internal implementation should be used
(and then tweak the baked-in default, too). That might be less
surprising for users, but most people would still get the benefit since
they'd be using the default config.

> +		char outmode[4] = "wb\0";

This looks sufficiently magical that it might merit a comment. I had to
look in the zlib header file to learn that this is just a normal
stdio-style mode. But we can't just do:

  gzip = gzdopen(fd, "wb");

because we want to (maybe) append a compression level. It's also
slightly confusing that it explicitly includes a NUL, but later:

> +		if (args->compression_level >= 0 && args->compression_level <= 9)
> +			outmode[2] = '0' + args->compression_level;

we may overwrite that and assume that outmode[3] is also a NUL. Which it
is, because of how C initialization works. But that means we also do not
need the "\0" in the initializer.

Dropping that may make it slightly less jarring (any time I see a
backslash escape in an initializer, I assume I'm in for some binary
trickery, but this turns out to be much more mundane).

I'd also consider just using a strbuf:

  struct strbuf outmode = STRBUF_INIT;

  strbuf_addstr(&outmode, "wb");
  if (args->compression_level >= 0 && args->compression_level <= 9)
	strbuf_addch(&outmode, '0' + args->compression_level);

That's overkill in a sense, but it saves us having to deal with
manually-counted offsets, and this code is only run once per program
invocation, so the efficiency shouldn't matter.

> +		gzip = gzdopen(fileno(stdout), outmode);
> +		if (!gzip)
> +			die(_("Could not gzdopen stdout"));

Is there a way to get a more specific error from zlib? I'm less
concerned about gzdopen here (which should never fail), and more about
the writing and closing steps. I don't see anything good for gzdopen(),
but...

> +	if (gzip) {
> +		if (gzclose(gzip) != Z_OK)
> +			die(_("gzclose failed"));

...according to zlib.h, here the returned int is meaningful. And if
Z_ERRNO, we should probably use die_errno() to give a better message.

> [...]

That was a lot of little nits, but the overall shape of the patch looks
good to me (and I think the goal is obviously good). Thanks for working
on it.

-Peff

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die()
  2019-04-13  1:34   ` Jeff King
@ 2019-04-13  5:51     ` Junio C Hamano
  2019-04-14  4:36       ` Rohit Ashiwal
  2019-04-26 14:29       ` Johannes Schindelin
  2019-04-14  4:34     ` Rohit Ashiwal
  2019-04-26 14:28     ` Johannes Schindelin
  2 siblings, 2 replies; 74+ messages in thread
From: Junio C Hamano @ 2019-04-13  5:51 UTC (permalink / raw)
  To: Jeff King
  Cc: Rohit Ashiwal via GitGitGadget, Johannes Schindelin, git,
	Rohit Ashiwal

Jeff King <peff@peff.net> writes:

>> +/* writes out the whole block, or dies if fails */
>> +static void write_block_or_die(const char *block) {
>> +	if (gzip) {
>> +		if (gzwrite(gzip, block, (unsigned) BLOCKSIZE) != BLOCKSIZE)
>> +			die(_("gzwrite failed"));
>> +	} else {
>> +		write_or_die(1, block, BLOCKSIZE);
>> +	}
>> +}

I agree everything you said you your two review messages.

One thing you did not mention but I found disturbing was that this
does not take size argument but hardcodes BLOCKSIZE.  If the patch
were removing use of BLOCKSIZE in its callers (because everybody
uses the same constant), this would not have bothered me, but as the
caller passes BLOCKSIZE to all its callees except this one, I found
that the interface optimizes for a wrong thing (i.e. reducing
one-time pain of writing this single patch of having to repeat
BLOCKSIZE in all calls to this function).  This function should be
updated to take the size_t and have its caller(s) pass BLOCKSIZE.

Thanks for a review, and thanks Rohit for starting to get rid of
external dependency on gzip binary.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 2/2] archive: avoid spawning `gzip`
  2019-04-13  1:51   ` Jeff King
@ 2019-04-13 22:01     ` René Scharfe
  2019-04-15 21:35       ` Jeff King
  2019-04-13 22:16     ` brian m. carlson
  2019-04-26 14:47     ` Johannes Schindelin
  2 siblings, 1 reply; 74+ messages in thread
From: René Scharfe @ 2019-04-13 22:01 UTC (permalink / raw)
  To: Jeff King, Rohit Ashiwal via GitGitGadget
  Cc: Johannes Schindelin, git, Junio C Hamano, Rohit Ashiwal

Am 13.04.2019 um 03:51 schrieb Jeff King:
> On Fri, Apr 12, 2019 at 04:04:40PM -0700, Rohit Ashiwal via GitGitGadget wrote:
>
>> From: Rohit Ashiwal <rohit.ashiwal265@gmail.com>
>>
>> As we already link to the zlib library, we can perform the compression
>> without even requiring gzip on the host machine.
>
> Very cool. It's nice to drop a dependency, and this should be a bit more
> efficient, too.

Getting rid of dependencies is good, and using zlib is the obvious way to
generate .tgz files. Last time I tried something like that, a separate gzip
process was faster, though -- at least on Linux [1].  How does this one
fare?

Doing compression in its own thread may be a good idea.

René


[1] http://public-inbox.org/git/4AAAC8CE.8020302@lsrfire.ath.cx/

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 2/2] archive: avoid spawning `gzip`
  2019-04-13  1:51   ` Jeff King
  2019-04-13 22:01     ` René Scharfe
@ 2019-04-13 22:16     ` brian m. carlson
  2019-04-15 21:36       ` Jeff King
  2019-04-26 14:54       ` Johannes Schindelin
  2019-04-26 14:47     ` Johannes Schindelin
  2 siblings, 2 replies; 74+ messages in thread
From: brian m. carlson @ 2019-04-13 22:16 UTC (permalink / raw)
  To: Jeff King
  Cc: Rohit Ashiwal via GitGitGadget, Johannes Schindelin, git,
	Junio C Hamano, Rohit Ashiwal

[-- Attachment #1: Type: text/plain, Size: 1539 bytes --]

On Fri, Apr 12, 2019 at 09:51:02PM -0400, Jeff King wrote:
> I wondered how you were going to kick this in, since users can define
> arbitrary filters. I think it's kind of neat to automagically convert
> "gzip -cn" (which also happens to be the default). But I think we should
> mention that in the Documentation, in case somebody tries to use a
> custom version of gzip and wonders why it isn't kicking in.
> 
> Likewise, it might make sense in the tests to put a poison gzip in the
> $PATH so that we can be sure we're using our internal code, and not just
> calling out to gzip (on platforms that have it, of course).
> 
> The alternative is that we could use a special token like ":zlib" or
> something to indicate that the internal implementation should be used
> (and then tweak the baked-in default, too). That might be less
> surprising for users, but most people would still get the benefit since
> they'd be using the default config.

I agree that a special value (or NULL, if that's possible) would be
nicer here. That way, if someone does specify a custom gzip, we honor
it, and it serves to document the code better. For example, if someone
symlinked pigz to gzip and used "gzip -cn", then they might not get the
parallelization benefits they expected.

I'm fine overall with the idea of bringing the compression into the
binary using zlib, provided that we preserve the "-n" behavior
(producing reproducible archives).
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die()
  2019-04-13  1:34   ` Jeff King
  2019-04-13  5:51     ` Junio C Hamano
@ 2019-04-14  4:34     ` Rohit Ashiwal
  2019-04-14 10:33       ` Junio C Hamano
  2019-04-26 14:28     ` Johannes Schindelin
  2 siblings, 1 reply; 74+ messages in thread
From: Rohit Ashiwal @ 2019-04-14  4:34 UTC (permalink / raw)
  To: peff; +Cc: git, gitgitgadget, gitster, johannes.schindelin, rohit.ashiwal265

Hey Peff!

On 2019-04-13  1:34 UTC Jeff King <peff@peff.net> wrote:

> What is gzwrite()?
> [...]
> I think it would be less confusing if this just factored out
> write_block_or_die(), which starts as a thin wrapper and then grows the
> gzip parts in the next patch.

You are right, it might appear to someone as a bit confusing, but I feel
like, this is the right commit to put it.

> Is it OK for us to ask about the truthiness of this opaque type? That
> works if it's really a pointer behind the scenes, but it seems like it
> would be equally OK for zlib to declare it as a struct.

It would be perfectly sane on zlib's part to make gzFile a struct, and if
so happens, I'll be there to refactor the code.

Regards
Rohit


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die()
  2019-04-13  5:51     ` Junio C Hamano
@ 2019-04-14  4:36       ` Rohit Ashiwal
  2019-04-26 14:29       ` Johannes Schindelin
  1 sibling, 0 replies; 74+ messages in thread
From: Rohit Ashiwal @ 2019-04-14  4:36 UTC (permalink / raw)
  To: gitster; +Cc: git, gitgitgadget, johannes.schindelin, peff, rohit.ashiwal265

Hey jch!

I'll change the signature of the function in next revision.

Thanks
Rohit


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die()
  2019-04-14  4:34     ` Rohit Ashiwal
@ 2019-04-14 10:33       ` Junio C Hamano
  0 siblings, 0 replies; 74+ messages in thread
From: Junio C Hamano @ 2019-04-14 10:33 UTC (permalink / raw)
  To: Rohit Ashiwal; +Cc: peff, git, gitgitgadget, johannes.schindelin

Rohit Ashiwal <rohit.ashiwal265@gmail.com> writes:

> On 2019-04-13  1:34 UTC Jeff King <peff@peff.net> wrote:
>
>> What is gzwrite()?
>> [...]
>> I think it would be less confusing if this just factored out
>> write_block_or_die(), which starts as a thin wrapper and then grows the
>> gzip parts in the next patch.
>
> You are right, it might appear to someone as a bit confusing, but I feel
> like, this is the right commit to put it.

Often, the original author is the worst judge about the patch series
organization, because s/he has been staring at his or her own
patches too long and knows too much about them.  Unless the author
is very experienced and is good at pretending to be the first-time
reader when proofreading his or her own patch, that is.

FWIW, I tend to agree with Peff that the organization would become
much easier to follow with "first refactor without new feature, and
in gzip related step add gzip thing".

>> Is it OK for us to ask about the truthiness of this opaque type? That
>> works if it's really a pointer behind the scenes, but it seems like it
>> would be equally OK for zlib to declare it as a struct.

Or a small integer indexing into an internal array the library keeps
track of ;-) At that point, truthiness would be completely gone, and
the compiler would not help catching "if (opaque)" as a syntax error
(if the library implements the opaque thing as a structure, then we
will be saved).

> It would be perfectly sane on zlib's part to make gzFile a struct, and if
> so happens, I'll be there to refactor the code.

We do not trust any single developer enough with "I'll do so when
needed"---in practice, it will often be done by somebody else, and
more importantly, we would want anybody to be able to take things
over, instead of relying on any one "indispensable contributor".

If it is reasonable to expect that things can easily be broken by an
external factor, I'd prefer to see us being defensive from day one.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 2/2] archive: avoid spawning `gzip`
  2019-04-13 22:01     ` René Scharfe
@ 2019-04-15 21:35       ` Jeff King
  2019-04-26 14:51         ` Johannes Schindelin
  0 siblings, 1 reply; 74+ messages in thread
From: Jeff King @ 2019-04-15 21:35 UTC (permalink / raw)
  To: René Scharfe
  Cc: Rohit Ashiwal via GitGitGadget, Johannes Schindelin, git,
	Junio C Hamano, Rohit Ashiwal

On Sun, Apr 14, 2019 at 12:01:10AM +0200, René Scharfe wrote:

> >> As we already link to the zlib library, we can perform the compression
> >> without even requiring gzip on the host machine.
> >
> > Very cool. It's nice to drop a dependency, and this should be a bit more
> > efficient, too.
> 
> Getting rid of dependencies is good, and using zlib is the obvious way to
> generate .tgz files. Last time I tried something like that, a separate gzip
> process was faster, though -- at least on Linux [1].  How does this one
> fare?

I'd expect a separate gzip to be faster in wall-clock time for a
multi-core machine, but overall consume more CPU. I'm slightly surprised
that your timings show that it actually wins on total CPU, too.

Here are best-of-five times for "git archive --format=tar.gz HEAD" on
linux.git (the machine is a quad-core):

  [before, separate gzip]
  real	0m21.501s
  user	0m26.148s
  sys	0m0.619s

  [after, internal gzwrite]
  real	0m25.156s
  user	0m25.059s
  sys	0m0.096s

which does show what I expect (longer overall, but less total CPU).

Which one you prefer depends on your situation, of course. A user on a
workstation with multiple cores probably cares most about end-to-end
latency and using all of their available horsepower. A server hosting
repositories and receiving many unrelated requests probably cares more
about total CPU (though the differences there are small enough that it
may not even be worth having a config knob to un-parallelize it).

> Doing compression in its own thread may be a good idea.

Yeah. It might even make the patch simpler, since I'd expect it to be
implemented with start_async() and a descriptor, making it look just
like a gzip pipe to the caller. :)

-Peff

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 2/2] archive: avoid spawning `gzip`
  2019-04-13 22:16     ` brian m. carlson
@ 2019-04-15 21:36       ` Jeff King
  2019-04-26 14:54       ` Johannes Schindelin
  1 sibling, 0 replies; 74+ messages in thread
From: Jeff King @ 2019-04-15 21:36 UTC (permalink / raw)
  To: brian m. carlson, Rohit Ashiwal via GitGitGadget,
	Johannes Schindelin, git, Junio C Hamano, Rohit Ashiwal

On Sat, Apr 13, 2019 at 10:16:46PM +0000, brian m. carlson wrote:

> > The alternative is that we could use a special token like ":zlib" or
> > something to indicate that the internal implementation should be used
> > (and then tweak the baked-in default, too). That might be less
> > surprising for users, but most people would still get the benefit since
> > they'd be using the default config.
> 
> I agree that a special value (or NULL, if that's possible) would be
> nicer here. That way, if someone does specify a custom gzip, we honor
> it, and it serves to document the code better. For example, if someone
> symlinked pigz to gzip and used "gzip -cn", then they might not get the
> parallelization benefits they expected.

Thanks for spelling that out. I had a vague feeling somebody might be
surprised, but I didn't know if people actually did stuff like
symlinking pigz to gzip (though it makes perfect sense to do so).

> I'm fine overall with the idea of bringing the compression into the
> binary using zlib, provided that we preserve the "-n" behavior
> (producing reproducible archives).

I just assumed that gzwrite() would have the "-n" behavior, but it's
definitely worth double-checking.

-Peff

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die()
  2019-04-13  1:34   ` Jeff King
  2019-04-13  5:51     ` Junio C Hamano
  2019-04-14  4:34     ` Rohit Ashiwal
@ 2019-04-26 14:28     ` Johannes Schindelin
  2019-05-01 18:07       ` Jeff King
  2 siblings, 1 reply; 74+ messages in thread
From: Johannes Schindelin @ 2019-04-26 14:28 UTC (permalink / raw)
  To: Jeff King
  Cc: Rohit Ashiwal via GitGitGadget, git, Junio C Hamano,
	Rohit Ashiwal

Hi Peff,

On Fri, 12 Apr 2019, Jeff King wrote:

> On Fri, Apr 12, 2019 at 04:04:39PM -0700, Rohit Ashiwal via GitGitGadget wrote:
>
> > From: Rohit Ashiwal <rohit.ashiwal265@gmail.com>
> >
> > MinGit for Windows comes without `gzip` bundled inside, git-archive uses
> > `gzip -cn` to compress tar files but for this to work, gzip needs to be
> > present on the host system.
> >
> > In the next commit, we will change the gzip compression so that we no
> > longer spawn `gzip` but let zlib perform the compression in the same
> > process instead.
> >
> > In preparation for this, we consolidate all the block writes into a
> > single function.
>
> Sounds like a good preparatory step. This part confused me, though:
>
> > @@ -38,11 +40,21 @@ static int write_tar_filter_archive(const struct archiver *ar,
> >  #define USTAR_MAX_MTIME 077777777777ULL
> >  #endif
> >
> > +/* writes out the whole block, or dies if fails */
> > +static void write_block_or_die(const char *block) {
> > +	if (gzip) {
> > +		if (gzwrite(gzip, block, (unsigned) BLOCKSIZE) != BLOCKSIZE)
> > +			die(_("gzwrite failed"));
> > +	} else {
> > +		write_or_die(1, block, BLOCKSIZE);
> > +	}
> > +}
>
> What is gzwrite()? At first I thought this was an out-of-sequence bit of
> the series, but it turns out that this is a zlib.h interface. So the
> idea (I think) is that here we introduce a "gzip" variable that is
> always false, and this first conditional arm is effectively dead code.
> And then in a later patch we'd set up "gzip" and it would become
> not-dead.
>
> I think it would be less confusing if this just factored out
> write_block_or_die(), which starts as a thin wrapper and then grows the
> gzip parts in the next patch.

Yes, I missed this in my pre-submission review. Sorry about that!

> A few nits on the code itself:
>
> > +static gzFile gzip;
> > [...]
> > +       if (gzip) {
>
> Is it OK for us to ask about the truthiness of this opaque type? That
> works if it's really a pointer behind the scenes, but it seems like it
> would be equally OK for zlib to declare it as a struct.
>
> It looks OK in my version of zlib, and that library tends to be fairly
> conservative so I wouldn't be surprised if it was that way back to the
> beginning and remains that way for eternity. But it feels like a bad
> pattern.

It is even part of the public API that `gzFile` is `typedef`'d to a
pointer. So I think in the interest of simplicity, I'll leave it at that
(but I'll mention this in the commit message).

> > +		if (gzwrite(gzip, block, (unsigned) BLOCKSIZE) != BLOCKSIZE)
>
> This cast is interesting. All of the matching write_or_die() calls are
> promoting it to a size_t, which is also unsigned.
>
> BLOCKSIZE is a constant. Should we be defining it with a "U" in the
> first place?

Yep, good idea.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die()
  2019-04-13  5:51     ` Junio C Hamano
  2019-04-14  4:36       ` Rohit Ashiwal
@ 2019-04-26 14:29       ` Johannes Schindelin
  2019-04-26 23:44         ` Junio C Hamano
  1 sibling, 1 reply; 74+ messages in thread
From: Johannes Schindelin @ 2019-04-26 14:29 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Rohit Ashiwal via GitGitGadget, git, Rohit Ashiwal

Hi Junio,

On Sat, 13 Apr 2019, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
>
> >> +/* writes out the whole block, or dies if fails */
> >> +static void write_block_or_die(const char *block) {
> >> +	if (gzip) {
> >> +		if (gzwrite(gzip, block, (unsigned) BLOCKSIZE) != BLOCKSIZE)
> >> +			die(_("gzwrite failed"));
> >> +	} else {
> >> +		write_or_die(1, block, BLOCKSIZE);
> >> +	}
> >> +}
>
> I agree everything you said you your two review messages.
>
> One thing you did not mention but I found disturbing was that this
> does not take size argument but hardcodes BLOCKSIZE.

That is very much on purpose, as this code really is specific to the `tar`
file format, which has a fixed, well-defined block size. It would make it
easier to introduce a bug if that was a parameter.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 2/2] archive: avoid spawning `gzip`
  2019-04-13  1:51   ` Jeff King
  2019-04-13 22:01     ` René Scharfe
  2019-04-13 22:16     ` brian m. carlson
@ 2019-04-26 14:47     ` Johannes Schindelin
  2 siblings, 0 replies; 74+ messages in thread
From: Johannes Schindelin @ 2019-04-26 14:47 UTC (permalink / raw)
  To: Jeff King
  Cc: Rohit Ashiwal via GitGitGadget, git, Junio C Hamano,
	Rohit Ashiwal

[-- Attachment #1: Type: text/plain, Size: 4555 bytes --]

Hi Peff,

On Fri, 12 Apr 2019, Jeff King wrote:

> On Fri, Apr 12, 2019 at 04:04:40PM -0700, Rohit Ashiwal via GitGitGadget wrote:
>
> > From: Rohit Ashiwal <rohit.ashiwal265@gmail.com>
> >
> > As we already link to the zlib library, we can perform the compression
> > without even requiring gzip on the host machine.
>
> Very cool. It's nice to drop a dependency, and this should be a bit more
> efficient, too.

Sadly, no, as René intuited and as your testing shows: there seems to be a
~15% penalty for compressing in the same thread as producing the data to
be compressed.

Given that it reduces the number of dependencies, and given that it might
be better to rely on the external command `pigz -cn` if speed is really a
matter, I still think it makes sense to switch the default, though.

> > diff --git a/archive-tar.c b/archive-tar.c
> > index ba37dad27c..5979ed14b7 100644
> > --- a/archive-tar.c
> > +++ b/archive-tar.c
> > @@ -466,18 +466,34 @@ static int write_tar_filter_archive(const struct archiver *ar,
> >  	filter.use_shell = 1;
> >  	filter.in = -1;
> >
> > -	if (start_command(&filter) < 0)
> > -		die_errno(_("unable to start '%s' filter"), argv[0]);
> > -	close(1);
> > -	if (dup2(filter.in, 1) < 0)
> > -		die_errno(_("unable to redirect descriptor"));
> > -	close(filter.in);
> > +	if (!strcmp("gzip -cn", ar->data)) {
>
> I wondered how you were going to kick this in, since users can define
> arbitrary filters. I think it's kind of neat to automagically convert
> "gzip -cn" (which also happens to be the default). But I think we should
> mention that in the Documentation, in case somebody tries to use a
> custom version of gzip and wonders why it isn't kicking in.
>
> Likewise, it might make sense in the tests to put a poison gzip in the
> $PATH so that we can be sure we're using our internal code, and not just
> calling out to gzip (on platforms that have it, of course).
>
> The alternative is that we could use a special token like ":zlib" or
> something to indicate that the internal implementation should be used
> (and then tweak the baked-in default, too). That might be less
> surprising for users, but most people would still get the benefit since
> they'd be using the default config.

I went with `:zlib`.

> > +		char outmode[4] = "wb\0";
>
> This looks sufficiently magical that it might merit a comment. I had to
> look in the zlib header file to learn that this is just a normal
> stdio-style mode. But we can't just do:
>
>   gzip = gzdopen(fd, "wb");
>
> because we want to (maybe) append a compression level. It's also
> slightly confusing that it explicitly includes a NUL, but later:
>
> > +		if (args->compression_level >= 0 && args->compression_level <= 9)
> > +			outmode[2] = '0' + args->compression_level;
>
> we may overwrite that and assume that outmode[3] is also a NUL. Which it
> is, because of how C initialization works. But that means we also do not
> need the "\0" in the initializer.
>
> Dropping that may make it slightly less jarring (any time I see a
> backslash escape in an initializer, I assume I'm in for some binary
> trickery, but this turns out to be much more mundane).
>
> I'd also consider just using a strbuf:
>
>   struct strbuf outmode = STRBUF_INIT;
>
>   strbuf_addstr(&outmode, "wb");
>   if (args->compression_level >= 0 && args->compression_level <= 9)
> 	strbuf_addch(&outmode, '0' + args->compression_level);
>
> That's overkill in a sense, but it saves us having to deal with
> manually-counted offsets, and this code is only run once per program
> invocation, so the efficiency shouldn't matter.

I'll change that, too, as it seems that `pigz` allows compression levels
higher than 9, in which case we need `strbuf_addf()` anyway. I will not
adjust the condition `<= 9`, of course, as zlib is still limited that way.

> > +		gzip = gzdopen(fileno(stdout), outmode);
> > +		if (!gzip)
> > +			die(_("Could not gzdopen stdout"));
>
> Is there a way to get a more specific error from zlib? I'm less
> concerned about gzdopen here (which should never fail), and more about
> the writing and closing steps. I don't see anything good for gzdopen(),
> but...

Sadly, I did not find anything there.

> > +	if (gzip) {
> > +		if (gzclose(gzip) != Z_OK)
> > +			die(_("gzclose failed"));
>
> ...according to zlib.h, here the returned int is meaningful. And if
> Z_ERRNO, we should probably use die_errno() to give a better message.

Okay.

Thanks,
Dscho

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 2/2] archive: avoid spawning `gzip`
  2019-04-15 21:35       ` Jeff King
@ 2019-04-26 14:51         ` Johannes Schindelin
  2019-04-27  9:59           ` René Scharfe
  0 siblings, 1 reply; 74+ messages in thread
From: Johannes Schindelin @ 2019-04-26 14:51 UTC (permalink / raw)
  To: Jeff King
  Cc: René Scharfe, Rohit Ashiwal via GitGitGadget, git,
	Junio C Hamano, Rohit Ashiwal

[-- Attachment #1: Type: text/plain, Size: 2596 bytes --]

Hi Peff,

On Mon, 15 Apr 2019, Jeff King wrote:

> On Sun, Apr 14, 2019 at 12:01:10AM +0200, René Scharfe wrote:
>
> > >> As we already link to the zlib library, we can perform the compression
> > >> without even requiring gzip on the host machine.
> > >
> > > Very cool. It's nice to drop a dependency, and this should be a bit more
> > > efficient, too.
> >
> > Getting rid of dependencies is good, and using zlib is the obvious way to
> > generate .tgz files. Last time I tried something like that, a separate gzip
> > process was faster, though -- at least on Linux [1].  How does this one
> > fare?
>
> I'd expect a separate gzip to be faster in wall-clock time for a
> multi-core machine, but overall consume more CPU. I'm slightly surprised
> that your timings show that it actually wins on total CPU, too.

If performance is really a concern, you'll be much better off using `pigz`
than `gzip`.

> Here are best-of-five times for "git archive --format=tar.gz HEAD" on
> linux.git (the machine is a quad-core):
>
>   [before, separate gzip]
>   real	0m21.501s
>   user	0m26.148s
>   sys	0m0.619s
>
>   [after, internal gzwrite]
>   real	0m25.156s
>   user	0m25.059s
>   sys	0m0.096s
>
> which does show what I expect (longer overall, but less total CPU).
>
> Which one you prefer depends on your situation, of course. A user on a
> workstation with multiple cores probably cares most about end-to-end
> latency and using all of their available horsepower. A server hosting
> repositories and receiving many unrelated requests probably cares more
> about total CPU (though the differences there are small enough that it
> may not even be worth having a config knob to un-parallelize it).

I am a bit sad that this is so noticeable. Nevertheless, I think that
dropping the dependency is worth it, in particular given that `gzip` is
not exactly fast to begin with (you really should switch to `pigz` or to a
faster compression if you are interested in speed).

> > Doing compression in its own thread may be a good idea.
>
> Yeah. It might even make the patch simpler, since I'd expect it to be
> implemented with start_async() and a descriptor, making it look just
> like a gzip pipe to the caller. :)

Sadly, it does not really look like it is simpler.

And when going into the direction of multi-threaded compression anyway,
the `pigz` trick of compressing 32kB chunks in parallel is an interesting
idea, too.

All of this, however, is outside of the purview of this (still relatively
simple) patch series.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 2/2] archive: avoid spawning `gzip`
  2019-04-13 22:16     ` brian m. carlson
  2019-04-15 21:36       ` Jeff King
@ 2019-04-26 14:54       ` Johannes Schindelin
  2019-05-02 20:20         ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 74+ messages in thread
From: Johannes Schindelin @ 2019-04-26 14:54 UTC (permalink / raw)
  To: brian m. carlson
  Cc: Jeff King, Rohit Ashiwal via GitGitGadget, git, Junio C Hamano,
	Rohit Ashiwal

Hi brian,

On Sat, 13 Apr 2019, brian m. carlson wrote:

> On Fri, Apr 12, 2019 at 09:51:02PM -0400, Jeff King wrote:
> > I wondered how you were going to kick this in, since users can define
> > arbitrary filters. I think it's kind of neat to automagically convert
> > "gzip -cn" (which also happens to be the default). But I think we should
> > mention that in the Documentation, in case somebody tries to use a
> > custom version of gzip and wonders why it isn't kicking in.
> >
> > Likewise, it might make sense in the tests to put a poison gzip in the
> > $PATH so that we can be sure we're using our internal code, and not just
> > calling out to gzip (on platforms that have it, of course).
> >
> > The alternative is that we could use a special token like ":zlib" or
> > something to indicate that the internal implementation should be used
> > (and then tweak the baked-in default, too). That might be less
> > surprising for users, but most people would still get the benefit since
> > they'd be using the default config.
>
> I agree that a special value (or NULL, if that's possible) would be
> nicer here. That way, if someone does specify a custom gzip, we honor
> it, and it serves to document the code better. For example, if someone
> symlinked pigz to gzip and used "gzip -cn", then they might not get the
> parallelization benefits they expected.

I went with `:zlib`. The `NULL` value would not really work, as there is
no way to specify that via `archive.tgz.command`.

About the symlinked thing: I do not really want to care to support such
hacks. If you want a different compressor than the default (which can
change), you should specify it specifically.

> I'm fine overall with the idea of bringing the compression into the
> binary using zlib, provided that we preserve the "-n" behavior
> (producing reproducible archives).

Thanks for voicing this concern. I had a look at zlib's source code, and
it looks like it requires an extra function call (that we don't call) to
make the resulting file non-reproducible. In other words, it has the
opposite default behavior from `gzip`.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die()
  2019-04-26 14:29       ` Johannes Schindelin
@ 2019-04-26 23:44         ` Junio C Hamano
  2019-04-29 21:32           ` Johannes Schindelin
  0 siblings, 1 reply; 74+ messages in thread
From: Junio C Hamano @ 2019-04-26 23:44 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Jeff King, Rohit Ashiwal via GitGitGadget, git, Rohit Ashiwal

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

>> >> +/* writes out the whole block, or dies if fails */
>> >> +static void write_block_or_die(const char *block) {
>> >> +	if (gzip) {
>> >> +		if (gzwrite(gzip, block, (unsigned) BLOCKSIZE) != BLOCKSIZE)
>> >> +			die(_("gzwrite failed"));
>> >> +	} else {
>> >> +		write_or_die(1, block, BLOCKSIZE);
>> >> +	}
>> >> +}
>>
>> I agree everything you said you your two review messages.
>>
>> One thing you did not mention but I found disturbing was that this
>> does not take size argument but hardcodes BLOCKSIZE.
>
> That is very much on purpose, as this code really is specific to the `tar`
> file format, which has a fixed, well-defined block size. It would make it
> easier to introduce a bug if that was a parameter.

I am not so sure for two reasons.

One is that its caller is full of BLOCKSIZE constants passed as
parameters (instead of calling a specialized function that hardcodes
the BLOCKSIZE without taking it as a parameter), and this being a
file-scope static, it does not really matter with respect to an
accidental bug of mistakenly changing BLOCKSIZE either in the caller
or callee.

Another is that I am not sure how your "fixed format" argument
meshes with the "-b blocksize" parameter to affect the tar/pax
output.  The format may be fixed, but it is parameterized.  If
we ever need to grow the ability to take "-b", having the knowledge
that our current code is limited to the fixed BLOCKSIZE in a single
function (i.e. the caller of this function , not the callee) would 
be less error prone.

These two are in addition to the uniformity of the abstraction
concerns I raised in my original review comment.

So, sorry, I do not think your response makes much sense.

Thanks.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 2/2] archive: avoid spawning `gzip`
  2019-04-26 14:51         ` Johannes Schindelin
@ 2019-04-27  9:59           ` René Scharfe
  2019-04-27 17:39             ` René Scharfe
  0 siblings, 1 reply; 74+ messages in thread
From: René Scharfe @ 2019-04-27  9:59 UTC (permalink / raw)
  To: Johannes Schindelin, Jeff King
  Cc: Rohit Ashiwal via GitGitGadget, git, Junio C Hamano,
	Rohit Ashiwal

Am 26.04.19 um 16:51 schrieb Johannes Schindelin:> Hi Peff,
>
> On Mon, 15 Apr 2019, Jeff King wrote:
>
>> On Sun, Apr 14, 2019 at 12:01:10AM +0200, René Scharfe wrote:
>>
>>>>> As we already link to the zlib library, we can perform the compression
>>>>> without even requiring gzip on the host machine.
>>>>
>>>> Very cool. It's nice to drop a dependency, and this should be a bit more
>>>> efficient, too.
>>>
>>> Getting rid of dependencies is good, and using zlib is the obvious way to
>>> generate .tgz files. Last time I tried something like that, a separate gzip
>>> process was faster, though -- at least on Linux [1].  How does this one
>>> fare?
>>
>> I'd expect a separate gzip to be faster in wall-clock time for a
>> multi-core machine, but overall consume more CPU. I'm slightly surprised
>> that your timings show that it actually wins on total CPU, too.

My initial expectation back then was that moving data between processes
is costly and that compressing in-process would improve the overall
performance.  Your expectation is more in line with what I then actually
saw.  The difference in total CPU time wasn't that big, perhaps just
noise.

> If performance is really a concern, you'll be much better off using `pigz`
> than `gzip`.

Performance is always a concern, but on the other hand I didn't see any
complaints about slow archiving so far.

>> Here are best-of-five times for "git archive --format=tar.gz HEAD" on
>> linux.git (the machine is a quad-core):
>>
>>    [before, separate gzip]
>>    real	0m21.501s
>>    user	0m26.148s
>>    sys	0m0.619s
>>
>>    [after, internal gzwrite]
>>    real	0m25.156s
>>    user	0m25.059s
>>    sys	0m0.096s
>>
>> which does show what I expect (longer overall, but less total CPU).

I get similar numbers with hyperfine:

Benchmark #1: git archive --format=tar.gz HEAD >/dev/null
  Time (mean ± σ):     16.683 s ±  0.451 s    [User: 20.230 s, System: 0.375 s]
  Range (min … max):   16.308 s … 17.852 s    10 runs

Benchmark #2: ~/src/git/git-archive --format=tar.gz HEAD >/dev/null
  Time (mean ± σ):     19.898 s ±  0.228 s    [User: 19.825 s, System: 0.073 s]
  Range (min … max):   19.627 s … 20.355 s    10 runs

Benchmark #3: git archive --format=zip HEAD >/dev/null
  Time (mean ± σ):     16.449 s ±  0.075 s    [User: 16.340 s, System: 0.109 s]
  Range (min … max):   16.326 s … 16.611 s    10 runs

#1 is git v2.21.0, #2 is with the two patches applied, #3 is v2.21.0
again, but with zip output, just to put things into perspective.

>> Which one you prefer depends on your situation, of course. A user on a
>> workstation with multiple cores probably cares most about end-to-end
>> latency and using all of their available horsepower. A server hosting
>> repositories and receiving many unrelated requests probably cares more
>> about total CPU (though the differences there are small enough that it
>> may not even be worth having a config knob to un-parallelize it).
>
> I am a bit sad that this is so noticeable. Nevertheless, I think that
> dropping the dependency is worth it, in particular given that `gzip` is
> not exactly fast to begin with (you really should switch to `pigz` or to a
> faster compression if you are interested in speed).

We could import pigz verbatim, it's just 11K LOCs total. :)

>>> Doing compression in its own thread may be a good idea.
>>
>> Yeah. It might even make the patch simpler, since I'd expect it to be
>> implemented with start_async() and a descriptor, making it look just
>> like a gzip pipe to the caller. :)
>
> Sadly, it does not really look like it is simpler.

I have to agree -- at least I was unable to pull off the stdout
plumbing trick.  Is there a way?  But it doesn't look too bad, and
the performance is closer to using the real gzip:

Benchmark #1: ~/src/git/git-archive --format=tar.gz HEAD >/dev/null
  Time (mean ± σ):     17.300 s ±  0.198 s    [User: 20.825 s, System: 0.356 s]
  Range (min … max):   17.042 s … 17.638 s    10 runs

This is with the following patch:

---
 archive-tar.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 59 insertions(+), 4 deletions(-)

diff --git a/archive-tar.c b/archive-tar.c
index 3e53aac1e6..c889b84c2c 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -38,11 +38,13 @@ static int write_tar_filter_archive(const struct archiver *ar,
 #define USTAR_MAX_MTIME 077777777777ULL
 #endif

+static int out_fd = 1;
+
 /* writes out the whole block, but only if it is full */
 static void write_if_needed(void)
 {
 	if (offset == BLOCKSIZE) {
-		write_or_die(1, block, BLOCKSIZE);
+		write_or_die(out_fd, block, BLOCKSIZE);
 		offset = 0;
 	}
 }
@@ -66,7 +68,7 @@ static void do_write_blocked(const void *data, unsigned long size)
 		write_if_needed();
 	}
 	while (size >= BLOCKSIZE) {
-		write_or_die(1, buf, BLOCKSIZE);
+		write_or_die(out_fd, buf, BLOCKSIZE);
 		size -= BLOCKSIZE;
 		buf += BLOCKSIZE;
 	}
@@ -101,10 +103,10 @@ static void write_trailer(void)
 {
 	int tail = BLOCKSIZE - offset;
 	memset(block + offset, 0, tail);
-	write_or_die(1, block, BLOCKSIZE);
+	write_or_die(out_fd, block, BLOCKSIZE);
 	if (tail < 2 * RECORDSIZE) {
 		memset(block, 0, offset);
-		write_or_die(1, block, BLOCKSIZE);
+		write_or_die(out_fd, block, BLOCKSIZE);
 	}
 }

@@ -434,6 +436,56 @@ static int write_tar_archive(const struct archiver *ar,
 	return err;
 }

+static int internal_gzip(int in, int out, void *data)
+{
+	int *levelp = data;
+	gzFile gzip = gzdopen(1, "wb");
+	if (!gzip)
+		die(_("gzdopen failed"));
+	if (gzsetparams(gzip, *levelp, Z_DEFAULT_STRATEGY) != Z_OK)
+		die(_("unable to set compression level"));
+
+	for (;;) {
+		char buf[BLOCKSIZE];
+		ssize_t read = xread(in, buf, sizeof(buf));
+		if (read < 0)
+			die_errno(_("read failed"));
+		if (read == 0)
+			break;
+		if (gzwrite(gzip, buf, read) != read)
+			die(_("gzwrite failed"));
+	}
+
+	if (gzclose(gzip) != Z_OK)
+		die(_("gzclose failed"));
+	close(in);
+	return 0;
+}
+
+static int write_tar_gzip_archive(const struct archiver *ar,
+				  struct archiver_args *args)
+{
+	struct async filter;
+	int r;
+
+	memset(&filter, 0, sizeof(filter));
+	filter.proc = internal_gzip;
+	filter.data = &args->compression_level;
+	filter.in = -1;
+
+	if (start_async(&filter))
+		die(_("unable to fork off internal gzip"));
+	out_fd = filter.in;
+
+	r = write_tar_archive(ar, args);
+
+	close(out_fd);
+	if (finish_async(&filter))
+		die(_("error in internal gzip"));
+
+	return r;
+}
+
 static int write_tar_filter_archive(const struct archiver *ar,
 				    struct archiver_args *args)
 {
@@ -445,6 +497,9 @@ static int write_tar_filter_archive(const struct archiver *ar,
 	if (!ar->data)
 		BUG("tar-filter archiver called with no filter defined");

+	if (!strcmp(ar->data, "gzip -cn"))
+		return write_tar_gzip_archive(ar, args);
+
 	strbuf_addstr(&cmd, ar->data);
 	if (args->compression_level >= 0)
 		strbuf_addf(&cmd, " -%d", args->compression_level);
--
2.21.0

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH 2/2] archive: avoid spawning `gzip`
  2019-04-27  9:59           ` René Scharfe
@ 2019-04-27 17:39             ` René Scharfe
  2019-04-29 21:25               ` Johannes Schindelin
  0 siblings, 1 reply; 74+ messages in thread
From: René Scharfe @ 2019-04-27 17:39 UTC (permalink / raw)
  To: Johannes Schindelin, Jeff King
  Cc: Rohit Ashiwal via GitGitGadget, git, Junio C Hamano,
	Rohit Ashiwal

Am 27.04.19 um 11:59 schrieb René Scharfe:> Am 26.04.19 um 16:51 schrieb Johannes Schindelin:
>>
>> On Mon, 15 Apr 2019, Jeff King wrote:
>>
>>> On Sun, Apr 14, 2019 at 12:01:10AM +0200, René Scharfe wrote:
>>>
>>>> Doing compression in its own thread may be a good idea.
>>>
>>> Yeah. It might even make the patch simpler, since I'd expect it to be
>>> implemented with start_async() and a descriptor, making it look just
>>> like a gzip pipe to the caller. :)
>>
>> Sadly, it does not really look like it is simpler.
>
> I have to agree -- at least I was unable to pull off the stdout
> plumbing trick.

The simplest solution is of course to not touch the archive code.  The
patch below makes that possible:

Benchmark #1: ~/src/git/git -c tar.tgz.command=~/src/git/git-gzip archive --format=tgz HEAD >/dev/null
  Time (mean ± σ):     17.256 s ±  0.299 s    [User: 20.380 s, System: 0.294 s]
  Range (min … max):   16.940 s … 17.804 s    10 runs

Curious to see how it looks like on other systems and platforms.

And perhaps the buffer size needs to be tuned.

-- >8 --
Subject: [PATCH] add git gzip

Add a cheap gzip lookalike based on zlib for systems that don't have
(or want) the real thing.  It can be used e.g. to generate tgz files
using git archive and its configuration options tar.tgz.command and
tar.tar.gz.command, without any other external dependency.

Signed-off-by: Rene Scharfe <l.s.r@web.de>
---
 .gitignore       |  1 +
 Makefile         |  1 +
 builtin.h        |  1 +
 builtin/gzip.c   | 64 ++++++++++++++++++++++++++++++++++++++++++++++++
 command-list.txt |  1 +
 git.c            |  1 +
 6 files changed, 69 insertions(+)
 create mode 100644 builtin/gzip.c

diff --git a/.gitignore b/.gitignore
index 44c74402c8..e550868219 100644
--- a/.gitignore
+++ b/.gitignore
@@ -71,6 +71,7 @@
 /git-gc
 /git-get-tar-commit-id
 /git-grep
+/git-gzip
 /git-hash-object
 /git-help
 /git-http-backend
diff --git a/Makefile b/Makefile
index 9f1b6e8926..2b34f1a4aa 100644
--- a/Makefile
+++ b/Makefile
@@ -1075,6 +1075,7 @@ BUILTIN_OBJS += builtin/fsck.o
 BUILTIN_OBJS += builtin/gc.o
 BUILTIN_OBJS += builtin/get-tar-commit-id.o
 BUILTIN_OBJS += builtin/grep.o
+BUILTIN_OBJS += builtin/gzip.o
 BUILTIN_OBJS += builtin/hash-object.o
 BUILTIN_OBJS += builtin/help.o
 BUILTIN_OBJS += builtin/index-pack.o
diff --git a/builtin.h b/builtin.h
index b78ab6e30b..abc34cc9d0 100644
--- a/builtin.h
+++ b/builtin.h
@@ -170,6 +170,7 @@ extern int cmd_fsck(int argc, const char **argv, const char *prefix);
 extern int cmd_gc(int argc, const char **argv, const char *prefix);
 extern int cmd_get_tar_commit_id(int argc, const char **argv, const char *prefix);
 extern int cmd_grep(int argc, const char **argv, const char *prefix);
+extern int cmd_gzip(int argc, const char **argv, const char *prefix);
 extern int cmd_hash_object(int argc, const char **argv, const char *prefix);
 extern int cmd_help(int argc, const char **argv, const char *prefix);
 extern int cmd_index_pack(int argc, const char **argv, const char *prefix);
diff --git a/builtin/gzip.c b/builtin/gzip.c
new file mode 100644
index 0000000000..90a98c44ce
--- /dev/null
+++ b/builtin/gzip.c
@@ -0,0 +1,64 @@
+#include "cache.h"
+#include "builtin.h"
+#include "parse-options.h"
+
+static const char * const gzip_usage[] = {
+	N_("git gzip [-NUM]"),
+	NULL
+};
+
+static int level_callback(const struct option *opt, const char *arg, int unset)
+{
+	int *levelp = opt->value;
+	int value;
+	const char *endp;
+
+	if (unset)
+		BUG("switch -NUM cannot be negated");
+
+	value = strtol(arg, (char **)&endp, 10);
+	if (*endp)
+		BUG("switch -NUM cannot be non-numeric");
+
+	*levelp = value;
+	return 0;
+}
+
+#define BUFFERSIZE (64 * 1024)
+
+int cmd_gzip(int argc, const char **argv, const char *prefix)
+{
+	gzFile gz;
+	int level = Z_DEFAULT_COMPRESSION;
+	struct option options[] = {
+		OPT_NUMBER_CALLBACK(&level, N_("compression level"),
+				    level_callback),
+		OPT_END()
+	};
+
+	argc = parse_options(argc, argv, prefix, options, gzip_usage, 0);
+	if (argc > 0)
+		usage_with_options(gzip_usage, options);
+
+	gz = gzdopen(1, "wb");
+	if (!gz)
+		die(_("unable to gzdopen stdout"));
+
+	if (gzsetparams(gz, level, Z_DEFAULT_STRATEGY) != Z_OK)
+		die(_("unable to set compression level %d"), level);
+
+	for (;;) {
+		char buf[BUFFERSIZE];
+		ssize_t read_bytes = xread(0, buf, sizeof(buf));
+		if (read_bytes < 0)
+			die_errno(_("unable to read from stdin"));
+		if (read_bytes == 0)
+			break;
+		if (gzwrite(gz, buf, read_bytes) != read_bytes)
+			die(_("gzwrite failed"));
+	}
+
+	if (gzclose(gz) != Z_OK)
+		die(_("gzclose failed"));
+	return 0;
+}
diff --git a/command-list.txt b/command-list.txt
index 3a9af104b5..755848842c 100644
--- a/command-list.txt
+++ b/command-list.txt
@@ -99,6 +99,7 @@ git-gc                                  mainporcelain
 git-get-tar-commit-id                   plumbinginterrogators
 git-grep                                mainporcelain           info
 git-gui                                 mainporcelain
+git-gzip                                purehelpers
 git-hash-object                         plumbingmanipulators
 git-help                                ancillaryinterrogators          complete
 git-http-backend                        synchingrepositories
diff --git a/git.c b/git.c
index 50da125c60..48f7fc6c56 100644
--- a/git.c
+++ b/git.c
@@ -510,6 +510,7 @@ static struct cmd_struct commands[] = {
 	{ "gc", cmd_gc, RUN_SETUP },
 	{ "get-tar-commit-id", cmd_get_tar_commit_id, NO_PARSEOPT },
 	{ "grep", cmd_grep, RUN_SETUP_GENTLY },
+	{ "gzip", cmd_gzip },
 	{ "hash-object", cmd_hash_object },
 	{ "help", cmd_help },
 	{ "index-pack", cmd_index_pack, RUN_SETUP_GENTLY | NO_PARSEOPT },
--
2.21.0

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH 2/2] archive: avoid spawning `gzip`
  2019-04-27 17:39             ` René Scharfe
@ 2019-04-29 21:25               ` Johannes Schindelin
  2019-05-01 17:45                 ` René Scharfe
  0 siblings, 1 reply; 74+ messages in thread
From: Johannes Schindelin @ 2019-04-29 21:25 UTC (permalink / raw)
  To: René Scharfe
  Cc: Jeff King, Rohit Ashiwal via GitGitGadget, git, Junio C Hamano,
	Rohit Ashiwal

[-- Attachment #1: Type: text/plain, Size: 1709 bytes --]

Hi René,

On Sat, 27 Apr 2019, René Scharfe wrote:

> Am 27.04.19 um 11:59 schrieb René Scharfe:> Am 26.04.19 um 16:51 schrieb
> Johannes Schindelin:
> >>
> >> On Mon, 15 Apr 2019, Jeff King wrote:
> >>
> >>> On Sun, Apr 14, 2019 at 12:01:10AM +0200, René Scharfe wrote:
> >>>
> >>>> Doing compression in its own thread may be a good idea.
> >>>
> >>> Yeah. It might even make the patch simpler, since I'd expect it to
> >>> be implemented with start_async() and a descriptor, making it look
> >>> just like a gzip pipe to the caller. :)
> >>
> >> Sadly, it does not really look like it is simpler.
> >
> > I have to agree -- at least I was unable to pull off the stdout
> > plumbing trick.
>
> The simplest solution is of course to not touch the archive code.

We could do that, of course, and we could avoid adding a new command that
we have to support for eternity by introducing a command mode for `git
archive` instead (think: `git archive --gzip -9`), and marking that
command mode clearly as an internal implementation detail.

But since the performance is still not quite on par with `gzip`, I would
actually rather not, and really, just punt on that one, stating that
people interested in higher performance should use `pigz`.

And who knows, maybe nobody will complain at all about the performance?
It's not like `gzip` is really, really fast (IIRC LZO blows gzip out of
the water, speed-wise).

And if we get "bug" reports about this, we

1) have a very easy workaround:

	git config --global archive.tgz.command 'gzip -cn'

2) could always implement a pigz-like multi-threading solution.

I strongly expect a YAGNI here, though.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die()
  2019-04-26 23:44         ` Junio C Hamano
@ 2019-04-29 21:32           ` Johannes Schindelin
  2019-05-01 18:09             ` Jeff King
  0 siblings, 1 reply; 74+ messages in thread
From: Johannes Schindelin @ 2019-04-29 21:32 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Rohit Ashiwal via GitGitGadget, git, Rohit Ashiwal

Hi Junio,

On Sat, 27 Apr 2019, Junio C Hamano wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
>
> >> >> +/* writes out the whole block, or dies if fails */
> >> >> +static void write_block_or_die(const char *block) {
> >> >> +	if (gzip) {
> >> >> +		if (gzwrite(gzip, block, (unsigned) BLOCKSIZE) != BLOCKSIZE)
> >> >> +			die(_("gzwrite failed"));
> >> >> +	} else {
> >> >> +		write_or_die(1, block, BLOCKSIZE);
> >> >> +	}
> >> >> +}
> >>
> >> I agree everything you said you your two review messages.
> >>
> >> One thing you did not mention but I found disturbing was that this
> >> does not take size argument but hardcodes BLOCKSIZE.
> >
> > That is very much on purpose, as this code really is specific to the `tar`
> > file format, which has a fixed, well-defined block size. It would make it
> > easier to introduce a bug if that was a parameter.
>
> I am not so sure for two reasons.
>
> One is that its caller is full of BLOCKSIZE constants passed as
> parameters (instead of calling a specialized function that hardcodes
> the BLOCKSIZE without taking it as a parameter), and this being a
> file-scope static, it does not really matter with respect to an
> accidental bug of mistakenly changing BLOCKSIZE either in the caller
> or callee.

I guess I can try to find some time next week to clean up those callers.
But honestly, I do not really think that this cleanup falls squarely into
the goal of this here patch series.

> Another is that I am not sure how your "fixed format" argument
> meshes with the "-b blocksize" parameter to affect the tar/pax
> output.  The format may be fixed, but it is parameterized.  If
> we ever need to grow the ability to take "-b", having the knowledge
> that our current code is limited to the fixed BLOCKSIZE in a single
> function (i.e. the caller of this function , not the callee) would
> be less error prone.

This argument would hold a lot more water if the following lines were not
part of archive-tar.c:

	#define RECORDSIZE      (512)
	#define BLOCKSIZE       (RECORDSIZE * 20)

	static char block[BLOCKSIZE];

If you can tell me how the `-b` (run-time) parameter can affect the
(compile-time) `BLOCKSIZE` constant, maybe I can start to understand your
concern.

:-)

Ciao,
Dscho

P.S.: I just looked, and I do not even see a `-b` option of `git archive`,
so I suspect that you talked about the generic tar file format? I was not
talking about each and every implementation of the tar file format here, I
was talking about the tar file format that Git generates.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 2/2] archive: avoid spawning `gzip`
  2019-04-29 21:25               ` Johannes Schindelin
@ 2019-05-01 17:45                 ` René Scharfe
  2019-05-01 18:18                   ` Jeff King
  0 siblings, 1 reply; 74+ messages in thread
From: René Scharfe @ 2019-05-01 17:45 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Jeff King, Rohit Ashiwal via GitGitGadget, git, Junio C Hamano,
	Rohit Ashiwal

Hello Dscho,

Am 29.04.19 um 23:25 schrieb Johannes Schindelin:
> On Sat, 27 Apr 2019, René Scharfe wrote:
>> The simplest solution is of course to not touch the archive code.
>
> We could do that, of course, and we could avoid adding a new command that
> we have to support for eternity by introducing a command mode for `git
> archive` instead (think: `git archive --gzip -9`), and marking that
> command mode clearly as an internal implementation detail.

adding gzip as the 142nd git command and 18th pure helper *would* be a
bit embarrassing, in particular for a command that's not directly
related to version control and readily available on all platforms.
Exposing it as a (hidden?) archive sub-command might be better.

> But since the performance is still not quite on par with `gzip`, I would
> actually rather not, and really, just punt on that one, stating that
> people interested in higher performance should use `pigz`.

Here are my performance numbers for generating .tar.gz files again:

master, using gzip(1):
  Time (mean ± σ):     16.683 s ±  0.451 s    [User: 20.230 s, System: 0.375 s]
  Range (min … max):   16.308 s … 17.852 s    10 runs

using zlib sequentially:
  Time (mean ± σ):     19.898 s ±  0.228 s    [User: 19.825 s, System: 0.073 s]
  Range (min … max):   19.627 s … 20.355 s    10 runs

using zlib asynchronously:
  Time (mean ± σ):     17.300 s ±  0.198 s    [User: 20.825 s, System: 0.356 s]
  Range (min … max):   17.042 s … 17.638 s    10 runs

using a gzip-lookalike:
  Time (mean ± σ):     17.256 s ±  0.299 s    [User: 20.380 s, System: 0.294 s]
  Range (min … max):   16.940 s … 17.804 s    10 runs

The last two have comparable system time, ca. 1% more user time and
ca. 5% longer duration.  The second one has much better system time
and 2% less user time and 19% longer duration.  Hmm.

> And who knows, maybe nobody will complain at all about the performance?

Probably.  And popular tarballs would be cached anyway, I guess.

So I'll send comments on your series later this week.

René

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die()
  2019-04-26 14:28     ` Johannes Schindelin
@ 2019-05-01 18:07       ` Jeff King
  0 siblings, 0 replies; 74+ messages in thread
From: Jeff King @ 2019-05-01 18:07 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Rohit Ashiwal via GitGitGadget, git, Junio C Hamano,
	Rohit Ashiwal

On Fri, Apr 26, 2019 at 10:28:12AM -0400, Johannes Schindelin wrote:

> > > +static gzFile gzip;
> > > [...]
> > > +       if (gzip) {
> >
> > Is it OK for us to ask about the truthiness of this opaque type? That
> > works if it's really a pointer behind the scenes, but it seems like it
> > would be equally OK for zlib to declare it as a struct.
> >
> > It looks OK in my version of zlib, and that library tends to be fairly
> > conservative so I wouldn't be surprised if it was that way back to the
> > beginning and remains that way for eternity. But it feels like a bad
> > pattern.
> 
> It is even part of the public API that `gzFile` is `typedef`'d to a
> pointer. So I think in the interest of simplicity, I'll leave it at that
> (but I'll mention this in the commit message).

I think that's probably OK. My biggest concern is that we'd notice if
our assumption changes, but I think modern compilers would generally
complain about checking a tautological truth value.

-Peff

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die()
  2019-04-29 21:32           ` Johannes Schindelin
@ 2019-05-01 18:09             ` Jeff King
  2019-05-02 20:29               ` René Scharfe
  2019-05-05  5:25               ` Junio C Hamano
  0 siblings, 2 replies; 74+ messages in thread
From: Jeff King @ 2019-05-01 18:09 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Junio C Hamano, Rohit Ashiwal via GitGitGadget, git,
	Rohit Ashiwal

On Mon, Apr 29, 2019 at 05:32:50PM -0400, Johannes Schindelin wrote:

> > Another is that I am not sure how your "fixed format" argument
> > meshes with the "-b blocksize" parameter to affect the tar/pax
> > output.  The format may be fixed, but it is parameterized.  If
> > we ever need to grow the ability to take "-b", having the knowledge
> > that our current code is limited to the fixed BLOCKSIZE in a single
> > function (i.e. the caller of this function , not the callee) would
> > be less error prone.
> 
> This argument would hold a lot more water if the following lines were not
> part of archive-tar.c:
> 
> 	#define RECORDSIZE      (512)
> 	#define BLOCKSIZE       (RECORDSIZE * 20)
> 
> 	static char block[BLOCKSIZE];
> 
> If you can tell me how the `-b` (run-time) parameter can affect the
> (compile-time) `BLOCKSIZE` constant, maybe I can start to understand your
> concern.

FWIW, I agree with you here. These patches are not making anything worse
(and may even make them better, since we'd probably need to swap out the
BLOCKSIZE constant for a run-time "blocksize" variable in fewer places).

-Peff

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 2/2] archive: avoid spawning `gzip`
  2019-05-01 17:45                 ` René Scharfe
@ 2019-05-01 18:18                   ` Jeff King
  2019-06-10 10:44                     ` René Scharfe
  0 siblings, 1 reply; 74+ messages in thread
From: Jeff King @ 2019-05-01 18:18 UTC (permalink / raw)
  To: René Scharfe
  Cc: Johannes Schindelin, Rohit Ashiwal via GitGitGadget, git,
	Junio C Hamano, Rohit Ashiwal

On Wed, May 01, 2019 at 07:45:05PM +0200, René Scharfe wrote:

> > But since the performance is still not quite on par with `gzip`, I would
> > actually rather not, and really, just punt on that one, stating that
> > people interested in higher performance should use `pigz`.
> 
> Here are my performance numbers for generating .tar.gz files again:
> 
> master, using gzip(1):
>   Time (mean ± σ):     16.683 s ±  0.451 s    [User: 20.230 s, System: 0.375 s]
>   Range (min … max):   16.308 s … 17.852 s    10 runs
> 
> using zlib sequentially:
>   Time (mean ± σ):     19.898 s ±  0.228 s    [User: 19.825 s, System: 0.073 s]
>   Range (min … max):   19.627 s … 20.355 s    10 runs
> 
> using zlib asynchronously:
>   Time (mean ± σ):     17.300 s ±  0.198 s    [User: 20.825 s, System: 0.356 s]
>   Range (min … max):   17.042 s … 17.638 s    10 runs
> 
> using a gzip-lookalike:
>   Time (mean ± σ):     17.256 s ±  0.299 s    [User: 20.380 s, System: 0.294 s]
>   Range (min … max):   16.940 s … 17.804 s    10 runs
> 
> The last two have comparable system time, ca. 1% more user time and
> ca. 5% longer duration.  The second one has much better system time
> and 2% less user time and 19% longer duration.  Hmm.

I think the start_async() one seems like a good option. It reclaims most
of the (wall-clock) performance, isn't very much code, and doesn't leave
any ugly user-visible traces.

I'd be fine to see it come later, though, on top of the patches Dscho is
sending. Even though changing to sequential zlib is technically a change
in behavior, the existing behavior wasn't really planned. And given the
wall-clock versus CPU time tradeoff, it's not entirely clear that one
solution is better than the other.

> > And who knows, maybe nobody will complain at all about the performance?
> 
> Probably.  And popular tarballs would be cached anyway, I guess.

At GitHub we certainly do cache the git-archive output. We'd also be
just fine with the sequential solution. We generally turn down
pack.threads to 1, and keep our CPUs busy by serving multiple users
anyway.

So whatever has the lowest overall CPU time is generally preferable, but
the times are close enough that I don't think we'd care much either way
(and it's probably not worth having a config option or similar).

-Peff

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 2/2] archive: avoid spawning `gzip`
  2019-04-26 14:54       ` Johannes Schindelin
@ 2019-05-02 20:20         ` Ævar Arnfjörð Bjarmason
  2019-05-03 20:49           ` Johannes Schindelin
  0 siblings, 1 reply; 74+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-05-02 20:20 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: brian m. carlson, Jeff King, Rohit Ashiwal via GitGitGadget, git,
	Junio C Hamano, Rohit Ashiwal

On Fri, Apr 26 2019, Johannes Schindelin wrote:

> Hi brian,
>
> On Sat, 13 Apr 2019, brian m. carlson wrote:
>
>> On Fri, Apr 12, 2019 at 09:51:02PM -0400, Jeff King wrote:
>> > I wondered how you were going to kick this in, since users can define
>> > arbitrary filters. I think it's kind of neat to automagically convert
>> > "gzip -cn" (which also happens to be the default). But I think we should
>> > mention that in the Documentation, in case somebody tries to use a
>> > custom version of gzip and wonders why it isn't kicking in.
>> >
>> > Likewise, it might make sense in the tests to put a poison gzip in the
>> > $PATH so that we can be sure we're using our internal code, and not just
>> > calling out to gzip (on platforms that have it, of course).
>> >
>> > The alternative is that we could use a special token like ":zlib" or
>> > something to indicate that the internal implementation should be used
>> > (and then tweak the baked-in default, too). That might be less
>> > surprising for users, but most people would still get the benefit since
>> > they'd be using the default config.
>>
>> I agree that a special value (or NULL, if that's possible) would be
>> nicer here. That way, if someone does specify a custom gzip, we honor
>> it, and it serves to document the code better. For example, if someone
>> symlinked pigz to gzip and used "gzip -cn", then they might not get the
>> parallelization benefits they expected.
>
> I went with `:zlib`. The `NULL` value would not really work, as there is
> no way to specify that via `archive.tgz.command`.
>
> About the symlinked thing: I do not really want to care to support such
> hacks.

It's the standard way by which a lot of systems do this, e.g. on my
Debian box:

    $ find /{,s}bin /usr/{,s}bin -type l -exec file {} \;|grep /etc/alternatives|wc -l
    108

To write this E-Mail I'm invoking one such symlink :)

> If you want a different compressor than the default (which can
> change), you should specify it specifically.

You might want to do so system-wide, or for each program at a time.

I don't care about this for gzip myself, just pointing out it *is* a
thing people use.

>> I'm fine overall with the idea of bringing the compression into the
>> binary using zlib, provided that we preserve the "-n" behavior
>> (producing reproducible archives).
>
> Thanks for voicing this concern. I had a look at zlib's source code, and
> it looks like it requires an extra function call (that we don't call) to
> make the resulting file non-reproducible. In other words, it has the
> opposite default behavior from `gzip`.

Just commenting on the overall thread: I like René's "new built-in"
patch best.

You mentioned "new command that we have to support for eternity". I
think calling it "git gzip" is a bad idea. We'd make it "git
archive--gzip" or "git archive--helper", and we could hide building it
behind some compat flag.

Then we'd carry no if/else internal/external code, and the portability
issue that started this would be addressed, no?

As a bonus we could also drop the "GZIP" prereq from the test suite
entirely and just put that "gzip" in $PATH for the purposes of the
tests.

I spied on your yet-to-be-submitted patches and you could drop GZIP from
the "git archive" tests, but we'd still need it in
t/t5562-http-backend-content-length.sh, but not if we had a "gzip"
compat helper.

There's also a long-standing bug/misfeature in git-archive that I wonder
about: When you combine --format with --remote you can only generate
e.g. tar.gz if the remote is OK with it, if it says no you can't even if
it supports "tar" and you could do the "gz" part locally. Would such a
patch be harder with :zlib than if we always just spewed out to external
"gzip" after satisfying some criteria?

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 3/4] archive: optionally use zlib directly for gzip compression
       [not found]   ` <4ea94a8784876c3a19e387537edd81a957fc692c.1556321244.git.gitgitgadget@gmail.com>
@ 2019-05-02 20:29     ` René Scharfe
  0 siblings, 0 replies; 74+ messages in thread
From: René Scharfe @ 2019-05-02 20:29 UTC (permalink / raw)
  To: Rohit Ashiwal via GitGitGadget, git
  Cc: Jeff King, brian m. carlson, Junio C Hamano, Rohit Ashiwal

Am 27.04.19 um 01:27 schrieb Rohit Ashiwal via GitGitGadget:
> From: Rohit Ashiwal <rohit.ashiwal265@gmail.com>
>
> As we already link to the zlib library, we can perform the compression
> without even requiring gzip on the host machine.
>
> Note: the `-n` flag that `git archive` passed to `gzip` wants to ensure
> that a reproducible file is written, i.e. no filename or mtime will be
> recorded in the compressed output. This is already the default for
> zlib's `gzopen()` function (if the file name or mtime should be
> recorded, the `deflateSetHeader()` function would have to be called
> instead).
>
> Note also that the `gzFile` datatype is defined as a pointer in
> `zlib.h`, i.e. we can rely on the fact that it can be `NULL`.
>
> At this point, this new mode is hidden behind the pseudo command
> `:zlib`: assign this magic string to the `archive.tgz.command` config
> setting to enable it.

Technically the patch emits the gzip format using the gz* functions.
Raw zlib output with deflate* would be slightly different.  So I'd
rather use "gzip" instead of "zlib" in the magic string.

And I'm not sure about the colon as the only magic marker.  Perhaps
throw in a "git " or "git-" instead or in addition?

> @@ -459,18 +464,40 @@ static int write_tar_filter_archive(const struct archiver *ar,
>  	filter.use_shell = 1;
>  	filter.in = -1;
>
> -	if (start_command(&filter) < 0)
> -		die_errno(_("unable to start '%s' filter"), argv[0]);
> -	close(1);
> -	if (dup2(filter.in, 1) < 0)
> -		die_errno(_("unable to redirect descriptor"));
> -	close(filter.in);
> +	if (!strcmp(":zlib", ar->data)) {
> +		struct strbuf mode = STRBUF_INIT;
> +
> +		strbuf_addstr(&mode, "wb");
> +
> +		if (args->compression_level >= 0 && args->compression_level <= 9)
> +			strbuf_addf(&mode, "%d", args->compression_level);

Using gzsetparams() to set the compression level numerically after gzdopen()
instead of baking it into the mode string feels cleaner.

> +
> +		gzip = gzdopen(fileno(stdout), mode.buf);
> +		if (!gzip)
> +			die(_("Could not gzdopen stdout"));
> +		strbuf_release(&mode);
> +	} else {
> +		if (start_command(&filter) < 0)
> +			die_errno(_("unable to start '%s' filter"), argv[0]);
> +		close(1);
> +		if (dup2(filter.in, 1) < 0)
> +			die_errno(_("unable to redirect descriptor"));
> +		close(filter.in);
> +	}
>
>  	r = write_tar_archive(ar, args);
>
> -	close(1);
> -	if (finish_command(&filter) != 0)
> -		die(_("'%s' filter reported error"), argv[0]);
> +	if (gzip) {
> +		int ret = gzclose(gzip);
> +		if (ret == Z_ERRNO)
> +			die_errno(_("gzclose failed"));
> +		else if (ret != Z_OK)
> +			die(_("gzclose failed (%d)"), ret);
> +	} else {
> +		close(1);
> +		if (finish_command(&filter) != 0)
> +			die(_("'%s' filter reported error"), argv[0]);
> +	}
>
>  	strbuf_release(&cmd);
>  	return r;
>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die()
  2019-05-01 18:09             ` Jeff King
@ 2019-05-02 20:29               ` René Scharfe
  2019-05-05  5:25               ` Junio C Hamano
  1 sibling, 0 replies; 74+ messages in thread
From: René Scharfe @ 2019-05-02 20:29 UTC (permalink / raw)
  To: Jeff King, Johannes Schindelin
  Cc: Junio C Hamano, Rohit Ashiwal via GitGitGadget, git,
	Rohit Ashiwal

Am 01.05.19 um 20:09 schrieb Jeff King:
> On Mon, Apr 29, 2019 at 05:32:50PM -0400, Johannes Schindelin wrote:
>
>>> Another is that I am not sure how your "fixed format" argument
>>> meshes with the "-b blocksize" parameter to affect the tar/pax
>>> output.  The format may be fixed, but it is parameterized.  If
>>> we ever need to grow the ability to take "-b", having the knowledge
>>> that our current code is limited to the fixed BLOCKSIZE in a single
>>> function (i.e. the caller of this function , not the callee) would
>>> be less error prone.
>>
>> This argument would hold a lot more water if the following lines were not
>> part of archive-tar.c:
>>
>> 	#define RECORDSIZE      (512)
>> 	#define BLOCKSIZE       (RECORDSIZE * 20)
>>
>> 	static char block[BLOCKSIZE];
>>
>> If you can tell me how the `-b` (run-time) parameter can affect the
>> (compile-time) `BLOCKSIZE` constant, maybe I can start to understand your
>> concern.
>
> FWIW, I agree with you here. These patches are not making anything worse
> (and may even make them better, since we'd probably need to swap out the
> BLOCKSIZE constant for a run-time "blocksize" variable in fewer places).

The block size is mostly relevant for writing tar archives to magnetic
tapes.  You can do that with git archive and a tape drive that supports
the blocking factor 20, which is the default for GNU tar and thus should
be quite common.  You may get higher performance with a higher blocking
factor, if supported.

But so far this didn't come up on the mailing list, and I'd be surprised
if people really wrote snapshots of git archives directly to tape.  So
I'm not too worried about this define ever becoming a user-settable
option.  Sealing the constant into a function a bit feels dirty, though.
Mixing code and data makes the code more brittle.

Another example of that is the hard-coded file descriptor in the same
function, by the way.  It's a lot of busywork to undo in order to gain
the ability to write to some other fd, for the questionable convenience
of not having to pass that parameter along the call chain.  My bad.

But anyway, I worry more about the fact that blocking is not needed when
gzip'ing; gzwrite can be fed pieces of any size, not just 20 KB chunks.
The tar writer just needs to round up the archive size to a multiple of
20 KB and pad with NUL bytes at the end, in order to produce the same
uncompressed output as non-compressing tar.

If we'd wanted to be tape-friendly, then we'd have to block the gzip'ed
output instead of the uncompressed tar file, but I'm not suggesting
doing that.

Note to self: I wonder if moving the blocking part out into an
asynchronous function could simplify the code.

René

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 2/4] archive-tar: mark RECORDSIZE/BLOCKSIZE as unsigned
       [not found]   ` <ac2b2488a1b42b3caf8a84594c48eca796748e59.1556321244.git.gitgitgadget@gmail.com>
@ 2019-05-02 20:30     ` René Scharfe
  2019-05-08 11:45       ` Johannes Schindelin
  0 siblings, 1 reply; 74+ messages in thread
From: René Scharfe @ 2019-05-02 20:30 UTC (permalink / raw)
  To: Johannes Schindelin via GitGitGadget, git
  Cc: Jeff King, brian m. carlson, Junio C Hamano, Johannes Schindelin

Am 27.04.19 um 01:27 schrieb Johannes Schindelin via GitGitGadget:
> From: Johannes Schindelin <johannes.schindelin@gmx.de>
>
> They really are unsigned, and we are using e.g. BLOCKSIZE as `size_t`
> parameter to pass to `write_or_die()`.

True, but the compiler converts that value correctly to size_t without
complaint already, doesn't it?  What am I missing?

>
> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> ---
>  archive-tar.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/archive-tar.c b/archive-tar.c
> index af9ea70733..be06c8b205 100644
> --- a/archive-tar.c
> +++ b/archive-tar.c
> @@ -9,7 +9,7 @@
>  #include "streaming.h"
>  #include "run-command.h"
>
> -#define RECORDSIZE	(512)
> +#define RECORDSIZE	(512u)
>  #define BLOCKSIZE	(RECORDSIZE * 20)
>
>  static char block[BLOCKSIZE];
>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 2/2] archive: avoid spawning `gzip`
  2019-05-02 20:20         ` Ævar Arnfjörð Bjarmason
@ 2019-05-03 20:49           ` Johannes Schindelin
  2019-05-03 20:52             ` Jeff King
  0 siblings, 1 reply; 74+ messages in thread
From: Johannes Schindelin @ 2019-05-03 20:49 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: brian m. carlson, Jeff King, Rohit Ashiwal via GitGitGadget, git,
	Junio C Hamano, Rohit Ashiwal

[-- Attachment #1: Type: text/plain, Size: 5284 bytes --]

Hi Ævar,

On Thu, 2 May 2019, Ævar Arnfjörð Bjarmason wrote:

> On Fri, Apr 26 2019, Johannes Schindelin wrote:
>
> > On Sat, 13 Apr 2019, brian m. carlson wrote:
> >
> >> On Fri, Apr 12, 2019 at 09:51:02PM -0400, Jeff King wrote:
> >> > I wondered how you were going to kick this in, since users can
> >> > define arbitrary filters. I think it's kind of neat to
> >> > automagically convert "gzip -cn" (which also happens to be the
> >> > default). But I think we should mention that in the Documentation,
> >> > in case somebody tries to use a custom version of gzip and wonders
> >> > why it isn't kicking in.
> >> >
> >> > Likewise, it might make sense in the tests to put a poison gzip in
> >> > the $PATH so that we can be sure we're using our internal code, and
> >> > not just calling out to gzip (on platforms that have it, of
> >> > course).
> >> >
> >> > The alternative is that we could use a special token like ":zlib"
> >> > or something to indicate that the internal implementation should be
> >> > used (and then tweak the baked-in default, too). That might be less
> >> > surprising for users, but most people would still get the benefit
> >> > since they'd be using the default config.
> >>
> >> I agree that a special value (or NULL, if that's possible) would be
> >> nicer here. That way, if someone does specify a custom gzip, we honor
> >> it, and it serves to document the code better. For example, if
> >> someone symlinked pigz to gzip and used "gzip -cn", then they might
> >> not get the parallelization benefits they expected.
> >
> > I went with `:zlib`. The `NULL` value would not really work, as there
> > is no way to specify that via `archive.tgz.command`.
> >
> > About the symlinked thing: I do not really want to care to support
> > such hacks.
>
> It's the standard way by which a lot of systems do this, e.g. on my
> Debian box:
>
>     $ find /{,s}bin /usr/{,s}bin -type l -exec file {} \;|grep /etc/alternatives|wc -l
>     108
>
> To write this E-Mail I'm invoking one such symlink :)

I am well aware of the way Debian-based systems handle alternatives, and I
myself also use something similar to write this E-Mail (but it is not a
symlink, it is a Git alias).

But that's not the hack that I was talking about.

The hack I meant was: if you symlink `gzip` to `pigz` in your `PATH` *and
then expect `git archive --format=tgz` to pick that up*.

As far as I am concerned, the fact that `git archive --format=tgz` spawns
`gzip` to perform the compression is an implementation detail, and not
something that users should feel they can rely on.

> > If you want a different compressor than the default (which can
> > change), you should specify it specifically.
>
> You might want to do so system-wide, or for each program at a time.
>
> I don't care about this for gzip myself, just pointing out it *is* a
> thing people use.

Sure.

> >> I'm fine overall with the idea of bringing the compression into the
> >> binary using zlib, provided that we preserve the "-n" behavior
> >> (producing reproducible archives).
> >
> > Thanks for voicing this concern. I had a look at zlib's source code,
> > and it looks like it requires an extra function call (that we don't
> > call) to make the resulting file non-reproducible. In other words, it
> > has the opposite default behavior from `gzip`.
>
> Just commenting on the overall thread: I like René's "new built-in"
> patch best.

I guess we now have to diverging votes: yours for the `git archive --gzip`
"built-in" and Peff's for the async code ;-)

> You mentioned "new command that we have to support for eternity". I
> think calling it "git gzip" is a bad idea. We'd make it "git
> archive--gzip" or "git archive--helper", and we could hide building it
> behind some compat flag.
>
> Then we'd carry no if/else internal/external code, and the portability
> issue that started this would be addressed, no?

Sure.

The async version would leave the door wide open for implementing pigz'
trick to multi-thread the compression, though.

> As a bonus we could also drop the "GZIP" prereq from the test suite
> entirely and just put that "gzip" in $PATH for the purposes of the
> tests.
>
> I spied on your yet-to-be-submitted patches and you could drop GZIP from
> the "git archive" tests, but we'd still need it in
> t/t5562-http-backend-content-length.sh, but not if we had a "gzip"
> compat helper.

We need it at least once for *decompressing* the `--format=tgz` output in
order to compare it to the `--format=tar` output. Besides, I think it is
really important to keep the test that verifies that the output is correct
(i.e. that gzip can decompress it).

> There's also a long-standing bug/misfeature in git-archive that I wonder
> about: When you combine --format with --remote you can only generate
> e.g. tar.gz if the remote is OK with it, if it says no you can't even if
> it supports "tar" and you could do the "gz" part locally. Would such a
> patch be harder with :zlib than if we always just spewed out to external
> "gzip" after satisfying some criteria?

I think it would be precisely the same: you'd still use the same "filter"
code path.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 2/2] archive: avoid spawning `gzip`
  2019-05-03 20:49           ` Johannes Schindelin
@ 2019-05-03 20:52             ` Jeff King
  0 siblings, 0 replies; 74+ messages in thread
From: Jeff King @ 2019-05-03 20:52 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Ævar Arnfjörð Bjarmason, brian m. carlson,
	Rohit Ashiwal via GitGitGadget, git, Junio C Hamano,
	Rohit Ashiwal

On Fri, May 03, 2019 at 10:49:17PM +0200, Johannes Schindelin wrote:

> I am well aware of the way Debian-based systems handle alternatives, and I
> myself also use something similar to write this E-Mail (but it is not a
> symlink, it is a Git alias).
> 
> But that's not the hack that I was talking about.
> 
> The hack I meant was: if you symlink `gzip` to `pigz` in your `PATH` *and
> then expect `git archive --format=tgz` to pick that up*.
> 
> As far as I am concerned, the fact that `git archive --format=tgz` spawns
> `gzip` to perform the compression is an implementation detail, and not
> something that users should feel they can rely on.

I'd agree with you more if we didn't document a user-facing config
variable that claims to run "gzip" from the system.

> > Just commenting on the overall thread: I like René's "new built-in"
> > patch best.
> 
> I guess we now have to diverging votes: yours for the `git archive --gzip`
> "built-in" and Peff's for the async code ;-)

For the record, I am fine with any of the solutions (including just
doing the single-thread bit you already have and letting René do what he
likes on top).

-Peff

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die()
  2019-05-01 18:09             ` Jeff King
  2019-05-02 20:29               ` René Scharfe
@ 2019-05-05  5:25               ` Junio C Hamano
  2019-05-06  5:07                 ` Jeff King
  1 sibling, 1 reply; 74+ messages in thread
From: Junio C Hamano @ 2019-05-05  5:25 UTC (permalink / raw)
  To: Jeff King
  Cc: Johannes Schindelin, Rohit Ashiwal via GitGitGadget, git,
	Rohit Ashiwal

Jeff King <peff@peff.net> writes:

> FWIW, I agree with you here. These patches are not making anything worse
> (and may even make them better, since we'd probably need to swap out the
> BLOCKSIZE constant for a run-time "blocksize" variable in fewer places).

It's just that leaving the interface uneven is an easy way to
introduce an unnecessary bug, e.g.

	-type function(args) {
	+type function(args, size_t blocksize) {
		decls;
	-	helper_one(BLOCKSIZE, other, args);
	+	helper_one(blocksize, other, args);
		helper_two(its, args);
	-	helper_three(BLOCKSIZE, even, more, args);
	+	helper_three(blocksize, even, more, args);
	 }

when this caller is away from the implementation of helper_two()
that hardcodes the assumption that this callchain only uses
BLOCKSIZE and in an implicit way.

And that can easily be avoided by defensively making helper_two() to
take BLOCKSIZE as an argument as everybody else in the caller does.

I do not actually care too deeply, though.  Hopefully whoever adds
"-b" would be careful enough to follow all callchain, and at least
look at all the callees that are file-scope static, and the one I
have trouble with _is_ a file-scope static.

Or maybe nobody does "-b", in which case this ticking time bomb will
not trigger, so we'd be OK.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die()
  2019-05-05  5:25               ` Junio C Hamano
@ 2019-05-06  5:07                 ` Jeff King
  0 siblings, 0 replies; 74+ messages in thread
From: Jeff King @ 2019-05-06  5:07 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Johannes Schindelin, Rohit Ashiwal via GitGitGadget, git,
	Rohit Ashiwal

On Sun, May 05, 2019 at 02:25:59PM +0900, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > FWIW, I agree with you here. These patches are not making anything worse
> > (and may even make them better, since we'd probably need to swap out the
> > BLOCKSIZE constant for a run-time "blocksize" variable in fewer places).
> 
> It's just that leaving the interface uneven is an easy way to
> introduce an unnecessary bug, e.g.
> 
> 	-type function(args) {
> 	+type function(args, size_t blocksize) {
> 		decls;
> 	-	helper_one(BLOCKSIZE, other, args);
> 	+	helper_one(blocksize, other, args);
> 		helper_two(its, args);
> 	-	helper_three(BLOCKSIZE, even, more, args);
> 	+	helper_three(blocksize, even, more, args);
> 	 }
> 
> when this caller is away from the implementation of helper_two()
> that hardcodes the assumption that this callchain only uses
> BLOCKSIZE and in an implicit way.
> 
> And that can easily be avoided by defensively making helper_two() to
> take BLOCKSIZE as an argument as everybody else in the caller does.
> 
> I do not actually care too deeply, though.  Hopefully whoever adds
> "-b" would be careful enough to follow all callchain, and at least
> look at all the callees that are file-scope static, and the one I
> have trouble with _is_ a file-scope static.

Right, my assumption was that the first step in the conversion would be
somebody doing s/BLOCKSIZE/global_blocksize_variable/. But that is just
a guess.

> Or maybe nobody does "-b", in which case this ticking time bomb will
> not trigger, so we'd be OK.

Yes. I suspect we're probably going down an unproductive tangent. :)

-Peff

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 2/4] archive-tar: mark RECORDSIZE/BLOCKSIZE as unsigned
  2019-05-02 20:30     ` [PATCH v2 2/4] archive-tar: mark RECORDSIZE/BLOCKSIZE as unsigned René Scharfe
@ 2019-05-08 11:45       ` Johannes Schindelin
  2019-05-08 23:04         ` Jeff King
  0 siblings, 1 reply; 74+ messages in thread
From: Johannes Schindelin @ 2019-05-08 11:45 UTC (permalink / raw)
  To: René Scharfe
  Cc: Johannes Schindelin via GitGitGadget, git, Jeff King,
	brian m. carlson, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 1462 bytes --]

Hi René,

On Thu, 2 May 2019, René Scharfe wrote:

> Am 27.04.19 um 01:27 schrieb Johannes Schindelin via GitGitGadget:
> > From: Johannes Schindelin <johannes.schindelin@gmx.de>
> >
> > They really are unsigned, and we are using e.g. BLOCKSIZE as `size_t`
> > parameter to pass to `write_or_die()`.
>
> True, but the compiler converts that value correctly to size_t without
> complaint already, doesn't it?  What am I missing?

Are you talking about a specific compiler? It sure sounds as if you did.

I really do not want to fall into the "you can build Git with *any*
compiler, as long as that compiler happens to be GCC, oh, and as long it
is version X" trap.

We *already* rely on GCC's optimization in way too many places for my
liking, e.g. when we adapted the `hasheq()` code *specifically* to make
GCC's particular optimization strategies to kick in.

Or the way we defined the `SWAP()` macro: it depends on GCC's ability to
see through the veil and out-guess the code, deducing its intent rather
than what it *says* ("Do As I Want, Not As I Say", anyone?). We *do* want
to swap registers when possible (instead of forcing register variables to
be written to memory just for the sake of being swapped, as our code says
rather explicitly).

Essentially, we build a cruise ship of a dependency on GCC here. Which
should not make anybody happy (except maybe the GCC folks).

Let's not make things worse.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 2/4] archive-tar: mark RECORDSIZE/BLOCKSIZE as unsigned
  2019-05-08 11:45       ` Johannes Schindelin
@ 2019-05-08 23:04         ` Jeff King
  2019-05-09 14:06           ` Johannes Schindelin
  0 siblings, 1 reply; 74+ messages in thread
From: Jeff King @ 2019-05-08 23:04 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: René Scharfe, Johannes Schindelin via GitGitGadget, git,
	brian m. carlson, Junio C Hamano

On Wed, May 08, 2019 at 01:45:25PM +0200, Johannes Schindelin wrote:

> Hi René,
> 
> On Thu, 2 May 2019, René Scharfe wrote:
> 
> > Am 27.04.19 um 01:27 schrieb Johannes Schindelin via GitGitGadget:
> > > From: Johannes Schindelin <johannes.schindelin@gmx.de>
> > >
> > > They really are unsigned, and we are using e.g. BLOCKSIZE as `size_t`
> > > parameter to pass to `write_or_die()`.
> >
> > True, but the compiler converts that value correctly to size_t without
> > complaint already, doesn't it?  What am I missing?
> 
> Are you talking about a specific compiler? It sure sounds as if you did.
> 
> I really do not want to fall into the "you can build Git with *any*
> compiler, as long as that compiler happens to be GCC, oh, and as long it
> is version X" trap.

I don't this this has anything to do with gcc. The point is that we
already have this line:

  write_or_die(fd, buf, BLOCKSIZE);

which does not cast and nobody has complained, even though the signed
constant is implicitly converted to a size_t. So adding another line
like:

  gzwrite(gzip, block, BLOCKSIZE);

would in theory be treated the same (gzwrite takes an "unsigned").

The conversion from signed to unsigned is well defined in ANSI C, and
I'd expect a compiler to either complain about neither or both (and the
latter probably with warnings like -Wconversion cranked up).

But of course if you have data otherwise, we can revise that. Was the
cast added out of caution, or to squelch a compiler warning?

-Peff

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 2/4] archive-tar: mark RECORDSIZE/BLOCKSIZE as unsigned
  2019-05-08 23:04         ` Jeff King
@ 2019-05-09 14:06           ` Johannes Schindelin
  2019-05-09 18:38             ` Jeff King
  0 siblings, 1 reply; 74+ messages in thread
From: Johannes Schindelin @ 2019-05-09 14:06 UTC (permalink / raw)
  To: Jeff King
  Cc: René Scharfe, Johannes Schindelin via GitGitGadget, git,
	brian m. carlson, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 1920 bytes --]

Hi Peff,

On Wed, 8 May 2019, Jeff King wrote:

> On Wed, May 08, 2019 at 01:45:25PM +0200, Johannes Schindelin wrote:
>
> > Hi René,
> >
> > On Thu, 2 May 2019, René Scharfe wrote:
> >
> > > Am 27.04.19 um 01:27 schrieb Johannes Schindelin via GitGitGadget:
> > > > From: Johannes Schindelin <johannes.schindelin@gmx.de>
> > > >
> > > > They really are unsigned, and we are using e.g. BLOCKSIZE as `size_t`
> > > > parameter to pass to `write_or_die()`.
> > >
> > > True, but the compiler converts that value correctly to size_t without
> > > complaint already, doesn't it?  What am I missing?
> >
> > Are you talking about a specific compiler? It sure sounds as if you did.
> >
> > I really do not want to fall into the "you can build Git with *any*
> > compiler, as long as that compiler happens to be GCC, oh, and as long it
> > is version X" trap.
>
> I don't this this has anything to do with gcc. The point is that we
> already have this line:
>
>   write_or_die(fd, buf, BLOCKSIZE);
>
> which does not cast and nobody has complained,

I mistook this part of your reply in
https://public-inbox.org/git/20190413013451.GB2040@sigill.intra.peff.net/
as precisely such a complaint:

	BLOCKSIZE is a constant. Should we be defining it with a "U" in
	the first place?

Thanks,
Dscho

> even though the signed
> constant is implicitly converted to a size_t. So adding another line
> like:
>
>   gzwrite(gzip, block, BLOCKSIZE);
>
> would in theory be treated the same (gzwrite takes an "unsigned").
>
> The conversion from signed to unsigned is well defined in ANSI C, and
> I'd expect a compiler to either complain about neither or both (and the
> latter probably with warnings like -Wconversion cranked up).
>
> But of course if you have data otherwise, we can revise that. Was the
> cast added out of caution, or to squelch a compiler warning?
>
> -Peff
>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 2/4] archive-tar: mark RECORDSIZE/BLOCKSIZE as unsigned
  2019-05-09 14:06           ` Johannes Schindelin
@ 2019-05-09 18:38             ` Jeff King
  2019-05-10 17:18               ` René Scharfe
  0 siblings, 1 reply; 74+ messages in thread
From: Jeff King @ 2019-05-09 18:38 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: René Scharfe, Johannes Schindelin via GitGitGadget, git,
	brian m. carlson, Junio C Hamano

On Thu, May 09, 2019 at 04:06:22PM +0200, Johannes Schindelin wrote:

> > I don't this this has anything to do with gcc. The point is that we
> > already have this line:
> >
> >   write_or_die(fd, buf, BLOCKSIZE);
> >
> > which does not cast and nobody has complained,
> 
> I mistook this part of your reply in
> https://public-inbox.org/git/20190413013451.GB2040@sigill.intra.peff.net/
> as precisely such a complaint:
> 
> 	BLOCKSIZE is a constant. Should we be defining it with a "U" in
> 	the first place?

Ah, sorry to introduce confusion. I mostly meant "if we need to cast,
why not just define as unsigned in the first place?". But I think René
was pointing out that we do not even need to cast, and I am fine with
that approach.

I do dream of a world where we do not have a bunch of implicit
conversions (both signedness but also truncation) in our code base, and
can compile cleanly with -Wconversion We know that this case is
perfectly fine, but I am sure there are many that are not. However, I'm
not sure if we'll ever get there, and in the meantime I don't think it's
worth worrying too much about individual cases like this.

-Peff

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 2/4] archive-tar: mark RECORDSIZE/BLOCKSIZE as unsigned
  2019-05-09 18:38             ` Jeff King
@ 2019-05-10 17:18               ` René Scharfe
  2019-05-10 21:20                 ` Jeff King
  0 siblings, 1 reply; 74+ messages in thread
From: René Scharfe @ 2019-05-10 17:18 UTC (permalink / raw)
  To: Jeff King, Johannes Schindelin
  Cc: Johannes Schindelin via GitGitGadget, git, brian m. carlson,
	Junio C Hamano

Am 09.05.19 um 20:38 schrieb Jeff King:
> I do dream of a world where we do not have a bunch of implicit
> conversions (both signedness but also truncation) in our code base, and
> can compile cleanly with -Wconversion We know that this case is
> perfectly fine, but I am sure there are many that are not. However, I'm
> not sure if we'll ever get there, and in the meantime I don't think it's
> worth worrying too much about individual cases like this.

Here's a rough take on how to silence that warning for archive-tar.c using
GCC 8.3.  Some of the changes are worth polishing and submitting.  Some
are silly.  The one for regexec_buf() is scary; I don't see a clean way of
dealing with that size_t to int conversion.

---
 archive-tar.c     | 54 +++++++++++++++++++++++++++++++----------------
 cache.h           | 10 ++++-----
 git-compat-util.h | 21 +++++++++++++++---
 hash.h            |  2 +-
 strbuf.h          |  2 +-
 5 files changed, 61 insertions(+), 28 deletions(-)

diff --git a/archive-tar.c b/archive-tar.c
index 3e53aac1e6..bfd91782ab 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -15,7 +15,7 @@
 static char block[BLOCKSIZE];
 static unsigned long offset;

-static int tar_umask = 002;
+static mode_t tar_umask = 002;

 static int write_tar_filter_archive(const struct archiver *ar,
 				    struct archiver_args *args);
@@ -99,7 +99,7 @@ static void write_blocked(const void *data, unsigned long size)
  */
 static void write_trailer(void)
 {
-	int tail = BLOCKSIZE - offset;
+	size_t tail = BLOCKSIZE - offset;
 	memset(block + offset, 0, tail);
 	write_or_die(1, block, BLOCKSIZE);
 	if (tail < 2 * RECORDSIZE) {
@@ -127,12 +127,13 @@ static int stream_blocked(const struct object_id *oid)
 		readlen = read_istream(st, buf, sizeof(buf));
 		if (readlen <= 0)
 			break;
-		do_write_blocked(buf, readlen);
+		do_write_blocked(buf, (size_t)readlen);
 	}
 	close_istream(st);
-	if (!readlen)
-		finish_record();
-	return readlen;
+	if (readlen < 0)
+		return -1;
+	finish_record();
+	return 0;
 }

 /*
@@ -142,9 +143,9 @@ static int stream_blocked(const struct object_id *oid)
  * string and appends it to a struct strbuf.
  */
 static void strbuf_append_ext_header(struct strbuf *sb, const char *keyword,
-				     const char *value, unsigned int valuelen)
+				     const char *value, size_t valuelen)
 {
-	int len, tmp;
+	size_t len, tmp;

 	/* "%u %s=%s\n" */
 	len = 1 + 1 + strlen(keyword) + 1 + valuelen + 1;
@@ -152,7 +153,7 @@ static void strbuf_append_ext_header(struct strbuf *sb, const char *keyword,
 		len++;

 	strbuf_grow(sb, len);
-	strbuf_addf(sb, "%u %s=", len, keyword);
+	strbuf_addf(sb, "%"PRIuMAX" %s=", (uintmax_t)len, keyword);
 	strbuf_add(sb, value, valuelen);
 	strbuf_addch(sb, '\n');
 }
@@ -168,7 +169,9 @@ static void strbuf_append_ext_header_uint(struct strbuf *sb,
 	int len;

 	len = xsnprintf(buf, sizeof(buf), "%"PRIuMAX, value);
-	strbuf_append_ext_header(sb, keyword, buf, len);
+	if (len < 0)
+		BUG("unable to convert %"PRIuMAX" to decimal", value);
+	strbuf_append_ext_header(sb, keyword, buf, (size_t)len);
 }

 static unsigned int ustar_header_chksum(const struct ustar_header *header)
@@ -177,7 +180,7 @@ static unsigned int ustar_header_chksum(const struct ustar_header *header)
 	unsigned int chksum = 0;
 	while (p < (const unsigned char *)header->chksum)
 		chksum += *p++;
-	chksum += sizeof(header->chksum) * ' ';
+	chksum += (unsigned int)sizeof(header->chksum) * ' ';
 	p += sizeof(header->chksum);
 	while (p < (const unsigned char *)header + sizeof(struct ustar_header))
 		chksum += *p++;
@@ -355,12 +358,14 @@ static void write_global_extended_header(struct archiver_args *args)
 }

 static struct archiver **tar_filters;
-static int nr_tar_filters;
-static int alloc_tar_filters;
+static size_t nr_tar_filters;
+static size_t alloc_tar_filters;

-static struct archiver *find_tar_filter(const char *name, int len)
+static struct archiver *find_tar_filter(const char *name, size_t len)
 {
 	int i;
+	if (len < 1)
+		return NULL;
 	for (i = 0; i < nr_tar_filters; i++) {
 		struct archiver *ar = tar_filters[i];
 		if (!strncmp(ar->name, name, len) && !ar->name[len])
@@ -369,14 +374,27 @@ static struct archiver *find_tar_filter(const char *name, int len)
 	return NULL;
 }

+static int parse_config_key2(const char *var, const char *section,
+			     const char **subsection, size_t *subsection_len,
+			     const char **key)
+{
+	int rc, len;
+
+	rc = parse_config_key(var, section, subsection, &len, key);
+	if (!rc && len < 0)
+		return -1;
+	*subsection_len = (size_t)len;
+	return rc;
+}
+
 static int tar_filter_config(const char *var, const char *value, void *data)
 {
 	struct archiver *ar;
 	const char *name;
 	const char *type;
-	int namelen;
+	size_t namelen;

-	if (parse_config_key(var, "tar", &name, &namelen, &type) < 0 || !name)
+	if (parse_config_key2(var, "tar", &name, &namelen, &type) < 0 || !name)
 		return 0;

 	ar = find_tar_filter(name, namelen);
@@ -400,7 +418,7 @@ static int tar_filter_config(const char *var, const char *value, void *data)
 		if (git_config_bool(var, value))
 			ar->flags |= ARCHIVER_REMOTE;
 		else
-			ar->flags &= ~ARCHIVER_REMOTE;
+			ar->flags &= ~(unsigned int)ARCHIVER_REMOTE;
 		return 0;
 	}

@@ -414,7 +432,7 @@ static int git_tar_config(const char *var, const char *value, void *cb)
 			tar_umask = umask(0);
 			umask(tar_umask);
 		} else {
-			tar_umask = git_config_int(var, value);
+			tar_umask = (mode_t)git_config_ulong(var, value);
 		}
 		return 0;
 	}
diff --git a/cache.h b/cache.h
index 67cc2e1806..a791034260 100644
--- a/cache.h
+++ b/cache.h
@@ -241,7 +241,7 @@ static inline void copy_cache_entry(struct cache_entry *dst,
 				    const struct cache_entry *src)
 {
 	unsigned int state = dst->ce_flags & CE_HASHED;
-	int mem_pool_allocated = dst->mem_pool_allocated;
+	unsigned int mem_pool_allocated = dst->mem_pool_allocated;

 	/* Don't copy hash chain and name */
 	memcpy(&dst->ce_stat_data, &src->ce_stat_data,
@@ -249,7 +249,7 @@ static inline void copy_cache_entry(struct cache_entry *dst,
 			offsetof(struct cache_entry, ce_stat_data));

 	/* Restore the hash state */
-	dst->ce_flags = (dst->ce_flags & ~CE_HASHED) | state;
+	dst->ce_flags = (dst->ce_flags & ~(unsigned int)CE_HASHED) | state;

 	/* Restore the mem_pool_allocated flag */
 	dst->mem_pool_allocated = mem_pool_allocated;
@@ -1314,7 +1314,7 @@ extern int check_and_freshen_file(const char *fn, int freshen);
 extern const signed char hexval_table[256];
 static inline unsigned int hexval(unsigned char c)
 {
-	return hexval_table[c];
+	return (unsigned int)hexval_table[c];
 }

 /*
@@ -1323,8 +1323,8 @@ static inline unsigned int hexval(unsigned char c)
  */
 static inline int hex2chr(const char *s)
 {
-	unsigned int val = hexval(s[0]);
-	return (val & ~0xf) ? val : (val << 4) | hexval(s[1]);
+	int val = hexval_table[(unsigned char)s[0]];
+	return (val < 0) ? val : (val << 4) | hexval_table[(unsigned char)s[1]];
 }

 /* Convert to/from hex/sha1 representation */
diff --git a/git-compat-util.h b/git-compat-util.h
index 4386b3e1c8..cf33e84c96 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -1068,7 +1068,7 @@ static inline int strtoul_ui(char const *s, int base, unsigned int *result)
 	ul = strtoul(s, &p, base);
 	if (errno || *p || p == s || (unsigned int) ul != ul)
 		return -1;
-	*result = ul;
+	*result = (unsigned int)ul;
 	return 0;
 }

@@ -1081,7 +1081,7 @@ static inline int strtol_i(char const *s, int base, int *result)
 	ul = strtol(s, &p, base);
 	if (errno || *p || p == s || (int) ul != ul)
 		return -1;
-	*result = ul;
+	*result = (int)ul;
 	return 0;
 }

@@ -1119,7 +1119,22 @@ static inline int regexec_buf(const regex_t *preg, const char *buf, size_t size,
 {
 	assert(nmatch > 0 && pmatch);
 	pmatch[0].rm_so = 0;
-	pmatch[0].rm_eo = size;
+	pmatch[0].rm_eo = (regoff_t)size;
+	if (pmatch[0].rm_eo != size) {
+		if (((regoff_t)-1) < 0) {
+			if (sizeof(regoff_t) == sizeof(int))
+				pmatch[0].rm_eo = (regoff_t)INT_MAX;
+			else if (sizeof(regoff_t) == sizeof(long))
+				pmatch[0].rm_eo = (regoff_t)LONG_MAX;
+			else
+				die("unable to determine maximum value of regoff_t");
+		} else {
+			pmatch[0].rm_eo = (regoff_t)-1;
+		}
+		warning("buffer too big (%"PRIuMAX"), "
+			"will search only the first %"PRIuMAX" bytes",
+			(uintmax_t)size, (uintmax_t)pmatch[0].rm_eo);
+	}
 	return regexec(preg, buf, nmatch, pmatch, eflags | REG_STARTEND);
 }

diff --git a/hash.h b/hash.h
index 661c9f2281..7056f89eb4 100644
--- a/hash.h
+++ b/hash.h
@@ -134,7 +134,7 @@ int hash_algo_by_id(uint32_t format_id);
 /* Identical, except based on the length. */
 int hash_algo_by_length(int len);
 /* Identical, except for a pointer to struct git_hash_algo. */
-static inline int hash_algo_by_ptr(const struct git_hash_algo *p)
+static inline ptrdiff_t hash_algo_by_ptr(const struct git_hash_algo *p)
 {
 	return p - hash_algos;
 }
diff --git a/strbuf.h b/strbuf.h
index c8d98dfb95..30659f2d5d 100644
--- a/strbuf.h
+++ b/strbuf.h
@@ -225,7 +225,7 @@ int strbuf_cmp(const struct strbuf *first, const struct strbuf *second);
 /**
  * Add a single character to the buffer.
  */
-static inline void strbuf_addch(struct strbuf *sb, int c)
+static inline void strbuf_addch(struct strbuf *sb, char c)
 {
 	if (!strbuf_avail(sb))
 		strbuf_grow(sb, 1);
--
2.21.0

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 2/4] archive-tar: mark RECORDSIZE/BLOCKSIZE as unsigned
  2019-05-10 17:18               ` René Scharfe
@ 2019-05-10 21:20                 ` Jeff King
  0 siblings, 0 replies; 74+ messages in thread
From: Jeff King @ 2019-05-10 21:20 UTC (permalink / raw)
  To: René Scharfe
  Cc: Johannes Schindelin, Johannes Schindelin via GitGitGadget, git,
	brian m. carlson, Junio C Hamano

On Fri, May 10, 2019 at 07:18:44PM +0200, René Scharfe wrote:

> Am 09.05.19 um 20:38 schrieb Jeff King:
> > I do dream of a world where we do not have a bunch of implicit
> > conversions (both signedness but also truncation) in our code base, and
> > can compile cleanly with -Wconversion We know that this case is
> > perfectly fine, but I am sure there are many that are not. However, I'm
> > not sure if we'll ever get there, and in the meantime I don't think it's
> > worth worrying too much about individual cases like this.
> 
> Here's a rough take on how to silence that warning for archive-tar.c using
> GCC 8.3.  Some of the changes are worth polishing and submitting.  Some
> are silly.  The one for regexec_buf() is scary; I don't see a clean way of
> dealing with that size_t to int conversion.

This is actually slightly less tedious than I had imagined it to be, but
still pretty bad. I dunno. If somebody wants to tackle it, I do think it
would make the world a better place. But I'm not sure if it is worth the
effort involved.

>  static void write_trailer(void)
>  {
> -	int tail = BLOCKSIZE - offset;
> +	size_t tail = BLOCKSIZE - offset;

These kinds of int/size_t conversions are the ones I think are the most
valuable (because the size_t's are often used to allocate or access
arrays, and truncated or negative values there can cause other security
problems). _Most_ of them are harmless, of course, but it's hard to
separate the important ones from the mundane.

> @@ -414,7 +432,7 @@ static int git_tar_config(const char *var, const char *value, void *cb)
>  			tar_umask = umask(0);
>  			umask(tar_umask);
>  		} else {
> -			tar_umask = git_config_int(var, value);
> +			tar_umask = (mode_t)git_config_ulong(var, value);
>  		}

It's nice that the cast here shuts up the compiler, and I agree it is
not likely to be a problem in this instance. But we'd probably want some
kind of "safe cast" helper. To some degree, if you put 2^64-1 in your
"umask" value you get what you deserve, but it would be nice if we could
detect such nonsense (less for this case, but more for others where we
do cast).

> @@ -1119,7 +1119,22 @@ static inline int regexec_buf(const regex_t *preg, const char *buf, size_t size,
>  {
>  	assert(nmatch > 0 && pmatch);
>  	pmatch[0].rm_so = 0;
> -	pmatch[0].rm_eo = size;
> +	pmatch[0].rm_eo = (regoff_t)size;
> +	if (pmatch[0].rm_eo != size) {
> +		if (((regoff_t)-1) < 0) {
> +			if (sizeof(regoff_t) == sizeof(int))
> +				pmatch[0].rm_eo = (regoff_t)INT_MAX;
> +			else if (sizeof(regoff_t) == sizeof(long))
> +				pmatch[0].rm_eo = (regoff_t)LONG_MAX;
> +			else
> +				die("unable to determine maximum value of regoff_t");
> +		} else {
> +			pmatch[0].rm_eo = (regoff_t)-1;
> +		}
> +		warning("buffer too big (%"PRIuMAX"), "
> +			"will search only the first %"PRIuMAX" bytes",
> +			(uintmax_t)size, (uintmax_t)pmatch[0].rm_eo);
> +	}
>  	return regexec(preg, buf, nmatch, pmatch, eflags | REG_STARTEND);
>  }

I think a helper could make things less awful here, too. Our xsize_t()
is sort of like this, but of course it dies. But I think it would be
possible to write a macro to let you do:

  if (ASSIGN_CAST(pmatch[0].rm_eo, size))
	warning(...);

This is definitely a rabbit-hole that I've been afraid to go down. :)

-Peff

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 2/2] archive: avoid spawning `gzip`
  2019-05-01 18:18                   ` Jeff King
@ 2019-06-10 10:44                     ` René Scharfe
  2019-06-13 19:16                       ` Jeff King
  0 siblings, 1 reply; 74+ messages in thread
From: René Scharfe @ 2019-06-10 10:44 UTC (permalink / raw)
  To: Jeff King
  Cc: Johannes Schindelin, Rohit Ashiwal via GitGitGadget, git,
	Junio C Hamano, Rohit Ashiwal

Am 01.05.19 um 20:18 schrieb Jeff King:
> On Wed, May 01, 2019 at 07:45:05PM +0200, René Scharfe wrote:
>
>>> But since the performance is still not quite on par with `gzip`, I would
>>> actually rather not, and really, just punt on that one, stating that
>>> people interested in higher performance should use `pigz`.
>>
>> Here are my performance numbers for generating .tar.gz files again:

OK, tried one more version, with pthreads (patch at the end).  Also
redid all measurements for better comparability; everything is faster
now for some reason (perhaps due to a compiler update? clang version
7.0.1-8 now):

master, using gzip(1):
Benchmark #1: git archive --format=tgz HEAD
  Time (mean ± σ):     15.697 s ±  0.246 s    [User: 19.213 s, System: 0.386 s]
  Range (min … max):   15.405 s … 16.103 s    10 runs

using zlib sequentially:
Benchmark #1: git archive --format=tgz HEAD
  Time (mean ± σ):     19.191 s ±  0.408 s    [User: 19.091 s, System: 0.100 s]
  Range (min … max):   18.802 s … 19.877 s    10 runs

using a gzip-lookalike:
Benchmark #1: git archive --format=tgz HEAD
  Time (mean ± σ):     16.289 s ±  0.218 s    [User: 19.485 s, System: 0.337 s]
  Range (min … max):   16.020 s … 16.555 s    10 runs

using zlib with start_async:
Benchmark #1: git archive --format=tgz HEAD
  Time (mean ± σ):     16.516 s ±  0.334 s    [User: 20.282 s, System: 0.383 s]
  Range (min … max):   16.166 s … 17.283 s    10 runs

using zlib in a separate thread (that's the new one):
Benchmark #1: git archive --format=tgz HEAD
  Time (mean ± σ):     16.310 s ±  0.237 s    [User: 20.075 s, System: 0.173 s]
  Range (min … max):   15.983 s … 16.790 s    10 runs

> I think the start_async() one seems like a good option. It reclaims most
> of the (wall-clock) performance, isn't very much code, and doesn't leave
> any ugly user-visible traces.

The pthreads numbers look a bit better still.  The patch is huge though,
because it duplicates almost everything.  It was easier that way; a real
patch series would extract functions that can be used both with static
and allocated headers first, and keep everything in archive-tar.c.

> I'd be fine to see it come later, though, on top of the patches Dscho is
> sending. Even though changing to sequential zlib is technically a change
> in behavior, the existing behavior wasn't really planned. And given the
> wall-clock versus CPU time tradeoff, it's not entirely clear that one
> solution is better than the other.

The current behavior is not an accident; the synchronous method was
rejected in 2009 because it was slower [1].  Redid the measurements
with v1.6.5-rc0 and the old patch [2], but they would only compile with
gcc (Debian 8.3.0-6) for me, so it's not directly comparable to the
numbers above:

v1.6.5-rc0:
Benchmark #1: ../git/git-archive HEAD | gzip
  Time (mean ± σ):     16.051 s ±  0.486 s    [User: 19.514 s, System: 0.341 s]
  Range (min … max):   15.416 s … 17.001 s    10 runs

v1.6.5-rc0 + [2]:
Benchmark #1: ../git/git-archive --format=tar.gz HEAD
  Time (mean ± σ):     19.684 s ±  0.374 s    [User: 19.601 s, System: 0.060 s]
  Range (min … max):   19.082 s … 20.177 s    10 runs

User time is still slightly higher, but the difference is in the noise.

[1] http://public-inbox.org/git/4AAAC8CE.8020302@lsrfire.ath.cx/
[2] http://public-inbox.org/git/4AA97B61.6030301@lsrfire.ath.cx/

>>> And who knows, maybe nobody will complain at all about the performance?
>>
>> Probably.  And popular tarballs would be cached anyway, I guess.
>
> At GitHub we certainly do cache the git-archive output. We'd also be
> just fine with the sequential solution. We generally turn down
> pack.threads to 1, and keep our CPUs busy by serving multiple users
> anyway.
>
> So whatever has the lowest overall CPU time is generally preferable, but
> the times are close enough that I don't think we'd care much either way
> (and it's probably not worth having a config option or similar).

Moving back to 2009 and reducing the number of utilized cores both feels
weird, but the sequential solution *is* the most obvious, easiest and
(by a narrow margin) lightest one if gzip(1) is not an option anymore.

Anyway, the threading patch:

---
 Makefile      |   1 +
 archive-tar.c |  11 +-
 archive-tgz.c | 452 ++++++++++++++++++++++++++++++++++++++++++++++++++
 archive.h     |   4 +
 4 files changed, 465 insertions(+), 3 deletions(-)
 create mode 100644 archive-tgz.c

diff --git a/Makefile b/Makefile
index 8a7e235352..ed649ac18d 100644
--- a/Makefile
+++ b/Makefile
@@ -834,6 +834,7 @@ LIB_OBJS += alloc.o
 LIB_OBJS += apply.o
 LIB_OBJS += archive.o
 LIB_OBJS += archive-tar.o
+LIB_OBJS += archive-tgz.o
 LIB_OBJS += archive-zip.o
 LIB_OBJS += argv-array.o
 LIB_OBJS += attr.o
diff --git a/archive-tar.c b/archive-tar.c
index 3e53aac1e6..929eb58235 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -15,7 +15,9 @@
 static char block[BLOCKSIZE];
 static unsigned long offset;

-static int tar_umask = 002;
+int tar_umask = 002;
+
+static const char internal_gzip[] = "git archive gzip";

 static int write_tar_filter_archive(const struct archiver *ar,
 				    struct archiver_args *args);
@@ -445,6 +447,9 @@ static int write_tar_filter_archive(const struct archiver *ar,
 	if (!ar->data)
 		BUG("tar-filter archiver called with no filter defined");

+	if (!strcmp(ar->data, internal_gzip))
+		return write_tgz_archive(ar, args);
+
 	strbuf_addstr(&cmd, ar->data);
 	if (args->compression_level >= 0)
 		strbuf_addf(&cmd, " -%d", args->compression_level);
@@ -483,9 +488,9 @@ void init_tar_archiver(void)
 	int i;
 	register_archiver(&tar_archiver);

-	tar_filter_config("tar.tgz.command", "gzip -cn", NULL);
+	tar_filter_config("tar.tgz.command", internal_gzip, NULL);
 	tar_filter_config("tar.tgz.remote", "true", NULL);
-	tar_filter_config("tar.tar.gz.command", "gzip -cn", NULL);
+	tar_filter_config("tar.tar.gz.command", internal_gzip, NULL);
 	tar_filter_config("tar.tar.gz.remote", "true", NULL);
 	git_config(git_tar_config, NULL);
 	for (i = 0; i < nr_tar_filters; i++) {
diff --git a/archive-tgz.c b/archive-tgz.c
new file mode 100644
index 0000000000..ae219e1cc0
--- /dev/null
+++ b/archive-tgz.c
@@ -0,0 +1,452 @@
+#include "cache.h"
+#include "config.h"
+#include "tar.h"
+#include "archive.h"
+#include "object-store.h"
+#include "streaming.h"
+
+#define RECORDSIZE	(512)
+#define BLOCKSIZE	(RECORDSIZE * 20)
+
+static gzFile gzip;
+static size_t offset;
+
+/*
+ * This is the max value that a ustar size header can specify, as it is fixed
+ * at 11 octal digits. POSIX specifies that we switch to extended headers at
+ * this size.
+ *
+ * Likewise for the mtime (which happens to use a buffer of the same size).
+ */
+#if ULONG_MAX == 0xFFFFFFFF
+#define USTAR_MAX_SIZE ULONG_MAX
+#else
+#define USTAR_MAX_SIZE 077777777777UL
+#endif
+#if TIME_MAX == 0xFFFFFFFF
+#define USTAR_MAX_MTIME TIME_MAX
+#else
+#define USTAR_MAX_MTIME 077777777777ULL
+#endif
+
+static void tgz_write(const void *data, size_t size)
+{
+	const char *p = data;
+	while (size) {
+		size_t to_write = size;
+		if (to_write > UINT_MAX)
+			to_write = UINT_MAX;
+		if (gzwrite(gzip, p, to_write) != to_write)
+			die(_("gzwrite failed"));
+		p += to_write;
+		size -= to_write;
+		offset = (offset + to_write) % BLOCKSIZE;
+	}
+}
+
+static void tgz_finish_record(void)
+{
+	size_t tail = offset % RECORDSIZE;
+	if (tail) {
+		size_t to_seek = RECORDSIZE - tail;
+		if (gzseek(gzip, to_seek, SEEK_CUR) < 0)
+			die(_("gzseek failed"));
+		offset = (offset + to_seek) % BLOCKSIZE;
+	}
+}
+
+static void tgz_write_trailer(void)
+{
+	size_t to_seek = BLOCKSIZE - offset;
+	if (to_seek < 2 * RECORDSIZE)
+		to_seek += BLOCKSIZE;
+	if (gzseek(gzip, to_seek, SEEK_CUR) < 0)
+		 die(_("gzseek failed"));
+	if (gzflush(gzip, Z_FINISH) != Z_OK)
+		die(_("gzflush failed"));
+}
+
+struct work_item {
+	void *buffer;
+	size_t size;
+	int finish_record;
+};
+
+#define TODO_SIZE 64
+struct work_item todo[TODO_SIZE];
+static int todo_start;
+static int todo_end;
+static int todo_done;
+static int all_work_added;
+static pthread_mutex_t tar_mutex;
+static pthread_t thread;
+
+static void tar_lock(void)
+{
+	pthread_mutex_lock(&tar_mutex);
+}
+
+static void tar_unlock(void)
+{
+	pthread_mutex_unlock(&tar_mutex);
+}
+
+static pthread_cond_t cond_add;
+static pthread_cond_t cond_write;
+static pthread_cond_t cond_result;
+
+static void add_work(void *buffer, size_t size, int finish_record)
+{
+	tar_lock();
+
+	while ((todo_end + 1) % ARRAY_SIZE(todo) == todo_done)
+		pthread_cond_wait(&cond_write, &tar_mutex);
+
+	todo[todo_end].buffer = buffer;
+	todo[todo_end].size = size;
+	todo[todo_end].finish_record = finish_record;
+
+	todo_end = (todo_end + 1) % ARRAY_SIZE(todo);
+
+	pthread_cond_signal(&cond_add);
+	tar_unlock();
+}
+
+static struct work_item *get_work(void)
+{
+	struct work_item *ret = NULL;
+
+	tar_lock();
+	while (todo_start == todo_end && !all_work_added)
+		pthread_cond_wait(&cond_add, &tar_mutex);
+
+	if (todo_start != todo_end || !all_work_added) {
+		ret = &todo[todo_start];
+		todo_start = (todo_start + 1) % ARRAY_SIZE(todo);
+	}
+	tar_unlock();
+	return ret;
+}
+
+static void work_done(void)
+{
+	tar_lock();
+	todo_done = (todo_done + 1) % ARRAY_SIZE(todo);
+	pthread_cond_signal(&cond_write);
+
+	if (all_work_added && todo_done == todo_end)
+		pthread_cond_signal(&cond_result);
+	tar_unlock();
+}
+
+static void *run(void *arg)
+{
+	for (;;) {
+		struct work_item *w = get_work();
+		if (!w)
+			break;
+		tgz_write(w->buffer, w->size);
+		free(w->buffer);
+		if (w->finish_record)
+			tgz_finish_record();
+		work_done();
+	}
+	return NULL;
+}
+
+static void start_output_thread(void)
+{
+	int err;
+
+	pthread_mutex_init(&tar_mutex, NULL);
+	pthread_cond_init(&cond_add, NULL);
+	pthread_cond_init(&cond_write, NULL);
+	pthread_cond_init(&cond_result, NULL);
+
+	memset(todo, 0, sizeof(todo));
+
+	err = pthread_create(&thread, NULL, run, NULL);
+	if (err)
+		die(_("failed to create thread: %s"), strerror(err));
+}
+
+static void wait_for_output_thread(void)
+{
+	tar_lock();
+	all_work_added = 1;
+
+	while (todo_done != todo_end)
+		pthread_cond_wait(&cond_result, &tar_mutex);
+
+	pthread_cond_broadcast(&cond_add);
+	tar_unlock();
+
+	pthread_join(thread, NULL);
+
+	pthread_mutex_destroy(&tar_mutex);
+	pthread_cond_destroy(&cond_add);
+	pthread_cond_destroy(&cond_write);
+	pthread_cond_destroy(&cond_result);
+}
+
+static int stream_blob(const struct object_id *oid)
+{
+	struct git_istream *st;
+	enum object_type type;
+	unsigned long sz;
+	ssize_t readlen;
+	size_t chunk_size = BLOCKSIZE * 10;
+
+	st = open_istream(oid, &type, &sz, NULL);
+	if (!st)
+		return error(_("cannot stream blob %s"), oid_to_hex(oid));
+	for (;;) {
+		char *buf = xmalloc(chunk_size);
+		readlen = read_istream(st, buf, chunk_size);
+		if (readlen <= 0)
+			break;
+		sz -= readlen;
+		add_work(buf, readlen, !sz);
+	}
+	close_istream(st);
+	return readlen;
+}
+
+/*
+ * pax extended header records have the format "%u %s=%s\n".  %u contains
+ * the size of the whole string (including the %u), the first %s is the
+ * keyword, the second one is the value.  This function constructs such a
+ * string and appends it to a struct strbuf.
+ */
+static void strbuf_append_ext_header(struct strbuf *sb, const char *keyword,
+				     const char *value, unsigned int valuelen)
+{
+	int len, tmp;
+
+	/* "%u %s=%s\n" */
+	len = 1 + 1 + strlen(keyword) + 1 + valuelen + 1;
+	for (tmp = len; tmp > 9; tmp /= 10)
+		len++;
+
+	strbuf_grow(sb, len);
+	strbuf_addf(sb, "%u %s=", len, keyword);
+	strbuf_add(sb, value, valuelen);
+	strbuf_addch(sb, '\n');
+}
+
+/*
+ * Like strbuf_append_ext_header, but for numeric values.
+ */
+static void strbuf_append_ext_header_uint(struct strbuf *sb,
+					  const char *keyword,
+					  uintmax_t value)
+{
+	char buf[40]; /* big enough for 2^128 in decimal, plus NUL */
+	int len;
+
+	len = xsnprintf(buf, sizeof(buf), "%"PRIuMAX, value);
+	strbuf_append_ext_header(sb, keyword, buf, len);
+}
+
+static unsigned int ustar_header_chksum(const struct ustar_header *header)
+{
+	const unsigned char *p = (const unsigned char *)header;
+	unsigned int chksum = 0;
+	while (p < (const unsigned char *)header->chksum)
+		chksum += *p++;
+	chksum += sizeof(header->chksum) * ' ';
+	p += sizeof(header->chksum);
+	while (p < (const unsigned char *)header + sizeof(struct ustar_header))
+		chksum += *p++;
+	return chksum;
+}
+
+static size_t get_path_prefix(const char *path, size_t pathlen, size_t maxlen)
+{
+	size_t i = pathlen;
+	if (i > 1 && path[i - 1] == '/')
+		i--;
+	if (i > maxlen)
+		i = maxlen;
+	do {
+		i--;
+	} while (i > 0 && path[i] != '/');
+	return i;
+}
+
+static void prepare_header(struct archiver_args *args,
+			   struct ustar_header *header,
+			   unsigned int mode, unsigned long size)
+{
+	xsnprintf(header->mode, sizeof(header->mode), "%07o", mode & 07777);
+	xsnprintf(header->size, sizeof(header->size), "%011"PRIoMAX , S_ISREG(mode) ? (uintmax_t)size : (uintmax_t)0);
+	xsnprintf(header->mtime, sizeof(header->mtime), "%011lo", (unsigned long) args->time);
+
+	xsnprintf(header->uid, sizeof(header->uid), "%07o", 0);
+	xsnprintf(header->gid, sizeof(header->gid), "%07o", 0);
+	strlcpy(header->uname, "root", sizeof(header->uname));
+	strlcpy(header->gname, "root", sizeof(header->gname));
+	xsnprintf(header->devmajor, sizeof(header->devmajor), "%07o", 0);
+	xsnprintf(header->devminor, sizeof(header->devminor), "%07o", 0);
+
+	memcpy(header->magic, "ustar", 6);
+	memcpy(header->version, "00", 2);
+
+	xsnprintf(header->chksum, sizeof(header->chksum), "%07o", ustar_header_chksum(header));
+}
+
+static void write_extended_header(struct archiver_args *args,
+				  const struct object_id *oid,
+				  struct strbuf *extended_header)
+{
+	size_t size;
+	char *buffer = strbuf_detach(extended_header, &size);
+	struct ustar_header *header = xcalloc(1, sizeof(*header));
+	unsigned int mode;
+	*header->typeflag = TYPEFLAG_EXT_HEADER;
+	mode = 0100666;
+	xsnprintf(header->name, sizeof(header->name), "%s.paxheader",
+		  oid_to_hex(oid));
+	prepare_header(args, header, mode, size);
+	add_work(header, sizeof(*header), 1);
+	add_work(buffer, size, 1);
+}
+
+static int write_tar_entry(struct archiver_args *args,
+			   const struct object_id *oid,
+			   const char *path, size_t pathlen,
+			   unsigned int mode)
+{
+	struct ustar_header *header = xcalloc(1, sizeof(*header));
+	struct strbuf ext_header = STRBUF_INIT;
+	unsigned int old_mode = mode;
+	unsigned long size, size_in_header;
+	void *buffer;
+	int err = 0;
+
+	if (S_ISDIR(mode) || S_ISGITLINK(mode)) {
+		*header->typeflag = TYPEFLAG_DIR;
+		mode = (mode | 0777) & ~tar_umask;
+	} else if (S_ISLNK(mode)) {
+		*header->typeflag = TYPEFLAG_LNK;
+		mode |= 0777;
+	} else if (S_ISREG(mode)) {
+		*header->typeflag = TYPEFLAG_REG;
+		mode = (mode | ((mode & 0100) ? 0777 : 0666)) & ~tar_umask;
+	} else {
+		return error(_("unsupported file mode: 0%o (SHA1: %s)"),
+			     mode, oid_to_hex(oid));
+	}
+	if (pathlen > sizeof(header->name)) {
+		size_t plen = get_path_prefix(path, pathlen,
+					      sizeof(header->prefix));
+		size_t rest = pathlen - plen - 1;
+		if (plen > 0 && rest <= sizeof(header->name)) {
+			memcpy(header->prefix, path, plen);
+			memcpy(header->name, path + plen + 1, rest);
+		} else {
+			xsnprintf(header->name, sizeof(header->name), "%s.data",
+				  oid_to_hex(oid));
+			strbuf_append_ext_header(&ext_header, "path",
+						 path, pathlen);
+		}
+	} else
+		memcpy(header->name, path, pathlen);
+
+	if (S_ISREG(mode) && !args->convert &&
+	    oid_object_info(args->repo, oid, &size) == OBJ_BLOB &&
+	    size > big_file_threshold)
+		buffer = NULL;
+	else if (S_ISLNK(mode) || S_ISREG(mode)) {
+		enum object_type type;
+		buffer = object_file_to_archive(args, path, oid, old_mode, &type, &size);
+		if (!buffer)
+			return error(_("cannot read %s"), oid_to_hex(oid));
+	} else {
+		buffer = NULL;
+		size = 0;
+	}
+
+	if (S_ISLNK(mode)) {
+		if (size > sizeof(header->linkname)) {
+			xsnprintf(header->linkname, sizeof(header->linkname),
+				  "see %s.paxheader", oid_to_hex(oid));
+			strbuf_append_ext_header(&ext_header, "linkpath",
+						 buffer, size);
+		} else
+			memcpy(header->linkname, buffer, size);
+	}
+
+	size_in_header = size;
+	if (S_ISREG(mode) && size > USTAR_MAX_SIZE) {
+		size_in_header = 0;
+		strbuf_append_ext_header_uint(&ext_header, "size", size);
+	}
+
+	prepare_header(args, header, mode, size_in_header);
+
+	if (ext_header.len > 0) {
+		write_extended_header(args, oid, &ext_header);
+	}
+	add_work(header, sizeof(*header), 1);
+	if (S_ISREG(mode) && size > 0) {
+		if (buffer)
+			add_work(buffer, size, 1);
+		else
+			err = stream_blob(oid);
+	} else
+		free(buffer);
+	return err;
+}
+
+static void write_global_extended_header(struct archiver_args *args)
+{
+	const struct object_id *oid = args->commit_oid;
+	struct strbuf ext_header = STRBUF_INIT;
+	struct ustar_header *header = xcalloc(1, sizeof(*header));
+	unsigned int mode;
+	size_t size;
+	char *buffer;
+
+	if (oid)
+		strbuf_append_ext_header(&ext_header, "comment",
+					 oid_to_hex(oid),
+					 the_hash_algo->hexsz);
+	if (args->time > USTAR_MAX_MTIME) {
+		strbuf_append_ext_header_uint(&ext_header, "mtime",
+					      args->time);
+		args->time = USTAR_MAX_MTIME;
+	}
+
+	if (!ext_header.len)
+		return;
+
+	buffer = strbuf_detach(&ext_header, &size);
+	*header->typeflag = TYPEFLAG_GLOBAL_HEADER;
+	mode = 0100666;
+	xsnprintf(header->name, sizeof(header->name), "pax_global_header");
+	prepare_header(args, header, mode, size);
+	add_work(header, sizeof(*header), 1);
+	add_work(buffer, size, 1);
+}
+
+int write_tgz_archive(const struct archiver *ar, struct archiver_args *args)
+{
+	int level = args->compression_level;
+	int err = 0;
+
+	gzip = gzdopen(1, "wb");
+	if (!gzip)
+		return error(_("gzdopen failed"));
+	if (gzsetparams(gzip, level, Z_DEFAULT_STRATEGY) != Z_OK)
+		return error(_("unable to set compression level %d"), level);
+
+	start_output_thread();
+	write_global_extended_header(args);
+	err = write_archive_entries(args, write_tar_entry);
+	if (err)
+		return err;
+	wait_for_output_thread();
+	tgz_write_trailer();
+	return err;
+}
diff --git a/archive.h b/archive.h
index e60e3dd31c..b00afa1a9f 100644
--- a/archive.h
+++ b/archive.h
@@ -45,6 +45,10 @@ void init_tar_archiver(void);
 void init_zip_archiver(void);
 void init_archivers(void);

+int tar_umask;
+
+int write_tgz_archive(const struct archiver *ar, struct archiver_args *args);
+
 typedef int (*write_archive_entry_fn_t)(struct archiver_args *args,
 					const struct object_id *oid,
 					const char *path, size_t pathlen,
--
2.22.0

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH 2/2] archive: avoid spawning `gzip`
  2019-06-10 10:44                     ` René Scharfe
@ 2019-06-13 19:16                       ` Jeff King
  0 siblings, 0 replies; 74+ messages in thread
From: Jeff King @ 2019-06-13 19:16 UTC (permalink / raw)
  To: René Scharfe
  Cc: Johannes Schindelin, Rohit Ashiwal via GitGitGadget, git,
	Junio C Hamano, Rohit Ashiwal

On Mon, Jun 10, 2019 at 12:44:54PM +0200, René Scharfe wrote:

> Am 01.05.19 um 20:18 schrieb Jeff King:
> > On Wed, May 01, 2019 at 07:45:05PM +0200, René Scharfe wrote:
> >
> >>> But since the performance is still not quite on par with `gzip`, I would
> >>> actually rather not, and really, just punt on that one, stating that
> >>> people interested in higher performance should use `pigz`.
> >>
> >> Here are my performance numbers for generating .tar.gz files again:
> 
> OK, tried one more version, with pthreads (patch at the end).  Also
> redid all measurements for better comparability; everything is faster
> now for some reason (perhaps due to a compiler update? clang version
> 7.0.1-8 now):

Hmm. Interesting that using pthreads is still slower than just shelling
out to gzip:

> master, using gzip(1):
> Benchmark #1: git archive --format=tgz HEAD
>   Time (mean ± σ):     15.697 s ±  0.246 s    [User: 19.213 s, System: 0.386 s]
>   Range (min … max):   15.405 s … 16.103 s    10 runs
> [...]
> using zlib in a separate thread (that's the new one):
> Benchmark #1: git archive --format=tgz HEAD
>   Time (mean ± σ):     16.310 s ±  0.237 s    [User: 20.075 s, System: 0.173 s]
>   Range (min … max):   15.983 s … 16.790 s    10 runs

I wonder if zlib is just slower. Or if the cost of context switching
is somehow higher than just dumping big chunks over a pipe. In
particular, our gzip-alike is still faster than pthreads:

> using a gzip-lookalike:
> Benchmark #1: git archive --format=tgz HEAD
>   Time (mean ± σ):     16.289 s ±  0.218 s    [User: 19.485 s, System: 0.337 s]
>   Range (min … max):   16.020 s … 16.555 s    10 runs

though it looks like the timings do overlap.

> > At GitHub we certainly do cache the git-archive output. We'd also be
> > just fine with the sequential solution. We generally turn down
> > pack.threads to 1, and keep our CPUs busy by serving multiple users
> > anyway.
> >
> > So whatever has the lowest overall CPU time is generally preferable, but
> > the times are close enough that I don't think we'd care much either way
> > (and it's probably not worth having a config option or similar).
> 
> Moving back to 2009 and reducing the number of utilized cores both feels
> weird, but the sequential solution *is* the most obvious, easiest and
> (by a narrow margin) lightest one if gzip(1) is not an option anymore.

It sounds like we resolved to give the "internal gzip" its own name
(whether it's a gzip-alike command, or a special name we recognize to
trigger the internal code). So maybe we could continue to default to
"gzip -cn", but platforms could do otherwise when shipping gzip there is
a pain (i.e. Windows, but maybe also anybody else who wants to set
NO_EXTERNAL_GZIP or detect it from autoconf).

-Peff

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v3 0/5] Avoid spawning gzip in git archive
  2019-04-12 23:04 [PATCH 0/2] Avoid spawning gzip in git archive Johannes Schindelin via GitGitGadget
                   ` (2 preceding siblings ...)
       [not found] ` <pull.145.v2.git.gitgitgadget@gmail.com>
@ 2022-06-12  6:00 ` René Scharfe
  2022-06-12  6:03   ` [PATCH v3 1/5] archive: rename archiver data field to filter_command René Scharfe
                     ` (5 more replies)
  2022-06-15 16:53 ` [PATCH v4 0/6] " René Scharfe
  4 siblings, 6 replies; 74+ messages in thread
From: René Scharfe @ 2022-06-12  6:00 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Johannes Schindelin, Rohit Ashiwal,
	Ævar Arnfjörð Bjarmason, Jeff King,
	brian m . carlson

It's been a while, let's try again.

Changes:
- Use our own zlib helpers instead of the gz* functions of zlib,
- ... which allows us to set the OS_CODE header consistently.
- Pseudo-command "git archive gzip" to select the internal
  implementation in config.
- Use a function pointer to plug in the internal gzip.
- Tests.
- Discuss performance in commit message.

  archive: rename archiver data field to filter_command
  archive-tar: factor out write_block()
  archive-tar: add internal gzip implementation
  archive-tar: use OS_CODE 3 (Unix) for internal gzip
  archive-tar: use internal gzip by default

 Documentation/git-archive.txt |  3 +-
 archive-tar.c                 | 79 ++++++++++++++++++++++++++++++-----
 archive.h                     |  2 +-
 t/t5000-tar-tree.sh           | 28 ++++++++++---
 4 files changed, 93 insertions(+), 19 deletions(-)

--
2.36.1

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v3 1/5] archive: rename archiver data field to filter_command
  2022-06-12  6:00 ` [PATCH v3 0/5] Avoid spawning gzip in git archive René Scharfe
@ 2022-06-12  6:03   ` René Scharfe
  2022-06-12  6:05   ` [PATCH v3 2/5] archive-tar: factor out write_block() René Scharfe
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 74+ messages in thread
From: René Scharfe @ 2022-06-12  6:03 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Johannes Schindelin, Rohit Ashiwal,
	Ævar Arnfjörð Bjarmason, Jeff King,
	brian m . carlson

The void pointer "data" in struct archiver is only used to store filter
commands to pass tar archives to, like gzip.  Rename it accordingly and
also turn it into a char pointer to document the fact that it's a string
reference.

Signed-off-by: René Scharfe <l.s.r@web.de>
---
 archive-tar.c | 10 +++++-----
 archive.h     |  2 +-
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/archive-tar.c b/archive-tar.c
index 042feb66d2..2717e34a1d 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -383,8 +383,8 @@ static int tar_filter_config(const char *var, const char *value, void *data)
 	if (!strcmp(type, "command")) {
 		if (!value)
 			return config_error_nonbool(var);
-		free(ar->data);
-		ar->data = xstrdup(value);
+		free(ar->filter_command);
+		ar->filter_command = xstrdup(value);
 		return 0;
 	}
 	if (!strcmp(type, "remote")) {
@@ -432,10 +432,10 @@ static int write_tar_filter_archive(const struct archiver *ar,
 	struct child_process filter = CHILD_PROCESS_INIT;
 	int r;

-	if (!ar->data)
+	if (!ar->filter_command)
 		BUG("tar-filter archiver called with no filter defined");

-	strbuf_addstr(&cmd, ar->data);
+	strbuf_addstr(&cmd, ar->filter_command);
 	if (args->compression_level >= 0)
 		strbuf_addf(&cmd, " -%d", args->compression_level);

@@ -478,7 +478,7 @@ void init_tar_archiver(void)
 	git_config(git_tar_config, NULL);
 	for (i = 0; i < nr_tar_filters; i++) {
 		/* omit any filters that never had a command configured */
-		if (tar_filters[i]->data)
+		if (tar_filters[i]->filter_command)
 			register_archiver(tar_filters[i]);
 	}
 }
diff --git a/archive.h b/archive.h
index 49fab71aaf..08bed3ed3a 100644
--- a/archive.h
+++ b/archive.h
@@ -43,7 +43,7 @@ struct archiver {
 	const char *name;
 	int (*write_archive)(const struct archiver *, struct archiver_args *);
 	unsigned flags;
-	void *data;
+	char *filter_command;
 };
 void register_archiver(struct archiver *);

--
2.36.1

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v3 2/5] archive-tar: factor out write_block()
  2022-06-12  6:00 ` [PATCH v3 0/5] Avoid spawning gzip in git archive René Scharfe
  2022-06-12  6:03   ` [PATCH v3 1/5] archive: rename archiver data field to filter_command René Scharfe
@ 2022-06-12  6:05   ` René Scharfe
  2022-06-12  6:08   ` [PATCH v3 3/5] archive-tar: add internal gzip implementation René Scharfe
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 74+ messages in thread
From: René Scharfe @ 2022-06-12  6:05 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Johannes Schindelin, Rohit Ashiwal,
	Ævar Arnfjörð Bjarmason, Jeff King,
	brian m . carlson

All tar archive writes have the same size and are done to the same file
descriptor.  Move them to a common function, write_block(), to reduce
code duplication and make it easy to change the destination.

Original-patch-by: Rohit Ashiwal <rohit.ashiwal265@gmail.com>
Signed-off-by: René Scharfe <l.s.r@web.de>
---
 archive-tar.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/archive-tar.c b/archive-tar.c
index 2717e34a1d..4e6a3deb80 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -38,11 +38,16 @@ static int write_tar_filter_archive(const struct archiver *ar,
 #define USTAR_MAX_MTIME 077777777777ULL
 #endif

+static void write_block(const void *buf)
+{
+	write_or_die(1, buf, BLOCKSIZE);
+}
+
 /* writes out the whole block, but only if it is full */
 static void write_if_needed(void)
 {
 	if (offset == BLOCKSIZE) {
-		write_or_die(1, block, BLOCKSIZE);
+		write_block(block);
 		offset = 0;
 	}
 }
@@ -66,7 +71,7 @@ static void do_write_blocked(const void *data, unsigned long size)
 		write_if_needed();
 	}
 	while (size >= BLOCKSIZE) {
-		write_or_die(1, buf, BLOCKSIZE);
+		write_block(buf);
 		size -= BLOCKSIZE;
 		buf += BLOCKSIZE;
 	}
@@ -101,10 +106,10 @@ static void write_trailer(void)
 {
 	int tail = BLOCKSIZE - offset;
 	memset(block + offset, 0, tail);
-	write_or_die(1, block, BLOCKSIZE);
+	write_block(block);
 	if (tail < 2 * RECORDSIZE) {
 		memset(block, 0, offset);
-		write_or_die(1, block, BLOCKSIZE);
+		write_block(block);
 	}
 }

--
2.36.1

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v3 3/5] archive-tar: add internal gzip implementation
  2022-06-12  6:00 ` [PATCH v3 0/5] Avoid spawning gzip in git archive René Scharfe
  2022-06-12  6:03   ` [PATCH v3 1/5] archive: rename archiver data field to filter_command René Scharfe
  2022-06-12  6:05   ` [PATCH v3 2/5] archive-tar: factor out write_block() René Scharfe
@ 2022-06-12  6:08   ` René Scharfe
  2022-06-13 19:10     ` Junio C Hamano
  2022-06-12  6:18   ` [PATCH v3 4/5] archive-tar: use OS_CODE 3 (Unix) for internal gzip René Scharfe
                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 74+ messages in thread
From: René Scharfe @ 2022-06-12  6:08 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Johannes Schindelin, Rohit Ashiwal,
	Ævar Arnfjörð Bjarmason, Jeff King,
	brian m . carlson

Git uses zlib for its own object store, but calls gzip when creating tgz
archives.  Add an option to perform the gzip compression for the latter
using zlib, without depending on the external gzip binary.

Plug it in by making write_block a function pointer and switching to a
compressing variant if the filter command has the magic value "git
archive gzip".  Does that indirection slow down tar creation?  Not
really, at least not in this test:

$ hyperfine -w3 -L rev HEAD,origin/main -p 'git checkout {rev} && make' \
'./git -C ../linux archive --format=tar HEAD # {rev}'
Benchmark #1: ./git -C ../linux archive --format=tar HEAD # HEAD
  Time (mean ± σ):      4.044 s ±  0.007 s    [User: 3.901 s, System: 0.137 s]
  Range (min … max):    4.038 s …  4.059 s    10 runs

Benchmark #2: ./git -C ../linux archive --format=tar HEAD # origin/main
  Time (mean ± σ):      4.047 s ±  0.009 s    [User: 3.903 s, System: 0.138 s]
  Range (min … max):    4.038 s …  4.066 s    10 runs

How does tgz creation perform?

$ hyperfine -w3 -L command 'gzip -cn','git archive gzip' \
'./git -c tar.tgz.command="{command}" -C ../linux archive --format=tgz HEAD'
Benchmark #1: ./git -c tar.tgz.command="gzip -cn" -C ../linux archive --format=tgz HEAD
  Time (mean ± σ):     20.404 s ±  0.006 s    [User: 23.943 s, System: 0.401 s]
  Range (min … max):   20.395 s … 20.414 s    10 runs

Benchmark #2: ./git -c tar.tgz.command="git archive gzip" -C ../linux archive --format=tgz HEAD
  Time (mean ± σ):     23.807 s ±  0.023 s    [User: 23.655 s, System: 0.145 s]
  Range (min … max):   23.782 s … 23.857 s    10 runs

Summary
  './git -c tar.tgz.command="gzip -cn" -C ../linux archive --format=tgz HEAD' ran
    1.17 ± 0.00 times faster than './git -c tar.tgz.command="git archive gzip" -C ../linux archive --format=tgz HEAD'

So the internal implementation takes 17% longer on the Linux repo, but
uses 2% less CPU time.  That's because the external gzip can run in
parallel on its own processor, while the internal one works sequentially
and avoids the inter-process communication overhead.

What are the benefits?  Only an internal sequential implementation can
offer this eco mode, and it allows avoiding the gzip(1) requirement.

This implementation uses the helper functions from our zlib.c instead of
the convenient gz* functions from zlib, because the latter doesn't give
the control over the generated gzip header that the next patch requires.

Original-patch-by: Rohit Ashiwal <rohit.ashiwal265@gmail.com>
Signed-off-by: René Scharfe <l.s.r@web.de>
---
 Documentation/git-archive.txt |  3 ++-
 archive-tar.c                 | 45 ++++++++++++++++++++++++++++++++++-
 t/t5000-tar-tree.sh           | 16 +++++++++++++
 3 files changed, 62 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-archive.txt b/Documentation/git-archive.txt
index 56989a2f34..5b017c2bdc 100644
--- a/Documentation/git-archive.txt
+++ b/Documentation/git-archive.txt
@@ -148,7 +148,8 @@ tar.<format>.command::
 	format is given.
 +
 The "tar.gz" and "tgz" formats are defined automatically and default to
-`gzip -cn`. You may override them with custom commands.
+`gzip -cn`. You may override them with custom commands. An internal gzip
+implementation can be used by specifying the value `git archive gzip`.

 tar.<format>.remote::
 	If true, enable `<format>` for use by remote clients via
diff --git a/archive-tar.c b/archive-tar.c
index 4e6a3deb80..53d0ef685c 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -38,11 +38,13 @@ static int write_tar_filter_archive(const struct archiver *ar,
 #define USTAR_MAX_MTIME 077777777777ULL
 #endif

-static void write_block(const void *buf)
+static void tar_write_block(const void *buf)
 {
 	write_or_die(1, buf, BLOCKSIZE);
 }

+static void (*write_block)(const void *) = tar_write_block;
+
 /* writes out the whole block, but only if it is full */
 static void write_if_needed(void)
 {
@@ -430,6 +432,34 @@ static int write_tar_archive(const struct archiver *ar,
 	return err;
 }

+static git_zstream gzstream;
+static unsigned char outbuf[16384];
+
+static void tgz_deflate(int flush)
+{
+	while (gzstream.avail_in || flush == Z_FINISH) {
+		int status = git_deflate(&gzstream, flush);
+		if (!gzstream.avail_out || status == Z_STREAM_END) {
+			write_or_die(1, outbuf, gzstream.next_out - outbuf);
+			gzstream.next_out = outbuf;
+			gzstream.avail_out = sizeof(outbuf);
+			if (status == Z_STREAM_END)
+				break;
+		}
+		if (status != Z_OK && status != Z_BUF_ERROR)
+			die(_("deflate error (%d)"), status);
+	}
+}
+
+static void tgz_write_block(const void *data)
+{
+	gzstream.next_in = (void *)data;
+	gzstream.avail_in = BLOCKSIZE;
+	tgz_deflate(Z_NO_FLUSH);
+}
+
+static const char internal_gzip_command[] = "git archive gzip";
+
 static int write_tar_filter_archive(const struct archiver *ar,
 				    struct archiver_args *args)
 {
@@ -440,6 +470,19 @@ static int write_tar_filter_archive(const struct archiver *ar,
 	if (!ar->filter_command)
 		BUG("tar-filter archiver called with no filter defined");

+	if (!strcmp(ar->filter_command, internal_gzip_command)) {
+		write_block = tgz_write_block;
+		git_deflate_init_gzip(&gzstream, args->compression_level);
+		gzstream.next_out = outbuf;
+		gzstream.avail_out = sizeof(outbuf);
+
+		r = write_tar_archive(ar, args);
+
+		tgz_deflate(Z_FINISH);
+		git_deflate_end(&gzstream);
+		return r;
+	}
+
 	strbuf_addstr(&cmd, ar->filter_command);
 	if (args->compression_level >= 0)
 		strbuf_addf(&cmd, " -%d", args->compression_level);
diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh
index 7f8d2ab0a7..9ac0ec67fe 100755
--- a/t/t5000-tar-tree.sh
+++ b/t/t5000-tar-tree.sh
@@ -374,6 +374,22 @@ test_expect_success GZIP 'remote tar.gz can be disabled' '
 		>remote.tar.gz
 '

+test_expect_success 'git archive --format=tgz (internal gzip)' '
+	test_config tar.tgz.command "git archive gzip" &&
+	git archive --format=tgz HEAD >internal_gzip.tgz
+'
+
+test_expect_success 'git archive --format=tar.gz (internal gzip)' '
+	test_config tar.tar.gz.command "git archive gzip" &&
+	git archive --format=tar.gz HEAD >internal_gzip.tar.gz &&
+	test_cmp_bin internal_gzip.tgz internal_gzip.tar.gz
+'
+
+test_expect_success GZIP 'extract tgz file (internal gzip)' '
+	gzip -d -c <internal_gzip.tgz >internal_gzip.tar &&
+	test_cmp_bin b.tar internal_gzip.tar
+'
+
 test_expect_success 'archive and :(glob)' '
 	git archive -v HEAD -- ":(glob)**/sh" >/dev/null 2>actual &&
 	cat >expect <<EOF &&
--
2.36.1

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v3 4/5] archive-tar: use OS_CODE 3 (Unix) for internal gzip
  2022-06-12  6:00 ` [PATCH v3 0/5] Avoid spawning gzip in git archive René Scharfe
                     ` (2 preceding siblings ...)
  2022-06-12  6:08   ` [PATCH v3 3/5] archive-tar: add internal gzip implementation René Scharfe
@ 2022-06-12  6:18   ` René Scharfe
  2022-06-12  6:19   ` [PATCH v3 5/5] archive-tar: use internal gzip by default René Scharfe
  2022-06-14 11:28   ` [PATCH v3 0/5] Avoid spawning gzip in git archive Johannes Schindelin
  5 siblings, 0 replies; 74+ messages in thread
From: René Scharfe @ 2022-06-12  6:18 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Johannes Schindelin, Rohit Ashiwal,
	Ævar Arnfjörð Bjarmason, Jeff King,
	brian m . carlson

gzip(1) encodes the OS it runs on in the 10th byte of its output. It
uses the following OS_CODE values according to its tailor.h [1]:

        0 - MS-DOS
        3 - UNIX
        5 - Atari ST
        6 - OS/2
       10 - TOPS-20
       11 - Windows NT

The gzip.exe that comes with Git for Windows uses OS_CODE 3 for some
reason, so this value is used on practically all supported platforms
when generating tgz archives using gzip(1).

Zlib uses a bigger set of values according to its zutil.h [2], aligned
with section 4.4.2 of the ZIP specification, APPNOTE.txt [3]:

         0 - MS-DOS
         1 - Amiga
         3 - UNIX
         4 - VM/CMS
         5 - Atari ST
         6 - OS/2
         7 - Macintosh
         8 - Z-System
        10 - Windows NT
        11 - MVS (OS/390 - Z/OS)
        13 - Acorn Risc
        16 - BeOS
        18 - OS/400
        19 - OS X (Darwin)

Thus the internal gzip implementation in archive-tar.c sets different
OS_CODE header values on major platforms Windows and macOS.  Git for
Windows uses its own zlib-based variant since v2.20.1 by default and
thus embeds OS_CODE 10 in tgz archives.

The tar archive for a commit is generated consistently on all systems
(by the same Git version).  The OS_CODE in the gzip header does not
influence extraction.  Avoid leaking OS information and make tgz
archives constistent and reproducable (with the same Git and libz
versions) by using OS_CODE 3 everywhere.

NB: The function deflateSetHeader() was introduced by zlib 1.2.2.1,
released 2004-10-31.

At least on macOS 12.4 this produces the same output as gzip(1) for the
examples I tried:

   # before
   $ git -c tar.tgz.command='git archive gzip' archive --format=tgz v2.36.0 | shasum
   3abbffb40b7c63cf9b7d91afc682f11682f80759  -

   # with this patch
   $ git -c tar.tgz.command='git archive gzip' archive --format=tgz v2.36.0 | shasum
   dc6dc6ba9636d522799085d0d77ab6a110bcc141  -

   $ git archive --format=tar v2.36.0 | gzip -cn | shasum
   dc6dc6ba9636d522799085d0d77ab6a110bcc141  -

[1] https://git.savannah.gnu.org/cgit/gzip.git/tree/tailor.h
[2] https://github.com/madler/zlib/blob/master/zutil.h
[3] https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT

Signed-off-by: René Scharfe <l.s.r@web.de>
---
Perhaps makes sense for remote-curl as well (out of scope of this
series)?

 archive-tar.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/archive-tar.c b/archive-tar.c
index 53d0ef685c..bf7e321e0e 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -460,6 +460,14 @@ static void tgz_write_block(const void *data)

 static const char internal_gzip_command[] = "git archive gzip";

+static void tgz_set_os(git_zstream *strm, int os)
+{
+#if ZLIB_VERNUM >= 0x1221
+	struct gz_header_s gzhead = { .os = os };
+	deflateSetHeader(&strm->z, &gzhead);
+#endif
+}
+
 static int write_tar_filter_archive(const struct archiver *ar,
 				    struct archiver_args *args)
 {
@@ -473,6 +481,7 @@ static int write_tar_filter_archive(const struct archiver *ar,
 	if (!strcmp(ar->filter_command, internal_gzip_command)) {
 		write_block = tgz_write_block;
 		git_deflate_init_gzip(&gzstream, args->compression_level);
+		tgz_set_os(&gzstream, 3); /* Unix, for reproducibility */
 		gzstream.next_out = outbuf;
 		gzstream.avail_out = sizeof(outbuf);

--
2.36.1

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v3 5/5] archive-tar: use internal gzip by default
  2022-06-12  6:00 ` [PATCH v3 0/5] Avoid spawning gzip in git archive René Scharfe
                     ` (3 preceding siblings ...)
  2022-06-12  6:18   ` [PATCH v3 4/5] archive-tar: use OS_CODE 3 (Unix) for internal gzip René Scharfe
@ 2022-06-12  6:19   ` René Scharfe
  2022-06-13 21:55     ` Junio C Hamano
  2022-06-14 11:28   ` [PATCH v3 0/5] Avoid spawning gzip in git archive Johannes Schindelin
  5 siblings, 1 reply; 74+ messages in thread
From: René Scharfe @ 2022-06-12  6:19 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Johannes Schindelin, Rohit Ashiwal,
	Ævar Arnfjörð Bjarmason, Jeff King,
	brian m . carlson

Drop the dependency on gzip(1) and use our internal implementation to
create tar.gz and tgz files.

Signed-off-by: René Scharfe <l.s.r@web.de>
---
 Documentation/git-archive.txt |  4 ++--
 archive-tar.c                 |  4 ++--
 t/t5000-tar-tree.sh           | 32 ++++++++++++++++----------------
 3 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/Documentation/git-archive.txt b/Documentation/git-archive.txt
index 5b017c2bdc..9de12896fc 100644
--- a/Documentation/git-archive.txt
+++ b/Documentation/git-archive.txt
@@ -148,8 +148,8 @@ tar.<format>.command::
 	format is given.
 +
 The "tar.gz" and "tgz" formats are defined automatically and default to
-`gzip -cn`. You may override them with custom commands. An internal gzip
-implementation can be used by specifying the value `git archive gzip`.
+the magic value `git archive gzip`, which invokes an internal
+implementation of gzip. You may override them with custom commands.

 tar.<format>.remote::
 	If true, enable `<format>` for use by remote clients via
diff --git a/archive-tar.c b/archive-tar.c
index bf7e321e0e..60669eb7b9 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -528,9 +528,9 @@ void init_tar_archiver(void)
 	int i;
 	register_archiver(&tar_archiver);

-	tar_filter_config("tar.tgz.command", "gzip -cn", NULL);
+	tar_filter_config("tar.tgz.command", internal_gzip_command, NULL);
 	tar_filter_config("tar.tgz.remote", "true", NULL);
-	tar_filter_config("tar.tar.gz.command", "gzip -cn", NULL);
+	tar_filter_config("tar.tar.gz.command", internal_gzip_command, NULL);
 	tar_filter_config("tar.tar.gz.remote", "true", NULL);
 	git_config(git_tar_config, NULL);
 	for (i = 0; i < nr_tar_filters; i++) {
diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh
index 9ac0ec67fe..1a68e89a55 100755
--- a/t/t5000-tar-tree.sh
+++ b/t/t5000-tar-tree.sh
@@ -339,21 +339,21 @@ test_expect_success 'only enabled filters are available remotely' '
 	test_cmp_bin remote.bar config.bar
 '

-test_expect_success GZIP 'git archive --format=tgz' '
+test_expect_success 'git archive --format=tgz' '
 	git archive --format=tgz HEAD >j.tgz
 '

-test_expect_success GZIP 'git archive --format=tar.gz' '
+test_expect_success 'git archive --format=tar.gz' '
 	git archive --format=tar.gz HEAD >j1.tar.gz &&
 	test_cmp_bin j.tgz j1.tar.gz
 '

-test_expect_success GZIP 'infer tgz from .tgz filename' '
+test_expect_success 'infer tgz from .tgz filename' '
 	git archive --output=j2.tgz HEAD &&
 	test_cmp_bin j.tgz j2.tgz
 '

-test_expect_success GZIP 'infer tgz from .tar.gz filename' '
+test_expect_success 'infer tgz from .tar.gz filename' '
 	git archive --output=j3.tar.gz HEAD &&
 	test_cmp_bin j.tgz j3.tar.gz
 '
@@ -363,31 +363,31 @@ test_expect_success GZIP 'extract tgz file' '
 	test_cmp_bin b.tar j.tar
 '

-test_expect_success GZIP 'remote tar.gz is allowed by default' '
+test_expect_success 'remote tar.gz is allowed by default' '
 	git archive --remote=. --format=tar.gz HEAD >remote.tar.gz &&
 	test_cmp_bin j.tgz remote.tar.gz
 '

-test_expect_success GZIP 'remote tar.gz can be disabled' '
+test_expect_success 'remote tar.gz can be disabled' '
 	git config tar.tar.gz.remote false &&
 	test_must_fail git archive --remote=. --format=tar.gz HEAD \
 		>remote.tar.gz
 '

-test_expect_success 'git archive --format=tgz (internal gzip)' '
-	test_config tar.tgz.command "git archive gzip" &&
-	git archive --format=tgz HEAD >internal_gzip.tgz
+test_expect_success GZIP 'git archive --format=tgz (external gzip)' '
+	test_config tar.tgz.command "gzip -cn" &&
+	git archive --format=tgz HEAD >external_gzip.tgz
 '

-test_expect_success 'git archive --format=tar.gz (internal gzip)' '
-	test_config tar.tar.gz.command "git archive gzip" &&
-	git archive --format=tar.gz HEAD >internal_gzip.tar.gz &&
-	test_cmp_bin internal_gzip.tgz internal_gzip.tar.gz
+test_expect_success GZIP 'git archive --format=tar.gz (external gzip)' '
+	test_config tar.tar.gz.command "gzip -cn" &&
+	git archive --format=tar.gz HEAD >external_gzip.tar.gz &&
+	test_cmp_bin external_gzip.tgz external_gzip.tar.gz
 '

-test_expect_success GZIP 'extract tgz file (internal gzip)' '
-	gzip -d -c <internal_gzip.tgz >internal_gzip.tar &&
-	test_cmp_bin b.tar internal_gzip.tar
+test_expect_success GZIP 'extract tgz file (external gzip)' '
+	gzip -d -c <external_gzip.tgz >external_gzip.tar &&
+	test_cmp_bin b.tar external_gzip.tar
 '

 test_expect_success 'archive and :(glob)' '
--
2.36.1

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 3/5] archive-tar: add internal gzip implementation
  2022-06-12  6:08   ` [PATCH v3 3/5] archive-tar: add internal gzip implementation René Scharfe
@ 2022-06-13 19:10     ` Junio C Hamano
  0 siblings, 0 replies; 74+ messages in thread
From: Junio C Hamano @ 2022-06-13 19:10 UTC (permalink / raw)
  To: René Scharfe
  Cc: git, Johannes Schindelin, Rohit Ashiwal,
	Ævar Arnfjörð Bjarmason, Jeff King,
	brian m . carlson

René Scharfe <l.s.r@web.de> writes:

> @@ -148,7 +148,8 @@ tar.<format>.command::
>  	format is given.
>  +
>  The "tar.gz" and "tgz" formats are defined automatically and default to
> -`gzip -cn`. You may override them with custom commands.
> +`gzip -cn`. You may override them with custom commands. An internal gzip
> +implementation can be used by specifying the value `git archive gzip`.
>

The new sentence didn't click and I was lost in figuring out what is
set to which value and where, before looking at a test the patch
adds.

I think it is not entirely a fault of this patch, but the badness is
already in the original.  I wouldn't have been as confused if it
were like so:

    The "tar.gz" and "tgz" formats are defined automatically to use
    `gzip -cn` as the command by default.  An internal gzip
    implementation can be used by specifying the value `git archive
    gzip` for these two formats.

for example.

Thanks.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 5/5] archive-tar: use internal gzip by default
  2022-06-12  6:19   ` [PATCH v3 5/5] archive-tar: use internal gzip by default René Scharfe
@ 2022-06-13 21:55     ` Junio C Hamano
  2022-06-14 11:27       ` Johannes Schindelin
  0 siblings, 1 reply; 74+ messages in thread
From: Junio C Hamano @ 2022-06-13 21:55 UTC (permalink / raw)
  To: René Scharfe
  Cc: git, Johannes Schindelin, Rohit Ashiwal,
	Ævar Arnfjörð Bjarmason, Jeff King,
	brian m . carlson

René Scharfe <l.s.r@web.de> writes:

> -test_expect_success GZIP 'git archive --format=tar.gz' '
> +test_expect_success 'git archive --format=tar.gz' '
>  	git archive --format=tar.gz HEAD >j1.tar.gz &&
>  	test_cmp_bin j.tgz j1.tar.gz
>  '

Curiously, this breaks for me.  It is understandable if we are not
producing byte-for-byte identical output with internal gzip.

With the following hack I can force the step pass, so it seems that
the two invocations of internal gzip are not emitting identical
result for the tar stream taken out of HEAD^{tree} object?

diff --git c/t/t5000-tar-tree.sh w/t/t5000-tar-tree.sh
index 1a68e89a55..c0a2cb92d4 100755
--- c/t/t5000-tar-tree.sh
+++ w/t/t5000-tar-tree.sh
@@ -340,14 +340,16 @@ test_expect_success 'only enabled filters are available remotely' '
 '
 
 test_expect_success 'git archive --format=tgz' '
-	git archive --format=tgz HEAD >j.tgz
+	git -c tar.tgz.command="gzip -cn" archive --format=tgz HEAD >j.tgz
 '
 
 test_expect_success 'git archive --format=tar.gz' '
-	git archive --format=tar.gz HEAD >j1.tar.gz &&
+	git -c tar.tar.gz.command="gzip -cn" archive --format=tar.gz HEAD >j1.tar.gz &&
 	test_cmp_bin j.tgz j1.tar.gz
 '
 
+exit
+
 test_expect_success 'infer tgz from .tgz filename' '
 	git archive --output=j2.tgz HEAD &&
 	test_cmp_bin j.tgz j2.tgz


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 5/5] archive-tar: use internal gzip by default
  2022-06-13 21:55     ` Junio C Hamano
@ 2022-06-14 11:27       ` Johannes Schindelin
  2022-06-14 15:47         ` René Scharfe
  0 siblings, 1 reply; 74+ messages in thread
From: Johannes Schindelin @ 2022-06-14 11:27 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: René Scharfe, git, Rohit Ashiwal,
	Ævar Arnfjörð Bjarmason, Jeff King,
	brian m . carlson

[-- Attachment #1: Type: text/plain, Size: 4352 bytes --]

Hi Junio,

On Mon, 13 Jun 2022, Junio C Hamano wrote:

> René Scharfe <l.s.r@web.de> writes:
>
> > -test_expect_success GZIP 'git archive --format=tar.gz' '
> > +test_expect_success 'git archive --format=tar.gz' '
> >  	git archive --format=tar.gz HEAD >j1.tar.gz &&
> >  	test_cmp_bin j.tgz j1.tar.gz
> >  '
>
> Curiously, this breaks for me.  It is understandable if we are not
> producing byte-for-byte identical output with internal gzip.

Indeed, I can reproduce this, too. In particular, `j.tgz` and `j1.tar.gz`
differ like this in my test run:

-00000000  1f 8b 08 1a 00 2e ca 09  00 03 04 00 89 45 fc 83 |.............E..|
+00000000  1f 8b 08 1a 00 35 2a 10  00 03 04 00 89 45 fc 83 |.....5*......E..|

and

-00000010  7d fc 00 f1 d0 ec b7 63  8c 30 cc 9b e6 db b6 6d |}......c.0.....m|
+00000010  7d fc 00 54 ff ec b7 63  8c 30 cc 9b e6 db b6 6d |}..T...c.0.....m|

According to https://datatracker.ietf.org/doc/html/rfc1952#page-5, the
difference in the first line is the mtime. For reference, this is the
version with `git -c tar.tgz.command="gzip -cn" archive --format=tgz
HEAD`:

00000000  1f 8b 08 00 00 00 00 00  00 03 ec b7 63 8c 30 cc |............c.0.|

In other words, `gzip` forces the `mtim` member to all zeros, which makes
sense.

The recorded mtimes are a bit funny, according to
https://wolf-tungsten.github.io/gzip-analyzer/, they are 1975-03-17
00:36:32 and 1978-08-05 22:45:36, respectively...

And the mtime actually changes all the time.

What's even more funny: if I comment out the `deflateSetHeader()`, the
mtime header field is left at all-zeros. This is on Ubuntu 18.04 with
zlib1g 1:1.2.11.dfsg-0ubuntu2.

So I dug in a bit deeper and what do you know, the `deflateHeader()`
function is implemented like this
(https://github.com/madler/zlib/blob/21767c654d31/deflate.c#L557-L565):

	int ZEXPORT deflateSetHeader (strm, head)
	    z_streamp strm;
	    gz_headerp head;
	{
	    if (deflateStateCheck(strm) || strm->state->wrap != 2)
		return Z_STREAM_ERROR;
	    strm->state->gzhead = head;
	    return Z_OK;
	}

Now, the caller is implemented like this:

	static void tgz_set_os(git_zstream *strm, int os)
	{
	#if ZLIB_VERNUM >= 0x1221
		struct gz_header_s gzhead = { .os = os };
		deflateSetHeader(&strm->z, &gzhead);
	#endif
	}

The biggest problem is not that the return value of `deflateSetHeader()`
is ignored. The biggest problem is that it passes the address of a heap
variable to the `deflateSetHeader()` function, which then stores it away
in another struct that lives beyond the point when we return from
`tgz_set_os()`.

In other words, this is the very issue I pointed out as GCC not catching:
https://lore.kernel.org/git/nycvar.QRO.7.76.6.2205272235220.349@tvgsbejvaqbjf.bet/

The solution is to move the heap variable back into a scope that matches
the lifetime of the compression:

-- snip --
diff --git a/archive-tar.c b/archive-tar.c
index 60669eb7b9c..3d77e0f7509 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -460,17 +460,12 @@ static void tgz_write_block(const void *data)

 static const char internal_gzip_command[] = "git archive gzip";

-static void tgz_set_os(git_zstream *strm, int os)
-{
-#if ZLIB_VERNUM >= 0x1221
-	struct gz_header_s gzhead = { .os = os };
-	deflateSetHeader(&strm->z, &gzhead);
-#endif
-}
-
 static int write_tar_filter_archive(const struct archiver *ar,
 				    struct archiver_args *args)
 {
+#if ZLIB_VERNUM >= 0x1221
+	struct gz_header_s gzhead = { .os = 3 }; /* Unix, for reproducibility */
+#endif
 	struct strbuf cmd = STRBUF_INIT;
 	struct child_process filter = CHILD_PROCESS_INIT;
 	int r;
@@ -481,7 +476,10 @@ static int write_tar_filter_archive(const struct archiver *ar,
 	if (!strcmp(ar->filter_command, internal_gzip_command)) {
 		write_block = tgz_write_block;
 		git_deflate_init_gzip(&gzstream, args->compression_level);
-		tgz_set_os(&gzstream, 3); /* Unix, for reproducibility */
+#if ZLIB_VERNUM >= 0x1221
+		if (deflateSetHeader(&gzstream.z, &gzhead) != Z_OK)
+			BUG("deflateSetHeader() called too late");
+#endif
 		gzstream.next_out = outbuf;
 		gzstream.avail_out = sizeof(outbuf);

-- snap --

With this, the test passes for me.

René, would you mind squashing this into your patch series?

Thank you,
Dscho

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 0/5] Avoid spawning gzip in git archive
  2022-06-12  6:00 ` [PATCH v3 0/5] Avoid spawning gzip in git archive René Scharfe
                     ` (4 preceding siblings ...)
  2022-06-12  6:19   ` [PATCH v3 5/5] archive-tar: use internal gzip by default René Scharfe
@ 2022-06-14 11:28   ` Johannes Schindelin
  2022-06-14 20:05     ` René Scharfe
  5 siblings, 1 reply; 74+ messages in thread
From: Johannes Schindelin @ 2022-06-14 11:28 UTC (permalink / raw)
  To: René Scharfe
  Cc: git, Junio C Hamano, Rohit Ashiwal,
	Ævar Arnfjörð Bjarmason, Jeff King,
	brian m . carlson

[-- Attachment #1: Type: text/plain, Size: 21805 bytes --]

Hi René,

On Sun, 12 Jun 2022, René Scharfe wrote:

> It's been a while, let's try again.

Thank you for picking this up again!

> Changes:
> - Use our own zlib helpers instead of the gz* functions of zlib,
> - ... which allows us to set the OS_CODE header consistently.
> - Pseudo-command "git archive gzip" to select the internal
>   implementation in config.
> - Use a function pointer to plug in the internal gzip.
> - Tests.
> - Discuss performance in commit message.

Makes sense. Here is the range-diff:

-- snip --
-:  ----------- > 1:  9847267888e archive: rename archiver data field to filter_command
1:  7d50f52e490 ! 2:  a98ef655af9 archive: factor out writing blocks into a separate function
    @@
      ## Metadata ##
    -Author: Rohit Ashiwal <rohit.ashiwal265@gmail.com>
    +Author: René Scharfe <l.s.r@web.de>

      ## Commit message ##
    -    archive: factor out writing blocks into a separate function
    +    archive-tar: factor out write_block()

    -    The `git archive --format=tgz` command spawns `gzip` to perform the
    -    actual compression. However, the MinGit flavor of Git for Windows
    -    comes without `gzip` bundled inside.
    +    All tar archive writes have the same size and are done to the same file
    +    descriptor.  Move them to a common function, write_block(), to reduce
    +    code duplication and make it easy to change the destination.

    -    To help with that, we will teach `git archive` to let zlib perform the
    -    actual compression.
    -
    -    In preparation for this, we consolidate all the block writes into the
    -    function `write_block_or_die()`.
    -
    -    Note: .tar files have a well-defined, fixed block size. For that reason,
    -    it does not make any sense to pass anything but a fully-populated,
    -    full-length block to the `write_block_or_die()` function, and we can
    -    save ourselves some future trouble by not even allowing to pass an
    -    incorrect `size` parameter to it.
    -
    -    Signed-off-by: Rohit Ashiwal <rohit.ashiwal265@gmail.com>
    -    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
    +    Original-patch-by: Rohit Ashiwal <rohit.ashiwal265@gmail.com>
    +    Signed-off-by: René Scharfe <l.s.r@web.de>
    +    Signed-off-by: Junio C Hamano <gitster@pobox.com>

      ## archive-tar.c ##
     @@ archive-tar.c: static int write_tar_filter_archive(const struct archiver *ar,
      #define USTAR_MAX_MTIME 077777777777ULL
      #endif

    -+/* writes out the whole block, or dies if fails */
    -+static void write_block_or_die(const char *block) {
    -+	write_or_die(1, block, BLOCKSIZE);
    ++static void write_block(const void *buf)
    ++{
    ++	write_or_die(1, buf, BLOCKSIZE);
     +}
     +
      /* writes out the whole block, but only if it is full */
    @@ archive-tar.c: static int write_tar_filter_archive(const struct archiver *ar,
      {
      	if (offset == BLOCKSIZE) {
     -		write_or_die(1, block, BLOCKSIZE);
    -+		write_block_or_die(block);
    ++		write_block(block);
      		offset = 0;
      	}
      }
    @@ archive-tar.c: static void do_write_blocked(const void *data, unsigned long size
      	}
      	while (size >= BLOCKSIZE) {
     -		write_or_die(1, buf, BLOCKSIZE);
    -+		write_block_or_die(buf);
    ++		write_block(buf);
      		size -= BLOCKSIZE;
      		buf += BLOCKSIZE;
      	}
    @@ archive-tar.c: static void write_trailer(void)
      	int tail = BLOCKSIZE - offset;
      	memset(block + offset, 0, tail);
     -	write_or_die(1, block, BLOCKSIZE);
    -+	write_block_or_die(block);
    ++	write_block(block);
      	if (tail < 2 * RECORDSIZE) {
      		memset(block, 0, offset);
     -		write_or_die(1, block, BLOCKSIZE);
    -+		write_block_or_die(block);
    ++		write_block(block);
      	}
      }

2:  ac2b2488a1b < -:  ----------- archive-tar: mark RECORDSIZE/BLOCKSIZE as unsigned
3:  4ea94a87848 ! 3:  5e3c0d79589 archive: optionally use zlib directly for gzip compression
    @@
      ## Metadata ##
    -Author: Rohit Ashiwal <rohit.ashiwal265@gmail.com>
    +Author: René Scharfe <l.s.r@web.de>

      ## Commit message ##
    -    archive: optionally use zlib directly for gzip compression
    +    archive-tar: add internal gzip implementation

    -    As we already link to the zlib library, we can perform the compression
    -    without even requiring gzip on the host machine.
    +    Git uses zlib for its own object store, but calls gzip when creating tgz
    +    archives.  Add an option to perform the gzip compression for the latter
    +    using zlib, without depending on the external gzip binary.

    -    Note: the `-n` flag that `git archive` passed to `gzip` wants to ensure
    -    that a reproducible file is written, i.e. no filename or mtime will be
    -    recorded in the compressed output. This is already the default for
    -    zlib's `gzopen()` function (if the file name or mtime should be
    -    recorded, the `deflateSetHeader()` function would have to be called
    -    instead).
    +    Plug it in by making write_block a function pointer and switching to a
    +    compressing variant if the filter command has the magic value "git
    +    archive gzip".  Does that indirection slow down tar creation?  Not
    +    really, at least not in this test:

    -    Note also that the `gzFile` datatype is defined as a pointer in
    -    `zlib.h`, i.e. we can rely on the fact that it can be `NULL`.
    +    $ hyperfine -w3 -L rev HEAD,origin/main -p 'git checkout {rev} && make' \
    +    './git -C ../linux archive --format=tar HEAD # {rev}'
    +    Benchmark #1: ./git -C ../linux archive --format=tar HEAD # HEAD
    +      Time (mean ± σ):      4.044 s ±  0.007 s    [User: 3.901 s, System: 0.137 s]
    +      Range (min … max):    4.038 s …  4.059 s    10 runs

    -    At this point, this new mode is hidden behind the pseudo command
    -    `:zlib`: assign this magic string to the `archive.tgz.command` config
    -    setting to enable it.
    +    Benchmark #2: ./git -C ../linux archive --format=tar HEAD # origin/main
    +      Time (mean ± σ):      4.047 s ±  0.009 s    [User: 3.903 s, System: 0.138 s]
    +      Range (min … max):    4.038 s …  4.066 s    10 runs

    -    Signed-off-by: Rohit Ashiwal <rohit.ashiwal265@gmail.com>
    -    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
    +    How does tgz creation perform?

    - ## archive-tar.c ##
    -@@ archive-tar.c: static unsigned long offset;
    -
    - static int tar_umask = 002;
    -
    -+static gzFile gzip;
    -+
    - static int write_tar_filter_archive(const struct archiver *ar,
    - 				    struct archiver_args *args);
    +    $ hyperfine -w3 -L command 'gzip -cn','git archive gzip' \
    +    './git -c tar.tgz.command="{command}" -C ../linux archive --format=tgz HEAD'
    +    Benchmark #1: ./git -c tar.tgz.command="gzip -cn" -C ../linux archive --format=tgz HEAD
    +      Time (mean ± σ):     20.404 s ±  0.006 s    [User: 23.943 s, System: 0.401 s]
    +      Range (min … max):   20.395 s … 20.414 s    10 runs
    +
    +    Benchmark #2: ./git -c tar.tgz.command="git archive gzip" -C ../linux archive --format=tgz HEAD
    +      Time (mean ± σ):     23.807 s ±  0.023 s    [User: 23.655 s, System: 0.145 s]
    +      Range (min … max):   23.782 s … 23.857 s    10 runs
    +
    +    Summary
    +      './git -c tar.tgz.command="gzip -cn" -C ../linux archive --format=tgz HEAD' ran
    +        1.17 ± 0.00 times faster than './git -c tar.tgz.command="git archive gzip" -C ../linux archive --format=tgz HEAD'
    +
    +    So the internal implementation takes 17% longer on the Linux repo, but
    +    uses 2% less CPU time.  That's because the external gzip can run in
    +    parallel on its own processor, while the internal one works sequentially
    +    and avoids the inter-process communication overhead.
    +
    +    What are the benefits?  Only an internal sequential implementation can
    +    offer this eco mode, and it allows avoiding the gzip(1) requirement.
    +
    +    This implementation uses the helper functions from our zlib.c instead of
    +    the convenient gz* functions from zlib, because the latter doesn't give
    +    the control over the generated gzip header that the next patch requires.
    +
    +    Original-patch-by: Rohit Ashiwal <rohit.ashiwal265@gmail.com>
    +    Signed-off-by: René Scharfe <l.s.r@web.de>
    +    Signed-off-by: Junio C Hamano <gitster@pobox.com>
    +
    + ## Documentation/git-archive.txt ##
    +@@ Documentation/git-archive.txt: tar.<format>.command::
    + 	format is given.
    + +
    + The "tar.gz" and "tgz" formats are defined automatically and default to
    +-`gzip -cn`. You may override them with custom commands.
    ++`gzip -cn`. You may override them with custom commands. An internal gzip
    ++implementation can be used by specifying the value `git archive gzip`.

    + tar.<format>.remote::
    + 	If true, enable `<format>` for use by remote clients via
    +
    + ## archive-tar.c ##
     @@ archive-tar.c: static int write_tar_filter_archive(const struct archiver *ar,
    + #define USTAR_MAX_MTIME 077777777777ULL
    + #endif

    - /* writes out the whole block, or dies if fails */
    - static void write_block_or_die(const char *block) {
    --	write_or_die(1, block, BLOCKSIZE);
    -+	if (!gzip)
    -+		write_or_die(1, block, BLOCKSIZE);
    -+	else if (gzwrite(gzip, block, (unsigned) BLOCKSIZE) != BLOCKSIZE)
    -+		die(_("gzwrite failed"));
    +-static void write_block(const void *buf)
    ++static void tar_write_block(const void *buf)
    + {
    + 	write_or_die(1, buf, BLOCKSIZE);
      }

    ++static void (*write_block)(const void *) = tar_write_block;
    ++
      /* writes out the whole block, but only if it is full */
    -@@ archive-tar.c: static int write_tar_filter_archive(const struct archiver *ar,
    - 	filter.use_shell = 1;
    - 	filter.in = -1;
    + static void write_if_needed(void)
    + {
    +@@ archive-tar.c: static int write_tar_archive(const struct archiver *ar,
    + 	return err;
    + }

    --	if (start_command(&filter) < 0)
    --		die_errno(_("unable to start '%s' filter"), argv[0]);
    --	close(1);
    --	if (dup2(filter.in, 1) < 0)
    --		die_errno(_("unable to redirect descriptor"));
    --	close(filter.in);
    -+	if (!strcmp(":zlib", ar->data)) {
    -+		struct strbuf mode = STRBUF_INIT;
    ++static git_zstream gzstream;
    ++static unsigned char outbuf[16384];
     +
    -+		strbuf_addstr(&mode, "wb");
    ++static void tgz_deflate(int flush)
    ++{
    ++	while (gzstream.avail_in || flush == Z_FINISH) {
    ++		int status = git_deflate(&gzstream, flush);
    ++		if (!gzstream.avail_out || status == Z_STREAM_END) {
    ++			write_or_die(1, outbuf, gzstream.next_out - outbuf);
    ++			gzstream.next_out = outbuf;
    ++			gzstream.avail_out = sizeof(outbuf);
    ++			if (status == Z_STREAM_END)
    ++				break;
    ++		}
    ++		if (status != Z_OK && status != Z_BUF_ERROR)
    ++			die(_("deflate error (%d)"), status);
    ++	}
    ++}
     +
    -+		if (args->compression_level >= 0 && args->compression_level <= 9)
    -+			strbuf_addf(&mode, "%d", args->compression_level);
    ++static void tgz_write_block(const void *data)
    ++{
    ++	gzstream.next_in = (void *)data;
    ++	gzstream.avail_in = BLOCKSIZE;
    ++	tgz_deflate(Z_NO_FLUSH);
    ++}
     +
    -+		gzip = gzdopen(fileno(stdout), mode.buf);
    -+		if (!gzip)
    -+			die(_("Could not gzdopen stdout"));
    -+		strbuf_release(&mode);
    -+	} else {
    -+		if (start_command(&filter) < 0)
    -+			die_errno(_("unable to start '%s' filter"), argv[0]);
    -+		close(1);
    -+		if (dup2(filter.in, 1) < 0)
    -+			die_errno(_("unable to redirect descriptor"));
    -+		close(filter.in);
    -+	}
    -
    - 	r = write_tar_archive(ar, args);
    ++static const char internal_gzip_command[] = "git archive gzip";
    ++
    + static int write_tar_filter_archive(const struct archiver *ar,
    + 				    struct archiver_args *args)
    + {
    +@@ archive-tar.c: static int write_tar_filter_archive(const struct archiver *ar,
    + 	if (!ar->filter_command)
    + 		BUG("tar-filter archiver called with no filter defined");

    --	close(1);
    --	if (finish_command(&filter) != 0)
    --		die(_("'%s' filter reported error"), argv[0]);
    -+	if (gzip) {
    -+		int ret = gzclose(gzip);
    -+		if (ret == Z_ERRNO)
    -+			die_errno(_("gzclose failed"));
    -+		else if (ret != Z_OK)
    -+			die(_("gzclose failed (%d)"), ret);
    -+	} else {
    -+		close(1);
    -+		if (finish_command(&filter) != 0)
    -+			die(_("'%s' filter reported error"), argv[0]);
    ++	if (!strcmp(ar->filter_command, internal_gzip_command)) {
    ++		write_block = tgz_write_block;
    ++		git_deflate_init_gzip(&gzstream, args->compression_level);
    ++		gzstream.next_out = outbuf;
    ++		gzstream.avail_out = sizeof(outbuf);
    ++
    ++		r = write_tar_archive(ar, args);
    ++
    ++		tgz_deflate(Z_FINISH);
    ++		git_deflate_end(&gzstream);
    ++		return r;
     +	}
    ++
    + 	strbuf_addstr(&cmd, ar->filter_command);
    + 	if (args->compression_level >= 0)
    + 		strbuf_addf(&cmd, " -%d", args->compression_level);
    +
    + ## t/t5000-tar-tree.sh ##
    +@@ t/t5000-tar-tree.sh: test_expect_success GZIP 'remote tar.gz can be disabled' '
    + 		>remote.tar.gz
    + '

    - 	strbuf_release(&cmd);
    - 	return r;
    ++test_expect_success 'git archive --format=tgz (internal gzip)' '
    ++	test_config tar.tgz.command "git archive gzip" &&
    ++	git archive --format=tgz HEAD >internal_gzip.tgz
    ++'
    ++
    ++test_expect_success 'git archive --format=tar.gz (internal gzip)' '
    ++	test_config tar.tar.gz.command "git archive gzip" &&
    ++	git archive --format=tar.gz HEAD >internal_gzip.tar.gz &&
    ++	test_cmp_bin internal_gzip.tgz internal_gzip.tar.gz
    ++'
    ++
    ++test_expect_success GZIP 'extract tgz file (internal gzip)' '
    ++	gzip -d -c <internal_gzip.tgz >internal_gzip.tar &&
    ++	test_cmp_bin b.tar internal_gzip.tar
    ++'
    ++
    + test_expect_success 'archive and :(glob)' '
    + 	git archive -v HEAD -- ":(glob)**/sh" >/dev/null 2>actual &&
    + 	cat >expect <<EOF &&
-:  ----------- > 4:  af27bea4fc3 archive-tar: use OS_CODE 3 (Unix) for internal gzip
4:  0e5826e3f25 ! 5:  62038b8e911 archive: use the internal zlib-based gzip compression by default
    @@
      ## Metadata ##
    -Author: Johannes Schindelin <Johannes.Schindelin@gmx.de>
    +Author: René Scharfe <l.s.r@web.de>

      ## Commit message ##
    -    archive: use the internal zlib-based gzip compression by default
    +    archive-tar: use internal gzip by default

    -    We just introduced support for compressing `.tar.gz` archives in the
    -    `git archive` process itself, using zlib directly instead of spawning
    -    `gzip`.
    +    Drop the dependency on gzip(1) and use our internal implementation to
    +    create tar.gz and tgz files.

    -    While this takes less CPU time overall, on multi-core machines, this is
    -    slightly slower in terms of wall clock time (it seems to be in the
    -    ballpark of 15%).
    -
    -    It does reduce the number of dependencies by one, though, which makes it
    -    desirable to turn that mode on by default.
    -
    -    Changing the default benefits most notably the MinGit flavor of Git for
    -    Windows (which intends to support 3rd-party applications that want to
    -    use Git and want to bundle a minimal set of files for that purpose, i.e.
    -    stripping out all non-essential files such as interactive commands,
    -    Perl, and yes, also `gzip`).
    -
    -    We also can now remove the `GZIP` prerequisite from quite a number of
    -    test cases in `t/t5000-tar-tree.sh`.
    -
    -    This closes https://github.com/git-for-windows/git/issues/1970
    -
    -    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
    +    Signed-off-by: René Scharfe <l.s.r@web.de>
    +    Signed-off-by: Junio C Hamano <gitster@pobox.com>

      ## Documentation/git-archive.txt ##
     @@ Documentation/git-archive.txt: tar.<format>.command::
      	format is given.
      +
      The "tar.gz" and "tgz" formats are defined automatically and default to
    --`gzip -cn`. You may override them with custom commands.
    -+`:zlib`, triggering an in-process gzip compression. You may override
    -+them with custom commands, e.g. `gzip -cn` or `pigz -cn`.
    +-`gzip -cn`. You may override them with custom commands. An internal gzip
    +-implementation can be used by specifying the value `git archive gzip`.
    ++the magic value `git archive gzip`, which invokes an internal
    ++implementation of gzip. You may override them with custom commands.

      tar.<format>.remote::
      	If true, enable `<format>` for use by remote clients via
    @@ archive-tar.c: void init_tar_archiver(void)
      	register_archiver(&tar_archiver);

     -	tar_filter_config("tar.tgz.command", "gzip -cn", NULL);
    -+	tar_filter_config("tar.tgz.command", ":zlib", NULL);
    ++	tar_filter_config("tar.tgz.command", internal_gzip_command, NULL);
      	tar_filter_config("tar.tgz.remote", "true", NULL);
     -	tar_filter_config("tar.tar.gz.command", "gzip -cn", NULL);
    -+	tar_filter_config("tar.tar.gz.command", ":zlib", NULL);
    ++	tar_filter_config("tar.tar.gz.command", internal_gzip_command, NULL);
      	tar_filter_config("tar.tar.gz.remote", "true", NULL);
      	git_config(git_tar_config, NULL);
      	for (i = 0; i < nr_tar_filters; i++) {
    @@ t/t5000-tar-tree.sh: test_expect_success 'only enabled filters are available rem
      	git archive --output=j3.tar.gz HEAD &&
      	test_cmp_bin j.tgz j3.tar.gz
      '
    -
    -+test_expect_success 'use `archive.tgz.command=:zlib` explicitly' '
    -+	git -c archive.tgz.command=:zlib archive --output=j4.tgz HEAD &&
    -+	test_cmp_bin j.tgz j4.tgz
    -+'
    -+
    - test_expect_success GZIP 'extract tgz file' '
    - 	gzip -d -c <j.tgz >j.tar &&
    +@@ t/t5000-tar-tree.sh: test_expect_success GZIP 'extract tgz file' '
      	test_cmp_bin b.tar j.tar
      '

    @@ t/t5000-tar-tree.sh: test_expect_success 'only enabled filters are available rem
      	git config tar.tar.gz.remote false &&
      	test_must_fail git archive --remote=. --format=tar.gz HEAD \
      		>remote.tar.gz
    + '
    +
    +-test_expect_success 'git archive --format=tgz (internal gzip)' '
    +-	test_config tar.tgz.command "git archive gzip" &&
    +-	git archive --format=tgz HEAD >internal_gzip.tgz
    ++test_expect_success GZIP 'git archive --format=tgz (external gzip)' '
    ++	test_config tar.tgz.command "gzip -cn" &&
    ++	git archive --format=tgz HEAD >external_gzip.tgz
    + '
    +
    +-test_expect_success 'git archive --format=tar.gz (internal gzip)' '
    +-	test_config tar.tar.gz.command "git archive gzip" &&
    +-	git archive --format=tar.gz HEAD >internal_gzip.tar.gz &&
    +-	test_cmp_bin internal_gzip.tgz internal_gzip.tar.gz
    ++test_expect_success GZIP 'git archive --format=tar.gz (external gzip)' '
    ++	test_config tar.tar.gz.command "gzip -cn" &&
    ++	git archive --format=tar.gz HEAD >external_gzip.tar.gz &&
    ++	test_cmp_bin external_gzip.tgz external_gzip.tar.gz
    + '
    +
    +-test_expect_success GZIP 'extract tgz file (internal gzip)' '
    +-	gzip -d -c <internal_gzip.tgz >internal_gzip.tar &&
    +-	test_cmp_bin b.tar internal_gzip.tar
    ++test_expect_success GZIP 'extract tgz file (external gzip)' '
    ++	gzip -d -c <external_gzip.tgz >external_gzip.tar &&
    ++	test_cmp_bin b.tar external_gzip.tar
    + '
    +
    + test_expect_success 'archive and :(glob)' '
-- snap --

All of these changes look sensible to me, and the performance
implications, while at first glance unfavorable because wallclock time
increases even as CPU time decreases, are actually quite good. As Peff
said in
https://lore.kernel.org/git/20190501181807.GC4109@sigill.intra.peff.net/t/#u

> [...] whatever has the lowest overall CPU time is generally preferable
> [...]

By the way, the main reason why I did not work more is that in
http://madler.net/pipermail/zlib-devel_madler.net/2019-December/003308.html,
Mark Adler (the zlib maintainer) announced that...

> [...] There are many well-tested performance improvements in zlib
> waiting in the wings that will be incorporated over the next several
> months. [...]

This was in December 2019. And now it's June 2022 and I kind of wonder
whether those promised improvements will still come.

In the meantime, however, a viable alternative seems to have cropped up:
https://github.com/zlib-ng/zlib-ng. Essentially, it looks as if it is what
zlib should have become after above-quoted announcement.

In particular the CPU intrinsics support (think MMX, SSE2/3, etc) seem to
be very interesting and I would not be completely surprised if building
Git with your patches and linking against zlib-ng would paint a very
favorable picture not only in terms of CPU time but also in terms of
wallclock time. Sadly, I have not been able to set aside time to look into
that angle, but maybe I can peak your interest?

Thanks,
Dscho

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 5/5] archive-tar: use internal gzip by default
  2022-06-14 11:27       ` Johannes Schindelin
@ 2022-06-14 15:47         ` René Scharfe
  2022-06-14 15:56           ` René Scharfe
  2022-06-14 16:29           ` Johannes Schindelin
  0 siblings, 2 replies; 74+ messages in thread
From: René Scharfe @ 2022-06-14 15:47 UTC (permalink / raw)
  To: Johannes Schindelin, Junio C Hamano
  Cc: git, Rohit Ashiwal, Ævar Arnfjörð Bjarmason,
	Jeff King, brian m . carlson

Am 14.06.22 um 13:27 schrieb Johannes Schindelin:
> Hi Junio,
>
> On Mon, 13 Jun 2022, Junio C Hamano wrote:
>
>> René Scharfe <l.s.r@web.de> writes:
>>
>>> -test_expect_success GZIP 'git archive --format=tar.gz' '
>>> +test_expect_success 'git archive --format=tar.gz' '
>>>  	git archive --format=tar.gz HEAD >j1.tar.gz &&
>>>  	test_cmp_bin j.tgz j1.tar.gz
>>>  '
>>
>> Curiously, this breaks for me.  It is understandable if we are not
>> producing byte-for-byte identical output with internal gzip.

Makes sense in retrospect, there's no reason the output of gzip(1) and
zlib would have to be the same exactly.  It just happened to be so on my
platform, so the tests deceptively passed for me.  I think we simply
have to drop those that try to compare compressed files made by
different tools -- we can still check if their content can be extracted
and matches.

> Indeed, I can reproduce this, too. In particular, `j.tgz` and `j1.tar.gz`
> differ like this in my test run:
>
> -00000000  1f 8b 08 1a 00 2e ca 09  00 03 04 00 89 45 fc 83 |.............E..|
> +00000000  1f 8b 08 1a 00 35 2a 10  00 03 04 00 89 45 fc 83 |.....5*......E..|
>
> and
>
> -00000010  7d fc 00 f1 d0 ec b7 63  8c 30 cc 9b e6 db b6 6d |}......c.0.....m|
> +00000010  7d fc 00 54 ff ec b7 63  8c 30 cc 9b e6 db b6 6d |}..T...c.0.....m|
>
> According to https://datatracker.ietf.org/doc/html/rfc1952#page-5, the
> difference in the first line is the mtime. For reference, this is the
> version with `git -c tar.tgz.command="gzip -cn" archive --format=tgz
> HEAD`:
>
> 00000000  1f 8b 08 00 00 00 00 00  00 03 ec b7 63 8c 30 cc |............c.0.|
>
> In other words, `gzip` forces the `mtim` member to all zeros, which makes
> sense.

And that's what zlib does as well for me:

   $ ./git-archive --format=tgz HEAD | hexdump -C | head -1
   00000000  1f 8b 08 00 00 00 00 00  00 03 ec bd 59 73 1c 49  |..........?Ys.I|

>
> The recorded mtimes are a bit funny, according to
> https://wolf-tungsten.github.io/gzip-analyzer/, they are 1975-03-17
> 00:36:32 and 1978-08-05 22:45:36, respectively...
>
> And the mtime actually changes all the time.
>
> What's even more funny: if I comment out the `deflateSetHeader()`, the
> mtime header field is left at all-zeros. This is on Ubuntu 18.04 with
> zlib1g 1:1.2.11.dfsg-0ubuntu2.
>
> So I dug in a bit deeper and what do you know, the `deflateHeader()`
> function is implemented like this
> (https://github.com/madler/zlib/blob/21767c654d31/deflate.c#L557-L565):
>
> 	int ZEXPORT deflateSetHeader (strm, head)
> 	    z_streamp strm;
> 	    gz_headerp head;
> 	{
> 	    if (deflateStateCheck(strm) || strm->state->wrap != 2)
> 		return Z_STREAM_ERROR;
> 	    strm->state->gzhead = head;
> 	    return Z_OK;
> 	}
>
> Now, the caller is implemented like this:
>
> 	static void tgz_set_os(git_zstream *strm, int os)
> 	{
> 	#if ZLIB_VERNUM >= 0x1221
> 		struct gz_header_s gzhead = { .os = os };
> 		deflateSetHeader(&strm->z, &gzhead);
> 	#endif
> 	}
>
> The biggest problem is not that the return value of `deflateSetHeader()`
> is ignored. The biggest problem is that it passes the address of a heap
> variable to the `deflateSetHeader()` function, which then stores it away
> in another struct that lives beyond the point when we return from
> `tgz_set_os()`.

Ah, you mean the address of an automatic variable on the stack, but I
get it.  D'oh!

>
> In other words, this is the very issue I pointed out as GCC not catching:
> https://lore.kernel.org/git/nycvar.QRO.7.76.6.2205272235220.349@tvgsbejvaqbjf.bet/
>
> The solution is to move the heap variable back into a scope that matches
> the lifetime of the compression:
>
> -- snip --
> diff --git a/archive-tar.c b/archive-tar.c
> index 60669eb7b9c..3d77e0f7509 100644
> --- a/archive-tar.c
> +++ b/archive-tar.c
> @@ -460,17 +460,12 @@ static void tgz_write_block(const void *data)
>
>  static const char internal_gzip_command[] = "git archive gzip";
>
> -static void tgz_set_os(git_zstream *strm, int os)
> -{
> -#if ZLIB_VERNUM >= 0x1221
> -	struct gz_header_s gzhead = { .os = os };
> -	deflateSetHeader(&strm->z, &gzhead);
> -#endif
> -}
> -
>  static int write_tar_filter_archive(const struct archiver *ar,
>  				    struct archiver_args *args)
>  {
> +#if ZLIB_VERNUM >= 0x1221
> +	struct gz_header_s gzhead = { .os = 3 }; /* Unix, for reproducibility */
> +#endif
>  	struct strbuf cmd = STRBUF_INIT;
>  	struct child_process filter = CHILD_PROCESS_INIT;
>  	int r;
> @@ -481,7 +476,10 @@ static int write_tar_filter_archive(const struct archiver *ar,
>  	if (!strcmp(ar->filter_command, internal_gzip_command)) {
>  		write_block = tgz_write_block;
>  		git_deflate_init_gzip(&gzstream, args->compression_level);
> -		tgz_set_os(&gzstream, 3); /* Unix, for reproducibility */
> +#if ZLIB_VERNUM >= 0x1221
> +		if (deflateSetHeader(&gzstream.z, &gzhead) != Z_OK)
> +			BUG("deflateSetHeader() called too late");
> +#endif
>  		gzstream.next_out = outbuf;
>  		gzstream.avail_out = sizeof(outbuf);
>
> -- snap --

Good find, thank you!  A shorter solution would be to make gzhead static.

>
> With this, the test passes for me.
>
> René, would you mind squashing this into your patch series?
>
> Thank you,
> Dscho

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 5/5] archive-tar: use internal gzip by default
  2022-06-14 15:47         ` René Scharfe
@ 2022-06-14 15:56           ` René Scharfe
  2022-06-14 16:29           ` Johannes Schindelin
  1 sibling, 0 replies; 74+ messages in thread
From: René Scharfe @ 2022-06-14 15:56 UTC (permalink / raw)
  To: Johannes Schindelin, Junio C Hamano
  Cc: git, Rohit Ashiwal, Ævar Arnfjörð Bjarmason,
	Jeff King, brian m . carlson

Am 14.06.22 um 17:47 schrieb René Scharfe:
> Am 14.06.22 um 13:27 schrieb Johannes Schindelin:
>> Hi Junio,
>>
>> On Mon, 13 Jun 2022, Junio C Hamano wrote:
>>
>>> René Scharfe <l.s.r@web.de> writes:
>>>
>>>> -test_expect_success GZIP 'git archive --format=tar.gz' '
>>>> +test_expect_success 'git archive --format=tar.gz' '
>>>>  	git archive --format=tar.gz HEAD >j1.tar.gz &&
>>>>  	test_cmp_bin j.tgz j1.tar.gz
>>>>  '
>>>
>>> Curiously, this breaks for me.  It is understandable if we are not
>>> producing byte-for-byte identical output with internal gzip.
>
> Makes sense in retrospect, there's no reason the output of gzip(1) and
> zlib would have to be the same exactly.  It just happened to be so on my
> platform, so the tests deceptively passed for me.  I think we simply
> have to drop those that try to compare compressed files made by
> different tools -- we can still check if their content can be extracted
> and matches.

I have to take that back, I was confused -- the tests are fine.  There are no
comparisons between gzip-generated and zlib-generated files.  There's just
the gzhead use-after-return error that Dscho discovered.

René

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 5/5] archive-tar: use internal gzip by default
  2022-06-14 15:47         ` René Scharfe
  2022-06-14 15:56           ` René Scharfe
@ 2022-06-14 16:29           ` Johannes Schindelin
  2022-06-14 20:04             ` René Scharfe
  1 sibling, 1 reply; 74+ messages in thread
From: Johannes Schindelin @ 2022-06-14 16:29 UTC (permalink / raw)
  To: René Scharfe
  Cc: Junio C Hamano, git, Rohit Ashiwal,
	Ævar Arnfjörð Bjarmason, Jeff King,
	brian m . carlson

[-- Attachment #1: Type: text/plain, Size: 2017 bytes --]

Hi René,

On Tue, 14 Jun 2022, René Scharfe wrote:

> Am 14.06.22 um 13:27 schrieb Johannes Schindelin:
>
> > The solution is to move the heap variable back into a scope that matches
> > the lifetime of the compression:
> >
> > -- snip --
> > diff --git a/archive-tar.c b/archive-tar.c
> > index 60669eb7b9c..3d77e0f7509 100644
> > --- a/archive-tar.c
> > +++ b/archive-tar.c
> > @@ -460,17 +460,12 @@ static void tgz_write_block(const void *data)
> >
> >  static const char internal_gzip_command[] = "git archive gzip";
> >
> > -static void tgz_set_os(git_zstream *strm, int os)
> > -{
> > -#if ZLIB_VERNUM >= 0x1221
> > -	struct gz_header_s gzhead = { .os = os };
> > -	deflateSetHeader(&strm->z, &gzhead);
> > -#endif
> > -}
> > -
> >  static int write_tar_filter_archive(const struct archiver *ar,
> >  				    struct archiver_args *args)
> >  {
> > +#if ZLIB_VERNUM >= 0x1221
> > +	struct gz_header_s gzhead = { .os = 3 }; /* Unix, for reproducibility */
> > +#endif
> >  	struct strbuf cmd = STRBUF_INIT;
> >  	struct child_process filter = CHILD_PROCESS_INIT;
> >  	int r;
> > @@ -481,7 +476,10 @@ static int write_tar_filter_archive(const struct archiver *ar,
> >  	if (!strcmp(ar->filter_command, internal_gzip_command)) {
> >  		write_block = tgz_write_block;
> >  		git_deflate_init_gzip(&gzstream, args->compression_level);
> > -		tgz_set_os(&gzstream, 3); /* Unix, for reproducibility */
> > +#if ZLIB_VERNUM >= 0x1221
> > +		if (deflateSetHeader(&gzstream.z, &gzhead) != Z_OK)
> > +			BUG("deflateSetHeader() called too late");
> > +#endif
> >  		gzstream.next_out = outbuf;
> >  		gzstream.avail_out = sizeof(outbuf);
> >
> > -- snap --
>
> Good find, thank you!  A shorter solution would be to make gzhead static.

I should have said that I had considered this, but decided against it
because it would introduce yet another issue: it would render the code
needlessly un-multi-threadable. And that can be avoided _really_ easily.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 5/5] archive-tar: use internal gzip by default
  2022-06-14 16:29           ` Johannes Schindelin
@ 2022-06-14 20:04             ` René Scharfe
  2022-06-15 16:41               ` Junio C Hamano
  0 siblings, 1 reply; 74+ messages in thread
From: René Scharfe @ 2022-06-14 20:04 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Junio C Hamano, git, Rohit Ashiwal,
	Ævar Arnfjörð Bjarmason, Jeff King,
	brian m . carlson

Am 14.06.22 um 18:29 schrieb Johannes Schindelin:
> Hi René,
>
> On Tue, 14 Jun 2022, René Scharfe wrote:
>
>> Am 14.06.22 um 13:27 schrieb Johannes Schindelin:
>>
>>> The solution is to move the heap variable back into a scope that matches
>>> the lifetime of the compression:
>>>
>>> -- snip --
>>> diff --git a/archive-tar.c b/archive-tar.c
>>> index 60669eb7b9c..3d77e0f7509 100644
>>> --- a/archive-tar.c
>>> +++ b/archive-tar.c
>>> @@ -460,17 +460,12 @@ static void tgz_write_block(const void *data)
>>>
>>>  static const char internal_gzip_command[] = "git archive gzip";
>>>
>>> -static void tgz_set_os(git_zstream *strm, int os)
>>> -{
>>> -#if ZLIB_VERNUM >= 0x1221
>>> -	struct gz_header_s gzhead = { .os = os };
>>> -	deflateSetHeader(&strm->z, &gzhead);
>>> -#endif
>>> -}
>>> -
>>>  static int write_tar_filter_archive(const struct archiver *ar,
>>>  				    struct archiver_args *args)
>>>  {
>>> +#if ZLIB_VERNUM >= 0x1221
>>> +	struct gz_header_s gzhead = { .os = 3 }; /* Unix, for reproducibility */
>>> +#endif
>>>  	struct strbuf cmd = STRBUF_INIT;
>>>  	struct child_process filter = CHILD_PROCESS_INIT;
>>>  	int r;
>>> @@ -481,7 +476,10 @@ static int write_tar_filter_archive(const struct archiver *ar,
>>>  	if (!strcmp(ar->filter_command, internal_gzip_command)) {
>>>  		write_block = tgz_write_block;
>>>  		git_deflate_init_gzip(&gzstream, args->compression_level);
>>> -		tgz_set_os(&gzstream, 3); /* Unix, for reproducibility */
>>> +#if ZLIB_VERNUM >= 0x1221
>>> +		if (deflateSetHeader(&gzstream.z, &gzhead) != Z_OK)
>>> +			BUG("deflateSetHeader() called too late");
>>> +#endif
>>>  		gzstream.next_out = outbuf;
>>>  		gzstream.avail_out = sizeof(outbuf);
>>>
>>> -- snap --
>>
>> Good find, thank you!  A shorter solution would be to make gzhead static.
>
> I should have said that I had considered this, but decided against it
> because it would introduce yet another issue: it would render the code
> needlessly un-multi-threadable. And that can be avoided _really_ easily.

archive-tar.c (and archive-zip.c) use other static variables, so a
static gzhead won't break or block anything in this regard.  There was
no interest in running it in parallel threads so far AFAIK, and it's
hard for me to imagine the usefulness of creating multiple .tgz files at
the same time.

The doubled ZLIB_VERNUM is unsightly and I'm not sure the BUG check is
useful -- I omitted error checking because there is no recurse for us if
deflateSetHeader() doesn't work, and on ancient zlib versions we
silently continue anyway.

But that's all just minor quibbling -- I'll include your changes in the
next version, they look fine overall.

René

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 0/5] Avoid spawning gzip in git archive
  2022-06-14 11:28   ` [PATCH v3 0/5] Avoid spawning gzip in git archive Johannes Schindelin
@ 2022-06-14 20:05     ` René Scharfe
  2022-06-30 18:55       ` Johannes Schindelin
  0 siblings, 1 reply; 74+ messages in thread
From: René Scharfe @ 2022-06-14 20:05 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: git, Junio C Hamano, Rohit Ashiwal,
	Ævar Arnfjörð Bjarmason, Jeff King,
	brian m . carlson

Am 14.06.22 um 13:28 schrieb Johannes Schindelin:
>
> By the way, the main reason why I did not work more is that in
> http://madler.net/pipermail/zlib-devel_madler.net/2019-December/003308.html,
> Mark Adler (the zlib maintainer) announced that...
>
>> [...] There are many well-tested performance improvements in zlib
>> waiting in the wings that will be incorporated over the next several
>> months. [...]
>
> This was in December 2019. And now it's June 2022 and I kind of wonder
> whether those promised improvements will still come.
>
> In the meantime, however, a viable alternative seems to have cropped up:
> https://github.com/zlib-ng/zlib-ng. Essentially, it looks as if it is what
> zlib should have become after above-quoted announcement.
>
> In particular the CPU intrinsics support (think MMX, SSE2/3, etc) seem to
> be very interesting and I would not be completely surprised if building
> Git with your patches and linking against zlib-ng would paint a very
> favorable picture not only in terms of CPU time but also in terms of
> wallclock time. Sadly, I have not been able to set aside time to look into
> that angle, but maybe I can peak your interest?
I was unable to preload zlib-ng using DYLD_INSERT_LIBRARIES on macOS
12.4 so far.  The included demo proggy looks impressive, though:

$ hyperfine -w3 -L gzip gzip,../zlib-ng/minigzip "git -C ../linux archive --format=tar HEAD | {gzip} -c"
Benchmark #1: git -C ../linux archive --format=tar HEAD | gzip -c
  Time (mean ± σ):     20.424 s ±  0.006 s    [User: 23.964 s, System: 0.432 s]
  Range (min … max):   20.414 s … 20.434 s    10 runs

Benchmark #2: git -C ../linux archive --format=tar HEAD | ../zlib-ng/minigzip -c
  Time (mean ± σ):     12.158 s ±  0.006 s    [User: 13.908 s, System: 0.376 s]
  Range (min … max):   12.145 s … 12.166 s    10 runs

Summary
  'git -C ../linux archive --format=tar HEAD | ../zlib-ng/minigzip -c' ran
    1.68 ± 0.00 times faster than 'git -C ../linux archive --format=tar HEAD | gzip -c'

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 5/5] archive-tar: use internal gzip by default
  2022-06-14 20:04             ` René Scharfe
@ 2022-06-15 16:41               ` Junio C Hamano
  0 siblings, 0 replies; 74+ messages in thread
From: Junio C Hamano @ 2022-06-15 16:41 UTC (permalink / raw)
  To: René Scharfe
  Cc: Johannes Schindelin, git, Rohit Ashiwal,
	Ævar Arnfjörð Bjarmason, Jeff King,
	brian m . carlson

René Scharfe <l.s.r@web.de> writes:

> archive-tar.c (and archive-zip.c) use other static variables, so a
> static gzhead won't break or block anything in this regard.  There was
> no interest in running it in parallel threads so far AFAIK, and it's
> hard for me to imagine the usefulness of creating multiple .tgz files at
> the same time.

;-)  FWIW I had exactly the same reaction.  

If this were a lot isolated piece of helper function, perhaps, but
no, reentrancy for this helper function specifically written for
this code path is not a very good argument.

The code structure the (not very good) argument tried to suport,
however, is a good thing to have regardless, and if the code is
written in such a way from the beginning, there is no reason to
reject it.  If it opens the door to a unified way to deal with all
the other static global variables (e.g. have a "archiver_state"
structure that collects all of them, with this one included, and
pass it around, or something), that would be great.

Short of that, I do not care too much either way.  As this topic is
not even in 'next', "once written in one way, it is not worth the
code churn to rewrite it in the other way, when these two ways are
both reasonable" does not yet apply here, so ...

> But that's all just minor quibbling -- I'll include your changes in the
> next version, they look fine overall.

... that's fine, too.

Thanks, both.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v4 0/6] Avoid spawning gzip in git archive
  2019-04-12 23:04 [PATCH 0/2] Avoid spawning gzip in git archive Johannes Schindelin via GitGitGadget
                   ` (3 preceding siblings ...)
  2022-06-12  6:00 ` [PATCH v3 0/5] Avoid spawning gzip in git archive René Scharfe
@ 2022-06-15 16:53 ` René Scharfe
  2022-06-15 16:58   ` [PATCH v4 1/6] archive: update format documentation René Scharfe
                     ` (5 more replies)
  4 siblings, 6 replies; 74+ messages in thread
From: René Scharfe @ 2022-06-15 16:53 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Johannes Schindelin, Rohit Ashiwal,
	Ævar Arnfjörð Bjarmason, Jeff King,
	brian m . carlson

Changes since v3:
- Use deflateSetHeader() correctly, thanks to Dscho.
- New patch to update the format-related documentation.

  archive: update format documentation
  archive: rename archiver data field to filter_command
  archive-tar: factor out write_block()
  archive-tar: add internal gzip implementation
  archive-tar: use OS_CODE 3 (Unix) for internal gzip
  archive-tar: use internal gzip by default

 Documentation/git-archive.txt | 21 +++++-----
 archive-tar.c                 | 77 ++++++++++++++++++++++++++++++-----
 archive.h                     |  2 +-
 t/t5000-tar-tree.sh           | 28 ++++++++++---
 4 files changed, 100 insertions(+), 28 deletions(-)

Range-Diff vs. v3:
-:  ---------- > 1:  67369ed452 archive: update format documentation
1:  73ccd190bd = 2:  6a7cce50ef archive: rename archiver data field to filter_command
2:  352cff7163 = 3:  c86e82bee8 archive-tar: factor out write_block()
3:  4e7cf97631 ! 4:  6196b0e39d archive-tar: add internal gzip implementation
    @@ Commit message

      ## Documentation/git-archive.txt ##
     @@ Documentation/git-archive.txt: tar.<format>.command::
    - 	format is given.
    + 	to the command (e.g., `-9`).
      +
    - The "tar.gz" and "tgz" formats are defined automatically and default to
    --`gzip -cn`. You may override them with custom commands.
    -+`gzip -cn`. You may override them with custom commands. An internal gzip
    -+implementation can be used by specifying the value `git archive gzip`.
    + The `tar.gz` and `tgz` formats are defined automatically and use the
    +-command `gzip -cn` by default.
    ++command `gzip -cn` by default. An internal gzip implementation can be
    ++used by specifying the value `git archive gzip`.

      tar.<format>.remote::
    - 	If true, enable `<format>` for use by remote clients via
    + 	If true, enable the format for use by remote clients via

      ## archive-tar.c ##
     @@ archive-tar.c: static int write_tar_filter_archive(const struct archiver *ar,
4:  cb2bbe9f6d < -:  ---------- archive-tar: use OS_CODE 3 (Unix) for internal gzip
-:  ---------- > 5:  19d286af6a archive-tar: use OS_CODE 3 (Unix) for internal gzip
5:  5dd968ced1 ! 6:  74683137af archive-tar: use internal gzip by default
    @@ Commit message

      ## Documentation/git-archive.txt ##
     @@ Documentation/git-archive.txt: tar.<format>.command::
    - 	format is given.
    + 	to the command (e.g., `-9`).
      +
    - The "tar.gz" and "tgz" formats are defined automatically and default to
    --`gzip -cn`. You may override them with custom commands. An internal gzip
    --implementation can be used by specifying the value `git archive gzip`.
    -+the magic value `git archive gzip`, which invokes an internal
    -+implementation of gzip. You may override them with custom commands.
    + The `tar.gz` and `tgz` formats are defined automatically and use the
    +-command `gzip -cn` by default. An internal gzip implementation can be
    +-used by specifying the value `git archive gzip`.
    ++magic command `git archive gzip` by default, which invokes an internal
    ++implementation of gzip.

      tar.<format>.remote::
    - 	If true, enable `<format>` for use by remote clients via
    + 	If true, enable the format for use by remote clients via

      ## archive-tar.c ##
     @@ archive-tar.c: void init_tar_archiver(void)
--
2.36.1

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v4 1/6] archive: update format documentation
  2022-06-15 16:53 ` [PATCH v4 0/6] " René Scharfe
@ 2022-06-15 16:58   ` René Scharfe
  2022-06-15 16:59   ` [PATCH v4 2/6] archive: rename archiver data field to filter_command René Scharfe
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 74+ messages in thread
From: René Scharfe @ 2022-06-15 16:58 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Johannes Schindelin, Rohit Ashiwal,
	Ævar Arnfjörð Bjarmason, Jeff King,
	brian m . carlson

Mention all formats in the --format section, use backtick quoting for
literal values throughout, clarify the description of the configuration
option.

Helped-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: René Scharfe <l.s.r@web.de>
---
 Documentation/git-archive.txt | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/Documentation/git-archive.txt b/Documentation/git-archive.txt
index 56989a2f34..ff3f7b0344 100644
--- a/Documentation/git-archive.txt
+++ b/Documentation/git-archive.txt
@@ -34,10 +34,12 @@ OPTIONS
 -------

 --format=<fmt>::
-	Format of the resulting archive: 'tar' or 'zip'. If this option
+	Format of the resulting archive. Possible values are `tar`,
+	`zip`, `tar.gz`, `tgz`, and any format defined using the
+	configuration option `tar.<format>.command`. If `--format`
 	is not given, and the output file is specified, the format is
-	inferred from the filename if possible (e.g. writing to "foo.zip"
-	makes the output to be in the zip format). Otherwise the output
+	inferred from the filename if possible (e.g. writing to `foo.zip`
+	makes the output to be in the `zip` format). Otherwise the output
 	format is `tar`.

 -l::
@@ -143,17 +145,15 @@ tar.<format>.command::
 	is executed using the shell with the generated tar file on its
 	standard input, and should produce the final output on its
 	standard output. Any compression-level options will be passed
-	to the command (e.g., "-9"). An output file with the same
-	extension as `<format>` will be use this format if no other
-	format is given.
+	to the command (e.g., `-9`).
 +
-The "tar.gz" and "tgz" formats are defined automatically and default to
-`gzip -cn`. You may override them with custom commands.
+The `tar.gz` and `tgz` formats are defined automatically and use the
+command `gzip -cn` by default.

 tar.<format>.remote::
-	If true, enable `<format>` for use by remote clients via
+	If true, enable the format for use by remote clients via
 	linkgit:git-upload-archive[1]. Defaults to false for
-	user-defined formats, but true for the "tar.gz" and "tgz"
+	user-defined formats, but true for the `tar.gz` and `tgz`
 	formats.

 [[ATTRIBUTES]]
--
2.36.1

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v4 2/6] archive: rename archiver data field to filter_command
  2022-06-15 16:53 ` [PATCH v4 0/6] " René Scharfe
  2022-06-15 16:58   ` [PATCH v4 1/6] archive: update format documentation René Scharfe
@ 2022-06-15 16:59   ` René Scharfe
  2022-06-15 17:01   ` [PATCH v4 3/6] archive-tar: factor out write_block() René Scharfe
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 74+ messages in thread
From: René Scharfe @ 2022-06-15 16:59 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Johannes Schindelin, Rohit Ashiwal,
	Ævar Arnfjörð Bjarmason, Jeff King,
	brian m . carlson

The void pointer "data" in struct archiver is only used to store filter
commands to pass tar archives to, like gzip.  Rename it accordingly and
also turn it into a char pointer to document the fact that it's a string
reference.

Signed-off-by: René Scharfe <l.s.r@web.de>
---
 archive-tar.c | 10 +++++-----
 archive.h     |  2 +-
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/archive-tar.c b/archive-tar.c
index 042feb66d2..2717e34a1d 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -383,8 +383,8 @@ static int tar_filter_config(const char *var, const char *value, void *data)
 	if (!strcmp(type, "command")) {
 		if (!value)
 			return config_error_nonbool(var);
-		free(ar->data);
-		ar->data = xstrdup(value);
+		free(ar->filter_command);
+		ar->filter_command = xstrdup(value);
 		return 0;
 	}
 	if (!strcmp(type, "remote")) {
@@ -432,10 +432,10 @@ static int write_tar_filter_archive(const struct archiver *ar,
 	struct child_process filter = CHILD_PROCESS_INIT;
 	int r;

-	if (!ar->data)
+	if (!ar->filter_command)
 		BUG("tar-filter archiver called with no filter defined");

-	strbuf_addstr(&cmd, ar->data);
+	strbuf_addstr(&cmd, ar->filter_command);
 	if (args->compression_level >= 0)
 		strbuf_addf(&cmd, " -%d", args->compression_level);

@@ -478,7 +478,7 @@ void init_tar_archiver(void)
 	git_config(git_tar_config, NULL);
 	for (i = 0; i < nr_tar_filters; i++) {
 		/* omit any filters that never had a command configured */
-		if (tar_filters[i]->data)
+		if (tar_filters[i]->filter_command)
 			register_archiver(tar_filters[i]);
 	}
 }
diff --git a/archive.h b/archive.h
index 49fab71aaf..08bed3ed3a 100644
--- a/archive.h
+++ b/archive.h
@@ -43,7 +43,7 @@ struct archiver {
 	const char *name;
 	int (*write_archive)(const struct archiver *, struct archiver_args *);
 	unsigned flags;
-	void *data;
+	char *filter_command;
 };
 void register_archiver(struct archiver *);

--
2.36.1

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v4 3/6] archive-tar: factor out write_block()
  2022-06-15 16:53 ` [PATCH v4 0/6] " René Scharfe
  2022-06-15 16:58   ` [PATCH v4 1/6] archive: update format documentation René Scharfe
  2022-06-15 16:59   ` [PATCH v4 2/6] archive: rename archiver data field to filter_command René Scharfe
@ 2022-06-15 17:01   ` René Scharfe
  2022-06-15 17:02   ` [PATCH v4 4/6] archive-tar: add internal gzip implementation René Scharfe
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 74+ messages in thread
From: René Scharfe @ 2022-06-15 17:01 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Johannes Schindelin, Rohit Ashiwal,
	Ævar Arnfjörð Bjarmason, Jeff King,
	brian m . carlson

All tar archive writes have the same size and are done to the same file
descriptor.  Move them to a common function, write_block(), to reduce
code duplication and make it easy to change the destination.

Original-patch-by: Rohit Ashiwal <rohit.ashiwal265@gmail.com>
Signed-off-by: René Scharfe <l.s.r@web.de>
---
 archive-tar.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/archive-tar.c b/archive-tar.c
index 2717e34a1d..4e6a3deb80 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -38,11 +38,16 @@ static int write_tar_filter_archive(const struct archiver *ar,
 #define USTAR_MAX_MTIME 077777777777ULL
 #endif

+static void write_block(const void *buf)
+{
+	write_or_die(1, buf, BLOCKSIZE);
+}
+
 /* writes out the whole block, but only if it is full */
 static void write_if_needed(void)
 {
 	if (offset == BLOCKSIZE) {
-		write_or_die(1, block, BLOCKSIZE);
+		write_block(block);
 		offset = 0;
 	}
 }
@@ -66,7 +71,7 @@ static void do_write_blocked(const void *data, unsigned long size)
 		write_if_needed();
 	}
 	while (size >= BLOCKSIZE) {
-		write_or_die(1, buf, BLOCKSIZE);
+		write_block(buf);
 		size -= BLOCKSIZE;
 		buf += BLOCKSIZE;
 	}
@@ -101,10 +106,10 @@ static void write_trailer(void)
 {
 	int tail = BLOCKSIZE - offset;
 	memset(block + offset, 0, tail);
-	write_or_die(1, block, BLOCKSIZE);
+	write_block(block);
 	if (tail < 2 * RECORDSIZE) {
 		memset(block, 0, offset);
-		write_or_die(1, block, BLOCKSIZE);
+		write_block(block);
 	}
 }

--
2.36.1

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v4 4/6] archive-tar: add internal gzip implementation
  2022-06-15 16:53 ` [PATCH v4 0/6] " René Scharfe
                     ` (2 preceding siblings ...)
  2022-06-15 17:01   ` [PATCH v4 3/6] archive-tar: factor out write_block() René Scharfe
@ 2022-06-15 17:02   ` René Scharfe
  2022-06-15 20:32     ` Ævar Arnfjörð Bjarmason
  2022-06-15 17:04   ` [PATCH v4 5/6] archive-tar: use OS_CODE 3 (Unix) for internal gzip René Scharfe
  2022-06-15 17:05   ` [PATCH v4 6/6] archive-tar: use internal gzip by default René Scharfe
  5 siblings, 1 reply; 74+ messages in thread
From: René Scharfe @ 2022-06-15 17:02 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Johannes Schindelin, Rohit Ashiwal,
	Ævar Arnfjörð Bjarmason, Jeff King,
	brian m . carlson

Git uses zlib for its own object store, but calls gzip when creating tgz
archives.  Add an option to perform the gzip compression for the latter
using zlib, without depending on the external gzip binary.

Plug it in by making write_block a function pointer and switching to a
compressing variant if the filter command has the magic value "git
archive gzip".  Does that indirection slow down tar creation?  Not
really, at least not in this test:

$ hyperfine -w3 -L rev HEAD,origin/main -p 'git checkout {rev} && make' \
'./git -C ../linux archive --format=tar HEAD # {rev}'
Benchmark #1: ./git -C ../linux archive --format=tar HEAD # HEAD
  Time (mean ± σ):      4.044 s ±  0.007 s    [User: 3.901 s, System: 0.137 s]
  Range (min … max):    4.038 s …  4.059 s    10 runs

Benchmark #2: ./git -C ../linux archive --format=tar HEAD # origin/main
  Time (mean ± σ):      4.047 s ±  0.009 s    [User: 3.903 s, System: 0.138 s]
  Range (min … max):    4.038 s …  4.066 s    10 runs

How does tgz creation perform?

$ hyperfine -w3 -L command 'gzip -cn','git archive gzip' \
'./git -c tar.tgz.command="{command}" -C ../linux archive --format=tgz HEAD'
Benchmark #1: ./git -c tar.tgz.command="gzip -cn" -C ../linux archive --format=tgz HEAD
  Time (mean ± σ):     20.404 s ±  0.006 s    [User: 23.943 s, System: 0.401 s]
  Range (min … max):   20.395 s … 20.414 s    10 runs

Benchmark #2: ./git -c tar.tgz.command="git archive gzip" -C ../linux archive --format=tgz HEAD
  Time (mean ± σ):     23.807 s ±  0.023 s    [User: 23.655 s, System: 0.145 s]
  Range (min … max):   23.782 s … 23.857 s    10 runs

Summary
  './git -c tar.tgz.command="gzip -cn" -C ../linux archive --format=tgz HEAD' ran
    1.17 ± 0.00 times faster than './git -c tar.tgz.command="git archive gzip" -C ../linux archive --format=tgz HEAD'

So the internal implementation takes 17% longer on the Linux repo, but
uses 2% less CPU time.  That's because the external gzip can run in
parallel on its own processor, while the internal one works sequentially
and avoids the inter-process communication overhead.

What are the benefits?  Only an internal sequential implementation can
offer this eco mode, and it allows avoiding the gzip(1) requirement.

This implementation uses the helper functions from our zlib.c instead of
the convenient gz* functions from zlib, because the latter doesn't give
the control over the generated gzip header that the next patch requires.

Original-patch-by: Rohit Ashiwal <rohit.ashiwal265@gmail.com>
Signed-off-by: René Scharfe <l.s.r@web.de>
---
 Documentation/git-archive.txt |  3 ++-
 archive-tar.c                 | 45 ++++++++++++++++++++++++++++++++++-
 t/t5000-tar-tree.sh           | 16 +++++++++++++
 3 files changed, 62 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-archive.txt b/Documentation/git-archive.txt
index ff3f7b0344..b2d1b63d31 100644
--- a/Documentation/git-archive.txt
+++ b/Documentation/git-archive.txt
@@ -148,7 +148,8 @@ tar.<format>.command::
 	to the command (e.g., `-9`).
 +
 The `tar.gz` and `tgz` formats are defined automatically and use the
-command `gzip -cn` by default.
+command `gzip -cn` by default. An internal gzip implementation can be
+used by specifying the value `git archive gzip`.

 tar.<format>.remote::
 	If true, enable the format for use by remote clients via
diff --git a/archive-tar.c b/archive-tar.c
index 4e6a3deb80..53d0ef685c 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -38,11 +38,13 @@ static int write_tar_filter_archive(const struct archiver *ar,
 #define USTAR_MAX_MTIME 077777777777ULL
 #endif

-static void write_block(const void *buf)
+static void tar_write_block(const void *buf)
 {
 	write_or_die(1, buf, BLOCKSIZE);
 }

+static void (*write_block)(const void *) = tar_write_block;
+
 /* writes out the whole block, but only if it is full */
 static void write_if_needed(void)
 {
@@ -430,6 +432,34 @@ static int write_tar_archive(const struct archiver *ar,
 	return err;
 }

+static git_zstream gzstream;
+static unsigned char outbuf[16384];
+
+static void tgz_deflate(int flush)
+{
+	while (gzstream.avail_in || flush == Z_FINISH) {
+		int status = git_deflate(&gzstream, flush);
+		if (!gzstream.avail_out || status == Z_STREAM_END) {
+			write_or_die(1, outbuf, gzstream.next_out - outbuf);
+			gzstream.next_out = outbuf;
+			gzstream.avail_out = sizeof(outbuf);
+			if (status == Z_STREAM_END)
+				break;
+		}
+		if (status != Z_OK && status != Z_BUF_ERROR)
+			die(_("deflate error (%d)"), status);
+	}
+}
+
+static void tgz_write_block(const void *data)
+{
+	gzstream.next_in = (void *)data;
+	gzstream.avail_in = BLOCKSIZE;
+	tgz_deflate(Z_NO_FLUSH);
+}
+
+static const char internal_gzip_command[] = "git archive gzip";
+
 static int write_tar_filter_archive(const struct archiver *ar,
 				    struct archiver_args *args)
 {
@@ -440,6 +470,19 @@ static int write_tar_filter_archive(const struct archiver *ar,
 	if (!ar->filter_command)
 		BUG("tar-filter archiver called with no filter defined");

+	if (!strcmp(ar->filter_command, internal_gzip_command)) {
+		write_block = tgz_write_block;
+		git_deflate_init_gzip(&gzstream, args->compression_level);
+		gzstream.next_out = outbuf;
+		gzstream.avail_out = sizeof(outbuf);
+
+		r = write_tar_archive(ar, args);
+
+		tgz_deflate(Z_FINISH);
+		git_deflate_end(&gzstream);
+		return r;
+	}
+
 	strbuf_addstr(&cmd, ar->filter_command);
 	if (args->compression_level >= 0)
 		strbuf_addf(&cmd, " -%d", args->compression_level);
diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh
index 7f8d2ab0a7..9ac0ec67fe 100755
--- a/t/t5000-tar-tree.sh
+++ b/t/t5000-tar-tree.sh
@@ -374,6 +374,22 @@ test_expect_success GZIP 'remote tar.gz can be disabled' '
 		>remote.tar.gz
 '

+test_expect_success 'git archive --format=tgz (internal gzip)' '
+	test_config tar.tgz.command "git archive gzip" &&
+	git archive --format=tgz HEAD >internal_gzip.tgz
+'
+
+test_expect_success 'git archive --format=tar.gz (internal gzip)' '
+	test_config tar.tar.gz.command "git archive gzip" &&
+	git archive --format=tar.gz HEAD >internal_gzip.tar.gz &&
+	test_cmp_bin internal_gzip.tgz internal_gzip.tar.gz
+'
+
+test_expect_success GZIP 'extract tgz file (internal gzip)' '
+	gzip -d -c <internal_gzip.tgz >internal_gzip.tar &&
+	test_cmp_bin b.tar internal_gzip.tar
+'
+
 test_expect_success 'archive and :(glob)' '
 	git archive -v HEAD -- ":(glob)**/sh" >/dev/null 2>actual &&
 	cat >expect <<EOF &&
--
2.36.1

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v4 5/6] archive-tar: use OS_CODE 3 (Unix) for internal gzip
  2022-06-15 16:53 ` [PATCH v4 0/6] " René Scharfe
                     ` (3 preceding siblings ...)
  2022-06-15 17:02   ` [PATCH v4 4/6] archive-tar: add internal gzip implementation René Scharfe
@ 2022-06-15 17:04   ` René Scharfe
  2022-06-15 17:05   ` [PATCH v4 6/6] archive-tar: use internal gzip by default René Scharfe
  5 siblings, 0 replies; 74+ messages in thread
From: René Scharfe @ 2022-06-15 17:04 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Johannes Schindelin, Rohit Ashiwal,
	Ævar Arnfjörð Bjarmason, Jeff King,
	brian m . carlson

gzip(1) encodes the OS it runs on in the 10th byte of its output. It
uses the following OS_CODE values according to its tailor.h [1]:

        0 - MS-DOS
        3 - UNIX
        5 - Atari ST
        6 - OS/2
       10 - TOPS-20
       11 - Windows NT

The gzip.exe that comes with Git for Windows uses OS_CODE 3 for some
reason, so this value is used on practically all supported platforms
when generating tgz archives using gzip(1).

Zlib uses a bigger set of values according to its zutil.h [2], aligned
with section 4.4.2 of the ZIP specification, APPNOTE.txt [3]:

         0 - MS-DOS
         1 - Amiga
         3 - UNIX
         4 - VM/CMS
         5 - Atari ST
         6 - OS/2
         7 - Macintosh
         8 - Z-System
        10 - Windows NT
        11 - MVS (OS/390 - Z/OS)
        13 - Acorn Risc
        16 - BeOS
        18 - OS/400
        19 - OS X (Darwin)

Thus the internal gzip implementation in archive-tar.c sets different
OS_CODE header values on major platforms Windows and macOS.  Git for
Windows uses its own zlib-based variant since v2.20.1 by default and
thus embeds OS_CODE 10 in tgz archives.

The tar archive for a commit is generated consistently on all systems
(by the same Git version).  The OS_CODE in the gzip header does not
influence extraction.  Avoid leaking OS information and make tgz
archives constistent and reproducable (with the same Git and libz
versions) by using OS_CODE 3 everywhere.

At least on macOS 12.4 this produces the same output as gzip(1) for the
examples I tried:

   # before
   $ git -c tar.tgz.command='git archive gzip' archive --format=tgz v2.36.0 | shasum
   3abbffb40b7c63cf9b7d91afc682f11682f80759  -

   # with this patch
   $ git -c tar.tgz.command='git archive gzip' archive --format=tgz v2.36.0 | shasum
   dc6dc6ba9636d522799085d0d77ab6a110bcc141  -

   $ git archive --format=tar v2.36.0 | gzip -cn | shasum
   dc6dc6ba9636d522799085d0d77ab6a110bcc141  -

[1] https://git.savannah.gnu.org/cgit/gzip.git/tree/tailor.h
[2] https://github.com/madler/zlib/blob/master/zutil.h
[3] https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT

Helped-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: René Scharfe <l.s.r@web.de>
---
 archive-tar.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/archive-tar.c b/archive-tar.c
index 53d0ef685c..efba78118b 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -463,6 +463,9 @@ static const char internal_gzip_command[] = "git archive gzip";
 static int write_tar_filter_archive(const struct archiver *ar,
 				    struct archiver_args *args)
 {
+#if ZLIB_VERNUM >= 0x1221
+	struct gz_header_s gzhead = { .os = 3 }; /* Unix, for reproducibility */
+#endif
 	struct strbuf cmd = STRBUF_INIT;
 	struct child_process filter = CHILD_PROCESS_INIT;
 	int r;
@@ -473,6 +476,10 @@ static int write_tar_filter_archive(const struct archiver *ar,
 	if (!strcmp(ar->filter_command, internal_gzip_command)) {
 		write_block = tgz_write_block;
 		git_deflate_init_gzip(&gzstream, args->compression_level);
+#if ZLIB_VERNUM >= 0x1221
+		if (deflateSetHeader(&gzstream.z, &gzhead) != Z_OK)
+			BUG("deflateSetHeader() called too late");
+#endif
 		gzstream.next_out = outbuf;
 		gzstream.avail_out = sizeof(outbuf);

--
2.36.1

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v4 6/6] archive-tar: use internal gzip by default
  2022-06-15 16:53 ` [PATCH v4 0/6] " René Scharfe
                     ` (4 preceding siblings ...)
  2022-06-15 17:04   ` [PATCH v4 5/6] archive-tar: use OS_CODE 3 (Unix) for internal gzip René Scharfe
@ 2022-06-15 17:05   ` René Scharfe
  5 siblings, 0 replies; 74+ messages in thread
From: René Scharfe @ 2022-06-15 17:05 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Johannes Schindelin, Rohit Ashiwal,
	Ævar Arnfjörð Bjarmason, Jeff King,
	brian m . carlson

Drop the dependency on gzip(1) and use our internal implementation to
create tar.gz and tgz files.

Signed-off-by: René Scharfe <l.s.r@web.de>
---
 Documentation/git-archive.txt |  4 ++--
 archive-tar.c                 |  4 ++--
 t/t5000-tar-tree.sh           | 32 ++++++++++++++++----------------
 3 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/Documentation/git-archive.txt b/Documentation/git-archive.txt
index b2d1b63d31..60c040988b 100644
--- a/Documentation/git-archive.txt
+++ b/Documentation/git-archive.txt
@@ -148,8 +148,8 @@ tar.<format>.command::
 	to the command (e.g., `-9`).
 +
 The `tar.gz` and `tgz` formats are defined automatically and use the
-command `gzip -cn` by default. An internal gzip implementation can be
-used by specifying the value `git archive gzip`.
+magic command `git archive gzip` by default, which invokes an internal
+implementation of gzip.

 tar.<format>.remote::
 	If true, enable the format for use by remote clients via
diff --git a/archive-tar.c b/archive-tar.c
index efba78118b..3d77e0f750 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -526,9 +526,9 @@ void init_tar_archiver(void)
 	int i;
 	register_archiver(&tar_archiver);

-	tar_filter_config("tar.tgz.command", "gzip -cn", NULL);
+	tar_filter_config("tar.tgz.command", internal_gzip_command, NULL);
 	tar_filter_config("tar.tgz.remote", "true", NULL);
-	tar_filter_config("tar.tar.gz.command", "gzip -cn", NULL);
+	tar_filter_config("tar.tar.gz.command", internal_gzip_command, NULL);
 	tar_filter_config("tar.tar.gz.remote", "true", NULL);
 	git_config(git_tar_config, NULL);
 	for (i = 0; i < nr_tar_filters; i++) {
diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh
index 9ac0ec67fe..1a68e89a55 100755
--- a/t/t5000-tar-tree.sh
+++ b/t/t5000-tar-tree.sh
@@ -339,21 +339,21 @@ test_expect_success 'only enabled filters are available remotely' '
 	test_cmp_bin remote.bar config.bar
 '

-test_expect_success GZIP 'git archive --format=tgz' '
+test_expect_success 'git archive --format=tgz' '
 	git archive --format=tgz HEAD >j.tgz
 '

-test_expect_success GZIP 'git archive --format=tar.gz' '
+test_expect_success 'git archive --format=tar.gz' '
 	git archive --format=tar.gz HEAD >j1.tar.gz &&
 	test_cmp_bin j.tgz j1.tar.gz
 '

-test_expect_success GZIP 'infer tgz from .tgz filename' '
+test_expect_success 'infer tgz from .tgz filename' '
 	git archive --output=j2.tgz HEAD &&
 	test_cmp_bin j.tgz j2.tgz
 '

-test_expect_success GZIP 'infer tgz from .tar.gz filename' '
+test_expect_success 'infer tgz from .tar.gz filename' '
 	git archive --output=j3.tar.gz HEAD &&
 	test_cmp_bin j.tgz j3.tar.gz
 '
@@ -363,31 +363,31 @@ test_expect_success GZIP 'extract tgz file' '
 	test_cmp_bin b.tar j.tar
 '

-test_expect_success GZIP 'remote tar.gz is allowed by default' '
+test_expect_success 'remote tar.gz is allowed by default' '
 	git archive --remote=. --format=tar.gz HEAD >remote.tar.gz &&
 	test_cmp_bin j.tgz remote.tar.gz
 '

-test_expect_success GZIP 'remote tar.gz can be disabled' '
+test_expect_success 'remote tar.gz can be disabled' '
 	git config tar.tar.gz.remote false &&
 	test_must_fail git archive --remote=. --format=tar.gz HEAD \
 		>remote.tar.gz
 '

-test_expect_success 'git archive --format=tgz (internal gzip)' '
-	test_config tar.tgz.command "git archive gzip" &&
-	git archive --format=tgz HEAD >internal_gzip.tgz
+test_expect_success GZIP 'git archive --format=tgz (external gzip)' '
+	test_config tar.tgz.command "gzip -cn" &&
+	git archive --format=tgz HEAD >external_gzip.tgz
 '

-test_expect_success 'git archive --format=tar.gz (internal gzip)' '
-	test_config tar.tar.gz.command "git archive gzip" &&
-	git archive --format=tar.gz HEAD >internal_gzip.tar.gz &&
-	test_cmp_bin internal_gzip.tgz internal_gzip.tar.gz
+test_expect_success GZIP 'git archive --format=tar.gz (external gzip)' '
+	test_config tar.tar.gz.command "gzip -cn" &&
+	git archive --format=tar.gz HEAD >external_gzip.tar.gz &&
+	test_cmp_bin external_gzip.tgz external_gzip.tar.gz
 '

-test_expect_success GZIP 'extract tgz file (internal gzip)' '
-	gzip -d -c <internal_gzip.tgz >internal_gzip.tar &&
-	test_cmp_bin b.tar internal_gzip.tar
+test_expect_success GZIP 'extract tgz file (external gzip)' '
+	gzip -d -c <external_gzip.tgz >external_gzip.tar &&
+	test_cmp_bin b.tar external_gzip.tar
 '

 test_expect_success 'archive and :(glob)' '
--
2.36.1

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH v4 4/6] archive-tar: add internal gzip implementation
  2022-06-15 17:02   ` [PATCH v4 4/6] archive-tar: add internal gzip implementation René Scharfe
@ 2022-06-15 20:32     ` Ævar Arnfjörð Bjarmason
  2022-06-16 18:55       ` René Scharfe
  0 siblings, 1 reply; 74+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-06-15 20:32 UTC (permalink / raw)
  To: René Scharfe
  Cc: git, Junio C Hamano, Johannes Schindelin, Rohit Ashiwal,
	Jeff King, brian m . carlson

On Wed, Jun 15 2022, René Scharfe wrote:

> Git uses zlib for its own object store, but calls gzip when creating tgz
> archives.  Add an option to perform the gzip compression for the latter
> using zlib, without depending on the external gzip binary.
>
> Plug it in by making write_block a function pointer and switching to a
> compressing variant if the filter command has the magic value "git
> archive gzip".  Does that indirection slow down tar creation?  Not
> really, at least not in this test:
>
> $ hyperfine -w3 -L rev HEAD,origin/main -p 'git checkout {rev} && make' \
> './git -C ../linux archive --format=tar HEAD # {rev}'

Shameless plug: https://lore.kernel.org/git/211201.86r1aw9gbd.gmgdl@evledraar.gmail.com/

I.e. a "hyperfine" wrapper I wrote to make exactly this sort of thing
easier.

You'll find that you need less or no --warmup with it, since the
checkout flip-flopping and re-making (and resulting FS and other cache
eviction) will go away, as we'll use different "git worktree"'s for the
two "rev".

(Also, putting those on a ramdisk really helps)

> Benchmark #1: ./git -C ../linux archive --format=tar HEAD # HEAD
>   Time (mean ± σ):      4.044 s ±  0.007 s    [User: 3.901 s, System: 0.137 s]
>   Range (min … max):    4.038 s …  4.059 s    10 runs
>
> Benchmark #2: ./git -C ../linux archive --format=tar HEAD # origin/main
>   Time (mean ± σ):      4.047 s ±  0.009 s    [User: 3.903 s, System: 0.138 s]
>   Range (min … max):    4.038 s …  4.066 s    10 runs
>
> How does tgz creation perform?
>
> $ hyperfine -w3 -L command 'gzip -cn','git archive gzip' \
> './git -c tar.tgz.command="{command}" -C ../linux archive --format=tgz HEAD'
> Benchmark #1: ./git -c tar.tgz.command="gzip -cn" -C ../linux archive --format=tgz HEAD
>   Time (mean ± σ):     20.404 s ±  0.006 s    [User: 23.943 s, System: 0.401 s]
>   Range (min … max):   20.395 s … 20.414 s    10 runs
>
> Benchmark #2: ./git -c tar.tgz.command="git archive gzip" -C ../linux archive --format=tgz HEAD
>   Time (mean ± σ):     23.807 s ±  0.023 s    [User: 23.655 s, System: 0.145 s]
>   Range (min … max):   23.782 s … 23.857 s    10 runs
>
> Summary
>   './git -c tar.tgz.command="gzip -cn" -C ../linux archive --format=tgz HEAD' ran
>     1.17 ± 0.00 times faster than './git -c tar.tgz.command="git archive gzip" -C ../linux archive --format=tgz HEAD'
>
> So the internal implementation takes 17% longer on the Linux repo, but
> uses 2% less CPU time.  That's because the external gzip can run in
> parallel on its own processor, while the internal one works sequentially
> and avoids the inter-process communication overhead.
>
> What are the benefits?  Only an internal sequential implementation can
> offer this eco mode, and it allows avoiding the gzip(1) requirement.

I had been keeping one eye on this series, but didn't look at it in any
detail.

I found this after reading 6/6, which I think in any case could really
use some "why" summary, which seems to mostly be covered here.

I.e. it's unclear if the "drop the dependency on gzip(1)" in 6/6 is a
reference to the GZIP test dependency, or that our users are unlikely to
have "gzip(1)" on their systems.

If it's the latter I'd much rather (as a user) take a 17% wallclock
improvement over a 2% cost of CPU. I mostly care about my own time, not
that of the CPU.

Can't we have our 6/6 cake much easier and eat it too by learning a
"fallback" mode, i.e. we try to invoke gzip, and if that doesn't work
use the "internal" one?

Re the "eco mode": I also wonder how much of the overhead you're seeing
for both that 17% and 2% would go away if you pin both processes to the
same CPU, I can't recall the command offhand, but IIRC taskset or
numactl can do that. I.e. is this really measuring IPC overhead, or
I-CPU overhead on your system?

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v4 4/6] archive-tar: add internal gzip implementation
  2022-06-15 20:32     ` Ævar Arnfjörð Bjarmason
@ 2022-06-16 18:55       ` René Scharfe
  2022-06-24 11:13         ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 74+ messages in thread
From: René Scharfe @ 2022-06-16 18:55 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Junio C Hamano, Johannes Schindelin, Rohit Ashiwal,
	Jeff King, brian m . carlson

Am 15.06.22 um 22:32 schrieb Ævar Arnfjörð Bjarmason:
>
> On Wed, Jun 15 2022, René Scharfe wrote:
>
>> Git uses zlib for its own object store, but calls gzip when creating tgz
>> archives.  Add an option to perform the gzip compression for the latter
>> using zlib, without depending on the external gzip binary.
>>
>> Plug it in by making write_block a function pointer and switching to a
>> compressing variant if the filter command has the magic value "git
>> archive gzip".  Does that indirection slow down tar creation?  Not
>> really, at least not in this test:
>>
>> $ hyperfine -w3 -L rev HEAD,origin/main -p 'git checkout {rev} && make' \
>> './git -C ../linux archive --format=tar HEAD # {rev}'
>
> Shameless plug: https://lore.kernel.org/git/211201.86r1aw9gbd.gmgdl@evledraar.gmail.com/
>
> I.e. a "hyperfine" wrapper I wrote to make exactly this sort of thing
> easier.
>
> You'll find that you need less or no --warmup with it, since the
> checkout flip-flopping and re-making (and resulting FS and other cache
> eviction) will go away, as we'll use different "git worktree"'s for the
> two "rev".

OK, but requiring hyperfine alone is burden enough for reviewers.

I had a try anyway and it took me a while to realize that git-hyperfine
requires setting the Git config option hyperfine.run-dir band that it
ignores it on my system.  Had to hard-code it in the script.

> (Also, putting those on a ramdisk really helps)
>
>> Benchmark #1: ./git -C ../linux archive --format=tar HEAD # HEAD
>>   Time (mean ± σ):      4.044 s ±  0.007 s    [User: 3.901 s, System: 0.137 s]
>>   Range (min … max):    4.038 s …  4.059 s    10 runs
>>
>> Benchmark #2: ./git -C ../linux archive --format=tar HEAD # origin/main
>>   Time (mean ± σ):      4.047 s ±  0.009 s    [User: 3.903 s, System: 0.138 s]
>>   Range (min … max):    4.038 s …  4.066 s    10 runs
>>
>> How does tgz creation perform?
>>
>> $ hyperfine -w3 -L command 'gzip -cn','git archive gzip' \
>> './git -c tar.tgz.command="{command}" -C ../linux archive --format=tgz HEAD'
>> Benchmark #1: ./git -c tar.tgz.command="gzip -cn" -C ../linux archive --format=tgz HEAD
>>   Time (mean ± σ):     20.404 s ±  0.006 s    [User: 23.943 s, System: 0.401 s]
>>   Range (min … max):   20.395 s … 20.414 s    10 runs
>>
>> Benchmark #2: ./git -c tar.tgz.command="git archive gzip" -C ../linux archive --format=tgz HEAD
>>   Time (mean ± σ):     23.807 s ±  0.023 s    [User: 23.655 s, System: 0.145 s]
>>   Range (min … max):   23.782 s … 23.857 s    10 runs
>>
>> Summary
>>   './git -c tar.tgz.command="gzip -cn" -C ../linux archive --format=tgz HEAD' ran
>>     1.17 ± 0.00 times faster than './git -c tar.tgz.command="git archive gzip" -C ../linux archive --format=tgz HEAD'
>>
>> So the internal implementation takes 17% longer on the Linux repo, but
>> uses 2% less CPU time.  That's because the external gzip can run in
>> parallel on its own processor, while the internal one works sequentially
>> and avoids the inter-process communication overhead.
>>
>> What are the benefits?  Only an internal sequential implementation can
>> offer this eco mode, and it allows avoiding the gzip(1) requirement.
>
> I had been keeping one eye on this series, but didn't look at it in any
> detail.
>
> I found this after reading 6/6, which I think in any case could really
> use some "why" summary, which seems to mostly be covered here.
>
> I.e. it's unclear if the "drop the dependency on gzip(1)" in 6/6 is a
> reference to the GZIP test dependency, or that our users are unlikely to
> have "gzip(1)" on their systems.

It's to avoid a run dependency; the build/test dependency remains.

> If it's the latter I'd much rather (as a user) take a 17% wallclock
> improvement over a 2% cost of CPU. I mostly care about my own time, not
> that of the CPU.

Understandable, and you can set tar.tgz.command='gzip -cn' to get the
old behavior.  Saving energy is a better default, though.

The runtime in the real world probably includes lots more I/O time.  The
tests above are repeated and warmed up to get consistent measurements,
but big repos are probably not fully kept in memory like that.

> Can't we have our 6/6 cake much easier and eat it too by learning a
> "fallback" mode, i.e. we try to invoke gzip, and if that doesn't work
> use the "internal" one?

Interesting idea, but I think the existing config option suffices.  E.g.
a distro could set it in the system-wide config file if/when gzip is
installed.

> Re the "eco mode": I also wonder how much of the overhead you're seeing
> for both that 17% and 2% would go away if you pin both processes to the
> same CPU, I can't recall the command offhand, but IIRC taskset or
> numactl can do that. I.e. is this really measuring IPC overhead, or
> I-CPU overhead on your system?

I'd expect that running git archive and gzip at the same CPU core takes
more wall-clock time than using zlib because inflating the object files
and deflating the archive are done sequentially in both scenarios.
Can't test it on macOS because it doesn't offer a way to pin programs to
a certain core, but e.g. someone with access to a Linux system can check
that using taskset(1).

René

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v4 4/6] archive-tar: add internal gzip implementation
  2022-06-16 18:55       ` René Scharfe
@ 2022-06-24 11:13         ` Ævar Arnfjörð Bjarmason
  2022-06-24 20:24           ` René Scharfe
  0 siblings, 1 reply; 74+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-06-24 11:13 UTC (permalink / raw)
  To: René Scharfe
  Cc: git, Junio C Hamano, Johannes Schindelin, Rohit Ashiwal,
	Jeff King, brian m . carlson


On Thu, Jun 16 2022, René Scharfe wrote:

> Am 15.06.22 um 22:32 schrieb Ævar Arnfjörð Bjarmason:
>> [...]
> Understandable, and you can set tar.tgz.command='gzip -cn' to get the
> old behavior.  Saving energy is a better default, though.

I disagree with that in general, a big reason for why git won out over
other VCS's is that it wasn't as slow. I think we should primarily be
interested in the time a user might end up staring at the screen.

I understand the concern to have "git archive" just work, e.g. if you
uninstall gzip(1) (although that seems rather obscure, but perhaps this
is for more minimal setups).

I don't think saving energy is a virtue, *maybe* it is, but maybe your
computer is powered by hydro, solar or nuclear instead of coal, so even
if we're taking global energy policy into account for changes to git
it's highly context dependant.

In any case, this is also true for pretty much any other git command
that might spawn processes or threads, e.g. "git grep":

	$ hyperfine -w3 -L cpus 0,0-7 'taskset --cpu-list {cpus} ./git grep foo.*bar' -r 10
	Benchmark 1: taskset --cpu-list 0 ./git grep foo.*bar
	  Time (mean ± σ):      39.3 ms ±   1.2 ms    [User: 20.0 ms, System: 18.6 ms]
	  Range (min … max):    38.2 ms …  41.8 ms    10 runs
	
	Benchmark 2: taskset --cpu-list 0-7 ./git grep foo.*bar
	  Time (mean ± σ):      28.1 ms ±   1.3 ms    [User: 43.5 ms, System: 51.0 ms]
	  Range (min … max):    26.6 ms …  31.2 ms    10 runs
	
	Summary
	  'taskset --cpu-list 0-7 ./git grep foo.*bar' ran
	    1.40 ± 0.08 times faster than 'taskset --cpu-list 0 ./git grep foo.*bar'

Here we use less than 1/2 the user/system time when I pin it to 1 cpu,
but we're 40% slower.

So this is a bit of a digression, but this particular thing seems much
better left to the OS or your hardware's CPU throttling policy. To the
extent that we care perhaps more fitting would be to have a global
core.wrapper-cmd option or something, so you could pass all git commands
through "taskset" (or your local equivalent), or just use shell aliases.

> The runtime in the real world probably includes lots more I/O time.  The
> tests above are repeated and warmed up to get consistent measurements,
> but big repos are probably not fully kept in memory like that.
>
>> Can't we have our 6/6 cake much easier and eat it too by learning a
>> "fallback" mode, i.e. we try to invoke gzip, and if that doesn't work
>> use the "internal" one?
>
> Interesting idea, but I think the existing config option suffices.  E.g.
> a distro could set it in the system-wide config file if/when gzip is
> installed.

I think in practice distros are unlikely to have such triggers for
"package X is installed, let's set config Y". I mean, e.g. Debian can do
that with its packaging system, but it's expecting a lot. Why not flip
the default depending on if start_command() fails?

>> Re the "eco mode": I also wonder how much of the overhead you're seeing
>> for both that 17% and 2% would go away if you pin both processes to the
>> same CPU, I can't recall the command offhand, but IIRC taskset or
>> numactl can do that. I.e. is this really measuring IPC overhead, or
>> I-CPU overhead on your system?
>
> I'd expect that running git archive and gzip at the same CPU core takes
> more wall-clock time than using zlib because inflating the object files
> and deflating the archive are done sequentially in both scenarios.
> Can't test it on macOS because it doesn't offer a way to pin programs to
> a certain core, but e.g. someone with access to a Linux system can check
> that using taskset(1).

Here's a benchmark, this is your hyperfine command, just with taskset
added. It's an 8-core box, so 0-7 is "all CPUs" (I think...):

	hyperfine -w3 \
		-L cpus 0,0-7 \
		-L command 'gzip -cn','git archive gzip' \
		'taskset --cpu-list {cpus} ./git -c tar.tgz.command="{command}" archive --format=tgz HEAD'

Which gives me:

	Benchmark 1: taskset --cpu-list 0 ./git -c tar.tgz.command="gzip -cn" archive --format=tgz HEAD
	  Time (mean ± σ):      1.561 s ±  0.029 s    [User: 1.503 s, System: 0.058 s]
	  Range (min … max):    1.522 s …  1.622 s    10 runs
	 
	Benchmark 2: taskset --cpu-list 0-7 ./git -c tar.tgz.command="gzip -cn" archive --format=tgz HEAD
	  Time (mean ± σ):      1.337 s ±  0.029 s    [User: 1.535 s, System: 0.075 s]
	  Range (min … max):    1.298 s …  1.388 s    10 runs
	 
	Benchmark 3: taskset --cpu-list 0 ./git -c tar.tgz.command="git archive gzip" archive --format=tgz HEAD
	  Time (mean ± σ):      1.493 s ±  0.032 s    [User: 1.453 s, System: 0.040 s]
	  Range (min … max):    1.462 s …  1.572 s    10 runs
	 
	Benchmark 4: taskset --cpu-list 0-7 ./git -c tar.tgz.command="git archive gzip" archive --format=tgz HEAD
	  Time (mean ± σ):      1.503 s ±  0.026 s    [User: 1.466 s, System: 0.036 s]
	  Range (min … max):    1.469 s …  1.542 s    10 runs
	 
	Summary
	  'taskset --cpu-list 0-7 ./git -c tar.tgz.command="gzip -cn" archive --format=tgz HEAD' ran
	    1.12 ± 0.03 times faster than 'taskset --cpu-list 0 ./git -c tar.tgz.command="git archive gzip" archive --format=tgz HEAD'
	    1.12 ± 0.03 times faster than 'taskset --cpu-list 0-7 ./git -c tar.tgz.command="git archive gzip" archive --format=tgz HEAD'
	    1.17 ± 0.03 times faster than 'taskset --cpu-list 0 ./git -c tar.tgz.command="gzip -cn" archive --format=tgz HEAD'

Whic I think should control for the IPC overhead v.s. the advantage of
multicore. I.e. we're faster with "gzip -cn" on multicore, but the
internal implementation has an advantage when it comes to 

I tried out the fallback method, memory leaks aside (needs to do a
proper cleanup) this seems to work. Most of the diff is moving the
existing code into a function:

diff --git a/archive-tar.c b/archive-tar.c
index 3d77e0f7509..a1b08812ee3 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -458,14 +458,36 @@ static void tgz_write_block(const void *data)
 	tgz_deflate(Z_NO_FLUSH);
 }
 
+static const char default_gzip_command[] = "gzip -cn";
 static const char internal_gzip_command[] = "git archive gzip";
 
-static int write_tar_filter_archive(const struct archiver *ar,
-				    struct archiver_args *args)
+static int do_internal_gzip(const struct archiver *ar,
+			    struct archiver_args *args)
 {
 #if ZLIB_VERNUM >= 0x1221
 	struct gz_header_s gzhead = { .os = 3 }; /* Unix, for reproducibility */
 #endif
+	int r;
+
+	write_block = tgz_write_block;
+	git_deflate_init_gzip(&gzstream, args->compression_level);
+#if ZLIB_VERNUM >= 0x1221
+	if (deflateSetHeader(&gzstream.z, &gzhead) != Z_OK)
+		BUG("deflateSetHeader() called too late");
+#endif
+	gzstream.next_out = outbuf;
+	gzstream.avail_out = sizeof(outbuf);
+
+	r = write_tar_archive(ar, args);
+
+	tgz_deflate(Z_FINISH);
+	git_deflate_end(&gzstream);
+	return r;
+}
+
+static int write_tar_filter_archive(const struct archiver *ar,
+				    struct archiver_args *args)
+{
 	struct strbuf cmd = STRBUF_INIT;
 	struct child_process filter = CHILD_PROCESS_INIT;
 	int r;
@@ -473,33 +495,24 @@ static int write_tar_filter_archive(const struct archiver *ar,
 	if (!ar->filter_command)
 		BUG("tar-filter archiver called with no filter defined");
 
-	if (!strcmp(ar->filter_command, internal_gzip_command)) {
-		write_block = tgz_write_block;
-		git_deflate_init_gzip(&gzstream, args->compression_level);
-#if ZLIB_VERNUM >= 0x1221
-		if (deflateSetHeader(&gzstream.z, &gzhead) != Z_OK)
-			BUG("deflateSetHeader() called too late");
-#endif
-		gzstream.next_out = outbuf;
-		gzstream.avail_out = sizeof(outbuf);
-
-		r = write_tar_archive(ar, args);
-
-		tgz_deflate(Z_FINISH);
-		git_deflate_end(&gzstream);
-		return r;
-	}
+	if (!strcmp(ar->filter_command, internal_gzip_command))
+		return do_internal_gzip(ar, args);
 
 	strbuf_addstr(&cmd, ar->filter_command);
 	if (args->compression_level >= 0)
 		strbuf_addf(&cmd, " -%d", args->compression_level);
 
-	strvec_push(&filter.args, cmd.buf);
-	filter.use_shell = 1;
+	strvec_split(&filter.args, cmd.buf);
 	filter.in = -1;
 
-	if (start_command(&filter) < 0)
+	if (start_command(&filter) < 0) {
+		if (!strcmp(ar->filter_command, default_gzip_command)) {
+			warning_errno(_("could not start '%s' filter, falling back to '%s'"),
+				      ar->filter_command, internal_gzip_command);
+			return do_internal_gzip(ar, args);
+		}
 		die_errno(_("unable to start '%s' filter"), cmd.buf);
+	}
 	close(1);
 	if (dup2(filter.in, 1) < 0)
 		die_errno(_("unable to redirect descriptor"));

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH v4 4/6] archive-tar: add internal gzip implementation
  2022-06-24 11:13         ` Ævar Arnfjörð Bjarmason
@ 2022-06-24 20:24           ` René Scharfe
  0 siblings, 0 replies; 74+ messages in thread
From: René Scharfe @ 2022-06-24 20:24 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Junio C Hamano, Johannes Schindelin, Rohit Ashiwal,
	Jeff King, brian m . carlson

Am 24.06.22 um 13:13 schrieb Ævar Arnfjörð Bjarmason:
>
> On Thu, Jun 16 2022, René Scharfe wrote:
>
>> Am 15.06.22 um 22:32 schrieb Ævar Arnfjörð Bjarmason:
>>> [...]
>> Understandable, and you can set tar.tgz.command='gzip -cn' to get the
>> old behavior.  Saving energy is a better default, though.
>
> I disagree with that in general, a big reason for why git won out over
> other VCS's is that it wasn't as slow. I think we should primarily be
> interested in the time a user might end up staring at the screen.
>
> I understand the concern to have "git archive" just work, e.g. if you
> uninstall gzip(1) (although that seems rather obscure, but perhaps this
> is for more minimal setups).

The previous attempt came from/via Git on Windows.

> I don't think saving energy is a virtue, *maybe* it is, but maybe your
> computer is powered by hydro, solar or nuclear instead of coal, so even
> if we're taking global energy policy into account for changes to git
> it's highly context dependant.

Or a device runs on battery power and saving energy keeps it running a
bit longer.  Or it's housed in a data center and saving energy helps
reduce cooling requirements.

> In any case, this is also true for pretty much any other git command
> that might spawn processes or threads, e.g. "git grep":
>
> 	$ hyperfine -w3 -L cpus 0,0-7 'taskset --cpu-list {cpus} ./git grep foo.*bar' -r 10
> 	Benchmark 1: taskset --cpu-list 0 ./git grep foo.*bar
> 	  Time (mean ± σ):      39.3 ms ±   1.2 ms    [User: 20.0 ms, System: 18.6 ms]
> 	  Range (min … max):    38.2 ms …  41.8 ms    10 runs
>
> 	Benchmark 2: taskset --cpu-list 0-7 ./git grep foo.*bar
> 	  Time (mean ± σ):      28.1 ms ±   1.3 ms    [User: 43.5 ms, System: 51.0 ms]
> 	  Range (min … max):    26.6 ms …  31.2 ms    10 runs
>
> 	Summary
> 	  'taskset --cpu-list 0-7 ./git grep foo.*bar' ran
> 	    1.40 ± 0.08 times faster than 'taskset --cpu-list 0 ./git grep foo.*bar'
>
> Here we use less than 1/2 the user/system time when I pin it to 1 cpu,
> but we're 40% slower.
>
> So this is a bit of a digression, but this particular thing seems much
> better left to the OS or your hardware's CPU throttling policy. To the
> extent that we care perhaps more fitting would be to have a global
> core.wrapper-cmd option or something, so you could pass all git commands
> through "taskset" (or your local equivalent), or just use shell aliases.

Not sure what conclusion to draw from these numbers.  Perhaps that
computation is not the bottleneck here (increasing the number of cores by
700% increases speed only by 40%)?  That coordination overhead makes up a
big percentage and there might be room for improvement/tuning?

In any case, I agree we should leave scheduling decisions at runtime to
the OS.

>> The runtime in the real world probably includes lots more I/O time.  The
>> tests above are repeated and warmed up to get consistent measurements,
>> but big repos are probably not fully kept in memory like that.

On top of that I guess only few people create tgz files at all.  Most of
them I would expect to be created automatically (and cached) by sites
like kernel.org.  So I imagine people rather create tar.xz, tar.zst or
zip archives these days.  Or use git at both ends (push/pull), as they
should. ;-)  I have no data to support this guess, though.

But yeah, the tradeoff sounds a bit weird: Give 17% duration, get 2% CPU
time back -- sounds like a ripoff.  In your example below it's 12%
longer duration for 5% saved CPU time, which sounds a bit better, but
still not terribly attractive.

Look at it from a different angle: This basic sequential implementation
is better for non-interactive tgz creation due to its slightly lower
CPU usage, which we cannot achieve with any parallel process setup.
It's easier to deploy because it doesn't need gzip.  Its runtime hit
isn't *that* hard, and people interested primarily in speed should
parallelize the expensive part, deflate, not run the cheap tar creation
parallel to a single-threaded deflate.  I.e. they should already run
pigz (https://zlib.net/pigz/).

$ hyperfine -L gz gzip,pigz -w3 'git -C ../linux archive --format=tar HEAD | {gz} -cn'
Benchmark 1: git -C ../linux archive --format=tar HEAD | gzip -cn
  Time (mean ± σ):     20.764 s ±  0.007 s    [User: 24.119 s, System: 0.606 s]
  Range (min … max):   20.758 s … 20.781 s    10 runs

Benchmark 2: git -C ../linux archive --format=tar HEAD | pigz -cn
  Time (mean ± σ):      6.077 s ±  0.023 s    [User: 29.708 s, System: 1.599 s]
  Range (min … max):    6.037 s …  6.125 s    10 runs

Summary
  'git -C ../linux archive --format=tar HEAD | pigz -cn' ran
    3.42 ± 0.01 times faster than 'git -C ../linux archive --format=tar HEAD | gzip -cn'

>>> Can't we have our 6/6 cake much easier and eat it too by learning a
>>> "fallback" mode, i.e. we try to invoke gzip, and if that doesn't work
>>> use the "internal" one?
>>
>> Interesting idea, but I think the existing config option suffices.  E.g.
>> a distro could set it in the system-wide config file if/when gzip is
>> installed.
>
> I think in practice distros are unlikely to have such triggers for
> "package X is installed, let's set config Y". I mean, e.g. Debian can do
> that with its packaging system, but it's expecting a lot.

I don't *expect* any reaction either way, but packagers *can* go with a
custom config if they see the need.

> Why not flip
> the default depending on if start_command() fails?

Because it's harder to test and support due to its more complicated
behavior, and I don't see why it would be needed.

>>> Re the "eco mode": I also wonder how much of the overhead you're seeing
>>> for both that 17% and 2% would go away if you pin both processes to the
>>> same CPU, I can't recall the command offhand, but IIRC taskset or
>>> numactl can do that. I.e. is this really measuring IPC overhead, or
>>> I-CPU overhead on your system?
>>
>> I'd expect that running git archive and gzip at the same CPU core takes
>> more wall-clock time than using zlib because inflating the object files
>> and deflating the archive are done sequentially in both scenarios.
>> Can't test it on macOS because it doesn't offer a way to pin programs to
>> a certain core, but e.g. someone with access to a Linux system can check
>> that using taskset(1).
>
> Here's a benchmark, this is your hyperfine command, just with taskset
> added. It's an 8-core box, so 0-7 is "all CPUs" (I think...):
>
> 	hyperfine -w3 \
> 		-L cpus 0,0-7 \
> 		-L command 'gzip -cn','git archive gzip' \
> 		'taskset --cpu-list {cpus} ./git -c tar.tgz.command="{command}" archive --format=tgz HEAD'
>
> Which gives me:
>
> 	Benchmark 1: taskset --cpu-list 0 ./git -c tar.tgz.command="gzip -cn" archive --format=tgz HEAD
> 	  Time (mean ± σ):      1.561 s ±  0.029 s    [User: 1.503 s, System: 0.058 s]
> 	  Range (min … max):    1.522 s …  1.622 s    10 runs
>
> 	Benchmark 2: taskset --cpu-list 0-7 ./git -c tar.tgz.command="gzip -cn" archive --format=tgz HEAD
> 	  Time (mean ± σ):      1.337 s ±  0.029 s    [User: 1.535 s, System: 0.075 s]
> 	  Range (min … max):    1.298 s …  1.388 s    10 runs
>
> 	Benchmark 3: taskset --cpu-list 0 ./git -c tar.tgz.command="git archive gzip" archive --format=tgz HEAD
> 	  Time (mean ± σ):      1.493 s ±  0.032 s    [User: 1.453 s, System: 0.040 s]
> 	  Range (min … max):    1.462 s …  1.572 s    10 runs
>
> 	Benchmark 4: taskset --cpu-list 0-7 ./git -c tar.tgz.command="git archive gzip" archive --format=tgz HEAD
> 	  Time (mean ± σ):      1.503 s ±  0.026 s    [User: 1.466 s, System: 0.036 s]
> 	  Range (min … max):    1.469 s …  1.542 s    10 runs
>
> 	Summary
> 	  'taskset --cpu-list 0-7 ./git -c tar.tgz.command="gzip -cn" archive --format=tgz HEAD' ran
> 	    1.12 ± 0.03 times faster than 'taskset --cpu-list 0 ./git -c tar.tgz.command="git archive gzip" archive --format=tgz HEAD'
> 	    1.12 ± 0.03 times faster than 'taskset --cpu-list 0-7 ./git -c tar.tgz.command="git archive gzip" archive --format=tgz HEAD'
> 	    1.17 ± 0.03 times faster than 'taskset --cpu-list 0 ./git -c tar.tgz.command="gzip -cn" archive --format=tgz HEAD'
>
> Whic I think should control for the IPC overhead v.s. the advantage of
> multicore. I.e. we're faster with "gzip -cn" on multicore, but the
> internal implementation has an advantage when it comes to

Right, #1, #3 and #4 all run sequentially, but #1 has the pipe overhead
to deal with as well, which adds 5 percentage points to its runtime.

René

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 0/5] Avoid spawning gzip in git archive
  2022-06-14 20:05     ` René Scharfe
@ 2022-06-30 18:55       ` Johannes Schindelin
  2022-07-01 16:05         ` Johannes Schindelin
  0 siblings, 1 reply; 74+ messages in thread
From: Johannes Schindelin @ 2022-06-30 18:55 UTC (permalink / raw)
  To: René Scharfe
  Cc: git, Junio C Hamano, Rohit Ashiwal,
	Ævar Arnfjörð Bjarmason, Jeff King,
	brian m . carlson

[-- Attachment #1: Type: text/plain, Size: 3860 bytes --]

Hi René,

On Tue, 14 Jun 2022, René Scharfe wrote:

> Am 14.06.22 um 13:28 schrieb Johannes Schindelin:
> >
> > By the way, the main reason why I did not work more is that in
> > http://madler.net/pipermail/zlib-devel_madler.net/2019-December/003308.html,
> > Mark Adler (the zlib maintainer) announced that...
> >
> >> [...] There are many well-tested performance improvements in zlib
> >> waiting in the wings that will be incorporated over the next several
> >> months. [...]
> >
> > This was in December 2019. And now it's June 2022 and I kind of wonder
> > whether those promised improvements will still come.
> >
> > In the meantime, however, a viable alternative seems to have cropped up:
> > https://github.com/zlib-ng/zlib-ng. Essentially, it looks as if it is what
> > zlib should have become after above-quoted announcement.
> >
> > In particular the CPU intrinsics support (think MMX, SSE2/3, etc) seem to
> > be very interesting and I would not be completely surprised if building
> > Git with your patches and linking against zlib-ng would paint a very
> > favorable picture not only in terms of CPU time but also in terms of
> > wallclock time. Sadly, I have not been able to set aside time to look into
> > that angle, but maybe I can peak your interest?
> I was unable to preload zlib-ng using DYLD_INSERT_LIBRARIES on macOS
> 12.4 so far.  The included demo proggy looks impressive, though:
>
> $ hyperfine -w3 -L gzip gzip,../zlib-ng/minigzip "git -C ../linux archive --format=tar HEAD | {gzip} -c"
> Benchmark #1: git -C ../linux archive --format=tar HEAD | gzip -c
>   Time (mean ± σ):     20.424 s ±  0.006 s    [User: 23.964 s, System: 0.432 s]
>   Range (min … max):   20.414 s … 20.434 s    10 runs
>
> Benchmark #2: git -C ../linux archive --format=tar HEAD | ../zlib-ng/minigzip -c
>   Time (mean ± σ):     12.158 s ±  0.006 s    [User: 13.908 s, System: 0.376 s]
>   Range (min … max):   12.145 s … 12.166 s    10 runs
>
> Summary
>   'git -C ../linux archive --format=tar HEAD | ../zlib-ng/minigzip -c' ran
>     1.68 ± 0.00 times faster than 'git -C ../linux archive --format=tar HEAD | gzip -c'

Intriguing.

I finally managed to play around with building and packaging zlib-ng [*1*]
(since I want to use it as a drop-in replacement for zlib, I think it is
best to configure it with `--zlib-compat`, that way I do not have to
fiddle with any equivalent of `LD_PRELOAD`). Here are my numbers:

	zlib-ng: 14.409 s ± 0.209 s
	zlib:    26.843 s ± 0.636 s

These are pretty good, which made me think that they might actually even
help regular Git operations (because we zlib every loose object).

So I tried to `fast-import` some 2500 commits from linux.git into a fresh
repository, and the zlib-ng version takes ~51s and the zlib version takes
~58s. At first I thought that it might be noise, but the trend seems to be
steady. It's not a huge improvement, of course, but I think that might be
because most of the time is spent parsing.

I then tried to test the performance focusing on writing loose object, by
using p0008 (increasing the number of files from 50 to 1500 and
restricting it to fsyncMethod=none).

Unfortunately, the numbers are not really conclusive. I do see minor
speed-ups with zlib-ng, mostly, in the single digit percentages, though
occasionally in the other direction. In other words, there is no clear-cut
change, just a vague tendency. My guess: Git writes too small files (their
contents are of the form "$basedir$test_tick.$counter") and zlib-ng's
superior performance does not come to bear.

Still, for larger workloads, zlib-ng seems to offer a quite nice and
substantial performance improvement over zlib.

Ciao,
Dscho

Footnote *1*: https://github.com/msys2/MINGW-packages/compare/master...dscho:zlib-ng

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 0/5] Avoid spawning gzip in git archive
  2022-06-30 18:55       ` Johannes Schindelin
@ 2022-07-01 16:05         ` Johannes Schindelin
  2022-07-01 16:27           ` Jeff King
  0 siblings, 1 reply; 74+ messages in thread
From: Johannes Schindelin @ 2022-07-01 16:05 UTC (permalink / raw)
  To: René Scharfe
  Cc: git, Junio C Hamano, Rohit Ashiwal,
	Ævar Arnfjörð Bjarmason, Jeff King,
	brian m . carlson

[-- Attachment #1: Type: text/plain, Size: 3490 bytes --]

Me again,

On Thu, 30 Jun 2022, Johannes Schindelin wrote:

> I finally managed to play around with building and packaging zlib-ng
> [*1*] (since I want to use it as a drop-in replacement for zlib, I think
> it is best to configure it with `--zlib-compat`, that way I do not have
> to fiddle with any equivalent of `LD_PRELOAD`). Here are my numbers:
>
> 	zlib-ng: 14.409 s ± 0.209 s
> 	zlib:    26.843 s ± 0.636 s
>
> These are pretty good, which made me think that they might actually even
> help regular Git operations (because we zlib every loose object).
>
> So I tried to `fast-import` some 2500 commits from linux.git into a fresh
> repository, and the zlib-ng version takes ~51s and the zlib version takes
> ~58s. At first I thought that it might be noise, but the trend seems to be
> steady. It's not a huge improvement, of course, but I think that might be
> because most of the time is spent parsing.
>
> I then tried to test the performance focusing on writing loose object, by
> using p0008 (increasing the number of files from 50 to 1500 and
> restricting it to fsyncMethod=none).
>
> Unfortunately, the numbers are not really conclusive. I do see minor
> speed-ups with zlib-ng, mostly, in the single digit percentages, though
> occasionally in the other direction. In other words, there is no clear-cut
> change, just a vague tendency. My guess: Git writes too small files (their
> contents are of the form "$basedir$test_tick.$counter") and zlib-ng's
> superior performance does not come to bear.
>
> Still, for larger workloads, zlib-ng seems to offer a quite nice and
> substantial performance improvement over zlib.

Stolee pointed out to me that objects inside pack files are also
zlib-compressed, and that measuring the speed of `git rev-list --objects
--all --count` might therefore be a better test.

And this is where things get a little messy: in the context of Git for
Windows, my local measurements indicate that zlib is better, with ~41
seconds using zlib vs ~52 seconds using zlib-ng (but the latter has a
rather large variance).

These measurements were done with a relatively straight-forward build of
zlib-ng v2.0.6, and on a hunch I then tried to build the tip of zlib-ng's
`develop` branch (which was much less straight-forward) and now get
virtually the same speed with that `rev-list` command.

But then I repeated the `archive` measurement with the `develop` version
of zlib-ng, and while it was still substantially faster than zlib, it was
slightly slower than zlib-ng v2.0.6 (zlib: ~26 seconds, zlib-ng v2.0.6:
~14 seconds, zlib-ng develop: ~16 seconds). Still, much, much faster than
using `-c tar.tgz.command="gzip -cn"` at ~24 seconds.

So: the picture is messy. The latest official release of zlib-ng seems to
offer performance wins using `archive` but slight losses using `rev-list.
Upgrading to the latest revision of zlib-ng offers slightly smaller
performance wins using `archive` and equivalent performance using
`rev-list`. Both blow `gzip -cn` out of the water, thanks to using MMX or
whatever my laptop's CPU offers.

The take-away as far as Git for Windows is concerned: It seems not _quite_
the time yet to switch from zlib to zlib-ng, I want to wait until there is
an official zlib-ng release with favorable speed.

Ciao,
Dscho

P.S.: I pushed a WIP update to this branch:

> Footnote *1*: https://github.com/msys2/MINGW-packages/compare/master...dscho:zlib-ng

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 0/5] Avoid spawning gzip in git archive
  2022-07-01 16:05         ` Johannes Schindelin
@ 2022-07-01 16:27           ` Jeff King
  2022-07-01 17:47             ` Junio C Hamano
  0 siblings, 1 reply; 74+ messages in thread
From: Jeff King @ 2022-07-01 16:27 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: René Scharfe, git, Junio C Hamano, Rohit Ashiwal,
	Ævar Arnfjörð Bjarmason, brian m . carlson

On Fri, Jul 01, 2022 at 06:05:59PM +0200, Johannes Schindelin wrote:

> Stolee pointed out to me that objects inside pack files are also
> zlib-compressed, and that measuring the speed of `git rev-list --objects
> --all --count` might therefore be a better test.

That will spend quite a lot of time doing hash-lookups for each tree
entry. A better raw zlib test might be:

  git cat-file --batch --batch-all-objects --unordered >/dev/null

which will just dump each object, and should mostly be zlib and delta
reconstruction (the --unordered is important to hit the deltas in the
right order).

> And this is where things get a little messy: in the context of Git for
> Windows, my local measurements indicate that zlib is better, with ~41
> seconds using zlib vs ~52 seconds using zlib-ng (but the latter has a
> rather large variance).

That is a surprising slow-down between the two. I'd expect the command
above to show even more pronounced results, though, as it's spending
less time doing non-zlib things. But it's still just inflating (as
opposed to git-archive, which is both inflating and deflating).

-Peff

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 0/5] Avoid spawning gzip in git archive
  2022-07-01 16:27           ` Jeff King
@ 2022-07-01 17:47             ` Junio C Hamano
  0 siblings, 0 replies; 74+ messages in thread
From: Junio C Hamano @ 2022-07-01 17:47 UTC (permalink / raw)
  To: Jeff King
  Cc: Johannes Schindelin, René Scharfe, git, Rohit Ashiwal,
	Ævar Arnfjörð Bjarmason, brian m . carlson

Jeff King <peff@peff.net> writes:

> That will spend quite a lot of time doing hash-lookups for each tree
> entry. A better raw zlib test might be:
>
>   git cat-file --batch --batch-all-objects --unordered >/dev/null
>
> which will just dump each object, and should mostly be zlib and delta
> reconstruction (the --unordered is important to hit the deltas in the
> right order).

;-)

I like --unordered has the meaning "use the order Git likes" (which
is probably the packfile offset order, which we optimize for
minimizing seek during delta reconstruction).


^ permalink raw reply	[flat|nested] 74+ messages in thread

end of thread, other threads:[~2022-07-01 17:49 UTC | newest]

Thread overview: 74+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-12 23:04 [PATCH 0/2] Avoid spawning gzip in git archive Johannes Schindelin via GitGitGadget
2019-04-12 23:04 ` [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die() Rohit Ashiwal via GitGitGadget
2019-04-13  1:34   ` Jeff King
2019-04-13  5:51     ` Junio C Hamano
2019-04-14  4:36       ` Rohit Ashiwal
2019-04-26 14:29       ` Johannes Schindelin
2019-04-26 23:44         ` Junio C Hamano
2019-04-29 21:32           ` Johannes Schindelin
2019-05-01 18:09             ` Jeff King
2019-05-02 20:29               ` René Scharfe
2019-05-05  5:25               ` Junio C Hamano
2019-05-06  5:07                 ` Jeff King
2019-04-14  4:34     ` Rohit Ashiwal
2019-04-14 10:33       ` Junio C Hamano
2019-04-26 14:28     ` Johannes Schindelin
2019-05-01 18:07       ` Jeff King
2019-04-12 23:04 ` [PATCH 2/2] archive: avoid spawning `gzip` Rohit Ashiwal via GitGitGadget
2019-04-13  1:51   ` Jeff King
2019-04-13 22:01     ` René Scharfe
2019-04-15 21:35       ` Jeff King
2019-04-26 14:51         ` Johannes Schindelin
2019-04-27  9:59           ` René Scharfe
2019-04-27 17:39             ` René Scharfe
2019-04-29 21:25               ` Johannes Schindelin
2019-05-01 17:45                 ` René Scharfe
2019-05-01 18:18                   ` Jeff King
2019-06-10 10:44                     ` René Scharfe
2019-06-13 19:16                       ` Jeff King
2019-04-13 22:16     ` brian m. carlson
2019-04-15 21:36       ` Jeff King
2019-04-26 14:54       ` Johannes Schindelin
2019-05-02 20:20         ` Ævar Arnfjörð Bjarmason
2019-05-03 20:49           ` Johannes Schindelin
2019-05-03 20:52             ` Jeff King
2019-04-26 14:47     ` Johannes Schindelin
     [not found] ` <pull.145.v2.git.gitgitgadget@gmail.com>
     [not found]   ` <4ea94a8784876c3a19e387537edd81a957fc692c.1556321244.git.gitgitgadget@gmail.com>
2019-05-02 20:29     ` [PATCH v2 3/4] archive: optionally use zlib directly for gzip compression René Scharfe
     [not found]   ` <ac2b2488a1b42b3caf8a84594c48eca796748e59.1556321244.git.gitgitgadget@gmail.com>
2019-05-02 20:30     ` [PATCH v2 2/4] archive-tar: mark RECORDSIZE/BLOCKSIZE as unsigned René Scharfe
2019-05-08 11:45       ` Johannes Schindelin
2019-05-08 23:04         ` Jeff King
2019-05-09 14:06           ` Johannes Schindelin
2019-05-09 18:38             ` Jeff King
2019-05-10 17:18               ` René Scharfe
2019-05-10 21:20                 ` Jeff King
2022-06-12  6:00 ` [PATCH v3 0/5] Avoid spawning gzip in git archive René Scharfe
2022-06-12  6:03   ` [PATCH v3 1/5] archive: rename archiver data field to filter_command René Scharfe
2022-06-12  6:05   ` [PATCH v3 2/5] archive-tar: factor out write_block() René Scharfe
2022-06-12  6:08   ` [PATCH v3 3/5] archive-tar: add internal gzip implementation René Scharfe
2022-06-13 19:10     ` Junio C Hamano
2022-06-12  6:18   ` [PATCH v3 4/5] archive-tar: use OS_CODE 3 (Unix) for internal gzip René Scharfe
2022-06-12  6:19   ` [PATCH v3 5/5] archive-tar: use internal gzip by default René Scharfe
2022-06-13 21:55     ` Junio C Hamano
2022-06-14 11:27       ` Johannes Schindelin
2022-06-14 15:47         ` René Scharfe
2022-06-14 15:56           ` René Scharfe
2022-06-14 16:29           ` Johannes Schindelin
2022-06-14 20:04             ` René Scharfe
2022-06-15 16:41               ` Junio C Hamano
2022-06-14 11:28   ` [PATCH v3 0/5] Avoid spawning gzip in git archive Johannes Schindelin
2022-06-14 20:05     ` René Scharfe
2022-06-30 18:55       ` Johannes Schindelin
2022-07-01 16:05         ` Johannes Schindelin
2022-07-01 16:27           ` Jeff King
2022-07-01 17:47             ` Junio C Hamano
2022-06-15 16:53 ` [PATCH v4 0/6] " René Scharfe
2022-06-15 16:58   ` [PATCH v4 1/6] archive: update format documentation René Scharfe
2022-06-15 16:59   ` [PATCH v4 2/6] archive: rename archiver data field to filter_command René Scharfe
2022-06-15 17:01   ` [PATCH v4 3/6] archive-tar: factor out write_block() René Scharfe
2022-06-15 17:02   ` [PATCH v4 4/6] archive-tar: add internal gzip implementation René Scharfe
2022-06-15 20:32     ` Ævar Arnfjörð Bjarmason
2022-06-16 18:55       ` René Scharfe
2022-06-24 11:13         ` Ævar Arnfjörð Bjarmason
2022-06-24 20:24           ` René Scharfe
2022-06-15 17:04   ` [PATCH v4 5/6] archive-tar: use OS_CODE 3 (Unix) for internal gzip René Scharfe
2022-06-15 17:05   ` [PATCH v4 6/6] archive-tar: use internal gzip by default René Scharfe

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).