git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / Atom feed
* [PATCH 0/2] Avoid spawning gzip in git archive
@ 2019-04-12 23:04 Johannes Schindelin via GitGitGadget
  2019-04-12 23:04 ` [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die() Rohit Ashiwal via GitGitGadget
                   ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Johannes Schindelin via GitGitGadget @ 2019-04-12 23:04 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano

When creating .tar.gz archives with git archive, we let gzip handle the
compression part. But that is not even necessary, as we already require zlib
(to compress our loose objects). It is also unfortunate, as it requires gzip 
to be in the PATH (which is not the case e.g. with MinGit for Windows, which
tries to bundle just the bare minimum of files to make Git work
non-interactively, for use with 3rd-party applications requiring Git).

This patch series resolves this conundrum by teaching git archive the trick
to gzip-compress in-process.

Rohit Ashiwal (2):
  archive: replace write_or_die() calls with write_block_or_die()
  archive: avoid spawning `gzip`

 archive-tar.c | 54 ++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 41 insertions(+), 13 deletions(-)


base-commit: 8104ec994ea3849a968b4667d072fedd1e688642
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-145%2Fdscho%2Fdont-spawn-gzip-in-archive-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-145/dscho/dont-spawn-gzip-in-archive-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/145
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die()
  2019-04-12 23:04 [PATCH 0/2] Avoid spawning gzip in git archive Johannes Schindelin via GitGitGadget
@ 2019-04-12 23:04 ` Rohit Ashiwal via GitGitGadget
  2019-04-13  1:34   ` Jeff King
  2019-04-12 23:04 ` [PATCH 2/2] archive: avoid spawning `gzip` Rohit Ashiwal via GitGitGadget
       [not found] ` <pull.145.v2.git.gitgitgadget@gmail.com>
  2 siblings, 1 reply; 43+ messages in thread
From: Rohit Ashiwal via GitGitGadget @ 2019-04-12 23:04 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Rohit Ashiwal

From: Rohit Ashiwal <rohit.ashiwal265@gmail.com>

MinGit for Windows comes without `gzip` bundled inside, git-archive uses
`gzip -cn` to compress tar files but for this to work, gzip needs to be
present on the host system.

In the next commit, we will change the gzip compression so that we no
longer spawn `gzip` but let zlib perform the compression in the same
process instead.

In preparation for this, we consolidate all the block writes into a
single function.

This closes https://github.com/git-for-windows/git/issues/1970

Signed-off-by: Rohit Ashiwal <rohit.ashiwal265@gmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 archive-tar.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/archive-tar.c b/archive-tar.c
index 4aabd566fb..ba37dad27c 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -17,6 +17,8 @@ static unsigned long offset;
 
 static int tar_umask = 002;
 
+static gzFile gzip;
+
 static int write_tar_filter_archive(const struct archiver *ar,
 				    struct archiver_args *args);
 
@@ -38,11 +40,21 @@ static int write_tar_filter_archive(const struct archiver *ar,
 #define USTAR_MAX_MTIME 077777777777ULL
 #endif
 
+/* writes out the whole block, or dies if fails */
+static void write_block_or_die(const char *block) {
+	if (gzip) {
+		if (gzwrite(gzip, block, (unsigned) BLOCKSIZE) != BLOCKSIZE)
+			die(_("gzwrite failed"));
+	} else {
+		write_or_die(1, block, BLOCKSIZE);
+	}
+}
+
 /* writes out the whole block, but only if it is full */
 static void write_if_needed(void)
 {
 	if (offset == BLOCKSIZE) {
-		write_or_die(1, block, BLOCKSIZE);
+		write_block_or_die(block);
 		offset = 0;
 	}
 }
@@ -66,7 +78,7 @@ static void do_write_blocked(const void *data, unsigned long size)
 		write_if_needed();
 	}
 	while (size >= BLOCKSIZE) {
-		write_or_die(1, buf, BLOCKSIZE);
+		write_block_or_die(buf);
 		size -= BLOCKSIZE;
 		buf += BLOCKSIZE;
 	}
@@ -101,10 +113,10 @@ static void write_trailer(void)
 {
 	int tail = BLOCKSIZE - offset;
 	memset(block + offset, 0, tail);
-	write_or_die(1, block, BLOCKSIZE);
+	write_block_or_die(block);
 	if (tail < 2 * RECORDSIZE) {
 		memset(block, 0, offset);
-		write_or_die(1, block, BLOCKSIZE);
+		write_block_or_die(block);
 	}
 }
 
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 2/2] archive: avoid spawning `gzip`
  2019-04-12 23:04 [PATCH 0/2] Avoid spawning gzip in git archive Johannes Schindelin via GitGitGadget
  2019-04-12 23:04 ` [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die() Rohit Ashiwal via GitGitGadget
@ 2019-04-12 23:04 ` Rohit Ashiwal via GitGitGadget
  2019-04-13  1:51   ` Jeff King
       [not found] ` <pull.145.v2.git.gitgitgadget@gmail.com>
  2 siblings, 1 reply; 43+ messages in thread
From: Rohit Ashiwal via GitGitGadget @ 2019-04-12 23:04 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Rohit Ashiwal

From: Rohit Ashiwal <rohit.ashiwal265@gmail.com>

As we already link to the zlib library, we can perform the compression
without even requiring gzip on the host machine.

Signed-off-by: Rohit Ashiwal <rohit.ashiwal265@gmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 archive-tar.c | 34 +++++++++++++++++++++++++---------
 1 file changed, 25 insertions(+), 9 deletions(-)

diff --git a/archive-tar.c b/archive-tar.c
index ba37dad27c..5979ed14b7 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -466,18 +466,34 @@ static int write_tar_filter_archive(const struct archiver *ar,
 	filter.use_shell = 1;
 	filter.in = -1;
 
-	if (start_command(&filter) < 0)
-		die_errno(_("unable to start '%s' filter"), argv[0]);
-	close(1);
-	if (dup2(filter.in, 1) < 0)
-		die_errno(_("unable to redirect descriptor"));
-	close(filter.in);
+	if (!strcmp("gzip -cn", ar->data)) {
+		char outmode[4] = "wb\0";
+
+		if (args->compression_level >= 0 && args->compression_level <= 9)
+			outmode[2] = '0' + args->compression_level;
+
+		gzip = gzdopen(fileno(stdout), outmode);
+		if (!gzip)
+			die(_("Could not gzdopen stdout"));
+	} else {
+		if (start_command(&filter) < 0)
+			die_errno(_("unable to start '%s' filter"), argv[0]);
+		close(1);
+		if (dup2(filter.in, 1) < 0)
+			die_errno(_("unable to redirect descriptor"));
+		close(filter.in);
+	}
 
 	r = write_tar_archive(ar, args);
 
-	close(1);
-	if (finish_command(&filter) != 0)
-		die(_("'%s' filter reported error"), argv[0]);
+	if (gzip) {
+		if (gzclose(gzip) != Z_OK)
+			die(_("gzclose failed"));
+	} else {
+		close(1);
+		if (finish_command(&filter) != 0)
+			die(_("'%s' filter reported error"), argv[0]);
+	}
 
 	strbuf_release(&cmd);
 	return r;
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die()
  2019-04-12 23:04 ` [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die() Rohit Ashiwal via GitGitGadget
@ 2019-04-13  1:34   ` Jeff King
  2019-04-13  5:51     ` Junio C Hamano
                       ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Jeff King @ 2019-04-13  1:34 UTC (permalink / raw)
  To: Rohit Ashiwal via GitGitGadget
  Cc: Johannes Schindelin, git, Junio C Hamano, Rohit Ashiwal

On Fri, Apr 12, 2019 at 04:04:39PM -0700, Rohit Ashiwal via GitGitGadget wrote:

> From: Rohit Ashiwal <rohit.ashiwal265@gmail.com>
> 
> MinGit for Windows comes without `gzip` bundled inside, git-archive uses
> `gzip -cn` to compress tar files but for this to work, gzip needs to be
> present on the host system.
> 
> In the next commit, we will change the gzip compression so that we no
> longer spawn `gzip` but let zlib perform the compression in the same
> process instead.
> 
> In preparation for this, we consolidate all the block writes into a
> single function.

Sounds like a good preparatory step. This part confused me, though:

> @@ -38,11 +40,21 @@ static int write_tar_filter_archive(const struct archiver *ar,
>  #define USTAR_MAX_MTIME 077777777777ULL
>  #endif
>  
> +/* writes out the whole block, or dies if fails */
> +static void write_block_or_die(const char *block) {
> +	if (gzip) {
> +		if (gzwrite(gzip, block, (unsigned) BLOCKSIZE) != BLOCKSIZE)
> +			die(_("gzwrite failed"));
> +	} else {
> +		write_or_die(1, block, BLOCKSIZE);
> +	}
> +}

What is gzwrite()? At first I thought this was an out-of-sequence bit of
the series, but it turns out that this is a zlib.h interface. So the
idea (I think) is that here we introduce a "gzip" variable that is
always false, and this first conditional arm is effectively dead code.
And then in a later patch we'd set up "gzip" and it would become
not-dead.

I think it would be less confusing if this just factored out
write_block_or_die(), which starts as a thin wrapper and then grows the
gzip parts in the next patch.

A few nits on the code itself:

> +static gzFile gzip;
> [...]
> +       if (gzip) {

Is it OK for us to ask about the truthiness of this opaque type? That
works if it's really a pointer behind the scenes, but it seems like it
would be equally OK for zlib to declare it as a struct.

It looks OK in my version of zlib, and that library tends to be fairly
conservative so I wouldn't be surprised if it was that way back to the
beginning and remains that way for eternity. But it feels like a bad
pattern.

> +		if (gzwrite(gzip, block, (unsigned) BLOCKSIZE) != BLOCKSIZE)

This cast is interesting. All of the matching write_or_die() calls are
promoting it to a size_t, which is also unsigned.

BLOCKSIZE is a constant. Should we be defining it with a "U" in the first place?

I doubt it matters much either way from a correctness perspective. I
just wonder when I see a cast like that if we're going to get unexpected
truncation or similar.

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] archive: avoid spawning `gzip`
  2019-04-12 23:04 ` [PATCH 2/2] archive: avoid spawning `gzip` Rohit Ashiwal via GitGitGadget
@ 2019-04-13  1:51   ` Jeff King
  2019-04-13 22:01     ` René Scharfe
                       ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Jeff King @ 2019-04-13  1:51 UTC (permalink / raw)
  To: Rohit Ashiwal via GitGitGadget
  Cc: Johannes Schindelin, git, Junio C Hamano, Rohit Ashiwal

On Fri, Apr 12, 2019 at 04:04:40PM -0700, Rohit Ashiwal via GitGitGadget wrote:

> From: Rohit Ashiwal <rohit.ashiwal265@gmail.com>
> 
> As we already link to the zlib library, we can perform the compression
> without even requiring gzip on the host machine.

Very cool. It's nice to drop a dependency, and this should be a bit more
efficient, too.

> diff --git a/archive-tar.c b/archive-tar.c
> index ba37dad27c..5979ed14b7 100644
> --- a/archive-tar.c
> +++ b/archive-tar.c
> @@ -466,18 +466,34 @@ static int write_tar_filter_archive(const struct archiver *ar,
>  	filter.use_shell = 1;
>  	filter.in = -1;
>  
> -	if (start_command(&filter) < 0)
> -		die_errno(_("unable to start '%s' filter"), argv[0]);
> -	close(1);
> -	if (dup2(filter.in, 1) < 0)
> -		die_errno(_("unable to redirect descriptor"));
> -	close(filter.in);
> +	if (!strcmp("gzip -cn", ar->data)) {

I wondered how you were going to kick this in, since users can define
arbitrary filters. I think it's kind of neat to automagically convert
"gzip -cn" (which also happens to be the default). But I think we should
mention that in the Documentation, in case somebody tries to use a
custom version of gzip and wonders why it isn't kicking in.

Likewise, it might make sense in the tests to put a poison gzip in the
$PATH so that we can be sure we're using our internal code, and not just
calling out to gzip (on platforms that have it, of course).

The alternative is that we could use a special token like ":zlib" or
something to indicate that the internal implementation should be used
(and then tweak the baked-in default, too). That might be less
surprising for users, but most people would still get the benefit since
they'd be using the default config.

> +		char outmode[4] = "wb\0";

This looks sufficiently magical that it might merit a comment. I had to
look in the zlib header file to learn that this is just a normal
stdio-style mode. But we can't just do:

  gzip = gzdopen(fd, "wb");

because we want to (maybe) append a compression level. It's also
slightly confusing that it explicitly includes a NUL, but later:

> +		if (args->compression_level >= 0 && args->compression_level <= 9)
> +			outmode[2] = '0' + args->compression_level;

we may overwrite that and assume that outmode[3] is also a NUL. Which it
is, because of how C initialization works. But that means we also do not
need the "\0" in the initializer.

Dropping that may make it slightly less jarring (any time I see a
backslash escape in an initializer, I assume I'm in for some binary
trickery, but this turns out to be much more mundane).

I'd also consider just using a strbuf:

  struct strbuf outmode = STRBUF_INIT;

  strbuf_addstr(&outmode, "wb");
  if (args->compression_level >= 0 && args->compression_level <= 9)
	strbuf_addch(&outmode, '0' + args->compression_level);

That's overkill in a sense, but it saves us having to deal with
manually-counted offsets, and this code is only run once per program
invocation, so the efficiency shouldn't matter.

> +		gzip = gzdopen(fileno(stdout), outmode);
> +		if (!gzip)
> +			die(_("Could not gzdopen stdout"));

Is there a way to get a more specific error from zlib? I'm less
concerned about gzdopen here (which should never fail), and more about
the writing and closing steps. I don't see anything good for gzdopen(),
but...

> +	if (gzip) {
> +		if (gzclose(gzip) != Z_OK)
> +			die(_("gzclose failed"));

...according to zlib.h, here the returned int is meaningful. And if
Z_ERRNO, we should probably use die_errno() to give a better message.

> [...]

That was a lot of little nits, but the overall shape of the patch looks
good to me (and I think the goal is obviously good). Thanks for working
on it.

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die()
  2019-04-13  1:34   ` Jeff King
@ 2019-04-13  5:51     ` Junio C Hamano
  2019-04-14  4:36       ` Rohit Ashiwal
  2019-04-26 14:29       ` Johannes Schindelin
  2019-04-14  4:34     ` Rohit Ashiwal
  2019-04-26 14:28     ` Johannes Schindelin
  2 siblings, 2 replies; 43+ messages in thread
From: Junio C Hamano @ 2019-04-13  5:51 UTC (permalink / raw)
  To: Jeff King
  Cc: Rohit Ashiwal via GitGitGadget, Johannes Schindelin, git, Rohit Ashiwal

Jeff King <peff@peff.net> writes:

>> +/* writes out the whole block, or dies if fails */
>> +static void write_block_or_die(const char *block) {
>> +	if (gzip) {
>> +		if (gzwrite(gzip, block, (unsigned) BLOCKSIZE) != BLOCKSIZE)
>> +			die(_("gzwrite failed"));
>> +	} else {
>> +		write_or_die(1, block, BLOCKSIZE);
>> +	}
>> +}

I agree everything you said you your two review messages.

One thing you did not mention but I found disturbing was that this
does not take size argument but hardcodes BLOCKSIZE.  If the patch
were removing use of BLOCKSIZE in its callers (because everybody
uses the same constant), this would not have bothered me, but as the
caller passes BLOCKSIZE to all its callees except this one, I found
that the interface optimizes for a wrong thing (i.e. reducing
one-time pain of writing this single patch of having to repeat
BLOCKSIZE in all calls to this function).  This function should be
updated to take the size_t and have its caller(s) pass BLOCKSIZE.

Thanks for a review, and thanks Rohit for starting to get rid of
external dependency on gzip binary.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] archive: avoid spawning `gzip`
  2019-04-13  1:51   ` Jeff King
@ 2019-04-13 22:01     ` René Scharfe
  2019-04-15 21:35       ` Jeff King
  2019-04-13 22:16     ` brian m. carlson
  2019-04-26 14:47     ` Johannes Schindelin
  2 siblings, 1 reply; 43+ messages in thread
From: René Scharfe @ 2019-04-13 22:01 UTC (permalink / raw)
  To: Jeff King, Rohit Ashiwal via GitGitGadget
  Cc: Johannes Schindelin, git, Junio C Hamano, Rohit Ashiwal

Am 13.04.2019 um 03:51 schrieb Jeff King:
> On Fri, Apr 12, 2019 at 04:04:40PM -0700, Rohit Ashiwal via GitGitGadget wrote:
>
>> From: Rohit Ashiwal <rohit.ashiwal265@gmail.com>
>>
>> As we already link to the zlib library, we can perform the compression
>> without even requiring gzip on the host machine.
>
> Very cool. It's nice to drop a dependency, and this should be a bit more
> efficient, too.

Getting rid of dependencies is good, and using zlib is the obvious way to
generate .tgz files. Last time I tried something like that, a separate gzip
process was faster, though -- at least on Linux [1].  How does this one
fare?

Doing compression in its own thread may be a good idea.

René


[1] http://public-inbox.org/git/4AAAC8CE.8020302@lsrfire.ath.cx/

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] archive: avoid spawning `gzip`
  2019-04-13  1:51   ` Jeff King
  2019-04-13 22:01     ` René Scharfe
@ 2019-04-13 22:16     ` brian m. carlson
  2019-04-15 21:36       ` Jeff King
  2019-04-26 14:54       ` Johannes Schindelin
  2019-04-26 14:47     ` Johannes Schindelin
  2 siblings, 2 replies; 43+ messages in thread
From: brian m. carlson @ 2019-04-13 22:16 UTC (permalink / raw)
  To: Jeff King
  Cc: Rohit Ashiwal via GitGitGadget, Johannes Schindelin, git,
	Junio C Hamano, Rohit Ashiwal

[-- Attachment #1: Type: text/plain, Size: 1539 bytes --]

On Fri, Apr 12, 2019 at 09:51:02PM -0400, Jeff King wrote:
> I wondered how you were going to kick this in, since users can define
> arbitrary filters. I think it's kind of neat to automagically convert
> "gzip -cn" (which also happens to be the default). But I think we should
> mention that in the Documentation, in case somebody tries to use a
> custom version of gzip and wonders why it isn't kicking in.
> 
> Likewise, it might make sense in the tests to put a poison gzip in the
> $PATH so that we can be sure we're using our internal code, and not just
> calling out to gzip (on platforms that have it, of course).
> 
> The alternative is that we could use a special token like ":zlib" or
> something to indicate that the internal implementation should be used
> (and then tweak the baked-in default, too). That might be less
> surprising for users, but most people would still get the benefit since
> they'd be using the default config.

I agree that a special value (or NULL, if that's possible) would be
nicer here. That way, if someone does specify a custom gzip, we honor
it, and it serves to document the code better. For example, if someone
symlinked pigz to gzip and used "gzip -cn", then they might not get the
parallelization benefits they expected.

I'm fine overall with the idea of bringing the compression into the
binary using zlib, provided that we preserve the "-n" behavior
(producing reproducible archives).
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die()
  2019-04-13  1:34   ` Jeff King
  2019-04-13  5:51     ` Junio C Hamano
@ 2019-04-14  4:34     ` Rohit Ashiwal
  2019-04-14 10:33       ` Junio C Hamano
  2019-04-26 14:28     ` Johannes Schindelin
  2 siblings, 1 reply; 43+ messages in thread
From: Rohit Ashiwal @ 2019-04-14  4:34 UTC (permalink / raw)
  To: peff; +Cc: git, gitgitgadget, gitster, johannes.schindelin, rohit.ashiwal265

Hey Peff!

On 2019-04-13  1:34 UTC Jeff King <peff@peff.net> wrote:

> What is gzwrite()?
> [...]
> I think it would be less confusing if this just factored out
> write_block_or_die(), which starts as a thin wrapper and then grows the
> gzip parts in the next patch.

You are right, it might appear to someone as a bit confusing, but I feel
like, this is the right commit to put it.

> Is it OK for us to ask about the truthiness of this opaque type? That
> works if it's really a pointer behind the scenes, but it seems like it
> would be equally OK for zlib to declare it as a struct.

It would be perfectly sane on zlib's part to make gzFile a struct, and if
so happens, I'll be there to refactor the code.

Regards
Rohit


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die()
  2019-04-13  5:51     ` Junio C Hamano
@ 2019-04-14  4:36       ` Rohit Ashiwal
  2019-04-26 14:29       ` Johannes Schindelin
  1 sibling, 0 replies; 43+ messages in thread
From: Rohit Ashiwal @ 2019-04-14  4:36 UTC (permalink / raw)
  To: gitster; +Cc: git, gitgitgadget, johannes.schindelin, peff, rohit.ashiwal265

Hey jch!

I'll change the signature of the function in next revision.

Thanks
Rohit


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die()
  2019-04-14  4:34     ` Rohit Ashiwal
@ 2019-04-14 10:33       ` Junio C Hamano
  0 siblings, 0 replies; 43+ messages in thread
From: Junio C Hamano @ 2019-04-14 10:33 UTC (permalink / raw)
  To: Rohit Ashiwal; +Cc: peff, git, gitgitgadget, johannes.schindelin

Rohit Ashiwal <rohit.ashiwal265@gmail.com> writes:

> On 2019-04-13  1:34 UTC Jeff King <peff@peff.net> wrote:
>
>> What is gzwrite()?
>> [...]
>> I think it would be less confusing if this just factored out
>> write_block_or_die(), which starts as a thin wrapper and then grows the
>> gzip parts in the next patch.
>
> You are right, it might appear to someone as a bit confusing, but I feel
> like, this is the right commit to put it.

Often, the original author is the worst judge about the patch series
organization, because s/he has been staring at his or her own
patches too long and knows too much about them.  Unless the author
is very experienced and is good at pretending to be the first-time
reader when proofreading his or her own patch, that is.

FWIW, I tend to agree with Peff that the organization would become
much easier to follow with "first refactor without new feature, and
in gzip related step add gzip thing".

>> Is it OK for us to ask about the truthiness of this opaque type? That
>> works if it's really a pointer behind the scenes, but it seems like it
>> would be equally OK for zlib to declare it as a struct.

Or a small integer indexing into an internal array the library keeps
track of ;-) At that point, truthiness would be completely gone, and
the compiler would not help catching "if (opaque)" as a syntax error
(if the library implements the opaque thing as a structure, then we
will be saved).

> It would be perfectly sane on zlib's part to make gzFile a struct, and if
> so happens, I'll be there to refactor the code.

We do not trust any single developer enough with "I'll do so when
needed"---in practice, it will often be done by somebody else, and
more importantly, we would want anybody to be able to take things
over, instead of relying on any one "indispensable contributor".

If it is reasonable to expect that things can easily be broken by an
external factor, I'd prefer to see us being defensive from day one.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] archive: avoid spawning `gzip`
  2019-04-13 22:01     ` René Scharfe
@ 2019-04-15 21:35       ` Jeff King
  2019-04-26 14:51         ` Johannes Schindelin
  0 siblings, 1 reply; 43+ messages in thread
From: Jeff King @ 2019-04-15 21:35 UTC (permalink / raw)
  To: René Scharfe
  Cc: Rohit Ashiwal via GitGitGadget, Johannes Schindelin, git,
	Junio C Hamano, Rohit Ashiwal

On Sun, Apr 14, 2019 at 12:01:10AM +0200, René Scharfe wrote:

> >> As we already link to the zlib library, we can perform the compression
> >> without even requiring gzip on the host machine.
> >
> > Very cool. It's nice to drop a dependency, and this should be a bit more
> > efficient, too.
> 
> Getting rid of dependencies is good, and using zlib is the obvious way to
> generate .tgz files. Last time I tried something like that, a separate gzip
> process was faster, though -- at least on Linux [1].  How does this one
> fare?

I'd expect a separate gzip to be faster in wall-clock time for a
multi-core machine, but overall consume more CPU. I'm slightly surprised
that your timings show that it actually wins on total CPU, too.

Here are best-of-five times for "git archive --format=tar.gz HEAD" on
linux.git (the machine is a quad-core):

  [before, separate gzip]
  real	0m21.501s
  user	0m26.148s
  sys	0m0.619s

  [after, internal gzwrite]
  real	0m25.156s
  user	0m25.059s
  sys	0m0.096s

which does show what I expect (longer overall, but less total CPU).

Which one you prefer depends on your situation, of course. A user on a
workstation with multiple cores probably cares most about end-to-end
latency and using all of their available horsepower. A server hosting
repositories and receiving many unrelated requests probably cares more
about total CPU (though the differences there are small enough that it
may not even be worth having a config knob to un-parallelize it).

> Doing compression in its own thread may be a good idea.

Yeah. It might even make the patch simpler, since I'd expect it to be
implemented with start_async() and a descriptor, making it look just
like a gzip pipe to the caller. :)

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] archive: avoid spawning `gzip`
  2019-04-13 22:16     ` brian m. carlson
@ 2019-04-15 21:36       ` Jeff King
  2019-04-26 14:54       ` Johannes Schindelin
  1 sibling, 0 replies; 43+ messages in thread
From: Jeff King @ 2019-04-15 21:36 UTC (permalink / raw)
  To: brian m. carlson, Rohit Ashiwal via GitGitGadget,
	Johannes Schindelin, git, Junio C Hamano, Rohit Ashiwal

On Sat, Apr 13, 2019 at 10:16:46PM +0000, brian m. carlson wrote:

> > The alternative is that we could use a special token like ":zlib" or
> > something to indicate that the internal implementation should be used
> > (and then tweak the baked-in default, too). That might be less
> > surprising for users, but most people would still get the benefit since
> > they'd be using the default config.
> 
> I agree that a special value (or NULL, if that's possible) would be
> nicer here. That way, if someone does specify a custom gzip, we honor
> it, and it serves to document the code better. For example, if someone
> symlinked pigz to gzip and used "gzip -cn", then they might not get the
> parallelization benefits they expected.

Thanks for spelling that out. I had a vague feeling somebody might be
surprised, but I didn't know if people actually did stuff like
symlinking pigz to gzip (though it makes perfect sense to do so).

> I'm fine overall with the idea of bringing the compression into the
> binary using zlib, provided that we preserve the "-n" behavior
> (producing reproducible archives).

I just assumed that gzwrite() would have the "-n" behavior, but it's
definitely worth double-checking.

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die()
  2019-04-13  1:34   ` Jeff King
  2019-04-13  5:51     ` Junio C Hamano
  2019-04-14  4:34     ` Rohit Ashiwal
@ 2019-04-26 14:28     ` Johannes Schindelin
  2019-05-01 18:07       ` Jeff King
  2 siblings, 1 reply; 43+ messages in thread
From: Johannes Schindelin @ 2019-04-26 14:28 UTC (permalink / raw)
  To: Jeff King
  Cc: Rohit Ashiwal via GitGitGadget, git, Junio C Hamano, Rohit Ashiwal

Hi Peff,

On Fri, 12 Apr 2019, Jeff King wrote:

> On Fri, Apr 12, 2019 at 04:04:39PM -0700, Rohit Ashiwal via GitGitGadget wrote:
>
> > From: Rohit Ashiwal <rohit.ashiwal265@gmail.com>
> >
> > MinGit for Windows comes without `gzip` bundled inside, git-archive uses
> > `gzip -cn` to compress tar files but for this to work, gzip needs to be
> > present on the host system.
> >
> > In the next commit, we will change the gzip compression so that we no
> > longer spawn `gzip` but let zlib perform the compression in the same
> > process instead.
> >
> > In preparation for this, we consolidate all the block writes into a
> > single function.
>
> Sounds like a good preparatory step. This part confused me, though:
>
> > @@ -38,11 +40,21 @@ static int write_tar_filter_archive(const struct archiver *ar,
> >  #define USTAR_MAX_MTIME 077777777777ULL
> >  #endif
> >
> > +/* writes out the whole block, or dies if fails */
> > +static void write_block_or_die(const char *block) {
> > +	if (gzip) {
> > +		if (gzwrite(gzip, block, (unsigned) BLOCKSIZE) != BLOCKSIZE)
> > +			die(_("gzwrite failed"));
> > +	} else {
> > +		write_or_die(1, block, BLOCKSIZE);
> > +	}
> > +}
>
> What is gzwrite()? At first I thought this was an out-of-sequence bit of
> the series, but it turns out that this is a zlib.h interface. So the
> idea (I think) is that here we introduce a "gzip" variable that is
> always false, and this first conditional arm is effectively dead code.
> And then in a later patch we'd set up "gzip" and it would become
> not-dead.
>
> I think it would be less confusing if this just factored out
> write_block_or_die(), which starts as a thin wrapper and then grows the
> gzip parts in the next patch.

Yes, I missed this in my pre-submission review. Sorry about that!

> A few nits on the code itself:
>
> > +static gzFile gzip;
> > [...]
> > +       if (gzip) {
>
> Is it OK for us to ask about the truthiness of this opaque type? That
> works if it's really a pointer behind the scenes, but it seems like it
> would be equally OK for zlib to declare it as a struct.
>
> It looks OK in my version of zlib, and that library tends to be fairly
> conservative so I wouldn't be surprised if it was that way back to the
> beginning and remains that way for eternity. But it feels like a bad
> pattern.

It is even part of the public API that `gzFile` is `typedef`'d to a
pointer. So I think in the interest of simplicity, I'll leave it at that
(but I'll mention this in the commit message).

> > +		if (gzwrite(gzip, block, (unsigned) BLOCKSIZE) != BLOCKSIZE)
>
> This cast is interesting. All of the matching write_or_die() calls are
> promoting it to a size_t, which is also unsigned.
>
> BLOCKSIZE is a constant. Should we be defining it with a "U" in the
> first place?

Yep, good idea.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die()
  2019-04-13  5:51     ` Junio C Hamano
  2019-04-14  4:36       ` Rohit Ashiwal
@ 2019-04-26 14:29       ` Johannes Schindelin
  2019-04-26 23:44         ` Junio C Hamano
  1 sibling, 1 reply; 43+ messages in thread
From: Johannes Schindelin @ 2019-04-26 14:29 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Rohit Ashiwal via GitGitGadget, git, Rohit Ashiwal

Hi Junio,

On Sat, 13 Apr 2019, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
>
> >> +/* writes out the whole block, or dies if fails */
> >> +static void write_block_or_die(const char *block) {
> >> +	if (gzip) {
> >> +		if (gzwrite(gzip, block, (unsigned) BLOCKSIZE) != BLOCKSIZE)
> >> +			die(_("gzwrite failed"));
> >> +	} else {
> >> +		write_or_die(1, block, BLOCKSIZE);
> >> +	}
> >> +}
>
> I agree everything you said you your two review messages.
>
> One thing you did not mention but I found disturbing was that this
> does not take size argument but hardcodes BLOCKSIZE.

That is very much on purpose, as this code really is specific to the `tar`
file format, which has a fixed, well-defined block size. It would make it
easier to introduce a bug if that was a parameter.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] archive: avoid spawning `gzip`
  2019-04-13  1:51   ` Jeff King
  2019-04-13 22:01     ` René Scharfe
  2019-04-13 22:16     ` brian m. carlson
@ 2019-04-26 14:47     ` Johannes Schindelin
  2 siblings, 0 replies; 43+ messages in thread
From: Johannes Schindelin @ 2019-04-26 14:47 UTC (permalink / raw)
  To: Jeff King
  Cc: Rohit Ashiwal via GitGitGadget, git, Junio C Hamano, Rohit Ashiwal

[-- Attachment #1: Type: text/plain, Size: 4555 bytes --]

Hi Peff,

On Fri, 12 Apr 2019, Jeff King wrote:

> On Fri, Apr 12, 2019 at 04:04:40PM -0700, Rohit Ashiwal via GitGitGadget wrote:
>
> > From: Rohit Ashiwal <rohit.ashiwal265@gmail.com>
> >
> > As we already link to the zlib library, we can perform the compression
> > without even requiring gzip on the host machine.
>
> Very cool. It's nice to drop a dependency, and this should be a bit more
> efficient, too.

Sadly, no, as René intuited and as your testing shows: there seems to be a
~15% penalty for compressing in the same thread as producing the data to
be compressed.

Given that it reduces the number of dependencies, and given that it might
be better to rely on the external command `pigz -cn` if speed is really a
matter, I still think it makes sense to switch the default, though.

> > diff --git a/archive-tar.c b/archive-tar.c
> > index ba37dad27c..5979ed14b7 100644
> > --- a/archive-tar.c
> > +++ b/archive-tar.c
> > @@ -466,18 +466,34 @@ static int write_tar_filter_archive(const struct archiver *ar,
> >  	filter.use_shell = 1;
> >  	filter.in = -1;
> >
> > -	if (start_command(&filter) < 0)
> > -		die_errno(_("unable to start '%s' filter"), argv[0]);
> > -	close(1);
> > -	if (dup2(filter.in, 1) < 0)
> > -		die_errno(_("unable to redirect descriptor"));
> > -	close(filter.in);
> > +	if (!strcmp("gzip -cn", ar->data)) {
>
> I wondered how you were going to kick this in, since users can define
> arbitrary filters. I think it's kind of neat to automagically convert
> "gzip -cn" (which also happens to be the default). But I think we should
> mention that in the Documentation, in case somebody tries to use a
> custom version of gzip and wonders why it isn't kicking in.
>
> Likewise, it might make sense in the tests to put a poison gzip in the
> $PATH so that we can be sure we're using our internal code, and not just
> calling out to gzip (on platforms that have it, of course).
>
> The alternative is that we could use a special token like ":zlib" or
> something to indicate that the internal implementation should be used
> (and then tweak the baked-in default, too). That might be less
> surprising for users, but most people would still get the benefit since
> they'd be using the default config.

I went with `:zlib`.

> > +		char outmode[4] = "wb\0";
>
> This looks sufficiently magical that it might merit a comment. I had to
> look in the zlib header file to learn that this is just a normal
> stdio-style mode. But we can't just do:
>
>   gzip = gzdopen(fd, "wb");
>
> because we want to (maybe) append a compression level. It's also
> slightly confusing that it explicitly includes a NUL, but later:
>
> > +		if (args->compression_level >= 0 && args->compression_level <= 9)
> > +			outmode[2] = '0' + args->compression_level;
>
> we may overwrite that and assume that outmode[3] is also a NUL. Which it
> is, because of how C initialization works. But that means we also do not
> need the "\0" in the initializer.
>
> Dropping that may make it slightly less jarring (any time I see a
> backslash escape in an initializer, I assume I'm in for some binary
> trickery, but this turns out to be much more mundane).
>
> I'd also consider just using a strbuf:
>
>   struct strbuf outmode = STRBUF_INIT;
>
>   strbuf_addstr(&outmode, "wb");
>   if (args->compression_level >= 0 && args->compression_level <= 9)
> 	strbuf_addch(&outmode, '0' + args->compression_level);
>
> That's overkill in a sense, but it saves us having to deal with
> manually-counted offsets, and this code is only run once per program
> invocation, so the efficiency shouldn't matter.

I'll change that, too, as it seems that `pigz` allows compression levels
higher than 9, in which case we need `strbuf_addf()` anyway. I will not
adjust the condition `<= 9`, of course, as zlib is still limited that way.

> > +		gzip = gzdopen(fileno(stdout), outmode);
> > +		if (!gzip)
> > +			die(_("Could not gzdopen stdout"));
>
> Is there a way to get a more specific error from zlib? I'm less
> concerned about gzdopen here (which should never fail), and more about
> the writing and closing steps. I don't see anything good for gzdopen(),
> but...

Sadly, I did not find anything there.

> > +	if (gzip) {
> > +		if (gzclose(gzip) != Z_OK)
> > +			die(_("gzclose failed"));
>
> ...according to zlib.h, here the returned int is meaningful. And if
> Z_ERRNO, we should probably use die_errno() to give a better message.

Okay.

Thanks,
Dscho

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] archive: avoid spawning `gzip`
  2019-04-15 21:35       ` Jeff King
@ 2019-04-26 14:51         ` Johannes Schindelin
  2019-04-27  9:59           ` René Scharfe
  0 siblings, 1 reply; 43+ messages in thread
From: Johannes Schindelin @ 2019-04-26 14:51 UTC (permalink / raw)
  To: Jeff King
  Cc: René Scharfe, Rohit Ashiwal via GitGitGadget, git,
	Junio C Hamano, Rohit Ashiwal

[-- Attachment #1: Type: text/plain, Size: 2596 bytes --]

Hi Peff,

On Mon, 15 Apr 2019, Jeff King wrote:

> On Sun, Apr 14, 2019 at 12:01:10AM +0200, René Scharfe wrote:
>
> > >> As we already link to the zlib library, we can perform the compression
> > >> without even requiring gzip on the host machine.
> > >
> > > Very cool. It's nice to drop a dependency, and this should be a bit more
> > > efficient, too.
> >
> > Getting rid of dependencies is good, and using zlib is the obvious way to
> > generate .tgz files. Last time I tried something like that, a separate gzip
> > process was faster, though -- at least on Linux [1].  How does this one
> > fare?
>
> I'd expect a separate gzip to be faster in wall-clock time for a
> multi-core machine, but overall consume more CPU. I'm slightly surprised
> that your timings show that it actually wins on total CPU, too.

If performance is really a concern, you'll be much better off using `pigz`
than `gzip`.

> Here are best-of-five times for "git archive --format=tar.gz HEAD" on
> linux.git (the machine is a quad-core):
>
>   [before, separate gzip]
>   real	0m21.501s
>   user	0m26.148s
>   sys	0m0.619s
>
>   [after, internal gzwrite]
>   real	0m25.156s
>   user	0m25.059s
>   sys	0m0.096s
>
> which does show what I expect (longer overall, but less total CPU).
>
> Which one you prefer depends on your situation, of course. A user on a
> workstation with multiple cores probably cares most about end-to-end
> latency and using all of their available horsepower. A server hosting
> repositories and receiving many unrelated requests probably cares more
> about total CPU (though the differences there are small enough that it
> may not even be worth having a config knob to un-parallelize it).

I am a bit sad that this is so noticeable. Nevertheless, I think that
dropping the dependency is worth it, in particular given that `gzip` is
not exactly fast to begin with (you really should switch to `pigz` or to a
faster compression if you are interested in speed).

> > Doing compression in its own thread may be a good idea.
>
> Yeah. It might even make the patch simpler, since I'd expect it to be
> implemented with start_async() and a descriptor, making it look just
> like a gzip pipe to the caller. :)

Sadly, it does not really look like it is simpler.

And when going into the direction of multi-threaded compression anyway,
the `pigz` trick of compressing 32kB chunks in parallel is an interesting
idea, too.

All of this, however, is outside of the purview of this (still relatively
simple) patch series.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] archive: avoid spawning `gzip`
  2019-04-13 22:16     ` brian m. carlson
  2019-04-15 21:36       ` Jeff King
@ 2019-04-26 14:54       ` Johannes Schindelin
  2019-05-02 20:20         ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 43+ messages in thread
From: Johannes Schindelin @ 2019-04-26 14:54 UTC (permalink / raw)
  To: brian m. carlson
  Cc: Jeff King, Rohit Ashiwal via GitGitGadget, git, Junio C Hamano,
	Rohit Ashiwal

Hi brian,

On Sat, 13 Apr 2019, brian m. carlson wrote:

> On Fri, Apr 12, 2019 at 09:51:02PM -0400, Jeff King wrote:
> > I wondered how you were going to kick this in, since users can define
> > arbitrary filters. I think it's kind of neat to automagically convert
> > "gzip -cn" (which also happens to be the default). But I think we should
> > mention that in the Documentation, in case somebody tries to use a
> > custom version of gzip and wonders why it isn't kicking in.
> >
> > Likewise, it might make sense in the tests to put a poison gzip in the
> > $PATH so that we can be sure we're using our internal code, and not just
> > calling out to gzip (on platforms that have it, of course).
> >
> > The alternative is that we could use a special token like ":zlib" or
> > something to indicate that the internal implementation should be used
> > (and then tweak the baked-in default, too). That might be less
> > surprising for users, but most people would still get the benefit since
> > they'd be using the default config.
>
> I agree that a special value (or NULL, if that's possible) would be
> nicer here. That way, if someone does specify a custom gzip, we honor
> it, and it serves to document the code better. For example, if someone
> symlinked pigz to gzip and used "gzip -cn", then they might not get the
> parallelization benefits they expected.

I went with `:zlib`. The `NULL` value would not really work, as there is
no way to specify that via `archive.tgz.command`.

About the symlinked thing: I do not really want to care to support such
hacks. If you want a different compressor than the default (which can
change), you should specify it specifically.

> I'm fine overall with the idea of bringing the compression into the
> binary using zlib, provided that we preserve the "-n" behavior
> (producing reproducible archives).

Thanks for voicing this concern. I had a look at zlib's source code, and
it looks like it requires an extra function call (that we don't call) to
make the resulting file non-reproducible. In other words, it has the
opposite default behavior from `gzip`.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die()
  2019-04-26 14:29       ` Johannes Schindelin
@ 2019-04-26 23:44         ` Junio C Hamano
  2019-04-29 21:32           ` Johannes Schindelin
  0 siblings, 1 reply; 43+ messages in thread
From: Junio C Hamano @ 2019-04-26 23:44 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Jeff King, Rohit Ashiwal via GitGitGadget, git, Rohit Ashiwal

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

>> >> +/* writes out the whole block, or dies if fails */
>> >> +static void write_block_or_die(const char *block) {
>> >> +	if (gzip) {
>> >> +		if (gzwrite(gzip, block, (unsigned) BLOCKSIZE) != BLOCKSIZE)
>> >> +			die(_("gzwrite failed"));
>> >> +	} else {
>> >> +		write_or_die(1, block, BLOCKSIZE);
>> >> +	}
>> >> +}
>>
>> I agree everything you said you your two review messages.
>>
>> One thing you did not mention but I found disturbing was that this
>> does not take size argument but hardcodes BLOCKSIZE.
>
> That is very much on purpose, as this code really is specific to the `tar`
> file format, which has a fixed, well-defined block size. It would make it
> easier to introduce a bug if that was a parameter.

I am not so sure for two reasons.

One is that its caller is full of BLOCKSIZE constants passed as
parameters (instead of calling a specialized function that hardcodes
the BLOCKSIZE without taking it as a parameter), and this being a
file-scope static, it does not really matter with respect to an
accidental bug of mistakenly changing BLOCKSIZE either in the caller
or callee.

Another is that I am not sure how your "fixed format" argument
meshes with the "-b blocksize" parameter to affect the tar/pax
output.  The format may be fixed, but it is parameterized.  If
we ever need to grow the ability to take "-b", having the knowledge
that our current code is limited to the fixed BLOCKSIZE in a single
function (i.e. the caller of this function , not the callee) would 
be less error prone.

These two are in addition to the uniformity of the abstraction
concerns I raised in my original review comment.

So, sorry, I do not think your response makes much sense.

Thanks.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] archive: avoid spawning `gzip`
  2019-04-26 14:51         ` Johannes Schindelin
@ 2019-04-27  9:59           ` René Scharfe
  2019-04-27 17:39             ` René Scharfe
  0 siblings, 1 reply; 43+ messages in thread
From: René Scharfe @ 2019-04-27  9:59 UTC (permalink / raw)
  To: Johannes Schindelin, Jeff King
  Cc: Rohit Ashiwal via GitGitGadget, git, Junio C Hamano, Rohit Ashiwal

Am 26.04.19 um 16:51 schrieb Johannes Schindelin:> Hi Peff,
>
> On Mon, 15 Apr 2019, Jeff King wrote:
>
>> On Sun, Apr 14, 2019 at 12:01:10AM +0200, René Scharfe wrote:
>>
>>>>> As we already link to the zlib library, we can perform the compression
>>>>> without even requiring gzip on the host machine.
>>>>
>>>> Very cool. It's nice to drop a dependency, and this should be a bit more
>>>> efficient, too.
>>>
>>> Getting rid of dependencies is good, and using zlib is the obvious way to
>>> generate .tgz files. Last time I tried something like that, a separate gzip
>>> process was faster, though -- at least on Linux [1].  How does this one
>>> fare?
>>
>> I'd expect a separate gzip to be faster in wall-clock time for a
>> multi-core machine, but overall consume more CPU. I'm slightly surprised
>> that your timings show that it actually wins on total CPU, too.

My initial expectation back then was that moving data between processes
is costly and that compressing in-process would improve the overall
performance.  Your expectation is more in line with what I then actually
saw.  The difference in total CPU time wasn't that big, perhaps just
noise.

> If performance is really a concern, you'll be much better off using `pigz`
> than `gzip`.

Performance is always a concern, but on the other hand I didn't see any
complaints about slow archiving so far.

>> Here are best-of-five times for "git archive --format=tar.gz HEAD" on
>> linux.git (the machine is a quad-core):
>>
>>    [before, separate gzip]
>>    real	0m21.501s
>>    user	0m26.148s
>>    sys	0m0.619s
>>
>>    [after, internal gzwrite]
>>    real	0m25.156s
>>    user	0m25.059s
>>    sys	0m0.096s
>>
>> which does show what I expect (longer overall, but less total CPU).

I get similar numbers with hyperfine:

Benchmark #1: git archive --format=tar.gz HEAD >/dev/null
  Time (mean ± σ):     16.683 s ±  0.451 s    [User: 20.230 s, System: 0.375 s]
  Range (min … max):   16.308 s … 17.852 s    10 runs

Benchmark #2: ~/src/git/git-archive --format=tar.gz HEAD >/dev/null
  Time (mean ± σ):     19.898 s ±  0.228 s    [User: 19.825 s, System: 0.073 s]
  Range (min … max):   19.627 s … 20.355 s    10 runs

Benchmark #3: git archive --format=zip HEAD >/dev/null
  Time (mean ± σ):     16.449 s ±  0.075 s    [User: 16.340 s, System: 0.109 s]
  Range (min … max):   16.326 s … 16.611 s    10 runs

#1 is git v2.21.0, #2 is with the two patches applied, #3 is v2.21.0
again, but with zip output, just to put things into perspective.

>> Which one you prefer depends on your situation, of course. A user on a
>> workstation with multiple cores probably cares most about end-to-end
>> latency and using all of their available horsepower. A server hosting
>> repositories and receiving many unrelated requests probably cares more
>> about total CPU (though the differences there are small enough that it
>> may not even be worth having a config knob to un-parallelize it).
>
> I am a bit sad that this is so noticeable. Nevertheless, I think that
> dropping the dependency is worth it, in particular given that `gzip` is
> not exactly fast to begin with (you really should switch to `pigz` or to a
> faster compression if you are interested in speed).

We could import pigz verbatim, it's just 11K LOCs total. :)

>>> Doing compression in its own thread may be a good idea.
>>
>> Yeah. It might even make the patch simpler, since I'd expect it to be
>> implemented with start_async() and a descriptor, making it look just
>> like a gzip pipe to the caller. :)
>
> Sadly, it does not really look like it is simpler.

I have to agree -- at least I was unable to pull off the stdout
plumbing trick.  Is there a way?  But it doesn't look too bad, and
the performance is closer to using the real gzip:

Benchmark #1: ~/src/git/git-archive --format=tar.gz HEAD >/dev/null
  Time (mean ± σ):     17.300 s ±  0.198 s    [User: 20.825 s, System: 0.356 s]
  Range (min … max):   17.042 s … 17.638 s    10 runs

This is with the following patch:

---
 archive-tar.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 59 insertions(+), 4 deletions(-)

diff --git a/archive-tar.c b/archive-tar.c
index 3e53aac1e6..c889b84c2c 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -38,11 +38,13 @@ static int write_tar_filter_archive(const struct archiver *ar,
 #define USTAR_MAX_MTIME 077777777777ULL
 #endif

+static int out_fd = 1;
+
 /* writes out the whole block, but only if it is full */
 static void write_if_needed(void)
 {
 	if (offset == BLOCKSIZE) {
-		write_or_die(1, block, BLOCKSIZE);
+		write_or_die(out_fd, block, BLOCKSIZE);
 		offset = 0;
 	}
 }
@@ -66,7 +68,7 @@ static void do_write_blocked(const void *data, unsigned long size)
 		write_if_needed();
 	}
 	while (size >= BLOCKSIZE) {
-		write_or_die(1, buf, BLOCKSIZE);
+		write_or_die(out_fd, buf, BLOCKSIZE);
 		size -= BLOCKSIZE;
 		buf += BLOCKSIZE;
 	}
@@ -101,10 +103,10 @@ static void write_trailer(void)
 {
 	int tail = BLOCKSIZE - offset;
 	memset(block + offset, 0, tail);
-	write_or_die(1, block, BLOCKSIZE);
+	write_or_die(out_fd, block, BLOCKSIZE);
 	if (tail < 2 * RECORDSIZE) {
 		memset(block, 0, offset);
-		write_or_die(1, block, BLOCKSIZE);
+		write_or_die(out_fd, block, BLOCKSIZE);
 	}
 }

@@ -434,6 +436,56 @@ static int write_tar_archive(const struct archiver *ar,
 	return err;
 }

+static int internal_gzip(int in, int out, void *data)
+{
+	int *levelp = data;
+	gzFile gzip = gzdopen(1, "wb");
+	if (!gzip)
+		die(_("gzdopen failed"));
+	if (gzsetparams(gzip, *levelp, Z_DEFAULT_STRATEGY) != Z_OK)
+		die(_("unable to set compression level"));
+
+	for (;;) {
+		char buf[BLOCKSIZE];
+		ssize_t read = xread(in, buf, sizeof(buf));
+		if (read < 0)
+			die_errno(_("read failed"));
+		if (read == 0)
+			break;
+		if (gzwrite(gzip, buf, read) != read)
+			die(_("gzwrite failed"));
+	}
+
+	if (gzclose(gzip) != Z_OK)
+		die(_("gzclose failed"));
+	close(in);
+	return 0;
+}
+
+static int write_tar_gzip_archive(const struct archiver *ar,
+				  struct archiver_args *args)
+{
+	struct async filter;
+	int r;
+
+	memset(&filter, 0, sizeof(filter));
+	filter.proc = internal_gzip;
+	filter.data = &args->compression_level;
+	filter.in = -1;
+
+	if (start_async(&filter))
+		die(_("unable to fork off internal gzip"));
+	out_fd = filter.in;
+
+	r = write_tar_archive(ar, args);
+
+	close(out_fd);
+	if (finish_async(&filter))
+		die(_("error in internal gzip"));
+
+	return r;
+}
+
 static int write_tar_filter_archive(const struct archiver *ar,
 				    struct archiver_args *args)
 {
@@ -445,6 +497,9 @@ static int write_tar_filter_archive(const struct archiver *ar,
 	if (!ar->data)
 		BUG("tar-filter archiver called with no filter defined");

+	if (!strcmp(ar->data, "gzip -cn"))
+		return write_tar_gzip_archive(ar, args);
+
 	strbuf_addstr(&cmd, ar->data);
 	if (args->compression_level >= 0)
 		strbuf_addf(&cmd, " -%d", args->compression_level);
--
2.21.0

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] archive: avoid spawning `gzip`
  2019-04-27  9:59           ` René Scharfe
@ 2019-04-27 17:39             ` René Scharfe
  2019-04-29 21:25               ` Johannes Schindelin
  0 siblings, 1 reply; 43+ messages in thread
From: René Scharfe @ 2019-04-27 17:39 UTC (permalink / raw)
  To: Johannes Schindelin, Jeff King
  Cc: Rohit Ashiwal via GitGitGadget, git, Junio C Hamano, Rohit Ashiwal

Am 27.04.19 um 11:59 schrieb René Scharfe:> Am 26.04.19 um 16:51 schrieb Johannes Schindelin:
>>
>> On Mon, 15 Apr 2019, Jeff King wrote:
>>
>>> On Sun, Apr 14, 2019 at 12:01:10AM +0200, René Scharfe wrote:
>>>
>>>> Doing compression in its own thread may be a good idea.
>>>
>>> Yeah. It might even make the patch simpler, since I'd expect it to be
>>> implemented with start_async() and a descriptor, making it look just
>>> like a gzip pipe to the caller. :)
>>
>> Sadly, it does not really look like it is simpler.
>
> I have to agree -- at least I was unable to pull off the stdout
> plumbing trick.

The simplest solution is of course to not touch the archive code.  The
patch below makes that possible:

Benchmark #1: ~/src/git/git -c tar.tgz.command=~/src/git/git-gzip archive --format=tgz HEAD >/dev/null
  Time (mean ± σ):     17.256 s ±  0.299 s    [User: 20.380 s, System: 0.294 s]
  Range (min … max):   16.940 s … 17.804 s    10 runs

Curious to see how it looks like on other systems and platforms.

And perhaps the buffer size needs to be tuned.

-- >8 --
Subject: [PATCH] add git gzip

Add a cheap gzip lookalike based on zlib for systems that don't have
(or want) the real thing.  It can be used e.g. to generate tgz files
using git archive and its configuration options tar.tgz.command and
tar.tar.gz.command, without any other external dependency.

Signed-off-by: Rene Scharfe <l.s.r@web.de>
---
 .gitignore       |  1 +
 Makefile         |  1 +
 builtin.h        |  1 +
 builtin/gzip.c   | 64 ++++++++++++++++++++++++++++++++++++++++++++++++
 command-list.txt |  1 +
 git.c            |  1 +
 6 files changed, 69 insertions(+)
 create mode 100644 builtin/gzip.c

diff --git a/.gitignore b/.gitignore
index 44c74402c8..e550868219 100644
--- a/.gitignore
+++ b/.gitignore
@@ -71,6 +71,7 @@
 /git-gc
 /git-get-tar-commit-id
 /git-grep
+/git-gzip
 /git-hash-object
 /git-help
 /git-http-backend
diff --git a/Makefile b/Makefile
index 9f1b6e8926..2b34f1a4aa 100644
--- a/Makefile
+++ b/Makefile
@@ -1075,6 +1075,7 @@ BUILTIN_OBJS += builtin/fsck.o
 BUILTIN_OBJS += builtin/gc.o
 BUILTIN_OBJS += builtin/get-tar-commit-id.o
 BUILTIN_OBJS += builtin/grep.o
+BUILTIN_OBJS += builtin/gzip.o
 BUILTIN_OBJS += builtin/hash-object.o
 BUILTIN_OBJS += builtin/help.o
 BUILTIN_OBJS += builtin/index-pack.o
diff --git a/builtin.h b/builtin.h
index b78ab6e30b..abc34cc9d0 100644
--- a/builtin.h
+++ b/builtin.h
@@ -170,6 +170,7 @@ extern int cmd_fsck(int argc, const char **argv, const char *prefix);
 extern int cmd_gc(int argc, const char **argv, const char *prefix);
 extern int cmd_get_tar_commit_id(int argc, const char **argv, const char *prefix);
 extern int cmd_grep(int argc, const char **argv, const char *prefix);
+extern int cmd_gzip(int argc, const char **argv, const char *prefix);
 extern int cmd_hash_object(int argc, const char **argv, const char *prefix);
 extern int cmd_help(int argc, const char **argv, const char *prefix);
 extern int cmd_index_pack(int argc, const char **argv, const char *prefix);
diff --git a/builtin/gzip.c b/builtin/gzip.c
new file mode 100644
index 0000000000..90a98c44ce
--- /dev/null
+++ b/builtin/gzip.c
@@ -0,0 +1,64 @@
+#include "cache.h"
+#include "builtin.h"
+#include "parse-options.h"
+
+static const char * const gzip_usage[] = {
+	N_("git gzip [-NUM]"),
+	NULL
+};
+
+static int level_callback(const struct option *opt, const char *arg, int unset)
+{
+	int *levelp = opt->value;
+	int value;
+	const char *endp;
+
+	if (unset)
+		BUG("switch -NUM cannot be negated");
+
+	value = strtol(arg, (char **)&endp, 10);
+	if (*endp)
+		BUG("switch -NUM cannot be non-numeric");
+
+	*levelp = value;
+	return 0;
+}
+
+#define BUFFERSIZE (64 * 1024)
+
+int cmd_gzip(int argc, const char **argv, const char *prefix)
+{
+	gzFile gz;
+	int level = Z_DEFAULT_COMPRESSION;
+	struct option options[] = {
+		OPT_NUMBER_CALLBACK(&level, N_("compression level"),
+				    level_callback),
+		OPT_END()
+	};
+
+	argc = parse_options(argc, argv, prefix, options, gzip_usage, 0);
+	if (argc > 0)
+		usage_with_options(gzip_usage, options);
+
+	gz = gzdopen(1, "wb");
+	if (!gz)
+		die(_("unable to gzdopen stdout"));
+
+	if (gzsetparams(gz, level, Z_DEFAULT_STRATEGY) != Z_OK)
+		die(_("unable to set compression level %d"), level);
+
+	for (;;) {
+		char buf[BUFFERSIZE];
+		ssize_t read_bytes = xread(0, buf, sizeof(buf));
+		if (read_bytes < 0)
+			die_errno(_("unable to read from stdin"));
+		if (read_bytes == 0)
+			break;
+		if (gzwrite(gz, buf, read_bytes) != read_bytes)
+			die(_("gzwrite failed"));
+	}
+
+	if (gzclose(gz) != Z_OK)
+		die(_("gzclose failed"));
+	return 0;
+}
diff --git a/command-list.txt b/command-list.txt
index 3a9af104b5..755848842c 100644
--- a/command-list.txt
+++ b/command-list.txt
@@ -99,6 +99,7 @@ git-gc                                  mainporcelain
 git-get-tar-commit-id                   plumbinginterrogators
 git-grep                                mainporcelain           info
 git-gui                                 mainporcelain
+git-gzip                                purehelpers
 git-hash-object                         plumbingmanipulators
 git-help                                ancillaryinterrogators          complete
 git-http-backend                        synchingrepositories
diff --git a/git.c b/git.c
index 50da125c60..48f7fc6c56 100644
--- a/git.c
+++ b/git.c
@@ -510,6 +510,7 @@ static struct cmd_struct commands[] = {
 	{ "gc", cmd_gc, RUN_SETUP },
 	{ "get-tar-commit-id", cmd_get_tar_commit_id, NO_PARSEOPT },
 	{ "grep", cmd_grep, RUN_SETUP_GENTLY },
+	{ "gzip", cmd_gzip },
 	{ "hash-object", cmd_hash_object },
 	{ "help", cmd_help },
 	{ "index-pack", cmd_index_pack, RUN_SETUP_GENTLY | NO_PARSEOPT },
--
2.21.0

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] archive: avoid spawning `gzip`
  2019-04-27 17:39             ` René Scharfe
@ 2019-04-29 21:25               ` Johannes Schindelin
  2019-05-01 17:45                 ` René Scharfe
  0 siblings, 1 reply; 43+ messages in thread
From: Johannes Schindelin @ 2019-04-29 21:25 UTC (permalink / raw)
  To: René Scharfe
  Cc: Jeff King, Rohit Ashiwal via GitGitGadget, git, Junio C Hamano,
	Rohit Ashiwal

[-- Attachment #1: Type: text/plain, Size: 1709 bytes --]

Hi René,


On Sat, 27 Apr 2019, René Scharfe wrote:

> Am 27.04.19 um 11:59 schrieb René Scharfe:> Am 26.04.19 um 16:51 schrieb
> Johannes Schindelin:
> >>
> >> On Mon, 15 Apr 2019, Jeff King wrote:
> >>
> >>> On Sun, Apr 14, 2019 at 12:01:10AM +0200, René Scharfe wrote:
> >>>
> >>>> Doing compression in its own thread may be a good idea.
> >>>
> >>> Yeah. It might even make the patch simpler, since I'd expect it to
> >>> be implemented with start_async() and a descriptor, making it look
> >>> just like a gzip pipe to the caller. :)
> >>
> >> Sadly, it does not really look like it is simpler.
> >
> > I have to agree -- at least I was unable to pull off the stdout
> > plumbing trick.
>
> The simplest solution is of course to not touch the archive code.

We could do that, of course, and we could avoid adding a new command that
we have to support for eternity by introducing a command mode for `git
archive` instead (think: `git archive --gzip -9`), and marking that
command mode clearly as an internal implementation detail.

But since the performance is still not quite on par with `gzip`, I would
actually rather not, and really, just punt on that one, stating that
people interested in higher performance should use `pigz`.

And who knows, maybe nobody will complain at all about the performance?
It's not like `gzip` is really, really fast (IIRC LZO blows gzip out of
the water, speed-wise).

And if we get "bug" reports about this, we

1) have a very easy workaround:

	git config --global archive.tgz.command 'gzip -cn'

2) could always implement a pigz-like multi-threading solution.

I strongly expect a YAGNI here, though.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die()
  2019-04-26 23:44         ` Junio C Hamano
@ 2019-04-29 21:32           ` Johannes Schindelin
  2019-05-01 18:09             ` Jeff King
  0 siblings, 1 reply; 43+ messages in thread
From: Johannes Schindelin @ 2019-04-29 21:32 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Rohit Ashiwal via GitGitGadget, git, Rohit Ashiwal

Hi Junio,

On Sat, 27 Apr 2019, Junio C Hamano wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
>
> >> >> +/* writes out the whole block, or dies if fails */
> >> >> +static void write_block_or_die(const char *block) {
> >> >> +	if (gzip) {
> >> >> +		if (gzwrite(gzip, block, (unsigned) BLOCKSIZE) != BLOCKSIZE)
> >> >> +			die(_("gzwrite failed"));
> >> >> +	} else {
> >> >> +		write_or_die(1, block, BLOCKSIZE);
> >> >> +	}
> >> >> +}
> >>
> >> I agree everything you said you your two review messages.
> >>
> >> One thing you did not mention but I found disturbing was that this
> >> does not take size argument but hardcodes BLOCKSIZE.
> >
> > That is very much on purpose, as this code really is specific to the `tar`
> > file format, which has a fixed, well-defined block size. It would make it
> > easier to introduce a bug if that was a parameter.
>
> I am not so sure for two reasons.
>
> One is that its caller is full of BLOCKSIZE constants passed as
> parameters (instead of calling a specialized function that hardcodes
> the BLOCKSIZE without taking it as a parameter), and this being a
> file-scope static, it does not really matter with respect to an
> accidental bug of mistakenly changing BLOCKSIZE either in the caller
> or callee.

I guess I can try to find some time next week to clean up those callers.
But honestly, I do not really think that this cleanup falls squarely into
the goal of this here patch series.

> Another is that I am not sure how your "fixed format" argument
> meshes with the "-b blocksize" parameter to affect the tar/pax
> output.  The format may be fixed, but it is parameterized.  If
> we ever need to grow the ability to take "-b", having the knowledge
> that our current code is limited to the fixed BLOCKSIZE in a single
> function (i.e. the caller of this function , not the callee) would
> be less error prone.

This argument would hold a lot more water if the following lines were not
part of archive-tar.c:

	#define RECORDSIZE      (512)
	#define BLOCKSIZE       (RECORDSIZE * 20)

	static char block[BLOCKSIZE];

If you can tell me how the `-b` (run-time) parameter can affect the
(compile-time) `BLOCKSIZE` constant, maybe I can start to understand your
concern.

:-)

Ciao,
Dscho

P.S.: I just looked, and I do not even see a `-b` option of `git archive`,
so I suspect that you talked about the generic tar file format? I was not
talking about each and every implementation of the tar file format here, I
was talking about the tar file format that Git generates.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] archive: avoid spawning `gzip`
  2019-04-29 21:25               ` Johannes Schindelin
@ 2019-05-01 17:45                 ` René Scharfe
  2019-05-01 18:18                   ` Jeff King
  0 siblings, 1 reply; 43+ messages in thread
From: René Scharfe @ 2019-05-01 17:45 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Jeff King, Rohit Ashiwal via GitGitGadget, git, Junio C Hamano,
	Rohit Ashiwal

Hello Dscho,

Am 29.04.19 um 23:25 schrieb Johannes Schindelin:
> On Sat, 27 Apr 2019, René Scharfe wrote:
>> The simplest solution is of course to not touch the archive code.
>
> We could do that, of course, and we could avoid adding a new command that
> we have to support for eternity by introducing a command mode for `git
> archive` instead (think: `git archive --gzip -9`), and marking that
> command mode clearly as an internal implementation detail.

adding gzip as the 142nd git command and 18th pure helper *would* be a
bit embarrassing, in particular for a command that's not directly
related to version control and readily available on all platforms.
Exposing it as a (hidden?) archive sub-command might be better.

> But since the performance is still not quite on par with `gzip`, I would
> actually rather not, and really, just punt on that one, stating that
> people interested in higher performance should use `pigz`.

Here are my performance numbers for generating .tar.gz files again:

master, using gzip(1):
  Time (mean ± σ):     16.683 s ±  0.451 s    [User: 20.230 s, System: 0.375 s]
  Range (min … max):   16.308 s … 17.852 s    10 runs

using zlib sequentially:
  Time (mean ± σ):     19.898 s ±  0.228 s    [User: 19.825 s, System: 0.073 s]
  Range (min … max):   19.627 s … 20.355 s    10 runs

using zlib asynchronously:
  Time (mean ± σ):     17.300 s ±  0.198 s    [User: 20.825 s, System: 0.356 s]
  Range (min … max):   17.042 s … 17.638 s    10 runs

using a gzip-lookalike:
  Time (mean ± σ):     17.256 s ±  0.299 s    [User: 20.380 s, System: 0.294 s]
  Range (min … max):   16.940 s … 17.804 s    10 runs

The last two have comparable system time, ca. 1% more user time and
ca. 5% longer duration.  The second one has much better system time
and 2% less user time and 19% longer duration.  Hmm.

> And who knows, maybe nobody will complain at all about the performance?

Probably.  And popular tarballs would be cached anyway, I guess.

So I'll send comments on your series later this week.

René

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die()
  2019-04-26 14:28     ` Johannes Schindelin
@ 2019-05-01 18:07       ` Jeff King
  0 siblings, 0 replies; 43+ messages in thread
From: Jeff King @ 2019-05-01 18:07 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Rohit Ashiwal via GitGitGadget, git, Junio C Hamano, Rohit Ashiwal

On Fri, Apr 26, 2019 at 10:28:12AM -0400, Johannes Schindelin wrote:

> > > +static gzFile gzip;
> > > [...]
> > > +       if (gzip) {
> >
> > Is it OK for us to ask about the truthiness of this opaque type? That
> > works if it's really a pointer behind the scenes, but it seems like it
> > would be equally OK for zlib to declare it as a struct.
> >
> > It looks OK in my version of zlib, and that library tends to be fairly
> > conservative so I wouldn't be surprised if it was that way back to the
> > beginning and remains that way for eternity. But it feels like a bad
> > pattern.
> 
> It is even part of the public API that `gzFile` is `typedef`'d to a
> pointer. So I think in the interest of simplicity, I'll leave it at that
> (but I'll mention this in the commit message).

I think that's probably OK. My biggest concern is that we'd notice if
our assumption changes, but I think modern compilers would generally
complain about checking a tautological truth value.

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die()
  2019-04-29 21:32           ` Johannes Schindelin
@ 2019-05-01 18:09             ` Jeff King
  2019-05-02 20:29               ` René Scharfe
  2019-05-05  5:25               ` Junio C Hamano
  0 siblings, 2 replies; 43+ messages in thread
From: Jeff King @ 2019-05-01 18:09 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Junio C Hamano, Rohit Ashiwal via GitGitGadget, git, Rohit Ashiwal

On Mon, Apr 29, 2019 at 05:32:50PM -0400, Johannes Schindelin wrote:

> > Another is that I am not sure how your "fixed format" argument
> > meshes with the "-b blocksize" parameter to affect the tar/pax
> > output.  The format may be fixed, but it is parameterized.  If
> > we ever need to grow the ability to take "-b", having the knowledge
> > that our current code is limited to the fixed BLOCKSIZE in a single
> > function (i.e. the caller of this function , not the callee) would
> > be less error prone.
> 
> This argument would hold a lot more water if the following lines were not
> part of archive-tar.c:
> 
> 	#define RECORDSIZE      (512)
> 	#define BLOCKSIZE       (RECORDSIZE * 20)
> 
> 	static char block[BLOCKSIZE];
> 
> If you can tell me how the `-b` (run-time) parameter can affect the
> (compile-time) `BLOCKSIZE` constant, maybe I can start to understand your
> concern.

FWIW, I agree with you here. These patches are not making anything worse
(and may even make them better, since we'd probably need to swap out the
BLOCKSIZE constant for a run-time "blocksize" variable in fewer places).

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] archive: avoid spawning `gzip`
  2019-05-01 17:45                 ` René Scharfe
@ 2019-05-01 18:18                   ` Jeff King
  2019-06-10 10:44                     ` René Scharfe
  0 siblings, 1 reply; 43+ messages in thread
From: Jeff King @ 2019-05-01 18:18 UTC (permalink / raw)
  To: René Scharfe
  Cc: Johannes Schindelin, Rohit Ashiwal via GitGitGadget, git,
	Junio C Hamano, Rohit Ashiwal

On Wed, May 01, 2019 at 07:45:05PM +0200, René Scharfe wrote:

> > But since the performance is still not quite on par with `gzip`, I would
> > actually rather not, and really, just punt on that one, stating that
> > people interested in higher performance should use `pigz`.
> 
> Here are my performance numbers for generating .tar.gz files again:
> 
> master, using gzip(1):
>   Time (mean ± σ):     16.683 s ±  0.451 s    [User: 20.230 s, System: 0.375 s]
>   Range (min … max):   16.308 s … 17.852 s    10 runs
> 
> using zlib sequentially:
>   Time (mean ± σ):     19.898 s ±  0.228 s    [User: 19.825 s, System: 0.073 s]
>   Range (min … max):   19.627 s … 20.355 s    10 runs
> 
> using zlib asynchronously:
>   Time (mean ± σ):     17.300 s ±  0.198 s    [User: 20.825 s, System: 0.356 s]
>   Range (min … max):   17.042 s … 17.638 s    10 runs
> 
> using a gzip-lookalike:
>   Time (mean ± σ):     17.256 s ±  0.299 s    [User: 20.380 s, System: 0.294 s]
>   Range (min … max):   16.940 s … 17.804 s    10 runs
> 
> The last two have comparable system time, ca. 1% more user time and
> ca. 5% longer duration.  The second one has much better system time
> and 2% less user time and 19% longer duration.  Hmm.

I think the start_async() one seems like a good option. It reclaims most
of the (wall-clock) performance, isn't very much code, and doesn't leave
any ugly user-visible traces.

I'd be fine to see it come later, though, on top of the patches Dscho is
sending. Even though changing to sequential zlib is technically a change
in behavior, the existing behavior wasn't really planned. And given the
wall-clock versus CPU time tradeoff, it's not entirely clear that one
solution is better than the other.

> > And who knows, maybe nobody will complain at all about the performance?
> 
> Probably.  And popular tarballs would be cached anyway, I guess.

At GitHub we certainly do cache the git-archive output. We'd also be
just fine with the sequential solution. We generally turn down
pack.threads to 1, and keep our CPUs busy by serving multiple users
anyway.

So whatever has the lowest overall CPU time is generally preferable, but
the times are close enough that I don't think we'd care much either way
(and it's probably not worth having a config option or similar).

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] archive: avoid spawning `gzip`
  2019-04-26 14:54       ` Johannes Schindelin
@ 2019-05-02 20:20         ` Ævar Arnfjörð Bjarmason
  2019-05-03 20:49           ` Johannes Schindelin
  0 siblings, 1 reply; 43+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-05-02 20:20 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: brian m. carlson, Jeff King, Rohit Ashiwal via GitGitGadget, git,
	Junio C Hamano, Rohit Ashiwal


On Fri, Apr 26 2019, Johannes Schindelin wrote:

> Hi brian,
>
> On Sat, 13 Apr 2019, brian m. carlson wrote:
>
>> On Fri, Apr 12, 2019 at 09:51:02PM -0400, Jeff King wrote:
>> > I wondered how you were going to kick this in, since users can define
>> > arbitrary filters. I think it's kind of neat to automagically convert
>> > "gzip -cn" (which also happens to be the default). But I think we should
>> > mention that in the Documentation, in case somebody tries to use a
>> > custom version of gzip and wonders why it isn't kicking in.
>> >
>> > Likewise, it might make sense in the tests to put a poison gzip in the
>> > $PATH so that we can be sure we're using our internal code, and not just
>> > calling out to gzip (on platforms that have it, of course).
>> >
>> > The alternative is that we could use a special token like ":zlib" or
>> > something to indicate that the internal implementation should be used
>> > (and then tweak the baked-in default, too). That might be less
>> > surprising for users, but most people would still get the benefit since
>> > they'd be using the default config.
>>
>> I agree that a special value (or NULL, if that's possible) would be
>> nicer here. That way, if someone does specify a custom gzip, we honor
>> it, and it serves to document the code better. For example, if someone
>> symlinked pigz to gzip and used "gzip -cn", then they might not get the
>> parallelization benefits they expected.
>
> I went with `:zlib`. The `NULL` value would not really work, as there is
> no way to specify that via `archive.tgz.command`.
>
> About the symlinked thing: I do not really want to care to support such
> hacks.

It's the standard way by which a lot of systems do this, e.g. on my
Debian box:

    $ find /{,s}bin /usr/{,s}bin -type l -exec file {} \;|grep /etc/alternatives|wc -l
    108

To write this E-Mail I'm invoking one such symlink :)

> If you want a different compressor than the default (which can
> change), you should specify it specifically.

You might want to do so system-wide, or for each program at a time.

I don't care about this for gzip myself, just pointing out it *is* a
thing people use.

>> I'm fine overall with the idea of bringing the compression into the
>> binary using zlib, provided that we preserve the "-n" behavior
>> (producing reproducible archives).
>
> Thanks for voicing this concern. I had a look at zlib's source code, and
> it looks like it requires an extra function call (that we don't call) to
> make the resulting file non-reproducible. In other words, it has the
> opposite default behavior from `gzip`.

Just commenting on the overall thread: I like René's "new built-in"
patch best.

You mentioned "new command that we have to support for eternity". I
think calling it "git gzip" is a bad idea. We'd make it "git
archive--gzip" or "git archive--helper", and we could hide building it
behind some compat flag.

Then we'd carry no if/else internal/external code, and the portability
issue that started this would be addressed, no?

As a bonus we could also drop the "GZIP" prereq from the test suite
entirely and just put that "gzip" in $PATH for the purposes of the
tests.

I spied on your yet-to-be-submitted patches and you could drop GZIP from
the "git archive" tests, but we'd still need it in
t/t5562-http-backend-content-length.sh, but not if we had a "gzip"
compat helper.

There's also a long-standing bug/misfeature in git-archive that I wonder
about: When you combine --format with --remote you can only generate
e.g. tar.gz if the remote is OK with it, if it says no you can't even if
it supports "tar" and you could do the "gz" part locally. Would such a
patch be harder with :zlib than if we always just spewed out to external
"gzip" after satisfying some criteria?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 3/4] archive: optionally use zlib directly for gzip compression
       [not found]   ` <4ea94a8784876c3a19e387537edd81a957fc692c.1556321244.git.gitgitgadget@gmail.com>
@ 2019-05-02 20:29     ` René Scharfe
  0 siblings, 0 replies; 43+ messages in thread
From: René Scharfe @ 2019-05-02 20:29 UTC (permalink / raw)
  To: Rohit Ashiwal via GitGitGadget, git
  Cc: Jeff King, brian m. carlson, Junio C Hamano, Rohit Ashiwal

Am 27.04.19 um 01:27 schrieb Rohit Ashiwal via GitGitGadget:
> From: Rohit Ashiwal <rohit.ashiwal265@gmail.com>
>
> As we already link to the zlib library, we can perform the compression
> without even requiring gzip on the host machine.
>
> Note: the `-n` flag that `git archive` passed to `gzip` wants to ensure
> that a reproducible file is written, i.e. no filename or mtime will be
> recorded in the compressed output. This is already the default for
> zlib's `gzopen()` function (if the file name or mtime should be
> recorded, the `deflateSetHeader()` function would have to be called
> instead).
>
> Note also that the `gzFile` datatype is defined as a pointer in
> `zlib.h`, i.e. we can rely on the fact that it can be `NULL`.
>
> At this point, this new mode is hidden behind the pseudo command
> `:zlib`: assign this magic string to the `archive.tgz.command` config
> setting to enable it.

Technically the patch emits the gzip format using the gz* functions.
Raw zlib output with deflate* would be slightly different.  So I'd
rather use "gzip" instead of "zlib" in the magic string.

And I'm not sure about the colon as the only magic marker.  Perhaps
throw in a "git " or "git-" instead or in addition?

> @@ -459,18 +464,40 @@ static int write_tar_filter_archive(const struct archiver *ar,
>  	filter.use_shell = 1;
>  	filter.in = -1;
>
> -	if (start_command(&filter) < 0)
> -		die_errno(_("unable to start '%s' filter"), argv[0]);
> -	close(1);
> -	if (dup2(filter.in, 1) < 0)
> -		die_errno(_("unable to redirect descriptor"));
> -	close(filter.in);
> +	if (!strcmp(":zlib", ar->data)) {
> +		struct strbuf mode = STRBUF_INIT;
> +
> +		strbuf_addstr(&mode, "wb");
> +
> +		if (args->compression_level >= 0 && args->compression_level <= 9)
> +			strbuf_addf(&mode, "%d", args->compression_level);

Using gzsetparams() to set the compression level numerically after gzdopen()
instead of baking it into the mode string feels cleaner.

> +
> +		gzip = gzdopen(fileno(stdout), mode.buf);
> +		if (!gzip)
> +			die(_("Could not gzdopen stdout"));
> +		strbuf_release(&mode);
> +	} else {
> +		if (start_command(&filter) < 0)
> +			die_errno(_("unable to start '%s' filter"), argv[0]);
> +		close(1);
> +		if (dup2(filter.in, 1) < 0)
> +			die_errno(_("unable to redirect descriptor"));
> +		close(filter.in);
> +	}
>
>  	r = write_tar_archive(ar, args);
>
> -	close(1);
> -	if (finish_command(&filter) != 0)
> -		die(_("'%s' filter reported error"), argv[0]);
> +	if (gzip) {
> +		int ret = gzclose(gzip);
> +		if (ret == Z_ERRNO)
> +			die_errno(_("gzclose failed"));
> +		else if (ret != Z_OK)
> +			die(_("gzclose failed (%d)"), ret);
> +	} else {
> +		close(1);
> +		if (finish_command(&filter) != 0)
> +			die(_("'%s' filter reported error"), argv[0]);
> +	}
>
>  	strbuf_release(&cmd);
>  	return r;
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die()
  2019-05-01 18:09             ` Jeff King
@ 2019-05-02 20:29               ` René Scharfe
  2019-05-05  5:25               ` Junio C Hamano
  1 sibling, 0 replies; 43+ messages in thread
From: René Scharfe @ 2019-05-02 20:29 UTC (permalink / raw)
  To: Jeff King, Johannes Schindelin
  Cc: Junio C Hamano, Rohit Ashiwal via GitGitGadget, git, Rohit Ashiwal

Am 01.05.19 um 20:09 schrieb Jeff King:
> On Mon, Apr 29, 2019 at 05:32:50PM -0400, Johannes Schindelin wrote:
>
>>> Another is that I am not sure how your "fixed format" argument
>>> meshes with the "-b blocksize" parameter to affect the tar/pax
>>> output.  The format may be fixed, but it is parameterized.  If
>>> we ever need to grow the ability to take "-b", having the knowledge
>>> that our current code is limited to the fixed BLOCKSIZE in a single
>>> function (i.e. the caller of this function , not the callee) would
>>> be less error prone.
>>
>> This argument would hold a lot more water if the following lines were not
>> part of archive-tar.c:
>>
>> 	#define RECORDSIZE      (512)
>> 	#define BLOCKSIZE       (RECORDSIZE * 20)
>>
>> 	static char block[BLOCKSIZE];
>>
>> If you can tell me how the `-b` (run-time) parameter can affect the
>> (compile-time) `BLOCKSIZE` constant, maybe I can start to understand your
>> concern.
>
> FWIW, I agree with you here. These patches are not making anything worse
> (and may even make them better, since we'd probably need to swap out the
> BLOCKSIZE constant for a run-time "blocksize" variable in fewer places).

The block size is mostly relevant for writing tar archives to magnetic
tapes.  You can do that with git archive and a tape drive that supports
the blocking factor 20, which is the default for GNU tar and thus should
be quite common.  You may get higher performance with a higher blocking
factor, if supported.

But so far this didn't come up on the mailing list, and I'd be surprised
if people really wrote snapshots of git archives directly to tape.  So
I'm not too worried about this define ever becoming a user-settable
option.  Sealing the constant into a function a bit feels dirty, though.
Mixing code and data makes the code more brittle.

Another example of that is the hard-coded file descriptor in the same
function, by the way.  It's a lot of busywork to undo in order to gain
the ability to write to some other fd, for the questionable convenience
of not having to pass that parameter along the call chain.  My bad.

But anyway, I worry more about the fact that blocking is not needed when
gzip'ing; gzwrite can be fed pieces of any size, not just 20 KB chunks.
The tar writer just needs to round up the archive size to a multiple of
20 KB and pad with NUL bytes at the end, in order to produce the same
uncompressed output as non-compressing tar.

If we'd wanted to be tape-friendly, then we'd have to block the gzip'ed
output instead of the uncompressed tar file, but I'm not suggesting
doing that.

Note to self: I wonder if moving the blocking part out into an
asynchronous function could simplify the code.

René

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 2/4] archive-tar: mark RECORDSIZE/BLOCKSIZE as unsigned
       [not found]   ` <ac2b2488a1b42b3caf8a84594c48eca796748e59.1556321244.git.gitgitgadget@gmail.com>
@ 2019-05-02 20:30     ` René Scharfe
  2019-05-08 11:45       ` Johannes Schindelin
  0 siblings, 1 reply; 43+ messages in thread
From: René Scharfe @ 2019-05-02 20:30 UTC (permalink / raw)
  To: Johannes Schindelin via GitGitGadget, git
  Cc: Jeff King, brian m. carlson, Junio C Hamano, Johannes Schindelin

Am 27.04.19 um 01:27 schrieb Johannes Schindelin via GitGitGadget:
> From: Johannes Schindelin <johannes.schindelin@gmx.de>
>
> They really are unsigned, and we are using e.g. BLOCKSIZE as `size_t`
> parameter to pass to `write_or_die()`.

True, but the compiler converts that value correctly to size_t without
complaint already, doesn't it?  What am I missing?

>
> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> ---
>  archive-tar.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/archive-tar.c b/archive-tar.c
> index af9ea70733..be06c8b205 100644
> --- a/archive-tar.c
> +++ b/archive-tar.c
> @@ -9,7 +9,7 @@
>  #include "streaming.h"
>  #include "run-command.h"
>
> -#define RECORDSIZE	(512)
> +#define RECORDSIZE	(512u)
>  #define BLOCKSIZE	(RECORDSIZE * 20)
>
>  static char block[BLOCKSIZE];
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] archive: avoid spawning `gzip`
  2019-05-02 20:20         ` Ævar Arnfjörð Bjarmason
@ 2019-05-03 20:49           ` Johannes Schindelin
  2019-05-03 20:52             ` Jeff King
  0 siblings, 1 reply; 43+ messages in thread
From: Johannes Schindelin @ 2019-05-03 20:49 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: brian m. carlson, Jeff King, Rohit Ashiwal via GitGitGadget, git,
	Junio C Hamano, Rohit Ashiwal

[-- Attachment #1: Type: text/plain, Size: 5284 bytes --]

Hi Ævar,

On Thu, 2 May 2019, Ævar Arnfjörð Bjarmason wrote:

> On Fri, Apr 26 2019, Johannes Schindelin wrote:
>
> > On Sat, 13 Apr 2019, brian m. carlson wrote:
> >
> >> On Fri, Apr 12, 2019 at 09:51:02PM -0400, Jeff King wrote:
> >> > I wondered how you were going to kick this in, since users can
> >> > define arbitrary filters. I think it's kind of neat to
> >> > automagically convert "gzip -cn" (which also happens to be the
> >> > default). But I think we should mention that in the Documentation,
> >> > in case somebody tries to use a custom version of gzip and wonders
> >> > why it isn't kicking in.
> >> >
> >> > Likewise, it might make sense in the tests to put a poison gzip in
> >> > the $PATH so that we can be sure we're using our internal code, and
> >> > not just calling out to gzip (on platforms that have it, of
> >> > course).
> >> >
> >> > The alternative is that we could use a special token like ":zlib"
> >> > or something to indicate that the internal implementation should be
> >> > used (and then tweak the baked-in default, too). That might be less
> >> > surprising for users, but most people would still get the benefit
> >> > since they'd be using the default config.
> >>
> >> I agree that a special value (or NULL, if that's possible) would be
> >> nicer here. That way, if someone does specify a custom gzip, we honor
> >> it, and it serves to document the code better. For example, if
> >> someone symlinked pigz to gzip and used "gzip -cn", then they might
> >> not get the parallelization benefits they expected.
> >
> > I went with `:zlib`. The `NULL` value would not really work, as there
> > is no way to specify that via `archive.tgz.command`.
> >
> > About the symlinked thing: I do not really want to care to support
> > such hacks.
>
> It's the standard way by which a lot of systems do this, e.g. on my
> Debian box:
>
>     $ find /{,s}bin /usr/{,s}bin -type l -exec file {} \;|grep /etc/alternatives|wc -l
>     108
>
> To write this E-Mail I'm invoking one such symlink :)

I am well aware of the way Debian-based systems handle alternatives, and I
myself also use something similar to write this E-Mail (but it is not a
symlink, it is a Git alias).

But that's not the hack that I was talking about.

The hack I meant was: if you symlink `gzip` to `pigz` in your `PATH` *and
then expect `git archive --format=tgz` to pick that up*.

As far as I am concerned, the fact that `git archive --format=tgz` spawns
`gzip` to perform the compression is an implementation detail, and not
something that users should feel they can rely on.

> > If you want a different compressor than the default (which can
> > change), you should specify it specifically.
>
> You might want to do so system-wide, or for each program at a time.
>
> I don't care about this for gzip myself, just pointing out it *is* a
> thing people use.

Sure.

> >> I'm fine overall with the idea of bringing the compression into the
> >> binary using zlib, provided that we preserve the "-n" behavior
> >> (producing reproducible archives).
> >
> > Thanks for voicing this concern. I had a look at zlib's source code,
> > and it looks like it requires an extra function call (that we don't
> > call) to make the resulting file non-reproducible. In other words, it
> > has the opposite default behavior from `gzip`.
>
> Just commenting on the overall thread: I like René's "new built-in"
> patch best.

I guess we now have to diverging votes: yours for the `git archive --gzip`
"built-in" and Peff's for the async code ;-)

> You mentioned "new command that we have to support for eternity". I
> think calling it "git gzip" is a bad idea. We'd make it "git
> archive--gzip" or "git archive--helper", and we could hide building it
> behind some compat flag.
>
> Then we'd carry no if/else internal/external code, and the portability
> issue that started this would be addressed, no?

Sure.

The async version would leave the door wide open for implementing pigz'
trick to multi-thread the compression, though.

> As a bonus we could also drop the "GZIP" prereq from the test suite
> entirely and just put that "gzip" in $PATH for the purposes of the
> tests.
>
> I spied on your yet-to-be-submitted patches and you could drop GZIP from
> the "git archive" tests, but we'd still need it in
> t/t5562-http-backend-content-length.sh, but not if we had a "gzip"
> compat helper.

We need it at least once for *decompressing* the `--format=tgz` output in
order to compare it to the `--format=tar` output. Besides, I think it is
really important to keep the test that verifies that the output is correct
(i.e. that gzip can decompress it).

> There's also a long-standing bug/misfeature in git-archive that I wonder
> about: When you combine --format with --remote you can only generate
> e.g. tar.gz if the remote is OK with it, if it says no you can't even if
> it supports "tar" and you could do the "gz" part locally. Would such a
> patch be harder with :zlib than if we always just spewed out to external
> "gzip" after satisfying some criteria?

I think it would be precisely the same: you'd still use the same "filter"
code path.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] archive: avoid spawning `gzip`
  2019-05-03 20:49           ` Johannes Schindelin
@ 2019-05-03 20:52             ` Jeff King
  0 siblings, 0 replies; 43+ messages in thread
From: Jeff King @ 2019-05-03 20:52 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Ævar Arnfjörð Bjarmason, brian m. carlson,
	Rohit Ashiwal via GitGitGadget, git, Junio C Hamano,
	Rohit Ashiwal

On Fri, May 03, 2019 at 10:49:17PM +0200, Johannes Schindelin wrote:

> I am well aware of the way Debian-based systems handle alternatives, and I
> myself also use something similar to write this E-Mail (but it is not a
> symlink, it is a Git alias).
> 
> But that's not the hack that I was talking about.
> 
> The hack I meant was: if you symlink `gzip` to `pigz` in your `PATH` *and
> then expect `git archive --format=tgz` to pick that up*.
> 
> As far as I am concerned, the fact that `git archive --format=tgz` spawns
> `gzip` to perform the compression is an implementation detail, and not
> something that users should feel they can rely on.

I'd agree with you more if we didn't document a user-facing config
variable that claims to run "gzip" from the system.

> > Just commenting on the overall thread: I like René's "new built-in"
> > patch best.
> 
> I guess we now have to diverging votes: yours for the `git archive --gzip`
> "built-in" and Peff's for the async code ;-)

For the record, I am fine with any of the solutions (including just
doing the single-thread bit you already have and letting René do what he
likes on top).

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die()
  2019-05-01 18:09             ` Jeff King
  2019-05-02 20:29               ` René Scharfe
@ 2019-05-05  5:25               ` Junio C Hamano
  2019-05-06  5:07                 ` Jeff King
  1 sibling, 1 reply; 43+ messages in thread
From: Junio C Hamano @ 2019-05-05  5:25 UTC (permalink / raw)
  To: Jeff King
  Cc: Johannes Schindelin, Rohit Ashiwal via GitGitGadget, git, Rohit Ashiwal

Jeff King <peff@peff.net> writes:

> FWIW, I agree with you here. These patches are not making anything worse
> (and may even make them better, since we'd probably need to swap out the
> BLOCKSIZE constant for a run-time "blocksize" variable in fewer places).

It's just that leaving the interface uneven is an easy way to
introduce an unnecessary bug, e.g.

	-type function(args) {
	+type function(args, size_t blocksize) {
		decls;
	-	helper_one(BLOCKSIZE, other, args);
	+	helper_one(blocksize, other, args);
		helper_two(its, args);
	-	helper_three(BLOCKSIZE, even, more, args);
	+	helper_three(blocksize, even, more, args);
	 }

when this caller is away from the implementation of helper_two()
that hardcodes the assumption that this callchain only uses
BLOCKSIZE and in an implicit way.

And that can easily be avoided by defensively making helper_two() to
take BLOCKSIZE as an argument as everybody else in the caller does.

I do not actually care too deeply, though.  Hopefully whoever adds
"-b" would be careful enough to follow all callchain, and at least
look at all the callees that are file-scope static, and the one I
have trouble with _is_ a file-scope static.

Or maybe nobody does "-b", in which case this ticking time bomb will
not trigger, so we'd be OK.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die()
  2019-05-05  5:25               ` Junio C Hamano
@ 2019-05-06  5:07                 ` Jeff King
  0 siblings, 0 replies; 43+ messages in thread
From: Jeff King @ 2019-05-06  5:07 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Johannes Schindelin, Rohit Ashiwal via GitGitGadget, git, Rohit Ashiwal

On Sun, May 05, 2019 at 02:25:59PM +0900, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > FWIW, I agree with you here. These patches are not making anything worse
> > (and may even make them better, since we'd probably need to swap out the
> > BLOCKSIZE constant for a run-time "blocksize" variable in fewer places).
> 
> It's just that leaving the interface uneven is an easy way to
> introduce an unnecessary bug, e.g.
> 
> 	-type function(args) {
> 	+type function(args, size_t blocksize) {
> 		decls;
> 	-	helper_one(BLOCKSIZE, other, args);
> 	+	helper_one(blocksize, other, args);
> 		helper_two(its, args);
> 	-	helper_three(BLOCKSIZE, even, more, args);
> 	+	helper_three(blocksize, even, more, args);
> 	 }
> 
> when this caller is away from the implementation of helper_two()
> that hardcodes the assumption that this callchain only uses
> BLOCKSIZE and in an implicit way.
> 
> And that can easily be avoided by defensively making helper_two() to
> take BLOCKSIZE as an argument as everybody else in the caller does.
> 
> I do not actually care too deeply, though.  Hopefully whoever adds
> "-b" would be careful enough to follow all callchain, and at least
> look at all the callees that are file-scope static, and the one I
> have trouble with _is_ a file-scope static.

Right, my assumption was that the first step in the conversion would be
somebody doing s/BLOCKSIZE/global_blocksize_variable/. But that is just
a guess.

> Or maybe nobody does "-b", in which case this ticking time bomb will
> not trigger, so we'd be OK.

Yes. I suspect we're probably going down an unproductive tangent. :)

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 2/4] archive-tar: mark RECORDSIZE/BLOCKSIZE as unsigned
  2019-05-02 20:30     ` [PATCH v2 2/4] archive-tar: mark RECORDSIZE/BLOCKSIZE as unsigned René Scharfe
@ 2019-05-08 11:45       ` Johannes Schindelin
  2019-05-08 23:04         ` Jeff King
  0 siblings, 1 reply; 43+ messages in thread
From: Johannes Schindelin @ 2019-05-08 11:45 UTC (permalink / raw)
  To: René Scharfe
  Cc: Johannes Schindelin via GitGitGadget, git, Jeff King,
	brian m. carlson, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 1462 bytes --]

Hi René,

On Thu, 2 May 2019, René Scharfe wrote:

> Am 27.04.19 um 01:27 schrieb Johannes Schindelin via GitGitGadget:
> > From: Johannes Schindelin <johannes.schindelin@gmx.de>
> >
> > They really are unsigned, and we are using e.g. BLOCKSIZE as `size_t`
> > parameter to pass to `write_or_die()`.
>
> True, but the compiler converts that value correctly to size_t without
> complaint already, doesn't it?  What am I missing?

Are you talking about a specific compiler? It sure sounds as if you did.

I really do not want to fall into the "you can build Git with *any*
compiler, as long as that compiler happens to be GCC, oh, and as long it
is version X" trap.

We *already* rely on GCC's optimization in way too many places for my
liking, e.g. when we adapted the `hasheq()` code *specifically* to make
GCC's particular optimization strategies to kick in.

Or the way we defined the `SWAP()` macro: it depends on GCC's ability to
see through the veil and out-guess the code, deducing its intent rather
than what it *says* ("Do As I Want, Not As I Say", anyone?). We *do* want
to swap registers when possible (instead of forcing register variables to
be written to memory just for the sake of being swapped, as our code says
rather explicitly).

Essentially, we build a cruise ship of a dependency on GCC here. Which
should not make anybody happy (except maybe the GCC folks).

Let's not make things worse.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 2/4] archive-tar: mark RECORDSIZE/BLOCKSIZE as unsigned
  2019-05-08 11:45       ` Johannes Schindelin
@ 2019-05-08 23:04         ` Jeff King
  2019-05-09 14:06           ` Johannes Schindelin
  0 siblings, 1 reply; 43+ messages in thread
From: Jeff King @ 2019-05-08 23:04 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: René Scharfe, Johannes Schindelin via GitGitGadget, git,
	brian m. carlson, Junio C Hamano

On Wed, May 08, 2019 at 01:45:25PM +0200, Johannes Schindelin wrote:

> Hi René,
> 
> On Thu, 2 May 2019, René Scharfe wrote:
> 
> > Am 27.04.19 um 01:27 schrieb Johannes Schindelin via GitGitGadget:
> > > From: Johannes Schindelin <johannes.schindelin@gmx.de>
> > >
> > > They really are unsigned, and we are using e.g. BLOCKSIZE as `size_t`
> > > parameter to pass to `write_or_die()`.
> >
> > True, but the compiler converts that value correctly to size_t without
> > complaint already, doesn't it?  What am I missing?
> 
> Are you talking about a specific compiler? It sure sounds as if you did.
> 
> I really do not want to fall into the "you can build Git with *any*
> compiler, as long as that compiler happens to be GCC, oh, and as long it
> is version X" trap.

I don't this this has anything to do with gcc. The point is that we
already have this line:

  write_or_die(fd, buf, BLOCKSIZE);

which does not cast and nobody has complained, even though the signed
constant is implicitly converted to a size_t. So adding another line
like:

  gzwrite(gzip, block, BLOCKSIZE);

would in theory be treated the same (gzwrite takes an "unsigned").

The conversion from signed to unsigned is well defined in ANSI C, and
I'd expect a compiler to either complain about neither or both (and the
latter probably with warnings like -Wconversion cranked up).

But of course if you have data otherwise, we can revise that. Was the
cast added out of caution, or to squelch a compiler warning?

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 2/4] archive-tar: mark RECORDSIZE/BLOCKSIZE as unsigned
  2019-05-08 23:04         ` Jeff King
@ 2019-05-09 14:06           ` Johannes Schindelin
  2019-05-09 18:38             ` Jeff King
  0 siblings, 1 reply; 43+ messages in thread
From: Johannes Schindelin @ 2019-05-09 14:06 UTC (permalink / raw)
  To: Jeff King
  Cc: René Scharfe, Johannes Schindelin via GitGitGadget, git,
	brian m. carlson, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 1920 bytes --]

Hi Peff,

On Wed, 8 May 2019, Jeff King wrote:

> On Wed, May 08, 2019 at 01:45:25PM +0200, Johannes Schindelin wrote:
>
> > Hi René,
> >
> > On Thu, 2 May 2019, René Scharfe wrote:
> >
> > > Am 27.04.19 um 01:27 schrieb Johannes Schindelin via GitGitGadget:
> > > > From: Johannes Schindelin <johannes.schindelin@gmx.de>
> > > >
> > > > They really are unsigned, and we are using e.g. BLOCKSIZE as `size_t`
> > > > parameter to pass to `write_or_die()`.
> > >
> > > True, but the compiler converts that value correctly to size_t without
> > > complaint already, doesn't it?  What am I missing?
> >
> > Are you talking about a specific compiler? It sure sounds as if you did.
> >
> > I really do not want to fall into the "you can build Git with *any*
> > compiler, as long as that compiler happens to be GCC, oh, and as long it
> > is version X" trap.
>
> I don't this this has anything to do with gcc. The point is that we
> already have this line:
>
>   write_or_die(fd, buf, BLOCKSIZE);
>
> which does not cast and nobody has complained,

I mistook this part of your reply in
https://public-inbox.org/git/20190413013451.GB2040@sigill.intra.peff.net/
as precisely such a complaint:

	BLOCKSIZE is a constant. Should we be defining it with a "U" in
	the first place?

Thanks,
Dscho

> even though the signed
> constant is implicitly converted to a size_t. So adding another line
> like:
>
>   gzwrite(gzip, block, BLOCKSIZE);
>
> would in theory be treated the same (gzwrite takes an "unsigned").
>
> The conversion from signed to unsigned is well defined in ANSI C, and
> I'd expect a compiler to either complain about neither or both (and the
> latter probably with warnings like -Wconversion cranked up).
>
> But of course if you have data otherwise, we can revise that. Was the
> cast added out of caution, or to squelch a compiler warning?
>
> -Peff
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 2/4] archive-tar: mark RECORDSIZE/BLOCKSIZE as unsigned
  2019-05-09 14:06           ` Johannes Schindelin
@ 2019-05-09 18:38             ` Jeff King
  2019-05-10 17:18               ` René Scharfe
  0 siblings, 1 reply; 43+ messages in thread
From: Jeff King @ 2019-05-09 18:38 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: René Scharfe, Johannes Schindelin via GitGitGadget, git,
	brian m. carlson, Junio C Hamano

On Thu, May 09, 2019 at 04:06:22PM +0200, Johannes Schindelin wrote:

> > I don't this this has anything to do with gcc. The point is that we
> > already have this line:
> >
> >   write_or_die(fd, buf, BLOCKSIZE);
> >
> > which does not cast and nobody has complained,
> 
> I mistook this part of your reply in
> https://public-inbox.org/git/20190413013451.GB2040@sigill.intra.peff.net/
> as precisely such a complaint:
> 
> 	BLOCKSIZE is a constant. Should we be defining it with a "U" in
> 	the first place?

Ah, sorry to introduce confusion. I mostly meant "if we need to cast,
why not just define as unsigned in the first place?". But I think René
was pointing out that we do not even need to cast, and I am fine with
that approach.

I do dream of a world where we do not have a bunch of implicit
conversions (both signedness but also truncation) in our code base, and
can compile cleanly with -Wconversion We know that this case is
perfectly fine, but I am sure there are many that are not. However, I'm
not sure if we'll ever get there, and in the meantime I don't think it's
worth worrying too much about individual cases like this.

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 2/4] archive-tar: mark RECORDSIZE/BLOCKSIZE as unsigned
  2019-05-09 18:38             ` Jeff King
@ 2019-05-10 17:18               ` René Scharfe
  2019-05-10 21:20                 ` Jeff King
  0 siblings, 1 reply; 43+ messages in thread
From: René Scharfe @ 2019-05-10 17:18 UTC (permalink / raw)
  To: Jeff King, Johannes Schindelin
  Cc: Johannes Schindelin via GitGitGadget, git, brian m. carlson,
	Junio C Hamano

Am 09.05.19 um 20:38 schrieb Jeff King:
> I do dream of a world where we do not have a bunch of implicit
> conversions (both signedness but also truncation) in our code base, and
> can compile cleanly with -Wconversion We know that this case is
> perfectly fine, but I am sure there are many that are not. However, I'm
> not sure if we'll ever get there, and in the meantime I don't think it's
> worth worrying too much about individual cases like this.

Here's a rough take on how to silence that warning for archive-tar.c using
GCC 8.3.  Some of the changes are worth polishing and submitting.  Some
are silly.  The one for regexec_buf() is scary; I don't see a clean way of
dealing with that size_t to int conversion.

---
 archive-tar.c     | 54 +++++++++++++++++++++++++++++++----------------
 cache.h           | 10 ++++-----
 git-compat-util.h | 21 +++++++++++++++---
 hash.h            |  2 +-
 strbuf.h          |  2 +-
 5 files changed, 61 insertions(+), 28 deletions(-)

diff --git a/archive-tar.c b/archive-tar.c
index 3e53aac1e6..bfd91782ab 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -15,7 +15,7 @@
 static char block[BLOCKSIZE];
 static unsigned long offset;

-static int tar_umask = 002;
+static mode_t tar_umask = 002;

 static int write_tar_filter_archive(const struct archiver *ar,
 				    struct archiver_args *args);
@@ -99,7 +99,7 @@ static void write_blocked(const void *data, unsigned long size)
  */
 static void write_trailer(void)
 {
-	int tail = BLOCKSIZE - offset;
+	size_t tail = BLOCKSIZE - offset;
 	memset(block + offset, 0, tail);
 	write_or_die(1, block, BLOCKSIZE);
 	if (tail < 2 * RECORDSIZE) {
@@ -127,12 +127,13 @@ static int stream_blocked(const struct object_id *oid)
 		readlen = read_istream(st, buf, sizeof(buf));
 		if (readlen <= 0)
 			break;
-		do_write_blocked(buf, readlen);
+		do_write_blocked(buf, (size_t)readlen);
 	}
 	close_istream(st);
-	if (!readlen)
-		finish_record();
-	return readlen;
+	if (readlen < 0)
+		return -1;
+	finish_record();
+	return 0;
 }

 /*
@@ -142,9 +143,9 @@ static int stream_blocked(const struct object_id *oid)
  * string and appends it to a struct strbuf.
  */
 static void strbuf_append_ext_header(struct strbuf *sb, const char *keyword,
-				     const char *value, unsigned int valuelen)
+				     const char *value, size_t valuelen)
 {
-	int len, tmp;
+	size_t len, tmp;

 	/* "%u %s=%s\n" */
 	len = 1 + 1 + strlen(keyword) + 1 + valuelen + 1;
@@ -152,7 +153,7 @@ static void strbuf_append_ext_header(struct strbuf *sb, const char *keyword,
 		len++;

 	strbuf_grow(sb, len);
-	strbuf_addf(sb, "%u %s=", len, keyword);
+	strbuf_addf(sb, "%"PRIuMAX" %s=", (uintmax_t)len, keyword);
 	strbuf_add(sb, value, valuelen);
 	strbuf_addch(sb, '\n');
 }
@@ -168,7 +169,9 @@ static void strbuf_append_ext_header_uint(struct strbuf *sb,
 	int len;

 	len = xsnprintf(buf, sizeof(buf), "%"PRIuMAX, value);
-	strbuf_append_ext_header(sb, keyword, buf, len);
+	if (len < 0)
+		BUG("unable to convert %"PRIuMAX" to decimal", value);
+	strbuf_append_ext_header(sb, keyword, buf, (size_t)len);
 }

 static unsigned int ustar_header_chksum(const struct ustar_header *header)
@@ -177,7 +180,7 @@ static unsigned int ustar_header_chksum(const struct ustar_header *header)
 	unsigned int chksum = 0;
 	while (p < (const unsigned char *)header->chksum)
 		chksum += *p++;
-	chksum += sizeof(header->chksum) * ' ';
+	chksum += (unsigned int)sizeof(header->chksum) * ' ';
 	p += sizeof(header->chksum);
 	while (p < (const unsigned char *)header + sizeof(struct ustar_header))
 		chksum += *p++;
@@ -355,12 +358,14 @@ static void write_global_extended_header(struct archiver_args *args)
 }

 static struct archiver **tar_filters;
-static int nr_tar_filters;
-static int alloc_tar_filters;
+static size_t nr_tar_filters;
+static size_t alloc_tar_filters;

-static struct archiver *find_tar_filter(const char *name, int len)
+static struct archiver *find_tar_filter(const char *name, size_t len)
 {
 	int i;
+	if (len < 1)
+		return NULL;
 	for (i = 0; i < nr_tar_filters; i++) {
 		struct archiver *ar = tar_filters[i];
 		if (!strncmp(ar->name, name, len) && !ar->name[len])
@@ -369,14 +374,27 @@ static struct archiver *find_tar_filter(const char *name, int len)
 	return NULL;
 }

+static int parse_config_key2(const char *var, const char *section,
+			     const char **subsection, size_t *subsection_len,
+			     const char **key)
+{
+	int rc, len;
+
+	rc = parse_config_key(var, section, subsection, &len, key);
+	if (!rc && len < 0)
+		return -1;
+	*subsection_len = (size_t)len;
+	return rc;
+}
+
 static int tar_filter_config(const char *var, const char *value, void *data)
 {
 	struct archiver *ar;
 	const char *name;
 	const char *type;
-	int namelen;
+	size_t namelen;

-	if (parse_config_key(var, "tar", &name, &namelen, &type) < 0 || !name)
+	if (parse_config_key2(var, "tar", &name, &namelen, &type) < 0 || !name)
 		return 0;

 	ar = find_tar_filter(name, namelen);
@@ -400,7 +418,7 @@ static int tar_filter_config(const char *var, const char *value, void *data)
 		if (git_config_bool(var, value))
 			ar->flags |= ARCHIVER_REMOTE;
 		else
-			ar->flags &= ~ARCHIVER_REMOTE;
+			ar->flags &= ~(unsigned int)ARCHIVER_REMOTE;
 		return 0;
 	}

@@ -414,7 +432,7 @@ static int git_tar_config(const char *var, const char *value, void *cb)
 			tar_umask = umask(0);
 			umask(tar_umask);
 		} else {
-			tar_umask = git_config_int(var, value);
+			tar_umask = (mode_t)git_config_ulong(var, value);
 		}
 		return 0;
 	}
diff --git a/cache.h b/cache.h
index 67cc2e1806..a791034260 100644
--- a/cache.h
+++ b/cache.h
@@ -241,7 +241,7 @@ static inline void copy_cache_entry(struct cache_entry *dst,
 				    const struct cache_entry *src)
 {
 	unsigned int state = dst->ce_flags & CE_HASHED;
-	int mem_pool_allocated = dst->mem_pool_allocated;
+	unsigned int mem_pool_allocated = dst->mem_pool_allocated;

 	/* Don't copy hash chain and name */
 	memcpy(&dst->ce_stat_data, &src->ce_stat_data,
@@ -249,7 +249,7 @@ static inline void copy_cache_entry(struct cache_entry *dst,
 			offsetof(struct cache_entry, ce_stat_data));

 	/* Restore the hash state */
-	dst->ce_flags = (dst->ce_flags & ~CE_HASHED) | state;
+	dst->ce_flags = (dst->ce_flags & ~(unsigned int)CE_HASHED) | state;

 	/* Restore the mem_pool_allocated flag */
 	dst->mem_pool_allocated = mem_pool_allocated;
@@ -1314,7 +1314,7 @@ extern int check_and_freshen_file(const char *fn, int freshen);
 extern const signed char hexval_table[256];
 static inline unsigned int hexval(unsigned char c)
 {
-	return hexval_table[c];
+	return (unsigned int)hexval_table[c];
 }

 /*
@@ -1323,8 +1323,8 @@ static inline unsigned int hexval(unsigned char c)
  */
 static inline int hex2chr(const char *s)
 {
-	unsigned int val = hexval(s[0]);
-	return (val & ~0xf) ? val : (val << 4) | hexval(s[1]);
+	int val = hexval_table[(unsigned char)s[0]];
+	return (val < 0) ? val : (val << 4) | hexval_table[(unsigned char)s[1]];
 }

 /* Convert to/from hex/sha1 representation */
diff --git a/git-compat-util.h b/git-compat-util.h
index 4386b3e1c8..cf33e84c96 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -1068,7 +1068,7 @@ static inline int strtoul_ui(char const *s, int base, unsigned int *result)
 	ul = strtoul(s, &p, base);
 	if (errno || *p || p == s || (unsigned int) ul != ul)
 		return -1;
-	*result = ul;
+	*result = (unsigned int)ul;
 	return 0;
 }

@@ -1081,7 +1081,7 @@ static inline int strtol_i(char const *s, int base, int *result)
 	ul = strtol(s, &p, base);
 	if (errno || *p || p == s || (int) ul != ul)
 		return -1;
-	*result = ul;
+	*result = (int)ul;
 	return 0;
 }

@@ -1119,7 +1119,22 @@ static inline int regexec_buf(const regex_t *preg, const char *buf, size_t size,
 {
 	assert(nmatch > 0 && pmatch);
 	pmatch[0].rm_so = 0;
-	pmatch[0].rm_eo = size;
+	pmatch[0].rm_eo = (regoff_t)size;
+	if (pmatch[0].rm_eo != size) {
+		if (((regoff_t)-1) < 0) {
+			if (sizeof(regoff_t) == sizeof(int))
+				pmatch[0].rm_eo = (regoff_t)INT_MAX;
+			else if (sizeof(regoff_t) == sizeof(long))
+				pmatch[0].rm_eo = (regoff_t)LONG_MAX;
+			else
+				die("unable to determine maximum value of regoff_t");
+		} else {
+			pmatch[0].rm_eo = (regoff_t)-1;
+		}
+		warning("buffer too big (%"PRIuMAX"), "
+			"will search only the first %"PRIuMAX" bytes",
+			(uintmax_t)size, (uintmax_t)pmatch[0].rm_eo);
+	}
 	return regexec(preg, buf, nmatch, pmatch, eflags | REG_STARTEND);
 }

diff --git a/hash.h b/hash.h
index 661c9f2281..7056f89eb4 100644
--- a/hash.h
+++ b/hash.h
@@ -134,7 +134,7 @@ int hash_algo_by_id(uint32_t format_id);
 /* Identical, except based on the length. */
 int hash_algo_by_length(int len);
 /* Identical, except for a pointer to struct git_hash_algo. */
-static inline int hash_algo_by_ptr(const struct git_hash_algo *p)
+static inline ptrdiff_t hash_algo_by_ptr(const struct git_hash_algo *p)
 {
 	return p - hash_algos;
 }
diff --git a/strbuf.h b/strbuf.h
index c8d98dfb95..30659f2d5d 100644
--- a/strbuf.h
+++ b/strbuf.h
@@ -225,7 +225,7 @@ int strbuf_cmp(const struct strbuf *first, const struct strbuf *second);
 /**
  * Add a single character to the buffer.
  */
-static inline void strbuf_addch(struct strbuf *sb, int c)
+static inline void strbuf_addch(struct strbuf *sb, char c)
 {
 	if (!strbuf_avail(sb))
 		strbuf_grow(sb, 1);
--
2.21.0

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 2/4] archive-tar: mark RECORDSIZE/BLOCKSIZE as unsigned
  2019-05-10 17:18               ` René Scharfe
@ 2019-05-10 21:20                 ` Jeff King
  0 siblings, 0 replies; 43+ messages in thread
From: Jeff King @ 2019-05-10 21:20 UTC (permalink / raw)
  To: René Scharfe
  Cc: Johannes Schindelin, Johannes Schindelin via GitGitGadget, git,
	brian m. carlson, Junio C Hamano

On Fri, May 10, 2019 at 07:18:44PM +0200, René Scharfe wrote:

> Am 09.05.19 um 20:38 schrieb Jeff King:
> > I do dream of a world where we do not have a bunch of implicit
> > conversions (both signedness but also truncation) in our code base, and
> > can compile cleanly with -Wconversion We know that this case is
> > perfectly fine, but I am sure there are many that are not. However, I'm
> > not sure if we'll ever get there, and in the meantime I don't think it's
> > worth worrying too much about individual cases like this.
> 
> Here's a rough take on how to silence that warning for archive-tar.c using
> GCC 8.3.  Some of the changes are worth polishing and submitting.  Some
> are silly.  The one for regexec_buf() is scary; I don't see a clean way of
> dealing with that size_t to int conversion.

This is actually slightly less tedious than I had imagined it to be, but
still pretty bad. I dunno. If somebody wants to tackle it, I do think it
would make the world a better place. But I'm not sure if it is worth the
effort involved.

>  static void write_trailer(void)
>  {
> -	int tail = BLOCKSIZE - offset;
> +	size_t tail = BLOCKSIZE - offset;

These kinds of int/size_t conversions are the ones I think are the most
valuable (because the size_t's are often used to allocate or access
arrays, and truncated or negative values there can cause other security
problems). _Most_ of them are harmless, of course, but it's hard to
separate the important ones from the mundane.

> @@ -414,7 +432,7 @@ static int git_tar_config(const char *var, const char *value, void *cb)
>  			tar_umask = umask(0);
>  			umask(tar_umask);
>  		} else {
> -			tar_umask = git_config_int(var, value);
> +			tar_umask = (mode_t)git_config_ulong(var, value);
>  		}

It's nice that the cast here shuts up the compiler, and I agree it is
not likely to be a problem in this instance. But we'd probably want some
kind of "safe cast" helper. To some degree, if you put 2^64-1 in your
"umask" value you get what you deserve, but it would be nice if we could
detect such nonsense (less for this case, but more for others where we
do cast).

> @@ -1119,7 +1119,22 @@ static inline int regexec_buf(const regex_t *preg, const char *buf, size_t size,
>  {
>  	assert(nmatch > 0 && pmatch);
>  	pmatch[0].rm_so = 0;
> -	pmatch[0].rm_eo = size;
> +	pmatch[0].rm_eo = (regoff_t)size;
> +	if (pmatch[0].rm_eo != size) {
> +		if (((regoff_t)-1) < 0) {
> +			if (sizeof(regoff_t) == sizeof(int))
> +				pmatch[0].rm_eo = (regoff_t)INT_MAX;
> +			else if (sizeof(regoff_t) == sizeof(long))
> +				pmatch[0].rm_eo = (regoff_t)LONG_MAX;
> +			else
> +				die("unable to determine maximum value of regoff_t");
> +		} else {
> +			pmatch[0].rm_eo = (regoff_t)-1;
> +		}
> +		warning("buffer too big (%"PRIuMAX"), "
> +			"will search only the first %"PRIuMAX" bytes",
> +			(uintmax_t)size, (uintmax_t)pmatch[0].rm_eo);
> +	}
>  	return regexec(preg, buf, nmatch, pmatch, eflags | REG_STARTEND);
>  }

I think a helper could make things less awful here, too. Our xsize_t()
is sort of like this, but of course it dies. But I think it would be
possible to write a macro to let you do:

  if (ASSIGN_CAST(pmatch[0].rm_eo, size))
	warning(...);

This is definitely a rabbit-hole that I've been afraid to go down. :)

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] archive: avoid spawning `gzip`
  2019-05-01 18:18                   ` Jeff King
@ 2019-06-10 10:44                     ` René Scharfe
  2019-06-13 19:16                       ` Jeff King
  0 siblings, 1 reply; 43+ messages in thread
From: René Scharfe @ 2019-06-10 10:44 UTC (permalink / raw)
  To: Jeff King
  Cc: Johannes Schindelin, Rohit Ashiwal via GitGitGadget, git,
	Junio C Hamano, Rohit Ashiwal

Am 01.05.19 um 20:18 schrieb Jeff King:
> On Wed, May 01, 2019 at 07:45:05PM +0200, René Scharfe wrote:
>
>>> But since the performance is still not quite on par with `gzip`, I would
>>> actually rather not, and really, just punt on that one, stating that
>>> people interested in higher performance should use `pigz`.
>>
>> Here are my performance numbers for generating .tar.gz files again:

OK, tried one more version, with pthreads (patch at the end).  Also
redid all measurements for better comparability; everything is faster
now for some reason (perhaps due to a compiler update? clang version
7.0.1-8 now):

master, using gzip(1):
Benchmark #1: git archive --format=tgz HEAD
  Time (mean ± σ):     15.697 s ±  0.246 s    [User: 19.213 s, System: 0.386 s]
  Range (min … max):   15.405 s … 16.103 s    10 runs

using zlib sequentially:
Benchmark #1: git archive --format=tgz HEAD
  Time (mean ± σ):     19.191 s ±  0.408 s    [User: 19.091 s, System: 0.100 s]
  Range (min … max):   18.802 s … 19.877 s    10 runs

using a gzip-lookalike:
Benchmark #1: git archive --format=tgz HEAD
  Time (mean ± σ):     16.289 s ±  0.218 s    [User: 19.485 s, System: 0.337 s]
  Range (min … max):   16.020 s … 16.555 s    10 runs

using zlib with start_async:
Benchmark #1: git archive --format=tgz HEAD
  Time (mean ± σ):     16.516 s ±  0.334 s    [User: 20.282 s, System: 0.383 s]
  Range (min … max):   16.166 s … 17.283 s    10 runs

using zlib in a separate thread (that's the new one):
Benchmark #1: git archive --format=tgz HEAD
  Time (mean ± σ):     16.310 s ±  0.237 s    [User: 20.075 s, System: 0.173 s]
  Range (min … max):   15.983 s … 16.790 s    10 runs

> I think the start_async() one seems like a good option. It reclaims most
> of the (wall-clock) performance, isn't very much code, and doesn't leave
> any ugly user-visible traces.

The pthreads numbers look a bit better still.  The patch is huge though,
because it duplicates almost everything.  It was easier that way; a real
patch series would extract functions that can be used both with static
and allocated headers first, and keep everything in archive-tar.c.

> I'd be fine to see it come later, though, on top of the patches Dscho is
> sending. Even though changing to sequential zlib is technically a change
> in behavior, the existing behavior wasn't really planned. And given the
> wall-clock versus CPU time tradeoff, it's not entirely clear that one
> solution is better than the other.

The current behavior is not an accident; the synchronous method was
rejected in 2009 because it was slower [1].  Redid the measurements
with v1.6.5-rc0 and the old patch [2], but they would only compile with
gcc (Debian 8.3.0-6) for me, so it's not directly comparable to the
numbers above:

v1.6.5-rc0:
Benchmark #1: ../git/git-archive HEAD | gzip
  Time (mean ± σ):     16.051 s ±  0.486 s    [User: 19.514 s, System: 0.341 s]
  Range (min … max):   15.416 s … 17.001 s    10 runs

v1.6.5-rc0 + [2]:
Benchmark #1: ../git/git-archive --format=tar.gz HEAD
  Time (mean ± σ):     19.684 s ±  0.374 s    [User: 19.601 s, System: 0.060 s]
  Range (min … max):   19.082 s … 20.177 s    10 runs

User time is still slightly higher, but the difference is in the noise.

[1] http://public-inbox.org/git/4AAAC8CE.8020302@lsrfire.ath.cx/
[2] http://public-inbox.org/git/4AA97B61.6030301@lsrfire.ath.cx/

>>> And who knows, maybe nobody will complain at all about the performance?
>>
>> Probably.  And popular tarballs would be cached anyway, I guess.
>
> At GitHub we certainly do cache the git-archive output. We'd also be
> just fine with the sequential solution. We generally turn down
> pack.threads to 1, and keep our CPUs busy by serving multiple users
> anyway.
>
> So whatever has the lowest overall CPU time is generally preferable, but
> the times are close enough that I don't think we'd care much either way
> (and it's probably not worth having a config option or similar).

Moving back to 2009 and reducing the number of utilized cores both feels
weird, but the sequential solution *is* the most obvious, easiest and
(by a narrow margin) lightest one if gzip(1) is not an option anymore.

Anyway, the threading patch:

---
 Makefile      |   1 +
 archive-tar.c |  11 +-
 archive-tgz.c | 452 ++++++++++++++++++++++++++++++++++++++++++++++++++
 archive.h     |   4 +
 4 files changed, 465 insertions(+), 3 deletions(-)
 create mode 100644 archive-tgz.c

diff --git a/Makefile b/Makefile
index 8a7e235352..ed649ac18d 100644
--- a/Makefile
+++ b/Makefile
@@ -834,6 +834,7 @@ LIB_OBJS += alloc.o
 LIB_OBJS += apply.o
 LIB_OBJS += archive.o
 LIB_OBJS += archive-tar.o
+LIB_OBJS += archive-tgz.o
 LIB_OBJS += archive-zip.o
 LIB_OBJS += argv-array.o
 LIB_OBJS += attr.o
diff --git a/archive-tar.c b/archive-tar.c
index 3e53aac1e6..929eb58235 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -15,7 +15,9 @@
 static char block[BLOCKSIZE];
 static unsigned long offset;

-static int tar_umask = 002;
+int tar_umask = 002;
+
+static const char internal_gzip[] = "git archive gzip";

 static int write_tar_filter_archive(const struct archiver *ar,
 				    struct archiver_args *args);
@@ -445,6 +447,9 @@ static int write_tar_filter_archive(const struct archiver *ar,
 	if (!ar->data)
 		BUG("tar-filter archiver called with no filter defined");

+	if (!strcmp(ar->data, internal_gzip))
+		return write_tgz_archive(ar, args);
+
 	strbuf_addstr(&cmd, ar->data);
 	if (args->compression_level >= 0)
 		strbuf_addf(&cmd, " -%d", args->compression_level);
@@ -483,9 +488,9 @@ void init_tar_archiver(void)
 	int i;
 	register_archiver(&tar_archiver);

-	tar_filter_config("tar.tgz.command", "gzip -cn", NULL);
+	tar_filter_config("tar.tgz.command", internal_gzip, NULL);
 	tar_filter_config("tar.tgz.remote", "true", NULL);
-	tar_filter_config("tar.tar.gz.command", "gzip -cn", NULL);
+	tar_filter_config("tar.tar.gz.command", internal_gzip, NULL);
 	tar_filter_config("tar.tar.gz.remote", "true", NULL);
 	git_config(git_tar_config, NULL);
 	for (i = 0; i < nr_tar_filters; i++) {
diff --git a/archive-tgz.c b/archive-tgz.c
new file mode 100644
index 0000000000..ae219e1cc0
--- /dev/null
+++ b/archive-tgz.c
@@ -0,0 +1,452 @@
+#include "cache.h"
+#include "config.h"
+#include "tar.h"
+#include "archive.h"
+#include "object-store.h"
+#include "streaming.h"
+
+#define RECORDSIZE	(512)
+#define BLOCKSIZE	(RECORDSIZE * 20)
+
+static gzFile gzip;
+static size_t offset;
+
+/*
+ * This is the max value that a ustar size header can specify, as it is fixed
+ * at 11 octal digits. POSIX specifies that we switch to extended headers at
+ * this size.
+ *
+ * Likewise for the mtime (which happens to use a buffer of the same size).
+ */
+#if ULONG_MAX == 0xFFFFFFFF
+#define USTAR_MAX_SIZE ULONG_MAX
+#else
+#define USTAR_MAX_SIZE 077777777777UL
+#endif
+#if TIME_MAX == 0xFFFFFFFF
+#define USTAR_MAX_MTIME TIME_MAX
+#else
+#define USTAR_MAX_MTIME 077777777777ULL
+#endif
+
+static void tgz_write(const void *data, size_t size)
+{
+	const char *p = data;
+	while (size) {
+		size_t to_write = size;
+		if (to_write > UINT_MAX)
+			to_write = UINT_MAX;
+		if (gzwrite(gzip, p, to_write) != to_write)
+			die(_("gzwrite failed"));
+		p += to_write;
+		size -= to_write;
+		offset = (offset + to_write) % BLOCKSIZE;
+	}
+}
+
+static void tgz_finish_record(void)
+{
+	size_t tail = offset % RECORDSIZE;
+	if (tail) {
+		size_t to_seek = RECORDSIZE - tail;
+		if (gzseek(gzip, to_seek, SEEK_CUR) < 0)
+			die(_("gzseek failed"));
+		offset = (offset + to_seek) % BLOCKSIZE;
+	}
+}
+
+static void tgz_write_trailer(void)
+{
+	size_t to_seek = BLOCKSIZE - offset;
+	if (to_seek < 2 * RECORDSIZE)
+		to_seek += BLOCKSIZE;
+	if (gzseek(gzip, to_seek, SEEK_CUR) < 0)
+		 die(_("gzseek failed"));
+	if (gzflush(gzip, Z_FINISH) != Z_OK)
+		die(_("gzflush failed"));
+}
+
+struct work_item {
+	void *buffer;
+	size_t size;
+	int finish_record;
+};
+
+#define TODO_SIZE 64
+struct work_item todo[TODO_SIZE];
+static int todo_start;
+static int todo_end;
+static int todo_done;
+static int all_work_added;
+static pthread_mutex_t tar_mutex;
+static pthread_t thread;
+
+static void tar_lock(void)
+{
+	pthread_mutex_lock(&tar_mutex);
+}
+
+static void tar_unlock(void)
+{
+	pthread_mutex_unlock(&tar_mutex);
+}
+
+static pthread_cond_t cond_add;
+static pthread_cond_t cond_write;
+static pthread_cond_t cond_result;
+
+static void add_work(void *buffer, size_t size, int finish_record)
+{
+	tar_lock();
+
+	while ((todo_end + 1) % ARRAY_SIZE(todo) == todo_done)
+		pthread_cond_wait(&cond_write, &tar_mutex);
+
+	todo[todo_end].buffer = buffer;
+	todo[todo_end].size = size;
+	todo[todo_end].finish_record = finish_record;
+
+	todo_end = (todo_end + 1) % ARRAY_SIZE(todo);
+
+	pthread_cond_signal(&cond_add);
+	tar_unlock();
+}
+
+static struct work_item *get_work(void)
+{
+	struct work_item *ret = NULL;
+
+	tar_lock();
+	while (todo_start == todo_end && !all_work_added)
+		pthread_cond_wait(&cond_add, &tar_mutex);
+
+	if (todo_start != todo_end || !all_work_added) {
+		ret = &todo[todo_start];
+		todo_start = (todo_start + 1) % ARRAY_SIZE(todo);
+	}
+	tar_unlock();
+	return ret;
+}
+
+static void work_done(void)
+{
+	tar_lock();
+	todo_done = (todo_done + 1) % ARRAY_SIZE(todo);
+	pthread_cond_signal(&cond_write);
+
+	if (all_work_added && todo_done == todo_end)
+		pthread_cond_signal(&cond_result);
+	tar_unlock();
+}
+
+static void *run(void *arg)
+{
+	for (;;) {
+		struct work_item *w = get_work();
+		if (!w)
+			break;
+		tgz_write(w->buffer, w->size);
+		free(w->buffer);
+		if (w->finish_record)
+			tgz_finish_record();
+		work_done();
+	}
+	return NULL;
+}
+
+static void start_output_thread(void)
+{
+	int err;
+
+	pthread_mutex_init(&tar_mutex, NULL);
+	pthread_cond_init(&cond_add, NULL);
+	pthread_cond_init(&cond_write, NULL);
+	pthread_cond_init(&cond_result, NULL);
+
+	memset(todo, 0, sizeof(todo));
+
+	err = pthread_create(&thread, NULL, run, NULL);
+	if (err)
+		die(_("failed to create thread: %s"), strerror(err));
+}
+
+static void wait_for_output_thread(void)
+{
+	tar_lock();
+	all_work_added = 1;
+
+	while (todo_done != todo_end)
+		pthread_cond_wait(&cond_result, &tar_mutex);
+
+	pthread_cond_broadcast(&cond_add);
+	tar_unlock();
+
+	pthread_join(thread, NULL);
+
+	pthread_mutex_destroy(&tar_mutex);
+	pthread_cond_destroy(&cond_add);
+	pthread_cond_destroy(&cond_write);
+	pthread_cond_destroy(&cond_result);
+}
+
+static int stream_blob(const struct object_id *oid)
+{
+	struct git_istream *st;
+	enum object_type type;
+	unsigned long sz;
+	ssize_t readlen;
+	size_t chunk_size = BLOCKSIZE * 10;
+
+	st = open_istream(oid, &type, &sz, NULL);
+	if (!st)
+		return error(_("cannot stream blob %s"), oid_to_hex(oid));
+	for (;;) {
+		char *buf = xmalloc(chunk_size);
+		readlen = read_istream(st, buf, chunk_size);
+		if (readlen <= 0)
+			break;
+		sz -= readlen;
+		add_work(buf, readlen, !sz);
+	}
+	close_istream(st);
+	return readlen;
+}
+
+/*
+ * pax extended header records have the format "%u %s=%s\n".  %u contains
+ * the size of the whole string (including the %u), the first %s is the
+ * keyword, the second one is the value.  This function constructs such a
+ * string and appends it to a struct strbuf.
+ */
+static void strbuf_append_ext_header(struct strbuf *sb, const char *keyword,
+				     const char *value, unsigned int valuelen)
+{
+	int len, tmp;
+
+	/* "%u %s=%s\n" */
+	len = 1 + 1 + strlen(keyword) + 1 + valuelen + 1;
+	for (tmp = len; tmp > 9; tmp /= 10)
+		len++;
+
+	strbuf_grow(sb, len);
+	strbuf_addf(sb, "%u %s=", len, keyword);
+	strbuf_add(sb, value, valuelen);
+	strbuf_addch(sb, '\n');
+}
+
+/*
+ * Like strbuf_append_ext_header, but for numeric values.
+ */
+static void strbuf_append_ext_header_uint(struct strbuf *sb,
+					  const char *keyword,
+					  uintmax_t value)
+{
+	char buf[40]; /* big enough for 2^128 in decimal, plus NUL */
+	int len;
+
+	len = xsnprintf(buf, sizeof(buf), "%"PRIuMAX, value);
+	strbuf_append_ext_header(sb, keyword, buf, len);
+}
+
+static unsigned int ustar_header_chksum(const struct ustar_header *header)
+{
+	const unsigned char *p = (const unsigned char *)header;
+	unsigned int chksum = 0;
+	while (p < (const unsigned char *)header->chksum)
+		chksum += *p++;
+	chksum += sizeof(header->chksum) * ' ';
+	p += sizeof(header->chksum);
+	while (p < (const unsigned char *)header + sizeof(struct ustar_header))
+		chksum += *p++;
+	return chksum;
+}
+
+static size_t get_path_prefix(const char *path, size_t pathlen, size_t maxlen)
+{
+	size_t i = pathlen;
+	if (i > 1 && path[i - 1] == '/')
+		i--;
+	if (i > maxlen)
+		i = maxlen;
+	do {
+		i--;
+	} while (i > 0 && path[i] != '/');
+	return i;
+}
+
+static void prepare_header(struct archiver_args *args,
+			   struct ustar_header *header,
+			   unsigned int mode, unsigned long size)
+{
+	xsnprintf(header->mode, sizeof(header->mode), "%07o", mode & 07777);
+	xsnprintf(header->size, sizeof(header->size), "%011"PRIoMAX , S_ISREG(mode) ? (uintmax_t)size : (uintmax_t)0);
+	xsnprintf(header->mtime, sizeof(header->mtime), "%011lo", (unsigned long) args->time);
+
+	xsnprintf(header->uid, sizeof(header->uid), "%07o", 0);
+	xsnprintf(header->gid, sizeof(header->gid), "%07o", 0);
+	strlcpy(header->uname, "root", sizeof(header->uname));
+	strlcpy(header->gname, "root", sizeof(header->gname));
+	xsnprintf(header->devmajor, sizeof(header->devmajor), "%07o", 0);
+	xsnprintf(header->devminor, sizeof(header->devminor), "%07o", 0);
+
+	memcpy(header->magic, "ustar", 6);
+	memcpy(header->version, "00", 2);
+
+	xsnprintf(header->chksum, sizeof(header->chksum), "%07o", ustar_header_chksum(header));
+}
+
+static void write_extended_header(struct archiver_args *args,
+				  const struct object_id *oid,
+				  struct strbuf *extended_header)
+{
+	size_t size;
+	char *buffer = strbuf_detach(extended_header, &size);
+	struct ustar_header *header = xcalloc(1, sizeof(*header));
+	unsigned int mode;
+	*header->typeflag = TYPEFLAG_EXT_HEADER;
+	mode = 0100666;
+	xsnprintf(header->name, sizeof(header->name), "%s.paxheader",
+		  oid_to_hex(oid));
+	prepare_header(args, header, mode, size);
+	add_work(header, sizeof(*header), 1);
+	add_work(buffer, size, 1);
+}
+
+static int write_tar_entry(struct archiver_args *args,
+			   const struct object_id *oid,
+			   const char *path, size_t pathlen,
+			   unsigned int mode)
+{
+	struct ustar_header *header = xcalloc(1, sizeof(*header));
+	struct strbuf ext_header = STRBUF_INIT;
+	unsigned int old_mode = mode;
+	unsigned long size, size_in_header;
+	void *buffer;
+	int err = 0;
+
+	if (S_ISDIR(mode) || S_ISGITLINK(mode)) {
+		*header->typeflag = TYPEFLAG_DIR;
+		mode = (mode | 0777) & ~tar_umask;
+	} else if (S_ISLNK(mode)) {
+		*header->typeflag = TYPEFLAG_LNK;
+		mode |= 0777;
+	} else if (S_ISREG(mode)) {
+		*header->typeflag = TYPEFLAG_REG;
+		mode = (mode | ((mode & 0100) ? 0777 : 0666)) & ~tar_umask;
+	} else {
+		return error(_("unsupported file mode: 0%o (SHA1: %s)"),
+			     mode, oid_to_hex(oid));
+	}
+	if (pathlen > sizeof(header->name)) {
+		size_t plen = get_path_prefix(path, pathlen,
+					      sizeof(header->prefix));
+		size_t rest = pathlen - plen - 1;
+		if (plen > 0 && rest <= sizeof(header->name)) {
+			memcpy(header->prefix, path, plen);
+			memcpy(header->name, path + plen + 1, rest);
+		} else {
+			xsnprintf(header->name, sizeof(header->name), "%s.data",
+				  oid_to_hex(oid));
+			strbuf_append_ext_header(&ext_header, "path",
+						 path, pathlen);
+		}
+	} else
+		memcpy(header->name, path, pathlen);
+
+	if (S_ISREG(mode) && !args->convert &&
+	    oid_object_info(args->repo, oid, &size) == OBJ_BLOB &&
+	    size > big_file_threshold)
+		buffer = NULL;
+	else if (S_ISLNK(mode) || S_ISREG(mode)) {
+		enum object_type type;
+		buffer = object_file_to_archive(args, path, oid, old_mode, &type, &size);
+		if (!buffer)
+			return error(_("cannot read %s"), oid_to_hex(oid));
+	} else {
+		buffer = NULL;
+		size = 0;
+	}
+
+	if (S_ISLNK(mode)) {
+		if (size > sizeof(header->linkname)) {
+			xsnprintf(header->linkname, sizeof(header->linkname),
+				  "see %s.paxheader", oid_to_hex(oid));
+			strbuf_append_ext_header(&ext_header, "linkpath",
+						 buffer, size);
+		} else
+			memcpy(header->linkname, buffer, size);
+	}
+
+	size_in_header = size;
+	if (S_ISREG(mode) && size > USTAR_MAX_SIZE) {
+		size_in_header = 0;
+		strbuf_append_ext_header_uint(&ext_header, "size", size);
+	}
+
+	prepare_header(args, header, mode, size_in_header);
+
+	if (ext_header.len > 0) {
+		write_extended_header(args, oid, &ext_header);
+	}
+	add_work(header, sizeof(*header), 1);
+	if (S_ISREG(mode) && size > 0) {
+		if (buffer)
+			add_work(buffer, size, 1);
+		else
+			err = stream_blob(oid);
+	} else
+		free(buffer);
+	return err;
+}
+
+static void write_global_extended_header(struct archiver_args *args)
+{
+	const struct object_id *oid = args->commit_oid;
+	struct strbuf ext_header = STRBUF_INIT;
+	struct ustar_header *header = xcalloc(1, sizeof(*header));
+	unsigned int mode;
+	size_t size;
+	char *buffer;
+
+	if (oid)
+		strbuf_append_ext_header(&ext_header, "comment",
+					 oid_to_hex(oid),
+					 the_hash_algo->hexsz);
+	if (args->time > USTAR_MAX_MTIME) {
+		strbuf_append_ext_header_uint(&ext_header, "mtime",
+					      args->time);
+		args->time = USTAR_MAX_MTIME;
+	}
+
+	if (!ext_header.len)
+		return;
+
+	buffer = strbuf_detach(&ext_header, &size);
+	*header->typeflag = TYPEFLAG_GLOBAL_HEADER;
+	mode = 0100666;
+	xsnprintf(header->name, sizeof(header->name), "pax_global_header");
+	prepare_header(args, header, mode, size);
+	add_work(header, sizeof(*header), 1);
+	add_work(buffer, size, 1);
+}
+
+int write_tgz_archive(const struct archiver *ar, struct archiver_args *args)
+{
+	int level = args->compression_level;
+	int err = 0;
+
+	gzip = gzdopen(1, "wb");
+	if (!gzip)
+		return error(_("gzdopen failed"));
+	if (gzsetparams(gzip, level, Z_DEFAULT_STRATEGY) != Z_OK)
+		return error(_("unable to set compression level %d"), level);
+
+	start_output_thread();
+	write_global_extended_header(args);
+	err = write_archive_entries(args, write_tar_entry);
+	if (err)
+		return err;
+	wait_for_output_thread();
+	tgz_write_trailer();
+	return err;
+}
diff --git a/archive.h b/archive.h
index e60e3dd31c..b00afa1a9f 100644
--- a/archive.h
+++ b/archive.h
@@ -45,6 +45,10 @@ void init_tar_archiver(void);
 void init_zip_archiver(void);
 void init_archivers(void);

+int tar_umask;
+
+int write_tgz_archive(const struct archiver *ar, struct archiver_args *args);
+
 typedef int (*write_archive_entry_fn_t)(struct archiver_args *args,
 					const struct object_id *oid,
 					const char *path, size_t pathlen,
--
2.22.0

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] archive: avoid spawning `gzip`
  2019-06-10 10:44                     ` René Scharfe
@ 2019-06-13 19:16                       ` Jeff King
  0 siblings, 0 replies; 43+ messages in thread
From: Jeff King @ 2019-06-13 19:16 UTC (permalink / raw)
  To: René Scharfe
  Cc: Johannes Schindelin, Rohit Ashiwal via GitGitGadget, git,
	Junio C Hamano, Rohit Ashiwal

On Mon, Jun 10, 2019 at 12:44:54PM +0200, René Scharfe wrote:

> Am 01.05.19 um 20:18 schrieb Jeff King:
> > On Wed, May 01, 2019 at 07:45:05PM +0200, René Scharfe wrote:
> >
> >>> But since the performance is still not quite on par with `gzip`, I would
> >>> actually rather not, and really, just punt on that one, stating that
> >>> people interested in higher performance should use `pigz`.
> >>
> >> Here are my performance numbers for generating .tar.gz files again:
> 
> OK, tried one more version, with pthreads (patch at the end).  Also
> redid all measurements for better comparability; everything is faster
> now for some reason (perhaps due to a compiler update? clang version
> 7.0.1-8 now):

Hmm. Interesting that using pthreads is still slower than just shelling
out to gzip:

> master, using gzip(1):
> Benchmark #1: git archive --format=tgz HEAD
>   Time (mean ± σ):     15.697 s ±  0.246 s    [User: 19.213 s, System: 0.386 s]
>   Range (min … max):   15.405 s … 16.103 s    10 runs
> [...]
> using zlib in a separate thread (that's the new one):
> Benchmark #1: git archive --format=tgz HEAD
>   Time (mean ± σ):     16.310 s ±  0.237 s    [User: 20.075 s, System: 0.173 s]
>   Range (min … max):   15.983 s … 16.790 s    10 runs

I wonder if zlib is just slower. Or if the cost of context switching
is somehow higher than just dumping big chunks over a pipe. In
particular, our gzip-alike is still faster than pthreads:

> using a gzip-lookalike:
> Benchmark #1: git archive --format=tgz HEAD
>   Time (mean ± σ):     16.289 s ±  0.218 s    [User: 19.485 s, System: 0.337 s]
>   Range (min … max):   16.020 s … 16.555 s    10 runs

though it looks like the timings do overlap.

> > At GitHub we certainly do cache the git-archive output. We'd also be
> > just fine with the sequential solution. We generally turn down
> > pack.threads to 1, and keep our CPUs busy by serving multiple users
> > anyway.
> >
> > So whatever has the lowest overall CPU time is generally preferable, but
> > the times are close enough that I don't think we'd care much either way
> > (and it's probably not worth having a config option or similar).
> 
> Moving back to 2009 and reducing the number of utilized cores both feels
> weird, but the sequential solution *is* the most obvious, easiest and
> (by a narrow margin) lightest one if gzip(1) is not an option anymore.

It sounds like we resolved to give the "internal gzip" its own name
(whether it's a gzip-alike command, or a special name we recognize to
trigger the internal code). So maybe we could continue to default to
"gzip -cn", but platforms could do otherwise when shipping gzip there is
a pain (i.e. Windows, but maybe also anybody else who wants to set
NO_EXTERNAL_GZIP or detect it from autoconf).

-Peff

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, back to index

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-12 23:04 [PATCH 0/2] Avoid spawning gzip in git archive Johannes Schindelin via GitGitGadget
2019-04-12 23:04 ` [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die() Rohit Ashiwal via GitGitGadget
2019-04-13  1:34   ` Jeff King
2019-04-13  5:51     ` Junio C Hamano
2019-04-14  4:36       ` Rohit Ashiwal
2019-04-26 14:29       ` Johannes Schindelin
2019-04-26 23:44         ` Junio C Hamano
2019-04-29 21:32           ` Johannes Schindelin
2019-05-01 18:09             ` Jeff King
2019-05-02 20:29               ` René Scharfe
2019-05-05  5:25               ` Junio C Hamano
2019-05-06  5:07                 ` Jeff King
2019-04-14  4:34     ` Rohit Ashiwal
2019-04-14 10:33       ` Junio C Hamano
2019-04-26 14:28     ` Johannes Schindelin
2019-05-01 18:07       ` Jeff King
2019-04-12 23:04 ` [PATCH 2/2] archive: avoid spawning `gzip` Rohit Ashiwal via GitGitGadget
2019-04-13  1:51   ` Jeff King
2019-04-13 22:01     ` René Scharfe
2019-04-15 21:35       ` Jeff King
2019-04-26 14:51         ` Johannes Schindelin
2019-04-27  9:59           ` René Scharfe
2019-04-27 17:39             ` René Scharfe
2019-04-29 21:25               ` Johannes Schindelin
2019-05-01 17:45                 ` René Scharfe
2019-05-01 18:18                   ` Jeff King
2019-06-10 10:44                     ` René Scharfe
2019-06-13 19:16                       ` Jeff King
2019-04-13 22:16     ` brian m. carlson
2019-04-15 21:36       ` Jeff King
2019-04-26 14:54       ` Johannes Schindelin
2019-05-02 20:20         ` Ævar Arnfjörð Bjarmason
2019-05-03 20:49           ` Johannes Schindelin
2019-05-03 20:52             ` Jeff King
2019-04-26 14:47     ` Johannes Schindelin
     [not found] ` <pull.145.v2.git.gitgitgadget@gmail.com>
     [not found]   ` <4ea94a8784876c3a19e387537edd81a957fc692c.1556321244.git.gitgitgadget@gmail.com>
2019-05-02 20:29     ` [PATCH v2 3/4] archive: optionally use zlib directly for gzip compression René Scharfe
     [not found]   ` <ac2b2488a1b42b3caf8a84594c48eca796748e59.1556321244.git.gitgitgadget@gmail.com>
2019-05-02 20:30     ` [PATCH v2 2/4] archive-tar: mark RECORDSIZE/BLOCKSIZE as unsigned René Scharfe
2019-05-08 11:45       ` Johannes Schindelin
2019-05-08 23:04         ` Jeff King
2019-05-09 14:06           ` Johannes Schindelin
2019-05-09 18:38             ` Jeff King
2019-05-10 17:18               ` René Scharfe
2019-05-10 21:20                 ` Jeff King

git@vger.kernel.org list mirror (unofficial, one of many)

Archives are clonable:
	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git
	nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git
	nntp://news.gmane.org/gmane.comp.version-control.git

 note: .onion URLs require Tor: https://www.torproject.org/

AGPL code for this site: git clone https://public-inbox.org/ public-inbox