git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / Atom feed
* [PATCH 0/2] Avoid spawning gzip in git archive
@ 2019-04-12 23:04 Johannes Schindelin via GitGitGadget
  2019-04-12 23:04 ` [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die() Rohit Ashiwal via GitGitGadget
  2019-04-12 23:04 ` [PATCH 2/2] archive: avoid spawning `gzip` Rohit Ashiwal via GitGitGadget
  0 siblings, 2 replies; 13+ messages in thread
From: Johannes Schindelin via GitGitGadget @ 2019-04-12 23:04 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano

When creating .tar.gz archives with git archive, we let gzip handle the
compression part. But that is not even necessary, as we already require zlib
(to compress our loose objects). It is also unfortunate, as it requires gzip 
to be in the PATH (which is not the case e.g. with MinGit for Windows, which
tries to bundle just the bare minimum of files to make Git work
non-interactively, for use with 3rd-party applications requiring Git).

This patch series resolves this conundrum by teaching git archive the trick
to gzip-compress in-process.

Rohit Ashiwal (2):
  archive: replace write_or_die() calls with write_block_or_die()
  archive: avoid spawning `gzip`

 archive-tar.c | 54 ++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 41 insertions(+), 13 deletions(-)


base-commit: 8104ec994ea3849a968b4667d072fedd1e688642
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-145%2Fdscho%2Fdont-spawn-gzip-in-archive-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-145/dscho/dont-spawn-gzip-in-archive-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/145
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die()
  2019-04-12 23:04 [PATCH 0/2] Avoid spawning gzip in git archive Johannes Schindelin via GitGitGadget
@ 2019-04-12 23:04 ` Rohit Ashiwal via GitGitGadget
  2019-04-13  1:34   ` Jeff King
  2019-04-12 23:04 ` [PATCH 2/2] archive: avoid spawning `gzip` Rohit Ashiwal via GitGitGadget
  1 sibling, 1 reply; 13+ messages in thread
From: Rohit Ashiwal via GitGitGadget @ 2019-04-12 23:04 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Rohit Ashiwal

From: Rohit Ashiwal <rohit.ashiwal265@gmail.com>

MinGit for Windows comes without `gzip` bundled inside, git-archive uses
`gzip -cn` to compress tar files but for this to work, gzip needs to be
present on the host system.

In the next commit, we will change the gzip compression so that we no
longer spawn `gzip` but let zlib perform the compression in the same
process instead.

In preparation for this, we consolidate all the block writes into a
single function.

This closes https://github.com/git-for-windows/git/issues/1970

Signed-off-by: Rohit Ashiwal <rohit.ashiwal265@gmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 archive-tar.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/archive-tar.c b/archive-tar.c
index 4aabd566fb..ba37dad27c 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -17,6 +17,8 @@ static unsigned long offset;
 
 static int tar_umask = 002;
 
+static gzFile gzip;
+
 static int write_tar_filter_archive(const struct archiver *ar,
 				    struct archiver_args *args);
 
@@ -38,11 +40,21 @@ static int write_tar_filter_archive(const struct archiver *ar,
 #define USTAR_MAX_MTIME 077777777777ULL
 #endif
 
+/* writes out the whole block, or dies if fails */
+static void write_block_or_die(const char *block) {
+	if (gzip) {
+		if (gzwrite(gzip, block, (unsigned) BLOCKSIZE) != BLOCKSIZE)
+			die(_("gzwrite failed"));
+	} else {
+		write_or_die(1, block, BLOCKSIZE);
+	}
+}
+
 /* writes out the whole block, but only if it is full */
 static void write_if_needed(void)
 {
 	if (offset == BLOCKSIZE) {
-		write_or_die(1, block, BLOCKSIZE);
+		write_block_or_die(block);
 		offset = 0;
 	}
 }
@@ -66,7 +78,7 @@ static void do_write_blocked(const void *data, unsigned long size)
 		write_if_needed();
 	}
 	while (size >= BLOCKSIZE) {
-		write_or_die(1, buf, BLOCKSIZE);
+		write_block_or_die(buf);
 		size -= BLOCKSIZE;
 		buf += BLOCKSIZE;
 	}
@@ -101,10 +113,10 @@ static void write_trailer(void)
 {
 	int tail = BLOCKSIZE - offset;
 	memset(block + offset, 0, tail);
-	write_or_die(1, block, BLOCKSIZE);
+	write_block_or_die(block);
 	if (tail < 2 * RECORDSIZE) {
 		memset(block, 0, offset);
-		write_or_die(1, block, BLOCKSIZE);
+		write_block_or_die(block);
 	}
 }
 
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 2/2] archive: avoid spawning `gzip`
  2019-04-12 23:04 [PATCH 0/2] Avoid spawning gzip in git archive Johannes Schindelin via GitGitGadget
  2019-04-12 23:04 ` [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die() Rohit Ashiwal via GitGitGadget
@ 2019-04-12 23:04 ` Rohit Ashiwal via GitGitGadget
  2019-04-13  1:51   ` Jeff King
  1 sibling, 1 reply; 13+ messages in thread
From: Rohit Ashiwal via GitGitGadget @ 2019-04-12 23:04 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Rohit Ashiwal

From: Rohit Ashiwal <rohit.ashiwal265@gmail.com>

As we already link to the zlib library, we can perform the compression
without even requiring gzip on the host machine.

Signed-off-by: Rohit Ashiwal <rohit.ashiwal265@gmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 archive-tar.c | 34 +++++++++++++++++++++++++---------
 1 file changed, 25 insertions(+), 9 deletions(-)

diff --git a/archive-tar.c b/archive-tar.c
index ba37dad27c..5979ed14b7 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -466,18 +466,34 @@ static int write_tar_filter_archive(const struct archiver *ar,
 	filter.use_shell = 1;
 	filter.in = -1;
 
-	if (start_command(&filter) < 0)
-		die_errno(_("unable to start '%s' filter"), argv[0]);
-	close(1);
-	if (dup2(filter.in, 1) < 0)
-		die_errno(_("unable to redirect descriptor"));
-	close(filter.in);
+	if (!strcmp("gzip -cn", ar->data)) {
+		char outmode[4] = "wb\0";
+
+		if (args->compression_level >= 0 && args->compression_level <= 9)
+			outmode[2] = '0' + args->compression_level;
+
+		gzip = gzdopen(fileno(stdout), outmode);
+		if (!gzip)
+			die(_("Could not gzdopen stdout"));
+	} else {
+		if (start_command(&filter) < 0)
+			die_errno(_("unable to start '%s' filter"), argv[0]);
+		close(1);
+		if (dup2(filter.in, 1) < 0)
+			die_errno(_("unable to redirect descriptor"));
+		close(filter.in);
+	}
 
 	r = write_tar_archive(ar, args);
 
-	close(1);
-	if (finish_command(&filter) != 0)
-		die(_("'%s' filter reported error"), argv[0]);
+	if (gzip) {
+		if (gzclose(gzip) != Z_OK)
+			die(_("gzclose failed"));
+	} else {
+		close(1);
+		if (finish_command(&filter) != 0)
+			die(_("'%s' filter reported error"), argv[0]);
+	}
 
 	strbuf_release(&cmd);
 	return r;
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die()
  2019-04-12 23:04 ` [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die() Rohit Ashiwal via GitGitGadget
@ 2019-04-13  1:34   ` Jeff King
  2019-04-13  5:51     ` Junio C Hamano
  2019-04-14  4:34     ` Rohit Ashiwal
  0 siblings, 2 replies; 13+ messages in thread
From: Jeff King @ 2019-04-13  1:34 UTC (permalink / raw)
  To: Rohit Ashiwal via GitGitGadget
  Cc: Johannes Schindelin, git, Junio C Hamano, Rohit Ashiwal

On Fri, Apr 12, 2019 at 04:04:39PM -0700, Rohit Ashiwal via GitGitGadget wrote:

> From: Rohit Ashiwal <rohit.ashiwal265@gmail.com>
> 
> MinGit for Windows comes without `gzip` bundled inside, git-archive uses
> `gzip -cn` to compress tar files but for this to work, gzip needs to be
> present on the host system.
> 
> In the next commit, we will change the gzip compression so that we no
> longer spawn `gzip` but let zlib perform the compression in the same
> process instead.
> 
> In preparation for this, we consolidate all the block writes into a
> single function.

Sounds like a good preparatory step. This part confused me, though:

> @@ -38,11 +40,21 @@ static int write_tar_filter_archive(const struct archiver *ar,
>  #define USTAR_MAX_MTIME 077777777777ULL
>  #endif
>  
> +/* writes out the whole block, or dies if fails */
> +static void write_block_or_die(const char *block) {
> +	if (gzip) {
> +		if (gzwrite(gzip, block, (unsigned) BLOCKSIZE) != BLOCKSIZE)
> +			die(_("gzwrite failed"));
> +	} else {
> +		write_or_die(1, block, BLOCKSIZE);
> +	}
> +}

What is gzwrite()? At first I thought this was an out-of-sequence bit of
the series, but it turns out that this is a zlib.h interface. So the
idea (I think) is that here we introduce a "gzip" variable that is
always false, and this first conditional arm is effectively dead code.
And then in a later patch we'd set up "gzip" and it would become
not-dead.

I think it would be less confusing if this just factored out
write_block_or_die(), which starts as a thin wrapper and then grows the
gzip parts in the next patch.

A few nits on the code itself:

> +static gzFile gzip;
> [...]
> +       if (gzip) {

Is it OK for us to ask about the truthiness of this opaque type? That
works if it's really a pointer behind the scenes, but it seems like it
would be equally OK for zlib to declare it as a struct.

It looks OK in my version of zlib, and that library tends to be fairly
conservative so I wouldn't be surprised if it was that way back to the
beginning and remains that way for eternity. But it feels like a bad
pattern.

> +		if (gzwrite(gzip, block, (unsigned) BLOCKSIZE) != BLOCKSIZE)

This cast is interesting. All of the matching write_or_die() calls are
promoting it to a size_t, which is also unsigned.

BLOCKSIZE is a constant. Should we be defining it with a "U" in the first place?

I doubt it matters much either way from a correctness perspective. I
just wonder when I see a cast like that if we're going to get unexpected
truncation or similar.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] archive: avoid spawning `gzip`
  2019-04-12 23:04 ` [PATCH 2/2] archive: avoid spawning `gzip` Rohit Ashiwal via GitGitGadget
@ 2019-04-13  1:51   ` Jeff King
  2019-04-13 22:01     ` René Scharfe
  2019-04-13 22:16     ` brian m. carlson
  0 siblings, 2 replies; 13+ messages in thread
From: Jeff King @ 2019-04-13  1:51 UTC (permalink / raw)
  To: Rohit Ashiwal via GitGitGadget
  Cc: Johannes Schindelin, git, Junio C Hamano, Rohit Ashiwal

On Fri, Apr 12, 2019 at 04:04:40PM -0700, Rohit Ashiwal via GitGitGadget wrote:

> From: Rohit Ashiwal <rohit.ashiwal265@gmail.com>
> 
> As we already link to the zlib library, we can perform the compression
> without even requiring gzip on the host machine.

Very cool. It's nice to drop a dependency, and this should be a bit more
efficient, too.

> diff --git a/archive-tar.c b/archive-tar.c
> index ba37dad27c..5979ed14b7 100644
> --- a/archive-tar.c
> +++ b/archive-tar.c
> @@ -466,18 +466,34 @@ static int write_tar_filter_archive(const struct archiver *ar,
>  	filter.use_shell = 1;
>  	filter.in = -1;
>  
> -	if (start_command(&filter) < 0)
> -		die_errno(_("unable to start '%s' filter"), argv[0]);
> -	close(1);
> -	if (dup2(filter.in, 1) < 0)
> -		die_errno(_("unable to redirect descriptor"));
> -	close(filter.in);
> +	if (!strcmp("gzip -cn", ar->data)) {

I wondered how you were going to kick this in, since users can define
arbitrary filters. I think it's kind of neat to automagically convert
"gzip -cn" (which also happens to be the default). But I think we should
mention that in the Documentation, in case somebody tries to use a
custom version of gzip and wonders why it isn't kicking in.

Likewise, it might make sense in the tests to put a poison gzip in the
$PATH so that we can be sure we're using our internal code, and not just
calling out to gzip (on platforms that have it, of course).

The alternative is that we could use a special token like ":zlib" or
something to indicate that the internal implementation should be used
(and then tweak the baked-in default, too). That might be less
surprising for users, but most people would still get the benefit since
they'd be using the default config.

> +		char outmode[4] = "wb\0";

This looks sufficiently magical that it might merit a comment. I had to
look in the zlib header file to learn that this is just a normal
stdio-style mode. But we can't just do:

  gzip = gzdopen(fd, "wb");

because we want to (maybe) append a compression level. It's also
slightly confusing that it explicitly includes a NUL, but later:

> +		if (args->compression_level >= 0 && args->compression_level <= 9)
> +			outmode[2] = '0' + args->compression_level;

we may overwrite that and assume that outmode[3] is also a NUL. Which it
is, because of how C initialization works. But that means we also do not
need the "\0" in the initializer.

Dropping that may make it slightly less jarring (any time I see a
backslash escape in an initializer, I assume I'm in for some binary
trickery, but this turns out to be much more mundane).

I'd also consider just using a strbuf:

  struct strbuf outmode = STRBUF_INIT;

  strbuf_addstr(&outmode, "wb");
  if (args->compression_level >= 0 && args->compression_level <= 9)
	strbuf_addch(&outmode, '0' + args->compression_level);

That's overkill in a sense, but it saves us having to deal with
manually-counted offsets, and this code is only run once per program
invocation, so the efficiency shouldn't matter.

> +		gzip = gzdopen(fileno(stdout), outmode);
> +		if (!gzip)
> +			die(_("Could not gzdopen stdout"));

Is there a way to get a more specific error from zlib? I'm less
concerned about gzdopen here (which should never fail), and more about
the writing and closing steps. I don't see anything good for gzdopen(),
but...

> +	if (gzip) {
> +		if (gzclose(gzip) != Z_OK)
> +			die(_("gzclose failed"));

...according to zlib.h, here the returned int is meaningful. And if
Z_ERRNO, we should probably use die_errno() to give a better message.

> [...]

That was a lot of little nits, but the overall shape of the patch looks
good to me (and I think the goal is obviously good). Thanks for working
on it.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die()
  2019-04-13  1:34   ` Jeff King
@ 2019-04-13  5:51     ` Junio C Hamano
  2019-04-14  4:36       ` Rohit Ashiwal
  2019-04-14  4:34     ` Rohit Ashiwal
  1 sibling, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2019-04-13  5:51 UTC (permalink / raw)
  To: Jeff King
  Cc: Rohit Ashiwal via GitGitGadget, Johannes Schindelin, git, Rohit Ashiwal

Jeff King <peff@peff.net> writes:

>> +/* writes out the whole block, or dies if fails */
>> +static void write_block_or_die(const char *block) {
>> +	if (gzip) {
>> +		if (gzwrite(gzip, block, (unsigned) BLOCKSIZE) != BLOCKSIZE)
>> +			die(_("gzwrite failed"));
>> +	} else {
>> +		write_or_die(1, block, BLOCKSIZE);
>> +	}
>> +}

I agree everything you said you your two review messages.

One thing you did not mention but I found disturbing was that this
does not take size argument but hardcodes BLOCKSIZE.  If the patch
were removing use of BLOCKSIZE in its callers (because everybody
uses the same constant), this would not have bothered me, but as the
caller passes BLOCKSIZE to all its callees except this one, I found
that the interface optimizes for a wrong thing (i.e. reducing
one-time pain of writing this single patch of having to repeat
BLOCKSIZE in all calls to this function).  This function should be
updated to take the size_t and have its caller(s) pass BLOCKSIZE.

Thanks for a review, and thanks Rohit for starting to get rid of
external dependency on gzip binary.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] archive: avoid spawning `gzip`
  2019-04-13  1:51   ` Jeff King
@ 2019-04-13 22:01     ` René Scharfe
  2019-04-15 21:35       ` Jeff King
  2019-04-13 22:16     ` brian m. carlson
  1 sibling, 1 reply; 13+ messages in thread
From: René Scharfe @ 2019-04-13 22:01 UTC (permalink / raw)
  To: Jeff King, Rohit Ashiwal via GitGitGadget
  Cc: Johannes Schindelin, git, Junio C Hamano, Rohit Ashiwal

Am 13.04.2019 um 03:51 schrieb Jeff King:
> On Fri, Apr 12, 2019 at 04:04:40PM -0700, Rohit Ashiwal via GitGitGadget wrote:
>
>> From: Rohit Ashiwal <rohit.ashiwal265@gmail.com>
>>
>> As we already link to the zlib library, we can perform the compression
>> without even requiring gzip on the host machine.
>
> Very cool. It's nice to drop a dependency, and this should be a bit more
> efficient, too.

Getting rid of dependencies is good, and using zlib is the obvious way to
generate .tgz files. Last time I tried something like that, a separate gzip
process was faster, though -- at least on Linux [1].  How does this one
fare?

Doing compression in its own thread may be a good idea.

René


[1] http://public-inbox.org/git/4AAAC8CE.8020302@lsrfire.ath.cx/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] archive: avoid spawning `gzip`
  2019-04-13  1:51   ` Jeff King
  2019-04-13 22:01     ` René Scharfe
@ 2019-04-13 22:16     ` brian m. carlson
  2019-04-15 21:36       ` Jeff King
  1 sibling, 1 reply; 13+ messages in thread
From: brian m. carlson @ 2019-04-13 22:16 UTC (permalink / raw)
  To: Jeff King
  Cc: Rohit Ashiwal via GitGitGadget, Johannes Schindelin, git,
	Junio C Hamano, Rohit Ashiwal

[-- Attachment #1: Type: text/plain, Size: 1539 bytes --]

On Fri, Apr 12, 2019 at 09:51:02PM -0400, Jeff King wrote:
> I wondered how you were going to kick this in, since users can define
> arbitrary filters. I think it's kind of neat to automagically convert
> "gzip -cn" (which also happens to be the default). But I think we should
> mention that in the Documentation, in case somebody tries to use a
> custom version of gzip and wonders why it isn't kicking in.
> 
> Likewise, it might make sense in the tests to put a poison gzip in the
> $PATH so that we can be sure we're using our internal code, and not just
> calling out to gzip (on platforms that have it, of course).
> 
> The alternative is that we could use a special token like ":zlib" or
> something to indicate that the internal implementation should be used
> (and then tweak the baked-in default, too). That might be less
> surprising for users, but most people would still get the benefit since
> they'd be using the default config.

I agree that a special value (or NULL, if that's possible) would be
nicer here. That way, if someone does specify a custom gzip, we honor
it, and it serves to document the code better. For example, if someone
symlinked pigz to gzip and used "gzip -cn", then they might not get the
parallelization benefits they expected.

I'm fine overall with the idea of bringing the compression into the
binary using zlib, provided that we preserve the "-n" behavior
(producing reproducible archives).
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die()
  2019-04-13  1:34   ` Jeff King
  2019-04-13  5:51     ` Junio C Hamano
@ 2019-04-14  4:34     ` Rohit Ashiwal
  2019-04-14 10:33       ` Junio C Hamano
  1 sibling, 1 reply; 13+ messages in thread
From: Rohit Ashiwal @ 2019-04-14  4:34 UTC (permalink / raw)
  To: peff; +Cc: git, gitgitgadget, gitster, johannes.schindelin, rohit.ashiwal265

Hey Peff!

On 2019-04-13  1:34 UTC Jeff King <peff@peff.net> wrote:

> What is gzwrite()?
> [...]
> I think it would be less confusing if this just factored out
> write_block_or_die(), which starts as a thin wrapper and then grows the
> gzip parts in the next patch.

You are right, it might appear to someone as a bit confusing, but I feel
like, this is the right commit to put it.

> Is it OK for us to ask about the truthiness of this opaque type? That
> works if it's really a pointer behind the scenes, but it seems like it
> would be equally OK for zlib to declare it as a struct.

It would be perfectly sane on zlib's part to make gzFile a struct, and if
so happens, I'll be there to refactor the code.

Regards
Rohit


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die()
  2019-04-13  5:51     ` Junio C Hamano
@ 2019-04-14  4:36       ` Rohit Ashiwal
  0 siblings, 0 replies; 13+ messages in thread
From: Rohit Ashiwal @ 2019-04-14  4:36 UTC (permalink / raw)
  To: gitster; +Cc: git, gitgitgadget, johannes.schindelin, peff, rohit.ashiwal265

Hey jch!

I'll change the signature of the function in next revision.

Thanks
Rohit


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die()
  2019-04-14  4:34     ` Rohit Ashiwal
@ 2019-04-14 10:33       ` Junio C Hamano
  0 siblings, 0 replies; 13+ messages in thread
From: Junio C Hamano @ 2019-04-14 10:33 UTC (permalink / raw)
  To: Rohit Ashiwal; +Cc: peff, git, gitgitgadget, johannes.schindelin

Rohit Ashiwal <rohit.ashiwal265@gmail.com> writes:

> On 2019-04-13  1:34 UTC Jeff King <peff@peff.net> wrote:
>
>> What is gzwrite()?
>> [...]
>> I think it would be less confusing if this just factored out
>> write_block_or_die(), which starts as a thin wrapper and then grows the
>> gzip parts in the next patch.
>
> You are right, it might appear to someone as a bit confusing, but I feel
> like, this is the right commit to put it.

Often, the original author is the worst judge about the patch series
organization, because s/he has been staring at his or her own
patches too long and knows too much about them.  Unless the author
is very experienced and is good at pretending to be the first-time
reader when proofreading his or her own patch, that is.

FWIW, I tend to agree with Peff that the organization would become
much easier to follow with "first refactor without new feature, and
in gzip related step add gzip thing".

>> Is it OK for us to ask about the truthiness of this opaque type? That
>> works if it's really a pointer behind the scenes, but it seems like it
>> would be equally OK for zlib to declare it as a struct.

Or a small integer indexing into an internal array the library keeps
track of ;-) At that point, truthiness would be completely gone, and
the compiler would not help catching "if (opaque)" as a syntax error
(if the library implements the opaque thing as a structure, then we
will be saved).

> It would be perfectly sane on zlib's part to make gzFile a struct, and if
> so happens, I'll be there to refactor the code.

We do not trust any single developer enough with "I'll do so when
needed"---in practice, it will often be done by somebody else, and
more importantly, we would want anybody to be able to take things
over, instead of relying on any one "indispensable contributor".

If it is reasonable to expect that things can easily be broken by an
external factor, I'd prefer to see us being defensive from day one.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] archive: avoid spawning `gzip`
  2019-04-13 22:01     ` René Scharfe
@ 2019-04-15 21:35       ` Jeff King
  0 siblings, 0 replies; 13+ messages in thread
From: Jeff King @ 2019-04-15 21:35 UTC (permalink / raw)
  To: René Scharfe
  Cc: Rohit Ashiwal via GitGitGadget, Johannes Schindelin, git,
	Junio C Hamano, Rohit Ashiwal

On Sun, Apr 14, 2019 at 12:01:10AM +0200, René Scharfe wrote:

> >> As we already link to the zlib library, we can perform the compression
> >> without even requiring gzip on the host machine.
> >
> > Very cool. It's nice to drop a dependency, and this should be a bit more
> > efficient, too.
> 
> Getting rid of dependencies is good, and using zlib is the obvious way to
> generate .tgz files. Last time I tried something like that, a separate gzip
> process was faster, though -- at least on Linux [1].  How does this one
> fare?

I'd expect a separate gzip to be faster in wall-clock time for a
multi-core machine, but overall consume more CPU. I'm slightly surprised
that your timings show that it actually wins on total CPU, too.

Here are best-of-five times for "git archive --format=tar.gz HEAD" on
linux.git (the machine is a quad-core):

  [before, separate gzip]
  real	0m21.501s
  user	0m26.148s
  sys	0m0.619s

  [after, internal gzwrite]
  real	0m25.156s
  user	0m25.059s
  sys	0m0.096s

which does show what I expect (longer overall, but less total CPU).

Which one you prefer depends on your situation, of course. A user on a
workstation with multiple cores probably cares most about end-to-end
latency and using all of their available horsepower. A server hosting
repositories and receiving many unrelated requests probably cares more
about total CPU (though the differences there are small enough that it
may not even be worth having a config knob to un-parallelize it).

> Doing compression in its own thread may be a good idea.

Yeah. It might even make the patch simpler, since I'd expect it to be
implemented with start_async() and a descriptor, making it look just
like a gzip pipe to the caller. :)

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] archive: avoid spawning `gzip`
  2019-04-13 22:16     ` brian m. carlson
@ 2019-04-15 21:36       ` Jeff King
  0 siblings, 0 replies; 13+ messages in thread
From: Jeff King @ 2019-04-15 21:36 UTC (permalink / raw)
  To: brian m. carlson, Rohit Ashiwal via GitGitGadget,
	Johannes Schindelin, git, Junio C Hamano, Rohit Ashiwal

On Sat, Apr 13, 2019 at 10:16:46PM +0000, brian m. carlson wrote:

> > The alternative is that we could use a special token like ":zlib" or
> > something to indicate that the internal implementation should be used
> > (and then tweak the baked-in default, too). That might be less
> > surprising for users, but most people would still get the benefit since
> > they'd be using the default config.
> 
> I agree that a special value (or NULL, if that's possible) would be
> nicer here. That way, if someone does specify a custom gzip, we honor
> it, and it serves to document the code better. For example, if someone
> symlinked pigz to gzip and used "gzip -cn", then they might not get the
> parallelization benefits they expected.

Thanks for spelling that out. I had a vague feeling somebody might be
surprised, but I didn't know if people actually did stuff like
symlinking pigz to gzip (though it makes perfect sense to do so).

> I'm fine overall with the idea of bringing the compression into the
> binary using zlib, provided that we preserve the "-n" behavior
> (producing reproducible archives).

I just assumed that gzwrite() would have the "-n" behavior, but it's
definitely worth double-checking.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, back to index

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-12 23:04 [PATCH 0/2] Avoid spawning gzip in git archive Johannes Schindelin via GitGitGadget
2019-04-12 23:04 ` [PATCH 1/2] archive: replace write_or_die() calls with write_block_or_die() Rohit Ashiwal via GitGitGadget
2019-04-13  1:34   ` Jeff King
2019-04-13  5:51     ` Junio C Hamano
2019-04-14  4:36       ` Rohit Ashiwal
2019-04-14  4:34     ` Rohit Ashiwal
2019-04-14 10:33       ` Junio C Hamano
2019-04-12 23:04 ` [PATCH 2/2] archive: avoid spawning `gzip` Rohit Ashiwal via GitGitGadget
2019-04-13  1:51   ` Jeff King
2019-04-13 22:01     ` René Scharfe
2019-04-15 21:35       ` Jeff King
2019-04-13 22:16     ` brian m. carlson
2019-04-15 21:36       ` Jeff King

git@vger.kernel.org list mirror (unofficial, one of many)

Archives are clonable:
	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git
	nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git
	nntp://news.gmane.org/gmane.comp.version-control.git

 note: .onion URLs require Tor: https://www.torproject.org/
       or Tor2web: https://www.tor2web.org/

AGPL code for this site: git clone https://public-inbox.org/ public-inbox