git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* git 2.38.0: Change in `git archive` output
@ 2022-10-16 21:57 kpcyrd
  2022-10-16 23:21 ` brian m. carlson
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: kpcyrd @ 2022-10-16 21:57 UTC (permalink / raw)
  To: rb-general, arch-dev-public, git; +Cc: gitster, l.s.r

hello,

multiple people in Arch Linux noticed the output of our `git archive` 
command doesn't match the tarball served by github anymore.

First I suspected an update in our gzip package until I found this line 
in the git 2.38.0 release notes:

 > * Teach "git archive" to (optionally and then by default) avoid
 >   spawning an external "gzip" process when creating ".tar.gz" (and
 >   ".tgz") archives.

I've then found this commit that could be considered a breaking change 
in `git archive`:

https://github.com/git/git/commit/4f4be00d302bc52d0d9d5a3d4738bb525066c710

I don't know if there's some kind of gzip standard that could be used to 
align the git internal gzip implementation with gnu gzip.

I'm not saying this is necessarily a bug or regression but it makes it 
harder to reproduce github tar balls from a git repository. Just sharing 
what I've debugged. :)

cheers,
kpcyrd

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: git 2.38.0: Change in `git archive` output
  2022-10-16 21:57 git 2.38.0: Change in `git archive` output kpcyrd
@ 2022-10-16 23:21 ` brian m. carlson
  2022-10-17  0:02 ` Jeff King
  2022-10-17 19:14 ` Konstantin Ryabitsev
  2 siblings, 0 replies; 6+ messages in thread
From: brian m. carlson @ 2022-10-16 23:21 UTC (permalink / raw)
  To: kpcyrd; +Cc: rb-general, arch-dev-public, git, gitster, l.s.r

[-- Attachment #1: Type: text/plain, Size: 2229 bytes --]

On 2022-10-16 at 21:57:40, kpcyrd wrote:
> hello,

Hey,

> multiple people in Arch Linux noticed the output of our `git archive`
> command doesn't match the tarball served by github anymore.
> 
> First I suspected an update in our gzip package until I found this line in
> the git 2.38.0 release notes:
> 
> > * Teach "git archive" to (optionally and then by default) avoid
> >   spawning an external "gzip" process when creating ".tar.gz" (and
> >   ".tgz") archives.
> 
> I've then found this commit that could be considered a breaking change in
> `git archive`:
> 
> https://github.com/git/git/commit/4f4be00d302bc52d0d9d5a3d4738bb525066c710
> 
> I don't know if there's some kind of gzip standard that could be used to
> align the git internal gzip implementation with gnu gzip.
> 
> I'm not saying this is necessarily a bug or regression but it makes it
> harder to reproduce github tar balls from a git repository. Just sharing
> what I've debugged. :)

This isn't a bug, because Git doesn't guarantee that archives produced
by different versions will be bit-for-bit identical.  It does guarantee
that an archive of the same commit or tree using the same version of Git
and associated tools running with the same configuration and environment
will be consistent (that is, a given version of Git produces
deterministic archives).

I will also point out that GitHub also doesn't guarantee bit-for-bit
identical archives.  It does currently use git archive under the hood,
but that could change at any moment without notice.

Zip files also contain two sets of timestamps: local and UTC, and
therefore there's an additional element in which archives can differ
depending on the time zone.  In addition, using the export-subst
functionality can result in short object IDs of different lengths
depending on the number of objects in the repository, so archive can
differ for that reason as well.

So it's not the case that you can expect identical archives from Git and
GitHub.  If you need to compute a hash over an archive, you need to
store the archive somewhere (on GitHub, that would be as a release
asset).
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: git 2.38.0: Change in `git archive` output
  2022-10-16 21:57 git 2.38.0: Change in `git archive` output kpcyrd
  2022-10-16 23:21 ` brian m. carlson
@ 2022-10-17  0:02 ` Jeff King
  2022-10-17  0:51   ` brian m. carlson
  2022-10-17 19:14 ` Konstantin Ryabitsev
  2 siblings, 1 reply; 6+ messages in thread
From: Jeff King @ 2022-10-17  0:02 UTC (permalink / raw)
  To: kpcyrd; +Cc: rb-general, arch-dev-public, git, gitster, l.s.r

On Sun, Oct 16, 2022 at 11:57:40PM +0200, kpcyrd wrote:

> multiple people in Arch Linux noticed the output of our `git archive`
> command doesn't match the tarball served by github anymore.
> 
> First I suspected an update in our gzip package until I found this line in
> the git 2.38.0 release notes:
> 
> > * Teach "git archive" to (optionally and then by default) avoid
> >   spawning an external "gzip" process when creating ".tar.gz" (and
> >   ".tgz") archives.
> 
> I've then found this commit that could be considered a breaking change in
> `git archive`:
> 
> https://github.com/git/git/commit/4f4be00d302bc52d0d9d5a3d4738bb525066c710
> 
> I don't know if there's some kind of gzip standard that could be used to
> align the git internal gzip implementation with gnu gzip.

Interesting. For a small input, they seem to produce the same file for
me:

  git init repo
  cd repo
  seq 1000 >file
  git add file
  git commit -m foo
 
  git -c tar.tar.gz.command='git archive gzip' \
    archive --format=tar.gz HEAD >internal.tar.gz
  git -c tar.tar.gz.command='gzip -cn' \
    archive --format=tar.gz HEAD >external.tar.gz
  cmp internal.tar.gz external.tar.gz && echo ok

but if I instead do "seq 10000", then the files differ. I didn't dig
into the actual binary to see the source of the change. It might be
something we can tweak (e.g., if it's how a header is represented, or if
we can change the zlib parameters to find the same compressions).

> I'm not saying this is necessarily a bug or regression but it makes it
> harder to reproduce github tar balls from a git repository. Just sharing
> what I've debugged. :)

I don't think we make promises about stable output from "git archive".
We've fixed bugs in the tar-generating side before that lead to changes.
But if we can easily make them the same, that might be worth doing.

In the meantime, you can use the config option I showed above to get the
old, external behavior. At some point GitHub will probably update their
version, though, at which point you'd want the internal (they may also
try to retain the old one, though; lots of distro/packaging projects get
broken when GitHub's archives aren't byte-for-byte identical).

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: git 2.38.0: Change in `git archive` output
  2022-10-17  0:02 ` Jeff King
@ 2022-10-17  0:51   ` brian m. carlson
  2022-10-17 17:03     ` Jeff King
  0 siblings, 1 reply; 6+ messages in thread
From: brian m. carlson @ 2022-10-17  0:51 UTC (permalink / raw)
  To: Jeff King; +Cc: kpcyrd, rb-general, arch-dev-public, git, gitster, l.s.r

[-- Attachment #1: Type: text/plain, Size: 2369 bytes --]

On 2022-10-17 at 00:02:19, Jeff King wrote:
> Interesting. For a small input, they seem to produce the same file for
> me:
> 
>   git init repo
>   cd repo
>   seq 1000 >file
>   git add file
>   git commit -m foo
> 
>   git -c tar.tar.gz.command='git archive gzip' \
>     archive --format=tar.gz HEAD >internal.tar.gz
>   git -c tar.tar.gz.command='gzip -cn' \
>     archive --format=tar.gz HEAD >external.tar.gz
>   cmp internal.tar.gz external.tar.gz && echo ok
> 
> but if I instead do "seq 10000", then the files differ. I didn't dig
> into the actual binary to see the source of the change. It might be
> something we can tweak (e.g., if it's how a header is represented, or if
> we can change the zlib parameters to find the same compressions).

I will say that trying to make two compression implementations produce
identical output is likely futile because it's almost always the case
that there are multiple identical ways to encode the same data.  Most
implementations are going to prefer improving size over consistency, so
there's little incentive to copy the same algorithm across
implementations. I believe even GNU gzip has changed its output in the
past as better optimizations were implemented.

I mean, don't let me stop you from trying to tweak things to see if you
can make it work, but in general I think it's likely that some
divergence is going to occur between implementations no matter what.

> I don't think we make promises about stable output from "git archive".
> We've fixed bugs in the tar-generating side before that lead to changes.
> But if we can easily make them the same, that might be worth doing.

Since this is on the reproducible builds list, I would be interested in
working with tar implementations to specify a profile of the pax format
that _is_ standardized, stable, and consistent and that Git and other
implementations could use to produce bit-for-bit identical tar archives
across versions, since this is a thing lots of people seem to want.  (If
that's of interest, please contact me off list.)

However, I don't think that trying to do that with compression formats
is likely to lead to a productive work product, so users who cared about
reproducibility would need to compare the uncompressed output.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: git 2.38.0: Change in `git archive` output
  2022-10-17  0:51   ` brian m. carlson
@ 2022-10-17 17:03     ` Jeff King
  0 siblings, 0 replies; 6+ messages in thread
From: Jeff King @ 2022-10-17 17:03 UTC (permalink / raw)
  To: brian m. carlson; +Cc: kpcyrd, rb-general, arch-dev-public, git, gitster, l.s.r

On Mon, Oct 17, 2022 at 12:51:25AM +0000, brian m. carlson wrote:

> > but if I instead do "seq 10000", then the files differ. I didn't dig
> > into the actual binary to see the source of the change. It might be
> > something we can tweak (e.g., if it's how a header is represented, or if
> > we can change the zlib parameters to find the same compressions).
> 
> I will say that trying to make two compression implementations produce
> identical output is likely futile because it's almost always the case
> that there are multiple identical ways to encode the same data.  Most
> implementations are going to prefer improving size over consistency, so
> there's little incentive to copy the same algorithm across
> implementations. I believe even GNU gzip has changed its output in the
> past as better optimizations were implemented.
> 
> I mean, don't let me stop you from trying to tweak things to see if you
> can make it work, but in general I think it's likely that some
> divergence is going to occur between implementations no matter what.

Yeah, I definitely don't think it's something we ought to be promising,
or do put a lot of work into. But if there's low-hanging fruit to reduce
immediate pain in practice, it seems worth considering.

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: git 2.38.0: Change in `git archive` output
  2022-10-16 21:57 git 2.38.0: Change in `git archive` output kpcyrd
  2022-10-16 23:21 ` brian m. carlson
  2022-10-17  0:02 ` Jeff King
@ 2022-10-17 19:14 ` Konstantin Ryabitsev
  2 siblings, 0 replies; 6+ messages in thread
From: Konstantin Ryabitsev @ 2022-10-17 19:14 UTC (permalink / raw)
  To: kpcyrd; +Cc: rb-general, arch-dev-public, git, gitster, l.s.r

On Sun, Oct 16, 2022 at 11:57:40PM +0200, kpcyrd wrote:
> I don't know if there's some kind of gzip standard that could be used to
> align the git internal gzip implementation with gnu gzip.
> 
> I'm not saying this is necessarily a bug or regression but it makes it
> harder to reproduce github tar balls from a git repository. Just sharing
> what I've debugged. :)

I've previously complained about the output of .tar format changing, but I
think it's too much to expect the *compressed* output to remain the same. In
environments where CPU time is more expensive than bandwidth, it's entirely
normal to expect the administrator to arbitrarily adjust compression levels
without any warning, or, on the contrary, pre-generate and cache
zopfli-compressed archives.

-K

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-10-17 19:14 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-16 21:57 git 2.38.0: Change in `git archive` output kpcyrd
2022-10-16 23:21 ` brian m. carlson
2022-10-17  0:02 ` Jeff King
2022-10-17  0:51   ` brian m. carlson
2022-10-17 17:03     ` Jeff King
2022-10-17 19:14 ` Konstantin Ryabitsev

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).