git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "brian m. carlson" <sandals@crustytoothpaste.net>
To: Jeff King <peff@peff.net>
Cc: kpcyrd <kpcyrd@archlinux.org>,
	rb-general@lists.reproducible-builds.org,
	arch-dev-public@lists.archlinux.org, git@vger.kernel.org,
	gitster@pobox.com, l.s.r@web.de
Subject: Re: git 2.38.0: Change in `git archive` output
Date: Mon, 17 Oct 2022 00:51:25 +0000	[thread overview]
Message-ID: <Y0ynDbG8CxwAt4Fj@tapette.crustytoothpaste.net> (raw)
In-Reply-To: <Y0ybi66K40+uH+im@coredump.intra.peff.net>

[-- Attachment #1: Type: text/plain, Size: 2369 bytes --]

On 2022-10-17 at 00:02:19, Jeff King wrote:
> Interesting. For a small input, they seem to produce the same file for
> me:
> 
>   git init repo
>   cd repo
>   seq 1000 >file
>   git add file
>   git commit -m foo
> 
>   git -c tar.tar.gz.command='git archive gzip' \
>     archive --format=tar.gz HEAD >internal.tar.gz
>   git -c tar.tar.gz.command='gzip -cn' \
>     archive --format=tar.gz HEAD >external.tar.gz
>   cmp internal.tar.gz external.tar.gz && echo ok
> 
> but if I instead do "seq 10000", then the files differ. I didn't dig
> into the actual binary to see the source of the change. It might be
> something we can tweak (e.g., if it's how a header is represented, or if
> we can change the zlib parameters to find the same compressions).

I will say that trying to make two compression implementations produce
identical output is likely futile because it's almost always the case
that there are multiple identical ways to encode the same data.  Most
implementations are going to prefer improving size over consistency, so
there's little incentive to copy the same algorithm across
implementations. I believe even GNU gzip has changed its output in the
past as better optimizations were implemented.

I mean, don't let me stop you from trying to tweak things to see if you
can make it work, but in general I think it's likely that some
divergence is going to occur between implementations no matter what.

> I don't think we make promises about stable output from "git archive".
> We've fixed bugs in the tar-generating side before that lead to changes.
> But if we can easily make them the same, that might be worth doing.

Since this is on the reproducible builds list, I would be interested in
working with tar implementations to specify a profile of the pax format
that _is_ standardized, stable, and consistent and that Git and other
implementations could use to produce bit-for-bit identical tar archives
across versions, since this is a thing lots of people seem to want.  (If
that's of interest, please contact me off list.)

However, I don't think that trying to do that with compression formats
is likely to lead to a productive work product, so users who cared about
reproducibility would need to compare the uncompressed output.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

  reply	other threads:[~2022-10-17  0:51 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-16 21:57 git 2.38.0: Change in `git archive` output kpcyrd
2022-10-16 23:21 ` brian m. carlson
2022-10-17  0:02 ` Jeff King
2022-10-17  0:51   ` brian m. carlson [this message]
2022-10-17 17:03     ` Jeff King
2022-10-17 19:14 ` Konstantin Ryabitsev

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y0ynDbG8CxwAt4Fj@tapette.crustytoothpaste.net \
    --to=sandals@crustytoothpaste.net \
    --cc=arch-dev-public@lists.archlinux.org \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=kpcyrd@archlinux.org \
    --cc=l.s.r@web.de \
    --cc=peff@peff.net \
    --cc=rb-general@lists.reproducible-builds.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).