git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "brian m. carlson" <sandals@crustytoothpaste.net>
To: Eli Schwartz <eschwartz93@gmail.com>
Cc: Git List <git@vger.kernel.org>
Subject: Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution
Date: Tue, 31 Jan 2023 09:54:58 +0000	[thread overview]
Message-ID: <Y9jlWYLzZ/yy4NqD@tapette.crustytoothpaste.net> (raw)
In-Reply-To: <a812a664-67ea-c0ba-599f-cb79e2d96694@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 4823 bytes --]

On 2023-01-31 at 00:06:44, Eli Schwartz wrote:
> Nevertheless, I've seen the sentiment a few times that git doesn't like
> committing to output stability of git-archive, because it isn't
> officially documented (but it's not entirely clear what the benefits of
> changing are). And yet, git endeavors to do so, in order to prevent
> unnecessary breakage of people who embody Hyrum's Law and need that
> stability.

I'm one of the GitHub employees who chimed in there, and I'm also a Git
contributor in my own time (and I am speaking here only in my personal
capacity, since this is a personal address).  I made a change some years
back to the archive format to fix the permissions on pax headers when
extracted as files, and kernel.org was relying on that and broke.  Linus
yelled at me because of that.

Since then, I've been very opposed to us guaranteeing output format
consistency without explicitly doing so.  I had sent some patches before
that I don't think ever got picked up that documented this explicitly.
I very much don't want people to come to rely on our behaviour unless we
explicitly guarantee it.

> What does everyone think about offering versioned git-archive outputs?
> This could be user-selectable as an option to `git archive`, but the
> main goal would be to select a good versioned output format depending on
> what is being archived. So:
> 
> - first things first, un-default the internal compressor again
> - implement a v2 archive format, where the internal compressor is the
>   default -- no other changes
> - teach git to select an archive format based on the date of the object
>   being archived
>   - when given a commit/tag ID to archive, check which support frame the
>     committer date falls inside
>   - for tree IDs, always use the latest format (it always uses the
>     current date anyway)
> - schedule a date, for the sake of argument, 6 months after the next
>   scheduled release date of git version X.Y in which this change goes
>   live; bake this into the git sources as a transition date, all commits
>   or tags generated after this date fall into the next format support
>   frame

I am actually very much in favour of providing a standard, deterministic
version of pax (the extended tar format) that we use and documenting it
as a standard so that other archive tools can use that.  That is, we
document some canonical tar format that is bit-for-bit identical that we
(and hopefully GNU tar and libarchive) will agree should be used to
serialize files for software interchange.  I don't think this should be
dependent on the date at all, but I do believe it should be versioned
and tested, and the version number embedded as a pax header.  I think
this would be valuable for simply having reproducible archives in
general, including for things like Docker containers, Debian packages,
Rust crates, and more, and I'm happy to work with others on such a
format, as I've said in the past on the list.  People can opt-in to
whatever format they want when creating an archive and continue to use
that forever if they like.

Part of the reason I think this is valuable is that once SHA-1 and
SHA-256 interoperability is present, git archive will change the
contents of the archive format, since it will embed a SHA-256 hash into
the file instead of a SHA-1 hash, since that's what's in the repository.
Thus, we can't produce an archive that's deterministic in the face of
SHA-1/SHA-256 interoperability concerns, and we need to create a new
format that doesn't contain that data embedded in it.

Having said that, I don't think this should be based on the timestamp of
the file, since that means that two otherwise identical archives
differing in timestamp aren't ever going to be the same, and we do see
people who import or vendor other projects.  Nor do I think we should
attempt to provide consistent compression, since I believe the output of
things like zlib has changed in the past, and we can't continually carry
an old, potentially insecure version of zlib just because the output
changed.  People should be able to implement compression using gzip,
zlib, pigz, miniz_oxide, or whatever if they want, since people
implement Git in many different languages, and we won't want to force
people using memory-safe languages like Go and Rust to explicitly use
zlib for archives.

That may mean that it's important for people to actually decompress the
archive before checking hashes if they want deterministic behaviour, and
I'm okay with that.  You already have to do that if you're verifying the
signature on Git tarballs, since only the uncompressed tar archive is
signed, so I don't think this is out of the question.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

  parent reply	other threads:[~2023-01-31  9:55 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-31  0:06 Stability of git-archive, breaking (?) the Github universe, and a possible solution Eli Schwartz
2023-01-31  7:49 ` Ævar Arnfjörð Bjarmason
2023-01-31  9:11   ` Eli Schwartz
2023-02-02  9:32   ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Ævar Arnfjörð Bjarmason
2023-02-02  9:32     ` [PATCH 1/9] archive & tar config docs: de-duplicate configuration section Ævar Arnfjörð Bjarmason
2023-02-02  9:32     ` [PATCH 2/9] git config docs: document "tar.<format>.{command,remote}" Ævar Arnfjörð Bjarmason
2023-02-02  9:32     ` [PATCH 3/9] archiver API: make the "flags" in "struct archiver" an enum Ævar Arnfjörð Bjarmason
2023-02-02  9:32     ` [PATCH 4/9] archive: omit the shell for built-in "command" filters Ævar Arnfjörð Bjarmason
2023-02-02  9:32     ` [PATCH 5/9] archive-tar.c: move internal gzip implementation to a function Ævar Arnfjörð Bjarmason
2023-02-02  9:32     ` [PATCH 6/9] archive: use "gzip -cn" for stability, not "git archive gzip" Ævar Arnfjörð Bjarmason
2023-02-02  9:32     ` [PATCH 7/9] test-lib.sh: add a lazy GZIP prerequisite Ævar Arnfjörð Bjarmason
2023-02-02  9:32     ` [PATCH 8/9] archive tests: test for "gzip -cn" and "git archive gzip" stability Ævar Arnfjörð Bjarmason
2023-02-02  9:32     ` [PATCH 9/9] git archive docs: document output non-stability Ævar Arnfjörð Bjarmason
2023-02-02 10:25       ` brian m. carlson
2023-02-02 10:30         ` Ævar Arnfjörð Bjarmason
2023-02-02 16:34         ` Junio C Hamano
2023-02-04 17:46           ` brian m. carlson
2023-02-02 16:17     ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Phillip Wood
2023-02-02 16:40       ` Junio C Hamano
2023-02-03 13:49       ` Ævar Arnfjörð Bjarmason
2023-02-06 14:46         ` Phillip Wood
2023-02-03 15:47       ` Theodore Ts'o
2023-02-02 16:25     ` Junio C Hamano
2023-02-04 18:08       ` René Scharfe
2023-02-05 21:30         ` Ævar Arnfjörð Bjarmason
2023-02-12 17:41           ` René Scharfe
2023-02-02 19:23     ` Raymond E. Pasco
2023-02-03  8:06       ` [PATCH] archive: document output stability concerns Raymond E. Pasco
2023-01-31  9:54 ` brian m. carlson [this message]
2023-01-31 11:31   ` Stability of git-archive, breaking (?) the Github universe, and a possible solution Ævar Arnfjörð Bjarmason
2023-01-31 15:05   ` Konstantin Ryabitsev
2023-01-31 22:32     ` brian m. carlson
2023-02-01  9:40       ` Ævar Arnfjörð Bjarmason
2023-02-01 11:34         ` demerphq
2023-02-01 12:21           ` Michal Suchánek
2023-02-01 12:48             ` demerphq
2023-02-01 13:43               ` Ævar Arnfjörð Bjarmason
2023-02-01 15:21                 ` demerphq
2023-02-01 18:56                   ` Theodore Ts'o
2023-02-02 21:19                     ` Joey Hess
2023-02-03  4:02                       ` Theodore Ts'o
2023-02-03 13:32                         ` Ævar Arnfjörð Bjarmason
2023-02-01 23:16         ` brian m. carlson
2023-02-01 23:37           ` Junio C Hamano
2023-02-02 23:01             ` brian m. carlson
2023-02-02 23:47               ` rsbecker
2023-02-03 13:18                 ` Ævar Arnfjörð Bjarmason
2023-02-02  0:42           ` Ævar Arnfjörð Bjarmason
2023-02-01 12:17       ` Raymond E. Pasco
2023-01-31 15:56   ` Eli Schwartz
2023-01-31 16:20     ` Konstantin Ryabitsev
2023-01-31 16:34       ` Eli Schwartz
2023-01-31 20:34         ` Konstantin Ryabitsev
2023-01-31 20:45         ` Michal Suchánek
2023-02-01  1:33     ` brian m. carlson
2023-02-01 12:42   ` Ævar Arnfjörð Bjarmason
2023-02-01 23:18     ` brian m. carlson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y9jlWYLzZ/yy4NqD@tapette.crustytoothpaste.net \
    --to=sandals@crustytoothpaste.net \
    --cc=eschwartz93@gmail.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).