From: "brian m. carlson" <sandals@crustytoothpaste.net>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org, "Eli Schwartz" <eschwartz93@gmail.com>,
"René Scharfe" <l.s.r@web.de>,
"Konstantin Ryabitsev" <konstantin@linuxfoundation.org>,
"Michal Suchánek" <msuchanek@suse.de>,
"Raymond E . Pasco" <ray@ameretat.dev>,
demerphq <demerphq@gmail.com>, "Theodore Ts'o" <tytso@mit.edu>,
"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Subject: Re: [RFC PATCH 1/1] Document a fixed tar format for interoperability
Date: Tue, 7 Feb 2023 22:34:14 +0000 [thread overview]
Message-ID: <Y+LR5rlFTqyLfoeF@tapette.crustytoothpaste.net> (raw)
In-Reply-To: <xmqq8rha5wno.fsf@gitster.g>
[-- Attachment #1: Type: text/plain, Size: 10588 bytes --]
On 2023-02-06 at 21:08:59, Junio C Hamano wrote:
> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
> "is identical to what"? Ditto for the one in the previous
> paragraph. The first paragraph is better in that there is "between
> versions", even though it would be easier to grok if we made it more
> clear that we are talking about versions of the software that is
> used to create the archive, not the version of contents being
> archived.
>
> Our goal is that serializing the same tree object or the same commit
> object result in bit-for-bit identical result, no matter which
> version of Git is used, and no matter what platform the Git used to
> create the archive was built on. Mentioning both what we take an
> archive out of (i.e. tree or commit) and we can use different
> versions of Git to create archives, in the description would make it
> easier to grok.
I can update that to reflect things more accurately.
> > +Goals and Rationale
> > +-------------------
> > +
> > +The goals for this format are that it is first and foremost reproducible, that
> > +identical trees produce identical results, that it is simple and easy to
> > +implement correctly, and that it is useful in general. While we don't consider
> > +functionality needs beyond Git's at the moment (such as hardlinks, xattrs, or
> > +sparse files), there is intense interest in reproducible builds, and so it makes
> > +sense to design something that can see general use for software interchange.
>
> Perfect.
>
> > +Because the goal is strict reproducibility, this format doesn't honor
> > +`tar.umask` or other options that can produce different output. It serializes
> > +all timestamps as the Epoch, which produces identical results whether the tree
> > +is serialized as a tree, commit, or tag. This is consistent with the behaviour
> > +of some other tar serializers, including the default for modern Rust crates, and
> > +is not believed to pose any interoperability problems.
>
> > +Object IDs are not included in this version of the format because this produces
> > +non-identical data when identical data is serialized with different hash
> > +algorithms.
>
> Declaring that we'll always peel a tag or a commit down to a tree is
> one sure way to avoid having to worry about object name hashes, but
> aren't we discarding too much utility by doing so?
>
> This is probably debatable. The commit object name embedded in the
> extended header of an archive makes it trivial to identify what
> version the archive _claims_ to have been taken from (you could also
> embed it in the filename that stores archive, but the use of the
> embedded metainfo makes it more robust against file names). And
> running "git archive" twice, with different versions of Git on
> different architectures, should be reproducible as long as both
> invokers expressed their desire to see the commit object name in the
> archive by passing the commit, not its tree, to the command, and
> they are using the same hash algorithm.
It's true that it makes it easy to look up, but I can say I've never
used that functionality. I think very few people actually know it
exists.
> Having said all that, I think stripping the commit object name (or
> tags) is a better design. Imagine that I see I created a tarball
> earlier and published its hash, but later lost the tarball. By not
> allowing any commit object name in the archive, it would force me to
> somehow name the tarball in such a way that I can tell which commit
> I used to create it, e.g. "git-e83c516331.tar". Other people can
> notice the filename and without having seen the bytes in it, they
> can try running "git archive e83c516331" in their repository and see
> the output matches the hash I published earlier. Having commit or
> tag embedded in the archive would make it harder to do this kind of
> things.
Most people do this anyway (except with a tag name), so I don't think
it's a big deal to have this as the primary mechanism.
> By the way, other potentially interesting points are:
>
> - Do we want to ignore "export-subst" for stability?
I think that would be a good idea. I'll add it in v2.
> - "git archive" can be invoked with pathspec to archive only a
> subset of paths.
True. I don't think that's a problem as long as we generate paths
correctly. I'll be sure to add tests for it, though.
> > +Introduction to the Underlying Format
> > +-------------------------------------
> > ...
> > +A global extended header sets metadata for the entire file, and a per-file
> > +extended header applies to only the to which it corresponds. A per-file
>
> "only the to which" -> "only the file to which"
Will fix.
> > +While pax extensions are widely supported by most modern versions of tar
> > +(including versions on Windows and all major open-source OSes), some older
> > +archivers and non-tar implementations which do not understand them typically
> > +extract the extended headers as regular files. Thus, it's helpful to have these
> > +entries have reasonable permissions and unique names.
>
> Surely, and to make things reproducible, they shouldn't just be
> reasonable and unique. They should be exactly as we define in the
> specification.
Yes, of course. This is more to indicate why we've made the decisions
to name them as they are and give them the permissions we did.
> > +Every file serialized in the archive is serialized in lexicographical order by
> > +its bytes. A directory is always serialized before its contents, and a
>
> "by its bytes" -> "by the bytes in its filename" or something?
> Surely we do not sort by contents ;-)
Good point. We should avoid ambiguity.
> > +directory is never serialized with a trailing slash. If a system uses a Unicode
> > +encoding other than UTF-8, it encodes filenames as UTF-8.
>
> This is a bit hard to grok. Do you mean there may be UTF-16 system
> where the data in our tree objects, whose paths are recorded in UTF-8,
> but "git checkout" of the tree may result in files in the native
> filename on that system, i.e. UTF-16 not UTF-8? And even on such a
> system, running "git archive" would record paths in the archive in
> UTF-8 (i.e. the same as what was in the tree object)? Or do you
> mean something stronger, like on a Latin-1 system with Latin-1
> project that used Latin-1 as pathnames even in the tree objects,
> when "git archive" produces an archive, the paths in it shall be
> transcoded from the original Latin-1 pathnames to UTF-8?
This means if, on Windows, someone uses --add-file or
--add-virtual-file, those paths will be encoded in UTF-8, not UTF-16.
> > +Version Number
> > +--------------
> > +
> > +The version number for this version is `ctar-v1`.
> > +
> > +Extended Headers
> > +----------------
> > +
> > +Global Extended Header
> > +~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +The global extended header (record `g`) shall contain one header:
> > +`CTAR.version`, which contains the version number specified above.
> > +
> > +The contents of the ustar header for the global extended header are as below,
> > +except that the `name` field contains `pax_global_header`.
>
> "as below" meaning...? The same as what is listed in "Per-File
> Extended Header"? There is no `name` field listed there, though.
I'll make a clearer reference.
> > +Per-File Extended Header
> > +~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +Each file has a per-file extended header.
> > +
> > +The following per-file extended header fields are included:
> > +
> > +|===
> > +| Field Name | When Present | Value
> > +
> > +| `atime` | always | `0`
> > +| `mtime` | always | `0`
> > +| `size` | always | size of the data in bytes
> > +| `path` | always | full path name of the file
>
> These are length-prefixed data, so we do not have to worry about
> overly long pathnames or symlinks?
Correct. This data can be arbitrarily long as long as all the metadata
can be encoded in a ustar header, so we're limited to at least several
gigabytes or so. I don't think anybody thinks of that as a practical
limitation on filenames or other metadata.
> "we because" -> "because"
Will fix.
> > +we avoid explicitly declaring them as such and rely on the default archiver
> > +behavior, which may be more sensible.
>
> So, do we or do we not store hdrcharset? Producing Git does not know
> if the pathnames stored in the tree it is asked to produce archive
> for are not in UTF-8, so it assumes everything is in UTF-8 hence
> does not see the need to add hdrcharset?
pax says that these values are UTF-8 if not specified. If they're
clearly not UTF-8, we use `hdcharset` and say they're binary. If they
look like valid UTF-8, we don't use `hdrcharset` and pretend they are in
fact UTF-8, in case somebody just likes causing discord by using
Windows-1252 that looks like UTF-8.
> In other words, we just store the contents of the blob that
> represents the symbolic link there? I wonder if we do anything
> special if a blob, that is pointed at in an entry in a tree whose
> mode bits are 120000, has NUL in it (should we teach fsck to flag
> it, for example)?
This is the destination of the symlink, yes. We can simply check for
NUL and abort; I don't think that's an unreasonable behaviour in any
case.
> The order of entries need to be specified when we aim for
> bit-for-bit reproduceability, no?
Yes. That's specified in the next section, where we say this:
When encoding the data for an extended header, all entries are sorted in order
by the byte values of their keys as encoded in UTF-8. Duplicate keys are not
permitted.
I'll make a reference to that section and describe it more clearly.
> "the header block" -> "the ustar header block" to match the next
> section, probably.
I'll update that.
> These are barebone header fields, not extended headers. Do we want
> to refer to some canonical sources so that readers understand that
> unlike the extended headres we are talking about fixed-length fields?
> The description above talks about "padding", but that of course
> applies to fixed width columns.
Correct. I'll mention that these are the values in the ustar header for
the extended header. I'll also put some references in to the
documentation.
--
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]
next prev parent reply other threads:[~2023-02-07 22:34 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-02-05 22:17 [RFC PATCH 0/1] Canonical tar format for Git brian m. carlson
2023-02-05 22:17 ` [RFC PATCH 1/1] Document a fixed tar format for interoperability brian m. carlson
2023-02-06 21:08 ` Junio C Hamano
2023-02-07 22:34 ` brian m. carlson [this message]
2023-02-06 22:18 ` Ævar Arnfjörð Bjarmason
2023-02-07 23:01 ` brian m. carlson
2023-02-08 11:07 ` Ævar Arnfjörð Bjarmason
2023-02-08 23:52 ` brian m. carlson
2023-02-09 0:35 ` Ævar Arnfjörð Bjarmason
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Y+LR5rlFTqyLfoeF@tapette.crustytoothpaste.net \
--to=sandals@crustytoothpaste.net \
--cc=avarab@gmail.com \
--cc=demerphq@gmail.com \
--cc=eschwartz93@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=konstantin@linuxfoundation.org \
--cc=l.s.r@web.de \
--cc=msuchanek@suse.de \
--cc=ray@ameretat.dev \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).