On 2023-02-06 at 21:08:59, Junio C Hamano wrote: > "brian m. carlson" writes: > "is identical to what"? Ditto for the one in the previous > paragraph. The first paragraph is better in that there is "between > versions", even though it would be easier to grok if we made it more > clear that we are talking about versions of the software that is > used to create the archive, not the version of contents being > archived. > > Our goal is that serializing the same tree object or the same commit > object result in bit-for-bit identical result, no matter which > version of Git is used, and no matter what platform the Git used to > create the archive was built on. Mentioning both what we take an > archive out of (i.e. tree or commit) and we can use different > versions of Git to create archives, in the description would make it > easier to grok. I can update that to reflect things more accurately. > > +Goals and Rationale > > +------------------- > > + > > +The goals for this format are that it is first and foremost reproducible, that > > +identical trees produce identical results, that it is simple and easy to > > +implement correctly, and that it is useful in general. While we don't consider > > +functionality needs beyond Git's at the moment (such as hardlinks, xattrs, or > > +sparse files), there is intense interest in reproducible builds, and so it makes > > +sense to design something that can see general use for software interchange. > > Perfect. > > > +Because the goal is strict reproducibility, this format doesn't honor > > +`tar.umask` or other options that can produce different output. It serializes > > +all timestamps as the Epoch, which produces identical results whether the tree > > +is serialized as a tree, commit, or tag. This is consistent with the behaviour > > +of some other tar serializers, including the default for modern Rust crates, and > > +is not believed to pose any interoperability problems. > > > +Object IDs are not included in this version of the format because this produces > > +non-identical data when identical data is serialized with different hash > > +algorithms. > > Declaring that we'll always peel a tag or a commit down to a tree is > one sure way to avoid having to worry about object name hashes, but > aren't we discarding too much utility by doing so? > > This is probably debatable. The commit object name embedded in the > extended header of an archive makes it trivial to identify what > version the archive _claims_ to have been taken from (you could also > embed it in the filename that stores archive, but the use of the > embedded metainfo makes it more robust against file names). And > running "git archive" twice, with different versions of Git on > different architectures, should be reproducible as long as both > invokers expressed their desire to see the commit object name in the > archive by passing the commit, not its tree, to the command, and > they are using the same hash algorithm. It's true that it makes it easy to look up, but I can say I've never used that functionality. I think very few people actually know it exists. > Having said all that, I think stripping the commit object name (or > tags) is a better design. Imagine that I see I created a tarball > earlier and published its hash, but later lost the tarball. By not > allowing any commit object name in the archive, it would force me to > somehow name the tarball in such a way that I can tell which commit > I used to create it, e.g. "git-e83c516331.tar". Other people can > notice the filename and without having seen the bytes in it, they > can try running "git archive e83c516331" in their repository and see > the output matches the hash I published earlier. Having commit or > tag embedded in the archive would make it harder to do this kind of > things. Most people do this anyway (except with a tag name), so I don't think it's a big deal to have this as the primary mechanism. > By the way, other potentially interesting points are: > > - Do we want to ignore "export-subst" for stability? I think that would be a good idea. I'll add it in v2. > - "git archive" can be invoked with pathspec to archive only a > subset of paths. True. I don't think that's a problem as long as we generate paths correctly. I'll be sure to add tests for it, though. > > +Introduction to the Underlying Format > > +------------------------------------- > > ... > > +A global extended header sets metadata for the entire file, and a per-file > > +extended header applies to only the to which it corresponds. A per-file > > "only the to which" -> "only the file to which" Will fix. > > +While pax extensions are widely supported by most modern versions of tar > > +(including versions on Windows and all major open-source OSes), some older > > +archivers and non-tar implementations which do not understand them typically > > +extract the extended headers as regular files. Thus, it's helpful to have these > > +entries have reasonable permissions and unique names. > > Surely, and to make things reproducible, they shouldn't just be > reasonable and unique. They should be exactly as we define in the > specification. Yes, of course. This is more to indicate why we've made the decisions to name them as they are and give them the permissions we did. > > +Every file serialized in the archive is serialized in lexicographical order by > > +its bytes. A directory is always serialized before its contents, and a > > "by its bytes" -> "by the bytes in its filename" or something? > Surely we do not sort by contents ;-) Good point. We should avoid ambiguity. > > +directory is never serialized with a trailing slash. If a system uses a Unicode > > +encoding other than UTF-8, it encodes filenames as UTF-8. > > This is a bit hard to grok. Do you mean there may be UTF-16 system > where the data in our tree objects, whose paths are recorded in UTF-8, > but "git checkout" of the tree may result in files in the native > filename on that system, i.e. UTF-16 not UTF-8? And even on such a > system, running "git archive" would record paths in the archive in > UTF-8 (i.e. the same as what was in the tree object)? Or do you > mean something stronger, like on a Latin-1 system with Latin-1 > project that used Latin-1 as pathnames even in the tree objects, > when "git archive" produces an archive, the paths in it shall be > transcoded from the original Latin-1 pathnames to UTF-8? This means if, on Windows, someone uses --add-file or --add-virtual-file, those paths will be encoded in UTF-8, not UTF-16. > > +Version Number > > +-------------- > > + > > +The version number for this version is `ctar-v1`. > > + > > +Extended Headers > > +---------------- > > + > > +Global Extended Header > > +~~~~~~~~~~~~~~~~~~~~~~ > > + > > +The global extended header (record `g`) shall contain one header: > > +`CTAR.version`, which contains the version number specified above. > > + > > +The contents of the ustar header for the global extended header are as below, > > +except that the `name` field contains `pax_global_header`. > > "as below" meaning...? The same as what is listed in "Per-File > Extended Header"? There is no `name` field listed there, though. I'll make a clearer reference. > > +Per-File Extended Header > > +~~~~~~~~~~~~~~~~~~~~~~~~ > > + > > +Each file has a per-file extended header. > > + > > +The following per-file extended header fields are included: > > + > > +|=== > > +| Field Name | When Present | Value > > + > > +| `atime` | always | `0` > > +| `mtime` | always | `0` > > +| `size` | always | size of the data in bytes > > +| `path` | always | full path name of the file > > These are length-prefixed data, so we do not have to worry about > overly long pathnames or symlinks? Correct. This data can be arbitrarily long as long as all the metadata can be encoded in a ustar header, so we're limited to at least several gigabytes or so. I don't think anybody thinks of that as a practical limitation on filenames or other metadata. > "we because" -> "because" Will fix. > > +we avoid explicitly declaring them as such and rely on the default archiver > > +behavior, which may be more sensible. > > So, do we or do we not store hdrcharset? Producing Git does not know > if the pathnames stored in the tree it is asked to produce archive > for are not in UTF-8, so it assumes everything is in UTF-8 hence > does not see the need to add hdrcharset? pax says that these values are UTF-8 if not specified. If they're clearly not UTF-8, we use `hdcharset` and say they're binary. If they look like valid UTF-8, we don't use `hdrcharset` and pretend they are in fact UTF-8, in case somebody just likes causing discord by using Windows-1252 that looks like UTF-8. > In other words, we just store the contents of the blob that > represents the symbolic link there? I wonder if we do anything > special if a blob, that is pointed at in an entry in a tree whose > mode bits are 120000, has NUL in it (should we teach fsck to flag > it, for example)? This is the destination of the symlink, yes. We can simply check for NUL and abort; I don't think that's an unreasonable behaviour in any case. > The order of entries need to be specified when we aim for > bit-for-bit reproduceability, no? Yes. That's specified in the next section, where we say this: When encoding the data for an extended header, all entries are sorted in order by the byte values of their keys as encoded in UTF-8. Duplicate keys are not permitted. I'll make a reference to that section and describe it more clearly. > "the header block" -> "the ustar header block" to match the next > section, probably. I'll update that. > These are barebone header fields, not extended headers. Do we want > to refer to some canonical sources so that readers understand that > unlike the extended headres we are talking about fixed-length fields? > The description above talks about "padding", but that of course > applies to fixed width columns. Correct. I'll mention that these are the values in the ustar header for the extended header. I'll also put some references in to the documentation. -- brian m. carlson (he/him or they/them) Toronto, Ontario, CA