git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* force deterministic trees on git push - exact sort-order of filenames
@ 2021-10-15 16:04 milan hauth
  2021-10-15 17:47 ` René Scharfe
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: milan hauth @ 2021-10-15 16:04 UTC (permalink / raw)
  To: git

fact: the sort-order of filenames in a tree is not strictly regulated

proposal: enforce the exact sort-order of tree-items,
to make trees deterministic and reproducible.
the git server could refuse a 'git push',
when the tree is invalid

use case: reconstruct a git tree only from files.

current situation:
blobs are deterministic, trees are non-deterministic

backward compatibility:
rewriting git history is usually not desired.
so this new rule would apply only to new commits
after a certain 'deadline', set by the git server



sample trees from [1] and [2]:

git cat-file -p 2b75a7dbb76138587dbe50a5a6af8a6eedbaf66b | grep id_ed25519
100644 blob f914b3f712fc56ab212b53cb9055e1291b5c77a2    id_ed25519
100644 blob 40de4a8ac6027f64ac85f687bea7049467b428a2    id_ed25519.pub

git cat-file -p c8a72e628d0ca0a174a1a4241e6c7314a4660f0f | grep example
100644 blob fde6f3cbd19addb8ce84ffe32ab4d040e8b09c18    example.pem
040000 tree 6b0ee97865059ac965590e0ff5272fb76b6fd2c8    example



the first tree is sorted as expected

printf 'id_ed25519\nid_ed25519.pub\n' | sort
id_ed25519
id_ed25519.pub



the expected sort-order for the second tree:

printf 'example.pem\nexample\n' | sort
example
example.pem

the "unexpected" sort-order can be produced with [3]:

```py
import functools
def cmp(s, t):
  """
  Alter lexicographic sort order
  to make longer keys go *before* any of their prefixes
  """
  for p, q in zip(s, t):
    if p < q: return -1
    if q < p: return 1
  if len(s) > len(t): return -1
  elif len(t) > len(s): return 1
  return 0

arr = [ 'example', 'example.pem' ]
print("\n".join(sorted(arr, key=functools.cmp_to_key(cmp))))
```

result:

example.pem
example



[1] sample tree 1
https://api.github.com/repos/NixOS/nixpkgs/git/trees/2b75a7dbb76138587dbe50a5a6af8a6eedbaf66b

[2] sample tree 2
https://api.github.com/repos/NixOS/nixpkgs/git/trees/c8a72e628d0ca0a174a1a4241e6c7314a4660f0f

[3] python impl of unexpected sort
https://stackoverflow.com/questions/42899405/sort-a-list-with-longest-items-first

[4] related issue (wontfix)
https://github.com/dulwich/dulwich/issues/905

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: force deterministic trees on git push - exact sort-order of filenames
  2021-10-15 16:04 force deterministic trees on git push - exact sort-order of filenames milan hauth
@ 2021-10-15 17:47 ` René Scharfe
  2021-10-15 18:03 ` Junio C Hamano
  2021-10-17 16:10 ` Ævar Arnfjörð Bjarmason
  2 siblings, 0 replies; 6+ messages in thread
From: René Scharfe @ 2021-10-15 17:47 UTC (permalink / raw)
  To: milan hauth, git

Am 15.10.21 um 18:04 schrieb milan hauth:
> fact: the sort-order of filenames in a tree is not strictly regulated

Actually it is.

> proposal: enforce the exact sort-order of tree-items,
> to make trees deterministic and reproducible.

That's a good idea.  It was implemented in e83c516331 (Initial revision
of "git", the information manager from hell, 2005-04-07), whose README
says:

  "A tree object is a list of permission/name/blob data, sorted by name.
   In other words the tree object is uniquely determined by the set
   contents, and so two separate but identical trees will always share
   the exact same object."

> the git server could refuse a 'git push',
> when the tree is invalid

It should.

> sample trees from [1] and [2]:
>
> git cat-file -p 2b75a7dbb76138587dbe50a5a6af8a6eedbaf66b | grep id_ed25519
> 100644 blob f914b3f712fc56ab212b53cb9055e1291b5c77a2    id_ed25519
> 100644 blob 40de4a8ac6027f64ac85f687bea7049467b428a2    id_ed25519.pub
>
> git cat-file -p c8a72e628d0ca0a174a1a4241e6c7314a4660f0f | grep example
> 100644 blob fde6f3cbd19addb8ce84ffe32ab4d040e8b09c18    example.pem
> 040000 tree 6b0ee97865059ac965590e0ff5272fb76b6fd2c8    example

Tree entries are sorted by name, except that the names of a sub-trees
(like "example" above) get an implicit slash (/) appended.  Slash
(ASCII character 47) sorts after dot (ASCII character 46).

Are you able to generate and push different trees objects that
reference the same entries (same names, hashes, modes) in different
order?

René

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: force deterministic trees on git push - exact sort-order of filenames
  2021-10-15 16:04 force deterministic trees on git push - exact sort-order of filenames milan hauth
  2021-10-15 17:47 ` René Scharfe
@ 2021-10-15 18:03 ` Junio C Hamano
  2021-10-15 18:26   ` milan hauth
  2021-10-17 16:10 ` Ævar Arnfjörð Bjarmason
  2 siblings, 1 reply; 6+ messages in thread
From: Junio C Hamano @ 2021-10-15 18:03 UTC (permalink / raw)
  To: milan hauth; +Cc: git

milan hauth <milahu@gmail.com> writes:

> fact: the sort-order of filenames in a tree is not strictly regulated

Untrue, sorry.

The rule is "a tree entry that describes a tree object sorts as if
its pathname component has a trailing slash '/'".  All your sample
trees should satisify that rule (or the implementation of Git that
created such a tree is broken).

In hindsight, if we used the collation order that puts '/' before
all other bytes when sorting the index entries and the tree entries,
it would have made a lot of implemenentation details clean and easier
to work with, but that would have been feasible in early 2005, and
not now.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: force deterministic trees on git push - exact sort-order of filenames
  2021-10-15 18:03 ` Junio C Hamano
@ 2021-10-15 18:26   ` milan hauth
  0 siblings, 0 replies; 6+ messages in thread
From: milan hauth @ 2021-10-15 18:26 UTC (permalink / raw)
  To: Junio C Hamano, René Scharfe; +Cc: git

aah!

thanks for clarifying this implementation detail

so git trees are deterministic already.
this makes my life much easier : )

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: force deterministic trees on git push - exact sort-order of filenames
  2021-10-15 16:04 force deterministic trees on git push - exact sort-order of filenames milan hauth
  2021-10-15 17:47 ` René Scharfe
  2021-10-15 18:03 ` Junio C Hamano
@ 2021-10-17 16:10 ` Ævar Arnfjörð Bjarmason
  2021-10-18  5:38   ` Junio C Hamano
  2 siblings, 1 reply; 6+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-10-17 16:10 UTC (permalink / raw)
  To: milan hauth; +Cc: git


On Fri, Oct 15 2021, milan hauth wrote:

> backward compatibility:
> rewriting git history is usually not desired.
> so this new rule would apply only to new commits
> after a certain 'deadline', set by the git server

Others have commented on the status quo, but just on this: "git fsck"
will report these, and if it doesn't that's a bug. Grep for
"TREE_NOT_SORTED" in git.git for the code.

But in general this sort of plan for disallowing "bad" data doesn't
really work all that well. People want to e.g. switch hosting providers,
and will need to re-push old bad data with a new push.

I suppose there could be more strictures in the fsck code to allow for
that use-case, but still make some things that are mere warnings now
hard errors (or "fatal").

E.g. allowing it based on a more thorough inspection of the history, or
treat commits differently if their envelope timestamp is past some
cut-off (which right now we don't care about).

For a self-hosted installation with some specific old bad data I was
also able to turn on tighter strictures and whitelist specific OIDs with
fsck.skipList, I don't know if any of the big providers use that.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: force deterministic trees on git push - exact sort-order of filenames
  2021-10-17 16:10 ` Ævar Arnfjörð Bjarmason
@ 2021-10-18  5:38   ` Junio C Hamano
  0 siblings, 0 replies; 6+ messages in thread
From: Junio C Hamano @ 2021-10-18  5:38 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: milan hauth, git

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> I suppose there could be more strictures in the fsck code to allow for
> that use-case, but still make some things that are mere warnings now
> hard errors (or "fatal").

There is a group of configuration variables fsck.* to let you pass
specific types of errors go without erroring out on them, and also
fsck.skiplist lets you ignore errors on specific objects, but both
need to ask people on the other side to flip them on.

I'd imagine that hosting providers can make it self-service for
repository owners, though, but if they employ shared object store
across repositories, it might become a bit cumbersome to implement
correctly.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-10-18  5:38 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-15 16:04 force deterministic trees on git push - exact sort-order of filenames milan hauth
2021-10-15 17:47 ` René Scharfe
2021-10-15 18:03 ` Junio C Hamano
2021-10-15 18:26   ` milan hauth
2021-10-17 16:10 ` Ævar Arnfjörð Bjarmason
2021-10-18  5:38   ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).