Finer timestamps and serialization in git

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Finer timestamps and serialization in git
@ 2019-05-15 19:16 Eric S. Raymond
  2019-05-15 20:16 ` Derrick Stolee
  2019-05-15 20:20 ` Ævar Arnfjörð Bjarmason
  0 siblings, 2 replies; 33+ messages in thread
From: Eric S. Raymond @ 2019-05-15 19:16 UTC (permalink / raw)
  To: git

The recent increase in vulnerability in SHA-1 means, I hope, that you
are planning for the day when git needs to change to something like
an elliptic-curve hash.  This means you're going to have a major
format break. Such is life.

Since this is going to have to happen anyway, let me request two
functional changes in git. Neither will be at all difficult, but the
first one is also a thing that cannot be done without a format break,
which is why I have not suggested them before.  They come from lots of
(often painful) experience with repository conversions via
reposurgeon.

1. Finer granularity on commit timestamps.

2. Timestamps unique per repository

The coarse resolution of git timestamps, and the lack of uniqueness,
are at the bottom of several problems that are persistently irritating
when I do repository conversions and surgery.

The most obvious issue, though a relatively superficial one, is that I have
to thow away information whenever I convert a repository from a system with
finer-grained time.  Notably this is the case with Subversion, which keeps
time to milliseconds. This is probably the only respect in which its data
model remains superior to git's. :-)

The deeper problem is that I want something from Git that I cannot
have with 1-second granularity. That is: a unique timestamp on each
commit in a repository. The only way to be certain of this is for git
to delay accepting integration of a patch until it can issue a unique
time mark for it - obviously impractical if the quantum is one second,
but not if it's a millisecond or microsecond.

Why do I want this? There are number of reasons, all related to a
mathematical concept called "total ordering".  At present, commits in
a Git repository only have partial ordering. One consequence is that
action stamps - the committer/date pairs I use as VCS-independent commit
identifications in reposurgeon - are not unique.  When a patch sequence
is applied, it can easily happen fast enough to give several successive
commits the same committer-ID and timestamp.

Of course the commit hash remains a unique commit ID.  But it can't
easily be parsed and followed by a human, which is a UX problem when
it's used as a commit stamp in change comments.

More deeply, the lack of total ordering means that repository graphs
don't have a single canonical serialized form.  This sounds abstract
but it means there are surgical operations I can't regression-test
properly.  My colleague Edward Cree has found cases where git fast-export
can issue a stream dump for which git fast-import won't necessarily
re-color certain interior nodes the same way when it's read back in
and I'm pretty sure the absence of total ordering on the branch tips
is at the bottom of that.

I'm willing to write patches if this direction is accepted.  I've figured
out how to make fast-import streams upward-compatible with finer-grained
timestamps.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Finer timestamps and serialization in git
  2019-05-15 19:16 Finer timestamps and serialization in git Eric S. Raymond
@ 2019-05-15 20:16 ` Derrick Stolee
  2019-05-15 20:28   ` Jason Pyeron
  2019-05-15 23:32   ` Eric S. Raymond
  2019-05-15 20:20 ` Ævar Arnfjörð Bjarmason
  1 sibling, 2 replies; 33+ messages in thread
From: Derrick Stolee @ 2019-05-15 20:16 UTC (permalink / raw)
  To: Eric S. Raymond, git

On 5/15/2019 3:16 PM, Eric S. Raymond wrote:
> The deeper problem is that I want something from Git that I cannot
> have with 1-second granularity. That is: a unique timestamp on each
> commit in a repository.

This is impossible in a distributed version control system like Git
(where the commits are immutable). No matter your precision, there is
a chance that two machiens commit at the exact same moment on two different
machines and then those commits are merged into the same branch. Even
when you specify a committer, there are many environments where a set
of parallel machines are creating commits with the same identity.

> Why do I want this? There are number of reasons, all related to a
> mathematical concept called "total ordering".  At present, commits in
> a Git repository only have partial ordering. 

This is true of any directed acyclic graph. If you want a total ordering
that is completely unambiguous, then you should think about maintaining
a linear commit history by requiring rebasing instead of merging.

> One consequence is that
> action stamps - the committer/date pairs I use as VCS-independent commit
> identifications in reposurgeon - are not unique.  When a patch sequence
> is applied, it can easily happen fast enough to give several successive
> commits the same committer-ID and timestamp.

Sorting by committer/date pairs sounds like an unhelpful idea, as that
does not take any graph topology into account. It happens that commits
can actually have an _earlier_ commit date than its parent.

> More deeply, the lack of total ordering means that repository graphs
> don't have a single canonical serialized form.  This sounds abstract
> but it means there are surgical operations I can't regression-test
> properly.  My colleague Edward Cree has found cases where git fast-export
> can issue a stream dump for which git fast-import won't necessarily
> re-color certain interior nodes the same way when it's read back in
> and I'm pretty sure the absence of total ordering on the branch tips
> is at the bottom of that.

If you use `git rev-list --topo-order` with a fixed set of refs to start,
then the total ordering given is well-defined (and it is a linear
extension of the partial order given by the commit graph). However, this
ordering is not stable: adding another merge commit may swap the order between
two commits lower in the order.

> I'm willing to write patches if this direction is accepted.  I've figured
> out how to make fast-import streams upward-compatible with finer-grained
> timestamps.

Changing the granularity of timestamps requires changing the commit format,
which is probably a non-starter. More universally-useful suggestions have
been blocked due to keeping the file format consistent.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: Finer timestamps and serialization in git
  2019-05-15 20:16 ` Derrick Stolee
@ 2019-05-15 20:28   ` Jason Pyeron
  2019-05-15 21:14     ` Derrick Stolee
  2019-05-15 23:40     ` Eric S. Raymond
  2019-05-15 23:32   ` Eric S. Raymond
  1 sibling, 2 replies; 33+ messages in thread
From: Jason Pyeron @ 2019-05-15 20:28 UTC (permalink / raw)
  To: git; +Cc: 'Derrick Stolee', 'Eric S. Raymond'

(please don’t cc me)

> -----Original Message-----
> From: Derrick Stolee
> Sent: Wednesday, May 15, 2019 4:16 PM
> 
> On 5/15/2019 3:16 PM, Eric S. Raymond wrote:

<snip/> I disagree with many of Eric's reasons - and agree with most of Derrick's refutation. But

> 
> Changing the granularity of timestamps requires changing the commit format,
> which is probably a non-starter. 

is not necessarily true. If we take the below example:

committer Name <user@domain> 1557948240 -0400

and we follow the rule that:

1. any trailing zero after the decimal point MUST be omitted
2. if there are no digits after the decimal point, it MUST be omitted

This would allow:

committer Name <user@domain> 1557948240 -0400
committer Name <user@domain> 1557948240.12 -0400

but the following are never allowed:

committer Name <user@domain> 1557948240. -0400
committer Name <user@domain> 1557948240.000000 -0400

By following these rules, all previous commits' hash are unchanged. Future commits made on the top of the second will look like old commit formats. Commits coming from "older" tools will produce valid and mergeable objects. The loss precision has frustrated us several times as well.

Respectfully,

Jason Pyeron

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Finer timestamps and serialization in git
  2019-05-15 20:28   ` Jason Pyeron
@ 2019-05-15 21:14     ` Derrick Stolee
  2019-05-15 22:07       ` Ævar Arnfjörð Bjarmason
  2019-05-16  0:28       ` Eric S. Raymond
  2019-05-15 23:40     ` Eric S. Raymond
  1 sibling, 2 replies; 33+ messages in thread
From: Derrick Stolee @ 2019-05-15 21:14 UTC (permalink / raw)
  To: Jason Pyeron, git; +Cc: 'Eric S. Raymond'

On 5/15/2019 4:28 PM, Jason Pyeron wrote:
> (please don’t cc me)

Ok. I'll "To" you.

> and we follow the rule that:
> 
> 1. any trailing zero after the decimal point MUST be omitted
> 2. if there are no digits after the decimal point, it MUST be omitted
> 
> This would allow:
> 
> committer Name <user@domain> 1557948240 -0400
> committer Name <user@domain> 1557948240.12 -0400

This kind of change would probably break old clients trying to read
commits from new clients. Ævar's suggestion [1] of additional headers
should not create incompatibilities.

> By following these rules, all previous commits' hash are unchanged. Future commits made on the top of the second will look like old commit formats. Commits coming from "older" tools will produce valid and mergeable objects. The loss precision has frustrated us several times as well.

What problem are you trying to solve where commit date is important?
The only use I have for them is "how long has it been since someone
made this change?" A question like "when was this change introduced?"
is much less important than "in which version was this first released?"
This "in which version" is a graph reachability question, not a date
question.

I think any attempt to understand Git commits using commit date without
using the underling graph topology (commit->parent relationships) is
fundamentally broken and won't scale to even moderately-sized teams.
I don't even use "git log" without a "--topo-order" or "--graph" option
because using a date order puts unrelated changes next to each other.
--topo-order guarantees that a path of commits with only one parent
and only one child appears in consecutive order.

Thanks,
-Stolee

P.S. All of my (overly strong) opinions on using commit date are made
more valid when you realize anyone can set GIT_COMMITTER_DATE to get
an arbitrary commit date.

[1] https://public-inbox.org/git/871s0zwjv0.fsf@evledraar.gmail.com/T/#t

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Finer timestamps and serialization in git
  2019-05-15 21:14     ` Derrick Stolee
@ 2019-05-15 22:07       ` Ævar Arnfjörð Bjarmason
  2019-05-16  0:28       ` Eric S. Raymond
  1 sibling, 0 replies; 33+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-05-15 22:07 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Jason Pyeron, git, 'Eric S. Raymond'


On Wed, May 15 2019, Derrick Stolee wrote:

> On 5/15/2019 4:28 PM, Jason Pyeron wrote:
>> (please don’t cc me)
>
> Ok. I'll "To" you.

I'm a rebel!

>> and we follow the rule that:
>>
>> 1. any trailing zero after the decimal point MUST be omitted
>> 2. if there are no digits after the decimal point, it MUST be omitted
>>
>> This would allow:
>>
>> committer Name <user@domain> 1557948240 -0400
>> committer Name <user@domain> 1557948240.12 -0400
>
> This kind of change would probably break old clients trying to read
> commits from new clients. Ævar's suggestion [1] of additional headers
> should not create incompatibilities.

Yes, exactly. Obviously patching git to do this is rather easy, here's
an initial try:

    diff --git a/date.c b/date.c
    index 8126146c50..0a97e1d877 100644
    --- a/date.c
    +++ b/date.c
    @@ -762,3 +762,3 @@ static void date_string(timestamp_t date, int offset, struct strbuf *buf)
            }
    -       strbuf_addf(buf, "%"PRItime" %c%02d%02d", date, sign, offset / 60, offset % 60);
    +       strbuf_addf(buf, "%"PRItime".12345 %c%02d%02d", date, sign, offset / 60, offset % 60);
     }
    diff --git a/usage.c b/usage.c
    index 2fdb20086b..7760b78cb6 100644
    --- a/usage.c
    +++ b/usage.c
    @@ -267,2 +267,3 @@ NORETURN void BUG_fl(const char *file, int line, const char *fmt, ...)
            va_list ap;
    +       return;
            va_start(ap, fmt);

We don't need BUG() right? :)

Now let's commit with that git, that gives me a commit object with a
sub-second timestamp like:

    $ git cat-file -p HEAD
    tree 4d5fcadc293a348e88f777dc0920f11e7d71441c
    author Ævar Arnfjörð Bjarmason <avarab@gmail.com> 1557955656.12345 +0200
    committer Ævar Arnfjörð Bjarmason <avarab@gmail.com> 1557955656.12345 +0200

Works so far, yay!

And now fsck fails:

    error in commit 31b3e9b88c36f75b3375471d9f5b449165c9ff93: badDate: invalid author/committer line - bad date

And any sane git hosting site will refuse this, e.g. trying to push this
to github:

    remote: error: object 31b3e9b88c36f75b3375471d9f5b449165c9ff93: badDate: invalid author/committer line - bad date
    remote: fatal: fsck error in packed object

And that's *just* dealing with the git.git client, any such format
changes also need to consider what happens to jgit, libgit2 etc. etc.

Once you make such changes to the format you've created your own
version-control system. It's no longer git.

>> By following these rules, all previous commits' hash are unchanged. Future commits made on the top of the second will look like old commit formats. Commits coming from "older" tools will produce valid and mergeable objects. The loss precision has frustrated us several times as well.
>
> What problem are you trying to solve where commit date is important?
> The only use I have for them is "how long has it been since someone
> made this change?" A question like "when was this change introduced?"
> is much less important than "in which version was this first released?"
> This "in which version" is a graph reachability question, not a date
> question.
>
> I think any attempt to understand Git commits using commit date without
> using the underling graph topology (commit->parent relationships) is
> fundamentally broken and won't scale to even moderately-sized teams.
> I don't even use "git log" without a "--topo-order" or "--graph" option
> because using a date order puts unrelated changes next to each other.
> --topo-order guarantees that a path of commits with only one parent
> and only one child appears in consecutive order.
>
> Thanks,
> -Stolee
>
> P.S. All of my (overly strong) opinions on using commit date are made
> more valid when you realize anyone can set GIT_COMMITTER_DATE to get
> an arbitrary commit date.
>
> [1] https://public-inbox.org/git/871s0zwjv0.fsf@evledraar.gmail.com/T/#t

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Finer timestamps and serialization in git
  2019-05-15 21:14     ` Derrick Stolee
  2019-05-15 22:07       ` Ævar Arnfjörð Bjarmason
@ 2019-05-16  0:28       ` Eric S. Raymond
  2019-05-16  1:25         ` Derrick Stolee
  1 sibling, 1 reply; 33+ messages in thread
From: Eric S. Raymond @ 2019-05-16  0:28 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Jason Pyeron, git

Derrick Stolee <stolee@gmail.com>:
> What problem are you trying to solve where commit date is important?

I don't know what Jason's are.  I know what mine are.

A. Portable commit identifiers

1. When I in-migrate a repository from (say) Subversion with
reposurgeon, I want to be able to patch change comments so that (say)
r2367 becomes a unique reference to its corresponding commit. I do
not want the kludge of appending a relic SVN-ID header to be *required*,
though some customers may choose that. Requirung that is an orthogonality
violation.

2. Because I think in decadal timescales about infrastructure, I want
my commit references to be in a format that won't break when the history
is forward-migrated to the *next* VCS. That pretty much eliminates any
from of opaque hash. (Git itself will have a weaker version of this problem
when you change hash formats.)

3. Accordingly, I invented action stamps. This is an action stamp:
<esr@thyrsus.com!2019-05-15T20:01:15Z>. One reason I want timestamp
uniqueness is for action-stamp uniqueness.

B. Unique canonical form of import-stream representation.

Reposurgeon is a very complex piece of software with subtle failure
modes.  I have a strong need to be able to regression-test its
operation.  Right now there are important cases in which I can't do
that because (a) the order in which it writes commits and (b) how it
colors branches, are both phase-of-moon dependent.  That is, the
algorithms may be deterministic but they're not documented and seem to
be dependent on variables that are hidden from me.

Before import streams can have a canonical output order without hidden
variables (e.g. depending only on visible metadata) in practice, that
needs to be possible in principle. I've thought about this a lot and
not only are unique commit timestamps the most natural way to make
it possible, they're the only way conistent with the reality that
commit comments may be altered for various good reasons during
repository translation.

> P.S. All of my (overly strong) opinions on using commit date are made
> more valid when you realize anyone can set GIT_COMMITTER_DATE to get
> an arbitrary commit date.

In the way I would write things, you can *request* that date, but in
case of a collision you might actually get one a few microseconds off
that preserves its order relationship with your other commits.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Finer timestamps and serialization in git
  2019-05-16  0:28       ` Eric S. Raymond
@ 2019-05-16  1:25         ` Derrick Stolee
  2019-05-20 15:05           ` Michal Suchánek
  0 siblings, 1 reply; 33+ messages in thread
From: Derrick Stolee @ 2019-05-16  1:25 UTC (permalink / raw)
  To: esr; +Cc: Jason Pyeron, git

On 5/15/2019 8:28 PM, Eric S. Raymond wrote:
> Derrick Stolee <stolee@gmail.com>:
>> What problem are you trying to solve where commit date is important?
> 
> I don't know what Jason's are.  I know what mine are.
> 
> A. Portable commit identifiers
> 
> 1. When I in-migrate a repository from (say) Subversion with
> reposurgeon, I want to be able to patch change comments so that (say)
> r2367 becomes a unique reference to its corresponding commit. I do
> not want the kludge of appending a relic SVN-ID header to be *required*,
> though some customers may choose that. Requirung that is an orthogonality
> violation.

Instead of using the free-form nature of a commit message to include links
to an external VCS, you want a first-class data type in Git to provide this
data? Not only is that backwards, it makes the link between the Git repo and
the SVN repo weaker. How would you distinguish between a commit generated from
the old SVN repo and a commit that was created directly in the Git repo without
performing a lookup to the SVN repo based on (committer, timestamp)?

> 2. Because I think in decadal timescales about infrastructure, I want
> my commit references to be in a format that won't break when the history
> is forward-migrated to the *next* VCS. That pretty much eliminates any
> from of opaque hash. (Git itself will have a weaker version of this problem
> when you change hash formats.)
> 
> 3. Accordingly, I invented action stamps. This is an action stamp:
> <esr@thyrsus.com!2019-05-15T20:01:15Z>. One reason I want timestamp
> uniqueness is for action-stamp uniqueness.

Looks like you have an excellent format for a backwards-facing link.

Gerrit uses a commit-msg hook [1] to insert "Change-Id" tags into
commit messages. You could probably do something similar. If you have
control over _every_ client interacting with the repo, you could even
have this interact with a central authority to give a unique stamp.

> B. Unique canonical form of import-stream representation.
> 
> Reposurgeon is a very complex piece of software with subtle failure
> modes.  I have a strong need to be able to regression-test its
> operation.  Right now there are important cases in which I can't do
> that because (a) the order in which it writes commits and (b) how it
> colors branches, are both phase-of-moon dependent.  That is, the
> algorithms may be deterministic but they're not documented and seem to
> be dependent on variables that are hidden from me.
> 
> Before import streams can have a canonical output order without hidden
> variables (e.g. depending only on visible metadata) in practice, that
> needs to be possible in principle. I've thought about this a lot and
> not only are unique commit timestamps the most natural way to make
> it possible, they're the only way conistent with the reality that
> commit comments may be altered for various good reasons during
> repository translation.

If you are trying to debug or test something, why don't you serialize
the input you are using for your test?

>> P.S. All of my (overly strong) opinions on using commit date are made
>> more valid when you realize anyone can set GIT_COMMITTER_DATE to get
>> an arbitrary commit date.
> 
> In the way I would write things, you can *request* that date, but in
> case of a collision you might actually get one a few microseconds off
> that preserves its order relationship with your other commits.

As mentioned above, you need to make this request at the time the commit
is created, and you'll need to communicate with a central authority. That
goes against the distributed nature of Git.

In my opinion, Git already gives you the flexibility to achieve the goals
you are looking for. But changing a core data type to make your goals
slightly more convenient is not a valuable exercise.

-Stolee

[1] https://gerrit-review.googlesource.com/Documentation/cmd-hook-commit-msg.html

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Finer timestamps and serialization in git
  2019-05-16  1:25         ` Derrick Stolee
@ 2019-05-20 15:05           ` Michal Suchánek
  2019-05-20 16:36             ` Eric S. Raymond
  0 siblings, 1 reply; 33+ messages in thread
From: Michal Suchánek @ 2019-05-20 15:05 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: esr, Jason Pyeron, git

On Wed, 15 May 2019 21:25:46 -0400
Derrick Stolee <stolee@gmail.com> wrote:

> On 5/15/2019 8:28 PM, Eric S. Raymond wrote:
> > Derrick Stolee <stolee@gmail.com>:  
> >> What problem are you trying to solve where commit date is important?  

> > B. Unique canonical form of import-stream representation.
> > 
> > Reposurgeon is a very complex piece of software with subtle failure
> > modes.  I have a strong need to be able to regression-test its
> > operation.  Right now there are important cases in which I can't do
> > that because (a) the order in which it writes commits and (b) how it
> > colors branches, are both phase-of-moon dependent.  That is, the
> > algorithms may be deterministic but they're not documented and seem to
> > be dependent on variables that are hidden from me.
> > 
> > Before import streams can have a canonical output order without hidden
> > variables (e.g. depending only on visible metadata) in practice, that
> > needs to be possible in principle. I've thought about this a lot and
> > not only are unique commit timestamps the most natural way to make
> > it possible, they're the only way conistent with the reality that
> > commit comments may be altered for various good reasons during
> > repository translation.  
> 
> If you are trying to debug or test something, why don't you serialize
> the input you are using for your test?

And that's the problem. Serialization of a git repository is not stable
because there is no total ordering on commits. And for testing you need
to serialize some 'before' and 'after' state and they can be totally
different. Not because the repository state is totally different but
because the serialization of the state is not stable.

Thanks

Michal

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Finer timestamps and serialization in git
  2019-05-20 15:05           ` Michal Suchánek
@ 2019-05-20 16:36             ` Eric S. Raymond
  2019-05-20 17:22               ` Derrick Stolee
  0 siblings, 1 reply; 33+ messages in thread
From: Eric S. Raymond @ 2019-05-20 16:36 UTC (permalink / raw)
  To: Michal Suchánek; +Cc: Derrick Stolee, Jason Pyeron, git

Michal Suchánek <msuchanek@suse.de>:
> On Wed, 15 May 2019 21:25:46 -0400
> Derrick Stolee <stolee@gmail.com> wrote:
> 
> > On 5/15/2019 8:28 PM, Eric S. Raymond wrote:
> > > Derrick Stolee <stolee@gmail.com>:  
> > >> What problem are you trying to solve where commit date is important?  
> 
> > > B. Unique canonical form of import-stream representation.
> > > 
> > > Reposurgeon is a very complex piece of software with subtle failure
> > > modes.  I have a strong need to be able to regression-test its
> > > operation.  Right now there are important cases in which I can't do
> > > that because (a) the order in which it writes commits and (b) how it
> > > colors branches, are both phase-of-moon dependent.  That is, the
> > > algorithms may be deterministic but they're not documented and seem to
> > > be dependent on variables that are hidden from me.
> > > 
> > > Before import streams can have a canonical output order without hidden
> > > variables (e.g. depending only on visible metadata) in practice, that
> > > needs to be possible in principle. I've thought about this a lot and
> > > not only are unique commit timestamps the most natural way to make
> > > it possible, they're the only way conistent with the reality that
> > > commit comments may be altered for various good reasons during
> > > repository translation.  
> > 
> > If you are trying to debug or test something, why don't you serialize
> > the input you are using for your test?
> 
> And that's the problem. Serialization of a git repository is not stable
> because there is no total ordering on commits. And for testing you need
> to serialize some 'before' and 'after' state and they can be totally
> different. Not because the repository state is totally different but
> because the serialization of the state is not stable.

Yes, msuchanek is right - that is exactly the problem.  Very well put.

git fast-import streams *are* the serialization; they're what reposurgeon
ingests and emits.  The concrete problem I have is that there is no stable
correspondence between a repository and one canonical fast-import
serialization of it.

That is a bigger pain in the ass than you will be able to imagine unless
and until you try writing surgical tools yourself and discover that you
can't write tests for them.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Finer timestamps and serialization in git
  2019-05-20 16:36             ` Eric S. Raymond
@ 2019-05-20 17:22               ` Derrick Stolee
  2019-05-20 21:32                 ` Eric S. Raymond
  0 siblings, 1 reply; 33+ messages in thread
From: Derrick Stolee @ 2019-05-20 17:22 UTC (permalink / raw)
  To: esr, Michal Suchánek; +Cc: Jason Pyeron, git

On 5/20/2019 12:36 PM, Eric S. Raymond wrote:
> Michal Suchánek <msuchanek@suse.de>:
>> On Wed, 15 May 2019 21:25:46 -0400
>> Derrick Stolee <stolee@gmail.com> wrote:
>>
>>> On 5/15/2019 8:28 PM, Eric S. Raymond wrote:
>>>> Derrick Stolee <stolee@gmail.com>:  
>>>>> What problem are you trying to solve where commit date is important?  
>>
>>>> B. Unique canonical form of import-stream representation.
>>>>
>>>> Reposurgeon is a very complex piece of software with subtle failure
>>>> modes.  I have a strong need to be able to regression-test its
>>>> operation.  Right now there are important cases in which I can't do
>>>> that because (a) the order in which it writes commits and (b) how it
>>>> colors branches, are both phase-of-moon dependent.  That is, the
>>>> algorithms may be deterministic but they're not documented and seem to
>>>> be dependent on variables that are hidden from me.
>>>>
>>>> Before import streams can have a canonical output order without hidden
>>>> variables (e.g. depending only on visible metadata) in practice, that
>>>> needs to be possible in principle. I've thought about this a lot and
>>>> not only are unique commit timestamps the most natural way to make
>>>> it possible, they're the only way conistent with the reality that
>>>> commit comments may be altered for various good reasons during
>>>> repository translation.  
>>>
>>> If you are trying to debug or test something, why don't you serialize
>>> the input you are using for your test?
>>
>> And that's the problem. Serialization of a git repository is not stable
>> because there is no total ordering on commits. And for testing you need
>> to serialize some 'before' and 'after' state and they can be totally
>> different. Not because the repository state is totally different but
>> because the serialization of the state is not stable.
> 
> Yes, msuchanek is right - that is exactly the problem.  Very well put.
> 
> git fast-import streams *are* the serialization; they're what reposurgeon
> ingests and emits.  The concrete problem I have is that there is no stable
> correspondence between a repository and one canonical fast-import
> serialization of it.
> 
> That is a bigger pain in the ass than you will be able to imagine unless
> and until you try writing surgical tools yourself and discover that you
> can't write tests for them.

What it sounds like you are doing is piping a 'git fast-import' process into
reposurgeon, and testing that reposurgeon does the same thing every time.
Of course this won't be consistent if 'git fast-import' isn't consistent.

But what you should do instead is store a fixed file from one run of
'git fast-import' and send that file to reposurgeon for the repeated test.
Don't rely on fast-import being consistent and instead use fixed input for
your test.

If reposurgeon is providing the input to _and_ consuming the output from
'git fast-import', then yes you will need to have at least one integration
test that runs the full pipeline. But for regression tests covering complicated
logic in reposurgeon, you're better off splitting the test (or mocking out
'git fast-import' with something that provides consistent output given
fixed input).

-Stolee
 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Finer timestamps and serialization in git
  2019-05-20 17:22               ` Derrick Stolee
@ 2019-05-20 21:32                 ` Eric S. Raymond
  0 siblings, 0 replies; 33+ messages in thread
From: Eric S. Raymond @ 2019-05-20 21:32 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Michal Suchánek, Jason Pyeron, git

Derrick Stolee <stolee@gmail.com>:
> What it sounds like you are doing is piping a 'git fast-import' process into
> reposurgeon, and testing that reposurgeon does the same thing every time.
> Of course this won't be consistent if 'git fast-import' isn't consistent.

It's not actually import that fails to have consistent behavior, it's export.

That is, if I fast-import a given stream, I get indistinguishable
in-core commit DAGs every time. (It would be pretty alarming if this
weren't true!)

What I have no guarantee of is the other direction.  In a multibranch repo,
fast-export writes out branches in an order I cannot predict and which
appears from the outside to be randomly variable.

> But what you should do instead is store a fixed file from one run of
> 'git fast-import' and send that file to reposurgeon for the repeated test.
> Don't rely on fast-import being consistent and instead use fixed input for
> your test.
> 
> If reposurgeon is providing the input to _and_ consuming the output from
> 'git fast-import', then yes you will need to have at least one integration
> test that runs the full pipeline. But for regression tests covering complicated
> logic in reposurgeon, you're better off splitting the test (or mocking out
> 'git fast-import' with something that provides consistent output given
> fixed input).

And I'd do that... but the problem is more fundamental than you seem to
understand.  git fast-export can't ship a consistent output order because
it doesn't retain metadata sufficient to totally order child branches.

This is why I wanted unique timestamps.  That would solve the problem,
branch child commits of any node would be ordered by their commit date.

But I had a realization just now.  A much smaller change would do it.
Suppose branch creations had creation stamps with a weak uniqueness property;
for any given parent node, the creation stamps of all branches originating
there are guaranteed to be unique?

If that were true, there would be an implied total ordering of the
repository.  The rules for writing out a totally ordered dump would go
like this:

1. At any given step there is a set of active branches and a cursor
on each such branch.  Each cursor points at a commit and caches the
creation stamp of the current branch.

2. Look at the set of commits under the cursors.  Write the oldest one.
If multiple commits have the same commit date, break ties by their
branch creation stamps.

3. Bump that cursor forward. If you're at a branch creation, it
becomes multiple cursors, one for each child branch.
If you're at a join, some cursors go away.

Here's the clever bit - you make the creation stamp nothing but a
counter that says "This was the Nth branch creation."  And it is
set by these rules:

4. If the branch creation stamp is undefined at branch creation time,
number it in any way you like as long as each stamp is unique. A
defined, documented order would be nice but is not necessary for
streams to round-trip.

5. When writing an export stream, you always utter a reset at the
point of branch creation.

6. When reading an import stream, the ordinal for a new branch is
defined as the number of resets you have seen.

Rules 5 and 6 together guarantee that branch creation ordinals round-trip
through export streams.  Thus, streams round-trip and I can have my
regression tests with no change to git's visible interface at all!

I could write this code.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Finer timestamps and serialization in git
  2019-05-15 20:28   ` Jason Pyeron
  2019-05-15 21:14     ` Derrick Stolee
@ 2019-05-15 23:40     ` Eric S. Raymond
  2019-05-19  0:16       ` Philip Oakley
  1 sibling, 1 reply; 33+ messages in thread
From: Eric S. Raymond @ 2019-05-15 23:40 UTC (permalink / raw)
  To: Jason Pyeron; +Cc: git, 'Derrick Stolee'

Jason Pyeron <jpyeron@pdinc.us>:
> If we take the below example:
> 
> committer Name <user@domain> 1557948240 -0400
> 
> and we follow the rule that:
> 
> 1. any trailing zero after the decimal point MUST be omitted
> 2. if there are no digits after the decimal point, it MUST be omitted
> 
> This would allow:
> 
> committer Name <user@domain> 1557948240 -0400
> committer Name <user@domain> 1557948240.12 -0400
> 
> but the following are never allowed:
> 
> committer Name <user@domain> 1557948240. -0400
> committer Name <user@domain> 1557948240.000000 -0400
> 
> By following these rules, all previous commits' hash are unchanged. Future commits made on the top of the second will look like old commit formats. Commits coming from "older" tools will produce valid and mergeable objects. The loss precision has frustrated us several times as well.

Yes, that's almost exactly what I came up with.  I was concerned with upward
compatibility in fast-export streams, which reposurgeon ingests and emits.

But I don't quite understand your claim that there's no format
breakage here, unless you're implying to me that timestamps are already
stored in the git file system as variable-length strings.  Do they
really never get translated into time_t?  Good news if so.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Finer timestamps and serialization in git
  2019-05-15 23:40     ` Eric S. Raymond
@ 2019-05-19  0:16       ` Philip Oakley
  2019-05-19  4:09         ` Eric S. Raymond
  0 siblings, 1 reply; 33+ messages in thread
From: Philip Oakley @ 2019-05-19  0:16 UTC (permalink / raw)
  To: esr, Jason Pyeron; +Cc: git, 'Derrick Stolee'



On 16/05/2019 00:40, Eric S. Raymond wrote:
> Jason Pyeron <jpyeron@pdinc.us>:
>> If we take the below example:
>>
>> committer Name <user@domain> 1557948240 -0400
>>
>> and we follow the rule that:
>>
>> 1. any trailing zero after the decimal point MUST be omitted
>> 2. if there are no digits after the decimal point, it MUST be omitted
>>
>> This would allow:
>>
>> committer Name <user@domain> 1557948240 -0400
>> committer Name <user@domain> 1557948240.12 -0400
>>
>> but the following are never allowed:
>>
>> committer Name <user@domain> 1557948240. -0400
>> committer Name <user@domain> 1557948240.000000 -0400
>>
>> By following these rules, all previous commits' hash are unchanged. Future commits made on the top of the second will look like old commit formats. Commits coming from "older" tools will produce valid and mergeable objects. The loss precision has frustrated us several times as well.
> Yes, that's almost exactly what I came up with.  I was concerned with upward
> compatibility in fast-export streams, which reposurgeon ingests and emits.
>
> But I don't quite understand your claim that there's no format
> breakage here, unless you're implying to me that timestamps are already
> stored in the git file system as variable-length strings.  Do they
> really never get translated into time_t?  Good news if so.
Maybe just take some of the object ID bits as being the fractional time 
timestamp. They are effectively random, so should do a reasonable job of 
distinguishing commits in a repeatable manner, even with full round 
tripping via older git versions (as long as the sha1 replicates...)

As I understand it the commit timestamp is actually free text within the 
commit object (try `git cat-file -p <commit_object>), so the issue is 
whether the particular git version is ready to accept the additional 
'dot' factional time notation (future versions could be extended, but I 
think old ones would reject them if I understand the test up thread - 
which would compromise backward compatibility and round tripping).
--
Philip

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Finer timestamps and serialization in git
  2019-05-19  0:16       ` Philip Oakley
@ 2019-05-19  4:09         ` Eric S. Raymond
  2019-05-19 10:07           ` Philip Oakley
  0 siblings, 1 reply; 33+ messages in thread
From: Eric S. Raymond @ 2019-05-19  4:09 UTC (permalink / raw)
  To: Philip Oakley; +Cc: Jason Pyeron, git, 'Derrick Stolee'

Philip Oakley <philipoakley@iee.org>:
> > But I don't quite understand your claim that there's no format
> > breakage here, unless you're implying to me that timestamps are already
> > stored in the git file system as variable-length strings.  Do they
> > really never get translated into time_t?  Good news if so.
> Maybe just take some of the object ID bits as being the fractional time
> timestamp. They are effectively random, so should do a reasonable job of
> distinguishing commits in a repeatable manner, even with full round tripping
> via older git versions (as long as the sha1 replicates...)

Huh.  That's an interesting idea.  Doesn't absolutely guarantee uniqueness,
but even with birthday effect the probability of collisions could be pulled
arbitrarily low.

> As I understand it the commit timestamp is actually free text within the
> commit object (try `git cat-file -p <commit_object>), so the issue is
> whether the particular git version is ready to accept the additional 'dot'
> factional time notation (future versions could be extended, but I think old
> ones would reject them if I understand the test up thread - which would
> compromise backward compatibility and round tripping).

Nobody seems to want to grapple with the fact that changing hash formats is
as large or larger a problem in exactly the same way.

I'm not saying that changing the timestamp granularity justifies a format
break.  I'm saying that *since you're going to have one anyway*, the option
to increase timestamp precision at the same time should not be missed.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Finer timestamps and serialization in git
  2019-05-19  4:09         ` Eric S. Raymond
@ 2019-05-19 10:07           ` Philip Oakley
  0 siblings, 0 replies; 33+ messages in thread
From: Philip Oakley @ 2019-05-19 10:07 UTC (permalink / raw)
  To: esr; +Cc: Jason Pyeron, git, 'Derrick Stolee'

Hi Eric,

On 19/05/2019 05:09, Eric S. Raymond wrote:
> Philip Oakley <philipoakley@iee.org>:
>>> But I don't quite understand your claim that there's no format
>>> breakage here, unless you're implying to me that timestamps are already
>>> stored in the git file system as variable-length strings.  Do they
>>> really never get translated into time_t?  Good news if so.
>> Maybe just take some of the object ID bits as being the fractional time
>> timestamp. They are effectively random, so should do a reasonable job of
>> distinguishing commits in a repeatable manner, even with full round tripping
>> via older git versions (as long as the sha1 replicates...)
> Huh.  That's an interesting idea.  Doesn't absolutely guarantee uniqueness,
> but even with birthday effect the probability of collisions could be pulled
> arbitrarily low.
depends how many bits are in the 'nano-second' resolution long word ;-)
see also
>
>> As I understand it the commit timestamp is actually free text within the
>> commit object (try `git cat-file -p <commit_object>), so the issue is
>> whether the particular git version is ready to accept the additional 'dot'
>> factional time notation (future versions could be extended, but I think old
>> ones would reject them if I understand the test up thread - which would
>> compromise backward compatibility and round tripping).
> Nobody seems to want to grapple with the fact that changing hash formats is
> as large or larger a problem in exactly the same way.
>
> I'm not saying that changing the timestamp granularity justifies a format
> break.  I'm saying that *since you're going to have one anyway*, the option
> to increase timestamp precision at the same time should not be missed.
It is probably the round tripping issue with a non-fixed format (for the 
time string) that will scupper the idea, plus the focus being primarily 
on the DAG as the fundamental lineage (which only gives partial order, 
which can be an issue for other VCS systems that are based on 
incremental changes rather than snapshots)
The transition is well underway see thread: 
https://public-inbox.org/git/20190212012256.1005924-1-sandals@crustytoothpaste.net/ 
for a patch series.

The plan is at: 
https://github.com/git/git/blob/master/Documentation/technical/hash-function-transition.txt 
<https://github.com/git/git/blob/v2.19.0-rc0/Documentation/technical/hash-function-transition.txt>, 

some discussions at thread: 
https://public-inbox.org/git/878t4xfaes.fsf@evledraar.gmail.com/ etc.

The timestamp problem is known see yesterdays thread: 
https://public-inbox.org/git/20190518005412.n45pj5p2rrtm2bfj@glandium.org/

Given that the object ID should be immutable for a round trip, using 
64bits from the sha1-oid as notional 'nano-second' time does give a 
reasonable birthday attack resistance of ~32 bits (i.e. >1M commits with 
identical whole second timestamps). [or choose the sha-256 once the 
transition is well underway]
--
Philip



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Finer timestamps and serialization in git
  2019-05-15 20:16 ` Derrick Stolee
  2019-05-15 20:28   ` Jason Pyeron
@ 2019-05-15 23:32   ` Eric S. Raymond
  2019-05-16  1:14     ` Derrick Stolee
  2019-05-16  9:50     ` Ævar Arnfjörð Bjarmason
  1 sibling, 2 replies; 33+ messages in thread
From: Eric S. Raymond @ 2019-05-15 23:32 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git

Derrick Stolee <stolee@gmail.com>:
> On 5/15/2019 3:16 PM, Eric S. Raymond wrote:
> > The deeper problem is that I want something from Git that I cannot
> > have with 1-second granularity. That is: a unique timestamp on each
> > commit in a repository.
> 
> This is impossible in a distributed version control system like Git
> (where the commits are immutable). No matter your precision, there is
> a chance that two machiens commit at the exact same moment on two different
> machines and then those commits are merged into the same branch.

It's easy to work around that problem. Each git daemon has to single-thread
its handling of incoming commits at some level, because you need a lock on the
file system to guarantee consistent updates to it.

So if a commit comes in that would be the same as the date of the
previous commit on the current branch, you bump the incoming commit timestamp.
That's the simple case. The complicated case is checking for date
collisions on *other* branches. But there are ways to make that fast,
too. There's a very obvious one involving a presort that is is O(log2
n) in the number of commits.

I wouldn't have brought this up in the first place if I didn't have a
pretty clear idea how to do it in code!

> Even when you specify a committer, there are many environments where a set
> of parallel machines are creating commits with the same identity.

If those commit sets become the same commit in the final graph, this is
not a problem for total ordering.

> > Why do I want this? There are number of reasons, all related to a
> > mathematical concept called "total ordering".  At present, commits in
> > a Git repository only have partial ordering. 
> 
> This is true of any directed acyclic graph. If you want a total ordering
> that is completely unambiguous, then you should think about maintaining
> a linear commit history by requiring rebasing instead of merging.

Excuse me, but your premise is incorrect.  A git DAG isn't just "any" DAG.
The presence of timestamps makes a total ordering possible.

(I was a theoretical mathematician in a former life. This is all very
familiar ground to me.)

> > One consequence is that
> > action stamps - the committer/date pairs I use as VCS-independent commit
> > identifications in reposurgeon - are not unique.  When a patch sequence
> > is applied, it can easily happen fast enough to give several successive
> > commits the same committer-ID and timestamp.
> 
> Sorting by committer/date pairs sounds like an unhelpful idea, as that
> does not take any graph topology into account. It happens that commits
> can actually have an _earlier_ commit date than its parent.

Yes, I'm aware of that.  The uniqueness properties that make a total
ordering desirable are not actually dependent on timestamp order
coinciding with topo order.

> Changing the granularity of timestamps requires changing the commit format,
> which is probably a non-starter.

That's why I started by noting that you're going to have to break the
format anyway to move to an ECDSA hash (or whatever you end up using).

I'm saying that *since you'll need to do that anyway*, it's a good time
to think about making timestamps finer-grained and unique.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Finer timestamps and serialization in git
  2019-05-15 23:32   ` Eric S. Raymond
@ 2019-05-16  1:14     ` Derrick Stolee
  2019-05-16  9:50     ` Ævar Arnfjörð Bjarmason
  1 sibling, 0 replies; 33+ messages in thread
From: Derrick Stolee @ 2019-05-16  1:14 UTC (permalink / raw)
  To: esr; +Cc: git

On 5/15/2019 7:32 PM, Eric S. Raymond wrote:
> Derrick Stolee <stolee@gmail.com>:
>> On 5/15/2019 3:16 PM, Eric S. Raymond wrote:
>>> The deeper problem is that I want something from Git that I cannot
>>> have with 1-second granularity. That is: a unique timestamp on each
>>> commit in a repository.
>>
>> This is impossible in a distributed version control system like Git
>> (where the commits are immutable). No matter your precision, there is
>> a chance that two machiens commit at the exact same moment on two different
>> machines and then those commits are merged into the same branch.
> 
> It's easy to work around that problem. Each git daemon has to single-thread
> its handling of incoming commits at some level, because you need a lock on the
> file system to guarantee consistent updates to it.
> 
> So if a commit comes in that would be the same as the date of the
> previous commit on the current branch, you bump the incoming commit timestamp.

This changes the commit, causing it to have a different object id, and
now the client that pushed that commit disagrees with your machine on
the history.

> That's the simple case. The complicated case is checking for date
> collisions on *other* branches. But there are ways to make that fast,
> too. There's a very obvious one involving a presort that is is O(log2
> n) in the number of commits.
> 
> I wouldn't have brought this up in the first place if I didn't have a
> pretty clear idea how to do it in code!
> 
>> Even when you specify a committer, there are many environments where a set
>> of parallel machines are creating commits with the same identity.
> 
> If those commit sets become the same commit in the final graph, this is
> not a problem for total ordering.
> 
>>> Why do I want this? There are number of reasons, all related to a
>>> mathematical concept called "total ordering".  At present, commits in
>>> a Git repository only have partial ordering. 
>>
>> This is true of any directed acyclic graph. If you want a total ordering
>> that is completely unambiguous, then you should think about maintaining
>> a linear commit history by requiring rebasing instead of merging.
> 
> Excuse me, but your premise is incorrect.  A git DAG isn't just "any" DAG.
> The presence of timestamps makes a total ordering possible.
> 
> (I was a theoretical mathematician in a former life. This is all very
> familiar ground to me.)

Same. But you seem to have a fundamental misunderstanding about the immutability
of commits, which is core to how Git works. If you change a commit, then you
get a new object id and now distributed copies don't agree on the history.

>>> One consequence is that
>>> action stamps - the committer/date pairs I use as VCS-independent commit
>>> identifications in reposurgeon - are not unique.  When a patch sequence
>>> is applied, it can easily happen fast enough to give several successive
>>> commits the same committer-ID and timestamp.
>>
>> Sorting by committer/date pairs sounds like an unhelpful idea, as that
>> does not take any graph topology into account. It happens that commits
>> can actually have an _earlier_ commit date than its parent.
> 
> Yes, I'm aware of that.  The uniqueness properties that make a total
> ordering desirable are not actually dependent on timestamp order
> coinciding with topo order.
> 
>> Changing the granularity of timestamps requires changing the commit format,
>> which is probably a non-starter.
> 
> That's why I started by noting that you're going to have to break the
> format anyway to move to an ECDSA hash (or whatever you end up using).
> 
> I'm saying that *since you'll need to do that anyway*, it's a good time
> to think about making timestamps finer-grained and unique.

That change is difficult enough as it is. I don't think your goals justify
making this more complicated. You are also not considering:

 * The in-memory data type now needs to be a floating-point type, or an
   even larger integer type using a different set of units.

 * This data type now affects our priority queues for commit walks, how
   we store the commit date in the commit-graph file, how we compute
   relative dates for 'git log' pretty formats.

-Stolee


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Finer timestamps and serialization in git
  2019-05-15 23:32   ` Eric S. Raymond
  2019-05-16  1:14     ` Derrick Stolee
@ 2019-05-16  9:50     ` Ævar Arnfjörð Bjarmason
  2019-05-19 23:15       ` Jakub Narebski
  1 sibling, 1 reply; 33+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-05-16  9:50 UTC (permalink / raw)
  To: esr; +Cc: Derrick Stolee, git

On Thu, May 16 2019, Eric S. Raymond wrote:

> Derrick Stolee <stolee@gmail.com>:
>> On 5/15/2019 3:16 PM, Eric S. Raymond wrote:
>> > The deeper problem is that I want something from Git that I cannot
>> > have with 1-second granularity. That is: a unique timestamp on each
>> > commit in a repository.
>>
>> This is impossible in a distributed version control system like Git
>> (where the commits are immutable). No matter your precision, there is
>> a chance that two machiens commit at the exact same moment on two different
>> machines and then those commits are merged into the same branch.
>
> It's easy to work around that problem. Each git daemon has to single-thread
> its handling of incoming commits at some level, because you need a lock on the
> file system to guarantee consistent updates to it.

You don't need a daemon now to write commits to a repository. You can
just add stuff to the object store, and then later flip the SHA-1 on a
reference, we lock those indivdiual references, but this sort of thing
would require a global write lock. This would introduce huge concurrency
caveats that are non-issues now.

Dumb clients matter. Now you can e.g. have two libgit2 processes writing
to ref A and B respectively in the same repo, and they never have to
know about each other or care about IPC.

Also, even if you have daemons accepting pushes they can now be on
different computers sharing things over e.g. an NFS filesystem. Now you
need some FS-based serialization protcol for commits and their
timestamps.

> So if a commit comes in that would be the same as the date of the
> previous commit on the current branch, you bump the incoming commit timestamp.
> That's the simple case. The complicated case is checking for date
> collisions on *other* branches. But there are ways to make that fast,
> too. There's a very obvious one involving a presort that is is O(log2
> n) in the number of commits.

What Derrick mentioned downthread of this "I rebase your pushes" being
fundimentally un-git applies, but let's assume we can somehow get past
that for the sake of argument.

The model you're trying to impose here of "within a repo I want to
serialize all X" just doesn't play with how git views the world. Git
cares about graphs being serialized, it doesn't care about arbitrary
sets of graphs.

E.g. let's say I push a commit X to github, and now I want to push the
same history to gitlab, I might be twarted because they have some
side-ref they themselves make (e.g. the PR or MR refs) which conflicts
with this "timestamps must monotonically increase across all branches in
a repo" view of the world.

The only thing that matters in git in this regard is how individual refs
behave, we then by convention tend to have a 1=1 mapping between those
sets of refs and a repository, but in a lot of cases it's
many=1. E.g. in cases where such a hosting site might have one
underlying repo store exposed to multiple users via ref namespace
prefixes.

> I wouldn't have brought this up in the first place if I didn't have a
> pretty clear idea how to do it in code!
>
>> Even when you specify a committer, there are many environments where a set
>> of parallel machines are creating commits with the same identity.
>
> If those commit sets become the same commit in the final graph, this is
> not a problem for total ordering.
>
>> > Why do I want this? There are number of reasons, all related to a
>> > mathematical concept called "total ordering".  At present, commits in
>> > a Git repository only have partial ordering.
>>
>> This is true of any directed acyclic graph. If you want a total ordering
>> that is completely unambiguous, then you should think about maintaining
>> a linear commit history by requiring rebasing instead of merging.
>
> Excuse me, but your premise is incorrect.  A git DAG isn't just "any" DAG.
> The presence of timestamps makes a total ordering possible.
>
> (I was a theoretical mathematician in a former life. This is all very
> familiar ground to me.)
>
>> > One consequence is that
>> > action stamps - the committer/date pairs I use as VCS-independent commit
>> > identifications in reposurgeon - are not unique.  When a patch sequence
>> > is applied, it can easily happen fast enough to give several successive
>> > commits the same committer-ID and timestamp.
>>
>> Sorting by committer/date pairs sounds like an unhelpful idea, as that
>> does not take any graph topology into account. It happens that commits
>> can actually have an _earlier_ commit date than its parent.
>
> Yes, I'm aware of that.  The uniqueness properties that make a total
> ordering desirable are not actually dependent on timestamp order
> coinciding with topo order.
>
>> Changing the granularity of timestamps requires changing the commit format,
>> which is probably a non-starter.
>
> That's why I started by noting that you're going to have to break the
> format anyway to move to an ECDSA hash (or whatever you end up using).
>
> I'm saying that *since you'll need to do that anyway*, it's a good time
> to think about making timestamps finer-grained and unique.

We should really discuss proposed format changes separately from tacking
them onto the SHA-256 transition, because as I noted upthread your
premise that you need a format change for this isn't true. *If* this was
a good idea it's something you can add to commit objects.

And yeah, git-interpret-trailers is a bit of a kludge, which is why I
mentioned you can add new headers to the format, this is e.g. how GPG
signed commits work.

Of course whether it makes any sense to add such a thing to the format
is another matter, I'm not at all convinced, but that's a separate
discussion from how it would be done.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Finer timestamps and serialization in git
  2019-05-16  9:50     ` Ævar Arnfjörð Bjarmason
@ 2019-05-19 23:15       ` Jakub Narebski
  2019-05-20  0:45         ` Eric S. Raymond
  0 siblings, 1 reply; 33+ messages in thread
From: Jakub Narebski @ 2019-05-19 23:15 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: esr, Derrick Stolee, git

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
> On Thu, May 16 2019, Eric S. Raymond wrote:
>> Derrick Stolee <stolee@gmail.com>:
>>> On 5/15/2019 3:16 PM, Eric S. Raymond wrote:
>>>> The deeper problem is that I want something from Git that I cannot
>>>> have with 1-second granularity. That is: a unique timestamp on each
>>>> commit in a repository.
>>>
>>> This is impossible in a distributed version control system like Git
>>> (where the commits are immutable). No matter your precision, there is
>>> a chance that two machines commit at the exact same moment on two different
>>> machines and then those commits are merged into the same branch.
>>
>> It's easy to work around that problem. Each git daemon has to single-thread
>> its handling of incoming commits at some level, because you need a lock on the
>> file system to guarantee consistent updates to it.

As far as I understand it this would slow down receiving new commits
tremendously.  Currently great care is taken to not have to parse the
commit object during fetch or push if it is not necessary (thanks to
things such as reachability bitmaps, see e.g. [1]).

With this restriction you would need to parse each commit to get at
commit timestamp and committer, check if the committer+timestamp is
unique, and bump it if it is not.

Also, bumping timestamp means that the commit changed, means that its
contents-based ID changed, means that all commits that follow it needs
to have its contents changed...  And now you need to rewrite many
commits.  And you also break the assumptions that the same commits have
the same contents (including date) and the same ID in different
repositories (some of which may include additional branches, some of
which may have been part of network of related repositories, etc.).

[1]: https://github.blog/2015-09-22-counting-objects/
     http://githubengineering.com/counting-objects/

> You don't need a daemon now to write commits to a repository. You can
> just add stuff to the object store, and then later flip the SHA-1 on a
> reference, we lock those indivdiual references, but this sort of thing
> would require a global write lock. This would introduce huge concurrency
> caveats that are non-issues now.
>
> Dumb clients matter. Now you can e.g. have two libgit2 processes writing
> to ref A and B respectively in the same repo, and they never have to
> know about each other or care about IPC.
>
> Also, even if you have daemons accepting pushes they can now be on
> different computers sharing things over e.g. an NFS filesystem. Now you
> need some FS-based serialization protcol for commits and their
> timestamps.

Also, performance matters.  Especially for large repositories, and for
large number of repositories.

>> So if a commit comes in that would be the same as the date of the
>> previous commit on the current branch, you bump the incoming commit timestamp.

You do realize that dates may not be monotonic (because of imperfections
in clock synchronization), thus the fact that the date is different from
parent does not mean that is different from ancestor.

>> That's the simple case. The complicated case is checking for date
>> collisions on *other* branches. But there are ways to make that fast,
>> too. There's a very obvious one involving a presort that is is O(log2
>> n) in the number of commits.

I don't think performance hit you would get would be acceptable.

[...]
>>>> Why do I want this? There are number of reasons, all related to a
>>>> mathematical concept called "total ordering".  At present, commits in
>>>> a Git repository only have partial ordering.
>>>
>>> This is true of any directed acyclic graph. If you want a total ordering
>>> that is completely unambiguous, then you should think about maintaining
>>> a linear commit history by requiring rebasing instead of merging.
>>
>> Excuse me, but your premise is incorrect.  A git DAG isn't just "any" DAG.
>> The presence of timestamps makes a total ordering possible.
>>
>> (I was a theoretical mathematician in a former life. This is all very
>> familiar ground to me.)

Maybe in theory, when all clock are synchronized.  But not in practice.
Shit happens.  Just recently Mike Hommey wrote about the case he has to
deal with:

MH> I'm hitting another corner case in some other "weird" history, where
MH> I have 500k commits all with the same date.

[2]: https://public-inbox.org/git/20190518005412.n45pj5p2rrtm2bfj@glandium.org/t/#u

--
Jakub Narębski

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Finer timestamps and serialization in git
  2019-05-19 23:15       ` Jakub Narebski
@ 2019-05-20  0:45         ` Eric S. Raymond
  2019-05-20  9:43           ` Jakub Narebski
  0 siblings, 1 reply; 33+ messages in thread
From: Eric S. Raymond @ 2019-05-20  0:45 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Ævar Arnfjörð Bjarmason, Derrick Stolee, git

Jakub Narebski <jnareb@gmail.com>:
> As far as I understand it this would slow down receiving new commits
> tremendously.  Currently great care is taken to not have to parse the
> commit object during fetch or push if it is not necessary (thanks to
> things such as reachability bitmaps, see e.g. [1]).
> 
> With this restriction you would need to parse each commit to get at
> commit timestamp and committer, check if the committer+timestamp is
> unique, and bump it if it is not.

So, I'd want to measure that rather than simply assuming it's a blocker.
Clocks are very cheap these days.

> Also, bumping timestamp means that the commit changed, means that its
> contents-based ID changed, means that all commits that follow it needs
> to have its contents changed...  And now you need to rewrite many
> commits.

What "commits that follow it?" By hypothesis, the incoming commit's
timestamp is bumped (if it's bumped) when it's first added to a branch
or branches, before there are following commits in the DAG.

>    And you also break the assumptions that the same commits have
> the same contents (including date) and the same ID in different
> repositories (some of which may include additional branches, some of
> which may have been part of network of related repositories, etc.).

Wait...unless I completely misunderstand the hash-chain model, doesn't the
hash of a commit depend on the hashes of its parents?  If that's the case,
commits cannot have portable hashes. If it's not, please correct me.

But if it's not, how does your first objection make sense?

> > You don't need a daemon now to write commits to a repository. You can
> > just add stuff to the object store, and then later flip the SHA-1 on a
> > reference, we lock those indivdiual references, but this sort of thing
> > would require a global write lock. This would introduce huge concurrency
> > caveats that are non-issues now.
> >
> > Dumb clients matter. Now you can e.g. have two libgit2 processes writing
> > to ref A and B respectively in the same repo, and they never have to
> > know about each other or care about IPC.

How do they know they're not writing to the same ref?  What keeps
*that* operation atomic?

> You do realize that dates may not be monotonic (because of imperfections
> in clock synchronization), thus the fact that the date is different from
> parent does not mean that is different from ancestor.

Good point. That means the O(log2 n) version of the check has to be done
all the time.  Unfortunate.

> >> That's the simple case. The complicated case is checking for date
> >> collisions on *other* branches. But there are ways to make that fast,
> >> too. There's a very obvious one involving a presort that is is O(log2
> >> n) in the number of commits.
> 
> I don't think performance hit you would get would be acceptable.

Again, it's bad practice to assume rather than measure. Human intuitions
about this sort of thing are notoriously unreliable.

> >> Excuse me, but your premise is incorrect.  A git DAG isn't just "any" DAG.
> >> The presence of timestamps makes a total ordering possible.
> >>
> >> (I was a theoretical mathematician in a former life. This is all very
> >> familiar ground to me.)
> 
> Maybe in theory, when all clock are synchronized.

My assertion does not depend on synchronized clocks, because it doesn't have to.

If the timestamps in your repo are unique, there *is* a total ordering - 
by timestamp. What you don't get is guaranteed consistency with the
topo ordering - that is you get no guarantee that a child's timestamp
is greater than its parents'. That really would require a common
timebase.

But I don't need that stronger property, because the purpose of
totally ordering the repo is to guararantee the uniqueness of action
stamps.  For that, all I need is to be able to generate a unique cookie
for each commit that can be inserted in its action stamp.  For my use cases
that cookie should *not* be a hash, because hashes always break N years
down.  It should be an eternally stable product of the commit metadata.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Finer timestamps and serialization in git
  2019-05-20  0:45         ` Eric S. Raymond
@ 2019-05-20  9:43           ` Jakub Narebski
  2019-05-20 10:08             ` Ævar Arnfjörð Bjarmason
                               ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Jakub Narebski @ 2019-05-20  9:43 UTC (permalink / raw)
  To: Eric S. Raymond
  Cc: Ævar Arnfjörð Bjarmason, Derrick Stolee, git

"Eric S. Raymond" <esr@thyrsus.com> writes:
> Jakub Narebski <jnareb@gmail.com>:

>> As far as I understand it this would slow down receiving new commits
>> tremendously.  Currently great care is taken to not have to parse the
>> commit object during fetch or push if it is not necessary (thanks to
>> things such as reachability bitmaps, see e.g. [1]).
>> 
>> With this restriction you would need to parse each commit to get at
>> commit timestamp and committer, check if the committer+timestamp is
>> unique, and bump it if it is not.
>
> So, I'd want to measure that rather than simply assuming it's a blocker.
> Clocks are very cheap these days.

Clocks may be cheap, but parsing is not.

You can receive new commits in the repository by creating them, and from
other repository (via push or fetch).  In the second case you often get
many commits at once.

In [1] it is described how using "bitmap index" you can avoid parsing
commits when deciding which objects to send to the client; they can be
directly copied to the client (added to the packfile that is sent to
client).  Thanks to this reachability bitmap (bit vector) the time to
clone Linux repository decreased from 57 seconds to 1.6 seconds.

It is not a direct correspondence, but there most probably would be the
same problem with requiring fractional timestamp+committer identity to
be unique on the receiving side.

[1]: https://githubengineering.com/counting-objects/

>> Also, bumping timestamp means that the commit changed, means that its
>> contents-based ID changed, means that all commits that follow it needs
>> to have its contents changed...  And now you need to rewrite many
>> commits.
>
> What "commits that follow it?" By hypothesis, the incoming commit's
> timestamp is bumped (if it's bumped) when it's first added to a branch
> or branches, before there are following commits in the DAG.

Errr... the main problem is with distributed nature of Git, i.e. when
two repositories create different commits with the same
committer+timestamp value.  You receive commits on fetch or push, and
you receive many commits at once.

Say you have two repositories, and the history looks like this:

 repo A:   1<---2<---a<---x<---c<---d      <- master

 repo B:   1<---2<---X<---3<---4           <- master

When you push from repo A to repo B, or fetch in repo B from repo A you
would get the following DAG of revisions

 repo B:   1<---2<---X<---3<---4           <- master
                 \
                  \--a<---x<---c<---d      <- repo_A/master

Now let's assume that commits X and x have the came committer and the
same fractional timestamp, while being different commits.  Then you
would need to bump timestamp of 'x', changing the commit.  This means
that 'c' needs to be rewritten too, and 'd' also:

 repo B:   1<---2<---X<---3<---4           <- master
                 \
                  \--a<---x'<--c'<--d'     <- repo_A/master

And now for the final nail in the coffing of the Bazaar-esque idea of
changing commits on arrival.  Say that repository A created new commits,
and pushed them to B.  You would need to rewrite all future commits from
this repository too, and you would always fetch all commits starting
from the first "bumped"

 repo A:   1<---2<---a<---x<---c<---d<---E   <- master

transfer of [<---x<---c<---d<---E], instead of [<--E], because 'x', 'c',
and 'd' are missing in repo B.

 repo B:   1<---2<---X<---3<---4             <- master
                 \
                  \--a<---x'<--c'<--d'<--E'  <- repo_A/master

And there is yet another problem.  Let's assume that repo B created some
history on top of bump-rewritten commits:

 repo B:   1<---2<---X<---3<---4             <- master
                 \
                  \--a<---x'<--c'<--d'<--E'  <- repo_A/master
                                \
                                 \--5        <- next

Then if in repo A you fetch from repo B (remember, in Git there is no
concept of central repository), you would get the following history

                  /--X'<--3'<--4'            <- repo_B/master
                 /
 repo A:   1<---2<---a<---x<---c<---d<---E   <- master
                     \
                      \---x'<--c'
                                \
                                 \--5        <- repo_B/master

(because 'X' is now incoming, it needs to be "bumped", therefore
changing 3' and 4').

The history without all this rewriting looks like this:

                  /--X<---3<---4'            <- repo_B/master
                 /           
 repo A:   1<---2<---a<---x<---c<---d<---E   <- master
                                \
                                 \--5        <- repo_B/master

Notice the difference?

>>    And you also break the assumptions that the same commits have
>> the same contents (including date) and the same ID in different
>> repositories (some of which may include additional branches, some of
>> which may have been part of network of related repositories, etc.).

See repo A and repo B in above example.

> Wait...unless I completely misunderstand the hash-chain model, doesn't the
> hash of a commit depend on the hashes of its parents?  If that's the case,
> commits cannot have portable hashes. If it's not, please correct me.
>
> But if it's not, how does your first objection make sense?

Hash of a commit depend in hashes of its parents (Merkle tree). That is
why signing a commit (or a tag pointing to the commit) signs a whole
history of a commit.

>>> You don't need a daemon now to write commits to a repository. You can
>>> just add stuff to the object store, and then later flip the SHA-1 on a
>>> reference, we lock those indivdiual references, but this sort of thing
>>> would require a global write lock. This would introduce huge concurrency
>>> caveats that are non-issues now.
>>>
>>> Dumb clients matter. Now you can e.g. have two libgit2 processes writing
>>> to ref A and B respectively in the same repo, and they never have to
>>> know about each other or care about IPC.
>
> How do they know they're not writing to the same ref?  What keeps
> *that* operation atomic?

Because different refs are stored in different files (at least for
"live" refs that are stores in loose ref format).  The lock is taken on
ref (to update ref and its reflog in sync), there is no need to take
global lock on all refs.

>> You do realize that dates may not be monotonic (because of imperfections
>> in clock synchronization), thus the fact that the date is different from
>> parent does not mean that is different from ancestor.
>
> Good point. That means the O(log2 n) version of the check has to be done
> all the time.  Unfortunate.

Especially with around 1 million of commits (Linux kernel, Chromium,
AOSP), or even 3M commits (MS Windows repository).

>>>> That's the simple case. The complicated case is checking for date
>>>> collisions on *other* branches. But there are ways to make that fast,
>>>> too. There's a very obvious one involving a presort that is is O(log2
>>>> n) in the number of commits.
>> 
>> I don't think performance hit you would get would be acceptable.
>
> Again, it's bad practice to assume rather than measure. Human intuitions
> about this sort of thing are notoriously unreliable.

Techniques created to handle very large repositories (with respect to
number of commits) that make it possible for Git to avoid parsing commit
objects, namely bitmap index (for 'git fetch'/'clone') and serialized
commit graph (for 'git log') lead to _significant_ performance
improvements.

The performance changes from "waiting for Git to finish" to "done in the
blink of eye" (well, almost).

>>>> Excuse me, but your premise is incorrect.  A git DAG isn't just "any" DAG.
>>>> The presence of timestamps makes a total ordering possible.
>>>>
>>>> (I was a theoretical mathematician in a former life. This is all very
>>>> familiar ground to me.)
>> 
>> Maybe in theory, when all clock are synchronized.
>
> My assertion does not depend on synchronized clocks, because it doesn't have to.
>
> If the timestamps in your repo are unique, there *is* a total ordering - 
> by timestamp. What you don't get is guaranteed consistency with the
> topo ordering - that is you get no guarantee that a child's timestamp
> is greater than its parents'. That really would require a common
> timebase.
>
> But I don't need that stronger property, because the purpose of
> totally ordering the repo is to guarantee the uniqueness of action
> stamps.  For that, all I need is to be able to generate a unique cookie
> for each commit that can be inserted in its action stamp.

For cookie to be unique among all forks / clones of the same repository
you need either centralized naming server, or for the cookie to be based
on contents of the commit (i.e. be a hash function).

>                                                          For my use cases
> that cookie should *not* be a hash, because hashes always break N years
> down.  It should be an eternally stable product of the commit metadata.

Well, the idea for SHA-1 <--> NewHash == SHA-256 transition is to avoid
having a flag day, and providing full interoperability between
repositories and Git installations using the old hash ad using new
hash^1.  This will be done internally by using SHA-1 <--> SHA-256
mapping.  So after the transition all you need is to publish this
mapping somewhere, be it with Internet Archive or Software Heritage.
Problem solved.

P.S. Could you explain to me how one can use action stamp, e.g.
<esr@thyrsus.com!2019-05-15T20:01:15.473209800Z>, to quickly find the
commit it refers to?  With SHA-1 id you have either filesystem pathname
or the index file for pack to find it _fast_.

Footnotes:
----------
1. That is why where would be no "major format break", thus no place for
   incompatibile format changes.

Best,
--
Jakub Narębski

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Finer timestamps and serialization in git
  2019-05-20  9:43           ` Jakub Narebski
@ 2019-05-20 10:08             ` Ævar Arnfjörð Bjarmason
  2019-05-20 12:40             ` Jeff King
  2019-05-20 14:14             ` Eric S. Raymond
  2 siblings, 0 replies; 33+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-05-20 10:08 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Eric S. Raymond, Derrick Stolee, git


On Mon, May 20 2019, Jakub Narebski wrote:

> "Eric S. Raymond" <esr@thyrsus.com> writes:
>> Jakub Narebski <jnareb@gmail.com>:
>
>>> As far as I understand it this would slow down receiving new commits
>>> tremendously.  Currently great care is taken to not have to parse the
>>> commit object during fetch or push if it is not necessary (thanks to
>>> things such as reachability bitmaps, see e.g. [1]).
>>>
>>> With this restriction you would need to parse each commit to get at
>>> commit timestamp and committer, check if the committer+timestamp is
>>> unique, and bump it if it is not.
>>
>> So, I'd want to measure that rather than simply assuming it's a blocker.
>> Clocks are very cheap these days.
>
> Clocks may be cheap, but parsing is not.
>
> You can receive new commits in the repository by creating them, and from
> other repository (via push or fetch).  In the second case you often get
> many commits at once.
>
> In [1] it is described how using "bitmap index" you can avoid parsing
> commits when deciding which objects to send to the client; they can be
> directly copied to the client (added to the packfile that is sent to
> client).  Thanks to this reachability bitmap (bit vector) the time to
> clone Linux repository decreased from 57 seconds to 1.6 seconds.
>
> It is not a direct correspondence, but there most probably would be the
> same problem with requiring fractional timestamp+committer identity to
> be unique on the receiving side.
>
> [1]: https://githubengineering.com/counting-objects/

We're in violent agreement about the general viability of ESR's proposed
plan, but just a side-note on this point. I don't think this is
right. I.e. I don't think a hypothetical version of git that guarantees
monotonically increasing timestamps will be slow in *this* regard.

For accepting pushes we already unpack all the commits / content / hash
it to perform fsck checks, which is why screwing with the commit
timestamp will fail on push:
https://public-inbox.org/git/87zhnnv0b8.fsf@evledraar.gmail.com/

Same on the client with fetches, although transfer.fsckObjects isn't on
there we do most of the work anyway for hashing & basic validation
purposes.

The bitmaps wouldn't be affected because they're computed after-the-fact
on the basis of reachability, whereas validating increasing timestamps
for a single branch is cheap, you just look at each A..B push
incrementally and see if the timestamps are increasing and past A's
parent.

It's trickier if you're trying to make the same guarantee for *all* ref
updates in a given repo (and locking caveats etc. have been discussed
elsewhere), but not *that* much of a PITA.

We'd need to compare "new" packs/loose objects against the new push, and
an obvious shortcut in such a schema if you required a global lock
anyway would be for the process taking the lock to write out "this is
the current max timestamp" when finished.

In *this* case that's a long way down the journey into crazytown :)

But it is intertesting to think about in general, because with e.g. the
commit-graph we have a set of commits that are "optimized" in some
side-index, so it becomes useful for many algorithms to be able to ask
"what is the current set of unoptimized commits".

Once you have that, and can keep the size of it down with "gc" many
algorithms that require graph traversal become possible, because your
O(n) of needing to consider the "n" unoptimized commits is small enough
v.s. the bulk of "optimized" commits as to not matter.


>>> Also, bumping timestamp means that the commit changed, means that its
>>> contents-based ID changed, means that all commits that follow it needs
>>> to have its contents changed...  And now you need to rewrite many
>>> commits.
>>
>> What "commits that follow it?" By hypothesis, the incoming commit's
>> timestamp is bumped (if it's bumped) when it's first added to a branch
>> or branches, before there are following commits in the DAG.
>
> Errr... the main problem is with distributed nature of Git, i.e. when
> two repositories create different commits with the same
> committer+timestamp value.  You receive commits on fetch or push, and
> you receive many commits at once.
>
> Say you have two repositories, and the history looks like this:
>
>  repo A:   1<---2<---a<---x<---c<---d      <- master
>
>  repo B:   1<---2<---X<---3<---4           <- master
>
> When you push from repo A to repo B, or fetch in repo B from repo A you
> would get the following DAG of revisions
>
>  repo B:   1<---2<---X<---3<---4           <- master
>                  \
>                   \--a<---x<---c<---d      <- repo_A/master
>
> Now let's assume that commits X and x have the came committer and the
> same fractional timestamp, while being different commits.  Then you
> would need to bump timestamp of 'x', changing the commit.  This means
> that 'c' needs to be rewritten too, and 'd' also:
>
>  repo B:   1<---2<---X<---3<---4           <- master
>                  \
>                   \--a<---x'<--c'<--d'     <- repo_A/master
>
> And now for the final nail in the coffing of the Bazaar-esque idea of
> changing commits on arrival.  Say that repository A created new commits,
> and pushed them to B.  You would need to rewrite all future commits from
> this repository too, and you would always fetch all commits starting
> from the first "bumped"
>
>  repo A:   1<---2<---a<---x<---c<---d<---E   <- master
>
> transfer of [<---x<---c<---d<---E], instead of [<--E], because 'x', 'c',
> and 'd' are missing in repo B.
>
>  repo B:   1<---2<---X<---3<---4             <- master
>                  \
>                   \--a<---x'<--c'<--d'<--E'  <- repo_A/master
>
> And there is yet another problem.  Let's assume that repo B created some
> history on top of bump-rewritten commits:
>
>  repo B:   1<---2<---X<---3<---4             <- master
>                  \
>                   \--a<---x'<--c'<--d'<--E'  <- repo_A/master
>                                 \
>                                  \--5        <- next
>
> Then if in repo A you fetch from repo B (remember, in Git there is no
> concept of central repository), you would get the following history
>
>                   /--X'<--3'<--4'            <- repo_B/master
>                  /
>  repo A:   1<---2<---a<---x<---c<---d<---E   <- master
>                      \
>                       \---x'<--c'
>                                 \
>                                  \--5        <- repo_B/master
>
> (because 'X' is now incoming, it needs to be "bumped", therefore
> changing 3' and 4').
>
> The history without all this rewriting looks like this:
>
>                   /--X<---3<---4'            <- repo_B/master
>                  /
>  repo A:   1<---2<---a<---x<---c<---d<---E   <- master
>                                 \
>                                  \--5        <- repo_B/master
>
> Notice the difference?
>
>>>    And you also break the assumptions that the same commits have
>>> the same contents (including date) and the same ID in different
>>> repositories (some of which may include additional branches, some of
>>> which may have been part of network of related repositories, etc.).
>
> See repo A and repo B in above example.
>
>> Wait...unless I completely misunderstand the hash-chain model, doesn't the
>> hash of a commit depend on the hashes of its parents?  If that's the case,
>> commits cannot have portable hashes. If it's not, please correct me.
>>
>> But if it's not, how does your first objection make sense?
>
> Hash of a commit depend in hashes of its parents (Merkle tree). That is
> why signing a commit (or a tag pointing to the commit) signs a whole
> history of a commit.
>
>>>> You don't need a daemon now to write commits to a repository. You can
>>>> just add stuff to the object store, and then later flip the SHA-1 on a
>>>> reference, we lock those indivdiual references, but this sort of thing
>>>> would require a global write lock. This would introduce huge concurrency
>>>> caveats that are non-issues now.
>>>>
>>>> Dumb clients matter. Now you can e.g. have two libgit2 processes writing
>>>> to ref A and B respectively in the same repo, and they never have to
>>>> know about each other or care about IPC.
>>
>> How do they know they're not writing to the same ref?  What keeps
>> *that* operation atomic?
>
> Because different refs are stored in different files (at least for
> "live" refs that are stores in loose ref format).  The lock is taken on
> ref (to update ref and its reflog in sync), there is no need to take
> global lock on all refs.
>
>>> You do realize that dates may not be monotonic (because of imperfections
>>> in clock synchronization), thus the fact that the date is different from
>>> parent does not mean that is different from ancestor.
>>
>> Good point. That means the O(log2 n) version of the check has to be done
>> all the time.  Unfortunate.
>
> Especially with around 1 million of commits (Linux kernel, Chromium,
> AOSP), or even 3M commits (MS Windows repository).
>
>>>>> That's the simple case. The complicated case is checking for date
>>>>> collisions on *other* branches. But there are ways to make that fast,
>>>>> too. There's a very obvious one involving a presort that is is O(log2
>>>>> n) in the number of commits.
>>>
>>> I don't think performance hit you would get would be acceptable.
>>
>> Again, it's bad practice to assume rather than measure. Human intuitions
>> about this sort of thing are notoriously unreliable.
>
> Techniques created to handle very large repositories (with respect to
> number of commits) that make it possible for Git to avoid parsing commit
> objects, namely bitmap index (for 'git fetch'/'clone') and serialized
> commit graph (for 'git log') lead to _significant_ performance
> improvements.
>
> The performance changes from "waiting for Git to finish" to "done in the
> blink of eye" (well, almost).
>
>>>>> Excuse me, but your premise is incorrect.  A git DAG isn't just "any" DAG.
>>>>> The presence of timestamps makes a total ordering possible.
>>>>>
>>>>> (I was a theoretical mathematician in a former life. This is all very
>>>>> familiar ground to me.)
>>>
>>> Maybe in theory, when all clock are synchronized.
>>
>> My assertion does not depend on synchronized clocks, because it doesn't have to.
>>
>> If the timestamps in your repo are unique, there *is* a total ordering -
>> by timestamp. What you don't get is guaranteed consistency with the
>> topo ordering - that is you get no guarantee that a child's timestamp
>> is greater than its parents'. That really would require a common
>> timebase.
>>
>> But I don't need that stronger property, because the purpose of
>> totally ordering the repo is to guarantee the uniqueness of action
>> stamps.  For that, all I need is to be able to generate a unique cookie
>> for each commit that can be inserted in its action stamp.
>
> For cookie to be unique among all forks / clones of the same repository
> you need either centralized naming server, or for the cookie to be based
> on contents of the commit (i.e. be a hash function).
>
>>                                                          For my use cases
>> that cookie should *not* be a hash, because hashes always break N years
>> down.  It should be an eternally stable product of the commit metadata.
>
> Well, the idea for SHA-1 <--> NewHash == SHA-256 transition is to avoid
> having a flag day, and providing full interoperability between
> repositories and Git installations using the old hash ad using new
> hash^1.  This will be done internally by using SHA-1 <--> SHA-256
> mapping.  So after the transition all you need is to publish this
> mapping somewhere, be it with Internet Archive or Software Heritage.
> Problem solved.
>
> P.S. Could you explain to me how one can use action stamp, e.g.
> <esr@thyrsus.com!2019-05-15T20:01:15.473209800Z>, to quickly find the
> commit it refers to?  With SHA-1 id you have either filesystem pathname
> or the index file for pack to find it _fast_.
>
> Footnotes:
> ----------
> 1. That is why where would be no "major format break", thus no place for
>    incompatibile format changes.
>
> Best,

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Finer timestamps and serialization in git
  2019-05-20  9:43           ` Jakub Narebski
  2019-05-20 10:08             ` Ævar Arnfjörð Bjarmason
@ 2019-05-20 12:40             ` Jeff King
  2019-05-20 14:14             ` Eric S. Raymond
  2 siblings, 0 replies; 33+ messages in thread
From: Jeff King @ 2019-05-20 12:40 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Eric S. Raymond, Ævar Arnfjörð Bjarmason,
	Derrick Stolee, git

On Mon, May 20, 2019 at 11:43:14AM +0200, Jakub Narebski wrote:

> You can receive new commits in the repository by creating them, and from
> other repository (via push or fetch).  In the second case you often get
> many commits at once.
> 
> In [1] it is described how using "bitmap index" you can avoid parsing
> commits when deciding which objects to send to the client; they can be
> directly copied to the client (added to the packfile that is sent to
> client).  Thanks to this reachability bitmap (bit vector) the time to
> clone Linux repository decreased from 57 seconds to 1.6 seconds.

No, this is mixing up sending and receiving. On the sending side, we try
very hard not to open up objects if we can avoid it (using tricks like
reachability bitmaps helps us quickly decide what to send, and reusing
the on-disk packfile data lets us send out objects without decompressing
them).

But on the receiving side, we do not trust the sender at all. The
protocol specifically does not send the sha1 of any object. The receiver
instead inflates every object it gets and computes the object hash
itself. And then on top of that, we traverse the commit graph to make
sure that the server sent us all of the objects we need to have a
complete graph.

So adding any extra object-quality checks on the receiving side would
not really change that equation.

But I do otherwise agree with your mail that the general idea of having
the receiver _change_ the incoming objects is going to lead to a world
of headaches.

-Peff

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Finer timestamps and serialization in git
  2019-05-20  9:43           ` Jakub Narebski
  2019-05-20 10:08             ` Ævar Arnfjörð Bjarmason
  2019-05-20 12:40             ` Jeff King
@ 2019-05-20 14:14             ` Eric S. Raymond
  2019-05-20 14:41               ` Michal Suchánek
                                 ` (2 more replies)
  2 siblings, 3 replies; 33+ messages in thread
From: Eric S. Raymond @ 2019-05-20 14:14 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Ævar Arnfjörð Bjarmason, Derrick Stolee, git

Jakub Narebski <jnareb@gmail.com>:
> > What "commits that follow it?" By hypothesis, the incoming commit's
> > timestamp is bumped (if it's bumped) when it's first added to a branch
> > or branches, before there are following commits in the DAG.
> 
> Errr... the main problem is with distributed nature of Git, i.e. when
> two repositories create different commits with the same
> committer+timestamp value.  You receive commits on fetch or push, and
> you receive many commits at once.
> 
> Say you have two repositories, and the history looks like this:
> 
>  repo A:   1<---2<---a<---x<---c<---d      <- master
> 
>  repo B:   1<---2<---X<---3<---4           <- master
> 
> When you push from repo A to repo B, or fetch in repo B from repo A you
> would get the following DAG of revisions
> 
>  repo B:   1<---2<---X<---3<---4           <- master
>                  \
>                   \--a<---x<---c<---d      <- repo_A/master
> 
> Now let's assume that commits X and x have the came committer and the
> same fractional timestamp, while being different commits.  Then you
> would need to bump timestamp of 'x', changing the commit.  This means
> that 'c' needs to be rewritten too, and 'd' also:
> 
>  repo B:   1<---2<---X<---3<---4           <- master
>                  \
>                   \--a<---x'<--c'<--d'     <- repo_A/master

Of course that's true.  But you were talking as though all those commits
have to be modified *after they're in the DAG*, and that's not the case.
If any timestamp has to be modified, it only has to happen *once*, at the
time its commit enters the repo.

Actually, in the normal case only x would need to be modified. The only
way c would need to be modified is if bumping x's timestamp caused an
actual collision with c's.

I don't see any conceptual problem with this.  You appear to me to be
confusing two issues.  Yes, bumping timestamps would mean that all
hashes downstream in the Merkle tree would be generated differently,
even when there's no timestamp collision, but so what?  The hash of a
commit isn't portable to begin with - it can't be, because AFAIK
there's no guarantee that the ancestry parts of the DAG in two
repositories where copies of it live contain all the same commits and
topo relationships.

> And now for the final nail in the coffing of the Bazaar-esque idea of
> changing commits on arrival.  Say that repository A created new commits,
> and pushed them to B.  You would need to rewrite all future commits from
> this repository too, and you would always fetch all commits starting
> from the first "bumped"

I don't see how the second clause of your last sentence follows from the
first unless commit hashes really are supposed to be portable across
repositories.  And I don't see how that can be so given that 'git am'
exists and a branch can thus be rooted at a different place after
it is transported and integrated.

> Hash of a commit depend in hashes of its parents (Merkle tree). That is
> why signing a commit (or a tag pointing to the commit) signs a whole
> history of a commit.

That's what I thought.

> > How do they know they're not writing to the same ref?  What keeps
> > *that* operation atomic?
> 
> Because different refs are stored in different files (at least for
> "live" refs that are stores in loose ref format).  The lock is taken on
> ref (to update ref and its reflog in sync), there is no need to take
> global lock on all refs.

OK, that makes sense.

> For cookie to be unique among all forks / clones of the same repository
> you need either centralized naming server, or for the cookie to be based
> on contents of the commit (i.e. be a hash function).

I don't need uniquess across all forks, only uniqueness *within the repo*.

I want this for two reasons: (1) so that action stamps are unique, (2)
so that there is a unique canonical ordering of commits in a fast export
stream.

(Without that second property there are surgical cases I can't
regression-test.)

> >                                                          For my use cases
> > that cookie should *not* be a hash, because hashes always break N years
> > down.  It should be an eternally stable product of the commit metadata.
> 
> Well, the idea for SHA-1 <--> NewHash == SHA-256 transition is to avoid
> having a flag day, and providing full interoperability between
> repositories and Git installations using the old hash ad using new
> hash^1.  This will be done internally by using SHA-1 <--> SHA-256
> mapping.  So after the transition all you need is to publish this
> mapping somewhere, be it with Internet Archive or Software Heritage.
> Problem solved.

I don't see it.  How does this prevent old clients from barfing on new
repositories?

> P.S. Could you explain to me how one can use action stamp, e.g.
> <esr@thyrsus.com!2019-05-15T20:01:15.473209800Z>, to quickly find the
> commit it refers to?  With SHA-1 id you have either filesystem pathname
> or the index file for pack to find it _fast_.

For the purposes that make action stamps important I don't really care
about performance much (though there are fairly obvious ways to
achieve it).  My goal is to ensure that revision histories (e.g. in
their import-stream format) are forward-portable to future VCSes
without requiring any data outside the stream itself.

Please remember that I'm accustomed to maintaining infrastructure on
decadal timescales - I wrote code in the 1980s that is still in wide use
and I expect some of the code I'm writing now to be still in use thirty
years from now.

This gives me a different perspective on the fragility of things like
SHA-1 hashes.  From a decadal-scale POV any particular crypto-hash
format is unstable garbage, and having them in change comments is a
maintainability disaster waiting to happen.

Action stamps are specifically designed so that they're pointers to commits
that don't require anything but the target commit's import/export-stream
metadata to resolve.  Your idea of an archived hash registry makes me
extremely nervous; I think it's too fragile to trust.

So let me back up a step.  I will cheerfully drop advocating bumping
timestamps if anyone can tell me how a different way to define a per-commit
reference cookie that (a) is unique within its repo, and (b) only requires
metadata visible in the fast-export representation of the commit.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Finer timestamps and serialization in git
  2019-05-20 14:14             ` Eric S. Raymond
@ 2019-05-20 14:41               ` Michal Suchánek
  2019-05-20 22:18                 ` Philip Oakley
  2019-05-20 21:38               ` Elijah Newren
  2019-05-21  0:08               ` Jakub Narebski
  2 siblings, 1 reply; 33+ messages in thread
From: Michal Suchánek @ 2019-05-20 14:41 UTC (permalink / raw)
  To: Eric S. Raymond
  Cc: Jakub Narebski, Ævar Arnfjörð Bjarmason,
	Derrick Stolee, git

On Mon, 20 May 2019 10:14:17 -0400
"Eric S. Raymond" <esr@thyrsus.com> wrote:

> Jakub Narebski <jnareb@gmail.com>:
> > > What "commits that follow it?" By hypothesis, the incoming commit's
> > > timestamp is bumped (if it's bumped) when it's first added to a branch
> > > or branches, before there are following commits in the DAG.  
> > 
> > Errr... the main problem is with distributed nature of Git, i.e. when
> > two repositories create different commits with the same
> > committer+timestamp value.  You receive commits on fetch or push, and
> > you receive many commits at once.
> > 
> > Say you have two repositories, and the history looks like this:
> > 
> >  repo A:   1<---2<---a<---x<---c<---d      <- master
> > 
> >  repo B:   1<---2<---X<---3<---4           <- master
> > 
> > When you push from repo A to repo B, or fetch in repo B from repo A you
> > would get the following DAG of revisions
> > 
> >  repo B:   1<---2<---X<---3<---4           <- master
> >                  \
> >                   \--a<---x<---c<---d      <- repo_A/master
> > 
> > Now let's assume that commits X and x have the came committer and the
> > same fractional timestamp, while being different commits.  Then you
> > would need to bump timestamp of 'x', changing the commit.  This means
> > that 'c' needs to be rewritten too, and 'd' also:
> > 
> >  repo B:   1<---2<---X<---3<---4           <- master
> >                  \
> >                   \--a<---x'<--c'<--d'     <- repo_A/master  
> 
> Of course that's true.  But you were talking as though all those commits
> have to be modified *after they're in the DAG*, and that's not the case.
> If any timestamp has to be modified, it only has to happen *once*, at the
> time its commit enters the repo.

And that's where you get it wrong. Git is *distributed*. There is more
than one repository. Each repository has its own DAG that is completely
unrelated to the other repositories and their DAGs. So when you take
your history and push it to another repository and the timestamps
change as the result what ends up in the other repository is not the
history you pushed. So the repositories diverge and you no longer know
what is what.

> 
> Actually, in the normal case only x would need to be modified. The only
> way c would need to be modified is if bumping x's timestamp caused an
> actual collision with c's.
> 
> I don't see any conceptual problem with this.  You appear to me to be
> confusing two issues.  Yes, bumping timestamps would mean that all
> hashes downstream in the Merkle tree would be generated differently,
> even when there's no timestamp collision, but so what?  The hash of a
> commit isn't portable to begin with - it can't be, because AFAIK
> there's no guarantee that the ancestry parts of the DAG in two
> repositories where copies of it live contain all the same commits and
> topo relationships.

If you push form one repository to another repository now you get exact
same history with exact same hashes. So the hashes are portable across
repositories that share history. With your proposed change hashes can
be modified on push/pull so repositories no longer share history and
hashes become non-portable. That's why it is a bad idea.

The commits are currently identified by the hash so it must not change
during push/pull. Changing the identifier to something else (eg content
has without (some) metadata) might be useful to make the identifier
more stable but will bring other problems when you need two
different identifiers for the same content to include it in two
unrelated histories.

Thanks

Michal

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Finer timestamps and serialization in git
  2019-05-20 14:41               ` Michal Suchánek
@ 2019-05-20 22:18                 ` Philip Oakley
  0 siblings, 0 replies; 33+ messages in thread
From: Philip Oakley @ 2019-05-20 22:18 UTC (permalink / raw)
  To: Michal Suchánek, Eric S. Raymond
  Cc: Jakub Narebski, Ævar Arnfjörð Bjarmason,
	Derrick Stolee, git

Hi,

On 20/05/2019 15:41, Michal Suchánek wrote:
>>   But you were talking as though all those commits
>> have to be modified*after they're in the DAG*, and that's not the case.
>> If any timestamp has to be modified, it only has to happen*once*, at the
>> time its commit enters the repo.
> And that's where you get it wrong. Git is*distributed*. There is more
> than one repository. Each repository has its own DAG
So far so good. In fact it the change to 'distributed' that has ruined 
Eric's Acton stamps that assume that the 'time' came from a single 
central server.
>   that is completely
> unrelated to the other repositories and their DAGs.
This bit will confuse. It is only the new commits in the different 
repositories that are 'unrelated'. Their common history commits are 
identical sha1 values, and the DAG links back to their common root commit(s)
> So when you take
> your history and push it to another repository and the timestamps
> change as the result what ends up in the other repository is not the
> history you pushed. So the repositories diverge and you no longer know
> what is what.
>
If the sender tweaks their timestamps at commit time, then no one 
'knows'. It's just a minor bit of clock drift/slop. But once they have a 
cascaded history which has been published (and used) you are locked into 
that.

As noted previously. The significant change is the loss of the central 
server and the referential nature of it's clock time stamp.

If the action stamp is just a useful temporary intermediary in a 
transfer then cheats are possible (e.g. some randomising hash of a 
definative partr of the commit).

But if the action stamps are meant to be permanent and re-generatable 
for a round trip between a central server change set based server to 
Git, and then back again, repeatably, without divergence, loss, or 
change, then it is not going to happen reliably. To do so requires the 
creation of fixed total order (by design - single clock) from commits 
that are only partially ordered (by design! - DAG rather than multiple 
unsynchronized user clocks).

For backward compatibility Git only has (and only needs 1 second 
resolution).

The multi-decade/century VCS idea of a master artifact and then near 
copies (since koalin and linen drawings, blue prints, ..) with central 
_control_ is being replaced by zero cost perfect replication, 
authentication by hash, with its distribution of control (of artifact 
entry into the VCS) to _users_, from managers. Managers simply select 
and decide on the artifact quality and authorize the use of a hash.

Most folks haven't really looked below the surface of what it is that 
makes GIT and DVCS so successful, and it's not just the Linus effect. 
The previous certainties (e.g. the idea of a total order to allow 
logging by change-set) have gone.

--

Philip

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Finer timestamps and serialization in git
  2019-05-20 14:14             ` Eric S. Raymond
  2019-05-20 14:41               ` Michal Suchánek
@ 2019-05-20 21:38               ` Elijah Newren
  2019-05-20 23:12                 ` Eric S. Raymond
  2019-05-21  0:08               ` Jakub Narebski
  2 siblings, 1 reply; 33+ messages in thread
From: Elijah Newren @ 2019-05-20 21:38 UTC (permalink / raw)
  To: esr
  Cc: Jakub Narebski, Ævar Arnfjörð Bjarmason,
	Derrick Stolee, Git Mailing List

Hi,

On Mon, May 20, 2019 at 11:09 AM Eric S. Raymond <esr@thyrsus.com> wrote:

> > For cookie to be unique among all forks / clones of the same repository
> > you need either centralized naming server, or for the cookie to be based
> > on contents of the commit (i.e. be a hash function).
>
> I don't need uniquess across all forks, only uniqueness *within the repo*.

You've lost me.  In other places you stated you didn't want to use the
commit hash, and now you say this.  If you only care about uniqueness
within the current copy of the repo and don't care about uniqueness
across forks (i.e. clones or copies that exist now or in the future --
including copies stored using SHA256), then what's wrong with using
the commit hash?

> I want this for two reasons: (1) so that action stamps are unique, (2)
> so that there is a unique canonical ordering of commits in a fast export
> stream.

A stable ordering of commits in a fast-export stream might be a cool
feature.  But I don't know how to define one, other than perhaps sort
first by commit-depth (maybe optionally adding a few additional
intermediate sorting criteria), and then finally sort by commit hash
as a tiebreaker.  Without the fallback to commit hash, you fall back
on normal traversal order which isn't stable (it depends on e.g. order
of branches listed on the command line to fast-export, or if using
--all, what new branch you just added that comes alphabetically before
others).

I suspect that solution might run afoul of your dislike for commit
hashes, though, so I'm not sure it'd work for you.

> (Without that second property there are surgical cases I can't
> regression-test.)
>
> > >                                                          For my use case
> > > that cookie should *not* be a hash, because hashes always break N years
> > > down.  It should be an eternally stable product of the commit metadata.
> >
> > Well, the idea for SHA-1 <--> NewHash == SHA-256 transition is to avoid
> > having a flag day, and providing full interoperability between
> > repositories and Git installations using the old hash ad using new
> > hash^1.  This will be done internally by using SHA-1 <--> SHA-256
> > mapping.  So after the transition all you need is to publish this
> > mapping somewhere, be it with Internet Archive or Software Heritage.
> > Problem solved.
>
> I don't see it.  How does this prevent old clients from barfing on new
> repositories?

Depends on range of time for "old".  The plan as I understood it
(which is suspect): make git version which understand both SHA-1 and
SHA-256 (which I think is already done, though I haven't followed
closely), wait some time, allow people to opt in to converting, allow
more time, consider ways of nudging people to switch.

You are right that clients older than any version that understands
SHA-256 would barf on the new repositories.

> So let me back up a step.  I will cheerfully drop advocating bumping
> timestamps if anyone can tell me how a different way to define a per-commit
> reference cookie that (a) is unique within its repo, and (b) only requires
> metadata visible in the fast-export representation of the commit.

Does passing --show-original-ids option to fast-export and using the
resulting original-oid field as the cookie count?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Finer timestamps and serialization in git
  2019-05-20 21:38               ` Elijah Newren
@ 2019-05-20 23:12                 ` Eric S. Raymond
  0 siblings, 0 replies; 33+ messages in thread
From: Eric S. Raymond @ 2019-05-20 23:12 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Jakub Narebski, Ævar Arnfjörð Bjarmason,
	Derrick Stolee, Git Mailing List

Elijah Newren <newren@gmail.com>:
> Hi,
> 
> On Mon, May 20, 2019 at 11:09 AM Eric S. Raymond <esr@thyrsus.com> wrote:
> 
> > > For cookie to be unique among all forks / clones of the same repository
> > > you need either centralized naming server, or for the cookie to be based
> > > on contents of the commit (i.e. be a hash function).
> >
> > I don't need uniquess across all forks, only uniqueness *within the repo*.
> 
> You've lost me.  In other places you stated you didn't want to use the
> commit hash, and now you say this.  If you only care about uniqueness
> within the current copy of the repo and don't care about uniqueness
> across forks (i.e. clones or copies that exist now or in the future --
> including copies stored using SHA256), then what's wrong with using
> the commit hash?

Because it's not self-describing, can't be computed solely from visible
commit metadata, and relies on complex external assumptions about how
the hash is computed which break when your VCS changes hash algorithms.

These are dealbreakers because one of my major objectives is forward
portability of these IDs forever. And I mean *forever*.  It should be
possible for someone in the year 40,000, in between assaulting planets
for the God-Emperor, to look at an import stream and deduce how to
resolve the cookies to their commits without seeing git's code or
knowing anything about its hash algorithms.

I think maybe the reason I'm having so much trouble getting this
across is that git insiders are used to thinking of import streams as
transient things.  Because I do a lot of repo migrations, I have a
very different view of them.  I built reposurgeon on the realization
that they're a general transport format for revision histories, and
that has forward value independent of the existence of git.

If a stream contained fully forward-portable action stamps, it would be
forward-portable forever.  Hashes in commit comments are the *only*
blocker to that.  Take this from a person who has spent way too much time
patching Subversion IDs like r1234 during repository conversions.

It would take so little to make this work. Existing stream format is
*almost there*.

> A stable ordering of commits in a fast-export stream might be a cool
> feature.  But I don't know how to define one, other than perhaps sort
> first by commit-depth (maybe optionally adding a few additional
> intermediate sorting criteria), and then finally sort by commit hash
> as a tiebreaker. Without the fallback to commit hash, you fall back
> on normal traversal order which isn't stable (it depends on e.g. order
> of branches listed on the command line to fast-export, or if using
> --all, what new branch you just added that comes alphabetically before
> others).
>
> I suspect that solution might run afoul of your dislike for commit
> hashes, though, so I'm not sure it'd work for you.

It does. See above.

> > So let me back up a step.  I will cheerfully drop advocating bumping
> > timestamps if anyone can tell me how a different way to define a per-commit
> > reference cookie that (a) is unique within its repo, and (b) only requires
> > metadata visible in the fast-export representation of the commit.
> 
> Does passing --show-original-ids option to fast-export and using the
> resulting original-oid field as the cookie count?

I was not aware of this option.  Looking...no wonder, it's not on my
system man page.  Must be recent.

OK. Wow.  That is *useful*, and I am going to upgrade reposurgeon to read
it.  With that I can do automatic commit-reference rewriting.

I don't consider it a complete solution. The problem is that OID is
a consistent property that can be used to resolve cookies, but there's
no guaranteed that it's a *preserved* property that survives multiple
round trips and changes in hash functions.

So the right way to use it is to pick it up, do reference-cookie
resolution, and then mung the reference cookies to a format that is
stable forever.  I don't know what that format should be yet.  I
have a message in composition about this.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Finer timestamps and serialization in git
  2019-05-20 14:14             ` Eric S. Raymond
  2019-05-20 14:41               ` Michal Suchánek
  2019-05-20 21:38               ` Elijah Newren
@ 2019-05-21  0:08               ` Jakub Narebski
  2019-05-21  1:05                 ` Eric S. Raymond
  2 siblings, 1 reply; 33+ messages in thread
From: Jakub Narebski @ 2019-05-21  0:08 UTC (permalink / raw)
  To: Eric S. Raymond
  Cc: Ævar Arnfjörð Bjarmason, Derrick Stolee, git

"Eric S. Raymond" <esr@thyrsus.com> writes:
> Jakub Narebski <jnareb@gmail.com>:

>>> What "commits that follow it?" By hypothesis, the incoming commit's
>>> timestamp is bumped (if it's bumped) when it's first added to a branch
>>> or branches, before there are following commits in the DAG.
>> 
>> Errr... the main problem is with distributed nature of Git, i.e. when
>> two repositories create different commits with the same
>> committer+timestamp value.  You receive commits on fetch or push, and
>> you receive many commits at once.
>> 
>> Say you have two repositories, and the history looks like this:
>> 
>>  repo A:   1<---2<---a<---x<---c<---d      <- master
>> 
>>  repo B:   1<---2<---X<---3<---4           <- master
>> 
>> When you push from repo A to repo B, or fetch in repo B from repo A you
>> would get the following DAG of revisions
>> 
>>  repo B:   1<---2<---X<---3<---4           <- master
>>                  \
>>                   \--a<---x<---c<---d      <- repo_A/master
>> 
>> Now let's assume that commits X and x have the came committer and the
>> same fractional timestamp, while being different commits.  Then you
>> would need to bump timestamp of 'x', changing the commit.  This means
>> that 'c' needs to be rewritten too, and 'd' also:
>> 
>>  repo B:   1<---2<---X<---3<---4           <- master
>>                  \
>>                   \--a<---x'<--c'<--d'     <- repo_A/master
>
> Of course that's true.  But you were talking as though all those commits
> have to be modified *after they're in the DAG*, and that's not the case.
> If any timestamp has to be modified, it only has to happen *once*, at the
> time its commit enters the repo.

The time commit 'x' was created in repo A there was no need to bump the
timestamp.  Same with commit 'X' in repo B (well, unless there is a
central serialization server - which would not fly).  It is only after
push from repo A to repo B that we have two commits: 'x' and 'X' with
the same timestamp.

> Actually, in the normal case only x would need to be modified. The only
> way c would need to be modified is if bumping x's timestamp caused an
> actual collision with c's.
>
> I don't see any conceptual problem with this.  You appear to me to be
> confusing two issues.  Yes, bumping timestamps would mean that all
> hashes downstream in the Merkle tree would be generated differently,
> even when there's no timestamp collision, but so what?  The hash of a
> commit isn't portable to begin with - it can't be, because AFAIK
> there's no guarantee that the ancestry parts of the DAG in two
> repositories where copies of it live contain all the same commits and
> topo relationships.

Errr... how did you get that the hash of a commit is not portable???
Same contents means same hash, i.e. same object identifier.  Two
repositories can have part of history in common (for example different
forks of the same repository, like different "trees" of Linux kernel),
sharing part of DAG.  Same commits, same topo relationships.  That's how
_distributed_ version control works.

[I think we may have been talking past each other.]

>> And now for the final nail in the coffing of the Bazaar-esque idea of
>> changing commits on arrival.  Say that repository A created new commits,
>> and pushed them to B.  You would need to rewrite all future commits from
>> this repository too, and you would always fetch all commits starting
>> from the first "bumped"
>
> I don't see how the second clause of your last sentence follows from the
> first unless commit hashes really are supposed to be portable across
> repositories.  And I don't see how that can be so given that 'git am'
> exists and a branch can thus be rooted at a different place after
> it is transported and integrated.

'git rebase', 'git rebase --interactive' and 'git am' create diffent
commits; that is why their's result is called "history rewriting" (it
actually is creating altered copy, and garbage-collecting old pre-copy
and pre-change version).  Anyway, the recommended practice is to not
rewrite published history (where somebody could have bookmarked it).

Note also that this copy preserves author date, not committer date; also
commits can be deleted, split and merged during "rewrite".

Fetch and push do not use 'git am', and they preserve commits and their
identities.  That is how they can be effective and peformant.

>> Hash of a commit depend in hashes of its parents (Merkle tree). That is
>> why signing a commit (or a tag pointing to the commit) signs a whole
>> history of a commit.
>
> That's what I thought.

[...]
>> For cookie to be unique among all forks / clones of the same repository
>> you need either centralized naming server, or for the cookie to be based
>> on contents of the commit (i.e. be a hash function).
>
> I don't need uniquess across all forks, only uniqueness *within the repo*.

Err, what?  So the proposed "action stamp" identifier is even more
useless?  If you can't use <esr@thyrsus.com!2019-05-15T20:01:15.473209800Z>
to uniquely name revision, so that every person that has that commit can
know which commit is it, what's the use?

Is "action stamp" meant to be some local identifier, like Mercurial's
Subversion-like revision number, good only for local repository?

> I want this for two reasons: (1) so that action stamps are unique, (2)
> so that there is a unique canonical ordering of commits in a fast export
> stream.
>
> (Without that second property there are surgical cases I can't
> regression-test.)

You can always use object identifier (hash) for tiebreaking for second
case use.

>>>                                                          For my use cases
>>> that cookie should *not* be a hash, because hashes always break N years
>>> down.  It should be an eternally stable product of the commit metadata.
>> 
>> Well, the idea for SHA-1 <--> NewHash == SHA-256 transition is to avoid
>> having a flag day, and providing full interoperability between
>> repositories and Git installations using the old hash ad using new
>> hash^1.  This will be done internally by using SHA-1 <--> SHA-256
>> mapping.  So after the transition all you need is to publish this
>> mapping somewhere, be it with Internet Archive or Software Heritage.
>> Problem solved.
>
> I don't see it.  How does this prevent old clients from barfing on new
> repositories?

The SHA-1 <--> SHA-256 interoperation is on the client-server level; one
can use old Git that uses SHA-1 from repository that uses SHA-256, and
vice versa.

>> P.S. Could you explain to me how one can use action stamp, e.g.
>> <esr@thyrsus.com!2019-05-15T20:01:15.473209800Z>, to quickly find the
>> commit it refers to?  With SHA-1 id you have either filesystem pathname
>> or the index file for pack to find it _fast_.
>
> For the purposes that make action stamps important I don't really care
> about performance much (though there are fairly obvious ways to
> achieve it).

What ways?

>              My goal is to ensure that revision histories (e.g. in
> their import-stream format) are forward-portable to future VCSes
> without requiring any data outside the stream itself.

In Git you can store "action stamp" in extra extension headers in commit
objects (as was already proposed in this thread).

Best,
--
Jakub Narębski

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Finer timestamps and serialization in git
  2019-05-21  0:08               ` Jakub Narebski
@ 2019-05-21  1:05                 ` Eric S. Raymond
  0 siblings, 0 replies; 33+ messages in thread
From: Eric S. Raymond @ 2019-05-21  1:05 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Ævar Arnfjörð Bjarmason, Derrick Stolee, git

Jakub Narebski <jnareb@gmail.com>:
> Errr... how did you get that the hash of a commit is not portable???

OK. You're telling me that premise was wrong.  Thank you,
accepted.

I've since had a better idea.  Expect mail soon.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Finer timestamps and serialization in git
  2019-05-15 19:16 Finer timestamps and serialization in git Eric S. Raymond
  2019-05-15 20:16 ` Derrick Stolee
@ 2019-05-15 20:20 ` Ævar Arnfjörð Bjarmason
  2019-05-16  0:35   ` Eric S. Raymond
  2019-05-16  4:14   ` Jeff King
  1 sibling, 2 replies; 33+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-05-15 20:20 UTC (permalink / raw)
  To: Eric S. Raymond; +Cc: git

On Wed, May 15 2019, Eric S. Raymond wrote:

> The recent increase in vulnerability in SHA-1 means, I hope, that you
> are planning for the day when git needs to change to something like
> an elliptic-curve hash.  This means you're going to have a major
> format break. Such is life.

Note that most users of Git (default build options) won't be vulnerable
to the latest attack (or SHAttered), see
https://public-inbox.org/git/875zqbx5yz.fsf@evledraar.gmail.com/T/#u

But yes the plan is to move to SHA-256. See
https://github.com/git/git/blob/next/Documentation/technical/hash-function-transition.txt

> Since this is going to have to happen anyway

The SHA-1 <-> SHA-256 transition is planned to happen, but there's some
strong opinions that this should be *only* for munging the content for
hashing, not adding new stuff while we're at it (even if optional). See
: https://public-inbox.org/git/87ftyyedqd.fsf@evledraar.gmail.com/

> let me request two
> functional changes in git. Neither will be at all difficult, but the
> first one is also a thing that cannot be done without a format break,
> which is why I have not suggested them before.  They come from lots of
> (often painful) experience with repository conversions via
> reposurgeon.
>
> 1. Finer granularity on commit timestamps.

If you wanted milli/micro/nano-second timestamps for commit objects or
whatever other new info then it doesn't need to break the commit header
format.

You put it key-values in the commit message and read it back out via
git-interpret-trailers.

Or even put it in the header itself, e.g.:

author <name> <epoch> <tz>
committer <name> <epoch> <tz>
x-author-ns <nanosecond part of author>
x-committer-ns <nanosecond part of committer>

Of course nobody would understand that new thing from day one, but
that's nothing compared to breaking the existing header format.

> 2. Timestamps unique per repository
>
> The coarse resolution of git timestamps, and the lack of uniqueness,
> are at the bottom of several problems that are persistently irritating
> when I do repository conversions and surgery.
>
> The most obvious issue, though a relatively superficial one, is that I have
> to thow away information whenever I convert a repository from a system with
> finer-grained time.  Notably this is the case with Subversion, which keeps
> time to milliseconds. This is probably the only respect in which its data
> model remains superior to git's. :-)

Should be solved by putting it in the commit as noted above, just not in
the very narrow part of the object that's reserved and not going to
change.

More generally plenty of *->git importers write some extra data in the
commits, usually in the commit message. Try e.g. cloning a SVN repo with
"git svn clone" and see what it does.

> The deeper problem is that I want something from Git that I cannot
> have with 1-second granularity. That is: a unique timestamp on each
> commit in a repository. The only way to be certain of this is for git
> to delay accepting integration of a patch until it can issue a unique
> time mark for it - obviously impractical if the quantum is one second,
> but not if it's a millisecond or microsecond.
>
> Why do I want this? There are number of reasons, all related to a
> mathematical concept called "total ordering".  At present, commits in
> a Git repository only have partial ordering. One consequence is that
> action stamps - the committer/date pairs I use as VCS-independent commit
> identifications in reposurgeon - are not unique.  When a patch sequence
> is applied, it can easily happen fast enough to give several successive
> commits the same committer-ID and timestamp.
>
> Of course the commit hash remains a unique commit ID.  But it can't
> easily be parsed and followed by a human, which is a UX problem when
> it's used as a commit stamp in change comments.

You cannot get a guaranteed "total order" of any sort in anything like
git's current object model without taking a global lock on all write
operations.

Otherwise how would two concurrent ref updates / object writes be
guaranteed not to get the timestamp? Unlikely with nanosecond accuracy,
but not impossible.

Even if you solve that, take two such repositories and "git merge
--allow-unrelated-histories" them together. Now what's the order?

These issues are solved by defining ordering in terms of the graph, and
writing this information after-the-fact. That's already part of git. See
https://github.com/git/git/blob/next/Documentation/technical/commit-graph.txt
and
https://devblogs.microsoft.com/devops/supercharging-the-git-commit-graph-ii-file-format/

> More deeply, the lack of total ordering means that repository graphs
> don't have a single canonical serialized form.  This sounds abstract
> but it means there are surgical operations I can't regression-test
> properly.  My colleague Edward Cree has found cases where git fast-export
> can issue a stream dump for which git fast-import won't necessarily
> re-color certain interior nodes the same way when it's read back in
> and I'm pretty sure the absence of total ordering on the branch tips
> is at the bottom of that.

Can you clarify what you mean by this? You run fast-import twice and get
different results, is that it? If so that sounds like a bug.

> I'm willing to write patches if this direction is accepted.  I've figured
> out how to make fast-import streams upward-compatible with finer-grained
> timestamps.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Finer timestamps and serialization in git
  2019-05-15 20:20 ` Ævar Arnfjörð Bjarmason
@ 2019-05-16  0:35   ` Eric S. Raymond
  2019-05-16  4:14   ` Jeff King
  1 sibling, 0 replies; 33+ messages in thread
From: Eric S. Raymond @ 2019-05-16  0:35 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: git

Ævar Arnfjörð Bjarmason <avarab@gmail.com>:
> You put it key-values in the commit message and read it back out via
> git-interpret-trailers.

Speaking as a person who has done a lot of repository migrations, this
makes me shudder.  It's fragile, kludgy, and does not maintain proper
separation of concerns.

The feature I *didn't* ask for at the next format break is a user-modifiable
key-value store per commit that is *not* in the commit comment.  Bzr
has this.  It's useful.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Finer timestamps and serialization in git
  2019-05-15 20:20 ` Ævar Arnfjörð Bjarmason
  2019-05-16  0:35   ` Eric S. Raymond
@ 2019-05-16  4:14   ` Jeff King
  1 sibling, 0 replies; 33+ messages in thread
From: Jeff King @ 2019-05-16  4:14 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Eric S. Raymond, git

On Wed, May 15, 2019 at 10:20:03PM +0200, Ævar Arnfjörð Bjarmason wrote:

> > Since this is going to have to happen anyway
> 
> The SHA-1 <-> SHA-256 transition is planned to happen, but there's some
> strong opinions that this should be *only* for munging the content for
> hashing, not adding new stuff while we're at it (even if optional). See
> : https://public-inbox.org/git/87ftyyedqd.fsf@evledraar.gmail.com/

One reason for this is that the transition plan calls for being able to
convert between the sha1 and sha256 representations losslessly (which
makes interoperability possible and avoids a flag day). So even if the
sha256 format understood floating-point timestamps in the committer
header, we'd have to have some way of representing that same information
in the sha1 format. Which implies putting it into a new header, as you
described below.

And if it's in a new header in sha1, then is there any real advantage in
having it somewhere else in the sha256 version? I dunno. Maybe a little,
as eventually all of the sha1 formats would die off, after everybody has
transitioned.

-Peff

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2019-05-21  1:05 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-15 19:16 Finer timestamps and serialization in git Eric S. Raymond
2019-05-15 20:16 ` Derrick Stolee
2019-05-15 20:28   ` Jason Pyeron
2019-05-15 21:14     ` Derrick Stolee
2019-05-15 22:07       ` Ævar Arnfjörð Bjarmason
2019-05-16  0:28       ` Eric S. Raymond
2019-05-16  1:25         ` Derrick Stolee
2019-05-20 15:05           ` Michal Suchánek
2019-05-20 16:36             ` Eric S. Raymond
2019-05-20 17:22               ` Derrick Stolee
2019-05-20 21:32                 ` Eric S. Raymond
2019-05-15 23:40     ` Eric S. Raymond
2019-05-19  0:16       ` Philip Oakley
2019-05-19  4:09         ` Eric S. Raymond
2019-05-19 10:07           ` Philip Oakley
2019-05-15 23:32   ` Eric S. Raymond
2019-05-16  1:14     ` Derrick Stolee
2019-05-16  9:50     ` Ævar Arnfjörð Bjarmason
2019-05-19 23:15       ` Jakub Narebski
2019-05-20  0:45         ` Eric S. Raymond
2019-05-20  9:43           ` Jakub Narebski
2019-05-20 10:08             ` Ævar Arnfjörð Bjarmason
2019-05-20 12:40             ` Jeff King
2019-05-20 14:14             ` Eric S. Raymond
2019-05-20 14:41               ` Michal Suchánek
2019-05-20 22:18                 ` Philip Oakley
2019-05-20 21:38               ` Elijah Newren
2019-05-20 23:12                 ` Eric S. Raymond
2019-05-21  0:08               ` Jakub Narebski
2019-05-21  1:05                 ` Eric S. Raymond
2019-05-15 20:20 ` Ævar Arnfjörð Bjarmason
2019-05-16  0:35   ` Eric S. Raymond
2019-05-16  4:14   ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).