* Finer timestamps and serialization in git @ 2019-05-15 19:16 Eric S. Raymond 2019-05-15 20:16 ` Derrick Stolee 2019-05-15 20:20 ` Ævar Arnfjörð Bjarmason 0 siblings, 2 replies; 33+ messages in thread From: Eric S. Raymond @ 2019-05-15 19:16 UTC (permalink / raw) To: git The recent increase in vulnerability in SHA-1 means, I hope, that you are planning for the day when git needs to change to something like an elliptic-curve hash. This means you're going to have a major format break. Such is life. Since this is going to have to happen anyway, let me request two functional changes in git. Neither will be at all difficult, but the first one is also a thing that cannot be done without a format break, which is why I have not suggested them before. They come from lots of (often painful) experience with repository conversions via reposurgeon. 1. Finer granularity on commit timestamps. 2. Timestamps unique per repository The coarse resolution of git timestamps, and the lack of uniqueness, are at the bottom of several problems that are persistently irritating when I do repository conversions and surgery. The most obvious issue, though a relatively superficial one, is that I have to thow away information whenever I convert a repository from a system with finer-grained time. Notably this is the case with Subversion, which keeps time to milliseconds. This is probably the only respect in which its data model remains superior to git's. :-) The deeper problem is that I want something from Git that I cannot have with 1-second granularity. That is: a unique timestamp on each commit in a repository. The only way to be certain of this is for git to delay accepting integration of a patch until it can issue a unique time mark for it - obviously impractical if the quantum is one second, but not if it's a millisecond or microsecond. Why do I want this? There are number of reasons, all related to a mathematical concept called "total ordering". At present, commits in a Git repository only have partial ordering. One consequence is that action stamps - the committer/date pairs I use as VCS-independent commit identifications in reposurgeon - are not unique. When a patch sequence is applied, it can easily happen fast enough to give several successive commits the same committer-ID and timestamp. Of course the commit hash remains a unique commit ID. But it can't easily be parsed and followed by a human, which is a UX problem when it's used as a commit stamp in change comments. More deeply, the lack of total ordering means that repository graphs don't have a single canonical serialized form. This sounds abstract but it means there are surgical operations I can't regression-test properly. My colleague Edward Cree has found cases where git fast-export can issue a stream dump for which git fast-import won't necessarily re-color certain interior nodes the same way when it's read back in and I'm pretty sure the absence of total ordering on the branch tips is at the bottom of that. I'm willing to write patches if this direction is accepted. I've figured out how to make fast-import streams upward-compatible with finer-grained timestamps. -- <a href="http://www.catb.org/~esr/">Eric S. Raymond</a> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Finer timestamps and serialization in git 2019-05-15 19:16 Finer timestamps and serialization in git Eric S. Raymond @ 2019-05-15 20:16 ` Derrick Stolee 2019-05-15 20:28 ` Jason Pyeron 2019-05-15 23:32 ` Eric S. Raymond 2019-05-15 20:20 ` Ævar Arnfjörð Bjarmason 1 sibling, 2 replies; 33+ messages in thread From: Derrick Stolee @ 2019-05-15 20:16 UTC (permalink / raw) To: Eric S. Raymond, git On 5/15/2019 3:16 PM, Eric S. Raymond wrote: > The deeper problem is that I want something from Git that I cannot > have with 1-second granularity. That is: a unique timestamp on each > commit in a repository. This is impossible in a distributed version control system like Git (where the commits are immutable). No matter your precision, there is a chance that two machiens commit at the exact same moment on two different machines and then those commits are merged into the same branch. Even when you specify a committer, there are many environments where a set of parallel machines are creating commits with the same identity. > Why do I want this? There are number of reasons, all related to a > mathematical concept called "total ordering". At present, commits in > a Git repository only have partial ordering. This is true of any directed acyclic graph. If you want a total ordering that is completely unambiguous, then you should think about maintaining a linear commit history by requiring rebasing instead of merging. > One consequence is that > action stamps - the committer/date pairs I use as VCS-independent commit > identifications in reposurgeon - are not unique. When a patch sequence > is applied, it can easily happen fast enough to give several successive > commits the same committer-ID and timestamp. Sorting by committer/date pairs sounds like an unhelpful idea, as that does not take any graph topology into account. It happens that commits can actually have an _earlier_ commit date than its parent. > More deeply, the lack of total ordering means that repository graphs > don't have a single canonical serialized form. This sounds abstract > but it means there are surgical operations I can't regression-test > properly. My colleague Edward Cree has found cases where git fast-export > can issue a stream dump for which git fast-import won't necessarily > re-color certain interior nodes the same way when it's read back in > and I'm pretty sure the absence of total ordering on the branch tips > is at the bottom of that. If you use `git rev-list --topo-order` with a fixed set of refs to start, then the total ordering given is well-defined (and it is a linear extension of the partial order given by the commit graph). However, this ordering is not stable: adding another merge commit may swap the order between two commits lower in the order. > I'm willing to write patches if this direction is accepted. I've figured > out how to make fast-import streams upward-compatible with finer-grained > timestamps. Changing the granularity of timestamps requires changing the commit format, which is probably a non-starter. More universally-useful suggestions have been blocked due to keeping the file format consistent. Thanks, -Stolee ^ permalink raw reply [flat|nested] 33+ messages in thread
* RE: Finer timestamps and serialization in git 2019-05-15 20:16 ` Derrick Stolee @ 2019-05-15 20:28 ` Jason Pyeron 2019-05-15 21:14 ` Derrick Stolee 2019-05-15 23:40 ` Eric S. Raymond 2019-05-15 23:32 ` Eric S. Raymond 1 sibling, 2 replies; 33+ messages in thread From: Jason Pyeron @ 2019-05-15 20:28 UTC (permalink / raw) To: git; +Cc: 'Derrick Stolee', 'Eric S. Raymond' (please don’t cc me) > -----Original Message----- > From: Derrick Stolee > Sent: Wednesday, May 15, 2019 4:16 PM > > On 5/15/2019 3:16 PM, Eric S. Raymond wrote: <snip/> I disagree with many of Eric's reasons - and agree with most of Derrick's refutation. But > > Changing the granularity of timestamps requires changing the commit format, > which is probably a non-starter. is not necessarily true. If we take the below example: committer Name <user@domain> 1557948240 -0400 and we follow the rule that: 1. any trailing zero after the decimal point MUST be omitted 2. if there are no digits after the decimal point, it MUST be omitted This would allow: committer Name <user@domain> 1557948240 -0400 committer Name <user@domain> 1557948240.12 -0400 but the following are never allowed: committer Name <user@domain> 1557948240. -0400 committer Name <user@domain> 1557948240.000000 -0400 By following these rules, all previous commits' hash are unchanged. Future commits made on the top of the second will look like old commit formats. Commits coming from "older" tools will produce valid and mergeable objects. The loss precision has frustrated us several times as well. Respectfully, Jason Pyeron ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Finer timestamps and serialization in git 2019-05-15 20:28 ` Jason Pyeron @ 2019-05-15 21:14 ` Derrick Stolee 2019-05-15 22:07 ` Ævar Arnfjörð Bjarmason 2019-05-16 0:28 ` Eric S. Raymond 2019-05-15 23:40 ` Eric S. Raymond 1 sibling, 2 replies; 33+ messages in thread From: Derrick Stolee @ 2019-05-15 21:14 UTC (permalink / raw) To: Jason Pyeron, git; +Cc: 'Eric S. Raymond' On 5/15/2019 4:28 PM, Jason Pyeron wrote: > (please don’t cc me) Ok. I'll "To" you. > and we follow the rule that: > > 1. any trailing zero after the decimal point MUST be omitted > 2. if there are no digits after the decimal point, it MUST be omitted > > This would allow: > > committer Name <user@domain> 1557948240 -0400 > committer Name <user@domain> 1557948240.12 -0400 This kind of change would probably break old clients trying to read commits from new clients. Ævar's suggestion [1] of additional headers should not create incompatibilities. > By following these rules, all previous commits' hash are unchanged. Future commits made on the top of the second will look like old commit formats. Commits coming from "older" tools will produce valid and mergeable objects. The loss precision has frustrated us several times as well. What problem are you trying to solve where commit date is important? The only use I have for them is "how long has it been since someone made this change?" A question like "when was this change introduced?" is much less important than "in which version was this first released?" This "in which version" is a graph reachability question, not a date question. I think any attempt to understand Git commits using commit date without using the underling graph topology (commit->parent relationships) is fundamentally broken and won't scale to even moderately-sized teams. I don't even use "git log" without a "--topo-order" or "--graph" option because using a date order puts unrelated changes next to each other. --topo-order guarantees that a path of commits with only one parent and only one child appears in consecutive order. Thanks, -Stolee P.S. All of my (overly strong) opinions on using commit date are made more valid when you realize anyone can set GIT_COMMITTER_DATE to get an arbitrary commit date. [1] https://public-inbox.org/git/871s0zwjv0.fsf@evledraar.gmail.com/T/#t ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Finer timestamps and serialization in git 2019-05-15 21:14 ` Derrick Stolee @ 2019-05-15 22:07 ` Ævar Arnfjörð Bjarmason 2019-05-16 0:28 ` Eric S. Raymond 1 sibling, 0 replies; 33+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2019-05-15 22:07 UTC (permalink / raw) To: Derrick Stolee; +Cc: Jason Pyeron, git, 'Eric S. Raymond' On Wed, May 15 2019, Derrick Stolee wrote: > On 5/15/2019 4:28 PM, Jason Pyeron wrote: >> (please don’t cc me) > > Ok. I'll "To" you. I'm a rebel! >> and we follow the rule that: >> >> 1. any trailing zero after the decimal point MUST be omitted >> 2. if there are no digits after the decimal point, it MUST be omitted >> >> This would allow: >> >> committer Name <user@domain> 1557948240 -0400 >> committer Name <user@domain> 1557948240.12 -0400 > > This kind of change would probably break old clients trying to read > commits from new clients. Ævar's suggestion [1] of additional headers > should not create incompatibilities. Yes, exactly. Obviously patching git to do this is rather easy, here's an initial try: diff --git a/date.c b/date.c index 8126146c50..0a97e1d877 100644 --- a/date.c +++ b/date.c @@ -762,3 +762,3 @@ static void date_string(timestamp_t date, int offset, struct strbuf *buf) } - strbuf_addf(buf, "%"PRItime" %c%02d%02d", date, sign, offset / 60, offset % 60); + strbuf_addf(buf, "%"PRItime".12345 %c%02d%02d", date, sign, offset / 60, offset % 60); } diff --git a/usage.c b/usage.c index 2fdb20086b..7760b78cb6 100644 --- a/usage.c +++ b/usage.c @@ -267,2 +267,3 @@ NORETURN void BUG_fl(const char *file, int line, const char *fmt, ...) va_list ap; + return; va_start(ap, fmt); We don't need BUG() right? :) Now let's commit with that git, that gives me a commit object with a sub-second timestamp like: $ git cat-file -p HEAD tree 4d5fcadc293a348e88f777dc0920f11e7d71441c author Ævar Arnfjörð Bjarmason <avarab@gmail.com> 1557955656.12345 +0200 committer Ævar Arnfjörð Bjarmason <avarab@gmail.com> 1557955656.12345 +0200 Works so far, yay! And now fsck fails: error in commit 31b3e9b88c36f75b3375471d9f5b449165c9ff93: badDate: invalid author/committer line - bad date And any sane git hosting site will refuse this, e.g. trying to push this to github: remote: error: object 31b3e9b88c36f75b3375471d9f5b449165c9ff93: badDate: invalid author/committer line - bad date remote: fatal: fsck error in packed object And that's *just* dealing with the git.git client, any such format changes also need to consider what happens to jgit, libgit2 etc. etc. Once you make such changes to the format you've created your own version-control system. It's no longer git. >> By following these rules, all previous commits' hash are unchanged. Future commits made on the top of the second will look like old commit formats. Commits coming from "older" tools will produce valid and mergeable objects. The loss precision has frustrated us several times as well. > > What problem are you trying to solve where commit date is important? > The only use I have for them is "how long has it been since someone > made this change?" A question like "when was this change introduced?" > is much less important than "in which version was this first released?" > This "in which version" is a graph reachability question, not a date > question. > > I think any attempt to understand Git commits using commit date without > using the underling graph topology (commit->parent relationships) is > fundamentally broken and won't scale to even moderately-sized teams. > I don't even use "git log" without a "--topo-order" or "--graph" option > because using a date order puts unrelated changes next to each other. > --topo-order guarantees that a path of commits with only one parent > and only one child appears in consecutive order. > > Thanks, > -Stolee > > P.S. All of my (overly strong) opinions on using commit date are made > more valid when you realize anyone can set GIT_COMMITTER_DATE to get > an arbitrary commit date. > > [1] https://public-inbox.org/git/871s0zwjv0.fsf@evledraar.gmail.com/T/#t ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Finer timestamps and serialization in git 2019-05-15 21:14 ` Derrick Stolee 2019-05-15 22:07 ` Ævar Arnfjörð Bjarmason @ 2019-05-16 0:28 ` Eric S. Raymond 2019-05-16 1:25 ` Derrick Stolee 1 sibling, 1 reply; 33+ messages in thread From: Eric S. Raymond @ 2019-05-16 0:28 UTC (permalink / raw) To: Derrick Stolee; +Cc: Jason Pyeron, git Derrick Stolee <stolee@gmail.com>: > What problem are you trying to solve where commit date is important? I don't know what Jason's are. I know what mine are. A. Portable commit identifiers 1. When I in-migrate a repository from (say) Subversion with reposurgeon, I want to be able to patch change comments so that (say) r2367 becomes a unique reference to its corresponding commit. I do not want the kludge of appending a relic SVN-ID header to be *required*, though some customers may choose that. Requirung that is an orthogonality violation. 2. Because I think in decadal timescales about infrastructure, I want my commit references to be in a format that won't break when the history is forward-migrated to the *next* VCS. That pretty much eliminates any from of opaque hash. (Git itself will have a weaker version of this problem when you change hash formats.) 3. Accordingly, I invented action stamps. This is an action stamp: <esr@thyrsus.com!2019-05-15T20:01:15Z>. One reason I want timestamp uniqueness is for action-stamp uniqueness. B. Unique canonical form of import-stream representation. Reposurgeon is a very complex piece of software with subtle failure modes. I have a strong need to be able to regression-test its operation. Right now there are important cases in which I can't do that because (a) the order in which it writes commits and (b) how it colors branches, are both phase-of-moon dependent. That is, the algorithms may be deterministic but they're not documented and seem to be dependent on variables that are hidden from me. Before import streams can have a canonical output order without hidden variables (e.g. depending only on visible metadata) in practice, that needs to be possible in principle. I've thought about this a lot and not only are unique commit timestamps the most natural way to make it possible, they're the only way conistent with the reality that commit comments may be altered for various good reasons during repository translation. > P.S. All of my (overly strong) opinions on using commit date are made > more valid when you realize anyone can set GIT_COMMITTER_DATE to get > an arbitrary commit date. In the way I would write things, you can *request* that date, but in case of a collision you might actually get one a few microseconds off that preserves its order relationship with your other commits. -- <a href="http://www.catb.org/~esr/">Eric S. Raymond</a> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Finer timestamps and serialization in git 2019-05-16 0:28 ` Eric S. Raymond @ 2019-05-16 1:25 ` Derrick Stolee 2019-05-20 15:05 ` Michal Suchánek 0 siblings, 1 reply; 33+ messages in thread From: Derrick Stolee @ 2019-05-16 1:25 UTC (permalink / raw) To: esr; +Cc: Jason Pyeron, git On 5/15/2019 8:28 PM, Eric S. Raymond wrote: > Derrick Stolee <stolee@gmail.com>: >> What problem are you trying to solve where commit date is important? > > I don't know what Jason's are. I know what mine are. > > A. Portable commit identifiers > > 1. When I in-migrate a repository from (say) Subversion with > reposurgeon, I want to be able to patch change comments so that (say) > r2367 becomes a unique reference to its corresponding commit. I do > not want the kludge of appending a relic SVN-ID header to be *required*, > though some customers may choose that. Requirung that is an orthogonality > violation. Instead of using the free-form nature of a commit message to include links to an external VCS, you want a first-class data type in Git to provide this data? Not only is that backwards, it makes the link between the Git repo and the SVN repo weaker. How would you distinguish between a commit generated from the old SVN repo and a commit that was created directly in the Git repo without performing a lookup to the SVN repo based on (committer, timestamp)? > 2. Because I think in decadal timescales about infrastructure, I want > my commit references to be in a format that won't break when the history > is forward-migrated to the *next* VCS. That pretty much eliminates any > from of opaque hash. (Git itself will have a weaker version of this problem > when you change hash formats.) > > 3. Accordingly, I invented action stamps. This is an action stamp: > <esr@thyrsus.com!2019-05-15T20:01:15Z>. One reason I want timestamp > uniqueness is for action-stamp uniqueness. Looks like you have an excellent format for a backwards-facing link. Gerrit uses a commit-msg hook [1] to insert "Change-Id" tags into commit messages. You could probably do something similar. If you have control over _every_ client interacting with the repo, you could even have this interact with a central authority to give a unique stamp. > B. Unique canonical form of import-stream representation. > > Reposurgeon is a very complex piece of software with subtle failure > modes. I have a strong need to be able to regression-test its > operation. Right now there are important cases in which I can't do > that because (a) the order in which it writes commits and (b) how it > colors branches, are both phase-of-moon dependent. That is, the > algorithms may be deterministic but they're not documented and seem to > be dependent on variables that are hidden from me. > > Before import streams can have a canonical output order without hidden > variables (e.g. depending only on visible metadata) in practice, that > needs to be possible in principle. I've thought about this a lot and > not only are unique commit timestamps the most natural way to make > it possible, they're the only way conistent with the reality that > commit comments may be altered for various good reasons during > repository translation. If you are trying to debug or test something, why don't you serialize the input you are using for your test? >> P.S. All of my (overly strong) opinions on using commit date are made >> more valid when you realize anyone can set GIT_COMMITTER_DATE to get >> an arbitrary commit date. > > In the way I would write things, you can *request* that date, but in > case of a collision you might actually get one a few microseconds off > that preserves its order relationship with your other commits. As mentioned above, you need to make this request at the time the commit is created, and you'll need to communicate with a central authority. That goes against the distributed nature of Git. In my opinion, Git already gives you the flexibility to achieve the goals you are looking for. But changing a core data type to make your goals slightly more convenient is not a valuable exercise. -Stolee [1] https://gerrit-review.googlesource.com/Documentation/cmd-hook-commit-msg.html ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Finer timestamps and serialization in git 2019-05-16 1:25 ` Derrick Stolee @ 2019-05-20 15:05 ` Michal Suchánek 2019-05-20 16:36 ` Eric S. Raymond 0 siblings, 1 reply; 33+ messages in thread From: Michal Suchánek @ 2019-05-20 15:05 UTC (permalink / raw) To: Derrick Stolee; +Cc: esr, Jason Pyeron, git On Wed, 15 May 2019 21:25:46 -0400 Derrick Stolee <stolee@gmail.com> wrote: > On 5/15/2019 8:28 PM, Eric S. Raymond wrote: > > Derrick Stolee <stolee@gmail.com>: > >> What problem are you trying to solve where commit date is important? > > B. Unique canonical form of import-stream representation. > > > > Reposurgeon is a very complex piece of software with subtle failure > > modes. I have a strong need to be able to regression-test its > > operation. Right now there are important cases in which I can't do > > that because (a) the order in which it writes commits and (b) how it > > colors branches, are both phase-of-moon dependent. That is, the > > algorithms may be deterministic but they're not documented and seem to > > be dependent on variables that are hidden from me. > > > > Before import streams can have a canonical output order without hidden > > variables (e.g. depending only on visible metadata) in practice, that > > needs to be possible in principle. I've thought about this a lot and > > not only are unique commit timestamps the most natural way to make > > it possible, they're the only way conistent with the reality that > > commit comments may be altered for various good reasons during > > repository translation. > > If you are trying to debug or test something, why don't you serialize > the input you are using for your test? And that's the problem. Serialization of a git repository is not stable because there is no total ordering on commits. And for testing you need to serialize some 'before' and 'after' state and they can be totally different. Not because the repository state is totally different but because the serialization of the state is not stable. Thanks Michal ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Finer timestamps and serialization in git 2019-05-20 15:05 ` Michal Suchánek @ 2019-05-20 16:36 ` Eric S. Raymond 2019-05-20 17:22 ` Derrick Stolee 0 siblings, 1 reply; 33+ messages in thread From: Eric S. Raymond @ 2019-05-20 16:36 UTC (permalink / raw) To: Michal Suchánek; +Cc: Derrick Stolee, Jason Pyeron, git Michal Suchánek <msuchanek@suse.de>: > On Wed, 15 May 2019 21:25:46 -0400 > Derrick Stolee <stolee@gmail.com> wrote: > > > On 5/15/2019 8:28 PM, Eric S. Raymond wrote: > > > Derrick Stolee <stolee@gmail.com>: > > >> What problem are you trying to solve where commit date is important? > > > > B. Unique canonical form of import-stream representation. > > > > > > Reposurgeon is a very complex piece of software with subtle failure > > > modes. I have a strong need to be able to regression-test its > > > operation. Right now there are important cases in which I can't do > > > that because (a) the order in which it writes commits and (b) how it > > > colors branches, are both phase-of-moon dependent. That is, the > > > algorithms may be deterministic but they're not documented and seem to > > > be dependent on variables that are hidden from me. > > > > > > Before import streams can have a canonical output order without hidden > > > variables (e.g. depending only on visible metadata) in practice, that > > > needs to be possible in principle. I've thought about this a lot and > > > not only are unique commit timestamps the most natural way to make > > > it possible, they're the only way conistent with the reality that > > > commit comments may be altered for various good reasons during > > > repository translation. > > > > If you are trying to debug or test something, why don't you serialize > > the input you are using for your test? > > And that's the problem. Serialization of a git repository is not stable > because there is no total ordering on commits. And for testing you need > to serialize some 'before' and 'after' state and they can be totally > different. Not because the repository state is totally different but > because the serialization of the state is not stable. Yes, msuchanek is right - that is exactly the problem. Very well put. git fast-import streams *are* the serialization; they're what reposurgeon ingests and emits. The concrete problem I have is that there is no stable correspondence between a repository and one canonical fast-import serialization of it. That is a bigger pain in the ass than you will be able to imagine unless and until you try writing surgical tools yourself and discover that you can't write tests for them. -- <a href="http://www.catb.org/~esr/">Eric S. Raymond</a> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Finer timestamps and serialization in git 2019-05-20 16:36 ` Eric S. Raymond @ 2019-05-20 17:22 ` Derrick Stolee 2019-05-20 21:32 ` Eric S. Raymond 0 siblings, 1 reply; 33+ messages in thread From: Derrick Stolee @ 2019-05-20 17:22 UTC (permalink / raw) To: esr, Michal Suchánek; +Cc: Jason Pyeron, git On 5/20/2019 12:36 PM, Eric S. Raymond wrote: > Michal Suchánek <msuchanek@suse.de>: >> On Wed, 15 May 2019 21:25:46 -0400 >> Derrick Stolee <stolee@gmail.com> wrote: >> >>> On 5/15/2019 8:28 PM, Eric S. Raymond wrote: >>>> Derrick Stolee <stolee@gmail.com>: >>>>> What problem are you trying to solve where commit date is important? >> >>>> B. Unique canonical form of import-stream representation. >>>> >>>> Reposurgeon is a very complex piece of software with subtle failure >>>> modes. I have a strong need to be able to regression-test its >>>> operation. Right now there are important cases in which I can't do >>>> that because (a) the order in which it writes commits and (b) how it >>>> colors branches, are both phase-of-moon dependent. That is, the >>>> algorithms may be deterministic but they're not documented and seem to >>>> be dependent on variables that are hidden from me. >>>> >>>> Before import streams can have a canonical output order without hidden >>>> variables (e.g. depending only on visible metadata) in practice, that >>>> needs to be possible in principle. I've thought about this a lot and >>>> not only are unique commit timestamps the most natural way to make >>>> it possible, they're the only way conistent with the reality that >>>> commit comments may be altered for various good reasons during >>>> repository translation. >>> >>> If you are trying to debug or test something, why don't you serialize >>> the input you are using for your test? >> >> And that's the problem. Serialization of a git repository is not stable >> because there is no total ordering on commits. And for testing you need >> to serialize some 'before' and 'after' state and they can be totally >> different. Not because the repository state is totally different but >> because the serialization of the state is not stable. > > Yes, msuchanek is right - that is exactly the problem. Very well put. > > git fast-import streams *are* the serialization; they're what reposurgeon > ingests and emits. The concrete problem I have is that there is no stable > correspondence between a repository and one canonical fast-import > serialization of it. > > That is a bigger pain in the ass than you will be able to imagine unless > and until you try writing surgical tools yourself and discover that you > can't write tests for them. What it sounds like you are doing is piping a 'git fast-import' process into reposurgeon, and testing that reposurgeon does the same thing every time. Of course this won't be consistent if 'git fast-import' isn't consistent. But what you should do instead is store a fixed file from one run of 'git fast-import' and send that file to reposurgeon for the repeated test. Don't rely on fast-import being consistent and instead use fixed input for your test. If reposurgeon is providing the input to _and_ consuming the output from 'git fast-import', then yes you will need to have at least one integration test that runs the full pipeline. But for regression tests covering complicated logic in reposurgeon, you're better off splitting the test (or mocking out 'git fast-import' with something that provides consistent output given fixed input). -Stolee ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Finer timestamps and serialization in git 2019-05-20 17:22 ` Derrick Stolee @ 2019-05-20 21:32 ` Eric S. Raymond 0 siblings, 0 replies; 33+ messages in thread From: Eric S. Raymond @ 2019-05-20 21:32 UTC (permalink / raw) To: Derrick Stolee; +Cc: Michal Suchánek, Jason Pyeron, git Derrick Stolee <stolee@gmail.com>: > What it sounds like you are doing is piping a 'git fast-import' process into > reposurgeon, and testing that reposurgeon does the same thing every time. > Of course this won't be consistent if 'git fast-import' isn't consistent. It's not actually import that fails to have consistent behavior, it's export. That is, if I fast-import a given stream, I get indistinguishable in-core commit DAGs every time. (It would be pretty alarming if this weren't true!) What I have no guarantee of is the other direction. In a multibranch repo, fast-export writes out branches in an order I cannot predict and which appears from the outside to be randomly variable. > But what you should do instead is store a fixed file from one run of > 'git fast-import' and send that file to reposurgeon for the repeated test. > Don't rely on fast-import being consistent and instead use fixed input for > your test. > > If reposurgeon is providing the input to _and_ consuming the output from > 'git fast-import', then yes you will need to have at least one integration > test that runs the full pipeline. But for regression tests covering complicated > logic in reposurgeon, you're better off splitting the test (or mocking out > 'git fast-import' with something that provides consistent output given > fixed input). And I'd do that... but the problem is more fundamental than you seem to understand. git fast-export can't ship a consistent output order because it doesn't retain metadata sufficient to totally order child branches. This is why I wanted unique timestamps. That would solve the problem, branch child commits of any node would be ordered by their commit date. But I had a realization just now. A much smaller change would do it. Suppose branch creations had creation stamps with a weak uniqueness property; for any given parent node, the creation stamps of all branches originating there are guaranteed to be unique? If that were true, there would be an implied total ordering of the repository. The rules for writing out a totally ordered dump would go like this: 1. At any given step there is a set of active branches and a cursor on each such branch. Each cursor points at a commit and caches the creation stamp of the current branch. 2. Look at the set of commits under the cursors. Write the oldest one. If multiple commits have the same commit date, break ties by their branch creation stamps. 3. Bump that cursor forward. If you're at a branch creation, it becomes multiple cursors, one for each child branch. If you're at a join, some cursors go away. Here's the clever bit - you make the creation stamp nothing but a counter that says "This was the Nth branch creation." And it is set by these rules: 4. If the branch creation stamp is undefined at branch creation time, number it in any way you like as long as each stamp is unique. A defined, documented order would be nice but is not necessary for streams to round-trip. 5. When writing an export stream, you always utter a reset at the point of branch creation. 6. When reading an import stream, the ordinal for a new branch is defined as the number of resets you have seen. Rules 5 and 6 together guarantee that branch creation ordinals round-trip through export streams. Thus, streams round-trip and I can have my regression tests with no change to git's visible interface at all! I could write this code. -- <a href="http://www.catb.org/~esr/">Eric S. Raymond</a> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Finer timestamps and serialization in git 2019-05-15 20:28 ` Jason Pyeron 2019-05-15 21:14 ` Derrick Stolee @ 2019-05-15 23:40 ` Eric S. Raymond 2019-05-19 0:16 ` Philip Oakley 1 sibling, 1 reply; 33+ messages in thread From: Eric S. Raymond @ 2019-05-15 23:40 UTC (permalink / raw) To: Jason Pyeron; +Cc: git, 'Derrick Stolee' Jason Pyeron <jpyeron@pdinc.us>: > If we take the below example: > > committer Name <user@domain> 1557948240 -0400 > > and we follow the rule that: > > 1. any trailing zero after the decimal point MUST be omitted > 2. if there are no digits after the decimal point, it MUST be omitted > > This would allow: > > committer Name <user@domain> 1557948240 -0400 > committer Name <user@domain> 1557948240.12 -0400 > > but the following are never allowed: > > committer Name <user@domain> 1557948240. -0400 > committer Name <user@domain> 1557948240.000000 -0400 > > By following these rules, all previous commits' hash are unchanged. Future commits made on the top of the second will look like old commit formats. Commits coming from "older" tools will produce valid and mergeable objects. The loss precision has frustrated us several times as well. Yes, that's almost exactly what I came up with. I was concerned with upward compatibility in fast-export streams, which reposurgeon ingests and emits. But I don't quite understand your claim that there's no format breakage here, unless you're implying to me that timestamps are already stored in the git file system as variable-length strings. Do they really never get translated into time_t? Good news if so. -- <a href="http://www.catb.org/~esr/">Eric S. Raymond</a> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Finer timestamps and serialization in git 2019-05-15 23:40 ` Eric S. Raymond @ 2019-05-19 0:16 ` Philip Oakley 2019-05-19 4:09 ` Eric S. Raymond 0 siblings, 1 reply; 33+ messages in thread From: Philip Oakley @ 2019-05-19 0:16 UTC (permalink / raw) To: esr, Jason Pyeron; +Cc: git, 'Derrick Stolee' On 16/05/2019 00:40, Eric S. Raymond wrote: > Jason Pyeron <jpyeron@pdinc.us>: >> If we take the below example: >> >> committer Name <user@domain> 1557948240 -0400 >> >> and we follow the rule that: >> >> 1. any trailing zero after the decimal point MUST be omitted >> 2. if there are no digits after the decimal point, it MUST be omitted >> >> This would allow: >> >> committer Name <user@domain> 1557948240 -0400 >> committer Name <user@domain> 1557948240.12 -0400 >> >> but the following are never allowed: >> >> committer Name <user@domain> 1557948240. -0400 >> committer Name <user@domain> 1557948240.000000 -0400 >> >> By following these rules, all previous commits' hash are unchanged. Future commits made on the top of the second will look like old commit formats. Commits coming from "older" tools will produce valid and mergeable objects. The loss precision has frustrated us several times as well. > Yes, that's almost exactly what I came up with. I was concerned with upward > compatibility in fast-export streams, which reposurgeon ingests and emits. > > But I don't quite understand your claim that there's no format > breakage here, unless you're implying to me that timestamps are already > stored in the git file system as variable-length strings. Do they > really never get translated into time_t? Good news if so. Maybe just take some of the object ID bits as being the fractional time timestamp. They are effectively random, so should do a reasonable job of distinguishing commits in a repeatable manner, even with full round tripping via older git versions (as long as the sha1 replicates...) As I understand it the commit timestamp is actually free text within the commit object (try `git cat-file -p <commit_object>), so the issue is whether the particular git version is ready to accept the additional 'dot' factional time notation (future versions could be extended, but I think old ones would reject them if I understand the test up thread - which would compromise backward compatibility and round tripping). -- Philip ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Finer timestamps and serialization in git 2019-05-19 0:16 ` Philip Oakley @ 2019-05-19 4:09 ` Eric S. Raymond 2019-05-19 10:07 ` Philip Oakley 0 siblings, 1 reply; 33+ messages in thread From: Eric S. Raymond @ 2019-05-19 4:09 UTC (permalink / raw) To: Philip Oakley; +Cc: Jason Pyeron, git, 'Derrick Stolee' Philip Oakley <philipoakley@iee.org>: > > But I don't quite understand your claim that there's no format > > breakage here, unless you're implying to me that timestamps are already > > stored in the git file system as variable-length strings. Do they > > really never get translated into time_t? Good news if so. > Maybe just take some of the object ID bits as being the fractional time > timestamp. They are effectively random, so should do a reasonable job of > distinguishing commits in a repeatable manner, even with full round tripping > via older git versions (as long as the sha1 replicates...) Huh. That's an interesting idea. Doesn't absolutely guarantee uniqueness, but even with birthday effect the probability of collisions could be pulled arbitrarily low. > As I understand it the commit timestamp is actually free text within the > commit object (try `git cat-file -p <commit_object>), so the issue is > whether the particular git version is ready to accept the additional 'dot' > factional time notation (future versions could be extended, but I think old > ones would reject them if I understand the test up thread - which would > compromise backward compatibility and round tripping). Nobody seems to want to grapple with the fact that changing hash formats is as large or larger a problem in exactly the same way. I'm not saying that changing the timestamp granularity justifies a format break. I'm saying that *since you're going to have one anyway*, the option to increase timestamp precision at the same time should not be missed. -- <a href="http://www.catb.org/~esr/">Eric S. Raymond</a> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Finer timestamps and serialization in git 2019-05-19 4:09 ` Eric S. Raymond @ 2019-05-19 10:07 ` Philip Oakley 0 siblings, 0 replies; 33+ messages in thread From: Philip Oakley @ 2019-05-19 10:07 UTC (permalink / raw) To: esr; +Cc: Jason Pyeron, git, 'Derrick Stolee' Hi Eric, On 19/05/2019 05:09, Eric S. Raymond wrote: > Philip Oakley <philipoakley@iee.org>: >>> But I don't quite understand your claim that there's no format >>> breakage here, unless you're implying to me that timestamps are already >>> stored in the git file system as variable-length strings. Do they >>> really never get translated into time_t? Good news if so. >> Maybe just take some of the object ID bits as being the fractional time >> timestamp. They are effectively random, so should do a reasonable job of >> distinguishing commits in a repeatable manner, even with full round tripping >> via older git versions (as long as the sha1 replicates...) > Huh. That's an interesting idea. Doesn't absolutely guarantee uniqueness, > but even with birthday effect the probability of collisions could be pulled > arbitrarily low. depends how many bits are in the 'nano-second' resolution long word ;-) see also > >> As I understand it the commit timestamp is actually free text within the >> commit object (try `git cat-file -p <commit_object>), so the issue is >> whether the particular git version is ready to accept the additional 'dot' >> factional time notation (future versions could be extended, but I think old >> ones would reject them if I understand the test up thread - which would >> compromise backward compatibility and round tripping). > Nobody seems to want to grapple with the fact that changing hash formats is > as large or larger a problem in exactly the same way. > > I'm not saying that changing the timestamp granularity justifies a format > break. I'm saying that *since you're going to have one anyway*, the option > to increase timestamp precision at the same time should not be missed. It is probably the round tripping issue with a non-fixed format (for the time string) that will scupper the idea, plus the focus being primarily on the DAG as the fundamental lineage (which only gives partial order, which can be an issue for other VCS systems that are based on incremental changes rather than snapshots) The transition is well underway see thread: https://public-inbox.org/git/20190212012256.1005924-1-sandals@crustytoothpaste.net/ for a patch series. The plan is at: https://github.com/git/git/blob/master/Documentation/technical/hash-function-transition.txt <https://github.com/git/git/blob/v2.19.0-rc0/Documentation/technical/hash-function-transition.txt>, some discussions at thread: https://public-inbox.org/git/878t4xfaes.fsf@evledraar.gmail.com/ etc. The timestamp problem is known see yesterdays thread: https://public-inbox.org/git/20190518005412.n45pj5p2rrtm2bfj@glandium.org/ Given that the object ID should be immutable for a round trip, using 64bits from the sha1-oid as notional 'nano-second' time does give a reasonable birthday attack resistance of ~32 bits (i.e. >1M commits with identical whole second timestamps). [or choose the sha-256 once the transition is well underway] -- Philip ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Finer timestamps and serialization in git 2019-05-15 20:16 ` Derrick Stolee 2019-05-15 20:28 ` Jason Pyeron @ 2019-05-15 23:32 ` Eric S. Raymond 2019-05-16 1:14 ` Derrick Stolee 2019-05-16 9:50 ` Ævar Arnfjörð Bjarmason 1 sibling, 2 replies; 33+ messages in thread From: Eric S. Raymond @ 2019-05-15 23:32 UTC (permalink / raw) To: Derrick Stolee; +Cc: git Derrick Stolee <stolee@gmail.com>: > On 5/15/2019 3:16 PM, Eric S. Raymond wrote: > > The deeper problem is that I want something from Git that I cannot > > have with 1-second granularity. That is: a unique timestamp on each > > commit in a repository. > > This is impossible in a distributed version control system like Git > (where the commits are immutable). No matter your precision, there is > a chance that two machiens commit at the exact same moment on two different > machines and then those commits are merged into the same branch. It's easy to work around that problem. Each git daemon has to single-thread its handling of incoming commits at some level, because you need a lock on the file system to guarantee consistent updates to it. So if a commit comes in that would be the same as the date of the previous commit on the current branch, you bump the incoming commit timestamp. That's the simple case. The complicated case is checking for date collisions on *other* branches. But there are ways to make that fast, too. There's a very obvious one involving a presort that is is O(log2 n) in the number of commits. I wouldn't have brought this up in the first place if I didn't have a pretty clear idea how to do it in code! > Even when you specify a committer, there are many environments where a set > of parallel machines are creating commits with the same identity. If those commit sets become the same commit in the final graph, this is not a problem for total ordering. > > Why do I want this? There are number of reasons, all related to a > > mathematical concept called "total ordering". At present, commits in > > a Git repository only have partial ordering. > > This is true of any directed acyclic graph. If you want a total ordering > that is completely unambiguous, then you should think about maintaining > a linear commit history by requiring rebasing instead of merging. Excuse me, but your premise is incorrect. A git DAG isn't just "any" DAG. The presence of timestamps makes a total ordering possible. (I was a theoretical mathematician in a former life. This is all very familiar ground to me.) > > One consequence is that > > action stamps - the committer/date pairs I use as VCS-independent commit > > identifications in reposurgeon - are not unique. When a patch sequence > > is applied, it can easily happen fast enough to give several successive > > commits the same committer-ID and timestamp. > > Sorting by committer/date pairs sounds like an unhelpful idea, as that > does not take any graph topology into account. It happens that commits > can actually have an _earlier_ commit date than its parent. Yes, I'm aware of that. The uniqueness properties that make a total ordering desirable are not actually dependent on timestamp order coinciding with topo order. > Changing the granularity of timestamps requires changing the commit format, > which is probably a non-starter. That's why I started by noting that you're going to have to break the format anyway to move to an ECDSA hash (or whatever you end up using). I'm saying that *since you'll need to do that anyway*, it's a good time to think about making timestamps finer-grained and unique. -- <a href="http://www.catb.org/~esr/">Eric S. Raymond</a> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Finer timestamps and serialization in git 2019-05-15 23:32 ` Eric S. Raymond @ 2019-05-16 1:14 ` Derrick Stolee 2019-05-16 9:50 ` Ævar Arnfjörð Bjarmason 1 sibling, 0 replies; 33+ messages in thread From: Derrick Stolee @ 2019-05-16 1:14 UTC (permalink / raw) To: esr; +Cc: git On 5/15/2019 7:32 PM, Eric S. Raymond wrote: > Derrick Stolee <stolee@gmail.com>: >> On 5/15/2019 3:16 PM, Eric S. Raymond wrote: >>> The deeper problem is that I want something from Git that I cannot >>> have with 1-second granularity. That is: a unique timestamp on each >>> commit in a repository. >> >> This is impossible in a distributed version control system like Git >> (where the commits are immutable). No matter your precision, there is >> a chance that two machiens commit at the exact same moment on two different >> machines and then those commits are merged into the same branch. > > It's easy to work around that problem. Each git daemon has to single-thread > its handling of incoming commits at some level, because you need a lock on the > file system to guarantee consistent updates to it. > > So if a commit comes in that would be the same as the date of the > previous commit on the current branch, you bump the incoming commit timestamp. This changes the commit, causing it to have a different object id, and now the client that pushed that commit disagrees with your machine on the history. > That's the simple case. The complicated case is checking for date > collisions on *other* branches. But there are ways to make that fast, > too. There's a very obvious one involving a presort that is is O(log2 > n) in the number of commits. > > I wouldn't have brought this up in the first place if I didn't have a > pretty clear idea how to do it in code! > >> Even when you specify a committer, there are many environments where a set >> of parallel machines are creating commits with the same identity. > > If those commit sets become the same commit in the final graph, this is > not a problem for total ordering. > >>> Why do I want this? There are number of reasons, all related to a >>> mathematical concept called "total ordering". At present, commits in >>> a Git repository only have partial ordering. >> >> This is true of any directed acyclic graph. If you want a total ordering >> that is completely unambiguous, then you should think about maintaining >> a linear commit history by requiring rebasing instead of merging. > > Excuse me, but your premise is incorrect. A git DAG isn't just "any" DAG. > The presence of timestamps makes a total ordering possible. > > (I was a theoretical mathematician in a former life. This is all very > familiar ground to me.) Same. But you seem to have a fundamental misunderstanding about the immutability of commits, which is core to how Git works. If you change a commit, then you get a new object id and now distributed copies don't agree on the history. >>> One consequence is that >>> action stamps - the committer/date pairs I use as VCS-independent commit >>> identifications in reposurgeon - are not unique. When a patch sequence >>> is applied, it can easily happen fast enough to give several successive >>> commits the same committer-ID and timestamp. >> >> Sorting by committer/date pairs sounds like an unhelpful idea, as that >> does not take any graph topology into account. It happens that commits >> can actually have an _earlier_ commit date than its parent. > > Yes, I'm aware of that. The uniqueness properties that make a total > ordering desirable are not actually dependent on timestamp order > coinciding with topo order. > >> Changing the granularity of timestamps requires changing the commit format, >> which is probably a non-starter. > > That's why I started by noting that you're going to have to break the > format anyway to move to an ECDSA hash (or whatever you end up using). > > I'm saying that *since you'll need to do that anyway*, it's a good time > to think about making timestamps finer-grained and unique. That change is difficult enough as it is. I don't think your goals justify making this more complicated. You are also not considering: * The in-memory data type now needs to be a floating-point type, or an even larger integer type using a different set of units. * This data type now affects our priority queues for commit walks, how we store the commit date in the commit-graph file, how we compute relative dates for 'git log' pretty formats. -Stolee ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Finer timestamps and serialization in git 2019-05-15 23:32 ` Eric S. Raymond 2019-05-16 1:14 ` Derrick Stolee @ 2019-05-16 9:50 ` Ævar Arnfjörð Bjarmason 2019-05-19 23:15 ` Jakub Narebski 1 sibling, 1 reply; 33+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2019-05-16 9:50 UTC (permalink / raw) To: esr; +Cc: Derrick Stolee, git On Thu, May 16 2019, Eric S. Raymond wrote: > Derrick Stolee <stolee@gmail.com>: >> On 5/15/2019 3:16 PM, Eric S. Raymond wrote: >> > The deeper problem is that I want something from Git that I cannot >> > have with 1-second granularity. That is: a unique timestamp on each >> > commit in a repository. >> >> This is impossible in a distributed version control system like Git >> (where the commits are immutable). No matter your precision, there is >> a chance that two machiens commit at the exact same moment on two different >> machines and then those commits are merged into the same branch. > > It's easy to work around that problem. Each git daemon has to single-thread > its handling of incoming commits at some level, because you need a lock on the > file system to guarantee consistent updates to it. You don't need a daemon now to write commits to a repository. You can just add stuff to the object store, and then later flip the SHA-1 on a reference, we lock those indivdiual references, but this sort of thing would require a global write lock. This would introduce huge concurrency caveats that are non-issues now. Dumb clients matter. Now you can e.g. have two libgit2 processes writing to ref A and B respectively in the same repo, and they never have to know about each other or care about IPC. Also, even if you have daemons accepting pushes they can now be on different computers sharing things over e.g. an NFS filesystem. Now you need some FS-based serialization protcol for commits and their timestamps. > So if a commit comes in that would be the same as the date of the > previous commit on the current branch, you bump the incoming commit timestamp. > That's the simple case. The complicated case is checking for date > collisions on *other* branches. But there are ways to make that fast, > too. There's a very obvious one involving a presort that is is O(log2 > n) in the number of commits. What Derrick mentioned downthread of this "I rebase your pushes" being fundimentally un-git applies, but let's assume we can somehow get past that for the sake of argument. The model you're trying to impose here of "within a repo I want to serialize all X" just doesn't play with how git views the world. Git cares about graphs being serialized, it doesn't care about arbitrary sets of graphs. E.g. let's say I push a commit X to github, and now I want to push the same history to gitlab, I might be twarted because they have some side-ref they themselves make (e.g. the PR or MR refs) which conflicts with this "timestamps must monotonically increase across all branches in a repo" view of the world. The only thing that matters in git in this regard is how individual refs behave, we then by convention tend to have a 1=1 mapping between those sets of refs and a repository, but in a lot of cases it's many=1. E.g. in cases where such a hosting site might have one underlying repo store exposed to multiple users via ref namespace prefixes. > I wouldn't have brought this up in the first place if I didn't have a > pretty clear idea how to do it in code! > >> Even when you specify a committer, there are many environments where a set >> of parallel machines are creating commits with the same identity. > > If those commit sets become the same commit in the final graph, this is > not a problem for total ordering. > >> > Why do I want this? There are number of reasons, all related to a >> > mathematical concept called "total ordering". At present, commits in >> > a Git repository only have partial ordering. >> >> This is true of any directed acyclic graph. If you want a total ordering >> that is completely unambiguous, then you should think about maintaining >> a linear commit history by requiring rebasing instead of merging. > > Excuse me, but your premise is incorrect. A git DAG isn't just "any" DAG. > The presence of timestamps makes a total ordering possible. > > (I was a theoretical mathematician in a former life. This is all very > familiar ground to me.) > >> > One consequence is that >> > action stamps - the committer/date pairs I use as VCS-independent commit >> > identifications in reposurgeon - are not unique. When a patch sequence >> > is applied, it can easily happen fast enough to give several successive >> > commits the same committer-ID and timestamp. >> >> Sorting by committer/date pairs sounds like an unhelpful idea, as that >> does not take any graph topology into account. It happens that commits >> can actually have an _earlier_ commit date than its parent. > > Yes, I'm aware of that. The uniqueness properties that make a total > ordering desirable are not actually dependent on timestamp order > coinciding with topo order. > >> Changing the granularity of timestamps requires changing the commit format, >> which is probably a non-starter. > > That's why I started by noting that you're going to have to break the > format anyway to move to an ECDSA hash (or whatever you end up using). > > I'm saying that *since you'll need to do that anyway*, it's a good time > to think about making timestamps finer-grained and unique. We should really discuss proposed format changes separately from tacking them onto the SHA-256 transition, because as I noted upthread your premise that you need a format change for this isn't true. *If* this was a good idea it's something you can add to commit objects. And yeah, git-interpret-trailers is a bit of a kludge, which is why I mentioned you can add new headers to the format, this is e.g. how GPG signed commits work. Of course whether it makes any sense to add such a thing to the format is another matter, I'm not at all convinced, but that's a separate discussion from how it would be done. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Finer timestamps and serialization in git 2019-05-16 9:50 ` Ævar Arnfjörð Bjarmason @ 2019-05-19 23:15 ` Jakub Narebski 2019-05-20 0:45 ` Eric S. Raymond 0 siblings, 1 reply; 33+ messages in thread From: Jakub Narebski @ 2019-05-19 23:15 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason; +Cc: esr, Derrick Stolee, git Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: > On Thu, May 16 2019, Eric S. Raymond wrote: >> Derrick Stolee <stolee@gmail.com>: >>> On 5/15/2019 3:16 PM, Eric S. Raymond wrote: >>>> The deeper problem is that I want something from Git that I cannot >>>> have with 1-second granularity. That is: a unique timestamp on each >>>> commit in a repository. >>> >>> This is impossible in a distributed version control system like Git >>> (where the commits are immutable). No matter your precision, there is >>> a chance that two machines commit at the exact same moment on two different >>> machines and then those commits are merged into the same branch. >> >> It's easy to work around that problem. Each git daemon has to single-thread >> its handling of incoming commits at some level, because you need a lock on the >> file system to guarantee consistent updates to it. As far as I understand it this would slow down receiving new commits tremendously. Currently great care is taken to not have to parse the commit object during fetch or push if it is not necessary (thanks to things such as reachability bitmaps, see e.g. [1]). With this restriction you would need to parse each commit to get at commit timestamp and committer, check if the committer+timestamp is unique, and bump it if it is not. Also, bumping timestamp means that the commit changed, means that its contents-based ID changed, means that all commits that follow it needs to have its contents changed... And now you need to rewrite many commits. And you also break the assumptions that the same commits have the same contents (including date) and the same ID in different repositories (some of which may include additional branches, some of which may have been part of network of related repositories, etc.). [1]: https://github.blog/2015-09-22-counting-objects/ http://githubengineering.com/counting-objects/ > You don't need a daemon now to write commits to a repository. You can > just add stuff to the object store, and then later flip the SHA-1 on a > reference, we lock those indivdiual references, but this sort of thing > would require a global write lock. This would introduce huge concurrency > caveats that are non-issues now. > > Dumb clients matter. Now you can e.g. have two libgit2 processes writing > to ref A and B respectively in the same repo, and they never have to > know about each other or care about IPC. > > Also, even if you have daemons accepting pushes they can now be on > different computers sharing things over e.g. an NFS filesystem. Now you > need some FS-based serialization protcol for commits and their > timestamps. Also, performance matters. Especially for large repositories, and for large number of repositories. >> So if a commit comes in that would be the same as the date of the >> previous commit on the current branch, you bump the incoming commit timestamp. You do realize that dates may not be monotonic (because of imperfections in clock synchronization), thus the fact that the date is different from parent does not mean that is different from ancestor. >> That's the simple case. The complicated case is checking for date >> collisions on *other* branches. But there are ways to make that fast, >> too. There's a very obvious one involving a presort that is is O(log2 >> n) in the number of commits. I don't think performance hit you would get would be acceptable. [...] >>>> Why do I want this? There are number of reasons, all related to a >>>> mathematical concept called "total ordering". At present, commits in >>>> a Git repository only have partial ordering. >>> >>> This is true of any directed acyclic graph. If you want a total ordering >>> that is completely unambiguous, then you should think about maintaining >>> a linear commit history by requiring rebasing instead of merging. >> >> Excuse me, but your premise is incorrect. A git DAG isn't just "any" DAG. >> The presence of timestamps makes a total ordering possible. >> >> (I was a theoretical mathematician in a former life. This is all very >> familiar ground to me.) Maybe in theory, when all clock are synchronized. But not in practice. Shit happens. Just recently Mike Hommey wrote about the case he has to deal with: MH> I'm hitting another corner case in some other "weird" history, where MH> I have 500k commits all with the same date. [2]: https://public-inbox.org/git/20190518005412.n45pj5p2rrtm2bfj@glandium.org/t/#u -- Jakub Narębski ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Finer timestamps and serialization in git 2019-05-19 23:15 ` Jakub Narebski @ 2019-05-20 0:45 ` Eric S. Raymond 2019-05-20 9:43 ` Jakub Narebski 0 siblings, 1 reply; 33+ messages in thread From: Eric S. Raymond @ 2019-05-20 0:45 UTC (permalink / raw) To: Jakub Narebski Cc: Ævar Arnfjörð Bjarmason, Derrick Stolee, git Jakub Narebski <jnareb@gmail.com>: > As far as I understand it this would slow down receiving new commits > tremendously. Currently great care is taken to not have to parse the > commit object during fetch or push if it is not necessary (thanks to > things such as reachability bitmaps, see e.g. [1]). > > With this restriction you would need to parse each commit to get at > commit timestamp and committer, check if the committer+timestamp is > unique, and bump it if it is not. So, I'd want to measure that rather than simply assuming it's a blocker. Clocks are very cheap these days. > Also, bumping timestamp means that the commit changed, means that its > contents-based ID changed, means that all commits that follow it needs > to have its contents changed... And now you need to rewrite many > commits. What "commits that follow it?" By hypothesis, the incoming commit's timestamp is bumped (if it's bumped) when it's first added to a branch or branches, before there are following commits in the DAG. > And you also break the assumptions that the same commits have > the same contents (including date) and the same ID in different > repositories (some of which may include additional branches, some of > which may have been part of network of related repositories, etc.). Wait...unless I completely misunderstand the hash-chain model, doesn't the hash of a commit depend on the hashes of its parents? If that's the case, commits cannot have portable hashes. If it's not, please correct me. But if it's not, how does your first objection make sense? > > You don't need a daemon now to write commits to a repository. You can > > just add stuff to the object store, and then later flip the SHA-1 on a > > reference, we lock those indivdiual references, but this sort of thing > > would require a global write lock. This would introduce huge concurrency > > caveats that are non-issues now. > > > > Dumb clients matter. Now you can e.g. have two libgit2 processes writing > > to ref A and B respectively in the same repo, and they never have to > > know about each other or care about IPC. How do they know they're not writing to the same ref? What keeps *that* operation atomic? > You do realize that dates may not be monotonic (because of imperfections > in clock synchronization), thus the fact that the date is different from > parent does not mean that is different from ancestor. Good point. That means the O(log2 n) version of the check has to be done all the time. Unfortunate. > >> That's the simple case. The complicated case is checking for date > >> collisions on *other* branches. But there are ways to make that fast, > >> too. There's a very obvious one involving a presort that is is O(log2 > >> n) in the number of commits. > > I don't think performance hit you would get would be acceptable. Again, it's bad practice to assume rather than measure. Human intuitions about this sort of thing are notoriously unreliable. > >> Excuse me, but your premise is incorrect. A git DAG isn't just "any" DAG. > >> The presence of timestamps makes a total ordering possible. > >> > >> (I was a theoretical mathematician in a former life. This is all very > >> familiar ground to me.) > > Maybe in theory, when all clock are synchronized. My assertion does not depend on synchronized clocks, because it doesn't have to. If the timestamps in your repo are unique, there *is* a total ordering - by timestamp. What you don't get is guaranteed consistency with the topo ordering - that is you get no guarantee that a child's timestamp is greater than its parents'. That really would require a common timebase. But I don't need that stronger property, because the purpose of totally ordering the repo is to guararantee the uniqueness of action stamps. For that, all I need is to be able to generate a unique cookie for each commit that can be inserted in its action stamp. For my use cases that cookie should *not* be a hash, because hashes always break N years down. It should be an eternally stable product of the commit metadata. -- <a href="http://www.catb.org/~esr/">Eric S. Raymond</a> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Finer timestamps and serialization in git 2019-05-20 0:45 ` Eric S. Raymond @ 2019-05-20 9:43 ` Jakub Narebski 2019-05-20 10:08 ` Ævar Arnfjörð Bjarmason ` (2 more replies) 0 siblings, 3 replies; 33+ messages in thread From: Jakub Narebski @ 2019-05-20 9:43 UTC (permalink / raw) To: Eric S. Raymond Cc: Ævar Arnfjörð Bjarmason, Derrick Stolee, git "Eric S. Raymond" <esr@thyrsus.com> writes: > Jakub Narebski <jnareb@gmail.com>: >> As far as I understand it this would slow down receiving new commits >> tremendously. Currently great care is taken to not have to parse the >> commit object during fetch or push if it is not necessary (thanks to >> things such as reachability bitmaps, see e.g. [1]). >> >> With this restriction you would need to parse each commit to get at >> commit timestamp and committer, check if the committer+timestamp is >> unique, and bump it if it is not. > > So, I'd want to measure that rather than simply assuming it's a blocker. > Clocks are very cheap these days. Clocks may be cheap, but parsing is not. You can receive new commits in the repository by creating them, and from other repository (via push or fetch). In the second case you often get many commits at once. In [1] it is described how using "bitmap index" you can avoid parsing commits when deciding which objects to send to the client; they can be directly copied to the client (added to the packfile that is sent to client). Thanks to this reachability bitmap (bit vector) the time to clone Linux repository decreased from 57 seconds to 1.6 seconds. It is not a direct correspondence, but there most probably would be the same problem with requiring fractional timestamp+committer identity to be unique on the receiving side. [1]: https://githubengineering.com/counting-objects/ >> Also, bumping timestamp means that the commit changed, means that its >> contents-based ID changed, means that all commits that follow it needs >> to have its contents changed... And now you need to rewrite many >> commits. > > What "commits that follow it?" By hypothesis, the incoming commit's > timestamp is bumped (if it's bumped) when it's first added to a branch > or branches, before there are following commits in the DAG. Errr... the main problem is with distributed nature of Git, i.e. when two repositories create different commits with the same committer+timestamp value. You receive commits on fetch or push, and you receive many commits at once. Say you have two repositories, and the history looks like this: repo A: 1<---2<---a<---x<---c<---d <- master repo B: 1<---2<---X<---3<---4 <- master When you push from repo A to repo B, or fetch in repo B from repo A you would get the following DAG of revisions repo B: 1<---2<---X<---3<---4 <- master \ \--a<---x<---c<---d <- repo_A/master Now let's assume that commits X and x have the came committer and the same fractional timestamp, while being different commits. Then you would need to bump timestamp of 'x', changing the commit. This means that 'c' needs to be rewritten too, and 'd' also: repo B: 1<---2<---X<---3<---4 <- master \ \--a<---x'<--c'<--d' <- repo_A/master And now for the final nail in the coffing of the Bazaar-esque idea of changing commits on arrival. Say that repository A created new commits, and pushed them to B. You would need to rewrite all future commits from this repository too, and you would always fetch all commits starting from the first "bumped" repo A: 1<---2<---a<---x<---c<---d<---E <- master transfer of [<---x<---c<---d<---E], instead of [<--E], because 'x', 'c', and 'd' are missing in repo B. repo B: 1<---2<---X<---3<---4 <- master \ \--a<---x'<--c'<--d'<--E' <- repo_A/master And there is yet another problem. Let's assume that repo B created some history on top of bump-rewritten commits: repo B: 1<---2<---X<---3<---4 <- master \ \--a<---x'<--c'<--d'<--E' <- repo_A/master \ \--5 <- next Then if in repo A you fetch from repo B (remember, in Git there is no concept of central repository), you would get the following history /--X'<--3'<--4' <- repo_B/master / repo A: 1<---2<---a<---x<---c<---d<---E <- master \ \---x'<--c' \ \--5 <- repo_B/master (because 'X' is now incoming, it needs to be "bumped", therefore changing 3' and 4'). The history without all this rewriting looks like this: /--X<---3<---4' <- repo_B/master / repo A: 1<---2<---a<---x<---c<---d<---E <- master \ \--5 <- repo_B/master Notice the difference? >> And you also break the assumptions that the same commits have >> the same contents (including date) and the same ID in different >> repositories (some of which may include additional branches, some of >> which may have been part of network of related repositories, etc.). See repo A and repo B in above example. > Wait...unless I completely misunderstand the hash-chain model, doesn't the > hash of a commit depend on the hashes of its parents? If that's the case, > commits cannot have portable hashes. If it's not, please correct me. > > But if it's not, how does your first objection make sense? Hash of a commit depend in hashes of its parents (Merkle tree). That is why signing a commit (or a tag pointing to the commit) signs a whole history of a commit. >>> You don't need a daemon now to write commits to a repository. You can >>> just add stuff to the object store, and then later flip the SHA-1 on a >>> reference, we lock those indivdiual references, but this sort of thing >>> would require a global write lock. This would introduce huge concurrency >>> caveats that are non-issues now. >>> >>> Dumb clients matter. Now you can e.g. have two libgit2 processes writing >>> to ref A and B respectively in the same repo, and they never have to >>> know about each other or care about IPC. > > How do they know they're not writing to the same ref? What keeps > *that* operation atomic? Because different refs are stored in different files (at least for "live" refs that are stores in loose ref format). The lock is taken on ref (to update ref and its reflog in sync), there is no need to take global lock on all refs. >> You do realize that dates may not be monotonic (because of imperfections >> in clock synchronization), thus the fact that the date is different from >> parent does not mean that is different from ancestor. > > Good point. That means the O(log2 n) version of the check has to be done > all the time. Unfortunate. Especially with around 1 million of commits (Linux kernel, Chromium, AOSP), or even 3M commits (MS Windows repository). >>>> That's the simple case. The complicated case is checking for date >>>> collisions on *other* branches. But there are ways to make that fast, >>>> too. There's a very obvious one involving a presort that is is O(log2 >>>> n) in the number of commits. >> >> I don't think performance hit you would get would be acceptable. > > Again, it's bad practice to assume rather than measure. Human intuitions > about this sort of thing are notoriously unreliable. Techniques created to handle very large repositories (with respect to number of commits) that make it possible for Git to avoid parsing commit objects, namely bitmap index (for 'git fetch'/'clone') and serialized commit graph (for 'git log') lead to _significant_ performance improvements. The performance changes from "waiting for Git to finish" to "done in the blink of eye" (well, almost). >>>> Excuse me, but your premise is incorrect. A git DAG isn't just "any" DAG. >>>> The presence of timestamps makes a total ordering possible. >>>> >>>> (I was a theoretical mathematician in a former life. This is all very >>>> familiar ground to me.) >> >> Maybe in theory, when all clock are synchronized. > > My assertion does not depend on synchronized clocks, because it doesn't have to. > > If the timestamps in your repo are unique, there *is* a total ordering - > by timestamp. What you don't get is guaranteed consistency with the > topo ordering - that is you get no guarantee that a child's timestamp > is greater than its parents'. That really would require a common > timebase. > > But I don't need that stronger property, because the purpose of > totally ordering the repo is to guarantee the uniqueness of action > stamps. For that, all I need is to be able to generate a unique cookie > for each commit that can be inserted in its action stamp. For cookie to be unique among all forks / clones of the same repository you need either centralized naming server, or for the cookie to be based on contents of the commit (i.e. be a hash function). > For my use cases > that cookie should *not* be a hash, because hashes always break N years > down. It should be an eternally stable product of the commit metadata. Well, the idea for SHA-1 <--> NewHash == SHA-256 transition is to avoid having a flag day, and providing full interoperability between repositories and Git installations using the old hash ad using new hash^1. This will be done internally by using SHA-1 <--> SHA-256 mapping. So after the transition all you need is to publish this mapping somewhere, be it with Internet Archive or Software Heritage. Problem solved. P.S. Could you explain to me how one can use action stamp, e.g. <esr@thyrsus.com!2019-05-15T20:01:15.473209800Z>, to quickly find the commit it refers to? With SHA-1 id you have either filesystem pathname or the index file for pack to find it _fast_. Footnotes: ---------- 1. That is why where would be no "major format break", thus no place for incompatibile format changes. Best, -- Jakub Narębski ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Finer timestamps and serialization in git 2019-05-20 9:43 ` Jakub Narebski @ 2019-05-20 10:08 ` Ævar Arnfjörð Bjarmason 2019-05-20 12:40 ` Jeff King 2019-05-20 14:14 ` Eric S. Raymond 2 siblings, 0 replies; 33+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2019-05-20 10:08 UTC (permalink / raw) To: Jakub Narebski; +Cc: Eric S. Raymond, Derrick Stolee, git On Mon, May 20 2019, Jakub Narebski wrote: > "Eric S. Raymond" <esr@thyrsus.com> writes: >> Jakub Narebski <jnareb@gmail.com>: > >>> As far as I understand it this would slow down receiving new commits >>> tremendously. Currently great care is taken to not have to parse the >>> commit object during fetch or push if it is not necessary (thanks to >>> things such as reachability bitmaps, see e.g. [1]). >>> >>> With this restriction you would need to parse each commit to get at >>> commit timestamp and committer, check if the committer+timestamp is >>> unique, and bump it if it is not. >> >> So, I'd want to measure that rather than simply assuming it's a blocker. >> Clocks are very cheap these days. > > Clocks may be cheap, but parsing is not. > > You can receive new commits in the repository by creating them, and from > other repository (via push or fetch). In the second case you often get > many commits at once. > > In [1] it is described how using "bitmap index" you can avoid parsing > commits when deciding which objects to send to the client; they can be > directly copied to the client (added to the packfile that is sent to > client). Thanks to this reachability bitmap (bit vector) the time to > clone Linux repository decreased from 57 seconds to 1.6 seconds. > > It is not a direct correspondence, but there most probably would be the > same problem with requiring fractional timestamp+committer identity to > be unique on the receiving side. > > [1]: https://githubengineering.com/counting-objects/ We're in violent agreement about the general viability of ESR's proposed plan, but just a side-note on this point. I don't think this is right. I.e. I don't think a hypothetical version of git that guarantees monotonically increasing timestamps will be slow in *this* regard. For accepting pushes we already unpack all the commits / content / hash it to perform fsck checks, which is why screwing with the commit timestamp will fail on push: https://public-inbox.org/git/87zhnnv0b8.fsf@evledraar.gmail.com/ Same on the client with fetches, although transfer.fsckObjects isn't on there we do most of the work anyway for hashing & basic validation purposes. The bitmaps wouldn't be affected because they're computed after-the-fact on the basis of reachability, whereas validating increasing timestamps for a single branch is cheap, you just look at each A..B push incrementally and see if the timestamps are increasing and past A's parent. It's trickier if you're trying to make the same guarantee for *all* ref updates in a given repo (and locking caveats etc. have been discussed elsewhere), but not *that* much of a PITA. We'd need to compare "new" packs/loose objects against the new push, and an obvious shortcut in such a schema if you required a global lock anyway would be for the process taking the lock to write out "this is the current max timestamp" when finished. In *this* case that's a long way down the journey into crazytown :) But it is intertesting to think about in general, because with e.g. the commit-graph we have a set of commits that are "optimized" in some side-index, so it becomes useful for many algorithms to be able to ask "what is the current set of unoptimized commits". Once you have that, and can keep the size of it down with "gc" many algorithms that require graph traversal become possible, because your O(n) of needing to consider the "n" unoptimized commits is small enough v.s. the bulk of "optimized" commits as to not matter. >>> Also, bumping timestamp means that the commit changed, means that its >>> contents-based ID changed, means that all commits that follow it needs >>> to have its contents changed... And now you need to rewrite many >>> commits. >> >> What "commits that follow it?" By hypothesis, the incoming commit's >> timestamp is bumped (if it's bumped) when it's first added to a branch >> or branches, before there are following commits in the DAG. > > Errr... the main problem is with distributed nature of Git, i.e. when > two repositories create different commits with the same > committer+timestamp value. You receive commits on fetch or push, and > you receive many commits at once. > > Say you have two repositories, and the history looks like this: > > repo A: 1<---2<---a<---x<---c<---d <- master > > repo B: 1<---2<---X<---3<---4 <- master > > When you push from repo A to repo B, or fetch in repo B from repo A you > would get the following DAG of revisions > > repo B: 1<---2<---X<---3<---4 <- master > \ > \--a<---x<---c<---d <- repo_A/master > > Now let's assume that commits X and x have the came committer and the > same fractional timestamp, while being different commits. Then you > would need to bump timestamp of 'x', changing the commit. This means > that 'c' needs to be rewritten too, and 'd' also: > > repo B: 1<---2<---X<---3<---4 <- master > \ > \--a<---x'<--c'<--d' <- repo_A/master > > And now for the final nail in the coffing of the Bazaar-esque idea of > changing commits on arrival. Say that repository A created new commits, > and pushed them to B. You would need to rewrite all future commits from > this repository too, and you would always fetch all commits starting > from the first "bumped" > > repo A: 1<---2<---a<---x<---c<---d<---E <- master > > transfer of [<---x<---c<---d<---E], instead of [<--E], because 'x', 'c', > and 'd' are missing in repo B. > > repo B: 1<---2<---X<---3<---4 <- master > \ > \--a<---x'<--c'<--d'<--E' <- repo_A/master > > And there is yet another problem. Let's assume that repo B created some > history on top of bump-rewritten commits: > > repo B: 1<---2<---X<---3<---4 <- master > \ > \--a<---x'<--c'<--d'<--E' <- repo_A/master > \ > \--5 <- next > > Then if in repo A you fetch from repo B (remember, in Git there is no > concept of central repository), you would get the following history > > /--X'<--3'<--4' <- repo_B/master > / > repo A: 1<---2<---a<---x<---c<---d<---E <- master > \ > \---x'<--c' > \ > \--5 <- repo_B/master > > (because 'X' is now incoming, it needs to be "bumped", therefore > changing 3' and 4'). > > The history without all this rewriting looks like this: > > /--X<---3<---4' <- repo_B/master > / > repo A: 1<---2<---a<---x<---c<---d<---E <- master > \ > \--5 <- repo_B/master > > Notice the difference? > >>> And you also break the assumptions that the same commits have >>> the same contents (including date) and the same ID in different >>> repositories (some of which may include additional branches, some of >>> which may have been part of network of related repositories, etc.). > > See repo A and repo B in above example. > >> Wait...unless I completely misunderstand the hash-chain model, doesn't the >> hash of a commit depend on the hashes of its parents? If that's the case, >> commits cannot have portable hashes. If it's not, please correct me. >> >> But if it's not, how does your first objection make sense? > > Hash of a commit depend in hashes of its parents (Merkle tree). That is > why signing a commit (or a tag pointing to the commit) signs a whole > history of a commit. > >>>> You don't need a daemon now to write commits to a repository. You can >>>> just add stuff to the object store, and then later flip the SHA-1 on a >>>> reference, we lock those indivdiual references, but this sort of thing >>>> would require a global write lock. This would introduce huge concurrency >>>> caveats that are non-issues now. >>>> >>>> Dumb clients matter. Now you can e.g. have two libgit2 processes writing >>>> to ref A and B respectively in the same repo, and they never have to >>>> know about each other or care about IPC. >> >> How do they know they're not writing to the same ref? What keeps >> *that* operation atomic? > > Because different refs are stored in different files (at least for > "live" refs that are stores in loose ref format). The lock is taken on > ref (to update ref and its reflog in sync), there is no need to take > global lock on all refs. > >>> You do realize that dates may not be monotonic (because of imperfections >>> in clock synchronization), thus the fact that the date is different from >>> parent does not mean that is different from ancestor. >> >> Good point. That means the O(log2 n) version of the check has to be done >> all the time. Unfortunate. > > Especially with around 1 million of commits (Linux kernel, Chromium, > AOSP), or even 3M commits (MS Windows repository). > >>>>> That's the simple case. The complicated case is checking for date >>>>> collisions on *other* branches. But there are ways to make that fast, >>>>> too. There's a very obvious one involving a presort that is is O(log2 >>>>> n) in the number of commits. >>> >>> I don't think performance hit you would get would be acceptable. >> >> Again, it's bad practice to assume rather than measure. Human intuitions >> about this sort of thing are notoriously unreliable. > > Techniques created to handle very large repositories (with respect to > number of commits) that make it possible for Git to avoid parsing commit > objects, namely bitmap index (for 'git fetch'/'clone') and serialized > commit graph (for 'git log') lead to _significant_ performance > improvements. > > The performance changes from "waiting for Git to finish" to "done in the > blink of eye" (well, almost). > >>>>> Excuse me, but your premise is incorrect. A git DAG isn't just "any" DAG. >>>>> The presence of timestamps makes a total ordering possible. >>>>> >>>>> (I was a theoretical mathematician in a former life. This is all very >>>>> familiar ground to me.) >>> >>> Maybe in theory, when all clock are synchronized. >> >> My assertion does not depend on synchronized clocks, because it doesn't have to. >> >> If the timestamps in your repo are unique, there *is* a total ordering - >> by timestamp. What you don't get is guaranteed consistency with the >> topo ordering - that is you get no guarantee that a child's timestamp >> is greater than its parents'. That really would require a common >> timebase. >> >> But I don't need that stronger property, because the purpose of >> totally ordering the repo is to guarantee the uniqueness of action >> stamps. For that, all I need is to be able to generate a unique cookie >> for each commit that can be inserted in its action stamp. > > For cookie to be unique among all forks / clones of the same repository > you need either centralized naming server, or for the cookie to be based > on contents of the commit (i.e. be a hash function). > >> For my use cases >> that cookie should *not* be a hash, because hashes always break N years >> down. It should be an eternally stable product of the commit metadata. > > Well, the idea for SHA-1 <--> NewHash == SHA-256 transition is to avoid > having a flag day, and providing full interoperability between > repositories and Git installations using the old hash ad using new > hash^1. This will be done internally by using SHA-1 <--> SHA-256 > mapping. So after the transition all you need is to publish this > mapping somewhere, be it with Internet Archive or Software Heritage. > Problem solved. > > P.S. Could you explain to me how one can use action stamp, e.g. > <esr@thyrsus.com!2019-05-15T20:01:15.473209800Z>, to quickly find the > commit it refers to? With SHA-1 id you have either filesystem pathname > or the index file for pack to find it _fast_. > > Footnotes: > ---------- > 1. That is why where would be no "major format break", thus no place for > incompatibile format changes. > > Best, ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Finer timestamps and serialization in git 2019-05-20 9:43 ` Jakub Narebski 2019-05-20 10:08 ` Ævar Arnfjörð Bjarmason @ 2019-05-20 12:40 ` Jeff King 2019-05-20 14:14 ` Eric S. Raymond 2 siblings, 0 replies; 33+ messages in thread From: Jeff King @ 2019-05-20 12:40 UTC (permalink / raw) To: Jakub Narebski Cc: Eric S. Raymond, Ævar Arnfjörð Bjarmason, Derrick Stolee, git On Mon, May 20, 2019 at 11:43:14AM +0200, Jakub Narebski wrote: > You can receive new commits in the repository by creating them, and from > other repository (via push or fetch). In the second case you often get > many commits at once. > > In [1] it is described how using "bitmap index" you can avoid parsing > commits when deciding which objects to send to the client; they can be > directly copied to the client (added to the packfile that is sent to > client). Thanks to this reachability bitmap (bit vector) the time to > clone Linux repository decreased from 57 seconds to 1.6 seconds. No, this is mixing up sending and receiving. On the sending side, we try very hard not to open up objects if we can avoid it (using tricks like reachability bitmaps helps us quickly decide what to send, and reusing the on-disk packfile data lets us send out objects without decompressing them). But on the receiving side, we do not trust the sender at all. The protocol specifically does not send the sha1 of any object. The receiver instead inflates every object it gets and computes the object hash itself. And then on top of that, we traverse the commit graph to make sure that the server sent us all of the objects we need to have a complete graph. So adding any extra object-quality checks on the receiving side would not really change that equation. But I do otherwise agree with your mail that the general idea of having the receiver _change_ the incoming objects is going to lead to a world of headaches. -Peff ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Finer timestamps and serialization in git 2019-05-20 9:43 ` Jakub Narebski 2019-05-20 10:08 ` Ævar Arnfjörð Bjarmason 2019-05-20 12:40 ` Jeff King @ 2019-05-20 14:14 ` Eric S. Raymond 2019-05-20 14:41 ` Michal Suchánek ` (2 more replies) 2 siblings, 3 replies; 33+ messages in thread From: Eric S. Raymond @ 2019-05-20 14:14 UTC (permalink / raw) To: Jakub Narebski Cc: Ævar Arnfjörð Bjarmason, Derrick Stolee, git Jakub Narebski <jnareb@gmail.com>: > > What "commits that follow it?" By hypothesis, the incoming commit's > > timestamp is bumped (if it's bumped) when it's first added to a branch > > or branches, before there are following commits in the DAG. > > Errr... the main problem is with distributed nature of Git, i.e. when > two repositories create different commits with the same > committer+timestamp value. You receive commits on fetch or push, and > you receive many commits at once. > > Say you have two repositories, and the history looks like this: > > repo A: 1<---2<---a<---x<---c<---d <- master > > repo B: 1<---2<---X<---3<---4 <- master > > When you push from repo A to repo B, or fetch in repo B from repo A you > would get the following DAG of revisions > > repo B: 1<---2<---X<---3<---4 <- master > \ > \--a<---x<---c<---d <- repo_A/master > > Now let's assume that commits X and x have the came committer and the > same fractional timestamp, while being different commits. Then you > would need to bump timestamp of 'x', changing the commit. This means > that 'c' needs to be rewritten too, and 'd' also: > > repo B: 1<---2<---X<---3<---4 <- master > \ > \--a<---x'<--c'<--d' <- repo_A/master Of course that's true. But you were talking as though all those commits have to be modified *after they're in the DAG*, and that's not the case. If any timestamp has to be modified, it only has to happen *once*, at the time its commit enters the repo. Actually, in the normal case only x would need to be modified. The only way c would need to be modified is if bumping x's timestamp caused an actual collision with c's. I don't see any conceptual problem with this. You appear to me to be confusing two issues. Yes, bumping timestamps would mean that all hashes downstream in the Merkle tree would be generated differently, even when there's no timestamp collision, but so what? The hash of a commit isn't portable to begin with - it can't be, because AFAIK there's no guarantee that the ancestry parts of the DAG in two repositories where copies of it live contain all the same commits and topo relationships. > And now for the final nail in the coffing of the Bazaar-esque idea of > changing commits on arrival. Say that repository A created new commits, > and pushed them to B. You would need to rewrite all future commits from > this repository too, and you would always fetch all commits starting > from the first "bumped" I don't see how the second clause of your last sentence follows from the first unless commit hashes really are supposed to be portable across repositories. And I don't see how that can be so given that 'git am' exists and a branch can thus be rooted at a different place after it is transported and integrated. > Hash of a commit depend in hashes of its parents (Merkle tree). That is > why signing a commit (or a tag pointing to the commit) signs a whole > history of a commit. That's what I thought. > > How do they know they're not writing to the same ref? What keeps > > *that* operation atomic? > > Because different refs are stored in different files (at least for > "live" refs that are stores in loose ref format). The lock is taken on > ref (to update ref and its reflog in sync), there is no need to take > global lock on all refs. OK, that makes sense. > For cookie to be unique among all forks / clones of the same repository > you need either centralized naming server, or for the cookie to be based > on contents of the commit (i.e. be a hash function). I don't need uniquess across all forks, only uniqueness *within the repo*. I want this for two reasons: (1) so that action stamps are unique, (2) so that there is a unique canonical ordering of commits in a fast export stream. (Without that second property there are surgical cases I can't regression-test.) > > For my use cases > > that cookie should *not* be a hash, because hashes always break N years > > down. It should be an eternally stable product of the commit metadata. > > Well, the idea for SHA-1 <--> NewHash == SHA-256 transition is to avoid > having a flag day, and providing full interoperability between > repositories and Git installations using the old hash ad using new > hash^1. This will be done internally by using SHA-1 <--> SHA-256 > mapping. So after the transition all you need is to publish this > mapping somewhere, be it with Internet Archive or Software Heritage. > Problem solved. I don't see it. How does this prevent old clients from barfing on new repositories? > P.S. Could you explain to me how one can use action stamp, e.g. > <esr@thyrsus.com!2019-05-15T20:01:15.473209800Z>, to quickly find the > commit it refers to? With SHA-1 id you have either filesystem pathname > or the index file for pack to find it _fast_. For the purposes that make action stamps important I don't really care about performance much (though there are fairly obvious ways to achieve it). My goal is to ensure that revision histories (e.g. in their import-stream format) are forward-portable to future VCSes without requiring any data outside the stream itself. Please remember that I'm accustomed to maintaining infrastructure on decadal timescales - I wrote code in the 1980s that is still in wide use and I expect some of the code I'm writing now to be still in use thirty years from now. This gives me a different perspective on the fragility of things like SHA-1 hashes. From a decadal-scale POV any particular crypto-hash format is unstable garbage, and having them in change comments is a maintainability disaster waiting to happen. Action stamps are specifically designed so that they're pointers to commits that don't require anything but the target commit's import/export-stream metadata to resolve. Your idea of an archived hash registry makes me extremely nervous; I think it's too fragile to trust. So let me back up a step. I will cheerfully drop advocating bumping timestamps if anyone can tell me how a different way to define a per-commit reference cookie that (a) is unique within its repo, and (b) only requires metadata visible in the fast-export representation of the commit. -- <a href="http://www.catb.org/~esr/">Eric S. Raymond</a> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Finer timestamps and serialization in git 2019-05-20 14:14 ` Eric S. Raymond @ 2019-05-20 14:41 ` Michal Suchánek 2019-05-20 22:18 ` Philip Oakley 2019-05-20 21:38 ` Elijah Newren 2019-05-21 0:08 ` Jakub Narebski 2 siblings, 1 reply; 33+ messages in thread From: Michal Suchánek @ 2019-05-20 14:41 UTC (permalink / raw) To: Eric S. Raymond Cc: Jakub Narebski, Ævar Arnfjörð Bjarmason, Derrick Stolee, git On Mon, 20 May 2019 10:14:17 -0400 "Eric S. Raymond" <esr@thyrsus.com> wrote: > Jakub Narebski <jnareb@gmail.com>: > > > What "commits that follow it?" By hypothesis, the incoming commit's > > > timestamp is bumped (if it's bumped) when it's first added to a branch > > > or branches, before there are following commits in the DAG. > > > > Errr... the main problem is with distributed nature of Git, i.e. when > > two repositories create different commits with the same > > committer+timestamp value. You receive commits on fetch or push, and > > you receive many commits at once. > > > > Say you have two repositories, and the history looks like this: > > > > repo A: 1<---2<---a<---x<---c<---d <- master > > > > repo B: 1<---2<---X<---3<---4 <- master > > > > When you push from repo A to repo B, or fetch in repo B from repo A you > > would get the following DAG of revisions > > > > repo B: 1<---2<---X<---3<---4 <- master > > \ > > \--a<---x<---c<---d <- repo_A/master > > > > Now let's assume that commits X and x have the came committer and the > > same fractional timestamp, while being different commits. Then you > > would need to bump timestamp of 'x', changing the commit. This means > > that 'c' needs to be rewritten too, and 'd' also: > > > > repo B: 1<---2<---X<---3<---4 <- master > > \ > > \--a<---x'<--c'<--d' <- repo_A/master > > Of course that's true. But you were talking as though all those commits > have to be modified *after they're in the DAG*, and that's not the case. > If any timestamp has to be modified, it only has to happen *once*, at the > time its commit enters the repo. And that's where you get it wrong. Git is *distributed*. There is more than one repository. Each repository has its own DAG that is completely unrelated to the other repositories and their DAGs. So when you take your history and push it to another repository and the timestamps change as the result what ends up in the other repository is not the history you pushed. So the repositories diverge and you no longer know what is what. > > Actually, in the normal case only x would need to be modified. The only > way c would need to be modified is if bumping x's timestamp caused an > actual collision with c's. > > I don't see any conceptual problem with this. You appear to me to be > confusing two issues. Yes, bumping timestamps would mean that all > hashes downstream in the Merkle tree would be generated differently, > even when there's no timestamp collision, but so what? The hash of a > commit isn't portable to begin with - it can't be, because AFAIK > there's no guarantee that the ancestry parts of the DAG in two > repositories where copies of it live contain all the same commits and > topo relationships. If you push form one repository to another repository now you get exact same history with exact same hashes. So the hashes are portable across repositories that share history. With your proposed change hashes can be modified on push/pull so repositories no longer share history and hashes become non-portable. That's why it is a bad idea. The commits are currently identified by the hash so it must not change during push/pull. Changing the identifier to something else (eg content has without (some) metadata) might be useful to make the identifier more stable but will bring other problems when you need two different identifiers for the same content to include it in two unrelated histories. Thanks Michal ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Finer timestamps and serialization in git 2019-05-20 14:41 ` Michal Suchánek @ 2019-05-20 22:18 ` Philip Oakley 0 siblings, 0 replies; 33+ messages in thread From: Philip Oakley @ 2019-05-20 22:18 UTC (permalink / raw) To: Michal Suchánek, Eric S. Raymond Cc: Jakub Narebski, Ævar Arnfjörð Bjarmason, Derrick Stolee, git Hi, On 20/05/2019 15:41, Michal Suchánek wrote: >> But you were talking as though all those commits >> have to be modified*after they're in the DAG*, and that's not the case. >> If any timestamp has to be modified, it only has to happen*once*, at the >> time its commit enters the repo. > And that's where you get it wrong. Git is*distributed*. There is more > than one repository. Each repository has its own DAG So far so good. In fact it the change to 'distributed' that has ruined Eric's Acton stamps that assume that the 'time' came from a single central server. > that is completely > unrelated to the other repositories and their DAGs. This bit will confuse. It is only the new commits in the different repositories that are 'unrelated'. Their common history commits are identical sha1 values, and the DAG links back to their common root commit(s) > So when you take > your history and push it to another repository and the timestamps > change as the result what ends up in the other repository is not the > history you pushed. So the repositories diverge and you no longer know > what is what. > If the sender tweaks their timestamps at commit time, then no one 'knows'. It's just a minor bit of clock drift/slop. But once they have a cascaded history which has been published (and used) you are locked into that. As noted previously. The significant change is the loss of the central server and the referential nature of it's clock time stamp. If the action stamp is just a useful temporary intermediary in a transfer then cheats are possible (e.g. some randomising hash of a definative partr of the commit). But if the action stamps are meant to be permanent and re-generatable for a round trip between a central server change set based server to Git, and then back again, repeatably, without divergence, loss, or change, then it is not going to happen reliably. To do so requires the creation of fixed total order (by design - single clock) from commits that are only partially ordered (by design! - DAG rather than multiple unsynchronized user clocks). For backward compatibility Git only has (and only needs 1 second resolution). The multi-decade/century VCS idea of a master artifact and then near copies (since koalin and linen drawings, blue prints, ..) with central _control_ is being replaced by zero cost perfect replication, authentication by hash, with its distribution of control (of artifact entry into the VCS) to _users_, from managers. Managers simply select and decide on the artifact quality and authorize the use of a hash. Most folks haven't really looked below the surface of what it is that makes GIT and DVCS so successful, and it's not just the Linus effect. The previous certainties (e.g. the idea of a total order to allow logging by change-set) have gone. -- Philip ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Finer timestamps and serialization in git 2019-05-20 14:14 ` Eric S. Raymond 2019-05-20 14:41 ` Michal Suchánek @ 2019-05-20 21:38 ` Elijah Newren 2019-05-20 23:12 ` Eric S. Raymond 2019-05-21 0:08 ` Jakub Narebski 2 siblings, 1 reply; 33+ messages in thread From: Elijah Newren @ 2019-05-20 21:38 UTC (permalink / raw) To: esr Cc: Jakub Narebski, Ævar Arnfjörð Bjarmason, Derrick Stolee, Git Mailing List Hi, On Mon, May 20, 2019 at 11:09 AM Eric S. Raymond <esr@thyrsus.com> wrote: > > For cookie to be unique among all forks / clones of the same repository > > you need either centralized naming server, or for the cookie to be based > > on contents of the commit (i.e. be a hash function). > > I don't need uniquess across all forks, only uniqueness *within the repo*. You've lost me. In other places you stated you didn't want to use the commit hash, and now you say this. If you only care about uniqueness within the current copy of the repo and don't care about uniqueness across forks (i.e. clones or copies that exist now or in the future -- including copies stored using SHA256), then what's wrong with using the commit hash? > I want this for two reasons: (1) so that action stamps are unique, (2) > so that there is a unique canonical ordering of commits in a fast export > stream. A stable ordering of commits in a fast-export stream might be a cool feature. But I don't know how to define one, other than perhaps sort first by commit-depth (maybe optionally adding a few additional intermediate sorting criteria), and then finally sort by commit hash as a tiebreaker. Without the fallback to commit hash, you fall back on normal traversal order which isn't stable (it depends on e.g. order of branches listed on the command line to fast-export, or if using --all, what new branch you just added that comes alphabetically before others). I suspect that solution might run afoul of your dislike for commit hashes, though, so I'm not sure it'd work for you. > (Without that second property there are surgical cases I can't > regression-test.) > > > > For my use case > > > that cookie should *not* be a hash, because hashes always break N years > > > down. It should be an eternally stable product of the commit metadata. > > > > Well, the idea for SHA-1 <--> NewHash == SHA-256 transition is to avoid > > having a flag day, and providing full interoperability between > > repositories and Git installations using the old hash ad using new > > hash^1. This will be done internally by using SHA-1 <--> SHA-256 > > mapping. So after the transition all you need is to publish this > > mapping somewhere, be it with Internet Archive or Software Heritage. > > Problem solved. > > I don't see it. How does this prevent old clients from barfing on new > repositories? Depends on range of time for "old". The plan as I understood it (which is suspect): make git version which understand both SHA-1 and SHA-256 (which I think is already done, though I haven't followed closely), wait some time, allow people to opt in to converting, allow more time, consider ways of nudging people to switch. You are right that clients older than any version that understands SHA-256 would barf on the new repositories. > So let me back up a step. I will cheerfully drop advocating bumping > timestamps if anyone can tell me how a different way to define a per-commit > reference cookie that (a) is unique within its repo, and (b) only requires > metadata visible in the fast-export representation of the commit. Does passing --show-original-ids option to fast-export and using the resulting original-oid field as the cookie count? ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Finer timestamps and serialization in git 2019-05-20 21:38 ` Elijah Newren @ 2019-05-20 23:12 ` Eric S. Raymond 0 siblings, 0 replies; 33+ messages in thread From: Eric S. Raymond @ 2019-05-20 23:12 UTC (permalink / raw) To: Elijah Newren Cc: Jakub Narebski, Ævar Arnfjörð Bjarmason, Derrick Stolee, Git Mailing List Elijah Newren <newren@gmail.com>: > Hi, > > On Mon, May 20, 2019 at 11:09 AM Eric S. Raymond <esr@thyrsus.com> wrote: > > > > For cookie to be unique among all forks / clones of the same repository > > > you need either centralized naming server, or for the cookie to be based > > > on contents of the commit (i.e. be a hash function). > > > > I don't need uniquess across all forks, only uniqueness *within the repo*. > > You've lost me. In other places you stated you didn't want to use the > commit hash, and now you say this. If you only care about uniqueness > within the current copy of the repo and don't care about uniqueness > across forks (i.e. clones or copies that exist now or in the future -- > including copies stored using SHA256), then what's wrong with using > the commit hash? Because it's not self-describing, can't be computed solely from visible commit metadata, and relies on complex external assumptions about how the hash is computed which break when your VCS changes hash algorithms. These are dealbreakers because one of my major objectives is forward portability of these IDs forever. And I mean *forever*. It should be possible for someone in the year 40,000, in between assaulting planets for the God-Emperor, to look at an import stream and deduce how to resolve the cookies to their commits without seeing git's code or knowing anything about its hash algorithms. I think maybe the reason I'm having so much trouble getting this across is that git insiders are used to thinking of import streams as transient things. Because I do a lot of repo migrations, I have a very different view of them. I built reposurgeon on the realization that they're a general transport format for revision histories, and that has forward value independent of the existence of git. If a stream contained fully forward-portable action stamps, it would be forward-portable forever. Hashes in commit comments are the *only* blocker to that. Take this from a person who has spent way too much time patching Subversion IDs like r1234 during repository conversions. It would take so little to make this work. Existing stream format is *almost there*. > A stable ordering of commits in a fast-export stream might be a cool > feature. But I don't know how to define one, other than perhaps sort > first by commit-depth (maybe optionally adding a few additional > intermediate sorting criteria), and then finally sort by commit hash > as a tiebreaker. Without the fallback to commit hash, you fall back > on normal traversal order which isn't stable (it depends on e.g. order > of branches listed on the command line to fast-export, or if using > --all, what new branch you just added that comes alphabetically before > others). > > I suspect that solution might run afoul of your dislike for commit > hashes, though, so I'm not sure it'd work for you. It does. See above. > > So let me back up a step. I will cheerfully drop advocating bumping > > timestamps if anyone can tell me how a different way to define a per-commit > > reference cookie that (a) is unique within its repo, and (b) only requires > > metadata visible in the fast-export representation of the commit. > > Does passing --show-original-ids option to fast-export and using the > resulting original-oid field as the cookie count? I was not aware of this option. Looking...no wonder, it's not on my system man page. Must be recent. OK. Wow. That is *useful*, and I am going to upgrade reposurgeon to read it. With that I can do automatic commit-reference rewriting. I don't consider it a complete solution. The problem is that OID is a consistent property that can be used to resolve cookies, but there's no guaranteed that it's a *preserved* property that survives multiple round trips and changes in hash functions. So the right way to use it is to pick it up, do reference-cookie resolution, and then mung the reference cookies to a format that is stable forever. I don't know what that format should be yet. I have a message in composition about this. -- <a href="http://www.catb.org/~esr/">Eric S. Raymond</a> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Finer timestamps and serialization in git 2019-05-20 14:14 ` Eric S. Raymond 2019-05-20 14:41 ` Michal Suchánek 2019-05-20 21:38 ` Elijah Newren @ 2019-05-21 0:08 ` Jakub Narebski 2019-05-21 1:05 ` Eric S. Raymond 2 siblings, 1 reply; 33+ messages in thread From: Jakub Narebski @ 2019-05-21 0:08 UTC (permalink / raw) To: Eric S. Raymond Cc: Ævar Arnfjörð Bjarmason, Derrick Stolee, git "Eric S. Raymond" <esr@thyrsus.com> writes: > Jakub Narebski <jnareb@gmail.com>: >>> What "commits that follow it?" By hypothesis, the incoming commit's >>> timestamp is bumped (if it's bumped) when it's first added to a branch >>> or branches, before there are following commits in the DAG. >> >> Errr... the main problem is with distributed nature of Git, i.e. when >> two repositories create different commits with the same >> committer+timestamp value. You receive commits on fetch or push, and >> you receive many commits at once. >> >> Say you have two repositories, and the history looks like this: >> >> repo A: 1<---2<---a<---x<---c<---d <- master >> >> repo B: 1<---2<---X<---3<---4 <- master >> >> When you push from repo A to repo B, or fetch in repo B from repo A you >> would get the following DAG of revisions >> >> repo B: 1<---2<---X<---3<---4 <- master >> \ >> \--a<---x<---c<---d <- repo_A/master >> >> Now let's assume that commits X and x have the came committer and the >> same fractional timestamp, while being different commits. Then you >> would need to bump timestamp of 'x', changing the commit. This means >> that 'c' needs to be rewritten too, and 'd' also: >> >> repo B: 1<---2<---X<---3<---4 <- master >> \ >> \--a<---x'<--c'<--d' <- repo_A/master > > Of course that's true. But you were talking as though all those commits > have to be modified *after they're in the DAG*, and that's not the case. > If any timestamp has to be modified, it only has to happen *once*, at the > time its commit enters the repo. The time commit 'x' was created in repo A there was no need to bump the timestamp. Same with commit 'X' in repo B (well, unless there is a central serialization server - which would not fly). It is only after push from repo A to repo B that we have two commits: 'x' and 'X' with the same timestamp. > Actually, in the normal case only x would need to be modified. The only > way c would need to be modified is if bumping x's timestamp caused an > actual collision with c's. > > I don't see any conceptual problem with this. You appear to me to be > confusing two issues. Yes, bumping timestamps would mean that all > hashes downstream in the Merkle tree would be generated differently, > even when there's no timestamp collision, but so what? The hash of a > commit isn't portable to begin with - it can't be, because AFAIK > there's no guarantee that the ancestry parts of the DAG in two > repositories where copies of it live contain all the same commits and > topo relationships. Errr... how did you get that the hash of a commit is not portable??? Same contents means same hash, i.e. same object identifier. Two repositories can have part of history in common (for example different forks of the same repository, like different "trees" of Linux kernel), sharing part of DAG. Same commits, same topo relationships. That's how _distributed_ version control works. [I think we may have been talking past each other.] >> And now for the final nail in the coffing of the Bazaar-esque idea of >> changing commits on arrival. Say that repository A created new commits, >> and pushed them to B. You would need to rewrite all future commits from >> this repository too, and you would always fetch all commits starting >> from the first "bumped" > > I don't see how the second clause of your last sentence follows from the > first unless commit hashes really are supposed to be portable across > repositories. And I don't see how that can be so given that 'git am' > exists and a branch can thus be rooted at a different place after > it is transported and integrated. 'git rebase', 'git rebase --interactive' and 'git am' create diffent commits; that is why their's result is called "history rewriting" (it actually is creating altered copy, and garbage-collecting old pre-copy and pre-change version). Anyway, the recommended practice is to not rewrite published history (where somebody could have bookmarked it). Note also that this copy preserves author date, not committer date; also commits can be deleted, split and merged during "rewrite". Fetch and push do not use 'git am', and they preserve commits and their identities. That is how they can be effective and peformant. >> Hash of a commit depend in hashes of its parents (Merkle tree). That is >> why signing a commit (or a tag pointing to the commit) signs a whole >> history of a commit. > > That's what I thought. [...] >> For cookie to be unique among all forks / clones of the same repository >> you need either centralized naming server, or for the cookie to be based >> on contents of the commit (i.e. be a hash function). > > I don't need uniquess across all forks, only uniqueness *within the repo*. Err, what? So the proposed "action stamp" identifier is even more useless? If you can't use <esr@thyrsus.com!2019-05-15T20:01:15.473209800Z> to uniquely name revision, so that every person that has that commit can know which commit is it, what's the use? Is "action stamp" meant to be some local identifier, like Mercurial's Subversion-like revision number, good only for local repository? > I want this for two reasons: (1) so that action stamps are unique, (2) > so that there is a unique canonical ordering of commits in a fast export > stream. > > (Without that second property there are surgical cases I can't > regression-test.) You can always use object identifier (hash) for tiebreaking for second case use. >>> For my use cases >>> that cookie should *not* be a hash, because hashes always break N years >>> down. It should be an eternally stable product of the commit metadata. >> >> Well, the idea for SHA-1 <--> NewHash == SHA-256 transition is to avoid >> having a flag day, and providing full interoperability between >> repositories and Git installations using the old hash ad using new >> hash^1. This will be done internally by using SHA-1 <--> SHA-256 >> mapping. So after the transition all you need is to publish this >> mapping somewhere, be it with Internet Archive or Software Heritage. >> Problem solved. > > I don't see it. How does this prevent old clients from barfing on new > repositories? The SHA-1 <--> SHA-256 interoperation is on the client-server level; one can use old Git that uses SHA-1 from repository that uses SHA-256, and vice versa. >> P.S. Could you explain to me how one can use action stamp, e.g. >> <esr@thyrsus.com!2019-05-15T20:01:15.473209800Z>, to quickly find the >> commit it refers to? With SHA-1 id you have either filesystem pathname >> or the index file for pack to find it _fast_. > > For the purposes that make action stamps important I don't really care > about performance much (though there are fairly obvious ways to > achieve it). What ways? > My goal is to ensure that revision histories (e.g. in > their import-stream format) are forward-portable to future VCSes > without requiring any data outside the stream itself. In Git you can store "action stamp" in extra extension headers in commit objects (as was already proposed in this thread). Best, -- Jakub Narębski ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Finer timestamps and serialization in git 2019-05-21 0:08 ` Jakub Narebski @ 2019-05-21 1:05 ` Eric S. Raymond 0 siblings, 0 replies; 33+ messages in thread From: Eric S. Raymond @ 2019-05-21 1:05 UTC (permalink / raw) To: Jakub Narebski Cc: Ævar Arnfjörð Bjarmason, Derrick Stolee, git Jakub Narebski <jnareb@gmail.com>: > Errr... how did you get that the hash of a commit is not portable??? OK. You're telling me that premise was wrong. Thank you, accepted. I've since had a better idea. Expect mail soon. -- <a href="http://www.catb.org/~esr/">Eric S. Raymond</a> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Finer timestamps and serialization in git 2019-05-15 19:16 Finer timestamps and serialization in git Eric S. Raymond 2019-05-15 20:16 ` Derrick Stolee @ 2019-05-15 20:20 ` Ævar Arnfjörð Bjarmason 2019-05-16 0:35 ` Eric S. Raymond 2019-05-16 4:14 ` Jeff King 1 sibling, 2 replies; 33+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2019-05-15 20:20 UTC (permalink / raw) To: Eric S. Raymond; +Cc: git On Wed, May 15 2019, Eric S. Raymond wrote: > The recent increase in vulnerability in SHA-1 means, I hope, that you > are planning for the day when git needs to change to something like > an elliptic-curve hash. This means you're going to have a major > format break. Such is life. Note that most users of Git (default build options) won't be vulnerable to the latest attack (or SHAttered), see https://public-inbox.org/git/875zqbx5yz.fsf@evledraar.gmail.com/T/#u But yes the plan is to move to SHA-256. See https://github.com/git/git/blob/next/Documentation/technical/hash-function-transition.txt > Since this is going to have to happen anyway The SHA-1 <-> SHA-256 transition is planned to happen, but there's some strong opinions that this should be *only* for munging the content for hashing, not adding new stuff while we're at it (even if optional). See : https://public-inbox.org/git/87ftyyedqd.fsf@evledraar.gmail.com/ > let me request two > functional changes in git. Neither will be at all difficult, but the > first one is also a thing that cannot be done without a format break, > which is why I have not suggested them before. They come from lots of > (often painful) experience with repository conversions via > reposurgeon. > > 1. Finer granularity on commit timestamps. If you wanted milli/micro/nano-second timestamps for commit objects or whatever other new info then it doesn't need to break the commit header format. You put it key-values in the commit message and read it back out via git-interpret-trailers. Or even put it in the header itself, e.g.: author <name> <epoch> <tz> committer <name> <epoch> <tz> x-author-ns <nanosecond part of author> x-committer-ns <nanosecond part of committer> Of course nobody would understand that new thing from day one, but that's nothing compared to breaking the existing header format. > 2. Timestamps unique per repository > > The coarse resolution of git timestamps, and the lack of uniqueness, > are at the bottom of several problems that are persistently irritating > when I do repository conversions and surgery. > > The most obvious issue, though a relatively superficial one, is that I have > to thow away information whenever I convert a repository from a system with > finer-grained time. Notably this is the case with Subversion, which keeps > time to milliseconds. This is probably the only respect in which its data > model remains superior to git's. :-) Should be solved by putting it in the commit as noted above, just not in the very narrow part of the object that's reserved and not going to change. More generally plenty of *->git importers write some extra data in the commits, usually in the commit message. Try e.g. cloning a SVN repo with "git svn clone" and see what it does. > The deeper problem is that I want something from Git that I cannot > have with 1-second granularity. That is: a unique timestamp on each > commit in a repository. The only way to be certain of this is for git > to delay accepting integration of a patch until it can issue a unique > time mark for it - obviously impractical if the quantum is one second, > but not if it's a millisecond or microsecond. > > Why do I want this? There are number of reasons, all related to a > mathematical concept called "total ordering". At present, commits in > a Git repository only have partial ordering. One consequence is that > action stamps - the committer/date pairs I use as VCS-independent commit > identifications in reposurgeon - are not unique. When a patch sequence > is applied, it can easily happen fast enough to give several successive > commits the same committer-ID and timestamp. > > Of course the commit hash remains a unique commit ID. But it can't > easily be parsed and followed by a human, which is a UX problem when > it's used as a commit stamp in change comments. You cannot get a guaranteed "total order" of any sort in anything like git's current object model without taking a global lock on all write operations. Otherwise how would two concurrent ref updates / object writes be guaranteed not to get the timestamp? Unlikely with nanosecond accuracy, but not impossible. Even if you solve that, take two such repositories and "git merge --allow-unrelated-histories" them together. Now what's the order? These issues are solved by defining ordering in terms of the graph, and writing this information after-the-fact. That's already part of git. See https://github.com/git/git/blob/next/Documentation/technical/commit-graph.txt and https://devblogs.microsoft.com/devops/supercharging-the-git-commit-graph-ii-file-format/ > More deeply, the lack of total ordering means that repository graphs > don't have a single canonical serialized form. This sounds abstract > but it means there are surgical operations I can't regression-test > properly. My colleague Edward Cree has found cases where git fast-export > can issue a stream dump for which git fast-import won't necessarily > re-color certain interior nodes the same way when it's read back in > and I'm pretty sure the absence of total ordering on the branch tips > is at the bottom of that. Can you clarify what you mean by this? You run fast-import twice and get different results, is that it? If so that sounds like a bug. > I'm willing to write patches if this direction is accepted. I've figured > out how to make fast-import streams upward-compatible with finer-grained > timestamps. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Finer timestamps and serialization in git 2019-05-15 20:20 ` Ævar Arnfjörð Bjarmason @ 2019-05-16 0:35 ` Eric S. Raymond 2019-05-16 4:14 ` Jeff King 1 sibling, 0 replies; 33+ messages in thread From: Eric S. Raymond @ 2019-05-16 0:35 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason; +Cc: git Ævar Arnfjörð Bjarmason <avarab@gmail.com>: > You put it key-values in the commit message and read it back out via > git-interpret-trailers. Speaking as a person who has done a lot of repository migrations, this makes me shudder. It's fragile, kludgy, and does not maintain proper separation of concerns. The feature I *didn't* ask for at the next format break is a user-modifiable key-value store per commit that is *not* in the commit comment. Bzr has this. It's useful. -- <a href="http://www.catb.org/~esr/">Eric S. Raymond</a> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Finer timestamps and serialization in git 2019-05-15 20:20 ` Ævar Arnfjörð Bjarmason 2019-05-16 0:35 ` Eric S. Raymond @ 2019-05-16 4:14 ` Jeff King 1 sibling, 0 replies; 33+ messages in thread From: Jeff King @ 2019-05-16 4:14 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason; +Cc: Eric S. Raymond, git On Wed, May 15, 2019 at 10:20:03PM +0200, Ævar Arnfjörð Bjarmason wrote: > > Since this is going to have to happen anyway > > The SHA-1 <-> SHA-256 transition is planned to happen, but there's some > strong opinions that this should be *only* for munging the content for > hashing, not adding new stuff while we're at it (even if optional). See > : https://public-inbox.org/git/87ftyyedqd.fsf@evledraar.gmail.com/ One reason for this is that the transition plan calls for being able to convert between the sha1 and sha256 representations losslessly (which makes interoperability possible and avoids a flag day). So even if the sha256 format understood floating-point timestamps in the committer header, we'd have to have some way of representing that same information in the sha1 format. Which implies putting it into a new header, as you described below. And if it's in a new header in sha1, then is there any real advantage in having it somewhere else in the sha256 version? I dunno. Maybe a little, as eventually all of the sha1 formats would die off, after everybody has transitioned. -Peff ^ permalink raw reply [flat|nested] 33+ messages in thread
end of thread, other threads:[~2019-05-21 1:05 UTC | newest] Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-05-15 19:16 Finer timestamps and serialization in git Eric S. Raymond 2019-05-15 20:16 ` Derrick Stolee 2019-05-15 20:28 ` Jason Pyeron 2019-05-15 21:14 ` Derrick Stolee 2019-05-15 22:07 ` Ævar Arnfjörð Bjarmason 2019-05-16 0:28 ` Eric S. Raymond 2019-05-16 1:25 ` Derrick Stolee 2019-05-20 15:05 ` Michal Suchánek 2019-05-20 16:36 ` Eric S. Raymond 2019-05-20 17:22 ` Derrick Stolee 2019-05-20 21:32 ` Eric S. Raymond 2019-05-15 23:40 ` Eric S. Raymond 2019-05-19 0:16 ` Philip Oakley 2019-05-19 4:09 ` Eric S. Raymond 2019-05-19 10:07 ` Philip Oakley 2019-05-15 23:32 ` Eric S. Raymond 2019-05-16 1:14 ` Derrick Stolee 2019-05-16 9:50 ` Ævar Arnfjörð Bjarmason 2019-05-19 23:15 ` Jakub Narebski 2019-05-20 0:45 ` Eric S. Raymond 2019-05-20 9:43 ` Jakub Narebski 2019-05-20 10:08 ` Ævar Arnfjörð Bjarmason 2019-05-20 12:40 ` Jeff King 2019-05-20 14:14 ` Eric S. Raymond 2019-05-20 14:41 ` Michal Suchánek 2019-05-20 22:18 ` Philip Oakley 2019-05-20 21:38 ` Elijah Newren 2019-05-20 23:12 ` Eric S. Raymond 2019-05-21 0:08 ` Jakub Narebski 2019-05-21 1:05 ` Eric S. Raymond 2019-05-15 20:20 ` Ævar Arnfjörð Bjarmason 2019-05-16 0:35 ` Eric S. Raymond 2019-05-16 4:14 ` Jeff King
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).