* Revision walking, commit dates, slop @ 2019-05-18 0:54 Mike Hommey 2019-05-18 1:50 ` SZEDER Gábor 0 siblings, 1 reply; 23+ messages in thread From: Mike Hommey @ 2019-05-18 0:54 UTC (permalink / raw) To: git Hi, There are established corner cases, where in a repo where commit dates are not monotonically increasing, revision walking can go horribly wrong. This was discussed in the past in e.g. https://public-inbox.org/git/20150521061553.GA29269@glandium.org/ The only (simple) workable way, given the current algorithm, to get an accurate view off rev-list is to essentially make slop infinite. This works fine, at the expense of runtime. Now, ignoring any modification for the above, I'm hitting another corner case in some other "weird" history, where I have 500k commits all with the same date. With such a commit dag, something as trivial as `git rev-list HEAD~..HEAD` goes through all commits from the root commit to HEAD, which takes multiple seconds, when the (obvious) output is one commit. It looks like the only way revision walking stops going through all the ancestry is through slop, and slop is essentially made infinite by the fact all commits have the same date (because of the date check in still_interesting(). By extension, this means the workaound for the first corner case above, which is to make slop infinite, essentially makes all rev walking go through the entire ancestry of the commits given on the command line. It feels like some cases of everybody_uninteresting should shorcut slop entirely, but considering the only way for slop to decrease at all is when everybody_uninteresting returns true, that would seem like a wrong assumption. But I'm also not sure what slop helps with in the first place (but I don't have a clear view of the broader picture of how the entire revision walking works). Anyways, a rather easy way to witness this happening is to create a dummy repo like: git init foo cd foo for i in $(seq 1 50); do echo $i > a; git add a; git commit -a -m $i; done The something as simple as `git rev-list HEAD~..HEAD` will go through all 50 commits (assuming the script above created commits in the same second, which it did on my machine) By the time both HEAD~ and HEAD have been processed, the revision walking should have enough information to determine that it doesn't need to go further, but still does. Even with something like HEAD~2..HEAD, after the first round of processing parents it should be able to see there's not going to be any more interesting commits. I'm willing to dig into this, but if someone familiar with the algorithm could give me some hints as to what I might be missing in the big picture, that would be helpful. Cheers, Mike ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Revision walking, commit dates, slop 2019-05-18 0:54 Revision walking, commit dates, slop Mike Hommey @ 2019-05-18 1:50 ` SZEDER Gábor 2019-05-18 3:58 ` Mike Hommey 2019-05-21 2:00 ` Revision walking, commit dates, slop Jonathan Nieder 0 siblings, 2 replies; 23+ messages in thread From: SZEDER Gábor @ 2019-05-18 1:50 UTC (permalink / raw) To: Mike Hommey; +Cc: git On Sat, May 18, 2019 at 09:54:12AM +0900, Mike Hommey wrote: > There are established corner cases, where in a repo where commit dates > are not monotonically increasing, revision walking can go horribly > wrong. This was discussed in the past in e.g. > https://public-inbox.org/git/20150521061553.GA29269@glandium.org/ > > The only (simple) workable way, given the current algorithm, to get an > accurate view off rev-list is to essentially make slop infinite. This > works fine, at the expense of runtime. > > Now, ignoring any modification for the above, I'm hitting another corner > case in some other "weird" history, where I have 500k commits all with > the same date. With such a commit dag, something as trivial as > `git rev-list HEAD~..HEAD` goes through all commits from the root commit > to HEAD, which takes multiple seconds, when the (obvious) output is one > commit. > > It looks like the only way revision walking stops going through all the > ancestry is through slop, and slop is essentially made infinite by the > fact all commits have the same date (because of the date check in > still_interesting(). By extension, this means the workaound for the > first corner case above, which is to make slop infinite, essentially > makes all rev walking go through the entire ancestry of the commits > given on the command line. > > It feels like some cases of everybody_uninteresting should shorcut slop > entirely, but considering the only way for slop to decrease at all is > when everybody_uninteresting returns true, that would seem like a wrong > assumption. But I'm also not sure what slop helps with in the first > place (but I don't have a clear view of the broader picture of how the > entire revision walking works). > > Anyways, a rather easy way to witness this happening is to create a > dummy repo like: > git init foo > cd foo > for i in $(seq 1 50); do > echo $i > a; > git add a; > git commit -a -m $i; > done > > The something as simple as `git rev-list HEAD~..HEAD` will go through > all 50 commits (assuming the script above created commits in the same > second, which it did on my machine) > > By the time both HEAD~ and HEAD have been processed, the revision > walking should have enough information to determine that it doesn't need > to go further, but still does. Even with something like HEAD~2..HEAD, > after the first round of processing parents it should be able to see > there's not going to be any more interesting commits. > > I'm willing to dig into this, but if someone familiar with the > algorithm could give me some hints as to what I might be missing in the > big picture, that would be helpful. All the above is without commit-graph, I presume? If so, then you should give it a try, as it might bring immediate help in your pathological repo. With 5k commit in the same second (enforced via 'export GIT_COMMITTER_DATE=$(date); for i in {1..5000} ...') I get: $ best-of-five -q git rev-list HEAD~..HEAD 0.069 $ git commit-graph write --reachableComputing commit graph generation numbers: 100% (5000/5000), done. $ best-of-five -q git rev-list HEAD~..HEAD 0.004 ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Revision walking, commit dates, slop 2019-05-18 1:50 ` SZEDER Gábor @ 2019-05-18 3:58 ` Mike Hommey 2019-05-18 4:17 ` Mike Hommey 2019-05-21 2:00 ` Revision walking, commit dates, slop Jonathan Nieder 1 sibling, 1 reply; 23+ messages in thread From: Mike Hommey @ 2019-05-18 3:58 UTC (permalink / raw) To: SZEDER Gábor; +Cc: git On Sat, May 18, 2019 at 03:50:05AM +0200, SZEDER Gábor wrote: > On Sat, May 18, 2019 at 09:54:12AM +0900, Mike Hommey wrote: > > There are established corner cases, where in a repo where commit dates > > are not monotonically increasing, revision walking can go horribly > > wrong. This was discussed in the past in e.g. > > https://public-inbox.org/git/20150521061553.GA29269@glandium.org/ > > > > The only (simple) workable way, given the current algorithm, to get an > > accurate view off rev-list is to essentially make slop infinite. This > > works fine, at the expense of runtime. > > > > Now, ignoring any modification for the above, I'm hitting another corner > > case in some other "weird" history, where I have 500k commits all with > > the same date. With such a commit dag, something as trivial as > > `git rev-list HEAD~..HEAD` goes through all commits from the root commit > > to HEAD, which takes multiple seconds, when the (obvious) output is one > > commit. > > > > It looks like the only way revision walking stops going through all the > > ancestry is through slop, and slop is essentially made infinite by the > > fact all commits have the same date (because of the date check in > > still_interesting(). By extension, this means the workaound for the > > first corner case above, which is to make slop infinite, essentially > > makes all rev walking go through the entire ancestry of the commits > > given on the command line. > > > > It feels like some cases of everybody_uninteresting should shorcut slop > > entirely, but considering the only way for slop to decrease at all is > > when everybody_uninteresting returns true, that would seem like a wrong > > assumption. But I'm also not sure what slop helps with in the first > > place (but I don't have a clear view of the broader picture of how the > > entire revision walking works). > > > > Anyways, a rather easy way to witness this happening is to create a > > dummy repo like: > > git init foo > > cd foo > > for i in $(seq 1 50); do > > echo $i > a; > > git add a; > > git commit -a -m $i; > > done > > > > The something as simple as `git rev-list HEAD~..HEAD` will go through > > all 50 commits (assuming the script above created commits in the same > > second, which it did on my machine) > > > > By the time both HEAD~ and HEAD have been processed, the revision > > walking should have enough information to determine that it doesn't need > > to go further, but still does. Even with something like HEAD~2..HEAD, > > after the first round of processing parents it should be able to see > > there's not going to be any more interesting commits. > > > > I'm willing to dig into this, but if someone familiar with the > > algorithm could give me some hints as to what I might be missing in the > > big picture, that would be helpful. > > All the above is without commit-graph, I presume? If so, then you > should give it a try, as it might bring immediate help in your > pathological repo. With 5k commit in the same second (enforced via > 'export GIT_COMMITTER_DATE=$(date); for i in {1..5000} ...') I get: > > $ best-of-five -q git rev-list HEAD~..HEAD > 0.069 > $ git commit-graph write --reachableComputing commit graph generation > numbers: 100% (5000/5000), done. > $ best-of-five -q git rev-list HEAD~..HEAD > 0.004 I'm not observing any difference from using commit-graph, whether in time or in the number of commits that are looked at in limit_list(). Mike ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Revision walking, commit dates, slop 2019-05-18 3:58 ` Mike Hommey @ 2019-05-18 4:17 ` Mike Hommey 2019-05-18 12:01 ` SZEDER Gábor 2019-05-20 1:33 ` Derrick Stolee 0 siblings, 2 replies; 23+ messages in thread From: Mike Hommey @ 2019-05-18 4:17 UTC (permalink / raw) To: SZEDER Gábor; +Cc: git On Sat, May 18, 2019 at 12:58:28PM +0900, Mike Hommey wrote: > On Sat, May 18, 2019 at 03:50:05AM +0200, SZEDER Gábor wrote: > > On Sat, May 18, 2019 at 09:54:12AM +0900, Mike Hommey wrote: > > > There are established corner cases, where in a repo where commit dates > > > are not monotonically increasing, revision walking can go horribly > > > wrong. This was discussed in the past in e.g. > > > https://public-inbox.org/git/20150521061553.GA29269@glandium.org/ > > > > > > The only (simple) workable way, given the current algorithm, to get an > > > accurate view off rev-list is to essentially make slop infinite. This > > > works fine, at the expense of runtime. > > > > > > Now, ignoring any modification for the above, I'm hitting another corner > > > case in some other "weird" history, where I have 500k commits all with > > > the same date. With such a commit dag, something as trivial as > > > `git rev-list HEAD~..HEAD` goes through all commits from the root commit > > > to HEAD, which takes multiple seconds, when the (obvious) output is one > > > commit. > > > > > > It looks like the only way revision walking stops going through all the > > > ancestry is through slop, and slop is essentially made infinite by the > > > fact all commits have the same date (because of the date check in > > > still_interesting(). By extension, this means the workaound for the > > > first corner case above, which is to make slop infinite, essentially > > > makes all rev walking go through the entire ancestry of the commits > > > given on the command line. > > > > > > It feels like some cases of everybody_uninteresting should shorcut slop > > > entirely, but considering the only way for slop to decrease at all is > > > when everybody_uninteresting returns true, that would seem like a wrong > > > assumption. But I'm also not sure what slop helps with in the first > > > place (but I don't have a clear view of the broader picture of how the > > > entire revision walking works). > > > > > > Anyways, a rather easy way to witness this happening is to create a > > > dummy repo like: > > > git init foo > > > cd foo > > > for i in $(seq 1 50); do > > > echo $i > a; > > > git add a; > > > git commit -a -m $i; > > > done > > > > > > The something as simple as `git rev-list HEAD~..HEAD` will go through > > > all 50 commits (assuming the script above created commits in the same > > > second, which it did on my machine) > > > > > > By the time both HEAD~ and HEAD have been processed, the revision > > > walking should have enough information to determine that it doesn't need > > > to go further, but still does. Even with something like HEAD~2..HEAD, > > > after the first round of processing parents it should be able to see > > > there's not going to be any more interesting commits. > > > > > > I'm willing to dig into this, but if someone familiar with the > > > algorithm could give me some hints as to what I might be missing in the > > > big picture, that would be helpful. > > > > All the above is without commit-graph, I presume? If so, then you > > should give it a try, as it might bring immediate help in your > > pathological repo. With 5k commit in the same second (enforced via > > 'export GIT_COMMITTER_DATE=$(date); for i in {1..5000} ...') I get: > > > > $ best-of-five -q git rev-list HEAD~..HEAD > > 0.069 > > $ git commit-graph write --reachableComputing commit graph generation > > numbers: 100% (5000/5000), done. > > $ best-of-five -q git rev-list HEAD~..HEAD > > 0.004 > > I'm not observing any difference from using commit-graph, whether in > time or in the number of commits that are looked at in limit_list(). -c core.commitGraph=true does make a difference in time, but not in the number of commits looked at in limit_list(). So it's only faster because each iteration of the loop is faster. It means it's still dependent on the depth of the dag, and the larger the repo will grow, the slower it will get. Mike ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Revision walking, commit dates, slop 2019-05-18 4:17 ` Mike Hommey @ 2019-05-18 12:01 ` SZEDER Gábor 2019-05-19 22:28 ` Jakub Narebski 2019-05-20 1:33 ` Derrick Stolee 1 sibling, 1 reply; 23+ messages in thread From: SZEDER Gábor @ 2019-05-18 12:01 UTC (permalink / raw) To: Mike Hommey; +Cc: git, Kacper Kornet On Sat, May 18, 2019 at 01:17:06PM +0900, Mike Hommey wrote: > On Sat, May 18, 2019 at 12:58:28PM +0900, Mike Hommey wrote: > > On Sat, May 18, 2019 at 03:50:05AM +0200, SZEDER Gábor wrote: > > > On Sat, May 18, 2019 at 09:54:12AM +0900, Mike Hommey wrote: > > > > There are established corner cases, where in a repo where commit dates > > > > are not monotonically increasing, revision walking can go horribly > > > > wrong. This was discussed in the past in e.g. > > > > https://public-inbox.org/git/20150521061553.GA29269@glandium.org/ > > > > > > > > The only (simple) workable way, given the current algorithm, to get an > > > > accurate view off rev-list is to essentially make slop infinite. This > > > > works fine, at the expense of runtime. > > > > > > > > Now, ignoring any modification for the above, I'm hitting another corner > > > > case in some other "weird" history, where I have 500k commits all with > > > > the same date. With such a commit dag, something as trivial as > > > > `git rev-list HEAD~..HEAD` goes through all commits from the root commit > > > > to HEAD, which takes multiple seconds, when the (obvious) output is one > > > > commit. > > > > > > > > It looks like the only way revision walking stops going through all the > > > > ancestry is through slop, and slop is essentially made infinite by the > > > > fact all commits have the same date (because of the date check in > > > > still_interesting(). By extension, this means the workaound for the > > > > first corner case above, which is to make slop infinite, essentially > > > > makes all rev walking go through the entire ancestry of the commits > > > > given on the command line. > > > > > > > > It feels like some cases of everybody_uninteresting should shorcut slop > > > > entirely, but considering the only way for slop to decrease at all is > > > > when everybody_uninteresting returns true, that would seem like a wrong > > > > assumption. But I'm also not sure what slop helps with in the first > > > > place (but I don't have a clear view of the broader picture of how the > > > > entire revision walking works). > > > > > > > > Anyways, a rather easy way to witness this happening is to create a > > > > dummy repo like: > > > > git init foo > > > > cd foo > > > > for i in $(seq 1 50); do > > > > echo $i > a; > > > > git add a; > > > > git commit -a -m $i; > > > > done > > > > > > > > The something as simple as `git rev-list HEAD~..HEAD` will go through > > > > all 50 commits (assuming the script above created commits in the same > > > > second, which it did on my machine) > > > > > > > > By the time both HEAD~ and HEAD have been processed, the revision > > > > walking should have enough information to determine that it doesn't need > > > > to go further, but still does. Even with something like HEAD~2..HEAD, > > > > after the first round of processing parents it should be able to see > > > > there's not going to be any more interesting commits. > > > > > > > > I'm willing to dig into this, but if someone familiar with the > > > > algorithm could give me some hints as to what I might be missing in the > > > > big picture, that would be helpful. > > > > > > All the above is without commit-graph, I presume? If so, then you > > > should give it a try, as it might bring immediate help in your > > > pathological repo. With 5k commit in the same second (enforced via > > > 'export GIT_COMMITTER_DATE=$(date); for i in {1..5000} ...') I get: > > > > > > $ best-of-five -q git rev-list HEAD~..HEAD > > > 0.069 > > > $ git commit-graph write --reachableComputing commit graph generation > > > numbers: 100% (5000/5000), done. > > > $ best-of-five -q git rev-list HEAD~..HEAD > > > 0.004 > > > > I'm not observing any difference from using commit-graph, whether in > > time or in the number of commits that are looked at in limit_list(). > > -c core.commitGraph=true does make a difference in time, but not in the > number of commits looked at in limit_list(). So it's only faster because > each iteration of the loop is faster. It means it's still dependent on > the depth of the dag, and the larger the repo will grow, the slower it > will get. Oh, indeed. Well, at least you'll waste about an order of magnitude less processor time until you figure out how to fix it :) Btw, once upon a time this was fast, but it became slow with commit c19d1b4e84 (Fix revision walk for commits with the same dates, 2013-03-22). ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Revision walking, commit dates, slop 2019-05-18 12:01 ` SZEDER Gábor @ 2019-05-19 22:28 ` Jakub Narebski 0 siblings, 0 replies; 23+ messages in thread From: Jakub Narebski @ 2019-05-19 22:28 UTC (permalink / raw) To: SZEDER Gábor; +Cc: Mike Hommey, git, Kacper Kornet, Derrick Stolee SZEDER Gábor <szeder.dev@gmail.com> writes: > On Sat, May 18, 2019 at 01:17:06PM +0900, Mike Hommey wrote: >> On Sat, May 18, 2019 at 12:58:28PM +0900, Mike Hommey wrote: >>> On Sat, May 18, 2019 at 03:50:05AM +0200, SZEDER Gábor wrote: >>>> On Sat, May 18, 2019 at 09:54:12AM +0900, Mike Hommey wrote: [...] >>>>> There are established corner cases, where in a repo where commit dates >>>>> are not monotonically increasing, revision walking can go horribly >>>>> wrong. This was discussed in the past in e.g. >>>>> https://public-inbox.org/git/20150521061553.GA29269@glandium.org/ >>>>> >>>>> The only (simple) workable way, given the current algorithm, to get an >>>>> accurate view off rev-list is to essentially make slop infinite. This >>>>> works fine, at the expense of runtime. >>>>> >>>>> Now, ignoring any modification for the above, I'm hitting another corner >>>>> case in some other "weird" history, where I have 500k commits all with >>>>> the same date. With such a commit dag, something as trivial as >>>>> `git rev-list HEAD~..HEAD` goes through all commits from the root commit >>>>> to HEAD, which takes multiple seconds, when the (obvious) output is one >>>>> commit. [...] >>>> All the above is without commit-graph, I presume? If so, then you >>>> should give it a try, as it might bring immediate help in your >>>> pathological repo. With 5k commit in the same second (enforced via >>>> 'export GIT_COMMITTER_DATE=$(date); for i in {1..5000} ...') I get: >>>> >>>> $ best-of-five -q git rev-list HEAD~..HEAD >>>> 0.069 >>>> $ git commit-graph write --reachableComputing commit graph generation >>>> numbers: 100% (5000/5000), done. >>>> $ best-of-five -q git rev-list HEAD~..HEAD >>>> 0.004 >>> >>> I'm not observing any difference from using commit-graph, whether in >>> time or in the number of commits that are looked at in limit_list(). >> >> -c core.commitGraph=true does make a difference in time, but not in the >> number of commits looked at in limit_list(). So it's only faster because >> each iteration of the loop is faster. It means it's still dependent on >> the depth of the dag, and the larger the repo will grow, the slower it >> will get. > > Oh, indeed. Well, at least you'll waste about an order of magnitude > less processor time until you figure out how to fix it :) > > Btw, once upon a time this was fast, but it became slow with commit > c19d1b4e84 (Fix revision walk for commits with the same dates, > 2013-03-22). This might be related to the fact, that with generation numbers v1 (i.e. topological levels) there were cases when using those generation numbers for ordering caused performance regressions (e.g. for some parts of Linux kernel history). There was an idea of replacing them with generation numbers v2 [1], which if I remember correctly was chosen to be corrected date (that is, committer date or one more than maximum of dates of parents). This is incompatibile change; it turned out however that while commit-graph format is versioned, unfortunately Git fails hard if commit-graph version is too new, instead of falling back to not using commit graph. [1]: https://github.com/derrickstolee/gen-test Stolee, could you tell us what is the current status on generation numbers v2? Thanks in advance. Best, -- Jakub Narębski ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Revision walking, commit dates, slop 2019-05-18 4:17 ` Mike Hommey 2019-05-18 12:01 ` SZEDER Gábor @ 2019-05-20 1:33 ` Derrick Stolee 2019-05-20 11:02 ` Jakub Narebski 2019-05-21 13:14 ` [PATCH] revision: use generation for A..B --topo-order queries Derrick Stolee 1 sibling, 2 replies; 23+ messages in thread From: Derrick Stolee @ 2019-05-20 1:33 UTC (permalink / raw) To: Mike Hommey, SZEDER Gábor; +Cc: git On 5/18/2019 12:17 AM, Mike Hommey wrote: > On Sat, May 18, 2019 at 12:58:28PM +0900, Mike Hommey wrote: >> On Sat, May 18, 2019 at 03:50:05AM +0200, SZEDER Gábor wrote: >>> >>> All the above is without commit-graph, I presume? If so, then you >>> should give it a try, as it might bring immediate help in your >>> pathological repo. With 5k commit in the same second (enforced via >>> 'export GIT_COMMITTER_DATE=$(date); for i in {1..5000} ...') I get: >>> >>> $ best-of-five -q git rev-list HEAD~..HEAD >>> 0.069 >>> $ git commit-graph write --reachableComputing commit graph generation >>> numbers: 100% (5000/5000), done. >>> $ best-of-five -q git rev-list HEAD~..HEAD >>> 0.004 >> >> I'm not observing any difference from using commit-graph, whether in >> time or in the number of commits that are looked at in limit_list(). > > -c core.commitGraph=true does make a difference in time, but not in the > number of commits looked at in limit_list(). So it's only faster because > each iteration of the loop is faster. It means it's still dependent on > the depth of the dag, and the larger the repo will grow, the slower it > will get. The plan is to use the commit-graph's generation numbers for these A..B queries, but due to some cases when commit date is a _better_ heuristic than generation numbers, we have not enabled them for A..B. You'll see that 'git rev-list --topo-order -n 1 HEAD` will be much faster with the commit-graph, but adding '--topo-order' to your 'HEAD~1..HEAD' query should not change the time at all. See [1] for the discussion about "generation number v2" which will allow us to use a better heuristic in these cases. Thanks, -Stolee [1] https://public-inbox.org/git/6367e30a-1b3a-4fe9-611b-d931f51effef@gmail.com/ ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Revision walking, commit dates, slop 2019-05-20 1:33 ` Derrick Stolee @ 2019-05-20 11:02 ` Jakub Narebski 2019-05-20 11:20 ` Derrick Stolee 2019-05-21 13:14 ` [PATCH] revision: use generation for A..B --topo-order queries Derrick Stolee 1 sibling, 1 reply; 23+ messages in thread From: Jakub Narebski @ 2019-05-20 11:02 UTC (permalink / raw) To: Derrick Stolee; +Cc: Mike Hommey, SZEDER Gábor, git Derrick Stolee <stolee@gmail.com> writes: > On 5/18/2019 12:17 AM, Mike Hommey wrote: >> On Sat, May 18, 2019 at 12:58:28PM +0900, Mike Hommey wrote: >>> On Sat, May 18, 2019 at 03:50:05AM +0200, SZEDER Gábor wrote: >>>> >>>> All the above is without commit-graph, I presume? If so, then you >>>> should give it a try, as it might bring immediate help in your >>>> pathological repo. With 5k commit in the same second (enforced via >>>> 'export GIT_COMMITTER_DATE=$(date); for i in {1..5000} ...') I get: >>>> >>>> $ best-of-five -q git rev-list HEAD~..HEAD >>>> 0.069 >>>> $ git commit-graph write --reachableComputing commit graph generation >>>> numbers: 100% (5000/5000), done. >>>> $ best-of-five -q git rev-list HEAD~..HEAD >>>> 0.004 >>> >>> I'm not observing any difference from using commit-graph, whether in >>> time or in the number of commits that are looked at in limit_list(). >> >> -c core.commitGraph=true does make a difference in time, but not in the >> number of commits looked at in limit_list(). So it's only faster because >> each iteration of the loop is faster. It means it's still dependent on >> the depth of the dag, and the larger the repo will grow, the slower it >> will get. > > The plan is to use the commit-graph's generation numbers for these A..B > queries, but due to some cases when commit date is a _better_ heuristic > than generation numbers, we have not enabled them for A..B. You'll see > that 'git rev-list --topo-order -n 1 HEAD` will be much faster with the > commit-graph, but adding '--topo-order' to your 'HEAD~1..HEAD' query > should not change the time at all. > > See [1] for the discussion about "generation number v2" which will allow > us to use a better heuristic in these cases. > > [1] https://public-inbox.org/git/6367e30a-1b3a-4fe9-611b-d931f51effef@gmail.com/ Are there any blockers that prevent the switch to this "generation number v2"? - Is it a problem with insufficient data to choose the correct numbering as "generation number v2' (there can be only one)? - Is it a problem with selected "generation number v2" being incompatibile with gen v2, and Git failing when new version of commit-graph is used instead of softly just not using commit-graph? - Or is it something else? Best, -- Jakub Narębski ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Revision walking, commit dates, slop 2019-05-20 11:02 ` Jakub Narebski @ 2019-05-20 11:20 ` Derrick Stolee 2019-05-20 13:42 ` Jakub Narebski 0 siblings, 1 reply; 23+ messages in thread From: Derrick Stolee @ 2019-05-20 11:20 UTC (permalink / raw) To: Jakub Narebski; +Cc: Mike Hommey, SZEDER Gábor, git On 5/20/2019 7:02 AM, Jakub Narebski wrote: > > Are there any blockers that prevent the switch to this > "generation number v2"? > > - Is it a problem with insufficient data to choose the correct numbering > as "generation number v2' (there can be only one)? > - Is it a problem with selected "generation number v2" being > incompatibile with gen v2, and Git failing when new version of > commit-graph is used instead of softly just not using commit-graph? > - Or is it something else? The switch becomes a bit complicated. First, the plan was to version the commit-graph file to v2, and that would include a byte in the header for the "reachability index version" [1]. Since older clients fail hard on a newer file version, we are switching to instead including the reachability index version as a value in a more flexible "metadata chunk" [2]. Using the generation number column for the corrected commit-date offsets (assuming we also guarantee the offset is strictly increasing from parent to child), these new values will be backwards- compatible _except_ for 'git commit-graph verify'. Second, we need to pull the reachability index value into a commit slab. The generation value is currently 32 bits, but we will expand that to 64 as it stores a timestamp. The commit struct is a bit bloated already, so this will reduce the required memory space even when not using the commit-graph. But, it requires some refactoring, including every place where we pass a "min_generation" needs to change type and name. Third and finally, we need to calculate the new values and change the read logic to sum the offset and commit-date (when the metadata chunk says we are using corrected commit date). While none of this is _incredibly_ hard to do, it does require a bit of care. It's on my list to get to at some point, but making the file incremental is higher priority to me. Thanks, -Stolee [1] https://public-inbox.org/git/pull.112.git.gitgitgadget@gmail.com/ [2] https://public-inbox.org/git/87h8acivkh.fsf@evledraar.gmail.com/ ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Revision walking, commit dates, slop 2019-05-20 11:20 ` Derrick Stolee @ 2019-05-20 13:42 ` Jakub Narebski 2019-05-20 23:27 ` Jakub Narebski 2019-06-25 7:51 ` Jakub Narebski 0 siblings, 2 replies; 23+ messages in thread From: Jakub Narebski @ 2019-05-20 13:42 UTC (permalink / raw) To: Derrick Stolee; +Cc: Mike Hommey, SZEDER Gábor, git Derrick Stolee <stolee@gmail.com> writes: > On 5/20/2019 7:02 AM, Jakub Narebski wrote: >> >> Are there any blockers that prevent the switch to this >> "generation number v2"? >> >> - Is it a problem with insufficient data to choose the correct numbering >> as "generation number v2' (there can be only one)? >> - Is it a problem with selected "generation number v2" being >> incompatibile with gen v2, and Git failing when new version of >> commit-graph is used instead of softly just not using commit-graph? >> - Or is it something else? Thanks for the explanation. > The switch becomes a bit complicated. > > First, the plan was to version the commit-graph file to v2, and that would > include a byte in the header for the "reachability index version" [1]. Since > older clients fail hard on a newer file version, we are switching to instead > including the reachability index version as a value in a more flexible > "metadata chunk" [2]. Ugh, that is bad. The version number inn the commit-graph file format was supposed to make it possible to easily change the format; now it looks like we are stuck with workarounds until the released version that dies on never commit-graph file format version innstead of softly not utilizing the commit-graph file dies off. How this issue got missed in review... If we cannot change the format, all that is left is ading new chunks, and changes that conform to commit-graph file version 1. > Using the generation number column for the corrected > commit-date offsets (assuming we also guarantee the offset is strictly > increasing from parent to child), these new values will be backwards- > compatible _except_ for 'git commit-graph verify'. O.K., so the "generation number v2 (legacy)" would be incremental and backward-compatibile in use (though not in generation and validation). Do I understand it correctly how it is calculated: corrected_date(C) = max(committer_date(C), max_{P ∈ parents(C)}(corrected_date(P)) + 1) offset(C) = corrected_date(C) - committer_date(C) gen_v2(C) = max(offset(C), max_{P ∈ parents(C)}(gen_v2(P)) + 1) Do you have benchmark for this "monotonically offset corrected commit date" generation number in https://github.com/derrickstolee/git/commits/reach-perf and https://github.com/derrickstolee/gen-test ? Also, what would happen if different versions of Git tried to add to commit-graph in interleaved way, either with rewrite or incremental? > Second, we need to pull the reachability index value into a commit slab. Is commit slab documented somewhere in Documentation/technical/, or just in comments in commit-slab.h? As I understand it, commit slab is Git-specific implementition of inside-out object storage for commit data (i.e. struct of arrays instead of array of structs), isn't it? I wonder if using commit slab improves cache utilization... > The generation value is currently 32 bits, but we will expand that to > 64 as it stores a timestamp. The commit struct is a bit bloated already, > so this will reduce the required memory space even when not using the > commit-graph. But, it requires some refactoring, including every place > where we pass a "min_generation" needs to change type and name. Could this be done with Coccinelle's spatch, similar to e.g. contrib/coccinelle/commit.cocci? > > Third and finally, we need to calculate the new values and change the > read logic to sum the offset and commit-date (when the metadata chunk > says we are using corrected commit date). Right. > While none of this is _incredibly_ hard to do, it does require a bit > of care. It's on my list to get to at some point, but making the file > incremental is higher priority to me. All right, I can understand that. > [1] https://public-inbox.org/git/pull.112.git.gitgitgadget@gmail.com/ > [2] https://public-inbox.org/git/87h8acivkh.fsf@evledraar.gmail.com/ Best, -- Jakub Narębski ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Revision walking, commit dates, slop 2019-05-20 13:42 ` Jakub Narebski @ 2019-05-20 23:27 ` Jakub Narebski 2019-05-21 1:20 ` Derrick Stolee 2019-06-25 7:51 ` Jakub Narebski 1 sibling, 1 reply; 23+ messages in thread From: Jakub Narebski @ 2019-05-20 23:27 UTC (permalink / raw) To: Derrick Stolee; +Cc: Mike Hommey, SZEDER Gábor, git Jakub Narebski <jnareb@gmail.com> writes: > Derrick Stolee <stolee@gmail.com> writes: >> On 5/20/2019 7:02 AM, Jakub Narebski wrote: >>> >>> Are there any blockers that prevent the switch to this >>> "generation number v2"? [...] >> Using the generation number column for the corrected >> commit-date offsets (assuming we also guarantee the offset is strictly >> increasing from parent to child), these new values will be backwards- >> compatible _except_ for 'git commit-graph verify'. > > O.K., so the "generation number v2 (legacy)" would be incremental and > backward-compatibile in use (though not in generation and validation). > > Do I understand it correctly how it is calculated: > > corrected_date(C) = max(committer_date(C), > max_{P ∈ parents(C)}(corrected_date(P)) + 1) This should probably read offset_date(P) = committer_date(P) + gen_v2(P) corrected_date(C) = max(committer_date(C), max_{P ∈ parents(C)}(offset_date(P)) + 1) > offset(C) = corrected_date(C) - committer_date(C) > gen_v2(C) = max(offset(C), max_{P ∈ parents(C)}(gen_v2(P)) + 1) > > Do you have benchmark for this "monotonically offset corrected commit > date" generation number in https://github.com/derrickstolee/git/commits/reach-perf > and https://github.com/derrickstolee/gen-test ? -- Jakub Narębski ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Revision walking, commit dates, slop 2019-05-20 23:27 ` Jakub Narebski @ 2019-05-21 1:20 ` Derrick Stolee 2019-05-22 18:29 ` Jakub Narebski 0 siblings, 1 reply; 23+ messages in thread From: Derrick Stolee @ 2019-05-21 1:20 UTC (permalink / raw) To: Jakub Narebski; +Cc: Mike Hommey, SZEDER Gábor, git On 5/20/2019 7:27 PM, Jakub Narebski wrote: > Jakub Narebski <jnareb@gmail.com> writes: >> Derrick Stolee <stolee@gmail.com> writes: >>> On 5/20/2019 7:02 AM, Jakub Narebski wrote: >>>> >>>> Are there any blockers that prevent the switch to this >>>> "generation number v2"? > [...] > >>> Using the generation number column for the corrected >>> commit-date offsets (assuming we also guarantee the offset is strictly >>> increasing from parent to child), these new values will be backwards- >>> compatible _except_ for 'git commit-graph verify'. >> >> O.K., so the "generation number v2 (legacy)" would be incremental and >> backward-compatibile in use (though not in generation and validation). >> >> Do I understand it correctly how it is calculated: >> >> corrected_date(C) = max(committer_date(C), >> max_{P ∈ parents(C)}(corrected_date(P)) + 1) > > This should probably read > > offset_date(P) = committer_date(P) + gen_v2(P) > corrected_date(C) = max(committer_date(C), > max_{P ∈ parents(C)}(offset_date(P)) + 1) The final definition needs two conditions on the offset of a commit C for every parent P: 1. committer_date(C) + offset(C) > committer_date(P) + offset(P) 2. offset(C) > offset(P) Condition (1) will give us the performance benefits related to the committer-date heuristic. Condition (2) will give us backwards-compatibility with generation numbers. Thanks, -Stolee ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Revision walking, commit dates, slop 2019-05-21 1:20 ` Derrick Stolee @ 2019-05-22 18:29 ` Jakub Narebski 2019-05-22 19:06 ` Derrick Stolee 0 siblings, 1 reply; 23+ messages in thread From: Jakub Narebski @ 2019-05-22 18:29 UTC (permalink / raw) To: Derrick Stolee; +Cc: Mike Hommey, SZEDER Gábor, git Derrick Stolee <stolee@gmail.com> writes: > On 5/20/2019 7:27 PM, Jakub Narebski wrote: >> Jakub Narebski <jnareb@gmail.com> writes: >>> Derrick Stolee <stolee@gmail.com> writes: >>>> On 5/20/2019 7:02 AM, Jakub Narebski wrote: >>>>> >>>>> Are there any blockers that prevent the switch to this >>>>> "generation number v2"? >> [...] >> >>>> Using the generation number column for the corrected >>>> commit-date offsets (assuming we also guarantee the offset is strictly >>>> increasing from parent to child), these new values will be backwards- >>>> compatible _except_ for 'git commit-graph verify'. >>> >>> O.K., so the "generation number v2 (legacy)" would be incremental and >>> backward-compatibile in use (though not in generation and validation). >>> >>> Do I understand it correctly how it is calculated: >>> >>> corrected_date(C) = max(committer_date(C), >>> max_{P ∈ parents(C)}(corrected_date(P)) + 1) >> >> This should probably read >> >> offset_date(P) = committer_date(P) + gen_v2(P) >> corrected_date(C) = max(committer_date(C), >> max_{P ∈ parents(C)}(offset_date(P)) + 1) Restating it yet again: A. corrected_date(C) = max(committer_date(C), max_P(committer_date(P) + offset(P)) + 1) B. offset(C) = max(corrected_date(C) - committer_date(C), max_P(offset(P)) + 1) > The final definition needs two conditions on the offset of a commit C for > every parent P: > > 1. committer_date(C) + offset(C) > committer_date(P) + offset(P) > 2. offset(C) > offset(P) The equation (B) ensures the (2) condition, i.e offset(C) > offset(P). The equation (A) ensures that condition (1) is fulfulled, because from (B) we have corrected_date(C) <= committer_date(C) + offset(C) This from (B) and (A( we get: committer_date(C) + offset(C) >= corrected_date(C) > > committer_date(P) + offset(P) > Condition (1) will give us the performance benefits related to the > committer-date heuristic. Condition (2) will give us backwards-compatibility > with generation numbers. Well, we should check/test if performance benefits of "offset date" ("corrected date with rising offset") truly holds. Best, -- Jakub Narębski ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Revision walking, commit dates, slop 2019-05-22 18:29 ` Jakub Narebski @ 2019-05-22 19:06 ` Derrick Stolee 2019-05-23 21:04 ` Jakub Narebski 0 siblings, 1 reply; 23+ messages in thread From: Derrick Stolee @ 2019-05-22 19:06 UTC (permalink / raw) To: Jakub Narebski; +Cc: Mike Hommey, SZEDER Gábor, git On 5/22/2019 2:29 PM, Jakub Narebski wrote: > Derrick Stolee <stolee@gmail.com> writes: >> On 5/20/2019 7:27 PM, Jakub Narebski wrote: > Restating it yet again: > > A. corrected_date(C) = max(committer_date(C), > max_P(committer_date(P) + offset(P)) + 1) > > B. offset(C) = max(corrected_date(C) - committer_date(C), > max_P(offset(P)) + 1) The problem with this definition is that it "defines" the corrected date, and then _adjusts_ it by updating the offset. I consider corrected_date(C) = committer_date(C) + offset(C) to be part of the definition. You could restate the definition as follows: corrected_date = max(committer_date(C) + max_P(offset(P)) + 1, max_P(corrected_date(P))) or, equivalently corrected_date = max(committer_date(C) + max_P(offset(P)) + 1, max_P(committer_date(P) + offset(P))) This definition, in a single step, satisfies the conditions below: > >> The final definition needs two conditions on the offset of a commit C for >> every parent P: >> >> 1. committer_date(C) + offset(C) > committer_date(P) + offset(P) >> 2. offset(C) > offset(P) Plus, the "+ 1" in the first step takes into account that "0" is a special offset value in the commit-graph file format meaning "not computed". > Well, we should check/test if performance benefits of "offset date" > ("corrected date with rising offset") truly holds. Yes, a full performance test will be required. I have full confidence that the monotonic offset requirement will have only positive effect. That is, it will not affect the case where committer-date was better than generation number, but will help the cases where all the committer-dates are equal. Thanks, -Stolee ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Revision walking, commit dates, slop 2019-05-22 19:06 ` Derrick Stolee @ 2019-05-23 21:04 ` Jakub Narebski 0 siblings, 0 replies; 23+ messages in thread From: Jakub Narebski @ 2019-05-23 21:04 UTC (permalink / raw) To: Derrick Stolee; +Cc: Mike Hommey, SZEDER Gábor, git Derrick Stolee <stolee@gmail.com> writes: > On 5/22/2019 2:29 PM, Jakub Narebski wrote: >> Derrick Stolee <stolee@gmail.com> writes: >>> On 5/20/2019 7:27 PM, Jakub Narebski wrote: >> >> Restating it yet again: >> >> A. corrected_date(C) = max(committer_date(C), >> max_P(committer_date(P) + offset(P)) + 1) >> >> B. offset(C) = max(corrected_date(C) - committer_date(C), >> max_P(offset(P)) + 1) > > The problem with this definition is that it "defines" the corrected date, and > then _adjusts_ it by updating the offset. For me equation (A) was just an intermediate step; we might call it adjusted_date(C) or temp_date(C) instead. Note that with the following adjustment corrected_date(C) = max(committer_date(C), max_P(corrected_date(P)) + 1) it is +1 corrected "V3: Corrected Commit Date" from gen-test. > I consider > > corrected_date(C) = committer_date(C) + offset(C) > > to be part of the definition. You could restate the definition as follows: > > corrected_date = max(committer_date(C) + max_P(offset(P)) + 1, > max_P(corrected_date(P))) > > or, equivalently > > corrected_date = max(committer_date(C) + max_P(offset(P)) + 1, > max_P(committer_date(P) + offset(P))) > > This definition, in a single step, satisfies the conditions below: > >> >>> The final definition needs two conditions on the offset of a commit C for >>> every parent P: >>> >>> 1. committer_date(C) + offset(C) > committer_date(P) + offset(P) >>> 2. offset(C) > offset(P) I think it is easier to prove the conditions (1) and (2) using two-step definition (A) + (B), as I have shown in previous email. Also, what we need to calculate and store is offset(C), not offset_date(C) (i.e. corrected_date(C), if you prefer). > Plus, the "+ 1" in the first step takes into account that "0" is a special offset > value in the commit-graph file format meaning "not computed". That assumes that max_P(function(P)) over empty set of P is taken to be zero. Also, I think two step definition has the same property. >> Well, we should check/test if performance benefits of "offset date" >> ("corrected date with rising offset") truly holds. > > Yes, a full performance test will be required. I have full confidence that the > monotonic offset requirement will have only positive effect. That is, it will > not affect the case where committer-date was better than generation number, but will > help the cases where all the committer-dates are equal. I worry that monotonic offset corrected date would behave like topological levels (i.e. like current generation number v1) in more cases rather than like commit date heuristics. P.S. there is _theoretical_ problem with all date-offset based generation numbers (slightly more likely to occur for monotonical offset), namely that in the commit-graph format v1 we have 34 bits for commit date timestamp, but only 30 bits for offset; and there might be the case where offset do not fit in 30 bits. It would be however very unlikely, e.g. commit date of 2028, then commit date of 1970 (timestamp equal to zero). Regards, -- Jakub Narębski ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Revision walking, commit dates, slop 2019-05-20 13:42 ` Jakub Narebski 2019-05-20 23:27 ` Jakub Narebski @ 2019-06-25 7:51 ` Jakub Narebski 2019-06-25 10:54 ` Derrick Stolee 1 sibling, 1 reply; 23+ messages in thread From: Jakub Narebski @ 2019-06-25 7:51 UTC (permalink / raw) To: Derrick Stolee; +Cc: Mike Hommey, SZEDER Gábor, git Jakub Narebski <jnareb@gmail.com> writes: > Derrick Stolee <stolee@gmail.com> writes: >> On 5/20/2019 7:02 AM, Jakub Narebski wrote: >>> >>> Are there any blockers that prevent the switch to this >>> "generation number v2"? >>> >>> - Is it a problem with insufficient data to choose the correct numbering >>> as "generation number v2' (there can be only one)? >>> - Is it a problem with selected "generation number v2" being >>> incompatibile with gen v2, and Git failing when new version of >>> commit-graph is used instead of softly just not using commit-graph? >>> - Or is it something else? [...] >> Using the generation number column for the corrected >> commit-date offsets (assuming we also guarantee the offset is strictly >> increasing from parent to child), these new values will be backwards- >> compatible _except_ for 'git commit-graph verify'. > > O.K., so the "generation number v2 (legacy)" would be incremental and > backward-compatibile in use (though not in generation and validation). > > Do I understand it correctly how it is calculated: > > corrected_date(C) = max(committer_date(C), > max_{P ∈ parents(C)}(corrected_date(P)) + 1) > offset(C) = corrected_date(C) - committer_date(C) > gen_v2(C) = max(offset(C), max_{P ∈ parents(C)}(gen_v2(P)) + 1) Do you remember who first came up with this idea for backward compatibile corrected commit date offsets (monotonically offset corrected date)? > Do you have benchmark for this "monotonically offset corrected commit > date" generation number in https://github.com/derrickstolee/git/commits/reach-perf > and https://github.com/derrickstolee/gen-test ? I guess this will have to wait... Best, -- Jakub Narębski ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Revision walking, commit dates, slop 2019-06-25 7:51 ` Jakub Narebski @ 2019-06-25 10:54 ` Derrick Stolee 2019-09-18 8:43 ` [RFC/PATCH] commit-graph: generation v5 (backward compatible date ceiling) Jakub Narebski 0 siblings, 1 reply; 23+ messages in thread From: Derrick Stolee @ 2019-06-25 10:54 UTC (permalink / raw) To: Jakub Narebski; +Cc: Mike Hommey, SZEDER Gábor, git On 6/25/2019 3:51 AM, Jakub Narebski wrote: > Jakub Narebski <jnareb@gmail.com> writes: >> Derrick Stolee <stolee@gmail.com> writes: >>> On 5/20/2019 7:02 AM, Jakub Narebski wrote: >>>> >>>> Are there any blockers that prevent the switch to this >>>> "generation number v2"? >>>> >>>> - Is it a problem with insufficient data to choose the correct numbering >>>> as "generation number v2' (there can be only one)? >>>> - Is it a problem with selected "generation number v2" being >>>> incompatibile with gen v2, and Git failing when new version of >>>> commit-graph is used instead of softly just not using commit-graph? >>>> - Or is it something else? > [...] > >>> Using the generation number column for the corrected >>> commit-date offsets (assuming we also guarantee the offset is strictly >>> increasing from parent to child), these new values will be backwards- >>> compatible _except_ for 'git commit-graph verify'. >> >> O.K., so the "generation number v2 (legacy)" would be incremental and >> backward-compatibile in use (though not in generation and validation). >> >> Do I understand it correctly how it is calculated: >> >> corrected_date(C) = max(committer_date(C), >> max_{P ∈ parents(C)}(corrected_date(P)) + 1) >> offset(C) = corrected_date(C) - committer_date(C) >> gen_v2(C) = max(offset(C), max_{P ∈ parents(C)}(gen_v2(P)) + 1) > > Do you remember who first came up with this idea for backward > compatibile corrected commit date offsets (monotonically offset > corrected date)? I remember saying that the "corrected commit date" that I had suggested was weak because it was not backwards-compatible with generation numbers if you are only looking at the offsets. I don't remember who suggested simply increasing the offset so they do become backwards-compatible. >> Do you have benchmark for this "monotonically offset corrected commit >> date" generation number in https://github.com/derrickstolee/git/commits/reach-perf >> and https://github.com/derrickstolee/gen-test ? > > I guess this will have to wait... I have not had time to revisit this topic and re-run performance numbers, sorry. -Stolee ^ permalink raw reply [flat|nested] 23+ messages in thread
* [RFC/PATCH] commit-graph: generation v5 (backward compatible date ceiling) 2019-06-25 10:54 ` Derrick Stolee @ 2019-09-18 8:43 ` Jakub Narebski 2019-09-18 12:26 ` Derrick Stolee 0 siblings, 1 reply; 23+ messages in thread From: Jakub Narebski @ 2019-09-18 8:43 UTC (permalink / raw) To: Derrick Stolee; +Cc: Mike Hommey, SZEDER Gábor, git Derrick Stolee <stolee@gmail.com> writes: > On 6/25/2019 3:51 AM, Jakub Narebski wrote: >> Jakub Narebski <jnareb@gmail.com> writes: >>> Derrick Stolee <stolee@gmail.com> writes: [...] >>> O.K., so the "generation number v2 (legacy)" would be incremental and >>> backward-compatibile in use (though not in generation and validation). [...] >>> Do you have benchmark for this "monotonically offset corrected commit >>> date" generation number in https://github.com/derrickstolee/git/commits/reach-perf >>> and https://github.com/derrickstolee/gen-test ? >> >> I guess this will have to wait... > > I have not had time to revisit this topic and re-run performance > numbers, sorry. I have created pull requests against `reach-perf` branch of derrickstolee/git repository[1], and companion pull request against gen-test repository[2] with proposed prototype of backward-compatible corrected commit date (with monotonic offsets). Could you please run the tests for this generation number v5? I was not able to do so. It was my first time trying to compile Git on MS Windows, and while there were no problems compiling `master` (well, except for compilation taking a long time), I was unable to do it for `reach-perf` branch because of independent of change compilation errors. It would be good to test also the legacy mode, i.e. old Git (using generation number v0, i.e. topological levels (shortest path from node to sink, plus one) with all backward-compatible generation numbers, or at least generation number v5. Do I understand it correctly that for the time being because of backward-compatibility concerns instead of incrementing the version number there would be used some kind of "metadata" chunk (at least until version of Git that fails hard, instead of turning off commit-graph feature softly, on unknown version numbers dies out). Is there some proposal how such chunk should look like? [1]: https://github.com/derrickstolee/git/pull/10 [2]: https://github.com/derrickstolee/gen-test/pull/1 --- >8 ---- >8 ---- >8 ---- >8 ---- >8 ---- >8 ---- >8 ---- >8 ---- From: Jakub Narębski <jnareb@gmail.com> Subject: [PATCH] commit-graph: generation v5 (backward compatible date ceiling) This generation number option is backward-compatible (with reachability index v0) version of the generation number v3, that is the corrected commit date. It takes the simple approach of faking "commit dates" being in the right order by adding an offset to the commit date (which is stored in the generation number column), so the modified commit date is always at least one more than the maximum modified commit date of all ancestors. Additionally the offset itself is adjusted in such way that the offset of a commit is always at least one more than the offsets of all ancestors. Two following conditions are satisfied on the offset of a commit C for every parent P: 1. committer_date(C) + offset(C) > committer_date(P) + offset(P) 2. offset(C) > offset(P) This reachability index is locally computable, which means it is compatible with incremental commit graph feature. It has the same problem as v3, namely that we have a "maximum clock skew" based on the maximum generation number we can store, only more so. Note: it includes incidental whitespace cleanups for v3 code. Signed-off-by: Jakub Narębski <jnareb@gmail.com> --- Note that the diff looks a bit strange in the part adding the code for the new compute_generation_numbers_5() subroutine, because it got entangled in whitespace cleanup (which I shouldn't do, I'm sorry). Relevant part of README.md from `gen-test` repository: -------- **V5: Corrected Commit Date with Strictly Monotonic Offset.** For a commit C, let its _offset commit date_ (denoted by odate(C)) be a commit date plus some offset, i.e. odate(C) = date(C) + offset(C), such that: 1. Offset commit date is greater than the maximum of the commit date of C and the offset commit dates of its parents. If odate(A) < odate(B), then A cannot reach B. 2. Offset of a commit is one more than the maximum offset of a parent, or more If offset(A) < offset(B), then A cannot reach B. This is backward-compatible version of V3: Corrected Commit Date. ### Comparing Reachability Index Versions Viability [...] | Index | Compatible? | Immutable? | Local? | |---------------------------|-------------|------------|--------| | Minimum Generation Number | Yes | Yes | Yes | | (Epoch, Date) pairs | Yes | Yes | Yes | | Maximum Generation Number | Yes | No | No | | Corrected Commit Date | No | Yes | Yes | | FELINE index | Yes | No | No | | Offset Commit Date *NEW* | Yes | Yes | Yes | _Note:_ The corrected commit date uses the generation number column to store an offset of "how much do I need to add to my commit date to get my corrected commit date?" The values stored in that column are then not backwards-compatible. _Note:_ The corrected commit date with strictly monotonic offset also uses the generation number column to store the date offset, but the offset alone can be used as generation number (as reachability index) itself. commit-graph.c | 126 +++++++++++++++++++++++++++++++++++++------------ 1 file changed, 95 insertions(+), 31 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index 91863d4895..633b4b24f8 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -99,10 +99,9 @@ int compare_generations(struct generation *a, struct generation *b) return 1; return 0; - case 3: - ta = a->date + a->value1; - tb = b->date + b->value1; - + case 3: /* V3: Corrected Commit Date */ + case 5: /* V5: Strictly Monotonic Corrected Commit Date */ + /* handle special cases, i.e. commits outside commit graph */ if (a->value1 == GENERATION_NUMBER_INFINITY) { if (b->value1 == GENERATION_NUMBER_INFINITY) return 0; @@ -112,6 +111,10 @@ int compare_generations(struct generation *a, struct generation *b) return -1; } + /* corrected commit date = date + offset (correction) */ + ta = a->date + a->value1; + tb = b->date + b->value1; + if (ta < tb) return -1; if (ta > tb) @@ -162,6 +165,7 @@ void get_generation_version_from_commit(const struct commit *c, case 1: case 3: + case 5: gen->value1 = c->generation; gen->date = c->date; break; @@ -212,9 +216,10 @@ void set_generation_below_commit(const struct commit *c, struct generation *g) break; case 3: + case 5: /* ??? */ if (g->value1 + g->date >= gc.value1 + gc.date) { g->value1 = 0; - g->date = gc.value1 + gc.date; + g->date = gc.value1 + gc.date; } break; @@ -363,7 +368,7 @@ struct commit_graph *load_commit_graph_one(const char *graph_file) else graph->chunk_large_edges = data + chunk_offset; break; - + case GRAPH_CHUNKID_FELINE: if (graph->chunk_feline_gen) chunk_repeated = 1; @@ -971,27 +976,27 @@ static void compute_generation_numbers_3(struct packed_commit_list *commits) int i; struct commit_list *list = NULL; - for (i = 0; i < commits->nr; i++) { - if (commits->list[i]->generation != GENERATION_NUMBER_INFINITY && - commits->list[i]->generation != GENERATION_NUMBER_ZERO) - continue; + for (i = 0; i < commits->nr; i++) { + if (commits->list[i]->generation != GENERATION_NUMBER_INFINITY && + commits->list[i]->generation != GENERATION_NUMBER_ZERO) + continue; - commit_list_insert(commits->list[i], &list); + commit_list_insert(commits->list[i], &list); - while (list) { - struct commit *current = list->item; - struct commit_list *parent; - int all_parents_computed = 1; + while (list) { + struct commit *current = list->item; + struct commit_list *parent; + int all_parents_computed = 1; - timestamp_t max_timestamp = current->date; + timestamp_t max_timestamp = current->date; - for (parent = current->parents; parent; parent = parent->next) { - if (parent->item->generation == GENERATION_NUMBER_INFINITY || - parent->item->generation == GENERATION_NUMBER_ZERO) { - all_parents_computed = 0; - commit_list_insert(parent->item, &list); - break; - } else { + for (parent = current->parents; parent; parent = parent->next) { + if (parent->item->generation == GENERATION_NUMBER_INFINITY || + parent->item->generation == GENERATION_NUMBER_ZERO) { + all_parents_computed = 0; + commit_list_insert(parent->item, &list); + break; + } else { struct generation pg; timestamp_t pt; get_generation_version_from_commit(parent->item, 3, &pg); @@ -1001,19 +1006,75 @@ static void compute_generation_numbers_3(struct packed_commit_list *commits) if (pt > max_timestamp) max_timestamp = pt + 1; } - } + } + + if (all_parents_computed) { + current->generation = (uint32_t)(max_timestamp - current->date) + 1; + pop_commit(&list); + + if (current->generation > GENERATION_NUMBER_MAX) + die(_("generation number gap is too high!")); + } + } + } +} + +static void compute_generation_numbers_5(struct packed_commit_list *commits) +{ + int i; + struct commit_list *list = NULL; + + for (i = 0; i < commits->nr; i++) { + /* skip already computed generation numbers */ + if (commits->list[i]->generation != GENERATION_NUMBER_INFINITY && + commits->list[i]->generation != GENERATION_NUMBER_ZERO) + continue; + + commit_list_insert(commits->list[i], &list); + + while (list) { + struct commit *current = list->item; + struct commit_list *parent; + int all_parents_computed = 1; + + timestamp_t max_timestamp = current->date; + uint32_t max_generation = 0; + + for (parent = current->parents; parent; parent = parent->next) { + if (parent->item->generation == GENERATION_NUMBER_INFINITY || + parent->item->generation == GENERATION_NUMBER_ZERO) { + all_parents_computed = 0; + commit_list_insert(parent->item, &list); + break; + + } else { + struct generation pg; + timestamp_t pt; + get_generation_version_from_commit(parent->item, 5, &pg); + + pt = pg.value1 + pg.date; + + if (pt > max_timestamp) + max_timestamp = pt + 1; + if (pg.value1 > max_generation) + max_generation = pg.value1; + } + } - if (all_parents_computed) { + if (all_parents_computed) { current->generation = (uint32_t)(max_timestamp - current->date) + 1; - pop_commit(&list); + if (current->generation < max_generation + 1) + current->generation = max_generation + 1; + pop_commit(&list); - if (current->generation > GENERATION_NUMBER_MAX) - die(_("generation number gap is too high!")); - } - } - } + if (current->generation > GENERATION_NUMBER_MAX) + die(_("generation number gap is too high!")); + } + } + } } + static void compute_generation_numbers(struct packed_commit_list *commits, int generation_version) { @@ -1038,6 +1099,9 @@ static void compute_generation_numbers(struct packed_commit_list *commits, case 4: /* compute at write time */ return; + case 5: + compute_generation_numbers_5(commits); + return; } } -- 2.23.0.windows.1 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [RFC/PATCH] commit-graph: generation v5 (backward compatible date ceiling) 2019-09-18 8:43 ` [RFC/PATCH] commit-graph: generation v5 (backward compatible date ceiling) Jakub Narebski @ 2019-09-18 12:26 ` Derrick Stolee 0 siblings, 0 replies; 23+ messages in thread From: Derrick Stolee @ 2019-09-18 12:26 UTC (permalink / raw) To: Jakub Narebski; +Cc: Mike Hommey, SZEDER Gábor, git On 9/18/2019 4:43 AM, Jakub Narebski wrote: > Derrick Stolee <stolee@gmail.com> writes: >> On 6/25/2019 3:51 AM, Jakub Narebski wrote: >>> Jakub Narebski <jnareb@gmail.com> writes: >>>> Derrick Stolee <stolee@gmail.com> writes: > [...] >>>> O.K., so the "generation number v2 (legacy)" would be incremental and >>>> backward-compatibile in use (though not in generation and validation). > [...] >>>> Do you have benchmark for this "monotonically offset corrected commit >>>> date" generation number in https://github.com/derrickstolee/git/commits/reach-perf >>>> and https://github.com/derrickstolee/gen-test ? >>> >>> I guess this will have to wait... >> >> I have not had time to revisit this topic and re-run performance >> numbers, sorry. > > I have created pull requests against `reach-perf` branch of > derrickstolee/git repository[1], and companion pull request against > gen-test repository[2] with proposed prototype of backward-compatible > corrected commit date (with monotonic offsets). > > Could you please run the tests for this generation number v5? I'll try to get to this, but it may take a few days. > I was not > able to do so. It was my first time trying to compile Git on MS > Windows, and while there were no problems compiling `master` (well, > except for compilation taking a long time), I was unable to do it for > `reach-perf` branch because of independent of change compilation errors. I don't think I tested this on Windows. I do most of my performance tests in Linux, especially when presenting numbers to the mailing list. The gen-test repo has a bunch of shell scripts that I used for testing in that environment. Thanks, -Stolee ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH] revision: use generation for A..B --topo-order queries 2019-05-20 1:33 ` Derrick Stolee 2019-05-20 11:02 ` Jakub Narebski @ 2019-05-21 13:14 ` Derrick Stolee 2019-05-21 13:59 ` [PATCH 2/2] revision: keep topo-walk free of unintersting commits Derrick Stolee 1 sibling, 1 reply; 23+ messages in thread From: Derrick Stolee @ 2019-05-21 13:14 UTC (permalink / raw) To: git; +Cc: szeder.dev, jnareb, jrnieder, mh, Derrick Stolee If a commit-graph exists with computed generation numbers, then a 'git rev-list --topo-order -n <N> <rev>' query will use those generation numbers to reduce the number of commits walked before writing N commits. One caveat put in b454241 (revision.c: generation-based topo-order algorithm, 2018-11-01) was to not enable the new algorithm for queries with a revision range "A..B". The logic was placed to walk from "A" and mark those commits as uninteresting, but the performance was actually worse than the existing logic in some cases. The root cause of this performance degradation is that generation numbers _increase_ the number of commits we walk relative to the existing heuristic of walking by commit date. While generation numbers actually guarantee that the algorithm is correct, the existing logic is very rarely wrong and that added requirement is not worth the cost. This motivates the planned "corrected commit date" to replace generation numbers in a future version of Git. The current change enables the logic to use whatever reachability index is currently in the commit-graph (generation numbers or corrected commit date). The limited flag in struct rev_info forces a full walk of the commit history (after discovering the A..B range). Previosuly, it is enabled whenever we see an uninteresting commit. We prevent enabling the parameter when we are planning to use the reachability index for a topo-order. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- Mike, If you have the chance, then please apply this patch (on v2.22.0-rc1) and re-run your test. This will confirm if my thoughts on this matter are correct. Thanks, -Stolee revision.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/revision.c b/revision.c index d4aaf0ef25..be6ccf5786 100644 --- a/revision.c +++ b/revision.c @@ -436,7 +436,9 @@ static struct commit *handle_commit(struct rev_info *revs, die("unable to parse commit %s", name); if (flags & UNINTERESTING) { mark_parents_uninteresting(commit); - revs->limited = 1; + + if (!revs->topo_order || !generation_numbers_enabled(the_repository)) + revs->limited = 1; } if (revs->sources) { char **slot = revision_sources_at(revs->sources, commit); -- 2.22.0.rc1 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 2/2] revision: keep topo-walk free of unintersting commits 2019-05-21 13:14 ` [PATCH] revision: use generation for A..B --topo-order queries Derrick Stolee @ 2019-05-21 13:59 ` Derrick Stolee 2019-05-22 2:19 ` Mike Hommey 0 siblings, 1 reply; 23+ messages in thread From: Derrick Stolee @ 2019-05-21 13:59 UTC (permalink / raw) To: git; +Cc: jnareb, jrnieder, szeder.dev, mh, Derrick Stolee When updating the topo-order walk in b454241 (revision.c: generation-based topo-order algorithm, 2018-11-01), the logic was a huge rewrite of the walk logic. In that massive change, we accidentally included the UNINTERESTING commits in expand_topo_walk(). This means that a simple query like git rev-list --topo-order HEAD~1..HEAD will expand the topo walk for all commits reachable from HEAD, and not just one commit. This change should speed up these cases, but there is still a need for corrected commit-date for some A..B queries. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- Sorry for the patch-spam, but I took a moment to check this command on the Git repo, and was able to reproduce the slowness. That didn't make sense to me, so I added some log messages to expand_topo_walk() and notices we were walking the UNINITERESTING commits. This is part of the reason the new logic is slower for A..B commands, but not the whole reason. You'll want this patch as well for a test. Thanks, -Stolee revision.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/revision.c b/revision.c index be6ccf5786..621feb9df7 100644 --- a/revision.c +++ b/revision.c @@ -3265,6 +3265,9 @@ static void expand_topo_walk(struct rev_info *revs, struct commit *commit) struct commit *parent = p->item; int *pi; + if (parent->object.flags & UNINTERESTING) + continue; + if (parse_commit_gently(parent, 1) < 0) continue; -- 2.22.0.rc1 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] revision: keep topo-walk free of unintersting commits 2019-05-21 13:59 ` [PATCH 2/2] revision: keep topo-walk free of unintersting commits Derrick Stolee @ 2019-05-22 2:19 ` Mike Hommey 0 siblings, 0 replies; 23+ messages in thread From: Mike Hommey @ 2019-05-22 2:19 UTC (permalink / raw) To: Derrick Stolee; +Cc: git, jnareb, jrnieder, szeder.dev, Derrick Stolee On Tue, May 21, 2019 at 09:59:53AM -0400, Derrick Stolee wrote: > When updating the topo-order walk in b454241 (revision.c: generation-based > topo-order algorithm, 2018-11-01), the logic was a huge rewrite of the > walk logic. In that massive change, we accidentally included the > UNINTERESTING commits in expand_topo_walk(). This means that a simple > query like > > git rev-list --topo-order HEAD~1..HEAD > > will expand the topo walk for all commits reachable from HEAD, and not > just one commit. > > This change should speed up these cases, but there is still a need > for corrected commit-date for some A..B queries. > > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > > Sorry for the patch-spam, but I took a moment to check this command > on the Git repo, and was able to reproduce the slowness. That didn't > make sense to me, so I added some log messages to expand_topo_walk() > and notices we were walking the UNINITERESTING commits. This is part > of the reason the new logic is slower for A..B commands, but not the > whole reason. > > You'll want this patch as well for a test. Both patches help, thanks. Mike ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Revision walking, commit dates, slop 2019-05-18 1:50 ` SZEDER Gábor 2019-05-18 3:58 ` Mike Hommey @ 2019-05-21 2:00 ` Jonathan Nieder 1 sibling, 0 replies; 23+ messages in thread From: Jonathan Nieder @ 2019-05-21 2:00 UTC (permalink / raw) To: SZEDER Gábor; +Cc: Mike Hommey, git SZEDER Gábor wrote: > On Sat, May 18, 2019 at 09:54:12AM +0900, Mike Hommey wrote: >> There are established corner cases, where in a repo where commit dates >> are not monotonically increasing, [...] > All the above is without commit-graph, I presume? If so, then you > should give it a try, as it might bring immediate help in your > pathological repo. With 5k commit in the same second (enforced via > 'export GIT_COMMITTER_DATE=$(date); for i in {1..5000} ...') I get: Just to emphasize this point: one field in the commit-graph file is a generation number (or more generally, a "reachability index" in the standard jargon). The reason I'm excited about having this field is that it will allow us to stop playing games with slop. So please join forces with Stolee and help us get to that future sooner. :) Thanks, Jonathan ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2019-09-18 12:26 UTC | newest] Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-05-18 0:54 Revision walking, commit dates, slop Mike Hommey 2019-05-18 1:50 ` SZEDER Gábor 2019-05-18 3:58 ` Mike Hommey 2019-05-18 4:17 ` Mike Hommey 2019-05-18 12:01 ` SZEDER Gábor 2019-05-19 22:28 ` Jakub Narebski 2019-05-20 1:33 ` Derrick Stolee 2019-05-20 11:02 ` Jakub Narebski 2019-05-20 11:20 ` Derrick Stolee 2019-05-20 13:42 ` Jakub Narebski 2019-05-20 23:27 ` Jakub Narebski 2019-05-21 1:20 ` Derrick Stolee 2019-05-22 18:29 ` Jakub Narebski 2019-05-22 19:06 ` Derrick Stolee 2019-05-23 21:04 ` Jakub Narebski 2019-06-25 7:51 ` Jakub Narebski 2019-06-25 10:54 ` Derrick Stolee 2019-09-18 8:43 ` [RFC/PATCH] commit-graph: generation v5 (backward compatible date ceiling) Jakub Narebski 2019-09-18 12:26 ` Derrick Stolee 2019-05-21 13:14 ` [PATCH] revision: use generation for A..B --topo-order queries Derrick Stolee 2019-05-21 13:59 ` [PATCH 2/2] revision: keep topo-walk free of unintersting commits Derrick Stolee 2019-05-22 2:19 ` Mike Hommey 2019-05-21 2:00 ` Revision walking, commit dates, slop Jonathan Nieder
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).