* [RFC] Generation Number v2 @ 2018-10-29 16:55 Derrick Stolee 2018-10-29 19:22 ` Stefan Beller ` (3 more replies) 0 siblings, 4 replies; 17+ messages in thread From: Derrick Stolee @ 2018-10-29 16:55 UTC (permalink / raw) To: Git List, Jeff King, Junio C Hamano, Jakub Narębski, Derrick Stolee, Ævar Arnfjörð Bjarmason We've discussed in several places how to improve upon generation numbers. This RFC is a report based on my investigation into a few new options, and how they compare for Git's purposes on several existing open-source repos. You can find this report and the associated test scripts at https://github.com/derrickstolee/gen-test. I have some more work to do in order to make this fully reproducible (including a single script to clone all of the repos I used, compute the commit-graphs using the different index versions, and run the tests). TL;DR: I recommend we pursue one of the following two options: 1. Maximum Generation Number. 2. Corrected Commit Date. Each of these perform well, but have different implementation drawbacks. I'd like to hear what everyone thinks about those drawbacks. Please also let me know about any additional tests that I could run. Now that I've got a lot of test scripts built up, I can re-run the test suite pretty quickly. Thanks, -Stolee Generation Number Performance Test ================================== Git uses the commit history to answer many basic questions, such as computing a merge-base. Some of these algorithms can benefit from a _reachability index_, which is a condition that allows us to answer "commit A cannot reach commit B" in some cases. These require pre- computing some values that we can load and use at run-time. Git already has a notion of _generation number_, stores it in the commit- graph file, and uses it in several reachability algorithms. You can read more about generation numbers and how to use them in algorithms in [this blog post](https://blogs.msdn.microsoft.com/devops/2018/07/09/supercharging-the-git-commit-graph-iii-generations/). However, [generation numbers do not always improve our algorithms](https://public-inbox.org/git/pull.28.git.gitgitgadget@gmail.com/T/#u). Specifically, some algorithms in Git already use commit date as a heuristic reachability index. This has some problems, though, since commit date can be incorrect for several reasons (clock skew between machines, purposefully setting GIT_COMMIT_DATE to the author date, etc.). However, the speed boost by using commit date as a cutoff was so important in these cases, that the potential for incorrect answers was considered acceptable. When these algorithms were converted to use generation numbers, we _added_ the extra constraint that the algorithms are _never incorrect_. Unfortunately, this led to some cases where performance was worse than before. There are known cases where `git merge-base A B` or `git log --topo-order A..B` are worse when using generation numbers than when using commit dates. This report investigates four replacements for generation numbers, and compares the number of walked commits to the existing algorithms (both using generation numbers and not using them at all). We can use this data to make decisions for the future of the feature. ### Implementation The reachability indexes below are implemented in [the `reach-perf` branch in https://github.com/derrickstolee/git](https://github.com/derrickstolee/git/tree/reach-perf). This implementation is in a very rough condition, as it is intended to be a prototype, not production-quality. Using this implementation, you can compute commit-graph files for the specified reachability index using `git commit-graph write --reachable --version=<V>`. The `git` client will read the version from the file, so our tests store each version as `.git/objects/info/commit-graph.<V>` and copy the necessary file to `.git/objects/info/commit-graph` before testing. The algorithms count the number of commits walked, as to avoid the noise that would occur with time-based performance reporting. We use the (in progress) trace2 library for this. To find the values reported, use the `GIT_TR2_PERFORMANCE` environment variable. To ignore reachability indexes entirely and use the old algorithms (reported here as "OLD" values) use the environment variable `GIT_TEST_OLD_PAINT=1`. Reachability Index Versions --------------------------- **V0: (Minimum) Generation Number.** The _generation number_ of a commit is exactly one more than the maximum generation number of a parent (by convention, the generation number of a commit with no parents is 1). This is the definition currently used by Git (2.19.0 and later). Given two commits A and B, we can then use the following reachability index: If gen(A) < gen(B), then A cannot reach B. _Commentary:_ One issue with generation numbers is that some algorithms in Git use commit date as a heuristic, and allow incorrect answers if there is enough clock skew. Using that heuristic, the algorithms can walk fewer commits than the algorithms using generation number. The other reachability indexes introduced below attempt to fix this problem. **V1: (Epoch, Date) Pairs.** For each commit, store a pair of values called the _epoch_ and the _date_. The date is the normal commit date. The _epoch_ of a commit is the minimum X such that X is at least the maximum epoch of each parent, and at least one more than the epoch of any parent whose date is larger than the date of this commit (i.e. there is clock skew between this commit and this parent). In this way, we can use the following reachability index: If epoch(A) < epoch(B), then A cannot reach B. If epoch(A) == epoch(B) and date(A) < date(B), then A cannot reach B. **V2: Maximum Generation Numbers.** The _maximum generation number_ of a commit (denoted by maxgen(A)) is defined using its _children_ and the total number of commits in the repo. If A is a commit with no children (that is, there is no commit B with A as a parent) then maxgen(A) is equal to the number of commits in the repo. For all other commits A, maxgen(A) is one less than the minimum maxgen(B) among children B. This can be computed by initializing maxgen(C) to the number of commits, then walking the commit graph in topological order, assigning maxgen(P) = min { maxgen(P), maxgen(C) - 1 } for each parent P of the currently-walking commit C. We then have the same reachability index as minimum generation number: If maxgen(A) < maxgen(B), then A cannot reach B. _Commentary:_ The known examples where (minimum) generation numbers perform worse than commit date heuristics involve commits with recent commit dates but whose parents have very low generation number compared to most recent commits. In a way, minimum generation numbers increase the odds that we can say "A cannot reach B" when A is fixed and B is selected at random. Maximum generation numbers increase the odds that we can say "A cannot reach B" when B is fixed and A is selected at random. This helps us when we are exploring during a reachability algorithm and have explored few commits and want to say that the large set of unexplored commits cannot reach any of the explored commits. **V3: Corrected Commit Date.** For a commit C, let its _corrected commit date_ (denoted by cdate(C)) be the maximum of the commit date of C and the commit dates of its parents. If cdate(A) < cdate(B), then A cannot reach B. **V4: FELINE Index.** The FELINE index is a two-dimentional reachability index as defined in [Reachability Queries in Very Large Graphs: A Fast Refined Online Search Approach](https://openproceedings.org/EDBT/2014/paper_166.pdf) by Veloso, Cerf, Jeira, and Zaki. The index is not deterministically defined, but instead is defined in the following way: 1. Compute a reverse topological order of the commits. Let felineX(C) be the position in which C appears in this list. Since this is a reverse topological order, felineX(C) > felineX(P) for each parent P of C. 2. Compute a reverse topological order of the commits using Kahn's algorithm, but when selecting among a list of commits with in-degree zero, prioritize the commit by minimizing felineX. Let felineY(C) be the position in which C appears in this list. Essentially, the felineY order is selected with the goal of swapping positions of topologically-independent commits relative to the felinX ordering. The resulting reachability index is as follows: If felineX(A) < felineY(B), then A cannot reach B. If felineY(A) < felineY(B), then A cannot reach B. _Commentary:_ In terms of comparing two commits directly, this index is quite strong. However, when we are performing our reachability algorithms that care about reachable sets (git log --graph), or boundaries between reachable sets (git merge-base, git log --graph A..B) we need to track a single pair (minX,minY) for comparion. In order to not miss anything during our search, we need to update this pair to be the minimum felineX(A) and minimum felineY(B) among all explored commits A and B. That is, the pair (minX, minY) is below our entire explored set. This can be a disadvantage for these algorithms. ### Comparing Reachability Index Versions Viability Before considering how well these indexes perform during our algorithm runs, let's take a moment to consider the implementation details and how they relate to the existing commit-graph file format and existing Git clients. * **Compatible?** In our test implementation, we use a previously unused byte of data in the commit-graph format to indicate which reachability index version we are using. Existing clients ignore this value, so we will want to consider if these new indexes are _backwards compatible_. That is, will they still report correct values if they ignore this byte and use the generation number column from the commit-graph file assuming the values are minimum generation numbers? * **Immutable?** Git objects are _immutable_. If you change an object you actually create a new object with a new object ID. Are the values we store for these reachability indexes also immutable? * **Local?** Are these values **locally computable**? That is, do we only need to look at the parents of a commit (assuming those parents have computed values) in order to determine the value at that commit? | Index | Compatible? | Immutable? | Local? | |---------------------------|-------------|------------|--------| | Minimum Generation Number | Yes | Yes | Yes | | (Epoch, Date) pairs | Yes | Yes | Yes | | Maximum Generation Number | Yes | No | No | | Corrected Commit Date | No | Yes | Yes | | FELINE index | Yes | No | No | _Note:_ The corrected commit date uses the generation number column to store an offset of "how much do I need to add to my commit date to get my corrected commit date?" The values stored in that column are then not backwards-compatible. _Note:_ The FELINE index requires storing two values instead of just one. One of these values could be stored in the generation number column and the other in an optional chunk, hence it could be backwards compatible. (This is not how it is implemented in our example implementation.) Data ---- We focused on three types of performance tests that test the indexes in different ways. Each test lists the `git` command that is used, and the table lists which repository is used and which inputs. ### Test 1: `git log --topo-order -N` This test focuses on the number of commits that are parsed during a `git log --topo-order` before writing `N` commits to output. You can reproduce this test using `topo-order-tests.sh` and see all the data in `topo-order-summary.txt`. The values reported here are a sampling of the data, ignoring tests where all values were the same or extremely close in value. | Repo | N | V0 | V1 | V2 | V3 | V4 | |--------------|-------|--------|--------|--------|--------|-------| | android-base | 100 | 5,487 | 8,534 | 6,937 | 6,419 | 6,453 | | android-base | 1000 | 36,029 | 44,030 | 41,493 | 41,206 |45,431 | | chromium | 100 | 101 |424,406 | 101 | 101 | 101 | | gerrit | 100 | 8,212 | 8,533 | 164 | 159 | 162 | | gerrit | 1000 | 8,512 | 8,533 | 1,990 | 1,973 | 3,766 | | Linux | 100 | 12,458 | 12,444 | 13,683 | 13,123 |13,124 | | Linux | 1000 | 24,436 | 26,247 | 27,878 | 26,430 |27,875 | | Linux | 10000 | 30,364 | 28,891 | 27,878 | 26,430 |27,875 | | electron | 1000 | 19,927 | 18,733 | 1,072 | 18,214 |18,214 | | Ffmpeg | 10000 | 32,154 | 47,429 | 10,435 | 11,054 |11,054 | | jgit | 1000 | 1,550 | 6,264 | 1,067 | 1,060 | 1,233 | | julia | 10000 | 43,043 | 43,043 | 10,201 | 23,505 |23,828 | | odoo | 1000 | 17,175 | 9,714 | 4,043 | 4,046 | 4,111 | | php-src | 1000 | 19,014 | 27,530 | 1,311 | 1,305 | 1,320 | | rails | 100 | 1,420 | 2,041 | 1,757 | 1,428 | 1,441 | | rails | 1000 | 7,952 | 10,145 | 10,053 | 8,373 | 8,373 | | swift | 1000 | 1,914 | 4,004 | 2,071 | 1,939 | 1,940 | | tensorflow | 1000 | 10,019 | 39,221 | 6,711 | 10,051 |10,051 | | TypeScript | 1000 | 2,873 | 12,014 | 3,049 | 2,876 | 2,876 | ### Test 2: `git log --topo-order -10 A..B` This test focuses on the number of commits that are parsed during a `git log --topo-order A..B` before writing ten commits to output. Since we fix a very small set of output commits, we care more about the part of the walk that determines which commits are reachable from `B` but not reachable from `A`. This part of the walk uses commit date as a heuristic in the existing implementation. You can reproduce this test using `topo-compare-tests.sh` and see all the data in `topo-compare-summary.txt`. The values reported here are a sampling of the data, ignoring tests where all values were the same or extremely close in value. _Note:_ For some of the rows, the `A` and `B` values may be swapped from what is expected. This is due to (1) a bug in the reference implementation that doesn't short-circuit the walk when `A` can reach `B`, and (2) data-entry errors by the author. The bug can be fixed, but would have introduced strange-looking rows in this table. | Repo | A | B | OLD | V0 | V1 | V2 | V3 | V4 | |--------------|--------------|--------------|---------|--------|--------|--------|---------|--------| | android-base | 53c1972bc8f | 92f18ac3e39 | 39,403 | 1,544 | 6,957 | 26 | 1,015 | 1,098 | | gerrit | c4311f7642 | 777a8cd1e0 | 6,457 | 7,836 | 10,869 | 415 | 414 | 445 | | electron | 7da7dd85e | addf069f2 | 18,164 | 945 | 6,528 | 17 | 17 | 18 | | julia | 7faee1b201 | e2022b9f0f | 22,800 | 4,221 | 12,710 | 377 | 213 | 213 | | julia | ae69259cd9 | c8b5402afc | 1,864 | 1,859 | 13,287 | 12 | 1,859 | 1,859 | | Linux | 69973b830859 | c470abd4fde4 | 111,692 | 77,263 | 96,598 | 80,238 | 76,332 | 76,495 | | Linux | c3b92c878736 | 19f949f52599 | 167,418 | 5,736 | 4,684 | 9,675 | 3,887 | 3,923 | | Linux | c8d2bc9bc39e | 69973b830859 | 44,940 | 4,056 | 16,636 | 10,405 | 3,475 | 4,022 | | odoo | 4a31f55d0a0 | 93fb2b4a616 | 25,139 | 19,528 | 20,418 | 19,874 | 19,634 | 27,247 | | swift | 4046359efd | b34b6a14c7 | 13,411 | 662 | 321 | 12 | 80 | 134 | | tensorflow | ec6d17219c | fa1db5eb0d | 10,373 | 4,762 | 36,272 | 174 | 3,631 | 3,632 | | TypeScript | 35ea2bea76 | 123edced90 | 3,450 | 267 | 10,386 | 27 | 259 | 259 | ### Test 3: `git merge-base A B` This test focuses on the number of commits that are parsed during a `git merge-base A B`. This part of the walk uses commit date as a heuristic in the existing implementation. You can reproduce this test using `merge-base-tests.sh` and see all the data in `merge-base-summary.txt`. The values reported here are a sampling of the data, ignoring tests where all values were the same or extremely close in value. | Repo | A | B | OLD | V0 | V1 | V2 | V3 | V4 | |--------------|--------------|--------------|---------|---------|---------|---------|---------|---------| | android-base | 53c1972bc8f | 92f18ac3e39 | 81,999 | 109,025 | 81,885 | 77,475 | 81,999 | 82,001 | | gerrit | c4311f7642 | 777a8cd1e0 | 6,468 | 7,995 | 6,566 | 6,478 | 6,468 | 6,468 | | electron | 7da7dd85e | addf069f2 | 18,160 | 19,871 | 18,670 | 2,231 | 18,160 | 18,160 | | julia | 7faee1b201 | e2022b9f0f | 22,803 | 42,339 | 42,212 | 6,803 | 22,803 | 22,803 | | julia | c8b5402afc | ae69259cd9 | 7,076 | 42,909 | 42,770 | 2,690 | 7,076 | 7,076 | | Linux | 69973b830859 | c470abd4fde4 | 44,984 | 47,457 | 44,679 | 38,461 | 44,984 | 44,984 | | Linux | c3b92c878736 | 19f949f52599 | 111,740 | 111,027 | 111,196 | 107,835 | 111,771 | 111,368 | | Linux | c8d2bc9bc39e | 69973b830859 | 167,468 | 635,579 | 630,138 | 33,716 | 167,496 | 153,774 | | odoo | 4a31f55d0a0 | 93fb2b4a616 | 25,150 | 27,259 | 23,977 | 24,041 | 23,974 | 26,829 | | swift | 4046359efd | b34b6a14c7 | 13,434 | 13,254 | 13,940 | 16,023 | 13,127 | 15,008 | | tensorflow | ec6d17219c | fa1db5eb0d | 10,377 | 10,448 | 10,377 | 8,460 | 10,377 | 10,377 | | TypeScript | 35ea2bea76 | 123edced90 | 3,464 | 3,439 | 3,464 | 3,581 | 3,464 | 3,464 | Conclusions ----------- Based on the performance results alone, we should remove minimum generation numbers, (epoch, date) pairs, and FELINE index from consideration. There are enough examples of these indexes performing poorly. In contrast, maximum generation numbers and corrected commit dates both performed quite well. They are frequently the top two performing indexes, and rarely significantly different. The trade-off here now seems to be: which _property_ is more important, locally-computable or backwards-compatible? * Maximum generation number is backwards-compatible but not locally-computable or immutable. * Corrected commit-date is locally-computable and immutable, but not backwards-compatible. _Editor's Note:_ Every time I think about this trade-off, I can't come to a hard conclusion about which is better. Instead, I'll leave that up for discussion. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Generation Number v2 2018-10-29 16:55 [RFC] Generation Number v2 Derrick Stolee @ 2018-10-29 19:22 ` Stefan Beller 2018-10-29 20:06 ` Derrick Stolee 2018-11-01 20:06 ` Jakub Narebski 2018-10-29 20:25 ` Derrick Stolee ` (2 subsequent siblings) 3 siblings, 2 replies; 17+ messages in thread From: Stefan Beller @ 2018-10-29 19:22 UTC (permalink / raw) To: Derrick Stolee Cc: git, Jeff King, Junio C Hamano, Jakub Narębski, Derrick Stolee, Ævar Arnfjörð Bjarmason > Based on the performance results alone, we should remove minimum > generation numbers, (epoch, date) pairs, and FELINE index from > consideration. There are enough examples of these indexes performing > poorly. > > In contrast, maximum generation numbers and corrected commit > dates both performed quite well. They are frequently the top > two performing indexes, and rarely significantly different. > > The trade-off here now seems to be: which _property_ is more important, > locally-computable or backwards-compatible? > > * Maximum generation number is backwards-compatible but not > locally-computable or immutable. These maximum generation numbers sound like the reverse of the generation numbers as they are currently implemented, i.e. we count all commits between the commit A and all heads. How would this impact creation of a commit? The current generation numbers can be lazily updated or not updated at all. In my understanding of the maximum generation numbers, a new commit would make these maximum generation numbers invalid (i.e. they have to be recomputed). Are there ways out by deferring the computation of maximum generation numbers while still being able to work with some commits that are un-accounted for? When recomputing these numbers, the current generation number (V0) has the property that already existing numbers can be re-used as-is. We only need to compute the numbers for new commits, and then insert this to the appropriate data structures (which is a separate problem, one could imagine a split generation numbers file like the split index) For the V2 maximum generation numbers, would we need to rewrite the numbers for all commits once we recompute them? Assuming that is true, it sounds like the benchmark doesn't cover the whole costs associated with V2, which is why the exceptional performance can be explained. (Note: The table as read in https://public-inbox.org/git/6367e30a-1b3a-4fe9-611b-d931f51effef@gmail.com/ seems line broken, using gmails web interface is not good for ASCII art and patches, git-send-email would fare better) > > * Corrected commit-date is locally-computable and immutable, > but not backwards-compatible. How are these dates not backwards incompatible? We don't have to expose these dates to the user, but - just like generation numbers - could store them and use them but not tell the user about it. We would need to be really careful to not expose them at all as they look like the real dates, so that could make for an awkward bug report. The approach of "corrected commit date" sounds like we could have a very lazy approach, i.e. no extra data structures needed for many commits (as the corrected date equals the real date) and only need to store the corrections for some commits. Such an approach however would not make it easy to know if we operate on corrected dates, or if we even checked them for correctness before. So if we'd have an additional row in the generation numbers file telling the corrected date, then we should be able to be backwards compatible? ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Generation Number v2 2018-10-29 19:22 ` Stefan Beller @ 2018-10-29 20:06 ` Derrick Stolee 2018-11-01 20:06 ` Jakub Narebski 1 sibling, 0 replies; 17+ messages in thread From: Derrick Stolee @ 2018-10-29 20:06 UTC (permalink / raw) To: Stefan Beller Cc: git, Jeff King, Junio C Hamano, Jakub Narębski, Derrick Stolee, Ævar Arnfjörð Bjarmason On 10/29/2018 3:22 PM, Stefan Beller wrote: >> Based on the performance results alone, we should remove minimum >> generation numbers, (epoch, date) pairs, and FELINE index from >> consideration. There are enough examples of these indexes performing >> poorly. >> >> In contrast, maximum generation numbers and corrected commit >> dates both performed quite well. They are frequently the top >> two performing indexes, and rarely significantly different. >> >> The trade-off here now seems to be: which _property_ is more important, >> locally-computable or backwards-compatible? >> >> * Maximum generation number is backwards-compatible but not >> locally-computable or immutable. > These maximum generation numbers sound like the reverse of > the generation numbers as they are currently implemented, i.e. > we count all commits between the commit A and all heads. > How would this impact creation of a commit? The indexes listed here would be computed at the same time as a commit-graph (and only change on commit-graph writes). This idea of having something depend on the "global state" isn't ideal (in my opinion) but it certainly works for the tests specified below. > The current generation numbers can be lazily updated or not > updated at all. In my understanding of the maximum generation > numbers, a new commit would make these maximum generation > numbers invalid (i.e. they have to be recomputed). New commits would change the value you would get by the definition (on rewrite of the commit-graph) but don't violate the reachability index property (maxgen(A) < maxgen(B) => A can't reach B). So that doesn't effect their usefulness. > Are there ways out by deferring the computation of maximum > generation numbers while still being able to work with some > commits that are un-accounted for? Creating a commit that doesn't exist in the commit-graph file results in a commit with generation number INFINITY. The new commits can't be reached by commits with finite generation number. > When recomputing these numbers, the current generation number > (V0) has the property that already existing numbers can be re-used > as-is. We only need to compute the numbers for new commits, > and then insert this to the appropriate data structures (which is > a separate problem, one could imagine a split generation > numbers file like the split index) > > For the V2 maximum generation numbers, would we need to > rewrite the numbers for all commits once we recompute them? > Assuming that is true, it sounds like the benchmark doesn't > cover the whole costs associated with V2, which is why the > exceptional performance can be explained. You're right, we need to recompute the entire list of maximum generation number values. However, you are a bit hung up on "maxgen" being a fixed function that is correct at all times. We could compute the "maxgen" for the set of commits that are written to the "split" commit-graph while leaving the base the same. There is one tricky part here: we need to initialize the values starting at "num commits in the combined commit-graph" (commits in the base and split portion). The minimum possible value in the split portion will be larger than the maximum possible value in the base portion. Since the (future) implementation of split commit-graphs would have all commit-parent edges pointing down into the base and not from the base into the split, this will continue to satisfy our reachability index property. As for the cost: we need to do a topo-order walk and then a quick minimum-value check across all parents. This is O(N), and the constant is small. This may be a cost worth paying. > (Note: The table as read in > https://public-inbox.org/git/6367e30a-1b3a-4fe9-611b-d931f51effef@gmail.com/ > seems line broken, using gmails web interface is not good > for ASCII art and patches, git-send-email would fare better) Sorry for that. I copy-pasted into Thunderbird. Hopefully most users can redirect to the GitHub-rendered markdown if they have trouble. > >> * Corrected commit-date is locally-computable and immutable, >> but not backwards-compatible. > How are these dates not backwards incompatible? > We don't have to expose these dates to the user, but > - just like generation numbers - could store them and use them > but not tell the user about it. > > We would need to be really careful to not expose them at all > as they look like the real dates, so that could make for an > awkward bug report. By "backwards-compatible" I mean that we could store them in the commit-graph file without breaking existing clients (or by introducing extra data). We could easily add these corrected commit dates as an optional chunk, but that adds 8 bytes per commit, and we already use 8 bytes for the (generation, date) pairs. The solution I made in my prototype is to replace the generation number with an offset for how much to add to the commit-date to get the corrected commit-date. However, an old client would interpret these offsets as generations and have incorrect output. A potential way to get around this is to consider the pair (offset, date) and define the offset(C) to be the minimum of min { offset(P), date(P) - date(C) }. This would satisfy the requirements for the V0 reachability index. I'm making a note to implement _this_ version and give it a test. > > The approach of "corrected commit date" sounds like we could > have a very lazy approach, i.e. no extra data structures needed > for many commits (as the corrected date equals the real date) > and only need to store the corrections for some commits. > Such an approach however would not make it easy to know > if we operate on corrected dates, or if we even checked them > for correctness before. > > So if we'd have an additional row in the generation numbers file > telling the corrected date, then we should be able to be backwards > compatible? Yeah, an optional chunk with the corrected date would work, but be wasteful. Thanks, -Stolee ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Generation Number v2 2018-10-29 19:22 ` Stefan Beller 2018-10-29 20:06 ` Derrick Stolee @ 2018-11-01 20:06 ` Jakub Narebski 2018-11-02 9:30 ` Jakub Narebski 1 sibling, 1 reply; 17+ messages in thread From: Jakub Narebski @ 2018-11-01 20:06 UTC (permalink / raw) To: Stefan Beller Cc: Derrick Stolee, git, Jeff King, Junio C Hamano, Derrick Stolee, Ævar Arnfjörð Bjarmason Stefan Beller <sbeller@google.com> writes: >> Based on the performance results alone, we should remove minimum >> generation numbers, (epoch, date) pairs, and FELINE index from >> consideration. There are enough examples of these indexes performing >> poorly. >> >> In contrast, maximum generation numbers and corrected commit >> dates both performed quite well. They are frequently the top >> two performing indexes, and rarely significantly different. >> >> The trade-off here now seems to be: which _property_ is more important, >> locally-computable or backwards-compatible? >> >> * Maximum generation number is backwards-compatible but not >> locally-computable or immutable. > > These maximum generation numbers sound like the reverse of > the generation numbers as they are currently implemented, i.e. > we count all commits between the commit A and all heads. Actually maximum generation number = number of commits - reverse generation number > How would this impact creation of a commit? > > The current generation numbers can be lazily updated or not > updated at all. In my understanding of the maximum generation > numbers, a new commit would make these maximum generation > numbers invalid (i.e. they have to be recomputed). > > Are there ways out by deferring the computation of maximum > generation numbers while still being able to work with some > commits that are un-accounted for? As I understand it, new commits added since commit-graph file was created would get simply INFINITY maximum/maximal generation numbers (if we were using reverse generation numbers, new commits would get ZERO for generation number). > When recomputing these numbers, the current generation number > (V0) has the property that already existing numbers can be re-used > as-is. We only need to compute the numbers for new commits, > and then insert this to the appropriate data structures (which is > a separate problem, one could imagine a split generation > numbers file like the split index) Sidenote: I wonder if there is some on-disk data structure with similarly fast read, but easier update. > For the V2 maximum generation numbers, would we need to > rewrite the numbers for all commits once we recompute them? > Assuming that is true, it sounds like the benchmark doesn't > cover the whole costs associated with V2, which is why the > exceptional performance can be explained. Let's check it using a simple example First, (minimum) parent-based generation numbers before and after extending the commit graph: 1 2 3 4 5 6 7 new 1 2 3 4 5 - - old .---.-----.---.-----.---*---* \ \ 3 4 5 6 new \ 3 4 5 6 old \-.---.-----.---. \ \ 5 new \ - old \-* Next, maximum generation numbers. We start with 9 commits, and we end up with 12 commits after the change 6 7 8 9 10 11 12 new 4 5 7 8 9 - - old .---.-----.---.-----.---*---* \ \ 9 10 11 12 new \ 6 7 8 9 old \-.---.-----.---. \ \ 12 new \ - old \-* As you can see all maximum generation numbers got rewritten. Though if instead using the number of commits, we use the maximum generation number, or in other words the length of the longest path, we get the following: 1 2 3 4 5 6 7 new 1 2 4 5 6 - - old .---.-----.---.-----.---*---* \ \ 4 5 6 7 new \ 3 4 5 6 old \-.---.-----.---. \ \ 7 new \ - old \-* A bit better, but still much change in numbers. [...] >> >> * Corrected commit-date is locally-computable and immutable, >> but not backwards-compatible. > > How are these dates not backwards incompatible? > We don't have to expose these dates to the user, but > - just like generation numbers - could store them and use them > but not tell the user about it. > > We would need to be really careful to not expose them at all > as they look like the real dates, so that could make for an > awkward bug report. > > The approach of "corrected commit date" sounds like we could > have a very lazy approach, i.e. no extra data structures needed > for many commits (as the corrected date equals the real date) > and only need to store the corrections for some commits. > Such an approach however would not make it easy to know > if we operate on corrected dates, or if we even checked them > for correctness before. > > So if we'd have an additional row in the generation numbers file > telling the corrected date, then we should be able to be backwards > compatible? Here, from what I understand, being "backwards compatibile" means being able to use commit-graph file using the new format by old client, trating the data as if they were generation numbers and not data offsets. Best, -- Jakub Narębski ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Generation Number v2 2018-11-01 20:06 ` Jakub Narebski @ 2018-11-02 9:30 ` Jakub Narebski 2018-11-03 17:27 ` Jakub Narebski 0 siblings, 1 reply; 17+ messages in thread From: Jakub Narebski @ 2018-11-02 9:30 UTC (permalink / raw) To: Stefan Beller Cc: Derrick Stolee, git, Jeff King, Junio C Hamano, Derrick Stolee, Ævar Arnfjörð Bjarmason Jakub Narebski <jnareb@gmail.com> writes: > Stefan Beller <sbeller@google.com> writes: [...] >> How would this impact creation of a commit? >> >> The current generation numbers can be lazily updated or not >> updated at all. In my understanding of the maximum generation >> numbers, a new commit would make these maximum generation >> numbers invalid (i.e. they have to be recomputed). [...] >> For the V2 maximum generation numbers, would we need to >> rewrite the numbers for all commits once we recompute them? >> Assuming that is true, it sounds like the benchmark doesn't >> cover the whole costs associated with V2, which is why the >> exceptional performance can be explained. > > Let's check it using a simple example > > First, (minimum) parent-based generation numbers before and after > extending the commit graph: > > 1 2 3 4 5 6 7 new > 1 2 3 4 5 - - old > .---.-----.---.-----.---*---* > \ > \ 3 4 5 6 new > \ 3 4 5 6 old > \-.---.-----.---. > \ > \ 5 new > \ - old > \-* Let me check yet another idea, using (minimum) parent-based V0 generation numbers (counting distance from the sink / root) as a starting number for source / heads commits. 1 2 3 4 5 6 7 new 1 2 3 4 5 - - old .---.-----.---.-----.---*---* \ \ 3 4 5 6 new \ 3 4 5 6 old \-.---.-----.---. \ \ 5 new \ - old \-* Well, on this example it looks like this variant of maximum generation numbers can be done incrementally, but let's check another example 1 2 3 4 5 6 7 8 9 new 1 2 3 4 5 6 7 8 - old .---.-----.---.---.---.-----.---.---* \ / \ 3 4 / 5 6 7 8 new \ 5 6 / - - - - old \-.---.---------/---*---*---*---* It looks like it doesn; give as good results as I thought. Less values are changed, but you would still need to recalculate them, unless it can be proven that they do not need it. > > Next, maximum generation numbers. We start with 9 commits, and we end > up with 12 commits after the change > > 6 7 8 9 10 11 12 new > 4 5 7 8 9 - - old > .---.-----.---.-----.---*---* > \ > \ 9 10 11 12 new > \ 6 7 8 9 old > \-.---.-----.---. > \ > \ 12 new > \ - old > \-* > > > As you can see all maximum generation numbers got rewritten. > > Though if instead using the number of commits, we use the maximum > generation number, or in other words the length of the longest path, we > get the following: > > 1 2 3 4 5 6 7 new > 1 2 4 5 6 - - old > .---.-----.---.-----.---*---* > \ > \ 4 5 6 7 new > \ 3 4 5 6 old > \-.---.-----.---. > \ > \ 7 new > \ - old > \-* > > A bit better, but still much change in numbers. -- Jakub Narębski ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Generation Number v2 2018-11-02 9:30 ` Jakub Narebski @ 2018-11-03 17:27 ` Jakub Narebski 0 siblings, 0 replies; 17+ messages in thread From: Jakub Narebski @ 2018-11-03 17:27 UTC (permalink / raw) To: Stefan Beller Cc: Derrick Stolee, git, Jeff King, Junio C Hamano, Derrick Stolee, Ævar Arnfjörð Bjarmason Jakub Narebski <jnareb@gmail.com> writes: > Jakub Narebski <jnareb@gmail.com> writes: >> Stefan Beller <sbeller@google.com> writes: > [...] >>> How would this impact creation of a commit? >>> >>> The current generation numbers can be lazily updated or not >>> updated at all. In my understanding of the maximum generation >>> numbers, a new commit would make these maximum generation >>> numbers invalid (i.e. they have to be recomputed). > [...] >>> For the V2 maximum generation numbers, would we need to >>> rewrite the numbers for all commits once we recompute them? >>> Assuming that is true, it sounds like the benchmark doesn't >>> cover the whole costs associated with V2, which is why the >>> exceptional performance can be explained. >> >> Let's check it using a simple example >> >> First, (minimum) parent-based generation numbers before and after >> extending the commit graph: >> >> 1 2 3 4 5 6 7 new >> 1 2 3 4 5 - - old >> .---.-----.---.-----.---*---* >> \ >> \ 3 4 5 6 new >> \ 3 4 5 6 old >> \-.---.-----.---. >> \ >> \ 5 new >> \ - old >> \-* > > Let me check yet another idea, using (minimum) parent-based V0 generation > numbers (counting distance from the sink / root) as a starting number > for source / heads commits. [...] > [...] but let's check another example > > 1 2 3 4 5 6 7 8 9 new > 1 2 3 4 5 6 7 8 - old > .---.-----.---.---.---.-----.---.---* > \ / > \ 3 4 / 5 6 7 8 new > \ 5 6 / - - - - old > \-.---.---------/---*---*---*---* But let's do this correctly. 1 2 3 4 5 6 7 8 9 new 1 2 3 4 5 6 7 8 - old .---.-----.---.------.---.-----.---.---* \ / \ 3 4 / new \ 5 6 / old \-.---.------------/ \ \ 5 6 7 8 new \ - - - - old \--*---*-----*---* Well, it looks as if I draw it incorrectly, but performed calculations right. You may need to modify / change some data, but it looks as if it is not that much of a problem. The new version of the maximum generation numbers looks like it gives the same results as generation numbers for the "longest" path, and update may affect only the side-branches that were added to. All branches merged into the trunk, and not added to should be safe with respect to updating. Can anyone here prove a thing about update of those modified maximum generation numbers? Thanks in advance. Best, -- Jakub Narębski ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Generation Number v2 2018-10-29 16:55 [RFC] Generation Number v2 Derrick Stolee 2018-10-29 19:22 ` Stefan Beller @ 2018-10-29 20:25 ` Derrick Stolee 2018-11-01 22:13 ` Jakub Narebski 2018-10-30 3:59 ` Junio C Hamano 2018-11-01 12:27 ` Jakub Narebski 3 siblings, 1 reply; 17+ messages in thread From: Derrick Stolee @ 2018-10-29 20:25 UTC (permalink / raw) To: Git List, Jeff King, Junio C Hamano, Jakub Narębski, Derrick Stolee, Ævar Arnfjörð Bjarmason, Stefan Beller Here is a re-formatted version of the tables I introduced earlier. The tables were too wide for public-inbox to render correctly (when paired with my email client). Hopefully this bulleted-list format works better. Thanks, Stefan, for pointing out the rendering problems! ### Test 1: `git log --topo-order -N` This test focuses on the number of commits that are parsed during a `git log --topo-order` before writing `N` commits to output. You can reproduce this test using `topo-order-tests.sh` and see all the data in `topo-order-summary.txt`. The values reported here are a sampling of the data, ignoring tests where all values were the same or extremely close in value. android-base, N = 100 V0: 5487 V1: 8534 V2: 6937 V3: 6419 V4: 6453 android-base, N = 1000 V0: 36029 V1: 44030 V2: 41493 V3: 41206 V4: 45431 chromium, N = 100 V0: 101 V1: 424406 V2: 101 V3: 101 V4: 101 gerrit, N = 100 V0: 8212 V1: 8533 V2: 164 V3: 159 V4: 162 gerrit, N = 1000 V0: 8512 V1: 8533 V2: 1990 V3: 1973 V4: 3766 Linux, N = 100 V0: 12458 V1: 12444 V2: 13683 V3: 13123 V4: 13124 Linux, N = 1000 V0: 24436 V1: 26247 V2: 27878 V3: 26430 V4: 27875 Linux, N = 10000 V0: 30364 V1: 28891 V2: 27878 V3: 26430 V4: 27875 electron, N = 1000 V0: 19927 V1: 18733 V2: 1072 V3: 18214 V4: 18214 Ffmpeg, N = 10000 V0: 32154 V1: 47429 V2: 10435 V3: 11054 V4: 11054 jgit, N = 1000 V0: 1550 V1: 6264 V2: 1067 V3: 1060 V4: 1233 julia, N = 10000 V0: 43043 V1: 43043 V2: 10201 V3: 23505 V4: 23828 odoo, N = 1000 V0: 17175 V1: 9714 V2: 4043 V3: 4046 V4: 4111 php-src, N = 1000 V0: 19014 V1: 27530 V2: 1311 V3: 1305 V4: 1320 rails, N = 100 V0: 1420 V1: 2041 V2: 1757 V3: 1428 V4: 1441 rails, N = 1000 V0: 7952 V1: 10145 V2: 10053 V3: 8373 V4: 8373 swift, N = 1000 V0: 1914 V1: 4004 V2: 2071 V3: 1939 V4: 1940 tensorflow, N = 1000 V0: 10019 V1: 39221 V2: 6711 V3: 10051 V4: 10051 TypeScript, N = 1000 V0: 2873 V1: 12014 V2: 3049 V3: 2876 V4: 2876 ### Test 2: `git log --topo-order -10 A..B` This test focuses on the number of commits that are parsed during a `git log --topo-order A..B` before writing ten commits to output. Since we fix a very small set of output commits, we care more about the part of the walk that determines which commits are reachable from `B` but not reachable from `A`. This part of the walk uses commit date as a heuristic in the existing implementation. You can reproduce this test using `topo-compare-tests.sh` and see all the data in `topo-compare-summary.txt`. The values reported here are a sampling of the data, ignoring tests where all values were the same or extremely close in value. _Note:_ For some of the rows, the `A` and `B` values may be swapped from what is expected. This is due to (1) a bug in the reference implementation that doesn't short-circuit the walk when `A` can reach `B`, and (2) data-entry errors by the author. The bug can be fixed, but would have introduced strange-looking rows in this table. android-base 53c1972bc8f 92f18ac3e39 OLD: 39403 V0: 1544 V1: 6957 V2: 26 V3: 1015 V4: 1098 gerrit c4311f7642 777a8cd1e0 OLD: 6457 V0: 7836 V1: 10869 V2: 415 V3: 414 V4: 445 electron 7da7dd85e addf069f2 OLD: 18164 V0: 945 V1: 6528 V2: 17 V3: 17 V4: 18 julia 7faee1b201 e2022b9f0f OLD: 22800 V0: 4221 V1: 12710 V2: 377 V3: 213 V4: 213 julia ae69259cd9 c8b5402afc OLD: 1864 V0: 1859 V1: 13287 V2: 12 V3: 1859 V4: 1859 Linux 69973b830859 c470abd4fde4 OLD: 111692 V0: 77263 V1: 96598 V2: 80238 V3: 76332 V4: 76495 Linux c3b92c878736 19f949f52599 OLD: 167418 V0: 5736 V1: 4684 V2: 9675 V3: 3887 V4: 3923 Linux c8d2bc9bc39e 69973b830859 OLD: 44940 V0: 4056 V1: 16636 V2: 10405 V3: 3475 V4: 4022 odoo 4a31f55d0a0 93fb2b4a616 OLD: 25139 V0: 19528 V1: 20418 V2: 19874 V3: 19634 V4: 27247 swift 4046359efd b34b6a14c7 OLD: 13411 V0: 662 V1: 321 V2: 12 V3: 80 V4: 134 tensorflow ec6d17219c fa1db5eb0d OLD: 10373 V0: 4762 V1: 36272 V2: 174 V3: 3631 V4: 3632 TypeScript 35ea2bea76 123edced90 OLD: 3450 V0: 267 V1: 10386 V2: 27 V3: 259 V4: 259 ### Test 3: `git merge-base A B` This test focuses on the number of commits that are parsed during a `git merge-base A B`. This part of the walk uses commit date as a heuristic in the existing implementation. You can reproduce this test using `merge-base-tests.sh` and see all the data in `merge-base-summary.txt`. The values reported here are a sampling of the data, ignoring tests where all values were the same or extremely close in value. android-base 53c1972bc8f 92f18ac3e39 OLD: 81999 V0: 109025 V1: 81885 V2: 77475 V3: 81999 V4: 82001 gerrit c4311f7642 777a8cd1e0 OLD: 6468 V0: 7995 V1: 6566 V2: 6478 V3: 6468 V4: 6468 electron 7da7dd85e addf069f2 OLD: 18160 V0: 19871 V1: 18670 V2: 2231 V3: 18160 V4: 18160 julia 7faee1b201 e2022b9f0f OLD: 22803 V0: 42339 V1: 42212 V2: 6803 V3: 22803 V4: 22803 julia c8b5402afc ae69259cd9 OLD: 7076 V0: 42909 V1: 42770 V2: 2690 V3: 7076 V4: 7076 Linux 69973b830859 c470abd4fde4 OLD: 44984 V0: 47457 V1: 44679 V2: 38461 V3: 44984 V4: 44984 Linux c3b92c878736 19f949f52599 OLD: 111740 V0: 111027 V1: 111196 V2: 107835 V3: 111771 V4: 111368 Linux c8d2bc9bc39e 69973b830859 OLD: 167468 V0: 635579 V1: 630138 V2: 33716 V3: 167496 V4: 153774 odoo 4a31f55d0a0 93fb2b4a616 OLD: 25150 V0: 27259 V1: 23977 V2: 24041 V3: 23974 V4: 26829 swift 4046359efd b34b6a14c7 OLD: 13434 V0: 13254 V1: 13940 V2: 16023 V3: 13127 V4: 15008 tensorflow ec6d17219c fa1db5eb0d OLD: 10377 V0: 10448 V1: 10377 V2: 8460 V3: 10377 V4: 10377 TypeScript 35ea2bea76 123edced90 OLD: 3464 V0: 3439 V1: 3464 V2: 3581 V3: 3464 V4: 3464 ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Generation Number v2 2018-10-29 20:25 ` Derrick Stolee @ 2018-11-01 22:13 ` Jakub Narebski 0 siblings, 0 replies; 17+ messages in thread From: Jakub Narebski @ 2018-11-01 22:13 UTC (permalink / raw) To: Derrick Stolee Cc: Git List, Jeff King, Junio C Hamano, Derrick Stolee, Ævar Arnfjörð Bjarmason, Stefan Beller Derrick Stolee <stolee@gmail.com> writes: > Here is a re-formatted version of the tables I introduced earlier. > The tables were too wide for public-inbox to render correctly (when > paired with my email client). Hopefully this bulleted-list format > works better. Thanks, Stefan, for pointing out the rendering > problems! > > ### Test 1: `git log --topo-order -N` > > This test focuses on the number of commits that are parsed during > a `git log --topo-order` before writing `N` commits to output. > > You can reproduce this test using `topo-order-tests.sh` and > see all the data in `topo-order-summary.txt`. The values reported > here are a sampling of the data, ignoring tests where all values > were the same or extremely close in value. > > android-base, N = 100 > V0: 5487 > V1: 8534 > V2: 6937 > V3: 6419 > V4: 6453 [...] > ### Test 2: `git log --topo-order -10 A..B` [...] > android-base 53c1972bc8f 92f18ac3e39 > OLD: 39403 > V0: 1544 > V1: 6957 > V2: 26 > V3: 1015 > V4: 1098 Two minor issues. First, now that the data is not in the table format, where every bit of horizontal space matters, why not spell the names of new proposed "generation numbers" (or rather reachability orderings) in full, instead of using V0...V4 shortcuts? It is not as if we were short of space. Second, why OLD data is present in tests 2 and 3, but not in test 1? Best, -- Jakub Narębski ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Generation Number v2 2018-10-29 16:55 [RFC] Generation Number v2 Derrick Stolee 2018-10-29 19:22 ` Stefan Beller 2018-10-29 20:25 ` Derrick Stolee @ 2018-10-30 3:59 ` Junio C Hamano 2018-10-31 12:30 ` Derrick Stolee 2018-10-31 12:54 ` Ævar Arnfjörð Bjarmason 2018-11-01 12:27 ` Jakub Narebski 3 siblings, 2 replies; 17+ messages in thread From: Junio C Hamano @ 2018-10-30 3:59 UTC (permalink / raw) To: Derrick Stolee Cc: Git List, Jeff King, Jakub Narębski, Derrick Stolee, Ævar Arnfjörð Bjarmason Derrick Stolee <stolee@gmail.com> writes: > **V3: Corrected Commit Date.** > For a commit C, let its _corrected commit date_ (denoted by cdate(C)) > be the maximum of the commit date of C and the commit dates of its > parents. "maximum of the commit date of C and the corrected commit dates of its parents"? We've talked about exactly this one in the past (long before any of Microsoft folks other than Dscho came to the scene) but without an implementation, or detailed design and analysis. I am very happy to see the idea among other possibilities to be considered again. This time around, we may finally come up with something better than the "commit dates with SLOP" thing ;-). > Essentially, the felineY order is selected with the goal of swapping > positions of topologically-independent commits relative to the felinX > ordering. The resulting reachability index is as follows: > > If felineX(A) < felineY(B), then A cannot reach B. > If felineY(A) < felineY(B), then A cannot reach B. I presume that the first line is a typo and you compare the same X index? > * **Compatible?** In our test implementation, we use a previously unused > byte of data in the commit-graph format to indicate which reachability > index version we are using. Existing clients ignore this value, so we > will want to consider if these new indexes are _backwards compatible_. > That is, will they still report correct values if they ignore this byte > and use the generation number column from the commit-graph file assuming > the values are minimum generation numbers? I personally consider that the current commit-graph with generation numbers experimental, so I am not sure how much we care about this. Having said that. By the above definition, any new index that is wider than the current generation number cannot be compatible (can we even tell the existing clients how wide each elements in the ix array is?) In any case, perhaps the first thing to do is to update the clients so that they stop ignoring the version number field, and instead work without generation number when there is no version of reach.ix available in the file? That way, a better reachablility index can be chosen freely without having to worry about the compatibility. > * **Immutable?** Git objects are _immutable_. If you change an object you > actually create a new object with a new object ID. Are the values we > store > for these reachability indexes also immutable? Even if we do not embed the reachability ix in commit objects, having an immutable value is probably a must if we want to make them incrementally computable, so this is a very good property to have. Unless there is a clever idea to incrementally compute a mutable reach.ix, my gut instinct says that this property is a must. Another thing, perhaps related to "local" below, is if exactly the same reach.ix is computed by anybody, given an identical commit history graph (perhaps "reproducibility"?). I think most of the candidates you listed are reproducible without a fixed tie-breaker, but I am not sure about felineY() thing. > * **Local?** Are these values **locally computable**? That is, do we only > need to look at the parents of a commit (assuming those parents have > computed values) in order to determine the value at that commit? A subset of non-local reachability ix, for example, the ones that need to know what each commit's children are, cannot be immutable, as adding new objects to the graph (either with locally committing, or transferring objects from other repositories) would affect the ix; is this true for all non-local reachability ix, I wonder? > We focused on three types of performance tests that test the indexes > in different ways. Each test lists the `git` command that is used, > and the table lists which repository is used and which inputs. > > ### Test 1: `git log --topo-order -N` > > This test focuses on the number of commits that are parsed during > a `git log --topo-order` before writing `N` commits to output. A devil's advocate comment. Your patches seem to be very focused on this "unlimited" case for the past few weeks, but I am not so sure if that is a case worth optimizing for. If "git log --topo-order -N HEAD~M.." (for some number M) gives us a similar result as unlimited case but with much less effort, wouldn't it be good enough that lets us concentrate on other use cases instead? > Based on the performance results alone, we should remove minimum > generation numbers, (epoch, date) pairs, and FELINE index from > consideration. There are enough examples of these indexes performing > poorly. OK. > In contrast, maximum generation numbers and corrected commit > dates both performed quite well. They are frequently the top > two performing indexes, and rarely significantly different. > > The trade-off here now seems to be: which _property_ is more important, > locally-computable or backwards-compatible? Nice summary. As I already said, I personally do not think being compatible with currently deployed clients is important at all (primarily because I still consider the whole thing experimental), and there is a clear way forward once we correct the mistake of not having a version number in the file format that tells the updated clients to ignore the generation numbers. For longer term viability, we should pick something that is immutable, reproducible, computable with minimum input---all of which would lead to being incrementally computable, I would think. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Generation Number v2 2018-10-30 3:59 ` Junio C Hamano @ 2018-10-31 12:30 ` Derrick Stolee 2018-11-02 13:33 ` Jakub Narebski 2018-10-31 12:54 ` Ævar Arnfjörð Bjarmason 1 sibling, 1 reply; 17+ messages in thread From: Derrick Stolee @ 2018-10-31 12:30 UTC (permalink / raw) To: Junio C Hamano Cc: Git List, Jeff King, Jakub Narębski, Derrick Stolee, Ævar Arnfjörð Bjarmason On 10/29/2018 11:59 PM, Junio C Hamano wrote: > Derrick Stolee <stolee@gmail.com> writes: > >> **V3: Corrected Commit Date.** >> For a commit C, let its _corrected commit date_ (denoted by cdate(C)) >> be the maximum of the commit date of C and the commit dates of its >> parents. > "maximum of the commit date of C and the corrected commit dates of > its parents"? That's what I mean. Thanks. > > We've talked about exactly this one in the past (long before any of > Microsoft folks other than Dscho came to the scene) but without an > implementation, or detailed design and analysis. I am very happy to > see the idea among other possibilities to be considered again. This > time around, we may finally come up with something better than the > "commit dates with SLOP" thing ;-). > >> Essentially, the felineY order is selected with the goal of swapping >> positions of topologically-independent commits relative to the felinX >> ordering. The resulting reachability index is as follows: >> >> If felineX(A) < felineY(B), then A cannot reach B. >> If felineY(A) < felineY(B), then A cannot reach B. > I presume that the first line is a typo and you compare the same X index? Yes, sorry for the typos. I fixed them in the report on GitHub. > >> * **Compatible?** In our test implementation, we use a previously unused >> byte of data in the commit-graph format to indicate which reachability >> index version we are using. Existing clients ignore this value, so we >> will want to consider if these new indexes are _backwards compatible_. >> That is, will they still report correct values if they ignore this byte >> and use the generation number column from the commit-graph file assuming >> the values are minimum generation numbers? > I personally consider that the current commit-graph with generation > numbers experimental, so I am not sure how much we care about this. > > Having said that. > > By the above definition, any new index that is wider than the > current generation number cannot be compatible (can we even tell the > existing clients how wide each elements in the ix array is?) > > In any case, perhaps the first thing to do is to update the clients > so that they stop ignoring the version number field, and instead > work without generation number when there is no version of reach.ix > available in the file? That way, a better reachablility index can > be chosen freely without having to worry about the compatibility. I can work on that. It should be as simple as setting commit->generation to GENERATION_NUMBER_ZERO in fill_commit_in_graph when the graph has a different version. > >> * **Immutable?** Git objects are _immutable_. If you change an object you >> actually create a new object with a new object ID. Are the values we >> store >> for these reachability indexes also immutable? > Even if we do not embed the reachability ix in commit objects, > having an immutable value is probably a must if we want to make them > incrementally computable, so this is a very good property to have. > Unless there is a clever idea to incrementally compute a mutable > reach.ix, my gut instinct says that this property is a must. > > Another thing, perhaps related to "local" below, is if exactly the > same reach.ix is computed by anybody, given an identical commit > history graph (perhaps "reproducibility"?). I think most of the > candidates you listed are reproducible without a fixed tie-breaker, > but I am not sure about felineY() thing. > >> * **Local?** Are these values **locally computable**? That is, do we only >> need to look at the parents of a commit (assuming those parents have >> computed values) in order to determine the value at that commit? > A subset of non-local reachability ix, for example, the ones that > need to know what each commit's children are, cannot be immutable, > as adding new objects to the graph (either with locally committing, > or transferring objects from other repositories) would affect the > ix; is this true for all non-local reachability ix, I wonder? As a thought experiment, we could define a function size(C) to be the numberof commits reachable from C. This is not locally-computable from the size values at C's parents due to the inclusion-exclusion principle. We would need to compute it by walking the reachable set and counting (resulting in quadratic performance overall) but is immutable. Since the performance cost is so expensive (unlike the linear costs in the other non-local versions) I didn't include it in my comparison. > >> We focused on three types of performance tests that test the indexes >> in different ways. Each test lists the `git` command that is used, >> and the table lists which repository is used and which inputs. >> >> ### Test 1: `git log --topo-order -N` >> >> This test focuses on the number of commits that are parsed during >> a `git log --topo-order` before writing `N` commits to output. > A devil's advocate comment. Your patches seem to be very focused on > this "unlimited" case for the past few weeks, but I am not so sure > if that is a case worth optimizing for. If "git log --topo-order -N > HEAD~M.." (for some number M) gives us a similar result as unlimited > case but with much less effort, wouldn't it be good enough that lets > us concentrate on other use cases instead? I mostly included these statistics to make sure we didn't add a regression in this case. Note that I didn't report the OLD values in this table, because that would be an unfair comparison. >> Based on the performance results alone, we should remove minimum >> generation numbers, (epoch, date) pairs, and FELINE index from >> consideration. There are enough examples of these indexes performing >> poorly. > OK. > >> In contrast, maximum generation numbers and corrected commit >> dates both performed quite well. They are frequently the top >> two performing indexes, and rarely significantly different. >> >> The trade-off here now seems to be: which _property_ is more important, >> locally-computable or backwards-compatible? > Nice summary. > > As I already said, I personally do not think being compatible with > currently deployed clients is important at all (primarily because I > still consider the whole thing experimental), and there is a clear > way forward once we correct the mistake of not having a version > number in the file format that tells the updated clients to ignore > the generation numbers. For longer term viability, we should pick > something that is immutable, reproducible, computable with minimum > input---all of which would lead to being incrementally computable, I > would think. This is good reasoning. The "reproducible" property is also important for support reasons, too! Sounds like the corrected commit date is the best way forward. Aside: I spent some time thinking about making the corrected commit dates backward compatible by ensuring the offsets are monotonic in the commit history (so we could store the offset as commit->generation and the existing generation comparisons would still work). However, it performs poorly on the Linux repository 'git merge-base v4.8 v4.9' example. Thanks, -Stolee ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Generation Number v2 2018-10-31 12:30 ` Derrick Stolee @ 2018-11-02 13:33 ` Jakub Narebski 0 siblings, 0 replies; 17+ messages in thread From: Jakub Narebski @ 2018-11-02 13:33 UTC (permalink / raw) To: Derrick Stolee Cc: Junio C Hamano, Git List, Jeff King, Derrick Stolee, Ævar Arnfjörð Bjarmason Derrick Stolee <stolee@gmail.com> writes: > On 10/29/2018 11:59 PM, Junio C Hamano wrote: >> Derrick Stolee <stolee@gmail.com> writes: [...] >>> * **Compatible?** In our test implementation, we use a previously unused >>> byte of data in the commit-graph format to indicate which reachability >>> index version we are using. Existing clients ignore this value, so we >>> will want to consider if these new indexes are _backwards compatible_. >>> That is, will they still report correct values if they ignore this byte >>> and use the generation number column from the commit-graph file assuming >>> the values are minimum generation numbers? >> >> I personally consider that the current commit-graph with generation >> numbers experimental, so I am not sure how much we care about this. >> >> Having said that. >> >> By the above definition, any new index that is wider than the >> current generation number cannot be compatible (can we even tell the >> existing clients how wide each elements in the ix array is?) >> >> In any case, perhaps the first thing to do is to update the clients >> so that they stop ignoring the version number field, and instead >> work without generation number when there is no version of reach.ix >> available in the file? That way, a better reachablility index can >> be chosen freely without having to worry about the compatibility. > > I can work on that. It should be as simple as setting commit->generation to > GENERATION_NUMBER_ZERO in fill_commit_in_graph when the graph > has a different version. That is a very good idea, but we have here a chicken-and-egg problem. The only way to denote that the generation numbers / reachability index have changed (for reading/using: that it changed in backwards incompatibile way; for update: that it changed at all) is to change the format version: 1-byte version number: Currently, the only valid version is 1. But if we assume that different commit-graph format version simply means that commit->generation is to be set to GENERATION_NUMBER_ZERO, then we block possible changes to the structure of the commit-graph file. [Reads further in thread] Ah, I see that's why you want to introduce [1]: DS> DS> 3. Reachability Index versioning [1]: https://public-inbox.org/git/6902dbff-d9f6-e897-2c20-d0cb47a50795@gmail.com/ >>> * **Immutable?** Git objects are _immutable_. If you change an object you >>> actually create a new object with a new object ID. Are the values we store >>> for these reachability indexes also immutable? >> >> Even if we do not embed the reachability ix in commit objects, >> having an immutable value is probably a must if we want to make them >> incrementally computable, so this is a very good property to have. >> Unless there is a clever idea to incrementally compute a mutable >> reach.ix, my gut instinct says that this property is a must. I think that the reachability index can be mutable and non-local, but still being able to be incrementally updated, though perhaps it would require not only calculations for new commits, but recalculations for some limited subset of commits existing in the commit-graph file. I'm trying to modify exact definition of maximal generation numbers to check if I can find out a version that exhibits nearly-incremental updates property. >> Another thing, perhaps related to "local" below, is if exactly the >> same reach.ix is computed by anybody, given an identical commit >> history graph (perhaps "reproducibility"?). I think most of the >> candidates you listed are reproducible without a fixed tie-breaker, >> but I am not sure about felineY() thing. felineY() is deterministic (or can be made deterministic) for given felineX() (i.e. given [reverse] topological order). >>> * **Local?** Are these values **locally computable**? That is, do we only >>> need to look at the parents of a commit (assuming those parents have >>> computed values) in order to determine the value at that commit? >> >> A subset of non-local reachability ix, for example, the ones that >> need to know what each commit's children are, cannot be immutable, >> as adding new objects to the graph (either with locally committing, >> or transferring objects from other repositories) would affect the >> ix; is this true for all non-local reachability ix, I wonder? > > As a thought experiment, we could define a function size(C) to be the > number of commits reachable from C. Let's call it reach(C), instead ;-) > This is not locally-computable > from the size values at C's parents due to the inclusion-exclusion > principle. We would need to compute it by walking the reachable set > and counting (resulting in quadratic performance overall) but is > immutable. Since the performance cost is so expensive (unlike the > linear costs in the other non-local versions) I didn't include it > in my comparison. Let's define Reach(C) as set of all commits reachable from commit C. Then reach(C) = ||Reach(C)||, where ||S|| is number of elements in the set S. If commit A can reach commit B, then B ∈ Reach(A), but also Reach(B) ⊂ Reach(A), thus reach(B) < reach(A). Therefore if reach(A) <= reach(B) then A cannot reach commit B -- reach(C) can be used as reachability index. However the performance cost of calculating reach(C), i.e. full traversal of the commit graph without ability for incremental update (well, you can do incremental update starting from the commits that have reachability bitmap [2]) is prohibitive. Never mind the fact that this index would have, I think, the same performance problems as (minimum) generaton numbers. [2]: https://githubengineering.com/counting-objects/ SIDENOTE 1: as gen(C) can be thought of as the lower boundary on reach(C), similarly sumgen(C) defined as being 1 for parent-less commits, and being (sum_{P ∈ parents(C)} sumgen(P)) + 1 otherwise being the upper boundary on reach(C). This sumgen(C) can also be used as generation number / reachability index. gen(C) <= reach(C) <= sumgen(C) SIDENOTE 2: The idea of using a proxy for Reach(B) ⊆ Reach(A) as a negative-cut filter (negative-cut reachability index) is used in two types of reachability indices: one is Independent Permutation Labelling (IP+) [3] approach, the other is Bloom Filter Labelling (BFL) [4]. Both algorithms are local, and with given permutation (which can be replaced by object ID, I think) for IP+ and given set of hash functions etc. for BFL they are both immutable. [3] Hao Wei, Jeffrey Xu Yu, Can Lu, Ruoming Jin "Reachability Querying: An Independent Permutation Labeling Approach", Proceedings of the VLDB Endowment, Vol. 7, No. 12 (2014) [4] Jiao Su, Qing Zhu, Hao Wei, Jeffrey Xu Yu "Reachability Querying: Can It Be Even Faster?", IEEE Transactions on Knowledge and Data Engineering (2016) Best, -- Jakub Narębski ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Generation Number v2 2018-10-30 3:59 ` Junio C Hamano 2018-10-31 12:30 ` Derrick Stolee @ 2018-10-31 12:54 ` Ævar Arnfjörð Bjarmason 2018-10-31 13:04 ` Derrick Stolee 1 sibling, 1 reply; 17+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2018-10-31 12:54 UTC (permalink / raw) To: Junio C Hamano Cc: Derrick Stolee, Git List, Jeff King, Jakub Narębski, Derrick Stolee On Tue, Oct 30 2018, Junio C Hamano wrote: > Derrick Stolee <stolee@gmail.com> writes: >> In contrast, maximum generation numbers and corrected commit >> dates both performed quite well. They are frequently the top >> two performing indexes, and rarely significantly different. >> >> The trade-off here now seems to be: which _property_ is more important, >> locally-computable or backwards-compatible? > > Nice summary. > > As I already said, I personally do not think being compatible with > currently deployed clients is important at all (primarily because I > still consider the whole thing experimental), and there is a clear > way forward once we correct the mistake of not having a version > number in the file format that tells the updated clients to ignore > the generation numbers. For longer term viability, we should pick > something that is immutable, reproducible, computable with minimum > input---all of which would lead to being incrementally computable, I > would think. I think it depends on what we mean by backwards compatibility. None of our docs are saying this is experimental right now, just that it's opt-in like so many other git-config(1) options. So if we mean breaking backwards compatibility in that we'll write a new file or clobber the existing one with a version older clients can't use as an optimization, fine. But it would be bad to produce a hard error on older clients, but avoiding that seems as easy as just creating a "commit-graph2" file in .git/objects/info/. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Generation Number v2 2018-10-31 12:54 ` Ævar Arnfjörð Bjarmason @ 2018-10-31 13:04 ` Derrick Stolee 2018-11-02 17:44 ` Jakub Narebski 0 siblings, 1 reply; 17+ messages in thread From: Derrick Stolee @ 2018-10-31 13:04 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason, Junio C Hamano Cc: Git List, Jeff King, Jakub Narębski, Derrick Stolee On 10/31/2018 8:54 AM, Ævar Arnfjörð Bjarmason wrote: > On Tue, Oct 30 2018, Junio C Hamano wrote: > >> Derrick Stolee <stolee@gmail.com> writes: >>> In contrast, maximum generation numbers and corrected commit >>> dates both performed quite well. They are frequently the top >>> two performing indexes, and rarely significantly different. >>> >>> The trade-off here now seems to be: which _property_ is more important, >>> locally-computable or backwards-compatible? >> Nice summary. >> >> As I already said, I personally do not think being compatible with >> currently deployed clients is important at all (primarily because I >> still consider the whole thing experimental), and there is a clear >> way forward once we correct the mistake of not having a version >> number in the file format that tells the updated clients to ignore >> the generation numbers. For longer term viability, we should pick >> something that is immutable, reproducible, computable with minimum >> input---all of which would lead to being incrementally computable, I >> would think. > I think it depends on what we mean by backwards compatibility. None of > our docs are saying this is experimental right now, just that it's > opt-in like so many other git-config(1) options. > > So if we mean breaking backwards compatibility in that we'll write a new > file or clobber the existing one with a version older clients can't use > as an optimization, fine. > > But it would be bad to produce a hard error on older clients, but > avoiding that seems as easy as just creating a "commit-graph2" file in > .git/objects/info/. Well, we have a 1-byte version number following the "CGPH" header in the commit-graph file, and clients will ignore the commit-graph file if that number is not "1". My hope for backwards-compatibility was to avoid incrementing this value and instead use the unused 8th byte. However, it appears that we are destined to increment that version number, anyway. Here is my list for what needs to be in the next version of the commit-graph file format: 1. A four-byte hash version. 2. File incrementality (split commit-graph). 3. Reachability Index versioning Most of these changes will happen in the file header. The chunks themselves don't need to change, but some chunks may be added that only make sense in v2 commit-graphs. Thanks, -Stolee ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Generation Number v2 2018-10-31 13:04 ` Derrick Stolee @ 2018-11-02 17:44 ` Jakub Narebski 0 siblings, 0 replies; 17+ messages in thread From: Jakub Narebski @ 2018-11-02 17:44 UTC (permalink / raw) To: Derrick Stolee Cc: Ævar Arnfjörð Bjarmason, Junio C Hamano, Git List, Jeff King, Derrick Stolee Derrick Stolee <stolee@gmail.com> writes: > On 10/31/2018 8:54 AM, Ævar Arnfjörð Bjarmason wrote: >> On Tue, Oct 30 2018, Junio C Hamano wrote: >>> Derrick Stolee <stolee@gmail.com> writes: >>>> >>>> In contrast, maximum generation numbers and corrected commit >>>> dates both performed quite well. They are frequently the top >>>> two performing indexes, and rarely significantly different. >>>> >>>> The trade-off here now seems to be: which _property_ is more important, >>>> locally-computable or backwards-compatible? >>> >>> Nice summary. >>> >>> As I already said, I personally do not think being compatible with >>> currently deployed clients is important at all (primarily because I >>> still consider the whole thing experimental), and there is a clear >>> way forward once we correct the mistake of not having a version >>> number in the file format that tells the updated clients to ignore >>> the generation numbers. For longer term viability, we should pick >>> something that is immutable, reproducible, computable with minimum >>> input---all of which would lead to being incrementally computable, I >>> would think. >> >> I think it depends on what we mean by backwards compatibility. None of >> our docs are saying this is experimental right now, just that it's >> opt-in like so many other git-config(1) options. >> >> So if we mean breaking backwards compatibility in that we'll write a new >> file or clobber the existing one with a version older clients can't use >> as an optimization, fine. >> >> But it would be bad to produce a hard error on older clients, but >> avoiding that seems as easy as just creating a "commit-graph2" file in >> .git/objects/info/. > > Well, we have a 1-byte version number following the "CGPH" header in > the commit-graph file, and clients will ignore the commit-graph file > if that number is not "1". My hope for backwards-compatibility was > to avoid incrementing this value and instead use the unused 8th byte. How? Some of the considered new generation numbers were backwards compatibile in the sense that for graph traversal the old code for (minimum) generation numbers could be used. But that is for reading. If the old client tried to update new generation number using old generation number algorithm (assuming locality), it would write some chimaera that may or may not work. > However, it appears that we are destined to increment that version > number, anyway. Here is my list for what needs to be in the next > version of the commit-graph file format: As a reminder, here is how the commit-graph format looks now [1] HEADER: 4-byte signature: CGPH 1-byte version number: 1 1-byte Hash Version: 1 = SHA-1 1-byte number (C) of "chunks": 3 or 4 1-byte (reserved for later use) <ignored> CHUNK LOOKUP: (C + 1) * 12 bytes listing the table of contents for the chunks CHUNK DATA: OIDF OIDL CDAT EDGE [optional, depends on graph structure] TRAILER: checksum [1]: Documentation/technical/commit-graph-format.txt > 1. A four-byte hash version. Why do you need a four-byte hash version? 255 possible hash functions is not enough? > 2. File incrementality (split commit-graph). This would probably require a segment lookup part, and one chunk of each type per segment (or even one chunk lookup table per segment). It may or may not need generation number which can support segmenting (here maximal generation numbers have the advantage); but you can assume that if segment(A) > segment(B) then reach.ix(A) > reach.ix(B), i.e. lexicographical-like sort. > 3. Reachability Index versioning I assume that you mean here versioning for the reachability index that is stored in the CDAT section (if it is in a separate chunk, it should be easy to version). The easiest thing to do when encountering unknown or not supported generation number (perhaps the commit-graph file was created in the past by earlier version of Git, perahs it came from server with newer Git, or perhaps the same repository is sometimes accessed using newer and sometimes older Git version), is to set commit->generation_number to GENERATION_NUMBER_ZERO, as you wrote in [2]. We could encode backward-compatibility (for using) somewhat, but I wonder how much it would be of use. [2]: https://public-inbox.org/git/61a829ce-0d29-81c9-880e-7aef1bec916e@gmail.com/ > Most of these changes will happen in the file header. The chunks > themselves don't need to change, but some chunks may be added that > only make sense in v2 commit-graphs. What I would like to see in v2 commit-graph format would be some kind convention for naming / categorizing chunks, so that Git would be able to handle unknown chunks gracefully. In PNG format there are various types of chunks. Quoting Wikipedia: Chunk types [in PNG format] are given a four-letter case sensitive ASCII type/name; compare FourCC. The case of the different letters in the name (bit 5 of the numeric value of the character) is a bit field that provides the decoder with some information on the nature of chunks it does not recognize. The case of the first letter indicates whether the chunk is critical or not. If the first letter is uppercase, the chunk is critical; if not, the chunk is ancillary. Critical chunks contain information that is necessary to read the file. If a decoder encounters a critical chunk it does not recognize, it must abort reading the file or supply the user with an appropriate warning. Currently this is solved with the format version. For given format version some chunks are necessary (critical); if there would be another type of chunk that must be understood, we can simply increase format version. The case of the second letter indicates whether the chunk is "public" (either in the specification or the registry of special-purpose public chunks) or "private" (not standardised). Uppercase is public and lowercase is private. This ensures that public and private chunk names can never conflict with each other (although two private chunk names could conflict). For Git this could be "official" and "experimental"... though I don't see people sharing commit-graph files with experimental chunks that can be used only by some local unpublished fork of Git. The third letter must be uppercase to conform to the PNG specification. It is reserved for future expansion. Decoders should treat a chunk with a lower case third letter the same as any other unrecognised chunk. In short: reserved. This could be the fate of all of those 4 flags that we do not need. The case of the fourth letter indicates whether the chunk is safe to copy by editors that do not recognize it. If lowercase, the chunk may be safely copied regardless of the extent of modifications to the file. If uppercase, it may only be copied if the modifications have not touched any critical chunks. If the data contained in the chunks is immutable, then it could be copied when updating commit-graph file... but what to do with new commits, with what to fill the data for a new commit for a copied unknown chunk? All zeros? All ones? Some value defined in the chunk itself? Best, -- Jakub Narębski ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Generation Number v2 2018-10-29 16:55 [RFC] Generation Number v2 Derrick Stolee ` (2 preceding siblings ...) 2018-10-30 3:59 ` Junio C Hamano @ 2018-11-01 12:27 ` Jakub Narebski 2018-11-01 13:29 ` Derrick Stolee 3 siblings, 1 reply; 17+ messages in thread From: Jakub Narebski @ 2018-11-01 12:27 UTC (permalink / raw) To: Derrick Stolee Cc: Git List, Jeff King, Junio C Hamano, Derrick Stolee, Ævar Arnfjörð Bjarmason [I have noticed that in some places I wrote A..B instead of B..A. Sorry about that] Derrick Stolee <stolee@gmail.com> writes: > We've discussed in several places how to improve upon generation > numbers. This RFC is a report based on my investigation into a > few new options, and how they compare for Git's purposes on > several existing open-source repos. > > You can find this report and the associated test scripts at > https://github.com/derrickstolee/gen-test. Which use prototype test implementation from the `reach-perf` branch in https://github.com/derrickstolee/git. > I have some more work to do in order to make this fully > reproducible (including a single script to clone all of the > repos I used, compute the commit-graphs using the different > index versions, and run the tests). > > TL;DR: I recommend we pursue one of the following two options: > > 1. Maximum Generation Number. > 2. Corrected Commit Date. > > Each of these perform well, but have different implementation > drawbacks. I'd like to hear what everyone thinks about those > drawbacks. I agree with Junio that incremental computation is more important than backwards compatibility (that the old clients can work with the new commit-graph without changes). We would have compatibility breaking anyway with the planned move from SHA-1 to NewHash aka. SHA-256. > Please also let me know about any additional tests that I could > run. Now that I've got a lot of test scripts built up, I can re-run > the test suite pretty quickly. I would add straighforward 1-to-N and N-to-1 reachability tests in the form of `git branch / tag --contains` and `git branch / tag --merged`, and perhaps also ahead-behind calculations used by `git status`, and N-to-M reachability tests used by tag following code in push and fetch. Possibly also A...B walk, if it is not implemented via calculating merge-base. Maybe even --ancestry-path walk, though I am not sure how important best performance for rhis case is (we would want good performance, but perhaps best is not needed for rarely used command). See explanation below. > Generation Number Performance Test > ================================== > > Git uses the commit history to answer many basic questions, such as > computing a merge-base. Some of these algorithms can benefit from a > _reachability index_, which is a condition that allows us to answer > "commit A cannot reach commit B" in some cases. These require pre- > computing some values that we can load and use at run-time. Git > already has a notion of _generation number_, stores it in the commit- > graph file, and uses it in several reachability algorithms. Note that there are other kinds of reachability indices. First, there are reachability indices that can answer the full reachability query (if commit A can reach commit B, or if commit A cannot reach commit B) directly, without walking the commit graph at all: so called label-only approach. For example one could store for each commit the compressed list of all commits reachable from it (transitive closure compression). Those, I think (but I have not checked), would be of not much use for Git, as the size of the index grows stronger than linear with the number of commits, as grows the time to compute such index. So probably of no use to Git, at least not directly (Git uses so called "bitmap index", see e.g. [1], which stores reachability bit-vector as compressed bitmap... but not for all commits, only for a small subset). Second, beside negative-cut reachability indexes, that can answer with certainity that "commit A cannot reach commit B", like the generation numbers (also known as level, or topological level), there also positive-cut indexes (usually if not always based on the spanning tree or trees for the DAG), that can answer when "commit A can reach commit B". I think that those can lead to significant speedups for at least some types of commit traversal and reachability queries that Git needs to answer. But currently no algorithms has a provision for using positive-cut filter index. Anyway, such index would be auxiliary thing, see the next point. Third, we require more than having reachability index in the sense of having some kind of _labelling_, often composite, that can answer either "commit A cannot reach commit B" or "commit A can reach commit B", depending on the type. Git for most operations needs more, namely an actual _ordering_, that is a weak order (or to be more exact a total preorder, i.e. there can be two different commits with the same "generation number" or index, but always either idx(A) ≲ idx(B) or idx(B) ≲ idx(A)) and not only partial ordering (where there can be incomparable elements, i.e neither idx(A) ≼ idx(B) nor idx(B) ≼ idx(A)). This is because it needs to be used in priority queue to decide which commit to travel next; more on that later. This is also why there can be only one such "main" reachability _index_. [1]: https://githubengineering.com/counting-objects/ > You can read more about generation numbers and how to use them in > algorithms in [this blog > post](https://blogs.msdn.microsoft.com/devops/2018/07/09/supercharging-the-git-commit-graph-iii-generations/). > > However, [generation numbers do not always improve our > algorithms](https://public-inbox.org/git/pull.28.git.gitgitgadget@gmail.com/T/#u). > Specifically, some algorithms in Git already use commit date as a > heuristic reachability index. This has some problems, though, since > commit date can be incorrect for several reasons (clock skew between > machines, purposefully setting GIT_COMMIT_DATE to the author date, > etc.). That's what we use the "slop" mechanism for: Git would walk a few commits more than necessary than if there were no clock skew (if commit dates were monotonic), and assume that the skew is not to severe to finish early. But as we are adding more early termination condition, this introduces more places where things can go wrong for Git to return incorrect results. Thus the need for strict, non-heuristic reachability index. > However, the speed boost by using commit date as a cutoff was so > important in these cases, that the potential for incorrect answers was > considered acceptable. Yes, and there were few places where date as cutoff was used; we have since added more place where generation number / reachbility index is used as cutoff. Thus more potential for incorrect answers if using heuristics. > When these algorithms were converted to use generation numbers, we > _added_ the extra constraint that the algorithms are _never incorrect_. Right. > Unfortunately, this led to some cases where performance was worse than > before. There are known cases where `git merge-base A B` or > `git log --topo-order A..B` are worse when using generation numbers > than when using commit dates. In Git there are at least four different types of graph traversal queries, with somewhat different requirements, and affected differently by various reachability indexes. ## Pure reachability queries First there are pure reachability queries, when we are interested only in nswering he question whether commit A can reach commit B or not. We are nt interested in list of commits between A and B; if reachability index, be it negative-cut or positive-cut, can answer the question, it is all we need. If we need to walk (perform online search), then we are interested in just being able to arrive; we need to walk only one path. This query can be performed one-to-one, one-to-many (which of commits from the list can be reached from commit A), many-to-one (which commits from the list can reach commit B), and many-to-many -- the latest for example with recently added get_reachable_subset(). These types of queries are done by "git branch --contains" (and "--no-contains") and "git branch --merged" (and "--no-merged"), the same for "git tag"; if I remember it correctly those were among first operations that were sped up by commit-graph mechanism. Many-to-many queries are done for tag following in "git push" and "git fetch". Note that those queries are missing from gen-test. ## Topological ordering Here reachability index is used to find out in-degree of considered commits (starting from a set of starting points). Here we don't really need to walk all paths, but we need to walk all possible ways of arriving at given commit. Though I guess the [new] incremental algorithm could be modified to only check for in-degree being zero or non-zero. Would it make it slower or faster; I guess in absence of good positive-cut reachability filter it would make it slower. This operation is benchmarked both standalone i.e. `git log --topo-order -N` and in combination with A..B walk in `git log --topo-order -10 A..B`. ## (Po)set difference, or B..A range In this type of query, with at least one "negative" / "negated" commit, we want to find out all commits that are reachable from commit A that are not reachable from commit B. Note that A (commit in positive set) is treated differently with respect to traversal from B (commit in negative set). We need to travel (and possibly list) all paths from A, while when traveling from B the only thing important is reachability, not the actual path or paths. We want to travel as few commits as possible when walking from B, and here one generation numbers are better than the other in the presence of multiple paths. The situation in nontrivial case could look as follows: ---.---.---o---*---*---*----*---*-----*---A \ \---x-----x-----x-----B In such case, the algorithm paints down from positive set i.e. commit A, and from negative set i.e. commit B, and walk down from A till we get to commits painted down from B. We want to walk those commits somewhat in sync, using priority-queue based variant of breadth-first search (well, kind of), so that we don't walk more commits from A than necessary, only to notice later that these are actually reachable from B, and neither we want to walk more commits from B than necessary, going past the boundary commit. Sidenote: I think that positive-cut auxiliary reachability index (reachability filter) can help speed up this operation, making it possible to mark commit as reachable from the negative set (from commit B) and stop early without actually having to walk from B... though the more in sync walks from A and B would be, the less it would probably help. This is tested using `git log --topo-order -10 A..B`. ## Merge base, and A...B walk In this case we either find one or all commits that are reachable from both A and B, or (which is related query) find all commits reachable from A that are not reachable from B and vice versa, that is all commits reachable from B that are not reachable from A. In this case we walk from both A and B the same way: the operation is symmetric. In the merge base calculations we need to walk only one path if there are many equivaent ones; in A...B walk we have to actually walk them. Thus merge base calculation would exhibit, we can concurr, the same behaviour with respect to reachability indexes as A..B case. This is tested using `git merge-base A B`; finding all merge bases, and A...B walk are not tested yet. Note that IIRC it was merge-base calculation where the problem of performance regression when using generation number (topological level) as an ordering was first encountered. The performance regression happen for situation like this (the graph is taken from original commit fixing the issue, just rephrased). 0 1 2 3 4 5 6 7 8 9 10 11 4 δgen(c) .---M---*---*---*---*---*---*---*---*--*----A B |\ 2 / / δgen(c) | \---------------------------------x-/ / \ 2 3 / δgen(c) \---------------------------------#---------#--/ ------------------------------------------------------------> time Here optimal solution would be walk the branch / path with commits marked with 'x' from commit A, while walking the branch / path with commits marked '#' from commit B, and after walking 4 commits notice that the commit M is reachable from both A and B -- thus we have found [one of] the merge base. Using (minimum) generation numbers we would walk the path marked with '*' first, unnecessarily. (IIUC that for a commit we mark also its parents, isn't it?). Using corrected commit date we would walk at most only a few commits on the *-marked branch. 0 1 2 3 4 5 6 7 8 9 10 11 11 δmaxgen(c) .---M---*---*---*---*---*---*---*---*--*----A B |\ 10 / / δmaxgen(c) | \---------------------------------x-/ / \ 9 10 / δmaxgen(c) \---------------------------------#---------#--/ ------------------------------------------------------------> time Using maximum generation numbers we would also walk at most few *-marked commits. Sidenote: on the figure below you can easily see easy correspondence between maxgen(c) and revgen(c) indices, see below. 12 11 10 9 8 7 6 5 4 3 2 1 1 revgen(c) .---M---*---*---*---*---*---*---*---*--*----A B |\ 2 / / revgen(c) | \---------------------------------x-/ / \ 3 2 / revgen(c) \---------------------------------#---------#--/ ## Strict ancestry, or ancestry path When given a range of commits to display (e.g. A..B), only display commits that exist directly on the ancestry chain between B and A, i.e. commits that are both descendants of B, and ancestors of A. In other words, find all paths from B to A (if they exist). In this case we want to walk all the paths, heavily pruning. This should be less dependent on reachability index quality as an index (as an ordering), and more on pruning ability of the filter. This case is not tested in gen-test, but I wonder how often this feature is actualy used (if it would be worth adding it to the benchmark, and if so, with what weight behind it). > This report investigates four replacements for generation numbers, and > compares the number of walked commits to the existing algorithms (both > using generation numbers and not using them at all). We can use this > data to make decisions for the future of the feature. > > ### Implementation > > The reachability indexes below are implemented in > [the `reach-perf` branch in > https://github.com/derrickstolee/git](https://github.com/derrickstolee/git/tree/reach-perf). > This implementation is in a very rough condition, as it is intended to > be a prototype, not production-quality. > > Using this implementation, you can compute commit-graph files for the > specified reachability index using `git commit-graph write --reachable > --version=<V>`. > The `git` client will read the version from the file, so our tests > store each version as `.git/objects/info/commit-graph.<V>` and copy > the necessary file to `.git/objects/info/commit-graph` before testing. > > The algorithms count the number of commits walked, as to avoid the > noise that would occur with time-based performance reporting. We use > the (in progress) trace2 library for this. To find the values reported, > use the `GIT_TR2_PERFORMANCE` environment variable. Where we can read more about this trace2 library? Thanks in advance. > > To ignore reachability indexes entirely and use the old algorithms > (reported here as "OLD" values) use the environment variable > `GIT_TEST_OLD_PAINT=1`. > > > Reachability Index Versions > --------------------------- > > **V0: (Minimum) Generation Number.** > The _generation number_ of a commit is exactly one more than the maximum > generation number of a parent (by convention, the generation number of a > commit with no parents is 1). This is the definition currently used by > Git (2.19.0 and later). Given two commits A and B, we can then use the > following reachability index: > > If gen(A) < gen(B), then A cannot reach B. > > _Commentary:_ One issue with generation numbers is that some algorithms > in Git use commit date as a heuristic, and allow incorrect answers if > there is enough clock skew. Using that heuristic, the algorithms can walk > fewer commits than the algorithms using generation number. The other > reachability indexes introduced below attempt to fix this problem. This is the existing solution. > **V1: (Epoch, Date) Pairs.** > For each commit, store a pair of values called the _epoch_ and the _date_. > The date is the normal commit date. The _epoch_ of a commit is the minimum > X such that X is at least the maximum epoch of each parent, and at least > one more than the epoch of any parent whose date is larger than the date > of this commit (i.e. there is clock skew between this commit and this > parent). In this way, we can use the following reachability index: > > If epoch(A) < epoch(B), then A cannot reach B. > If epoch(A) == epoch(B) and date(A) < date(B), then A cannot reach B. I wonder what makes it perform worse than corrected date aka V3. > **V2: Maximum Generation Numbers.** > The _maximum generation number_ of a commit (denoted by maxgen(A)) is > defined using its _children_ and the total number of commits in the repo. > If A is a commit with no children (that is, there is no commit B with > A as a parent) then maxgen(A) is equal to the number of commits in the repo. > For all other commits A, maxgen(A) is one less than the minimum maxgen(B) > among children B. This can be computed by initializing maxgen(C) to the > number of commits, then walking the commit graph in topological order, > assigning maxgen(P) = min { maxgen(P), maxgen(C) - 1 } for each parent P > of the currently-walking commit C. We then have the same reachability > index as minimum generation number: > > If maxgen(A) < maxgen(B), then A cannot reach B. If I understand it correctly, this is the same as reverse generation numbers, or reverse topological levels; in other words genertion numbers on reversed graph -- only transformed: maxgen(A) == number_of_commits - (revgen(A) - 1) We define revgen(A) in the following way: - for head tips (for source nodes), i.e. commits with in-degree of 0, with no children, have revgen(A) = 1 (revgen(A) = 0 is left for commits outside commit-graph, which translates to INFINITY for maxgen(A)). - otherwise, it is 1 more than maximum revgen(C) of its children They are equivalent, but maxgen(A) is "backward compatibile", that is the rechability condition is exactly the same as for ordinary generation numbers: If gen(A) < gen(B), then A cannot reach B. If maxgen(A) < maxgen(B), then A cannot reach B. But If revgen(A) > revgen(B), then A cannot reach B. > > _Commentary:_ The known examples where (minimum) generation numbers perform > worse than commit date heuristics involve commits with recent commit dates > but whose parents have very low generation number compared to most recent > commits. In a way, minimum generation numbers increase the odds that we > can say "A cannot reach B" when A is fixed and B is selected at random. > Maximum generation numbers increase the odds that we can say "A cannot > reach B" when B is fixed and A is selected at random. This helps us when > we are exploring during a reachability algorithm and have explored few > commits and want to say that the large set of unexplored commits cannot > reach any of the explored commits. I guess that the situation where shortcut path have recent date, while having less commits in path (and thus lover ordinary generation number) is more common in real commit graphs because one might want to base a commit (for example a bugfix) on an old commit, but commits usually get merged quickly, and not left for long time. If they re left for long time, they usually needs correction (via rebase) before being merged, and again we have commit with date close to the date of merge, leading to a shortcut. Does this reasoning looks correct to you? > **V3: Corrected Commit Date.** > For a commit C, let its _corrected commit date_ (denoted by cdate(C)) > be the maximum of the commit date of C and the [corrected] commit dates of its > parents. Wouldn't it be better to use the maximum of the commit date of C, and one second more than the corrected commit dates of its parents? This way corrected dates would be strictly monotonic along the chain. Sidenote: in theory, sufficiently bad clock skew (like e.g. wrong year in the future) could screw large swatches of the commit graph. In practice, in real-life repositories, this doesn't happen, I guess. > If cdate(A) < cdate(B), then A cannot reach B. > > **V4: FELINE Index.** > The FELINE index is a two-dimentional reachability index as defined in > [Reachability Queries in Very Large Graphs: A Fast Refined Online > Search Approach](https://openproceedings.org/EDBT/2014/paper_166.pdf) > by Veloso, Cerf, Jeira, and Zaki. The index is not deterministically > defined, but instead is defined in the following way: > > 1. Compute a reverse topological order of the commits. Let felineX(C) > be the position in which C appears in this list. Since this is > a reverse topological order, felineX(C) > felineX(P) for each parent > P of C. Well, it depends on the definition of the topological order if it is tolopogical order or reverse topological order. Anyway, the FELINE paper includes discussion about FELINE-I, i.e. inversed feline index. From FELINE and FELINE-I, one of those is what was used here. > 2. Compute a reverse topological order of the commits using Kahn's > algorithm, but when selecting among a list of commits with in-degree > zero, prioritize the commit by minimizing felineX. Let felineY(C) > be the position in which C appears in this list. Note that the second step, as given in the paper, is actually deterministic (assuming that we follow the recommendations; well, up to a point). It needs some topological order for felineX, though. > Essentially, the felineY order is selected with the goal of swapping > positions of topologically-independent commits relative to the felinX > ordering. The resulting reachability index is as follows: > > If felineX(A) < felineX(B), then A cannot reach B. > If felineY(A) < felineY(B), then A cannot reach B. > > _Commentary:_ In terms of comparing two commits directly, this index > is quite strong. However, when we are performing our reachability > algorithms that care about reachable sets (git log --graph), or > boundaries between reachable sets (git merge-base, git log --graph A..B) > we need to track a single pair (minX,minY) for comparion. In order to not > miss anything during our search, we need to update this pair to be > the minimum felineX(A) and minimum felineY(B) among all explored commits > A and B. That is, the pair (minX, minY) is below our entire explored set. > This can be a disadvantage for these algorithms. Actually the problem is that FELINE index is partial order, not an ordering; not something that we can use in a priority queue. But the single [reverse] topological ordering is a reachability index in itself, and it is an ordering. Thus we can use felineX as an index in priority queue (or we can use felineY). One thing to notice is that FELINE index performs well on real world large graphs which have quite different structure from commit graphs; among others they have either very large number of sources (head tips) or sinks (root nodes), or both; those numbers grow with the number of nodes. For commit graphs they don't perform as well; well, at least ccording to my explorations on that matter in the Google Colaboratory notebook [2]. [2]: https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg > > ### Comparing Reachability Index Versions Viability > > Before considering how well these indexes perform during our algorithm > runs, let's take a moment to consider the implementation details and > how they relate to the existing commit-graph file format and existing > Git clients. > > * **Compatible?** In our test implementation, we use a previously unused > byte of data in the commit-graph format to indicate which reachability > index version we are using. Existing clients ignore this value, so we > will want to consider if these new indexes are _backwards compatible_. > That is, will they still report correct values if they ignore this byte > and use the generation number column from the commit-graph file assuming > the values are minimum generation numbers? In other words "backward compatibility" for a reachability index means that the reachability condition is exactly the same as it was: If index(A) < index(B), then A cannot reach B. and also values of the reachability index are written in place of generation numbers in the commit-graph. > > * **Immutable?** Git objects are _immutable_. If you change an object you > actually create a new object with a new object ID. Are the values we > store for these reachability indexes also immutable? > > * **Local?** Are these values **locally computable**? That is, do we only > need to look at the parents of a commit (assuming those parents have > computed values) in order to determine the value at that commit? Those two features imply that the number of commits that we need to walk to be able to update reachability index after appending new commits is O(new commits), which means O(changes). I would allow for simple transformations of other values, e.g. adding a constant; we need to rewrite commit-graph file anyway. I wonder if there are reachability indexes that are either immutable but not local, or not immutable but local. Doesn't being local imply immutability, with some common-sense assumptions (like no random choice for parent-less commits)? > | Index | Compatible? | Immutable? | Local? | > |---------------------------|-------------|------------|--------| > | Minimum Generation Number | Yes | Yes | Yes | > | (Epoch, Date) pairs | Yes | Yes | Yes | > | Maximum Generation Number | Yes | No | No | > | Corrected Commit Date | No | Yes | Yes | > | FELINE index | Yes | No | No | > > _Note:_ The corrected commit date uses the generation number column > to store an offset of "how much do I need to add to my commit date > to get my corrected commit date?" The values stored in that column > are then not backwards-compatible. Additional note: this offset / delta based approach to storing corrected commit data is a form of lightweight compression, allowing it to fit safely in 30 bits (while the commit date uses 32+2 = 34 bits itself). Assuming that we want to keep this feature from the original commit-graph file format, that is. > _Note:_ The FELINE index requires storing two values instead of just > one. One of these values could be stored in the generation number > column and the other in an optional chunk, hence it could be backwards > compatible. (This is not how it is implemented in our example > implementation.) So the table above refers to this hypothetical implementation when stating that FELINE is backwards compatibile, but current test implementation is not, isn't it? > > Data > ---- > > We focused on three types of performance tests that test the indexes > in different ways. Each test lists the `git` command that is used, > and the table lists which repository is used and which inputs. > > ### Test 1: `git log --topo-order -N` > > This test focuses on the number of commits that are parsed during > a `git log --topo-order` before writing `N` commits to output. > > You can reproduce this test using `topo-order-tests.sh` and > see all the data in `topo-order-summary.txt`. The values reported > here are a sampling of the data, ignoring tests where all values > were the same or extremely close in value. > > > | Repo | N | V0 | V1 | V2 | V3 | V4 | > |--------------|-------|--------|--------|--------|--------|-------| > | android-base | 100 | 5,487 | 8,534 | 6,937 | 6,419 | 6,453 | > | android-base | 1000 | 36,029 | 44,030 | 41,493 | 41,206 |45,431 | > | chromium | 100 | 101 |424,406 | 101 | 101 | 101 | > | gerrit | 100 | 8,212 | 8,533 | 164 | 159 | 162 | > | gerrit | 1000 | 8,512 | 8,533 | 1,990 | 1,973 | 3,766 | > | Linux | 100 | 12,458 | 12,444 | 13,683 | 13,123 |13,124 | > | Linux | 1000 | 24,436 | 26,247 | 27,878 | 26,430 |27,875 | > | Linux | 10000 | 30,364 | 28,891 | 27,878 | 26,430 |27,875 | > | electron | 1000 | 19,927 | 18,733 | 1,072 | 18,214 |18,214 | > | Ffmpeg | 10000 | 32,154 | 47,429 | 10,435 | 11,054 |11,054 | > | jgit | 1000 | 1,550 | 6,264 | 1,067 | 1,060 | 1,233 | > | julia | 10000 | 43,043 | 43,043 | 10,201 | 23,505 |23,828 | > | odoo | 1000 | 17,175 | 9,714 | 4,043 | 4,046 | 4,111 | > | php-src | 1000 | 19,014 | 27,530 | 1,311 | 1,305 | 1,320 | > | rails | 100 | 1,420 | 2,041 | 1,757 | 1,428 | 1,441 | > | rails | 1000 | 7,952 | 10,145 | 10,053 | 8,373 | 8,373 | > | swift | 1000 | 1,914 | 4,004 | 2,071 | 1,939 | 1,940 | > | tensorflow | 1000 | 10,019 | 39,221 | 6,711 | 10,051 |10,051 | > | TypeScript | 1000 | 2,873 | 12,014 | 3,049 | 2,876 | 2,876 | Do I understand it correctly that the range of N for a given project is limited by the depth of the project history (hence maximum N of 10000 for Linux kernel, but only 100 for chromium)? I wonder what the OLD numbers are for these operations. > > ### Test 2: `git log --topo-order -10 A..B` > > This test focuses on the number of commits that are parsed during > a `git log --topo-order A..B` before writing ten commits to output. > Since we fix a very small set of output commits, we care more about > the part of the walk that determines which commits are reachable > from `B` but not reachable from `A`. This part of the walk uses > commit date as a heuristic in the existing implementation. [...] > ### Test 3: `git merge-base A B` > > This test focuses on the number of commits that are parsed during > a `git merge-base A B`. This part of the walk uses commit date as > a heuristic in the existing implementation. [...] > Conclusions > ----------- > > Based on the performance results alone, we should remove minimum > generation numbers, (epoch, date) pairs, and FELINE index from > consideration. There are enough examples of these indexes performing > poorly. > > In contrast, maximum generation numbers and corrected commit > dates both performed quite well. They are frequently the top > two performing indexes, and rarely significantly different. > > The trade-off here now seems to be: which _property_ is more important, > locally-computable or backwards-compatible? > > * Maximum generation number is backwards-compatible but not > locally-computable or immutable. > > * Corrected commit-date is locally-computable and immutable, > but not backwards-compatible. > > _Editor's Note:_ Every time I think about this trade-off, I can't > come to a hard conclusion about which is better. Instead, I'll > leave that up for discussion. In my opinion being able to update commit-graph data fast is more important than being backward-compatibile, i.e. beig able to use commit-data file generated by new clients (and servers) by old clients. Thank you for all your work, -- Jakub Narębski ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Generation Number v2 2018-11-01 12:27 ` Jakub Narebski @ 2018-11-01 13:29 ` Derrick Stolee 2018-11-03 12:33 ` Jakub Narebski 0 siblings, 1 reply; 17+ messages in thread From: Derrick Stolee @ 2018-11-01 13:29 UTC (permalink / raw) To: Jakub Narebski Cc: Git List, Jeff King, Junio C Hamano, Derrick Stolee, Ævar Arnfjörð Bjarmason On 11/1/2018 8:27 AM, Jakub Narebski wrote: > [I have noticed that in some places I wrote A..B instead of B..A. Sorry > about that] > > Derrick Stolee <stolee@gmail.com> writes: > >> Please also let me know about any additional tests that I could >> run. Now that I've got a lot of test scripts built up, I can re-run >> the test suite pretty quickly. > I would add straighforward 1-to-N and N-to-1 reachability tests in the > form of `git branch / tag --contains` and `git branch / tag --merged`, > and perhaps also ahead-behind calculations used by `git status`, and > N-to-M reachability tests used by tag following code in push and fetch. > > Possibly also A...B walk, if it is not implemented via calculating > merge-base. I believe this uses paint_down_to_common(), but looks at the PARENT1 and PARENT2 flags afterward to determine the left/right/both relationships. > Maybe even --ancestry-path walk, though I am not sure how important best > performance for rhis case is (we would want good performance, but > perhaps best is not needed for rarely used command). Currently, the implementation of --ancestry-path does not use a reachability index. > > See explanation below. > > >> Generation Number Performance Test >> ================================== >> >> Git uses the commit history to answer many basic questions, such as >> computing a merge-base. Some of these algorithms can benefit from a >> _reachability index_, which is a condition that allows us to answer >> "commit A cannot reach commit B" in some cases. These require pre- >> computing some values that we can load and use at run-time. Git >> already has a notion of _generation number_, stores it in the commit- >> graph file, and uses it in several reachability algorithms. > Note that there are other kinds of reachability indices. > > First, there are reachability indices that can answer the full > reachability query (if commit A can reach commit B, or if commit A > cannot reach commit B) directly, without walking the commit graph at > all: so called label-only approach. For example one could store for > each commit the compressed list of all commits reachable from it > (transitive closure compression). > > Those, I think (but I have not checked), would be of not much use for > Git, as the size of the index grows stronger than linear with the number > of commits, as grows the time to compute such index. So probably of no > use to Git, at least not directly (Git uses so called "bitmap index", > see e.g. [1], which stores reachability bit-vector as compressed > bitmap... but not for all commits, only for a small subset). > > > Second, beside negative-cut reachability indexes, that can answer with > certainity that "commit A cannot reach commit B", like the generation > numbers (also known as level, or topological level), there also > positive-cut indexes (usually if not always based on the spanning tree > or trees for the DAG), that can answer when "commit A can reach commit > B". > > I think that those can lead to significant speedups for at least some > types of commit traversal and reachability queries that Git needs to > answer. But currently no algorithms has a provision for using > positive-cut filter index. Anyway, such index would be auxiliary thing, > see the next point. > > > Third, we require more than having reachability index in the sense of > having some kind of _labelling_, often composite, that can answer either > "commit A cannot reach commit B" or "commit A can reach commit B", > depending on the type. Git for most operations needs more, namely an > actual _ordering_, that is a weak order (or to be more exact a total > preorder, i.e. there can be two different commits with the same > "generation number" or index, but always either idx(A) ≲ idx(B) or > idx(B) ≲ idx(A)) and not only partial ordering (where there can be > incomparable elements, i.e neither idx(A) ≼ idx(B) nor idx(B) ≼ idx(A)). > This is because it needs to be used in priority queue to decide which > commit to travel next; more on that later. This is also why there can > be only one such "main" reachability _index_. > > [1]: https://githubengineering.com/counting-objects/ Thanks for the details. At the moment, I'm only interested in improving our negative-cut reachability index, as we have algorithms that can take advantage of them (and can compare their performance directly). > >> You can read more about generation numbers and how to use them in >> algorithms in [this blog >> post](https://blogs.msdn.microsoft.com/devops/2018/07/09/supercharging-the-git-commit-graph-iii-generations/). >> >> However, [generation numbers do not always improve our >> algorithms](https://public-inbox.org/git/pull.28.git.gitgitgadget@gmail.com/T/#u). >> Specifically, some algorithms in Git already use commit date as a >> heuristic reachability index. This has some problems, though, since >> commit date can be incorrect for several reasons (clock skew between >> machines, purposefully setting GIT_COMMIT_DATE to the author date, >> etc.). > That's what we use the "slop" mechanism for: Git would walk a few > commits more than necessary than if there were no clock skew (if commit > dates were monotonic), and assume that the skew is not to severe to > finish early. > > But as we are adding more early termination condition, this introduces > more places where things can go wrong for Git to return incorrect > results. Thus the need for strict, non-heuristic reachability index. > >> However, the speed boost by using commit date as a cutoff was so >> important in these cases, that the potential for incorrect answers was >> considered acceptable. > Yes, and there were few places where date as cutoff was used; we have > since added more place where generation number / reachbility index is > used as cutoff. Thus more potential for incorrect answers if using > heuristics. > >> When these algorithms were converted to use generation numbers, we >> _added_ the extra constraint that the algorithms are _never incorrect_. > Right. > >> Unfortunately, this led to some cases where performance was worse than >> before. There are known cases where `git merge-base A B` or >> `git log --topo-order A..B` are worse when using generation numbers >> than when using commit dates. > In Git there are at least four different types of graph traversal > queries, with somewhat different requirements, and affected differently > by various reachability indexes. > > > ## Pure reachability queries > > First there are pure reachability queries, when we are interested only > in nswering he question whether commit A can reach commit B or not. We > are nt interested in list of commits between A and B; if reachability > index, be it negative-cut or positive-cut, can answer the question, it > is all we need. If we need to walk (perform online search), then we are > interested in just being able to arrive; we need to walk only one path. > > This query can be performed one-to-one, one-to-many (which of commits > from the list can be reached from commit A), many-to-one (which commits > from the list can reach commit B), and many-to-many -- the latest for > example with recently added get_reachable_subset(). > > These types of queries are done by "git branch --contains" (and > "--no-contains") and "git branch --merged" (and "--no-merged"), the same > for "git tag"; if I remember it correctly those were among first > operations that were sped up by commit-graph mechanism. Many-to-many > queries are done for tag following in "git push" and "git fetch". > > Note that those queries are missing from gen-test. These are definitely a good idea to pursue further testing. The only thing I canthink of right now is that the tests can be hard to set up, but perhaps`git branch --remote --contains` or `git tag --contains` would be interesting. > ## Topological ordering > > Here reachability index is used to find out in-degree of considered > commits (starting from a set of starting points). Here we don't really > need to walk all paths, but we need to walk all possible ways of > arriving at given commit. You make a good point here about the difference between "all paths to C" and"all children of C", since the number of paths can grow exponentially (and frequentlydoes in Git) but the number of children is linear (and small in practice). > Though I guess the [new] incremental algorithm could be modified to only > check for in-degree being zero or non-zero. Would it make it slower or > faster; I guess in absence of good positive-cut reachability filter it > would make it slower. I don't understand how a positive-cut filter helps in this case. The point of the in-degreecalculation is to say "we already output all commits that can reach this commit" beforewe output another commit. How could a positive-cut filter help here? > This operation is benchmarked both standalone i.e. `git log --topo-order -N` > and in combination with A..B walk in `git log --topo-order -10 A..B`. > > > ## (Po)set difference, or B..A range > > In this type of query, with at least one "negative" / "negated" commit, > we want to find out all commits that are reachable from commit A that > are not reachable from commit B. > > Note that A (commit in positive set) is treated differently with respect > to traversal from B (commit in negative set). We need to travel (and > possibly list) all paths from A, while when traveling from B the only > thing important is reachability, not the actual path or paths. We want > to travel as few commits as possible when walking from B, and here one > generation numbers are better than the other in the presence of multiple > paths. > > The situation in nontrivial case could look as follows: > > ---.---.---o---*---*---*----*---*-----*---A > \ > \---x-----x-----x-----B > > In such case, the algorithm paints down from positive set i.e. commit A, > and from negative set i.e. commit B, and walk down from A till we get to > commits painted down from B. > > We want to walk those commits somewhat in sync, using priority-queue > based variant of breadth-first search (well, kind of), so that we don't > walk more commits from A than necessary, only to notice later that these > are actually reachable from B, and neither we want to walk more commits > from B than necessary, going past the boundary commit. > > Sidenote: I think that positive-cut auxiliary reachability index > (reachability filter) can help speed up this operation, making it > possible to mark commit as reachable from the negative set (from commit > B) and stop early without actually having to walk from B... though the > more in sync walks from A and B would be, the less it would probably > help. > > This is tested using `git log --topo-order -10 A..B`. > > > ## Merge base, and A...B walk > > In this case we either find one or all commits that are reachable from > both A and B, or (which is related query) find all commits reachable > from A that are not reachable from B and vice versa, that is all commits > reachable from B that are not reachable from A. > > In this case we walk from both A and B the same way: the operation is > symmetric. In the merge base calculations we need to walk only one path > if there are many equivaent ones; in A...B walk we have to actually walk > them. Thus merge base calculation would exhibit, we can concurr, the > same behaviour with respect to reachability indexes as A..B case. > > This is tested using `git merge-base A B`; finding all merge bases, and > A...B walk are not tested yet. > > Note that IIRC it was merge-base calculation where the problem of > performance regression when using generation number (topological level) > as an ordering was first encountered. The performance regression happen > for situation like this (the graph is taken from original commit fixing > the issue, just rephrased). > > 0 1 2 3 4 5 6 7 8 9 10 11 4 δgen(c) > .---M---*---*---*---*---*---*---*---*--*----A B > |\ 2 / / δgen(c) > | \---------------------------------x-/ / > \ 2 3 / δgen(c) > \---------------------------------#---------#--/ > > > ------------------------------------------------------------> > time > > Here optimal solution would be walk the branch / path with commits > marked with 'x' from commit A, while walking the branch / path with > commits marked '#' from commit B, and after walking 4 commits notice > that the commit M is reachable from both A and B -- thus we have found > [one of] the merge base. > > Using (minimum) generation numbers we would walk the path marked with > '*' first, unnecessarily. (IIUC that for a commit we mark also its > parents, isn't it?). > > Using corrected commit date we would walk at most only a few commits on > the *-marked branch. > > 0 1 2 3 4 5 6 7 8 9 10 11 11 δmaxgen(c) > .---M---*---*---*---*---*---*---*---*--*----A B > |\ 10 / / δmaxgen(c) > | \---------------------------------x-/ / > \ 9 10 / δmaxgen(c) > \---------------------------------#---------#--/ > > > ------------------------------------------------------------> > time > > Using maximum generation numbers we would also walk at most few *-marked > commits. > > Sidenote: on the figure below you can easily see easy correspondence > between maxgen(c) and revgen(c) indices, see below. > > 12 11 10 9 8 7 6 5 4 3 2 1 1 revgen(c) > .---M---*---*---*---*---*---*---*---*--*----A B > |\ 2 / / revgen(c) > | \---------------------------------x-/ / > \ 3 2 / revgen(c) > \---------------------------------#---------#--/ > > > ## Strict ancestry, or ancestry path > > When given a range of commits to display (e.g. A..B), only display > commits that exist directly on the ancestry chain between B and A, > i.e. commits that are both descendants of B, and ancestors of A. In > other words, find all paths from B to A (if they exist). > > In this case we want to walk all the paths, heavily pruning. This > should be less dependent on reachability index quality as an index (as > an ordering), and more on pruning ability of the filter. > > This case is not tested in gen-test, but I wonder how often this feature > is actualy used (if it would be worth adding it to the benchmark, and if > so, with what weight behind it). As mentioned earlier, the --ancestry-path algorithm does not currently use a negative-cut filter, so we would not gather any data on this at the moment. However, I imagine the algorithm is similar to the typical A..B case, as we needto discover the (po)set difference, construct the reversed digraph on thosecommits, and walk backwards from A. Assuming the difference is much smallerthan the entire commit graph, then the main cost is the walk that discovers thedifference, and hence is covered somewhat by the `git log --topo-order A..B` tests. >> This report investigates four replacements for generation numbers, and >> compares the number of walked commits to the existing algorithms (both >> using generation numbers and not using them at all). We can use this >> data to make decisions for the future of the feature. >> >> ### Implementation >> >> The reachability indexes below are implemented in >> [the `reach-perf` branch in >> https://github.com/derrickstolee/git](https://github.com/derrickstolee/git/tree/reach-perf). >> This implementation is in a very rough condition, as it is intended to >> be a prototype, not production-quality. >> >> Using this implementation, you can compute commit-graph files for the >> specified reachability index using `git commit-graph write --reachable >> --version=<V>`. >> The `git` client will read the version from the file, so our tests >> store each version as `.git/objects/info/commit-graph.<V>` and copy >> the necessary file to `.git/objects/info/commit-graph` before testing. >> >> The algorithms count the number of commits walked, as to avoid the >> noise that would occur with time-based performance reporting. We use >> the (in progress) trace2 library for this. To find the values reported, >> use the `GIT_TR2_PERFORMANCE` environment variable. > Where we can read more about this trace2 library? Thanks in advance. https://public-inbox.org/git/pull.29.git.gitgitgadget@gmail.com/ [PATCH 0/8] WIP: trace2: a new trace facility >> To ignore reachability indexes entirely and use the old algorithms >> (reported here as "OLD" values) use the environment variable >> `GIT_TEST_OLD_PAINT=1`. >> >> >> Reachability Index Versions >> --------------------------- >> >> **V0: (Minimum) Generation Number.** >> The _generation number_ of a commit is exactly one more than the maximum >> generation number of a parent (by convention, the generation number of a >> commit with no parents is 1). This is the definition currently used by >> Git (2.19.0 and later). Given two commits A and B, we can then use the >> following reachability index: >> >> If gen(A) < gen(B), then A cannot reach B. >> >> _Commentary:_ One issue with generation numbers is that some algorithms >> in Git use commit date as a heuristic, and allow incorrect answers if >> there is enough clock skew. Using that heuristic, the algorithms can walk >> fewer commits than the algorithms using generation number. The other >> reachability indexes introduced below attempt to fix this problem. > This is the existing solution. > >> **V1: (Epoch, Date) Pairs.** >> For each commit, store a pair of values called the _epoch_ and the _date_. >> The date is the normal commit date. The _epoch_ of a commit is the minimum >> X such that X is at least the maximum epoch of each parent, and at least >> one more than the epoch of any parent whose date is larger than the date >> of this commit (i.e. there is clock skew between this commit and this >> parent). In this way, we can use the following reachability index: >> >> If epoch(A) < epoch(B), then A cannot reach B. >> If epoch(A) == epoch(B) and date(A) < date(B), then A cannot reach B. > I wonder what makes it perform worse than corrected date aka V3. In the example `git merge-base v4.8 v4.9` in the Linux repo, the topology includestwo commits (say, C1 and C2) of low generation number but high commit-date. Thesecommits also have low epoch. However, there are 607 commits in the repo whosecommit date is smaller than a parent's commit date. This means that the epoch in themain trunk of the repo can be as high as 607, while the epoch for C1 and C2 is likely in the single digits. This means that we need to walk all commits with epoch greaterthan min { epoch(C1), epoch(C2) } before we explore C1 and C2 and terminate the walk. In V3, the commits C1 and C2 have high corrected commit date, higher than any of thecommits that require positive offset to overcome the clock skew with their parents. Thisallows the walk to be very similar to the old algorithm, as seen in this `git merge-base A B`test summary: Linux c8d2bc9bc39e 69973b830859 OLD: 167468 V0: 635579 V1: 630138 V2: 33716 V3: 167496 V4: 153774 > >> **V2: Maximum Generation Numbers.** >> The _maximum generation number_ of a commit (denoted by maxgen(A)) is >> defined using its _children_ and the total number of commits in the repo. >> If A is a commit with no children (that is, there is no commit B with >> A as a parent) then maxgen(A) is equal to the number of commits in the repo. >> For all other commits A, maxgen(A) is one less than the minimum maxgen(B) >> among children B. This can be computed by initializing maxgen(C) to the >> number of commits, then walking the commit graph in topological order, >> assigning maxgen(P) = min { maxgen(P), maxgen(C) - 1 } for each parent P >> of the currently-walking commit C. We then have the same reachability >> index as minimum generation number: >> >> If maxgen(A) < maxgen(B), then A cannot reach B. > If I understand it correctly, this is the same as reverse generation > numbers, or reverse topological levels; in other words genertion numbers > on reversed graph -- only transformed: > > maxgen(A) == number_of_commits - (revgen(A) - 1) > > We define revgen(A) in the following way: > - for head tips (for source nodes), i.e. commits with in-degree of 0, > with no children, have revgen(A) = 1 (revgen(A) = 0 is left for > commits outside commit-graph, which translates to INFINITY for > maxgen(A)). > - otherwise, it is 1 more than maximum revgen(C) of its children > > They are equivalent, but maxgen(A) is "backward compatibile", that is > the rechability condition is exactly the same as for ordinary generation > numbers: > > If gen(A) < gen(B), then A cannot reach B. > If maxgen(A) < maxgen(B), then A cannot reach B. > > But > > If revgen(A) > revgen(B), then A cannot reach B. Yes, these are the same idea, and the reason to phrase it as I did is to keep theinequality in the same direction. > >> _Commentary:_ The known examples where (minimum) generation numbers perform >> worse than commit date heuristics involve commits with recent commit dates >> but whose parents have very low generation number compared to most recent >> commits. In a way, minimum generation numbers increase the odds that we >> can say "A cannot reach B" when A is fixed and B is selected at random. >> Maximum generation numbers increase the odds that we can say "A cannot >> reach B" when B is fixed and A is selected at random. This helps us when >> we are exploring during a reachability algorithm and have explored few >> commits and want to say that the large set of unexplored commits cannot >> reach any of the explored commits. > I guess that the situation where shortcut path have recent date, while > having less commits in path (and thus lover ordinary generation number) > is more common in real commit graphs because one might want to base a > commit (for example a bugfix) on an old commit, but commits usually get > merged quickly, and not left for long time. If they re left for long > time, they usually needs correction (via rebase) before being merged, > and again we have commit with date close to the date of merge, leading > to a shortcut. > > Does this reasoning looks correct to you? These "bug fixes based on old commits" cases are few and far between, so are notenough to explain how often maximum generation number works well. Instead, Iwould say that it works well because the typical Git pattern is to merge into a singletrunk. In the case of projects like Linux, there are multiple "trunks" that are run bylieutenants and are merged into a "super trunk"; even in these cases, the numberof trunks is much smaller than the number of topics. Let's focus on a single-trunkmodel for the discussion below. When working with a single trunk, most merge commits are in the first-parent historyof that trunk. As we travel down those trunks, the maximum generation is decreasing.This means that the number of commits that can have equal generation is limited bythe number of commits in a single topic, which is typically small. This is opposed tothe minimum generation, which can be unbounded. For example, if everyone basestheir Git patches on the latest release, then if we walk into those topics, then weneed to walk all the way to that release to determine reachability. Here is a picture, representing each topic as an interval of consecutive generationvalues: # Minimum Generation Numbers: v2.19.0----------M1----M2----M3----M4 |\ / / / / | [topic1]-/ / / / |\ / / / | [topic2]-----// / |\ / / | [topic3]---------/ / \ / [topic4]-------------/ =====generation====> # Maximum Generation Numbers: v2.19.0----------M1----M2----M3----M4 |\ / | | | | [topic1]-/ | | | |\ | | | | \-------[topic2]/ | | |\ | | | \-------------[topic3]/ | \ | \-------------------[topic4]/ =====generation====> >> **V3: Corrected Commit Date.** >> For a commit C, let its _corrected commit date_ (denoted by cdate(C)) >> be the maximum of the commit date of C and the [corrected] commit dates of its >> parents. > Wouldn't it be better to use the maximum of the commit date of C, and > one second more than the corrected commit dates of its parents? This > way corrected dates would be strictly monotonic along the chain. > > Sidenote: in theory, sufficiently bad clock skew (like e.g. wrong year > in the future) could screw large swatches of the commit graph. In > practice, in real-life repositories, this doesn't happen, I guess. Yes, you're right. Congyi Wu from the Azure Repos server team pointed this out to me privately, as well. His example was someone merging a commit with date 3000-01-01, and making all later commits be useless. Adding one to the commit dates of the parents "fixes" this issue by reverting our negative-cut index to be equivalent to minimum generation number for all commits with this commit in their history. > >> If cdate(A) < cdate(B), then A cannot reach B. >> >> **V4: FELINE Index.** >> The FELINE index is a two-dimentional reachability index as defined in >> [Reachability Queries in Very Large Graphs: A Fast Refined Online >> Search Approach](https://openproceedings.org/EDBT/2014/paper_166.pdf) >> by Veloso, Cerf, Jeira, and Zaki. The index is not deterministically >> defined, but instead is defined in the following way: >> >> 1. Compute a reverse topological order of the commits. Let felineX(C) >> be the position in which C appears in this list. Since this is >> a reverse topological order, felineX(C) > felineX(P) for each parent >> P of C. > Well, it depends on the definition of the topological order if it is > tolopogical order or reverse topological order. Anyway, the FELINE > paper includes discussion about FELINE-I, i.e. inversed feline index. > From FELINE and FELINE-I, one of those is what was used here. > >> 2. Compute a reverse topological order of the commits using Kahn's >> algorithm, but when selecting among a list of commits with in-degree >> zero, prioritize the commit by minimizing felineX. Let felineY(C) >> be the position in which C appears in this list. > Note that the second step, as given in the paper, is actually > deterministic (assuming that we follow the recommendations; well, up to > a point). It needs some topological order for felineX, though. > >> Essentially, the felineY order is selected with the goal of swapping >> positions of topologically-independent commits relative to the felinX >> ordering. The resulting reachability index is as follows: >> >> If felineX(A) < felineX(B), then A cannot reach B. >> If felineY(A) < felineY(B), then A cannot reach B. >> >> _Commentary:_ In terms of comparing two commits directly, this index >> is quite strong. However, when we are performing our reachability >> algorithms that care about reachable sets (git log --graph), or >> boundaries between reachable sets (git merge-base, git log --graph A..B) >> we need to track a single pair (minX,minY) for comparion. In order to not >> miss anything during our search, we need to update this pair to be >> the minimum felineX(A) and minimum felineY(B) among all explored commits >> A and B. That is, the pair (minX, minY) is below our entire explored set. >> This can be a disadvantage for these algorithms. > Actually the problem is that FELINE index is partial order, not an > ordering; not something that we can use in a priority queue. > > But the single [reverse] topological ordering is a reachability index in > itself, and it is an ordering. Thus we can use felineX as an index in > priority queue (or we can use felineY). > > One thing to notice is that FELINE index performs well on real world > large graphs which have quite different structure from commit graphs; > among others they have either very large number of sources (head tips) > or sinks (root nodes), or both; those numbers grow with the number of > nodes. For commit graphs they don't perform as well; well, at least > ccording to my explorations on that matter in the Google Colaboratory > notebook [2]. > > [2]: https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg > >> ### Comparing Reachability Index Versions Viability >> >> Before considering how well these indexes perform during our algorithm >> runs, let's take a moment to consider the implementation details and >> how they relate to the existing commit-graph file format and existing >> Git clients. >> >> * **Compatible?** In our test implementation, we use a previously unused >> byte of data in the commit-graph format to indicate which reachability >> index version we are using. Existing clients ignore this value, so we >> will want to consider if these new indexes are _backwards compatible_. >> That is, will they still report correct values if they ignore this byte >> and use the generation number column from the commit-graph file assuming >> the values are minimum generation numbers? > In other words "backward compatibility" for a reachability index means > that the reachability condition is exactly the same as it was: > > If index(A) < index(B), then A cannot reach B. > > and also values of the reachability index are written in place of > generation numbers in the commit-graph. > >> * **Immutable?** Git objects are _immutable_. If you change an object you >> actually create a new object with a new object ID. Are the values we >> store for these reachability indexes also immutable? >> >> * **Local?** Are these values **locally computable**? That is, do we only >> need to look at the parents of a commit (assuming those parents have >> computed values) in order to determine the value at that commit? > Those two features imply that the number of commits that we need to walk > to be able to update reachability index after appending new commits is > O(new commits), which means O(changes). I would allow for simple > transformations of other values, e.g. adding a constant; we need to > rewrite commit-graph file anyway. > > I wonder if there are reachability indexes that are either immutable but > not local, or not immutable but local. Doesn't being local imply > immutability, with some common-sense assumptions (like no random choice > for parent-less commits)? Local probably implies immutable, but the reverse does not. The example I gave elsewhere in the thread was "number of commits reachable from C". You can't determine this directly from the values of your parents, but it doesn't change even if you add more commits to the graph. > >> | Index | Compatible? | Immutable? | Local? | >> |---------------------------|-------------|------------|--------| >> | Minimum Generation Number | Yes | Yes | Yes | >> | (Epoch, Date) pairs | Yes | Yes | Yes | >> | Maximum Generation Number | Yes | No | No | >> | Corrected Commit Date | No | Yes | Yes | >> | FELINE index | Yes | No | No | >> >> _Note:_ The corrected commit date uses the generation number column >> to store an offset of "how much do I need to add to my commit date >> to get my corrected commit date?" The values stored in that column >> are then not backwards-compatible. > Additional note: this offset / delta based approach to storing corrected > commit data is a form of lightweight compression, allowing it to fit > safely in 30 bits (while the commit date uses 32+2 = 34 bits itself). > > Assuming that we want to keep this feature from the original > commit-graph file format, that is. > >> _Note:_ The FELINE index requires storing two values instead of just >> one. One of these values could be stored in the generation number >> column and the other in an optional chunk, hence it could be backwards >> compatible. (This is not how it is implemented in our example >> implementation.) > So the table above refers to this hypothetical implementation when > stating that FELINE is backwards compatibile, but current test > implementation is not, isn't it? Correct. > >> Data >> ---- >> >> We focused on three types of performance tests that test the indexes >> in different ways. Each test lists the `git` command that is used, >> and the table lists which repository is used and which inputs. >> >> ### Test 1: `git log --topo-order -N` >> >> This test focuses on the number of commits that are parsed during >> a `git log --topo-order` before writing `N` commits to output. >> >> You can reproduce this test using `topo-order-tests.sh` and >> see all the data in `topo-order-summary.txt`. The values reported >> here are a sampling of the data, ignoring tests where all values >> were the same or extremely close in value. >> >> >> | Repo | N | V0 | V1 | V2 | V3 | V4 | >> |--------------|-------|--------|--------|--------|--------|-------| >> | android-base | 100 | 5,487 | 8,534 | 6,937 | 6,419 | 6,453 | >> | android-base | 1000 | 36,029 | 44,030 | 41,493 | 41,206 |45,431 | >> | chromium | 100 | 101 |424,406 | 101 | 101 | 101 | >> | gerrit | 100 | 8,212 | 8,533 | 164 | 159 | 162 | >> | gerrit | 1000 | 8,512 | 8,533 | 1,990 | 1,973 | 3,766 | >> | Linux | 100 | 12,458 | 12,444 | 13,683 | 13,123 |13,124 | >> | Linux | 1000 | 24,436 | 26,247 | 27,878 | 26,430 |27,875 | >> | Linux | 10000 | 30,364 | 28,891 | 27,878 | 26,430 |27,875 | >> | electron | 1000 | 19,927 | 18,733 | 1,072 | 18,214 |18,214 | >> | Ffmpeg | 10000 | 32,154 | 47,429 | 10,435 | 11,054 |11,054 | >> | jgit | 1000 | 1,550 | 6,264 | 1,067 | 1,060 | 1,233 | >> | julia | 10000 | 43,043 | 43,043 | 10,201 | 23,505 |23,828 | >> | odoo | 1000 | 17,175 | 9,714 | 4,043 | 4,046 | 4,111 | >> | php-src | 1000 | 19,014 | 27,530 | 1,311 | 1,305 | 1,320 | >> | rails | 100 | 1,420 | 2,041 | 1,757 | 1,428 | 1,441 | >> | rails | 1000 | 7,952 | 10,145 | 10,053 | 8,373 | 8,373 | >> | swift | 1000 | 1,914 | 4,004 | 2,071 | 1,939 | 1,940 | >> | tensorflow | 1000 | 10,019 | 39,221 | 6,711 | 10,051 |10,051 | >> | TypeScript | 1000 | 2,873 | 12,014 | 3,049 | 2,876 | 2,876 | > Do I understand it correctly that the range of N for a given project is > limited by the depth of the project history (hence maximum N of 10000 > for Linux kernel, but only 100 for chromium)? I ran the tests with N equal to 100, 1000, and 10000 on all repos, but only included results for values that were interesting. For chromium, something about the topology let all versions (except V1) report N + 1, so I didn't include the other values. > I wonder what the OLD numbers are for these operations. The OLD numbers are equal to the number of commits reachable from HEAD in every case. I didn't think this was interesting to report. You can see the numbers for yourself in the output data file: https://github.com/derrickstolee/gen-test/blob/master/topo-order-summary.txt > >> ### Test 2: `git log --topo-order -10 A..B` >> >> This test focuses on the number of commits that are parsed during >> a `git log --topo-order A..B` before writing ten commits to output. >> Since we fix a very small set of output commits, we care more about >> the part of the walk that determines which commits are reachable >> from `B` but not reachable from `A`. This part of the walk uses >> commit date as a heuristic in the existing implementation. > [...] > >> ### Test 3: `git merge-base A B` >> >> This test focuses on the number of commits that are parsed during >> a `git merge-base A B`. This part of the walk uses commit date as >> a heuristic in the existing implementation. > [...] > >> Conclusions >> ----------- >> >> Based on the performance results alone, we should remove minimum >> generation numbers, (epoch, date) pairs, and FELINE index from >> consideration. There are enough examples of these indexes performing >> poorly. >> >> In contrast, maximum generation numbers and corrected commit >> dates both performed quite well. They are frequently the top >> two performing indexes, and rarely significantly different. >> >> The trade-off here now seems to be: which _property_ is more important, >> locally-computable or backwards-compatible? >> >> * Maximum generation number is backwards-compatible but not >> locally-computable or immutable. >> >> * Corrected commit-date is locally-computable and immutable, >> but not backwards-compatible. >> >> _Editor's Note:_ Every time I think about this trade-off, I can't >> come to a hard conclusion about which is better. Instead, I'll >> leave that up for discussion. > In my opinion being able to update commit-graph data fast is more > important than being backward-compatibile, i.e. beig able to use > commit-data file generated by new clients (and servers) by old clients. > > > Thank you for all your work, Thanks for the thoughtful discussion! -Stolee ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Generation Number v2 2018-11-01 13:29 ` Derrick Stolee @ 2018-11-03 12:33 ` Jakub Narebski 0 siblings, 0 replies; 17+ messages in thread From: Jakub Narebski @ 2018-11-03 12:33 UTC (permalink / raw) To: Derrick Stolee Cc: Git List, Jeff King, Junio C Hamano, Derrick Stolee, Ævar Arnfjörð Bjarmason Derrick Stolee <stolee@gmail.com> writes: > On 11/1/2018 8:27 AM, Jakub Narebski wrote: >> Derrick Stolee <stolee@gmail.com> writes: >> >>> Please also let me know about any additional tests that I could >>> run. Now that I've got a lot of test scripts built up, I can re-run >>> the test suite pretty quickly. >> >> I would add straighforward 1-to-N and N-to-1 reachability tests in the >> form of `git branch / tag --contains` and `git branch / tag --merged`, >> and perhaps also ahead-behind calculations used by `git status`, and >> N-to-M reachability tests used by tag following code in push and fetch. 1-to-N and N-to-1 can be done with `git branch --merged` / `--contains`. >> Possibly also A...B walk, if it is not implemented via calculating >> merge-base. > > I believe this uses paint_down_to_common(), but looks at the PARENT1 and > PARENT2 flags afterward to determine the left/right/both relationships. Right, so I guess this is the same internal mechanism that `git merge-base` utilizes, just used a bit differently. Thus benchmarking `git merge-base` should cover also A...B performance, I guess. >> Maybe even --ancestry-path walk, though I am not sure how important best >> performance for rhis case is (we would want good performance, but >> perhaps best is not needed for rarely used command). > > Currently, the implementation of --ancestry-path does not use a > reachability index. Well, using reachability index would certainly give it a boost. [...] >>> Generation Number Performance Test >>> ================================== >>> >>> Git uses the commit history to answer many basic questions, such as >>> computing a merge-base. Some of these algorithms can benefit from a >>> _reachability index_, which is a condition that allows us to answer >>> "commit A cannot reach commit B" in some cases. These require pre- >>> computing some values that we can load and use at run-time. Git >>> already has a notion of _generation number_, stores it in the commit- >>> graph file, and uses it in several reachability algorithms. >> >> Note that there are other kinds of reachability indices. >> >> First, there are reachability indices that can answer the full >> reachability query (if commit A can reach commit B, or if commit A >> cannot reach commit B) directly, without walking the commit graph at >> all: so called label-only approach. For example one could store for >> each commit the compressed list of all commits reachable from it >> (transitive closure compression). >> >> Those, I think (but I have not checked), would be of not much use for >> Git, as the size of the index grows stronger than linear with the number >> of commits, as grows the time to compute such index. So probably of no >> use to Git, at least not directly (Git uses so called "bitmap index", >> see e.g. [1], which stores reachability bit-vector as compressed >> bitmap... but not for all commits, only for a small subset). >> >> >> Second, beside negative-cut reachability indexes, that can answer with >> certainity that "commit A cannot reach commit B", like the generation >> numbers (also known as level, or topological level), there also >> positive-cut indexes (usually if not always based on the spanning tree >> or trees for the DAG), that can answer when "commit A can reach commit >> B". >> >> I think that those can lead to significant speedups for at least some >> types of commit traversal and reachability queries that Git needs to >> answer. But currently no algorithms has a provision for using >> positive-cut filter index. Anyway, such index would be auxiliary thing, >> see the next point. >> >> >> Third, we require more than having reachability index in the sense of >> having some kind of _labelling_, often composite, that can answer either >> "commit A cannot reach commit B" or "commit A can reach commit B", >> depending on the type. Git for most operations needs more, namely an >> actual _ordering_, that is a weak order (or to be more exact a total >> preorder, i.e. there can be two different commits with the same >> "generation number" or index, but always either idx(A) ≲ idx(B) or >> idx(B) ≲ idx(A)) and not only partial ordering (where there can be >> incomparable elements, i.e neither idx(A) ≼ idx(B) nor idx(B) ≼ idx(A)). >> This is because it needs to be used in priority queue to decide which >> commit to travel next; more on that later. This is also why there can >> be only one such "main" reachability _index_. >> >> [1]: https://githubengineering.com/counting-objects/ > > Thanks for the details. At the moment, I'm only interested in improving our > negative-cut reachability index, as we have algorithms that can take > advantage of them (and can compare their performance directly). Right, I can agree with that. Positive-cut reachability index use would need to be added separately. What I wanted to emphasize is the need for a _number_ (or a total preorder), not simply an _index_ (a label, perhaps a composite one). This means, as I wrote, that there would be only one main "generation number". It also limits the number of possible reachability indexes to consider. [...] >> In Git there are at least four different types of graph traversal >> queries, with somewhat different requirements, and affected differently >> by various reachability indexes. >> >> >> ## Pure reachability queries >> >> First there are pure reachability queries, when we are interested only >> in nswering he question whether commit A can reach commit B or not. We >> are nt interested in list of commits between A and B; if reachability >> index, be it negative-cut or positive-cut, can answer the question, it >> is all we need. If we need to walk (perform online search), then we are >> interested in just being able to arrive; we need to walk only one path. >> >> This query can be performed one-to-one, one-to-many (which of commits >> from the list can be reached from commit A), many-to-one (which commits >> from the list can reach commit B), and many-to-many -- the latest for >> example with recently added get_reachable_subset(). >> >> These types of queries are done by "git branch --contains" (and >> "--no-contains") and "git branch --merged" (and "--no-merged"), the same >> for "git tag"; if I remember it correctly those were among first >> operations that were sped up by commit-graph mechanism. Many-to-many >> queries are done for tag following in "git push" and "git fetch". >> >> Note that those queries are missing from gen-test. > > These are definitely a good idea to pursue further testing. The only > thing I can think of right now is that the tests can be hard to set up, > but perhaps`git branch --remote --contains` or `git tag --contains` > would be interesting. Also `git tab --merged`, for the "reverse" check; then this should cover both sides of N-to-M queries used by `git fetch` (add_missing_tags() via new get_reachable_subset()). In any case, if all else fails, we can add new test-something script... >> ## Topological ordering >> >> Here reachability index is used to find out in-degree of considered >> commits (starting from a set of starting points). Here we don't really >> need to walk all paths, but we need to walk all possible ways of >> arriving at given commit. > > You make a good point here about the difference between "all paths to C" > and "all children of C", since the number of paths can grow exponentially > (and frequently does in Git) but the number of children is linear (and > small in practice). Sidenote: I wonder how the distribution of number of parent (out-degree), numbers of childern (in-degree), and number of paths to a commit looks like. >> Though I guess the [new] incremental algorithm could be modified to only >> check for in-degree being zero or non-zero. Would it make it slower or >> faster; I guess in absence of good positive-cut reachability filter it >> would make it slower. > > I don't understand how a positive-cut filter helps in this case. The point > of the in-degreecalculation is to say "we already output all commits that > can reach this commit" beforewe output another commit. How could a > positive-cut filter help here? That was more of an idle thought than a concrete proposal. With the new 3-phase algorithm, we populate topo_queue with commits of an in-degree of zero, calculating this on-degree if necessary. I thought that those pre-candidates could have been eliminated early with positive-cut index, instead of having to compute in-degree in full. [...] >> ## Strict ancestry, or ancestry path >> >> When given a range of commits to display (e.g. A..B), only display >> commits that exist directly on the ancestry chain between B and A, >> i.e. commits that are both descendants of B, and ancestors of A. In >> other words, find all paths from B to A (if they exist). >> >> In this case we want to walk all the paths, heavily pruning. This >> should be less dependent on reachability index quality as an index (as >> an ordering), and more on pruning ability of the filter. >> >> This case is not tested in gen-test, but I wonder how often this feature >> is actualy used (if it would be worth adding it to the benchmark, and if >> so, with what weight behind it). > > As mentioned earlier, the --ancestry-path algorithm does not currently use a > negative-cut filter, so we would not gather any data on this at the moment. > > However, I imagine the algorithm is similar to the typical A..B case, as we > need to discover the (po)set difference, construct the reversed digraph on > those commits, and walk backwards from A. Assuming the difference is much > smaller than the entire commit graph, then the main cost is the walk that > discovers the difference, and hence is covered somewhat by the `git log > --topo-order A..B` tests. I would say that --ancestry-path is a little like A..B walk, a little like the in-degree computing in --topo-order (because in in-degree computing we are also interested only in paths that arrive at given commit). [...] >>> **V1: (Epoch, Date) Pairs.** >>> >>> For each commit, store a pair of values called the _epoch_ and the _date_. >>> The date is the normal commit date. The _epoch_ of a commit is the minimum >>> X such that X is at least the maximum epoch of each parent, and at least >>> one more than the epoch of any parent whose date is larger than the date >>> of this commit (i.e. there is clock skew between this commit and this >>> parent). In this way, we can use the following reachability index: >>> >>> If epoch(A) < epoch(B), then A cannot reach B. >>> If epoch(A) == epoch(B) and date(A) < date(B), then A cannot reach B. >> >> I wonder what makes it perform worse than corrected date aka V3. > > In the example `git merge-base v4.8 v4.9` in the Linux repo, the topology > includes two commits (say, C1 and C2) of low generation number but high > commit-date. These commits also have low epoch. However, there are 607 > commits in the repo whose commit date is smaller than a parent's commit > date. This means that the epoch in themain trunk of the repo can be as > high as 607, while the epoch for C1 and C2 is likely in the single digits. > This means that we need to walk all commits with epoch greater than > min { epoch(C1), epoch(C2) } before we explore C1 and C2 and terminate > the walk. > > In V3, the commits C1 and C2 have high corrected commit date, higher than > any of the commits that require positive offset to overcome the clock skew > with their parents. This allows the walk to be very similar to the old > algorithm, as seen in this `git merge-base A B` test summary: > > Linux c8d2bc9bc39e 69973b830859 [I have added names of algorithms / generationn numbers here] > OLD: 167468 ----- no reachability index > V0: 635579 gen(C) (Minimum) Generation Number > V1: 630138 epoch(C) (Epoch, Date) Pairs > V2: 33716 maxgen(C) Maximum Generation Numbers > V3: 167496 cdate(C) Corrected Commit Date > V4: 153774 felineX(C) FELINE Index Ah, so the problem with (Epoch, Date) pairs is that the trunk acquires high epoch, while side branches (which may be shortcuts) often have lower epoch. Makes sense, if maybe not obvious without such examination. >>> **V2: Maximum Generation Numbers.** >>> The _maximum generation number_ of a commit (denoted by maxgen(A)) is >>> defined using its _children_ and the total number of commits in the repo. >>> If A is a commit with no children (that is, there is no commit B with >>> A as a parent) then maxgen(A) is equal to the number of commits in the repo. >>> For all other commits A, maxgen(A) is one less than the minimum maxgen(B) >>> among children B. This can be computed by initializing maxgen(C) to the >>> number of commits, then walking the commit graph in topological order, >>> assigning maxgen(P) = min { maxgen(P), maxgen(C) - 1 } for each parent P >>> of the currently-walking commit C. We then have the same reachability >>> index as minimum generation number: >>> >>> If maxgen(A) < maxgen(B), then A cannot reach B. >> >> If I understand it correctly, this is the same as reverse generation >> numbers, or reverse topological levels; in other words genertion numbers >> on reversed graph -- only transformed: >> >> maxgen(A) == number_of_commits - (revgen(A) - 1) Actually using number_of_commits may not be the best choice here with respect to number of commits that need to have maxgen(C) be recalculated on update... >> We define revgen(A) in the following way: >> - for head tips (for source nodes), i.e. commits with in-degree of 0, >> with no children, have revgen(A) = 1 (revgen(A) = 0 is left for >> commits outside commit-graph, which translates to INFINITY for >> maxgen(A)). >> - otherwise, it is 1 more than maximum revgen(C) of its children >> >> They are equivalent, but maxgen(A) is "backward compatibile", that is >> the rechability condition is exactly the same as for ordinary generation >> numbers: >> >> If gen(A) < gen(B), then A cannot reach B. >> If maxgen(A) < maxgen(B), then A cannot reach B. >> >> But >> >> If revgen(A) > revgen(B), then A cannot reach B. > > Yes, these are the same idea, and the reason to phrase it as I did is to > keep the inequality in the same direction. ...and if using gen(C) as a starting point, you avoid rewriting whole DAG on update. Definition of maxgen'(C): - if A is commit without children (that is, there is no commit B with A as a parent), then maxgen'(A) = gen(A) - otherwise, maxgen'(A) = -1 + min_{B has A as parent}(maxgen'(B)) Update algorithm: 1. compute gen(A) for all new commits, which is O(changes) 2. select commit A with highest gen(A) that doesn't have maxgen'(A): [such commit does not have any children, because of gen() properties] and add it to queue Q 2.1. take commit C from the queue Q 2.2. for each parent P of commit C 2.2.1. if maxgen'(P) <= maxgen'(C) - 1, continue 2.2.2. update maxgen'(P) to maxgen'(C) - 1 2.2.3. add P to queue Q I have tested this version of maxgen(C) algorithm in a very few cases in other post in this thread [3], and it looks like the amount of commits that need to have maxgen() recalculated is limited. [3]: https://public-inbox.org/git/86ftwjzv1h.fsf@gmail.com/ [...] >>> _Commentary:_ The known examples where (minimum) generation numbers perform >>> worse than commit date heuristics involve commits with recent commit dates >>> but whose parents have very low generation number compared to most recent >>> commits. In a way, minimum generation numbers increase the odds that we >>> can say "A cannot reach B" when A is fixed and B is selected at random. >>> Maximum generation numbers increase the odds that we can say "A cannot >>> reach B" when B is fixed and A is selected at random. This helps us when >>> we are exploring during a reachability algorithm and have explored few >>> commits and want to say that the large set of unexplored commits cannot >>> reach any of the explored commits. >>> >> I guess that the situation where shortcut path have recent date, while >> having less commits in path (and thus lover ordinary generation number) >> is more common in real commit graphs because one might want to base a >> commit (for example a bugfix) on an old commit, but commits usually get >> merged quickly, and not left for long time. If they re left for long >> time, they usually needs correction (via rebase) before being merged, >> and again we have commit with date close to the date of merge, leading >> to a shortcut. >> >> Does this reasoning looks correct to you? > > These "bug fixes based on old commits" cases are few and far between, so > are not enough to explain how often maximum generation number works well. > Instead, I would say that it works well because the typical Git pattern is > to merge into a single trunk. In the case of projects like Linux, there are > multiple "trunks" that are run by lieutenants and are merged into a "super > trunk"; even in these cases, the number of trunks is much smaller than the > number of topics. Let's focus on a single-trunk model for the > discussion below. > > When working with a single trunk, most merge commits are in the first-parent > historyof that trunk. As we travel down those trunks, the maximum generation > is decreasing. This means that the number of commits that can have equal > generation is limited bythe number of commits in a single topic, which is > typically small. This is opposed to the minimum generation, which can be > unbounded. For example, if everyone bases their Git patches on the latest > release, then if we walk into those topics, then we need to walk all the way > to that release to determine reachability. > > Here is a picture, representing each topic as an interval of consecutive > generation values: > > # Minimum Generation Numbers: > > v2.19.0----------M1----M2----M3----M4 > |\ / / / / > | [topic1]-/ / / / > |\ / / / > | [topic2]-----/ / / > |\ / / > | [topic3]---------/ / > \ / > [topic4]-------------/ > > =====generation====> > > # Maximum Generation Numbers: > > v2.19.0----------M1----M2----M3----M4 > |\ / | | | > | [topic1]-/ | | | > |\ | | | > | \-------[topic2]/ | | > |\ | | > | \-------------[topic3]/ | > \ | > \-------------------[topic4]/ > > =====generation====> All right, so the problem with (minimum) generation numbers performance will be exhibited in any project that utilizes the topic branch workflow. Nice to know the case. >>> **V3: Corrected Commit Date.** >>> For a commit C, let its _corrected commit date_ (denoted by cdate(C)) >>> be the maximum of the commit date of C and the [corrected] commit dates of its >>> parents. >> >> Wouldn't it be better to use the maximum of the commit date of C, and >> one second more than the corrected commit dates of its parents? This >> way corrected dates would be strictly monotonic along the chain. >> >> Sidenote: in theory, sufficiently bad clock skew (like e.g. wrong year >> in the future) could screw large swatches of the commit graph. In >> practice, in real-life repositories, this doesn't happen, I guess. > > Yes, you're right. Congyi Wu from the Azure Repos server team pointed > this out to me privately, as well. His example was someone merging a > commit with date 3000-01-01, and making all later commits be > useless. Adding one to the commit dates of the parents "fixes" this > issue by reverting our negative-cut index to be equivalent to minimum > generation number for all commits with this commit in their history. I wonder if the problem with bad clock skew with the date far in the future could be solved by using instead for such cases (where date(parent) > date(commit) + large Delta) the average of the commit and of the grandparents (parents of parent commit with such large positive clock skew). Though this wouldn't be strictly speaking immutable... Another edge case, which I hope is not a problem in real-life repositories, is clock skew very far in the past, and storing offsets in 30-bits. If there was some malformed commit which has 0 as commit date epoch (Unix time), i.e. 1970-01-01, then the delta / offset may not fit in 30 bits that we use for storing generation numbers; there are 32+2 = 34 bits for comit date. Something to think about. [...] >>> ### Comparing Reachability Index Versions Viability >>> >>> Before considering how well these indexes perform during our algorithm >>> runs, let's take a moment to consider the implementation details and >>> how they relate to the existing commit-graph file format and existing >>> Git clients. >>> >>> * **Compatible?** In our test implementation, we use a previously unused >>> byte of data in the commit-graph format to indicate which reachability >>> index version we are using. Existing clients ignore this value, so we >>> will want to consider if these new indexes are _backwards compatible_. >>> That is, will they still report correct values if they ignore this byte >>> and use the generation number column from the commit-graph file assuming >>> the values are minimum generation numbers? >> >> In other words "backward compatibility" for a reachability index means >> that the reachability condition is exactly the same as it was: >> >> If index(A) < index(B), then A cannot reach B. >> >> and also values of the reachability index are written in place of >> generation numbers in the commit-graph. Note that it is backwards-compatibility only for use of generation numbers, but of course not for an update. >>> * **Immutable?** Git objects are _immutable_. If you change an object you >>> actually create a new object with a new object ID. Are the values we >>> store for these reachability indexes also immutable? >>> >>> * **Local?** Are these values **locally computable**? That is, do we only >>> need to look at the parents of a commit (assuming those parents have >>> computed values) in order to determine the value at that commit? >>> >> Those two features imply that the number of commits that we need to walk >> to be able to update reachability index after appending new commits is >> O(new commits), which means O(changes). I would allow for simple >> transformations of other values, e.g. adding a constant; we need to >> rewrite commit-graph file anyway. >> >> I wonder if there are reachability indexes that are either immutable but >> not local, or not immutable but local. Doesn't being local imply >> immutability, with some common-sense assumptions (like no random choice >> for parent-less commits)? > > Local probably implies immutable, but the reverse does not. The > example I gave elsewhere in the thread was "number of commits > reachable from C". You can't determine this directly from the values > of your parents, but it doesn't change even if you add more commits to > the graph. Right, reach(C) is immutable but not local. Local may not mean immutable if the starting points, i.e. parentless commits, get assingled not-immutable values. But I don't see how that could happen without some randomness. >>> Data >>> ---- >>> >>> We focused on three types of performance tests that test the indexes >>> in different ways. Each test lists the `git` command that is used, >>> and the table lists which repository is used and which inputs. >>> >>> ### Test 1: `git log --topo-order -N` [...] >>> >>> | Repo | N | V0 | V1 | V2 | V3 | V4 | >>> |--------------|-------|--------|--------|--------|--------|-------| >>> | android-base | 100 | 5,487 | 8,534 | 6,937 | 6,419 | 6,453 | >>> | android-base | 1000 | 36,029 | 44,030 | 41,493 | 41,206 |45,431 | >>> | chromium | 100 | 101 |424,406 | 101 | 101 | 101 | >>> | gerrit | 100 | 8,212 | 8,533 | 164 | 159 | 162 | >>> | gerrit | 1000 | 8,512 | 8,533 | 1,990 | 1,973 | 3,766 | >>> | Linux | 100 | 12,458 | 12,444 | 13,683 | 13,123 |13,124 | >>> | Linux | 1000 | 24,436 | 26,247 | 27,878 | 26,430 |27,875 | >>> | Linux | 10000 | 30,364 | 28,891 | 27,878 | 26,430 |27,875 | >>> | electron | 1000 | 19,927 | 18,733 | 1,072 | 18,214 |18,214 | >>> | Ffmpeg | 10000 | 32,154 | 47,429 | 10,435 | 11,054 |11,054 | >>> | jgit | 1000 | 1,550 | 6,264 | 1,067 | 1,060 | 1,233 | >>> | julia | 10000 | 43,043 | 43,043 | 10,201 | 23,505 |23,828 | >>> | odoo | 1000 | 17,175 | 9,714 | 4,043 | 4,046 | 4,111 | >>> | php-src | 1000 | 19,014 | 27,530 | 1,311 | 1,305 | 1,320 | >>> | rails | 100 | 1,420 | 2,041 | 1,757 | 1,428 | 1,441 | >>> | rails | 1000 | 7,952 | 10,145 | 10,053 | 8,373 | 8,373 | >>> | swift | 1000 | 1,914 | 4,004 | 2,071 | 1,939 | 1,940 | >>> | tensorflow | 1000 | 10,019 | 39,221 | 6,711 | 10,051 |10,051 | >>> | TypeScript | 1000 | 2,873 | 12,014 | 3,049 | 2,876 | 2,876 | >> >> Do I understand it correctly that the range of N for a given project is >> limited by the depth of the project history (hence maximum N of 10000 >> for Linux kernel, but only 100 for chromium)? > > I ran the tests with N equal to 100, 1000, and 10000 on all repos, but only > included results for values that were interesting. For chromium, something > about the topology let all versions (except V1) report N + 1, so I > didn't include the other values. Ah, that information (only interesting cases were included) would be something good to know. I wonder if the relative performance of different generation numbers is correlated with the topology of the commit graph, for example with increment patterns to intergation patterns ratio in the commit metagraph (like --simplify-by-decoration), like in [4] and related works. [4] Marco Biazzini, Martin Monperrus, Benoit Baudry "On Analyzing the Topology of Commit Histories in Decentralized Version Control Systems", DOI: 10.1109/ICSME.2014.48 (2014) >> I wonder what the OLD numbers are for these operations. > > The OLD numbers are equal to the number of commits reachable from HEAD > in every case. I didn't think this was interesting to report. You can > see the numbers for yourself in the output data file: > > https://github.com/derrickstolee/gen-test/blob/master/topo-order-summary.txt Thanks for the info. -- Jakub Narębski ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2018-11-03 17:27 UTC | newest] Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-10-29 16:55 [RFC] Generation Number v2 Derrick Stolee 2018-10-29 19:22 ` Stefan Beller 2018-10-29 20:06 ` Derrick Stolee 2018-11-01 20:06 ` Jakub Narebski 2018-11-02 9:30 ` Jakub Narebski 2018-11-03 17:27 ` Jakub Narebski 2018-10-29 20:25 ` Derrick Stolee 2018-11-01 22:13 ` Jakub Narebski 2018-10-30 3:59 ` Junio C Hamano 2018-10-31 12:30 ` Derrick Stolee 2018-11-02 13:33 ` Jakub Narebski 2018-10-31 12:54 ` Ævar Arnfjörð Bjarmason 2018-10-31 13:04 ` Derrick Stolee 2018-11-02 17:44 ` Jakub Narebski 2018-11-01 12:27 ` Jakub Narebski 2018-11-01 13:29 ` Derrick Stolee 2018-11-03 12:33 ` Jakub Narebski
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).