[RFC] Generation Number v2

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [RFC] Generation Number v2
@ 2018-10-29 16:55 Derrick Stolee
  2018-10-29 19:22 ` Stefan Beller
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Derrick Stolee @ 2018-10-29 16:55 UTC (permalink / raw)
  To: Git List, Jeff King, Junio C Hamano, Jakub Narębski,
	Derrick Stolee, Ævar Arnfjörð Bjarmason

We've discussed in several places how to improve upon generation
numbers. This RFC is a report based on my investigation into a
few new options, and how they compare for Git's purposes on
several existing open-source repos.

You can find this report and the associated test scripts at
https://github.com/derrickstolee/gen-test.

I have some more work to do in order to make this fully
reproducible (including a single script to clone all of the
repos I used, compute the commit-graphs using the different
index versions, and run the tests).

TL;DR: I recommend we pursue one of the following two options:

1. Maximum Generation Number.
2. Corrected Commit Date.

Each of these perform well, but have different implementation
drawbacks. I'd like to hear what everyone thinks about those
drawbacks.

Please also let me know about any additional tests that I could
run. Now that I've got a lot of test scripts built up, I can re-run
the test suite pretty quickly.

Thanks,
-Stolee

Generation Number Performance Test
==================================

Git uses the commit history to answer many basic questions, such as
computing a merge-base. Some of these algorithms can benefit from a
_reachability index_, which is a condition that allows us to answer
"commit A cannot reach commit B" in some cases. These require pre-
computing some values that we can load and use at run-time. Git
already has a notion of _generation number_, stores it in the commit-
graph file, and uses it in several reachability algorithms.

You can read more about generation numbers and how to use them in
algorithms in [this blog 
post](https://blogs.msdn.microsoft.com/devops/2018/07/09/supercharging-the-git-commit-graph-iii-generations/).

However, [generation numbers do not always improve our 
algorithms](https://public-inbox.org/git/pull.28.git.gitgitgadget@gmail.com/T/#u).
Specifically, some algorithms in Git already use commit date as a
heuristic reachability index. This has some problems, though, since
commit date can be incorrect for several reasons (clock skew between
machines, purposefully setting GIT_COMMIT_DATE to the author date, etc.).
However, the speed boost by using commit date as a cutoff was so
important in these cases, that the potential for incorrect answers was
considered acceptable.

When these algorithms were converted to use generation numbers, we
_added_ the extra constraint that the algorithms are _never incorrect_.
Unfortunately, this led to some cases where performance was worse than
before. There are known cases where `git merge-base A B` or
`git log --topo-order A..B` are worse when using generation numbers
than when using commit dates.

This report investigates four replacements for generation numbers, and
compares the number of walked commits to the existing algorithms (both
using generation numbers and not using them at all). We can use this
data to make decisions for the future of the feature.

### Implementation

The reachability indexes below are implemented in
[the `reach-perf` branch in 
https://github.com/derrickstolee/git](https://github.com/derrickstolee/git/tree/reach-perf).
This implementation is in a very rough condition, as it is intended to
be a prototype, not production-quality.

Using this implementation, you can compute commit-graph files for the
specified reachability index using `git commit-graph write --reachable 
--version=<V>`.
The `git` client will read the version from the file, so our tests
store each version as `.git/objects/info/commit-graph.<V>` and copy
the necessary file to `.git/objects/info/commit-graph` before testing.

The algorithms count the number of commits walked, as to avoid the
noise that would occur with time-based performance reporting. We use
the (in progress) trace2 library for this. To find the values reported,
use the `GIT_TR2_PERFORMANCE` environment variable.

To ignore reachability indexes entirely and use the old algorithms
(reported here as "OLD" values) use the environment variable
`GIT_TEST_OLD_PAINT=1`.

Reachability Index Versions
---------------------------

**V0: (Minimum) Generation Number.**
The _generation number_ of a commit is exactly one more than the maximum
generation number of a parent (by convention, the generation number of a
commit with no parents is 1). This is the definition currently used by
Git (2.19.0 and later). Given two commits A and B, we can then use the
following reachability index:

     If gen(A) < gen(B), then A cannot reach B.

_Commentary:_ One issue with generation numbers is that some algorithms
in Git use commit date as a heuristic, and allow incorrect answers if
there is enough clock skew. Using that heuristic, the algorithms can walk
fewer commits than the algorithms using generation number. The other
reachability indexes introduced below attempt to fix this problem.

**V1: (Epoch, Date) Pairs.**
For each commit, store a pair of values called the _epoch_ and the _date_.
The date is the normal commit date. The _epoch_ of a commit is the minimum
X such that X is at least the maximum epoch of each parent, and at least
one more than the epoch of any parent whose date is larger than the date
of this commit (i.e. there is clock skew between this commit and this
parent). In this way, we can use the following reachability index:

    If epoch(A) < epoch(B), then A cannot reach B.
    If epoch(A) == epoch(B) and date(A) < date(B), then A cannot reach B.

**V2: Maximum Generation Numbers.**
The _maximum generation number_ of a commit (denoted by maxgen(A)) is
defined using its _children_ and the total number of commits in the repo.
If A is a commit with no children (that is, there is no commit B with
A as a parent) then maxgen(A) is equal to the number of commits in the repo.
For all other commits A, maxgen(A) is one less than the minimum maxgen(B)
among children B. This can be computed by initializing maxgen(C) to the
number of commits, then walking the commit graph in topological order,
assigning maxgen(P) = min { maxgen(P), maxgen(C) - 1 } for each parent P
of the currently-walking commit C. We then have the same reachability
index as minimum generation number:

   If maxgen(A) < maxgen(B), then A cannot reach B.

_Commentary:_ The known examples where (minimum) generation numbers perform
worse than commit date heuristics involve commits with recent commit dates
but whose parents have very low generation number compared to most recent
commits. In a way, minimum generation numbers increase the odds that we
can say "A cannot reach B" when A is fixed and B is selected at random.
Maximum generation numbers increase the odds that we can say "A cannot
reach B" when B is fixed and A is selected at random. This helps us when
we are exploring during a reachability algorithm and have explored few
commits and want to say that the large set of unexplored commits cannot
reach any of the explored commits.

**V3: Corrected Commit Date.**
For a commit C, let its _corrected commit date_ (denoted by cdate(C))
be the maximum of the commit date of C and the commit dates of its
parents.

   If cdate(A) < cdate(B), then A cannot reach B.

**V4: FELINE Index.**
The FELINE index is a two-dimentional reachability index as defined in
[Reachability Queries in Very Large Graphs: A Fast Refined Online Search 
Approach](https://openproceedings.org/EDBT/2014/paper_166.pdf)
by Veloso, Cerf, Jeira, and Zaki. The index is not deterministically
defined, but instead is defined in the following way:

1. Compute a reverse topological order of the commits. Let felineX(C)
    be the position in which C appears in this list. Since this is
    a reverse topological order, felineX(C) > felineX(P) for each parent
    P of C.

2. Compute a reverse topological order of the commits using Kahn's
    algorithm, but when selecting among a list of commits with in-degree
    zero, prioritize the commit by minimizing felineX. Let felineY(C)
    be the position in which C appears in this list.

Essentially, the felineY order is selected with the goal of swapping
positions of topologically-independent commits relative to the felinX
ordering. The resulting reachability index is as follows:

    If felineX(A) < felineY(B), then A cannot reach B.
    If felineY(A) < felineY(B), then A cannot reach B.

_Commentary:_ In terms of comparing two commits directly, this index
is quite strong. However, when we are performing our reachability
algorithms that care about reachable sets (git log --graph), or
boundaries between reachable sets (git merge-base, git log --graph A..B)
we need to track a single pair (minX,minY) for comparion. In order to not
miss anything during our search, we need to update this pair to be
the minimum felineX(A) and minimum felineY(B) among all explored commits
A and B. That is, the pair (minX, minY) is below our entire explored set.
This can be a disadvantage for these algorithms.

### Comparing Reachability Index Versions Viability

Before considering how well these indexes perform during our algorithm
runs, let's take a moment to consider the implementation details and
how they relate to the existing commit-graph file format and existing
Git clients.

* **Compatible?** In our test implementation, we use a previously unused
   byte of data in the commit-graph format to indicate which reachability
   index version we are using. Existing clients ignore this value, so we
   will want to consider if these new indexes are _backwards compatible_.
   That is, will they still report correct values if they ignore this byte
   and use the generation number column from the commit-graph file assuming
   the values are minimum generation numbers?

* **Immutable?** Git objects are _immutable_. If you change an object you
   actually create a new object with a new object ID. Are the values we 
store
   for these reachability indexes also immutable?

* **Local?** Are these values **locally computable**? That is, do we only
   need to look at the parents of a commit (assuming those parents have
   computed values) in order to determine the value at that commit?

| Index                     | Compatible? | Immutable? | Local? |
|---------------------------|-------------|------------|--------|
| Minimum Generation Number | Yes         | Yes        | Yes    |
| (Epoch, Date) pairs       | Yes         | Yes        | Yes    |
| Maximum Generation Number | Yes         | No         | No     |
| Corrected Commit Date     | No          | Yes        | Yes    |
| FELINE index              | Yes         | No         | No     |

_Note:_ The corrected commit date uses the generation number column
to store an offset of "how much do I need to add to my commit date
to get my corrected commit date?" The values stored in that column
are then not backwards-compatible.

_Note:_ The FELINE index requires storing two values instead of just
one. One of these values could be stored in the generation number
column and the other in an optional chunk, hence it could be backwards
compatible. (This is not how it is implemented in our example
implementation.)

Data
----

We focused on three types of performance tests that test the indexes
in different ways. Each test lists the `git` command that is used,
and the table lists which repository is used and which inputs.

### Test 1: `git log --topo-order -N`

This test focuses on the number of commits that are parsed during
a `git log --topo-order` before writing `N` commits to output.

You can reproduce this test using `topo-order-tests.sh` and
see all the data in `topo-order-summary.txt`. The values reported
here are a sampling of the data, ignoring tests where all values
were the same or extremely close in value.

| Repo         | N     | V0     | V1     | V2     | V3 | V4    |
|--------------|-------|--------|--------|--------|--------|-------|
| android-base | 100   |  5,487 |  8,534 |  6,937 |  6,419 | 6,453 |
| android-base | 1000  | 36,029 | 44,030 | 41,493 | 41,206 |45,431 |
| chromium     | 100   |    101 |424,406 |    101 |    101 |   101 |
| gerrit       | 100   |  8,212 |  8,533 |    164 |    159 |   162 |
| gerrit       | 1000  |  8,512 |  8,533 |  1,990 |  1,973 | 3,766 |
| Linux        | 100   | 12,458 | 12,444 | 13,683 | 13,123 |13,124 |
| Linux        | 1000  | 24,436 | 26,247 | 27,878 | 26,430 |27,875 |
| Linux        | 10000 | 30,364 | 28,891 | 27,878 | 26,430 |27,875 |
| electron     | 1000  | 19,927 | 18,733 |  1,072 | 18,214 |18,214 |
| Ffmpeg       | 10000 | 32,154 | 47,429 | 10,435 | 11,054 |11,054 |
| jgit         | 1000  |  1,550 |  6,264 |  1,067 |  1,060 | 1,233 |
| julia        | 10000 | 43,043 | 43,043 | 10,201 | 23,505 |23,828 |
| odoo         | 1000  | 17,175 |  9,714 |  4,043 |  4,046 | 4,111 |
| php-src      | 1000  | 19,014 | 27,530 |  1,311 |  1,305 | 1,320 |
| rails        | 100   |  1,420 |  2,041 |  1,757 |  1,428 | 1,441 |
| rails        | 1000  |  7,952 | 10,145 | 10,053 |  8,373 | 8,373 |
| swift        | 1000  |  1,914 |  4,004 |  2,071 |  1,939 | 1,940 |
| tensorflow   | 1000  | 10,019 | 39,221 |  6,711 | 10,051 |10,051 |
| TypeScript   | 1000  |  2,873 | 12,014 |  3,049 |  2,876 | 2,876 |

### Test 2: `git log --topo-order -10 A..B`

This test focuses on the number of commits that are parsed during
a `git log --topo-order A..B` before writing ten commits to output.
Since we fix a very small set of output commits, we care more about
the part of the walk that determines which commits are reachable
from `B` but not reachable from `A`. This part of the walk uses
commit date as a heuristic in the existing implementation.

You can reproduce this test using `topo-compare-tests.sh` and
see all the data in `topo-compare-summary.txt`. The values reported
here are a sampling of the data, ignoring tests where all values
were the same or extremely close in value.

_Note:_ For some of the rows, the `A` and `B` values may be
swapped from what is expected. This is due to (1) a bug in the
reference implementation that doesn't short-circuit the walk
when `A` can reach `B`, and (2) data-entry errors by the author.
The bug can be fixed, but would have introduced strange-looking
rows in this table.

| Repo         | A            | B            | OLD     | V0     | V1     
| V2     | V3      | V4     |
|--------------|--------------|--------------|---------|--------|--------|--------|---------|--------|
| android-base | 53c1972bc8f  | 92f18ac3e39  |  39,403 | 1,544 |  6,957 
|     26 |   1,015 |  1,098 |
| gerrit       | c4311f7642   | 777a8cd1e0   |   6,457 | 7,836 | 10,869 
|    415 |     414 |    445 |
| electron     | 7da7dd85e    | addf069f2    |  18,164 | 945 |  6,528 
|     17 |      17 |     18 |
| julia        | 7faee1b201   | e2022b9f0f   |  22,800 | 4,221 | 12,710 
|    377 |     213 |    213 |
| julia        | ae69259cd9   | c8b5402afc   |   1,864 | 1,859 | 13,287 
|     12 |   1,859 |  1,859 |
| Linux        | 69973b830859 | c470abd4fde4 | 111,692 | 77,263 | 96,598 
| 80,238 |  76,332 | 76,495 |
| Linux        | c3b92c878736 | 19f949f52599 | 167,418 | 5,736 |  4,684 
|  9,675 |   3,887 |  3,923 |
| Linux        | c8d2bc9bc39e | 69973b830859 |  44,940 | 4,056 | 16,636 
| 10,405 |   3,475 |  4,022 |
| odoo         | 4a31f55d0a0  | 93fb2b4a616  |  25,139 | 19,528 | 20,418 
| 19,874 |  19,634 | 27,247 |
| swift        | 4046359efd   | b34b6a14c7   |  13,411 | 662 |    321 
|     12 |      80 |    134 |
| tensorflow   | ec6d17219c   | fa1db5eb0d   |  10,373 | 4,762 | 36,272 
|    174 |   3,631 |  3,632 |
| TypeScript   | 35ea2bea76   | 123edced90   |   3,450 | 267 | 10,386 
|     27 |     259 |    259 |

### Test 3: `git merge-base A B`

This test focuses on the number of commits that are parsed during
a `git merge-base A B`. This part of the walk uses commit date as
a heuristic in the existing implementation.

You can reproduce this test using `merge-base-tests.sh` and
see all the data in `merge-base-summary.txt`. The values reported
here are a sampling of the data, ignoring tests where all values
were the same or extremely close in value.

| Repo         | A            | B            | OLD     | V0      | 
V1      | V2      | V3      | V4      |
|--------------|--------------|--------------|---------|---------|---------|---------|---------|---------|
| android-base | 53c1972bc8f  | 92f18ac3e39  |  81,999 | 109,025 |  
81,885 |  77,475 |  81,999 |  82,001 |
| gerrit       | c4311f7642   | 777a8cd1e0   |   6,468 | 7,995 |   6,566 
|   6,478 |   6,468 |   6,468 |
| electron     | 7da7dd85e    | addf069f2    |  18,160 | 19,871 |  
18,670 |   2,231 |  18,160 |  18,160 |
| julia        | 7faee1b201   | e2022b9f0f   |  22,803 | 42,339 |  
42,212 |   6,803 |  22,803 |  22,803 |
| julia        | c8b5402afc   | ae69259cd9   |   7,076 | 42,909 |  
42,770 |   2,690 |   7,076 |   7,076 |
| Linux        | 69973b830859 | c470abd4fde4 |  44,984 | 47,457 |  
44,679 |  38,461 |  44,984 |  44,984 |
| Linux        | c3b92c878736 | 19f949f52599 | 111,740 | 111,027 | 
111,196 | 107,835 | 111,771 | 111,368 |
| Linux        | c8d2bc9bc39e | 69973b830859 | 167,468 | 635,579 | 
630,138 |  33,716 | 167,496 | 153,774 |
| odoo         | 4a31f55d0a0  | 93fb2b4a616  |  25,150 | 27,259 |  
23,977 |  24,041 |  23,974 |  26,829 |
| swift        | 4046359efd   | b34b6a14c7   |  13,434 | 13,254 |  
13,940 |  16,023 |  13,127 |  15,008 |
| tensorflow   | ec6d17219c   | fa1db5eb0d   |  10,377 | 10,448 |  
10,377 |   8,460 |  10,377 |  10,377 |
| TypeScript   | 35ea2bea76   | 123edced90   |   3,464 | 3,439 |   3,464 
|   3,581 |   3,464 |   3,464 |

Conclusions
-----------

Based on the performance results alone, we should remove minimum
generation numbers, (epoch, date) pairs, and FELINE index from
consideration. There are enough examples of these indexes performing
poorly.

In contrast, maximum generation numbers and corrected commit
dates both performed quite well. They are frequently the top
two performing indexes, and rarely significantly different.

The trade-off here now seems to be: which _property_ is more important,
locally-computable or backwards-compatible?

* Maximum generation number is backwards-compatible but not
   locally-computable or immutable.

* Corrected commit-date is locally-computable and immutable,
   but not backwards-compatible.

_Editor's Note:_ Every time I think about this trade-off, I can't
come to a hard conclusion about which is better. Instead, I'll
leave that up for discussion.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Generation Number v2
  2018-10-29 16:55 [RFC] Generation Number v2 Derrick Stolee
@ 2018-10-29 19:22 ` Stefan Beller
  2018-10-29 20:06   ` Derrick Stolee
  2018-11-01 20:06   ` Jakub Narebski
  2018-10-29 20:25 ` Derrick Stolee
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 17+ messages in thread
From: Stefan Beller @ 2018-10-29 19:22 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Jeff King, Junio C Hamano, Jakub Narębski,
	Derrick Stolee, Ævar Arnfjörð Bjarmason

> Based on the performance results alone, we should remove minimum
> generation numbers, (epoch, date) pairs, and FELINE index from
> consideration. There are enough examples of these indexes performing
> poorly.
>
> In contrast, maximum generation numbers and corrected commit
> dates both performed quite well. They are frequently the top
> two performing indexes, and rarely significantly different.
>
> The trade-off here now seems to be: which _property_ is more important,
> locally-computable or backwards-compatible?
>
> * Maximum generation number is backwards-compatible but not
>    locally-computable or immutable.

These maximum generation numbers sound like the reverse of
the generation numbers as they are currently implemented, i.e.
we count all commits between the commit A and all heads.
How would this impact creation of a commit?

The current generation numbers can be lazily updated or not
updated at all. In my understanding of the maximum generation
numbers, a new commit would make these maximum generation
numbers invalid (i.e. they have to be recomputed).

Are there ways out by deferring the computation of maximum
generation numbers while still being able to work with some
commits that are un-accounted for?

When recomputing these numbers, the current generation number
(V0) has the property that already existing numbers can be re-used
as-is. We only need to compute the numbers for new commits,
and then insert this to the appropriate data structures (which is
a separate problem, one could imagine a split generation
numbers file like the split index)

For the V2 maximum generation numbers, would we need to
rewrite the numbers for all commits once we recompute them?
Assuming that is true, it sounds like the benchmark doesn't
cover the whole costs associated with V2, which is why the
exceptional performance can be explained.

(Note: The table as read in
https://public-inbox.org/git/6367e30a-1b3a-4fe9-611b-d931f51effef@gmail.com/
seems line broken, using gmails web interface is not good
for ASCII art and patches, git-send-email would fare better)

>
> * Corrected commit-date is locally-computable and immutable,
>    but not backwards-compatible.

How are these dates not backwards incompatible?
We don't have to expose these dates to the user, but
- just like generation numbers - could store them and use them
but not tell the user about it.

We would need to be really careful to not expose them at all
as they look like the real dates, so that could make for an
awkward bug report.

The approach of "corrected commit date" sounds like we could
have a very lazy approach, i.e. no extra data structures needed
for many commits (as the corrected date equals the real date)
and only need to store the corrections for some commits.
Such an approach however would not make it easy to know
if we operate on corrected dates, or if we even checked them
for correctness before.

So if we'd have an additional row in the generation numbers file
telling the corrected date, then we should be able to be backwards
compatible?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Generation Number v2
  2018-10-29 19:22 ` Stefan Beller
@ 2018-10-29 20:06   ` Derrick Stolee
  2018-11-01 20:06   ` Jakub Narebski
  1 sibling, 0 replies; 17+ messages in thread
From: Derrick Stolee @ 2018-10-29 20:06 UTC (permalink / raw)
  To: Stefan Beller
  Cc: git, Jeff King, Junio C Hamano, Jakub Narębski,
	Derrick Stolee, Ævar Arnfjörð Bjarmason

On 10/29/2018 3:22 PM, Stefan Beller wrote:
>> Based on the performance results alone, we should remove minimum
>> generation numbers, (epoch, date) pairs, and FELINE index from
>> consideration. There are enough examples of these indexes performing
>> poorly.
>>
>> In contrast, maximum generation numbers and corrected commit
>> dates both performed quite well. They are frequently the top
>> two performing indexes, and rarely significantly different.
>>
>> The trade-off here now seems to be: which _property_ is more important,
>> locally-computable or backwards-compatible?
>>
>> * Maximum generation number is backwards-compatible but not
>>     locally-computable or immutable.
> These maximum generation numbers sound like the reverse of
> the generation numbers as they are currently implemented, i.e.
> we count all commits between the commit A and all heads.
> How would this impact creation of a commit?

The indexes listed here would be computed at the same time as a 
commit-graph (and only change on commit-graph writes).

This idea of having something depend on the "global state" isn't ideal 
(in my opinion) but it certainly works for the tests specified below.

> The current generation numbers can be lazily updated or not
> updated at all. In my understanding of the maximum generation
> numbers, a new commit would make these maximum generation
> numbers invalid (i.e. they have to be recomputed).

New commits would change the value you would get by the definition (on 
rewrite of the commit-graph) but don't violate the reachability index 
property (maxgen(A) < maxgen(B) =>
A can't reach B). So that doesn't effect their usefulness.

> Are there ways out by deferring the computation of maximum
> generation numbers while still being able to work with some
> commits that are un-accounted for?

Creating a commit that doesn't exist in the commit-graph file results in 
a commit with generation number INFINITY. The new commits can't be 
reached by commits with finite generation number.

> When recomputing these numbers, the current generation number
> (V0) has the property that already existing numbers can be re-used
> as-is. We only need to compute the numbers for new commits,
> and then insert this to the appropriate data structures (which is
> a separate problem, one could imagine a split generation
> numbers file like the split index)
>
> For the V2 maximum generation numbers, would we need to
> rewrite the numbers for all commits once we recompute them?
> Assuming that is true, it sounds like the benchmark doesn't
> cover the whole costs associated with V2, which is why the
> exceptional performance can be explained.
You're right, we need to recompute the entire list of maximum generation 
number values. However, you are a bit hung up on "maxgen" being a fixed 
function that is correct at all times. We could compute the "maxgen" for 
the set of commits that are written to the "split" commit-graph while 
leaving the base the same. There is one tricky part here: we need to 
initialize the values starting at "num commits in the combined 
commit-graph" (commits in the base and split portion). The minimum 
possible value in the split portion will be larger than the maximum 
possible value in the base portion. Since the (future) implementation of 
split commit-graphs would have all commit-parent edges pointing down 
into the base and not from the base into the split, this will continue 
to satisfy our reachability index property.

As for the cost: we need to do a topo-order walk and then a quick 
minimum-value check across all parents. This is O(N), and the constant 
is small. This may be a cost worth paying.

> (Note: The table as read in
> https://public-inbox.org/git/6367e30a-1b3a-4fe9-611b-d931f51effef@gmail.com/
> seems line broken, using gmails web interface is not good
> for ASCII art and patches, git-send-email would fare better)

Sorry for that. I copy-pasted into Thunderbird. Hopefully most users can 
redirect to the GitHub-rendered markdown if they have trouble.

>
>> * Corrected commit-date is locally-computable and immutable,
>>     but not backwards-compatible.
> How are these dates not backwards incompatible?
> We don't have to expose these dates to the user, but
> - just like generation numbers - could store them and use them
> but not tell the user about it.
>
> We would need to be really careful to not expose them at all
> as they look like the real dates, so that could make for an
> awkward bug report.

By "backwards-compatible" I mean that we could store them in the 
commit-graph file without breaking existing clients (or by introducing 
extra data).

We could easily add these corrected commit dates as an optional chunk, 
but that adds 8 bytes per commit, and we already use 8 bytes for the 
(generation, date) pairs.

The solution I made in my prototype is to replace the generation number 
with an offset for how much to add to the commit-date to get the 
corrected commit-date. However, an old client would interpret these 
offsets as generations and have incorrect output.

A potential way to get around this is to consider the pair (offset, 
date) and define the offset(C) to be the minimum of min { offset(P), 
date(P) - date(C) }. This would satisfy the requirements for the V0 
reachability index. I'm making a note to implement _this_ version and 
give it a test.

>
> The approach of "corrected commit date" sounds like we could
> have a very lazy approach, i.e. no extra data structures needed
> for many commits (as the corrected date equals the real date)
> and only need to store the corrections for some commits.
> Such an approach however would not make it easy to know
> if we operate on corrected dates, or if we even checked them
> for correctness before.
>
> So if we'd have an additional row in the generation numbers file
> telling the corrected date, then we should be able to be backwards
> compatible?
Yeah, an optional chunk with the corrected date would work, but be wasteful.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Generation Number v2
  2018-10-29 19:22 ` Stefan Beller
  2018-10-29 20:06   ` Derrick Stolee
@ 2018-11-01 20:06   ` Jakub Narebski
  2018-11-02  9:30     ` Jakub Narebski
  1 sibling, 1 reply; 17+ messages in thread
From: Jakub Narebski @ 2018-11-01 20:06 UTC (permalink / raw)
  To: Stefan Beller
  Cc: Derrick Stolee, git, Jeff King, Junio C Hamano, Derrick Stolee,
	Ævar Arnfjörð Bjarmason

Stefan Beller <sbeller@google.com> writes:

>> Based on the performance results alone, we should remove minimum
>> generation numbers, (epoch, date) pairs, and FELINE index from
>> consideration. There are enough examples of these indexes performing
>> poorly.
>>
>> In contrast, maximum generation numbers and corrected commit
>> dates both performed quite well. They are frequently the top
>> two performing indexes, and rarely significantly different.
>>
>> The trade-off here now seems to be: which _property_ is more important,
>> locally-computable or backwards-compatible?
>>
>> * Maximum generation number is backwards-compatible but not
>>    locally-computable or immutable.
>
> These maximum generation numbers sound like the reverse of
> the generation numbers as they are currently implemented, i.e.
> we count all commits between the commit A and all heads.

Actually

  maximum generation number =
  number of commits  -  reverse generation number

> How would this impact creation of a commit?
>
> The current generation numbers can be lazily updated or not
> updated at all. In my understanding of the maximum generation
> numbers, a new commit would make these maximum generation
> numbers invalid (i.e. they have to be recomputed).
>
> Are there ways out by deferring the computation of maximum
> generation numbers while still being able to work with some
> commits that are un-accounted for?

As I understand it, new commits added since commit-graph file was
created would get simply INFINITY maximum/maximal generation numbers (if
we were using reverse generation numbers, new commits would get ZERO for
generation number).

> When recomputing these numbers, the current generation number
> (V0) has the property that already existing numbers can be re-used
> as-is. We only need to compute the numbers for new commits,
> and then insert this to the appropriate data structures (which is
> a separate problem, one could imagine a split generation
> numbers file like the split index)

Sidenote: I wonder if there is some on-disk data structure with
similarly fast read, but easier update.

> For the V2 maximum generation numbers, would we need to
> rewrite the numbers for all commits once we recompute them?
> Assuming that is true, it sounds like the benchmark doesn't
> cover the whole costs associated with V2, which is why the
> exceptional performance can be explained.

Let's check it using a simple example

First, (minimum) parent-based generation numbers before and after
extending the commit graph:

  1   2     3   4     5   6   7    new
  1   2     3   4     5   -   -    old
  .---.-----.---.-----.---*---*
       \
        \   3   4     5   6        new
         \  3   4     5   6        old
          \-.---.-----.---.
                 \
                  \   5            new
                   \  -            old
                    \-*


Next, maximum generation numbers.  We start with 9 commits, and we end
up with 12 commits after the change

  6   7     8   9     10  11  12   new
  4   5     7   8     9   -   -    old
  .---.-----.---.-----.---*---*
       \
        \   9   10    11  12       new
         \  6   7     8   9        old
          \-.---.-----.---.
                 \
                  \   12           new
                   \  -            old
                    \-*


As you can see all maximum generation numbers got rewritten.

Though if instead using the number of commits, we use the maximum
generation number, or in other words the length of the longest path, we
get the following:

  1   2     3   4     5   6   7    new
  1   2     4   5     6   -   -    old
  .---.-----.---.-----.---*---*
       \
        \   4   5     6   7        new
         \  3   4     5   6        old
          \-.---.-----.---.
                 \
                  \   7            new
                   \  -            old
                    \-*

A bit better, but still much change in numbers.

[...]
>>
>> * Corrected commit-date is locally-computable and immutable,
>>    but not backwards-compatible.
>
> How are these dates not backwards incompatible?
> We don't have to expose these dates to the user, but
> - just like generation numbers - could store them and use them
> but not tell the user about it.
>
> We would need to be really careful to not expose them at all
> as they look like the real dates, so that could make for an
> awkward bug report.
>
> The approach of "corrected commit date" sounds like we could
> have a very lazy approach, i.e. no extra data structures needed
> for many commits (as the corrected date equals the real date)
> and only need to store the corrections for some commits.
> Such an approach however would not make it easy to know
> if we operate on corrected dates, or if we even checked them
> for correctness before.
>
> So if we'd have an additional row in the generation numbers file
> telling the corrected date, then we should be able to be backwards
> compatible?

Here, from what I understand, being "backwards compatibile" means being
able to use commit-graph file using the new format by old client,
trating the data as if they were generation numbers and not data
offsets.

Best,
--
Jakub Narębski

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Generation Number v2
  2018-11-01 20:06   ` Jakub Narebski
@ 2018-11-02  9:30     ` Jakub Narebski
  2018-11-03 17:27       ` Jakub Narebski
  0 siblings, 1 reply; 17+ messages in thread
From: Jakub Narebski @ 2018-11-02  9:30 UTC (permalink / raw)
  To: Stefan Beller
  Cc: Derrick Stolee, git, Jeff King, Junio C Hamano, Derrick Stolee,
	Ævar Arnfjörð Bjarmason

Jakub Narebski <jnareb@gmail.com> writes:
> Stefan Beller <sbeller@google.com> writes:
[...]
>> How would this impact creation of a commit?
>>
>> The current generation numbers can be lazily updated or not
>> updated at all. In my understanding of the maximum generation
>> numbers, a new commit would make these maximum generation
>> numbers invalid (i.e. they have to be recomputed).
[...]
>> For the V2 maximum generation numbers, would we need to
>> rewrite the numbers for all commits once we recompute them?
>> Assuming that is true, it sounds like the benchmark doesn't
>> cover the whole costs associated with V2, which is why the
>> exceptional performance can be explained.
>
> Let's check it using a simple example
>
> First, (minimum) parent-based generation numbers before and after
> extending the commit graph:
>
>   1   2     3   4     5   6   7    new
>   1   2     3   4     5   -   -    old
>   .---.-----.---.-----.---*---*
>        \
>         \   3   4     5   6        new
>          \  3   4     5   6        old
>           \-.---.-----.---.
>                  \
>                   \   5            new
>                    \  -            old
>                     \-*

Let me check yet another idea, using (minimum) parent-based V0 generation
numbers (counting distance from the sink / root) as a starting number
for source / heads commits.

    1   2     3   4     5   6   7    new
    1   2     3   4     5   -   -    old
    .---.-----.---.-----.---*---*
         \
          \   3   4     5   6        new
           \  3   4     5   6        old
            \-.---.-----.---.
                   \
                    \   5            new
                     \  -            old
                      \-*


Well, on this example it looks like this variant of maximum generation
numbers can be done incrementally, but let's check another example

   1   2     3   4   5   6     7   8   9       new
   1   2     3   4   5   6     7   8   -       old
   .---.-----.---.---.---.-----.---.---*
        \                     /
         \   3   4           / 5   6   7   8   new
          \  5   6          /  -   -   -   -   old
           \-.---.---------/---*---*---*---*

It looks like it doesn; give as good results as I thought.  Less values
are changed, but you would still need to recalculate them, unless it can
be proven that they do not need it.

>
> Next, maximum generation numbers.  We start with 9 commits, and we end
> up with 12 commits after the change
>
>   6   7     8   9     10  11  12   new
>   4   5     7   8     9   -   -    old
>   .---.-----.---.-----.---*---*
>        \
>         \   9   10    11  12       new
>          \  6   7     8   9        old
>           \-.---.-----.---.
>                  \
>                   \   12           new
>                    \  -            old
>                     \-*
>
>
> As you can see all maximum generation numbers got rewritten.
>
> Though if instead using the number of commits, we use the maximum
> generation number, or in other words the length of the longest path, we
> get the following:
>
>   1   2     3   4     5   6   7    new
>   1   2     4   5     6   -   -    old
>   .---.-----.---.-----.---*---*
>        \
>         \   4   5     6   7        new
>          \  3   4     5   6        old
>           \-.---.-----.---.
>                  \
>                   \   7            new
>                    \  -            old
>                     \-*
>
> A bit better, but still much change in numbers.

-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Generation Number v2
  2018-11-02  9:30     ` Jakub Narebski
@ 2018-11-03 17:27       ` Jakub Narebski
  0 siblings, 0 replies; 17+ messages in thread
From: Jakub Narebski @ 2018-11-03 17:27 UTC (permalink / raw)
  To: Stefan Beller
  Cc: Derrick Stolee, git, Jeff King, Junio C Hamano, Derrick Stolee,
	Ævar Arnfjörð Bjarmason

Jakub Narebski <jnareb@gmail.com> writes:
> Jakub Narebski <jnareb@gmail.com> writes:
>> Stefan Beller <sbeller@google.com> writes:
> [...]
>>> How would this impact creation of a commit?
>>>
>>> The current generation numbers can be lazily updated or not
>>> updated at all. In my understanding of the maximum generation
>>> numbers, a new commit would make these maximum generation
>>> numbers invalid (i.e. they have to be recomputed).
> [...]
>>> For the V2 maximum generation numbers, would we need to
>>> rewrite the numbers for all commits once we recompute them?
>>> Assuming that is true, it sounds like the benchmark doesn't
>>> cover the whole costs associated with V2, which is why the
>>> exceptional performance can be explained.
>>
>> Let's check it using a simple example
>>
>> First, (minimum) parent-based generation numbers before and after
>> extending the commit graph:
>>
>>   1   2     3   4     5   6   7    new
>>   1   2     3   4     5   -   -    old
>>   .---.-----.---.-----.---*---*
>>        \
>>         \   3   4     5   6        new
>>          \  3   4     5   6        old
>>           \-.---.-----.---.
>>                  \
>>                   \   5            new
>>                    \  -            old
>>                     \-*
>
> Let me check yet another idea, using (minimum) parent-based V0 generation
> numbers (counting distance from the sink / root) as a starting number
> for source / heads commits.
[...]

> [...] but let's check another example
>
>    1   2     3   4   5   6     7   8   9       new
>    1   2     3   4   5   6     7   8   -       old
>    .---.-----.---.---.---.-----.---.---*
>         \                     /
>          \   3   4           / 5   6   7   8   new
>           \  5   6          /  -   -   -   -   old
>            \-.---.---------/---*---*---*---*

But let's do this correctly.


   1   2     3   4      5   6     7   8   9      new
   1   2     3   4      5   6     7   8   -      old
   .---.-----.---.------.---.-----.---.---*
        \                        /
         \   3   4              /                new
          \  5   6             /                 old
           \-.---.------------/
                  \
                   \    5   6     7   8          new
                    \   -   -     -   -          old
                     \--*---*-----*---*

Well, it looks as if I draw it incorrectly, but performed calculations
right.  You may need to modify / change some data, but it looks as if it
is not that much of a problem.

The new version of the maximum generation numbers looks like it gives
the same results as generation numbers for the "longest" path, and
update may affect only the side-branches that were added to.  All
branches merged into the trunk, and not added to should be safe with
respect to updating.

Can anyone here prove a thing about update of those modified maximum
generation numbers?  Thanks in advance.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Generation Number v2
  2018-10-29 16:55 [RFC] Generation Number v2 Derrick Stolee
  2018-10-29 19:22 ` Stefan Beller
@ 2018-10-29 20:25 ` Derrick Stolee
  2018-11-01 22:13   ` Jakub Narebski
  2018-10-30  3:59 ` Junio C Hamano
  2018-11-01 12:27 ` Jakub Narebski
  3 siblings, 1 reply; 17+ messages in thread
From: Derrick Stolee @ 2018-10-29 20:25 UTC (permalink / raw)
  To: Git List, Jeff King, Junio C Hamano, Jakub Narębski,
	Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Stefan Beller

Here is a re-formatted version of the tables I introduced earlier.
The tables were too wide for public-inbox to render correctly (when
paired with my email client). Hopefully this bulleted-list format
works better. Thanks, Stefan, for pointing out the rendering
problems!

### Test 1: `git log --topo-order -N`

This test focuses on the number of commits that are parsed during
a `git log --topo-order` before writing `N` commits to output.

You can reproduce this test using `topo-order-tests.sh` and
see all the data in `topo-order-summary.txt`. The values reported
here are a sampling of the data, ignoring tests where all values
were the same or extremely close in value.

android-base, N = 100
     V0: 5487
     V1: 8534
     V2: 6937
     V3: 6419
     V4: 6453
android-base, N = 1000
     V0: 36029
     V1: 44030
     V2: 41493
     V3: 41206
     V4: 45431
chromium, N = 100
     V0:    101
     V1: 424406
     V2:    101
     V3:    101
     V4:    101
gerrit, N = 100
     V0: 8212
     V1: 8533
     V2:  164
     V3:  159
     V4:  162
gerrit, N = 1000
     V0: 8512
     V1: 8533
     V2: 1990
     V3: 1973
     V4: 3766
Linux, N = 100
     V0: 12458
     V1: 12444
     V2: 13683
     V3: 13123
     V4: 13124
Linux, N = 1000
     V0: 24436
     V1: 26247
     V2: 27878
     V3: 26430
     V4: 27875
Linux, N = 10000
     V0: 30364
     V1: 28891
     V2: 27878
     V3: 26430
     V4: 27875
electron, N = 1000
     V0: 19927
     V1: 18733
     V2:  1072
     V3: 18214
     V4: 18214
Ffmpeg, N = 10000
     V0: 32154
     V1: 47429
     V2: 10435
     V3: 11054
     V4: 11054
jgit, N = 1000
     V0: 1550
     V1: 6264
     V2: 1067
     V3: 1060
     V4: 1233
julia, N = 10000
     V0: 43043
     V1: 43043
     V2: 10201
     V3: 23505
     V4: 23828
odoo, N = 1000
     V0: 17175
     V1:  9714
     V2:  4043
     V3:  4046
     V4:  4111
php-src, N = 1000
     V0: 19014
     V1: 27530
     V2:  1311
     V3:  1305
     V4:  1320
rails, N = 100
     V0: 1420
     V1: 2041
     V2: 1757
     V3: 1428
     V4: 1441
rails, N = 1000
     V0:  7952
     V1: 10145
     V2: 10053
     V3:  8373
     V4:  8373
swift, N = 1000
     V0: 1914
     V1: 4004
     V2: 2071
     V3: 1939
     V4: 1940
tensorflow, N = 1000
     V0: 10019
     V1: 39221
     V2:  6711
     V3: 10051
     V4: 10051
TypeScript, N = 1000
     V0:  2873
     V1: 12014
     V2:  3049
     V3:  2876
     V4:  2876


### Test 2: `git log --topo-order -10 A..B`

This test focuses on the number of commits that are parsed during
a `git log --topo-order A..B` before writing ten commits to output.
Since we fix a very small set of output commits, we care more about
the part of the walk that determines which commits are reachable
from `B` but not reachable from `A`. This part of the walk uses
commit date as a heuristic in the existing implementation.

You can reproduce this test using `topo-compare-tests.sh` and
see all the data in `topo-compare-summary.txt`. The values reported
here are a sampling of the data, ignoring tests where all values
were the same or extremely close in value.

_Note:_ For some of the rows, the `A` and `B` values may be
swapped from what is expected. This is due to (1) a bug in the
reference implementation that doesn't short-circuit the walk
when `A` can reach `B`, and (2) data-entry errors by the author.
The bug can be fixed, but would have introduced strange-looking
rows in this table.

android-base 53c1972bc8f 92f18ac3e39
   OLD: 39403
    V0:  1544
    V1:  6957
    V2:    26
    V3:  1015
    V4:  1098
gerrit c4311f7642 777a8cd1e0
   OLD:  6457
    V0:  7836
    V1: 10869
    V2:   415
    V3:   414
    V4:   445
electron 7da7dd85e addf069f2
   OLD: 18164
    V0:   945
    V1:  6528
    V2:    17
    V3:    17
    V4:    18
julia 7faee1b201 e2022b9f0f
   OLD: 22800
    V0:  4221
    V1: 12710
    V2:   377
    V3:   213
    V4:   213
julia ae69259cd9 c8b5402afc
   OLD:  1864
    V0:  1859
    V1: 13287
    V2:    12
    V3:  1859
    V4:  1859
Linux 69973b830859 c470abd4fde4
   OLD: 111692
    V0:  77263
    V1:  96598
    V2:  80238
    V3:  76332
    V4:  76495
Linux c3b92c878736 19f949f52599
   OLD: 167418
    V0:   5736
    V1:   4684
    V2:   9675
    V3:   3887
    V4:   3923
Linux c8d2bc9bc39e 69973b830859
   OLD: 44940
    V0:  4056
    V1: 16636
    V2: 10405
    V3:  3475
    V4:  4022
odoo 4a31f55d0a0 93fb2b4a616
   OLD: 25139
    V0: 19528
    V1: 20418
    V2: 19874
    V3: 19634
    V4: 27247
swift 4046359efd b34b6a14c7
   OLD: 13411
    V0:   662
    V1:   321
    V2:    12
    V3:    80
    V4:   134
tensorflow ec6d17219c fa1db5eb0d
   OLD: 10373
    V0:  4762
    V1: 36272
    V2:   174
    V3:  3631
    V4:  3632
TypeScript 35ea2bea76 123edced90
   OLD:  3450
    V0:   267
    V1: 10386
    V2:    27
    V3:   259
    V4:   259


### Test 3: `git merge-base A B`

This test focuses on the number of commits that are parsed during
a `git merge-base A B`. This part of the walk uses commit date as
a heuristic in the existing implementation.

You can reproduce this test using `merge-base-tests.sh` and
see all the data in `merge-base-summary.txt`. The values reported
here are a sampling of the data, ignoring tests where all values
were the same or extremely close in value.

android-base 53c1972bc8f 92f18ac3e39
   OLD:  81999
    V0: 109025
    V1:  81885
    V2:  77475
    V3:  81999
    V4:  82001
gerrit c4311f7642 777a8cd1e0
   OLD: 6468
    V0: 7995
    V1: 6566
    V2: 6478
    V3: 6468
    V4: 6468
electron 7da7dd85e addf069f2
   OLD: 18160
    V0: 19871
    V1: 18670
    V2:  2231
    V3: 18160
    V4: 18160
julia 7faee1b201 e2022b9f0f
   OLD: 22803
    V0: 42339
    V1: 42212
    V2:  6803
    V3: 22803
    V4: 22803
julia c8b5402afc ae69259cd9
   OLD:  7076
    V0: 42909
    V1: 42770
    V2:  2690
    V3:  7076
    V4:  7076
Linux 69973b830859 c470abd4fde4
   OLD: 44984
    V0: 47457
    V1: 44679
    V2: 38461
    V3: 44984
    V4: 44984
Linux c3b92c878736 19f949f52599
   OLD: 111740
    V0: 111027
    V1: 111196
    V2: 107835
    V3: 111771
    V4: 111368
Linux c8d2bc9bc39e 69973b830859
   OLD: 167468
    V0: 635579
    V1: 630138
    V2:  33716
    V3: 167496
    V4: 153774
odoo 4a31f55d0a0 93fb2b4a616
   OLD: 25150
    V0: 27259
    V1: 23977
    V2: 24041
    V3: 23974
    V4: 26829
swift 4046359efd b34b6a14c7
   OLD: 13434
    V0: 13254
    V1: 13940
    V2: 16023
    V3: 13127
    V4: 15008
tensorflow ec6d17219c fa1db5eb0d
   OLD: 10377
    V0: 10448
    V1: 10377
    V2:  8460
    V3: 10377
    V4: 10377
TypeScript 35ea2bea76 123edced90
   OLD: 3464
    V0: 3439
    V1: 3464
    V2: 3581
    V3: 3464
    V4: 3464


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Generation Number v2
  2018-10-29 20:25 ` Derrick Stolee
@ 2018-11-01 22:13   ` Jakub Narebski
  0 siblings, 0 replies; 17+ messages in thread
From: Jakub Narebski @ 2018-11-01 22:13 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git List, Jeff King, Junio C Hamano, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Stefan Beller

Derrick Stolee <stolee@gmail.com> writes:

> Here is a re-formatted version of the tables I introduced earlier.
> The tables were too wide for public-inbox to render correctly (when
> paired with my email client). Hopefully this bulleted-list format
> works better. Thanks, Stefan, for pointing out the rendering
> problems!
>
> ### Test 1: `git log --topo-order -N`
>
> This test focuses on the number of commits that are parsed during
> a `git log --topo-order` before writing `N` commits to output.
>
> You can reproduce this test using `topo-order-tests.sh` and
> see all the data in `topo-order-summary.txt`. The values reported
> here are a sampling of the data, ignoring tests where all values
> were the same or extremely close in value.
>
> android-base, N = 100
>     V0: 5487
>     V1: 8534
>     V2: 6937
>     V3: 6419
>     V4: 6453
[...]

> ### Test 2: `git log --topo-order -10 A..B`
[...]
> android-base 53c1972bc8f 92f18ac3e39
>   OLD: 39403
>    V0:  1544
>    V1:  6957
>    V2:    26
>    V3:  1015
>    V4:  1098

Two minor issues.  First, now that the data is not in the table format,
where every bit of horizontal space matters, why not spell the names of
new proposed "generation numbers" (or rather reachability orderings) in
full, instead of using V0...V4 shortcuts?  It is not as if we were short
of space.

Second, why OLD data is present in tests 2 and 3, but not in test 1?

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Generation Number v2
  2018-10-29 16:55 [RFC] Generation Number v2 Derrick Stolee
  2018-10-29 19:22 ` Stefan Beller
  2018-10-29 20:25 ` Derrick Stolee
@ 2018-10-30  3:59 ` Junio C Hamano
  2018-10-31 12:30   ` Derrick Stolee
  2018-10-31 12:54   ` Ævar Arnfjörð Bjarmason
  2018-11-01 12:27 ` Jakub Narebski
  3 siblings, 2 replies; 17+ messages in thread
From: Junio C Hamano @ 2018-10-30  3:59 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git List, Jeff King, Jakub Narębski, Derrick Stolee,
	Ævar Arnfjörð Bjarmason

Derrick Stolee <stolee@gmail.com> writes:

> **V3: Corrected Commit Date.**
> For a commit C, let its _corrected commit date_ (denoted by cdate(C))
> be the maximum of the commit date of C and the commit dates of its
> parents.

"maximum of the commit date of C and the corrected commit dates of
its parents"?

We've talked about exactly this one in the past (long before any of
Microsoft folks other than Dscho came to the scene) but without an
implementation, or detailed design and analysis.  I am very happy to
see the idea among other possibilities to be considered again.  This
time around, we may finally come up with something better than the
"commit dates with SLOP" thing ;-).

> Essentially, the felineY order is selected with the goal of swapping
> positions of topologically-independent commits relative to the felinX
> ordering. The resulting reachability index is as follows:
>
>    If felineX(A) < felineY(B), then A cannot reach B.
>    If felineY(A) < felineY(B), then A cannot reach B.

I presume that the first line is a typo and you compare the same X index?

> * **Compatible?** In our test implementation, we use a previously unused
>   byte of data in the commit-graph format to indicate which reachability
>   index version we are using. Existing clients ignore this value, so we
>   will want to consider if these new indexes are _backwards compatible_.
>   That is, will they still report correct values if they ignore this byte
>   and use the generation number column from the commit-graph file assuming
>   the values are minimum generation numbers?

I personally consider that the current commit-graph with generation
numbers experimental, so I am not sure how much we care about this.

Having said that.

By the above definition, any new index that is wider than the
current generation number cannot be compatible (can we even tell the
existing clients how wide each elements in the ix array is?)

In any case, perhaps the first thing to do is to update the clients
so that they stop ignoring the version number field, and instead
work without generation number when there is no version of reach.ix
available in the file?  That way, a better reachablility index can
be chosen freely without having to worry about the compatibility.

> * **Immutable?** Git objects are _immutable_. If you change an object you
>   actually create a new object with a new object ID. Are the values we
> store
>   for these reachability indexes also immutable?

Even if we do not embed the reachability ix in commit objects,
having an immutable value is probably a must if we want to make them
incrementally computable, so this is a very good property to have.
Unless there is a clever idea to incrementally compute a mutable
reach.ix, my gut instinct says that this property is a must.

Another thing, perhaps related to "local" below, is if exactly the
same reach.ix is computed by anybody, given an identical commit
history graph (perhaps "reproducibility"?).  I think most of the
candidates you listed are reproducible without a fixed tie-breaker,
but I am not sure about felineY() thing.

> * **Local?** Are these values **locally computable**? That is, do we only
>   need to look at the parents of a commit (assuming those parents have
>   computed values) in order to determine the value at that commit?

A subset of non-local reachability ix, for example, the ones that
need to know what each commit's children are, cannot be immutable,
as adding new objects to the graph (either with locally committing,
or transferring objects from other repositories) would affect the
ix; is this true for all non-local reachability ix, I wonder?

> We focused on three types of performance tests that test the indexes
> in different ways. Each test lists the `git` command that is used,
> and the table lists which repository is used and which inputs.
>
> ### Test 1: `git log --topo-order -N`
>
> This test focuses on the number of commits that are parsed during
> a `git log --topo-order` before writing `N` commits to output.

A devil's advocate comment.  Your patches seem to be very focused on
this "unlimited" case for the past few weeks, but I am not so sure
if that is a case worth optimizing for.  If "git log --topo-order -N
HEAD~M.." (for some number M) gives us a similar result as unlimited
case but with much less effort, wouldn't it be good enough that lets
us concentrate on other use cases instead?

> Based on the performance results alone, we should remove minimum
> generation numbers, (epoch, date) pairs, and FELINE index from
> consideration. There are enough examples of these indexes performing
> poorly.

OK.

> In contrast, maximum generation numbers and corrected commit
> dates both performed quite well. They are frequently the top
> two performing indexes, and rarely significantly different.
>
> The trade-off here now seems to be: which _property_ is more important,
> locally-computable or backwards-compatible?

Nice summary.  

As I already said, I personally do not think being compatible with
currently deployed clients is important at all (primarily because I
still consider the whole thing experimental), and there is a clear
way forward once we correct the mistake of not having a version
number in the file format that tells the updated clients to ignore
the generation numbers.  For longer term viability, we should pick
something that is immutable, reproducible, computable with minimum
input---all of which would lead to being incrementally computable, I
would think.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Generation Number v2
  2018-10-30  3:59 ` Junio C Hamano
@ 2018-10-31 12:30   ` Derrick Stolee
  2018-11-02 13:33     ` Jakub Narebski
  2018-10-31 12:54   ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 17+ messages in thread
From: Derrick Stolee @ 2018-10-31 12:30 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Git List, Jeff King, Jakub Narębski, Derrick Stolee,
	Ævar Arnfjörð Bjarmason

On 10/29/2018 11:59 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
>
>> **V3: Corrected Commit Date.**
>> For a commit C, let its _corrected commit date_ (denoted by cdate(C))
>> be the maximum of the commit date of C and the commit dates of its
>> parents.
> "maximum of the commit date of C and the corrected commit dates of
> its parents"?

That's what I mean. Thanks.

>
> We've talked about exactly this one in the past (long before any of
> Microsoft folks other than Dscho came to the scene) but without an
> implementation, or detailed design and analysis.  I am very happy to
> see the idea among other possibilities to be considered again.  This
> time around, we may finally come up with something better than the
> "commit dates with SLOP" thing ;-).
>
>> Essentially, the felineY order is selected with the goal of swapping
>> positions of topologically-independent commits relative to the felinX
>> ordering. The resulting reachability index is as follows:
>>
>>     If felineX(A) < felineY(B), then A cannot reach B.
>>     If felineY(A) < felineY(B), then A cannot reach B.
> I presume that the first line is a typo and you compare the same X index?

Yes, sorry for the typos. I fixed them in the report on GitHub.

>
>> * **Compatible?** In our test implementation, we use a previously unused
>>    byte of data in the commit-graph format to indicate which reachability
>>    index version we are using. Existing clients ignore this value, so we
>>    will want to consider if these new indexes are _backwards compatible_.
>>    That is, will they still report correct values if they ignore this byte
>>    and use the generation number column from the commit-graph file assuming
>>    the values are minimum generation numbers?
> I personally consider that the current commit-graph with generation
> numbers experimental, so I am not sure how much we care about this.
>
> Having said that.
>
> By the above definition, any new index that is wider than the
> current generation number cannot be compatible (can we even tell the
> existing clients how wide each elements in the ix array is?)
>
> In any case, perhaps the first thing to do is to update the clients
> so that they stop ignoring the version number field, and instead
> work without generation number when there is no version of reach.ix
> available in the file?  That way, a better reachablility index can
> be chosen freely without having to worry about the compatibility.

I can work on that. It should be as simple as setting commit->generation to
GENERATION_NUMBER_ZERO in fill_commit_in_graph when the graph
has a different version.

>
>> * **Immutable?** Git objects are _immutable_. If you change an object you
>>    actually create a new object with a new object ID. Are the values we
>> store
>>    for these reachability indexes also immutable?
> Even if we do not embed the reachability ix in commit objects,
> having an immutable value is probably a must if we want to make them
> incrementally computable, so this is a very good property to have.
> Unless there is a clever idea to incrementally compute a mutable
> reach.ix, my gut instinct says that this property is a must.
>
> Another thing, perhaps related to "local" below, is if exactly the
> same reach.ix is computed by anybody, given an identical commit
> history graph (perhaps "reproducibility"?).  I think most of the
> candidates you listed are reproducible without a fixed tie-breaker,
> but I am not sure about felineY() thing.
>
>> * **Local?** Are these values **locally computable**? That is, do we only
>>    need to look at the parents of a commit (assuming those parents have
>>    computed values) in order to determine the value at that commit?
> A subset of non-local reachability ix, for example, the ones that
> need to know what each commit's children are, cannot be immutable,
> as adding new objects to the graph (either with locally committing,
> or transferring objects from other repositories) would affect the
> ix; is this true for all non-local reachability ix, I wonder?

As a thought experiment, we could define a function size(C) to be the
numberof commits reachable from C. This is not locally-computable
from the size values at C's parents due to the inclusion-exclusion
principle. We would need to compute it by walking the reachable set
and counting (resulting in quadratic performance overall) but is
immutable. Since the performance cost is so expensive (unlike the
linear costs in the other non-local versions) I didn't include it
in my comparison.

>
>> We focused on three types of performance tests that test the indexes
>> in different ways. Each test lists the `git` command that is used,
>> and the table lists which repository is used and which inputs.
>>
>> ### Test 1: `git log --topo-order -N`
>>
>> This test focuses on the number of commits that are parsed during
>> a `git log --topo-order` before writing `N` commits to output.
> A devil's advocate comment.  Your patches seem to be very focused on
> this "unlimited" case for the past few weeks, but I am not so sure
> if that is a case worth optimizing for.  If "git log --topo-order -N
> HEAD~M.." (for some number M) gives us a similar result as unlimited
> case but with much less effort, wouldn't it be good enough that lets
> us concentrate on other use cases instead?

I mostly included these statistics to make sure we didn't add a 
regression in
this case. Note that I didn't report the OLD values in this table, 
because that
would be an unfair comparison.

>> Based on the performance results alone, we should remove minimum
>> generation numbers, (epoch, date) pairs, and FELINE index from
>> consideration. There are enough examples of these indexes performing
>> poorly.
> OK.
>
>> In contrast, maximum generation numbers and corrected commit
>> dates both performed quite well. They are frequently the top
>> two performing indexes, and rarely significantly different.
>>
>> The trade-off here now seems to be: which _property_ is more important,
>> locally-computable or backwards-compatible?
> Nice summary.
>
> As I already said, I personally do not think being compatible with
> currently deployed clients is important at all (primarily because I
> still consider the whole thing experimental), and there is a clear
> way forward once we correct the mistake of not having a version
> number in the file format that tells the updated clients to ignore
> the generation numbers.  For longer term viability, we should pick
> something that is immutable, reproducible, computable with minimum
> input---all of which would lead to being incrementally computable, I
> would think.

This is good reasoning. The "reproducible" property is also important for
support reasons, too! Sounds like the corrected commit date is the best
way forward.

Aside: I spent some time thinking about making the corrected commit dates
backward compatible by ensuring the offsets are monotonic in the commit
history (so we could store the offset as commit->generation and the existing
generation comparisons would still work). However, it performs poorly on the
Linux repository 'git merge-base v4.8 v4.9' example.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Generation Number v2
  2018-10-31 12:30   ` Derrick Stolee
@ 2018-11-02 13:33     ` Jakub Narebski
  0 siblings, 0 replies; 17+ messages in thread
From: Jakub Narebski @ 2018-11-02 13:33 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Junio C Hamano, Git List, Jeff King, Derrick Stolee,
	Ævar Arnfjörð Bjarmason

Derrick Stolee <stolee@gmail.com> writes:
> On 10/29/2018 11:59 PM, Junio C Hamano wrote:
>> Derrick Stolee <stolee@gmail.com> writes:
[...]
>>> * **Compatible?** In our test implementation, we use a previously unused
>>>    byte of data in the commit-graph format to indicate which reachability
>>>    index version we are using. Existing clients ignore this value, so we
>>>    will want to consider if these new indexes are _backwards compatible_.
>>>    That is, will they still report correct values if they ignore this byte
>>>    and use the generation number column from the commit-graph file assuming
>>>    the values are minimum generation numbers?
>>
>> I personally consider that the current commit-graph with generation
>> numbers experimental, so I am not sure how much we care about this.
>>
>> Having said that.
>>
>> By the above definition, any new index that is wider than the
>> current generation number cannot be compatible (can we even tell the
>> existing clients how wide each elements in the ix array is?)
>>
>> In any case, perhaps the first thing to do is to update the clients
>> so that they stop ignoring the version number field, and instead
>> work without generation number when there is no version of reach.ix
>> available in the file?  That way, a better reachablility index can
>> be chosen freely without having to worry about the compatibility.
>
> I can work on that. It should be as simple as setting commit->generation to
> GENERATION_NUMBER_ZERO in fill_commit_in_graph when the graph
> has a different version.

That is a very good idea, but we have here a chicken-and-egg problem.
The only way to denote that the generation numbers / reachability index
have changed (for reading/using: that it changed in backwards
incompatibile way; for update: that it changed at all) is to change the
format version:

  1-byte version number:
      Currently, the only valid version is 1.

But if we assume that different commit-graph format version simply means
that commit->generation is to be set to GENERATION_NUMBER_ZERO, then we
block possible changes to the structure of the commit-graph file.

[Reads further in thread]

Ah, I see that's why you want to introduce [1]:

DS>
DS> 3. Reachability Index versioning

[1]: https://public-inbox.org/git/6902dbff-d9f6-e897-2c20-d0cb47a50795@gmail.com/

>>> * **Immutable?** Git objects are _immutable_. If you change an object you
>>>    actually create a new object with a new object ID. Are the values we store
>>>    for these reachability indexes also immutable?
>>
>> Even if we do not embed the reachability ix in commit objects,
>> having an immutable value is probably a must if we want to make them
>> incrementally computable, so this is a very good property to have.
>> Unless there is a clever idea to incrementally compute a mutable
>> reach.ix, my gut instinct says that this property is a must.

I think that the reachability index can be mutable and non-local, but
still being able to be incrementally updated, though perhaps it would
require not only calculations for new commits, but recalculations for
some limited subset of commits existing in the commit-graph file.

I'm trying to modify exact definition of maximal generation numbers to
check if I can find out a version that exhibits nearly-incremental
updates property.

>> Another thing, perhaps related to "local" below, is if exactly the
>> same reach.ix is computed by anybody, given an identical commit
>> history graph (perhaps "reproducibility"?).  I think most of the
>> candidates you listed are reproducible without a fixed tie-breaker,
>> but I am not sure about felineY() thing.

felineY() is deterministic (or can be made deterministic) for given
felineX() (i.e. given [reverse] topological order).

>>> * **Local?** Are these values **locally computable**? That is, do we only
>>>    need to look at the parents of a commit (assuming those parents have
>>>    computed values) in order to determine the value at that commit?
>>
>> A subset of non-local reachability ix, for example, the ones that
>> need to know what each commit's children are, cannot be immutable,
>> as adding new objects to the graph (either with locally committing,
>> or transferring objects from other repositories) would affect the
>> ix; is this true for all non-local reachability ix, I wonder?
>
> As a thought experiment, we could define a function size(C) to be the
> number of commits reachable from C.

Let's call it reach(C), instead ;-)

>                                    This is not locally-computable
> from the size values at C's parents due to the inclusion-exclusion
> principle. We would need to compute it by walking the reachable set
> and counting (resulting in quadratic performance overall) but is
> immutable. Since the performance cost is so expensive (unlike the
> linear costs in the other non-local versions) I didn't include it
> in my comparison.

Let's define Reach(C) as set of all commits reachable from commit C.
Then reach(C) = ||Reach(C)||, where ||S|| is number of elements in the
set S.

If commit A can reach commit B, then B ∈ Reach(A), but also Reach(B) ⊂
Reach(A), thus reach(B) < reach(A).  Therefore if reach(A) <= reach(B)
then A cannot reach commit B -- reach(C) can be used as reachability
index.

However the performance cost of calculating reach(C), i.e. full
traversal of the commit graph without ability for incremental update
(well, you can do incremental update starting from the commits that have
reachability bitmap [2]) is prohibitive.  Never mind the fact that this
index would have, I think, the same performance problems as (minimum)
generaton numbers.

[2]: https://githubengineering.com/counting-objects/

SIDENOTE 1: as gen(C) can be thought of as the lower boundary on
reach(C), similarly sumgen(C) defined as being 1 for parent-less
commits, and being (sum_{P ∈ parents(C)} sumgen(P)) + 1 otherwise being
the upper boundary on reach(C).  This sumgen(C) can also be used as
generation number / reachability index.

   gen(C) <= reach(C) <= sumgen(C)

SIDENOTE 2: The idea of using a proxy for Reach(B) ⊆ Reach(A) as a
negative-cut filter (negative-cut reachability index) is used in two
types of reachability indices: one is Independent Permutation Labelling
(IP+) [3] approach, the other is Bloom Filter Labelling (BFL) [4].  Both
algorithms are local, and with given permutation (which can be replaced
by object ID, I think) for IP+ and given set of hash functions etc. for
BFL they are both immutable.

[3] Hao Wei, Jeffrey Xu Yu, Can Lu, Ruoming Jin "Reachability Querying:
    An Independent Permutation Labeling Approach", Proceedings of the
    VLDB Endowment, Vol. 7, No. 12 (2014)

[4] Jiao Su, Qing Zhu, Hao Wei, Jeffrey Xu Yu "Reachability Querying:
    Can It Be Even Faster?", IEEE Transactions on Knowledge and Data
    Engineering (2016)

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Generation Number v2
  2018-10-30  3:59 ` Junio C Hamano
  2018-10-31 12:30   ` Derrick Stolee
@ 2018-10-31 12:54   ` Ævar Arnfjörð Bjarmason
  2018-10-31 13:04     ` Derrick Stolee
  1 sibling, 1 reply; 17+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-10-31 12:54 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee, Git List, Jeff King, Jakub Narębski,
	Derrick Stolee


On Tue, Oct 30 2018, Junio C Hamano wrote:

> Derrick Stolee <stolee@gmail.com> writes:
>> In contrast, maximum generation numbers and corrected commit
>> dates both performed quite well. They are frequently the top
>> two performing indexes, and rarely significantly different.
>>
>> The trade-off here now seems to be: which _property_ is more important,
>> locally-computable or backwards-compatible?
>
> Nice summary.
>
> As I already said, I personally do not think being compatible with
> currently deployed clients is important at all (primarily because I
> still consider the whole thing experimental), and there is a clear
> way forward once we correct the mistake of not having a version
> number in the file format that tells the updated clients to ignore
> the generation numbers.  For longer term viability, we should pick
> something that is immutable, reproducible, computable with minimum
> input---all of which would lead to being incrementally computable, I
> would think.

I think it depends on what we mean by backwards compatibility. None of
our docs are saying this is experimental right now, just that it's
opt-in like so many other git-config(1) options.

So if we mean breaking backwards compatibility in that we'll write a new
file or clobber the existing one with a version older clients can't use
as an optimization, fine.

But it would be bad to produce a hard error on older clients, but
avoiding that seems as easy as just creating a "commit-graph2" file in
.git/objects/info/.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Generation Number v2
  2018-10-31 12:54   ` Ævar Arnfjörð Bjarmason
@ 2018-10-31 13:04     ` Derrick Stolee
  2018-11-02 17:44       ` Jakub Narebski
  0 siblings, 1 reply; 17+ messages in thread
From: Derrick Stolee @ 2018-10-31 13:04 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, Junio C Hamano
  Cc: Git List, Jeff King, Jakub Narębski, Derrick Stolee

On 10/31/2018 8:54 AM, Ævar Arnfjörð Bjarmason wrote:
> On Tue, Oct 30 2018, Junio C Hamano wrote:
>
>> Derrick Stolee <stolee@gmail.com> writes:
>>> In contrast, maximum generation numbers and corrected commit
>>> dates both performed quite well. They are frequently the top
>>> two performing indexes, and rarely significantly different.
>>>
>>> The trade-off here now seems to be: which _property_ is more important,
>>> locally-computable or backwards-compatible?
>> Nice summary.
>>
>> As I already said, I personally do not think being compatible with
>> currently deployed clients is important at all (primarily because I
>> still consider the whole thing experimental), and there is a clear
>> way forward once we correct the mistake of not having a version
>> number in the file format that tells the updated clients to ignore
>> the generation numbers.  For longer term viability, we should pick
>> something that is immutable, reproducible, computable with minimum
>> input---all of which would lead to being incrementally computable, I
>> would think.
> I think it depends on what we mean by backwards compatibility. None of
> our docs are saying this is experimental right now, just that it's
> opt-in like so many other git-config(1) options.
>
> So if we mean breaking backwards compatibility in that we'll write a new
> file or clobber the existing one with a version older clients can't use
> as an optimization, fine.
>
> But it would be bad to produce a hard error on older clients, but
> avoiding that seems as easy as just creating a "commit-graph2" file in
> .git/objects/info/.

Well, we have a 1-byte version number following the "CGPH" header in
the commit-graph file, and clients will ignore the commit-graph file
if that number is not "1". My hope for backwards-compatibility was
to avoid incrementing this value and instead use the unused 8th byte.

However, it appears that we are destined to increment that version
number, anyway. Here is my list for what needs to be in the next
version of the commit-graph file format:

1. A four-byte hash version.

2. File incrementality (split commit-graph).

3. Reachability Index versioning

Most of these changes will happen in the file header. The chunks
themselves don't need to change, but some chunks may be added that
only make sense in v2 commit-graphs.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Generation Number v2
  2018-10-31 13:04     ` Derrick Stolee
@ 2018-11-02 17:44       ` Jakub Narebski
  0 siblings, 0 replies; 17+ messages in thread
From: Jakub Narebski @ 2018-11-02 17:44 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Ævar Arnfjörð Bjarmason, Junio C Hamano, Git List,
	Jeff King, Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:
> On 10/31/2018 8:54 AM, Ævar Arnfjörð Bjarmason wrote:
>> On Tue, Oct 30 2018, Junio C Hamano wrote:
>>> Derrick Stolee <stolee@gmail.com> writes:
>>>>
>>>> In contrast, maximum generation numbers and corrected commit
>>>> dates both performed quite well. They are frequently the top
>>>> two performing indexes, and rarely significantly different.
>>>>
>>>> The trade-off here now seems to be: which _property_ is more important,
>>>> locally-computable or backwards-compatible?
>>>
>>> Nice summary.
>>>
>>> As I already said, I personally do not think being compatible with
>>> currently deployed clients is important at all (primarily because I
>>> still consider the whole thing experimental), and there is a clear
>>> way forward once we correct the mistake of not having a version
>>> number in the file format that tells the updated clients to ignore
>>> the generation numbers.  For longer term viability, we should pick
>>> something that is immutable, reproducible, computable with minimum
>>> input---all of which would lead to being incrementally computable, I
>>> would think.
>>
>> I think it depends on what we mean by backwards compatibility. None of
>> our docs are saying this is experimental right now, just that it's
>> opt-in like so many other git-config(1) options.
>>
>> So if we mean breaking backwards compatibility in that we'll write a new
>> file or clobber the existing one with a version older clients can't use
>> as an optimization, fine.
>>
>> But it would be bad to produce a hard error on older clients, but
>> avoiding that seems as easy as just creating a "commit-graph2" file in
>> .git/objects/info/.
>
> Well, we have a 1-byte version number following the "CGPH" header in
> the commit-graph file, and clients will ignore the commit-graph file
> if that number is not "1". My hope for backwards-compatibility was
> to avoid incrementing this value and instead use the unused 8th byte.

How?  Some of the considered new generation numbers were backwards
compatibile in the sense that for graph traversal the old code for
(minimum) generation numbers could be used.  But that is for reading.
If the old client tried to update new generation number using old
generation number algorithm (assuming locality), it would write some
chimaera that may or may not work.

> However, it appears that we are destined to increment that version
> number, anyway. Here is my list for what needs to be in the next
> version of the commit-graph file format:

As a reminder, here is how the commit-graph format looks now [1]

  HEADER:
    4-byte signature:                  CGPH
    1-byte version number:             1
    1-byte Hash Version:               1 = SHA-1
    1-byte number (C) of "chunks":     3 or 4
    1-byte (reserved for later use)    <ignored>
  CHUNK LOOKUP:
    (C + 1) * 12 bytes listing the table of contents for the chunks
  CHUNK DATA:
    OIDF
    OIDL
    CDAT
    EDGE [optional, depends on graph structure]
  TRAILER:
    checksum

[1]: Documentation/technical/commit-graph-format.txt

> 1. A four-byte hash version.

Why do you need a four-byte hash version?  255 possible hash functions
is not enough?

> 2. File incrementality (split commit-graph).

This would probably require a segment lookup part, and one chunk of each
type per segment (or even one chunk lookup table per segment).  It may
or may not need generation number which can support segmenting (here
maximal generation numbers have the advantage); but you can assume that
if segment(A) > segment(B) then reach.ix(A) > reach.ix(B),
i.e. lexicographical-like sort.

> 3. Reachability Index versioning

I assume that you mean here versioning for the reachability index that
is stored in the CDAT section (if it is in a separate chunk, it should
be easy to version).

The easiest thing to do when encountering unknown or not supported
generation number (perhaps the commit-graph file was created in the past
by earlier version of Git, perahs it came from server with newer Git, or
perhaps the same repository is sometimes accessed using newer and
sometimes older Git version), is to set commit->generation_number to
GENERATION_NUMBER_ZERO, as you wrote in [2].

We could encode backward-compatibility (for using) somewhat, but I
wonder how much it would be of use.

[2]: https://public-inbox.org/git/61a829ce-0d29-81c9-880e-7aef1bec916e@gmail.com/

> Most of these changes will happen in the file header. The chunks
> themselves don't need to change, but some chunks may be added that
> only make sense in v2 commit-graphs.

What I would like to see in v2 commit-graph format would be some kind
convention for naming / categorizing chunks, so that Git would be able
to handle unknown chunks gracefully.

In PNG format there are various types of chunks.  Quoting Wikipedia:

  Chunk types [in PNG format] are given a four-letter case sensitive
  ASCII type/name; compare FourCC. The case of the different letters in
  the name (bit 5 of the numeric value of the character) is a bit field
  that provides the decoder with some information on the nature of
  chunks it does not recognize.

  The case of the first letter indicates whether the chunk is critical
  or not. If the first letter is uppercase, the chunk is critical; if
  not, the chunk is ancillary. Critical chunks contain information that
  is necessary to read the file. If a decoder encounters a critical
  chunk it does not recognize, it must abort reading the file or supply
  the user with an appropriate warning.

Currently this is solved with the format version.  For given format
version some chunks are necessary (critical); if there would be another
type of chunk that must be understood, we can simply increase format
version. 

  The case of the second letter indicates whether the chunk is "public"
  (either in the specification or the registry of special-purpose public
  chunks) or "private" (not standardised). Uppercase is public and
  lowercase is private. This ensures that public and private chunk names
  can never conflict with each other (although two private chunk names
  could conflict).

For Git this could be "official" and "experimental"... though I don't
see people sharing commit-graph files with experimental chunks that can
be used only by some local unpublished fork of Git.

  The third letter must be uppercase to conform to the PNG
  specification. It is reserved for future expansion. Decoders should
  treat a chunk with a lower case third letter the same as any other
  unrecognised chunk.

In short: reserved.  This could be the fate of all of those 4 flags that
we do not need.

  The case of the fourth letter indicates whether the chunk is safe to
  copy by editors that do not recognize it. If lowercase, the chunk may
  be safely copied regardless of the extent of modifications to the
  file. If uppercase, it may only be copied if the modifications have
  not touched any critical chunks.

If the data contained in the chunks is immutable, then it could be
copied when updating commit-graph file... but what to do with new
commits, with what to fill the data for a new commit for a copied
unknown chunk?  All zeros?  All ones?  Some value defined in the chunk
itself?

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Generation Number v2
  2018-10-29 16:55 [RFC] Generation Number v2 Derrick Stolee
                   ` (2 preceding siblings ...)
  2018-10-30  3:59 ` Junio C Hamano
@ 2018-11-01 12:27 ` Jakub Narebski
  2018-11-01 13:29   ` Derrick Stolee
  3 siblings, 1 reply; 17+ messages in thread
From: Jakub Narebski @ 2018-11-01 12:27 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git List, Jeff King, Junio C Hamano, Derrick Stolee,
	Ævar Arnfjörð Bjarmason

[I have noticed that in some places I wrote A..B instead of B..A.  Sorry
 about that]

Derrick Stolee <stolee@gmail.com> writes:

> We've discussed in several places how to improve upon generation
> numbers. This RFC is a report based on my investigation into a
> few new options, and how they compare for Git's purposes on
> several existing open-source repos.
>
> You can find this report and the associated test scripts at
> https://github.com/derrickstolee/gen-test.

Which use prototype test implementation from the `reach-perf` branch in
https://github.com/derrickstolee/git.

> I have some more work to do in order to make this fully
> reproducible (including a single script to clone all of the
> repos I used, compute the commit-graphs using the different
> index versions, and run the tests).
>
> TL;DR: I recommend we pursue one of the following two options:
>
> 1. Maximum Generation Number.
> 2. Corrected Commit Date.
>
> Each of these perform well, but have different implementation
> drawbacks. I'd like to hear what everyone thinks about those
> drawbacks.

I agree with Junio that incremental computation is more important than
backwards compatibility (that the old clients can work with the new
commit-graph without changes).  We would have compatibility breaking
anyway with the planned move from SHA-1 to NewHash aka. SHA-256.

> Please also let me know about any additional tests that I could
> run. Now that I've got a lot of test scripts built up, I can re-run
> the test suite pretty quickly.

I would add straighforward 1-to-N and N-to-1 reachability tests in the
form of `git branch / tag --contains` and `git branch / tag --merged`,
and perhaps also ahead-behind calculations used by `git status`, and
N-to-M reachability tests used by tag following code in push and fetch.

Possibly also A...B walk, if it is not implemented via calculating
merge-base.

Maybe even --ancestry-path walk, though I am not sure how important best
performance for rhis case is (we would want good performance, but
perhaps best is not needed for rarely used command).

See explanation below.

> Generation Number Performance Test
> ==================================
>
> Git uses the commit history to answer many basic questions, such as
> computing a merge-base. Some of these algorithms can benefit from a
> _reachability index_, which is a condition that allows us to answer
> "commit A cannot reach commit B" in some cases. These require pre-
> computing some values that we can load and use at run-time. Git
> already has a notion of _generation number_, stores it in the commit-
> graph file, and uses it in several reachability algorithms.

Note that there are other kinds of reachability indices.

First, there are reachability indices that can answer the full
reachability query (if commit A can reach commit B, or if commit A
cannot reach commit B) directly, without walking the commit graph at
all: so called label-only approach.  For example one could store for
each commit the compressed list of all commits reachable from it
(transitive closure compression).

Those, I think (but I have not checked), would be of not much use for
Git, as the size of the index grows stronger than linear with the number
of commits, as grows the time to compute such index.  So probably of no
use to Git, at least not directly (Git uses so called "bitmap index",
see e.g. [1], which stores reachability bit-vector as compressed
bitmap... but not for all commits, only for a small subset).

Second, beside negative-cut reachability indexes, that can answer with
certainity that "commit A cannot reach commit B", like the generation
numbers (also known as level, or topological level), there also
positive-cut indexes (usually if not always based on the spanning tree
or trees for the DAG), that can answer when "commit A can reach commit
B".

I think that those can lead to significant speedups for at least some
types of commit traversal and reachability queries that Git needs to
answer.  But currently no algorithms has a provision for using
positive-cut filter index.  Anyway, such index would be auxiliary thing,
see the next point.

Third, we require more than having reachability index in the sense of
having some kind of _labelling_, often composite, that can answer either
"commit A cannot reach commit B" or "commit A can reach commit B",
depending on the type.  Git for most operations needs more, namely an
actual _ordering_, that is a weak order (or to be more exact a total
preorder, i.e. there can be two different commits with the same
"generation number" or index, but always either idx(A) ≲ idx(B) or
idx(B) ≲ idx(A)) and not only partial ordering (where there can be
incomparable elements, i.e neither idx(A) ≼ idx(B) nor idx(B) ≼ idx(A)).
This is because it needs to be used in priority queue to decide which
commit to travel next; more on that later.  This is also why there can
be only one such "main" reachability _index_.

[1]: https://githubengineering.com/counting-objects/

> You can read more about generation numbers and how to use them in
> algorithms in [this blog
> post](https://blogs.msdn.microsoft.com/devops/2018/07/09/supercharging-the-git-commit-graph-iii-generations/).
>
> However, [generation numbers do not always improve our
> algorithms](https://public-inbox.org/git/pull.28.git.gitgitgadget@gmail.com/T/#u).
> Specifically, some algorithms in Git already use commit date as a
> heuristic reachability index. This has some problems, though, since
> commit date can be incorrect for several reasons (clock skew between
> machines, purposefully setting GIT_COMMIT_DATE to the author date,
> etc.).

That's what we use the "slop" mechanism for: Git would walk a few
commits more than necessary than if there were no clock skew (if commit
dates were monotonic), and assume that the skew is not to severe to
finish early.

But as we are adding more early termination condition, this introduces
more places where things can go wrong for Git to return incorrect
results.  Thus the need for strict, non-heuristic reachability index.

> However, the speed boost by using commit date as a cutoff was so
> important in these cases, that the potential for incorrect answers was
> considered acceptable.

Yes, and there were few places where date as cutoff was used; we have
since added more place where generation number / reachbility index is
used as cutoff.  Thus more potential for incorrect answers if using
heuristics.

> When these algorithms were converted to use generation numbers, we
> _added_ the extra constraint that the algorithms are _never incorrect_.

Right.

> Unfortunately, this led to some cases where performance was worse than
> before. There are known cases where `git merge-base A B` or
> `git log --topo-order A..B` are worse when using generation numbers
> than when using commit dates.

In Git there are at least four different types of graph traversal
queries, with somewhat different requirements, and affected differently
by various reachability indexes.

## Pure reachability queries

First there are pure reachability queries, when we are interested only
in nswering he question whether commit A can reach commit B or not.  We
are nt interested in list of commits between A and B; if reachability
index, be it negative-cut or positive-cut, can answer the question, it
is all we need.  If we need to walk (perform online search), then we are
interested in just being able to arrive; we need to walk only one path.

This query can be performed one-to-one, one-to-many (which of commits
from the list can be reached from commit A), many-to-one (which commits
from the list can reach commit B), and many-to-many -- the latest for
example with recently added get_reachable_subset().

These types of queries are done by "git branch --contains" (and
"--no-contains") and "git branch --merged" (and "--no-merged"), the same
for "git tag"; if I remember it correctly those were among first
operations that were sped up by commit-graph mechanism.  Many-to-many
queries are done for tag following in "git push" and "git fetch".

Note that those queries are missing from gen-test.

## Topological ordering

Here reachability index is used to find out in-degree of considered
commits (starting from a set of starting points).  Here we don't really
need to walk all paths, but we need to walk all possible ways of
arriving at given commit.

Though I guess the [new] incremental algorithm could be modified to only
check for in-degree being zero or non-zero.  Would it make it slower or
faster; I guess in absence of good positive-cut reachability filter it
would make it slower.

This operation is benchmarked both standalone i.e. `git log --topo-order -N`
and in combination with A..B walk in `git log --topo-order -10 A..B`.

## (Po)set difference, or B..A range

In this type of query, with at least one "negative" / "negated" commit,
we want to find out all commits that are reachable from commit A that
are not reachable from commit B.

Note that A (commit in positive set) is treated differently with respect
to traversal from B (commit in negative set).  We need to travel (and
possibly list) all paths from A, while when traveling from B the only
thing important is reachability, not the actual path or paths.  We want
to travel as few commits as possible when walking from B, and here one
generation numbers are better than the other in the presence of multiple
paths.

The situation in nontrivial case could look as follows:

  ---.---.---o---*---*---*----*---*-----*---A
              \
               \---x-----x-----x-----B

In such case, the algorithm paints down from positive set i.e. commit A,
and from negative set i.e. commit B, and walk down from A till we get to
commits painted down from B.

We want to walk those commits somewhat in sync, using priority-queue
based variant of breadth-first search (well, kind of), so that we don't
walk more commits from A than necessary, only to notice later that these
are actually reachable from B, and neither we want to walk more commits
from B than necessary, going past the boundary commit.

Sidenote: I think that positive-cut auxiliary reachability index
(reachability filter) can help speed up this operation, making it
possible to mark commit as reachable from the negative set (from commit
B) and stop early without actually having to walk from B... though the
more in sync walks from A and B would be, the less it would probably
help.

This is tested using `git log --topo-order -10 A..B`.

## Merge base, and A...B walk

In this case we either find one or all commits that are reachable from
both A and B, or (which is related query) find all commits reachable
from A that are not reachable from B and vice versa, that is all commits
reachable from B that are not reachable from A.

In this case we walk from both A and B the same way: the operation is
symmetric.  In the merge base calculations we need to walk only one path
if there are many equivaent ones; in A...B walk we have to actually walk
them.  Thus merge base calculation would exhibit, we can concurr, the
same behaviour with respect to reachability indexes as A..B case.

This is tested using `git merge-base A B`; finding all merge bases, and
A...B walk are not tested yet.

Note that IIRC it was merge-base calculation where the problem of
performance regression when using generation number (topological level)
as an ordering was first encountered.  The performance regression happen
for situation like this (the graph is taken from original commit fixing
the issue, just rephrased).

   0   1   2   3   4   5   6   7   8   9  10   11          4    δgen(c)
   .---M---*---*---*---*---*---*---*---*--*----A           B
       |\                                  2  /           /     δgen(c)
       | \---------------------------------x-/           /
       \                                  2         3   /       δgen(c)
        \---------------------------------#---------#--/

    ------------------------------------------------------------>
                                                          time

Here optimal solution would be walk the branch / path with commits
marked with 'x' from commit A, while walking the branch / path with
commits marked '#' from commit B, and after walking 4 commits notice
that the commit M is reachable from both A and B -- thus we have found
[one of] the merge base.

Using (minimum) generation numbers we would walk the path marked with
'*' first, unnecessarily.  (IIUC that for a commit we mark also its
parents, isn't it?).

Using corrected commit date we would walk at most only a few commits on
the *-marked branch.

   0   1   2   3   4   5   6   7   8   9  10   11          11   δmaxgen(c)
   .---M---*---*---*---*---*---*---*---*--*----A           B
       |\                                  10 /           /     δmaxgen(c)
       | \---------------------------------x-/           /
       \                                  9         10  /       δmaxgen(c)
        \---------------------------------#---------#--/

    ------------------------------------------------------------>
                                                          time

Using maximum generation numbers we would also walk at most few *-marked
commits.

Sidenote: on the figure below you can easily see easy correspondence
between maxgen(c) and revgen(c) indices, see below.

   12  11  10  9   8   7   6   5   4   3  2    1           1    revgen(c)
   .---M---*---*---*---*---*---*---*---*--*----A           B
       |\                                  2  /           /     revgen(c)
       | \---------------------------------x-/           /
       \                                  3         2   /       revgen(c)
        \---------------------------------#---------#--/

## Strict ancestry, or ancestry path

When given a range of commits to display (e.g. A..B), only display
commits that exist directly on the ancestry chain between B and A,
i.e. commits that are both descendants of B, and ancestors of A.  In
other words, find all paths from B to A (if they exist).

In this case we want to walk all the paths, heavily pruning.  This
should be less dependent on reachability index quality as an index (as
an ordering), and more on pruning ability of the filter.

This case is not tested in gen-test, but I wonder how often this feature
is actualy used (if it would be worth adding it to the benchmark, and if
so, with what weight behind it).

> This report investigates four replacements for generation numbers, and
> compares the number of walked commits to the existing algorithms (both
> using generation numbers and not using them at all). We can use this
> data to make decisions for the future of the feature.
>
> ### Implementation
>
> The reachability indexes below are implemented in
> [the `reach-perf` branch in
> https://github.com/derrickstolee/git](https://github.com/derrickstolee/git/tree/reach-perf).
> This implementation is in a very rough condition, as it is intended to
> be a prototype, not production-quality.
>
> Using this implementation, you can compute commit-graph files for the
> specified reachability index using `git commit-graph write --reachable
> --version=<V>`.
> The `git` client will read the version from the file, so our tests
> store each version as `.git/objects/info/commit-graph.<V>` and copy
> the necessary file to `.git/objects/info/commit-graph` before testing.
>
> The algorithms count the number of commits walked, as to avoid the
> noise that would occur with time-based performance reporting. We use
> the (in progress) trace2 library for this. To find the values reported,
> use the `GIT_TR2_PERFORMANCE` environment variable.

Where we can read more about this trace2 library?  Thanks in advance.

>
> To ignore reachability indexes entirely and use the old algorithms
> (reported here as "OLD" values) use the environment variable
> `GIT_TEST_OLD_PAINT=1`.
>
>
> Reachability Index Versions
> ---------------------------
>
> **V0: (Minimum) Generation Number.**
> The _generation number_ of a commit is exactly one more than the maximum
> generation number of a parent (by convention, the generation number of a
> commit with no parents is 1). This is the definition currently used by
> Git (2.19.0 and later). Given two commits A and B, we can then use the
> following reachability index:
>
>     If gen(A) < gen(B), then A cannot reach B.
>
> _Commentary:_ One issue with generation numbers is that some algorithms
> in Git use commit date as a heuristic, and allow incorrect answers if
> there is enough clock skew. Using that heuristic, the algorithms can walk
> fewer commits than the algorithms using generation number. The other
> reachability indexes introduced below attempt to fix this problem.

This is the existing solution.

> **V1: (Epoch, Date) Pairs.**
> For each commit, store a pair of values called the _epoch_ and the _date_.
> The date is the normal commit date. The _epoch_ of a commit is the minimum
> X such that X is at least the maximum epoch of each parent, and at least
> one more than the epoch of any parent whose date is larger than the date
> of this commit (i.e. there is clock skew between this commit and this
> parent). In this way, we can use the following reachability index:
>
>    If epoch(A) < epoch(B), then A cannot reach B.
>    If epoch(A) == epoch(B) and date(A) < date(B), then A cannot reach B.

I wonder what makes it perform worse than corrected date aka V3.

> **V2: Maximum Generation Numbers.**
> The _maximum generation number_ of a commit (denoted by maxgen(A)) is
> defined using its _children_ and the total number of commits in the repo.
> If A is a commit with no children (that is, there is no commit B with
> A as a parent) then maxgen(A) is equal to the number of commits in the repo.
> For all other commits A, maxgen(A) is one less than the minimum maxgen(B)
> among children B. This can be computed by initializing maxgen(C) to the
> number of commits, then walking the commit graph in topological order,
> assigning maxgen(P) = min { maxgen(P), maxgen(C) - 1 } for each parent P
> of the currently-walking commit C. We then have the same reachability
> index as minimum generation number:
>
>   If maxgen(A) < maxgen(B), then A cannot reach B.

If I understand it correctly, this is the same as reverse generation
numbers, or reverse topological levels; in other words genertion numbers
on reversed graph -- only transformed:

   maxgen(A) == number_of_commits - (revgen(A) - 1)

We define revgen(A) in the following way:
- for head tips (for source nodes), i.e. commits with in-degree of 0,
  with no children, have revgen(A) = 1 (revgen(A) = 0 is left for
  commits outside commit-graph, which translates to INFINITY for
  maxgen(A)).
- otherwise, it is 1 more than maximum revgen(C) of its children

They are equivalent, but maxgen(A) is "backward compatibile", that is
the rechability condition is exactly the same as for ordinary generation
numbers:

    If gen(A) < gen(B), then A cannot reach B.
    If maxgen(A) < maxgen(B), then A cannot reach B.

But

    If revgen(A) > revgen(B), then A cannot reach B.

>
> _Commentary:_ The known examples where (minimum) generation numbers perform
> worse than commit date heuristics involve commits with recent commit dates
> but whose parents have very low generation number compared to most recent
> commits. In a way, minimum generation numbers increase the odds that we
> can say "A cannot reach B" when A is fixed and B is selected at random.
> Maximum generation numbers increase the odds that we can say "A cannot
> reach B" when B is fixed and A is selected at random. This helps us when
> we are exploring during a reachability algorithm and have explored few
> commits and want to say that the large set of unexplored commits cannot
> reach any of the explored commits.

I guess that the situation where shortcut path have recent date, while
having less commits in path (and thus lover ordinary generation number)
is more common in real commit graphs because one might want to base a
commit (for example a bugfix) on an old commit, but commits usually get
merged quickly, and not left for long time.  If they re left for long
time, they usually needs correction (via rebase) before being merged,
and again we have commit with date close to the date of merge, leading
to a shortcut.

Does this reasoning looks correct to you?

> **V3: Corrected Commit Date.**
> For a commit C, let its _corrected commit date_ (denoted by cdate(C))
> be the maximum of the commit date of C and the [corrected] commit dates of its
> parents.

Wouldn't it be better to use the maximum of the commit date of C, and
one second more than the corrected commit dates of its parents?  This
way corrected dates would be strictly monotonic along the chain.

Sidenote: in theory, sufficiently bad clock skew (like e.g. wrong year
in the future) could screw large swatches of the commit graph.  In
practice, in real-life repositories, this doesn't happen, I guess.

>   If cdate(A) < cdate(B), then A cannot reach B.
>
> **V4: FELINE Index.**
> The FELINE index is a two-dimentional reachability index as defined in
> [Reachability Queries in Very Large Graphs: A Fast Refined Online
> Search Approach](https://openproceedings.org/EDBT/2014/paper_166.pdf)
> by Veloso, Cerf, Jeira, and Zaki. The index is not deterministically
> defined, but instead is defined in the following way:
>
> 1. Compute a reverse topological order of the commits. Let felineX(C)
>    be the position in which C appears in this list. Since this is
>    a reverse topological order, felineX(C) > felineX(P) for each parent
>    P of C.

Well, it depends on the definition of the topological order if it is
tolopogical order or reverse topological order.  Anyway, the FELINE
paper includes discussion about FELINE-I, i.e. inversed feline index.
From FELINE and FELINE-I, one of those is what was used here.

> 2. Compute a reverse topological order of the commits using Kahn's
>    algorithm, but when selecting among a list of commits with in-degree
>    zero, prioritize the commit by minimizing felineX. Let felineY(C)
>    be the position in which C appears in this list.

Note that the second step, as given in the paper, is actually
deterministic (assuming that we follow the recommendations; well, up to
a point).  It needs some topological order for felineX, though.

> Essentially, the felineY order is selected with the goal of swapping
> positions of topologically-independent commits relative to the felinX
> ordering. The resulting reachability index is as follows:
>
>    If felineX(A) < felineX(B), then A cannot reach B.
>    If felineY(A) < felineY(B), then A cannot reach B.
>
> _Commentary:_ In terms of comparing two commits directly, this index
> is quite strong. However, when we are performing our reachability
> algorithms that care about reachable sets (git log --graph), or
> boundaries between reachable sets (git merge-base, git log --graph A..B)
> we need to track a single pair (minX,minY) for comparion. In order to not
> miss anything during our search, we need to update this pair to be
> the minimum felineX(A) and minimum felineY(B) among all explored commits
> A and B. That is, the pair (minX, minY) is below our entire explored set.
> This can be a disadvantage for these algorithms.

Actually the problem is that FELINE index is partial order, not an
ordering; not something that we can use in a priority queue.

But the single [reverse] topological ordering is a reachability index in
itself, and it is an ordering.  Thus we can use felineX as an index in
priority queue (or we can use felineY).

One thing to notice is that FELINE index performs well on real world
large graphs which have quite different structure from commit graphs;
among others they have either very large number of sources (head tips)
or sinks (root nodes), or both; those numbers grow with the number of
nodes.  For commit graphs they don't perform as well; well, at least
ccording to my explorations on that matter in the Google Colaboratory
notebook [2].

[2]: https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg

>
> ### Comparing Reachability Index Versions Viability
>
> Before considering how well these indexes perform during our algorithm
> runs, let's take a moment to consider the implementation details and
> how they relate to the existing commit-graph file format and existing
> Git clients.
>
> * **Compatible?** In our test implementation, we use a previously unused
>   byte of data in the commit-graph format to indicate which reachability
>   index version we are using. Existing clients ignore this value, so we
>   will want to consider if these new indexes are _backwards compatible_.
>   That is, will they still report correct values if they ignore this byte
>   and use the generation number column from the commit-graph file assuming
>   the values are minimum generation numbers?

In other words "backward compatibility" for a reachability index means
that the reachability condition is exactly the same as it was:

    If index(A) < index(B), then A cannot reach B.

and also values of the reachability index are written in place of
generation numbers in the commit-graph.

>
> * **Immutable?** Git objects are _immutable_. If you change an object you
>   actually create a new object with a new object ID. Are the values we
>   store for these reachability indexes also immutable?
>
> * **Local?** Are these values **locally computable**? That is, do we only
>   need to look at the parents of a commit (assuming those parents have
>   computed values) in order to determine the value at that commit?

Those two features imply that the number of commits that we need to walk
to be able to update reachability index after appending new commits is
O(new commits), which means O(changes).  I would allow for simple
transformations of other values, e.g. adding a constant; we need to
rewrite commit-graph file anyway.

I wonder if there are reachability indexes that are either immutable but
not local, or not immutable but local.  Doesn't being local imply
immutability, with some common-sense assumptions (like no random choice
for parent-less commits)?

> | Index                     | Compatible? | Immutable? | Local? |
> |---------------------------|-------------|------------|--------|
> | Minimum Generation Number | Yes         | Yes        | Yes    |
> | (Epoch, Date) pairs       | Yes         | Yes        | Yes    |
> | Maximum Generation Number | Yes         | No         | No     |
> | Corrected Commit Date     | No          | Yes        | Yes    |
> | FELINE index              | Yes         | No         | No     |
>
> _Note:_ The corrected commit date uses the generation number column
> to store an offset of "how much do I need to add to my commit date
> to get my corrected commit date?" The values stored in that column
> are then not backwards-compatible.

Additional note: this offset / delta based approach to storing corrected
commit data is a form of lightweight compression, allowing it to fit
safely in 30 bits (while the commit date uses 32+2 = 34 bits itself).

Assuming that we want to keep this feature from the original
commit-graph file format, that is.

> _Note:_ The FELINE index requires storing two values instead of just
> one. One of these values could be stored in the generation number
> column and the other in an optional chunk, hence it could be backwards
> compatible. (This is not how it is implemented in our example
> implementation.)

So the table above refers to this hypothetical implementation when
stating that FELINE is backwards compatibile, but current test
implementation is not, isn't it?

>
> Data
> ----
>
> We focused on three types of performance tests that test the indexes
> in different ways. Each test lists the `git` command that is used,
> and the table lists which repository is used and which inputs.
>
> ### Test 1: `git log --topo-order -N`
>
> This test focuses on the number of commits that are parsed during
> a `git log --topo-order` before writing `N` commits to output.
>
> You can reproduce this test using `topo-order-tests.sh` and
> see all the data in `topo-order-summary.txt`. The values reported
> here are a sampling of the data, ignoring tests where all values
> were the same or extremely close in value.
>
>
> | Repo         | N     | V0     | V1     | V2     | V3     | V4    |
> |--------------|-------|--------|--------|--------|--------|-------|
> | android-base | 100   |  5,487 |  8,534 |  6,937 |  6,419 | 6,453 |
> | android-base | 1000  | 36,029 | 44,030 | 41,493 | 41,206 |45,431 |
> | chromium     | 100   |    101 |424,406 |    101 |    101 |   101 |
> | gerrit       | 100   |  8,212 |  8,533 |    164 |    159 |   162 |
> | gerrit       | 1000  |  8,512 |  8,533 |  1,990 |  1,973 | 3,766 |
> | Linux        | 100   | 12,458 | 12,444 | 13,683 | 13,123 |13,124 |
> | Linux        | 1000  | 24,436 | 26,247 | 27,878 | 26,430 |27,875 |
> | Linux        | 10000 | 30,364 | 28,891 | 27,878 | 26,430 |27,875 |
> | electron     | 1000  | 19,927 | 18,733 |  1,072 | 18,214 |18,214 |
> | Ffmpeg       | 10000 | 32,154 | 47,429 | 10,435 | 11,054 |11,054 |
> | jgit         | 1000  |  1,550 |  6,264 |  1,067 |  1,060 | 1,233 |
> | julia        | 10000 | 43,043 | 43,043 | 10,201 | 23,505 |23,828 |
> | odoo         | 1000  | 17,175 |  9,714 |  4,043 |  4,046 | 4,111 |
> | php-src      | 1000  | 19,014 | 27,530 |  1,311 |  1,305 | 1,320 |
> | rails        | 100   |  1,420 |  2,041 |  1,757 |  1,428 | 1,441 |
> | rails        | 1000  |  7,952 | 10,145 | 10,053 |  8,373 | 8,373 |
> | swift        | 1000  |  1,914 |  4,004 |  2,071 |  1,939 | 1,940 |
> | tensorflow   | 1000  | 10,019 | 39,221 |  6,711 | 10,051 |10,051 |
> | TypeScript   | 1000  |  2,873 | 12,014 |  3,049 |  2,876 | 2,876 |

Do I understand it correctly that the range of N for a given project is
limited by the depth of the project history (hence maximum N of 10000
for Linux kernel, but only 100 for chromium)?

I wonder what the OLD numbers are for these operations.

>
> ### Test 2: `git log --topo-order -10 A..B`
>
> This test focuses on the number of commits that are parsed during
> a `git log --topo-order A..B` before writing ten commits to output.
> Since we fix a very small set of output commits, we care more about
> the part of the walk that determines which commits are reachable
> from `B` but not reachable from `A`. This part of the walk uses
> commit date as a heuristic in the existing implementation.
[...]

> ### Test 3: `git merge-base A B`
>
> This test focuses on the number of commits that are parsed during
> a `git merge-base A B`. This part of the walk uses commit date as
> a heuristic in the existing implementation.
[...]

> Conclusions
> -----------
>
> Based on the performance results alone, we should remove minimum
> generation numbers, (epoch, date) pairs, and FELINE index from
> consideration. There are enough examples of these indexes performing
> poorly.
>
> In contrast, maximum generation numbers and corrected commit
> dates both performed quite well. They are frequently the top
> two performing indexes, and rarely significantly different.
>
> The trade-off here now seems to be: which _property_ is more important,
> locally-computable or backwards-compatible?
>
> * Maximum generation number is backwards-compatible but not
>   locally-computable or immutable.
>
> * Corrected commit-date is locally-computable and immutable,
>   but not backwards-compatible.
>
> _Editor's Note:_ Every time I think about this trade-off, I can't
> come to a hard conclusion about which is better. Instead, I'll
> leave that up for discussion.

In my opinion being able to update commit-graph data fast is more
important than being backward-compatibile, i.e. beig able to use
commit-data file generated by new clients (and servers) by old clients.

Thank you for all your work,
--
Jakub Narębski

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Generation Number v2
  2018-11-01 12:27 ` Jakub Narebski
@ 2018-11-01 13:29   ` Derrick Stolee
  2018-11-03 12:33     ` Jakub Narebski
  0 siblings, 1 reply; 17+ messages in thread
From: Derrick Stolee @ 2018-11-01 13:29 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Git List, Jeff King, Junio C Hamano, Derrick Stolee,
	Ævar Arnfjörð Bjarmason

On 11/1/2018 8:27 AM, Jakub Narebski wrote:
> [I have noticed that in some places I wrote A..B instead of B..A.  Sorry
>   about that]
>
> Derrick Stolee <stolee@gmail.com> writes:
>
>> Please also let me know about any additional tests that I could
>> run. Now that I've got a lot of test scripts built up, I can re-run
>> the test suite pretty quickly.
> I would add straighforward 1-to-N and N-to-1 reachability tests in the
> form of `git branch / tag --contains` and `git branch / tag --merged`,
> and perhaps also ahead-behind calculations used by `git status`, and
> N-to-M reachability tests used by tag following code in push and fetch.
>
> Possibly also A...B walk, if it is not implemented via calculating
> merge-base.

I believe this uses paint_down_to_common(), but looks at the PARENT1 and
PARENT2 flags afterward to determine the left/right/both relationships.

> Maybe even --ancestry-path walk, though I am not sure how important best
> performance for rhis case is (we would want good performance, but
> perhaps best is not needed for rarely used command).

Currently, the implementation of --ancestry-path does not use a
reachability index.

>
> See explanation below.
>
>
>> Generation Number Performance Test
>> ==================================
>>
>> Git uses the commit history to answer many basic questions, such as
>> computing a merge-base. Some of these algorithms can benefit from a
>> _reachability index_, which is a condition that allows us to answer
>> "commit A cannot reach commit B" in some cases. These require pre-
>> computing some values that we can load and use at run-time. Git
>> already has a notion of _generation number_, stores it in the commit-
>> graph file, and uses it in several reachability algorithms.
> Note that there are other kinds of reachability indices.
>
> First, there are reachability indices that can answer the full
> reachability query (if commit A can reach commit B, or if commit A
> cannot reach commit B) directly, without walking the commit graph at
> all: so called label-only approach.  For example one could store for
> each commit the compressed list of all commits reachable from it
> (transitive closure compression).
>
> Those, I think (but I have not checked), would be of not much use for
> Git, as the size of the index grows stronger than linear with the number
> of commits, as grows the time to compute such index.  So probably of no
> use to Git, at least not directly (Git uses so called "bitmap index",
> see e.g. [1], which stores reachability bit-vector as compressed
> bitmap... but not for all commits, only for a small subset).
>
>
> Second, beside negative-cut reachability indexes, that can answer with
> certainity that "commit A cannot reach commit B", like the generation
> numbers (also known as level, or topological level), there also
> positive-cut indexes (usually if not always based on the spanning tree
> or trees for the DAG), that can answer when "commit A can reach commit
> B".
>
> I think that those can lead to significant speedups for at least some
> types of commit traversal and reachability queries that Git needs to
> answer.  But currently no algorithms has a provision for using
> positive-cut filter index.  Anyway, such index would be auxiliary thing,
> see the next point.
>
>
> Third, we require more than having reachability index in the sense of
> having some kind of _labelling_, often composite, that can answer either
> "commit A cannot reach commit B" or "commit A can reach commit B",
> depending on the type.  Git for most operations needs more, namely an
> actual _ordering_, that is a weak order (or to be more exact a total
> preorder, i.e. there can be two different commits with the same
> "generation number" or index, but always either idx(A) ≲ idx(B) or
> idx(B) ≲ idx(A)) and not only partial ordering (where there can be
> incomparable elements, i.e neither idx(A) ≼ idx(B) nor idx(B) ≼ idx(A)).
> This is because it needs to be used in priority queue to decide which
> commit to travel next; more on that later.  This is also why there can
> be only one such "main" reachability _index_.
>
> [1]: https://githubengineering.com/counting-objects/

Thanks for the details. At the moment, I'm only interested in improving our
negative-cut reachability index, as we have algorithms that can take 
advantage
of them (and can compare their performance directly).

>
>> You can read more about generation numbers and how to use them in
>> algorithms in [this blog
>> post](https://blogs.msdn.microsoft.com/devops/2018/07/09/supercharging-the-git-commit-graph-iii-generations/).
>>
>> However, [generation numbers do not always improve our
>> algorithms](https://public-inbox.org/git/pull.28.git.gitgitgadget@gmail.com/T/#u).
>> Specifically, some algorithms in Git already use commit date as a
>> heuristic reachability index. This has some problems, though, since
>> commit date can be incorrect for several reasons (clock skew between
>> machines, purposefully setting GIT_COMMIT_DATE to the author date,
>> etc.).
> That's what we use the "slop" mechanism for: Git would walk a few
> commits more than necessary than if there were no clock skew (if commit
> dates were monotonic), and assume that the skew is not to severe to
> finish early.
>
> But as we are adding more early termination condition, this introduces
> more places where things can go wrong for Git to return incorrect
> results.  Thus the need for strict, non-heuristic reachability index.
>
>> However, the speed boost by using commit date as a cutoff was so
>> important in these cases, that the potential for incorrect answers was
>> considered acceptable.
> Yes, and there were few places where date as cutoff was used; we have
> since added more place where generation number / reachbility index is
> used as cutoff.  Thus more potential for incorrect answers if using
> heuristics.
>
>> When these algorithms were converted to use generation numbers, we
>> _added_ the extra constraint that the algorithms are _never incorrect_.
> Right.
>
>> Unfortunately, this led to some cases where performance was worse than
>> before. There are known cases where `git merge-base A B` or
>> `git log --topo-order A..B` are worse when using generation numbers
>> than when using commit dates.
> In Git there are at least four different types of graph traversal
> queries, with somewhat different requirements, and affected differently
> by various reachability indexes.
>
>
> ## Pure reachability queries
>
> First there are pure reachability queries, when we are interested only
> in nswering he question whether commit A can reach commit B or not.  We
> are nt interested in list of commits between A and B; if reachability
> index, be it negative-cut or positive-cut, can answer the question, it
> is all we need.  If we need to walk (perform online search), then we are
> interested in just being able to arrive; we need to walk only one path.
>
> This query can be performed one-to-one, one-to-many (which of commits
> from the list can be reached from commit A), many-to-one (which commits
> from the list can reach commit B), and many-to-many -- the latest for
> example with recently added get_reachable_subset().
>
> These types of queries are done by "git branch --contains" (and
> "--no-contains") and "git branch --merged" (and "--no-merged"), the same
> for "git tag"; if I remember it correctly those were among first
> operations that were sped up by commit-graph mechanism.  Many-to-many
> queries are done for tag following in "git push" and "git fetch".
>
> Note that those queries are missing from gen-test.

These are definitely a good idea to pursue further testing. The only
thing I canthink of right now is that the tests can be hard to set up,
but perhaps`git branch --remote --contains` or `git tag --contains`
would be interesting.

> ## Topological ordering
>
> Here reachability index is used to find out in-degree of considered
> commits (starting from a set of starting points).  Here we don't really
> need to walk all paths, but we need to walk all possible ways of
> arriving at given commit.

You make a good point here about the difference between "all paths to C"
and"all children of C", since the number of paths can grow exponentially
(and frequentlydoes in Git) but the number of children is linear (and
small in practice).

> Though I guess the [new] incremental algorithm could be modified to only
> check for in-degree being zero or non-zero.  Would it make it slower or
> faster; I guess in absence of good positive-cut reachability filter it
> would make it slower.

I don't understand how a positive-cut filter helps in this case. The point
of the in-degreecalculation is to say "we already output all commits that
can reach this commit" beforewe output another commit. How could a
positive-cut filter help here?

> This operation is benchmarked both standalone i.e. `git log --topo-order -N`
> and in combination with A..B walk in `git log --topo-order -10 A..B`.
>
>
> ## (Po)set difference, or B..A range
>
> In this type of query, with at least one "negative" / "negated" commit,
> we want to find out all commits that are reachable from commit A that
> are not reachable from commit B.
>
> Note that A (commit in positive set) is treated differently with respect
> to traversal from B (commit in negative set).  We need to travel (and
> possibly list) all paths from A, while when traveling from B the only
> thing important is reachability, not the actual path or paths.  We want
> to travel as few commits as possible when walking from B, and here one
> generation numbers are better than the other in the presence of multiple
> paths.
>
> The situation in nontrivial case could look as follows:
>
>    ---.---.---o---*---*---*----*---*-----*---A
>                \
>                 \---x-----x-----x-----B
>
> In such case, the algorithm paints down from positive set i.e. commit A,
> and from negative set i.e. commit B, and walk down from A till we get to
> commits painted down from B.
>
> We want to walk those commits somewhat in sync, using priority-queue
> based variant of breadth-first search (well, kind of), so that we don't
> walk more commits from A than necessary, only to notice later that these
> are actually reachable from B, and neither we want to walk more commits
> from B than necessary, going past the boundary commit.
>
> Sidenote: I think that positive-cut auxiliary reachability index
> (reachability filter) can help speed up this operation, making it
> possible to mark commit as reachable from the negative set (from commit
> B) and stop early without actually having to walk from B... though the
> more in sync walks from A and B would be, the less it would probably
> help.
>
> This is tested using `git log --topo-order -10 A..B`.
>
>
> ## Merge base, and A...B walk
>
> In this case we either find one or all commits that are reachable from
> both A and B, or (which is related query) find all commits reachable
> from A that are not reachable from B and vice versa, that is all commits
> reachable from B that are not reachable from A.
>
> In this case we walk from both A and B the same way: the operation is
> symmetric.  In the merge base calculations we need to walk only one path
> if there are many equivaent ones; in A...B walk we have to actually walk
> them.  Thus merge base calculation would exhibit, we can concurr, the
> same behaviour with respect to reachability indexes as A..B case.
>
> This is tested using `git merge-base A B`; finding all merge bases, and
> A...B walk are not tested yet.
>
> Note that IIRC it was merge-base calculation where the problem of
> performance regression when using generation number (topological level)
> as an ordering was first encountered.  The performance regression happen
> for situation like this (the graph is taken from original commit fixing
> the issue, just rephrased).
>
>     0   1   2   3   4   5   6   7   8   9  10   11          4    δgen(c)
>     .---M---*---*---*---*---*---*---*---*--*----A           B
>         |\                                  2  /           /     δgen(c)
>         | \---------------------------------x-/           /
>         \                                  2         3   /       δgen(c)
>          \---------------------------------#---------#--/
>
>
>      ------------------------------------------------------------>
>                                                            time
>
> Here optimal solution would be walk the branch / path with commits
> marked with 'x' from commit A, while walking the branch / path with
> commits marked '#' from commit B, and after walking 4 commits notice
> that the commit M is reachable from both A and B -- thus we have found
> [one of] the merge base.
>
> Using (minimum) generation numbers we would walk the path marked with
> '*' first, unnecessarily.  (IIUC that for a commit we mark also its
> parents, isn't it?).
>
> Using corrected commit date we would walk at most only a few commits on
> the *-marked branch.
>
>     0   1   2   3   4   5   6   7   8   9  10   11          11   δmaxgen(c)
>     .---M---*---*---*---*---*---*---*---*--*----A           B
>         |\                                  10 /           /     δmaxgen(c)
>         | \---------------------------------x-/           /
>         \                                  9         10  /       δmaxgen(c)
>          \---------------------------------#---------#--/
>
>
>      ------------------------------------------------------------>
>                                                            time
>
> Using maximum generation numbers we would also walk at most few *-marked
> commits.
>
> Sidenote: on the figure below you can easily see easy correspondence
> between maxgen(c) and revgen(c) indices, see below.
>
>     12  11  10  9   8   7   6   5   4   3  2    1           1    revgen(c)
>     .---M---*---*---*---*---*---*---*---*--*----A           B
>         |\                                  2  /           /     revgen(c)
>         | \---------------------------------x-/           /
>         \                                  3         2   /       revgen(c)
>          \---------------------------------#---------#--/
>
>
> ## Strict ancestry, or ancestry path
>
> When given a range of commits to display (e.g. A..B), only display
> commits that exist directly on the ancestry chain between B and A,
> i.e. commits that are both descendants of B, and ancestors of A.  In
> other words, find all paths from B to A (if they exist).
>
> In this case we want to walk all the paths, heavily pruning.  This
> should be less dependent on reachability index quality as an index (as
> an ordering), and more on pruning ability of the filter.
>
> This case is not tested in gen-test, but I wonder how often this feature
> is actualy used (if it would be worth adding it to the benchmark, and if
> so, with what weight behind it).

As mentioned earlier, the --ancestry-path algorithm does not currently use a
negative-cut filter, so we would not gather any data on this at the moment.

However, I imagine the algorithm is similar to the typical A..B case, as we
needto discover the (po)set difference, construct the reversed digraph on
thosecommits, and walk backwards from A. Assuming the difference is much
smallerthan the entire commit graph, then the main cost is the walk that
discovers thedifference, and hence is covered somewhat by the `git log
--topo-order A..B` tests.

>> This report investigates four replacements for generation numbers, and
>> compares the number of walked commits to the existing algorithms (both
>> using generation numbers and not using them at all). We can use this
>> data to make decisions for the future of the feature.
>>
>> ### Implementation
>>
>> The reachability indexes below are implemented in
>> [the `reach-perf` branch in
>> https://github.com/derrickstolee/git](https://github.com/derrickstolee/git/tree/reach-perf).
>> This implementation is in a very rough condition, as it is intended to
>> be a prototype, not production-quality.
>>
>> Using this implementation, you can compute commit-graph files for the
>> specified reachability index using `git commit-graph write --reachable
>> --version=<V>`.
>> The `git` client will read the version from the file, so our tests
>> store each version as `.git/objects/info/commit-graph.<V>` and copy
>> the necessary file to `.git/objects/info/commit-graph` before testing.
>>
>> The algorithms count the number of commits walked, as to avoid the
>> noise that would occur with time-based performance reporting. We use
>> the (in progress) trace2 library for this. To find the values reported,
>> use the `GIT_TR2_PERFORMANCE` environment variable.
> Where we can read more about this trace2 library?  Thanks in advance.

https://public-inbox.org/git/pull.29.git.gitgitgadget@gmail.com/
[PATCH 0/8] WIP: trace2: a new trace facility

>> To ignore reachability indexes entirely and use the old algorithms
>> (reported here as "OLD" values) use the environment variable
>> `GIT_TEST_OLD_PAINT=1`.
>>
>>
>> Reachability Index Versions
>> ---------------------------
>>
>> **V0: (Minimum) Generation Number.**
>> The _generation number_ of a commit is exactly one more than the maximum
>> generation number of a parent (by convention, the generation number of a
>> commit with no parents is 1). This is the definition currently used by
>> Git (2.19.0 and later). Given two commits A and B, we can then use the
>> following reachability index:
>>
>>      If gen(A) < gen(B), then A cannot reach B.
>>
>> _Commentary:_ One issue with generation numbers is that some algorithms
>> in Git use commit date as a heuristic, and allow incorrect answers if
>> there is enough clock skew. Using that heuristic, the algorithms can walk
>> fewer commits than the algorithms using generation number. The other
>> reachability indexes introduced below attempt to fix this problem.
> This is the existing solution.
>
>> **V1: (Epoch, Date) Pairs.**
>> For each commit, store a pair of values called the _epoch_ and the _date_.
>> The date is the normal commit date. The _epoch_ of a commit is the minimum
>> X such that X is at least the maximum epoch of each parent, and at least
>> one more than the epoch of any parent whose date is larger than the date
>> of this commit (i.e. there is clock skew between this commit and this
>> parent). In this way, we can use the following reachability index:
>>
>>     If epoch(A) < epoch(B), then A cannot reach B.
>>     If epoch(A) == epoch(B) and date(A) < date(B), then A cannot reach B.
> I wonder what makes it perform worse than corrected date aka V3.

In the example `git merge-base v4.8 v4.9` in the Linux repo, the topology
includestwo commits (say, C1 and C2) of low generation number but high
commit-date. Thesecommits also have low epoch. However, there are 607
commits in the repo whosecommit date is smaller than a parent's commit
date. This means that the epoch in themain trunk of the repo can be as
high as 607, while the epoch for C1 and C2 is likely in the single digits.
This means that we need to walk all commits with epoch greaterthan
min { epoch(C1), epoch(C2) } before we explore C1 and C2 and terminate
the walk.

In V3, the commits C1 and C2 have high corrected commit date, higher than
any of thecommits that require positive offset to overcome the clock skew
with their parents. Thisallows the walk to be very similar to the old
algorithm, as seen in this `git merge-base A B`test summary:

Linux c8d2bc9bc39e 69973b830859
   OLD: 167468
    V0: 635579
    V1: 630138
    V2:  33716
    V3: 167496
    V4: 153774

>
>> **V2: Maximum Generation Numbers.**
>> The _maximum generation number_ of a commit (denoted by maxgen(A)) is
>> defined using its _children_ and the total number of commits in the repo.
>> If A is a commit with no children (that is, there is no commit B with
>> A as a parent) then maxgen(A) is equal to the number of commits in the repo.
>> For all other commits A, maxgen(A) is one less than the minimum maxgen(B)
>> among children B. This can be computed by initializing maxgen(C) to the
>> number of commits, then walking the commit graph in topological order,
>> assigning maxgen(P) = min { maxgen(P), maxgen(C) - 1 } for each parent P
>> of the currently-walking commit C. We then have the same reachability
>> index as minimum generation number:
>>
>>    If maxgen(A) < maxgen(B), then A cannot reach B.
> If I understand it correctly, this is the same as reverse generation
> numbers, or reverse topological levels; in other words genertion numbers
> on reversed graph -- only transformed:
>
>     maxgen(A) == number_of_commits - (revgen(A) - 1)
>
> We define revgen(A) in the following way:
> - for head tips (for source nodes), i.e. commits with in-degree of 0,
>    with no children, have revgen(A) = 1 (revgen(A) = 0 is left for
>    commits outside commit-graph, which translates to INFINITY for
>    maxgen(A)).
> - otherwise, it is 1 more than maximum revgen(C) of its children
>
> They are equivalent, but maxgen(A) is "backward compatibile", that is
> the rechability condition is exactly the same as for ordinary generation
> numbers:
>
>      If gen(A) < gen(B), then A cannot reach B.
>      If maxgen(A) < maxgen(B), then A cannot reach B.
>
> But
>
>      If revgen(A) > revgen(B), then A cannot reach B.

Yes, these are the same idea, and the reason to phrase it as I did is to
keep theinequality in the same direction.

>
>> _Commentary:_ The known examples where (minimum) generation numbers perform
>> worse than commit date heuristics involve commits with recent commit dates
>> but whose parents have very low generation number compared to most recent
>> commits. In a way, minimum generation numbers increase the odds that we
>> can say "A cannot reach B" when A is fixed and B is selected at random.
>> Maximum generation numbers increase the odds that we can say "A cannot
>> reach B" when B is fixed and A is selected at random. This helps us when
>> we are exploring during a reachability algorithm and have explored few
>> commits and want to say that the large set of unexplored commits cannot
>> reach any of the explored commits.
> I guess that the situation where shortcut path have recent date, while
> having less commits in path (and thus lover ordinary generation number)
> is more common in real commit graphs because one might want to base a
> commit (for example a bugfix) on an old commit, but commits usually get
> merged quickly, and not left for long time.  If they re left for long
> time, they usually needs correction (via rebase) before being merged,
> and again we have commit with date close to the date of merge, leading
> to a shortcut.
>
> Does this reasoning looks correct to you?

These "bug fixes based on old commits" cases are few and far between, so
are notenough to explain how often maximum generation number works well.
Instead, Iwould say that it works well because the typical Git pattern is
to merge into a singletrunk. In the case of projects like Linux, there are
multiple "trunks" that are run bylieutenants and are merged into a "super
trunk"; even in these cases, the numberof trunks is much smaller than the
number of topics. Let's focus on a single-trunkmodel for the discussion 
below.

When working with a single trunk, most merge commits are in the first-parent
historyof that trunk. As we travel down those trunks, the maximum generation
is decreasing.This means that the number of commits that can have equal
generation is limited bythe number of commits in a single topic, which is
typically small. This is opposed tothe minimum generation, which can be
unbounded. For example, if everyone basestheir Git patches on the latest
release, then if we walk into those topics, then weneed to walk all the way
to that release to determine reachability.

Here is a picture, representing each topic as an interval of consecutive
generationvalues:

# Minimum Generation Numbers:

v2.19.0----------M1----M2----M3----M4
     |\          /     /     /     /
     | [topic1]-/     /     /     /
     |\              /     /     /
     | [topic2]-----//     /
     |\     /     /
     | [topic3]---------/     /
      \                      /
       [topic4]-------------/

     =====generation====>

# Maximum Generation Numbers:

v2.19.0----------M1----M2----M3----M4
     |\          /      |     |     |
     | [topic1]-/       |     |     |
     |\                 |     |     |
     | \-------[topic2]/      |     |
     |\                       |     |
     | \-------------[topic3]/      |
      \                             |
       \-------------------[topic4]/

     =====generation====>

>> **V3: Corrected Commit Date.**
>> For a commit C, let its _corrected commit date_ (denoted by cdate(C))
>> be the maximum of the commit date of C and the [corrected] commit dates of its
>> parents.
> Wouldn't it be better to use the maximum of the commit date of C, and
> one second more than the corrected commit dates of its parents?  This
> way corrected dates would be strictly monotonic along the chain.
>
> Sidenote: in theory, sufficiently bad clock skew (like e.g. wrong year
> in the future) could screw large swatches of the commit graph.  In
> practice, in real-life repositories, this doesn't happen, I guess.

Yes, you're right. Congyi Wu from the Azure Repos server team pointed 
this out to
me privately, as well. His example was someone merging a commit with date
3000-01-01, and making all later commits be useless. Adding one to the 
commit dates
of the parents "fixes" this issue by reverting our negative-cut index to 
be equivalent
to minimum generation number for all commits with this commit in their 
history.

>
>>    If cdate(A) < cdate(B), then A cannot reach B.
>>
>> **V4: FELINE Index.**
>> The FELINE index is a two-dimentional reachability index as defined in
>> [Reachability Queries in Very Large Graphs: A Fast Refined Online
>> Search Approach](https://openproceedings.org/EDBT/2014/paper_166.pdf)
>> by Veloso, Cerf, Jeira, and Zaki. The index is not deterministically
>> defined, but instead is defined in the following way:
>>
>> 1. Compute a reverse topological order of the commits. Let felineX(C)
>>     be the position in which C appears in this list. Since this is
>>     a reverse topological order, felineX(C) > felineX(P) for each parent
>>     P of C.
> Well, it depends on the definition of the topological order if it is
> tolopogical order or reverse topological order.  Anyway, the FELINE
> paper includes discussion about FELINE-I, i.e. inversed feline index.
>  From FELINE and FELINE-I, one of those is what was used here.
>
>> 2. Compute a reverse topological order of the commits using Kahn's
>>     algorithm, but when selecting among a list of commits with in-degree
>>     zero, prioritize the commit by minimizing felineX. Let felineY(C)
>>     be the position in which C appears in this list.
> Note that the second step, as given in the paper, is actually
> deterministic (assuming that we follow the recommendations; well, up to
> a point).  It needs some topological order for felineX, though.
>
>> Essentially, the felineY order is selected with the goal of swapping
>> positions of topologically-independent commits relative to the felinX
>> ordering. The resulting reachability index is as follows:
>>
>>     If felineX(A) < felineX(B), then A cannot reach B.
>>     If felineY(A) < felineY(B), then A cannot reach B.
>>
>> _Commentary:_ In terms of comparing two commits directly, this index
>> is quite strong. However, when we are performing our reachability
>> algorithms that care about reachable sets (git log --graph), or
>> boundaries between reachable sets (git merge-base, git log --graph A..B)
>> we need to track a single pair (minX,minY) for comparion. In order to not
>> miss anything during our search, we need to update this pair to be
>> the minimum felineX(A) and minimum felineY(B) among all explored commits
>> A and B. That is, the pair (minX, minY) is below our entire explored set.
>> This can be a disadvantage for these algorithms.
> Actually the problem is that FELINE index is partial order, not an
> ordering; not something that we can use in a priority queue.
>
> But the single [reverse] topological ordering is a reachability index in
> itself, and it is an ordering.  Thus we can use felineX as an index in
> priority queue (or we can use felineY).
>
> One thing to notice is that FELINE index performs well on real world
> large graphs which have quite different structure from commit graphs;
> among others they have either very large number of sources (head tips)
> or sinks (root nodes), or both; those numbers grow with the number of
> nodes.  For commit graphs they don't perform as well; well, at least
> ccording to my explorations on that matter in the Google Colaboratory
> notebook [2].
>
> [2]: https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg
>
>> ### Comparing Reachability Index Versions Viability
>>
>> Before considering how well these indexes perform during our algorithm
>> runs, let's take a moment to consider the implementation details and
>> how they relate to the existing commit-graph file format and existing
>> Git clients.
>>
>> * **Compatible?** In our test implementation, we use a previously unused
>>    byte of data in the commit-graph format to indicate which reachability
>>    index version we are using. Existing clients ignore this value, so we
>>    will want to consider if these new indexes are _backwards compatible_.
>>    That is, will they still report correct values if they ignore this byte
>>    and use the generation number column from the commit-graph file assuming
>>    the values are minimum generation numbers?
> In other words "backward compatibility" for a reachability index means
> that the reachability condition is exactly the same as it was:
>
>      If index(A) < index(B), then A cannot reach B.
>
> and also values of the reachability index are written in place of
> generation numbers in the commit-graph.
>
>> * **Immutable?** Git objects are _immutable_. If you change an object you
>>    actually create a new object with a new object ID. Are the values we
>>    store for these reachability indexes also immutable?
>>
>> * **Local?** Are these values **locally computable**? That is, do we only
>>    need to look at the parents of a commit (assuming those parents have
>>    computed values) in order to determine the value at that commit?
> Those two features imply that the number of commits that we need to walk
> to be able to update reachability index after appending new commits is
> O(new commits), which means O(changes).  I would allow for simple
> transformations of other values, e.g. adding a constant; we need to
> rewrite commit-graph file anyway.
>
> I wonder if there are reachability indexes that are either immutable but
> not local, or not immutable but local.  Doesn't being local imply
> immutability, with some common-sense assumptions (like no random choice
> for parent-less commits)?

Local probably implies immutable, but the reverse does not. The example 
I gave
elsewhere in the thread was "number of commits reachable from C". You can't
determine this directly from the values of your parents, but it doesn't 
change
even if you add more commits to the graph.

>
>> | Index                     | Compatible? | Immutable? | Local? |
>> |---------------------------|-------------|------------|--------|
>> | Minimum Generation Number | Yes         | Yes        | Yes    |
>> | (Epoch, Date) pairs       | Yes         | Yes        | Yes    |
>> | Maximum Generation Number | Yes         | No         | No     |
>> | Corrected Commit Date     | No          | Yes        | Yes    |
>> | FELINE index              | Yes         | No         | No     |
>>
>> _Note:_ The corrected commit date uses the generation number column
>> to store an offset of "how much do I need to add to my commit date
>> to get my corrected commit date?" The values stored in that column
>> are then not backwards-compatible.
> Additional note: this offset / delta based approach to storing corrected
> commit data is a form of lightweight compression, allowing it to fit
> safely in 30 bits (while the commit date uses 32+2 = 34 bits itself).
>
> Assuming that we want to keep this feature from the original
> commit-graph file format, that is.
>
>> _Note:_ The FELINE index requires storing two values instead of just
>> one. One of these values could be stored in the generation number
>> column and the other in an optional chunk, hence it could be backwards
>> compatible. (This is not how it is implemented in our example
>> implementation.)
> So the table above refers to this hypothetical implementation when
> stating that FELINE is backwards compatibile, but current test
> implementation is not, isn't it?

Correct.

>
>> Data
>> ----
>>
>> We focused on three types of performance tests that test the indexes
>> in different ways. Each test lists the `git` command that is used,
>> and the table lists which repository is used and which inputs.
>>
>> ### Test 1: `git log --topo-order -N`
>>
>> This test focuses on the number of commits that are parsed during
>> a `git log --topo-order` before writing `N` commits to output.
>>
>> You can reproduce this test using `topo-order-tests.sh` and
>> see all the data in `topo-order-summary.txt`. The values reported
>> here are a sampling of the data, ignoring tests where all values
>> were the same or extremely close in value.
>>
>>
>> | Repo         | N     | V0     | V1     | V2     | V3     | V4    |
>> |--------------|-------|--------|--------|--------|--------|-------|
>> | android-base | 100   |  5,487 |  8,534 |  6,937 |  6,419 | 6,453 |
>> | android-base | 1000  | 36,029 | 44,030 | 41,493 | 41,206 |45,431 |
>> | chromium     | 100   |    101 |424,406 |    101 |    101 |   101 |
>> | gerrit       | 100   |  8,212 |  8,533 |    164 |    159 |   162 |
>> | gerrit       | 1000  |  8,512 |  8,533 |  1,990 |  1,973 | 3,766 |
>> | Linux        | 100   | 12,458 | 12,444 | 13,683 | 13,123 |13,124 |
>> | Linux        | 1000  | 24,436 | 26,247 | 27,878 | 26,430 |27,875 |
>> | Linux        | 10000 | 30,364 | 28,891 | 27,878 | 26,430 |27,875 |
>> | electron     | 1000  | 19,927 | 18,733 |  1,072 | 18,214 |18,214 |
>> | Ffmpeg       | 10000 | 32,154 | 47,429 | 10,435 | 11,054 |11,054 |
>> | jgit         | 1000  |  1,550 |  6,264 |  1,067 |  1,060 | 1,233 |
>> | julia        | 10000 | 43,043 | 43,043 | 10,201 | 23,505 |23,828 |
>> | odoo         | 1000  | 17,175 |  9,714 |  4,043 |  4,046 | 4,111 |
>> | php-src      | 1000  | 19,014 | 27,530 |  1,311 |  1,305 | 1,320 |
>> | rails        | 100   |  1,420 |  2,041 |  1,757 |  1,428 | 1,441 |
>> | rails        | 1000  |  7,952 | 10,145 | 10,053 |  8,373 | 8,373 |
>> | swift        | 1000  |  1,914 |  4,004 |  2,071 |  1,939 | 1,940 |
>> | tensorflow   | 1000  | 10,019 | 39,221 |  6,711 | 10,051 |10,051 |
>> | TypeScript   | 1000  |  2,873 | 12,014 |  3,049 |  2,876 | 2,876 |
> Do I understand it correctly that the range of N for a given project is
> limited by the depth of the project history (hence maximum N of 10000
> for Linux kernel, but only 100 for chromium)?

I ran the tests with N equal to 100, 1000, and 10000 on all repos, but only
included results for values that were interesting. For chromium, something
about the topology let all versions (except V1) report N + 1, so I 
didn't include
the other values.

> I wonder what the OLD numbers are for these operations.

The OLD numbers are equal to the number of commits reachable from HEAD
in every case. I didn't think this was interesting to report. You can 
see the
numbers for yourself in the output data file:

https://github.com/derrickstolee/gen-test/blob/master/topo-order-summary.txt

>
>> ### Test 2: `git log --topo-order -10 A..B`
>>
>> This test focuses on the number of commits that are parsed during
>> a `git log --topo-order A..B` before writing ten commits to output.
>> Since we fix a very small set of output commits, we care more about
>> the part of the walk that determines which commits are reachable
>> from `B` but not reachable from `A`. This part of the walk uses
>> commit date as a heuristic in the existing implementation.
> [...]
>
>> ### Test 3: `git merge-base A B`
>>
>> This test focuses on the number of commits that are parsed during
>> a `git merge-base A B`. This part of the walk uses commit date as
>> a heuristic in the existing implementation.
> [...]
>
>> Conclusions
>> -----------
>>
>> Based on the performance results alone, we should remove minimum
>> generation numbers, (epoch, date) pairs, and FELINE index from
>> consideration. There are enough examples of these indexes performing
>> poorly.
>>
>> In contrast, maximum generation numbers and corrected commit
>> dates both performed quite well. They are frequently the top
>> two performing indexes, and rarely significantly different.
>>
>> The trade-off here now seems to be: which _property_ is more important,
>> locally-computable or backwards-compatible?
>>
>> * Maximum generation number is backwards-compatible but not
>>    locally-computable or immutable.
>>
>> * Corrected commit-date is locally-computable and immutable,
>>    but not backwards-compatible.
>>
>> _Editor's Note:_ Every time I think about this trade-off, I can't
>> come to a hard conclusion about which is better. Instead, I'll
>> leave that up for discussion.
> In my opinion being able to update commit-graph data fast is more
> important than being backward-compatibile, i.e. beig able to use
> commit-data file generated by new clients (and servers) by old clients.
>
>
> Thank you for all your work,

Thanks for the thoughtful discussion!

-Stolee

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Generation Number v2
  2018-11-01 13:29   ` Derrick Stolee
@ 2018-11-03 12:33     ` Jakub Narebski
  0 siblings, 0 replies; 17+ messages in thread
From: Jakub Narebski @ 2018-11-03 12:33 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git List, Jeff King, Junio C Hamano, Derrick Stolee,
	Ævar Arnfjörð Bjarmason

Derrick Stolee <stolee@gmail.com> writes:
> On 11/1/2018 8:27 AM, Jakub Narebski wrote:
>> Derrick Stolee <stolee@gmail.com> writes:
>>
>>> Please also let me know about any additional tests that I could
>>> run. Now that I've got a lot of test scripts built up, I can re-run
>>> the test suite pretty quickly.
>>
>> I would add straighforward 1-to-N and N-to-1 reachability tests in the
>> form of `git branch / tag --contains` and `git branch / tag --merged`,
>> and perhaps also ahead-behind calculations used by `git status`, and
>> N-to-M reachability tests used by tag following code in push and fetch.

1-to-N and N-to-1 can be done with `git branch --merged` / `--contains`.

>> Possibly also A...B walk, if it is not implemented via calculating
>> merge-base.
>
> I believe this uses paint_down_to_common(), but looks at the PARENT1 and
> PARENT2 flags afterward to determine the left/right/both relationships.

Right, so I guess this is the same internal mechanism that `git
merge-base` utilizes, just used a bit differently.  Thus benchmarking
`git merge-base` should cover also A...B performance, I guess.

>> Maybe even --ancestry-path walk, though I am not sure how important best
>> performance for rhis case is (we would want good performance, but
>> perhaps best is not needed for rarely used command).
>
> Currently, the implementation of --ancestry-path does not use a
> reachability index.

Well, using reachability index would certainly give it a boost.

[...]
>>> Generation Number Performance Test
>>> ==================================
>>>
>>> Git uses the commit history to answer many basic questions, such as
>>> computing a merge-base. Some of these algorithms can benefit from a
>>> _reachability index_, which is a condition that allows us to answer
>>> "commit A cannot reach commit B" in some cases. These require pre-
>>> computing some values that we can load and use at run-time. Git
>>> already has a notion of _generation number_, stores it in the commit-
>>> graph file, and uses it in several reachability algorithms.
>>
>> Note that there are other kinds of reachability indices.
>>
>> First, there are reachability indices that can answer the full
>> reachability query (if commit A can reach commit B, or if commit A
>> cannot reach commit B) directly, without walking the commit graph at
>> all: so called label-only approach.  For example one could store for
>> each commit the compressed list of all commits reachable from it
>> (transitive closure compression).
>>
>> Those, I think (but I have not checked), would be of not much use for
>> Git, as the size of the index grows stronger than linear with the number
>> of commits, as grows the time to compute such index.  So probably of no
>> use to Git, at least not directly (Git uses so called "bitmap index",
>> see e.g. [1], which stores reachability bit-vector as compressed
>> bitmap... but not for all commits, only for a small subset).
>>
>>
>> Second, beside negative-cut reachability indexes, that can answer with
>> certainity that "commit A cannot reach commit B", like the generation
>> numbers (also known as level, or topological level), there also
>> positive-cut indexes (usually if not always based on the spanning tree
>> or trees for the DAG), that can answer when "commit A can reach commit
>> B".
>>
>> I think that those can lead to significant speedups for at least some
>> types of commit traversal and reachability queries that Git needs to
>> answer.  But currently no algorithms has a provision for using
>> positive-cut filter index.  Anyway, such index would be auxiliary thing,
>> see the next point.
>>
>>
>> Third, we require more than having reachability index in the sense of
>> having some kind of _labelling_, often composite, that can answer either
>> "commit A cannot reach commit B" or "commit A can reach commit B",
>> depending on the type.  Git for most operations needs more, namely an
>> actual _ordering_, that is a weak order (or to be more exact a total
>> preorder, i.e. there can be two different commits with the same
>> "generation number" or index, but always either idx(A) ≲ idx(B) or
>> idx(B) ≲ idx(A)) and not only partial ordering (where there can be
>> incomparable elements, i.e neither idx(A) ≼ idx(B) nor idx(B) ≼ idx(A)).
>> This is because it needs to be used in priority queue to decide which
>> commit to travel next; more on that later.  This is also why there can
>> be only one such "main" reachability _index_.
>>
>> [1]: https://githubengineering.com/counting-objects/
>
> Thanks for the details. At the moment, I'm only interested in improving our
> negative-cut reachability index, as we have algorithms that can take
> advantage of them (and can compare their performance directly).

Right, I can agree with that.  Positive-cut reachability index use would
need to be added separately.

What I wanted to emphasize is the need for a _number_ (or a total
preorder), not simply an _index_ (a label, perhaps a composite one).
This means, as I wrote, that there would be only one main "generation
number".  It also limits the number of possible reachability indexes to
consider.

[...]
>> In Git there are at least four different types of graph traversal
>> queries, with somewhat different requirements, and affected differently
>> by various reachability indexes.
>>
>>
>> ## Pure reachability queries
>>
>> First there are pure reachability queries, when we are interested only
>> in nswering he question whether commit A can reach commit B or not.  We
>> are nt interested in list of commits between A and B; if reachability
>> index, be it negative-cut or positive-cut, can answer the question, it
>> is all we need.  If we need to walk (perform online search), then we are
>> interested in just being able to arrive; we need to walk only one path.
>>
>> This query can be performed one-to-one, one-to-many (which of commits
>> from the list can be reached from commit A), many-to-one (which commits
>> from the list can reach commit B), and many-to-many -- the latest for
>> example with recently added get_reachable_subset().
>>
>> These types of queries are done by "git branch --contains" (and
>> "--no-contains") and "git branch --merged" (and "--no-merged"), the same
>> for "git tag"; if I remember it correctly those were among first
>> operations that were sped up by commit-graph mechanism.  Many-to-many
>> queries are done for tag following in "git push" and "git fetch".
>>
>> Note that those queries are missing from gen-test.
>
> These are definitely a good idea to pursue further testing. The only
> thing I can think of right now is that the tests can be hard to set up,
> but perhaps`git branch --remote --contains` or `git tag --contains`
> would be interesting.

Also `git tab --merged`, for the "reverse" check; then this should cover
both sides of N-to-M queries used by `git fetch` (add_missing_tags() via
new get_reachable_subset()).

In any case, if all else fails, we can add new test-something script...

>> ## Topological ordering
>>
>> Here reachability index is used to find out in-degree of considered
>> commits (starting from a set of starting points).  Here we don't really
>> need to walk all paths, but we need to walk all possible ways of
>> arriving at given commit.
>
> You make a good point here about the difference between "all paths to C"
> and "all children of C", since the number of paths can grow exponentially
> (and frequently does in Git) but the number of children is linear (and
> small in practice).

Sidenote: I wonder how the distribution of number of parent
(out-degree), numbers of childern (in-degree), and number of paths to a
commit looks like.

>> Though I guess the [new] incremental algorithm could be modified to only
>> check for in-degree being zero or non-zero.  Would it make it slower or
>> faster; I guess in absence of good positive-cut reachability filter it
>> would make it slower.
>
> I don't understand how a positive-cut filter helps in this case. The point
> of the in-degreecalculation is to say "we already output all commits that
> can reach this commit" beforewe output another commit. How could a
> positive-cut filter help here?

That was more of an idle thought than a concrete proposal.

With the new 3-phase algorithm, we populate topo_queue with commits of
an in-degree of zero, calculating this on-degree if necessary.  I
thought that those pre-candidates could have been eliminated early with
positive-cut index, instead of having to compute in-degree in full.

[...]
>> ## Strict ancestry, or ancestry path
>>
>> When given a range of commits to display (e.g. A..B), only display
>> commits that exist directly on the ancestry chain between B and A,
>> i.e. commits that are both descendants of B, and ancestors of A.  In
>> other words, find all paths from B to A (if they exist).
>>
>> In this case we want to walk all the paths, heavily pruning.  This
>> should be less dependent on reachability index quality as an index (as
>> an ordering), and more on pruning ability of the filter.
>>
>> This case is not tested in gen-test, but I wonder how often this feature
>> is actualy used (if it would be worth adding it to the benchmark, and if
>> so, with what weight behind it).
>
> As mentioned earlier, the --ancestry-path algorithm does not currently use a
> negative-cut filter, so we would not gather any data on this at the moment.
>
> However, I imagine the algorithm is similar to the typical A..B case, as we
> need to discover the (po)set difference, construct the reversed digraph on
> those commits, and walk backwards from A. Assuming the difference is much
> smaller than the entire commit graph, then the main cost is the walk that
> discovers the difference, and hence is covered somewhat by the `git log
> --topo-order A..B` tests.

I would say that --ancestry-path is a little like A..B walk, a little
like the in-degree computing in --topo-order (because in in-degree
computing we are also interested only in paths that arrive at given
commit).

[...]
>>> **V1: (Epoch, Date) Pairs.**
>>>
>>> For each commit, store a pair of values called the _epoch_ and the _date_.
>>> The date is the normal commit date. The _epoch_ of a commit is the minimum
>>> X such that X is at least the maximum epoch of each parent, and at least
>>> one more than the epoch of any parent whose date is larger than the date
>>> of this commit (i.e. there is clock skew between this commit and this
>>> parent). In this way, we can use the following reachability index:
>>>
>>>     If epoch(A) < epoch(B), then A cannot reach B.
>>>     If epoch(A) == epoch(B) and date(A) < date(B), then A cannot reach B.
>>
>> I wonder what makes it perform worse than corrected date aka V3.
>
> In the example `git merge-base v4.8 v4.9` in the Linux repo, the topology
> includes two commits (say, C1 and C2) of low generation number but high
> commit-date. These commits also have low epoch. However, there are 607
> commits in the repo whose commit date is smaller than a parent's commit
> date. This means that the epoch in themain trunk of the repo can be as
> high as 607, while the epoch for C1 and C2 is likely in the single digits.
> This means that we need to walk all commits with epoch greater than
> min { epoch(C1), epoch(C2) } before we explore C1 and C2 and terminate
> the walk.
>
> In V3, the commits C1 and C2 have high corrected commit date, higher than
> any of the commits that require positive offset to overcome the clock skew
> with their parents. This allows the walk to be very similar to the old
> algorithm, as seen in this `git merge-base A B` test summary:
>
> Linux c8d2bc9bc39e 69973b830859
[I have added names of algorithms / generationn numbers here]

>   OLD: 167468        -----         no reachability index
>    V0: 635579        gen(C)        (Minimum) Generation Number
>    V1: 630138        epoch(C)      (Epoch, Date) Pairs
>    V2:  33716        maxgen(C)     Maximum Generation Numbers
>    V3: 167496        cdate(C)      Corrected Commit Date
>    V4: 153774        felineX(C)    FELINE Index

Ah, so the problem with (Epoch, Date) pairs is that the trunk acquires
high epoch, while side branches (which may be shortcuts) often have
lower epoch.  Makes sense, if maybe not obvious without such
examination.

>>> **V2: Maximum Generation Numbers.**
>>> The _maximum generation number_ of a commit (denoted by maxgen(A)) is
>>> defined using its _children_ and the total number of commits in the repo.
>>> If A is a commit with no children (that is, there is no commit B with
>>> A as a parent) then maxgen(A) is equal to the number of commits in the repo.
>>> For all other commits A, maxgen(A) is one less than the minimum maxgen(B)
>>> among children B. This can be computed by initializing maxgen(C) to the
>>> number of commits, then walking the commit graph in topological order,
>>> assigning maxgen(P) = min { maxgen(P), maxgen(C) - 1 } for each parent P
>>> of the currently-walking commit C. We then have the same reachability
>>> index as minimum generation number:
>>>
>>>    If maxgen(A) < maxgen(B), then A cannot reach B.
>>
>> If I understand it correctly, this is the same as reverse generation
>> numbers, or reverse topological levels; in other words genertion numbers
>> on reversed graph -- only transformed:
>>
>>     maxgen(A) == number_of_commits - (revgen(A) - 1)

Actually using number_of_commits may not be the best choice here with
respect to number of commits that need to have maxgen(C) be recalculated
on update...

>> We define revgen(A) in the following way:
>> - for head tips (for source nodes), i.e. commits with in-degree of 0,
>>    with no children, have revgen(A) = 1 (revgen(A) = 0 is left for
>>    commits outside commit-graph, which translates to INFINITY for
>>    maxgen(A)).
>> - otherwise, it is 1 more than maximum revgen(C) of its children
>>
>> They are equivalent, but maxgen(A) is "backward compatibile", that is
>> the rechability condition is exactly the same as for ordinary generation
>> numbers:
>>
>>      If gen(A) < gen(B), then A cannot reach B.
>>      If maxgen(A) < maxgen(B), then A cannot reach B.
>>
>> But
>>
>>      If revgen(A) > revgen(B), then A cannot reach B.
>
> Yes, these are the same idea, and the reason to phrase it as I did is to
> keep the inequality in the same direction.

...and if using gen(C) as a starting point, you avoid rewriting whole
DAG on update.

Definition of maxgen'(C):

  - if A is commit without children (that is, there is no commit B with
    A as a parent), then maxgen'(A) = gen(A)
  - otherwise, maxgen'(A) = -1 + min_{B has A as parent}(maxgen'(B))

Update algorithm:

  1. compute gen(A) for all new commits, which is O(changes)
  2. select commit A with highest gen(A) that doesn't have maxgen'(A):
     [such commit does not have any children, because of gen() properties]
     and add it to queue Q
     2.1. take commit C from the queue Q
     2.2. for each parent P of commit C
          2.2.1. if maxgen'(P) <= maxgen'(C) - 1, continue
          2.2.2. update maxgen'(P) to maxgen'(C) - 1
          2.2.3. add P to queue Q

I have tested this version of maxgen(C) algorithm in a very few cases in
other post in this thread [3], and it looks like the amount of commits
that need to have maxgen() recalculated is limited.

[3]: https://public-inbox.org/git/86ftwjzv1h.fsf@gmail.com/

[...]
>>> _Commentary:_ The known examples where (minimum) generation numbers perform
>>> worse than commit date heuristics involve commits with recent commit dates
>>> but whose parents have very low generation number compared to most recent
>>> commits. In a way, minimum generation numbers increase the odds that we
>>> can say "A cannot reach B" when A is fixed and B is selected at random.
>>> Maximum generation numbers increase the odds that we can say "A cannot
>>> reach B" when B is fixed and A is selected at random. This helps us when
>>> we are exploring during a reachability algorithm and have explored few
>>> commits and want to say that the large set of unexplored commits cannot
>>> reach any of the explored commits.
>>>
>> I guess that the situation where shortcut path have recent date, while
>> having less commits in path (and thus lover ordinary generation number)
>> is more common in real commit graphs because one might want to base a
>> commit (for example a bugfix) on an old commit, but commits usually get
>> merged quickly, and not left for long time.  If they re left for long
>> time, they usually needs correction (via rebase) before being merged,
>> and again we have commit with date close to the date of merge, leading
>> to a shortcut.
>>
>> Does this reasoning looks correct to you?
>
> These "bug fixes based on old commits" cases are few and far between, so
> are not enough to explain how often maximum generation number works well.
> Instead, I would say that it works well because the typical Git pattern is
> to merge into a single trunk. In the case of projects like Linux, there are
> multiple "trunks" that are run by lieutenants and are merged into a "super
> trunk"; even in these cases, the number of trunks is much smaller than the
> number of topics. Let's focus on a single-trunk model for the
> discussion below.
>
> When working with a single trunk, most merge commits are in the first-parent
> historyof that trunk. As we travel down those trunks, the maximum generation
> is decreasing. This means that the number of commits that can have equal
> generation is limited bythe number of commits in a single topic, which is
> typically small. This is opposed to the minimum generation, which can be
> unbounded. For example, if everyone bases their Git patches on the latest
> release, then if we walk into those topics, then we need to walk all the way
> to that release to determine reachability.
>
> Here is a picture, representing each topic as an interval of consecutive
> generation values:
>
> # Minimum Generation Numbers:
>
> v2.19.0----------M1----M2----M3----M4
>     |\          /     /     /     /
>     | [topic1]-/     /     /     /
>     |\              /     /     /
>     | [topic2]-----/     /     /
>     |\                  /     /
>     | [topic3]---------/     /
>      \                      /
>       [topic4]-------------/
>
>     =====generation====>
>
> # Maximum Generation Numbers:
>
> v2.19.0----------M1----M2----M3----M4
>     |\          /      |     |     |
>     | [topic1]-/       |     |     |
>     |\                 |     |     |
>     | \-------[topic2]/      |     |
>     |\                       |     |
>     | \-------------[topic3]/      |
>      \                             |
>       \-------------------[topic4]/
>
>     =====generation====>

All right, so the problem with (minimum) generation numbers performance
will be exhibited in any project that utilizes the topic branch
workflow.

Nice to know the case.

>>> **V3: Corrected Commit Date.**
>>> For a commit C, let its _corrected commit date_ (denoted by cdate(C))
>>> be the maximum of the commit date of C and the [corrected] commit dates of its
>>> parents.
>>
>> Wouldn't it be better to use the maximum of the commit date of C, and
>> one second more than the corrected commit dates of its parents?  This
>> way corrected dates would be strictly monotonic along the chain.
>>
>> Sidenote: in theory, sufficiently bad clock skew (like e.g. wrong year
>> in the future) could screw large swatches of the commit graph.  In
>> practice, in real-life repositories, this doesn't happen, I guess.
>
> Yes, you're right. Congyi Wu from the Azure Repos server team pointed
> this out to me privately, as well. His example was someone merging a
> commit with date 3000-01-01, and making all later commits be
> useless. Adding one to the commit dates of the parents "fixes" this
> issue by reverting our negative-cut index to be equivalent to minimum
> generation number for all commits with this commit in their history.

I wonder if the problem with bad clock skew with the date far in the
future could be solved by using instead for such cases (where
date(parent) > date(commit) + large Delta) the average of the commit and
of the grandparents (parents of parent commit with such large positive
clock skew).  Though this wouldn't be strictly speaking immutable...

Another edge case, which I hope is not a problem in real-life
repositories, is clock skew very far in the past, and storing offsets in
30-bits.  If there was some malformed commit which has 0 as commit date
epoch (Unix time), i.e. 1970-01-01, then the delta / offset may not fit
in 30 bits that we use for storing generation numbers; there are 32+2 =
34 bits for comit date.  Something to think about.

[...]
>>> ### Comparing Reachability Index Versions Viability
>>>
>>> Before considering how well these indexes perform during our algorithm
>>> runs, let's take a moment to consider the implementation details and
>>> how they relate to the existing commit-graph file format and existing
>>> Git clients.
>>>
>>> * **Compatible?** In our test implementation, we use a previously unused
>>>    byte of data in the commit-graph format to indicate which reachability
>>>    index version we are using. Existing clients ignore this value, so we
>>>    will want to consider if these new indexes are _backwards compatible_.
>>>    That is, will they still report correct values if they ignore this byte
>>>    and use the generation number column from the commit-graph file assuming
>>>    the values are minimum generation numbers?
>>
>> In other words "backward compatibility" for a reachability index means
>> that the reachability condition is exactly the same as it was:
>>
>>      If index(A) < index(B), then A cannot reach B.
>>
>> and also values of the reachability index are written in place of
>> generation numbers in the commit-graph.

Note that it is backwards-compatibility only for use of generation
numbers, but of course not for an update.

>>> * **Immutable?** Git objects are _immutable_. If you change an object you
>>>    actually create a new object with a new object ID. Are the values we
>>>    store for these reachability indexes also immutable?
>>>
>>> * **Local?** Are these values **locally computable**? That is, do we only
>>>    need to look at the parents of a commit (assuming those parents have
>>>    computed values) in order to determine the value at that commit?
>>>
>> Those two features imply that the number of commits that we need to walk
>> to be able to update reachability index after appending new commits is
>> O(new commits), which means O(changes).  I would allow for simple
>> transformations of other values, e.g. adding a constant; we need to
>> rewrite commit-graph file anyway.
>>
>> I wonder if there are reachability indexes that are either immutable but
>> not local, or not immutable but local.  Doesn't being local imply
>> immutability, with some common-sense assumptions (like no random choice
>> for parent-less commits)?
>
> Local probably implies immutable, but the reverse does not. The
> example I gave elsewhere in the thread was "number of commits
> reachable from C". You can't determine this directly from the values
> of your parents, but it doesn't change even if you add more commits to
> the graph.

Right, reach(C) is immutable but not local.

Local may not mean immutable if the starting points, i.e. parentless
commits, get assingled not-immutable values.  But I don't see how that
could happen without some randomness.

>>> Data
>>> ----
>>>
>>> We focused on three types of performance tests that test the indexes
>>> in different ways. Each test lists the `git` command that is used,
>>> and the table lists which repository is used and which inputs.
>>>
>>> ### Test 1: `git log --topo-order -N`
[...]
>>>
>>> | Repo         | N     | V0     | V1     | V2     | V3     | V4    |
>>> |--------------|-------|--------|--------|--------|--------|-------|
>>> | android-base | 100   |  5,487 |  8,534 |  6,937 |  6,419 | 6,453 |
>>> | android-base | 1000  | 36,029 | 44,030 | 41,493 | 41,206 |45,431 |
>>> | chromium     | 100   |    101 |424,406 |    101 |    101 |   101 |
>>> | gerrit       | 100   |  8,212 |  8,533 |    164 |    159 |   162 |
>>> | gerrit       | 1000  |  8,512 |  8,533 |  1,990 |  1,973 | 3,766 |
>>> | Linux        | 100   | 12,458 | 12,444 | 13,683 | 13,123 |13,124 |
>>> | Linux        | 1000  | 24,436 | 26,247 | 27,878 | 26,430 |27,875 |
>>> | Linux        | 10000 | 30,364 | 28,891 | 27,878 | 26,430 |27,875 |
>>> | electron     | 1000  | 19,927 | 18,733 |  1,072 | 18,214 |18,214 |
>>> | Ffmpeg       | 10000 | 32,154 | 47,429 | 10,435 | 11,054 |11,054 |
>>> | jgit         | 1000  |  1,550 |  6,264 |  1,067 |  1,060 | 1,233 |
>>> | julia        | 10000 | 43,043 | 43,043 | 10,201 | 23,505 |23,828 |
>>> | odoo         | 1000  | 17,175 |  9,714 |  4,043 |  4,046 | 4,111 |
>>> | php-src      | 1000  | 19,014 | 27,530 |  1,311 |  1,305 | 1,320 |
>>> | rails        | 100   |  1,420 |  2,041 |  1,757 |  1,428 | 1,441 |
>>> | rails        | 1000  |  7,952 | 10,145 | 10,053 |  8,373 | 8,373 |
>>> | swift        | 1000  |  1,914 |  4,004 |  2,071 |  1,939 | 1,940 |
>>> | tensorflow   | 1000  | 10,019 | 39,221 |  6,711 | 10,051 |10,051 |
>>> | TypeScript   | 1000  |  2,873 | 12,014 |  3,049 |  2,876 | 2,876 |
>>
>> Do I understand it correctly that the range of N for a given project is
>> limited by the depth of the project history (hence maximum N of 10000
>> for Linux kernel, but only 100 for chromium)?
>
> I ran the tests with N equal to 100, 1000, and 10000 on all repos, but only
> included results for values that were interesting. For chromium, something
> about the topology let all versions (except V1) report N + 1, so I
> didn't include the other values.

Ah, that information (only interesting cases were included) would be
something good to know.

I wonder if the relative performance of different generation numbers is
correlated with the topology of the commit graph, for example with
increment patterns to intergation patterns ratio in the commit metagraph
(like --simplify-by-decoration), like in [4] and related works.

[4] Marco Biazzini, Martin Monperrus, Benoit Baudry "On Analyzing the
    Topology of Commit Histories in Decentralized Version Control
    Systems", DOI: 10.1109/ICSME.2014.48 (2014)

>> I wonder what the OLD numbers are for these operations.
>
> The OLD numbers are equal to the number of commits reachable from HEAD
> in every case. I didn't think this was interesting to report. You can
> see the numbers for yourself in the output data file:
>
> https://github.com/derrickstolee/gen-test/blob/master/topo-order-summary.txt

Thanks for the info.

-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2018-11-03 17:27 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-29 16:55 [RFC] Generation Number v2 Derrick Stolee
2018-10-29 19:22 ` Stefan Beller
2018-10-29 20:06   ` Derrick Stolee
2018-11-01 20:06   ` Jakub Narebski
2018-11-02  9:30     ` Jakub Narebski
2018-11-03 17:27       ` Jakub Narebski
2018-10-29 20:25 ` Derrick Stolee
2018-11-01 22:13   ` Jakub Narebski
2018-10-30  3:59 ` Junio C Hamano
2018-10-31 12:30   ` Derrick Stolee
2018-11-02 13:33     ` Jakub Narebski
2018-10-31 12:54   ` Ævar Arnfjörð Bjarmason
2018-10-31 13:04     ` Derrick Stolee
2018-11-02 17:44       ` Jakub Narebski
2018-11-01 12:27 ` Jakub Narebski
2018-11-01 13:29   ` Derrick Stolee
2018-11-03 12:33     ` Jakub Narebski

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).