git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jakub Narebski <jnareb@gmail.com>
To: Derrick Stolee <stolee@gmail.com>
Cc: "Git List" <git@vger.kernel.org>, "Jeff King" <peff@peff.net>,
	"Junio C Hamano" <gitster@pobox.com>,
	"Derrick Stolee" <dstolee@microsoft.com>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Subject: Re: [RFC] Generation Number v2
Date: Thu, 01 Nov 2018 13:27:30 +0100	[thread overview]
Message-ID: <86efc50yq5.fsf@gmail.com> (raw)
In-Reply-To: <6367e30a-1b3a-4fe9-611b-d931f51effef@gmail.com> (Derrick Stolee's message of "Mon, 29 Oct 2018 12:55:33 -0400")

[I have noticed that in some places I wrote A..B instead of B..A.  Sorry
 about that]

Derrick Stolee <stolee@gmail.com> writes:

> We've discussed in several places how to improve upon generation
> numbers. This RFC is a report based on my investigation into a
> few new options, and how they compare for Git's purposes on
> several existing open-source repos.
>
> You can find this report and the associated test scripts at
> https://github.com/derrickstolee/gen-test.

Which use prototype test implementation from the `reach-perf` branch in
https://github.com/derrickstolee/git.

> I have some more work to do in order to make this fully
> reproducible (including a single script to clone all of the
> repos I used, compute the commit-graphs using the different
> index versions, and run the tests).
>
> TL;DR: I recommend we pursue one of the following two options:
>
> 1. Maximum Generation Number.
> 2. Corrected Commit Date.
>
> Each of these perform well, but have different implementation
> drawbacks. I'd like to hear what everyone thinks about those
> drawbacks.

I agree with Junio that incremental computation is more important than
backwards compatibility (that the old clients can work with the new
commit-graph without changes).  We would have compatibility breaking
anyway with the planned move from SHA-1 to NewHash aka. SHA-256.

> Please also let me know about any additional tests that I could
> run. Now that I've got a lot of test scripts built up, I can re-run
> the test suite pretty quickly.

I would add straighforward 1-to-N and N-to-1 reachability tests in the
form of `git branch / tag --contains` and `git branch / tag --merged`,
and perhaps also ahead-behind calculations used by `git status`, and
N-to-M reachability tests used by tag following code in push and fetch.

Possibly also A...B walk, if it is not implemented via calculating
merge-base.

Maybe even --ancestry-path walk, though I am not sure how important best
performance for rhis case is (we would want good performance, but
perhaps best is not needed for rarely used command).

See explanation below.


> Generation Number Performance Test
> ==================================
>
> Git uses the commit history to answer many basic questions, such as
> computing a merge-base. Some of these algorithms can benefit from a
> _reachability index_, which is a condition that allows us to answer
> "commit A cannot reach commit B" in some cases. These require pre-
> computing some values that we can load and use at run-time. Git
> already has a notion of _generation number_, stores it in the commit-
> graph file, and uses it in several reachability algorithms.

Note that there are other kinds of reachability indices.

First, there are reachability indices that can answer the full
reachability query (if commit A can reach commit B, or if commit A
cannot reach commit B) directly, without walking the commit graph at
all: so called label-only approach.  For example one could store for
each commit the compressed list of all commits reachable from it
(transitive closure compression).

Those, I think (but I have not checked), would be of not much use for
Git, as the size of the index grows stronger than linear with the number
of commits, as grows the time to compute such index.  So probably of no
use to Git, at least not directly (Git uses so called "bitmap index",
see e.g. [1], which stores reachability bit-vector as compressed
bitmap... but not for all commits, only for a small subset).


Second, beside negative-cut reachability indexes, that can answer with
certainity that "commit A cannot reach commit B", like the generation
numbers (also known as level, or topological level), there also
positive-cut indexes (usually if not always based on the spanning tree
or trees for the DAG), that can answer when "commit A can reach commit
B".

I think that those can lead to significant speedups for at least some
types of commit traversal and reachability queries that Git needs to
answer.  But currently no algorithms has a provision for using
positive-cut filter index.  Anyway, such index would be auxiliary thing,
see the next point.


Third, we require more than having reachability index in the sense of
having some kind of _labelling_, often composite, that can answer either
"commit A cannot reach commit B" or "commit A can reach commit B",
depending on the type.  Git for most operations needs more, namely an
actual _ordering_, that is a weak order (or to be more exact a total
preorder, i.e. there can be two different commits with the same
"generation number" or index, but always either idx(A) ≲ idx(B) or
idx(B) ≲ idx(A)) and not only partial ordering (where there can be
incomparable elements, i.e neither idx(A) ≼ idx(B) nor idx(B) ≼ idx(A)).
This is because it needs to be used in priority queue to decide which
commit to travel next; more on that later.  This is also why there can
be only one such "main" reachability _index_.

[1]: https://githubengineering.com/counting-objects/

> You can read more about generation numbers and how to use them in
> algorithms in [this blog
> post](https://blogs.msdn.microsoft.com/devops/2018/07/09/supercharging-the-git-commit-graph-iii-generations/).
>
> However, [generation numbers do not always improve our
> algorithms](https://public-inbox.org/git/pull.28.git.gitgitgadget@gmail.com/T/#u).
> Specifically, some algorithms in Git already use commit date as a
> heuristic reachability index. This has some problems, though, since
> commit date can be incorrect for several reasons (clock skew between
> machines, purposefully setting GIT_COMMIT_DATE to the author date,
> etc.).

That's what we use the "slop" mechanism for: Git would walk a few
commits more than necessary than if there were no clock skew (if commit
dates were monotonic), and assume that the skew is not to severe to
finish early.

But as we are adding more early termination condition, this introduces
more places where things can go wrong for Git to return incorrect
results.  Thus the need for strict, non-heuristic reachability index.

> However, the speed boost by using commit date as a cutoff was so
> important in these cases, that the potential for incorrect answers was
> considered acceptable.

Yes, and there were few places where date as cutoff was used; we have
since added more place where generation number / reachbility index is
used as cutoff.  Thus more potential for incorrect answers if using
heuristics.

> When these algorithms were converted to use generation numbers, we
> _added_ the extra constraint that the algorithms are _never incorrect_.

Right.

> Unfortunately, this led to some cases where performance was worse than
> before. There are known cases where `git merge-base A B` or
> `git log --topo-order A..B` are worse when using generation numbers
> than when using commit dates.

In Git there are at least four different types of graph traversal
queries, with somewhat different requirements, and affected differently
by various reachability indexes.


## Pure reachability queries

First there are pure reachability queries, when we are interested only
in nswering he question whether commit A can reach commit B or not.  We
are nt interested in list of commits between A and B; if reachability
index, be it negative-cut or positive-cut, can answer the question, it
is all we need.  If we need to walk (perform online search), then we are
interested in just being able to arrive; we need to walk only one path.

This query can be performed one-to-one, one-to-many (which of commits
from the list can be reached from commit A), many-to-one (which commits
from the list can reach commit B), and many-to-many -- the latest for
example with recently added get_reachable_subset().

These types of queries are done by "git branch --contains" (and
"--no-contains") and "git branch --merged" (and "--no-merged"), the same
for "git tag"; if I remember it correctly those were among first
operations that were sped up by commit-graph mechanism.  Many-to-many
queries are done for tag following in "git push" and "git fetch".

Note that those queries are missing from gen-test.


## Topological ordering

Here reachability index is used to find out in-degree of considered
commits (starting from a set of starting points).  Here we don't really
need to walk all paths, but we need to walk all possible ways of
arriving at given commit.

Though I guess the [new] incremental algorithm could be modified to only
check for in-degree being zero or non-zero.  Would it make it slower or
faster; I guess in absence of good positive-cut reachability filter it
would make it slower.

This operation is benchmarked both standalone i.e. `git log --topo-order -N`
and in combination with A..B walk in `git log --topo-order -10 A..B`.


## (Po)set difference, or B..A range

In this type of query, with at least one "negative" / "negated" commit,
we want to find out all commits that are reachable from commit A that
are not reachable from commit B.

Note that A (commit in positive set) is treated differently with respect
to traversal from B (commit in negative set).  We need to travel (and
possibly list) all paths from A, while when traveling from B the only
thing important is reachability, not the actual path or paths.  We want
to travel as few commits as possible when walking from B, and here one
generation numbers are better than the other in the presence of multiple
paths.

The situation in nontrivial case could look as follows:

  ---.---.---o---*---*---*----*---*-----*---A
              \
               \---x-----x-----x-----B

In such case, the algorithm paints down from positive set i.e. commit A,
and from negative set i.e. commit B, and walk down from A till we get to
commits painted down from B.

We want to walk those commits somewhat in sync, using priority-queue
based variant of breadth-first search (well, kind of), so that we don't
walk more commits from A than necessary, only to notice later that these
are actually reachable from B, and neither we want to walk more commits
from B than necessary, going past the boundary commit.

Sidenote: I think that positive-cut auxiliary reachability index
(reachability filter) can help speed up this operation, making it
possible to mark commit as reachable from the negative set (from commit
B) and stop early without actually having to walk from B... though the
more in sync walks from A and B would be, the less it would probably
help.

This is tested using `git log --topo-order -10 A..B`.


## Merge base, and A...B walk

In this case we either find one or all commits that are reachable from
both A and B, or (which is related query) find all commits reachable
from A that are not reachable from B and vice versa, that is all commits
reachable from B that are not reachable from A.

In this case we walk from both A and B the same way: the operation is
symmetric.  In the merge base calculations we need to walk only one path
if there are many equivaent ones; in A...B walk we have to actually walk
them.  Thus merge base calculation would exhibit, we can concurr, the
same behaviour with respect to reachability indexes as A..B case.

This is tested using `git merge-base A B`; finding all merge bases, and
A...B walk are not tested yet.

Note that IIRC it was merge-base calculation where the problem of
performance regression when using generation number (topological level)
as an ordering was first encountered.  The performance regression happen
for situation like this (the graph is taken from original commit fixing
the issue, just rephrased).

   0   1   2   3   4   5   6   7   8   9  10   11          4    δgen(c)
   .---M---*---*---*---*---*---*---*---*--*----A           B
       |\                                  2  /           /     δgen(c)
       | \---------------------------------x-/           /
       \                                  2         3   /       δgen(c)
        \---------------------------------#---------#--/


    ------------------------------------------------------------>
                                                          time

Here optimal solution would be walk the branch / path with commits
marked with 'x' from commit A, while walking the branch / path with
commits marked '#' from commit B, and after walking 4 commits notice
that the commit M is reachable from both A and B -- thus we have found
[one of] the merge base.

Using (minimum) generation numbers we would walk the path marked with
'*' first, unnecessarily.  (IIUC that for a commit we mark also its
parents, isn't it?).

Using corrected commit date we would walk at most only a few commits on
the *-marked branch.

   0   1   2   3   4   5   6   7   8   9  10   11          11   δmaxgen(c)
   .---M---*---*---*---*---*---*---*---*--*----A           B
       |\                                  10 /           /     δmaxgen(c)
       | \---------------------------------x-/           /
       \                                  9         10  /       δmaxgen(c)
        \---------------------------------#---------#--/


    ------------------------------------------------------------>
                                                          time

Using maximum generation numbers we would also walk at most few *-marked
commits.

Sidenote: on the figure below you can easily see easy correspondence
between maxgen(c) and revgen(c) indices, see below.

   12  11  10  9   8   7   6   5   4   3  2    1           1    revgen(c)
   .---M---*---*---*---*---*---*---*---*--*----A           B
       |\                                  2  /           /     revgen(c)
       | \---------------------------------x-/           /
       \                                  3         2   /       revgen(c)
        \---------------------------------#---------#--/


## Strict ancestry, or ancestry path

When given a range of commits to display (e.g. A..B), only display
commits that exist directly on the ancestry chain between B and A,
i.e. commits that are both descendants of B, and ancestors of A.  In
other words, find all paths from B to A (if they exist).

In this case we want to walk all the paths, heavily pruning.  This
should be less dependent on reachability index quality as an index (as
an ordering), and more on pruning ability of the filter.

This case is not tested in gen-test, but I wonder how often this feature
is actualy used (if it would be worth adding it to the benchmark, and if
so, with what weight behind it).


> This report investigates four replacements for generation numbers, and
> compares the number of walked commits to the existing algorithms (both
> using generation numbers and not using them at all). We can use this
> data to make decisions for the future of the feature.
>
> ### Implementation
>
> The reachability indexes below are implemented in
> [the `reach-perf` branch in
> https://github.com/derrickstolee/git](https://github.com/derrickstolee/git/tree/reach-perf).
> This implementation is in a very rough condition, as it is intended to
> be a prototype, not production-quality.
>
> Using this implementation, you can compute commit-graph files for the
> specified reachability index using `git commit-graph write --reachable
> --version=<V>`.
> The `git` client will read the version from the file, so our tests
> store each version as `.git/objects/info/commit-graph.<V>` and copy
> the necessary file to `.git/objects/info/commit-graph` before testing.
>
> The algorithms count the number of commits walked, as to avoid the
> noise that would occur with time-based performance reporting. We use
> the (in progress) trace2 library for this. To find the values reported,
> use the `GIT_TR2_PERFORMANCE` environment variable.

Where we can read more about this trace2 library?  Thanks in advance.

>
> To ignore reachability indexes entirely and use the old algorithms
> (reported here as "OLD" values) use the environment variable
> `GIT_TEST_OLD_PAINT=1`.
>
>
> Reachability Index Versions
> ---------------------------
>
> **V0: (Minimum) Generation Number.**
> The _generation number_ of a commit is exactly one more than the maximum
> generation number of a parent (by convention, the generation number of a
> commit with no parents is 1). This is the definition currently used by
> Git (2.19.0 and later). Given two commits A and B, we can then use the
> following reachability index:
>
>     If gen(A) < gen(B), then A cannot reach B.
>
> _Commentary:_ One issue with generation numbers is that some algorithms
> in Git use commit date as a heuristic, and allow incorrect answers if
> there is enough clock skew. Using that heuristic, the algorithms can walk
> fewer commits than the algorithms using generation number. The other
> reachability indexes introduced below attempt to fix this problem.

This is the existing solution.

> **V1: (Epoch, Date) Pairs.**
> For each commit, store a pair of values called the _epoch_ and the _date_.
> The date is the normal commit date. The _epoch_ of a commit is the minimum
> X such that X is at least the maximum epoch of each parent, and at least
> one more than the epoch of any parent whose date is larger than the date
> of this commit (i.e. there is clock skew between this commit and this
> parent). In this way, we can use the following reachability index:
>
>    If epoch(A) < epoch(B), then A cannot reach B.
>    If epoch(A) == epoch(B) and date(A) < date(B), then A cannot reach B.

I wonder what makes it perform worse than corrected date aka V3.

> **V2: Maximum Generation Numbers.**
> The _maximum generation number_ of a commit (denoted by maxgen(A)) is
> defined using its _children_ and the total number of commits in the repo.
> If A is a commit with no children (that is, there is no commit B with
> A as a parent) then maxgen(A) is equal to the number of commits in the repo.
> For all other commits A, maxgen(A) is one less than the minimum maxgen(B)
> among children B. This can be computed by initializing maxgen(C) to the
> number of commits, then walking the commit graph in topological order,
> assigning maxgen(P) = min { maxgen(P), maxgen(C) - 1 } for each parent P
> of the currently-walking commit C. We then have the same reachability
> index as minimum generation number:
>
>   If maxgen(A) < maxgen(B), then A cannot reach B.

If I understand it correctly, this is the same as reverse generation
numbers, or reverse topological levels; in other words genertion numbers
on reversed graph -- only transformed:

   maxgen(A) == number_of_commits - (revgen(A) - 1)

We define revgen(A) in the following way:
- for head tips (for source nodes), i.e. commits with in-degree of 0,
  with no children, have revgen(A) = 1 (revgen(A) = 0 is left for
  commits outside commit-graph, which translates to INFINITY for
  maxgen(A)).
- otherwise, it is 1 more than maximum revgen(C) of its children

They are equivalent, but maxgen(A) is "backward compatibile", that is
the rechability condition is exactly the same as for ordinary generation
numbers:

    If gen(A) < gen(B), then A cannot reach B.
    If maxgen(A) < maxgen(B), then A cannot reach B.

But

    If revgen(A) > revgen(B), then A cannot reach B.

>
> _Commentary:_ The known examples where (minimum) generation numbers perform
> worse than commit date heuristics involve commits with recent commit dates
> but whose parents have very low generation number compared to most recent
> commits. In a way, minimum generation numbers increase the odds that we
> can say "A cannot reach B" when A is fixed and B is selected at random.
> Maximum generation numbers increase the odds that we can say "A cannot
> reach B" when B is fixed and A is selected at random. This helps us when
> we are exploring during a reachability algorithm and have explored few
> commits and want to say that the large set of unexplored commits cannot
> reach any of the explored commits.

I guess that the situation where shortcut path have recent date, while
having less commits in path (and thus lover ordinary generation number)
is more common in real commit graphs because one might want to base a
commit (for example a bugfix) on an old commit, but commits usually get
merged quickly, and not left for long time.  If they re left for long
time, they usually needs correction (via rebase) before being merged,
and again we have commit with date close to the date of merge, leading
to a shortcut.

Does this reasoning looks correct to you?

> **V3: Corrected Commit Date.**
> For a commit C, let its _corrected commit date_ (denoted by cdate(C))
> be the maximum of the commit date of C and the [corrected] commit dates of its
> parents.

Wouldn't it be better to use the maximum of the commit date of C, and
one second more than the corrected commit dates of its parents?  This
way corrected dates would be strictly monotonic along the chain.

Sidenote: in theory, sufficiently bad clock skew (like e.g. wrong year
in the future) could screw large swatches of the commit graph.  In
practice, in real-life repositories, this doesn't happen, I guess.

>   If cdate(A) < cdate(B), then A cannot reach B.
>
> **V4: FELINE Index.**
> The FELINE index is a two-dimentional reachability index as defined in
> [Reachability Queries in Very Large Graphs: A Fast Refined Online
> Search Approach](https://openproceedings.org/EDBT/2014/paper_166.pdf)
> by Veloso, Cerf, Jeira, and Zaki. The index is not deterministically
> defined, but instead is defined in the following way:
>
> 1. Compute a reverse topological order of the commits. Let felineX(C)
>    be the position in which C appears in this list. Since this is
>    a reverse topological order, felineX(C) > felineX(P) for each parent
>    P of C.

Well, it depends on the definition of the topological order if it is
tolopogical order or reverse topological order.  Anyway, the FELINE
paper includes discussion about FELINE-I, i.e. inversed feline index.
From FELINE and FELINE-I, one of those is what was used here.

> 2. Compute a reverse topological order of the commits using Kahn's
>    algorithm, but when selecting among a list of commits with in-degree
>    zero, prioritize the commit by minimizing felineX. Let felineY(C)
>    be the position in which C appears in this list.

Note that the second step, as given in the paper, is actually
deterministic (assuming that we follow the recommendations; well, up to
a point).  It needs some topological order for felineX, though.

> Essentially, the felineY order is selected with the goal of swapping
> positions of topologically-independent commits relative to the felinX
> ordering. The resulting reachability index is as follows:
>
>    If felineX(A) < felineX(B), then A cannot reach B.
>    If felineY(A) < felineY(B), then A cannot reach B.
>
> _Commentary:_ In terms of comparing two commits directly, this index
> is quite strong. However, when we are performing our reachability
> algorithms that care about reachable sets (git log --graph), or
> boundaries between reachable sets (git merge-base, git log --graph A..B)
> we need to track a single pair (minX,minY) for comparion. In order to not
> miss anything during our search, we need to update this pair to be
> the minimum felineX(A) and minimum felineY(B) among all explored commits
> A and B. That is, the pair (minX, minY) is below our entire explored set.
> This can be a disadvantage for these algorithms.

Actually the problem is that FELINE index is partial order, not an
ordering; not something that we can use in a priority queue.

But the single [reverse] topological ordering is a reachability index in
itself, and it is an ordering.  Thus we can use felineX as an index in
priority queue (or we can use felineY).

One thing to notice is that FELINE index performs well on real world
large graphs which have quite different structure from commit graphs;
among others they have either very large number of sources (head tips)
or sinks (root nodes), or both; those numbers grow with the number of
nodes.  For commit graphs they don't perform as well; well, at least
ccording to my explorations on that matter in the Google Colaboratory
notebook [2].

[2]: https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg

>
> ### Comparing Reachability Index Versions Viability
>
> Before considering how well these indexes perform during our algorithm
> runs, let's take a moment to consider the implementation details and
> how they relate to the existing commit-graph file format and existing
> Git clients.
>
> * **Compatible?** In our test implementation, we use a previously unused
>   byte of data in the commit-graph format to indicate which reachability
>   index version we are using. Existing clients ignore this value, so we
>   will want to consider if these new indexes are _backwards compatible_.
>   That is, will they still report correct values if they ignore this byte
>   and use the generation number column from the commit-graph file assuming
>   the values are minimum generation numbers?

In other words "backward compatibility" for a reachability index means
that the reachability condition is exactly the same as it was:

    If index(A) < index(B), then A cannot reach B.

and also values of the reachability index are written in place of
generation numbers in the commit-graph.

>
> * **Immutable?** Git objects are _immutable_. If you change an object you
>   actually create a new object with a new object ID. Are the values we
>   store for these reachability indexes also immutable?
>
> * **Local?** Are these values **locally computable**? That is, do we only
>   need to look at the parents of a commit (assuming those parents have
>   computed values) in order to determine the value at that commit?

Those two features imply that the number of commits that we need to walk
to be able to update reachability index after appending new commits is
O(new commits), which means O(changes).  I would allow for simple
transformations of other values, e.g. adding a constant; we need to
rewrite commit-graph file anyway.

I wonder if there are reachability indexes that are either immutable but
not local, or not immutable but local.  Doesn't being local imply
immutability, with some common-sense assumptions (like no random choice
for parent-less commits)?

> | Index                     | Compatible? | Immutable? | Local? |
> |---------------------------|-------------|------------|--------|
> | Minimum Generation Number | Yes         | Yes        | Yes    |
> | (Epoch, Date) pairs       | Yes         | Yes        | Yes    |
> | Maximum Generation Number | Yes         | No         | No     |
> | Corrected Commit Date     | No          | Yes        | Yes    |
> | FELINE index              | Yes         | No         | No     |
>
> _Note:_ The corrected commit date uses the generation number column
> to store an offset of "how much do I need to add to my commit date
> to get my corrected commit date?" The values stored in that column
> are then not backwards-compatible.

Additional note: this offset / delta based approach to storing corrected
commit data is a form of lightweight compression, allowing it to fit
safely in 30 bits (while the commit date uses 32+2 = 34 bits itself).

Assuming that we want to keep this feature from the original
commit-graph file format, that is.

> _Note:_ The FELINE index requires storing two values instead of just
> one. One of these values could be stored in the generation number
> column and the other in an optional chunk, hence it could be backwards
> compatible. (This is not how it is implemented in our example
> implementation.)

So the table above refers to this hypothetical implementation when
stating that FELINE is backwards compatibile, but current test
implementation is not, isn't it?

>
> Data
> ----
>
> We focused on three types of performance tests that test the indexes
> in different ways. Each test lists the `git` command that is used,
> and the table lists which repository is used and which inputs.
>
> ### Test 1: `git log --topo-order -N`
>
> This test focuses on the number of commits that are parsed during
> a `git log --topo-order` before writing `N` commits to output.
>
> You can reproduce this test using `topo-order-tests.sh` and
> see all the data in `topo-order-summary.txt`. The values reported
> here are a sampling of the data, ignoring tests where all values
> were the same or extremely close in value.
>
>
> | Repo         | N     | V0     | V1     | V2     | V3     | V4    |
> |--------------|-------|--------|--------|--------|--------|-------|
> | android-base | 100   |  5,487 |  8,534 |  6,937 |  6,419 | 6,453 |
> | android-base | 1000  | 36,029 | 44,030 | 41,493 | 41,206 |45,431 |
> | chromium     | 100   |    101 |424,406 |    101 |    101 |   101 |
> | gerrit       | 100   |  8,212 |  8,533 |    164 |    159 |   162 |
> | gerrit       | 1000  |  8,512 |  8,533 |  1,990 |  1,973 | 3,766 |
> | Linux        | 100   | 12,458 | 12,444 | 13,683 | 13,123 |13,124 |
> | Linux        | 1000  | 24,436 | 26,247 | 27,878 | 26,430 |27,875 |
> | Linux        | 10000 | 30,364 | 28,891 | 27,878 | 26,430 |27,875 |
> | electron     | 1000  | 19,927 | 18,733 |  1,072 | 18,214 |18,214 |
> | Ffmpeg       | 10000 | 32,154 | 47,429 | 10,435 | 11,054 |11,054 |
> | jgit         | 1000  |  1,550 |  6,264 |  1,067 |  1,060 | 1,233 |
> | julia        | 10000 | 43,043 | 43,043 | 10,201 | 23,505 |23,828 |
> | odoo         | 1000  | 17,175 |  9,714 |  4,043 |  4,046 | 4,111 |
> | php-src      | 1000  | 19,014 | 27,530 |  1,311 |  1,305 | 1,320 |
> | rails        | 100   |  1,420 |  2,041 |  1,757 |  1,428 | 1,441 |
> | rails        | 1000  |  7,952 | 10,145 | 10,053 |  8,373 | 8,373 |
> | swift        | 1000  |  1,914 |  4,004 |  2,071 |  1,939 | 1,940 |
> | tensorflow   | 1000  | 10,019 | 39,221 |  6,711 | 10,051 |10,051 |
> | TypeScript   | 1000  |  2,873 | 12,014 |  3,049 |  2,876 | 2,876 |

Do I understand it correctly that the range of N for a given project is
limited by the depth of the project history (hence maximum N of 10000
for Linux kernel, but only 100 for chromium)?

I wonder what the OLD numbers are for these operations.

>
> ### Test 2: `git log --topo-order -10 A..B`
>
> This test focuses on the number of commits that are parsed during
> a `git log --topo-order A..B` before writing ten commits to output.
> Since we fix a very small set of output commits, we care more about
> the part of the walk that determines which commits are reachable
> from `B` but not reachable from `A`. This part of the walk uses
> commit date as a heuristic in the existing implementation.
[...]

> ### Test 3: `git merge-base A B`
>
> This test focuses on the number of commits that are parsed during
> a `git merge-base A B`. This part of the walk uses commit date as
> a heuristic in the existing implementation.
[...]

> Conclusions
> -----------
>
> Based on the performance results alone, we should remove minimum
> generation numbers, (epoch, date) pairs, and FELINE index from
> consideration. There are enough examples of these indexes performing
> poorly.
>
> In contrast, maximum generation numbers and corrected commit
> dates both performed quite well. They are frequently the top
> two performing indexes, and rarely significantly different.
>
> The trade-off here now seems to be: which _property_ is more important,
> locally-computable or backwards-compatible?
>
> * Maximum generation number is backwards-compatible but not
>   locally-computable or immutable.
>
> * Corrected commit-date is locally-computable and immutable,
>   but not backwards-compatible.
>
> _Editor's Note:_ Every time I think about this trade-off, I can't
> come to a hard conclusion about which is better. Instead, I'll
> leave that up for discussion.

In my opinion being able to update commit-graph data fast is more
important than being backward-compatibile, i.e. beig able to use
commit-data file generated by new clients (and servers) by old clients.


Thank you for all your work,
--
Jakub Narębski

  parent reply	other threads:[~2018-11-01 12:27 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-29 16:55 [RFC] Generation Number v2 Derrick Stolee
2018-10-29 19:22 ` Stefan Beller
2018-10-29 20:06   ` Derrick Stolee
2018-11-01 20:06   ` Jakub Narebski
2018-11-02  9:30     ` Jakub Narebski
2018-11-03 17:27       ` Jakub Narebski
2018-10-29 20:25 ` Derrick Stolee
2018-11-01 22:13   ` Jakub Narebski
2018-10-30  3:59 ` Junio C Hamano
2018-10-31 12:30   ` Derrick Stolee
2018-11-02 13:33     ` Jakub Narebski
2018-10-31 12:54   ` Ævar Arnfjörð Bjarmason
2018-10-31 13:04     ` Derrick Stolee
2018-11-02 17:44       ` Jakub Narebski
2018-11-01 12:27 ` Jakub Narebski [this message]
2018-11-01 13:29   ` Derrick Stolee
2018-11-03 12:33     ` Jakub Narebski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=86efc50yq5.fsf@gmail.com \
    --to=jnareb@gmail.com \
    --cc=avarab@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=peff@peff.net \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).