list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
From: Derrick Stolee <>
To: Jakub Narebski <>
Cc:, "Derrick Stolee" <>,
	"Jeff King" <>,
	"Ævar Arnfjörð Bjarmason" <>
Subject: Re: [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc.
Date: Mon, 14 May 2018 09:20:29 -0400	[thread overview]
Message-ID: <> (raw)
In-Reply-To: <>

On 5/12/2018 10:00 AM, Jakub Narebski wrote:
> Derrick Stolee <> writes:
>> On 5/4/2018 3:40 PM, Jakub Narebski wrote:
>>> With early parts of commit-graph feature (ds/commit-graph and
>>> ds/lazy-load-trees) close to being merged into "master", see
>>> I think it would be good idea to think what other data could be added
>>> there to make Git even faster.
>> Before thinking about adding more data to the commit-graph, I think
>> instead we need to finish taking advantage of the data that is already
>> there. This means landing the generation number patch [1] (I think
>> this is close, so I'll send a v6 this week if there is no new
>> feedback.) and the auto-compute patch [2] (this could use more
>> feedback, but I'll send a v1 based on the RFC feedback if no one
>> chimes in).
>> [1]
>>      [PATCH v5 00/11] Compute and consume generation numbers
>> [2]
>>      [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc'
> Right, so the RFC might be a bit premature; I wanted the discussion to
> be out there to think about when adding new uses of existing features.
> DIGRESSION: it is commendable that you are sending patches in small,
> easy digestible chunks / patch series.  It is much easier to review 10+
> series than 80+ behemoth (though I understand it is not always possible
> to subdivide patch series into smaller self-contained sub-series).
> On the other hand, it would be nice to have some roadmap about series
> and features to be sent in the future, if possible.  Something like what
> was done when 'git rebase --interactive' was getting builtinified: moved
> (in parts) to C.  It would be great to have such roadmap with milestones
> achieved and milestones to be done in the cover letter for series.

I suppose that is what I intended in the "Future Work" section of 
Documentation/technical/commit-graph.txt. It gives a set of things that 
need to be done in order to make this a default feature, not just an 
expert-level feature. When I wrote the section, I was focused entirely 
on "commit-graph version 1.0" and I didn't know how long that would 
take. The series have been getting interest and review (in great part to 
your interest, Jakub) so they have been moving to 'next' faster than I 

I'll plan on writing a more detailed list of future directions, but for 
now I'll list the things I know about and how I see them in priority order:

Commit-graph v1.0:
* ds/generation-numbers
* 'verify' and fsck/gc integration
* correct behavior with shallow clones, replace-objects, and grafts

Commit-graph v1.1:
* Place commit-graph storage in the_repository
* 'git tag --merged' use generation numbers
* 'git log --graph' use generation numbers

Commit-graph v1.X:
* Protocol v2 verb for sending commit-graph
* Bloom filters for changed paths

>> The big wins remaining from this data are `git tag --merged` and `git
>> log --graph`. The `tag` scenario is probably easier: this can be done
>> by replacing the revision-walk underlying the call to use
>> paint_down_to_common() instead. Requires adding an external method to
>> commit.c, but not too much code.
> I wonder if there is some significant reason behind `git tag --merged`
> using its own codepath, beside historical reasons.  Maybe performance is
> better with current code?
> Utilizing paint_down_to_common() there, beside reducing amount of code
> you would have to modify, would also unify code (and possibly reduce
> amount of code).  That's very nice.
>> The tougher challenge is `git log --graph`. The revision walk
>> machinery currently uses two precompute phases before iterating
>> results to the pager: limit_list() and sort_in_topological_order();
>> these correspond to two phases of Kahn's algorithm for topo-sort
>> (compute in-degrees, then walk by peeling commits with in-degree
>> zero). This requires O(N) time, where N is the number of reachable
>> commits. Instead, we could make this be O(W) time to output one page
>> of results, where W is (roughly) the number of reachable commits with
>> generation number above the last reported result.
> A reminder: Kahn's algorithm (described for example in [1] and [3])
> looks like this:
>    L ← Empty list that will contain the sorted elements
>    S ← Collection of all nodes with no incoming edge
>    while S is non-empty do
>        remove a node 'n' from S
>        add 'n' to tail of L
>        for each parent 'm' of 'n' do
>            decrease the in-degree of 'm'
>            if 'm' has in-degree of 0 then
>                insert 'm' into S
> [1]:'s_algorithm
> [2]:
>> In order to take advantage of this approach, the two phases of Kahn's
>> algorithm need to be done in-line with reporting results to the
>> pager. This means keeping two queues: one is a priority queue by
>> generation number that computes in-degrees,
> This simple solition of using priority queue by generation number won't
> work, I think.  In-degree is computed from heads down, generation number
> is computed from roots up.
> For example in the graph below
>     *<---*<---*<---*<---*<---*<---*<---*<---*<---A
>           \
>            \--*<---B
> both A and B have in-degree of 0, but gen(B) << gen(A).
> But I might be wrong.
>>                                             the other is a priority
>> queue (by commit-date or a visit-order value to do the --topo-order
>> priority) that peels the in-degree-zero commits (and decrements the
>> in-degree of their parents). I have not begun this refactoring effort
>> because appears complicated to me, and it will be hard to tease out
>> the logic without affecting other consumers of the revision-walk
>> machinery.
>> I would love it if someone picked up the `git log --graph` task, since
>> it will be a few weeks before I have the time to focus on it.
> Let's assume that we have extra data (indexes such as generation number)
> that can be used for positive-cut (we know that A can reach B) and
> negative-cut (we know that A cannot reach B) filters.  Generation number
> aka. topological level can be used in negative-cut filter.
> NOTE: I have not looked at current Git code that does topological
> sorting, as to not be suggested by the existing implementation.
> How the indexes-amplified incremental Kahn's algorithm could look like:
> First we need to find at least one node / vertex / commit with an
> in-degree of zero.  We are given a list of commits to start from, but
> they may not all have in-degree of zero - they may be dependent, or in
> other words some of them may be reachable from others and be
> irrelevant.  Here negative-cut and positive-cut filters can help.
> If the order of commits on command line does not matter, we can start
> from any distinct commit with highest generation number - we know that
> it cannot be reached from other heads / refs and thus has in-degree of
> zero.
>    L ← Empty list that will contain the sorted elements
>    S ← Collection of all nodes with no incoming edge
>    R ← Collection of starting points (refs)
>    while S is non-empty do
>        remove a node 'n' from S
>        add 'n' to tail of L
>        for each parent 'm' of 'n' do
>            if 'm' cannot be reached from R then
>                # it has in-degree of 0
>                insert 'm' into S
>            else if 'm' can be reached from R then
>                # it has in-degree greater than 0
>                insert 'm' into R
>            else
>                walk from each 'r of R,
>                until we know if 'm' is reachable from R
>                then insert it into S or R
>                perhaps computing in-degrees,
>                or marking commits as reachable...
> Does it looks correct?  I can try to make this pseudocode into actual
> algorithm.

I'm not following your pseudocode very well, so instead I'll provide a 
more concrete description of what I mentioned before:

Here are some data structures.

IDV : A dictionary storing the currently-computed in-degree value of a 
commit 'c' as 'IDV[c]'. Assume all commits start with in-degree 0.
IDQ : A queue, for storing the "in-degree walk". It is a priority queue 
ordered by generation number.
TOQ : A queue, for storing the "topo-order walk". It is a priority queue 
ordered by "visit order" (see algorithm)

Here are some methods.

AdvanceInDegreeWalk(IDQ, IDV, gen):
     while !IDX.Empty && IDQ.Max >= gen:
         c = IDX.Dequeue

         for each parent p of c:
             IDQ.Enqueue(p, generation(p))

     Initialize IDQ, IDV, TOQ

     min_generation = GENERATION_NUMBER_INFINITY
     visit_order = 0

     for each commit c in R:
         min_generation = min(min_generation, generation(c))
         IDQ.Enqueue(c, generation(c))

     AdvanceInDegreeWalk(IDQ, IDV, min_generation)

     for each commit c in R:
         if IDV[c] == 0:
             TOQ.Enqueue(c, visit_order++)

     return (IDQ, IDV, TOQ, visit_order)

GetNextInTopoOrder(IDQ, IDV, TOQ, ref visit_order):
     if TOQ.Empty:
         return null

     c = TOQ.Dequeue()

     for each parent p of c:
         AdvanceInDegreeWalk(IDQ, IDV, generation(p))
         if IDV[p] == 0:
             TOQ.Enqueue(p, visit_order++)

     return c

(I mention "ref visit_order" to be sure we are passing-by-reference. In 
a full implementation, the walk details would be grouped into a struct.)

This algorithm is relatively simple, but the hard part is teasing the 
revision walk machinery to initialize the data by calling 
InitializeTopoOrder(R) and to call GetNextInTopoOrder() whenever it 
needs a new element of the walk. That is, it is hard to be sure we are 
not causing side-effects as we make that transformation.

>> Without completing the benefits we get from generation numbers, these
>> investigations into other reachability indexes will be incomplete as
>> they are comparing benefits without all consumers taking advantage of
>> a reachability index.
> On one hand side, you are right: if we try to investigate of some
> reachability index is worth it by checking if it *improves performance
> of actual git operations* without all consumers taking advantage of a
> reachability index would be incomplete.
> On the other hand we can still perform synthetic tests: how much less
> commits we walk when checking that A can reach B on real commit graphs
> (like I did in mentioned Google Colaboratory notebook [3]).  This
> assumes that the cost of accessing commit data (and possibly also
> indexes data) dominates, and the cost of using reachability indexes is
> negligible.
> [3]:
> On the gripping hand the construction of algorithms in those future
> steps would be, I think, affected by what types of indexes would we
> have: negative-cut filters, positive-cut filters, reachability bitmaps,
> Bloom filters for changed files, eic.
>> [...]
>>> Bloom filter for changed paths
>>> ------------------------------
>>> The goal of this chunk is to speed up checking if the file or directory
>>> was changed in given commit, for queries such as "git log -- <file>" or
>>> "git blame <file>".  This is something that according to "Git Merge
>>> contributor summit notes" [2] is already present in VSTS (Visual Studio
>>> Team Services - the server counterpart of GVFS: Git Virtual File System)
>>> at Microsoft:
>>> AV> - VSTS adds bloom filters to know which paths have changed on the commit
>>> AV> - tree-same check in the bloom filter is fast; speeds up file history checks
>>> AV> - might be useful in the client as well, since limited-traversal is common
>>> AV> - if the file history is _very_ sparse, then bloom filter is useful
>>> AV> - but needs pre-compute, so useful to do once
>>> AV> - first make the client do it, then think about how to serve it centrally
>>> [2]:
>>> I think it was what Derrick Stolee was talking about at the end of his
>>> part of "Making Git for Windows" presentation at Git Merge 2018:
>>> This was also mentioned in subthread of "Re: [PATCH v2 0/4] Lazy-load
>>> trees when reading commit-graph", starting from [3]
>>> [3]:
>> Again, the benefits of Bloom filters should only be measured after
>> already taking advantage of a reachability index during `git
>> log`. However, you could get performance benefits from Bloom filters
>> in a normal `git log` (no topo-order).
> I wonder how much they improve performance of "git blame"...
>> The tricky part about this feature is that the decisions we made in
>> our C# implementation for the VSTS Git server may be very different
>> than the needs for the C implementation of the Git client. Questions
>> like "how do we handle merge commits?" may have different answers,
>> which can only be discovered by implementing the feature.
>> (The answer for VSTS is that we only store Bloom filters containing
>> the list of changed paths against the first parent. The second parent
>> frequently has too many different paths, and if we are computing
>> file-history simplification we have already determined the first
>> parent is _not_ TREESAME, which requires verifying the difference by
>> parsing trees against the first parent.)
> Thanks for the information.  I think for now it is sufficient level of
> the detail.
>> I'm happy to provide more information on how we built this feature if
>> someone is writing a patch. Otherwise, I plan to implement it after
>> finishing the parts I think are higher priority.
> All right, I understand that.  Time is a scarse resource.
> I think that, beside writing patches for Git, exploring how various
> pieces of data and various indexes affect walking commit graphs is also
> important.  My explorations shown that, for example, that FELINE index
> is not good fit for relatively "narrow" graphs of revision history.
> Exploring this in Python / Jupyter is easier than trying to write a
> exploratory patch for Git, in my opinion.  Just IMVHO.

You are right. Ruling out possibilities is the best outcome these 
prototypes can have. Your work has saved someone a lot of time in 
investigating that direction.


  reply	other threads:[~2018-05-14 13:20 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-05-04 19:40 Jakub Narebski
2018-05-04 20:07 ` Ævar Arnfjörð Bjarmason
2018-05-04 20:36 ` Ævar Arnfjörð Bjarmason
2018-05-05 13:28   ` Jakub Narebski
2018-05-06 23:55 ` [RFC] Other chunks for commit-graph, part 2 - reachability indexes Jakub Narebski
2018-05-07 14:26 ` [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc Derrick Stolee
2018-05-12 14:00   ` Jakub Narebski
2018-05-14 13:20     ` Derrick Stolee [this message]
2018-05-14 20:58       ` Jakub Narebski
2018-05-15 10:01         ` Johannes Schindelin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

  List information:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \
    --subject='Re: [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc.' \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Code repositories for project(s) associated with this inbox:

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).