Re: [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc.

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Derrick Stolee <stolee@gmail.com>
To: Jakub Narebski <jnareb@gmail.com>
Cc: git@vger.kernel.org, "Derrick Stolee" <dstolee@microsoft.com>,
	"Jeff King" <peff@peff.net>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Subject: Re: [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc.
Date: Mon, 14 May 2018 09:20:29 -0400	[thread overview]
Message-ID: <fca023a3-f849-4be8-db42-7027bfe7b0dc@gmail.com> (raw)
In-Reply-To: <86r2mh2cur.fsf@gmail.com>

On 5/12/2018 10:00 AM, Jakub Narebski wrote:
> Derrick Stolee <stolee@gmail.com> writes:
>> On 5/4/2018 3:40 PM, Jakub Narebski wrote:
>>> With early parts of commit-graph feature (ds/commit-graph and
>>> ds/lazy-load-trees) close to being merged into "master", see
>>> https://public-inbox.org/git/xmqq4ljtz87g.fsf@gitster-ct.c.googlers.com/
>>> I think it would be good idea to think what other data could be added
>>> there to make Git even faster.
>> Before thinking about adding more data to the commit-graph, I think
>> instead we need to finish taking advantage of the data that is already
>> there. This means landing the generation number patch [1] (I think
>> this is close, so I'll send a v6 this week if there is no new
>> feedback.) and the auto-compute patch [2] (this could use more
>> feedback, but I'll send a v1 based on the RFC feedback if no one
>> chimes in).
>>
>> [1]
>> https://public-inbox.org/git/20180501124652.155781-1-dstolee@microsoft.com/
>>      [PATCH v5 00/11] Compute and consume generation numbers
>>
>> [2]
>> https://public-inbox.org/git/20180417181028.198397-1-dstolee@microsoft.com/
>>      [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc'
> Right, so the RFC might be a bit premature; I wanted the discussion to
> be out there to think about when adding new uses of existing features.
>
>
> DIGRESSION: it is commendable that you are sending patches in small,
> easy digestible chunks / patch series.  It is much easier to review 10+
> series than 80+ behemoth (though I understand it is not always possible
> to subdivide patch series into smaller self-contained sub-series).
>
> On the other hand, it would be nice to have some roadmap about series
> and features to be sent in the future, if possible.  Something like what
> was done when 'git rebase --interactive' was getting builtinified: moved
> (in parts) to C.  It would be great to have such roadmap with milestones
> achieved and milestones to be done in the cover letter for series.

I suppose that is what I intended in the "Future Work" section of 
Documentation/technical/commit-graph.txt. It gives a set of things that 
need to be done in order to make this a default feature, not just an 
expert-level feature. When I wrote the section, I was focused entirely 
on "commit-graph version 1.0" and I didn't know how long that would 
take. The series have been getting interest and review (in great part to 
your interest, Jakub) so they have been moving to 'next' faster than I 
anticipated.

I'll plan on writing a more detailed list of future directions, but for 
now I'll list the things I know about and how I see them in priority order:

Commit-graph v1.0:
* ds/generation-numbers
* 'verify' and fsck/gc integration
* correct behavior with shallow clones, replace-objects, and grafts

Commit-graph v1.1:
* Place commit-graph storage in the_repository
* 'git tag --merged' use generation numbers
* 'git log --graph' use generation numbers

Commit-graph v1.X:
* Protocol v2 verb for sending commit-graph
* Bloom filters for changed paths

>
>> The big wins remaining from this data are `git tag --merged` and `git
>> log --graph`. The `tag` scenario is probably easier: this can be done
>> by replacing the revision-walk underlying the call to use
>> paint_down_to_common() instead. Requires adding an external method to
>> commit.c, but not too much code.
> I wonder if there is some significant reason behind `git tag --merged`
> using its own codepath, beside historical reasons.  Maybe performance is
> better with current code?
>
> Utilizing paint_down_to_common() there, beside reducing amount of code
> you would have to modify, would also unify code (and possibly reduce
> amount of code).  That's very nice.
>
>> The tougher challenge is `git log --graph`. The revision walk
>> machinery currently uses two precompute phases before iterating
>> results to the pager: limit_list() and sort_in_topological_order();
>> these correspond to two phases of Kahn's algorithm for topo-sort
>> (compute in-degrees, then walk by peeling commits with in-degree
>> zero). This requires O(N) time, where N is the number of reachable
>> commits. Instead, we could make this be O(W) time to output one page
>> of results, where W is (roughly) the number of reachable commits with
>> generation number above the last reported result.
> A reminder: Kahn's algorithm (described for example in [1] and [3])
> looks like this:
>
>    L ← Empty list that will contain the sorted elements
>    S ← Collection of all nodes with no incoming edge
>    while S is non-empty do
>        remove a node 'n' from S
>        add 'n' to tail of L
>        for each parent 'm' of 'n' do
>            decrease the in-degree of 'm'
>            if 'm' has in-degree of 0 then
>                insert 'm' into S
>
> [1]: https://en.wikipedia.org/wiki/Topological_sorting#Kahn's_algorithm
> [2]: https://www.geeksforgeeks.org/topological-sorting-indegree-based-solution/
>
>> In order to take advantage of this approach, the two phases of Kahn's
>> algorithm need to be done in-line with reporting results to the
>> pager. This means keeping two queues: one is a priority queue by
>> generation number that computes in-degrees,
> This simple solition of using priority queue by generation number won't
> work, I think.  In-degree is computed from heads down, generation number
> is computed from roots up.
>
> For example in the graph below
>
>     *<---*<---*<---*<---*<---*<---*<---*<---*<---A
>           \
>            \--*<---B
>
> both A and B have in-degree of 0, but gen(B) << gen(A).
>
> But I might be wrong.
>
>>                                             the other is a priority
>> queue (by commit-date or a visit-order value to do the --topo-order
>> priority) that peels the in-degree-zero commits (and decrements the
>> in-degree of their parents). I have not begun this refactoring effort
>> because appears complicated to me, and it will be hard to tease out
>> the logic without affecting other consumers of the revision-walk
>> machinery.
>>
>> I would love it if someone picked up the `git log --graph` task, since
>> it will be a few weeks before I have the time to focus on it.
> Let's assume that we have extra data (indexes such as generation number)
> that can be used for positive-cut (we know that A can reach B) and
> negative-cut (we know that A cannot reach B) filters.  Generation number
> aka. topological level can be used in negative-cut filter.
>
> NOTE: I have not looked at current Git code that does topological
> sorting, as to not be suggested by the existing implementation.
>
>
> How the indexes-amplified incremental Kahn's algorithm could look like:
>
> First we need to find at least one node / vertex / commit with an
> in-degree of zero.  We are given a list of commits to start from, but
> they may not all have in-degree of zero - they may be dependent, or in
> other words some of them may be reachable from others and be
> irrelevant.  Here negative-cut and positive-cut filters can help.
>
> If the order of commits on command line does not matter, we can start
> from any distinct commit with highest generation number - we know that
> it cannot be reached from other heads / refs and thus has in-degree of
> zero.
>
>    L ← Empty list that will contain the sorted elements
>    S ← Collection of all nodes with no incoming edge
>    R ← Collection of starting points (refs)
>    while S is non-empty do
>        remove a node 'n' from S
>        add 'n' to tail of L
>        for each parent 'm' of 'n' do
>            if 'm' cannot be reached from R then
>                # it has in-degree of 0
>                insert 'm' into S
>            else if 'm' can be reached from R then
>                # it has in-degree greater than 0
>                insert 'm' into R
>            else
>                walk from each 'r of R,
>                until we know if 'm' is reachable from R
>                then insert it into S or R
>
>                perhaps computing in-degrees,
>                or marking commits as reachable...
>
>
> Does it looks correct?  I can try to make this pseudocode into actual
> algorithm.

I'm not following your pseudocode very well, so instead I'll provide a 
more concrete description of what I mentioned before:

Here are some data structures.

IDV : A dictionary storing the currently-computed in-degree value of a 
commit 'c' as 'IDV[c]'. Assume all commits start with in-degree 0.
IDQ : A queue, for storing the "in-degree walk". It is a priority queue 
ordered by generation number.
TOQ : A queue, for storing the "topo-order walk". It is a priority queue 
ordered by "visit order" (see algorithm)

Here are some methods.

AdvanceInDegreeWalk(IDQ, IDV, gen):
     while !IDX.Empty && IDQ.Max >= gen:
         c = IDX.Dequeue

         for each parent p of c:
             IDV[p]++;
             IDQ.Enqueue(p, generation(p))

InitializeTopoOrder(R):
     Initialize IDQ, IDV, TOQ

     min_generation = GENERATION_NUMBER_INFINITY
     visit_order = 0

     for each commit c in R:
         min_generation = min(min_generation, generation(c))
         IDQ.Enqueue(c, generation(c))

     AdvanceInDegreeWalk(IDQ, IDV, min_generation)

     for each commit c in R:
         if IDV[c] == 0:
             TOQ.Enqueue(c, visit_order++)

     return (IDQ, IDV, TOQ, visit_order)

GetNextInTopoOrder(IDQ, IDV, TOQ, ref visit_order):
     if TOQ.Empty:
         return null

     c = TOQ.Dequeue()

     for each parent p of c:
         IDV[p]--
         AdvanceInDegreeWalk(IDQ, IDV, generation(p))
         if IDV[p] == 0:
             TOQ.Enqueue(p, visit_order++)

     return c

(I mention "ref visit_order" to be sure we are passing-by-reference. In 
a full implementation, the walk details would be grouped into a struct.)

This algorithm is relatively simple, but the hard part is teasing the 
revision walk machinery to initialize the data by calling 
InitializeTopoOrder(R) and to call GetNextInTopoOrder() whenever it 
needs a new element of the walk. That is, it is hard to be sure we are 
not causing side-effects as we make that transformation.

>
>> Without completing the benefits we get from generation numbers, these
>> investigations into other reachability indexes will be incomplete as
>> they are comparing benefits without all consumers taking advantage of
>> a reachability index.
> On one hand side, you are right: if we try to investigate of some
> reachability index is worth it by checking if it *improves performance
> of actual git operations* without all consumers taking advantage of a
> reachability index would be incomplete.
>
> On the other hand we can still perform synthetic tests: how much less
> commits we walk when checking that A can reach B on real commit graphs
> (like I did in mentioned Google Colaboratory notebook [3]).  This
> assumes that the cost of accessing commit data (and possibly also
> indexes data) dominates, and the cost of using reachability indexes is
> negligible.
>
> [3]: https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg
>
> On the gripping hand the construction of algorithms in those future
> steps would be, I think, affected by what types of indexes would we
> have: negative-cut filters, positive-cut filters, reachability bitmaps,
> Bloom filters for changed files, eic.
>
>> [...]
>>> Bloom filter for changed paths
>>> ------------------------------
>>>
>>> The goal of this chunk is to speed up checking if the file or directory
>>> was changed in given commit, for queries such as "git log -- <file>" or
>>> "git blame <file>".  This is something that according to "Git Merge
>>> contributor summit notes" [2] is already present in VSTS (Visual Studio
>>> Team Services - the server counterpart of GVFS: Git Virtual File System)
>>> at Microsoft:
>>>
>>> AV> - VSTS adds bloom filters to know which paths have changed on the commit
>>> AV> - tree-same check in the bloom filter is fast; speeds up file history checks
>>> AV> - might be useful in the client as well, since limited-traversal is common
>>> AV> - if the file history is _very_ sparse, then bloom filter is useful
>>> AV> - but needs pre-compute, so useful to do once
>>> AV> - first make the client do it, then think about how to serve it centrally
>>>
>>> [2]: https://public-inbox.org/git/alpine.DEB.2.20.1803091557510.23109@alexmv-linux/
>>>
>>> I think it was what Derrick Stolee was talking about at the end of his
>>> part of "Making Git for Windows" presentation at Git Merge 2018:
>>> https://youtu.be/oOMzi983Qmw?t=1835
>>>
>>> This was also mentioned in subthread of "Re: [PATCH v2 0/4] Lazy-load
>>> trees when reading commit-graph", starting from [3]
>>> [3]: https://public-inbox.org/git/86y3hyeu6c.fsf@gmail.com/
>> Again, the benefits of Bloom filters should only be measured after
>> already taking advantage of a reachability index during `git
>> log`. However, you could get performance benefits from Bloom filters
>> in a normal `git log` (no topo-order).
> I wonder how much they improve performance of "git blame"...
>
>> The tricky part about this feature is that the decisions we made in
>> our C# implementation for the VSTS Git server may be very different
>> than the needs for the C implementation of the Git client. Questions
>> like "how do we handle merge commits?" may have different answers,
>> which can only be discovered by implementing the feature.
>>
>> (The answer for VSTS is that we only store Bloom filters containing
>> the list of changed paths against the first parent. The second parent
>> frequently has too many different paths, and if we are computing
>> file-history simplification we have already determined the first
>> parent is _not_ TREESAME, which requires verifying the difference by
>> parsing trees against the first parent.)
> Thanks for the information.  I think for now it is sufficient level of
> the detail.
>
>> I'm happy to provide more information on how we built this feature if
>> someone is writing a patch. Otherwise, I plan to implement it after
>> finishing the parts I think are higher priority.
> All right, I understand that.  Time is a scarse resource.
>
>
> I think that, beside writing patches for Git, exploring how various
> pieces of data and various indexes affect walking commit graphs is also
> important.  My explorations shown that, for example, that FELINE index
> is not good fit for relatively "narrow" graphs of revision history.
> Exploring this in Python / Jupyter is easier than trying to write a
> exploratory patch for Git, in my opinion.  Just IMVHO.

You are right. Ruling out possibilities is the best outcome these 
prototypes can have. Your work has saved someone a lot of time in 
investigating that direction.

Thanks,
-Stolee

next prev parent reply	other threads:[~2018-05-14 13:20 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-05-04 19:40 [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc Jakub Narebski
2018-05-04 20:07 ` Ævar Arnfjörð Bjarmason
2018-05-04 20:36 ` Ævar Arnfjörð Bjarmason
2018-05-05 13:28   ` Jakub Narebski
2018-05-06 23:55 ` [RFC] Other chunks for commit-graph, part 2 - reachability indexes Jakub Narebski
2018-05-07 14:26 ` [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc Derrick Stolee
2018-05-12 14:00   ` Jakub Narebski
2018-05-14 13:20     ` Derrick Stolee [this message]
2018-05-14 20:58       ` Jakub Narebski
2018-05-15 10:01         ` Johannes Schindelin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fca023a3-f849-4be8-db42-7027bfe7b0dc@gmail.com \
    --to=stolee@gmail.com \
    --cc=avarab@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=jnareb@gmail.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).