git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Derrick Stolee <stolee@gmail.com>
To: Jakub Narebski <jnareb@gmail.com>, git@vger.kernel.org
Cc: Christian Couder <christian.couder@gmail.com>
Subject: Re: [RFC] Possible idea for GSoC 2020
Date: Mon, 16 Mar 2020 08:44:54 -0400	[thread overview]
Message-ID: <7d6a84c7-6b16-c2a9-11a1-3397422064d1@gmail.com> (raw)
In-Reply-To: <86mu8o8dsf.fsf@gmail.com>

On 3/10/2020 10:50 AM, Jakub Narebski wrote:
> Hello,
> 
> Here below is a possible proposal for a more difficult Google Summer of
> Code 2020 project.
> 
> A few questions:
> - is it too late to propose a new project idea for GSoC 2020?
> - is it too difficult of a project for GSoC?
> 
> Best,
> 
>   Jakub Narębski
> 
> --------------------------------------------------
> 
> ### Graph labelling for speeding up git commands
> 
>  - Language: C
>  - Difficulty: hard / difficult
>  - Possible mentors: Jakub Narębski
> 
> Git uses various clever methods for making operations on very large
> repositories faster, from bitmap indices for git-fetch[1], to generation
> numbers (also known as topological levels) in the commit-graph file for
> commit graph traversal operations like `git log --graph`[2].
> 
> One possible improvement that can make Git even faster is using min-post
> intervals labelling.  The basis of this labelling is post-visit order of
> a depth-first search traversal tree of a commit graph, let's call it
> 'post(v)'.
> 
> If for each commit 'v' we would compute and store in the commit-graph
> file two numbers: 'post(v)' and the minimum of 'post(u)' for all commits
> reachable from 'v', let's call the latter 'min_graph(v)', then the
> following condition is true:
> 
>   if 'v' can reach 'u', then min_graph(v) <= post(u) <= post(v)

I haven't thought too hard about it, but I'm assuming that if v is not
in a commit-graph file, then post(v) would be "infinite" and min_graph(v)
would be zero.

We already have the second inequality (f(u) <= f(v)) where the function
'f' is the generation of v. The success of this approach over generation
numbers relies entirely on how often the inequality min_graph(v) <= post(u)
fails when gen(u) <= gen(v) holds.

> If for each commit 'v' we would compute and store in the commit-graph
> file two numbers: 'post(v)' and the minimum of 'post(u)' for commits
> that were visited during the part of depth-first search that started
> from 'v' (which is the minimum of post-order number for subtree of a
> spanning tree that starts at 'v').  Let's call the later 'min_tree(v)'.
> Then the following condition is true:
> 
>   if min_tree(v) <= post(u) <= post(v), then 'v' can reach 'u'

How many places in Git do we ask "can v reach u?" and how many would
return immediately without needing a walk in this new approach? My
guess is that we will have a very narrow window where this query
returns a positive result.

I believe we discussed this concept briefly when planning "generation
number v2" and the main concern I have with this plan is that the
values are not stable. The value of post(v) and min_tree(v) depend
on the entire graph as a whole, not just what is reachable from v
(and preferably only the parents of v).

Before starting to implement this, I would consider how such labels
could be computed across incremental commit-graph boundaries. That is,
if I'm only adding a layer of commits to the commit-graph without
modifying the existing layers of the commit-graph chain, can I still
compute values with these properties? How expensive is it? Do I need
to walk the entire reachable set of commits?
 
> The task would be to implement computing such labelling (or a more
> involved variant of it[3][4]), storing it in commit-graph file, and
> using it for speeding up git commands (starting from a single chosen
> command) such as:
> 
>  - git merge-base --is-ancestor A B
>  - git branch --contains A
>  - git tag --contains A
>  - git branch --merged A
>  - git tag --merged A
>  - git merge-base --all A B
>  - git log --topo-sort

Having such a complicated two-dimensional system would need to
justify itself by being measurably faster than that one-dimensional
system in these example commands.

The point of generation number v2 [1] was to allow moving to "exact"
algorithms for things like merge-base where we still use commit time
as a heuristic, and could be wrong because of special data shapes.
We don't use generation number in these examples because using only
generation number can lead to a large increase in number of commits
walked. The example we saw in the Linux kernel repository was a bug
fix created on top of a very old commit, so there was a commit of
low generation with very high commit-date that caused extra walking.
(See [2] for a detailed description of the data shape.)

My _prediction_ is that the two-dimensional system will be more
complicated to write and use, and will not have any measurable
difference. I'd be happy to be wrong, but I also would not send
anyone down this direction only to find out I'm right and that
effort was wasted.

My recommendation is that a GSoC student update the
generation number to "v2" based on the definition you made in [1].
That proposal is also more likely to be effective in Git because
it makes use of extra heuristic information (commit date) to
assist the types of algorithms we care about.

In that case, the "difficult" part is moving the "generation"
member of struct commit into a slab before making it a 64-bit
value. (This is likely necessary for your plan, anyway.) Updating
the generation number to v2 is relatively straight-forward after
that, as someone can follow all places that reference or compute
generation numbers and apply a diff

Thanks,
-Stolee

[1] https://lore.kernel.org/git/86o8ziatb2.fsf_-_@gmail.com/
    [RFC/PATCH] commit-graph: generation v5 (backward compatible date ceiling)

[2] https://lore.kernel.org/git/efa3720fb40638e5d61c6130b55e3348d8e4339e.1535633886.git.gitgitgadget@gmail.com/
    [PATCH 1/1] commit: don't use generation numbers if not needed

  parent reply	other threads:[~2020-03-16 12:44 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-10 14:50 [RFC] Possible idea for GSoC 2020 Jakub Narebski
2020-03-11 19:03 ` Junio C Hamano
2020-03-13 10:56   ` Jakub Narebski
2020-03-15 14:26     ` Jakub Narebski
2020-03-17 12:24       ` Kaartic Sivaraam
2020-03-17 12:39         ` Kaartic Sivaraam
2020-03-17 14:22         ` Jakub Narebski
2020-03-11 20:29 ` Christian Couder
2020-03-11 21:30   ` Jakub Narebski
2020-03-11 21:48     ` Christian Couder
2020-03-12 12:17       ` Jakub Narebski
2020-03-12 13:08         ` Jakub Narebski
2020-03-13 10:59           ` Jakub Narebski
2020-03-13 13:08 ` Philip Oakley
2020-03-13 14:34   ` Jakub Narebski
2020-03-15 18:57     ` Philip Oakley
2020-03-15 21:14       ` Jakub Narebski
2020-03-16 14:47         ` Philip Oakley
2020-03-16 12:44 ` Derrick Stolee [this message]
2020-03-17  3:13   ` Jakub Narebski
2020-03-17  7:24     ` Christian Couder
2020-03-17 11:49       ` Derrick Stolee
2020-03-17 14:18       ` Jakub Narebski
2020-03-17 17:04         ` Christian Couder
2020-03-18 13:55           ` Jakub Narebski
2020-03-18 15:25             ` Derrick Stolee
2020-03-19 12:52               ` Jakub Narebski
  -- strict thread matches above, loose matches on Subject: below --
2020-03-13 17:30 Abhishek Kumar
2020-03-17 17:00 Abhishek Kumar
2020-03-17 18:05 ` Jakub Narebski
2020-03-17 18:00 Abhishek Kumar
2020-03-19 12:50 ` Jakub Narebski
     [not found] <CAHk66ftQqFqP-4kd4-8cHtCMEofSUvbeSQ24pcCCrkz7+2JG1w@mail.gmail.com>
2020-03-27 18:31 ` Jakub Narębski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7d6a84c7-6b16-c2a9-11a1-3397422064d1@gmail.com \
    --to=stolee@gmail.com \
    --cc=christian.couder@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=jnareb@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).