git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Re: [RFC] Possible idea for GSoC 2020
@ 2020-03-17 18:00 Abhishek Kumar
  2020-03-19 12:50 ` Jakub Narebski
  0 siblings, 1 reply; 33+ messages in thread
From: Abhishek Kumar @ 2020-03-17 18:00 UTC (permalink / raw)
  To: jnareb; +Cc: christian.couder, git, stolee

Greetings Jakub

> So perhaps we should expand "Commit graph labeling for speeding up git
> commands" idea, splitting it into two possible ways of working in this
> project: the more technical 'Generation number v2', and 'Interval labels
> for commit graph' which is more of research project?  Which should be
> put first, then?

I would suggest working on generation number v2 first because:
- We ship improved performance *sooner*.
- I find it easier to shift from simpler to more complex systems.

On a personal note, I would be able to do a better job at working on generation
number v2 than on interval labels. While reading through the papers mentioned
was fun and full of a-ha moments, I find building things more
fulfilling. I would be
glad if either you or Derrick opt for mentoring it.

You could read my proposal at the link below. It is very rough and I haven't
proofread it yet. I will send out a more formal proposal once the direction
of this project is decided.

https://github.com/abhishekkumar2718/GSoC20/blob/master/graph_labelling.md

**Too long, didn't read**

- Commit graphs are small to medium-sized (compared to problem sizes observed in
graph-theory literature) sparse graphs. They have unusual properties compared
to more conventional graphs that can be exploited for better performance.

- Most of git's reachability queries are negative and using a negative-cut
filter improves performance more than a positive-cut filter.

- Implementing and maintaining a two-dimensional reachability index is hard
and does not offer justifiable performance gains.

- We plan to use corrected commit date as the generation number v2 because it is
locally computable, immutable and can be incrementally updated.

- If git ever considers a two-dimensional reachability index, either post order
DFS, GRAIL or an index based on commit date would be good starting
places to explore.

I go in more details about GRAIL, FERRARI and PReaCH, explaining
briefly how they work, their advantages, and disadvantages.

> Note that for example "Convert scripts to builtins" idea is in similar
> situation: it is also many projects in one.

Regards
Abhishek

^ permalink raw reply	[flat|nested] 33+ messages in thread
[parent not found: <CAHk66ftQqFqP-4kd4-8cHtCMEofSUvbeSQ24pcCCrkz7+2JG1w@mail.gmail.com>]
* Re: [RFC] Possible idea for GSoC 2020
@ 2020-03-17 17:00 Abhishek Kumar
  2020-03-17 18:05 ` Jakub Narebski
  0 siblings, 1 reply; 33+ messages in thread
From: Abhishek Kumar @ 2020-03-17 17:00 UTC (permalink / raw)
  To: stolee; +Cc: christian.couder, git, jnareb

> Having such a complicated two-dimensional system would need to
> justify itself by being measurably faster than that one-dimensional
> system in these example commands.
>
> [...]
>
> My _prediction_ is that the two-dimensional system will be more
> complicated to write and use, and will not have any measurable
> difference. I'd be happy to be wrong, but I also would not send
> anyone down this direction only to find out I'm right and that
> effort was wasted.

Agreed. I have been through the papers of the involved variants and on graphs
comparable to some of the largest git repositories, the performance improves by
fifty nanoseconds for a random query.

Additionally:
1. They require significantly more space per commit.
2. They require significantly more preprocessing time.

> My recommendation is that a GSoC student update the
> generation number to "v2" based on the definition you made in [1].
> That proposal is also more likely to be effective in Git because
> it makes use of extra heuristic information (commit date) to
> assist the types of algorithms we care about.

> In that case, the "difficult" part is moving the "generation"
> member of struct commit into a slab before making it a 64-bit
> value. (This is likely necessary for your plan, anyway.) Updating
> the generation number to v2 is relatively straight-forward after
> that, as someone can follow all places that reference or compute
> generation numbers and apply a diff

Thanks for the recommendation. Reading about how this fits in more
with REU on the other thread, I too agree that updating generation
number to use corrected commit date would be more appropriate for a GSoC
project.

Regards
Abhishek

^ permalink raw reply	[flat|nested] 33+ messages in thread
* Re: [RFC] Possible idea for GSoC 2020
@ 2020-03-13 17:30 Abhishek Kumar
  0 siblings, 0 replies; 33+ messages in thread
From: Abhishek Kumar @ 2020-03-13 17:30 UTC (permalink / raw)
  To: jnareb; +Cc: christian.couder, git

Jakub Narebski <jnareb@gmail.com> writes:

> I have prepared slides for "Graph operations in Git version control
> system" (PDF), mainly describing what was already done to improve their
> performance, but they also include a few thoughts about the future (like
> additional graph reachability labelings)... unfortunately the slides are
> in Polish, not in English.

> If there is interest, I could translate them, and put the result
> somewhere accessible.

I was going through resources and drafting up a proposal. The slides would be
a great help.

Could you please translate them, if it's not too much trouble?

> Or I could try to make this information into blog post -- this topic
> would really gain from using images (like Derrick Stolee series of
> articles on commit-graph).

Yes, thank you very much. Derrick's articles have been very useful so far.
I would be glad to help you out in any way that I can.

Regards
Abhishek

^ permalink raw reply	[flat|nested] 33+ messages in thread
* [RFC] Possible idea for GSoC 2020
@ 2020-03-10 14:50 Jakub Narebski
  2020-03-11 19:03 ` Junio C Hamano
                   ` (3 more replies)
  0 siblings, 4 replies; 33+ messages in thread
From: Jakub Narebski @ 2020-03-10 14:50 UTC (permalink / raw)
  To: git; +Cc: Christian Couder

Hello,

Here below is a possible proposal for a more difficult Google Summer of
Code 2020 project.

A few questions:
- is it too late to propose a new project idea for GSoC 2020?
- is it too difficult of a project for GSoC?

Best,

  Jakub Narębski

--------------------------------------------------

### Graph labelling for speeding up git commands

 - Language: C
 - Difficulty: hard / difficult
 - Possible mentors: Jakub Narębski

Git uses various clever methods for making operations on very large
repositories faster, from bitmap indices for git-fetch[1], to generation
numbers (also known as topological levels) in the commit-graph file for
commit graph traversal operations like `git log --graph`[2].

One possible improvement that can make Git even faster is using min-post
intervals labelling.  The basis of this labelling is post-visit order of
a depth-first search traversal tree of a commit graph, let's call it
'post(v)'.

If for each commit 'v' we would compute and store in the commit-graph
file two numbers: 'post(v)' and the minimum of 'post(u)' for all commits
reachable from 'v', let's call the latter 'min_graph(v)', then the
following condition is true:

  if 'v' can reach 'u', then min_graph(v) <= post(u) <= post(v)

If for each commit 'v' we would compute and store in the commit-graph
file two numbers: 'post(v)' and the minimum of 'post(u)' for commits
that were visited during the part of depth-first search that started
from 'v' (which is the minimum of post-order number for subtree of a
spanning tree that starts at 'v').  Let's call the later 'min_tree(v)'.
Then the following condition is true:

  if min_tree(v) <= post(u) <= post(v), then 'v' can reach 'u'

The task would be to implement computing such labelling (or a more
involved variant of it[3][4]), storing it in commit-graph file, and
using it for speeding up git commands (starting from a single chosen
command) such as:

 - git merge-base --is-ancestor A B
 - git branch --contains A
 - git tag --contains A
 - git branch --merged A
 - git tag --merged A
 - git merge-base --all A B
 - git log --topo-sort

References:

1. <http://githubengineering.com/counting-objects/>
2. <https://devblogs.microsoft.com/devops/supercharging-the-git-commit-graph-iii-generations/>
3. <https://arxiv.org/abs/1404.4465>
4. <https://github.com/steps/Ferrari>

See also discussion in:

<https://public-inbox.org/git/86tvl0zhos.fsf@gmail.com/t/>

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2020-03-27 18:32 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-17 18:00 [RFC] Possible idea for GSoC 2020 Abhishek Kumar
2020-03-19 12:50 ` Jakub Narebski
     [not found] <CAHk66ftQqFqP-4kd4-8cHtCMEofSUvbeSQ24pcCCrkz7+2JG1w@mail.gmail.com>
2020-03-27 18:31 ` Jakub Narębski
  -- strict thread matches above, loose matches on Subject: below --
2020-03-17 17:00 Abhishek Kumar
2020-03-17 18:05 ` Jakub Narebski
2020-03-13 17:30 Abhishek Kumar
2020-03-10 14:50 Jakub Narebski
2020-03-11 19:03 ` Junio C Hamano
2020-03-13 10:56   ` Jakub Narebski
2020-03-15 14:26     ` Jakub Narebski
2020-03-17 12:24       ` Kaartic Sivaraam
2020-03-17 12:39         ` Kaartic Sivaraam
2020-03-17 14:22         ` Jakub Narebski
2020-03-11 20:29 ` Christian Couder
2020-03-11 21:30   ` Jakub Narebski
2020-03-11 21:48     ` Christian Couder
2020-03-12 12:17       ` Jakub Narebski
2020-03-12 13:08         ` Jakub Narebski
2020-03-13 10:59           ` Jakub Narebski
2020-03-13 13:08 ` Philip Oakley
2020-03-13 14:34   ` Jakub Narebski
2020-03-15 18:57     ` Philip Oakley
2020-03-15 21:14       ` Jakub Narebski
2020-03-16 14:47         ` Philip Oakley
2020-03-16 12:44 ` Derrick Stolee
2020-03-17  3:13   ` Jakub Narebski
2020-03-17  7:24     ` Christian Couder
2020-03-17 11:49       ` Derrick Stolee
2020-03-17 14:18       ` Jakub Narebski
2020-03-17 17:04         ` Christian Couder
2020-03-18 13:55           ` Jakub Narebski
2020-03-18 15:25             ` Derrick Stolee
2020-03-19 12:52               ` Jakub Narebski

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).