git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [RFC] Possible idea for GSoC 2020
@ 2020-03-10 14:50 Jakub Narebski
  2020-03-11 19:03 ` Junio C Hamano
                   ` (3 more replies)
  0 siblings, 4 replies; 33+ messages in thread
From: Jakub Narebski @ 2020-03-10 14:50 UTC (permalink / raw)
  To: git; +Cc: Christian Couder

Hello,

Here below is a possible proposal for a more difficult Google Summer of
Code 2020 project.

A few questions:
- is it too late to propose a new project idea for GSoC 2020?
- is it too difficult of a project for GSoC?

Best,

  Jakub Narębski

--------------------------------------------------

### Graph labelling for speeding up git commands

 - Language: C
 - Difficulty: hard / difficult
 - Possible mentors: Jakub Narębski

Git uses various clever methods for making operations on very large
repositories faster, from bitmap indices for git-fetch[1], to generation
numbers (also known as topological levels) in the commit-graph file for
commit graph traversal operations like `git log --graph`[2].

One possible improvement that can make Git even faster is using min-post
intervals labelling.  The basis of this labelling is post-visit order of
a depth-first search traversal tree of a commit graph, let's call it
'post(v)'.

If for each commit 'v' we would compute and store in the commit-graph
file two numbers: 'post(v)' and the minimum of 'post(u)' for all commits
reachable from 'v', let's call the latter 'min_graph(v)', then the
following condition is true:

  if 'v' can reach 'u', then min_graph(v) <= post(u) <= post(v)

If for each commit 'v' we would compute and store in the commit-graph
file two numbers: 'post(v)' and the minimum of 'post(u)' for commits
that were visited during the part of depth-first search that started
from 'v' (which is the minimum of post-order number for subtree of a
spanning tree that starts at 'v').  Let's call the later 'min_tree(v)'.
Then the following condition is true:

  if min_tree(v) <= post(u) <= post(v), then 'v' can reach 'u'

The task would be to implement computing such labelling (or a more
involved variant of it[3][4]), storing it in commit-graph file, and
using it for speeding up git commands (starting from a single chosen
command) such as:

 - git merge-base --is-ancestor A B
 - git branch --contains A
 - git tag --contains A
 - git branch --merged A
 - git tag --merged A
 - git merge-base --all A B
 - git log --topo-sort

References:

1. <http://githubengineering.com/counting-objects/>
2. <https://devblogs.microsoft.com/devops/supercharging-the-git-commit-graph-iii-generations/>
3. <https://arxiv.org/abs/1404.4465>
4. <https://github.com/steps/Ferrari>

See also discussion in:

<https://public-inbox.org/git/86tvl0zhos.fsf@gmail.com/t/>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Possible idea for GSoC 2020
  2020-03-10 14:50 [RFC] Possible idea for GSoC 2020 Jakub Narebski
@ 2020-03-11 19:03 ` Junio C Hamano
  2020-03-13 10:56   ` Jakub Narebski
  2020-03-11 20:29 ` Christian Couder
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 33+ messages in thread
From: Junio C Hamano @ 2020-03-11 19:03 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git, Christian Couder

Jakub Narebski <jnareb@gmail.com> writes:

> A few questions:
> - is it too late to propose a new project idea for GSoC 2020?
> - is it too difficult of a project for GSoC?
> ...
> ### Graph labelling for speeding up git commands
>
>  - Language: C
>  - Difficulty: hard / difficult
>  - Possible mentors: Jakub Narębski

I am not running the GSoC or participating in it in any way other
than just being a reviewer-maintainer of the project, but I would
appreciate a well-thought-out write-up very much.

> Git uses various clever methods for making operations on very large
> repositories faster, from bitmap indices for git-fetch[1], to generation
> numbers (also known as topological levels) in the commit-graph file for
> commit graph traversal operations like `git log --graph`[2].
>
> One possible improvement that can make Git even faster is using min-post
> intervals labelling.  The basis of this labelling is post-visit order of
> a depth-first search traversal tree of a commit graph, let's call it
> 'post(v)'.
>
> If for each commit 'v' we would compute and store in the commit-graph
> file two numbers: 'post(v)' and the minimum of 'post(u)' for all commits
> reachable from 'v', let's call the latter 'min_graph(v)', then the
> following condition is true:
>
>   if 'v' can reach 'u', then min_graph(v) <= post(u) <= post(v)
>
> If for each commit 'v' we would compute and store in the commit-graph
> file two numbers: 'post(v)' and the minimum of 'post(u)' for commits
> that were visited during the part of depth-first search that started
> from 'v' (which is the minimum of post-order number for subtree of a
> spanning tree that starts at 'v').  Let's call the later 'min_tree(v)'.
> Then the following condition is true:
>
>   if min_tree(v) <= post(u) <= post(v), then 'v' can reach 'u'
>
> The task would be to implement computing such labelling (or a more
> involved variant of it[3][4]), storing it in commit-graph file, and
> using it for speeding up git commands (starting from a single chosen
> command) such as:
>
>  - git merge-base --is-ancestor A B
>  - git branch --contains A
>  - git tag --contains A
>  - git branch --merged A
>  - git tag --merged A
>  - git merge-base --all A B
>  - git log --topo-sort
>
> References:
>
> 1. <http://githubengineering.com/counting-objects/>
> 2. <https://devblogs.microsoft.com/devops/supercharging-the-git-commit-graph-iii-generations/>
> 3. <https://arxiv.org/abs/1404.4465>
> 4. <https://github.com/steps/Ferrari>
>
> See also discussion in:
>
> <https://public-inbox.org/git/86tvl0zhos.fsf@gmail.com/t/>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Possible idea for GSoC 2020
  2020-03-10 14:50 [RFC] Possible idea for GSoC 2020 Jakub Narebski
  2020-03-11 19:03 ` Junio C Hamano
@ 2020-03-11 20:29 ` Christian Couder
  2020-03-11 21:30   ` Jakub Narebski
  2020-03-13 13:08 ` Philip Oakley
  2020-03-16 12:44 ` Derrick Stolee
  3 siblings, 1 reply; 33+ messages in thread
From: Christian Couder @ 2020-03-11 20:29 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git

Hi,

On Tue, Mar 10, 2020 at 3:50 PM Jakub Narebski <jnareb@gmail.com> wrote:
>
> Hello,
>
> Here below is a possible proposal for a more difficult Google Summer of
> Code 2020 project.
>
> A few questions:
> - is it too late to propose a new project idea for GSoC 2020?

I don't think so. Students have until March 16 to submit a proposal,
so they could still submit one based on this even if it's late.

> - is it too difficult of a project for GSoC?

I guess it depends on how much you as a mentor would help the student
getting started.

I think it's interesting and worth adding it anyway. Can you add it to
SoC-2020-Ideas.md?

Thanks for the idea,
Christian.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Possible idea for GSoC 2020
  2020-03-11 20:29 ` Christian Couder
@ 2020-03-11 21:30   ` Jakub Narebski
  2020-03-11 21:48     ` Christian Couder
  0 siblings, 1 reply; 33+ messages in thread
From: Jakub Narebski @ 2020-03-11 21:30 UTC (permalink / raw)
  To: Christian Couder; +Cc: git, Derrick Stolee

Hi,

Christian Couder <christian.couder@gmail.com> writes:
> On Tue, Mar 10, 2020 at 3:50 PM Jakub Narebski <jnareb@gmail.com> wrote:
>>
>> Here below is a possible proposal for a more difficult Google Summer of
>> Code 2020 project.
>>
>> A few questions:
>> - is it too late to propose a new project idea for GSoC 2020?
>
> I don't think so. Students have until March 16 to submit a proposal,
> so they could still submit one based on this even if it's late.

If I understand Google Summer of Code 2020 Timeline correctly
https://developers.google.com/open-source/gsoc/timeline
March 16 is the date when students might start to submit
proposals, and the deadline is March 20.

>> - is it too difficult of a project for GSoC?
>
> I guess it depends on how much you as a mentor would help the student
> getting started.
>
> I think it's interesting and worth adding it anyway. Can you add it to
> SoC-2020-Ideas.md?

All right, I'll do it.


P.S. I wonder if Derrick Stolee would agree to be co-mentor (if he is
not too busy with working on Scalar).

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Possible idea for GSoC 2020
  2020-03-11 21:30   ` Jakub Narebski
@ 2020-03-11 21:48     ` Christian Couder
  2020-03-12 12:17       ` Jakub Narebski
  0 siblings, 1 reply; 33+ messages in thread
From: Christian Couder @ 2020-03-11 21:48 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: git, Derrick Stolee, Heba Waly, Junio C Hamano, Jonathan Tan,
	Emily Shaffer

On Wed, Mar 11, 2020 at 10:30 PM Jakub Narebski <jnareb@gmail.com> wrote:

> >> A few questions:
> >> - is it too late to propose a new project idea for GSoC 2020?
> >
> > I don't think so. Students have until March 16 to submit a proposal,
> > so they could still submit one based on this even if it's late.
>
> If I understand Google Summer of Code 2020 Timeline correctly
> https://developers.google.com/open-source/gsoc/timeline
> March 16 is the date when students might start to submit
> proposals, and the deadline is March 20.

Yeah, sorry about the confusion, the timeline says:

March 16 18:00 UTC Student application period begins
March 31 18:00 UTC Student application deadline

So technically we could add ideas until March 31, but I think it makes
no sense, especially as we ask students to submit proposals well in
advance. So we should give ourselves March 16 as our actual deadline
to add ideas.

> >> - is it too difficult of a project for GSoC?
> >
> > I guess it depends on how much you as a mentor would help the student
> > getting started.
> >
> > I think it's interesting and worth adding it anyway. Can you add it to
> > SoC-2020-Ideas.md?
>
> All right, I'll do it.

Great, thanks!

> P.S. I wonder if Derrick Stolee would agree to be co-mentor (if he is
> not too busy with working on Scalar).

I am adding Heba in Cc as she said she would be ok to co-mentor. Emily
and Jonathan T. might be interested too, so I'm adding them too.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Possible idea for GSoC 2020
  2020-03-11 21:48     ` Christian Couder
@ 2020-03-12 12:17       ` Jakub Narebski
  2020-03-12 13:08         ` Jakub Narebski
  0 siblings, 1 reply; 33+ messages in thread
From: Jakub Narebski @ 2020-03-12 12:17 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Derrick Stolee, Heba Waly, Junio C Hamano, Jonathan Tan,
	Emily Shaffer

Hello,

Christian Couder <christian.couder@gmail.com> writes:
> On Wed, Mar 11, 2020 at 10:30 PM Jakub Narebski <jnareb@gmail.com> wrote:
>
>>>> A few questions:
>>>> - is it too late to propose a new project idea for GSoC 2020?
>>>
>>> I don't think so. Students have until March 16 to submit a proposal,
>>> so they could still submit one based on this even if it's late.
>>
>> If I understand Google Summer of Code 2020 Timeline correctly
>> https://developers.google.com/open-source/gsoc/timeline
>> March 16 is the date when students might start to submit
>> proposals, and the deadline is March 20.
>
> Yeah, sorry about the confusion, the timeline says:
>
> March 16 18:00 UTC Student application period begins
> March 31 18:00 UTC Student application deadline
>
> So technically we could add ideas until March 31, but I think it makes
> no sense, especially as we ask students to submit proposals well in
> advance. So we should give ourselves March 16 as our actual deadline
> to add ideas.

Right.

>>>> - is it too difficult of a project for GSoC?
>>>
>>> I guess it depends on how much you as a mentor would help the student
>>> getting started.
>>>
>>> I think it's interesting and worth adding it anyway. Can you add it to
>>> SoC-2020-Ideas.md?
>>
>> All right, I'll do it.
>
> Great, thanks!

Added with some modifications in commit 6b5edec6557
https://github.com/git/git.github.io/commit/6b5edec6557930888441c626f9272dc2be0c769e

>> P.S. I wonder if Derrick Stolee would agree to be co-mentor (if he is
>> not too busy with working on Scalar).
>
> I am adding Heba in Cc as she said she would be ok to co-mentor. Emily
> and Jonathan T. might be interested too, so I'm adding them too.

Added Heba as possible co-maintainer in SoC-2020-Ideas.md.
If I hear from Stolee, Emily or Jonathan T, I can add them too.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Possible idea for GSoC 2020
  2020-03-12 12:17       ` Jakub Narebski
@ 2020-03-12 13:08         ` Jakub Narebski
  2020-03-13 10:59           ` Jakub Narebski
  0 siblings, 1 reply; 33+ messages in thread
From: Jakub Narebski @ 2020-03-12 13:08 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Derrick Stolee, Heba Waly, Junio C Hamano, Jonathan Tan,
	Emily Shaffer

Jakub Narebski <jnareb@gmail.com> writes:
> Christian Couder <christian.couder@gmail.com> writes:
>> On Wed, Mar 11, 2020 at 10:30 PM Jakub Narebski <jnareb@gmail.com> wrote:
[...]
>>>> I think it's interesting and worth adding it anyway. Can you add it to
>>>> SoC-2020-Ideas.md?
>>>
>>> All right, I'll do it.
>>
>> Great, thanks!
>
> Added with some modifications in commit 6b5edec6557
> https://github.com/git/git.github.io/commit/6b5edec6557930888441c626f9272dc2be0c769e

And it failed to build.  Was it caused by accidentally deleting two
trailing empty lines at the end of the file?

page_build_id=168669405

-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Possible idea for GSoC 2020
  2020-03-11 19:03 ` Junio C Hamano
@ 2020-03-13 10:56   ` Jakub Narebski
  2020-03-15 14:26     ` Jakub Narebski
  0 siblings, 1 reply; 33+ messages in thread
From: Jakub Narebski @ 2020-03-13 10:56 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Christian Couder, Derrick Stolee, Heba Waly, Jonathan Tan,
	Emily Shaffer

Cc: Stolee, Heba, Jonathan T., Emily Shaffer. 

Junio C Hamano <gitster@pobox.com> writes:
> Jakub Narebski <jnareb@gmail.com> writes:
>
>> A few questions:
>> - is it too late to propose a new project idea for GSoC 2020?
>> - is it too difficult of a project for GSoC?
>> ...
>> ### Graph labelling for speeding up git commands
>>
>>  - Language: C
>>  - Difficulty: hard / difficult
>>  - Possible mentors: Jakub Narębski
>
> I am not running the GSoC or participating in it in any way other
> than just being a reviewer-maintainer of the project, but I would
> appreciate a well-thought-out write-up very much.

I have prepared slides for "Graph operations in Git version control
system" (PDF), mainly describing what was already done to improve their
performance, but they also include a few thoughts about the future (like
additional graph reachability labelings)... unfortunately the slides are
in Polish, not in English.

If there is interest, I could translate them, and put the result
somewhere accessible.

Or I could try to make this information into blog post -- this topic
would really gain from using images (like Derrick Stolee series of
articles on commit-graph).

>> Git uses various clever methods for making operations on very large
>> repositories faster, from bitmap indices for git-fetch[1], to generation
>> numbers (also known as topological levels) in the commit-graph file for
>> commit graph traversal operations like `git log --graph`[2].
>>
>> One possible improvement that can make Git even faster is using min-post
>> intervals labelling.  The basis of this labelling is post-visit order of
>> a depth-first search traversal tree of a commit graph, let's call it
>> 'post(v)'.
>>
>> If for each commit 'v' we would compute and store in the commit-graph
>> file two numbers: 'post(v)' and the minimum of 'post(u)' for all commits
>> reachable from 'v', let's call the latter 'min_graph(v)', then the
>> following condition is true:
>>
>>   if 'v' can reach 'u', then min_graph(v) <= post(u) <= post(v)
>>
>> If for each commit 'v' we would compute and store in the commit-graph
>> file two numbers: 'post(v)' and the minimum of 'post(u)' for commits
>> that were visited during the part of depth-first search that started
>> from 'v' (which is the minimum of post-order number for subtree of a
>> spanning tree that starts at 'v').  Let's call the later 'min_tree(v)'.
>> Then the following condition is true:
>>
>>   if min_tree(v) <= post(u) <= post(v), then 'v' can reach 'u'
>>
>> The task would be to implement computing such labelling (or a more
>> involved variant of it[3][4]), storing it in commit-graph file, and
>> using it for speeding up git commands (starting from a single chosen
>> command) such as:
>>
>>  - git merge-base --is-ancestor A B
>>  - git branch --contains A
>>  - git tag --contains A
>>  - git branch --merged A
>>  - git tag --merged A
>>  - git merge-base --all A B
>>  - git log --topo-sort
>>
>> References:
>>
>> 1. <http://githubengineering.com/counting-objects/>
>> 2. <https://devblogs.microsoft.com/devops/supercharging-the-git-commit-graph-iii-generations/>
>> 3. <https://arxiv.org/abs/1404.4465>
>> 4. <https://github.com/steps/Ferrari>
>>
>> See also discussion in:
>>
>> <https://public-inbox.org/git/86tvl0zhos.fsf@gmail.com/t/>

P.S. A bit more expanded writeup now available
at https://git.github.io/SoC-2020-Ideas/

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Possible idea for GSoC 2020
  2020-03-12 13:08         ` Jakub Narebski
@ 2020-03-13 10:59           ` Jakub Narebski
  0 siblings, 0 replies; 33+ messages in thread
From: Jakub Narebski @ 2020-03-13 10:59 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Derrick Stolee, Heba Waly, Junio C Hamano, Jonathan Tan,
	Emily Shaffer

Jakub Narebski <jnareb@gmail.com> writes:
> Jakub Narebski <jnareb@gmail.com> writes:
>> Christian Couder <christian.couder@gmail.com> writes:
>>> On Wed, Mar 11, 2020 at 10:30 PM Jakub Narebski <jnareb@gmail.com> wrote:
> [...]
>>>>> I think it's interesting and worth adding it anyway. Can you add it to
>>>>> SoC-2020-Ideas.md?
>>>>
>>>> All right, I'll do it.
>>>
>>> Great, thanks!
>>
>> Added with some modifications in commit 6b5edec6557
>> https://github.com/git/git.github.io/commit/6b5edec6557930888441c626f9272dc2be0c769e
>
> And it failed to build.  Was it caused by accidentally deleting two
> trailing empty lines at the end of the file?
>
> page_build_id=168669405

Fixed now (somehow).

https://git.github.io/SoC-2020-Ideas/#commit-graph-labeling-for-speeding-up-git-commands

-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Possible idea for GSoC 2020
  2020-03-10 14:50 [RFC] Possible idea for GSoC 2020 Jakub Narebski
  2020-03-11 19:03 ` Junio C Hamano
  2020-03-11 20:29 ` Christian Couder
@ 2020-03-13 13:08 ` Philip Oakley
  2020-03-13 14:34   ` Jakub Narebski
  2020-03-16 12:44 ` Derrick Stolee
  3 siblings, 1 reply; 33+ messages in thread
From: Philip Oakley @ 2020-03-13 13:08 UTC (permalink / raw)
  To: Jakub Narebski, git; +Cc: Christian Couder

Hi Jakub,

On 10/03/2020 14:50, Jakub Narebski wrote:
> Hello,
>
> Here below is a possible proposal for a more difficult Google Summer of
> Code 2020 project.
>
> A few questions:
> - is it too late to propose a new project idea for GSoC 2020?
> - is it too difficult of a project for GSoC?
>
> Best,
>
>   Jakub Narębski
>
> --------------------------------------------------
>
> ### Graph labelling for speeding up git commands
>
>  - Language: C
>  - Difficulty: hard / difficult
>  - Possible mentors: Jakub Narębski
>
> Git uses various clever methods for making operations on very large
> repositories faster, from bitmap indices for git-fetch[1], to generation
> numbers (also known as topological levels) in the commit-graph file for
> commit graph traversal operations like `git log --graph`[2].
>
> One possible improvement that can make Git even faster is using min-post
> intervals labelling.  The basis of this labelling is post-visit order of
> a depth-first search traversal tree of a commit graph, let's call it
> 'post(v)'.
This, the post(v) number, may need a bit more explanation for those not
familiar with graph theory terminology, so that they can get up to speed
easily about the method, and it's (hopeful) simplicity.

 It isn't clear to me if it is a count along a single string-of-pearls
between two branch - merge points, or depth from origin, or whether it
can span large chunks of the DAG? Ref 3. has the same problem of
starting from an assumed level of knowledge and understanding that may
put of non-academics (I'm thinking that the proposed method is this
PReaCH [3]).

It's my understanding that 'v' is usually 1...n in the literature, but
in reality it just means 'unique label' (and ultimately able to be
indexed in a data table). In Git we use the object id as the unique
label, so the 1..n is just an abstraction/convenience. The other problem
that can happen is if the terminologies of Git doesn't quite match those
of the descriptions, such as which end is 'root', or being 'mutually
reachable' in a directed acyclic graph.

The Wikipedia article on contraction hierarchies [6] did give some
useful clues for more lay readers.

> If for each commit 'v' we would compute and store in the commit-graph
> file two numbers: 'post(v)' and the minimum of 'post(u)' for all commits
> reachable from 'v', let's call the latter 'min_graph(v)', then the
> following condition is true:
>
>   if 'v' can reach 'u', then min_graph(v) <= post(u) <= post(v)
>
> If for each commit 'v' we would compute and store in the commit-graph
> file two numbers: 'post(v)' and the minimum of 'post(u)' for commits
> that were visited during the part of depth-first search that started
> from 'v' (which is the minimum of post-order number for subtree of a
> spanning tree that starts at 'v').  Let's call the later 'min_tree(v)'.
> Then the following condition is true:
>
>   if min_tree(v) <= post(u) <= post(v), then 'v' can reach 'u'
>
> The task would be to implement computing such labelling (or a more
> involved variant of it[3][4]), storing it in commit-graph file, and
> using it for speeding up git commands (starting from a single chosen
> command) such as:
>
>  - git merge-base --is-ancestor A B
>  - git branch --contains A
>  - git tag --contains A
>  - git branch --merged A
>  - git tag --merged A
>  - git merge-base --all A B
>  - git log --topo-sort
>
> References:
>
> 1. <http://githubengineering.com/counting-objects/>
> 2. <https://devblogs.microsoft.com/devops/supercharging-the-git-commit-graph-iii-generations/>
> 3. <https://arxiv.org/abs/1404.4465>
> 4. <https://github.com/steps/Ferrari>
>
> See also discussion in:
>
> <https://public-inbox.org/git/86tvl0zhos.fsf@gmail.com/t/>  [5]
Philip

[6] https://en.wikipedia.org/wiki/Contraction_hierarchies

[I've been off-list for 2+ ,months, so still catching up]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Possible idea for GSoC 2020
  2020-03-13 13:08 ` Philip Oakley
@ 2020-03-13 14:34   ` Jakub Narebski
  2020-03-15 18:57     ` Philip Oakley
  0 siblings, 1 reply; 33+ messages in thread
From: Jakub Narebski @ 2020-03-13 14:34 UTC (permalink / raw)
  To: Philip Oakley
  Cc: git, Christian Couder, Derrick Stolee, Heba Waly, Junio C Hamano,
	Jonathan Tan, Emily Shaffer

Hello Philip,

Philip Oakley <philipoakley@iee.email> writes:
> On 10/03/2020 14:50, Jakub Narebski wrote:
[...]
>> ### Graph labelling for speeding up git commands
>>
>>  - Language: C
>>  - Difficulty: hard / difficult
>>  - Possible mentors: Jakub Narębski
>>
>> Git uses various clever methods for making operations on very large
>> repositories faster, from bitmap indices for git-fetch[1], to generation
>> numbers (also known as topological levels) in the commit-graph file for
>> commit graph traversal operations like `git log --graph`[2].
>>
>> One possible improvement that can make Git even faster is using min-post
>> intervals labelling.  The basis of this labelling is post-visit order of
>> a depth-first search traversal tree of a commit graph, let's call it
>> 'post(v)'.
>>
> This, the post(v) number, may need a bit more explanation for those not
> familiar with graph theory terminology, so that they can get up to speed
> easily about the method, and it's (hopeful) simplicity.

All right, I see that it might be not clear for someone who is not
familiar with graph theory terminology.  The post(v) order is the order
you encounter commits, assuming that you mark commit 'v' as visited
after all its parents have been visited.

The positive-cut labeling works also for pre(v) order, where we number
commits from top, starting from heads, marking commit 'v' as visited
before any of its parents (you just need to switch from min-post to
pre-max interval).

>  It isn't clear to me if it is a count along a single string-of-pearls
> between two branch - merge points, or depth from origin, or whether it
> can span large chunks of the DAG? Ref 3. has the same problem of
> starting from an assumed level of knowledge and understanding that may
> put of non-academics (I'm thinking that the proposed method is this
> PReaCH [3]).

The basic method is something simpler, common to all those methods.
It is described as
- method from "3.3 Pruning Based on DFS Numbering" in PReaCH[3] paper
  (only one of full intervals from Figure 3 there), with modifications
- method from "Interval Indexing" paragraph in "I. Introduction"
  in FERRARI[4] paper, but using only a single interval (strict or
  approximate)
- Fig. 4 Min-Post Labeling, in "GRAIL: A Scalable Index for Reachability
  Queries in Very Large Graphs" (2012) paper
- "3.4.1 Positive-Cut Filter" subsubsection in "3.4 Optimizations"
  in FELINE paper i.e. "Reachability Queries in Very Large Graphs:
  A Fast Refined Online Search Approach" (2014)

> It's my understanding that 'v' is usually 1...n in the literature, but
> in reality it just means 'unique label' (and ultimately able to be
> indexed in a data table). In Git we use the object id as the unique
> label, so the 1..n is just an abstraction/convenience. The other problem
> that can happen is if the terminologies of Git doesn't quite match those
> of the descriptions, such as which end is 'root', or being 'mutually
> reachable' in a directed acyclic graph.

Yes, when reading various graph papers, I need to translate 'root',
'leaf', 'child' from graph-theory terminology to git terminology.

But 'v' being a node (a commit in a commit graph of revisions) is not
one of them.

>
> The Wikipedia article on contraction hierarchies [6] did give some
> useful clues for more lay readers.

Ooops.  Actually in my opinion Reachability Contraction Hierarchies
(RCH) reachability index from PReaCH[3] paper would not work for Git, as
it assumes bidirectional BFS search.

I have cited PReaCH[3] paper for its Pruning Based on DFS Numbering
ideas.

>> If for each commit 'v' we would compute and store in the commit-graph
>> file two numbers: 'post(v)' and the minimum of 'post(u)' for all commits
>> reachable from 'v', let's call the latter 'min_graph(v)', then the
>> following condition is true:
>>
>>   if 'v' can reach 'u', then min_graph(v) <= post(u) <= post(v)
>>
>> If for each commit 'v' we would compute and store in the commit-graph
>> file two numbers: 'post(v)' and the minimum of 'post(u)' for commits
>> that were visited during the part of depth-first search that started
>> from 'v' (which is the minimum of post-order number for subtree of a
>> spanning tree that starts at 'v').  Let's call the later 'min_tree(v)'.
>> Then the following condition is true:
>>
>>   if min_tree(v) <= post(u) <= post(v), then 'v' can reach 'u'
>>
>> The task would be to implement computing such labelling (or a more
>> involved variant of it[3][4]), storing it in commit-graph file, and
>> using it for speeding up git commands (starting from a single chosen
>> command) such as:
>>
>>  - git merge-base --is-ancestor A B
>>  - git branch --contains A
>>  - git tag --contains A
>>  - git branch --merged A
>>  - git tag --merged A
>>  - git merge-base --all A B
>>  - git log --topo-sort
>>
>> References:
>>
>> 1. <http://githubengineering.com/counting-objects/>
>> 2. <https://devblogs.microsoft.com/devops/supercharging-the-git-commit-graph-iii-generations/>
>> 3. <https://arxiv.org/abs/1404.4465>
>> 4. <https://github.com/steps/Ferrari>
>>
>> See also discussion in:
>>
>> <https://public-inbox.org/git/86tvl0zhos.fsf@gmail.com/t/>  [5]
>
> [6] https://en.wikipedia.org/wiki/Contraction_hierarchies
>
> [I've been off-list for 2+ ,months, so still catching up]

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Possible idea for GSoC 2020
@ 2020-03-13 17:30 Abhishek Kumar
  0 siblings, 0 replies; 33+ messages in thread
From: Abhishek Kumar @ 2020-03-13 17:30 UTC (permalink / raw)
  To: jnareb; +Cc: christian.couder, git

Jakub Narebski <jnareb@gmail.com> writes:

> I have prepared slides for "Graph operations in Git version control
> system" (PDF), mainly describing what was already done to improve their
> performance, but they also include a few thoughts about the future (like
> additional graph reachability labelings)... unfortunately the slides are
> in Polish, not in English.

> If there is interest, I could translate them, and put the result
> somewhere accessible.

I was going through resources and drafting up a proposal. The slides would be
a great help.

Could you please translate them, if it's not too much trouble?

> Or I could try to make this information into blog post -- this topic
> would really gain from using images (like Derrick Stolee series of
> articles on commit-graph).

Yes, thank you very much. Derrick's articles have been very useful so far.
I would be glad to help you out in any way that I can.

Regards
Abhishek

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Possible idea for GSoC 2020
  2020-03-13 10:56   ` Jakub Narebski
@ 2020-03-15 14:26     ` Jakub Narebski
  2020-03-17 12:24       ` Kaartic Sivaraam
  0 siblings, 1 reply; 33+ messages in thread
From: Jakub Narebski @ 2020-03-15 14:26 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Christian Couder, Derrick Stolee, Heba Waly, Jonathan Tan,
	Emily Shaffer, Abhishek Kumar

Cc:  Abhishek Kumar

Jakub Narebski <jnareb@gmail.com> writes:

> Cc: Stolee, Heba, Jonathan T., Emily Shaffer. 
>
> Junio C Hamano <gitster@pobox.com> writes:
>> Jakub Narebski <jnareb@gmail.com> writes:
[...]
>>> ### Graph labelling for speeding up git commands
>>>
>>>  - Language: C
>>>  - Difficulty: hard / difficult
>>>  - Possible mentors: Jakub Narębski
>>
>> I am not running the GSoC or participating in it in any way other
>> than just being a reviewer-maintainer of the project, but I would
>> appreciate a well-thought-out write-up very much.
>
> I have prepared slides for "Graph operations in Git version control
> system" (PDF), mainly describing what was already done to improve their
> performance, but they also include a few thoughts about the future (like
> additional graph reachability labelings)... unfortunately the slides are
> in Polish, not in English.
>
> If there is interest, I could translate them, and put the result
> somewhere accessible.

Here it is, traanslated into English, but otherwise almost exactly as I
have presented it on December 2019.  Those slides includes much of
introductory information, so one would be interested probably in few
last slides (the "Future work" section).

  https://drive.google.com/file/d/1psMBVfcRHcZeJ7AewGpdoymrEfFVdXoK/view?usp=sharing

I will be extending those slides with more information about interval
labeling, and then I will update the file, and I can post it also on
SlideShare (or other site, if one can recommend it).

> Or I could try to make this information into blog post -- this topic
> would really gain from using images (like Derrick Stolee series of
> articles on commit-graph).

This would take a bit, I'll try to do it when I would have some more
free time.

Regards,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Possible idea for GSoC 2020
  2020-03-13 14:34   ` Jakub Narebski
@ 2020-03-15 18:57     ` Philip Oakley
  2020-03-15 21:14       ` Jakub Narebski
  0 siblings, 1 reply; 33+ messages in thread
From: Philip Oakley @ 2020-03-15 18:57 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: git, Christian Couder, Derrick Stolee, Heba Waly, Junio C Hamano,
	Jonathan Tan, Emily Shaffer, Abhishek Kumar

Hi Jakub,
some comments on potential misreading by being over-pedantic...

added Abhishek for GSoC comments

On 13/03/2020 14:34, Jakub Narebski wrote:
> Hello Philip,
>
> Philip Oakley <philipoakley@iee.email> writes:
>> On 10/03/2020 14:50, Jakub Narebski wrote:
> [...]
>>> ### Graph labelling for speeding up git commands
>>>
>>>  - Language: C
>>>  - Difficulty: hard / difficult
>>>  - Possible mentors: Jakub Narębski
>>>
>>> Git uses various clever methods for making operations on very large
>>> repositories faster, from bitmap indices for git-fetch[1], to generation
>>> numbers (also known as topological levels) in the commit-graph file for
>>> commit graph traversal operations like `git log --graph`[2].
>>>
>>> One possible improvement that can make Git even faster is using min-post
>>> intervals labelling.  The basis of this labelling is post-visit order of
>>> a depth-first search traversal tree of a commit graph, let's call it
>>> 'post(v)'.
So post(v) is the (increasing) number for the visited commits, starting
with 1 at the initial branch tip (not root!), as the graph is traversed,
depth first, with backtracking to the most recent commit that still has
unvisited parents when either the root commit, or an already visited
commit is found. (phew, that took a lot of words;-)

e.g. (with 'a' as a branch tip)

a - b - c - d - e - f - g
         \m/     \w - x/         (m spans c-d;  w-x spans e-g)

potential orderings, based on alternate parental choices.
abcdefgwxm
abcdewxgfm
abcmdefgwx
abcmdewxgf

All 4 orderings are equally valid implementations.(?)
>> This, the post(v) number, may need a bit more explanation for those not
>> familiar with graph theory terminology, so that they can get up to speed
>> easily about the method, and it's (hopeful) simplicity.
> All right, I see that it might be not clear for someone who is not
> familiar with graph theory terminology.  The post(v) order is the order
> you encounter commits, assuming that you mark commit 'v' as visited
> after all its parents have been visited.

The 'all' can be misunderstood as meaning 'skip' a commit if it has
multiple parents, until the last visit. I don't think that's what is
meant. (or is it?)

>
> The positive-cut labeling works also for pre(v) order, where we number
> commits from top, starting from heads, marking commit 'v' as visited
> before any of its parents (you just need to switch from min-post to
> pre-max interval).
I'm not sure I understood that. Is this the same as my comment above.

>
>>  It isn't clear to me if it is a count along a single string-of-pearls
>> between two branch - merge points, or depth from origin, or whether it
>> can span large chunks of the DAG? Ref 3. has the same problem of
>> starting from an assumed level of knowledge and understanding that may
>> put of non-academics (I'm thinking that the proposed method is this
>> PReaCH [3]).
> The basic method is something simpler, common to all those methods.
> It is described as
> - method from "3.3 Pruning Based on DFS Numbering" in PReaCH[3] paper
>   (only one of full intervals from Figure 3 there), with modifications
> - method from "Interval Indexing" paragraph in "I. Introduction"
>   in FERRARI[4] paper, but using only a single interval (strict or
>   approximate)
> - Fig. 4 Min-Post Labeling, in "GRAIL: A Scalable Index for Reachability
>   Queries in Very Large Graphs" (2012) paper
> - "3.4.1 Positive-Cut Filter" subsubsection in "3.4 Optimizations"
>   in FELINE paper i.e. "Reachability Queries in Very Large Graphs:
>   A Fast Refined Online Search Approach" (2014)
>
>> It's my understanding that 'v' is usually 1...n in the literature, but
>> in reality it just means 'unique label' (and ultimately able to be
>> indexed in a data table). In Git we use the object id as the unique
>> label, so the 1..n is just an abstraction/convenience. The other problem
>> that can happen is if the terminologies of Git doesn't quite match those
>> of the descriptions, such as which end is 'root', or being 'mutually
>> reachable' in a directed acyclic graph.
> Yes, when reading various graph papers, I need to translate 'root',
> 'leaf', 'child' from graph-theory terminology to git terminology.
>
> But 'v' being a node (a commit in a commit graph of revisions) is not
> one of them.
I'd agree that 'node' is ok, it's just the available values for 'v' that
can be presumptuous.
(i.e. the academics already labelling v->1..n, that often just happens
to be the order they end up with:: `Given <type of graph> with nodes
1..n, then <some assertions based on n>` The PReaCH paper does that.)

Plus our DAGs are, in a sense, backward (only children know who their
parents are!), so all the counting levels feel wrong (we count edges
from the wrong end)
>
>> The Wikipedia article on contraction hierarchies [6] did give some
>> useful clues for more lay readers.
> Ooops.  Actually in my opinion Reachability Contraction Hierarchies
> (RCH) reachability index from PReaCH[3] paper would not work for Git, as
> it assumes bidirectional BFS search.
>
> I have cited PReaCH[3] paper for its Pruning Based on DFS Numbering
> ideas.
>
>>> If for each commit 'v' we would compute and store in the commit-graph
>>> file two numbers: 'post(v)' and the minimum of 'post(u)' for all commits
>>> reachable from 'v', let's call the latter 'min_graph(v)', then the
>>> following condition is true:
>>>
>>>   if 'v' can reach 'u', then min_graph(v) <= post(u) <= post(v)
>>>
>>> If for each commit 'v' we would compute and store in the commit-graph
>>> file two numbers: 'post(v)' and the minimum of 'post(u)' for commits
>>> that were visited during the part of depth-first search that started
>>> from 'v' (which is the minimum of post-order number for subtree of a
>>> spanning tree that starts at 'v').  Let's call the later 'min_tree(v)'.
>>> Then the following condition is true:
>>>
>>>   if min_tree(v) <= post(u) <= post(v), then 'v' can reach 'u'
>>>
>>> The task would be to implement computing such labelling (or a more
>>> involved variant of it[3][4]), storing it in commit-graph file, and
>>> using it for speeding up git commands (starting from a single chosen
>>> command) such as:
>>>
>>>  - git merge-base --is-ancestor A B
>>>  - git branch --contains A
>>>  - git tag --contains A
>>>  - git branch --merged A
>>>  - git tag --merged A
>>>  - git merge-base --all A B
>>>  - git log --topo-sort
>>>
>>> References:
>>>
>>> 1. <http://githubengineering.com/counting-objects/>
>>> 2. <https://devblogs.microsoft.com/devops/supercharging-the-git-commit-graph-iii-generations/>
>>> 3. <https://arxiv.org/abs/1404.4465>
>>> 4. <https://github.com/steps/Ferrari>
>>>
>>> See also discussion in:
>>>
>>> <https://public-inbox.org/git/86tvl0zhos.fsf@gmail.com/t/>  [5]
>> [6] https://en.wikipedia.org/wiki/Contraction_hierarchies
>>
>> [I've been off-list for 2+ ,months, so still catching up]
> Best,
Thanks for the feedback
Philip

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Possible idea for GSoC 2020
  2020-03-15 18:57     ` Philip Oakley
@ 2020-03-15 21:14       ` Jakub Narebski
  2020-03-16 14:47         ` Philip Oakley
  0 siblings, 1 reply; 33+ messages in thread
From: Jakub Narebski @ 2020-03-15 21:14 UTC (permalink / raw)
  To: Philip Oakley
  Cc: git, Christian Couder, Derrick Stolee, Heba Waly, Junio C Hamano,
	Jonathan Tan, Emily Shaffer, Abhishek Kumar

Hello, Philip

Thank you for your continued interest in this topic.

Philip Oakley <philipoakley@iee.email> writes:
> On 13/03/2020 14:34, Jakub Narebski wrote:
>> Philip Oakley <philipoakley@iee.email> writes:
>>> On 10/03/2020 14:50, Jakub Narebski wrote:
>> [...]
>>>> ### Graph labelling for speeding up git commands
>>>>
>>>>  - Language: C
>>>>  - Difficulty: hard / difficult
>>>>  - Possible mentors: Jakub Narębski
>>>>
>>>> Git uses various clever methods for making operations on very large
>>>> repositories faster, from bitmap indices for git-fetch[1], to generation
>>>> numbers (also known as topological levels) in the commit-graph file for
>>>> commit graph traversal operations like `git log --graph`[2].
>>>>
>>>> One possible improvement that can make Git even faster is using min-post
>>>> intervals labelling.  The basis of this labelling is post-visit order of
>>>> a depth-first search traversal tree of a commit graph, let's call it
>>>> 'post(v)'.
>
> So post(v) is the (increasing) number for the visited commits, starting
> with 1 at the initial branch tip (not root!), as the graph is traversed,
> depth first, with backtracking to the most recent commit that still has
> unvisited parents when either the root commit, or an already visited
> commit is found. (phew, that took a lot of words;-)

No, if we start at branch tip (source node), and number in the order of
visiting during DFS traversal, with node numbered _before_ its out
neighbours (parents in Git terms), that would be pre(v) numbering.

For post(v) you number nodes (commits) in the order of backtracking,
that is with 1 at some parentless root commit, not 1 at some branch
tip.  This can be done e.g. during computing of generation numbers.

An example of such post(v) labeling can be found on slides 57-58/64 and
59/64 in v1.1 version of my slides:
  https://drive.google.com/open?id=1psMBVfcRHcZeJ7AewGpdoymrEfFVdXoK

> e.g. (with 'a' as a branch tip)
>
> a - b - c - d - e - f - g
>          \m/     \w - x/         (m spans c-d;  w-x spans e-g)

We usually order commits left-to-right, or bottom-to-top, with most
recent commits and branch tips respectively on the far right, or on the
top.  This is the reverse notation.

> potential orderings, based on alternate parental choices.
>
> abcdefgwxm
> abcdewxgfm
> abcmdefgwx
> abcmdewxgf
>
> All 4 orderings are equally valid implementations.(?)

Yes, this is true.  There are various heuristics to select the best
ordering, but as most of those require additional work to be done
upfront (e.g. computing in-degrees, or topological sort, or generation
numbers i.e. topological levels), we can start with simple rule: walk
the graph in parents order.

>>> This, the post(v) number, may need a bit more explanation for those not
>>> familiar with graph theory terminology, so that they can get up to speed
>>> easily about the method, and it's (hopeful) simplicity.
>>>
>> All right, I see that it might be not clear for someone who is not
>> familiar with graph theory terminology.  The post(v) order is the order
>> you encounter commits, assuming that you mark commit 'v' as visited
>> after all its parents have been visited.
>
> The 'all' can be misunderstood as meaning 'skip' a commit if it has
> multiple parents, until the last visit. I don't think that's what is
> meant. (or is it?)

During the depth-first search (DFS) traversal of the commit graph we
walk all reachable commits.  What I meant here is that the node (commit)
gets it post(v) number only after all of its parents have been visited,
and we backtracked to this commit.

Again, I will refer to the frame 59/64 in v1.1 version of my slides:
  https://drive.google.com/open?id=1psMBVfcRHcZeJ7AewGpdoymrEfFVdXoK

>>
>> The positive-cut labeling works also for pre(v) order, where we number
>> commits from top, starting from heads, marking commit 'v' as visited
>> before any of its parents (you just need to switch from min-post to
>> pre-max interval).
>>
> I'm not sure I understood that. Is this the same as my comment above.

What I wanted to say here, is that if you label commits with pre(v)
number, that is number them in the order of visiting, commit before its
parents, then we can use pre-max interval instead.  The interval would
be [pre(v), max_{p \in parents(v)} pre(p)], and the positive-cut filter
would work the same: if one interval is contained in the other, then
there is path from one commit to the other.

>>>  It isn't clear to me if it is a count along a single string-of-pearls
>>> between two branch - merge points, or depth from origin, or whether it
>>> can span large chunks of the DAG? Ref 3. has the same problem of
>>> starting from an assumed level of knowledge and understanding that may
>>> put of non-academics (I'm thinking that the proposed method is this
>>> PReaCH [3]).
>>
>> The basic method is something simpler, common to all those methods.
>> It is described as
>> - method from "3.3 Pruning Based on DFS Numbering" in PReaCH[3] paper
>>   (only one of full intervals from Figure 3 there), with modifications
>> - method from "Interval Indexing" paragraph in "I. Introduction"
>>   in FERRARI[4] paper, but using only a single interval (strict or
>>   approximate)
>> - Fig. 4 Min-Post Labeling, in "GRAIL: A Scalable Index for Reachability
>>   Queries in Very Large Graphs" (2012) paper
>> - "3.4.1 Positive-Cut Filter" subsubsection in "3.4 Optimizations"
>>   in FELINE paper i.e. "Reachability Queries in Very Large Graphs:
>>   A Fast Refined Online Search Approach" (2014)
>>
>>> It's my understanding that 'v' is usually 1...n in the literature, but
>>> in reality it just means 'unique label' (and ultimately able to be
>>> indexed in a data table). In Git we use the object id as the unique
>>> label, so the 1..n is just an abstraction/convenience. The other problem
>>> that can happen is if the terminologies of Git doesn't quite match those
>>> of the descriptions, such as which end is 'root', or being 'mutually
>>> reachable' in a directed acyclic graph.
>>
>> Yes, when reading various graph papers, I need to translate 'root',
>> 'leaf', 'child' from graph-theory terminology to git terminology.
>>
>> But 'v' being a node (a commit in a commit graph of revisions) is not
>> one of them.
>
> I'd agree that 'node' is ok, it's just the available values for 'v' that
> can be presumptuous.
> (i.e. the academics already labelling v->1..n, that often just happens
> to be the order they end up with:: `Given <type of graph> with nodes
> 1..n, then <some assertions based on n>` The PReaCH paper does that.)

No, in academics paper usually define graph as a set of vertices V and
set of edges E.  Numbering vertices (nodes) is something extra.

> Plus our DAGs are, in a sense, backward (only children know who their
> parents are!), so all the counting levels feel wrong (we count edges
> from the wrong end)

No, our DAGs are not backward, only our terminology is different.
On the other hand we cannot assume that we can get reverse graph
easily (many graph reachability algorithms do assume that).

  graph theory        |  git documentation     |  meaning
  ------------------------------------------------------------------
  roots               |  branch tips, heads    |  no incoming edges
  leaves              |  root commits          |  no outgoing edges
  children            |  parents               |  [out] neighbours

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Possible idea for GSoC 2020
  2020-03-10 14:50 [RFC] Possible idea for GSoC 2020 Jakub Narebski
                   ` (2 preceding siblings ...)
  2020-03-13 13:08 ` Philip Oakley
@ 2020-03-16 12:44 ` Derrick Stolee
  2020-03-17  3:13   ` Jakub Narebski
  3 siblings, 1 reply; 33+ messages in thread
From: Derrick Stolee @ 2020-03-16 12:44 UTC (permalink / raw)
  To: Jakub Narebski, git; +Cc: Christian Couder

On 3/10/2020 10:50 AM, Jakub Narebski wrote:
> Hello,
> 
> Here below is a possible proposal for a more difficult Google Summer of
> Code 2020 project.
> 
> A few questions:
> - is it too late to propose a new project idea for GSoC 2020?
> - is it too difficult of a project for GSoC?
> 
> Best,
> 
>   Jakub Narębski
> 
> --------------------------------------------------
> 
> ### Graph labelling for speeding up git commands
> 
>  - Language: C
>  - Difficulty: hard / difficult
>  - Possible mentors: Jakub Narębski
> 
> Git uses various clever methods for making operations on very large
> repositories faster, from bitmap indices for git-fetch[1], to generation
> numbers (also known as topological levels) in the commit-graph file for
> commit graph traversal operations like `git log --graph`[2].
> 
> One possible improvement that can make Git even faster is using min-post
> intervals labelling.  The basis of this labelling is post-visit order of
> a depth-first search traversal tree of a commit graph, let's call it
> 'post(v)'.
> 
> If for each commit 'v' we would compute and store in the commit-graph
> file two numbers: 'post(v)' and the minimum of 'post(u)' for all commits
> reachable from 'v', let's call the latter 'min_graph(v)', then the
> following condition is true:
> 
>   if 'v' can reach 'u', then min_graph(v) <= post(u) <= post(v)

I haven't thought too hard about it, but I'm assuming that if v is not
in a commit-graph file, then post(v) would be "infinite" and min_graph(v)
would be zero.

We already have the second inequality (f(u) <= f(v)) where the function
'f' is the generation of v. The success of this approach over generation
numbers relies entirely on how often the inequality min_graph(v) <= post(u)
fails when gen(u) <= gen(v) holds.

> If for each commit 'v' we would compute and store in the commit-graph
> file two numbers: 'post(v)' and the minimum of 'post(u)' for commits
> that were visited during the part of depth-first search that started
> from 'v' (which is the minimum of post-order number for subtree of a
> spanning tree that starts at 'v').  Let's call the later 'min_tree(v)'.
> Then the following condition is true:
> 
>   if min_tree(v) <= post(u) <= post(v), then 'v' can reach 'u'

How many places in Git do we ask "can v reach u?" and how many would
return immediately without needing a walk in this new approach? My
guess is that we will have a very narrow window where this query
returns a positive result.

I believe we discussed this concept briefly when planning "generation
number v2" and the main concern I have with this plan is that the
values are not stable. The value of post(v) and min_tree(v) depend
on the entire graph as a whole, not just what is reachable from v
(and preferably only the parents of v).

Before starting to implement this, I would consider how such labels
could be computed across incremental commit-graph boundaries. That is,
if I'm only adding a layer of commits to the commit-graph without
modifying the existing layers of the commit-graph chain, can I still
compute values with these properties? How expensive is it? Do I need
to walk the entire reachable set of commits?
 
> The task would be to implement computing such labelling (or a more
> involved variant of it[3][4]), storing it in commit-graph file, and
> using it for speeding up git commands (starting from a single chosen
> command) such as:
> 
>  - git merge-base --is-ancestor A B
>  - git branch --contains A
>  - git tag --contains A
>  - git branch --merged A
>  - git tag --merged A
>  - git merge-base --all A B
>  - git log --topo-sort

Having such a complicated two-dimensional system would need to
justify itself by being measurably faster than that one-dimensional
system in these example commands.

The point of generation number v2 [1] was to allow moving to "exact"
algorithms for things like merge-base where we still use commit time
as a heuristic, and could be wrong because of special data shapes.
We don't use generation number in these examples because using only
generation number can lead to a large increase in number of commits
walked. The example we saw in the Linux kernel repository was a bug
fix created on top of a very old commit, so there was a commit of
low generation with very high commit-date that caused extra walking.
(See [2] for a detailed description of the data shape.)

My _prediction_ is that the two-dimensional system will be more
complicated to write and use, and will not have any measurable
difference. I'd be happy to be wrong, but I also would not send
anyone down this direction only to find out I'm right and that
effort was wasted.

My recommendation is that a GSoC student update the
generation number to "v2" based on the definition you made in [1].
That proposal is also more likely to be effective in Git because
it makes use of extra heuristic information (commit date) to
assist the types of algorithms we care about.

In that case, the "difficult" part is moving the "generation"
member of struct commit into a slab before making it a 64-bit
value. (This is likely necessary for your plan, anyway.) Updating
the generation number to v2 is relatively straight-forward after
that, as someone can follow all places that reference or compute
generation numbers and apply a diff

Thanks,
-Stolee

[1] https://lore.kernel.org/git/86o8ziatb2.fsf_-_@gmail.com/
    [RFC/PATCH] commit-graph: generation v5 (backward compatible date ceiling)

[2] https://lore.kernel.org/git/efa3720fb40638e5d61c6130b55e3348d8e4339e.1535633886.git.gitgitgadget@gmail.com/
    [PATCH 1/1] commit: don't use generation numbers if not needed

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Possible idea for GSoC 2020
  2020-03-15 21:14       ` Jakub Narebski
@ 2020-03-16 14:47         ` Philip Oakley
  0 siblings, 0 replies; 33+ messages in thread
From: Philip Oakley @ 2020-03-16 14:47 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: git, Christian Couder, Derrick Stolee, Heba Waly, Junio C Hamano,
	Jonathan Tan, Emily Shaffer, Abhishek Kumar

Hi Jakub,
Thanks for the explanations. Some comments in line.
On 15/03/2020 21:14, Jakub Narebski wrote:
> Hello, Philip
>
> Thank you for your continued interest in this topic.
>
> Philip Oakley <philipoakley@iee.email> writes:
>> On 13/03/2020 14:34, Jakub Narebski wrote:
>>> Philip Oakley <philipoakley@iee.email> writes:
>>>> On 10/03/2020 14:50, Jakub Narebski wrote:
>>> [...]
>>>>> ### Graph labelling for speeding up git commands
>>>>>
>>>>>  - Language: C
>>>>>  - Difficulty: hard / difficult
>>>>>  - Possible mentors: Jakub Narębski
>>>>>
>>>>> Git uses various clever methods for making operations on very large
>>>>> repositories faster, from bitmap indices for git-fetch[1], to generation
>>>>> numbers (also known as topological levels) in the commit-graph file for
>>>>> commit graph traversal operations like `git log --graph`[2].
>>>>>
>>>>> One possible improvement that can make Git even faster is using min-post
>>>>> intervals labelling.  The basis of this labelling is post-visit order of
>>>>> a depth-first search traversal tree of a commit graph, let's call it
>>>>> 'post(v)'.
>> So post(v) is the (increasing) number for the visited commits, starting
>> with 1 at the initial branch tip (not root!), as the graph is traversed,
>> depth first, with backtracking to the most recent commit that still has
>> unvisited parents when either the root commit, or an already visited
>> commit is found. (phew, that took a lot of words;-)
> No, if we start at branch tip (source node), and number in the order of
> visiting during DFS traversal, with node numbered _before_ its out
> neighbours (parents in Git terms), that would be pre(v) numbering.
So I've described pre(v) numbering.
>
> For post(v) you number nodes (commits) in the order of backtracking,
> that is with 1 at some parentless root commit, not 1 at some branch
> tip.  This can be done e.g. during computing of generation numbers.
I didn't easily understand the wording.
>
> An example of such post(v) labeling can be found on slides 57-58/64 and
> 59/64 in v1.1 version of my slides:
>   https://drive.google.com/open?id=1psMBVfcRHcZeJ7AewGpdoymrEfFVdXoK
Ah, I had an older version (picked up from the other thread) that only
had 60 slides (84 pdf pages). I had got all the way to the end.
I've have a better read of that.

>> e.g. (with 'a' as a branch tip)
>>
>> a - b - c - d - e - f - g
>>          \m/     \w - x/         (m spans c-d;  w-x spans e-g)
> We usually order commits left-to-right, or bottom-to-top, with most
> recent commits and branch tips respectively on the far right, or on the
> top.  This is the reverse notation.
True. It was for the convenience of following the Git DAG linkage. It's
easy to make mistakes with right to left reading, especially for those
who aren't yet sure what they are looking at.
>
>> potential orderings, based on alternate parental choices.
>>
>> abcdefgwxm
>> abcdewxgfm
>> abcmdefgwx
>> abcmdewxgf
>>
>> All 4 orderings are equally valid implementations.(?)
> Yes, this is true.  There are various heuristics to select the best
> ordering, but as most of those require additional work to be done
> upfront (e.g. computing in-degrees, or topological sort, or generation
> numbers i.e. topological levels), we can start with simple rule: walk
> the graph in parents order.
Thanks for confirming.
>>>> This, the post(v) number, may need a bit more explanation for those not
>>>> familiar with graph theory terminology, so that they can get up to speed
>>>> easily about the method, and it's (hopeful) simplicity.
>>>>
>>> All right, I see that it might be not clear for someone who is not
>>> familiar with graph theory terminology.  The post(v) order is the order
>>> you encounter commits, assuming that you mark commit 'v' as visited
>>> after all its parents have been visited.
>> The 'all' can be misunderstood as meaning 'skip' a commit if it has
>> multiple parents, until the last visit. I don't think that's what is
>> meant. (or is it?)
> During the depth-first search (DFS) traversal of the commit graph we
> walk all reachable commits.  What I meant here is that the node (commit)
> gets it post(v) number only after all of its parents have been visited,
> and we backtracked to this commit.
>
> Again, I will refer to the frame 59/64 in v1.1 version of my slides:
>   https://drive.google.com/open?id=1psMBVfcRHcZeJ7AewGpdoymrEfFVdXoK
>
>>> The positive-cut labeling works also for pre(v) order, where we number
>>> commits from top, starting from heads, marking commit 'v' as visited
>>> before any of its parents (you just need to switch from min-post to
>>> pre-max interval).
>>>
>> I'm not sure I understood that. Is this the same as my comment above.
> What I wanted to say here, is that if you label commits with pre(v)
> number, that is number them in the order of visiting, commit before its
> parents, then we can use pre-max interval instead.  The interval would
> be [pre(v), max_{p \in parents(v)} pre(p)], and the positive-cut filter
> would work the same: if one interval is contained in the other, then
> there is path from one commit to the other.
>
>>>>  It isn't clear to me if it is a count along a single string-of-pearls
>>>> between two branch - merge points, or depth from origin, or whether it
>>>> can span large chunks of the DAG? Ref 3. has the same problem of
>>>> starting from an assumed level of knowledge and understanding that may
>>>> put of non-academics (I'm thinking that the proposed method is this
>>>> PReaCH [3]).
>>> The basic method is something simpler, common to all those methods.
>>> It is described as
>>> - method from "3.3 Pruning Based on DFS Numbering" in PReaCH[3] paper
>>>   (only one of full intervals from Figure 3 there), with modifications
>>> - method from "Interval Indexing" paragraph in "I. Introduction"
>>>   in FERRARI[4] paper, but using only a single interval (strict or
>>>   approximate)
>>> - Fig. 4 Min-Post Labeling, in "GRAIL: A Scalable Index for Reachability
>>>   Queries in Very Large Graphs" (2012) paper
>>> - "3.4.1 Positive-Cut Filter" subsubsection in "3.4 Optimizations"
>>>   in FELINE paper i.e. "Reachability Queries in Very Large Graphs:
>>>   A Fast Refined Online Search Approach" (2014)
>>>
>>>> It's my understanding that 'v' is usually 1...n in the literature, but
>>>> in reality it just means 'unique label' (and ultimately able to be
>>>> indexed in a data table). In Git we use the object id as the unique
>>>> label, so the 1..n is just an abstraction/convenience. The other problem
>>>> that can happen is if the terminologies of Git doesn't quite match those
>>>> of the descriptions, such as which end is 'root', or being 'mutually
>>>> reachable' in a directed acyclic graph.
>>> Yes, when reading various graph papers, I need to translate 'root',
>>> 'leaf', 'child' from graph-theory terminology to git terminology.
>>>
>>> But 'v' being a node (a commit in a commit graph of revisions) is not
>>> one of them.
>> I'd agree that 'node' is ok, it's just the available values for 'v' that
>> can be presumptuous.
>> (i.e. the academics already labelling v->1..n, that often just happens
>> to be the order they end up with:: `Given <type of graph> with nodes
>> 1..n, then <some assertions based on n>` The PReaCH paper does that.)
> No, in academics paper usually define graph as a set of vertices V and
> set of edges E.  Numbering vertices (nodes) is something extra.
Usually it is mid paper where they do this about some property of an
order (or at least _appear_ to do). I think they omit the "given an
order" in f->1.n with (order) properties x,y,z, then we see p,q,r
properties.
>> Plus our DAGs are, in a sense, backward (only children know who their
>> parents are!), so all the counting levels feel wrong (we count edges
>> from the wrong end)
> No, our DAGs are not backward, only our terminology is different.
> On the other hand we cannot assume that we can get reverse graph
> easily (many graph reachability algorithms do assume that).
>
>   graph theory        |  git documentation     |  meaning
>   ------------------------------------------------------------------
>   roots               |  branch tips, heads    |  no incoming edges
>   leaves              |  root commits          |  no outgoing edges
>   children            |  parents               |  [out] neighbours
Yes, I was surprised at the root-tip-root swap. It's a bit of a
pushmi-pullyu type of animal, but backwards;-).

> Best,
Thanks. Back to reading the updated slide deck

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Possible idea for GSoC 2020
  2020-03-16 12:44 ` Derrick Stolee
@ 2020-03-17  3:13   ` Jakub Narebski
  2020-03-17  7:24     ` Christian Couder
  0 siblings, 1 reply; 33+ messages in thread
From: Jakub Narebski @ 2020-03-17  3:13 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: git, Christian Couder, Heba Waly, Junio C Hamano, Jonathan Tan,
	Emily Shaffer, Abhishek Kumar

Added other people who might be interested to Cc.

Derrick Stolee <stolee@gmail.com> writes:
> On 3/10/2020 10:50 AM, Jakub Narebski wrote:

>> Here below is a possible proposal for a more difficult Google Summer of
>> Code 2020 project.
>> 
>> A few questions:
>> - is it too late to propose a new project idea for GSoC 2020?
>> - is it too difficult of a project for GSoC?
>> 
>> Best,
>> 
>>   Jakub Narębski

https://git.github.io/SoC-2020-Ideas/#commit-graph-labeling-for-speeding-up-git-commands

>> --------------------------------------------------
>> 
>> ### Graph labelling for speeding up git commands
>> 
>>  - Language: C
>>  - Difficulty: hard / difficult
>>  - Possible mentors: Jakub Narębski
>> 
>> Git uses various clever methods for making operations on very large
>> repositories faster, from bitmap indices for git-fetch[1], to generation
>> numbers (also known as topological levels) in the commit-graph file for
>> commit graph traversal operations like `git log --graph`[2].
>> 
>> One possible improvement that can make Git even faster is using min-post
>> intervals labelling.  The basis of this labelling is post-visit order of
>> a depth-first search traversal tree of a commit graph, let's call it
>> 'post(v)'.
>> 
>> If for each commit 'v' we would compute and store in the commit-graph
>> file two numbers: 'post(v)' and the minimum of 'post(u)' for all commits
>> reachable from 'v', let's call the latter 'min_graph(v)', then the
>> following condition is true:
>> 
>>   if 'v' can reach 'u', then min_graph(v) <= post(u) <= post(v)
>
> I haven't thought too hard about it, but I'm assuming that if v is not
> in a commit-graph file, then post(v) would be "infinite" and min_graph(v)
> would be zero.

Or we can simply add a check if v is in a commit-graph file; this might
be not a good idea from the performance point of view.

> We already have the second inequality (f(u) <= f(v)) where the function
> 'f' is the generation of v. The success of this approach over generation
> numbers relies entirely on how often the inequality min_graph(v) <= post(u)
> fails when gen(u) <= gen(v) holds.

True.  It may turn out that additional negative-cut filters do not bring
enough performance improvements over topological levels or corrected
commit date (or monotonically increasing corrected commit date) to be
worth it.

I think they can help in wide commit graphs (many concurrently developed
branches with many commits and few merges), and when there is orphan
branch (like 'todo' in the git.git, or 'gh-pages' for storing
per-project GitHub Pages) that is somehow entangled in query.

>> If for each commit 'v' we would compute and store in the commit-graph
>> file two numbers: 'post(v)' and the minimum of 'post(u)' for commits
>> that were visited during the part of depth-first search that started
>> from 'v' (which is the minimum of post-order number for subtree of a
>> spanning tree that starts at 'v').  Let's call the later 'min_tree(v)'.
>> Then the following condition is true:
>> 
>>   if min_tree(v) <= post(u) <= post(v), then 'v' can reach 'u'
>
> How many places in Git do we ask "can v reach u?" and how many would
> return immediately without needing a walk in this new approach? My
> guess is that we will have a very narrow window where this query
> returns a positive result.

As I wrote below, such positive-cut filter would be directly helpful in
performing the following commands:

 - `git merge-base --is-ancestor`
 - `git branch --contains`
 - `git tag --contains`
 - `git branch --merged`
 - `git tag --merged`

It would be also useful for tag autofollow in git-fetch; is is N-to-M
equivalent to 1-to-N / N-to-1 `--contains` queries.

I am quite sure that positive-cut filter would make `--ancestry-path`
walk faster.

I think, but I am not sure, that positive-cut filter can make parts of
topological sort and merge base algorithms at least a tiny bit faster.

> I believe we discussed this concept briefly when planning "generation
> number v2" and the main concern I have with this plan is that the
> values are not stable. The value of post(v) and min_tree(v) depend
> on the entire graph as a whole, not just what is reachable from v
> (and preferably only the parents of v).
>
> Before starting to implement this, I would consider how such labels
> could be computed across incremental commit-graph boundaries. That is,
> if I'm only adding a layer of commits to the commit-graph without
> modifying the existing layers of the commit-graph chain, can I still
> compute values with these properties? How expensive is it? Do I need
> to walk the entire reachable set of commits?

I think it would be possible to compute post(v) and min_tree(v) using
incremental updates, and to make it compatibile with incremental
commit-graph format (with the commit-graph chain).  But I have not
proven it.

This incremental update would find _a_ post(v) number and min-post
interval labeling, but that might not be exactly the same as one would
get if it was computed in one shot, from scratch.  The result may be
suboptimal, but perhaps it would be still good enough.

I have added the explanation of this idea to v1.2 of my slides:

https://drive.google.com/open?id=1psMBVfcRHcZeJ7AewGpdoymrEfFVdXoK
https://drive.google.com/file/d/1psMBVfcRHcZeJ7AewGpdoymrEfFVdXoK/view?usp=sharing

https://www.slideshare.net/JakubNarbski/graph-operations-in-git-version-control-system-how-the-performance-was-improved-for-large-repositories-how-can-it-be-further-improved

There you can find on slides 65/58 and later (page 90 out of 99)
*illustrated* explanation of the idea.

Excerpt:
========

For each layer in the \texttt{commit-graph} chain store (in relevant chunks):
 - min--post interval labels
 - list of tips (heads) in the commit-graph; possibly also list of their intervals
 - for each layer in chain, except for base layer,
   store ajustments for 'post(v)' labels for previous layer
   (only needed for tips)
   
    - adjusting values of labels consist of adding a constant value to
      both sides of interval for a whole subtrees starting from
      old branch tips

HTH

>> The task would be to implement computing such labelling (or a more
>> involved variant of it[3][4]), storing it in commit-graph file, and
>> using it for speeding up git commands (starting from a single chosen
>> command) such as:
>> 
>>  - git merge-base --is-ancestor A B
>>  - git branch --contains A
>>  - git tag --contains A
>>  - git branch --merged A
>>  - git tag --merged A
>>  - git merge-base --all A B
>>  - git log --topo-sort
>
> Having such a complicated two-dimensional system would need to
> justify itself by being measurably faster than that one-dimensional
> system in these example commands.

That is true.  Git is certainly fast already thank to serialized commit
graph and generation numbers.  Such extra positive-cut / negative-cut
filter subsystem would be maintenance burdern (though I think it could
be nicely isolated with a good API).

On the other hand repositories re gettting larger and larger...

>
> The point of generation number v2 [1] was to allow moving to "exact"
> algorithms for things like merge-base where we still use commit time
> as a heuristic, and could be wrong because of special data shapes.
> We don't use generation number in these examples because using only
> generation number can lead to a large increase in number of commits
> walked. The example we saw in the Linux kernel repository was a bug
> fix created on top of a very old commit, so there was a commit of
> low generation with very high commit-date that caused extra walking.
> (See [2] for a detailed description of the data shape.)
>
> My _prediction_ is that the two-dimensional system will be more
> complicated to write and use, and will not have any measurable
> difference. I'd be happy to be wrong, but I also would not send
> anyone down this direction only to find out I'm right and that
> effort was wasted.

That might be a problem.

This is a bit of a "moonshot" / research project, moreso than others.
Though it would be still valuable, in my opionion, even if the code
wouldn't ultimately get merged and added into Git.

>
> My recommendation is that a GSoC student update the
> generation number to "v2" based on the definition you made in [1].
> That proposal is also more likely to be effective in Git because
> it makes use of extra heuristic information (commit date) to
> assist the types of algorithms we care about.
>
> In that case, the "difficult" part is moving the "generation"
> member of struct commit into a slab before making it a 64-bit
> value. (This is likely necessary for your plan, anyway.) Updating
> the generation number to v2 is relatively straight-forward after
> that, as someone can follow all places that reference or compute
> generation numbers and apply a diff

Good idea!  Though I am not sure if it is not too late to add it to the
https://git.github.io/SoC-2020-Ideas/ as the self imposed deadline of
March 16 (where students can start submitting proposals to GSoC) has
just passed.  Christian, what do you think?

Would you agree, Stolee, to be a _possible_ mentor or co-mentor for
"Generation number v2" project?

> [1] https://lore.kernel.org/git/86o8ziatb2.fsf_-_@gmail.com/
>     [RFC/PATCH] commit-graph: generation v5 (backward compatible date ceiling)
>
> [2]
> https://lore.kernel.org/git/efa3720fb40638e5d61c6130b55e3348d8e4339e.1535633886.git.gitgitgadget@gmail.com/
>     [PATCH 1/1] commit: don't use generation numbers if not needed

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Possible idea for GSoC 2020
  2020-03-17  3:13   ` Jakub Narebski
@ 2020-03-17  7:24     ` Christian Couder
  2020-03-17 11:49       ` Derrick Stolee
  2020-03-17 14:18       ` Jakub Narebski
  0 siblings, 2 replies; 33+ messages in thread
From: Christian Couder @ 2020-03-17  7:24 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Derrick Stolee, git, Heba Waly, Junio C Hamano, Jonathan Tan,
	Emily Shaffer, Abhishek Kumar

On Tue, Mar 17, 2020 at 4:13 AM Jakub Narebski <jnareb@gmail.com> wrote:

[...]

> >> ### Graph labelling for speeding up git commands

[...]

> > We already have the second inequality (f(u) <= f(v)) where the function
> > 'f' is the generation of v. The success of this approach over generation
> > numbers relies entirely on how often the inequality min_graph(v) <= post(u)
> > fails when gen(u) <= gen(v) holds.
>
> True.  It may turn out that additional negative-cut filters do not bring
> enough performance improvements over topological levels or corrected
> commit date (or monotonically increasing corrected commit date) to be
> worth it.
>
> I think they can help in wide commit graphs (many concurrently developed
> branches with many commits and few merges), and when there is orphan
> branch (like 'todo' in the git.git, or 'gh-pages' for storing
> per-project GitHub Pages) that is somehow entangled in query.
>
> >> If for each commit 'v' we would compute and store in the commit-graph
> >> file two numbers: 'post(v)' and the minimum of 'post(u)' for commits
> >> that were visited during the part of depth-first search that started
> >> from 'v' (which is the minimum of post-order number for subtree of a
> >> spanning tree that starts at 'v').  Let's call the later 'min_tree(v)'.
> >> Then the following condition is true:
> >>
> >>   if min_tree(v) <= post(u) <= post(v), then 'v' can reach 'u'
> >
> > How many places in Git do we ask "can v reach u?" and how many would
> > return immediately without needing a walk in this new approach? My
> > guess is that we will have a very narrow window where this query
> > returns a positive result.
>
> As I wrote below, such positive-cut filter would be directly helpful in
> performing the following commands:
>
>  - `git merge-base --is-ancestor`
>  - `git branch --contains`
>  - `git tag --contains`
>  - `git branch --merged`
>  - `git tag --merged`
>
> It would be also useful for tag autofollow in git-fetch; is is N-to-M
> equivalent to 1-to-N / N-to-1 `--contains` queries.
>
> I am quite sure that positive-cut filter would make `--ancestry-path`
> walk faster.
>
> I think, but I am not sure, that positive-cut filter can make parts of
> topological sort and merge base algorithms at least a tiny bit faster.

Is there an easy way to check that it would provide significant
performance improvements at least in some cases? Can we ask the
student to do that at the beginning of the GSoC?

> > I believe we discussed this concept briefly when planning "generation
> > number v2" and the main concern I have with this plan is that the
> > values are not stable. The value of post(v) and min_tree(v) depend
> > on the entire graph as a whole, not just what is reachable from v
> > (and preferably only the parents of v).
> >
> > Before starting to implement this, I would consider how such labels
> > could be computed across incremental commit-graph boundaries. That is,
> > if I'm only adding a layer of commits to the commit-graph without
> > modifying the existing layers of the commit-graph chain, can I still
> > compute values with these properties? How expensive is it? Do I need
> > to walk the entire reachable set of commits?
>
> I think it would be possible to compute post(v) and min_tree(v) using
> incremental updates, and to make it compatibile with incremental
> commit-graph format (with the commit-graph chain).  But I have not
> proven it.

Would it be difficult to prove? What would be required? And again can
we ask the student to do that at the beginning of the GSoC?

[...]

> > The point of generation number v2 [1] was to allow moving to "exact"
> > algorithms for things like merge-base where we still use commit time
> > as a heuristic, and could be wrong because of special data shapes.
> > We don't use generation number in these examples because using only
> > generation number can lead to a large increase in number of commits
> > walked. The example we saw in the Linux kernel repository was a bug
> > fix created on top of a very old commit, so there was a commit of
> > low generation with very high commit-date that caused extra walking.
> > (See [2] for a detailed description of the data shape.)
> >
> > My _prediction_ is that the two-dimensional system will be more
> > complicated to write and use, and will not have any measurable
> > difference. I'd be happy to be wrong, but I also would not send
> > anyone down this direction only to find out I'm right and that
> > effort was wasted.
>
> That might be a problem.
>
> This is a bit of a "moonshot" / research project, moreso than others.
> Though it would be still valuable, in my opionion, even if the code
> wouldn't ultimately get merged and added into Git.

I agree that it feels like a "moonshot" / research project.

> > My recommendation is that a GSoC student update the
> > generation number to "v2" based on the definition you made in [1].
> > That proposal is also more likely to be effective in Git because
> > it makes use of extra heuristic information (commit date) to
> > assist the types of algorithms we care about.
> >
> > In that case, the "difficult" part is moving the "generation"
> > member of struct commit into a slab before making it a 64-bit
> > value. (This is likely necessary for your plan, anyway.) Updating
> > the generation number to v2 is relatively straight-forward after
> > that, as someone can follow all places that reference or compute
> > generation numbers and apply a diff
>
> Good idea!  Though I am not sure if it is not too late to add it to the
> https://git.github.io/SoC-2020-Ideas/ as the self imposed deadline of
> March 16 (where students can start submitting proposals to GSoC) has
> just passed.  Christian, what do you think?

Would that be a different project idea or part of your "Graph labeling
for speeding up git commands" project idea?

I am very reluctant to add new project ideas at that time. I don't
think student will have time to properly research it and get it
reviewed.

It could be part of your research project though, to check if that
approach is better or good enough compared to what you suggest in the
current version of your project.

> Would you agree, Stolee, to be a _possible_ mentor or co-mentor for
> "Generation number v2" project?

At this point I think it might be best if you are both willing to
co-mentor a "moonshot" / research project to find what is the best way
forward by bench-marking the different approaches that you both
suggest for different commands/use cases.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Possible idea for GSoC 2020
  2020-03-17  7:24     ` Christian Couder
@ 2020-03-17 11:49       ` Derrick Stolee
  2020-03-17 14:18       ` Jakub Narebski
  1 sibling, 0 replies; 33+ messages in thread
From: Derrick Stolee @ 2020-03-17 11:49 UTC (permalink / raw)
  To: Christian Couder, Jakub Narebski
  Cc: git, Heba Waly, Junio C Hamano, Jonathan Tan, Emily Shaffer,
	Abhishek Kumar

On 3/17/2020 3:24 AM, Christian Couder wrote:
> On Tue, Mar 17, 2020 at 4:13 AM Jakub Narebski <jnareb@gmail.com> wrote:
> It could be part of your research project though, to check if that
> approach is better or good enough compared to what you suggest in the
> current version of your project.
> 
>> Would you agree, Stolee, to be a _possible_ mentor or co-mentor for
>> "Generation number v2" project?
> 
> At this point I think it might be best if you are both willing to
> co-mentor a "moonshot" / research project to find what is the best way
> forward by bench-marking the different approaches that you both
> suggest for different commands/use cases.
 
If a student wants to take this on with the full expectation that they
will mostly be doing research and gathering data which _might_ result
in an acceptable patch, then I could co-mentor.

With that expectation in mind, this project becomes closer to an REU
(Research Experience for Undergraduates) than a software engineering
internship. Are we sure that fits into GSoC?

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Possible idea for GSoC 2020
  2020-03-15 14:26     ` Jakub Narebski
@ 2020-03-17 12:24       ` Kaartic Sivaraam
  2020-03-17 12:39         ` Kaartic Sivaraam
  2020-03-17 14:22         ` Jakub Narebski
  0 siblings, 2 replies; 33+ messages in thread
From: Kaartic Sivaraam @ 2020-03-17 12:24 UTC (permalink / raw)
  To: Jakub Narebski, Junio C Hamano
  Cc: git, Christian Couder, Derrick Stolee, Heba Waly, Jonathan Tan,
	Emily Shaffer, Abhishek Kumar

Hi Jakub,

On 15-03-2020 19:56, Jakub Narebski wrote:
> Jakub Narebski <jnareb@gmail.com> writes:
>>
>> I have prepared slides for "Graph operations in Git version control
>> system" (PDF), mainly describing what was already done to improve their
>> performance, but they also include a few thoughts about the future (like
>> additional graph reachability labelings)... unfortunately the slides are
>> in Polish, not in English.
>>
>> If there is interest, I could translate them, and put the result
>> somewhere accessible.
> 
> Here it is, traanslated into English, but otherwise almost exactly as I
> have presented it on December 2019.  Those slides includes much of
> introductory information, so one would be interested probably in few
> last slides (the "Future work" section).
> 
>    https://drive.google.com/file/d/1psMBVfcRHcZeJ7AewGpdoymrEfFVdXoK/view?usp=sharing
> 
> I will be extending those slides with more information about interval
> labeling, and then I will update the file, and I can post it also on
> SlideShare (or other site, if one can recommend it).
>

Just wanted to note that Speaker deck[1] is a nice alternative to 
SlideShare. I've not uploaded slides to it but have seen people sharing 
slides using Speaker deck. I find the user interface to be more neat and 
minimalistic than SlideShare. FWIW, their mobile interface a lot better 
than the SlideShare's interface.

[1]: https://speakerdeck.com/

-- 
Sivaraam

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Possible idea for GSoC 2020
  2020-03-17 12:24       ` Kaartic Sivaraam
@ 2020-03-17 12:39         ` Kaartic Sivaraam
  2020-03-17 14:22         ` Jakub Narebski
  1 sibling, 0 replies; 33+ messages in thread
From: Kaartic Sivaraam @ 2020-03-17 12:39 UTC (permalink / raw)
  To: Jakub Narebski, Junio C Hamano
  Cc: git, Christian Couder, Derrick Stolee, Heba Waly, Jonathan Tan,
	Emily Shaffer, Abhishek Kumar

On 17-03-2020 17:54, Kaartic Sivaraam wrote:
>> I will be extending those slides with more information about interval
>> labeling, and then I will update the file, and I can post it also on
>> SlideShare (or other site, if one can recommend it).
>>
> 
> Just wanted to note that Speaker deck[1] is a nice alternative to 
> SlideShare. I've not uploaded slides to it but have seen people sharing 
> slides using Speaker deck. I find the user interface to be more neat and 
> minimalistic than SlideShare. FWIW, their mobile interface a lot better 
> than the SlideShare's interface.
> 

Oh, and forgot to mention that you can also share links to a specific 
slide like this:

https://speakerdeck.com/lemire/parsing-json-really-quickly-lessons-learned?slide=13

I'm really surprised to realize that SlideShare doesn't have an option 
to do this.

-- 
Sivaraam

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Possible idea for GSoC 2020
  2020-03-17  7:24     ` Christian Couder
  2020-03-17 11:49       ` Derrick Stolee
@ 2020-03-17 14:18       ` Jakub Narebski
  2020-03-17 17:04         ` Christian Couder
  1 sibling, 1 reply; 33+ messages in thread
From: Jakub Narebski @ 2020-03-17 14:18 UTC (permalink / raw)
  To: Christian Couder
  Cc: Derrick Stolee, git, Heba Waly, Junio C Hamano, Jonathan Tan,
	Emily Shaffer, Abhishek Kumar

Christian Couder <christian.couder@gmail.com> writes:
> On Tue, Mar 17, 2020 at 4:13 AM Jakub Narebski <jnareb@gmail.com> wrote:
>
> [...]
>
>>>> ### Graph labelling for speeding up git commands
>
> [...]
>
>>> We already have the second inequality (f(u) <= f(v)) where the function
>>> 'f' is the generation of v. The success of this approach over generation
>>> numbers relies entirely on how often the inequality min_graph(v) <= post(u)
>>> fails when gen(u) <= gen(v) holds.
>>
>> True.  It may turn out that additional negative-cut filters do not bring
>> enough performance improvements over topological levels or corrected
>> commit date (or monotonically increasing corrected commit date) to be
>> worth it.
>>
>> I think they can help in wide commit graphs (many concurrently developed
>> branches with many commits and few merges), and when there is orphan
>> branch (like 'todo' in the git.git, or 'gh-pages' for storing
>> per-project GitHub Pages) that is somehow entangled in query.
>>
>>>> If for each commit 'v' we would compute and store in the commit-graph
>>>> file two numbers: 'post(v)' and the minimum of 'post(u)' for commits
>>>> that were visited during the part of depth-first search that started
>>>> from 'v' (which is the minimum of post-order number for subtree of a
>>>> spanning tree that starts at 'v').  Let's call the later 'min_tree(v)'.
>>>> Then the following condition is true:
>>>>
>>>>   if min_tree(v) <= post(u) <= post(v), then 'v' can reach 'u'
>>>
>>> How many places in Git do we ask "can v reach u?" and how many would
>>> return immediately without needing a walk in this new approach? My
>>> guess is that we will have a very narrow window where this query
>>> returns a positive result.
>>
>> As I wrote below, such positive-cut filter would be directly helpful in
>> performing the following commands:
>>
>>  - `git merge-base --is-ancestor`
>>  - `git branch --contains`
>>  - `git tag --contains`
>>  - `git branch --merged`
>>  - `git tag --merged`
>>
>> It would be also useful for tag autofollow in git-fetch; is is N-to-M
>> equivalent to 1-to-N / N-to-1 `--contains` queries.
>>
>> I am quite sure that positive-cut filter would make `--ancestry-path`
>> walk faster.
>>
>> I think, but I am not sure, that positive-cut filter can make parts of
>> topological sort and merge base algorithms at least a tiny bit faster.
>
> Is there an easy way to check that it would provide significant
> performance improvements at least in some cases? Can we ask the
> student to do that at the beginning of the GSoC?

The "Reachability labels for version control graphs.ipynb" Jupyter
Notebook on Google Colaboratory was created to answer this question
(originally for the FELINE reachability index).  Among others it can
min-post interval labels and topological levels (generation numbers),
use them for reachability queries, and load Linux kernel
commit-graph. The exploration didn't get finished, but it would be not
that difficult, I think, to at least find the amount of false negatives
for min-post interval labeling for git.git or Linux kernel repo.

  https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg

As Jupyter Notebook, it is run in the web browser.  It can either use
local runtime, or run on Google Cloud runtime.  On the other hand it
requires at least some knowledge of Python...

>>> I believe we discussed this concept briefly when planning "generation
>>> number v2" and the main concern I have with this plan is that the
>>> values are not stable. The value of post(v) and min_tree(v) depend
>>> on the entire graph as a whole, not just what is reachable from v
>>> (and preferably only the parents of v).
>>>
>>> Before starting to implement this, I would consider how such labels
>>> could be computed across incremental commit-graph boundaries. That is,
>>> if I'm only adding a layer of commits to the commit-graph without
>>> modifying the existing layers of the commit-graph chain, can I still
>>> compute values with these properties? How expensive is it? Do I need
>>> to walk the entire reachable set of commits?
>>
>> I think it would be possible to compute post(v) and min_tree(v) using
>> incremental updates, and to make it compatibile with incremental
>> commit-graph format (with the commit-graph chain).  But I have not
>> proven it.
>
> Would it be difficult to prove? What would be required? And again can
> we ask the student to do that at the beginning of the GSoC?

Formal mathematical proof by induction would be not that difficult.  The
problem is, I think, in finding all possible classes of initial spanning
forest arrangements, and all possible classes of commit-graph growth by
one commit -- and examining how this affect the spanning forest.

> [...]
>
>>> The point of generation number v2 [1] was to allow moving to "exact"
>>> algorithms for things like merge-base where we still use commit time
>>> as a heuristic, and could be wrong because of special data shapes.
>>> We don't use generation number in these examples because using only
>>> generation number can lead to a large increase in number of commits
>>> walked. The example we saw in the Linux kernel repository was a bug
>>> fix created on top of a very old commit, so there was a commit of
>>> low generation with very high commit-date that caused extra walking.
>>> (See [2] for a detailed description of the data shape.)
>>>
>>> My _prediction_ is that the two-dimensional system will be more
>>> complicated to write and use, and will not have any measurable
>>> difference. I'd be happy to be wrong, but I also would not send
>>> anyone down this direction only to find out I'm right and that
>>> effort was wasted.
>>
>> That might be a problem.
>>
>> This is a bit of a "moonshot" / research project, moreso than others.
>> Though it would be still valuable, in my opionion, even if the code
>> wouldn't ultimately get merged and added into Git.
>
> I agree that it feels like a "moonshot" / research project.
>
>>> My recommendation is that a GSoC student update the
>>> generation number to "v2" based on the definition you made in [1].
>>> That proposal is also more likely to be effective in Git because
>>> it makes use of extra heuristic information (commit date) to
>>> assist the types of algorithms we care about.
>>>
>>> In that case, the "difficult" part is moving the "generation"
>>> member of struct commit into a slab before making it a 64-bit
>>> value. (This is likely necessary for your plan, anyway.) Updating
>>> the generation number to v2 is relatively straight-forward after
>>> that, as someone can follow all places that reference or compute
>>> generation numbers and apply a diff
>>
>> Good idea!  Though I am not sure if it is not too late to add it to the
>> https://git.github.io/SoC-2020-Ideas/ as the self imposed deadline of
>> March 16 (where students can start submitting proposals to GSoC) has
>> just passed.  Christian, what do you think?
>
> Would that be a different project idea or part of your "Graph labeling
> for speeding up git commands" project idea?
>
> I am very reluctant to add new project ideas at that time. I don't
> think student will have time to properly research it and get it
> reviewed.
>
> It could be part of your research project though, to check if that
> approach is better or good enough compared to what you suggest in the
> current version of your project.
>
>> Would you agree, Stolee, to be a _possible_ mentor or co-mentor for
>> "Generation number v2" project?
>
> At this point I think it might be best if you are both willing to
> co-mentor a "moonshot" / research project to find what is the best way
> forward by bench-marking the different approaches that you both
> suggest for different commands/use cases.

So perhaps we should expand "Commit graph labeling for speeding up git
commands" idea, splitting it into two possible ways of working in this
project: the more technical 'Generation number v2', and 'Interval labels
for commit graph' which is more of research project?  Which should be
put first, then?

Note that for example "Convert scripts to builtins" idea is in similar
situation: it is also many projects in one.

Regards,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Possible idea for GSoC 2020
  2020-03-17 12:24       ` Kaartic Sivaraam
  2020-03-17 12:39         ` Kaartic Sivaraam
@ 2020-03-17 14:22         ` Jakub Narebski
  1 sibling, 0 replies; 33+ messages in thread
From: Jakub Narebski @ 2020-03-17 14:22 UTC (permalink / raw)
  To: Kaartic Sivaraam
  Cc: Junio C Hamano, git, Christian Couder, Derrick Stolee, Heba Waly,
	Jonathan Tan, Emily Shaffer, Abhishek Kumar

Hello,

Kaartic Sivaraam <kaartic.sivaraam@gmail.com> writes:
> On 15-03-2020 19:56, Jakub Narebski wrote:
>> Jakub Narebski <jnareb@gmail.com> writes:
>>>
>>> I have prepared slides for "Graph operations in Git version control
>>> system" (PDF), mainly describing what was already done to improve their
>>> performance, but they also include a few thoughts about the future (like
>>> additional graph reachability labelings)... unfortunately the slides are
>>> in Polish, not in English.
>>>
>>> If there is interest, I could translate them, and put the result
>>> somewhere accessible.
>>
>> Here it is, traanslated into English, but otherwise almost exactly as I
>> have presented it on December 2019.  Those slides includes much of
>> introductory information, so one would be interested probably in few
>> last slides (the "Future work" section).
>>
>>    https://drive.google.com/file/d/1psMBVfcRHcZeJ7AewGpdoymrEfFVdXoK/view?usp=sharing
>>
>> I will be extending those slides with more information about interval
>> labeling, and then I will update the file, and I can post it also on
>> SlideShare (or other site, if one can recommend it).
>>
>
> Just wanted to note that Speaker deck[1] is a nice alternative to
> SlideShare. I've not uploaded slides to it but have seen people
> sharing slides using Speaker deck. I find the user interface to be
> more neat and minimalistic than SlideShare. FWIW, their mobile
> interface a lot better than the SlideShare's interface.
>
> [1]: https://speakerdeck.com/

Thanks for the suggestion.  The slides (which are at version 1.2) are
now also available on SlideShare[2] and on SpeakerDeck[3], in addition
to being on Google Drive:

[2]: https://www.slideshare.net/JakubNarbski/graph-operations-in-git-version-control-system-how-the-performance-was-improved-for-large-repositories-how-can-it-be-further-improved
[3]: https://speakerdeck.com/jnareb/graph-operations-in-git-slides-2019

The only disadvantage with SpeakerDeck is that bitmap images (PNG, JPEG)
are a bit blocky/pixelated in HTML-ized version of slides.

P.S. I am not sure if I put my slids in correct category...
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Possible idea for GSoC 2020
@ 2020-03-17 17:00 Abhishek Kumar
  2020-03-17 18:05 ` Jakub Narebski
  0 siblings, 1 reply; 33+ messages in thread
From: Abhishek Kumar @ 2020-03-17 17:00 UTC (permalink / raw)
  To: stolee; +Cc: christian.couder, git, jnareb

> Having such a complicated two-dimensional system would need to
> justify itself by being measurably faster than that one-dimensional
> system in these example commands.
>
> [...]
>
> My _prediction_ is that the two-dimensional system will be more
> complicated to write and use, and will not have any measurable
> difference. I'd be happy to be wrong, but I also would not send
> anyone down this direction only to find out I'm right and that
> effort was wasted.

Agreed. I have been through the papers of the involved variants and on graphs
comparable to some of the largest git repositories, the performance improves by
fifty nanoseconds for a random query.

Additionally:
1. They require significantly more space per commit.
2. They require significantly more preprocessing time.

> My recommendation is that a GSoC student update the
> generation number to "v2" based on the definition you made in [1].
> That proposal is also more likely to be effective in Git because
> it makes use of extra heuristic information (commit date) to
> assist the types of algorithms we care about.

> In that case, the "difficult" part is moving the "generation"
> member of struct commit into a slab before making it a 64-bit
> value. (This is likely necessary for your plan, anyway.) Updating
> the generation number to v2 is relatively straight-forward after
> that, as someone can follow all places that reference or compute
> generation numbers and apply a diff

Thanks for the recommendation. Reading about how this fits in more
with REU on the other thread, I too agree that updating generation
number to use corrected commit date would be more appropriate for a GSoC
project.

Regards
Abhishek

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Possible idea for GSoC 2020
  2020-03-17 14:18       ` Jakub Narebski
@ 2020-03-17 17:04         ` Christian Couder
  2020-03-18 13:55           ` Jakub Narebski
  0 siblings, 1 reply; 33+ messages in thread
From: Christian Couder @ 2020-03-17 17:04 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Derrick Stolee, git, Heba Waly, Junio C Hamano, Jonathan Tan,
	Emily Shaffer, Abhishek Kumar

On Tue, Mar 17, 2020 at 3:18 PM Jakub Narebski <jnareb@gmail.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:
> > On Tue, Mar 17, 2020 at 4:13 AM Jakub Narebski <jnareb@gmail.com> wrote:

> >>>> If for each commit 'v' we would compute and store in the commit-graph
> >>>> file two numbers: 'post(v)' and the minimum of 'post(u)' for commits
> >>>> that were visited during the part of depth-first search that started
> >>>> from 'v' (which is the minimum of post-order number for subtree of a
> >>>> spanning tree that starts at 'v').  Let's call the later 'min_tree(v)'.
> >>>> Then the following condition is true:
> >>>>
> >>>>   if min_tree(v) <= post(u) <= post(v), then 'v' can reach 'u'
> >>>
> >>> How many places in Git do we ask "can v reach u?" and how many would
> >>> return immediately without needing a walk in this new approach? My
> >>> guess is that we will have a very narrow window where this query
> >>> returns a positive result.
> >>
> >> As I wrote below, such positive-cut filter would be directly helpful in
> >> performing the following commands:
> >>
> >>  - `git merge-base --is-ancestor`
> >>  - `git branch --contains`
> >>  - `git tag --contains`
> >>  - `git branch --merged`
> >>  - `git tag --merged`
> >>
> >> It would be also useful for tag autofollow in git-fetch; is is N-to-M
> >> equivalent to 1-to-N / N-to-1 `--contains` queries.
> >>
> >> I am quite sure that positive-cut filter would make `--ancestry-path`
> >> walk faster.
> >>
> >> I think, but I am not sure, that positive-cut filter can make parts of
> >> topological sort and merge base algorithms at least a tiny bit faster.
> >
> > Is there an easy way to check that it would provide significant
> > performance improvements at least in some cases? Can we ask the
> > student to do that at the beginning of the GSoC?
>
> The "Reachability labels for version control graphs.ipynb" Jupyter
> Notebook on Google Colaboratory was created to answer this question
> (originally for the FELINE reachability index).  Among others it can
> min-post interval labels and topological levels (generation numbers),
> use them for reachability queries, and load Linux kernel
> commit-graph. The exploration didn't get finished, but it would be not
> that difficult, I think, to at least find the amount of false negatives
> for min-post interval labeling for git.git or Linux kernel repo.
>
>   https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg
>
> As Jupyter Notebook, it is run in the web browser.  It can either use
> local runtime, or run on Google Cloud runtime.  On the other hand it
> requires at least some knowledge of Python...

Ok, if that's a possible approach to the project, then please add it
to the description.

> >> I think it would be possible to compute post(v) and min_tree(v) using
> >> incremental updates, and to make it compatibile with incremental
> >> commit-graph format (with the commit-graph chain).  But I have not
> >> proven it.
> >
> > Would it be difficult to prove? What would be required? And again can
> > we ask the student to do that at the beginning of the GSoC?
>
> Formal mathematical proof by induction would be not that difficult.  The
> problem is, I think, in finding all possible classes of initial spanning
> forest arrangements, and all possible classes of commit-graph growth by
> one commit -- and examining how this affect the spanning forest.

Ok.

> >>> The point of generation number v2 [1] was to allow moving to "exact"
> >>> algorithms for things like merge-base where we still use commit time
> >>> as a heuristic, and could be wrong because of special data shapes.
> >>> We don't use generation number in these examples because using only
> >>> generation number can lead to a large increase in number of commits
> >>> walked. The example we saw in the Linux kernel repository was a bug
> >>> fix created on top of a very old commit, so there was a commit of
> >>> low generation with very high commit-date that caused extra walking.
> >>> (See [2] for a detailed description of the data shape.)
> >>>
> >>> My _prediction_ is that the two-dimensional system will be more
> >>> complicated to write and use, and will not have any measurable
> >>> difference. I'd be happy to be wrong, but I also would not send
> >>> anyone down this direction only to find out I'm right and that
> >>> effort was wasted.
> >>
> >> That might be a problem.
> >>
> >> This is a bit of a "moonshot" / research project, moreso than others.
> >> Though it would be still valuable, in my opionion, even if the code
> >> wouldn't ultimately get merged and added into Git.
> >
> > I agree that it feels like a "moonshot" / research project.
> >
> >>> My recommendation is that a GSoC student update the
> >>> generation number to "v2" based on the definition you made in [1].
> >>> That proposal is also more likely to be effective in Git because
> >>> it makes use of extra heuristic information (commit date) to
> >>> assist the types of algorithms we care about.
> >>>
> >>> In that case, the "difficult" part is moving the "generation"
> >>> member of struct commit into a slab before making it a 64-bit
> >>> value. (This is likely necessary for your plan, anyway.) Updating
> >>> the generation number to v2 is relatively straight-forward after
> >>> that, as someone can follow all places that reference or compute
> >>> generation numbers and apply a diff
> >>
> >> Good idea!  Though I am not sure if it is not too late to add it to the
> >> https://git.github.io/SoC-2020-Ideas/ as the self imposed deadline of
> >> March 16 (where students can start submitting proposals to GSoC) has
> >> just passed.  Christian, what do you think?
> >
> > Would that be a different project idea or part of your "Graph labeling
> > for speeding up git commands" project idea?
> >
> > I am very reluctant to add new project ideas at that time. I don't
> > think student will have time to properly research it and get it
> > reviewed.
> >
> > It could be part of your research project though, to check if that
> > approach is better or good enough compared to what you suggest in the
> > current version of your project.
> >
> >> Would you agree, Stolee, to be a _possible_ mentor or co-mentor for
> >> "Generation number v2" project?
> >
> > At this point I think it might be best if you are both willing to
> > co-mentor a "moonshot" / research project to find what is the best way
> > forward by bench-marking the different approaches that you both
> > suggest for different commands/use cases.
>
> So perhaps we should expand "Commit graph labeling for speeding up git
> commands" idea, splitting it into two possible ways of working in this
> project: the more technical 'Generation number v2', and 'Interval labels
> for commit graph' which is more of research project?

Ok for that.

>  Which should be
> put first, then?

Which ever you prefer. If you say that any part could be done
separately, that doesn't matter much.

Best,
Christian.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Possible idea for GSoC 2020
@ 2020-03-17 18:00 Abhishek Kumar
  2020-03-19 12:50 ` Jakub Narebski
  0 siblings, 1 reply; 33+ messages in thread
From: Abhishek Kumar @ 2020-03-17 18:00 UTC (permalink / raw)
  To: jnareb; +Cc: christian.couder, git, stolee

Greetings Jakub

> So perhaps we should expand "Commit graph labeling for speeding up git
> commands" idea, splitting it into two possible ways of working in this
> project: the more technical 'Generation number v2', and 'Interval labels
> for commit graph' which is more of research project?  Which should be
> put first, then?

I would suggest working on generation number v2 first because:
- We ship improved performance *sooner*.
- I find it easier to shift from simpler to more complex systems.

On a personal note, I would be able to do a better job at working on generation
number v2 than on interval labels. While reading through the papers mentioned
was fun and full of a-ha moments, I find building things more
fulfilling. I would be
glad if either you or Derrick opt for mentoring it.

You could read my proposal at the link below. It is very rough and I haven't
proofread it yet. I will send out a more formal proposal once the direction
of this project is decided.

https://github.com/abhishekkumar2718/GSoC20/blob/master/graph_labelling.md

**Too long, didn't read**

- Commit graphs are small to medium-sized (compared to problem sizes observed in
graph-theory literature) sparse graphs. They have unusual properties compared
to more conventional graphs that can be exploited for better performance.

- Most of git's reachability queries are negative and using a negative-cut
filter improves performance more than a positive-cut filter.

- Implementing and maintaining a two-dimensional reachability index is hard
and does not offer justifiable performance gains.

- We plan to use corrected commit date as the generation number v2 because it is
locally computable, immutable and can be incrementally updated.

- If git ever considers a two-dimensional reachability index, either post order
DFS, GRAIL or an index based on commit date would be good starting
places to explore.

I go in more details about GRAIL, FERRARI and PReaCH, explaining
briefly how they work, their advantages, and disadvantages.

> Note that for example "Convert scripts to builtins" idea is in similar
> situation: it is also many projects in one.

Regards
Abhishek

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Possible idea for GSoC 2020
  2020-03-17 17:00 Abhishek Kumar
@ 2020-03-17 18:05 ` Jakub Narebski
  0 siblings, 0 replies; 33+ messages in thread
From: Jakub Narebski @ 2020-03-17 18:05 UTC (permalink / raw)
  To: Abhishek Kumar
  Cc: git, Christian Couder, Derrick Stolee, Heba Waly, Junio C Hamano,
	Jonathan Tan, Emily Shaffer, Abhishek Kumar

Abhishek Kumar <abhishekkumar8222@gmail.com> writes:

>> Having such a complicated two-dimensional system would need to
>> justify itself by being measurably faster than that one-dimensional
>> system in these example commands.
>>
>> [...]
>>
>> My _prediction_ is that the two-dimensional system will be more
>> complicated to write and use, and will not have any measurable
>> difference. I'd be happy to be wrong, but I also would not send
>> anyone down this direction only to find out I'm right and that
>> effort was wasted.
>
> Agreed. I have been through the papers of the involved variants and on graphs
> comparable to some of the largest git repositories, the performance improves by
> fifty nanoseconds for a random query.

I would recommend extending results for other types of large graphs to
the commit graphs with care.  The characteristics of those graphs are
quite different from characteristics of commit graph: they usually are
scale-free graphs, with low maximum level, and low connectivity: the
probability of two random nodes being connected in order of 10^-3 or
10^-4; see e.g. https://speakerdeck.com/jnareb/graph-operations-in-git-and-how-to-make-them-faster?slide=99

The last one, called R-ratio, means that testing on random query
actually tests mainly negative-cut filters.  That is why some papers
provide either separate numbers for negative and for positive queries,
or separate numbers for random and for balanced queries.

> Additionally:
> 1. They require significantly more space per commit.

This depends on the type of the algorithm: is it Label-Only (answering
reachability queries without consulting graph), or Label+Graph
(augmented online search algorithms):
https://speakerdeck.com/jnareb/graph-operations-in-git-and-how-to-make-them-faster?slide=78

The CDAT chunk in current version of commit-graph format takes H+16
bytes per commit (where H is the size of object id hash).  From those
H+16 bytes 30 bits (slightly less that 4 bytes) are used for current
reachability label: the topological level aka generation number.
https://speakerdeck.com/jnareb/graph-operations-in-git-and-how-to-make-them-faster?slide=45

The proposed min-post interval label would take 8 bytes per commit, that
is 4 bytes per single number in interval.  That is not much, provided
that we get visible performance improvements for at least some often
used git commands.


> 2. They require significantly more preprocessing time.

This again depends on the type of algorithm: Label-Only or Label+G.

In the case of min-post interval labels, they can be computed together
with generation number, during the same commit-graph walk.  The amount
of calculations required to compute min-post interval is not much.
Therefore I think it would be not unreasonable cost.

>> My recommendation is that a GSoC student update the
>> generation number to "v2" based on the definition you made in [1].
>> That proposal is also more likely to be effective in Git because
>> it makes use of extra heuristic information (commit date) to
>> assist the types of algorithms we care about.
>
>> In that case, the "difficult" part is moving the "generation"
>> member of struct commit into a slab before making it a 64-bit
>> value. (This is likely necessary for your plan, anyway.) Updating
>> the generation number to v2 is relatively straight-forward after
>> that, as someone can follow all places that reference or compute
>> generation numbers and apply a diff
>
> Thanks for the recommendation. Reading about how this fits in more
> with REU on the other thread, I too agree that updating generation
> number to use corrected commit date would be more appropriate for a GSoC
> project.

You can read about monotonically offset corrected commit date for
example on my slides, from page 64 (the problem) to 75 (references):
https://speakerdeck.com/jnareb/graph-operations-in-git-and-how-to-make-them-faster?slide=64

I'll try to update the proposal on SoC 2020 Ideas page soon, perhaps
tomorrow when I will have more free time.

https://git.github.io/SoC-2020-Ideas/

In the meantime I will refer to what is written at the top of it:

> **Students:** Please consider these ideas as starting points for
> generating proposals. We are also more than happy to receive proposals
> for other ideas related to Git. Make sure you have read the “Note
> about refactoring projects versus projects that implement new
> features” in the general application information page though.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Possible idea for GSoC 2020
  2020-03-17 17:04         ` Christian Couder
@ 2020-03-18 13:55           ` Jakub Narebski
  2020-03-18 15:25             ` Derrick Stolee
  0 siblings, 1 reply; 33+ messages in thread
From: Jakub Narebski @ 2020-03-18 13:55 UTC (permalink / raw)
  To: Christian Couder
  Cc: Derrick Stolee, git, Heba Waly, Junio C Hamano, Jonathan Tan,
	Emily Shaffer, Abhishek Kumar

Christian Couder <christian.couder@gmail.com> writes:
> On Tue, Mar 17, 2020 at 3:18 PM Jakub Narebski <jnareb@gmail.com> wrote:
>> Christian Couder <christian.couder@gmail.com> writes:
>>> On Tue, Mar 17, 2020 at 4:13 AM Jakub Narebski <jnareb@gmail.com> wrote:
[...]
>>>> As I wrote below, such positive-cut filter would be directly helpful in
>>>> performing the following commands:
>>>>
>>>>  - `git merge-base --is-ancestor`
>>>>  - `git branch --contains`
>>>>  - `git tag --contains`
>>>>  - `git branch --merged`
>>>>  - `git tag --merged`
>>>>
>>>> It would be also useful for tag autofollow in git-fetch; is is N-to-M
>>>> equivalent to 1-to-N / N-to-1 `--contains` queries.
>>>>
>>>> I am quite sure that positive-cut filter would make `--ancestry-path`
>>>> walk faster.
>>>>
>>>> I think, but I am not sure, that positive-cut filter can make parts of
>>>> topological sort and merge base algorithms at least a tiny bit faster.
>>>
>>> Is there an easy way to check that it would provide significant
>>> performance improvements at least in some cases? Can we ask the
>>> student to do that at the beginning of the GSoC?
>>
>> The "Reachability labels for version control graphs.ipynb" Jupyter
>> Notebook on Google Colaboratory was created to answer this question
>> (originally for the FELINE reachability index).  Among others it can
>> min-post interval labels and topological levels (generation numbers),
>> use them for reachability queries, and load Linux kernel
>> commit-graph. The exploration didn't get finished, but it would be not
>> that difficult, I think, to at least find the amount of false negatives
>> for min-post interval labeling for git.git or Linux kernel repo.
>>
>>   https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg
>>
>> As Jupyter Notebook, it is run in the web browser.  It can either use
>> local runtime, or run on Google Cloud runtime.  On the other hand it
>> requires at least some knowledge of Python...
>
> Ok, if that's a possible approach to the project, then please add it
> to the description.

Added as a new paragraph in the 'interval labels' subtask description
https://github.com/git/git.github.io/commit/ac526b18b97e53a431767ce147f27eaf6af0aa76

[...]
>>>>> My recommendation is that a GSoC student update the
>>>>> generation number to "v2" based on the definition you made in [1].
>>>>> That proposal is also more likely to be effective in Git because
>>>>> it makes use of extra heuristic information (commit date) to
>>>>> assist the types of algorithms we care about.
>>>>>
>>>>> In that case, the "difficult" part is moving the "generation"
>>>>> member of struct commit into a slab before making it a 64-bit
>>>>> value. (This is likely necessary for your plan, anyway.) Updating
>>>>> the generation number to v2 is relatively straight-forward after
>>>>> that, as someone can follow all places that reference or compute
>>>>> generation numbers and apply a diff
>>>>
>>>> Good idea!  Though I am not sure if it is not too late to add it to the
>>>> https://git.github.io/SoC-2020-Ideas/ as the self imposed deadline of
>>>> March 16 (where students can start submitting proposals to GSoC) has
>>>> just passed.  Christian, what do you think?
>>>
>>> Would that be a different project idea or part of your "Graph labeling
>>> for speeding up git commands" project idea?
>>>
>>> I am very reluctant to add new project ideas at that time. I don't
>>> think student will have time to properly research it and get it
>>> reviewed.
[...]

>> So perhaps we should expand "Commit graph labeling for speeding up git
>> commands" idea, splitting it into two possible ways of working in this
>> project: the more technical 'Generation number v2', and 'Interval labels
>> for commit graph' which is more of research project?
>
> Ok for that.
>
>>  Which should be
>> put first, then?
>
> Which ever you prefer. If you say that any part could be done
> separately, that doesn't matter much.

I have added 'Generation number v2' as one of alternative ways of
working on the more generic "Commit graph labeling for speeding up git
commands" idea -- as first task, because it fit better the narrative:
https://github.com/git/git.github.io/commit/a6d59820709417081c334a5120342987ff046f1a

Could you (or Stolee) check current proposal, so that it can be merged
in?  Thanks in advance.

  https://github.com/git/git.github.io/blob/soc-2020-idea-jnareb/SoC-2020-Ideas.md

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Possible idea for GSoC 2020
  2020-03-18 13:55           ` Jakub Narebski
@ 2020-03-18 15:25             ` Derrick Stolee
  2020-03-19 12:52               ` Jakub Narebski
  0 siblings, 1 reply; 33+ messages in thread
From: Derrick Stolee @ 2020-03-18 15:25 UTC (permalink / raw)
  To: Jakub Narebski, Christian Couder
  Cc: git, Heba Waly, Junio C Hamano, Jonathan Tan, Emily Shaffer,
	Abhishek Kumar

On 3/18/2020 9:55 AM, Jakub Narebski wrote:
> I have added 'Generation number v2' as one of alternative ways of
> working on the more generic "Commit graph labeling for speeding up git
> commands" idea -- as first task, because it fit better the narrative:
> https://github.com/git/git.github.io/commit/a6d59820709417081c334a5120342987ff046f1a
> 
> Could you (or Stolee) check current proposal, so that it can be merged
> in?  Thanks in advance.
> 
>   https://github.com/git/git.github.io/blob/soc-2020-idea-jnareb/SoC-2020-Ideas.md

Thanks for the updated write-up. I think the narrative is helpful, describing how
we landed on the definition of "generation number v2" before going into the interval
methods.

The only comment I have is about this statement at the end, which seems to be a
carry-over from your perspective of wanting intervals instead of v2:

	Before starting on this task, at the beginning of the GSoC, it might be
	[a (sic)] good idea to check that interval labels would provide significant
	performance improvements at least in some cases (and if it is not the case,
	switch to working on generation numbers v2).

The final parenthetical (switch to working on...) is a bit presumptive. Instead,
please recommend an exploration period to determine which methods have which
performance improvements using prototypes and/or the Python notebook. I'm usually
of the opinion that a prototype is more informative as it compares the results in
context, and the student would learn about the code that needs to change before
creating review-quality patches.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Possible idea for GSoC 2020
  2020-03-17 18:00 Abhishek Kumar
@ 2020-03-19 12:50 ` Jakub Narebski
  0 siblings, 0 replies; 33+ messages in thread
From: Jakub Narebski @ 2020-03-19 12:50 UTC (permalink / raw)
  To: Abhishek Kumar; +Cc: Christian Couder, git, Derrick Stolee

Abhishek Kumar <abhishekkumar8222@gmail.com> writes:

> Greetings Jakub
>
>> So perhaps we should expand "Commit graph labeling for speeding up git
>> commands" idea, splitting it into two possible ways of working in this
>> project: the more technical 'Generation number v2', and 'Interval labels
>> for commit graph' which is more of research project?  Which should be
>> put first, then?
>
> I would suggest working on generation number v2 first because:
> - We ship improved performance *sooner*.
> - I find it easier to shift from simpler to more complex systems.

It would be also more of an engineering project, than research one.

> On a personal note, I would be able to do a better job at working on generation
> number v2 than on interval labels. While reading through the papers mentioned
> was fun and full of a-ha moments, I find building things more
> fulfilling. I would be glad if either you or Derrick opt for mentoring it.

All right.  I have added 'Generation number v2' as possible way of
working on the "Commit graph labeling for speeding up git commands"
idea.

https://git.github.io/SoC-2020-Ideas/#commit-graph-labeling-for-speeding-up-git-commands


> You could read my proposal at the link below. It is very rough and I haven't
> proofread it yet. I will send out a more formal proposal once the direction
> of this project is decided.
>
> https://github.com/abhishekkumar2718/GSoC20/blob/master/graph_labelling.md
>
> **Too long, didn't read**

I will comment only on the TLDR version below.

> - Commit graphs are small to medium-sized (compared to problem sizes observed in
> graph-theory literature) sparse graphs. They have unusual properties compared
> to more conventional graphs that can be exploited for better performance.

Right.

> - Most of git's reachability queries are negative and using a negative-cut
> filter improves performance more than a positive-cut filter.

Not exactly.  Most time consuming git graph queries are those that
return a subgraph or otherwise need to walk the commit graph (like
topological sorting, or ahead/behind numbers).  Here in most cases
negative-cut filters, that reduce the number of commits walked, improve
performance more (with rare exceptions, like --ancestry-path query).

That is what I think, but it is not something I have proven...

> - Implementing and maintaining a two-dimensional reachability index is hard
> and does not offer justifiable performance gains.

I think it would be better to not mention graph reachability algorithms
at all, and just talk about generation number v2 as straight performance
improvement over generation number v1.

> - We plan to use corrected commit date as the generation number v2 because it is
> locally computable, immutable and can be incrementally updated.

Unfortunately because of an oopsie with respect to commit-graph file
format versioning, at least for the time being generation number v2
needs to be also backward compatibile with generation number v1.  See
for example:
  https://git.github.io/rev_news/2019/06/28/edition-52/#reviews
  https://speakerdeck.com/jnareb/graph-operations-in-git-and-how-to-make-them-faster?slide=74

>
> - If git ever considers a two-dimensional reachability index, either post order
> DFS, GRAIL or an index based on commit date would be good starting
> places to explore.
>
> I go in more details about GRAIL, FERRARI and PReaCH, explaining
> briefly how they work, their advantages, and disadvantages.

Again, I don't think that is necessary.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Possible idea for GSoC 2020
  2020-03-18 15:25             ` Derrick Stolee
@ 2020-03-19 12:52               ` Jakub Narebski
  0 siblings, 0 replies; 33+ messages in thread
From: Jakub Narebski @ 2020-03-19 12:52 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Christian Couder, git, Heba Waly, Junio C Hamano, Jonathan Tan,
	Emily Shaffer, Abhishek Kumar

Derrick Stolee <stolee@gmail.com> writes:
> On 3/18/2020 9:55 AM, Jakub Narebski wrote:

>> I have added 'Generation number v2' as one of alternative ways of
>> working on the more generic "Commit graph labeling for speeding up git
>> commands" idea -- as first task, because it fit better the narrative:
>> https://github.com/git/git.github.io/commit/a6d59820709417081c334a5120342987ff046f1a
>> 
>> Could you (or Stolee) check current proposal, so that it can be merged
>> in?  Thanks in advance.
>> 
>>   https://github.com/git/git.github.io/blob/soc-2020-idea-jnareb/SoC-2020-Ideas.md
>
> Thanks for the updated write-up. I think the narrative is helpful, describing how
> we landed on the definition of "generation number v2" before going into the interval
> methods.
>
> The only comment I have is about this statement at the end, which seems to be a
> carry-over from your perspective of wanting intervals instead of v2:
>
> 	Before starting on this task, at the beginning of the GSoC, it might be
> 	[a (sic)] good idea to check that interval labels would provide significant
> 	performance improvements at least in some cases (and if it is not the case,
> 	switch to working on generation numbers v2).
>
> The final parenthetical (switch to working on...) is a bit presumptive. Instead,
> please recommend an exploration period to determine which methods have which
> performance improvements using prototypes and/or the Python notebook. I'm usually
> of the opinion that a prototype is more informative as it compares the results in
> context, and the student would learn about the code that needs to change before
> creating review-quality patches.

Thanks, I took it into account in the final version

  https://git.github.io/SoC-2020-Ideas/

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC] Possible idea for GSoC 2020
       [not found] <CAHk66ftQqFqP-4kd4-8cHtCMEofSUvbeSQ24pcCCrkz7+2JG1w@mail.gmail.com>
@ 2020-03-27 18:31 ` Jakub Narębski
  0 siblings, 0 replies; 33+ messages in thread
From: Jakub Narębski @ 2020-03-27 18:31 UTC (permalink / raw)
  To: Abhishek Kumar
  Cc: git, Junio C Hamano, Derrick Stolee, Christian Couder,
	Jonathan Tan, Emily Shaffer, Heba Waly

"Hello Abhishek,

Somehow I have missed replying to this email.

On Wed, 18 Mar 2020 at 17:46, Abhishek Kumar
<abhishekkumar8222@gmail.com> wrote:
[...]
> >>> My _prediction_ is that the two-dimensional system will be more
> >>> complicated to write and use, and will not have any measurable
> >>> difference. I'd be happy to be wrong, but I also would not send
> >>> anyone down this direction only to find out I'm right and that
> >>> effort was wasted.
> >>
> >> Agreed. I have been through the papers of the involved variants and on graphs
> >> comparable to some of the largest git repositories, the performance improves by
> >> fifty nanoseconds for a random query.
> >
> > I would recommend extending results for other types of large graphs to
> > the commit graphs with care.  The characteristics of those graphs are
> > quite different from characteristics of commit graph: they usually are
> > scale-free graphs, with low maximum level, and low connectivity: the
> > probability of two random nodes being connected in order of 10^-3 or
> > 10^-4; see e.g. https://speakerdeck.com/jnareb/graph-operations-in-git-and-how-to-make-them-faster?slide=99
>
> > The last one, called R-ratio, means that testing on random query
> > actually tests mainly negative-cut filters.  That is why some papers
> > provide either separate numbers for negative and for positive queries,
> > or separate numbers for random and for balanced queries.
>
> I do agree that we should be careful while extending results but am skeptical
> about the performance difference. If anything, general results could help us
> eyeball the sort of improvement we can expect.

The fact is that for example in FELINE paper authors add min-post positive-cut
filter to their own index to improve performance for positive or
balanced queries.

"Reachability Queries in Very Large Graphs: A Fast Refined Online
Search Approach" (2014)
http://openprocedings.org/EDBT/2014/paper_166.pdf

> Of course, there is only one definitive source of truth -
> implementing indices and benchmarking performance.

Right.

> Speaking of special characteristics, are there any indexes designed for maximum
> performance with such graphs?

I was not able to find any reachability labelings that are intended for
commit graphs; even papers examining characteristics of commit graphs
as graphs are sparse. One that I have found is

Marco Biazzini, Martin Monperrus, Benoit Baudry
"On Analyzing the Topology of Commit Histories in
Decentralized Version Control Systems" (2014)
https://hal.archives-ouvertes.fr/hal-01063789

> > > Additionally:
> > > 1. They require significantly more space per commit.
>
> > This depends on the type of the algorithm: is it Label-Only (answering
> > reachability queries without consulting graph), or Label+Graph
> > (augmented online search algorithms):
> > https://speakerdeck.com/jnareb/graph-operations-in-git-and-how-to-make-them-faster?slide=78
> >
> > The CDAT chunk in current version of commit-graph format takes H+16
> > bytes per commit (where H is the size of object id hash).  From those
> > H+16 bytes 30 bits (slightly less that 4 bytes) are used for current
> > reachability label: the topological level aka generation number.
> > https://speakerdeck.com/jnareb/graph-operations-in-git-and-how-to-make-them-faster?slide=45
> >
> > The proposed min-post interval label would take 8 bytes per commit, that
> > is 4 bytes per single number in interval.  That is not much, provided
> > that we get visible performance improvements for at least some often
> > used git commands.
>
> Agreed. Both min-post and GRAIL use 8 bytes each.

Actually GRAIL uses k*8 bytes, where recommended value of k
is k=5 for dense graphs and k=2 for sparse graphs, and k is number
of random spanning trees which is number of intervals (negative-cut).

GRAIL = Graph Reachability indexing via rAndomized Interval
Labeling.

>                                                                                     Even FERRARI's
> performance would reach the flat end of diminishing returns if we
> assign three intervals (or 25 bytes) i.e six interval ends and three bits
> for recording whether intervals are exact or approximate.

Actually the current commit-graph format can store at most
(1 << 30) + (1 << 29) + (1 << 28) - 1 (around 1.8 billion) commits.
We can use most significant bit of one end of interval to record
whether interval is exact or approximate (in current format in
CDAT it is used to record whether there are more than 2 parents).
This means that k intervals for FERRARI take also k*8 bytes.
Again they recommend values of k=5 for dense, k=3 or k=2
for sparse. Also there is FERRARI-G variant, with k intervals
on average (global limit) - though I am not sure if it is something
that we can use.

>
> But PReaCH requires 8m + 65 bytes for each commit, which is a huge ask.

Let's take into account only "Pruning Based on DFS Numbering" from
there - I don't think reachability contraction hierarchies labeling
would be good fit for commit-graphs, because they assume bidi-BFS.
Reverse graph is not something that can be incrementally updated.

Maximum number of 4 byte numbers (one word for each end of interval,
and also one word for position of node) would be 5 or 6, depending
on whether we would store only p_tree, or [min(p_tree), post(p_tree)]
interval directly. Even 6*4 bytes per commit is not that huge of a task,
given that current CDAT is H + 2*4 + 8, where H is hash length.

> [An additional 4m bytes from current commit-graph chunk format since we
> do not store children nodes needed for the bi-directional nature of CHs.]
>
> >> 2. They require significantly more preprocessing time.
> >
> > This again depends on the type of algorithm: Label-Only or Label+G.
> >
> > In the case of min-post interval labels, they can be computed together
> > with generation number, during the same commit-graph walk. The amount
> > of calculations required to compute min-post interval is not much.
> > Therefore I think it would be not unreasonable cost.
>
> Also agreed. I do consider min-post interval labels and GRAIL to be some of
> more reasonable choices.
>
> But FERRARI would have marginally better performance than GRAIL and the
> five DFS passes made by PReaCH during preprocessing make it unsuitable.

Actually all five PReaCH DFS-labels can be computed during
_single_ DFS pass; implemented in Colaboratory:
https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg#scrollTo=4l0Ld0Jklq1o
as find_dfs_intervals_extra().

[...]
> >> My recommendation is that a GSoC student update the
> >> generation number to "v2" based on the definition you made in [1].
> >> That proposal is also more likely to be effective in Git because
> >> it makes use of extra heuristic information (commit date) to
> >> assist the types of algorithms we care about.
>
> Hear me out on this but topological levels can be
> considered a special case of all corrected commit
> dates occurring one time unit apart.

Right.

> Storing corrected dates instead of topological levels
> for min-post interval labels might actually have the
> best performance of all.

But you cannot use corrected dates for post(v);
topological levels and post-visit order in DFS
are different beasts.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2020-03-27 18:32 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-10 14:50 [RFC] Possible idea for GSoC 2020 Jakub Narebski
2020-03-11 19:03 ` Junio C Hamano
2020-03-13 10:56   ` Jakub Narebski
2020-03-15 14:26     ` Jakub Narebski
2020-03-17 12:24       ` Kaartic Sivaraam
2020-03-17 12:39         ` Kaartic Sivaraam
2020-03-17 14:22         ` Jakub Narebski
2020-03-11 20:29 ` Christian Couder
2020-03-11 21:30   ` Jakub Narebski
2020-03-11 21:48     ` Christian Couder
2020-03-12 12:17       ` Jakub Narebski
2020-03-12 13:08         ` Jakub Narebski
2020-03-13 10:59           ` Jakub Narebski
2020-03-13 13:08 ` Philip Oakley
2020-03-13 14:34   ` Jakub Narebski
2020-03-15 18:57     ` Philip Oakley
2020-03-15 21:14       ` Jakub Narebski
2020-03-16 14:47         ` Philip Oakley
2020-03-16 12:44 ` Derrick Stolee
2020-03-17  3:13   ` Jakub Narebski
2020-03-17  7:24     ` Christian Couder
2020-03-17 11:49       ` Derrick Stolee
2020-03-17 14:18       ` Jakub Narebski
2020-03-17 17:04         ` Christian Couder
2020-03-18 13:55           ` Jakub Narebski
2020-03-18 15:25             ` Derrick Stolee
2020-03-19 12:52               ` Jakub Narebski
  -- strict thread matches above, loose matches on Subject: below --
2020-03-13 17:30 Abhishek Kumar
2020-03-17 17:00 Abhishek Kumar
2020-03-17 18:05 ` Jakub Narebski
2020-03-17 18:00 Abhishek Kumar
2020-03-19 12:50 ` Jakub Narebski
     [not found] <CAHk66ftQqFqP-4kd4-8cHtCMEofSUvbeSQ24pcCCrkz7+2JG1w@mail.gmail.com>
2020-03-27 18:31 ` Jakub Narębski

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).