Re: [RFC] Possible idea for GSoC 2020

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Jakub Narebski <jnareb@gmail.com>
To: Abhishek Kumar <abhishekkumar8222@gmail.com>
Cc: git@vger.kernel.org,
	Christian Couder <christian.couder@gmail.com>,
	Derrick Stolee <stolee@gmail.com>,
	Heba Waly <heba.waly@gmail.com>,
	Junio C Hamano <gitster@pobox.com>,
	Jonathan Tan <jonathantanmy@google.com>,
	Emily Shaffer <emilyshaffer@google.com>,
	Abhishek Kumar <abhishekkumar8222@gmail.com>
Subject: Re: [RFC] Possible idea for GSoC 2020
Date: Tue, 17 Mar 2020 19:05:37 +0100	[thread overview]
Message-ID: <86sgi66emm.fsf@gmail.com> (raw)
In-Reply-To: <CAHk66ftx=XTSeVcPe09yA9KMDpHwiKFLKa62cCBFufjeenAbaQ@mail.gmail.com> (Abhishek Kumar's message of "Tue, 17 Mar 2020 22:30:00 +0530")

Abhishek Kumar <abhishekkumar8222@gmail.com> writes:

>> Having such a complicated two-dimensional system would need to
>> justify itself by being measurably faster than that one-dimensional
>> system in these example commands.
>>
>> [...]
>>
>> My _prediction_ is that the two-dimensional system will be more
>> complicated to write and use, and will not have any measurable
>> difference. I'd be happy to be wrong, but I also would not send
>> anyone down this direction only to find out I'm right and that
>> effort was wasted.
>
> Agreed. I have been through the papers of the involved variants and on graphs
> comparable to some of the largest git repositories, the performance improves by
> fifty nanoseconds for a random query.

I would recommend extending results for other types of large graphs to
the commit graphs with care.  The characteristics of those graphs are
quite different from characteristics of commit graph: they usually are
scale-free graphs, with low maximum level, and low connectivity: the
probability of two random nodes being connected in order of 10^-3 or
10^-4; see e.g. https://speakerdeck.com/jnareb/graph-operations-in-git-and-how-to-make-them-faster?slide=99

The last one, called R-ratio, means that testing on random query
actually tests mainly negative-cut filters.  That is why some papers
provide either separate numbers for negative and for positive queries,
or separate numbers for random and for balanced queries.

> Additionally:
> 1. They require significantly more space per commit.

This depends on the type of the algorithm: is it Label-Only (answering
reachability queries without consulting graph), or Label+Graph
(augmented online search algorithms):
https://speakerdeck.com/jnareb/graph-operations-in-git-and-how-to-make-them-faster?slide=78

The CDAT chunk in current version of commit-graph format takes H+16
bytes per commit (where H is the size of object id hash).  From those
H+16 bytes 30 bits (slightly less that 4 bytes) are used for current
reachability label: the topological level aka generation number.
https://speakerdeck.com/jnareb/graph-operations-in-git-and-how-to-make-them-faster?slide=45

The proposed min-post interval label would take 8 bytes per commit, that
is 4 bytes per single number in interval.  That is not much, provided
that we get visible performance improvements for at least some often
used git commands.

> 2. They require significantly more preprocessing time.

This again depends on the type of algorithm: Label-Only or Label+G.

In the case of min-post interval labels, they can be computed together
with generation number, during the same commit-graph walk.  The amount
of calculations required to compute min-post interval is not much.
Therefore I think it would be not unreasonable cost.

>> My recommendation is that a GSoC student update the
>> generation number to "v2" based on the definition you made in [1].
>> That proposal is also more likely to be effective in Git because
>> it makes use of extra heuristic information (commit date) to
>> assist the types of algorithms we care about.
>
>> In that case, the "difficult" part is moving the "generation"
>> member of struct commit into a slab before making it a 64-bit
>> value. (This is likely necessary for your plan, anyway.) Updating
>> the generation number to v2 is relatively straight-forward after
>> that, as someone can follow all places that reference or compute
>> generation numbers and apply a diff
>
> Thanks for the recommendation. Reading about how this fits in more
> with REU on the other thread, I too agree that updating generation
> number to use corrected commit date would be more appropriate for a GSoC
> project.

You can read about monotonically offset corrected commit date for
example on my slides, from page 64 (the problem) to 75 (references):
https://speakerdeck.com/jnareb/graph-operations-in-git-and-how-to-make-them-faster?slide=64

I'll try to update the proposal on SoC 2020 Ideas page soon, perhaps
tomorrow when I will have more free time.

https://git.github.io/SoC-2020-Ideas/

In the meantime I will refer to what is written at the top of it:

> **Students:** Please consider these ideas as starting points for
> generating proposals. We are also more than happy to receive proposals
> for other ideas related to Git. Make sure you have read the “Note
> about refactoring projects versus projects that implement new
> features” in the general application information page though.

Best,
-- 
Jakub Narębski

next prev parent reply	other threads:[~2020-03-17 18:05 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-17 17:00 [RFC] Possible idea for GSoC 2020 Abhishek Kumar
2020-03-17 18:05 ` Jakub Narebski [this message]
     [not found] <CAHk66ftQqFqP-4kd4-8cHtCMEofSUvbeSQ24pcCCrkz7+2JG1w@mail.gmail.com>
2020-03-27 18:31 ` Jakub Narębski
  -- strict thread matches above, loose matches on Subject: below --
2020-03-17 18:00 Abhishek Kumar
2020-03-19 12:50 ` Jakub Narebski
2020-03-13 17:30 Abhishek Kumar
2020-03-10 14:50 Jakub Narebski
2020-03-11 19:03 ` Junio C Hamano
2020-03-13 10:56   ` Jakub Narebski
2020-03-15 14:26     ` Jakub Narebski
2020-03-17 12:24       ` Kaartic Sivaraam
2020-03-17 12:39         ` Kaartic Sivaraam
2020-03-17 14:22         ` Jakub Narebski
2020-03-11 20:29 ` Christian Couder
2020-03-11 21:30   ` Jakub Narebski
2020-03-11 21:48     ` Christian Couder
2020-03-12 12:17       ` Jakub Narebski
2020-03-12 13:08         ` Jakub Narebski
2020-03-13 10:59           ` Jakub Narebski
2020-03-13 13:08 ` Philip Oakley
2020-03-13 14:34   ` Jakub Narebski
2020-03-15 18:57     ` Philip Oakley
2020-03-15 21:14       ` Jakub Narebski
2020-03-16 14:47         ` Philip Oakley
2020-03-16 12:44 ` Derrick Stolee
2020-03-17  3:13   ` Jakub Narebski
2020-03-17  7:24     ` Christian Couder
2020-03-17 11:49       ` Derrick Stolee
2020-03-17 14:18       ` Jakub Narebski
2020-03-17 17:04         ` Christian Couder
2020-03-18 13:55           ` Jakub Narebski
2020-03-18 15:25             ` Derrick Stolee
2020-03-19 12:52               ` Jakub Narebski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=86sgi66emm.fsf@gmail.com \
    --to=jnareb@gmail.com \
    --cc=abhishekkumar8222@gmail.com \
    --cc=christian.couder@gmail.com \
    --cc=emilyshaffer@google.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=heba.waly@gmail.com \
    --cc=jonathantanmy@google.com \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).