From: Jakub Narebski <jnareb@gmail.com>
To: Abhishek Kumar <abhishekkumar8222@gmail.com>
Cc: git@vger.kernel.org,
Christian Couder <christian.couder@gmail.com>,
Derrick Stolee <stolee@gmail.com>,
Heba Waly <heba.waly@gmail.com>,
Junio C Hamano <gitster@pobox.com>,
Jonathan Tan <jonathantanmy@google.com>,
Emily Shaffer <emilyshaffer@google.com>,
Abhishek Kumar <abhishekkumar8222@gmail.com>
Subject: Re: [RFC] Possible idea for GSoC 2020
Date: Tue, 17 Mar 2020 19:05:37 +0100 [thread overview]
Message-ID: <86sgi66emm.fsf@gmail.com> (raw)
In-Reply-To: <CAHk66ftx=XTSeVcPe09yA9KMDpHwiKFLKa62cCBFufjeenAbaQ@mail.gmail.com> (Abhishek Kumar's message of "Tue, 17 Mar 2020 22:30:00 +0530")
Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
>> Having such a complicated two-dimensional system would need to
>> justify itself by being measurably faster than that one-dimensional
>> system in these example commands.
>>
>> [...]
>>
>> My _prediction_ is that the two-dimensional system will be more
>> complicated to write and use, and will not have any measurable
>> difference. I'd be happy to be wrong, but I also would not send
>> anyone down this direction only to find out I'm right and that
>> effort was wasted.
>
> Agreed. I have been through the papers of the involved variants and on graphs
> comparable to some of the largest git repositories, the performance improves by
> fifty nanoseconds for a random query.
I would recommend extending results for other types of large graphs to
the commit graphs with care. The characteristics of those graphs are
quite different from characteristics of commit graph: they usually are
scale-free graphs, with low maximum level, and low connectivity: the
probability of two random nodes being connected in order of 10^-3 or
10^-4; see e.g. https://speakerdeck.com/jnareb/graph-operations-in-git-and-how-to-make-them-faster?slide=99
The last one, called R-ratio, means that testing on random query
actually tests mainly negative-cut filters. That is why some papers
provide either separate numbers for negative and for positive queries,
or separate numbers for random and for balanced queries.
> Additionally:
> 1. They require significantly more space per commit.
This depends on the type of the algorithm: is it Label-Only (answering
reachability queries without consulting graph), or Label+Graph
(augmented online search algorithms):
https://speakerdeck.com/jnareb/graph-operations-in-git-and-how-to-make-them-faster?slide=78
The CDAT chunk in current version of commit-graph format takes H+16
bytes per commit (where H is the size of object id hash). From those
H+16 bytes 30 bits (slightly less that 4 bytes) are used for current
reachability label: the topological level aka generation number.
https://speakerdeck.com/jnareb/graph-operations-in-git-and-how-to-make-them-faster?slide=45
The proposed min-post interval label would take 8 bytes per commit, that
is 4 bytes per single number in interval. That is not much, provided
that we get visible performance improvements for at least some often
used git commands.
> 2. They require significantly more preprocessing time.
This again depends on the type of algorithm: Label-Only or Label+G.
In the case of min-post interval labels, they can be computed together
with generation number, during the same commit-graph walk. The amount
of calculations required to compute min-post interval is not much.
Therefore I think it would be not unreasonable cost.
>> My recommendation is that a GSoC student update the
>> generation number to "v2" based on the definition you made in [1].
>> That proposal is also more likely to be effective in Git because
>> it makes use of extra heuristic information (commit date) to
>> assist the types of algorithms we care about.
>
>> In that case, the "difficult" part is moving the "generation"
>> member of struct commit into a slab before making it a 64-bit
>> value. (This is likely necessary for your plan, anyway.) Updating
>> the generation number to v2 is relatively straight-forward after
>> that, as someone can follow all places that reference or compute
>> generation numbers and apply a diff
>
> Thanks for the recommendation. Reading about how this fits in more
> with REU on the other thread, I too agree that updating generation
> number to use corrected commit date would be more appropriate for a GSoC
> project.
You can read about monotonically offset corrected commit date for
example on my slides, from page 64 (the problem) to 75 (references):
https://speakerdeck.com/jnareb/graph-operations-in-git-and-how-to-make-them-faster?slide=64
I'll try to update the proposal on SoC 2020 Ideas page soon, perhaps
tomorrow when I will have more free time.
https://git.github.io/SoC-2020-Ideas/
In the meantime I will refer to what is written at the top of it:
> **Students:** Please consider these ideas as starting points for
> generating proposals. We are also more than happy to receive proposals
> for other ideas related to Git. Make sure you have read the “Note
> about refactoring projects versus projects that implement new
> features” in the general application information page though.
Best,
--
Jakub Narębski
next prev parent reply other threads:[~2020-03-17 18:05 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-03-17 17:00 [RFC] Possible idea for GSoC 2020 Abhishek Kumar
2020-03-17 18:05 ` Jakub Narebski [this message]
[not found] <CAHk66ftQqFqP-4kd4-8cHtCMEofSUvbeSQ24pcCCrkz7+2JG1w@mail.gmail.com>
2020-03-27 18:31 ` Jakub Narębski
-- strict thread matches above, loose matches on Subject: below --
2020-03-17 18:00 Abhishek Kumar
2020-03-19 12:50 ` Jakub Narebski
2020-03-13 17:30 Abhishek Kumar
2020-03-10 14:50 Jakub Narebski
2020-03-11 19:03 ` Junio C Hamano
2020-03-13 10:56 ` Jakub Narebski
2020-03-15 14:26 ` Jakub Narebski
2020-03-17 12:24 ` Kaartic Sivaraam
2020-03-17 12:39 ` Kaartic Sivaraam
2020-03-17 14:22 ` Jakub Narebski
2020-03-11 20:29 ` Christian Couder
2020-03-11 21:30 ` Jakub Narebski
2020-03-11 21:48 ` Christian Couder
2020-03-12 12:17 ` Jakub Narebski
2020-03-12 13:08 ` Jakub Narebski
2020-03-13 10:59 ` Jakub Narebski
2020-03-13 13:08 ` Philip Oakley
2020-03-13 14:34 ` Jakub Narebski
2020-03-15 18:57 ` Philip Oakley
2020-03-15 21:14 ` Jakub Narebski
2020-03-16 14:47 ` Philip Oakley
2020-03-16 12:44 ` Derrick Stolee
2020-03-17 3:13 ` Jakub Narebski
2020-03-17 7:24 ` Christian Couder
2020-03-17 11:49 ` Derrick Stolee
2020-03-17 14:18 ` Jakub Narebski
2020-03-17 17:04 ` Christian Couder
2020-03-18 13:55 ` Jakub Narebski
2020-03-18 15:25 ` Derrick Stolee
2020-03-19 12:52 ` Jakub Narebski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=86sgi66emm.fsf@gmail.com \
--to=jnareb@gmail.com \
--cc=abhishekkumar8222@gmail.com \
--cc=christian.couder@gmail.com \
--cc=emilyshaffer@google.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=heba.waly@gmail.com \
--cc=jonathantanmy@google.com \
--cc=stolee@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).