git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Derrick Stolee <stolee@gmail.com>
To: Jakub Narebski <jnareb@gmail.com>
Cc: git@vger.kernel.org, git@jeffhostetler.com, peff@peff.net,
	jonathantanmy@google.com, szeder.dev@gmail.com,
	sbeller@google.com, gitster@pobox.com,
	Derrick Stolee <dstolee@microsoft.com>
Subject: Re: [PATCH v4 00/13] Serialized Git Commit Graph
Date: Mon, 2 Apr 2018 09:02:11 -0400	[thread overview]
Message-ID: <f5d8934e-afc2-39d4-b7d5-e5ba5e5021a1@gmail.com> (raw)
In-Reply-To: <867eptkeeq.fsf@gmail.com>

On 3/30/2018 7:10 AM, Jakub Narebski wrote:
> I hope that I am addressing the most recent version of this series.

Hi Jakub. Thanks for the interest in this patch series.

The most-recent version is v6 [1], but I will re-roll to v7 soon (after 
v2.17.0 is marked).

[1] 
https://public-inbox.org/git/20180314192736.70602-1-dstolee@microsoft.com/T/#u

> Derrick Stolee <stolee@gmail.com> writes:
>
>> As promised [1], this patch contains a way to serialize the commit graph.
>> The current implementation defines a new file format to store the graph
>> structure (parent relationships) and basic commit metadata (commit date,
>> root tree OID) in order to prevent parsing raw commits while performing
>> basic graph walks. For example, we do not need to parse the full commit
>> when performing these walks:
>>
>> * 'git log --topo-order -1000' walks all reachable commits to avoid
>>    incorrect topological orders, but only needs the commit message for
>>    the top 1000 commits.
>>
>> * 'git merge-base <A> <B>' may walk many commits to find the correct
>>    boundary between the commits reachable from A and those reachable
>>    from B. No commit messages are needed.
>>
>> * 'git branch -vv' checks ahead/behind status for all local branches
>>    compared to their upstream remote branches. This is essentially as
>>    hard as computing merge bases for each.
>>
>> The current patch speeds up these calculations by injecting a check in
>> parse_commit_gently() to check if there is a graph file and using that
>> to provide the required metadata to the struct commit.
> That's nice.
>
> What are the assumptions about the serialized commit graph format? Does
> it need to be:
>   - extensible without rewriting (e.g. append-only)?
>   - like the above, but may need rewriting for optimal performance?
>   - extending it needs to rewrite whole file?
>
> Excuse me if it waas already asked and answered.

It is not extensible without rewriting. Reducing write time was not a 
main goal, since the graph will be written only occasionally during data 
management phases (like 'gc' or 'repack'; this integration is not 
implemented yet).

>
>> The file format has room to store generation numbers, which will be
>> provided as a patch after this framework is merged. Generation numbers
>> are referenced by the design document but not implemented in order to
>> make the current patch focus on the graph construction process. Once
>> that is stable, it will be easier to add generation numbers and make
>> graph walks aware of generation numbers one-by-one.
> As the serialized commit graph format is versioned, I wonder if it would
> be possible to speed up graph walks even more by adding to it FELINE
> index (pair of numbers) from "Reachability Queries in Very Large Graphs:
> A Fast Refined Olnine Search Approach" (2014) - available at
> http://openproceedings.org/EDBT/2014/paper_166.pdf
>
> The implementation would probably need adjustments to make it
> unambiguous and unambiguously extensible; unless there is place for
> indices that are local-only and need to be recalculated from scratch
> when graph changes (to cover all graph).

The chunk-based format is intended to allow extra indexes like the one 
you recommend, without needing to increase the version number. Using an 
optional chunk allows older versions of Git to read the file without 
error, since the data is "extra", and newer versions can take advantage 
of the acceleration.

At one point, I was investigating these reachability indexes (I read 
"SCARAB: Scaling Reachability Computation on Large Graphs" by Jihn, 
Ruan, Dey, and Xu [2]) but find the question that these indexes target 
to be lacking for most of the Git uses. That is, they ask the boolean 
question "Can X reach Y?". More often, Git needs to answer "What is the 
set of commits reachable from X but not from Y" or "Topologically sort 
commits reachable from X" or "How many commits are in each part of the 
symmetric difference between reachable from X or reachable from Y?"

The case for "Can X reach Y?" is mostly for commands like 'git branch 
--contains', when 'git fetch' checks for forced-updates of branches, or 
when the server decides enough negotiation has occurred during a 'git 
fetch'. While these may be worth investigating, they also benefit 
greatly from the accelerated graph walk introduced in the current format.

I would be happy to review any effort to extend the commit-graph format 
to include such indexes, as long as the performance benefits outweigh 
the complexity to create them.

[2] 
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.719.8396&rep=rep1&type=pdf

>
>> Here are some performance results for a copy of the Linux repository
>> where 'master' has 704,766 reachable commits and is behind 'origin/master'
>> by 19,610 commits.
>>
>> | Command                          | Before | After  | Rel % |
>> |----------------------------------|--------|--------|-------|
>> | log --oneline --topo-order -1000 |  5.9s  |  0.7s  | -88%  |
>> | branch -vv                       |  0.42s |  0.27s | -35%  |
>> | rev-list --all                   |  6.4s  |  1.0s  | -84%  |
>> | rev-list --all --objects         | 32.6s  | 27.6s  | -15%  |
> That's the "Rel %" of "Before", that is delta/before, isn't it?

I do mean the relative change.

>
>> To test this yourself, run the following on your repo:
>>
>>    git config core.commitGraph true
>>    git show-ref -s | git commit-graph write --set-latest --stdin-commits
>>
>> The second command writes a commit graph file containing every commit
>> reachable from your refs. Now, all git commands that walk commits will
>> check your graph first before consulting the ODB. You can run your own
>> performance comparisions by toggling the 'core.commitgraph' setting.
> Good.  It is nicely similar to how bitmap indices (of reachability) are
> handled.
>
> I just wonder what happens in the (rare) presence of grafts (old
> mechanism), or "git replace"-d commits...

In the design document, I mention that the current implementation does 
not work with grafts (it will ignore them). A later patch will refactor 
the graft code so we can access it from the commit-graph parsing of a 
commit without copy-pasting the code out of parse_commit_gently().

The commit-graph is only a compact representation of the object 
database. If a commit is replaced with 'git replace' before 'git 
commit-graph write' then the commit-graph write will write the replaced 
object. I haven't tested what happens when a commit-graph is written and 
then a commit is replaced, but my guess is that the replacement does not 
occur until a full parse is attempted (i.e. when reading author or 
commit message information). This will lead to unknown results.

Thanks for pointing out the interaction with 'git replace'. I have items 
to fix grafts and replaced commits before integrating commit-graph 
writes into automatic actions like 'gc.auto'.

Thanks,
-Stolee

  reply	other threads:[~2018-04-02 13:02 UTC|newest]

Thread overview: 146+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-30 21:39 [PATCH v2 00/14] Serialized Git Commit Graph Derrick Stolee
2018-01-30 21:39 ` [PATCH v2 01/14] commit-graph: add format document Derrick Stolee
2018-02-01 21:44   ` Jonathan Tan
2018-01-30 21:39 ` [PATCH v2 02/14] graph: add commit graph design document Derrick Stolee
2018-01-31  2:19   ` Stefan Beller
2018-01-30 21:39 ` [PATCH v2 03/14] commit-graph: create git-commit-graph builtin Derrick Stolee
2018-02-02  0:53   ` SZEDER Gábor
2018-01-30 21:39 ` [PATCH v2 04/14] commit-graph: implement construct_commit_graph() Derrick Stolee
2018-02-01 22:23   ` Jonathan Tan
2018-02-01 23:46   ` SZEDER Gábor
2018-02-02 15:32   ` SZEDER Gábor
2018-02-05 16:06     ` Derrick Stolee
2018-02-07 15:08       ` SZEDER Gábor
2018-02-07 15:10         ` Derrick Stolee
2018-01-30 21:39 ` [PATCH v2 05/14] commit-graph: implement git-commit-graph --write Derrick Stolee
2018-02-01 23:33   ` Jonathan Tan
2018-02-02 18:36     ` Stefan Beller
2018-02-02 22:48       ` Junio C Hamano
2018-02-03  1:58         ` Derrick Stolee
2018-02-03  9:28           ` Jeff King
2018-02-05 18:48             ` Junio C Hamano
2018-02-06 18:55               ` Derrick Stolee
2018-02-01 23:48   ` SZEDER Gábor
2018-02-05 18:07     ` Derrick Stolee
2018-02-02  1:47   ` SZEDER Gábor
2018-01-30 21:39 ` [PATCH v2 06/14] commit-graph: implement git-commit-graph --read Derrick Stolee
2018-01-31  2:22   ` Stefan Beller
2018-02-02  0:02   ` SZEDER Gábor
2018-02-02  0:23   ` Jonathan Tan
2018-02-05 19:29     ` Derrick Stolee
2018-01-30 21:39 ` [PATCH v2 07/14] commit-graph: implement git-commit-graph --update-head Derrick Stolee
2018-02-02  1:35   ` SZEDER Gábor
2018-02-05 21:01     ` Derrick Stolee
2018-02-02  2:45   ` SZEDER Gábor
2018-01-30 21:39 ` [PATCH v2 08/14] commit-graph: implement git-commit-graph --clear Derrick Stolee
2018-02-02  4:01   ` SZEDER Gábor
2018-01-30 21:39 ` [PATCH v2 09/14] commit-graph: teach git-commit-graph --delete-expired Derrick Stolee
2018-02-02 15:04   ` SZEDER Gábor
2018-01-30 21:39 ` [PATCH v2 10/14] commit-graph: add core.commitgraph setting Derrick Stolee
2018-01-31 22:44   ` Igor Djordjevic
2018-02-02 16:01   ` SZEDER Gábor
2018-01-30 21:39 ` [PATCH v2 11/14] commit: integrate commit graph with commit parsing Derrick Stolee
2018-02-02  1:51   ` Jonathan Tan
2018-02-06 14:53     ` Derrick Stolee
2018-01-30 21:39 ` [PATCH v2 12/14] commit-graph: read only from specific pack-indexes Derrick Stolee
2018-01-30 21:39 ` [PATCH v2 13/14] commit-graph: close under reachability Derrick Stolee
2018-01-30 21:39 ` [PATCH v2 14/14] commit-graph: build graph from starting commits Derrick Stolee
2018-01-30 21:47 ` [PATCH v2 00/14] Serialized Git Commit Graph Stefan Beller
2018-02-01  2:34   ` Stefan Beller
2018-02-08 20:37 ` [PATCH v3 " Derrick Stolee
2018-02-08 20:37   ` [PATCH v3 01/14] commit-graph: add format document Derrick Stolee
2018-02-08 21:21     ` Junio C Hamano
2018-02-08 21:33       ` Derrick Stolee
2018-02-08 23:16         ` Junio C Hamano
2018-02-08 20:37   ` [PATCH v3 02/14] graph: add commit graph design document Derrick Stolee
2018-02-08 20:37   ` [PATCH v3 03/14] commit-graph: create git-commit-graph builtin Derrick Stolee
2018-02-08 21:27     ` Junio C Hamano
2018-02-08 21:36       ` Derrick Stolee
2018-02-08 23:21         ` Junio C Hamano
2018-02-08 20:37   ` [PATCH v3 04/14] commit-graph: implement write_commit_graph() Derrick Stolee
2018-02-08 22:14     ` Junio C Hamano
2018-02-15 18:19     ` Junio C Hamano
2018-02-15 18:23       ` Derrick Stolee
2018-02-08 20:37   ` [PATCH v3 05/14] commit-graph: implement 'git-commit-graph write' Derrick Stolee
2018-02-13 21:57     ` Jonathan Tan
2018-02-08 20:37   ` [PATCH v3 06/14] commit-graph: implement 'git-commit-graph read' Derrick Stolee
2018-02-08 23:38     ` Junio C Hamano
2018-02-08 20:37   ` [PATCH v3 07/14] commit-graph: update graph-head during write Derrick Stolee
2018-02-12 18:56     ` Junio C Hamano
2018-02-12 20:37       ` Junio C Hamano
2018-02-12 21:24         ` Derrick Stolee
2018-02-13 22:38     ` Jonathan Tan
2018-02-08 20:37   ` [PATCH v3 08/14] commit-graph: implement 'git-commit-graph clear' Derrick Stolee
2018-02-13 22:49     ` Jonathan Tan
2018-02-08 20:37   ` [PATCH v3 09/14] commit-graph: implement --delete-expired Derrick Stolee
2018-02-08 20:37   ` [PATCH v3 10/14] commit-graph: add core.commitGraph setting Derrick Stolee
2018-02-08 20:37   ` [PATCH v3 11/14] commit: integrate commit graph with commit parsing Derrick Stolee
2018-02-14  0:12     ` Jonathan Tan
2018-02-14 18:08       ` Derrick Stolee
2018-02-15 18:25     ` Junio C Hamano
2018-02-08 20:37   ` [PATCH v3 12/14] commit-graph: close under reachability Derrick Stolee
2018-02-08 20:37   ` [PATCH v3 13/14] commit-graph: read only from specific pack-indexes Derrick Stolee
2018-02-08 20:37   ` [PATCH v3 14/14] commit-graph: build graph from starting commits Derrick Stolee
2018-02-09 13:02     ` SZEDER Gábor
2018-02-09 13:45       ` Derrick Stolee
2018-02-14 18:15   ` [PATCH v3 00/14] Serialized Git Commit Graph Derrick Stolee
2018-02-14 18:27     ` Stefan Beller
2018-02-14 19:11       ` Derrick Stolee
2018-02-19 18:53     ` [PATCH v4 00/13] " Derrick Stolee
2018-02-19 18:53       ` [PATCH v4 01/13] commit-graph: add format document Derrick Stolee
2018-02-20 20:49         ` Junio C Hamano
2018-02-21 19:23         ` Stefan Beller
2018-02-21 19:45           ` Derrick Stolee
2018-02-21 19:48             ` Stefan Beller
2018-03-30 13:25         ` Jakub Narebski
2018-04-02 13:09           ` Derrick Stolee
2018-04-02 14:09             ` Jakub Narebski
2018-02-19 18:53       ` [PATCH v4 02/13] graph: add commit graph design document Derrick Stolee
2018-02-20 21:42         ` Junio C Hamano
2018-02-23 15:44           ` Derrick Stolee
2018-02-21 19:34         ` Stefan Beller
2018-02-19 18:53       ` [PATCH v4 03/13] commit-graph: create git-commit-graph builtin Derrick Stolee
2018-02-20 21:51         ` Junio C Hamano
2018-02-21 18:58           ` Junio C Hamano
2018-02-23 16:07             ` Derrick Stolee
2018-02-26 16:25         ` SZEDER Gábor
2018-02-26 17:08           ` Derrick Stolee
2018-02-19 18:53       ` [PATCH v4 04/13] commit-graph: implement write_commit_graph() Derrick Stolee
2018-02-20 22:57         ` Junio C Hamano
2018-02-23 17:23           ` Derrick Stolee
2018-02-23 19:30             ` Junio C Hamano
2018-02-23 19:48               ` Junio C Hamano
2018-02-23 20:02               ` Derrick Stolee
2018-02-26 16:10         ` SZEDER Gábor
2018-02-28 18:47         ` Junio C Hamano
2018-02-19 18:53       ` [PATCH v4 05/13] commit-graph: implement 'git-commit-graph write' Derrick Stolee
2018-02-21 19:25         ` Junio C Hamano
2018-02-19 18:53       ` [PATCH v4 06/13] commit-graph: implement git commit-graph read Derrick Stolee
2018-02-21 20:11         ` Junio C Hamano
2018-02-22 18:25           ` Junio C Hamano
2018-02-19 18:53       ` [PATCH v4 07/13] commit-graph: implement --set-latest Derrick Stolee
2018-02-22 18:31         ` Junio C Hamano
2018-02-23 17:53           ` Derrick Stolee
2018-02-19 18:53       ` [PATCH v4 08/13] commit-graph: implement --delete-expired Derrick Stolee
2018-02-21 21:34         ` Stefan Beller
2018-02-23 17:43           ` Derrick Stolee
2018-02-22 18:48         ` Junio C Hamano
2018-02-23 17:59           ` Derrick Stolee
2018-02-23 19:33             ` Junio C Hamano
2018-02-23 19:41               ` Derrick Stolee
2018-02-23 19:51                 ` Junio C Hamano
2018-02-19 18:53       ` [PATCH v4 09/13] commit-graph: add core.commitGraph setting Derrick Stolee
2018-02-19 18:53       ` [PATCH v4 10/13] commit-graph: close under reachability Derrick Stolee
2018-02-19 18:53       ` [PATCH v4 11/13] commit: integrate commit graph with commit parsing Derrick Stolee
2018-02-19 18:53       ` [PATCH v4 12/13] commit-graph: read only from specific pack-indexes Derrick Stolee
2018-02-21 22:25         ` Stefan Beller
2018-02-23 19:19           ` Derrick Stolee
2018-02-19 18:53       ` [PATCH v4 13/13] commit-graph: build graph from starting commits Derrick Stolee
2018-03-30 11:10       ` [PATCH v4 00/13] Serialized Git Commit Graph Jakub Narebski
2018-04-02 13:02         ` Derrick Stolee [this message]
2018-04-02 14:46           ` Jakub Narebski
2018-04-02 15:02             ` Derrick Stolee
2018-04-02 17:35               ` Stefan Beller
2018-04-02 17:54                 ` Derrick Stolee
2018-04-02 18:02                   ` Stefan Beller
2018-04-07 22:37               ` Jakub Narebski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f5d8934e-afc2-39d4-b7d5-e5ba5e5021a1@gmail.com \
    --to=stolee@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=git@jeffhostetler.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jnareb@gmail.com \
    --cc=jonathantanmy@google.com \
    --cc=peff@peff.net \
    --cc=sbeller@google.com \
    --cc=szeder.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).