From: Derrick Stolee <stolee@gmail.com>
To: git@vger.kernel.org
Cc: gitster@pobox.com, sbeller@google.com, szeder.dev@gmail.com,
ramsay@ramsayjones.plus.com, git@jeffhostetler.com,
peff@peff.net, Derrick Stolee <dstolee@microsoft.com>
Subject: [PATCH v7 04/14] graph: add commit graph design document
Date: Mon, 2 Apr 2018 16:34:17 -0400 [thread overview]
Message-ID: <20180402203427.170177-5-dstolee@microsoft.com> (raw)
In-Reply-To: <20180402203427.170177-1-dstolee@microsoft.com>
From: Derrick Stolee <dstolee@microsoft.com>
Add Documentation/technical/commit-graph.txt with details of the planned
commit graph feature, including future plans.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
Documentation/technical/commit-graph.txt | 163 +++++++++++++++++++++++
1 file changed, 163 insertions(+)
create mode 100644 Documentation/technical/commit-graph.txt
diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
new file mode 100644
index 0000000000..0550c6d0dc
--- /dev/null
+++ b/Documentation/technical/commit-graph.txt
@@ -0,0 +1,163 @@
+Git Commit Graph Design Notes
+=============================
+
+Git walks the commit graph for many reasons, including:
+
+1. Listing and filtering commit history.
+2. Computing merge bases.
+
+These operations can become slow as the commit count grows. The merge
+base calculation shows up in many user-facing commands, such as 'merge-base'
+or 'status' and can take minutes to compute depending on history shape.
+
+There are two main costs here:
+
+1. Decompressing and parsing commits.
+2. Walking the entire graph to satisfy topological order constraints.
+
+The commit graph file is a supplemental data structure that accelerates
+commit graph walks. If a user downgrades or disables the 'core.commitGraph'
+config setting, then the existing ODB is sufficient. The file is stored
+as "commit-graph" either in the .git/objects/info directory or in the info
+directory of an alternate.
+
+The commit graph file stores the commit graph structure along with some
+extra metadata to speed up graph walks. By listing commit OIDs in lexi-
+cographic order, we can identify an integer position for each commit and
+refer to the parents of a commit using those integer positions. We use
+binary search to find initial commits and then use the integer positions
+for fast lookups during the walk.
+
+A consumer may load the following info for a commit from the graph:
+
+1. The commit OID.
+2. The list of parents, along with their integer position.
+3. The commit date.
+4. The root tree OID.
+5. The generation number (see definition below).
+
+Values 1-4 satisfy the requirements of parse_commit_gently().
+
+Define the "generation number" of a commit recursively as follows:
+
+ * A commit with no parents (a root commit) has generation number one.
+
+ * A commit with at least one parent has generation number one more than
+ the largest generation number among its parents.
+
+Equivalently, the generation number of a commit A is one more than the
+length of a longest path from A to a root commit. The recursive definition
+is easier to use for computation and observing the following property:
+
+ If A and B are commits with generation numbers N and M, respectively,
+ and N <= M, then A cannot reach B. That is, we know without searching
+ that B is not an ancestor of A because it is further from a root commit
+ than A.
+
+ Conversely, when checking if A is an ancestor of B, then we only need
+ to walk commits until all commits on the walk boundary have generation
+ number at most N. If we walk commits using a priority queue seeded by
+ generation numbers, then we always expand the boundary commit with highest
+ generation number and can easily detect the stopping condition.
+
+This property can be used to significantly reduce the time it takes to
+walk commits and determine topological relationships. Without generation
+numbers, the general heuristic is the following:
+
+ If A and B are commits with commit time X and Y, respectively, and
+ X < Y, then A _probably_ cannot reach B.
+
+This heuristic is currently used whenever the computation is allowed to
+violate topological relationships due to clock skew (such as "git log"
+with default order), but is not used when the topological order is
+required (such as merge base calculations, "git log --graph").
+
+In practice, we expect some commits to be created recently and not stored
+in the commit graph. We can treat these commits as having "infinite"
+generation number and walk until reaching commits with known generation
+number.
+
+Design Details
+--------------
+
+- The commit graph file is stored in a file named 'commit-graph' in the
+ .git/objects/info directory. This could be stored in the info directory
+ of an alternate.
+
+- The core.commitGraph config setting must be on to consume graph files.
+
+- The file format includes parameters for the object ID hash function,
+ so a future change of hash algorithm does not require a change in format.
+
+Future Work
+-----------
+
+- The commit graph feature currently does not honor commit grafts. This can
+ be remedied by duplicating or refactoring the current graft logic.
+
+- The 'commit-graph' subcommand does not have a "verify" mode that is
+ necessary for integration with fsck.
+
+- The file format includes room for precomputed generation numbers. These
+ are not currently computed, so all generation numbers will be marked as
+ 0 (or "uncomputed"). A later patch will include this calculation.
+
+- After computing and storing generation numbers, we must make graph
+ walks aware of generation numbers to gain the performance benefits they
+ enable. This will mostly be accomplished by swapping a commit-date-ordered
+ priority queue with one ordered by generation number. The following
+ operations are important candidates:
+
+ - paint_down_to_common()
+ - 'log --topo-order'
+
+- Currently, parse_commit_gently() requires filling in the root tree
+ object for a commit. This passes through lookup_tree() and consequently
+ lookup_object(). Also, it calls lookup_commit() when loading the parents.
+ These method calls check the ODB for object existence, even if the
+ consumer does not need the content. For example, we do not need the
+ tree contents when computing merge bases. Now that commit parsing is
+ removed from the computation time, these lookup operations are the
+ slowest operations keeping graph walks from being fast. Consider
+ loading these objects without verifying their existence in the ODB and
+ only loading them fully when consumers need them. Consider a method
+ such as "ensure_tree_loaded(commit)" that fully loads a tree before
+ using commit->tree.
+
+- The current design uses the 'commit-graph' subcommand to generate the graph.
+ When this feature stabilizes enough to recommend to most users, we should
+ add automatic graph writes to common operations that create many commits.
+ For example, one could compute a graph on 'clone', 'fetch', or 'repack'
+ commands.
+
+- A server could provide a commit graph file as part of the network protocol
+ to avoid extra calculations by clients. This feature is only of benefit if
+ the user is willing to trust the file, because verifying the file is correct
+ is as hard as computing it from scratch.
+
+Related Links
+-------------
+[0] https://bugs.chromium.org/p/git/issues/detail?id=8
+ Chromium work item for: Serialized Commit Graph
+
+[1] https://public-inbox.org/git/20110713070517.GC18566@sigill.intra.peff.net/
+ An abandoned patch that introduced generation numbers.
+
+[2] https://public-inbox.org/git/20170908033403.q7e6dj7benasrjes@sigill.intra.peff.net/
+ Discussion about generation numbers on commits and how they interact
+ with fsck.
+
+[3] https://public-inbox.org/git/20170908034739.4op3w4f2ma5s65ku@sigill.intra.peff.net/
+ More discussion about generation numbers and not storing them inside
+ commit objects. A valuable quote:
+
+ "I think we should be moving more in the direction of keeping
+ repo-local caches for optimizations. Reachability bitmaps have been
+ a big performance win. I think we should be doing the same with our
+ properties of commits. Not just generation numbers, but making it
+ cheap to access the graph structure without zlib-inflating whole
+ commit objects (i.e., packv4 or something like the "metapacks" I
+ proposed a few years ago)."
+
+[4] https://public-inbox.org/git/20180108154822.54829-1-git@jeffhostetler.com/T/#u
+ A patch to remove the ahead-behind calculation from 'status'.
--
2.17.0.14.gba1221a8ce
next prev parent reply other threads:[~2018-04-02 20:35 UTC|newest]
Thread overview: 110+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-02-27 2:32 [PATCH v5 00/13] Serialized Git Commit Graph Derrick Stolee
2018-02-27 2:32 ` [PATCH v5 01/13] commit-graph: add format document Derrick Stolee
2018-02-27 2:32 ` [PATCH v5 02/13] graph: add commit graph design document Derrick Stolee
2018-02-27 2:32 ` [PATCH v5 03/13] commit-graph: create git-commit-graph builtin Derrick Stolee
2018-02-27 2:32 ` [PATCH v5 04/13] csum-file: add CSUM_KEEP_OPEN flag Derrick Stolee
2018-03-12 13:55 ` Derrick Stolee
2018-03-13 21:42 ` Junio C Hamano
2018-03-14 2:26 ` Derrick Stolee
2018-03-14 17:00 ` Junio C Hamano
2018-02-27 2:32 ` [PATCH v5 05/13] commit-graph: implement write_commit_graph() Derrick Stolee
2018-02-27 2:33 ` [PATCH v5 06/13] commit-graph: implement 'git-commit-graph write' Derrick Stolee
2018-02-27 2:33 ` [PATCH v5 07/13] commit-graph: implement git commit-graph read Derrick Stolee
2018-02-27 2:33 ` [PATCH v5 08/13] commit-graph: add core.commitGraph setting Derrick Stolee
2018-02-27 2:33 ` [PATCH v5 09/13] commit-graph: close under reachability Derrick Stolee
2018-02-27 2:33 ` [PATCH v5 10/13] commit: integrate commit graph with commit parsing Derrick Stolee
2018-02-27 2:33 ` [PATCH v5 11/13] commit-graph: read only from specific pack-indexes Derrick Stolee
2018-02-27 20:15 ` Stefan Beller
2018-02-27 2:33 ` [PATCH v5 12/13] commit-graph: build graph from starting commits Derrick Stolee
2018-02-27 2:33 ` [PATCH v5 13/13] commit-graph: implement "--additive" option Derrick Stolee
2018-02-27 18:50 ` [PATCH v5 00/13] Serialized Git Commit Graph Stefan Beller
2018-03-14 19:27 ` [PATCH v6 00/14] " Derrick Stolee
2018-03-14 19:27 ` [PATCH v6 01/14] csum-file: rename hashclose() to finalize_hashfile() Derrick Stolee
2018-03-14 19:27 ` [PATCH v6 02/14] csum-file: refactor finalize_hashfile() method Derrick Stolee
2018-03-14 19:27 ` [PATCH v6 03/14] commit-graph: add format document Derrick Stolee
2018-03-14 19:27 ` [PATCH v6 04/14] graph: add commit graph design document Derrick Stolee
2018-03-14 19:27 ` [PATCH v6 05/14] commit-graph: create git-commit-graph builtin Derrick Stolee
2018-03-14 19:27 ` [PATCH v6 06/14] commit-graph: implement write_commit_graph() Derrick Stolee
2018-03-14 19:27 ` [PATCH v6 07/14] commit-graph: implement 'git-commit-graph write' Derrick Stolee
2018-03-18 13:25 ` Ævar Arnfjörð Bjarmason
2018-03-19 13:12 ` Derrick Stolee
2018-03-19 14:36 ` Ævar Arnfjörð Bjarmason
2018-03-19 18:27 ` Derrick Stolee
2018-03-19 18:48 ` Ævar Arnfjörð Bjarmason
2018-03-14 19:27 ` [PATCH v6 08/14] commit-graph: implement git commit-graph read Derrick Stolee
2018-03-14 19:27 ` [PATCH v6 09/14] commit-graph: add core.commitGraph setting Derrick Stolee
2018-03-14 19:27 ` [PATCH v6 10/14] commit-graph: close under reachability Derrick Stolee
2018-03-14 19:27 ` [PATCH v6 11/14] commit: integrate commit graph with commit parsing Derrick Stolee
2018-03-14 19:27 ` [PATCH v6 12/14] commit-graph: read only from specific pack-indexes Derrick Stolee
2018-03-15 22:50 ` SZEDER Gábor
2018-03-19 13:13 ` Derrick Stolee
2018-03-14 19:27 ` [PATCH v6 13/14] commit-graph: build graph from starting commits Derrick Stolee
2018-03-14 19:27 ` [PATCH v6 14/14] commit-graph: implement "--additive" option Derrick Stolee
2018-03-14 20:10 ` [PATCH v6 00/14] Serialized Git Commit Graph Ramsay Jones
2018-03-14 20:43 ` Junio C Hamano
2018-03-15 17:23 ` Johannes Schindelin
2018-03-15 18:41 ` Junio C Hamano
2018-03-15 21:51 ` Ramsay Jones
2018-03-16 11:50 ` Johannes Schindelin
2018-03-16 17:27 ` Junio C Hamano
2018-03-19 11:41 ` Johannes Schindelin
2018-03-16 16:28 ` Lars Schneider
2018-03-19 13:10 ` Derrick Stolee
2018-03-16 15:06 ` Ævar Arnfjörð Bjarmason
2018-03-16 16:38 ` SZEDER Gábor
2018-03-16 18:33 ` Junio C Hamano
2018-03-16 19:48 ` SZEDER Gábor
2018-03-16 20:06 ` Jeff King
2018-03-16 20:19 ` Jeff King
2018-03-19 12:55 ` Derrick Stolee
2018-03-20 1:17 ` Derrick Stolee
2018-03-16 20:49 ` Jeff King
2018-04-02 20:34 ` [PATCH v7 " Derrick Stolee
2018-04-02 20:34 ` [PATCH v7 01/14] csum-file: rename hashclose() to finalize_hashfile() Derrick Stolee
2018-04-02 20:34 ` [PATCH v7 02/14] csum-file: refactor finalize_hashfile() method Derrick Stolee
2018-04-07 22:59 ` Jakub Narebski
2018-04-02 20:34 ` [PATCH v7 03/14] commit-graph: add format document Derrick Stolee
2018-04-07 23:49 ` Jakub Narebski
2018-04-02 20:34 ` Derrick Stolee [this message]
2018-04-08 11:06 ` [PATCH v7 04/14] graph: add commit graph design document Jakub Narebski
2018-04-02 20:34 ` [PATCH v7 05/14] commit-graph: create git-commit-graph builtin Derrick Stolee
2018-04-02 20:34 ` [PATCH v7 06/14] commit-graph: implement write_commit_graph() Derrick Stolee
2018-04-02 20:34 ` [PATCH v7 07/14] commit-graph: implement git-commit-graph write Derrick Stolee
2018-04-08 11:59 ` Jakub Narebski
2018-04-02 20:34 ` [PATCH v7 08/14] commit-graph: implement git commit-graph read Derrick Stolee
2018-04-02 21:33 ` Junio C Hamano
2018-04-03 11:49 ` Derrick Stolee
2018-04-08 12:59 ` Jakub Narebski
2018-04-02 20:34 ` [PATCH v7 09/14] commit-graph: add core.commitGraph setting Derrick Stolee
2018-04-08 13:39 ` Jakub Narebski
2018-04-02 20:34 ` [PATCH v7 10/14] commit-graph: close under reachability Derrick Stolee
2018-04-02 20:34 ` [PATCH v7 11/14] commit: integrate commit graph with commit parsing Derrick Stolee
2018-04-02 20:34 ` [PATCH v7 12/14] commit-graph: read only from specific pack-indexes Derrick Stolee
2018-04-02 20:34 ` [PATCH v7 13/14] commit-graph: build graph from starting commits Derrick Stolee
2018-04-08 13:50 ` Jakub Narebski
2018-04-02 20:34 ` [PATCH v7 14/14] commit-graph: implement "--additive" option Derrick Stolee
2018-04-05 8:27 ` SZEDER Gábor
2018-04-10 12:55 ` [PATCH v8 00/14] Serialized Git Commit Graph Derrick Stolee
2018-04-10 12:55 ` [PATCH v8 01/14] csum-file: rename hashclose() to finalize_hashfile() Derrick Stolee
2018-04-10 12:55 ` [PATCH v8 02/14] csum-file: refactor finalize_hashfile() method Derrick Stolee
2018-04-10 12:55 ` [PATCH v8 03/14] commit-graph: add format document Derrick Stolee
2018-04-10 19:10 ` Stefan Beller
2018-04-10 19:18 ` Derrick Stolee
2018-04-11 20:58 ` Jakub Narebski
2018-04-12 11:28 ` Derrick Stolee
2018-04-13 22:07 ` Jakub Narebski
2018-04-10 12:55 ` [PATCH v8 04/14] graph: add commit graph design document Derrick Stolee
2018-04-15 22:48 ` Jakub Narebski
2018-04-10 12:55 ` [PATCH v8 05/14] commit-graph: create git-commit-graph builtin Derrick Stolee
2018-04-10 12:56 ` [PATCH v8 06/14] commit-graph: implement write_commit_graph() Derrick Stolee
2018-04-10 12:56 ` [PATCH v8 07/14] commit-graph: implement git-commit-graph write Derrick Stolee
2018-04-10 12:56 ` [PATCH v8 08/14] commit-graph: implement git commit-graph read Derrick Stolee
2018-04-14 22:15 ` Jakub Narebski
2018-04-15 3:26 ` Eric Sunshine
2018-04-10 12:56 ` [PATCH v8 09/14] commit-graph: add core.commitGraph setting Derrick Stolee
2018-04-14 18:33 ` Jakub Narebski
2018-04-10 12:56 ` [PATCH v8 10/14] commit-graph: close under reachability Derrick Stolee
2018-04-10 12:56 ` [PATCH v8 11/14] commit: integrate commit graph with commit parsing Derrick Stolee
2018-04-10 12:56 ` [PATCH v8 12/14] commit-graph: read only from specific pack-indexes Derrick Stolee
2018-04-10 12:56 ` [PATCH v8 13/14] commit-graph: build graph from starting commits Derrick Stolee
2018-04-10 12:56 ` [PATCH v8 14/14] commit-graph: implement "--append" option Derrick Stolee
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180402203427.170177-5-dstolee@microsoft.com \
--to=stolee@gmail.com \
--cc=dstolee@microsoft.com \
--cc=git@jeffhostetler.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=peff@peff.net \
--cc=ramsay@ramsayjones.plus.com \
--cc=sbeller@google.com \
--cc=szeder.dev@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).