From: Derrick Stolee <stolee@gmail.com>
To: git@vger.kernel.org
Cc: gitster@pobox.com, peff@peff.net, git@jeffhostetler.com,
sbeller@google.com, dstolee@microsoft.com
Subject: [PATCH 01/14] graph: add packed graph design document
Date: Thu, 25 Jan 2018 09:02:18 -0500 [thread overview]
Message-ID: <20180125140231.65604-2-dstolee@microsoft.com> (raw)
In-Reply-To: <20180125140231.65604-1-dstolee@microsoft.com>
Add Documentation/technical/packed-graph.txt with details of the planned
packed graph feature, including future plans.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
Documentation/technical/packed-graph.txt | 185 +++++++++++++++++++++++++++++++
1 file changed, 185 insertions(+)
create mode 100644 Documentation/technical/packed-graph.txt
diff --git a/Documentation/technical/packed-graph.txt b/Documentation/technical/packed-graph.txt
new file mode 100644
index 0000000000..fcc0c83874
--- /dev/null
+++ b/Documentation/technical/packed-graph.txt
@@ -0,0 +1,185 @@
+Git Packed Graph Design Notes
+=============================
+
+Git walks the commit graph for many reasons, including:
+
+1. Listing and filtering commit history.
+2. Computing merge bases.
+
+These operations can become slow as the commit count grows above 100K.
+The merge base calculation shows up in many user-facing commands, such
+as 'status' and 'fetch' and can take minutes to compute depending on
+data shape. There are two main costs here:
+
+1. Decompressing and parsing commits.
+2. Walking the entire graph to avoid topological order mistakes.
+
+The packed graph is a file that stores the commit graph structure along
+with some extra metadata to speed up graph walks. This format allows a
+consumer to load the following info for a commit:
+
+1. The commit OID.
+2. The list of parents.
+3. The commit date.
+4. The root tree OID.
+5. An integer ID for fast lookups in the graph.
+6. The generation number (see definition below).
+
+Values 1-4 satisfy the requirements of parse_commit_gently().
+
+By providing an integer ID we can avoid lookups in the graph as we walk
+commits. Specifically, we need to provide the integer ID of the parent
+commits so we navigate directly to their information on request.
+
+Define the "generation number" of a commit recursively as follows:
+ * A commit with no parents (a root commit) has generation number 1.
+ * A commit with at least one parent has generation number 1 more than
+ the largest generation number among its parents.
+Equivalently, the generation number is one more than the length of a
+longest path from the commit to a root commit. The recursive definition
+is easier to use for computation and the following property:
+
+ If A and B are commits with generation numbers N and M, respectively,
+ and N <= M, then A cannot reach B. That is, we know without searching
+ that B is not an ancestor of A because it is further from a root commit
+ than A.
+
+ Conversely, when checking if A is an ancestor of B, then we only need
+ to walk commits until all commits on the walk boundary have generation
+ number at most N. If we walk commits using a priority queue seeded by
+ generation numbers, then we always expand the boundary commit with highest
+ generation number and can easily detect the stopping condition.
+
+This property can be used to significantly reduce the time it takes to
+walk commits and determine topological relationships. Without generation
+numbers, the general heuristic is the following:
+
+ If A and B are commits with commit time X and Y, respectively, and
+ X < Y, then A _probably_ cannot reach B.
+
+This heuristic is currently used whenever the computation can make
+mistakes with topological orders (such as "git log" with default order),
+but is not used when the topological order is required (such as merge
+base calculations, "git log --graph").
+
+Design Details
+--------------
+
+- A graph file is stored in a file named 'graph-<oid>.graph' in the pack
+ directory. This could be stored in an alternate.
+
+- The most-recent graph file OID is stored in a 'graph-head' file for
+ immediate access and storing backup graphs. This could be stored in an
+ alternate, and refers to a 'graph-<oid>.graph' file in the same pack
+ directory.
+
+- The core.graph config setting must be on to create or consume graph files.
+
+- The graph file is only a supplemental structure. If a user downgrades
+ or disables the 'core.graph' config setting, then the existing ODB is
+ sufficient.
+
+- The file format includes parameters for the object id length
+ and hash algorithm, so a future change of hash algorithm does
+ not require a change in format.
+
+Current Limitations
+-------------------
+
+- Only one graph file is used at one time. This allows the integer ID to
+ seek into the single graph file. It is possible to extend the model
+ for multiple graph files, but that is currently not part of the design.
+
+- .graph files are managed only by the 'graph' builtin. These are not
+ updated automatically during clone or fetch.
+
+- There is no '--verify' option for the 'graph' builtin to verify the
+ contents of the graph file.
+
+- The graph only considers commits existing in packfiles and does not
+ walk to fill in reachable commits. [Small]
+
+- When rewriting the graph, we do not check for a commit still existing
+ in the ODB, so garbage collection may remove commits
+
+- Generation numbers are not computed in the current version. The file
+ format supports storing them, along with a mechanism to upgrade from
+ a file without generation numbers to one that uses them.
+
+Future Work
+-----------
+
+- The file format includes room for precomputed generation numbers. These
+ are not currently computed, so all generation numbers will be marked as
+ 0 (or "uncomputed"). A later patch will include this calculation.
+
+- The current implementation of the 'graph' builtin walks all packed objects
+ to find a complete list of commits in packfiles. If some commits are
+ stored as loose objects, then these do not appear in the graph. This is
+ handled gracefully by the file format, but it would cause incorrect
+ generation number calculations. We should implement the construct_graph()
+ method in a way that walks all commits reachable from some starting set
+ and then can use complete information for generation numbers. (Some
+ care must be taken around shallow clones.)
+
+- The graph is not currently integrated with grafts.
+
+- After computing and storing generation numbers, we must make graph
+ walks aware of generation numbers to gain performance benefits. This
+ will mostly be accomplished by swapping a commit-date-ordered priority
+ queue with one ordered by generation number. The following operations
+ are important candidates:
+
+ - paint_down_to_common()
+ - 'log --topo-order'
+
+- The graph currently only adds commits to a previously existing graph.
+ When writing a new graph, we could check that the ODB still contains
+ the commits and choose to remove the commits that are deleted from the
+ ODB. For performance reasons, this check should remain optional.
+
+- Currently, parse_commit_gently() requires filling in the root tree
+ object for a commit. This passes through lookup_tree() and consequently
+ lookup_object(). Also, it calls lookup_commit() when loading the parents.
+ These method calls check the ODB for object existence, even if the
+ consumer does not need the content. For example, we do not need the
+ tree contents when computing merge bases. Now that commit parsing is
+ removed from the computation time, these lookup operations are the
+ slowest operations keeping graph walks from being fast. Consider
+ loading these objects without verifying their existence in the ODB and
+ only loading them fully when consumers need them. Consider a method
+ such as "ensure_tree_loaded(commit)" that fully loads a tree before
+ using commit->tree.
+
+- The current design uses the 'graph' builtin to generate the graph. When
+ this feature stabilizes enough to recommend to most users, we should
+ add automatic graph writes to common operations that create many commits.
+ For example, one coulde compute a graph on 'clone' and 'fetch' commands.
+
+Related Links
+-------------
+
+[0] https://bugs.chromium.org/p/git/issues/detail?id=8
+ Chromium work item for: Serialized Commit Graph
+
+[1] https://public-inbox.org/git/20110713070517.GC18566@sigill.intra.peff.net/
+ An abandoned patch that introduced generation numbers.
+
+[2] https://public-inbox.org/git/20170908033403.q7e6dj7benasrjes@sigill.intra.peff.net/
+ Discussion about generation numbers on commits and how they interact
+ with fsck.
+
+[3] https://public-inbox.org/git/20170907094718.b6kuzp2uhvkmwcso@sigill.intra.peff.net/t/#m7a2ea7b355aeda962e6b86404bcbadc648abfbba
+ More discussion about generation numbers and not storing them inside
+ commit objects. A valuable quote:
+
+ "I think we should be moving more in the direction of keeping
+ repo-local caches for optimizations. Reachability bitmaps have been
+ a big performance win. I think we should be doing the same with our
+ properties of commits. Not just generation numbers, but making it
+ cheap to access the graph structure without zlib-inflating whole
+ commit objects (i.e., packv4 or something like the "metapacks" I
+ proposed a few years ago)."
+
+[4] https://public-inbox.org/git/20180108154822.54829-1-git@jeffhostetler.com/T/#u
+ A patch to remove the ahead-behind calculation from 'status'.
--
2.16.0
next prev parent reply other threads:[~2018-01-25 14:02 UTC|newest]
Thread overview: 49+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-01-25 14:02 [PATCH 00/14] Serialized Commit Graph Derrick Stolee
2018-01-25 14:02 ` Derrick Stolee [this message]
2018-01-25 20:04 ` [PATCH 01/14] graph: add packed graph design document Stefan Beller
2018-01-26 12:49 ` Derrick Stolee
2018-01-26 18:17 ` Stefan Beller
2018-01-25 21:14 ` Junio C Hamano
2018-01-26 13:06 ` Derrick Stolee
2018-01-26 14:13 ` Duy Nguyen
2018-01-25 14:02 ` [PATCH 02/14] packed-graph: add core.graph setting Derrick Stolee
2018-01-25 20:17 ` Stefan Beller
2018-01-25 20:40 ` Derrick Stolee
2018-01-25 21:43 ` Junio C Hamano
2018-01-26 13:08 ` Derrick Stolee
2018-01-25 14:02 ` [PATCH 03/14] packed-graph: create git-graph builtin Derrick Stolee
2018-01-25 21:45 ` Stefan Beller
2018-01-26 13:13 ` Derrick Stolee
2018-01-25 23:01 ` Junio C Hamano
2018-01-26 13:14 ` Derrick Stolee
2018-01-26 14:16 ` Duy Nguyen
2018-01-25 14:02 ` [PATCH 04/14] packed-graph: add format document Derrick Stolee
2018-01-25 22:06 ` Junio C Hamano
2018-01-25 22:18 ` Stefan Beller
2018-01-25 22:29 ` Junio C Hamano
2018-01-26 13:22 ` Derrick Stolee
2018-01-25 22:07 ` Stefan Beller
2018-01-26 13:25 ` Derrick Stolee
2018-01-25 14:02 ` [PATCH 05/14] packed-graph: implement construct_graph() Derrick Stolee
2018-01-25 23:21 ` Stefan Beller
2018-01-26 20:47 ` Junio C Hamano
2018-01-26 20:55 ` Junio C Hamano
2018-01-26 21:14 ` Andreas Schwab
2018-01-26 22:04 ` Junio C Hamano
2018-01-25 14:02 ` [PATCH 06/14] packed-graph: implement git-graph --write Derrick Stolee
2018-01-25 23:28 ` Stefan Beller
2018-01-26 13:28 ` Derrick Stolee
2018-01-25 14:02 ` [PATCH 07/14] packed-graph: implement git-graph --read Derrick Stolee
2018-01-25 14:02 ` [PATCH 08/14] graph: implement git-graph --update-head Derrick Stolee
2018-01-25 14:02 ` [PATCH 09/14] packed-graph: implement git-graph --clear Derrick Stolee
2018-01-25 23:35 ` Stefan Beller
2018-01-25 14:02 ` [PATCH 10/14] packed-graph: teach git-graph --delete-expired Derrick Stolee
2018-01-25 14:02 ` [PATCH 11/14] commit: integrate packed graph with commit parsing Derrick Stolee
2018-01-26 19:38 ` Stefan Beller
2018-01-25 14:02 ` [PATCH 12/14] packed-graph: read only from specific pack-indexes Derrick Stolee
2018-01-25 14:02 ` [PATCH 13/14] packed-graph: close under reachability Derrick Stolee
2018-01-25 14:02 ` [PATCH 14/14] packed-graph: teach git-graph to read commits Derrick Stolee
2018-01-25 15:46 ` [PATCH 00/14] Serialized Commit Graph Ævar Arnfjörð Bjarmason
2018-01-25 16:09 ` Derrick Stolee
2018-01-25 23:06 ` Ævar Arnfjörð Bjarmason
2018-01-26 12:15 ` Derrick Stolee
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180125140231.65604-2-dstolee@microsoft.com \
--to=stolee@gmail.com \
--cc=dstolee@microsoft.com \
--cc=git@jeffhostetler.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=peff@peff.net \
--cc=sbeller@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).