list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
* [PATCH 00/17] [RFC] Commit-graph: Write incremental files
@ 2019-05-08 15:53 Derrick Stolee via GitGitGadget
  2019-05-08 15:53 ` [PATCH 01/17] commit-graph: fix the_repository reference Derrick Stolee via GitGitGadget
                   ` (18 more replies)
  0 siblings, 19 replies; 136+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2019-05-08 15:53 UTC (permalink / raw)
  To: git; +Cc: peff, avarab, git, jrnieder, steadmon, Junio C Hamano

This patch series is marked as RFC quality because it is missing some key
features and tests, but hopefully starts a concrete discussion of how the
incremental commit-graph writes can work. If this is a good direction, then
it would replace ds/commit-graph-format-v2.

The commit-graph is a valuable performance feature for repos with large
commit histories, but suffers from the same problem as git repack: it
rewrites the entire file every time. This can be slow when there are
millions of commits, especially after we stopped reading from the
commit-graph file during a write in 43d3561 (commit-graph write: don't die
if the existing graph is corrupt).

Instead, create a "stack" of commit-graphs where the existing commit-graph 
file is "level 0" and other levels are in files 
$OBJDIR/info/commit-graphs/commit-graph-N for positive N. Each level is
closed under reachability with its lower levels, and the idea of "graph
position" now considers the concatenation of the commit orders from each
level. See PATCH 12 for more details.

When writing, we don't always want to add a new level to the stack. This
would eventually result in performance degradation, especially when
searching for a commit (before we know its graph position). We decide to
merge levels of the stack when the new commits we will write satisfy two

 1. The expected size of the new file is more than half the size of the tip
    of the stack.
 2. The new file contains more than 64,000 commits.

The first condition alone would prevent more than a logarithmic number of
levels. The second condition is a stop-gap to prevent performance issues
when another process starts reading the commit-graph stack as we are merging
a large stack of commit-graph files. The reading process could be in a state
where the new file is not ready, but the levels above the new file were
already deleted. Thus, the commits that were merged down must be parsed from

The performance is necessarily amortized across multiple writes, so I tested
by writing commit-graphs from the (non-rc) tags in the Linux repo. My test
included 72 tags, and wrote everything reachable from the tag using 
--stdin-commits. Here are the overall perf numbers:

git commit-graph write --stdin-commits:         8m 12s
git commit-graph write --stdin-commits --split:    45s

The test using --split included at least six full collapses to the full
commit-graph. I believe the commit-graph stack had at most three levels
during this test.

This series is long because I felt the need to refactor write_commit_graph()
before making such a sweeping change to the format.

 * Patches 1-4: these are small changes which either fix issues or just
   provide clean-up. These are mostly borrowed from
 * Patches 5-11: these provide a non-functional refactor of
   write_commit_graph() into several methods using a "struct
   write_commit_graph_context" to share across the methods.
 * Patches 12-16: Implement the split commit-graph feature.
 * Patch 17: Demonstrate the value by writing a split commit-graph during 
   git fetch when the new config setting fetch.writeCommitGraph is true.

TODO: There are several things missing that need to be added before this
series is ready for full review and merging:

 1. The documentation for git commit-graph needs updating for the --split 
 2. We likely want config settings for the merge strategy. This is mentioned
    in the design doc, and could be saved for later.
 3. We want to update the git commit-graph verify subcommand to understand
    the commit-graph stack and optionally only verify the tip of the stack.
    This allows faster (amortized) verification if we are verifying
    immediately after writes and trusting the files at rest.
 4. It would be helpful to add a new optional chunk that contains the
    trailing hash for the lower level of the commit-graph stack. This chunk
    would only be for the commit-graph-N files, and would provide a simple
    way to check that the stack is valid on read, in case we are still
    worried about other processes reading/writing in the wrong order.
 5. Currently, --split essentially implies --append since we either (a)
    don't change the existing stack and only add commits, or (b) add all
    existing commits while merging files. However, if you would use --append 
    with --split, the append logic will trigger a merge with the current tip
    (at minimum). Some care should be taken to make this more clear.

Thanks, -Stolee

commit-graph write: don't die if the existing graph is corrupt

Derrick Stolee (17):
  commit-graph: fix the_repository reference
  commit-graph: return with errors during write
  commit-graph: collapse parameters into flags
  commit-graph: remove Future Work section
  commit-graph: create write_commit_graph_context
  commit-graph: extract fill_oids_from_packs()
  commit-graph: extract fill_oids_from_commit_hex()
  commit-graph: extract fill_oids_from_all_packs()
  commit-graph: extract count_distinct_commits()
  commit-graph: extract copy_oids_to_commits()
  commit-graph: extract write_commit_graph_file()
  Documentation: describe split commit-graphs
  commit-graph: lay groundwork for incremental files
  commit-graph: load split commit-graph files
  commit-graph: write split commit-graph files
  commit-graph: add --split option
  fetch: add fetch.writeCommitGraph config setting

 Documentation/technical/commit-graph.txt | 157 +++-
 builtin/commit-graph.c                   |  31 +-
 builtin/commit.c                         |   5 +-
 builtin/fetch.c                          |  17 +
 builtin/gc.c                             |   7 +-
 commit-graph.c                           | 946 ++++++++++++++++-------
 commit-graph.h                           |  19 +-
 commit.c                                 |   2 +-
 t/                  |  28 +-
 9 files changed, 887 insertions(+), 325 deletions(-)

base-commit: 93b4405ffe4ad9308740e7c1c71383bfc369baaa
Fetch-It-Via: git fetch pr-184/derrickstolee/graph/incremental-v1

^ permalink raw reply	[flat|nested] 136+ messages in thread