git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: me@ttaylorr.com, szeder.dev@gmail.com, l.s.r@web.de,
	Derrick Stolee <dstolee@microsoft.com>
Subject: [PATCH v4 00/10] More commit-graph/Bloom filter improvements
Date: Wed, 01 Jul 2020 13:27:20 +0000	[thread overview]
Message-ID: <pull.659.v4.git.1593610050.gitgitgadget@gmail.com> (raw)
In-Reply-To: <pull.659.v3.git.1593174636.gitgitgadget@gmail.com>

This builds on sg/commit-graph-cleanups, which took several patches from
Szeder's series [1] and applied them almost directly to a more-recent
version of Git [2].

[1] https://lore.kernel.org/git/20200529085038.26008-1-szeder.dev@gmail.com/
[2] 
https://lore.kernel.org/git/pull.650.git.1591362032.gitgitgadget@gmail.com/

This series adds a few extra improvements, several of which are rooted in
Szeder's original series. I maintained his authorship and sign-off, even
though the patches did not apply or cherry-pick at all.

(In v2, I have removed the range-diff comparison to Szeder's series, so look
at the v1 cover letter for that.)

The patches have been significantly reordered. René pointed out (and Szeder
discovered in the old thread) that we are not re-using the
bloom_filter_settings from the existing commit-graph when writing a new one.

 1. commit-graph: place bloom_settings in context
 2. commit-graph: change test to die on parse, not load

These are mostly the same, except we now use a pointer to the settings in
the commit-graph write context.

 3. bloom: get_bloom_filter() cleanups

This new patch is a subtle change in behavior that will become relevant in
the very next patch. In fact, if we swap patch 3 and 4, then
t4216-log-bloom.sh fails with a segfault due to a NULL filter.

 4. commit-graph: persist existence of changed-paths

This patch is now updated to use the existing changed-path filter settings.

 5. commit-graph: unify the signatures of all write_graph_chunk_*()
    functions
 6. commit-graph: simplify chunk writes into loop
 7. commit-graph: check chunk sizes after writing

These are all the same as before.

 8. revision.c: fix whitespace

This patch is the cleanup part of Taylor's patch.

 9. revision: empty pathspecs should not use Bloom filters

Here is Taylor's fix for empty pathspecs.

 10. commit-graph: check all leading directories in changed path Bloom
     filters
 11. bloom: enforce a minimum size of 8 bytes

Finally, we get these performance patches. Patch 10 is updated to have the
better logic around directory separators and empty paths. Also, the list of
Bloom keys is ordered with the deepest path first. That has some tiny
performance benefits for deep paths since we can short-circuit the multi-key
checks more often. That code path is much faster than the tree parsing, so
it is hard to measure any change.

Updates in V3:

 * Responded to René's feedback.
 * Fixed the test in Patch 4 to use GIT_TEST_ variables and extend the
   GIT_TRACE2 depth to work with 'seen' branch.

Update in V4;

 * Fixed the bug with "too large" commits. Test is added. The fixup! I sent
   earlier doesn't actually squash cleanly, so I resolved the conflicts
   during the rebase.

Thanks, -Stolee

Derrick Stolee (5):
  commit-graph: place bloom_settings in context
  commit-graph: change test to die on parse, not load
  bloom: fix logic in get_bloom_filter()
  commit-graph: persist existence of changed-paths
  revision.c: fix whitespace

SZEDER Gábor (4):
  commit-graph: unify the signatures of all write_graph_chunk_*()
    functions
  commit-graph: simplify chunk writes into loop
  commit-graph: check chunk sizes after writing
  commit-graph: check all leading directories in changed path Bloom
    filters

Taylor Blau (1):
  revision: empty pathspecs should not use Bloom filters

 Documentation/git-commit-graph.txt |   5 +-
 bloom.c                            |  14 ++-
 builtin/commit-graph.c             |   5 +-
 commit-graph.c                     | 146 +++++++++++++++++++++--------
 commit-graph.h                     |   3 +-
 revision.c                         |  63 +++++++++----
 revision.h                         |   6 +-
 t/t4216-log-bloom.sh               |  45 ++++++++-
 t/t5318-commit-graph.sh            |   2 +-
 9 files changed, 215 insertions(+), 74 deletions(-)


base-commit: 7fbfe07ab4d4e58c0971dac73001b89f180a0af3
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-659%2Fderrickstolee%2Fbloom-2-v4
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-659/derrickstolee/bloom-2-v4
Pull-Request: https://github.com/gitgitgadget/git/pull/659

Range-diff vs v3:

  1:  57002040bc =  1:  57002040bc commit-graph: place bloom_settings in context
  2:  6b63f9bd8a =  2:  6b63f9bd8a commit-graph: change test to die on parse, not load
  3:  2f809499ab !  3:  3c532ebabc bloom: fix logic in get_bloom_filter()
     @@ Commit message
      
          Also clean up some style issues while we are here.
      
     +    One side-effect of returning a NULL filter is that the filters that are
     +    reported as "too large" will now be reported as NULL insead of length
     +    zero. This case was not properly covered before, so add a test. Further,
     +    remote the counting of the zero-length filters from revision.c and the
     +    trace2 logs.
     +
          Helped-by: René Scharfe <l.s.r@web.de>
     +    Helped-by: SZEDER Gábor <szeder.dev@gmail.com>
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
       ## bloom.c ##
     @@ bloom.c: struct bloom_filter *get_bloom_filter(struct repository *r,
       
       	repo_diff_setup(r, &diffopt);
       	diffopt.flags.recursive = 1;
     +
     + ## commit-graph.c ##
     +@@ commit-graph.c: static void write_graph_chunk_bloom_indexes(struct hashfile *f,
     + 
     + 	while (list < last) {
     + 		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
     +-		cur_pos += filter->len;
     ++		size_t len = filter ? filter->len : 0;
     ++		cur_pos += len;
     + 		display_progress(progress, ++i);
     + 		hashwrite_be32(f, cur_pos);
     + 		list++;
     +@@ commit-graph.c: static void write_graph_chunk_bloom_data(struct hashfile *f,
     + 
     + 	while (list < last) {
     + 		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
     ++		size_t len = filter ? filter->len : 0;
     + 		display_progress(progress, ++i);
     +-		hashwrite(f, filter->data, filter->len * sizeof(unsigned char));
     ++
     ++		if (len)
     ++			hashwrite(f, filter->data, len * sizeof(unsigned char));
     + 		list++;
     + 	}
     + 
     +
     + ## revision.c ##
     +@@ revision.c: static unsigned int count_bloom_filter_maybe;
     + static unsigned int count_bloom_filter_definitely_not;
     + static unsigned int count_bloom_filter_false_positive;
     + static unsigned int count_bloom_filter_not_present;
     +-static unsigned int count_bloom_filter_length_zero;
     + 
     + static void trace2_bloom_filter_statistics_atexit(void)
     + {
     +@@ revision.c: static void trace2_bloom_filter_statistics_atexit(void)
     + 
     + 	jw_object_begin(&jw, 0);
     + 	jw_object_intmax(&jw, "filter_not_present", count_bloom_filter_not_present);
     +-	jw_object_intmax(&jw, "zero_length_filter", count_bloom_filter_length_zero);
     + 	jw_object_intmax(&jw, "maybe", count_bloom_filter_maybe);
     + 	jw_object_intmax(&jw, "definitely_not", count_bloom_filter_definitely_not);
     + 	jw_object_intmax(&jw, "false_positive", count_bloom_filter_false_positive);
     +@@ revision.c: static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
     + 		return -1;
     + 	}
     + 
     +-	if (!filter->len) {
     +-		count_bloom_filter_length_zero++;
     +-		return -1;
     +-	}
     +-
     + 	result = bloom_filter_contains(filter,
     + 				       revs->bloom_key,
     + 				       revs->bloom_filter_settings);
     +
     + ## t/t4216-log-bloom.sh ##
     +@@ t/t4216-log-bloom.sh: setup () {
     + 
     + test_bloom_filters_used () {
     + 	log_args=$1
     +-	bloom_trace_prefix="statistics:{\"filter_not_present\":0,\"zero_length_filter\":0,\"maybe\""
     ++	bloom_trace_prefix="statistics:{\"filter_not_present\":0,\"maybe\""
     + 	setup "$log_args" &&
     + 	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
     + 	test_cmp log_wo_bloom log_w_bloom &&
     +@@ t/t4216-log-bloom.sh: test_expect_success 'setup - add commit-graph to the chain with Bloom filters' '
     + 
     + test_bloom_filters_used_when_some_filters_are_missing () {
     + 	log_args=$1
     +-	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":8,\"definitely_not\":6"
     ++	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"maybe\":8,\"definitely_not\":6"
     + 	setup "$log_args" &&
     + 	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
     + 	test_cmp log_wo_bloom log_w_bloom
     +@@ t/t4216-log-bloom.sh: test_expect_success 'Use Bloom filters if they exist in the latest but not all c
     + 	test_bloom_filters_used_when_some_filters_are_missing "-- A/B"
     + '
     + 
     ++test_expect_success 'correctly report changes over limit' '
     ++	git init 513changes &&
     ++	(
     ++		cd 513changes &&
     ++		for i in $(test_seq 1 513)
     ++		do
     ++			echo $i >file$i.txt || return 1
     ++		done &&
     ++		git add . &&
     ++		git commit -m "files" &&
     ++		git commit-graph write --reachable --changed-paths &&
     ++		for i in $(test_seq 1 513)
     ++		do
     ++			git -c core.commitGraph=false log -- file$i.txt >expect &&
     ++			git log -- file$i.txt >actual &&
     ++			test_cmp expect actual || return 1
     ++		done
     ++	)
     ++'
     ++
     + test_done
     + \ No newline at end of file
  4:  33e22d05cb !  4:  f1e3a8516e commit-graph: persist existence of changed-paths
     @@ t/t4216-log-bloom.sh: test_expect_success 'Use Bloom filters if they exist in th
      +	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15}" trace2-auto.txt
      +'
      +
     - test_done
     - \ No newline at end of file
     + test_expect_success 'correctly report changes over limit' '
     + 	git init 513changes &&
     + 	(
  5:  81c45d5260 =  5:  c079921473 commit-graph: unify the signatures of all write_graph_chunk_*() functions
  6:  8828dcd906 =  6:  5ed0ce20a4 commit-graph: simplify chunk writes into loop
  7:  ddbf297755 =  7:  b982c9bf80 commit-graph: check chunk sizes after writing
  8:  8b63706141 =  8:  af750d8887 revision.c: fix whitespace
  9:  7d6163305a =  9:  a95de3cceb revision: empty pathspecs should not use Bloom filters
 10:  40061233ca ! 10:  9c4a00ab08 commit-graph: check all leading directories in changed path Bloom filters
     @@ t/t4216-log-bloom.sh: test_expect_success 'setup - add commit-graph to the chain
       
       test_bloom_filters_used_when_some_filters_are_missing () {
       	log_args=$1
     --	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":8,\"definitely_not\":6"
     -+	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":6,\"definitely_not\":8"
     +-	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"maybe\":8,\"definitely_not\":6"
     ++	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"maybe\":6,\"definitely_not\":8"
       	setup "$log_args" &&
       	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
       	test_cmp log_wo_bloom log_w_bloom

-- 
gitgitgadget

  parent reply	other threads:[~2020-07-01 13:27 UTC|newest]

Thread overview: 76+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-15 20:14 [PATCH 0/8] More commit-graph/Bloom filter improvements Derrick Stolee via GitGitGadget
2020-06-15 20:14 ` [PATCH 1/8] commit-graph: place bloom_settings in context Derrick Stolee via GitGitGadget
2020-06-18 20:30   ` René Scharfe
2020-06-19 12:58     ` Derrick Stolee
2020-06-15 20:14 ` [PATCH 2/8] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor via GitGitGadget
2020-06-18 20:30   ` René Scharfe
2020-06-15 20:14 ` [PATCH 3/8] commit-graph: simplify chunk writes into loop SZEDER Gábor via GitGitGadget
2020-06-18 20:30   ` René Scharfe
2020-06-15 20:14 ` [PATCH 4/8] commit-graph: check chunk sizes after writing SZEDER Gábor via GitGitGadget
2020-06-15 20:14 ` [PATCH 5/8] commit-graph: check all leading directories in changed path Bloom filters SZEDER Gábor via GitGitGadget
2020-06-18 20:31   ` René Scharfe
2020-06-19  9:14     ` René Scharfe
2020-06-19 17:17   ` Taylor Blau
2020-06-19 17:19     ` Taylor Blau
2020-06-23 13:47     ` Derrick Stolee
2020-06-15 20:14 ` [PATCH 6/8] bloom: enforce a minimum size of 8 bytes Derrick Stolee via GitGitGadget
2020-06-15 20:14 ` [PATCH 7/8] commit-graph: change test to die on parse, not load Derrick Stolee via GitGitGadget
2020-06-15 20:14 ` [PATCH 8/8] commit-graph: persist existence of changed-paths Derrick Stolee via GitGitGadget
2020-06-17 21:21 ` [PATCH 0/8] More commit-graph/Bloom filter improvements Junio C Hamano
2020-06-18  1:46   ` Derrick Stolee
2020-06-23 17:46 ` [PATCH v2 00/11] " Derrick Stolee via GitGitGadget
2020-06-23 17:47   ` [PATCH v2 01/11] commit-graph: place bloom_settings in context Derrick Stolee via GitGitGadget
2020-06-23 17:47   ` [PATCH v2 02/11] commit-graph: change test to die on parse, not load Derrick Stolee via GitGitGadget
2020-06-23 17:47   ` [PATCH v2 03/11] bloom: get_bloom_filter() cleanups Derrick Stolee via GitGitGadget
2020-06-25  7:24     ` René Scharfe
2020-06-23 17:47   ` [PATCH v2 04/11] commit-graph: persist existence of changed-paths Derrick Stolee via GitGitGadget
2020-06-23 17:47   ` [PATCH v2 05/11] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor via GitGitGadget
2020-06-25  7:25     ` René Scharfe
2020-06-23 17:47   ` [PATCH v2 06/11] commit-graph: simplify chunk writes into loop SZEDER Gábor via GitGitGadget
2020-06-25  7:25     ` René Scharfe
2020-06-25 14:59       ` Derrick Stolee
2020-06-23 17:47   ` [PATCH v2 07/11] commit-graph: check chunk sizes after writing SZEDER Gábor via GitGitGadget
2020-06-25  7:25     ` René Scharfe
2020-06-25 15:02       ` Derrick Stolee
2020-06-23 17:47   ` [PATCH v2 08/11] revision.c: fix whitespace Derrick Stolee via GitGitGadget
2020-06-23 17:47   ` [PATCH v2 09/11] revision: empty pathspecs should not use Bloom filters Taylor Blau via GitGitGadget
2020-06-23 17:47   ` [PATCH v2 10/11] commit-graph: check all leading directories in changed path " SZEDER Gábor via GitGitGadget
2020-06-25  7:25     ` René Scharfe
2020-06-25 15:05       ` Derrick Stolee
2020-06-26  6:34         ` SZEDER Gábor
2020-06-26 14:42           ` Derrick Stolee
2020-06-23 17:47   ` [PATCH v2 11/11] bloom: enforce a minimum size of 8 bytes Derrick Stolee via GitGitGadget
2020-06-24 23:11   ` [PATCH v2 00/11] More commit-graph/Bloom filter improvements Junio C Hamano
2020-06-24 23:32     ` Derrick Stolee
2020-06-25  0:38       ` Junio C Hamano
2020-06-25 13:38         ` Derrick Stolee
2020-06-25 16:34           ` Junio C Hamano
2020-06-26 12:30   ` [PATCH v3 00/10] " Derrick Stolee via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 01/10] commit-graph: place bloom_settings in context Derrick Stolee via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 02/10] commit-graph: change test to die on parse, not load Derrick Stolee via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 03/10] bloom: fix logic in get_bloom_filter() Derrick Stolee via GitGitGadget
2020-06-27 16:33       ` SZEDER Gábor
2020-06-29 13:02         ` Derrick Stolee
2020-06-26 12:30     ` [PATCH v3 04/10] commit-graph: persist existence of changed-paths Derrick Stolee via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 05/10] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 06/10] commit-graph: simplify chunk writes into loop SZEDER Gábor via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 07/10] commit-graph: check chunk sizes after writing SZEDER Gábor via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 08/10] revision.c: fix whitespace Derrick Stolee via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 09/10] revision: empty pathspecs should not use Bloom filters Taylor Blau via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 10/10] commit-graph: check all leading directories in changed path " SZEDER Gábor via GitGitGadget
2020-07-01 13:27     ` Derrick Stolee via GitGitGadget [this message]
2020-07-01 13:27       ` [PATCH v4 01/10] commit-graph: place bloom_settings in context Derrick Stolee via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 02/10] commit-graph: change test to die on parse, not load Derrick Stolee via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 03/10] bloom: fix logic in get_bloom_filter() Derrick Stolee via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 04/10] commit-graph: persist existence of changed-paths Derrick Stolee via GitGitGadget
2020-10-15 13:21         ` SZEDER Gábor
2020-10-15 21:41           ` Taylor Blau
2020-10-16  2:18             ` Derrick Stolee
2020-10-16  3:18               ` Taylor Blau
2020-10-16 13:52                 ` Derrick Stolee
2020-07-01 13:27       ` [PATCH v4 05/10] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 06/10] commit-graph: simplify chunk writes into loop SZEDER Gábor via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 07/10] commit-graph: check chunk sizes after writing SZEDER Gábor via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 08/10] revision.c: fix whitespace Derrick Stolee via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 09/10] revision: empty pathspecs should not use Bloom filters Taylor Blau via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 10/10] commit-graph: check all leading directories in changed path " SZEDER Gábor via GitGitGadget

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=pull.659.v4.git.1593610050.gitgitgadget@gmail.com \
    --to=gitgitgadget@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=l.s.r@web.de \
    --cc=me@ttaylorr.com \
    --cc=szeder.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).