git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Taylor Blau <me@ttaylorr.com>
To: "SZEDER Gábor" <szeder.dev@gmail.com>
Cc: Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>,
	git@vger.kernel.org, me@ttaylorr.com, l.s.r@web.de,
	Derrick Stolee <dstolee@microsoft.com>
Subject: Re: [PATCH v4 04/10] commit-graph: persist existence of changed-paths
Date: Thu, 15 Oct 2020 17:41:11 -0400	[thread overview]
Message-ID: <20201015214111.GA2176154@nand.local> (raw)
In-Reply-To: <20201015132147.GB24954@szeder.dev>

On Thu, Oct 15, 2020 at 03:21:47PM +0200, SZEDER Gábor wrote:
> As pointed out in my original bug report [1], modified path Bloom
> filters are computed with hardcoded settings in
> bloom.c:get_bloom_filter().  Since this patch does not touch bloom.c
> at all, it still computes Bloom filters with those hardcoded settings,
> and, consequently, despite the commit message's claims, it does not
> persist the settings in the existing commit-graph.
>
> [1] https://public-inbox.org/git/20200619140230.GB22200@szeder.dev/

Right. It is worth mentioning here (as you do below) that this was fixed
as of 9a7a9ed10d (bloom: use provided 'struct bloom_filter_settings',
2020-09-16).

> > Use the trace2 API to signal the settings used during the write, and
> > check that output in a test after manually adjusting the correct bytes
> > in the commit-graph file.
>
> This test is insufficient, as it only checks what settings trace2
> believes the Bloom filters are computed with, not what settings they
> are actually computed with; that's why it succeeded while the bug
> whose absence it was supposed to ensure was still there.
>
> More robust tests should instead look at what actually gets written to
> the commit-graph, and how that is interpreted during pathspec-limited
> revision walks.

Thanks for the test! That saved me a little bit of work trying to set up
the scenario you're describing in a reproducible way. I think that the
third test can be fixed relatively easily. The crux of the issue there
is that we are computing Bloom filters for commits, some of which
already had Bloom filters computed in an earlier layer, but with
different settings than the one that we're using to write the current
layer.

So, we need to be more aggressively checking the Bloom filter settings
in any layer we want to reuse a Bloom filter out of before reusing it
verbatim in the current layer. The patch below the scissors line is
sufficient to do that, and it causes the third test to start passing.

...But, teaching the revision machinery how to handle a walk through
commits in multiple commit-graph layers with incompatible Bloom filter
settings is trickier. Right now we compute all of the Bloom keys up
front using the Bloom filter settings in the top layer. I think we'd
probably want to change this to lazily compute those keys by:

  - Keeping around the contents of the Bloom keys, i.e., the pathspecs
    themselves.

  - Keeping a hashmap from Bloom filter settings to computed Bloom keys
    corresponding to those settings.

  - Lazily filling in that hashmap as we traverse through different
    commits.

At least, that's the best way that I can think to do it. It does get
kind of slow, though; we'd have to scan every commit graph layer at each
commit in the worst case to find the actual 'struct commit_graph *' in
order to get its Bloom filter settings. So, I think that's sort of
show-stoppingly slow, and that we should probably think more deeply
about it before taking up that direction.

Maybe Stolee has some better thoughts about how we can quickly go from a
commit to either (a) the commit-graph struct that that commit is stored
in, or (b) the Bloom filter settings belonging to that struct.

Thanks,
Taylor

--- >8 ---

Subject: [PATCH] bloom: recompute filters with incompatible settings
---
 bloom.c        | 21 +++++++++++++++++++--
 commit-graph.c |  4 ++--
 2 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/bloom.c b/bloom.c
index 68c73200a5..30da128474 100644
--- a/bloom.c
+++ b/bloom.c
@@ -30,7 +30,8 @@ static inline unsigned char get_bitmask(uint32_t pos)

 static int load_bloom_filter_from_graph(struct commit_graph *g,
 					struct bloom_filter *filter,
-					struct commit *c)
+					struct commit *c,
+					const struct bloom_filter_settings *settings)
 {
 	uint32_t lex_pos, start_index, end_index;
 	uint32_t graph_pos = commit_graph_position(c);
@@ -42,6 +43,21 @@ static int load_bloom_filter_from_graph(struct commit_graph *g,
 	if (!g->chunk_bloom_indexes)
 		return 0;

+	if (settings) {
+		struct bloom_filter_settings *graph_settings = g->bloom_filter_settings;
+		/*
+		 * Check that the settings used to generate new Bloom filters is
+		 * compatible with the settings Bloom filters in this
+		 * commit-graph layer were generated with.
+		 */
+		if (settings->hash_version != graph_settings->hash_version)
+			return 0;
+		if (settings->num_hashes != graph_settings->num_hashes)
+			return 0;
+		if (settings->bits_per_entry != graph_settings->bits_per_entry)
+			return 0;
+	}
+
 	lex_pos = graph_pos - g->num_commits_in_base;

 	end_index = get_be32(g->chunk_bloom_indexes + 4 * lex_pos);
@@ -205,7 +221,8 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 	if (!filter->data) {
 		load_commit_graph_info(r, c);
 		if (commit_graph_position(c) != COMMIT_NOT_FROM_GRAPH)
-			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c);
+			load_bloom_filter_from_graph(r->objects->commit_graph,
+						     filter, c, settings);
 	}

 	if (filter->data && filter->len)
diff --git a/commit-graph.c b/commit-graph.c
index cb042bdba8..afe14af2a3 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1188,7 +1188,7 @@ static int write_graph_chunk_bloom_indexes(struct hashfile *f,
 	uint32_t cur_pos = 0;

 	while (list < last) {
-		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
+		struct bloom_filter *filter = get_or_compute_bloom_filter(ctx->r, *list, 0, ctx->bloom_settings, NULL);
 		size_t len = filter ? filter->len : 0;
 		cur_pos += len;
 		display_progress(ctx->progress, ++ctx->progress_cnt);
@@ -1228,7 +1228,7 @@ static int write_graph_chunk_bloom_data(struct hashfile *f,
 	hashwrite_be32(f, ctx->bloom_settings->bits_per_entry);

 	while (list < last) {
-		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
+		struct bloom_filter *filter = get_or_compute_bloom_filter(ctx->r, *list, 0, ctx->bloom_settings, NULL);
 		size_t len = filter ? filter->len : 0;

 		display_progress(ctx->progress, ++ctx->progress_cnt);
--
2.29.0.rc1.dirty


  reply	other threads:[~2020-10-15 21:41 UTC|newest]

Thread overview: 76+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-15 20:14 [PATCH 0/8] More commit-graph/Bloom filter improvements Derrick Stolee via GitGitGadget
2020-06-15 20:14 ` [PATCH 1/8] commit-graph: place bloom_settings in context Derrick Stolee via GitGitGadget
2020-06-18 20:30   ` René Scharfe
2020-06-19 12:58     ` Derrick Stolee
2020-06-15 20:14 ` [PATCH 2/8] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor via GitGitGadget
2020-06-18 20:30   ` René Scharfe
2020-06-15 20:14 ` [PATCH 3/8] commit-graph: simplify chunk writes into loop SZEDER Gábor via GitGitGadget
2020-06-18 20:30   ` René Scharfe
2020-06-15 20:14 ` [PATCH 4/8] commit-graph: check chunk sizes after writing SZEDER Gábor via GitGitGadget
2020-06-15 20:14 ` [PATCH 5/8] commit-graph: check all leading directories in changed path Bloom filters SZEDER Gábor via GitGitGadget
2020-06-18 20:31   ` René Scharfe
2020-06-19  9:14     ` René Scharfe
2020-06-19 17:17   ` Taylor Blau
2020-06-19 17:19     ` Taylor Blau
2020-06-23 13:47     ` Derrick Stolee
2020-06-15 20:14 ` [PATCH 6/8] bloom: enforce a minimum size of 8 bytes Derrick Stolee via GitGitGadget
2020-06-15 20:14 ` [PATCH 7/8] commit-graph: change test to die on parse, not load Derrick Stolee via GitGitGadget
2020-06-15 20:14 ` [PATCH 8/8] commit-graph: persist existence of changed-paths Derrick Stolee via GitGitGadget
2020-06-17 21:21 ` [PATCH 0/8] More commit-graph/Bloom filter improvements Junio C Hamano
2020-06-18  1:46   ` Derrick Stolee
2020-06-23 17:46 ` [PATCH v2 00/11] " Derrick Stolee via GitGitGadget
2020-06-23 17:47   ` [PATCH v2 01/11] commit-graph: place bloom_settings in context Derrick Stolee via GitGitGadget
2020-06-23 17:47   ` [PATCH v2 02/11] commit-graph: change test to die on parse, not load Derrick Stolee via GitGitGadget
2020-06-23 17:47   ` [PATCH v2 03/11] bloom: get_bloom_filter() cleanups Derrick Stolee via GitGitGadget
2020-06-25  7:24     ` René Scharfe
2020-06-23 17:47   ` [PATCH v2 04/11] commit-graph: persist existence of changed-paths Derrick Stolee via GitGitGadget
2020-06-23 17:47   ` [PATCH v2 05/11] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor via GitGitGadget
2020-06-25  7:25     ` René Scharfe
2020-06-23 17:47   ` [PATCH v2 06/11] commit-graph: simplify chunk writes into loop SZEDER Gábor via GitGitGadget
2020-06-25  7:25     ` René Scharfe
2020-06-25 14:59       ` Derrick Stolee
2020-06-23 17:47   ` [PATCH v2 07/11] commit-graph: check chunk sizes after writing SZEDER Gábor via GitGitGadget
2020-06-25  7:25     ` René Scharfe
2020-06-25 15:02       ` Derrick Stolee
2020-06-23 17:47   ` [PATCH v2 08/11] revision.c: fix whitespace Derrick Stolee via GitGitGadget
2020-06-23 17:47   ` [PATCH v2 09/11] revision: empty pathspecs should not use Bloom filters Taylor Blau via GitGitGadget
2020-06-23 17:47   ` [PATCH v2 10/11] commit-graph: check all leading directories in changed path " SZEDER Gábor via GitGitGadget
2020-06-25  7:25     ` René Scharfe
2020-06-25 15:05       ` Derrick Stolee
2020-06-26  6:34         ` SZEDER Gábor
2020-06-26 14:42           ` Derrick Stolee
2020-06-23 17:47   ` [PATCH v2 11/11] bloom: enforce a minimum size of 8 bytes Derrick Stolee via GitGitGadget
2020-06-24 23:11   ` [PATCH v2 00/11] More commit-graph/Bloom filter improvements Junio C Hamano
2020-06-24 23:32     ` Derrick Stolee
2020-06-25  0:38       ` Junio C Hamano
2020-06-25 13:38         ` Derrick Stolee
2020-06-25 16:34           ` Junio C Hamano
2020-06-26 12:30   ` [PATCH v3 00/10] " Derrick Stolee via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 01/10] commit-graph: place bloom_settings in context Derrick Stolee via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 02/10] commit-graph: change test to die on parse, not load Derrick Stolee via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 03/10] bloom: fix logic in get_bloom_filter() Derrick Stolee via GitGitGadget
2020-06-27 16:33       ` SZEDER Gábor
2020-06-29 13:02         ` Derrick Stolee
2020-06-26 12:30     ` [PATCH v3 04/10] commit-graph: persist existence of changed-paths Derrick Stolee via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 05/10] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 06/10] commit-graph: simplify chunk writes into loop SZEDER Gábor via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 07/10] commit-graph: check chunk sizes after writing SZEDER Gábor via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 08/10] revision.c: fix whitespace Derrick Stolee via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 09/10] revision: empty pathspecs should not use Bloom filters Taylor Blau via GitGitGadget
2020-06-26 12:30     ` [PATCH v3 10/10] commit-graph: check all leading directories in changed path " SZEDER Gábor via GitGitGadget
2020-07-01 13:27     ` [PATCH v4 00/10] More commit-graph/Bloom filter improvements Derrick Stolee via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 01/10] commit-graph: place bloom_settings in context Derrick Stolee via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 02/10] commit-graph: change test to die on parse, not load Derrick Stolee via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 03/10] bloom: fix logic in get_bloom_filter() Derrick Stolee via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 04/10] commit-graph: persist existence of changed-paths Derrick Stolee via GitGitGadget
2020-10-15 13:21         ` SZEDER Gábor
2020-10-15 21:41           ` Taylor Blau [this message]
2020-10-16  2:18             ` Derrick Stolee
2020-10-16  3:18               ` Taylor Blau
2020-10-16 13:52                 ` Derrick Stolee
2020-07-01 13:27       ` [PATCH v4 05/10] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 06/10] commit-graph: simplify chunk writes into loop SZEDER Gábor via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 07/10] commit-graph: check chunk sizes after writing SZEDER Gábor via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 08/10] revision.c: fix whitespace Derrick Stolee via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 09/10] revision: empty pathspecs should not use Bloom filters Taylor Blau via GitGitGadget
2020-07-01 13:27       ` [PATCH v4 10/10] commit-graph: check all leading directories in changed path " SZEDER Gábor via GitGitGadget

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201015214111.GA2176154@nand.local \
    --to=me@ttaylorr.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=l.s.r@web.de \
    --cc=szeder.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).