git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Taylor Blau <me@ttaylorr.com>
To: "SZEDER Gábor" <szeder.dev@gmail.com>
Cc: git@vger.kernel.org, peff@peff.net, dstolee@microsoft.com,
	gitster@pobox.com
Subject: Re: [PATCH v3 14/14] builtin/commit-graph.c: introduce '--max-new-filters=<n>'
Date: Thu, 3 Sep 2020 14:49:20 -0400	[thread overview]
Message-ID: <20200903184920.GA8946@nand.local> (raw)
In-Reply-To: <20200901143650.GA22631@szeder.dev>

On Tue, Sep 01, 2020 at 04:36:50PM +0200, SZEDER Gábor wrote:
> Something seems to be wrong in this patch, though I haven't looked
> closer.  Consider this test with a bit of makeshift tracing:
>
>   ---  >8  ---
>
> diff --git a/bloom.c b/bloom.c
> index 8d07209c6b..1a0dec35cd 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -222,6 +222,7 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
>  	if (!compute_if_not_present)
>  		return NULL;
>
> +	printf("get_or_compute_bloom_filter(): diff: %s\n", oid_to_hex(&c->object.oid));
>  	repo_diff_setup(r, &diffopt);
>  	diffopt.flags.recursive = 1;
>  	diffopt.detect_rename = 0;
> diff --git a/t/t9999-test.sh b/t/t9999-test.sh
> new file mode 100755
> index 0000000000..0833e6ff7e
> --- /dev/null
> +++ b/t/t9999-test.sh
> @@ -0,0 +1,25 @@
> +#!/bin/sh
> +
> +test_description='test'
> +
> +. ./test-lib.sh
> +
> +test_expect_success 'test' '
> +	for i in 1 2 3 4 5 6
> +	do
> +		git commit -q --allow-empty -m $i || return 1
> +	done &&
> +	git log --oneline &&
> +
> +	# We have 6 commits and compute 2 Bloom filters per execution,
> +	# so 3 executions should be enough...  but, alas, it isnt.
> +	for i in 1 2 3 4 5
> +	do
> +		# No --split=replace!
> +		git commit-graph write --reachable --changed-paths --max-new-filters=2 || return 1
> +	done &&
> +
> +	git commit-graph write --reachable --changed-paths --max-new-filters=4
> +'
> +
> +test_done
>
>   ---  >8  ---
>
> It's output looks like:
>
>   [...]
>   + git log --oneline
>   13fcefa (HEAD -> master) 6
>   0c71516 5
>   a82a61c 4
>   54c6b2c 3
>   fc99def 2
>   a3a8cd3 1
>   + git commit-graph write --reachable --changed-paths --max-new-filters=2
>   get_or_compute_bloom_filter(): diff: 0c71516945bf0a23813e80205961d29ebc1020e0
>   get_or_compute_bloom_filter(): diff: 13fcefa4bb859a15c4edc6bb01b8b6c91b4f32b6
>   + git commit-graph write --reachable --changed-paths --max-new-filters=2
>   get_or_compute_bloom_filter(): diff: 54c6b2cd4fb50066683a197cc6d677689618505a
>   get_or_compute_bloom_filter(): diff: a3a8cd3c82028671bf51502d77277baf14a2f528
>   + git commit-graph write --reachable --changed-paths --max-new-filters=2
>   get_or_compute_bloom_filter(): diff: 0c71516945bf0a23813e80205961d29ebc1020e0
>   get_or_compute_bloom_filter(): diff: 13fcefa4bb859a15c4edc6bb01b8b6c91b4f32b6
>   + git commit-graph write --reachable --changed-paths --max-new-filters=2
>   get_or_compute_bloom_filter(): diff: 54c6b2cd4fb50066683a197cc6d677689618505a
>   get_or_compute_bloom_filter(): diff: a3a8cd3c82028671bf51502d77277baf14a2f528
>   + git commit-graph write --reachable --changed-paths --max-new-filters=2
>   get_or_compute_bloom_filter(): diff: 0c71516945bf0a23813e80205961d29ebc1020e0
>   get_or_compute_bloom_filter(): diff: 13fcefa4bb859a15c4edc6bb01b8b6c91b4f32b6
>   + git commit-graph write --reachable --changed-paths
>   get_or_compute_bloom_filter(): diff: 54c6b2cd4fb50066683a197cc6d677689618505a
>   get_or_compute_bloom_filter(): diff: a3a8cd3c82028671bf51502d77277baf14a2f528
>   get_or_compute_bloom_filter(): diff: a82a61c79b2b07c4440e292613e11a69e33ef7a2
>   get_or_compute_bloom_filter(): diff: fc99def8b1df27bcab7d1f4b7ced73239f9bd7ec
>
> See how the third write with '--max-new-filters=2' computes the
> filters that have already been computed by the first write instead of
> those two that have never been computed?  And then how the fourth
> write computes filters that have already been computed by the second
> write?  And ultimately we'll need a write without '--max-new-filters' (or
> with '--max-new-filters=<large-enough>') to compute all remaining
> filters.

Ouch. This definitely has to do with the empty commits, since swapping
out your 'git commit ... --allow-empty' for a 'test_commit' produces the
output that you'd expect.

> With '--split=replace' it appears to work as expected.

This is definitely the critical bit. The crux of the issue is that
'copy_oids_to_commits()' handles split and non-split graphs differently.
The critical bits here are:

  * 43d3561805 (commit-graph write: don't die if the existing graph is
    corrupt, 2019-03-25) which forces the relevant data to *not* be
    loaded from an existing commit-graph, and

  * 8a6ac287b2 (builtin/commit-graph.c: introduce split strategy
    'replace', 2020-04-13), which does load data from an existing
    commit-graph with '--split=replace'.

When writing a graph with '--split=replace', commits are loaded from the
graph, which includes setting their '->graph_pos' (or rather setting
this data in a commit slab, which is I guess how it's done these days).
Without '--split=replace', the graph position will never be set.

So, by the time we get to 'get_bloom_filter_large_in_graph', the graph
position is 'COMMIT_NOT_FROM_GRAPH', which in turn forces us to
recompute the filter from scratch, since we assume that being
'NOT_FROM_GRAPH' implies that we won't find it in any 'BFXL' chunk.

Regardless of whether or not we should be trusting the parentage
information on-disk, recomputing the Bloom filters from scratch is
simply too expensive (and the opposite of the point of this series). So,
doing the following to force 'get_bloom_filter_large_in_graph' to lookup
the Bloom and BFXL data in a commit graph by forcibly loading its graph
position is the right thing to do.

This is sufficient to get us unstuck:

diff --git a/commit-graph.c b/commit-graph.c
index bec4e5b725..243c7253ff 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -956,8 +956,8 @@ int get_bloom_filter_large_in_graph(struct commit_graph *g,
                                    const struct commit *c,
                                    uint32_t max_changed_paths)
 {
-       uint32_t graph_pos = commit_graph_position(c);
-       if (graph_pos == COMMIT_NOT_FROM_GRAPH)
+       uint32_t graph_pos;
+       if (!find_commit_in_graph(c, g, &graph_pos))
                return 0;

        while (g && graph_pos < g->num_commits_in_base)

...but adding something like:

diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index a7a3b41919..571676cef2 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -310,4 +310,29 @@ test_expect_success 'Bloom generation backfills previously-skipped filters' '
        )
 '

+test_expect_success 'Bloom generation backfills empty commits' '
+       git init empty &&
+       test_when_finished "rm -fr empty" &&
+       (
+               cd empty &&
+               for i in $(test_seq 1 6)
+               do
+                       git commit --allow-empty -m "$i"
+               done &&
+
+               # Generate Bloom filters for empty commits 1-6, two at a time.
+               test_bloom_filters_computed "--reachable --changed-paths --max-new-filters=2" \
+                       0 2 2 &&
+               test_bloom_filters_computed "--reachable --changed-paths --max-new-filters=2" \
+                       2 2 2 &&
+               test_bloom_filters_computed "--reachable --changed-paths --max-new-filters=2" \
+                       4 2 2 &&
+
+               # Finally, make sure that once all commits have filters, that
+               # none are subsequently recomputed.
+               test_bloom_filters_computed "--reachable --changed-paths --max-new-filters=2" \
+                       6 0 0
+       )
+'
+
 test_done

is a good idea to harden what you wrote in your t9999 into an actual
test to prevent against regression. I'll fold both of those into this
patch.

Thanks for the bug report. It led to an interesting investigation as a
result :).

Thanks,
Taylor

  reply	other threads:[~2020-09-03 18:49 UTC|newest]

Thread overview: 117+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-03 18:57 [PATCH 00/10] more miscellaneous Bloom filter improvements Taylor Blau
2020-08-03 18:57 ` [PATCH 01/10] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
2020-08-04  7:24   ` Jeff King
2020-08-04 20:08     ` Taylor Blau
2020-08-03 18:57 ` [PATCH 02/10] commit-graph: pass a 'struct repository *' in more places Taylor Blau
2020-08-03 18:57 ` [PATCH 03/10] t4216: use an '&&'-chain Taylor Blau
2020-08-03 18:57 ` [PATCH 04/10] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
2020-08-03 18:57 ` [PATCH 05/10] commit-graph: respect 'commitgraph.readChangedPaths' Taylor Blau
2020-08-03 18:57 ` [PATCH 06/10] commit-graph.c: sort index into commits list Taylor Blau
2020-08-04 12:31   ` Derrick Stolee
2020-08-04 20:10     ` Taylor Blau
2020-08-03 18:57 ` [PATCH 07/10] commit-graph: add large-filters bitmap chunk Taylor Blau
2020-08-03 18:59   ` Taylor Blau
2020-08-04 12:57   ` Derrick Stolee
2020-08-03 18:57 ` [PATCH 08/10] bloom: split 'get_bloom_filter()' in two Taylor Blau
2020-08-04 13:00   ` Derrick Stolee
2020-08-04 20:12     ` Taylor Blau
2020-08-03 18:57 ` [PATCH 09/10] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
2020-08-03 18:57 ` [PATCH 10/10] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
2020-08-04 13:03   ` Derrick Stolee
2020-08-04 20:14     ` Taylor Blau
2020-08-05 17:01 ` [PATCH v2 00/14] more miscellaneous Bloom filter improvements Taylor Blau
2020-08-05 17:01   ` [PATCH v2 01/14] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
2020-08-05 17:02   ` [PATCH v2 02/14] t4216: use an '&&'-chain Taylor Blau
2020-08-05 17:02   ` [PATCH v2 03/14] commit-graph: pass a 'struct repository *' in more places Taylor Blau
2020-08-05 17:02   ` [PATCH v2 04/14] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
2020-08-05 17:02   ` [PATCH v2 05/14] commit-graph: respect 'commitGraph.readChangedPaths' Taylor Blau
2020-08-05 17:02   ` [PATCH v2 06/14] commit-graph.c: store maximum changed paths Taylor Blau
2020-08-05 17:02   ` [PATCH v2 07/14] bloom: split 'get_bloom_filter()' in two Taylor Blau
2020-08-05 17:02   ` [PATCH v2 08/14] bloom: use provided 'struct bloom_filter_settings' Taylor Blau
2020-08-05 17:02   ` [PATCH v2 09/14] bloom/diff: properly short-circuit on max_changes Taylor Blau
2020-08-05 17:02   ` [PATCH v2 10/14] commit-graph.c: sort index into commits list Taylor Blau
2020-08-05 17:02   ` [PATCH v2 11/14] csum-file.h: introduce 'hashwrite_be64()' Taylor Blau
2020-08-05 17:02   ` [PATCH v2 12/14] commit-graph: add large-filters bitmap chunk Taylor Blau
2020-08-05 21:01     ` Junio C Hamano
2020-08-05 21:17       ` Taylor Blau
2020-08-05 22:21         ` Junio C Hamano
2020-08-05 22:25           ` Taylor Blau
2020-08-11 13:48             ` Taylor Blau
2020-08-11 18:59               ` Junio C Hamano
2020-08-05 17:03   ` [PATCH v2 13/14] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
2020-08-05 17:03   ` [PATCH v2 14/14] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
2020-08-11 20:51 ` [PATCH v3 00/14] more miscellaneous Bloom filter improvements Taylor Blau
2020-08-11 20:51   ` [PATCH v3 01/14] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
2020-08-11 21:18     ` SZEDER Gábor
2020-08-11 21:21       ` Taylor Blau
2020-08-11 21:27         ` SZEDER Gábor
2020-08-11 21:34           ` Taylor Blau
2020-08-11 23:55             ` SZEDER Gábor
2020-08-12 11:48               ` Derrick Stolee
2020-08-14 20:17                 ` Taylor Blau
2020-08-11 20:51   ` [PATCH v3 02/14] t4216: use an '&&'-chain Taylor Blau
2020-08-11 20:51   ` [PATCH v3 03/14] commit-graph: pass a 'struct repository *' in more places Taylor Blau
2020-08-11 20:51   ` [PATCH v3 04/14] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
2020-08-11 20:51   ` [PATCH v3 05/14] commit-graph: respect 'commitGraph.readChangedPaths' Taylor Blau
2020-08-11 20:51   ` [PATCH v3 06/14] commit-graph.c: store maximum changed paths Taylor Blau
2020-08-11 20:51   ` [PATCH v3 07/14] bloom: split 'get_bloom_filter()' in two Taylor Blau
2020-08-11 20:51   ` [PATCH v3 11/14] csum-file.h: introduce 'hashwrite_be64()' Taylor Blau
2020-08-11 20:51   ` [PATCH v3 08/14] bloom: use provided 'struct bloom_filter_settings' Taylor Blau
2020-08-11 20:51   ` [PATCH v3 09/14] bloom/diff: properly short-circuit on max_changes Taylor Blau
2020-08-11 20:52   ` [PATCH v3 10/14] commit-graph.c: sort index into commits list Taylor Blau
2020-08-11 20:52   ` [PATCH v3 12/14] commit-graph: add large-filters bitmap chunk Taylor Blau
2020-08-11 21:11     ` Derrick Stolee
2020-08-11 21:18       ` Taylor Blau
2020-08-11 22:05         ` Taylor Blau
2020-08-19 13:35     ` SZEDER Gábor
2020-09-02 20:23       ` Taylor Blau
2020-09-01 14:35     ` SZEDER Gábor
2020-09-02 20:40       ` Taylor Blau
2020-08-11 20:52   ` [PATCH v3 13/14] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
2020-08-19  9:56     ` SZEDER Gábor
2020-09-02 21:02       ` Taylor Blau
2020-08-11 20:52   ` [PATCH v3 14/14] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
2020-08-12 11:49     ` SZEDER Gábor
2020-08-14 20:20       ` Taylor Blau
2020-08-17 22:50         ` SZEDER Gábor
2020-09-02 21:03           ` Taylor Blau
2020-08-12 12:29     ` Derrick Stolee
2020-08-14 20:10       ` Taylor Blau
2020-08-18 22:23     ` SZEDER Gábor
2020-09-03 16:35       ` Taylor Blau
2020-08-19  8:20     ` SZEDER Gábor
2020-09-03 16:42       ` Taylor Blau
2020-09-04  8:50         ` SZEDER Gábor
2020-09-01 14:36     ` SZEDER Gábor
2020-09-03 18:49       ` Taylor Blau [this message]
2020-09-03 21:45   ` [PATCH v3 00/14] more miscellaneous Bloom filter improvements Junio C Hamano
2020-09-03 22:33     ` Taylor Blau
2020-09-03 22:45 ` [PATCH v4 " Taylor Blau
2020-09-03 22:46   ` [PATCH v4 01/14] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
2020-09-03 22:46   ` [PATCH v4 02/14] t4216: use an '&&'-chain Taylor Blau
2020-09-03 22:46   ` [PATCH v4 03/14] commit-graph: pass a 'struct repository *' in more places Taylor Blau
2020-09-03 22:46   ` [PATCH v4 04/14] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
2020-09-03 22:46   ` [PATCH v4 05/14] commit-graph: respect 'commitGraph.readChangedPaths' Taylor Blau
2020-09-03 22:46   ` [PATCH v4 06/14] commit-graph.c: store maximum changed paths Taylor Blau
2020-09-03 22:46   ` [PATCH v4 07/14] bloom: split 'get_bloom_filter()' in two Taylor Blau
2020-09-05 17:22     ` Jakub Narębski
2020-09-05 17:38       ` Taylor Blau
2020-09-05 17:50         ` Jakub Narębski
2020-09-05 18:01           ` Taylor Blau
2020-09-05 18:18             ` Jakub Narębski
2020-09-05 18:38               ` Taylor Blau
2020-09-05 18:55                 ` Taylor Blau
2020-09-05 19:04                   ` SZEDER Gábor
2020-09-05 19:49                     ` Taylor Blau
2020-09-06 21:52                       ` Junio C Hamano
2020-09-03 22:46   ` [PATCH v4 08/14] bloom: use provided 'struct bloom_filter_settings' Taylor Blau
2020-09-03 22:46   ` [PATCH v4 09/14] bloom/diff: properly short-circuit on max_changes Taylor Blau
2020-09-03 22:46   ` [PATCH v4 10/14] commit-graph.c: sort index into commits list Taylor Blau
2020-09-03 22:46   ` [PATCH v4 11/14] csum-file.h: introduce 'hashwrite_be64()' Taylor Blau
2020-09-04 20:18     ` René Scharfe
2020-09-04 20:22       ` Taylor Blau
2020-09-03 22:46   ` [PATCH v4 12/14] commit-graph: add large-filters bitmap chunk Taylor Blau
2020-09-03 22:46   ` [PATCH v4 13/14] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
2020-09-04 15:20     ` Taylor Blau
2020-09-03 22:47   ` [PATCH v4 14/14] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
2020-09-04 14:39   ` [PATCH v4 00/14] more miscellaneous Bloom filter improvements Derrick Stolee

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200903184920.GA8946@nand.local \
    --to=me@ttaylorr.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=peff@peff.net \
    --cc=szeder.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).