git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
* [PATCH 00/10] more miscellaneous Bloom filter improvements
@ 2020-08-03 18:57 Taylor Blau
  2020-08-03 18:57 ` [PATCH 01/10] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
                   ` (12 more replies)
  0 siblings, 13 replies; 117+ messages in thread
From: Taylor Blau @ 2020-08-03 18:57 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee

Hi,

Here are some patches that I have been sitting on while rolling out
changed-path Bloom filters at GitHub. I was initially going to send this
as four separate topics, but they really make the most sense when sent
all together.

Here's an overview of what's contained:

  - The first patch fixes a bug where Bloom filters are only read if the
    most-recent layer in a commit-graph-chain has Bloom filters.

  - Patches 2-5 introduce 'commitgraph.readChangedPaths', a new
    configuration option for debugging changed-path Bloom filters. When
    disabled, the commit-graph machinery pretends as if there are no
    BIDX/BDAT chunks.

    This is useful when testing behavior with/without Bloom filters
    without having to regenerate the commit graph.

  - Patches 6-7 introduce a new 'BFXL' chunk which is a bitmap
    indicating which Bloom filters were too large to compute (ie., had
    more than 512 changed paths). This is a prerequisite for the final
    feature, --max-changed-paths.

  - Patches 8-10 introduces '--max-new-filters <n>', which allows
    callers to limit the number of Bloom filters that they are willing
    to compute from scratch when generating commit-graphs with
    --changed-paths.

The BFXL chunk is a prerequisite for '--max-new-filters' because of an
unfortunate overloading where filters with (1) no changed paths, (2) too
many changed paths, and (3) ones that weren't computed at all are all
represented as a length-0 filter by the BIDX chunk.

The problem is that in repositories with many commits that have too many
changed-paths to store Bloom filters, specifying '--max-new-filters'
recomputes those same large filters each commit-graph write, only to
throw them away. Knowing which filters are too-large allows us to skip
over computing filters we know are a waste.

Thanks in advance for your review!

Taylor Blau (10):
  commit-graph: introduce 'get_bloom_filter_settings()'
  commit-graph: pass a 'struct repository *' in more places
  t4216: use an '&&'-chain
  t/helper/test-read-graph.c: prepare repo settings
  commit-graph: respect 'commitgraph.readChangedPaths'
  commit-graph.c: sort index into commits list
  commit-graph: add large-filters bitmap chunk
  bloom: split 'get_bloom_filter()' in two
  commit-graph: rename 'split_commit_graph_opts'
  builtin/commit-graph.c: introduce '--max-new-filters=<n>'

 Documentation/config.txt                      |   2 +
 Documentation/config/commitgraph.txt          |   8 +
 Documentation/git-commit-graph.txt            |   4 +
 .../technical/commit-graph-format.txt         |   9 +
 blame.c                                       |   8 +-
 bloom.c                                       |  34 ++-
 bloom.h                                       |   9 +-
 builtin/commit-graph.c                        |  61 +++--
 commit-graph.c                                | 214 +++++++++++++-----
 commit-graph.h                                |  17 +-
 fuzz-commit-graph.c                           |   5 +-
 line-log.c                                    |   2 +-
 repo-settings.c                               |   3 +
 repository.h                                  |   1 +
 revision.c                                    |   4 +-
 t/helper/test-bloom.c                         |   2 +-
 t/helper/test-read-graph.c                    |   3 +-
 t/t4216-log-bloom.sh                          |  54 ++++-
 t/t5324-split-commit-graph.sh                 |  13 ++
 19 files changed, 357 insertions(+), 96 deletions(-)
 create mode 100644 Documentation/config/commitgraph.txt

--
2.28.0.rc1.13.ge78abce653

^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH 01/10] commit-graph: introduce 'get_bloom_filter_settings()'
  2020-08-03 18:57 [PATCH 00/10] more miscellaneous Bloom filter improvements Taylor Blau
@ 2020-08-03 18:57 ` Taylor Blau
  2020-08-04  7:24   ` Jeff King
  2020-08-03 18:57 ` [PATCH 02/10] commit-graph: pass a 'struct repository *' in more places Taylor Blau
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 117+ messages in thread
From: Taylor Blau @ 2020-08-03 18:57 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee

Many places in the code often need a pointer to the commit-graph's
'struct bloom_filter_settings', in which case they often take the value
from the top-most commit-graph.

In the non-split case, this works as expected. In the split case,
however, things get a little tricky. Not all layers in a chain of
incremental commit-graphs are required to themselves have Bloom data,
and so whether or not some part of the code uses Bloom filters depends
entirely on whether or not the top-most level of the commit-graph chain
has Bloom filters.

This has been the behavior since Bloom filters were introduced, and has
been codified into the tests since a759bfa9ee (t4216: add end to end
tests for git log with Bloom filters, 2020-04-06). In fact, t4216.130
requires that Bloom filters are not used in exactly the case described
earlier.

There is no reason that this needs to be the case, since it is perfectly
valid for commits in an earlier layer to have Bloom filters when commits
in a newer layer do not.

Since Bloom settings are guaranteed to be the same for any layer in a
chain that has Bloom data, it is sufficient to traverse the
'->base_graph' pointer until either (1) a non-null 'struct
bloom_filter_settings *' is found, or (2) until we are at the root of
the commit-graph chain.

Introduce a 'get_bloom_filter_settings()' function that does just this,
and use it instead of purely dereferencing the top-most graph's
'->bloom_filter_settings' pointer.

While we're at it, add an additional test in t5324 to guard against code
in the commit-graph writing machinery that doesn't correctly handle a
NULL 'struct bloom_filter *'.

Co-authored-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 blame.c                       |  6 ++++--
 bloom.c                       |  6 +++---
 commit-graph.c                | 11 +++++++++++
 commit-graph.h                |  2 ++
 revision.c                    |  2 +-
 t/t4216-log-bloom.sh          |  9 ++++++---
 t/t5324-split-commit-graph.sh | 13 +++++++++++++
 7 files changed, 40 insertions(+), 9 deletions(-)

diff --git a/blame.c b/blame.c
index 82fa16d658..3e5f8787bc 100644
--- a/blame.c
+++ b/blame.c
@@ -2891,16 +2891,18 @@ void setup_blame_bloom_data(struct blame_scoreboard *sb,
 			    const char *path)
 {
 	struct blame_bloom_data *bd;
+	struct bloom_filter_settings *bs;
 
 	if (!sb->repo->objects->commit_graph)
 		return;
 
-	if (!sb->repo->objects->commit_graph->bloom_filter_settings)
+	bs = get_bloom_filter_settings(sb->repo);
+	if (!bs)
 		return;
 
 	bd = xmalloc(sizeof(struct blame_bloom_data));
 
-	bd->settings = sb->repo->objects->commit_graph->bloom_filter_settings;
+	bd->settings = bs;
 
 	bd->alloc = 4;
 	bd->nr = 0;
diff --git a/bloom.c b/bloom.c
index 1a573226e7..cd9380ac62 100644
--- a/bloom.c
+++ b/bloom.c
@@ -38,7 +38,7 @@ static int load_bloom_filter_from_graph(struct commit_graph *g,
 	while (graph_pos < g->num_commits_in_base)
 		g = g->base_graph;
 
-	/* The commit graph commit 'c' lives in doesn't carry bloom filters. */
+	/* The commit graph commit 'c' lives in doesn't carry Bloom filters. */
 	if (!g->chunk_bloom_indexes)
 		return 0;
 
@@ -195,8 +195,8 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 	if (!filter->data) {
 		load_commit_graph_info(r, c);
 		if (commit_graph_position(c) != COMMIT_NOT_FROM_GRAPH &&
-			r->objects->commit_graph->chunk_bloom_indexes)
-			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c);
+			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
+				return filter;
 	}
 
 	if (filter->data)
diff --git a/commit-graph.c b/commit-graph.c
index e51c91dd5b..d4b06811be 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -660,6 +660,17 @@ int generation_numbers_enabled(struct repository *r)
 	return !!first_generation;
 }
 
+struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r)
+{
+	struct commit_graph *g = r->objects->commit_graph;
+	while (g) {
+		if (g->bloom_filter_settings)
+			return g->bloom_filter_settings;
+		g = g->base_graph;
+	}
+	return NULL;
+}
+
 static void close_commit_graph_one(struct commit_graph *g)
 {
 	if (!g)
diff --git a/commit-graph.h b/commit-graph.h
index 09a97030dc..0677dd1031 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -87,6 +87,8 @@ struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size);
  */
 int generation_numbers_enabled(struct repository *r);
 
+struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r);
+
 enum commit_graph_write_flags {
 	COMMIT_GRAPH_WRITE_APPEND     = (1 << 0),
 	COMMIT_GRAPH_WRITE_PROGRESS   = (1 << 1),
diff --git a/revision.c b/revision.c
index 6de29cdf7a..e244beed05 100644
--- a/revision.c
+++ b/revision.c
@@ -684,7 +684,7 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 	if (!revs->repo->objects->commit_graph)
 		return;
 
-	revs->bloom_filter_settings = revs->repo->objects->commit_graph->bloom_filter_settings;
+	revs->bloom_filter_settings = get_bloom_filter_settings(revs->repo);
 	if (!revs->bloom_filter_settings)
 		return;
 
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index c21cc160f3..c9f9bdf1ba 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -60,7 +60,7 @@ setup () {
 
 test_bloom_filters_used () {
 	log_args=$1
-	bloom_trace_prefix="statistics:{\"filter_not_present\":0,\"maybe\""
+	bloom_trace_prefix="statistics:{\"filter_not_present\":${2:-0},\"maybe\""
 	setup "$log_args" &&
 	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
 	test_cmp log_wo_bloom log_w_bloom &&
@@ -134,8 +134,11 @@ test_expect_success 'setup - add commit-graph to the chain without Bloom filters
 	test_line_count = 2 .git/objects/info/commit-graphs/commit-graph-chain
 '
 
-test_expect_success 'Do not use Bloom filters if the latest graph does not have Bloom filters.' '
-	test_bloom_filters_not_used "-- A/B"
+test_expect_success 'use Bloom filters even if the latest graph does not have Bloom filters' '
+	# Ensure that the number of empty filters is equal to the number of
+	# filters in the latest graph layer to prove that they are loaded (and
+	# ignored).
+	test_bloom_filters_used "-- A/B" 3
 '
 
 test_expect_success 'setup - add commit-graph to the chain with Bloom filters' '
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 9b850ea907..5bdfd53ef9 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -425,4 +425,17 @@ done <<\EOF
 0600 -r--------
 EOF
 
+test_expect_success '--split=replace with partial Bloom data' '
+	rm -rf $graphdir $infodir/commit-graph &&
+	git reset --hard commits/3 &&
+	git rev-list -1 HEAD~2 >a &&
+	git rev-list -1 HEAD~1 >b &&
+	git commit-graph write --split=no-merge --stdin-commits --changed-paths <a &&
+	git commit-graph write --split=no-merge --stdin-commits <b &&
+	git commit-graph write --split=replace --stdin-commits --changed-paths <c &&
+	ls $graphdir/graph-*.graph >graph-files &&
+	test_line_count = 1 graph-files &&
+	verify_chain_files_exist $graphdir
+'
+
 test_done
-- 
2.28.0.rc1.13.ge78abce653


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH 02/10] commit-graph: pass a 'struct repository *' in more places
  2020-08-03 18:57 [PATCH 00/10] more miscellaneous Bloom filter improvements Taylor Blau
  2020-08-03 18:57 ` [PATCH 01/10] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
@ 2020-08-03 18:57 ` Taylor Blau
  2020-08-03 18:57 ` [PATCH 03/10] t4216: use an '&&'-chain Taylor Blau
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-08-03 18:57 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee

In a future commit, some commit-graph internals will want access to
'r->settings', but we only have the 'struct object_directory *'
corresponding to that repository.

Add an additional parameter to pass the repository around in more
places. In the next patch, we will remove the object directory (and
instead reference it with 'r->odb').

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/commit-graph.c |  2 +-
 commit-graph.c         | 18 +++++++++++-------
 commit-graph.h         |  6 ++++--
 fuzz-commit-graph.c    |  5 +++--
 4 files changed, 19 insertions(+), 12 deletions(-)

diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 523501f217..ba5584463f 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -106,7 +106,7 @@ static int graph_verify(int argc, const char **argv)
 	FREE_AND_NULL(graph_name);
 
 	if (open_ok)
-		graph = load_commit_graph_one_fd_st(fd, &st, odb);
+		graph = load_commit_graph_one_fd_st(the_repository, fd, &st, odb);
 	else
 		graph = read_commit_graph_one(the_repository, odb);
 
diff --git a/commit-graph.c b/commit-graph.c
index d4b06811be..81a6f2a8ce 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -224,7 +224,8 @@ int open_commit_graph(const char *graph_file, int *fd, struct stat *st)
 	return 1;
 }
 
-struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st,
+struct commit_graph *load_commit_graph_one_fd_st(struct repository *r,
+						 int fd, struct stat *st,
 						 struct object_directory *odb)
 {
 	void *graph_map;
@@ -240,7 +241,7 @@ struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st,
 	}
 	graph_map = xmmap(NULL, graph_size, PROT_READ, MAP_PRIVATE, fd, 0);
 	close(fd);
-	ret = parse_commit_graph(graph_map, graph_size);
+	ret = parse_commit_graph(r, graph_map, graph_size);
 
 	if (ret)
 		ret->odb = odb;
@@ -280,7 +281,8 @@ static int verify_commit_graph_lite(struct commit_graph *g)
 	return 0;
 }
 
-struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size)
+struct commit_graph *parse_commit_graph(struct repository *r,
+					void *graph_map, size_t graph_size)
 {
 	const unsigned char *data, *chunk_lookup;
 	uint32_t i;
@@ -445,7 +447,9 @@ struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size)
 	return NULL;
 }
 
-static struct commit_graph *load_commit_graph_one(const char *graph_file,
+
+static struct commit_graph *load_commit_graph_one(struct repository *r,
+						  const char *graph_file,
 						  struct object_directory *odb)
 {
 
@@ -457,7 +461,7 @@ static struct commit_graph *load_commit_graph_one(const char *graph_file,
 	if (!open_ok)
 		return NULL;
 
-	g = load_commit_graph_one_fd_st(fd, &st, odb);
+	g = load_commit_graph_one_fd_st(r, fd, &st, odb);
 
 	if (g)
 		g->filename = xstrdup(graph_file);
@@ -469,7 +473,7 @@ static struct commit_graph *load_commit_graph_v1(struct repository *r,
 						 struct object_directory *odb)
 {
 	char *graph_name = get_commit_graph_filename(odb);
-	struct commit_graph *g = load_commit_graph_one(graph_name, odb);
+	struct commit_graph *g = load_commit_graph_one(r, graph_name, odb);
 	free(graph_name);
 
 	return g;
@@ -550,7 +554,7 @@ static struct commit_graph *load_commit_graph_chain(struct repository *r,
 		valid = 0;
 		for (odb = r->objects->odb; odb; odb = odb->next) {
 			char *graph_name = get_split_graph_filename(odb, line.buf);
-			struct commit_graph *g = load_commit_graph_one(graph_name, odb);
+			struct commit_graph *g = load_commit_graph_one(r, graph_name, odb);
 
 			free(graph_name);
 
diff --git a/commit-graph.h b/commit-graph.h
index 0677dd1031..d9acb22bac 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -75,11 +75,13 @@ struct commit_graph {
 	struct bloom_filter_settings *bloom_filter_settings;
 };
 
-struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st,
+struct commit_graph *load_commit_graph_one_fd_st(struct repository *r,
+						 int fd, struct stat *st,
 						 struct object_directory *odb);
 struct commit_graph *read_commit_graph_one(struct repository *r,
 					   struct object_directory *odb);
-struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size);
+struct commit_graph *parse_commit_graph(struct repository *r,
+					void *graph_map, size_t graph_size);
 
 /*
  * Return 1 if and only if the repository has a commit-graph
diff --git a/fuzz-commit-graph.c b/fuzz-commit-graph.c
index 430817214d..e7cf6d5b0f 100644
--- a/fuzz-commit-graph.c
+++ b/fuzz-commit-graph.c
@@ -1,7 +1,8 @@
 #include "commit-graph.h"
 #include "repository.h"
 
-struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size);
+struct commit_graph *parse_commit_graph(struct repository *r,
+					void *graph_map, size_t graph_size);
 
 int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size);
 
@@ -10,7 +11,7 @@ int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size)
 	struct commit_graph *g;
 
 	initialize_the_repository();
-	g = parse_commit_graph((void *)data, size);
+	g = parse_commit_graph(the_repository, (void *)data, size);
 	repo_clear(the_repository);
 	free_commit_graph(g);
 
-- 
2.28.0.rc1.13.ge78abce653


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH 03/10] t4216: use an '&&'-chain
  2020-08-03 18:57 [PATCH 00/10] more miscellaneous Bloom filter improvements Taylor Blau
  2020-08-03 18:57 ` [PATCH 01/10] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
  2020-08-03 18:57 ` [PATCH 02/10] commit-graph: pass a 'struct repository *' in more places Taylor Blau
@ 2020-08-03 18:57 ` Taylor Blau
  2020-08-03 18:57 ` [PATCH 04/10] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-08-03 18:57 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee

In a759bfa9ee (t4216: add end to end tests for git log with Bloom
filters, 2020-04-06), a 'rm' invocation was added without a
corresponding '&&' chain.

When 'trace.perf' already exists, everything works fine. However, the
function can be executed without 'trace.perf' on disk (eg., when the
subset of tests run is altered with '--run'), and so the bare 'rm'
complains about a missing file.

To remove some noise from the test log, invoke 'rm' with '-f', at which
point it is sensible to place the 'rm -f' in an '&&'-chain, which is
both (1) our usual style, and (2) avoids a broken chain in the future if
more commands are added at the beginning of the function.

Helped-by: Eric Sunshine <sunshine@sunshineco.com>
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/t4216-log-bloom.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index c9f9bdf1ba..fe19f6a60c 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -53,7 +53,7 @@ sane_unset GIT_TRACE2_PERF_BRIEF
 sane_unset GIT_TRACE2_CONFIG_PARAMS
 
 setup () {
-	rm "$TRASH_DIRECTORY/trace.perf"
+	rm -f "$TRASH_DIRECTORY/trace.perf" &&
 	git -c core.commitGraph=false log --pretty="format:%s" $1 >log_wo_bloom &&
 	GIT_TRACE2_PERF="$TRASH_DIRECTORY/trace.perf" git -c core.commitGraph=true log --pretty="format:%s" $1 >log_w_bloom
 }
-- 
2.28.0.rc1.13.ge78abce653


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH 04/10] t/helper/test-read-graph.c: prepare repo settings
  2020-08-03 18:57 [PATCH 00/10] more miscellaneous Bloom filter improvements Taylor Blau
                   ` (2 preceding siblings ...)
  2020-08-03 18:57 ` [PATCH 03/10] t4216: use an '&&'-chain Taylor Blau
@ 2020-08-03 18:57 ` Taylor Blau
  2020-08-03 18:57 ` [PATCH 05/10] commit-graph: respect 'commitgraph.readChangedPaths' Taylor Blau
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-08-03 18:57 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee

The read-graph test-tool is used by a number of the commit-graph test to
assert various properties about a commit-graph. Previously, this program
never ran 'prepare_repo_settings()'. There was no need to do so, since
none of the commit-graph machinery is affected by the repo settings.

In the next patch, the commit-graph machinery's behavior will become
dependent on the repo settings, and so loading them before running the
rest of the test tool is critical.

As such, teach the test tool to call 'prepare_repo_settings()'.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/helper/test-read-graph.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
index 6d0c962438..5f585a1725 100644
--- a/t/helper/test-read-graph.c
+++ b/t/helper/test-read-graph.c
@@ -12,11 +12,12 @@ int cmd__read_graph(int argc, const char **argv)
 	setup_git_directory();
 	odb = the_repository->objects->odb;
 
+	prepare_repo_settings(the_repository);
+
 	graph = read_commit_graph_one(the_repository, odb);
 	if (!graph)
 		return 1;
 
-
 	printf("header: %08x %d %d %d %d\n",
 		ntohl(*(uint32_t*)graph->data),
 		*(unsigned char*)(graph->data + 4),
-- 
2.28.0.rc1.13.ge78abce653


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH 05/10] commit-graph: respect 'commitgraph.readChangedPaths'
  2020-08-03 18:57 [PATCH 00/10] more miscellaneous Bloom filter improvements Taylor Blau
                   ` (3 preceding siblings ...)
  2020-08-03 18:57 ` [PATCH 04/10] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
@ 2020-08-03 18:57 ` Taylor Blau
  2020-08-03 18:57 ` [PATCH 06/10] commit-graph.c: sort index into commits list Taylor Blau
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-08-03 18:57 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee

Git uses the 'core.commitGraph' configuration value to control whether
or not the commit graph is used when parsing commits or performing a
traversal.

Now that commit-graphs can also contain a section for changed-path Bloom
filters, administrators that already have commit-graphs may find it
convenient to use those graphs without relying on their changed-path
Bloom filters. This can happen, for example, during a staged roll-out,
or in the event of an incident.

Introduce 'commitgraph.readChangedPaths' to control whether or not Bloom
filters are read. Note that this configuration is independent from both:

  - 'core.commitGraph', to allow flexibility in using all parts of a
    commit-graph _except_ for its Bloom filters.

  - The '--changed-paths' option for 'git commit-graph write', to allow
    reading and writing Bloom filters to be controlled independently.

When the variable is set, pretend as if no Bloom data was specified at
all. This avoids adding additional special-casing outside of the
commit-graph internals.

Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config.txt             | 2 ++
 Documentation/config/commitgraph.txt | 4 ++++
 commit-graph.c                       | 6 ++++--
 repo-settings.c                      | 3 +++
 repository.h                         | 1 +
 t/t4216-log-bloom.sh                 | 4 +++-
 6 files changed, 17 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/config/commitgraph.txt

diff --git a/Documentation/config.txt b/Documentation/config.txt
index ef0768b91a..78883c6e63 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -340,6 +340,8 @@ include::config/column.txt[]
 
 include::config/commit.txt[]
 
+include::config/commitgraph.txt[]
+
 include::config/credential.txt[]
 
 include::config/completion.txt[]
diff --git a/Documentation/config/commitgraph.txt b/Documentation/config/commitgraph.txt
new file mode 100644
index 0000000000..bb78e72f1b
--- /dev/null
+++ b/Documentation/config/commitgraph.txt
@@ -0,0 +1,4 @@
+commitgraph.readChangedPaths::
+	If true, then git will use the changed-path Bloom filters in the
+	commit-graph file (if it exists, and they are present). Defaults to
+	true. See linkgit:git-commit-graph[1] for more information.
diff --git a/commit-graph.c b/commit-graph.c
index 81a6f2a8ce..cb9d7fea04 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -320,6 +320,8 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 		return NULL;
 	}
 
+	prepare_repo_settings(r);
+
 	graph = alloc_commit_graph();
 
 	graph->hash_len = the_hash_algo->rawsz;
@@ -396,14 +398,14 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 		case GRAPH_CHUNKID_BLOOMINDEXES:
 			if (graph->chunk_bloom_indexes)
 				chunk_repeated = 1;
-			else
+			else if (r->settings.commit_graph_read_changed_paths)
 				graph->chunk_bloom_indexes = data + chunk_offset;
 			break;
 
 		case GRAPH_CHUNKID_BLOOMDATA:
 			if (graph->chunk_bloom_data)
 				chunk_repeated = 1;
-			else {
+			else if (r->settings.commit_graph_read_changed_paths) {
 				uint32_t hash_version;
 				graph->chunk_bloom_data = data + chunk_offset;
 				hash_version = get_be32(data + chunk_offset);
diff --git a/repo-settings.c b/repo-settings.c
index 0918408b34..9e551bc03d 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -17,9 +17,12 @@ void prepare_repo_settings(struct repository *r)
 
 	if (!repo_config_get_bool(r, "core.commitgraph", &value))
 		r->settings.core_commit_graph = value;
+	if (!repo_config_get_bool(r, "commitgraph.readchangedpaths", &value))
+		r->settings.commit_graph_read_changed_paths = value;
 	if (!repo_config_get_bool(r, "gc.writecommitgraph", &value))
 		r->settings.gc_write_commit_graph = value;
 	UPDATE_DEFAULT_BOOL(r->settings.core_commit_graph, 1);
+	UPDATE_DEFAULT_BOOL(r->settings.commit_graph_read_changed_paths, 1);
 	UPDATE_DEFAULT_BOOL(r->settings.gc_write_commit_graph, 1);
 
 	if (!repo_config_get_int(r, "index.version", &value))
diff --git a/repository.h b/repository.h
index 3c1f7d54bd..81759b7d27 100644
--- a/repository.h
+++ b/repository.h
@@ -29,6 +29,7 @@ struct repo_settings {
 	int initialized;
 
 	int core_commit_graph;
+	int commit_graph_read_changed_paths;
 	int gc_write_commit_graph;
 	int fetch_write_commit_graph;
 
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index fe19f6a60c..b3d1f596f8 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -90,7 +90,9 @@ do
 		      "--ancestry-path side..master"
 	do
 		test_expect_success "git log option: $option for path: $path" '
-			test_bloom_filters_used "$option -- $path"
+			test_bloom_filters_used "$option -- $path" &&
+			test_config commitgraph.readChangedPaths false &&
+			test_bloom_filters_not_used "$option -- $path"
 		'
 	done
 done
-- 
2.28.0.rc1.13.ge78abce653


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH 06/10] commit-graph.c: sort index into commits list
  2020-08-03 18:57 [PATCH 00/10] more miscellaneous Bloom filter improvements Taylor Blau
                   ` (4 preceding siblings ...)
  2020-08-03 18:57 ` [PATCH 05/10] commit-graph: respect 'commitgraph.readChangedPaths' Taylor Blau
@ 2020-08-03 18:57 ` Taylor Blau
  2020-08-04 12:31   ` Derrick Stolee
  2020-08-03 18:57 ` [PATCH 07/10] commit-graph: add large-filters bitmap chunk Taylor Blau
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 117+ messages in thread
From: Taylor Blau @ 2020-08-03 18:57 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee

For locality, 'compute_bloom_filters()' sorts the commits for which it
wants to compute Bloom filters in a preferred order (cf., 3d11275505
(commit-graph: examine commits by generation number, 2020-03-30) for
details).

The subsequent patch will want to recover the new graph position of each
commit. Since the 'packed_commit_list' already stores a double-pointer,
avoid a 'COPY_ARRAY' and instead keep track of an index into the
original list. (Use an integer index instead of a memory address, since
this involves a needlessly confusing triple-pointer).

Alter the two sorting routines 'commit_pos_cmp' and 'commit_gen_cmp' to
take into account the packed_commit_list they are sorting with respect
to. Since 'compute_bloom_filters()' is the only caller for each of those
comparison functions, no other call-sites need updating.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 commit-graph.c | 44 ++++++++++++++++++++++++--------------------
 1 file changed, 24 insertions(+), 20 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index cb9d7fea04..d6ea556649 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -79,10 +79,18 @@ static void set_commit_pos(struct repository *r, const struct object_id *oid)
 	*commit_pos_at(&commit_pos, commit) = max_pos++;
 }
 
-static int commit_pos_cmp(const void *va, const void *vb)
+struct packed_commit_list {
+	struct commit **list;
+	int nr;
+	int alloc;
+};
+
+static int commit_pos_cmp(const void *va, const void *vb, void *ctx)
 {
-	const struct commit *a = *(const struct commit **)va;
-	const struct commit *b = *(const struct commit **)vb;
+	struct packed_commit_list *commits = ctx;
+
+	const struct commit *a = commits->list[*(int *)va];
+	const struct commit *b = commits->list[*(int *)vb];
 	return commit_pos_at(&commit_pos, a) -
 	       commit_pos_at(&commit_pos, b);
 }
@@ -139,10 +147,12 @@ static struct commit_graph_data *commit_graph_data_at(const struct commit *c)
 	return data;
 }
 
-static int commit_gen_cmp(const void *va, const void *vb)
+static int commit_gen_cmp(const void *va, const void *vb, void *ctx)
 {
-	const struct commit *a = *(const struct commit **)va;
-	const struct commit *b = *(const struct commit **)vb;
+	struct packed_commit_list *commits = ctx;
+
+	const struct commit *a = commits->list[*(int *)va];
+	const struct commit *b = commits->list[*(int *)vb];
 
 	uint32_t generation_a = commit_graph_generation(a);
 	uint32_t generation_b = commit_graph_generation(b);
@@ -923,12 +933,6 @@ struct tree *get_commit_tree_in_graph(struct repository *r, const struct commit
 	return get_commit_tree_in_graph_one(r, r->objects->commit_graph, c);
 }
 
-struct packed_commit_list {
-	struct commit **list;
-	int nr;
-	int alloc;
-};
-
 struct packed_oid_list {
 	struct object_id *list;
 	int nr;
@@ -1389,7 +1393,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 {
 	int i;
 	struct progress *progress = NULL;
-	struct commit **sorted_commits;
+	int *sorted_commits;
 
 	init_bloom_filters();
 
@@ -1399,15 +1403,15 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 			ctx->commits.nr);
 
 	ALLOC_ARRAY(sorted_commits, ctx->commits.nr);
-	COPY_ARRAY(sorted_commits, ctx->commits.list, ctx->commits.nr);
-
-	if (ctx->order_by_pack)
-		QSORT(sorted_commits, ctx->commits.nr, commit_pos_cmp);
-	else
-		QSORT(sorted_commits, ctx->commits.nr, commit_gen_cmp);
+	for (i = 0; i < ctx->commits.nr; i++)
+		sorted_commits[i] = i;
+	QSORT_S(sorted_commits, ctx->commits.nr,
+		ctx->order_by_pack ? commit_pos_cmp : commit_gen_cmp,
+		&ctx->commits);
 
 	for (i = 0; i < ctx->commits.nr; i++) {
-		struct commit *c = sorted_commits[i];
+		int pos = sorted_commits[i];
+		struct commit *c = ctx->commits.list[pos];
 		struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1);
 		ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
 		display_progress(progress, i + 1);
-- 
2.28.0.rc1.13.ge78abce653


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH 07/10] commit-graph: add large-filters bitmap chunk
  2020-08-03 18:57 [PATCH 00/10] more miscellaneous Bloom filter improvements Taylor Blau
                   ` (5 preceding siblings ...)
  2020-08-03 18:57 ` [PATCH 06/10] commit-graph.c: sort index into commits list Taylor Blau
@ 2020-08-03 18:57 ` Taylor Blau
  2020-08-03 18:59   ` Taylor Blau
  2020-08-04 12:57   ` Derrick Stolee
  2020-08-03 18:57 ` [PATCH 08/10] bloom: split 'get_bloom_filter()' in two Taylor Blau
                   ` (5 subsequent siblings)
  12 siblings, 2 replies; 117+ messages in thread
From: Taylor Blau @ 2020-08-03 18:57 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee

When a commit has 512 changed paths or greater, the commit-graph
machinery represents it as a zero-length filter since having many
entries in the Bloom filter has undesirable effects on the false
positivity rate.

In addition to these too-large filters, the commit-graph machinery also
represents commits with no filter and commits with no changed paths in
the same way.

When writing a commit-graph that aggregates several incremental
commit-graph layers (eg., with '--split=replace'), the commit-graph
machinery first computes all of the Bloom filters that it wants to write
but does not already know about from existing graph layers. Because we
overload the zero-length filter in the above fashion, this leads to
recomputing large filters over and over again.

This is already undesirable, since it means that we are wasting
considerable effort to discover that a commit has at least 512 changed
paths, only to throw that effort away (and then repeat the process the
next time a roll-up is performed).

In a subsequent patch, we will add a '--max-new-filters' option, which
specifies an upper-bound on the number of new filters we are willing to
compute from scratch. Suppose that there are 'N' too-large filters, and
we specify '--max-new-filters=M'. If 'N >= M', it is unlikely that any
filters will be generated, since we'll spend most of our effort on
filters that we ultimately throw away. If 'N < M', filters will trickle
in over time, but only at most 'M - N' per-write.

To address this, add a new chunk which encodes a bitmap where the ith
bit is on iff the ith commit has zero or at least 512 changed paths.
When computing Bloom filters, first consult the relevant bitmap (in the
case that we are rolling up existing layers) to see if computing the
Bloom filter from scratch would be a waste of time.

This patch implements a new chunk instead of extending the existing BIDX
and BDAT chunks because modifying these chunks would confuse old
clients. For example, consider setting the most-significant bit in an
entry in the BIDX chunk to indicate that that filter is too-large. New
clients would understand how to interpret this, but old clients would
treat that entry as a really large filter.

By avoiding the need to move to new versions of the BDAT and BIDX chunk,
we can give ourselves more time to consider whether or not other
modifications to these chunks are worthwhile without holding up this
change.

Another approach would be to introduce a new BIDX chunk (say, one
identified by 'BID2') which is identical to the existing BIDX chunk,
except the most-significant bit of each offset is interpreted as "this
filter is too big" iff looking at a BID2 chunk. This avoids having to
write a bitmap, but forces older clients to rewrite their commit-graphs
(as well as reduces the theoretical largest Bloom filters we couldl
write, and forces us to maintain the code necessary to translate BIDX
chunks to BID2 ones).  Separately from this patch, I implemented this
alternate approach and did not find it to be advantageous.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 .../technical/commit-graph-format.txt         |  9 ++
 commit-graph.c                                | 85 ++++++++++++++++++-
 commit-graph.h                                |  2 +
 t/t4216-log-bloom.sh                          | 20 ++++-
 4 files changed, 112 insertions(+), 4 deletions(-)

diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index 440541045d..a7191c03d3 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -123,6 +123,15 @@ CHUNK DATA:
       of length zero.
     * The BDAT chunk is present if and only if BIDX is present.
 
+  Large Bloom Filters (ID: {'B', 'F', 'X', 'L'}) [Optional]
+    * It contains a list of 'eword_t's (the length of this list is determined by
+      the width of the chunk) which is a bitmap. The 'i'th bit is set exactly
+      when the 'i'th commit in the graph has a changed-path Bloom filter with
+      zero entries (either because the commit is empty, or because it contains
+      more than 512 changed paths).
+    * The BFXL chunk is present on newer versions of Git iff the BIDX and BDAT
+      chunks are also present.
+
   Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
       This list of H-byte hashes describe a set of B commit-graph files that
       form a commit-graph chain. The graph position for the ith commit in this
diff --git a/commit-graph.c b/commit-graph.c
index d6ea556649..c870ffe0ed 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -41,8 +41,9 @@ void git_test_write_commit_graph_or_die(void)
 #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
 #define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
 #define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
+#define GRAPH_CHUNKID_BLOOMLARGE 0x4246584c /* "BFXL" */
 #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
-#define MAX_NUM_CHUNKS 7
+#define MAX_NUM_CHUNKS 8
 
 #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
 
@@ -429,6 +430,17 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 				graph->bloom_filter_settings->bits_per_entry = get_be32(data + chunk_offset + 8);
 			}
 			break;
+
+		case GRAPH_CHUNKID_BLOOMLARGE:
+			if (graph->bloom_large.word_alloc)
+				chunk_repeated = 1;
+			else if (r->settings.commit_graph_read_changed_paths) {
+				uint32_t alloc = get_be64(chunk_lookup + 4) - chunk_offset;
+
+				graph->bloom_large.word_alloc = alloc;
+				graph->bloom_large.words = (eword_t *)(data + chunk_offset);
+			}
+			break;
 		}
 
 		if (chunk_repeated) {
@@ -443,6 +455,8 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 		/* We need both the bloom chunks to exist together. Else ignore the data */
 		graph->chunk_bloom_indexes = NULL;
 		graph->chunk_bloom_data = NULL;
+		graph->bloom_large.words = NULL;
+		graph->bloom_large.word_alloc = 0;
 		FREE_AND_NULL(graph->bloom_filter_settings);
 	}
 
@@ -933,6 +947,21 @@ struct tree *get_commit_tree_in_graph(struct repository *r, const struct commit
 	return get_commit_tree_in_graph_one(r, r->objects->commit_graph, c);
 }
 
+static int get_bloom_filter_large_in_graph(struct commit_graph *g,
+					   const struct commit *c)
+{
+	uint32_t graph_pos = commit_graph_position(c);
+	if (graph_pos == COMMIT_NOT_FROM_GRAPH)
+		return 0;
+
+	while (g && graph_pos < g->num_commits_in_base)
+		g = g->base_graph;
+
+	if (!(g && g->bloom_large.word_alloc))
+		return 0;
+	return bitmap_get(&g->bloom_large, graph_pos - g->num_commits_in_base);
+}
+
 struct packed_oid_list {
 	struct object_id *list;
 	int nr;
@@ -969,6 +998,11 @@ struct write_commit_graph_context {
 	const struct split_commit_graph_opts *split_opts;
 	size_t total_bloom_filter_data_size;
 	const struct bloom_filter_settings *bloom_settings;
+	struct bitmap *bloom_large;
+
+	int count_bloom_filter_known_large;
+	int count_bloom_filter_found_large;
+	int count_bloom_filter_computed;
 };
 
 static int write_graph_chunk_fanout(struct hashfile *f,
@@ -1231,6 +1265,15 @@ static int write_graph_chunk_bloom_data(struct hashfile *f,
 	return 0;
 }
 
+static int write_graph_chunk_bloom_large(struct hashfile *f,
+					  struct write_commit_graph_context *ctx)
+{
+	trace2_region_enter("commit-graph", "bloom_large", ctx->r);
+	hashwrite(f, ctx->bloom_large->words, ctx->bloom_large->word_alloc * sizeof(eword_t));
+	trace2_region_leave("commit-graph", "bloom_large", ctx->r);
+	return 0;
+}
+
 static int oid_compare(const void *_a, const void *_b)
 {
 	const struct object_id *a = (const struct object_id *)_a;
@@ -1389,6 +1432,24 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 	stop_progress(&ctx->progress);
 }
 
+static void trace2_bloom_filter_write_statistics(struct write_commit_graph_context *ctx)
+{
+	struct json_writer jw = JSON_WRITER_INIT;
+
+	jw_object_begin(&jw, 0);
+	jw_object_intmax(&jw, "filter_known_large",
+			 ctx->count_bloom_filter_known_large);
+	jw_object_intmax(&jw, "filter_found_large",
+			 ctx->count_bloom_filter_found_large);
+	jw_object_intmax(&jw, "filter_computed",
+			 ctx->count_bloom_filter_computed);
+	jw_end(&jw);
+
+	trace2_data_json("commit-graph", the_repository, "bloom_statistics", &jw);
+
+	jw_release(&jw);
+}
+
 static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 {
 	int i;
@@ -1396,6 +1457,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 	int *sorted_commits;
 
 	init_bloom_filters();
+	ctx->bloom_large = bitmap_word_alloc(ctx->commits.nr / BITS_IN_EWORD);
 
 	if (ctx->report_progress)
 		progress = start_delayed_progress(
@@ -1412,11 +1474,24 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 	for (i = 0; i < ctx->commits.nr; i++) {
 		int pos = sorted_commits[i];
 		struct commit *c = ctx->commits.list[pos];
-		struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1);
-		ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
+		if (get_bloom_filter_large_in_graph(ctx->r->objects->commit_graph, c)) {
+			bitmap_set(ctx->bloom_large, pos);
+			ctx->count_bloom_filter_known_large++;
+		} else {
+			struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1);
+			if (!filter->len) {
+				bitmap_set(ctx->bloom_large, pos);
+				ctx->count_bloom_filter_found_large++;
+			}
+			ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
+			ctx->count_bloom_filter_computed++;
+		}
 		display_progress(progress, i + 1);
 	}
 
+	if (trace2_is_enabled())
+		trace2_bloom_filter_write_statistics(ctx);
+
 	free(sorted_commits);
 	stop_progress(&progress);
 }
@@ -1739,6 +1814,10 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 					  + ctx->total_bloom_filter_data_size;
 		chunks[num_chunks].write_fn = write_graph_chunk_bloom_data;
 		num_chunks++;
+		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMLARGE;
+		chunks[num_chunks].size = sizeof(eword_t) * ctx->bloom_large->word_alloc;
+		chunks[num_chunks].write_fn = write_graph_chunk_bloom_large;
+		num_chunks++;
 	}
 	if (ctx->num_commit_graphs_after > 1) {
 		chunks[num_chunks].id = GRAPH_CHUNKID_BASE;
diff --git a/commit-graph.h b/commit-graph.h
index d9acb22bac..afbc5fa41e 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -4,6 +4,7 @@
 #include "git-compat-util.h"
 #include "object-store.h"
 #include "oidset.h"
+#include "ewah/ewok.h"
 
 #define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
 #define GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE "GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE"
@@ -71,6 +72,7 @@ struct commit_graph {
 	const unsigned char *chunk_base_graphs;
 	const unsigned char *chunk_bloom_indexes;
 	const unsigned char *chunk_bloom_data;
+	struct bitmap bloom_large;
 
 	struct bloom_filter_settings *bloom_filter_settings;
 };
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index b3d1f596f8..231dc8a3a7 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -33,7 +33,7 @@ test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
 	git commit-graph write --reachable --changed-paths
 '
 graph_read_expect () {
-	NUM_CHUNKS=5
+	NUM_CHUNKS=6
 	cat >expect <<- EOF
 	header: 43475048 1 1 $NUM_CHUNKS 0
 	num_commits: $1
@@ -195,5 +195,23 @@ test_expect_success 'correctly report changes over limit' '
 		done
 	)
 '
+test_bloom_filters_computed () {
+	commit_graph_args=$1
+	bloom_trace_prefix="{\"filter_known_large\":$2,\"filter_found_large\":$3,\"filter_computed\":$4"
+	rm -f "$TRASH_DIRECTORY/trace.event" &&
+	GIT_TRACE2_EVENT="$TRASH_DIRECTORY/trace.event" git commit-graph write $commit_graph_args &&
+	grep "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.event"
+}
+
+test_expect_success 'Bloom generation does not recompute too-large filters' '
+	(
+		cd 513changes &&
+		git commit-graph write --reachable --changed-paths \
+			--split=replace &&
+		test_commit c1 filter &&
+		test_bloom_filters_computed "--reachable --changed-paths --split=replace" \
+			1 0 1
+	)
+'
 
 test_done
-- 
2.28.0.rc1.13.ge78abce653


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH 08/10] bloom: split 'get_bloom_filter()' in two
  2020-08-03 18:57 [PATCH 00/10] more miscellaneous Bloom filter improvements Taylor Blau
                   ` (6 preceding siblings ...)
  2020-08-03 18:57 ` [PATCH 07/10] commit-graph: add large-filters bitmap chunk Taylor Blau
@ 2020-08-03 18:57 ` Taylor Blau
  2020-08-04 13:00   ` Derrick Stolee
  2020-08-03 18:57 ` [PATCH 09/10] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 117+ messages in thread
From: Taylor Blau @ 2020-08-03 18:57 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee

'get_bloom_filter' takes a flag to control whether it will compute a
Bloom filter if the requested one is missing. In the next patch, we'll
add yet another flag to this method, which would force all but one
caller to specify an extra 'NULL' parameter at the end.

Instead of doing this, split 'get_bloom_filter' into two functions:
'get_bloom_filter' and 'get_or_compute_bloom_filter'. The former only
looks up a Bloom filter (and does not compute one if it's missing,
thus dropping the 'compute_if_not_present' flag). The latter does
compute missing Bloom filters, with an additional parameter to store
whether or not it needed to do so.

This simplifies many call-sites, since the majority of existing callers
to 'get_bloom_filter' do not want missing Bloom filters to be computed
(so they can drop the parameter entirely and use the simpler version of
the function).

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 blame.c               |  2 +-
 bloom.c               | 13 ++++++++++---
 bloom.h               |  9 ++++++---
 commit-graph.c        |  6 +++---
 line-log.c            |  2 +-
 revision.c            |  2 +-
 t/helper/test-bloom.c |  2 +-
 7 files changed, 23 insertions(+), 13 deletions(-)

diff --git a/blame.c b/blame.c
index 3e5f8787bc..756285fca7 100644
--- a/blame.c
+++ b/blame.c
@@ -1275,7 +1275,7 @@ static int maybe_changed_path(struct repository *r,
 	if (commit_graph_generation(origin->commit) == GENERATION_NUMBER_INFINITY)
 		return 1;
 
-	filter = get_bloom_filter(r, origin->commit, 0);
+	filter = get_bloom_filter(r, origin->commit);
 
 	if (!filter)
 		return 1;
diff --git a/bloom.c b/bloom.c
index cd9380ac62..a8a21762f4 100644
--- a/bloom.c
+++ b/bloom.c
@@ -177,9 +177,10 @@ static int pathmap_cmp(const void *hashmap_cmp_fn_data,
 	return strcmp(e1->path, e2->path);
 }
 
-struct bloom_filter *get_bloom_filter(struct repository *r,
-				      struct commit *c,
-				      int compute_if_not_present)
+struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
+						 struct commit *c,
+						 int compute_if_not_present,
+						 int *computed)
 {
 	struct bloom_filter *filter;
 	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
@@ -187,6 +188,9 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 	struct diff_options diffopt;
 	int max_changes = 512;
 
+	if (computed)
+		*computed = 0;
+
 	if (!bloom_filters.slab_size)
 		return NULL;
 
@@ -273,6 +277,9 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 		filter->len = 0;
 	}
 
+	if (computed)
+		*computed = 1;
+
 	free(diff_queued_diff.queue);
 	DIFF_QUEUE_CLEAR(&diff_queued_diff);
 
diff --git a/bloom.h b/bloom.h
index d8fbb0fbf1..2fdc2918f8 100644
--- a/bloom.h
+++ b/bloom.h
@@ -80,9 +80,12 @@ void add_key_to_filter(const struct bloom_key *key,
 
 void init_bloom_filters(void);
 
-struct bloom_filter *get_bloom_filter(struct repository *r,
-				      struct commit *c,
-				      int compute_if_not_present);
+struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
+						 struct commit *c,
+						 int compute_if_not_present,
+						 int *computed);
+
+#define get_bloom_filter(r, c) get_or_compute_bloom_filter((r), (c), 0, NULL)
 
 int bloom_filter_contains(const struct bloom_filter *filter,
 			  const struct bloom_key *key,
diff --git a/commit-graph.c b/commit-graph.c
index c870ffe0ed..2e765b26d5 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1214,7 +1214,7 @@ static int write_graph_chunk_bloom_indexes(struct hashfile *f,
 	uint32_t cur_pos = 0;
 
 	while (list < last) {
-		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
+		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
 		size_t len = filter ? filter->len : 0;
 		cur_pos += len;
 		display_progress(ctx->progress, ++ctx->progress_cnt);
@@ -1253,7 +1253,7 @@ static int write_graph_chunk_bloom_data(struct hashfile *f,
 	hashwrite_be32(f, ctx->bloom_settings->bits_per_entry);
 
 	while (list < last) {
-		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
+		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
 		size_t len = filter ? filter->len : 0;
 
 		display_progress(ctx->progress, ++ctx->progress_cnt);
@@ -1478,7 +1478,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 			bitmap_set(ctx->bloom_large, pos);
 			ctx->count_bloom_filter_known_large++;
 		} else {
-			struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1);
+			struct bloom_filter *filter = get_or_compute_bloom_filter(ctx->r, c, 1, NULL);
 			if (!filter->len) {
 				bitmap_set(ctx->bloom_large, pos);
 				ctx->count_bloom_filter_found_large++;
diff --git a/line-log.c b/line-log.c
index c53692834d..9e58fd185a 100644
--- a/line-log.c
+++ b/line-log.c
@@ -1159,7 +1159,7 @@ static int bloom_filter_check(struct rev_info *rev,
 		return 1;
 
 	if (!rev->bloom_filter_settings ||
-	    !(filter = get_bloom_filter(rev->repo, commit, 0)))
+	    !(filter = get_bloom_filter(rev->repo, commit)))
 		return 1;
 
 	if (!range)
diff --git a/revision.c b/revision.c
index e244beed05..73b59d2134 100644
--- a/revision.c
+++ b/revision.c
@@ -756,7 +756,7 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 	if (commit_graph_generation(commit) == GENERATION_NUMBER_INFINITY)
 		return -1;
 
-	filter = get_bloom_filter(revs->repo, commit, 0);
+	filter = get_bloom_filter(revs->repo, commit);
 
 	if (!filter) {
 		count_bloom_filter_not_present++;
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index f0aa80b98e..76a28a2199 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -39,7 +39,7 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
 	struct bloom_filter *filter;
 	setup_git_directory();
 	c = lookup_commit(the_repository, commit_oid);
-	filter = get_bloom_filter(the_repository, c, 1);
+	filter = get_or_compute_bloom_filter(the_repository, c, 1, NULL);
 	print_bloom_filter(filter);
 }
 
-- 
2.28.0.rc1.13.ge78abce653


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH 09/10] commit-graph: rename 'split_commit_graph_opts'
  2020-08-03 18:57 [PATCH 00/10] more miscellaneous Bloom filter improvements Taylor Blau
                   ` (7 preceding siblings ...)
  2020-08-03 18:57 ` [PATCH 08/10] bloom: split 'get_bloom_filter()' in two Taylor Blau
@ 2020-08-03 18:57 ` Taylor Blau
  2020-08-03 18:57 ` [PATCH 10/10] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-08-03 18:57 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee

In a subsequent commit, additional options will be added to the
commit-graph API, which themselves do not have to do with splitting.

Rename the 'split_commit_graph_opts' structure to the more-generic
'commit_graph_opts' to encompass both.

Suggsted-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/commit-graph.c | 20 ++++++++++----------
 commit-graph.c         | 40 ++++++++++++++++++++--------------------
 commit-graph.h         |  6 +++---
 3 files changed, 33 insertions(+), 33 deletions(-)

diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index ba5584463f..38f5f57d15 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -119,7 +119,7 @@ static int graph_verify(int argc, const char **argv)
 }
 
 extern int read_replace_refs;
-static struct split_commit_graph_opts split_opts;
+static struct commit_graph_opts write_opts;
 
 static int write_option_parse_split(const struct option *opt, const char *arg,
 				    int unset)
@@ -187,24 +187,24 @@ static int graph_write(int argc, const char **argv)
 		OPT_BOOL(0, "changed-paths", &opts.enable_changed_paths,
 			N_("enable computation for changed paths")),
 		OPT_BOOL(0, "progress", &opts.progress, N_("force progress reporting")),
-		OPT_CALLBACK_F(0, "split", &split_opts.flags, NULL,
+		OPT_CALLBACK_F(0, "split", &write_opts.flags, NULL,
 			N_("allow writing an incremental commit-graph file"),
 			PARSE_OPT_OPTARG | PARSE_OPT_NONEG,
 			write_option_parse_split),
-		OPT_INTEGER(0, "max-commits", &split_opts.max_commits,
+		OPT_INTEGER(0, "max-commits", &write_opts.max_commits,
 			N_("maximum number of commits in a non-base split commit-graph")),
-		OPT_INTEGER(0, "size-multiple", &split_opts.size_multiple,
+		OPT_INTEGER(0, "size-multiple", &write_opts.size_multiple,
 			N_("maximum ratio between two levels of a split commit-graph")),
-		OPT_EXPIRY_DATE(0, "expire-time", &split_opts.expire_time,
+		OPT_EXPIRY_DATE(0, "expire-time", &write_opts.expire_time,
 			N_("only expire files older than a given date-time")),
 		OPT_END(),
 	};
 
 	opts.progress = isatty(2);
 	opts.enable_changed_paths = -1;
-	split_opts.size_multiple = 2;
-	split_opts.max_commits = 0;
-	split_opts.expire_time = 0;
+	write_opts.size_multiple = 2;
+	write_opts.max_commits = 0;
+	write_opts.expire_time = 0;
 
 	trace2_cmd_mode("write");
 
@@ -232,7 +232,7 @@ static int graph_write(int argc, const char **argv)
 	odb = find_odb(the_repository, opts.obj_dir);
 
 	if (opts.reachable) {
-		if (write_commit_graph_reachable(odb, flags, &split_opts))
+		if (write_commit_graph_reachable(odb, flags, &write_opts))
 			return 1;
 		return 0;
 	}
@@ -261,7 +261,7 @@ static int graph_write(int argc, const char **argv)
 			       opts.stdin_packs ? &pack_indexes : NULL,
 			       opts.stdin_commits ? &commits : NULL,
 			       flags,
-			       &split_opts))
+			       &write_opts))
 		result = 1;
 
 cleanup:
diff --git a/commit-graph.c b/commit-graph.c
index 2e765b26d5..8626024faa 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -995,7 +995,7 @@ struct write_commit_graph_context {
 		 changed_paths:1,
 		 order_by_pack:1;
 
-	const struct split_commit_graph_opts *split_opts;
+	const struct commit_graph_opts *opts;
 	size_t total_bloom_filter_data_size;
 	const struct bloom_filter_settings *bloom_settings;
 	struct bitmap *bloom_large;
@@ -1327,8 +1327,8 @@ static void close_reachable(struct write_commit_graph_context *ctx)
 {
 	int i;
 	struct commit *commit;
-	enum commit_graph_split_flags flags = ctx->split_opts ?
-		ctx->split_opts->flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;
+	enum commit_graph_split_flags flags = ctx->opts ?
+		ctx->opts->flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;
 
 	if (ctx->report_progress)
 		ctx->progress = start_delayed_progress(
@@ -1520,7 +1520,7 @@ static int add_ref_to_set(const char *refname,
 
 int write_commit_graph_reachable(struct object_directory *odb,
 				 enum commit_graph_write_flags flags,
-				 const struct split_commit_graph_opts *split_opts)
+				 const struct commit_graph_opts *opts)
 {
 	struct oidset commits = OIDSET_INIT;
 	struct refs_cb_data data;
@@ -1537,7 +1537,7 @@ int write_commit_graph_reachable(struct object_directory *odb,
 	stop_progress(&data.progress);
 
 	result = write_commit_graph(odb, NULL, &commits,
-				    flags, split_opts);
+				    flags, opts);
 
 	oidset_clear(&commits);
 	return result;
@@ -1652,8 +1652,8 @@ static uint32_t count_distinct_commits(struct write_commit_graph_context *ctx)
 static void copy_oids_to_commits(struct write_commit_graph_context *ctx)
 {
 	uint32_t i;
-	enum commit_graph_split_flags flags = ctx->split_opts ?
-		ctx->split_opts->flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;
+	enum commit_graph_split_flags flags = ctx->opts ?
+		ctx->opts->flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;
 
 	ctx->num_extra_edges = 0;
 	if (ctx->report_progress)
@@ -1951,13 +1951,13 @@ static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
 	int max_commits = 0;
 	int size_mult = 2;
 
-	if (ctx->split_opts) {
-		max_commits = ctx->split_opts->max_commits;
+	if (ctx->opts) {
+		max_commits = ctx->opts->max_commits;
 
-		if (ctx->split_opts->size_multiple)
-			size_mult = ctx->split_opts->size_multiple;
+		if (ctx->opts->size_multiple)
+			size_mult = ctx->opts->size_multiple;
 
-		flags = ctx->split_opts->flags;
+		flags = ctx->opts->flags;
 	}
 
 	g = ctx->r->objects->commit_graph;
@@ -2135,8 +2135,8 @@ static void expire_commit_graphs(struct write_commit_graph_context *ctx)
 	size_t dirnamelen;
 	timestamp_t expire_time = time(NULL);
 
-	if (ctx->split_opts && ctx->split_opts->expire_time)
-		expire_time = ctx->split_opts->expire_time;
+	if (ctx->opts && ctx->opts->expire_time)
+		expire_time = ctx->opts->expire_time;
 	if (!ctx->split) {
 		char *chain_file_name = get_chain_filename(ctx->odb);
 		unlink(chain_file_name);
@@ -2187,7 +2187,7 @@ int write_commit_graph(struct object_directory *odb,
 		       struct string_list *pack_indexes,
 		       struct oidset *commits,
 		       enum commit_graph_write_flags flags,
-		       const struct split_commit_graph_opts *split_opts)
+		       const struct commit_graph_opts *opts)
 {
 	struct write_commit_graph_context *ctx;
 	uint32_t i, count_distinct = 0;
@@ -2203,7 +2203,7 @@ int write_commit_graph(struct object_directory *odb,
 	ctx->append = flags & COMMIT_GRAPH_WRITE_APPEND ? 1 : 0;
 	ctx->report_progress = flags & COMMIT_GRAPH_WRITE_PROGRESS ? 1 : 0;
 	ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
-	ctx->split_opts = split_opts;
+	ctx->opts = opts;
 	ctx->total_bloom_filter_data_size = 0;
 
 	if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
@@ -2243,15 +2243,15 @@ int write_commit_graph(struct object_directory *odb,
 			}
 		}
 
-		if (ctx->split_opts)
-			replace = ctx->split_opts->flags & COMMIT_GRAPH_SPLIT_REPLACE;
+		if (ctx->opts)
+			replace = ctx->opts->flags & COMMIT_GRAPH_SPLIT_REPLACE;
 	}
 
 	ctx->approx_nr_objects = approximate_object_count();
 	ctx->oids.alloc = ctx->approx_nr_objects / 32;
 
-	if (ctx->split && split_opts && ctx->oids.alloc > split_opts->max_commits)
-		ctx->oids.alloc = split_opts->max_commits;
+	if (ctx->split && opts && ctx->oids.alloc > opts->max_commits)
+		ctx->oids.alloc = opts->max_commits;
 
 	if (ctx->append) {
 		prepare_commit_graph_one(ctx->r, ctx->odb);
diff --git a/commit-graph.h b/commit-graph.h
index afbc5fa41e..4cd991cf26 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -107,7 +107,7 @@ enum commit_graph_split_flags {
 	COMMIT_GRAPH_SPLIT_REPLACE          = 2
 };
 
-struct split_commit_graph_opts {
+struct commit_graph_opts {
 	int size_multiple;
 	int max_commits;
 	timestamp_t expire_time;
@@ -122,12 +122,12 @@ struct split_commit_graph_opts {
  */
 int write_commit_graph_reachable(struct object_directory *odb,
 				 enum commit_graph_write_flags flags,
-				 const struct split_commit_graph_opts *split_opts);
+				 const struct commit_graph_opts *opts);
 int write_commit_graph(struct object_directory *odb,
 		       struct string_list *pack_indexes,
 		       struct oidset *commits,
 		       enum commit_graph_write_flags flags,
-		       const struct split_commit_graph_opts *split_opts);
+		       const struct commit_graph_opts *opts);
 
 #define COMMIT_GRAPH_VERIFY_SHALLOW	(1 << 0)
 
-- 
2.28.0.rc1.13.ge78abce653


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH 10/10] builtin/commit-graph.c: introduce '--max-new-filters=<n>'
  2020-08-03 18:57 [PATCH 00/10] more miscellaneous Bloom filter improvements Taylor Blau
                   ` (8 preceding siblings ...)
  2020-08-03 18:57 ` [PATCH 09/10] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
@ 2020-08-03 18:57 ` Taylor Blau
  2020-08-04 13:03   ` Derrick Stolee
  2020-08-05 17:01 ` [PATCH v2 00/14] more miscellaneous Bloom filter improvements Taylor Blau
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 117+ messages in thread
From: Taylor Blau @ 2020-08-03 18:57 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee

Introduce a command-line flag and configuration variable to fill in the
'max_new_filters' variable introduced by the previous patch.

The command-line option '--max-new-filters' takes precedence over
'graph.maxNewFilters', which is the default value.
'--no-max-new-filters' can also be provided, which sets the value back
to '-1', indicating that an unlimited number of new Bloom filters may be
generated. (OPT_INTEGER only allows setting the '--no-' variant back to
'0', hence a custom callback was used instead).

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/commitgraph.txt |  4 +++
 Documentation/git-commit-graph.txt   |  4 +++
 bloom.c                              | 15 +++++++++++
 builtin/commit-graph.c               | 39 +++++++++++++++++++++++++---
 commit-graph.c                       | 22 +++++++++++-----
 commit-graph.h                       |  1 +
 t/t4216-log-bloom.sh                 | 19 ++++++++++++++
 7 files changed, 95 insertions(+), 9 deletions(-)

diff --git a/Documentation/config/commitgraph.txt b/Documentation/config/commitgraph.txt
index bb78e72f1b..46f48d1aa8 100644
--- a/Documentation/config/commitgraph.txt
+++ b/Documentation/config/commitgraph.txt
@@ -1,3 +1,7 @@
+commitgraph.maxNewFilters::
+	Specifies the default value for the `--max-new-filters` option of `git
+	commit-graph write` (c.f., linkgit:git-commit-graph[1]).
+
 commitgraph.readChangedPaths::
 	If true, then git will use the changed-path Bloom filters in the
 	commit-graph file (if it exists, and they are present). Defaults to
diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index 17405c73a9..e8034ff0ac 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -67,6 +67,10 @@ this option is given, future commit-graph writes will automatically assume
 that this option was intended. Use `--no-changed-paths` to stop storing this
 data.
 +
+With the `--max-new-filters=<n>` option, generate at most `n` new Bloom
+filters (if `--changed-paths` is specified). If `n` is `-1`, no limit is
+enforced. Overrides the `graph.maxNewFilters` configuration.
++
 With the `--split[=<strategy>]` option, write the commit-graph as a
 chain of multiple commit-graph files stored in
 `<dir>/info/commit-graphs`. Commit-graph layers are merged based on the
diff --git a/bloom.c b/bloom.c
index a8a21762f4..1742beac9e 100644
--- a/bloom.c
+++ b/bloom.c
@@ -51,6 +51,21 @@ static int load_bloom_filter_from_graph(struct commit_graph *g,
 	else
 		start_index = 0;
 
+	if ((start_index == end_index) &&
+	    (g->bloom_large.word_alloc && !bitmap_get(&g->bloom_large, lex_pos))) {
+		/*
+		 * If the filter is zero-length, either (1) the filter has no
+		 * changes, (2) the filter has too many changes, or (3) it
+		 * wasn't computed (eg., due to '--max-new-filters').
+		 *
+		 * If either (1) or (2) is the case, the 'large' bit will be set
+		 * for this Bloom filter. If it is unset, then it wasn't
+		 * computed. In that case, return nothing, since we don't have
+		 * that filter in the graph.
+		 */
+		return 0;
+	}
+
 	filter->len = end_index - start_index;
 	filter->data = (unsigned char *)(g->chunk_bloom_data +
 					sizeof(unsigned char) * start_index +
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 38f5f57d15..aaf54a76db 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -13,7 +13,8 @@ static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph verify [--object-dir <objdir>] [--shallow] [--[no-]progress]"),
 	N_("git commit-graph write [--object-dir <objdir>] [--append] "
 	   "[--split[=<strategy>]] [--reachable|--stdin-packs|--stdin-commits] "
-	   "[--changed-paths] [--[no-]progress] <split options>"),
+	   "[--changed-paths] [--[no-]max-new-filters <n>] [--[no-]progress] "
+	   "<split options>"),
 	NULL
 };
 
@@ -25,7 +26,8 @@ static const char * const builtin_commit_graph_verify_usage[] = {
 static const char * const builtin_commit_graph_write_usage[] = {
 	N_("git commit-graph write [--object-dir <objdir>] [--append] "
 	   "[--split[=<strategy>]] [--reachable|--stdin-packs|--stdin-commits] "
-	   "[--changed-paths] [--[no-]progress] <split options>"),
+	   "[--changed-paths] [--[no-]max-new-filters <n>] [--[no-]progress] "
+	   "<split options>"),
 	NULL
 };
 
@@ -162,6 +164,23 @@ static int read_one_commit(struct oidset *commits, struct progress *progress,
 	return 0;
 }
 
+static int write_option_max_new_filters(const struct option *opt,
+					const char *arg,
+					int unset)
+{
+	int *to = opt->value;
+	if (unset)
+		*to = -1;
+	else {
+		const char *s;
+		*to = strtol(arg, (char **)&s, 10);
+		if (*s)
+			return error(_("%s expects a numerical value"),
+				     optname(opt, opt->flags));
+	}
+	return 0;
+}
+
 static int graph_write(int argc, const char **argv)
 {
 	struct string_list pack_indexes = STRING_LIST_INIT_NODUP;
@@ -197,6 +216,9 @@ static int graph_write(int argc, const char **argv)
 			N_("maximum ratio between two levels of a split commit-graph")),
 		OPT_EXPIRY_DATE(0, "expire-time", &write_opts.expire_time,
 			N_("only expire files older than a given date-time")),
+		OPT_CALLBACK_F(0, "max-new-filters", &write_opts.max_new_filters,
+			NULL, N_("maximum number of Bloom filters to compute"),
+			0, write_option_max_new_filters),
 		OPT_END(),
 	};
 
@@ -205,6 +227,7 @@ static int graph_write(int argc, const char **argv)
 	write_opts.size_multiple = 2;
 	write_opts.max_commits = 0;
 	write_opts.expire_time = 0;
+	write_opts.max_new_filters = -1;
 
 	trace2_cmd_mode("write");
 
@@ -270,6 +293,16 @@ static int graph_write(int argc, const char **argv)
 	return result;
 }
 
+static int git_commit_graph_config(const char *var, const char *value, void *cb)
+{
+	if (!strcmp(var, "graph.maxnewfilters")) {
+		write_opts.max_new_filters = git_config_int(var, value);
+		return 0;
+	}
+
+	return git_default_config(var, value, cb);
+}
+
 int cmd_commit_graph(int argc, const char **argv, const char *prefix)
 {
 	static struct option builtin_commit_graph_options[] = {
@@ -283,7 +316,7 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix)
 		usage_with_options(builtin_commit_graph_usage,
 				   builtin_commit_graph_options);
 
-	git_config(git_default_config, NULL);
+	git_config(git_commit_graph_config, &opts);
 	argc = parse_options(argc, argv, prefix,
 			     builtin_commit_graph_options,
 			     builtin_commit_graph_usage,
diff --git a/commit-graph.c b/commit-graph.c
index 8626024faa..87ea0437e2 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1455,6 +1455,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 	int i;
 	struct progress *progress = NULL;
 	int *sorted_commits;
+	int max_new_filters;
 
 	init_bloom_filters();
 	ctx->bloom_large = bitmap_word_alloc(ctx->commits.nr / BITS_IN_EWORD);
@@ -1471,6 +1472,9 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 		ctx->order_by_pack ? commit_pos_cmp : commit_gen_cmp,
 		&ctx->commits);
 
+	max_new_filters = ctx->opts->max_new_filters >= 0 ?
+		ctx->opts->max_new_filters : ctx->commits.nr;
+
 	for (i = 0; i < ctx->commits.nr; i++) {
 		int pos = sorted_commits[i];
 		struct commit *c = ctx->commits.list[pos];
@@ -1478,13 +1482,19 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 			bitmap_set(ctx->bloom_large, pos);
 			ctx->count_bloom_filter_known_large++;
 		} else {
-			struct bloom_filter *filter = get_or_compute_bloom_filter(ctx->r, c, 1, NULL);
-			if (!filter->len) {
-				bitmap_set(ctx->bloom_large, pos);
-				ctx->count_bloom_filter_found_large++;
+			int computed = 0;
+			struct bloom_filter *filter = get_or_compute_bloom_filter(
+				ctx->r, c,
+				ctx->count_bloom_filter_computed < max_new_filters,
+				&computed);
+			if (computed) {
+				ctx->count_bloom_filter_computed++;
+				if (filter && !filter->len) {
+					bitmap_set(ctx->bloom_large, pos);
+					ctx->count_bloom_filter_found_large++;
+				}
 			}
-			ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
-			ctx->count_bloom_filter_computed++;
+			ctx->total_bloom_filter_data_size += sizeof(unsigned char) * (filter ? filter->len : 0);
 		}
 		display_progress(progress, i + 1);
 	}
diff --git a/commit-graph.h b/commit-graph.h
index 4cd991cf26..0842d13744 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -112,6 +112,7 @@ struct commit_graph_opts {
 	int max_commits;
 	timestamp_t expire_time;
 	enum commit_graph_split_flags flags;
+	int max_new_filters;
 };
 
 /*
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 231dc8a3a7..985bea23b2 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -214,4 +214,23 @@ test_expect_success 'Bloom generation does not recompute too-large filters' '
 	)
 '
 
+test_expect_success 'Bloom generation is limited by --max-new-filters' '
+	(
+		cd 513changes &&
+		test_commit c2 filter &&
+		test_commit c3 filter &&
+		test_commit c4 no-filter &&
+		test_bloom_filters_computed "--reachable --changed-paths --split=replace --max-new-filters=2" \
+			1 0 2
+	)
+'
+
+test_expect_success 'Bloom generation backfills previously-skipped filters' '
+	(
+		cd 513changes &&
+		test_bloom_filters_computed "--reachable --changed-paths --split=replace --max-new-filters=1" \
+			1 0 1
+	)
+'
+
 test_done
-- 
2.28.0.rc1.13.ge78abce653

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 07/10] commit-graph: add large-filters bitmap chunk
  2020-08-03 18:57 ` [PATCH 07/10] commit-graph: add large-filters bitmap chunk Taylor Blau
@ 2020-08-03 18:59   ` Taylor Blau
  2020-08-04 12:57   ` Derrick Stolee
  1 sibling, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-08-03 18:59 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, dstolee

On Mon, Aug 03, 2020 at 02:57:33PM -0400, Taylor Blau wrote:
> When a commit has 512 changed paths or greater, the commit-graph
> machinery represents it as a zero-length filter since having many
> entries in the Bloom filter has undesirable effects on the false
> positivity rate.

One thing that occurs to me after sending this patch is that 512 may not
be the upper-bound on the number of changed-paths forever.

Maybe we should store a uint32_t as the first four bytes of this chunk
specifying where we draw the line?

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 01/10] commit-graph: introduce 'get_bloom_filter_settings()'
  2020-08-03 18:57 ` [PATCH 01/10] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
@ 2020-08-04  7:24   ` Jeff King
  2020-08-04 20:08     ` Taylor Blau
  0 siblings, 1 reply; 117+ messages in thread
From: Jeff King @ 2020-08-04  7:24 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee

On Mon, Aug 03, 2020 at 02:57:09PM -0400, Taylor Blau wrote:

> diff --git a/revision.c b/revision.c
> index 6de29cdf7a..e244beed05 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -684,7 +684,7 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>  	if (!revs->repo->objects->commit_graph)
>  		return;
>  
> -	revs->bloom_filter_settings = revs->repo->objects->commit_graph->bloom_filter_settings;
> +	revs->bloom_filter_settings = get_bloom_filter_settings(revs->repo);
>  	if (!revs->bloom_filter_settings)
>  		return;

I think you could get rid of the revs->repo->objects->commit_graph check
above now, as get_bloom_filter_settings() would return NULL if there is
no commit_graph at all.

-Peff

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 06/10] commit-graph.c: sort index into commits list
  2020-08-03 18:57 ` [PATCH 06/10] commit-graph.c: sort index into commits list Taylor Blau
@ 2020-08-04 12:31   ` Derrick Stolee
  2020-08-04 20:10     ` Taylor Blau
  0 siblings, 1 reply; 117+ messages in thread
From: Derrick Stolee @ 2020-08-04 12:31 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: peff, dstolee

On 8/3/2020 2:57 PM, Taylor Blau wrote:
> For locality, 'compute_bloom_filters()' sorts the commits for which it
> wants to compute Bloom filters in a preferred order (cf., 3d11275505
> (commit-graph: examine commits by generation number, 2020-03-30) for
> details).
> 
> The subsequent patch will want to recover the new graph position of each
> commit. Since the 'packed_commit_list' already stores a double-pointer,
> avoid a 'COPY_ARRAY' and instead keep track of an index into the
> original list. (Use an integer index instead of a memory address, since
> this involves a needlessly confusing triple-pointer).

It took me a little while to grok that we are switching from sorting
a list of commit pointers to sorting a list of integers. However, that
makes a lot of sense. It preserves the commit list sorted by OID for
binary search, which you will need soon. Perhaps another change would
need that at another time, too.

> Alter the two sorting routines 'commit_pos_cmp' and 'commit_gen_cmp' to
> take into account the packed_commit_list they are sorting with respect
> to. Since 'compute_bloom_filters()' is the only caller for each of those
> comparison functions, no other call-sites need updating.

Parsing the changes to these functions is the most complicated, because
of the int-to-commit indirection. I think they are correct and as easy
to read as possible.

-Stolee

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 07/10] commit-graph: add large-filters bitmap chunk
  2020-08-03 18:57 ` [PATCH 07/10] commit-graph: add large-filters bitmap chunk Taylor Blau
  2020-08-03 18:59   ` Taylor Blau
@ 2020-08-04 12:57   ` Derrick Stolee
  1 sibling, 0 replies; 117+ messages in thread
From: Derrick Stolee @ 2020-08-04 12:57 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: peff, dstolee

On 8/3/2020 2:57 PM, Taylor Blau wrote:
> When a commit has 512 changed paths or greater, the commit-graph
> machinery represents it as a zero-length filter since having many
> entries in the Bloom filter has undesirable effects on the false
> positivity rate.

You mentioned in another message that you'd like to encode the magic
"512" constant into this chunk. I think that is a good idea. Perhaps
it is worth describing it as a MAX_CHANGED_PATHS variable or something.

I don't think now is the time to expose it as a command-line argument,
but encoding it in the file format is a good idea. This is similar to
the start of the BDAT chunk starting with three values from the
bloom_filter_settings, which are not actually customizable from arguments.

For testing, I would recommend using a GIT_TEST_* variable, just like
GIT_TEST_BLOOM_SETTINGS_NUM_HASHES and GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY
are used in t4216-log-bloom.sh to test those BDAT chunk values.

> In addition to these too-large filters, the commit-graph machinery also
> represents commits with no filter and commits with no changed paths in
> the same way.
> 
> When writing a commit-graph that aggregates several incremental
> commit-graph layers (eg., with '--split=replace'), the commit-graph
> machinery first computes all of the Bloom filters that it wants to write
> but does not already know about from existing graph layers. Because we
> overload the zero-length filter in the above fashion, this leads to
> recomputing large filters over and over again.
> 
> This is already undesirable, since it means that we are wasting
> considerable effort to discover that a commit has at least 512 changed
> paths, only to throw that effort away (and then repeat the process the
> next time a roll-up is performed).
> 
> In a subsequent patch, we will add a '--max-new-filters' option, which
> specifies an upper-bound on the number of new filters we are willing to
> compute from scratch. Suppose that there are 'N' too-large filters, and
> we specify '--max-new-filters=M'. If 'N >= M', it is unlikely that any
> filters will be generated, since we'll spend most of our effort on
> filters that we ultimately throw away. If 'N < M', filters will trickle
> in over time, but only at most 'M - N' per-write.
> 
> To address this, add a new chunk which encodes a bitmap where the ith
> bit is on iff the ith commit has zero or at least 512 changed paths.
> When computing Bloom filters, first consult the relevant bitmap (in the
> case that we are rolling up existing layers) to see if computing the
> Bloom filter from scratch would be a waste of time.
> 
> This patch implements a new chunk instead of extending the existing BIDX
> and BDAT chunks because modifying these chunks would confuse old
> clients. For example, consider setting the most-significant bit in an
> entry in the BIDX chunk to indicate that that filter is too-large. New
> clients would understand how to interpret this, but old clients would
> treat that entry as a really large filter.
> 
> By avoiding the need to move to new versions of the BDAT and BIDX chunk,
> we can give ourselves more time to consider whether or not other
> modifications to these chunks are worthwhile without holding up this
> change.
> 
> Another approach would be to introduce a new BIDX chunk (say, one
> identified by 'BID2') which is identical to the existing BIDX chunk,
> except the most-significant bit of each offset is interpreted as "this
> filter is too big" iff looking at a BID2 chunk. This avoids having to
> write a bitmap, but forces older clients to rewrite their commit-graphs
> (as well as reduces the theoretical largest Bloom filters we couldl
> write, and forces us to maintain the code necessary to translate BIDX
> chunks to BID2 ones).  Separately from this patch, I implemented this
> alternate approach and did not find it to be advantageous.
> 
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  .../technical/commit-graph-format.txt         |  9 ++
>  commit-graph.c                                | 85 ++++++++++++++++++-
>  commit-graph.h                                |  2 +
>  t/t4216-log-bloom.sh                          | 20 ++++-
>  4 files changed, 112 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
> index 440541045d..a7191c03d3 100644
> --- a/Documentation/technical/commit-graph-format.txt
> +++ b/Documentation/technical/commit-graph-format.txt
> @@ -123,6 +123,15 @@ CHUNK DATA:
>        of length zero.
>      * The BDAT chunk is present if and only if BIDX is present.
>  
> +  Large Bloom Filters (ID: {'B', 'F', 'X', 'L'}) [Optional]
> +    * It contains a list of 'eword_t's (the length of this list is determined by

I'm not sure using 'eword_t' in this documentation is helpful. How large is that
type? Can we make that concrete in terms of 32- or 64-bit words? Below, you
describe that this is an uncompressed bitmap whose 'i'th bit corresponds to the
'i'th commit.

> +      the width of the chunk) which is a bitmap. The 'i'th bit is set exactly
> +      when the 'i'th commit in the graph has a changed-path Bloom filter with
> +      zero entries (either because the commit is empty, or because it contains
> +      more than 512 changed paths).
As I expect you will update this in v2, try to create a name for the "512" constant
to re-use here.

> +    * The BFXL chunk is present on newer versions of Git iff the BIDX and BDAT
> +      chunks are also present.

Expand "iff" to "if and only if". Perhaps this would be simpler to have only one
direction:

	* The BFXL chunk is present only when the BIDX and BDAT chunks are
	  also present.

> +
>    Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
>        This list of H-byte hashes describe a set of B commit-graph files that
>        form a commit-graph chain. The graph position for the ith commit in this
> diff --git a/commit-graph.c b/commit-graph.c
> index d6ea556649..c870ffe0ed 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -41,8 +41,9 @@ void git_test_write_commit_graph_or_die(void)
>  #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
>  #define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
>  #define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
> +#define GRAPH_CHUNKID_BLOOMLARGE 0x4246584c /* "BFXL" */
>  #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
> -#define MAX_NUM_CHUNKS 7
> +#define MAX_NUM_CHUNKS 8
>  
>  #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
>  
> @@ -429,6 +430,17 @@ struct commit_graph *parse_commit_graph(struct repository *r,
>  				graph->bloom_filter_settings->bits_per_entry = get_be32(data + chunk_offset + 8);
>  			}
>  			break;
> +
> +		case GRAPH_CHUNKID_BLOOMLARGE:
> +			if (graph->bloom_large.word_alloc)
> +				chunk_repeated = 1;
> +			else if (r->settings.commit_graph_read_changed_paths) {
> +				uint32_t alloc = get_be64(chunk_lookup + 4) - chunk_offset;
> +
> +				graph->bloom_large.word_alloc = alloc;
> +				graph->bloom_large.words = (eword_t *)(data + chunk_offset);
> +			}
> +			break;
>  		}
>  
>  		if (chunk_repeated) {
> @@ -443,6 +455,8 @@ struct commit_graph *parse_commit_graph(struct repository *r,
>  		/* We need both the bloom chunks to exist together. Else ignore the data */
>  		graph->chunk_bloom_indexes = NULL;
>  		graph->chunk_bloom_data = NULL;
> +		graph->bloom_large.words = NULL;
> +		graph->bloom_large.word_alloc = 0;
>  		FREE_AND_NULL(graph->bloom_filter_settings);
>  	}
>  
> @@ -933,6 +947,21 @@ struct tree *get_commit_tree_in_graph(struct repository *r, const struct commit
>  	return get_commit_tree_in_graph_one(r, r->objects->commit_graph, c);
>  }
>  
> +static int get_bloom_filter_large_in_graph(struct commit_graph *g,
> +					   const struct commit *c)
> +{
> +	uint32_t graph_pos = commit_graph_position(c);
> +	if (graph_pos == COMMIT_NOT_FROM_GRAPH)
> +		return 0;
> +
> +	while (g && graph_pos < g->num_commits_in_base)
> +		g = g->base_graph;
> +
> +	if (!(g && g->bloom_large.word_alloc))
> +		return 0;
> +	return bitmap_get(&g->bloom_large, graph_pos - g->num_commits_in_base);
> +}
> +
>  struct packed_oid_list {
>  	struct object_id *list;
>  	int nr;
> @@ -969,6 +998,11 @@ struct write_commit_graph_context {
>  	const struct split_commit_graph_opts *split_opts;
>  	size_t total_bloom_filter_data_size;
>  	const struct bloom_filter_settings *bloom_settings;
> +	struct bitmap *bloom_large;
> +
> +	int count_bloom_filter_known_large;
> +	int count_bloom_filter_found_large;
> +	int count_bloom_filter_computed;
>  };
>  
>  static int write_graph_chunk_fanout(struct hashfile *f,
> @@ -1231,6 +1265,15 @@ static int write_graph_chunk_bloom_data(struct hashfile *f,
>  	return 0;
>  }
>  
> +static int write_graph_chunk_bloom_large(struct hashfile *f,
> +					  struct write_commit_graph_context *ctx)
> +{
> +	trace2_region_enter("commit-graph", "bloom_large", ctx->r);
> +	hashwrite(f, ctx->bloom_large->words, ctx->bloom_large->word_alloc * sizeof(eword_t));

Should we be using word_alloc here? Or can we trust that the bitmap
was allocated to the exact right size to fit our number of commits?

We probably need to be careful about network-byte order here, right?
Or, should we treat the bitmap as an array of bytes?

I see that eword_t is a 64-bit type, which makes things a _little_
tricky. We normally would use hashwrite_be32() for writing 32-bit
words. Likely, we will want a new hashwrite_be64() that is a wrapper
around hashwrite_be32(). Notice that this could be used when writing
the chunk offsets in commit-graph.c and midx.c. For example, we
write the 64-bit offsets in write_commit_graph_file() as follows:

	for (i = 0; i <= num_chunks; i++) {
		uint32_t chunk_write[3];

		chunk_write[0] = htonl(chunk_ids[i]);
		chunk_write[1] = htonl(chunk_offsets[i] >> 32);
		chunk_write[2] = htonl(chunk_offsets[i] & 0xffffffff);
		hashwrite(f, chunk_write, 12);
	}

This would be nicer as

	for (i = 0; i <= num_chunks; i++) {
		hashwrite_be32(f, chunk_ids[i]);
		hashwrite_be64(f, chunk_offsets[i]);
	}

I know this adds complexity to an already meaty series. However, this
should help us avoid this network-byte order issue in the future while
cleaning up existing code.

> +	trace2_region_leave("commit-graph", "bloom_large", ctx->r);
> +	return 0;
> +}
> +
>  static int oid_compare(const void *_a, const void *_b)
>  {
>  	const struct object_id *a = (const struct object_id *)_a;
> @@ -1389,6 +1432,24 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  	stop_progress(&ctx->progress);
>  }
>  
> +static void trace2_bloom_filter_write_statistics(struct write_commit_graph_context *ctx)
> +{
> +	struct json_writer jw = JSON_WRITER_INIT;
> +
> +	jw_object_begin(&jw, 0);
> +	jw_object_intmax(&jw, "filter_known_large",
> +			 ctx->count_bloom_filter_known_large);
> +	jw_object_intmax(&jw, "filter_found_large",
> +			 ctx->count_bloom_filter_found_large);
> +	jw_object_intmax(&jw, "filter_computed",
> +			 ctx->count_bloom_filter_computed);
> +	jw_end(&jw);

Helpful stats.

> +
> +	trace2_data_json("commit-graph", the_repository, "bloom_statistics", &jw);
> +
> +	jw_release(&jw);
> +}
> +
>  static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>  {
>  	int i;
> @@ -1396,6 +1457,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>  	int *sorted_commits;
>  
>  	init_bloom_filters();
> +	ctx->bloom_large = bitmap_word_alloc(ctx->commits.nr / BITS_IN_EWORD);
>  
>  	if (ctx->report_progress)
>  		progress = start_delayed_progress(
> @@ -1412,11 +1474,24 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>  	for (i = 0; i < ctx->commits.nr; i++) {
>  		int pos = sorted_commits[i];
>  		struct commit *c = ctx->commits.list[pos];
> -		struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1);
> -		ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
> +		if (get_bloom_filter_large_in_graph(ctx->r->objects->commit_graph, c)) {
> +			bitmap_set(ctx->bloom_large, pos);
> +			ctx->count_bloom_filter_known_large++;
> +		} else {
> +			struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1);
> +			if (!filter->len) {
> +				bitmap_set(ctx->bloom_large, pos);
> +				ctx->count_bloom_filter_found_large++;
> +			}
> +			ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;

Is this sizeof(unsigned char) necessary? Or just a helpful reminder that
filter->len is a number of bytes?

> +			ctx->count_bloom_filter_computed++;
> +		}
>  		display_progress(progress, i + 1);
>  	}
>  
> +	if (trace2_is_enabled())
> +		trace2_bloom_filter_write_statistics(ctx);
> +
>  	free(sorted_commits);
>  	stop_progress(&progress);
>  }
> @@ -1739,6 +1814,10 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>  					  + ctx->total_bloom_filter_data_size;
>  		chunks[num_chunks].write_fn = write_graph_chunk_bloom_data;
>  		num_chunks++;
> +		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMLARGE;
> +		chunks[num_chunks].size = sizeof(eword_t) * ctx->bloom_large->word_alloc;
> +		chunks[num_chunks].write_fn = write_graph_chunk_bloom_large;
> +		num_chunks++;
>  	}
>  	if (ctx->num_commit_graphs_after > 1) {
>  		chunks[num_chunks].id = GRAPH_CHUNKID_BASE;
> diff --git a/commit-graph.h b/commit-graph.h
> index d9acb22bac..afbc5fa41e 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -4,6 +4,7 @@
>  #include "git-compat-util.h"
>  #include "object-store.h"
>  #include "oidset.h"
> +#include "ewah/ewok.h"
>  
>  #define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
>  #define GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE "GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE"
> @@ -71,6 +72,7 @@ struct commit_graph {
>  	const unsigned char *chunk_base_graphs;
>  	const unsigned char *chunk_bloom_indexes;
>  	const unsigned char *chunk_bloom_data;
> +	struct bitmap bloom_large;
>  
>  	struct bloom_filter_settings *bloom_filter_settings;
>  };
> diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
> index b3d1f596f8..231dc8a3a7 100755
> --- a/t/t4216-log-bloom.sh
> +++ b/t/t4216-log-bloom.sh
> @@ -33,7 +33,7 @@ test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
>  	git commit-graph write --reachable --changed-paths
>  '
>  graph_read_expect () {
> -	NUM_CHUNKS=5
> +	NUM_CHUNKS=6
>  	cat >expect <<- EOF
>  	header: 43475048 1 1 $NUM_CHUNKS 0
>  	num_commits: $1
> @@ -195,5 +195,23 @@ test_expect_success 'correctly report changes over limit' '
>  		done
>  	)
>  '
> +test_bloom_filters_computed () {
> +	commit_graph_args=$1
> +	bloom_trace_prefix="{\"filter_known_large\":$2,\"filter_found_large\":$3,\"filter_computed\":$4"
> +	rm -f "$TRASH_DIRECTORY/trace.event" &&
> +	GIT_TRACE2_EVENT="$TRASH_DIRECTORY/trace.event" git commit-graph write $commit_graph_args &&
> +	grep "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.event"
> +}

Could this helper be moved earlier, so we can use
test_bloom_filters_computed in 'correctly report changes
over limit'? We could then check that filter_known_large is
zero while filter_found_large is 1.

These tests would be an excellent opportunity to use a
GIT_TEST_* variable to lower the 512 limit to something
smaller to make the test run a tiny bit faster ;).

> +
> +test_expect_success 'Bloom generation does not recompute too-large filters' '
> +	(
> +		cd 513changes &&
> +		git commit-graph write --reachable --changed-paths \
> +			--split=replace &&
> +		test_commit c1 filter &&
> +		test_bloom_filters_computed "--reachable --changed-paths --split=replace" \
> +			1 0 1
> +	)
> +'

Adding this test is valuable. Thanks.

-Stolee


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 08/10] bloom: split 'get_bloom_filter()' in two
  2020-08-03 18:57 ` [PATCH 08/10] bloom: split 'get_bloom_filter()' in two Taylor Blau
@ 2020-08-04 13:00   ` Derrick Stolee
  2020-08-04 20:12     ` Taylor Blau
  0 siblings, 1 reply; 117+ messages in thread
From: Derrick Stolee @ 2020-08-04 13:00 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: peff, dstolee

On 8/3/2020 2:57 PM, Taylor Blau wrote:
> 'get_bloom_filter' takes a flag to control whether it will compute a
> Bloom filter if the requested one is missing. In the next patch, we'll
> add yet another flag to this method, which would force all but one
> caller to specify an extra 'NULL' parameter at the end.
> 
> Instead of doing this, split 'get_bloom_filter' into two functions:
> 'get_bloom_filter' and 'get_or_compute_bloom_filter'. The former only
> looks up a Bloom filter (and does not compute one if it's missing,
> thus dropping the 'compute_if_not_present' flag). The latter does
> compute missing Bloom filters, with an additional parameter to store
> whether or not it needed to do so.
> 
> This simplifies many call-sites, since the majority of existing callers
> to 'get_bloom_filter' do not want missing Bloom filters to be computed
> (so they can drop the parameter entirely and use the simpler version of
> the function).

> +struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
> +						 struct commit *c,
> +						 int compute_if_not_present,
> +						 int *computed)

Could we further simplify this by letting "computed" be the indicator
for whether we should compute the filter? If "computed" is NULL, then
we won't compute it directly. This allows us to reduce the "1, NULL)"
to "NULL)" in these callers:

> +			struct bloom_filter *filter = get_or_compute_bloom_filter(ctx->r, c, 1, NULL);

> +	filter = get_or_compute_bloom_filter(the_repository, c, 1, NULL);

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 10/10] builtin/commit-graph.c: introduce '--max-new-filters=<n>'
  2020-08-03 18:57 ` [PATCH 10/10] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
@ 2020-08-04 13:03   ` Derrick Stolee
  2020-08-04 20:14     ` Taylor Blau
  0 siblings, 1 reply; 117+ messages in thread
From: Derrick Stolee @ 2020-08-04 13:03 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: peff, dstolee

On 8/3/2020 2:57 PM, Taylor Blau wrote:
> Introduce a command-line flag and configuration variable to fill in the
> 'max_new_filters' variable introduced by the previous patch.
> 
> The command-line option '--max-new-filters' takes precedence over
> 'graph.maxNewFilters', which is the default value.

commitGraph.maxNewFilters?

> '--no-max-new-filters' can also be provided, which sets the value back
> to '-1', indicating that an unlimited number of new Bloom filters may be
> generated. (OPT_INTEGER only allows setting the '--no-' variant back to
> '0', hence a custom callback was used instead).

Other than the commit message typo above, this is a good patch.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 01/10] commit-graph: introduce 'get_bloom_filter_settings()'
  2020-08-04  7:24   ` Jeff King
@ 2020-08-04 20:08     ` Taylor Blau
  0 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-08-04 20:08 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, git, dstolee

On Tue, Aug 04, 2020 at 03:24:17AM -0400, Jeff King wrote:
> On Mon, Aug 03, 2020 at 02:57:09PM -0400, Taylor Blau wrote:
>
> > diff --git a/revision.c b/revision.c
> > index 6de29cdf7a..e244beed05 100644
> > --- a/revision.c
> > +++ b/revision.c
> > @@ -684,7 +684,7 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
> >  	if (!revs->repo->objects->commit_graph)
> >  		return;
> >
> > -	revs->bloom_filter_settings = revs->repo->objects->commit_graph->bloom_filter_settings;
> > +	revs->bloom_filter_settings = get_bloom_filter_settings(revs->repo);
> >  	if (!revs->bloom_filter_settings)
> >  		return;
>
> I think you could get rid of the revs->repo->objects->commit_graph check
> above now, as get_bloom_filter_settings() would return NULL if there is
> no commit_graph at all.

Yep, that's right -- and good catch. This used to not be the case (or if
it was, relying on it was more fragile) when
'get_bloom_filter_settings()' took a commit-graph instead of a
repository, but now it's no longer necessary.

Thanks.

> -Peff

Taylor

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 06/10] commit-graph.c: sort index into commits list
  2020-08-04 12:31   ` Derrick Stolee
@ 2020-08-04 20:10     ` Taylor Blau
  0 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-08-04 20:10 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Taylor Blau, git, peff, dstolee

On Tue, Aug 04, 2020 at 08:31:53AM -0400, Derrick Stolee wrote:
> On 8/3/2020 2:57 PM, Taylor Blau wrote:
> > For locality, 'compute_bloom_filters()' sorts the commits for which it
> > wants to compute Bloom filters in a preferred order (cf., 3d11275505
> > (commit-graph: examine commits by generation number, 2020-03-30) for
> > details).
> >
> > The subsequent patch will want to recover the new graph position of each
> > commit. Since the 'packed_commit_list' already stores a double-pointer,
> > avoid a 'COPY_ARRAY' and instead keep track of an index into the
> > original list. (Use an integer index instead of a memory address, since
> > this involves a needlessly confusing triple-pointer).
>
> It took me a little while to grok that we are switching from sorting
> a list of commit pointers to sorting a list of integers. However, that
> makes a lot of sense. It preserves the commit list sorted by OID for
> binary search, which you will need soon. Perhaps another change would
> need that at another time, too.

Yeah. I had to spend some additional time with this patch (at least back
when it was written in terms of 'struct commit ***'s) to convince myself
of its correctness, too.

I think that this is ultimately the right thing, and that it is probably
as simple as I can make it without refactoring the packed_commit_list,
which I think is squarely outside the scope of this (already-large)
series ;).

> > Alter the two sorting routines 'commit_pos_cmp' and 'commit_gen_cmp' to
> > take into account the packed_commit_list they are sorting with respect
> > to. Since 'compute_bloom_filters()' is the only caller for each of those
> > comparison functions, no other call-sites need updating.
>
> Parsing the changes to these functions is the most complicated, because
> of the int-to-commit indirection. I think they are correct and as easy
> to read as possible.
>
> -Stolee

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 08/10] bloom: split 'get_bloom_filter()' in two
  2020-08-04 13:00   ` Derrick Stolee
@ 2020-08-04 20:12     ` Taylor Blau
  0 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-08-04 20:12 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Taylor Blau, git, peff, dstolee

On Tue, Aug 04, 2020 at 09:00:31AM -0400, Derrick Stolee wrote:
> On 8/3/2020 2:57 PM, Taylor Blau wrote:
> > 'get_bloom_filter' takes a flag to control whether it will compute a
> > Bloom filter if the requested one is missing. In the next patch, we'll
> > add yet another flag to this method, which would force all but one
> > caller to specify an extra 'NULL' parameter at the end.
> >
> > Instead of doing this, split 'get_bloom_filter' into two functions:
> > 'get_bloom_filter' and 'get_or_compute_bloom_filter'. The former only
> > looks up a Bloom filter (and does not compute one if it's missing,
> > thus dropping the 'compute_if_not_present' flag). The latter does
> > compute missing Bloom filters, with an additional parameter to store
> > whether or not it needed to do so.
> >
> > This simplifies many call-sites, since the majority of existing callers
> > to 'get_bloom_filter' do not want missing Bloom filters to be computed
> > (so they can drop the parameter entirely and use the simpler version of
> > the function).
>
> > +struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
> > +						 struct commit *c,
> > +						 int compute_if_not_present,
> > +						 int *computed)
>
> Could we further simplify this by letting "computed" be the indicator
> for whether we should compute the filter? If "computed" is NULL, then
> we won't compute it directly. This allows us to reduce the "1, NULL)"
> to "NULL)" in these callers:

I like what you're getting at--that is, that we shouldn't make calling
this more complicated than necessary, and right now lots of callers
always pass "1, NULL)" as the last two arguments--but I'm not sure that
I like this suggestion.

I could imagine a future caller would want to compute the Bloom filters
if missing, but not care about whether or not they were computed from
scratch. In that case, they'd need a dummy variable. Not the worst thing
in the world, but I think it's less clear.

By the way, I think that this suggestion only helps for "0, NULL" into
just "NULL", not "1, NULL" (which requires a dummy variable with your
suggestion).

>
> > +			struct bloom_filter *filter = get_or_compute_bloom_filter(ctx->r, c, 1, NULL);
>
> > +	filter = get_or_compute_bloom_filter(the_repository, c, 1, NULL);
>
> Thanks,
> -Stolee

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH 10/10] builtin/commit-graph.c: introduce '--max-new-filters=<n>'
  2020-08-04 13:03   ` Derrick Stolee
@ 2020-08-04 20:14     ` Taylor Blau
  0 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-08-04 20:14 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Taylor Blau, git, peff, dstolee

On Tue, Aug 04, 2020 at 09:03:36AM -0400, Derrick Stolee wrote:
> On 8/3/2020 2:57 PM, Taylor Blau wrote:
> > Introduce a command-line flag and configuration variable to fill in the
> > 'max_new_filters' variable introduced by the previous patch.
> >
> > The command-line option '--max-new-filters' takes precedence over
> > 'graph.maxNewFilters', which is the default value.
>
> commitGraph.maxNewFilters?

Nice catch. There are some other places that I made this typo, but I
fixed those up locally, too.

> > '--no-max-new-filters' can also be provided, which sets the value back
> > to '-1', indicating that an unlimited number of new Bloom filters may be
> > generated. (OPT_INTEGER only allows setting the '--no-' variant back to
> > '0', hence a custom callback was used instead).
>
> Other than the commit message typo above, this is a good patch.

Thanks, and thanks for the good suggestion to write it in the first
place ;).

> Thanks,
> -Stolee

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH v2 00/14] more miscellaneous Bloom filter improvements
  2020-08-03 18:57 [PATCH 00/10] more miscellaneous Bloom filter improvements Taylor Blau
                   ` (9 preceding siblings ...)
  2020-08-03 18:57 ` [PATCH 10/10] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
@ 2020-08-05 17:01 ` Taylor Blau
  2020-08-05 17:01   ` [PATCH v2 01/14] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
                     ` (13 more replies)
  2020-08-11 20:51 ` [PATCH v3 00/14] more miscellaneous Bloom filter improvements Taylor Blau
  2020-09-03 22:45 ` [PATCH v4 " Taylor Blau
  12 siblings, 14 replies; 117+ messages in thread
From: Taylor Blau @ 2020-08-05 17:01 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, szeder.dev

Hi,

Here's a re-roll of mine and Stolee's series containing a grab bag of
changed-path Bloom filter improvements. The contents are described in
[1], but the culmination of this series is in adding the new 'BFXL'
chunk, and allowing callers to pass '--max-new-filters <n>' to
'git commit-graph write' commands.

Lots of little things have changed since last time, most notably a
moderate restructuring of the patches to slot in Stolee's fix for
'max_changes' not working properly in the diff machinery [2].

A range-diff is included below for convening, although I'm not sure how
illuminating it is to read compared to starting from scratch reading the
patches one by one...

Thanks again for all of the helpful review since last time.

[1]: https://lore.kernel.org/git/cover.1596480582.git.me@ttaylorr.com/
[2]: https://lore.kernel.org/git/a08c26bb-54ec-13af-e503-fccd68727cf3@gmail.com/

Derrick Stolee (1):
  bloom/diff: properly short-circuit on max_changes

Taylor Blau (13):
  commit-graph: introduce 'get_bloom_filter_settings()'
  t4216: use an '&&'-chain
  commit-graph: pass a 'struct repository *' in more places
  t/helper/test-read-graph.c: prepare repo settings
  commit-graph: respect 'commitGraph.readChangedPaths'
  commit-graph.c: store maximum changed paths
  bloom: split 'get_bloom_filter()' in two
  bloom: use provided 'struct bloom_filter_settings'
  commit-graph.c: sort index into commits list
  csum-file.h: introduce 'hashwrite_be64()'
  commit-graph: add large-filters bitmap chunk
  commit-graph: rename 'split_commit_graph_opts'
  builtin/commit-graph.c: introduce '--max-new-filters=<n>'

 Documentation/config.txt                      |   2 +
 Documentation/config/commitgraph.txt          |   8 +
 Documentation/git-commit-graph.txt            |   4 +
 .../technical/commit-graph-format.txt         |  12 +
 blame.c                                       |   8 +-
 bloom.c                                       |  51 +++-
 bloom.h                                       |  23 +-
 builtin/commit-graph.c                        |  61 +++-
 commit-graph.c                                | 268 +++++++++++++-----
 commit-graph.h                                |  20 +-
 csum-file.h                                   |   6 +
 diff.h                                        |   2 -
 fuzz-commit-graph.c                           |   5 +-
 line-log.c                                    |   2 +-
 midx.c                                        |   3 +-
 repo-settings.c                               |   3 +
 repository.h                                  |   1 +
 revision.c                                    |   7 +-
 t/helper/test-bloom.c                         |   4 +-
 t/helper/test-read-graph.c                    |   3 +-
 t/t4216-log-bloom.sh                          | 148 ++++++++--
 t/t5324-split-commit-graph.sh                 |  13 +
 tree-diff.c                                   |   5 +-
 23 files changed, 518 insertions(+), 141 deletions(-)
 create mode 100644 Documentation/config/commitgraph.txt

Range-diff against v1:
 [ ... rebased onto 'master' ]
 1:  08479793c1 ! 11:  001f3385ff commit-graph: introduce 'get_bloom_filter_settings()'
    @@ commit-graph.h: struct commit_graph *parse_commit_graph(void *graph_map, size_t

      ## revision.c ##
     @@ revision.c: static void prepare_to_use_bloom_filter(struct rev_info *revs)
    - 	if (!revs->repo->objects->commit_graph)
    - 		return;

    + 	repo_parse_commit(revs->repo, revs->commits->item);
    +
    +-	if (!revs->repo->objects->commit_graph)
    +-		return;
    +-
     -	revs->bloom_filter_settings = revs->repo->objects->commit_graph->bloom_filter_settings;
     +	revs->bloom_filter_settings = get_bloom_filter_settings(revs->repo);
      	if (!revs->bloom_filter_settings)
 3:  d12fcc4a8d = 12:  e4d068a478 t4216: use an '&&'-chain
 2:  52f8f7424e ! 13:  afdc614c0d commit-graph: pass a 'struct repository *' in more places
    @@ Commit message
         corresponding to that repository.

         Add an additional parameter to pass the repository around in more
    -    places. In the next patch, we will remove the object directory (and
    -    instead reference it with 'r->odb').
    +    places.

         Signed-off-by: Taylor Blau <me@ttaylorr.com>

    @@ commit-graph.c: struct commit_graph *parse_commit_graph(void *graph_map, size_t
      }

     -static struct commit_graph *load_commit_graph_one(const char *graph_file,
    -+
     +static struct commit_graph *load_commit_graph_one(struct repository *r,
     +						  const char *graph_file,
      						  struct object_directory *odb)
 4:  42a0ca9549 = 14:  038e996ced t/helper/test-read-graph.c: prepare repo settings
 5:  e615077283 ! 15:  404f10319a commit-graph: respect 'commitgraph.readChangedPaths'
    @@ Metadata
     Author: Taylor Blau <me@ttaylorr.com>

      ## Commit message ##
    -    commit-graph: respect 'commitgraph.readChangedPaths'
    +    commit-graph: respect 'commitGraph.readChangedPaths'

         Git uses the 'core.commitGraph' configuration value to control whether
         or not the commit graph is used when parsing commits or performing a
    @@ Commit message
         Bloom filters. This can happen, for example, during a staged roll-out,
         or in the event of an incident.

    -    Introduce 'commitgraph.readChangedPaths' to control whether or not Bloom
    +    Introduce 'commitGraph.readChangedPaths' to control whether or not Bloom
         filters are read. Note that this configuration is independent from both:

           - 'core.commitGraph', to allow flexibility in using all parts of a
    @@ Documentation/config.txt: include::config/column.txt[]

      ## Documentation/config/commitgraph.txt (new) ##
     @@
    -+commitgraph.readChangedPaths::
    ++commitGraph.readChangedPaths::
     +	If true, then git will use the changed-path Bloom filters in the
     +	commit-graph file (if it exists, and they are present). Defaults to
     +	true. See linkgit:git-commit-graph[1] for more information.
 -:  ---------- > 16:  053991f048 commit-graph.c: store maximum changed paths
 8:  a494094c10 ! 17:  23525947c8 bloom: split 'get_bloom_filter()' in two
    @@ Commit message

         'get_bloom_filter' takes a flag to control whether it will compute a
         Bloom filter if the requested one is missing. In the next patch, we'll
    -    add yet another flag to this method, which would force all but one
    +    add yet another parameter to this method, which would force all but one
         caller to specify an extra 'NULL' parameter at the end.

         Instead of doing this, split 'get_bloom_filter' into two functions:
    @@ Commit message
         (so they can drop the parameter entirely and use the simpler version of
         the function).

    +    While we're at it, instrument the new 'get_or_compute_bloom_filter()'
    +    with two counters in the 'write_commit_graph_context' struct which store
    +    the number of filters that we computed, and the number of those which
    +    were too large to store.
    +
    +    It would be nice to drop the 'compute_if_not_present' flag entirely,
    +    since all remaining callers of 'get_or_compute_bloom_filter' pass it as
    +    '1', but this will change in a future patch and hence cannot be removed.
    +
         Signed-off-by: Taylor Blau <me@ttaylorr.com>

      ## blame.c ##
    @@ bloom.h: void add_key_to_filter(const struct bloom_key *key,
     +						 int compute_if_not_present,
     +						 int *computed);
     +
    -+#define get_bloom_filter(r, c) get_or_compute_bloom_filter((r), (c), 0, NULL)
    ++#define DEFAULT_BLOOM_MAX_CHANGES 512
    ++#define get_bloom_filter(r, c) get_or_compute_bloom_filter( \
    ++	(r), (c), 0, NULL)

      int bloom_filter_contains(const struct bloom_filter *filter,
      			  const struct bloom_key *key,

      ## commit-graph.c ##
    +@@ commit-graph.c: struct write_commit_graph_context {
    + 	const struct split_commit_graph_opts *split_opts;
    + 	size_t total_bloom_filter_data_size;
    + 	const struct bloom_filter_settings *bloom_settings;
    ++
    ++	int count_bloom_filter_found_large;
    ++	int count_bloom_filter_computed;
    + };
    +
    + static int write_graph_chunk_fanout(struct hashfile *f,
     @@ commit-graph.c: static int write_graph_chunk_bloom_indexes(struct hashfile *f,
      	uint32_t cur_pos = 0;

    @@ commit-graph.c: static int write_graph_chunk_bloom_data(struct hashfile *f,
      		size_t len = filter ? filter->len : 0;

      		display_progress(ctx->progress, ++ctx->progress_cnt);
    +@@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph_context *ctx)
    + 	stop_progress(&ctx->progress);
    + }
    +
    ++static void trace2_bloom_filter_write_statistics(struct write_commit_graph_context *ctx)
    ++{
    ++	struct json_writer jw = JSON_WRITER_INIT;
    ++
    ++	jw_object_begin(&jw, 0);
    ++	jw_object_intmax(&jw, "filter_found_large",
    ++			 ctx->count_bloom_filter_found_large);
    ++	jw_object_intmax(&jw, "filter_computed",
    ++			 ctx->count_bloom_filter_computed);
    ++	jw_end(&jw);
    ++
    ++	trace2_data_json("commit-graph", the_repository, "bloom_statistics", &jw);
    ++
    ++	jw_release(&jw);
    ++}
    ++
    + static void compute_bloom_filters(struct write_commit_graph_context *ctx)
    + {
    + 	int i;
     @@ commit-graph.c: static void compute_bloom_filters(struct write_commit_graph_context *ctx)
    - 			bitmap_set(ctx->bloom_large, pos);
    - 			ctx->count_bloom_filter_known_large++;
    - 		} else {
    --			struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1);
    -+			struct bloom_filter *filter = get_or_compute_bloom_filter(ctx->r, c, 1, NULL);
    - 			if (!filter->len) {
    - 				bitmap_set(ctx->bloom_large, pos);
    - 				ctx->count_bloom_filter_found_large++;
    + 		QSORT(sorted_commits, ctx->commits.nr, commit_gen_cmp);
    +
    + 	for (i = 0; i < ctx->commits.nr; i++) {
    ++		int computed = 0;
    + 		struct commit *c = sorted_commits[i];
    +-		struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1);
    ++		struct bloom_filter *filter = get_or_compute_bloom_filter(
    ++			ctx->r,
    ++			c,
    ++			1,
    ++			&computed);
    ++		if (computed) {
    ++			ctx->count_bloom_filter_computed++;
    ++			if (filter && !filter->len)
    ++				ctx->count_bloom_filter_found_large++;
    ++		}
    + 		ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
    + 		display_progress(progress, i + 1);
    + 	}
    +
    ++	if (trace2_is_enabled())
    ++		trace2_bloom_filter_write_statistics(ctx);
    ++
    + 	free(sorted_commits);
    + 	stop_progress(&progress);
    + }

      ## line-log.c ##
     @@ line-log.c: static int bloom_filter_check(struct rev_info *rev,
    @@ t/helper/test-bloom.c: static void get_bloom_filter_for_commit(const struct obje
      	setup_git_directory();
      	c = lookup_commit(the_repository, commit_oid);
     -	filter = get_bloom_filter(the_repository, c, 1);
    -+	filter = get_or_compute_bloom_filter(the_repository, c, 1, NULL);
    ++	filter = get_or_compute_bloom_filter(the_repository, c, 1,
    ++					     NULL);
      	print_bloom_filter(filter);
      }

 -:  ---------- > 18:  4deb724fc1 bloom: use provided 'struct bloom_filter_settings'
 -:  ---------- > 19:  d1c4bbcaa9 bloom/diff: properly short-circuit on max_changes
 6:  b31c60d712 ! 20:  e92ccafcf7 commit-graph.c: sort index into commits list
    @@ Commit message
         (commit-graph: examine commits by generation number, 2020-03-30) for
         details).

    -    The subsequent patch will want to recover the new graph position of each
    +    A future patch will want to recover the new graph position of each
         commit. Since the 'packed_commit_list' already stores a double-pointer,
         avoid a 'COPY_ARRAY' and instead keep track of an index into the
         original list. (Use an integer index instead of a memory address, since
    @@ commit-graph.c: struct tree *get_commit_tree_in_graph(struct repository *r, cons
     -	int nr;
     -	int alloc;
     -};
    --
    +
      struct packed_oid_list {
      	struct object_id *list;
    - 	int nr;
     @@ commit-graph.c: static void compute_bloom_filters(struct write_commit_graph_context *ctx)
      {
      	int i;
    @@ commit-graph.c: static void compute_bloom_filters(struct write_commit_graph_cont
     +		&ctx->commits);

      	for (i = 0; i < ctx->commits.nr; i++) {
    + 		int computed = 0;
     -		struct commit *c = sorted_commits[i];
     +		int pos = sorted_commits[i];
     +		struct commit *c = ctx->commits.list[pos];
    - 		struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1);
    - 		ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
    - 		display_progress(progress, i + 1);
    + 		struct bloom_filter *filter = get_or_compute_bloom_filter(
    + 			ctx->r,
    + 			c,
 -:  ---------- > 21:  c42d678714 csum-file.h: introduce 'hashwrite_be64()'
 7:  ab78907233 ! 22:  100b26d7c8 commit-graph: add large-filters bitmap chunk
    @@ Metadata
      ## Commit message ##
         commit-graph: add large-filters bitmap chunk

    -    When a commit has 512 changed paths or greater, the commit-graph
    -    machinery represents it as a zero-length filter since having many
    -    entries in the Bloom filter has undesirable effects on the false
    -    positivity rate.
    +    When a commit has more than a certain number of changed paths (commonly
    +    512), the commit-graph machinery represents it as a zero-length filter.
    +    This is done since having many entries in the Bloom filter has
    +    undesirable effects on the false positivity rate.

         In addition to these too-large filters, the commit-graph machinery also
         represents commits with no filter and commits with no changed paths in
    @@ Commit message
         recomputing large filters over and over again.

         This is already undesirable, since it means that we are wasting
    -    considerable effort to discover that a commit has at least 512 changed
    +    considerable effort to discover that a commit with too many changed
         paths, only to throw that effort away (and then repeat the process the
         next time a roll-up is performed).

    -    In a subsequent patch, we will add a '--max-new-filters' option, which
    -    specifies an upper-bound on the number of new filters we are willing to
    -    compute from scratch. Suppose that there are 'N' too-large filters, and
    -    we specify '--max-new-filters=M'. If 'N >= M', it is unlikely that any
    -    filters will be generated, since we'll spend most of our effort on
    -    filters that we ultimately throw away. If 'N < M', filters will trickle
    -    in over time, but only at most 'M - N' per-write.
    +    In a subsequent patch, we will add a '--max-new-filters=<n>' option,
    +    which specifies an upper-bound on the number of new filters we are
    +    willing to compute from scratch. Suppose that there are 'N' too-large
    +    filters, and we specify '--max-new-filters=M'. If 'N >= M', it is
    +    unlikely that any filters will be generated, since we'll spend most of
    +    our effort on filters that we ultimately throw away. If 'N < M', filters
    +    will trickle in over time, but only at most 'M - N' per-write.

         To address this, add a new chunk which encodes a bitmap where the ith
         bit is on iff the ith commit has zero or at least 512 changed paths.
    -    When computing Bloom filters, first consult the relevant bitmap (in the
    -    case that we are rolling up existing layers) to see if computing the
    -    Bloom filter from scratch would be a waste of time.
    +    Likewise store the maximum number of changed paths we are willing to
    +    store in order to prepare for eventually making this value more easily
    +    customizable. When computing Bloom filters, first consult the relevant
    +    bitmap (in the case that we are rolling up existing layers) to see if
    +    computing the Bloom filter from scratch would be a waste of time.

         This patch implements a new chunk instead of extending the existing BIDX
         and BDAT chunks because modifying these chunks would confuse old
    -    clients. For example, consider setting the most-significant bit in an
    -    entry in the BIDX chunk to indicate that that filter is too-large. New
    -    clients would understand how to interpret this, but old clients would
    -    treat that entry as a really large filter.
    +    clients. (Eg., setting the most-significant bit in the BIDX chunk would
    +    confuse old clients and require a version bump).
    +
    +    To allow using the existing bitmap code with 64-bit words, we write the
    +    data in network byte order from the 64-bit words. This means we also
    +    need to read the array from the commit-graph file by translating each
    +    word from network byte order using get_be64() upon first use of the
    +    bitmap. This is only used when writing the commit-graph, so this is a
    +    relatively small operation compared to the other writing code.

         By avoiding the need to move to new versions of the BDAT and BIDX chunk,
         we can give ourselves more time to consider whether or not other
    @@ Commit message
         write a bitmap, but forces older clients to rewrite their commit-graphs
         (as well as reduces the theoretical largest Bloom filters we couldl
         write, and forces us to maintain the code necessary to translate BIDX
    -    chunks to BID2 ones).  Separately from this patch, I implemented this
    +    chunks to BID2 ones). Separately from this patch, I implemented this
         alternate approach and did not find it to be advantageous.

         Signed-off-by: Taylor Blau <me@ttaylorr.com>
    @@ Documentation/technical/commit-graph-format.txt: CHUNK DATA:
          * The BDAT chunk is present if and only if BIDX is present.

     +  Large Bloom Filters (ID: {'B', 'F', 'X', 'L'}) [Optional]
    -+    * It contains a list of 'eword_t's (the length of this list is determined by
    -+      the width of the chunk) which is a bitmap. The 'i'th bit is set exactly
    -+      when the 'i'th commit in the graph has a changed-path Bloom filter with
    -+      zero entries (either because the commit is empty, or because it contains
    -+      more than 512 changed paths).
    -+    * The BFXL chunk is present on newer versions of Git iff the BIDX and BDAT
    -+      chunks are also present.
    ++    * It starts with a 32-bit unsigned integer specifying the maximum number of
    ++      changed-paths that can be stored in a single Bloom filter.
    ++    * It then contains a list of 64-bit words (the length of this list is
    ++      determined by the width of the chunk) which is a bitmap. The 'i'th bit is
    ++      set exactly when the 'i'th commit in the graph has a changed-path Bloom
    ++      filter with zero entries (either because the commit is empty, or because
    ++      it contains more than 512 changed paths).
    ++    * The BFXL chunk is present only when the BIDX and BDAT chunks are
    ++      also present.
    ++
     +
        Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
            This list of H-byte hashes describe a set of B commit-graph files that
            form a commit-graph chain. The graph position for the ith commit in this

    + ## bloom.h ##
    +@@ bloom.h: struct bloom_filter_settings {
    + 	 * The maximum number of changed paths per commit
    + 	 * before declaring a Bloom filter to be too-large.
    + 	 *
    +-	 * Not written to the commit-graph file.
    ++	 * Written to the 'BFXL' chunk (instead of 'BDAT').
    + 	 */
    + 	uint32_t max_changed_paths;
    + };
    +
      ## commit-graph.c ##
     @@ commit-graph.c: void git_test_write_commit_graph_or_die(void)
      #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
    @@ commit-graph.c: struct commit_graph *parse_commit_graph(struct repository *r,
      			break;
     +
     +		case GRAPH_CHUNKID_BLOOMLARGE:
    -+			if (graph->bloom_large.word_alloc)
    ++			if (graph->chunk_bloom_large_filters)
     +				chunk_repeated = 1;
     +			else if (r->settings.commit_graph_read_changed_paths) {
    -+				uint32_t alloc = get_be64(chunk_lookup + 4) - chunk_offset;
    ++				graph->bloom_large_to_alloc = get_be64(chunk_lookup + 4)
    ++							      - chunk_offset - sizeof(uint32_t);
     +
    -+				graph->bloom_large.word_alloc = alloc;
    -+				graph->bloom_large.words = (eword_t *)(data + chunk_offset);
    ++				graph->bloom_large.word_alloc = 0; /* populate when necessary */
    ++				graph->chunk_bloom_large_filters = data + chunk_offset + sizeof(uint32_t);
    ++				graph->bloom_filter_settings->max_changed_paths = get_be32(data + chunk_offset);
     +			}
     +			break;
      		}
    @@ commit-graph.c: struct commit_graph *parse_commit_graph(struct repository *r,
      		/* We need both the bloom chunks to exist together. Else ignore the data */
      		graph->chunk_bloom_indexes = NULL;
      		graph->chunk_bloom_data = NULL;
    -+		graph->bloom_large.words = NULL;
    -+		graph->bloom_large.word_alloc = 0;
    ++		graph->chunk_bloom_large_filters = NULL;
      		FREE_AND_NULL(graph->bloom_filter_settings);
      	}

    @@ commit-graph.c: struct tree *get_commit_tree_in_graph(struct repository *r, cons
     +	while (g && graph_pos < g->num_commits_in_base)
     +		g = g->base_graph;
     +
    -+	if (!(g && g->bloom_large.word_alloc))
    ++	if (!g || !g->bloom_large_to_alloc)
     +		return 0;
    ++
    ++	if (!g->bloom_large.word_alloc) {
    ++		size_t i;
    ++		g->bloom_large.word_alloc = g->bloom_large_to_alloc;
    ++		g->bloom_large.words = xmalloc(g->bloom_large_to_alloc * sizeof(eword_t));
    ++
    ++		for (i = 0; i < g->bloom_large_to_alloc; i++)
    ++			g->bloom_large.words[i] = get_be64(g->chunk_bloom_large_filters
    ++							   + i * sizeof(eword_t));
    ++	}
    ++
     +	return bitmap_get(&g->bloom_large, graph_pos - g->num_commits_in_base);
     +}
    -+
    +
      struct packed_oid_list {
      	struct object_id *list;
    - 	int nr;
     @@ commit-graph.c: struct write_commit_graph_context {
    - 	const struct split_commit_graph_opts *split_opts;
      	size_t total_bloom_filter_data_size;
      	const struct bloom_filter_settings *bloom_settings;
    -+	struct bitmap *bloom_large;
    -+
    +
     +	int count_bloom_filter_known_large;
    -+	int count_bloom_filter_found_large;
    -+	int count_bloom_filter_computed;
    + 	int count_bloom_filter_found_large;
    + 	int count_bloom_filter_computed;
    ++	struct bitmap *bloom_large;
      };

      static int write_graph_chunk_fanout(struct hashfile *f,
    @@ commit-graph.c: static int write_graph_chunk_bloom_data(struct hashfile *f,
      }

     +static int write_graph_chunk_bloom_large(struct hashfile *f,
    -+					  struct write_commit_graph_context *ctx)
    ++					 struct write_commit_graph_context *ctx)
     +{
    ++	size_t i, alloc = ctx->commits.nr / BITS_IN_EWORD;
    ++	if (ctx->commits.nr % BITS_IN_EWORD)
    ++		alloc++;
    ++	if (alloc > ctx->bloom_large->word_alloc)
    ++		BUG("write_graph_chunk_bloom_large: bitmap not large enough");
    ++
     +	trace2_region_enter("commit-graph", "bloom_large", ctx->r);
    -+	hashwrite(f, ctx->bloom_large->words, ctx->bloom_large->word_alloc * sizeof(eword_t));
    ++	hashwrite_be32(f, ctx->bloom_settings->max_changed_paths);
    ++	for (i = 0; i < ctx->bloom_large->word_alloc; i++)
    ++		hashwrite_be64(f, ctx->bloom_large->words[i]);
     +	trace2_region_leave("commit-graph", "bloom_large", ctx->r);
     +	return 0;
     +}
    @@ commit-graph.c: static int write_graph_chunk_bloom_data(struct hashfile *f,
      static int oid_compare(const void *_a, const void *_b)
      {
      	const struct object_id *a = (const struct object_id *)_a;
    -@@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph_context *ctx)
    - 	stop_progress(&ctx->progress);
    - }
    +@@ commit-graph.c: static void trace2_bloom_filter_write_statistics(struct write_commit_graph_conte
    + 	struct json_writer jw = JSON_WRITER_INIT;

    -+static void trace2_bloom_filter_write_statistics(struct write_commit_graph_context *ctx)
    -+{
    -+	struct json_writer jw = JSON_WRITER_INIT;
    -+
    -+	jw_object_begin(&jw, 0);
    + 	jw_object_begin(&jw, 0);
     +	jw_object_intmax(&jw, "filter_known_large",
     +			 ctx->count_bloom_filter_known_large);
    -+	jw_object_intmax(&jw, "filter_found_large",
    -+			 ctx->count_bloom_filter_found_large);
    -+	jw_object_intmax(&jw, "filter_computed",
    -+			 ctx->count_bloom_filter_computed);
    -+	jw_end(&jw);
    -+
    -+	trace2_data_json("commit-graph", the_repository, "bloom_statistics", &jw);
    -+
    -+	jw_release(&jw);
    -+}
    -+
    - static void compute_bloom_filters(struct write_commit_graph_context *ctx)
    - {
    - 	int i;
    + 	jw_object_intmax(&jw, "filter_found_large",
    + 			 ctx->count_bloom_filter_found_large);
    + 	jw_object_intmax(&jw, "filter_computed",
     @@ commit-graph.c: static void compute_bloom_filters(struct write_commit_graph_context *ctx)
      	int *sorted_commits;

      	init_bloom_filters();
    -+	ctx->bloom_large = bitmap_word_alloc(ctx->commits.nr / BITS_IN_EWORD);
    ++	ctx->bloom_large = bitmap_word_alloc(ctx->commits.nr / BITS_IN_EWORD + 1);

      	if (ctx->report_progress)
      		progress = start_delayed_progress(
     @@ commit-graph.c: static void compute_bloom_filters(struct write_commit_graph_context *ctx)
    + 		&ctx->commits);
    +
      	for (i = 0; i < ctx->commits.nr; i++) {
    +-		int computed = 0;
      		int pos = sorted_commits[i];
      		struct commit *c = ctx->commits.list[pos];
    --		struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1);
    --		ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
    +-		struct bloom_filter *filter = get_or_compute_bloom_filter(
    +-			ctx->r,
    +-			c,
    +-			1,
    +-			ctx->bloom_settings,
    +-			&computed);
    +-		if (computed) {
    +-			ctx->count_bloom_filter_computed++;
    +-			if (filter && !filter->len)
    +-				ctx->count_bloom_filter_found_large++;
     +		if (get_bloom_filter_large_in_graph(ctx->r->objects->commit_graph, c)) {
     +			bitmap_set(ctx->bloom_large, pos);
     +			ctx->count_bloom_filter_known_large++;
     +		} else {
    -+			struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1);
    -+			if (!filter->len) {
    -+				bitmap_set(ctx->bloom_large, pos);
    -+				ctx->count_bloom_filter_found_large++;
    ++			int computed = 0;
    ++			struct bloom_filter *filter = get_or_compute_bloom_filter(
    ++				ctx->r,
    ++				c,
    ++				1,
    ++				ctx->bloom_settings,
    ++				&computed);
    ++			if (computed) {
    ++				ctx->count_bloom_filter_computed++;
    ++				if (filter && !filter->len) {
    ++					bitmap_set(ctx->bloom_large, pos);
    ++					ctx->count_bloom_filter_found_large++;
    ++				}
     +			}
     +			ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
    -+			ctx->count_bloom_filter_computed++;
    -+		}
    + 		}
    +-		ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
      		display_progress(progress, i + 1);
      	}

    -+	if (trace2_is_enabled())
    -+		trace2_bloom_filter_write_statistics(ctx);
    -+
    - 	free(sorted_commits);
    - 	stop_progress(&progress);
    - }
     @@ commit-graph.c: static int write_commit_graph_file(struct write_commit_graph_context *ctx)
      					  + ctx->total_bloom_filter_data_size;
      		chunks[num_chunks].write_fn = write_graph_chunk_bloom_data;
      		num_chunks++;
     +		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMLARGE;
    -+		chunks[num_chunks].size = sizeof(eword_t) * ctx->bloom_large->word_alloc;
    ++		chunks[num_chunks].size = sizeof(eword_t) * ctx->bloom_large->word_alloc
    ++					+ sizeof(uint32_t);
     +		chunks[num_chunks].write_fn = write_graph_chunk_bloom_large;
     +		num_chunks++;
      	}
      	if (ctx->num_commit_graphs_after > 1) {
      		chunks[num_chunks].id = GRAPH_CHUNKID_BASE;
    +@@ commit-graph.c: void free_commit_graph(struct commit_graph *g)
    + 	}
    + 	free(g->filename);
    + 	free(g->bloom_filter_settings);
    ++	bitmap_free(g->bloom_large);
    + 	free(g);
    + }
    +

      ## commit-graph.h ##
     @@
    @@ commit-graph.h: struct commit_graph {
      	const unsigned char *chunk_base_graphs;
      	const unsigned char *chunk_bloom_indexes;
      	const unsigned char *chunk_bloom_data;
    ++	const unsigned char *chunk_bloom_large_filters;
    ++
    ++	size_t bloom_large_to_alloc;
     +	struct bitmap bloom_large;

      	struct bloom_filter_settings *bloom_filter_settings;
    @@ t/t4216-log-bloom.sh: test_expect_success 'correctly report changes over limit'
     +
     +test_expect_success 'Bloom generation does not recompute too-large filters' '
     +	(
    -+		cd 513changes &&
    -+		git commit-graph write --reachable --changed-paths \
    ++		cd limits &&
    ++
    ++		# start from scratch and rebuild
    ++		rm -f .git/objects/info/commit-graph &&
    ++		GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=10 \
    ++			git commit-graph write --reachable --changed-paths \
     +			--split=replace &&
     +		test_commit c1 filter &&
    ++
     +		test_bloom_filters_computed "--reachable --changed-paths --split=replace" \
    -+			1 0 1
    ++			2 0 1
     +	)
     +'

 9:  309e76bb17 ! 23:  2ee0b84351 commit-graph: rename 'split_commit_graph_opts'
    @@ Metadata
      ## Commit message ##
         commit-graph: rename 'split_commit_graph_opts'

    -    In a subsequent commit, additional options will be added to the
    -    commit-graph API, which themselves do not have to do with splitting.
    +    In the subsequent commit, additional options will be added to the
    +    commit-graph API which have nothing to do with splitting.

         Rename the 'split_commit_graph_opts' structure to the more-generic
         'commit_graph_opts' to encompass both.
    @@ commit-graph.c: struct write_commit_graph_context {
     +	const struct commit_graph_opts *opts;
      	size_t total_bloom_filter_data_size;
      	const struct bloom_filter_settings *bloom_settings;
    - 	struct bitmap *bloom_large;
    +
     @@ commit-graph.c: static void close_reachable(struct write_commit_graph_context *ctx)
      {
      	int i;
    @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
     +	ctx->opts = opts;
      	ctx->total_bloom_filter_data_size = 0;

    - 	if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
    + 	bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
     @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
      			}
      		}
10:  8b670571dc ! 24:  3b66ae4a9c builtin/commit-graph.c: introduce '--max-new-filters=<n>'
    @@ Commit message
         'max_new_filters' variable introduced by the previous patch.

         The command-line option '--max-new-filters' takes precedence over
    -    'graph.maxNewFilters', which is the default value.
    +    'commitGraph.maxNewFilters', which is the default value.
         '--no-max-new-filters' can also be provided, which sets the value back
         to '-1', indicating that an unlimited number of new Bloom filters may be
         generated. (OPT_INTEGER only allows setting the '--no-' variant back to
    @@ Commit message

      ## Documentation/config/commitgraph.txt ##
     @@
    -+commitgraph.maxNewFilters::
    ++commitGraph.maxNewFilters::
     +	Specifies the default value for the `--max-new-filters` option of `git
     +	commit-graph write` (c.f., linkgit:git-commit-graph[1]).
     +
    - commitgraph.readChangedPaths::
    + commitGraph.readChangedPaths::
      	If true, then git will use the changed-path Bloom filters in the
      	commit-graph file (if it exists, and they are present). Defaults to

    @@ Documentation/git-commit-graph.txt: this option is given, future commit-graph wr
      +
     +With the `--max-new-filters=<n>` option, generate at most `n` new Bloom
     +filters (if `--changed-paths` is specified). If `n` is `-1`, no limit is
    -+enforced. Overrides the `graph.maxNewFilters` configuration.
    ++enforced. Overrides the `commitGraph.maxNewFilters` configuration.
     ++
      With the `--split[=<strategy>]` option, write the commit-graph as a
      chain of multiple commit-graph files stored in
    @@ builtin/commit-graph.c: static int graph_write(int argc, const char **argv)
      		OPT_EXPIRY_DATE(0, "expire-time", &write_opts.expire_time,
      			N_("only expire files older than a given date-time")),
     +		OPT_CALLBACK_F(0, "max-new-filters", &write_opts.max_new_filters,
    -+			NULL, N_("maximum number of Bloom filters to compute"),
    ++			NULL, N_("maximum number of changed-path Bloom filters to compute"),
     +			0, write_option_max_new_filters),
      		OPT_END(),
      	};
    @@ builtin/commit-graph.c: static int graph_write(int argc, const char **argv)

     +static int git_commit_graph_config(const char *var, const char *value, void *cb)
     +{
    -+	if (!strcmp(var, "graph.maxnewfilters")) {
    ++	if (!strcmp(var, "commitgraph.maxnewfilters")) {
     +		write_opts.max_new_filters = git_config_int(var, value);
     +		return 0;
     +	}
    @@ builtin/commit-graph.c: int cmd_commit_graph(int argc, const char **argv, const
      			     builtin_commit_graph_usage,

      ## commit-graph.c ##
    +@@ commit-graph.c: struct tree *get_commit_tree_in_graph(struct repository *r, const struct commit
    + }
    +
    + static int get_bloom_filter_large_in_graph(struct commit_graph *g,
    +-					   const struct commit *c)
    ++					   const struct commit *c,
    ++					   uint32_t max_changed_paths)
    + {
    + 	uint32_t graph_pos = commit_graph_position(c);
    + 	if (graph_pos == COMMIT_NOT_FROM_GRAPH)
     @@ commit-graph.c: static void compute_bloom_filters(struct write_commit_graph_context *ctx)
      	int i;
      	struct progress *progress = NULL;
    @@ commit-graph.c: static void compute_bloom_filters(struct write_commit_graph_cont
     +	int max_new_filters;

      	init_bloom_filters();
    - 	ctx->bloom_large = bitmap_word_alloc(ctx->commits.nr / BITS_IN_EWORD);
    + 	ctx->bloom_large = bitmap_word_alloc(ctx->commits.nr / BITS_IN_EWORD + 1);
     @@ commit-graph.c: static void compute_bloom_filters(struct write_commit_graph_context *ctx)
      		ctx->order_by_pack ? commit_pos_cmp : commit_gen_cmp,
      		&ctx->commits);
    @@ commit-graph.c: static void compute_bloom_filters(struct write_commit_graph_cont
      	for (i = 0; i < ctx->commits.nr; i++) {
      		int pos = sorted_commits[i];
      		struct commit *c = ctx->commits.list[pos];
    -@@ commit-graph.c: static void compute_bloom_filters(struct write_commit_graph_context *ctx)
    +-		if (get_bloom_filter_large_in_graph(ctx->r->objects->commit_graph, c)) {
    ++		if (get_bloom_filter_large_in_graph(ctx->r->objects->commit_graph,
    ++						    c,
    ++						    ctx->bloom_settings->max_changed_paths)) {
      			bitmap_set(ctx->bloom_large, pos);
      			ctx->count_bloom_filter_known_large++;
      		} else {
    --			struct bloom_filter *filter = get_or_compute_bloom_filter(ctx->r, c, 1, NULL);
    --			if (!filter->len) {
    --				bitmap_set(ctx->bloom_large, pos);
    --				ctx->count_bloom_filter_found_large++;
    -+			int computed = 0;
    -+			struct bloom_filter *filter = get_or_compute_bloom_filter(
    -+				ctx->r, c,
    +@@ commit-graph.c: static void compute_bloom_filters(struct write_commit_graph_context *ctx)
    + 			struct bloom_filter *filter = get_or_compute_bloom_filter(
    + 				ctx->r,
    + 				c,
    +-				1,
     +				ctx->count_bloom_filter_computed < max_new_filters,
    -+				&computed);
    -+			if (computed) {
    -+				ctx->count_bloom_filter_computed++;
    -+				if (filter && !filter->len) {
    -+					bitmap_set(ctx->bloom_large, pos);
    -+					ctx->count_bloom_filter_found_large++;
    -+				}
    + 				ctx->bloom_settings,
    + 				&computed);
    + 			if (computed) {
    +@@ commit-graph.c: static void compute_bloom_filters(struct write_commit_graph_context *ctx)
    + 					ctx->count_bloom_filter_found_large++;
    + 				}
      			}
     -			ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
    --			ctx->count_bloom_filter_computed++;
    -+			ctx->total_bloom_filter_data_size += sizeof(unsigned char) * (filter ? filter->len : 0);
    ++			if (filter)
    ++				ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
      		}
      		display_progress(progress, i + 1);
      	}
    @@ t/t4216-log-bloom.sh: test_expect_success 'Bloom generation does not recompute t

     +test_expect_success 'Bloom generation is limited by --max-new-filters' '
     +	(
    -+		cd 513changes &&
    ++		cd limits &&
     +		test_commit c2 filter &&
     +		test_commit c3 filter &&
     +		test_commit c4 no-filter &&
     +		test_bloom_filters_computed "--reachable --changed-paths --split=replace --max-new-filters=2" \
    -+			1 0 2
    ++			2 0 2
     +	)
     +'
     +
     +test_expect_success 'Bloom generation backfills previously-skipped filters' '
     +	(
    -+		cd 513changes &&
    ++		cd limits &&
     +		test_bloom_filters_computed "--reachable --changed-paths --split=replace --max-new-filters=1" \
    -+			1 0 1
    ++			2 0 1
     +	)
     +'
     +
--
2.28.0.rc1.13.ge78abce653

^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH v2 01/14] commit-graph: introduce 'get_bloom_filter_settings()'
  2020-08-05 17:01 ` [PATCH v2 00/14] more miscellaneous Bloom filter improvements Taylor Blau
@ 2020-08-05 17:01   ` Taylor Blau
  2020-08-05 17:02   ` [PATCH v2 02/14] t4216: use an '&&'-chain Taylor Blau
                     ` (12 subsequent siblings)
  13 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-08-05 17:01 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, szeder.dev

Many places in the code often need a pointer to the commit-graph's
'struct bloom_filter_settings', in which case they often take the value
from the top-most commit-graph.

In the non-split case, this works as expected. In the split case,
however, things get a little tricky. Not all layers in a chain of
incremental commit-graphs are required to themselves have Bloom data,
and so whether or not some part of the code uses Bloom filters depends
entirely on whether or not the top-most level of the commit-graph chain
has Bloom filters.

This has been the behavior since Bloom filters were introduced, and has
been codified into the tests since a759bfa9ee (t4216: add end to end
tests for git log with Bloom filters, 2020-04-06). In fact, t4216.130
requires that Bloom filters are not used in exactly the case described
earlier.

There is no reason that this needs to be the case, since it is perfectly
valid for commits in an earlier layer to have Bloom filters when commits
in a newer layer do not.

Since Bloom settings are guaranteed to be the same for any layer in a
chain that has Bloom data, it is sufficient to traverse the
'->base_graph' pointer until either (1) a non-null 'struct
bloom_filter_settings *' is found, or (2) until we are at the root of
the commit-graph chain.

Introduce a 'get_bloom_filter_settings()' function that does just this,
and use it instead of purely dereferencing the top-most graph's
'->bloom_filter_settings' pointer.

While we're at it, add an additional test in t5324 to guard against code
in the commit-graph writing machinery that doesn't correctly handle a
NULL 'struct bloom_filter *'.

Co-authored-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 blame.c                       |  6 ++++--
 bloom.c                       |  6 +++---
 commit-graph.c                | 11 +++++++++++
 commit-graph.h                |  2 ++
 revision.c                    |  5 +----
 t/t4216-log-bloom.sh          |  9 ++++++---
 t/t5324-split-commit-graph.sh | 13 +++++++++++++
 7 files changed, 40 insertions(+), 12 deletions(-)

diff --git a/blame.c b/blame.c
index 82fa16d658..3e5f8787bc 100644
--- a/blame.c
+++ b/blame.c
@@ -2891,16 +2891,18 @@ void setup_blame_bloom_data(struct blame_scoreboard *sb,
 			    const char *path)
 {
 	struct blame_bloom_data *bd;
+	struct bloom_filter_settings *bs;
 
 	if (!sb->repo->objects->commit_graph)
 		return;
 
-	if (!sb->repo->objects->commit_graph->bloom_filter_settings)
+	bs = get_bloom_filter_settings(sb->repo);
+	if (!bs)
 		return;
 
 	bd = xmalloc(sizeof(struct blame_bloom_data));
 
-	bd->settings = sb->repo->objects->commit_graph->bloom_filter_settings;
+	bd->settings = bs;
 
 	bd->alloc = 4;
 	bd->nr = 0;
diff --git a/bloom.c b/bloom.c
index 1a573226e7..cd9380ac62 100644
--- a/bloom.c
+++ b/bloom.c
@@ -38,7 +38,7 @@ static int load_bloom_filter_from_graph(struct commit_graph *g,
 	while (graph_pos < g->num_commits_in_base)
 		g = g->base_graph;
 
-	/* The commit graph commit 'c' lives in doesn't carry bloom filters. */
+	/* The commit graph commit 'c' lives in doesn't carry Bloom filters. */
 	if (!g->chunk_bloom_indexes)
 		return 0;
 
@@ -195,8 +195,8 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 	if (!filter->data) {
 		load_commit_graph_info(r, c);
 		if (commit_graph_position(c) != COMMIT_NOT_FROM_GRAPH &&
-			r->objects->commit_graph->chunk_bloom_indexes)
-			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c);
+			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
+				return filter;
 	}
 
 	if (filter->data)
diff --git a/commit-graph.c b/commit-graph.c
index e51c91dd5b..d4b06811be 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -660,6 +660,17 @@ int generation_numbers_enabled(struct repository *r)
 	return !!first_generation;
 }
 
+struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r)
+{
+	struct commit_graph *g = r->objects->commit_graph;
+	while (g) {
+		if (g->bloom_filter_settings)
+			return g->bloom_filter_settings;
+		g = g->base_graph;
+	}
+	return NULL;
+}
+
 static void close_commit_graph_one(struct commit_graph *g)
 {
 	if (!g)
diff --git a/commit-graph.h b/commit-graph.h
index 09a97030dc..0677dd1031 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -87,6 +87,8 @@ struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size);
  */
 int generation_numbers_enabled(struct repository *r);
 
+struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r);
+
 enum commit_graph_write_flags {
 	COMMIT_GRAPH_WRITE_APPEND     = (1 << 0),
 	COMMIT_GRAPH_WRITE_PROGRESS   = (1 << 1),
diff --git a/revision.c b/revision.c
index 6de29cdf7a..c45ed1076e 100644
--- a/revision.c
+++ b/revision.c
@@ -681,10 +681,7 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 
 	repo_parse_commit(revs->repo, revs->commits->item);
 
-	if (!revs->repo->objects->commit_graph)
-		return;
-
-	revs->bloom_filter_settings = revs->repo->objects->commit_graph->bloom_filter_settings;
+	revs->bloom_filter_settings = get_bloom_filter_settings(revs->repo);
 	if (!revs->bloom_filter_settings)
 		return;
 
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index c21cc160f3..c9f9bdf1ba 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -60,7 +60,7 @@ setup () {
 
 test_bloom_filters_used () {
 	log_args=$1
-	bloom_trace_prefix="statistics:{\"filter_not_present\":0,\"maybe\""
+	bloom_trace_prefix="statistics:{\"filter_not_present\":${2:-0},\"maybe\""
 	setup "$log_args" &&
 	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
 	test_cmp log_wo_bloom log_w_bloom &&
@@ -134,8 +134,11 @@ test_expect_success 'setup - add commit-graph to the chain without Bloom filters
 	test_line_count = 2 .git/objects/info/commit-graphs/commit-graph-chain
 '
 
-test_expect_success 'Do not use Bloom filters if the latest graph does not have Bloom filters.' '
-	test_bloom_filters_not_used "-- A/B"
+test_expect_success 'use Bloom filters even if the latest graph does not have Bloom filters' '
+	# Ensure that the number of empty filters is equal to the number of
+	# filters in the latest graph layer to prove that they are loaded (and
+	# ignored).
+	test_bloom_filters_used "-- A/B" 3
 '
 
 test_expect_success 'setup - add commit-graph to the chain with Bloom filters' '
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 9b850ea907..5bdfd53ef9 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -425,4 +425,17 @@ done <<\EOF
 0600 -r--------
 EOF
 
+test_expect_success '--split=replace with partial Bloom data' '
+	rm -rf $graphdir $infodir/commit-graph &&
+	git reset --hard commits/3 &&
+	git rev-list -1 HEAD~2 >a &&
+	git rev-list -1 HEAD~1 >b &&
+	git commit-graph write --split=no-merge --stdin-commits --changed-paths <a &&
+	git commit-graph write --split=no-merge --stdin-commits <b &&
+	git commit-graph write --split=replace --stdin-commits --changed-paths <c &&
+	ls $graphdir/graph-*.graph >graph-files &&
+	test_line_count = 1 graph-files &&
+	verify_chain_files_exist $graphdir
+'
+
 test_done
-- 
2.28.0.rc1.13.ge78abce653


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH v2 02/14] t4216: use an '&&'-chain
  2020-08-05 17:01 ` [PATCH v2 00/14] more miscellaneous Bloom filter improvements Taylor Blau
  2020-08-05 17:01   ` [PATCH v2 01/14] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
@ 2020-08-05 17:02   ` Taylor Blau
  2020-08-05 17:02   ` [PATCH v2 03/14] commit-graph: pass a 'struct repository *' in more places Taylor Blau
                     ` (11 subsequent siblings)
  13 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-08-05 17:02 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, szeder.dev

In a759bfa9ee (t4216: add end to end tests for git log with Bloom
filters, 2020-04-06), a 'rm' invocation was added without a
corresponding '&&' chain.

When 'trace.perf' already exists, everything works fine. However, the
function can be executed without 'trace.perf' on disk (eg., when the
subset of tests run is altered with '--run'), and so the bare 'rm'
complains about a missing file.

To remove some noise from the test log, invoke 'rm' with '-f', at which
point it is sensible to place the 'rm -f' in an '&&'-chain, which is
both (1) our usual style, and (2) avoids a broken chain in the future if
more commands are added at the beginning of the function.

Helped-by: Eric Sunshine <sunshine@sunshineco.com>
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/t4216-log-bloom.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index c9f9bdf1ba..fe19f6a60c 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -53,7 +53,7 @@ sane_unset GIT_TRACE2_PERF_BRIEF
 sane_unset GIT_TRACE2_CONFIG_PARAMS
 
 setup () {
-	rm "$TRASH_DIRECTORY/trace.perf"
+	rm -f "$TRASH_DIRECTORY/trace.perf" &&
 	git -c core.commitGraph=false log --pretty="format:%s" $1 >log_wo_bloom &&
 	GIT_TRACE2_PERF="$TRASH_DIRECTORY/trace.perf" git -c core.commitGraph=true log --pretty="format:%s" $1 >log_w_bloom
 }
-- 
2.28.0.rc1.13.ge78abce653


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH v2 03/14] commit-graph: pass a 'struct repository *' in more places
  2020-08-05 17:01 ` [PATCH v2 00/14] more miscellaneous Bloom filter improvements Taylor Blau
  2020-08-05 17:01   ` [PATCH v2 01/14] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
  2020-08-05 17:02   ` [PATCH v2 02/14] t4216: use an '&&'-chain Taylor Blau
@ 2020-08-05 17:02   ` Taylor Blau
  2020-08-05 17:02   ` [PATCH v2 04/14] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
                     ` (10 subsequent siblings)
  13 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-08-05 17:02 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, szeder.dev

In a future commit, some commit-graph internals will want access to
'r->settings', but we only have the 'struct object_directory *'
corresponding to that repository.

Add an additional parameter to pass the repository around in more
places.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/commit-graph.c |  2 +-
 commit-graph.c         | 17 ++++++++++-------
 commit-graph.h         |  6 ++++--
 fuzz-commit-graph.c    |  5 +++--
 4 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 523501f217..ba5584463f 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -106,7 +106,7 @@ static int graph_verify(int argc, const char **argv)
 	FREE_AND_NULL(graph_name);
 
 	if (open_ok)
-		graph = load_commit_graph_one_fd_st(fd, &st, odb);
+		graph = load_commit_graph_one_fd_st(the_repository, fd, &st, odb);
 	else
 		graph = read_commit_graph_one(the_repository, odb);
 
diff --git a/commit-graph.c b/commit-graph.c
index d4b06811be..0c1030641c 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -224,7 +224,8 @@ int open_commit_graph(const char *graph_file, int *fd, struct stat *st)
 	return 1;
 }
 
-struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st,
+struct commit_graph *load_commit_graph_one_fd_st(struct repository *r,
+						 int fd, struct stat *st,
 						 struct object_directory *odb)
 {
 	void *graph_map;
@@ -240,7 +241,7 @@ struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st,
 	}
 	graph_map = xmmap(NULL, graph_size, PROT_READ, MAP_PRIVATE, fd, 0);
 	close(fd);
-	ret = parse_commit_graph(graph_map, graph_size);
+	ret = parse_commit_graph(r, graph_map, graph_size);
 
 	if (ret)
 		ret->odb = odb;
@@ -280,7 +281,8 @@ static int verify_commit_graph_lite(struct commit_graph *g)
 	return 0;
 }
 
-struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size)
+struct commit_graph *parse_commit_graph(struct repository *r,
+					void *graph_map, size_t graph_size)
 {
 	const unsigned char *data, *chunk_lookup;
 	uint32_t i;
@@ -445,7 +447,8 @@ struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size)
 	return NULL;
 }
 
-static struct commit_graph *load_commit_graph_one(const char *graph_file,
+static struct commit_graph *load_commit_graph_one(struct repository *r,
+						  const char *graph_file,
 						  struct object_directory *odb)
 {
 
@@ -457,7 +460,7 @@ static struct commit_graph *load_commit_graph_one(const char *graph_file,
 	if (!open_ok)
 		return NULL;
 
-	g = load_commit_graph_one_fd_st(fd, &st, odb);
+	g = load_commit_graph_one_fd_st(r, fd, &st, odb);
 
 	if (g)
 		g->filename = xstrdup(graph_file);
@@ -469,7 +472,7 @@ static struct commit_graph *load_commit_graph_v1(struct repository *r,
 						 struct object_directory *odb)
 {
 	char *graph_name = get_commit_graph_filename(odb);
-	struct commit_graph *g = load_commit_graph_one(graph_name, odb);
+	struct commit_graph *g = load_commit_graph_one(r, graph_name, odb);
 	free(graph_name);
 
 	return g;
@@ -550,7 +553,7 @@ static struct commit_graph *load_commit_graph_chain(struct repository *r,
 		valid = 0;
 		for (odb = r->objects->odb; odb; odb = odb->next) {
 			char *graph_name = get_split_graph_filename(odb, line.buf);
-			struct commit_graph *g = load_commit_graph_one(graph_name, odb);
+			struct commit_graph *g = load_commit_graph_one(r, graph_name, odb);
 
 			free(graph_name);
 
diff --git a/commit-graph.h b/commit-graph.h
index 0677dd1031..d9acb22bac 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -75,11 +75,13 @@ struct commit_graph {
 	struct bloom_filter_settings *bloom_filter_settings;
 };
 
-struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st,
+struct commit_graph *load_commit_graph_one_fd_st(struct repository *r,
+						 int fd, struct stat *st,
 						 struct object_directory *odb);
 struct commit_graph *read_commit_graph_one(struct repository *r,
 					   struct object_directory *odb);
-struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size);
+struct commit_graph *parse_commit_graph(struct repository *r,
+					void *graph_map, size_t graph_size);
 
 /*
  * Return 1 if and only if the repository has a commit-graph
diff --git a/fuzz-commit-graph.c b/fuzz-commit-graph.c
index 430817214d..e7cf6d5b0f 100644
--- a/fuzz-commit-graph.c
+++ b/fuzz-commit-graph.c
@@ -1,7 +1,8 @@
 #include "commit-graph.h"
 #include "repository.h"
 
-struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size);
+struct commit_graph *parse_commit_graph(struct repository *r,
+					void *graph_map, size_t graph_size);
 
 int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size);
 
@@ -10,7 +11,7 @@ int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size)
 	struct commit_graph *g;
 
 	initialize_the_repository();
-	g = parse_commit_graph((void *)data, size);
+	g = parse_commit_graph(the_repository, (void *)data, size);
 	repo_clear(the_repository);
 	free_commit_graph(g);
 
-- 
2.28.0.rc1.13.ge78abce653


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH v2 04/14] t/helper/test-read-graph.c: prepare repo settings
  2020-08-05 17:01 ` [PATCH v2 00/14] more miscellaneous Bloom filter improvements Taylor Blau
                     ` (2 preceding siblings ...)
  2020-08-05 17:02   ` [PATCH v2 03/14] commit-graph: pass a 'struct repository *' in more places Taylor Blau
@ 2020-08-05 17:02   ` Taylor Blau
  2020-08-05 17:02   ` [PATCH v2 05/14] commit-graph: respect 'commitGraph.readChangedPaths' Taylor Blau
                     ` (9 subsequent siblings)
  13 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-08-05 17:02 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, szeder.dev

The read-graph test-tool is used by a number of the commit-graph test to
assert various properties about a commit-graph. Previously, this program
never ran 'prepare_repo_settings()'. There was no need to do so, since
none of the commit-graph machinery is affected by the repo settings.

In the next patch, the commit-graph machinery's behavior will become
dependent on the repo settings, and so loading them before running the
rest of the test tool is critical.

As such, teach the test tool to call 'prepare_repo_settings()'.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/helper/test-read-graph.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
index 6d0c962438..5f585a1725 100644
--- a/t/helper/test-read-graph.c
+++ b/t/helper/test-read-graph.c
@@ -12,11 +12,12 @@ int cmd__read_graph(int argc, const char **argv)
 	setup_git_directory();
 	odb = the_repository->objects->odb;
 
+	prepare_repo_settings(the_repository);
+
 	graph = read_commit_graph_one(the_repository, odb);
 	if (!graph)
 		return 1;
 
-
 	printf("header: %08x %d %d %d %d\n",
 		ntohl(*(uint32_t*)graph->data),
 		*(unsigned char*)(graph->data + 4),
-- 
2.28.0.rc1.13.ge78abce653


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH v2 05/14] commit-graph: respect 'commitGraph.readChangedPaths'
  2020-08-05 17:01 ` [PATCH v2 00/14] more miscellaneous Bloom filter improvements Taylor Blau
                     ` (3 preceding siblings ...)
  2020-08-05 17:02   ` [PATCH v2 04/14] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
@ 2020-08-05 17:02   ` Taylor Blau
  2020-08-05 17:02   ` [PATCH v2 06/14] commit-graph.c: store maximum changed paths Taylor Blau
                     ` (8 subsequent siblings)
  13 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-08-05 17:02 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, szeder.dev

Git uses the 'core.commitGraph' configuration value to control whether
or not the commit graph is used when parsing commits or performing a
traversal.

Now that commit-graphs can also contain a section for changed-path Bloom
filters, administrators that already have commit-graphs may find it
convenient to use those graphs without relying on their changed-path
Bloom filters. This can happen, for example, during a staged roll-out,
or in the event of an incident.

Introduce 'commitGraph.readChangedPaths' to control whether or not Bloom
filters are read. Note that this configuration is independent from both:

  - 'core.commitGraph', to allow flexibility in using all parts of a
    commit-graph _except_ for its Bloom filters.

  - The '--changed-paths' option for 'git commit-graph write', to allow
    reading and writing Bloom filters to be controlled independently.

When the variable is set, pretend as if no Bloom data was specified at
all. This avoids adding additional special-casing outside of the
commit-graph internals.

Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config.txt             | 2 ++
 Documentation/config/commitgraph.txt | 4 ++++
 commit-graph.c                       | 6 ++++--
 repo-settings.c                      | 3 +++
 repository.h                         | 1 +
 t/t4216-log-bloom.sh                 | 4 +++-
 6 files changed, 17 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/config/commitgraph.txt

diff --git a/Documentation/config.txt b/Documentation/config.txt
index ef0768b91a..78883c6e63 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -340,6 +340,8 @@ include::config/column.txt[]
 
 include::config/commit.txt[]
 
+include::config/commitgraph.txt[]
+
 include::config/credential.txt[]
 
 include::config/completion.txt[]
diff --git a/Documentation/config/commitgraph.txt b/Documentation/config/commitgraph.txt
new file mode 100644
index 0000000000..cff0797b54
--- /dev/null
+++ b/Documentation/config/commitgraph.txt
@@ -0,0 +1,4 @@
+commitGraph.readChangedPaths::
+	If true, then git will use the changed-path Bloom filters in the
+	commit-graph file (if it exists, and they are present). Defaults to
+	true. See linkgit:git-commit-graph[1] for more information.
diff --git a/commit-graph.c b/commit-graph.c
index 0c1030641c..a516e93d71 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -320,6 +320,8 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 		return NULL;
 	}
 
+	prepare_repo_settings(r);
+
 	graph = alloc_commit_graph();
 
 	graph->hash_len = the_hash_algo->rawsz;
@@ -396,14 +398,14 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 		case GRAPH_CHUNKID_BLOOMINDEXES:
 			if (graph->chunk_bloom_indexes)
 				chunk_repeated = 1;
-			else
+			else if (r->settings.commit_graph_read_changed_paths)
 				graph->chunk_bloom_indexes = data + chunk_offset;
 			break;
 
 		case GRAPH_CHUNKID_BLOOMDATA:
 			if (graph->chunk_bloom_data)
 				chunk_repeated = 1;
-			else {
+			else if (r->settings.commit_graph_read_changed_paths) {
 				uint32_t hash_version;
 				graph->chunk_bloom_data = data + chunk_offset;
 				hash_version = get_be32(data + chunk_offset);
diff --git a/repo-settings.c b/repo-settings.c
index 0918408b34..9e551bc03d 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -17,9 +17,12 @@ void prepare_repo_settings(struct repository *r)
 
 	if (!repo_config_get_bool(r, "core.commitgraph", &value))
 		r->settings.core_commit_graph = value;
+	if (!repo_config_get_bool(r, "commitgraph.readchangedpaths", &value))
+		r->settings.commit_graph_read_changed_paths = value;
 	if (!repo_config_get_bool(r, "gc.writecommitgraph", &value))
 		r->settings.gc_write_commit_graph = value;
 	UPDATE_DEFAULT_BOOL(r->settings.core_commit_graph, 1);
+	UPDATE_DEFAULT_BOOL(r->settings.commit_graph_read_changed_paths, 1);
 	UPDATE_DEFAULT_BOOL(r->settings.gc_write_commit_graph, 1);
 
 	if (!repo_config_get_int(r, "index.version", &value))
diff --git a/repository.h b/repository.h
index 3c1f7d54bd..81759b7d27 100644
--- a/repository.h
+++ b/repository.h
@@ -29,6 +29,7 @@ struct repo_settings {
 	int initialized;
 
 	int core_commit_graph;
+	int commit_graph_read_changed_paths;
 	int gc_write_commit_graph;
 	int fetch_write_commit_graph;
 
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index fe19f6a60c..b3d1f596f8 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -90,7 +90,9 @@ do
 		      "--ancestry-path side..master"
 	do
 		test_expect_success "git log option: $option for path: $path" '
-			test_bloom_filters_used "$option -- $path"
+			test_bloom_filters_used "$option -- $path" &&
+			test_config commitgraph.readChangedPaths false &&
+			test_bloom_filters_not_used "$option -- $path"
 		'
 	done
 done
-- 
2.28.0.rc1.13.ge78abce653


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH v2 06/14] commit-graph.c: store maximum changed paths
  2020-08-05 17:01 ` [PATCH v2 00/14] more miscellaneous Bloom filter improvements Taylor Blau
                     ` (4 preceding siblings ...)
  2020-08-05 17:02   ` [PATCH v2 05/14] commit-graph: respect 'commitGraph.readChangedPaths' Taylor Blau
@ 2020-08-05 17:02   ` Taylor Blau
  2020-08-05 17:02   ` [PATCH v2 07/14] bloom: split 'get_bloom_filter()' in two Taylor Blau
                     ` (7 subsequent siblings)
  13 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-08-05 17:02 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, szeder.dev

For now, we assume that there is a fixed constant describing the
maximum number of changed paths we are willing to store in a Bloom
filter.

Prepare for that to (at least partially) not be the case by making it a
member of the 'struct bloom_filter_settings'. This will be helpful in
the subsequent patches by reducing the size of test cases that exercise
storing too many changed paths, as well as preparing for an eventual
future in which this value might change.

This patch alone does not cause newly generated Bloom filters to use
a custom upper-bound on the maximum number of changed paths a single
Bloom filter can hold, that will occur in a later patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 bloom.h              | 11 ++++++++++-
 commit-graph.c       |  3 +++
 t/t4216-log-bloom.sh |  4 ++--
 3 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/bloom.h b/bloom.h
index d8fbb0fbf1..0b9b59a6fe 100644
--- a/bloom.h
+++ b/bloom.h
@@ -28,9 +28,18 @@ struct bloom_filter_settings {
 	 * that contain n*b bits.
 	 */
 	uint32_t bits_per_entry;
+
+	/*
+	 * The maximum number of changed paths per commit
+	 * before declaring a Bloom filter to be too-large.
+	 *
+	 * Not written to the commit-graph file.
+	 */
+	uint32_t max_changed_paths;
 };
 
-#define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 }
+#define DEFAULT_BLOOM_MAX_CHANGES 512
+#define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10, DEFAULT_BLOOM_MAX_CHANGES }
 #define BITS_PER_WORD 8
 #define BLOOMDATA_CHUNK_HEADER_SIZE 3 * sizeof(uint32_t)
 
diff --git a/commit-graph.c b/commit-graph.c
index a516e93d71..86dd4b979e 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1194,6 +1194,7 @@ static void trace2_bloom_filter_settings(struct write_commit_graph_context *ctx)
 	jw_object_intmax(&jw, "hash_version", ctx->bloom_settings->hash_version);
 	jw_object_intmax(&jw, "num_hashes", ctx->bloom_settings->num_hashes);
 	jw_object_intmax(&jw, "bits_per_entry", ctx->bloom_settings->bits_per_entry);
+	jw_object_intmax(&jw, "max_changed_paths", ctx->bloom_settings->max_changed_paths);
 	jw_end(&jw);
 
 	trace2_data_json("bloom", ctx->r, "settings", &jw);
@@ -1662,6 +1663,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 							      bloom_settings.bits_per_entry);
 		bloom_settings.num_hashes = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_NUM_HASHES",
 							  bloom_settings.num_hashes);
+		bloom_settings.max_changed_paths = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS",
+							  bloom_settings.max_changed_paths);
 		ctx->bloom_settings = &bloom_settings;
 	}
 
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index b3d1f596f8..eb2bcc51f0 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -169,11 +169,11 @@ test_expect_success 'persist filter settings' '
 		GIT_TEST_BLOOM_SETTINGS_NUM_HASHES=9 \
 		GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY=15 \
 		git commit-graph write --reachable --changed-paths &&
-	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15}" trace2.txt &&
+	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15" trace2.txt &&
 	GIT_TRACE2_EVENT="$(pwd)/trace2-auto.txt" \
 		GIT_TRACE2_EVENT_NESTING=5 \
 		git commit-graph write --reachable --changed-paths &&
-	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15}" trace2-auto.txt
+	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15" trace2-auto.txt
 '
 
 test_expect_success 'correctly report changes over limit' '
-- 
2.28.0.rc1.13.ge78abce653


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH v2 07/14] bloom: split 'get_bloom_filter()' in two
  2020-08-05 17:01 ` [PATCH v2 00/14] more miscellaneous Bloom filter improvements Taylor Blau
                     ` (5 preceding siblings ...)
  2020-08-05 17:02   ` [PATCH v2 06/14] commit-graph.c: store maximum changed paths Taylor Blau
@ 2020-08-05 17:02   ` Taylor Blau
  2020-08-05 17:02   ` [PATCH v2 08/14] bloom: use provided 'struct bloom_filter_settings' Taylor Blau
                     ` (6 subsequent siblings)
  13 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-08-05 17:02 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, szeder.dev

'get_bloom_filter' takes a flag to control whether it will compute a
Bloom filter if the requested one is missing. In the next patch, we'll
add yet another parameter to this method, which would force all but one
caller to specify an extra 'NULL' parameter at the end.

Instead of doing this, split 'get_bloom_filter' into two functions:
'get_bloom_filter' and 'get_or_compute_bloom_filter'. The former only
looks up a Bloom filter (and does not compute one if it's missing,
thus dropping the 'compute_if_not_present' flag). The latter does
compute missing Bloom filters, with an additional parameter to store
whether or not it needed to do so.

This simplifies many call-sites, since the majority of existing callers
to 'get_bloom_filter' do not want missing Bloom filters to be computed
(so they can drop the parameter entirely and use the simpler version of
the function).

While we're at it, instrument the new 'get_or_compute_bloom_filter()'
with two counters in the 'write_commit_graph_context' struct which store
the number of filters that we computed, and the number of those which
were too large to store.

It would be nice to drop the 'compute_if_not_present' flag entirely,
since all remaining callers of 'get_or_compute_bloom_filter' pass it as
'1', but this will change in a future patch and hence cannot be removed.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 blame.c               |  2 +-
 bloom.c               | 13 ++++++++++---
 bloom.h               | 11 ++++++++---
 commit-graph.c        | 38 +++++++++++++++++++++++++++++++++++---
 line-log.c            |  2 +-
 revision.c            |  2 +-
 t/helper/test-bloom.c |  3 ++-
 7 files changed, 58 insertions(+), 13 deletions(-)

diff --git a/blame.c b/blame.c
index 3e5f8787bc..756285fca7 100644
--- a/blame.c
+++ b/blame.c
@@ -1275,7 +1275,7 @@ static int maybe_changed_path(struct repository *r,
 	if (commit_graph_generation(origin->commit) == GENERATION_NUMBER_INFINITY)
 		return 1;
 
-	filter = get_bloom_filter(r, origin->commit, 0);
+	filter = get_bloom_filter(r, origin->commit);
 
 	if (!filter)
 		return 1;
diff --git a/bloom.c b/bloom.c
index cd9380ac62..a8a21762f4 100644
--- a/bloom.c
+++ b/bloom.c
@@ -177,9 +177,10 @@ static int pathmap_cmp(const void *hashmap_cmp_fn_data,
 	return strcmp(e1->path, e2->path);
 }
 
-struct bloom_filter *get_bloom_filter(struct repository *r,
-				      struct commit *c,
-				      int compute_if_not_present)
+struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
+						 struct commit *c,
+						 int compute_if_not_present,
+						 int *computed)
 {
 	struct bloom_filter *filter;
 	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
@@ -187,6 +188,9 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 	struct diff_options diffopt;
 	int max_changes = 512;
 
+	if (computed)
+		*computed = 0;
+
 	if (!bloom_filters.slab_size)
 		return NULL;
 
@@ -273,6 +277,9 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 		filter->len = 0;
 	}
 
+	if (computed)
+		*computed = 1;
+
 	free(diff_queued_diff.queue);
 	DIFF_QUEUE_CLEAR(&diff_queued_diff);
 
diff --git a/bloom.h b/bloom.h
index 0b9b59a6fe..c4107a6415 100644
--- a/bloom.h
+++ b/bloom.h
@@ -89,9 +89,14 @@ void add_key_to_filter(const struct bloom_key *key,
 
 void init_bloom_filters(void);
 
-struct bloom_filter *get_bloom_filter(struct repository *r,
-				      struct commit *c,
-				      int compute_if_not_present);
+struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
+						 struct commit *c,
+						 int compute_if_not_present,
+						 int *computed);
+
+#define DEFAULT_BLOOM_MAX_CHANGES 512
+#define get_bloom_filter(r, c) get_or_compute_bloom_filter( \
+	(r), (c), 0, NULL)
 
 int bloom_filter_contains(const struct bloom_filter *filter,
 			  const struct bloom_key *key,
diff --git a/commit-graph.c b/commit-graph.c
index 86dd4b979e..ba2a2cfb22 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -964,6 +964,9 @@ struct write_commit_graph_context {
 	const struct split_commit_graph_opts *split_opts;
 	size_t total_bloom_filter_data_size;
 	const struct bloom_filter_settings *bloom_settings;
+
+	int count_bloom_filter_found_large;
+	int count_bloom_filter_computed;
 };
 
 static int write_graph_chunk_fanout(struct hashfile *f,
@@ -1175,7 +1178,7 @@ static int write_graph_chunk_bloom_indexes(struct hashfile *f,
 	uint32_t cur_pos = 0;
 
 	while (list < last) {
-		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
+		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
 		size_t len = filter ? filter->len : 0;
 		cur_pos += len;
 		display_progress(ctx->progress, ++ctx->progress_cnt);
@@ -1215,7 +1218,7 @@ static int write_graph_chunk_bloom_data(struct hashfile *f,
 	hashwrite_be32(f, ctx->bloom_settings->bits_per_entry);
 
 	while (list < last) {
-		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
+		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
 		size_t len = filter ? filter->len : 0;
 
 		display_progress(ctx->progress, ++ctx->progress_cnt);
@@ -1385,6 +1388,22 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 	stop_progress(&ctx->progress);
 }
 
+static void trace2_bloom_filter_write_statistics(struct write_commit_graph_context *ctx)
+{
+	struct json_writer jw = JSON_WRITER_INIT;
+
+	jw_object_begin(&jw, 0);
+	jw_object_intmax(&jw, "filter_found_large",
+			 ctx->count_bloom_filter_found_large);
+	jw_object_intmax(&jw, "filter_computed",
+			 ctx->count_bloom_filter_computed);
+	jw_end(&jw);
+
+	trace2_data_json("commit-graph", the_repository, "bloom_statistics", &jw);
+
+	jw_release(&jw);
+}
+
 static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 {
 	int i;
@@ -1407,12 +1426,25 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 		QSORT(sorted_commits, ctx->commits.nr, commit_gen_cmp);
 
 	for (i = 0; i < ctx->commits.nr; i++) {
+		int computed = 0;
 		struct commit *c = sorted_commits[i];
-		struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1);
+		struct bloom_filter *filter = get_or_compute_bloom_filter(
+			ctx->r,
+			c,
+			1,
+			&computed);
+		if (computed) {
+			ctx->count_bloom_filter_computed++;
+			if (filter && !filter->len)
+				ctx->count_bloom_filter_found_large++;
+		}
 		ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
 		display_progress(progress, i + 1);
 	}
 
+	if (trace2_is_enabled())
+		trace2_bloom_filter_write_statistics(ctx);
+
 	free(sorted_commits);
 	stop_progress(&progress);
 }
diff --git a/line-log.c b/line-log.c
index c53692834d..9e58fd185a 100644
--- a/line-log.c
+++ b/line-log.c
@@ -1159,7 +1159,7 @@ static int bloom_filter_check(struct rev_info *rev,
 		return 1;
 
 	if (!rev->bloom_filter_settings ||
-	    !(filter = get_bloom_filter(rev->repo, commit, 0)))
+	    !(filter = get_bloom_filter(rev->repo, commit)))
 		return 1;
 
 	if (!range)
diff --git a/revision.c b/revision.c
index c45ed1076e..b7ec712755 100644
--- a/revision.c
+++ b/revision.c
@@ -753,7 +753,7 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 	if (commit_graph_generation(commit) == GENERATION_NUMBER_INFINITY)
 		return -1;
 
-	filter = get_bloom_filter(revs->repo, commit, 0);
+	filter = get_bloom_filter(revs->repo, commit);
 
 	if (!filter) {
 		count_bloom_filter_not_present++;
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index f0aa80b98e..531af439c2 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -39,7 +39,8 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
 	struct bloom_filter *filter;
 	setup_git_directory();
 	c = lookup_commit(the_repository, commit_oid);
-	filter = get_bloom_filter(the_repository, c, 1);
+	filter = get_or_compute_bloom_filter(the_repository, c, 1,
+					     NULL);
 	print_bloom_filter(filter);
 }
 
-- 
2.28.0.rc1.13.ge78abce653


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH v2 08/14] bloom: use provided 'struct bloom_filter_settings'
  2020-08-05 17:01 ` [PATCH v2 00/14] more miscellaneous Bloom filter improvements Taylor Blau
                     ` (6 preceding siblings ...)
  2020-08-05 17:02   ` [PATCH v2 07/14] bloom: split 'get_bloom_filter()' in two Taylor Blau
@ 2020-08-05 17:02   ` Taylor Blau
  2020-08-05 17:02   ` [PATCH v2 09/14] bloom/diff: properly short-circuit on max_changes Taylor Blau
                     ` (5 subsequent siblings)
  13 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-08-05 17:02 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, szeder.dev

When 'get_or_compute_bloom_filter()' needs to compute a Bloom filter
from scratch, it looks to the default 'struct bloom_filter_settings' in
order to determine the maximum number of changed paths, number of bits
per entry, and so on.

All of these values have so far been constant, and so there was no need
to pass in a pointer from the caller (eg., the one that is stored in the
'struct write_commit_graph_context').

Start passing in a 'struct bloom_filter_settings *' instead of using the
default values to respect graph-specific settings (eg., in the case of
setting 'GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS').

In order to have an initialized value for these settings, move its
initialization to earlier in the commit-graph write.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 bloom.c               | 13 ++++++-------
 bloom.h               |  3 ++-
 commit-graph.c        | 21 ++++++++++-----------
 t/helper/test-bloom.c |  1 +
 4 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/bloom.c b/bloom.c
index a8a21762f4..0cf1962dc5 100644
--- a/bloom.c
+++ b/bloom.c
@@ -180,13 +180,12 @@ static int pathmap_cmp(const void *hashmap_cmp_fn_data,
 struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 						 struct commit *c,
 						 int compute_if_not_present,
+						 const struct bloom_filter_settings *settings,
 						 int *computed)
 {
 	struct bloom_filter *filter;
-	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
 	int i;
 	struct diff_options diffopt;
-	int max_changes = 512;
 
 	if (computed)
 		*computed = 0;
@@ -211,7 +210,7 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 	repo_diff_setup(r, &diffopt);
 	diffopt.flags.recursive = 1;
 	diffopt.detect_rename = 0;
-	diffopt.max_changes = max_changes;
+	diffopt.max_changes = settings->max_changed_paths;
 	diff_setup_done(&diffopt);
 
 	/* ensure commit is parsed so we have parent information */
@@ -223,7 +222,7 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 		diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
 	diffcore_std(&diffopt);
 
-	if (diffopt.num_changes <= max_changes) {
+	if (diffopt.num_changes <= settings->max_changed_paths) {
 		struct hashmap pathmap;
 		struct pathmap_hash_entry *e;
 		struct hashmap_iter iter;
@@ -260,13 +259,13 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 			diff_free_filepair(diff_queued_diff.queue[i]);
 		}
 
-		filter->len = (hashmap_get_size(&pathmap) * settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
+		filter->len = (hashmap_get_size(&pathmap) * settings->bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
 		filter->data = xcalloc(filter->len, sizeof(unsigned char));
 
 		hashmap_for_each_entry(&pathmap, &iter, e, entry) {
 			struct bloom_key key;
-			fill_bloom_key(e->path, strlen(e->path), &key, &settings);
-			add_key_to_filter(&key, filter, &settings);
+			fill_bloom_key(e->path, strlen(e->path), &key, settings);
+			add_key_to_filter(&key, filter, settings);
 		}
 
 		hashmap_free_entries(&pathmap, struct pathmap_hash_entry, entry);
diff --git a/bloom.h b/bloom.h
index c4107a6415..2353c74a24 100644
--- a/bloom.h
+++ b/bloom.h
@@ -92,11 +92,12 @@ void init_bloom_filters(void);
 struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 						 struct commit *c,
 						 int compute_if_not_present,
+						 const struct bloom_filter_settings *settings,
 						 int *computed);
 
 #define DEFAULT_BLOOM_MAX_CHANGES 512
 #define get_bloom_filter(r, c) get_or_compute_bloom_filter( \
-	(r), (c), 0, NULL)
+	(r), (c), 0, NULL, NULL)
 
 int bloom_filter_contains(const struct bloom_filter *filter,
 			  const struct bloom_key *key,
diff --git a/commit-graph.c b/commit-graph.c
index ba2a2cfb22..48d4697f54 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1432,6 +1432,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 			ctx->r,
 			c,
 			1,
+			ctx->bloom_settings,
 			&computed);
 		if (computed) {
 			ctx->count_bloom_filter_computed++;
@@ -1688,17 +1689,6 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	int num_chunks = 3;
 	uint64_t chunk_offset;
 	struct object_id file_hash;
-	struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
-
-	if (!ctx->bloom_settings) {
-		bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
-							      bloom_settings.bits_per_entry);
-		bloom_settings.num_hashes = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_NUM_HASHES",
-							  bloom_settings.num_hashes);
-		bloom_settings.max_changed_paths = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS",
-							  bloom_settings.max_changed_paths);
-		ctx->bloom_settings = &bloom_settings;
-	}
 
 	if (ctx->split) {
 		struct strbuf tmp_file = STRBUF_INIT;
@@ -2144,6 +2134,7 @@ int write_commit_graph(struct object_directory *odb,
 	uint32_t i, count_distinct = 0;
 	int res = 0;
 	int replace = 0;
+	struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
 
 	if (!commit_graph_compatible(the_repository))
 		return 0;
@@ -2157,6 +2148,14 @@ int write_commit_graph(struct object_directory *odb,
 	ctx->split_opts = split_opts;
 	ctx->total_bloom_filter_data_size = 0;
 
+	bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
+						      bloom_settings.bits_per_entry);
+	bloom_settings.num_hashes = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_NUM_HASHES",
+						  bloom_settings.num_hashes);
+	bloom_settings.max_changed_paths = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS",
+							 bloom_settings.max_changed_paths);
+	ctx->bloom_settings = &bloom_settings;
+
 	if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
 		ctx->changed_paths = 1;
 	if (!(flags & COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS)) {
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index 531af439c2..4af949164c 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -40,6 +40,7 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
 	setup_git_directory();
 	c = lookup_commit(the_repository, commit_oid);
 	filter = get_or_compute_bloom_filter(the_repository, c, 1,
+					     &settings,
 					     NULL);
 	print_bloom_filter(filter);
 }
-- 
2.28.0.rc1.13.ge78abce653


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH v2 09/14] bloom/diff: properly short-circuit on max_changes
  2020-08-05 17:01 ` [PATCH v2 00/14] more miscellaneous Bloom filter improvements Taylor Blau
                     ` (7 preceding siblings ...)
  2020-08-05 17:02   ` [PATCH v2 08/14] bloom: use provided 'struct bloom_filter_settings' Taylor Blau
@ 2020-08-05 17:02   ` Taylor Blau
  2020-08-05 17:02   ` [PATCH v2 10/14] commit-graph.c: sort index into commits list Taylor Blau
                     ` (4 subsequent siblings)
  13 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-08-05 17:02 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, szeder.dev

From: Derrick Stolee <dstolee@microsoft.com>

Commit e3696980 (diff: halt tree-diff early after max_changes,
2020-03-30) intended to create a mechanism to short-circuit a diff
calculation after a certain number of paths were modified. By
incrementing a "num_changes" counter throughout the recursive
ll_diff_tree_paths(), this was supposed to match the number of changes
that would be written into the changed-path Bloom filters.
Unfortunately, this was not implemented correctly and instead misses
simple cases like file modifications. This then does not stop very
large changed-path filters from being written (unless they add or remove
many files).

To start, change the implementation in ll_diff_tree_paths() to instead
use the global diff_queue_diff struct's 'nr' member as the count. This
is a way to simplify the logic instead of making more mistakes in the
complicated diff code.

This has a drawback: the diff_queue_diff struct only lists the paths
corresponding to blob changes, not their leading directories. Thus,
get_or_compute_bloom_filter() needs an additional check to see if the
hashmap with the leading directories becomes too large.

One reason why this was not caught by test cases was that the test in
t4216-log-bloom.sh that was supposed to check this "too many changes"
condition only checked this on the initial commit of a repository. The
old logic counted these values correctly. Update this test in a few
ways:

1. Use GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS to reduce the limit,
   allowing smaller commits to engage with this logic.

2. Create several interesting cases of edits, adds, removes, and mode
   changes (in the second commit). By testing both sides of the
   inequality with the *_MAX_CHANGED_PATHS variable, we can see that
   the count is exactly correct, so none of these changes are missed
   or over-counted.

3. Use the trace2 data value filter_found_large to verify that these
   commits are on the correct side of the limit.

Another way to verify the behavior is correct is through performance
tests. By testing on my local copies of the Git repository and the Linux
kernel repository, I could measure the effect of these short-circuits
when computing a fresh commit-graph file with changed-path Bloom filters
using the command

  GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=N time \
    git commit-graph write --reachable --changed-paths

and reporting the wall time and resulting commit-graph size.

For Git, the results are

|        |      N=1       |       N=10     |      N=512     |
|--------|----------------|----------------|----------------|
| HEAD~1 | 10.90s  9.18MB | 11.11s  9.34MB | 11.31s  9.35MB |
| HEAD   |  9.21s  8.62MB | 11.11s  9.29MB | 11.29s  9.34MB |

For Linux, the results are

|        |       N=1      |     N=20      |     N=512     |
|--------|----------------|---------------|---------------|
| HEAD~1 | 61.28s  64.3MB | 76.9s  72.6MB | 77.6s  72.6MB |
| HEAD   | 49.44s  56.3MB | 68.7s  65.9MB | 69.2s  65.9MB |

Naturally, the improvement becomes much less as the limit grows, as
fewer commits satisfy the short-circuit.

Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 bloom.c              |  6 +++-
 diff.h               |  2 --
 t/t4216-log-bloom.sh | 85 +++++++++++++++++++++++++++++++++++++++-----
 tree-diff.c          |  5 +--
 4 files changed, 82 insertions(+), 16 deletions(-)

diff --git a/bloom.c b/bloom.c
index 0cf1962dc5..ed54e96e57 100644
--- a/bloom.c
+++ b/bloom.c
@@ -222,7 +222,7 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 		diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
 	diffcore_std(&diffopt);
 
-	if (diffopt.num_changes <= settings->max_changed_paths) {
+	if (diff_queued_diff.nr <= settings->max_changed_paths) {
 		struct hashmap pathmap;
 		struct pathmap_hash_entry *e;
 		struct hashmap_iter iter;
@@ -259,6 +259,9 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 			diff_free_filepair(diff_queued_diff.queue[i]);
 		}
 
+		if (hashmap_get_size(&pathmap) > settings->max_changed_paths)
+			goto cleanup;
+
 		filter->len = (hashmap_get_size(&pathmap) * settings->bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
 		filter->data = xcalloc(filter->len, sizeof(unsigned char));
 
@@ -268,6 +271,7 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 			add_key_to_filter(&key, filter, settings);
 		}
 
+	cleanup:
 		hashmap_free_entries(&pathmap, struct pathmap_hash_entry, entry);
 	} else {
 		for (i = 0; i < diff_queued_diff.nr; i++)
diff --git a/diff.h b/diff.h
index e0c0af6286..1d32b71885 100644
--- a/diff.h
+++ b/diff.h
@@ -287,8 +287,6 @@ struct diff_options {
 
 	/* If non-zero, then stop computing after this many changes. */
 	int max_changes;
-	/* For internal use only. */
-	int num_changes;
 
 	int ita_invisible_in_index;
 /* white-space error highlighting */
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index eb2bcc51f0..21b67677ef 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -177,20 +177,87 @@ test_expect_success 'persist filter settings' '
 '
 
 test_expect_success 'correctly report changes over limit' '
-	git init 513changes &&
+	git init limits &&
 	(
-		cd 513changes &&
-		for i in $(test_seq 1 513)
+		cd limits &&
+		mkdir d &&
+		mkdir d/e &&
+
+		for i in $(test_seq 1 2)
 		do
-			echo $i >file$i.txt || return 1
+			printf $i >d/file$i.txt &&
+			printf $i >d/e/file$i.txt || return 1
 		done &&
-		git add . &&
+
+		mkdir mode &&
+		printf bash >mode/script.sh &&
+
+		mkdir foo &&
+		touch foo/bar &&
+		touch foo.txt &&
+
+		git add d foo foo.txt mode &&
 		git commit -m "files" &&
-		git commit-graph write --reachable --changed-paths &&
-		for i in $(test_seq 1 513)
+
+		# Commit has 7 file and 4 directory adds
+		GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=10 \
+			GIT_TRACE2_EVENT="$(pwd)/trace" \
+			git commit-graph write --reachable --changed-paths &&
+		grep "\"max_changed_paths\":10" trace &&
+		grep "\"filter_found_large\":1" trace &&
+
+		for path in $(git ls-tree -r --name-only HEAD)
 		do
-			git -c core.commitGraph=false log -- file$i.txt >expect &&
-			git log -- file$i.txt >actual &&
+			git -c commitGraph.readChangedPaths=false log \
+				-- $path >expect &&
+			git log -- $path >actual &&
+			test_cmp expect actual || return 1
+		done &&
+
+		# Make a variety of path changes
+		printf new1 >d/e/file1.txt &&
+		printf new2 >d/file2.txt &&
+		rm d/e/file2.txt &&
+		rm -r foo &&
+		printf text >foo &&
+		mkdir f &&
+		printf new1 >f/file1.txt &&
+
+		# including a mode-only change (counts as modified)
+		git update-index --chmod=+x mode/script.sh &&
+
+		git add foo d f &&
+		git commit -m "complicated" &&
+
+		# start from scratch and rebuild
+		rm -f .git/objects/info/commit-graph &&
+		GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=10 \
+			GIT_TRACE2_EVENT="$(pwd)/trace-edit" \
+			git commit-graph write --reachable --changed-paths &&
+		grep "\"max_changed_paths\":10" trace-edit &&
+		grep "\"filter_found_large\":2" trace-edit &&
+
+		for path in $(git ls-tree -r --name-only HEAD)
+		do
+			git -c commitGraph.readChangedPaths=false log \
+				-- $path >expect &&
+			git log -- $path >actual &&
+			test_cmp expect actual || return 1
+		done &&
+
+		# start from scratch and rebuild
+		rm -f .git/objects/info/commit-graph &&
+		GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=11 \
+			GIT_TRACE2_EVENT="$(pwd)/trace-update" \
+			git commit-graph write --reachable --changed-paths &&
+		grep "\"max_changed_paths\":11" trace-update &&
+		grep "\"filter_found_large\":0" trace-update &&
+
+		for path in $(git ls-tree -r --name-only HEAD)
+		do
+			git -c commitGraph.readChangedPaths=false log \
+				-- $path >expect &&
+			git log -- $path >actual &&
 			test_cmp expect actual || return 1
 		done
 	)
diff --git a/tree-diff.c b/tree-diff.c
index 6ebad1a46f..7cebbb327e 100644
--- a/tree-diff.c
+++ b/tree-diff.c
@@ -434,7 +434,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
 		if (diff_can_quit_early(opt))
 			break;
 
-		if (opt->max_changes && opt->num_changes > opt->max_changes)
+		if (opt->max_changes && diff_queued_diff.nr > opt->max_changes)
 			break;
 
 		if (opt->pathspec.nr) {
@@ -521,7 +521,6 @@ static struct combine_diff_path *ll_diff_tree_paths(
 
 			/* t↓ */
 			update_tree_entry(&t);
-			opt->num_changes++;
 		}
 
 		/* t > p[imin] */
@@ -539,7 +538,6 @@ static struct combine_diff_path *ll_diff_tree_paths(
 		skip_emit_tp:
 			/* ∀ pi=p[imin]  pi↓ */
 			update_tp_entries(tp, nparent);
-			opt->num_changes++;
 		}
 	}
 
@@ -557,7 +555,6 @@ struct combine_diff_path *diff_tree_paths(
 	const struct object_id **parents_oid, int nparent,
 	struct strbuf *base, struct diff_options *opt)
 {
-	opt->num_changes = 0;
 	p = ll_diff_tree_paths(p, oid, parents_oid, nparent, base, opt);
 
 	/*
-- 
2.28.0.rc1.13.ge78abce653


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH v2 10/14] commit-graph.c: sort index into commits list
  2020-08-05 17:01 ` [PATCH v2 00/14] more miscellaneous Bloom filter improvements Taylor Blau
                     ` (8 preceding siblings ...)
  2020-08-05 17:02   ` [PATCH v2 09/14] bloom/diff: properly short-circuit on max_changes Taylor Blau
@ 2020-08-05 17:02   ` Taylor Blau
  2020-08-05 17:02   ` [PATCH v2 11/14] csum-file.h: introduce 'hashwrite_be64()' Taylor Blau
                     ` (3 subsequent siblings)
  13 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-08-05 17:02 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, szeder.dev

For locality, 'compute_bloom_filters()' sorts the commits for which it
wants to compute Bloom filters in a preferred order (cf., 3d11275505
(commit-graph: examine commits by generation number, 2020-03-30) for
details).

A future patch will want to recover the new graph position of each
commit. Since the 'packed_commit_list' already stores a double-pointer,
avoid a 'COPY_ARRAY' and instead keep track of an index into the
original list. (Use an integer index instead of a memory address, since
this involves a needlessly confusing triple-pointer).

Alter the two sorting routines 'commit_pos_cmp' and 'commit_gen_cmp' to
take into account the packed_commit_list they are sorting with respect
to. Since 'compute_bloom_filters()' is the only caller for each of those
comparison functions, no other call-sites need updating.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 commit-graph.c | 43 ++++++++++++++++++++++++-------------------
 1 file changed, 24 insertions(+), 19 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 48d4697f54..0d70545149 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -79,10 +79,18 @@ static void set_commit_pos(struct repository *r, const struct object_id *oid)
 	*commit_pos_at(&commit_pos, commit) = max_pos++;
 }
 
-static int commit_pos_cmp(const void *va, const void *vb)
+struct packed_commit_list {
+	struct commit **list;
+	int nr;
+	int alloc;
+};
+
+static int commit_pos_cmp(const void *va, const void *vb, void *ctx)
 {
-	const struct commit *a = *(const struct commit **)va;
-	const struct commit *b = *(const struct commit **)vb;
+	struct packed_commit_list *commits = ctx;
+
+	const struct commit *a = commits->list[*(int *)va];
+	const struct commit *b = commits->list[*(int *)vb];
 	return commit_pos_at(&commit_pos, a) -
 	       commit_pos_at(&commit_pos, b);
 }
@@ -139,10 +147,12 @@ static struct commit_graph_data *commit_graph_data_at(const struct commit *c)
 	return data;
 }
 
-static int commit_gen_cmp(const void *va, const void *vb)
+static int commit_gen_cmp(const void *va, const void *vb, void *ctx)
 {
-	const struct commit *a = *(const struct commit **)va;
-	const struct commit *b = *(const struct commit **)vb;
+	struct packed_commit_list *commits = ctx;
+
+	const struct commit *a = commits->list[*(int *)va];
+	const struct commit *b = commits->list[*(int *)vb];
 
 	uint32_t generation_a = commit_graph_generation(a);
 	uint32_t generation_b = commit_graph_generation(b);
@@ -922,11 +932,6 @@ struct tree *get_commit_tree_in_graph(struct repository *r, const struct commit
 	return get_commit_tree_in_graph_one(r, r->objects->commit_graph, c);
 }
 
-struct packed_commit_list {
-	struct commit **list;
-	int nr;
-	int alloc;
-};
 
 struct packed_oid_list {
 	struct object_id *list;
@@ -1408,7 +1413,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 {
 	int i;
 	struct progress *progress = NULL;
-	struct commit **sorted_commits;
+	int *sorted_commits;
 
 	init_bloom_filters();
 
@@ -1418,16 +1423,16 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 			ctx->commits.nr);
 
 	ALLOC_ARRAY(sorted_commits, ctx->commits.nr);
-	COPY_ARRAY(sorted_commits, ctx->commits.list, ctx->commits.nr);
-
-	if (ctx->order_by_pack)
-		QSORT(sorted_commits, ctx->commits.nr, commit_pos_cmp);
-	else
-		QSORT(sorted_commits, ctx->commits.nr, commit_gen_cmp);
+	for (i = 0; i < ctx->commits.nr; i++)
+		sorted_commits[i] = i;
+	QSORT_S(sorted_commits, ctx->commits.nr,
+		ctx->order_by_pack ? commit_pos_cmp : commit_gen_cmp,
+		&ctx->commits);
 
 	for (i = 0; i < ctx->commits.nr; i++) {
 		int computed = 0;
-		struct commit *c = sorted_commits[i];
+		int pos = sorted_commits[i];
+		struct commit *c = ctx->commits.list[pos];
 		struct bloom_filter *filter = get_or_compute_bloom_filter(
 			ctx->r,
 			c,
-- 
2.28.0.rc1.13.ge78abce653


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH v2 11/14] csum-file.h: introduce 'hashwrite_be64()'
  2020-08-05 17:01 ` [PATCH v2 00/14] more miscellaneous Bloom filter improvements Taylor Blau
                     ` (9 preceding siblings ...)
  2020-08-05 17:02   ` [PATCH v2 10/14] commit-graph.c: sort index into commits list Taylor Blau
@ 2020-08-05 17:02   ` Taylor Blau
  2020-08-05 17:02   ` [PATCH v2 12/14] commit-graph: add large-filters bitmap chunk Taylor Blau
                     ` (2 subsequent siblings)
  13 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-08-05 17:02 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, szeder.dev

A small handful of writers who wish to encode 64-bit values in network
order have worked around the lack of such a helper by calling the 32-bit
variant twice.

The subsequent commit will add another caller who wants to write a
64-bit value. To ease their (and the existing caller's) pain, introduce
a helper to do just that, and convert existing call-sites.

Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 commit-graph.c | 8 ++------
 csum-file.h    | 6 ++++++
 midx.c         | 3 +--
 3 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 0d70545149..8964453433 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1784,12 +1784,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 
 	chunk_offset = 8 + (num_chunks + 1) * GRAPH_CHUNKLOOKUP_WIDTH;
 	for (i = 0; i <= num_chunks; i++) {
-		uint32_t chunk_write[3];
-
-		chunk_write[0] = htonl(chunks[i].id);
-		chunk_write[1] = htonl(chunk_offset >> 32);
-		chunk_write[2] = htonl(chunk_offset & 0xffffffff);
-		hashwrite(f, chunk_write, 12);
+		hashwrite_be32(f, chunks[i].id);
+		hashwrite_be64(f, chunk_offset);
 
 		chunk_offset += chunks[i].size;
 	}
diff --git a/csum-file.h b/csum-file.h
index f9cbd317fb..b026ec7766 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -62,4 +62,10 @@ static inline void hashwrite_be32(struct hashfile *f, uint32_t data)
 	hashwrite(f, &data, sizeof(data));
 }
 
+static inline void hashwrite_be64(struct hashfile *f, uint64_t data)
+{
+	hashwrite_be32(f, data >> 32);
+	hashwrite_be32(f, data & 0xffffffffUL);
+}
+
 #endif
diff --git a/midx.c b/midx.c
index 6d1584ca51..8b33149ce4 100644
--- a/midx.c
+++ b/midx.c
@@ -775,8 +775,7 @@ static size_t write_midx_large_offsets(struct hashfile *f, uint32_t nr_large_off
 		if (!(offset >> 31))
 			continue;
 
-		hashwrite_be32(f, offset >> 32);
-		hashwrite_be32(f, offset & 0xffffffffUL);
+		hashwrite_be64(f, offset);
 		written += 2 * sizeof(uint32_t);
 
 		nr_large_offset--;
-- 
2.28.0.rc1.13.ge78abce653


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH v2 12/14] commit-graph: add large-filters bitmap chunk
  2020-08-05 17:01 ` [PATCH v2 00/14] more miscellaneous Bloom filter improvements Taylor Blau
                     ` (10 preceding siblings ...)
  2020-08-05 17:02   ` [PATCH v2 11/14] csum-file.h: introduce 'hashwrite_be64()' Taylor Blau
@ 2020-08-05 17:02   ` Taylor Blau
  2020-08-05 21:01     ` Junio C Hamano
  2020-08-05 17:03   ` [PATCH v2 13/14] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
  2020-08-05 17:03   ` [PATCH v2 14/14] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
  13 siblings, 1 reply; 117+ messages in thread
From: Taylor Blau @ 2020-08-05 17:02 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, szeder.dev

When a commit has more than a certain number of changed paths (commonly
512), the commit-graph machinery represents it as a zero-length filter.
This is done since having many entries in the Bloom filter has
undesirable effects on the false positivity rate.

In addition to these too-large filters, the commit-graph machinery also
represents commits with no filter and commits with no changed paths in
the same way.

When writing a commit-graph that aggregates several incremental
commit-graph layers (eg., with '--split=replace'), the commit-graph
machinery first computes all of the Bloom filters that it wants to write
but does not already know about from existing graph layers. Because we
overload the zero-length filter in the above fashion, this leads to
recomputing large filters over and over again.

This is already undesirable, since it means that we are wasting
considerable effort to discover that a commit with too many changed
paths, only to throw that effort away (and then repeat the process the
next time a roll-up is performed).

In a subsequent patch, we will add a '--max-new-filters=<n>' option,
which specifies an upper-bound on the number of new filters we are
willing to compute from scratch. Suppose that there are 'N' too-large
filters, and we specify '--max-new-filters=M'. If 'N >= M', it is
unlikely that any filters will be generated, since we'll spend most of
our effort on filters that we ultimately throw away. If 'N < M', filters
will trickle in over time, but only at most 'M - N' per-write.

To address this, add a new chunk which encodes a bitmap where the ith
bit is on iff the ith commit has zero or at least 512 changed paths.
Likewise store the maximum number of changed paths we are willing to
store in order to prepare for eventually making this value more easily
customizable. When computing Bloom filters, first consult the relevant
bitmap (in the case that we are rolling up existing layers) to see if
computing the Bloom filter from scratch would be a waste of time.

This patch implements a new chunk instead of extending the existing BIDX
and BDAT chunks because modifying these chunks would confuse old
clients. (Eg., setting the most-significant bit in the BIDX chunk would
confuse old clients and require a version bump).

To allow using the existing bitmap code with 64-bit words, we write the
data in network byte order from the 64-bit words. This means we also
need to read the array from the commit-graph file by translating each
word from network byte order using get_be64() upon first use of the
bitmap. This is only used when writing the commit-graph, so this is a
relatively small operation compared to the other writing code.

By avoiding the need to move to new versions of the BDAT and BIDX chunk,
we can give ourselves more time to consider whether or not other
modifications to these chunks are worthwhile without holding up this
change.

Another approach would be to introduce a new BIDX chunk (say, one
identified by 'BID2') which is identical to the existing BIDX chunk,
except the most-significant bit of each offset is interpreted as "this
filter is too big" iff looking at a BID2 chunk. This avoids having to
write a bitmap, but forces older clients to rewrite their commit-graphs
(as well as reduces the theoretical largest Bloom filters we couldl
write, and forces us to maintain the code necessary to translate BIDX
chunks to BID2 ones). Separately from this patch, I implemented this
alternate approach and did not find it to be advantageous.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 .../technical/commit-graph-format.txt         |  12 +++
 bloom.h                                       |   2 +-
 commit-graph.c                                | 101 +++++++++++++++---
 commit-graph.h                                |   5 +
 t/t4216-log-bloom.sh                          |  25 ++++-
 5 files changed, 130 insertions(+), 15 deletions(-)

diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index 440541045d..5f2d9ab4d7 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -123,6 +123,18 @@ CHUNK DATA:
       of length zero.
     * The BDAT chunk is present if and only if BIDX is present.
 
+  Large Bloom Filters (ID: {'B', 'F', 'X', 'L'}) [Optional]
+    * It starts with a 32-bit unsigned integer specifying the maximum number of
+      changed-paths that can be stored in a single Bloom filter.
+    * It then contains a list of 64-bit words (the length of this list is
+      determined by the width of the chunk) which is a bitmap. The 'i'th bit is
+      set exactly when the 'i'th commit in the graph has a changed-path Bloom
+      filter with zero entries (either because the commit is empty, or because
+      it contains more than 512 changed paths).
+    * The BFXL chunk is present only when the BIDX and BDAT chunks are
+      also present.
+
+
   Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
       This list of H-byte hashes describe a set of B commit-graph files that
       form a commit-graph chain. The graph position for the ith commit in this
diff --git a/bloom.h b/bloom.h
index 2353c74a24..73238276bc 100644
--- a/bloom.h
+++ b/bloom.h
@@ -33,7 +33,7 @@ struct bloom_filter_settings {
 	 * The maximum number of changed paths per commit
 	 * before declaring a Bloom filter to be too-large.
 	 *
-	 * Not written to the commit-graph file.
+	 * Written to the 'BFXL' chunk (instead of 'BDAT').
 	 */
 	uint32_t max_changed_paths;
 };
diff --git a/commit-graph.c b/commit-graph.c
index 8964453433..1fee49d171 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -41,8 +41,9 @@ void git_test_write_commit_graph_or_die(void)
 #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
 #define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
 #define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
+#define GRAPH_CHUNKID_BLOOMLARGE 0x4246584c /* "BFXL" */
 #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
-#define MAX_NUM_CHUNKS 7
+#define MAX_NUM_CHUNKS 8
 
 #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
 
@@ -429,6 +430,19 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 				graph->bloom_filter_settings->bits_per_entry = get_be32(data + chunk_offset + 8);
 			}
 			break;
+
+		case GRAPH_CHUNKID_BLOOMLARGE:
+			if (graph->chunk_bloom_large_filters)
+				chunk_repeated = 1;
+			else if (r->settings.commit_graph_read_changed_paths) {
+				graph->bloom_large_to_alloc = get_be64(chunk_lookup + 4)
+							      - chunk_offset - sizeof(uint32_t);
+
+				graph->bloom_large.word_alloc = 0; /* populate when necessary */
+				graph->chunk_bloom_large_filters = data + chunk_offset + sizeof(uint32_t);
+				graph->bloom_filter_settings->max_changed_paths = get_be32(data + chunk_offset);
+			}
+			break;
 		}
 
 		if (chunk_repeated) {
@@ -443,6 +457,7 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 		/* We need both the bloom chunks to exist together. Else ignore the data */
 		graph->chunk_bloom_indexes = NULL;
 		graph->chunk_bloom_data = NULL;
+		graph->chunk_bloom_large_filters = NULL;
 		FREE_AND_NULL(graph->bloom_filter_settings);
 	}
 
@@ -932,6 +947,31 @@ struct tree *get_commit_tree_in_graph(struct repository *r, const struct commit
 	return get_commit_tree_in_graph_one(r, r->objects->commit_graph, c);
 }
 
+static int get_bloom_filter_large_in_graph(struct commit_graph *g,
+					   const struct commit *c)
+{
+	uint32_t graph_pos = commit_graph_position(c);
+	if (graph_pos == COMMIT_NOT_FROM_GRAPH)
+		return 0;
+
+	while (g && graph_pos < g->num_commits_in_base)
+		g = g->base_graph;
+
+	if (!g || !g->bloom_large_to_alloc)
+		return 0;
+
+	if (!g->bloom_large.word_alloc) {
+		size_t i;
+		g->bloom_large.word_alloc = g->bloom_large_to_alloc;
+		g->bloom_large.words = xmalloc(g->bloom_large_to_alloc * sizeof(eword_t));
+
+		for (i = 0; i < g->bloom_large_to_alloc; i++)
+			g->bloom_large.words[i] = get_be64(g->chunk_bloom_large_filters
+							   + i * sizeof(eword_t));
+	}
+
+	return bitmap_get(&g->bloom_large, graph_pos - g->num_commits_in_base);
+}
 
 struct packed_oid_list {
 	struct object_id *list;
@@ -970,8 +1010,10 @@ struct write_commit_graph_context {
 	size_t total_bloom_filter_data_size;
 	const struct bloom_filter_settings *bloom_settings;
 
+	int count_bloom_filter_known_large;
 	int count_bloom_filter_found_large;
 	int count_bloom_filter_computed;
+	struct bitmap *bloom_large;
 };
 
 static int write_graph_chunk_fanout(struct hashfile *f,
@@ -1235,6 +1277,23 @@ static int write_graph_chunk_bloom_data(struct hashfile *f,
 	return 0;
 }
 
+static int write_graph_chunk_bloom_large(struct hashfile *f,
+					 struct write_commit_graph_context *ctx)
+{
+	size_t i, alloc = ctx->commits.nr / BITS_IN_EWORD;
+	if (ctx->commits.nr % BITS_IN_EWORD)
+		alloc++;
+	if (alloc > ctx->bloom_large->word_alloc)
+		BUG("write_graph_chunk_bloom_large: bitmap not large enough");
+
+	trace2_region_enter("commit-graph", "bloom_large", ctx->r);
+	hashwrite_be32(f, ctx->bloom_settings->max_changed_paths);
+	for (i = 0; i < ctx->bloom_large->word_alloc; i++)
+		hashwrite_be64(f, ctx->bloom_large->words[i]);
+	trace2_region_leave("commit-graph", "bloom_large", ctx->r);
+	return 0;
+}
+
 static int oid_compare(const void *_a, const void *_b)
 {
 	const struct object_id *a = (const struct object_id *)_a;
@@ -1398,6 +1457,8 @@ static void trace2_bloom_filter_write_statistics(struct write_commit_graph_conte
 	struct json_writer jw = JSON_WRITER_INIT;
 
 	jw_object_begin(&jw, 0);
+	jw_object_intmax(&jw, "filter_known_large",
+			 ctx->count_bloom_filter_known_large);
 	jw_object_intmax(&jw, "filter_found_large",
 			 ctx->count_bloom_filter_found_large);
 	jw_object_intmax(&jw, "filter_computed",
@@ -1416,6 +1477,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 	int *sorted_commits;
 
 	init_bloom_filters();
+	ctx->bloom_large = bitmap_word_alloc(ctx->commits.nr / BITS_IN_EWORD + 1);
 
 	if (ctx->report_progress)
 		progress = start_delayed_progress(
@@ -1430,21 +1492,28 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 		&ctx->commits);
 
 	for (i = 0; i < ctx->commits.nr; i++) {
-		int computed = 0;
 		int pos = sorted_commits[i];
 		struct commit *c = ctx->commits.list[pos];
-		struct bloom_filter *filter = get_or_compute_bloom_filter(
-			ctx->r,
-			c,
-			1,
-			ctx->bloom_settings,
-			&computed);
-		if (computed) {
-			ctx->count_bloom_filter_computed++;
-			if (filter && !filter->len)
-				ctx->count_bloom_filter_found_large++;
+		if (get_bloom_filter_large_in_graph(ctx->r->objects->commit_graph, c)) {
+			bitmap_set(ctx->bloom_large, pos);
+			ctx->count_bloom_filter_known_large++;
+		} else {
+			int computed = 0;
+			struct bloom_filter *filter = get_or_compute_bloom_filter(
+				ctx->r,
+				c,
+				1,
+				ctx->bloom_settings,
+				&computed);
+			if (computed) {
+				ctx->count_bloom_filter_computed++;
+				if (filter && !filter->len) {
+					bitmap_set(ctx->bloom_large, pos);
+					ctx->count_bloom_filter_found_large++;
+				}
+			}
+			ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
 		}
-		ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
 		display_progress(progress, i + 1);
 	}
 
@@ -1764,6 +1833,11 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 					  + ctx->total_bloom_filter_data_size;
 		chunks[num_chunks].write_fn = write_graph_chunk_bloom_data;
 		num_chunks++;
+		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMLARGE;
+		chunks[num_chunks].size = sizeof(eword_t) * ctx->bloom_large->word_alloc
+					+ sizeof(uint32_t);
+		chunks[num_chunks].write_fn = write_graph_chunk_bloom_large;
+		num_chunks++;
 	}
 	if (ctx->num_commit_graphs_after > 1) {
 		chunks[num_chunks].id = GRAPH_CHUNKID_BASE;
@@ -2503,6 +2577,7 @@ void free_commit_graph(struct commit_graph *g)
 	}
 	free(g->filename);
 	free(g->bloom_filter_settings);
+	bitmap_free(g->bloom_large);
 	free(g);
 }
 
diff --git a/commit-graph.h b/commit-graph.h
index d9acb22bac..f4fb996dd5 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -4,6 +4,7 @@
 #include "git-compat-util.h"
 #include "object-store.h"
 #include "oidset.h"
+#include "ewah/ewok.h"
 
 #define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
 #define GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE "GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE"
@@ -71,6 +72,10 @@ struct commit_graph {
 	const unsigned char *chunk_base_graphs;
 	const unsigned char *chunk_bloom_indexes;
 	const unsigned char *chunk_bloom_data;
+	const unsigned char *chunk_bloom_large_filters;
+
+	size_t bloom_large_to_alloc;
+	struct bitmap bloom_large;
 
 	struct bloom_filter_settings *bloom_filter_settings;
 };
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 21b67677ef..6859d85369 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -33,7 +33,7 @@ test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
 	git commit-graph write --reachable --changed-paths
 '
 graph_read_expect () {
-	NUM_CHUNKS=5
+	NUM_CHUNKS=6
 	cat >expect <<- EOF
 	header: 43475048 1 1 $NUM_CHUNKS 0
 	num_commits: $1
@@ -262,5 +262,28 @@ test_expect_success 'correctly report changes over limit' '
 		done
 	)
 '
+test_bloom_filters_computed () {
+	commit_graph_args=$1
+	bloom_trace_prefix="{\"filter_known_large\":$2,\"filter_found_large\":$3,\"filter_computed\":$4"
+	rm -f "$TRASH_DIRECTORY/trace.event" &&
+	GIT_TRACE2_EVENT="$TRASH_DIRECTORY/trace.event" git commit-graph write $commit_graph_args &&
+	grep "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.event"
+}
+
+test_expect_success 'Bloom generation does not recompute too-large filters' '
+	(
+		cd limits &&
+
+		# start from scratch and rebuild
+		rm -f .git/objects/info/commit-graph &&
+		GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=10 \
+			git commit-graph write --reachable --changed-paths \
+			--split=replace &&
+		test_commit c1 filter &&
+
+		test_bloom_filters_computed "--reachable --changed-paths --split=replace" \
+			2 0 1
+	)
+'
 
 test_done
-- 
2.28.0.rc1.13.ge78abce653


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH v2 13/14] commit-graph: rename 'split_commit_graph_opts'
  2020-08-05 17:01 ` [PATCH v2 00/14] more miscellaneous Bloom filter improvements Taylor Blau
                     ` (11 preceding siblings ...)
  2020-08-05 17:02   ` [PATCH v2 12/14] commit-graph: add large-filters bitmap chunk Taylor Blau
@ 2020-08-05 17:03   ` Taylor Blau
  2020-08-05 17:03   ` [PATCH v2 14/14] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
  13 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-08-05 17:03 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, szeder.dev

In the subsequent commit, additional options will be added to the
commit-graph API which have nothing to do with splitting.

Rename the 'split_commit_graph_opts' structure to the more-generic
'commit_graph_opts' to encompass both.

Suggsted-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/commit-graph.c | 20 ++++++++++----------
 commit-graph.c         | 40 ++++++++++++++++++++--------------------
 commit-graph.h         |  6 +++---
 3 files changed, 33 insertions(+), 33 deletions(-)

diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index ba5584463f..38f5f57d15 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -119,7 +119,7 @@ static int graph_verify(int argc, const char **argv)
 }
 
 extern int read_replace_refs;
-static struct split_commit_graph_opts split_opts;
+static struct commit_graph_opts write_opts;
 
 static int write_option_parse_split(const struct option *opt, const char *arg,
 				    int unset)
@@ -187,24 +187,24 @@ static int graph_write(int argc, const char **argv)
 		OPT_BOOL(0, "changed-paths", &opts.enable_changed_paths,
 			N_("enable computation for changed paths")),
 		OPT_BOOL(0, "progress", &opts.progress, N_("force progress reporting")),
-		OPT_CALLBACK_F(0, "split", &split_opts.flags, NULL,
+		OPT_CALLBACK_F(0, "split", &write_opts.flags, NULL,
 			N_("allow writing an incremental commit-graph file"),
 			PARSE_OPT_OPTARG | PARSE_OPT_NONEG,
 			write_option_parse_split),
-		OPT_INTEGER(0, "max-commits", &split_opts.max_commits,
+		OPT_INTEGER(0, "max-commits", &write_opts.max_commits,
 			N_("maximum number of commits in a non-base split commit-graph")),
-		OPT_INTEGER(0, "size-multiple", &split_opts.size_multiple,
+		OPT_INTEGER(0, "size-multiple", &write_opts.size_multiple,
 			N_("maximum ratio between two levels of a split commit-graph")),
-		OPT_EXPIRY_DATE(0, "expire-time", &split_opts.expire_time,
+		OPT_EXPIRY_DATE(0, "expire-time", &write_opts.expire_time,
 			N_("only expire files older than a given date-time")),
 		OPT_END(),
 	};
 
 	opts.progress = isatty(2);
 	opts.enable_changed_paths = -1;
-	split_opts.size_multiple = 2;
-	split_opts.max_commits = 0;
-	split_opts.expire_time = 0;
+	write_opts.size_multiple = 2;
+	write_opts.max_commits = 0;
+	write_opts.expire_time = 0;
 
 	trace2_cmd_mode("write");
 
@@ -232,7 +232,7 @@ static int graph_write(int argc, const char **argv)
 	odb = find_odb(the_repository, opts.obj_dir);
 
 	if (opts.reachable) {
-		if (write_commit_graph_reachable(odb, flags, &split_opts))
+		if (write_commit_graph_reachable(odb, flags, &write_opts))
 			return 1;
 		return 0;
 	}
@@ -261,7 +261,7 @@ static int graph_write(int argc, const char **argv)
 			       opts.stdin_packs ? &pack_indexes : NULL,
 			       opts.stdin_commits ? &commits : NULL,
 			       flags,
-			       &split_opts))
+			       &write_opts))
 		result = 1;
 
 cleanup:
diff --git a/commit-graph.c b/commit-graph.c
index 1fee49d171..82fca07579 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1006,7 +1006,7 @@ struct write_commit_graph_context {
 		 changed_paths:1,
 		 order_by_pack:1;
 
-	const struct split_commit_graph_opts *split_opts;
+	const struct commit_graph_opts *opts;
 	size_t total_bloom_filter_data_size;
 	const struct bloom_filter_settings *bloom_settings;
 
@@ -1347,8 +1347,8 @@ static void close_reachable(struct write_commit_graph_context *ctx)
 {
 	int i;
 	struct commit *commit;
-	enum commit_graph_split_flags flags = ctx->split_opts ?
-		ctx->split_opts->flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;
+	enum commit_graph_split_flags flags = ctx->opts ?
+		ctx->opts->flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;
 
 	if (ctx->report_progress)
 		ctx->progress = start_delayed_progress(
@@ -1548,7 +1548,7 @@ static int add_ref_to_set(const char *refname,
 
 int write_commit_graph_reachable(struct object_directory *odb,
 				 enum commit_graph_write_flags flags,
-				 const struct split_commit_graph_opts *split_opts)
+				 const struct commit_graph_opts *opts)
 {
 	struct oidset commits = OIDSET_INIT;
 	struct refs_cb_data data;
@@ -1565,7 +1565,7 @@ int write_commit_graph_reachable(struct object_directory *odb,
 	stop_progress(&data.progress);
 
 	result = write_commit_graph(odb, NULL, &commits,
-				    flags, split_opts);
+				    flags, opts);
 
 	oidset_clear(&commits);
 	return result;
@@ -1680,8 +1680,8 @@ static uint32_t count_distinct_commits(struct write_commit_graph_context *ctx)
 static void copy_oids_to_commits(struct write_commit_graph_context *ctx)
 {
 	uint32_t i;
-	enum commit_graph_split_flags flags = ctx->split_opts ?
-		ctx->split_opts->flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;
+	enum commit_graph_split_flags flags = ctx->opts ?
+		ctx->opts->flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;
 
 	ctx->num_extra_edges = 0;
 	if (ctx->report_progress)
@@ -1967,13 +1967,13 @@ static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
 	int max_commits = 0;
 	int size_mult = 2;
 
-	if (ctx->split_opts) {
-		max_commits = ctx->split_opts->max_commits;
+	if (ctx->opts) {
+		max_commits = ctx->opts->max_commits;
 
-		if (ctx->split_opts->size_multiple)
-			size_mult = ctx->split_opts->size_multiple;
+		if (ctx->opts->size_multiple)
+			size_mult = ctx->opts->size_multiple;
 
-		flags = ctx->split_opts->flags;
+		flags = ctx->opts->flags;
 	}
 
 	g = ctx->r->objects->commit_graph;
@@ -2151,8 +2151,8 @@ static void expire_commit_graphs(struct write_commit_graph_context *ctx)
 	size_t dirnamelen;
 	timestamp_t expire_time = time(NULL);
 
-	if (ctx->split_opts && ctx->split_opts->expire_time)
-		expire_time = ctx->split_opts->expire_time;
+	if (ctx->opts && ctx->opts->expire_time)
+		expire_time = ctx->opts->expire_time;
 	if (!ctx->split) {
 		char *chain_file_name = get_chain_filename(ctx->odb);
 		unlink(chain_file_name);
@@ -2203,7 +2203,7 @@ int write_commit_graph(struct object_directory *odb,
 		       struct string_list *pack_indexes,
 		       struct oidset *commits,
 		       enum commit_graph_write_flags flags,
-		       const struct split_commit_graph_opts *split_opts)
+		       const struct commit_graph_opts *opts)
 {
 	struct write_commit_graph_context *ctx;
 	uint32_t i, count_distinct = 0;
@@ -2220,7 +2220,7 @@ int write_commit_graph(struct object_directory *odb,
 	ctx->append = flags & COMMIT_GRAPH_WRITE_APPEND ? 1 : 0;
 	ctx->report_progress = flags & COMMIT_GRAPH_WRITE_PROGRESS ? 1 : 0;
 	ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
-	ctx->split_opts = split_opts;
+	ctx->opts = opts;
 	ctx->total_bloom_filter_data_size = 0;
 
 	bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
@@ -2268,15 +2268,15 @@ int write_commit_graph(struct object_directory *odb,
 			}
 		}
 
-		if (ctx->split_opts)
-			replace = ctx->split_opts->flags & COMMIT_GRAPH_SPLIT_REPLACE;
+		if (ctx->opts)
+			replace = ctx->opts->flags & COMMIT_GRAPH_SPLIT_REPLACE;
 	}
 
 	ctx->approx_nr_objects = approximate_object_count();
 	ctx->oids.alloc = ctx->approx_nr_objects / 32;
 
-	if (ctx->split && split_opts && ctx->oids.alloc > split_opts->max_commits)
-		ctx->oids.alloc = split_opts->max_commits;
+	if (ctx->split && opts && ctx->oids.alloc > opts->max_commits)
+		ctx->oids.alloc = opts->max_commits;
 
 	if (ctx->append) {
 		prepare_commit_graph_one(ctx->r, ctx->odb);
diff --git a/commit-graph.h b/commit-graph.h
index f4fb996dd5..1d147b7b76 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -110,7 +110,7 @@ enum commit_graph_split_flags {
 	COMMIT_GRAPH_SPLIT_REPLACE          = 2
 };
 
-struct split_commit_graph_opts {
+struct commit_graph_opts {
 	int size_multiple;
 	int max_commits;
 	timestamp_t expire_time;
@@ -125,12 +125,12 @@ struct split_commit_graph_opts {
  */
 int write_commit_graph_reachable(struct object_directory *odb,
 				 enum commit_graph_write_flags flags,
-				 const struct split_commit_graph_opts *split_opts);
+				 const struct commit_graph_opts *opts);
 int write_commit_graph(struct object_directory *odb,
 		       struct string_list *pack_indexes,
 		       struct oidset *commits,
 		       enum commit_graph_write_flags flags,
-		       const struct split_commit_graph_opts *split_opts);
+		       const struct commit_graph_opts *opts);
 
 #define COMMIT_GRAPH_VERIFY_SHALLOW	(1 << 0)
 
-- 
2.28.0.rc1.13.ge78abce653


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH v2 14/14] builtin/commit-graph.c: introduce '--max-new-filters=<n>'
  2020-08-05 17:01 ` [PATCH v2 00/14] more miscellaneous Bloom filter improvements Taylor Blau
                     ` (12 preceding siblings ...)
  2020-08-05 17:03   ` [PATCH v2 13/14] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
@ 2020-08-05 17:03   ` Taylor Blau
  13 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-08-05 17:03 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, szeder.dev

Introduce a command-line flag and configuration variable to fill in the
'max_new_filters' variable introduced by the previous patch.

The command-line option '--max-new-filters' takes precedence over
'commitGraph.maxNewFilters', which is the default value.
'--no-max-new-filters' can also be provided, which sets the value back
to '-1', indicating that an unlimited number of new Bloom filters may be
generated. (OPT_INTEGER only allows setting the '--no-' variant back to
'0', hence a custom callback was used instead).

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/commitgraph.txt |  4 +++
 Documentation/git-commit-graph.txt   |  4 +++
 bloom.c                              | 15 +++++++++++
 builtin/commit-graph.c               | 39 +++++++++++++++++++++++++---
 commit-graph.c                       | 16 +++++++++---
 commit-graph.h                       |  1 +
 t/t4216-log-bloom.sh                 | 19 ++++++++++++++
 7 files changed, 91 insertions(+), 7 deletions(-)

diff --git a/Documentation/config/commitgraph.txt b/Documentation/config/commitgraph.txt
index cff0797b54..4582c39fc4 100644
--- a/Documentation/config/commitgraph.txt
+++ b/Documentation/config/commitgraph.txt
@@ -1,3 +1,7 @@
+commitGraph.maxNewFilters::
+	Specifies the default value for the `--max-new-filters` option of `git
+	commit-graph write` (c.f., linkgit:git-commit-graph[1]).
+
 commitGraph.readChangedPaths::
 	If true, then git will use the changed-path Bloom filters in the
 	commit-graph file (if it exists, and they are present). Defaults to
diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index 17405c73a9..9c887d5d79 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -67,6 +67,10 @@ this option is given, future commit-graph writes will automatically assume
 that this option was intended. Use `--no-changed-paths` to stop storing this
 data.
 +
+With the `--max-new-filters=<n>` option, generate at most `n` new Bloom
+filters (if `--changed-paths` is specified). If `n` is `-1`, no limit is
+enforced. Overrides the `commitGraph.maxNewFilters` configuration.
++
 With the `--split[=<strategy>]` option, write the commit-graph as a
 chain of multiple commit-graph files stored in
 `<dir>/info/commit-graphs`. Commit-graph layers are merged based on the
diff --git a/bloom.c b/bloom.c
index ed54e96e57..d0c0fd049d 100644
--- a/bloom.c
+++ b/bloom.c
@@ -51,6 +51,21 @@ static int load_bloom_filter_from_graph(struct commit_graph *g,
 	else
 		start_index = 0;
 
+	if ((start_index == end_index) &&
+	    (g->bloom_large.word_alloc && !bitmap_get(&g->bloom_large, lex_pos))) {
+		/*
+		 * If the filter is zero-length, either (1) the filter has no
+		 * changes, (2) the filter has too many changes, or (3) it
+		 * wasn't computed (eg., due to '--max-new-filters').
+		 *
+		 * If either (1) or (2) is the case, the 'large' bit will be set
+		 * for this Bloom filter. If it is unset, then it wasn't
+		 * computed. In that case, return nothing, since we don't have
+		 * that filter in the graph.
+		 */
+		return 0;
+	}
+
 	filter->len = end_index - start_index;
 	filter->data = (unsigned char *)(g->chunk_bloom_data +
 					sizeof(unsigned char) * start_index +
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 38f5f57d15..3500a6e1f1 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -13,7 +13,8 @@ static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph verify [--object-dir <objdir>] [--shallow] [--[no-]progress]"),
 	N_("git commit-graph write [--object-dir <objdir>] [--append] "
 	   "[--split[=<strategy>]] [--reachable|--stdin-packs|--stdin-commits] "
-	   "[--changed-paths] [--[no-]progress] <split options>"),
+	   "[--changed-paths] [--[no-]max-new-filters <n>] [--[no-]progress] "
+	   "<split options>"),
 	NULL
 };
 
@@ -25,7 +26,8 @@ static const char * const builtin_commit_graph_verify_usage[] = {
 static const char * const builtin_commit_graph_write_usage[] = {
 	N_("git commit-graph write [--object-dir <objdir>] [--append] "
 	   "[--split[=<strategy>]] [--reachable|--stdin-packs|--stdin-commits] "
-	   "[--changed-paths] [--[no-]progress] <split options>"),
+	   "[--changed-paths] [--[no-]max-new-filters <n>] [--[no-]progress] "
+	   "<split options>"),
 	NULL
 };
 
@@ -162,6 +164,23 @@ static int read_one_commit(struct oidset *commits, struct progress *progress,
 	return 0;
 }
 
+static int write_option_max_new_filters(const struct option *opt,
+					const char *arg,
+					int unset)
+{
+	int *to = opt->value;
+	if (unset)
+		*to = -1;
+	else {
+		const char *s;
+		*to = strtol(arg, (char **)&s, 10);
+		if (*s)
+			return error(_("%s expects a numerical value"),
+				     optname(opt, opt->flags));
+	}
+	return 0;
+}
+
 static int graph_write(int argc, const char **argv)
 {
 	struct string_list pack_indexes = STRING_LIST_INIT_NODUP;
@@ -197,6 +216,9 @@ static int graph_write(int argc, const char **argv)
 			N_("maximum ratio between two levels of a split commit-graph")),
 		OPT_EXPIRY_DATE(0, "expire-time", &write_opts.expire_time,
 			N_("only expire files older than a given date-time")),
+		OPT_CALLBACK_F(0, "max-new-filters", &write_opts.max_new_filters,
+			NULL, N_("maximum number of changed-path Bloom filters to compute"),
+			0, write_option_max_new_filters),
 		OPT_END(),
 	};
 
@@ -205,6 +227,7 @@ static int graph_write(int argc, const char **argv)
 	write_opts.size_multiple = 2;
 	write_opts.max_commits = 0;
 	write_opts.expire_time = 0;
+	write_opts.max_new_filters = -1;
 
 	trace2_cmd_mode("write");
 
@@ -270,6 +293,16 @@ static int graph_write(int argc, const char **argv)
 	return result;
 }
 
+static int git_commit_graph_config(const char *var, const char *value, void *cb)
+{
+	if (!strcmp(var, "commitgraph.maxnewfilters")) {
+		write_opts.max_new_filters = git_config_int(var, value);
+		return 0;
+	}
+
+	return git_default_config(var, value, cb);
+}
+
 int cmd_commit_graph(int argc, const char **argv, const char *prefix)
 {
 	static struct option builtin_commit_graph_options[] = {
@@ -283,7 +316,7 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix)
 		usage_with_options(builtin_commit_graph_usage,
 				   builtin_commit_graph_options);
 
-	git_config(git_default_config, NULL);
+	git_config(git_commit_graph_config, &opts);
 	argc = parse_options(argc, argv, prefix,
 			     builtin_commit_graph_options,
 			     builtin_commit_graph_usage,
diff --git a/commit-graph.c b/commit-graph.c
index 82fca07579..76b1238262 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -948,7 +948,8 @@ struct tree *get_commit_tree_in_graph(struct repository *r, const struct commit
 }
 
 static int get_bloom_filter_large_in_graph(struct commit_graph *g,
-					   const struct commit *c)
+					   const struct commit *c,
+					   uint32_t max_changed_paths)
 {
 	uint32_t graph_pos = commit_graph_position(c);
 	if (graph_pos == COMMIT_NOT_FROM_GRAPH)
@@ -1475,6 +1476,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 	int i;
 	struct progress *progress = NULL;
 	int *sorted_commits;
+	int max_new_filters;
 
 	init_bloom_filters();
 	ctx->bloom_large = bitmap_word_alloc(ctx->commits.nr / BITS_IN_EWORD + 1);
@@ -1491,10 +1493,15 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 		ctx->order_by_pack ? commit_pos_cmp : commit_gen_cmp,
 		&ctx->commits);
 
+	max_new_filters = ctx->opts->max_new_filters >= 0 ?
+		ctx->opts->max_new_filters : ctx->commits.nr;
+
 	for (i = 0; i < ctx->commits.nr; i++) {
 		int pos = sorted_commits[i];
 		struct commit *c = ctx->commits.list[pos];
-		if (get_bloom_filter_large_in_graph(ctx->r->objects->commit_graph, c)) {
+		if (get_bloom_filter_large_in_graph(ctx->r->objects->commit_graph,
+						    c,
+						    ctx->bloom_settings->max_changed_paths)) {
 			bitmap_set(ctx->bloom_large, pos);
 			ctx->count_bloom_filter_known_large++;
 		} else {
@@ -1502,7 +1509,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 			struct bloom_filter *filter = get_or_compute_bloom_filter(
 				ctx->r,
 				c,
-				1,
+				ctx->count_bloom_filter_computed < max_new_filters,
 				ctx->bloom_settings,
 				&computed);
 			if (computed) {
@@ -1512,7 +1519,8 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 					ctx->count_bloom_filter_found_large++;
 				}
 			}
-			ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
+			if (filter)
+				ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
 		}
 		display_progress(progress, i + 1);
 	}
diff --git a/commit-graph.h b/commit-graph.h
index 1d147b7b76..47d99ea4bf 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -115,6 +115,7 @@ struct commit_graph_opts {
 	int max_commits;
 	timestamp_t expire_time;
 	enum commit_graph_split_flags flags;
+	int max_new_filters;
 };
 
 /*
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 6859d85369..3aab8ffbe3 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -286,4 +286,23 @@ test_expect_success 'Bloom generation does not recompute too-large filters' '
 	)
 '
 
+test_expect_success 'Bloom generation is limited by --max-new-filters' '
+	(
+		cd limits &&
+		test_commit c2 filter &&
+		test_commit c3 filter &&
+		test_commit c4 no-filter &&
+		test_bloom_filters_computed "--reachable --changed-paths --split=replace --max-new-filters=2" \
+			2 0 2
+	)
+'
+
+test_expect_success 'Bloom generation backfills previously-skipped filters' '
+	(
+		cd limits &&
+		test_bloom_filters_computed "--reachable --changed-paths --split=replace --max-new-filters=1" \
+			2 0 1
+	)
+'
+
 test_done
-- 
2.28.0.rc1.13.ge78abce653

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v2 12/14] commit-graph: add large-filters bitmap chunk
  2020-08-05 17:02   ` [PATCH v2 12/14] commit-graph: add large-filters bitmap chunk Taylor Blau
@ 2020-08-05 21:01     ` Junio C Hamano
  2020-08-05 21:17       ` Taylor Blau
  0 siblings, 1 reply; 117+ messages in thread
From: Junio C Hamano @ 2020-08-05 21:01 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, dstolee, szeder.dev

Taylor Blau <me@ttaylorr.com> writes:

> @@ -71,6 +72,10 @@ struct commit_graph {
>  	const unsigned char *chunk_base_graphs;
>  	const unsigned char *chunk_bloom_indexes;
>  	const unsigned char *chunk_bloom_data;
> +	const unsigned char *chunk_bloom_large_filters;
> +
> +	size_t bloom_large_to_alloc;
> +	struct bitmap bloom_large;

Hmph, is the API rich enough to allow users to release the resource
used by such an embedded bitmap?  I ask becuase...

> @@ -2503,6 +2577,7 @@ void free_commit_graph(struct commit_graph *g)
>  	}
>  	free(g->filename);
>  	free(g->bloom_filter_settings);
> +	bitmap_free(g->bloom_large);
>  	free(g);
>  }

... this hunk cannot be possibly correct as-is, and cannot be made
correct without changing g->bloom_large to a pointer into a heap
allocated bitmap, because bitmap_free() wants to not just release
the resource held by the bitmap but the bitmap itself.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v2 12/14] commit-graph: add large-filters bitmap chunk
  2020-08-05 21:01     ` Junio C Hamano
@ 2020-08-05 21:17       ` Taylor Blau
  2020-08-05 22:21         ` Junio C Hamano
  0 siblings, 1 reply; 117+ messages in thread
From: Taylor Blau @ 2020-08-05 21:17 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, git, peff, dstolee, szeder.dev

On Wed, Aug 05, 2020 at 02:01:29PM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > @@ -71,6 +72,10 @@ struct commit_graph {
> >  	const unsigned char *chunk_base_graphs;
> >  	const unsigned char *chunk_bloom_indexes;
> >  	const unsigned char *chunk_bloom_data;
> > +	const unsigned char *chunk_bloom_large_filters;
> > +
> > +	size_t bloom_large_to_alloc;
> > +	struct bitmap bloom_large;
>
> Hmph, is the API rich enough to allow users to release the resource
> used by such an embedded bitmap?  I ask becuase...
>
> > @@ -2503,6 +2577,7 @@ void free_commit_graph(struct commit_graph *g)
> >  	}
> >  	free(g->filename);
> >  	free(g->bloom_filter_settings);
> > +	bitmap_free(g->bloom_large);
> >  	free(g);
> >  }
>
> ... this hunk cannot be possibly correct as-is, and cannot be made
> correct without changing g->bloom_large to a pointer into a heap
> allocated bitmap, because bitmap_free() wants to not just release
> the resource held by the bitmap but the bitmap itself.

Yuck, that's definitely wrong. Serves me right for sneaking this in
after I had run `git rebase -x 'make -j40 DEVELOPER=1 test'
upstream/master` ;-).

Below the scissors line should do the trick. It should apply cleanly at
this point in the series, but it'll produce a compilation failure on the
very last patch (fixing it is straightforward and looks like the
following diff):

diff --git a/bloom.c b/bloom.c
index d0c0fd049d..8d07209c6b 100644
--- a/bloom.c
+++ b/bloom.c
@@ -52,7 +52,7 @@ static int load_bloom_filter_from_graph(struct commit_graph *g,
                start_index = 0;

        if ((start_index == end_index) &&
-           (g->bloom_large.word_alloc && !bitmap_get(&g->bloom_large, lex_pos))) {
+           (g->bloom_large && !bitmap_get(g->bloom_large, lex_pos))) {
                /*
                 * If the filter is zero-length, either (1) the filter has no
                 * changes, (2) the filter has too many changes, or (3) it

In either case, this will fix the bad free():

--- >8 ---

Subject: [PATCH] fixup! commit-graph: add large-filters bitmap chunk

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 commit-graph.c | 18 ++++++++++--------
 commit-graph.h |  2 +-
 2 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 1fee49d171..add76f1824 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -438,7 +438,10 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 				graph->bloom_large_to_alloc = get_be64(chunk_lookup + 4)
 							      - chunk_offset - sizeof(uint32_t);

-				graph->bloom_large.word_alloc = 0; /* populate when necessary */
+				/*
+				 * leave 'bloom_large' uninitialized, and
+				 * populate when necessary
+				 */
 				graph->chunk_bloom_large_filters = data + chunk_offset + sizeof(uint32_t);
 				graph->bloom_filter_settings->max_changed_paths = get_be32(data + chunk_offset);
 			}
@@ -960,17 +963,15 @@ static int get_bloom_filter_large_in_graph(struct commit_graph *g,
 	if (!g || !g->bloom_large_to_alloc)
 		return 0;

-	if (!g->bloom_large.word_alloc) {
+	if (!g->bloom_large) {
 		size_t i;
-		g->bloom_large.word_alloc = g->bloom_large_to_alloc;
-		g->bloom_large.words = xmalloc(g->bloom_large_to_alloc * sizeof(eword_t));
-
+		g->bloom_large = bitmap_word_alloc(g->bloom_large_to_alloc);
 		for (i = 0; i < g->bloom_large_to_alloc; i++)
-			g->bloom_large.words[i] = get_be64(g->chunk_bloom_large_filters
-							   + i * sizeof(eword_t));
+			g->bloom_large->words[i] = get_be64(g->chunk_bloom_large_filters
+							    + i * sizeof(eword_t));
 	}

-	return bitmap_get(&g->bloom_large, graph_pos - g->num_commits_in_base);
+	return bitmap_get(g->bloom_large, graph_pos - g->num_commits_in_base);
 }

 struct packed_oid_list {
@@ -2360,6 +2361,7 @@ int write_commit_graph(struct object_directory *odb,
 	free(ctx->graph_name);
 	free(ctx->commits.list);
 	free(ctx->oids.list);
+	free(ctx->bloom_large);

 	if (ctx->commit_graph_filenames_after) {
 		for (i = 0; i < ctx->num_commit_graphs_after; i++) {
diff --git a/commit-graph.h b/commit-graph.h
index f4fb996dd5..b1ab86a3c8 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -75,7 +75,7 @@ struct commit_graph {
 	const unsigned char *chunk_bloom_large_filters;

 	size_t bloom_large_to_alloc;
-	struct bitmap bloom_large;
+	struct bitmap *bloom_large;

 	struct bloom_filter_settings *bloom_filter_settings;
 };
--
2.28.0.rc1.13.ge78abce653


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v2 12/14] commit-graph: add large-filters bitmap chunk
  2020-08-05 21:17       ` Taylor Blau
@ 2020-08-05 22:21         ` Junio C Hamano
  2020-08-05 22:25           ` Taylor Blau
  0 siblings, 1 reply; 117+ messages in thread
From: Junio C Hamano @ 2020-08-05 22:21 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, dstolee, szeder.dev

Taylor Blau <me@ttaylorr.com> writes:

> On Wed, Aug 05, 2020 at 02:01:29PM -0700, Junio C Hamano wrote:
>> Taylor Blau <me@ttaylorr.com> writes:
>>
>> > @@ -71,6 +72,10 @@ struct commit_graph {
>> >  	const unsigned char *chunk_base_graphs;
>> >  	const unsigned char *chunk_bloom_indexes;
>> >  	const unsigned char *chunk_bloom_data;
>> > +	const unsigned char *chunk_bloom_large_filters;
>> > +
>> > +	size_t bloom_large_to_alloc;
>> > +	struct bitmap bloom_large;
>>
>> Hmph, is the API rich enough to allow users to release the resource
>> used by such an embedded bitmap?  I ask becuase...
>>
>> > @@ -2503,6 +2577,7 @@ void free_commit_graph(struct commit_graph *g)
>> >  	}
>> >  	free(g->filename);
>> >  	free(g->bloom_filter_settings);
>> > +	bitmap_free(g->bloom_large);
>> >  	free(g);
>> >  }
>>
>> ... this hunk cannot be possibly correct as-is, and cannot be made
>> correct without changing g->bloom_large to a pointer into a heap
>> allocated bitmap, because bitmap_free() wants to not just release
>> the resource held by the bitmap but the bitmap itself.
>
> Yuck, that's definitely wrong. Serves me right for sneaking this in
> after I had run `git rebase -x 'make -j40 DEVELOPER=1 test'
> upstream/master` ;-).
>
> Below the scissors line should do the trick. It should apply cleanly at
> this point in the series, but it'll produce a compilation failure on the
> very last patch (fixing it is straightforward and looks like the
> following diff):
>
> diff --git a/bloom.c b/bloom.c
> index d0c0fd049d..8d07209c6b 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -52,7 +52,7 @@ static int load_bloom_filter_from_graph(struct commit_graph *g,
>                 start_index = 0;
>
>         if ((start_index == end_index) &&
> -           (g->bloom_large.word_alloc && !bitmap_get(&g->bloom_large, lex_pos))) {
> +           (g->bloom_large && !bitmap_get(g->bloom_large, lex_pos))) {
>                 /*
>                  * If the filter is zero-length, either (1) the filter has no
>                  * changes, (2) the filter has too many changes, or (3) it
>
> In either case, this will fix the bad free():
>
> --- >8 ---
>
> Subject: [PATCH] fixup! commit-graph: add large-filters bitmap chunk
>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  commit-graph.c | 18 ++++++++++--------
>  commit-graph.h |  2 +-
>  2 files changed, 11 insertions(+), 9 deletions(-)
> ...
> @@ -2360,6 +2361,7 @@ int write_commit_graph(struct object_directory *odb,
>  	free(ctx->graph_name);
>  	free(ctx->commits.list);
>  	free(ctx->oids.list);
> +	free(ctx->bloom_large);

Is this correct, or shouldn't it be bitmap_free() instead?

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v2 12/14] commit-graph: add large-filters bitmap chunk
  2020-08-05 22:21         ` Junio C Hamano
@ 2020-08-05 22:25           ` Taylor Blau
  2020-08-11 13:48             ` Taylor Blau
  0 siblings, 1 reply; 117+ messages in thread
From: Taylor Blau @ 2020-08-05 22:25 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, git, peff, dstolee, szeder.dev

On Wed, Aug 05, 2020 at 03:21:39PM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > On Wed, Aug 05, 2020 at 02:01:29PM -0700, Junio C Hamano wrote:
> >> Taylor Blau <me@ttaylorr.com> writes:
> >>
> >> > @@ -71,6 +72,10 @@ struct commit_graph {
> >> >  	const unsigned char *chunk_base_graphs;
> >> >  	const unsigned char *chunk_bloom_indexes;
> >> >  	const unsigned char *chunk_bloom_data;
> >> > +	const unsigned char *chunk_bloom_large_filters;
> >> > +
> >> > +	size_t bloom_large_to_alloc;
> >> > +	struct bitmap bloom_large;
> >>
> >> Hmph, is the API rich enough to allow users to release the resource
> >> used by such an embedded bitmap?  I ask becuase...
> >>
> >> > @@ -2503,6 +2577,7 @@ void free_commit_graph(struct commit_graph *g)
> >> >  	}
> >> >  	free(g->filename);
> >> >  	free(g->bloom_filter_settings);
> >> > +	bitmap_free(g->bloom_large);
> >> >  	free(g);
> >> >  }
> >>
> >> ... this hunk cannot be possibly correct as-is, and cannot be made
> >> correct without changing g->bloom_large to a pointer into a heap
> >> allocated bitmap, because bitmap_free() wants to not just release
> >> the resource held by the bitmap but the bitmap itself.
> >
> > Yuck, that's definitely wrong. Serves me right for sneaking this in
> > after I had run `git rebase -x 'make -j40 DEVELOPER=1 test'
> > upstream/master` ;-).
> >
> > Below the scissors line should do the trick. It should apply cleanly at
> > this point in the series, but it'll produce a compilation failure on the
> > very last patch (fixing it is straightforward and looks like the
> > following diff):
> >
> > diff --git a/bloom.c b/bloom.c
> > index d0c0fd049d..8d07209c6b 100644
> > --- a/bloom.c
> > +++ b/bloom.c
> > @@ -52,7 +52,7 @@ static int load_bloom_filter_from_graph(struct commit_graph *g,
> >                 start_index = 0;
> >
> >         if ((start_index == end_index) &&
> > -           (g->bloom_large.word_alloc && !bitmap_get(&g->bloom_large, lex_pos))) {
> > +           (g->bloom_large && !bitmap_get(g->bloom_large, lex_pos))) {
> >                 /*
> >                  * If the filter is zero-length, either (1) the filter has no
> >                  * changes, (2) the filter has too many changes, or (3) it
> >
> > In either case, this will fix the bad free():
> >
> > --- >8 ---
> >
> > Subject: [PATCH] fixup! commit-graph: add large-filters bitmap chunk
> >
> > Signed-off-by: Taylor Blau <me@ttaylorr.com>
> > ---
> >  commit-graph.c | 18 ++++++++++--------
> >  commit-graph.h |  2 +-
> >  2 files changed, 11 insertions(+), 9 deletions(-)
> > ...
> > @@ -2360,6 +2361,7 @@ int write_commit_graph(struct object_directory *odb,
> >  	free(ctx->graph_name);
> >  	free(ctx->commits.list);
> >  	free(ctx->oids.list);
> > +	free(ctx->bloom_large);
>
> Is this correct, or shouldn't it be bitmap_free() instead?

Ack, that should be 'bitmap_free()'. Double checking, 'bitmap_free' does
handle a 'NULL' argument like 'free', so dealing with an old
commit-graph lacking this chunk will work fine.

Thanks for catching my mistake. I'm off tomorrow, Friday, and Monday,
so my responses from now on may be intermittent, but I should have some
time.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v2 12/14] commit-graph: add large-filters bitmap chunk
  2020-08-05 22:25           ` Taylor Blau
@ 2020-08-11 13:48             ` Taylor Blau
  2020-08-11 18:59               ` Junio C Hamano
  0 siblings, 1 reply; 117+ messages in thread
From: Taylor Blau @ 2020-08-11 13:48 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Junio C Hamano, git, peff, dstolee, szeder.dev

On Wed, Aug 05, 2020 at 06:25:31PM -0400, Taylor Blau wrote:
> On Wed, Aug 05, 2020 at 03:21:39PM -0700, Junio C Hamano wrote:
> > Taylor Blau <me@ttaylorr.com> writes:
> >
> > > On Wed, Aug 05, 2020 at 02:01:29PM -0700, Junio C Hamano wrote:
> > >> Taylor Blau <me@ttaylorr.com> writes:
> > >>
> > >> > @@ -71,6 +72,10 @@ struct commit_graph {
> > >> >  	const unsigned char *chunk_base_graphs;
> > >> >  	const unsigned char *chunk_bloom_indexes;
> > >> >  	const unsigned char *chunk_bloom_data;
> > >> > +	const unsigned char *chunk_bloom_large_filters;
> > >> > +
> > >> > +	size_t bloom_large_to_alloc;
> > >> > +	struct bitmap bloom_large;
> > >>
> > >> Hmph, is the API rich enough to allow users to release the resource
> > >> used by such an embedded bitmap?  I ask becuase...
> > >>
> > >> > @@ -2503,6 +2577,7 @@ void free_commit_graph(struct commit_graph *g)
> > >> >  	}
> > >> >  	free(g->filename);
> > >> >  	free(g->bloom_filter_settings);
> > >> > +	bitmap_free(g->bloom_large);
> > >> >  	free(g);
> > >> >  }
> > >>
> > >> ... this hunk cannot be possibly correct as-is, and cannot be made
> > >> correct without changing g->bloom_large to a pointer into a heap
> > >> allocated bitmap, because bitmap_free() wants to not just release
> > >> the resource held by the bitmap but the bitmap itself.
> > >
> > > Yuck, that's definitely wrong. Serves me right for sneaking this in
> > > after I had run `git rebase -x 'make -j40 DEVELOPER=1 test'
> > > upstream/master` ;-).
> > >
> > > Below the scissors line should do the trick. It should apply cleanly at
> > > this point in the series, but it'll produce a compilation failure on the
> > > very last patch (fixing it is straightforward and looks like the
> > > following diff):
> > >
> > > diff --git a/bloom.c b/bloom.c
> > > index d0c0fd049d..8d07209c6b 100644
> > > --- a/bloom.c
> > > +++ b/bloom.c
> > > @@ -52,7 +52,7 @@ static int load_bloom_filter_from_graph(struct commit_graph *g,
> > >                 start_index = 0;
> > >
> > >         if ((start_index == end_index) &&
> > > -           (g->bloom_large.word_alloc && !bitmap_get(&g->bloom_large, lex_pos))) {
> > > +           (g->bloom_large && !bitmap_get(g->bloom_large, lex_pos))) {
> > >                 /*
> > >                  * If the filter is zero-length, either (1) the filter has no
> > >                  * changes, (2) the filter has too many changes, or (3) it
> > >
> > > In either case, this will fix the bad free():
> > >
> > > --- >8 ---
> > >
> > > Subject: [PATCH] fixup! commit-graph: add large-filters bitmap chunk
> > >
> > > Signed-off-by: Taylor Blau <me@ttaylorr.com>
> > > ---
> > >  commit-graph.c | 18 ++++++++++--------
> > >  commit-graph.h |  2 +-
> > >  2 files changed, 11 insertions(+), 9 deletions(-)
> > > ...
> > > @@ -2360,6 +2361,7 @@ int write_commit_graph(struct object_directory *odb,
> > >  	free(ctx->graph_name);
> > >  	free(ctx->commits.list);
> > >  	free(ctx->oids.list);
> > > +	free(ctx->bloom_large);
> >
> > Is this correct, or shouldn't it be bitmap_free() instead?
>
> Ack, that should be 'bitmap_free()'. Double checking, 'bitmap_free' does
> handle a 'NULL' argument like 'free', so dealing with an old
> commit-graph lacking this chunk will work fine.
>
> Thanks for catching my mistake. I'm off tomorrow, Friday, and Monday,
> so my responses from now on may be intermittent, but I should have some
> time.

I'm back :). Let's squash the following into patch (bearing in mind that
you'll have to drop a '&' in the final patch of this series as a result
of this change):

--- >8 ---

Subject: [PATCH] fixup! commit-graph: add large-filters bitmap chunk

Makes the commit-graph's 'bloom_large' bitmap a pointer so that it can
be managed with the standard bitmap APIs.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 commit-graph.c | 18 ++++++++++--------
 commit-graph.h |  2 +-
 2 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 1fee49d171..384089be87 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -438,7 +438,10 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 				graph->bloom_large_to_alloc = get_be64(chunk_lookup + 4)
 							      - chunk_offset - sizeof(uint32_t);

-				graph->bloom_large.word_alloc = 0; /* populate when necessary */
+				/*
+				 * leave 'bloom_large' uninitialized, and
+				 * populate when necessary
+				 */
 				graph->chunk_bloom_large_filters = data + chunk_offset + sizeof(uint32_t);
 				graph->bloom_filter_settings->max_changed_paths = get_be32(data + chunk_offset);
 			}
@@ -960,17 +963,15 @@ static int get_bloom_filter_large_in_graph(struct commit_graph *g,
 	if (!g || !g->bloom_large_to_alloc)
 		return 0;

-	if (!g->bloom_large.word_alloc) {
+	if (!g->bloom_large) {
 		size_t i;
-		g->bloom_large.word_alloc = g->bloom_large_to_alloc;
-		g->bloom_large.words = xmalloc(g->bloom_large_to_alloc * sizeof(eword_t));
-
+		g->bloom_large = bitmap_word_alloc(g->bloom_large_to_alloc);
 		for (i = 0; i < g->bloom_large_to_alloc; i++)
-			g->bloom_large.words[i] = get_be64(g->chunk_bloom_large_filters
-							   + i * sizeof(eword_t));
+			g->bloom_large->words[i] = get_be64(g->chunk_bloom_large_filters
+							    + i * sizeof(eword_t));
 	}

-	return bitmap_get(&g->bloom_large, graph_pos - g->num_commits_in_base);
+	return bitmap_get(g->bloom_large, graph_pos - g->num_commits_in_base);
 }

 struct packed_oid_list {
@@ -2360,6 +2361,7 @@ int write_commit_graph(struct object_directory *odb,
 	free(ctx->graph_name);
 	free(ctx->commits.list);
 	free(ctx->oids.list);
+	bitmap_free(ctx->bloom_large);

 	if (ctx->commit_graph_filenames_after) {
 		for (i = 0; i < ctx->num_commit_graphs_after; i++) {
diff --git a/commit-graph.h b/commit-graph.h
index f4fb996dd5..b1ab86a3c8 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -75,7 +75,7 @@ struct commit_graph {
 	const unsigned char *chunk_bloom_large_filters;

 	size_t bloom_large_to_alloc;
-	struct bitmap bloom_large;
+	struct bitmap *bloom_large;

 	struct bloom_filter_settings *bloom_filter_settings;
 };
--
2.28.0.rc1.13.ge78abce653


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v2 12/14] commit-graph: add large-filters bitmap chunk
  2020-08-11 13:48             ` Taylor Blau
@ 2020-08-11 18:59               ` Junio C Hamano
  0 siblings, 0 replies; 117+ messages in thread
From: Junio C Hamano @ 2020-08-11 18:59 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, dstolee, szeder.dev

Taylor Blau <me@ttaylorr.com> writes:

> I'm back :). Let's squash the following into patch (bearing in mind that
> you'll have to drop a '&' in the final patch of this series as a result
> of this change):

Let's have replacement patches for those steps (or a full series is
also fine).  Don't make me squash and clean-up.

That would give us (well, mostly the contributor) a chance for the
final proofreading before sending.

Thanks.

^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH v3 00/14] more miscellaneous Bloom filter improvements
  2020-08-03 18:57 [PATCH 00/10] more miscellaneous Bloom filter improvements Taylor Blau
                   ` (10 preceding siblings ...)
  2020-08-05 17:01 ` [PATCH v2 00/14] more miscellaneous Bloom filter improvements Taylor Blau
@ 2020-08-11 20:51 ` Taylor Blau
  2020-08-11 20:51   ` [PATCH v3 01/14] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
                     ` (14 more replies)
  2020-09-03 22:45 ` [PATCH v4 " Taylor Blau
  12 siblings, 15 replies; 117+ messages in thread
From: Taylor Blau @ 2020-08-11 20:51 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, szeder.dev, gitster

Hi,

Here's a(nother) re-roll of mine and Stolee's series to introduce the
new BFXL commit-graph chunk, along with the '--max-new-filters' option
to 'git commit-graph write'.

Not really much has changed since v2, other than a rebase onto the
latest from master (the fifth 2.29 batch, at the time of writing), and
to squash in a few fixups that I sent in response to my v2 series.

Hopefully this should be ready for queueing. (Stolee has looked at this
a lot off-list, but it would be great to get an ack from him on list,
too).

Derrick Stolee (1):
  bloom/diff: properly short-circuit on max_changes

Taylor Blau (13):
  commit-graph: introduce 'get_bloom_filter_settings()'
  t4216: use an '&&'-chain
  commit-graph: pass a 'struct repository *' in more places
  t/helper/test-read-graph.c: prepare repo settings
  commit-graph: respect 'commitGraph.readChangedPaths'
  commit-graph.c: store maximum changed paths
  bloom: split 'get_bloom_filter()' in two
  bloom: use provided 'struct bloom_filter_settings'
  commit-graph.c: sort index into commits list
  csum-file.h: introduce 'hashwrite_be64()'
  commit-graph: add large-filters bitmap chunk
  commit-graph: rename 'split_commit_graph_opts'
  builtin/commit-graph.c: introduce '--max-new-filters=<n>'

 Documentation/config.txt                      |   2 +
 Documentation/config/commitgraph.txt          |   8 +
 Documentation/git-commit-graph.txt            |   4 +
 .../technical/commit-graph-format.txt         |  12 +
 blame.c                                       |   8 +-
 bloom.c                                       |  51 +++-
 bloom.h                                       |  22 +-
 builtin/commit-graph.c                        |  61 +++-
 commit-graph.c                                | 274 +++++++++++++-----
 commit-graph.h                                |  19 +-
 csum-file.h                                   |   6 +
 diff.h                                        |   2 -
 fuzz-commit-graph.c                           |   5 +-
 line-log.c                                    |   2 +-
 midx.c                                        |   3 +-
 repo-settings.c                               |   3 +
 repository.h                                  |   1 +
 revision.c                                    |   7 +-
 t/helper/test-bloom.c                         |   4 +-
 t/helper/test-read-graph.c                    |   3 +-
 t/t4216-log-bloom.sh                          | 148 ++++++++--
 t/t5324-split-commit-graph.sh                 |  13 +
 tree-diff.c                                   |   5 +-
 23 files changed, 522 insertions(+), 141 deletions(-)
 create mode 100644 Documentation/config/commitgraph.txt

Range-diff against v2:
 [ ... rebase onto 'master' ... ]
 1:  001f3385ff = 34:  e714e54240 commit-graph: introduce 'get_bloom_filter_settings()'
 2:  e4d068a478 = 35:  9fc8b17d6f t4216: use an '&&'-chain
 3:  afdc614c0d = 36:  8dbe4838b7 commit-graph: pass a 'struct repository *' in more places
 4:  038e996ced = 37:  f59db1e30d t/helper/test-read-graph.c: prepare repo settings
 5:  404f10319a = 38:  daae6788c0 commit-graph: respect 'commitGraph.readChangedPaths'
 6:  053991f048 = 39:  bf498844ef commit-graph.c: store maximum changed paths
 7:  23525947c8 ! 40:  eba2794873 bloom: split 'get_bloom_filter()' in two
    @@ bloom.h: void add_key_to_filter(const struct bloom_key *key,
     +						 int compute_if_not_present,
     +						 int *computed);
     +
    -+#define DEFAULT_BLOOM_MAX_CHANGES 512
     +#define get_bloom_filter(r, c) get_or_compute_bloom_filter( \
     +	(r), (c), 0, NULL)

 8:  4deb724fc1 ! 41:  4f08177dbe bloom: use provided 'struct bloom_filter_settings'
    @@ bloom.h: void init_bloom_filters(void);
     +						 const struct bloom_filter_settings *settings,
      						 int *computed);

    - #define DEFAULT_BLOOM_MAX_CHANGES 512
      #define get_bloom_filter(r, c) get_or_compute_bloom_filter( \
     -	(r), (c), 0, NULL)
     +	(r), (c), 0, NULL, NULL)
 9:  d1c4bbcaa9 = 42:  cc1dc8b121 bloom/diff: properly short-circuit on max_changes
10:  e92ccafcf7 = 43:  23fd52c3b8 commit-graph.c: sort index into commits list
11:  c42d678714 = 44:  4800cd373e csum-file.h: introduce 'hashwrite_be64()'
12:  100b26d7c8 ! 45:  619e0c619d commit-graph: add large-filters bitmap chunk
    @@ Commit message
         To allow using the existing bitmap code with 64-bit words, we write the
         data in network byte order from the 64-bit words. This means we also
         need to read the array from the commit-graph file by translating each
    -    word from network byte order using get_be64() upon first use of the
    -    bitmap. This is only used when writing the commit-graph, so this is a
    -    relatively small operation compared to the other writing code.
    +    word from network byte order using get_be64() when loading the commit
    +    graph. (Note that this *could* be delayed until first-use, but a later
    +    patch will rely on this being initialized early, so we assume the
    +    up-front cost when parsing instead of delaying initialization).

         By avoiding the need to move to new versions of the BDAT and BIDX chunk,
         we can give ourselves more time to consider whether or not other
    @@ commit-graph.c: struct commit_graph *parse_commit_graph(struct repository *r,
     +			if (graph->chunk_bloom_large_filters)
     +				chunk_repeated = 1;
     +			else if (r->settings.commit_graph_read_changed_paths) {
    -+				graph->bloom_large_to_alloc = get_be64(chunk_lookup + 4)
    -+							      - chunk_offset - sizeof(uint32_t);
    -+
    -+				graph->bloom_large.word_alloc = 0; /* populate when necessary */
    ++				size_t alloc = get_be64(chunk_lookup + 4) - chunk_offset - sizeof(uint32_t);
     +				graph->chunk_bloom_large_filters = data + chunk_offset + sizeof(uint32_t);
     +				graph->bloom_filter_settings->max_changed_paths = get_be32(data + chunk_offset);
    ++				if (alloc) {
    ++					size_t j;
    ++					graph->bloom_large = bitmap_word_alloc(alloc);
    ++
    ++					for (j = 0; j < graph->bloom_large->word_alloc; j++)
    ++						graph->bloom_large->words[j] = get_be64(
    ++							graph->chunk_bloom_large_filters + j * sizeof(eword_t));
    ++				}
     +			}
     +			break;
      		}
    @@ commit-graph.c: struct commit_graph *parse_commit_graph(struct repository *r,
      		graph->chunk_bloom_data = NULL;
     +		graph->chunk_bloom_large_filters = NULL;
      		FREE_AND_NULL(graph->bloom_filter_settings);
    ++		bitmap_free(graph->bloom_large);
      	}

    + 	hashcpy(graph->oid.hash, graph->data + graph->data_len - graph->hash_len);
     @@ commit-graph.c: struct tree *get_commit_tree_in_graph(struct repository *r, const struct commit
      	return get_commit_tree_in_graph_one(r, r->objects->commit_graph, c);
      }
    @@ commit-graph.c: struct tree *get_commit_tree_in_graph(struct repository *r, cons
     +	while (g && graph_pos < g->num_commits_in_base)
     +		g = g->base_graph;
     +
    -+	if (!g || !g->bloom_large_to_alloc)
    ++	if (!(g && g->bloom_large))
     +		return 0;
    -+
    -+	if (!g->bloom_large.word_alloc) {
    -+		size_t i;
    -+		g->bloom_large.word_alloc = g->bloom_large_to_alloc;
    -+		g->bloom_large.words = xmalloc(g->bloom_large_to_alloc * sizeof(eword_t));
    -+
    -+		for (i = 0; i < g->bloom_large_to_alloc; i++)
    -+			g->bloom_large.words[i] = get_be64(g->chunk_bloom_large_filters
    -+							   + i * sizeof(eword_t));
    -+	}
    -+
    -+	return bitmap_get(&g->bloom_large, graph_pos - g->num_commits_in_base);
    ++	return bitmap_get(g->bloom_large, graph_pos - g->num_commits_in_base);
     +}

      struct packed_oid_list {
    @@ commit-graph.h: struct commit_graph {
      	const unsigned char *chunk_bloom_data;
     +	const unsigned char *chunk_bloom_large_filters;
     +
    -+	size_t bloom_large_to_alloc;
    -+	struct bitmap bloom_large;
    ++	struct bitmap *bloom_large;

      	struct bloom_filter_settings *bloom_filter_settings;
      };
13:  2ee0b84351 = 46:  b2e33ecba8 commit-graph: rename 'split_commit_graph_opts'
14:  3b66ae4a9c ! 47:  09f6871f66 builtin/commit-graph.c: introduce '--max-new-filters=<n>'
    @@ bloom.c: static int load_bloom_filter_from_graph(struct commit_graph *g,
      		start_index = 0;

     +	if ((start_index == end_index) &&
    -+	    (g->bloom_large.word_alloc && !bitmap_get(&g->bloom_large, lex_pos))) {
    ++	    (g->bloom_large && !bitmap_get(g->bloom_large, lex_pos))) {
     +		/*
     +		 * If the filter is zero-length, either (1) the filter has no
     +		 * changes, (2) the filter has too many changes, or (3) it
    @@ commit-graph.c: struct tree *get_commit_tree_in_graph(struct repository *r, cons
      {
      	uint32_t graph_pos = commit_graph_position(c);
      	if (graph_pos == COMMIT_NOT_FROM_GRAPH)
    +@@ commit-graph.c: static int get_bloom_filter_large_in_graph(struct commit_graph *g,
    +
    + 	if (!(g && g->bloom_large))
    + 		return 0;
    ++	if (g->bloom_filter_settings->max_changed_paths != max_changed_paths) {
    ++		/*
    ++		 * Force all commits which are subject to a different
    ++		 * 'max_changed_paths' limit to be recomputed from scratch.
    ++		 *
    ++		 * Note that this could likely be improved, but is ignored since
    ++		 * all real-world graphs set the maximum number of changed paths
    ++		 * at 512.
    ++		 */
    ++		return 0;
    ++	}
    + 	return bitmap_get(g->bloom_large, graph_pos - g->num_commits_in_base);
    + }
    +
     @@ commit-graph.c: static void compute_bloom_filters(struct write_commit_graph_context *ctx)
      	int i;
      	struct progress *progress = NULL;
--
2.28.0.rc1.13.ge78abce653

^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH v3 01/14] commit-graph: introduce 'get_bloom_filter_settings()'
  2020-08-11 20:51 ` [PATCH v3 00/14] more miscellaneous Bloom filter improvements Taylor Blau
@ 2020-08-11 20:51   ` Taylor Blau
  2020-08-11 21:18     ` SZEDER Gábor
  2020-08-11 20:51   ` [PATCH v3 02/14] t4216: use an '&&'-chain Taylor Blau
                     ` (13 subsequent siblings)
  14 siblings, 1 reply; 117+ messages in thread
From: Taylor Blau @ 2020-08-11 20:51 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, szeder.dev, gitster

Many places in the code often need a pointer to the commit-graph's
'struct bloom_filter_settings', in which case they often take the value
from the top-most commit-graph.

In the non-split case, this works as expected. In the split case,
however, things get a little tricky. Not all layers in a chain of
incremental commit-graphs are required to themselves have Bloom data,
and so whether or not some part of the code uses Bloom filters depends
entirely on whether or not the top-most level of the commit-graph chain
has Bloom filters.

This has been the behavior since Bloom filters were introduced, and has
been codified into the tests since a759bfa9ee (t4216: add end to end
tests for git log with Bloom filters, 2020-04-06). In fact, t4216.130
requires that Bloom filters are not used in exactly the case described
earlier.

There is no reason that this needs to be the case, since it is perfectly
valid for commits in an earlier layer to have Bloom filters when commits
in a newer layer do not.

Since Bloom settings are guaranteed to be the same for any layer in a
chain that has Bloom data, it is sufficient to traverse the
'->base_graph' pointer until either (1) a non-null 'struct
bloom_filter_settings *' is found, or (2) until we are at the root of
the commit-graph chain.

Introduce a 'get_bloom_filter_settings()' function that does just this,
and use it instead of purely dereferencing the top-most graph's
'->bloom_filter_settings' pointer.

While we're at it, add an additional test in t5324 to guard against code
in the commit-graph writing machinery that doesn't correctly handle a
NULL 'struct bloom_filter *'.

Co-authored-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 blame.c                       |  6 ++++--
 bloom.c                       |  6 +++---
 commit-graph.c                | 11 +++++++++++
 commit-graph.h                |  2 ++
 revision.c                    |  5 +----
 t/t4216-log-bloom.sh          |  9 ++++++---
 t/t5324-split-commit-graph.sh | 13 +++++++++++++
 7 files changed, 40 insertions(+), 12 deletions(-)

diff --git a/blame.c b/blame.c
index 82fa16d658..3e5f8787bc 100644
--- a/blame.c
+++ b/blame.c
@@ -2891,16 +2891,18 @@ void setup_blame_bloom_data(struct blame_scoreboard *sb,
 			    const char *path)
 {
 	struct blame_bloom_data *bd;
+	struct bloom_filter_settings *bs;
 
 	if (!sb->repo->objects->commit_graph)
 		return;
 
-	if (!sb->repo->objects->commit_graph->bloom_filter_settings)
+	bs = get_bloom_filter_settings(sb->repo);
+	if (!bs)
 		return;
 
 	bd = xmalloc(sizeof(struct blame_bloom_data));
 
-	bd->settings = sb->repo->objects->commit_graph->bloom_filter_settings;
+	bd->settings = bs;
 
 	bd->alloc = 4;
 	bd->nr = 0;
diff --git a/bloom.c b/bloom.c
index 1a573226e7..cd9380ac62 100644
--- a/bloom.c
+++ b/bloom.c
@@ -38,7 +38,7 @@ static int load_bloom_filter_from_graph(struct commit_graph *g,
 	while (graph_pos < g->num_commits_in_base)
 		g = g->base_graph;
 
-	/* The commit graph commit 'c' lives in doesn't carry bloom filters. */
+	/* The commit graph commit 'c' lives in doesn't carry Bloom filters. */
 	if (!g->chunk_bloom_indexes)
 		return 0;
 
@@ -195,8 +195,8 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 	if (!filter->data) {
 		load_commit_graph_info(r, c);
 		if (commit_graph_position(c) != COMMIT_NOT_FROM_GRAPH &&
-			r->objects->commit_graph->chunk_bloom_indexes)
-			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c);
+			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
+				return filter;
 	}
 
 	if (filter->data)
diff --git a/commit-graph.c b/commit-graph.c
index e51c91dd5b..d4b06811be 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -660,6 +660,17 @@ int generation_numbers_enabled(struct repository *r)
 	return !!first_generation;
 }
 
+struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r)
+{
+	struct commit_graph *g = r->objects->commit_graph;
+	while (g) {
+		if (g->bloom_filter_settings)
+			return g->bloom_filter_settings;
+		g = g->base_graph;
+	}
+	return NULL;
+}
+
 static void close_commit_graph_one(struct commit_graph *g)
 {
 	if (!g)
diff --git a/commit-graph.h b/commit-graph.h
index 09a97030dc..0677dd1031 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -87,6 +87,8 @@ struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size);
  */
 int generation_numbers_enabled(struct repository *r);
 
+struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r);
+
 enum commit_graph_write_flags {
 	COMMIT_GRAPH_WRITE_APPEND     = (1 << 0),
 	COMMIT_GRAPH_WRITE_PROGRESS   = (1 << 1),
diff --git a/revision.c b/revision.c
index 3dcf689341..be600186ee 100644
--- a/revision.c
+++ b/revision.c
@@ -680,10 +680,7 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 
 	repo_parse_commit(revs->repo, revs->commits->item);
 
-	if (!revs->repo->objects->commit_graph)
-		return;
-
-	revs->bloom_filter_settings = revs->repo->objects->commit_graph->bloom_filter_settings;
+	revs->bloom_filter_settings = get_bloom_filter_settings(revs->repo);
 	if (!revs->bloom_filter_settings)
 		return;
 
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index c21cc160f3..c9f9bdf1ba 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -60,7 +60,7 @@ setup () {
 
 test_bloom_filters_used () {
 	log_args=$1
-	bloom_trace_prefix="statistics:{\"filter_not_present\":0,\"maybe\""
+	bloom_trace_prefix="statistics:{\"filter_not_present\":${2:-0},\"maybe\""
 	setup "$log_args" &&
 	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
 	test_cmp log_wo_bloom log_w_bloom &&
@@ -134,8 +134,11 @@ test_expect_success 'setup - add commit-graph to the chain without Bloom filters
 	test_line_count = 2 .git/objects/info/commit-graphs/commit-graph-chain
 '
 
-test_expect_success 'Do not use Bloom filters if the latest graph does not have Bloom filters.' '
-	test_bloom_filters_not_used "-- A/B"
+test_expect_success 'use Bloom filters even if the latest graph does not have Bloom filters' '
+	# Ensure that the number of empty filters is equal to the number of
+	# filters in the latest graph layer to prove that they are loaded (and
+	# ignored).
+	test_bloom_filters_used "-- A/B" 3
 '
 
 test_expect_success 'setup - add commit-graph to the chain with Bloom filters' '
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 9b850ea907..5bdfd53ef9 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -425,4 +425,17 @@ done <<\EOF
 0600 -r--------
 EOF
 
+test_expect_success '--split=replace with partial Bloom data' '
+	rm -rf $graphdir $infodir/commit-graph &&
+	git reset --hard commits/3 &&
+	git rev-list -1 HEAD~2 >a &&
+	git rev-list -1 HEAD~1 >b &&
+	git commit-graph write --split=no-merge --stdin-commits --changed-paths <a &&
+	git commit-graph write --split=no-merge --stdin-commits <b &&
+	git commit-graph write --split=replace --stdin-commits --changed-paths <c &&
+	ls $graphdir/graph-*.graph >graph-files &&
+	test_line_count = 1 graph-files &&
+	verify_chain_files_exist $graphdir
+'
+
 test_done
-- 
2.28.0.rc1.13.ge78abce653


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH v3 02/14] t4216: use an '&&'-chain
  2020-08-11 20:51 ` [PATCH v3 00/14] more miscellaneous Bloom filter improvements Taylor Blau
  2020-08-11 20:51   ` [PATCH v3 01/14] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
@ 2020-08-11 20:51   ` Taylor Blau
  2020-08-11 20:51   ` [PATCH v3 03/14] commit-graph: pass a 'struct repository *' in more places Taylor Blau
                     ` (12 subsequent siblings)
  14 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-08-11 20:51 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, szeder.dev, gitster

In a759bfa9ee (t4216: add end to end tests for git log with Bloom
filters, 2020-04-06), a 'rm' invocation was added without a
corresponding '&&' chain.

When 'trace.perf' already exists, everything works fine. However, the
function can be executed without 'trace.perf' on disk (eg., when the
subset of tests run is altered with '--run'), and so the bare 'rm'
complains about a missing file.

To remove some noise from the test log, invoke 'rm' with '-f', at which
point it is sensible to place the 'rm -f' in an '&&'-chain, which is
both (1) our usual style, and (2) avoids a broken chain in the future if
more commands are added at the beginning of the function.

Helped-by: Eric Sunshine <sunshine@sunshineco.com>
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/t4216-log-bloom.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index c9f9bdf1ba..fe19f6a60c 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -53,7 +53,7 @@ sane_unset GIT_TRACE2_PERF_BRIEF
 sane_unset GIT_TRACE2_CONFIG_PARAMS
 
 setup () {
-	rm "$TRASH_DIRECTORY/trace.perf"
+	rm -f "$TRASH_DIRECTORY/trace.perf" &&
 	git -c core.commitGraph=false log --pretty="format:%s" $1 >log_wo_bloom &&
 	GIT_TRACE2_PERF="$TRASH_DIRECTORY/trace.perf" git -c core.commitGraph=true log --pretty="format:%s" $1 >log_w_bloom
 }
-- 
2.28.0.rc1.13.ge78abce653


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH v3 03/14] commit-graph: pass a 'struct repository *' in more places
  2020-08-11 20:51 ` [PATCH v3 00/14] more miscellaneous Bloom filter improvements Taylor Blau
  2020-08-11 20:51   ` [PATCH v3 01/14] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
  2020-08-11 20:51   ` [PATCH v3 02/14] t4216: use an '&&'-chain Taylor Blau
@ 2020-08-11 20:51   ` Taylor Blau
  2020-08-11 20:51   ` [PATCH v3 04/14] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
                     ` (11 subsequent siblings)
  14 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-08-11 20:51 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, szeder.dev, gitster

In a future commit, some commit-graph internals will want access to
'r->settings', but we only have the 'struct object_directory *'
corresponding to that repository.

Add an additional parameter to pass the repository around in more
places.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/commit-graph.c |  2 +-
 commit-graph.c         | 17 ++++++++++-------
 commit-graph.h         |  6 ++++--
 fuzz-commit-graph.c    |  5 +++--
 4 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 523501f217..ba5584463f 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -106,7 +106,7 @@ static int graph_verify(int argc, const char **argv)
 	FREE_AND_NULL(graph_name);
 
 	if (open_ok)
-		graph = load_commit_graph_one_fd_st(fd, &st, odb);
+		graph = load_commit_graph_one_fd_st(the_repository, fd, &st, odb);
 	else
 		graph = read_commit_graph_one(the_repository, odb);
 
diff --git a/commit-graph.c b/commit-graph.c
index d4b06811be..0c1030641c 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -224,7 +224,8 @@ int open_commit_graph(const char *graph_file, int *fd, struct stat *st)
 	return 1;
 }
 
-struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st,
+struct commit_graph *load_commit_graph_one_fd_st(struct repository *r,
+						 int fd, struct stat *st,
 						 struct object_directory *odb)
 {
 	void *graph_map;
@@ -240,7 +241,7 @@ struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st,
 	}
 	graph_map = xmmap(NULL, graph_size, PROT_READ, MAP_PRIVATE, fd, 0);
 	close(fd);
-	ret = parse_commit_graph(graph_map, graph_size);
+	ret = parse_commit_graph(r, graph_map, graph_size);
 
 	if (ret)
 		ret->odb = odb;
@@ -280,7 +281,8 @@ static int verify_commit_graph_lite(struct commit_graph *g)
 	return 0;
 }
 
-struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size)
+struct commit_graph *parse_commit_graph(struct repository *r,
+					void *graph_map, size_t graph_size)
 {
 	const unsigned char *data, *chunk_lookup;
 	uint32_t i;
@@ -445,7 +447,8 @@ struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size)
 	return NULL;
 }
 
-static struct commit_graph *load_commit_graph_one(const char *graph_file,
+static struct commit_graph *load_commit_graph_one(struct repository *r,
+						  const char *graph_file,
 						  struct object_directory *odb)
 {
 
@@ -457,7 +460,7 @@ static struct commit_graph *load_commit_graph_one(const char *graph_file,
 	if (!open_ok)
 		return NULL;
 
-	g = load_commit_graph_one_fd_st(fd, &st, odb);
+	g = load_commit_graph_one_fd_st(r, fd, &st, odb);
 
 	if (g)
 		g->filename = xstrdup(graph_file);
@@ -469,7 +472,7 @@ static struct commit_graph *load_commit_graph_v1(struct repository *r,
 						 struct object_directory *odb)
 {
 	char *graph_name = get_commit_graph_filename(odb);
-	struct commit_graph *g = load_commit_graph_one(graph_name, odb);
+	struct commit_graph *g = load_commit_graph_one(r, graph_name, odb);
 	free(graph_name);
 
 	return g;
@@ -550,7 +553,7 @@ static struct commit_graph *load_commit_graph_chain(struct repository *r,
 		valid = 0;
 		for (odb = r->objects->odb; odb; odb = odb->next) {
 			char *graph_name = get_split_graph_filename(odb, line.buf);
-			struct commit_graph *g = load_commit_graph_one(graph_name, odb);
+			struct commit_graph *g = load_commit_graph_one(r, graph_name, odb);
 
 			free(graph_name);
 
diff --git a/commit-graph.h b/commit-graph.h
index 0677dd1031..d9acb22bac 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -75,11 +75,13 @@ struct commit_graph {
 	struct bloom_filter_settings *bloom_filter_settings;
 };
 
-struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st,
+struct commit_graph *load_commit_graph_one_fd_st(struct repository *r,
+						 int fd, struct stat *st,
 						 struct object_directory *odb);
 struct commit_graph *read_commit_graph_one(struct repository *r,
 					   struct object_directory *odb);
-struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size);
+struct commit_graph *parse_commit_graph(struct repository *r,
+					void *graph_map, size_t graph_size);
 
 /*
  * Return 1 if and only if the repository has a commit-graph
diff --git a/fuzz-commit-graph.c b/fuzz-commit-graph.c
index 430817214d..e7cf6d5b0f 100644
--- a/fuzz-commit-graph.c
+++ b/fuzz-commit-graph.c
@@ -1,7 +1,8 @@
 #include "commit-graph.h"
 #include "repository.h"
 
-struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size);
+struct commit_graph *parse_commit_graph(struct repository *r,
+					void *graph_map, size_t graph_size);
 
 int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size);
 
@@ -10,7 +11,7 @@ int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size)
 	struct commit_graph *g;
 
 	initialize_the_repository();
-	g = parse_commit_graph((void *)data, size);
+	g = parse_commit_graph(the_repository, (void *)data, size);
 	repo_clear(the_repository);
 	free_commit_graph(g);
 
-- 
2.28.0.rc1.13.ge78abce653


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH v3 04/14] t/helper/test-read-graph.c: prepare repo settings
  2020-08-11 20:51 ` [PATCH v3 00/14] more miscellaneous Bloom filter improvements Taylor Blau
                     ` (2 preceding siblings ...)
  2020-08-11 20:51   ` [PATCH v3 03/14] commit-graph: pass a 'struct repository *' in more places Taylor Blau
@ 2020-08-11 20:51   ` Taylor Blau
  2020-08-11 20:51   ` [PATCH v3 05/14] commit-graph: respect 'commitGraph.readChangedPaths' Taylor Blau
                     ` (10 subsequent siblings)
  14 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-08-11 20:51 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, szeder.dev, gitster

The read-graph test-tool is used by a number of the commit-graph test to
assert various properties about a commit-graph. Previously, this program
never ran 'prepare_repo_settings()'. There was no need to do so, since
none of the commit-graph machinery is affected by the repo settings.

In the next patch, the commit-graph machinery's behavior will become
dependent on the repo settings, and so loading them before running the
rest of the test tool is critical.

As such, teach the test tool to call 'prepare_repo_settings()'.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/helper/test-read-graph.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
index 6d0c962438..5f585a1725 100644
--- a/t/helper/test-read-graph.c
+++ b/t/helper/test-read-graph.c
@@ -12,11 +12,12 @@ int cmd__read_graph(int argc, const char **argv)
 	setup_git_directory();
 	odb = the_repository->objects->odb;
 
+	prepare_repo_settings(the_repository);
+
 	graph = read_commit_graph_one(the_repository, odb);
 	if (!graph)
 		return 1;
 
-
 	printf("header: %08x %d %d %d %d\n",
 		ntohl(*(uint32_t*)graph->data),
 		*(unsigned char*)(graph->data + 4),
-- 
2.28.0.rc1.13.ge78abce653


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH v3 05/14] commit-graph: respect 'commitGraph.readChangedPaths'
  2020-08-11 20:51 ` [PATCH v3 00/14] more miscellaneous Bloom filter improvements Taylor Blau
                     ` (3 preceding siblings ...)
  2020-08-11 20:51   ` [PATCH v3 04/14] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
@ 2020-08-11 20:51   ` Taylor Blau
  2020-08-11 20:51   ` [PATCH v3 06/14] commit-graph.c: store maximum changed paths Taylor Blau
                     ` (9 subsequent siblings)
  14 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-08-11 20:51 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, szeder.dev, gitster

Git uses the 'core.commitGraph' configuration value to control whether
or not the commit graph is used when parsing commits or performing a
traversal.

Now that commit-graphs can also contain a section for changed-path Bloom
filters, administrators that already have commit-graphs may find it
convenient to use those graphs without relying on their changed-path
Bloom filters. This can happen, for example, during a staged roll-out,
or in the event of an incident.

Introduce 'commitGraph.readChangedPaths' to control whether or not Bloom
filters are read. Note that this configuration is independent from both:

  - 'core.commitGraph', to allow flexibility in using all parts of a
    commit-graph _except_ for its Bloom filters.

  - The '--changed-paths' option for 'git commit-graph write', to allow
    reading and writing Bloom filters to be controlled independently.

When the variable is set, pretend as if no Bloom data was specified at
all. This avoids adding additional special-casing outside of the
commit-graph internals.

Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config.txt             | 2 ++
 Documentation/config/commitgraph.txt | 4 ++++
 commit-graph.c                       | 6 ++++--
 repo-settings.c                      | 3 +++
 repository.h                         | 1 +
 t/t4216-log-bloom.sh                 | 4 +++-
 6 files changed, 17 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/config/commitgraph.txt

diff --git a/Documentation/config.txt b/Documentation/config.txt
index ef0768b91a..78883c6e63 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -340,6 +340,8 @@ include::config/column.txt[]
 
 include::config/commit.txt[]
 
+include::config/commitgraph.txt[]
+
 include::config/credential.txt[]
 
 include::config/completion.txt[]
diff --git a/Documentation/config/commitgraph.txt b/Documentation/config/commitgraph.txt
new file mode 100644
index 0000000000..cff0797b54
--- /dev/null
+++ b/Documentation/config/commitgraph.txt
@@ -0,0 +1,4 @@
+commitGraph.readChangedPaths::
+	If true, then git will use the changed-path Bloom filters in the
+	commit-graph file (if it exists, and they are present). Defaults to
+	true. See linkgit:git-commit-graph[1] for more information.
diff --git a/commit-graph.c b/commit-graph.c
index 0c1030641c..a516e93d71 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -320,6 +320,8 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 		return NULL;
 	}
 
+	prepare_repo_settings(r);
+
 	graph = alloc_commit_graph();
 
 	graph->hash_len = the_hash_algo->rawsz;
@@ -396,14 +398,14 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 		case GRAPH_CHUNKID_BLOOMINDEXES:
 			if (graph->chunk_bloom_indexes)
 				chunk_repeated = 1;
-			else
+			else if (r->settings.commit_graph_read_changed_paths)
 				graph->chunk_bloom_indexes = data + chunk_offset;
 			break;
 
 		case GRAPH_CHUNKID_BLOOMDATA:
 			if (graph->chunk_bloom_data)
 				chunk_repeated = 1;
-			else {
+			else if (r->settings.commit_graph_read_changed_paths) {
 				uint32_t hash_version;
 				graph->chunk_bloom_data = data + chunk_offset;
 				hash_version = get_be32(data + chunk_offset);
diff --git a/repo-settings.c b/repo-settings.c
index 0918408b34..9e551bc03d 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -17,9 +17,12 @@ void prepare_repo_settings(struct repository *r)
 
 	if (!repo_config_get_bool(r, "core.commitgraph", &value))
 		r->settings.core_commit_graph = value;
+	if (!repo_config_get_bool(r, "commitgraph.readchangedpaths", &value))
+		r->settings.commit_graph_read_changed_paths = value;
 	if (!repo_config_get_bool(r, "gc.writecommitgraph", &value))
 		r->settings.gc_write_commit_graph = value;
 	UPDATE_DEFAULT_BOOL(r->settings.core_commit_graph, 1);
+	UPDATE_DEFAULT_BOOL(r->settings.commit_graph_read_changed_paths, 1);
 	UPDATE_DEFAULT_BOOL(r->settings.gc_write_commit_graph, 1);
 
 	if (!repo_config_get_int(r, "index.version", &value))
diff --git a/repository.h b/repository.h
index 3c1f7d54bd..81759b7d27 100644
--- a/repository.h
+++ b/repository.h
@@ -29,6 +29,7 @@ struct repo_settings {
 	int initialized;
 
 	int core_commit_graph;
+	int commit_graph_read_changed_paths;
 	int gc_write_commit_graph;
 	int fetch_write_commit_graph;
 
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index fe19f6a60c..b3d1f596f8 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -90,7 +90,9 @@ do
 		      "--ancestry-path side..master"
 	do
 		test_expect_success "git log option: $option for path: $path" '
-			test_bloom_filters_used "$option -- $path"
+			test_bloom_filters_used "$option -- $path" &&
+			test_config commitgraph.readChangedPaths false &&
+			test_bloom_filters_not_used "$option -- $path"
 		'
 	done
 done
-- 
2.28.0.rc1.13.ge78abce653


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH v3 06/14] commit-graph.c: store maximum changed paths
  2020-08-11 20:51 ` [PATCH v3 00/14] more miscellaneous Bloom filter improvements Taylor Blau
                     ` (4 preceding siblings ...)
  2020-08-11 20:51   ` [PATCH v3 05/14] commit-graph: respect 'commitGraph.readChangedPaths' Taylor Blau
@ 2020-08-11 20:51   ` Taylor Blau
  2020-08-11 20:51   ` [PATCH v3 07/14] bloom: split 'get_bloom_filter()' in two Taylor Blau
                     ` (8 subsequent siblings)
  14 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-08-11 20:51 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, szeder.dev, gitster

For now, we assume that there is a fixed constant describing the
maximum number of changed paths we are willing to store in a Bloom
filter.

Prepare for that to (at least partially) not be the case by making it a
member of the 'struct bloom_filter_settings'. This will be helpful in
the subsequent patches by reducing the size of test cases that exercise
storing too many changed paths, as well as preparing for an eventual
future in which this value might change.

This patch alone does not cause newly generated Bloom filters to use
a custom upper-bound on the maximum number of changed paths a single
Bloom filter can hold, that will occur in a later patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 bloom.h              | 11 ++++++++++-
 commit-graph.c       |  3 +++
 t/t4216-log-bloom.sh |  4 ++--
 3 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/bloom.h b/bloom.h
index d8fbb0fbf1..0b9b59a6fe 100644
--- a/bloom.h
+++ b/bloom.h
@@ -28,9 +28,18 @@ struct bloom_filter_settings {
 	 * that contain n*b bits.
 	 */
 	uint32_t bits_per_entry;
+
+	/*
+	 * The maximum number of changed paths per commit
+	 * before declaring a Bloom filter to be too-large.
+	 *
+	 * Not written to the commit-graph file.
+	 */
+	uint32_t max_changed_paths;
 };
 
-#define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 }
+#define DEFAULT_BLOOM_MAX_CHANGES 512
+#define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10, DEFAULT_BLOOM_MAX_CHANGES }
 #define BITS_PER_WORD 8
 #define BLOOMDATA_CHUNK_HEADER_SIZE 3 * sizeof(uint32_t)
 
diff --git a/commit-graph.c b/commit-graph.c
index a516e93d71..86dd4b979e 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1194,6 +1194,7 @@ static void trace2_bloom_filter_settings(struct write_commit_graph_context *ctx)
 	jw_object_intmax(&jw, "hash_version", ctx->bloom_settings->hash_version);
 	jw_object_intmax(&jw, "num_hashes", ctx->bloom_settings->num_hashes);
 	jw_object_intmax(&jw, "bits_per_entry", ctx->bloom_settings->bits_per_entry);
+	jw_object_intmax(&jw, "max_changed_paths", ctx->bloom_settings->max_changed_paths);
 	jw_end(&jw);
 
 	trace2_data_json("bloom", ctx->r, "settings", &jw);
@@ -1662,6 +1663,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 							      bloom_settings.bits_per_entry);
 		bloom_settings.num_hashes = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_NUM_HASHES",
 							  bloom_settings.num_hashes);
+		bloom_settings.max_changed_paths = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS",
+							  bloom_settings.max_changed_paths);
 		ctx->bloom_settings = &bloom_settings;
 	}
 
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index b3d1f596f8..eb2bcc51f0 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -169,11 +169,11 @@ test_expect_success 'persist filter settings' '
 		GIT_TEST_BLOOM_SETTINGS_NUM_HASHES=9 \
 		GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY=15 \
 		git commit-graph write --reachable --changed-paths &&
-	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15}" trace2.txt &&
+	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15" trace2.txt &&
 	GIT_TRACE2_EVENT="$(pwd)/trace2-auto.txt" \
 		GIT_TRACE2_EVENT_NESTING=5 \
 		git commit-graph write --reachable --changed-paths &&
-	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15}" trace2-auto.txt
+	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15" trace2-auto.txt
 '
 
 test_expect_success 'correctly report changes over limit' '
-- 
2.28.0.rc1.13.ge78abce653


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH v3 07/14] bloom: split 'get_bloom_filter()' in two
  2020-08-11 20:51 ` [PATCH v3 00/14] more miscellaneous Bloom filter improvements Taylor Blau
                     ` (5 preceding siblings ...)
  2020-08-11 20:51   ` [PATCH v3 06/14] commit-graph.c: store maximum changed paths Taylor Blau
@ 2020-08-11 20:51   ` Taylor Blau
  2020-08-11 20:51   ` [PATCH v3 11/14] csum-file.h: introduce 'hashwrite_be64()' Taylor Blau
                     ` (7 subsequent siblings)
  14 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-08-11 20:51 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, szeder.dev, gitster

'get_bloom_filter' takes a flag to control whether it will compute a
Bloom filter if the requested one is missing. In the next patch, we'll
add yet another parameter to this method, which would force all but one
caller to specify an extra 'NULL' parameter at the end.

Instead of doing this, split 'get_bloom_filter' into two functions:
'get_bloom_filter' and 'get_or_compute_bloom_filter'. The former only
looks up a Bloom filter (and does not compute one if it's missing,
thus dropping the 'compute_if_not_present' flag). The latter does
compute missing Bloom filters, with an additional parameter to store
whether or not it needed to do so.

This simplifies many call-sites, since the majority of existing callers
to 'get_bloom_filter' do not want missing Bloom filters to be computed
(so they can drop the parameter entirely and use the simpler version of
the function).

While we're at it, instrument the new 'get_or_compute_bloom_filter()'
with two counters in the 'write_commit_graph_context' struct which store
the number of filters that we computed, and the number of those which
were too large to store.

It would be nice to drop the 'compute_if_not_present' flag entirely,
since all remaining callers of 'get_or_compute_bloom_filter' pass it as
'1', but this will change in a future patch and hence cannot be removed.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 blame.c               |  2 +-
 bloom.c               | 13 ++++++++++---
 bloom.h               | 10 +++++++---
 commit-graph.c        | 38 +++++++++++++++++++++++++++++++++++---
 line-log.c            |  2 +-
 revision.c            |  2 +-
 t/helper/test-bloom.c |  3 ++-
 7 files changed, 57 insertions(+), 13 deletions(-)

diff --git a/blame.c b/blame.c
index 3e5f8787bc..756285fca7 100644
--- a/blame.c
+++ b/blame.c
@@ -1275,7 +1275,7 @@ static int maybe_changed_path(struct repository *r,
 	if (commit_graph_generation(origin->commit) == GENERATION_NUMBER_INFINITY)
 		return 1;
 
-	filter = get_bloom_filter(r, origin->commit, 0);
+	filter = get_bloom_filter(r, origin->commit);
 
 	if (!filter)
 		return 1;
diff --git a/bloom.c b/bloom.c
index cd9380ac62..a8a21762f4 100644
--- a/bloom.c
+++ b/bloom.c
@@ -177,9 +177,10 @@ static int pathmap_cmp(const void *hashmap_cmp_fn_data,
 	return strcmp(e1->path, e2->path);
 }
 
-struct bloom_filter *get_bloom_filter(struct repository *r,
-				      struct commit *c,
-				      int compute_if_not_present)
+struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
+						 struct commit *c,
+						 int compute_if_not_present,
+						 int *computed)
 {
 	struct bloom_filter *filter;
 	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
@@ -187,6 +188,9 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 	struct diff_options diffopt;
 	int max_changes = 512;
 
+	if (computed)
+		*computed = 0;
+
 	if (!bloom_filters.slab_size)
 		return NULL;
 
@@ -273,6 +277,9 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 		filter->len = 0;
 	}
 
+	if (computed)
+		*computed = 1;
+
 	free(diff_queued_diff.queue);
 	DIFF_QUEUE_CLEAR(&diff_queued_diff);
 
diff --git a/bloom.h b/bloom.h
index 0b9b59a6fe..baa91926db 100644
--- a/bloom.h
+++ b/bloom.h
@@ -89,9 +89,13 @@ void add_key_to_filter(const struct bloom_key *key,
 
 void init_bloom_filters(void);
 
-struct bloom_filter *get_bloom_filter(struct repository *r,
-				      struct commit *c,
-				      int compute_if_not_present);
+struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
+						 struct commit *c,
+						 int compute_if_not_present,
+						 int *computed);
+
+#define get_bloom_filter(r, c) get_or_compute_bloom_filter( \
+	(r), (c), 0, NULL)
 
 int bloom_filter_contains(const struct bloom_filter *filter,
 			  const struct bloom_key *key,
diff --git a/commit-graph.c b/commit-graph.c
index 86dd4b979e..ba2a2cfb22 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -964,6 +964,9 @@ struct write_commit_graph_context {
 	const struct split_commit_graph_opts *split_opts;
 	size_t total_bloom_filter_data_size;
 	const struct bloom_filter_settings *bloom_settings;
+
+	int count_bloom_filter_found_large;
+	int count_bloom_filter_computed;
 };
 
 static int write_graph_chunk_fanout(struct hashfile *f,
@@ -1175,7 +1178,7 @@ static int write_graph_chunk_bloom_indexes(struct hashfile *f,
 	uint32_t cur_pos = 0;
 
 	while (list < last) {
-		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
+		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
 		size_t len = filter ? filter->len : 0;
 		cur_pos += len;
 		display_progress(ctx->progress, ++ctx->progress_cnt);
@@ -1215,7 +1218,7 @@ static int write_graph_chunk_bloom_data(struct hashfile *f,
 	hashwrite_be32(f, ctx->bloom_settings->bits_per_entry);
 
 	while (list < last) {
-		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
+		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
 		size_t len = filter ? filter->len : 0;
 
 		display_progress(ctx->progress, ++ctx->progress_cnt);
@@ -1385,6 +1388,22 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 	stop_progress(&ctx->progress);
 }
 
+static void trace2_bloom_filter_write_statistics(struct write_commit_graph_context *ctx)
+{
+	struct json_writer jw = JSON_WRITER_INIT;
+
+	jw_object_begin(&jw, 0);
+	jw_object_intmax(&jw, "filter_found_large",
+			 ctx->count_bloom_filter_found_large);
+	jw_object_intmax(&jw, "filter_computed",
+			 ctx->count_bloom_filter_computed);
+	jw_end(&jw);
+
+	trace2_data_json("commit-graph", the_repository, "bloom_statistics", &jw);
+
+	jw_release(&jw);
+}
+
 static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 {
 	int i;
@@ -1407,12 +1426,25 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 		QSORT(sorted_commits, ctx->commits.nr, commit_gen_cmp);
 
 	for (i = 0; i < ctx->commits.nr; i++) {
+		int computed = 0;
 		struct commit *c = sorted_commits[i];
-		struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1);
+		struct bloom_filter *filter = get_or_compute_bloom_filter(
+			ctx->r,
+			c,
+			1,
+			&computed);
+		if (computed) {
+			ctx->count_bloom_filter_computed++;
+			if (filter && !filter->len)
+				ctx->count_bloom_filter_found_large++;
+		}
 		ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
 		display_progress(progress, i + 1);
 	}
 
+	if (trace2_is_enabled())
+		trace2_bloom_filter_write_statistics(ctx);
+
 	free(sorted_commits);
 	stop_progress(&progress);
 }
diff --git a/line-log.c b/line-log.c
index bf73ea95ac..68eeb425f8 100644
--- a/line-log.c
+++ b/line-log.c
@@ -1159,7 +1159,7 @@ static int bloom_filter_check(struct rev_info *rev,
 		return 1;
 
 	if (!rev->bloom_filter_settings ||
-	    !(filter = get_bloom_filter(rev->repo, commit, 0)))
+	    !(filter = get_bloom_filter(rev->repo, commit)))
 		return 1;
 
 	if (!range)
diff --git a/revision.c b/revision.c
index be600186ee..7f58ecc411 100644
--- a/revision.c
+++ b/revision.c
@@ -751,7 +751,7 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 	if (commit_graph_generation(commit) == GENERATION_NUMBER_INFINITY)
 		return -1;
 
-	filter = get_bloom_filter(revs->repo, commit, 0);
+	filter = get_bloom_filter(revs->repo, commit);
 
 	if (!filter) {
 		count_bloom_filter_not_present++;
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index f0aa80b98e..531af439c2 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -39,7 +39,8 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
 	struct bloom_filter *filter;
 	setup_git_directory();
 	c = lookup_commit(the_repository, commit_oid);
-	filter = get_bloom_filter(the_repository, c, 1);
+	filter = get_or_compute_bloom_filter(the_repository, c, 1,
+					     NULL);
 	print_bloom_filter(filter);
 }
 
-- 
2.28.0.rc1.13.ge78abce653


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH v3 11/14] csum-file.h: introduce 'hashwrite_be64()'
  2020-08-11 20:51 ` [PATCH v3 00/14] more miscellaneous Bloom filter improvements Taylor Blau
                     ` (6 preceding siblings ...)
  2020-08-11 20:51   ` [PATCH v3 07/14] bloom: split 'get_bloom_filter()' in two Taylor Blau
@ 2020-08-11 20:51   ` Taylor Blau
  2020-08-11 20:51   ` [PATCH v3 08/14] bloom: use provided 'struct bloom_filter_settings' Taylor Blau
                     ` (6 subsequent siblings)
  14 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-08-11 20:51 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, szeder.dev, gitster

A small handful of writers who wish to encode 64-bit values in network
order have worked around the lack of such a helper by calling the 32-bit
variant twice.

The subsequent commit will add another caller who wants to write a
64-bit value. To ease their (and the existing caller's) pain, introduce
a helper to do just that, and convert existing call-sites.

Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 commit-graph.c | 8 ++------
 csum-file.h    | 6 ++++++
 midx.c         | 3 +--
 3 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 0d70545149..8964453433 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1784,12 +1784,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 
 	chunk_offset = 8 + (num_chunks + 1) * GRAPH_CHUNKLOOKUP_WIDTH;
 	for (i = 0; i <= num_chunks; i++) {
-		uint32_t chunk_write[3];
-
-		chunk_write[0] = htonl(chunks[i].id);
-		chunk_write[1] = htonl(chunk_offset >> 32);
-		chunk_write[2] = htonl(chunk_offset & 0xffffffff);
-		hashwrite(f, chunk_write, 12);
+		hashwrite_be32(f, chunks[i].id);
+		hashwrite_be64(f, chunk_offset);
 
 		chunk_offset += chunks[i].size;
 	}
diff --git a/csum-file.h b/csum-file.h
index f9cbd317fb..b026ec7766 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -62,4 +62,10 @@ static inline void hashwrite_be32(struct hashfile *f, uint32_t data)
 	hashwrite(f, &data, sizeof(data));
 }
 
+static inline void hashwrite_be64(struct hashfile *f, uint64_t data)
+{
+	hashwrite_be32(f, data >> 32);
+	hashwrite_be32(f, data & 0xffffffffUL);
+}
+
 #endif
diff --git a/midx.c b/midx.c
index a5fb797ede..51ca27cf34 100644
--- a/midx.c
+++ b/midx.c
@@ -775,8 +775,7 @@ static size_t write_midx_large_offsets(struct hashfile *f, uint32_t nr_large_off
 		if (!(offset >> 31))
 			continue;
 
-		hashwrite_be32(f, offset >> 32);
-		hashwrite_be32(f, offset & 0xffffffffUL);
+		hashwrite_be64(f, offset);
 		written += 2 * sizeof(uint32_t);
 
 		nr_large_offset--;
-- 
2.28.0.rc1.13.ge78abce653


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH v3 08/14] bloom: use provided 'struct bloom_filter_settings'
  2020-08-11 20:51 ` [PATCH v3 00/14] more miscellaneous Bloom filter improvements Taylor Blau
                     ` (7 preceding siblings ...)
  2020-08-11 20:51   ` [PATCH v3 11/14] csum-file.h: introduce 'hashwrite_be64()' Taylor Blau
@ 2020-08-11 20:51   ` Taylor Blau
  2020-08-11 20:51   ` [PATCH v3 09/14] bloom/diff: properly short-circuit on max_changes Taylor Blau
                     ` (5 subsequent siblings)
  14 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-08-11 20:51 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, szeder.dev, gitster

When 'get_or_compute_bloom_filter()' needs to compute a Bloom filter
from scratch, it looks to the default 'struct bloom_filter_settings' in
order to determine the maximum number of changed paths, number of bits
per entry, and so on.

All of these values have so far been constant, and so there was no need
to pass in a pointer from the caller (eg., the one that is stored in the
'struct write_commit_graph_context').

Start passing in a 'struct bloom_filter_settings *' instead of using the
default values to respect graph-specific settings (eg., in the case of
setting 'GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS').

In order to have an initialized value for these settings, move its
initialization to earlier in the commit-graph write.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 bloom.c               | 13 ++++++-------
 bloom.h               |  3 ++-
 commit-graph.c        | 21 ++++++++++-----------
 t/helper/test-bloom.c |  1 +
 4 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/bloom.c b/bloom.c
index a8a21762f4..0cf1962dc5 100644
--- a/bloom.c
+++ b/bloom.c
@@ -180,13 +180,12 @@ static int pathmap_cmp(const void *hashmap_cmp_fn_data,
 struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 						 struct commit *c,
 						 int compute_if_not_present,
+						 const struct bloom_filter_settings *settings,
 						 int *computed)
 {
 	struct bloom_filter *filter;
-	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
 	int i;
 	struct diff_options diffopt;
-	int max_changes = 512;
 
 	if (computed)
 		*computed = 0;
@@ -211,7 +210,7 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 	repo_diff_setup(r, &diffopt);
 	diffopt.flags.recursive = 1;
 	diffopt.detect_rename = 0;
-	diffopt.max_changes = max_changes;
+	diffopt.max_changes = settings->max_changed_paths;
 	diff_setup_done(&diffopt);
 
 	/* ensure commit is parsed so we have parent information */
@@ -223,7 +222,7 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 		diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
 	diffcore_std(&diffopt);
 
-	if (diffopt.num_changes <= max_changes) {
+	if (diffopt.num_changes <= settings->max_changed_paths) {
 		struct hashmap pathmap;
 		struct pathmap_hash_entry *e;
 		struct hashmap_iter iter;
@@ -260,13 +259,13 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 			diff_free_filepair(diff_queued_diff.queue[i]);
 		}
 
-		filter->len = (hashmap_get_size(&pathmap) * settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
+		filter->len = (hashmap_get_size(&pathmap) * settings->bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
 		filter->data = xcalloc(filter->len, sizeof(unsigned char));
 
 		hashmap_for_each_entry(&pathmap, &iter, e, entry) {
 			struct bloom_key key;
-			fill_bloom_key(e->path, strlen(e->path), &key, &settings);
-			add_key_to_filter(&key, filter, &settings);
+			fill_bloom_key(e->path, strlen(e->path), &key, settings);
+			add_key_to_filter(&key, filter, settings);
 		}
 
 		hashmap_free_entries(&pathmap, struct pathmap_hash_entry, entry);
diff --git a/bloom.h b/bloom.h
index baa91926db..3f19e3fca4 100644
--- a/bloom.h
+++ b/bloom.h
@@ -92,10 +92,11 @@ void init_bloom_filters(void);
 struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 						 struct commit *c,
 						 int compute_if_not_present,
+						 const struct bloom_filter_settings *settings,
 						 int *computed);
 
 #define get_bloom_filter(r, c) get_or_compute_bloom_filter( \
-	(r), (c), 0, NULL)
+	(r), (c), 0, NULL, NULL)
 
 int bloom_filter_contains(const struct bloom_filter *filter,
 			  const struct bloom_key *key,
diff --git a/commit-graph.c b/commit-graph.c
index ba2a2cfb22..48d4697f54 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1432,6 +1432,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 			ctx->r,
 			c,
 			1,
+			ctx->bloom_settings,
 			&computed);
 		if (computed) {
 			ctx->count_bloom_filter_computed++;
@@ -1688,17 +1689,6 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	int num_chunks = 3;
 	uint64_t chunk_offset;
 	struct object_id file_hash;
-	struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
-
-	if (!ctx->bloom_settings) {
-		bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
-							      bloom_settings.bits_per_entry);
-		bloom_settings.num_hashes = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_NUM_HASHES",
-							  bloom_settings.num_hashes);
-		bloom_settings.max_changed_paths = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS",
-							  bloom_settings.max_changed_paths);
-		ctx->bloom_settings = &bloom_settings;
-	}
 
 	if (ctx->split) {
 		struct strbuf tmp_file = STRBUF_INIT;
@@ -2144,6 +2134,7 @@ int write_commit_graph(struct object_directory *odb,
 	uint32_t i, count_distinct = 0;
 	int res = 0;
 	int replace = 0;
+	struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
 
 	if (!commit_graph_compatible(the_repository))
 		return 0;
@@ -2157,6 +2148,14 @@ int write_commit_graph(struct object_directory *odb,
 	ctx->split_opts = split_opts;
 	ctx->total_bloom_filter_data_size = 0;
 
+	bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
+						      bloom_settings.bits_per_entry);
+	bloom_settings.num_hashes = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_NUM_HASHES",
+						  bloom_settings.num_hashes);
+	bloom_settings.max_changed_paths = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS",
+							 bloom_settings.max_changed_paths);
+	ctx->bloom_settings = &bloom_settings;
+
 	if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
 		ctx->changed_paths = 1;
 	if (!(flags & COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS)) {
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index 531af439c2..4af949164c 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -40,6 +40,7 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
 	setup_git_directory();
 	c = lookup_commit(the_repository, commit_oid);
 	filter = get_or_compute_bloom_filter(the_repository, c, 1,
+					     &settings,
 					     NULL);
 	print_bloom_filter(filter);
 }
-- 
2.28.0.rc1.13.ge78abce653


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH v3 09/14] bloom/diff: properly short-circuit on max_changes
  2020-08-11 20:51 ` [PATCH v3 00/14] more miscellaneous Bloom filter improvements Taylor Blau
                     ` (8 preceding siblings ...)
  2020-08-11 20:51   ` [PATCH v3 08/14] bloom: use provided 'struct bloom_filter_settings' Taylor Blau
@ 2020-08-11 20:51   ` Taylor Blau
  2020-08-11 20:52   ` [PATCH v3 10/14] commit-graph.c: sort index into commits list Taylor Blau
                     ` (4 subsequent siblings)
  14 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-08-11 20:51 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, szeder.dev, gitster

From: Derrick Stolee <dstolee@microsoft.com>

Commit e3696980 (diff: halt tree-diff early after max_changes,
2020-03-30) intended to create a mechanism to short-circuit a diff
calculation after a certain number of paths were modified. By
incrementing a "num_changes" counter throughout the recursive
ll_diff_tree_paths(), this was supposed to match the number of changes
that would be written into the changed-path Bloom filters.
Unfortunately, this was not implemented correctly and instead misses
simple cases like file modifications. This then does not stop very
large changed-path filters from being written (unless they add or remove
many files).

To start, change the implementation in ll_diff_tree_paths() to instead
use the global diff_queue_diff struct's 'nr' member as the count. This
is a way to simplify the logic instead of making more mistakes in the
complicated diff code.

This has a drawback: the diff_queue_diff struct only lists the paths
corresponding to blob changes, not their leading directories. Thus,
get_or_compute_bloom_filter() needs an additional check to see if the
hashmap with the leading directories becomes too large.

One reason why this was not caught by test cases was that the test in
t4216-log-bloom.sh that was supposed to check this "too many changes"
condition only checked this on the initial commit of a repository. The
old logic counted these values correctly. Update this test in a few
ways:

1. Use GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS to reduce the limit,
   allowing smaller commits to engage with this logic.

2. Create several interesting cases of edits, adds, removes, and mode
   changes (in the second commit). By testing both sides of the
   inequality with the *_MAX_CHANGED_PATHS variable, we can see that
   the count is exactly correct, so none of these changes are missed
   or over-counted.

3. Use the trace2 data value filter_found_large to verify that these
   commits are on the correct side of the limit.

Another way to verify the behavior is correct is through performance
tests. By testing on my local copies of the Git repository and the Linux
kernel repository, I could measure the effect of these short-circuits
when computing a fresh commit-graph file with changed-path Bloom filters
using the command

  GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=N time \
    git commit-graph write --reachable --changed-paths

and reporting the wall time and resulting commit-graph size.

For Git, the results are

|        |      N=1       |       N=10     |      N=512     |
|--------|----------------|----------------|----------------|
| HEAD~1 | 10.90s  9.18MB | 11.11s  9.34MB | 11.31s  9.35MB |
| HEAD   |  9.21s  8.62MB | 11.11s  9.29MB | 11.29s  9.34MB |

For Linux, the results are

|        |       N=1      |     N=20      |     N=512     |
|--------|----------------|---------------|---------------|
| HEAD~1 | 61.28s  64.3MB | 76.9s  72.6MB | 77.6s  72.6MB |
| HEAD   | 49.44s  56.3MB | 68.7s  65.9MB | 69.2s  65.9MB |

Naturally, the improvement becomes much less as the limit grows, as
fewer commits satisfy the short-circuit.

Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 bloom.c              |  6 +++-
 diff.h               |  2 --
 t/t4216-log-bloom.sh | 85 +++++++++++++++++++++++++++++++++++++++-----
 tree-diff.c          |  5 +--
 4 files changed, 82 insertions(+), 16 deletions(-)

diff --git a/bloom.c b/bloom.c
index 0cf1962dc5..ed54e96e57 100644
--- a/bloom.c
+++ b/bloom.c
@@ -222,7 +222,7 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 		diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
 	diffcore_std(&diffopt);
 
-	if (diffopt.num_changes <= settings->max_changed_paths) {
+	if (diff_queued_diff.nr <= settings->max_changed_paths) {
 		struct hashmap pathmap;
 		struct pathmap_hash_entry *e;
 		struct hashmap_iter iter;
@@ -259,6 +259,9 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 			diff_free_filepair(diff_queued_diff.queue[i]);
 		}
 
+		if (hashmap_get_size(&pathmap) > settings->max_changed_paths)
+			goto cleanup;
+
 		filter->len = (hashmap_get_size(&pathmap) * settings->bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
 		filter->data = xcalloc(filter->len, sizeof(unsigned char));
 
@@ -268,6 +271,7 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 			add_key_to_filter(&key, filter, settings);
 		}
 
+	cleanup:
 		hashmap_free_entries(&pathmap, struct pathmap_hash_entry, entry);
 	} else {
 		for (i = 0; i < diff_queued_diff.nr; i++)
diff --git a/diff.h b/diff.h
index e0c0af6286..1d32b71885 100644
--- a/diff.h
+++ b/diff.h
@@ -287,8 +287,6 @@ struct diff_options {
 
 	/* If non-zero, then stop computing after this many changes. */
 	int max_changes;
-	/* For internal use only. */
-	int num_changes;
 
 	int ita_invisible_in_index;
 /* white-space error highlighting */
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index eb2bcc51f0..21b67677ef 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -177,20 +177,87 @@ test_expect_success 'persist filter settings' '
 '
 
 test_expect_success 'correctly report changes over limit' '
-	git init 513changes &&
+	git init limits &&
 	(
-		cd 513changes &&
-		for i in $(test_seq 1 513)
+		cd limits &&
+		mkdir d &&
+		mkdir d/e &&
+
+		for i in $(test_seq 1 2)
 		do
-			echo $i >file$i.txt || return 1
+			printf $i >d/file$i.txt &&
+			printf $i >d/e/file$i.txt || return 1
 		done &&
-		git add . &&
+
+		mkdir mode &&
+		printf bash >mode/script.sh &&
+
+		mkdir foo &&
+		touch foo/bar &&
+		touch foo.txt &&
+
+		git add d foo foo.txt mode &&
 		git commit -m "files" &&
-		git commit-graph write --reachable --changed-paths &&
-		for i in $(test_seq 1 513)
+
+		# Commit has 7 file and 4 directory adds
+		GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=10 \
+			GIT_TRACE2_EVENT="$(pwd)/trace" \
+			git commit-graph write --reachable --changed-paths &&
+		grep "\"max_changed_paths\":10" trace &&
+		grep "\"filter_found_large\":1" trace &&
+
+		for path in $(git ls-tree -r --name-only HEAD)
 		do
-			git -c core.commitGraph=false log -- file$i.txt >expect &&
-			git log -- file$i.txt >actual &&
+			git -c commitGraph.readChangedPaths=false log \
+				-- $path >expect &&
+			git log -- $path >actual &&
+			test_cmp expect actual || return 1
+		done &&
+
+		# Make a variety of path changes
+		printf new1 >d/e/file1.txt &&
+		printf new2 >d/file2.txt &&
+		rm d/e/file2.txt &&
+		rm -r foo &&
+		printf text >foo &&
+		mkdir f &&
+		printf new1 >f/file1.txt &&
+
+		# including a mode-only change (counts as modified)
+		git update-index --chmod=+x mode/script.sh &&
+
+		git add foo d f &&
+		git commit -m "complicated" &&
+
+		# start from scratch and rebuild
+		rm -f .git/objects/info/commit-graph &&
+		GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=10 \
+			GIT_TRACE2_EVENT="$(pwd)/trace-edit" \
+			git commit-graph write --reachable --changed-paths &&
+		grep "\"max_changed_paths\":10" trace-edit &&
+		grep "\"filter_found_large\":2" trace-edit &&
+
+		for path in $(git ls-tree -r --name-only HEAD)
+		do
+			git -c commitGraph.readChangedPaths=false log \
+				-- $path >expect &&
+			git log -- $path >actual &&
+			test_cmp expect actual || return 1
+		done &&
+
+		# start from scratch and rebuild
+		rm -f .git/objects/info/commit-graph &&
+		GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=11 \
+			GIT_TRACE2_EVENT="$(pwd)/trace-update" \
+			git commit-graph write --reachable --changed-paths &&
+		grep "\"max_changed_paths\":11" trace-update &&
+		grep "\"filter_found_large\":0" trace-update &&
+
+		for path in $(git ls-tree -r --name-only HEAD)
+		do
+			git -c commitGraph.readChangedPaths=false log \
+				-- $path >expect &&
+			git log -- $path >actual &&
 			test_cmp expect actual || return 1
 		done
 	)
diff --git a/tree-diff.c b/tree-diff.c
index 6ebad1a46f..7cebbb327e 100644
--- a/tree-diff.c
+++ b/tree-diff.c
@@ -434,7 +434,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
 		if (diff_can_quit_early(opt))
 			break;
 
-		if (opt->max_changes && opt->num_changes > opt->max_changes)
+		if (opt->max_changes && diff_queued_diff.nr > opt->max_changes)
 			break;
 
 		if (opt->pathspec.nr) {
@@ -521,7 +521,6 @@ static struct combine_diff_path *ll_diff_tree_paths(
 
 			/* t↓ */
 			update_tree_entry(&t);
-			opt->num_changes++;
 		}
 
 		/* t > p[imin] */
@@ -539,7 +538,6 @@ static struct combine_diff_path *ll_diff_tree_paths(
 		skip_emit_tp:
 			/* ∀ pi=p[imin]  pi↓ */
 			update_tp_entries(tp, nparent);
-			opt->num_changes++;
 		}
 	}
 
@@ -557,7 +555,6 @@ struct combine_diff_path *diff_tree_paths(
 	const struct object_id **parents_oid, int nparent,
 	struct strbuf *base, struct diff_options *opt)
 {
-	opt->num_changes = 0;
 	p = ll_diff_tree_paths(p, oid, parents_oid, nparent, base, opt);
 
 	/*
-- 
2.28.0.rc1.13.ge78abce653


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH v3 10/14] commit-graph.c: sort index into commits list
  2020-08-11 20:51 ` [PATCH v3 00/14] more miscellaneous Bloom filter improvements Taylor Blau
                     ` (9 preceding siblings ...)
  2020-08-11 20:51   ` [PATCH v3 09/14] bloom/diff: properly short-circuit on max_changes Taylor Blau
@ 2020-08-11 20:52   ` Taylor Blau
  2020-08-11 20:52   ` [PATCH v3 12/14] commit-graph: add large-filters bitmap chunk Taylor Blau
                     ` (3 subsequent siblings)
  14 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-08-11 20:52 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, szeder.dev, gitster

For locality, 'compute_bloom_filters()' sorts the commits for which it
wants to compute Bloom filters in a preferred order (cf., 3d11275505
(commit-graph: examine commits by generation number, 2020-03-30) for
details).

A future patch will want to recover the new graph position of each
commit. Since the 'packed_commit_list' already stores a double-pointer,
avoid a 'COPY_ARRAY' and instead keep track of an index into the
original list. (Use an integer index instead of a memory address, since
this involves a needlessly confusing triple-pointer).

Alter the two sorting routines 'commit_pos_cmp' and 'commit_gen_cmp' to
take into account the packed_commit_list they are sorting with respect
to. Since 'compute_bloom_filters()' is the only caller for each of those
comparison functions, no other call-sites need updating.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 commit-graph.c | 43 ++++++++++++++++++++++++-------------------
 1 file changed, 24 insertions(+), 19 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 48d4697f54..0d70545149 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -79,10 +79,18 @@ static void set_commit_pos(struct repository *r, const struct object_id *oid)
 	*commit_pos_at(&commit_pos, commit) = max_pos++;
 }
 
-static int commit_pos_cmp(const void *va, const void *vb)
+struct packed_commit_list {
+	struct commit **list;
+	int nr;
+	int alloc;
+};
+
+static int commit_pos_cmp(const void *va, const void *vb, void *ctx)
 {
-	const struct commit *a = *(const struct commit **)va;
-	const struct commit *b = *(const struct commit **)vb;
+	struct packed_commit_list *commits = ctx;
+
+	const struct commit *a = commits->list[*(int *)va];
+	const struct commit *b = commits->list[*(int *)vb];
 	return commit_pos_at(&commit_pos, a) -
 	       commit_pos_at(&commit_pos, b);
 }
@@ -139,10 +147,12 @@ static struct commit_graph_data *commit_graph_data_at(const struct commit *c)
 	return data;
 }
 
-static int commit_gen_cmp(const void *va, const void *vb)
+static int commit_gen_cmp(const void *va, const void *vb, void *ctx)
 {
-	const struct commit *a = *(const struct commit **)va;
-	const struct commit *b = *(const struct commit **)vb;
+	struct packed_commit_list *commits = ctx;
+
+	const struct commit *a = commits->list[*(int *)va];
+	const struct commit *b = commits->list[*(int *)vb];
 
 	uint32_t generation_a = commit_graph_generation(a);
 	uint32_t generation_b = commit_graph_generation(b);
@@ -922,11 +932,6 @@ struct tree *get_commit_tree_in_graph(struct repository *r, const struct commit
 	return get_commit_tree_in_graph_one(r, r->objects->commit_graph, c);
 }
 
-struct packed_commit_list {
-	struct commit **list;
-	int nr;
-	int alloc;
-};
 
 struct packed_oid_list {
 	struct object_id *list;
@@ -1408,7 +1413,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 {
 	int i;
 	struct progress *progress = NULL;
-	struct commit **sorted_commits;
+	int *sorted_commits;
 
 	init_bloom_filters();
 
@@ -1418,16 +1423,16 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 			ctx->commits.nr);
 
 	ALLOC_ARRAY(sorted_commits, ctx->commits.nr);
-	COPY_ARRAY(sorted_commits, ctx->commits.list, ctx->commits.nr);
-
-	if (ctx->order_by_pack)
-		QSORT(sorted_commits, ctx->commits.nr, commit_pos_cmp);
-	else
-		QSORT(sorted_commits, ctx->commits.nr, commit_gen_cmp);
+	for (i = 0; i < ctx->commits.nr; i++)
+		sorted_commits[i] = i;
+	QSORT_S(sorted_commits, ctx->commits.nr,
+		ctx->order_by_pack ? commit_pos_cmp : commit_gen_cmp,
+		&ctx->commits);
 
 	for (i = 0; i < ctx->commits.nr; i++) {
 		int computed = 0;
-		struct commit *c = sorted_commits[i];
+		int pos = sorted_commits[i];
+		struct commit *c = ctx->commits.list[pos];
 		struct bloom_filter *filter = get_or_compute_bloom_filter(
 			ctx->r,
 			c,
-- 
2.28.0.rc1.13.ge78abce653


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH v3 12/14] commit-graph: add large-filters bitmap chunk
  2020-08-11 20:51 ` [PATCH v3 00/14] more miscellaneous Bloom filter improvements Taylor Blau
                     ` (10 preceding siblings ...)
  2020-08-11 20:52   ` [PATCH v3 10/14] commit-graph.c: sort index into commits list Taylor Blau
@ 2020-08-11 20:52   ` Taylor Blau
  2020-08-11 21:11     ` Derrick Stolee
                       ` (2 more replies)
  2020-08-11 20:52   ` [PATCH v3 13/14] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
                     ` (2 subsequent siblings)
  14 siblings, 3 replies; 117+ messages in thread
From: Taylor Blau @ 2020-08-11 20:52 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, szeder.dev, gitster

When a commit has more than a certain number of changed paths (commonly
512), the commit-graph machinery represents it as a zero-length filter.
This is done since having many entries in the Bloom filter has
undesirable effects on the false positivity rate.

In addition to these too-large filters, the commit-graph machinery also
represents commits with no filter and commits with no changed paths in
the same way.

When writing a commit-graph that aggregates several incremental
commit-graph layers (eg., with '--split=replace'), the commit-graph
machinery first computes all of the Bloom filters that it wants to write
but does not already know about from existing graph layers. Because we
overload the zero-length filter in the above fashion, this leads to
recomputing large filters over and over again.

This is already undesirable, since it means that we are wasting
considerable effort to discover that a commit with too many changed
paths, only to throw that effort away (and then repeat the process the
next time a roll-up is performed).

In a subsequent patch, we will add a '--max-new-filters=<n>' option,
which specifies an upper-bound on the number of new filters we are
willing to compute from scratch. Suppose that there are 'N' too-large
filters, and we specify '--max-new-filters=M'. If 'N >= M', it is
unlikely that any filters will be generated, since we'll spend most of
our effort on filters that we ultimately throw away. If 'N < M', filters
will trickle in over time, but only at most 'M - N' per-write.

To address this, add a new chunk which encodes a bitmap where the ith
bit is on iff the ith commit has zero or at least 512 changed paths.
Likewise store the maximum number of changed paths we are willing to
store in order to prepare for eventually making this value more easily
customizable. When computing Bloom filters, first consult the relevant
bitmap (in the case that we are rolling up existing layers) to see if
computing the Bloom filter from scratch would be a waste of time.

This patch implements a new chunk instead of extending the existing BIDX
and BDAT chunks because modifying these chunks would confuse old
clients. (Eg., setting the most-significant bit in the BIDX chunk would
confuse old clients and require a version bump).

To allow using the existing bitmap code with 64-bit words, we write the
data in network byte order from the 64-bit words. This means we also
need to read the array from the commit-graph file by translating each
word from network byte order using get_be64() when loading the commit
graph. (Note that this *could* be delayed until first-use, but a later
patch will rely on this being initialized early, so we assume the
up-front cost when parsing instead of delaying initialization).

By avoiding the need to move to new versions of the BDAT and BIDX chunk,
we can give ourselves more time to consider whether or not other
modifications to these chunks are worthwhile without holding up this
change.

Another approach would be to introduce a new BIDX chunk (say, one
identified by 'BID2') which is identical to the existing BIDX chunk,
except the most-significant bit of each offset is interpreted as "this
filter is too big" iff looking at a BID2 chunk. This avoids having to
write a bitmap, but forces older clients to rewrite their commit-graphs
(as well as reduces the theoretical largest Bloom filters we couldl
write, and forces us to maintain the code necessary to translate BIDX
chunks to BID2 ones). Separately from this patch, I implemented this
alternate approach and did not find it to be advantageous.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 .../technical/commit-graph-format.txt         | 12 +++
 bloom.h                                       |  2 +-
 commit-graph.c                                | 96 ++++++++++++++++---
 commit-graph.h                                |  4 +
 t/t4216-log-bloom.sh                          | 25 ++++-
 5 files changed, 124 insertions(+), 15 deletions(-)

diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index 440541045d..5f2d9ab4d7 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -123,6 +123,18 @@ CHUNK DATA:
       of length zero.
     * The BDAT chunk is present if and only if BIDX is present.
 
+  Large Bloom Filters (ID: {'B', 'F', 'X', 'L'}) [Optional]
+    * It starts with a 32-bit unsigned integer specifying the maximum number of
+      changed-paths that can be stored in a single Bloom filter.
+    * It then contains a list of 64-bit words (the length of this list is
+      determined by the width of the chunk) which is a bitmap. The 'i'th bit is
+      set exactly when the 'i'th commit in the graph has a changed-path Bloom
+      filter with zero entries (either because the commit is empty, or because
+      it contains more than 512 changed paths).
+    * The BFXL chunk is present only when the BIDX and BDAT chunks are
+      also present.
+
+
   Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
       This list of H-byte hashes describe a set of B commit-graph files that
       form a commit-graph chain. The graph position for the ith commit in this
diff --git a/bloom.h b/bloom.h
index 3f19e3fca4..464d9b57de 100644
--- a/bloom.h
+++ b/bloom.h
@@ -33,7 +33,7 @@ struct bloom_filter_settings {
 	 * The maximum number of changed paths per commit
 	 * before declaring a Bloom filter to be too-large.
 	 *
-	 * Not written to the commit-graph file.
+	 * Written to the 'BFXL' chunk (instead of 'BDAT').
 	 */
 	uint32_t max_changed_paths;
 };
diff --git a/commit-graph.c b/commit-graph.c
index 8964453433..ea0583298c 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -41,8 +41,9 @@ void git_test_write_commit_graph_or_die(void)
 #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
 #define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
 #define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
+#define GRAPH_CHUNKID_BLOOMLARGE 0x4246584c /* "BFXL" */
 #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
-#define MAX_NUM_CHUNKS 7
+#define MAX_NUM_CHUNKS 8
 
 #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
 
@@ -429,6 +430,24 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 				graph->bloom_filter_settings->bits_per_entry = get_be32(data + chunk_offset + 8);
 			}
 			break;
+
+		case GRAPH_CHUNKID_BLOOMLARGE:
+			if (graph->chunk_bloom_large_filters)
+				chunk_repeated = 1;
+			else if (r->settings.commit_graph_read_changed_paths) {
+				size_t alloc = get_be64(chunk_lookup + 4) - chunk_offset - sizeof(uint32_t);
+				graph->chunk_bloom_large_filters = data + chunk_offset + sizeof(uint32_t);
+				graph->bloom_filter_settings->max_changed_paths = get_be32(data + chunk_offset);
+				if (alloc) {
+					size_t j;
+					graph->bloom_large = bitmap_word_alloc(alloc);
+
+					for (j = 0; j < graph->bloom_large->word_alloc; j++)
+						graph->bloom_large->words[j] = get_be64(
+							graph->chunk_bloom_large_filters + j * sizeof(eword_t));
+				}
+			}
+			break;
 		}
 
 		if (chunk_repeated) {
@@ -443,7 +462,9 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 		/* We need both the bloom chunks to exist together. Else ignore the data */
 		graph->chunk_bloom_indexes = NULL;
 		graph->chunk_bloom_data = NULL;
+		graph->chunk_bloom_large_filters = NULL;
 		FREE_AND_NULL(graph->bloom_filter_settings);
+		bitmap_free(graph->bloom_large);
 	}
 
 	hashcpy(graph->oid.hash, graph->data + graph->data_len - graph->hash_len);
@@ -932,6 +953,20 @@ struct tree *get_commit_tree_in_graph(struct repository *r, const struct commit
 	return get_commit_tree_in_graph_one(r, r->objects->commit_graph, c);
 }
 
+static int get_bloom_filter_large_in_graph(struct commit_graph *g,
+					   const struct commit *c)
+{
+	uint32_t graph_pos = commit_graph_position(c);
+	if (graph_pos == COMMIT_NOT_FROM_GRAPH)
+		return 0;
+
+	while (g && graph_pos < g->num_commits_in_base)
+		g = g->base_graph;
+
+	if (!(g && g->bloom_large))
+		return 0;
+	return bitmap_get(g->bloom_large, graph_pos - g->num_commits_in_base);
+}
 
 struct packed_oid_list {
 	struct object_id *list;
@@ -970,8 +1005,10 @@ struct write_commit_graph_context {
 	size_t total_bloom_filter_data_size;
 	const struct bloom_filter_settings *bloom_settings;
 
+	int count_bloom_filter_known_large;
 	int count_bloom_filter_found_large;
 	int count_bloom_filter_computed;
+	struct bitmap *bloom_large;
 };
 
 static int write_graph_chunk_fanout(struct hashfile *f,
@@ -1235,6 +1272,23 @@ static int write_graph_chunk_bloom_data(struct hashfile *f,
 	return 0;
 }
 
+static int write_graph_chunk_bloom_large(struct hashfile *f,
+					 struct write_commit_graph_context *ctx)
+{
+	size_t i, alloc = ctx->commits.nr / BITS_IN_EWORD;
+	if (ctx->commits.nr % BITS_IN_EWORD)
+		alloc++;
+	if (alloc > ctx->bloom_large->word_alloc)
+		BUG("write_graph_chunk_bloom_large: bitmap not large enough");
+
+	trace2_region_enter("commit-graph", "bloom_large", ctx->r);
+	hashwrite_be32(f, ctx->bloom_settings->max_changed_paths);
+	for (i = 0; i < ctx->bloom_large->word_alloc; i++)
+		hashwrite_be64(f, ctx->bloom_large->words[i]);
+	trace2_region_leave("commit-graph", "bloom_large", ctx->r);
+	return 0;
+}
+
 static int oid_compare(const void *_a, const void *_b)
 {
 	const struct object_id *a = (const struct object_id *)_a;
@@ -1398,6 +1452,8 @@ static void trace2_bloom_filter_write_statistics(struct write_commit_graph_conte
 	struct json_writer jw = JSON_WRITER_INIT;
 
 	jw_object_begin(&jw, 0);
+	jw_object_intmax(&jw, "filter_known_large",
+			 ctx->count_bloom_filter_known_large);
 	jw_object_intmax(&jw, "filter_found_large",
 			 ctx->count_bloom_filter_found_large);
 	jw_object_intmax(&jw, "filter_computed",
@@ -1416,6 +1472,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 	int *sorted_commits;
 
 	init_bloom_filters();
+	ctx->bloom_large = bitmap_word_alloc(ctx->commits.nr / BITS_IN_EWORD + 1);
 
 	if (ctx->report_progress)
 		progress = start_delayed_progress(
@@ -1430,21 +1487,28 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 		&ctx->commits);
 
 	for (i = 0; i < ctx->commits.nr; i++) {
-		int computed = 0;
 		int pos = sorted_commits[i];
 		struct commit *c = ctx->commits.list[pos];
-		struct bloom_filter *filter = get_or_compute_bloom_filter(
-			ctx->r,
-			c,
-			1,
-			ctx->bloom_settings,
-			&computed);
-		if (computed) {
-			ctx->count_bloom_filter_computed++;
-			if (filter && !filter->len)
-				ctx->count_bloom_filter_found_large++;
+		if (get_bloom_filter_large_in_graph(ctx->r->objects->commit_graph, c)) {
+			bitmap_set(ctx->bloom_large, pos);
+			ctx->count_bloom_filter_known_large++;
+		} else {
+			int computed = 0;
+			struct bloom_filter *filter = get_or_compute_bloom_filter(
+				ctx->r,
+				c,
+				1,
+				ctx->bloom_settings,
+				&computed);
+			if (computed) {
+				ctx->count_bloom_filter_computed++;
+				if (filter && !filter->len) {
+					bitmap_set(ctx->bloom_large, pos);
+					ctx->count_bloom_filter_found_large++;
+				}
+			}
+			ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
 		}
-		ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
 		display_progress(progress, i + 1);
 	}
 
@@ -1764,6 +1828,11 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 					  + ctx->total_bloom_filter_data_size;
 		chunks[num_chunks].write_fn = write_graph_chunk_bloom_data;
 		num_chunks++;
+		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMLARGE;
+		chunks[num_chunks].size = sizeof(eword_t) * ctx->bloom_large->word_alloc
+					+ sizeof(uint32_t);
+		chunks[num_chunks].write_fn = write_graph_chunk_bloom_large;
+		num_chunks++;
 	}
 	if (ctx->num_commit_graphs_after > 1) {
 		chunks[num_chunks].id = GRAPH_CHUNKID_BASE;
@@ -2503,6 +2572,7 @@ void free_commit_graph(struct commit_graph *g)
 	}
 	free(g->filename);
 	free(g->bloom_filter_settings);
+	bitmap_free(g->bloom_large);
 	free(g);
 }
 
diff --git a/commit-graph.h b/commit-graph.h
index d9acb22bac..ddbca1b59d 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -4,6 +4,7 @@
 #include "git-compat-util.h"
 #include "object-store.h"
 #include "oidset.h"
+#include "ewah/ewok.h"
 
 #define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
 #define GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE "GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE"
@@ -71,6 +72,9 @@ struct commit_graph {
 	const unsigned char *chunk_base_graphs;
 	const unsigned char *chunk_bloom_indexes;
 	const unsigned char *chunk_bloom_data;
+	const unsigned char *chunk_bloom_large_filters;
+
+	struct bitmap *bloom_large;
 
 	struct bloom_filter_settings *bloom_filter_settings;
 };
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 21b67677ef..6859d85369 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -33,7 +33,7 @@ test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
 	git commit-graph write --reachable --changed-paths
 '
 graph_read_expect () {
-	NUM_CHUNKS=5
+	NUM_CHUNKS=6
 	cat >expect <<- EOF
 	header: 43475048 1 1 $NUM_CHUNKS 0
 	num_commits: $1
@@ -262,5 +262,28 @@ test_expect_success 'correctly report changes over limit' '
 		done
 	)
 '
+test_bloom_filters_computed () {
+	commit_graph_args=$1
+	bloom_trace_prefix="{\"filter_known_large\":$2,\"filter_found_large\":$3,\"filter_computed\":$4"
+	rm -f "$TRASH_DIRECTORY/trace.event" &&
+	GIT_TRACE2_EVENT="$TRASH_DIRECTORY/trace.event" git commit-graph write $commit_graph_args &&
+	grep "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.event"
+}
+
+test_expect_success 'Bloom generation does not recompute too-large filters' '
+	(
+		cd limits &&
+
+		# start from scratch and rebuild
+		rm -f .git/objects/info/commit-graph &&
+		GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=10 \
+			git commit-graph write --reachable --changed-paths \
+			--split=replace &&
+		test_commit c1 filter &&
+
+		test_bloom_filters_computed "--reachable --changed-paths --split=replace" \
+			2 0 1
+	)
+'
 
 test_done
-- 
2.28.0.rc1.13.ge78abce653


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH v3 13/14] commit-graph: rename 'split_commit_graph_opts'
  2020-08-11 20:51 ` [PATCH v3 00/14] more miscellaneous Bloom filter improvements Taylor Blau
                     ` (11 preceding siblings ...)
  2020-08-11 20:52   ` [PATCH v3 12/14] commit-graph: add large-filters bitmap chunk Taylor Blau
@ 2020-08-11 20:52   ` Taylor Blau
  2020-08-19  9:56     ` SZEDER Gábor
  2020-08-11 20:52   ` [PATCH v3 14/14] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
  2020-09-03 21:45   ` [PATCH v3 00/14] more miscellaneous Bloom filter improvements Junio C Hamano
  14 siblings, 1 reply; 117+ messages in thread
From: Taylor Blau @ 2020-08-11 20:52 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, szeder.dev, gitster

In the subsequent commit, additional options will be added to the
commit-graph API which have nothing to do with splitting.

Rename the 'split_commit_graph_opts' structure to the more-generic
'commit_graph_opts' to encompass both.

Suggsted-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/commit-graph.c | 20 ++++++++++----------
 commit-graph.c         | 40 ++++++++++++++++++++--------------------
 commit-graph.h         |  6 +++---
 3 files changed, 33 insertions(+), 33 deletions(-)

diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index ba5584463f..38f5f57d15 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -119,7 +119,7 @@ static int graph_verify(int argc, const char **argv)
 }
 
 extern int read_replace_refs;
-static struct split_commit_graph_opts split_opts;
+static struct commit_graph_opts write_opts;
 
 static int write_option_parse_split(const struct option *opt, const char *arg,
 				    int unset)
@@ -187,24 +187,24 @@ static int graph_write(int argc, const char **argv)
 		OPT_BOOL(0, "changed-paths", &opts.enable_changed_paths,
 			N_("enable computation for changed paths")),
 		OPT_BOOL(0, "progress", &opts.progress, N_("force progress reporting")),
-		OPT_CALLBACK_F(0, "split", &split_opts.flags, NULL,
+		OPT_CALLBACK_F(0, "split", &write_opts.flags, NULL,
 			N_("allow writing an incremental commit-graph file"),
 			PARSE_OPT_OPTARG | PARSE_OPT_NONEG,
 			write_option_parse_split),
-		OPT_INTEGER(0, "max-commits", &split_opts.max_commits,
+		OPT_INTEGER(0, "max-commits", &write_opts.max_commits,
 			N_("maximum number of commits in a non-base split commit-graph")),
-		OPT_INTEGER(0, "size-multiple", &split_opts.size_multiple,
+		OPT_INTEGER(0, "size-multiple", &write_opts.size_multiple,
 			N_("maximum ratio between two levels of a split commit-graph")),
-		OPT_EXPIRY_DATE(0, "expire-time", &split_opts.expire_time,
+		OPT_EXPIRY_DATE(0, "expire-time", &write_opts.expire_time,
 			N_("only expire files older than a given date-time")),
 		OPT_END(),
 	};
 
 	opts.progress = isatty(2);
 	opts.enable_changed_paths = -1;
-	split_opts.size_multiple = 2;
-	split_opts.max_commits = 0;
-	split_opts.expire_time = 0;
+	write_opts.size_multiple = 2;
+	write_opts.max_commits = 0;
+	write_opts.expire_time = 0;
 
 	trace2_cmd_mode("write");
 
@@ -232,7 +232,7 @@ static int graph_write(int argc, const char **argv)
 	odb = find_odb(the_repository, opts.obj_dir);
 
 	if (opts.reachable) {
-		if (write_commit_graph_reachable(odb, flags, &split_opts))
+		if (write_commit_graph_reachable(odb, flags, &write_opts))
 			return 1;
 		return 0;
 	}
@@ -261,7 +261,7 @@ static int graph_write(int argc, const char **argv)
 			       opts.stdin_packs ? &pack_indexes : NULL,
 			       opts.stdin_commits ? &commits : NULL,
 			       flags,
-			       &split_opts))
+			       &write_opts))
 		result = 1;
 
 cleanup:
diff --git a/commit-graph.c b/commit-graph.c
index ea0583298c..6886f319a5 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1001,7 +1001,7 @@ struct write_commit_graph_context {
 		 changed_paths:1,
 		 order_by_pack:1;
 
-	const struct split_commit_graph_opts *split_opts;
+	const struct commit_graph_opts *opts;
 	size_t total_bloom_filter_data_size;
 	const struct bloom_filter_settings *bloom_settings;
 
@@ -1342,8 +1342,8 @@ static void close_reachable(struct write_commit_graph_context *ctx)
 {
 	int i;
 	struct commit *commit;
-	enum commit_graph_split_flags flags = ctx->split_opts ?
-		ctx->split_opts->flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;
+	enum commit_graph_split_flags flags = ctx->opts ?
+		ctx->opts->flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;
 
 	if (ctx->report_progress)
 		ctx->progress = start_delayed_progress(
@@ -1543,7 +1543,7 @@ static int add_ref_to_set(const char *refname,
 
 int write_commit_graph_reachable(struct object_directory *odb,
 				 enum commit_graph_write_flags flags,
-				 const struct split_commit_graph_opts *split_opts)
+				 const struct commit_graph_opts *opts)
 {
 	struct oidset commits = OIDSET_INIT;
 	struct refs_cb_data data;
@@ -1560,7 +1560,7 @@ int write_commit_graph_reachable(struct object_directory *odb,
 	stop_progress(&data.progress);
 
 	result = write_commit_graph(odb, NULL, &commits,
-				    flags, split_opts);
+				    flags, opts);
 
 	oidset_clear(&commits);
 	return result;
@@ -1675,8 +1675,8 @@ static uint32_t count_distinct_commits(struct write_commit_graph_context *ctx)
 static void copy_oids_to_commits(struct write_commit_graph_context *ctx)
 {
 	uint32_t i;
-	enum commit_graph_split_flags flags = ctx->split_opts ?
-		ctx->split_opts->flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;
+	enum commit_graph_split_flags flags = ctx->opts ?
+		ctx->opts->flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;
 
 	ctx->num_extra_edges = 0;
 	if (ctx->report_progress)
@@ -1962,13 +1962,13 @@ static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
 	int max_commits = 0;
 	int size_mult = 2;
 
-	if (ctx->split_opts) {
-		max_commits = ctx->split_opts->max_commits;
+	if (ctx->opts) {
+		max_commits = ctx->opts->max_commits;
 
-		if (ctx->split_opts->size_multiple)
-			size_mult = ctx->split_opts->size_multiple;
+		if (ctx->opts->size_multiple)
+			size_mult = ctx->opts->size_multiple;
 
-		flags = ctx->split_opts->flags;
+		flags = ctx->opts->flags;
 	}
 
 	g = ctx->r->objects->commit_graph;
@@ -2146,8 +2146,8 @@ static void expire_commit_graphs(struct write_commit_graph_context *ctx)
 	size_t dirnamelen;
 	timestamp_t expire_time = time(NULL);
 
-	if (ctx->split_opts && ctx->split_opts->expire_time)
-		expire_time = ctx->split_opts->expire_time;
+	if (ctx->opts && ctx->opts->expire_time)
+		expire_time = ctx->opts->expire_time;
 	if (!ctx->split) {
 		char *chain_file_name = get_chain_filename(ctx->odb);
 		unlink(chain_file_name);
@@ -2198,7 +2198,7 @@ int write_commit_graph(struct object_directory *odb,
 		       struct string_list *pack_indexes,
 		       struct oidset *commits,
 		       enum commit_graph_write_flags flags,
-		       const struct split_commit_graph_opts *split_opts)
+		       const struct commit_graph_opts *opts)
 {
 	struct write_commit_graph_context *ctx;
 	uint32_t i, count_distinct = 0;
@@ -2215,7 +2215,7 @@ int write_commit_graph(struct object_directory *odb,
 	ctx->append = flags & COMMIT_GRAPH_WRITE_APPEND ? 1 : 0;
 	ctx->report_progress = flags & COMMIT_GRAPH_WRITE_PROGRESS ? 1 : 0;
 	ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
-	ctx->split_opts = split_opts;
+	ctx->opts = opts;
 	ctx->total_bloom_filter_data_size = 0;
 
 	bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
@@ -2263,15 +2263,15 @@ int write_commit_graph(struct object_directory *odb,
 			}
 		}
 
-		if (ctx->split_opts)
-			replace = ctx->split_opts->flags & COMMIT_GRAPH_SPLIT_REPLACE;
+		if (ctx->opts)
+			replace = ctx->opts->flags & COMMIT_GRAPH_SPLIT_REPLACE;
 	}
 
 	ctx->approx_nr_objects = approximate_object_count();
 	ctx->oids.alloc = ctx->approx_nr_objects / 32;
 
-	if (ctx->split && split_opts && ctx->oids.alloc > split_opts->max_commits)
-		ctx->oids.alloc = split_opts->max_commits;
+	if (ctx->split && opts && ctx->oids.alloc > opts->max_commits)
+		ctx->oids.alloc = opts->max_commits;
 
 	if (ctx->append) {
 		prepare_commit_graph_one(ctx->r, ctx->odb);
diff --git a/commit-graph.h b/commit-graph.h
index ddbca1b59d..af08c4505d 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -109,7 +109,7 @@ enum commit_graph_split_flags {
 	COMMIT_GRAPH_SPLIT_REPLACE          = 2
 };
 
-struct split_commit_graph_opts {
+struct commit_graph_opts {
 	int size_multiple;
 	int max_commits;
 	timestamp_t expire_time;
@@ -124,12 +124,12 @@ struct split_commit_graph_opts {
  */
 int write_commit_graph_reachable(struct object_directory *odb,
 				 enum commit_graph_write_flags flags,
-				 const struct split_commit_graph_opts *split_opts);
+				 const struct commit_graph_opts *opts);
 int write_commit_graph(struct object_directory *odb,
 		       struct string_list *pack_indexes,
 		       struct oidset *commits,
 		       enum commit_graph_write_flags flags,
-		       const struct split_commit_graph_opts *split_opts);
+		       const struct commit_graph_opts *opts);
 
 #define COMMIT_GRAPH_VERIFY_SHALLOW	(1 << 0)
 
-- 
2.28.0.rc1.13.ge78abce653


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH v3 14/14] builtin/commit-graph.c: introduce '--max-new-filters=<n>'
  2020-08-11 20:51 ` [PATCH v3 00/14] more miscellaneous Bloom filter improvements Taylor Blau
                     ` (12 preceding siblings ...)
  2020-08-11 20:52   ` [PATCH v3 13/14] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
@ 2020-08-11 20:52   ` Taylor Blau
  2020-08-12 11:49     ` SZEDER Gábor
                       ` (4 more replies)
  2020-09-03 21:45   ` [PATCH v3 00/14] more miscellaneous Bloom filter improvements Junio C Hamano
  14 siblings, 5 replies; 117+ messages in thread
From: Taylor Blau @ 2020-08-11 20:52 UTC (permalink / raw)
  To: git; +Cc: peff, dstolee, szeder.dev, gitster

Introduce a command-line flag and configuration variable to fill in the
'max_new_filters' variable introduced by the previous patch.

The command-line option '--max-new-filters' takes precedence over
'commitGraph.maxNewFilters', which is the default value.
'--no-max-new-filters' can also be provided, which sets the value back
to '-1', indicating that an unlimited number of new Bloom filters may be
generated. (OPT_INTEGER only allows setting the '--no-' variant back to
'0', hence a custom callback was used instead).

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/commitgraph.txt |  4 +++
 Documentation/git-commit-graph.txt   |  4 +++
 bloom.c                              | 15 +++++++++++
 builtin/commit-graph.c               | 39 +++++++++++++++++++++++++---
 commit-graph.c                       | 27 ++++++++++++++++---
 commit-graph.h                       |  1 +
 t/t4216-log-bloom.sh                 | 19 ++++++++++++++
 7 files changed, 102 insertions(+), 7 deletions(-)

diff --git a/Documentation/config/commitgraph.txt b/Documentation/config/commitgraph.txt
index cff0797b54..4582c39fc4 100644
--- a/Documentation/config/commitgraph.txt
+++ b/Documentation/config/commitgraph.txt
@@ -1,3 +1,7 @@
+commitGraph.maxNewFilters::
+	Specifies the default value for the `--max-new-filters` option of `git
+	commit-graph write` (c.f., linkgit:git-commit-graph[1]).
+
 commitGraph.readChangedPaths::
 	If true, then git will use the changed-path Bloom filters in the
 	commit-graph file (if it exists, and they are present). Defaults to
diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index 17405c73a9..9c887d5d79 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -67,6 +67,10 @@ this option is given, future commit-graph writes will automatically assume
 that this option was intended. Use `--no-changed-paths` to stop storing this
 data.
 +
+With the `--max-new-filters=<n>` option, generate at most `n` new Bloom
+filters (if `--changed-paths` is specified). If `n` is `-1`, no limit is
+enforced. Overrides the `commitGraph.maxNewFilters` configuration.
++
 With the `--split[=<strategy>]` option, write the commit-graph as a
 chain of multiple commit-graph files stored in
 `<dir>/info/commit-graphs`. Commit-graph layers are merged based on the
diff --git a/bloom.c b/bloom.c
index ed54e96e57..8d07209c6b 100644
--- a/bloom.c
+++ b/bloom.c
@@ -51,6 +51,21 @@ static int load_bloom_filter_from_graph(struct commit_graph *g,
 	else
 		start_index = 0;
 
+	if ((start_index == end_index) &&
+	    (g->bloom_large && !bitmap_get(g->bloom_large, lex_pos))) {
+		/*
+		 * If the filter is zero-length, either (1) the filter has no
+		 * changes, (2) the filter has too many changes, or (3) it
+		 * wasn't computed (eg., due to '--max-new-filters').
+		 *
+		 * If either (1) or (2) is the case, the 'large' bit will be set
+		 * for this Bloom filter. If it is unset, then it wasn't
+		 * computed. In that case, return nothing, since we don't have
+		 * that filter in the graph.
+		 */
+		return 0;
+	}
+
 	filter->len = end_index - start_index;
 	filter->data = (unsigned char *)(g->chunk_bloom_data +
 					sizeof(unsigned char) * start_index +
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 38f5f57d15..3500a6e1f1 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -13,7 +13,8 @@ static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph verify [--object-dir <objdir>] [--shallow] [--[no-]progress]"),
 	N_("git commit-graph write [--object-dir <objdir>] [--append] "
 	   "[--split[=<strategy>]] [--reachable|--stdin-packs|--stdin-commits] "
-	   "[--changed-paths] [--[no-]progress] <split options>"),
+	   "[--changed-paths] [--[no-]max-new-filters <n>] [--[no-]progress] "
+	   "<split options>"),
 	NULL
 };
 
@@ -25,7 +26,8 @@ static const char * const builtin_commit_graph_verify_usage[] = {
 static const char * const builtin_commit_graph_write_usage[] = {
 	N_("git commit-graph write [--object-dir <objdir>] [--append] "
 	   "[--split[=<strategy>]] [--reachable|--stdin-packs|--stdin-commits] "
-	   "[--changed-paths] [--[no-]progress] <split options>"),
+	   "[--changed-paths] [--[no-]max-new-filters <n>] [--[no-]progress] "
+	   "<split options>"),
 	NULL
 };
 
@@ -162,6 +164,23 @@ static int read_one_commit(struct oidset *commits, struct progress *progress,
 	return 0;
 }
 
+static int write_option_max_new_filters(const struct option *opt,
+					const char *arg,
+					int unset)
+{
+	int *to = opt->value;
+	if (unset)
+		*to = -1;
+	else {
+		const char *s;
+		*to = strtol(arg, (char **)&s, 10);
+		if (*s)
+			return error(_("%s expects a numerical value"),
+				     optname(opt, opt->flags));
+	}
+	return 0;
+}
+
 static int graph_write(int argc, const char **argv)
 {
 	struct string_list pack_indexes = STRING_LIST_INIT_NODUP;
@@ -197,6 +216,9 @@ static int graph_write(int argc, const char **argv)
 			N_("maximum ratio between two levels of a split commit-graph")),
 		OPT_EXPIRY_DATE(0, "expire-time", &write_opts.expire_time,
 			N_("only expire files older than a given date-time")),
+		OPT_CALLBACK_F(0, "max-new-filters", &write_opts.max_new_filters,
+			NULL, N_("maximum number of changed-path Bloom filters to compute"),
+			0, write_option_max_new_filters),
 		OPT_END(),
 	};
 
@@ -205,6 +227,7 @@ static int graph_write(int argc, const char **argv)
 	write_opts.size_multiple = 2;
 	write_opts.max_commits = 0;
 	write_opts.expire_time = 0;
+	write_opts.max_new_filters = -1;
 
 	trace2_cmd_mode("write");
 
@@ -270,6 +293,16 @@ static int graph_write(int argc, const char **argv)
 	return result;
 }
 
+static int git_commit_graph_config(const char *var, const char *value, void *cb)
+{
+	if (!strcmp(var, "commitgraph.maxnewfilters")) {
+		write_opts.max_new_filters = git_config_int(var, value);
+		return 0;
+	}
+
+	return git_default_config(var, value, cb);
+}
+
 int cmd_commit_graph(int argc, const char **argv, const char *prefix)
 {
 	static struct option builtin_commit_graph_options[] = {
@@ -283,7 +316,7 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix)
 		usage_with_options(builtin_commit_graph_usage,
 				   builtin_commit_graph_options);
 
-	git_config(git_default_config, NULL);
+	git_config(git_commit_graph_config, &opts);
 	argc = parse_options(argc, argv, prefix,
 			     builtin_commit_graph_options,
 			     builtin_commit_graph_usage,
diff --git a/commit-graph.c b/commit-graph.c
index 6886f319a5..4aae5471e3 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -954,7 +954,8 @@ struct tree *get_commit_tree_in_graph(struct repository *r, const struct commit
 }
 
 static int get_bloom_filter_large_in_graph(struct commit_graph *g,
-					   const struct commit *c)
+					   const struct commit *c,
+					   uint32_t max_changed_paths)
 {
 	uint32_t graph_pos = commit_graph_position(c);
 	if (graph_pos == COMMIT_NOT_FROM_GRAPH)
@@ -965,6 +966,17 @@ static int get_bloom_filter_large_in_graph(struct commit_graph *g,
 
 	if (!(g && g->bloom_large))
 		return 0;
+	if (g->bloom_filter_settings->max_changed_paths != max_changed_paths) {
+		/*
+		 * Force all commits which are subject to a different
+		 * 'max_changed_paths' limit to be recomputed from scratch.
+		 *
+		 * Note that this could likely be improved, but is ignored since
+		 * all real-world graphs set the maximum number of changed paths
+		 * at 512.
+		 */
+		return 0;
+	}
 	return bitmap_get(g->bloom_large, graph_pos - g->num_commits_in_base);
 }
 
@@ -1470,6 +1482,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 	int i;
 	struct progress *progress = NULL;
 	int *sorted_commits;
+	int max_new_filters;
 
 	init_bloom_filters();
 	ctx->bloom_large = bitmap_word_alloc(ctx->commits.nr / BITS_IN_EWORD + 1);
@@ -1486,10 +1499,15 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 		ctx->order_by_pack ? commit_pos_cmp : commit_gen_cmp,
 		&ctx->commits);
 
+	max_new_filters = ctx->opts->max_new_filters >= 0 ?
+		ctx->opts->max_new_filters : ctx->commits.nr;
+
 	for (i = 0; i < ctx->commits.nr; i++) {
 		int pos = sorted_commits[i];
 		struct commit *c = ctx->commits.list[pos];
-		if (get_bloom_filter_large_in_graph(ctx->r->objects->commit_graph, c)) {
+		if (get_bloom_filter_large_in_graph(ctx->r->objects->commit_graph,
+						    c,
+						    ctx->bloom_settings->max_changed_paths)) {
 			bitmap_set(ctx->bloom_large, pos);
 			ctx->count_bloom_filter_known_large++;
 		} else {
@@ -1497,7 +1515,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 			struct bloom_filter *filter = get_or_compute_bloom_filter(
 				ctx->r,
 				c,
-				1,
+				ctx->count_bloom_filter_computed < max_new_filters,
 				ctx->bloom_settings,
 				&computed);
 			if (computed) {
@@ -1507,7 +1525,8 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 					ctx->count_bloom_filter_found_large++;
 				}
 			}
-			ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
+			if (filter)
+				ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
 		}
 		display_progress(progress, i + 1);
 	}
diff --git a/commit-graph.h b/commit-graph.h
index af08c4505d..75ef83708b 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -114,6 +114,7 @@ struct commit_graph_opts {
 	int max_commits;
 	timestamp_t expire_time;
 	enum commit_graph_split_flags flags;
+	int max_new_filters;
 };
 
 /*
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 6859d85369..3aab8ffbe3 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -286,4 +286,23 @@ test_expect_success 'Bloom generation does not recompute too-large filters' '
 	)
 '
 
+test_expect_success 'Bloom generation is limited by --max-new-filters' '
+	(
+		cd limits &&
+		test_commit c2 filter &&
+		test_commit c3 filter &&
+		test_commit c4 no-filter &&
+		test_bloom_filters_computed "--reachable --changed-paths --split=replace --max-new-filters=2" \
+			2 0 2
+	)
+'
+
+test_expect_success 'Bloom generation backfills previously-skipped filters' '
+	(
+		cd limits &&
+		test_bloom_filters_computed "--reachable --changed-paths --split=replace --max-new-filters=1" \
+			2 0 1
+	)
+'
+
 test_done
-- 
2.28.0.rc1.13.ge78abce653

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v3 12/14] commit-graph: add large-filters bitmap chunk
  2020-08-11 20:52   ` [PATCH v3 12/14] commit-graph: add large-filters bitmap chunk Taylor Blau
@ 2020-08-11 21:11     ` Derrick Stolee
  2020-08-11 21:18       ` Taylor Blau
  2020-08-19 13:35     ` SZEDER Gábor
  2020-09-01 14:35     ` SZEDER Gábor
  2 siblings, 1 reply; 117+ messages in thread
From: Derrick Stolee @ 2020-08-11 21:11 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: peff, dstolee, szeder.dev, gitster

On 8/11/20 4:52 PM, Taylor Blau wrote:
> To allow using the existing bitmap code with 64-bit words, we write the
> data in network byte order from the 64-bit words. This means we also
> need to read the array from the commit-graph file by translating each
> word from network byte order using get_be64() when loading the commit
> graph. (Note that this *could* be delayed until first-use, but a later
> patch will rely on this being initialized early, so we assume the
> up-front cost when parsing instead of delaying initialization).

I don't think this is correct to do. This means that every commit-graph
load will load the entire large bloom filter chunk into memory before
allowing a single commit to be read from the graph.

In the case of a very large commit-graph (> 1 million commits), this
would cause a significant initial cost that is not necessarily needed
for anything. For example, "git log -1" will be delayed by this action.

If I understand correctly, this bloom_large bitmap is only used when
writing Bloom filters. At that point, reading the entire bitmap from
disk into memory is inexpensive compared to the time saved by the
feature.

> @@ -429,6 +430,24 @@ struct commit_graph *parse_commit_graph(struct repository *r,
>  				graph->bloom_filter_settings->bits_per_entry = get_be32(data + chunk_offset + 8);
>  			}
>  			break;
> +
> +		case GRAPH_CHUNKID_BLOOMLARGE:
> +			if (graph->chunk_bloom_large_filters)
> +				chunk_repeated = 1;
> +			else if (r->settings.commit_graph_read_changed_paths) {

This is guarded against commitGraph.readChangedPaths, which is good,
but that's not enough to claim that we need the bloom_large bitmap
in this process.

> +				size_t alloc = get_be64(chunk_lookup + 4) - chunk_offset - sizeof(uint32_t);

If we store this inside the commit_graph struct, we can save this
size for later so...

> +				graph->chunk_bloom_large_filters = data + chunk_offset + sizeof(uint32_t);
> +				graph->bloom_filter_settings->max_changed_paths = get_be32(data + chunk_offset);

...this portion can be done only when we are about to read from the
bitmap.

> +				if (alloc) {
> +					size_t j;
> +					graph->bloom_large = bitmap_word_alloc(alloc);
> +
> +					for (j = 0; j < graph->bloom_large->word_alloc; j++)
> +						graph->bloom_large->words[j] = get_be64(
> +							graph->chunk_bloom_large_filters + j * sizeof(eword_t));
> +				}



> +			}
> +			break;
>  		}
>  
>  		if (chunk_repeated) {
> @@ -443,7 +462,9 @@ struct commit_graph *parse_commit_graph(struct repository *r,
>  		/* We need both the bloom chunks to exist together. Else ignore the data */
>  		graph->chunk_bloom_indexes = NULL;
>  		graph->chunk_bloom_data = NULL;
> +		graph->chunk_bloom_large_filters = NULL;
>  		FREE_AND_NULL(graph->bloom_filter_settings);
> +		bitmap_free(graph->bloom_large);

Perhaps this bitmap_free() needs a check to see if we
initialized it (with my recommended lazy-loading), but
maybe not?

>  	}
>  
>  	hashcpy(graph->oid.hash, graph->data + graph->data_len - graph->hash_len);
> @@ -932,6 +953,20 @@ struct tree *get_commit_tree_in_graph(struct repository *r, const struct commit
>  	return get_commit_tree_in_graph_one(r, r->objects->commit_graph, c);
>  }
>  
> +static int get_bloom_filter_large_in_graph(struct commit_graph *g,
> +					   const struct commit *c)
> +{
> +	uint32_t graph_pos = commit_graph_position(c);
> +	if (graph_pos == COMMIT_NOT_FROM_GRAPH)
> +		return 0;
> +
> +	while (g && graph_pos < g->num_commits_in_base)
> +		g = g->base_graph;
> +
> +	if (!(g && g->bloom_large))
> +		return 0;

Here is where we can check if the size of the chunk is
non-zero and if the g->bloom_large bitmap is uninitialized.
Since we are caring about this information, it is now time
to do the network-byte transition of the full bitmap.

> +	return bitmap_get(g->bloom_large, graph_pos - g->num_commits_in_base);
> +}
>  
Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v3 12/14] commit-graph: add large-filters bitmap chunk
  2020-08-11 21:11     ` Derrick Stolee
@ 2020-08-11 21:18       ` Taylor Blau
  2020-08-11 22:05         ` Taylor Blau
  0 siblings, 1 reply; 117+ messages in thread
From: Taylor Blau @ 2020-08-11 21:18 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Taylor Blau, git, peff, dstolee, szeder.dev, gitster

On Tue, Aug 11, 2020 at 05:11:49PM -0400, Derrick Stolee wrote:
> On 8/11/20 4:52 PM, Taylor Blau wrote:
> > To allow using the existing bitmap code with 64-bit words, we write the
> > data in network byte order from the 64-bit words. This means we also
> > need to read the array from the commit-graph file by translating each
> > word from network byte order using get_be64() when loading the commit
> > graph. (Note that this *could* be delayed until first-use, but a later
> > patch will rely on this being initialized early, so we assume the
> > up-front cost when parsing instead of delaying initialization).
>
> I don't think this is correct to do. This means that every commit-graph
> load will load the entire large bloom filter chunk into memory before
> allowing a single commit to be read from the graph.
>
> In the case of a very large commit-graph (> 1 million commits), this
> would cause a significant initial cost that is not necessarily needed
> for anything. For example, "git log -1" will be delayed by this action.
>
> If I understand correctly, this bloom_large bitmap is only used when
> writing Bloom filters. At that point, reading the entire bitmap from
> disk into memory is inexpensive compared to the time saved by the
> feature.

That's not quite correct. By this point in the patch series, we only use
this bitmap for writing, but in the final patch, we will use it in
'load_bloom_filter_from_graph()' (see that patch for the details of why,
there is an explanatory comment to that effect).

So, the way that this lazy initialization was written was subtly
incorrect, since calling 'load_bloom_filter_from_graph' would return the
wrong value depending on what was or wasn't called before it (namely
whether or not we had initialized the bitmap).

> > @@ -429,6 +430,24 @@ struct commit_graph *parse_commit_graph(struct repository *r,
> >  				graph->bloom_filter_settings->bits_per_entry = get_be32(data + chunk_offset + 8);
> >  			}
> >  			break;
> > +
> > +		case GRAPH_CHUNKID_BLOOMLARGE:
> > +			if (graph->chunk_bloom_large_filters)
> > +				chunk_repeated = 1;
> > +			else if (r->settings.commit_graph_read_changed_paths) {
>
> This is guarded against commitGraph.readChangedPaths, which is good,
> but that's not enough to claim that we need the bloom_large bitmap
> in this process.
>
> > +				size_t alloc = get_be64(chunk_lookup + 4) - chunk_offset - sizeof(uint32_t);
>
> If we store this inside the commit_graph struct, we can save this
> size for later so...
>
> > +				graph->chunk_bloom_large_filters = data + chunk_offset + sizeof(uint32_t);
> > +				graph->bloom_filter_settings->max_changed_paths = get_be32(data + chunk_offset);
>
> ...this portion can be done only when we are about to read from the
> bitmap.

Right, that all is clear.

>
> > +				if (alloc) {
> > +					size_t j;
> > +					graph->bloom_large = bitmap_word_alloc(alloc);
> > +
> > +					for (j = 0; j < graph->bloom_large->word_alloc; j++)
> > +						graph->bloom_large->words[j] = get_be64(
> > +							graph->chunk_bloom_large_filters + j * sizeof(eword_t));
> > +				}
>
>
>
> > +			}
> > +			break;
> >  		}
> >
> >  		if (chunk_repeated) {
> > @@ -443,7 +462,9 @@ struct commit_graph *parse_commit_graph(struct repository *r,
> >  		/* We need both the bloom chunks to exist together. Else ignore the data */
> >  		graph->chunk_bloom_indexes = NULL;
> >  		graph->chunk_bloom_data = NULL;
> > +		graph->chunk_bloom_large_filters = NULL;
> >  		FREE_AND_NULL(graph->bloom_filter_settings);
> > +		bitmap_free(graph->bloom_large);
>
> Perhaps this bitmap_free() needs a check to see if we
> initialized it (with my recommended lazy-loading), but
> maybe not?

It does not ('bitmap_free()' understands that a NULL argument is a noop,
like 'free()').
>
> >  	}
> >
> >  	hashcpy(graph->oid.hash, graph->data + graph->data_len - graph->hash_len);
> > @@ -932,6 +953,20 @@ struct tree *get_commit_tree_in_graph(struct repository *r, const struct commit
> >  	return get_commit_tree_in_graph_one(r, r->objects->commit_graph, c);
> >  }
> >
> > +static int get_bloom_filter_large_in_graph(struct commit_graph *g,
> > +					   const struct commit *c)
> > +{
> > +	uint32_t graph_pos = commit_graph_position(c);
> > +	if (graph_pos == COMMIT_NOT_FROM_GRAPH)
> > +		return 0;
> > +
> > +	while (g && graph_pos < g->num_commits_in_base)
> > +		g = g->base_graph;
> > +
> > +	if (!(g && g->bloom_large))
> > +		return 0;
>
> Here is where we can check if the size of the chunk is
> non-zero and if the g->bloom_large bitmap is uninitialized.
> Since we are caring about this information, it is now time
> to do the network-byte transition of the full bitmap.

Yes, that's right. But again, we'll need to move this out to a
non-static helper function so that we can call it from bloom.c in the
final patch.

I'm admittedly a little unsure of what to do next, since the changes are
scoped to this and the remaining patches (really just this and the final
patch).

I guess I'll try the other approach of replacing these two patches
by sending emails in response so that Junio can take those (I'll avoid
sending a fixup patch).

> > +	return bitmap_get(g->bloom_large, graph_pos - g->num_commits_in_base);
> > +}
> >
> Thanks,
> -Stolee

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v3 01/14] commit-graph: introduce 'get_bloom_filter_settings()'
  2020-08-11 20:51   ` [PATCH v3 01/14] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
@ 2020-08-11 21:18     ` SZEDER Gábor
  2020-08-11 21:21       ` Taylor Blau
  0 siblings, 1 reply; 117+ messages in thread
From: SZEDER Gábor @ 2020-08-11 21:18 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, dstolee, gitster

On Tue, Aug 11, 2020 at 04:51:19PM -0400, Taylor Blau wrote:
> Many places in the code often need a pointer to the commit-graph's
> 'struct bloom_filter_settings', in which case they often take the value
> from the top-most commit-graph.
> 
> In the non-split case, this works as expected. In the split case,
> however, things get a little tricky. Not all layers in a chain of
> incremental commit-graphs are required to themselves have Bloom data,
> and so whether or not some part of the code uses Bloom filters depends
> entirely on whether or not the top-most level of the commit-graph chain
> has Bloom filters.
> 
> This has been the behavior since Bloom filters were introduced, and has
> been codified into the tests since a759bfa9ee (t4216: add end to end
> tests for git log with Bloom filters, 2020-04-06). In fact, t4216.130
> requires that Bloom filters are not used in exactly the case described
> earlier.
> 
> There is no reason that this needs to be the case, since it is perfectly
> valid for commits in an earlier layer to have Bloom filters when commits
> in a newer layer do not.
> 
> Since Bloom settings are guaranteed to be the same for any layer in a
> chain that has Bloom data,

Is it?  Where is that guaranteed?

> it is sufficient to traverse the
> '->base_graph' pointer until either (1) a non-null 'struct
> bloom_filter_settings *' is found, or (2) until we are at the root of
> the commit-graph chain.
> 
> Introduce a 'get_bloom_filter_settings()' function that does just this,
> and use it instead of purely dereferencing the top-most graph's
> '->bloom_filter_settings' pointer.
> 
> While we're at it, add an additional test in t5324 to guard against code
> in the commit-graph writing machinery that doesn't correctly handle a
> NULL 'struct bloom_filter *'.
> 
> Co-authored-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v3 01/14] commit-graph: introduce 'get_bloom_filter_settings()'
  2020-08-11 21:18     ` SZEDER Gábor
@ 2020-08-11 21:21       ` Taylor Blau
  2020-08-11 21:27         ` SZEDER Gábor
  0 siblings, 1 reply; 117+ messages in thread
From: Taylor Blau @ 2020-08-11 21:21 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: Taylor Blau, git, peff, dstolee, gitster

On Tue, Aug 11, 2020 at 11:18:30PM +0200, SZEDER Gábor wrote:
> On Tue, Aug 11, 2020 at 04:51:19PM -0400, Taylor Blau wrote:
> > Many places in the code often need a pointer to the commit-graph's
> > 'struct bloom_filter_settings', in which case they often take the value
> > from the top-most commit-graph.
> >
> > In the non-split case, this works as expected. In the split case,
> > however, things get a little tricky. Not all layers in a chain of
> > incremental commit-graphs are required to themselves have Bloom data,
> > and so whether or not some part of the code uses Bloom filters depends
> > entirely on whether or not the top-most level of the commit-graph chain
> > has Bloom filters.
> >
> > This has been the behavior since Bloom filters were introduced, and has
> > been codified into the tests since a759bfa9ee (t4216: add end to end
> > tests for git log with Bloom filters, 2020-04-06). In fact, t4216.130
> > requires that Bloom filters are not used in exactly the case described
> > earlier.
> >
> > There is no reason that this needs to be the case, since it is perfectly
> > valid for commits in an earlier layer to have Bloom filters when commits
> > in a newer layer do not.
> >
> > Since Bloom settings are guaranteed to be the same for any layer in a
> > chain that has Bloom data,
>
> Is it?  Where is that guaranteed?

There is no mechanism whatsoever to customize these settings that is
exposed to the user (except for the undocumented 'GIT_TEST' environment
variables).

> > it is sufficient to traverse the
> > '->base_graph' pointer until either (1) a non-null 'struct
> > bloom_filter_settings *' is found, or (2) until we are at the root of
> > the commit-graph chain.
> >
> > Introduce a 'get_bloom_filter_settings()' function that does just this,
> > and use it instead of purely dereferencing the top-most graph's
> > '->bloom_filter_settings' pointer.
> >
> > While we're at it, add an additional test in t5324 to guard against code
> > in the commit-graph writing machinery that doesn't correctly handle a
> > NULL 'struct bloom_filter *'.
> >
> > Co-authored-by: Derrick Stolee <dstolee@microsoft.com>
> > Signed-off-by: Taylor Blau <me@ttaylorr.com>
Thanks,
Taylor

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v3 01/14] commit-graph: introduce 'get_bloom_filter_settings()'
  2020-08-11 21:21       ` Taylor Blau
@ 2020-08-11 21:27         ` SZEDER Gábor
  2020-08-11 21:34           ` Taylor Blau
  0 siblings, 1 reply; 117+ messages in thread
From: SZEDER Gábor @ 2020-08-11 21:27 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, dstolee, gitster

On Tue, Aug 11, 2020 at 05:21:18PM -0400, Taylor Blau wrote:
> On Tue, Aug 11, 2020 at 11:18:30PM +0200, SZEDER Gábor wrote:
> > On Tue, Aug 11, 2020 at 04:51:19PM -0400, Taylor Blau wrote:
> > > Many places in the code often need a pointer to the commit-graph's
> > > 'struct bloom_filter_settings', in which case they often take the value
> > > from the top-most commit-graph.
> > >
> > > In the non-split case, this works as expected. In the split case,
> > > however, things get a little tricky. Not all layers in a chain of
> > > incremental commit-graphs are required to themselves have Bloom data,
> > > and so whether or not some part of the code uses Bloom filters depends
> > > entirely on whether or not the top-most level of the commit-graph chain
> > > has Bloom filters.
> > >
> > > This has been the behavior since Bloom filters were introduced, and has
> > > been codified into the tests since a759bfa9ee (t4216: add end to end
> > > tests for git log with Bloom filters, 2020-04-06). In fact, t4216.130
> > > requires that Bloom filters are not used in exactly the case described
> > > earlier.
> > >
> > > There is no reason that this needs to be the case, since it is perfectly
> > > valid for commits in an earlier layer to have Bloom filters when commits
> > > in a newer layer do not.
> > >
> > > Since Bloom settings are guaranteed to be the same for any layer in a
> > > chain that has Bloom data,
> >
> > Is it?  Where is that guaranteed?
> 
> There is no mechanism whatsoever to customize these settings that is
> exposed to the user (except for the undocumented 'GIT_TEST' environment
> variables).

Let me rephrase it, then: where is it written in the commit-graph
format specification that these must be the same in all layers?

Nowhere.

> > > it is sufficient to traverse the
> > > '->base_graph' pointer until either (1) a non-null 'struct
> > > bloom_filter_settings *' is found, or (2) until we are at the root of
> > > the commit-graph chain.
> > >
> > > Introduce a 'get_bloom_filter_settings()' function that does just this,
> > > and use it instead of purely dereferencing the top-most graph's
> > > '->bloom_filter_settings' pointer.
> > >
> > > While we're at it, add an additional test in t5324 to guard against code
> > > in the commit-graph writing machinery that doesn't correctly handle a
> > > NULL 'struct bloom_filter *'.
> > >
> > > Co-authored-by: Derrick Stolee <dstolee@microsoft.com>
> > > Signed-off-by: Taylor Blau <me@ttaylorr.com>
> Thanks,
> Taylor

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v3 01/14] commit-graph: introduce 'get_bloom_filter_settings()'
  2020-08-11 21:27         ` SZEDER Gábor
@ 2020-08-11 21:34           ` Taylor Blau
  2020-08-11 23:55             ` SZEDER Gábor
  0 siblings, 1 reply; 117+ messages in thread
From: Taylor Blau @ 2020-08-11 21:34 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: Taylor Blau, git, peff, dstolee, gitster

On Tue, Aug 11, 2020 at 11:27:16PM +0200, SZEDER Gábor wrote:
> On Tue, Aug 11, 2020 at 05:21:18PM -0400, Taylor Blau wrote:
> > On Tue, Aug 11, 2020 at 11:18:30PM +0200, SZEDER Gábor wrote:
> > > On Tue, Aug 11, 2020 at 04:51:19PM -0400, Taylor Blau wrote:
> > > > Many places in the code often need a pointer to the commit-graph's
> > > > 'struct bloom_filter_settings', in which case they often take the value
> > > > from the top-most commit-graph.
> > > >
> > > > In the non-split case, this works as expected. In the split case,
> > > > however, things get a little tricky. Not all layers in a chain of
> > > > incremental commit-graphs are required to themselves have Bloom data,
> > > > and so whether or not some part of the code uses Bloom filters depends
> > > > entirely on whether or not the top-most level of the commit-graph chain
> > > > has Bloom filters.
> > > >
> > > > This has been the behavior since Bloom filters were introduced, and has
> > > > been codified into the tests since a759bfa9ee (t4216: add end to end
> > > > tests for git log with Bloom filters, 2020-04-06). In fact, t4216.130
> > > > requires that Bloom filters are not used in exactly the case described
> > > > earlier.
> > > >
> > > > There is no reason that this needs to be the case, since it is perfectly
> > > > valid for commits in an earlier layer to have Bloom filters when commits
> > > > in a newer layer do not.
> > > >
> > > > Since Bloom settings are guaranteed to be the same for any layer in a
> > > > chain that has Bloom data,
> > >
> > > Is it?  Where is that guaranteed?
> >
> > There is no mechanism whatsoever to customize these settings that is
> > exposed to the user (except for the undocumented 'GIT_TEST' environment
> > variables).
>
> Let me rephrase it, then: where is it written in the commit-graph
> format specification that these must be the same in all layers?
>
> Nowhere.

OK. We can certainly document that this is the case. For this purpose,
all we really care about is that the graph _has_ Bloom filters anywhere.
If you wanted to return the exact matching settings, you could also
provide a commit and return the settings belonging to the graph that
contains that commit.

In the case where we don't have a commit, we could use the default
settings instead.

I think that we are a little bit dealing with a problem that doesn't
exist, since we do not document the sole method by which you would
change these settings. So, maybe we can think more about this, but my
preference would be to leave this patch alone.

Maybe Stolee can chime in, too.

> > > > it is sufficient to traverse the
> > > > '->base_graph' pointer until either (1) a non-null 'struct
> > > > bloom_filter_settings *' is found, or (2) until we are at the root of
> > > > the commit-graph chain.
> > > >
> > > > Introduce a 'get_bloom_filter_settings()' function that does just this,
> > > > and use it instead of purely dereferencing the top-most graph's
> > > > '->bloom_filter_settings' pointer.
> > > >
> > > > While we're at it, add an additional test in t5324 to guard against code
> > > > in the commit-graph writing machinery that doesn't correctly handle a
> > > > NULL 'struct bloom_filter *'.
> > > >
> > > > Co-authored-by: Derrick Stolee <dstolee@microsoft.com>
> > > > Signed-off-by: Taylor Blau <me@ttaylorr.com>
> > Thanks,
> > Taylor
Thanks,
Taylor

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v3 12/14] commit-graph: add large-filters bitmap chunk
  2020-08-11 21:18       ` Taylor Blau
@ 2020-08-11 22:05         ` Taylor Blau
  0 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-08-11 22:05 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Derrick Stolee, git, peff, dstolee, szeder.dev, gitster

On Tue, Aug 11, 2020 at 05:18:15PM -0400, Taylor Blau wrote:
> On Tue, Aug 11, 2020 at 05:11:49PM -0400, Derrick Stolee wrote:
> > On 8/11/20 4:52 PM, Taylor Blau wrote:
> > > To allow using the existing bitmap code with 64-bit words, we write the
> > > data in network byte order from the 64-bit words. This means we also
> > > need to read the array from the commit-graph file by translating each
> > > word from network byte order using get_be64() when loading the commit
> > > graph. (Note that this *could* be delayed until first-use, but a later
> > > patch will rely on this being initialized early, so we assume the
> > > up-front cost when parsing instead of delaying initialization).
> >
> > I don't think this is correct to do. This means that every commit-graph
> > load will load the entire large bloom filter chunk into memory before
> > allowing a single commit to be read from the graph.
> >
> > In the case of a very large commit-graph (> 1 million commits), this
> > would cause a significant initial cost that is not necessarily needed
> > for anything. For example, "git log -1" will be delayed by this action.
> >
> > If I understand correctly, this bloom_large bitmap is only used when
> > writing Bloom filters. At that point, reading the entire bitmap from
> > disk into memory is inexpensive compared to the time saved by the
> > feature.
>
> That's not quite correct. By this point in the patch series, we only use
> this bitmap for writing, but in the final patch, we will use it in
> 'load_bloom_filter_from_graph()' (see that patch for the details of why,
> there is an explanatory comment to that effect).
>
> So, the way that this lazy initialization was written was subtly
> incorrect, since calling 'load_bloom_filter_from_graph' would return the
> wrong value depending on what was or wasn't called before it (namely
> whether or not we had initialized the bitmap).

Ack, this is totally wrong. I had the wrong impression of where you were
actually initializing the bitmaps, which is what motivated the change.
Whoops, totally my mistake. Here's a version of the patch that does what
it says.

There is a small adjustment that needs to be made to the final patch, so
I can send a version of that, too, if we decide that this is the
direction we want to go (and that there are no other changes to the
series). The change to the final patch basically boils down to using
'get_bloom_filter_large_in_graph()' instead of manually checking the
bitmap (so that the aforementioned function can do its initialization
ahead of time).

Thanks for catching this subtle issue, I'm glad that we got it resolved
before going further.

--- >8 ---

Subject: [PATCH] commit-graph: add large-filters bitmap chunk

When a commit has more than a certain number of changed paths (commonly
512), the commit-graph machinery represents it as a zero-length filter.
This is done since having many entries in the Bloom filter has
undesirable effects on the false positivity rate.

In addition to these too-large filters, the commit-graph machinery also
represents commits with no filter and commits with no changed paths in
the same way.

When writing a commit-graph that aggregates several incremental
commit-graph layers (eg., with '--split=replace'), the commit-graph
machinery first computes all of the Bloom filters that it wants to write
but does not already know about from existing graph layers. Because we
overload the zero-length filter in the above fashion, this leads to
recomputing large filters over and over again.

This is already undesirable, since it means that we are wasting
considerable effort to discover that a commit with too many changed
paths, only to throw that effort away (and then repeat the process the
next time a roll-up is performed).

In a subsequent patch, we will add a '--max-new-filters=<n>' option,
which specifies an upper-bound on the number of new filters we are
willing to compute from scratch. Suppose that there are 'N' too-large
filters, and we specify '--max-new-filters=M'. If 'N >= M', it is
unlikely that any filters will be generated, since we'll spend most of
our effort on filters that we ultimately throw away. If 'N < M', filters
will trickle in over time, but only at most 'M - N' per-write.

To address this, add a new chunk which encodes a bitmap where the ith
bit is on iff the ith commit has zero or at least 512 changed paths.
Likewise store the maximum number of changed paths we are willing to
store in order to prepare for eventually making this value more easily
customizable. When computing Bloom filters, first consult the relevant
bitmap (in the case that we are rolling up existing layers) to see if
computing the Bloom filter from scratch would be a waste of time.

This patch implements a new chunk instead of extending the existing BIDX
and BDAT chunks because modifying these chunks would confuse old
clients. (Eg., setting the most-significant bit in the BIDX chunk would
confuse old clients and require a version bump).

To allow using the existing bitmap code with 64-bit words, we write the
data in network byte order from the 64-bit words. This means we also
need to read the array from the commit-graph file by translating each
word from network byte order using get_be64() when loading the commit
graph. Initialize this bitmap lazily to avoid paying a linear-time cost
upon each commit-graph load even if we do not need the bitmaps
themselves.

By avoiding the need to move to new versions of the BDAT and BIDX chunk,
we can give ourselves more time to consider whether or not other
modifications to these chunks are worthwhile without holding up this
change.

Another approach would be to introduce a new BIDX chunk (say, one
identified by 'BID2') which is identical to the existing BIDX chunk,
except the most-significant bit of each offset is interpreted as "this
filter is too big" iff looking at a BID2 chunk. This avoids having to
write a bitmap, but forces older clients to rewrite their commit-graphs
(as well as reduces the theoretical largest Bloom filters we could
write, and forces us to maintain the code necessary to translate BIDX
chunks to BID2 ones). Separately from this patch, I implemented this
alternate approach and did not find it to be advantageous.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 .../technical/commit-graph-format.txt         |  12 +++
 bloom.h                                       |   2 +-
 commit-graph.c                                | 100 +++++++++++++++---
 commit-graph.h                                |  10 ++
 t/t4216-log-bloom.sh                          |  25 ++++-
 5 files changed, 134 insertions(+), 15 deletions(-)

diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index 440541045d..5f2d9ab4d7 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -123,6 +123,18 @@ CHUNK DATA:
       of length zero.
     * The BDAT chunk is present if and only if BIDX is present.

+  Large Bloom Filters (ID: {'B', 'F', 'X', 'L'}) [Optional]
+    * It starts with a 32-bit unsigned integer specifying the maximum number of
+      changed-paths that can be stored in a single Bloom filter.
+    * It then contains a list of 64-bit words (the length of this list is
+      determined by the width of the chunk) which is a bitmap. The 'i'th bit is
+      set exactly when the 'i'th commit in the graph has a changed-path Bloom
+      filter with zero entries (either because the commit is empty, or because
+      it contains more than 512 changed paths).
+    * The BFXL chunk is present only when the BIDX and BDAT chunks are
+      also present.
+
+
   Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
       This list of H-byte hashes describe a set of B commit-graph files that
       form a commit-graph chain. The graph position for the ith commit in this
diff --git a/bloom.h b/bloom.h
index 3f19e3fca4..464d9b57de 100644
--- a/bloom.h
+++ b/bloom.h
@@ -33,7 +33,7 @@ struct bloom_filter_settings {
 	 * The maximum number of changed paths per commit
 	 * before declaring a Bloom filter to be too-large.
 	 *
-	 * Not written to the commit-graph file.
+	 * Written to the 'BFXL' chunk (instead of 'BDAT').
 	 */
 	uint32_t max_changed_paths;
 };
diff --git a/commit-graph.c b/commit-graph.c
index 8964453433..4ccc7a3e56 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -41,8 +41,9 @@ void git_test_write_commit_graph_or_die(void)
 #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
 #define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
 #define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
+#define GRAPH_CHUNKID_BLOOMLARGE 0x4246584c /* "BFXL" */
 #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
-#define MAX_NUM_CHUNKS 7
+#define MAX_NUM_CHUNKS 8

 #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)

@@ -429,6 +430,16 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 				graph->bloom_filter_settings->bits_per_entry = get_be32(data + chunk_offset + 8);
 			}
 			break;
+
+		case GRAPH_CHUNKID_BLOOMLARGE:
+			if (graph->chunk_bloom_large_filters)
+				chunk_repeated = 1;
+			else if (r->settings.commit_graph_read_changed_paths) {
+				graph->chunk_bloom_large_filters = data + chunk_offset + sizeof(uint32_t);
+				graph->bloom_large_alloc = get_be64(chunk_lookup + 4) - chunk_offset - sizeof(uint32_t);
+				graph->bloom_filter_settings->max_changed_paths = get_be32(data + chunk_offset);
+			}
+			break;
 		}

 		if (chunk_repeated) {
@@ -443,6 +454,8 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 		/* We need both the bloom chunks to exist together. Else ignore the data */
 		graph->chunk_bloom_indexes = NULL;
 		graph->chunk_bloom_data = NULL;
+		graph->chunk_bloom_large_filters = NULL;
+		graph->bloom_large_alloc = 0;
 		FREE_AND_NULL(graph->bloom_filter_settings);
 	}

@@ -932,6 +945,32 @@ struct tree *get_commit_tree_in_graph(struct repository *r, const struct commit
 	return get_commit_tree_in_graph_one(r, r->objects->commit_graph, c);
 }

+int get_bloom_filter_large_in_graph(struct commit_graph *g,
+				    const struct commit *c)
+{
+	uint32_t graph_pos = commit_graph_position(c);
+	if (graph_pos == COMMIT_NOT_FROM_GRAPH)
+		return 0;
+
+	while (g && graph_pos < g->num_commits_in_base)
+		g = g->base_graph;
+
+	if (!g)
+		return 0;
+
+	if (!g->bloom_large && g->bloom_large_alloc) {
+		size_t i;
+		g->bloom_large = bitmap_word_alloc(g->bloom_large_alloc);
+
+		for (i = 0; i < g->bloom_large->word_alloc; i++)
+			g->bloom_large->words[i] = get_be64(
+				g->chunk_bloom_large_filters + i * sizeof(eword_t));
+	}
+
+	if (!g->bloom_large)
+		return 0;
+	return bitmap_get(g->bloom_large, graph_pos - g->num_commits_in_base);
+}

 struct packed_oid_list {
 	struct object_id *list;
@@ -970,8 +1009,10 @@ struct write_commit_graph_context {
 	size_t total_bloom_filter_data_size;
 	const struct bloom_filter_settings *bloom_settings;

+	int count_bloom_filter_known_large;
 	int count_bloom_filter_found_large;
 	int count_bloom_filter_computed;
+	struct bitmap *bloom_large;
 };

 static int write_graph_chunk_fanout(struct hashfile *f,
@@ -1235,6 +1276,23 @@ static int write_graph_chunk_bloom_data(struct hashfile *f,
 	return 0;
 }

+static int write_graph_chunk_bloom_large(struct hashfile *f,
+					 struct write_commit_graph_context *ctx)
+{
+	size_t i, alloc = ctx->commits.nr / BITS_IN_EWORD;
+	if (ctx->commits.nr % BITS_IN_EWORD)
+		alloc++;
+	if (alloc > ctx->bloom_large->word_alloc)
+		BUG("write_graph_chunk_bloom_large: bitmap not large enough");
+
+	trace2_region_enter("commit-graph", "bloom_large", ctx->r);
+	hashwrite_be32(f, ctx->bloom_settings->max_changed_paths);
+	for (i = 0; i < ctx->bloom_large->word_alloc; i++)
+		hashwrite_be64(f, ctx->bloom_large->words[i]);
+	trace2_region_leave("commit-graph", "bloom_large", ctx->r);
+	return 0;
+}
+
 static int oid_compare(const void *_a, const void *_b)
 {
 	const struct object_id *a = (const struct object_id *)_a;
@@ -1398,6 +1456,8 @@ static void trace2_bloom_filter_write_statistics(struct write_commit_graph_conte
 	struct json_writer jw = JSON_WRITER_INIT;

 	jw_object_begin(&jw, 0);
+	jw_object_intmax(&jw, "filter_known_large",
+			 ctx->count_bloom_filter_known_large);
 	jw_object_intmax(&jw, "filter_found_large",
 			 ctx->count_bloom_filter_found_large);
 	jw_object_intmax(&jw, "filter_computed",
@@ -1416,6 +1476,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 	int *sorted_commits;

 	init_bloom_filters();
+	ctx->bloom_large = bitmap_word_alloc(ctx->commits.nr / BITS_IN_EWORD + 1);

 	if (ctx->report_progress)
 		progress = start_delayed_progress(
@@ -1430,21 +1491,28 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 		&ctx->commits);

 	for (i = 0; i < ctx->commits.nr; i++) {
-		int computed = 0;
 		int pos = sorted_commits[i];
 		struct commit *c = ctx->commits.list[pos];
-		struct bloom_filter *filter = get_or_compute_bloom_filter(
-			ctx->r,
-			c,
-			1,
-			ctx->bloom_settings,
-			&computed);
-		if (computed) {
-			ctx->count_bloom_filter_computed++;
-			if (filter && !filter->len)
-				ctx->count_bloom_filter_found_large++;
+		if (get_bloom_filter_large_in_graph(ctx->r->objects->commit_graph, c)) {
+			bitmap_set(ctx->bloom_large, pos);
+			ctx->count_bloom_filter_known_large++;
+		} else {
+			int computed = 0;
+			struct bloom_filter *filter = get_or_compute_bloom_filter(
+				ctx->r,
+				c,
+				1,
+				ctx->bloom_settings,
+				&computed);
+			if (computed) {
+				ctx->count_bloom_filter_computed++;
+				if (filter && !filter->len) {
+					bitmap_set(ctx->bloom_large, pos);
+					ctx->count_bloom_filter_found_large++;
+				}
+			}
+			ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
 		}
-		ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
 		display_progress(progress, i + 1);
 	}

@@ -1764,6 +1832,11 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 					  + ctx->total_bloom_filter_data_size;
 		chunks[num_chunks].write_fn = write_graph_chunk_bloom_data;
 		num_chunks++;
+		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMLARGE;
+		chunks[num_chunks].size = sizeof(eword_t) * ctx->bloom_large->word_alloc
+					+ sizeof(uint32_t);
+		chunks[num_chunks].write_fn = write_graph_chunk_bloom_large;
+		num_chunks++;
 	}
 	if (ctx->num_commit_graphs_after > 1) {
 		chunks[num_chunks].id = GRAPH_CHUNKID_BASE;
@@ -2503,6 +2576,7 @@ void free_commit_graph(struct commit_graph *g)
 	}
 	free(g->filename);
 	free(g->bloom_filter_settings);
+	bitmap_free(g->bloom_large);
 	free(g);
 }

diff --git a/commit-graph.h b/commit-graph.h
index d9acb22bac..9afb1477d5 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -4,6 +4,7 @@
 #include "git-compat-util.h"
 #include "object-store.h"
 #include "oidset.h"
+#include "ewah/ewok.h"

 #define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
 #define GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE "GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE"
@@ -50,6 +51,9 @@ void load_commit_graph_info(struct repository *r, struct commit *item);
 struct tree *get_commit_tree_in_graph(struct repository *r,
 				      const struct commit *c);

+int get_bloom_filter_large_in_graph(struct commit_graph *g,
+				    const struct commit *c);
+
 struct commit_graph {
 	const unsigned char *data;
 	size_t data_len;
@@ -71,6 +75,10 @@ struct commit_graph {
 	const unsigned char *chunk_base_graphs;
 	const unsigned char *chunk_bloom_indexes;
 	const unsigned char *chunk_bloom_data;
+	const unsigned char *chunk_bloom_large_filters;
+
+	struct bitmap *bloom_large;
+	size_t bloom_large_alloc;

 	struct bloom_filter_settings *bloom_filter_settings;
 };
@@ -83,6 +91,8 @@ struct commit_graph *read_commit_graph_one(struct repository *r,
 struct commit_graph *parse_commit_graph(struct repository *r,
 					void *graph_map, size_t graph_size);

+void prepare_commit_graph_bloom_large(struct commit_graph *g);
+
 /*
  * Return 1 if and only if the repository has a commit-graph
  * file and generation numbers are computed in that file.
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 21b67677ef..6859d85369 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -33,7 +33,7 @@ test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
 	git commit-graph write --reachable --changed-paths
 '
 graph_read_expect () {
-	NUM_CHUNKS=5
+	NUM_CHUNKS=6
 	cat >expect <<- EOF
 	header: 43475048 1 1 $NUM_CHUNKS 0
 	num_commits: $1
@@ -262,5 +262,28 @@ test_expect_success 'correctly report changes over limit' '
 		done
 	)
 '
+test_bloom_filters_computed () {
+	commit_graph_args=$1
+	bloom_trace_prefix="{\"filter_known_large\":$2,\"filter_found_large\":$3,\"filter_computed\":$4"
+	rm -f "$TRASH_DIRECTORY/trace.event" &&
+	GIT_TRACE2_EVENT="$TRASH_DIRECTORY/trace.event" git commit-graph write $commit_graph_args &&
+	grep "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.event"
+}
+
+test_expect_success 'Bloom generation does not recompute too-large filters' '
+	(
+		cd limits &&
+
+		# start from scratch and rebuild
+		rm -f .git/objects/info/commit-graph &&
+		GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=10 \
+			git commit-graph write --reachable --changed-paths \
+			--split=replace &&
+		test_commit c1 filter &&
+
+		test_bloom_filters_computed "--reachable --changed-paths --split=replace" \
+			2 0 1
+	)
+'

 test_done
--
2.28.0.rc1.13.ge78abce653


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v3 01/14] commit-graph: introduce 'get_bloom_filter_settings()'
  2020-08-11 21:34           ` Taylor Blau
@ 2020-08-11 23:55             ` SZEDER Gábor
  2020-08-12 11:48               ` Derrick Stolee
  0 siblings, 1 reply; 117+ messages in thread
From: SZEDER Gábor @ 2020-08-11 23:55 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, dstolee, gitster

On Tue, Aug 11, 2020 at 05:34:11PM -0400, Taylor Blau wrote:
> On Tue, Aug 11, 2020 at 11:27:16PM +0200, SZEDER Gábor wrote:
> > On Tue, Aug 11, 2020 at 05:21:18PM -0400, Taylor Blau wrote:
> > > On Tue, Aug 11, 2020 at 11:18:30PM +0200, SZEDER Gábor wrote:
> > > > On Tue, Aug 11, 2020 at 04:51:19PM -0400, Taylor Blau wrote:
> > > > > Many places in the code often need a pointer to the commit-graph's
> > > > > 'struct bloom_filter_settings', in which case they often take the value
> > > > > from the top-most commit-graph.
> > > > >
> > > > > In the non-split case, this works as expected. In the split case,
> > > > > however, things get a little tricky. Not all layers in a chain of
> > > > > incremental commit-graphs are required to themselves have Bloom data,
> > > > > and so whether or not some part of the code uses Bloom filters depends
> > > > > entirely on whether or not the top-most level of the commit-graph chain
> > > > > has Bloom filters.
> > > > >
> > > > > This has been the behavior since Bloom filters were introduced, and has
> > > > > been codified into the tests since a759bfa9ee (t4216: add end to end
> > > > > tests for git log with Bloom filters, 2020-04-06). In fact, t4216.130
> > > > > requires that Bloom filters are not used in exactly the case described
> > > > > earlier.
> > > > >
> > > > > There is no reason that this needs to be the case, since it is perfectly
> > > > > valid for commits in an earlier layer to have Bloom filters when commits
> > > > > in a newer layer do not.
> > > > >
> > > > > Since Bloom settings are guaranteed to be the same for any layer in a
> > > > > chain that has Bloom data,
> > > >
> > > > Is it?  Where is that guaranteed?
> > >
> > > There is no mechanism whatsoever to customize these settings that is
> > > exposed to the user (except for the undocumented 'GIT_TEST' environment
> > > variables).
> >
> > Let me rephrase it, then: where is it written in the commit-graph
> > format specification that these must be the same in all layers?
> >
> > Nowhere.
> 
> OK. We can certainly document that this is the case.

IMO we absolutely must document this first; ideally it should have
been carefully considered and documented right from the start.

Some thougths about this:

  https://public-inbox.org/git/20200619140230.GB22200@szeder.dev/

> For this purpose,
> all we really care about is that the graph _has_ Bloom filters anywhere.
> If you wanted to return the exact matching settings, you could also
> provide a commit and return the settings belonging to the graph that
> contains that commit.
> 
> In the case where we don't have a commit, we could use the default
> settings instead.
> 
> I think that we are a little bit dealing with a problem that doesn't
> exist, since we do not document the sole method by which you would
> change these settings. So, maybe we can think more about this, but my
> preference would be to leave this patch alone.

Other implementations can write split commit-graphs with modified path
Bloom filters as well, and at the moment there is nothing in the specs
that tells them not to use different Bloom filter settings in
different layers.

> Maybe Stolee can chime in, too.
> 
> > > > > it is sufficient to traverse the
> > > > > '->base_graph' pointer until either (1) a non-null 'struct
> > > > > bloom_filter_settings *' is found, or (2) until we are at the root of
> > > > > the commit-graph chain.
> > > > >
> > > > > Introduce a 'get_bloom_filter_settings()' function that does just this,
> > > > > and use it instead of purely dereferencing the top-most graph's
> > > > > '->bloom_filter_settings' pointer.
> > > > >
> > > > > While we're at it, add an additional test in t5324 to guard against code
> > > > > in the commit-graph writing machinery that doesn't correctly handle a
> > > > > NULL 'struct bloom_filter *'.
> > > > >
> > > > > Co-authored-by: Derrick Stolee <dstolee@microsoft.com>
> > > > > Signed-off-by: Taylor Blau <me@ttaylorr.com>
> > > Thanks,
> > > Taylor
> Thanks,
> Taylor

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v3 01/14] commit-graph: introduce 'get_bloom_filter_settings()'
  2020-08-11 23:55             ` SZEDER Gábor
@ 2020-08-12 11:48               ` Derrick Stolee
  2020-08-14 20:17                 ` Taylor Blau
  0 siblings, 1 reply; 117+ messages in thread
From: Derrick Stolee @ 2020-08-12 11:48 UTC (permalink / raw)
  To: SZEDER Gábor, Taylor Blau; +Cc: git, peff, dstolee, gitster

On 8/11/2020 7:55 PM, SZEDER Gábor wrote:
> On Tue, Aug 11, 2020 at 05:34:11PM -0400, Taylor Blau wrote:
>> On Tue, Aug 11, 2020 at 11:27:16PM +0200, SZEDER Gábor wrote:
>>> On Tue, Aug 11, 2020 at 05:21:18PM -0400, Taylor Blau wrote:
>>>> On Tue, Aug 11, 2020 at 11:18:30PM +0200, SZEDER Gábor wrote:
>>>>> On Tue, Aug 11, 2020 at 04:51:19PM -0400, Taylor Blau wrote:
>>>>>> Many places in the code often need a pointer to the commit-graph's
>>>>>> 'struct bloom_filter_settings', in which case they often take the value
>>>>>> from the top-most commit-graph.
>>>>>>
>>>>>> In the non-split case, this works as expected. In the split case,
>>>>>> however, things get a little tricky. Not all layers in a chain of
>>>>>> incremental commit-graphs are required to themselves have Bloom data,
>>>>>> and so whether or not some part of the code uses Bloom filters depends
>>>>>> entirely on whether or not the top-most level of the commit-graph chain
>>>>>> has Bloom filters.
>>>>>>
>>>>>> This has been the behavior since Bloom filters were introduced, and has
>>>>>> been codified into the tests since a759bfa9ee (t4216: add end to end
>>>>>> tests for git log with Bloom filters, 2020-04-06). In fact, t4216.130
>>>>>> requires that Bloom filters are not used in exactly the case described
>>>>>> earlier.
>>>>>>
>>>>>> There is no reason that this needs to be the case, since it is perfectly
>>>>>> valid for commits in an earlier layer to have Bloom filters when commits
>>>>>> in a newer layer do not.
>>>>>>
>>>>>> Since Bloom settings are guaranteed to be the same for any layer in a
>>>>>> chain that has Bloom data,
>>>>>
>>>>> Is it?  Where is that guaranteed?

Perhaps instead of "guaranteed" we could say "Git never writes
a commit-graph chain with different settings between layers."

>>>> There is no mechanism whatsoever to customize these settings that is
>>>> exposed to the user (except for the undocumented 'GIT_TEST' environment
>>>> variables).
>>>
>>> Let me rephrase it, then: where is it written in the commit-graph
>>> format specification that these must be the same in all layers?
>>>
>>> Nowhere.
>>
>> OK. We can certainly document that this is the case.
> 
> IMO we absolutely must document this first; ideally it should have
> been carefully considered and documented right from the start.

You're right. One of the major difficulties in bringing a Bloom
filter implementation to the commit-graph feature when we did was
that the split commit-graph was introduced between our initial
prototypes and when it was finally prepared for a full submission.

There certainly are gaps in the implementation and documentation.
I think Taylor is doing a great job by addressing one of those gaps
in a focused, thoughtful way.

> Some thougths about this:
> 
>   https://public-inbox.org/git/20200619140230.GB22200@szeder.dev/

I appreciate your attention to detail. Your comments on the existing
implementation do point out some of its shortcomings, and that is a
valuable contribution.

Actually converting those thoughts into patches is a lot of work.

>> For this purpose,
>> all we really care about is that the graph _has_ Bloom filters anywhere.
>> If you wanted to return the exact matching settings, you could also
>> provide a commit and return the settings belonging to the graph that
>> contains that commit.
>>
>> In the case where we don't have a commit, we could use the default
>> settings instead.
>>
>> I think that we are a little bit dealing with a problem that doesn't
>> exist, since we do not document the sole method by which you would
>> change these settings. So, maybe we can think more about this, but my
>> preference would be to leave this patch alone.
> 
> Other implementations can write split commit-graphs with modified path
> Bloom filters as well, and at the moment there is nothing in the specs
> that tells them not to use different Bloom filter settings in
> different layers.

You are 100% correct that there is a gap in documentation. That should
be corrected at some point. (I don't consider it a blocker for this
series.)

But also: Git itself is the true test of a "correct" third-party
implementation. libgit2 and JGit try to match Git, "warts and all".
If another implementation wrote data that results in incorrect
behavior by Git, then that implementation is wrong.

Improving documentation can make those errors less likely.

We also must design with "future Git" in mind, presenting it with
enough flexibility to improve formats. The custom Bloom filter
settings do allow that flexibility, but the requirement that all
layers have identical settings exists for a reason (despite not
being documented). It is important that any commit walk that intends
to use the changed-path Bloom filters can compute the bloom keys
for the test paths only once.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v3 14/14] builtin/commit-graph.c: introduce '--max-new-filters=<n>'
  2020-08-11 20:52   ` [PATCH v3 14/14] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
@ 2020-08-12 11:49     ` SZEDER Gábor
  2020-08-14 20:20       ` Taylor Blau
  2020-08-12 12:29     ` Derrick Stolee
                       ` (3 subsequent siblings)
  4 siblings, 1 reply; 117+ messages in thread
From: SZEDER Gábor @ 2020-08-12 11:49 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, dstolee, gitster

On Tue, Aug 11, 2020 at 04:52:14PM -0400, Taylor Blau wrote:

> diff --git a/commit-graph.c b/commit-graph.c
> index 6886f319a5..4aae5471e3 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -954,7 +954,8 @@ struct tree *get_commit_tree_in_graph(struct repository *r, const struct commit
>  }
>  
>  static int get_bloom_filter_large_in_graph(struct commit_graph *g,
> -					   const struct commit *c)
> +					   const struct commit *c,
> +					   uint32_t max_changed_paths)
>  {
>  	uint32_t graph_pos = commit_graph_position(c);
>  	if (graph_pos == COMMIT_NOT_FROM_GRAPH)
> @@ -965,6 +966,17 @@ static int get_bloom_filter_large_in_graph(struct commit_graph *g,
>  
>  	if (!(g && g->bloom_large))
>  		return 0;
> +	if (g->bloom_filter_settings->max_changed_paths != max_changed_paths) {
> +		/*
> +		 * Force all commits which are subject to a different
> +		 * 'max_changed_paths' limit to be recomputed from scratch.
> +		 *
> +		 * Note that this could likely be improved, but is ignored since
> +		 * all real-world graphs set the maximum number of changed paths
> +		 * at 512.
> +		 */
> +		return 0;
> +	}
>  	return bitmap_get(g->bloom_large, graph_pos - g->num_commits_in_base);
>  }
>  
> @@ -1470,6 +1482,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>  	int i;
>  	struct progress *progress = NULL;
>  	int *sorted_commits;
> +	int max_new_filters;
>  
>  	init_bloom_filters();
>  	ctx->bloom_large = bitmap_word_alloc(ctx->commits.nr / BITS_IN_EWORD + 1);
> @@ -1486,10 +1499,15 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>  		ctx->order_by_pack ? commit_pos_cmp : commit_gen_cmp,
>  		&ctx->commits);
>  
> +	max_new_filters = ctx->opts->max_new_filters >= 0 ?
> +		ctx->opts->max_new_filters : ctx->commits.nr;

git_test_write_commit_graph_or_die() invokes
write_commit_graph_reachable() with opts=0x0, so 'ctx->opts' is NULL,
and we get segfault.  This breaks a lot of tests when run with
GIT_TEST_COMMIT_GRAPH=1 GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=1.


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v3 14/14] builtin/commit-graph.c: introduce '--max-new-filters=<n>'
  2020-08-11 20:52   ` [PATCH v3 14/14] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
  2020-08-12 11:49     ` SZEDER Gábor
@ 2020-08-12 12:29     ` Derrick Stolee
  2020-08-14 20:10       ` Taylor Blau
  2020-08-18 22:23     ` SZEDER Gábor
                       ` (2 subsequent siblings)
  4 siblings, 1 reply; 117+ messages in thread
From: Derrick Stolee @ 2020-08-12 12:29 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: peff, dstolee, szeder.dev, gitster

On 8/11/2020 4:52 PM, Taylor Blau wrote:
> Introduce a command-line flag and configuration variable to fill in the
> 'max_new_filters' variable introduced by the previous patch.
> 
> The command-line option '--max-new-filters' takes precedence over
> 'commitGraph.maxNewFilters', which is the default value.
> '--no-max-new-filters' can also be provided, which sets the value back
> to '-1', indicating that an unlimited number of new Bloom filters may be
> generated. (OPT_INTEGER only allows setting the '--no-' variant back to
> '0', hence a custom callback was used instead).
> 
> Signed-off-by: Taylor Blau <me@ttaylorr.com>

...

> diff --git a/bloom.c b/bloom.c
> index ed54e96e57..8d07209c6b 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -51,6 +51,21 @@ static int load_bloom_filter_from_graph(struct commit_graph *g,
>  	else
>  		start_index = 0;
>  
> +	if ((start_index == end_index) &&
> +	    (g->bloom_large && !bitmap_get(g->bloom_large, lex_pos))) {
> +		/*
> +		 * If the filter is zero-length, either (1) the filter has no
> +		 * changes, (2) the filter has too many changes, or (3) it
> +		 * wasn't computed (eg., due to '--max-new-filters').
> +		 *
> +		 * If either (1) or (2) is the case, the 'large' bit will be set
> +		 * for this Bloom filter. If it is unset, then it wasn't
> +		 * computed. In that case, return nothing, since we don't have
> +		 * that filter in the graph.
> +		 */
> +		return 0;
> +	}
> +

Here, you are creating a distinction between an "empty" filter and
a "too large" filter at a place that I don't think is important.

For instance, this code will be triggered by "git log -- <path>"
but you only care about the filter being too large when writing the
commit-graph. I think this check for a "too large" filter should
instead be inside get_or_compute_bloom_filter(). I include a patch
below that applies on top of tb/bloom-improvements that gets to
my point (and how minor the issue might be).

Thanks,
-Stolee

--- >8 ---

From 92a7a3d0f769fad7617426730cb97584a3d07794 Mon Sep 17 00:00:00 2001
From: Derrick Stolee <dstolee@microsoft.com>
Date: Wed, 12 Aug 2020 08:20:03 -0400
Subject: [PATCH] commit-graph: lazy-load large Bloom filter bitmap

The bloom_large bitmap in struct commit_graph is currently loaded
immediately upon first parse of the commit-graph file (when the chunk
exists). This is needed because the current implementation of
get_bloom_filter_from_graph() is special-cased to return NULL when the
filter is marked as "too large".

This has a slight drawback: before we can read a single commit out of
the commit-graph file, we need to load this entire chunk into memory.
This happens even for commands that don't need changed-path Bloom
filters, such as "git log -1".

This "too large" information is only used when writing a commit-graph
file, so we can delay the check for a large filter until after we check
compute_if_not_present in get_or_compute_bloom_filter(). Also, place
that lazy-load directly in the get_bloom_filter_large_in_graph() method,
so we ensure it is ready when needed.

This may be overkill. For a repository with one million commits, this
filter size is approximately 125 *kilobytes* of data. My local
measurements found that this took between 1 and 2 milliseconds to load
into memory. Even for repositories with 10 million commits, this
difference would not be noticeable for end-user commands.

On the other hand, these handfuls of milliseconds could add up when
running a hosting service using Git, so this extra effort is probably
worth it.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 bloom.c        | 27 ++++++++-------------------
 commit-graph.c | 30 +++++++++++++++++-------------
 commit-graph.h |  8 ++++++++
 3 files changed, 33 insertions(+), 32 deletions(-)

diff --git a/bloom.c b/bloom.c
index 8d07209c6b..6d0884fa19 100644
--- a/bloom.c
+++ b/bloom.c
@@ -51,21 +51,6 @@ static int load_bloom_filter_from_graph(struct commit_graph *g,
 	else
 		start_index = 0;
 
-	if ((start_index == end_index) &&
-	    (g->bloom_large && !bitmap_get(g->bloom_large, lex_pos))) {
-		/*
-		 * If the filter is zero-length, either (1) the filter has no
-		 * changes, (2) the filter has too many changes, or (3) it
-		 * wasn't computed (eg., due to '--max-new-filters').
-		 *
-		 * If either (1) or (2) is the case, the 'large' bit will be set
-		 * for this Bloom filter. If it is unset, then it wasn't
-		 * computed. In that case, return nothing, since we don't have
-		 * that filter in the graph.
-		 */
-		return 0;
-	}
-
 	filter->len = end_index - start_index;
 	filter->data = (unsigned char *)(g->chunk_bloom_data +
 					sizeof(unsigned char) * start_index +
@@ -212,16 +197,20 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 
 	if (!filter->data) {
 		load_commit_graph_info(r, c);
-		if (commit_graph_position(c) != COMMIT_NOT_FROM_GRAPH &&
-			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
-				return filter;
+		if (commit_graph_position(c) != COMMIT_NOT_FROM_GRAPH)
+			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c);
 	}
 
-	if (filter->data)
+	if (filter && filter->len)
 		return filter;
 	if (!compute_if_not_present)
 		return NULL;
 
+	if (filter && !filter->len &&
+	    get_bloom_filter_large_in_graph(r->objects->commit_graph, c,
+	    				    settings->max_changed_paths))
+		return filter;
+
 	repo_diff_setup(r, &diffopt);
 	diffopt.flags.recursive = 1;
 	diffopt.detect_rename = 0;
diff --git a/commit-graph.c b/commit-graph.c
index 4aae5471e3..ea89f431cc 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -435,17 +435,9 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 			if (graph->chunk_bloom_large_filters)
 				chunk_repeated = 1;
 			else if (r->settings.commit_graph_read_changed_paths) {
-				size_t alloc = get_be64(chunk_lookup + 4) - chunk_offset - sizeof(uint32_t);
+				graph->bloom_large_alloc = get_be64(chunk_lookup + 4) - chunk_offset - sizeof(uint32_t);
 				graph->chunk_bloom_large_filters = data + chunk_offset + sizeof(uint32_t);
 				graph->bloom_filter_settings->max_changed_paths = get_be32(data + chunk_offset);
-				if (alloc) {
-					size_t j;
-					graph->bloom_large = bitmap_word_alloc(alloc);
-
-					for (j = 0; j < graph->bloom_large->word_alloc; j++)
-						graph->bloom_large->words[j] = get_be64(
-							graph->chunk_bloom_large_filters + j * sizeof(eword_t));
-				}
 			}
 			break;
 		}
@@ -953,9 +945,9 @@ struct tree *get_commit_tree_in_graph(struct repository *r, const struct commit
 	return get_commit_tree_in_graph_one(r, r->objects->commit_graph, c);
 }
 
-static int get_bloom_filter_large_in_graph(struct commit_graph *g,
-					   const struct commit *c,
-					   uint32_t max_changed_paths)
+int get_bloom_filter_large_in_graph(struct commit_graph *g,
+				    const struct commit *c,
+				    uint32_t max_changed_paths)
 {
 	uint32_t graph_pos = commit_graph_position(c);
 	if (graph_pos == COMMIT_NOT_FROM_GRAPH)
@@ -964,8 +956,20 @@ static int get_bloom_filter_large_in_graph(struct commit_graph *g,
 	while (g && graph_pos < g->num_commits_in_base)
 		g = g->base_graph;
 
-	if (!(g && g->bloom_large))
+	if (!g || !g->bloom_large_alloc)
 		return 0;
+
+	if (!g->bloom_large) {
+		size_t j;
+		g->bloom_large = bitmap_word_alloc(g->bloom_large_alloc);
+
+		for (j = 0; j < g->bloom_large->word_alloc; j++) {
+			const void *data = g->chunk_bloom_large_filters +
+					   j * sizeof(eword_t);
+			g->bloom_large->words[j] = get_be64(data);
+		}
+	}
+
 	if (g->bloom_filter_settings->max_changed_paths != max_changed_paths) {
 		/*
 		 * Force all commits which are subject to a different
diff --git a/commit-graph.h b/commit-graph.h
index 75ef83708b..126fd43380 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -51,6 +51,13 @@ void load_commit_graph_info(struct repository *r, struct commit *item);
 struct tree *get_commit_tree_in_graph(struct repository *r,
 				      const struct commit *c);
 
+/*
+ * Returns 1 if this commit was marked in the BFXL chunk as having more
+ * than max_changed_paths changes.
+ */
+int get_bloom_filter_large_in_graph(struct commit_graph *g,
+				    const struct commit *c,
+				    uint32_t max_changed_paths);
 struct commit_graph {
 	const unsigned char *data;
 	size_t data_len;
@@ -74,6 +81,7 @@ struct commit_graph {
 	const unsigned char *chunk_bloom_data;
 	const unsigned char *chunk_bloom_large_filters;
 
+	size_t bloom_large_alloc;
 	struct bitmap *bloom_large;
 
 	struct bloom_filter_settings *bloom_filter_settings;
-- 
2.28.0.38.gc6f546511c1



^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v3 14/14] builtin/commit-graph.c: introduce '--max-new-filters=<n>'
  2020-08-12 12:29     ` Derrick Stolee
@ 2020-08-14 20:10       ` Taylor Blau
  0 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-08-14 20:10 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Taylor Blau, git, peff, dstolee, szeder.dev, gitster

On Wed, Aug 12, 2020 at 08:29:06AM -0400, Derrick Stolee wrote:
> On 8/11/2020 4:52 PM, Taylor Blau wrote:
> > Introduce a command-line flag and configuration variable to fill in the
> > 'max_new_filters' variable introduced by the previous patch.
> >
> > The command-line option '--max-new-filters' takes precedence over
> > 'commitGraph.maxNewFilters', which is the default value.
> > '--no-max-new-filters' can also be provided, which sets the value back
> > to '-1', indicating that an unlimited number of new Bloom filters may be
> > generated. (OPT_INTEGER only allows setting the '--no-' variant back to
> > '0', hence a custom callback was used instead).
> >
> > Signed-off-by: Taylor Blau <me@ttaylorr.com>
>
> ...
>
> > diff --git a/bloom.c b/bloom.c
> > index ed54e96e57..8d07209c6b 100644
> > --- a/bloom.c
> > +++ b/bloom.c
> > @@ -51,6 +51,21 @@ static int load_bloom_filter_from_graph(struct commit_graph *g,
> >  	else
> >  		start_index = 0;
> >
> > +	if ((start_index == end_index) &&
> > +	    (g->bloom_large && !bitmap_get(g->bloom_large, lex_pos))) {
> > +		/*
> > +		 * If the filter is zero-length, either (1) the filter has no
> > +		 * changes, (2) the filter has too many changes, or (3) it
> > +		 * wasn't computed (eg., due to '--max-new-filters').
> > +		 *
> > +		 * If either (1) or (2) is the case, the 'large' bit will be set
> > +		 * for this Bloom filter. If it is unset, then it wasn't
> > +		 * computed. In that case, return nothing, since we don't have
> > +		 * that filter in the graph.
> > +		 */
> > +		return 0;
> > +	}
> > +
>
> Here, you are creating a distinction between an "empty" filter and
> a "too large" filter at a place that I don't think is important.
>
> For instance, this code will be triggered by "git log -- <path>"
> but you only care about the filter being too large when writing the
> commit-graph. I think this check for a "too large" filter should
> instead be inside get_or_compute_bloom_filter(). I include a patch
> below that applies on top of tb/bloom-improvements that gets to
> my point (and how minor the issue might be).

That's a good point. I factored the lazy-loading out in basically the
same way that you did below, but didn't move this call out of
'load_bloom_filter_from_graph'. I think that you're right that this is
better at the calling end (replacing the 'start_index == end_index'
check with a '!filter->len' one instead).

I'm going to "ignore" this patch below since I (a) already have it
locally, and (b) would rather just do this correctly the first time than
introduce and subsequently fix a performance regression.

> Thanks,
> -Stolee
>
> --- >8 ---
>
> From 92a7a3d0f769fad7617426730cb97584a3d07794 Mon Sep 17 00:00:00 2001
> From: Derrick Stolee <dstolee@microsoft.com>
> Date: Wed, 12 Aug 2020 08:20:03 -0400
> Subject: [PATCH] commit-graph: lazy-load large Bloom filter bitmap
>
> The bloom_large bitmap in struct commit_graph is currently loaded
> immediately upon first parse of the commit-graph file (when the chunk
> exists). This is needed because the current implementation of
> get_bloom_filter_from_graph() is special-cased to return NULL when the
> filter is marked as "too large".
>
> This has a slight drawback: before we can read a single commit out of
> the commit-graph file, we need to load this entire chunk into memory.
> This happens even for commands that don't need changed-path Bloom
> filters, such as "git log -1".
>
> This "too large" information is only used when writing a commit-graph
> file, so we can delay the check for a large filter until after we check
> compute_if_not_present in get_or_compute_bloom_filter(). Also, place
> that lazy-load directly in the get_bloom_filter_large_in_graph() method,
> so we ensure it is ready when needed.
>
> This may be overkill. For a repository with one million commits, this
> filter size is approximately 125 *kilobytes* of data. My local
> measurements found that this took between 1 and 2 milliseconds to load
> into memory. Even for repositories with 10 million commits, this
> difference would not be noticeable for end-user commands.
>
> On the other hand, these handfuls of milliseconds could add up when
> running a hosting service using Git, so this extra effort is probably
> worth it.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  bloom.c        | 27 ++++++++-------------------
>  commit-graph.c | 30 +++++++++++++++++-------------
>  commit-graph.h |  8 ++++++++
>  3 files changed, 33 insertions(+), 32 deletions(-)
>
> diff --git a/bloom.c b/bloom.c
> index 8d07209c6b..6d0884fa19 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -51,21 +51,6 @@ static int load_bloom_filter_from_graph(struct commit_graph *g,
>  	else
>  		start_index = 0;
>
> -	if ((start_index == end_index) &&
> -	    (g->bloom_large && !bitmap_get(g->bloom_large, lex_pos))) {
> -		/*
> -		 * If the filter is zero-length, either (1) the filter has no
> -		 * changes, (2) the filter has too many changes, or (3) it
> -		 * wasn't computed (eg., due to '--max-new-filters').
> -		 *
> -		 * If either (1) or (2) is the case, the 'large' bit will be set
> -		 * for this Bloom filter. If it is unset, then it wasn't
> -		 * computed. In that case, return nothing, since we don't have
> -		 * that filter in the graph.
> -		 */
> -		return 0;
> -	}
> -
>  	filter->len = end_index - start_index;
>  	filter->data = (unsigned char *)(g->chunk_bloom_data +
>  					sizeof(unsigned char) * start_index +
> @@ -212,16 +197,20 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
>
>  	if (!filter->data) {
>  		load_commit_graph_info(r, c);
> -		if (commit_graph_position(c) != COMMIT_NOT_FROM_GRAPH &&
> -			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
> -				return filter;
> +		if (commit_graph_position(c) != COMMIT_NOT_FROM_GRAPH)
> +			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c);
>  	}
>
> -	if (filter->data)
> +	if (filter && filter->len)
>  		return filter;
>  	if (!compute_if_not_present)
>  		return NULL;
>
> +	if (filter && !filter->len &&
> +	    get_bloom_filter_large_in_graph(r->objects->commit_graph, c,
> +	    				    settings->max_changed_paths))
> +		return filter;
> +
>  	repo_diff_setup(r, &diffopt);
>  	diffopt.flags.recursive = 1;
>  	diffopt.detect_rename = 0;
> diff --git a/commit-graph.c b/commit-graph.c
> index 4aae5471e3..ea89f431cc 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -435,17 +435,9 @@ struct commit_graph *parse_commit_graph(struct repository *r,
>  			if (graph->chunk_bloom_large_filters)
>  				chunk_repeated = 1;
>  			else if (r->settings.commit_graph_read_changed_paths) {
> -				size_t alloc = get_be64(chunk_lookup + 4) - chunk_offset - sizeof(uint32_t);
> +				graph->bloom_large_alloc = get_be64(chunk_lookup + 4) - chunk_offset - sizeof(uint32_t);
>  				graph->chunk_bloom_large_filters = data + chunk_offset + sizeof(uint32_t);
>  				graph->bloom_filter_settings->max_changed_paths = get_be32(data + chunk_offset);
> -				if (alloc) {
> -					size_t j;
> -					graph->bloom_large = bitmap_word_alloc(alloc);
> -
> -					for (j = 0; j < graph->bloom_large->word_alloc; j++)
> -						graph->bloom_large->words[j] = get_be64(
> -							graph->chunk_bloom_large_filters + j * sizeof(eword_t));
> -				}
>  			}
>  			break;
>  		}
> @@ -953,9 +945,9 @@ struct tree *get_commit_tree_in_graph(struct repository *r, const struct commit
>  	return get_commit_tree_in_graph_one(r, r->objects->commit_graph, c);
>  }
>
> -static int get_bloom_filter_large_in_graph(struct commit_graph *g,
> -					   const struct commit *c,
> -					   uint32_t max_changed_paths)
> +int get_bloom_filter_large_in_graph(struct commit_graph *g,
> +				    const struct commit *c,
> +				    uint32_t max_changed_paths)
>  {
>  	uint32_t graph_pos = commit_graph_position(c);
>  	if (graph_pos == COMMIT_NOT_FROM_GRAPH)
> @@ -964,8 +956,20 @@ static int get_bloom_filter_large_in_graph(struct commit_graph *g,
>  	while (g && graph_pos < g->num_commits_in_base)
>  		g = g->base_graph;
>
> -	if (!(g && g->bloom_large))
> +	if (!g || !g->bloom_large_alloc)
>  		return 0;
> +
> +	if (!g->bloom_large) {
> +		size_t j;
> +		g->bloom_large = bitmap_word_alloc(g->bloom_large_alloc);
> +
> +		for (j = 0; j < g->bloom_large->word_alloc; j++) {
> +			const void *data = g->chunk_bloom_large_filters +
> +					   j * sizeof(eword_t);
> +			g->bloom_large->words[j] = get_be64(data);
> +		}
> +	}
> +
>  	if (g->bloom_filter_settings->max_changed_paths != max_changed_paths) {
>  		/*
>  		 * Force all commits which are subject to a different
> diff --git a/commit-graph.h b/commit-graph.h
> index 75ef83708b..126fd43380 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -51,6 +51,13 @@ void load_commit_graph_info(struct repository *r, struct commit *item);
>  struct tree *get_commit_tree_in_graph(struct repository *r,
>  				      const struct commit *c);
>
> +/*
> + * Returns 1 if this commit was marked in the BFXL chunk as having more
> + * than max_changed_paths changes.
> + */
> +int get_bloom_filter_large_in_graph(struct commit_graph *g,
> +				    const struct commit *c,
> +				    uint32_t max_changed_paths);
>  struct commit_graph {
>  	const unsigned char *data;
>  	size_t data_len;
> @@ -74,6 +81,7 @@ struct commit_graph {
>  	const unsigned char *chunk_bloom_data;
>  	const unsigned char *chunk_bloom_large_filters;
>
> +	size_t bloom_large_alloc;
>  	struct bitmap *bloom_large;
>
>  	struct bloom_filter_settings *bloom_filter_settings;
> --
> 2.28.0.38.gc6f546511c1
>
>

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v3 01/14] commit-graph: introduce 'get_bloom_filter_settings()'
  2020-08-12 11:48               ` Derrick Stolee
@ 2020-08-14 20:17                 ` Taylor Blau
  0 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-08-14 20:17 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: SZEDER Gábor, Taylor Blau, git, peff, dstolee, gitster

On Wed, Aug 12, 2020 at 07:48:55AM -0400, Derrick Stolee wrote:
> On 8/11/2020 7:55 PM, SZEDER Gábor wrote:
> > On Tue, Aug 11, 2020 at 05:34:11PM -0400, Taylor Blau wrote:
> >> On Tue, Aug 11, 2020 at 11:27:16PM +0200, SZEDER Gábor wrote:
> >>> On Tue, Aug 11, 2020 at 05:21:18PM -0400, Taylor Blau wrote:
> >>>> On Tue, Aug 11, 2020 at 11:18:30PM +0200, SZEDER Gábor wrote:
> >>>>> On Tue, Aug 11, 2020 at 04:51:19PM -0400, Taylor Blau wrote:
> >>>>>> Many places in the code often need a pointer to the commit-graph's
> >>>>>> 'struct bloom_filter_settings', in which case they often take the value
> >>>>>> from the top-most commit-graph.
> >>>>>>
> >>>>>> In the non-split case, this works as expected. In the split case,
> >>>>>> however, things get a little tricky. Not all layers in a chain of
> >>>>>> incremental commit-graphs are required to themselves have Bloom data,
> >>>>>> and so whether or not some part of the code uses Bloom filters depends
> >>>>>> entirely on whether or not the top-most level of the commit-graph chain
> >>>>>> has Bloom filters.
> >>>>>>
> >>>>>> This has been the behavior since Bloom filters were introduced, and has
> >>>>>> been codified into the tests since a759bfa9ee (t4216: add end to end
> >>>>>> tests for git log with Bloom filters, 2020-04-06). In fact, t4216.130
> >>>>>> requires that Bloom filters are not used in exactly the case described
> >>>>>> earlier.
> >>>>>>
> >>>>>> There is no reason that this needs to be the case, since it is perfectly
> >>>>>> valid for commits in an earlier layer to have Bloom filters when commits
> >>>>>> in a newer layer do not.
> >>>>>>
> >>>>>> Since Bloom settings are guaranteed to be the same for any layer in a
> >>>>>> chain that has Bloom data,
> >>>>>
> >>>>> Is it?  Where is that guaranteed?
>
> Perhaps instead of "guaranteed" we could say "Git never writes
> a commit-graph chain with different settings between layers."
>
> >>>> There is no mechanism whatsoever to customize these settings that is
> >>>> exposed to the user (except for the undocumented 'GIT_TEST' environment
> >>>> variables).
> >>>
> >>> Let me rephrase it, then: where is it written in the commit-graph
> >>> format specification that these must be the same in all layers?
> >>>
> >>> Nowhere.
> >>
> >> OK. We can certainly document that this is the case.
> >
> > IMO we absolutely must document this first; ideally it should have
> > been carefully considered and documented right from the start.
>
> You're right. One of the major difficulties in bringing a Bloom
> filter implementation to the commit-graph feature when we did was
> that the split commit-graph was introduced between our initial
> prototypes and when it was finally prepared for a full submission.
>
> There certainly are gaps in the implementation and documentation.
> I think Taylor is doing a great job by addressing one of those gaps
> in a focused, thoughtful way.
>
> > Some thougths about this:
> >
> >   https://public-inbox.org/git/20200619140230.GB22200@szeder.dev/
>
> I appreciate your attention to detail. Your comments on the existing
> implementation do point out some of its shortcomings, and that is a
> valuable contribution.
>
> Actually converting those thoughts into patches is a lot of work.
>
> >> For this purpose,
> >> all we really care about is that the graph _has_ Bloom filters anywhere.
> >> If you wanted to return the exact matching settings, you could also
> >> provide a commit and return the settings belonging to the graph that
> >> contains that commit.
> >>
> >> In the case where we don't have a commit, we could use the default
> >> settings instead.
> >>
> >> I think that we are a little bit dealing with a problem that doesn't
> >> exist, since we do not document the sole method by which you would
> >> change these settings. So, maybe we can think more about this, but my
> >> preference would be to leave this patch alone.
> >
> > Other implementations can write split commit-graphs with modified path
> > Bloom filters as well, and at the moment there is nothing in the specs
> > that tells them not to use different Bloom filter settings in
> > different layers.
>
> You are 100% correct that there is a gap in documentation. That should
> be corrected at some point. (I don't consider it a blocker for this
> series.)
>
> But also: Git itself is the true test of a "correct" third-party
> implementation. libgit2 and JGit try to match Git, "warts and all".
> If another implementation wrote data that results in incorrect
> behavior by Git, then that implementation is wrong.
>
> Improving documentation can make those errors less likely.

I agree with this reasoning. Would anybody object to moving forward with
this series without a change in documentation today (but rather down the
road)?

> We also must design with "future Git" in mind, presenting it with
> enough flexibility to improve formats. The custom Bloom filter
> settings do allow that flexibility, but the requirement that all
> layers have identical settings exists for a reason (despite not
> being documented). It is important that any commit walk that intends
> to use the changed-path Bloom filters can compute the bloom keys
> for the test paths only once.
>
> Thanks,
> -Stolee

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v3 14/14] builtin/commit-graph.c: introduce '--max-new-filters=<n>'
  2020-08-12 11:49     ` SZEDER Gábor
@ 2020-08-14 20:20       ` Taylor Blau
  2020-08-17 22:50         ` SZEDER Gábor
  0 siblings, 1 reply; 117+ messages in thread
From: Taylor Blau @ 2020-08-14 20:20 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: Taylor Blau, git, peff, dstolee, gitster

On Wed, Aug 12, 2020 at 01:49:29PM +0200, SZEDER Gábor wrote:
> On Tue, Aug 11, 2020 at 04:52:14PM -0400, Taylor Blau wrote:
>
> > diff --git a/commit-graph.c b/commit-graph.c
> > index 6886f319a5..4aae5471e3 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -954,7 +954,8 @@ struct tree *get_commit_tree_in_graph(struct repository *r, const struct commit
> >  }
> >
> >  static int get_bloom_filter_large_in_graph(struct commit_graph *g,
> > -					   const struct commit *c)
> > +					   const struct commit *c,
> > +					   uint32_t max_changed_paths)
> >  {
> >  	uint32_t graph_pos = commit_graph_position(c);
> >  	if (graph_pos == COMMIT_NOT_FROM_GRAPH)
> > @@ -965,6 +966,17 @@ static int get_bloom_filter_large_in_graph(struct commit_graph *g,
> >
> >  	if (!(g && g->bloom_large))
> >  		return 0;
> > +	if (g->bloom_filter_settings->max_changed_paths != max_changed_paths) {
> > +		/*
> > +		 * Force all commits which are subject to a different
> > +		 * 'max_changed_paths' limit to be recomputed from scratch.
> > +		 *
> > +		 * Note that this could likely be improved, but is ignored since
> > +		 * all real-world graphs set the maximum number of changed paths
> > +		 * at 512.
> > +		 */
> > +		return 0;
> > +	}
> >  	return bitmap_get(g->bloom_large, graph_pos - g->num_commits_in_base);
> >  }
> >
> > @@ -1470,6 +1482,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
> >  	int i;
> >  	struct progress *progress = NULL;
> >  	int *sorted_commits;
> > +	int max_new_filters;
> >
> >  	init_bloom_filters();
> >  	ctx->bloom_large = bitmap_word_alloc(ctx->commits.nr / BITS_IN_EWORD + 1);
> > @@ -1486,10 +1499,15 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
> >  		ctx->order_by_pack ? commit_pos_cmp : commit_gen_cmp,
> >  		&ctx->commits);
> >
> > +	max_new_filters = ctx->opts->max_new_filters >= 0 ?
> > +		ctx->opts->max_new_filters : ctx->commits.nr;
>
> git_test_write_commit_graph_or_die() invokes
> write_commit_graph_reachable() with opts=0x0, so 'ctx->opts' is NULL,
> and we get segfault.  This breaks a lot of tests when run with
> GIT_TEST_COMMIT_GRAPH=1 GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=1.

Great catch, thanks. Fixing this is as simple as adding 'ctx->opts &&'
right before we dereference 'ctx->opts', since setting this variable
equal to 'ctx->commits.nr' is the right thing to do in that case.

Unrelated to this comment, I am hoping to send out a final version of
this series sometime next week so that we can keep moving forward with
Bloom filter improvements.

Have you had a chance to review the rest of the patches? I'll happily
wait until you have had a chance to do so before sending v5 so that we
can avoid a v6.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v3 14/14] builtin/commit-graph.c: introduce '--max-new-filters=<n>'
  2020-08-14 20:20       ` Taylor Blau
@ 2020-08-17 22:50         ` SZEDER Gábor
  2020-09-02 21:03           ` Taylor Blau
  0 siblings, 1 reply; 117+ messages in thread
From: SZEDER Gábor @ 2020-08-17 22:50 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, dstolee, gitster

On Fri, Aug 14, 2020 at 04:20:21PM -0400, Taylor Blau wrote:
> > > @@ -1486,10 +1499,15 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
> > >  		ctx->order_by_pack ? commit_pos_cmp : commit_gen_cmp,
> > >  		&ctx->commits);
> > >
> > > +	max_new_filters = ctx->opts->max_new_filters >= 0 ?
> > > +		ctx->opts->max_new_filters : ctx->commits.nr;
> >
> > git_test_write_commit_graph_or_die() invokes
> > write_commit_graph_reachable() with opts=0x0, so 'ctx->opts' is NULL,
> > and we get segfault.  This breaks a lot of tests when run with
> > GIT_TEST_COMMIT_GRAPH=1 GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=1.
> 
> Great catch, thanks. Fixing this is as simple as adding 'ctx->opts &&'
> right before we dereference 'ctx->opts', since setting this variable
> equal to 'ctx->commits.nr' is the right thing to do in that case.

That would avoid the segfault, sure, but I would rather see all
callers of write_commit_graph{_reachable}() passing a valid opts
instance.  Just like we don't call the diff machinery with a NULL
diff_options, or the revision walking machinery with a NULL rev_info.

> Unrelated to this comment, I am hoping to send out a final version of
> this series sometime next week so that we can keep moving forward with
> Bloom filter improvements.
> 
> Have you had a chance to review the rest of the patches? I'll happily
> wait until you have had a chance to do so before sending v5 so that we

v5?  This is v3, and I'm unable to a find a v4.

> can avoid a v6.
> 
> Thanks,
> Taylor

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v3 14/14] builtin/commit-graph.c: introduce '--max-new-filters=<n>'
  2020-08-11 20:52   ` [PATCH v3 14/14] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
  2020-08-12 11:49     ` SZEDER Gábor
  2020-08-12 12:29     ` Derrick Stolee
@ 2020-08-18 22:23     ` SZEDER Gábor
  2020-09-03 16:35       ` Taylor Blau
  2020-08-19  8:20     ` SZEDER Gábor
  2020-09-01 14:36     ` SZEDER Gábor
  4 siblings, 1 reply; 117+ messages in thread
From: SZEDER Gábor @ 2020-08-18 22:23 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, dstolee, gitster

On Tue, Aug 11, 2020 at 04:52:14PM -0400, Taylor Blau wrote:
> Introduce a command-line flag and configuration variable to fill in the
> 'max_new_filters' variable introduced by the previous patch.

'max_new_filters' is introduced in this patch.

> The command-line option '--max-new-filters' takes precedence over
> 'commitGraph.maxNewFilters', which is the default value.

What "filters"?  While misnamed, the '--changed-paths' options did a
good job at hiding the implementation detail that we use Bloom filters
to speed up pathspec-limited revision walks; as far as I remember this
was a conscious design decision.  Including "filter" in the name of
the option and corresponding config variable goes against this
decision.  Furthermore, by not being specific we might end up in abad
situation when adding some other filters to the commit-graph format.
Unfortunately, I can't offhand propose a better option name, all my
ideas were horrible.

> '--no-max-new-filters' can also be provided, which sets the value back
> to '-1', indicating that an unlimited number of new Bloom filters may be
> generated. (OPT_INTEGER only allows setting the '--no-' variant back to
> '0', hence a custom callback was used instead).
> 
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  Documentation/config/commitgraph.txt |  4 +++
>  Documentation/git-commit-graph.txt   |  4 +++
>  bloom.c                              | 15 +++++++++++
>  builtin/commit-graph.c               | 39 +++++++++++++++++++++++++---
>  commit-graph.c                       | 27 ++++++++++++++++---
>  commit-graph.h                       |  1 +
>  t/t4216-log-bloom.sh                 | 19 ++++++++++++++
>  7 files changed, 102 insertions(+), 7 deletions(-)
> 
> diff --git a/Documentation/config/commitgraph.txt b/Documentation/config/commitgraph.txt
> index cff0797b54..4582c39fc4 100644
> --- a/Documentation/config/commitgraph.txt
> +++ b/Documentation/config/commitgraph.txt
> @@ -1,3 +1,7 @@
> +commitGraph.maxNewFilters::
> +	Specifies the default value for the `--max-new-filters` option of `git
> +	commit-graph write` (c.f., linkgit:git-commit-graph[1]).
> +
>  commitGraph.readChangedPaths::
>  	If true, then git will use the changed-path Bloom filters in the
>  	commit-graph file (if it exists, and they are present). Defaults to
> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
> index 17405c73a9..9c887d5d79 100644
> --- a/Documentation/git-commit-graph.txt
> +++ b/Documentation/git-commit-graph.txt
> @@ -67,6 +67,10 @@ this option is given, future commit-graph writes will automatically assume
>  that this option was intended. Use `--no-changed-paths` to stop storing this
>  data.
>  +
> +With the `--max-new-filters=<n>` option, generate at most `n` new Bloom
> +filters (if `--changed-paths` is specified). If `n` is `-1`, no limit is
> +enforced. Overrides the `commitGraph.maxNewFilters` configuration.

I think this description should also detail what happens with those
commits for which no modified path Bloom filters are calculated, and
which commands will calculate them (even implicitly).

> ++
>  With the `--split[=<strategy>]` option, write the commit-graph as a
>  chain of multiple commit-graph files stored in
>  `<dir>/info/commit-graphs`. Commit-graph layers are merged based on the
> diff --git a/bloom.c b/bloom.c
> index ed54e96e57..8d07209c6b 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -51,6 +51,21 @@ static int load_bloom_filter_from_graph(struct commit_graph *g,
>  	else
>  		start_index = 0;
>  
> +	if ((start_index == end_index) &&
> +	    (g->bloom_large && !bitmap_get(g->bloom_large, lex_pos))) {
> +		/*
> +		 * If the filter is zero-length, either (1) the filter has no
> +		 * changes, (2) the filter has too many changes, or (3) it
> +		 * wasn't computed (eg., due to '--max-new-filters').
> +		 *
> +		 * If either (1) or (2) is the case, the 'large' bit will be set
> +		 * for this Bloom filter. If it is unset, then it wasn't
> +		 * computed. In that case, return nothing, since we don't have
> +		 * that filter in the graph.
> +		 */
> +		return 0;
> +	}
> +
>  	filter->len = end_index - start_index;
>  	filter->data = (unsigned char *)(g->chunk_bloom_data +
>  					sizeof(unsigned char) * start_index +
> diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
> index 38f5f57d15..3500a6e1f1 100644
> --- a/builtin/commit-graph.c
> +++ b/builtin/commit-graph.c
> @@ -13,7 +13,8 @@ static char const * const builtin_commit_graph_usage[] = {
>  	N_("git commit-graph verify [--object-dir <objdir>] [--shallow] [--[no-]progress]"),
>  	N_("git commit-graph write [--object-dir <objdir>] [--append] "
>  	   "[--split[=<strategy>]] [--reachable|--stdin-packs|--stdin-commits] "
> -	   "[--changed-paths] [--[no-]progress] <split options>"),
> +	   "[--changed-paths] [--[no-]max-new-filters <n>] [--[no-]progress] "
> +	   "<split options>"),
>  	NULL
>  };
>  
> @@ -25,7 +26,8 @@ static const char * const builtin_commit_graph_verify_usage[] = {
>  static const char * const builtin_commit_graph_write_usage[] = {
>  	N_("git commit-graph write [--object-dir <objdir>] [--append] "
>  	   "[--split[=<strategy>]] [--reachable|--stdin-packs|--stdin-commits] "
> -	   "[--changed-paths] [--[no-]progress] <split options>"),
> +	   "[--changed-paths] [--[no-]max-new-filters <n>] [--[no-]progress] "
> +	   "<split options>"),
>  	NULL
>  };
>  
> @@ -162,6 +164,23 @@ static int read_one_commit(struct oidset *commits, struct progress *progress,
>  	return 0;
>  }
>  
> +static int write_option_max_new_filters(const struct option *opt,
> +					const char *arg,
> +					int unset)
> +{
> +	int *to = opt->value;
> +	if (unset)
> +		*to = -1;
> +	else {
> +		const char *s;
> +		*to = strtol(arg, (char **)&s, 10);
> +		if (*s)
> +			return error(_("%s expects a numerical value"),
> +				     optname(opt, opt->flags));
> +	}
> +	return 0;
> +}
> +
>  static int graph_write(int argc, const char **argv)
>  {
>  	struct string_list pack_indexes = STRING_LIST_INIT_NODUP;
> @@ -197,6 +216,9 @@ static int graph_write(int argc, const char **argv)
>  			N_("maximum ratio between two levels of a split commit-graph")),
>  		OPT_EXPIRY_DATE(0, "expire-time", &write_opts.expire_time,
>  			N_("only expire files older than a given date-time")),
> +		OPT_CALLBACK_F(0, "max-new-filters", &write_opts.max_new_filters,
> +			NULL, N_("maximum number of changed-path Bloom filters to compute"),
> +			0, write_option_max_new_filters),
>  		OPT_END(),
>  	};
>  
> @@ -205,6 +227,7 @@ static int graph_write(int argc, const char **argv)
>  	write_opts.size_multiple = 2;
>  	write_opts.max_commits = 0;
>  	write_opts.expire_time = 0;
> +	write_opts.max_new_filters = -1;
>  
>  	trace2_cmd_mode("write");
>  
> @@ -270,6 +293,16 @@ static int graph_write(int argc, const char **argv)
>  	return result;
>  }
>  
> +static int git_commit_graph_config(const char *var, const char *value, void *cb)
> +{
> +	if (!strcmp(var, "commitgraph.maxnewfilters")) {
> +		write_opts.max_new_filters = git_config_int(var, value);
> +		return 0;
> +	}
> +
> +	return git_default_config(var, value, cb);
> +}
> +
>  int cmd_commit_graph(int argc, const char **argv, const char *prefix)
>  {
>  	static struct option builtin_commit_graph_options[] = {
> @@ -283,7 +316,7 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix)
>  		usage_with_options(builtin_commit_graph_usage,
>  				   builtin_commit_graph_options);
>  
> -	git_config(git_default_config, NULL);
> +	git_config(git_commit_graph_config, &opts);
>  	argc = parse_options(argc, argv, prefix,
>  			     builtin_commit_graph_options,
>  			     builtin_commit_graph_usage,
> diff --git a/commit-graph.c b/commit-graph.c
> index 6886f319a5..4aae5471e3 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -954,7 +954,8 @@ struct tree *get_commit_tree_in_graph(struct repository *r, const struct commit
>  }
>  
>  static int get_bloom_filter_large_in_graph(struct commit_graph *g,
> -					   const struct commit *c)
> +					   const struct commit *c,
> +					   uint32_t max_changed_paths)
>  {
>  	uint32_t graph_pos = commit_graph_position(c);
>  	if (graph_pos == COMMIT_NOT_FROM_GRAPH)
> @@ -965,6 +966,17 @@ static int get_bloom_filter_large_in_graph(struct commit_graph *g,
>  
>  	if (!(g && g->bloom_large))
>  		return 0;
> +	if (g->bloom_filter_settings->max_changed_paths != max_changed_paths) {
> +		/*
> +		 * Force all commits which are subject to a different
> +		 * 'max_changed_paths' limit to be recomputed from scratch.
> +		 *
> +		 * Note that this could likely be improved, but is ignored since
> +		 * all real-world graphs set the maximum number of changed paths
> +		 * at 512.

I don't understand what the second part of this comment is trying to
say; and real-world graphs might very well contain Bloom filters with
more than 512 modified paths, because the applying that limit was
buggy.

> +		 */
> +		return 0;
> +	}
>  	return bitmap_get(g->bloom_large, graph_pos - g->num_commits_in_base);
>  }
>  
> @@ -1470,6 +1482,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>  	int i;
>  	struct progress *progress = NULL;
>  	int *sorted_commits;
> +	int max_new_filters;
>  
>  	init_bloom_filters();
>  	ctx->bloom_large = bitmap_word_alloc(ctx->commits.nr / BITS_IN_EWORD + 1);
> @@ -1486,10 +1499,15 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>  		ctx->order_by_pack ? commit_pos_cmp : commit_gen_cmp,
>  		&ctx->commits);
>  
> +	max_new_filters = ctx->opts->max_new_filters >= 0 ?
> +		ctx->opts->max_new_filters : ctx->commits.nr;
> +
>  	for (i = 0; i < ctx->commits.nr; i++) {
>  		int pos = sorted_commits[i];
>  		struct commit *c = ctx->commits.list[pos];
> -		if (get_bloom_filter_large_in_graph(ctx->r->objects->commit_graph, c)) {
> +		if (get_bloom_filter_large_in_graph(ctx->r->objects->commit_graph,
> +						    c,
> +						    ctx->bloom_settings->max_changed_paths)) {
>  			bitmap_set(ctx->bloom_large, pos);
>  			ctx->count_bloom_filter_known_large++;
>  		} else {
> @@ -1497,7 +1515,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>  			struct bloom_filter *filter = get_or_compute_bloom_filter(
>  				ctx->r,
>  				c,
> -				1,
> +				ctx->count_bloom_filter_computed < max_new_filters,
>  				ctx->bloom_settings,
>  				&computed);
>  			if (computed) {
> @@ -1507,7 +1525,8 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>  					ctx->count_bloom_filter_found_large++;
>  				}
>  			}
> -			ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
> +			if (filter)
> +				ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
>  		}
>  		display_progress(progress, i + 1);
>  	}
> diff --git a/commit-graph.h b/commit-graph.h
> index af08c4505d..75ef83708b 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -114,6 +114,7 @@ struct commit_graph_opts {
>  	int max_commits;
>  	timestamp_t expire_time;
>  	enum commit_graph_split_flags flags;
> +	int max_new_filters;
>  };
>  
>  /*
> diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
> index 6859d85369..3aab8ffbe3 100755
> --- a/t/t4216-log-bloom.sh
> +++ b/t/t4216-log-bloom.sh
> @@ -286,4 +286,23 @@ test_expect_success 'Bloom generation does not recompute too-large filters' '
>  	)
>  '
>  
> +test_expect_success 'Bloom generation is limited by --max-new-filters' '
> +	(
> +		cd limits &&
> +		test_commit c2 filter &&
> +		test_commit c3 filter &&
> +		test_commit c4 no-filter &&
> +		test_bloom_filters_computed "--reachable --changed-paths --split=replace --max-new-filters=2" \
> +			2 0 2
> +	)
> +'
> +
> +test_expect_success 'Bloom generation backfills previously-skipped filters' '
> +	(
> +		cd limits &&
> +		test_bloom_filters_computed "--reachable --changed-paths --split=replace --max-new-filters=1" \
> +			2 0 1
> +	)
> +'
> +
>  test_done
> -- 
> 2.28.0.rc1.13.ge78abce653

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v3 14/14] builtin/commit-graph.c: introduce '--max-new-filters=<n>'
  2020-08-11 20:52   ` [PATCH v3 14/14] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
                       ` (2 preceding siblings ...)
  2020-08-18 22:23     ` SZEDER Gábor
@ 2020-08-19  8:20     ` SZEDER Gábor
  2020-09-03 16:42       ` Taylor Blau
  2020-09-01 14:36     ` SZEDER Gábor
  4 siblings, 1 reply; 117+ messages in thread
From: SZEDER Gábor @ 2020-08-19  8:20 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, dstolee, gitster

On Tue, Aug 11, 2020 at 04:52:14PM -0400, Taylor Blau wrote:
> Introduce a command-line flag and configuration variable to fill in the
> 'max_new_filters' variable introduced by the previous patch.
> 
> The command-line option '--max-new-filters' takes precedence over
> 'commitGraph.maxNewFilters', which is the default value.
> '--no-max-new-filters' can also be provided, which sets the value back
> to '-1', indicating that an unlimited number of new Bloom filters may be
> generated. (OPT_INTEGER only allows setting the '--no-' variant back to
> '0', hence a custom callback was used instead).

Forgot the most important thing: Why?  Please explain in the commit
message why this option is necesary, what problems does it solve,
how it is supposed to interact with other options and why so.


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v3 13/14] commit-graph: rename 'split_commit_graph_opts'
  2020-08-11 20:52   ` [PATCH v3 13/14] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
@ 2020-08-19  9:56     ` SZEDER Gábor
  2020-09-02 21:02       ` Taylor Blau
  0 siblings, 1 reply; 117+ messages in thread
From: SZEDER Gábor @ 2020-08-19  9:56 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, dstolee, gitster

On Tue, Aug 11, 2020 at 04:52:11PM -0400, Taylor Blau wrote:
> In the subsequent commit, additional options will be added to the
> commit-graph API which have nothing to do with splitting.
> 
> Rename the 'split_commit_graph_opts' structure to the more-generic
> 'commit_graph_opts' to encompass both.

Good.  Note, however, that write_commit_graph() has a 'flags'
parameter as well, and when a feature is enabled via this 'flags',
then first a corresponding 'ctx->foo' field is set, and that
'ctx->foo' is checked while computing and writing the commit-graph.
With the generic options struct some other feature will be enabled via
the 'opts->bar' field, so simply 'ctx->opts->bar' is checked while
writing the commit-graph.

With the generic options struct there really is no need for a separate
flags parameter, the values in the flags can be stored in the options
struct, and we can eliminate this inconsistency instead of adding even
more.


> diff --git a/commit-graph.h b/commit-graph.h
> index ddbca1b59d..af08c4505d 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -109,7 +109,7 @@ enum commit_graph_split_flags {
>  	COMMIT_GRAPH_SPLIT_REPLACE          = 2
>  };
>  
> -struct split_commit_graph_opts {
> +struct commit_graph_opts {
>  	int size_multiple;
>  	int max_commits;
>  	timestamp_t expire_time;
        enum commit_graph_split_flags flags;

While this was 'struct split_commit_graph_opts *split_opts' it was
clear what kind of flags were in this 'flags' field.  Now that the
struct is generic it's not clear anymore, so perhaps it should be
renamed as well (e.g. 'split_flags'), or even turned into a couple of
bit fields.

> @@ -124,12 +124,12 @@ struct split_commit_graph_opts {
>   */
>  int write_commit_graph_reachable(struct object_directory *odb,
>  				 enum commit_graph_write_flags flags,
> -				 const struct split_commit_graph_opts *split_opts);
> +				 const struct commit_graph_opts *opts);
>  int write_commit_graph(struct object_directory *odb,
>  		       struct string_list *pack_indexes,
>  		       struct oidset *commits,
>  		       enum commit_graph_write_flags flags,
> -		       const struct split_commit_graph_opts *split_opts);
> +		       const struct commit_graph_opts *opts);
>  
>  #define COMMIT_GRAPH_VERIFY_SHALLOW	(1 << 0)
>  
> -- 
> 2.28.0.rc1.13.ge78abce653
> 

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v3 12/14] commit-graph: add large-filters bitmap chunk
  2020-08-11 20:52   ` [PATCH v3 12/14] commit-graph: add large-filters bitmap chunk Taylor Blau
  2020-08-11 21:11     ` Derrick Stolee
@ 2020-08-19 13:35     ` SZEDER Gábor
  2020-09-02 20:23       ` Taylor Blau
  2020-09-01 14:35     ` SZEDER Gábor
  2 siblings, 1 reply; 117+ messages in thread
From: SZEDER Gábor @ 2020-08-19 13:35 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, dstolee, gitster

On Tue, Aug 11, 2020 at 04:52:07PM -0400, Taylor Blau wrote:
> When a commit has more than a certain number of changed paths (commonly
> 512), the commit-graph machinery represents it as a zero-length filter.
> This is done since having many entries in the Bloom filter has
> undesirable effects on the false positivity rate.

This is not the case, the false positive probability depends on the
ratio of the Bloom filter's size and the number of elements it
contains, and we size the filters proportional to the number of
elements they contain, so the number of elements shouldn't affect the 
false positive rate.

On the contrary, it's the small filters, up to around 30-35 bytes, 
that tend to have larger than expected false positive rate when using
double hashing.


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v3 12/14] commit-graph: add large-filters bitmap chunk
  2020-08-11 20:52   ` [PATCH v3 12/14] commit-graph: add large-filters bitmap chunk Taylor Blau
  2020-08-11 21:11     ` Derrick Stolee
  2020-08-19 13:35     ` SZEDER Gábor
@ 2020-09-01 14:35     ` SZEDER Gábor
  2020-09-02 20:40       ` Taylor Blau
  2 siblings, 1 reply; 117+ messages in thread
From: SZEDER Gábor @ 2020-09-01 14:35 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, dstolee, gitster

On Tue, Aug 11, 2020 at 04:52:07PM -0400, Taylor Blau wrote:
> When a commit has more than a certain number of changed paths (commonly
> 512), the commit-graph machinery represents it as a zero-length filter.
> This is done since having many entries in the Bloom filter has
> undesirable effects on the false positivity rate.
> 
> In addition to these too-large filters, the commit-graph machinery also
> represents commits with no filter and commits with no changed paths in
> the same way.
> 
> When writing a commit-graph that aggregates several incremental
> commit-graph layers (eg., with '--split=replace'), the commit-graph
> machinery first computes all of the Bloom filters that it wants to write
> but does not already know about from existing graph layers. Because we
> overload the zero-length filter in the above fashion, this leads to
> recomputing large filters over and over again.
> 
> This is already undesirable, since it means that we are wasting
> considerable effort to discover that a commit with too many changed
> paths, only to throw that effort away (and then repeat the process the
> next time a roll-up is performed).

Is this really the case?

If a commit has a corresponding entry in the Bloom filters chunks,
then the commit-graph machinery does know the Bloom filter associated
with that commit.  The size of that filter should not matter, i.e.
even if it is a zero-length filter, the commit-graph machinery should
know about it all the same.  And as far as I can tell it does indeed,
because load_bloom_filter_from_graph() sets a non-NULL 'filter->data'
pointer even if 'filter->len' is zero, which get_bloom_filter() treats
as "we know about this", and returns early without (re)computing the
filter.  Even the test 'Bloom generation does not recompute too-large
filters' added in this patch is expected to succeed, and my
superficial and makeshift testing seems to corroborate this; at least
I couldn't find a combination of commands and options that would
recompute any existing zero-length Bloom filters.

Am I missing something?

> In a subsequent patch, we will add a '--max-new-filters=<n>' option,
> which specifies an upper-bound on the number of new filters we are
> willing to compute from scratch. Suppose that there are 'N' too-large
> filters, and we specify '--max-new-filters=M'. If 'N >= M', it is
> unlikely that any filters will be generated, since we'll spend most of
> our effort on filters that we ultimately throw away. If 'N < M', filters
> will trickle in over time, but only at most 'M - N' per-write.
> 
> To address this, add a new chunk which encodes a bitmap where the ith
> bit is on iff the ith commit has zero or at least 512 changed paths.
> Likewise store the maximum number of changed paths we are willing to
> store in order to prepare for eventually making this value more easily
> customizable.

I don't understand how storing this value would make it any easier to
customize.

> When computing Bloom filters, first consult the relevant
> bitmap (in the case that we are rolling up existing layers) to see if
> computing the Bloom filter from scratch would be a waste of time.
> 
> This patch implements a new chunk instead of extending the existing BIDX
> and BDAT chunks because modifying these chunks would confuse old
> clients. (Eg., setting the most-significant bit in the BIDX chunk would
> confuse old clients and require a version bump).
> 
> To allow using the existing bitmap code with 64-bit words, we write the
> data in network byte order from the 64-bit words. This means we also
> need to read the array from the commit-graph file by translating each
> word from network byte order using get_be64() when loading the commit
> graph. (Note that this *could* be delayed until first-use, but a later
> patch will rely on this being initialized early, so we assume the
> up-front cost when parsing instead of delaying initialization).
> 
> By avoiding the need to move to new versions of the BDAT and BIDX chunk,
> we can give ourselves more time to consider whether or not other
> modifications to these chunks are worthwhile without holding up this
> change.
> 
> Another approach would be to introduce a new BIDX chunk (say, one
> identified by 'BID2') which is identical to the existing BIDX chunk,
> except the most-significant bit of each offset is interpreted as "this
> filter is too big" iff looking at a BID2 chunk. This avoids having to
> write a bitmap, but forces older clients to rewrite their commit-graphs
> (as well as reduces the theoretical largest Bloom filters we couldl

And it reduces the max possible size of the BDAT chunk, and thus the
max number of commits with Bloom filters as well.

s/couldl/could/

> write, and forces us to maintain the code necessary to translate BIDX
> chunks to BID2 ones). Separately from this patch, I implemented this
> alternate approach and did not find it to be advantageous.

Let's take a step back to reconsider what should be stored in this
bitmap for a moment.  Sure, setting a bit for each commit that doesn't
modify any paths or modifies too many makes it possible to repliably
identify commits that don't have Bloom filters yet.  But isn't it a
bit roundabout way?  I think it would be better to directly track
which commits don't have Bloom filters yet.  IOW what you really want
is a, say, BNCY "Bloom filter Not Computed Yet" chunk, where we set
the corresponding bit for each commit which has an entry in the BIDX
chunk but for which a Bloom filter hasn't been computed yet.

  - It's simpler and easier to explain (IMO).

  - This bitmap chunk can easily be made optional: if all Bloom
    filters have been computed, then the bitmap will contain all
    zeros.  So why bother writing it, when we can save a bit of space
    instead?

  - It avoids the unpleasentness of setting a bit in the _Large_ Bloom
    Filters chunks for commits _not_ modifying any paths.

  - Less incentive to spill implementation details to the format
    specification (e.g. 512 modified paths).

Now, let's take another step back: is such a bitmap really necessary?
We could write a single-byte Bloom filter with no bits set for commits
not modifying any paths, and a single-byte Bloom filter with all bits
set for commits modifying too many paths.  This is compatible with the
specs and any existing implementation should do the right thing when
reading such filters, this would allow us to interpret zero-length
filters as "not computed yet", and if that bitmap chunk won't be
optional, then this would save space as long as less than 1/8 of
commits modify no or too many paths.  Unfortunately, however, all
existing zero-length Bloom filters have to be recomputed.


> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  .../technical/commit-graph-format.txt         | 12 +++
>  bloom.h                                       |  2 +-
>  commit-graph.c                                | 96 ++++++++++++++++---
>  commit-graph.h                                |  4 +
>  t/t4216-log-bloom.sh                          | 25 ++++-
>  5 files changed, 124 insertions(+), 15 deletions(-)
> 
> diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
> index 440541045d..5f2d9ab4d7 100644
> --- a/Documentation/technical/commit-graph-format.txt
> +++ b/Documentation/technical/commit-graph-format.txt
> @@ -123,6 +123,18 @@ CHUNK DATA:
>        of length zero.
>      * The BDAT chunk is present if and only if BIDX is present.
>  
> +  Large Bloom Filters (ID: {'B', 'F', 'X', 'L'}) [Optional]
> +    * It starts with a 32-bit unsigned integer specifying the maximum number of
> +      changed-paths that can be stored in a single Bloom filter.

Should this value be the same in all elements of a commit-graph chain?
Note that in this case having different values won't affect revision
walks using modified path Bloom filters.

> +    * It then contains a list of 64-bit words (the length of this list is
> +      determined by the width of the chunk) which is a bitmap. The 'i'th bit is
> +      set exactly when the 'i'th commit in the graph has a changed-path Bloom
> +      filter with zero entries (either because the commit is empty, or because
> +      it contains more than 512 changed paths).

Please make clear the byte order of these 64 bit words in the specs as
well.

Furthermore, that 512 path limit is an implementation detail, so it
would be better if it didn't leak into the specification of this new
chunk.

> +    * The BFXL chunk is present only when the BIDX and BDAT chunks are
> +      also present.
> +
> +
>    Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
>        This list of H-byte hashes describe a set of B commit-graph files that
>        form a commit-graph chain. The graph position for the ith commit in this



> diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
> index 21b67677ef..6859d85369 100755
> --- a/t/t4216-log-bloom.sh
> +++ b/t/t4216-log-bloom.sh
> @@ -33,7 +33,7 @@ test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
>  	git commit-graph write --reachable --changed-paths
>  '
>  graph_read_expect () {
> -	NUM_CHUNKS=5
> +	NUM_CHUNKS=6
>  	cat >expect <<- EOF
>  	header: 43475048 1 1 $NUM_CHUNKS 0
>  	num_commits: $1
> @@ -262,5 +262,28 @@ test_expect_success 'correctly report changes over limit' '
>  		done
>  	)
>  '
> +test_bloom_filters_computed () {
> +	commit_graph_args=$1
> +	bloom_trace_prefix="{\"filter_known_large\":$2,\"filter_found_large\":$3,\"filter_computed\":$4"
> +	rm -f "$TRASH_DIRECTORY/trace.event" &&
> +	GIT_TRACE2_EVENT="$TRASH_DIRECTORY/trace.event" git commit-graph write $commit_graph_args &&
> +	grep "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.event"
> +}
> +
> +test_expect_success 'Bloom generation does not recompute too-large filters' '
> +	(
> +		cd limits &&
> +
> +		# start from scratch and rebuild
> +		rm -f .git/objects/info/commit-graph &&
> +		GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=10 \
> +			git commit-graph write --reachable --changed-paths \
> +			--split=replace &&
> +		test_commit c1 filter &&
> +
> +		test_bloom_filters_computed "--reachable --changed-paths --split=replace" \
> +			2 0 1
> +	)
> +'
>  
>  test_done
> -- 
> 2.28.0.rc1.13.ge78abce653
> 

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v3 14/14] builtin/commit-graph.c: introduce '--max-new-filters=<n>'
  2020-08-11 20:52   ` [PATCH v3 14/14] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
                       ` (3 preceding siblings ...)
  2020-08-19  8:20     ` SZEDER Gábor
@ 2020-09-01 14:36     ` SZEDER Gábor
  2020-09-03 18:49       ` Taylor Blau
  4 siblings, 1 reply; 117+ messages in thread
From: SZEDER Gábor @ 2020-09-01 14:36 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, dstolee, gitster

On Tue, Aug 11, 2020 at 04:52:14PM -0400, Taylor Blau wrote:
> Introduce a command-line flag and configuration variable to fill in the
> 'max_new_filters' variable introduced by the previous patch.
> 
> The command-line option '--max-new-filters' takes precedence over
> 'commitGraph.maxNewFilters', which is the default value.
> '--no-max-new-filters' can also be provided, which sets the value back
> to '-1', indicating that an unlimited number of new Bloom filters may be
> generated. (OPT_INTEGER only allows setting the '--no-' variant back to
> '0', hence a custom callback was used instead).
> 
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  Documentation/config/commitgraph.txt |  4 +++
>  Documentation/git-commit-graph.txt   |  4 +++
>  bloom.c                              | 15 +++++++++++
>  builtin/commit-graph.c               | 39 +++++++++++++++++++++++++---
>  commit-graph.c                       | 27 ++++++++++++++++---
>  commit-graph.h                       |  1 +
>  t/t4216-log-bloom.sh                 | 19 ++++++++++++++
>  7 files changed, 102 insertions(+), 7 deletions(-)
> 
> diff --git a/Documentation/config/commitgraph.txt b/Documentation/config/commitgraph.txt
> index cff0797b54..4582c39fc4 100644
> --- a/Documentation/config/commitgraph.txt
> +++ b/Documentation/config/commitgraph.txt
> @@ -1,3 +1,7 @@
> +commitGraph.maxNewFilters::
> +	Specifies the default value for the `--max-new-filters` option of `git
> +	commit-graph write` (c.f., linkgit:git-commit-graph[1]).
> +
>  commitGraph.readChangedPaths::
>  	If true, then git will use the changed-path Bloom filters in the
>  	commit-graph file (if it exists, and they are present). Defaults to
> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
> index 17405c73a9..9c887d5d79 100644
> --- a/Documentation/git-commit-graph.txt
> +++ b/Documentation/git-commit-graph.txt
> @@ -67,6 +67,10 @@ this option is given, future commit-graph writes will automatically assume
>  that this option was intended. Use `--no-changed-paths` to stop storing this
>  data.
>  +
> +With the `--max-new-filters=<n>` option, generate at most `n` new Bloom
> +filters (if `--changed-paths` is specified). If `n` is `-1`, no limit is
> +enforced. Overrides the `commitGraph.maxNewFilters` configuration.
> ++
>  With the `--split[=<strategy>]` option, write the commit-graph as a
>  chain of multiple commit-graph files stored in
>  `<dir>/info/commit-graphs`. Commit-graph layers are merged based on the
> diff --git a/bloom.c b/bloom.c
> index ed54e96e57..8d07209c6b 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -51,6 +51,21 @@ static int load_bloom_filter_from_graph(struct commit_graph *g,
>  	else
>  		start_index = 0;
>  
> +	if ((start_index == end_index) &&
> +	    (g->bloom_large && !bitmap_get(g->bloom_large, lex_pos))) {
> +		/*
> +		 * If the filter is zero-length, either (1) the filter has no
> +		 * changes, (2) the filter has too many changes, or (3) it
> +		 * wasn't computed (eg., due to '--max-new-filters').
> +		 *
> +		 * If either (1) or (2) is the case, the 'large' bit will be set
> +		 * for this Bloom filter. If it is unset, then it wasn't
> +		 * computed. In that case, return nothing, since we don't have
> +		 * that filter in the graph.
> +		 */
> +		return 0;
> +	}
> +
>  	filter->len = end_index - start_index;
>  	filter->data = (unsigned char *)(g->chunk_bloom_data +
>  					sizeof(unsigned char) * start_index +
> diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
> index 38f5f57d15..3500a6e1f1 100644
> --- a/builtin/commit-graph.c
> +++ b/builtin/commit-graph.c
> @@ -13,7 +13,8 @@ static char const * const builtin_commit_graph_usage[] = {
>  	N_("git commit-graph verify [--object-dir <objdir>] [--shallow] [--[no-]progress]"),
>  	N_("git commit-graph write [--object-dir <objdir>] [--append] "
>  	   "[--split[=<strategy>]] [--reachable|--stdin-packs|--stdin-commits] "
> -	   "[--changed-paths] [--[no-]progress] <split options>"),
> +	   "[--changed-paths] [--[no-]max-new-filters <n>] [--[no-]progress] "
> +	   "<split options>"),
>  	NULL
>  };
>  
> @@ -25,7 +26,8 @@ static const char * const builtin_commit_graph_verify_usage[] = {
>  static const char * const builtin_commit_graph_write_usage[] = {
>  	N_("git commit-graph write [--object-dir <objdir>] [--append] "
>  	   "[--split[=<strategy>]] [--reachable|--stdin-packs|--stdin-commits] "
> -	   "[--changed-paths] [--[no-]progress] <split options>"),
> +	   "[--changed-paths] [--[no-]max-new-filters <n>] [--[no-]progress] "
> +	   "<split options>"),
>  	NULL
>  };
>  
> @@ -162,6 +164,23 @@ static int read_one_commit(struct oidset *commits, struct progress *progress,
>  	return 0;
>  }
>  
> +static int write_option_max_new_filters(const struct option *opt,
> +					const char *arg,
> +					int unset)
> +{
> +	int *to = opt->value;
> +	if (unset)
> +		*to = -1;
> +	else {
> +		const char *s;
> +		*to = strtol(arg, (char **)&s, 10);
> +		if (*s)
> +			return error(_("%s expects a numerical value"),
> +				     optname(opt, opt->flags));
> +	}
> +	return 0;
> +}
> +
>  static int graph_write(int argc, const char **argv)
>  {
>  	struct string_list pack_indexes = STRING_LIST_INIT_NODUP;
> @@ -197,6 +216,9 @@ static int graph_write(int argc, const char **argv)
>  			N_("maximum ratio between two levels of a split commit-graph")),
>  		OPT_EXPIRY_DATE(0, "expire-time", &write_opts.expire_time,
>  			N_("only expire files older than a given date-time")),
> +		OPT_CALLBACK_F(0, "max-new-filters", &write_opts.max_new_filters,
> +			NULL, N_("maximum number of changed-path Bloom filters to compute"),
> +			0, write_option_max_new_filters),
>  		OPT_END(),
>  	};
>  
> @@ -205,6 +227,7 @@ static int graph_write(int argc, const char **argv)
>  	write_opts.size_multiple = 2;
>  	write_opts.max_commits = 0;
>  	write_opts.expire_time = 0;
> +	write_opts.max_new_filters = -1;
>  
>  	trace2_cmd_mode("write");
>  
> @@ -270,6 +293,16 @@ static int graph_write(int argc, const char **argv)
>  	return result;
>  }
>  
> +static int git_commit_graph_config(const char *var, const char *value, void *cb)
> +{
> +	if (!strcmp(var, "commitgraph.maxnewfilters")) {
> +		write_opts.max_new_filters = git_config_int(var, value);
> +		return 0;
> +	}
> +
> +	return git_default_config(var, value, cb);
> +}
> +
>  int cmd_commit_graph(int argc, const char **argv, const char *prefix)
>  {
>  	static struct option builtin_commit_graph_options[] = {
> @@ -283,7 +316,7 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix)
>  		usage_with_options(builtin_commit_graph_usage,
>  				   builtin_commit_graph_options);
>  
> -	git_config(git_default_config, NULL);
> +	git_config(git_commit_graph_config, &opts);
>  	argc = parse_options(argc, argv, prefix,
>  			     builtin_commit_graph_options,
>  			     builtin_commit_graph_usage,
> diff --git a/commit-graph.c b/commit-graph.c
> index 6886f319a5..4aae5471e3 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -954,7 +954,8 @@ struct tree *get_commit_tree_in_graph(struct repository *r, const struct commit
>  }
>  
>  static int get_bloom_filter_large_in_graph(struct commit_graph *g,
> -					   const struct commit *c)
> +					   const struct commit *c,
> +					   uint32_t max_changed_paths)
>  {
>  	uint32_t graph_pos = commit_graph_position(c);
>  	if (graph_pos == COMMIT_NOT_FROM_GRAPH)
> @@ -965,6 +966,17 @@ static int get_bloom_filter_large_in_graph(struct commit_graph *g,
>  
>  	if (!(g && g->bloom_large))
>  		return 0;
> +	if (g->bloom_filter_settings->max_changed_paths != max_changed_paths) {
> +		/*
> +		 * Force all commits which are subject to a different
> +		 * 'max_changed_paths' limit to be recomputed from scratch.
> +		 *
> +		 * Note that this could likely be improved, but is ignored since
> +		 * all real-world graphs set the maximum number of changed paths
> +		 * at 512.
> +		 */
> +		return 0;
> +	}
>  	return bitmap_get(g->bloom_large, graph_pos - g->num_commits_in_base);
>  }
>  
> @@ -1470,6 +1482,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>  	int i;
>  	struct progress *progress = NULL;
>  	int *sorted_commits;
> +	int max_new_filters;
>  
>  	init_bloom_filters();
>  	ctx->bloom_large = bitmap_word_alloc(ctx->commits.nr / BITS_IN_EWORD + 1);
> @@ -1486,10 +1499,15 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>  		ctx->order_by_pack ? commit_pos_cmp : commit_gen_cmp,
>  		&ctx->commits);
>  
> +	max_new_filters = ctx->opts->max_new_filters >= 0 ?
> +		ctx->opts->max_new_filters : ctx->commits.nr;
> +
>  	for (i = 0; i < ctx->commits.nr; i++) {
>  		int pos = sorted_commits[i];
>  		struct commit *c = ctx->commits.list[pos];
> -		if (get_bloom_filter_large_in_graph(ctx->r->objects->commit_graph, c)) {
> +		if (get_bloom_filter_large_in_graph(ctx->r->objects->commit_graph,
> +						    c,
> +						    ctx->bloom_settings->max_changed_paths)) {
>  			bitmap_set(ctx->bloom_large, pos);
>  			ctx->count_bloom_filter_known_large++;
>  		} else {
> @@ -1497,7 +1515,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>  			struct bloom_filter *filter = get_or_compute_bloom_filter(
>  				ctx->r,
>  				c,
> -				1,
> +				ctx->count_bloom_filter_computed < max_new_filters,
>  				ctx->bloom_settings,
>  				&computed);
>  			if (computed) {
> @@ -1507,7 +1525,8 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>  					ctx->count_bloom_filter_found_large++;
>  				}
>  			}
> -			ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
> +			if (filter)
> +				ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
>  		}
>  		display_progress(progress, i + 1);
>  	}
> diff --git a/commit-graph.h b/commit-graph.h
> index af08c4505d..75ef83708b 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -114,6 +114,7 @@ struct commit_graph_opts {
>  	int max_commits;
>  	timestamp_t expire_time;
>  	enum commit_graph_split_flags flags;
> +	int max_new_filters;
>  };
>  
>  /*
> diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
> index 6859d85369..3aab8ffbe3 100755
> --- a/t/t4216-log-bloom.sh
> +++ b/t/t4216-log-bloom.sh
> @@ -286,4 +286,23 @@ test_expect_success 'Bloom generation does not recompute too-large filters' '
>  	)
>  '
>  
> +test_expect_success 'Bloom generation is limited by --max-new-filters' '
> +	(
> +		cd limits &&
> +		test_commit c2 filter &&
> +		test_commit c3 filter &&
> +		test_commit c4 no-filter &&
> +		test_bloom_filters_computed "--reachable --changed-paths --split=replace --max-new-filters=2" \
> +			2 0 2
> +	)
> +'
> +
> +test_expect_success 'Bloom generation backfills previously-skipped filters' '
> +	(
> +		cd limits &&
> +		test_bloom_filters_computed "--reachable --changed-paths --split=replace --max-new-filters=1" \
> +			2 0 1
> +	)
> +'
> +
>  test_done
> -- 
> 2.28.0.rc1.13.ge78abce653

Something seems to be wrong in this patch, though I haven't looked
closer.  Consider this test with a bit of makeshift tracing:

  ---  >8  ---

diff --git a/bloom.c b/bloom.c
index 8d07209c6b..1a0dec35cd 100644
--- a/bloom.c
+++ b/bloom.c
@@ -222,6 +222,7 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 	if (!compute_if_not_present)
 		return NULL;
 
+	printf("get_or_compute_bloom_filter(): diff: %s\n", oid_to_hex(&c->object.oid));
 	repo_diff_setup(r, &diffopt);
 	diffopt.flags.recursive = 1;
 	diffopt.detect_rename = 0;
diff --git a/t/t9999-test.sh b/t/t9999-test.sh
new file mode 100755
index 0000000000..0833e6ff7e
--- /dev/null
+++ b/t/t9999-test.sh
@@ -0,0 +1,25 @@
+#!/bin/sh
+
+test_description='test'
+
+. ./test-lib.sh
+
+test_expect_success 'test' '
+	for i in 1 2 3 4 5 6
+	do
+		git commit -q --allow-empty -m $i || return 1
+	done &&
+	git log --oneline &&
+
+	# We have 6 commits and compute 2 Bloom filters per execution,
+	# so 3 executions should be enough...  but, alas, it isnt.
+	for i in 1 2 3 4 5
+	do
+		# No --split=replace!
+		git commit-graph write --reachable --changed-paths --max-new-filters=2 || return 1
+	done &&
+
+	git commit-graph write --reachable --changed-paths --max-new-filters=4
+'
+
+test_done

  ---  >8  ---

It's output looks like:

  [...]
  + git log --oneline
  13fcefa (HEAD -> master) 6
  0c71516 5
  a82a61c 4
  54c6b2c 3
  fc99def 2
  a3a8cd3 1
  + git commit-graph write --reachable --changed-paths --max-new-filters=2
  get_or_compute_bloom_filter(): diff: 0c71516945bf0a23813e80205961d29ebc1020e0
  get_or_compute_bloom_filter(): diff: 13fcefa4bb859a15c4edc6bb01b8b6c91b4f32b6
  + git commit-graph write --reachable --changed-paths --max-new-filters=2
  get_or_compute_bloom_filter(): diff: 54c6b2cd4fb50066683a197cc6d677689618505a
  get_or_compute_bloom_filter(): diff: a3a8cd3c82028671bf51502d77277baf14a2f528
  + git commit-graph write --reachable --changed-paths --max-new-filters=2
  get_or_compute_bloom_filter(): diff: 0c71516945bf0a23813e80205961d29ebc1020e0
  get_or_compute_bloom_filter(): diff: 13fcefa4bb859a15c4edc6bb01b8b6c91b4f32b6
  + git commit-graph write --reachable --changed-paths --max-new-filters=2
  get_or_compute_bloom_filter(): diff: 54c6b2cd4fb50066683a197cc6d677689618505a
  get_or_compute_bloom_filter(): diff: a3a8cd3c82028671bf51502d77277baf14a2f528
  + git commit-graph write --reachable --changed-paths --max-new-filters=2
  get_or_compute_bloom_filter(): diff: 0c71516945bf0a23813e80205961d29ebc1020e0
  get_or_compute_bloom_filter(): diff: 13fcefa4bb859a15c4edc6bb01b8b6c91b4f32b6
  + git commit-graph write --reachable --changed-paths
  get_or_compute_bloom_filter(): diff: 54c6b2cd4fb50066683a197cc6d677689618505a
  get_or_compute_bloom_filter(): diff: a3a8cd3c82028671bf51502d77277baf14a2f528
  get_or_compute_bloom_filter(): diff: a82a61c79b2b07c4440e292613e11a69e33ef7a2
  get_or_compute_bloom_filter(): diff: fc99def8b1df27bcab7d1f4b7ced73239f9bd7ec

See how the third write with '--max-new-filters=2' computes the
filters that have already been computed by the first write instead of
those two that have never been computed?  And then how the fourth
write computes filters that have already been computed by the second
write?  And ultimately we'll need a write without '--max-new-filters' (or
with '--max-new-filters=<large-enough>') to compute all remaining
filters.

With '--split=replace' it appears to work as expected.


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v3 12/14] commit-graph: add large-filters bitmap chunk
  2020-08-19 13:35     ` SZEDER Gábor
@ 2020-09-02 20:23       ` Taylor Blau
  0 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-09-02 20:23 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: Taylor Blau, git, peff, dstolee, gitster

(Finally getting back to this review after working on some other topics
for a while..., sorry for the late response).

On Wed, Aug 19, 2020 at 03:35:26PM +0200, SZEDER Gábor wrote:
> On Tue, Aug 11, 2020 at 04:52:07PM -0400, Taylor Blau wrote:
> > When a commit has more than a certain number of changed paths (commonly
> > 512), the commit-graph machinery represents it as a zero-length filter.
> > This is done since having many entries in the Bloom filter has
> > undesirable effects on the false positivity rate.
>
> This is not the case, the false positive probability depends on the
> ratio of the Bloom filter's size and the number of elements it
> contains, and we size the filters proportional to the number of
> elements they contain, so the number of elements shouldn't affect the
> false positive rate.

I'm not sure that I understand. I agree that the FPR depends on the
ratio between the number of elements in the filter and the filter's
"size". But, consider a Bloom filter that is too small to faithfully
represent all its elements. Such a filter would likely have all its bits
set high, in which case every query would return "maybe", and the FPR
would go up.

> On the contrary, it's the small filters, up to around 30-35 bytes,
> that tend to have larger than expected false positive rate when using
> double hashing.

I agree that small filters suffer from the same, but I think this is an
"in addition" not an "on the contrary".

In either case, I don't think that this is an important detail for the
commit message. What matters is the representation (that we truncate >=
512 elements to a length-zero filter), not why (that can be found in
another commit). I'd have expected to find the rationale in ed591febb4
(bloom.c: core Bloom filter implementation for changed paths.,
2020-03-30), but I couldn't find anything there.

So, I'll drop this sentence entirely to avoid an unimportant detail.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v3 12/14] commit-graph: add large-filters bitmap chunk
  2020-09-01 14:35     ` SZEDER Gábor
@ 2020-09-02 20:40       ` Taylor Blau
  0 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-09-02 20:40 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: Taylor Blau, git, peff, dstolee, gitster

On Tue, Sep 01, 2020 at 04:35:46PM +0200, SZEDER Gábor wrote:
> On Tue, Aug 11, 2020 at 04:52:07PM -0400, Taylor Blau wrote:
> > This is already undesirable, since it means that we are wasting
> > considerable effort to discover that a commit with too many changed
> > paths, only to throw that effort away (and then repeat the process the
> > next time a roll-up is performed).
>
> Is this really the case?
>
> If a commit has a corresponding entry in the Bloom filters chunks,
> then the commit-graph machinery does know the Bloom filter associated
> with that commit.  The size of that filter should not matter, i.e.
> even if it is a zero-length filter, the commit-graph machinery should
> know about it all the same.  And as far as I can tell it does indeed,
> because load_bloom_filter_from_graph() sets a non-NULL 'filter->data'
> pointer even if 'filter->len' is zero, which get_bloom_filter() treats
> as "we know about this", and returns early without (re)computing the
> filter.  Even the test 'Bloom generation does not recompute too-large
> filters' added in this patch is expected to succeed, and my
> superficial and makeshift testing seems to corroborate this; at least
> I couldn't find a combination of commands and options that would
> recompute any existing zero-length Bloom filters.
>
> Am I missing something?

I'm not sure that this would work, or at least that it creates another
pair of cases that we have to disambiguate. Here's what I'm thinking
after reading what you wrote:

  - By this point in the series, we expect every commit in a
    commit-graph layer to have a corresponding entry in the 'BIDX' chunk
    (if one exists, obviously all of this is meaningless without
    specifying '--changed-paths').

  - In a future commit (specifically, "builtin/commit-graph.c: introduce
    '--max-new-filters=<n>'"), this will no longer be the case.

  - So, by that point in the series, we would have two possible reasons
    why a commit did not show up in the BIDX. Either:

      * The commit has too many changed paths, in which case we don't
        want to write it down, or

      * We had already computed too many new filters, and so didn't have
        the budget to compute the filter for some commit(s).

In the first of those two cases, we want to avoid recomputing the
too-large filter. But, in the latter case, we *do* want to compute the
filter from scratch, since we want to "backfill" commits with missing
filters by letting them trickle in over time.

I might be missing something, too, but I think that those two cases are
just as indistinguishable as what's in the commit-graph today (without
the 'BFXL' chunk).

> > To address this, add a new chunk which encodes a bitmap where the ith
> > bit is on iff the ith commit has zero or at least 512 changed paths.
> > Likewise store the maximum number of changed paths we are willing to
> > store in order to prepare for eventually making this value more easily
> > customizable.
>
> I don't understand how storing this value would make it any easier to
> customize.

It doesn't, but it's also not meant to. The idea is that if the value
ever were easy to customize, then we could keep track of which layers
were written with which threshold. This would allow you to invent fancy
rules like dropping filters when the threshold lowers, or recomputing
ones that you wouldn't have otherwise when it goes up.

> > Another approach would be to introduce a new BIDX chunk (say, one
> > identified by 'BID2') which is identical to the existing BIDX chunk,
> > except the most-significant bit of each offset is interpreted as "this
> > filter is too big" iff looking at a BID2 chunk. This avoids having to
> > write a bitmap, but forces older clients to rewrite their commit-graphs
> > (as well as reduces the theoretical largest Bloom filters we couldl
>
> And it reduces the max possible size of the BDAT chunk, and thus the
> max number of commits with Bloom filters as well.

Fair point.

> s/couldl/could/

Oops, thanks.

> > write, and forces us to maintain the code necessary to translate BIDX
> > chunks to BID2 ones). Separately from this patch, I implemented this
> > alternate approach and did not find it to be advantageous.
>
> Let's take a step back to reconsider what should be stored in this
> bitmap for a moment.  Sure, setting a bit for each commit that doesn't
> modify any paths or modifies too many makes it possible to repliably
> identify commits that don't have Bloom filters yet.  But isn't it a
> bit roundabout way?  I think it would be better to directly track
> which commits don't have Bloom filters yet.  IOW what you really want
> is a, say, BNCY "Bloom filter Not Computed Yet" chunk, where we set
> the corresponding bit for each commit which has an entry in the BIDX
> chunk but for which a Bloom filter hasn't been computed yet.
>
>   - It's simpler and easier to explain (IMO).
>
>   - This bitmap chunk can easily be made optional: if all Bloom
>     filters have been computed, then the bitmap will contain all
>     zeros.  So why bother writing it, when we can save a bit of space
>     instead?

I don't think that this is true. Omitting the chunk (because you have
computed--or tried to compute--filters for every commit in the graph)
isn't distinguishable from what exists today, so we are back at square
one there.

>   - It avoids the unpleasentness of setting a bit in the _Large_ Bloom
>     Filters chunks for commits _not_ modifying any paths.

I agree it's unpleasant, but I also don't think it's a show-stopper.

>   - Less incentive to spill implementation details to the format
>     specification (e.g. 512 modified paths).
>
> Now, let's take another step back: is such a bitmap really necessary?
> We could write a single-byte Bloom filter with no bits set for commits
> not modifying any paths, and a single-byte Bloom filter with all bits
> set for commits modifying too many paths.  This is compatible with the
> specs and any existing implementation should do the right thing when
> reading such filters, this would allow us to interpret zero-length
> filters as "not computed yet", and if that bitmap chunk won't be
> optional, then this would save space as long as less than 1/8 of
> commits modify no or too many paths.  Unfortunately, however, all
> existing zero-length Bloom filters have to be recomputed.

I think this is a semantic difference: either you store a bitmap, or a
Bloom filter containing the same data. To me, I don't think there's a
huge difference, since we're talking about 1 bit per commit. If we were
really worried, we could store them as EWAH-compressed bitmaps, but I
don't get a sense that such a concern exists.

I do feel strongly about using a non-probabilistic data structure,
though, since the point of this feature is to be able to make tighter
guarentees about the runtime of 'git commit-graph write'.

>
> > Signed-off-by: Taylor Blau <me@ttaylorr.com>
> > ---
> >  .../technical/commit-graph-format.txt         | 12 +++
> >  bloom.h                                       |  2 +-
> >  commit-graph.c                                | 96 ++++++++++++++++---
> >  commit-graph.h                                |  4 +
> >  t/t4216-log-bloom.sh                          | 25 ++++-
> >  5 files changed, 124 insertions(+), 15 deletions(-)
> >
> > diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
> > index 440541045d..5f2d9ab4d7 100644
> > --- a/Documentation/technical/commit-graph-format.txt
> > +++ b/Documentation/technical/commit-graph-format.txt
> > @@ -123,6 +123,18 @@ CHUNK DATA:
> >        of length zero.
> >      * The BDAT chunk is present if and only if BIDX is present.
> >
> > +  Large Bloom Filters (ID: {'B', 'F', 'X', 'L'}) [Optional]
> > +    * It starts with a 32-bit unsigned integer specifying the maximum number of
> > +      changed-paths that can be stored in a single Bloom filter.
>
> Should this value be the same in all elements of a commit-graph chain?
> Note that in this case having different values won't affect revision
> walks using modified path Bloom filters.

I don't think it needs to be the same necessarily.

> > +    * It then contains a list of 64-bit words (the length of this list is
> > +      determined by the width of the chunk) which is a bitmap. The 'i'th bit is
> > +      set exactly when the 'i'th commit in the graph has a changed-path Bloom
> > +      filter with zero entries (either because the commit is empty, or because
> > +      it contains more than 512 changed paths).
>
> Please make clear the byte order of these 64 bit words in the specs as
> well.
>
> Furthermore, that 512 path limit is an implementation detail, so it
> would be better if it didn't leak into the specification of this new
> chunk.

Addressed both, thanks.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v3 13/14] commit-graph: rename 'split_commit_graph_opts'
  2020-08-19  9:56     ` SZEDER Gábor
@ 2020-09-02 21:02       ` Taylor Blau
  0 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-09-02 21:02 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: Taylor Blau, git, peff, dstolee, gitster

On Wed, Aug 19, 2020 at 11:56:56AM +0200, SZEDER Gábor wrote:
> On Tue, Aug 11, 2020 at 04:52:11PM -0400, Taylor Blau wrote:
> > In the subsequent commit, additional options will be added to the
> > commit-graph API which have nothing to do with splitting.
> >
> > Rename the 'split_commit_graph_opts' structure to the more-generic
> > 'commit_graph_opts' to encompass both.
>
> Good.  Note, however, that write_commit_graph() has a 'flags'
> parameter as well, and when a feature is enabled via this 'flags',
> then first a corresponding 'ctx->foo' field is set, and that
> 'ctx->foo' is checked while computing and writing the commit-graph.
> With the generic options struct some other feature will be enabled via
> the 'opts->bar' field, so simply 'ctx->opts->bar' is checked while
> writing the commit-graph.
>
> With the generic options struct there really is no need for a separate
> flags parameter, the values in the flags can be stored in the options
> struct, and we can eliminate this inconsistency instead of adding even
> more.

I like the direction that you're headed in, but I'm not entirely sure
what you're suggesting. Do you want to make the
'enum commit_graph_write_flags flags' part of the new options struct?
Break out the fields into individual bits on that struct?

I'm not opposed to either, but note that there is also already a 'flags'
field on the options structure related to splitting, so that would have
to be untangled, too.

What I'm trying to say is that I think there's more complexity here than
you're giving it credit for. I'd rather press on with what we have here,
and devote adequate time to unraveling the complexity appropriately than
try to shove in another patch that takes a half-step in the right
direction.

>
> > diff --git a/commit-graph.h b/commit-graph.h
> > index ddbca1b59d..af08c4505d 100644
> > --- a/commit-graph.h
> > +++ b/commit-graph.h
> > @@ -109,7 +109,7 @@ enum commit_graph_split_flags {
> >  	COMMIT_GRAPH_SPLIT_REPLACE          = 2
> >  };
> >
> > -struct split_commit_graph_opts {
> > +struct commit_graph_opts {
> >  	int size_multiple;
> >  	int max_commits;
> >  	timestamp_t expire_time;
>         enum commit_graph_split_flags flags;
>
> While this was 'struct split_commit_graph_opts *split_opts' it was
> clear what kind of flags were in this 'flags' field.  Now that the
> struct is generic it's not clear anymore, so perhaps it should be
> renamed as well (e.g. 'split_flags'), or even turned into a couple of
> bit fields.

This I can definitely vouch for, so I'll 's/flags/split_flags' in the
next revision.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v3 14/14] builtin/commit-graph.c: introduce '--max-new-filters=<n>'
  2020-08-17 22:50         ` SZEDER Gábor
@ 2020-09-02 21:03           ` Taylor Blau
  0 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-09-02 21:03 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: Taylor Blau, git, peff, dstolee, gitster

On Tue, Aug 18, 2020 at 12:50:04AM +0200, SZEDER Gábor wrote:
> On Fri, Aug 14, 2020 at 04:20:21PM -0400, Taylor Blau wrote:
> > > > @@ -1486,10 +1499,15 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
> > > >  		ctx->order_by_pack ? commit_pos_cmp : commit_gen_cmp,
> > > >  		&ctx->commits);
> > > >
> > > > +	max_new_filters = ctx->opts->max_new_filters >= 0 ?
> > > > +		ctx->opts->max_new_filters : ctx->commits.nr;
> > >
> > > git_test_write_commit_graph_or_die() invokes
> > > write_commit_graph_reachable() with opts=0x0, so 'ctx->opts' is NULL,
> > > and we get segfault.  This breaks a lot of tests when run with
> > > GIT_TEST_COMMIT_GRAPH=1 GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=1.
> >
> > Great catch, thanks. Fixing this is as simple as adding 'ctx->opts &&'
> > right before we dereference 'ctx->opts', since setting this variable
> > equal to 'ctx->commits.nr' is the right thing to do in that case.
>
> That would avoid the segfault, sure, but I would rather see all
> callers of write_commit_graph{_reachable}() passing a valid opts
> instance.  Just like we don't call the diff machinery with a NULL
> diff_options, or the revision walking machinery with a NULL rev_info.

I wouldn't mind that either, but this is definitely a common pattern
throughout the commit-graph machinery. So, if/when we do get away from
it, I'd rather do so uniformly than in some spots.

> > Unrelated to this comment, I am hoping to send out a final version of
> > this series sometime next week so that we can keep moving forward with
> > Bloom filter improvements.
> >
> > Have you had a chance to review the rest of the patches? I'll happily
> > wait until you have had a chance to do so before sending v5 so that we
>
> v5?  This is v3, and I'm unable to a find a v4.

Sorry, I clearly had too much on my mind when I was writing this ;). I'm
hopeful that with your careful review that v4 will be the last of this
topic.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v3 14/14] builtin/commit-graph.c: introduce '--max-new-filters=<n>'
  2020-08-18 22:23     ` SZEDER Gábor
@ 2020-09-03 16:35       ` Taylor Blau
  0 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-09-03 16:35 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: Taylor Blau, git, peff, dstolee, gitster

On Wed, Aug 19, 2020 at 12:23:29AM +0200, SZEDER Gábor wrote:
> On Tue, Aug 11, 2020 at 04:52:14PM -0400, Taylor Blau wrote:
> > Introduce a command-line flag and configuration variable to fill in the
> > 'max_new_filters' variable introduced by the previous patch.
>
> 'max_new_filters' is introduced in this patch.
>
> > The command-line option '--max-new-filters' takes precedence over
> > 'commitGraph.maxNewFilters', which is the default value.
>
> What "filters"?  While misnamed, the '--changed-paths' options did a
> good job at hiding the implementation detail that we use Bloom filters
> to speed up pathspec-limited revision walks; as far as I remember this
> was a conscious design decision.  Including "filter" in the name of
> the option and corresponding config variable goes against this
> decision.  Furthermore, by not being specific we might end up in abad
> situation when adding some other filters to the commit-graph format.
> Unfortunately, I can't offhand propose a better option name, all my
> ideas were horrible.

I don't disagree, but I can't come up with a better name either.
maxNewChangedPaths? maxNewSomething? I don't know. Usually I think
exposing this sort of detail is smelly, but I'm not so bothered by it
here. Similarly, this thread has been around for a while, and nobody
else has suggested a better name, so, I'm inclined to keep it.

> > +With the `--max-new-filters=<n>` option, generate at most `n` new Bloom
> > +filters (if `--changed-paths` is specified). If `n` is `-1`, no limit is
> > +enforced. Overrides the `commitGraph.maxNewFilters` configuration.
>
> I think this description should also detail what happens with those
> commits for which no modified path Bloom filters are calculated, and
> which commands will calculate them (even implicitly).

OK, sure.

> >  static int get_bloom_filter_large_in_graph(struct commit_graph *g,
> > -					   const struct commit *c)
> > +					   const struct commit *c,
> > +					   uint32_t max_changed_paths)
> >  {
> >  	uint32_t graph_pos = commit_graph_position(c);
> >  	if (graph_pos == COMMIT_NOT_FROM_GRAPH)
> > @@ -965,6 +966,17 @@ static int get_bloom_filter_large_in_graph(struct commit_graph *g,
> >
> >  	if (!(g && g->bloom_large))
> >  		return 0;
> > +	if (g->bloom_filter_settings->max_changed_paths != max_changed_paths) {
> > +		/*
> > +		 * Force all commits which are subject to a different
> > +		 * 'max_changed_paths' limit to be recomputed from scratch.
> > +		 *
> > +		 * Note that this could likely be improved, but is ignored since
> > +		 * all real-world graphs set the maximum number of changed paths
> > +		 * at 512.
>
> I don't understand what the second part of this comment is trying to
> say; and real-world graphs might very well contain Bloom filters with
> more than 512 modified paths, because the applying that limit was
> buggy.

I'm trying to say that there is room for us to improve when we do and
don't recompute filters when the limit on the number of changed paths
deviates between layers, but that such deviations don't currently exist
in the wild.

Bugs are another thing, but we can't tell that they exist without
recomputing every filter.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v3 14/14] builtin/commit-graph.c: introduce '--max-new-filters=<n>'
  2020-08-19  8:20     ` SZEDER Gábor
@ 2020-09-03 16:42       ` Taylor Blau
  2020-09-04  8:50         ` SZEDER Gábor
  0 siblings, 1 reply; 117+ messages in thread
From: Taylor Blau @ 2020-09-03 16:42 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: git, peff, dstolee, gitster

On Wed, Aug 19, 2020 at 10:20:21AM +0200, SZEDER Gábor wrote:
> On Tue, Aug 11, 2020 at 04:52:14PM -0400, Taylor Blau wrote:
> > Introduce a command-line flag and configuration variable to fill in the
> > 'max_new_filters' variable introduced by the previous patch.
> >
> > The command-line option '--max-new-filters' takes precedence over
> > 'commitGraph.maxNewFilters', which is the default value.
> > '--no-max-new-filters' can also be provided, which sets the value back
> > to '-1', indicating that an unlimited number of new Bloom filters may be
> > generated. (OPT_INTEGER only allows setting the '--no-' variant back to
> > '0', hence a custom callback was used instead).
>
> Forgot the most important thing: Why?  Please explain in the commit
> message why this option is necesary, what problems does it solve,
> how it is supposed to interact with other options and why so.

This is already explained in detail in the patch 'commit-graph: add
large-filters bitmap chunk', although there is an error in the quoted
part of your email (which I wrote) which refers the reader to the
previous patch. The patch I'm actually referring two is the
twice-previous patch.

I'll fix that locally before re-sending.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v3 14/14] builtin/commit-graph.c: introduce '--max-new-filters=<n>'
  2020-09-01 14:36     ` SZEDER Gábor
@ 2020-09-03 18:49       ` Taylor Blau
  0 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-09-03 18:49 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: git, peff, dstolee, gitster

On Tue, Sep 01, 2020 at 04:36:50PM +0200, SZEDER Gábor wrote:
> Something seems to be wrong in this patch, though I haven't looked
> closer.  Consider this test with a bit of makeshift tracing:
>
>   ---  >8  ---
>
> diff --git a/bloom.c b/bloom.c
> index 8d07209c6b..1a0dec35cd 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -222,6 +222,7 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
>  	if (!compute_if_not_present)
>  		return NULL;
>
> +	printf("get_or_compute_bloom_filter(): diff: %s\n", oid_to_hex(&c->object.oid));
>  	repo_diff_setup(r, &diffopt);
>  	diffopt.flags.recursive = 1;
>  	diffopt.detect_rename = 0;
> diff --git a/t/t9999-test.sh b/t/t9999-test.sh
> new file mode 100755
> index 0000000000..0833e6ff7e
> --- /dev/null
> +++ b/t/t9999-test.sh
> @@ -0,0 +1,25 @@
> +#!/bin/sh
> +
> +test_description='test'
> +
> +. ./test-lib.sh
> +
> +test_expect_success 'test' '
> +	for i in 1 2 3 4 5 6
> +	do
> +		git commit -q --allow-empty -m $i || return 1
> +	done &&
> +	git log --oneline &&
> +
> +	# We have 6 commits and compute 2 Bloom filters per execution,
> +	# so 3 executions should be enough...  but, alas, it isnt.
> +	for i in 1 2 3 4 5
> +	do
> +		# No --split=replace!
> +		git commit-graph write --reachable --changed-paths --max-new-filters=2 || return 1
> +	done &&
> +
> +	git commit-graph write --reachable --changed-paths --max-new-filters=4
> +'
> +
> +test_done
>
>   ---  >8  ---
>
> It's output looks like:
>
>   [...]
>   + git log --oneline
>   13fcefa (HEAD -> master) 6
>   0c71516 5
>   a82a61c 4
>   54c6b2c 3
>   fc99def 2
>   a3a8cd3 1
>   + git commit-graph write --reachable --changed-paths --max-new-filters=2
>   get_or_compute_bloom_filter(): diff: 0c71516945bf0a23813e80205961d29ebc1020e0
>   get_or_compute_bloom_filter(): diff: 13fcefa4bb859a15c4edc6bb01b8b6c91b4f32b6
>   + git commit-graph write --reachable --changed-paths --max-new-filters=2
>   get_or_compute_bloom_filter(): diff: 54c6b2cd4fb50066683a197cc6d677689618505a
>   get_or_compute_bloom_filter(): diff: a3a8cd3c82028671bf51502d77277baf14a2f528
>   + git commit-graph write --reachable --changed-paths --max-new-filters=2
>   get_or_compute_bloom_filter(): diff: 0c71516945bf0a23813e80205961d29ebc1020e0
>   get_or_compute_bloom_filter(): diff: 13fcefa4bb859a15c4edc6bb01b8b6c91b4f32b6
>   + git commit-graph write --reachable --changed-paths --max-new-filters=2
>   get_or_compute_bloom_filter(): diff: 54c6b2cd4fb50066683a197cc6d677689618505a
>   get_or_compute_bloom_filter(): diff: a3a8cd3c82028671bf51502d77277baf14a2f528
>   + git commit-graph write --reachable --changed-paths --max-new-filters=2
>   get_or_compute_bloom_filter(): diff: 0c71516945bf0a23813e80205961d29ebc1020e0
>   get_or_compute_bloom_filter(): diff: 13fcefa4bb859a15c4edc6bb01b8b6c91b4f32b6
>   + git commit-graph write --reachable --changed-paths
>   get_or_compute_bloom_filter(): diff: 54c6b2cd4fb50066683a197cc6d677689618505a
>   get_or_compute_bloom_filter(): diff: a3a8cd3c82028671bf51502d77277baf14a2f528
>   get_or_compute_bloom_filter(): diff: a82a61c79b2b07c4440e292613e11a69e33ef7a2
>   get_or_compute_bloom_filter(): diff: fc99def8b1df27bcab7d1f4b7ced73239f9bd7ec
>
> See how the third write with '--max-new-filters=2' computes the
> filters that have already been computed by the first write instead of
> those two that have never been computed?  And then how the fourth
> write computes filters that have already been computed by the second
> write?  And ultimately we'll need a write without '--max-new-filters' (or
> with '--max-new-filters=<large-enough>') to compute all remaining
> filters.

Ouch. This definitely has to do with the empty commits, since swapping
out your 'git commit ... --allow-empty' for a 'test_commit' produces the
output that you'd expect.

> With '--split=replace' it appears to work as expected.

This is definitely the critical bit. The crux of the issue is that
'copy_oids_to_commits()' handles split and non-split graphs differently.
The critical bits here are:

  * 43d3561805 (commit-graph write: don't die if the existing graph is
    corrupt, 2019-03-25) which forces the relevant data to *not* be
    loaded from an existing commit-graph, and

  * 8a6ac287b2 (builtin/commit-graph.c: introduce split strategy
    'replace', 2020-04-13), which does load data from an existing
    commit-graph with '--split=replace'.

When writing a graph with '--split=replace', commits are loaded from the
graph, which includes setting their '->graph_pos' (or rather setting
this data in a commit slab, which is I guess how it's done these days).
Without '--split=replace', the graph position will never be set.

So, by the time we get to 'get_bloom_filter_large_in_graph', the graph
position is 'COMMIT_NOT_FROM_GRAPH', which in turn forces us to
recompute the filter from scratch, since we assume that being
'NOT_FROM_GRAPH' implies that we won't find it in any 'BFXL' chunk.

Regardless of whether or not we should be trusting the parentage
information on-disk, recomputing the Bloom filters from scratch is
simply too expensive (and the opposite of the point of this series). So,
doing the following to force 'get_bloom_filter_large_in_graph' to lookup
the Bloom and BFXL data in a commit graph by forcibly loading its graph
position is the right thing to do.

This is sufficient to get us unstuck:

diff --git a/commit-graph.c b/commit-graph.c
index bec4e5b725..243c7253ff 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -956,8 +956,8 @@ int get_bloom_filter_large_in_graph(struct commit_graph *g,
                                    const struct commit *c,
                                    uint32_t max_changed_paths)
 {
-       uint32_t graph_pos = commit_graph_position(c);
-       if (graph_pos == COMMIT_NOT_FROM_GRAPH)
+       uint32_t graph_pos;
+       if (!find_commit_in_graph(c, g, &graph_pos))
                return 0;

        while (g && graph_pos < g->num_commits_in_base)

...but adding something like:

diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index a7a3b41919..571676cef2 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -310,4 +310,29 @@ test_expect_success 'Bloom generation backfills previously-skipped filters' '
        )
 '

+test_expect_success 'Bloom generation backfills empty commits' '
+       git init empty &&
+       test_when_finished "rm -fr empty" &&
+       (
+               cd empty &&
+               for i in $(test_seq 1 6)
+               do
+                       git commit --allow-empty -m "$i"
+               done &&
+
+               # Generate Bloom filters for empty commits 1-6, two at a time.
+               test_bloom_filters_computed "--reachable --changed-paths --max-new-filters=2" \
+                       0 2 2 &&
+               test_bloom_filters_computed "--reachable --changed-paths --max-new-filters=2" \
+                       2 2 2 &&
+               test_bloom_filters_computed "--reachable --changed-paths --max-new-filters=2" \
+                       4 2 2 &&
+
+               # Finally, make sure that once all commits have filters, that
+               # none are subsequently recomputed.
+               test_bloom_filters_computed "--reachable --changed-paths --max-new-filters=2" \
+                       6 0 0
+       )
+'
+
 test_done

is a good idea to harden what you wrote in your t9999 into an actual
test to prevent against regression. I'll fold both of those into this
patch.

Thanks for the bug report. It led to an interesting investigation as a
result :).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v3 00/14] more miscellaneous Bloom filter improvements
  2020-08-11 20:51 ` [PATCH v3 00/14] more miscellaneous Bloom filter improvements Taylor Blau
                     ` (13 preceding siblings ...)
  2020-08-11 20:52   ` [PATCH v3 14/14] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
@ 2020-09-03 21:45   ` Junio C Hamano
  2020-09-03 22:33     ` Taylor Blau
  14 siblings, 1 reply; 117+ messages in thread
From: Junio C Hamano @ 2020-09-03 21:45 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, dstolee, szeder.dev

Taylor Blau <me@ttaylorr.com> writes:

> Here's a(nother) re-roll of mine and Stolee's series to introduce the
> new BFXL commit-graph chunk, along with the '--max-new-filters' option
> to 'git commit-graph write'.
>
> Not really much has changed since v2, other than a rebase onto the
> latest from master (the fifth 2.29 batch, at the time of writing), and
> to squash in a few fixups that I sent in response to my v2 series.

It seems we've seen more than enough comments and enthusiasm on this
round to make another round of update worthwhile, but it seems it
may take a bit more time (e.g. <20200903184920.GA8946@nand.local>)
to get issues resolved?  Just pinging, no rush.


 

 

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v3 00/14] more miscellaneous Bloom filter improvements
  2020-09-03 21:45   ` [PATCH v3 00/14] more miscellaneous Bloom filter improvements Junio C Hamano
@ 2020-09-03 22:33     ` Taylor Blau
  0 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-09-03 22:33 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, peff, dstolee, szeder.dev

On Thu, Sep 03, 2020 at 02:45:50PM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
> > Not really much has changed since v2, other than a rebase onto the
> > latest from master (the fifth 2.29 batch, at the time of writing), and
> > to squash in a few fixups that I sent in response to my v2 series.
>
> It seems we've seen more than enough comments and enthusiasm on this
> round to make another round of update worthwhile, but it seems it
> may take a bit more time (e.g. <20200903184920.GA8946@nand.local>)
> to get issues resolved?  Just pinging, no rush.

Yeah. I am definitely partially to blame, since I have been distracted
for the past ~month or so working on some bitmaps-related topics that I
am hoping to send soon.

The issues in <20200903184920.GA8946@nand.local> are already resolved,
and I have v4 ready to send. But, I need to give it a final proofreading
pass before letting it hit the list.

Patches shortly.


Thanks,
Taylor

^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH v4 00/14] more miscellaneous Bloom filter improvements
  2020-08-03 18:57 [PATCH 00/10] more miscellaneous Bloom filter improvements Taylor Blau
                   ` (11 preceding siblings ...)
  2020-08-11 20:51 ` [PATCH v3 00/14] more miscellaneous Bloom filter improvements Taylor Blau
@ 2020-09-03 22:45 ` Taylor Blau
  2020-09-03 22:46   ` [PATCH v4 01/14] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
                     ` (14 more replies)
  12 siblings, 15 replies; 117+ messages in thread
From: Taylor Blau @ 2020-09-03 22:45 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

Hi,

Here is another reroll of my series to introduce a '--max-new-filters'
option to the 'git commit-graph write' sub-command to limit the number
of changed-path Bloom filters a single process is willing to compute
from scratch.

As a reminder (since it's been a while since v3), this is done by adding
a new 'BFXL' chunk, which specifies which Bloom filters (a) have been
computed and (b) were too large to store, and thus are encoded as
zero-length filters.

Things seem to have settled down since the review in v3, so I'm hoping
that this is what will end up being queued.

Thanks again for all of your review, and sorry for the prolonged radio
silence on this topic. I've had my hands full working on something else
that I'm hoping to start testing and sending to the list soon.

Derrick Stolee (1):
  bloom/diff: properly short-circuit on max_changes

Taylor Blau (13):
  commit-graph: introduce 'get_bloom_filter_settings()'
  t4216: use an '&&'-chain
  commit-graph: pass a 'struct repository *' in more places
  t/helper/test-read-graph.c: prepare repo settings
  commit-graph: respect 'commitGraph.readChangedPaths'
  commit-graph.c: store maximum changed paths
  bloom: split 'get_bloom_filter()' in two
  bloom: use provided 'struct bloom_filter_settings'
  commit-graph.c: sort index into commits list
  csum-file.h: introduce 'hashwrite_be64()'
  commit-graph: add large-filters bitmap chunk
  commit-graph: rename 'split_commit_graph_opts'
  builtin/commit-graph.c: introduce '--max-new-filters=<n>'

 Documentation/config.txt                      |   2 +
 Documentation/config/commitgraph.txt          |   8 +
 Documentation/git-commit-graph.txt            |   6 +
 .../technical/commit-graph-format.txt         |  13 +
 blame.c                                       |   8 +-
 bloom.c                                       |  43 ++-
 bloom.h                                       |  22 +-
 builtin/commit-graph.c                        |  61 +++-
 commit-graph.c                                | 278 +++++++++++++-----
 commit-graph.h                                |  28 +-
 csum-file.h                                   |   6 +
 diff.h                                        |   2 -
 fuzz-commit-graph.c                           |   5 +-
 line-log.c                                    |   2 +-
 make: *** [Makefile                           |   0
 midx.c                                        |   3 +-
 repo-settings.c                               |   3 +
 repository.h                                  |   1 +
 revision.c                                    |   7 +-
 t/helper/test-bloom.c                         |   4 +-
 t/helper/test-read-graph.c                    |   3 +-
 t/t4216-log-bloom.sh                          | 173 +++++++++--
 t/t5324-split-commit-graph.sh                 |  13 +
 tree-diff.c                                   |   5 +-
 24 files changed, 553 insertions(+), 143 deletions(-)
 create mode 100644 Documentation/config/commitgraph.txt
 create mode 100644 make: *** [Makefile

Range-diff against v3:
[rebased onto master]
  1:  e714e54240 = 232:  97d80f109f commit-graph: introduce 'get_bloom_filter_settings()'
  2:  9fc8b17d6f = 233:  d60698d2f8 t4216: use an '&&'-chain
  3:  8dbe4838b7 = 234:  639a962a49 commit-graph: pass a 'struct repository *' in more places
  4:  f59db1e30d = 235:  ccab59dfe4 t/helper/test-read-graph.c: prepare repo settings
  5:  daae6788c0 = 236:  8aff54d83e commit-graph: respect 'commitGraph.readChangedPaths'
  6:  bf498844ef = 237:  965489d361 commit-graph.c: store maximum changed paths
  7:  eba2794873 = 238:  ba89a0cb83 bloom: split 'get_bloom_filter()' in two
  8:  4f08177dbe = 239:  89bedba089 bloom: use provided 'struct bloom_filter_settings'
  9:  cc1dc8b121 = 240:  427f129656 bloom/diff: properly short-circuit on max_changes
 10:  23fd52c3b8 = 241:  08b5f185f6 commit-graph.c: sort index into commits list
 11:  4800cd373e = 242:  d7cbd4ca1a csum-file.h: introduce 'hashwrite_be64()'
 12:  619e0c619d ! 243:  3063beb588 commit-graph: add large-filters bitmap chunk
    @@ Commit message

         When a commit has more than a certain number of changed paths (commonly
         512), the commit-graph machinery represents it as a zero-length filter.
    -    This is done since having many entries in the Bloom filter has
    -    undesirable effects on the false positivity rate.
    -
         In addition to these too-large filters, the commit-graph machinery also
         represents commits with no filter and commits with no changed paths in
         the same way.
    @@ Commit message
         data in network byte order from the 64-bit words. This means we also
         need to read the array from the commit-graph file by translating each
         word from network byte order using get_be64() when loading the commit
    -    graph. (Note that this *could* be delayed until first-use, but a later
    -    patch will rely on this being initialized early, so we assume the
    -    up-front cost when parsing instead of delaying initialization).
    +    graph. Initialize this bitmap lazily to avoid paying a linear-time cost
    +    upon each commit-graph load even if we do not need the bitmaps
    +    themselves.

         By avoiding the need to move to new versions of the BDAT and BIDX chunk,
         we can give ourselves more time to consider whether or not other
    @@ Commit message
         except the most-significant bit of each offset is interpreted as "this
         filter is too big" iff looking at a BID2 chunk. This avoids having to
         write a bitmap, but forces older clients to rewrite their commit-graphs
    -    (as well as reduces the theoretical largest Bloom filters we couldl
    +    (as well as reduces the theoretical largest Bloom filters we could
         write, and forces us to maintain the code necessary to translate BIDX
         chunks to BID2 ones). Separately from this patch, I implemented this
         alternate approach and did not find it to be advantageous.
    @@ Documentation/technical/commit-graph-format.txt: CHUNK DATA:
     +  Large Bloom Filters (ID: {'B', 'F', 'X', 'L'}) [Optional]
     +    * It starts with a 32-bit unsigned integer specifying the maximum number of
     +      changed-paths that can be stored in a single Bloom filter.
    -+    * It then contains a list of 64-bit words (the length of this list is
    -+      determined by the width of the chunk) which is a bitmap. The 'i'th bit is
    -+      set exactly when the 'i'th commit in the graph has a changed-path Bloom
    -+      filter with zero entries (either because the commit is empty, or because
    -+      it contains more than 512 changed paths).
    ++    * It then contains a list of 64-bit words in network order (the length of
    ++      this list is determined by the width of the chunk) which is a bitmap. The
    ++      'i'th bit is set exactly when the 'i'th commit in the graph has a
    ++      changed-path Bloom filter with zero entries (either because the commit is
    ++      empty, or because it contains more entries than is allowed per filter by
    ++      the layer that contains it).
     +    * The BFXL chunk is present only when the BIDX and BDAT chunks are
     +      also present.
     +
    @@ commit-graph.c: struct commit_graph *parse_commit_graph(struct repository *r,
     +			if (graph->chunk_bloom_large_filters)
     +				chunk_repeated = 1;
     +			else if (r->settings.commit_graph_read_changed_paths) {
    -+				size_t alloc = get_be64(chunk_lookup + 4) - chunk_offset - sizeof(uint32_t);
     +				graph->chunk_bloom_large_filters = data + chunk_offset + sizeof(uint32_t);
    ++				graph->bloom_large_alloc = get_be64(chunk_lookup + 4) - chunk_offset - sizeof(uint32_t);
     +				graph->bloom_filter_settings->max_changed_paths = get_be32(data + chunk_offset);
    -+				if (alloc) {
    -+					size_t j;
    -+					graph->bloom_large = bitmap_word_alloc(alloc);
    -+
    -+					for (j = 0; j < graph->bloom_large->word_alloc; j++)
    -+						graph->bloom_large->words[j] = get_be64(
    -+							graph->chunk_bloom_large_filters + j * sizeof(eword_t));
    -+				}
     +			}
     +			break;
      		}
    @@ commit-graph.c: struct commit_graph *parse_commit_graph(struct repository *r,
      		graph->chunk_bloom_indexes = NULL;
      		graph->chunk_bloom_data = NULL;
     +		graph->chunk_bloom_large_filters = NULL;
    ++		graph->bloom_large_alloc = 0;
      		FREE_AND_NULL(graph->bloom_filter_settings);
    -+		bitmap_free(graph->bloom_large);
      	}

    - 	hashcpy(graph->oid.hash, graph->data + graph->data_len - graph->hash_len);
     @@ commit-graph.c: struct tree *get_commit_tree_in_graph(struct repository *r, const struct commit
      	return get_commit_tree_in_graph_one(r, r->objects->commit_graph, c);
      }

    -+static int get_bloom_filter_large_in_graph(struct commit_graph *g,
    -+					   const struct commit *c)
    ++int get_bloom_filter_large_in_graph(struct commit_graph *g,
    ++				    const struct commit *c)
     +{
    -+	uint32_t graph_pos = commit_graph_position(c);
    -+	if (graph_pos == COMMIT_NOT_FROM_GRAPH)
    ++	uint32_t graph_pos;
    ++	if (!find_commit_in_graph(c, g, &graph_pos))
     +		return 0;
     +
     +	while (g && graph_pos < g->num_commits_in_base)
     +		g = g->base_graph;
     +
    -+	if (!(g && g->bloom_large))
    ++	if (!g)
    ++		return 0;
    ++
    ++	if (!g->bloom_large && g->bloom_large_alloc) {
    ++		size_t i;
    ++		g->bloom_large = bitmap_word_alloc(g->bloom_large_alloc);
    ++
    ++		for (i = 0; i < g->bloom_large->word_alloc; i++)
    ++			g->bloom_large->words[i] = get_be64(
    ++				g->chunk_bloom_large_filters + i * sizeof(eword_t));
    ++	}
    ++
    ++	if (!g->bloom_large)
     +		return 0;
     +	return bitmap_get(g->bloom_large, graph_pos - g->num_commits_in_base);
     +}
    @@ commit-graph.h

      #define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
      #define GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE "GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE"
    +@@ commit-graph.h: void load_commit_graph_info(struct repository *r, struct commit *item);
    + struct tree *get_commit_tree_in_graph(struct repository *r,
    + 				      const struct commit *c);
    +
    ++int get_bloom_filter_large_in_graph(struct commit_graph *g,
    ++				    const struct commit *c);
    ++
    + struct commit_graph {
    + 	const unsigned char *data;
    + 	size_t data_len;
     @@ commit-graph.h: struct commit_graph {
      	const unsigned char *chunk_base_graphs;
      	const unsigned char *chunk_bloom_indexes;
    @@ commit-graph.h: struct commit_graph {
     +	const unsigned char *chunk_bloom_large_filters;
     +
     +	struct bitmap *bloom_large;
    ++	size_t bloom_large_alloc;

      	struct bloom_filter_settings *bloom_filter_settings;
      };
    +@@ commit-graph.h: struct commit_graph *read_commit_graph_one(struct repository *r,
    + struct commit_graph *parse_commit_graph(struct repository *r,
    + 					void *graph_map, size_t graph_size);
    +
    ++void prepare_commit_graph_bloom_large(struct commit_graph *g);
    ++
    + /*
    +  * Return 1 if and only if the repository has a commit-graph
    +  * file and generation numbers are computed in that file.

      ## t/t4216-log-bloom.sh ##
     @@ t/t4216-log-bloom.sh: test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
    - 	git commit-graph write --reachable --changed-paths
    + 	EOF
      '
      graph_read_expect () {
     -	NUM_CHUNKS=5
     +	NUM_CHUNKS=6
      	cat >expect <<- EOF
    - 	header: 43475048 1 1 $NUM_CHUNKS 0
    + 	header: 43475048 1 $(test_oid oid_version) $NUM_CHUNKS 0
      	num_commits: $1
     @@ t/t4216-log-bloom.sh: test_expect_success 'correctly report changes over limit' '
      		done
 13:  b2e33ecba8 ! 244:  ee0bc109f3 commit-graph: rename 'split_commit_graph_opts'
    @@ Commit message
         commit-graph API which have nothing to do with splitting.

         Rename the 'split_commit_graph_opts' structure to the more-generic
    -    'commit_graph_opts' to encompass both.
    +    'commit_graph_opts' to encompass both. Likewise, rename the 'flags'
    +    member to instead be 'split_flags' to clarify that it only has to do
    +    with the behavior implied by '--split'.

    -    Suggsted-by: Derrick Stolee <dstolee@microsoft.com>
    +    Suggested-by: Derrick Stolee <dstolee@microsoft.com>
         Signed-off-by: Taylor Blau <me@ttaylorr.com>

      ## builtin/commit-graph.c ##
    @@ builtin/commit-graph.c: static int graph_write(int argc, const char **argv)
      			N_("enable computation for changed paths")),
      		OPT_BOOL(0, "progress", &opts.progress, N_("force progress reporting")),
     -		OPT_CALLBACK_F(0, "split", &split_opts.flags, NULL,
    -+		OPT_CALLBACK_F(0, "split", &write_opts.flags, NULL,
    ++		OPT_CALLBACK_F(0, "split", &write_opts.split_flags, NULL,
      			N_("allow writing an incremental commit-graph file"),
      			PARSE_OPT_OPTARG | PARSE_OPT_NONEG,
      			write_option_parse_split),
    @@ commit-graph.c: static void close_reachable(struct write_commit_graph_context *c
     -	enum commit_graph_split_flags flags = ctx->split_opts ?
     -		ctx->split_opts->flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;
     +	enum commit_graph_split_flags flags = ctx->opts ?
    -+		ctx->opts->flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;
    ++		ctx->opts->split_flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;

      	if (ctx->report_progress)
      		ctx->progress = start_delayed_progress(
    @@ commit-graph.c: static uint32_t count_distinct_commits(struct write_commit_graph
     -	enum commit_graph_split_flags flags = ctx->split_opts ?
     -		ctx->split_opts->flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;
     +	enum commit_graph_split_flags flags = ctx->opts ?
    -+		ctx->opts->flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;
    ++		ctx->opts->split_flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;

      	ctx->num_extra_edges = 0;
      	if (ctx->report_progress)
    @@ commit-graph.c: static void split_graph_merge_strategy(struct write_commit_graph
     +			size_mult = ctx->opts->size_multiple;

     -		flags = ctx->split_opts->flags;
    -+		flags = ctx->opts->flags;
    ++		flags = ctx->opts->split_flags;
      	}

      	g = ctx->r->objects->commit_graph;
    @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
     -		if (ctx->split_opts)
     -			replace = ctx->split_opts->flags & COMMIT_GRAPH_SPLIT_REPLACE;
     +		if (ctx->opts)
    -+			replace = ctx->opts->flags & COMMIT_GRAPH_SPLIT_REPLACE;
    ++			replace = ctx->opts->split_flags & COMMIT_GRAPH_SPLIT_REPLACE;
      	}

      	ctx->approx_nr_objects = approximate_object_count();
    @@ commit-graph.h: enum commit_graph_split_flags {
      	int size_multiple;
      	int max_commits;
      	timestamp_t expire_time;
    +-	enum commit_graph_split_flags flags;
    ++	enum commit_graph_split_flags split_flags;
    + };
    +
    + /*
     @@ commit-graph.h: struct split_commit_graph_opts {
       */
      int write_commit_graph_reachable(struct object_directory *odb,
    @@ commit-graph.h: struct split_commit_graph_opts {

      #define COMMIT_GRAPH_VERIFY_SHALLOW	(1 << 0)

    +
    + ## make: *** [Makefile (new) ##
 14:  09f6871f66 ! 245:  cd0a9da639 builtin/commit-graph.c: introduce '--max-new-filters=<n>'
    @@ Commit message
         builtin/commit-graph.c: introduce '--max-new-filters=<n>'

         Introduce a command-line flag and configuration variable to fill in the
    -    'max_new_filters' variable introduced by the previous patch.
    +    'max_new_filters' variable introduced two patches ago.

         The command-line option '--max-new-filters' takes precedence over
         'commitGraph.maxNewFilters', which is the default value.
    @@ Documentation/git-commit-graph.txt: this option is given, future commit-graph wr
      +
     +With the `--max-new-filters=<n>` option, generate at most `n` new Bloom
     +filters (if `--changed-paths` is specified). If `n` is `-1`, no limit is
    -+enforced. Overrides the `commitGraph.maxNewFilters` configuration.
    ++enforced. Commits whose filters are not calculated are stored as a
    ++length zero Bloom filter, and their bit is marked in the `BFXL` chunk.
    ++Overrides the `commitGraph.maxNewFilters` configuration.
     ++
      With the `--split[=<strategy>]` option, write the commit-graph as a
      chain of multiple commit-graph files stored in
      `<dir>/info/commit-graphs`. Commit-graph layers are merged based on the

      ## bloom.c ##
    -@@ bloom.c: static int load_bloom_filter_from_graph(struct commit_graph *g,
    - 	else
    - 		start_index = 0;
    +@@ bloom.c: struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,

    -+	if ((start_index == end_index) &&
    -+	    (g->bloom_large && !bitmap_get(g->bloom_large, lex_pos))) {
    -+		/*
    -+		 * If the filter is zero-length, either (1) the filter has no
    -+		 * changes, (2) the filter has too many changes, or (3) it
    -+		 * wasn't computed (eg., due to '--max-new-filters').
    -+		 *
    -+		 * If either (1) or (2) is the case, the 'large' bit will be set
    -+		 * for this Bloom filter. If it is unset, then it wasn't
    -+		 * computed. In that case, return nothing, since we don't have
    -+		 * that filter in the graph.
    -+		 */
    -+		return 0;
    -+	}
    + 	if (!filter->data) {
    + 		load_commit_graph_info(r, c);
    +-		if (commit_graph_position(c) != COMMIT_NOT_FROM_GRAPH &&
    +-			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
    +-				return filter;
    ++		if (commit_graph_position(c) != COMMIT_NOT_FROM_GRAPH)
    ++			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c);
    + 	}
    +
    +-	if (filter->data)
    ++	if (filter->data && filter->len)
    + 		return filter;
    + 	if (!compute_if_not_present)
    + 		return NULL;
    +
    ++	if (filter && !filter->len &&
    ++	    get_bloom_filter_large_in_graph(r->objects->commit_graph, c,
    ++					    settings->max_changed_paths))
    ++		return filter;
    ++
     +
    - 	filter->len = end_index - start_index;
    - 	filter->data = (unsigned char *)(g->chunk_bloom_data +
    - 					sizeof(unsigned char) * start_index +
    + 	repo_diff_setup(r, &diffopt);
    + 	diffopt.flags.recursive = 1;
    + 	diffopt.detect_rename = 0;

      ## builtin/commit-graph.c ##
     @@ builtin/commit-graph.c: static char const * const builtin_commit_graph_usage[] = {
    @@ commit-graph.c
     @@ commit-graph.c: struct tree *get_commit_tree_in_graph(struct repository *r, const struct commit
      }

    - static int get_bloom_filter_large_in_graph(struct commit_graph *g,
    --					   const struct commit *c)
    -+					   const struct commit *c,
    -+					   uint32_t max_changed_paths)
    + int get_bloom_filter_large_in_graph(struct commit_graph *g,
    +-				    const struct commit *c)
    ++				    const struct commit *c,
    ++				    uint32_t max_changed_paths)
      {
    - 	uint32_t graph_pos = commit_graph_position(c);
    - 	if (graph_pos == COMMIT_NOT_FROM_GRAPH)
    -@@ commit-graph.c: static int get_bloom_filter_large_in_graph(struct commit_graph *g,
    + 	uint32_t graph_pos;
    + 	if (!find_commit_in_graph(c, g, &graph_pos))
    +@@ commit-graph.c: int get_bloom_filter_large_in_graph(struct commit_graph *g,

    - 	if (!(g && g->bloom_large))
    + 	if (!g->bloom_large)
      		return 0;
     +	if (g->bloom_filter_settings->max_changed_paths != max_changed_paths) {
     +		/*
    @@ commit-graph.c: static void compute_bloom_filters(struct write_commit_graph_cont
      		ctx->order_by_pack ? commit_pos_cmp : commit_gen_cmp,
      		&ctx->commits);

    -+	max_new_filters = ctx->opts->max_new_filters >= 0 ?
    ++	max_new_filters = ctx->opts && ctx->opts->max_new_filters >= 0 ?
     +		ctx->opts->max_new_filters : ctx->commits.nr;
     +
      	for (i = 0; i < ctx->commits.nr; i++) {
    @@ commit-graph.c: static void compute_bloom_filters(struct write_commit_graph_cont
      	}

      ## commit-graph.h ##
    +@@ commit-graph.h: struct tree *get_commit_tree_in_graph(struct repository *r,
    + 				      const struct commit *c);
    +
    + int get_bloom_filter_large_in_graph(struct commit_graph *g,
    +-				    const struct commit *c);
    ++				    const struct commit *c,
    ++				    uint32_t max_changed_paths);
    +
    + struct commit_graph {
    + 	const unsigned char *data;
     @@ commit-graph.h: struct commit_graph_opts {
      	int max_commits;
      	timestamp_t expire_time;
    - 	enum commit_graph_split_flags flags;
    + 	enum commit_graph_split_flags split_flags;
     +	int max_new_filters;
      };

    @@ t/t4216-log-bloom.sh: test_expect_success 'Bloom generation does not recompute t
     +			2 0 1
     +	)
     +'
    ++
    ++test_expect_success 'Bloom generation backfills empty commits' '
    ++	git init empty &&
    ++	test_when_finished "rm -fr empty" &&
    ++	(
    ++		cd empty &&
    ++		for i in $(test_seq 1 6)
    ++		do
    ++			git commit --allow-empty -m "$i"
    ++		done &&
    ++
    ++		# Generate Bloom filters for empty commits 1-6, two at a time.
    ++		test_bloom_filters_computed "--reachable --changed-paths --max-new-filters=2" \
    ++			0 2 2 &&
    ++		test_bloom_filters_computed "--reachable --changed-paths --max-new-filters=2" \
    ++			2 2 2 &&
    ++		test_bloom_filters_computed "--reachable --changed-paths --max-new-filters=2" \
    ++			4 2 2 &&
    ++
    ++		# Finally, make sure that once all commits have filters, that
    ++		# none are subsequently recomputed.
    ++		test_bloom_filters_computed "--reachable --changed-paths --max-new-filters=2" \
    ++			6 0 0
    ++	)
    ++'
     +
      test_done
--
2.27.0.2918.gc99a27ff8f

^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH v4 01/14] commit-graph: introduce 'get_bloom_filter_settings()'
  2020-09-03 22:45 ` [PATCH v4 " Taylor Blau
@ 2020-09-03 22:46   ` Taylor Blau
  2020-09-03 22:46   ` [PATCH v4 02/14] t4216: use an '&&'-chain Taylor Blau
                     ` (13 subsequent siblings)
  14 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-09-03 22:46 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

Many places in the code often need a pointer to the commit-graph's
'struct bloom_filter_settings', in which case they often take the value
from the top-most commit-graph.

In the non-split case, this works as expected. In the split case,
however, things get a little tricky. Not all layers in a chain of
incremental commit-graphs are required to themselves have Bloom data,
and so whether or not some part of the code uses Bloom filters depends
entirely on whether or not the top-most level of the commit-graph chain
has Bloom filters.

This has been the behavior since Bloom filters were introduced, and has
been codified into the tests since a759bfa9ee (t4216: add end to end
tests for git log with Bloom filters, 2020-04-06). In fact, t4216.130
requires that Bloom filters are not used in exactly the case described
earlier.

There is no reason that this needs to be the case, since it is perfectly
valid for commits in an earlier layer to have Bloom filters when commits
in a newer layer do not.

Since Bloom settings are guaranteed to be the same for any layer in a
chain that has Bloom data, it is sufficient to traverse the
'->base_graph' pointer until either (1) a non-null 'struct
bloom_filter_settings *' is found, or (2) until we are at the root of
the commit-graph chain.

Introduce a 'get_bloom_filter_settings()' function that does just this,
and use it instead of purely dereferencing the top-most graph's
'->bloom_filter_settings' pointer.

While we're at it, add an additional test in t5324 to guard against code
in the commit-graph writing machinery that doesn't correctly handle a
NULL 'struct bloom_filter *'.

Co-authored-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 blame.c                       |  6 ++++--
 bloom.c                       |  6 +++---
 commit-graph.c                | 11 +++++++++++
 commit-graph.h                |  2 ++
 revision.c                    |  5 +----
 t/t4216-log-bloom.sh          |  9 ++++++---
 t/t5324-split-commit-graph.sh | 13 +++++++++++++
 7 files changed, 40 insertions(+), 12 deletions(-)

diff --git a/blame.c b/blame.c
index 1be1cd82a2..903e23af23 100644
--- a/blame.c
+++ b/blame.c
@@ -2892,16 +2892,18 @@ void setup_blame_bloom_data(struct blame_scoreboard *sb,
 			    const char *path)
 {
 	struct blame_bloom_data *bd;
+	struct bloom_filter_settings *bs;
 
 	if (!sb->repo->objects->commit_graph)
 		return;
 
-	if (!sb->repo->objects->commit_graph->bloom_filter_settings)
+	bs = get_bloom_filter_settings(sb->repo);
+	if (!bs)
 		return;
 
 	bd = xmalloc(sizeof(struct blame_bloom_data));
 
-	bd->settings = sb->repo->objects->commit_graph->bloom_filter_settings;
+	bd->settings = bs;
 
 	bd->alloc = 4;
 	bd->nr = 0;
diff --git a/bloom.c b/bloom.c
index 1a573226e7..cd9380ac62 100644
--- a/bloom.c
+++ b/bloom.c
@@ -38,7 +38,7 @@ static int load_bloom_filter_from_graph(struct commit_graph *g,
 	while (graph_pos < g->num_commits_in_base)
 		g = g->base_graph;
 
-	/* The commit graph commit 'c' lives in doesn't carry bloom filters. */
+	/* The commit graph commit 'c' lives in doesn't carry Bloom filters. */
 	if (!g->chunk_bloom_indexes)
 		return 0;
 
@@ -195,8 +195,8 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 	if (!filter->data) {
 		load_commit_graph_info(r, c);
 		if (commit_graph_position(c) != COMMIT_NOT_FROM_GRAPH &&
-			r->objects->commit_graph->chunk_bloom_indexes)
-			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c);
+			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
+				return filter;
 	}
 
 	if (filter->data)
diff --git a/commit-graph.c b/commit-graph.c
index 0ed003e218..6a36ed0b06 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -667,6 +667,17 @@ int generation_numbers_enabled(struct repository *r)
 	return !!first_generation;
 }
 
+struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r)
+{
+	struct commit_graph *g = r->objects->commit_graph;
+	while (g) {
+		if (g->bloom_filter_settings)
+			return g->bloom_filter_settings;
+		g = g->base_graph;
+	}
+	return NULL;
+}
+
 static void close_commit_graph_one(struct commit_graph *g)
 {
 	if (!g)
diff --git a/commit-graph.h b/commit-graph.h
index 09a97030dc..0677dd1031 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -87,6 +87,8 @@ struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size);
  */
 int generation_numbers_enabled(struct repository *r);
 
+struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r);
+
 enum commit_graph_write_flags {
 	COMMIT_GRAPH_WRITE_APPEND     = (1 << 0),
 	COMMIT_GRAPH_WRITE_PROGRESS   = (1 << 1),
diff --git a/revision.c b/revision.c
index 08c2ad23af..857274408c 100644
--- a/revision.c
+++ b/revision.c
@@ -680,10 +680,7 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 
 	repo_parse_commit(revs->repo, revs->commits->item);
 
-	if (!revs->repo->objects->commit_graph)
-		return;
-
-	revs->bloom_filter_settings = revs->repo->objects->commit_graph->bloom_filter_settings;
+	revs->bloom_filter_settings = get_bloom_filter_settings(revs->repo);
 	if (!revs->bloom_filter_settings)
 		return;
 
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 4bb9e9dbe2..715912ad0f 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -65,7 +65,7 @@ setup () {
 
 test_bloom_filters_used () {
 	log_args=$1
-	bloom_trace_prefix="statistics:{\"filter_not_present\":0,\"maybe\""
+	bloom_trace_prefix="statistics:{\"filter_not_present\":${2:-0},\"maybe\""
 	setup "$log_args" &&
 	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
 	test_cmp log_wo_bloom log_w_bloom &&
@@ -139,8 +139,11 @@ test_expect_success 'setup - add commit-graph to the chain without Bloom filters
 	test_line_count = 2 .git/objects/info/commit-graphs/commit-graph-chain
 '
 
-test_expect_success 'Do not use Bloom filters if the latest graph does not have Bloom filters.' '
-	test_bloom_filters_not_used "-- A/B"
+test_expect_success 'use Bloom filters even if the latest graph does not have Bloom filters' '
+	# Ensure that the number of empty filters is equal to the number of
+	# filters in the latest graph layer to prove that they are loaded (and
+	# ignored).
+	test_bloom_filters_used "-- A/B" 3
 '
 
 test_expect_success 'setup - add commit-graph to the chain with Bloom filters' '
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 18216463c7..c334ee9155 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -427,4 +427,17 @@ done <<\EOF
 0600 -r--------
 EOF
 
+test_expect_success '--split=replace with partial Bloom data' '
+	rm -rf $graphdir $infodir/commit-graph &&
+	git reset --hard commits/3 &&
+	git rev-list -1 HEAD~2 >a &&
+	git rev-list -1 HEAD~1 >b &&
+	git commit-graph write --split=no-merge --stdin-commits --changed-paths <a &&
+	git commit-graph write --split=no-merge --stdin-commits <b &&
+	git commit-graph write --split=replace --stdin-commits --changed-paths <c &&
+	ls $graphdir/graph-*.graph >graph-files &&
+	test_line_count = 1 graph-files &&
+	verify_chain_files_exist $graphdir
+'
+
 test_done
-- 
2.27.0.2918.gc99a27ff8f


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH v4 02/14] t4216: use an '&&'-chain
  2020-09-03 22:45 ` [PATCH v4 " Taylor Blau
  2020-09-03 22:46   ` [PATCH v4 01/14] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
@ 2020-09-03 22:46   ` Taylor Blau
  2020-09-03 22:46   ` [PATCH v4 03/14] commit-graph: pass a 'struct repository *' in more places Taylor Blau
                     ` (12 subsequent siblings)
  14 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-09-03 22:46 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

In a759bfa9ee (t4216: add end to end tests for git log with Bloom
filters, 2020-04-06), a 'rm' invocation was added without a
corresponding '&&' chain.

When 'trace.perf' already exists, everything works fine. However, the
function can be executed without 'trace.perf' on disk (eg., when the
subset of tests run is altered with '--run'), and so the bare 'rm'
complains about a missing file.

To remove some noise from the test log, invoke 'rm' with '-f', at which
point it is sensible to place the 'rm -f' in an '&&'-chain, which is
both (1) our usual style, and (2) avoids a broken chain in the future if
more commands are added at the beginning of the function.

Helped-by: Eric Sunshine <sunshine@sunshineco.com>
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/t4216-log-bloom.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 715912ad0f..cd89c75002 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -58,7 +58,7 @@ sane_unset GIT_TRACE2_PERF_BRIEF
 sane_unset GIT_TRACE2_CONFIG_PARAMS
 
 setup () {
-	rm "$TRASH_DIRECTORY/trace.perf"
+	rm -f "$TRASH_DIRECTORY/trace.perf" &&
 	git -c core.commitGraph=false log --pretty="format:%s" $1 >log_wo_bloom &&
 	GIT_TRACE2_PERF="$TRASH_DIRECTORY/trace.perf" git -c core.commitGraph=true log --pretty="format:%s" $1 >log_w_bloom
 }
-- 
2.27.0.2918.gc99a27ff8f


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH v4 03/14] commit-graph: pass a 'struct repository *' in more places
  2020-09-03 22:45 ` [PATCH v4 " Taylor Blau
  2020-09-03 22:46   ` [PATCH v4 01/14] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
  2020-09-03 22:46   ` [PATCH v4 02/14] t4216: use an '&&'-chain Taylor Blau
@ 2020-09-03 22:46   ` Taylor Blau
  2020-09-03 22:46   ` [PATCH v4 04/14] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
                     ` (11 subsequent siblings)
  14 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-09-03 22:46 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

In a future commit, some commit-graph internals will want access to
'r->settings', but we only have the 'struct object_directory *'
corresponding to that repository.

Add an additional parameter to pass the repository around in more
places.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/commit-graph.c |  2 +-
 commit-graph.c         | 17 ++++++++++-------
 commit-graph.h         |  6 ++++--
 fuzz-commit-graph.c    |  5 +++--
 4 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 523501f217..ba5584463f 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -106,7 +106,7 @@ static int graph_verify(int argc, const char **argv)
 	FREE_AND_NULL(graph_name);
 
 	if (open_ok)
-		graph = load_commit_graph_one_fd_st(fd, &st, odb);
+		graph = load_commit_graph_one_fd_st(the_repository, fd, &st, odb);
 	else
 		graph = read_commit_graph_one(the_repository, odb);
 
diff --git a/commit-graph.c b/commit-graph.c
index 6a36ed0b06..72a838bd00 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -231,7 +231,8 @@ int open_commit_graph(const char *graph_file, int *fd, struct stat *st)
 	return 1;
 }
 
-struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st,
+struct commit_graph *load_commit_graph_one_fd_st(struct repository *r,
+						 int fd, struct stat *st,
 						 struct object_directory *odb)
 {
 	void *graph_map;
@@ -247,7 +248,7 @@ struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st,
 	}
 	graph_map = xmmap(NULL, graph_size, PROT_READ, MAP_PRIVATE, fd, 0);
 	close(fd);
-	ret = parse_commit_graph(graph_map, graph_size);
+	ret = parse_commit_graph(r, graph_map, graph_size);
 
 	if (ret)
 		ret->odb = odb;
@@ -287,7 +288,8 @@ static int verify_commit_graph_lite(struct commit_graph *g)
 	return 0;
 }
 
-struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size)
+struct commit_graph *parse_commit_graph(struct repository *r,
+					void *graph_map, size_t graph_size)
 {
 	const unsigned char *data, *chunk_lookup;
 	uint32_t i;
@@ -452,7 +454,8 @@ struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size)
 	return NULL;
 }
 
-static struct commit_graph *load_commit_graph_one(const char *graph_file,
+static struct commit_graph *load_commit_graph_one(struct repository *r,
+						  const char *graph_file,
 						  struct object_directory *odb)
 {
 
@@ -464,7 +467,7 @@ static struct commit_graph *load_commit_graph_one(const char *graph_file,
 	if (!open_ok)
 		return NULL;
 
-	g = load_commit_graph_one_fd_st(fd, &st, odb);
+	g = load_commit_graph_one_fd_st(r, fd, &st, odb);
 
 	if (g)
 		g->filename = xstrdup(graph_file);
@@ -476,7 +479,7 @@ static struct commit_graph *load_commit_graph_v1(struct repository *r,
 						 struct object_directory *odb)
 {
 	char *graph_name = get_commit_graph_filename(odb);
-	struct commit_graph *g = load_commit_graph_one(graph_name, odb);
+	struct commit_graph *g = load_commit_graph_one(r, graph_name, odb);
 	free(graph_name);
 
 	return g;
@@ -557,7 +560,7 @@ static struct commit_graph *load_commit_graph_chain(struct repository *r,
 		valid = 0;
 		for (odb = r->objects->odb; odb; odb = odb->next) {
 			char *graph_name = get_split_graph_filename(odb, line.buf);
-			struct commit_graph *g = load_commit_graph_one(graph_name, odb);
+			struct commit_graph *g = load_commit_graph_one(r, graph_name, odb);
 
 			free(graph_name);
 
diff --git a/commit-graph.h b/commit-graph.h
index 0677dd1031..d9acb22bac 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -75,11 +75,13 @@ struct commit_graph {
 	struct bloom_filter_settings *bloom_filter_settings;
 };
 
-struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st,
+struct commit_graph *load_commit_graph_one_fd_st(struct repository *r,
+						 int fd, struct stat *st,
 						 struct object_directory *odb);
 struct commit_graph *read_commit_graph_one(struct repository *r,
 					   struct object_directory *odb);
-struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size);
+struct commit_graph *parse_commit_graph(struct repository *r,
+					void *graph_map, size_t graph_size);
 
 /*
  * Return 1 if and only if the repository has a commit-graph
diff --git a/fuzz-commit-graph.c b/fuzz-commit-graph.c
index 430817214d..e7cf6d5b0f 100644
--- a/fuzz-commit-graph.c
+++ b/fuzz-commit-graph.c
@@ -1,7 +1,8 @@
 #include "commit-graph.h"
 #include "repository.h"
 
-struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size);
+struct commit_graph *parse_commit_graph(struct repository *r,
+					void *graph_map, size_t graph_size);
 
 int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size);
 
@@ -10,7 +11,7 @@ int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size)
 	struct commit_graph *g;
 
 	initialize_the_repository();
-	g = parse_commit_graph((void *)data, size);
+	g = parse_commit_graph(the_repository, (void *)data, size);
 	repo_clear(the_repository);
 	free_commit_graph(g);
 
-- 
2.27.0.2918.gc99a27ff8f


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH v4 04/14] t/helper/test-read-graph.c: prepare repo settings
  2020-09-03 22:45 ` [PATCH v4 " Taylor Blau
                     ` (2 preceding siblings ...)
  2020-09-03 22:46   ` [PATCH v4 03/14] commit-graph: pass a 'struct repository *' in more places Taylor Blau
@ 2020-09-03 22:46   ` Taylor Blau
  2020-09-03 22:46   ` [PATCH v4 05/14] commit-graph: respect 'commitGraph.readChangedPaths' Taylor Blau
                     ` (10 subsequent siblings)
  14 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-09-03 22:46 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

The read-graph test-tool is used by a number of the commit-graph test to
assert various properties about a commit-graph. Previously, this program
never ran 'prepare_repo_settings()'. There was no need to do so, since
none of the commit-graph machinery is affected by the repo settings.

In the next patch, the commit-graph machinery's behavior will become
dependent on the repo settings, and so loading them before running the
rest of the test tool is critical.

As such, teach the test tool to call 'prepare_repo_settings()'.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/helper/test-read-graph.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
index 6d0c962438..5f585a1725 100644
--- a/t/helper/test-read-graph.c
+++ b/t/helper/test-read-graph.c
@@ -12,11 +12,12 @@ int cmd__read_graph(int argc, const char **argv)
 	setup_git_directory();
 	odb = the_repository->objects->odb;
 
+	prepare_repo_settings(the_repository);
+
 	graph = read_commit_graph_one(the_repository, odb);
 	if (!graph)
 		return 1;
 
-
 	printf("header: %08x %d %d %d %d\n",
 		ntohl(*(uint32_t*)graph->data),
 		*(unsigned char*)(graph->data + 4),
-- 
2.27.0.2918.gc99a27ff8f


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH v4 05/14] commit-graph: respect 'commitGraph.readChangedPaths'
  2020-09-03 22:45 ` [PATCH v4 " Taylor Blau
                     ` (3 preceding siblings ...)
  2020-09-03 22:46   ` [PATCH v4 04/14] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
@ 2020-09-03 22:46   ` Taylor Blau
  2020-09-03 22:46   ` [PATCH v4 06/14] commit-graph.c: store maximum changed paths Taylor Blau
                     ` (9 subsequent siblings)
  14 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-09-03 22:46 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

Git uses the 'core.commitGraph' configuration value to control whether
or not the commit graph is used when parsing commits or performing a
traversal.

Now that commit-graphs can also contain a section for changed-path Bloom
filters, administrators that already have commit-graphs may find it
convenient to use those graphs without relying on their changed-path
Bloom filters. This can happen, for example, during a staged roll-out,
or in the event of an incident.

Introduce 'commitGraph.readChangedPaths' to control whether or not Bloom
filters are read. Note that this configuration is independent from both:

  - 'core.commitGraph', to allow flexibility in using all parts of a
    commit-graph _except_ for its Bloom filters.

  - The '--changed-paths' option for 'git commit-graph write', to allow
    reading and writing Bloom filters to be controlled independently.

When the variable is set, pretend as if no Bloom data was specified at
all. This avoids adding additional special-casing outside of the
commit-graph internals.

Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config.txt             | 2 ++
 Documentation/config/commitgraph.txt | 4 ++++
 commit-graph.c                       | 6 ++++--
 repo-settings.c                      | 3 +++
 repository.h                         | 1 +
 t/t4216-log-bloom.sh                 | 4 +++-
 6 files changed, 17 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/config/commitgraph.txt

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 3042d80978..770ae79b82 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -340,6 +340,8 @@ include::config/column.txt[]
 
 include::config/commit.txt[]
 
+include::config/commitgraph.txt[]
+
 include::config/credential.txt[]
 
 include::config/completion.txt[]
diff --git a/Documentation/config/commitgraph.txt b/Documentation/config/commitgraph.txt
new file mode 100644
index 0000000000..cff0797b54
--- /dev/null
+++ b/Documentation/config/commitgraph.txt
@@ -0,0 +1,4 @@
+commitGraph.readChangedPaths::
+	If true, then git will use the changed-path Bloom filters in the
+	commit-graph file (if it exists, and they are present). Defaults to
+	true. See linkgit:git-commit-graph[1] for more information.
diff --git a/commit-graph.c b/commit-graph.c
index 72a838bd00..ea54d108b9 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -327,6 +327,8 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 		return NULL;
 	}
 
+	prepare_repo_settings(r);
+
 	graph = alloc_commit_graph();
 
 	graph->hash_len = the_hash_algo->rawsz;
@@ -403,14 +405,14 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 		case GRAPH_CHUNKID_BLOOMINDEXES:
 			if (graph->chunk_bloom_indexes)
 				chunk_repeated = 1;
-			else
+			else if (r->settings.commit_graph_read_changed_paths)
 				graph->chunk_bloom_indexes = data + chunk_offset;
 			break;
 
 		case GRAPH_CHUNKID_BLOOMDATA:
 			if (graph->chunk_bloom_data)
 				chunk_repeated = 1;
-			else {
+			else if (r->settings.commit_graph_read_changed_paths) {
 				uint32_t hash_version;
 				graph->chunk_bloom_data = data + chunk_offset;
 				hash_version = get_be32(data + chunk_offset);
diff --git a/repo-settings.c b/repo-settings.c
index aa61a35338..88ccce2036 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -17,9 +17,12 @@ void prepare_repo_settings(struct repository *r)
 
 	if (!repo_config_get_bool(r, "core.commitgraph", &value))
 		r->settings.core_commit_graph = value;
+	if (!repo_config_get_bool(r, "commitgraph.readchangedpaths", &value))
+		r->settings.commit_graph_read_changed_paths = value;
 	if (!repo_config_get_bool(r, "gc.writecommitgraph", &value))
 		r->settings.gc_write_commit_graph = value;
 	UPDATE_DEFAULT_BOOL(r->settings.core_commit_graph, 1);
+	UPDATE_DEFAULT_BOOL(r->settings.commit_graph_read_changed_paths, 1);
 	UPDATE_DEFAULT_BOOL(r->settings.gc_write_commit_graph, 1);
 
 	if (!repo_config_get_int(r, "index.version", &value))
diff --git a/repository.h b/repository.h
index 628c834367..bacf843d46 100644
--- a/repository.h
+++ b/repository.h
@@ -30,6 +30,7 @@ struct repo_settings {
 	int initialized;
 
 	int core_commit_graph;
+	int commit_graph_read_changed_paths;
 	int gc_write_commit_graph;
 	int fetch_write_commit_graph;
 
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index cd89c75002..fc7693806c 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -95,7 +95,9 @@ do
 		      "--ancestry-path side..master"
 	do
 		test_expect_success "git log option: $option for path: $path" '
-			test_bloom_filters_used "$option -- $path"
+			test_bloom_filters_used "$option -- $path" &&
+			test_config commitgraph.readChangedPaths false &&
+			test_bloom_filters_not_used "$option -- $path"
 		'
 	done
 done
-- 
2.27.0.2918.gc99a27ff8f


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH v4 06/14] commit-graph.c: store maximum changed paths
  2020-09-03 22:45 ` [PATCH v4 " Taylor Blau
                     ` (4 preceding siblings ...)
  2020-09-03 22:46   ` [PATCH v4 05/14] commit-graph: respect 'commitGraph.readChangedPaths' Taylor Blau
@ 2020-09-03 22:46   ` Taylor Blau
  2020-09-03 22:46   ` [PATCH v4 07/14] bloom: split 'get_bloom_filter()' in two Taylor Blau
                     ` (8 subsequent siblings)
  14 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-09-03 22:46 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

For now, we assume that there is a fixed constant describing the
maximum number of changed paths we are willing to store in a Bloom
filter.

Prepare for that to (at least partially) not be the case by making it a
member of the 'struct bloom_filter_settings'. This will be helpful in
the subsequent patches by reducing the size of test cases that exercise
storing too many changed paths, as well as preparing for an eventual
future in which this value might change.

This patch alone does not cause newly generated Bloom filters to use
a custom upper-bound on the maximum number of changed paths a single
Bloom filter can hold, that will occur in a later patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 bloom.h              | 11 ++++++++++-
 commit-graph.c       |  3 +++
 t/t4216-log-bloom.sh |  4 ++--
 3 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/bloom.h b/bloom.h
index d8fbb0fbf1..0b9b59a6fe 100644
--- a/bloom.h
+++ b/bloom.h
@@ -28,9 +28,18 @@ struct bloom_filter_settings {
 	 * that contain n*b bits.
 	 */
 	uint32_t bits_per_entry;
+
+	/*
+	 * The maximum number of changed paths per commit
+	 * before declaring a Bloom filter to be too-large.
+	 *
+	 * Not written to the commit-graph file.
+	 */
+	uint32_t max_changed_paths;
 };
 
-#define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 }
+#define DEFAULT_BLOOM_MAX_CHANGES 512
+#define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10, DEFAULT_BLOOM_MAX_CHANGES }
 #define BITS_PER_WORD 8
 #define BLOOMDATA_CHUNK_HEADER_SIZE 3 * sizeof(uint32_t)
 
diff --git a/commit-graph.c b/commit-graph.c
index ea54d108b9..55af498aa0 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1201,6 +1201,7 @@ static void trace2_bloom_filter_settings(struct write_commit_graph_context *ctx)
 	jw_object_intmax(&jw, "hash_version", ctx->bloom_settings->hash_version);
 	jw_object_intmax(&jw, "num_hashes", ctx->bloom_settings->num_hashes);
 	jw_object_intmax(&jw, "bits_per_entry", ctx->bloom_settings->bits_per_entry);
+	jw_object_intmax(&jw, "max_changed_paths", ctx->bloom_settings->max_changed_paths);
 	jw_end(&jw);
 
 	trace2_data_json("bloom", ctx->r, "settings", &jw);
@@ -1669,6 +1670,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 							      bloom_settings.bits_per_entry);
 		bloom_settings.num_hashes = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_NUM_HASHES",
 							  bloom_settings.num_hashes);
+		bloom_settings.max_changed_paths = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS",
+							  bloom_settings.max_changed_paths);
 		ctx->bloom_settings = &bloom_settings;
 	}
 
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index fc7693806c..47ddf2641f 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -174,11 +174,11 @@ test_expect_success 'persist filter settings' '
 		GIT_TEST_BLOOM_SETTINGS_NUM_HASHES=9 \
 		GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY=15 \
 		git commit-graph write --reachable --changed-paths &&
-	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15}" trace2.txt &&
+	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15" trace2.txt &&
 	GIT_TRACE2_EVENT="$(pwd)/trace2-auto.txt" \
 		GIT_TRACE2_EVENT_NESTING=5 \
 		git commit-graph write --reachable --changed-paths &&
-	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15}" trace2-auto.txt
+	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15" trace2-auto.txt
 '
 
 test_expect_success 'correctly report changes over limit' '
-- 
2.27.0.2918.gc99a27ff8f


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH v4 07/14] bloom: split 'get_bloom_filter()' in two
  2020-09-03 22:45 ` [PATCH v4 " Taylor Blau
                     ` (5 preceding siblings ...)
  2020-09-03 22:46   ` [PATCH v4 06/14] commit-graph.c: store maximum changed paths Taylor Blau
@ 2020-09-03 22:46   ` Taylor Blau
  2020-09-05 17:22     ` Jakub Narębski
  2020-09-03 22:46   ` [PATCH v4 08/14] bloom: use provided 'struct bloom_filter_settings' Taylor Blau
                     ` (7 subsequent siblings)
  14 siblings, 1 reply; 117+ messages in thread
From: Taylor Blau @ 2020-09-03 22:46 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

'get_bloom_filter' takes a flag to control whether it will compute a
Bloom filter if the requested one is missing. In the next patch, we'll
add yet another parameter to this method, which would force all but one
caller to specify an extra 'NULL' parameter at the end.

Instead of doing this, split 'get_bloom_filter' into two functions:
'get_bloom_filter' and 'get_or_compute_bloom_filter'. The former only
looks up a Bloom filter (and does not compute one if it's missing,
thus dropping the 'compute_if_not_present' flag). The latter does
compute missing Bloom filters, with an additional parameter to store
whether or not it needed to do so.

This simplifies many call-sites, since the majority of existing callers
to 'get_bloom_filter' do not want missing Bloom filters to be computed
(so they can drop the parameter entirely and use the simpler version of
the function).

While we're at it, instrument the new 'get_or_compute_bloom_filter()'
with two counters in the 'write_commit_graph_context' struct which store
the number of filters that we computed, and the number of those which
were too large to store.

It would be nice to drop the 'compute_if_not_present' flag entirely,
since all remaining callers of 'get_or_compute_bloom_filter' pass it as
'1', but this will change in a future patch and hence cannot be removed.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 blame.c               |  2 +-
 bloom.c               | 13 ++++++++++---
 bloom.h               | 10 +++++++---
 commit-graph.c        | 38 +++++++++++++++++++++++++++++++++++---
 line-log.c            |  2 +-
 revision.c            |  2 +-
 t/helper/test-bloom.c |  3 ++-
 7 files changed, 57 insertions(+), 13 deletions(-)

diff --git a/blame.c b/blame.c
index 903e23af23..e5ba35dbd1 100644
--- a/blame.c
+++ b/blame.c
@@ -1276,7 +1276,7 @@ static int maybe_changed_path(struct repository *r,
 	if (commit_graph_generation(origin->commit) == GENERATION_NUMBER_INFINITY)
 		return 1;
 
-	filter = get_bloom_filter(r, origin->commit, 0);
+	filter = get_bloom_filter(r, origin->commit);
 
 	if (!filter)
 		return 1;
diff --git a/bloom.c b/bloom.c
index cd9380ac62..a8a21762f4 100644
--- a/bloom.c
+++ b/bloom.c
@@ -177,9 +177,10 @@ static int pathmap_cmp(const void *hashmap_cmp_fn_data,
 	return strcmp(e1->path, e2->path);
 }
 
-struct bloom_filter *get_bloom_filter(struct repository *r,
-				      struct commit *c,
-				      int compute_if_not_present)
+struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
+						 struct commit *c,
+						 int compute_if_not_present,
+						 int *computed)
 {
 	struct bloom_filter *filter;
 	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
@@ -187,6 +188,9 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 	struct diff_options diffopt;
 	int max_changes = 512;
 
+	if (computed)
+		*computed = 0;
+
 	if (!bloom_filters.slab_size)
 		return NULL;
 
@@ -273,6 +277,9 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 		filter->len = 0;
 	}
 
+	if (computed)
+		*computed = 1;
+
 	free(diff_queued_diff.queue);
 	DIFF_QUEUE_CLEAR(&diff_queued_diff);
 
diff --git a/bloom.h b/bloom.h
index 0b9b59a6fe..baa91926db 100644
--- a/bloom.h
+++ b/bloom.h
@@ -89,9 +89,13 @@ void add_key_to_filter(const struct bloom_key *key,
 
 void init_bloom_filters(void);
 
-struct bloom_filter *get_bloom_filter(struct repository *r,
-				      struct commit *c,
-				      int compute_if_not_present);
+struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
+						 struct commit *c,
+						 int compute_if_not_present,
+						 int *computed);
+
+#define get_bloom_filter(r, c) get_or_compute_bloom_filter( \
+	(r), (c), 0, NULL)
 
 int bloom_filter_contains(const struct bloom_filter *filter,
 			  const struct bloom_key *key,
diff --git a/commit-graph.c b/commit-graph.c
index 55af498aa0..cabac7f45b 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -971,6 +971,9 @@ struct write_commit_graph_context {
 	const struct split_commit_graph_opts *split_opts;
 	size_t total_bloom_filter_data_size;
 	const struct bloom_filter_settings *bloom_settings;
+
+	int count_bloom_filter_found_large;
+	int count_bloom_filter_computed;
 };
 
 static int write_graph_chunk_fanout(struct hashfile *f,
@@ -1182,7 +1185,7 @@ static int write_graph_chunk_bloom_indexes(struct hashfile *f,
 	uint32_t cur_pos = 0;
 
 	while (list < last) {
-		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
+		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
 		size_t len = filter ? filter->len : 0;
 		cur_pos += len;
 		display_progress(ctx->progress, ++ctx->progress_cnt);
@@ -1222,7 +1225,7 @@ static int write_graph_chunk_bloom_data(struct hashfile *f,
 	hashwrite_be32(f, ctx->bloom_settings->bits_per_entry);
 
 	while (list < last) {
-		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
+		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
 		size_t len = filter ? filter->len : 0;
 
 		display_progress(ctx->progress, ++ctx->progress_cnt);
@@ -1392,6 +1395,22 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 	stop_progress(&ctx->progress);
 }
 
+static void trace2_bloom_filter_write_statistics(struct write_commit_graph_context *ctx)
+{
+	struct json_writer jw = JSON_WRITER_INIT;
+
+	jw_object_begin(&jw, 0);
+	jw_object_intmax(&jw, "filter_found_large",
+			 ctx->count_bloom_filter_found_large);
+	jw_object_intmax(&jw, "filter_computed",
+			 ctx->count_bloom_filter_computed);
+	jw_end(&jw);
+
+	trace2_data_json("commit-graph", the_repository, "bloom_statistics", &jw);
+
+	jw_release(&jw);
+}
+
 static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 {
 	int i;
@@ -1414,12 +1433,25 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 		QSORT(sorted_commits, ctx->commits.nr, commit_gen_cmp);
 
 	for (i = 0; i < ctx->commits.nr; i++) {
+		int computed = 0;
 		struct commit *c = sorted_commits[i];
-		struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1);
+		struct bloom_filter *filter = get_or_compute_bloom_filter(
+			ctx->r,
+			c,
+			1,
+			&computed);
+		if (computed) {
+			ctx->count_bloom_filter_computed++;
+			if (filter && !filter->len)
+				ctx->count_bloom_filter_found_large++;
+		}
 		ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
 		display_progress(progress, i + 1);
 	}
 
+	if (trace2_is_enabled())
+		trace2_bloom_filter_write_statistics(ctx);
+
 	free(sorted_commits);
 	stop_progress(&progress);
 }
diff --git a/line-log.c b/line-log.c
index bf73ea95ac..68eeb425f8 100644
--- a/line-log.c
+++ b/line-log.c
@@ -1159,7 +1159,7 @@ static int bloom_filter_check(struct rev_info *rev,
 		return 1;
 
 	if (!rev->bloom_filter_settings ||
-	    !(filter = get_bloom_filter(rev->repo, commit, 0)))
+	    !(filter = get_bloom_filter(rev->repo, commit)))
 		return 1;
 
 	if (!range)
diff --git a/revision.c b/revision.c
index 857274408c..f4be5d1650 100644
--- a/revision.c
+++ b/revision.c
@@ -751,7 +751,7 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 	if (commit_graph_generation(commit) == GENERATION_NUMBER_INFINITY)
 		return -1;
 
-	filter = get_bloom_filter(revs->repo, commit, 0);
+	filter = get_bloom_filter(revs->repo, commit);
 
 	if (!filter) {
 		count_bloom_filter_not_present++;
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index 5e77d56f59..9f7bb729fc 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -39,7 +39,8 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
 	struct bloom_filter *filter;
 	setup_git_directory();
 	c = lookup_commit(the_repository, commit_oid);
-	filter = get_bloom_filter(the_repository, c, 1);
+	filter = get_or_compute_bloom_filter(the_repository, c, 1,
+					     NULL);
 	print_bloom_filter(filter);
 }
 
-- 
2.27.0.2918.gc99a27ff8f


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH v4 08/14] bloom: use provided 'struct bloom_filter_settings'
  2020-09-03 22:45 ` [PATCH v4 " Taylor Blau
                     ` (6 preceding siblings ...)
  2020-09-03 22:46   ` [PATCH v4 07/14] bloom: split 'get_bloom_filter()' in two Taylor Blau
@ 2020-09-03 22:46   ` Taylor Blau
  2020-09-03 22:46   ` [PATCH v4 09/14] bloom/diff: properly short-circuit on max_changes Taylor Blau
                     ` (6 subsequent siblings)
  14 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-09-03 22:46 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

When 'get_or_compute_bloom_filter()' needs to compute a Bloom filter
from scratch, it looks to the default 'struct bloom_filter_settings' in
order to determine the maximum number of changed paths, number of bits
per entry, and so on.

All of these values have so far been constant, and so there was no need
to pass in a pointer from the caller (eg., the one that is stored in the
'struct write_commit_graph_context').

Start passing in a 'struct bloom_filter_settings *' instead of using the
default values to respect graph-specific settings (eg., in the case of
setting 'GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS').

In order to have an initialized value for these settings, move its
initialization to earlier in the commit-graph write.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 bloom.c               | 13 ++++++-------
 bloom.h               |  3 ++-
 commit-graph.c        | 21 ++++++++++-----------
 t/helper/test-bloom.c |  1 +
 4 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/bloom.c b/bloom.c
index a8a21762f4..0cf1962dc5 100644
--- a/bloom.c
+++ b/bloom.c
@@ -180,13 +180,12 @@ static int pathmap_cmp(const void *hashmap_cmp_fn_data,
 struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 						 struct commit *c,
 						 int compute_if_not_present,
+						 const struct bloom_filter_settings *settings,
 						 int *computed)
 {
 	struct bloom_filter *filter;
-	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
 	int i;
 	struct diff_options diffopt;
-	int max_changes = 512;
 
 	if (computed)
 		*computed = 0;
@@ -211,7 +210,7 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 	repo_diff_setup(r, &diffopt);
 	diffopt.flags.recursive = 1;
 	diffopt.detect_rename = 0;
-	diffopt.max_changes = max_changes;
+	diffopt.max_changes = settings->max_changed_paths;
 	diff_setup_done(&diffopt);
 
 	/* ensure commit is parsed so we have parent information */
@@ -223,7 +222,7 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 		diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
 	diffcore_std(&diffopt);
 
-	if (diffopt.num_changes <= max_changes) {
+	if (diffopt.num_changes <= settings->max_changed_paths) {
 		struct hashmap pathmap;
 		struct pathmap_hash_entry *e;
 		struct hashmap_iter iter;
@@ -260,13 +259,13 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 			diff_free_filepair(diff_queued_diff.queue[i]);
 		}
 
-		filter->len = (hashmap_get_size(&pathmap) * settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
+		filter->len = (hashmap_get_size(&pathmap) * settings->bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
 		filter->data = xcalloc(filter->len, sizeof(unsigned char));
 
 		hashmap_for_each_entry(&pathmap, &iter, e, entry) {
 			struct bloom_key key;
-			fill_bloom_key(e->path, strlen(e->path), &key, &settings);
-			add_key_to_filter(&key, filter, &settings);
+			fill_bloom_key(e->path, strlen(e->path), &key, settings);
+			add_key_to_filter(&key, filter, settings);
 		}
 
 		hashmap_free_entries(&pathmap, struct pathmap_hash_entry, entry);
diff --git a/bloom.h b/bloom.h
index baa91926db..3f19e3fca4 100644
--- a/bloom.h
+++ b/bloom.h
@@ -92,10 +92,11 @@ void init_bloom_filters(void);
 struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 						 struct commit *c,
 						 int compute_if_not_present,
+						 const struct bloom_filter_settings *settings,
 						 int *computed);
 
 #define get_bloom_filter(r, c) get_or_compute_bloom_filter( \
-	(r), (c), 0, NULL)
+	(r), (c), 0, NULL, NULL)
 
 int bloom_filter_contains(const struct bloom_filter *filter,
 			  const struct bloom_key *key,
diff --git a/commit-graph.c b/commit-graph.c
index cabac7f45b..7ba9ae26e1 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1439,6 +1439,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 			ctx->r,
 			c,
 			1,
+			ctx->bloom_settings,
 			&computed);
 		if (computed) {
 			ctx->count_bloom_filter_computed++;
@@ -1695,17 +1696,6 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	int num_chunks = 3;
 	uint64_t chunk_offset;
 	struct object_id file_hash;
-	struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
-
-	if (!ctx->bloom_settings) {
-		bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
-							      bloom_settings.bits_per_entry);
-		bloom_settings.num_hashes = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_NUM_HASHES",
-							  bloom_settings.num_hashes);
-		bloom_settings.max_changed_paths = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS",
-							  bloom_settings.max_changed_paths);
-		ctx->bloom_settings = &bloom_settings;
-	}
 
 	if (ctx->split) {
 		struct strbuf tmp_file = STRBUF_INIT;
@@ -2151,6 +2141,7 @@ int write_commit_graph(struct object_directory *odb,
 	uint32_t i, count_distinct = 0;
 	int res = 0;
 	int replace = 0;
+	struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
 
 	if (!commit_graph_compatible(the_repository))
 		return 0;
@@ -2164,6 +2155,14 @@ int write_commit_graph(struct object_directory *odb,
 	ctx->split_opts = split_opts;
 	ctx->total_bloom_filter_data_size = 0;
 
+	bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
+						      bloom_settings.bits_per_entry);
+	bloom_settings.num_hashes = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_NUM_HASHES",
+						  bloom_settings.num_hashes);
+	bloom_settings.max_changed_paths = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS",
+							 bloom_settings.max_changed_paths);
+	ctx->bloom_settings = &bloom_settings;
+
 	if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
 		ctx->changed_paths = 1;
 	if (!(flags & COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS)) {
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index 9f7bb729fc..46e97b04eb 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -40,6 +40,7 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
 	setup_git_directory();
 	c = lookup_commit(the_repository, commit_oid);
 	filter = get_or_compute_bloom_filter(the_repository, c, 1,
+					     &settings,
 					     NULL);
 	print_bloom_filter(filter);
 }
-- 
2.27.0.2918.gc99a27ff8f


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH v4 09/14] bloom/diff: properly short-circuit on max_changes
  2020-09-03 22:45 ` [PATCH v4 " Taylor Blau
                     ` (7 preceding siblings ...)
  2020-09-03 22:46   ` [PATCH v4 08/14] bloom: use provided 'struct bloom_filter_settings' Taylor Blau
@ 2020-09-03 22:46   ` Taylor Blau
  2020-09-03 22:46   ` [PATCH v4 10/14] commit-graph.c: sort index into commits list Taylor Blau
                     ` (5 subsequent siblings)
  14 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-09-03 22:46 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

From: Derrick Stolee <dstolee@microsoft.com>

Commit e3696980 (diff: halt tree-diff early after max_changes,
2020-03-30) intended to create a mechanism to short-circuit a diff
calculation after a certain number of paths were modified. By
incrementing a "num_changes" counter throughout the recursive
ll_diff_tree_paths(), this was supposed to match the number of changes
that would be written into the changed-path Bloom filters.
Unfortunately, this was not implemented correctly and instead misses
simple cases like file modifications. This then does not stop very
large changed-path filters from being written (unless they add or remove
many files).

To start, change the implementation in ll_diff_tree_paths() to instead
use the global diff_queue_diff struct's 'nr' member as the count. This
is a way to simplify the logic instead of making more mistakes in the
complicated diff code.

This has a drawback: the diff_queue_diff struct only lists the paths
corresponding to blob changes, not their leading directories. Thus,
get_or_compute_bloom_filter() needs an additional check to see if the
hashmap with the leading directories becomes too large.

One reason why this was not caught by test cases was that the test in
t4216-log-bloom.sh that was supposed to check this "too many changes"
condition only checked this on the initial commit of a repository. The
old logic counted these values correctly. Update this test in a few
ways:

1. Use GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS to reduce the limit,
   allowing smaller commits to engage with this logic.

2. Create several interesting cases of edits, adds, removes, and mode
   changes (in the second commit). By testing both sides of the
   inequality with the *_MAX_CHANGED_PATHS variable, we can see that
   the count is exactly correct, so none of these changes are missed
   or over-counted.

3. Use the trace2 data value filter_found_large to verify that these
   commits are on the correct side of the limit.

Another way to verify the behavior is correct is through performance
tests. By testing on my local copies of the Git repository and the Linux
kernel repository, I could measure the effect of these short-circuits
when computing a fresh commit-graph file with changed-path Bloom filters
using the command

  GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=N time \
    git commit-graph write --reachable --changed-paths

and reporting the wall time and resulting commit-graph size.

For Git, the results are

|        |      N=1       |       N=10     |      N=512     |
|--------|----------------|----------------|----------------|
| HEAD~1 | 10.90s  9.18MB | 11.11s  9.34MB | 11.31s  9.35MB |
| HEAD   |  9.21s  8.62MB | 11.11s  9.29MB | 11.29s  9.34MB |

For Linux, the results are

|        |       N=1      |     N=20      |     N=512     |
|--------|----------------|---------------|---------------|
| HEAD~1 | 61.28s  64.3MB | 76.9s  72.6MB | 77.6s  72.6MB |
| HEAD   | 49.44s  56.3MB | 68.7s  65.9MB | 69.2s  65.9MB |

Naturally, the improvement becomes much less as the limit grows, as
fewer commits satisfy the short-circuit.

Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 bloom.c              |  6 +++-
 diff.h               |  2 --
 t/t4216-log-bloom.sh | 85 +++++++++++++++++++++++++++++++++++++++-----
 tree-diff.c          |  5 +--
 4 files changed, 82 insertions(+), 16 deletions(-)

diff --git a/bloom.c b/bloom.c
index 0cf1962dc5..ed54e96e57 100644
--- a/bloom.c
+++ b/bloom.c
@@ -222,7 +222,7 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 		diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
 	diffcore_std(&diffopt);
 
-	if (diffopt.num_changes <= settings->max_changed_paths) {
+	if (diff_queued_diff.nr <= settings->max_changed_paths) {
 		struct hashmap pathmap;
 		struct pathmap_hash_entry *e;
 		struct hashmap_iter iter;
@@ -259,6 +259,9 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 			diff_free_filepair(diff_queued_diff.queue[i]);
 		}
 
+		if (hashmap_get_size(&pathmap) > settings->max_changed_paths)
+			goto cleanup;
+
 		filter->len = (hashmap_get_size(&pathmap) * settings->bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
 		filter->data = xcalloc(filter->len, sizeof(unsigned char));
 
@@ -268,6 +271,7 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 			add_key_to_filter(&key, filter, settings);
 		}
 
+	cleanup:
 		hashmap_free_entries(&pathmap, struct pathmap_hash_entry, entry);
 	} else {
 		for (i = 0; i < diff_queued_diff.nr; i++)
diff --git a/diff.h b/diff.h
index e0c0af6286..1d32b71885 100644
--- a/diff.h
+++ b/diff.h
@@ -287,8 +287,6 @@ struct diff_options {
 
 	/* If non-zero, then stop computing after this many changes. */
 	int max_changes;
-	/* For internal use only. */
-	int num_changes;
 
 	int ita_invisible_in_index;
 /* white-space error highlighting */
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 47ddf2641f..e8788749bf 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -182,20 +182,87 @@ test_expect_success 'persist filter settings' '
 '
 
 test_expect_success 'correctly report changes over limit' '
-	git init 513changes &&
+	git init limits &&
 	(
-		cd 513changes &&
-		for i in $(test_seq 1 513)
+		cd limits &&
+		mkdir d &&
+		mkdir d/e &&
+
+		for i in $(test_seq 1 2)
 		do
-			echo $i >file$i.txt || return 1
+			printf $i >d/file$i.txt &&
+			printf $i >d/e/file$i.txt || return 1
 		done &&
-		git add . &&
+
+		mkdir mode &&
+		printf bash >mode/script.sh &&
+
+		mkdir foo &&
+		touch foo/bar &&
+		touch foo.txt &&
+
+		git add d foo foo.txt mode &&
 		git commit -m "files" &&
-		git commit-graph write --reachable --changed-paths &&
-		for i in $(test_seq 1 513)
+
+		# Commit has 7 file and 4 directory adds
+		GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=10 \
+			GIT_TRACE2_EVENT="$(pwd)/trace" \
+			git commit-graph write --reachable --changed-paths &&
+		grep "\"max_changed_paths\":10" trace &&
+		grep "\"filter_found_large\":1" trace &&
+
+		for path in $(git ls-tree -r --name-only HEAD)
 		do
-			git -c core.commitGraph=false log -- file$i.txt >expect &&
-			git log -- file$i.txt >actual &&
+			git -c commitGraph.readChangedPaths=false log \
+				-- $path >expect &&
+			git log -- $path >actual &&
+			test_cmp expect actual || return 1
+		done &&
+
+		# Make a variety of path changes
+		printf new1 >d/e/file1.txt &&
+		printf new2 >d/file2.txt &&
+		rm d/e/file2.txt &&
+		rm -r foo &&
+		printf text >foo &&
+		mkdir f &&
+		printf new1 >f/file1.txt &&
+
+		# including a mode-only change (counts as modified)
+		git update-index --chmod=+x mode/script.sh &&
+
+		git add foo d f &&
+		git commit -m "complicated" &&
+
+		# start from scratch and rebuild
+		rm -f .git/objects/info/commit-graph &&
+		GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=10 \
+			GIT_TRACE2_EVENT="$(pwd)/trace-edit" \
+			git commit-graph write --reachable --changed-paths &&
+		grep "\"max_changed_paths\":10" trace-edit &&
+		grep "\"filter_found_large\":2" trace-edit &&
+
+		for path in $(git ls-tree -r --name-only HEAD)
+		do
+			git -c commitGraph.readChangedPaths=false log \
+				-- $path >expect &&
+			git log -- $path >actual &&
+			test_cmp expect actual || return 1
+		done &&
+
+		# start from scratch and rebuild
+		rm -f .git/objects/info/commit-graph &&
+		GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=11 \
+			GIT_TRACE2_EVENT="$(pwd)/trace-update" \
+			git commit-graph write --reachable --changed-paths &&
+		grep "\"max_changed_paths\":11" trace-update &&
+		grep "\"filter_found_large\":0" trace-update &&
+
+		for path in $(git ls-tree -r --name-only HEAD)
+		do
+			git -c commitGraph.readChangedPaths=false log \
+				-- $path >expect &&
+			git log -- $path >actual &&
 			test_cmp expect actual || return 1
 		done
 	)
diff --git a/tree-diff.c b/tree-diff.c
index 6ebad1a46f..7cebbb327e 100644
--- a/tree-diff.c
+++ b/tree-diff.c
@@ -434,7 +434,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
 		if (diff_can_quit_early(opt))
 			break;
 
-		if (opt->max_changes && opt->num_changes > opt->max_changes)
+		if (opt->max_changes && diff_queued_diff.nr > opt->max_changes)
 			break;
 
 		if (opt->pathspec.nr) {
@@ -521,7 +521,6 @@ static struct combine_diff_path *ll_diff_tree_paths(
 
 			/* t↓ */
 			update_tree_entry(&t);
-			opt->num_changes++;
 		}
 
 		/* t > p[imin] */
@@ -539,7 +538,6 @@ static struct combine_diff_path *ll_diff_tree_paths(
 		skip_emit_tp:
 			/* ∀ pi=p[imin]  pi↓ */
 			update_tp_entries(tp, nparent);
-			opt->num_changes++;
 		}
 	}
 
@@ -557,7 +555,6 @@ struct combine_diff_path *diff_tree_paths(
 	const struct object_id **parents_oid, int nparent,
 	struct strbuf *base, struct diff_options *opt)
 {
-	opt->num_changes = 0;
 	p = ll_diff_tree_paths(p, oid, parents_oid, nparent, base, opt);
 
 	/*
-- 
2.27.0.2918.gc99a27ff8f


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH v4 10/14] commit-graph.c: sort index into commits list
  2020-09-03 22:45 ` [PATCH v4 " Taylor Blau
                     ` (8 preceding siblings ...)
  2020-09-03 22:46   ` [PATCH v4 09/14] bloom/diff: properly short-circuit on max_changes Taylor Blau
@ 2020-09-03 22:46   ` Taylor Blau
  2020-09-03 22:46   ` [PATCH v4 11/14] csum-file.h: introduce 'hashwrite_be64()' Taylor Blau
                     ` (4 subsequent siblings)
  14 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-09-03 22:46 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

For locality, 'compute_bloom_filters()' sorts the commits for which it
wants to compute Bloom filters in a preferred order (cf., 3d11275505
(commit-graph: examine commits by generation number, 2020-03-30) for
details).

A future patch will want to recover the new graph position of each
commit. Since the 'packed_commit_list' already stores a double-pointer,
avoid a 'COPY_ARRAY' and instead keep track of an index into the
original list. (Use an integer index instead of a memory address, since
this involves a needlessly confusing triple-pointer).

Alter the two sorting routines 'commit_pos_cmp' and 'commit_gen_cmp' to
take into account the packed_commit_list they are sorting with respect
to. Since 'compute_bloom_filters()' is the only caller for each of those
comparison functions, no other call-sites need updating.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 commit-graph.c | 43 ++++++++++++++++++++++++-------------------
 1 file changed, 24 insertions(+), 19 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 7ba9ae26e1..35535f4192 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -79,10 +79,18 @@ static void set_commit_pos(struct repository *r, const struct object_id *oid)
 	*commit_pos_at(&commit_pos, commit) = max_pos++;
 }
 
-static int commit_pos_cmp(const void *va, const void *vb)
+struct packed_commit_list {
+	struct commit **list;
+	int nr;
+	int alloc;
+};
+
+static int commit_pos_cmp(const void *va, const void *vb, void *ctx)
 {
-	const struct commit *a = *(const struct commit **)va;
-	const struct commit *b = *(const struct commit **)vb;
+	struct packed_commit_list *commits = ctx;
+
+	const struct commit *a = commits->list[*(int *)va];
+	const struct commit *b = commits->list[*(int *)vb];
 	return commit_pos_at(&commit_pos, a) -
 	       commit_pos_at(&commit_pos, b);
 }
@@ -139,10 +147,12 @@ static struct commit_graph_data *commit_graph_data_at(const struct commit *c)
 	return data;
 }
 
-static int commit_gen_cmp(const void *va, const void *vb)
+static int commit_gen_cmp(const void *va, const void *vb, void *ctx)
 {
-	const struct commit *a = *(const struct commit **)va;
-	const struct commit *b = *(const struct commit **)vb;
+	struct packed_commit_list *commits = ctx;
+
+	const struct commit *a = commits->list[*(int *)va];
+	const struct commit *b = commits->list[*(int *)vb];
 
 	uint32_t generation_a = commit_graph_generation(a);
 	uint32_t generation_b = commit_graph_generation(b);
@@ -929,11 +939,6 @@ struct tree *get_commit_tree_in_graph(struct repository *r, const struct commit
 	return get_commit_tree_in_graph_one(r, r->objects->commit_graph, c);
 }
 
-struct packed_commit_list {
-	struct commit **list;
-	int nr;
-	int alloc;
-};
 
 struct packed_oid_list {
 	struct object_id *list;
@@ -1415,7 +1420,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 {
 	int i;
 	struct progress *progress = NULL;
-	struct commit **sorted_commits;
+	int *sorted_commits;
 
 	init_bloom_filters();
 
@@ -1425,16 +1430,16 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 			ctx->commits.nr);
 
 	ALLOC_ARRAY(sorted_commits, ctx->commits.nr);
-	COPY_ARRAY(sorted_commits, ctx->commits.list, ctx->commits.nr);
-
-	if (ctx->order_by_pack)
-		QSORT(sorted_commits, ctx->commits.nr, commit_pos_cmp);
-	else
-		QSORT(sorted_commits, ctx->commits.nr, commit_gen_cmp);
+	for (i = 0; i < ctx->commits.nr; i++)
+		sorted_commits[i] = i;
+	QSORT_S(sorted_commits, ctx->commits.nr,
+		ctx->order_by_pack ? commit_pos_cmp : commit_gen_cmp,
+		&ctx->commits);
 
 	for (i = 0; i < ctx->commits.nr; i++) {
 		int computed = 0;
-		struct commit *c = sorted_commits[i];
+		int pos = sorted_commits[i];
+		struct commit *c = ctx->commits.list[pos];
 		struct bloom_filter *filter = get_or_compute_bloom_filter(
 			ctx->r,
 			c,
-- 
2.27.0.2918.gc99a27ff8f


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH v4 11/14] csum-file.h: introduce 'hashwrite_be64()'
  2020-09-03 22:45 ` [PATCH v4 " Taylor Blau
                     ` (9 preceding siblings ...)
  2020-09-03 22:46   ` [PATCH v4 10/14] commit-graph.c: sort index into commits list Taylor Blau
@ 2020-09-03 22:46   ` Taylor Blau
  2020-09-04 20:18     ` René Scharfe
  2020-09-03 22:46   ` [PATCH v4 12/14] commit-graph: add large-filters bitmap chunk Taylor Blau
                     ` (3 subsequent siblings)
  14 siblings, 1 reply; 117+ messages in thread
From: Taylor Blau @ 2020-09-03 22:46 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

A small handful of writers who wish to encode 64-bit values in network
order have worked around the lack of such a helper by calling the 32-bit
variant twice.

The subsequent commit will add another caller who wants to write a
64-bit value. To ease their (and the existing caller's) pain, introduce
a helper to do just that, and convert existing call-sites.

Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 commit-graph.c | 8 ++------
 csum-file.h    | 6 ++++++
 midx.c         | 3 +--
 3 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 35535f4192..01d791343a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1791,12 +1791,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 
 	chunk_offset = 8 + (num_chunks + 1) * GRAPH_CHUNKLOOKUP_WIDTH;
 	for (i = 0; i <= num_chunks; i++) {
-		uint32_t chunk_write[3];
-
-		chunk_write[0] = htonl(chunks[i].id);
-		chunk_write[1] = htonl(chunk_offset >> 32);
-		chunk_write[2] = htonl(chunk_offset & 0xffffffff);
-		hashwrite(f, chunk_write, 12);
+		hashwrite_be32(f, chunks[i].id);
+		hashwrite_be64(f, chunk_offset);
 
 		chunk_offset += chunks[i].size;
 	}
diff --git a/csum-file.h b/csum-file.h
index f9cbd317fb..b026ec7766 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -62,4 +62,10 @@ static inline void hashwrite_be32(struct hashfile *f, uint32_t data)
 	hashwrite(f, &data, sizeof(data));
 }
 
+static inline void hashwrite_be64(struct hashfile *f, uint64_t data)
+{
+	hashwrite_be32(f, data >> 32);
+	hashwrite_be32(f, data & 0xffffffffUL);
+}
+
 #endif
diff --git a/midx.c b/midx.c
index e9b2e1253a..32cc5fdc22 100644
--- a/midx.c
+++ b/midx.c
@@ -789,8 +789,7 @@ static size_t write_midx_large_offsets(struct hashfile *f, uint32_t nr_large_off
 		if (!(offset >> 31))
 			continue;
 
-		hashwrite_be32(f, offset >> 32);
-		hashwrite_be32(f, offset & 0xffffffffUL);
+		hashwrite_be64(f, offset);
 		written += 2 * sizeof(uint32_t);
 
 		nr_large_offset--;
-- 
2.27.0.2918.gc99a27ff8f


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH v4 12/14] commit-graph: add large-filters bitmap chunk
  2020-09-03 22:45 ` [PATCH v4 " Taylor Blau
                     ` (10 preceding siblings ...)
  2020-09-03 22:46   ` [PATCH v4 11/14] csum-file.h: introduce 'hashwrite_be64()' Taylor Blau
@ 2020-09-03 22:46   ` Taylor Blau
  2020-09-03 22:46   ` [PATCH v4 13/14] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
                     ` (2 subsequent siblings)
  14 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-09-03 22:46 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

When a commit has more than a certain number of changed paths (commonly
512), the commit-graph machinery represents it as a zero-length filter.
In addition to these too-large filters, the commit-graph machinery also
represents commits with no filter and commits with no changed paths in
the same way.

When writing a commit-graph that aggregates several incremental
commit-graph layers (eg., with '--split=replace'), the commit-graph
machinery first computes all of the Bloom filters that it wants to write
but does not already know about from existing graph layers. Because we
overload the zero-length filter in the above fashion, this leads to
recomputing large filters over and over again.

This is already undesirable, since it means that we are wasting
considerable effort to discover that a commit with too many changed
paths, only to throw that effort away (and then repeat the process the
next time a roll-up is performed).

In a subsequent patch, we will add a '--max-new-filters=<n>' option,
which specifies an upper-bound on the number of new filters we are
willing to compute from scratch. Suppose that there are 'N' too-large
filters, and we specify '--max-new-filters=M'. If 'N >= M', it is
unlikely that any filters will be generated, since we'll spend most of
our effort on filters that we ultimately throw away. If 'N < M', filters
will trickle in over time, but only at most 'M - N' per-write.

To address this, add a new chunk which encodes a bitmap where the ith
bit is on iff the ith commit has zero or at least 512 changed paths.
Likewise store the maximum number of changed paths we are willing to
store in order to prepare for eventually making this value more easily
customizable. When computing Bloom filters, first consult the relevant
bitmap (in the case that we are rolling up existing layers) to see if
computing the Bloom filter from scratch would be a waste of time.

This patch implements a new chunk instead of extending the existing BIDX
and BDAT chunks because modifying these chunks would confuse old
clients. (Eg., setting the most-significant bit in the BIDX chunk would
confuse old clients and require a version bump).

To allow using the existing bitmap code with 64-bit words, we write the
data in network byte order from the 64-bit words. This means we also
need to read the array from the commit-graph file by translating each
word from network byte order using get_be64() when loading the commit
graph. Initialize this bitmap lazily to avoid paying a linear-time cost
upon each commit-graph load even if we do not need the bitmaps
themselves.

By avoiding the need to move to new versions of the BDAT and BIDX chunk,
we can give ourselves more time to consider whether or not other
modifications to these chunks are worthwhile without holding up this
change.

Another approach would be to introduce a new BIDX chunk (say, one
identified by 'BID2') which is identical to the existing BIDX chunk,
except the most-significant bit of each offset is interpreted as "this
filter is too big" iff looking at a BID2 chunk. This avoids having to
write a bitmap, but forces older clients to rewrite their commit-graphs
(as well as reduces the theoretical largest Bloom filters we could
write, and forces us to maintain the code necessary to translate BIDX
chunks to BID2 ones). Separately from this patch, I implemented this
alternate approach and did not find it to be advantageous.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 .../technical/commit-graph-format.txt         |  13 +++
 bloom.h                                       |   2 +-
 commit-graph.c                                | 100 +++++++++++++++---
 commit-graph.h                                |  10 ++
 t/t4216-log-bloom.sh                          |  25 ++++-
 5 files changed, 135 insertions(+), 15 deletions(-)

diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index 6ddbceba15..db212b8cc5 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -128,6 +128,19 @@ CHUNK DATA:
       of length zero.
     * The BDAT chunk is present if and only if BIDX is present.
 
+  Large Bloom Filters (ID: {'B', 'F', 'X', 'L'}) [Optional]
+    * It starts with a 32-bit unsigned integer specifying the maximum number of
+      changed-paths that can be stored in a single Bloom filter.
+    * It then contains a list of 64-bit words in network order (the length of
+      this list is determined by the width of the chunk) which is a bitmap. The
+      'i'th bit is set exactly when the 'i'th commit in the graph has a
+      changed-path Bloom filter with zero entries (either because the commit is
+      empty, or because it contains more entries than is allowed per filter by
+      the layer that contains it).
+    * The BFXL chunk is present only when the BIDX and BDAT chunks are
+      also present.
+
+
   Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
       This list of H-byte hashes describe a set of B commit-graph files that
       form a commit-graph chain. The graph position for the ith commit in this
diff --git a/bloom.h b/bloom.h
index 3f19e3fca4..464d9b57de 100644
--- a/bloom.h
+++ b/bloom.h
@@ -33,7 +33,7 @@ struct bloom_filter_settings {
 	 * The maximum number of changed paths per commit
 	 * before declaring a Bloom filter to be too-large.
 	 *
-	 * Not written to the commit-graph file.
+	 * Written to the 'BFXL' chunk (instead of 'BDAT').
 	 */
 	uint32_t max_changed_paths;
 };
diff --git a/commit-graph.c b/commit-graph.c
index 01d791343a..68ffa6ec35 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -41,8 +41,9 @@ void git_test_write_commit_graph_or_die(void)
 #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
 #define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
 #define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
+#define GRAPH_CHUNKID_BLOOMLARGE 0x4246584c /* "BFXL" */
 #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
-#define MAX_NUM_CHUNKS 7
+#define MAX_NUM_CHUNKS 8
 
 #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
 
@@ -436,6 +437,16 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 				graph->bloom_filter_settings->bits_per_entry = get_be32(data + chunk_offset + 8);
 			}
 			break;
+
+		case GRAPH_CHUNKID_BLOOMLARGE:
+			if (graph->chunk_bloom_large_filters)
+				chunk_repeated = 1;
+			else if (r->settings.commit_graph_read_changed_paths) {
+				graph->chunk_bloom_large_filters = data + chunk_offset + sizeof(uint32_t);
+				graph->bloom_large_alloc = get_be64(chunk_lookup + 4) - chunk_offset - sizeof(uint32_t);
+				graph->bloom_filter_settings->max_changed_paths = get_be32(data + chunk_offset);
+			}
+			break;
 		}
 
 		if (chunk_repeated) {
@@ -450,6 +461,8 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 		/* We need both the bloom chunks to exist together. Else ignore the data */
 		graph->chunk_bloom_indexes = NULL;
 		graph->chunk_bloom_data = NULL;
+		graph->chunk_bloom_large_filters = NULL;
+		graph->bloom_large_alloc = 0;
 		FREE_AND_NULL(graph->bloom_filter_settings);
 	}
 
@@ -939,6 +952,32 @@ struct tree *get_commit_tree_in_graph(struct repository *r, const struct commit
 	return get_commit_tree_in_graph_one(r, r->objects->commit_graph, c);
 }
 
+int get_bloom_filter_large_in_graph(struct commit_graph *g,
+				    const struct commit *c)
+{
+	uint32_t graph_pos;
+	if (!find_commit_in_graph(c, g, &graph_pos))
+		return 0;
+
+	while (g && graph_pos < g->num_commits_in_base)
+		g = g->base_graph;
+
+	if (!g)
+		return 0;
+
+	if (!g->bloom_large && g->bloom_large_alloc) {
+		size_t i;
+		g->bloom_large = bitmap_word_alloc(g->bloom_large_alloc);
+
+		for (i = 0; i < g->bloom_large->word_alloc; i++)
+			g->bloom_large->words[i] = get_be64(
+				g->chunk_bloom_large_filters + i * sizeof(eword_t));
+	}
+
+	if (!g->bloom_large)
+		return 0;
+	return bitmap_get(g->bloom_large, graph_pos - g->num_commits_in_base);
+}
 
 struct packed_oid_list {
 	struct object_id *list;
@@ -977,8 +1016,10 @@ struct write_commit_graph_context {
 	size_t total_bloom_filter_data_size;
 	const struct bloom_filter_settings *bloom_settings;
 
+	int count_bloom_filter_known_large;
 	int count_bloom_filter_found_large;
 	int count_bloom_filter_computed;
+	struct bitmap *bloom_large;
 };
 
 static int write_graph_chunk_fanout(struct hashfile *f,
@@ -1242,6 +1283,23 @@ static int write_graph_chunk_bloom_data(struct hashfile *f,
 	return 0;
 }
 
+static int write_graph_chunk_bloom_large(struct hashfile *f,
+					 struct write_commit_graph_context *ctx)
+{
+	size_t i, alloc = ctx->commits.nr / BITS_IN_EWORD;
+	if (ctx->commits.nr % BITS_IN_EWORD)
+		alloc++;
+	if (alloc > ctx->bloom_large->word_alloc)
+		BUG("write_graph_chunk_bloom_large: bitmap not large enough");
+
+	trace2_region_enter("commit-graph", "bloom_large", ctx->r);
+	hashwrite_be32(f, ctx->bloom_settings->max_changed_paths);
+	for (i = 0; i < ctx->bloom_large->word_alloc; i++)
+		hashwrite_be64(f, ctx->bloom_large->words[i]);
+	trace2_region_leave("commit-graph", "bloom_large", ctx->r);
+	return 0;
+}
+
 static int oid_compare(const void *_a, const void *_b)
 {
 	const struct object_id *a = (const struct object_id *)_a;
@@ -1405,6 +1463,8 @@ static void trace2_bloom_filter_write_statistics(struct write_commit_graph_conte
 	struct json_writer jw = JSON_WRITER_INIT;
 
 	jw_object_begin(&jw, 0);
+	jw_object_intmax(&jw, "filter_known_large",
+			 ctx->count_bloom_filter_known_large);
 	jw_object_intmax(&jw, "filter_found_large",
 			 ctx->count_bloom_filter_found_large);
 	jw_object_intmax(&jw, "filter_computed",
@@ -1423,6 +1483,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 	int *sorted_commits;
 
 	init_bloom_filters();
+	ctx->bloom_large = bitmap_word_alloc(ctx->commits.nr / BITS_IN_EWORD + 1);
 
 	if (ctx->report_progress)
 		progress = start_delayed_progress(
@@ -1437,21 +1498,28 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 		&ctx->commits);
 
 	for (i = 0; i < ctx->commits.nr; i++) {
-		int computed = 0;
 		int pos = sorted_commits[i];
 		struct commit *c = ctx->commits.list[pos];
-		struct bloom_filter *filter = get_or_compute_bloom_filter(
-			ctx->r,
-			c,
-			1,
-			ctx->bloom_settings,
-			&computed);
-		if (computed) {
-			ctx->count_bloom_filter_computed++;
-			if (filter && !filter->len)
-				ctx->count_bloom_filter_found_large++;
+		if (get_bloom_filter_large_in_graph(ctx->r->objects->commit_graph, c)) {
+			bitmap_set(ctx->bloom_large, pos);
+			ctx->count_bloom_filter_known_large++;
+		} else {
+			int computed = 0;
+			struct bloom_filter *filter = get_or_compute_bloom_filter(
+				ctx->r,
+				c,
+				1,
+				ctx->bloom_settings,
+				&computed);
+			if (computed) {
+				ctx->count_bloom_filter_computed++;
+				if (filter && !filter->len) {
+					bitmap_set(ctx->bloom_large, pos);
+					ctx->count_bloom_filter_found_large++;
+				}
+			}
+			ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
 		}
-		ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
 		display_progress(progress, i + 1);
 	}
 
@@ -1771,6 +1839,11 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 					  + ctx->total_bloom_filter_data_size;
 		chunks[num_chunks].write_fn = write_graph_chunk_bloom_data;
 		num_chunks++;
+		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMLARGE;
+		chunks[num_chunks].size = sizeof(eword_t) * ctx->bloom_large->word_alloc
+					+ sizeof(uint32_t);
+		chunks[num_chunks].write_fn = write_graph_chunk_bloom_large;
+		num_chunks++;
 	}
 	if (ctx->num_commit_graphs_after > 1) {
 		chunks[num_chunks].id = GRAPH_CHUNKID_BASE;
@@ -2510,6 +2583,7 @@ void free_commit_graph(struct commit_graph *g)
 	}
 	free(g->filename);
 	free(g->bloom_filter_settings);
+	bitmap_free(g->bloom_large);
 	free(g);
 }
 
diff --git a/commit-graph.h b/commit-graph.h
index d9acb22bac..9afb1477d5 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -4,6 +4,7 @@
 #include "git-compat-util.h"
 #include "object-store.h"
 #include "oidset.h"
+#include "ewah/ewok.h"
 
 #define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
 #define GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE "GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE"
@@ -50,6 +51,9 @@ void load_commit_graph_info(struct repository *r, struct commit *item);
 struct tree *get_commit_tree_in_graph(struct repository *r,
 				      const struct commit *c);
 
+int get_bloom_filter_large_in_graph(struct commit_graph *g,
+				    const struct commit *c);
+
 struct commit_graph {
 	const unsigned char *data;
 	size_t data_len;
@@ -71,6 +75,10 @@ struct commit_graph {
 	const unsigned char *chunk_base_graphs;
 	const unsigned char *chunk_bloom_indexes;
 	const unsigned char *chunk_bloom_data;
+	const unsigned char *chunk_bloom_large_filters;
+
+	struct bitmap *bloom_large;
+	size_t bloom_large_alloc;
 
 	struct bloom_filter_settings *bloom_filter_settings;
 };
@@ -83,6 +91,8 @@ struct commit_graph *read_commit_graph_one(struct repository *r,
 struct commit_graph *parse_commit_graph(struct repository *r,
 					void *graph_map, size_t graph_size);
 
+void prepare_commit_graph_bloom_large(struct commit_graph *g);
+
 /*
  * Return 1 if and only if the repository has a commit-graph
  * file and generation numbers are computed in that file.
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index e8788749bf..fed4929af3 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -38,7 +38,7 @@ test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
 	EOF
 '
 graph_read_expect () {
-	NUM_CHUNKS=5
+	NUM_CHUNKS=6
 	cat >expect <<- EOF
 	header: 43475048 1 $(test_oid oid_version) $NUM_CHUNKS 0
 	num_commits: $1
@@ -267,5 +267,28 @@ test_expect_success 'correctly report changes over limit' '
 		done
 	)
 '
+test_bloom_filters_computed () {
+	commit_graph_args=$1
+	bloom_trace_prefix="{\"filter_known_large\":$2,\"filter_found_large\":$3,\"filter_computed\":$4"
+	rm -f "$TRASH_DIRECTORY/trace.event" &&
+	GIT_TRACE2_EVENT="$TRASH_DIRECTORY/trace.event" git commit-graph write $commit_graph_args &&
+	grep "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.event"
+}
+
+test_expect_success 'Bloom generation does not recompute too-large filters' '
+	(
+		cd limits &&
+
+		# start from scratch and rebuild
+		rm -f .git/objects/info/commit-graph &&
+		GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=10 \
+			git commit-graph write --reachable --changed-paths \
+			--split=replace &&
+		test_commit c1 filter &&
+
+		test_bloom_filters_computed "--reachable --changed-paths --split=replace" \
+			2 0 1
+	)
+'
 
 test_done
-- 
2.27.0.2918.gc99a27ff8f


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH v4 13/14] commit-graph: rename 'split_commit_graph_opts'
  2020-09-03 22:45 ` [PATCH v4 " Taylor Blau
                     ` (11 preceding siblings ...)
  2020-09-03 22:46   ` [PATCH v4 12/14] commit-graph: add large-filters bitmap chunk Taylor Blau
@ 2020-09-03 22:46   ` Taylor Blau
  2020-09-04 15:20     ` Taylor Blau
  2020-09-03 22:47   ` [PATCH v4 14/14] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
  2020-09-04 14:39   ` [PATCH v4 00/14] more miscellaneous Bloom filter improvements Derrick Stolee
  14 siblings, 1 reply; 117+ messages in thread
From: Taylor Blau @ 2020-09-03 22:46 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

In the subsequent commit, additional options will be added to the
commit-graph API which have nothing to do with splitting.

Rename the 'split_commit_graph_opts' structure to the more-generic
'commit_graph_opts' to encompass both. Likewise, rename the 'flags'
member to instead be 'split_flags' to clarify that it only has to do
with the behavior implied by '--split'.

Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/commit-graph.c | 20 ++++++++++----------
 commit-graph.c         | 40 ++++++++++++++++++++--------------------
 commit-graph.h         |  8 ++++----
 make: *** [Makefile    |  0
 4 files changed, 34 insertions(+), 34 deletions(-)
 create mode 100644 make: *** [Makefile

diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index ba5584463f..f3243bd982 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -119,7 +119,7 @@ static int graph_verify(int argc, const char **argv)
 }
 
 extern int read_replace_refs;
-static struct split_commit_graph_opts split_opts;
+static struct commit_graph_opts write_opts;
 
 static int write_option_parse_split(const struct option *opt, const char *arg,
 				    int unset)
@@ -187,24 +187,24 @@ static int graph_write(int argc, const char **argv)
 		OPT_BOOL(0, "changed-paths", &opts.enable_changed_paths,
 			N_("enable computation for changed paths")),
 		OPT_BOOL(0, "progress", &opts.progress, N_("force progress reporting")),
-		OPT_CALLBACK_F(0, "split", &split_opts.flags, NULL,
+		OPT_CALLBACK_F(0, "split", &write_opts.split_flags, NULL,
 			N_("allow writing an incremental commit-graph file"),
 			PARSE_OPT_OPTARG | PARSE_OPT_NONEG,
 			write_option_parse_split),
-		OPT_INTEGER(0, "max-commits", &split_opts.max_commits,
+		OPT_INTEGER(0, "max-commits", &write_opts.max_commits,
 			N_("maximum number of commits in a non-base split commit-graph")),
-		OPT_INTEGER(0, "size-multiple", &split_opts.size_multiple,
+		OPT_INTEGER(0, "size-multiple", &write_opts.size_multiple,
 			N_("maximum ratio between two levels of a split commit-graph")),
-		OPT_EXPIRY_DATE(0, "expire-time", &split_opts.expire_time,
+		OPT_EXPIRY_DATE(0, "expire-time", &write_opts.expire_time,
 			N_("only expire files older than a given date-time")),
 		OPT_END(),
 	};
 
 	opts.progress = isatty(2);
 	opts.enable_changed_paths = -1;
-	split_opts.size_multiple = 2;
-	split_opts.max_commits = 0;
-	split_opts.expire_time = 0;
+	write_opts.size_multiple = 2;
+	write_opts.max_commits = 0;
+	write_opts.expire_time = 0;
 
 	trace2_cmd_mode("write");
 
@@ -232,7 +232,7 @@ static int graph_write(int argc, const char **argv)
 	odb = find_odb(the_repository, opts.obj_dir);
 
 	if (opts.reachable) {
-		if (write_commit_graph_reachable(odb, flags, &split_opts))
+		if (write_commit_graph_reachable(odb, flags, &write_opts))
 			return 1;
 		return 0;
 	}
@@ -261,7 +261,7 @@ static int graph_write(int argc, const char **argv)
 			       opts.stdin_packs ? &pack_indexes : NULL,
 			       opts.stdin_commits ? &commits : NULL,
 			       flags,
-			       &split_opts))
+			       &write_opts))
 		result = 1;
 
 cleanup:
diff --git a/commit-graph.c b/commit-graph.c
index 68ffa6ec35..33fcf01a7a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1012,7 +1012,7 @@ struct write_commit_graph_context {
 		 changed_paths:1,
 		 order_by_pack:1;
 
-	const struct split_commit_graph_opts *split_opts;
+	const struct commit_graph_opts *opts;
 	size_t total_bloom_filter_data_size;
 	const struct bloom_filter_settings *bloom_settings;
 
@@ -1353,8 +1353,8 @@ static void close_reachable(struct write_commit_graph_context *ctx)
 {
 	int i;
 	struct commit *commit;
-	enum commit_graph_split_flags flags = ctx->split_opts ?
-		ctx->split_opts->flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;
+	enum commit_graph_split_flags flags = ctx->opts ?
+		ctx->opts->split_flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;
 
 	if (ctx->report_progress)
 		ctx->progress = start_delayed_progress(
@@ -1554,7 +1554,7 @@ static int add_ref_to_set(const char *refname,
 
 int write_commit_graph_reachable(struct object_directory *odb,
 				 enum commit_graph_write_flags flags,
-				 const struct split_commit_graph_opts *split_opts)
+				 const struct commit_graph_opts *opts)
 {
 	struct oidset commits = OIDSET_INIT;
 	struct refs_cb_data data;
@@ -1571,7 +1571,7 @@ int write_commit_graph_reachable(struct object_directory *odb,
 	stop_progress(&data.progress);
 
 	result = write_commit_graph(odb, NULL, &commits,
-				    flags, split_opts);
+				    flags, opts);
 
 	oidset_clear(&commits);
 	return result;
@@ -1686,8 +1686,8 @@ static uint32_t count_distinct_commits(struct write_commit_graph_context *ctx)
 static void copy_oids_to_commits(struct write_commit_graph_context *ctx)
 {
 	uint32_t i;
-	enum commit_graph_split_flags flags = ctx->split_opts ?
-		ctx->split_opts->flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;
+	enum commit_graph_split_flags flags = ctx->opts ?
+		ctx->opts->split_flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;
 
 	ctx->num_extra_edges = 0;
 	if (ctx->report_progress)
@@ -1973,13 +1973,13 @@ static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
 	int max_commits = 0;
 	int size_mult = 2;
 
-	if (ctx->split_opts) {
-		max_commits = ctx->split_opts->max_commits;
+	if (ctx->opts) {
+		max_commits = ctx->opts->max_commits;
 
-		if (ctx->split_opts->size_multiple)
-			size_mult = ctx->split_opts->size_multiple;
+		if (ctx->opts->size_multiple)
+			size_mult = ctx->opts->size_multiple;
 
-		flags = ctx->split_opts->flags;
+		flags = ctx->opts->split_flags;
 	}
 
 	g = ctx->r->objects->commit_graph;
@@ -2157,8 +2157,8 @@ static void expire_commit_graphs(struct write_commit_graph_context *ctx)
 	size_t dirnamelen;
 	timestamp_t expire_time = time(NULL);
 
-	if (ctx->split_opts && ctx->split_opts->expire_time)
-		expire_time = ctx->split_opts->expire_time;
+	if (ctx->opts && ctx->opts->expire_time)
+		expire_time = ctx->opts->expire_time;
 	if (!ctx->split) {
 		char *chain_file_name = get_chain_filename(ctx->odb);
 		unlink(chain_file_name);
@@ -2209,7 +2209,7 @@ int write_commit_graph(struct object_directory *odb,
 		       struct string_list *pack_indexes,
 		       struct oidset *commits,
 		       enum commit_graph_write_flags flags,
-		       const struct split_commit_graph_opts *split_opts)
+		       const struct commit_graph_opts *opts)
 {
 	struct write_commit_graph_context *ctx;
 	uint32_t i, count_distinct = 0;
@@ -2226,7 +2226,7 @@ int write_commit_graph(struct object_directory *odb,
 	ctx->append = flags & COMMIT_GRAPH_WRITE_APPEND ? 1 : 0;
 	ctx->report_progress = flags & COMMIT_GRAPH_WRITE_PROGRESS ? 1 : 0;
 	ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
-	ctx->split_opts = split_opts;
+	ctx->opts = opts;
 	ctx->total_bloom_filter_data_size = 0;
 
 	bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
@@ -2274,15 +2274,15 @@ int write_commit_graph(struct object_directory *odb,
 			}
 		}
 
-		if (ctx->split_opts)
-			replace = ctx->split_opts->flags & COMMIT_GRAPH_SPLIT_REPLACE;
+		if (ctx->opts)
+			replace = ctx->opts->split_flags & COMMIT_GRAPH_SPLIT_REPLACE;
 	}
 
 	ctx->approx_nr_objects = approximate_object_count();
 	ctx->oids.alloc = ctx->approx_nr_objects / 32;
 
-	if (ctx->split && split_opts && ctx->oids.alloc > split_opts->max_commits)
-		ctx->oids.alloc = split_opts->max_commits;
+	if (ctx->split && opts && ctx->oids.alloc > opts->max_commits)
+		ctx->oids.alloc = opts->max_commits;
 
 	if (ctx->append) {
 		prepare_commit_graph_one(ctx->r, ctx->odb);
diff --git a/commit-graph.h b/commit-graph.h
index 9afb1477d5..fe798a4047 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -115,11 +115,11 @@ enum commit_graph_split_flags {
 	COMMIT_GRAPH_SPLIT_REPLACE          = 2
 };
 
-struct split_commit_graph_opts {
+struct commit_graph_opts {
 	int size_multiple;
 	int max_commits;
 	timestamp_t expire_time;
-	enum commit_graph_split_flags flags;
+	enum commit_graph_split_flags split_flags;
 };
 
 /*
@@ -130,12 +130,12 @@ struct split_commit_graph_opts {
  */
 int write_commit_graph_reachable(struct object_directory *odb,
 				 enum commit_graph_write_flags flags,
-				 const struct split_commit_graph_opts *split_opts);
+				 const struct commit_graph_opts *opts);
 int write_commit_graph(struct object_directory *odb,
 		       struct string_list *pack_indexes,
 		       struct oidset *commits,
 		       enum commit_graph_write_flags flags,
-		       const struct split_commit_graph_opts *split_opts);
+		       const struct commit_graph_opts *opts);
 
 #define COMMIT_GRAPH_VERIFY_SHALLOW	(1 << 0)
 
diff --git a/make: *** [Makefile b/make: *** [Makefile
new file mode 100644
index 0000000000..e69de29bb2
-- 
2.27.0.2918.gc99a27ff8f


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [PATCH v4 14/14] builtin/commit-graph.c: introduce '--max-new-filters=<n>'
  2020-09-03 22:45 ` [PATCH v4 " Taylor Blau
                     ` (12 preceding siblings ...)
  2020-09-03 22:46   ` [PATCH v4 13/14] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
@ 2020-09-03 22:47   ` Taylor Blau
  2020-09-04 14:39   ` [PATCH v4 00/14] more miscellaneous Bloom filter improvements Derrick Stolee
  14 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-09-03 22:47 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

Introduce a command-line flag and configuration variable to fill in the
'max_new_filters' variable introduced two patches ago.

The command-line option '--max-new-filters' takes precedence over
'commitGraph.maxNewFilters', which is the default value.
'--no-max-new-filters' can also be provided, which sets the value back
to '-1', indicating that an unlimited number of new Bloom filters may be
generated. (OPT_INTEGER only allows setting the '--no-' variant back to
'0', hence a custom callback was used instead).

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/commitgraph.txt |  4 +++
 Documentation/git-commit-graph.txt   |  6 ++++
 bloom.c                              | 13 +++++---
 builtin/commit-graph.c               | 39 ++++++++++++++++++++++--
 commit-graph.c                       | 27 ++++++++++++++---
 commit-graph.h                       |  4 ++-
 t/t4216-log-bloom.sh                 | 44 ++++++++++++++++++++++++++++
 7 files changed, 125 insertions(+), 12 deletions(-)

diff --git a/Documentation/config/commitgraph.txt b/Documentation/config/commitgraph.txt
index cff0797b54..4582c39fc4 100644
--- a/Documentation/config/commitgraph.txt
+++ b/Documentation/config/commitgraph.txt
@@ -1,3 +1,7 @@
+commitGraph.maxNewFilters::
+	Specifies the default value for the `--max-new-filters` option of `git
+	commit-graph write` (c.f., linkgit:git-commit-graph[1]).
+
 commitGraph.readChangedPaths::
 	If true, then git will use the changed-path Bloom filters in the
 	commit-graph file (if it exists, and they are present). Defaults to
diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index 17405c73a9..81a2e65903 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -67,6 +67,12 @@ this option is given, future commit-graph writes will automatically assume
 that this option was intended. Use `--no-changed-paths` to stop storing this
 data.
 +
+With the `--max-new-filters=<n>` option, generate at most `n` new Bloom
+filters (if `--changed-paths` is specified). If `n` is `-1`, no limit is
+enforced. Commits whose filters are not calculated are stored as a
+length zero Bloom filter, and their bit is marked in the `BFXL` chunk.
+Overrides the `commitGraph.maxNewFilters` configuration.
++
 With the `--split[=<strategy>]` option, write the commit-graph as a
 chain of multiple commit-graph files stored in
 `<dir>/info/commit-graphs`. Commit-graph layers are merged based on the
diff --git a/bloom.c b/bloom.c
index ed54e96e57..34503898ac 100644
--- a/bloom.c
+++ b/bloom.c
@@ -197,16 +197,21 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 
 	if (!filter->data) {
 		load_commit_graph_info(r, c);
-		if (commit_graph_position(c) != COMMIT_NOT_FROM_GRAPH &&
-			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
-				return filter;
+		if (commit_graph_position(c) != COMMIT_NOT_FROM_GRAPH)
+			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c);
 	}
 
-	if (filter->data)
+	if (filter->data && filter->len)
 		return filter;
 	if (!compute_if_not_present)
 		return NULL;
 
+	if (filter && !filter->len &&
+	    get_bloom_filter_large_in_graph(r->objects->commit_graph, c,
+					    settings->max_changed_paths))
+		return filter;
+
+
 	repo_diff_setup(r, &diffopt);
 	diffopt.flags.recursive = 1;
 	diffopt.detect_rename = 0;
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index f3243bd982..e7a1539b08 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -13,7 +13,8 @@ static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph verify [--object-dir <objdir>] [--shallow] [--[no-]progress]"),
 	N_("git commit-graph write [--object-dir <objdir>] [--append] "
 	   "[--split[=<strategy>]] [--reachable|--stdin-packs|--stdin-commits] "
-	   "[--changed-paths] [--[no-]progress] <split options>"),
+	   "[--changed-paths] [--[no-]max-new-filters <n>] [--[no-]progress] "
+	   "<split options>"),
 	NULL
 };
 
@@ -25,7 +26,8 @@ static const char * const builtin_commit_graph_verify_usage[] = {
 static const char * const builtin_commit_graph_write_usage[] = {
 	N_("git commit-graph write [--object-dir <objdir>] [--append] "
 	   "[--split[=<strategy>]] [--reachable|--stdin-packs|--stdin-commits] "
-	   "[--changed-paths] [--[no-]progress] <split options>"),
+	   "[--changed-paths] [--[no-]max-new-filters <n>] [--[no-]progress] "
+	   "<split options>"),
 	NULL
 };
 
@@ -162,6 +164,23 @@ static int read_one_commit(struct oidset *commits, struct progress *progress,
 	return 0;
 }
 
+static int write_option_max_new_filters(const struct option *opt,
+					const char *arg,
+					int unset)
+{
+	int *to = opt->value;
+	if (unset)
+		*to = -1;
+	else {
+		const char *s;
+		*to = strtol(arg, (char **)&s, 10);
+		if (*s)
+			return error(_("%s expects a numerical value"),
+				     optname(opt, opt->flags));
+	}
+	return 0;
+}
+
 static int graph_write(int argc, const char **argv)
 {
 	struct string_list pack_indexes = STRING_LIST_INIT_NODUP;
@@ -197,6 +216,9 @@ static int graph_write(int argc, const char **argv)
 			N_("maximum ratio between two levels of a split commit-graph")),
 		OPT_EXPIRY_DATE(0, "expire-time", &write_opts.expire_time,
 			N_("only expire files older than a given date-time")),
+		OPT_CALLBACK_F(0, "max-new-filters", &write_opts.max_new_filters,
+			NULL, N_("maximum number of changed-path Bloom filters to compute"),
+			0, write_option_max_new_filters),
 		OPT_END(),
 	};
 
@@ -205,6 +227,7 @@ static int graph_write(int argc, const char **argv)
 	write_opts.size_multiple = 2;
 	write_opts.max_commits = 0;
 	write_opts.expire_time = 0;
+	write_opts.max_new_filters = -1;
 
 	trace2_cmd_mode("write");
 
@@ -270,6 +293,16 @@ static int graph_write(int argc, const char **argv)
 	return result;
 }
 
+static int git_commit_graph_config(const char *var, const char *value, void *cb)
+{
+	if (!strcmp(var, "commitgraph.maxnewfilters")) {
+		write_opts.max_new_filters = git_config_int(var, value);
+		return 0;
+	}
+
+	return git_default_config(var, value, cb);
+}
+
 int cmd_commit_graph(int argc, const char **argv, const char *prefix)
 {
 	static struct option builtin_commit_graph_options[] = {
@@ -283,7 +316,7 @@ int cmd_commit_graph(int argc, const char **argv, const char *prefix)
 		usage_with_options(builtin_commit_graph_usage,
 				   builtin_commit_graph_options);
 
-	git_config(git_default_config, NULL);
+	git_config(git_commit_graph_config, &opts);
 	argc = parse_options(argc, argv, prefix,
 			     builtin_commit_graph_options,
 			     builtin_commit_graph_usage,
diff --git a/commit-graph.c b/commit-graph.c
index 33fcf01a7a..243c7253ff 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -953,7 +953,8 @@ struct tree *get_commit_tree_in_graph(struct repository *r, const struct commit
 }
 
 int get_bloom_filter_large_in_graph(struct commit_graph *g,
-				    const struct commit *c)
+				    const struct commit *c,
+				    uint32_t max_changed_paths)
 {
 	uint32_t graph_pos;
 	if (!find_commit_in_graph(c, g, &graph_pos))
@@ -976,6 +977,17 @@ int get_bloom_filter_large_in_graph(struct commit_graph *g,
 
 	if (!g->bloom_large)
 		return 0;
+	if (g->bloom_filter_settings->max_changed_paths != max_changed_paths) {
+		/*
+		 * Force all commits which are subject to a different
+		 * 'max_changed_paths' limit to be recomputed from scratch.
+		 *
+		 * Note that this could likely be improved, but is ignored since
+		 * all real-world graphs set the maximum number of changed paths
+		 * at 512.
+		 */
+		return 0;
+	}
 	return bitmap_get(g->bloom_large, graph_pos - g->num_commits_in_base);
 }
 
@@ -1481,6 +1493,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 	int i;
 	struct progress *progress = NULL;
 	int *sorted_commits;
+	int max_new_filters;
 
 	init_bloom_filters();
 	ctx->bloom_large = bitmap_word_alloc(ctx->commits.nr / BITS_IN_EWORD + 1);
@@ -1497,10 +1510,15 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 		ctx->order_by_pack ? commit_pos_cmp : commit_gen_cmp,
 		&ctx->commits);
 
+	max_new_filters = ctx->opts && ctx->opts->max_new_filters >= 0 ?
+		ctx->opts->max_new_filters : ctx->commits.nr;
+
 	for (i = 0; i < ctx->commits.nr; i++) {
 		int pos = sorted_commits[i];
 		struct commit *c = ctx->commits.list[pos];
-		if (get_bloom_filter_large_in_graph(ctx->r->objects->commit_graph, c)) {
+		if (get_bloom_filter_large_in_graph(ctx->r->objects->commit_graph,
+						    c,
+						    ctx->bloom_settings->max_changed_paths)) {
 			bitmap_set(ctx->bloom_large, pos);
 			ctx->count_bloom_filter_known_large++;
 		} else {
@@ -1508,7 +1526,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 			struct bloom_filter *filter = get_or_compute_bloom_filter(
 				ctx->r,
 				c,
-				1,
+				ctx->count_bloom_filter_computed < max_new_filters,
 				ctx->bloom_settings,
 				&computed);
 			if (computed) {
@@ -1518,7 +1536,8 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 					ctx->count_bloom_filter_found_large++;
 				}
 			}
-			ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
+			if (filter)
+				ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
 		}
 		display_progress(progress, i + 1);
 	}
diff --git a/commit-graph.h b/commit-graph.h
index fe798a4047..eac4efc7a6 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -52,7 +52,8 @@ struct tree *get_commit_tree_in_graph(struct repository *r,
 				      const struct commit *c);
 
 int get_bloom_filter_large_in_graph(struct commit_graph *g,
-				    const struct commit *c);
+				    const struct commit *c,
+				    uint32_t max_changed_paths);
 
 struct commit_graph {
 	const unsigned char *data;
@@ -120,6 +121,7 @@ struct commit_graph_opts {
 	int max_commits;
 	timestamp_t expire_time;
 	enum commit_graph_split_flags split_flags;
+	int max_new_filters;
 };
 
 /*
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index fed4929af3..571676cef2 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -291,4 +291,48 @@ test_expect_success 'Bloom generation does not recompute too-large filters' '
 	)
 '
 
+test_expect_success 'Bloom generation is limited by --max-new-filters' '
+	(
+		cd limits &&
+		test_commit c2 filter &&
+		test_commit c3 filter &&
+		test_commit c4 no-filter &&
+		test_bloom_filters_computed "--reachable --changed-paths --split=replace --max-new-filters=2" \
+			2 0 2
+	)
+'
+
+test_expect_success 'Bloom generation backfills previously-skipped filters' '
+	(
+		cd limits &&
+		test_bloom_filters_computed "--reachable --changed-paths --split=replace --max-new-filters=1" \
+			2 0 1
+	)
+'
+
+test_expect_success 'Bloom generation backfills empty commits' '
+	git init empty &&
+	test_when_finished "rm -fr empty" &&
+	(
+		cd empty &&
+		for i in $(test_seq 1 6)
+		do
+			git commit --allow-empty -m "$i"
+		done &&
+
+		# Generate Bloom filters for empty commits 1-6, two at a time.
+		test_bloom_filters_computed "--reachable --changed-paths --max-new-filters=2" \
+			0 2 2 &&
+		test_bloom_filters_computed "--reachable --changed-paths --max-new-filters=2" \
+			2 2 2 &&
+		test_bloom_filters_computed "--reachable --changed-paths --max-new-filters=2" \
+			4 2 2 &&
+
+		# Finally, make sure that once all commits have filters, that
+		# none are subsequently recomputed.
+		test_bloom_filters_computed "--reachable --changed-paths --max-new-filters=2" \
+			6 0 0
+	)
+'
+
 test_done
-- 
2.27.0.2918.gc99a27ff8f

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v3 14/14] builtin/commit-graph.c: introduce '--max-new-filters=<n>'
  2020-09-03 16:42       ` Taylor Blau
@ 2020-09-04  8:50         ` SZEDER Gábor
  0 siblings, 0 replies; 117+ messages in thread
From: SZEDER Gábor @ 2020-09-04  8:50 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, peff, dstolee, gitster

On Thu, Sep 03, 2020 at 12:42:25PM -0400, Taylor Blau wrote:
> On Wed, Aug 19, 2020 at 10:20:21AM +0200, SZEDER Gábor wrote:
> > On Tue, Aug 11, 2020 at 04:52:14PM -0400, Taylor Blau wrote:
> > > Introduce a command-line flag and configuration variable to fill in the
> > > 'max_new_filters' variable introduced by the previous patch.
> > >
> > > The command-line option '--max-new-filters' takes precedence over
> > > 'commitGraph.maxNewFilters', which is the default value.
> > > '--no-max-new-filters' can also be provided, which sets the value back
> > > to '-1', indicating that an unlimited number of new Bloom filters may be
> > > generated. (OPT_INTEGER only allows setting the '--no-' variant back to
> > > '0', hence a custom callback was used instead).
> >
> > Forgot the most important thing: Why?  Please explain in the commit
> > message why this option is necesary, what problems does it solve,
> > how it is supposed to interact with other options and why so.
> 
> This is already explained in detail in the patch 'commit-graph: add
> large-filters bitmap chunk', although there is an error in the quoted
> part of your email (which I wrote) which refers the reader to the
> previous patch. The patch I'm actually referring two is the
> twice-previous patch.

The proposed log message of that patch only briefly mentions what this
option will do, and goes into detail about what problems it _causes_
and how that new chunk is supposed to solve _those_ problems.  It does
not explain what problems this new option is supposed to solve, and
why would anyone want to use this option.

> I'll fix that locally before re-sending.
> 
> Thanks,
> Taylor

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v4 00/14] more miscellaneous Bloom filter improvements
  2020-09-03 22:45 ` [PATCH v4 " Taylor Blau
                     ` (13 preceding siblings ...)
  2020-09-03 22:47   ` [PATCH v4 14/14] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
@ 2020-09-04 14:39   ` Derrick Stolee
  14 siblings, 0 replies; 117+ messages in thread
From: Derrick Stolee @ 2020-09-04 14:39 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: dstolee, gitster, peff, szeder.dev

On 9/3/2020 6:45 PM, Taylor Blau wrote:
> Things seem to have settled down since the review in v3, so I'm hoping
> that this is what will end up being queued.
I haven't paid close attention to this topic, but I took a close
look at v4 and found nothing to improve. I'm happy with this
version.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v4 13/14] commit-graph: rename 'split_commit_graph_opts'
  2020-09-03 22:46   ` [PATCH v4 13/14] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
@ 2020-09-04 15:20     ` Taylor Blau
  0 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-09-04 15:20 UTC (permalink / raw)
  To: git; +Cc: dstolee, gitster, peff, szeder.dev

On Thu, Sep 03, 2020 at 06:46:57PM -0400, Taylor Blau wrote:
> In the subsequent commit, additional options will be added to the
> commit-graph API which have nothing to do with splitting.

I have no idea how 'make *** [Makefile' snuck into this, but it
definitely shouldn't be there. Let's use the below version instead.

--- >8 ---

Subject: [PATCH] commit-graph: rename 'split_commit_graph_opts'

In the subsequent commit, additional options will be added to the
commit-graph API which have nothing to do with splitting.

Rename the 'split_commit_graph_opts' structure to the more-generic
'commit_graph_opts' to encompass both. Likewise, rename the 'flags'
member to instead be 'split_flags' to clarify that it only has to do
with the behavior implied by '--split'.

Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/commit-graph.c | 20 ++++++++++----------
 commit-graph.c         | 40 ++++++++++++++++++++--------------------
 commit-graph.h         |  8 ++++----
 3 files changed, 34 insertions(+), 34 deletions(-)

diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index ba5584463f..f3243bd982 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -119,7 +119,7 @@ static int graph_verify(int argc, const char **argv)
 }

 extern int read_replace_refs;
-static struct split_commit_graph_opts split_opts;
+static struct commit_graph_opts write_opts;

 static int write_option_parse_split(const struct option *opt, const char *arg,
 				    int unset)
@@ -187,24 +187,24 @@ static int graph_write(int argc, const char **argv)
 		OPT_BOOL(0, "changed-paths", &opts.enable_changed_paths,
 			N_("enable computation for changed paths")),
 		OPT_BOOL(0, "progress", &opts.progress, N_("force progress reporting")),
-		OPT_CALLBACK_F(0, "split", &split_opts.flags, NULL,
+		OPT_CALLBACK_F(0, "split", &write_opts.split_flags, NULL,
 			N_("allow writing an incremental commit-graph file"),
 			PARSE_OPT_OPTARG | PARSE_OPT_NONEG,
 			write_option_parse_split),
-		OPT_INTEGER(0, "max-commits", &split_opts.max_commits,
+		OPT_INTEGER(0, "max-commits", &write_opts.max_commits,
 			N_("maximum number of commits in a non-base split commit-graph")),
-		OPT_INTEGER(0, "size-multiple", &split_opts.size_multiple,
+		OPT_INTEGER(0, "size-multiple", &write_opts.size_multiple,
 			N_("maximum ratio between two levels of a split commit-graph")),
-		OPT_EXPIRY_DATE(0, "expire-time", &split_opts.expire_time,
+		OPT_EXPIRY_DATE(0, "expire-time", &write_opts.expire_time,
 			N_("only expire files older than a given date-time")),
 		OPT_END(),
 	};

 	opts.progress = isatty(2);
 	opts.enable_changed_paths = -1;
-	split_opts.size_multiple = 2;
-	split_opts.max_commits = 0;
-	split_opts.expire_time = 0;
+	write_opts.size_multiple = 2;
+	write_opts.max_commits = 0;
+	write_opts.expire_time = 0;

 	trace2_cmd_mode("write");

@@ -232,7 +232,7 @@ static int graph_write(int argc, const char **argv)
 	odb = find_odb(the_repository, opts.obj_dir);

 	if (opts.reachable) {
-		if (write_commit_graph_reachable(odb, flags, &split_opts))
+		if (write_commit_graph_reachable(odb, flags, &write_opts))
 			return 1;
 		return 0;
 	}
@@ -261,7 +261,7 @@ static int graph_write(int argc, const char **argv)
 			       opts.stdin_packs ? &pack_indexes : NULL,
 			       opts.stdin_commits ? &commits : NULL,
 			       flags,
-			       &split_opts))
+			       &write_opts))
 		result = 1;

 cleanup:
diff --git a/commit-graph.c b/commit-graph.c
index 68ffa6ec35..33fcf01a7a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1012,7 +1012,7 @@ struct write_commit_graph_context {
 		 changed_paths:1,
 		 order_by_pack:1;

-	const struct split_commit_graph_opts *split_opts;
+	const struct commit_graph_opts *opts;
 	size_t total_bloom_filter_data_size;
 	const struct bloom_filter_settings *bloom_settings;

@@ -1353,8 +1353,8 @@ static void close_reachable(struct write_commit_graph_context *ctx)
 {
 	int i;
 	struct commit *commit;
-	enum commit_graph_split_flags flags = ctx->split_opts ?
-		ctx->split_opts->flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;
+	enum commit_graph_split_flags flags = ctx->opts ?
+		ctx->opts->split_flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;

 	if (ctx->report_progress)
 		ctx->progress = start_delayed_progress(
@@ -1554,7 +1554,7 @@ static int add_ref_to_set(const char *refname,

 int write_commit_graph_reachable(struct object_directory *odb,
 				 enum commit_graph_write_flags flags,
-				 const struct split_commit_graph_opts *split_opts)
+				 const struct commit_graph_opts *opts)
 {
 	struct oidset commits = OIDSET_INIT;
 	struct refs_cb_data data;
@@ -1571,7 +1571,7 @@ int write_commit_graph_reachable(struct object_directory *odb,
 	stop_progress(&data.progress);

 	result = write_commit_graph(odb, NULL, &commits,
-				    flags, split_opts);
+				    flags, opts);

 	oidset_clear(&commits);
 	return result;
@@ -1686,8 +1686,8 @@ static uint32_t count_distinct_commits(struct write_commit_graph_context *ctx)
 static void copy_oids_to_commits(struct write_commit_graph_context *ctx)
 {
 	uint32_t i;
-	enum commit_graph_split_flags flags = ctx->split_opts ?
-		ctx->split_opts->flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;
+	enum commit_graph_split_flags flags = ctx->opts ?
+		ctx->opts->split_flags : COMMIT_GRAPH_SPLIT_UNSPECIFIED;

 	ctx->num_extra_edges = 0;
 	if (ctx->report_progress)
@@ -1973,13 +1973,13 @@ static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
 	int max_commits = 0;
 	int size_mult = 2;

-	if (ctx->split_opts) {
-		max_commits = ctx->split_opts->max_commits;
+	if (ctx->opts) {
+		max_commits = ctx->opts->max_commits;

-		if (ctx->split_opts->size_multiple)
-			size_mult = ctx->split_opts->size_multiple;
+		if (ctx->opts->size_multiple)
+			size_mult = ctx->opts->size_multiple;

-		flags = ctx->split_opts->flags;
+		flags = ctx->opts->split_flags;
 	}

 	g = ctx->r->objects->commit_graph;
@@ -2157,8 +2157,8 @@ static void expire_commit_graphs(struct write_commit_graph_context *ctx)
 	size_t dirnamelen;
 	timestamp_t expire_time = time(NULL);

-	if (ctx->split_opts && ctx->split_opts->expire_time)
-		expire_time = ctx->split_opts->expire_time;
+	if (ctx->opts && ctx->opts->expire_time)
+		expire_time = ctx->opts->expire_time;
 	if (!ctx->split) {
 		char *chain_file_name = get_chain_filename(ctx->odb);
 		unlink(chain_file_name);
@@ -2209,7 +2209,7 @@ int write_commit_graph(struct object_directory *odb,
 		       struct string_list *pack_indexes,
 		       struct oidset *commits,
 		       enum commit_graph_write_flags flags,
-		       const struct split_commit_graph_opts *split_opts)
+		       const struct commit_graph_opts *opts)
 {
 	struct write_commit_graph_context *ctx;
 	uint32_t i, count_distinct = 0;
@@ -2226,7 +2226,7 @@ int write_commit_graph(struct object_directory *odb,
 	ctx->append = flags & COMMIT_GRAPH_WRITE_APPEND ? 1 : 0;
 	ctx->report_progress = flags & COMMIT_GRAPH_WRITE_PROGRESS ? 1 : 0;
 	ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
-	ctx->split_opts = split_opts;
+	ctx->opts = opts;
 	ctx->total_bloom_filter_data_size = 0;

 	bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
@@ -2274,15 +2274,15 @@ int write_commit_graph(struct object_directory *odb,
 			}
 		}

-		if (ctx->split_opts)
-			replace = ctx->split_opts->flags & COMMIT_GRAPH_SPLIT_REPLACE;
+		if (ctx->opts)
+			replace = ctx->opts->split_flags & COMMIT_GRAPH_SPLIT_REPLACE;
 	}

 	ctx->approx_nr_objects = approximate_object_count();
 	ctx->oids.alloc = ctx->approx_nr_objects / 32;

-	if (ctx->split && split_opts && ctx->oids.alloc > split_opts->max_commits)
-		ctx->oids.alloc = split_opts->max_commits;
+	if (ctx->split && opts && ctx->oids.alloc > opts->max_commits)
+		ctx->oids.alloc = opts->max_commits;

 	if (ctx->append) {
 		prepare_commit_graph_one(ctx->r, ctx->odb);
diff --git a/commit-graph.h b/commit-graph.h
index 9afb1477d5..fe798a4047 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -115,11 +115,11 @@ enum commit_graph_split_flags {
 	COMMIT_GRAPH_SPLIT_REPLACE          = 2
 };

-struct split_commit_graph_opts {
+struct commit_graph_opts {
 	int size_multiple;
 	int max_commits;
 	timestamp_t expire_time;
-	enum commit_graph_split_flags flags;
+	enum commit_graph_split_flags split_flags;
 };

 /*
@@ -130,12 +130,12 @@ struct split_commit_graph_opts {
  */
 int write_commit_graph_reachable(struct object_directory *odb,
 				 enum commit_graph_write_flags flags,
-				 const struct split_commit_graph_opts *split_opts);
+				 const struct commit_graph_opts *opts);
 int write_commit_graph(struct object_directory *odb,
 		       struct string_list *pack_indexes,
 		       struct oidset *commits,
 		       enum commit_graph_write_flags flags,
-		       const struct split_commit_graph_opts *split_opts);
+		       const struct commit_graph_opts *opts);

 #define COMMIT_GRAPH_VERIFY_SHALLOW	(1 << 0)

--
2.27.0.2918.gc99a27ff8f


^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v4 11/14] csum-file.h: introduce 'hashwrite_be64()'
  2020-09-03 22:46   ` [PATCH v4 11/14] csum-file.h: introduce 'hashwrite_be64()' Taylor Blau
@ 2020-09-04 20:18     ` René Scharfe
  2020-09-04 20:22       ` Taylor Blau
  0 siblings, 1 reply; 117+ messages in thread
From: René Scharfe @ 2020-09-04 20:18 UTC (permalink / raw)
  To: Taylor Blau, git; +Cc: dstolee, gitster, peff, szeder.dev

Am 04.09.20 um 00:46 schrieb Taylor Blau:
> A small handful of writers who wish to encode 64-bit values in network
> order have worked around the lack of such a helper by calling the 32-bit
> variant twice.
>
> The subsequent commit will add another caller who wants to write a
> 64-bit value. To ease their (and the existing caller's) pain, introduce
> a helper to do just that, and convert existing call-sites.
>
> Suggested-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  commit-graph.c | 8 ++------
>  csum-file.h    | 6 ++++++
>  midx.c         | 3 +--
>  3 files changed, 9 insertions(+), 8 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 35535f4192..01d791343a 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -1791,12 +1791,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>
>  	chunk_offset = 8 + (num_chunks + 1) * GRAPH_CHUNKLOOKUP_WIDTH;
>  	for (i = 0; i <= num_chunks; i++) {
> -		uint32_t chunk_write[3];
> -
> -		chunk_write[0] = htonl(chunks[i].id);
> -		chunk_write[1] = htonl(chunk_offset >> 32);
> -		chunk_write[2] = htonl(chunk_offset & 0xffffffff);
> -		hashwrite(f, chunk_write, 12);
> +		hashwrite_be32(f, chunks[i].id);
> +		hashwrite_be64(f, chunk_offset);
>
>  		chunk_offset += chunks[i].size;
>  	}
> diff --git a/csum-file.h b/csum-file.h
> index f9cbd317fb..b026ec7766 100644
> --- a/csum-file.h
> +++ b/csum-file.h
> @@ -62,4 +62,10 @@ static inline void hashwrite_be32(struct hashfile *f, uint32_t data)
>  	hashwrite(f, &data, sizeof(data));
>  }
>
> +static inline void hashwrite_be64(struct hashfile *f, uint64_t data)
> +{
> +	hashwrite_be32(f, data >> 32);
> +	hashwrite_be32(f, data & 0xffffffffUL);
> +}
> +
>  #endif
> diff --git a/midx.c b/midx.c
> index e9b2e1253a..32cc5fdc22 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -789,8 +789,7 @@ static size_t write_midx_large_offsets(struct hashfile *f, uint32_t nr_large_off
>  		if (!(offset >> 31))
>  			continue;
>
> -		hashwrite_be32(f, offset >> 32);
> -		hashwrite_be32(f, offset & 0xffffffffUL);
> +		hashwrite_be64(f, offset);
>  		written += 2 * sizeof(uint32_t);

"2 * sizeof(uint32_t)" looks slightly out of sync with the hashwrite_be64()
call now; "sizeof(uint64_t)" would be more fitting.

>
>  		nr_large_offset--;
>

There's also this potential caller:

midx.c=802=static int write_midx_internal(const char *object_dir, struct multi_pack_index *m,
midx.c:981:             hashwrite_be32(f, chunk_ids[i]);
midx.c:982:             hashwrite_be32(f, chunk_offsets[i] >> 32);
midx.c:983:             hashwrite_be32(f, chunk_offsets[i]);

Not sure it's worth a reroll, though.

(I'd probably leave those conversions for a later series.)

René

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v4 11/14] csum-file.h: introduce 'hashwrite_be64()'
  2020-09-04 20:18     ` René Scharfe
@ 2020-09-04 20:22       ` Taylor Blau
  0 siblings, 0 replies; 117+ messages in thread
From: Taylor Blau @ 2020-09-04 20:22 UTC (permalink / raw)
  To: René Scharfe; +Cc: Taylor Blau, git, dstolee, gitster, peff, szeder.dev

On Fri, Sep 04, 2020 at 10:18:38PM +0200, René Scharfe wrote:
> Am 04.09.20 um 00:46 schrieb Taylor Blau:
> "2 * sizeof(uint32_t)" looks slightly out of sync with the hashwrite_be64()
> call now; "sizeof(uint64_t)" would be more fitting.

Yeah, agreed.

> >
> >  		nr_large_offset--;
> >
>
> There's also this potential caller:
>
> midx.c=802=static int write_midx_internal(const char *object_dir, struct multi_pack_index *m,
> midx.c:981:             hashwrite_be32(f, chunk_ids[i]);
> midx.c:982:             hashwrite_be32(f, chunk_offsets[i] >> 32);
> midx.c:983:             hashwrite_be32(f, chunk_offsets[i]);
>
> Not sure it's worth a reroll, though.
>
> (I'd probably leave those conversions for a later series.)

Agreed. If we were earlier on, or there wasn't already a patch that I
had swapped out for a manual fixup after sending this v4, I'd certainly
fold these in, but I think at this point it's easier to apply this
separately on top.

Thanks for pointing them out.

> René

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v4 07/14] bloom: split 'get_bloom_filter()' in two
  2020-09-03 22:46   ` [PATCH v4 07/14] bloom: split 'get_bloom_filter()' in two Taylor Blau
@ 2020-09-05 17:22     ` Jakub Narębski
  2020-09-05 17:38       ` Taylor Blau
  0 siblings, 1 reply; 117+ messages in thread
From: Jakub Narębski @ 2020-09-05 17:22 UTC (permalink / raw)
  To: Taylor Blau
  Cc: git, Derrick Stolee, Junio Hamano, Jeff King, SZEDER Gábor,
	Jakub Narębski

Taylor Blau <me@ttaylorr.com> writes:

[...]
> While we're at it, instrument the new 'get_or_compute_bloom_filter()'
> with two counters in the 'write_commit_graph_context' struct which store
> the number of filters that we computed, and the number of those which
> were too large to store.

[...]
> @@ -1414,12 +1433,25 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>  		QSORT(sorted_commits, ctx->commits.nr, commit_gen_cmp);
>  
>  	for (i = 0; i < ctx->commits.nr; i++) {
> +		int computed = 0;
>  		struct commit *c = sorted_commits[i];
> -		struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1);
> +		struct bloom_filter *filter = get_or_compute_bloom_filter(
> +			ctx->r,
> +			c,
> +			1,
> +			&computed);
> +		if (computed) {
> +			ctx->count_bloom_filter_computed++;
> +			if (filter && !filter->len)
> +				ctx->count_bloom_filter_found_large++;

How do you distinguish between no changed paths stored because there
were no changes (which should not count as *_found_large), and no
changed paths stored because there were too many changes?  If I remember
it correctly in current implemetation both are represented as
zero-length filter (no changed paths could have been represented as all
zeros filter, too many changed paths could have been represented as all
ones filter).

No changes to store in filter can happen not only with `--allow-empty`
(e.g. via interactive rebase), but also with merge where all changes
came from the second parent -- we are storing only changes to first
parent, if I remember it correctly.

This is a minor issue, though.

> +		}
>  		ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
>  		display_progress(progress, i + 1);
>  	}
>  
> +	if (trace2_is_enabled())
> +		trace2_bloom_filter_write_statistics(ctx);
> +
>  	free(sorted_commits);
>  	stop_progress(&progress);
>  }

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v4 07/14] bloom: split 'get_bloom_filter()' in two
  2020-09-05 17:22     ` Jakub Narębski
@ 2020-09-05 17:38       ` Taylor Blau
  2020-09-05 17:50         ` Jakub Narębski
  0 siblings, 1 reply; 117+ messages in thread
From: Taylor Blau @ 2020-09-05 17:38 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: Taylor Blau, git, Derrick Stolee, Junio Hamano, Jeff King,
	SZEDER Gábor

On Sat, Sep 05, 2020 at 07:22:08PM +0200, Jakub Narębski wrote:
> Taylor Blau <me@ttaylorr.com> writes:
> How do you distinguish between no changed paths stored because there
> were no changes (which should not count as *_found_large), and no
> changed paths stored because there were too many changes?  If I remember
> it correctly in current implemetation both are represented as
> zero-length filter (no changed paths could have been represented as all
> zeros filter, too many changed paths could have been represented as all
> ones filter).

Right, once we get handed back a filter from
'get_or_compute_bloom_filter()', we can't distinguish between (a) a
commit with too many changes to store in a single Bloom filter, and (b)
a commit with no changes at all.

It's unfortunate that callers can't pick between the two, but this
implementation is actually an improvement on the status-quo! Why?
Because right now we'll see an "empty" Bloom filter and recompute it
because it's "missing", only to discover that it has no changes.

With this patch, we'll say "this filter looks too large", and stop
computing it, because we have already gone through the effort to compute
it once (and marked it in the BFXL chunk).

Now, you could certainly argue that 'BFXL' could be called 'BFWC'
("Bloom filter was computed"), and then the "was this filter too large"
check means that the commit was (a) marked in a BFWC chunk, and (b) has
length zero. I'm not necessarily opposed, but I think that this is
probably not worth it, since we're trying to disambiguate something that
is inherently ambiguous. (That is, even with this new check, a
length-zero would still be thought to be "too large", since it was
computed, and has length 0).

> No changes to store in filter can happen not only with `--allow-empty`
> (e.g. via interactive rebase), but also with merge where all changes
> came from the second parent -- we are storing only changes to first
> parent, if I remember it correctly.

Agreed. And yes, Bloom filters store changes only to their commit's
first parent.

> This is a minor issue, though.

Thanks for raising it. I don't think that this is a show-stopper for
this series.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v4 07/14] bloom: split 'get_bloom_filter()' in two
  2020-09-05 17:38       ` Taylor Blau
@ 2020-09-05 17:50         ` Jakub Narębski
  2020-09-05 18:01           ` Taylor Blau
  0 siblings, 1 reply; 117+ messages in thread
From: Jakub Narębski @ 2020-09-05 17:50 UTC (permalink / raw)
  To: Taylor Blau
  Cc: git, Derrick Stolee, Junio Hamano, Jeff King, SZEDER Gábor

Hello,

On Sat, 5 Sep 2020 at 19:38, Taylor Blau <me@ttaylorr.com> wrote:
> On Sat, Sep 05, 2020 at 07:22:08PM +0200, Jakub Narębski wrote:
> > Taylor Blau <me@ttaylorr.com> writes:
> >
> > How do you distinguish between no changed paths stored because there
> > were no changes (which should not count as *_found_large), and no
> > changed paths stored because there were too many changes?  If I remember
> > it correctly in current implementation both are represented as
> > zero-length filter (no changed paths could have been represented as all
> > zeros filter, too many changed paths could have been represented as all
> > ones filter).
>
> Right, once we get handed back a filter from
> 'get_or_compute_bloom_filter()', we can't distinguish between (a) a
> commit with too many changes to store in a single Bloom filter, and (b)
> a commit with no changes at all.

We could change how we store either no-changes Bloom filter (as all
zeros minimal size filter), or too-many-changes Bloom filter (as all
ones, i.e. max unsigned value, minimal size filter). This change would
not require to change any user of Bloom filter.

> It's unfortunate that callers can't pick between the two, but this
> implementation is actually an improvement on the status-quo! Why?
> Because right now we'll see an "empty" Bloom filter and recompute it
> because it's "missing", only to discover that it has no changes.
>
> With this patch, we'll say "this filter looks too large", and stop
> computing it, because we have already gone through the effort to compute
> it once (and marked it in the BFXL chunk).

Can we use this when computing trace2 values?

[...]
> > This is a minor issue, though.
>
> Thanks for raising it. I don't think that this is a show-stopper for
> this series.

I agree.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v4 07/14] bloom: split 'get_bloom_filter()' in two
  2020-09-05 17:50         ` Jakub Narębski
@ 2020-09-05 18:01           ` Taylor Blau
  2020-09-05 18:18             ` Jakub Narębski
  0 siblings, 1 reply; 117+ messages in thread
From: Taylor Blau @ 2020-09-05 18:01 UTC (permalink / raw)
  To: Jakub Narębski
  Cc: Taylor Blau, git, Derrick Stolee, Junio Hamano, Jeff King,
	SZEDER Gábor

On Sat, Sep 05, 2020 at 07:50:01PM +0200, Jakub Narębski wrote:
> Hello,
>
> On Sat, 5 Sep 2020 at 19:38, Taylor Blau <me@ttaylorr.com> wrote:
> > Right, once we get handed back a filter from
> > 'get_or_compute_bloom_filter()', we can't distinguish between (a) a
> > commit with too many changes to store in a single Bloom filter, and (b)
> > a commit with no changes at all.
>
> We could change how we store either no-changes Bloom filter (as all
> zeros minimal size filter), or too-many-changes Bloom filter (as all
> ones, i.e. max unsigned value, minimal size filter). This change would
> not require to change any user of Bloom filter.

I don't think that's true. Say that we changed the empty Bloom filter to
be encoded as only having the most-significant bit set. First, we'd have
to write a Bloom filter where we didn't have to before. But the real
issue is that commit-graph files generated with new clients would
suddenly be unreadable by old clients.
>
> > It's unfortunate that callers can't pick between the two, but this
> > implementation is actually an improvement on the status-quo! Why?
> > Because right now we'll see an "empty" Bloom filter and recompute it
> > because it's "missing", only to discover that it has no changes.
> >
> > With this patch, we'll say "this filter looks too large", and stop
> > computing it, because we have already gone through the effort to compute
> > it once (and marked it in the BFXL chunk).
>
> Can we use this when computing trace2 values?

We could, but I don't think it's absolutely necessary. The test coverage
in t4216 gives us enough confidence already.

> [...]
> > > This is a minor issue, though.
> >
> > Thanks for raising it. I don't think that this is a show-stopper for
> > this series.
>
> I agree.

Thanks for your input!

> Best,
> --
> Jakub Narębski

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v4 07/14] bloom: split 'get_bloom_filter()' in two
  2020-09-05 18:01           ` Taylor Blau
@ 2020-09-05 18:18             ` Jakub Narębski
  2020-09-05 18:38               ` Taylor Blau
  0 siblings, 1 reply; 117+ messages in thread
From: Jakub Narębski @ 2020-09-05 18:18 UTC (permalink / raw)
  To: Taylor Blau
  Cc: git, Derrick Stolee, Junio Hamano, Jeff King, SZEDER Gábor

Hello,

On Sat, 5 Sep 2020 at 20:01, Taylor Blau <me@ttaylorr.com> wrote:
> On Sat, Sep 05, 2020 at 07:50:01PM +0200, Jakub Narębski wrote:
> > On Sat, 5 Sep 2020 at 19:38, Taylor Blau <me@ttaylorr.com> wrote:
> > >
> > > Right, once we get handed back a filter from
> > > 'get_or_compute_bloom_filter()', we can't distinguish between (a) a
> > > commit with too many changes to store in a single Bloom filter, and (b)
> > > a commit with no changes at all.
> >
> > We could change how we store either no-changes Bloom filter (as all
> > zeros minimal size filter), or too-many-changes Bloom filter (as all
> > ones, i.e. max unsigned value, minimal size filter). This change would
> > not require to change any user of Bloom filter.
>
> I don't think that's true. Say that we changed the empty Bloom filter to
> be encoded as only having the most-significant bit set. First, we'd have
> to write a Bloom filter where we didn't have to before.

That's true.

>                                                                                But the real
> issue is that commit-graph files generated with new clients would
> suddenly be unreadable by old clients.

Actually it is, at least in the form that I have proposed. The Bloom filter
which has all bits set to zero would for every possible path reply that
the path is not in set. Old clients would therefore work without changes.
Therefore this is good representation of no-changes Bloom filter.

The Bloom filter which has all bits set to one would for every possible
path reply that the path is maybe in set. This is a good alternative
representation of too-many-changes Bloom filter. Again, old clients
would work without changes.

[...]

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v4 07/14] bloom: split 'get_bloom_filter()' in two
  2020-09-05 18:18             ` Jakub Narębski
@ 2020-09-05 18:38               ` Taylor Blau
  2020-09-05 18:55                 ` Taylor Blau
  0 siblings, 1 reply; 117+ messages in thread
From: Taylor Blau @ 2020-09-05 18:38 UTC (permalink / raw)
  To: Jakub Narębski, Derrick Stolee
  Cc: Taylor Blau, git, Junio Hamano, Jeff King, SZEDER Gábor

On Sat, Sep 05, 2020 at 08:18:40PM +0200, Jakub Narębski wrote:
> Hello,
>
> On Sat, 5 Sep 2020 at 20:01, Taylor Blau <me@ttaylorr.com> wrote:
> > But the real issue is that commit-graph files generated with new
> > clients would suddenly be unreadable by old clients.
>
> Actually it is, at least in the form that I have proposed. The Bloom filter
> which has all bits set to zero would for every possible path reply that
> the path is not in set. Old clients would therefore work without changes.
> Therefore this is good representation of no-changes Bloom filter.
>
> The Bloom filter which has all bits set to one would for every possible
> path reply that the path is maybe in set. This is a good alternative
> representation of too-many-changes Bloom filter. Again, old clients
> would work without changes.

That's a very interesting thought. To be honest, I'm not crazy about
generating our own Bloom filters that have special meaning (i.e., that
even though they are interpreted the same way as any other filter, they
are specially-crafted to carry a certain meaning). Your proposal also
means that commit-graph files are going to get bigger, which is
something that you may or may not care about.

On the other hand, it does get rid of the BFXL chunk, which certainly
isn't the most elegant thing.

I don't know. I think my biggest objection is the size: we use the BIDX
chunk today to avoid having to write the length-zero Bloom filters; your
scheme would force us to write every filter. On the other hand, we could
continue to avoid writing length-zero filters, so long as the
commit-graph indicates that it knows this optimization.

Stolee: what do you think?

> Best,
> --
> Jakub Narębski

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 117+ messages in thread

* Re: [PATCH v4 07/14] bloom: split 'get_bloom_filter()' in two
  2020-09-05 18:38               ` Taylor Blau
@ 2020-09-05 18:55                 ` Taylor Blau
  2020-09-05 19:04                   ` SZEDER Gábor
  0 siblings, 1 reply; 117+ messages in thread
From: Taylor Blau @ 2020-09-05 18:55 UTC (permalink / raw)
  To: Jakub Narębski, Derrick Stolee
  Cc: git, Junio Hamano, Jeff King, SZEDER Gábor

On Sat, Sep 05, 2020 at 02:38:54PM -0400, Taylor Blau wrote:
> I don't know. I think my biggest objection is the size: we use the BIDX
> chunk today to avoid having to write the length-zero Bloom filters; your
> scheme would force us to write every filter. On the other hand, we could
> continue to avoid writing length-zero filters, so long as the
> commit-graph indicates that it knows this optimization.

Thinking about it a little bit more, I'm pretty sure that this isn't as
easy as it sounds. Say that we:

  - continued to encode length-zero Bloom filters as equal adjacent
    entries in the BIDX, but reserve the length-zero filter for comm