git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / Atom feed
* [PATCH 0/9] [RFC] Changed Paths Bloom Filters
@ 2019-12-20 22:05 Garima Singh via GitGitGadget
  2019-12-20 22:05 ` [PATCH 1/9] commit-graph: add --changed-paths option to write Garima Singh via GitGitGadget
                   ` (15 more replies)
  0 siblings, 16 replies; 150+ messages in thread
From: Garima Singh via GitGitGadget @ 2019-12-20 22:05 UTC (permalink / raw)
  To: git; +Cc: stolee, szeder.dev, jonathantanmy, jeffhost, me, peff, Junio C Hamano

Hey! 

The commit graph feature brought in a lot of performance improvements across
multiple commands. However, file based history continues to be a performance
pain point, especially in large repositories. 

Adopting changed path bloom filters has been discussed on the list before,
and a prototype version was worked on by SZEDER Gábor, Jonathan Tan and Dr.
Derrick Stolee [1]. This series is based on Dr. Stolee's approach [2] and
presents an updated and more polished RFC version of the feature. 

Performance Gains: We tested the performance of git log -- path on the git
repo, the linux repo and some internal large repos, with a variety of paths
of varying depths.

On the git and linux repos: We observed a 2x to 5x speed up.

On a large internal repo with files seated 6-10 levels deep in the tree: We
observed 10x to 20x speed ups, with some paths going up to 28 times faster.

Future Work (not included in the scope of this series):

 1. Supporting multiple path based revision walk
 2. Adopting it in git blame logic. 
 3. Interactions with line log git log -L

This series is intended to start the conversation and many of the commit
messages include specific call outs for suggestions and thoughts. 

Cheers! Garima Singh

[1] https://lore.kernel.org/git/20181009193445.21908-1-szeder.dev@gmail.com/
[2] 
https://lore.kernel.org/git/61559c5b-546e-d61b-d2e1-68de692f5972@gmail.com/

Garima Singh (9):
  commit-graph: add --changed-paths option to write
  commit-graph: write changed paths bloom filters
  commit-graph: use MAX_NUM_CHUNKS
  commit-graph: document bloom filter format
  commit-graph: write changed path bloom filters to commit-graph file.
  commit-graph: test commit-graph write --changed-paths
  commit-graph: reuse existing bloom filters during write.
  revision.c: use bloom filters to speed up path based revision walks
  commit-graph: add GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS test flag

 Documentation/git-commit-graph.txt            |   5 +
 .../technical/commit-graph-format.txt         |  17 ++
 Makefile                                      |   1 +
 bloom.c                                       | 257 +++++++++++++++++
 bloom.h                                       |  51 ++++
 builtin/commit-graph.c                        |   9 +-
 ci/run-build-and-tests.sh                     |   1 +
 commit-graph.c                                | 116 +++++++-
 commit-graph.h                                |   9 +-
 revision.c                                    |  67 ++++-
 revision.h                                    |   5 +
 t/README                                      |   3 +
 t/helper/test-read-graph.c                    |   4 +
 t/t4216-log-bloom.sh                          |  77 ++++++
 t/t5318-commit-graph.sh                       |   2 +
 t/t5324-split-commit-graph.sh                 |   1 +
 t/t5325-commit-graph-bloom.sh                 | 258 ++++++++++++++++++
 17 files changed, 875 insertions(+), 8 deletions(-)
 create mode 100644 bloom.c
 create mode 100644 bloom.h
 create mode 100755 t/t4216-log-bloom.sh
 create mode 100755 t/t5325-commit-graph-bloom.sh


base-commit: b02fd2accad4d48078671adf38fe5b5976d77304
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-497%2Fgarimasi514%2FcoreGit-bloomFilters-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-497/garimasi514/coreGit-bloomFilters-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/497
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH 1/9] commit-graph: add --changed-paths option to write
  2019-12-20 22:05 [PATCH 0/9] [RFC] Changed Paths Bloom Filters Garima Singh via GitGitGadget
@ 2019-12-20 22:05 ` Garima Singh via GitGitGadget
  2020-01-01 20:20   ` Jakub Narebski
  2019-12-20 22:05 ` [PATCH 2/9] commit-graph: write changed paths bloom filters Garima Singh via GitGitGadget
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 150+ messages in thread
From: Garima Singh via GitGitGadget @ 2019-12-20 22:05 UTC (permalink / raw)
  To: git
  Cc: stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
	Junio C Hamano, Garima Singh

From: Garima Singh <garima.singh@microsoft.com>

Add --changed-paths option to git commit-graph write. This option will
soon allow users to compute bloom filters for the paths changed between
a commit and its first significant parent, and write this information
into the commit-graph file.

Note: This commit does not change any behavior. It only introduces
the option and passes down the appropriate flag to the commit-graph.

RFC Notes:
1. We named the option --changed-paths to capture what the option does,
   instead of how it does it. The current implementation does this
   using bloom filters. We believe using --changed-paths however keeps
   the implementation open to other data structures.
   All thoughts and suggestions for the name and this approach are
   welcome

2. Currently, a subsequent commit in this series will add tests that
   exercise this option. I plan to split that test commit across the
   series as appropriate.

Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
 Documentation/git-commit-graph.txt | 5 +++++
 builtin/commit-graph.c             | 9 +++++++--
 commit-graph.h                     | 3 ++-
 3 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index bcd85c1976..1efe6e5c5a 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -54,6 +54,11 @@ or `--stdin-packs`.)
 With the `--append` option, include all commits that are present in the
 existing commit-graph file.
 +
+With the `--changed-paths` option, compute and write information about the
+paths changed between a commit and it's first parent. This operation can
+take a while on large repositories. It provides significant performance gains
+for getting file based history logs with `git log`
++
 With the `--split` option, write the commit-graph as a chain of multiple
 commit-graph files stored in `<dir>/info/commit-graphs`. The new commits
 not already in the commit-graph are added in a new "tip" file. This file
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index e0c6fc4bbf..9bd1e11161 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -9,7 +9,7 @@
 
 static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph verify [--object-dir <objdir>] [--shallow] [--[no-]progress]"),
-	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--[no-]progress] <split options>"),
+	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--changed-paths] [--[no-]progress] <split options>"),
 	NULL
 };
 
@@ -19,7 +19,7 @@ static const char * const builtin_commit_graph_verify_usage[] = {
 };
 
 static const char * const builtin_commit_graph_write_usage[] = {
-	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--[no-]progress] <split options>"),
+	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--changed-paths] [--[no-]progress] <split options>"),
 	NULL
 };
 
@@ -32,6 +32,7 @@ static struct opts_commit_graph {
 	int split;
 	int shallow;
 	int progress;
+	int enable_bloom_filters;
 } opts;
 
 static int graph_verify(int argc, const char **argv)
@@ -110,6 +111,8 @@ static int graph_write(int argc, const char **argv)
 			N_("start walk at commits listed by stdin")),
 		OPT_BOOL(0, "append", &opts.append,
 			N_("include all commits already in the commit-graph file")),
+		OPT_BOOL(0, "changed-paths", &opts.enable_bloom_filters,
+			N_("enable computation for changed paths")),
 		OPT_BOOL(0, "progress", &opts.progress, N_("force progress reporting")),
 		OPT_BOOL(0, "split", &opts.split,
 			N_("allow writing an incremental commit-graph file")),
@@ -143,6 +146,8 @@ static int graph_write(int argc, const char **argv)
 		flags |= COMMIT_GRAPH_WRITE_SPLIT;
 	if (opts.progress)
 		flags |= COMMIT_GRAPH_WRITE_PROGRESS;
+	if (opts.enable_bloom_filters)
+		flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
 
 	read_replace_refs = 0;
 
diff --git a/commit-graph.h b/commit-graph.h
index 7f5c933fa2..952a4b83be 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -76,7 +76,8 @@ enum commit_graph_write_flags {
 	COMMIT_GRAPH_WRITE_PROGRESS   = (1 << 1),
 	COMMIT_GRAPH_WRITE_SPLIT      = (1 << 2),
 	/* Make sure that each OID in the input is a valid commit OID. */
-	COMMIT_GRAPH_WRITE_CHECK_OIDS = (1 << 3)
+	COMMIT_GRAPH_WRITE_CHECK_OIDS = (1 << 3),
+	COMMIT_GRAPH_WRITE_BLOOM_FILTERS = (1 << 4)
 };
 
 struct split_commit_graph_opts {
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH 2/9] commit-graph: write changed paths bloom filters
  2019-12-20 22:05 [PATCH 0/9] [RFC] Changed Paths Bloom Filters Garima Singh via GitGitGadget
  2019-12-20 22:05 ` [PATCH 1/9] commit-graph: add --changed-paths option to write Garima Singh via GitGitGadget
@ 2019-12-20 22:05 ` Garima Singh via GitGitGadget
  2019-12-21 16:48   ` Philip Oakley
  2020-01-06 18:44   ` Jakub Narebski
  2019-12-20 22:05 ` [PATCH 3/9] commit-graph: use MAX_NUM_CHUNKS Garima Singh via GitGitGadget
                   ` (13 subsequent siblings)
  15 siblings, 2 replies; 150+ messages in thread
From: Garima Singh via GitGitGadget @ 2019-12-20 22:05 UTC (permalink / raw)
  To: git
  Cc: stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
	Junio C Hamano, Garima Singh

From: Garima Singh <garima.singh@microsoft.com>

The changed path bloom filters help determine which paths changed between a
commit and its first parent. We already have the "--changed-paths" option
for the "git commit-graph write" subcommand, now actually compute them under
that option. The COMMIT_GRAPH_WRITE_BLOOM_FILTERS flag enables this
computation.

RFC Notes: Here are some details about the implementation and I would love
to know your thoughts and suggestions for improvements here.

For details on what bloom filters are and how they work, please refer to
Dr. Derrick Stolee's blog post [1].
[1] https://devblogs.microsoft.com/devops/super-charging-the-git-commit-graph-iv-bloom-filters/

1. The implementation sticks to the recommended values of 7 and 10 for the
   number of hashes and the size of each entry, as described in the blog.
   The implementation while not completely open to it at the moment, is flexible
   enough to allow for tweaking these settings in the future.
   Note: The performance gains we have observed so far with these values is
   significant enough to not that we did not need to tweak these settings.
   The cover letter of this series has the details and the commit where we have
   git log use bloom filters.

2. As described in the blog and the linked technical paper therin, we do not need
   7 independent hashing functions. We use the Murmur3 hashing scheme - seed it
   twice and then combine those to procure an arbitrary number of hash values.

3. The filters are sized according to the number of changes in the each commit,
   with minimum size of one 64 bit word.

[Call for advice] We currently cap writing bloom filters for commits with
atmost 512 changed files. In the current implementation, we compute the diff,
and then just throw it away once we see it has more than 512 changes.
Any suggestiongs on how to reduce the work we are doing in this case are more
than welcome.

[Call for advice] Would the git community like this commit to be split up into
more granular commits? This commit could possibly be split out further with the
bloom.c code in its own commit, to be used by the commit-graph in a subsequent
commit. While I prefer it being contained in one commit this way, I am open to
suggestions.

[Call for advice] Would a technical document explaining the exact details of
the bloom filter implemenation and the hashing calculations be helpful? I will
be adding details into Documentation/technical/commit-graph-format.txt, but the
bloom filter code is an independent subsystem and could be used outside of the
commit-graph feature. Is it worth a separate document, or should we apply "You
Ain't Gonna Need It" principles?

[Call for advice] I plan to add unit tests for bloom.c, specifically to ensure
that the hash algorithm and bloom key calculations are stable across versions.

Signed-off-by: Garima Singh <garima.singh@microsoft.com>
Helped-by: Derrick Stolee <dstolee@microsoft.com>
---
 Makefile       |   1 +
 bloom.c        | 201 +++++++++++++++++++++++++++++++++++++++++++++++++
 bloom.h        |  46 +++++++++++
 commit-graph.c |  32 +++++++-
 4 files changed, 279 insertions(+), 1 deletion(-)
 create mode 100644 bloom.c
 create mode 100644 bloom.h

diff --git a/Makefile b/Makefile
index 42a061d3fb..9d5e26f5d6 100644
--- a/Makefile
+++ b/Makefile
@@ -838,6 +838,7 @@ LIB_OBJS += base85.o
 LIB_OBJS += bisect.o
 LIB_OBJS += blame.o
 LIB_OBJS += blob.o
+LIB_OBJS += bloom.o
 LIB_OBJS += branch.o
 LIB_OBJS += bulk-checkin.o
 LIB_OBJS += bundle.o
diff --git a/bloom.c b/bloom.c
new file mode 100644
index 0000000000..08328cc381
--- /dev/null
+++ b/bloom.c
@@ -0,0 +1,201 @@
+#include "git-compat-util.h"
+#include "bloom.h"
+#include "commit-graph.h"
+#include "object-store.h"
+#include "diff.h"
+#include "diffcore.h"
+#include "revision.h"
+#include "hashmap.h"
+
+#define BITS_PER_BLOCK 64
+
+define_commit_slab(bloom_filter_slab, struct bloom_filter);
+
+struct bloom_filter_slab bloom_filters;
+
+struct pathmap_hash_entry {
+    struct hashmap_entry entry;
+    const char path[FLEX_ARRAY];
+};
+
+static uint32_t rotate_right(uint32_t value, int32_t count)
+{
+	uint32_t mask = 8 * sizeof(uint32_t) - 1;
+	count &= mask;
+	return ((value >> count) | (value << ((-count) & mask)));
+}
+
+static uint32_t seed_murmur3(uint32_t seed, const char *data, int len)
+{
+	const uint32_t c1 = 0xcc9e2d51;
+	const uint32_t c2 = 0x1b873593;
+	const int32_t r1 = 15;
+	const int32_t r2 = 13;
+	const uint32_t m = 5;
+	const uint32_t n = 0xe6546b64;
+	int i;
+	uint32_t k1 = 0;
+	const char *tail;
+
+	int len4 = len / sizeof(uint32_t);
+
+	const uint32_t *blocks = (const uint32_t*)data;
+
+	uint32_t k;
+	for (i = 0; i < len4; i++)
+	{
+		k = blocks[i];
+		k *= c1;
+		k = rotate_right(k, r1);
+		k *= c2;
+
+		seed ^= k;
+		seed = rotate_right(seed, r2) * m + n;
+	}
+
+	tail = (data + len4 * sizeof(uint32_t));
+
+	switch (len & (sizeof(uint32_t) - 1))
+	{
+	case 3:
+		k1 ^= ((uint32_t)tail[2]) << 16;
+		/*-fallthrough*/
+	case 2:
+		k1 ^= ((uint32_t)tail[1]) << 8;
+		/*-fallthrough*/
+	case 1:
+		k1 ^= ((uint32_t)tail[0]) << 0;
+		k1 *= c1;
+		k1 = rotate_right(k1, r1);
+		k1 *= c2;
+		seed ^= k1;
+		break;
+	}
+
+	seed ^= (uint32_t)len;
+	seed ^= (seed >> 16);
+	seed *= 0x85ebca6b;
+	seed ^= (seed >> 13);
+	seed *= 0xc2b2ae35;
+	seed ^= (seed >> 16);
+
+	return seed;
+}
+
+static inline uint64_t get_bitmask(uint32_t pos)
+{
+	return ((uint64_t)1) << (pos & (BITS_PER_BLOCK - 1));
+}
+
+void fill_bloom_key(const char *data,
+		    int len,
+		    struct bloom_key *key,
+		    struct bloom_filter_settings *settings)
+{
+	int i;
+	uint32_t seed0 = 0x293ae76f;
+	uint32_t seed1 = 0x7e646e2c;
+
+	uint32_t hash0 = seed_murmur3(seed0, data, len);
+	uint32_t hash1 = seed_murmur3(seed1, data, len);
+
+	key->hashes = (uint32_t *)xcalloc(settings->num_hashes, sizeof(uint32_t));
+	for (i = 0; i < settings->num_hashes; i++)
+		key->hashes[i] = hash0 + i * hash1;
+}
+
+static void add_key_to_filter(struct bloom_key *key,
+			      struct bloom_filter *filter,
+			      struct bloom_filter_settings *settings)
+{
+	int i;
+	uint64_t mod = filter->len * BITS_PER_BLOCK;
+
+	for (i = 0; i < settings->num_hashes; i++) {
+		uint64_t hash_mod = key->hashes[i] % mod;
+		uint64_t block_pos = hash_mod / BITS_PER_BLOCK;
+
+		filter->data[block_pos] |= get_bitmask(hash_mod);
+	}
+}
+
+void load_bloom_filters(void)
+{
+	init_bloom_filter_slab(&bloom_filters);
+}
+
+struct bloom_filter *get_bloom_filter(struct repository *r,
+				      struct commit *c)
+{
+	struct bloom_filter *filter;
+	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
+	int i;
+	struct rev_info revs;
+	const char *revs_argv[] = {NULL, "HEAD", NULL};
+
+	filter = bloom_filter_slab_at(&bloom_filters, c);
+	init_revisions(&revs, NULL);
+	revs.diffopt.flags.recursive = 1;
+
+	setup_revisions(2, revs_argv, &revs, NULL);
+
+	if (c->parents)
+		diff_tree_oid(&c->parents->item->object.oid, &c->object.oid, "", &revs.diffopt);
+	else
+		diff_tree_oid(NULL, &c->object.oid, "", &revs.diffopt);
+	diffcore_std(&revs.diffopt);
+
+	if (diff_queued_diff.nr <= 512) {
+		struct hashmap pathmap;
+		struct pathmap_hash_entry* e;
+		struct hashmap_iter iter;
+		hashmap_init(&pathmap, NULL, NULL, 0);
+
+		for (i = 0; i < diff_queued_diff.nr; i++) {
+		    const char* path = diff_queued_diff.queue[i]->two->path;
+		    const char* p = path;
+
+		    /*
+		     * Add each leading directory of the changed file, i.e. for
+		     * 'dir/subdir/file' add 'dir' and 'dir/subdir' as well, so
+		     * the Bloom filter could be used to speed up commands like
+		     * 'git log dir/subdir', too.
+		     *
+		     * Note that directories are added without the trailing '/'.
+		     */
+		    do {
+				char* last_slash = strrchr(p, '/');
+
+				FLEX_ALLOC_STR(e, path, path);
+				hashmap_entry_init(&e->entry, strhash(p));
+				hashmap_add(&pathmap, &e->entry);
+
+				if (!last_slash)
+				    last_slash = (char*)p;
+				*last_slash = '\0';
+
+		    } while (*p);
+
+		    diff_free_filepair(diff_queued_diff.queue[i]);
+		}
+
+		filter->len = (hashmap_get_size(&pathmap) * settings.bits_per_entry + BITS_PER_BLOCK - 1) / BITS_PER_BLOCK;
+		filter->data = xcalloc(filter->len, sizeof(uint64_t));
+
+		hashmap_for_each_entry(&pathmap, &iter, e, entry) {
+		    struct bloom_key key;
+		    fill_bloom_key(e->path, strlen(e->path), &key, &settings);
+		    add_key_to_filter(&key, filter, &settings);
+		}
+
+		hashmap_free_entries(&pathmap, struct pathmap_hash_entry, entry);
+	} else {
+		filter->data = NULL;
+		filter->len = 0;
+	}
+
+	free(diff_queued_diff.queue);
+	DIFF_QUEUE_CLEAR(&diff_queued_diff);
+
+	return filter;
+}
\ No newline at end of file
diff --git a/bloom.h b/bloom.h
new file mode 100644
index 0000000000..ba8ae70b67
--- /dev/null
+++ b/bloom.h
@@ -0,0 +1,46 @@
+#ifndef BLOOM_H
+#define BLOOM_H
+
+struct commit;
+struct repository;
+
+struct bloom_filter_settings {
+	uint32_t hash_version;
+	uint32_t num_hashes;
+	uint32_t bits_per_entry;
+};
+
+#define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 }
+
+/*
+ * A bloom_filter struct represents a data segment to
+ * use when testing hash values. The 'len' member
+ * dictates how many uint64_t entries are stored in
+ * 'data'.
+ */
+struct bloom_filter {
+	uint64_t *data;
+	int len;
+};
+
+/*
+ * A bloom_key represents the k hash values for a
+ * given hash input. These can be precomputed and
+ * stored in a bloom_key for re-use when testing
+ * against a bloom_filter.
+ */
+struct bloom_key {
+	uint32_t *hashes;
+};
+
+void load_bloom_filters(void);
+
+struct bloom_filter *get_bloom_filter(struct repository *r,
+				      struct commit *c);
+
+void fill_bloom_key(const char *data,
+		    int len,
+		    struct bloom_key *key,
+		    struct bloom_filter_settings *settings);
+
+#endif
diff --git a/commit-graph.c b/commit-graph.c
index e771394aff..61e60ff98a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -16,6 +16,7 @@
 #include "hashmap.h"
 #include "replace-object.h"
 #include "progress.h"
+#include "bloom.h"
 
 #define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
 #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
@@ -794,9 +795,11 @@ struct write_commit_graph_context {
 	unsigned append:1,
 		 report_progress:1,
 		 split:1,
-		 check_oids:1;
+		 check_oids:1,
+		 bloom:1;
 
 	const struct split_commit_graph_opts *split_opts;
+	uint32_t total_bloom_filter_size;
 };
 
 static void write_graph_chunk_fanout(struct hashfile *f,
@@ -1139,6 +1142,28 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 	stop_progress(&ctx->progress);
 }
 
+static void compute_bloom_filters(struct write_commit_graph_context *ctx)
+{
+	int i;
+	struct progress *progress = NULL;
+
+	load_bloom_filters();
+
+	if (ctx->report_progress)
+		progress = start_progress(
+			_("Computing commit diff Bloom filters"),
+			ctx->commits.nr);
+
+	for (i = 0; i < ctx->commits.nr; i++) {
+		struct commit *c = ctx->commits.list[i];
+		struct bloom_filter *filter = get_bloom_filter(ctx->r, c);
+		ctx->total_bloom_filter_size += sizeof(uint64_t) * filter->len;
+		display_progress(progress, i + 1);
+	}
+
+	stop_progress(&progress);
+}
+
 static int add_ref_to_list(const char *refname,
 			   const struct object_id *oid,
 			   int flags, void *cb_data)
@@ -1791,6 +1816,8 @@ int write_commit_graph(const char *obj_dir,
 	ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
 	ctx->check_oids = flags & COMMIT_GRAPH_WRITE_CHECK_OIDS ? 1 : 0;
 	ctx->split_opts = split_opts;
+	ctx->bloom = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;
+	ctx->total_bloom_filter_size = 0;
 
 	if (ctx->split) {
 		struct commit_graph *g;
@@ -1885,6 +1912,9 @@ int write_commit_graph(const char *obj_dir,
 
 	compute_generation_numbers(ctx);
 
+	if (ctx->bloom)
+		compute_bloom_filters(ctx);
+
 	res = write_commit_graph_file(ctx);
 
 	if (ctx->split)
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH 3/9] commit-graph: use MAX_NUM_CHUNKS
  2019-12-20 22:05 [PATCH 0/9] [RFC] Changed Paths Bloom Filters Garima Singh via GitGitGadget
  2019-12-20 22:05 ` [PATCH 1/9] commit-graph: add --changed-paths option to write Garima Singh via GitGitGadget
  2019-12-20 22:05 ` [PATCH 2/9] commit-graph: write changed paths bloom filters Garima Singh via GitGitGadget
@ 2019-12-20 22:05 ` Garima Singh via GitGitGadget
  2020-01-07 12:19   ` Jakub Narebski
  2019-12-20 22:05 ` [PATCH 4/9] commit-graph: document bloom filter format Garima Singh via GitGitGadget
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 150+ messages in thread
From: Garima Singh via GitGitGadget @ 2019-12-20 22:05 UTC (permalink / raw)
  To: git
  Cc: stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
	Junio C Hamano, Garima Singh

From: Garima Singh <garima.singh@microsoft.com>

This is a minor cleanup to make it easier to change the
number of chunks being written to the commit-graph in the future.

Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
 commit-graph.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 61e60ff98a..8c4941eeaa 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -24,6 +24,7 @@
 #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
 #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
 #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
+#define MAX_NUM_CHUNKS 5
 
 #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
 
@@ -1381,8 +1382,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	int fd;
 	struct hashfile *f;
 	struct lock_file lk = LOCK_INIT;
-	uint32_t chunk_ids[6];
-	uint64_t chunk_offsets[6];
+	uint32_t chunk_ids[MAX_NUM_CHUNKS + 1];
+	uint64_t chunk_offsets[MAX_NUM_CHUNKS + 1];
 	const unsigned hashsz = the_hash_algo->rawsz;
 	struct strbuf progress_title = STRBUF_INIT;
 	int num_chunks = 3;
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH 4/9] commit-graph: document bloom filter format
  2019-12-20 22:05 [PATCH 0/9] [RFC] Changed Paths Bloom Filters Garima Singh via GitGitGadget
                   ` (2 preceding siblings ...)
  2019-12-20 22:05 ` [PATCH 3/9] commit-graph: use MAX_NUM_CHUNKS Garima Singh via GitGitGadget
@ 2019-12-20 22:05 ` Garima Singh via GitGitGadget
  2020-01-07 14:46   ` Jakub Narebski
  2019-12-20 22:05 ` [PATCH 5/9] commit-graph: write changed path bloom filters to commit-graph file Garima Singh via GitGitGadget
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 150+ messages in thread
From: Garima Singh via GitGitGadget @ 2019-12-20 22:05 UTC (permalink / raw)
  To: git
  Cc: stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
	Junio C Hamano, Garima Singh

From: Garima Singh <garima.singh@microsoft.com>

Update the technical documentation for commit-graph-format with BIDX
and BDAT chunk information.

RFC Notes:
1. [Call for advice] We specifically mention that we are using bloom
   filters in this technical document. Should this document also be
   made open to other data structures in the future, with versioning
   information?

2. [Call for advice] We are also not describing the explicit nature
   of how we store the bloom filter binary data. Would it be useful
   to document details about the hash algorithm, the number of hashes
   and the specific seed values we are using in a separate document,
   or perhaps in a separate section in this document?

Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
 Documentation/technical/commit-graph-format.txt | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index a4f17441ae..6497f19f08 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -17,6 +17,9 @@ metadata, including:
 - The parents of the commit, stored using positional references within
   the graph file.
 
+- The bloom filter of the commit carrying the paths that were changed between
+  the commit and it's first parent.
+
 These positional references are stored as unsigned 32-bit integers
 corresponding to the array position within the list of commit OIDs. Due
 to some special constants we use to track parents, we can store at most
@@ -93,6 +96,20 @@ CHUNK DATA:
       positions for the parents until reaching a value with the most-significant
       bit on. The other bits correspond to the position of the last parent.
 
+  Bloom Filter Index (ID: {'B', 'I', 'D', 'X'}) [Optional]
+      For each commit we store the offset of its bloom filter in the BDAT chunk
+      as follows:
+      BIDX[i] = number of 8-byte words in all the bloom filters from commit 0 to
+		commit i (inclusive)
+
+  Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
+      * It starts with three 32 bit integers for the
+	    - version of the hash algorithm being used
+	    - the number of hashes used in the computation
+	    - the number of bits per entry
+	  * The rest of the chunk is the concatenation of all the computed bloom 
+	  filters for the commits in lexicographic order.
+
   Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
       This list of H-byte hashes describe a set of B commit-graph files that
       form a commit-graph chain. The graph position for the ith commit in this
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH 5/9] commit-graph: write changed path bloom filters to commit-graph file.
  2019-12-20 22:05 [PATCH 0/9] [RFC] Changed Paths Bloom Filters Garima Singh via GitGitGadget
                   ` (3 preceding siblings ...)
  2019-12-20 22:05 ` [PATCH 4/9] commit-graph: document bloom filter format Garima Singh via GitGitGadget
@ 2019-12-20 22:05 ` Garima Singh via GitGitGadget
  2020-01-07 16:01   ` Jakub Narebski
  2019-12-20 22:05 ` [PATCH 6/9] commit-graph: test commit-graph write --changed-paths Garima Singh via GitGitGadget
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 150+ messages in thread
From: Garima Singh via GitGitGadget @ 2019-12-20 22:05 UTC (permalink / raw)
  To: git
  Cc: stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
	Junio C Hamano, Garima Singh

From: Garima Singh <garima.singh@microsoft.com>

Write bloom filters to the commit-graph using the format described in
Documentation/technical/commit-graph-format.txt

Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
 commit-graph.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++-
 commit-graph.h |  5 ++++
 2 files changed, 85 insertions(+), 1 deletion(-)

diff --git a/commit-graph.c b/commit-graph.c
index 8c4941eeaa..def2ade166 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -24,7 +24,9 @@
 #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
 #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
 #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
-#define MAX_NUM_CHUNKS 5
+#define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
+#define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
+#define MAX_NUM_CHUNKS 7
 
 #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
 
@@ -282,6 +284,32 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd,
 				chunk_repeated = 1;
 			else
 				graph->chunk_base_graphs = data + chunk_offset;
+			break;
+
+		case GRAPH_CHUNKID_BLOOMINDEXES:
+			if (graph->chunk_bloom_indexes)
+				chunk_repeated = 1;
+			else
+				graph->chunk_bloom_indexes = data + chunk_offset;
+			break;
+
+		case GRAPH_CHUNKID_BLOOMDATA:
+			if (graph->chunk_bloom_data)
+				chunk_repeated = 1;
+			else {
+				uint32_t hash_version;
+				graph->chunk_bloom_data = data + chunk_offset;
+				hash_version = get_be32(data + chunk_offset);
+
+				if (hash_version != 1)
+					break;
+
+				graph->settings = xmalloc(sizeof(struct bloom_filter_settings));
+				graph->settings->hash_version = hash_version;
+				graph->settings->num_hashes = get_be32(data + chunk_offset + 4);
+				graph->settings->bits_per_entry = get_be32(data + chunk_offset + 8);
+			}
+			break;
 		}
 
 		if (chunk_repeated) {
@@ -996,6 +1024,39 @@ static void write_graph_chunk_extra_edges(struct hashfile *f,
 	}
 }
 
+static void write_graph_chunk_bloom_indexes(struct hashfile *f,
+					    struct write_commit_graph_context *ctx)
+{
+	struct commit **list = ctx->commits.list;
+	struct commit **last = ctx->commits.list + ctx->commits.nr;
+	uint32_t cur_pos = 0;
+
+	while (list < last) {
+		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
+		cur_pos += filter->len;
+		hashwrite_be32(f, cur_pos);
+		list++;
+	}
+}
+
+static void write_graph_chunk_bloom_data(struct hashfile *f,
+					 struct write_commit_graph_context *ctx,
+					 struct bloom_filter_settings *settings)
+{
+	struct commit **first = ctx->commits.list;
+	struct commit **last = ctx->commits.list + ctx->commits.nr;
+
+	hashwrite_be32(f, settings->hash_version);
+	hashwrite_be32(f, settings->num_hashes);
+	hashwrite_be32(f, settings->bits_per_entry);
+
+	while (first < last) {
+		struct bloom_filter *filter = get_bloom_filter(ctx->r, *first);
+		hashwrite(f, filter->data, filter->len * sizeof(uint64_t));
+		first++;
+	}
+}
+
 static int oid_compare(const void *_a, const void *_b)
 {
 	const struct object_id *a = (const struct object_id *)_a;
@@ -1388,6 +1449,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	struct strbuf progress_title = STRBUF_INIT;
 	int num_chunks = 3;
 	struct object_id file_hash;
+	struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
 
 	if (ctx->split) {
 		struct strbuf tmp_file = STRBUF_INIT;
@@ -1432,6 +1494,12 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 		chunk_ids[num_chunks] = GRAPH_CHUNKID_EXTRAEDGES;
 		num_chunks++;
 	}
+	if (ctx->bloom) {
+		chunk_ids[num_chunks] = GRAPH_CHUNKID_BLOOMINDEXES;
+		num_chunks++;
+		chunk_ids[num_chunks] = GRAPH_CHUNKID_BLOOMDATA;
+		num_chunks++;
+	}
 	if (ctx->num_commit_graphs_after > 1) {
 		chunk_ids[num_chunks] = GRAPH_CHUNKID_BASE;
 		num_chunks++;
@@ -1450,6 +1518,13 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 						4 * ctx->num_extra_edges;
 		num_chunks++;
 	}
+	if (ctx->bloom) {
+		chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] + sizeof(uint32_t) * ctx->commits.nr;
+		num_chunks++;
+
+		chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] + sizeof(uint32_t) * 3 + ctx->total_bloom_filter_size;
+		num_chunks++;
+	}
 	if (ctx->num_commit_graphs_after > 1) {
 		chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
 						hashsz * (ctx->num_commit_graphs_after - 1);
@@ -1487,6 +1562,10 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	write_graph_chunk_data(f, hashsz, ctx);
 	if (ctx->num_extra_edges)
 		write_graph_chunk_extra_edges(f, ctx);
+	if (ctx->bloom) {
+		write_graph_chunk_bloom_indexes(f, ctx);
+		write_graph_chunk_bloom_data(f, ctx, &bloom_settings);
+	}
 	if (ctx->num_commit_graphs_after > 1 &&
 	    write_graph_chunk_base(f, ctx)) {
 		return -1;
diff --git a/commit-graph.h b/commit-graph.h
index 952a4b83be..2202ad91ae 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -10,6 +10,7 @@
 #define GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD "GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD"
 
 struct commit;
+struct bloom_filter_settings;
 
 char *get_commit_graph_filename(const char *obj_dir);
 int open_commit_graph(const char *graph_file, int *fd, struct stat *st);
@@ -58,6 +59,10 @@ struct commit_graph {
 	const unsigned char *chunk_commit_data;
 	const unsigned char *chunk_extra_edges;
 	const unsigned char *chunk_base_graphs;
+	const unsigned char *chunk_bloom_indexes;
+	const unsigned char *chunk_bloom_data;
+
+	struct bloom_filter_settings *settings;
 };
 
 struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st);
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH 6/9] commit-graph: test commit-graph write --changed-paths
  2019-12-20 22:05 [PATCH 0/9] [RFC] Changed Paths Bloom Filters Garima Singh via GitGitGadget
                   ` (4 preceding siblings ...)
  2019-12-20 22:05 ` [PATCH 5/9] commit-graph: write changed path bloom filters to commit-graph file Garima Singh via GitGitGadget
@ 2019-12-20 22:05 ` Garima Singh via GitGitGadget
  2020-01-08  0:32   ` Jakub Narebski
  2019-12-20 22:05 ` [PATCH 7/9] commit-graph: reuse existing bloom filters during write Garima Singh via GitGitGadget
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 150+ messages in thread
From: Garima Singh via GitGitGadget @ 2019-12-20 22:05 UTC (permalink / raw)
  To: git
  Cc: stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
	Junio C Hamano, Garima Singh

From: Garima Singh <garima.singh@microsoft.com>

Add tests for the --changed-paths feature when writing
commit-graphs.

RFC Notes:
I plan to split this test across some of the earlier commits
as appropriate.

Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
 t/helper/test-read-graph.c    |   4 +
 t/t5325-commit-graph-bloom.sh | 255 ++++++++++++++++++++++++++++++++++
 2 files changed, 259 insertions(+)
 create mode 100755 t/t5325-commit-graph-bloom.sh

diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
index d2884efe0a..aff597c7a3 100644
--- a/t/helper/test-read-graph.c
+++ b/t/helper/test-read-graph.c
@@ -45,6 +45,10 @@ int cmd__read_graph(int argc, const char **argv)
 		printf(" commit_metadata");
 	if (graph->chunk_extra_edges)
 		printf(" extra_edges");
+	if (graph->chunk_bloom_indexes)
+		printf(" bloom_indexes");
+	if (graph->chunk_bloom_data)
+		printf(" bloom_data");
 	printf("\n");
 
 	UNLEAK(graph);
diff --git a/t/t5325-commit-graph-bloom.sh b/t/t5325-commit-graph-bloom.sh
new file mode 100755
index 0000000000..d7ef0e7fb3
--- /dev/null
+++ b/t/t5325-commit-graph-bloom.sh
@@ -0,0 +1,255 @@
+#!/bin/sh
+
+test_description='commit graph with bloom filters'
+. ./test-lib.sh
+
+test_expect_success 'setup repo' '
+	git init &&
+	git config core.commitGraph true &&
+	git config gc.writeCommitGraph false &&
+	infodir=".git/objects/info" &&
+	graphdir="$infodir/commit-graphs" &&
+	test_oid_init
+'
+
+graph_read_expect() {
+	OPTIONAL=""
+	NUM_CHUNKS=5
+	if test ! -z $2
+	then
+		OPTIONAL=" $2"
+		NUM_CHUNKS=$((NUM_CHUNKS + $(echo "$2" | wc -w)))
+	fi
+	cat >expect <<- EOF
+	header: 43475048 1 1 $NUM_CHUNKS 0
+	num_commits: $1
+	chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data
+	EOF
+	test-tool read-graph >output &&
+	test_cmp expect output
+}
+
+test_expect_success 'create commits and write commit-graph' '
+	for i in $(test_seq 3)
+	do
+		test_commit $i &&
+		git branch commits/$i || return 1
+	done &&
+	git commit-graph write --reachable --changed-paths &&
+	test_path_is_file $infodir/commit-graph &&
+	graph_read_expect 3
+'
+
+graph_git_two_modes() {
+	git -c core.commitGraph=true $1 >output
+	git -c core.commitGraph=false $1 >expect
+	test_cmp expect output
+}
+
+graph_git_behavior() {
+	MSG=$1
+	BRANCH=$2
+	COMPARE=$3
+	test_expect_success "check normal git operations: $MSG" '
+		graph_git_two_modes "log --oneline $BRANCH" &&
+		graph_git_two_modes "log --topo-order $BRANCH" &&
+		graph_git_two_modes "log --graph $COMPARE..$BRANCH" &&
+		graph_git_two_modes "branch -vv" &&
+		graph_git_two_modes "merge-base -a $BRANCH $COMPARE"
+	'
+}
+
+graph_git_behavior 'graph exists' commits/3 commits/1
+
+verify_chain_files_exist() {
+	for hash in $(cat $1/commit-graph-chain)
+	do
+		test_path_is_file $1/graph-$hash.graph || return 1
+	done
+}
+
+test_expect_success 'add more commits, and write a new base graph' '
+	git reset --hard commits/1 &&
+	for i in $(test_seq 4 5)
+	do
+		test_commit $i &&
+		git branch commits/$i || return 1
+	done &&
+	git reset --hard commits/2 &&
+	for i in $(test_seq 6 10)
+	do
+		test_commit $i &&
+		git branch commits/$i || return 1
+	done &&
+	git reset --hard commits/2 &&
+	git merge commits/4 &&
+	git branch merge/1 &&
+	git reset --hard commits/4 &&
+	git merge commits/6 &&
+	git branch merge/2 &&
+	git commit-graph write --reachable --changed-paths &&
+	graph_read_expect 12
+'
+
+test_expect_success 'fork and fail to base a chain on a commit-graph file' '
+	test_when_finished rm -rf fork &&
+	git clone . fork &&
+	(
+		cd fork &&
+		rm .git/objects/info/commit-graph &&
+		echo "$(pwd)/../.git/objects" >.git/objects/info/alternates &&
+		test_commit new-commit &&
+		git commit-graph write --reachable --split --changed-paths &&
+		test_path_is_file $graphdir/commit-graph-chain &&
+		test_line_count = 1 $graphdir/commit-graph-chain &&
+		verify_chain_files_exist $graphdir
+	)
+'
+
+test_expect_success 'add three more commits, write a tip graph' '
+	git reset --hard commits/3 &&
+	git merge merge/1 &&
+	git merge commits/5 &&
+	git merge merge/2 &&
+	git branch merge/3 &&
+	git commit-graph write --reachable --split --changed-paths &&
+	test_path_is_missing $infodir/commit-graph &&
+	test_path_is_file $graphdir/commit-graph-chain &&
+	ls $graphdir/graph-*.graph >graph-files &&
+	test_line_count = 2 graph-files &&
+	verify_chain_files_exist $graphdir
+'
+
+graph_git_behavior 'split commit-graph: merge 3 vs 2' merge/3 merge/2
+
+test_expect_success 'add one commit, write a tip graph' '
+	test_commit 11 &&
+	git branch commits/11 &&
+	git commit-graph write --reachable --split --changed-paths &&
+	test_path_is_missing $infodir/commit-graph &&
+	test_path_is_file $graphdir/commit-graph-chain &&
+	ls $graphdir/graph-*.graph >graph-files &&
+	test_line_count = 3 graph-files &&
+	verify_chain_files_exist $graphdir
+'
+
+graph_git_behavior 'three-layer commit-graph: commit 11 vs 6' commits/11 commits/6
+
+test_expect_success 'add one commit, write a merged graph' '
+	test_commit 12 &&
+	git branch commits/12 &&
+	git commit-graph write --reachable --split --changed-paths &&
+	test_path_is_file $graphdir/commit-graph-chain &&
+	test_line_count = 2 $graphdir/commit-graph-chain &&
+	ls $graphdir/graph-*.graph >graph-files &&
+	test_line_count = 2 graph-files &&
+	verify_chain_files_exist $graphdir
+'
+
+graph_git_behavior 'merged commit-graph: commit 12 vs 6' commits/12 commits/6
+
+test_expect_success 'create fork and chain across alternate' '
+	git clone . fork &&
+	(
+		cd fork &&
+		git config core.commitGraph true &&
+		rm -rf $graphdir &&
+		echo "$(pwd)/../.git/objects" >.git/objects/info/alternates &&
+		test_commit 13 &&
+		git branch commits/13 &&
+		git commit-graph write --reachable --split --changed-paths &&
+		test_path_is_file $graphdir/commit-graph-chain &&
+		test_line_count = 3 $graphdir/commit-graph-chain &&
+		ls $graphdir/graph-*.graph >graph-files &&
+		test_line_count = 1 graph-files &&
+		git -c core.commitGraph=true  rev-list HEAD >expect &&
+		git -c core.commitGraph=false rev-list HEAD >actual &&
+		test_cmp expect actual &&
+		test_commit 14 &&
+		git commit-graph write --reachable --split --changed-paths --object-dir=.git/objects/ &&
+		test_line_count = 3 $graphdir/commit-graph-chain &&
+		ls $graphdir/graph-*.graph >graph-files &&
+		test_line_count = 1 graph-files
+	)
+'
+
+graph_git_behavior 'alternate: commit 13 vs 6' commits/13 commits/6
+
+test_expect_success 'test merge stragety constants' '
+	git clone . merge-2 &&
+	(
+		cd merge-2 &&
+		git config core.commitGraph true &&
+		test_line_count = 2 $graphdir/commit-graph-chain &&
+		test_commit 14 &&
+		git commit-graph write --reachable --split --changed-paths --size-multiple=2 &&
+		test_line_count = 3 $graphdir/commit-graph-chain
+
+	) &&
+	git clone . merge-10 &&
+	(
+		cd merge-10 &&
+		git config core.commitGraph true &&
+		test_line_count = 2 $graphdir/commit-graph-chain &&
+		test_commit 14 &&
+		git commit-graph write --reachable --split --changed-paths --size-multiple=10 &&
+		test_line_count = 1 $graphdir/commit-graph-chain &&
+		ls $graphdir/graph-*.graph >graph-files &&
+		test_line_count = 1 graph-files
+	) &&
+	git clone . merge-10-expire &&
+	(
+		cd merge-10-expire &&
+		git config core.commitGraph true &&
+		test_line_count = 2 $graphdir/commit-graph-chain &&
+		test_commit 15 &&
+		git commit-graph write --reachable --split --changed-paths --size-multiple=10 --expire-time=1980-01-01 &&
+		test_line_count = 1 $graphdir/commit-graph-chain &&
+		ls $graphdir/graph-*.graph >graph-files &&
+		test_line_count = 3 graph-files
+	) &&
+	git clone --no-hardlinks . max-commits &&
+	(
+		cd max-commits &&
+		git config core.commitGraph true &&
+		test_line_count = 2 $graphdir/commit-graph-chain &&
+		test_commit 16 &&
+		test_commit 17 &&
+		git commit-graph write --reachable --split --changed-paths --max-commits=1 &&
+		test_line_count = 1 $graphdir/commit-graph-chain &&
+		ls $graphdir/graph-*.graph >graph-files &&
+		test_line_count = 1 graph-files
+	)
+'
+
+test_expect_success 'remove commit-graph-chain file after flattening' '
+	git clone . flatten &&
+	(
+		cd flatten &&
+		test_line_count = 2 $graphdir/commit-graph-chain &&
+		git commit-graph write --reachable &&
+		test_path_is_missing $graphdir/commit-graph-chain &&
+		ls $graphdir >graph-files &&
+		test_must_be_empty graph-files
+	)
+'
+
+graph_git_behavior 'graph exists' merge/octopus commits/12
+
+test_expect_success 'split across alternate where alternate is not split' '
+	git commit-graph write --reachable &&
+	test_path_is_file .git/objects/info/commit-graph &&
+	cp .git/objects/info/commit-graph . &&
+	git clone --no-hardlinks . alt-split &&
+	(
+		cd alt-split &&
+		rm -f .git/objects/info/commit-graph &&
+		echo "$(pwd)"/../.git/objects >.git/objects/info/alternates &&
+		test_commit 18 &&
+		git commit-graph write --reachable --split --changed-paths &&
+		test_line_count = 1 $graphdir/commit-graph-chain
+	) &&
+	test_cmp commit-graph .git/objects/info/commit-graph
+'
+
+test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH 7/9] commit-graph: reuse existing bloom filters during write.
  2019-12-20 22:05 [PATCH 0/9] [RFC] Changed Paths Bloom Filters Garima Singh via GitGitGadget
                   ` (5 preceding siblings ...)
  2019-12-20 22:05 ` [PATCH 6/9] commit-graph: test commit-graph write --changed-paths Garima Singh via GitGitGadget
@ 2019-12-20 22:05 ` Garima Singh via GitGitGadget
  2020-01-09 19:12   ` Jakub Narebski
  2019-12-20 22:05 ` [PATCH 8/9] revision.c: use bloom filters to speed up path based revision walks Garima Singh via GitGitGadget
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 150+ messages in thread
From: Garima Singh via GitGitGadget @ 2019-12-20 22:05 UTC (permalink / raw)
  To: git
  Cc: stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
	Junio C Hamano, Garima Singh

From: Garima Singh <garima.singh@microsoft.com>

Read previously computed bloom filters from the commit-graph file if possible
to avoid recomputing during commit-graph write.

Reading from the commit-graph is based on the format in which bloom filters are
written in the commit graph file. See method `fill_filter_from_graph` in bloom.c

For reading the bloom filter for commit at lexicographic position i:
1. Read BIDX[i] which essentially gives us the starting index in BDAT for filter
   of commit i+1 (called the next_index in the code)

2. For i>0, read BIDX[i-1] which will give us the starting index in BDAT for
   filter of commit i (called the prev_index in the code)
   For i = 0, prev_index will be 0. The first lexicographic commit's filter will
   start at BDAT.

3. The length of the filter will be next_index - prev_index, because BIDX[i]
   gives the cumulative 8-byte words including the ith commit's filter.

We toggle whether bloom filters should be recomputed based on the compute_if_null
flag.

Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
 bloom.c        | 40 ++++++++++++++++++++++++++++++++++++++--
 bloom.h        |  3 ++-
 commit-graph.c |  6 +++---
 3 files changed, 43 insertions(+), 6 deletions(-)

diff --git a/bloom.c b/bloom.c
index 08328cc381..86b1005802 100644
--- a/bloom.c
+++ b/bloom.c
@@ -1,5 +1,7 @@
 #include "git-compat-util.h"
 #include "bloom.h"
+#include "commit.h"
+#include "commit-slab.h"
 #include "commit-graph.h"
 #include "object-store.h"
 #include "diff.h"
@@ -119,13 +121,35 @@ static void add_key_to_filter(struct bloom_key *key,
 	}
 }
 
+static void fill_filter_from_graph(struct commit_graph *g,
+				   struct bloom_filter *filter,
+				   struct commit *c)
+{
+	uint32_t lex_pos, prev_index, next_index;
+
+	while (c->graph_pos < g->num_commits_in_base)
+		g = g->base_graph;
+
+	lex_pos = c->graph_pos - g->num_commits_in_base;
+
+	next_index = get_be32(g->chunk_bloom_indexes + 4 * lex_pos);
+	if (lex_pos)
+		prev_index = get_be32(g->chunk_bloom_indexes + 4 * (lex_pos - 1));
+	else
+		prev_index = 0;
+
+	filter->len = next_index - prev_index;
+	filter->data = (uint64_t *)(g->chunk_bloom_data + 8 * prev_index + 12);
+}
+
 void load_bloom_filters(void)
 {
 	init_bloom_filter_slab(&bloom_filters);
 }
 
 struct bloom_filter *get_bloom_filter(struct repository *r,
-				      struct commit *c)
+				      struct commit *c,
+				      int compute_if_null)
 {
 	struct bloom_filter *filter;
 	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
@@ -134,6 +158,18 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 	const char *revs_argv[] = {NULL, "HEAD", NULL};
 
 	filter = bloom_filter_slab_at(&bloom_filters, c);
+
+	if (!filter->data) {
+		load_commit_graph_info(r, c);
+		if (c->graph_pos != COMMIT_NOT_FROM_GRAPH && r->objects->commit_graph->chunk_bloom_indexes) {
+			fill_filter_from_graph(r->objects->commit_graph, filter, c);
+			return filter;
+		}
+	}
+
+	if (filter->data || !compute_if_null)
+			return filter;
+
 	init_revisions(&revs, NULL);
 	revs.diffopt.flags.recursive = 1;
 
@@ -198,4 +234,4 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 	DIFF_QUEUE_CLEAR(&diff_queued_diff);
 
 	return filter;
-}
\ No newline at end of file
+}
diff --git a/bloom.h b/bloom.h
index ba8ae70b67..101d689bbd 100644
--- a/bloom.h
+++ b/bloom.h
@@ -36,7 +36,8 @@ struct bloom_key {
 void load_bloom_filters(void);
 
 struct bloom_filter *get_bloom_filter(struct repository *r,
-				      struct commit *c);
+				      struct commit *c,
+				      int compute_if_null);
 
 void fill_bloom_key(const char *data,
 		    int len,
diff --git a/commit-graph.c b/commit-graph.c
index def2ade166..0580ce75d5 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1032,7 +1032,7 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f,
 	uint32_t cur_pos = 0;
 
 	while (list < last) {
-		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
+		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
 		cur_pos += filter->len;
 		hashwrite_be32(f, cur_pos);
 		list++;
@@ -1051,7 +1051,7 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
 	hashwrite_be32(f, settings->bits_per_entry);
 
 	while (first < last) {
-		struct bloom_filter *filter = get_bloom_filter(ctx->r, *first);
+		struct bloom_filter *filter = get_bloom_filter(ctx->r, *first, 0);
 		hashwrite(f, filter->data, filter->len * sizeof(uint64_t));
 		first++;
 	}
@@ -1218,7 +1218,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 
 	for (i = 0; i < ctx->commits.nr; i++) {
 		struct commit *c = ctx->commits.list[i];
-		struct bloom_filter *filter = get_bloom_filter(ctx->r, c);
+		struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1);
 		ctx->total_bloom_filter_size += sizeof(uint64_t) * filter->len;
 		display_progress(progress, i + 1);
 	}
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH 8/9] revision.c: use bloom filters to speed up path based revision walks
  2019-12-20 22:05 [PATCH 0/9] [RFC] Changed Paths Bloom Filters Garima Singh via GitGitGadget
                   ` (6 preceding siblings ...)
  2019-12-20 22:05 ` [PATCH 7/9] commit-graph: reuse existing bloom filters during write Garima Singh via GitGitGadget
@ 2019-12-20 22:05 ` Garima Singh via GitGitGadget
  2020-01-11  0:27   ` Jakub Narebski
  2019-12-20 22:05 ` [PATCH 9/9] commit-graph: add GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS test flag Garima Singh via GitGitGadget
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 150+ messages in thread
From: Garima Singh via GitGitGadget @ 2019-12-20 22:05 UTC (permalink / raw)
  To: git
  Cc: stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
	Junio C Hamano, Garima Singh

From: Garima Singh <garima.singh@microsoft.com>

If bloom filters have been written to the commit-graph file, revision walk will
use them to speed up revision walks for a particular path.
Note: The current implementation does this in the case of single pathspec
case only.

We load the bloom filters during the prepare_revision_walk step when dealing
with a single pathspec. While comparing trees in rev_compare_trees(), if the
bloom filter says that the file is not different between the two trees, we
don't need to compute the expensive diff. This is where we get our performance
gains.

Performance Gains:
We tested the performance of `git log --path` on the git repo, the linux and
some internal large repos, with a variety of paths of varying depths.

On the git and linux repos:
we observed a 2x to 5x speed up.

On a large internal repo with files seated 6-10 levels deep in the tree:
we observed 10x to 20x speed ups, with some paths going up to 28 times faster.

RFC Notes:
I plan to collect the folloowing statistics around this usage of bloom filters
and trace them out using trace2.
- number of bloom filter queries,
- number of "No" responses (file hasn't changed)
- number of "Maybe" responses (file may have changed)
- number of "Commit not parsed" cases (commit had too many changes to have a
  bloom filter written out, currently our limit is 512 diffs)

Helped-by: Derrick Stolee <dstolee@microsoft.com
Helped-by: SZEDER Gábor <szeder.dev@gmail.com>
Helped-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
 bloom.c              | 20 ++++++++++++
 bloom.h              |  4 +++
 revision.c           | 67 +++++++++++++++++++++++++++++++++++++--
 revision.h           |  5 +++
 t/t4216-log-bloom.sh | 74 ++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 168 insertions(+), 2 deletions(-)
 create mode 100755 t/t4216-log-bloom.sh

diff --git a/bloom.c b/bloom.c
index 86b1005802..0c7505d3d6 100644
--- a/bloom.c
+++ b/bloom.c
@@ -235,3 +235,23 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 
 	return filter;
 }
+
+int bloom_filter_contains(struct bloom_filter *filter,
+			  struct bloom_key *key,
+			  struct bloom_filter_settings *settings)
+{
+	int i;
+	uint64_t mod = filter->len * BITS_PER_BLOCK;
+
+	if (!mod)
+		return 1;
+
+	for (i = 0; i < settings->num_hashes; i++) {
+		uint64_t hash_mod = key->hashes[i] % mod;
+		uint64_t block_pos = hash_mod / BITS_PER_BLOCK;
+		if (!(filter->data[block_pos] & get_bitmask(hash_mod)))
+			return 0;
+	}
+
+	return 1;
+}
diff --git a/bloom.h b/bloom.h
index 101d689bbd..9bdacd0a8e 100644
--- a/bloom.h
+++ b/bloom.h
@@ -44,4 +44,8 @@ void fill_bloom_key(const char *data,
 		    struct bloom_key *key,
 		    struct bloom_filter_settings *settings);
 
+int bloom_filter_contains(struct bloom_filter *filter,
+			  struct bloom_key *key,
+			  struct bloom_filter_settings *settings);
+
 #endif
diff --git a/revision.c b/revision.c
index 39a25e7a5d..01f5330740 100644
--- a/revision.c
+++ b/revision.c
@@ -29,6 +29,7 @@
 #include "prio-queue.h"
 #include "hashmap.h"
 #include "utf8.h"
+#include "bloom.h"
 
 volatile show_early_output_fn_t show_early_output;
 
@@ -624,11 +625,34 @@ static void file_change(struct diff_options *options,
 	options->flags.has_changes = 1;
 }
 
+static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
+						 struct commit *commit,
+						 struct bloom_key *key,
+						 struct bloom_filter_settings *settings)
+{
+	struct bloom_filter *filter;
+
+	if (!revs->repo->objects->commit_graph)
+		return -1;
+	if (commit->generation == GENERATION_NUMBER_INFINITY)
+		return -1;
+	if (!key || !settings)
+		return -1;
+
+	filter = get_bloom_filter(revs->repo, commit, 0);
+
+	if (!filter || !filter->len)
+		return 1;
+
+	return bloom_filter_contains(filter, key, settings);
+}
+
 static int rev_compare_tree(struct rev_info *revs,
-			    struct commit *parent, struct commit *commit)
+			    struct commit *parent, struct commit *commit, int nth_parent)
 {
 	struct tree *t1 = get_commit_tree(parent);
 	struct tree *t2 = get_commit_tree(commit);
+	int bloom_ret = 1;
 
 	if (!t1)
 		return REV_TREE_NEW;
@@ -653,6 +677,16 @@ static int rev_compare_tree(struct rev_info *revs,
 			return REV_TREE_SAME;
 	}
 
+	if (revs->pruning.pathspec.nr == 1 && !nth_parent) {
+		bloom_ret = check_maybe_different_in_bloom_filter(revs,
+								  commit,
+								  revs->bloom_key,
+								  revs->bloom_filter_settings);
+
+		if (bloom_ret == 0)
+			return REV_TREE_SAME;
+	}
+
 	tree_difference = REV_TREE_SAME;
 	revs->pruning.flags.has_changes = 0;
 	if (diff_tree_oid(&t1->object.oid, &t2->object.oid, "",
@@ -855,7 +889,7 @@ static void try_to_simplify_commit(struct rev_info *revs, struct commit *commit)
 			die("cannot simplify commit %s (because of %s)",
 			    oid_to_hex(&commit->object.oid),
 			    oid_to_hex(&p->object.oid));
-		switch (rev_compare_tree(revs, p, commit)) {
+		switch (rev_compare_tree(revs, p, commit, nth_parent)) {
 		case REV_TREE_SAME:
 			if (!revs->simplify_history || !relevant_commit(p)) {
 				/* Even if a merge with an uninteresting
@@ -3342,6 +3376,33 @@ static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
 	}
 }
 
+static void prepare_to_use_bloom_filter(struct rev_info *revs)
+{
+	struct pathspec_item *pi;
+	const char *path;
+	size_t len;
+
+	if (!revs->commits)
+	    return;
+
+	parse_commit(revs->commits->item);
+
+	if (!revs->repo->objects->commit_graph)
+		return;
+
+	revs->bloom_filter_settings = revs->repo->objects->commit_graph->settings;
+	if (!revs->bloom_filter_settings)
+		return;
+
+	pi = &revs->pruning.pathspec.items[0];
+	path = pi->match;
+	len = strlen(path);
+
+	load_bloom_filters();
+	revs->bloom_key = xmalloc(sizeof(struct bloom_key));
+	fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
+}
+
 int prepare_revision_walk(struct rev_info *revs)
 {
 	int i;
@@ -3391,6 +3452,8 @@ int prepare_revision_walk(struct rev_info *revs)
 		simplify_merges(revs);
 	if (revs->children.name)
 		set_children(revs);
+	if (revs->pruning.pathspec.nr == 1)
+	    prepare_to_use_bloom_filter(revs);
 	return 0;
 }
 
diff --git a/revision.h b/revision.h
index a1a804bd3d..65dc11e8f1 100644
--- a/revision.h
+++ b/revision.h
@@ -56,6 +56,8 @@ struct repository;
 struct rev_info;
 struct string_list;
 struct saved_parents;
+struct bloom_key;
+struct bloom_filter_settings;
 define_shared_commit_slab(revision_sources, char *);
 
 struct rev_cmdline_info {
@@ -291,6 +293,9 @@ struct rev_info {
 	struct revision_sources *sources;
 
 	struct topo_walk_info *topo_walk_info;
+
+	struct bloom_key *bloom_key;
+	struct bloom_filter_settings *bloom_filter_settings;
 };
 
 int ref_excluded(struct string_list *, const char *path);
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
new file mode 100755
index 0000000000..d42f077998
--- /dev/null
+++ b/t/t4216-log-bloom.sh
@@ -0,0 +1,74 @@
+#!/bin/sh
+
+test_description='git log for a path with bloom filters'
+. ./test-lib.sh
+
+test_expect_success 'setup repo' '
+	git init &&
+	git config core.commitGraph true &&
+	git config gc.writeCommitGraph false &&
+	infodir=".git/objects/info" &&
+	graphdir="$infodir/commit-graphs" &&
+	test_oid_init
+'
+
+test_expect_success 'create 9 commits and repack' '
+	test_commit c1 file1 &&
+	test_commit c2 file2 &&
+	test_commit c3 file3 &&
+	test_commit c4 file1 &&
+	test_commit c5 file2 &&
+	test_commit c6 file3 &&
+	test_commit c7 file1 &&
+	test_commit c8 file2 &&
+	test_commit c9 file3
+'
+
+printf "c7\nc4\nc1" > expect_file1
+
+test_expect_success 'log without bloom filters' '
+	git log --pretty="format:%s"  -- file1 > actual &&
+	test_cmp expect_file1 actual
+'
+
+printf "c8\nc7\nc5\nc4\nc2\nc1" > expect_file1_file2
+
+test_expect_success 'multi-path log without bloom filters' '
+	git log --pretty="format:%s"  -- file1 file2 > actual &&
+	test_cmp expect_file1_file2 actual
+'
+
+graph_read_expect() {
+	OPTIONAL=""
+	NUM_CHUNKS=5
+	if test ! -z $2
+	then
+		OPTIONAL=" $2"
+		NUM_CHUNKS=$((3 + $(echo "$2" | wc -w)))
+	fi
+	cat >expect <<- EOF
+	header: 43475048 1 1 $NUM_CHUNKS 0
+	num_commits: $1
+	chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data$OPTIONAL
+	EOF
+	test-tool read-graph >output &&
+	test_cmp expect output
+}
+
+test_expect_success 'write commit graph with bloom filters' '
+	git commit-graph write --reachable --changed-paths &&
+	test_path_is_file $infodir/commit-graph &&
+	graph_read_expect "9"
+'
+
+test_expect_success 'log using bloom filters' '
+	git log --pretty="format:%s" -- file1 > actual &&
+	test_cmp expect_file1 actual
+'
+
+test_expect_success 'multi-path log using bloom filters' '
+	git log --pretty="format:%s"  -- file1 file2 > actual &&
+	test_cmp expect_file1_file2 actual
+'
+
+test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH 9/9] commit-graph: add GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS test flag
  2019-12-20 22:05 [PATCH 0/9] [RFC] Changed Paths Bloom Filters Garima Singh via GitGitGadget
                   ` (7 preceding siblings ...)
  2019-12-20 22:05 ` [PATCH 8/9] revision.c: use bloom filters to speed up path based revision walks Garima Singh via GitGitGadget
@ 2019-12-20 22:05 ` Garima Singh via GitGitGadget
  2020-01-11 19:56   ` Jakub Narebski
  2019-12-20 22:14 ` [PATCH 0/9] [RFC] Changed Paths Bloom Filters Junio C Hamano
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 150+ messages in thread
From: Garima Singh via GitGitGadget @ 2019-12-20 22:05 UTC (permalink / raw)
  To: git
  Cc: stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
	Junio C Hamano, Garima Singh

From: Garima Singh <garima.singh@microsoft.com>

Add GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS test flag to the test setup suite in
order to toggle writing bloom filters when running any of the git tests. If set
to true, we will compute and write bloom filters every time a test calls
`git commit-graph write`.

The test suite passes when GIT_TEST_COMMIT_GRAPH and
GIT_COMMIT_GRAPH_BLOOM_FILTERS are enabled.

Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
 builtin/commit-graph.c        | 2 +-
 ci/run-build-and-tests.sh     | 1 +
 commit-graph.h                | 1 +
 t/README                      | 3 +++
 t/t4216-log-bloom.sh          | 3 +++
 t/t5318-commit-graph.sh       | 2 ++
 t/t5324-split-commit-graph.sh | 1 +
 t/t5325-commit-graph-bloom.sh | 3 +++
 8 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 9bd1e11161..97167959b2 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -146,7 +146,7 @@ static int graph_write(int argc, const char **argv)
 		flags |= COMMIT_GRAPH_WRITE_SPLIT;
 	if (opts.progress)
 		flags |= COMMIT_GRAPH_WRITE_PROGRESS;
-	if (opts.enable_bloom_filters)
+	if (opts.enable_bloom_filters || git_env_bool(GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS, 0))
 		flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
 
 	read_replace_refs = 0;
diff --git a/ci/run-build-and-tests.sh b/ci/run-build-and-tests.sh
index ff0ef7f08e..19d0846d34 100755
--- a/ci/run-build-and-tests.sh
+++ b/ci/run-build-and-tests.sh
@@ -19,6 +19,7 @@ linux-gcc)
 	export GIT_TEST_OE_SIZE=10
 	export GIT_TEST_OE_DELTA_SIZE=5
 	export GIT_TEST_COMMIT_GRAPH=1
+	export GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS=1
 	export GIT_TEST_MULTI_PACK_INDEX=1
 	make test
 	;;
diff --git a/commit-graph.h b/commit-graph.h
index 2202ad91ae..d914e6abf1 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -8,6 +8,7 @@
 
 #define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
 #define GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD "GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD"
+#define GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS "GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS"
 
 struct commit;
 struct bloom_filter_settings;
diff --git a/t/README b/t/README
index caa125ba9a..399b190437 100644
--- a/t/README
+++ b/t/README
@@ -378,6 +378,9 @@ GIT_TEST_COMMIT_GRAPH=<boolean>, when true, forces the commit-graph to
 be written after every 'git commit' command, and overrides the
 'core.commitGraph' setting to true.
 
+GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS=<boolean>, when true, forces commit-graph
+write to compute and write bloom filters for every 'git commit-graph write'
+
 GIT_TEST_FSMONITOR=$PWD/t7519/fsmonitor-all exercises the fsmonitor
 code path for utilizing a file system monitor to speed up detecting
 new or changed files.
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index d42f077998..0e092b387c 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -3,6 +3,9 @@
 test_description='git log for a path with bloom filters'
 . ./test-lib.sh
 
+GIT_TEST_COMMIT_GRAPH=0
+GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS=0
+
 test_expect_success 'setup repo' '
 	git init &&
 	git config core.commitGraph true &&
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 3f03de6018..613228bb12 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -3,6 +3,8 @@
 test_description='commit graph'
 . ./test-lib.sh
 
+GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS=0
+
 test_expect_success 'setup full repo' '
 	mkdir full &&
 	cd "$TRASH_DIRECTORY/full" &&
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index c24823431f..181ca7e0cb 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -4,6 +4,7 @@ test_description='split commit graph'
 . ./test-lib.sh
 
 GIT_TEST_COMMIT_GRAPH=0
+GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS=0
 
 test_expect_success 'setup repo' '
 	git init &&
diff --git a/t/t5325-commit-graph-bloom.sh b/t/t5325-commit-graph-bloom.sh
index d7ef0e7fb3..a9c9e9fef6 100755
--- a/t/t5325-commit-graph-bloom.sh
+++ b/t/t5325-commit-graph-bloom.sh
@@ -3,6 +3,9 @@
 test_description='commit graph with bloom filters'
 . ./test-lib.sh
 
+GIT_TEST_COMMIT_GRAPH=0
+GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS=0
+
 test_expect_success 'setup repo' '
 	git init &&
 	git config core.commitGraph true &&
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH 0/9] [RFC] Changed Paths Bloom Filters
  2019-12-20 22:05 [PATCH 0/9] [RFC] Changed Paths Bloom Filters Garima Singh via GitGitGadget
                   ` (8 preceding siblings ...)
  2019-12-20 22:05 ` [PATCH 9/9] commit-graph: add GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS test flag Garima Singh via GitGitGadget
@ 2019-12-20 22:14 ` Junio C Hamano
  2019-12-22  9:26 ` Christian Couder
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 150+ messages in thread
From: Junio C Hamano @ 2019-12-20 22:14 UTC (permalink / raw)
  To: Garima Singh via GitGitGadget
  Cc: git, stolee, szeder.dev, jonathantanmy, jeffhost, me, peff

"Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:

> Adopting changed path bloom filters has been discussed on the list before,
> and a prototype version was worked on by SZEDER Gábor, Jonathan Tan and Dr.
> Derrick Stolee [1]. This series is based on Dr. Stolee's approach [2] and
> presents an updated and more polished RFC version of the feature. 

;-)

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH 2/9] commit-graph: write changed paths bloom filters
  2019-12-20 22:05 ` [PATCH 2/9] commit-graph: write changed paths bloom filters Garima Singh via GitGitGadget
@ 2019-12-21 16:48   ` Philip Oakley
  2020-01-06 18:44   ` Jakub Narebski
  1 sibling, 0 replies; 150+ messages in thread
From: Philip Oakley @ 2019-12-21 16:48 UTC (permalink / raw)
  To: Garima Singh via GitGitGadget, git
  Cc: stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
	Junio C Hamano, Garima Singh

spelling nit?

On 20/12/2019 22:05, Garima Singh via GitGitGadget wrote:
> 1. The implementation sticks to the recommended values of 7 and 10 for the
>    number of hashes and the size of each entry, as described in the blog.
>    The implementation while not completely open to it at the moment, is flexible
>    enough to allow for tweaking these settings in the future.
>    Note: The performance gains we have observed so far with these values is
>    significant enough to not that we did not need to tweak these settings.
s/not/note/ (first occurrence)
>    The cover letter of this series has the details and the commit where we have
>    git log use bloom filters.
Philip

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH 0/9] [RFC] Changed Paths Bloom Filters
  2019-12-20 22:05 [PATCH 0/9] [RFC] Changed Paths Bloom Filters Garima Singh via GitGitGadget
                   ` (9 preceding siblings ...)
  2019-12-20 22:14 ` [PATCH 0/9] [RFC] Changed Paths Bloom Filters Junio C Hamano
@ 2019-12-22  9:26 ` Christian Couder
  2019-12-22  9:38   ` Jeff King
  2019-12-22  9:30 ` Jeff King
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 150+ messages in thread
From: Christian Couder @ 2019-12-22  9:26 UTC (permalink / raw)
  To: Garima Singh via GitGitGadget
  Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
	Jeff Hostetler, Taylor Blau, Jeff King, Junio C Hamano

Hi,

On Fri, Dec 20, 2019 at 11:07 PM Garima Singh via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> The commit graph feature brought in a lot of performance improvements across
> multiple commands. However, file based history continues to be a performance
> pain point, especially in large repositories.
>
> Adopting changed path bloom filters has been discussed on the list before,
> and a prototype version was worked on by SZEDER Gábor, Jonathan Tan and Dr.
> Derrick Stolee [1]. This series is based on Dr. Stolee's approach [2] and
> presents an updated and more polished RFC version of the feature.

Thanks for working on this!

> Performance Gains: We tested the performance of git log -- path on the git
> repo, the linux repo and some internal large repos, with a variety of paths
> of varying depths.
>
> On the git and linux repos: We observed a 2x to 5x speed up.
>
> On a large internal repo with files seated 6-10 levels deep in the tree: We
> observed 10x to 20x speed ups, with some paths going up to 28 times faster.

Very nice!

I have a question though. Are the performance gains only available
with `git log -- path` or are they already available for example when
doing a partial clone and/or a sparse checkout?

> Future Work (not included in the scope of this series):
>
>  1. Supporting multiple path based revision walk
>  2. Adopting it in git blame logic.
>  3. Interactions with line log git log -L

Great!

> This series is intended to start the conversation and many of the commit
> messages include specific call outs for suggestions and thoughts.

I think Peff said during the Virtual Contributor Summit that he was
interested in using bitmaps to speed up partial clone on the server
side. Would it make sense to use both bitmaps and bloom filters?

Thanks,
Christian.

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH 0/9] [RFC] Changed Paths Bloom Filters
  2019-12-20 22:05 [PATCH 0/9] [RFC] Changed Paths Bloom Filters Garima Singh via GitGitGadget
                   ` (10 preceding siblings ...)
  2019-12-22  9:26 ` Christian Couder
@ 2019-12-22  9:30 ` Jeff King
  2019-12-22  9:32   ` [PATCH 1/3] commit-graph: examine changed-path objects in pack order Jeff King
                     ` (4 more replies)
  2019-12-31 16:45 ` Jakub Narebski
                   ` (3 subsequent siblings)
  15 siblings, 5 replies; 150+ messages in thread
From: Jeff King @ 2019-12-22  9:30 UTC (permalink / raw)
  To: Garima Singh via GitGitGadget
  Cc: git, stolee, szeder.dev, jonathantanmy, jeffhost, me, Junio C Hamano

On Fri, Dec 20, 2019 at 10:05:11PM +0000, Garima Singh via GitGitGadget wrote:

> Adopting changed path bloom filters has been discussed on the list before,
> and a prototype version was worked on by SZEDER Gábor, Jonathan Tan and Dr.
> Derrick Stolee [1]. This series is based on Dr. Stolee's approach [2] and
> presents an updated and more polished RFC version of the feature.

Great to see progress here. I probably won't have time to review this
carefully before the new year, but I did notice some low-hanging fruit
on the generation side.

So here are a few patches to reduce the CPU and memory usage. They could
be squashed in at the appropriate spots, or perhaps taken as inspiration
if there are better solutions (especially for the first one).

I think we could go further still, by actually doing a non-recursive
diff_tree_oid(), and then recursing into sub-trees ourselves. That would
save us having to split apart each path to add the leading paths to the
hashmap (most of which will be duplicates if the commit touched "a/b/c"
and "a/b/d", etc). I doubt it would be that huge a speedup though. We
have to keep a list of the touched paths anyway (since the bloom key
parameters depend on the number of entries), and most of the time is
almost certainly spent inflating the trees in the first place. However
it might be easier to follow the code, and it would make it simpler to
stop traversing at the 512-entry limit, rather than generating a huge
diff only to throw it away.

  [1/3]: commit-graph: examine changed-path objects in pack order
  [2/3]: commit-graph: free large diffs, too
  [3/3]: commit-graph: stop using full rev_info for diffs

 bloom.c        | 18 +++++++++---------
 commit-graph.c | 34 +++++++++++++++++++++++++++++++++-
 2 files changed, 42 insertions(+), 10 deletions(-)

-Peff

^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH 1/3] commit-graph: examine changed-path objects in pack order
  2019-12-22  9:30 ` Jeff King
@ 2019-12-22  9:32   ` Jeff King
  2019-12-27 14:51     ` Derrick Stolee
  2019-12-22  9:32   ` [PATCH 2/3] commit-graph: free large diffs, too Jeff King
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 150+ messages in thread
From: Jeff King @ 2019-12-22  9:32 UTC (permalink / raw)
  To: Garima Singh via GitGitGadget
  Cc: git, stolee, szeder.dev, jonathantanmy, jeffhost, me, Junio C Hamano

Looking at the diff of commit objects in pack order is much faster than
in sha1 order, as it gives locality to the access of tree deltas
(whereas sha1 order is effectively random). Unfortunately the
commit-graph code sorts the commits (several times, sometimes as an oid
and sometimes a pointer-to-commit), and we ultimately traverse in sha1
order.

Instead, let's remember the position at which we see each commit, and
traverse in that order when looking at bloom filters. This drops my time
for "git commit-graph write --changed-paths" in linux.git from ~4
minutes to ~1.5 minutes.

Probably the "--reachable" code path would want something similar.

Or alternatively, we could use a different data structure (either a
hash, or maybe even just a bit in "struct commit") to keep track of
which oids we've seen, etc instead of sorting. And then we could keep
the original order.

Signed-off-by: Jeff King <peff@peff.net>
---
 commit-graph.c | 34 +++++++++++++++++++++++++++++++++-
 1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/commit-graph.c b/commit-graph.c
index 0580ce75d5..bf6c663772 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -17,6 +17,7 @@
 #include "replace-object.h"
 #include "progress.h"
 #include "bloom.h"
+#include "commit-slab.h"
 
 #define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
 #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
@@ -48,6 +49,29 @@
 /* Remember to update object flag allocation in object.h */
 #define REACHABLE       (1u<<15)
 
+/* Keep track of the order in which commits are added to our list. */
+define_commit_slab(commit_pos, int);
+static struct commit_pos commit_pos = COMMIT_SLAB_INIT(1, commit_pos);
+
+static void set_commit_pos(struct repository *r, const struct object_id *oid)
+{
+	static int32_t max_pos;
+	struct commit *commit = lookup_commit(r, oid);
+
+	if (!commit)
+		return; /* should never happen, but be lenient */
+
+	*commit_pos_at(&commit_pos, commit) = max_pos++;
+}
+
+static int commit_pos_cmp(const void *va, const void *vb)
+{
+	const struct commit *a = *(const struct commit **)va;
+	const struct commit *b = *(const struct commit **)vb;
+	return commit_pos_at(&commit_pos, a) -
+	       commit_pos_at(&commit_pos, b);
+}
+
 char *get_commit_graph_filename(const char *obj_dir)
 {
 	char *filename = xstrfmt("%s/info/commit-graph", obj_dir);
@@ -1088,6 +1112,8 @@ static int add_packed_commits(const struct object_id *oid,
 	oidcpy(&(ctx->oids.list[ctx->oids.nr]), oid);
 	ctx->oids.nr++;
 
+	set_commit_pos(ctx->r, oid);
+
 	return 0;
 }
 
@@ -1208,6 +1234,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 {
 	int i;
 	struct progress *progress = NULL;
+	struct commit **sorted_by_pos;
 
 	load_bloom_filters();
 
@@ -1216,13 +1243,18 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 			_("Computing commit diff Bloom filters"),
 			ctx->commits.nr);
 
+	ALLOC_ARRAY(sorted_by_pos, ctx->commits.nr);
+	COPY_ARRAY(sorted_by_pos, ctx->commits.list, ctx->commits.nr);
+	QSORT(sorted_by_pos, ctx->commits.nr, commit_pos_cmp);
+
 	for (i = 0; i < ctx->commits.nr; i++) {
-		struct commit *c = ctx->commits.list[i];
+		struct commit *c = sorted_by_pos[i];
 		struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1);
 		ctx->total_bloom_filter_size += sizeof(uint64_t) * filter->len;
 		display_progress(progress, i + 1);
 	}
 
+	free(sorted_by_pos);
 	stop_progress(&progress);
 }
 
-- 
2.24.1.1152.gda0b849012


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH 2/3] commit-graph: free large diffs, too
  2019-12-22  9:30 ` Jeff King
  2019-12-22  9:32   ` [PATCH 1/3] commit-graph: examine changed-path objects in pack order Jeff King
@ 2019-12-22  9:32   ` Jeff King
  2019-12-27 14:52     ` Derrick Stolee
  2019-12-22  9:32   ` [PATCH 3/3] commit-graph: stop using full rev_info for diffs Jeff King
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 150+ messages in thread
From: Jeff King @ 2019-12-22  9:32 UTC (permalink / raw)
  To: Garima Singh via GitGitGadget
  Cc: git, stolee, szeder.dev, jonathantanmy, jeffhost, me, Junio C Hamano

If a diff we compute for --changed-path has more than 512 entries, we
don't bother generating a bloom filter for it. But since we don't
iterate over diff_queued_diff, we also don't free the filepairs and
filespecs from the diff before clearing the queue. Let's make sure we do
so.

This drops the peak heap usage of "commit-graph write --changed-paths"
on linux.git from ~8GB to ~4GB.

Signed-off-by: Jeff King <peff@peff.net>
---
 bloom.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/bloom.c b/bloom.c
index 0c7505d3d6..d1d3796e11 100644
--- a/bloom.c
+++ b/bloom.c
@@ -226,6 +226,8 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 
 		hashmap_free_entries(&pathmap, struct pathmap_hash_entry, entry);
 	} else {
+		for (i = 0; i < diff_queued_diff.nr; i++)
+			diff_free_filepair(diff_queued_diff.queue[i]);
 		filter->data = NULL;
 		filter->len = 0;
 	}
-- 
2.24.1.1152.gda0b849012


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH 3/3] commit-graph: stop using full rev_info for diffs
  2019-12-22  9:30 ` Jeff King
  2019-12-22  9:32   ` [PATCH 1/3] commit-graph: examine changed-path objects in pack order Jeff King
  2019-12-22  9:32   ` [PATCH 2/3] commit-graph: free large diffs, too Jeff King
@ 2019-12-22  9:32   ` Jeff King
  2019-12-27 14:53     ` Derrick Stolee
  2019-12-26 14:21   ` [PATCH 0/9] [RFC] Changed Paths Bloom Filters Derrick Stolee
  2019-12-27 16:11   ` Derrick Stolee
  4 siblings, 1 reply; 150+ messages in thread
From: Jeff King @ 2019-12-22  9:32 UTC (permalink / raw)
  To: Garima Singh via GitGitGadget
  Cc: git, stolee, szeder.dev, jonathantanmy, jeffhost, me, Junio C Hamano

When we perform a diff to get the set of changed paths for a commit,
we initialize a full "struct rev_info" with setup_revisions(). But the
only part of it we use is the diff_options struct. Besides being overly
complex, this also leaks memory, as we use the fake argv to
setup_revisions() create a pending array which is never cleared.

Let's just use diff_options directly. This reduces the peak heap usage
of "git commit-graph write --changed-paths" on linux.git from ~4GB to
~1.2GB.

Signed-off-by: Jeff King <peff@peff.net>
---
 bloom.c | 16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/bloom.c b/bloom.c
index d1d3796e11..ea77631cc2 100644
--- a/bloom.c
+++ b/bloom.c
@@ -154,8 +154,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 	struct bloom_filter *filter;
 	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
 	int i;
-	struct rev_info revs;
-	const char *revs_argv[] = {NULL, "HEAD", NULL};
+	struct diff_options diffopt;
 
 	filter = bloom_filter_slab_at(&bloom_filters, c);
 
@@ -170,16 +169,15 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 	if (filter->data || !compute_if_null)
 			return filter;
 
-	init_revisions(&revs, NULL);
-	revs.diffopt.flags.recursive = 1;
-
-	setup_revisions(2, revs_argv, &revs, NULL);
+	repo_diff_setup(r, &diffopt);
+	diffopt.flags.recursive = 1;
+	diff_setup_done(&diffopt);
 
 	if (c->parents)
-		diff_tree_oid(&c->parents->item->object.oid, &c->object.oid, "", &revs.diffopt);
+		diff_tree_oid(&c->parents->item->object.oid, &c->object.oid, "", &diffopt);
 	else
-		diff_tree_oid(NULL, &c->object.oid, "", &revs.diffopt);
-	diffcore_std(&revs.diffopt);
+		diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
+	diffcore_std(&diffopt);
 
 	if (diff_queued_diff.nr <= 512) {
 		struct hashmap pathmap;
-- 
2.24.1.1152.gda0b849012

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH 0/9] [RFC] Changed Paths Bloom Filters
  2019-12-22  9:26 ` Christian Couder
@ 2019-12-22  9:38   ` Jeff King
  2020-01-01 12:04     ` Jakub Narebski
  0 siblings, 1 reply; 150+ messages in thread
From: Jeff King @ 2019-12-22  9:38 UTC (permalink / raw)
  To: Christian Couder
  Cc: Garima Singh via GitGitGadget, git, Derrick Stolee,
	SZEDER Gábor, Jonathan Tan, Jeff Hostetler, Taylor Blau,
	Junio C Hamano

On Sun, Dec 22, 2019 at 10:26:20AM +0100, Christian Couder wrote:

> I have a question though. Are the performance gains only available
> with `git log -- path` or are they already available for example when
> doing a partial clone and/or a sparse checkout?

From my quick look at the code, anything that feeds a pathspec to a
revision traversal would be helped. I'm not sure if it would help for
partial/sparse traversals, though. There we actually need to know which
blobs correspond to the paths in question, not just whether any
particular commit touched them.

I also took a brief look at adding support to the custom blame-tree
implementation we use at GitHub, and got about a 6x speedup.

> > This series is intended to start the conversation and many of the commit
> > messages include specific call outs for suggestions and thoughts.
> 
> I think Peff said during the Virtual Contributor Summit that he was
> interested in using bitmaps to speed up partial clone on the server
> side. Would it make sense to use both bitmaps and bloom filters?

I think they're orthogonal. For size-based filters on blobs, you'd just
use bitmaps as normal, because you can post-process the result to check
the type and size of each object in the list (and I have patches to do
this, but they need some polishing and we're not yet running them).

For path-based filters like a sparse specification, you can't use
bitmaps at all; you have to do a real traversal. But there you still
generally get all of the commits. I guess if a commit doesn't touch any
path you're interested in, you could avoid walking into its tree at all,
which might help. I haven't given it much thought yet.

-Peff

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH 0/9] [RFC] Changed Paths Bloom Filters
  2019-12-22  9:30 ` Jeff King
                     ` (2 preceding siblings ...)
  2019-12-22  9:32   ` [PATCH 3/3] commit-graph: stop using full rev_info for diffs Jeff King
@ 2019-12-26 14:21   ` Derrick Stolee
  2019-12-29  6:03     ` Jeff King
  2019-12-27 16:11   ` Derrick Stolee
  4 siblings, 1 reply; 150+ messages in thread
From: Derrick Stolee @ 2019-12-26 14:21 UTC (permalink / raw)
  To: Jeff King, Garima Singh via GitGitGadget
  Cc: git, szeder.dev, jonathantanmy, jeffhost, me, Junio C Hamano

On 12/22/2019 4:30 AM, Jeff King wrote:
> On Fri, Dec 20, 2019 at 10:05:11PM +0000, Garima Singh via GitGitGadget wrote:
> 
>> Adopting changed path bloom filters has been discussed on the list before,
>> and a prototype version was worked on by SZEDER Gábor, Jonathan Tan and Dr.
>> Derrick Stolee [1]. This series is based on Dr. Stolee's approach [2] and
>> presents an updated and more polished RFC version of the feature.
> 
> Great to see progress here. I probably won't have time to review this
> carefully before the new year, but I did notice some low-hanging fruit
> on the generation side.
> 
> So here are a few patches to reduce the CPU and memory usage. They could
> be squashed in at the appropriate spots, or perhaps taken as inspiration
> if there are better solutions (especially for the first one).
> 
> I think we could go further still, by actually doing a non-recursive
> diff_tree_oid(), and then recursing into sub-trees ourselves. That would
> save us having to split apart each path to add the leading paths to the
> hashmap (most of which will be duplicates if the commit touched "a/b/c"
> and "a/b/d", etc). I doubt it would be that huge a speedup though. We
> have to keep a list of the touched paths anyway (since the bloom key
> parameters depend on the number of entries), and most of the time is
> almost certainly spent inflating the trees in the first place. However
> it might be easier to follow the code, and it would make it simpler to
> stop traversing at the 512-entry limit, rather than generating a huge
> diff only to throw it away.

Thanks for these improvements. This diff machinery is new to us (Garima
and myself).

Here are some recommendations (to Garima) for how to proceed with these
patches. Please let me know if anyone disagrees.

>   [1/3]: commit-graph: examine changed-path objects in pack order

This one is best kept as its own patch, as it shows a clear reason why
we want to do the sort-by-position. It would also be a complicated
patch to include this logic along with the first use of
compute_bloom_filters().

>   [2/3]: commit-graph: free large diffs, too
This one seems best to squash into "commit-graph: write changed paths
bloom filters" with a Helped-by for Peff.

>   [3/3]: commit-graph: stop using full rev_info for diffs

While I appreciate the clear benefit in the commit-message here, it
may be best to also squash this one similarly.

Of course, if we create our own diff logic with the short-circuit
capability, then perhaps these patches become obsolete. I'll spend
a little time playing with options here.

Thanks!
-Stolee

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH 1/3] commit-graph: examine changed-path objects in pack order
  2019-12-22  9:32   ` [PATCH 1/3] commit-graph: examine changed-path objects in pack order Jeff King
@ 2019-12-27 14:51     ` Derrick Stolee
  2019-12-29  6:12       ` Jeff King
  0 siblings, 1 reply; 150+ messages in thread
From: Derrick Stolee @ 2019-12-27 14:51 UTC (permalink / raw)
  To: Jeff King, Garima Singh via GitGitGadget
  Cc: git, szeder.dev, jonathantanmy, jeffhost, me, Junio C Hamano

On 12/22/2019 4:32 AM, Jeff King wrote:
> Looking at the diff of commit objects in pack order is much faster than
> in sha1 order, as it gives locality to the access of tree deltas
> (whereas sha1 order is effectively random). Unfortunately the
> commit-graph code sorts the commits (several times, sometimes as an oid
> and sometimes a pointer-to-commit), and we ultimately traverse in sha1
> order.
> 
> Instead, let's remember the position at which we see each commit, and
> traverse in that order when looking at bloom filters. This drops my time
> for "git commit-graph write --changed-paths" in linux.git from ~4
> minutes to ~1.5 minutes.

I'm doing my own perf tests on these patches, and my copy of linux.git
has four packs of varying sizes (corresponding with my rare fetches and
lack of repacks). My time goes from 3m50s to 3m00s. I was confused at
first, but then realized that I used the "--reachable" flag. In that
case, we never run set_commit_pos(), so all positions are equal and the
sort is not helpful.

I thought that inserting some set_commit_pos() calls into close_reachable()
and add_missing_parents() would give some amount of time-order to the
commits as we compute the filters. However, the time did not change at
all.

I've included the patch below for reference, anyway.

Thanks,
-Stolee

-->8--

From e7c63d8db09be81ce213ba7f112bb3d2f537bf4a Mon Sep 17 00:00:00 2001
From: Derrick Stolee <dstolee@microsoft.com>
Date: Fri, 27 Dec 2019 09:47:49 -0500
Subject: [PATCH] commit-graph: set commit positions for --reachable

When running 'git commit-graph write --changed-paths', we sort the
commits by pack-order to save time when computing the changed-paths
bloom filters. This does not help when finding the commits via the
--reachable flag.

Add calls to set_commit_pos() when walking the reachable commits,
which provides an ordering similar to a topological ordering.

Unfortunately, the performance did not improve with this change.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/commit-graph.c b/commit-graph.c
index bf6c663772..a6c4ab401e 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1126,6 +1126,8 @@ static void add_missing_parents(struct write_commit_graph_context *ctx, struct c
 			oidcpy(&ctx->oids.list[ctx->oids.nr], &(parent->item->object.oid));
 			ctx->oids.nr++;
 			parent->item->object.flags |= REACHABLE;
+
+			set_commit_pos(ctx->r, &parent->item->object.oid);
 		}
 	}
 }
@@ -1142,8 +1144,10 @@ static void close_reachable(struct write_commit_graph_context *ctx)
 	for (i = 0; i < ctx->oids.nr; i++) {
 		display_progress(ctx->progress, i + 1);
 		commit = lookup_commit(ctx->r, &ctx->oids.list[i]);
-		if (commit)
+		if (commit) {
 			commit->object.flags |= REACHABLE;
+			set_commit_pos(ctx->r, &commit->object.oid);
+		}
 	}
 	stop_progress(&ctx->progress);
 
-- 
2.25.0.rc0


^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH 2/3] commit-graph: free large diffs, too
  2019-12-22  9:32   ` [PATCH 2/3] commit-graph: free large diffs, too Jeff King
@ 2019-12-27 14:52     ` Derrick Stolee
  0 siblings, 0 replies; 150+ messages in thread
From: Derrick Stolee @ 2019-12-27 14:52 UTC (permalink / raw)
  To: Jeff King, Garima Singh via GitGitGadget
  Cc: git, szeder.dev, jonathantanmy, jeffhost, me, Junio C Hamano

On 12/22/2019 4:32 AM, Jeff King wrote:
> If a diff we compute for --changed-path has more than 512 entries, we
> don't bother generating a bloom filter for it. But since we don't
> iterate over diff_queued_diff, we also don't free the filepairs and
> filespecs from the diff before clearing the queue. Let's make sure we do
> so.
> 
> This drops the peak heap usage of "commit-graph write --changed-paths"
> on linux.git from ~8GB to ~4GB.

In my testing, the heap size went from ~10gb to ~6gb.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH 3/3] commit-graph: stop using full rev_info for diffs
  2019-12-22  9:32   ` [PATCH 3/3] commit-graph: stop using full rev_info for diffs Jeff King
@ 2019-12-27 14:53     ` Derrick Stolee
  0 siblings, 0 replies; 150+ messages in thread
From: Derrick Stolee @ 2019-12-27 14:53 UTC (permalink / raw)
  To: Jeff King, Garima Singh via GitGitGadget
  Cc: git, szeder.dev, jonathantanmy, jeffhost, me, Junio C Hamano

On 12/22/2019 4:32 AM, Jeff King wrote:
> When we perform a diff to get the set of changed paths for a commit,
> we initialize a full "struct rev_info" with setup_revisions(). But the
> only part of it we use is the diff_options struct. Besides being overly
> complex, this also leaks memory, as we use the fake argv to
> setup_revisions() create a pending array which is never cleared.
> 
> Let's just use diff_options directly. This reduces the peak heap usage
> of "git commit-graph write --changed-paths" on linux.git from ~4GB to
> ~1.2GB.

In my testing, this went from ~6gb to ~4gb.

I'm guessing that my memory difference is related to how poorly my
packs are repacked/redeltified.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH 0/9] [RFC] Changed Paths Bloom Filters
  2019-12-22  9:30 ` Jeff King
                     ` (3 preceding siblings ...)
  2019-12-26 14:21   ` [PATCH 0/9] [RFC] Changed Paths Bloom Filters Derrick Stolee
@ 2019-12-27 16:11   ` Derrick Stolee
  2019-12-29  6:24     ` Jeff King
  4 siblings, 1 reply; 150+ messages in thread
From: Derrick Stolee @ 2019-12-27 16:11 UTC (permalink / raw)
  To: Jeff King, Garima Singh via GitGitGadget
  Cc: git, szeder.dev, jonathantanmy, jeffhost, me, Junio C Hamano

On 12/22/2019 4:30 AM, Jeff King wrote:
> On Fri, Dec 20, 2019 at 10:05:11PM +0000, Garima Singh via GitGitGadget wrote:
> 
>> Adopting changed path bloom filters has been discussed on the list before,
>> and a prototype version was worked on by SZEDER Gábor, Jonathan Tan and Dr.
>> Derrick Stolee [1]. This series is based on Dr. Stolee's approach [2] and
>> presents an updated and more polished RFC version of the feature.
> 
> Great to see progress here. I probably won't have time to review this
> carefully before the new year, but I did notice some low-hanging fruit
> on the generation side.
> 
> So here are a few patches to reduce the CPU and memory usage. They could
> be squashed in at the appropriate spots, or perhaps taken as inspiration
> if there are better solutions (especially for the first one).

I tested these patches with the Linux kernel repo and reported my results
on each patch. However, I wanted to also test on a larger internal repo
(the AzureDevOps repo), which has ~500 commits with more than 512 changes,
and generally has larger diffs than the Linux kernel repo.

| Version  | Time   | Memory |
|----------|--------|--------|
| Garima   | 16m36s | 27.0gb |
| Peff 1   | 6m32s  | 28.0gb |
| Peff 2   | 6m48s  |  5.6gb |
| Peff 3   | 6m14s  |  4.5gb |
| Shortcut | 3m47s  |  4.5gb |

For reference, I found the time and memory information using
"/usr/bin/time --verbose" in a bash script.
 
> I think we could go further still, by actually doing a non-recursive
> diff_tree_oid(), and then recursing into sub-trees ourselves. That would
> save us having to split apart each path to add the leading paths to the
> hashmap (most of which will be duplicates if the commit touched "a/b/c"
> and "a/b/d", etc). I doubt it would be that huge a speedup though. We
> have to keep a list of the touched paths anyway (since the bloom key
> parameters depend on the number of entries), and most of the time is
> almost certainly spent inflating the trees in the first place. However
> it might be easier to follow the code, and it would make it simpler to
> stop traversing at the 512-entry limit, rather than generating a huge
> diff only to throw it away.

By "Shortcut" in the table above, I mean the following patch on top of
Garima's and Peff's changes. It inserts a max_changes option into struct
diff_options to halt the diff early. This seemed like an easier change
than creating a new tree diff algorithm wholesale.

Thanks,
-Stolee

-->8--

From: Derrick Stolee <dstolee@microsoft.com>
Date: Fri, 27 Dec 2019 10:13:48 -0500
Subject: [PATCH] diff: halt tree-diff early after max_changes

When computing the changed-paths bloom filters for the commit-graph,
we limit the size of the filter by restricting the number of paths
in the diff. Instead of computing a large diff and then ignoring the
result, it is better to halt the diff computation early.

Create a new "max_changes" option in struct diff_options. If non-zero,
then halt the diff computation after discovering strictly more changed
paths. This includes paths corresponding to trees that change.

Use this max_changes option in the bloom filter calculations. This
reduces the time taken to compute the filters for the Linux kernel
repo from 2m50s to 2m35s. For a larger repo with more commits changing
many paths, the time reduces from 6 minutes to under 4 minutes.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 bloom.c     | 4 +++-
 diff.h      | 5 +++++
 tree-diff.c | 5 +++++
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/bloom.c b/bloom.c
index ea77631cc2..83dde2378b 100644
--- a/bloom.c
+++ b/bloom.c
@@ -155,6 +155,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
 	int i;
 	struct diff_options diffopt;
+	int max_changes = 512;
 
 	filter = bloom_filter_slab_at(&bloom_filters, c);
 
@@ -171,6 +172,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 
 	repo_diff_setup(r, &diffopt);
 	diffopt.flags.recursive = 1;
+	diffopt.max_changes = max_changes;
 	diff_setup_done(&diffopt);
 
 	if (c->parents)
@@ -179,7 +181,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 		diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
 	diffcore_std(&diffopt);
 
-	if (diff_queued_diff.nr <= 512) {
+	if (diff_queued_diff.nr <= max_changes) {
 		struct hashmap pathmap;
 		struct pathmap_hash_entry* e;
 		struct hashmap_iter iter;
diff --git a/diff.h b/diff.h
index 6febe7e365..9443dc1b00 100644
--- a/diff.h
+++ b/diff.h
@@ -285,6 +285,11 @@ struct diff_options {
 	/* Number of hexdigits to abbreviate raw format output to. */
 	int abbrev;
 
+	/* If non-zero, then stop computing after this many changes. */
+	int max_changes;
+	/* For internal use only. */
+	int num_changes;
+
 	int ita_invisible_in_index;
 /* white-space error highlighting */
 #define WSEH_NEW (1<<12)
diff --git a/tree-diff.c b/tree-diff.c
index 33ded7f8b3..16a21d9f34 100644
--- a/tree-diff.c
+++ b/tree-diff.c
@@ -434,6 +434,9 @@ static struct combine_diff_path *ll_diff_tree_paths(
 		if (diff_can_quit_early(opt))
 			break;
 
+		if (opt->max_changes && opt->num_changes > opt->max_changes)
+			break;
+
 		if (opt->pathspec.nr) {
 			skip_uninteresting(&t, base, opt);
 			for (i = 0; i < nparent; i++)
@@ -518,6 +521,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
 
 			/* t↓ */
 			update_tree_entry(&t);
+			opt->num_changes++;
 		}
 
 		/* t > p[imin] */
@@ -535,6 +539,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
 		skip_emit_tp:
 			/* ∀ pi=p[imin]  pi↓ */
 			update_tp_entries(tp, nparent);
+			opt->num_changes++;
 		}
 	}
 
-- 
2.25.0.rc0


^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH 0/9] [RFC] Changed Paths Bloom Filters
  2019-12-26 14:21   ` [PATCH 0/9] [RFC] Changed Paths Bloom Filters Derrick Stolee
@ 2019-12-29  6:03     ` Jeff King
  0 siblings, 0 replies; 150+ messages in thread
From: Jeff King @ 2019-12-29  6:03 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Garima Singh via GitGitGadget, git, szeder.dev, jonathantanmy,
	jeffhost, me, Junio C Hamano

On Thu, Dec 26, 2019 at 09:21:36AM -0500, Derrick Stolee wrote:

> Here are some recommendations (to Garima) for how to proceed with these
> patches. Please let me know if anyone disagrees.
> 
> >   [1/3]: commit-graph: examine changed-path objects in pack order
> 
> This one is best kept as its own patch, as it shows a clear reason why
> we want to do the sort-by-position. It would also be a complicated
> patch to include this logic along with the first use of
> compute_bloom_filters().

Yeah, I'd agree this one could be a separate patch. It does need more
work, though (as you found out, it does not cover --reachable at all).

The position counter also probably ought to be an unsigned (or even a
uint32_t, which we usually consider a maximum bound for number of
objects).

-Peff

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH 1/3] commit-graph: examine changed-path objects in pack order
  2019-12-27 14:51     ` Derrick Stolee
@ 2019-12-29  6:12       ` Jeff King
  2019-12-29  6:28         ` Jeff King
  2019-12-30 14:37         ` Derrick Stolee
  0 siblings, 2 replies; 150+ messages in thread
From: Jeff King @ 2019-12-29  6:12 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Garima Singh via GitGitGadget, git, szeder.dev, jonathantanmy,
	jeffhost, me, Junio C Hamano

On Fri, Dec 27, 2019 at 09:51:02AM -0500, Derrick Stolee wrote:

> On 12/22/2019 4:32 AM, Jeff King wrote:
> > Looking at the diff of commit objects in pack order is much faster than
> > in sha1 order, as it gives locality to the access of tree deltas
> > (whereas sha1 order is effectively random). Unfortunately the
> > commit-graph code sorts the commits (several times, sometimes as an oid
> > and sometimes a pointer-to-commit), and we ultimately traverse in sha1
> > order.
> > 
> > Instead, let's remember the position at which we see each commit, and
> > traverse in that order when looking at bloom filters. This drops my time
> > for "git commit-graph write --changed-paths" in linux.git from ~4
> > minutes to ~1.5 minutes.
> 
> I'm doing my own perf tests on these patches, and my copy of linux.git
> has four packs of varying sizes (corresponding with my rare fetches and
> lack of repacks). My time goes from 3m50s to 3m00s. I was confused at
> first, but then realized that I used the "--reachable" flag. In that
> case, we never run set_commit_pos(), so all positions are equal and the
> sort is not helpful.
> 
> I thought that inserting some set_commit_pos() calls into close_reachable()
> and add_missing_parents() would give some amount of time-order to the
> commits as we compute the filters. However, the time did not change at
> all.
> 
> I've included the patch below for reference, anyway.

Yeah, I expected that would cover it, too. But instrumenting it to dump
the position of each commit (see patch below), and then decorating "git
log" output with the positions (see script below) shows that we're all
over the map:

  *   3
  |\  
  | * 2791
  | * 5476
  | * 8520
  | * 12040
  | * 16036
  * |   2790
  |\ \  
  | * | 5475
  | * | 8519
  | * | 12039
  | * | 16035
  | * | 20517
  | * | 25527
  | |/  
  * |   5474
  |\ \  
  | * | 8518
  | * | 12038
  * | |   8517
  [...]

I think the root issue is that we never do any date-sorting on the
commits. So:

  - we hit each ref tip in lexical order; with tags, this is quite often
    the opposite of reverse-chronological

  - we traverse breadth-first, but we don't order queue at all. So if we
    see a merge X, then we'll next process X^1 and X^2, and then X^1^,
    and then X^2^, and so forth. So we keep digging equally down
    simultaneous branches, even if one branch is way shorter than the
    other. Whereas a regular Git traversal will order the queue by
    commit timestamp, so it tends to be roughly chronological (of course
    a topo-sort would work too, but that's probably overkill).

I wonder if this would be simpler if "commit-graph --reachable" just
used the regular revision machinery instead of doing its own custom
traversal.

-Peff

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH 0/9] [RFC] Changed Paths Bloom Filters
  2019-12-27 16:11   ` Derrick Stolee
@ 2019-12-29  6:24     ` Jeff King
  2019-12-30 16:04       ` Derrick Stolee
  2019-12-30 17:02       ` Junio C Hamano
  0 siblings, 2 replies; 150+ messages in thread
From: Jeff King @ 2019-12-29  6:24 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Garima Singh via GitGitGadget, git, szeder.dev, jonathantanmy,
	jeffhost, me, Junio C Hamano

On Fri, Dec 27, 2019 at 11:11:37AM -0500, Derrick Stolee wrote:

> > So here are a few patches to reduce the CPU and memory usage. They could
> > be squashed in at the appropriate spots, or perhaps taken as inspiration
> > if there are better solutions (especially for the first one).
> 
> I tested these patches with the Linux kernel repo and reported my results
> on each patch. However, I wanted to also test on a larger internal repo
> (the AzureDevOps repo), which has ~500 commits with more than 512 changes,
> and generally has larger diffs than the Linux kernel repo.
> 
> | Version  | Time   | Memory |
> |----------|--------|--------|
> | Garima   | 16m36s | 27.0gb |
> | Peff 1   | 6m32s  | 28.0gb |
> | Peff 2   | 6m48s  |  5.6gb |
> | Peff 3   | 6m14s  |  4.5gb |
> | Shortcut | 3m47s  |  4.5gb |
> 
> For reference, I found the time and memory information using
> "/usr/bin/time --verbose" in a bash script.

Thanks for giving it more exercise. My heap profiling was done with
massif, which measures the heap directly. Measuring RSS would cover
that, but will also include the mmap'd packfiles. That's probably why
your linux.git numbers were slightly higher than mine.

(massif is a really great tool if you haven't used it, as it also shows
which allocations were using the memory. But it's part of valgrind, so
it definitely doesn't run on native Windows. It might work under WSL,
though. I'm sure there are also other heap profilers on Windows).

> By "Shortcut" in the table above, I mean the following patch on top of
> Garima's and Peff's changes. It inserts a max_changes option into struct
> diff_options to halt the diff early. This seemed like an easier change
> than creating a new tree diff algorithm wholesale.

Yeah, I'm not opposed to a diff feature like this.

But be careful, because...

> diff --git a/diff.h b/diff.h
> index 6febe7e365..9443dc1b00 100644
> --- a/diff.h
> +++ b/diff.h
> @@ -285,6 +285,11 @@ struct diff_options {
>  	/* Number of hexdigits to abbreviate raw format output to. */
>  	int abbrev;
>  
> +	/* If non-zero, then stop computing after this many changes. */
> +	int max_changes;
> +	/* For internal use only. */
> +	int num_changes;

This is holding internal state in diff_options, but the same
diff_options is often used for multiple diffs (e.g., "git log --raw"
would use the same rev_info.diffopt over and over again).

So it would need to be cleared between diffs. There's a similar feature
in the "has_changes" flag, though it looks like it is cleared manually
by callers. Yuck.

This isn't a problem for commit-graph right now, but:

  - it actually could be using a single diff_options, which would be
    slightly simpler (it doesn't seem to save much CPU, though, because
    the initialization is relatively cheap)

  - it's a bit of a subtle bug to leave hanging around for the next
    person who tries to use the feature

I actually wonder if this could be rolled into the has_changes and
diff_can_quit_early() feature. This really just a generalization of that
feature (which is like setting max_changes to "1").

-Peff

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH 1/3] commit-graph: examine changed-path objects in pack order
  2019-12-29  6:12       ` Jeff King
@ 2019-12-29  6:28         ` Jeff King
  2019-12-30 14:37         ` Derrick Stolee
  1 sibling, 0 replies; 150+ messages in thread
From: Jeff King @ 2019-12-29  6:28 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Garima Singh via GitGitGadget, git, szeder.dev, jonathantanmy,
	jeffhost, me, Junio C Hamano

On Sun, Dec 29, 2019 at 01:12:46AM -0500, Jeff King wrote:

> Yeah, I expected that would cover it, too. But instrumenting it to dump
> the position of each commit (see patch below), and then decorating "git
> log" output with the positions (see script below) shows that we're all
> over the map:

I forgot the patch, of course. :)

I just dumped this trace:

---
diff --git a/commit-graph.c b/commit-graph.c
index a6c4ab401e..1cb77be45f 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -61,6 +61,7 @@ static void set_commit_pos(struct repository *r, const struct object_id *oid)
 	if (!commit)
 		return; /* should never happen, but be lenient */
 
+	trace_printf("pos %s = %d", oid_to_hex(oid), max_pos);
 	*commit_pos_at(&commit_pos, commit) = max_pos++;
 }
 

with:

  rm -f .git/objects/info/commit-graph
  GIT_TRACE=$PWD/trace.out git commit-graph write --changed-paths --reachable

and then used:

  cat >foo.pl <<\EOF
  #!/usr/bin/perl
  
  my %deco = do {
  	open(my $fh, '<', 'trace.out');
  	map { /pos (\S+) = (\d+)/ ? ($1 => $2) : () } <$fh>
  };
  while (<>) {
  	s/([0-9a-f]{40})/$deco{$1}/;
  	print;
  }
  EOF

like so:

  git log --graph --format=%H |
  perl foo.pl |
  less

-Peff

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH 1/3] commit-graph: examine changed-path objects in pack order
  2019-12-29  6:12       ` Jeff King
  2019-12-29  6:28         ` Jeff King
@ 2019-12-30 14:37         ` Derrick Stolee
  2019-12-30 14:51           ` Derrick Stolee
  1 sibling, 1 reply; 150+ messages in thread
From: Derrick Stolee @ 2019-12-30 14:37 UTC (permalink / raw)
  To: Jeff King
  Cc: Garima Singh via GitGitGadget, git, szeder.dev, jonathantanmy,
	jeffhost, me, Junio C Hamano

On 12/29/2019 1:12 AM, Jeff King wrote:
> On Fri, Dec 27, 2019 at 09:51:02AM -0500, Derrick Stolee wrote:
> 
>> On 12/22/2019 4:32 AM, Jeff King wrote:
>>> Looking at the diff of commit objects in pack order is much faster than
>>> in sha1 order, as it gives locality to the access of tree deltas
>>> (whereas sha1 order is effectively random). Unfortunately the
>>> commit-graph code sorts the commits (several times, sometimes as an oid
>>> and sometimes a pointer-to-commit), and we ultimately traverse in sha1
>>> order.
>>>
>>> Instead, let's remember the position at which we see each commit, and
>>> traverse in that order when looking at bloom filters. This drops my time
>>> for "git commit-graph write --changed-paths" in linux.git from ~4
>>> minutes to ~1.5 minutes.
>>
>> I'm doing my own perf tests on these patches, and my copy of linux.git
>> has four packs of varying sizes (corresponding with my rare fetches and
>> lack of repacks). My time goes from 3m50s to 3m00s. I was confused at
>> first, but then realized that I used the "--reachable" flag. In that
>> case, we never run set_commit_pos(), so all positions are equal and the
>> sort is not helpful.
>>
>> I thought that inserting some set_commit_pos() calls into close_reachable()
>> and add_missing_parents() would give some amount of time-order to the
>> commits as we compute the filters. However, the time did not change at
>> all.
>>
>> I've included the patch below for reference, anyway.
> 
> Yeah, I expected that would cover it, too. But instrumenting it to dump
> the position of each commit (see patch below), and then decorating "git
> log" output with the positions (see script below) shows that we're all
> over the map:
> 
>   *   3
>   |\  
>   | * 2791
>   | * 5476
>   | * 8520
>   | * 12040
>   | * 16036
>   * |   2790
>   |\ \  
>   | * | 5475
>   | * | 8519
>   | * | 12039
>   | * | 16035
>   | * | 20517
>   | * | 25527
>   | |/  
>   * |   5474
>   |\ \  
>   | * | 8518
>   | * | 12038
>   * | |   8517
>   [...]

This makes a lot of sense why the previous approach did not work. Thanks!

> I think the root issue is that we never do any date-sorting on the
> commits. So:
> 
>   - we hit each ref tip in lexical order; with tags, this is quite often
>     the opposite of reverse-chronological
> 
>   - we traverse breadth-first, but we don't order queue at all. So if we
>     see a merge X, then we'll next process X^1 and X^2, and then X^1^,
>     and then X^2^, and so forth. So we keep digging equally down
>     simultaneous branches, even if one branch is way shorter than the
>     other. Whereas a regular Git traversal will order the queue by
>     commit timestamp, so it tends to be roughly chronological (of course
>     a topo-sort would work too, but that's probably overkill).
> 
> I wonder if this would be simpler if "commit-graph --reachable" just
> used the regular revision machinery instead of doing its own custom
> traversal.

Instead, why not use our already-computed generation numbers? That seems
to improve the time a bit. (6m30s to 4m50s)

-->8--

From: Derrick Stolee <dstolee@microsoft.com>
Date: Fri, 27 Dec 2019 09:47:49 -0500
Subject: [PATCH] commit-graph: examine commits by generation number

When running 'git commit-graph write --changed-paths', we sort the
commits by pack-order to save time when computing the changed-paths
bloom filters. This does not help when finding the commits via the
--reachable flag.

If not using pack-order, then sort by generation number before
examining the diff. Commits with similar generation are more likely
to have many trees in common, making the diff faster.

On the Linux kernel repository, this change reduced the computation
time for 'git commit-graph write --reachable --changed-paths' from
6m30s to 4m50s.

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 33 ++++++++++++++++++++++++++++++---
 1 file changed, 30 insertions(+), 3 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index bf6c663772..fe4ab545f2 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -72,6 +72,25 @@ static int commit_pos_cmp(const void *va, const void *vb)
 	       commit_pos_at(&commit_pos, b);
 }
 
+static int commit_gen_cmp(const void *va, const void *vb)
+{
+	const struct commit *a = *(const struct commit **)va;
+	const struct commit *b = *(const struct commit **)vb;
+
+	/* lower generation commits first */
+	if (a->generation < b->generation)
+		return -1;
+	else if (a->generation > b->generation)
+		return 1;
+
+	/* use date as a heuristic when generations are equal */
+	if (a->date < b->date)
+		return -1;
+	else if (a->date > b->date)
+		return 1;
+	return 0;
+}
+
 char *get_commit_graph_filename(const char *obj_dir)
 {
 	char *filename = xstrfmt("%s/info/commit-graph", obj_dir);
@@ -849,7 +868,8 @@ struct write_commit_graph_context {
 		 report_progress:1,
 		 split:1,
 		 check_oids:1,
-		 bloom:1;
+		 bloom:1,
+		 order_by_pack:1;
 
 	const struct split_commit_graph_opts *split_opts;
 	uint32_t total_bloom_filter_size;
@@ -1245,7 +1265,11 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 
 	ALLOC_ARRAY(sorted_by_pos, ctx->commits.nr);
 	COPY_ARRAY(sorted_by_pos, ctx->commits.list, ctx->commits.nr);
-	QSORT(sorted_by_pos, ctx->commits.nr, commit_pos_cmp);
+
+	if (ctx->order_by_pack)
+		QSORT(sorted_by_pos, ctx->commits.nr, commit_pos_cmp);
+	else
+		QSORT(sorted_by_pos, ctx->commits.nr, commit_gen_cmp);
 
 	for (i = 0; i < ctx->commits.nr; i++) {
 		struct commit *c = sorted_by_pos[i];
@@ -1979,6 +2003,7 @@ int write_commit_graph(const char *obj_dir,
 	}
 
 	if (pack_indexes) {
+		ctx->order_by_pack = 1;
 		if ((res = fill_oids_from_packs(ctx, pack_indexes)))
 			goto cleanup;
 	}
@@ -1988,8 +2013,10 @@ int write_commit_graph(const char *obj_dir,
 			goto cleanup;
 	}
 
-	if (!pack_indexes && !commit_hex)
+	if (!pack_indexes && !commit_hex) {
+		ctx->order_by_pack = 1;
 		fill_oids_from_all_packs(ctx);
+	}
 
 	close_reachable(ctx);
 
-- 
2.25.0.rc0




^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH 1/3] commit-graph: examine changed-path objects in pack order
  2019-12-30 14:37         ` Derrick Stolee
@ 2019-12-30 14:51           ` Derrick Stolee
  0 siblings, 0 replies; 150+ messages in thread
From: Derrick Stolee @ 2019-12-30 14:51 UTC (permalink / raw)
  To: Jeff King
  Cc: Garima Singh via GitGitGadget, git, szeder.dev, jonathantanmy,
	jeffhost, me, Junio C Hamano

On 12/30/2019 9:37 AM, Derrick Stolee wrote:
> On the Linux kernel repository, this change reduced the computation
> time for 'git commit-graph write --reachable --changed-paths' from
> 6m30s to 4m50s.

I apologize, these numbers are based on the AzureDevOps repo, not the
Linux kernel repo. After re-running with the Linux kernel repo my
times improve from 3m00s to 1m37s.

-Stolee



^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH 0/9] [RFC] Changed Paths Bloom Filters
  2019-12-29  6:24     ` Jeff King
@ 2019-12-30 16:04       ` Derrick Stolee
  2019-12-30 17:02       ` Junio C Hamano
  1 sibling, 0 replies; 150+ messages in thread
From: Derrick Stolee @ 2019-12-30 16:04 UTC (permalink / raw)
  To: Jeff King
  Cc: Garima Singh via GitGitGadget, git, szeder.dev, jonathantanmy,
	jeffhost, me, Junio C Hamano

On 12/29/2019 1:24 AM, Jeff King wrote:
> On Fri, Dec 27, 2019 at 11:11:37AM -0500, Derrick Stolee wrote:
> 
>>> So here are a few patches to reduce the CPU and memory usage. They could
>>> be squashed in at the appropriate spots, or perhaps taken as inspiration
>>> if there are better solutions (especially for the first one).
>>
>> I tested these patches with the Linux kernel repo and reported my results
>> on each patch. However, I wanted to also test on a larger internal repo
>> (the AzureDevOps repo), which has ~500 commits with more than 512 changes,
>> and generally has larger diffs than the Linux kernel repo.
>>
>> | Version  | Time   | Memory |
>> |----------|--------|--------|
>> | Garima   | 16m36s | 27.0gb |
>> | Peff 1   | 6m32s  | 28.0gb |
>> | Peff 2   | 6m48s  |  5.6gb |
>> | Peff 3   | 6m14s  |  4.5gb |
>> | Shortcut | 3m47s  |  4.5gb |
>>
>> For reference, I found the time and memory information using
>> "/usr/bin/time --verbose" in a bash script.
> 
> Thanks for giving it more exercise. My heap profiling was done with
> massif, which measures the heap directly. Measuring RSS would cover
> that, but will also include the mmap'd packfiles. That's probably why
> your linux.git numbers were slightly higher than mine.

That's interesting. I initially avoided massif because it is so much
slower than /usr/bin/time. However, the inflated numbers could be
explained by that. Also, the distinction between mem_heap and
mem_heap_extra may have interesting implications. Looking online, it
seems that large mem_heap_extra implies the heap is fragmented from
many small allocations.

Here are my findings on the Linux repo:

| Version  | mem_heap | mem_heap_extra |
|----------|----------|----------------|
| Peff 1   |  6,500mb |          913mb |
| Peff 2   |  3,100mb |          286mb |
| Peff 3   |    781mb |          235mb |

These numbers more closely match your numbers (in sum of the two
columns).

> (massif is a really great tool if you haven't used it, as it also shows
> which allocations were using the memory. But it's part of valgrind, so
> it definitely doesn't run on native Windows. It might work under WSL,
> though. I'm sure there are also other heap profilers on Windows).

I am using my Linux machine for my tests. Garima is using her Windows
machine.

>> By "Shortcut" in the table above, I mean the following patch on top of
>> Garima's and Peff's changes. It inserts a max_changes option into struct
>> diff_options to halt the diff early. This seemed like an easier change
>> than creating a new tree diff algorithm wholesale.
> 
> Yeah, I'm not opposed to a diff feature like this.
> 
> But be careful, because...
> 
>> diff --git a/diff.h b/diff.h
>> index 6febe7e365..9443dc1b00 100644
>> --- a/diff.h
>> +++ b/diff.h
>> @@ -285,6 +285,11 @@ struct diff_options {
>>  	/* Number of hexdigits to abbreviate raw format output to. */
>>  	int abbrev;
>>  
>> +	/* If non-zero, then stop computing after this many changes. */
>> +	int max_changes;
>> +	/* For internal use only. */
>> +	int num_changes;
> 
> This is holding internal state in diff_options, but the same
> diff_options is often used for multiple diffs (e.g., "git log --raw"
> would use the same rev_info.diffopt over and over again).
> 
> So it would need to be cleared between diffs. There's a similar feature
> in the "has_changes" flag, though it looks like it is cleared manually
> by callers. Yuck.

You're right about this. What if we initialize it in diff_tree_paths()
before it calls the recursive ll_difF_tree_paths()?

> This isn't a problem for commit-graph right now, but:
> 
>   - it actually could be using a single diff_options, which would be
>     slightly simpler (it doesn't seem to save much CPU, though, because
>     the initialization is relatively cheap)
> 
>   - it's a bit of a subtle bug to leave hanging around for the next
>     person who tries to use the feature
> 
> I actually wonder if this could be rolled into the has_changes and
> diff_can_quit_early() feature. This really just a generalization of that
> feature (which is like setting max_changes to "1").

I thought about this at first, but it only takes a struct diff_options
right now. It does have an internally-mutated member (flags.has_changes)
but it also seems a bit wrong to add a uint32_t of the count in this.
Changing the prototype could be messy, too.

There are also multiple callers, and limiting everything to tree-diff.c
limits the impact.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH 0/9] [RFC] Changed Paths Bloom Filters
  2019-12-29  6:24     ` Jeff King
  2019-12-30 16:04       ` Derrick Stolee
@ 2019-12-30 17:02       ` Junio C Hamano
  1 sibling, 0 replies; 150+ messages in thread
From: Junio C Hamano @ 2019-12-30 17:02 UTC (permalink / raw)
  To: Jeff King
  Cc: Derrick Stolee, Garima Singh via GitGitGadget, git, szeder.dev,
	jonathantanmy, jeffhost, me

Jeff King <peff@peff.net> writes:

> This is holding internal state in diff_options, but the same
> diff_options is often used for multiple diffs (e.g., "git log --raw"
> would use the same rev_info.diffopt over and over again).
>
> So it would need to be cleared between diffs. There's a similar feature
> in the "has_changes" flag, though it looks like it is cleared manually
> by callers. Yuck.

Do you mean we want reset_per_invocation_part_of_diff_options()
helper or something?

> I actually wonder if this could be rolled into the has_changes and
> diff_can_quit_early() feature. This really just a generalization of that
> feature (which is like setting max_changes to "1").

Yeah, I wondered about the same thing, after seeing the impressive
numbers ;-)

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH 0/9] [RFC] Changed Paths Bloom Filters
  2019-12-20 22:05 [PATCH 0/9] [RFC] Changed Paths Bloom Filters Garima Singh via GitGitGadget
                   ` (11 preceding siblings ...)
  2019-12-22  9:30 ` Jeff King
@ 2019-12-31 16:45 ` Jakub Narebski
  2020-01-13 16:54   ` Garima Singh
  2020-01-21 23:40 ` Emily Shaffer
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 150+ messages in thread
From: Jakub Narebski @ 2019-12-31 16:45 UTC (permalink / raw)
  To: Garima Singh via GitGitGadget
  Cc: git, stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
	Junio C Hamano

"Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:

> Hey! 
>
> The commit graph feature brought in a lot of performance improvements across
> multiple commands. However, file based history continues to be a performance
> pain point, especially in large repositories. 
>
> Adopting changed path bloom filters has been discussed on the list before,
> and a prototype version was worked on by SZEDER Gábor, Jonathan Tan and Dr.
> Derrick Stolee [1]. This series is based on Dr. Stolee's approach [2] and
> presents an updated and more polished RFC version of the feature. 

It is nice to have this picked up for upstream, finally.  The proof of
concept works[1][2] were started more than a year ago.

On the other hand slow and steady adoption of commit-graph serialization
and then extending it (generation numbers, topological sort, incremental
update) feels like a good approach.

> Performance Gains: We tested the performance of 'git log -- <path>' on the git
> repo, the linux repo and some internal large repos, with a variety of paths
> of varying depths.
>
> On the git and linux repos: We observed a 2x to 5x speed up.
>
> On a large internal repo with files seated 6-10 levels deep in the tree: We
> observed 10x to 20x speed ups, with some paths going up to 28 times faster.

Could you provide some more statistics about this internal repository,
such as number of files, number of commits, perhaps also number of all
objects?  Thanks in advance.

I wonder why such large difference in performance 2-5x vs 10-20x.  Is it
about the depth of the file hierarchy?  How would the numbers look for
files seated closer to the root in the same large repository, like 3-5
levels deep in the tree?

> Future Work (not included in the scope of this series):
>
>  1. Supporting multiple path based revision walk

I wonder if it would ever be possible to support globbing, e.g. '*.c'

>  2. Adopting it in git blame logic.

What about 'git log --follow <path>'?

>  3. Interactions with line log git log -L
>
> This series is intended to start the conversation and many of the commit
> messages include specific call outs for suggestions and thoughts. 
>
> Cheers! Garima Singh
>
> [1] https://lore.kernel.org/git/20181009193445.21908-1-szeder.dev@gmail.com/
> [2] https://lore.kernel.org/git/61559c5b-546e-d61b-d2e1-68de692f5972@gmail.com/
>
> Garima Singh (9):
>   commit-graph: add --changed-paths option to write

This summary is not easy to understand on first glance.  Maybe:

    commit-graph: add --changed-paths option to the write subcommand

or

    commit-graph: add --changed-paths option to 'git commit-graph write'

would be better?

>   commit-graph: write changed paths bloom filters
>   commit-graph: use MAX_NUM_CHUNKS
>   commit-graph: document bloom filter format
>   commit-graph: write changed path bloom filters to commit-graph file.
>   commit-graph: test commit-graph write --changed-paths
>   commit-graph: reuse existing bloom filters during write.
>   revision.c: use bloom filters to speed up path based revision walks
>   commit-graph: add GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS test flag
>
>  Documentation/git-commit-graph.txt            |   5 +
>  .../technical/commit-graph-format.txt         |  17 ++
>  Makefile                                      |   1 +
>  bloom.c                                       | 257 +++++++++++++++++
>  bloom.h                                       |  51 ++++
>  builtin/commit-graph.c                        |   9 +-
>  ci/run-build-and-tests.sh                     |   1 +
>  commit-graph.c                                | 116 +++++++-
>  commit-graph.h                                |   9 +-
>  revision.c                                    |  67 ++++-
>  revision.h                                    |   5 +
>  t/README                                      |   3 +
>  t/helper/test-read-graph.c                    |   4 +
>  t/t4216-log-bloom.sh                          |  77 ++++++
>  t/t5318-commit-graph.sh                       |   2 +
>  t/t5324-split-commit-graph.sh                 |   1 +
>  t/t5325-commit-graph-bloom.sh                 | 258 ++++++++++++++++++
>  17 files changed, 875 insertions(+), 8 deletions(-)
>  create mode 100644 bloom.c
>  create mode 100644 bloom.h
>  create mode 100755 t/t4216-log-bloom.sh
>  create mode 100755 t/t5325-commit-graph-bloom.sh
>
>
> base-commit: b02fd2accad4d48078671adf38fe5b5976d77304
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-497%2Fgarimasi514%2FcoreGit-bloomFilters-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-497/garimasi514/coreGit-bloomFilters-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/497

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH 0/9] [RFC] Changed Paths Bloom Filters
  2019-12-22  9:38   ` Jeff King
@ 2020-01-01 12:04     ` Jakub Narebski
  0 siblings, 0 replies; 150+ messages in thread
From: Jakub Narebski @ 2020-01-01 12:04 UTC (permalink / raw)
  To: Jeff King
  Cc: Christian Couder, Garima Singh via GitGitGadget, git,
	Derrick Stolee, SZEDER Gábor, Jonathan Tan, Jeff Hostetler,
	Taylor Blau, Junio C Hamano

Jeff King <peff@peff.net> writes:

> On Sun, Dec 22, 2019 at 10:26:20AM +0100, Christian Couder wrote:
>
>> I have a question though. Are the performance gains only available
>> with `git log -- path` or are they already available for example when
>> doing a partial clone and/or a sparse checkout?
>
> From my quick look at the code, anything that feeds a pathspec to a
> revision traversal would be helped. I'm not sure if it would help for
> partial/sparse traversals, though. There we actually need to know which
> blobs correspond to the paths in question, not just whether any
> particular commit touched them.
>
> I also took a brief look at adding support to the custom blame-tree
> implementation we use at GitHub, and got about a 6x speedup.

Is there any chance of upstreaming the blame-tree algorithm, perhaps as
a separate mode for git-blame (invoked with `git blame <directory>`?
Or is the algorithm too GitHub-specific?

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH 1/9] commit-graph: add --changed-paths option to write
  2019-12-20 22:05 ` [PATCH 1/9] commit-graph: add --changed-paths option to write Garima Singh via GitGitGadget
@ 2020-01-01 20:20   ` Jakub Narebski
  0 siblings, 0 replies; 150+ messages in thread
From: Jakub Narebski @ 2020-01-01 20:20 UTC (permalink / raw)
  To: Garima Singh via GitGitGadget
  Cc: git, stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
	Junio C Hamano, Garima Singh

"Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Garima Singh <garima.singh@microsoft.com>
>
> Add --changed-paths option to git commit-graph write. This option will
> soon allow users to compute bloom filters for the paths changed between
> a commit and its first significant parent, and write this information
> into the commit-graph file.

A slightly nitpicky comment.

First, I think it is "Bloom filter", not "bloom filter" (from the name
of the person that discovered them, Burton Howard Bloom).

Second, I would rather that the commit message started with at least one
sentence of describing purpose of this new option, not going straight to
the technical details (i.e. using Bloom filters).  Or in any other way
describe that this option would make Git store some helper data that
would help find out faster if a given path was changed in given commit.

> Note: This commit does not change any behavior. It only introduces
> the option and passes down the appropriate flag to the commit-graph.

All right.

Personally, I don't have strong opinion for or against separating this
change into its own patch.

> RFC Notes:
> 1. We named the option --changed-paths to capture what the option does,
>    instead of how it does it. The current implementation does this
>    using bloom filters. We believe using --changed-paths however keeps
>    the implementation open to other data structures.
>    All thoughts and suggestions for the name and this approach are
>    welcome

It is all right name.  Another option could be for example
`git commit-graph write --changeset-info`, or something like that.

>
> 2. Currently, a subsequent commit in this series will add tests that
>    exercise this option. I plan to split that test commit across the
>    series as appropriate.

There is another thing, but one that could be left for the followup
series, namely the configuration variables for this behavior.  In the
future it should be possible to switch some configuration variable to
have this feature on by default when manually or automatically running
`git commit-graph write`.

>
> Helped-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
>  Documentation/git-commit-graph.txt | 5 +++++
>  builtin/commit-graph.c             | 9 +++++++--
>  commit-graph.h                     | 3 ++-
>  3 files changed, 14 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
> index bcd85c1976..1efe6e5c5a 100644
> --- a/Documentation/git-commit-graph.txt
> +++ b/Documentation/git-commit-graph.txt

It is nice to have option documented.

All right, the 'write' subcommand has the following synopsis:

  'git commit-graph write' <options> [--object-dir <dir>] [--[no-]progress]

so the is no need to adjust it when adding a new option.

> @@ -54,6 +54,11 @@ or `--stdin-packs`.)
>  With the `--append` option, include all commits that are present in the
>  existing commit-graph file.
>  +
> +With the `--changed-paths` option, compute and write information about the
> +paths changed between a commit and it's first parent. This operation can
> +take a while on large repositories. It provides significant performance gains
> +for getting file based history logs with `git log`
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This might be not entirely clear for someone that is not familiar with
Git jargon.  Perhaps it would better read as "for getting history of a
directory or a file with `git log <path>`", or something like that.

Side note: the sentence is missing its finishing full stop.

> ++
>  With the `--split` option, write the commit-graph as a chain of multiple
>  commit-graph files stored in `<dir>/info/commit-graphs`. The new commits
>  not already in the commit-graph are added in a new "tip" file. This file
> diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
> index e0c6fc4bbf..9bd1e11161 100644
> --- a/builtin/commit-graph.c
> +++ b/builtin/commit-graph.c
> @@ -9,7 +9,7 @@
>  
>  static char const * const builtin_commit_graph_usage[] = {
>  	N_("git commit-graph verify [--object-dir <objdir>] [--shallow] [--[no-]progress]"),
> -	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--[no-]progress] <split options>"),
> +	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--changed-paths] [--[no-]progress] <split options>"),
>  	NULL
>  };
>  
> @@ -19,7 +19,7 @@ static const char * const builtin_commit_graph_verify_usage[] = {
>  };
>  
>  static const char * const builtin_commit_graph_write_usage[] = {
> -	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--[no-]progress] <split options>"),
> +	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--changed-paths] [--[no-]progress] <split options>"),
>  	NULL
>  };

I was at first wondering why the duplication (not caused by your patch,
though), and then realized that it is to have usage for command, and for
individual subcommands, separately.

> @@ -32,6 +32,7 @@ static struct opts_commit_graph {
>  	int split;
>  	int shallow;
>  	int progress;
> +	int enable_bloom_filters;

Why the field is called `enable_bloom_filters`, while option is called
`--changed-paths`?  I know it is not user-visible thing, so it would be
easy to change if we ever go beyond Bloom filters, though...

So I am not against keeping it as it is currently.

>  } opts;
>  
>  static int graph_verify(int argc, const char **argv)
> @@ -110,6 +111,8 @@ static int graph_write(int argc, const char **argv)
>  			N_("start walk at commits listed by stdin")),
>  		OPT_BOOL(0, "append", &opts.append,
>  			N_("include all commits already in the commit-graph file")),
> +		OPT_BOOL(0, "changed-paths", &opts.enable_bloom_filters,
> +			N_("enable computation for changed paths")),
>  		OPT_BOOL(0, "progress", &opts.progress, N_("force progress reporting")),
>  		OPT_BOOL(0, "split", &opts.split,
>  			N_("allow writing an incremental commit-graph file")),
> @@ -143,6 +146,8 @@ static int graph_write(int argc, const char **argv)
>  		flags |= COMMIT_GRAPH_WRITE_SPLIT;
>  	if (opts.progress)
>  		flags |= COMMIT_GRAPH_WRITE_PROGRESS;
> +	if (opts.enable_bloom_filters)
> +		flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;

Minor nitpick: are we all right having this ordering of options (not for
example having opt.progress last)?

Disregarding this, it looks all right.

>  
>  	read_replace_refs = 0;
>  
> diff --git a/commit-graph.h b/commit-graph.h
> index 7f5c933fa2..952a4b83be 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -76,7 +76,8 @@ enum commit_graph_write_flags {
>  	COMMIT_GRAPH_WRITE_PROGRESS   = (1 << 1),
>  	COMMIT_GRAPH_WRITE_SPLIT      = (1 << 2),
>  	/* Make sure that each OID in the input is a valid commit OID. */
> -	COMMIT_GRAPH_WRITE_CHECK_OIDS = (1 << 3)
> +	COMMIT_GRAPH_WRITE_CHECK_OIDS = (1 << 3),
> +	COMMIT_GRAPH_WRITE_BLOOM_FILTERS = (1 << 4)

I wonder if we should add comment describing the flag, like for the one
above...

>  };
>  
>  struct split_commit_graph_opts {

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH 2/9] commit-graph: write changed paths bloom filters
  2019-12-20 22:05 ` [PATCH 2/9] commit-graph: write changed paths bloom filters Garima Singh via GitGitGadget
  2019-12-21 16:48   ` Philip Oakley
@ 2020-01-06 18:44   ` Jakub Narebski
  2020-01-13 19:48     ` Garima Singh
  1 sibling, 1 reply; 150+ messages in thread
From: Jakub Narebski @ 2020-01-06 18:44 UTC (permalink / raw)
  To: Garima Singh via GitGitGadget
  Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
	Jeff Hostetler, Taylor Blau, Jeff King, Junio C Hamano,
	Garima Singh

"Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Garima Singh <garima.singh@microsoft.com>
>
> The changed path bloom filters help determine which paths changed between a
> commit and its first parent. We already have the "--changed-paths" option
> for the "git commit-graph write" subcommand, now actually compute them under
> that option. The COMMIT_GRAPH_WRITE_BLOOM_FILTERS flag enables this
> computation.
>
> RFC Notes: Here are some details about the implementation and I would love
> to know your thoughts and suggestions for improvements here.
>
> For details on what bloom filters are and how they work, please refer to
> Dr. Derrick Stolee's blog post [1].
> [1] https://devblogs.microsoft.com/devops/super-charging-the-git-commit-graph-iv-bloom-filters/
>
> 1. The implementation sticks to the recommended values of 7 and 10 for the
>    number of hashes and the size of each entry, as described in the blog.

Please provide references to original work for this.  Derrick Stolee
blog post references the following work:

  Flavio Bonomi, Michael Mitzenmacher, Rina Panigrahy, Sushil Singh, George Varghese
  "An Improved Construction for Counting Bloom Filters"
  http://theory.stanford.edu/~rinap/papers/esa2006b.pdf
  https://doi.org/10.1007/11841036_61

However, we do not use Counting Bloom Filters, but ordinary Bloom
Filters, if used in untypical way: instead of testing many elements
(keys) against single filter, we test single element (path) against
mainy filters.

Also, I'm not sure that values 10 bits per entry and 7 hash functions
are recommended; the work states:

  "For example, when n/m = 10 and k = 7 the false positive probability
  is just over 0.008."

Given false positive probablity we can calculate best choice for n/m and
k.

On the other hand in https://arxiv.org/abs/1912.08258 we have

  "For efficient memory usage, a Bloom filter with a false-positive
   probability ϵ should use about − log_2(ϵ) hash functions
   [Broder2004]. At a false-positive probability of 1%, seven hash
   functions are thus required."

So k=7 being optimal is somewhat confirmed.

>    The implementation while not completely open to it at the moment, is flexible
>    enough to allow for tweaking these settings in the future.

All right.

>    Note: The performance gains we have observed so far with these values is
>    significant enough to not that we did not need to tweak these settings.
                           ^^^
s/not/note/

Did you try to tweak settings, i.e. numbers of bits per entry, number
of hash functions (which is derivative of the former - at least the
optimal number), the size of the block, the cutoff threshold value?
It is not needed to be in this patch series - fine tuning is probably
better left for later.

>    The cover letter of this series has the details and the commit where we have
>    git log use bloom filters.

The second part of this sentence, from "and the commit..." is a bit
unclear.  Did you mean here that the future / subsequent commit in this
patch series that makes Git actually use Bloom filters in `git log --
<path>` will have more details in its commit message?

> 2. As described in the blog and the linked technical paper therin, we do not need
                                                             ^^^^^^
s/therin/therein/

>    7 independent hashing functions. We use the Murmur3 hashing scheme - seed it
>    twice and then combine those to procure an arbitrary number of hash values.

The "linked technical paper" in the blog post (which I would prefer to
have linked directly to in the commit message) is

  Peter C. Dillinger and Panagiotis Manolios
  "Bloom Filters in Probabilistic Verification"
  http://www.ccs.neu.edu/home/pete/pub/bloom-filters-verification.pdf
  https://doi.org/10.1007/978-3-540-30494-4_26

Sidenote: it looks like it is a reference from Wikipedia on Bloom filters.
This is according to authors the original paper with the _double hashing_
technique.

They also examine in much detail the optimal number of hash functions.

> 3. The filters are sized according to the number of changes in the each commit,
>    with minimum size of one 64 bit word.

Do I understand it correctly that the size of filter is 10*(number of
changed files) bits, rounded up to nearest multiple of 64?

How do you count renames and copies?  As two changes?

Do I understand it correctly that commit with no changes in it (which
can rarely happen) would have 64-bits i.e. 8-bytes Bloom filter of all
zeros: 0x0000000000000000?

How merges are handled?  Does the filter uses all changed files, or just
changes compared to first parent?

>
> [Call for advice] We currently cap writing bloom filters for commits with
> atmost 512 changed files. In the current implementation, we compute the diff,
> and then just throw it away once we see it has more than 512 changes.
> Any suggestiongs on how to reduce the work we are doing in this case are more
> than welcome.

This got solved in "[PATCH] diff: halt tree-diff early after max_changes"
https://public-inbox.org/git/e9a4e4ff-5466-dc39-c3f5-c9a8b8f2f11d@gmail.com/

> [Call for advice] Would the git community like this commit to be split up into
> more granular commits? This commit could possibly be split out further with the
> bloom.c code in its own commit, to be used by the commit-graph in a subsequent
> commit. While I prefer it being contained in one commit this way, I am open to
> suggestions.

I think it might be a good idea to split this commit into purely Bloom
filter implementation (bloom.c) AND unit tests for Bloom filter itself
(which would probably involve some new test-tool).

I have not read further messages in the series [edit: they don't], so I
don't know if such tests already exist or not.  One could test for
negative match, maybe also (for specific choice of hash function) for
positive and maybe even false positive match, for filter size depending
on the number of changes, for changes cap (maybe), maybe also for
no-changes scenario.


As for splitting the main part of the series, I would envision it in the
following way (which is of course only one possibility):

1. Implementation of generic-ish Bloom filter (with elements being
   strings / paths, and optimized to test single key against many
   filters, each taking small-ish space, variable size filter, limit on
   maximum number of elements).

   Technical documentation in comments in bloom.h (description of API)
   and bloom.c (details of the algorithm, with references).

   TODO: test-tool and unit tests.

2. Using per-commit Bloom filter(s) to store changeset information
   i.e. changed paths.  This would implement in-memory storage (on slab)
   and creating Bloom filter out of commit and repository information.

   Perhaps this should also get its own unit tests (that Bloom filter
   catches changed files, and excluding false positivess catches
   unchanged files).

3. Storing per-commit Bloom filters in the commit-graph file:

   a.) writing Bloom filters data to commit-graph file, which means
       designing the chunk(s) format,
   b.) verifying Bloom filter chunks, at least sanity-checks
   c.) reading Bloom filters from commit-graph file into memory

   Perhaps also some integration tests that the information is stored
   and retrieved correctly, and that verifying finds bugs in
   intentionally corrupted Bloom filter chunks.

4. Using Bloom filters to speed up `git log -- <path>` (and similar
   commands).

   It would be nice to have some functional tests, and maybe some
   performance tests, if possible.


> [Call for advice] Would a technical document explaining the exact details of
> the bloom filter implemenation and the hashing calculations be helpful? I will
> be adding details into Documentation/technical/commit-graph-format.txt, but the
> bloom filter code is an independent subsystem and could be used outside of the
> commit-graph feature. Is it worth a separate document, or should we apply "You
> Ain't Gonna Need It" principles?

As nowadays technical reference documentation is being moved from
Documentation/technical/api-*.txt to appropriate header files, maybe the
documentation of Bloom filter API (and some technical documentation and
references) be put in bloom.h?  See for example comments in strbuf.h.

> [Call for advice] I plan to add unit tests for bloom.c, specifically to ensure
> that the hash algorithm and bloom key calculations are stable across versions.

Ah, so the unit tests for bloom.c does not exist, yet...

> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> Helped-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Makefile       |   1 +
>  bloom.c        | 201 +++++++++++++++++++++++++++++++++++++++++++++++++
>  bloom.h        |  46 +++++++++++
>  commit-graph.c |  32 +++++++-
>  4 files changed, 279 insertions(+), 1 deletion(-)
>  create mode 100644 bloom.c
>  create mode 100644 bloom.h
>
> diff --git a/Makefile b/Makefile
> index 42a061d3fb..9d5e26f5d6 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -838,6 +838,7 @@ LIB_OBJS += base85.o
>  LIB_OBJS += bisect.o
>  LIB_OBJS += blame.o
>  LIB_OBJS += blob.o
> +LIB_OBJS += bloom.o
>  LIB_OBJS += branch.o
>  LIB_OBJS += bulk-checkin.o
>  LIB_OBJS += bundle.o

I'll put bloom.h first, to make it easier to review.

> diff --git a/bloom.h b/bloom.h
> new file mode 100644
> index 0000000000..ba8ae70b67
> --- /dev/null
> +++ b/bloom.h
> @@ -0,0 +1,46 @@
> +#ifndef BLOOM_H
> +#define BLOOM_H
> +
> +struct commit;
> +struct repository;

This would probably be missing if this patch was split in two:
introducing Bloom filter and saving Bloom filter in the repository
metadata (in commit-graphh file).

> +

O.K., the names of fields are descriptive enough so that this struct
doesn't need detailed description in comment (like the next one).

> +struct bloom_filter_settings {
> +	uint32_t hash_version;

Do we need full half-word for hash version?

> +	uint32_t num_hashes;

Do we need full 32-bits for number of hashes?  The "Bloom Filters in
Probabilistic Verification" paper mentioned in Stolee blog states that
no one should need number of hashes greater than k=32 - the accuracy is
so high that it doesn't matter that it is not optimal.

  "Notice one last thing about Bloom filters in verification, if $m$ is
   several gigabytes or less and $m/n$ calls for more than about 32 index
   functions, the accuracy is going to be so high that there is not much
   reason to use more than 32—for the next several years at least. In
   response to this, 3SPIN currently limits the user to $k = 32$. The
   point of this observation is that we do not have to worry about the
   runtime cost of $k$ being on the order of 64 or 100, because those
   choices do not really buy us anything over 32."

Here 'm' is the number of bits in Bloom filter, and m/n is number of
bits per element added to filter.

> +	uint32_t bits_per_entry;

All right, we wouldn't really want large Bloom filters, as we use one
filter per commit to match againts one key, not single Bloom filter to
match againts many keys.

> +};
> +
> +#define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 }
> +
> +/*
> + * A bloom_filter struct represents a data segment to
> + * use when testing hash values. The 'len' member
> + * dictates how many uint64_t entries are stored in
> + * 'data'.
> + */
> +struct bloom_filter {
> +	uint64_t *data;
> +	int len;
> +};

O.K., so it is single variable-sized (in 64-bit increments) Bloom filter
bit vector (bitmap).

> +
> +/*
> + * A bloom_key represents the k hash values for a
> + * given hash input. These can be precomputed and
> + * stored in a bloom_key for re-use when testing
> + * against a bloom_filter.
> + */
> +struct bloom_key {
> +	uint32_t *hashes;
> +};

That is smart.  I wonder however if it wouldn't be a good idea to
'typedef' a hash function return type.

I repeat myself: in Git case we have one key that we try to match
against many Bloom filters which are never updated, while in an ordinary
case many keys are matched against single Bloom filter - in many cases
updated (with keys inserted to Bloom filter).

I wonder if somebody from academia have examined such situation.
I couldn't find a good search query.


Sidenote: perhaps Xor or Xor+ filters from Graf & Lemire (2019)
https://arxiv.org/abs/1912.08258 would be better solution - they also
assume unchanging filter.  Though they are a very fresh proposal;
also construction time might be important for Git.
https://github.com/FastFilter/xor_singleheader

> +
> +void load_bloom_filters(void);
> +
> +struct bloom_filter *get_bloom_filter(struct repository *r,
> +				      struct commit *c);

Those two functions really need API documentation on how they are used,
if they are to be used in any other role, especially what is their
calling convention?  Why load_bloom_filters() doesn't take any
parameters?

Anyway, if this patch would be split into pure Bloom filter
implementation and Git use^W store of Bloom filters, then this would be
left for the latter patch.

> +
> +void fill_bloom_key(const char *data,
> +		    int len,
> +		    struct bloom_key *key,
> +		    struct bloom_filter_settings *settings);

It is a bit strange that two basic Bloom filter operations, namely
adding element to Bloom filter (and constructing Bloom filter), and
testing whether element is in Bloom filter are not part of a public
API...

This function should probably be documented, in particular the fact that
*key is an in/out parameter.  This could also be a good place to
document the mechanism itself (i.e. our implementation of Bloom filter,
with references), though it might be better to keep the details of how
it works in the bloom.c - close to the actual source (while keeping
description of API in bloom.h comments).

> +
> +#endif
> diff --git a/bloom.c b/bloom.c
> new file mode 100644
> index 0000000000..08328cc381
> --- /dev/null
> +++ b/bloom.c
> @@ -0,0 +1,201 @@
> +#include "git-compat-util.h"
> +#include "bloom.h"
> +#include "commit-graph.h"
> +#include "object-store.h"
> +#include "diff.h"
> +#include "diffcore.h"
> +#include "revision.h"
> +#include "hashmap.h"
> +
> +#define BITS_PER_BLOCK 64
> +
> +define_commit_slab(bloom_filter_slab, struct bloom_filter);
> +
> +struct bloom_filter_slab bloom_filters;

All right, so the Bloom filter data would be on slab.  This should
probably be mentioned in the commit message, like in
https://lore.kernel.org/git/61559c5b-546e-d61b-d2e1-68de692f5972@gmail.com/

Sidenote: If I remember correctly one of the unmet prerequisites for
switching to generation numbers v2 (corrected commit date with monotonic
offsets) was moving 'generation' field out of 'struct commit' and on to
slab (possibly also 'graph_pos'), and certainly having 'corrected_date'
on slab (Inside-Out Object style).  Which probably could be done with
Coccinelle script...

> +
> +struct pathmap_hash_entry {
> +    struct hashmap_entry entry;
> +    const char path[FLEX_ARRAY];
> +};

Hmmm... I wonder why use hashmap and not string_list.  This is for
adding path with leading directories to the Bloom filter, isn't it?

> +
> +static uint32_t rotate_right(uint32_t value, int32_t count)
> +{
> +	uint32_t mask = 8 * sizeof(uint32_t) - 1;
> +	count &= mask;
> +	return ((value >> count) | (value << ((-count) & mask)));
> +}

Does it actually work with count being negative?  Shouldn't 'count' be
of unsigned type, and if int32_t is needed, perhaps add an assertion (if
needed)?  I think it does not.

It looks like it is John Regehr [2] safe and compiler-friendly
implementation, with explicit 8 in place of CHAR_BIT from <limits.h>,
which should compile to "rotate" assembly instruction... it looks like
it is the case, see https://godbolt.org/z/5JP1Jb (at least for C++
compiler).

[2]: https://en.wikipedia.org/wiki/Circular_shift


I wonder if this should, in the future, be a part of 'compat/', maybe
even using compiler intrinsics for "rotate right" if available (see
https://stackoverflow.com/a/776523/46058).  But that might be outside of
the scope of this patch (perhaps outside of choosing function name).

> +

It would be nice to have reference to the source of algorithm, or to the
code that was borrowed for this in the header comment for the following
function.

I will be comparing the algorithm itself in Wikipedia
https://en.wikipedia.org/wiki/MurmurHash#Algorithm
and its implementation in C in qLibc library (BSD licensed)
https://github.com/wolkykim/qlibc/blob/03a8ce035391adf88d6d755f9a26967c16a1a567/src/utilities/qhash.c#L258

> +static uint32_t seed_murmur3(uint32_t seed, const char *data, int len)
> +{
> +	const uint32_t c1 = 0xcc9e2d51;
> +	const uint32_t c2 = 0x1b873593;
> +	const int32_t r1 = 15;
> +	const int32_t r2 = 13;

Those two: r1 and r1, probably should be both uint32_t type.

> +	const uint32_t m = 5;
> +	const uint32_t n = 0xe6546b64;
> +	int i;
> +	uint32_t k1 = 0;
> +	const char *tail;

*tail should probably be 'uint8_t', not 'char', isn't it?

> +
> +	int len4 = len / sizeof(uint32_t);

All length variables and parameters, i.e. `len`, `len4`, `i`, could
possibly be `size_t` and not `int` type.

> +
> +	const uint32_t *blocks = (const uint32_t*)data;
> +

Some implementations copy `seed` (or assume seed=0) to the local
variable named `h` or `hash`.

> +	uint32_t k;
> +	for (i = 0; i < len4; i++)
> +	{
> +		k = blocks[i];
> +		k *= c1;
> +		k = rotate_right(k, r1);

Shouldn't it be *rotate_left* (ROL), not rotate_right (ROR)???
This affects all cases / uses.

> +		k *= c2;
> +
> +		seed ^= k;
> +		seed = rotate_right(seed, r2) * m + n;
> +	}
> +
> +	tail = (data + len4 * sizeof(uint32_t));
> +

We could have reused variable `k`, like the implementation in qLibc
does, instead of introducing new `k1` variable, but this way it is more
clean.  Or name it `remainingBytes` instead of `k1`

> +	switch (len & (sizeof(uint32_t) - 1))
> +	{
> +	case 3:
> +		k1 ^= ((uint32_t)tail[2]) << 16;
> +		/*-fallthrough*/
> +	case 2:
> +		k1 ^= ((uint32_t)tail[1]) << 8;
> +		/*-fallthrough*/
> +	case 1:
> +		k1 ^= ((uint32_t)tail[0]) << 0;
> +		k1 *= c1;
> +		k1 = rotate_right(k1, r1);
> +		k1 *= c2;
> +		seed ^= k1;
> +		break;
> +	}
> +
> +	seed ^= (uint32_t)len;
> +	seed ^= (seed >> 16);
> +	seed *= 0x85ebca6b;
> +	seed ^= (seed >> 13);
> +	seed *= 0xc2b2ae35;
> +	seed ^= (seed >> 16);
> +
> +	return seed;
> +}
> +

It would be nice to have header comment describing what this function is
intended to actually do.

> +static inline uint64_t get_bitmask(uint32_t pos)
> +{
> +	return ((uint64_t)1) << (pos & (BITS_PER_BLOCK - 1));
> +}

Sidenote: I wonder if ewah/bitmap.c implements something similar.
Certainly possible consolidation, if any possible exists, should be left
for the future.

> +
> +void fill_bloom_key(const char *data,
> +		    int len,
> +		    struct bloom_key *key,
> +		    struct bloom_filter_settings *settings)
> +{
> +	int i;
> +	uint32_t seed0 = 0x293ae76f;
> +	uint32_t seed1 = 0x7e646e2c;

Where did those constants came from?  It would be nice to have a
reference either in header comment (in bloom.h or bloom.c), or in a
commit message, or both.

Note that above *constants* are each used only once.

> +
> +	uint32_t hash0 = seed_murmur3(seed0, data, len);
> +	uint32_t hash1 = seed_murmur3(seed1, data, len);

Those are constant values, so perhaps they should be `const uint32_t`.

> +
> +	key->hashes = (uint32_t *)xcalloc(settings->num_hashes, sizeof(uint32_t));
> +	for (i = 0; i < settings->num_hashes; i++)
> +		key->hashes[i] = hash0 + i * hash1;

It looks like this code implements the double hashing technique given in
Eq. (4) in http://www.ccs.neu.edu/home/pete/pub/bloom-filters-verification.pdf
that is "Bloom Filters in Probabilistic Verification".

Note that Dillinger and Manolios in this paper propose also _enhanced_
double hashing algorithm (Algorithm 2 on page 11), which has closed form
given by Eq. (6) - with better science-theoretical properties at
similar cost.


It might be a good idea to explicitly state in the header comment that
all arithmetic is performed with unsigned 32-bit integers, which means
that operations are performed modulo 2^32.  Or it might not be needed.

> +}
> +
> +static void add_key_to_filter(struct bloom_key *key,
> +			      struct bloom_filter *filter,
> +			      struct bloom_filter_settings *settings)
> +{
> +	int i;
> +	uint64_t mod = filter->len * BITS_PER_BLOCK;
> +
> +	for (i = 0; i < settings->num_hashes; i++) {
> +		uint64_t hash_mod = key->hashes[i] % mod;
> +		uint64_t block_pos = hash_mod / BITS_PER_BLOCK;

All right.  Because Bloom filters for different commits (and the same
key) may have different lengths, we can perform modulo operation only
here.  `hash_mod` is i-th hash modulo size of filter, and `block_pos` is
the block the '1' bit would go into.

> +
> +		filter->data[block_pos] |= get_bitmask(hash_mod);

I'm not quite convinced that get_bitmask() is a good name: this function
returns bitmap with hash_mod's bit set to 1.  On the other hand it
doesn't matter, because it is static (file-local) helper function.

Never mind then.

> +	}
> +}
> +
> +void load_bloom_filters(void)
> +{
> +	init_bloom_filter_slab(&bloom_filters);
> +}

Why *load* if all it does is initialize?

> +
> +struct bloom_filter *get_bloom_filter(struct repository *r,
> +				      struct commit *c)

I will not comment on this function; see Jeff King reply and Derrick
Stolee reply.

> +{
[...]
> +}
> \ No newline at end of file

Why there is no newline at the end of the file?  Accident?

> diff --git a/commit-graph.c b/commit-graph.c
> index e771394aff..61e60ff98a 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -16,6 +16,7 @@
>  #include "hashmap.h"
>  #include "replace-object.h"
>  #include "progress.h"
> +#include "bloom.h"
>  
>  #define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
>  #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
> @@ -794,9 +795,11 @@ struct write_commit_graph_context {
>  	unsigned append:1,
>  		 report_progress:1,
>  		 split:1,
> -		 check_oids:1;
> +		 check_oids:1,
> +		 bloom:1;

Very minor nitpick: why `bloom` and not `bloom_filter`?

>  
>  	const struct split_commit_graph_opts *split_opts;
> +	uint32_t total_bloom_filter_size;

All right, I guess size of all Bloom filters would fit in uint32_t, no
need for size_t, is it?

Shouldn't it be total_bloom_filters_size -- it is not a single Bloom
filter, but many (minor nitpick)?

>  };
>  
>  static void write_graph_chunk_fanout(struct hashfile *f,
> @@ -1139,6 +1142,28 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  	stop_progress(&ctx->progress);
>  }
>  
> +static void compute_bloom_filters(struct write_commit_graph_context *ctx)
> +{
> +	int i;
> +	struct progress *progress = NULL;
> +
> +	load_bloom_filters();
> +
> +	if (ctx->report_progress)
> +		progress = start_progress(
> +			_("Computing commit diff Bloom filters"),
> +			ctx->commits.nr);
> +
> +	for (i = 0; i < ctx->commits.nr; i++) {
> +		struct commit *c = ctx->commits.list[i];
> +		struct bloom_filter *filter = get_bloom_filter(ctx->r, c);
> +		ctx->total_bloom_filter_size += sizeof(uint64_t) * filter->len;

Wouldn't it be more future proof instead of using `sizeof(uint64_t)` to
use `sizeof(filter->data[0])` here?  This may be not worth it, and be
less readable (we have hard-coded use of 64-bits blocks in other places).

> +		display_progress(progress, i + 1);
> +	}
> +
> +	stop_progress(&progress);
> +}
> +
>  static int add_ref_to_list(const char *refname,
>  			   const struct object_id *oid,
>  			   int flags, void *cb_data)
> @@ -1791,6 +1816,8 @@ int write_commit_graph(const char *obj_dir,
>  	ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
>  	ctx->check_oids = flags & COMMIT_GRAPH_WRITE_CHECK_OIDS ? 1 : 0;
>  	ctx->split_opts = split_opts;
> +	ctx->bloom = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;

All right, this flag was defined in [PATCH 1/9].

The ordering of setting `ctx` members looks a bit strange.  Now it is
neither check `flags` firsts, neither keep related stuff together (see
ctx->split vs ctx->split_opts).  This is a very minor nitpick.

> +	ctx->total_bloom_filter_size = 0;
>  
>  	if (ctx->split) {
>  		struct commit_graph *g;
> @@ -1885,6 +1912,9 @@ int write_commit_graph(const char *obj_dir,
>  
>  	compute_generation_numbers(ctx);
>  
> +	if (ctx->bloom)
> +		compute_bloom_filters(ctx);
> +
>  	res = write_commit_graph_file(ctx);
>  
>  	if (ctx->split)

Regards,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH 3/9] commit-graph: use MAX_NUM_CHUNKS
  2019-12-20 22:05 ` [PATCH 3/9] commit-graph: use MAX_NUM_CHUNKS Garima Singh via GitGitGadget
@ 2020-01-07 12:19   ` Jakub Narebski
  0 siblings, 0 replies; 150+ messages in thread
From: Jakub Narebski @ 2020-01-07 12:19 UTC (permalink / raw)
  To: Garima Singh via GitGitGadget
  Cc: git, stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
	Junio C Hamano, Garima Singh

"Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Garima Singh <garima.singh@microsoft.com>
>
> This is a minor cleanup to make it easier to change the
> number of chunks being written to the commit-graph in the future.

Very minor nit: in the whole commit message it is not stated explicitly
what MAX_NUM_CHUNKS is for, though it is very easy to guess (from the
name itself).

>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
>  commit-graph.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 61e60ff98a..8c4941eeaa 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -24,6 +24,7 @@
>  #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
>  #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
>  #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
> +#define MAX_NUM_CHUNKS 5

Minor nit: MAX_NUM_CHUNKS or MAX_CHUNKS?

>  
>  #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
>  
> @@ -1381,8 +1382,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>  	int fd;
>  	struct hashfile *f;
>  	struct lock_file lk = LOCK_INIT;
> -	uint32_t chunk_ids[6];
> -	uint64_t chunk_offsets[6];
> +	uint32_t chunk_ids[MAX_NUM_CHUNKS + 1];
> +	uint64_t chunk_offsets[MAX_NUM_CHUNKS + 1];

Looks good.  I guess we won't ever have more chunks than 5:
OIDF, OIDL, CDAT, EDGE, BASE (and they cannot repeat, and last two are
optional).

>  	const unsigned hashsz = the_hash_algo->rawsz;
>  	struct strbuf progress_title = STRBUF_INIT;
>  	int num_chunks = 3;

Good.

Looks good to me.
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH 4/9] commit-graph: document bloom filter format
  2019-12-20 22:05 ` [PATCH 4/9] commit-graph: document bloom filter format Garima Singh via GitGitGadget
@ 2020-01-07 14:46   ` Jakub Narebski
  0 siblings, 0 replies; 150+ messages in thread
From: Jakub Narebski @ 2020-01-07 14:46 UTC (permalink / raw)
  To: Garima Singh via GitGitGadget
  Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
	Jeff Hostetler, Taylor Blau, Jeff King, Junio C Hamano,
	Garima Singh

"Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Garima Singh <garima.singh@microsoft.com>
>
> Update the technical documentation for commit-graph-format with BIDX
> and BDAT chunk information.
>
> RFC Notes:
> 1. [Call for advice] We specifically mention that we are using Bloom
>    filters in this technical document. Should this document also be
>    made open to other data structures in the future, with versioning
>    information?

I'm not sure.  In theory we might want to switch to another
probabilistic set inclusion query structure, like xor filters or cuckoo
hashing.

On one hand side we could use separate chunks (e.g. XIDX, XDAT for xor
filters), on the other hand we need only one such structure.  On the
gripping hand this can be left for the future, if needed.

Sidenote: using Bloom filters is somewhat encoded in the name of chunk
(B from Bloom filter).  I don't have a better poposal for 4-char name
(XIDX / XDAT for cXange?  CHDX / CHDT for CHange?  FIDX / FDAT for
changed Files?... I don't know).

>
> 2. [Call for advice] We are also not describing the explicit nature
>    of how we store the bloom filter binary data. Would it be useful
>    to document details about the hash algorithm, the number of hashes
>    and the specific seed values we are using in a separate document,
>    or perhaps in a separate section in this document?

I think it would be best to keep description of the commit graph format
concise.  The details about Bloom filter implementation would be better
put in Documentation/technical/commit-graph.txt in my opinion, together
with reasoning behind it (perhaps borrowing from Derrick Stolee blog
post).

This could be done as a separate patch.

>
> Helped-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
>  Documentation/technical/commit-graph-format.txt | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
>
> diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
> index a4f17441ae..6497f19f08 100644
> --- a/Documentation/technical/commit-graph-format.txt
> +++ b/Documentation/technical/commit-graph-format.txt
> @@ -17,6 +17,9 @@ metadata, including:
>  - The parents of the commit, stored using positional references within
>    the graph file.
>  
> +- The bloom filter of the commit carrying the paths that were changed between
> +  the commit and it's first parent.

s/bloom/Bloom/ and s/it's/its/

I am not sure about exact wording, but I could at this time think of a
better but concise way of stating it.

> +
>  These positional references are stored as unsigned 32-bit integers
>  corresponding to the array position within the list of commit OIDs. Due
>  to some special constants we use to track parents, we can store at most
> @@ -93,6 +96,20 @@ CHUNK DATA:
>        positions for the parents until reaching a value with the most-significant
>        bit on. The other bits correspond to the position of the last parent.
>  
> +  Bloom Filter Index (ID: {'B', 'I', 'D', 'X'}) [Optional]
> +      For each commit we store the offset of its bloom filter in the BDAT chunk
> +      as follows:
> +      BIDX[i] = number of 8-byte words in all the bloom filters from commit 0 to
> +		commit i (inclusive)

I think it would be better for consistency and ease of reading to follow
the example of OID Fanout (OIDF) chunk description:

 +  Bloom Filter Index (ID: {'B', 'I', 'D', 'X'}) (N * 4 bytes) [Optional]
 +      The ith entry, BIDX[i], stores the number of 8-byte word blocks
 +      in all Bloom filters from commit 0 up to commit i (inclusive)
 +      in lexicographical order.

Maybe even add the following to make implementing it easier:

 +      Data for Bloom filter for i-th commit spans from BIDX[i-1] to
 +      BIDX[i] (plus header length), where we take BIDX[-1] to be 0.

Is it possible for (BIDX[i] - BIDX[i-1]) to be zero (no Bloom filter),
for example for commits with more than 512 changes?  Or is this case
handled by 1 8-byte word Bloom filter of all bits sets to '1', i.e.
0xffffffffffffffff?

How the case of too many changes is distingushed from the case of no
changes (`git commit --allow-empty`, or `git merge --ours`)?  Is the
case of no changes uninteresting, i.e. Bloom filter consisting of zero,
that is with all bits set to '0'?

> +
> +  Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
> +      * It starts with three 32 bit integers for the

I would say "It starts with the header consisting of three unsigned
32-bit integers:" (but current version is not bad).

I wonder if this metadata should perhaps be put in a separate chunk,
BMET... (Bloom filter METadata).

> +	    - version of the hash algorithm being used

This number not only encodes that the base hash algorithm being used is
32-bit Murmur3 hash, but that 'k' hashes used in the computation are
created out of Murmur3 hash using double hashing technique, and
specifies two specific seed values for this double hashing technique.

[Maybe we should store those two seed values here too?]

It might be important to say that the currently supported version is
'1', and if Git encounters unknown hashing algorithm version it should
not use Bloom filter data.

Unless we store encoded _name_ of the hash algorithm, e.g. bytes
'm','u','r','3' for MurmurHash3_32... though it is about more than
a base hash.

Do we need whole 4 bytes for hash version, or is it for ease of use and
alignment?

> +	    - the number of hashes used in the computation

All right.  Perhaps we should test in the future patches that the value
different from the default of 7 would also work.

Also 8-bits / 1 byte for number of hashes (hash functions) should be
enough: as I have written in prevous reply there is no need for k > 32.

> +	    - the number of bits per entry

This is important for construction of Bloom filter, but I think it is
not necessary to use it -- so it may not be necessary to store it.

Would also fit in a single byte: we don't need exceedingly low false
positive probability.

We could use it to estimate the false positive probability, and...

> +	  * The rest of the chunk is the concatenation of all the computed bloom 
> +	  filters for the commits in lexicographic order.

  +	 * The rest of the chunk is the concatenation of all the computed Bloom 
  +	   filters for the commits in lexicographic order.

It would be, I think, a good idea to make it explicit that BDAT is
present iff BIDX is present (iff == if and only if), i.e. that either
both or neither of those chunks should be present.

> +
>    Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
>        This list of H-byte hashes describe a set of B commit-graph files that
>        form a commit-graph chain. The graph position for the ith commit in this

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH 5/9] commit-graph: write changed path bloom filters to commit-graph file.
  2019-12-20 22:05 ` [PATCH 5/9] commit-graph: write changed path bloom filters to commit-graph file Garima Singh via GitGitGadget
@ 2020-01-07 16:01   ` Jakub Narebski
  2020-01-14 15:14     ` Garima Singh
  0 siblings, 1 reply; 150+ messages in thread
From: Jakub Narebski @ 2020-01-07 16:01 UTC (permalink / raw)
  To: Garima Singh via GitGitGadget
  Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
	Jeff Hostetler, Taylor Blau, Jeff King, Junio C Hamano,
	Garima Singh

"Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Garima Singh <garima.singh@microsoft.com>
>
> Write bloom filters to the commit-graph using the format described in
> Documentation/technical/commit-graph-format.txt
>
> Helped-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>

Looks good to me.

> ---
>  commit-graph.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++-
>  commit-graph.h |  5 ++++
>  2 files changed, 85 insertions(+), 1 deletion(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 8c4941eeaa..def2ade166 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -24,7 +24,9 @@
>  #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
>  #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
>  #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
> -#define MAX_NUM_CHUNKS 5
> +#define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
> +#define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
> +#define MAX_NUM_CHUNKS 7

Very minor nitpick: shouldn't we follow the order in the
commit-graph-format.txt document (i.e. "BASE" as last chunk and last
preprocessor constant)?

>  
>  #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
>  
> @@ -282,6 +284,32 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd,
>  				chunk_repeated = 1;
>  			else
>  				graph->chunk_base_graphs = data + chunk_offset;
> +			break;
> +
> +		case GRAPH_CHUNKID_BLOOMINDEXES:
> +			if (graph->chunk_bloom_indexes)
> +				chunk_repeated = 1;
> +			else
> +				graph->chunk_bloom_indexes = data + chunk_offset;
> +			break;

All right.

> +
> +		case GRAPH_CHUNKID_BLOOMDATA:
> +			if (graph->chunk_bloom_data)
> +				chunk_repeated = 1;
> +			else {
> +				uint32_t hash_version;
> +				graph->chunk_bloom_data = data + chunk_offset;
> +				hash_version = get_be32(data + chunk_offset);

All right, now I see why all those header values for BDAT chunk are
defined to be 32-bit integers.  For code simplicity.

> +
> +				if (hash_version != 1)
> +					break;

What does it mean for Git?  Behave as if there were no Bloom filter
data?

> +
> +				graph->settings = xmalloc(sizeof(struct bloom_filter_settings));
> +				graph->settings->hash_version = hash_version;
> +				graph->settings->num_hashes = get_be32(data + chunk_offset + 4);
> +				graph->settings->bits_per_entry = get_be32(data + chunk_offset + 8);

All right, looks O.K.

> +			}
> +			break;
>  		}
>  
>  		if (chunk_repeated) {
> @@ -996,6 +1024,39 @@ static void write_graph_chunk_extra_edges(struct hashfile *f,
>  	}
>  }
>  
> +static void write_graph_chunk_bloom_indexes(struct hashfile *f,
> +					    struct write_commit_graph_context *ctx)
> +{
> +	struct commit **list = ctx->commits.list;
> +	struct commit **last = ctx->commits.list + ctx->commits.nr;
> +	uint32_t cur_pos = 0;
> +
> +	while (list < last) {
> +		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
> +		cur_pos += filter->len;
> +		hashwrite_be32(f, cur_pos);
> +		list++;
> +	}

Why not follow the write_graph_chunk_oids() example, instead of
write_graph_chunk_data(), that is use simply:

  +	struct commit **list = ctx->commits.list;
  +	uint32_t cur_pos = 0;
  +
  +	for (count = 0; count < ctx->commits.nr; count++, list++) {
  +		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
  +		cur_pos += filter->len;
  +		hashwrite_be32(f, cur_pos);
  +	}

I guess using here

  +		cur_pos += get_bloom_filter(ctx->r, *list)->len;

would be too cryptic, and hard to debug?

Also, wouldn't we need

  +		display_progress(ctx->progress, ++ctx->progress_cnt);

before hashwrite_be32()?

> +}
> +
> +static void write_graph_chunk_bloom_data(struct hashfile *f,
> +					 struct write_commit_graph_context *ctx,
> +					 struct bloom_filter_settings *settings)
> +{
> +	struct commit **first = ctx->commits.list;

Even if we decide to use `while` loop, like write_graph_chunk_data(),
and not `for` loop, like write_graph_chunk_oids(), why the change from
`struct commit **list = ...` to `struct commit **first = ...`?

> +	struct commit **last = ctx->commits.list + ctx->commits.nr;
> +
> +	hashwrite_be32(f, settings->hash_version);
> +	hashwrite_be32(f, settings->num_hashes);
> +	hashwrite_be32(f, settings->bits_per_entry);

All right, simple.

> +
> +	while (first < last) {
> +		struct bloom_filter *filter = get_bloom_filter(ctx->r, *first);

Hmmm... wouldn't this compute Bloom filter second time?
get_bloom_filter() does work unconditionally.

Wouldn't

  +		struct bloom_filter *filter = bloom_filter_slab_at(&bloom_filters, *first);

be enough?

Or make get_bloom_filter() use *_peek() to check if Bloom filter for
given commit was already computed, and only if it returns NULL do the
work.

> +		hashwrite(f, filter->data, filter->len * sizeof(uint64_t));

Might need display_progress() before hashwrote().

> +		first++;
> +	}
> +}
> +
>  static int oid_compare(const void *_a, const void *_b)
>  {
>  	const struct object_id *a = (const struct object_id *)_a;
> @@ -1388,6 +1449,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>  	struct strbuf progress_title = STRBUF_INIT;
>  	int num_chunks = 3;
>  	struct object_id file_hash;
> +	struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
>  
>  	if (ctx->split) {
>  		struct strbuf tmp_file = STRBUF_INIT;
> @@ -1432,6 +1494,12 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>  		chunk_ids[num_chunks] = GRAPH_CHUNKID_EXTRAEDGES;
>  		num_chunks++;
>  	}
> +	if (ctx->bloom) {
> +		chunk_ids[num_chunks] = GRAPH_CHUNKID_BLOOMINDEXES;
> +		num_chunks++;
> +		chunk_ids[num_chunks] = GRAPH_CHUNKID_BLOOMDATA;
> +		num_chunks++;
> +	}

Looks all right.

>  	if (ctx->num_commit_graphs_after > 1) {
>  		chunk_ids[num_chunks] = GRAPH_CHUNKID_BASE;
>  		num_chunks++;
> @@ -1450,6 +1518,13 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>  						4 * ctx->num_extra_edges;
>  		num_chunks++;
>  	}
> +	if (ctx->bloom) {
> +		chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] + sizeof(uint32_t) * ctx->commits.nr;
> +		num_chunks++;
> +
> +		chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] + sizeof(uint32_t) * 3 + ctx->total_bloom_filter_size;
> +		num_chunks++;
> +	}

Better wrap those long lines, like above:

  +	if (ctx->bloom) {
  +		chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
  +						sizeof(uint32_t) * ctx->commits.nr;
  +		num_chunks++;
  +
  +		chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
  +						sizeof(uint32_t) * 3 + ctx->total_bloom_filter_size;
  +		num_chunks++;
  +	}

>  	if (ctx->num_commit_graphs_after > 1) {
>  		chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
>  						hashsz * (ctx->num_commit_graphs_after - 1);
> @@ -1487,6 +1562,10 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>  	write_graph_chunk_data(f, hashsz, ctx);
>  	if (ctx->num_extra_edges)
>  		write_graph_chunk_extra_edges(f, ctx);
> +	if (ctx->bloom) {
> +		write_graph_chunk_bloom_indexes(f, ctx);
> +		write_graph_chunk_bloom_data(f, ctx, &bloom_settings);
> +	}

All right.

>  	if (ctx->num_commit_graphs_after > 1 &&
>  	    write_graph_chunk_base(f, ctx)) {
>  		return -1;
> diff --git a/commit-graph.h b/commit-graph.h
> index 952a4b83be..2202ad91ae 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -10,6 +10,7 @@
>  #define GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD "GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD"
>  
>  struct commit;
> +struct bloom_filter_settings;
>  
>  char *get_commit_graph_filename(const char *obj_dir);
>  int open_commit_graph(const char *graph_file, int *fd, struct stat *st);
> @@ -58,6 +59,10 @@ struct commit_graph {
>  	const unsigned char *chunk_commit_data;
>  	const unsigned char *chunk_extra_edges;
>  	const unsigned char *chunk_base_graphs;
> +	const unsigned char *chunk_bloom_indexes;
> +	const unsigned char *chunk_bloom_data;
> +
> +	struct bloom_filter_settings *settings;

Should this be part of `struct commit_graph`?  Shouldn't we free() this
data, or is it a pointer into xmmap-ped file... no it isn't -- we
xalloc() it, so we should free() it.

I think it should be done in 'cleanup:' section of write_commit_graph(),
but I am not entirely sure.

>  };
>  
>  struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st);

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH 6/9] commit-graph: test commit-graph write --changed-paths
  2019-12-20 22:05 ` [PATCH 6/9] commit-graph: test commit-graph write --changed-paths Garima Singh via GitGitGadget
@ 2020-01-08  0:32   ` Jakub Narebski
  0 siblings, 0 replies; 150+ messages in thread
From: Jakub Narebski @ 2020-01-08  0:32 UTC (permalink / raw)
  To: Garima Singh via GitGitGadget
  Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
	Jeff Hostetler, Taylor Blau, Jeff King, Junio C Hamano,
	Garima Singh

"Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Garima Singh <garima.singh@microsoft.com>
>
> Add tests for the --changed-paths feature when writing
> commit-graphs.

It doesn't look however as if this test is actually testing the _Bloom
filter_ functionality itself -- because this test looks like
copy'n'paste of t/t5324-split-commit-graph.sh, just with
`--changed-paths` added to the `git commit-graph write` invocation, and
added checking via enhanced test-tool that there are Bloom filter chunks
("bloom_indexes" and "bloom_data").

Please correct me if I am wrong, but this looks like a simple sanity
check for me.

>
> RFC Notes:
> I plan to split this test across some of the earlier commits
> as appropriate.

About adding tests to earlier commits in this series:

1. Testing Bloom filter functionality:
   - creating Bloom filter and adding elements to it
   - testing Bloom filter functionality
     - for element in set the answer is "maybe"
     - for element not in set the answer is "no" or "maybe"
   - automatic resizing works (6 and 7 elements)
   - it works for different number of hash functions,
     and different number of bits per element (maybe?)

2. Testing Bloom filter for commit changeset:
   - it works for commit with no changes
   - it works for merge commit with no changes to first parent
     (`git merge --strategy=ours`)
   - with number of changes that require filter size change
   - with maximal number of changes, one changed file less,
     one changed file more
   - that for file deeper in hierarchy, path/to/file, all of
     changed directories (path/to/ and path/) are also added

3. Test writing and reading commit-graph with Bloom filters
   - that after writing Bloom filters with `--changed-paths`
     the data is present in commit-graph files
   - it works correctly with split commit-graph
   - it doesn't crash if confronted with unknown settings:
     hash version different than 1, different number of hash
     functions, different number of bits per element

4. Bloom filter specific `git commit-graph verify` parts
   - fail if Bloom filter chunks appear multiple times
   - fail if only one of BIDX or BDAT chunks are present
   - fail if BIDX is not monotonic, that is if size of Bloom filter
     for a commit is negative
   - fail if BDAT size does not agree with BIDX,
     being either too small, or too large
   - check if values of number of hash functions
     and number of bits per element added are sane

5. Using Bloom filters to speed up Git operations
   - test that with and without Bloom filters (or commit-graph)
     the following operations work the same:
     - git log -- <path/to/file>
     - git log -- <path/to/directory>
     - git log -- '*.c'  # or other glob pattern
     - git log -- <file1> <file2>
     - git log --follow <file>
     - maybe also `git log --full-history -- <file>`
   - if possible, add performance tests, see `t/perf`

>
> Helped-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
>  t/helper/test-read-graph.c    |   4 +
>  t/t5325-commit-graph-bloom.sh | 255 ++++++++++++++++++++++++++++++++++
>  2 files changed, 259 insertions(+)
>  create mode 100755 t/t5325-commit-graph-bloom.sh
>
> diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
> index d2884efe0a..aff597c7a3 100644
> --- a/t/helper/test-read-graph.c
> +++ b/t/helper/test-read-graph.c
> @@ -45,6 +45,10 @@ int cmd__read_graph(int argc, const char **argv)
>  		printf(" commit_metadata");
>  	if (graph->chunk_extra_edges)
>  		printf(" extra_edges");
> +	if (graph->chunk_bloom_indexes)
> +		printf(" bloom_indexes");
> +	if (graph->chunk_bloom_data)
> +		printf(" bloom_data");

All right, though it is very basic information.

>  	printf("\n");
>  
>  	UNLEAK(graph);
> diff --git a/t/t5325-commit-graph-bloom.sh b/t/t5325-commit-graph-bloom.sh
> new file mode 100755
> index 0000000000..d7ef0e7fb3
> --- /dev/null
> +++ b/t/t5325-commit-graph-bloom.sh
> @@ -0,0 +1,255 @@
> +#!/bin/sh
> +
> +test_description='commit graph with bloom filters'
> +. ./test-lib.sh
> +
> +test_expect_success 'setup repo' '
> +	git init &&
> +	git config core.commitGraph true &&
> +	git config gc.writeCommitGraph false &&
> +	infodir=".git/objects/info" &&
> +	graphdir="$infodir/commit-graphs" &&
> +	test_oid_init
> +'
> +
> +graph_read_expect() {

Style: space between function name and parentheses, i.e.

  +graph_read_expect () {

> +	OPTIONAL=""

Not used anywhere.

> +	NUM_CHUNKS=5
> +	if test ! -z $2

It might be good idea to add names to those parameters by setting some
local variables to $1 and $2; or, alternatively add comment describing
this function.

> +	then
> +		OPTIONAL=" $2"
> +		NUM_CHUNKS=$((NUM_CHUNKS + $(echo "$2" | wc -w)))
> +	fi
> +	cat >expect <<- EOF
> +	header: 43475048 1 1 $NUM_CHUNKS 0
> +	num_commits: $1
> +	chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data
> +	EOF
> +	test-tool read-graph >output &&
> +	test_cmp expect output
> +}

No comments below this point...

Best,

  Jakub Narębski


> +
> +test_expect_success 'create commits and write commit-graph' '
> +	for i in $(test_seq 3)
> +	do
> +		test_commit $i &&
> +		git branch commits/$i || return 1
> +	done &&
> +	git commit-graph write --reachable --changed-paths &&
> +	test_path_is_file $infodir/commit-graph &&
> +	graph_read_expect 3
> +'
> +
> +graph_git_two_modes() {
> +	git -c core.commitGraph=true $1 >output
> +	git -c core.commitGraph=false $1 >expect
> +	test_cmp expect output
> +}
> +
> +graph_git_behavior() {
> +	MSG=$1
> +	BRANCH=$2
> +	COMPARE=$3
> +	test_expect_success "check normal git operations: $MSG" '
> +		graph_git_two_modes "log --oneline $BRANCH" &&
> +		graph_git_two_modes "log --topo-order $BRANCH" &&
> +		graph_git_two_modes "log --graph $COMPARE..$BRANCH" &&
> +		graph_git_two_modes "branch -vv" &&
> +		graph_git_two_modes "merge-base -a $BRANCH $COMPARE"
> +	'
> +}
> +
> +graph_git_behavior 'graph exists' commits/3 commits/1
> +
> +verify_chain_files_exist() {
> +	for hash in $(cat $1/commit-graph-chain)
> +	do
> +		test_path_is_file $1/graph-$hash.graph || return 1
> +	done
> +}
> +
> +test_expect_success 'add more commits, and write a new base graph' '
> +	git reset --hard commits/1 &&
> +	for i in $(test_seq 4 5)
> +	do
> +		test_commit $i &&
> +		git branch commits/$i || return 1
> +	done &&
> +	git reset --hard commits/2 &&
> +	for i in $(test_seq 6 10)
> +	do
> +		test_commit $i &&
> +		git branch commits/$i || return 1
> +	done &&
> +	git reset --hard commits/2 &&
> +	git merge commits/4 &&
> +	git branch merge/1 &&
> +	git reset --hard commits/4 &&
> +	git merge commits/6 &&
> +	git branch merge/2 &&
> +	git commit-graph write --reachable --changed-paths &&
> +	graph_read_expect 12
> +'
> +
> +test_expect_success 'fork and fail to base a chain on a commit-graph file' '
> +	test_when_finished rm -rf fork &&
> +	git clone . fork &&
> +	(
> +		cd fork &&
> +		rm .git/objects/info/commit-graph &&
> +		echo "$(pwd)/../.git/objects" >.git/objects/info/alternates &&
> +		test_commit new-commit &&
> +		git commit-graph write --reachable --split --changed-paths &&
> +		test_path_is_file $graphdir/commit-graph-chain &&
> +		test_line_count = 1 $graphdir/commit-graph-chain &&
> +		verify_chain_files_exist $graphdir
> +	)
> +'
> +
> +test_expect_success 'add three more commits, write a tip graph' '
> +	git reset --hard commits/3 &&
> +	git merge merge/1 &&
> +	git merge commits/5 &&
> +	git merge merge/2 &&
> +	git branch merge/3 &&
> +	git commit-graph write --reachable --split --changed-paths &&
> +	test_path_is_missing $infodir/commit-graph &&
> +	test_path_is_file $graphdir/commit-graph-chain &&
> +	ls $graphdir/graph-*.graph >graph-files &&
> +	test_line_count = 2 graph-files &&
> +	verify_chain_files_exist $graphdir
> +'
> +
> +graph_git_behavior 'split commit-graph: merge 3 vs 2' merge/3 merge/2
> +
> +test_expect_success 'add one commit, write a tip graph' '
> +	test_commit 11 &&
> +	git branch commits/11 &&
> +	git commit-graph write --reachable --split --changed-paths &&
> +	test_path_is_missing $infodir/commit-graph &&
> +	test_path_is_file $graphdir/commit-graph-chain &&
> +	ls $graphdir/graph-*.graph >graph-files &&
> +	test_line_count = 3 graph-files &&
> +	verify_chain_files_exist $graphdir
> +'
> +
> +graph_git_behavior 'three-layer commit-graph: commit 11 vs 6' commits/11 commits/6
> +
> +test_expect_success 'add one commit, write a merged graph' '
> +	test_commit 12 &&
> +	git branch commits/12 &&
> +	git commit-graph write --reachable --split --changed-paths &&
> +	test_path_is_file $graphdir/commit-graph-chain &&
> +	test_line_count = 2 $graphdir/commit-graph-chain &&
> +	ls $graphdir/graph-*.graph >graph-files &&
> +	test_line_count = 2 graph-files &&
> +	verify_chain_files_exist $graphdir
> +'
> +
> +graph_git_behavior 'merged commit-graph: commit 12 vs 6' commits/12 commits/6
> +
> +test_expect_success 'create fork and chain across alternate' '
> +	git clone . fork &&
> +	(
> +		cd fork &&
> +		git config core.commitGraph true &&
> +		rm -rf $graphdir &&
> +		echo "$(pwd)/../.git/objects" >.git/objects/info/alternates &&
> +		test_commit 13 &&
> +		git branch commits/13 &&
> +		git commit-graph write --reachable --split --changed-paths &&
> +		test_path_is_file $graphdir/commit-graph-chain &&
> +		test_line_count = 3 $graphdir/commit-graph-chain &&
> +		ls $graphdir/graph-*.graph >graph-files &&
> +		test_line_count = 1 graph-files &&
> +		git -c core.commitGraph=true  rev-list HEAD >expect &&
> +		git -c core.commitGraph=false rev-list HEAD >actual &&
> +		test_cmp expect actual &&
> +		test_commit 14 &&
> +		git commit-graph write --reachable --split --changed-paths --object-dir=.git/objects/ &&
> +		test_line_count = 3 $graphdir/commit-graph-chain &&
> +		ls $graphdir/graph-*.graph >graph-files &&
> +		test_line_count = 1 graph-files
> +	)
> +'
> +
> +graph_git_behavior 'alternate: commit 13 vs 6' commits/13 commits/6
> +
> +test_expect_success 'test merge stragety constants' '
> +	git clone . merge-2 &&
> +	(
> +		cd merge-2 &&
> +		git config core.commitGraph true &&
> +		test_line_count = 2 $graphdir/commit-graph-chain &&
> +		test_commit 14 &&
> +		git commit-graph write --reachable --split --changed-paths --size-multiple=2 &&
> +		test_line_count = 3 $graphdir/commit-graph-chain
> +
> +	) &&
> +	git clone . merge-10 &&
> +	(
> +		cd merge-10 &&
> +		git config core.commitGraph true &&
> +		test_line_count = 2 $graphdir/commit-graph-chain &&
> +		test_commit 14 &&
> +		git commit-graph write --reachable --split --changed-paths --size-multiple=10 &&
> +		test_line_count = 1 $graphdir/commit-graph-chain &&
> +		ls $graphdir/graph-*.graph >graph-files &&
> +		test_line_count = 1 graph-files
> +	) &&
> +	git clone . merge-10-expire &&
> +	(
> +		cd merge-10-expire &&
> +		git config core.commitGraph true &&
> +		test_line_count = 2 $graphdir/commit-graph-chain &&
> +		test_commit 15 &&
> +		git commit-graph write --reachable --split --changed-paths --size-multiple=10 --expire-time=1980-01-01 &&
> +		test_line_count = 1 $graphdir/commit-graph-chain &&
> +		ls $graphdir/graph-*.graph >graph-files &&
> +		test_line_count = 3 graph-files
> +	) &&
> +	git clone --no-hardlinks . max-commits &&
> +	(
> +		cd max-commits &&
> +		git config core.commitGraph true &&
> +		test_line_count = 2 $graphdir/commit-graph-chain &&
> +		test_commit 16 &&
> +		test_commit 17 &&
> +		git commit-graph write --reachable --split --changed-paths --max-commits=1 &&
> +		test_line_count = 1 $graphdir/commit-graph-chain &&
> +		ls $graphdir/graph-*.graph >graph-files &&
> +		test_line_count = 1 graph-files
> +	)
> +'
> +
> +test_expect_success 'remove commit-graph-chain file after flattening' '
> +	git clone . flatten &&
> +	(
> +		cd flatten &&
> +		test_line_count = 2 $graphdir/commit-graph-chain &&
> +		git commit-graph write --reachable &&
> +		test_path_is_missing $graphdir/commit-graph-chain &&
> +		ls $graphdir >graph-files &&
> +		test_must_be_empty graph-files
> +	)
> +'
> +
> +graph_git_behavior 'graph exists' merge/octopus commits/12
> +
> +test_expect_success 'split across alternate where alternate is not split' '
> +	git commit-graph write --reachable &&
> +	test_path_is_file .git/objects/info/commit-graph &&
> +	cp .git/objects/info/commit-graph . &&
> +	git clone --no-hardlinks . alt-split &&
> +	(
> +		cd alt-split &&
> +		rm -f .git/objects/info/commit-graph &&
> +		echo "$(pwd)"/../.git/objects >.git/objects/info/alternates &&
> +		test_commit 18 &&
> +		git commit-graph write --reachable --split --changed-paths &&
> +		test_line_count = 1 $graphdir/commit-graph-chain
> +	) &&
> +	test_cmp commit-graph .git/objects/info/commit-graph
> +'
> +
> +test_done

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH 7/9] commit-graph: reuse existing bloom filters during write.
  2019-12-20 22:05 ` [PATCH 7/9] commit-graph: reuse existing bloom filters during write Garima Singh via GitGitGadget
@ 2020-01-09 19:12   ` Jakub Narebski
  0 siblings, 0 replies; 150+ messages in thread
From: Jakub Narebski @ 2020-01-09 19:12 UTC (permalink / raw)
  To: Garima Singh via GitGitGadget
  Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
	Jeff Hostetler, Taylor Blau, Jeff King, Junio C Hamano,
	Garima Singh

"Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Garima Singh <garima.singh@microsoft.com>
>
Overly long lines in the commit message.

> Read previously computed bloom filters from the commit-graph file if possible
> to avoid recomputing during commit-graph write.

Hmmm.  This fixes (somewhat) the problem that I have noticed in previous
patch that the Bloom filter was computed at least twice, once for BIDX
chunk, once fo BDAT chunk.

I think the order should be:
 - use Bloom filter on slab, if present
 - fill it from commit graph, if saved there
 - if needed, compute it from scratch (expensive operation!)

If I understand it correctly, it now does it... though possibly with
unnecessary memory allocation if commit-graph file does not include
Bloomm filters data, and (re)computing is not requested (see later).

But I might be wrong here.

>
> Reading from the commit-graph is based on the format in which bloom filters are
> written in the commit graph file. See method `fill_filter_from_graph` in bloom.c

This description reads a bit strange; it looks like it states a truism
(we read in the format we wrote).  It think it should be rephrased in
different way for better readability.

>
> For reading the bloom filter for commit at lexicographic position i:

I think it would better read as:

  To read Bloom filter for a given commit with lexicographic position
  'i' we need to:

> 1. Read BIDX[i] which essentially gives us the starting index in BDAT for filter
>    of commit i+1 (called the next_index in the code)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                    |
                    \- I think would be not needed with better var name

I would also add that it gives the position [one past] the end of Bloom
filter data for i-th commit.

>
> 2. For i>0, read BIDX[i-1] which will give us the starting index in BDAT for
>    filter of commit i (called the prev_index in the code)

Minor nitpick: Full stops are missing.

Why it is called prev_index and next_index, while it is either
curr_index and next_index or prev_index and curr_index, or maybe even
better beg_index and end_index?

>    For i = 0, prev_index will be 0. The first lexicographic commit's filter will
>    start at BDAT.

I would state it

     For first commit, with i = 0, Bloom filter data starts at the
     beginning, just past the header in BDAT chunk.

>
> 3. The length of the filter will be next_index - prev_index, because BIDX[i]
>    gives the cumulative 8-byte words including the ith commit's filter.
>
> We toggle whether bloom filters should be recomputed based on the compute_if_null
> flag.

All right.
>
> Helped-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
>  bloom.c        | 40 ++++++++++++++++++++++++++++++++++++++--
>  bloom.h        |  3 ++-
>  commit-graph.c |  6 +++---
>  3 files changed, 43 insertions(+), 6 deletions(-)
>
> diff --git a/bloom.c b/bloom.c
> index 08328cc381..86b1005802 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -1,5 +1,7 @@
>  #include "git-compat-util.h"
>  #include "bloom.h"
> +#include "commit.h"
> +#include "commit-slab.h"
>  #include "commit-graph.h"
>  #include "object-store.h"
>  #include "diff.h"
> @@ -119,13 +121,35 @@ static void add_key_to_filter(struct bloom_key *key,
>  	}
>  }
>  
> +static void fill_filter_from_graph(struct commit_graph *g,
> +				   struct bloom_filter *filter,
> +				   struct commit *c)
> +{
> +	uint32_t lex_pos, prev_index, next_index;
> +

> +	while (c->graph_pos < g->num_commits_in_base)
> +		g = g->base_graph;
> +
> +	lex_pos = c->graph_pos - g->num_commits_in_base;

This part shares common code with load_oid_from_graph(), only without
some of error checking; perhaps it might be good to extract it into a
separate helper function, e.g. `lex_index(&g, c->graph_pos)`.

Minor nitpick about the consistency of function names: why
load_oid_from_graph(), but fill_filter_from_graph(), and not
load_filter_from_graph() / load_bloom_from_graph()?

> +
> +	next_index = get_be32(g->chunk_bloom_indexes + 4 * lex_pos);
> +	if (lex_pos)

I think using

  +	if (lex_pos > 0)

or

  +	if (lex_pos >= 0)

might be easier to reason about.

> +		prev_index = get_be32(g->chunk_bloom_indexes + 4 * (lex_pos - 1));
> +	else
> +		prev_index = 0;
> +
> +	filter->len = next_index - prev_index;

The above command reads a bit strange: next - prev?  not  next - curr,
or curr - prev?  Wouldn't it be better to name it begin_index and
end_index, or beg_index and end_index for brevity?

> +	filter->data = (uint64_t *)(g->chunk_bloom_data + 8 * prev_index + 12);

Please do not use magic constants; use instead something like:

  +	filter->data = (uint64_t *)(g->chunk_bloom_data +
  +				    sizeof(uint64_t) * prev_index +
  +				    BLOOMDATA_CHUNK_HEADER_SIZE);

Perhaps using `3*sizeof(unit32_t)` instead of magic value 12 would be
enough; but having symbolic name for BDAT chunk header size is better, I
think.

> +}
> +
>  void load_bloom_filters(void)
>  {
>  	init_bloom_filter_slab(&bloom_filters);
>  }
>  
>  struct bloom_filter *get_bloom_filter(struct repository *r,
> -				      struct commit *c)
> +				      struct commit *c,
> +				      int compute_if_null)

I'm not sure about `compute_if_null` name...

>  {
>  	struct bloom_filter *filter;
>  	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
> @@ -134,6 +158,18 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>  	const char *revs_argv[] = {NULL, "HEAD", NULL};
>  
>  	filter = bloom_filter_slab_at(&bloom_filters, c);

Note that the documentation for `slab_at(slab, commit)` is documented in
`commit-slab.h` as

 *   This function locates the data associated with the given commit in
 *   the indegree slab, and returns the pointer to it.  The location to
 *   store the data is allocated as necessary.
                       ~~~~~~~~~~~~~~~~~~~~~~

Should we worry about this possibly unnecessary allocation (if there is
no Bloom filter chunk in the commit-graph, and we are not recomputing
it)?

There is `slab_peek(slab_commit)` with the following properties:

 *   This function is similar to indegree_at(), but it will return NULL
 *   until a call to indegree_at() was made for the commit.

> +
> +	if (!filter->data) {

What does the Bloom filter for a commit with no changes looks like?
What about for a commit with more than 512 changes?  Do, in either of
those cases, filter->len is 0?  If yes, what about filter->data?

> +		load_commit_graph_info(r, c);
> +		if (c->graph_pos != COMMIT_NOT_FROM_GRAPH && r->objects->commit_graph->chunk_bloom_indexes) {

Please wrap overly long lines (109 characters seems too long; the
CodingGuidelines states:

 - We try to keep to at most 80 characters per line.

  +		if (c->graph_pos != COMMIT_NOT_FROM_GRAPH &&
  +		    r->objects->commit_graph->chunk_bloom_indexes) {

You seem to assume here that in the chain of commit-graph files either
all of them would have Bloom filters, or all of them would be missing
Bloom filters.  Isn't it however possible for only some of the
commit-graph files in chain to include Bloom filter data chunks?

In such case, the top commit-graph file may have BIDX chunk
(bloom_indexes), but the commit-graph with the commit 'c' might not have
it.  Or the top commit-graph file may be missing BIDX chunk, so Git
would recompute it even if the commit-graph file for commit 'c' includes
it.

If such situation is forbidden, how the restriction is managed?

Note: in any case, this needs to be tested!

> +			fill_filter_from_graph(r->objects->commit_graph, filter, c);
> +			return filter;
> +		}
> +	}

All right, if it is not in slab, we try to read if from the commit
graph.  Looks all right.

> +
> +	if (filter->data || !compute_if_null)
> +			return filter;
                ^^^^^^^^
                 |
                 \- one tab too many

If we have found existing filter (on slab or in the commit-graph), or if
we won't be recomputing it, return it.  O.K.

> +
>  	init_revisions(&revs, NULL);
>  	revs.diffopt.flags.recursive = 1;
>  
> @@ -198,4 +234,4 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>  	DIFF_QUEUE_CLEAR(&diff_queued_diff);
>  
>  	return filter;
> -}
> \ No newline at end of file
> +}

Accidental change.

> diff --git a/bloom.h b/bloom.h
> index ba8ae70b67..101d689bbd 100644
> --- a/bloom.h
> +++ b/bloom.h
> @@ -36,7 +36,8 @@ struct bloom_key {
>  void load_bloom_filters(void);
>  
>  struct bloom_filter *get_bloom_filter(struct repository *r,
> -				      struct commit *c);
> +				      struct commit *c,
> +				      int compute_if_null);
>

All right, this is just update of the function signature.

>  void fill_bloom_key(const char *data,
>  		    int len,
> diff --git a/commit-graph.c b/commit-graph.c
> index def2ade166..0580ce75d5 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -1032,7 +1032,7 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f,
>  	uint32_t cur_pos = 0;
>  
>  	while (list < last) {
> -		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
> +		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
>  		cur_pos += filter->len;
>  		hashwrite_be32(f, cur_pos);
>  		list++;
> @@ -1051,7 +1051,7 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
>  	hashwrite_be32(f, settings->bits_per_entry);
>  
>  	while (first < last) {
> -		struct bloom_filter *filter = get_bloom_filter(ctx->r, *first);
> +		struct bloom_filter *filter = get_bloom_filter(ctx->r, *first, 0);
>  		hashwrite(f, filter->data, filter->len * sizeof(uint64_t));
>  		first++;
>  	}

O.K., so those two do not compute Bloom filters, but they are called
from write_commit_graph_file(), which in turn is called in
write_commit_graph() *after* running compute_bloom_filters().

Looks good to me, then.

> @@ -1218,7 +1218,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>  
>  	for (i = 0; i < ctx->commits.nr; i++) {
>  		struct commit *c = ctx->commits.list[i];
> -		struct bloom_filter *filter = get_bloom_filter(ctx->r, c);
> +		struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1);
>  		ctx->total_bloom_filter_size += sizeof(uint64_t) * filter->len;
>  		display_progress(progress, i + 1);
>  	}

All right, so compute_bloom_filters() ensures that it is actually
computed (if needed).

Looks good to me.

Regards,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH 8/9] revision.c: use bloom filters to speed up path based revision walks
  2019-12-20 22:05 ` [PATCH 8/9] revision.c: use bloom filters to speed up path based revision walks Garima Singh via GitGitGadget
@ 2020-01-11  0:27   ` Jakub Narebski
  2020-01-15  0:08     ` Garima Singh
  0 siblings, 1 reply; 150+ messages in thread
From: Jakub Narebski @ 2020-01-11  0:27 UTC (permalink / raw)
  To: Garima Singh via GitGitGadget
  Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
	Jeff Hostetler, Taylor Blau, Jeff King, Junio C Hamano,
	Garima Singh

"Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Garima Singh <garima.singh@microsoft.com>
>
> If bloom filters have been written to the commit-graph file, revision walk will
> use them to speed up revision walks for a particular path.

I'd propose the following change, to make structure of the above
sentence simpler:

  Revision walk will now use Bloom filters for commits to speed up
  revision walks for a particular path (for computing history of a
  path), if they are present in the commit-graph file.

> Note: The current implementation does this in the case of single pathspec
> case only.
>
> We load the bloom filters during the prepare_revision_walk step when dealing

I think it would flow better with the following change:

s/when dealing/, but only when dealing/

> with a single pathspec. While comparing trees in rev_compare_trees(), if the
> bloom filter says that the file is not different between the two trees, we
> don't need to compute the expensive diff. This is where we get our performance
> gains.

Maybe we should also add:

  The other answer we can get from the Bloom filter is "maybe".

>
> Performance Gains:
> We tested the performance of `git log --path` on the git repo, the linux and
> some internal large repos, with a variety of paths of varying depths.

I think you meant `git log <path>`, not `git log --path`.

>
> On the git and linux repos:
> we observed a 2x to 5x speed up.

It would be good, I think, to have some specific numbers: starting from
this version, for this path, with and without Bloom filters it takes so
long (and the improvement in %).

While at it, it might be good idea to provide _costs_: how much
additional space on disk Bloom filters take for those specific examples,
and how much extra time (and memory) it takes to compute Bloom filters.

With actual specific numbers we can estimate when it would start to be
worth it to create Bloom filter data...

>
> On a large internal repo with files seated 6-10 levels deep in the tree:
> we observed 10x to 20x speed ups, with some paths going up to 28 times
> faster.

It would be nice to see specific numbers, if showing pathnames is
possible.  In any case it would be good to have more information: what
paths give 10x, what give 20x, and what kinds give 28x speedup (what
is path depth, how many objects, etc.).

>
> RFC Notes:
> I plan to collect the folloowing statistics around this usage of bloom filters
> and trace them out using trace2.
> - number of bloom filter queries,
> - number of "No" responses (file hasn't changed)
> - number of "Maybe" responses (file may have changed)
> - number of "Commit not parsed" cases (commit had too many changes to have a
>   bloom filter written out, currently our limit is 512 diffs)

Perhaps also:
  - histogram of bloom filter sizes in 64-bit blocks
    (which is rough histogram of number of changes per commit)

Though I think all those statistics are a bit specific to the
repository, and how you use Git.

>
> Helped-by: Derrick Stolee <dstolee@microsoft.com
> Helped-by: SZEDER Gábor <szeder.dev@gmail.com>
> Helped-by: Jonathan Tan <jonathantanmy@google.com>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
>  bloom.c              | 20 ++++++++++++
>  bloom.h              |  4 +++
>  revision.c           | 67 +++++++++++++++++++++++++++++++++++++--
>  revision.h           |  5 +++
>  t/t4216-log-bloom.sh | 74 ++++++++++++++++++++++++++++++++++++++++++++
>  5 files changed, 168 insertions(+), 2 deletions(-)
>  create mode 100755 t/t4216-log-bloom.sh
>
> diff --git a/bloom.c b/bloom.c
> index 86b1005802..0c7505d3d6 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -235,3 +235,23 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>  
>  	return filter;
>  }
> +
> +int bloom_filter_contains(struct bloom_filter *filter,
> +			  struct bloom_key *key,
> +			  struct bloom_filter_settings *settings)
> +{
> +	int i;
> +	uint64_t mod = filter->len * BITS_PER_BLOCK;
> +
> +	if (!mod)
> +		return 1;

Ah, so filter->len equal to zero denotes too many changes for a Bloom
filter, and this conditional is a short-circuit test: always return
"maybe".

I wonder if we should explicitly check for filter->len = 1 and
filter->data[0] = 0, which should be empty Bloom filter -- for commit
with no changes with respect to first parent, and short-circuit
returning 0 (no file would ever belong).

> +
> +	for (i = 0; i < settings->num_hashes; i++) {
> +		uint64_t hash_mod = key->hashes[i] % mod;
> +		uint64_t block_pos = hash_mod / BITS_PER_BLOCK;

I have seen this code before... ;-)  The add_key_to_filter() function
includes almost identical code, but I am not sure if it is feasible to
eliminate this (slight) code duplication.  Probably not.

> +		if (!(filter->data[block_pos] & get_bitmask(hash_mod)))
> +			return 0;

All right, if at least one hash function does not match, return "no"...

> +	}
> +
> +	return 1;

... else return "maybe".

> +}

I am wondering however if this code should not be moved earlier in the
series, so that commit 2/9 in series actually adds fully functional
[semi-generic] Bloom filter implementation.

> diff --git a/bloom.h b/bloom.h
> index 101d689bbd..9bdacd0a8e 100644
> --- a/bloom.h
> +++ b/bloom.h
> @@ -44,4 +44,8 @@ void fill_bloom_key(const char *data,
>  		    struct bloom_key *key,
>  		    struct bloom_filter_settings *settings);
>  
> +int bloom_filter_contains(struct bloom_filter *filter,
> +			  struct bloom_key *key,
> +			  struct bloom_filter_settings *settings);
> +
>  #endif
> diff --git a/revision.c b/revision.c
> index 39a25e7a5d..01f5330740 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -29,6 +29,7 @@
>  #include "prio-queue.h"
>  #include "hashmap.h"
>  #include "utf8.h"
> +#include "bloom.h"
>  
>  volatile show_early_output_fn_t show_early_output;
>  
> @@ -624,11 +625,34 @@ static void file_change(struct diff_options *options,
>  	options->flags.has_changes = 1;
>  }
>  
> +static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
> +						 struct commit *commit,
> +						 struct bloom_key *key,
> +						 struct bloom_filter_settings *settings)

All right, this function name is certainly descriptive.  I wonder if it
wouldn't be better to use a shorter name, like maybe_different(), or
maybe_not_treesame(), so that the conditional using this function would
read naturally:

  if (!maybe_different(revs, commit, revs->bloom_key,
  		       revs->bloom_filter_settings))
  	return REV_TREE_SAME;

But this might be just a matter of taste.

> +{
> +	struct bloom_filter *filter;
> +
> +	if (!revs->repo->objects->commit_graph)
> +		return -1;
> +	if (commit->generation == GENERATION_NUMBER_INFINITY)
> +		return -1;

O.K., so we check that there is loaded commit graph, and that given
commit is in a commit graph, otherwise we return "no data".

I agree with distinguishing between "no data" value (which is <0, a
convention for denoting errors), and "maybe" value.  Currently this
distinction is not utilized, but it would help in the case of more than
one path given -- in "no data" case there is no need to check other
paths against non-existing Bloom filter.

> +	if (!key || !settings)
> +		return -1;

Why the check for non-NULL 'key' and 'settings' is the last check?

BTW. when it is possible for key or settings to be NULL?  Or is it just
defensive programming?

> +
> +	filter = get_bloom_filter(revs->repo, commit, 0);

O.K., we won't be recomputing Bloom filter if it is not present either
on slab (as "inside-out" auxiliary data for a commit), or in the
commit-graph (in Bloom filter chunks).

> +
> +	if (!filter || !filter->len)
> +		return 1;

Sidenote: bloom_filter_contains() would also indirectly check for
filter->len being zero.  Though this doesn't cost much.

Shouldn't !filter case return value of -1 i.e. "no data", rather than
return value of 1 i.e. "maybe"?

> +
> +	return bloom_filter_contains(filter, key, settings);
> +}

All right.

> +
>  static int rev_compare_tree(struct rev_info *revs,
> -			    struct commit *parent, struct commit *commit)
> +			    struct commit *parent, struct commit *commit, int nth_parent)
>  {
>  	struct tree *t1 = get_commit_tree(parent);
>  	struct tree *t2 = get_commit_tree(commit);
> +	int bloom_ret = 1;
>  
>  	if (!t1)
>  		return REV_TREE_NEW;
> @@ -653,6 +677,16 @@ static int rev_compare_tree(struct rev_info *revs,
>  			return REV_TREE_SAME;
>  	}
>  
> +	if (revs->pruning.pathspec.nr == 1 && !nth_parent) {

All right, Bloom filter stores information about changed paths with
respect to first-parent changes only, so if we are asking about is not a
first parent (where nth_parent == 0), we cannot use Bloom filter.

Currently we limit the check to single pathspec only; that is a good
start and a good simplification.

> +		bloom_ret = check_maybe_different_in_bloom_filter(revs,
> +								  commit,
> +								  revs->bloom_key,
> +								  revs->bloom_filter_settings);
> +
> +		if (bloom_ret == 0)
> +			return REV_TREE_SAME;

Pretty straightforward.

> +	}
> +
>  	tree_difference = REV_TREE_SAME;
>  	revs->pruning.flags.has_changes = 0;
>  	if (diff_tree_oid(&t1->object.oid, &t2->object.oid, "",
> @@ -855,7 +889,7 @@ static void try_to_simplify_commit(struct rev_info *revs, struct commit *commit)
>  			die("cannot simplify commit %s (because of %s)",
>  			    oid_to_hex(&commit->object.oid),
>  			    oid_to_hex(&p->object.oid));
> -		switch (rev_compare_tree(revs, p, commit)) {
> +		switch (rev_compare_tree(revs, p, commit, nth_parent)) {

All right, we need to pass information about the index of the parent;
and we have just done that.  Good.

>  		case REV_TREE_SAME:
>  			if (!revs->simplify_history || !relevant_commit(p)) {
>  				/* Even if a merge with an uninteresting
> @@ -3342,6 +3376,33 @@ static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
>  	}
>  }
>  
> +static void prepare_to_use_bloom_filter(struct rev_info *revs)

All right, I see that pointers to bloom_key and bloom_filter_settings
were added to the rev_info struct.  I understand why the former is here,
but the latter seems to be there just as a shortcut (to not owned data),
which is fine but a bit strange.

Or is the latter here to allow for Bloom filter settings to possibly
change from commit-graph file in the chain to commit-graph file, and
thus from commit to commit?

> +{
> +	struct pathspec_item *pi;

Maybe 'pathspec' (or the like) instead of short and cryptic 'pi' would
be better a better name... unless 'pi' is used in other places already.

> +	const char *path;
> +	size_t len;
> +
> +	if (!revs->commits)
> +	    return;

When revs->commits may be NULL?  I understand that we need to have this
check because we use revs->commits->item next (sidenote: can revs ever
be NULL?).

Would `git log --all -- <path>` use Bloom filters (as it theoretically
could)?

> +
> +	parse_commit(revs->commits->item);

Why parsing first commit on the list of starting commits is needed here?
Please help me understand this line.

And shouldn't we use repo_parse_commit() here?

> +
> +	if (!revs->repo->objects->commit_graph)
> +		return;
> +
> +	revs->bloom_filter_settings = revs->repo->objects->commit_graph->settings;
> +	if (!revs->bloom_filter_settings)
> +		return;

All right, so if there is no commit graph, or the commit graph does not
include Bloom filter data, there is nothing to do.

Though I worry that it would make Git do not use Bloom filter if the top
commit-graph in the chain does not include Bloom filter data, while
other commit-graph files do (and Git could have used that information to
speed up the file history query).

> +
> +	pi = &revs->pruning.pathspec.items[0];
> +	path = pi->match;
> +	len = strlen(path);


Why not the following, if we do not do any checks for 'pi' value:

  +	path = &revs->pruning.pathspec.items[0]->match;

A question: is the path in the `match` field in `struct pathspec`
normalized with respect to trailing slash (for directories)?  Bloom
filter stores pathnames for directories without trailing slash.

What I mean is if, for example, both of those would use Bloom filter
data:

  $ git log -- Documentation/
  $ git log -- Documentation

> +
> +	load_bloom_filters();
> +	revs->bloom_key = xmalloc(sizeof(struct bloom_key));
> +	fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);

All right, looks good.

Though... do we leak revs->bloom_key, as should we worry about it?

> +}
> +
>  int prepare_revision_walk(struct rev_info *revs)
>  {
>  	int i;
> @@ -3391,6 +3452,8 @@ int prepare_revision_walk(struct rev_info *revs)
>  		simplify_merges(revs);
>  	if (revs->children.name)
>  		set_children(revs);
> +	if (revs->pruning.pathspec.nr == 1)
> +	    prepare_to_use_bloom_filter(revs);

Looks good.

Minor nitpick: 4 spaces instead of tab are used for indentation (or, to
be more exact a tab followed by 4 space, instead of two tabs):

  +	if (revs->pruning.pathspec.nr == 1)
  +		prepare_to_use_bloom_filter(revs);

>  	return 0;
>  }
>  
> diff --git a/revision.h b/revision.h
> index a1a804bd3d..65dc11e8f1 100644
> --- a/revision.h
> +++ b/revision.h
> @@ -56,6 +56,8 @@ struct repository;
>  struct rev_info;
>  struct string_list;
>  struct saved_parents;
> +struct bloom_key;
> +struct bloom_filter_settings;
>  define_shared_commit_slab(revision_sources, char *);
>  
>  struct rev_cmdline_info {
> @@ -291,6 +293,9 @@ struct rev_info {
>  	struct revision_sources *sources;
>  
>  	struct topo_walk_info *topo_walk_info;
> +
> +	struct bloom_key *bloom_key;
> +	struct bloom_filter_settings *bloom_filter_settings;

It might be a good idea to add one-line comment above those newly
introduced fields.  The `struct rev_info` has many subsections of fields
described by such comments (or even block comments), like e.g.

  /* Starting list */
  /* Parents of shown commits */
  /* The end-points specified by the end user */
  /* excluding from --branches, --refs, etc. expansion */
  /* Traversal flags */
  /* diff info for patches and for paths limiting */

  /*
   * Whether the arguments parsed by setup_revisions() included any
   * "input" revisions that might still have yielded an empty pending
   * list (e.g., patterns like "--all" or "--glob").
   */

>  };
>  
>  int ref_excluded(struct string_list *, const char *path);
> diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
> new file mode 100755
> index 0000000000..d42f077998
> --- /dev/null
> +++ b/t/t4216-log-bloom.sh
> @@ -0,0 +1,74 @@
> +#!/bin/sh
> +
> +test_description='git log for a path with bloom filters'
> +. ./test-lib.sh
> +
> +test_expect_success 'setup repo' '
> +	git init &&
> +	git config core.commitGraph true &&
> +	git config gc.writeCommitGraph false &&
> +	infodir=".git/objects/info" &&
> +	graphdir="$infodir/commit-graphs" &&
> +	test_oid_init

Why do you use `test_oid_init`, when you are *not* using `test_oid`?
I guess it is because t5318-commit-graph.sh uses it, isn't it?

The `graphdir` shell variable is not used either.

> +'
> +
> +test_expect_success 'create 9 commits and repack' '
> +	test_commit c1 file1 &&
> +	test_commit c2 file2 &&
> +	test_commit c3 file3 &&
> +	test_commit c4 file1 &&
> +	test_commit c5 file2 &&
> +	test_commit c6 file3 &&
> +	test_commit c7 file1 &&
> +	test_commit c8 file2 &&
> +	test_commit c9 file3
> +'

Wouldn't it be better for this step to be a part of 'setup repo' step?
Anyway, the test name says '... and repack', but `git repack` is missing
(it should be done after last test_commit).

I think it would be good idea to test behavior of Bloom filters with
respect to directories, so at least one file should be in a subdirectory
(maybe even deeper in hierarchy).

We should also test the behavior with respect to merges, and when we are
not following the first parent.  But that might be a separate part of
this test.

> +
> +printf "c7\nc4\nc1" > expect_file1

Doing things outside test is discouraged.  We can create a separate test
that creates those expect_file* files, or it can be a part of 'create
commits' test.

Anyway, instead of doing test without Bloom filters (something that
should have been tested already by other parts of testsuite), and then
doing the same test with Bloom filter, why not compare that the result
without and with Bloom filter is the same.  The t5318-commit-graph.sh
test does this with help of graph_git_two_modes() function:

  graph_git_two_modes () {
  	git -c core.commitGraph=true  $1 >output
  	git -c core.commitGraph=false $1 >expect
  	test_cmp expect output
  }

Sidenote: I wonder if it is high time to create t/lib-commit-graph.sh
helper with, among others, this common function.

> +
> +test_expect_success 'log without bloom filters' '
> +	git log --pretty="format:%s"  -- file1 > actual &&

CodingGuidelines:57: Redirection operators should be written with space
before, but no space after them.  (Minor nitpick)

  +	git log --pretty="format:%s"  -- file1 >actual &&

> +	test_cmp expect_file1 actual
> +'
> +
> +printf "c8\nc7\nc5\nc4\nc2\nc1" > expect_file1_file2

  +printf "c8\nc7\nc5\nc4\nc2\nc1" >expect_file1_file2

> +
> +test_expect_success 'multi-path log without bloom filters' '
> +	git log --pretty="format:%s"  -- file1 file2 > actual &&

  +	git log --pretty="format:%s"  -- file1 file2 >actual &&

> +	test_cmp expect_file1_file2 actual
> +'
> +
> +graph_read_expect() {

CodingGuidelines:144: We prefer a space between the function name and
the parentheses, and no space inside the parentheses.  (Minor nitpick)

  +graph_read_expect () {

> +	OPTIONAL=""
> +	NUM_CHUNKS=5
> +	if test ! -z $2
> +	then
> +		OPTIONAL=" $2"
> +		NUM_CHUNKS=$((3 + $(echo "$2" | wc -w)))

This should be either

  +		NUM_CHUNKS=$((5 + $(echo "$2" | wc -w)))

or more future proof

  +		NUM_CHUNKS=$(($NUM_CHUNKS + $(echo "$2" | wc -w)))

We got away with this bug because we were not using octopus merges, and
there were no optional core chunks, that is we never call
graph_read_expect with second parameter in this test.

> +	fi
> +	cat >expect <<- EOF

Why there is space between "<<-" and "EOF"?

> +	header: 43475048 1 1 $NUM_CHUNKS 0
> +	num_commits: $1
> +	chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data$OPTIONAL
> +	EOF
> +	test-tool read-graph >output &&
> +	test_cmp expect output
> +}
> +
> +test_expect_success 'write commit graph with bloom filters' '
> +	git commit-graph write --reachable --changed-paths &&
> +	test_path_is_file $infodir/commit-graph &&
> +	graph_read_expect "9"
> +'

All right, this is preparatory step for further tests.

> +
> +test_expect_success 'log using bloom filters' '
> +	git log --pretty="format:%s" -- file1 > actual &&
> +	test_cmp expect_file1 actual
> +'
> +
> +test_expect_success 'multi-path log using bloom filters' '
> +	git log --pretty="format:%s"  -- file1 file2 > actual &&
> +	test_cmp expect_file1_file2 actual
> +'

With graph_git_two_modes() it would be much simpler:

  +test_expect_success 'single-path log' '
  +	git_graph_two_modes "git log --pretty=format:%s -- file1"
  +'

  +test_expect_success 'multi-path log' '
  +	git_graph_two_modes "git log --pretty=format:%s -- file1 file2"
  +'

What is missing is:
1. checking that Git is actually _using_ Bloom filters
   (which might be difficult to do)
2. testing that Bloom filters work also for history of subdirectories
   e.g. "git log -- subdir/" and "git log -- subdir"; this would of
   course require adjusting setup step
3. testing specific behaviors, like "git log --all -- file1"
4. merges with history following second parent
5. commits with no changes and/or merges with no first-parent changes
6. commit with more than 512 changed files (marked as slow test,
   and perhaps created with fast-import interface, like bulk commit
   creation in test_commit_bulk)

> +
> +test_done

Best regards,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH 9/9] commit-graph: add GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS test flag
  2019-12-20 22:05 ` [PATCH 9/9] commit-graph: add GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS test flag Garima Singh via GitGitGadget
@ 2020-01-11 19:56   ` Jakub Narebski
  2020-01-15  0:55     ` Garima Singh
  0 siblings, 1 reply; 150+ messages in thread
From: Jakub Narebski @ 2020-01-11 19:56 UTC (permalink / raw)
  To: Garima Singh via GitGitGadget
  Cc: git, stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
	Junio C Hamano, Garima Singh

"Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Garima Singh <garima.singh@microsoft.com>
>
> Add GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS test flag to the test setup suite in
> order to toggle writing bloom filters when running any of the git tests. If set
> to true, we will compute and write bloom filters every time a test calls
> `git commit-graph write`.

OK, so it works in addition to GIT_TEST_COMMIT_GRAPH.

>
> The test suite passes when GIT_TEST_COMMIT_GRAPH and
> GIT_COMMIT_GRAPH_BLOOM_FILTERS are enabled.

Good.  Very good.

No errors found by Continuous Integration setup either?

>
> Helped-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
>  builtin/commit-graph.c        | 2 +-
>  ci/run-build-and-tests.sh     | 1 +
>  commit-graph.h                | 1 +
>  t/README                      | 3 +++
>  t/t4216-log-bloom.sh          | 3 +++
>  t/t5318-commit-graph.sh       | 2 ++
>  t/t5324-split-commit-graph.sh | 1 +
>  t/t5325-commit-graph-bloom.sh | 3 +++
>  8 files changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
> index 9bd1e11161..97167959b2 100644
> --- a/builtin/commit-graph.c
> +++ b/builtin/commit-graph.c
> @@ -146,7 +146,7 @@ static int graph_write(int argc, const char **argv)
>  		flags |= COMMIT_GRAPH_WRITE_SPLIT;
>  	if (opts.progress)
>  		flags |= COMMIT_GRAPH_WRITE_PROGRESS;
> -	if (opts.enable_bloom_filters)
> +	if (opts.enable_bloom_filters || git_env_bool(GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS, 0))
>  		flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
>

Very minor nitpick: not to make this line long, I would break it at the
boolean operator, that is write

  -	if (opts.enable_bloom_filters)
  +	if (opts.enable_bloom_filters ||
  +	    git_env_bool(GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS, 0))  
   		flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;

I agree that this is a good place to put this check, by pretending that
`--changed-paths` option was given on command line.  Looks good.

>  	read_replace_refs = 0;
> diff --git a/ci/run-build-and-tests.sh b/ci/run-build-and-tests.sh
> index ff0ef7f08e..19d0846d34 100755
> --- a/ci/run-build-and-tests.sh
> +++ b/ci/run-build-and-tests.sh
> @@ -19,6 +19,7 @@ linux-gcc)
>  	export GIT_TEST_OE_SIZE=10
>  	export GIT_TEST_OE_DELTA_SIZE=5
>  	export GIT_TEST_COMMIT_GRAPH=1
> +	export GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS=1
>  	export GIT_TEST_MULTI_PACK_INDEX=1
>  	make test
>  	;;

All right, adding this to CI would certainly exercise this feature.

> diff --git a/commit-graph.h b/commit-graph.h
> index 2202ad91ae..d914e6abf1 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -8,6 +8,7 @@
>  
>  #define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
>  #define GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD "GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD"
> +#define GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS "GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS"
>

All right (the ordering is a mater of taste).

>  struct commit;
>  struct bloom_filter_settings;
> diff --git a/t/README b/t/README
> index caa125ba9a..399b190437 100644
> --- a/t/README
> +++ b/t/README
> @@ -378,6 +378,9 @@ GIT_TEST_COMMIT_GRAPH=<boolean>, when true, forces the commit-graph to
>  be written after every 'git commit' command, and overrides the
>  'core.commitGraph' setting to true.
>  
> +GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS=<boolean>, when true, forces commit-graph
> +write to compute and write bloom filters for every 'git commit-graph write'
> +

Thanks for documenting this.

Missing full stop '.' at the end of the sentence.  (Minor nit).

We might want to add ", as if '--changed-paths' option was given.", but
it is not strictly necessary.

>  GIT_TEST_FSMONITOR=$PWD/t7519/fsmonitor-all exercises the fsmonitor
>  code path for utilizing a file system monitor to speed up detecting
>  new or changed files.
> diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
> index d42f077998..0e092b387c 100755
> --- a/t/t4216-log-bloom.sh
> +++ b/t/t4216-log-bloom.sh
> @@ -3,6 +3,9 @@
>  test_description='git log for a path with bloom filters'
>  . ./test-lib.sh
>  
> +GIT_TEST_COMMIT_GRAPH=0
> +GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS=0
> +

OK, neither of those setting increase the coverage for this test.

On the other hand they won't make tests fail (or at least they
shouldn't, I think).  The t5318-commit-graph.sh doesn't include 
GIT_TEST_COMMIT_GRAPH=0, after all.

>  test_expect_success 'setup repo' '
>  	git init &&
>  	git config core.commitGraph true &&
> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index 3f03de6018..613228bb12 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -3,6 +3,8 @@
>  test_description='commit graph'
>  . ./test-lib.sh
>  
> +GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS=0
> +

All right, adding Bloom filters to commit-graph file would make test
cases utilizing graph_read_expect() fail.

>  test_expect_success 'setup full repo' '
>  	mkdir full &&
>  	cd "$TRASH_DIRECTORY/full" &&
> diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
> index c24823431f..181ca7e0cb 100755
> --- a/t/t5324-split-commit-graph.sh
> +++ b/t/t5324-split-commit-graph.sh
> @@ -4,6 +4,7 @@ test_description='split commit graph'
>  . ./test-lib.sh
>  
>  GIT_TEST_COMMIT_GRAPH=0
> +GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS=0

All right, same here: adding Bloom filters to commit-graph would make
test cases utilizing graph_read_expect() fail.

Sidenote: Here without GIT_TEST_COMMIT_GRAPH=0 tests that rely on
precise timing of writing commit-graph to create split commit-graph
would fail.

>  
>  test_expect_success 'setup repo' '
>  	git init &&
> diff --git a/t/t5325-commit-graph-bloom.sh b/t/t5325-commit-graph-bloom.sh
> index d7ef0e7fb3..a9c9e9fef6 100755
> --- a/t/t5325-commit-graph-bloom.sh
> +++ b/t/t5325-commit-graph-bloom.sh
> @@ -3,6 +3,9 @@
>  test_description='commit graph with bloom filters'
>  . ./test-lib.sh
>  
> +GIT_TEST_COMMIT_GRAPH=0
> +GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS=0
> +

This test also includes some split commit-graph test cases, so the above
is necessary.

All right.

>  test_expect_success 'setup repo' '
>  	git init &&
>  	git config core.commitGraph true &&

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH 0/9] [RFC] Changed Paths Bloom Filters
  2019-12-31 16:45 ` Jakub Narebski
@ 2020-01-13 16:54   ` Garima Singh
  2020-01-20 13:48     ` Jakub Narebski
  0 siblings, 1 reply; 150+ messages in thread
From: Garima Singh @ 2020-01-13 16:54 UTC (permalink / raw)
  To: Jakub Narebski, Garima Singh via GitGitGadget
  Cc: git, stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
	Junio C Hamano

On 12/31/2019 11:45 AM, Jakub Narebski wrote:
> "Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
>> Performance Gains: We tested the performance of 'git log -- <path>' on the git
>> repo, the linux repo and some internal large repos, with a variety of paths
>> of varying depths.
>>
>> On the git and linux repos: We observed a 2x to 5x speed up.
>>
>> On a large internal repo with files seated 6-10 levels deep in the tree: We
>> observed 10x to 20x speed ups, with some paths going up to 28 times faster.
> 
> Could you provide some more statistics about this internal repository,
> such as number of files, number of commits, perhaps also number of all
> objects?  Thanks in advance.
> 
> I wonder why such large difference in performance 2-5x vs 10-20x.  Is it
> about the depth of the file hierarchy?  How would the numbers look for
> files seated closer to the root in the same large repository, like 3-5
> levels deep in the tree?

The internal repository we saw these massive gains on has:
- 413579 commits. 
- 183303 files distributed across 34482 folders
The size on disk is about 17 GiB. 

And yes, the difference is performance gains is mostly because of how 
deep the files were in the hierarchy. How often a file has been touched
also makes a difference. The performance gains are less dramatic if the 
file has a very sparse history even if it is a deep file. 

The numbers from the git and linux repos for instance, are for files 
closer to the root, hence 2x to 5x. 

Thanks! 
Garima Singh

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH 2/9] commit-graph: write changed paths bloom filters
  2020-01-06 18:44   ` Jakub Narebski
@ 2020-01-13 19:48     ` Garima Singh
  0 siblings, 0 replies; 150+ messages in thread
From: Garima Singh @ 2020-01-13 19:48 UTC (permalink / raw)
  To: Jakub Narebski, Garima Singh via GitGitGadget
  Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
	Jeff Hostetler, Taylor Blau, Jeff King, Junio C Hamano,
	Garima Singh


On 1/6/2020 1:44 PM, Jakub Narebski wrote:

> "Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes: 
>> 3. The filters are sized according to the number of changes in the each commit,
>>    with minimum size of one 64 bit word.
> 
> Do I understand it correctly that the size of filter is 10*(number of
> changed files) bits, rounded up to nearest multiple of 64?
>

Yes.
  
>> +
>> +struct pathmap_hash_entry {
>> +    struct hashmap_entry entry;
>> +    const char path[FLEX_ARRAY];
>> +};
> 
> Hmmm... I wonder why use hashmap and not string_list.  This is for
> adding path with leading directories to the Bloom filter, isn't it?
> 

Yes. We do not want to repeat directories in the filter.

Thanks!
Garima Singh


^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH 5/9] commit-graph: write changed path bloom filters to commit-graph file.
  2020-01-07 16:01   ` Jakub Narebski
@ 2020-01-14 15:14     ` Garima Singh
  0 siblings, 0 replies; 150+ messages in thread
From: Garima Singh @ 2020-01-14 15:14 UTC (permalink / raw)
  To: Jakub Narebski, Garima Singh via GitGitGadget
  Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
	Jeff Hostetler, Taylor Blau, Jeff King, Junio C Hamano,
	Garima Singh


On 1/7/2020 11:01 AM, Jakub Narebski wrote:
> "Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
>> +
>> +				if (hash_version != 1)
>> +					break;
> 
> What does it mean for Git?  Behave as if there were no Bloom filter
> data?
>

Yes. We choose to use Bloom filters best effort, and in such cases, we will 
just fall back to the original code path. 

>>  	if (ctx->num_commit_graphs_after > 1 &&
>>  	    write_graph_chunk_base(f, ctx)) {
>>  		return -1;
>> diff --git a/commit-graph.h b/commit-graph.h
>> index 952a4b83be..2202ad91ae 100644
>> --- a/commit-graph.h
>> +++ b/commit-graph.h
>> @@ -10,6 +10,7 @@
>>  #define GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD "GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD"
>>  
>>  struct commit;
>> +struct bloom_filter_settings;
>>  
>>  char *get_commit_graph_filename(const char *obj_dir);
>>  int open_commit_graph(const char *graph_file, int *fd, struct stat *st);
>> @@ -58,6 +59,10 @@ struct commit_graph {
>>  	const unsigned char *chunk_commit_data;
>>  	const unsigned char *chunk_extra_edges;
>>  	const unsigned char *chunk_base_graphs;
>> +	const unsigned char *chunk_bloom_indexes;
>> +	const unsigned char *chunk_bloom_data;
>> +
>> +	struct bloom_filter_settings *settings;
> 
> Should this be part of `struct commit_graph`?  Shouldn't we free() this
> data, or is it a pointer into xmmap-ped file... no it isn't -- we
> xalloc() it, so we should free() it.
> 
> I think it should be done in 'cleanup:' section of write_commit_graph(),
> but I am not entirely sure.
> 

Thanks for calling this out! This is definitely a bug in how commit-graph.c
frees up the graph. The right way to free the graph would be to call
free_commit_graph() instead of free(graph) like many places in that file. 

Cleaning up this entire pattern would be orthogonal to this series, so 
I will follow up with a separate series that cleans it up overall. 

For now, I will free up `bloom_filter_settings` in free_commit_graph(). 

Cheers! 
Garima Singh

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH 8/9] revision.c: use bloom filters to speed up path based revision walks
  2020-01-11  0:27   ` Jakub Narebski
@ 2020-01-15  0:08     ` Garima Singh
  0 siblings, 0 replies; 150+ messages in thread
From: Garima Singh @ 2020-01-15  0:08 UTC (permalink / raw)
  To: Jakub Narebski, Garima Singh via GitGitGadget
  Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
	Jeff Hostetler, Taylor Blau, Jeff King, Junio C Hamano,
	Garima Singh


On 1/10/2020 7:27 PM, Jakub Narebski wrote:
> "Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>  		case REV_TREE_SAME:
>>  			if (!revs->simplify_history || !relevant_commit(p)) {
>>  				/* Even if a merge with an uninteresting
>> @@ -3342,6 +3376,33 @@ static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
>>  	}
>>  }
>>  
>> +static void prepare_to_use_bloom_filter(struct rev_info *revs)
> 
> All right, I see that pointers to bloom_key and bloom_filter_settings
> were added to the rev_info struct.  I understand why the former is here,
> but the latter seems to be there just as a shortcut (to not owned data),
> which is fine but a bit strange.
> 
> Or is the latter here to allow for Bloom filter settings to possibly
> change from commit-graph file in the chain to commit-graph file, and
> thus from commit to commit?
> 

The latter. The idea is to keep the implementation open to that 
possibility. Load the bloom filter settings from the commit graph file 
you are dealing with and use those settings to fill out the bloom_key.

>> +	const char *path;
>> +	size_t len;
>> +
>> +	if (!revs->commits)
>> +	    return;
> 
> When revs->commits may be NULL?  I understand that we need to have this
> check because we use revs->commits->item next (sidenote: can revs ever
> be NULL?).
> 
> Would `git log --all -- <path>` use Bloom filters (as it theoretically
> could)?
> 

I am being defensive about revs->commits being NULL for the reason you 
called out. 

And yes, `git log --all <path>` does use Bloom filters if provided a 
single pathspec and when following the first parent.  

>> +
>> +	if (!revs->repo->objects->commit_graph)
>> +		return;
>> +
>> +	revs->bloom_filter_settings = revs->repo->objects->commit_graph->settings;
>> +	if (!revs->bloom_filter_settings)
>> +		return;
> 
> All right, so if there is no commit graph, or the commit graph does not
> include Bloom filter data, there is nothing to do.
> 
> Though I worry that it would make Git do not use Bloom filter if the top
> commit-graph in the chain does not include Bloom filter data, while
> other commit-graph files do (and Git could have used that information to
> speed up the file history query).
> 

Thanks for bringing this up! I need to test this scenario out more. 

>> +
>> +printf "c7\nc4\nc1" > expect_file1
> 
> Doing things outside test is discouraged.  We can create a separate test
> that creates those expect_file* files, or it can be a part of 'create
> commits' test.
> 
> Anyway, instead of doing test without Bloom filters (something that
> should have been tested already by other parts of testsuite), and then
> doing the same test with Bloom filter, why not compare that the result
> without and with Bloom filter is the same.  The t5318-commit-graph.sh
> test does this with help of graph_git_two_modes() function:
> 
>   graph_git_two_modes () {
>   	git -c core.commitGraph=true  $1 >output
>   	git -c core.commitGraph=false $1 >expect
>   	test_cmp expect output
>   }
> 
> Sidenote: I wonder if it is high time to create t/lib-commit-graph.sh
> helper with, among others, this common function.
>

Thank you! I have restructured the test to do almost everything you 
have suggested. 

Also, a note for this v1 RFC series: I am working on proper formal tests 
right now. I didn't want to wait to get these nailed down before sending 
out the RFC series and getting the ball rolling. 

I have taken note of all the testing suggestions you have made in your 
review. They are very helpful and I appreciate it! 

Creating t/lib-commit-graph.sh helper would be orthogonal to this series,
so I will follow up with a separate series that does this. 

Cheers! 
Garima Singh

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH 9/9] commit-graph: add GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS test flag
  2020-01-11 19:56   ` Jakub Narebski
@ 2020-01-15  0:55     ` Garima Singh
  0 siblings, 0 replies; 150+ messages in thread
From: Garima Singh @ 2020-01-15  0:55 UTC (permalink / raw)
  To: Jakub Narebski, Garima Singh via GitGitGadget
  Cc: git, stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
	Junio C Hamano, Garima Singh


On 1/11/2020 2:56 PM, Jakub Narebski wrote:
> "Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>
>> The test suite passes when GIT_TEST_COMMIT_GRAPH and
>> GIT_COMMIT_GRAPH_BLOOM_FILTERS are enabled.
> 
> Good.  Very good.
> 
> No errors found by Continuous Integration setup either?
> 

Yes, the CI test pipelines all passed.

Cheers! 
Garima Singh

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH 0/9] [RFC] Changed Paths Bloom Filters
  2020-01-13 16:54   ` Garima Singh
@ 2020-01-20 13:48     ` Jakub Narebski
  2020-01-21 16:14       ` Garima Singh
  0 siblings, 1 reply; 150+ messages in thread
From: Jakub Narebski @ 2020-01-20 13:48 UTC (permalink / raw)
  To: Garima Singh
  Cc: Garima Singh via GitGitGadget, git, stolee, szeder.dev,
	jonathantanmy, jeffhost, me, peff, Junio C Hamano

Garima Singh <garimasigit@gmail.com> writes:
> On 12/31/2019 11:45 AM, Jakub Narebski wrote:
>> "Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>>
>>> Performance Gains: We tested the performance of 'git log -- <path>' on the git
>>> repo, the linux repo and some internal large repos, with a variety of paths
>>> of varying depths.
>>>
>>> On the git and linux repos: We observed a 2x to 5x speed up.
>>>
>>> On a large internal repo with files seated 6-10 levels deep in the tree: We
>>> observed 10x to 20x speed ups, with some paths going up to 28 times faster.
>> 
>> Could you provide some more statistics about this internal repository,
>> such as number of files, number of commits, perhaps also number of all
>> objects?  Thanks in advance.
>> 
>> I wonder why such large difference in performance 2-5x vs 10-20x.  Is it
>> about the depth of the file hierarchy?  How would the numbers look for
>> files seated closer to the root in the same large repository, like 3-5
>> levels deep in the tree?
>
> The internal repository we saw these massive gains on has:
> - 413579 commits. 
> - 183303 files distributed across 34482 folders
> The size on disk is about 17 GiB. 

Thank you for the data.  Such information would be important
consideration to help to find out whether enabling Bloom filters in
given repository would be worth it.

> And yes, the difference is performance gains is mostly because of how 
> deep the files were in the hierarchy.

Right, this is understandable.  If files are diep in hierarchy, then we
have to unpack more tree objects to find out if the file was changed in
a given commit (provided that finding differences do not terminate early
thanks to hierarchical structure of tree objects).

>                                      How often a file has been touched
> also makes a difference. The performance gains are less dramatic if the 
> file has a very sparse history even if it is a deep file.

This looks a bit strange (or maybe I don't understand something).

Bloom filter can answer "no" and "maybe" to subset inclusion query.
This means that if file was *not* changed, with great probability the
answer from Bloom filter would be "no", and we would skip diff-ing
trees (which may terminate early, though).

On the other hand if file was changed by the commit, and the answer from
a Bloom filter is "maybe", then we have to perform diffing to make sure.

>
> The numbers from the git and linux repos for instance, are for files 
> closer to the root, hence 2x to 5x. 

That is quite nice speedup, anyway (git repository cannot be even
considered large; medium -- maybe).


P.S. I wonder if it would be worth to create some synthetical repository
to test performance gains of Bloom filters, perhaps in t/perf...

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH 0/9] [RFC] Changed Paths Bloom Filters
  2020-01-20 13:48     ` Jakub Narebski
@ 2020-01-21 16:14       ` Garima Singh
  2020-02-02 18:43         ` Jakub Narebski
  0 siblings, 1 reply; 150+ messages in thread
From: Garima Singh @ 2020-01-21 16:14 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Garima Singh via GitGitGadget, git, stolee, szeder.dev,
	jonathantanmy, jeffhost, me, peff, Junio C Hamano


On 1/20/2020 8:48 AM, Jakub Narebski wrote: >>                                      How often a file has been touched
>> also makes a difference. The performance gains are less dramatic if the 
>> file has a very sparse history even if it is a deep file.
> 
> This looks a bit strange (or maybe I don't understand something).
> 
> Bloom filter can answer "no" and "maybe" to subset inclusion query.
> This means that if file was *not* changed, with great probability the
> answer from Bloom filter would be "no", and we would skip diff-ing
> trees (which may terminate early, though).
> 
> On the other hand if file was changed by the commit, and the answer from
> a Bloom filter is "maybe", then we have to perform diffing to make sure.
>

Yes. What I meant by statement however is that the performance gain i.e. 
difference in performance between using and not using bloom filters, is not 
always as dramatic if the history is sparse and the trees aren't touched 
as often. So it is largely dependent on the shape of the repo and the shape
of the commit graph. 
 
>>
>> The numbers from the git and linux repos for instance, are for files 
>> closer to the root, hence 2x to 5x. 
> 
> That is quite nice speedup, anyway (git repository cannot be even
> considered large; medium -- maybe).
> 

Yeah. Git and Linux served as nice initial test beds. If you have any 
suggestions for interesting repos it would be worth running performanc 
investigations on, do let me know! 

> 
> P.S. I wonder if it would be worth to create some synthetical repository
> to test performance gains of Bloom filters, perhaps in t/perf...
> 

I will look into this after I get v1 out on the mailing list. 
Thanks! 

Cheers
Garima Singh

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH 0/9] [RFC] Changed Paths Bloom Filters
  2019-12-20 22:05 [PATCH 0/9] [RFC] Changed Paths Bloom Filters Garima Singh via GitGitGadget
                   ` (12 preceding siblings ...)
  2019-12-31 16:45 ` Jakub Narebski
@ 2020-01-21 23:40 ` Emily Shaffer
  2020-01-27 18:24   ` Garima Singh
  2020-02-01 23:32   ` Jakub Narebski
  2020-02-05 22:56 ` [PATCH v2 00/11] " Garima Singh via GitGitGadget
  2020-03-05 19:49 ` [PATCH 0/9] [RFC] " Garima Singh
  15 siblings, 2 replies; 150+ messages in thread
From: Emily Shaffer @ 2020-01-21 23:40 UTC (permalink / raw)
  To: Garima Singh via GitGitGadget
  Cc: git, stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
	Junio C Hamano

Hi,

On Fri, Dec 20, 2019 at 10:05:11PM +0000, Garima Singh via GitGitGadget wrote:
> This series is intended to start the conversation and many of the commit
> messages include specific call outs for suggestions and thoughts. 

Since it's mostly in RFC stage, I'm holding off on line by line comments
for now. I read the series (thanks for your patience) and I'll try to
leave some review thoughts in the diffspec here.

> Garima Singh (9):
>   [1/9] commit-graph: add --changed-paths option to write
I wonder if this can be combined with 2; without 2 actually the
documentation is wrong for this one, right? Although I suppose you also
mentioned 2 perhaps being too long :)

>   [2/9] commit-graph: write changed paths bloom filters
As I understand it, this one always regenerates the bloom filter pieces,
and doesn't write it down in the commit-graph file. How much longer does
that take now than before? I don't have a great feel for how often 'git
commit-graph' is run, or whether I need to be invoking it manually.

>   [3/9] commit-graph: use MAX_NUM_CHUNKS
>   [4/9] commit-graph: document bloom filter format
I suppose I might like to see this commit squashed with 5, but it's a
nit. I'm thinking it'd be handy to say "git blame commit-graph" and see
some nice doc about the format expected in the commit-graph file.

>   [5/9] commit-graph: write changed path bloom filters to commit-graph file.
Ah, so here we finally write down the result from 2 to disk in the
commit-graph file. And without 7, this gets recalculated every time we
call 'git commit-graph' still.

As for a technical doc around here, I'd really appreciate one. But I'm
speaking selfishly - I'd also be happy if I could watch a talk about
this design to make sure I understand it right :)

>   [6/9] commit-graph: test commit-graph write --changed-paths
>   [7/9] commit-graph: reuse existing bloom filters during write.
I saw an option to give up if there wasn't an existing bloom filter, but
I didn't see an option here to force recalculating. Is there a scenario
when that would be useful? What's the mitigation path if:
 - I have a commit-graph with v0 of the bloom index piece, but update to
   Git which uses v1?
 - My commit-graph file is corrupted in a way that the bloom filter
   results are incorrect and I am missing a blob change (and therefore
   not finding it during the walk)?
I think I understand that without this commit, 8 is not much speedup
because we will be recalculating the filter for each commit, rather than
using the written-down commit-graph file.

>   [8/9] revision.c: use bloom filters to speed up path based revision walks
>   [9/9] commit-graph: add GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS test flag

Speaking of making sure I understand the change right, I will also
summarize my own understanding in the hopes that I can be corrected and
others can learn too ;)

 - The general idea is that we can write down a hint at the tree level
   saying "something I own did change this commit"; when we look for an
   object later, we can skip the commit where it looks like that path
   didn't change.
 - The change is in two pieces: first, to generate the hints per tree
   (which happens during commit-graph); second, to use those hints to
   optimize a rev walk (which happens in revision.c patch 8)
 - When we calculate the hints during commit-graph, we check the diff of
   each tree compared to its recent ancestor to see if there was a
   change; if so we calculate a hash for each path and use that as a key
   for a map from hash to path. After we look through everything changed
   in the diff, we can add it to a cumulative bloom filter list (one
   filter per commit) so we have a handy in-memory idea of which paths
   changed in each commit.
 - When it's time to do the rev walk, we ask for the bloom filter for
   each commit and check if that commit's map contains the path to the
   object we're worried about; if so, then it's OK to unpack the tree
   and check if the path we are interested in actually did get changed
   during that commit.

Thanks.
 - Emily

> 
>  Documentation/git-commit-graph.txt            |   5 +
>  .../technical/commit-graph-format.txt         |  17 ++
>  Makefile                                      |   1 +
>  bloom.c                                       | 257 +++++++++++++++++
>  bloom.h                                       |  51 ++++
>  builtin/commit-graph.c                        |   9 +-
>  ci/run-build-and-tests.sh                     |   1 +
>  commit-graph.c                                | 116 +++++++-
>  commit-graph.h                                |   9 +-
>  revision.c                                    |  67 ++++-
>  revision.h                                    |   5 +
>  t/README                                      |   3 +
>  t/helper/test-read-graph.c                    |   4 +
>  t/t4216-log-bloom.sh                          |  77 ++++++
>  t/t5318-commit-graph.sh                       |   2 +
>  t/t5324-split-commit-graph.sh                 |   1 +
>  t/t5325-commit-graph-bloom.sh                 | 258 ++++++++++++++++++
>  17 files changed, 875 insertions(+), 8 deletions(-)
>  create mode 100644 bloom.c
>  create mode 100644 bloom.h
>  create mode 100755 t/t4216-log-bloom.sh
>  create mode 100755 t/t5325-commit-graph-bloom.sh
> 
> 
> base-commit: b02fd2accad4d48078671adf38fe5b5976d77304
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-497%2Fgarimasi514%2FcoreGit-bloomFilters-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-497/garimasi514/coreGit-bloomFilters-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/497
> -- 
> gitgitgadget

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH 0/9] [RFC] Changed Paths Bloom Filters
  2020-01-21 23:40 ` Emily Shaffer
@ 2020-01-27 18:24   ` Garima Singh
  2020-02-01 23:32   ` Jakub Narebski
  1 sibling, 0 replies; 150+ messages in thread
From: Garima Singh @ 2020-01-27 18:24 UTC (permalink / raw)
  To: Emily Shaffer, Garima Singh via GitGitGadget
  Cc: git, stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
	Junio C Hamano

On 1/21/2020 6:40 PM, Emily Shaffer wrote:>> Garima Singh (9):
>>   [1/9] commit-graph: add --changed-paths option to write
> I wonder if this can be combined with 2; without 2 actually the
> documentation is wrong for this one, right? Although I suppose you also
> mentioned 2 perhaps being too long :)
> 

True. The ordering of these commits has been a very subjective discussion. 
Leaving this commit isolated the way it is, does help separate the option
from the bloom filter computation and commit graph write step. 

Also for clarity, I have changed the message for 2/9 to:
`commit-graph: compute Bloom filters for changed paths`

 
>>   [2/9] commit-graph: write changed paths bloom filters
> As I understand it, this one always regenerates the bloom filter pieces,
> and doesn't write it down in the commit-graph file. How much longer does
> that take now than before? I don't have a great feel for how often 'git
> commit-graph' is run, or whether I need to be invoking it manually.
> 

Yes. Computation and writing the bloom filters to the commit-graph file
(2/9 and 5/9) are ideally one time operations per commit. The time taken
depends on the shape and size of the repo since computation involves 
running a full diff. If you look at the discussions and patches exchanged
by Dr. Stolee and Peff: it has improved greatly since this RFC patch. 
My next submission will carry these patches and more concrete numbers for
time and memory. 

Also, `git commit-graph write` is not run very frequently and is ideally
incremental. A full rewrite is usually only done in case there is a 
corruption caught by `git commit-graph verify` in which case, you delete
and rewrite. These are both manual operations. 
See the docs here: 
https://git-scm.com/docs/git-commit-graph

Note: that fetch.writeCommitGraph is now on by default in which case 
new computations happen automatically for newly fetched commits. But 
I am not adding the --changed-paths option to that yet. So computing, 
writing and using bloom filters will still be an opt-in feature 
that users need to manually run. 
 
>>   [7/9] commit-graph: reuse existing bloom filters during write.
> I saw an option to give up if there wasn't an existing bloom filter, but
> I didn't see an option here to force recalculating. 

The way to force recalculation at the moment would be to delete the commit
graph file and write it again. 

Is there a scenario when that would be useful? 

Yes, if there are algorithm or hash version changes in the changed paths logic
we would need to rewrite. Each of the cases I can think of would involve
triggering a recalculation by deleting and rewriting.

What's the mitigation path if:
>  - I have a commit-graph with v0 of the bloom index piece, but update to
>    Git which uses v1?     
   
     Take a look at 5/9, when we are parsing the commit graph: if the code 
     is expected to work with a particular version (hash_version = 1 in the 
     current version), and the commit graph has a different version, we just 
     ignore it. In the future, this is where we could extend this to support 
     multiple versions.

>  - My commit-graph file is corrupted in a way that the bloom filter
>    results are incorrect and I am missing a blob change (and therefore
>    not finding it during the walk)?
         The mitigation for any commit graph corruption is to delete and 
     rewrite. 

     If however, we are confident that the bloom filter computation itself
     is wrong, the immediate mitigations would be to deleting the commit
     graph file and rewriting without the --changed-paths option; and ofc
     report the bug so it can be investigated and fixed. :) 

> Speaking of making sure I understand the change right, I will also
> summarize my own understanding in the hopes that I can be corrected and
> others can learn too ;)
> 
>  - The general idea is that we can write down a hint at the tree level
>    saying "something I own did change this commit"; when we look for an
>    object later, we can skip the commit where it looks like that path
>    didn't change.

The hint we are storing is a bloom filter which answers "No" or "Maybe"
to the question "Did file A change in commit c"
If the answer is No, we can ignore walking that commit. Else, we fall
back to the diff algorithm like before to confirm if the file changed or
not. 

>  - The change is in two pieces: first, to generate the hints per tree
>    (which happens during commit-graph); second, to use those hints to
>    optimize a rev walk (which happens in revision.c patch 8)

Yes. 

>  - When we calculate the hints during commit-graph, we check the diff of
>    each tree compared to its recent ancestor to see if there was a
>    change; if so we calculate a hash for each path and use that as a key
>    for a map from hash to path. After we look through everything changed
>    in the diff, we can add it to a cumulative bloom filter list (one
>    filter per commit) so we have a handy in-memory idea of which paths
>    changed in each commit.
>  - When it's time to do the rev walk, we ask for the bloom filter for
>    each commit and check if that commit's map contains the path to the
>    object we're worried about; if so, then it's OK to unpack the tree
>    and check if the path we are interested in actually did get changed
>    during that commit.

Essentially yes. There are a few implementation specifics this description
is glossing over, but I understand that is the intention. 

 
> Thanks.
>  - Emily

Cheers! 
Garima Singh

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH 0/9] [RFC] Changed Paths Bloom Filters
  2020-01-21 23:40 ` Emily Shaffer
  2020-01-27 18:24   ` Garima Singh
@ 2020-02-01 23:32   ` Jakub Narebski
  1 sibling, 0 replies; 150+ messages in thread
From: Jakub Narebski @ 2020-02-01 23:32 UTC (permalink / raw)
  To: Emily Shaffer
  Cc: Garima Singh via GitGitGadget, git, stolee, szeder.dev,
	jonathantanmy, jeffhost, me, peff, Junio C Hamano

Emily Shaffer <emilyshaffer@google.com> writes:

[...]
> Speaking of making sure I understand the change right, I will also
> summarize my own understanding in the hopes that I can be corrected and
> others can learn too ;)
>
>  - The general idea is that we can write down a hint at the tree level
>    saying "something I own did change this commit"; when we look for an
>    object later, we can skip the commit where it looks like that path
>    didn't change.

Or, to be more exact, we write hint about all the files and directories
changed in the commit at the commit level.

Say, for example, that changed files are 'README' and 'subdir/file'.
We store hint that 'README', 'subdir/file' and 'subdir' paths have
changes in them.

>  - The change is in two pieces: first, to generate the hints per tree
>    (which happens during commit-graph); second, to use those hints to
>    optimize a rev walk (which happens in revision.c patch 8)

Right.

>  - When we calculate the hints during commit-graph, we check the diff of
>    each tree compared to its recent ancestor to see if there was a
>    change;

s/recent ancestor/first parent/

Right, though commits without any changes with respect to first-parent
(or null tree in case of root i.e. parentless commit) should be rare.

>           if so we calculate a hash for each path and use that as a key
>    for a map from hash to path.

Yes, and no.  Here we enter the details how Bloom filter is constructed.
We don't store paths in Bloom filter -- it would take too much space.

The hashmap is used as an implementation of mathematical set, a
temporary structure used during Bloom filter construction.  We could
have used string_list, but then we could waste time trying to add
intermediate directories multiple times (for example if 'foo/bar' and
'foo/baz' files changed, we need to add 'foo' path only once to Bloom
filter).

You can think of Bloom filter as a compact (and probabilistic, see
below) representation of set of changed paths.

>                                After we look through everything changed
>    in the diff, we can add it to a cumulative bloom filter list (one
>    filter per commit) so we have a handy in-memory idea of which paths
>    changed in each commit.

Yes.

>  - When it's time to do the rev walk, we ask for the bloom filter for
>    each commit and check if that commit's map contains the path to the
>    object we're worried about; if so, then it's OK to unpack the tree
>    and check if the path we are interested in actually did get changed
>    during that commit.

From the point of view of rev walk, we ask for the Bloom filter for each
commit walked, and check if the (sub)set of changed paths includes given
path.

Bloom filter can answer "no" -- then we can skip the commit simplifying
history, or it can answer "maybe" -- then we need to check if file was
actually changed unpacking the trees (there is around 1% probability
that Bloom filter will say "maybe" if the path is not actually changed).


From the point of view of Bloom filter, if the path was actually changed
the filter will always answer "maybe".  If the path was not changed,
then in most cases the filter will answer "no" but there is 1% of chance
that it will answer "maybe".


I hope that helps,
--
Jakub Narębski

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH 0/9] [RFC] Changed Paths Bloom Filters
  2020-01-21 16:14       ` Garima Singh
@ 2020-02-02 18:43         ` Jakub Narebski
  0 siblings, 0 replies; 150+ messages in thread
From: Jakub Narebski @ 2020-02-02 18:43 UTC (permalink / raw)
  To: Garima Singh
  Cc: Garima Singh via GitGitGadget, git, stolee, szeder.dev,
	jonathantanmy, jeffhost, me, peff, Junio C Hamano

Garima Singh <garimasigit@gmail.com> writes:
> On 1/20/2020 8:48 AM, Jakub Narebski wrote:

>>>                                      How often a file has been touched
>>> also makes a difference. The performance gains are less dramatic if the 
>>> file has a very sparse history even if it is a deep file.
>> 
>> This looks a bit strange (or maybe I don't understand something).
>> 
>> Bloom filter can answer "no" and "maybe" to subset inclusion query.
>> This means that if file was *not* changed, with great probability the
>> answer from Bloom filter would be "no", and we would skip diff-ing
>> trees (which may terminate early, though).
>> 
>> On the other hand if file was changed by the commit, and the answer from
>> a Bloom filter is "maybe", then we have to perform diffing to make sure.
>
> Yes. What I meant by statement however is that the performance gain i.e. 
> difference in performance between using and not using bloom filters, is not 
> always as dramatic if the history is sparse and the trees aren't touched 
> as often. So it is largely dependent on the shape of the repo and the shape
> of the commit graph. 

It probably depends on the depth of changes in a typical skipped commit,
I think.

If we are getting history for the
core/java/com/android/ims/internal/uce/presence/IPresenceListener.aidl
and the change is contained in libs/input/ directory, we have to unpack
only two trees, independent on the depth of the file we are asking
about.

It could be also possible, if Git is smart enough about it, to halt
early if we are checking only if given path changed rather than
calculating a full difftree.  Say, for example, that the change was in
core/java/org/apache/http/conn/ssl/SSLSocketFactory.java file, while
we were getting the history for the following file
core/java/com/android/ims/internal/uce/presence/IPresenceListener.aidl
After unpacking two or three threes we know that the second file was not
changed.  But if we compute full diff, we have to unpack 8 trees.
Quite a difference.


All example paths above came from AOSP repository that was used for
testing different proposed generation numbers v2, see
https://github.com/derrickstolee/gen-test/blob/master/clone-repos.sh

>>> The numbers from the git and linux repos for instance, are for files 
>>> closer to the root, hence 2x to 5x. 
>> 
>> That is quite nice speedup, anyway (git repository cannot be even
>> considered large; medium -- maybe).
>
> Yeah. Git and Linux served as nice initial test beds. If you have any 
> suggestions for interesting repos it would be worth running performanc 
> investigations on, do let me know! 

If we want repositories with deep path hierarchy, Java projects with
mandated directory structures might be a good choice, for example
Android (AOSP):

  git clone https://android.googlesource.com/platform/frameworks/base/ android-base

It is also quite large repository; in 2019 it had around 874000 commits,
around the same as the Linux kernel repository.

Another large repository is Chromium -- though I don't know if it has
deep filesystem hierarchy.

You can use the list of different large and large-ish repositories from
https://github.com/derrickstolee/gen-test/blob/master/clone-repos.sh
Other repositories with large number of commmits not on that list are
LLVM Compiler, GCC (GNU Compiler Collection) -- just converted to Git,
Homebrew, and Ruby on Rails.

>> P.S. I wonder if it would be worth to create some synthetical repository
>> to test performance gains of Bloom filters, perhaps in t/perf...
>> 
>
> I will look into this after I get v1 out on the mailing list. 
> Thanks! 

It would be nice to have, but it can wait.


Keep up the good work!
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH v2 00/11] Changed Paths Bloom Filters
  2019-12-20 22:05 [PATCH 0/9] [RFC] Changed Paths Bloom Filters Garima Singh via GitGitGadget
                   ` (13 preceding siblings ...)
  2020-01-21 23:40 ` Emily Shaffer
@ 2020-02-05 22:56 ` Garima Singh via GitGitGadget
  2020-02-05 22:56   ` [PATCH v2 01/11] commit-graph: use MAX_NUM_CHUNKS Garima Singh via GitGitGadget
                     ` (13 more replies)
  2020-03-05 19:49 ` [PATCH 0/9] [RFC] " Garima Singh
  15 siblings, 14 replies; 150+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-02-05 22:56 UTC (permalink / raw)
  To: git
  Cc: stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
	garimasigit, jnareb, christian.couder, emilyshaffer, gitster,
	Garima Singh

Hey! 

The commit graph feature brought in a lot of performance improvements across
multiple commands. However, file based history continues to be a performance
pain point, especially in large repositories. 

Adopting changed path bloom filters has been discussed on the list before,
and a prototype version was worked on by SZEDER Gábor, Jonathan Tan and Dr.
Derrick Stolee [1]. This series is based on Dr. Stolee's proof of concept in
[2]

Performance Gains: We tested the performance of git log -- path on the git
repo, the linux repo and some internal large repos, with a variety of paths
of varying depths.

On the git and linux repos: We observed a 2x to 5x speed up.

On a large internal repo with files seated 6-10 levels deep in the tree: We
observed 10x to 20x speed ups, with some paths going up to 28 times faster.

Future Work (not included in the scope of this series):

 1. Supporting multiple path based revision walk
 2. Adopting it in git blame logic. 
 3. Interactions with line log git log -L


----------------------------------------------------------------------------

Updates since the last submission

 * Removed all the RFC callouts, this is a ready for full review version
 * Added unit tests for the bloom filter computation layer
 * Added more evolved functional tests for git log
 * Fixed a lot of the bugs found by the tests
 * Reacted to other miscellaneous feedback on the RFC series. 

Cheers! Garima Singh

[1] https://lore.kernel.org/git/20181009193445.21908-1-szeder.dev@gmail.com/
[2] 
https://lore.kernel.org/git/61559c5b-546e-d61b-d2e1-68de692f5972@gmail.com/

Derrick Stolee (2):
  diff: halt tree-diff early after max_changes
  commit-graph: examine commits by generation number

Garima Singh (8):
  commit-graph: use MAX_NUM_CHUNKS
  bloom: core Bloom filter implementation for changed paths
  commit-graph: compute Bloom filters for changed paths
  commit-graph: write Bloom filters to commit graph file
  commit-graph: reuse existing Bloom filters during write.
  commit-graph: add --changed-paths option to write subcommand
  revision.c: use Bloom filters to speed up path based revision walks
  commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag

Jeff King (1):
  commit-graph: examine changed-path objects in pack order

 Documentation/git-commit-graph.txt            |   5 +
 .../technical/commit-graph-format.txt         |  24 ++
 Makefile                                      |   2 +
 bloom.c                                       | 277 ++++++++++++++++++
 bloom.h                                       |  58 ++++
 builtin/commit-graph.c                        |  10 +-
 ci/run-build-and-tests.sh                     |   1 +
 commit-graph.c                                | 211 ++++++++++++-
 commit-graph.h                                |   9 +-
 diff.h                                        |   5 +
 revision.c                                    | 124 +++++++-
 revision.h                                    |  11 +
 t/README                                      |   5 +
 t/helper/test-bloom.c                         |  84 ++++++
 t/helper/test-read-graph.c                    |   4 +
 t/helper/test-tool.c                          |   1 +
 t/helper/test-tool.h                          |   1 +
 t/t0095-bloom.sh                              | 113 +++++++
 t/t4216-log-bloom.sh                          | 143 +++++++++
 t/t5318-commit-graph.sh                       |   2 +
 t/t5324-split-commit-graph.sh                 |   1 +
 tree-diff.c                                   |   6 +
 22 files changed, 1088 insertions(+), 9 deletions(-)
 create mode 100644 bloom.c
 create mode 100644 bloom.h
 create mode 100644 t/helper/test-bloom.c
 create mode 100755 t/t0095-bloom.sh
 create mode 100755 t/t4216-log-bloom.sh


base-commit: 5b0ca878e008e82f91300091e793427205ce3544
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-497%2Fgarimasi514%2FcoreGit-bloomFilters-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-497/garimasi514/coreGit-bloomFilters-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/497

Range-diff vs v1:

  3:  a15f87fdcb =  1:  bf6b93878a commit-graph: use MAX_NUM_CHUNKS
  2:  e52c7ad37a !  2:  02b16d9422 commit-graph: write changed paths bloom filters
     @@ -1,65 +1,72 @@
      Author: Garima Singh <garima.singh@microsoft.com>
      
     -    commit-graph: write changed paths bloom filters
     +    bloom: core Bloom filter implementation for changed paths
      
     -    The changed path bloom filters help determine which paths changed between a
     -    commit and its first parent. We already have the "--changed-paths" option
     -    for the "git commit-graph write" subcommand, now actually compute them under
     -    that option. The COMMIT_GRAPH_WRITE_BLOOM_FILTERS flag enables this
     -    computation.
     +    Add the core Bloom filter logic for computing the paths changed between a
     +    commit and its first parent. For details on what Bloom filters are and how they
     +    work, please refer to Dr. Derrick Stolee's blog post [1]. It provides a concise
     +    explaination of the adoption of Bloom filters as described in [2] and [3]
      
     -    RFC Notes: Here are some details about the implementation and I would love
     -    to know your thoughts and suggestions for improvements here.
     +    1. We currently use 7 and 10 for the number of hashes and the size of each
     +       entry respectively. They served as great starting values, the mathematical
     +       details behind this choice are described in [1] and [4]. The implementation
     +       while not completely open to it at the moment, is flexible enough to allow
     +       for tweaking these settings in the future.
      
     -    For details on what bloom filters are and how they work, please refer to
     -    Dr. Derrick Stolee's blog post [1].
     -    [1] https://devblogs.microsoft.com/devops/super-charging-the-git-commit-graph-iv-bloom-filters/
     +       Note: The performance gains we have observed with these values are
     +       significant enough that we did not need to tweak these settings.
     +       The performance numbers are included in the cover letter of this series
     +       and in the message of a subsequent commit where we use Bloom filters in
     +       to speed up `git log -- <path>`.
      
     -    1. The implementation sticks to the recommended values of 7 and 10 for the
     -       number of hashes and the size of each entry, as described in the blog.
     -       The implementation while not completely open to it at the moment, is flexible
     -       enough to allow for tweaking these settings in the future.
     -       Note: The performance gains we have observed so far with these values is
     -       significant enough to not that we did not need to tweak these settings.
     -       The cover letter of this series has the details and the commit where we have
     -       git log use bloom filters.
     -
     -    2. As described in the blog and the linked technical paper therin, we do not need
     -       7 independent hashing functions. We use the Murmur3 hashing scheme - seed it
     -       twice and then combine those to procure an arbitrary number of hash values.
     +    2. As described in the blog and in [3], we do not need 7 independent hashing
     +       functions. We use the Murmur3 hashing scheme. Seed it twice and then
     +       combine those to procure an arbitrary number of hash values.
      
          3. The filters are sized according to the number of changes in the each commit,
             with minimum size of one 64 bit word.
      
     -    [Call for advice] We currently cap writing bloom filters for commits with
     -    atmost 512 changed files. In the current implementation, we compute the diff,
     -    and then just throw it away once we see it has more than 512 changes.
     -    Any suggestiongs on how to reduce the work we are doing in this case are more
     -    than welcome.
     +    4. We fill the Bloom filters as (const char *data, int len) pairs as
     +       "struct bloom_filter"s in a commit slab.
      
     -    [Call for advice] Would the git community like this commit to be split up into
     -    more granular commits? This commit could possibly be split out further with the
     -    bloom.c code in its own commit, to be used by the commit-graph in a subsequent
     -    commit. While I prefer it being contained in one commit this way, I am open to
     -    suggestions.
     +    5. The seed_murmur3 method is implemented as described in [5]. It hashes the
     +       given data using a given seed and produces a uniformly distributed hash
     +       value.
      
     -    [Call for advice] Would a technical document explaining the exact details of
     -    the bloom filter implemenation and the hashing calculations be helpful? I will
     -    be adding details into Documentation/technical/commit-graph-format.txt, but the
     -    bloom filter code is an independent subsystem and could be used outside of the
     -    commit-graph feature. Is it worth a separate document, or should we apply "You
     -    Ain't Gonna Need It" principles?
     +    [1] https://devblogs.microsoft.com/devops/super-charging-the-git-commit-graph-iv-Bloom-filters/
      
     -    [Call for advice] I plan to add unit tests for bloom.c, specifically to ensure
     -    that the hash algorithm and bloom key calculations are stable across versions.
     +    [2] Flavio Bonomi, Michael Mitzenmacher, Rina Panigrahy, Sushil Singh, George Varghese
     +        "An Improved Construction for Counting Bloom Filters"
     +        http://theory.stanford.edu/~rinap/papers/esa2006b.pdf
     +        https://doi.org/10.1007/11841036_61
      
     -    Signed-off-by: Garima Singh <garima.singh@microsoft.com>
     +    [3] Peter C. Dillinger and Panagiotis Manolios
     +        "Bloom Filters in Probabilistic Verification"
     +        http://www.ccs.neu.edu/home/pete/pub/Bloom-filters-verification.pdf
     +        https://doi.org/10.1007/978-3-540-30494-4_26
     +
     +    [4] Thomas Mueller Graf, Daniel Lemire
     +        "Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters"
     +        https://arxiv.org/abs/1912.08258
     +
     +    [5] https://en.wikipedia.org/wiki/MurmurHash#Algorithm
     +
     +    Helped-by: Jeff King <peff@peff.net>
          Helped-by: Derrick Stolee <dstolee@microsoft.com>
     +    Signed-off-by: Garima Singh <garima.singh@microsoft.com>
      
       diff --git a/Makefile b/Makefile
       --- a/Makefile
       +++ b/Makefile
      @@
     + 
     + PROGRAMS += $(patsubst %.o,git-%$X,$(PROGRAM_OBJS))
     + 
     ++TEST_BUILTINS_OBJS += test-bloom.o
     + TEST_BUILTINS_OBJS += test-chmtime.o
     + TEST_BUILTINS_OBJS += test-config.o
     + TEST_BUILTINS_OBJS += test-ctype.o
     +@@
       LIB_OBJS += bisect.o
       LIB_OBJS += blame.o
       LIB_OBJS += blob.o
     @@ -82,8 +89,6 @@
      +#include "revision.h"
      +#include "hashmap.h"
      +
     -+#define BITS_PER_BLOCK 64
     -+
      +define_commit_slab(bloom_filter_slab, struct bloom_filter);
      +
      +struct bloom_filter_slab bloom_filters;
     @@ -100,12 +105,18 @@
      +	return ((value >> count) | (value << ((-count) & mask)));
      +}
      +
     ++/*
     ++ * Calculate a hash value for the given data using the given seed.
     ++ * Produces a uniformly distributed hash value.
     ++ * Not considered to be cryptographically secure.
     ++ * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
     ++ **/
      +static uint32_t seed_murmur3(uint32_t seed, const char *data, int len)
      +{
      +	const uint32_t c1 = 0xcc9e2d51;
      +	const uint32_t c2 = 0x1b873593;
     -+	const int32_t r1 = 15;
     -+	const int32_t r2 = 13;
     ++	const uint32_t r1 = 15;
     ++	const uint32_t r2 = 13;
      +	const uint32_t m = 5;
      +	const uint32_t n = 0xe6546b64;
      +	int i;
     @@ -159,66 +170,67 @@
      +
      +static inline uint64_t get_bitmask(uint32_t pos)
      +{
     -+	return ((uint64_t)1) << (pos & (BITS_PER_BLOCK - 1));
     ++	return ((uint64_t)1) << (pos & (BITS_PER_WORD - 1));
     ++}
     ++
     ++void load_bloom_filters(void)
     ++{
     ++	init_bloom_filter_slab(&bloom_filters);
      +}
      +
      +void fill_bloom_key(const char *data,
     -+		    int len,
     -+		    struct bloom_key *key,
     -+		    struct bloom_filter_settings *settings)
     ++					int len,
     ++					struct bloom_key *key,
     ++					struct bloom_filter_settings *settings)
      +{
      +	int i;
     -+	uint32_t seed0 = 0x293ae76f;
     -+	uint32_t seed1 = 0x7e646e2c;
     -+
     -+	uint32_t hash0 = seed_murmur3(seed0, data, len);
     -+	uint32_t hash1 = seed_murmur3(seed1, data, len);
     ++	const uint32_t seed0 = 0x293ae76f;
     ++	const uint32_t seed1 = 0x7e646e2c;
     ++	const uint32_t hash0 = seed_murmur3(seed0, data, len);
     ++	const uint32_t hash1 = seed_murmur3(seed1, data, len);
      +
      +	key->hashes = (uint32_t *)xcalloc(settings->num_hashes, sizeof(uint32_t));
      +	for (i = 0; i < settings->num_hashes; i++)
      +		key->hashes[i] = hash0 + i * hash1;
      +}
      +
     -+static void add_key_to_filter(struct bloom_key *key,
     -+			      struct bloom_filter *filter,
     -+			      struct bloom_filter_settings *settings)
     ++void add_key_to_filter(struct bloom_key *key,
     ++					   struct bloom_filter *filter,
     ++					   struct bloom_filter_settings *settings)
      +{
      +	int i;
     -+	uint64_t mod = filter->len * BITS_PER_BLOCK;
     ++	uint64_t mod = filter->len * BITS_PER_WORD;
      +
      +	for (i = 0; i < settings->num_hashes; i++) {
      +		uint64_t hash_mod = key->hashes[i] % mod;
     -+		uint64_t block_pos = hash_mod / BITS_PER_BLOCK;
     ++		uint64_t block_pos = hash_mod / BITS_PER_WORD;
      +
      +		filter->data[block_pos] |= get_bitmask(hash_mod);
      +	}
      +}
      +
     -+void load_bloom_filters(void)
     -+{
     -+	init_bloom_filter_slab(&bloom_filters);
     -+}
     -+
      +struct bloom_filter *get_bloom_filter(struct repository *r,
      +				      struct commit *c)
      +{
      +	struct bloom_filter *filter;
      +	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
      +	int i;
     -+	struct rev_info revs;
     -+	const char *revs_argv[] = {NULL, "HEAD", NULL};
     ++	struct diff_options diffopt;
     ++
     ++	if (!bloom_filters.slab_size)
     ++		return NULL;
      +
      +	filter = bloom_filter_slab_at(&bloom_filters, c);
     -+	init_revisions(&revs, NULL);
     -+	revs.diffopt.flags.recursive = 1;
      +
     -+	setup_revisions(2, revs_argv, &revs, NULL);
     ++	repo_diff_setup(r, &diffopt);
     ++	diffopt.flags.recursive = 1;
     ++	diff_setup_done(&diffopt);
      +
      +	if (c->parents)
     -+		diff_tree_oid(&c->parents->item->object.oid, &c->object.oid, "", &revs.diffopt);
     ++		diff_tree_oid(&c->parents->item->object.oid, &c->object.oid, "", &diffopt);
      +	else
     -+		diff_tree_oid(NULL, &c->object.oid, "", &revs.diffopt);
     -+	diffcore_std(&revs.diffopt);
     ++		diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
     ++	diffcore_std(&diffopt);
      +
      +	if (diff_queued_diff.nr <= 512) {
      +		struct hashmap pathmap;
     @@ -227,18 +239,18 @@
      +		hashmap_init(&pathmap, NULL, NULL, 0);
      +
      +		for (i = 0; i < diff_queued_diff.nr; i++) {
     -+		    const char* path = diff_queued_diff.queue[i]->two->path;
     -+		    const char* p = path;
     -+
     -+		    /*
     -+		     * Add each leading directory of the changed file, i.e. for
     -+		     * 'dir/subdir/file' add 'dir' and 'dir/subdir' as well, so
     -+		     * the Bloom filter could be used to speed up commands like
     -+		     * 'git log dir/subdir', too.
     -+		     *
     -+		     * Note that directories are added without the trailing '/'.
     -+		     */
     -+		    do {
     ++			const char* path = diff_queued_diff.queue[i]->two->path;
     ++			const char* p = path;
     ++
     ++			/*
     ++			* Add each leading directory of the changed file, i.e. for
     ++			* 'dir/subdir/file' add 'dir' and 'dir/subdir' as well, so
     ++			* the Bloom filter could be used to speed up commands like
     ++			* 'git log dir/subdir', too.
     ++			*
     ++			* Note that directories are added without the trailing '/'.
     ++			*/
     ++			do {
      +				char* last_slash = strrchr(p, '/');
      +
      +				FLEX_ALLOC_STR(e, path, path);
     @@ -246,25 +258,27 @@
      +				hashmap_add(&pathmap, &e->entry);
      +
      +				if (!last_slash)
     -+				    last_slash = (char*)p;
     ++					last_slash = (char*)p;
      +				*last_slash = '\0';
      +
     -+		    } while (*p);
     ++			} while (*p);
      +
     -+		    diff_free_filepair(diff_queued_diff.queue[i]);
     ++			diff_free_filepair(diff_queued_diff.queue[i]);
      +		}
      +
     -+		filter->len = (hashmap_get_size(&pathmap) * settings.bits_per_entry + BITS_PER_BLOCK - 1) / BITS_PER_BLOCK;
     ++		filter->len = (hashmap_get_size(&pathmap) * settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
      +		filter->data = xcalloc(filter->len, sizeof(uint64_t));
      +
      +		hashmap_for_each_entry(&pathmap, &iter, e, entry) {
     -+		    struct bloom_key key;
     -+		    fill_bloom_key(e->path, strlen(e->path), &key, &settings);
     -+		    add_key_to_filter(&key, filter, &settings);
     ++			struct bloom_key key;
     ++			fill_bloom_key(e->path, strlen(e->path), &key, &settings);
     ++			add_key_to_filter(&key, filter, &settings);
      +		}
      +
      +		hashmap_free_entries(&pathmap, struct pathmap_hash_entry, entry);
      +	} else {
     ++		for (i = 0; i < diff_queued_diff.nr; i++)
     ++			diff_free_filepair(diff_queued_diff.queue[i]);
      +		filter->data = NULL;
      +		filter->len = 0;
      +	}
     @@ -274,7 +288,26 @@
      +
      +	return filter;
      +}
     - \ No newline at end of file
     ++
     ++int bloom_filter_contains(struct bloom_filter *filter,
     ++			  struct bloom_key *key,
     ++			  struct bloom_filter_settings *settings)
     ++{
     ++	int i;
     ++	uint64_t mod = filter->len * BITS_PER_WORD;
     ++
     ++	if (!mod)
     ++		return -1;
     ++
     ++	for (i = 0; i < settings->num_hashes; i++) {
     ++		uint64_t hash_mod = key->hashes[i] % mod;
     ++		uint64_t block_pos = hash_mod / BITS_PER_WORD;
     ++		if (!(filter->data[block_pos] & get_bitmask(hash_mod)))
     ++			return 0;
     ++	}
     ++
     ++	return 1;
     ++}
      
       diff --git a/bloom.h b/bloom.h
       new file mode 100644
     @@ -286,6 +319,7 @@
      +
      +struct commit;
      +struct repository;
     ++struct commit_graph;
      +
      +struct bloom_filter_settings {
      +	uint32_t hash_version;
     @@ -294,6 +328,7 @@
      +};
      +
      +#define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 }
     ++#define BITS_PER_WORD 64
      +
      +/*
      + * A bloom_filter struct represents a data segment to
     @@ -318,85 +353,253 @@
      +
      +void load_bloom_filters(void);
      +
     -+struct bloom_filter *get_bloom_filter(struct repository *r,
     -+				      struct commit *c);
     -+
      +void fill_bloom_key(const char *data,
      +		    int len,
      +		    struct bloom_key *key,
      +		    struct bloom_filter_settings *settings);
      +
     ++void add_key_to_filter(struct bloom_key *key,
     ++					   struct bloom_filter *filter,
     ++					   struct bloom_filter_settings *settings);
     ++
     ++struct bloom_filter *get_bloom_filter(struct repository *r,
     ++				      struct commit *c);
     ++
     ++int bloom_filter_contains(struct bloom_filter *filter,
     ++			  struct bloom_key *key,
     ++			  struct bloom_filter_settings *settings);
     ++
      +#endif
      
     - diff --git a/commit-graph.c b/commit-graph.c
     - --- a/commit-graph.c
     - +++ b/commit-graph.c
     + diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
     + new file mode 100644
     + --- /dev/null
     + +++ b/t/helper/test-bloom.c
      @@
     - #include "hashmap.h"
     - #include "replace-object.h"
     - #include "progress.h"
     ++#include "test-tool.h"
     ++#include "git-compat-util.h"
      +#include "bloom.h"
     - 
     - #define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
     - #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
     -@@
     - 	unsigned append:1,
     - 		 report_progress:1,
     - 		 split:1,
     --		 check_oids:1;
     -+		 check_oids:1,
     -+		 bloom:1;
     - 
     - 	const struct split_commit_graph_opts *split_opts;
     -+	uint32_t total_bloom_filter_size;
     - };
     - 
     - static void write_graph_chunk_fanout(struct hashfile *f,
     -@@
     - 	stop_progress(&ctx->progress);
     - }
     - 
     -+static void compute_bloom_filters(struct write_commit_graph_context *ctx)
     -+{
     -+	int i;
     -+	struct progress *progress = NULL;
     ++#include "test-tool.h"
     ++#include "cache.h"
     ++#include "commit-graph.h"
     ++#include "commit.h"
     ++#include "config.h"
     ++#include "object-store.h"
     ++#include "object.h"
     ++#include "repository.h"
     ++#include "tree.h"
      +
     -+	load_bloom_filters();
     ++struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
      +
     -+	if (ctx->report_progress)
     -+		progress = start_progress(
     -+			_("Computing commit diff Bloom filters"),
     -+			ctx->commits.nr);
     ++static void print_bloom_filter(struct bloom_filter *filter) {
     ++	int i;
      +
     -+	for (i = 0; i < ctx->commits.nr; i++) {
     -+		struct commit *c = ctx->commits.list[i];
     -+		struct bloom_filter *filter = get_bloom_filter(ctx->r, c);
     -+		ctx->total_bloom_filter_size += sizeof(uint64_t) * filter->len;
     -+		display_progress(progress, i + 1);
     ++	if (!filter) {
     ++		printf("No filter.\n");
     ++		return;
     ++	}
     ++	printf("Filter_Length:%d\n", filter->len);
     ++	printf("Filter_Data:");
     ++	for (i = 0; i < filter->len; i++){
     ++		printf("%"PRIx64"|", filter->data[i]);
      +	}
     ++	printf("\n");
     ++}
     ++
     ++static void add_string_to_filter(const char *data, struct bloom_filter *filter) {
     ++		struct bloom_key key;
     ++		int i;
     ++
     ++		fill_bloom_key(data, strlen(data), &key, &settings);
     ++		printf("Hashes:");
     ++		for (i = 0; i < settings.num_hashes; i++){
     ++			printf("%08x|", key.hashes[i]);
     ++		}
     ++		printf("\n");
     ++		add_key_to_filter(&key, filter, &settings);
     ++}
      +
     -+	stop_progress(&progress);
     ++static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
     ++{
     ++	struct commit *c;
     ++	struct bloom_filter *filter;
     ++	setup_git_directory();
     ++	c = lookup_commit(the_repository, commit_oid);
     ++	filter = get_bloom_filter(the_repository, c);
     ++	print_bloom_filter(filter);
      +}
      +
     - static int add_ref_to_list(const char *refname,
     - 			   const struct object_id *oid,
     - 			   int flags, void *cb_data)
     ++int cmd__bloom(int argc, const char **argv)
     ++{
     ++    if (!strcmp(argv[1], "generate_filter")) {
     ++		struct bloom_filter filter;
     ++		int i = 2;
     ++		filter.len =  (settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
     ++		filter.data = xcalloc(filter.len, sizeof(uint64_t));
     ++
     ++		if (!argv[2]){
     ++			die("at least one input string expected");
     ++		}
     ++
     ++		while (argv[i]) {
     ++			add_string_to_filter(argv[i], &filter);
     ++			i++;
     ++		}
     ++
     ++		print_bloom_filter(&filter);
     ++	}
     ++
     ++	if (!strcmp(argv[1], "get_filter_for_commit")) {
     ++		struct object_id oid;
     ++		const char *end;
     ++		if (parse_oid_hex(argv[2], &oid, &end))
     ++			die("cannot parse oid '%s'", argv[2]);
     ++		load_bloom_filters();
     ++		get_bloom_filter_for_commit(&oid);
     ++	}
     ++
     ++	return 0;
     ++}
     +
     + diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
     + --- a/t/helper/test-tool.c
     + +++ b/t/helper/test-tool.c
      @@
     - 	ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
     - 	ctx->check_oids = flags & COMMIT_GRAPH_WRITE_CHECK_OIDS ? 1 : 0;
     - 	ctx->split_opts = split_opts;
     -+	ctx->bloom = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;
     -+	ctx->total_bloom_filter_size = 0;
     + };
       
     - 	if (ctx->split) {
     - 		struct commit_graph *g;
     + static struct test_cmd cmds[] = {
     ++	{ "bloom", cmd__bloom },
     + 	{ "chmtime", cmd__chmtime },
     + 	{ "config", cmd__config },
     + 	{ "ctype", cmd__ctype },
     +
     + diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
     + --- a/t/helper/test-tool.h
     + +++ b/t/helper/test-tool.h
      @@
     + #define USE_THE_INDEX_COMPATIBILITY_MACROS
     + #include "git-compat-util.h"
       
     - 	compute_generation_numbers(ctx);
     - 
     -+	if (ctx->bloom)
     -+		compute_bloom_filters(ctx);
     -+
     - 	res = write_commit_graph_file(ctx);
     - 
     - 	if (ctx->split)
     ++int cmd__bloom(int argc, const char **argv);
     + int cmd__chmtime(int argc, const char **argv);
     + int cmd__config(int argc, const char **argv);
     + int cmd__ctype(int argc, const char **argv);
     +
     + diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
     + new file mode 100755
     + --- /dev/null
     + +++ b/t/t0095-bloom.sh
     +@@
     ++#!/bin/sh
     ++
     ++test_description='test bloom.c'
     ++. ./test-lib.sh
     ++
     ++test_expect_success 'get bloom filters for commit with no changes' '
     ++	git init &&
     ++	git commit --allow-empty -m "c0" &&
     ++	cat >expect <<-\EOF &&
     ++	Filter_Length:0
     ++	Filter_Data:
     ++	EOF
     ++	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
     ++	test_cmp expect actual
     ++'
     ++
     ++test_expect_success 'get bloom filter for commit with 10 changes' '
     ++	rm actual &&
     ++	rm expect &&
     ++	mkdir smallDir &&
     ++	for i in $(test_seq 0 9)
     ++	do
     ++		echo $i >smallDir/$i
     ++	done &&
     ++	git add smallDir &&
     ++	git commit -m "commit with 10 changes" &&
     ++	cat >expect <<-\EOF &&
     ++	Filter_Length:4
     ++	Filter_Data:508928809087080a|8a7648210804001|4089824400951000|841ab310098051a8|
     ++	EOF
     ++	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
     ++	test_cmp expect actual
     ++'
     ++
     ++test_expect_success EXPENSIVE 'get bloom filter for commit with 513 changes' '
     ++	rm actual &&
     ++	rm expect &&
     ++	mkdir bigDir &&
     ++	for i in $(test_seq 0 512)
     ++	do
     ++		echo $i >bigDir/$i
     ++	done &&
     ++	git add bigDir &&
     ++	git commit -m "commit with 513 changes" &&
     ++	cat >expect <<-\EOF &&
     ++	Filter_Length:0
     ++	Filter_Data:
     ++	EOF
     ++	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
     ++	test_cmp expect actual
     ++'
     ++
     ++test_expect_success 'compute bloom key for empty string' '
     ++	cat >expect <<-\EOF &&
     ++	Hashes:5615800c|5b966560|61174ab4|66983008|6c19155c|7199fab0|771ae004|
     ++	Filter_Length:1
     ++	Filter_Data:11000110001110|
     ++	EOF
     ++	test-tool bloom generate_filter "" >actual &&
     ++	test_cmp expect actual
     ++'
     ++
     ++test_expect_success 'compute bloom key for whitespace' '
     ++	cat >expect <<-\EOF &&
     ++	Hashes:1bf014e6|8a91b50b|f9335530|67d4f555|d676957a|4518359f|b3b9d5c4|
     ++	Filter_Length:1
     ++	Filter_Data:401004080200810|
     ++	EOF
     ++	test-tool bloom generate_filter " " >actual &&
     ++	test_cmp expect actual
     ++'
     ++
     ++test_expect_success 'compute bloom key for a root level folder' '
     ++	cat >expect <<-\EOF &&
     ++	Hashes:1a21016f|fff1c06d|e5c27f6b|cb933e69|b163fd67|9734bc65|7d057b63|
     ++	Filter_Length:1
     ++	Filter_Data:aaa800000000|
     ++	EOF
     ++	test-tool bloom generate_filter "A" >actual &&
     ++	test_cmp expect actual
     ++'
     ++
     ++test_expect_success 'compute bloom key for a root level file' '
     ++	cat >expect <<-\EOF &&
     ++	Hashes:e2d51107|30970605|7e58fb03|cc1af001|19dce4ff|679ed9fd|b560cefb|
     ++	Filter_Length:1
     ++	Filter_Data:a8000000000000aa|
     ++	EOF
     ++	test-tool bloom generate_filter "file.txt" >actual &&
     ++	test_cmp expect actual
     ++'
     ++
     ++test_expect_success 'compute bloom key for a deep folder' '
     ++	cat >expect <<-\EOF &&
     ++	Hashes:864cf838|27f055cd|c993b362|6b3710f7|0cda6e8c|ae7dcc21|502129b6|
     ++	Filter_Length:1
     ++	Filter_Data:1c0000600003000|
     ++	EOF
     ++	test-tool bloom generate_filter "A/B/C/D/E" >actual &&
     ++	test_cmp expect actual
     ++'
     ++
     ++test_expect_success 'compute bloom key for a deep file' '
     ++	cat >expect <<-\EOF &&
     ++	Hashes:07cdf850|4af629c7|8e1e5b3e|d1468cb5|146ebe2c|5796efa3|9abf211a|
     ++	Filter_Length:1
     ++	Filter_Data:4020100804010080|
     ++	EOF
     ++	test-tool bloom generate_filter "A/B/C/D/E/file.txt" >actual &&
     ++	test_cmp expect actual
     ++'
     ++
     ++test_done
  -:  ---------- >  3:  a698c04a78 diff: halt tree-diff early after max_changes
  -:  ---------- >  4:  c17bbcbc66 commit-graph: compute Bloom filters for changed paths
  -:  ---------- >  5:  78e8e49c3a commit-graph: examine changed-path objects in pack order
  -:  ---------- >  6:  58704d81b6 commit-graph: examine commits by generation number
  5:  7648021072 !  7:  39ee061080 commit-graph: write changed path bloom filters to commit-graph file.
     @@ -1,23 +1,67 @@
      Author: Garima Singh <garima.singh@microsoft.com>
      
     -    commit-graph: write changed path bloom filters to commit-graph file.
     +    commit-graph: write Bloom filters to commit graph file
      
     -    Write bloom filters to the commit-graph using the format described in
     -    Documentation/technical/commit-graph-format.txt
     +    Update the technical documentation for commit-graph-format with the formats for
     +    the Bloom filter index (BIDX) and Bloom filter data (BDAT) chunks. Write the
     +    computed Bloom filters information to the commit graph file using this format.
      
          Helped-by: Derrick Stolee <dstolee@microsoft.com>
          Signed-off-by: Garima Singh <garima.singh@microsoft.com>
      
     + diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
     + --- a/Documentation/technical/commit-graph-format.txt
     + +++ b/Documentation/technical/commit-graph-format.txt
     +@@
     + - The parents of the commit, stored using positional references within
     +   the graph file.
     + 
     ++- The Bloom filter of the commit carrying the paths that were changed between
     ++  the commit and its first parent.
     ++
     + These positional references are stored as unsigned 32-bit integers
     + corresponding to the array position within the list of commit OIDs. Due
     + to some special constants we use to track parents, we can store at most
     +@@
     +       positions for the parents until reaching a value with the most-significant
     +       bit on. The other bits correspond to the position of the last parent.
     + 
     ++  Bloom Filter Index (ID: {'B', 'I', 'D', 'X'}) (N * 4 bytes) [Optional]
     ++    * The ith entry, BIDX[i], stores the number of 8-byte word blocks in all
     ++      Bloom filters from commit 0 to commit i (inclusive) in lexicographic
     ++      order. The Bloom filter for the i-th commit spans from BIDX[i-1] to
     ++      BIDX[i] (plus header length), where BIDX[-1] is 0.
     ++    * The BIDX chunk is ignored if the BDAT chunk is not present.
     ++
     ++  Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
     ++    * It starts with header consisting of three unsigned 32-bit integers:
     ++      - Version of the hash algorithm being used. We currently only support
     ++	value 1 which implies the murmur3 hash implemented exactly as described
     ++	in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
     ++      - The number of times a path is hashed and hence the number of bit positions
     ++	that cumulatively determine whether a file is present in the commit.
     ++      - The minimum number of bits 'b' per entry in the Bloom filter. If the filter
     ++	contains 'n' entries, then the filter size is the minimum number of 64-bit
     ++	words that contain n*b bits.
     ++    * The rest of the chunk is the concatenation of all the computed Bloom
     ++      filters for the commits in lexicographic order.
     ++    * The BDAT chunk is present iff BIDX is present.
     ++
     +   Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
     +       This list of H-byte hashes describe a set of B commit-graph files that
     +       form a commit-graph chain. The graph position for the ith commit in this
     +
       diff --git a/commit-graph.c b/commit-graph.c
       --- a/commit-graph.c
       +++ b/commit-graph.c
      @@
     + #define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
       #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
       #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
     - #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
     --#define MAX_NUM_CHUNKS 5
      +#define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
      +#define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
     + #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
     +-#define MAX_NUM_CHUNKS 5
      +#define MAX_NUM_CHUNKS 7
       
       #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
     @@ -46,15 +90,33 @@
      +				if (hash_version != 1)
      +					break;
      +
     -+				graph->settings = xmalloc(sizeof(struct bloom_filter_settings));
     -+				graph->settings->hash_version = hash_version;
     -+				graph->settings->num_hashes = get_be32(data + chunk_offset + 4);
     -+				graph->settings->bits_per_entry = get_be32(data + chunk_offset + 8);
     ++				graph->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings));
     ++				graph->bloom_filter_settings->hash_version = hash_version;
     ++				graph->bloom_filter_settings->num_hashes = get_be32(data + chunk_offset + 4);
     ++				graph->bloom_filter_settings->bits_per_entry = get_be32(data + chunk_offset + 8);
      +			}
      +			break;
       		}
       
       		if (chunk_repeated) {
     +@@
     + 		last_chunk_offset = chunk_offset;
     + 	}
     + 
     ++	/* We need both the bloom chunks to exist together. Else ignore the data */
     ++	if ((graph->chunk_bloom_indexes && !graph->chunk_bloom_data)
     ++		 || (!graph->chunk_bloom_indexes && graph->chunk_bloom_data)) {
     ++		graph->chunk_bloom_indexes = NULL;
     ++		graph->chunk_bloom_data = NULL;
     ++		graph->bloom_filter_settings = NULL;
     ++	}
     ++
     ++	if (graph->chunk_bloom_indexes && graph->chunk_bloom_data)
     ++		load_bloom_filters();
     ++
     + 	hashcpy(graph->oid.hash, graph->data + graph->data_len - graph->hash_len);
     + 
     + 	if (verify_commit_graph_lite(graph)) {
      @@
       	}
       }
     @@ -65,36 +127,67 @@
      +	struct commit **list = ctx->commits.list;
      +	struct commit **last = ctx->commits.list + ctx->commits.nr;
      +	uint32_t cur_pos = 0;
     ++	struct progress *progress = NULL;
     ++	int i = 0;
     ++
     ++	if (ctx->report_progress)
     ++		progress = start_delayed_progress(
     ++			_("Writing changed paths Bloom filters index"),
     ++			ctx->commits.nr);
      +
      +	while (list < last) {
      +		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
      +		cur_pos += filter->len;
     ++		display_progress(progress, ++i);
      +		hashwrite_be32(f, cur_pos);
      +		list++;
      +	}
     ++
     ++	stop_progress(&progress);
      +}
      +
      +static void write_graph_chunk_bloom_data(struct hashfile *f,
      +					 struct write_commit_graph_context *ctx,
      +					 struct bloom_filter_settings *settings)
      +{
     -+	struct commit **first = ctx->commits.list;
     ++	struct commit **list = ctx->commits.list;
      +	struct commit **last = ctx->commits.list + ctx->commits.nr;
     ++	struct progress *progress = NULL;
     ++	int i = 0;
     ++
     ++	if (ctx->report_progress)
     ++		progress = start_delayed_progress(
     ++			_("Writing changed paths Bloom filters data"),
     ++			ctx->commits.nr);
      +
      +	hashwrite_be32(f, settings->hash_version);
      +	hashwrite_be32(f, settings->num_hashes);
      +	hashwrite_be32(f, settings->bits_per_entry);
      +
     -+	while (first < last) {
     -+		struct bloom_filter *filter = get_bloom_filter(ctx->r, *first);
     ++	while (list < last) {
     ++		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
     ++		display_progress(progress, ++i);
      +		hashwrite(f, filter->data, filter->len * sizeof(uint64_t));
     -+		first++;
     ++		list++;
      +	}
     ++
     ++	stop_progress(&progress);
      +}
      +
       static int oid_compare(const void *_a, const void *_b)
       {
       	const struct object_id *a = (const struct object_id *)_a;
     +@@
     + 	load_bloom_filters();
     + 
     + 	if (ctx->report_progress)
     +-		progress = start_progress(
     +-			_("Computing commit diff Bloom filters"),
     ++		progress = start_delayed_progress(
     ++			_("Computing changed paths Bloom filters"),
     + 			ctx->commits.nr);
     + 
     + 	ALLOC_ARRAY(sorted_by_pos, ctx->commits.nr);
      @@
       	struct strbuf progress_title = STRBUF_INIT;
       	int num_chunks = 3;
     @@ -107,7 +200,7 @@
       		chunk_ids[num_chunks] = GRAPH_CHUNKID_EXTRAEDGES;
       		num_chunks++;
       	}
     -+	if (ctx->bloom) {
     ++	if (ctx->changed_paths) {
      +		chunk_ids[num_chunks] = GRAPH_CHUNKID_BLOOMINDEXES;
      +		num_chunks++;
      +		chunk_ids[num_chunks] = GRAPH_CHUNKID_BLOOMDATA;
     @@ -120,11 +213,13 @@
       						4 * ctx->num_extra_edges;
       		num_chunks++;
       	}
     -+	if (ctx->bloom) {
     -+		chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] + sizeof(uint32_t) * ctx->commits.nr;
     ++	if (ctx->changed_paths) {
     ++		chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
     ++						sizeof(uint32_t) * ctx->commits.nr;
      +		num_chunks++;
      +
     -+		chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] + sizeof(uint32_t) * 3 + ctx->total_bloom_filter_size;
     ++		chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
     ++						sizeof(uint32_t) * 3 + ctx->total_bloom_filter_data_size;
      +		num_chunks++;
      +	}
       	if (ctx->num_commit_graphs_after > 1) {
     @@ -134,7 +229,7 @@
       	write_graph_chunk_data(f, hashsz, ctx);
       	if (ctx->num_extra_edges)
       		write_graph_chunk_extra_edges(f, ctx);
     -+	if (ctx->bloom) {
     ++	if (ctx->changed_paths) {
      +		write_graph_chunk_bloom_indexes(f, ctx);
      +		write_graph_chunk_bloom_data(f, ctx, &bloom_settings);
      +	}
     @@ -160,7 +255,16 @@
      +	const unsigned char *chunk_bloom_indexes;
      +	const unsigned char *chunk_bloom_data;
      +
     -+	struct bloom_filter_settings *settings;
     ++	struct bloom_filter_settings *bloom_filter_settings;
       };
       
       struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st);
     +@@
     + 	COMMIT_GRAPH_WRITE_SPLIT      = (1 << 2),
     + 	/* Make sure that each OID in the input is a valid commit OID. */
     + 	COMMIT_GRAPH_WRITE_CHECK_OIDS = (1 << 3),
     +-	COMMIT_GRAPH_WRITE_BLOOM_FILTERS = (1 << 4)
     ++	COMMIT_GRAPH_WRITE_BLOOM_FILTERS = (1 << 4),
     + };
     + 
     + struct split_commit_graph_opts {
  7:  1e2acb37ad !  8:  b20c8d2b20 commit-graph: reuse existing bloom filters during write.
     @@ -1,27 +1,31 @@
      Author: Garima Singh <garima.singh@microsoft.com>
      
     -    commit-graph: reuse existing bloom filters during write.
     +    commit-graph: reuse existing Bloom filters during write.
      
     -    Read previously computed bloom filters from the commit-graph file if possible
     -    to avoid recomputing during commit-graph write.
     +    Read previously computed Bloom filters from the commit-graph file if
     +    possible to avoid recomputing during commit-graph write.
      
     -    Reading from the commit-graph is based on the format in which bloom filters are
     -    written in the commit graph file. See method `fill_filter_from_graph` in bloom.c
     +    See Documentation/technical/commit-graph-format for the format in which
     +    the Bloom filter information is written to the commit graph file.
      
     -    For reading the bloom filter for commit at lexicographic position i:
     -    1. Read BIDX[i] which essentially gives us the starting index in BDAT for filter
     -       of commit i+1 (called the next_index in the code)
     +    To read Bloom filter for a given commit with lexicographic position
     +    'i' we need to:
     +    1. Read BIDX[i] which essentially gives us the starting index in BDAT for
     +       filter of commit i+1. It is essentially the index past the end
     +       of the filter of commit i. It is called end_index in the code.
      
     -    2. For i>0, read BIDX[i-1] which will give us the starting index in BDAT for
     -       filter of commit i (called the prev_index in the code)
     -       For i = 0, prev_index will be 0. The first lexicographic commit's filter will
     -       start at BDAT.
     +    2. For i>0, read BIDX[i-1] which will give us the starting index in BDAT
     +       for filter of commit i. It is called the start_index in the code.
     +       For the first commit, where i = 0, Bloom filter data starts at the
     +       beginning, just past the header in the BDAT chunk. Hence, start_index
     +       will be 0.
      
     -    3. The length of the filter will be next_index - prev_index, because BIDX[i]
     -       gives the cumulative 8-byte words including the ith commit's filter.
     +    3. The length of the filter will be end_index - start_index, because
     +       BIDX[i] gives the cumulative 8-byte words including the ith
     +       commit's filter.
      
     -    We toggle whether bloom filters should be recomputed based on the compute_if_null
     -    flag.
     +    We toggle whether Bloom filters should be recomputed based on the
     +    compute_if_null flag.
      
          Helped-by: Derrick Stolee <dstolee@microsoft.com>
          Signed-off-by: Garima Singh <garima.singh@microsoft.com>
     @@ -41,107 +45,135 @@
       	}
       }
       
     -+static void fill_filter_from_graph(struct commit_graph *g,
     ++static int load_bloom_filter_from_graph(struct commit_graph *g,
      +				   struct bloom_filter *filter,
      +				   struct commit *c)
      +{
     -+	uint32_t lex_pos, prev_index, next_index;
     ++	uint32_t lex_pos, start_index, end_index;
      +
      +	while (c->graph_pos < g->num_commits_in_base)
      +		g = g->base_graph;
      +
     ++	/* The commit graph commit 'c' lives in doesn't carry bloom filters. */
     ++	if (!g->chunk_bloom_indexes)
     ++		return 0;
     ++
      +	lex_pos = c->graph_pos - g->num_commits_in_base;
      +
     -+	next_index = get_be32(g->chunk_bloom_indexes + 4 * lex_pos);
     ++	end_index = get_be32(g->chunk_bloom_indexes + 4 * lex_pos);
     ++
      +	if (lex_pos)
     -+		prev_index = get_be32(g->chunk_bloom_indexes + 4 * (lex_pos - 1));
     ++		start_index = get_be32(g->chunk_bloom_indexes + 4 * (lex_pos - 1));
      +	else
     -+		prev_index = 0;
     ++		start_index = 0;
      +
     -+	filter->len = next_index - prev_index;
     -+	filter->data = (uint64_t *)(g->chunk_bloom_data + 8 * prev_index + 12);
     ++	filter->len = end_index - start_index;
     ++	filter->data = (uint64_t *)(g->chunk_bloom_data +
     ++					sizeof(uint64_t) * start_index +
     ++					BLOOMDATA_CHUNK_HEADER_SIZE);
     ++
     ++	return 1;
      +}
      +
     - void load_bloom_filters(void)
     - {
     - 	init_bloom_filter_slab(&bloom_filters);
     - }
     - 
       struct bloom_filter *get_bloom_filter(struct repository *r,
      -				      struct commit *c)
      +				      struct commit *c,
     -+				      int compute_if_null)
     ++				      int compute_if_not_present)
       {
       	struct bloom_filter *filter;
       	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
      @@
     - 	const char *revs_argv[] = {NULL, "HEAD", NULL};
       
       	filter = bloom_filter_slab_at(&bloom_filters, c);
     -+
     + 
      +	if (!filter->data) {
      +		load_commit_graph_info(r, c);
     -+		if (c->graph_pos != COMMIT_NOT_FROM_GRAPH && r->objects->commit_graph->chunk_bloom_indexes) {
     -+			fill_filter_from_graph(r->objects->commit_graph, filter, c);
     -+			return filter;
     ++		if (c->graph_pos != COMMIT_NOT_FROM_GRAPH &&
     ++			r->objects->commit_graph->chunk_bloom_indexes) {
     ++			if (load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
     ++				return filter;
     ++			else
     ++				return NULL;
      +		}
      +	}
      +
     -+	if (filter->data || !compute_if_null)
     -+			return filter;
     ++	if (filter->data || !compute_if_not_present)
     ++		return filter;
      +
     - 	init_revisions(&revs, NULL);
     - 	revs.diffopt.flags.recursive = 1;
     - 
     -@@
     - 	DIFF_QUEUE_CLEAR(&diff_queued_diff);
     - 
     - 	return filter;
     --}
     - \ No newline at end of file
     -+}
     + 	repo_diff_setup(r, &diffopt);
     + 	diffopt.flags.recursive = 1;
     + 	diffopt.max_changes = max_changes;
      
       diff --git a/bloom.h b/bloom.h
       --- a/bloom.h
       +++ b/bloom.h
      @@
     - void load_bloom_filters(void);
     + 
     + #define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 }
     + #define BITS_PER_WORD 64
     ++#define BLOOMDATA_CHUNK_HEADER_SIZE 3*sizeof(uint32_t)
     + 
     + /*
     +  * A bloom_filter struct represents a data segment to
     +@@
     + 					   struct bloom_filter_settings *settings);
       
       struct bloom_filter *get_bloom_filter(struct repository *r,
      -				      struct commit *c);
      +				      struct commit *c,
     -+				      int compute_if_null);
     ++				      int compute_if_not_present);
       
     - void fill_bloom_key(const char *data,
     - 		    int len,
     + int bloom_filter_contains(struct bloom_filter *filter,
     + 			  struct bloom_key *key,
      
       diff --git a/commit-graph.c b/commit-graph.c
       --- a/commit-graph.c
       +++ b/commit-graph.c
      @@
     - 	uint32_t cur_pos = 0;
     + 			ctx->commits.nr);
       
       	while (list < last) {
      -		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
      +		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
       		cur_pos += filter->len;
     + 		display_progress(progress, ++i);
       		hashwrite_be32(f, cur_pos);
     - 		list++;
      @@
       	hashwrite_be32(f, settings->bits_per_entry);
       
     - 	while (first < last) {
     --		struct bloom_filter *filter = get_bloom_filter(ctx->r, *first);
     -+		struct bloom_filter *filter = get_bloom_filter(ctx->r, *first, 0);
     + 	while (list < last) {
     +-		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
     ++		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
     + 		display_progress(progress, ++i);
       		hashwrite(f, filter->data, filter->len * sizeof(uint64_t));
     - 		first++;
     - 	}
     + 		list++;
      @@
       
       	for (i = 0; i < ctx->commits.nr; i++) {
     - 		struct commit *c = ctx->commits.list[i];
     + 		struct commit *c = sorted_by_pos[i];
      -		struct bloom_filter *filter = get_bloom_filter(ctx->r, c);
      +		struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1);
     - 		ctx->total_bloom_filter_size += sizeof(uint64_t) * filter->len;
     + 		ctx->total_bloom_filter_data_size += sizeof(uint64_t) * filter->len;
       		display_progress(progress, i + 1);
       	}
     +@@
     + 		g->data = NULL;
     + 		close(g->graph_fd);
     + 	}
     ++	free(g->bloom_filter_settings);
     + 	free(g->filename);
     + 	free(g);
     + }
     +
     + diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
     + --- a/t/helper/test-bloom.c
     + +++ b/t/helper/test-bloom.c
     +@@
     + 	struct bloom_filter *filter;
     + 	setup_git_directory();
     + 	c = lookup_commit(the_repository, commit_oid);
     +-	filter = get_bloom_filter(the_repository, c);
     ++	filter = get_bloom_filter(the_repository, c, 1);
     + 	print_bloom_filter(filter);
     + }
     + 
  1:  6bdde5e4f0 !  9:  3d7ee0c969 commit-graph: add --changed-paths option to write
     @@ -1,26 +1,13 @@
      Author: Garima Singh <garima.singh@microsoft.com>
      
     -    commit-graph: add --changed-paths option to write
     +    commit-graph: add --changed-paths option to write subcommand
      
          Add --changed-paths option to git commit-graph write. This option will
     -    soon allow users to compute bloom filters for the paths changed between
     -    a commit and its first significant parent, and write this information
     -    into the commit-graph file.
     -
     -    Note: This commit does not change any behavior. It only introduces
     -    the option and passes down the appropriate flag to the commit-graph.
     -
     -    RFC Notes:
     -    1. We named the option --changed-paths to capture what the option does,
     -       instead of how it does it. The current implementation does this
     -       using bloom filters. We believe using --changed-paths however keeps
     -       the implementation open to other data structures.
     -       All thoughts and suggestions for the name and this approach are
     -       welcome
     -
     -    2. Currently, a subsequent commit in this series will add tests that
     -       exercise this option. I plan to split that test commit across the
     -       series as appropriate.
     +    allow users to compute information about the paths that have changed
     +    between a commit and its first parent, and write it into the commit graph
     +    file. If the option is passed to the write subcommand we set the
     +    COMMIT_GRAPH_WRITE_BLOOM_FILTERS flag and pass it down to the
     +    commit-graph logic.
      
          Helped-by: Derrick Stolee <dstolee@microsoft.com>
          Signed-off-by: Garima Singh <garima.singh@microsoft.com>
     @@ -35,7 +22,7 @@
      +With the `--changed-paths` option, compute and write information about the
      +paths changed between a commit and it's first parent. This operation can
      +take a while on large repositories. It provides significant performance gains
     -+for getting file based history logs with `git log`
     ++for getting history of a directory or a file with `git log -- <path>`.
      ++
       With the `--split` option, write the commit-graph as a chain of multiple
       commit-graph files stored in `<dir>/info/commit-graphs`. The new commits
     @@ -66,7 +53,7 @@
       	int split;
       	int shallow;
       	int progress;
     -+	int enable_bloom_filters;
     ++	int enable_changed_paths;
       } opts;
       
       static int graph_verify(int argc, const char **argv)
     @@ -74,7 +61,7 @@
       			N_("start walk at commits listed by stdin")),
       		OPT_BOOL(0, "append", &opts.append,
       			N_("include all commits already in the commit-graph file")),
     -+		OPT_BOOL(0, "changed-paths", &opts.enable_bloom_filters,
     ++		OPT_BOOL(0, "changed-paths", &opts.enable_changed_paths,
      +			N_("enable computation for changed paths")),
       		OPT_BOOL(0, "progress", &opts.progress, N_("force progress reporting")),
       		OPT_BOOL(0, "split", &opts.split,
     @@ -83,22 +70,8 @@
       		flags |= COMMIT_GRAPH_WRITE_SPLIT;
       	if (opts.progress)
       		flags |= COMMIT_GRAPH_WRITE_PROGRESS;
     -+	if (opts.enable_bloom_filters)
     ++	if (opts.enable_changed_paths)
      +		flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
       
       	read_replace_refs = 0;
       
     -
     - diff --git a/commit-graph.h b/commit-graph.h
     - --- a/commit-graph.h
     - +++ b/commit-graph.h
     -@@
     - 	COMMIT_GRAPH_WRITE_PROGRESS   = (1 << 1),
     - 	COMMIT_GRAPH_WRITE_SPLIT      = (1 << 2),
     - 	/* Make sure that each OID in the input is a valid commit OID. */
     --	COMMIT_GRAPH_WRITE_CHECK_OIDS = (1 << 3)
     -+	COMMIT_GRAPH_WRITE_CHECK_OIDS = (1 << 3),
     -+	COMMIT_GRAPH_WRITE_BLOOM_FILTERS = (1 << 4)
     - };
     - 
     - struct split_commit_graph_opts {
  4:  3182a11f7c <  -:  ---------- commit-graph: document bloom filter format
  6:  85bfdfa59c <  -:  ---------- commit-graph: test commit-graph write --changed-paths
  8:  72a2bbf676 ! 10:  77f1c561e8 revision.c: use bloom filters to speed up path based revision walks
     @@ -1,83 +1,35 @@
      Author: Garima Singh <garima.singh@microsoft.com>
      
     -    revision.c: use bloom filters to speed up path based revision walks
     +    revision.c: use Bloom filters to speed up path based revision walks
      
     -    If bloom filters have been written to the commit-graph file, revision walk will
     -    use them to speed up revision walks for a particular path.
     -    Note: The current implementation does this in the case of single pathspec
     -    case only.
     +    Revision walk will now use Bloom filters for commits to speed up revision
     +    walks for a particular path (for computing history for that path), if they
     +    are present in the commit-graph file.
      
     -    We load the bloom filters during the prepare_revision_walk step when dealing
     -    with a single pathspec. While comparing trees in rev_compare_trees(), if the
     -    bloom filter says that the file is not different between the two trees, we
     -    don't need to compute the expensive diff. This is where we get our performance
     -    gains.
     +    We load the Bloom filters during the prepare_revision_walk step, but only
     +    when dealing with a single pathspec. While comparing trees in
     +    rev_compare_trees(), if the Bloom filter says that the file is not different
     +    between the two trees, we don't need to compute the expensive diff. This is
     +    where we get our performance gains. The other response of the Bloom filter
     +    is `maybe`, in which case we fall back to the full diff calculation to
     +    determine if the path was changed in the commit.
      
          Performance Gains:
     -    We tested the performance of `git log --path` on the git repo, the linux and
     -    some internal large repos, with a variety of paths of varying depths.
     +    We tested the performance of `git log -- <path>` on the git repo, the linux
     +    and some internal large repos, with a variety of paths of varying depths.
      
          On the git and linux repos:
     -    we observed a 2x to 5x speed up.
     +    - we observed a 2x to 5x speed up.
      
          On a large internal repo with files seated 6-10 levels deep in the tree:
     -    we observed 10x to 20x speed ups, with some paths going up to 28 times faster.
     -
     -    RFC Notes:
     -    I plan to collect the folloowing statistics around this usage of bloom filters
     -    and trace them out using trace2.
     -    - number of bloom filter queries,
     -    - number of "No" responses (file hasn't changed)
     -    - number of "Maybe" responses (file may have changed)
     -    - number of "Commit not parsed" cases (commit had too many changes to have a
     -      bloom filter written out, currently our limit is 512 diffs)
     +    - we observed 10x to 20x speed ups, with some paths going up to 28 times
     +      faster.
      
          Helped-by: Derrick Stolee <dstolee@microsoft.com
          Helped-by: SZEDER Gábor <szeder.dev@gmail.com>
          Helped-by: Jonathan Tan <jonathantanmy@google.com>
          Signed-off-by: Garima Singh <garima.singh@microsoft.com>
      
     - diff --git a/bloom.c b/bloom.c
     - --- a/bloom.c
     - +++ b/bloom.c
     -@@
     - 
     - 	return filter;
     - }
     -+
     -+int bloom_filter_contains(struct bloom_filter *filter,
     -+			  struct bloom_key *key,
     -+			  struct bloom_filter_settings *settings)
     -+{
     -+	int i;
     -+	uint64_t mod = filter->len * BITS_PER_BLOCK;
     -+
     -+	if (!mod)
     -+		return 1;
     -+
     -+	for (i = 0; i < settings->num_hashes; i++) {
     -+		uint64_t hash_mod = key->hashes[i] % mod;
     -+		uint64_t block_pos = hash_mod / BITS_PER_BLOCK;
     -+		if (!(filter->data[block_pos] & get_bitmask(hash_mod)))
     -+			return 0;
     -+	}
     -+
     -+	return 1;
     -+}
     -
     - diff --git a/bloom.h b/bloom.h
     - --- a/bloom.h
     - +++ b/bloom.h
     -@@
     - 		    struct bloom_key *key,
     - 		    struct bloom_filter_settings *settings);
     - 
     -+int bloom_filter_contains(struct bloom_filter *filter,
     -+			  struct bloom_key *key,
     -+			  struct bloom_filter_settings *settings);
     -+
     - #endif
     -
       diff --git a/revision.c b/revision.c
       --- a/revision.c
       +++ b/revision.c
     @@ -86,6 +38,7 @@
       #include "hashmap.h"
       #include "utf8.h"
      +#include "bloom.h"
     ++#include "json-writer.h"
       
       volatile show_early_output_fn_t show_early_output;
       
     @@ -93,26 +46,106 @@
       	options->flags.has_changes = 1;
       }
       
     ++static int bloom_filter_atexit_registered;
     ++static unsigned int count_bloom_filter_maybe;
     ++static unsigned int count_bloom_filter_definitely_not;
     ++static unsigned int count_bloom_filter_false_positive;
     ++static unsigned int count_bloom_filter_not_present;
     ++static unsigned int count_bloom_filter_length_zero;
     ++
     ++static void trace2_bloom_filter_statistics_atexit(void)
     ++{
     ++	struct json_writer jw = JSON_WRITER_INIT;
     ++
     ++	jw_object_begin(&jw, 0);
     ++	jw_object_intmax(&jw, "filter_not_present", count_bloom_filter_not_present);
     ++	jw_object_intmax(&jw, "zero_length_filter", count_bloom_filter_length_zero);
     ++	jw_object_intmax(&jw, "maybe", count_bloom_filter_maybe);
     ++	jw_object_intmax(&jw, "definitely_not", count_bloom_filter_definitely_not);
     ++	jw_end(&jw);
     ++
     ++	trace2_data_json("bloom", the_repository, "statistics", &jw);
     ++
     ++	jw_release(&jw);
     ++}
     ++
     ++static void prepare_to_use_bloom_filter(struct rev_info *revs)
     ++{
     ++	struct pathspec_item *pi;
     ++	char *path_alloc = NULL;
     ++	const char *path;
     ++	int last_index;
     ++	int len;
     ++
     ++	if (!revs->commits)
     ++	    return;
     ++
     ++	repo_parse_commit(revs->repo, revs->commits->item);
     ++
     ++	if (!revs->repo->objects->commit_graph)
     ++		return;
     ++
     ++	revs->bloom_filter_settings = revs->repo->objects->commit_graph->bloom_filter_settings;
     ++	if (!revs->bloom_filter_settings)
     ++		return;
     ++
     ++	pi = &revs->pruning.pathspec.items[0];
     ++	last_index = pi->len - 1;
     ++
     ++	if (pi->match[last_index] == '/') {
     ++	    path_alloc = xstrdup(pi->match);
     ++	    path_alloc[last_index] = '\0';
     ++	    path = path_alloc;
     ++	} else
     ++	    path = pi->match;
     ++
     ++	len = strlen(path);
     ++
     ++	revs->bloom_key = xmalloc(sizeof(struct bloom_key));
     ++	fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
     ++
     ++	if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
     ++		atexit(trace2_bloom_filter_statistics_atexit);
     ++		bloom_filter_atexit_registered = 1;
     ++	}
     ++
     ++	free(path_alloc);
     ++}
     ++
      +static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
     -+						 struct commit *commit,
     -+						 struct bloom_key *key,
     -+						 struct bloom_filter_settings *settings)
     ++						 struct commit *commit)
      +{
      +	struct bloom_filter *filter;
     ++	int result;
      +
      +	if (!revs->repo->objects->commit_graph)
      +		return -1;
     ++
      +	if (commit->generation == GENERATION_NUMBER_INFINITY)
      +		return -1;
     -+	if (!key || !settings)
     -+		return -1;
      +
      +	filter = get_bloom_filter(revs->repo, commit, 0);
      +
     -+	if (!filter || !filter->len)
     -+		return 1;
     ++	if (!filter) {
     ++		count_bloom_filter_not_present++;
     ++		return -1;
     ++	}
      +
     -+	return bloom_filter_contains(filter, key, settings);
     ++	if (!filter->len) {
     ++		count_bloom_filter_length_zero++;
     ++		return -1;
     ++	}
     ++
     ++	result = bloom_filter_contains(filter,
     ++				       revs->bloom_key,
     ++				       revs->bloom_filter_settings);
     ++
     ++	if (result)
     ++		count_bloom_filter_maybe++;
     ++	else
     ++		count_bloom_filter_definitely_not++;
     ++
     ++	return result;
      +}
      +
       static int rev_compare_tree(struct rev_info *revs,
     @@ -129,11 +162,8 @@
       			return REV_TREE_SAME;
       	}
       
     -+	if (revs->pruning.pathspec.nr == 1 && !nth_parent) {
     -+		bloom_ret = check_maybe_different_in_bloom_filter(revs,
     -+								  commit,
     -+								  revs->bloom_key,
     -+								  revs->bloom_filter_settings);
     ++	if (revs->pruning.pathspec.nr == 1 && !revs->reflog_info && !nth_parent) {
     ++		bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
      +
      +		if (bloom_ret == 0)
      +			return REV_TREE_SAME;
     @@ -142,6 +172,16 @@
       	tree_difference = REV_TREE_SAME;
       	revs->pruning.flags.has_changes = 0;
       	if (diff_tree_oid(&t1->object.oid, &t2->object.oid, "",
     + 			   &revs->pruning) < 0)
     + 		return REV_TREE_DIFFERENT;
     ++
     ++	if (!nth_parent)
     ++		if (bloom_ret == 1 && tree_difference == REV_TREE_SAME)
     ++			count_bloom_filter_false_positive++;
     ++
     + 	return tree_difference;
     + }
     + 
      @@
       			die("cannot simplify commit %s (because of %s)",
       			    oid_to_hex(&commit->object.oid),
     @@ -152,45 +192,19 @@
       			if (!revs->simplify_history || !relevant_commit(p)) {
       				/* Even if a merge with an uninteresting
      @@
     + 				       FOR_EACH_OBJECT_PROMISOR_ONLY);
       	}
     - }
       
     -+static void prepare_to_use_bloom_filter(struct rev_info *revs)
     -+{
     -+	struct pathspec_item *pi;
     -+	const char *path;
     -+	size_t len;
     -+
     -+	if (!revs->commits)
     -+	    return;
     -+
     -+	parse_commit(revs->commits->item);
     -+
     -+	if (!revs->repo->objects->commit_graph)
     -+		return;
     -+
     -+	revs->bloom_filter_settings = revs->repo->objects->commit_graph->settings;
     -+	if (!revs->bloom_filter_settings)
     -+		return;
     -+
     -+	pi = &revs->pruning.pathspec.items[0];
     -+	path = pi->match;
     -+	len = strlen(path);
     -+
     -+	load_bloom_filters();
     -+	revs->bloom_key = xmalloc(sizeof(struct bloom_key));
     -+	fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
     -+}
     -+
     - int prepare_revision_walk(struct rev_info *revs)
     - {
     - 	int i;
     ++	if (revs->pruning.pathspec.nr == 1 && !revs->reflog_info)
     ++		prepare_to_use_bloom_filter(revs);
     + 	if (revs->no_walk != REVISION_WALK_NO_WALK_UNSORTED)
     + 		commit_list_sort_by_date(&revs->commits);
     + 	if (revs->no_walk)
      @@
       		simplify_merges(revs);
       	if (revs->children.name)
       		set_children(revs);
     -+	if (revs->pruning.pathspec.nr == 1)
     -+	    prepare_to_use_bloom_filter(revs);
     ++
       	return 0;
       }
       
     @@ -212,12 +226,33 @@
       
       	struct topo_walk_info *topo_walk_info;
      +
     ++	/* Commit graph bloom filter fields */
     ++	/* The bloom filter key for the pathspec */
      +	struct bloom_key *bloom_key;
     ++	/*
     ++	 * The bloom filter settings used to generate the key.
     ++	 * This is loaded from the commit-graph being used.
     ++	 */
      +	struct bloom_filter_settings *bloom_filter_settings;
       };
       
       int ref_excluded(struct string_list *, const char *path);
      
     + diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
     + --- a/t/helper/test-read-graph.c
     + +++ b/t/helper/test-read-graph.c
     +@@
     + 		printf(" commit_metadata");
     + 	if (graph->chunk_extra_edges)
     + 		printf(" extra_edges");
     ++	if (graph->chunk_bloom_indexes)
     ++		printf(" bloom_indexes");
     ++	if (graph->chunk_bloom_data)
     ++		printf(" bloom_data");
     + 	printf("\n");
     + 
     + 	UNLEAK(graph);
     +
       diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
       new file mode 100755
       --- /dev/null
     @@ -228,72 +263,138 @@
      +test_description='git log for a path with bloom filters'
      +. ./test-lib.sh
      +
     -+test_expect_success 'setup repo' '
     ++test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
      +	git init &&
     -+	git config core.commitGraph true &&
     -+	git config gc.writeCommitGraph false &&
     -+	infodir=".git/objects/info" &&
     -+	graphdir="$infodir/commit-graphs" &&
     -+	test_oid_init
     ++	mkdir A A/B A/B/C &&
     ++	test_commit c1 A/file1 &&
     ++	test_commit c2 A/B/file2 &&
     ++	test_commit c3 A/B/C/file3 &&
     ++	test_commit c4 A/file1 &&
     ++	test_commit c5 A/B/file2 &&
     ++	test_commit c6 A/B/C/file3 &&
     ++	test_commit c7 A/file1 &&
     ++	test_commit c8 A/B/file2 &&
     ++	test_commit c9 A/B/C/file3 &&
     ++	git checkout -b side HEAD~4 &&
     ++	test_commit side-1 file4 &&
     ++	git checkout master &&
     ++	git merge side &&
     ++	test_commit c10 file5 &&
     ++	mv file5 file5_renamed &&
     ++	git add file5_renamed &&
     ++	git commit -m "rename" &&
     ++	git commit-graph write --reachable --changed-paths
      +'
     -+
     -+test_expect_success 'create 9 commits and repack' '
     -+	test_commit c1 file1 &&
     -+	test_commit c2 file2 &&
     -+	test_commit c3 file3 &&
     -+	test_commit c4 file1 &&
     -+	test_commit c5 file2 &&
     -+	test_commit c6 file3 &&
     -+	test_commit c7 file1 &&
     -+	test_commit c8 file2 &&
     -+	test_commit c9 file3
     -+'
     -+
     -+printf "c7\nc4\nc1" > expect_file1
     -+
     -+test_expect_success 'log without bloom filters' '
     -+	git log --pretty="format:%s"  -- file1 > actual &&
     -+	test_cmp expect_file1 actual
     -+'
     -+
     -+printf "c8\nc7\nc5\nc4\nc2\nc1" > expect_file1_file2
     -+
     -+test_expect_success 'multi-path log without bloom filters' '
     -+	git log --pretty="format:%s"  -- file1 file2 > actual &&
     -+	test_cmp expect_file1_file2 actual
     -+'
     -+
      +graph_read_expect() {
      +	OPTIONAL=""
      +	NUM_CHUNKS=5
     -+	if test ! -z $2
     -+	then
     -+		OPTIONAL=" $2"
     -+		NUM_CHUNKS=$((3 + $(echo "$2" | wc -w)))
     -+	fi
      +	cat >expect <<- EOF
      +	header: 43475048 1 1 $NUM_CHUNKS 0
      +	num_commits: $1
     -+	chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data$OPTIONAL
     ++	chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data
      +	EOF
      +	test-tool read-graph >output &&
      +	test_cmp expect output
      +}
      +
     -+test_expect_success 'write commit graph with bloom filters' '
     -+	git commit-graph write --reachable --changed-paths &&
     -+	test_path_is_file $infodir/commit-graph &&
     -+	graph_read_expect "9"
     ++test_expect_success 'commit-graph write wrote out the bloom chunks' '
     ++	graph_read_expect 13
     ++'
     ++
     ++setup() {
     ++	rm output
     ++	rm "$TRASH_DIRECTORY/trace.perf"
     ++	git -c core.commitGraph=false log --pretty="format:%s" $1 >log_wo_bloom
     ++	GIT_TRACE2_PERF="$TRASH_DIRECTORY/trace.perf" git -c core.commitGraph=true log --pretty="format:%s" $1 >log_w_bloom
     ++}
     ++
     ++test_bloom_filters_used() {
     ++	log_args=$1
     ++	bloom_trace_prefix="statistics:{\"filter_not_present\":0,\"zero_length_filter\":0,\"maybe\""
     ++	setup "$log_args"
     ++	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" && test_cmp log_wo_bloom log_w_bloom
     ++}
     ++
     ++test_bloom_filters_not_used() {
     ++	log_args=$1
     ++	setup "$log_args"
     ++	!(grep -q "statistics:{\"filter_not_present\":" "$TRASH_DIRECTORY/trace.perf") && test_cmp log_wo_bloom log_w_bloom
     ++}
     ++
     ++for path in A A/B A/B/C A/file1 A/B/file2 A/B/C/file3 file4 file5_renamed
     ++do
     ++	for option in "" \
     ++		      "--full-history" \
     ++		      "--full-history --simplify-merges" \
     ++		      "--simplify-merges" \
     ++		      "--simplify-by-decoration" \
     ++		      "--follow" \
     ++		      "--first-parent" \
     ++		      "--topo-order" \
     ++		      "--date-order" \
     ++		      "--author-date-order" \
     ++		      "--ancestry-path side..master"
     ++	do
     ++		test_expect_success "git log option: $option for path: $path" '
     ++			test_bloom_filters_used "$option -- $path"
     ++		'
     ++	done
     ++done
     ++
     ++test_expect_success 'git log -- folder works with and without the trailing slash' '
     ++	test_bloom_filters_used "-- A" &&
     ++	test_bloom_filters_used "-- A/"
     ++'
     ++
     ++test_expect_success 'git log for path that does not exist. ' '
     ++	test_bloom_filters_used "-- path_does_not_exist"
     ++'
     ++
     ++test_expect_success 'git log with --walk-reflogs does not use bloom filters' '
     ++	test_bloom_filters_not_used "--walk-reflogs -- A"
     ++'
     ++
     ++test_expect_success 'git log -- multiple path specs does not use bloom filters' '
     ++	test_bloom_filters_not_used "-- file4 A/file1"
     ++'
     ++
     ++test_expect_success 'git log with wildcard that resolves to a single path uses bloom filters' '
     ++	test_bloom_filters_used "-- *4" &&
     ++	test_bloom_filters_used "-- *renamed"
      +'
      +
     -+test_expect_success 'log using bloom filters' '
     -+	git log --pretty="format:%s" -- file1 > actual &&
     -+	test_cmp expect_file1 actual
     ++test_expect_success 'git log with wildcard that resolves to a multiple paths does not uses bloom filters' '
     ++	test_bloom_filters_not_used "-- *" &&
     ++	test_bloom_filters_not_used "-- file*"
      +'
      +
     -+test_expect_success 'multi-path log using bloom filters' '
     -+	git log --pretty="format:%s"  -- file1 file2 > actual &&
     -+	test_cmp expect_file1_file2 actual
     ++test_expect_success 'setup - add commit-graph to the chain without bloom filters' '
     ++	test_commit c14 A/anotherFile2 &&
     ++	test_commit c15 A/B/anotherFile2 &&
     ++	test_commit c16 A/B/C/anotherFile2 &&
     ++	GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0 git commit-graph write --reachable --split &&
     ++	test_line_count = 2 .git/objects/info/commit-graphs/commit-graph-chain
     ++'
     ++
     ++test_expect_success 'git log does not use bloom filters if the latest graph does not have bloom filters.' '
     ++	test_bloom_filters_not_used "-- A/B"
     ++'
     ++
     ++test_expect_success 'setup - add commit-graph to the chain with bloom filters' '
     ++	test_commit c17 A/anotherFile3 &&
     ++	git commit-graph write --reachable --changed-paths --split &&
     ++	test_line_count = 3 .git/objects/info/commit-graphs/commit-graph-chain
     ++'
     ++
     ++test_bloom_filters_used_when_some_filters_are_missing() {
     ++	log_args=$1
     ++	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":6,\"definitely_not\":6"
     ++	setup "$log_args"
     ++	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" && test_cmp log_wo_bloom log_w_bloom
     ++}
     ++
     ++test_expect_success 'git log uses bloom filters if they exist in the latest but not all commit graphs in the chain.' '
     ++	test_bloom_filters_used_when_some_filters_are_missing "-- A/B"
      +'
      +
      +test_done
  9:  e1c315d0a7 ! 11:  e1b076a714 commit-graph: add GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS test flag
     @@ -1,14 +1,15 @@
      Author: Garima Singh <garima.singh@microsoft.com>
      
     -    commit-graph: add GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS test flag
     +    commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag
      
     -    Add GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS test flag to the test setup suite in
     -    order to toggle writing bloom filters when running any of the git tests. If set
     -    to true, we will compute and write bloom filters every time a test calls
     -    `git commit-graph write`.
     +    Add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag to the test setup suite
     +    in order to toggle writing Bloom filters when running any of the git tests.
     +    If set to true, we will compute and write Bloom filters every time a test
     +    calls `git commit-graph write`, as if the `--changed-paths` option was
     +    passed in.
      
          The test suite passes when GIT_TEST_COMMIT_GRAPH and
     -    GIT_COMMIT_GRAPH_BLOOM_FILTERS are enabled.
     +    GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS are enabled.
      
          Helped-by: Derrick Stolee <dstolee@microsoft.com>
          Signed-off-by: Garima Singh <garima.singh@microsoft.com>
     @@ -20,8 +21,9 @@
       		flags |= COMMIT_GRAPH_WRITE_SPLIT;
       	if (opts.progress)
       		flags |= COMMIT_GRAPH_WRITE_PROGRESS;
     --	if (opts.enable_bloom_filters)
     -+	if (opts.enable_bloom_filters || git_env_bool(GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS, 0))
     +-	if (opts.enable_changed_paths)
     ++	if (opts.enable_changed_paths ||
     ++	    git_env_bool(GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS, 0))
       		flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
       
       	read_replace_refs = 0;
     @@ -33,7 +35,7 @@
       	export GIT_TEST_OE_SIZE=10
       	export GIT_TEST_OE_DELTA_SIZE=5
       	export GIT_TEST_COMMIT_GRAPH=1
     -+	export GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS=1
     ++	export GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=1
       	export GIT_TEST_MULTI_PACK_INDEX=1
       	make test
       	;;
     @@ -45,7 +47,7 @@
       
       #define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
       #define GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD "GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD"
     -+#define GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS "GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS"
     ++#define GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS "GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS"
       
       struct commit;
       struct bloom_filter_settings;
     @@ -57,8 +59,10 @@
       be written after every 'git commit' command, and overrides the
       'core.commitGraph' setting to true.
       
     -+GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS=<boolean>, when true, forces commit-graph
     -+write to compute and write bloom filters for every 'git commit-graph write'
     ++GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=<boolean>, when true, forces
     ++commit-graph write to compute and write changed path Bloom filters for
     ++every 'git commit-graph write', as if the `--changed-paths` option was
     ++passed in.
      +
       GIT_TEST_FSMONITOR=$PWD/t7519/fsmonitor-all exercises the fsmonitor
       code path for utilizing a file system monitor to speed up detecting
     @@ -72,11 +76,11 @@
       . ./test-lib.sh
       
      +GIT_TEST_COMMIT_GRAPH=0
     -+GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS=0
     ++GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0
      +
     - test_expect_success 'setup repo' '
     + test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
       	git init &&
     - 	git config core.commitGraph true &&
     + 	mkdir A A/B A/B/C &&
      
       diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
       --- a/t/t5318-commit-graph.sh
     @@ -85,7 +89,7 @@
       test_description='commit graph'
       . ./test-lib.sh
       
     -+GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS=0
     ++GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0
      +
       test_expect_success 'setup full repo' '
       	mkdir full &&
     @@ -98,21 +102,7 @@
       . ./test-lib.sh
       
       GIT_TEST_COMMIT_GRAPH=0
     -+GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS=0
     ++GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0
       
       test_expect_success 'setup repo' '
       	git init &&
     -
     - diff --git a/t/t5325-commit-graph-bloom.sh b/t/t5325-commit-graph-bloom.sh
     - --- a/t/t5325-commit-graph-bloom.sh
     - +++ b/t/t5325-commit-graph-bloom.sh
     -@@
     - test_description='commit graph with bloom filters'
     - . ./test-lib.sh
     - 
     -+GIT_TEST_COMMIT_GRAPH=0
     -+GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS=0
     -+
     - test_expect_success 'setup repo' '
     - 	git init &&
     - 	git config core.commitGraph true &&

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH v2 01/11] commit-graph: use MAX_NUM_CHUNKS
  2020-02-05 22:56 ` [PATCH v2 00/11] " Garima Singh via GitGitGadget
@ 2020-02-05 22:56   ` Garima Singh via GitGitGadget
  2020-02-09 12:39     ` Jakub Narebski
  2020-02-05 22:56   ` [PATCH v2 02/11] bloom: core Bloom filter implementation for changed paths Garima Singh via GitGitGadget
                     ` (12 subsequent siblings)
  13 siblings, 1 reply; 150+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-02-05 22:56 UTC (permalink / raw)
  To: git
  Cc: stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
	garimasigit, jnareb, christian.couder, emilyshaffer, gitster,
	Garima Singh, Garima Singh

From: Garima Singh <garima.singh@microsoft.com>

This is a minor cleanup to make it easier to change the
number of chunks being written to the commit-graph in the future.

Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
 commit-graph.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index b205e65ed1..3c4d411326 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -23,6 +23,7 @@
 #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
 #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
 #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
+#define MAX_NUM_CHUNKS 5
 
 #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
 
@@ -1356,8 +1357,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	int fd;
 	struct hashfile *f;
 	struct lock_file lk = LOCK_INIT;
-	uint32_t chunk_ids[6];
-	uint64_t chunk_offsets[6];
+	uint32_t chunk_ids[MAX_NUM_CHUNKS + 1];
+	uint64_t chunk_offsets[MAX_NUM_CHUNKS + 1];
 	const unsigned hashsz = the_hash_algo->rawsz;
 	struct strbuf progress_title = STRBUF_INIT;
 	int num_chunks = 3;
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH v2 02/11] bloom: core Bloom filter implementation for changed paths
  2020-02-05 22:56 ` [PATCH v2 00/11] " Garima Singh via GitGitGadget
  2020-02-05 22:56   ` [PATCH v2 01/11] commit-graph: use MAX_NUM_CHUNKS Garima Singh via GitGitGadget
@ 2020-02-05 22:56   ` Garima Singh via GitGitGadget
  2020-02-15 17:17     ` Jakub Narebski
  2020-02-16 16:49     ` Jakub Narebski
  2020-02-05 22:56   ` [PATCH v2 03/11] diff: halt tree-diff early after max_changes Derrick Stolee via GitGitGadget
                     ` (11 subsequent siblings)
  13 siblings, 2 replies; 150+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-02-05 22:56 UTC (permalink / raw)
  To: git
  Cc: stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
	garimasigit, jnareb, christian.couder, emilyshaffer, gitster,
	Garima Singh, Garima Singh

From: Garima Singh <garima.singh@microsoft.com>

Add the core Bloom filter logic for computing the paths changed between a
commit and its first parent. For details on what Bloom filters are and how they
work, please refer to Dr. Derrick Stolee's blog post [1]. It provides a concise
explaination of the adoption of Bloom filters as described in [2] and [3]

1. We currently use 7 and 10 for the number of hashes and the size of each
   entry respectively. They served as great starting values, the mathematical
   details behind this choice are described in [1] and [4]. The implementation
   while not completely open to it at the moment, is flexible enough to allow
   for tweaking these settings in the future.

   Note: The performance gains we have observed with these values are
   significant enough that we did not need to tweak these settings.
   The performance numbers are included in the cover letter of this series
   and in the message of a subsequent commit where we use Bloom filters in
   to speed up `git log -- <path>`.

2. As described in the blog and in [3], we do not need 7 independent hashing
   functions. We use the Murmur3 hashing scheme. Seed it twice and then
   combine those to procure an arbitrary number of hash values.

3. The filters are sized according to the number of changes in the each commit,
   with minimum size of one 64 bit word.

4. We fill the Bloom filters as (const char *data, int len) pairs as
   "struct bloom_filter"s in a commit slab.

5. The seed_murmur3 method is implemented as described in [5]. It hashes the
   given data using a given seed and produces a uniformly distributed hash
   value.

[1] https://devblogs.microsoft.com/devops/super-charging-the-git-commit-graph-iv-Bloom-filters/

[2] Flavio Bonomi, Michael Mitzenmacher, Rina Panigrahy, Sushil Singh, George Varghese
    "An Improved Construction for Counting Bloom Filters"
    http://theory.stanford.edu/~rinap/papers/esa2006b.pdf
    https://doi.org/10.1007/11841036_61

[3] Peter C. Dillinger and Panagiotis Manolios
    "Bloom Filters in Probabilistic Verification"
    http://www.ccs.neu.edu/home/pete/pub/Bloom-filters-verification.pdf
    https://doi.org/10.1007/978-3-540-30494-4_26

[4] Thomas Mueller Graf, Daniel Lemire
    "Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters"
    https://arxiv.org/abs/1912.08258

[5] https://en.wikipedia.org/wiki/MurmurHash#Algorithm

Helped-by: Jeff King <peff@peff.net>
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
 Makefile              |   2 +
 bloom.c               | 228 ++++++++++++++++++++++++++++++++++++++++++
 bloom.h               |  56 +++++++++++
 t/helper/test-bloom.c |  84 ++++++++++++++++
 t/helper/test-tool.c  |   1 +
 t/helper/test-tool.h  |   1 +
 t/t0095-bloom.sh      | 113 +++++++++++++++++++++
 7 files changed, 485 insertions(+)
 create mode 100644 bloom.c
 create mode 100644 bloom.h
 create mode 100644 t/helper/test-bloom.c
 create mode 100755 t/t0095-bloom.sh

diff --git a/Makefile b/Makefile
index 6134104ae6..afba81f4a8 100644
--- a/Makefile
+++ b/Makefile
@@ -695,6 +695,7 @@ X =
 
 PROGRAMS += $(patsubst %.o,git-%$X,$(PROGRAM_OBJS))
 
+TEST_BUILTINS_OBJS += test-bloom.o
 TEST_BUILTINS_OBJS += test-chmtime.o
 TEST_BUILTINS_OBJS += test-config.o
 TEST_BUILTINS_OBJS += test-ctype.o
@@ -840,6 +841,7 @@ LIB_OBJS += base85.o
 LIB_OBJS += bisect.o
 LIB_OBJS += blame.o
 LIB_OBJS += blob.o
+LIB_OBJS += bloom.o
 LIB_OBJS += branch.o
 LIB_OBJS += bulk-checkin.o
 LIB_OBJS += bundle.o
diff --git a/bloom.c b/bloom.c
new file mode 100644
index 0000000000..6082193a75
--- /dev/null
+++ b/bloom.c
@@ -0,0 +1,228 @@
+#include "git-compat-util.h"
+#include "bloom.h"
+#include "commit-graph.h"
+#include "object-store.h"
+#include "diff.h"
+#include "diffcore.h"
+#include "revision.h"
+#include "hashmap.h"
+
+define_commit_slab(bloom_filter_slab, struct bloom_filter);
+
+struct bloom_filter_slab bloom_filters;
+
+struct pathmap_hash_entry {
+    struct hashmap_entry entry;
+    const char path[FLEX_ARRAY];
+};
+
+static uint32_t rotate_right(uint32_t value, int32_t count)
+{
+	uint32_t mask = 8 * sizeof(uint32_t) - 1;
+	count &= mask;
+	return ((value >> count) | (value << ((-count) & mask)));
+}
+
+/*
+ * Calculate a hash value for the given data using the given seed.
+ * Produces a uniformly distributed hash value.
+ * Not considered to be cryptographically secure.
+ * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
+ **/
+static uint32_t seed_murmur3(uint32_t seed, const char *data, int len)
+{
+	const uint32_t c1 = 0xcc9e2d51;
+	const uint32_t c2 = 0x1b873593;
+	const uint32_t r1 = 15;
+	const uint32_t r2 = 13;
+	const uint32_t m = 5;
+	const uint32_t n = 0xe6546b64;
+	int i;
+	uint32_t k1 = 0;
+	const char *tail;
+
+	int len4 = len / sizeof(uint32_t);
+
+	const uint32_t *blocks = (const uint32_t*)data;
+
+	uint32_t k;
+	for (i = 0; i < len4; i++)
+	{
+		k = blocks[i];
+		k *= c1;
+		k = rotate_right(k, r1);
+		k *= c2;
+
+		seed ^= k;
+		seed = rotate_right(seed, r2) * m + n;
+	}
+
+	tail = (data + len4 * sizeof(uint32_t));
+
+	switch (len & (sizeof(uint32_t) - 1))
+	{
+	case 3:
+		k1 ^= ((uint32_t)tail[2]) << 16;
+		/*-fallthrough*/
+	case 2:
+		k1 ^= ((uint32_t)tail[1]) << 8;
+		/*-fallthrough*/
+	case 1:
+		k1 ^= ((uint32_t)tail[0]) << 0;
+		k1 *= c1;
+		k1 = rotate_right(k1, r1);
+		k1 *= c2;
+		seed ^= k1;
+		break;
+	}
+
+	seed ^= (uint32_t)len;
+	seed ^= (seed >> 16);
+	seed *= 0x85ebca6b;
+	seed ^= (seed >> 13);
+	seed *= 0xc2b2ae35;
+	seed ^= (seed >> 16);
+
+	return seed;
+}
+
+static inline uint64_t get_bitmask(uint32_t pos)
+{
+	return ((uint64_t)1) << (pos & (BITS_PER_WORD - 1));
+}
+
+void load_bloom_filters(void)
+{
+	init_bloom_filter_slab(&bloom_filters);
+}
+
+void fill_bloom_key(const char *data,
+					int len,
+					struct bloom_key *key,
+					struct bloom_filter_settings *settings)
+{
+	int i;
+	const uint32_t seed0 = 0x293ae76f;
+	const uint32_t seed1 = 0x7e646e2c;
+	const uint32_t hash0 = seed_murmur3(seed0, data, len);
+	const uint32_t hash1 = seed_murmur3(seed1, data, len);
+
+	key->hashes = (uint32_t *)xcalloc(settings->num_hashes, sizeof(uint32_t));
+	for (i = 0; i < settings->num_hashes; i++)
+		key->hashes[i] = hash0 + i * hash1;
+}
+
+void add_key_to_filter(struct bloom_key *key,
+					   struct bloom_filter *filter,
+					   struct bloom_filter_settings *settings)
+{
+	int i;
+	uint64_t mod = filter->len * BITS_PER_WORD;
+
+	for (i = 0; i < settings->num_hashes; i++) {
+		uint64_t hash_mod = key->hashes[i] % mod;
+		uint64_t block_pos = hash_mod / BITS_PER_WORD;
+
+		filter->data[block_pos] |= get_bitmask(hash_mod);
+	}
+}
+
+struct bloom_filter *get_bloom_filter(struct repository *r,
+				      struct commit *c)
+{
+	struct bloom_filter *filter;
+	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
+	int i;
+	struct diff_options diffopt;
+
+	if (!bloom_filters.slab_size)
+		return NULL;
+
+	filter = bloom_filter_slab_at(&bloom_filters, c);
+
+	repo_diff_setup(r, &diffopt);
+	diffopt.flags.recursive = 1;
+	diff_setup_done(&diffopt);
+
+	if (c->parents)
+		diff_tree_oid(&c->parents->item->object.oid, &c->object.oid, "", &diffopt);
+	else
+		diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
+	diffcore_std(&diffopt);
+
+	if (diff_queued_diff.nr <= 512) {
+		struct hashmap pathmap;
+		struct pathmap_hash_entry* e;
+		struct hashmap_iter iter;
+		hashmap_init(&pathmap, NULL, NULL, 0);
+
+		for (i = 0; i < diff_queued_diff.nr; i++) {
+			const char* path = diff_queued_diff.queue[i]->two->path;
+			const char* p = path;
+
+			/*
+			* Add each leading directory of the changed file, i.e. for
+			* 'dir/subdir/file' add 'dir' and 'dir/subdir' as well, so
+			* the Bloom filter could be used to speed up commands like
+			* 'git log dir/subdir', too.
+			*
+			* Note that directories are added without the trailing '/'.
+			*/
+			do {
+				char* last_slash = strrchr(p, '/');
+
+				FLEX_ALLOC_STR(e, path, path);
+				hashmap_entry_init(&e->entry, strhash(p));
+				hashmap_add(&pathmap, &e->entry);
+
+				if (!last_slash)
+					last_slash = (char*)p;
+				*last_slash = '\0';
+
+			} while (*p);
+
+			diff_free_filepair(diff_queued_diff.queue[i]);
+		}
+
+		filter->len = (hashmap_get_size(&pathmap) * settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
+		filter->data = xcalloc(filter->len, sizeof(uint64_t));
+
+		hashmap_for_each_entry(&pathmap, &iter, e, entry) {
+			struct bloom_key key;
+			fill_bloom_key(e->path, strlen(e->path), &key, &settings);
+			add_key_to_filter(&key, filter, &settings);
+		}
+
+		hashmap_free_entries(&pathmap, struct pathmap_hash_entry, entry);
+	} else {
+		for (i = 0; i < diff_queued_diff.nr; i++)
+			diff_free_filepair(diff_queued_diff.queue[i]);
+		filter->data = NULL;
+		filter->len = 0;
+	}
+
+	free(diff_queued_diff.queue);
+	DIFF_QUEUE_CLEAR(&diff_queued_diff);
+
+	return filter;
+}
+
+int bloom_filter_contains(struct bloom_filter *filter,
+			  struct bloom_key *key,
+			  struct bloom_filter_settings *settings)
+{
+	int i;
+	uint64_t mod = filter->len * BITS_PER_WORD;
+
+	if (!mod)
+		return -1;
+
+	for (i = 0; i < settings->num_hashes; i++) {
+		uint64_t hash_mod = key->hashes[i] % mod;
+		uint64_t block_pos = hash_mod / BITS_PER_WORD;
+		if (!(filter->data[block_pos] & get_bitmask(hash_mod)))
+			return 0;
+	}
+
+	return 1;
+}
diff --git a/bloom.h b/bloom.h
new file mode 100644
index 0000000000..7f40c751f7
--- /dev/null
+++ b/bloom.h
@@ -0,0 +1,56 @@
+#ifndef BLOOM_H
+#define BLOOM_H
+
+struct commit;
+struct repository;
+struct commit_graph;
+
+struct bloom_filter_settings {
+	uint32_t hash_version;
+	uint32_t num_hashes;
+	uint32_t bits_per_entry;
+};
+
+#define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 }
+#define BITS_PER_WORD 64
+
+/*
+ * A bloom_filter struct represents a data segment to
+ * use when testing hash values. The 'len' member
+ * dictates how many uint64_t entries are stored in
+ * 'data'.
+ */
+struct bloom_filter {
+	uint64_t *data;
+	int len;
+};
+
+/*
+ * A bloom_key represents the k hash values for a
+ * given hash input. These can be precomputed and
+ * stored in a bloom_key for re-use when testing
+ * against a bloom_filter.
+ */
+struct bloom_key {
+	uint32_t *hashes;
+};
+
+void load_bloom_filters(void);
+
+void fill_bloom_key(const char *data,
+		    int len,
+		    struct bloom_key *key,
+		    struct bloom_filter_settings *settings);
+
+void add_key_to_filter(struct bloom_key *key,
+					   struct bloom_filter *filter,
+					   struct bloom_filter_settings *settings);
+
+struct bloom_filter *get_bloom_filter(struct repository *r,
+				      struct commit *c);
+
+int bloom_filter_contains(struct bloom_filter *filter,
+			  struct bloom_key *key,
+			  struct bloom_filter_settings *settings);
+
+#endif
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
new file mode 100644
index 0000000000..331957011b
--- /dev/null
+++ b/t/helper/test-bloom.c
@@ -0,0 +1,84 @@
+#include "test-tool.h"
+#include "git-compat-util.h"
+#include "bloom.h"
+#include "test-tool.h"
+#include "cache.h"
+#include "commit-graph.h"
+#include "commit.h"
+#include "config.h"
+#include "object-store.h"
+#include "object.h"
+#include "repository.h"
+#include "tree.h"
+
+struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
+
+static void print_bloom_filter(struct bloom_filter *filter) {
+	int i;
+
+	if (!filter) {
+		printf("No filter.\n");
+		return;
+	}
+	printf("Filter_Length:%d\n", filter->len);
+	printf("Filter_Data:");
+	for (i = 0; i < filter->len; i++){
+		printf("%"PRIx64"|", filter->data[i]);
+	}
+	printf("\n");
+}
+
+static void add_string_to_filter(const char *data, struct bloom_filter *filter) {
+		struct bloom_key key;
+		int i;
+
+		fill_bloom_key(data, strlen(data), &key, &settings);
+		printf("Hashes:");
+		for (i = 0; i < settings.num_hashes; i++){
+			printf("%08x|", key.hashes[i]);
+		}
+		printf("\n");
+		add_key_to_filter(&key, filter, &settings);
+}
+
+static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
+{
+	struct commit *c;
+	struct bloom_filter *filter;
+	setup_git_directory();
+	c = lookup_commit(the_repository, commit_oid);
+	filter = get_bloom_filter(the_repository, c);
+	print_bloom_filter(filter);
+}
+
+int cmd__bloom(int argc, const char **argv)
+{
+    if (!strcmp(argv[1], "generate_filter")) {
+		struct bloom_filter filter;
+		int i = 2;
+		filter.len =  (settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
+		filter.data = xcalloc(filter.len, sizeof(uint64_t));
+
+		if (!argv[2]){
+			die("at least one input string expected");
+		}
+
+		while (argv[i]) {
+			add_string_to_filter(argv[i], &filter);
+			i++;
+		}
+
+		print_bloom_filter(&filter);
+	}
+
+	if (!strcmp(argv[1], "get_filter_for_commit")) {
+		struct object_id oid;
+		const char *end;
+		if (parse_oid_hex(argv[2], &oid, &end))
+			die("cannot parse oid '%s'", argv[2]);
+		load_bloom_filters();
+		get_bloom_filter_for_commit(&oid);
+	}
+
+	return 0;
+}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index c9a232d238..ca4f4b0066 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -14,6 +14,7 @@ struct test_cmd {
 };
 
 static struct test_cmd cmds[] = {
+	{ "bloom", cmd__bloom },
 	{ "chmtime", cmd__chmtime },
 	{ "config", cmd__config },
 	{ "ctype", cmd__ctype },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index c8549fd87f..05d2b32451 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -4,6 +4,7 @@
 #define USE_THE_INDEX_COMPATIBILITY_MACROS
 #include "git-compat-util.h"
 
+int cmd__bloom(int argc, const char **argv);
 int cmd__chmtime(int argc, const char **argv);
 int cmd__config(int argc, const char **argv);
 int cmd__ctype(int argc, const char **argv);
diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
new file mode 100755
index 0000000000..424fe4fc29
--- /dev/null
+++ b/t/t0095-bloom.sh
@@ -0,0 +1,113 @@
+#!/bin/sh
+
+test_description='test bloom.c'
+. ./test-lib.sh
+
+test_expect_success 'get bloom filters for commit with no changes' '
+	git init &&
+	git commit --allow-empty -m "c0" &&
+	cat >expect <<-\EOF &&
+	Filter_Length:0
+	Filter_Data:
+	EOF
+	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'get bloom filter for commit with 10 changes' '
+	rm actual &&
+	rm expect &&
+	mkdir smallDir &&
+	for i in $(test_seq 0 9)
+	do
+		echo $i >smallDir/$i
+	done &&
+	git add smallDir &&
+	git commit -m "commit with 10 changes" &&
+	cat >expect <<-\EOF &&
+	Filter_Length:4
+	Filter_Data:508928809087080a|8a7648210804001|4089824400951000|841ab310098051a8|
+	EOF
+	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success EXPENSIVE 'get bloom filter for commit with 513 changes' '
+	rm actual &&
+	rm expect &&
+	mkdir bigDir &&
+	for i in $(test_seq 0 512)
+	do
+		echo $i >bigDir/$i
+	done &&
+	git add bigDir &&
+	git commit -m "commit with 513 changes" &&
+	cat >expect <<-\EOF &&
+	Filter_Length:0
+	Filter_Data:
+	EOF
+	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'compute bloom key for empty string' '
+	cat >expect <<-\EOF &&
+	Hashes:5615800c|5b966560|61174ab4|66983008|6c19155c|7199fab0|771ae004|
+	Filter_Length:1
+	Filter_Data:11000110001110|
+	EOF
+	test-tool bloom generate_filter "" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'compute bloom key for whitespace' '
+	cat >expect <<-\EOF &&
+	Hashes:1bf014e6|8a91b50b|f9335530|67d4f555|d676957a|4518359f|b3b9d5c4|
+	Filter_Length:1
+	Filter_Data:401004080200810|
+	EOF
+	test-tool bloom generate_filter " " >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'compute bloom key for a root level folder' '
+	cat >expect <<-\EOF &&
+	Hashes:1a21016f|fff1c06d|e5c27f6b|cb933e69|b163fd67|9734bc65|7d057b63|
+	Filter_Length:1
+	Filter_Data:aaa800000000|
+	EOF
+	test-tool bloom generate_filter "A" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'compute bloom key for a root level file' '
+	cat >expect <<-\EOF &&
+	Hashes:e2d51107|30970605|7e58fb03|cc1af001|19dce4ff|679ed9fd|b560cefb|
+	Filter_Length:1
+	Filter_Data:a8000000000000aa|
+	EOF
+	test-tool bloom generate_filter "file.txt" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'compute bloom key for a deep folder' '
+	cat >expect <<-\EOF &&
+	Hashes:864cf838|27f055cd|c993b362|6b3710f7|0cda6e8c|ae7dcc21|502129b6|
+	Filter_Length:1
+	Filter_Data:1c0000600003000|
+	EOF
+	test-tool bloom generate_filter "A/B/C/D/E" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'compute bloom key for a deep file' '
+	cat >expect <<-\EOF &&
+	Hashes:07cdf850|4af629c7|8e1e5b3e|d1468cb5|146ebe2c|5796efa3|9abf211a|
+	Filter_Length:1
+	Filter_Data:4020100804010080|
+	EOF
+	test-tool bloom generate_filter "A/B/C/D/E/file.txt" >actual &&
+	test_cmp expect actual
+'
+
+test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH v2 03/11] diff: halt tree-diff early after max_changes
  2020-02-05 22:56 ` [PATCH v2 00/11] " Garima Singh via GitGitGadget
  2020-02-05 22:56   ` [PATCH v2 01/11] commit-graph: use MAX_NUM_CHUNKS Garima Singh via GitGitGadget
  2020-02-05 22:56   ` [PATCH v2 02/11] bloom: core Bloom filter implementation for changed paths Garima Singh via GitGitGadget
@ 2020-02-05 22:56   ` Derrick Stolee via GitGitGadget
  2020-02-17  0:00     ` Jakub Narebski
  2020-02-05 22:56   ` [PATCH v2 04/11] commit-graph: compute Bloom filters for changed paths Garima Singh via GitGitGadget
                     ` (10 subsequent siblings)
  13 siblings, 1 reply; 150+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-02-05 22:56 UTC (permalink / raw)
  To: git
  Cc: stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
	garimasigit, jnareb, christian.couder, emilyshaffer, gitster,
	Garima Singh, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When computing the changed-paths bloom filters for the commit-graph,
we limit the size of the filter by restricting the number of paths
in the diff. Instead of computing a large diff and then ignoring the
result, it is better to halt the diff computation early.

Create a new "max_changes" option in struct diff_options. If non-zero,
then halt the diff computation after discovering strictly more changed
paths. This includes paths corresponding to trees that change.

Use this max_changes option in the bloom filter calculations. This
reduces the time taken to compute the filters for the Linux kernel
repo from 2m50s to 2m35s. On a large internal repository with ~500
commits that perform tree-wide changes, the time reduced from
6m15s to 3m48s.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
 bloom.c     | 4 +++-
 diff.h      | 5 +++++
 tree-diff.c | 6 ++++++
 3 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/bloom.c b/bloom.c
index 6082193a75..818382c03b 100644
--- a/bloom.c
+++ b/bloom.c
@@ -134,6 +134,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
 	int i;
 	struct diff_options diffopt;
+	int max_changes = 512;
 
 	if (!bloom_filters.slab_size)
 		return NULL;
@@ -142,6 +143,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 
 	repo_diff_setup(r, &diffopt);
 	diffopt.flags.recursive = 1;
+	diffopt.max_changes = max_changes;
 	diff_setup_done(&diffopt);
 
 	if (c->parents)
@@ -150,7 +152,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 		diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
 	diffcore_std(&diffopt);
 
-	if (diff_queued_diff.nr <= 512) {
+	if (diff_queued_diff.nr <= max_changes) {
 		struct hashmap pathmap;
 		struct pathmap_hash_entry* e;
 		struct hashmap_iter iter;
diff --git a/diff.h b/diff.h
index 6febe7e365..9443dc1b00 100644
--- a/diff.h
+++ b/diff.h
@@ -285,6 +285,11 @@ struct diff_options {
 	/* Number of hexdigits to abbreviate raw format output to. */
 	int abbrev;
 
+	/* If non-zero, then stop computing after this many changes. */
+	int max_changes;
+	/* For internal use only. */
+	int num_changes;
+
 	int ita_invisible_in_index;
 /* white-space error highlighting */
 #define WSEH_NEW (1<<12)
diff --git a/tree-diff.c b/tree-diff.c
index 33ded7f8b3..f3d303c6e5 100644
--- a/tree-diff.c
+++ b/tree-diff.c
@@ -434,6 +434,9 @@ static struct combine_diff_path *ll_diff_tree_paths(
 		if (diff_can_quit_early(opt))
 			break;
 
+		if (opt->max_changes && opt->num_changes > opt->max_changes)
+			break;
+
 		if (opt->pathspec.nr) {
 			skip_uninteresting(&t, base, opt);
 			for (i = 0; i < nparent; i++)
@@ -518,6 +521,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
 
 			/* t↓ */
 			update_tree_entry(&t);
+			opt->num_changes++;
 		}
 
 		/* t > p[imin] */
@@ -535,6 +539,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
 		skip_emit_tp:
 			/* ∀ pi=p[imin]  pi↓ */
 			update_tp_entries(tp, nparent);
+			opt->num_changes++;
 		}
 	}
 
@@ -552,6 +557,7 @@ struct combine_diff_path *diff_tree_paths(
 	const struct object_id **parents_oid, int nparent,
 	struct strbuf *base, struct diff_options *opt)
 {
+	opt->num_changes = 0;
 	p = ll_diff_tree_paths(p, oid, parents_oid, nparent, base, opt);
 
 	/*
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH v2 04/11] commit-graph: compute Bloom filters for changed paths
  2020-02-05 22:56 ` [PATCH v2 00/11] " Garima Singh via GitGitGadget
                     ` (2 preceding siblings ...)
  2020-02-05 22:56   ` [PATCH v2 03/11] diff: halt tree-diff early after max_changes Derrick Stolee via GitGitGadget
@ 2020-02-05 22:56   ` Garima Singh via GitGitGadget
  2020-02-17 21:56     ` Jakub Narebski
  2020-02-05 22:56   ` [PATCH v2 05/11] commit-graph: examine changed-path objects in pack order Jeff King via GitGitGadget
                     ` (9 subsequent siblings)
  13 siblings, 1 reply; 150+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-02-05 22:56 UTC (permalink / raw)
  To: git
  Cc: stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
	garimasigit, jnareb, christian.couder, emilyshaffer, gitster,
	Garima Singh, Garima Singh

From: Garima Singh <garima.singh@microsoft.com>

Compute Bloom filters for the paths that changed between a commit and its
first parent using the implementation in bloom.c, when the
COMMIT_GRAPH_WRITE_CHANGED_PATHS flag is set. This computation is done on a
commit-by-commit basis. We will write these Bloom filters to the commit graph
file in the next change.

Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
 commit-graph.c | 32 +++++++++++++++++++++++++++++++-
 commit-graph.h |  3 ++-
 2 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 3c4d411326..724bfcffc4 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -16,6 +16,7 @@
 #include "hashmap.h"
 #include "replace-object.h"
 #include "progress.h"
+#include "bloom.h"
 
 #define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
 #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
@@ -795,9 +796,11 @@ struct write_commit_graph_context {
 	unsigned append:1,
 		 report_progress:1,
 		 split:1,
-		 check_oids:1;
+		 check_oids:1,
+		 changed_paths:1;
 
 	const struct split_commit_graph_opts *split_opts;
+	uint32_t total_bloom_filter_data_size;
 };
 
 static void write_graph_chunk_fanout(struct hashfile *f,
@@ -1140,6 +1143,28 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 	stop_progress(&ctx->progress);
 }
 
+static void compute_bloom_filters(struct write_commit_graph_context *ctx)
+{
+	int i;
+	struct progress *progress = NULL;
+
+	load_bloom_filters();
+
+	if (ctx->report_progress)
+		progress = start_progress(
+			_("Computing commit diff Bloom filters"),
+			ctx->commits.nr);
+
+	for (i = 0; i < ctx->commits.nr; i++) {
+		struct commit *c = ctx->commits.list[i];
+		struct bloom_filter *filter = get_bloom_filter(ctx->r, c);
+		ctx->total_bloom_filter_data_size += sizeof(uint64_t) * filter->len;
+		display_progress(progress, i + 1);
+	}
+
+	stop_progress(&progress);
+}
+
 static int add_ref_to_list(const char *refname,
 			   const struct object_id *oid,
 			   int flags, void *cb_data)
@@ -1794,6 +1819,8 @@ int write_commit_graph(const char *obj_dir,
 	ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
 	ctx->check_oids = flags & COMMIT_GRAPH_WRITE_CHECK_OIDS ? 1 : 0;
 	ctx->split_opts = split_opts;
+	ctx->changed_paths = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;
+	ctx->total_bloom_filter_data_size = 0;
 
 	if (ctx->split) {
 		struct commit_graph *g;
@@ -1888,6 +1915,9 @@ int write_commit_graph(const char *obj_dir,
 
 	compute_generation_numbers(ctx);
 
+	if (ctx->changed_paths)
+		compute_bloom_filters(ctx);
+
 	res = write_commit_graph_file(ctx);
 
 	if (ctx->split)
diff --git a/commit-graph.h b/commit-graph.h
index 7f5c933fa2..952a4b83be 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -76,7 +76,8 @@ enum commit_graph_write_flags {
 	COMMIT_GRAPH_WRITE_PROGRESS   = (1 << 1),
 	COMMIT_GRAPH_WRITE_SPLIT      = (1 << 2),
 	/* Make sure that each OID in the input is a valid commit OID. */
-	COMMIT_GRAPH_WRITE_CHECK_OIDS = (1 << 3)
+	COMMIT_GRAPH_WRITE_CHECK_OIDS = (1 << 3),
+	COMMIT_GRAPH_WRITE_BLOOM_FILTERS = (1 << 4)
 };
 
 struct split_commit_graph_opts {
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH v2 05/11] commit-graph: examine changed-path objects in pack order
  2020-02-05 22:56 ` [PATCH v2 00/11] " Garima Singh via GitGitGadget
                     ` (3 preceding siblings ...)
  2020-02-05 22:56   ` [PATCH v2 04/11] commit-graph: compute Bloom filters for changed paths Garima Singh via GitGitGadget
@ 2020-02-05 22:56   ` Jeff King via GitGitGadget
  2020-02-18 17:59     ` Jakub Narebski
  2020-02-05 22:56   ` [PATCH v2 06/11] commit-graph: examine commits by generation number Derrick Stolee via GitGitGadget
                     ` (8 subsequent siblings)
  13 siblings, 1 reply; 150+ messages in thread
From: Jeff King via GitGitGadget @ 2020-02-05 22:56 UTC (permalink / raw)
  To: git
  Cc: stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
	garimasigit, jnareb, christian.couder, emilyshaffer, gitster,
	Garima Singh, Jeff King

From: Jeff King <peff@peff.net>

Looking at the diff of commit objects in pack order is much faster than
in sha1 order, as it gives locality to the access of tree deltas
(whereas sha1 order is effectively random). Unfortunately the
commit-graph code sorts the commits (several times, sometimes as an oid
and sometimes a pointer-to-commit), and we ultimately traverse in sha1
order.

Instead, let's remember the position at which we see each commit, and
traverse in that order when looking at bloom filters. This drops my time
for "git commit-graph write --changed-paths" in linux.git from ~4
minutes to ~1.5 minutes.

Probably the "--reachable" code path would want something similar.

Or alternatively, we could use a different data structure (either a
hash, or maybe even just a bit in "struct commit") to keep track of
which oids we've seen, etc instead of sorting. And then we could keep
the original order.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
 commit-graph.c | 34 +++++++++++++++++++++++++++++++++-
 1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/commit-graph.c b/commit-graph.c
index 724bfcffc4..e125511a1c 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -17,6 +17,7 @@
 #include "replace-object.h"
 #include "progress.h"
 #include "bloom.h"
+#include "commit-slab.h"
 
 #define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
 #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
@@ -46,6 +47,29 @@
 /* Remember to update object flag allocation in object.h */
 #define REACHABLE       (1u<<15)
 
+/* Keep track of the order in which commits are added to our list. */
+define_commit_slab(commit_pos, int);
+static struct commit_pos commit_pos = COMMIT_SLAB_INIT(1, commit_pos);
+
+static void set_commit_pos(struct repository *r, const struct object_id *oid)
+{
+	static int32_t max_pos;
+	struct commit *commit = lookup_commit(r, oid);
+
+	if (!commit)
+		return; /* should never happen, but be lenient */
+
+	*commit_pos_at(&commit_pos, commit) = max_pos++;
+}
+
+static int commit_pos_cmp(const void *va, const void *vb)
+{
+	const struct commit *a = *(const struct commit **)va;
+	const struct commit *b = *(const struct commit **)vb;
+	return commit_pos_at(&commit_pos, a) -
+	       commit_pos_at(&commit_pos, b);
+}
+
 char *get_commit_graph_filename(const char *obj_dir)
 {
 	char *filename = xstrfmt("%s/info/commit-graph", obj_dir);
@@ -1027,6 +1051,8 @@ static int add_packed_commits(const struct object_id *oid,
 	oidcpy(&(ctx->oids.list[ctx->oids.nr]), oid);
 	ctx->oids.nr++;
 
+	set_commit_pos(ctx->r, oid);
+
 	return 0;
 }
 
@@ -1147,6 +1173,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 {
 	int i;
 	struct progress *progress = NULL;
+	struct commit **sorted_by_pos;
 
 	load_bloom_filters();
 
@@ -1155,13 +1182,18 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 			_("Computing commit diff Bloom filters"),
 			ctx->commits.nr);
 
+	ALLOC_ARRAY(sorted_by_pos, ctx->commits.nr);
+	COPY_ARRAY(sorted_by_pos, ctx->commits.list, ctx->commits.nr);
+	QSORT(sorted_by_pos, ctx->commits.nr, commit_pos_cmp);
+
 	for (i = 0; i < ctx->commits.nr; i++) {
-		struct commit *c = ctx->commits.list[i];
+		struct commit *c = sorted_by_pos[i];
 		struct bloom_filter *filter = get_bloom_filter(ctx->r, c);
 		ctx->total_bloom_filter_data_size += sizeof(uint64_t) * filter->len;
 		display_progress(progress, i + 1);
 	}
 
+	free(sorted_by_pos);
 	stop_progress(&progress);
 }
 
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH v2 06/11] commit-graph: examine commits by generation number
  2020-02-05 22:56 ` [PATCH v2 00/11] " Garima Singh via GitGitGadget
                     ` (4 preceding siblings ...)
  2020-02-05 22:56   ` [PATCH v2 05/11] commit-graph: examine changed-path objects in pack order Jeff King via GitGitGadget
@ 2020-02-05 22:56   ` Derrick Stolee via GitGitGadget
  2020-02-19  0:32     ` Jakub Narebski
  2020-02-05 22:56   ` [PATCH v2 07/11] commit-graph: write Bloom filters to commit graph file Garima Singh via GitGitGadget
                     ` (7 subsequent siblings)
  13 siblings, 1 reply; 150+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-02-05 22:56 UTC (permalink / raw)
  To: git
  Cc: stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
	garimasigit, jnareb, christian.couder, emilyshaffer, gitster,
	Garima Singh, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When running 'git commit-graph write --changed-paths', we sort the
commits by pack-order to save time when computing the changed-paths
bloom filters. This does not help when finding the commits via the
--reachable flag.

If not using pack-order, then sort by generation number before
examining the diff. Commits with similar generation are more likely
to have many trees in common, making the diff faster.

On the Linux kernel repository, this change reduced the computation
time for 'git commit-graph write --reachable --changed-paths' from
3m00s to 1m37s.

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
 commit-graph.c | 33 ++++++++++++++++++++++++++++++---
 1 file changed, 30 insertions(+), 3 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index e125511a1c..32a315058f 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -70,6 +70,25 @@ static int commit_pos_cmp(const void *va, const void *vb)
 	       commit_pos_at(&commit_pos, b);
 }
 
+static int commit_gen_cmp(const void *va, const void *vb)
+{
+	const struct commit *a = *(const struct commit **)va;
+	const struct commit *b = *(const struct commit **)vb;
+
+	/* lower generation commits first */
+	if (a->generation < b->generation)
+		return -1;
+	else if (a->generation > b->generation)
+		return 1;
+
+	/* use date as a heuristic when generations are equal */
+	if (a->date < b->date)
+		return -1;
+	else if (a->date > b->date)
+		return 1;
+	return 0;
+}
+
 char *get_commit_graph_filename(const char *obj_dir)
 {
 	char *filename = xstrfmt("%s/info/commit-graph", obj_dir);
@@ -821,7 +840,8 @@ struct write_commit_graph_context {
 		 report_progress:1,
 		 split:1,
 		 check_oids:1,
-		 changed_paths:1;
+		 changed_paths:1,
+		 order_by_pack:1;
 
 	const struct split_commit_graph_opts *split_opts;
 	uint32_t total_bloom_filter_data_size;
@@ -1184,7 +1204,11 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 
 	ALLOC_ARRAY(sorted_by_pos, ctx->commits.nr);
 	COPY_ARRAY(sorted_by_pos, ctx->commits.list, ctx->commits.nr);
-	QSORT(sorted_by_pos, ctx->commits.nr, commit_pos_cmp);
+
+	if (ctx->order_by_pack)
+		QSORT(sorted_by_pos, ctx->commits.nr, commit_pos_cmp);
+	else
+		QSORT(sorted_by_pos, ctx->commits.nr, commit_gen_cmp);
 
 	for (i = 0; i < ctx->commits.nr; i++) {
 		struct commit *c = sorted_by_pos[i];
@@ -1902,6 +1926,7 @@ int write_commit_graph(const char *obj_dir,
 	}
 
 	if (pack_indexes) {
+		ctx->order_by_pack = 1;
 		if ((res = fill_oids_from_packs(ctx, pack_indexes)))
 			goto cleanup;
 	}
@@ -1911,8 +1936,10 @@ int write_commit_graph(const char *obj_dir,
 			goto cleanup;
 	}
 
-	if (!pack_indexes && !commit_hex)
+	if (!pack_indexes && !commit_hex) {
+		ctx->order_by_pack = 1;
 		fill_oids_from_all_packs(ctx);
+	}
 
 	close_reachable(ctx);
 
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH v2 07/11] commit-graph: write Bloom filters to commit graph file
  2020-02-05 22:56 ` [PATCH v2 00/11] " Garima Singh via GitGitGadget
                     ` (5 preceding siblings ...)
  2020-02-05 22:56   ` [PATCH v2 06/11] commit-graph: examine commits by generation number Derrick Stolee via GitGitGadget
@ 2020-02-05 22:56   ` Garima Singh via GitGitGadget
  2020-02-19 15:13     ` Jakub Narebski
  2020-02-05 22:56   ` [PATCH v2 08/11] commit-graph: reuse existing Bloom filters during write Garima Singh via GitGitGadget
                     ` (6 subsequent siblings)
  13 siblings, 1 reply; 150+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-02-05 22:56 UTC (permalink / raw)
  To: git
  Cc: stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
	garimasigit, jnareb, christian.couder, emilyshaffer, gitster,
	Garima Singh, Garima Singh

From: Garima Singh <garima.singh@microsoft.com>

Update the technical documentation for commit-graph-format with the formats for
the Bloom filter index (BIDX) and Bloom filter data (BDAT) chunks. Write the
computed Bloom filters information to the commit graph file using this format.

Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
 .../technical/commit-graph-format.txt         |  24 ++++
 commit-graph.c                                | 118 +++++++++++++++++-
 commit-graph.h                                |   7 +-
 3 files changed, 145 insertions(+), 4 deletions(-)

diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index a4f17441ae..22e511643d 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -17,6 +17,9 @@ metadata, including:
 - The parents of the commit, stored using positional references within
   the graph file.
 
+- The Bloom filter of the commit carrying the paths that were changed between
+  the commit and its first parent.
+
 These positional references are stored as unsigned 32-bit integers
 corresponding to the array position within the list of commit OIDs. Due
 to some special constants we use to track parents, we can store at most
@@ -93,6 +96,27 @@ CHUNK DATA:
       positions for the parents until reaching a value with the most-significant
       bit on. The other bits correspond to the position of the last parent.
 
+  Bloom Filter Index (ID: {'B', 'I', 'D', 'X'}) (N * 4 bytes) [Optional]
+    * The ith entry, BIDX[i], stores the number of 8-byte word blocks in all
+      Bloom filters from commit 0 to commit i (inclusive) in lexicographic
+      order. The Bloom filter for the i-th commit spans from BIDX[i-1] to
+      BIDX[i] (plus header length), where BIDX[-1] is 0.
+    * The BIDX chunk is ignored if the BDAT chunk is not present.
+
+  Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
+    * It starts with header consisting of three unsigned 32-bit integers:
+      - Version of the hash algorithm being used. We currently only support
+	value 1 which implies the murmur3 hash implemented exactly as described
+	in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
+      - The number of times a path is hashed and hence the number of bit positions
+	that cumulatively determine whether a file is present in the commit.
+      - The minimum number of bits 'b' per entry in the Bloom filter. If the filter
+	contains 'n' entries, then the filter size is the minimum number of 64-bit
+	words that contain n*b bits.
+    * The rest of the chunk is the concatenation of all the computed Bloom
+      filters for the commits in lexicographic order.
+    * The BDAT chunk is present iff BIDX is present.
+
   Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
       This list of H-byte hashes describe a set of B commit-graph files that
       form a commit-graph chain. The graph position for the ith commit in this
diff --git a/commit-graph.c b/commit-graph.c
index 32a315058f..4585b3b702 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -24,8 +24,10 @@
 #define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
 #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
 #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
+#define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
+#define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
 #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
-#define MAX_NUM_CHUNKS 5
+#define MAX_NUM_CHUNKS 7
 
 #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
 
@@ -325,6 +327,32 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd,
 				chunk_repeated = 1;
 			else
 				graph->chunk_base_graphs = data + chunk_offset;
+			break;
+
+		case GRAPH_CHUNKID_BLOOMINDEXES:
+			if (graph->chunk_bloom_indexes)
+				chunk_repeated = 1;
+			else
+				graph->chunk_bloom_indexes = data + chunk_offset;
+			break;
+
+		case GRAPH_CHUNKID_BLOOMDATA:
+			if (graph->chunk_bloom_data)
+				chunk_repeated = 1;
+			else {
+				uint32_t hash_version;
+				graph->chunk_bloom_data = data + chunk_offset;
+				hash_version = get_be32(data + chunk_offset);
+
+				if (hash_version != 1)
+					break;
+
+				graph->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings));
+				graph->bloom_filter_settings->hash_version = hash_version;
+				graph->bloom_filter_settings->num_hashes = get_be32(data + chunk_offset + 4);
+				graph->bloom_filter_settings->bits_per_entry = get_be32(data + chunk_offset + 8);
+			}
+			break;
 		}
 
 		if (chunk_repeated) {
@@ -343,6 +371,17 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd,
 		last_chunk_offset = chunk_offset;
 	}
 
+	/* We need both the bloom chunks to exist together. Else ignore the data */
+	if ((graph->chunk_bloom_indexes && !graph->chunk_bloom_data)
+		 || (!graph->chunk_bloom_indexes && graph->chunk_bloom_data)) {
+		graph->chunk_bloom_indexes = NULL;
+		graph->chunk_bloom_data = NULL;
+		graph->bloom_filter_settings = NULL;
+	}
+
+	if (graph->chunk_bloom_indexes && graph->chunk_bloom_data)
+		load_bloom_filters();
+
 	hashcpy(graph->oid.hash, graph->data + graph->data_len - graph->hash_len);
 
 	if (verify_commit_graph_lite(graph)) {
@@ -1040,6 +1079,59 @@ static void write_graph_chunk_extra_edges(struct hashfile *f,
 	}
 }
 
+static void write_graph_chunk_bloom_indexes(struct hashfile *f,
+					    struct write_commit_graph_context *ctx)
+{
+	struct commit **list = ctx->commits.list;
+	struct commit **last = ctx->commits.list + ctx->commits.nr;
+	uint32_t cur_pos = 0;
+	struct progress *progress = NULL;
+	int i = 0;
+
+	if (ctx->report_progress)
+		progress = start_delayed_progress(
+			_("Writing changed paths Bloom filters index"),
+			ctx->commits.nr);
+
+	while (list < last) {
+		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
+		cur_pos += filter->len;
+		display_progress(progress, ++i);
+		hashwrite_be32(f, cur_pos);
+		list++;
+	}
+
+	stop_progress(&progress);
+}
+
+static void write_graph_chunk_bloom_data(struct hashfile *f,
+					 struct write_commit_graph_context *ctx,
+					 struct bloom_filter_settings *settings)
+{
+	struct commit **list = ctx->commits.list;
+	struct commit **last = ctx->commits.list + ctx->commits.nr;
+	struct progress *progress = NULL;
+	int i = 0;
+
+	if (ctx->report_progress)
+		progress = start_delayed_progress(
+			_("Writing changed paths Bloom filters data"),
+			ctx->commits.nr);
+
+	hashwrite_be32(f, settings->hash_version);
+	hashwrite_be32(f, settings->num_hashes);
+	hashwrite_be32(f, settings->bits_per_entry);
+
+	while (list < last) {
+		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
+		display_progress(progress, ++i);
+		hashwrite(f, filter->data, filter->len * sizeof(uint64_t));
+		list++;
+	}
+
+	stop_progress(&progress);
+}
+
 static int oid_compare(const void *_a, const void *_b)
 {
 	const struct object_id *a = (const struct object_id *)_a;
@@ -1198,8 +1290,8 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 	load_bloom_filters();
 
 	if (ctx->report_progress)
-		progress = start_progress(
-			_("Computing commit diff Bloom filters"),
+		progress = start_delayed_progress(
+			_("Computing changed paths Bloom filters"),
 			ctx->commits.nr);
 
 	ALLOC_ARRAY(sorted_by_pos, ctx->commits.nr);
@@ -1444,6 +1536,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	struct strbuf progress_title = STRBUF_INIT;
 	int num_chunks = 3;
 	struct object_id file_hash;
+	struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
 
 	if (ctx->split) {
 		struct strbuf tmp_file = STRBUF_INIT;
@@ -1488,6 +1581,12 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 		chunk_ids[num_chunks] = GRAPH_CHUNKID_EXTRAEDGES;
 		num_chunks++;
 	}
+	if (ctx->changed_paths) {
+		chunk_ids[num_chunks] = GRAPH_CHUNKID_BLOOMINDEXES;
+		num_chunks++;
+		chunk_ids[num_chunks] = GRAPH_CHUNKID_BLOOMDATA;
+		num_chunks++;
+	}
 	if (ctx->num_commit_graphs_after > 1) {
 		chunk_ids[num_chunks] = GRAPH_CHUNKID_BASE;
 		num_chunks++;
@@ -1506,6 +1605,15 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 						4 * ctx->num_extra_edges;
 		num_chunks++;
 	}
+	if (ctx->changed_paths) {
+		chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
+						sizeof(uint32_t) * ctx->commits.nr;
+		num_chunks++;
+
+		chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
+						sizeof(uint32_t) * 3 + ctx->total_bloom_filter_data_size;
+		num_chunks++;
+	}
 	if (ctx->num_commit_graphs_after > 1) {
 		chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
 						hashsz * (ctx->num_commit_graphs_after - 1);
@@ -1543,6 +1651,10 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	write_graph_chunk_data(f, hashsz, ctx);
 	if (ctx->num_extra_edges)
 		write_graph_chunk_extra_edges(f, ctx);
+	if (ctx->changed_paths) {
+		write_graph_chunk_bloom_indexes(f, ctx);
+		write_graph_chunk_bloom_data(f, ctx, &bloom_settings);
+	}
 	if (ctx->num_commit_graphs_after > 1 &&
 	    write_graph_chunk_base(f, ctx)) {
 		return -1;
diff --git a/commit-graph.h b/commit-graph.h
index 952a4b83be..25fefefb3e 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -10,6 +10,7 @@
 #define GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD "GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD"
 
 struct commit;
+struct bloom_filter_settings;
 
 char *get_commit_graph_filename(const char *obj_dir);
 int open_commit_graph(const char *graph_file, int *fd, struct stat *st);
@@ -58,6 +59,10 @@ struct commit_graph {
 	const unsigned char *chunk_commit_data;
 	const unsigned char *chunk_extra_edges;
 	const unsigned char *chunk_base_graphs;
+	const unsigned char *chunk_bloom_indexes;
+	const unsigned char *chunk_bloom_data;
+
+	struct bloom_filter_settings *bloom_filter_settings;
 };
 
 struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st);
@@ -77,7 +82,7 @@ enum commit_graph_write_flags {
 	COMMIT_GRAPH_WRITE_SPLIT      = (1 << 2),
 	/* Make sure that each OID in the input is a valid commit OID. */
 	COMMIT_GRAPH_WRITE_CHECK_OIDS = (1 << 3),
-	COMMIT_GRAPH_WRITE_BLOOM_FILTERS = (1 << 4)
+	COMMIT_GRAPH_WRITE_BLOOM_FILTERS = (1 << 4),
 };
 
 struct split_commit_graph_opts {
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH v2 08/11] commit-graph: reuse existing Bloom filters during write.
  2020-02-05 22:56 ` [PATCH v2 00/11] " Garima Singh via GitGitGadget
                     ` (6 preceding siblings ...)
  2020-02-05 22:56   ` [PATCH v2 07/11] commit-graph: write Bloom filters to commit graph file Garima Singh via GitGitGadget
@ 2020-02-05 22:56   ` Garima Singh via GitGitGadget
  2020-02-20 18:48     ` Jakub Narebski
  2020-02-05 22:56   ` [PATCH v2 09/11] commit-graph: add --changed-paths option to write subcommand Garima Singh via GitGitGadget
                     ` (5 subsequent siblings)
  13 siblings, 1 reply; 150+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-02-05 22:56 UTC (permalink / raw)
  To: git
  Cc: stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
	garimasigit, jnareb, christian.couder, emilyshaffer, gitster,
	Garima Singh, Garima Singh

From: Garima Singh <garima.singh@microsoft.com>

Read previously computed Bloom filters from the commit-graph file if
possible to avoid recomputing during commit-graph write.

See Documentation/technical/commit-graph-format for the format in which
the Bloom filter information is written to the commit graph file.

To read Bloom filter for a given commit with lexicographic position
'i' we need to:
1. Read BIDX[i] which essentially gives us the starting index in BDAT for
   filter of commit i+1. It is essentially the index past the end
   of the filter of commit i. It is called end_index in the code.

2. For i>0, read BIDX[i-1] which will give us the starting index in BDAT
   for filter of commit i. It is called the start_index in the code.
   For the first commit, where i = 0, Bloom filter data starts at the
   beginning, just past the header in the BDAT chunk. Hence, start_index
   will be 0.

3. The length of the filter will be end_index - start_index, because
   BIDX[i] gives the cumulative 8-byte words including the ith
   commit's filter.

We toggle whether Bloom filters should be recomputed based on the
compute_if_null flag.

Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
 bloom.c               | 49 ++++++++++++++++++++++++++++++++++++++++++-
 bloom.h               |  4 +++-
 commit-graph.c        |  7 ++++---
 t/helper/test-bloom.c |  2 +-
 4 files changed, 56 insertions(+), 6 deletions(-)

diff --git a/bloom.c b/bloom.c
index 818382c03b..90d84dc713 100644
--- a/bloom.c
+++ b/bloom.c
@@ -1,5 +1,7 @@
 #include "git-compat-util.h"
 #include "bloom.h"
+#include "commit.h"
+#include "commit-slab.h"
 #include "commit-graph.h"
 #include "object-store.h"
 #include "diff.h"
@@ -127,8 +129,39 @@ void add_key_to_filter(struct bloom_key *key,
 	}
 }
 
+static int load_bloom_filter_from_graph(struct commit_graph *g,
+				   struct bloom_filter *filter,
+				   struct commit *c)
+{
+	uint32_t lex_pos, start_index, end_index;
+
+	while (c->graph_pos < g->num_commits_in_base)
+		g = g->base_graph;
+
+	/* The commit graph commit 'c' lives in doesn't carry bloom filters. */
+	if (!g->chunk_bloom_indexes)
+		return 0;
+
+	lex_pos = c->graph_pos - g->num_commits_in_base;
+
+	end_index = get_be32(g->chunk_bloom_indexes + 4 * lex_pos);
+
+	if (lex_pos)
+		start_index = get_be32(g->chunk_bloom_indexes + 4 * (lex_pos - 1));
+	else
+		start_index = 0;
+
+	filter->len = end_index - start_index;
+	filter->data = (uint64_t *)(g->chunk_bloom_data +
+					sizeof(uint64_t) * start_index +
+					BLOOMDATA_CHUNK_HEADER_SIZE);
+
+	return 1;
+}
+
 struct bloom_filter *get_bloom_filter(struct repository *r,
-				      struct commit *c)
+				      struct commit *c,
+				      int compute_if_not_present)
 {
 	struct bloom_filter *filter;
 	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
@@ -141,6 +174,20 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 
 	filter = bloom_filter_slab_at(&bloom_filters, c);
 
+	if (!filter->data) {
+		load_commit_graph_info(r, c);
+		if (c->graph_pos != COMMIT_NOT_FROM_GRAPH &&
+			r->objects->commit_graph->chunk_bloom_indexes) {
+			if (load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
+				return filter;
+			else
+				return NULL;
+		}
+	}
+
+	if (filter->data || !compute_if_not_present)
+		return filter;
+
 	repo_diff_setup(r, &diffopt);
 	diffopt.flags.recursive = 1;
 	diffopt.max_changes = max_changes;
diff --git a/bloom.h b/bloom.h
index 7f40c751f7..76f8a9ad0c 100644
--- a/bloom.h
+++ b/bloom.h
@@ -13,6 +13,7 @@ struct bloom_filter_settings {
 
 #define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 }
 #define BITS_PER_WORD 64
+#define BLOOMDATA_CHUNK_HEADER_SIZE 3*sizeof(uint32_t)
 
 /*
  * A bloom_filter struct represents a data segment to
@@ -47,7 +48,8 @@ void add_key_to_filter(struct bloom_key *key,
 					   struct bloom_filter_settings *settings);
 
 struct bloom_filter *get_bloom_filter(struct repository *r,
-				      struct commit *c);
+				      struct commit *c,
+				      int compute_if_not_present);
 
 int bloom_filter_contains(struct bloom_filter *filter,
 			  struct bloom_key *key,
diff --git a/commit-graph.c b/commit-graph.c
index 4585b3b702..c0e9834bf2 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1094,7 +1094,7 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f,
 			ctx->commits.nr);
 
 	while (list < last) {
-		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
+		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
 		cur_pos += filter->len;
 		display_progress(progress, ++i);
 		hashwrite_be32(f, cur_pos);
@@ -1123,7 +1123,7 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
 	hashwrite_be32(f, settings->bits_per_entry);
 
 	while (list < last) {
-		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
+		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
 		display_progress(progress, ++i);
 		hashwrite(f, filter->data, filter->len * sizeof(uint64_t));
 		list++;
@@ -1304,7 +1304,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 
 	for (i = 0; i < ctx->commits.nr; i++) {
 		struct commit *c = sorted_by_pos[i];
-		struct bloom_filter *filter = get_bloom_filter(ctx->r, c);
+		struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1);
 		ctx->total_bloom_filter_data_size += sizeof(uint64_t) * filter->len;
 		display_progress(progress, i + 1);
 	}
@@ -2314,6 +2314,7 @@ void free_commit_graph(struct commit_graph *g)
 		g->data = NULL;
 		close(g->graph_fd);
 	}
+	free(g->bloom_filter_settings);
 	free(g->filename);
 	free(g);
 }
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index 331957011b..9b4be97f75 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -47,7 +47,7 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
 	struct bloom_filter *filter;
 	setup_git_directory();
 	c = lookup_commit(the_repository, commit_oid);
-	filter = get_bloom_filter(the_repository, c);
+	filter = get_bloom_filter(the_repository, c, 1);
 	print_bloom_filter(filter);
 }
 
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH v2 09/11] commit-graph: add --changed-paths option to write subcommand
  2020-02-05 22:56 ` [PATCH v2 00/11] " Garima Singh via GitGitGadget
                     ` (7 preceding siblings ...)
  2020-02-05 22:56   ` [PATCH v2 08/11] commit-graph: reuse existing Bloom filters during write Garima Singh via GitGitGadget
@ 2020-02-05 22:56   ` Garima Singh via GitGitGadget
  2020-02-20 20:28     ` Jakub Narebski
  2020-02-20 22:10     ` Bryan Turner
  2020-02-05 22:56   ` [PATCH v2 10/11] revision.c: use Bloom filters to speed up path based revision walks Garima Singh via GitGitGadget
                     ` (4 subsequent siblings)
  13 siblings, 2 replies; 150+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-02-05 22:56 UTC (permalink / raw)
  To: git
  Cc: stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
	garimasigit, jnareb, christian.couder, emilyshaffer, gitster,
	Garima Singh, Garima Singh

From: Garima Singh <garima.singh@microsoft.com>

Add --changed-paths option to git commit-graph write. This option will
allow users to compute information about the paths that have changed
between a commit and its first parent, and write it into the commit graph
file. If the option is passed to the write subcommand we set the
COMMIT_GRAPH_WRITE_BLOOM_FILTERS flag and pass it down to the
commit-graph logic.

Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
 Documentation/git-commit-graph.txt | 5 +++++
 builtin/commit-graph.c             | 9 +++++++--
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index bcd85c1976..907d703b30 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -54,6 +54,11 @@ or `--stdin-packs`.)
 With the `--append` option, include all commits that are present in the
 existing commit-graph file.
 +
+With the `--changed-paths` option, compute and write information about the
+paths changed between a commit and it's first parent. This operation can
+take a while on large repositories. It provides significant performance gains
+for getting history of a directory or a file with `git log -- <path>`.
++
 With the `--split` option, write the commit-graph as a chain of multiple
 commit-graph files stored in `<dir>/info/commit-graphs`. The new commits
 not already in the commit-graph are added in a new "tip" file. This file
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index e0c6fc4bbf..261dcce091 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -9,7 +9,7 @@
 
 static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph verify [--object-dir <objdir>] [--shallow] [--[no-]progress]"),
-	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--[no-]progress] <split options>"),
+	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--changed-paths] [--[no-]progress] <split options>"),
 	NULL
 };
 
@@ -19,7 +19,7 @@ static const char * const builtin_commit_graph_verify_usage[] = {
 };
 
 static const char * const builtin_commit_graph_write_usage[] = {
-	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--[no-]progress] <split options>"),
+	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--changed-paths] [--[no-]progress] <split options>"),
 	NULL
 };
 
@@ -32,6 +32,7 @@ static struct opts_commit_graph {
 	int split;
 	int shallow;
 	int progress;
+	int enable_changed_paths;
 } opts;
 
 static int graph_verify(int argc, const char **argv)
@@ -110,6 +111,8 @@ static int graph_write(int argc, const char **argv)
 			N_("start walk at commits listed by stdin")),
 		OPT_BOOL(0, "append", &opts.append,
 			N_("include all commits already in the commit-graph file")),
+		OPT_BOOL(0, "changed-paths", &opts.enable_changed_paths,
+			N_("enable computation for changed paths")),
 		OPT_BOOL(0, "progress", &opts.progress, N_("force progress reporting")),
 		OPT_BOOL(0, "split", &opts.split,
 			N_("allow writing an incremental commit-graph file")),
@@ -143,6 +146,8 @@ static int graph_write(int argc, const char **argv)
 		flags |= COMMIT_GRAPH_WRITE_SPLIT;
 	if (opts.progress)
 		flags |= COMMIT_GRAPH_WRITE_PROGRESS;
+	if (opts.enable_changed_paths)
+		flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
 
 	read_replace_refs = 0;
 
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH v2 10/11] revision.c: use Bloom filters to speed up path based revision walks
  2020-02-05 22:56 ` [PATCH v2 00/11] " Garima Singh via GitGitGadget
                     ` (8 preceding siblings ...)
  2020-02-05 22:56   ` [PATCH v2 09/11] commit-graph: add --changed-paths option to write subcommand Garima Singh via GitGitGadget
@ 2020-02-05 22:56   ` Garima Singh via GitGitGadget
  2020-02-21 17:31     ` Jakub Narebski
  2020-02-21 22:45     ` Jakub Narebski
  2020-02-05 22:56   ` [PATCH v2 11/11] commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag Garima Singh via GitGitGadget
                     ` (3 subsequent siblings)
  13 siblings, 2 replies; 150+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-02-05 22:56 UTC (permalink / raw)
  To: git
  Cc: stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
	garimasigit, jnareb, christian.couder, emilyshaffer, gitster,
	Garima Singh, Garima Singh

From: Garima Singh <garima.singh@microsoft.com>

Revision walk will now use Bloom filters for commits to speed up revision
walks for a particular path (for computing history for that path), if they
are present in the commit-graph file.

We load the Bloom filters during the prepare_revision_walk step, but only
when dealing with a single pathspec. While comparing trees in
rev_compare_trees(), if the Bloom filter says that the file is not different
between the two trees, we don't need to compute the expensive diff. This is
where we get our performance gains. The other response of the Bloom filter
is `maybe`, in which case we fall back to the full diff calculation to
determine if the path was changed in the commit.

Performance Gains:
We tested the performance of `git log -- <path>` on the git repo, the linux
and some internal large repos, with a variety of paths of varying depths.

On the git and linux repos:
- we observed a 2x to 5x speed up.

On a large internal repo with files seated 6-10 levels deep in the tree:
- we observed 10x to 20x speed ups, with some paths going up to 28 times
  faster.

Helped-by: Derrick Stolee <dstolee@microsoft.com
Helped-by: SZEDER Gábor <szeder.dev@gmail.com>
Helped-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
 revision.c                 | 124 +++++++++++++++++++++++++++++++-
 revision.h                 |  11 +++
 t/helper/test-read-graph.c |   4 ++
 t/t4216-log-bloom.sh       | 140 +++++++++++++++++++++++++++++++++++++
 4 files changed, 277 insertions(+), 2 deletions(-)
 create mode 100755 t/t4216-log-bloom.sh

diff --git a/revision.c b/revision.c
index 8136929e23..d1622afa17 100644
--- a/revision.c
+++ b/revision.c
@@ -29,6 +29,8 @@
 #include "prio-queue.h"
 #include "hashmap.h"
 #include "utf8.h"
+#include "bloom.h"
+#include "json-writer.h"
 
 volatile show_early_output_fn_t show_early_output;
 
@@ -624,11 +626,114 @@ static void file_change(struct diff_options *options,
 	options->flags.has_changes = 1;
 }
 
+static int bloom_filter_atexit_registered;
+static unsigned int count_bloom_filter_maybe;
+static unsigned int count_bloom_filter_definitely_not;
+static unsigned int count_bloom_filter_false_positive;
+static unsigned int count_bloom_filter_not_present;
+static unsigned int count_bloom_filter_length_zero;
+
+static void trace2_bloom_filter_statistics_atexit(void)
+{
+	struct json_writer jw = JSON_WRITER_INIT;
+
+	jw_object_begin(&jw, 0);
+	jw_object_intmax(&jw, "filter_not_present", count_bloom_filter_not_present);
+	jw_object_intmax(&jw, "zero_length_filter", count_bloom_filter_length_zero);
+	jw_object_intmax(&jw, "maybe", count_bloom_filter_maybe);
+	jw_object_intmax(&jw, "definitely_not", count_bloom_filter_definitely_not);
+	jw_end(&jw);
+
+	trace2_data_json("bloom", the_repository, "statistics", &jw);
+
+	jw_release(&jw);
+}
+
+static void prepare_to_use_bloom_filter(struct rev_info *revs)
+{
+	struct pathspec_item *pi;
+	char *path_alloc = NULL;
+	const char *path;
+	int last_index;
+	int len;
+
+	if (!revs->commits)
+	    return;
+
+	repo_parse_commit(revs->repo, revs->commits->item);
+
+	if (!revs->repo->objects->commit_graph)
+		return;
+
+	revs->bloom_filter_settings = revs->repo->objects->commit_graph->bloom_filter_settings;
+	if (!revs->bloom_filter_settings)
+		return;
+
+	pi = &revs->pruning.pathspec.items[0];
+	last_index = pi->len - 1;
+
+	if (pi->match[last_index] == '/') {
+	    path_alloc = xstrdup(pi->match);
+	    path_alloc[last_index] = '\0';
+	    path = path_alloc;
+	} else
+	    path = pi->match;
+
+	len = strlen(path);
+
+	revs->bloom_key = xmalloc(sizeof(struct bloom_key));
+	fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
+
+	if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
+		atexit(trace2_bloom_filter_statistics_atexit);
+		bloom_filter_atexit_registered = 1;
+	}
+
+	free(path_alloc);
+}
+
+static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
+						 struct commit *commit)
+{
+	struct bloom_filter *filter;
+	int result;
+
+	if (!revs->repo->objects->commit_graph)
+		return -1;
+
+	if (commit->generation == GENERATION_NUMBER_INFINITY)
+		return -1;
+
+	filter = get_bloom_filter(revs->repo, commit, 0);
+
+	if (!filter) {
+		count_bloom_filter_not_present++;
+		return -1;
+	}
+
+	if (!filter->len) {
+		count_bloom_filter_length_zero++;
+		return -1;
+	}
+
+	result = bloom_filter_contains(filter,
+				       revs->bloom_key,
+				       revs->bloom_filter_settings);
+
+	if (result)
+		count_bloom_filter_maybe++;
+	else
+		count_bloom_filter_definitely_not++;
+
+	return result;
+}
+
 static int rev_compare_tree(struct rev_info *revs,
-			    struct commit *parent, struct commit *commit)
+			    struct commit *parent, struct commit *commit, int nth_parent)
 {
 	struct tree *t1 = get_commit_tree(parent);
 	struct tree *t2 = get_commit_tree(commit);
+	int bloom_ret = 1;
 
 	if (!t1)
 		return REV_TREE_NEW;
@@ -653,11 +758,23 @@ static int rev_compare_tree(struct rev_info *revs,
 			return REV_TREE_SAME;
 	}
 
+	if (revs->pruning.pathspec.nr == 1 && !revs->reflog_info && !nth_parent) {
+		bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
+
+		if (bloom_ret == 0)
+			return REV_TREE_SAME;
+	}
+
 	tree_difference = REV_TREE_SAME;
 	revs->pruning.flags.has_changes = 0;
 	if (diff_tree_oid(&t1->object.oid, &t2->object.oid, "",
 			   &revs->pruning) < 0)
 		return REV_TREE_DIFFERENT;
+
+	if (!nth_parent)
+		if (bloom_ret == 1 && tree_difference == REV_TREE_SAME)
+			count_bloom_filter_false_positive++;
+
 	return tree_difference;
 }
 
@@ -855,7 +972,7 @@ static void try_to_simplify_commit(struct rev_info *revs, struct commit *commit)
 			die("cannot simplify commit %s (because of %s)",
 			    oid_to_hex(&commit->object.oid),
 			    oid_to_hex(&p->object.oid));
-		switch (rev_compare_tree(revs, p, commit)) {
+		switch (rev_compare_tree(revs, p, commit, nth_parent)) {
 		case REV_TREE_SAME:
 			if (!revs->simplify_history || !relevant_commit(p)) {
 				/* Even if a merge with an uninteresting
@@ -3362,6 +3479,8 @@ int prepare_revision_walk(struct rev_info *revs)
 				       FOR_EACH_OBJECT_PROMISOR_ONLY);
 	}
 
+	if (revs->pruning.pathspec.nr == 1 && !revs->reflog_info)
+		prepare_to_use_bloom_filter(revs);
 	if (revs->no_walk != REVISION_WALK_NO_WALK_UNSORTED)
 		commit_list_sort_by_date(&revs->commits);
 	if (revs->no_walk)
@@ -3379,6 +3498,7 @@ int prepare_revision_walk(struct rev_info *revs)
 		simplify_merges(revs);
 	if (revs->children.name)
 		set_children(revs);
+
 	return 0;
 }
 
diff --git a/revision.h b/revision.h
index 475f048fb6..7c026fe41f 100644
--- a/revision.h
+++ b/revision.h
@@ -56,6 +56,8 @@ struct repository;
 struct rev_info;
 struct string_list;
 struct saved_parents;
+struct bloom_key;
+struct bloom_filter_settings;
 define_shared_commit_slab(revision_sources, char *);
 
 struct rev_cmdline_info {
@@ -291,6 +293,15 @@ struct rev_info {
 	struct revision_sources *sources;
 
 	struct topo_walk_info *topo_walk_info;
+
+	/* Commit graph bloom filter fields */
+	/* The bloom filter key for the pathspec */
+	struct bloom_key *bloom_key;
+	/*
+	 * The bloom filter settings used to generate the key.
+	 * This is loaded from the commit-graph being used.
+	 */
+	struct bloom_filter_settings *bloom_filter_settings;
 };
 
 int ref_excluded(struct string_list *, const char *path);
diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
index d2884efe0a..aff597c7a3 100644
--- a/t/helper/test-read-graph.c
+++ b/t/helper/test-read-graph.c
@@ -45,6 +45,10 @@ int cmd__read_graph(int argc, const char **argv)
 		printf(" commit_metadata");
 	if (graph->chunk_extra_edges)
 		printf(" extra_edges");
+	if (graph->chunk_bloom_indexes)
+		printf(" bloom_indexes");
+	if (graph->chunk_bloom_data)
+		printf(" bloom_data");
 	printf("\n");
 
 	UNLEAK(graph);
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
new file mode 100755
index 0000000000..19eca1864b
--- /dev/null
+++ b/t/t4216-log-bloom.sh
@@ -0,0 +1,140 @@
+#!/bin/sh
+
+test_description='git log for a path with bloom filters'
+. ./test-lib.sh
+
+test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
+	git init &&
+	mkdir A A/B A/B/C &&
+	test_commit c1 A/file1 &&
+	test_commit c2 A/B/file2 &&
+	test_commit c3 A/B/C/file3 &&
+	test_commit c4 A/file1 &&
+	test_commit c5 A/B/file2 &&
+	test_commit c6 A/B/C/file3 &&
+	test_commit c7 A/file1 &&
+	test_commit c8 A/B/file2 &&
+	test_commit c9 A/B/C/file3 &&
+	git checkout -b side HEAD~4 &&
+	test_commit side-1 file4 &&
+	git checkout master &&
+	git merge side &&
+	test_commit c10 file5 &&
+	mv file5 file5_renamed &&
+	git add file5_renamed &&
+	git commit -m "rename" &&
+	git commit-graph write --reachable --changed-paths
+'
+graph_read_expect() {
+	OPTIONAL=""
+	NUM_CHUNKS=5
+	cat >expect <<- EOF
+	header: 43475048 1 1 $NUM_CHUNKS 0
+	num_commits: $1
+	chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data
+	EOF
+	test-tool read-graph >output &&
+	test_cmp expect output
+}
+
+test_expect_success 'commit-graph write wrote out the bloom chunks' '
+	graph_read_expect 13
+'
+
+setup() {
+	rm output
+	rm "$TRASH_DIRECTORY/trace.perf"
+	git -c core.commitGraph=false log --pretty="format:%s" $1 >log_wo_bloom
+	GIT_TRACE2_PERF="$TRASH_DIRECTORY/trace.perf" git -c core.commitGraph=true log --pretty="format:%s" $1 >log_w_bloom
+}
+
+test_bloom_filters_used() {
+	log_args=$1
+	bloom_trace_prefix="statistics:{\"filter_not_present\":0,\"zero_length_filter\":0,\"maybe\""
+	setup "$log_args"
+	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" && test_cmp log_wo_bloom log_w_bloom
+}
+
+test_bloom_filters_not_used() {
+	log_args=$1
+	setup "$log_args"
+	!(grep -q "statistics:{\"filter_not_present\":" "$TRASH_DIRECTORY/trace.perf") && test_cmp log_wo_bloom log_w_bloom
+}
+
+for path in A A/B A/B/C A/file1 A/B/file2 A/B/C/file3 file4 file5_renamed
+do
+	for option in "" \
+		      "--full-history" \
+		      "--full-history --simplify-merges" \
+		      "--simplify-merges" \
+		      "--simplify-by-decoration" \
+		      "--follow" \
+		      "--first-parent" \
+		      "--topo-order" \
+		      "--date-order" \
+		      "--author-date-order" \
+		      "--ancestry-path side..master"
+	do
+		test_expect_success "git log option: $option for path: $path" '
+			test_bloom_filters_used "$option -- $path"
+		'
+	done
+done
+
+test_expect_success 'git log -- folder works with and without the trailing slash' '
+	test_bloom_filters_used "-- A" &&
+	test_bloom_filters_used "-- A/"
+'
+
+test_expect_success 'git log for path that does not exist. ' '
+	test_bloom_filters_used "-- path_does_not_exist"
+'
+
+test_expect_success 'git log with --walk-reflogs does not use bloom filters' '
+	test_bloom_filters_not_used "--walk-reflogs -- A"
+'
+
+test_expect_success 'git log -- multiple path specs does not use bloom filters' '
+	test_bloom_filters_not_used "-- file4 A/file1"
+'
+
+test_expect_success 'git log with wildcard that resolves to a single path uses bloom filters' '
+	test_bloom_filters_used "-- *4" &&
+	test_bloom_filters_used "-- *renamed"
+'
+
+test_expect_success 'git log with wildcard that resolves to a multiple paths does not uses bloom filters' '
+	test_bloom_filters_not_used "-- *" &&
+	test_bloom_filters_not_used "-- file*"
+'
+
+test_expect_success 'setup - add commit-graph to the chain without bloom filters' '
+	test_commit c14 A/anotherFile2 &&
+	test_commit c15 A/B/anotherFile2 &&
+	test_commit c16 A/B/C/anotherFile2 &&
+	GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0 git commit-graph write --reachable --split &&
+	test_line_count = 2 .git/objects/info/commit-graphs/commit-graph-chain
+'
+
+test_expect_success 'git log does not use bloom filters if the latest graph does not have bloom filters.' '
+	test_bloom_filters_not_used "-- A/B"
+'
+
+test_expect_success 'setup - add commit-graph to the chain with bloom filters' '
+	test_commit c17 A/anotherFile3 &&
+	git commit-graph write --reachable --changed-paths --split &&
+	test_line_count = 3 .git/objects/info/commit-graphs/commit-graph-chain
+'
+
+test_bloom_filters_used_when_some_filters_are_missing() {
+	log_args=$1
+	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":6,\"definitely_not\":6"
+	setup "$log_args"
+	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" && test_cmp log_wo_bloom log_w_bloom
+}
+
+test_expect_success 'git log uses bloom filters if they exist in the latest but not all commit graphs in the chain.' '
+	test_bloom_filters_used_when_some_filters_are_missing "-- A/B"
+'
+
+test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH v2 11/11] commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag
  2020-02-05 22:56 ` [PATCH v2 00/11] " Garima Singh via GitGitGadget
                     ` (9 preceding siblings ...)
  2020-02-05 22:56   ` [PATCH v2 10/11] revision.c: use Bloom filters to speed up path based revision walks Garima Singh via GitGitGadget
@ 2020-02-05 22:56   ` Garima Singh via GitGitGadget
  2020-02-22  0:11     ` Jakub Narebski
  2020-02-07 13:52   ` [PATCH v2 00/11] Changed Paths Bloom Filters SZEDER Gábor
                     ` (2 subsequent siblings)
  13 siblings, 1 reply; 150+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-02-05 22:56 UTC (permalink / raw)
  To: git
  Cc: stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
	garimasigit, jnareb, christian.couder, emilyshaffer, gitster,
	Garima Singh, Garima Singh

From: Garima Singh <garima.singh@microsoft.com>

Add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag to the test setup suite
in order to toggle writing Bloom filters when running any of the git tests.
If set to true, we will compute and write Bloom filters every time a test
calls `git commit-graph write`, as if the `--changed-paths` option was
passed in.

The test suite passes when GIT_TEST_COMMIT_GRAPH and
GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS are enabled.

Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
 builtin/commit-graph.c        | 3 ++-
 ci/run-build-and-tests.sh     | 1 +
 commit-graph.h                | 1 +
 t/README                      | 5 +++++
 t/t4216-log-bloom.sh          | 3 +++
 t/t5318-commit-graph.sh       | 2 ++
 t/t5324-split-commit-graph.sh | 1 +
 7 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 261dcce091..fc9b234ab0 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -146,7 +146,8 @@ static int graph_write(int argc, const char **argv)
 		flags |= COMMIT_GRAPH_WRITE_SPLIT;
 	if (opts.progress)
 		flags |= COMMIT_GRAPH_WRITE_PROGRESS;
-	if (opts.enable_changed_paths)
+	if (opts.enable_changed_paths ||
+	    git_env_bool(GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS, 0))
 		flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
 
 	read_replace_refs = 0;
diff --git a/ci/run-build-and-tests.sh b/ci/run-build-and-tests.sh
index ff0ef7f08e..7b4857651d 100755
--- a/ci/run-build-and-tests.sh
+++ b/ci/run-build-and-tests.sh
@@ -19,6 +19,7 @@ linux-gcc)
 	export GIT_TEST_OE_SIZE=10
 	export GIT_TEST_OE_DELTA_SIZE=5
 	export GIT_TEST_COMMIT_GRAPH=1
+	export GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=1
 	export GIT_TEST_MULTI_PACK_INDEX=1
 	make test
 	;;
diff --git a/commit-graph.h b/commit-graph.h
index 25fefefb3e..4c202ff3d7 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -8,6 +8,7 @@
 
 #define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
 #define GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD "GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD"
+#define GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS "GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS"
 
 struct commit;
 struct bloom_filter_settings;
diff --git a/t/README b/t/README
index caa125ba9a..be2f7d7fd2 100644
--- a/t/README
+++ b/t/README
@@ -378,6 +378,11 @@ GIT_TEST_COMMIT_GRAPH=<boolean>, when true, forces the commit-graph to
 be written after every 'git commit' command, and overrides the
 'core.commitGraph' setting to true.
 
+GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=<boolean>, when true, forces
+commit-graph write to compute and write changed path Bloom filters for
+every 'git commit-graph write', as if the `--changed-paths` option was
+passed in.
+
 GIT_TEST_FSMONITOR=$PWD/t7519/fsmonitor-all exercises the fsmonitor
 code path for utilizing a file system monitor to speed up detecting
 new or changed files.
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 19eca1864b..7acebb3962 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -3,6 +3,9 @@
 test_description='git log for a path with bloom filters'
 . ./test-lib.sh
 
+GIT_TEST_COMMIT_GRAPH=0
+GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0
+
 test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
 	git init &&
 	mkdir A A/B A/B/C &&
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 3f03de6018..973020be2d 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -3,6 +3,8 @@
 test_description='commit graph'
 . ./test-lib.sh
 
+GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0
+
 test_expect_success 'setup full repo' '
 	mkdir full &&
 	cd "$TRASH_DIRECTORY/full" &&
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index c24823431f..9235db4561 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -4,6 +4,7 @@ test_description='split commit graph'
 . ./test-lib.sh
 
 GIT_TEST_COMMIT_GRAPH=0
+GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0
 
 test_expect_success 'setup repo' '
 	git init &&
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH v2 00/11] Changed Paths Bloom Filters
  2020-02-05 22:56 ` [PATCH v2 00/11] " Garima Singh via GitGitGadget
                     ` (10 preceding siblings ...)
  2020-02-05 22:56   ` [PATCH v2 11/11] commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag Garima Singh via GitGitGadget
@ 2020-02-07 13:52   ` SZEDER Gábor
  2020-02-07 15:09     ` Garima Singh
  2020-02-08 23:04   ` Jakub Narebski
  2020-03-30  0:31   ` [PATCH v3 00/16] " Garima Singh via GitGitGadget
  13 siblings, 1 reply; 150+ messages in thread
From: SZEDER Gábor @ 2020-02-07 13:52 UTC (permalink / raw)
  To: Garima Singh via GitGitGadget
  Cc: git, stolee, jonathantanmy, jeffhost, me, peff, garimasigit,
	jnareb, christian.couder, emilyshaffer, gitster, Garima Singh

On Wed, Feb 05, 2020 at 10:56:19PM +0000, Garima Singh via GitGitGadget wrote:
> Hey! 
> 
> The commit graph feature brought in a lot of performance improvements across
> multiple commands. However, file based history continues to be a performance
> pain point, especially in large repositories. 
> 
> Adopting changed path bloom filters has been discussed on the list before,
> and a prototype version was worked on by SZEDER Gábor, Jonathan Tan and Dr.
> Derrick Stolee [1]. This series is based on Dr. Stolee's proof of concept in
> [2]
> 
> Performance Gains: We tested the performance of git log -- path on the git
> repo, the linux repo and some internal large repos, with a variety of paths
> of varying depths.
> 
> On the git and linux repos: We observed a 2x to 5x speed up.
> 
> On a large internal repo with files seated 6-10 levels deep in the tree: We
> observed 10x to 20x speed ups, with some paths going up to 28 times faster.
> 
> Future Work (not included in the scope of this series):
> 
>  1. Supporting multiple path based revision walk
>  2. Adopting it in git blame logic. 
>  3. Interactions with line log git log -L
> 
> 
> ----------------------------------------------------------------------------
> 
> Updates since the last submission
> 
>  * Removed all the RFC callouts, this is a ready for full review version

Don't know when I'll find enough time to properly review the series.
maybe someday...

>  * Added unit tests for the bloom filter computation layer

This fails on big endian, e.g. in Travis CI's s390x build:

  https://travis-ci.org/szeder/git-cooking-topics-for-travis-ci/jobs/647253022#L2210

(The link highlights the failure, but I'm afraid your browser won't
jump there right away; you'll have to click on the print-test-failures
fold at the bottom, and scroll down a bit...)

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH v2 00/11] Changed Paths Bloom Filters
  2020-02-07 13:52   ` [PATCH v2 00/11] Changed Paths Bloom Filters SZEDER Gábor
@ 2020-02-07 15:09     ` Garima Singh
  2020-02-07 15:36       ` Derrick Stolee
  2020-02-11 19:08       ` Garima Singh
  0 siblings, 2 replies; 150+ messages in thread
From: Garima Singh @ 2020-02-07 15:09 UTC (permalink / raw)
  To: SZEDER Gábor, Garima Singh via GitGitGadget
  Cc: git, stolee, jonathantanmy, jeffhost, me, peff, jnareb,
	christian.couder, emilyshaffer, gitster, Garima Singh


On 2/7/2020 8:52 AM, SZEDER Gábor wrote:
>>  * Added unit tests for the bloom filter computation layer
> 
> This fails on big endian, e.g. in Travis CI's s390x build:
> 
>   https://travis-ci.org/szeder/git-cooking-topics-for-travis-ci/jobs/647253022#L2210
> 
> (The link highlights the failure, but I'm afraid your browser won't
> jump there right away; you'll have to click on the print-test-failures
> fold at the bottom, and scroll down a bit...)
> 

Thank you so much for running this pipeline and pointing out the error!

We will carefully review our interactions with the binary data and 
hopefully solve this in the next version. 

Cheers!
Garima Singh

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH v2 00/11] Changed Paths Bloom Filters
  2020-02-07 15:09     ` Garima Singh
@ 2020-02-07 15:36       ` Derrick Stolee
  2020-02-07 16:15         ` SZEDER Gábor
  2020-02-11 19:08       ` Garima Singh
  1 sibling, 1 reply; 150+ messages in thread
From: Derrick Stolee @ 2020-02-07 15:36 UTC (permalink / raw)
  To: Garima Singh, SZEDER Gábor, Garima Singh via GitGitGadget
  Cc: git, jonathantanmy, jeffhost, me, peff, jnareb, christian.couder,
	emilyshaffer, gitster, Garima Singh

On 2/7/2020 10:09 AM, Garima Singh wrote:
> 
> On 2/7/2020 8:52 AM, SZEDER Gábor wrote:
>>>  * Added unit tests for the bloom filter computation layer
>>
>> This fails on big endian, e.g. in Travis CI's s390x build:
>>
>>   https://travis-ci.org/szeder/git-cooking-topics-for-travis-ci/jobs/647253022#L2210
>>
>> (The link highlights the failure, but I'm afraid your browser won't
>> jump there right away; you'll have to click on the print-test-failures
>> fold at the bottom, and scroll down a bit...)
>>
> 
> Thank you so much for running this pipeline and pointing out the error!
> 
> We will carefully review our interactions with the binary data and 
> hopefully solve this in the next version. 

Szeder,

Thanks so much for running this test. We don't have access to a big endian
machine right now, so could you please apply this patch and re-run your tests?

The issue is described in the message below, and Garima is working to ensure
the handling of the filter data is clarified in the next version.

This is an issue from WAY back in the original prototype, and it highlights
that we've never been writing the data in network-byte order. This is completely
my fault.

Thanks,
-Stolee


-->8--

From c1067db5d618b2dae430dfe373a11c771517da9e Mon Sep 17 00:00:00 2001
From: Derrick Stolee <dstolee@microsoft.com>
Date: Fri, 7 Feb 2020 10:24:05 -0500
Subject: [PATCH] fixup! bloom: core Bloom filter implementation for changed
 paths

The 'data' field of 'struct bloom_filter' can point to a memory location
(when computing one before writing to the commit-graph) or a memmap()'d
file location (when reading from the Bloom data chunk of the commit-graph
file). This means that the memory representation may be backwards in
Little Endian or Big Endian machines.

Always write and read bits from 'filter->data' using network order. This
allows us to avoid loading the data streams from the file into memory
buffers.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 bloom.c               | 6 ++++--
 t/helper/test-bloom.c | 2 +-
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/bloom.c b/bloom.c
index 90d84dc713..aa6896584b 100644
--- a/bloom.c
+++ b/bloom.c
@@ -124,8 +124,9 @@ void add_key_to_filter(struct bloom_key *key,
 	for (i = 0; i < settings->num_hashes; i++) {
 		uint64_t hash_mod = key->hashes[i] % mod;
 		uint64_t block_pos = hash_mod / BITS_PER_WORD;
+		uint64_t bit = get_bitmask(hash_mod);
 
-		filter->data[block_pos] |= get_bitmask(hash_mod);
+		filter->data[block_pos] |= htonll(bit);
 	}
 }
 
@@ -269,7 +270,8 @@ int bloom_filter_contains(struct bloom_filter *filter,
 	for (i = 0; i < settings->num_hashes; i++) {
 		uint64_t hash_mod = key->hashes[i] % mod;
 		uint64_t block_pos = hash_mod / BITS_PER_WORD;
-		if (!(filter->data[block_pos] & get_bitmask(hash_mod)))
+		uint64_t bit = get_bitmask(hash_mod);
+		if (!(filter->data[block_pos] & htonll(bit)))
 			return 0;
 	}
 
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index 9b4be97f75..09b2bb0a00 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -23,7 +23,7 @@ static void print_bloom_filter(struct bloom_filter *filter) {
 	printf("Filter_Length:%d\n", filter->len);
 	printf("Filter_Data:");
 	for (i = 0; i < filter->len; i++){
-		printf("%"PRIx64"|", filter->data[i]);
+		printf("%"PRIx64"|", ntohll(filter->data[i]));
 	}
 	printf("\n");
 }
-- 
2.25.0.vfs.1.1.1.g9906319d24.dirty




^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH v2 00/11] Changed Paths Bloom Filters
  2020-02-07 15:36       ` Derrick Stolee
@ 2020-02-07 16:15         ` SZEDER Gábor
  2020-02-07 16:33           ` Derrick Stolee
  0 siblings, 1 reply; 150+ messages in thread
From: SZEDER Gábor @ 2020-02-07 16:15 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Garima Singh, Garima Singh via GitGitGadget, git, jonathantanmy,
	jeffhost, me, peff, jnareb, christian.couder, emilyshaffer,
	gitster, Garima Singh

On Fri, Feb 07, 2020 at 10:36:58AM -0500, Derrick Stolee wrote:
> On 2/7/2020 10:09 AM, Garima Singh wrote:
> > 
> > On 2/7/2020 8:52 AM, SZEDER Gábor wrote:
> >>>  * Added unit tests for the bloom filter computation layer
> >>
> >> This fails on big endian, e.g. in Travis CI's s390x build:
> >>
> >>   https://travis-ci.org/szeder/git-cooking-topics-for-travis-ci/jobs/647253022#L2210
> >>
> >> (The link highlights the failure, but I'm afraid your browser won't
> >> jump there right away; you'll have to click on the print-test-failures
> >> fold at the bottom, and scroll down a bit...)
> >>
> > 
> > Thank you so much for running this pipeline and pointing out the error!
> > 
> > We will carefully review our interactions with the binary data and 
> > hopefully solve this in the next version. 
> 
> Szeder,
> 
> Thanks so much for running this test. We don't have access to a big endian
> machine right now, so could you please apply this patch and re-run your tests?

Unfortunately, it still failed:

  https://travis-ci.org/szeder/git-cooking-topics-for-travis-ci/jobs/647395554#L2204

> The issue is described in the message below, and Garima is working to ensure
> the handling of the filter data is clarified in the next version.
> 
> This is an issue from WAY back in the original prototype, and it highlights
> that we've never been writing the data in network-byte order. This is completely
> my fault.
> 
> Thanks,
> -Stolee
> 
> 
> -->8--
> 
> From c1067db5d618b2dae430dfe373a11c771517da9e Mon Sep 17 00:00:00 2001
> From: Derrick Stolee <dstolee@microsoft.com>
> Date: Fri, 7 Feb 2020 10:24:05 -0500
> Subject: [PATCH] fixup! bloom: core Bloom filter implementation for changed
>  paths
> 
> The 'data' field of 'struct bloom_filter' can point to a memory location
> (when computing one before writing to the commit-graph) or a memmap()'d
> file location (when reading from the Bloom data chunk of the commit-graph
> file). This means that the memory representation may be backwards in
> Little Endian or Big Endian machines.
> 
> Always write and read bits from 'filter->data' using network order. This
> allows us to avoid loading the data streams from the file into memory
> buffers.
> 
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  bloom.c               | 6 ++++--
>  t/helper/test-bloom.c | 2 +-
>  2 files changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/bloom.c b/bloom.c
> index 90d84dc713..aa6896584b 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -124,8 +124,9 @@ void add_key_to_filter(struct bloom_key *key,
>  	for (i = 0; i < settings->num_hashes; i++) {
>  		uint64_t hash_mod = key->hashes[i] % mod;
>  		uint64_t block_pos = hash_mod / BITS_PER_WORD;
> +		uint64_t bit = get_bitmask(hash_mod);
>  
> -		filter->data[block_pos] |= get_bitmask(hash_mod);
> +		filter->data[block_pos] |= htonll(bit);
>  	}
>  }
>  
> @@ -269,7 +270,8 @@ int bloom_filter_contains(struct bloom_filter *filter,
>  	for (i = 0; i < settings->num_hashes; i++) {
>  		uint64_t hash_mod = key->hashes[i] % mod;
>  		uint64_t block_pos = hash_mod / BITS_PER_WORD;
> -		if (!(filter->data[block_pos] & get_bitmask(hash_mod)))
> +		uint64_t bit = get_bitmask(hash_mod);
> +		if (!(filter->data[block_pos] & htonll(bit)))
>  			return 0;
>  	}
>  
> diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
> index 9b4be97f75..09b2bb0a00 100644
> --- a/t/helper/test-bloom.c
> +++ b/t/helper/test-bloom.c
> @@ -23,7 +23,7 @@ static void print_bloom_filter(struct bloom_filter *filter) {
>  	printf("Filter_Length:%d\n", filter->len);
>  	printf("Filter_Data:");
>  	for (i = 0; i < filter->len; i++){
> -		printf("%"PRIx64"|", filter->data[i]);
> +		printf("%"PRIx64"|", ntohll(filter->data[i]));
>  	}
>  	printf("\n");
>  }
> -- 
> 2.25.0.vfs.1.1.1.g9906319d24.dirty
> 
> 
> 

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH v2 00/11] Changed Paths Bloom Filters
  2020-02-07 16:15         ` SZEDER Gábor
@ 2020-02-07 16:33           ` Derrick Stolee
  0 siblings, 0 replies; 150+ messages in thread
From: Derrick Stolee @ 2020-02-07 16:33 UTC (permalink / raw)
  To: SZEDER Gábor
  Cc: Garima Singh, Garima Singh via GitGitGadget, git, jonathantanmy,
	jeffhost, me, peff, jnareb, christian.couder, emilyshaffer,
	gitster, Garima Singh

On 2/7/2020 11:15 AM, SZEDER Gábor wrote:
> On Fri, Feb 07, 2020 at 10:36:58AM -0500, Derrick Stolee wrote:
>> On 2/7/2020 10:09 AM, Garima Singh wrote:
>>>
>>> On 2/7/2020 8:52 AM, SZEDER Gábor wrote:
>>>>>  * Added unit tests for the bloom filter computation layer
>>>>
>>>> This fails on big endian, e.g. in Travis CI's s390x build:
>>>>
>>>>   https://travis-ci.org/szeder/git-cooking-topics-for-travis-ci/jobs/647253022#L2210
>>>>
>>>> (The link highlights the failure, but I'm afraid your browser won't
>>>> jump there right away; you'll have to click on the print-test-failures
>>>> fold at the bottom, and scroll down a bit...)
>>>>
>>>
>>> Thank you so much for running this pipeline and pointing out the error!
>>>
>>> We will carefully review our interactions with the binary data and 
>>> hopefully solve this in the next version. 
>>
>> Szeder,
>>
>> Thanks so much for running this test. We don't have access to a big endian
>> machine right now, so could you please apply this patch and re-run your tests?
> 
> Unfortunately, it still failed:
> 
>   https://travis-ci.org/szeder/git-cooking-topics-for-travis-ci/jobs/647395554#L2204

Thanks! Both fail on test 2 of t0095-bloom.sh, which includes this
expected output line:

	Filter_Data:508928809087080a|8a7648210804001|4089824400951000|841ab310098051a8|

We may not be properly adjusting the output in the test-helper.

I still think the fixup patch I included is a good idea, but Garima
continues to dig into the problem from all angles to understand this
failure and the full fix.

-Stolee


^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH v2 00/11] Changed Paths Bloom Filters
  2020-02-05 22:56 ` [PATCH v2 00/11] " Garima Singh via GitGitGadget
                     ` (11 preceding siblings ...)
  2020-02-07 13:52   ` [PATCH v2 00/11] Changed Paths Bloom Filters SZEDER Gábor
@ 2020-02-08 23:04   ` Jakub Narebski
  2020-02-21 17:41     ` Garima Singh
  2020-03-30  0:31   ` [PATCH v3 00/16] " Garima Singh via GitGitGadget
  13 siblings, 1 reply; 150+ messages in thread
From: Jakub Narebski @ 2020-02-08 23:04 UTC (permalink / raw)
  To: Garima Singh via GitGitGadget
  Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
	Jeff Hostetler, Taylor Blau, Jeff King, Garima Singh,
	Christian Couder, Emily Shaffer, Junio C Hamano, Garima Singh

"Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:

> Hey! 
>
> The commit graph feature brought in a lot of performance improvements across
> multiple commands. However, file based history continues to be a performance
> pain point, especially in large repositories. 
>
> Adopting changed path Bloom filters has been discussed on the list before,
> and a prototype version was worked on by SZEDER Gábor, Jonathan Tan and Dr.
> Derrick Stolee [1]. This series is based on Dr. Stolee's proof of
> concept in [2].

Sidenote: I wondered why it did use MurmurHash3 (64-bit version), which
requires adding its implementation, instead of reusing FNV-1 hash
(Fowler–Noll–Vo hash function) used by Git hashmap implementation, see
https://github.com/git/git/blob/228f53135a4a41a37b6be8e4d6e2b6153db4a8ed/hashmap.h#L109
Beside the fact that everyone is using MurmurHash for Bloom filters ;-)

It turns out that in various benchmark MurmurHash is faster and also
slightly better as a hash than FNV-1 or FNV-1b.


I wonder then if it would be a good idea (in the future) to make it easy
to use hashmap with MurmurHash3 instead of FNV-1, or maybe to even make
it the default for hashing strings.

>
> Performance Gains: We tested the performance of git log -- path on the git
> repo, the linux repo and some internal large repos, with a variety of paths
> of varying depths.

As I wrote in reply to previous version of this series, a good public
repository (and thus being able to use by anyone) to test the Bloom
filter performance improvements could be AOSP (Android) base:

  https://android.googlesource.com/platform/frameworks/base/

which is a large repository with long path depths (due to Java file
naming conventions).

>
> On the git and linux repos: We observed a 2x to 5x speed up.
>
> On a large internal repo with files seated 6-10 levels deep in the tree: We
> observed 10x to 20x speed ups, with some paths going up to 28 times faster.

Very nice! Good work!

What is the cost of this feature, that is how long it takes to generate
Bloom filters, and how much larger commit-graph file gets?  It would be
nice to know.

>
> Future Work (not included in the scope of this series):
>
>  1. Supporting multiple path based revision walk

Shouldn't then tests that were added in v2 mark use of Bloom filters
with multiple paths revision walking as _not working *yet*_
(test_expect_failure), and not expected to not work (test_expect_success
with test_bloom_filters_not_used)?

>  2. Adopting it in git blame logic. 
>  3. Interactions with line log git log -L
>
>
> ----------------------------------------------------------------------------
>
> Updates since the last submission
>
>  * Removed all the RFC callouts, this is a ready for full review version
>  * Added unit tests for the bloom filter computation layer
>  * Added more evolved functional tests for git log
>  * Fixed a lot of the bugs found by the tests
>  * Reacted to other miscellaneous feedback on the RFC series. 
>
> Cheers! Garima Singh
>
> [1] https://lore.kernel.org/git/20181009193445.21908-1-szeder.dev@gmail.com/
> [2] https://lore.kernel.org/git/61559c5b-546e-d61b-d2e1-68de692f5972@gmail.com/
>
> Derrick Stolee (2):
>   diff: halt tree-diff early after max_changes
>   commit-graph: examine commits by generation number
>
> Garima Singh (8):
>   commit-graph: use MAX_NUM_CHUNKS
>   bloom: core Bloom filter implementation for changed paths
>   commit-graph: compute Bloom filters for changed paths
>   commit-graph: write Bloom filters to commit graph file
>   commit-graph: reuse existing Bloom filters during write.
>   commit-graph: add --changed-paths option to write subcommand
>   revision.c: use Bloom filters to speed up path based revision walks
>   commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag
>
> Jeff King (1):
>   commit-graph: examine changed-path objects in pack order

The shortlog summary is a fine tool to show contributors to the patch
series, but is not as useful to show patch series as a whole: splitting
of patches and their ordering.

I will review each of patches individually, but now I would like to say
a few things about the series as a whole.

- [PATCH v2 01/11] commit-graph: use MAX_NUM_CHUNKS

  Simple and non-controversial patch, improvement to existing code with
  the goal of helping future development (including further patches).

- [PATCH v2 02/11] bloom: core Bloom filter implementation for changed paths

  In my opinion this patch could be split into three individual pieces,
  though one might think it is not worth it.

  a. Add implementation of MurmurHash v3 (64-bit)
  
  Include tests based on test-tool (creating file similar to the
  t/helper/test-hash.c, or enhancing to that file) that the
  implementation is correct, for example that 'The quick brown fox jumps
  over the lazy dog' with given seed (for example the default feed of 0)
  hashes to the same value as other implementations.

  b. Add implementation of Bloom filter

  Include generic Bloom filter tests i.e. that it correctly answers
  "yes" and "maybe" (create filter, save it or print it, then use stored
  filter), and tests specific to our implementation, namely that the
  size of the filter behaves as it should.

  c. Bloom filter implementation for changed paths

  Here include tests that use 'test-tool bloom get_filter_for_commit',
  that filter for commit with no changes and for commit with more than
  512 fies changed works correctly, that directories are added along the
  paths, etc.

- [PATCH v2 03/11] diff: halt tree-diff early after max_changes

  I think keeping this patch as a separate step makes individual commits
  easier to understand and review.

- [PATCH v2 04/11] commit-graph: compute Bloom filters for changed paths

  Here we compute Bloom filters for changed paths for each commit in the
  commit-graph file, without writing it to file; as a side-effect we
  calculate total Bloom filters data size.

  This doesn't make much sense as a standalone patch, but it is nice,
  easy to understand incremental step in building the feature.

- [PATCH v2 05/11] commit-graph: examine changed-path objects in pack order
- [PATCH v2 06/11] commit-graph: examine commits by generation number

  Those two are performance improvements of previous step.  It is good
  to keep them as separate commits, makes it easier to understand (and
  easier to catch error via git-bisect, if there would be any)

- [PATCH v2 07/11] commit-graph: write Bloom filters to commit graph file

  This commit includes the documentation of the two new chunks of
  commit-graph file format.

  I wonder if the 9th patch in this series, namely
  commit-graph: add --changed-paths option to write subcommand
  should not precede this commit.  Otherwise we have this new code but
  no way of testing it.  On the other hand it makes it easier to
  review.  On the gripping hand, you can't really test that writing
  works without the ability to parse Bloom filter data out of
  commit-graph file... which is the next commit.

- [PATCH v2 08/11] commit-graph: reuse existing Bloom filters during write

  This implements reading Bloom filters data from commit-graph file.
  Is it a good split?  I think it makes it easier to review the single
  patch, but itt also makes them less standalone.

- [PATCH v2 09/11] commit-graph: add --changed-paths option to write subcommand

  One thing we could test there is that we are writing two new chunks to
  the commit-graph file (and perhaps checking that they are correctly
  formatted, and have correct shape).

- [PATCH v2 10/11] revision.c: use Bloom filters to speed up path based revision walks

  This is quite a big and involved patch, which in my opinion could be
  split in two or three parts:

  a. Add a bare bones implementation, like in v2

  This limits amount of testing we can do; the only thing we can really
  test is that we get the same results with and without Bloom filters.

  b.1. Add trace2 Bloom filter statistics
  b.2. Use said trace2 statistics to test use of Bloom filters

- [PATCH v2 11/11] commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag

  This one is for (optional) exhaustive testing of the feature.


Feel free to disagree with those ideas.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH v2 01/11] commit-graph: use MAX_NUM_CHUNKS
  2020-02-05 22:56   ` [PATCH v2 01/11] commit-graph: use MAX_NUM_CHUNKS Garima Singh via GitGitGadget
@ 2020-02-09 12:39     ` Jakub Narebski
  0 siblings, 0 replies; 150+ messages in thread
From: Jakub Narebski @ 2020-02-09 12:39 UTC (permalink / raw)
  To: Garima Singh via GitGitGadget
  Cc: git, stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
	garimasigit, christian.couder, emilyshaffer, gitster,
	Garima Singh

"Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Garima Singh <garima.singh@microsoft.com>
> Subject: Re: [PATCH v2 01/11] commit-graph: use MAX_NUM_CHUNKS
>
> This is a minor cleanup to make it easier to change the
> number of chunks being written to the commit-graph in the future.

Looks good to me...

...with the very minor possible nitpick that the subject probably should
be

  [PATCH v2 01/11] commit-graph: define and use MAX_NUM_CHUNKS

But this is just a bikeshedding.  Feel free to disregard this.

Best,
-- 
Jakub Narębski

> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
>  commit-graph.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index b205e65ed1..3c4d411326 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -23,6 +23,7 @@
>  #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
>  #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
>  #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
> +#define MAX_NUM_CHUNKS 5
>  
>  #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
>  
> @@ -1356,8 +1357,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>  	int fd;
>  	struct hashfile *f;
>  	struct lock_file lk = LOCK_INIT;
> -	uint32_t chunk_ids[6];
> -	uint64_t chunk_offsets[6];
> +	uint32_t chunk_ids[MAX_NUM_CHUNKS + 1];
> +	uint64_t chunk_offsets[MAX_NUM_CHUNKS + 1];
>  	const unsigned hashsz = the_hash_algo->rawsz;
>  	struct strbuf progress_title = STRBUF_INIT;
>  	int num_chunks = 3;

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH v2 00/11] Changed Paths Bloom Filters
  2020-02-07 15:09     ` Garima Singh
  2020-02-07 15:36       ` Derrick Stolee
@ 2020-02-11 19:08       ` Garima Singh
  1 sibling, 0 replies; 150+ messages in thread
From: Garima Singh @ 2020-02-11 19:08 UTC (permalink / raw)
  To: SZEDER Gábor, Garima Singh via GitGitGadget
  Cc: git, stolee, jonathantanmy, jeffhost, me, peff, jnareb,
	christian.couder, emilyshaffer, gitster, Garima Singh


On 2/7/2020 10:09 AM, Garima Singh wrote:
> 
> On 2/7/2020 8:52 AM, SZEDER Gábor wrote:
>>>  * Added unit tests for the bloom filter computation layer
>>
>> This fails on big endian, e.g. in Travis CI's s390x build:
>>
>>   https://travis-ci.org/szeder/git-cooking-topics-for-travis-ci/jobs/647253022#L2210
>>
>> (The link highlights the failure, but I'm afraid your browser won't
>> jump there right away; you'll have to click on the print-test-failures
>> fold at the bottom, and scroll down a bit...)
>>
> 
> Thank you so much for running this pipeline and pointing out the error!
> 
> We will carefully review our interactions with the binary data and 
> hopefully solve this in the next version. 
> 
> Cheers!
> Garima Singh
> 

Hey! 

The patch below carries the fix for the failure on Big-endian architectures.
We now treat bloom filter data as a simple binary stream of 1 byte words 
instead of 8 byte words. This avoids the Big-endian vs Little-endian 
confusion on different CPU architectures. 

Here is the successful run of SZEDER's Travis CI s390x build. 

 https://travis-ci.org/szeder/git/jobs/649044879

I will be squashing this patch into the appropriate commits in the series
in v3, which I will send out after people have had a chance to complete
their review of v2. 

A special thanks to SZEDER for helping us test our patches on his CI 
pipeline and saving us the overhead of setting up a Big-endian machine!

Cheers!
Garima Singh

-->8--

From ee72310dd8c3ad2b810914edb651008f637e7c2a Mon Sep 17 00:00:00 2001
From: Garima Singh <garima.singh@microsoft.com>
Date: Tue, 11 Feb 2020 13:55:03 -0500
Subject: [PATCH] Process bloom filter data as 1 byte words

Process bloom filter data as 1 byte words instead of 8 byte
words to avoid the Big-endian vs Little-endian confusion on
different CPU architectures

Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
 bloom.c               |  24 ++++-----
 bloom.h               |   4 +-
 commit-graph.c        |   4 +-
 t/helper/test-bloom.c |   4 +-
 t/t0095-bloom.sh      | 118 +++++++++++++++++++++---------------------
 5 files changed, 77 insertions(+), 77 deletions(-)

diff --git a/bloom.c b/bloom.c
index 90d84dc713..6d5d6bb2ef 100644
--- a/bloom.c
+++ b/bloom.c
@@ -45,12 +45,13 @@ static uint32_t seed_murmur3(uint32_t seed, const char *data, int len)
 
 	int len4 = len / sizeof(uint32_t);
 
-	const uint32_t *blocks = (const uint32_t*)data;
-
 	uint32_t k;
-	for (i = 0; i < len4; i++)
-	{
-		k = blocks[i];
+	for (i = 0; i < len4; i++) {	
+		uint32_t byte1 = (uint32_t)data[4*i];
+		uint32_t byte2 = ((uint32_t)data[4*i + 1]) << 8;
+		uint32_t byte3 = ((uint32_t)data[4*i + 2]) << 16;
+		uint32_t byte4 = ((uint32_t)data[4*i + 3]) << 24;
+		k = byte1 | byte2 | byte3 | byte4;
 		k *= c1;
 		k = rotate_right(k, r1);
 		k *= c2;
@@ -61,8 +62,7 @@ static uint32_t seed_murmur3(uint32_t seed, const char *data, int len)
 
 	tail = (data + len4 * sizeof(uint32_t));
 
-	switch (len & (sizeof(uint32_t) - 1))
-	{
+	switch (len & (sizeof(uint32_t) - 1)) {
 	case 3:
 		k1 ^= ((uint32_t)tail[2]) << 16;
 		/*-fallthrough*/
@@ -88,9 +88,9 @@ static uint32_t seed_murmur3(uint32_t seed, const char *data, int len)
 	return seed;
 }
 
-static inline uint64_t get_bitmask(uint32_t pos)
+static inline unsigned char get_bitmask(uint32_t pos)
 {
-	return ((uint64_t)1) << (pos & (BITS_PER_WORD - 1));
+	return ((unsigned char)1) << (pos & (BITS_PER_WORD - 1));
 }
 
 void load_bloom_filters(void)
@@ -152,8 +152,8 @@ static int load_bloom_filter_from_graph(struct commit_graph *g,
 		start_index = 0;
 
 	filter->len = end_index - start_index;
-	filter->data = (uint64_t *)(g->chunk_bloom_data +
-					sizeof(uint64_t) * start_index +
+	filter->data = (unsigned char *)(g->chunk_bloom_data +
+					sizeof(unsigned char) * start_index +
 					BLOOMDATA_CHUNK_HEADER_SIZE);
 
 	return 1;
@@ -234,7 +234,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 		}
 
 		filter->len = (hashmap_get_size(&pathmap) * settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
-		filter->data = xcalloc(filter->len, sizeof(uint64_t));
+		filter->data = xcalloc(filter->len, sizeof(unsigned char));
 
 		hashmap_for_each_entry(&pathmap, &iter, e, entry) {
 			struct bloom_key key;
diff --git a/bloom.h b/bloom.h
index 76f8a9ad0c..9604723ce0 100644
--- a/bloom.h
+++ b/bloom.h
@@ -12,7 +12,7 @@ struct bloom_filter_settings {
 };
 
 #define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 }
-#define BITS_PER_WORD 64
+#define BITS_PER_WORD 8
 #define BLOOMDATA_CHUNK_HEADER_SIZE 3*sizeof(uint32_t)
 
 /*
@@ -22,7 +22,7 @@ struct bloom_filter_settings {
  * 'data'.
  */
 struct bloom_filter {
-	uint64_t *data;
+	unsigned char *data;
 	int len;
 };
 
diff --git a/commit-graph.c b/commit-graph.c
index c0e9834bf2..f5f9a23c9a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1125,7 +1125,7 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
 	while (list < last) {
 		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
 		display_progress(progress, ++i);
-		hashwrite(f, filter->data, filter->len * sizeof(uint64_t));
+		hashwrite(f, filter->data, filter->len * sizeof(unsigned char));
 		list++;
 	}
 
@@ -1305,7 +1305,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 	for (i = 0; i < ctx->commits.nr; i++) {
 		struct commit *c = sorted_by_pos[i];
 		struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1);
-		ctx->total_bloom_filter_data_size += sizeof(uint64_t) * filter->len;
+		ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
 		display_progress(progress, i + 1);
 	}
 
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index 9b4be97f75..8fa2d8fc25 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -23,7 +23,7 @@ static void print_bloom_filter(struct bloom_filter *filter) {
 	printf("Filter_Length:%d\n", filter->len);
 	printf("Filter_Data:");
 	for (i = 0; i < filter->len; i++){
-		printf("%"PRIx64"|", filter->data[i]);
+		printf("%02x|", filter->data[i]);
 	}
 	printf("\n");
 }
@@ -57,7 +57,7 @@ int cmd__bloom(int argc, const char **argv)
 		struct bloom_filter filter;
 		int i = 2;
 		filter.len =  (settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
-		filter.data = xcalloc(filter.len, sizeof(uint64_t));
+		filter.data = xcalloc(filter.len, sizeof(unsigned char));
 
 		if (!argv[2]){
 			die("at least one input string expected");
diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
index 424fe4fc29..58273219ff 100755
--- a/t/t0095-bloom.sh
+++ b/t/t0095-bloom.sh
@@ -3,58 +3,11 @@
 test_description='test bloom.c'
 . ./test-lib.sh
 
-test_expect_success 'get bloom filters for commit with no changes' '
-	git init &&
-	git commit --allow-empty -m "c0" &&
-	cat >expect <<-\EOF &&
-	Filter_Length:0
-	Filter_Data:
-	EOF
-	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
-	test_cmp expect actual
-'
-
-test_expect_success 'get bloom filter for commit with 10 changes' '
-	rm actual &&
-	rm expect &&
-	mkdir smallDir &&
-	for i in $(test_seq 0 9)
-	do
-		echo $i >smallDir/$i
-	done &&
-	git add smallDir &&
-	git commit -m "commit with 10 changes" &&
-	cat >expect <<-\EOF &&
-	Filter_Length:4
-	Filter_Data:508928809087080a|8a7648210804001|4089824400951000|841ab310098051a8|
-	EOF
-	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
-	test_cmp expect actual
-'
-
-test_expect_success EXPENSIVE 'get bloom filter for commit with 513 changes' '
-	rm actual &&
-	rm expect &&
-	mkdir bigDir &&
-	for i in $(test_seq 0 512)
-	do
-		echo $i >bigDir/$i
-	done &&
-	git add bigDir &&
-	git commit -m "commit with 513 changes" &&
-	cat >expect <<-\EOF &&
-	Filter_Length:0
-	Filter_Data:
-	EOF
-	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
-	test_cmp expect actual
-'
-
 test_expect_success 'compute bloom key for empty string' '
 	cat >expect <<-\EOF &&
 	Hashes:5615800c|5b966560|61174ab4|66983008|6c19155c|7199fab0|771ae004|
-	Filter_Length:1
-	Filter_Data:11000110001110|
+	Filter_Length:2
+	Filter_Data:11|11|
 	EOF
 	test-tool bloom generate_filter "" >actual &&
 	test_cmp expect actual
@@ -63,8 +16,8 @@ test_expect_success 'compute bloom key for empty string' '
 test_expect_success 'compute bloom key for whitespace' '
 	cat >expect <<-\EOF &&
 	Hashes:1bf014e6|8a91b50b|f9335530|67d4f555|d676957a|4518359f|b3b9d5c4|
-	Filter_Length:1
-	Filter_Data:401004080200810|
+	Filter_Length:2
+	Filter_Data:71|8c|
 	EOF
 	test-tool bloom generate_filter " " >actual &&
 	test_cmp expect actual
@@ -73,8 +26,8 @@ test_expect_success 'compute bloom key for whitespace' '
 test_expect_success 'compute bloom key for a root level folder' '
 	cat >expect <<-\EOF &&
 	Hashes:1a21016f|fff1c06d|e5c27f6b|cb933e69|b163fd67|9734bc65|7d057b63|
-	Filter_Length:1
-	Filter_Data:aaa800000000|
+	Filter_Length:2
+	Filter_Data:a8|aa|
 	EOF
 	test-tool bloom generate_filter "A" >actual &&
 	test_cmp expect actual
@@ -83,8 +36,8 @@ test_expect_success 'compute bloom key for a root level folder' '
 test_expect_success 'compute bloom key for a root level file' '
 	cat >expect <<-\EOF &&
 	Hashes:e2d51107|30970605|7e58fb03|cc1af001|19dce4ff|679ed9fd|b560cefb|
-	Filter_Length:1
-	Filter_Data:a8000000000000aa|
+	Filter_Length:2
+	Filter_Data:aa|a8|
 	EOF
 	test-tool bloom generate_filter "file.txt" >actual &&
 	test_cmp expect actual
@@ -93,8 +46,8 @@ test_expect_success 'compute bloom key for a root level file' '
 test_expect_success 'compute bloom key for a deep folder' '
 	cat >expect <<-\EOF &&
 	Hashes:864cf838|27f055cd|c993b362|6b3710f7|0cda6e8c|ae7dcc21|502129b6|
-	Filter_Length:1
-	Filter_Data:1c0000600003000|
+	Filter_Length:2
+	Filter_Data:c6|31|
 	EOF
 	test-tool bloom generate_filter "A/B/C/D/E" >actual &&
 	test_cmp expect actual
@@ -103,11 +56,58 @@ test_expect_success 'compute bloom key for a deep folder' '
 test_expect_success 'compute bloom key for a deep file' '
 	cat >expect <<-\EOF &&
 	Hashes:07cdf850|4af629c7|8e1e5b3e|d1468cb5|146ebe2c|5796efa3|9abf211a|
-	Filter_Length:1
-	Filter_Data:4020100804010080|
+	Filter_Length:2
+	Filter_Data:a9|54|
 	EOF
 	test-tool bloom generate_filter "A/B/C/D/E/file.txt" >actual &&
 	test_cmp expect actual
 '
 
+test_expect_success 'get bloom filters for commit with no changes' '
+	git init &&
+	git commit --allow-empty -m "c0" &&
+	cat >expect <<-\EOF &&
+	Filter_Length:0
+	Filter_Data:
+	EOF
+	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'get bloom filter for commit with 10 changes' '
+	rm actual &&
+	rm expect &&
+	mkdir smallDir &&
+	for i in $(test_seq 0 9)
+	do
+		echo $i >smallDir/$i
+	done &&
+	git add smallDir &&
+	git commit -m "commit with 10 changes" &&
+	cat >expect <<-\EOF &&
+	Filter_Length:25
+	Filter_Data:c2|0b|b8|c0|10|88|f0|1d|c1|0c|01|a4|01|28|81|80|01|30|10|d0|92|be|88|10|8a|
+	EOF
+	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success EXPENSIVE 'get bloom filter for commit with 513 changes' '
+	rm actual &&
+	rm expect &&
+	mkdir bigDir &&
+	for i in $(test_seq 0 512)
+	do
+		echo $i >bigDir/$i
+	done &&
+	git add bigDir &&
+	git commit -m "commit with 513 changes" &&
+	cat >expect <<-\EOF &&
+	Filter_Length:0
+	Filter_Data:
+	EOF
+	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
+	test_cmp expect actual
+'
+
 test_done
-- 
2.22.0.windows.1


^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH v2 02/11] bloom: core Bloom filter implementation for changed paths
  2020-02-05 22:56   ` [PATCH v2 02/11] bloom: core Bloom filter implementation for changed paths Garima Singh via GitGitGadget
@ 2020-02-15 17:17     ` Jakub Narebski
  2020-02-16 16:49     ` Jakub Narebski
  1 sibling, 0 replies; 150+ messages in thread
From: Jakub Narebski @ 2020-02-15 17:17 UTC (permalink / raw)
  To: Garima Singh via GitGitGadget
  Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
	Jeff Hostetler, Taylor Blau, Jeff King, Christian Couder,
	Emily Shaffer, Junio C Hamano, Garima Singh, Garima Singh

"Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Garima Singh <garima.singh@microsoft.com>
>
> Add the core Bloom filter logic for computing the paths changed between a
> commit and its first parent. For details on what Bloom filters are and how they
> work, please refer to Dr. Derrick Stolee's blog post [1]. It provides a concise
> explaination of the adoption of Bloom filters as described in [2] and [3].
                                                                           ^^- to add
>
> 1. We currently use 7 and 10 for the number of hashes and the size of each
>    entry respectively. They served as great starting values, the mathematical
>    details behind this choice are described in [1] and [4]. The implementation,
                                                                                ^^- to add
>    while not completely open to it at the moment, is flexible enough to allow
>    for tweaking these settings in the future.

I don't know if it is worth it, but I think it should be size of each
entry, or in other words number of bits per element in the set, as first
value, and number of hashes as second.

About where those values come from.  The idea is that you decide on the
acceptable number of false positives, for example 1% (or 0.8% given that
the values must be integers); that gives you number of bits per element
i.e. 10, and from there you can find optimal number of hashes i.e. 7.
The references mentioned (and Wikipedia article) have those equations.

>
>    Note: The performance gains we have observed with these values are
>    significant enough that we did not need to tweak these settings.
>    The performance numbers are included in the cover letter of this series
>    and in the message of a subsequent commit where we use Bloom filters in
>    to speed up `git log -- <path>`.

All right.

>
> 2. As described in the blog and in [3], we do not need 7 independent hashing
>    functions. We use the Murmur3 hashing scheme. Seed it twice and then
>    combine those to procure an arbitrary number of hash values.

The technique from [3] is called "double hashing" (Algorithm 1 and
equation (4) on page 10).  Note that in this paper there is also
presented "enhanced double hashing" scheme (Algorithm 2 and equation
(6)) -- more about it later.

This is a standard technique from the hashing literature, called open
addressing with double hashing in hash tables.

This "enhanced double hashing" technique is further analyzed in [6].

[6] Adam Kirsch, Michael Mitzenmacher
    "Less Hashing, Same Performance: Building a Better Bloom Filter"
    https://www.eecs.harvard.edu/~michaelm/postscripts/esa2006a.pdf
    https://doi.org/10.5555/1400123.1400125

>
> 3. The filters are sized according to the number of changes in the each commit,
>    with minimum size of one 64 bit word.

If I understand it correctly (but which might not be entirely clear),
the filter size in bits is the number of changes^* times 10, rounded up
to the nearest multiple of 64.

[*] where the number of changes is the number of changed files (new blob
objects) _and_ the number of changed directories (new tree objects,
excluding root tree object change).


The interesting corner case, which might be worth specifying explicitly,
is what happens in the case there are _no changes_ with respect to first
parent (which can happen with either commit created with `git commit
--allow-empty`, or merge created e.g. with `git merge --strategy=ours`).
Is this case represented as Bloom filter of length 0, or as a Bloom
filter of length  of one 64-bit word which is minimal length composed of
all 0's (0x0000000000000000)?

>
> 4. We fill the Bloom filters as (const char *data, int len) pairs as
>    "struct bloom_filter"s in a commit slab.

All right.

>
> 5. The seed_murmur3 method is implemented as described in [5]. It hashes the
>    given data using a given seed and produces a uniformly distributed hash
>    value.

Actually there are two variants of Murmur3 hash, and we should specify
which one we are using.  There is Murmur3_32 which returns 32-bit value,
and Murmur3_128 which returns 128-bit value (which is different for x86
and x64 versions).  We use Murmur3_32.

Also, seed_murmur3 is the name given the function, not the name of the
method i.e. of a non-cryptographic hash function.


One question that one might as is why use Murmur3 hash instead for
example already implemented FNV hash from hashmap implementation (FNV
hash i.e. Fowler–Noll–Vo hash function is another non-cryptographic hash
function).  The answer is of course performance while maintaining good
enough quality (and for Bloom filter there is no problem of "hash
flooding" denial-of-service like for there is for a hash table -- no
need for SipHash or similar).

>
> [1] https://devblogs.microsoft.com/devops/super-charging-the-git-commit-graph-iv-Bloom-filters/

I would write it in full, similar to subsequent bibliographical entries,
that is:

  [1] Derrick Stolee
      "Supercharging the Git Commit Graph IV: Bloom Filters"
      https://devblogs.microsoft.com/devops/super-charging-the-git-commit-graph-iv-Bloom-filters/

But that is just a matter of style.

>
> [2] Flavio Bonomi, Michael Mitzenmacher, Rina Panigrahy, Sushil Singh, George Varghese
>     "An Improved Construction for Counting Bloom Filters"
>     http://theory.stanford.edu/~rinap/papers/esa2006b.pdf
>     https://doi.org/10.1007/11841036_61
>
> [3] Peter C. Dillinger and Panagiotis Manolios
>     "Bloom Filters in Probabilistic Verification"
>     http://www.ccs.neu.edu/home/pete/pub/Bloom-filters-verification.pdf
>     https://doi.org/10.1007/978-3-540-30494-4_26

Good, we should be able to find them even if the URL with PDF stops
working for some reason.

>
> [4] Thomas Mueller Graf, Daniel Lemire
>     "Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters"
>     https://arxiv.org/abs/1912.08258
>
> [5] https://en.wikipedia.org/wiki/MurmurHash#Algorithm
>
> Helped-by: Jeff King <peff@peff.net>
> Helped-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
>  Makefile              |   2 +
>  bloom.c               | 228 ++++++++++++++++++++++++++++++++++++++++++
>  bloom.h               |  56 +++++++++++
>  t/helper/test-bloom.c |  84 ++++++++++++++++
>  t/helper/test-tool.c  |   1 +
>  t/helper/test-tool.h  |   1 +
>  t/t0095-bloom.sh      | 113 +++++++++++++++++++++
>  7 files changed, 485 insertions(+)
>  create mode 100644 bloom.c
>  create mode 100644 bloom.h
>  create mode 100644 t/helper/test-bloom.c
>  create mode 100755 t/t0095-bloom.sh

As I wrote earlier, In my opinion this patch could be split into three
individual single-functionality pieces, to make it easier to review and
aid in bisectability if needed.

1. Add implementation of MurmurHash v3 (32-bit result)
  
Include tests based on test-tool (creating file similar to the
t/helper/test-hash.c, or enhancing to that file) that the implementation
is correct, for example that 'The quick brown fox jumps over the lazy
dog' or 'Hello world!' with a given seed (for example the default seed
of 0) hashes to the same value as other implementations, including the
reference implementation in https://github.com/aappleby/smhasher


2. Add implementation of [variant of] Bloom filter

Include generic Bloom filter tests i.e. that it correctly answers "yes"
and "maybe" (create filter, save it or print it, then use stored
filter), and tests specific to our implementation, namely that the size
of the filter behaves as it should.


3. Bloom filter implementation for changed paths

Here include tests that use 'test-tool bloom get_filter_for_commit',
that filter for commit with no changes and for commit with more than 512
changes works correctly, that directories are added along the files,
etc.


This split would make it easier to distinguish if the problems with
tests failing on big-endian architectures is caused by different output
from our implementation of Murmur3 hash, different bit sequence in the
Bloom filter, or just different printed output of Bloom filter data.

>
> diff --git a/Makefile b/Makefile
> index 6134104ae6..afba81f4a8 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -695,6 +695,7 @@ X =
>  
>  PROGRAMS += $(patsubst %.o,git-%$X,$(PROGRAM_OBJS))
>  
> +TEST_BUILTINS_OBJS += test-bloom.o
>  TEST_BUILTINS_OBJS += test-chmtime.o
>  TEST_BUILTINS_OBJS += test-config.o
>  TEST_BUILTINS_OBJS += test-ctype.o
> @@ -840,6 +841,7 @@ LIB_OBJS += base85.o
>  LIB_OBJS += bisect.o
>  LIB_OBJS += blame.o
>  LIB_OBJS += blob.o
> +LIB_OBJS += bloom.o
>  LIB_OBJS += branch.o
>  LIB_OBJS += bulk-checkin.o
>  LIB_OBJS += bundle.o

All right.

> diff --git a/bloom.c b/bloom.c
> new file mode 100644
> index 0000000000..6082193a75
> --- /dev/null
> +++ b/bloom.c
> @@ -0,0 +1,228 @@
> +#include "git-compat-util.h"
> +#include "bloom.h"
> +#include "commit-graph.h"
> +#include "object-store.h"
> +#include "diff.h"
> +#include "diffcore.h"
> +#include "revision.h"
> +#include "hashmap.h"
> +
> +define_commit_slab(bloom_filter_slab, struct bloom_filter);
> +
> +struct bloom_filter_slab bloom_filters;

All right, this is needed to store per-commit Bloom filter data
(inside-out object style, or in other jargon stored on slab).

> +
> +struct pathmap_hash_entry {
> +    struct hashmap_entry entry;
> +    const char path[FLEX_ARRAY];
> +};

O.K. this is used to add gather paths to add them all as elements to the
Bloom filter.

> +
> +static uint32_t rotate_right(uint32_t value, int32_t count)
> +{
> +	uint32_t mask = 8 * sizeof(uint32_t) - 1;
> +	count &= mask;
> +	return ((value >> count) | (value << ((-count) & mask)));
> +}

Hmmm... both the algoritm on Wikipedia, and reference implementation use
rotate *left*, not rotate *right* in the implementation of Murmur3 hash,
see

  https://en.wikipedia.org/wiki/MurmurHash#Algorithm
  https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp#L23


inline uint32_t rotl32 ( uint32_t x, int8_t r )
{
  return (x << r) | (x >> (32 - r));
}

> +
> +/*
> + * Calculate a hash value for the given data using the given seed.
> + * Produces a uniformly distributed hash value.
> + * Not considered to be cryptographically secure.
> + * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
> + **/
    ^^-- why two _trailing_ asterisks?

Perhaps it would be worth it to add that this hash function is intended
to be fast while being reasonably good (it is distributed randomly
enough, and it doesn't have too many hash collisions on typical inputs).
But this might be too much for a comment.

> +static uint32_t seed_murmur3(uint32_t seed, const char *data, int len)

A few things: name of the function, type of parameters and ordering of
parameters.


About the name: when I first saw seed_murmur3() used, I thought it was
_setting_ the seed, not that it was returning the 32-bit hash value.
Other implementations use either murmur3_32, MurmurHash3_x86_32, or
something similar like hashmurmur3_32.  If we were to specify that
'seed' is one of parameters, then using this word as part of suffix
would be better than using seed_ prefix; if we need it at all.

Because there is 32-bit and 128-bit variants of Murmur3, I think the _32
suffix should be a part of function name.

In short, I think that the name of the function should be murmur3_32, or
murmurhash3_32, or possibly murmur3_32_seed, or something like that.


About types of parameters and the return type of function: I understand
that 'data' parameter is of type 'const char *', instead of more generic
'const uint8_t*' or 'const void *' because of what we will be using the
hash function for.  On the other hand taking a look at implementation of
FNV hash function in hashmap.{c,h} we see that the 'str*' variants take
'const char *' parameter _without_ length, and 'mem*' variants take
'const void *' parmeter with length of data.

Shouldn't 'len' parameter be of 'size_t' type, rather than 'int'?  Both
the example implementation in C on Wikipedia page, and implementation in
C in qLibc use 'size_t'; the implementation of FNV hash in hashmap in
Git also uses 'size_t' (while admittedly the reference implementation in
C++ of Austin Appleby uses 'int' type for len parameter).

For 32-bit output variant of Murmur3 hash, using uint32_t as return type
is just fine.  The '*hash*' functions from hashmap.{c,h} use 'unsigned
int' but I think 'uint32_t' is better.


About names and ordering of parameters: the 'seed' or 'hash_seed'
parameter should be either first or last; it is a matter of preference.
While example implementation on Wikipedia page, Appleby's reference
implementation in C++ have 'seed' as last parameter, memihash_cont()
from hashmap.c in Git has it as first parameter.

In short: I'm fine with either order (seed parameter first or last), and
either name (be it 'seed' or 'hash_seed').

> +{
> +	const uint32_t c1 = 0xcc9e2d51;
> +	const uint32_t c2 = 0x1b873593;
> +	const uint32_t r1 = 15;
> +	const uint32_t r2 = 13;
> +	const uint32_t m = 5;
> +	const uint32_t n = 0xe6546b64;
> +	int i;
> +	uint32_t k1 = 0;
> +	const char *tail;
> +
> +	int len4 = len / sizeof(uint32_t);
> +
> +	const uint32_t *blocks = (const uint32_t*)data;
> +
> +	uint32_t k;
> +	for (i = 0; i < len4; i++)
> +	{
> +		k = blocks[i];

IMPORTANT: There is a comment around there in the example implementation
in C on Wikipedia that this operation above is a source of differing
results across endianness.  The pseudo-code description of the algorithm
on Wikipedia (above of C code) says that endian swapping is only
necessary on big-endian machines (and that it is needed to place the
meaningful digits towards the low end of the value, to not be discarded
by the modulo arithmetic under overflow).

The original / reference implementation by Austin Appleby in C++ uses
getblock32() function for doing the block read... but it doesn't
actually implement the endian-swapping on big-endian architecture:

  //-----------------------------------------------------------------------------
  // Block read - if your platform needs to do endian-swapping or can only
  // handle aligned reads, do the conversion here

  FORCE_INLINE uint32_t getblock32 ( const uint32_t * p, int i )
  {
    return p[i];
  }

References:
-----------
1. https://en.wikipedia.org/wiki/MurmurHash#Algorithm
2. https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp

> +		k *= c1;
> +		k = rotate_right(k, r1);

It is  k ROL r1 / ROTL32(k,15) / (k << 15) | (k >> (32 - 15))
(in other implementations), not rotate_right.

> +		k *= c2;
> +
> +		seed ^= k;
> +		seed = rotate_right(seed, r2) * m + n;

It is  hash ROL r2 / ROTL32(h1,13) / (h << 13) | (h >> (32 - 13))
(in other implementations), not rotate_right.

References:
-----------
1. https://en.wikipedia.org/wiki/MurmurHash#Algorithm
2. https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp#L94
3. https://github.com/wolkykim/qlibc/blob/master/src/utilities/qhash.c#L258

> +	}
> +
> +	tail = (data + len4 * sizeof(uint32_t));

Hmmm... in the pseudocode implementation on Wikipedia this is the place
where one needs to respect endianness:

    with any remainingBytesInKey do
        remainingBytes ← SwapToLittleEndian(remainingBytesInKey)
        // Note: Endian swapping is only necessary on big-endian machines.
        //       The purpose is to place the meaningful digits towards the low end of the value,
        //       so that these digits have the greatest potential to affect the low range digits
        //       in the subsequent multiplication.  Consider that locating the meaningful digits
        //       in the high range would produce a greater effect upon the high digits of the
        //       multiplication, and notably, that such high digits are likely to be discarded
        //       by the modulo arithmetic under overflow.  We don't want that.

On the other hand in the reference Appleby's C++ implementation the
endian-swapping is [ssumed to be] done only in the loop over data.
Either should be enough alone, but doing swapping for remaining bytes
only would work, it would be a better solution -- you do swap only once,
at the end.

It looks like the Crhomium implementation in C by Shane Day (public
domain) uses the second solution; well almost, see:
https://chromium.googlesource.com/external/smhasher/+/5b8fd3c31a58b87b80605dca7a64fad6cb3f8a0f/PMurHash.c#189

> +
> +	switch (len & (sizeof(uint32_t) - 1))
> +	{
> +	case 3:
> +		k1 ^= ((uint32_t)tail[2]) << 16;
> +		/*-fallthrough*/
> +	case 2:
> +		k1 ^= ((uint32_t)tail[1]) << 8;
> +		/*-fallthrough*/
> +	case 1:
> +		k1 ^= ((uint32_t)tail[0]) << 0;
> +		k1 *= c1;
> +		k1 = rotate_right(k1, r1);



> +		k1 *= c2;
> +		seed ^= k1;
> +		break;
> +	}



> +
> +	seed ^= (uint32_t)len;
> +	seed ^= (seed >> 16);
> +	seed *= 0x85ebca6b;
> +	seed ^= (seed >> 13);
> +	seed *= 0xc2b2ae35;
> +	seed ^= (seed >> 16);
> +
> +	return seed;
> +}
> +
> +static inline uint64_t get_bitmask(uint32_t pos)
> +{
> +	return ((uint64_t)1) << (pos & (BITS_PER_WORD - 1));
> +}
> +
> +void load_bloom_filters(void)
> +{
> +	init_bloom_filter_slab(&bloom_filters);
> +}
> +
> +void fill_bloom_key(const char *data,
> +					int len,
> +					struct bloom_key *key,
> +					struct bloom_filter_settings *settings)
> +{
> +	int i;
> +	const uint32_t seed0 = 0x293ae76f;
> +	const uint32_t seed1 = 0x7e646e2c;
> +	const uint32_t hash0 = seed_murmur3(seed0, data, len);
> +	const uint32_t hash1 = seed_murmur3(seed1, data, len);
> +
> +	key->hashes = (uint32_t *)xcalloc(settings->num_hashes, sizeof(uint32_t));
> +	for (i = 0; i < settings->num_hashes; i++)
> +		key->hashes[i] = hash0 + i * hash1;
> +}
> +
> +void add_key_to_filter(struct bloom_key *key,
> +					   struct bloom_filter *filter,
> +					   struct bloom_filter_settings *settings)
> +{
> +	int i;
> +	uint64_t mod = filter->len * BITS_PER_WORD;
> +
> +	for (i = 0; i < settings->num_hashes; i++) {
> +		uint64_t hash_mod = key->hashes[i] % mod;
> +		uint64_t block_pos = hash_mod / BITS_PER_WORD;
> +
> +		filter->data[block_pos] |= get_bitmask(hash_mod);
> +	}
> +}
> +
> +struct bloom_filter *get_bloom_filter(struct repository *r,
> +				      struct commit *c)
> +{
> +	struct bloom_filter *filter;
> +	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
> +	int i;
> +	struct diff_options diffopt;
> +
> +	if (!bloom_filters.slab_size)
> +		return NULL;
> +
> +	filter = bloom_filter_slab_at(&bloom_filters, c);
> +
> +	repo_diff_setup(r, &diffopt);
> +	diffopt.flags.recursive = 1;
> +	diff_setup_done(&diffopt);
> +
> +	if (c->parents)
> +		diff_tree_oid(&c->parents->item->object.oid, &c->object.oid, "", &diffopt);
> +	else
> +		diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
> +	diffcore_std(&diffopt);
> +
> +	if (diff_queued_diff.nr <= 512) {
> +		struct hashmap pathmap;
> +		struct pathmap_hash_entry* e;
> +		struct hashmap_iter iter;
> +		hashmap_init(&pathmap, NULL, NULL, 0);
> +
> +		for (i = 0; i < diff_queued_diff.nr; i++) {
> +			const char* path = diff_queued_diff.queue[i]->two->path;
> +			const char* p = path;
> +
> +			/*
> +			* Add each leading directory of the changed file, i.e. for
> +			* 'dir/subdir/file' add 'dir' and 'dir/subdir' as well, so
> +			* the Bloom filter could be used to speed up commands like
> +			* 'git log dir/subdir', too.
> +			*
> +			* Note that directories are added without the trailing '/'.
> +			*/
> +			do {
> +				char* last_slash = strrchr(p, '/');
> +
> +				FLEX_ALLOC_STR(e, path, path);
> +				hashmap_entry_init(&e->entry, strhash(p));
> +				hashmap_add(&pathmap, &e->entry);
> +
> +				if (!last_slash)
> +					last_slash = (char*)p;
> +				*last_slash = '\0';
> +
> +			} while (*p);
> +
> +			diff_free_filepair(diff_queued_diff.queue[i]);
> +		}
> +
> +		filter->len = (hashmap_get_size(&pathmap) * settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
> +		filter->data = xcalloc(filter->len, sizeof(uint64_t));
> +
> +		hashmap_for_each_entry(&pathmap, &iter, e, entry) {
> +			struct bloom_key key;
> +			fill_bloom_key(e->path, strlen(e->path), &key, &settings);
> +			add_key_to_filter(&key, filter, &settings);
> +		}
> +
> +		hashmap_free_entries(&pathmap, struct pathmap_hash_entry, entry);
> +	} else {
> +		for (i = 0; i < diff_queued_diff.nr; i++)
> +			diff_free_filepair(diff_queued_diff.queue[i]);
> +		filter->data = NULL;
> +		filter->len = 0;
> +	}
> +
> +	free(diff_queued_diff.queue);
> +	DIFF_QUEUE_CLEAR(&diff_queued_diff);
> +
> +	return filter;
> +}
> +
> +int bloom_filter_contains(struct bloom_filter *filter,
> +			  struct bloom_key *key,
> +			  struct bloom_filter_settings *settings)
> +{
> +	int i;
> +	uint64_t mod = filter->len * BITS_PER_WORD;
> +
> +	if (!mod)
> +		return -1;
> +
> +	for (i = 0; i < settings->num_hashes; i++) {
> +		uint64_t hash_mod = key->hashes[i] % mod;
> +		uint64_t block_pos = hash_mod / BITS_PER_WORD;
> +		if (!(filter->data[block_pos] & get_bitmask(hash_mod)))
> +			return 0;
> +	}
> +
> +	return 1;
> +}
> diff --git a/bloom.h b/bloom.h
> new file mode 100644
> index 0000000000..7f40c751f7
> --- /dev/null
> +++ b/bloom.h
> @@ -0,0 +1,56 @@
> +#ifndef BLOOM_H
> +#define BLOOM_H
> +
> +struct commit;
> +struct repository;
> +struct commit_graph;
> +
> +struct bloom_filter_settings {
> +	uint32_t hash_version;
> +	uint32_t num_hashes;
> +	uint32_t bits_per_entry;
> +};
> +
> +#define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 }
> +#define BITS_PER_WORD 64
> +
> +/*
> + * A bloom_filter struct represents a data segment to
> + * use when testing hash values. The 'len' member
> + * dictates how many uint64_t entries are stored in
> + * 'data'.
> + */
> +struct bloom_filter {
> +	uint64_t *data;
> +	int len;
> +};
> +
> +/*
> + * A bloom_key represents the k hash values for a
> + * given hash input. These can be precomputed and
> + * stored in a bloom_key for re-use when testing
> + * against a bloom_filter.
> + */
> +struct bloom_key {
> +	uint32_t *hashes;
> +};
> +
> +void load_bloom_filters(void);
> +
> +void fill_bloom_key(const char *data,
> +		    int len,
> +		    struct bloom_key *key,
> +		    struct bloom_filter_settings *settings);
> +
> +void add_key_to_filter(struct bloom_key *key,
> +					   struct bloom_filter *filter,
> +					   struct bloom_filter_settings *settings);
> +
> +struct bloom_filter *get_bloom_filter(struct repository *r,
> +				      struct commit *c);
> +
> +int bloom_filter_contains(struct bloom_filter *filter,
> +			  struct bloom_key *key,
> +			  struct bloom_filter_settings *settings);
> +
> +#endif
> diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
> new file mode 100644
> index 0000000000..331957011b
> --- /dev/null
> +++ b/t/helper/test-bloom.c
> @@ -0,0 +1,84 @@
> +#include "test-tool.h"
> +#include "git-compat-util.h"
> +#include "bloom.h"
> +#include "test-tool.h"
> +#include "cache.h"
> +#include "commit-graph.h"
> +#include "commit.h"
> +#include "config.h"
> +#include "object-store.h"
> +#include "object.h"
> +#include "repository.h"
> +#include "tree.h"
> +
> +struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
> +
> +static void print_bloom_filter(struct bloom_filter *filter) {
> +	int i;
> +
> +	if (!filter) {
> +		printf("No filter.\n");
> +		return;
> +	}
> +	printf("Filter_Length:%d\n", filter->len);
> +	printf("Filter_Data:");
> +	for (i = 0; i < filter->len; i++){
> +		printf("%"PRIx64"|", filter->data[i]);
> +	}
> +	printf("\n");
> +}
> +
> +static void add_string_to_filter(const char *data, struct bloom_filter *filter) {
> +		struct bloom_key key;
> +		int i;
> +
> +		fill_bloom_key(data, strlen(data), &key, &settings);
> +		printf("Hashes:");
> +		for (i = 0; i < settings.num_hashes; i++){
> +			printf("%08x|", key.hashes[i]);
> +		}
> +		printf("\n");
> +		add_key_to_filter(&key, filter, &settings);
> +}
> +
> +static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
> +{
> +	struct commit *c;
> +	struct bloom_filter *filter;
> +	setup_git_directory();
> +	c = lookup_commit(the_repository, commit_oid);
> +	filter = get_bloom_filter(the_repository, c);
> +	print_bloom_filter(filter);
> +}
> +
> +int cmd__bloom(int argc, const char **argv)
> +{
> +    if (!strcmp(argv[1], "generate_filter")) {
> +		struct bloom_filter filter;
> +		int i = 2;
> +		filter.len =  (settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
> +		filter.data = xcalloc(filter.len, sizeof(uint64_t));
> +
> +		if (!argv[2]){
> +			die("at least one input string expected");
> +		}
> +
> +		while (argv[i]) {
> +			add_string_to_filter(argv[i], &filter);
> +			i++;
> +		}
> +
> +		print_bloom_filter(&filter);
> +	}
> +
> +	if (!strcmp(argv[1], "get_filter_for_commit")) {
> +		struct object_id oid;
> +		const char *end;
> +		if (parse_oid_hex(argv[2], &oid, &end))
> +			die("cannot parse oid '%s'", argv[2]);
> +		load_bloom_filters();
> +		get_bloom_filter_for_commit(&oid);
> +	}
> +
> +	return 0;
> +}
> diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
> index c9a232d238..ca4f4b0066 100644
> --- a/t/helper/test-tool.c
> +++ b/t/helper/test-tool.c
> @@ -14,6 +14,7 @@ struct test_cmd {
>  };
>  
>  static struct test_cmd cmds[] = {
> +	{ "bloom", cmd__bloom },
>  	{ "chmtime", cmd__chmtime },
>  	{ "config", cmd__config },
>  	{ "ctype", cmd__ctype },
> diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
> index c8549fd87f..05d2b32451 100644
> --- a/t/helper/test-tool.h
> +++ b/t/helper/test-tool.h
> @@ -4,6 +4,7 @@
>  #define USE_THE_INDEX_COMPATIBILITY_MACROS
>  #include "git-compat-util.h"
>  
> +int cmd__bloom(int argc, const char **argv);
>  int cmd__chmtime(int argc, const char **argv);
>  int cmd__config(int argc, const char **argv);
>  int cmd__ctype(int argc, const char **argv);
> diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
> new file mode 100755
> index 0000000000..424fe4fc29
> --- /dev/null
> +++ b/t/t0095-bloom.sh
> @@ -0,0 +1,113 @@
> +#!/bin/sh
> +
> +test_description='test bloom.c'
> +. ./test-lib.sh
> +
> +test_expect_success 'get bloom filters for commit with no changes' '
> +	git init &&
> +	git commit --allow-empty -m "c0" &&
> +	cat >expect <<-\EOF &&
> +	Filter_Length:0
> +	Filter_Data:
> +	EOF
> +	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'get bloom filter for commit with 10 changes' '
> +	rm actual &&
> +	rm expect &&
> +	mkdir smallDir &&
> +	for i in $(test_seq 0 9)
> +	do
> +		echo $i >smallDir/$i
> +	done &&
> +	git add smallDir &&
> +	git commit -m "commit with 10 changes" &&
> +	cat >expect <<-\EOF &&
> +	Filter_Length:4
> +	Filter_Data:508928809087080a|8a7648210804001|4089824400951000|841ab310098051a8|
> +	EOF
> +	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success EXPENSIVE 'get bloom filter for commit with 513 changes' '
> +	rm actual &&
> +	rm expect &&
> +	mkdir bigDir &&
> +	for i in $(test_seq 0 512)
> +	do
> +		echo $i >bigDir/$i
> +	done &&
> +	git add bigDir &&
> +	git commit -m "commit with 513 changes" &&
> +	cat >expect <<-\EOF &&
> +	Filter_Length:0
> +	Filter_Data:
> +	EOF
> +	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'compute bloom key for empty string' '
> +	cat >expect <<-\EOF &&
> +	Hashes:5615800c|5b966560|61174ab4|66983008|6c19155c|7199fab0|771ae004|
> +	Filter_Length:1
> +	Filter_Data:11000110001110|
> +	EOF
> +	test-tool bloom generate_filter "" >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'compute bloom key for whitespace' '
> +	cat >expect <<-\EOF &&
> +	Hashes:1bf014e6|8a91b50b|f9335530|67d4f555|d676957a|4518359f|b3b9d5c4|
> +	Filter_Length:1
> +	Filter_Data:401004080200810|
> +	EOF
> +	test-tool bloom generate_filter " " >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'compute bloom key for a root level folder' '
> +	cat >expect <<-\EOF &&
> +	Hashes:1a21016f|fff1c06d|e5c27f6b|cb933e69|b163fd67|9734bc65|7d057b63|
> +	Filter_Length:1
> +	Filter_Data:aaa800000000|
> +	EOF
> +	test-tool bloom generate_filter "A" >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'compute bloom key for a root level file' '
> +	cat >expect <<-\EOF &&
> +	Hashes:e2d51107|30970605|7e58fb03|cc1af001|19dce4ff|679ed9fd|b560cefb|
> +	Filter_Length:1
> +	Filter_Data:a8000000000000aa|
> +	EOF
> +	test-tool bloom generate_filter "file.txt" >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'compute bloom key for a deep folder' '
> +	cat >expect <<-\EOF &&
> +	Hashes:864cf838|27f055cd|c993b362|6b3710f7|0cda6e8c|ae7dcc21|502129b6|
> +	Filter_Length:1
> +	Filter_Data:1c0000600003000|
> +	EOF
> +	test-tool bloom generate_filter "A/B/C/D/E" >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'compute bloom key for a deep file' '
> +	cat >expect <<-\EOF &&
> +	Hashes:07cdf850|4af629c7|8e1e5b3e|d1468cb5|146ebe2c|5796efa3|9abf211a|
> +	Filter_Length:1
> +	Filter_Data:4020100804010080|
> +	EOF
> +	test-tool bloom generate_filter "A/B/C/D/E/file.txt" >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_done

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH v2 02/11] bloom: core Bloom filter implementation for changed paths
  2020-02-05 22:56   ` [PATCH v2 02/11] bloom: core Bloom filter implementation for changed paths Garima Singh via GitGitGadget
  2020-02-15 17:17     ` Jakub Narebski
@ 2020-02-16 16:49     ` Jakub Narebski
  2020-02-22  0:32       ` Garima Singh
  1 sibling, 1 reply; 150+ messages in thread
From: Jakub Narebski @ 2020-02-16 16:49 UTC (permalink / raw)
  To: Garima Singh via GitGitGadget
  Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
	Jeff Hostetler, Taylor Blau, Jeff King, Christian Couder,
	Emily Shaffer, Junio C Hamano, Garima Singh, Garima Singh

[I'm sorry for accidentally sending unfinished version of this email]

"Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Garima Singh <garima.singh@microsoft.com>
>
> Add the core Bloom filter logic for computing the paths changed between a
> commit and its first parent. For details on what Bloom filters are and how they
> work, please refer to Dr. Derrick Stolee's blog post [1]. It provides a concise
> explaination of the adoption of Bloom filters as described in [2] and [3].
                                                                           ^^- to add
>
> 1. We currently use 7 and 10 for the number of hashes and the size of each
>    entry respectively. They served as great starting values, the mathematical
>    details behind this choice are described in [1] and [4]. The implementation,
                                                                                ^^- to add
>    while not completely open to it at the moment, is flexible enough to allow
>    for tweaking these settings in the future.

I don't know if it is worth it, but I think it should be size of each
entry, or in other words number of bits per element in the set, as first
value, and number of hashes as second.

About where those values come from.  The idea is that you decide on the
acceptable number of false positives, for example 1% (or 0.8% given that
the values must be integers); that gives you number of bits per element
i.e. 10, and from there you can find optimal number of hashes i.e. 7.
The references mentioned (and Wikipedia article) have those equations.

>
>    Note: The performance gains we have observed with these values are
>    significant enough that we did not need to tweak these settings.
>    The performance numbers are included in the cover letter of this series
>    and in the message of a subsequent commit where we use Bloom filters in
>    to speed up `git log -- <path>`.

All right.

>
> 2. As described in the blog and in [3], we do not need 7 independent hashing
>    functions. We use the Murmur3 hashing scheme. Seed it twice and then
>    combine those to procure an arbitrary number of hash values.

The technique from [3] is called "double hashing" (Algorithm 1 and
equation (4) on page 10).  Note that in this paper there is also
presented "enhanced double hashing" scheme (Algorithm 2 and equation
(6)) -- more about it later.

This is a standard technique from the hashing literature, called open
addressing with double hashing in hash tables.

This "enhanced double hashing" technique is further analyzed in [6].

[6] Adam Kirsch, Michael Mitzenmacher
    "Less Hashing, Same Performance: Building a Better Bloom Filter"
    https://www.eecs.harvard.edu/~michaelm/postscripts/esa2006a.pdf
    https://doi.org/10.5555/1400123.1400125

>
> 3. The filters are sized according to the number of changes in the each commit,
>    with minimum size of one 64 bit word.

If I understand it correctly (but which might not be entirely clear),
the filter size in bits is the number of changes^* times 10, rounded up
to the nearest multiple of 64.

[*] where the number of changes is the number of changed files (new blob
objects) _and_ the number of changed directories (new tree objects,
excluding root tree object change).


The interesting corner case, which might be worth specifying explicitly,
is what happens in the case there are _no changes_ with respect to first
parent (which can happen with either commit created with `git commit
--allow-empty`, or merge created e.g. with `git merge --strategy=ours`).
Is this case represented as Bloom filter of length 0, or as a Bloom
filter of length  of one 64-bit word which is minimal length composed of
all 0's (0x0000000000000000)?

>
> 4. We fill the Bloom filters as (const char *data, int len) pairs as
>    "struct bloom_filter"s in a commit slab.

All right.

>
> 5. The seed_murmur3 method is implemented as described in [5]. It hashes the
>    given data using a given seed and produces a uniformly distributed hash
>    value.

Actually there are two variants of Murmur3 hash, and we should specify
which one we are using.  There is Murmur3_32 which returns 32-bit value,
and Murmur3_128 which returns 128-bit value (which is different for x86
and x64 versions).  We use Murmur3_32.

Also, seed_murmur3 is the name given the function, not the name of the
method i.e. of a non-cryptographic hash function.


One question that one might as is why use Murmur3 hash instead for
example already implemented FNV hash from hashmap implementation (FNV
hash i.e. Fowler–Noll–Vo hash function is another non-cryptographic hash
function).  The answer is of course performance while maintaining good
enough quality (and for Bloom filter there is no problem of "hash
flooding" denial-of-service like for there is for a hash table -- no
need for SipHash or similar).

>
> [1] https://devblogs.microsoft.com/devops/super-charging-the-git-commit-graph-iv-Bloom-filters/

I would write it in full, similar to subsequent bibliographical entries,
that is:

  [1] Derrick Stolee
      "Supercharging the Git Commit Graph IV: Bloom Filters"
      https://devblogs.microsoft.com/devops/super-charging-the-git-commit-graph-iv-Bloom-filters/

But that is just a matter of style.

>
> [2] Flavio Bonomi, Michael Mitzenmacher, Rina Panigrahy, Sushil Singh, George Varghese
>     "An Improved Construction for Counting Bloom Filters"
>     http://theory.stanford.edu/~rinap/papers/esa2006b.pdf
>     https://doi.org/10.1007/11841036_61
>
> [3] Peter C. Dillinger and Panagiotis Manolios
>     "Bloom Filters in Probabilistic Verification"
>     http://www.ccs.neu.edu/home/pete/pub/Bloom-filters-verification.pdf
>     https://doi.org/10.1007/978-3-540-30494-4_26

Good, we should be able to find them even if the URL with PDF stops
working for some reason.

>
> [4] Thomas Mueller Graf, Daniel Lemire
>     "Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters"
>     https://arxiv.org/abs/1912.08258
>
> [5] https://en.wikipedia.org/wiki/MurmurHash#Algorithm
>
> Helped-by: Jeff King <peff@peff.net>
> Helped-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
>  Makefile              |   2 +
>  bloom.c               | 228 ++++++++++++++++++++++++++++++++++++++++++
>  bloom.h               |  56 +++++++++++
>  t/helper/test-bloom.c |  84 ++++++++++++++++
>  t/helper/test-tool.c  |   1 +
>  t/helper/test-tool.h  |   1 +
>  t/t0095-bloom.sh      | 113 +++++++++++++++++++++
>  7 files changed, 485 insertions(+)
>  create mode 100644 bloom.c
>  create mode 100644 bloom.h
>  create mode 100644 t/helper/test-bloom.c
>  create mode 100755 t/t0095-bloom.sh

As I wrote earlier, In my opinion this patch could be split into three
individual single-functionality pieces, to make it easier to review and
aid in bisectability if needed.

1. Add implementation of MurmurHash v3 (32-bit result)
  
Include tests based on test-tool (creating file similar to the
t/helper/test-hash.c, or enhancing to that file) that the implementation
is correct, for example that 'The quick brown fox jumps over the lazy
dog' or 'Hello world!' with a given seed (for example the default seed
of 0) hashes to the same value as other implementations, including the
reference implementation in https://github.com/aappleby/smhasher


2. Add implementation of [variant of] Bloom filter

Include generic Bloom filter tests i.e. that it correctly answers "yes"
and "maybe" (create filter, save it or print it, then use stored
filter), and tests specific to our implementation, namely that the size
of the filter behaves as it should.


3. Bloom filter implementation for changed paths

Here include tests that use 'test-tool bloom get_filter_for_commit',
that filter for commit with no changes and for commit with more than 512
changes works correctly, that directories are added along the files,
etc.


This split would make it easier to distinguish if the problems with
tests failing on big-endian architectures is caused by different output
from our implementation of Murmur3 hash, different bit sequence in the
Bloom filter, or just different printed output of Bloom filter data.

>
> diff --git a/Makefile b/Makefile
> index 6134104ae6..afba81f4a8 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -695,6 +695,7 @@ X =
>  
>  PROGRAMS += $(patsubst %.o,git-%$X,$(PROGRAM_OBJS))
>  
> +TEST_BUILTINS_OBJS += test-bloom.o
>  TEST_BUILTINS_OBJS += test-chmtime.o
>  TEST_BUILTINS_OBJS += test-config.o
>  TEST_BUILTINS_OBJS += test-ctype.o
> @@ -840,6 +841,7 @@ LIB_OBJS += base85.o
>  LIB_OBJS += bisect.o
>  LIB_OBJS += blame.o
>  LIB_OBJS += blob.o
> +LIB_OBJS += bloom.o
>  LIB_OBJS += branch.o
>  LIB_OBJS += bulk-checkin.o
>  LIB_OBJS += bundle.o

All right.

> diff --git a/bloom.c b/bloom.c
> new file mode 100644
> index 0000000000..6082193a75
> --- /dev/null
> +++ b/bloom.c
> @@ -0,0 +1,228 @@
> +#include "git-compat-util.h"
> +#include "bloom.h"
> +#include "commit-graph.h"
> +#include "object-store.h"
> +#include "diff.h"
> +#include "diffcore.h"
> +#include "revision.h"
> +#include "hashmap.h"
> +
> +define_commit_slab(bloom_filter_slab, struct bloom_filter);
> +
> +struct bloom_filter_slab bloom_filters;

All right, this is needed to store per-commit Bloom filter data
(inside-out object style, or in other jargon stored on slab).

> +
> +struct pathmap_hash_entry {
> +    struct hashmap_entry entry;
> +    const char path[FLEX_ARRAY];
> +};

O.K. this is used to add gather paths to add them all as elements to the
Bloom filter.

> +
> +static uint32_t rotate_right(uint32_t value, int32_t count)
> +{
> +	uint32_t mask = 8 * sizeof(uint32_t) - 1;
> +	count &= mask;
> +	return ((value >> count) | (value << ((-count) & mask)));
> +}

Hmmm... both the algoritm on Wikipedia, and reference implementation use
rotate *left*, not rotate *right* in the implementation of Murmur3 hash,
see

  https://en.wikipedia.org/wiki/MurmurHash#Algorithm
  https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp#L23


inline uint32_t rotl32 ( uint32_t x, int8_t r )
{
  return (x << r) | (x >> (32 - r));
}

> +
> +/*
> + * Calculate a hash value for the given data using the given seed.
> + * Produces a uniformly distributed hash value.
> + * Not considered to be cryptographically secure.
> + * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
> + **/
    ^^-- why two _trailing_ asterisks?

Perhaps it would be worth it to add that this hash function is intended
to be fast while being reasonably good (it is distributed randomly
enough, and it doesn't have too many hash collisions on typical inputs).
But this might be too much for a comment.

> +static uint32_t seed_murmur3(uint32_t seed, const char *data, int len)

A few things: name of the function, type of parameters and ordering of
parameters.


About the name: when I first saw seed_murmur3() used, I thought it was
_setting_ the seed, not that it was returning the 32-bit hash value.
Other implementations use either murmur3_32, MurmurHash3_x86_32, or
something similar like hashmurmur3_32.  If we were to specify that
'seed' is one of parameters, then using this word as part of suffix
would be better than using seed_ prefix; if we need it at all.

Because there is 32-bit and 128-bit variants of Murmur3, I think the _32
suffix should be a part of function name.

In short, I think that the name of the function should be murmur3_32, or
murmurhash3_32, or possibly murmur3_32_seed, or something like that.


About types of parameters and the return type of function: I understand
that 'data' parameter is of type 'const char *', instead of more generic
'const uint8_t*' or 'const void *' because of what we will be using the
hash function for.  On the other hand taking a look at implementation of
FNV hash function in hashmap.{c,h} we see that the 'str*' variants take
'const char *' parameter _without_ length, and 'mem*' variants take
'const void *' parmeter with length of data.

Shouldn't 'len' parameter be of 'size_t' type, rather than 'int'?  Both
the example implementation in C on Wikipedia page, and implementation in
C in qLibc use 'size_t'; the implementation of FNV hash in hashmap in
Git also uses 'size_t' (while admittedly the reference implementation in
C++ of Austin Appleby uses 'int' type for len parameter).

For 32-bit output variant of Murmur3 hash, using uint32_t as return type
is just fine.  The '*hash*' functions from hashmap.{c,h} use 'unsigned
int' but I think 'uint32_t' is better.


About names and ordering of parameters: the 'seed' or 'hash_seed'
parameter should be either first or last; it is a matter of preference.
While example implementation on Wikipedia page, Appleby's reference
implementation in C++ have 'seed' as last parameter, memihash_cont()
from hashmap.c in Git has it as first parameter.

In short: I'm fine with either order (seed parameter first or last), and
either name (be it 'seed' or 'hash_seed').

> +{
> +	const uint32_t c1 = 0xcc9e2d51;
> +	const uint32_t c2 = 0x1b873593;
> +	const uint32_t r1 = 15;
> +	const uint32_t r2 = 13;
> +	const uint32_t m = 5;
> +	const uint32_t n = 0xe6546b64;
> +	int i;
> +	uint32_t k1 = 0;
> +	const char *tail;
> +
> +	int len4 = len / sizeof(uint32_t);
> +
> +	const uint32_t *blocks = (const uint32_t*)data;
> +
> +	uint32_t k;
> +	for (i = 0; i < len4; i++)
> +	{
> +		k = blocks[i];

IMPORTANT: There is a comment around there in the example implementation
in C on Wikipedia that this operation above is a source of differing
results across endianness.  The pseudo-code description of the algorithm
on Wikipedia (above of C code) says that endian swapping is only
necessary on big-endian machines (and that it is needed to place the
meaningful digits towards the low end of the value, to not be discarded
by the modulo arithmetic under overflow).

The original / reference implementation by Austin Appleby in C++ uses
getblock32() function for doing the block read... but it doesn't
actually implement the endian-swapping on big-endian architecture:

  //-----------------------------------------------------------------------------
  // Block read - if your platform needs to do endian-swapping or can only
  // handle aligned reads, do the conversion here

  FORCE_INLINE uint32_t getblock32 ( const uint32_t * p, int i )
  {
    return p[i];
  }

References:
-----------
1. https://en.wikipedia.org/wiki/MurmurHash#Algorithm
2. https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp

> +		k *= c1;
> +		k = rotate_right(k, r1);

It is  k ROL r1 / ROTL32(k,15) / (k << 15) | (k >> (32 - 15))
(in other implementations), not rotate_right.

> +		k *= c2;
> +
> +		seed ^= k;
> +		seed = rotate_right(seed, r2) * m + n;

It is  hash ROL r2 / ROTL32(h1,13) / (h << 13) | (h >> (32 - 13))
(in other implementations), not rotate_right.

References:
-----------
1. https://en.wikipedia.org/wiki/MurmurHash#Algorithm
2. https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp#L94
3. https://github.com/wolkykim/qlibc/blob/master/src/utilities/qhash.c#L258

> +	}
> +
> +	tail = (data + len4 * sizeof(uint32_t));

Hmmm... in the pseudocode implementation on Wikipedia this is the place
where one needs to respect endianness:

    with any remainingBytesInKey do
        remainingBytes ← SwapToLittleEndian(remainingBytesInKey)
        // Note: Endian swapping is only necessary on big-endian machines.
        //       The purpose is to place the meaningful digits towards the low end of the value,
        //       so that these digits have the greatest potential to affect the low range digits
        //       in the subsequent multiplication.  Consider that locating the meaningful digits
        //       in the high range would produce a greater effect upon the high digits of the
        //       multiplication, and notably, that such high digits are likely to be discarded
        //       by the modulo arithmetic under overflow.  We don't want that.

On the other hand in the reference Appleby's C++ implementation the
endian-swapping is [ssumed to be] done only in the loop over data.
Either should be enough alone, but doing swapping for remaining bytes
only would work, it would be a better solution -- you do swap only once,
at the end.

It looks like the Chromium implementation in C by Shane Day (public
domain) uses the second solution; well almost, see:
https://chromium.googlesource.com/external/smhasher/+/5b8fd3c31a58b87b80605dca7a64fad6cb3f8a0f/PMurHash.c#189

> +
> +	switch (len & (sizeof(uint32_t) - 1))
> +	{
> +	case 3:
> +		k1 ^= ((uint32_t)tail[2]) << 16;
> +		/*-fallthrough*/
> +	case 2:
> +		k1 ^= ((uint32_t)tail[1]) << 8;
> +		/*-fallthrough*/
> +	case 1:
> +		k1 ^= ((uint32_t)tail[0]) << 0;
> +		k1 *= c1;
> +		k1 = rotate_right(k1, r1);

It is  remainingBytes ROL r1 / ROTL32(k1,15) / (k << 15) | (k >> (32 - 15)) 
(in other implementations), not rotate_right.  The same references as
before.

> +		k1 *= c2;
> +		seed ^= k1;
> +		break;
> +	}
> +
> +	seed ^= (uint32_t)len;
> +	seed ^= (seed >> 16);
> +	seed *= 0x85ebca6b;
> +	seed ^= (seed >> 13);
> +	seed *= 0xc2b2ae35;
> +	seed ^= (seed >> 16);
> +
> +	return seed;
> +}

In https://public-inbox.org/git/ba856e20-0a3c-e2d2-6744-b9abfacdc465@gmail.com/
you posted "[PATCH] Process bloom filter data as 1 byte words".
This may avoid the Big-endian vs Little-endian confusion,
that is wrong results on Big-endian architectures, but
it also may slow down the algorithm.

The public domain implementation in PMurHash.c in SMHasher
(re)implementation in Chromium (see URL above) fall backs to 1-byte
operations only if it doesn't know the endianness (or if it is neither
little-endian, nor big-endian, i.e. middle-endian or mixed-endian --
though I doubt that Git works correctly on mixed-endian anyway).


Sidenote: it looks like the current implementation if Murmur hash in
Cromium uses MurmurHash3_x86_32, i.e. little-endian unaligned-safe
implementation, but prepares data by swapping with StringToLE32
https://github.com/chromium/chromium/blob/master/components/variations/variations_murmur_hash.h


Assuming that the terminating NUL ("\0") character of a c-string is not
included in hash calculations, then murmur3_x86_32 hash has the
following results (all results are for seed equal 0):

''               -> 0x00000000
' '              -> 0x7ef49b98
'Hello world!'   -> 0x627b0c2c
'The quick brown fox jumps over the lazy dog'   -> 0x2e4ff723

C source (from Wikipedia): https://godbolt.org/z/ofa2p8
C++ source (Appleby's):    https://godbolt.org/z/BoSt6V

The implementation provided in this patch, with rotate_right (instead of
rotate_left) gives, on little-endian machine, different results:

''               -> 0x00000000
' '              -> 0xd1f27e64
'Hello world!'   -> 0xa0791ad7
'The quick brown fox jumps over the lazy dog'   -> 0x99f1676c

https://github.com/gitgitgadget/git/blob/e1b076a714d611e59d3d71c89221e41a3427fae4/bloom.c#L21
C source (via GitGitGadget): https://godbolt.org/z/R9s8Tt

Sidenote: While Godbolt.org site supports compiling with many different
compilers, including GCC, Clang (LLVM), icc (Intel), MSVC (via Wine),
and cross compiling for different platforms, including x86_64, ARM, MIPS,
PowerPC, power64 and power64le, AVR, it allows for execution only on
x86_64 i.e. little-endian.


We could create test similar to the one for SHA-1 and SHA-256 in
t/t0015-hash.sh but for murmur3, for example:

  test_expect_success 'test basic Murmur3_32 hash values' '
  	printf " " | test-tool murmur3_32 0 >actual &&
        printf "7ef49b98" >expected &&
        test_cmp expected actual &&
        ...
  '

or

  test_expect_success 'test basic Murmur3_32 hash values' '
  	printf " " | test-tool murmur3_32 0 >actual &&
        grep "7ef49b98" actual &&
        ...
  '

> +
> +static inline uint64_t get_bitmask(uint32_t pos)
> +{
> +	return ((uint64_t)1) << (pos & (BITS_PER_WORD - 1));
> +}

All right, that creates 64-bit wide mask with 1 bit set to 1 for a
64-bit word within filter data.  I just wonder if the trick with the &
operation is truly faster than using simpler to understand modulo
with compiler optimizations.

   static inline uint64_t get_bitmask(uint32_t pos)
   {
   	return ((uint64_t)1) << (pos % BITS_PER_WORD);
   }

Anyway, looks good (beside naming things, but I don't have better
proposal, and the function is static i.e. file-local anyway).

> +
> +void load_bloom_filters(void)
> +{
> +	init_bloom_filter_slab(&bloom_filters);
> +}


Actually this function doesn't load anything.  Perhaps it should be
named init_bloom_filters() or init_bloom_filters_storage(), or
bloom_filters_init()?

> +
> +void fill_bloom_key(const char *data,
> +					int len,
> +					struct bloom_key *key,
> +					struct bloom_filter_settings *settings)

The last parameter could be of 'const bloom_filter_settings *' type.

> +{
> +	int i;
> +	const uint32_t seed0 = 0x293ae76f;
> +	const uint32_t seed1 = 0x7e646e2c;

Where did those seeds values came from?

> +	const uint32_t hash0 = seed_murmur3(seed0, data, len);
> +	const uint32_t hash1 = seed_murmur3(seed1, data, len);
> +
> +	key->hashes = (uint32_t *)xcalloc(settings->num_hashes, sizeof(uint32_t));
> +	for (i = 0; i < settings->num_hashes; i++)
> +		key->hashes[i] = hash0 + i * hash1;

Note that in [3] authors say that double hashing technique has some
problems.  For one, we should ensure that hash1 is not zero, and even
better that it is odd (which makes it relatively prime to filter size
which is multiple of 64).  It also suffers from something called
"approximate fingerprint collisions".

That is why the define "enhanced double hashing" technique, which does
not suffer from those problems (Algorithm 2, page 11/15).

  +	for (i = 0; i < settings->num_hashes; i++) {
  +		key->hashes[i] = hash0;
  +
  +		hash0 = hash0 + hash1;
  +		hash1 = hash1 + i;
  +	}

This can also be written in closed form, based on equation (6)

  +	for (i = 0; i < settings->num_hashes; i++)
  +		key->hashes[i] = hash0 + i * hash1 + i*(i*i - 1)/6;


In later paper [6] the closed form for "enhanced double hashing"
(p. 188) is slightly modified (or rather they use different variant of
this technique):

  +	for (i = 0; i < settings->num_hashes; i++)
  +		key->hashes[i] = hash0 + i * hash1 + i*i;

This is a variant of more generic "enhanced double hashing", section
5.2 (Enhanced) Double Hashing Schemes (page 199):

        h_1(u) + i h_2(u) + f(i)    mod m

with f(i) = i^2 = i*i.

They have tested that enhanced double hashing with both f(i) equal i*i
and equal i*i*i, and triple hashing technique, and they have found that
it performs slightly better than straight double hashing technique
(Fig. 1, page 212, section 3).

> +}
> +
> +void add_key_to_filter(struct bloom_key *key,
> +					   struct bloom_filter *filter,
> +					   struct bloom_filter_settings *settings)

Here again the 'settings' argument can be const (as can the 'key'
parameter).

> +{
> +	int i;
> +	uint64_t mod = filter->len * BITS_PER_WORD;
> +
> +	for (i = 0; i < settings->num_hashes; i++) {
> +		uint64_t hash_mod = key->hashes[i] % mod;
> +		uint64_t block_pos = hash_mod / BITS_PER_WORD;
> +
> +		filter->data[block_pos] |= get_bitmask(hash_mod);
> +	}
> +}

All right, bloom_key is an intermediate representation that is used both
for creating Bloom filter, and for querying it.  In the latter case the
same path may be tested against Bloom filters for commits with different
number of (blob and tree) changes, and thus against Bloom filters with
different lengths.  It makes sense for bloom_key to store just values of
hash functions, without arithmetics modulo filter size.

Though I think it could be a good idea to create add_str_to_filter() as
a wrapper around add_key_to_filter() and fill_bloom_key() functions.

> +
> +struct bloom_filter *get_bloom_filter(struct repository *r,
> +				      struct commit *c)
> +{
> +	struct bloom_filter *filter;
> +	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
> +	int i;
> +	struct diff_options diffopt;
> +
> +	if (!bloom_filters.slab_size)
> +		return NULL;

This is testing that commit slab for per-commit Bloom filters is
initialized, isn't it?

First, should we write the condition as

	if (!bloom_filters.slab_size)

or would the following be more readable

	if (bloom_filters.slab_size == 0)

Second, should we return NULL, or should we just initialize the slab?
Or is non-existence of slab treated as a signal that the Bloom filters
mechanism is turned off?

> +
> +	filter = bloom_filter_slab_at(&bloom_filters, c);

Wouldn't it be better to check if the data for commit exists already on
the slab, and create the Bloom filter for commit changes only if it does
not exists, i.e.:

  +	filter = bloom_filter_slab_peek(&bloom_filters, c);
  +	if (filter)
  +		return filter;
  +	filter = bloom_filter_slab_at(&bloom_filters, c);

> +
> +	repo_diff_setup(r, &diffopt);
> +	diffopt.flags.recursive = 1;
> +	diff_setup_done(&diffopt);

I'll punt on checking this.  Looks all right from first glance, and
follows calling sequence in https://github.com/git/git/blob/master/diff.h#L26

> +
> +	if (c->parents)
> +		diff_tree_oid(&c->parents->item->object.oid, &c->object.oid, "", &diffopt);
> +	else
> +		diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
> +	diffcore_std(&diffopt);

All right, that computes first-parent diff (or diff from empty tree of
there are no parents).

> +
> +	if (diff_queued_diff.nr <= 512) {

First, shouldn't this magic value 512 be hidden behind some symbolic
name (some preprocessor constant), e.g. BLOOM_MAX_CHANGES?  On the other
hand this value is used only once (except tests), so it might be not
worth it -- especially coming up with a good name.

Second, there is a minor issue that diff_queue_struct.nr stores the
number of filepairs, that is the number of changed files, while the
number of elements added to Bloom filter is number of changed blobs and
trees.  For example if the following files are changed:

  sub/dir/file1
  sub/file2

then diff_queued_diff.nr is 2, but number of elements to be added to
Bloom filter is 4.

  sub/dir/file1
  sub/file2
  sub/dir/
  sub/

I'm not sure if it matters in practice.

> +		struct hashmap pathmap;
> +		struct pathmap_hash_entry* e;
> +		struct hashmap_iter iter;
> +		hashmap_init(&pathmap, NULL, NULL, 0);

Stylistic issue: I have just noticed that here (and in some other
places), but not in all cases, you declare pointer types with asterisk
cuddled to type name, not to variable name, which contradicts
CodingGuidelines:

 - When declaring pointers, the star sides with the variable
   name, i.e. "char *string", not "char* string" or
   "char * string".  This makes it easier to understand code
   like "char *string, c;".

In this case it should be

  +		struct pathmap_hash_entry *e;

In many other places in this patch it is correct, though.

> +
> +		for (i = 0; i < diff_queued_diff.nr; i++) {
> +			const char* path = diff_queued_diff.queue[i]->two->path;

Is that correct that we consider only post-image name for storing
changes in Bloom filter?  Currently if file was renamed (or deleted), it
is considered changed, and `git log -- <old-name>` lists commit that
changed file name too.

> +			const char* p = path;

It should be "const char *" for both.

> +
> +			/*
> +			* Add each leading directory of the changed file, i.e. for
> +			* 'dir/subdir/file' add 'dir' and 'dir/subdir' as well, so
> +			* the Bloom filter could be used to speed up commands like
> +			* 'git log dir/subdir', too.
> +			*
> +			* Note that directories are added without the trailing '/'.
> +			*/
> +			do {
> +				char* last_slash = strrchr(p, '/');
> +
> +				FLEX_ALLOC_STR(e, path, path);

Here first 'path' is the field name, i.e. pathmap_hash_entry.path,
second 'path' is the name of local variable, aliased also to 'p'.

> +				hashmap_entry_init(&e->entry, strhash(p));

I don't know why both 'path' and 'p' are used, while both point to the
same memory (and thus have the same contents).  It is a bit confusing.
See also my previous comment.

> +				hashmap_add(&pathmap, &e->entry);
> +
> +				if (!last_slash)
> +					last_slash = (char*)p;
> +				*last_slash = '\0';
> +
> +			} while (*p);

Looks good.  We overwrite '/' with '\0', and gather shrinking pathnames
along the way.

> +
> +			diff_free_filepair(diff_queued_diff.queue[i]);
> +		}
> +
> +		filter->len = (hashmap_get_size(&pathmap) * settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;

All right, this is division by BITS_PER_WORD, rounding up.

Sidenote: I see now why hashmap was used, it was to be able to get
number of unique changes (changed blobs and trees) easily.

> +		filter->data = xcalloc(filter->len, sizeof(uint64_t));
> +
> +		hashmap_for_each_entry(&pathmap, &iter, e, entry) {
> +			struct bloom_key key;
> +			fill_bloom_key(e->path, strlen(e->path), &key, &settings);
> +			add_key_to_filter(&key, filter, &settings);
> +		}

All right.

> +
> +		hashmap_free_entries(&pathmap, struct pathmap_hash_entry, entry);
> +	} else {
> +		for (i = 0; i < diff_queued_diff.nr; i++)
> +			diff_free_filepair(diff_queued_diff.queue[i]);

All right, that frees the memory taken by diff results.

> +		filter->data = NULL;
> +		filter->len = 0;

This needs to be explicitly stated both in the commit message and in the
API documentation (in comments) that bloom_filter.len == 0 means "no
data", while "no changes" is represented as bloom_filter with len == 1
and *data == (uint64_t)0;

EDIT: actually "no changes" is also represented as bloom_filter with len
equal 0, as it turns out.

One possible alternative could be representing "no data" value with
Bloom filter of length 1 and all 64 bits set to 1, and "no changes"
represented as filter of length 0.  This is not unambiguous choice!

> +	}
> +
> +	free(diff_queued_diff.queue);
> +	DIFF_QUEUE_CLEAR(&diff_queued_diff);
> +
> +	return filter;
> +}

All right.

> +
> +int bloom_filter_contains(struct bloom_filter *filter,
> +			  struct bloom_key *key,
> +			  struct bloom_filter_settings *settings)

It might be good idea to define enum for return values, that is
NO_DATA = -1, NO = 0, MAYBE = 1.

> +{
> +	int i;
> +	uint64_t mod = filter->len * BITS_PER_WORD;
> +
> +	if (!mod)
> +		return -1;

All right, it is different way of writing

	if (filter->len == 0)
		return -1;

which means "no data" (too many elements for Bloom filter to store).
EDIT: or "no changes".

> +
> +	for (i = 0; i < settings->num_hashes; i++) {
> +		uint64_t hash_mod = key->hashes[i] % mod;
> +		uint64_t block_pos = hash_mod / BITS_PER_WORD;
> +		if (!(filter->data[block_pos] & get_bitmask(hash_mod)))
> +			return 0;

All right, if any of hash functions (hash results) doesn't match what is
stored in filter, then the key cannot be contained in the Bloom filter.

> +	}
> +
> +	return 1;

All right, otherwise the key is probably included in filter, but may be
false positive (with around 1% probability in theory).

This means that if we get value of 0, we can skip checking the diff; we
know commit is TREESAME with respect to the path given.

> +}
> diff --git a/bloom.h b/bloom.h
> new file mode 100644
> index 0000000000..7f40c751f7
> --- /dev/null
> +++ b/bloom.h
> @@ -0,0 +1,56 @@
> +#ifndef BLOOM_H
> +#define BLOOM_H

Should we #include the stdint.h header for uint32_t and uint64_t types?

> +
> +struct commit;
> +struct repository;
> +struct commit_graph;
> +

Perhaps we should add block comment for this struct, like there is one
for struct bloom_filter below.

> +struct bloom_filter_settings {
> +	uint32_t hash_version;
> +	uint32_t num_hashes;
> +	uint32_t bits_per_entry;

I guess that the type uint32_t was chosen to make it easier to store
this information and later retrieve it from the commit-graph file, isn't
it?  Otherwise those types are much too large for sensible range of
values (which would all fit in 8-bits byte).

> +};
> +
> +#define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 }
> +#define BITS_PER_WORD 64

Sidenote: While CodingGuidelines explicitly says:

 - We try to support a wide range of C compilers to compile Git with,
   including old ones.  You should not use features from newer C
   standard, even if your compiler groks them.

   There are a few exceptions to this guideline:

   [...]

   . since mid 2017 with cbc0f81d, we have been using designated
     initializers for struct (e.g. "struct t v = { .val = 'a' };").

I don't think however that using designated initializers in
DEFAULT_BLOOM_FILTER_SETTINGS is needed, as this preprocessor constant
is just below the definition of struct bloom_filter_settings type.

> +
> +/*
> + * A bloom_filter struct represents a data segment to
> + * use when testing hash values. The 'len' member
> + * dictates how many uint64_t entries are stored in
> + * 'data'.
> + */
> +struct bloom_filter {
> +	uint64_t *data;
> +	int len;
> +};

Just wondering: is there any advantage or disadvantage to putting 'len'
field first (i.e. before 'data') versus putting it after (i.e. after
'data')?  Is there a convention that Git uses?

> +
> +/*
> + * A bloom_key represents the k hash values for a
> + * given hash input. These can be precomputed and
> + * stored in a bloom_key for re-use when testing
> + * against a bloom_filter.

We might want to add that the number of hash values is given by Bloom
filter settings, and it is assumed to be the same for all bloom_key
variables / objects.

> + */
> +struct bloom_key {
> +	uint32_t *hashes;
> +};
> +
> +void load_bloom_filters(void);
> +
> +void fill_bloom_key(const char *data,
> +		    int len,
> +		    struct bloom_key *key,
> +		    struct bloom_filter_settings *settings);
> +
> +void add_key_to_filter(struct bloom_key *key,
> +					   struct bloom_filter *filter,
> +					   struct bloom_filter_settings *settings);
> +
> +struct bloom_filter *get_bloom_filter(struct repository *r,
> +				      struct commit *c);
> +
> +int bloom_filter_contains(struct bloom_filter *filter,
> +			  struct bloom_key *key,
> +			  struct bloom_filter_settings *settings);
> +
> +#endif
> diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
> new file mode 100644
> index 0000000000..331957011b
> --- /dev/null
> +++ b/t/helper/test-bloom.c
> @@ -0,0 +1,84 @@
> +#include "test-tool.h"
> +#include "git-compat-util.h"
> +#include "bloom.h"
> +#include "test-tool.h"
> +#include "cache.h"
> +#include "commit-graph.h"
> +#include "commit.h"
> +#include "config.h"
> +#include "object-store.h"
> +#include "object.h"
> +#include "repository.h"
> +#include "tree.h"
> +
> +struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
> +
> +static void print_bloom_filter(struct bloom_filter *filter) {
> +	int i;
> +
> +	if (!filter) {
> +		printf("No filter.\n");
> +		return;
> +	}
> +	printf("Filter_Length:%d\n", filter->len);
> +	printf("Filter_Data:");
> +	for (i = 0; i < filter->len; i++){
> +		printf("%"PRIx64"|", filter->data[i]);
> +	}
> +	printf("\n");
> +}
> +
> +static void add_string_to_filter(const char *data, struct bloom_filter *filter) {
> +		struct bloom_key key;
> +		int i;
> +
> +		fill_bloom_key(data, strlen(data), &key, &settings);
> +		printf("Hashes:");
> +		for (i = 0; i < settings.num_hashes; i++){
> +			printf("%08x|", key.hashes[i]);
> +		}
> +		printf("\n");
> +		add_key_to_filter(&key, filter, &settings);
> +}
> +
> +static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
> +{
> +	struct commit *c;
> +	struct bloom_filter *filter;
> +	setup_git_directory();
> +	c = lookup_commit(the_repository, commit_oid);
> +	filter = get_bloom_filter(the_repository, c);
> +	print_bloom_filter(filter);
> +}
> +
> +int cmd__bloom(int argc, const char **argv)
> +{
> +    if (!strcmp(argv[1], "generate_filter")) {
> +		struct bloom_filter filter;
> +		int i = 2;
> +		filter.len =  (settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
> +		filter.data = xcalloc(filter.len, sizeof(uint64_t));
> +
> +		if (!argv[2]){
> +			die("at least one input string expected");
> +		}
> +
> +		while (argv[i]) {
> +			add_string_to_filter(argv[i], &filter);
> +			i++;
> +		}
> +
> +		print_bloom_filter(&filter);
> +	}
> +
> +	if (!strcmp(argv[1], "get_filter_for_commit")) {
> +		struct object_id oid;
> +		const char *end;
> +		if (parse_oid_hex(argv[2], &oid, &end))
> +			die("cannot parse oid '%s'", argv[2]);
> +		load_bloom_filters();
> +		get_bloom_filter_for_commit(&oid);
> +	}
> +
> +	return 0;
> +}


I won't comment on test-tool code, as I think the Bloom filter and
Murmur3 hash tests should be structured differently, which would
completely change test-bloom.c code.

> diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
> index c9a232d238..ca4f4b0066 100644
> --- a/t/helper/test-tool.c
> +++ b/t/helper/test-tool.c
> @@ -14,6 +14,7 @@ struct test_cmd {
>  };
>  
>  static struct test_cmd cmds[] = {
> +	{ "bloom", cmd__bloom },
>  	{ "chmtime", cmd__chmtime },
>  	{ "config", cmd__config },
>  	{ "ctype", cmd__ctype },

> diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
> index c8549fd87f..05d2b32451 100644
> --- a/t/helper/test-tool.h
> +++ b/t/helper/test-tool.h
> @@ -4,6 +4,7 @@
>  #define USE_THE_INDEX_COMPATIBILITY_MACROS
>  #include "git-compat-util.h"
>  
> +int cmd__bloom(int argc, const char **argv);
>  int cmd__chmtime(int argc, const char **argv);
>  int cmd__config(int argc, const char **argv);
>  int cmd__ctype(int argc, const char **argv);

All right, looks good.

> diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
> new file mode 100755
> index 0000000000..424fe4fc29
> --- /dev/null
> +++ b/t/t0095-bloom.sh
> @@ -0,0 +1,113 @@
> +#!/bin/sh
> +
> +test_description='test bloom.c'

This description is a bit lackluster...

> +. ./test-lib.sh
> +
> +test_expect_success 'get bloom filters for commit with no changes' '
> +	git init &&
> +	git commit --allow-empty -m "c0" &&
> +	cat >expect <<-\EOF &&
> +	Filter_Length:0
> +	Filter_Data:
> +	EOF
> +	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
> +	test_cmp expect actual
> +'

A few things.  First, I wonder why we need to provide object ID;
couldn't 'test-tool bloom get_filter_for_commit' parse commit-ish
argument, or would it make it too complicated for no reason?

Second, why both "no changes" (here) and "no data" have the same
representation of filter with length equal 0?  Let's take a look at the
code.

For no changes:

  filter->len = (hashmap_get_size(&pathmap) * settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^ == 0  for no changes
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                 \-- == 0 + BITS_PER_WORD - 1     for no changes
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                 \-- == 0  for no changes
  filter->data = xcalloc(filter->len, sizeof(uint64_t));
                         ^^^^^^^^^^^ == 0  for no changes
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                 \-- is NULL or unique pointer that can be passed to free()

For more than 512 changed files:

  filter->data = NULL;
  filter->len = 0;

Not being able to distinguish between "no data" and "no changes in the
commit" cases means that we would always perform full diff for commit
with no changes, unnecessarily.  Fortunately there should be no hit to
performance, as in this case we need to simply compare objects IDs of
top tree to know that there is no change.

If it is a design decision we go with, it should be in my opinion at
least explained in the commit message explicitly.

> +
> +test_expect_success 'get bloom filter for commit with 10 changes' '
> +	rm actual &&
> +	rm expect &&
> +	mkdir smallDir &&
> +	for i in $(test_seq 0 9)
> +	do
> +		echo $i >smallDir/$i
> +	done &&
> +	git add smallDir &&
> +	git commit -m "commit with 10 changes" &&
> +	cat >expect <<-\EOF &&
> +	Filter_Length:4
> +	Filter_Data:508928809087080a|8a7648210804001|4089824400951000|841ab310098051a8|
> +	EOF
> +	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
> +	test_cmp expect actual
> +'

This test is in my opinion fragile, as it unnecessarily test the
implementation details instead of the functionality provided.  If we
change the hashing scheme (for example going from double hashing to some
variant of enhanced double hashing), or change the base hash function
(for example from Murmur3_32 to xxHash_64), or change the number of hash
functions (perhaps because changing of number of bits per element, and
thus optimal number of hash functions from 7 to 6), or change from
64-bit word blocks to 32-bit word blocks, the test would have to be
changed.

What I think would be a good test is something like t/t0011-hashmap.sh.
For example test that the Bloom filter size scales correctly could look
like this:

   test_bloom() {
   	echo "$1" | test-tool bloom $3 >actual &&
   	echo "$2" >expect &&
   	test_cmp expect actual
   }

   test_expect_success 'Bloom filter for commit size scales with number of changes' '
   	mkdir smallDir &&
	for i in $(test_seq 0 9)
	do
		echo $i >smallDir/$i
	done &&
	git add smallDir &&
	git commit -m "commit with 10 changes" &&
        HEAD=$(git rev-parse HEAD) &&
        cat | test-tool bloom >actual <<-EOF &&
        add-commit $HEAD
        len-commit $HEAD
        EOF
        echo "4" >expect &&
        test_cmp expect actual
   '

> +
> +test_expect_success EXPENSIVE 'get bloom filter for commit with 513 changes' '
> +	rm actual &&
> +	rm expect &&
> +	mkdir bigDir &&
> +	for i in $(test_seq 0 512)
> +	do
> +		echo $i >bigDir/$i
> +	done &&
> +	git add bigDir &&
> +	git commit -m "commit with 513 changes" &&
> +	cat >expect <<-\EOF &&
> +	Filter_Length:0
> +	Filter_Data:
> +	EOF
> +	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
> +	test_cmp expect actual
> +'

All right, it is good test to have (though perhaps in modified form,
less fragile one).

> +
> +test_expect_success 'compute bloom key for empty string' '
> +	cat >expect <<-\EOF &&
> +	Hashes:5615800c|5b966560|61174ab4|66983008|6c19155c|7199fab0|771ae004|
> +	Filter_Length:1
> +	Filter_Data:11000110001110|
> +	EOF
> +	test-tool bloom generate_filter "" >actual &&
> +	test_cmp expect actual
> +'

This might be unnecessarily fragile test, but it might be a good test
for double hashing or enhanced double hashing technique.  Murmur3 hash
on empty data (empty string) always return seed value, so the result of
(enhanced) double hashing technique is predictable, given two seed
values.

> +
> +test_expect_success 'compute bloom key for whitespace' '
> +	cat >expect <<-\EOF &&
> +	Hashes:1bf014e6|8a91b50b|f9335530|67d4f555|d676957a|4518359f|b3b9d5c4|
> +	Filter_Length:1
> +	Filter_Data:401004080200810|
> +	EOF
> +	test-tool bloom generate_filter " " >actual &&
> +	test_cmp expect actual
> +'

Instead of those two fragile tests (that depend on irrelevant details of
the implementation), it would be better to create test similar to those
in t/t0011-hashmap.sh, for example:

   test_expect_success 'testing Bloom filter querying' '
   	test_bloom "add abc
        add abcdef
        check abc
        check abcdef
        check abcdee
        check abcdefghi
        len" "maybe
        maybe
        no
        no
        1"
   '

Or maybe something like this:

   test_expect_success 'testing Bloom filter querying' '
   	cat >commands <<\-EOF &&
        add abc
        add abcdef
        check abc
        check abcdef
        check abcdee
        check abcdefghi
        len
        EOF

   	cat >expect <<\-EOF &&
        maybe
        maybe
        no
        no
        1
        EOF
        
        test-tool bloom <commands >actual &&
        test_cmp expect actual
   '

> +
> +test_expect_success 'compute bloom key for a root level folder' '
> +	cat >expect <<-\EOF &&
> +	Hashes:1a21016f|fff1c06d|e5c27f6b|cb933e69|b163fd67|9734bc65|7d057b63|
> +	Filter_Length:1
> +	Filter_Data:aaa800000000|
> +	EOF
> +	test-tool bloom generate_filter "A" >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'compute bloom key for a root level file' '
> +	cat >expect <<-\EOF &&
> +	Hashes:e2d51107|30970605|7e58fb03|cc1af001|19dce4ff|679ed9fd|b560cefb|
> +	Filter_Length:1
> +	Filter_Data:a8000000000000aa|
> +	EOF
> +	test-tool bloom generate_filter "file.txt" >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'compute bloom key for a deep folder' '
> +	cat >expect <<-\EOF &&
> +	Hashes:864cf838|27f055cd|c993b362|6b3710f7|0cda6e8c|ae7dcc21|502129b6|
> +	Filter_Length:1
> +	Filter_Data:1c0000600003000|
> +	EOF
> +	test-tool bloom generate_filter "A/B/C/D/E" >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'compute bloom key for a deep file' '
> +	cat >expect <<-\EOF &&
> +	Hashes:07cdf850|4af629c7|8e1e5b3e|d1468cb5|146ebe2c|5796efa3|9abf211a|
> +	Filter_Length:1
> +	Filter_Data:4020100804010080|
> +	EOF
> +	test-tool bloom generate_filter "A/B/C/D/E/file.txt" >actual &&
> +	test_cmp expect actual
> +'

What are those meant to test?  For the Bloom filter itself it doesn't
matter if we add "A/B/C/file.txt" string to filter, or "ABC" string.

What we didn't test is that changed _directories_ are also added to the
Bloom filter for a commit.  Such test could look like this:

   test_expect_success 'changed directories are added to Bloom filter' '
   	mkdir -p A/B &&
	echo "foo" >A/B/file.txt &&
	git add A/B/file.txt &&
	git commit -m "add A/B/file.txt" &&
        HEAD=$(git rev-parse HEAD) &&

   	cat >commands <<-EOF &&
        add-commit $HEAD
        check A/B/file.txt
        check A/B
        check A
        EOF

   	cat >expect <<\-EOF &&
        maybe
        maybe
	maybe
        EOF
        
        test-tool bloom <commands >actual &&
        test_cmp expect actual
   '


> +
> +test_done

Reviewed-by: Jakub Narębski <jnareb@gmail.com>

Thanks for working on this.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH v2 03/11] diff: halt tree-diff early after max_changes
  2020-02-05 22:56   ` [PATCH v2 03/11] diff: halt tree-diff early after max_changes Derrick Stolee via GitGitGadget
@ 2020-02-17  0:00     ` Jakub Narebski
  2020-02-22  0:37       ` Garima Singh
  0 siblings, 1 reply; 150+ messages in thread
From: Jakub Narebski @ 2020-02-17  0:00 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, Derrick Stolee, Derrick Stolee, SZEDER Gábor,
	Jonathan Tan, Jeff Hostetler, Taylor Blau, Jeff King,
	Garima Singh, Christian Couder, Emily Shaffer, Junio C Hamano,
	Garima Singh

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <dstolee@microsoft.com>
>
> When computing the changed-paths bloom filters for the commit-graph,
> we limit the size of the filter by restricting the number of paths
> in the diff. Instead of computing a large diff and then ignoring the
> result, it is better to halt the diff computation early.

Good idea.

>
> Create a new "max_changes" option in struct diff_options. If non-zero,
> then halt the diff computation after discovering strictly more changed
> paths. This includes paths corresponding to trees that change.

All right; also, it doesn't need to be exact, though it would be good if
it was.

512 changed paths (changed files) usually generate more than 512
elements to be added to the Bloom filter (changed directories and
files), anyway.

>
> Use this max_changes option in the bloom filter calculations. This
> reduces the time taken to compute the filters for the Linux kernel
> repo from 2m50s to 2m35s. On a large internal repository with ~500
> commits that perform tree-wide changes, the time reduced from
> 6m15s to 3m48s.

I wonder if there is some large open-source project with many commits
performing tree-wide changes, that is with many commits with more than
512 changed files with respect to the first parent.

Maybe https://github.com/whosonfirst-data/whosonfirst-data-venue-us-ny
from "Top Ten Worst Repositories to host on GitHub - Git Merge 2017"
could be a good repository to test ;-)

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>

Looks good to me, but that is from cursory examination.
Don't know the area to say anything more.

> ---
>  bloom.c     | 4 +++-
>  diff.h      | 5 +++++
>  tree-diff.c | 6 ++++++
>  3 files changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/bloom.c b/bloom.c
> index 6082193a75..818382c03b 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -134,6 +134,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>  	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
>  	int i;
>  	struct diff_options diffopt;
> +	int max_changes = 512;
>  
>  	if (!bloom_filters.slab_size)
>  		return NULL;
> @@ -142,6 +143,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>  
>  	repo_diff_setup(r, &diffopt);
>  	diffopt.flags.recursive = 1;
> +	diffopt.max_changes = max_changes;
>  	diff_setup_done(&diffopt);
>  
>  	if (c->parents)
> @@ -150,7 +152,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>  		diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
>  	diffcore_std(&diffopt);
>  
> -	if (diff_queued_diff.nr <= 512) {
> +	if (diff_queued_diff.nr <= max_changes) {
>  		struct hashmap pathmap;
>  		struct pathmap_hash_entry* e;
>  		struct hashmap_iter iter;
> diff --git a/diff.h b/diff.h
> index 6febe7e365..9443dc1b00 100644
> --- a/diff.h
> +++ b/diff.h
> @@ -285,6 +285,11 @@ struct diff_options {
>  	/* Number of hexdigits to abbreviate raw format output to. */
>  	int abbrev;
>  
> +	/* If non-zero, then stop computing after this many changes. */
> +	int max_changes;
> +	/* For internal use only. */
> +	int num_changes;
> +
>  	int ita_invisible_in_index;
>  /* white-space error highlighting */
>  #define WSEH_NEW (1<<12)
> diff --git a/tree-diff.c b/tree-diff.c
> index 33ded7f8b3..f3d303c6e5 100644
> --- a/tree-diff.c
> +++ b/tree-diff.c
> @@ -434,6 +434,9 @@ static struct combine_diff_path *ll_diff_tree_paths(
>  		if (diff_can_quit_early(opt))
>  			break;
>  
> +		if (opt->max_changes && opt->num_changes > opt->max_changes)
> +			break;
> +
>  		if (opt->pathspec.nr) {
>  			skip_uninteresting(&t, base, opt);
>  			for (i = 0; i < nparent; i++)
> @@ -518,6 +521,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
>  
>  			/* t↓ */
>  			update_tree_entry(&t);
> +			opt->num_changes++;
>  		}
>  
>  		/* t > p[imin] */
> @@ -535,6 +539,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
>  		skip_emit_tp:
>  			/* ∀ pi=p[imin]  pi↓ */
>  			update_tp_entries(tp, nparent);
> +			opt->num_changes++;
>  		}
>  	}
>  
> @@ -552,6 +557,7 @@ struct combine_diff_path *diff_tree_paths(
>  	const struct object_id **parents_oid, int nparent,
>  	struct strbuf *base, struct diff_options *opt)
>  {
> +	opt->num_changes = 0;
>  	p = ll_diff_tree_paths(p, oid, parents_oid, nparent, base, opt);
>  
>  	/*

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH v2 04/11] commit-graph: compute Bloom filters for changed paths
  2020-02-05 22:56   ` [PATCH v2 04/11] commit-graph: compute Bloom filters for changed paths Garima Singh via GitGitGadget
@ 2020-02-17 21:56     ` Jakub Narebski
  2020-02-22  0:55       ` Garima Singh
  0 siblings, 1 reply; 150+ messages in thread
From: Jakub Narebski @ 2020-02-17 21:56 UTC (permalink / raw)
  To: Garima Singh via GitGitGadget
  Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
	Jeff Hostetler, Taylor Blau, Jeff King, Garima Singh,
	Christian Couder, Emily Shaffer, Junio C Hamano, Garima Singh

"Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Garima Singh <garima.singh@microsoft.com>
> Subject: [PATCH v2 04/11] commit-graph: compute Bloom filters for changed paths
>
> Compute Bloom filters for the paths that changed between a commit and its
> first parent using the implementation in bloom.c, when the
> COMMIT_GRAPH_WRITE_CHANGED_PATHS flag is set. This computation is done on a
> commit-by-commit basis. We will write these Bloom filters to the commit graph
> file in the next change.

I have no major complaints about the contents of this patch (except lack
of test, and type of total_bloom_filter_data_size), but the commit
message could have been worded better.

I would write something like this instead:

  Add new COMMIT_GRAPH_WRITE_CHANGED_PATHS flag that makes Git compute
  Bloom filters that store the information about changed paths (that
  changed between a commit and its first parent) for each commit in the
  commit-graph.  This computation is done on a commit-by-commit basis.

  We will write these Bloom filters to the commit-graph file, to store
  this data on disk, in the next change in this series.

In my opinion the fact that we compute Bloom filters for each and every
commit in the commit-graph file is more important than quite obvious
fact that we use implementation from bloom.c.

>
> Helped-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
>  commit-graph.c | 32 +++++++++++++++++++++++++++++++-
>  commit-graph.h |  3 ++-
>  2 files changed, 33 insertions(+), 2 deletions(-)

It would be good to have at least sanity check of this feature, perhaps
one that would check that the number of per-commit Bloom filters on slab
matches the number of commits in the commit-graph.

It could look something like this:

  test_expect_success 'create Bloom filters for all commit-graph commits' '
  	# create commit-graph with 2 commits
  	git rev-parse HEAD HEAD^ | git commit-graph write --stdin-commits &&
  	# generate Bloom filters for commit-graph commits
  	cat >commands <<\-EOF &&
  	add-graph-commits
  	filters-count
  	EOF
  	NUM_FILTERS=$(git test-tool bloom <commands) %%
  	test "$NUM_FILTERS" -eq 2
  '

>
> diff --git a/commit-graph.c b/commit-graph.c
> index 3c4d411326..724bfcffc4 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -16,6 +16,7 @@
>  #include "hashmap.h"
>  #include "replace-object.h"
>  #include "progress.h"
> +#include "bloom.h"
>  
>  #define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
>  #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
> @@ -795,9 +796,11 @@ struct write_commit_graph_context {
>  	unsigned append:1,
>  		 report_progress:1,
>  		 split:1,
> -		 check_oids:1;
> +		 check_oids:1,
> +		 changed_paths:1;

All right, this flag will be used for handling future `--changed-paths`
option to the `git commit-graph write`.

>  
>  	const struct split_commit_graph_opts *split_opts;
> +	uint32_t total_bloom_filter_data_size;

This is total size of Bloom filters data, in bytes, that will later be
used for BDAT chunk size.  However the commit-graph format uses 8 bytes
for byte-offset, not 4 bytes.  Why it is uint32_t and not uint64_t then?

>  };
>  
>  static void write_graph_chunk_fanout(struct hashfile *f,
> @@ -1140,6 +1143,28 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  	stop_progress(&ctx->progress);
>  }
>  
> +static void compute_bloom_filters(struct write_commit_graph_context *ctx)
> +{
> +	int i;
> +	struct progress *progress = NULL;
> +
> +	load_bloom_filters();
> +
> +	if (ctx->report_progress)
> +		progress = start_progress(
> +			_("Computing commit diff Bloom filters"),
> +			ctx->commits.nr);
> +

Shouldn't we initialize ctx->total_bloom_filter_data_size to 0 here?  We
cannot use compute_bloom_filters() to _update_ Bloom filters data, I
think -- we don't distinguish here between new and existing data (where
existing data size is already included in total Bloom filters size).  At
least I don't think so.


> +	for (i = 0; i < ctx->commits.nr; i++) {
> +		struct commit *c = ctx->commits.list[i];

Here we process commit in whatever order commits are in the
commits.list, which probably means lexicographical order, in practice
random order.

> +		struct bloom_filter *filter = get_bloom_filter(ctx->r, c);
> +		ctx->total_bloom_filter_data_size += sizeof(uint64_t) * filter->len;
> +		display_progress(progress, i + 1);
> +	}
> +
> +	stop_progress(&progress);
> +}
> +
>  static int add_ref_to_list(const char *refname,
>  			   const struct object_id *oid,
>  			   int flags, void *cb_data)

> @@ -1794,6 +1819,8 @@ int write_commit_graph(const char *obj_dir,
>  	ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
>  	ctx->check_oids = flags & COMMIT_GRAPH_WRITE_CHECK_OIDS ? 1 : 0;
>  	ctx->split_opts = split_opts;
> +	ctx->changed_paths = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;
> +	ctx->total_bloom_filter_data_size = 0;
>  
>  	if (ctx->split) {
>  		struct commit_graph *g;
> @@ -1888,6 +1915,9 @@ int write_commit_graph(const char *obj_dir,
>  
>  	compute_generation_numbers(ctx);
>  
> +	if (ctx->changed_paths)
> +		compute_bloom_filters(ctx);
> +

All right.

>  	res = write_commit_graph_file(ctx);
>  
>  	if (ctx->split)
> diff --git a/commit-graph.h b/commit-graph.h
> index 7f5c933fa2..952a4b83be 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -76,7 +76,8 @@ enum commit_graph_write_flags {
>  	COMMIT_GRAPH_WRITE_PROGRESS   = (1 << 1),
>  	COMMIT_GRAPH_WRITE_SPLIT      = (1 << 2),
>  	/* Make sure that each OID in the input is a valid commit OID. */
> -	COMMIT_GRAPH_WRITE_CHECK_OIDS = (1 << 3)
> +	COMMIT_GRAPH_WRITE_CHECK_OIDS = (1 << 3),
> +	COMMIT_GRAPH_WRITE_BLOOM_FILTERS = (1 << 4)

All right.


Side note: perhaps we could add trailing comma after new enum entry,
that is

  +	COMMIT_GRAPH_WRITE_BLOOM_FILTERS = (1 << 4),

following new CodingGuidelines recommendation

 - We try to support a wide range of C compilers to compile Git with,
   including old ones.  You should not use features from newer C
   standard, even if your compiler groks them.

   There are a few exceptions to this guideline:

   . since early 2012 with e1327023ea, we have been using an enum
     definition whose last element is followed by a comma.  This, like
     an array initializer that ends with a trailing comma, can be used
     to reduce the patch noise when adding a new identifier at the end.

https://github.com/git/git/blob/master/Documentation/CodingGuidelines#L197

>  };
>  
>  struct split_commit_graph_opts {

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH v2 05/11] commit-graph: examine changed-path objects in pack order
  2020-02-05 22:56   ` [PATCH v2 05/11] commit-graph: examine changed-path objects in pack order Jeff King via GitGitGadget
@ 2020-02-18 17:59     ` Jakub Narebski
  2020-02-24 18:29       ` Garima Singh
  0 siblings, 1 reply; 150+ messages in thread
From: Jakub Narebski @ 2020-02-18 17:59 UTC (permalink / raw)
  To: Jeff King via GitGitGadget
  Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
	Jeff Hostetler, Taylor Blau, Jeff King, Garima Singh,
	Christian Couder, Emily Shaffer, Junio C Hamano, Garima Singh

"Jeff King via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Jeff King <peff@peff.net>
>
> Looking at the diff of commit objects in pack order is much faster than
> in sha1 order, as it gives locality to the access of tree deltas

Nitpick: should we still say sha1 order?  Git is still using SHA-1 as an
*oid*, but hopefully soon it will be transitioning to NewHash = SHA-256.
(No need to change anything.)

> (whereas sha1 order is effectively random). Unfortunately the
> commit-graph code sorts the commits (several times, sometimes as an oid
> and sometimes a pointer-to-commit), and we ultimately traverse in sha1
> order.

Actually, commit-graph code needs write_commit_graph_context.commits.list
to be in lexicographical order to be able to turn position in graph into
reference to a commit.  The information about the parents of the commit
are stored using positional references within the graph file.

>
> Instead, let's remember the position at which we see each commit, and
> traverse in that order when looking at bloom filters. This drops my time
> for "git commit-graph write --changed-paths" in linux.git from ~4
> minutes to ~1.5 minutes.

Nitpick: with reordering of patches (which I think is otherwise a good
thing) this patch actually comes before the one adding "--changed-paths"
option to "git commit-graph write".  So it 'This would drop my time'
rather than 'This drops my time...' ;-)

>
> Probably the "--reachable" code path would want something similar.

Has anyone tried doing this?

>
> Or alternatively, we could use a different data structure (either a
> hash, or maybe even just a bit in "struct commit") to keep track of
> which oids we've seen, etc instead of sorting. And then we could keep
> the original order.

I think it is nice to keep those "what ifs?" thoughts in the commit
message.  They add some color.

>
> Signed-off-by: Jeff King <peff@peff.net>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
>  commit-graph.c | 34 +++++++++++++++++++++++++++++++++-
>  1 file changed, 33 insertions(+), 1 deletion(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 724bfcffc4..e125511a1c 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -17,6 +17,7 @@
>  #include "replace-object.h"
>  #include "progress.h"
>  #include "bloom.h"
> +#include "commit-slab.h"
>  
>  #define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
>  #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
> @@ -46,6 +47,29 @@
>  /* Remember to update object flag allocation in object.h */
>  #define REACHABLE       (1u<<15)
>  
> +/* Keep track of the order in which commits are added to our list. */
> +define_commit_slab(commit_pos, int);
> +static struct commit_pos commit_pos = COMMIT_SLAB_INIT(1, commit_pos);
> +
> +static void set_commit_pos(struct repository *r, const struct object_id *oid)
> +{
> +	static int32_t max_pos;
> +	struct commit *commit = lookup_commit(r, oid);
> +
> +	if (!commit)
> +		return; /* should never happen, but be lenient */
> +
> +	*commit_pos_at(&commit_pos, commit) = max_pos++;
> +}

All right, that is nice and universal function.

> +
> +static int commit_pos_cmp(const void *va, const void *vb)
> +{
> +	const struct commit *a = *(const struct commit **)va;
> +	const struct commit *b = *(const struct commit **)vb;
> +	return commit_pos_at(&commit_pos, a) -
> +	       commit_pos_at(&commit_pos, b);
> +}

Hmmm... I wonder what would happen in commit_pos was not set (like
e.g. commit-graph commits not coming from the packfile).  Let's look up
the documenation...

commit_pos_at() returns a pointer to an int... why are we comparing
pointers and not values?  Shouldn't it be

  +	return *commit_pos_at(&commit_pos, a) -
  +	       *commit_pos_at(&commit_pos, b);


With commit_pos_at() the location to store the data is allocated as
necessary (if data for commit doesn't exists), and because we are using
xalloc() the *commit_pos_at() is 0-initialized.  This means that if
commits didn't come from the packfile, we sort all commits as being
equal.  Luckily we fix that in next patch.

> +
>  char *get_commit_graph_filename(const char *obj_dir)
>  {
>  	char *filename = xstrfmt("%s/info/commit-graph", obj_dir);
> @@ -1027,6 +1051,8 @@ static int add_packed_commits(const struct object_id *oid,
>  	oidcpy(&(ctx->oids.list[ctx->oids.nr]), oid);
>  	ctx->oids.nr++;
>  
> +	set_commit_pos(ctx->r, oid);
> +
>  	return 0;
>  }
>  
> @@ -1147,6 +1173,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>  {
>  	int i;
>  	struct progress *progress = NULL;
> +	struct commit **sorted_by_pos;

In the next patch in series we would sort commits by generation number
and creation data; shouldn't this variable name be more generic to
reflect this, for example just `sorted_commits` or `commits_sorted`?

>  
>  	load_bloom_filters();
>  
> @@ -1155,13 +1182,18 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>  			_("Computing commit diff Bloom filters"),
>  			ctx->commits.nr);
>  
> +	ALLOC_ARRAY(sorted_by_pos, ctx->commits.nr);
> +	COPY_ARRAY(sorted_by_pos, ctx->commits.list, ctx->commits.nr);
> +	QSORT(sorted_by_pos, ctx->commits.nr, commit_pos_cmp);
> +

All right: allocate array, copy data, sort it.

We need to copy data because (what I think) we need commits in
lexicographical order to be able to turn the position in graph that
parents of a commit are stored as into the reference to this commit.

>  	for (i = 0; i < ctx->commits.nr; i++) {
> -		struct commit *c = ctx->commits.list[i];
> +		struct commit *c = sorted_by_pos[i];

All right: use sorted data.

>  		struct bloom_filter *filter = get_bloom_filter(ctx->r, c);
>  		ctx->total_bloom_filter_data_size += sizeof(uint64_t) * filter->len;
>  		display_progress(progress, i + 1);
>  	}
>  
> +	free(sorted_by_pos);

Can we free the slab data, i.e. call `clear_commit_pos(&commit_pos);`
here?  Otherwise we are leaking memory (well, except that finishing
command makes the operating system to free memory for us).

>  	stop_progress(&progress);
>  }

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH v2 06/11] commit-graph: examine commits by generation number
  2020-02-05 22:56   ` [PATCH v2 06/11] commit-graph: examine commits by generation number Derrick Stolee via GitGitGadget
@ 2020-02-19  0:32     ` Jakub Narebski
  2020-02-24 20:45       ` Garima Singh
  0 siblings, 1 reply; 150+ messages in thread
From: Jakub Narebski @ 2020-02-19  0:32 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
	Jeff Hostetler, Taylor Blau, Jeff King, Garima Singh,
	Christian Couder, Emily Shaffer, Junio C Hamano, Garima Singh,
	Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <dstolee@microsoft.com>
>
> When running 'git commit-graph write --changed-paths', we sort the
> commits by pack-order to save time when computing the changed-paths
> bloom filters. This does not help when finding the commits via the
> --reachable flag.

Minor improvement suggestion: s/--reachable flag/'--reachable' flag/.

>
> If not using pack-order, then sort by generation number before
> examining the diff.

All right, that is good description of what the patch does.

>                     Commits with similar generation are more likely
> to have many trees in common, making the diff faster.

Is this what causes the performance improvement, that subsequently
examined commits are more likely to have more trees in common, which
means that those trees would be hot in cache, making generating diff
faster?  Is it what profiling shows?

>
> On the Linux kernel repository, this change reduced the computation
> time for 'git commit-graph write --reachable --changed-paths' from
> 3m00s to 1m37s.

Would using the trick used for packfiles also for '--reachable', which
would mean commits examined in recency / reachability order, give
similar, worse or better performance improvements?

We would want this sorting order as one of possibilities anyway, because
'--stdin-commits' we could get commits in random order.

>
> Helped-by: Jeff King <peff@peff.net>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
>  commit-graph.c | 33 ++++++++++++++++++++++++++++++---
>  1 file changed, 30 insertions(+), 3 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index e125511a1c..32a315058f 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -70,6 +70,25 @@ static int commit_pos_cmp(const void *va, const void *vb)
>  	       commit_pos_at(&commit_pos, b);
>  }
>  
> +static int commit_gen_cmp(const void *va, const void *vb)
> +{
> +	const struct commit *a = *(const struct commit **)va;
> +	const struct commit *b = *(const struct commit **)vb;
> +
> +	/* lower generation commits first */

Shouldn't higher generation commits come first, in recency-like order?
Or it doesn't matter if it is sorted in ascending or descending order,
as long as commits with close generation numbers are examined close
together?

> +	if (a->generation < b->generation)
> +		return -1;
> +	else if (a->generation > b->generation)
> +		return 1;
> +
> +	/* use date as a heuristic when generations are equal */
> +	if (a->date < b->date)
> +		return -1;
> +	else if (a->date > b->date)
> +		return 1;
> +	return 0;
> +}

I thought we have had such comparison function defined somewhere in Git
already, but I think I'm wrong here.

> +
>  char *get_commit_graph_filename(const char *obj_dir)
>  {
>  	char *filename = xstrfmt("%s/info/commit-graph", obj_dir);
> @@ -821,7 +840,8 @@ struct write_commit_graph_context {
>  		 report_progress:1,
>  		 split:1,
>  		 check_oids:1,
> -		 changed_paths:1;
> +		 changed_paths:1,
> +		 order_by_pack:1;
>  
>  	const struct split_commit_graph_opts *split_opts;
>  	uint32_t total_bloom_filter_data_size;
> @@ -1184,7 +1204,11 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>  
>  	ALLOC_ARRAY(sorted_by_pos, ctx->commits.nr);
>  	COPY_ARRAY(sorted_by_pos, ctx->commits.list, ctx->commits.nr);
> -	QSORT(sorted_by_pos, ctx->commits.nr, commit_pos_cmp);
> +
> +	if (ctx->order_by_pack)
> +		QSORT(sorted_by_pos, ctx->commits.nr, commit_pos_cmp);
> +	else
> +		QSORT(sorted_by_pos, ctx->commits.nr, commit_gen_cmp);

Here 'sorted_b_pos' variable name no longer reflects reality...
(see comment to the previous patch in the series).

>  
>  	for (i = 0; i < ctx->commits.nr; i++) {
>  		struct commit *c = sorted_by_pos[i];
> @@ -1902,6 +1926,7 @@ int write_commit_graph(const char *obj_dir,
>  	}
>  
>  	if (pack_indexes) {
> +		ctx->order_by_pack = 1;
>  		if ((res = fill_oids_from_packs(ctx, pack_indexes)))
>  			goto cleanup;
>  	}
> @@ -1911,8 +1936,10 @@ int write_commit_graph(const char *obj_dir,
>  			goto cleanup;
>  	}
>  
> -	if (!pack_indexes && !commit_hex)
> +	if (!pack_indexes && !commit_hex) {
> +		ctx->order_by_pack = 1;
>  		fill_oids_from_all_packs(ctx);
> +	}
>  
>  	close_reachable(ctx);

All right, that covers all cases where 'git commit-graph write' writes
serialized commit-graph based on the commits found in packfiles:
'--stdin-packs' and default no option case, in that order.

Looks good.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH v2 07/11] commit-graph: write Bloom filters to commit graph file
  2020-02-05 22:56   ` [PATCH v2 07/11] commit-graph: write Bloom filters to commit graph file Garima Singh via GitGitGadget
@ 2020-02-19 15:13     ` Jakub Narebski
  2020-02-24 21:14       ` Garima Singh
  0 siblings, 1 reply; 150+ messages in thread
From: Jakub Narebski @ 2020-02-19 15:13 UTC (permalink / raw)
  To: Garima Singh via GitGitGadget
  Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
	Jeff Hostetler, Taylor Blau, Jeff King, Garima Singh,
	Christian Couder, Emily Shaffer, Junio C Hamano, Garima Singh

"Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Garima Singh <garima.singh@microsoft.com>
>
> Update the technical documentation for commit-graph-format with the formats for
> the Bloom filter index (BIDX) and Bloom filter data (BDAT) chunks. Write the
> computed Bloom filters information to the commit graph file using this format.

Nice description.

The only minor nitpick is with the formating: it is 80-character wide,
which is a bit wide.

>
> Helped-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
>  .../technical/commit-graph-format.txt         |  24 ++++
>  commit-graph.c                                | 118 +++++++++++++++++-
>  commit-graph.h                                |   7 +-
>  3 files changed, 145 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
> index a4f17441ae..22e511643d 100644
> --- a/Documentation/technical/commit-graph-format.txt
> +++ b/Documentation/technical/commit-graph-format.txt
> @@ -17,6 +17,9 @@ metadata, including:
>  - The parents of the commit, stored using positional references within
>    the graph file.
>  
> +- The Bloom filter of the commit carrying the paths that were changed between
> +  the commit and its first parent.
> +

All right.

Should we also state that it is optional (meta)data?  This would be
first optional piece of data stored in commit-graph, I think.

>  These positional references are stored as unsigned 32-bit integers
>  corresponding to the array position within the list of commit OIDs. Due
>  to some special constants we use to track parents, we can store at most
> @@ -93,6 +96,27 @@ CHUNK DATA:
>        positions for the parents until reaching a value with the most-significant
>        bit on. The other bits correspond to the position of the last parent.
>  
> +  Bloom Filter Index (ID: {'B', 'I', 'D', 'X'}) (N * 4 bytes) [Optional]
> +    * The ith entry, BIDX[i], stores the number of 8-byte word blocks in all
> +      Bloom filters from commit 0 to commit i (inclusive) in lexicographic
> +      order. The Bloom filter for the i-th commit spans from BIDX[i-1] to
> +      BIDX[i] (plus header length), where BIDX[-1] is 0.
> +    * The BIDX chunk is ignored if the BDAT chunk is not present.

All right.  Looks good.

> +
> +  Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
> +    * It starts with header consisting of three unsigned 32-bit integers:
> +      - Version of the hash algorithm being used. We currently only support
> +	value 1 which implies the murmur3 hash implemented exactly as described
> +	in https://en.wikipedia.org/wiki/MurmurHash#Algorithm

First a minor issue: shouldn't this nested unordered list be indented
with a hanging indent formatted with spaces?  That is be formatted like
the following:

  +  Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
  +    * It starts with header consisting of three unsigned 32-bit integers:
  +      - Version of the hash algorithm being used. We currently only support
  +        value 1 which implies the murmur3 hash implemented exactly as
  +        described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm

But the existing formatting with spaces and tabs might be fine as it is,
that is it renders as nested list with Asciidoc; it only looks a bit
weird as patch, not so as text.

Second, and more important: it is in my opinion not enough information,
at least if we are assuming that the information in this document should
be enough for clean-room reimplementation of Bloom filter functionality
(for example by JGit).  To generate compatible Bloom filters, one needs
also the information on how to create $k$ functionally-independent hash
functions out of murmur3 hash.  We do it currently using double hashing
technique; if that changes then the exact set of bits in the Bloom
filter would also change.

The additional description could look something like the following:

  +    * It starts with header consisting of three unsigned 32-bit integers:
  +      - Version of the hash algorithm being used. We currently only support
  +        value 1 which implies the murmur3_32 hash implemented exactly as
  +        described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
  +        and double hashing technique with 0x293ae76f and 0x7e646e2c seeds
  +        as described in https://doi.org/10.1007/978-3-540-30494-4_26
  +        "Bloom Filters in Probabilistic Verification"

Also, it should be explicitly noted that we use murmur3_32, because
there is also 128-bit version of murmur3 hash.

> +      - The number of times a path is hashed and hence the number of bit positions
> +	that cumulatively determine whether a file is present in the commit.

All right, in the original Bloom filter it was the number of different
hash functions.  With the double hashing technique, it is the number of
times a path is hashed.

> +      - The minimum number of bits 'b' per entry in the Bloom filter. If the filter
> +	contains 'n' entries, then the filter size is the minimum number of 64-bit
> +	words that contain n*b bits.

All right, that means empty Bloom filter, representing "no changes",
with 'n' equal 0 entries, is represented as size 0 filter.  That is, if
we read this rule exactly as written.

Should we add the information that size 0 / length 0 filter is
considered "no data" case?  Or should we leave it to implementation?

There are two corner cases:
- "no changes" case, where all queries are answered with "no"
  can be represented as filter of size 0, or as Bloom filter with all
  bits set to 0
- "no data" case (used when there are more than 512 changed files)
  where all queries are answered with "maybe", currently represented
  as filter of size 0; can also be represented as Bloom filter with all
  bits set to 1

> +    * The rest of the chunk is the concatenation of all the computed Bloom
> +      filters for the commits in lexicographic order.

All right.

> +    * The BDAT chunk is present iff BIDX is present.

Perhaps we should spell 'iff' in full, that is 'if and only if'?

> +
>    Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
>        This list of H-byte hashes describe a set of B commit-graph files that
>        form a commit-graph chain. The graph position for the ith commit in this
> diff --git a/commit-graph.c b/commit-graph.c
> index 32a315058f..4585b3b702 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -24,8 +24,10 @@
>  #define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
>  #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
>  #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
> +#define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
> +#define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
>  #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
> -#define MAX_NUM_CHUNKS 5
> +#define MAX_NUM_CHUNKS 7
>  
>  #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
>  
> @@ -325,6 +327,32 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd,
>  				chunk_repeated = 1;
>  			else
>  				graph->chunk_base_graphs = data + chunk_offset;
> +			break;
> +
> +		case GRAPH_CHUNKID_BLOOMINDEXES:
> +			if (graph->chunk_bloom_indexes)
> +				chunk_repeated = 1;
> +			else
> +				graph->chunk_bloom_indexes = data + chunk_offset;
> +			break;
> +
> +		case GRAPH_CHUNKID_BLOOMDATA:
> +			if (graph->chunk_bloom_data)
> +				chunk_repeated = 1;
> +			else {
> +				uint32_t hash_version;
> +				graph->chunk_bloom_data = data + chunk_offset;
> +				hash_version = get_be32(data + chunk_offset);
> +
> +				if (hash_version != 1)
> +					break;

Shouldn't we mark Bloom filter as not to be used?  Or is it left for
later commit?

In the future it might be good idea to notify the user (perhaps
protected with some advice.* option) that there is problem with Bloom
filter data, namely that we have encountered unsupported hash version.

> +
> +				graph->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings));

Why is this structure allocated dynamically?  We are leaking admittedly
a small amount of memory because we never free this xmalloc() result.

If we need this field being a pointer to struct to have NULL mean no
supported Bloom filter data, we could have instead use chunk_bloom_*
fields instead - we can set at least one of them to NULL.

> +				graph->bloom_filter_settings->hash_version = hash_version;
> +				graph->bloom_filter_settings->num_hashes = get_be32(data + chunk_offset + 4);
> +				graph->bloom_filter_settings->bits_per_entry = get_be32(data + chunk_offset + 8);

All right; these 4 and 8 are sizeof(uint32_t) and 2*sizeof(uint32_t),
respectively.

> +			}
> +			break;
>  		}
>  
>  		if (chunk_repeated) {
> @@ -343,6 +371,17 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd,
>  		last_chunk_offset = chunk_offset;
>  	}
>  
> +	/* We need both the bloom chunks to exist together. Else ignore the data */
> +	if ((graph->chunk_bloom_indexes && !graph->chunk_bloom_data)
> +		 || (!graph->chunk_bloom_indexes && graph->chunk_bloom_data)) {
> +		graph->chunk_bloom_indexes = NULL;
> +		graph->chunk_bloom_data = NULL;
> +		graph->bloom_filter_settings = NULL;
> +	}
> +
> +	if (graph->chunk_bloom_indexes && graph->chunk_bloom_data)
> +		load_bloom_filters();

Wouldn't it be simpler to rely on the fact that both Bloom chunks must
exists for it to matter, and write it like this:

  +	if (graph->chunk_bloom_indexes && graph->chunk_bloom_data) {
  +		load_bloom_filters();
  +	} else {
  +		graph->chunk_bloom_indexes = NULL;
  +		graph->chunk_bloom_data = NULL;
  +		graph->bloom_filter_settings = NULL;
  +	}

> +
>  	hashcpy(graph->oid.hash, graph->data + graph->data_len - graph->hash_len);
>  
>  	if (verify_commit_graph_lite(graph)) {
> @@ -1040,6 +1079,59 @@ static void write_graph_chunk_extra_edges(struct hashfile *f,
>  	}
>  }
>  
> +static void write_graph_chunk_bloom_indexes(struct hashfile *f,
> +					    struct write_commit_graph_context *ctx)
> +{
> +	struct commit **list = ctx->commits.list;
> +	struct commit **last = ctx->commits.list + ctx->commits.nr;
> +	uint32_t cur_pos = 0;
> +	struct progress *progress = NULL;
> +	int i = 0;
> +
> +	if (ctx->report_progress)
> +		progress = start_delayed_progress(
> +			_("Writing changed paths Bloom filters index"),
> +			ctx->commits.nr);
> +
> +	while (list < last) {
> +		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
> +		cur_pos += filter->len;
> +		display_progress(progress, ++i);
> +		hashwrite_be32(f, cur_pos);
> +		list++;
> +	}
> +
> +	stop_progress(&progress);
> +}

All right, looks good.

> +
> +static void write_graph_chunk_bloom_data(struct hashfile *f,
> +					 struct write_commit_graph_context *ctx,
> +					 struct bloom_filter_settings *settings)
> +{
> +	struct commit **list = ctx->commits.list;
> +	struct commit **last = ctx->commits.list + ctx->commits.nr;
> +	struct progress *progress = NULL;
> +	int i = 0;
> +
> +	if (ctx->report_progress)
> +		progress = start_delayed_progress(
> +			_("Writing changed paths Bloom filters data"),
> +			ctx->commits.nr);
> +
> +	hashwrite_be32(f, settings->hash_version);
> +	hashwrite_be32(f, settings->num_hashes);
> +	hashwrite_be32(f, settings->bits_per_entry);
> +
> +	while (list < last) {
> +		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
> +		display_progress(progress, ++i);
> +		hashwrite(f, filter->data, filter->len * sizeof(uint64_t));
> +		list++;
> +	}
> +
> +	stop_progress(&progress);
> +}

All right, looks good.

Side note: why have while loop here instead of for loop, like in
previous patches?  I'm not saying this is a bad idea (especially with
same names for same variables).

> +
>  static int oid_compare(const void *_a, const void *_b)
>  {
>  	const struct object_id *a = (const struct object_id *)_a;
> @@ -1198,8 +1290,8 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>  	load_bloom_filters();
>  
>  	if (ctx->report_progress)
> -		progress = start_progress(
> -			_("Computing commit diff Bloom filters"),
> +		progress = start_delayed_progress(
> +			_("Computing changed paths Bloom filters"),
>  			ctx->commits.nr);
>

Ooops.  This look like a fixup which should be made to the original
earlier commit instead, isn't it?

>  	ALLOC_ARRAY(sorted_by_pos, ctx->commits.nr);
> @@ -1444,6 +1536,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>  	struct strbuf progress_title = STRBUF_INIT;
>  	int num_chunks = 3;
>  	struct object_id file_hash;
> +	struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
>  
>  	if (ctx->split) {
>  		struct strbuf tmp_file = STRBUF_INIT;
> @@ -1488,6 +1581,12 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>  		chunk_ids[num_chunks] = GRAPH_CHUNKID_EXTRAEDGES;
>  		num_chunks++;
>  	}
> +	if (ctx->changed_paths) {
> +		chunk_ids[num_chunks] = GRAPH_CHUNKID_BLOOMINDEXES;
> +		num_chunks++;
> +		chunk_ids[num_chunks] = GRAPH_CHUNKID_BLOOMDATA;
> +		num_chunks++;
> +	}

All right, adding chunks and counting them.

>  	if (ctx->num_commit_graphs_after > 1) {
>  		chunk_ids[num_chunks] = GRAPH_CHUNKID_BASE;
>  		num_chunks++;
> @@ -1506,6 +1605,15 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>  						4 * ctx->num_extra_edges;
>  		num_chunks++;
>  	}
> +	if (ctx->changed_paths) {
> +		chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
> +						sizeof(uint32_t) * ctx->commits.nr;
> +		num_chunks++;
> +
> +		chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
> +						sizeof(uint32_t) * 3 + ctx->total_bloom_filter_data_size;
> +		num_chunks++;
> +	}

All right, calculating chunk offsets.

>  	if (ctx->num_commit_graphs_after > 1) {
>  		chunk_offsets[num_chunks + 1] = chunk_offsets[num_chunks] +
>  						hashsz * (ctx->num_commit_graphs_after - 1);
> @@ -1543,6 +1651,10 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>  	write_graph_chunk_data(f, hashsz, ctx);
>  	if (ctx->num_extra_edges)
>  		write_graph_chunk_extra_edges(f, ctx);
> +	if (ctx->changed_paths) {
> +		write_graph_chunk_bloom_indexes(f, ctx);
> +		write_graph_chunk_bloom_data(f, ctx, &bloom_settings);
> +	}

All right, writing BIDX and BDAT chunks with default settings.

By the way, in the future, when appending to existing commit-graph file,
shouldn't we re-use existing settings even if they are different from
default settings?  But that is question for the future...

>  	if (ctx->num_commit_graphs_after > 1 &&
>  	    write_graph_chunk_base(f, ctx)) {
>  		return -1;
> diff --git a/commit-graph.h b/commit-graph.h
> index 952a4b83be..25fefefb3e 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -10,6 +10,7 @@
>  #define GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD "GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD"
>  
>  struct commit;
> +struct bloom_filter_settings;
>  
>  char *get_commit_graph_filename(const char *obj_dir);
>  int open_commit_graph(const char *graph_file, int *fd, struct stat *st);
> @@ -58,6 +59,10 @@ struct commit_graph {
>  	const unsigned char *chunk_commit_data;
>  	const unsigned char *chunk_extra_edges;
>  	const unsigned char *chunk_base_graphs;
> +	const unsigned char *chunk_bloom_indexes;
> +	const unsigned char *chunk_bloom_data;

All right.

> +
> +	struct bloom_filter_settings *bloom_filter_settings;

Why it is pointer to struct, instead of being just struct type?
Is there reason for that?

>  };
>  
>  struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st);
> @@ -77,7 +82,7 @@ enum commit_graph_write_flags {
>  	COMMIT_GRAPH_WRITE_SPLIT      = (1 << 2),
>  	/* Make sure that each OID in the input is a valid commit OID. */
>  	COMMIT_GRAPH_WRITE_CHECK_OIDS = (1 << 3),
> -	COMMIT_GRAPH_WRITE_BLOOM_FILTERS = (1 << 4)
> +	COMMIT_GRAPH_WRITE_BLOOM_FILTERS = (1 << 4),

This looks like accidental change; if we want to use trailing comma in
enum, this change should be in my opinion done in the commit that added
COMMIT_GRAPH_WRITE_BLOOM_FILTERS (as I have written in a comment there).

>  };
>  
>  struct split_commit_graph_opts {

Thank you for your work on this series.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH v2 08/11] commit-graph: reuse existing Bloom filters during write.
  2020-02-05 22:56   ` [PATCH v2 08/11] commit-graph: reuse existing Bloom filters during write Garima Singh via GitGitGadget
@ 2020-02-20 18:48     ` Jakub Narebski
  2020-02-24 21:45       ` Garima Singh
  0 siblings, 1 reply; 150+ messages in thread
From: Jakub Narebski @ 2020-02-20 18:48 UTC (permalink / raw)
  To: Garima Singh via GitGitGadget
  Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
	Jeff Hostetler, Taylor Blau, Jeff King, Garima Singh,
	Christian Couder, Emily Shaffer, Junio C Hamano, Garima Singh

"Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Garima Singh <garima.singh@microsoft.com>
>
> Read previously computed Bloom filters from the commit-graph file if
> possible to avoid recomputing during commit-graph write.

All right, what is written makes sense for this point in patch series.

But it my opinion it is more important to state that this commit adds
"parsing" of the Bloom filter data from commit-graph file.  This means
that it needs to be calculated only once, then stored in commit-graph,
ready to be re-used.

>
> See Documentation/technical/commit-graph-format for the format in which
> the Bloom filter information is written to the commit graph file.
>
> To read Bloom filter for a given commit with lexicographic position
> 'i' we need to:
> 1. Read BIDX[i] which essentially gives us the starting index in BDAT for
>    filter of commit i+1. It is essentially the index past the end
>    of the filter of commit i. It is called end_index in the code.
>
> 2. For i>0, read BIDX[i-1] which will give us the starting index in BDAT
>    for filter of commit i. It is called the start_index in the code.
>    For the first commit, where i = 0, Bloom filter data starts at the
>    beginning, just past the header in the BDAT chunk. Hence, start_index
>    will be 0.
>
> 3. The length of the filter will be end_index - start_index, because
>    BIDX[i] gives the cumulative 8-byte words including the ith
>    commit's filter.
>
> We toggle whether Bloom filters should be recomputed based on the
> compute_if_null flag.

Nitpick: the flag (the parameter) is called compute_if_not_present, not
compute_if_null.

All right, this explanation is nice and clear.

>
> Helped-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
>  bloom.c               | 49 ++++++++++++++++++++++++++++++++++++++++++-
>  bloom.h               |  4 +++-
>  commit-graph.c        |  7 ++++---
>  t/helper/test-bloom.c |  2 +-
>  4 files changed, 56 insertions(+), 6 deletions(-)
>
> diff --git a/bloom.c b/bloom.c
> index 818382c03b..90d84dc713 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -1,5 +1,7 @@
>  #include "git-compat-util.h"
>  #include "bloom.h"
> +#include "commit.h"
> +#include "commit-slab.h"
>  #include "commit-graph.h"
>  #include "object-store.h"
>  #include "diff.h"
> @@ -127,8 +129,39 @@ void add_key_to_filter(struct bloom_key *key,
>  	}
>  }
>  
> +static int load_bloom_filter_from_graph(struct commit_graph *g,
> +				   struct bloom_filter *filter,
> +				   struct commit *c)
> +{
> +	uint32_t lex_pos, start_index, end_index;
> +
> +	while (c->graph_pos < g->num_commits_in_base)
> +		g = g->base_graph;
> +
> +	/* The commit graph commit 'c' lives in doesn't carry bloom filters. */
> +	if (!g->chunk_bloom_indexes)
> +		return 0;
> +
> +	lex_pos = c->graph_pos - g->num_commits_in_base;

All right, this finds lexicographical position of the commit following
the chain of incremental commit-graph files, and also check if the
commit-graph fragment that contains the commit in question has Bloom
filter data included.

> +
> +	end_index = get_be32(g->chunk_bloom_indexes + 4 * lex_pos);
> +
> +	if (lex_pos)

Wouldn't it be better to be more explicit, and write

  +	if (lex_pos > 0)


> +		start_index = get_be32(g->chunk_bloom_indexes + 4 * (lex_pos - 1));
> +	else
> +		start_index = 0;

All right, here we find start_index and end_index.

It might be good idea to at least assert() that start_index <= end_index,
though that should not happen (that is why I propose for this check to
be compiled on only for debug builds).

> +
> +	filter->len = end_index - start_index;
> +	filter->data = (uint64_t *)(g->chunk_bloom_data +
> +					sizeof(uint64_t) * start_index +
> +					BLOOMDATA_CHUNK_HEADER_SIZE);

All right, nice use of constant.

> +
> +	return 1;
> +}
> +
>  struct bloom_filter *get_bloom_filter(struct repository *r,
> -				      struct commit *c)
> +				      struct commit *c,
> +				      int compute_if_not_present)
>  {
>  	struct bloom_filter *filter;
>  	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
> @@ -141,6 +174,20 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>  
>  	filter = bloom_filter_slab_at(&bloom_filters, c);
>  
> +	if (!filter->data) {
> +		load_commit_graph_info(r, c);
> +		if (c->graph_pos != COMMIT_NOT_FROM_GRAPH &&
> +			r->objects->commit_graph->chunk_bloom_indexes) {

All right, the limitation that the top layer of incremental commit graph
needs to have Bloom filters enabled for it to be even considered is
reasonable tradeoff, in my opinion.

> +			if (load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
> +				return filter;
> +			else
> +				return NULL;

If it should have filter, return it, otherwise return NULL.

I wonder however when it can return NULL (and whether it should compute
Bloom filters if required instead).

> +		}
> +	}
> +
> +	if (filter->data || !compute_if_not_present)
> +		return filter;

If we have filter from slab, return it.  All right.

However, according to documentation contained in comments in
commit-slab.h, bloom_filter_slab_at() will allocate the location to
store the data, and return freshly allocated memory... fortunately it
uses xcalloc() so returned bloom_filter would have ->len == 0 and
->data == 0.

> +
>  	repo_diff_setup(r, &diffopt);
>  	diffopt.flags.recursive = 1;
>  	diffopt.max_changes = max_changes;
> diff --git a/bloom.h b/bloom.h
> index 7f40c751f7..76f8a9ad0c 100644
> --- a/bloom.h
> +++ b/bloom.h
> @@ -13,6 +13,7 @@ struct bloom_filter_settings {
>  
>  #define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 }
>  #define BITS_PER_WORD 64
> +#define BLOOMDATA_CHUNK_HEADER_SIZE 3*sizeof(uint32_t)

All right.

>  
>  /*
>   * A bloom_filter struct represents a data segment to
> @@ -47,7 +48,8 @@ void add_key_to_filter(struct bloom_key *key,
>  					   struct bloom_filter_settings *settings);
>  
>  struct bloom_filter *get_bloom_filter(struct repository *r,
> -				      struct commit *c);
> +				      struct commit *c,
> +				      int compute_if_not_present);
>

All right, adding new parameter (changing function signature).

>  int bloom_filter_contains(struct bloom_filter *filter,
>  			  struct bloom_key *key,
> diff --git a/commit-graph.c b/commit-graph.c
> index 4585b3b702..c0e9834bf2 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -1094,7 +1094,7 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f,
>  			ctx->commits.nr);
>  
>  	while (list < last) {
> -		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
> +		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
>  		cur_pos += filter->len;
>  		display_progress(progress, ++i);
>  		hashwrite_be32(f, cur_pos);
> @@ -1123,7 +1123,7 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
>  	hashwrite_be32(f, settings->bits_per_entry);
>  
>  	while (list < last) {
> -		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
> +		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
>  		display_progress(progress, ++i);
>  		hashwrite(f, filter->data, filter->len * sizeof(uint64_t));
>  		list++;

All right, if needed (that is, if '--changed-path' option from the
future commit is provided to 'git commit-graph write'),
compute_bloom_filters() would be called befor write_commit_graph_file(),
which in turn runs write_graph_chunk_bloom_index() and *_data().


Actually, when writing Bloom data chunks (BIDX and BDAT) we could have
requested recomputing filters if necessary: slab storage works as
memoization, so you would calculate Bloom filter data for each commit in
the commit-graph only once.  And write_graph_chunk_bloom_indexes()
and write_graph_chunk_bloom_data() are called only if ctx->changed_paths
is true.

So it would work with

  +		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 1);

Only in the future we would really need to call with compute_if_not_present
parameter set to falsy value.

> @@ -1304,7 +1304,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>  
>  	for (i = 0; i < ctx->commits.nr; i++) {
>  		struct commit *c = sorted_by_pos[i];
> -		struct bloom_filter *filter = get_bloom_filter(ctx->r, c);
> +		struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1);
>  		ctx->total_bloom_filter_data_size += sizeof(uint64_t) * filter->len;
>  		display_progress(progress, i + 1);
>  	}
> @@ -2314,6 +2314,7 @@ void free_commit_graph(struct commit_graph *g)
>  		g->data = NULL;
>  		close(g->graph_fd);
>  	}
> +	free(g->bloom_filter_settings);
>  	free(g->filename);
>  	free(g);

Shouldn't this fixup be added to earlier commit?

>  }
> diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
> index 331957011b..9b4be97f75 100644
> --- a/t/helper/test-bloom.c
> +++ b/t/helper/test-bloom.c
> @@ -47,7 +47,7 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
>  	struct bloom_filter *filter;
>  	setup_git_directory();
>  	c = lookup_commit(the_repository, commit_oid);
> -	filter = get_bloom_filter(the_repository, c);
> +	filter = get_bloom_filter(the_repository, c, 1);
>  	print_bloom_filter(filter);
>  }

I would like to see some tests, but that needs to wait for patch that
adds --changed-paths option to the 'write' subcommand.

Things to be tested:
1. That after reading commit-graph with Bloom filter:
   - that commit(s) in commit-graph have Bloom filter
   - that commits outside commit-graph do not have Bloom filter
2. That incremental commit-graph feature works:
   - for commits in deeper layer that have Bloom filter chunks
   - for commits in deeper layer that do not have Bloom filter chunks

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH v2 09/11] commit-graph: add --changed-paths option to write subcommand
  2020-02-05 22:56   ` [PATCH v2 09/11] commit-graph: add --changed-paths option to write subcommand Garima Singh via GitGitGadget
@ 2020-02-20 20:28     ` Jakub Narebski
  2020-02-24 21:51       ` Garima Singh
  2020-02-20 22:10     ` Bryan Turner
  1 sibling, 1 reply; 150+ messages in thread
From: Jakub Narebski @ 2020-02-20 20:28 UTC (permalink / raw)
  To: Garima Singh via GitGitGadget
  Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
	Jeff Hostetler, Taylor Blau, Jeff King, Garima Singh,
	Christian Couder, Emily Shaffer, Junio C Hamano, Garima Singh

"Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Garima Singh <garima.singh@microsoft.com>
>
> Add --changed-paths option to git commit-graph write. This option will
> allow users to compute information about the paths that have changed
> between a commit and its first parent, and write it into the commit graph
> file. If the option is passed to the write subcommand we set the
> COMMIT_GRAPH_WRITE_BLOOM_FILTERS flag and pass it down to the
> commit-graph logic.

In the manpage you write that this operation (computing Bloom filters)
can take a while on large repositories.  Could you perhaps provide some
numbers: how much longer does it take to write commit-graph file with
and without '--changed-paths' for example for Linux kernel, or some
other large repository?  Thanks in advance.

>
> Helped-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
>  Documentation/git-commit-graph.txt | 5 +++++
>  builtin/commit-graph.c             | 9 +++++++--
>  2 files changed, 12 insertions(+), 2 deletions(-)

What is missing is some sanity tests: that bloom index and bloom data
chunks are not present without '--changed-paths', and that they are
added with '--changed-paths'.

If possible, maybe also check in a separate test that the size of
bloom_index chunk agrees with the number of commits in the commit graph.


Also, we can now add those tests I have wrote about in my review of
previous patch, that is:

1. If you write commit-graph with --changed-paths, and either add some
   commits later or exclude some commits from the commit graph, then:

   a.) commit(s) in commit-graph have Bloom filter
   b.) commit(s) not in commit-graph do not have Bloom filter

2. If you write commit-graph without --changed-paths as base layer,
   and then write next layer with --changed-paths and --split, then:

   a.) commit(s) in top layer have Bloom filter(s)
   b.) commit(s) in bottom layer don't have Bloom filter(s)

>
> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
> index bcd85c1976..907d703b30 100644
> --- a/Documentation/git-commit-graph.txt
> +++ b/Documentation/git-commit-graph.txt
> @@ -54,6 +54,11 @@ or `--stdin-packs`.)
>  With the `--append` option, include all commits that are present in the
>  existing commit-graph file.
>  +
> +With the `--changed-paths` option, compute and write information about the
> +paths changed between a commit and it's first parent. This operation can
> +take a while on large repositories. It provides significant performance gains
> +for getting history of a directory or a file with `git log -- <path>`.
> ++

Should we write about limitation that the topmost layer in the split
commit graph needs to be written with '--changed-paths' for Git to use
this information?  Or perhaps we should try (in the future) to remove
this limitation??

>  With the `--split` option, write the commit-graph as a chain of multiple
>  commit-graph files stored in `<dir>/info/commit-graphs`. The new commits
>  not already in the commit-graph are added in a new "tip" file. This file
> diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
> index e0c6fc4bbf..261dcce091 100644
> --- a/builtin/commit-graph.c
> +++ b/builtin/commit-graph.c
> @@ -9,7 +9,7 @@
>  
>  static char const * const builtin_commit_graph_usage[] = {
>  	N_("git commit-graph verify [--object-dir <objdir>] [--shallow] [--[no-]progress]"),
> -	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--[no-]progress] <split options>"),
> +	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--changed-paths] [--[no-]progress] <split options>"),
>  	NULL
>  };
>  
> @@ -19,7 +19,7 @@ static const char * const builtin_commit_graph_verify_usage[] = {
>  };
>  
>  static const char * const builtin_commit_graph_write_usage[] = {
> -	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--[no-]progress] <split options>"),
> +	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--changed-paths] [--[no-]progress] <split options>"),
>  	NULL
>  };
>

All right.

> @@ -32,6 +32,7 @@ static struct opts_commit_graph {
>  	int split;
>  	int shallow;
>  	int progress;
> +	int enable_changed_paths;

Bikeshed painting: should this field be called enable_changed_paths or
simply changed_paths?

>  } opts;
>  
>  static int graph_verify(int argc, const char **argv)
> @@ -110,6 +111,8 @@ static int graph_write(int argc, const char **argv)
>  			N_("start walk at commits listed by stdin")),
>  		OPT_BOOL(0, "append", &opts.append,
>  			N_("include all commits already in the commit-graph file")),
> +		OPT_BOOL(0, "changed-paths", &opts.enable_changed_paths,
> +			N_("enable computation for changed paths")),
>  		OPT_BOOL(0, "progress", &opts.progress, N_("force progress reporting")),
>  		OPT_BOOL(0, "split", &opts.split,
>  			N_("allow writing an incremental commit-graph file")),

All right.

> @@ -143,6 +146,8 @@ static int graph_write(int argc, const char **argv)
>  		flags |= COMMIT_GRAPH_WRITE_SPLIT;
>  	if (opts.progress)
>  		flags |= COMMIT_GRAPH_WRITE_PROGRESS;
> +	if (opts.enable_changed_paths)
> +		flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
>  
>  	read_replace_refs = 0;

All right.  This actually turns on calculation Bloom filters for changed
paths, thanks to

 	ctx->changed_paths = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;

that was added by the "[PATCH v2 04/11] commit-graph: compute Bloom
filters for changed paths" patch.

Though... should this enabling be split into two separate patches like
this?


Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH v2 09/11] commit-graph: add --changed-paths option to write subcommand
  2020-02-05 22:56   ` [PATCH v2 09/11] commit-graph: add --changed-paths option to write subcommand Garima Singh via GitGitGadget
  2020-02-20 20:28     ` Jakub Narebski
@ 2020-02-20 22:10     ` Bryan Turner
  2020-02-22  1:44       ` Garima Singh
  1 sibling, 1 reply; 150+ messages in thread
From: Bryan Turner @ 2020-02-20 22:10 UTC (permalink / raw)
  To: Garima Singh via GitGitGadget
  Cc: Git Users, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
	jeffhost, me, Jeff King, garimasigit, jnareb, Christian Couder,
	emilyshaffer, Junio C Hamano, Garima Singh

On Wed, Feb 5, 2020 at 2:56 PM Garima Singh via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Garima Singh <garima.singh@microsoft.com>
>
> Add --changed-paths option to git commit-graph write. This option will
> allow users to compute information about the paths that have changed
> between a commit and its first parent, and write it into the commit graph
> file. If the option is passed to the write subcommand we set the
> COMMIT_GRAPH_WRITE_BLOOM_FILTERS flag and pass it down to the
> commit-graph logic.
>
> Helped-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
>  Documentation/git-commit-graph.txt | 5 +++++
>  builtin/commit-graph.c             | 9 +++++++--
>  2 files changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
> index bcd85c1976..907d703b30 100644
> --- a/Documentation/git-commit-graph.txt
> +++ b/Documentation/git-commit-graph.txt
> @@ -54,6 +54,11 @@ or `--stdin-packs`.)
>  With the `--append` option, include all commits that are present in the
>  existing commit-graph file.
>  +
> +With the `--changed-paths` option, compute and write information about the
> +paths changed between a commit and it's first parent. This operation can

"its first parent"

(Pardon the grammar nit from the peanut gallery!)

> +take a while on large repositories. It provides significant performance gains
> +for getting history of a directory or a file with `git log -- <path>`.
> ++
>  With the `--split` option, write the commit-graph as a chain of multiple
>  commit-graph files stored in `<dir>/info/commit-graphs`. The new commits
>  not already in the commit-graph are added in a new "tip" file. This file
> diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
> index e0c6fc4bbf..261dcce091 100644
> --- a/builtin/commit-graph.c
> +++ b/builtin/commit-graph.c
> @@ -9,7 +9,7 @@
>
>  static char const * const builtin_commit_graph_usage[] = {
>         N_("git commit-graph verify [--object-dir <objdir>] [--shallow] [--[no-]progress]"),
> -       N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--[no-]progress] <split options>"),
> +       N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--changed-paths] [--[no-]progress] <split options>"),
>         NULL
>  };
>
> @@ -19,7 +19,7 @@ static const char * const builtin_commit_graph_verify_usage[] = {
>  };
>
>  static const char * const builtin_commit_graph_write_usage[] = {
> -       N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--[no-]progress] <split options>"),
> +       N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--changed-paths] [--[no-]progress] <split options>"),
>         NULL
>  };
>
> @@ -32,6 +32,7 @@ static struct opts_commit_graph {
>         int split;
>         int shallow;
>         int progress;
> +       int enable_changed_paths;
>  } opts;
>
>  static int graph_verify(int argc, const char **argv)
> @@ -110,6 +111,8 @@ static int graph_write(int argc, const char **argv)
>                         N_("start walk at commits listed by stdin")),
>                 OPT_BOOL(0, "append", &opts.append,
>                         N_("include all commits already in the commit-graph file")),
> +               OPT_BOOL(0, "changed-paths", &opts.enable_changed_paths,
> +                       N_("enable computation for changed paths")),
>                 OPT_BOOL(0, "progress", &opts.progress, N_("force progress reporting")),
>                 OPT_BOOL(0, "split", &opts.split,
>                         N_("allow writing an incremental commit-graph file")),
> @@ -143,6 +146,8 @@ static int graph_write(int argc, const char **argv)
>                 flags |= COMMIT_GRAPH_WRITE_SPLIT;
>         if (opts.progress)
>                 flags |= COMMIT_GRAPH_WRITE_PROGRESS;
> +       if (opts.enable_changed_paths)
> +               flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
>
>         read_replace_refs = 0;
>
> --
> gitgitgadget
>

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH v2 10/11] revision.c: use Bloom filters to speed up path based revision walks
  2020-02-05 22:56   ` [PATCH v2 10/11] revision.c: use Bloom filters to speed up path based revision walks Garima Singh via GitGitGadget
@ 2020-02-21 17:31     ` Jakub Narebski
  2020-02-21 22:45     ` Jakub Narebski
  1 sibling, 0 replies; 150+ messages in thread
From: Jakub Narebski @ 2020-02-21 17:31 UTC (permalink / raw)
  To: Garima Singh via GitGitGadget
  Cc: git, stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
	garimasigit, christian.couder, emilyshaffer, gitster,
	Garima Singh

"Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Garima Singh <garima.singh@microsoft.com>
>
> Revision walk will now use Bloom filters for commits to speed up revision
> walks for a particular path (for computing history for that path), if they
> are present in the commit-graph file.

Why do we need to turn this feature off for --walk-reflog?

Anyway, in my opinion this restriction should be stated explicitly in
the commit message, if kept. 

>
> We load the Bloom filters during the prepare_revision_walk step, but only
> when dealing with a single pathspec.

I would add the qualifier "currently" here, i.e. s/only/currently only/
to make it clear that it is the limitation of current implementation,
and not the inherent implementation of the technique.

>                                      While comparing trees in
> rev_compare_trees(), if the Bloom filter says that the file is not different
> between the two trees, we don't need to compute the expensive diff. This is
> where we get our performance gains. The other response of the Bloom filter
> is `maybe`, in which case we fall back to the full diff calculation to
> determine if the path was changed in the commit.

All right, looks good.

Very minor nitpick: s/`maybe`/'maybe'/ (in my opinion).

>
> Performance Gains:
> We tested the performance of `git log -- <path>` on the git repo, the linux
> and some internal large repos, with a variety of paths of varying depths.

Another repository that we could test Bloom filters feature would be, as
I have written before, Android AOSP frameworks core repository
https://android.googlesource.com/platform/frameworks/base/
because being written in Java it has deep path hierarchy, and it also
has large number of commits.

>
> On the git and linux repos:
> - we observed a 2x to 5x speed up.

It would be nice to have at least one specific and repeatable example:
in given repository, starting from given commit or tag, following the
history of given path, what are timing results for doing some specific
command with and without Bloom filters computed and enabled.

One might also want to know the cost of this speedup: how much disk
space does it take (i.e. how large is the commit-graph file with and
without Bloom filters chunks), and how long does it take to compute
(i.e. how much time writing commit-graph takes with and without using
--changed-paths options).

>
> On a large internal repo with files seated 6-10 levels deep in the tree:
> - we observed 10x to 20x speed ups, with some paths going up to 28 times
>   faster.

This is good to know.

In the future we might want to have procedurally generated synthetic
repository, where we would be able to control number of files, depth of
filesystem hierarchy, average number of changes per commit, etc. to be
used for performance testing.  (Just wishful thinking)

>
> Helped-by: Derrick Stolee <dstolee@microsoft.com
> Helped-by: SZEDER Gábor <szeder.dev@gmail.com>
> Helped-by: Jonathan Tan <jonathantanmy@google.com>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
>  revision.c                 | 124 +++++++++++++++++++++++++++++++-
>  revision.h                 |  11 +++
>  t/helper/test-read-graph.c |   4 ++
>  t/t4216-log-bloom.sh       | 140 +++++++++++++++++++++++++++++++++++++
>  4 files changed, 277 insertions(+), 2 deletions(-)
>  create mode 100755 t/t4216-log-bloom.sh
>
> diff --git a/revision.c b/revision.c
> index 8136929e23..d1622afa17 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -29,6 +29,8 @@
>  #include "prio-queue.h"
>  #include "hashmap.h"
>  #include "utf8.h"
> +#include "bloom.h"
> +#include "json-writer.h"
>  
>  volatile show_early_output_fn_t show_early_output;
>  
> @@ -624,11 +626,114 @@ static void file_change(struct diff_options *options,
>  	options->flags.has_changes = 1;
>  }
>  
> +static int bloom_filter_atexit_registered;
> +static unsigned int count_bloom_filter_maybe;
> +static unsigned int count_bloom_filter_definitely_not;
> +static unsigned int count_bloom_filter_false_positive;
> +static unsigned int count_bloom_filter_not_present;
> +static unsigned int count_bloom_filter_length_zero;
> +
> +static void trace2_bloom_filter_statistics_atexit(void)
> +{
> +	struct json_writer jw = JSON_WRITER_INIT;
> +
> +	jw_object_begin(&jw, 0);
> +	jw_object_intmax(&jw, "filter_not_present", count_bloom_filter_not_present);
> +	jw_object_intmax(&jw, "zero_length_filter", count_bloom_filter_length_zero);
> +	jw_object_intmax(&jw, "maybe", count_bloom_filter_maybe);
> +	jw_object_intmax(&jw, "definitely_not", count_bloom_filter_definitely_not);
> +	jw_end(&jw);
> +
> +	trace2_data_json("bloom", the_repository, "statistics", &jw);
> +
> +	jw_release(&jw);
> +}

I thought that it would be better to put this part together with tests
that absolutely require this functionality in a separate subsequent
patch, but now I am not so sure.  It is nice to have all or almost all
tests created in a single patch.

Looks good to me, but I don't know much about trace2 API, so take it
with a pinch of salt.

> +
> +static void prepare_to_use_bloom_filter(struct rev_info *revs)
> +{
> +	struct pathspec_item *pi;
> +	char *path_alloc = NULL;
> +	const char *path;
> +	int last_index;
> +	int len;
> +
> +	if (!revs->commits)
> +	    return;

I see that we need this because in next command we dereference
revs->commits to get revs->commits->item.

If I understand it correctly empty pending list may happen with "--all"
or "--glob" options, but somebody with more experience in this area of
code is needed to state for sure.

Should we test `git log --all -- <path>`?

> +
> +	repo_parse_commit(revs->repo, revs->commits->item);

Are we calling this function for its side-effects?  Wouldn't using
prepare_commit_graph(revs->repo) here be a better solution?

> +
> +	if (!revs->repo->objects->commit_graph)
> +		return;

Looks good to me.  If there is no commit graph, then there are no Bloom
filters to consult.

> +
> +	revs->bloom_filter_settings = revs->repo->objects->commit_graph->bloom_filter_settings;

Hmmm... is that why bloom_filter_settings is a pointer to struct, and
not struct itself?

> +	if (!revs->bloom_filter_settings)
> +		return;

Looks good to me.  If there is no Bloomm filter in the commit-graph
file, then there are no Bloom filters to consult.

> +
> +	pi = &revs->pruning.pathspec.items[0];
> +	last_index = pi->len - 1;
> +

It might be a good idea to add a comment explaining what is happening
here, for example:

  +	/* remove single trailing slash from path, if needed */
> +	if (pi->match[last_index] == '/') {
> +	    path_alloc = xstrdup(pi->match);
> +	    path_alloc[last_index] = '\0';
> +	    path = path_alloc;
> +	} else
> +	    path = pi->match;
> +
> +	len = strlen(path);

We can avoid computing strlen(path) here, because in first branch of
this conditional we have len = last_index, in the second branch we have
len = pi->len.

> +
> +	revs->bloom_key = xmalloc(sizeof(struct bloom_key));
> +	fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);

All right, this is the meat of this function: creating bloom_key for a
path.  Looks good to me.

> +
> +	if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
> +		atexit(trace2_bloom_filter_statistics_atexit);
> +		bloom_filter_atexit_registered = 1;
> +	}

OK, here we register trace2 Bloom filter statistics handler, but only
once, and only when needed.

> +
> +	free(path_alloc);

OK, path_alloc is either xstrdup-ed string, or NULL, and is no longer
needed (after possibly being used to create bloom_key).

> +}
> +
> +static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
> +						 struct commit *commit)
> +{
> +	struct bloom_filter *filter;
> +	int result;
> +
> +	if (!revs->repo->objects->commit_graph)
> +		return -1;
> +
> +	if (commit->generation == GENERATION_NUMBER_INFINITY)
> +		return -1;

Idle thought: would it be useful to gather for trace2 statistics also
number of commits encountered that were outside commit-graph?

> +
> +	filter = get_bloom_filter(revs->repo, commit, 0);
> +
> +	if (!filter) {
> +		count_bloom_filter_not_present++;
> +		return -1;
> +	}
> +
> +	if (!filter->len) {
> +		count_bloom_filter_length_zero++;
> +		return -1;
> +	}
> +
> +	result = bloom_filter_contains(filter,
> +				       revs->bloom_key,
> +				       revs->bloom_filter_settings);
> +
> +	if (result)
> +		count_bloom_filter_maybe++;
> +	else
> +		count_bloom_filter_definitely_not++;
> +
> +	return result;
> +}

The whole check_maybe_different_in_bloom_filter() looks good to me,
thanks to designing and building a good API.

> +
>  static int rev_compare_tree(struct rev_info *revs,
> -			    struct commit *parent, struct commit *commit)
> +			    struct commit *parent, struct commit *commit, int nth_parent)
>  {
>  	struct tree *t1 = get_commit_tree(parent);
>  	struct tree *t2 = get_commit_tree(commit);
> +	int bloom_ret = 1;

I don't understand why it is initialized to 1, and not to 0.

>  
>  	if (!t1)
>  		return REV_TREE_NEW;
> @@ -653,11 +758,23 @@ static int rev_compare_tree(struct rev_info *revs,
>  			return REV_TREE_SAME;
>  	}
>  
> +	if (revs->pruning.pathspec.nr == 1 && !revs->reflog_info && !nth_parent) {

Shouldn't we check upfront here that revs->bloom_key is not NULL?
I don't think we check this down the callchain...

Or even better replace the first two checks with it, as revs->bloom_key
is set only if (revs->pruning.pathspec.nr == 1 && !revs->reflog_info),
see addition to prepare_revision_walk() below.

Of course the !nth_parent check needs to be kept, as this changes during
the revision walk (it is a limitation of current version of Bloom filter
in that only changes with respect to first parent are stored in filter).

> +		bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
> +
> +		if (bloom_ret == 0)
> +			return REV_TREE_SAME;
> +	}

All right, if we have single pathspec, and we don't walk reflog (?), and
we are interested in first parent, then we query the Bloom filter.

The Bloom filter can return 'no' or 'maybe'; if it returns 'no' then we
can short-circuit and avoid computing the tree diff.

> +
>  	tree_difference = REV_TREE_SAME;
>  	revs->pruning.flags.has_changes = 0;
>  	if (diff_tree_oid(&t1->object.oid, &t2->object.oid, "",
>  			   &revs->pruning) < 0)
>  		return REV_TREE_DIFFERENT;
> +
> +	if (!nth_parent)

Shouldn't this condition be exactly the same as for running
check_maybe_different_in_bloom_filter()?  Otherwise due to initializing
bloom_ret to 1 we would get wrong statistics, isn't it?

> +		if (bloom_ret == 1 && tree_difference == REV_TREE_SAME)
> +			count_bloom_filter_false_positive++;
> +

All right, looks good.

>  	return tree_difference;
>  }
>  
> @@ -855,7 +972,7 @@ static void try_to_simplify_commit(struct rev_info *revs, struct commit *commit)
>  			die("cannot simplify commit %s (because of %s)",
>  			    oid_to_hex(&commit->object.oid),
>  			    oid_to_hex(&p->object.oid));
> -		switch (rev_compare_tree(revs, p, commit)) {
> +		switch (rev_compare_tree(revs, p, commit, nth_parent)) {
>  		case REV_TREE_SAME:
>  			if (!revs->simplify_history || !relevant_commit(p)) {
>  				/* Even if a merge with an uninteresting

OK, we are just dding new parameter, with the information needed to
decide whether Bloom filters can be used or not.

> @@ -3362,6 +3479,8 @@ int prepare_revision_walk(struct rev_info *revs)
>  				       FOR_EACH_OBJECT_PROMISOR_ONLY);
>  	}
>  
> +	if (revs->pruning.pathspec.nr == 1 && !revs->reflog_info)
> +		prepare_to_use_bloom_filter(revs);

Well, the limitation that the technique _currently_ works only with a
single pathspec is stated explicitly, but the fact that it is turned off
for some reason for --walk-reflog is not.

Otherwise, looks good to me.

>  	if (revs->no_walk != REVISION_WALK_NO_WALK_UNSORTED)
>  		commit_list_sort_by_date(&revs->commits);
>  	if (revs->no_walk)
> @@ -3379,6 +3498,7 @@ int prepare_revision_walk(struct rev_info *revs)
>  		simplify_merges(revs);
>  	if (revs->children.name)
>  		set_children(revs);
> +
>  	return 0;
>  }

Unrelated coding style fixup, but we are doing changes in the
neighborhood.  All right, I can agree to that.

>  
> diff --git a/revision.h b/revision.h
> index 475f048fb6..7c026fe41f 100644
> --- a/revision.h
> +++ b/revision.h
> @@ -56,6 +56,8 @@ struct repository;
>  struct rev_info;
>  struct string_list;
>  struct saved_parents;
> +struct bloom_key;
> +struct bloom_filter_settings;
>  define_shared_commit_slab(revision_sources, char *);
>  
>  struct rev_cmdline_info {
> @@ -291,6 +293,15 @@ struct rev_info {
>  	struct revision_sources *sources;
>  
>  	struct topo_walk_info *topo_walk_info;
> +
> +	/* Commit graph bloom filter fields */
> +	/* The bloom filter key for the pathspec */
> +	struct bloom_key *bloom_key;
> +	/*
> +	 * The bloom filter settings used to generate the key.
> +	 * This is loaded from the commit-graph being used.
> +	 */
> +	struct bloom_filter_settings *bloom_filter_settings;

It is nice having those explanatory comments.

Sidenote: if I understand it correctly, revs->bloom_key is allocated but
never free()d.  On the other hand revs->bloom_filter_settings is a weak
reference / is set to the value of other pointer, which is allocated and
free()d together with commit_graph struct.

>  };
>  
>  int ref_excluded(struct string_list *, const char *path);
> diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
> index d2884efe0a..aff597c7a3 100644
> --- a/t/helper/test-read-graph.c
> +++ b/t/helper/test-read-graph.c
> @@ -45,6 +45,10 @@ int cmd__read_graph(int argc, const char **argv)
>  		printf(" commit_metadata");
>  	if (graph->chunk_extra_edges)
>  		printf(" extra_edges");
> +	if (graph->chunk_bloom_indexes)
> +		printf(" bloom_indexes");
> +	if (graph->chunk_bloom_data)
> +		printf(" bloom_data");
>  	printf("\n");

This chunk could be moved to the commit adding --changed-paths
option... on the other hand if all tests are to be added by this patch,
it can be left as is.

>  
>  	UNLEAK(graph);
> diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
> new file mode 100755
> index 0000000000..19eca1864b
> --- /dev/null
> +++ b/t/t4216-log-bloom.sh
[...]

I'll leave reviewing tests of this feature for the next email.

Best regards,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH v2 00/11] Changed Paths Bloom Filters
  2020-02-08 23:04   ` Jakub Narebski
@ 2020-02-21 17:41     ` Garima Singh
  2020-03-29 18:36       ` Junio C Hamano
  0 siblings, 1 reply; 150+ messages in thread
From: Garima Singh @ 2020-02-21 17:41 UTC (permalink / raw)
  To: Jakub Narebski, Garima Singh via GitGitGadget
  Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
	Jeff Hostetler, Taylor Blau, Jeff King, Christian Couder,
	Emily Shaffer, Junio C Hamano, Garima Singh


On 2/8/2020 6:04 PM, Jakub Narebski wrote:
> "Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> Hey! 
>>
>> The commit graph feature brought in a lot of performance improvements across
>> multiple commands. However, file based history continues to be a performance
>> pain point, especially in large repositories. 
>>
>> Adopting changed path Bloom filters has been discussed on the list before,
>> and a prototype version was worked on by SZEDER Gábor, Jonathan Tan and Dr.
>> Derrick Stolee [1]. This series is based on Dr. Stolee's proof of
>> concept in [2].
> 
> Sidenote: I wondered why it did use MurmurHash3 (64-bit version), which
> requires adding its implementation, instead of reusing FNV-1 hash
> (Fowler–Noll–Vo hash function) used by Git hashmap implementation, see
> https://github.com/git/git/blob/228f53135a4a41a37b6be8e4d6e2b6153db4a8ed/hashmap.h#L109
> Beside the fact that everyone is using MurmurHash for Bloom filters ;-)
> 
> It turns out that in various benchmark MurmurHash is faster and also
> slightly better as a hash than FNV-1 or FNV-1b.
> 
> 
> I wonder then if it would be a good idea (in the future) to make it easy
> to use hashmap with MurmurHash3 instead of FNV-1, or maybe to even make
> it the default for hashing strings.
> 

Making Murmur3 hash the default for hashing strings is definitely outside the
scope of this series. Also, if the method signatures for the murmur3 hash 
matched the existing hash method signatures in hashmap.c, then it would be 
appropriate to place them adjacently, even if no hashmap consumer uses it for 
hashmaps. However, we need the option to start at a custom seed to do our double
hashing. A change in the future that involves adopting murmur3 in the hashmap
code would involve a simple code move before creating the new methods that 
avoid a custom seed. So for now, it makes sense that these methods leave in 
bloom.c where they are being used for a very specific purpose. 

>>
>> Performance Gains: We tested the performance of git log -- path on the git
>> repo, the linux repo and some internal large repos, with a variety of paths
>> of varying depths.
> 
> As I wrote in reply to previous version of this series, a good public
> repository (and thus being able to use by anyone) to test the Bloom
> filter performance improvements could be AOSP (Android) base:
> 
>   https://android.googlesource.com/platform/frameworks/base/
> 
> which is a large repository with long path depths (due to Java file
> naming conventions).
> 

Thank you! I will incorporate these results into the commit messages as 
appropriate in v3. 

>>
>> On the git and linux repos: We observed a 2x to 5x speed up.
>>
>> On a large internal repo with files seated 6-10 levels deep in the tree: We
>> observed 10x to 20x speed ups, with some paths going up to 28 times faster.
> 
> Very nice! Good work!
> 
> What is the cost of this feature, that is how long it takes to generate
> Bloom filters, and how much larger commit-graph file gets?  It would be
> nice to know.
> 

The cost of writing is much better now with Peff and Dr. Stolee's improvements. 
I will include these numbers as well in the commit messages as appropriate in 
v3. 

>>
>> Future Work (not included in the scope of this series):
>>
>>  1. Supporting multiple path based revision walk
> 
> Shouldn't then tests that were added in v2 mark use of Bloom filters
> with multiple paths revision walking as _not working *yet*_
> (test_expect_failure), and not expected to not work (test_expect_success
> with test_bloom_filters_not_used)?
> 

My intent is to ensure that bloom filters are not being used in any of the 
unsupported code paths. I don't have a strong preference about the test 
semantics as long as I get that coverage :) So I will look into switching it 
to test_expect_failure as you have suggested. 

>> Derrick Stolee (2):
>>   diff: halt tree-diff early after max_changes
>>   commit-graph: examine commits by generation number
>>
>> Garima Singh (8):
>>   commit-graph: use MAX_NUM_CHUNKS
>>   bloom: core Bloom filter implementation for changed paths
>>   commit-graph: compute Bloom filters for changed paths
>>   commit-graph: write Bloom filters to commit graph file
>>   commit-graph: reuse existing Bloom filters during write.
>>   commit-graph: add --changed-paths option to write subcommand
>>   revision.c: use Bloom filters to speed up path based revision walks
>>   commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag
>>
>> Jeff King (1):
>>   commit-graph: examine changed-path objects in pack order
> 
> The shortlog summary is a fine tool to show contributors to the patch
> series, but is not as useful to show patch series as a whole: splitting
> of patches and their ordering.
> 

This is a GitGitGadget specific thing, and it is probably by design. I have 
opened an issue in that repo for any follow up discussions:
  https://github.com/gitgitgadget/gitgitgadget/issues/203

> - [PATCH v2 02/11] bloom: core Bloom filter implementation for changed paths
> 
>   In my opinion this patch could be split into three individual pieces,
>   though one might think it is not worth it.
> 

I have gone back and forth on doing this. I like most of the core Bloom filter
computations being isolated in one patch/commit. But based on the rest of your
review, it seems like you are leaning heavily on having this split out. 
So, I will take a proper stab at doing it for v3. 

> - [PATCH v2 07/11] commit-graph: write Bloom filters to commit graph file
> 
>   This commit includes the documentation of the two new chunks of
>   commit-graph file format.
> 
>   I wonder if the 9th patch in this series, namely
>   commit-graph: add --changed-paths option to write subcommand
>   should not precede this commit.  Otherwise we have this new code but
>   no way of testing it.  On the other hand it makes it easier to
>   review.  On the gripping hand, you can't really test that writing
>   works without the ability to parse Bloom filter data out of
>   commit-graph file... which is the next commit.
> 

Getting complete test coverage within a single patch would require 2 or 3 of 
these patches to be combined. This would lead to a large patch that would be 
much more difficult to review.
 
My tests in the patches following this one run git commands. Hence the tests 
get introduced when the command line is ready to use all the new code. 

The current ordering of patches works better than adding the --changed-paths 
option before the logic that computes and writes. Otherwise the option will not 
be doing what it is supposed to do in the patch it was introduced in.

> - [PATCH v2 08/11] commit-graph: reuse existing Bloom filters during write
> 
>   This implements reading Bloom filters data from commit-graph file.
>   Is it a good split?  I think it makes it easier to review the single
>   patch, but itt also makes them less standalone.
> 

All the logic upto this point works just fine without the ability to read and 
parse precomputed bloom filters. This patch is an enhancement and it also 
separates out the reading and writing logic. Reusing existing bloom filters 
during write is the simplest interatcion that involves reading from the commit
graph file, and builds the foundation to make the `git log` improvements. 
Hence, it warrants its own patch and review. 

> - [PATCH v2 10/11] revision.c: use Bloom filters to speed up path based revision walks
> 
>   This is quite a big and involved patch, which in my opinion could be
>   split in two or three parts:
> 
>   a. Add a bare bones implementation, like in v2
> 
>   This limits amount of testing we can do; the only thing we can really
>   test is that we get the same results with and without Bloom filters.
> 
>   b.1. Add trace2 Bloom filter statistics
>   b.2. Use said trace2 statistics to test use of Bloom filters
> 

Sure. I will look into doing this split as well for v3. 

> 
> Feel free to disagree with those ideas.
> 
> Best,

Thanks for taking the time for reviewing this series so thoroughly! 
It is greatly appreciated! 

Cheers,
Garima Singh


^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH v2 10/11] revision.c: use Bloom filters to speed up path based revision walks
  2020-02-05 22:56   ` [PATCH v2 10/11] revision.c: use Bloom filters to speed up path based revision walks Garima Singh via GitGitGadget
  2020-02-21 17:31     ` Jakub Narebski
@ 2020-02-21 22:45     ` Jakub Narebski
  1 sibling, 0 replies; 150+ messages in thread
From: Jakub Narebski @ 2020-02-21 22:45 UTC (permalink / raw)
  To: Garima Singh via GitGitGadget
  Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
	Jeff Hostetler, Taylor Blau, Jeff King, Garima Singh,
	Christian Couder, Emily Shaffer, Junio C Hamano, Garima Singh

"Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:

This is a second part of my response, focusing solely on tests of the
Bloom filters feature.

[...]
> diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
> index d2884efe0a..aff597c7a3 100644
> --- a/t/helper/test-read-graph.c
> +++ b/t/helper/test-read-graph.c
> @@ -45,6 +45,10 @@ int cmd__read_graph(int argc, const char **argv)
>  		printf(" commit_metadata");
>  	if (graph->chunk_extra_edges)
>  		printf(" extra_edges");
> +	if (graph->chunk_bloom_indexes)
> +		printf(" bloom_indexes");
> +	if (graph->chunk_bloom_data)
> +		printf(" bloom_data");
>  	printf("\n");
>

All right, that is simple extension of 'test-helper read-graph'.

>  	UNLEAK(graph);
> diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
> new file mode 100755
> index 0000000000..19eca1864b
> --- /dev/null
> +++ b/t/t4216-log-bloom.sh
> @@ -0,0 +1,140 @@
> +#!/bin/sh
> +
> +test_description='git log for a path with bloom filters'
> +. ./test-lib.sh
> +
> +test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
> +	git init &&
> +	mkdir A A/B A/B/C &&
> +	test_commit c1 A/file1 &&
> +	test_commit c2 A/B/file2 &&
> +	test_commit c3 A/B/C/file3 &&
> +	test_commit c4 A/file1 &&
> +	test_commit c5 A/B/file2 &&
> +	test_commit c6 A/B/C/file3 &&
> +	test_commit c7 A/file1 &&
> +	test_commit c8 A/B/file2 &&
> +	test_commit c9 A/B/C/file3 &&
> +	git checkout -b side HEAD~4 &&
> +	test_commit side-1 file4 &&
> +	git checkout master &&
> +	git merge side &&
> +	test_commit c10 file5 &&

Unfortunately this might be not enough for Git's heuristic similarity
based rename detection, as it creates 'file5' file with content 'c10'.

[Checking something].  Well, actually it looks like it works, even with
not much contents.  I thought you would need to use something like

  +	test_write_lines 1 2 3 4 5 6 7 8 9 >file5 &&
  +	git add file5 &&
  +	git commit -m c10 &&

But it turns out that it is, s far as I have checked, not necessary.

> +	mv file5 file5_renamed &&
> +	git add file5_renamed &&
> +	git commit -m "rename" &&
> +	git commit-graph write --reachable --changed-paths
> +'

Hmmm... there is no test for file that was present in history but got
deleted.  Might be important (because of pre-image vs post-image name
issues).


Very minor issue: following the style used in t/test-lib-functions.sh
and the style guide in CodingGuidelines, it should be

  +graph_read_expect () {

and the same for the following functions.


https://github.com/git/git/blob/master/Documentation/CodingGuidelines#L144

 - We prefer a space between the function name and the parentheses,
   and no space inside the parentheses. The opening "{" should also
   be on the same line.

	(incorrect)
	my_function(){
		...

	(correct)
	my_function () {
		...

> +graph_read_expect() {
> +	OPTIONAL=""
> +	NUM_CHUNKS=5
> +	cat >expect <<- EOF
> +	header: 43475048 1 1 $NUM_CHUNKS 0
> +	num_commits: $1
> +	chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data

Either OPTIONAL remains unused, and should be removed, or we leave it
for possible future extension, and we write

  +	chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data$OPTIONAL

like in t/t5318-commit-graph.sh.

> +	EOF
> +	test-tool read-graph >output &&
> +	test_cmp expect output

Why 'output', and not 'actual'?

> +}
> +
> +test_expect_success 'commit-graph write wrote out the bloom chunks' '
> +	graph_read_expect 13
> +'

All right, that is sanity-checking 'git commit-graph write --changed-paths'.

> +
> +setup() {

I wonder if we can come up with a better name... setup_log(),
setup_log_bloom(), log_compare()?

> +	rm output

This shouldn't be here, in this function.  Or perhaps it shouldn't even
be used at all; having 'output' doesn't hinder anything.

> +	rm "$TRASH_DIRECTORY/trace.perf"

All right, this cleanup is needed.

> +	git -c core.commitGraph=false log --pretty="format:%s" $1 >log_wo_bloom
> +	GIT_TRACE2_PERF="$TRASH_DIRECTORY/trace.perf" git -c core.commitGraph=true log --pretty="format:%s" $1 >log_w_bloom

All right, we prepare for comparing version without Bloom filters
(reference) and with Bloom filters, and for checking if Bloom filters
were used.

> +}

This setup() function above is missing the && chain.

It should then in my opinion read:

  +setup () {
  +	rm "$TRASH_DIRECTORY/trace.perf" &&
  +	git -c core.commitGraph=false log --format="%s" $1 >log_wo_bloom &&
  +	GIT_TRACE2_PERF="$TRASH_DIRECTORY/trace.perf" \
  +	git -c core.commitGraph=true log --format="%s" $1 >log_w_bloom
  +}

Also, perhaps we should add at the beginning of this test file, outside
anu test_expect_success block, the following (see t/*trace2*.sh files):

  # Turn off any inherited trace2 settings for this test.
  sane_unset GIT_TRACE2 GIT_TRACE2_PERF GIT_TRACE2_EVENT
  sane_unset GIT_TRACE2_PERF_BRIEF
  sane_unset GIT_TRACE2_CONFIG_PARAMS

> +
> +test_bloom_filters_used() {
> +	log_args=$1
> +	bloom_trace_prefix="statistics:{\"filter_not_present\":0,\"zero_length_filter\":0,\"maybe\""
> +	setup "$log_args"

Missing && chain.

> +	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" && test_cmp log_wo_bloom log_w_bloom

Why no line break after &&?

> +}

Ugh, examining JSON output with regexp is in my opinion quite fragile.
Though I am not sure if requiring Perl and JSON module installed like
t/t0212-trace2-event.sh is any better.

> +
> +test_bloom_filters_not_used() {
> +	log_args=$1
> +	setup "$log_args"
> +	!(grep -q "statistics:{\"filter_not_present\":" "$TRASH_DIRECTORY/trace.perf") && test_cmp log_wo_bloom log_w_bloom

We should also check that "$TRASH_DIRECTORY/trace.perf" file exist with
test_path_is_file.

Also, testing that something was not found is a bit fragile, but I don't
have any better idea on how to do this test without negating grep exit
value.

> +}
> +
> +for path in A A/B A/B/C A/file1 A/B/file2 A/B/C/file3 file4 file5_renamed

NOTE: file5 is missing from this list!

I suspect that adding it might cause the test to fail.

> +do
> +	for option in "" \
> +		      "--full-history" \
> +		      "--full-history --simplify-merges" \
> +		      "--simplify-merges" \
> +		      "--simplify-by-decoration" \
> +		      "--follow" \
> +		      "--first-parent" \
> +		      "--topo-order" \
> +		      "--date-order" \
> +		      "--author-date-order" \
> +		      "--ancestry-path side..master"
> +	do
> +		test_expect_success "git log option: $option for path: $path" '
> +			test_bloom_filters_used "$option -- $path"

All right, this tests that Bloom filters were used *and* that the
command run with Bloom filters and without Bloom filters (without
commit-graph) produces the same output.

> +		'
> +	done
> +done
> +
> +test_expect_success 'git log -- folder works with and without the trailing slash' '
> +	test_bloom_filters_used "-- A" &&
> +	test_bloom_filters_used "-- A/"
> +'

All right.

I wonder if we should test for insane test case, namely pathname to an
ordinary file that ends with slash:

  +	test_bloom_filters_used "-- file4" &&
  +	test_bloom_filters_used "-- file4/"

The latter should produce no output, being treated as not existing file.

> +
> +test_expect_success 'git log for path that does not exist. ' '
> +	test_bloom_filters_used "-- path_does_not_exist"
> +'

All right.

> +
> +test_expect_success 'git log with --walk-reflogs does not use bloom filters' '
> +	test_bloom_filters_not_used "--walk-reflogs -- A"
> +'

All right, but why is it so?

> +
> +test_expect_success 'git log -- multiple path specs does not use bloom filters' '
> +	test_bloom_filters_not_used "-- file4 A/file1"
> +'

All right, though this is limitation of current code, not limitation of
technique, so _maybe_ it would be better to test_expect_failure that for
multiple pathspecs bloom_filters_used...

> +
> +test_expect_success 'git log with wildcard that resolves to a single path uses bloom filters' '
> +	test_bloom_filters_used "-- *4" &&
> +	test_bloom_filters_used "-- *renamed"
> +'
> +
> +test_expect_success 'git log with wildcard that resolves to a multiple paths does not uses bloom filters' '
> +	test_bloom_filters_not_used "-- *" &&
> +	test_bloom_filters_not_used "-- file*"
> +'

Same here.

> +
> +test_expect_success 'setup - add commit-graph to the chain without bloom filters' '
> +	test_commit c14 A/anotherFile2 &&
> +	test_commit c15 A/B/anotherFile2 &&
> +	test_commit c16 A/B/C/anotherFile2 &&
> +	GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0 git commit-graph write --reachable --split &&
> +	test_line_count = 2 .git/objects/info/commit-graphs/commit-graph-chain
> +'
> +
> +test_expect_success 'git log does not use bloom filters if the latest graph does not have bloom filters.' '
> +	test_bloom_filters_not_used "-- A/B"
> +'

All right... though I would try to come up with a shorter test name :-)

> +
> +test_expect_success 'setup - add commit-graph to the chain with bloom filters' '
> +	test_commit c17 A/anotherFile3 &&
> +	git commit-graph write --reachable --changed-paths --split &&
> +	test_line_count = 3 .git/objects/info/commit-graphs/commit-graph-chain
> +'
> +
> +test_bloom_filters_used_when_some_filters_are_missing() {
> +	log_args=$1
> +	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":6,\"definitely_not\":6"

Perhaps a better solution would be to use (enhanced) 'test-tool bloom'
to check which commits have Bloom filters and which do not.

> +	setup "$log_args"
> +	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" && test_cmp log_wo_bloom log_w_bloom
> +}

Why broken && chain between setup() and the resr, and why && is not
followed by line break (as before)?

> +
> +test_expect_success 'git log uses bloom filters if they exist in the latest but not all commit graphs in the chain.' '
> +	test_bloom_filters_used_when_some_filters_are_missing "-- A/B"
> +'
> +
> +test_done

All right... though the description of this test is a bit long.


Thank you for your work on this series.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH v2 11/11] commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag
  2020-02-05 22:56   ` [PATCH v2 11/11] commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag Garima Singh via GitGitGadget
@ 2020-02-22  0:11     ` Jakub Narebski
  0 siblings, 0 replies; 150+ messages in thread
From: Jakub Narebski @ 2020-02-22  0:11 UTC (permalink / raw)
  To: Garima Singh via GitGitGadget
  Cc: git, stolee, szeder.dev, jonathantanmy, jeffhost, me, peff,
	garimasigit, christian.couder, emilyshaffer, gitster,
	Garima Singh

"Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Garima Singh <garima.singh@microsoft.com>
>
> Add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag to the test setup suite
> in order to toggle writing Bloom filters when running any of the git tests.
> If set to true, we will compute and write Bloom filters every time a test
> calls `git commit-graph write`, as if the `--changed-paths` option was
> passed in.
>
> The test suite passes when GIT_TEST_COMMIT_GRAPH and
> GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS are enabled.

All right.  Nice.

>
> Helped-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
>  builtin/commit-graph.c        | 3 ++-
>  ci/run-build-and-tests.sh     | 1 +
>  commit-graph.h                | 1 +
>  t/README                      | 5 +++++
>  t/t4216-log-bloom.sh          | 3 +++
>  t/t5318-commit-graph.sh       | 2 ++
>  t/t5324-split-commit-graph.sh | 1 +
>  7 files changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
> index 261dcce091..fc9b234ab0 100644
> --- a/builtin/commit-graph.c
> +++ b/builtin/commit-graph.c
> @@ -146,7 +146,8 @@ static int graph_write(int argc, const char **argv)
>  		flags |= COMMIT_GRAPH_WRITE_SPLIT;
>  	if (opts.progress)
>  		flags |= COMMIT_GRAPH_WRITE_PROGRESS;
> -	if (opts.enable_changed_paths)
> +	if (opts.enable_changed_paths ||
> +	    git_env_bool(GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS, 0))
>  		flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
>

Looks good to me.

>  	read_replace_refs = 0;
> diff --git a/ci/run-build-and-tests.sh b/ci/run-build-and-tests.sh
> index ff0ef7f08e..7b4857651d 100755
> --- a/ci/run-build-and-tests.sh
> +++ b/ci/run-build-and-tests.sh
> @@ -19,6 +19,7 @@ linux-gcc)
>  	export GIT_TEST_OE_SIZE=10
>  	export GIT_TEST_OE_DELTA_SIZE=5
>  	export GIT_TEST_COMMIT_GRAPH=1
> +	export GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=1
>  	export GIT_TEST_MULTI_PACK_INDEX=1
>  	make test
>  	;;

OK, include in continuous integration.

> diff --git a/commit-graph.h b/commit-graph.h
> index 25fefefb3e..4c202ff3d7 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -8,6 +8,7 @@
>  
>  #define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
>  #define GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD "GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD"
> +#define GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS "GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS"
>

Looks good to me.

>  struct commit;
>  struct bloom_filter_settings;
> diff --git a/t/README b/t/README
> index caa125ba9a..be2f7d7fd2 100644
> --- a/t/README
> +++ b/t/README
> @@ -378,6 +378,11 @@ GIT_TEST_COMMIT_GRAPH=<boolean>, when true, forces the commit-graph to
>  be written after every 'git commit' command, and overrides the
>  'core.commitGraph' setting to true.
>  
> +GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=<boolean>, when true, forces
> +commit-graph write to compute and write changed path Bloom filters for
> +every 'git commit-graph write', as if the `--changed-paths` option was
> +passed in.
> +

Good, it is documented in README for tests.

>  GIT_TEST_FSMONITOR=$PWD/t7519/fsmonitor-all exercises the fsmonitor
>  code path for utilizing a file system monitor to speed up detecting
>  new or changed files.
> diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
> index 19eca1864b..7acebb3962 100755
> --- a/t/t4216-log-bloom.sh
> +++ b/t/t4216-log-bloom.sh
> @@ -3,6 +3,9 @@
>  test_description='git log for a path with bloom filters'
>  . ./test-lib.sh
>  
> +GIT_TEST_COMMIT_GRAPH=0
> +GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0
> +

All right, we need to ensure that 'git commit-graph write' is not run
automatically, otherwise split / incremental commit-graph tests would
not work.

We also need to ensure that '--changed-paths' is not added
automatically, so that we can test that commit-graph does not include
Bloom filters chunks if not requested.

>  test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
>  	git init &&
>  	mkdir A A/B A/B/C &&
> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index 3f03de6018..973020be2d 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -3,6 +3,8 @@
>  test_description='commit graph'
>  . ./test-lib.sh
>  
> +GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0
> +

OK, otherwise it would screw up checking the content of commit-graph
with 'test-tool read-graph'.

>  test_expect_success 'setup full repo' '
>  	mkdir full &&
>  	cd "$TRASH_DIRECTORY/full" &&
> diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
> index c24823431f..9235db4561 100755
> --- a/t/t5324-split-commit-graph.sh
> +++ b/t/t5324-split-commit-graph.sh
> @@ -4,6 +4,7 @@ test_description='split commit graph'
>  . ./test-lib.sh
>  
>  GIT_TEST_COMMIT_GRAPH=0
> +GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0
>  
>  test_expect_success 'setup repo' '
>  	git init &&

Same here.

Looks good to me.
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH v2 02/11] bloom: core Bloom filter implementation for changed paths
  2020-02-16 16:49     ` Jakub Narebski
@ 2020-02-22  0:32       ` Garima Singh
  2020-02-23 13:38         ` Jakub Narebski
  0 siblings, 1 reply; 150+ messages in thread
From: Garima Singh @ 2020-02-22  0:32 UTC (permalink / raw)
  To: Jakub Narebski, Garima Singh via GitGitGadget
  Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
	Jeff Hostetler, Taylor Blau, Jeff King, Christian Couder,
	Emily Shaffer, Junio C Hamano, Garima Singh


On 2/16/2020 11:49 AM, Jakub Narebski wrote:
>> From: Garima Singh <garima.singh@microsoft.com>
>>
>> Add the core Bloom filter logic for computing the paths changed between a
>> commit and its first parent. For details on what Bloom filters are and how they
>> work, please refer to Dr. Derrick Stolee's blog post [1]. It provides a concise
>> explaination of the adoption of Bloom filters as described in [2] and [3].
>                                                                            ^^- to add

Not sure what this means. Can you please clarify. 

>> 1. We currently use 7 and 10 for the number of hashes and the size of each
>>    entry respectively. They served as great starting values, the mathematical
>>    details behind this choice are described in [1] and [4]. The implementation,
>                                                                                 ^^- to add

Not sure what this means. Can you please clarify.

>> 3. The filters are sized according to the number of changes in the each commit,
>>    with minimum size of one 64 bit word.
> 
> If I understand it correctly (but which might not be entirely clear),
> the filter size in bits is the number of changes^* times 10, rounded up
> to the nearest multiple of 64.
> 
> [*] where the number of changes is the number of changed files (new blob
> objects) _and_ the number of changed directories (new tree objects,
> excluding root tree object change).
> 

Yes. 

> The interesting corner case, which might be worth specifying explicitly,
> is what happens in the case there are _no changes_ with respect to first
> parent (which can happen with either commit created with `git commit
> --allow-empty`, or merge created e.g. with `git merge --strategy=ours`).
> Is this case represented as Bloom filter of length 0, or as a Bloom
> filter of length  of one 64-bit word which is minimal length composed of
> all 0's (0x0000000000000000)?
> 

See t0095-bloom.sh: The filter for a commit with no changes is of length 0.
I will call it out specifically in the appropriate commit message as well. 

>>
>> [1] https://devblogs.microsoft.com/devops/super-charging-the-git-commit-graph-iv-Bloom-filters/
> 
> I would write it in full, similar to subsequent bibliographical entries,
> that is:
> 
>   [1] Derrick Stolee
>       "Supercharging the Git Commit Graph IV: Bloom Filters"
>       https://devblogs.microsoft.com/devops/super-charging-the-git-commit-graph-iv-Bloom-filters/
> 
> But that is just a matter of style.
> 

Sounds good. Will do. 

>>
>> [4] Thomas Mueller Graf, Daniel Lemire
>>     "Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters"
>>     https://arxiv.org/abs/1912.08258
>>
>> [5] https://en.wikipedia.org/wiki/MurmurHash#Algorithm
>>
>> Helped-by: Jeff King <peff@peff.net>
>> Helped-by: Derrick Stolee <dstolee@microsoft.com>
>> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
>> ---
>>  Makefile              |   2 +
>>  bloom.c               | 228 ++++++++++++++++++++++++++++++++++++++++++
>>  bloom.h               |  56 +++++++++++
>>  t/helper/test-bloom.c |  84 ++++++++++++++++
>>  t/helper/test-tool.c  |   1 +
>>  t/helper/test-tool.h  |   1 +
>>  t/t0095-bloom.sh      | 113 +++++++++++++++++++++
>>  7 files changed, 485 insertions(+)
>>  create mode 100644 bloom.c
>>  create mode 100644 bloom.h
>>  create mode 100644 t/helper/test-bloom.c
>>  create mode 100755 t/t0095-bloom.sh
> 
> As I wrote earlier, In my opinion this patch could be split into three
> individual single-functionality pieces, to make it easier to review and
> aid in bisectability if needed.
> 

Doing this in v3. 


>> +
>> +static uint32_t rotate_right(uint32_t value, int32_t count)
>> +{
>> +	uint32_t mask = 8 * sizeof(uint32_t) - 1;
>> +	count &= mask;
>> +	return ((value >> count) | (value << ((-count) & mask)));
>> +}
> 
> Hmmm... both the algoritm on Wikipedia, and reference implementation use
> rotate *left*, not rotate *right* in the implementation of Murmur3 hash,
> see
> 
>   https://en.wikipedia.org/wiki/MurmurHash#Algorithm
>   https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp#L23
> 
> 
> inline uint32_t rotl32 ( uint32_t x, int8_t r )
> {
>   return (x << r) | (x >> (32 - r));
> }
> 

Thanks! Fixed this in v3. More on it later. 

>> +
>> +/*
>> + * Calculate a hash value for the given data using the given seed.
>> + * Produces a uniformly distributed hash value.
>> + * Not considered to be cryptographically secure.
>> + * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
>> + **/
>     ^^-- why two _trailing_ asterisks?
> 

Oops. Fixed. 

>> +static uint32_t seed_murmur3(uint32_t seed, const char *data, int len)
> 
> In short, I think that the name of the function should be murmur3_32, or
> murmurhash3_32, or possibly murmur3_32_seed, or something like that.
>

Renamed it to murmur3_seeded in v3. The input and output types in the 
signature make it clear that it is 32-bit version.
 
>> +{
>> +	const uint32_t c1 = 0xcc9e2d51;
>> +	const uint32_t c2 = 0x1b873593;
>> +	const uint32_t r1 = 15;
>> +	const uint32_t r2 = 13;
>> +	const uint32_t m = 5;
>> +	const uint32_t n = 0xe6546b64;
>> +	int i;
>> +	uint32_t k1 = 0;
>> +	const char *tail;
>> +
>> +	int len4 = len / sizeof(uint32_t);
>> +
>> +	const uint32_t *blocks = (const uint32_t*)data;
>> +
>> +	uint32_t k;
>> +	for (i = 0; i < len4; i++)
>> +	{
>> +		k = blocks[i];
> 
> IMPORTANT: There is a comment around there in the example implementation
> in C on Wikipedia that this operation above is a source of differing
> results across endianness.  

Thanks! SZEDER found this on his CI pipeline and we have fixed it to 
process the data in 1 byte words to avoid hitting any endian-ness issues. 
See this part of the thread that carries the fix and the related discussion. 
  https://lore.kernel.org/git/ba856e20-0a3c-e2d2-6744-b9abfacdc465@gmail.com/
I will be squashing those changes in appropriately in v3.  
 
>> +		k1 *= c2;
>> +		seed ^= k1;
>> +		break;
>> +	}
>> +
>> +	seed ^= (uint32_t)len;
>> +	seed ^= (seed >> 16);
>> +	seed *= 0x85ebca6b;
>> +	seed ^= (seed >> 13);
>> +	seed *= 0xc2b2ae35;
>> +	seed ^= (seed >> 16);
>> +
>> +	return seed;
>> +}
> 
> In https://public-inbox.org/git/ba856e20-0a3c-e2d2-6744-b9abfacdc465@gmail.com/
> you posted "[PATCH] Process bloom filter data as 1 byte words".
> This may avoid the Big-endian vs Little-endian confusion,
> that is wrong results on Big-endian architectures, but
> it also may slow down the algorithm.
> 

Oh cool! You have seen that patch. And yes, we understand that it might add 
a little overhead but at this point it is more important to be correct on all
architectures instead of micro-optimizing and introducing different 
implementations for Little-endian and Big-endian. This would make this 
series overly complicated. Optimizing the hashing techniques would deserve a
series of its own, which we can definitely revisit later.

> The public domain implementation in PMurHash.c in SMHasher
> (re)implementation in Chromium (see URL above) fall backs to 1-byte
> operations only if it doesn't know the endianness (or if it is neither
> little-endian, nor big-endian, i.e. middle-endian or mixed-endian --
> though I doubt that Git works correctly on mixed-endian anyway).
> 
> 
> Sidenote: it looks like the current implementation if Murmur hash in
> Cromium uses MurmurHash3_x86_32, i.e. little-endian unaligned-safe
> implementation, but prepares data by swapping with StringToLE32
> https://github.com/chromium/chromium/blob/master/components/variations/variations_murmur_hash.h
> 
> 
> Assuming that the terminating NUL ("\0") character of a c-string is not
> included in hash calculations, then murmur3_x86_32 hash has the
> following results (all results are for seed equal 0):
> 
> ''               -> 0x00000000
> ' '              -> 0x7ef49b98
> 'Hello world!'   -> 0x627b0c2c
> 'The quick brown fox jumps over the lazy dog'   -> 0x2e4ff723
> 
> C source (from Wikipedia): https://godbolt.org/z/ofa2p8
> C++ source (Appleby's):    https://godbolt.org/z/BoSt6V
> 
> The implementation provided in this patch, with rotate_right (instead of
> rotate_left) gives, on little-endian machine, different results:
> 
> ''               -> 0x00000000
> ' '              -> 0xd1f27e64
> 'Hello world!'   -> 0xa0791ad7
> 'The quick brown fox jumps over the lazy dog'   -> 0x99f1676c
> 
> https://github.com/gitgitgadget/git/blob/e1b076a714d611e59d3d71c89221e41a3427fae4/bloom.c#L21
> C source (via GitGitGadget): https://godbolt.org/z/R9s8Tt
> 

Thanks! This is an excellent catch! Fixing the rotate_right to rotate_left, 
gives us the same answers as the two implementations you pointed out. I have
added the appropriate unit tests in v3 and they match the values you obtained 
from the other implementations. Thanks a lot for the rigor! 

We based our implementation on the pseudo code and not on the sample code 
presented here: https://en.wikipedia.org/wiki/MurmurHash#Algorithm
We just didn't parse the ROL instruction correctly. 

>> +
>> +void load_bloom_filters(void)
>> +{
>> +	init_bloom_filter_slab(&bloom_filters);
>> +}
> 
> 
> Actually this function doesn't load anything.  Perhaps it should be
> named init_bloom_filters() or init_bloom_filters_storage(), or
> bloom_filters_init()?
>

Changed to init_bloom_filters() in v3. Thanks! 
 
>> +
>> +void fill_bloom_key(const char *data,
>> +					int len,
>> +					struct bloom_key *key,
>> +					struct bloom_filter_settings *settings)
> 
> The last parameter could be of 'const bloom_filter_settings *' type.
> 

Done. 

>> +{
>> +	int i;
>> +	const uint32_t seed0 = 0x293ae76f;
>> +	const uint32_t seed1 = 0x7e646e2c;
> 
> Where did those seeds values came from?
> 

Those values were chosen randomly. They will be fixed constants for the 
current hashing version. I will add a note calling this out in the 
appropriate commit messages and the Documentation in v3. 

>> +	const uint32_t hash0 = seed_murmur3(seed0, data, len);
>> +	const uint32_t hash1 = seed_murmur3(seed1, data, len);
>> +
>> +	key->hashes = (uint32_t *)xcalloc(settings->num_hashes, sizeof(uint32_t));
>> +	for (i = 0; i < settings->num_hashes; i++)
>> +		key->hashes[i] = hash0 + i * hash1;
> 
> Note that in [3] authors say that double hashing technique has some
> problems.  For one, we should ensure that hash1 is not zero, and even
> better that it is odd (which makes it relatively prime to filter size
> which is multiple of 64).  It also suffers from something called
> "approximate fingerprint collisions".
> 
> That is why the define "enhanced double hashing" technique, which does
> not suffer from those problems (Algorithm 2, page 11/15).
> 
>   +	for (i = 0; i < settings->num_hashes; i++) {
>   +		key->hashes[i] = hash0;
>   +
>   +		hash0 = hash0 + hash1;
>   +		hash1 = hash1 + i;
>   +	}
> 
> This can also be written in closed form, based on equation (6)
> 
>   +	for (i = 0; i < settings->num_hashes; i++)
>   +		key->hashes[i] = hash0 + i * hash1 + i*(i*i - 1)/6;
> 
> 
> In later paper [6] the closed form for "enhanced double hashing"
> (p. 188) is slightly modified (or rather they use different variant of
> this technique):
> 
>   +	for (i = 0; i < settings->num_hashes; i++)
>   +		key->hashes[i] = hash0 + i * hash1 + i*i;
> 
> This is a variant of more generic "enhanced double hashing", section
> 5.2 (Enhanced) Double Hashing Schemes (page 199):
> 
>         h_1(u) + i h_2(u) + f(i)    mod m
> 
> with f(i) = i^2 = i*i.
> 
> They have tested that enhanced double hashing with both f(i) equal i*i
> and equal i*i*i, and triple hashing technique, and they have found that
> it performs slightly better than straight double hashing technique
> (Fig. 1, page 212, section 3).
> 

Thanks for the detailed research here! The hash becoming zero and the 
approximate fingerprint collision are both extremely rare situations. In both
cases, we would just see git log having to diff more trees than if it didn't 
occur. While these techniques would be great optimizations to do, especially
if this implementation gets pulled into more generic hashing applications
in the code, we think that for the purposes of the current series - it is not 
worth it. I say this because Azure Repos has been using this exact hashing 
technique for several years now without any glitches. And we think it would
be great to rely on this battle tested strategy in atleast the first version
of this feature. 

>> +}
>> +
>> +void add_key_to_filter(struct bloom_key *key,
>> +					   struct bloom_filter *filter,
>> +					   struct bloom_filter_settings *settings)
> 
> Here again the 'settings' argument can be const (as can the 'key'
> parameter).
> 

Done. 

>> +
>> +struct bloom_filter *get_bloom_filter(struct repository *r,
>> +				      struct commit *c)
>> +{
>> +	struct bloom_filter *filter;
>> +	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
>> +	int i;
>> +	struct diff_options diffopt;
>> +
>> +	if (!bloom_filters.slab_size)
>> +		return NULL;
> 
> This is testing that commit slab for per-commit Bloom filters is
> initialized, isn't it?
> 
> First, should we write the condition as
> 
> 	if (!bloom_filters.slab_size)
> 
> or would the following be more readable
> 
> 	if (bloom_filters.slab_size == 0)
> 

Sure. Switched to `if (bloom_filter.slab_size == 0)` in v3. 

> Second, should we return NULL, or should we just initialize the slab?
> Or is non-existence of slab treated as a signal that the Bloom filters
> mechanism is turned off?
> 

Yes. We purposefully choose to return NULL and ignore the mechanism 
overall because we use Bloom filters best effort only. 

>> +
>> +	if (diff_queued_diff.nr <= 512) {
>
> Second, there is a minor issue that diff_queue_struct.nr stores the
> number of filepairs, that is the number of changed files, while the
> number of elements added to Bloom filter is number of changed blobs and
> trees.  For example if the following files are changed:
> 
>   sub/dir/file1
>   sub/file2
> 
> then diff_queued_diff.nr is 2, but number of elements to be added to
> Bloom filter is 4.
> 
>   sub/dir/file1
>   sub/file2
>   sub/dir/
>   sub/
> 
> I'm not sure if it matters in practice.
> 

It does not matter much in practice, since the directories usually tend
to collapse across the changes. Still, I will add another limit after 
creating the hashmap entries to cap at 640 so that we have a maximum of 
100 changes in the bloom filter. 

We plan to make these values configurable later. 

>> +		struct hashmap pathmap;
>> +		struct pathmap_hash_entry* e;
>> +		struct hashmap_iter iter;
>> +		hashmap_init(&pathmap, NULL, NULL, 0);
> 
> Stylistic issue: I have just noticed that here (and in some other
> places), but not in all cases, you declare pointer types with asterisk
> cuddled to type name, not to variable name, which contradicts
> CodingGuidelines

Thanks for noticing that! Fixed all of these in v3. 

>> +
>> +		for (i = 0; i < diff_queued_diff.nr; i++) {
>> +			const char* path = diff_queued_diff.queue[i]->two->path;
> 
> Is that correct that we consider only post-image name for storing
> changes in Bloom filter?  Currently if file was renamed (or deleted), it
> is considered changed, and `git log -- <old-name>` lists commit that
> changed file name too.
>

The tests in t4216-log-bloom.sh ensure that the output of `git log -- <oldname>` 
remains unchanged for renamed and deleted files, when using bloom filters. 
I realize that I fat fingered over checking the old name, and didn't have an 
explicit deleted file in the test. I have added them in v3, and the tests pass. 
So the behavior is preserved and as expected when using Bloom filters. 
Thanks for paying close attention! 
 
>> +			const char* p = path;
> 
> It should be "const char *" for both.
> 
>> +
>> +			/*
>> +			* Add each leading directory of the changed file, i.e. for
>> +			* 'dir/subdir/file' add 'dir' and 'dir/subdir' as well, so
>> +			* the Bloom filter could be used to speed up commands like
>> +			* 'git log dir/subdir', too.
>> +			*
>> +			* Note that directories are added without the trailing '/'.
>> +			*/
>> +			do {
>> +				char* last_slash = strrchr(p, '/');
>> +
>> +				FLEX_ALLOC_STR(e, path, path);
> 
> Here first 'path' is the field name, i.e. pathmap_hash_entry.path,
> second 'path' is the name of local variable, aliased also to 'p'.
> 
>> +				hashmap_entry_init(&e->entry, strhash(p));
> 
> I don't know why both 'path' and 'p' are used, while both point to the
> same memory (and thus have the same contents).  It is a bit confusing.
> See also my previous comment.
> 

Cleaned up in v3. Thanks! 
>> +		filter->data = NULL;
>> +		filter->len = 0;
> 
> This needs to be explicitly stated both in the commit message and in the
> API documentation (in comments) that bloom_filter.len == 0 means "no
> data", while "no changes" is represented as bloom_filter with len == 1
> and *data == (uint64_t)0;
> 
> EDIT: actually "no changes" is also represented as bloom_filter with len
> equal 0, as it turns out.
> 
> One possible alternative could be representing "no data" value with
> Bloom filter of length 1 and all 64 bits set to 1, and "no changes"
> represented as filter of length 0.  This is not unambiguous choice!
>

There is no gain in distinguishing between the absence of a filter and
a commit having no changes. The effect on `git log -- path` is the same in 
both cases. We fall back to the normal diffing algorithm in revision.c.
I will make this clearer in the appropriate commit messages and in the 
Documentation in v3. 
 
>> +}
>> diff --git a/bloom.h b/bloom.h
>> new file mode 100644
>> index 0000000000..7f40c751f7
>> --- /dev/null
>> +++ b/bloom.h
>> @@ -0,0 +1,56 @@
>> +#ifndef BLOOM_H
>> +#define BLOOM_H
> 
> Should we #include the stdint.h header for uint32_t and uint64_t types?
> 

git-compat-util.h takes care of this. 

>> +
>> +struct commit;
>> +struct repository;
>> +struct commit_graph;
>> +
> 
> Perhaps we should add block comment for this struct, like there is one
> for struct bloom_filter below.
> 

Done in v3.

>> +struct bloom_filter_settings {
>> +	uint32_t hash_version;
>> +	uint32_t num_hashes;
>> +	uint32_t bits_per_entry;
> 
> I guess that the type uint32_t was chosen to make it easier to store
> this information and later retrieve it from the commit-graph file, isn't
> it?  Otherwise those types are much too large for sensible range of
> values (which would all fit in 8-bits byte).
> 

Yes.

>> +
>> +/*
>> + * A bloom_key represents the k hash values for a
>> + * given hash input. These can be precomputed and
>> + * stored in a bloom_key for re-use when testing
>> + * against a bloom_filter.
> 
> We might want to add that the number of hash values is given by Bloom
> filter settings, and it is assumed to be the same for all bloom_key
> variables / objects.
> 

Incorporated in v3. 

>> +. ./test-lib.sh
>> +
>> +test_expect_success 'get bloom filters for commit with no changes' '
>> +	git init &&
>> +	git commit --allow-empty -m "c0" &&
>> +	cat >expect <<-\EOF &&
>> +	Filter_Length:0
>> +	Filter_Data:
>> +	EOF
>> +	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
>> +	test_cmp expect actual
>> +'
> 
> A few things.  First, I wonder why we need to provide object ID;
> couldn't 'test-tool bloom get_filter_for_commit' parse commit-ish
> argument, or would it make it too complicated for no reason?
> 

Yes it was overkill for what I need in the test. 

>> +
>> +test_expect_success 'get bloom filter for commit with 10 changes' '
>> +	rm actual &&
>> +	rm expect &&
>> +	mkdir smallDir &&
>> +	for i in $(test_seq 0 9)
>> +	do
>> +		echo $i >smallDir/$i
>> +	done &&
>> +	git add smallDir &&
>> +	git commit -m "commit with 10 changes" &&
>> +	cat >expect <<-\EOF &&
>> +	Filter_Length:4
>> +	Filter_Data:508928809087080a|8a7648210804001|4089824400951000|841ab310098051a8|
>> +	EOF
>> +	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
>> +	test_cmp expect actual
>> +'
> 
> This test is in my opinion fragile, as it unnecessarily test the
> implementation details instead of the functionality provided.  If we
> change the hashing scheme (for example going from double hashing to some
> variant of enhanced double hashing), or change the base hash function
> (for example from Murmur3_32 to xxHash_64), or change the number of hash
> functions (perhaps because changing of number of bits per element, and
> thus optimal number of hash functions from 7 to 6), or change from
> 64-bit word blocks to 32-bit word blocks, the test would have to be
> changed.
> 

Regarding this and the rest of you comments on t0095-log-bloom.sh:

I am tweaking it as necessary but the entire point of these tests is to
break for the things you called out. They need to be intricately tied
to the current hashing strategy and are hence intended to be fragile so 
as to catch any subtle or accidental changes in the hashing computation. 
Any change like the ones you have called out would require a hash version
change and all the compatibility reactions that come with it. 

I have added more tests around the murmur3_seeded method in v3. Removed
some of the redundant ones. 

The other more evolved test cases you call out are covered in the e2e
integration tests in t4216-log-bloom.sh

> 
> Reviewed-by: Jakub Narębski <jnareb@gmail.com>
> 
> Thanks for working on this.
> 
> Best,
> 

Thank you once again for an excellent and in-depth review of this patch! 
You have helped make this code so much better!

Cheers! 
Garima Singh

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH v2 03/11] diff: halt tree-diff early after max_changes
  2020-02-17  0:00     ` Jakub Narebski
@ 2020-02-22  0:37       ` Garima Singh
  0 siblings, 0 replies; 150+ messages in thread
From: Garima Singh @ 2020-02-22  0:37 UTC (permalink / raw)
  To: Jakub Narebski, Derrick Stolee via GitGitGadget
  Cc: git, Derrick Stolee, Derrick Stolee, SZEDER Gábor,
	Jonathan Tan, Jeff Hostetler, Taylor Blau, Jeff King,
	Christian Couder, Emily Shaffer, Junio C Hamano, Garima Singh


On 2/16/2020 7:00 PM, Jakub Narebski wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> Use this max_changes option in the bloom filter calculations. This
>> reduces the time taken to compute the filters for the Linux kernel
>> repo from 2m50s to 2m35s. On a large internal repository with ~500
>> commits that perform tree-wide changes, the time reduced from
>> 6m15s to 3m48s.
> 
> I wonder if there is some large open-source project with many commits
> performing tree-wide changes, that is with many commits with more than
> 512 changed files with respect to the first parent.
> 
> Maybe https://github.com/whosonfirst-data/whosonfirst-data-venue-us-ny
> from "Top Ten Worst Repositories to host on GitHub - Git Merge 2017"
> could be a good repository to test ;-)
> 

Thanks for the suggestion! I will see if any of these repos gives us a 
good test bed and add the perf improvement numbers in the appropriate
commit messages in v3. 

Cheers! 
Garima Singh

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH v2 04/11] commit-graph: compute Bloom filters for changed paths
  2020-02-17 21:56     ` Jakub Narebski
@ 2020-02-22  0:55       ` Garima Singh
  2020-02-23 17:34         ` Jakub Narebski
  0 siblings, 1 reply; 150+ messages in thread
From: Garima Singh @ 2020-02-22  0:55 UTC (permalink / raw)
  To: Jakub Narebski, Garima Singh via GitGitGadget
  Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
	Jeff Hostetler, Taylor Blau, Jeff King, Christian Couder,
	Emily Shaffer, Junio C Hamano, Garima Singh


On 2/17/2020 4:56 PM, Jakub Narebski wrote:
> "Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> From: Garima Singh <garima.singh@microsoft.com>
>> Subject: [PATCH v2 04/11] commit-graph: compute Bloom filters for changed paths
>>
>> Compute Bloom filters for the paths that changed between a commit and its
>> first parent using the implementation in bloom.c, when the
>> COMMIT_GRAPH_WRITE_CHANGED_PATHS flag is set. This computation is done on a
>> commit-by-commit basis. We will write these Bloom filters to the commit graph
>> file in the next change.
> 
> I have no major complaints about the contents of this patch (except lack
> of test, and type of total_bloom_filter_data_size), but the commit
> message could have been worded better.
> 
> I would write something like this instead:
> 
>   Add new COMMIT_GRAPH_WRITE_CHANGED_PATHS flag that makes Git compute
>   Bloom filters that store the information about changed paths (that
>   changed between a commit and its first parent) for each commit in the
>   commit-graph.  This computation is done on a commit-by-commit basis.
> 
>   We will write these Bloom filters to the commit-graph file, to store
>   this data on disk, in the next change in this series.
> 
> In my opinion the fact that we compute Bloom filters for each and every
> commit in the commit-graph file is more important than quite obvious
> fact that we use implementation from bloom.c.
> 

Nice! Incorporated in v3. Thanks!

>>
>> Helped-by: Derrick Stolee <dstolee@microsoft.com>
>> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
>> ---
>>  commit-graph.c | 32 +++++++++++++++++++++++++++++++-
>>  commit-graph.h |  3 ++-
>>  2 files changed, 33 insertions(+), 2 deletions(-)
> 
> It would be good to have at least sanity check of this feature, perhaps
> one that would check that the number of per-commit Bloom filters on slab
> matches the number of commits in the commit-graph.
> 

The combination of all the e2e tests in this series with the test
flag being turned on in the CI, and the performance gains we are seeing
confirm that this is happening correctly.

>>  
>>  	const struct split_commit_graph_opts *split_opts;
>> +	uint32_t total_bloom_filter_data_size;
> 
> This is total size of Bloom filters data, in bytes, that will later be
> used for BDAT chunk size.  However the commit-graph format uses 8 bytes
> for byte-offset, not 4 bytes.  Why it is uint32_t and not uint64_t then?
>

Changed to size_t. Thanks for noticing! 
 
>>  };
>>  
>>  static void write_graph_chunk_fanout(struct hashfile *f,
>> @@ -1140,6 +1143,28 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>>  	stop_progress(&ctx->progress);
>>  }
>>  
>> +static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>> +{
>> +	int i;
>> +	struct progress *progress = NULL;
>> +
>> +	load_bloom_filters();
>> +
>> +	if (ctx->report_progress)
>> +		progress = start_progress(
>> +			_("Computing commit diff Bloom filters"),
>> +			ctx->commits.nr);
>> +
> 
> Shouldn't we initialize ctx->total_bloom_filter_data_size to 0 here?  We
> cannot use compute_bloom_filters() to _update_ Bloom filters data, I
> think -- we don't distinguish here between new and existing data (where
> existing data size is already included in total Bloom filters size).  At
> least I don't think so.
> 

This line in commit-graph.c takes care of reinitializing the graph context and
by consequence the bloom filter data size. 
  ctx = xcalloc(1, sizeof(struct write_commit_graph_context));
  
So the total size gets recalculated every time, which is correct. 

> 
> Side note: perhaps we could add trailing comma after new enum entry,
> that is
> 
>   +	COMMIT_GRAPH_WRITE_BLOOM_FILTERS = (1 << 4),
> 
> following new CodingGuidelines recommendation
> 

Thanks! Fixed in v3.

Cheers! 
Garima Singh

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH v2 09/11] commit-graph: add --changed-paths option to write subcommand
  2020-02-20 22:10     ` Bryan Turner
@ 2020-02-22  1:44       ` Garima Singh
  0 siblings, 0 replies; 150+ messages in thread
From: Garima Singh @ 2020-02-22  1:44 UTC (permalink / raw)
  To: Bryan Turner, Garima Singh via GitGitGadget
  Cc: Git Users, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
	jeffhost, me, Jeff King, jnareb, Christian Couder, emilyshaffer,
	Junio C Hamano, Garima Singh


On 2/20/2020 5:10 PM, Bryan Turner wrote:
> On Wed, Feb 5, 2020 at 2:56 PM Garima Singh via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
>> index bcd85c1976..907d703b30 100644
>> --- a/Documentation/git-commit-graph.txt
>> +++ b/Documentation/git-commit-graph.txt
>> @@ -54,6 +54,11 @@ or `--stdin-packs`.)
>>  With the `--append` option, include all commits that are present in the
>>  existing commit-graph file.
>>  +
>> +With the `--changed-paths` option, compute and write information about the
>> +paths changed between a commit and it's first parent. This operation can
> 
> "its first parent"
> 
> (Pardon the grammar nit from the peanut gallery!)
> 

:)
Thank you! Fixed in v3. 

Cheers! 
Garima Singh

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH v2 02/11] bloom: core Bloom filter implementation for changed paths
  2020-02-22  0:32       ` Garima Singh
@ 2020-02-23 13:38         ` Jakub Narebski
  2020-02-24 17:34           ` Garima Singh
  0 siblings, 1 reply; 150+ messages in thread
From: Jakub Narebski @ 2020-02-23 13:38 UTC (permalink / raw)
  To: Garima Singh
  Cc: Garima Singh via GitGitGadget, git, Derrick Stolee,
	SZEDER Gábor, Jonathan Tan, Jeff Hostetler, Taylor Blau,
	Jeff King, Christian Couder, Emily Shaffer, Junio C Hamano,
	Garima Singh

Garima Singh <garimasigit@gmail.com> writes:
> On 2/16/2020 11:49 AM, Jakub Narebski wrote:
>>> From: Garima Singh <garima.singh@microsoft.com>
>>>
>>> Add the core Bloom filter logic for computing the paths changed between a
>>> commit and its first parent. For details on what Bloom filters are and how they
>>> work, please refer to Dr. Derrick Stolee's blog post [1]. It provides a concise
>>> explaination of the adoption of Bloom filters as described in [2] and [3].
>>                                                                           ^^- to add
>
> Not sure what this means. Can you please clarify. 
>
>>> 1. We currently use 7 and 10 for the number of hashes and the size of each
>>>    entry respectively. They served as great starting values, the mathematical
>>>    details behind this choice are described in [1] and [4]. The implementation,
>>                                                                                ^^- to add
>
> Not sure what this means. Can you please clarify.

I'm sorry for not being clear.  What I wanted to say that in both cases
the last line should have ended in either full stop in first case, or
comma in second case:

  "as described in [2] and [3]."

  "The implementation,"

What I wrote (trying to put the arrow below final fullstop or comma)
only works when one is using with fixed-width font.

>>> 3. The filters are sized according to the number of changes in the each commit,
>>>    with minimum size of one 64 bit word.

[...]
>> The interesting corner case, which might be worth specifying explicitly,
>> is what happens in the case there are _no changes_ with respect to first
>> parent (which can happen with either commit created with `git commit
>> --allow-empty`, or merge created e.g. with `git merge --strategy=ours`).
>> Is this case represented as Bloom filter of length 0, or as a Bloom
>> filter of length  of one 64-bit word which is minimal length composed of
>> all 0's (0x0000000000000000)?
>> 
>
> See t0095-bloom.sh: The filter for a commit with no changes is of length 0.
> I will call it out specifically in the appropriate commit message as well. 

I have realized this only later that both "no changes" and "no data"
uses filter of length 0; which works well because checking the diff if
there were no changes is cheap (both tree oids are the same).

>>> ---
>>>  Makefile              |   2 +
>>>  bloom.c               | 228 ++++++++++++++++++++++++++++++++++++++++++
>>>  bloom.h               |  56 +++++++++++
>>>  t/helper/test-bloom.c |  84 ++++++++++++++++
>>>  t/helper/test-tool.c  |   1 +
>>>  t/helper/test-tool.h  |   1 +
>>>  t/t0095-bloom.sh      | 113 +++++++++++++++++++++
>>>  7 files changed, 485 insertions(+)
>>>  create mode 100644 bloom.c
>>>  create mode 100644 bloom.h
>>>  create mode 100644 t/helper/test-bloom.c
>>>  create mode 100755 t/t0095-bloom.sh
>> 
>> As I wrote earlier, In my opinion this patch could be split into three
>> individual single-functionality pieces, to make it easier to review and
>> aid in bisectability if needed.
>
> Doing this in v3. 

Thanks.  Though if it makes (much) more work for you, I can work with
unsplit patch, no problem.

>>> +
>>> +static uint32_t rotate_right(uint32_t value, int32_t count)
>>> +{
>>> +	uint32_t mask = 8 * sizeof(uint32_t) - 1;
>>> +	count &= mask;
>>> +	return ((value >> count) | (value << ((-count) & mask)));
>>> +}
>> 
>> Hmmm... both the algoritm on Wikipedia, and reference implementation use
>> rotate *left*, not rotate *right* in the implementation of Murmur3 hash,
>> see
>> 
>>   https://en.wikipedia.org/wiki/MurmurHash#Algorithm
>>   https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp#L23
>> 
>> 
>> inline uint32_t rotl32 ( uint32_t x, int8_t r )
>> {
>>   return (x << r) | (x >> (32 - r));
>> }
>
> Thanks! Fixed this in v3. More on it later. 

Sidenote: If I understand it correctly Bloom filters functionality is
included in Scalar [1].  What will happen then with all those Bloom
filter chunks in commit-graph files with wrong hash functions?

[1]: https://devblogs.microsoft.com/devops/introducing-scalar/

>>> +
>>> +/*
>>> + * Calculate a hash value for the given data using the given seed.
>>> + * Produces a uniformly distributed hash value.
>>> + * Not considered to be cryptographically secure.
>>> + * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
>>> + **/
>>     ^^-- why two _trailing_ asterisks?
>
> Oops. Fixed. 

Often two _leading_ asterisks are used to mark commit as containing
docstring in some specific format, like Doxygen.  Two _trailing_
asterisks looks like typo.

>>> +static uint32_t seed_murmur3(uint32_t seed, const char *data, int len)
>> 
>> In short, I think that the name of the function should be murmur3_32, or
>> murmurhash3_32, or possibly murmur3_32_seed, or something like that.
>
> Renamed it to murmur3_seeded in v3. The input and output types in the 
> signature make it clear that it is 32-bit version.

All right, I can agree with that.

>>> +{
>>> +	const uint32_t c1 = 0xcc9e2d51;
>>> +	const uint32_t c2 = 0x1b873593;
>>> +	const uint32_t r1 = 15;
>>> +	const uint32_t r2 = 13;
>>> +	const uint32_t m = 5;
>>> +	const uint32_t n = 0xe6546b64;
>>> +	int i;
>>> +	uint32_t k1 = 0;
>>> +	const char *tail;
>>> +
>>> +	int len4 = len / sizeof(uint32_t);
>>> +
>>> +	const uint32_t *blocks = (const uint32_t*)data;
>>> +
>>> +	uint32_t k;
>>> +	for (i = 0; i < len4; i++)
>>> +	{
>>> +		k = blocks[i];
>> 
>> IMPORTANT: There is a comment around there in the example implementation
>> in C on Wikipedia that this operation above is a source of differing
>> results across endianness.  
>
> Thanks! SZEDER found this on his CI pipeline and we have fixed it to 
> process the data in 1 byte words to avoid hitting any endian-ness issues. 
> See this part of the thread that carries the fix and the related discussion. 
>   https://lore.kernel.org/git/ba856e20-0a3c-e2d2-6744-b9abfacdc465@gmail.com/
> I will be squashing those changes in appropriately in v3.  

[...]
>>> +		k1 *= c2;
>>> +		seed ^= k1;
>>> +		break;
>>> +	}
>>> +
>>> +	seed ^= (uint32_t)len;
>>> +	seed ^= (seed >> 16);
>>> +	seed *= 0x85ebca6b;
>>> +	seed ^= (seed >> 13);
>>> +	seed *= 0xc2b2ae35;
>>> +	seed ^= (seed >> 16);
>>> +
>>> +	return seed;
>>> +}
>> 
>> In https://public-inbox.org/git/ba856e20-0a3c-e2d2-6744-b9abfacdc465@gmail.com/
>> you posted "[PATCH] Process bloom filter data as 1 byte words".
>> This may avoid the Big-endian vs Little-endian confusion,
>> that is wrong results on Big-endian architectures, but
>> it also may slow down the algorithm.
>
> Oh cool! You have seen that patch. And yes, we understand that it might add 
> a little overhead but at this point it is more important to be correct on all
> architectures instead of micro-optimizing and introducing different 
> implementations for Little-endian and Big-endian. This would make this 
> series overly complicated. Optimizing the hashing techniques would deserve a
> series of its own, which we can definitely revisit later.

Right, "first make it work, then make it right, and, finally, make it fast.".

Anyway, could you maybe compare performance of Git for old version
(operating on 32-bit/4-bytes words) and new version (operating on 1-byte
words) file history operation with Bloom filters, to see if it matters
or not?

>> The public domain implementation in PMurHash.c in SMHasher
>> (re)implementation in Chromium (see URL above) fall backs to 1-byte
>> operations only if it doesn't know the endianness (or if it is neither
>> little-endian, nor big-endian, i.e. middle-endian or mixed-endian --
>> though I doubt that Git works correctly on mixed-endian anyway).
>> 
>> 
>> Sidenote: it looks like the current implementation if Murmur hash in
>> Chromium uses MurmurHash3_x86_32, i.e. little-endian unaligned-safe
>> implementation, but prepares data by swapping with StringToLE32
>> https://github.com/chromium/chromium/blob/master/components/variations/variations_murmur_hash.h

The solution in PMurHash.c in Chromium, and the pseudo-code algorithm on
Wikipedia do endian handling only for remaining bytes (while the
solution in Appleby's code [beginnings of], and in current
above-mentioned Chromium implementation do the conversion for all
bytes).  I think that handling it only for remaining bytes (for data
sizes not being multiply of 32-bits / 4-bytes) is enough; all other
operations, that is multiply, rotate, xor and addition do not depend on
endianness.

>> Assuming that the terminating NUL ("\0") character of a c-string is not
>> included in hash calculations, then murmur3_x86_32 hash has the
>> following results (all results are for seed equal 0):
>> 
>> ''               -> 0x00000000
>> ' '              -> 0x7ef49b98
>> 'Hello world!'   -> 0x627b0c2c
>> 'The quick brown fox jumps over the lazy dog'   -> 0x2e4ff723
>> 
>> C source (from Wikipedia): https://godbolt.org/z/ofa2p8
>> C++ source (Appleby's):    https://godbolt.org/z/BoSt6V
>> 
>> The implementation provided in this patch, with rotate_right (instead of
>> rotate_left) gives, on little-endian machine, different results:
>> 
>> ''               -> 0x00000000
>> ' '              -> 0xd1f27e64
>> 'Hello world!'   -> 0xa0791ad7
>> 'The quick brown fox jumps over the lazy dog'   -> 0x99f1676c
>> 
>> https://github.com/gitgitgadget/git/blob/e1b076a714d611e59d3d71c89221e41a3427fae4/bloom.c#L21
>> C source (via GitGitGadget): https://godbolt.org/z/R9s8Tt
>> 
>
> Thanks! This is an excellent catch! Fixing the rotate_right to rotate_left, 
> gives us the same answers as the two implementations you pointed out. I have
> added the appropriate unit tests in v3 and they match the values you obtained 
> from the other implementations. Thanks a lot for the rigor! 
>
> We based our implementation on the pseudo code and not on the sample code 
> presented here: https://en.wikipedia.org/wiki/MurmurHash#Algorithm
> We just didn't parse the ROL instruction correctly. 

All right, that's good.

Note that the pseudo code includes the following:

    with any remainingBytesInKey do
        remainingBytes ← SwapToLittleEndian(remainingBytesInKey)
        // Note: Endian swapping is only necessary on big-endian machines.
        //       The purpose is to place the meaningful digits towards the low end of the value,
        //       so that these digits have the greatest potential to affect the low range digits
        //       in the subsequent multiplication.  Consider that locating the meaningful digits
        //       in the high range would produce a greater effect upon the high digits of the
        //       multiplication, and notably, that such high digits are likely to be discarded
        //       by the modulo arithmetic under overflow.  We don't want that.

[...]
>>> +{
>>> +	int i;
>>> +	const uint32_t seed0 = 0x293ae76f;
>>> +	const uint32_t seed1 = 0x7e646e2c;
>> 
>> Where did those seeds values came from?
>>
>
> Those values were chosen randomly. They will be fixed constants for the 
> current hashing version. I will add a note calling this out in the 
> appropriate commit messages and the Documentation in v3. 

Nice to know.

I wonder if those seed values should be relatively prime, and whether
seed1 should be odd (from theoretical point of view).

>>> +	const uint32_t hash0 = seed_murmur3(seed0, data, len);
>>> +	const uint32_t hash1 = seed_murmur3(seed1, data, len);
>>> +
>>> +	key->hashes = (uint32_t *)xcalloc(settings->num_hashes, sizeof(uint32_t));
>>> +	for (i = 0; i < settings->num_hashes; i++)
>>> +		key->hashes[i] = hash0 + i * hash1;
>> 
>> Note that in [3] authors say that double hashing technique has some
>> problems.  For one, we should ensure that hash1 is not zero, and even
>> better that it is odd (which makes it relatively prime to filter size
>> which is multiple of 64).  It also suffers from something called
>> "approximate fingerprint collisions".
>> 
>> That is why the define "enhanced double hashing" technique, which does
>> not suffer from those problems (Algorithm 2, page 11/15).
>> 
>>   +	for (i = 0; i < settings->num_hashes; i++) {
>>   +		key->hashes[i] = hash0;
>>   +
>>   +		hash0 = hash0 + hash1;
>>   +		hash1 = hash1 + i;
>>   +	}
>> 
>> This can also be written in closed form, based on equation (6)
>> 
>>   +	for (i = 0; i < settings->num_hashes; i++)
>>   +		key->hashes[i] = hash0 + i * hash1 + i*(i*i - 1)/6;
>> 
>> 
>> In later paper [6] the closed form for "enhanced double hashing"
>> (p. 188) is slightly modified (or rather they use different variant of
>> this technique):
>> 
>>   +	for (i = 0; i < settings->num_hashes; i++)
>>   +		key->hashes[i] = hash0 + i * hash1 + i*i;
>> 
>> This is a variant of more generic "enhanced double hashing", section
>> 5.2 (Enhanced) Double Hashing Schemes (page 199):
>> 
>>         h_1(u) + i h_2(u) + f(i)    mod m
>> 
>> with f(i) = i^2 = i*i.
>> 
>> They have tested that enhanced double hashing with both f(i) equal i*i
>> and equal i*i*i, and triple hashing technique, and they have found that
>> it performs slightly better than straight double hashing technique
>> (Fig. 1, page 212, section 3).
>> 
>
> Thanks for the detailed research here! The hash becoming zero and the 
> approximate fingerprint collision are both extremely rare situations. In both
> cases, we would just see `git log` having to diff more trees than if it didn't 
> occur. While these techniques would be great optimizations to do, especially
> if this implementation gets pulled into more generic hashing applications
> in the code, we think that for the purposes of the current series - it is not 
> worth it. I say this because Azure Repos has been using this exact hashing 
> technique for several years now without any glitches. And we think it would
> be great to rely on this battle tested strategy in at least the first version
> of this feature. 

All right, that is a good strategy.

I wonder if switching from double hashing to enhanced double hashing
(for example the variant with i*i added) would bring any noticeable
performance improvements in Git operations (due to less false
positives).

>>> +
>>> +struct bloom_filter *get_bloom_filter(struct repository *r,
>>> +				      struct commit *c)
>>> +{
>>> +	struct bloom_filter *filter;
>>> +	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
>>> +	int i;
>>> +	struct diff_options diffopt;
>>> +
>>> +	if (!bloom_filters.slab_size)
>>> +		return NULL;
>> 
>> This is testing that commit slab for per-commit Bloom filters is
>> initialized, isn't it?
>> 
>> First, should we write the condition as
>> 
>> 	if (!bloom_filters.slab_size)
>> 
>> or would the following be more readable
>> 
>> 	if (bloom_filters.slab_size == 0)
>> 
>
> Sure. Switched to `if (bloom_filter.slab_size == 0)` in v3. 

Though either works, and the former looks more like the test if
bloom_filters slab are initialized, now that I thought about it a bit.
Your choice.

>> Second, should we return NULL, or should we just initialize the slab?
>> Or is non-existence of slab treated as a signal that the Bloom filters
>> mechanism is turned off?
>> 
>
> Yes. We purposefully choose to return NULL and ignore the mechanism 
> overall because we use Bloom filters best effort only. 

All right.

>>> +
>>> +	if (diff_queued_diff.nr <= 512) {
>>
>> Second, there is a minor issue that diff_queue_struct.nr stores the
>> number of filepairs, that is the number of changed files, while the
>> number of elements added to Bloom filter is number of changed blobs and
>> trees.  For example if the following files are changed:
>> 
>>   sub/dir/file1
>>   sub/file2
>> 
>> then diff_queued_diff.nr is 2, but number of elements to be added to
>> Bloom filter is 4.
>> 
>>   sub/dir/file1
>>   sub/file2
>>   sub/dir/
>>   sub/
>> 
>> I'm not sure if it matters in practice.
>> 
>
> It does not matter much in practice, since the directories usually tend
> to collapse across the changes. Still, I will add another limit after 
> creating the hashmap entries to cap at 640 so that we have a maximum of 
> 100 changes in the bloom filter. 
>
> We plan to make these values configurable later. 

I'm not sure if it is truly necessary; we can treat limit on number of
changed paths as "best effort" limit on Bloom filter size.

I just wanted to point out the difference.


Side note: I wonder if it would be worth it (in the future) to change
handling commits with large amount of changes.  I was thinking about
switching to soft and hard limit: soft limit would be on the size of the
Bloom filter, that is if number of elements times bits per element is
greater that size threshold, we don't increase the size of the filter.

This would mean that the false positives ratio (the number of files that
are not present but get answer "maybe" instead of "no" out of the
filter) would increase, so there would be a need for another hard limit
where we decide that it is not worth it, and not store the data for the
Bloom filter -- current "no data" case with empty filter with length 0.
This hard limit can be imposed on number of changed files, or on number
of paths added to filter, or on number of bits set to 1 in the filter
(on popcount), or some combination thereof.

[...]
>>> +
>>> +		for (i = 0; i < diff_queued_diff.nr; i++) {
>>> +			const char* path = diff_queued_diff.queue[i]->two->path;
>> 
>> Is that correct that we consider only post-image name for storing
>> changes in Bloom filter?  Currently if file was renamed (or deleted), it
>> is considered changed, and `git log -- <old-name>` lists commit that
>> changed file name too.
>
> The tests in t4216-log-bloom.sh ensure that the output of `git log -- <oldname>` 
> remains unchanged for renamed and deleted files, when using bloom filters. 
> I realize that I fat fingered over checking the old name, and didn't have an 
> explicit deleted file in the test. I have added them in v3, and the tests pass. 
> So the behavior is preserved and as expected when using Bloom filters. 
> Thanks for paying close attention! 

It seems like it shouldn't be working, as we are not adding the old name
to Bloom filter, but that only means that I misunderstood how
diff_tree_oid() works with default options.  It turns out that without
explicitly turning on rename detection it shows rename as deletion of
old name and addition of new name -- so if tracking deletion works
correctly, then tracking renames should work correctly.

So it is in fact correct, which as you said was confirmed by (improved)
tests.  I think also that if there was a bug in handling renames in this
code it would have been detected when running CI with
GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS.

[...]
>>> +		filter->data = NULL;
>>> +		filter->len = 0;
>> 
>> This needs to be explicitly stated both in the commit message and in the
>> API documentation (in comments) that bloom_filter.len == 0 means "no
>> data", while "no changes" is represented as bloom_filter with len == 1
>> and *data == (uint64_t)0;
>> 
>> EDIT: actually "no changes" is also represented as bloom_filter with len
>> equal 0, as it turns out.
>> 
>> One possible alternative could be representing "no data" value with
>> Bloom filter of length 1 and all 64 bits set to 1, and "no changes"
>> represented as filter of length 0.  This is not unambiguous choice!
>>
>
> There is no gain in distinguishing between the absence of a filter and
> a commit having no changes. The effect on `git log -- path` is the same in 
> both cases. We fall back to the normal diffing algorithm in revision.c.
> I will make this clearer in the appropriate commit messages and in the 
> Documentation in v3. 

You are right, which I have realized only when reviewing subsequent
patches in the series.

In the absence of a filter, the "no data" case, we need to fall back to
examining the diff anyway.

In the case of commit having no changes, the "no changes" case,
computing the diff is cheap because Git can realize that both trees have
the same oid.  So we do not lose performance this way, and we avoid
special-casing it (avoiding branching) when computing the Bloom filter,
if the "no change" case was represented by filter of length 1 and all
zero bits as data.  Comparing tree oids and matching first hash function
in bloom_key against all zeros Bloom filter should be, I think, of
similar performance.

[...]
>>> +. ./test-lib.sh
>>> +
>>> +test_expect_success 'get bloom filters for commit with no changes' '
>>> +	git init &&
>>> +	git commit --allow-empty -m "c0" &&
>>> +	cat >expect <<-\EOF &&
>>> +	Filter_Length:0
>>> +	Filter_Data:
>>> +	EOF
>>> +	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
>>> +	test_cmp expect actual
>>> +'
>> 
>> A few things.  First, I wonder why we need to provide object ID;
>> couldn't 'test-tool bloom get_filter_for_commit' parse commit-ish
>> argument, or would it make it too complicated for no reason? 
>
> Yes it was overkill for what I need in the test. 

All right, I agree with that.

>>> +
>>> +test_expect_success 'get bloom filter for commit with 10 changes' '
>>> +	rm actual &&
>>> +	rm expect &&
>>> +	mkdir smallDir &&
>>> +	for i in $(test_seq 0 9)
>>> +	do
>>> +		echo $i >smallDir/$i
>>> +	done &&
>>> +	git add smallDir &&
>>> +	git commit -m "commit with 10 changes" &&
>>> +	cat >expect <<-\EOF &&
>>> +	Filter_Length:4
>>> +	Filter_Data:508928809087080a|8a7648210804001|4089824400951000|841ab310098051a8|
>>> +	EOF
>>> +	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
>>> +	test_cmp expect actual
>>> +'
>> 
>> This test is in my opinion fragile, as it unnecessarily test the
>> implementation details instead of the functionality provided.  If we
>> change the hashing scheme (for example going from double hashing to some
>> variant of enhanced double hashing), or change the base hash function
>> (for example from Murmur3_32 to xxHash_64), or change the number of hash
>> functions (perhaps because changing of number of bits per element, and
>> thus optimal number of hash functions from 7 to 6), or change from
>> 64-bit word blocks to 32-bit word blocks, the test would have to be
>> changed.
>
> Regarding this and the rest of you comments on t0095-log-bloom.sh:
>
> I am tweaking it as necessary but the entire point of these tests is to
> break for the things you called out. They need to be intricately tied
> to the current hashing strategy and are hence intended to be fragile so 
> as to catch any subtle or accidental changes in the hashing computation. 
> Any change like the ones you have called out would require a hash version
> change and all the compatibility reactions that come with it. 

All right, if we assume that commit-graph is not something purely local^*,
and we need iteroperability, then this test is necessary and is
necessarily fragile.

*. This may happen because the repository and the commit-graph file in
   it is on network disk, and accessed by hosts with different
   endianness.  Or in the future (or possibly now, if one is using
   Scalar) the commit-graph file can be sent together with packfile
   during the fetch operation.

On the other hand testing the functionality of Murmur hash, and of Bloom
filter would help finding possible troubles if we decide in the future
to change the algorithm details (change hash function, and/or move from
double hashing to enhanced double hashing, and/or change how commits
with large number of changes are handled, or even switching to xor
filters [1]).

[1]: Graf, Thomas Mueller; Lemire, Daniel (2019), "Xor Filters: Faster
and Smaller Than Bloom and Cuckoo Filters", https://arxiv.org/abs/1912.08258

> I have added more tests around the murmur3_seeded method in v3. Removed
> some of the redundant ones. 

There is another test that might be worth adding (see the comment below
why), namely one test checking that bloom_key is computed as expected.

> The other more evolved test cases you call out are covered in the e2e
> integration tests in t4216-log-bloom.sh

All right, but there is another issue to consider.  Good tests should
not only catch the breakage, but also help to detect where the bug is.
That is one of advantages that unit tests (like the ones I have
proposed) have over end-to-end functional tests.  They are also often
faster.

On the other hand e2e tests can catch problems with integration, and
actually check that the user-visible behaviour is as expected.


Best,
--
Jakub Narębski

>> 
>> Reviewed-by: Jakub Narębski <jnareb@gmail.com>
>> 
>> Thanks for working on this.
>> 
>> Best, 
>
> Thank you once again for an excellent and in-depth review of this patch! 
> You have helped make this code so much better!

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH v2 04/11] commit-graph: compute Bloom filters for changed paths
  2020-02-22  0:55       ` Garima Singh
@ 2020-02-23 17:34         ` Jakub Narebski
  0 siblings, 0 replies; 150+ messages in thread
From: Jakub Narebski @ 2020-02-23 17:34 UTC (permalink / raw)
  To: Garima Singh
  Cc: Garima Singh via GitGitGadget, git, Derrick Stolee,
	SZEDER Gábor, Jonathan Tan, Jeff Hostetler, Taylor Blau,
	Jeff King, Christian Couder, Emily Shaffer, Junio C Hamano,
	Garima Singh

Garima Singh <garimasigit@gmail.com> writes:
> On 2/17/2020 4:56 PM, Jakub Narebski wrote:
>> "Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:

[...]
>>> ---
>>>  commit-graph.c | 32 +++++++++++++++++++++++++++++++-
>>>  commit-graph.h |  3 ++-
>>>  2 files changed, 33 insertions(+), 2 deletions(-)
>> 
>> It would be good to have at least sanity check of this feature, perhaps
>> one that would check that the number of per-commit Bloom filters on slab
>> matches the number of commits in the commit-graph.
>
> The combination of all the e2e tests in this series with the test
> flag being turned on in the CI, and the performance gains we are seeing
> confirm that this is happening correctly.

Well, the advantage of unit tests over e2e functional tests is that they
can pinpoint the source of bug much more easily.

That said, I don't think there is absolute need for unit tests here,
though it would be nice to have them.

>>>  
>>>  	const struct split_commit_graph_opts *split_opts;
>>> +	uint32_t total_bloom_filter_data_size;
>> 
>> This is total size of Bloom filters data, in bytes, that will later be
>> used for BDAT chunk size.  However the commit-graph format uses 8 bytes
>> for byte-offset, not 4 bytes.  Why it is uint32_t and not uint64_t then?
>
> Changed to size_t. Thanks for noticing! 

Right, this is a local value (size_t may be different size on different
architectures), even though it will be stored indirectly in chunk lookup
table as pair of uint64_t offsets.

>>>  };
>>>  
>>>  static void write_graph_chunk_fanout(struct hashfile *f,
>>> @@ -1140,6 +1143,28 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>>>  	stop_progress(&ctx->progress);
>>>  }
>>>  
>>> +static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>>> +{
>>> +	int i;
>>> +	struct progress *progress = NULL;
>>> +
>>> +	load_bloom_filters();
>>> +
>>> +	if (ctx->report_progress)
>>> +		progress = start_progress(
>>> +			_("Computing commit diff Bloom filters"),
>>> +			ctx->commits.nr);
>>> +
>> 
>> Shouldn't we initialize ctx->total_bloom_filter_data_size to 0 here?  We
>> cannot use compute_bloom_filters() to _update_ Bloom filters data, I
>> think -- we don't distinguish here between new and existing data (where
>> existing data size is already included in total Bloom filters size).  At
>> least I don't think so.
>> 
>
> This line in commit-graph.c takes care of reinitializing the graph context and
> by consequence the bloom filter data size.
>
>   ctx = xcalloc(1, sizeof(struct write_commit_graph_context));
>   
> So the total size gets recalculated every time, which is correct. 

True, I have missed this.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH v2 02/11] bloom: core Bloom filter implementation for changed paths
  2020-02-23 13:38         ` Jakub Narebski
@ 2020-02-24 17:34           ` Garima Singh
  2020-02-24 18:20             ` Jakub Narebski
  0 siblings, 1 reply; 150+ messages in thread
From: Garima Singh @ 2020-02-24 17:34 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Garima Singh via GitGitGadget, git, Derrick Stolee,
	SZEDER Gábor, Jonathan Tan, Jeff Hostetler, Taylor Blau,
	Jeff King, Christian Couder, Emily Shaffer, Junio C Hamano,
	Garima Singh


On 2/23/2020 8:38 AM, Jakub Narebski wrote:
> Garima Singh <garimasigit@gmail.com> writes:
>> On 2/16/2020 11:49 AM, Jakub Narebski wrote:
>>>> From: Garima Singh <garima.singh@microsoft.com>
>>>>
>>>> Add the core Bloom filter logic for computing the paths changed between a
>>>> commit and its first parent. For details on what Bloom filters are and how they
>>>> work, please refer to Dr. Derrick Stolee's blog post [1]. It provides a concise
>>>> explaination of the adoption of Bloom filters as described in [2] and [3].
>>>                                                                           ^^- to add
>>
>> Not sure what this means. Can you please clarify. 
>>
>>>> 1. We currently use 7 and 10 for the number of hashes and the size of each
>>>>    entry respectively. They served as great starting values, the mathematical
>>>>    details behind this choice are described in [1] and [4]. The implementation,
>>>                                                                                ^^- to add
>>
>> Not sure what this means. Can you please clarify.
> 
> I'm sorry for not being clear.  What I wanted to say that in both cases
> the last line should have ended in either full stop in first case, or
> comma in second case:
> 
>   "as described in [2] and [3]."
> 
>   "The implementation,"
> 
> What I wrote (trying to put the arrow below final fullstop or comma)
> only works when one is using with fixed-width font.
> 

Aah. Cool. Thanks! 

>>>> ---
>>>>  Makefile              |   2 +
>>>>  bloom.c               | 228 ++++++++++++++++++++++++++++++++++++++++++
>>>>  bloom.h               |  56 +++++++++++
>>>>  t/helper/test-bloom.c |  84 ++++++++++++++++
>>>>  t/helper/test-tool.c  |   1 +
>>>>  t/helper/test-tool.h  |   1 +
>>>>  t/t0095-bloom.sh      | 113 +++++++++++++++++++++
>>>>  7 files changed, 485 insertions(+)
>>>>  create mode 100644 bloom.c
>>>>  create mode 100644 bloom.h
>>>>  create mode 100644 t/helper/test-bloom.c
>>>>  create mode 100755 t/t0095-bloom.sh
>>>
>>> As I wrote earlier, In my opinion this patch could be split into three
>>> individual single-functionality pieces, to make it easier to review and
>>> aid in bisectability if needed.
>>
>> Doing this in v3. 
> 
> Thanks.  Though if it makes (much) more work for you, I can work with
> unsplit patch, no problem.
> 

Thanks! That's great! Splitting the patches will add some overhead. I will 
try and do it provided it does not delay getting v3 on the list. 

>>>> +
>>>> +static uint32_t rotate_right(uint32_t value, int32_t count)
>>>> +{
>>>> +	uint32_t mask = 8 * sizeof(uint32_t) - 1;
>>>> +	count &= mask;
>>>> +	return ((value >> count) | (value << ((-count) & mask)));
>>>> +}
>>>
>>> Hmmm... both the algoritm on Wikipedia, and reference implementation use
>>> rotate *left*, not rotate *right* in the implementation of Murmur3 hash,
>>> see
>>>
>>>   https://en.wikipedia.org/wiki/MurmurHash#Algorithm
>>>   https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp#L23
>>>
>>>
>>> inline uint32_t rotl32 ( uint32_t x, int8_t r )
>>> {
>>>   return (x << r) | (x >> (32 - r));
>>> }
>>
>> Thanks! Fixed this in v3. More on it later. 
> 
> Sidenote: If I understand it correctly Bloom filters functionality is
> included in Scalar [1].  What will happen then with all those Bloom
> filter chunks in commit-graph files with wrong hash functions?
> 
> [1]: https://devblogs.microsoft.com/devops/introducing-scalar/
> 

It is not included in Scalar. Scalar will write to the commit-graph in 
the background using the features available in the git version it is working
with. It will update to include changed path Bloom filters when they are 
available in git. We are not taking the Bloom filter into microsoft/git 
until the format is approved and accepted by the core git community.

>>>> +{
>>>> +	const uint32_t c1 = 0xcc9e2d51;
>>>> +	const uint32_t c2 = 0x1b873593;
>>>> +	const uint32_t r1 = 15;
>>>> +	const uint32_t r2 = 13;
>>>> +	const uint32_t m = 5;
>>>> +	const uint32_t n = 0xe6546b64;
>>>> +	int i;
>>>> +	uint32_t k1 = 0;
>>>> +	const char *tail;
>>>> +
>>>> +	int len4 = len / sizeof(uint32_t);
>>>> +
>>>> +	const uint32_t *blocks = (const uint32_t*)data;
>>>> +
>>>> +	uint32_t k;
>>>> +	for (i = 0; i < len4; i++)
>>>> +	{
>>>> +		k = blocks[i];
>>>
>>> IMPORTANT: There is a comment around there in the example implementation
>>> in C on Wikipedia that this operation above is a source of differing
>>> results across endianness.  
>>
>> Thanks! SZEDER found this on his CI pipeline and we have fixed it to 
>> process the data in 1 byte words to avoid hitting any endian-ness issues. 
>> See this part of the thread that carries the fix and the related discussion. 
>>   https://lore.kernel.org/git/ba856e20-0a3c-e2d2-6744-b9abfacdc465@gmail.com/
>> I will be squashing those changes in appropriately in v3.  
> 
> [...]
>>>> +		k1 *= c2;
>>>> +		seed ^= k1;
>>>> +		break;
>>>> +	}
>>>> +
>>>> +	seed ^= (uint32_t)len;
>>>> +	seed ^= (seed >> 16);
>>>> +	seed *= 0x85ebca6b;
>>>> +	seed ^= (seed >> 13);
>>>> +	seed *= 0xc2b2ae35;
>>>> +	seed ^= (seed >> 16);
>>>> +
>>>> +	return seed;
>>>> +}
>>>
>>> In https://public-inbox.org/git/ba856e20-0a3c-e2d2-6744-b9abfacdc465@gmail.com/
>>> you posted "[PATCH] Process bloom filter data as 1 byte words".
>>> This may avoid the Big-endian vs Little-endian confusion,
>>> that is wrong results on Big-endian architectures, but
>>> it also may slow down the algorithm.
>>
>> Oh cool! You have seen that patch. And yes, we understand that it might add 
>> a little overhead but at this point it is more important to be correct on all
>> architectures instead of micro-optimizing and introducing different 
>> implementations for Little-endian and Big-endian. This would make this 
>> series overly complicated. Optimizing the hashing techniques would deserve a
>> series of its own, which we can definitely revisit later.
> 
> Right, "first make it work, then make it right, and, finally, make it fast.".
> 
> Anyway, could you maybe compare performance of Git for old version
> (operating on 32-bit/4-bytes words) and new version (operating on 1-byte
> words) file history operation with Bloom filters, to see if it matters
> or not?
> 

We chose to switch to 1 byte words for correctness, not performance. 
Also, this specific implementation choice is a very small portion of the 
end to end time spent computing and writing Bloom filters. We run two murmur3 
hashes per path, which is one path per `git log` query; and one path per change 
after parsing trees to compute a diff. Measuring performance and micro-optimizing 
is not worth the effort and/or trading in the simplicity here.


>>>> +
>>>> +struct bloom_filter *get_bloom_filter(struct repository *r,
>>>> +				      struct commit *c)
>>>> +{
>>>> +	struct bloom_filter *filter;
>>>> +	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
>>>> +	int i;
>>>> +	struct diff_options diffopt;
>>>> +
>>>> +	if (!bloom_filters.slab_size)
>>>> +		return NULL;
>>>
>>> This is testing that commit slab for per-commit Bloom filters is
>>> initialized, isn't it?
>>>
>>> First, should we write the condition as
>>>
>>> 	if (!bloom_filters.slab_size)
>>>
>>> or would the following be more readable
>>>
>>> 	if (bloom_filters.slab_size == 0)
>>>
>>
>> Sure. Switched to `if (bloom_filter.slab_size == 0)` in v3. 
> 
> Though either works, and the former looks more like the test if
> bloom_filters slab are initialized, now that I thought about it a bit.
> Your choice.
> 

:) 


>>>> +
>>>> +	if (diff_queued_diff.nr <= 512) {
>>>
>>> Second, there is a minor issue that diff_queue_struct.nr stores the
>>> number of filepairs, that is the number of changed files, while the
>>> number of elements added to Bloom filter is number of changed blobs and
>>> trees.  For example if the following files are changed:
>>>
>>>   sub/dir/file1
>>>   sub/file2
>>>
>>> then diff_queued_diff.nr is 2, but number of elements to be added to
>>> Bloom filter is 4.
>>>
>>>   sub/dir/file1
>>>   sub/file2
>>>   sub/dir/
>>>   sub/
>>>
>>> I'm not sure if it matters in practice.
>>>
>>
>> It does not matter much in practice, since the directories usually tend
>> to collapse across the changes. Still, I will add another limit after 
>> creating the hashmap entries to cap at 640 so that we have a maximum of 
>> 100 changes in the bloom filter. 
>>
>> We plan to make these values configurable later. 
> 
> I'm not sure if it is truly necessary; we can treat limit on number of
> changed paths as "best effort" limit on Bloom filter size.
> 
> I just wanted to point out the difference.
> 

Sure. Not doing this for v3. Glad it got discussed here though!  

> 
> Side note: I wonder if it would be worth it (in the future) to change
> handling commits with large amount of changes.  I was thinking about
> switching to soft and hard limit: soft limit would be on the size of the
> Bloom filter, that is if number of elements times bits per element is
> greater that size threshold, we don't increase the size of the filter.
> 
> This would mean that the false positives ratio (the number of files that
> are not present but get answer "maybe" instead of "no" out of the
> filter) would increase, so there would be a need for another hard limit
> where we decide that it is not worth it, and not store the data for the
> Bloom filter -- current "no data" case with empty filter with length 0.
> This hard limit can be imposed on number of changed files, or on number
> of paths added to filter, or on number of bits set to 1 in the filter
> (on popcount), or some combination thereof.
> 
> [...]

Could be considered in the future. Doesn't make the cut for the current
series though. 

Thanks
Garima Singh

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH v2 02/11] bloom: core Bloom filter implementation for changed paths
  2020-02-24 17:34           ` Garima Singh
@ 2020-02-24 18:20             ` Jakub Narebski
  0 siblings, 0 replies; 150+ messages in thread
From: Jakub Narebski @ 2020-02-24 18:20 UTC (permalink / raw)
  To: Garima Singh
  Cc: Garima Singh via GitGitGadget, git, Derrick Stolee,
	SZEDER Gábor, Jonathan Tan, Jeff Hostetler, Taylor Blau,
	Jeff King, Christian Couder, Emily Shaffer, Junio C Hamano,
	Garima Singh

Garima Singh <garimasigit@gmail.com> writes:

> On 2/23/2020 8:38 AM, Jakub Narebski wrote:
>> Garima Singh <garimasigit@gmail.com> writes:
>>> On 2/16/2020 11:49 AM, Jakub Narebski wrote:
>>>>> From: Garima Singh <garima.singh@microsoft.com>

[...]
>>>> IMPORTANT: There is a comment around there in the example implementation
>>>> in C on Wikipedia that this operation above is a source of differing
>>>> results across endianness.  
>>>
>>> Thanks! SZEDER found this on his CI pipeline and we have fixed it to 
>>> process the data in 1 byte words to avoid hitting any endian-ness issues. 
>>> See this part of the thread that carries the fix and the related discussion. 
>>>   https://lore.kernel.org/git/ba856e20-0a3c-e2d2-6744-b9abfacdc465@gmail.com/
>>> I will be squashing those changes in appropriately in v3.  
>> 
>> [...]
>>>>
>>>> In https://public-inbox.org/git/ba856e20-0a3c-e2d2-6744-b9abfacdc465@gmail.com/
>>>> you posted "[PATCH] Process bloom filter data as 1 byte words".
>>>> This may avoid the Big-endian vs Little-endian confusion,
>>>> that is wrong results on Big-endian architectures, but
>>>> it also may slow down the algorithm.
>>>
>>> Oh cool! You have seen that patch. And yes, we understand that it might add 
>>> a little overhead but at this point it is more important to be correct on all
>>> architectures instead of micro-optimizing and introducing different 
>>> implementations for Little-endian and Big-endian. This would make this 
>>> series overly complicated. Optimizing the hashing techniques would deserve a
>>> series of its own, which we can definitely revisit later.
>> 
>> Right, "first make it work, then make it right, and, finally, make it fast.".
>> 
>> Anyway, could you maybe compare performance of Git for old version
>> (operating on 32-bit/4-bytes words) and new version (operating on 1-byte
>> words) file history operation with Bloom filters, to see if it matters
>> or not?
>> 
>
> We chose to switch to 1 byte words for correctness, not performance. 
> Also, this specific implementation choice is a very small portion of the 
> end to end time spent computing and writing Bloom filters. We run two murmur3 
> hashes per path, which is one path per `git log` query; and one path per change 
> after parsing trees to compute a diff. Measuring performance and micro-optimizing 
> is not worth the effort and/or trading in the simplicity here.

All right.

I still think that adding to_le32() invocation before the part that
processes remaining bytes (the 'switch' instruction in v2 code), just
like in pseudo-code on Wikipedia:

    with any remainingBytesInKey do
        remainingBytes ← SwapToLittleEndian(remainingBytesInKey)

would be enough to have correct results regardlless of endianness.

As I wrote

JN> The solution in PMurHash.c in Chromium [1], and the pseudo-code algorithm on
JN> Wikipedia do endian handling only for remaining bytes (while the
JN> beginnings of solution in Appleby's code, and solution in current
JN> above-mentioned Chromium implementation do the conversion for all
JN> bytes).  I think that handling it only for remaining bytes (for data
JN> sizes not being multiply of 32-bits / 4-bytes) is enough; all other
JN> operations, that is multiply, rotate, xor and addition do not depend on
JN> endianness.

[1]: https://chromium.googlesource.com/external/smhasher/+/5b8fd3c31a58b87b80605dca7a64fad6cb3f8a0f/PMurHash.c

If you have access to, or can run code on some big-endian architecture,
it should be easy enough to check it.


Anyway, if you decide on 1-byte at time implementation, please put a
comment about 32-bit chunk implementation.

>> Side note: I wonder if it would be worth it (in the future) to change
>> handling commits with large amount of changes.  I was thinking about
>> switching to soft and hard limit: soft limit would be on the size of the
>> Bloom filter, that is if number of elements times bits per element is
>> greater that size threshold, we don't increase the size of the filter.
>> 
>> This would mean that the false positives ratio (the number of files that
>> are not present but get answer "maybe" instead of "no" out of the
>> filter) would increase, so there would be a need for another hard limit
>> where we decide that it is not worth it, and not store the data for the
>> Bloom filter -- current "no data" case with empty filter with length 0.
>> This hard limit can be imposed on number of changed files, or on number
>> of paths added to filter, or on number of bits set to 1 in the filter
>> (on popcount), or some combination thereof.
>> 
>> [...]
>
> Could be considered in the future. Doesn't make the cut for the current
> series though. 

Right.

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH v2 05/11] commit-graph: examine changed-path objects in pack order
  2020-02-18 17:59     ` Jakub Narebski
@ 2020-02-24 18:29       ` Garima Singh
  0 siblings, 0 replies; 150+ messages in thread
From: Garima Singh @ 2020-02-24 18:29 UTC (permalink / raw)
  To: Jakub Narebski, Jeff King via GitGitGadget
  Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
	Jeff Hostetler, Taylor Blau, Jeff King, Christian Couder,
	Emily Shaffer, Junio C Hamano, Garima Singh


On 2/18/2020 12:59 PM, Jakub Narebski wrote:
> "Jeff King via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> From: Jeff King <peff@peff.net>
>>
>> Looking at the diff of commit objects in pack order is much faster than
>> in sha1 order, as it gives locality to the access of tree deltas
> 
> Nitpick: should we still say sha1 order?  Git is still using SHA-1 as an
> *oid*, but hopefully soon it will be transitioning to NewHash = SHA-256.
> (No need to change anything.)
> 
>> (whereas sha1 order is effectively random). Unfortunately the
>> commit-graph code sorts the commits (several times, sometimes as an oid
>> and sometimes a pointer-to-commit), and we ultimately traverse in sha1
>> order.
> 
> Actually, commit-graph code needs write_commit_graph_context.commits.list
> to be in lexicographical order to be able to turn position in graph into
> reference to a commit.  The information about the parents of the commit
> are stored using positional references within the graph file.
> 

You are right. Fixing the commit message in v3. 

>>
>> Instead, let's remember the position at which we see each commit, and
>> traverse in that order when looking at bloom filters. This drops my time
>> for "git commit-graph write --changed-paths" in linux.git from ~4
>> minutes to ~1.5 minutes.
> 
> Nitpick: with reordering of patches (which I think is otherwise a good
> thing) this patch actually comes before the one adding "--changed-paths"
> option to "git commit-graph write".  So it 'This would drop my time'
> rather than 'This drops my time...' ;-)
> 

:) I will fix that up. 

>>
>> Probably the "--reachable" code path would want something similar.
> 
> Has anyone tried doing this?
> 

I will and I will include the perf numbers in the appropriately in v3. 


>> +
>>  char *get_commit_graph_filename(const char *obj_dir)
>>  {
>>  	char *filename = xstrfmt("%s/info/commit-graph", obj_dir);
>> @@ -1027,6 +1051,8 @@ static int add_packed_commits(const struct object_id *oid,
>>  	oidcpy(&(ctx->oids.list[ctx->oids.nr]), oid);
>>  	ctx->oids.nr++;
>>  
>> +	set_commit_pos(ctx->r, oid);
>> +
>>  	return 0;
>>  }
>>  
>> @@ -1147,6 +1173,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>>  {
>>  	int i;
>>  	struct progress *progress = NULL;
>> +	struct commit **sorted_by_pos;
> 
> In the next patch in series we would sort commits by generation number
> and creation data; shouldn't this variable name be more generic to
> reflect this, for example just `sorted_commits` or `commits_sorted`?
> 

Good call. I will clean this up in both commits. 

Thanks for the review! 
Cheers! 
Garima Singh

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH v2 06/11] commit-graph: examine commits by generation number
  2020-02-19  0:32     ` Jakub Narebski
@ 2020-02-24 20:45       ` Garima Singh
  0 siblings, 0 replies; 150+ messages in thread
From: Garima Singh @ 2020-02-24 20:45 UTC (permalink / raw)
  To: Jakub Narebski, Derrick Stolee via GitGitGadget
  Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
	Jeff Hostetler, Taylor Blau, Jeff King, Christian Couder,
	Emily Shaffer, Junio C Hamano, Garima Singh, Derrick Stolee


On 2/18/2020 7:32 PM, Jakub Narebski wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> When running 'git commit-graph write --changed-paths', we sort the
>> commits by pack-order to save time when computing the changed-paths
>> bloom filters. This does not help when finding the commits via the
>> --reachable flag.
> 
> Minor improvement suggestion: s/--reachable flag/'--reachable' flag/.
> 

Sure. 


>>                     Commits with similar generation are more likely
>> to have many trees in common, making the diff faster.
> 
> Is this what causes the performance improvement, that subsequently
> examined commits are more likely to have more trees in common, which
> means that those trees would be hot in cache, making generating diff
> faster?  Is it what profiling shows?
> 

Yes. 

>>
>> Helped-by: Jeff King <peff@peff.net>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
>> ---
>>  commit-graph.c | 33 ++++++++++++++++++++++++++++++---
>>  1 file changed, 30 insertions(+), 3 deletions(-)
>>
>> diff --git a/commit-graph.c b/commit-graph.c
>> index e125511a1c..32a315058f 100644
>> --- a/commit-graph.c
>> +++ b/commit-graph.c
>> @@ -70,6 +70,25 @@ static int commit_pos_cmp(const void *va, const void *vb)
>>  	       commit_pos_at(&commit_pos, b);
>>  }
>>  
>> +static int commit_gen_cmp(const void *va, const void *vb)
>> +{
>> +	const struct commit *a = *(const struct commit **)va;
>> +	const struct commit *b = *(const struct commit **)vb;
>> +
>> +	/* lower generation commits first */
> 
> Shouldn't higher generation commits come first, in recency-like order?
> Or it doesn't matter if it is sorted in ascending or descending order,
> as long as commits with close generation numbers are examined close
> together?
> 

The direction does not matter. Locality is important. 

>> +	if (a->generation < b->generation)
>> +		return -1;
>> +	else if (a->generation > b->generation)
>> +		return 1;
>> +
>> +	/* use date as a heuristic when generations are equal */
>> +	if (a->date < b->date)
>> +		return -1;
>> +	else if (a->date > b->date)
>> +		return 1;
>> +	return 0;
>> +}
> 
> I thought we have had such comparison function defined somewhere in Git
> already, but I think I'm wrong here.
> 

It actually exists in commit.h
I will just use it here. 
Thanks for pointing it out! 

>> +
>>  char *get_commit_graph_filename(const char *obj_dir)
>>  {
>>  	char *filename = xstrfmt("%s/info/commit-graph", obj_dir);
>> @@ -821,7 +840,8 @@ struct write_commit_graph_context {
>>  		 report_progress:1,
>>  		 split:1,
>>  		 check_oids:1,
>> -		 changed_paths:1;
>> +		 changed_paths:1,
>> +		 order_by_pack:1;
>>  
>>  	const struct split_commit_graph_opts *split_opts;
>>  	uint32_t total_bloom_filter_data_size;
>> @@ -1184,7 +1204,11 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>>  
>>  	ALLOC_ARRAY(sorted_by_pos, ctx->commits.nr);
>>  	COPY_ARRAY(sorted_by_pos, ctx->commits.list, ctx->commits.nr);
>> -	QSORT(sorted_by_pos, ctx->commits.nr, commit_pos_cmp);
>> +
>> +	if (ctx->order_by_pack)
>> +		QSORT(sorted_by_pos, ctx->commits.nr, commit_pos_cmp);
>> +	else
>> +		QSORT(sorted_by_pos, ctx->commits.nr, commit_gen_cmp);
> 
> Here 'sorted_b_pos' variable name no longer reflects reality...
> (see comment to the previous patch in the series).
> 

Yup. Fixing. 

Thanks!
Garima Singh

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH v2 07/11] commit-graph: write Bloom filters to commit graph file
  2020-02-19 15:13     ` Jakub Narebski
@ 2020-02-24 21:14       ` Garima Singh
  2020-02-25 11:40         ` Jakub Narebski
  0 siblings, 1 reply; 150+ messages in thread
From: Garima Singh @ 2020-02-24 21:14 UTC (permalink / raw)
  To: Jakub Narebski, Garima Singh via GitGitGadget
  Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
	Jeff Hostetler, Taylor Blau, Jeff King, Christian Couder,
	Emily Shaffer, Junio C Hamano, Garima Singh



On 2/19/2020 10:13 AM, Jakub Narebski wrote:
> "Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> From: Garima Singh <garima.singh@microsoft.com>
>>
>> Update the technical documentation for commit-graph-format with the formats for
>> the Bloom filter index (BIDX) and Bloom filter data (BDAT) chunks. Write the
>> computed Bloom filters information to the commit graph file using this format.
> 
> Nice description.
> 
> The only minor nitpick is with the formating: it is 80-character wide,
> which is a bit wide.
> 

Fixed in v3. Thanks! 

>>
>> Helped-by: Derrick Stolee <dstolee@microsoft.com>
>> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
>> ---
>>  .../technical/commit-graph-format.txt         |  24 ++++
>>  commit-graph.c                                | 118 +++++++++++++++++-
>>  commit-graph.h                                |   7 +-
>>  3 files changed, 145 insertions(+), 4 deletions(-)
>>
>> diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
>> index a4f17441ae..22e511643d 100644
>> --- a/Documentation/technical/commit-graph-format.txt
>> +++ b/Documentation/technical/commit-graph-format.txt
>> @@ -17,6 +17,9 @@ metadata, including:
>>  - The parents of the commit, stored using positional references within
>>    the graph file.
>>  
>> +- The Bloom filter of the commit carrying the paths that were changed between
>> +  the commit and its first parent.
>> +
> 
> All right.
> 
> Should we also state that it is optional (meta)data?  This would be
> first optional piece of data stored in commit-graph, I think.
> 

However the entire commit graph file is non critical metadata since git commands
work just fine without it, just slower. The same applies to the changed path
bloom filters. 

Based on the definition of optional you are suggesting, edge data is optional
because not every commit-graph has octopus merges. 

>> +
>> +  Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
>> +    * It starts with header consisting of three unsigned 32-bit integers:
>> +      - Version of the hash algorithm being used. We currently only support
>> +	value 1 which implies the murmur3 hash implemented exactly as described
>> +	in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
> 
> First a minor issue: shouldn't this nested unordered list be indented
> with a hanging indent formatted with spaces?  That is be formatted like
> the following:
> 
>   +  Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
>   +    * It starts with header consisting of three unsigned 32-bit integers:
>   +      - Version of the hash algorithm being used. We currently only support
>   +        value 1 which implies the murmur3 hash implemented exactly as
>   +        described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
> 
> But the existing formatting with spaces and tabs might be fine as it is,
> that is it renders as nested list with Asciidoc; it only looks a bit
> weird as patch, not so as text.
> 
> Second, and more important: it is in my opinion not enough information,
> at least if we are assuming that the information in this document should
> be enough for clean-room reimplementation of Bloom filter functionality
> (for example by JGit).  To generate compatible Bloom filters, one needs
> also the information on how to create $k$ functionally-independent hash
> functions out of murmur3 hash.  We do it currently using double hashing
> technique; if that changes then the exact set of bits in the Bloom
> filter would also change.
> 
> The additional description could look something like the following:
> 
>   +    * It starts with header consisting of three unsigned 32-bit integers:
>   +      - Version of the hash algorithm being used. We currently only support
>   +        value 1 which implies the murmur3_32 hash implemented exactly as
>   +        described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
>   +        and double hashing technique with 0x293ae76f and 0x7e646e2c seeds
>   +        as described in https://doi.org/10.1007/978-3-540-30494-4_26
>   +        "Bloom Filters in Probabilistic Verification"
> 
> Also, it should be explicitly noted that we use murmur3_32, because
> there is also 128-bit version of murmur3 hash.
> 

I will incorporate this in. Thanks! 


>> +    * The BDAT chunk is present iff BIDX is present.
> 
> Perhaps we should spell 'iff' in full, that is 'if and only if'?
> 

Sure. 

>> +
>>    Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
>>        This list of H-byte hashes describe a set of B commit-graph files that
>>        form a commit-graph chain. The graph position for the ith commit in this
>> diff --git a/commit-graph.c b/commit-graph.c
>> index 32a315058f..4585b3b702 100644
>> --- a/commit-graph.c
>> +++ b/commit-graph.c
>> @@ -24,8 +24,10 @@
>>  #define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
>>  #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
>>  #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
>> +#define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
>> +#define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
>>  #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
>> -#define MAX_NUM_CHUNKS 5
>> +#define MAX_NUM_CHUNKS 7
>>  
>>  #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
>>  
>> @@ -325,6 +327,32 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd,
>>  				chunk_repeated = 1;
>>  			else
>>  				graph->chunk_base_graphs = data + chunk_offset;
>> +			break;
>> +
>> +		case GRAPH_CHUNKID_BLOOMINDEXES:
>> +			if (graph->chunk_bloom_indexes)
>> +				chunk_repeated = 1;
>> +			else
>> +				graph->chunk_bloom_indexes = data + chunk_offset;
>> +			break;
>> +
>> +		case GRAPH_CHUNKID_BLOOMDATA:
>> +			if (graph->chunk_bloom_data)
>> +				chunk_repeated = 1;
>> +			else {
>> +				uint32_t hash_version;
>> +				graph->chunk_bloom_data = data + chunk_offset;
>> +				hash_version = get_be32(data + chunk_offset);
>> +
>> +				if (hash_version != 1)
>> +					break;
> 
> Shouldn't we mark Bloom filter as not to be used?  Or is it left for
> later commit?
> 

We take care of this in line 375. 

> In the future it might be good idea to notify the user (perhaps
> protected with some advice.* option) that there is problem with Bloom
> filter data, namely that we have encountered unsupported hash version.
> 
>> +
>> +				graph->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings));
> 
> Why is this structure allocated dynamically?  We are leaking admittedly
> a small amount of memory because we never free this xmalloc() result.
> 
> If we need this field being a pointer to struct to have NULL mean no
> supported Bloom filter data, we could have instead use chunk_bloom_*
> fields instead - we can set at least one of them to NULL.
> 

I am freeing this up in free_commit_graph but I messed up putting it in the right commit. 
Sorry about that. Fixed in v3. 

Also as discussed in https://lore.kernel.org/git/3b7d77a1-aed9-d202-8646-4b964cb965db@gmail.com/
there is a bug in commit-graph.c where we should be calling free_commit_graph() instead of 
just free(graph). I will do this in a separate series. 

>> +			}
>> +			break;
>>  		}
>>  
>>  		if (chunk_repeated) {
>> @@ -343,6 +371,17 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd,
>>  		last_chunk_offset = chunk_offset;
>>  	}
>>  
>> +	/* We need both the bloom chunks to exist together. Else ignore the data */
>> +	if ((graph->chunk_bloom_indexes && !graph->chunk_bloom_data)
>> +		 || (!graph->chunk_bloom_indexes && graph->chunk_bloom_data)) {
>> +		graph->chunk_bloom_indexes = NULL;
>> +		graph->chunk_bloom_data = NULL;
>> +		graph->bloom_filter_settings = NULL;
>> +	}
>> +
>> +	if (graph->chunk_bloom_indexes && graph->chunk_bloom_data)
>> +		load_bloom_filters();
> 
> Wouldn't it be simpler to rely on the fact that both Bloom chunks must
> exists for it to matter, and write it like this:
> 
>   +	if (graph->chunk_bloom_indexes && graph->chunk_bloom_data) {
>   +		load_bloom_filters();
>   +	} else {
>   +		graph->chunk_bloom_indexes = NULL;
>   +		graph->chunk_bloom_data = NULL;
>   +		graph->bloom_filter_settings = NULL;
>   +	}
> 

:) Yes. Fixed in v3. 

>> +
>>  static int oid_compare(const void *_a, const void *_b)
>>  {
>>  	const struct object_id *a = (const struct object_id *)_a;
>> @@ -1198,8 +1290,8 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>>  	load_bloom_filters();
>>  
>>  	if (ctx->report_progress)
>> -		progress = start_progress(
>> -			_("Computing commit diff Bloom filters"),
>> +		progress = start_delayed_progress(
>> +			_("Computing changed paths Bloom filters"),
>>  			ctx->commits.nr);
>>
> 
> Ooops.  This look like a fixup which should be made to the original
> earlier commit instead, isn't it?


Yes. Should have been in a previous commit. Fixed in v3. 


>>  };
>>  
>>  struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st);
>> @@ -77,7 +82,7 @@ enum commit_graph_write_flags {
>>  	COMMIT_GRAPH_WRITE_SPLIT      = (1 << 2),
>>  	/* Make sure that each OID in the input is a valid commit OID. */
>>  	COMMIT_GRAPH_WRITE_CHECK_OIDS = (1 << 3),
>> -	COMMIT_GRAPH_WRITE_BLOOM_FILTERS = (1 << 4)
>> +	COMMIT_GRAPH_WRITE_BLOOM_FILTERS = (1 << 4),
> 
> This looks like accidental change; if we want to use trailing comma in
> enum, this change should be in my opinion done in the commit that added
> COMMIT_GRAPH_WRITE_BLOOM_FILTERS (as I have written in a comment there).
> 

Yes, I noticed the lack of the comma later and forgot to move it to the right
commit. Fixed in v3. 

> 
> Thank you for your work on this series.
> 
> Best,
> 

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH v2 08/11] commit-graph: reuse existing Bloom filters during write.
  2020-02-20 18:48     ` Jakub Narebski
@ 2020-02-24 21:45       ` Garima Singh
  0 siblings, 0 replies; 150+ messages in thread
From: Garima Singh @ 2020-02-24 21:45 UTC (permalink / raw)
  To: Jakub Narebski, Garima Singh via GitGitGadget
  Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
	Jeff Hostetler, Taylor Blau, Jeff King, Christian Couder,
	Emily Shaffer, Junio C Hamano, Garima Singh



On 2/20/2020 1:48 PM, Jakub Narebski wrote:
> "Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> From: Garima Singh <garima.singh@microsoft.com>
>>
>> Read previously computed Bloom filters from the commit-graph file if
>> possible to avoid recomputing during commit-graph write.
> 
> All right, what is written makes sense for this point in patch series.
> 
> But it my opinion it is more important to state that this commit adds
> "parsing" of the Bloom filter data from commit-graph file.  This means
> that it needs to be calculated only once, then stored in commit-graph,
> ready to be re-used.
> 

Good point. Incorporated in v3.

>>
>> See Documentation/technical/commit-graph-format for the format in which
>> the Bloom filter information is written to the commit graph file.
>>
>> To read Bloom filter for a given commit with lexicographic position
>> 'i' we need to:
>> 1. Read BIDX[i] which essentially gives us the starting index in BDAT for
>>    filter of commit i+1. It is essentially the index past the end
>>    of the filter of commit i. It is called end_index in the code.
>>
>> 2. For i>0, read BIDX[i-1] which will give us the starting index in BDAT
>>    for filter of commit i. It is called the start_index in the code.
>>    For the first commit, where i = 0, Bloom filter data starts at the
>>    beginning, just past the header in the BDAT chunk. Hence, start_index
>>    will be 0.
>>
>> 3. The length of the filter will be end_index - start_index, because
>>    BIDX[i] gives the cumulative 8-byte words including the ith
>>    commit's filter.
>>
>> We toggle whether Bloom filters should be recomputed based on the
>> compute_if_null flag.
> 
> Nitpick: the flag (the parameter) is called compute_if_not_present, not
> compute_if_null.
> 
Oops. Fixed in v3. 

>> +
>> +	end_index = get_be32(g->chunk_bloom_indexes + 4 * lex_pos);
>> +
>> +	if (lex_pos)
> 
> Wouldn't it be better to be more explicit, and write
> 
>   +	if (lex_pos > 0)
> 
> 

Sure. 

>> +		start_index = get_be32(g->chunk_bloom_indexes + 4 * (lex_pos - 1));
>> +	else
>> +		start_index = 0;
> 
> All right, here we find start_index and end_index.
> 
> It might be good idea to at least assert() that start_index <= end_index,
> though that should not happen (that is why I propose for this check to
> be compiled on only for debug builds).
> 

I will look into this. Thanks! 


>> @@ -1304,7 +1304,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>>  
>>  	for (i = 0; i < ctx->commits.nr; i++) {
>>  		struct commit *c = sorted_by_pos[i];
>> -		struct bloom_filter *filter = get_bloom_filter(ctx->r, c);
>> +		struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1);
>>  		ctx->total_bloom_filter_data_size += sizeof(uint64_t) * filter->len;
>>  		display_progress(progress, i + 1);
>>  	}
>> @@ -2314,6 +2314,7 @@ void free_commit_graph(struct commit_graph *g)
>>  		g->data = NULL;
>>  		close(g->graph_fd);
>>  	}
>> +	free(g->bloom_filter_settings);
>>  	free(g->filename);
>>  	free(g);
> 
> Shouldn't this fixup be added to earlier commit?
> 

Yes. 

>>  }
>> diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
>> index 331957011b..9b4be97f75 100644
>> --- a/t/helper/test-bloom.c
>> +++ b/t/helper/test-bloom.c
>> @@ -47,7 +47,7 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
>>  	struct bloom_filter *filter;
>>  	setup_git_directory();
>>  	c = lookup_commit(the_repository, commit_oid);
>> -	filter = get_bloom_filter(the_repository, c);
>> +	filter = get_bloom_filter(the_repository, c, 1);
>>  	print_bloom_filter(filter);
>>  }
> 
> I would like to see some tests, but that needs to wait for patch that
> adds --changed-paths option to the 'write' subcommand.
> 
> Things to be tested:
> 1. That after reading commit-graph with Bloom filter:
>    - that commit(s) in commit-graph have Bloom filter
>    - that commits outside commit-graph do not have Bloom filter
> 2. That incremental commit-graph feature works:
>    - for commits in deeper layer that have Bloom filter chunks
>    - for commits in deeper layer that do not have Bloom filter chunks
> 

Included in later commits. 

> Best,
> 

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH v2 09/11] commit-graph: add --changed-paths option to write subcommand
  2020-02-20 20:28     ` Jakub Narebski
@ 2020-02-24 21:51       ` Garima Singh
  2020-02-25 12:10         ` Jakub Narebski
  0 siblings, 1 reply; 150+ messages in thread
From: Garima Singh @ 2020-02-24 21:51 UTC (permalink / raw)
  To: Jakub Narebski, Garima Singh via GitGitGadget
  Cc: git, Derrick Stolee, SZEDER Gábor, Jonathan Tan,
	Jeff Hostetler, Taylor Blau, Jeff King, Christian Couder,
	Emily Shaffer, Junio C Hamano, Garima Singh



On 2/20/2020 3:28 PM, Jakub Narebski wrote:
> "Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> From: Garima Singh <garima.singh@microsoft.com>
>>
>> Add --changed-paths option to git commit-graph write. This option will
>> allow users to compute information about the paths that have changed
>> between a commit and its first parent, and write it into the commit graph
>> file. If the option is passed to the write subcommand we set the
>> COMMIT_GRAPH_WRITE_BLOOM_FILTERS flag and pass it down to the
>> commit-graph logic.
> 
> In the manpage you write that this operation (computing Bloom filters)
> can take a while on large repositories.  Could you perhaps provide some
> numbers: how much longer does it take to write commit-graph file with
> and without '--changed-paths' for example for Linux kernel, or some
> other large repository?  Thanks in advance.
> 

Yes. Will include numbers as appropriate in v3. 

>>
>> Helped-by: Derrick Stolee <dstolee@microsoft.com>
>> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
>> ---
>>  Documentation/git-commit-graph.txt | 5 +++++
>>  builtin/commit-graph.c             | 9 +++++++--
>>  2 files changed, 12 insertions(+), 2 deletions(-)
> 
> What is missing is some sanity tests: that bloom index and bloom data
> chunks are not present without '--changed-paths', and that they are
> added with '--changed-paths'.
> 
> If possible, maybe also check in a separate test that the size of
> bloom_index chunk agrees with the number of commits in the commit graph.
> 
> 
> Also, we can now add those tests I have wrote about in my review of
> previous patch, that is:
> 
> 1. If you write commit-graph with --changed-paths, and either add some
>    commits later or exclude some commits from the commit graph, then:
> 
>    a.) commit(s) in commit-graph have Bloom filter
>    b.) commit(s) not in commit-graph do not have Bloom filter
> 
> 2. If you write commit-graph without --changed-paths as base layer,
>    and then write next layer with --changed-paths and --split, then:
> 
>    a.) commit(s) in top layer have Bloom filter(s)
>    b.) commit(s) in bottom layer don't have Bloom filter(s)
> 

I will see what more can be done here. 

>>
>> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
>> index bcd85c1976..907d703b30 100644
>> --- a/Documentation/git-commit-graph.txt
>> +++ b/Documentation/git-commit-graph.txt
>> @@ -54,6 +54,11 @@ or `--stdin-packs`.)
>>  With the `--append` option, include all commits that are present in the
>>  existing commit-graph file.
>>  +
>> +With the `--changed-paths` option, compute and write information about the
>> +paths changed between a commit and it's first parent. This operation can
>> +take a while on large repositories. It provides significant performance gains
>> +for getting history of a directory or a file with `git log -- <path>`.
>> ++
> 
> Should we write about limitation that the topmost layer in the split
> commit graph needs to be written with '--changed-paths' for Git to use
> this information?  Or perhaps we should try (in the future) to remove
> this limitation??
> 

Given that this information is going to be used best effort, it would be 
superfluous to describe every case and conditional that decides whether 
this information is being used.
>> @@ -143,6 +146,8 @@ static int graph_write(int argc, const char **argv)
>>  		flags |= COMMIT_GRAPH_WRITE_SPLIT;
>>  	if (opts.progress)
>>  		flags |= COMMIT_GRAPH_WRITE_PROGRESS;
>> +	if (opts.enable_changed_paths)
>> +		flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
>>  
>>  	read_replace_refs = 0;
> 
> All right.  This actually turns on calculation Bloom filters for changed
> paths, thanks to
> 
>  	ctx->changed_paths = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;
> 
> that was added by the "[PATCH v2 04/11] commit-graph: compute Bloom
> filters for changed paths" patch.
> 
> Though... should this enabling be split into two separate patches like
> this?
> 

The idea is that in 4/11 We compute only if the flag is set. 
And between that patch and this one: we prepare the foundational code 
that is now ready for that flag to be set via an opt-in by the user. 

> 
> Best,
> 

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH v2 07/11] commit-graph: write Bloom filters to commit graph file
  2020-02-24 21:14       ` Garima Singh
@ 2020-02-25 11:40         ` Jakub Narebski
  2020-02-25 15:58           ` Garima Singh
  0 siblings, 1 reply; 150+ messages in thread
From: Jakub Narebski @ 2020-02-25 11:40 UTC (permalink / raw)
  To: Garima Singh
  Cc: Garima Singh via GitGitGadget, git, Derrick Stolee,
	SZEDER Gábor, Jonathan Tan, Jeff Hostetler, Taylor Blau,
	Jeff King, Christian Couder, Emily Shaffer, Junio C Hamano,
	Garima Singh

Garima Singh <garimasigit@gmail.com> writes:
> On 2/19/2020 10:13 AM, Jakub Narebski wrote:
>> "Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
[...]
>>> diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
>>> index a4f17441ae..22e511643d 100644
>>> --- a/Documentation/technical/commit-graph-format.txt
>>> +++ b/Documentation/technical/commit-graph-format.txt
>>> @@ -17,6 +17,9 @@ metadata, including:
>>>  - The parents of the commit, stored using positional references within
>>>    the graph file.
>>>  
>>> +- The Bloom filter of the commit carrying the paths that were changed between
>>> +  the commit and its first parent.
>>> +
>> 
>> All right.
>> 
>> Should we also state that it is optional (meta)data?  This would be
>> first optional piece of data stored in commit-graph, I think.
>> 
>
> However the entire commit graph file is non critical metadata since git commands
> work just fine without it, just slower. The same applies to the changed path
> bloom filters. 
>
> Based on the definition of optional you are suggesting, edge data is optional
> because not every commit-graph has octopus merges. 

Well, edge data (EDGE chunk) is optional in different way from Bloom
filter data.  The former depends on the repository (whether there are
octopus merges used), the latter is opt-in user choice (whether to run
`git commit-graph write` with the `--changed-paths` option, or in the
future equivalent config option).

To provide some advise that can be acted upon: perhaps it would be
better to start with "It can store", or end with "if requested" or
"optionally".  For example the change could look like the following
suggestion:


 The Git commit graph stores a list of commit OIDs and some associated
 metadata, including:
[...]
+- The Bloom filter of the commit carrying the paths that were changed between
+  the commit and its first parent, if requested.
+

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH v2 09/11] commit-graph: add --changed-paths option to write subcommand
  2020-02-24 21:51       ` Garima Singh
@ 2020-02-25 12:10         ` Jakub Narebski
  0 siblings, 0 replies; 150+ messages in thread
From: Jakub Narebski @ 2020-02-25 12:10 UTC (permalink / raw)
  To: Garima Singh
  Cc: Garima Singh via GitGitGadget, git, Derrick Stolee,
	SZEDER Gábor, Jonathan Tan, Jeff Hostetler, Taylor Blau,
	Jeff King, Christian Couder, Emily Shaffer, Junio C Hamano,
	Garima Singh

Garima Singh <garimasigit@gmail.com> writes:
> On 2/20/2020 3:28 PM, Jakub Narebski wrote:
>> "Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:

[...]
>>> --- a/Documentation/git-commit-graph.txt
>>> +++ b/Documentation/git-commit-graph.txt
>>> @@ -54,6 +54,11 @@ or `--stdin-packs`.)
>>>  With the `--append` option, include all commits that are present in the
>>>  existing commit-graph file.
>>>  +
>>> +With the `--changed-paths` option, compute and write information about the
>>> +paths changed between a commit and it's first parent. This operation can
>>> +take a while on large repositories. It provides significant performance gains
>>> +for getting history of a directory or a file with `git log -- <path>`.
>>> ++
>> 
>> Should we write about limitation that the topmost layer in the split
>> commit graph needs to be written with '--changed-paths' for Git to use
>> this information?  Or perhaps we should try (in the future) to remove
>> this limitation?
>
> Given that this information is going to be used best effort, it would be 
> superfluous to describe every case and conditional that decides whether 
> this information is being used.

I can somewhat agree with this reasoning.

However what I would like to avoid is surprising users.  If one creates
base commit-graph with Bloom filters data, but then when creating
new layer of commit-graph (updating it incrementally), it may be
surprising that `git log -- <path>` is now much slower.

On the other hand if one would update commit-graph in a non-incremental
way (rewriting the commit-graph file), loosing the Bloom filter
information and performance of `git log -- <path>` because one forgot to
include `--changed-paths` is not that unexpected.

Anyway, in the future when this mechanism will be controlled by
appropriate config variable, this whole discussion would become somewhat
moot.


Thought for the future: perhaps `git commit-graph verify` could detect
that split graph has Bloom filters only for some layers, and inform the
user?  But that is almost certainly out of scope of this patch series.

>>> @@ -143,6 +146,8 @@ static int graph_write(int argc, const char **argv)
>>>  		flags |= COMMIT_GRAPH_WRITE_SPLIT;
>>>  	if (opts.progress)
>>>  		flags |= COMMIT_GRAPH_WRITE_PROGRESS;
>>> +	if (opts.enable_changed_paths)
>>> +		flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
>>>  
>>>  	read_replace_refs = 0;
>> 
>> All right.  This actually turns on calculation Bloom filters for changed
>> paths, thanks to
>> 
>>  	ctx->changed_paths = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;
>> 
>> that was added by the "[PATCH v2 04/11] commit-graph: compute Bloom
>> filters for changed paths" patch.
>> 
>> Though... should this enabling be split into two separate patches like
>> this?
>
> The idea is that in 4/11 We compute only if the flag is set. 
> And between that patch and this one: we prepare the foundational code 
> that is now ready for that flag to be set via an opt-in by the user. 

All right.

Choosing how to split large change into series is not easy.  One one
hand one would want for each change to be small and self contained.  On
the other hand it would be good if each change was testable (test-tool
can help here).

Best,
-- 
Jakub Narębski

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH v2 07/11] commit-graph: write Bloom filters to commit graph file
  2020-02-25 11:40         ` Jakub Narebski
@ 2020-02-25 15:58           ` Garima Singh
  0 siblings, 0 replies; 150+ messages in thread
From: Garima Singh @ 2020-02-25 15:58 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Garima Singh via GitGitGadget, git, Derrick Stolee,
	SZEDER Gábor, Jonathan Tan, Jeff Hostetler, Taylor Blau,
	Jeff King, Christian Couder, Emily Shaffer, Junio C Hamano,
	Garima Singh


On 2/25/2020 6:40 AM, Jakub Narebski wrote:
> Garima Singh <garimasigit@gmail.com> writes:
>> On 2/19/2020 10:13 AM, Jakub Narebski wrote:
>>> "Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
> [...]
>>>> diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
>>>> index a4f17441ae..22e511643d 100644
>>>> --- a/Documentation/technical/commit-graph-format.txt
>>>> +++ b/Documentation/technical/commit-graph-format.txt
>>>> @@ -17,6 +17,9 @@ metadata, including:
>>>>  - The parents of the commit, stored using positional references within
>>>>    the graph file.
>>>>  
>>>> +- The Bloom filter of the commit carrying the paths that were changed between
>>>> +  the commit and its first parent.
>>>> +
>>>
>>> All right.
>>>
>>> Should we also state that it is optional (meta)data?  This would be
>>> first optional piece of data stored in commit-graph, I think.
>>>
>>
>> However the entire commit graph file is non critical metadata since git commands
>> work just fine without it, just slower. The same applies to the changed path
>> bloom filters. 
>>
>> Based on the definition of optional you are suggesting, edge data is optional
>> because not every commit-graph has octopus merges. 
> 
> Well, edge data (EDGE chunk) is optional in different way from Bloom
> filter data.  The former depends on the repository (whether there are
> octopus merges used), the latter is opt-in user choice (whether to run
> `git commit-graph write` with the `--changed-paths` option, or in the
> future equivalent config option).
> 
> To provide some advise that can be acted upon: perhaps it would be
> better to start with "It can store", or end with "if requested" or
> "optionally".  For example the change could look like the following
> suggestion:
> 
> 
>  The Git commit graph stores a list of commit OIDs and some associated
>  metadata, including:
> [...]
> +- The Bloom filter of the commit carrying the paths that were changed between
> +  the commit and its first parent, if requested.
> +
> 
> Best,
> 

Sure. That makes sense. Will incorporate in v3. 

Cheers!
Garima Singh

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH 0/9] [RFC] Changed Paths Bloom Filters
  2019-12-20 22:05 [PATCH 0/9] [RFC] Changed Paths Bloom Filters Garima Singh via GitGitGadget
                   ` (14 preceding siblings ...)
  2020-02-05 22:56 ` [PATCH v2 00/11] " Garima Singh via GitGitGadget
@ 2020-03-05 19:49 ` Garima Singh
  15 siblings, 0 replies; 150+ messages in thread
From: Garima Singh @ 2020-03-05 19:49 UTC (permalink / raw)
  To: Garima Singh via GitGitGadget, git
  Cc: stolee, szeder.dev, jonathantanmy, jeffhost, me, peff, Junio C Hamano

My apologies that things have been quite on this series for the past
week or so. An unexpected high priority task at work demanded all of 
my attention and will continue to do so through the end of this week. 

Hopefully I will be able to pick this up again early next week and 
have v3 out soon! 

Cheers!
Garima Singh

On 12/20/2019 5:05 PM, Garima Singh via GitGitGadget wrote:
> Hey! 
> 
> The commit graph feature brought in a lot of performance improvements across
> multiple commands. However, file based history continues to be a performance
> pain point, especially in large repositories. 
> 
> Adopting changed path bloom filters has been discussed on the list before,
> and a prototype version was worked on by SZEDER Gábor, Jonathan Tan and Dr.
> Derrick Stolee [1]. This series is based on Dr. Stolee's approach [2] and
> presents an updated and more polished RFC version of the feature. 
> 
> Performance Gains: We tested the performance of git log -- path on the git
> repo, the linux repo and some internal large repos, with a variety of paths
> of varying depths.
> 
> On the git and linux repos: We observed a 2x to 5x speed up.
> 
> On a large internal repo with files seated 6-10 levels deep in the tree: We
> observed 10x to 20x speed ups, with some paths going up to 28 times faster.
> 
> Future Work (not included in the scope of this series):
> 
>  1. Supporting multiple path based revision walk
>  2. Adopting it in git blame logic. 
>  3. Interactions with line log git log -L
> 
> This series is intended to start the conversation and many of the commit
> messages include specific call outs for suggestions and thoughts. 
> 
> Cheers! Garima Singh
> 
> [1] https://lore.kernel.org/git/20181009193445.21908-1-szeder.dev@gmail.com/
> [2] 
> https://lore.kernel.org/git/61559c5b-546e-d61b-d2e1-68de692f5972@gmail.com/
> 
> Garima Singh (9):
>   commit-graph: add --changed-paths option to write
>   commit-graph: write changed paths bloom filters
>   commit-graph: use MAX_NUM_CHUNKS
>   commit-graph: document bloom filter format
>   commit-graph: write changed path bloom filters to commit-graph file.
>   commit-graph: test commit-graph write --changed-paths
>   commit-graph: reuse existing bloom filters during write.
>   revision.c: use bloom filters to speed up path based revision walks
>   commit-graph: add GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS test flag
> 
>  Documentation/git-commit-graph.txt            |   5 +
>  .../technical/commit-graph-format.txt         |  17 ++
>  Makefile                                      |   1 +
>  bloom.c                                       | 257 +++++++++++++++++
>  bloom.h                                       |  51 ++++
>  builtin/commit-graph.c                        |   9 +-
>  ci/run-build-and-tests.sh                     |   1 +
>  commit-graph.c                                | 116 +++++++-
>  commit-graph.h                                |   9 +-
>  revision.c                                    |  67 ++++-
>  revision.h                                    |   5 +
>  t/README                                      |   3 +
>  t/helper/test-read-graph.c                    |   4 +
>  t/t4216-log-bloom.sh                          |  77 ++++++
>  t/t5318-commit-graph.sh                       |   2 +
>  t/t5324-split-commit-graph.sh                 |   1 +
>  t/t5325-commit-graph-bloom.sh                 | 258 ++++++++++++++++++
>  17 files changed, 875 insertions(+), 8 deletions(-)
>  create mode 100644 bloom.c
>  create mode 100644 bloom.h
>  create mode 100755 t/t4216-log-bloom.sh
>  create mode 100755 t/t5325-commit-graph-bloom.sh
> 
> 
> base-commit: b02fd2accad4d48078671adf38fe5b5976d77304
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-497%2Fgarimasi514%2FcoreGit-bloomFilters-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-497/garimasi514/coreGit-bloomFilters-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/497
> 

^ permalink raw reply	[flat|nested] 150+ messages in thread

* Re: [PATCH v2 00/11] Changed Paths Bloom Filters
  2020-02-21 17:41     ` Garima Singh
@ 2020-03-29 18:36       ` Junio C Hamano
  0 siblings, 0 replies; 150+ messages in thread
From: Junio C Hamano @ 2020-03-29 18:36 UTC (permalink / raw)
  To: Garima Singh
  Cc: Jakub Narebski, Garima Singh via GitGitGadget, git,
	Derrick Stolee, SZEDER Gábor, Jonathan Tan, Jeff Hostetler,
	Taylor Blau, Jeff King, Christian Couder, Emily Shaffer,
	Garima Singh

Garima Singh <garimasigit@gmail.com> writes:

> On 2/8/2020 6:04 PM, Jakub Narebski wrote:
>> "Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
>> ...
> I have gone back and forth on doing this. I like most of the core Bloom filter
> computations being isolated in one patch/commit. But based on the rest of your
> review, it seems like you are leaning heavily on having this split out. 
> So, I will take a proper stab at doing it for v3. 
> ...
> Thanks for taking the time for reviewing this series so thoroughly! 
> It is greatly appreciated! 

Thanks for a great discussion.  Just a friendly ping to the thread,
so that something from the discussion thread will stay on the first
page of mailing list archive's threaded view ;-)


^ permalink raw reply	[flat|nested] 150+ messages in thread

* [PATCH v3 00/16] Changed Paths Bloom Filters
  2020-02-05 22:56 ` [PATCH v2 00/11] " Garima Singh via GitGitGadget
                     ` (12 preceding siblings ...)
  2020-02-08 23:04   ` Jakub Narebski
@ 2020-03-30  0:31   ` Garima Singh via GitGitGadget
  2020-03-30  0:31     ` [PATCH v3 01/16] commit-graph: define and use MAX_NUM_CHUNKS Garima Singh via GitGitGadget
                       ` (16 more replies)
  13 siblings, 17 replies; 150+ messages in thread
From: Garima Singh via GitGitGadget @ 2020-03-30  0:31 UTC (permalink / raw)
  To: git; +Cc: stolee, szeder.dev, jonathantanmy, Garima Singh

Hey! 

The commit graph feature brought in a lot of performance improvements across
multiple commands. However, file based history continues to be a performance
pain point, especially in large repositories. 

Adopting changed path Bloom filters has been discussed on the list before,
and a prototype version was worked on by SZEDER Gábor, Jonathan Tan and Dr.
Derrick Stolee [1]. This series is based on Dr. Stolee's proof of concept in
[2]

With the changes in this series, git users will be able to choose to write
Bloom filters to the commit-graph using the following command:

'git commit-graph write --changed-paths'

Subsequent 'git log -- path' commands will use these computed Bloom filters
to decided which commits are worth exploring further to produce the history
of the provided path. 

Cost of computing and writing Bloom filters
===========================================

Computing and writing Bloom filters to the commit graph for the first time
implies computing the diffs and the resulting Bloom filters for all the
commits in the repository. This adds a non trivial amount of time to run
time. Every subsequent run is incremental i.e. we reuse the previously
computed Bloom filters. So this is a one time cost. 

Time taken by 'git commit-graph write' with and w/o --changed-paths, speed
up in 'git log -- path' with computed Bloom filters (see a):- 

-------------------------------------------------------------------------
| Repo        | w/o --changed-paths | with --changed-paths | Speed up   |
-------------------------------------------------------------------------
| git [3]     | 0.9 seconds         | 7 seconds            | 2x to 6x   |
| linux [4]   | 16 seconds          | 1 minute 8 seconds   | 2x to 6x   | 
| android [5] | 9 seconds           | 48 seconds           | 2x to 6x   |
| AzDo(see b) | 1 minute            | 5 minutes 2 seconds  | 10x to 30x |
-------------------------------------------------------------------------

a) We tested the performance of git log -- path with randomly chosen paths
of varying depths in each repo. The speed up depends on how deep the files
are in the hierarchy and how often a file has been touched in the commit
history.

b) This internal repository has about 420k commits, 183k files distributed
across 34k folders, the size on disk is about 17 GiB. The most massive gains
on this repository were for files 6-12 levels deep in the tree. 

c) These numbers were collected on a Windows machine, except for the linux
repo which was tested on a Linux machine. 

Future Work (not included in the scope of this series)
======================================================

 1. Supporting multiple path based revision walk
 2. Adopting it in git blame logic. 
 3. Interactions with line log git log -L

Cheers! Garima Singh

[1] https://lore.kernel.org/git/20181009193445.21908-1-szeder.dev@gmail.com/

[2] 
https://lore.kernel.org/git/61559c5b-546e-d61b-d2e1-68de692f5972@gmail.com/

[3] https://github.com/git/git

[4] https://github.com/torvalds/linux

[5] https://android.googlesource.com/platform/frameworks/base/

jeffhost@microsoft.com, me@ttaylorr.com, peff@peff.net, 
garimasigit@gmail.com,jnareb@gmail.com, christian.couder@gmail.com, 
emilyshaffer@gmail.com,gitster@pobox.com

Derrick Stolee (1):
  diff: halt tree-diff early after max_changes

Garima Singh (14):
  commit-graph: define and use MAX_NUM_CHUNKS
  bloom.c: add the murmur3 hash implementation
  bloom.c: introduce core Bloom filter constructs
  bloom.c: core Bloom filter implementation for changed paths.
  commit-graph: compute Bloom filters for changed paths
  commit-graph: examine commits by generation number
  diff: skip batch object download when possible
  commit-graph: write Bloom filters to commit graph file
  commit-graph: reuse existing Bloom filters during write
  commit-graph: add --changed-paths option to write subcommand
  revision.c: use Bloom filters to speed up path based revision walks
  revision.c: add trace2 stats around Bloom filter usage
  t4216: add end to end tests for git log with Bloom filters
  commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag

Jeff King (1):
  commit-graph: examine changed-path objects in pack order

 Documentation/git-commit-graph.txt            |   5 +
 .../technical/commit-graph-format.txt         |  30 ++
 Makefile                                      |   2 +
 bloom.c                                       | 276 ++++++++++++++++++
 bloom.h                                       |  90 ++++++
 builtin/commit-graph.c                        |  10 +-
 ci/run-build-and-tests.sh                     |   1 +
 commit-graph.c                                | 213 +++++++++++++-
 commit-graph.h                                |   9 +-
 diff.c                                        |   8 +-
 diff.h                                        |   6 +
 revision.c                                    | 126 +++++++-
 revision.h                                    |  11 +
 t/README                                      |   5 +
 t/helper/test-bloom.c                         |  81 +++++
 t/helper/test-read-graph.c                    |   4 +
 t/helper/test-tool.c                          |   1 +
 t/helper/test-tool.h                          |   1 +
 t/t0095-bloom.sh                              | 117 ++++++++
 t/t4216-log-bloom.sh                          | 155 ++++++++++
 t/t5318-commit-graph.sh                       |   2 +
 t/t5324-split-commit-graph.sh                 |   1 +
 tree-diff.c                                   |   6 +
 23 files changed, 1148 insertions(+), 12 deletions(-)
 create mode 100644 bloom.c
 create mode 100644 bloom.h
 create mode 100644 t/helper/test-bloom.c
 create mode 100755 t/t0095-bloom.sh
 create mode 100755 t/t4216-log-bloom.sh


base-commit: 3bab5d56259722843359702bc27111475437ad2a
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-497%2Fgarimasi514%2FcoreGit-bloomFilters-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-497/garimasi514/coreGit-bloomFilters-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/497

Range-diff vs v2:

  1:  bf6b93878af !  1:  c3ffd9820d5 commit-graph: use MAX_NUM_CHUNKS
     @@ -1,10 +1,12 @@
      Author: Garima Singh <garima.singh@microsoft.com>
      
     -    commit-graph: use MAX_NUM_CHUNKS
     +    commit-graph: define and use MAX_NUM_CHUNKS
      
     -    This is a minor cleanup to make it easier to change the
     -    number of chunks being written to the commit-graph in the future.
     +    This is a minor cleanup to make it easier to change
     +    the number of chunks being written to the commit
     +    graph.
      
     +    Reviewed-by: Jakub Narębski <jnareb@gmail.com>
          Signed-off-by: Garima Singh <garima.singh@microsoft.com>
      
       diff --git a/commit-graph.c b/commit-graph.c
  -:  ----------- >  2:  a5aa3415c05 bloom.c: add the murmur3 hash implementation
  -:  ----------- >  3:  a7702c1afde bloom.c: introduce core Bloom filter constructs
  2:  02b16d94227 !  4:  8304c297520 bloom: core Bloom filter implementation for changed paths
     @@ -1,89 +1,33 @@
      Author: Garima Singh <garima.singh@microsoft.com>
      
     -    bloom: core Bloom filter implementation for changed paths
     +    bloom.c: core Bloom filter implementation for changed paths.
      
     -    Add the core Bloom filter logic for computing the paths changed between a
     -    commit and its first parent. For details on what Bloom filters are and how they
     -    work, please refer to Dr. Derrick Stolee's blog post [1]. It provides a concise
     -    explaination of the adoption of Bloom filters as described in [2] and [3]
     +    Add the core implementation for computing Bloom filters for
     +    the paths changed between a commit and it's first parent.
      
     -    1. We currently use 7 and 10 for the number of hashes and the size of each
     -       entry respectively. They served as great starting values, the mathematical
     -       details behind this choice are described in [1] and [4]. The implementation
     -       while not completely open to it at the moment, is flexible enough to allow
     -       for tweaking these settings in the future.
     +    We fill the Bloom filters as (const char *data, int len) pairs
     +    as `struct bloom_filters" within a commit slab.
      
     -       Note: The performance gains we have observed with these values are
     -       significant enough that we did not need to tweak these settings.
     -       The performance numbers are included in the cover letter of this series
     -       and in the message of a subsequent commit where we use Bloom filters in
     -       to speed up `git log -- <path>`.
     -
     -    2. As described in the blog and in [3], we do not need 7 independent hashing
     -       functions. We use the Murmur3 hashing scheme. Seed it twice and then
     -       combine those to procure an arbitrary number of hash values.
     -
     -    3. The filters are sized according to the number of changes in the each commit,
     -       with minimum size of one 64 bit word.
     -
     -    4. We fill the Bloom filters as (const char *data, int len) pairs as
     -       "struct bloom_filter"s in a commit slab.
     -
     -    5. The seed_murmur3 method is implemented as described in [5]. It hashes the
     -       given data using a given seed and produces a uniformly distributed hash
     -       value.
     -
     -    [1] https://devblogs.microsoft.com/devops/super-charging-the-git-commit-graph-iv-Bloom-filters/
     -
     -    [2] Flavio Bonomi, Michael Mitzenmacher, Rina Panigrahy, Sushil Singh, George Varghese
     -        "An Improved Construction for Counting Bloom Filters"
     -        http://theory.stanford.edu/~rinap/papers/esa2006b.pdf
     -        https://doi.org/10.1007/11841036_61
     -
     -    [3] Peter C. Dillinger and Panagiotis Manolios
     -        "Bloom Filters in Probabilistic Verification"
     -        http://www.ccs.neu.edu/home/pete/pub/Bloom-filters-verification.pdf
     -        https://doi.org/10.1007/978-3-540-30494-4_26
     -
     -    [4] Thomas Mueller Graf, Daniel Lemire
     -        "Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters"
     -        https://arxiv.org/abs/1912.08258
     -
     -    [5] https://en.wikipedia.org/wiki/MurmurHash#Algorithm
     +    Filters for commits with no changes and more than 512 changes,
     +    is represented with a filter of length zero. There is no gain
     +    in distinguishing between a computed filter of length zero for
     +    a commit with no changes, and an uncomputed filter for new commits
     +    or for commits with more than 512 changes. The effect on
     +    `git log -- path` is the same in both cases. We will fall back to
     +    the normal diffing algorithm when we can't benefit from the
     +    existence of Bloom filters.
      
          Helped-by: Jeff King <peff@peff.net>
          Helped-by: Derrick Stolee <dstolee@microsoft.com>
     +    Reviewed-by: Jakub Narębski <jnareb@gmail.com>
          Signed-off-by: Garima Singh <garima.singh@microsoft.com>
      
     - diff --git a/Makefile b/Makefile
     - --- a/Makefile
     - +++ b/Makefile
     -@@
     - 
     - PROGRAMS += $(patsubst %.o,git-%$X,$(PROGRAM_OBJS))
     - 
     -+TEST_BUILTINS_OBJS += test-bloom.o
     - TEST_BUILTINS_OBJS += test-chmtime.o
     - TEST_BUILTINS_OBJS += test-config.o
     - TEST_BUILTINS_OBJS += test-ctype.o
     -@@
     - LIB_OBJS += bisect.o
     - LIB_OBJS += blame.o
     - LIB_OBJS += blob.o
     -+LIB_OBJS += bloom.o
     - LIB_OBJS += branch.o
     - LIB_OBJS += bulk-checkin.o
     - LIB_OBJS += bundle.o
     -
       diff --git a/bloom.c b/bloom.c
     - new file mode 100644
     - --- /dev/null
     + --- a/bloom.c
       +++ b/bloom.c
      @@
     -+#include "git-compat-util.h"
     -+#include "bloom.h"
     -+#include "commit-graph.h"
     -+#include "object-store.h"
     + #include "git-compat-util.h"
     + #include "bloom.h"
      +#include "diff.h"
      +#include "diffcore.h"
      +#include "revision.h"
     @@ -97,118 +41,19 @@
      +    struct hashmap_entry entry;
      +    const char path[FLEX_ARRAY];
      +};
     + 
     + static uint32_t rotate_left(uint32_t value, int32_t count)
     + {
     +@@
     + 		filter->data[block_pos] |= get_bitmask(hash_mod);
     + 	}
     + }
      +
     -+static uint32_t rotate_right(uint32_t value, int32_t count)
     -+{
     -+	uint32_t mask = 8 * sizeof(uint32_t) - 1;
     -+	count &= mask;
     -+	return ((value >> count) | (value << ((-count) & mask)));
     -+}
     -+
     -+/*
     -+ * Calculate a hash value for the given data using the given seed.
     -+ * Produces a uniformly distributed hash value.
     -+ * Not considered to be cryptographically secure.
     -+ * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
     -+ **/
     -+static uint32_t seed_murmur3(uint32_t seed, const char *data, int len)
     -+{
     -+	const uint32_t c1 = 0xcc9e2d51;
     -+	const uint32_t c2 = 0x1b873593;
     -+	const uint32_t r1 = 15;
     -+	const uint32_t r2 = 13;
     -+	const uint32_t m = 5;
     -+	const uint32_t n = 0xe6546b64;
     -+	int i;
     -+	uint32_t k1 = 0;
     -+	const char *tail;
     -+
     -+	int len4 = len / sizeof(uint32_t);
     -+
     -+	const uint32_t *blocks = (const uint32_t*)data;
     -+
     -+	uint32_t k;
     -+	for (i = 0; i < len4; i++)
     -+	{
     -+		k = blocks[i];
     -+		k *= c1;
     -+		k = rotate_right(k, r1);
     -+		k *= c2;
     -+
     -+		seed ^= k;
     -+		seed = rotate_right(seed, r2) * m + n;
     -+	}
     -+
     -+	tail = (data + len4 * sizeof(uint32_t));
     -+
     -+	switch (len & (sizeof(uint32_t) - 1))
     -+	{
     -+	case 3:
     -+		k1 ^= ((uint32_t)tail[2]) << 16;
     -+		/*-fallthrough*/
     -+	case 2:
     -+		k1 ^= ((uint32_t)tail[1]) << 8;
     -+		/*-fallthrough*/
     -+	case 1:
     -+		k1 ^= ((uint32_t)tail[0]) << 0;
     -+		k1 *= c1;
     -+		k1 = rotate_right(k1, r1);
     -+		k1 *= c2;
     -+		seed ^= k1;
     -+		break;
     -+	}
     -+
     -+	seed ^= (uint32_t)len;
     -+	seed ^= (seed >> 16);
     -+	seed *= 0x85ebca6b;
     -+	seed ^= (seed >> 13);
     -+	seed *= 0xc2b2ae35;
     -+	seed ^= (seed >> 16);
     -+
     -+	return seed;
     -+}
     -+
     -+static inline uint64_t get_bitmask(uint32_t pos)
     -+{
     -+	return ((uint64_t)1) << (pos & (BITS_PER_WORD - 1));
     -+}
     -+
     -+void load_bloom_filters(void)
     ++void init_bloom_filters(void)
      +{
      +	init_bloom_filter_slab(&bloom_filters);
      +}
      +
     -+void fill_bloom_key(const char *data,
     -+					int len,
     -+					struct bloom_key *key,
     -+					struct bloom_filter_settings *settings)
     -+{
     -+	int i;
     -+	const uint32_t seed0 = 0x293ae76f;
     -+	const uint32_t seed1 = 0x7e646e2c;
     -+	const uint32_t hash0 = seed_murmur3(seed0, data, len);
     -+	const uint32_t hash1 = seed_murmur3(seed1, data, len);
     -+
     -+	key->hashes = (uint32_t *)xcalloc(settings->num_hashes, sizeof(uint32_t));
     -+	for (i = 0; i < settings->num_hashes; i++)
     -+		key->hashes[i] = hash0 + i * hash1;
     -+}
     -+
     -+void add_key_to_filter(struct bloom_key *key,
     -+					   struct bloom_filter *filter,
     -+					   struct bloom_filter_settings *settings)
     -+{
     -+	int i;
     -+	uint64_t mod = filter->len * BITS_PER_WORD;
     -+
     -+	for (i = 0; i < settings->num_hashes; i++) {
     -+		uint64_t hash_mod = key->hashes[i] % mod;
     -+		uint64_t block_pos = hash_mod / BITS_PER_WORD;
     -+
     -+		filter->data[block_pos] |= get_bitmask(hash_mod);
     -+	}
     -+}
     -+
      +struct bloom_filter *get_bloom_filter(struct repository *r,
      +				      struct commit *c)
      +{
     @@ -217,7 +62,7 @@
      +	int i;
      +	struct diff_options diffopt;
      +
     -+	if (!bloom_filters.slab_size)
     ++	if (bloom_filters.slab_size == 0)
      +		return NULL;
      +
      +	filter = bloom_filter_slab_at(&bloom_filters, c);
     @@ -234,13 +79,12 @@
      +
      +	if (diff_queued_diff.nr <= 512) {
      +		struct hashmap pathmap;
     -+		struct pathmap_hash_entry* e;
     ++		struct pathmap_hash_entry *e;
      +		struct hashmap_iter iter;
      +		hashmap_init(&pathmap, NULL, NULL, 0);
      +
      +		for (i = 0; i < diff_queued_diff.nr; i++) {
     -+			const char* path = diff_queued_diff.queue[i]->two->path;
     -+			const char* p = path;
     ++			const char *path = diff_queued_diff.queue[i]->two->path;
      +
      +			/*
      +			* Add each leading directory of the changed file, i.e. for
     @@ -251,23 +95,23 @@
      +			* Note that directories are added without the trailing '/'.
      +			*/
      +			do {
     -+				char* last_slash = strrchr(p, '/');
     ++				char *last_slash = strrchr(path, '/');
      +
      +				FLEX_ALLOC_STR(e, path, path);
     -+				hashmap_entry_init(&e->entry, strhash(p));
     ++				hashmap_entry_init(&e->entry, strhash(path));
      +				hashmap_add(&pathmap, &e->entry);
      +
      +				if (!last_slash)
     -+					last_slash = (char*)p;
     ++					last_slash = (char*)path;
      +				*last_slash = '\0';
      +
     -+			} while (*p);
     ++			} while (*path);
      +
      +			diff_free_filepair(diff_queued_diff.queue[i]);
      +		}
      +
      +		filter->len = (hashmap_get_size(&pathmap) * settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
     -+		filter->data = xcalloc(filter->len, sizeof(uint64_t));
     ++		filter->data = xcalloc(filter->len, sizeof(unsigned char));
      +
      +		hashmap_for_each_entry(&pathmap, &iter, e, entry) {
      +			struct bloom_key key;
     @@ -287,138 +131,48 @@
      +	DIFF_QUEUE_CLEAR(&diff_queued_diff);
      +
      +	return filter;
     -+}
     -+
     -+int bloom_filter_contains(struct bloom_filter *filter,
     -+			  struct bloom_key *key,
     -+			  struct bloom_filter_settings *settings)
     -+{
     -+	int i;
     -+	uint64_t mod = filter->len * BITS_PER_WORD;
     -+
     -+	if (!mod)
     -+		return -1;
     -+
     -+	for (i = 0; i < settings->num_hashes; i++) {
     -+		uint64_t hash_mod = key->hashes[i] % mod;
     -+		uint64_t block_pos = hash_mod / BITS_PER_WORD;
     -+		if (!(filter->data[block_pos] & get_bitmask(hash_mod)))
     -+			return 0;
     -+	}
     -+
     -+	return 1;
      +}
      
       diff --git a/bloom.h b/bloom.h
     - new file mode 100644
     - --- /dev/null
     + --- a/bloom.h
       +++ b/bloom.h
      @@
     -+#ifndef BLOOM_H
     -+#define BLOOM_H
     -+
     + #ifndef BLOOM_H
     + #define BLOOM_H
     + 
      +struct commit;
      +struct repository;
     -+struct commit_graph;
     -+
     -+struct bloom_filter_settings {
     -+	uint32_t hash_version;
     -+	uint32_t num_hashes;
     -+	uint32_t bits_per_entry;
     -+};
     -+
     -+#define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 }
     -+#define BITS_PER_WORD 64
      +
     -+/*
     -+ * A bloom_filter struct represents a data segment to
     -+ * use when testing hash values. The 'len' member
     -+ * dictates how many uint64_t entries are stored in
     -+ * 'data'.
     -+ */
     -+struct bloom_filter {
     -+	uint64_t *data;
     -+	int len;
     -+};
     -+
     -+/*
     -+ * A bloom_key represents the k hash values for a
     -+ * given hash input. These can be precomputed and
     -+ * stored in a bloom_key for re-use when testing
     -+ * against a bloom_filter.
     -+ */
     -+struct bloom_key {
     -+	uint32_t *hashes;
     -+};
     -+
     -+void load_bloom_filters(void);
     -+
     -+void fill_bloom_key(const char *data,
     -+		    int len,
     -+		    struct bloom_key *key,
     -+		    struct bloom_filter_settings *settings);
     -+
     -+void add_key_to_filter(struct bloom_key *key,
     -+					   struct bloom_filter *filter,
     -+					   struct bloom_filter_settings *settings);
     + struct bloom_filter_settings {
     + 	/*
     + 	 * The version of the hashing technique being used.
     +@@
     + 					   struct bloom_filter *filter,
     + 					   const struct bloom_filter_settings *settings);
     + 
     ++void init_bloom_filters(void);
      +
      +struct bloom_filter *get_bloom_filter(struct repository *r,
      +				      struct commit *c);
      +
     -+int bloom_filter_contains(struct bloom_filter *filter,
     -+			  struct bloom_key *key,
     -+			  struct bloom_filter_settings *settings);
     -+
     -+#endif
     + #e