git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "SZEDER Gábor" <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: "Jeff King" <peff@peff.net>, "Junio C Hamano" <gitster@pobox.com>,
	"Derrick Stolee" <stolee@gmail.com>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"Stefan Beller" <sbeller@google.com>,
	"Duy Nguyen" <pclouds@gmail.com>,
	"SZEDER Gábor" <szeder.dev@gmail.com>
Subject: [PATCH 2/4] commit-graph: write a Bloom filter containing changed paths for each commit
Date: Tue,  9 Oct 2018 21:34:43 +0200	[thread overview]
Message-ID: <20181009193445.21908-3-szeder.dev@gmail.com> (raw)
In-Reply-To: <20181009193445.21908-1-szeder.dev@gmail.com>

You can create a Bloom filter containing changed paths for each commit
in the history by running:

  $ GIT_USE_POC_BLOOM_FILTER=$((8*1024*1024*8)) git commit-graph write

where the value of $GIT_USE_POC_BLOOM_FILTER must specify the number
of bits used in the Bloom filter's bitmap.

Writing the Bloom filter is tied into the 'git commit-graph' command,
mainly because that's where it might end up anyway, if it turns out to
be useful, but for now it's written to a different file
('object/info/bloom').  No incremental updates yet, the Bloom filter
is regenerated from scratch each time.

There is one single, big Bloom filter for the whole history (mainly
because that was the simplest way to get this PoC experiment up and
running).  The Bloom filter stores tuples of (path, parent-oid,
commit-oid) using the hash function:

  XOR(SHA1(path), XOR(parent-oid, commit-oid))

The resulting 20 bytes are turned into 5 unsigned 32 bit ints, which
then specify the positions of the bits to set or check in the Bloom
filter's bitmap (modulo the bitmap's size).

The parent oid is taken into account, because during revision walking
the diff is checked in rev_compare_tree(), which compares one commit
to _one_ of its parents, and in case of merge commits there are
multiple rev_compare_tree() calls with the same commit but with
different parent parameters.

Combining hashes with XOR is, in general, frowned upon, because of its
intrinsic properties:

  XOR(A, A) = 0
  XOR(A, B) = XOR(B, A)

In this case it should be fine, because all of XOR's operands are
cryptographic hashes, so we can safely assume that they'll never be
the same.

Add each leading directory of the changed file, i.e. for
'dir/subdir/file' add 'dir' and 'dir/subdir' as well, so the Bloom
filter could be used to speed up commands like 'git log dir/subdir',
too.

Creating the Bloom filter is sloooow.  Running it on git.git takes
about 23s on my hardware, while

  git log --format='%H%n%P' --name-only --all >/dev/null

gathers all the information necessary for that in about 5.3s.

About 30% of the runtime is wasted by naively hashing and rehashing
the same paths over and over again.  A hash function faster than SHA1
could help with that; I just haven't yet bothered with spicing up
memhash() and friends to produce 5 ints, and neither wanted to
introduce another hash function with wideer output just yet.  Or
perhaps our hashmap mapping paths of files to their SHAs and the SHAs
of their leading directories...

That's not the only factor though.  After ripping out all the loops
from add_changes_to_bloom_filter() there are no repeated SHA1(path)
calculations and no writes to the Bloom filter at all, i.e. all what
remains is revision walking and diffing, yet it still takes about 16s,
i.e. aroung 3 times more than the above mentioned 'git log' command.
I guess some other fields in 'struct rev_info' or 'struct
diff_options' need to be set, but both of those are huge, and I
haven't yet spotted which ones.
---
 commit-graph.c | 116 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 116 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index a1454c52a6..f415d3b41f 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -14,6 +14,9 @@
 #include "object-store.h"
 #include "alloc.h"
 #include "progress.h"
+#include "bloom-filter.h"
+#include "diff.h"
+#include "diffcore.h"
 
 #define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
 #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
@@ -709,6 +712,117 @@ static int add_ref_to_list(const char *refname,
 	return 0;
 }
 
+static void add_changes_to_bloom_filter(struct bloom_filter *bf,
+					struct commit *parent,
+					struct commit *commit,
+					struct diff_options *diffopt)
+{
+	unsigned char p_c_hash[GIT_MAX_RAWSZ];
+	int i;
+
+	hashxor(parent->object.oid.hash, commit->object.oid.hash, p_c_hash);
+
+	diff_tree_oid(&parent->object.oid, &commit->object.oid, "", diffopt);
+	diffcore_std(diffopt);
+
+	for (i = 0; i < diff_queued_diff.nr; i++) {
+		const char *path = diff_queued_diff.queue[i]->two->path;
+		const char *p = path;
+
+		/*
+		 * Add each leading directory of the changed file, i.e. for
+		 * 'dir/subdir/file' add 'dir' and 'dir/subdir' as well, so
+		 * the Bloom filter could be used to speed up commands like
+		 * 'git log dir/subdir', too.
+		 *
+		 * Note that directories are added without the trailing '/'.
+		 */
+		do {
+			git_hash_ctx ctx;
+			unsigned char name_hash[GIT_MAX_RAWSZ];
+			unsigned char hash[GIT_MAX_RAWSZ];
+
+			p = strchrnul(p + 1, '/');
+
+			/*
+			 * Beware all the wasted CPU cycles!
+			 *
+			 * Most paths change (a lot) more than once in the
+			 * history of a repository, so this hashes the same
+			 * paths over and over again, accounting for almost
+			 * 40% of the runtime.
+			 */
+			the_hash_algo->init_fn(&ctx);
+			the_hash_algo->update_fn(&ctx, path, p - path);
+			the_hash_algo->final_fn(name_hash, &ctx);
+
+			hashxor(name_hash, p_c_hash, hash);
+			bloom_filter_add_hash(bf, hash);
+		} while (*p);
+
+		diff_free_filepair(diff_queued_diff.queue[i]);
+	}
+
+	free(diff_queued_diff.queue);
+	DIFF_QUEUE_CLEAR(&diff_queued_diff);
+}
+
+static void fill_bloom_filter(struct bloom_filter *bf,
+				    struct progress *progress)
+{
+	struct rev_info revs;
+	const char *revs_argv[] = {NULL, "--all", NULL};
+	struct commit *commit;
+	int i = 0;
+
+	/* We (re-)create the bloom filter from scratch every time for now. */
+	init_revisions(&revs, NULL);
+	revs.diffopt.flags.recursive = 1;
+	setup_revisions(2, revs_argv, &revs, NULL);
+
+	if (prepare_revision_walk(&revs))
+		die("revision walk setup failed while preparing bloom filter");
+
+	while ((commit = get_revision(&revs))) {
+		struct commit_list *parent;
+
+		for (parent = commit->parents; parent; parent = parent->next)
+			add_changes_to_bloom_filter(bf, parent->item, commit,
+						    &revs.diffopt);
+
+		display_progress(progress, ++i);
+	}
+}
+
+static void write_bloom_filter(int report_progress, int commit_nr)
+{
+	struct bloom_filter bf;
+	struct progress *progress = NULL;
+	const char *v = getenv("GIT_USE_POC_BLOOM_FILTER");
+	unsigned int bitsize;
+	char *end;
+
+	if (!v)
+		return;
+
+	bitsize = strtol(v, &end, 10);
+	if (*end)
+		die("GIT_USE_POC_BLOOM_FILTER must specify the number of bits in the bloom filter (multiple of 8, n < 2^32)");
+
+	bloom_filter_init(&bf, bitsize);
+
+	if (report_progress)
+		progress = start_progress(_("Computing bloom filter"),
+					  commit_nr);
+
+	fill_bloom_filter(&bf, progress);
+
+	bloom_filter_write(&bf);
+	bloom_filter_free(&bf);
+
+	stop_progress(&progress);
+}
+
 void write_commit_graph_reachable(const char *obj_dir, int append,
 				  int report_progress)
 {
@@ -916,6 +1030,8 @@ void write_commit_graph(const char *obj_dir,
 	finalize_hashfile(f, NULL, CSUM_HASH_IN_STREAM | CSUM_FSYNC);
 	commit_lock_file(&lk);
 
+	write_bloom_filter(report_progress, commits.nr);
+
 	free(graph_name);
 	free(commits.list);
 	free(oids.list);
-- 
2.19.1.409.g0a0ee5eb6b


  parent reply	other threads:[~2018-10-09 19:35 UTC|newest]

Thread overview: 78+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-03 13:23 We should add a "git gc --auto" after "git clone" due to commit graph Ævar Arnfjörð Bjarmason
2018-10-03 13:36 ` SZEDER Gábor
2018-10-03 13:42   ` Derrick Stolee
2018-10-03 14:18     ` Ævar Arnfjörð Bjarmason
2018-10-03 14:01   ` Ævar Arnfjörð Bjarmason
2018-10-03 14:17     ` SZEDER Gábor
2018-10-03 14:22       ` Ævar Arnfjörð Bjarmason
2018-10-03 14:53         ` SZEDER Gábor
2018-10-03 15:19           ` Ævar Arnfjörð Bjarmason
2018-10-03 16:59             ` SZEDER Gábor
2018-10-05  6:09               ` Junio C Hamano
2018-10-10 22:07                 ` SZEDER Gábor
2018-10-10 23:01                   ` Ævar Arnfjörð Bjarmason
2018-10-03 19:08           ` Stefan Beller
2018-10-03 19:21             ` Jeff King
2018-10-03 20:35               ` Ævar Arnfjörð Bjarmason
2018-10-03 17:47         ` Stefan Beller
2018-10-03 18:47           ` Ævar Arnfjörð Bjarmason
2018-10-03 18:51             ` Jeff King
2018-10-03 18:59               ` Derrick Stolee
2018-10-03 19:18                 ` Jeff King
2018-10-08 16:41                   ` SZEDER Gábor
2018-10-08 16:57                     ` Derrick Stolee
2018-10-08 18:10                       ` SZEDER Gábor
2018-10-08 18:29                         ` Derrick Stolee
2018-10-09  3:08                           ` Jeff King
2018-10-09 13:48                             ` Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph) Derrick Stolee
2018-10-09 18:45                               ` Ævar Arnfjörð Bjarmason
2018-10-09 18:46                               ` Jeff King
2018-10-09 19:03                                 ` Derrick Stolee
2018-10-09 21:14                                   ` Jeff King
2018-10-09 23:12                                     ` Bloom Filters Jeff King
2018-10-09 23:13                                       ` [PoC -- do not apply 1/3] initial tree-bitmap proof of concept Jeff King
2018-10-09 23:14                                       ` [PoC -- do not apply 2/3] test-tree-bitmap: add "dump" mode Jeff King
2018-10-10  0:48                                         ` Junio C Hamano
2018-10-11  3:13                                           ` Jeff King
2018-10-09 23:14                                       ` [PoC -- do not apply 3/3] test-tree-bitmap: replace ewah with custom rle encoding Jeff King
2018-10-10  0:58                                         ` Junio C Hamano
2018-10-11  3:20                                           ` Jeff King
2018-10-11 12:33                                       ` Bloom Filters Derrick Stolee
2018-10-11 13:43                                         ` Jeff King
2018-10-09 21:30                             ` We should add a "git gc --auto" after "git clone" due to commit graph SZEDER Gábor
2018-10-09 19:34                       ` [PATCH 0/4] Bloom filter experiment SZEDER Gábor
2018-10-09 19:34                         ` [PATCH 1/4] Add a (very) barebones Bloom filter implementation SZEDER Gábor
2018-10-09 19:34                         ` SZEDER Gábor [this message]
2018-10-09 21:06                           ` [PATCH 2/4] commit-graph: write a Bloom filter containing changed paths for each commit Jeff King
2018-10-09 21:37                             ` SZEDER Gábor
2018-10-09 19:34                         ` [PATCH 3/4] revision.c: use the Bloom filter to speed up path-limited revision walks SZEDER Gábor
2018-10-09 19:34                         ` [PATCH 4/4] revision.c: add GIT_TRACE_BLOOM_FILTER for a bit of statistics SZEDER Gábor
2018-10-09 19:47                         ` [PATCH 0/4] Bloom filter experiment Derrick Stolee
2018-10-11  1:21                         ` [PATCH 0/2] Per-commit filter proof of concept Jonathan Tan
2018-10-11  1:21                           ` [PATCH 1/2] One filter per commit Jonathan Tan
2018-10-11 12:49                             ` Derrick Stolee
2018-10-11 19:11                               ` [PATCH] Per-commit and per-parent filters for 2 parents Jonathan Tan
2018-10-11  1:21                           ` [PATCH 2/2] Only make bloom filter for first parent Jonathan Tan
2018-10-11  7:37                           ` [PATCH 0/2] Per-commit filter proof of concept Ævar Arnfjörð Bjarmason
2018-10-15 14:39                         ` [PATCH 0/4] Bloom filter experiment Derrick Stolee
2018-10-16  4:45                           ` Junio C Hamano
2018-10-16 11:13                             ` Derrick Stolee
2018-10-16 12:57                               ` Ævar Arnfjörð Bjarmason
2018-10-16 13:03                                 ` Derrick Stolee
2018-10-18  2:00                                 ` Junio C Hamano
2018-10-16 23:41                           ` Jonathan Tan
2018-10-08 23:02                     ` We should add a "git gc --auto" after "git clone" due to commit graph Junio C Hamano
2018-10-03 14:32     ` Duy Nguyen
2018-10-03 16:45 ` Duy Nguyen
2018-10-04 21:42 ` [RFC PATCH] " Ævar Arnfjörð Bjarmason
2018-10-05 12:05   ` Derrick Stolee
2018-10-05 13:05     ` Ævar Arnfjörð Bjarmason
2018-10-05 13:45       ` Derrick Stolee
2018-10-05 14:04         ` Ævar Arnfjörð Bjarmason
2018-10-05 19:21         ` Jeff King
2018-10-05 19:41           ` Derrick Stolee
2018-10-05 19:47             ` Jeff King
2018-10-05 20:00               ` Derrick Stolee
2018-10-05 20:02                 ` Jeff King
2018-10-05 20:01               ` Ævar Arnfjörð Bjarmason
2018-10-05 20:09                 ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181009193445.21908-3-szeder.dev@gmail.com \
    --to=szeder.dev@gmail.com \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=pclouds@gmail.com \
    --cc=peff@peff.net \
    --cc=sbeller@google.com \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).