git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Derrick Stolee <stolee@gmail.com>
To: Jeff King <peff@peff.net>
Cc: "SZEDER Gábor" <szeder.dev@gmail.com>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"Stefan Beller" <sbeller@google.com>, git <git@vger.kernel.org>,
	"Duy Nguyen" <pclouds@gmail.com>
Subject: Re: Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph)
Date: Tue, 9 Oct 2018 15:03:08 -0400	[thread overview]
Message-ID: <ec3ca377-0434-322e-4ab9-49e27f96f4af@gmail.com> (raw)
In-Reply-To: <20181009184647.GA7014@sigill.intra.peff.net>

On 10/9/2018 2:46 PM, Jeff King wrote:
> On Tue, Oct 09, 2018 at 09:48:20AM -0400, Derrick Stolee wrote:
>
>> [I snipped all of the parts about bloom filters that seemed entirely
>>   reasonable to me ;) ]
>>> Imagine we have that list. Is a bloom filter still the best data
>>> structure for each commit? At the point that we have the complete
>>> universe of paths, we could give each commit a bitmap of changed paths.
>>> That lets us ask "did this commit touch these paths" (collect the bits
>>> from the list of paths, then check for 1's), as well as "what paths did
>>> we touch" (collect the 1 bits, and then index the path list).  Those
>>> bitmaps should compress very well via EWAH or similar (most of them
>>> would be huge stretches of 0's punctuated by short runs of 1's).
>> I'm not convinced we would frequently have runs of 1's, and the bitmap would
>> not compress much better than simply listing the positions. For example, a
>> path "foo/bar" that resolves to a tree would only start a run if the next
>> changes are the initial section of entries in that tree (sorted
>> lexicographically) such as "foo/bar/a, foo/bar/b". If we deepen into a tree,
>> then we will break the run of 1's unless we changed every path deeper than
>> that tree.
> Yeah, I doubt we'd really have runs of 1's (by short, I just mean 1 or
> 2). I agree that listing the positions could work, though I sort of
> assumed that was more or less what a decent compressed bitmap would
> turn into. E.g., if bit N is set, we should be able to say "N-1
> zeroes, 1 one" in about the same size as we could say "position N".
>
> EWAH seems pretty awful in that regard. Or at least its serialized
> format is (or maybe it's our implementation that is bad).
>
> The patch below generates a bitmap for each commit in a repository (it
> doesn't output the total list of paths; I've left that as an exercise
> for the reader). On linux.git, the result is 57MB. But when I look at
> the individual bitmap sizes (via GIT_TRACE), they're quite silly.
> Storing a single set bit takes 28 bytes in serialized form!
>
> There are only around 120k unique paths (including prefix trees).
> Naively using run-length encoding and varints, our worst case should be
> something like 18-20 bits to say "120k zeroes, then a 1, then all
> zeroes".  And the average case should be better (you don't even need to
> say "120k", but some smaller number).
>
> I wonder if Roaring does better here.

In these sparse cases, usually Roaring will organize the data as "array 
chunks" which are simply lists of the values. The thing that makes this 
still compressible is that we store two bytes per entry, as the entries 
are grouped by a common most-significant two bytes. SInce you say ~120k 
unique paths, the Roaring bitmap would have two or three chunks per 
bitmap (and those chunks could be empty). The overhead to store the 
chunk positions, types, and lengths does come at a cost, but it's more 
like 32 bytes _per commit_.

> Gzipping the resulting bitmaps drops the total size to about 7.5MB.
> That's not a particularly important number, but I think it shows that
> the built-in ewah compression is far from ideal.
>
> Just listing the positions with a series of varints would generally be
> fine, since we expect sparse bitmaps. I just hoped that a good
> RLE scheme would degrade to roughly that for the sparse case, but also
> perform well for more dense cases.
>
> So at any rate, I do think it would not be out of the question to store
> bitmaps like this. I'm much more worried about the maintenance cost of
> adding new entries incrementally. I think it's only feasible if we give
> up sorting, and then I wonder what other problems that might cause.
The patch below gives me a starting point to try the Bloom filter 
approach and see what the numbers are like. You did all the "git" stuff 
like computing the changed paths, so thanks!
>
> -- >8 --
> diff --git a/Makefile b/Makefile
> index 13e1c52478..f6e823f2d6 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -751,6 +751,7 @@ TEST_PROGRAMS_NEED_X += test-parse-options
>   TEST_PROGRAMS_NEED_X += test-pkt-line
>   TEST_PROGRAMS_NEED_X += test-svn-fe
>   TEST_PROGRAMS_NEED_X += test-tool
> +TEST_PROGRAMS_NEED_X += test-tree-bitmap
>   
>   TEST_PROGRAMS = $(patsubst %,t/helper/%$X,$(TEST_PROGRAMS_NEED_X))
>   
> diff --git a/t/helper/test-tree-bitmap.c b/t/helper/test-tree-bitmap.c
> new file mode 100644
> index 0000000000..bc5cf0e514
> --- /dev/null
> +++ b/t/helper/test-tree-bitmap.c
> @@ -0,0 +1,167 @@
> +#include "cache.h"
> +#include "revision.h"
> +#include "diffcore.h"
> +#include "argv-array.h"
> +#include "ewah/ewok.h"
> +
> +/* map of pathnames to bit positions */
> +struct pathmap_entry {
> +	struct hashmap_entry ent;
> +	unsigned pos;
> +	char path[FLEX_ARRAY];
> +};
> +
> +static int pathmap_entry_hashcmp(const void *unused_cmp_data,
> +				 const void *entry,
> +				 const void *entry_or_key,
> +				 const void *keydata)
> +{
> +	const struct pathmap_entry *a = entry;
> +	const struct pathmap_entry *b = entry_or_key;
> +	const char *key = keydata;
> +
> +	return strcmp(a->path, key ? key : b->path);
> +}
> +
> +static int pathmap_entry_strcmp(const void *va, const void *vb)
> +{
> +	struct pathmap_entry *a = *(struct pathmap_entry **)va;
> +	struct pathmap_entry *b = *(struct pathmap_entry **)vb;
> +	return strcmp(a->path, b->path);
> +}
> +
> +struct walk_paths_data {
> +	struct hashmap *paths;
> +	struct commit *commit;
> +};
> +
> +static void walk_paths(diff_format_fn_t fn, struct hashmap *paths)
> +{
> +	struct argv_array argv = ARGV_ARRAY_INIT;
> +	struct rev_info revs;
> +	struct walk_paths_data data;
> +	struct commit *commit;
> +
> +	argv_array_pushl(&argv, "rev-list",
> +			 "--all", "-t", "--no-renames",
> +			 NULL);
> +	init_revisions(&revs, NULL);
> +	setup_revisions(argv.argc, argv.argv, &revs, NULL);
> +	revs.diffopt.output_format = DIFF_FORMAT_CALLBACK;
> +	revs.diffopt.format_callback = fn;
> +	revs.diffopt.format_callback_data = &data;
> +
> +	data.paths = paths;
> +
> +	prepare_revision_walk(&revs);
> +	while ((commit = get_revision(&revs))) {
> +		data.commit = commit;
> +		diff_tree_combined_merge(commit, 0, &revs);
> +	}
> +
> +	reset_revision_walk();
> +	argv_array_clear(&argv);
> +}
> +
> +static void collect_commit_paths(struct diff_queue_struct *q,
> +				 struct diff_options *opts,
> +				 void *vdata)
> +{
> +	struct walk_paths_data *data = vdata;
> +	int i;
> +
> +	for (i = 0; i < q->nr; i++) {
> +		struct diff_filepair *p = q->queue[i];
> +		const char *path = p->one->path;
> +		struct pathmap_entry *entry;
> +		struct hashmap_entry lookup;
> +
> +		hashmap_entry_init(&lookup, strhash(path));
> +		entry = hashmap_get(data->paths, &lookup, path);
> +		if (entry)
> +			continue; /* already present */
> +
> +		FLEX_ALLOC_STR(entry, path, path);
> +		entry->ent = lookup;
> +		hashmap_put(data->paths, entry);
> +	}
> +}
> +
> +/* assign a bit position to all possible paths */
> +static void collect_paths(struct hashmap *paths)
> +{
> +	struct pathmap_entry **sorted;
> +	size_t i, n;
> +	struct hashmap_iter iter;
> +	struct pathmap_entry *entry;
> +
> +	/* grab all unique paths */
> +	hashmap_init(paths, pathmap_entry_hashcmp, NULL, 0);
> +	walk_paths(collect_commit_paths, paths);
> +
> +	/* and assign them bits in sorted order */
> +	n = hashmap_get_size(paths);
> +	ALLOC_ARRAY(sorted, n);
> +	i = 0;
> +	for (entry = hashmap_iter_first(paths, &iter);
> +	     entry;
> +	     entry = hashmap_iter_next(&iter)) {
> +		assert(i < n);
> +		sorted[i++] = entry;
> +	}
> +	QSORT(sorted, i, pathmap_entry_strcmp);
> +	for (i = 0; i < n; i++)
> +		sorted[i]->pos = i;
> +	free(sorted);
> +}
> +
> +/* generate the bitmap for a single commit */
> +static void generate_bitmap(struct diff_queue_struct *q,
> +			    struct diff_options *opts,
> +			    void *vdata)
> +{
> +	struct walk_paths_data *data = vdata;
> +	struct bitmap *bitmap = bitmap_new();
> +	struct ewah_bitmap *ewah;
> +	struct strbuf out = STRBUF_INIT;
> +	size_t i;
> +
> +	for (i = 0; i < q->nr; i++) {
> +		struct diff_filepair *p = q->queue[i];
> +		const char *path = p->one->path;
> +		struct pathmap_entry *entry;
> +		struct hashmap_entry lookup;
> +
> +		hashmap_entry_init(&lookup, strhash(path));
> +		entry = hashmap_get(data->paths, &lookup, path);
> +		if (!entry)
> +			BUG("mysterious path appeared: %s", path);
> +
> +		bitmap_set(bitmap, entry->pos);
> +	}
> +
> +	ewah = bitmap_to_ewah(bitmap);
> +	ewah_serialize_strbuf(ewah, &out);
> +	fwrite(out.buf, 1, out.len, stdout);
> +
> +	trace_printf("bitmap %s %u %u",
> +		     oid_to_hex(&data->commit->object.oid),
> +		     (unsigned)q->nr,
> +		     (unsigned)out.len);
> +
> +	strbuf_release(&out);
> +	ewah_free(ewah);
> +	bitmap_free(bitmap);
> +}
> +
> +int cmd_main(int argc, const char **argv)
> +{
> +	struct hashmap paths;
> +
> +	setup_git_directory();
> +	collect_paths(&paths);
> +
> +	walk_paths(generate_bitmap, &paths);
> +
> +	return 0;
> +}

  reply	other threads:[~2018-10-09 19:03 UTC|newest]

Thread overview: 78+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-03 13:23 We should add a "git gc --auto" after "git clone" due to commit graph Ævar Arnfjörð Bjarmason
2018-10-03 13:36 ` SZEDER Gábor
2018-10-03 13:42   ` Derrick Stolee
2018-10-03 14:18     ` Ævar Arnfjörð Bjarmason
2018-10-03 14:01   ` Ævar Arnfjörð Bjarmason
2018-10-03 14:17     ` SZEDER Gábor
2018-10-03 14:22       ` Ævar Arnfjörð Bjarmason
2018-10-03 14:53         ` SZEDER Gábor
2018-10-03 15:19           ` Ævar Arnfjörð Bjarmason
2018-10-03 16:59             ` SZEDER Gábor
2018-10-05  6:09               ` Junio C Hamano
2018-10-10 22:07                 ` SZEDER Gábor
2018-10-10 23:01                   ` Ævar Arnfjörð Bjarmason
2018-10-03 19:08           ` Stefan Beller
2018-10-03 19:21             ` Jeff King
2018-10-03 20:35               ` Ævar Arnfjörð Bjarmason
2018-10-03 17:47         ` Stefan Beller
2018-10-03 18:47           ` Ævar Arnfjörð Bjarmason
2018-10-03 18:51             ` Jeff King
2018-10-03 18:59               ` Derrick Stolee
2018-10-03 19:18                 ` Jeff King
2018-10-08 16:41                   ` SZEDER Gábor
2018-10-08 16:57                     ` Derrick Stolee
2018-10-08 18:10                       ` SZEDER Gábor
2018-10-08 18:29                         ` Derrick Stolee
2018-10-09  3:08                           ` Jeff King
2018-10-09 13:48                             ` Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph) Derrick Stolee
2018-10-09 18:45                               ` Ævar Arnfjörð Bjarmason
2018-10-09 18:46                               ` Jeff King
2018-10-09 19:03                                 ` Derrick Stolee [this message]
2018-10-09 21:14                                   ` Jeff King
2018-10-09 23:12                                     ` Bloom Filters Jeff King
2018-10-09 23:13                                       ` [PoC -- do not apply 1/3] initial tree-bitmap proof of concept Jeff King
2018-10-09 23:14                                       ` [PoC -- do not apply 2/3] test-tree-bitmap: add "dump" mode Jeff King
2018-10-10  0:48                                         ` Junio C Hamano
2018-10-11  3:13                                           ` Jeff King
2018-10-09 23:14                                       ` [PoC -- do not apply 3/3] test-tree-bitmap: replace ewah with custom rle encoding Jeff King
2018-10-10  0:58                                         ` Junio C Hamano
2018-10-11  3:20                                           ` Jeff King
2018-10-11 12:33                                       ` Bloom Filters Derrick Stolee
2018-10-11 13:43                                         ` Jeff King
2018-10-09 21:30                             ` We should add a "git gc --auto" after "git clone" due to commit graph SZEDER Gábor
2018-10-09 19:34                       ` [PATCH 0/4] Bloom filter experiment SZEDER Gábor
2018-10-09 19:34                         ` [PATCH 1/4] Add a (very) barebones Bloom filter implementation SZEDER Gábor
2018-10-09 19:34                         ` [PATCH 2/4] commit-graph: write a Bloom filter containing changed paths for each commit SZEDER Gábor
2018-10-09 21:06                           ` Jeff King
2018-10-09 21:37                             ` SZEDER Gábor
2018-10-09 19:34                         ` [PATCH 3/4] revision.c: use the Bloom filter to speed up path-limited revision walks SZEDER Gábor
2018-10-09 19:34                         ` [PATCH 4/4] revision.c: add GIT_TRACE_BLOOM_FILTER for a bit of statistics SZEDER Gábor
2018-10-09 19:47                         ` [PATCH 0/4] Bloom filter experiment Derrick Stolee
2018-10-11  1:21                         ` [PATCH 0/2] Per-commit filter proof of concept Jonathan Tan
2018-10-11  1:21                           ` [PATCH 1/2] One filter per commit Jonathan Tan
2018-10-11 12:49                             ` Derrick Stolee
2018-10-11 19:11                               ` [PATCH] Per-commit and per-parent filters for 2 parents Jonathan Tan
2018-10-11  1:21                           ` [PATCH 2/2] Only make bloom filter for first parent Jonathan Tan
2018-10-11  7:37                           ` [PATCH 0/2] Per-commit filter proof of concept Ævar Arnfjörð Bjarmason
2018-10-15 14:39                         ` [PATCH 0/4] Bloom filter experiment Derrick Stolee
2018-10-16  4:45                           ` Junio C Hamano
2018-10-16 11:13                             ` Derrick Stolee
2018-10-16 12:57                               ` Ævar Arnfjörð Bjarmason
2018-10-16 13:03                                 ` Derrick Stolee
2018-10-18  2:00                                 ` Junio C Hamano
2018-10-16 23:41                           ` Jonathan Tan
2018-10-08 23:02                     ` We should add a "git gc --auto" after "git clone" due to commit graph Junio C Hamano
2018-10-03 14:32     ` Duy Nguyen
2018-10-03 16:45 ` Duy Nguyen
2018-10-04 21:42 ` [RFC PATCH] " Ævar Arnfjörð Bjarmason
2018-10-05 12:05   ` Derrick Stolee
2018-10-05 13:05     ` Ævar Arnfjörð Bjarmason
2018-10-05 13:45       ` Derrick Stolee
2018-10-05 14:04         ` Ævar Arnfjörð Bjarmason
2018-10-05 19:21         ` Jeff King
2018-10-05 19:41           ` Derrick Stolee
2018-10-05 19:47             ` Jeff King
2018-10-05 20:00               ` Derrick Stolee
2018-10-05 20:02                 ` Jeff King
2018-10-05 20:01               ` Ævar Arnfjörð Bjarmason
2018-10-05 20:09                 ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ec3ca377-0434-322e-4ab9-49e27f96f4af@gmail.com \
    --to=stolee@gmail.com \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=pclouds@gmail.com \
    --cc=peff@peff.net \
    --cc=sbeller@google.com \
    --cc=szeder.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).