git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jakub Narebski <jnareb@gmail.com>
To: "Garima Singh via GitGitGadget" <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, "Derrick Stolee" <stolee@gmail.com>,
	"SZEDER Gábor" <szeder.dev@gmail.com>,
	"Jonathan Tan" <jonathantanmy@google.com>,
	"Jeff Hostetler" <jeffhost@microsoft.com>,
	"Taylor Blau" <me@ttaylorr.com>, "Jeff King" <peff@peff.net>,
	"Junio C Hamano" <gitster@pobox.com>,
	"Garima Singh" <garima.singh@microsoft.com>
Subject: Re: [PATCH 2/9] commit-graph: write changed paths bloom filters
Date: Mon, 06 Jan 2020 19:44:14 +0100	[thread overview]
Message-ID: <86eewczapt.fsf@gmail.com> (raw)
In-Reply-To: <e52c7ad37a306891487bd79a09b040bfb657d723.1576879520.git.gitgitgadget@gmail.com> (Garima Singh via GitGitGadget's message of "Fri, 20 Dec 2019 22:05:13 +0000")

"Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Garima Singh <garima.singh@microsoft.com>
>
> The changed path bloom filters help determine which paths changed between a
> commit and its first parent. We already have the "--changed-paths" option
> for the "git commit-graph write" subcommand, now actually compute them under
> that option. The COMMIT_GRAPH_WRITE_BLOOM_FILTERS flag enables this
> computation.
>
> RFC Notes: Here are some details about the implementation and I would love
> to know your thoughts and suggestions for improvements here.
>
> For details on what bloom filters are and how they work, please refer to
> Dr. Derrick Stolee's blog post [1].
> [1] https://devblogs.microsoft.com/devops/super-charging-the-git-commit-graph-iv-bloom-filters/
>
> 1. The implementation sticks to the recommended values of 7 and 10 for the
>    number of hashes and the size of each entry, as described in the blog.

Please provide references to original work for this.  Derrick Stolee
blog post references the following work:

  Flavio Bonomi, Michael Mitzenmacher, Rina Panigrahy, Sushil Singh, George Varghese
  "An Improved Construction for Counting Bloom Filters"
  http://theory.stanford.edu/~rinap/papers/esa2006b.pdf
  https://doi.org/10.1007/11841036_61

However, we do not use Counting Bloom Filters, but ordinary Bloom
Filters, if used in untypical way: instead of testing many elements
(keys) against single filter, we test single element (path) against
mainy filters.

Also, I'm not sure that values 10 bits per entry and 7 hash functions
are recommended; the work states:

  "For example, when n/m = 10 and k = 7 the false positive probability
  is just over 0.008."

Given false positive probablity we can calculate best choice for n/m and
k.

On the other hand in https://arxiv.org/abs/1912.08258 we have

  "For efficient memory usage, a Bloom filter with a false-positive
   probability ϵ should use about − log_2(ϵ) hash functions
   [Broder2004]. At a false-positive probability of 1%, seven hash
   functions are thus required."

So k=7 being optimal is somewhat confirmed.

>    The implementation while not completely open to it at the moment, is flexible
>    enough to allow for tweaking these settings in the future.

All right.

>    Note: The performance gains we have observed so far with these values is
>    significant enough to not that we did not need to tweak these settings.
                           ^^^
s/not/note/

Did you try to tweak settings, i.e. numbers of bits per entry, number
of hash functions (which is derivative of the former - at least the
optimal number), the size of the block, the cutoff threshold value?
It is not needed to be in this patch series - fine tuning is probably
better left for later.

>    The cover letter of this series has the details and the commit where we have
>    git log use bloom filters.

The second part of this sentence, from "and the commit..." is a bit
unclear.  Did you mean here that the future / subsequent commit in this
patch series that makes Git actually use Bloom filters in `git log --
<path>` will have more details in its commit message?

> 2. As described in the blog and the linked technical paper therin, we do not need
                                                             ^^^^^^
s/therin/therein/

>    7 independent hashing functions. We use the Murmur3 hashing scheme - seed it
>    twice and then combine those to procure an arbitrary number of hash values.

The "linked technical paper" in the blog post (which I would prefer to
have linked directly to in the commit message) is

  Peter C. Dillinger and Panagiotis Manolios
  "Bloom Filters in Probabilistic Verification"
  http://www.ccs.neu.edu/home/pete/pub/bloom-filters-verification.pdf
  https://doi.org/10.1007/978-3-540-30494-4_26

Sidenote: it looks like it is a reference from Wikipedia on Bloom filters.
This is according to authors the original paper with the _double hashing_
technique.

They also examine in much detail the optimal number of hash functions.

> 3. The filters are sized according to the number of changes in the each commit,
>    with minimum size of one 64 bit word.

Do I understand it correctly that the size of filter is 10*(number of
changed files) bits, rounded up to nearest multiple of 64?

How do you count renames and copies?  As two changes?

Do I understand it correctly that commit with no changes in it (which
can rarely happen) would have 64-bits i.e. 8-bytes Bloom filter of all
zeros: 0x0000000000000000?

How merges are handled?  Does the filter uses all changed files, or just
changes compared to first parent?

>
> [Call for advice] We currently cap writing bloom filters for commits with
> atmost 512 changed files. In the current implementation, we compute the diff,
> and then just throw it away once we see it has more than 512 changes.
> Any suggestiongs on how to reduce the work we are doing in this case are more
> than welcome.

This got solved in "[PATCH] diff: halt tree-diff early after max_changes"
https://public-inbox.org/git/e9a4e4ff-5466-dc39-c3f5-c9a8b8f2f11d@gmail.com/

> [Call for advice] Would the git community like this commit to be split up into
> more granular commits? This commit could possibly be split out further with the
> bloom.c code in its own commit, to be used by the commit-graph in a subsequent
> commit. While I prefer it being contained in one commit this way, I am open to
> suggestions.

I think it might be a good idea to split this commit into purely Bloom
filter implementation (bloom.c) AND unit tests for Bloom filter itself
(which would probably involve some new test-tool).

I have not read further messages in the series [edit: they don't], so I
don't know if such tests already exist or not.  One could test for
negative match, maybe also (for specific choice of hash function) for
positive and maybe even false positive match, for filter size depending
on the number of changes, for changes cap (maybe), maybe also for
no-changes scenario.


As for splitting the main part of the series, I would envision it in the
following way (which is of course only one possibility):

1. Implementation of generic-ish Bloom filter (with elements being
   strings / paths, and optimized to test single key against many
   filters, each taking small-ish space, variable size filter, limit on
   maximum number of elements).

   Technical documentation in comments in bloom.h (description of API)
   and bloom.c (details of the algorithm, with references).

   TODO: test-tool and unit tests.

2. Using per-commit Bloom filter(s) to store changeset information
   i.e. changed paths.  This would implement in-memory storage (on slab)
   and creating Bloom filter out of commit and repository information.

   Perhaps this should also get its own unit tests (that Bloom filter
   catches changed files, and excluding false positivess catches
   unchanged files).

3. Storing per-commit Bloom filters in the commit-graph file:

   a.) writing Bloom filters data to commit-graph file, which means
       designing the chunk(s) format,
   b.) verifying Bloom filter chunks, at least sanity-checks
   c.) reading Bloom filters from commit-graph file into memory

   Perhaps also some integration tests that the information is stored
   and retrieved correctly, and that verifying finds bugs in
   intentionally corrupted Bloom filter chunks.

4. Using Bloom filters to speed up `git log -- <path>` (and similar
   commands).

   It would be nice to have some functional tests, and maybe some
   performance tests, if possible.


> [Call for advice] Would a technical document explaining the exact details of
> the bloom filter implemenation and the hashing calculations be helpful? I will
> be adding details into Documentation/technical/commit-graph-format.txt, but the
> bloom filter code is an independent subsystem and could be used outside of the
> commit-graph feature. Is it worth a separate document, or should we apply "You
> Ain't Gonna Need It" principles?

As nowadays technical reference documentation is being moved from
Documentation/technical/api-*.txt to appropriate header files, maybe the
documentation of Bloom filter API (and some technical documentation and
references) be put in bloom.h?  See for example comments in strbuf.h.

> [Call for advice] I plan to add unit tests for bloom.c, specifically to ensure
> that the hash algorithm and bloom key calculations are stable across versions.

Ah, so the unit tests for bloom.c does not exist, yet...

> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> Helped-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Makefile       |   1 +
>  bloom.c        | 201 +++++++++++++++++++++++++++++++++++++++++++++++++
>  bloom.h        |  46 +++++++++++
>  commit-graph.c |  32 +++++++-
>  4 files changed, 279 insertions(+), 1 deletion(-)
>  create mode 100644 bloom.c
>  create mode 100644 bloom.h
>
> diff --git a/Makefile b/Makefile
> index 42a061d3fb..9d5e26f5d6 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -838,6 +838,7 @@ LIB_OBJS += base85.o
>  LIB_OBJS += bisect.o
>  LIB_OBJS += blame.o
>  LIB_OBJS += blob.o
> +LIB_OBJS += bloom.o
>  LIB_OBJS += branch.o
>  LIB_OBJS += bulk-checkin.o
>  LIB_OBJS += bundle.o

I'll put bloom.h first, to make it easier to review.

> diff --git a/bloom.h b/bloom.h
> new file mode 100644
> index 0000000000..ba8ae70b67
> --- /dev/null
> +++ b/bloom.h
> @@ -0,0 +1,46 @@
> +#ifndef BLOOM_H
> +#define BLOOM_H
> +
> +struct commit;
> +struct repository;

This would probably be missing if this patch was split in two:
introducing Bloom filter and saving Bloom filter in the repository
metadata (in commit-graphh file).

> +

O.K., the names of fields are descriptive enough so that this struct
doesn't need detailed description in comment (like the next one).

> +struct bloom_filter_settings {
> +	uint32_t hash_version;

Do we need full half-word for hash version?

> +	uint32_t num_hashes;

Do we need full 32-bits for number of hashes?  The "Bloom Filters in
Probabilistic Verification" paper mentioned in Stolee blog states that
no one should need number of hashes greater than k=32 - the accuracy is
so high that it doesn't matter that it is not optimal.

  "Notice one last thing about Bloom filters in verification, if $m$ is
   several gigabytes or less and $m/n$ calls for more than about 32 index
   functions, the accuracy is going to be so high that there is not much
   reason to use more than 32—for the next several years at least. In
   response to this, 3SPIN currently limits the user to $k = 32$. The
   point of this observation is that we do not have to worry about the
   runtime cost of $k$ being on the order of 64 or 100, because those
   choices do not really buy us anything over 32."

Here 'm' is the number of bits in Bloom filter, and m/n is number of
bits per element added to filter.

> +	uint32_t bits_per_entry;

All right, we wouldn't really want large Bloom filters, as we use one
filter per commit to match againts one key, not single Bloom filter to
match againts many keys.

> +};
> +
> +#define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 }
> +
> +/*
> + * A bloom_filter struct represents a data segment to
> + * use when testing hash values. The 'len' member
> + * dictates how many uint64_t entries are stored in
> + * 'data'.
> + */
> +struct bloom_filter {
> +	uint64_t *data;
> +	int len;
> +};

O.K., so it is single variable-sized (in 64-bit increments) Bloom filter
bit vector (bitmap).

> +
> +/*
> + * A bloom_key represents the k hash values for a
> + * given hash input. These can be precomputed and
> + * stored in a bloom_key for re-use when testing
> + * against a bloom_filter.
> + */
> +struct bloom_key {
> +	uint32_t *hashes;
> +};

That is smart.  I wonder however if it wouldn't be a good idea to
'typedef' a hash function return type.

I repeat myself: in Git case we have one key that we try to match
against many Bloom filters which are never updated, while in an ordinary
case many keys are matched against single Bloom filter - in many cases
updated (with keys inserted to Bloom filter).

I wonder if somebody from academia have examined such situation.
I couldn't find a good search query.


Sidenote: perhaps Xor or Xor+ filters from Graf & Lemire (2019)
https://arxiv.org/abs/1912.08258 would be better solution - they also
assume unchanging filter.  Though they are a very fresh proposal;
also construction time might be important for Git.
https://github.com/FastFilter/xor_singleheader

> +
> +void load_bloom_filters(void);
> +
> +struct bloom_filter *get_bloom_filter(struct repository *r,
> +				      struct commit *c);

Those two functions really need API documentation on how they are used,
if they are to be used in any other role, especially what is their
calling convention?  Why load_bloom_filters() doesn't take any
parameters?

Anyway, if this patch would be split into pure Bloom filter
implementation and Git use^W store of Bloom filters, then this would be
left for the latter patch.

> +
> +void fill_bloom_key(const char *data,
> +		    int len,
> +		    struct bloom_key *key,
> +		    struct bloom_filter_settings *settings);

It is a bit strange that two basic Bloom filter operations, namely
adding element to Bloom filter (and constructing Bloom filter), and
testing whether element is in Bloom filter are not part of a public
API...

This function should probably be documented, in particular the fact that
*key is an in/out parameter.  This could also be a good place to
document the mechanism itself (i.e. our implementation of Bloom filter,
with references), though it might be better to keep the details of how
it works in the bloom.c - close to the actual source (while keeping
description of API in bloom.h comments).

> +
> +#endif
> diff --git a/bloom.c b/bloom.c
> new file mode 100644
> index 0000000000..08328cc381
> --- /dev/null
> +++ b/bloom.c
> @@ -0,0 +1,201 @@
> +#include "git-compat-util.h"
> +#include "bloom.h"
> +#include "commit-graph.h"
> +#include "object-store.h"
> +#include "diff.h"
> +#include "diffcore.h"
> +#include "revision.h"
> +#include "hashmap.h"
> +
> +#define BITS_PER_BLOCK 64
> +
> +define_commit_slab(bloom_filter_slab, struct bloom_filter);
> +
> +struct bloom_filter_slab bloom_filters;

All right, so the Bloom filter data would be on slab.  This should
probably be mentioned in the commit message, like in
https://lore.kernel.org/git/61559c5b-546e-d61b-d2e1-68de692f5972@gmail.com/

Sidenote: If I remember correctly one of the unmet prerequisites for
switching to generation numbers v2 (corrected commit date with monotonic
offsets) was moving 'generation' field out of 'struct commit' and on to
slab (possibly also 'graph_pos'), and certainly having 'corrected_date'
on slab (Inside-Out Object style).  Which probably could be done with
Coccinelle script...

> +
> +struct pathmap_hash_entry {
> +    struct hashmap_entry entry;
> +    const char path[FLEX_ARRAY];
> +};

Hmmm... I wonder why use hashmap and not string_list.  This is for
adding path with leading directories to the Bloom filter, isn't it?

> +
> +static uint32_t rotate_right(uint32_t value, int32_t count)
> +{
> +	uint32_t mask = 8 * sizeof(uint32_t) - 1;
> +	count &= mask;
> +	return ((value >> count) | (value << ((-count) & mask)));
> +}

Does it actually work with count being negative?  Shouldn't 'count' be
of unsigned type, and if int32_t is needed, perhaps add an assertion (if
needed)?  I think it does not.

It looks like it is John Regehr [2] safe and compiler-friendly
implementation, with explicit 8 in place of CHAR_BIT from <limits.h>,
which should compile to "rotate" assembly instruction... it looks like
it is the case, see https://godbolt.org/z/5JP1Jb (at least for C++
compiler).

[2]: https://en.wikipedia.org/wiki/Circular_shift


I wonder if this should, in the future, be a part of 'compat/', maybe
even using compiler intrinsics for "rotate right" if available (see
https://stackoverflow.com/a/776523/46058).  But that might be outside of
the scope of this patch (perhaps outside of choosing function name).

> +

It would be nice to have reference to the source of algorithm, or to the
code that was borrowed for this in the header comment for the following
function.

I will be comparing the algorithm itself in Wikipedia
https://en.wikipedia.org/wiki/MurmurHash#Algorithm
and its implementation in C in qLibc library (BSD licensed)
https://github.com/wolkykim/qlibc/blob/03a8ce035391adf88d6d755f9a26967c16a1a567/src/utilities/qhash.c#L258

> +static uint32_t seed_murmur3(uint32_t seed, const char *data, int len)
> +{
> +	const uint32_t c1 = 0xcc9e2d51;
> +	const uint32_t c2 = 0x1b873593;
> +	const int32_t r1 = 15;
> +	const int32_t r2 = 13;

Those two: r1 and r1, probably should be both uint32_t type.

> +	const uint32_t m = 5;
> +	const uint32_t n = 0xe6546b64;
> +	int i;
> +	uint32_t k1 = 0;
> +	const char *tail;

*tail should probably be 'uint8_t', not 'char', isn't it?

> +
> +	int len4 = len / sizeof(uint32_t);

All length variables and parameters, i.e. `len`, `len4`, `i`, could
possibly be `size_t` and not `int` type.

> +
> +	const uint32_t *blocks = (const uint32_t*)data;
> +

Some implementations copy `seed` (or assume seed=0) to the local
variable named `h` or `hash`.

> +	uint32_t k;
> +	for (i = 0; i < len4; i++)
> +	{
> +		k = blocks[i];
> +		k *= c1;
> +		k = rotate_right(k, r1);

Shouldn't it be *rotate_left* (ROL), not rotate_right (ROR)???
This affects all cases / uses.

> +		k *= c2;
> +
> +		seed ^= k;
> +		seed = rotate_right(seed, r2) * m + n;
> +	}
> +
> +	tail = (data + len4 * sizeof(uint32_t));
> +

We could have reused variable `k`, like the implementation in qLibc
does, instead of introducing new `k1` variable, but this way it is more
clean.  Or name it `remainingBytes` instead of `k1`

> +	switch (len & (sizeof(uint32_t) - 1))
> +	{
> +	case 3:
> +		k1 ^= ((uint32_t)tail[2]) << 16;
> +		/*-fallthrough*/
> +	case 2:
> +		k1 ^= ((uint32_t)tail[1]) << 8;
> +		/*-fallthrough*/
> +	case 1:
> +		k1 ^= ((uint32_t)tail[0]) << 0;
> +		k1 *= c1;
> +		k1 = rotate_right(k1, r1);
> +		k1 *= c2;
> +		seed ^= k1;
> +		break;
> +	}
> +
> +	seed ^= (uint32_t)len;
> +	seed ^= (seed >> 16);
> +	seed *= 0x85ebca6b;
> +	seed ^= (seed >> 13);
> +	seed *= 0xc2b2ae35;
> +	seed ^= (seed >> 16);
> +
> +	return seed;
> +}
> +

It would be nice to have header comment describing what this function is
intended to actually do.

> +static inline uint64_t get_bitmask(uint32_t pos)
> +{
> +	return ((uint64_t)1) << (pos & (BITS_PER_BLOCK - 1));
> +}

Sidenote: I wonder if ewah/bitmap.c implements something similar.
Certainly possible consolidation, if any possible exists, should be left
for the future.

> +
> +void fill_bloom_key(const char *data,
> +		    int len,
> +		    struct bloom_key *key,
> +		    struct bloom_filter_settings *settings)
> +{
> +	int i;
> +	uint32_t seed0 = 0x293ae76f;
> +	uint32_t seed1 = 0x7e646e2c;

Where did those constants came from?  It would be nice to have a
reference either in header comment (in bloom.h or bloom.c), or in a
commit message, or both.

Note that above *constants* are each used only once.

> +
> +	uint32_t hash0 = seed_murmur3(seed0, data, len);
> +	uint32_t hash1 = seed_murmur3(seed1, data, len);

Those are constant values, so perhaps they should be `const uint32_t`.

> +
> +	key->hashes = (uint32_t *)xcalloc(settings->num_hashes, sizeof(uint32_t));
> +	for (i = 0; i < settings->num_hashes; i++)
> +		key->hashes[i] = hash0 + i * hash1;

It looks like this code implements the double hashing technique given in
Eq. (4) in http://www.ccs.neu.edu/home/pete/pub/bloom-filters-verification.pdf
that is "Bloom Filters in Probabilistic Verification".

Note that Dillinger and Manolios in this paper propose also _enhanced_
double hashing algorithm (Algorithm 2 on page 11), which has closed form
given by Eq. (6) - with better science-theoretical properties at
similar cost.


It might be a good idea to explicitly state in the header comment that
all arithmetic is performed with unsigned 32-bit integers, which means
that operations are performed modulo 2^32.  Or it might not be needed.

> +}
> +
> +static void add_key_to_filter(struct bloom_key *key,
> +			      struct bloom_filter *filter,
> +			      struct bloom_filter_settings *settings)
> +{
> +	int i;
> +	uint64_t mod = filter->len * BITS_PER_BLOCK;
> +
> +	for (i = 0; i < settings->num_hashes; i++) {
> +		uint64_t hash_mod = key->hashes[i] % mod;
> +		uint64_t block_pos = hash_mod / BITS_PER_BLOCK;

All right.  Because Bloom filters for different commits (and the same
key) may have different lengths, we can perform modulo operation only
here.  `hash_mod` is i-th hash modulo size of filter, and `block_pos` is
the block the '1' bit would go into.

> +
> +		filter->data[block_pos] |= get_bitmask(hash_mod);

I'm not quite convinced that get_bitmask() is a good name: this function
returns bitmap with hash_mod's bit set to 1.  On the other hand it
doesn't matter, because it is static (file-local) helper function.

Never mind then.

> +	}
> +}
> +
> +void load_bloom_filters(void)
> +{
> +	init_bloom_filter_slab(&bloom_filters);
> +}

Why *load* if all it does is initialize?

> +
> +struct bloom_filter *get_bloom_filter(struct repository *r,
> +				      struct commit *c)

I will not comment on this function; see Jeff King reply and Derrick
Stolee reply.

> +{
[...]
> +}
> \ No newline at end of file

Why there is no newline at the end of the file?  Accident?

> diff --git a/commit-graph.c b/commit-graph.c
> index e771394aff..61e60ff98a 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -16,6 +16,7 @@
>  #include "hashmap.h"
>  #include "replace-object.h"
>  #include "progress.h"
> +#include "bloom.h"
>  
>  #define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
>  #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
> @@ -794,9 +795,11 @@ struct write_commit_graph_context {
>  	unsigned append:1,
>  		 report_progress:1,
>  		 split:1,
> -		 check_oids:1;
> +		 check_oids:1,
> +		 bloom:1;

Very minor nitpick: why `bloom` and not `bloom_filter`?

>  
>  	const struct split_commit_graph_opts *split_opts;
> +	uint32_t total_bloom_filter_size;

All right, I guess size of all Bloom filters would fit in uint32_t, no
need for size_t, is it?

Shouldn't it be total_bloom_filters_size -- it is not a single Bloom
filter, but many (minor nitpick)?

>  };
>  
>  static void write_graph_chunk_fanout(struct hashfile *f,
> @@ -1139,6 +1142,28 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  	stop_progress(&ctx->progress);
>  }
>  
> +static void compute_bloom_filters(struct write_commit_graph_context *ctx)
> +{
> +	int i;
> +	struct progress *progress = NULL;
> +
> +	load_bloom_filters();
> +
> +	if (ctx->report_progress)
> +		progress = start_progress(
> +			_("Computing commit diff Bloom filters"),
> +			ctx->commits.nr);
> +
> +	for (i = 0; i < ctx->commits.nr; i++) {
> +		struct commit *c = ctx->commits.list[i];
> +		struct bloom_filter *filter = get_bloom_filter(ctx->r, c);
> +		ctx->total_bloom_filter_size += sizeof(uint64_t) * filter->len;

Wouldn't it be more future proof instead of using `sizeof(uint64_t)` to
use `sizeof(filter->data[0])` here?  This may be not worth it, and be
less readable (we have hard-coded use of 64-bits blocks in other places).

> +		display_progress(progress, i + 1);
> +	}
> +
> +	stop_progress(&progress);
> +}
> +
>  static int add_ref_to_list(const char *refname,
>  			   const struct object_id *oid,
>  			   int flags, void *cb_data)
> @@ -1791,6 +1816,8 @@ int write_commit_graph(const char *obj_dir,
>  	ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
>  	ctx->check_oids = flags & COMMIT_GRAPH_WRITE_CHECK_OIDS ? 1 : 0;
>  	ctx->split_opts = split_opts;
> +	ctx->bloom = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;

All right, this flag was defined in [PATCH 1/9].

The ordering of setting `ctx` members looks a bit strange.  Now it is
neither check `flags` firsts, neither keep related stuff together (see
ctx->split vs ctx->split_opts).  This is a very minor nitpick.

> +	ctx->total_bloom_filter_size = 0;
>  
>  	if (ctx->split) {
>  		struct commit_graph *g;
> @@ -1885,6 +1912,9 @@ int write_commit_graph(const char *obj_dir,
>  
>  	compute_generation_numbers(ctx);
>  
> +	if (ctx->bloom)
> +		compute_bloom_filters(ctx);
> +
>  	res = write_commit_graph_file(ctx);
>  
>  	if (ctx->split)

Regards,
-- 
Jakub Narębski

  parent reply	other threads:[~2020-01-06 18:44 UTC|newest]

Thread overview: 159+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-20 22:05 [PATCH 0/9] [RFC] Changed Paths Bloom Filters Garima Singh via GitGitGadget
2019-12-20 22:05 ` [PATCH 1/9] commit-graph: add --changed-paths option to write Garima Singh via GitGitGadget
2020-01-01 20:20   ` Jakub Narebski
2019-12-20 22:05 ` [PATCH 2/9] commit-graph: write changed paths bloom filters Garima Singh via GitGitGadget
2019-12-21 16:48   ` Philip Oakley
2020-01-06 18:44   ` Jakub Narebski [this message]
2020-01-13 19:48     ` Garima Singh
2019-12-20 22:05 ` [PATCH 3/9] commit-graph: use MAX_NUM_CHUNKS Garima Singh via GitGitGadget
2020-01-07 12:19   ` Jakub Narebski
2019-12-20 22:05 ` [PATCH 4/9] commit-graph: document bloom filter format Garima Singh via GitGitGadget
2020-01-07 14:46   ` Jakub Narebski
2019-12-20 22:05 ` [PATCH 5/9] commit-graph: write changed path bloom filters to commit-graph file Garima Singh via GitGitGadget
2020-01-07 16:01   ` Jakub Narebski
2020-01-14 15:14     ` Garima Singh
2019-12-20 22:05 ` [PATCH 6/9] commit-graph: test commit-graph write --changed-paths Garima Singh via GitGitGadget
2020-01-08  0:32   ` Jakub Narebski
2019-12-20 22:05 ` [PATCH 7/9] commit-graph: reuse existing bloom filters during write Garima Singh via GitGitGadget
2020-01-09 19:12   ` Jakub Narebski
2019-12-20 22:05 ` [PATCH 8/9] revision.c: use bloom filters to speed up path based revision walks Garima Singh via GitGitGadget
2020-01-11  0:27   ` Jakub Narebski
2020-01-15  0:08     ` Garima Singh
2019-12-20 22:05 ` [PATCH 9/9] commit-graph: add GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS test flag Garima Singh via GitGitGadget
2020-01-11 19:56   ` Jakub Narebski
2020-01-15  0:55     ` Garima Singh
2019-12-20 22:14 ` [PATCH 0/9] [RFC] Changed Paths Bloom Filters Junio C Hamano
2019-12-22  9:26 ` Christian Couder
2019-12-22  9:38   ` Jeff King
2020-01-01 12:04     ` Jakub Narebski
2019-12-22  9:30 ` Jeff King
2019-12-22  9:32   ` [PATCH 1/3] commit-graph: examine changed-path objects in pack order Jeff King
2019-12-27 14:51     ` Derrick Stolee
2019-12-29  6:12       ` Jeff King
2019-12-29  6:28         ` Jeff King
2019-12-30 14:37         ` Derrick Stolee
2019-12-30 14:51           ` Derrick Stolee
2019-12-22  9:32   ` [PATCH 2/3] commit-graph: free large diffs, too Jeff King
2019-12-27 14:52     ` Derrick Stolee
2019-12-22  9:32   ` [PATCH 3/3] commit-graph: stop using full rev_info for diffs Jeff King
2019-12-27 14:53     ` Derrick Stolee
2019-12-26 14:21   ` [PATCH 0/9] [RFC] Changed Paths Bloom Filters Derrick Stolee
2019-12-29  6:03     ` Jeff King
2019-12-27 16:11   ` Derrick Stolee
2019-12-29  6:24     ` Jeff King
2019-12-30 16:04       ` Derrick Stolee
2019-12-30 17:02       ` Junio C Hamano
2019-12-31 16:45 ` Jakub Narebski
2020-01-13 16:54   ` Garima Singh
2020-01-20 13:48     ` Jakub Narebski
2020-01-21 16:14       ` Garima Singh
2020-02-02 18:43         ` Jakub Narebski
2020-01-21 23:40 ` Emily Shaffer
2020-01-27 18:24   ` Garima Singh
2020-02-01 23:32   ` Jakub Narebski
2020-02-05 22:56 ` [PATCH v2 00/11] " Garima Singh via GitGitGadget
2020-02-05 22:56   ` [PATCH v2 01/11] commit-graph: use MAX_NUM_CHUNKS Garima Singh via GitGitGadget
2020-02-09 12:39     ` Jakub Narebski
2020-02-05 22:56   ` [PATCH v2 02/11] bloom: core Bloom filter implementation for changed paths Garima Singh via GitGitGadget
2020-02-15 17:17     ` Jakub Narebski
2020-02-16 16:49     ` Jakub Narebski
2020-02-22  0:32       ` Garima Singh
2020-02-23 13:38         ` Jakub Narebski
2020-02-24 17:34           ` Garima Singh
2020-02-24 18:20             ` Jakub Narebski
2020-02-05 22:56   ` [PATCH v2 03/11] diff: halt tree-diff early after max_changes Derrick Stolee via GitGitGadget
2020-02-17  0:00     ` Jakub Narebski
2020-02-22  0:37       ` Garima Singh
2020-02-05 22:56   ` [PATCH v2 04/11] commit-graph: compute Bloom filters for changed paths Garima Singh via GitGitGadget
2020-02-17 21:56     ` Jakub Narebski
2020-02-22  0:55       ` Garima Singh
2020-02-23 17:34         ` Jakub Narebski
2020-02-05 22:56   ` [PATCH v2 05/11] commit-graph: examine changed-path objects in pack order Jeff King via GitGitGadget
2020-02-18 17:59     ` Jakub Narebski
2020-02-24 18:29       ` Garima Singh
2020-02-05 22:56   ` [PATCH v2 06/11] commit-graph: examine commits by generation number Derrick Stolee via GitGitGadget
2020-02-19  0:32     ` Jakub Narebski
2020-02-24 20:45       ` Garima Singh
2020-02-05 22:56   ` [PATCH v2 07/11] commit-graph: write Bloom filters to commit graph file Garima Singh via GitGitGadget
2020-02-19 15:13     ` Jakub Narebski
2020-02-24 21:14       ` Garima Singh
2020-02-25 11:40         ` Jakub Narebski
2020-02-25 15:58           ` Garima Singh
2020-02-05 22:56   ` [PATCH v2 08/11] commit-graph: reuse existing Bloom filters during write Garima Singh via GitGitGadget
2020-02-20 18:48     ` Jakub Narebski
2020-02-24 21:45       ` Garima Singh
2020-02-05 22:56   ` [PATCH v2 09/11] commit-graph: add --changed-paths option to write subcommand Garima Singh via GitGitGadget
2020-02-20 20:28     ` Jakub Narebski
2020-02-24 21:51       ` Garima Singh
2020-02-25 12:10         ` Jakub Narebski
2020-02-20 22:10     ` Bryan Turner
2020-02-22  1:44       ` Garima Singh
2020-02-05 22:56   ` [PATCH v2 10/11] revision.c: use Bloom filters to speed up path based revision walks Garima Singh via GitGitGadget
2020-02-21 17:31     ` Jakub Narebski
2020-02-21 22:45     ` Jakub Narebski
2020-02-05 22:56   ` [PATCH v2 11/11] commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag Garima Singh via GitGitGadget
2020-02-22  0:11     ` Jakub Narebski
2020-02-07 13:52   ` [PATCH v2 00/11] Changed Paths Bloom Filters SZEDER Gábor
2020-02-07 15:09     ` Garima Singh
2020-02-07 15:36       ` Derrick Stolee
2020-02-07 16:15         ` SZEDER Gábor
2020-02-07 16:33           ` Derrick Stolee
2020-02-11 19:08       ` Garima Singh
2020-02-08 23:04   ` Jakub Narebski
2020-02-21 17:41     ` Garima Singh
2020-03-29 18:36       ` Junio C Hamano
2020-03-30  0:31   ` [PATCH v3 00/16] " Garima Singh via GitGitGadget
2020-03-30  0:31     ` [PATCH v3 01/16] commit-graph: define and use MAX_NUM_CHUNKS Garima Singh via GitGitGadget
2020-03-30  0:31     ` [PATCH v3 02/16] bloom.c: add the murmur3 hash implementation Garima Singh via GitGitGadget
2020-03-30  0:31     ` [PATCH v3 03/16] bloom.c: introduce core Bloom filter constructs Garima Singh via GitGitGadget
2020-03-30  0:31     ` [PATCH v3 04/16] bloom.c: core Bloom filter implementation for changed paths Garima Singh via GitGitGadget
2020-03-30  0:31     ` [PATCH v3 05/16] diff: halt tree-diff early after max_changes Derrick Stolee via GitGitGadget
2020-03-30  0:31     ` [PATCH v3 06/16] commit-graph: compute Bloom filters for changed paths Garima Singh via GitGitGadget
2020-03-30  0:31     ` [PATCH v3 07/16] commit-graph: examine changed-path objects in pack order Jeff King via GitGitGadget
2020-03-30  0:31     ` [PATCH v3 08/16] commit-graph: examine commits by generation number Garima Singh via GitGitGadget
2020-03-30  0:31     ` [PATCH v3 09/16] diff: skip batch object download when possible Garima Singh via GitGitGadget
2020-03-30  0:31     ` [PATCH v3 10/16] commit-graph: write Bloom filters to commit graph file Garima Singh via GitGitGadget
2020-03-30  0:31     ` [PATCH v3 11/16] commit-graph: reuse existing Bloom filters during write Garima Singh via GitGitGadget
2020-03-30  0:31     ` [PATCH v3 12/16] commit-graph: add --changed-paths option to write subcommand Garima Singh via GitGitGadget
2020-03-30  0:31     ` [PATCH v3 13/16] revision.c: use Bloom filters to speed up path based revision walks Garima Singh via GitGitGadget
2020-03-30  0:31     ` [PATCH v3 14/16] revision.c: add trace2 stats around Bloom filter usage Garima Singh via GitGitGadget
2020-03-30  0:31     ` [PATCH v3 15/16] t4216: add end to end tests for git log with Bloom filters Garima Singh via GitGitGadget
2020-03-30  0:31     ` [PATCH v3 16/16] commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag Garima Singh via GitGitGadget
2020-04-06 16:59     ` [PATCH v4 00/15] Changed Paths Bloom Filters Garima Singh via GitGitGadget
2020-04-06 16:59       ` [PATCH v4 01/15] commit-graph: define and use MAX_NUM_CHUNKS Garima Singh via GitGitGadget
2020-04-06 16:59       ` [PATCH v4 02/15] bloom.c: add the murmur3 hash implementation Garima Singh via GitGitGadget
2020-04-06 16:59       ` [PATCH v4 03/15] bloom.c: introduce core Bloom filter constructs Garima Singh via GitGitGadget
2020-04-06 16:59       ` [PATCH v4 04/15] bloom.c: core Bloom filter implementation for changed paths Garima Singh via GitGitGadget
2020-06-27 15:53         ` SZEDER Gábor
2020-04-06 16:59       ` [PATCH v4 05/15] diff: halt tree-diff early after max_changes Derrick Stolee via GitGitGadget
2020-08-04 14:47         ` SZEDER Gábor
2020-08-04 16:25           ` Derrick Stolee
2020-08-04 17:00             ` SZEDER Gábor
2020-08-04 17:31               ` Derrick Stolee
2020-08-05 17:08                 ` Derrick Stolee
2020-04-06 16:59       ` [PATCH v4 06/15] commit-graph: compute Bloom filters for changed paths Garima Singh via GitGitGadget
2020-04-06 16:59       ` [PATCH v4 07/15] commit-graph: examine changed-path objects in pack order Jeff King via GitGitGadget
2020-04-06 16:59       ` [PATCH v4 08/15] commit-graph: examine commits by generation number Garima Singh via GitGitGadget
2020-04-06 16:59       ` [PATCH v4 09/15] commit-graph: write Bloom filters to commit graph file Garima Singh via GitGitGadget
2020-05-29  8:57         ` SZEDER Gábor
2020-05-29 13:35           ` Derrick Stolee
2020-05-31 17:23             ` SZEDER Gábor
2020-07-09 17:00         ` [PATCH] commit-graph: fix "Writing out commit graph" progress counter SZEDER Gábor
2020-07-09 18:01           ` Derrick Stolee
2020-07-09 18:20             ` Derrick Stolee
2020-04-06 16:59       ` [PATCH v4 10/15] commit-graph: reuse existing Bloom filters during write Garima Singh via GitGitGadget
2020-06-19 14:02         ` SZEDER Gábor
2020-06-19 19:28           ` Junio C Hamano
2020-07-27 21:33         ` SZEDER Gábor
2020-04-06 16:59       ` [PATCH v4 11/15] commit-graph: add --changed-paths option to write subcommand Garima Singh via GitGitGadget
2020-06-07 22:21         ` SZEDER Gábor
2020-04-06 16:59       ` [PATCH v4 12/15] revision.c: use Bloom filters to speed up path based revision walks Garima Singh via GitGitGadget
2020-06-26  6:34         ` SZEDER Gábor
2020-04-06 16:59       ` [PATCH v4 13/15] revision.c: add trace2 stats around Bloom filter usage Garima Singh via GitGitGadget
2020-04-06 16:59       ` [PATCH v4 14/15] t4216: add end to end tests for git log with Bloom filters Garima Singh via GitGitGadget
2020-04-06 16:59       ` [PATCH v4 15/15] commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag Garima Singh via GitGitGadget
2020-04-08 15:51       ` [PATCH v4 00/15] Changed Paths Bloom Filters Derrick Stolee
2020-04-08 19:21         ` Junio C Hamano
2020-04-08 20:05         ` Jakub Narębski
2020-04-12 20:34         ` Taylor Blau
2020-03-05 19:49 ` [PATCH 0/9] [RFC] " Garima Singh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=86eewczapt.fsf@gmail.com \
    --to=jnareb@gmail.com \
    --cc=garima.singh@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=gitster@pobox.com \
    --cc=jeffhost@microsoft.com \
    --cc=jonathantanmy@google.com \
    --cc=me@ttaylorr.com \
    --cc=peff@peff.net \
    --cc=stolee@gmail.com \
    --cc=szeder.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).