git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
From: Taylor Blau <me@ttaylorr.com>
To: "SZEDER Gábor" <szeder.dev@gmail.com>
Cc: Taylor Blau <me@ttaylorr.com>,
	git@vger.kernel.org, peff@peff.net, dstolee@microsoft.com,
	gitster@pobox.com
Subject: Re: [PATCH v3 12/14] commit-graph: add large-filters bitmap chunk
Date: Wed, 2 Sep 2020 16:40:42 -0400	[thread overview]
Message-ID: <20200902204042.GB5281@nand.local> (raw)
In-Reply-To: <20200901143546.GA5678@szeder.dev>

On Tue, Sep 01, 2020 at 04:35:46PM +0200, SZEDER Gábor wrote:
> On Tue, Aug 11, 2020 at 04:52:07PM -0400, Taylor Blau wrote:
> > This is already undesirable, since it means that we are wasting
> > considerable effort to discover that a commit with too many changed
> > paths, only to throw that effort away (and then repeat the process the
> > next time a roll-up is performed).
>
> Is this really the case?
>
> If a commit has a corresponding entry in the Bloom filters chunks,
> then the commit-graph machinery does know the Bloom filter associated
> with that commit.  The size of that filter should not matter, i.e.
> even if it is a zero-length filter, the commit-graph machinery should
> know about it all the same.  And as far as I can tell it does indeed,
> because load_bloom_filter_from_graph() sets a non-NULL 'filter->data'
> pointer even if 'filter->len' is zero, which get_bloom_filter() treats
> as "we know about this", and returns early without (re)computing the
> filter.  Even the test 'Bloom generation does not recompute too-large
> filters' added in this patch is expected to succeed, and my
> superficial and makeshift testing seems to corroborate this; at least
> I couldn't find a combination of commands and options that would
> recompute any existing zero-length Bloom filters.
>
> Am I missing something?

I'm not sure that this would work, or at least that it creates another
pair of cases that we have to disambiguate. Here's what I'm thinking
after reading what you wrote:

  - By this point in the series, we expect every commit in a
    commit-graph layer to have a corresponding entry in the 'BIDX' chunk
    (if one exists, obviously all of this is meaningless without
    specifying '--changed-paths').

  - In a future commit (specifically, "builtin/commit-graph.c: introduce
    '--max-new-filters=<n>'"), this will no longer be the case.

  - So, by that point in the series, we would have two possible reasons
    why a commit did not show up in the BIDX. Either:

      * The commit has too many changed paths, in which case we don't
        want to write it down, or

      * We had already computed too many new filters, and so didn't have
        the budget to compute the filter for some commit(s).

In the first of those two cases, we want to avoid recomputing the
too-large filter. But, in the latter case, we *do* want to compute the
filter from scratch, since we want to "backfill" commits with missing
filters by letting them trickle in over time.

I might be missing something, too, but I think that those two cases are
just as indistinguishable as what's in the commit-graph today (without
the 'BFXL' chunk).

> > To address this, add a new chunk which encodes a bitmap where the ith
> > bit is on iff the ith commit has zero or at least 512 changed paths.
> > Likewise store the maximum number of changed paths we are willing to
> > store in order to prepare for eventually making this value more easily
> > customizable.
>
> I don't understand how storing this value would make it any easier to
> customize.

It doesn't, but it's also not meant to. The idea is that if the value
ever were easy to customize, then we could keep track of which layers
were written with which threshold. This would allow you to invent fancy
rules like dropping filters when the threshold lowers, or recomputing
ones that you wouldn't have otherwise when it goes up.

> > Another approach would be to introduce a new BIDX chunk (say, one
> > identified by 'BID2') which is identical to the existing BIDX chunk,
> > except the most-significant bit of each offset is interpreted as "this
> > filter is too big" iff looking at a BID2 chunk. This avoids having to
> > write a bitmap, but forces older clients to rewrite their commit-graphs
> > (as well as reduces the theoretical largest Bloom filters we couldl
>
> And it reduces the max possible size of the BDAT chunk, and thus the
> max number of commits with Bloom filters as well.

Fair point.

> s/couldl/could/

Oops, thanks.

> > write, and forces us to maintain the code necessary to translate BIDX
> > chunks to BID2 ones). Separately from this patch, I implemented this
> > alternate approach and did not find it to be advantageous.
>
> Let's take a step back to reconsider what should be stored in this
> bitmap for a moment.  Sure, setting a bit for each commit that doesn't
> modify any paths or modifies too many makes it possible to repliably
> identify commits that don't have Bloom filters yet.  But isn't it a
> bit roundabout way?  I think it would be better to directly track
> which commits don't have Bloom filters yet.  IOW what you really want
> is a, say, BNCY "Bloom filter Not Computed Yet" chunk, where we set
> the corresponding bit for each commit which has an entry in the BIDX
> chunk but for which a Bloom filter hasn't been computed yet.
>
>   - It's simpler and easier to explain (IMO).
>
>   - This bitmap chunk can easily be made optional: if all Bloom
>     filters have been computed, then the bitmap will contain all
>     zeros.  So why bother writing it, when we can save a bit of space
>     instead?

I don't think that this is true. Omitting the chunk (because you have
computed--or tried to compute--filters for every commit in the graph)
isn't distinguishable from what exists today, so we are back at square
one there.

>   - It avoids the unpleasentness of setting a bit in the _Large_ Bloom
>     Filters chunks for commits _not_ modifying any paths.

I agree it's unpleasant, but I also don't think it's a show-stopper.

>   - Less incentive to spill implementation details to the format
>     specification (e.g. 512 modified paths).
>
> Now, let's take another step back: is such a bitmap really necessary?
> We could write a single-byte Bloom filter with no bits set for commits
> not modifying any paths, and a single-byte Bloom filter with all bits
> set for commits modifying too many paths.  This is compatible with the
> specs and any existing implementation should do the right thing when
> reading such filters, this would allow us to interpret zero-length
> filters as "not computed yet", and if that bitmap chunk won't be
> optional, then this would save space as long as less than 1/8 of
> commits modify no or too many paths.  Unfortunately, however, all
> existing zero-length Bloom filters have to be recomputed.

I think this is a semantic difference: either you store a bitmap, or a
Bloom filter containing the same data. To me, I don't think there's a
huge difference, since we're talking about 1 bit per commit. If we were
really worried, we could store them as EWAH-compressed bitmaps, but I
don't get a sense that such a concern exists.

I do feel strongly about using a non-probabilistic data structure,
though, since the point of this feature is to be able to make tighter
guarentees about the runtime of 'git commit-graph write'.

>
> > Signed-off-by: Taylor Blau <me@ttaylorr.com>
> > ---
> >  .../technical/commit-graph-format.txt         | 12 +++
> >  bloom.h                                       |  2 +-
> >  commit-graph.c                                | 96 ++++++++++++++++---
> >  commit-graph.h                                |  4 +
> >  t/t4216-log-bloom.sh                          | 25 ++++-
> >  5 files changed, 124 insertions(+), 15 deletions(-)
> >
> > diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
> > index 440541045d..5f2d9ab4d7 100644
> > --- a/Documentation/technical/commit-graph-format.txt
> > +++ b/Documentation/technical/commit-graph-format.txt
> > @@ -123,6 +123,18 @@ CHUNK DATA:
> >        of length zero.
> >      * The BDAT chunk is present if and only if BIDX is present.
> >
> > +  Large Bloom Filters (ID: {'B', 'F', 'X', 'L'}) [Optional]
> > +    * It starts with a 32-bit unsigned integer specifying the maximum number of
> > +      changed-paths that can be stored in a single Bloom filter.
>
> Should this value be the same in all elements of a commit-graph chain?
> Note that in this case having different values won't affect revision
> walks using modified path Bloom filters.

I don't think it needs to be the same necessarily.

> > +    * It then contains a list of 64-bit words (the length of this list is
> > +      determined by the width of the chunk) which is a bitmap. The 'i'th bit is
> > +      set exactly when the 'i'th commit in the graph has a changed-path Bloom
> > +      filter with zero entries (either because the commit is empty, or because
> > +      it contains more than 512 changed paths).
>
> Please make clear the byte order of these 64 bit words in the specs as
> well.
>
> Furthermore, that 512 path limit is an implementation detail, so it
> would be better if it didn't leak into the specification of this new
> chunk.

Addressed both, thanks.

Thanks,
Taylor

  reply	other threads:[~2020-09-02 20:40 UTC|newest]

Thread overview: 117+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-03 18:57 [PATCH 00/10] more miscellaneous Bloom filter improvements Taylor Blau
2020-08-03 18:57 ` [PATCH 01/10] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
2020-08-04  7:24   ` Jeff King
2020-08-04 20:08     ` Taylor Blau
2020-08-03 18:57 ` [PATCH 02/10] commit-graph: pass a 'struct repository *' in more places Taylor Blau
2020-08-03 18:57 ` [PATCH 03/10] t4216: use an '&&'-chain Taylor Blau
2020-08-03 18:57 ` [PATCH 04/10] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
2020-08-03 18:57 ` [PATCH 05/10] commit-graph: respect 'commitgraph.readChangedPaths' Taylor Blau
2020-08-03 18:57 ` [PATCH 06/10] commit-graph.c: sort index into commits list Taylor Blau
2020-08-04 12:31   ` Derrick Stolee
2020-08-04 20:10     ` Taylor Blau
2020-08-03 18:57 ` [PATCH 07/10] commit-graph: add large-filters bitmap chunk Taylor Blau
2020-08-03 18:59   ` Taylor Blau
2020-08-04 12:57   ` Derrick Stolee
2020-08-03 18:57 ` [PATCH 08/10] bloom: split 'get_bloom_filter()' in two Taylor Blau
2020-08-04 13:00   ` Derrick Stolee
2020-08-04 20:12     ` Taylor Blau
2020-08-03 18:57 ` [PATCH 09/10] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
2020-08-03 18:57 ` [PATCH 10/10] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
2020-08-04 13:03   ` Derrick Stolee
2020-08-04 20:14     ` Taylor Blau
2020-08-05 17:01 ` [PATCH v2 00/14] more miscellaneous Bloom filter improvements Taylor Blau
2020-08-05 17:01   ` [PATCH v2 01/14] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
2020-08-05 17:02   ` [PATCH v2 02/14] t4216: use an '&&'-chain Taylor Blau
2020-08-05 17:02   ` [PATCH v2 03/14] commit-graph: pass a 'struct repository *' in more places Taylor Blau
2020-08-05 17:02   ` [PATCH v2 04/14] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
2020-08-05 17:02   ` [PATCH v2 05/14] commit-graph: respect 'commitGraph.readChangedPaths' Taylor Blau
2020-08-05 17:02   ` [PATCH v2 06/14] commit-graph.c: store maximum changed paths Taylor Blau
2020-08-05 17:02   ` [PATCH v2 07/14] bloom: split 'get_bloom_filter()' in two Taylor Blau
2020-08-05 17:02   ` [PATCH v2 08/14] bloom: use provided 'struct bloom_filter_settings' Taylor Blau
2020-08-05 17:02   ` [PATCH v2 09/14] bloom/diff: properly short-circuit on max_changes Taylor Blau
2020-08-05 17:02   ` [PATCH v2 10/14] commit-graph.c: sort index into commits list Taylor Blau
2020-08-05 17:02   ` [PATCH v2 11/14] csum-file.h: introduce 'hashwrite_be64()' Taylor Blau
2020-08-05 17:02   ` [PATCH v2 12/14] commit-graph: add large-filters bitmap chunk Taylor Blau
2020-08-05 21:01     ` Junio C Hamano
2020-08-05 21:17       ` Taylor Blau
2020-08-05 22:21         ` Junio C Hamano
2020-08-05 22:25           ` Taylor Blau
2020-08-11 13:48             ` Taylor Blau
2020-08-11 18:59               ` Junio C Hamano
2020-08-05 17:03   ` [PATCH v2 13/14] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
2020-08-05 17:03   ` [PATCH v2 14/14] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
2020-08-11 20:51 ` [PATCH v3 00/14] more miscellaneous Bloom filter improvements Taylor Blau
2020-08-11 20:51   ` [PATCH v3 01/14] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
2020-08-11 21:18     ` SZEDER Gábor
2020-08-11 21:21       ` Taylor Blau
2020-08-11 21:27         ` SZEDER Gábor
2020-08-11 21:34           ` Taylor Blau
2020-08-11 23:55             ` SZEDER Gábor
2020-08-12 11:48               ` Derrick Stolee
2020-08-14 20:17                 ` Taylor Blau
2020-08-11 20:51   ` [PATCH v3 02/14] t4216: use an '&&'-chain Taylor Blau
2020-08-11 20:51   ` [PATCH v3 03/14] commit-graph: pass a 'struct repository *' in more places Taylor Blau
2020-08-11 20:51   ` [PATCH v3 04/14] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
2020-08-11 20:51   ` [PATCH v3 05/14] commit-graph: respect 'commitGraph.readChangedPaths' Taylor Blau
2020-08-11 20:51   ` [PATCH v3 06/14] commit-graph.c: store maximum changed paths Taylor Blau
2020-08-11 20:51   ` [PATCH v3 07/14] bloom: split 'get_bloom_filter()' in two Taylor Blau
2020-08-11 20:51   ` [PATCH v3 11/14] csum-file.h: introduce 'hashwrite_be64()' Taylor Blau
2020-08-11 20:51   ` [PATCH v3 08/14] bloom: use provided 'struct bloom_filter_settings' Taylor Blau
2020-08-11 20:51   ` [PATCH v3 09/14] bloom/diff: properly short-circuit on max_changes Taylor Blau
2020-08-11 20:52   ` [PATCH v3 10/14] commit-graph.c: sort index into commits list Taylor Blau
2020-08-11 20:52   ` [PATCH v3 12/14] commit-graph: add large-filters bitmap chunk Taylor Blau
2020-08-11 21:11     ` Derrick Stolee
2020-08-11 21:18       ` Taylor Blau
2020-08-11 22:05         ` Taylor Blau
2020-08-19 13:35     ` SZEDER Gábor
2020-09-02 20:23       ` Taylor Blau
2020-09-01 14:35     ` SZEDER Gábor
2020-09-02 20:40       ` Taylor Blau [this message]
2020-08-11 20:52   ` [PATCH v3 13/14] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
2020-08-19  9:56     ` SZEDER Gábor
2020-09-02 21:02       ` Taylor Blau
2020-08-11 20:52   ` [PATCH v3 14/14] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
2020-08-12 11:49     ` SZEDER Gábor
2020-08-14 20:20       ` Taylor Blau
2020-08-17 22:50         ` SZEDER Gábor
2020-09-02 21:03           ` Taylor Blau
2020-08-12 12:29     ` Derrick Stolee
2020-08-14 20:10       ` Taylor Blau
2020-08-18 22:23     ` SZEDER Gábor
2020-09-03 16:35       ` Taylor Blau
2020-08-19  8:20     ` SZEDER Gábor
2020-09-03 16:42       ` Taylor Blau
2020-09-04  8:50         ` SZEDER Gábor
2020-09-01 14:36     ` SZEDER Gábor
2020-09-03 18:49       ` Taylor Blau
2020-09-03 21:45   ` [PATCH v3 00/14] more miscellaneous Bloom filter improvements Junio C Hamano
2020-09-03 22:33     ` Taylor Blau
2020-09-03 22:45 ` [PATCH v4 " Taylor Blau
2020-09-03 22:46   ` [PATCH v4 01/14] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
2020-09-03 22:46   ` [PATCH v4 02/14] t4216: use an '&&'-chain Taylor Blau
2020-09-03 22:46   ` [PATCH v4 03/14] commit-graph: pass a 'struct repository *' in more places Taylor Blau
2020-09-03 22:46   ` [PATCH v4 04/14] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
2020-09-03 22:46   ` [PATCH v4 05/14] commit-graph: respect 'commitGraph.readChangedPaths' Taylor Blau
2020-09-03 22:46   ` [PATCH v4 06/14] commit-graph.c: store maximum changed paths Taylor Blau
2020-09-03 22:46   ` [PATCH v4 07/14] bloom: split 'get_bloom_filter()' in two Taylor Blau
2020-09-05 17:22     ` Jakub Narębski
2020-09-05 17:38       ` Taylor Blau
2020-09-05 17:50         ` Jakub Narębski
2020-09-05 18:01           ` Taylor Blau
2020-09-05 18:18             ` Jakub Narębski
2020-09-05 18:38               ` Taylor Blau
2020-09-05 18:55                 ` Taylor Blau
2020-09-05 19:04                   ` SZEDER Gábor
2020-09-05 19:49                     ` Taylor Blau
2020-09-06 21:52                       ` Junio C Hamano
2020-09-03 22:46   ` [PATCH v4 08/14] bloom: use provided 'struct bloom_filter_settings' Taylor Blau
2020-09-03 22:46   ` [PATCH v4 09/14] bloom/diff: properly short-circuit on max_changes Taylor Blau
2020-09-03 22:46   ` [PATCH v4 10/14] commit-graph.c: sort index into commits list Taylor Blau
2020-09-03 22:46   ` [PATCH v4 11/14] csum-file.h: introduce 'hashwrite_be64()' Taylor Blau
2020-09-04 20:18     ` René Scharfe
2020-09-04 20:22       ` Taylor Blau
2020-09-03 22:46   ` [PATCH v4 12/14] commit-graph: add large-filters bitmap chunk Taylor Blau
2020-09-03 22:46   ` [PATCH v4 13/14] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
2020-09-04 15:20     ` Taylor Blau
2020-09-03 22:47   ` [PATCH v4 14/14] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
2020-09-04 14:39   ` [PATCH v4 00/14] more miscellaneous Bloom filter improvements Derrick Stolee

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200902204042.GB5281@nand.local \
    --to=me@ttaylorr.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=peff@peff.net \
    --cc=szeder.dev@gmail.com \
    --subject='Re: [PATCH v3 12/14] commit-graph: add large-filters bitmap chunk' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Code repositories for project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).