From: "SZEDER Gábor" <szeder.dev@gmail.com>
To: Taylor Blau <me@ttaylorr.com>
Cc: git@vger.kernel.org, dstolee@microsoft.com, gitster@pobox.com,
peff@peff.net
Subject: Re: [PATCH v2 10/13] bloom: encode out-of-bounds filters as non-empty
Date: Fri, 18 Sep 2020 00:13:02 +0200 [thread overview]
Message-ID: <20200917221302.GC23146@szeder.dev> (raw)
In-Reply-To: <4653b5b4bcd254a3791797214b46722b4062dc18.1600279373.git.me@ttaylorr.com>
On Wed, Sep 16, 2020 at 02:07:59PM -0400, Taylor Blau wrote:
> When a changed-path Bloom filter has either zero, or more than a
> certain number (commonly 512) of entries, the commit-graph machinery
> encodes it as "missing". More specifically, it sets the indices adjacent
> in the BIDX chunk as equal to each other to indicate a "length 0"
> filter; that is, that the filter occupies zero bytes on disk.
>
> This has heretofore been fine, since the commit-graph machinery has no
> need to care about these filters with too few or too many changed paths.
> Both cases act like no filter has been generated at all, and so there is
> no need to store them.
>
> In a subsequent commit, however, the commit-graph machinery will learn
> to only compute Bloom filters for some commits in the current
> commit-graph layer. This is a change from the current implementation
> which computes Bloom filters for all commits that are in the layer being
> written. Critically for this patch, only computing some of the Bloom
> filters means adding a third state for length 0 Bloom filters: zero
> entries, too many entries, or "hasn't been computed".
>
> It will be important for that future patch to distinguish between "not
> representable" (i.e., zero or too-many changed paths), and "hasn't been
> computed". In particular, we don't want to waste time recomputing
> filters that have already been computed.
>
> To that end, change how we store Bloom filters in the "computed but not
> representable" category:
>
> - Bloom filters with no entries are stored as a single byte with all
> bits low (i.e., all queries to that Bloom filter will return
> "definitely not")
>
> - Bloom filters with too many entries are stored as a single byte with
> all bits set high (i.e., all queries to that Bloom filter will
> return "maybe").
>
> These rules are sufficient to not incur a behavior change by changing
> the on-disk representation of these two classes. Likewise, no
> specification changes are necessary for the commit-graph format, either:
>
> - Filters that were previously empty will be recomputed and stored
> according to the new rules, and
>
> - old clients reading filters generated by new clients will interpret
> the filters correctly and be none the wiser to how they were
> generated.
>
> Clients will invoke the Bloom machinery in more cases than before, but
> this can be addressed by returning a NULL filter when all bits are set
> high. This can be addressed in a future patch.
OTOH, clients will invoke the tree-diff machinery in fewer cases than
before, because querying the Bloom filter of commits not modifying any
files will now return "definitely not".
> Finally, note that this does increase the size of on-disk commit-graphs,
> but far less than other proposals. In particular, this is generally more
> efficient than storing a bitmap for which commits haven't computed their
> Bloom filters. Storing a bitmap incurs a penalty of one bit per commit,
> whereas storing explicit filters as above incurs a penalty of one byte
> per too-large or too-small commit.
s/too-small/empty/
> In practice, these boundary commits likely occupy a small proportion of
> the overall number of commits, and so the size penalty is likely smaller
> than storing a bitmap for all commits.
| Percentage of
| commits modifying
| 0 path | >= 512 paths
---------------+------------+----------------
android-base | 13.20% | 0.13%
cmssw | 0.15% | 0.23%
cpython | 3.07% | 0.01%
elasticsearch | 0.70% | 1.00%
gcc | 0.00% | 0.08%
gecko-dev | 0.14% | 0.64%
git | 0.11% | 0.02%
glibc | 0.02% | 0.10%
go | 0.00% | 0.07%
homebrew-cask | 0.40% | 0.02%
homebrew-core | 0.01% | 0.01%
jdk | 0.26% | 5.64%
linux | 0.01% | 0.51%
llvm-project | 0.12% | 0.03%
rails | 0.10% | 0.10%
rust | 0.07% | 0.17%
tensorflow | 0.09% | 1.02%
webkit | 0.05% | 0.31%
> A test to exercise filters which contain too many changed path entries
> will be introduced in a subsequent patch.
> diff --git a/bloom.h b/bloom.h
> index c6d77e8393..70a8840896 100644
> --- a/bloom.h
> +++ b/bloom.h
> @@ -93,6 +93,7 @@ enum bloom_filter_computed {
> BLOOM_NOT_COMPUTED = (1 << 0),
> BLOOM_COMPUTED = (1 << 1),
> BLOOM_TRUNC_LARGE = (1 << 2),
> + BLOOM_TRUNC_SMALL = (1 << 3),
s/SMALL/EMPTY/
This "small" suffix in the constant, variable, and trace2 key names is
misleading, because we only mean empty commits.
> };
>
> struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
> diff --git a/commit-graph.c b/commit-graph.c
> index 1ca754f19c..bd4247bca5 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -974,6 +974,7 @@ struct write_commit_graph_context {
>
> int count_bloom_filter_computed;
> int count_bloom_filter_not_computed;
> + int count_bloom_filter_trunc_small;
> int count_bloom_filter_trunc_large;
> };
>
next prev parent reply other threads:[~2020-09-17 22:13 UTC|newest]
Thread overview: 75+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-09-09 15:22 [PATCH 00/12] more miscellaneous Bloom filter improvements, redux Taylor Blau
2020-09-09 15:22 ` [PATCH 01/12] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
2020-09-09 15:22 ` [PATCH 02/12] t4216: use an '&&'-chain Taylor Blau
2020-09-09 15:22 ` [PATCH 03/12] commit-graph: pass a 'struct repository *' in more places Taylor Blau
2020-09-09 15:23 ` [PATCH 04/12] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
2020-09-09 15:23 ` [PATCH 05/12] commit-graph: respect 'commitGraph.readChangedPaths' Taylor Blau
2020-09-09 15:23 ` [PATCH 06/12] commit-graph.c: store maximum changed paths Taylor Blau
2020-09-09 15:23 ` [PATCH 07/12] bloom: split 'get_bloom_filter()' in two Taylor Blau
2020-09-09 15:23 ` [PATCH 08/12] bloom: use provided 'struct bloom_filter_settings' Taylor Blau
2020-09-09 15:23 ` [PATCH 09/12] bloom/diff: properly short-circuit on max_changes Taylor Blau
2020-09-09 15:23 ` [PATCH 10/12] bloom: encode out-of-bounds filters as non-empty Taylor Blau
2020-09-10 3:35 ` Taylor Blau
2020-09-10 15:45 ` Taylor Blau
2020-09-11 18:15 ` Derrick Stolee
2020-09-09 15:23 ` [PATCH 11/12] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
2020-09-09 15:24 ` [PATCH 12/12] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
2020-09-11 17:52 ` Jeff King
2020-09-11 18:59 ` Taylor Blau
2020-09-11 19:25 ` Taylor Blau
2020-09-14 20:12 ` Taylor Blau
2020-09-14 20:31 ` Derrick Stolee
2020-09-14 20:36 ` Taylor Blau
2020-09-15 0:59 ` Derrick Stolee
2020-09-15 4:31 ` Taylor Blau
2020-09-15 21:49 ` Junio C Hamano
2020-09-15 21:53 ` Taylor Blau
2020-09-11 19:47 ` Jeff King
2020-09-11 19:31 ` Junio C Hamano
2020-09-16 18:06 ` [PATCH v2 00/13] more miscellaneous Bloom filter improvements, redux Taylor Blau
2020-09-16 18:06 ` [PATCH v2 01/13] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
2020-09-16 18:07 ` [PATCH v2 02/13] t4216: use an '&&'-chain Taylor Blau
2020-09-16 18:07 ` [PATCH v2 03/13] commit-graph: pass a 'struct repository *' in more places Taylor Blau
2020-09-16 18:07 ` [PATCH v2 04/13] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
2020-09-16 18:07 ` [PATCH v2 05/13] commit-graph: respect 'commitGraph.readChangedPaths' Taylor Blau
2020-09-16 18:07 ` [PATCH v2 06/13] commit-graph.c: store maximum changed paths Taylor Blau
2020-09-16 18:07 ` [PATCH v2 07/13] bloom: split 'get_bloom_filter()' in two Taylor Blau
2020-09-16 18:07 ` [PATCH v2 08/13] bloom: use provided 'struct bloom_filter_settings' Taylor Blau
2020-09-16 18:07 ` [PATCH v2 09/13] bloom/diff: properly short-circuit on max_changes Taylor Blau
2020-09-16 18:07 ` [PATCH v2 10/13] bloom: encode out-of-bounds filters as non-empty Taylor Blau
2020-09-17 22:13 ` SZEDER Gábor [this message]
2020-09-17 23:13 ` Taylor Blau
2020-09-18 0:52 ` Junio C Hamano
2020-09-18 1:15 ` Taylor Blau
2020-09-16 18:08 ` [PATCH v2 11/13] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
2020-09-16 18:08 ` [PATCH v2 12/13] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
2020-09-18 9:23 ` SZEDER Gábor
2020-09-18 13:27 ` Taylor Blau
2020-09-16 18:08 ` [PATCH v2 13/13] commit-graph: introduce 'commitGraph.maxNewFilters' Taylor Blau
2020-09-16 22:51 ` [PATCH v2 00/13] more miscellaneous Bloom filter improvements, redux Derrick Stolee
2020-09-16 23:07 ` Junio C Hamano
2020-09-17 0:45 ` Taylor Blau
2020-09-17 0:59 ` Junio C Hamano
2020-09-17 1:10 ` Taylor Blau
2020-09-17 13:34 ` Taylor Blau
2020-09-17 13:38 ` Derrick Stolee
2020-09-18 2:58 ` [PATCH v3 " Taylor Blau
2020-09-18 2:58 ` [PATCH v3 01/13] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
2020-09-18 2:58 ` [PATCH v3 02/13] t4216: use an '&&'-chain Taylor Blau
2020-09-18 2:59 ` [PATCH v3 03/13] commit-graph: pass a 'struct repository *' in more places Taylor Blau
2020-09-18 2:59 ` [PATCH v3 04/13] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
2020-09-18 2:59 ` [PATCH v3 05/13] commit-graph: respect 'commitGraph.readChangedPaths' Taylor Blau
2020-09-18 2:59 ` [PATCH v3 06/13] commit-graph.c: store maximum changed paths Taylor Blau
2020-09-18 2:59 ` [PATCH v3 07/13] bloom: split 'get_bloom_filter()' in two Taylor Blau
2020-09-18 2:59 ` [PATCH v3 08/13] bloom: use provided 'struct bloom_filter_settings' Taylor Blau
2020-09-18 16:27 ` SZEDER Gábor
2020-09-18 16:32 ` Taylor Blau
2020-09-18 2:59 ` [PATCH v3 09/13] bloom/diff: properly short-circuit on max_changes Taylor Blau
2020-09-18 2:59 ` [PATCH v3 10/13] bloom: encode out-of-bounds filters as non-empty Taylor Blau
2020-09-18 2:59 ` [PATCH v3 11/13] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
2020-09-18 2:59 ` [PATCH v3 12/13] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
2020-09-18 2:59 ` [PATCH v3 13/13] commit-graph: introduce 'commitGraph.maxNewFilters' Taylor Blau
2020-09-18 13:29 ` Taylor Blau
2020-09-18 17:43 ` Junio C Hamano
2020-09-18 13:31 ` [PATCH v3 00/13] more miscellaneous Bloom filter improvements, redux Taylor Blau
2020-09-18 13:34 ` Taylor Blau
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200917221302.GC23146@szeder.dev \
--to=szeder.dev@gmail.com \
--cc=dstolee@microsoft.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=me@ttaylorr.com \
--cc=peff@peff.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).