git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Derrick Stolee <stolee@gmail.com>
To: Taylor Blau <me@ttaylorr.com>, git@vger.kernel.org
Cc: dstolee@microsoft.com, gitster@pobox.com, peff@peff.net,
	szeder.dev@gmail.com
Subject: Re: [PATCH 10/12] bloom: encode out-of-bounds filters as non-empty
Date: Fri, 11 Sep 2020 14:15:18 -0400	[thread overview]
Message-ID: <ee881087-76b7-901d-d75c-814ad508f90d@gmail.com> (raw)
In-Reply-To: <20200910154516.GA32117@nand.local>

On 9/10/2020 11:45 AM, Taylor Blau wrote:
> On Wed, Sep 09, 2020 at 11:35:57PM -0400, Taylor Blau wrote:
>> On Wed, Sep 09, 2020 at 11:23:48AM -0400, Taylor Blau wrote:
>>> -		filter->data = NULL;
>>> -		filter->len = 0;
>>> +		filter->data = xmalloc(1);
>>> +		filter->data[0] = 0xFF;
>>> +		filter->len = 1;
>>>
>>>  		if (computed)
>>>  			*computed |= BLOOM_TRUNC_LARGE;
>>
>> Oops, I missed the case that added by the previous patch where the
>> number of diff entries is smaller than the limit, but the hashmap
>> entries (after directories are added and such) crosses the threshold.
>>
>> Specifically, this patch doesn't write the 0xFF filter like it should.
>> I'll send a different version of this patch tomorrow.
> 
> This one should do the trick. Let's use it instead.
> 
> --- >8 ---
> 
> Subject: [PATCH] bloom: encode out-of-bounds filters as non-empty
> 
> When a changed-path Bloom filter has either zero, or more than a
> certain number (commonly 512) of entries, the commit-graph machinery
> encodes it as "missing". More specifically, it sets the indices adjacent
> in the BIDX chunk as equal to each other to indicate a "length 0"
> filter; that is, that the filter occupies zero bytes on disk.
> 
> This has heretofore been fine, since the commit-graph machinery has no
> need to care about these filters with too few or too many changed paths.
> Both cases act like no filter has been generated at all, and so there is
> no need to store them.
> 
> In a subsequent commit, however, the commit-graph machinery will learn
> to only compute Bloom filters for some commits in the current
> commit-graph layer. This is a change from the current implementation
> which computes Bloom filters for all commits that are in the layer being
> written. Critically for this patch, only computing some of the Bloom
> filters means adding a third state for length 0 Bloom filters: zero
> entries, too many entries, or "hasn't been computed".
> 
> It will be important for that future patch to distinguish between "not
> representable" (i.e., zero or too-many changed paths), and "hasn't been
> computed". In particular, we don't want to waste time recomputing
> filters that have already been computed.
> 
> To that end, change how we store Bloom filters in the "computed but not
> representable" category:
> 
>   - Bloom filters with no entries are stored as a single byte with all
>     bits low (i.e., all queries to that Bloom filter will return
>     "definitely not")
> 
>   - Bloom filters with too many entries are stored as a single byte with
>     all bits set high (i.e., all queries to that Bloom filter will
>     return "maybe").
> 
> These rules are sufficient to not incur a behavior change by changing
> the on-disk representation of these two classes. Likewise, no
> specification changes are necessary for the commit-graph format, either:
> 
>   - Filters that were previously empty will be recomputed and stored
>     according to the new rules, and
> 
>   - old clients reading filters generated by new clients will interpret
>     the filters correctly and be none the wiser to how they were
>     generated.
> 
> Clients will invoke the Bloom machinery in more cases than before, but
> this can be addressed by returning a NULL filter when all bits are set
> high. This can be addressed in a future patch.
> 
> Finally, note that this does increase the size of on-disk commit-graphs,
> but far less than other proposals. In particular, this is generally more
> efficient than storing a bitmap for which commits haven't computed their
> Bloom filters. Storing a bitmap incurs a penalty of one bit per commit,
> whereas storing explicit filters as above incurs a penalty of one byte
> per too-large or too-small commit.
> 
> In practice, these boundary commits likely occupy a small proportion of
> the overall number of commits, and so the size penalty is likely smaller
> than storing a bitmap for all commits.
> 
> A test to exercise filters which contain too many changed path entries
> will be introduced in a subsequent patch.
> 
> Suggested-by: SZEDER Gábor <szeder.dev@gmail.com>
> Suggested-by: Jakub Narębski <jnareb@gmail.com>
> Helped-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  .../technical/commit-graph-format.txt         |  2 +-
>  bloom.c                                       | 16 ++++++++++++--
>  bloom.h                                       |  1 +
>  commit-graph.c                                |  5 +++++
>  t/t0095-bloom.sh                              |  8 +++----
>  t/t4216-log-bloom.sh                          | 21 +++++++++++++++++--
>  6 files changed, 44 insertions(+), 9 deletions(-)
> 
> diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
> index 6ddbceba15..6585f1948a 100644
> --- a/Documentation/technical/commit-graph-format.txt
> +++ b/Documentation/technical/commit-graph-format.txt
> @@ -125,7 +125,7 @@ CHUNK DATA:
>      * The rest of the chunk is the concatenation of all the computed Bloom
>        filters for the commits in lexicographic order.
>      * Note: Commits with no changes or more than 512 changes have Bloom filters
> -      of length zero.
> +      of length one, with either all bits set to zero or one respectively.
>      * The BDAT chunk is present if and only if BIDX is present.
> 
>    Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
> diff --git a/bloom.c b/bloom.c
> index db9fb82437..d24747a1d5 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -177,6 +177,13 @@ static int pathmap_cmp(const void *hashmap_cmp_fn_data,
>  	return strcmp(e1->path, e2->path);
>  }
> 
> +static void init_truncated_large_filter(struct bloom_filter *filter)
> +{
> +	filter->data = xmalloc(1);
> +	filter->data[0] = 0xFF;
> +	filter->len = 1;
> +}
> +

So this hunk is essentially the new bit in this version.

> diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
> index 232ba2c485..7e4ab1795f 100755
> --- a/t/t0095-bloom.sh
> +++ b/t/t0095-bloom.sh
> @@ -71,8 +71,8 @@ test_expect_success 'get bloom filters for commit with no changes' '
>  	git init &&
>  	git commit --allow-empty -m "c0" &&
>  	cat >expect <<-\EOF &&
> -	Filter_Length:0
> -	Filter_Data:
> +	Filter_Length:1
> +	Filter_Data:00|
>  	EOF
>  	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
>  	test_cmp expect actual
> @@ -107,8 +107,8 @@ test_expect_success EXPENSIVE 'get bloom filter for commit with 513 changes' '
>  	git add bigDir &&
>  	git commit -m "commit with 513 changes" &&
>  	cat >expect <<-\EOF &&
> -	Filter_Length:0
> -	Filter_Data:
> +	Filter_Length:1
> +	Filter_Data:ff|
>  	EOF
>  	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
>  	test_cmp expect actual

And these two tests now show the new behavior. Nice.

This version matches my expectations. Thanks.

-Stolee

  reply	other threads:[~2020-09-11 18:15 UTC|newest]

Thread overview: 75+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-09 15:22 [PATCH 00/12] more miscellaneous Bloom filter improvements, redux Taylor Blau
2020-09-09 15:22 ` [PATCH 01/12] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
2020-09-09 15:22 ` [PATCH 02/12] t4216: use an '&&'-chain Taylor Blau
2020-09-09 15:22 ` [PATCH 03/12] commit-graph: pass a 'struct repository *' in more places Taylor Blau
2020-09-09 15:23 ` [PATCH 04/12] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
2020-09-09 15:23 ` [PATCH 05/12] commit-graph: respect 'commitGraph.readChangedPaths' Taylor Blau
2020-09-09 15:23 ` [PATCH 06/12] commit-graph.c: store maximum changed paths Taylor Blau
2020-09-09 15:23 ` [PATCH 07/12] bloom: split 'get_bloom_filter()' in two Taylor Blau
2020-09-09 15:23 ` [PATCH 08/12] bloom: use provided 'struct bloom_filter_settings' Taylor Blau
2020-09-09 15:23 ` [PATCH 09/12] bloom/diff: properly short-circuit on max_changes Taylor Blau
2020-09-09 15:23 ` [PATCH 10/12] bloom: encode out-of-bounds filters as non-empty Taylor Blau
2020-09-10  3:35   ` Taylor Blau
2020-09-10 15:45     ` Taylor Blau
2020-09-11 18:15       ` Derrick Stolee [this message]
2020-09-09 15:23 ` [PATCH 11/12] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
2020-09-09 15:24 ` [PATCH 12/12] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
2020-09-11 17:52   ` Jeff King
2020-09-11 18:59     ` Taylor Blau
2020-09-11 19:25       ` Taylor Blau
2020-09-14 20:12         ` Taylor Blau
2020-09-14 20:31           ` Derrick Stolee
2020-09-14 20:36             ` Taylor Blau
2020-09-15  0:59               ` Derrick Stolee
2020-09-15  4:31                 ` Taylor Blau
2020-09-15 21:49               ` Junio C Hamano
2020-09-15 21:53                 ` Taylor Blau
2020-09-11 19:47       ` Jeff King
2020-09-11 19:31     ` Junio C Hamano
2020-09-16 18:06 ` [PATCH v2 00/13] more miscellaneous Bloom filter improvements, redux Taylor Blau
2020-09-16 18:06   ` [PATCH v2 01/13] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
2020-09-16 18:07   ` [PATCH v2 02/13] t4216: use an '&&'-chain Taylor Blau
2020-09-16 18:07   ` [PATCH v2 03/13] commit-graph: pass a 'struct repository *' in more places Taylor Blau
2020-09-16 18:07   ` [PATCH v2 04/13] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
2020-09-16 18:07   ` [PATCH v2 05/13] commit-graph: respect 'commitGraph.readChangedPaths' Taylor Blau
2020-09-16 18:07   ` [PATCH v2 06/13] commit-graph.c: store maximum changed paths Taylor Blau
2020-09-16 18:07   ` [PATCH v2 07/13] bloom: split 'get_bloom_filter()' in two Taylor Blau
2020-09-16 18:07   ` [PATCH v2 08/13] bloom: use provided 'struct bloom_filter_settings' Taylor Blau
2020-09-16 18:07   ` [PATCH v2 09/13] bloom/diff: properly short-circuit on max_changes Taylor Blau
2020-09-16 18:07   ` [PATCH v2 10/13] bloom: encode out-of-bounds filters as non-empty Taylor Blau
2020-09-17 22:13     ` SZEDER Gábor
2020-09-17 23:13       ` Taylor Blau
2020-09-18  0:52         ` Junio C Hamano
2020-09-18  1:15           ` Taylor Blau
2020-09-16 18:08   ` [PATCH v2 11/13] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
2020-09-16 18:08   ` [PATCH v2 12/13] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
2020-09-18  9:23     ` SZEDER Gábor
2020-09-18 13:27       ` Taylor Blau
2020-09-16 18:08   ` [PATCH v2 13/13] commit-graph: introduce 'commitGraph.maxNewFilters' Taylor Blau
2020-09-16 22:51   ` [PATCH v2 00/13] more miscellaneous Bloom filter improvements, redux Derrick Stolee
2020-09-16 23:07     ` Junio C Hamano
2020-09-17  0:45       ` Taylor Blau
2020-09-17  0:59         ` Junio C Hamano
2020-09-17  1:10           ` Taylor Blau
2020-09-17 13:34             ` Taylor Blau
2020-09-17 13:38               ` Derrick Stolee
2020-09-18  2:58 ` [PATCH v3 " Taylor Blau
2020-09-18  2:58   ` [PATCH v3 01/13] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
2020-09-18  2:58   ` [PATCH v3 02/13] t4216: use an '&&'-chain Taylor Blau
2020-09-18  2:59   ` [PATCH v3 03/13] commit-graph: pass a 'struct repository *' in more places Taylor Blau
2020-09-18  2:59   ` [PATCH v3 04/13] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
2020-09-18  2:59   ` [PATCH v3 05/13] commit-graph: respect 'commitGraph.readChangedPaths' Taylor Blau
2020-09-18  2:59   ` [PATCH v3 06/13] commit-graph.c: store maximum changed paths Taylor Blau
2020-09-18  2:59   ` [PATCH v3 07/13] bloom: split 'get_bloom_filter()' in two Taylor Blau
2020-09-18  2:59   ` [PATCH v3 08/13] bloom: use provided 'struct bloom_filter_settings' Taylor Blau
2020-09-18 16:27     ` SZEDER Gábor
2020-09-18 16:32       ` Taylor Blau
2020-09-18  2:59   ` [PATCH v3 09/13] bloom/diff: properly short-circuit on max_changes Taylor Blau
2020-09-18  2:59   ` [PATCH v3 10/13] bloom: encode out-of-bounds filters as non-empty Taylor Blau
2020-09-18  2:59   ` [PATCH v3 11/13] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
2020-09-18  2:59   ` [PATCH v3 12/13] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
2020-09-18  2:59   ` [PATCH v3 13/13] commit-graph: introduce 'commitGraph.maxNewFilters' Taylor Blau
2020-09-18 13:29     ` Taylor Blau
2020-09-18 17:43       ` Junio C Hamano
2020-09-18 13:31   ` [PATCH v3 00/13] more miscellaneous Bloom filter improvements, redux Taylor Blau
2020-09-18 13:34     ` Taylor Blau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ee881087-76b7-901d-d75c-814ad508f90d@gmail.com \
    --to=stolee@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=me@ttaylorr.com \
    --cc=peff@peff.net \
    --cc=szeder.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).