git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
From: Taylor Blau <me@ttaylorr.com>
To: "SZEDER Gábor" <szeder.dev@gmail.com>
Cc: Taylor Blau <me@ttaylorr.com>,
	git@vger.kernel.org, dstolee@microsoft.com, gitster@pobox.com,
	peff@peff.net
Subject: Re: [PATCH v2 10/13] bloom: encode out-of-bounds filters as non-empty
Date: Thu, 17 Sep 2020 19:13:44 -0400	[thread overview]
Message-ID: <20200917231344.GA1591704@nand.local> (raw)
In-Reply-To: <20200917221302.GC23146@szeder.dev>

On Fri, Sep 18, 2020 at 12:13:02AM +0200, SZEDER Gábor wrote:
> > Clients will invoke the Bloom machinery in more cases than before, but
> > this can be addressed by returning a NULL filter when all bits are set
> > high. This can be addressed in a future patch.
>
> OTOH, clients will invoke the tree-diff machinery in fewer cases than
> before, because querying the Bloom filter of commits not modifying any
> files will now return "definitely not".

Absolutely right.

> > Finally, note that this does increase the size of on-disk commit-graphs,
> > but far less than other proposals. In particular, this is generally more
> > efficient than storing a bitmap for which commits haven't computed their
> > Bloom filters. Storing a bitmap incurs a penalty of one bit per commit,
> > whereas storing explicit filters as above incurs a penalty of one byte
> > per too-large or too-small commit.
>
> s/too-small/empty/

Fair enough, although I'm not planning to alter this or any other patch
now that it's picked up unless there's a real show-stopper (this doesn't
seem like one).

> > In practice, these boundary commits likely occupy a small proportion of
> > the overall number of commits, and so the size penalty is likely smaller
> > than storing a bitmap for all commits.
>
>                  |      Percentage of
>                  |    commits modifying
>                  |   0 path   |  >= 512 paths
>   ---------------+------------+----------------
>   android-base   |   13.20%   |   0.13%
>   cmssw          |    0.15%   |   0.23%
>   cpython        |    3.07%   |   0.01%
>   elasticsearch  |    0.70%   |   1.00%
>   gcc            |    0.00%   |   0.08%
>   gecko-dev      |    0.14%   |   0.64%
>   git            |    0.11%   |   0.02%
>   glibc          |    0.02%   |   0.10%
>   go             |    0.00%   |   0.07%
>   homebrew-cask  |    0.40%   |   0.02%
>   homebrew-core  |    0.01%   |   0.01%
>   jdk            |    0.26%   |   5.64%
>   linux          |    0.01%   |   0.51%
>   llvm-project   |    0.12%   |   0.03%
>   rails          |    0.10%   |   0.10%
>   rust           |    0.07%   |   0.17%
>   tensorflow     |    0.09%   |   1.02%
>   webkit         |    0.05%   |   0.31%

This is very useful information to have! Without the total number of
commits, it's impossible to know whether or not this is a win over the
BFXL chunk. But, since the number of commits is probably "large" versus
the percentage of boundary commits which is "small", it's almost
certainly an advantage.

> > A test to exercise filters which contain too many changed path entries
> > will be introduced in a subsequent patch.
>
>
> > diff --git a/bloom.h b/bloom.h
> > index c6d77e8393..70a8840896 100644
> > --- a/bloom.h
> > +++ b/bloom.h
> > @@ -93,6 +93,7 @@ enum bloom_filter_computed {
> >  	BLOOM_NOT_COMPUTED = (1 << 0),
> >  	BLOOM_COMPUTED     = (1 << 1),
> >  	BLOOM_TRUNC_LARGE  = (1 << 2),
> > +	BLOOM_TRUNC_SMALL  = (1 << 3),
>
> s/SMALL/EMPTY/
>
> This "small" suffix in the constant, variable, and trace2 key names is
> misleading, because we only mean empty commits.

I could buy that it might be misleading; I only picked this since it was
the opposite of "large". You could imagine that BLOOM_TRUNC_X means
"truncated in the direction of 'x'", but to be honest I don't think that
this matters.

I understand the churn of coming back to this after the topic has
already been merged creates more hassle, but frankly this series has
already gone on for quite a while, and it has been holding up important
bug fixes that are unrelated to the main feature.

So, I think that if it's truly misleading, we could revisit this after
the topic is merged. But, I'm not planning on changing anything at this
point.

Thanks,
Taylor

  reply	other threads:[~2020-09-17 23:13 UTC|newest]

Thread overview: 75+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-09 15:22 [PATCH 00/12] more miscellaneous Bloom filter improvements, redux Taylor Blau
2020-09-09 15:22 ` [PATCH 01/12] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
2020-09-09 15:22 ` [PATCH 02/12] t4216: use an '&&'-chain Taylor Blau
2020-09-09 15:22 ` [PATCH 03/12] commit-graph: pass a 'struct repository *' in more places Taylor Blau
2020-09-09 15:23 ` [PATCH 04/12] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
2020-09-09 15:23 ` [PATCH 05/12] commit-graph: respect 'commitGraph.readChangedPaths' Taylor Blau
2020-09-09 15:23 ` [PATCH 06/12] commit-graph.c: store maximum changed paths Taylor Blau
2020-09-09 15:23 ` [PATCH 07/12] bloom: split 'get_bloom_filter()' in two Taylor Blau
2020-09-09 15:23 ` [PATCH 08/12] bloom: use provided 'struct bloom_filter_settings' Taylor Blau
2020-09-09 15:23 ` [PATCH 09/12] bloom/diff: properly short-circuit on max_changes Taylor Blau
2020-09-09 15:23 ` [PATCH 10/12] bloom: encode out-of-bounds filters as non-empty Taylor Blau
2020-09-10  3:35   ` Taylor Blau
2020-09-10 15:45     ` Taylor Blau
2020-09-11 18:15       ` Derrick Stolee
2020-09-09 15:23 ` [PATCH 11/12] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
2020-09-09 15:24 ` [PATCH 12/12] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
2020-09-11 17:52   ` Jeff King
2020-09-11 18:59     ` Taylor Blau
2020-09-11 19:25       ` Taylor Blau
2020-09-14 20:12         ` Taylor Blau
2020-09-14 20:31           ` Derrick Stolee
2020-09-14 20:36             ` Taylor Blau
2020-09-15  0:59               ` Derrick Stolee
2020-09-15  4:31                 ` Taylor Blau
2020-09-15 21:49               ` Junio C Hamano
2020-09-15 21:53                 ` Taylor Blau
2020-09-11 19:47       ` Jeff King
2020-09-11 19:31     ` Junio C Hamano
2020-09-16 18:06 ` [PATCH v2 00/13] more miscellaneous Bloom filter improvements, redux Taylor Blau
2020-09-16 18:06   ` [PATCH v2 01/13] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
2020-09-16 18:07   ` [PATCH v2 02/13] t4216: use an '&&'-chain Taylor Blau
2020-09-16 18:07   ` [PATCH v2 03/13] commit-graph: pass a 'struct repository *' in more places Taylor Blau
2020-09-16 18:07   ` [PATCH v2 04/13] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
2020-09-16 18:07   ` [PATCH v2 05/13] commit-graph: respect 'commitGraph.readChangedPaths' Taylor Blau
2020-09-16 18:07   ` [PATCH v2 06/13] commit-graph.c: store maximum changed paths Taylor Blau
2020-09-16 18:07   ` [PATCH v2 07/13] bloom: split 'get_bloom_filter()' in two Taylor Blau
2020-09-16 18:07   ` [PATCH v2 08/13] bloom: use provided 'struct bloom_filter_settings' Taylor Blau
2020-09-16 18:07   ` [PATCH v2 09/13] bloom/diff: properly short-circuit on max_changes Taylor Blau
2020-09-16 18:07   ` [PATCH v2 10/13] bloom: encode out-of-bounds filters as non-empty Taylor Blau
2020-09-17 22:13     ` SZEDER Gábor
2020-09-17 23:13       ` Taylor Blau [this message]
2020-09-18  0:52         ` Junio C Hamano
2020-09-18  1:15           ` Taylor Blau
2020-09-16 18:08   ` [PATCH v2 11/13] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
2020-09-16 18:08   ` [PATCH v2 12/13] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
2020-09-18  9:23     ` SZEDER Gábor
2020-09-18 13:27       ` Taylor Blau
2020-09-16 18:08   ` [PATCH v2 13/13] commit-graph: introduce 'commitGraph.maxNewFilters' Taylor Blau
2020-09-16 22:51   ` [PATCH v2 00/13] more miscellaneous Bloom filter improvements, redux Derrick Stolee
2020-09-16 23:07     ` Junio C Hamano
2020-09-17  0:45       ` Taylor Blau
2020-09-17  0:59         ` Junio C Hamano
2020-09-17  1:10           ` Taylor Blau
2020-09-17 13:34             ` Taylor Blau
2020-09-17 13:38               ` Derrick Stolee
2020-09-18  2:58 ` [PATCH v3 " Taylor Blau
2020-09-18  2:58   ` [PATCH v3 01/13] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
2020-09-18  2:58   ` [PATCH v3 02/13] t4216: use an '&&'-chain Taylor Blau
2020-09-18  2:59   ` [PATCH v3 03/13] commit-graph: pass a 'struct repository *' in more places Taylor Blau
2020-09-18  2:59   ` [PATCH v3 04/13] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
2020-09-18  2:59   ` [PATCH v3 05/13] commit-graph: respect 'commitGraph.readChangedPaths' Taylor Blau
2020-09-18  2:59   ` [PATCH v3 06/13] commit-graph.c: store maximum changed paths Taylor Blau
2020-09-18  2:59   ` [PATCH v3 07/13] bloom: split 'get_bloom_filter()' in two Taylor Blau
2020-09-18  2:59   ` [PATCH v3 08/13] bloom: use provided 'struct bloom_filter_settings' Taylor Blau
2020-09-18 16:27     ` SZEDER Gábor
2020-09-18 16:32       ` Taylor Blau
2020-09-18  2:59   ` [PATCH v3 09/13] bloom/diff: properly short-circuit on max_changes Taylor Blau
2020-09-18  2:59   ` [PATCH v3 10/13] bloom: encode out-of-bounds filters as non-empty Taylor Blau
2020-09-18  2:59   ` [PATCH v3 11/13] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
2020-09-18  2:59   ` [PATCH v3 12/13] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
2020-09-18  2:59   ` [PATCH v3 13/13] commit-graph: introduce 'commitGraph.maxNewFilters' Taylor Blau
2020-09-18 13:29     ` Taylor Blau
2020-09-18 17:43       ` Junio C Hamano
2020-09-18 13:31   ` [PATCH v3 00/13] more miscellaneous Bloom filter improvements, redux Taylor Blau
2020-09-18 13:34     ` Taylor Blau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200917231344.GA1591704@nand.local \
    --to=me@ttaylorr.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=peff@peff.net \
    --cc=szeder.dev@gmail.com \
    --subject='Re: [PATCH v2 10/13] bloom: encode out-of-bounds filters as non-empty' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Code repositories for project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).