git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Derrick Stolee <stolee@gmail.com>
To: "Taylor Blau" <me@ttaylorr.com>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Cc: git@vger.kernel.org, dstolee@microsoft.com, gitster@pobox.com,
	peff@peff.net, szeder.dev@gmail.com
Subject: Re: Making split commit graphs pick up new options (namely --changed-paths)
Date: Thu, 10 Jun 2021 14:21:16 -0400	[thread overview]
Message-ID: <39a675dc-d6fd-c81c-3d73-c1b1a163f10d@gmail.com> (raw)
In-Reply-To: <YMJKcHpN/gL5U6KK@nand.local>

On 6/10/2021 1:22 PM, Taylor Blau wrote:
> On Thu, Jun 10, 2021 at 12:40:33PM +0200, Ævar Arnfjörð Bjarmason wrote:
...
>> Reading the code there seems to be no way to do that, and we have the
>> "chunk_bloom_data" in the graph, as well as "bloom_filter_settings".
>>
>> I'd expect some way to combine the "max_new_filters" and --split with
>> some eventual-consistency logic so that graphs not matching our current
>> settings are replaced, or replaced some <limit> at a time.
> 
> This is asking about something slightly different, Bloom filter
> settings rather than the existence of chagned-path Bloom filters
> themselves. The Bloom settings aren't written to the commit-graph
> although there has been some discussion about doing this in the past.

Some of the settings are included, but not the "maximum size" of a
filter. Thus, if that maximum size changes we do not have a way to
determine if a missing filter is larger than that limit or not.
Further, the existing filters might be larger than the new maximum
which means we would need to check if some of those filters should
be dropped.

Here is the spec of the BDAT chunk:

Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
    * It starts with header consisting of three unsigned 32-bit integers:
      - Version of the hash algorithm being used. We currently only support
	value 1 which corresponds to the 32-bit version of the murmur3 hash
	implemented exactly as described in
	https://en.wikipedia.org/wiki/MurmurHash#Algorithm and the double
	hashing technique using seed values 0x293ae76f and 0x7e646e2 as
	described in https://doi.org/10.1007/978-3-540-30494-4_26 "Bloom Filters
	in Probabilistic Verification"
      - The number of times a path is hashed and hence the number of bit positions
	      that cumulatively determine whether a file is present in the commit.
      - The minimum number of bits 'b' per entry in the Bloom filter. If the filter
	      contains 'n' entries, then the filter size is the minimum number of 64-bit
	      words that contain n*b bits.
    * The rest of the chunk is the concatenation of all the computed Bloom
      filters for the commits in lexicographic order.

> If we ever did encode the Bloom settings, I imagine that accomplishing a
> sort of "eventually replace all changed-path Bloom filters with these
> new settings" would be as simple as considering all filters computed
> under different settings to be "uncomputed".
> 
>> Also, am I reading the expire_commit_graphs() logic correctly that we
>> first write the split graph, and then unlink() things that are too old?
>> I.e. if you rely on the commit-graph to optimize things this will make
>> things slower until the next run of writing the graph?
> 
> Before expire_commit_graphs() is called, we call mark_commit_graphs()
> which freshens the mtimes of all surviving commit-graph layers, and then
> expire_commit_graphs() removes the stale layers. I'm not sure what
> things getting slower is referring to since the resulting commit-graph
> has at least as many commits as the commit-graph that existed prior to
> the write.
> 
>> I expected to find something more gentle there [...]
> 
> FWIW, I also find this "expire based on mtimes" thing a little odd for
> writing split commit-graphs because we know exactly which layers we want
> to get rid of. I suspect that the reuse comes from wanting to unify the
> logic for handling '--expire-time' with the expiration that happens
> after writing a split commit-graph that merged two or more previous
> layers.
> 
> I would probably change mark_commit_graphs() to remove those merged
> layers explicitly (but still run expire_commit_graphs() to handle
> --expire-time). But, come to think of it... if merging >2 layers already
> causes the merged layers to be removed, then why would you ever set an
> --expire-time yourself?

The --expire-time is intended to leave the old layers around a while
so concurrent processes who already parsed the commit-graph-chain file
can discover the layers it referenced. It's not a perfect mechanism, so
there is room for improvement here.

Thanks,
-Stolee

  reply	other threads:[~2021-06-10 18:22 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-10 10:40 Making split commit graphs pick up new options (namely --changed-paths) Ævar Arnfjörð Bjarmason
2021-06-10 17:22 ` Taylor Blau
2021-06-10 18:21   ` Derrick Stolee [this message]
2021-06-10 23:56   ` Ævar Arnfjörð Bjarmason
2021-06-11  0:50     ` Taylor Blau
2021-06-11 17:47       ` Derrick Stolee
2021-06-11 19:01         ` Taylor Blau
2021-06-15 14:21           ` Derrick Stolee
2021-06-15 14:35             ` Ævar Arnfjörð Bjarmason
2021-06-16  1:45               ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=39a675dc-d6fd-c81c-3d73-c1b1a163f10d@gmail.com \
    --to=stolee@gmail.com \
    --cc=avarab@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=me@ttaylorr.com \
    --cc=peff@peff.net \
    --cc=szeder.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).