git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "SZEDER Gábor" <szeder.dev@gmail.com>
To: Jeff King <peff@peff.net>
Cc: "Derrick Stolee" <stolee@gmail.com>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"Stefan Beller" <sbeller@google.com>, git <git@vger.kernel.org>,
	"Duy Nguyen" <pclouds@gmail.com>
Subject: Re: We should add a "git gc --auto" after "git clone" due to commit graph
Date: Tue, 9 Oct 2018 23:30:07 +0200	[thread overview]
Message-ID: <20181009213007.GB23446@szeder.dev> (raw)
In-Reply-To: <20181009030803.GA6250@sigill.intra.peff.net>

On Mon, Oct 08, 2018 at 11:08:03PM -0400, Jeff King wrote:
> I'd have done it as one fixed-size filter per commit. Then you should be
> able to hash the path keys once, and apply the result as a bitwise query
> to each individual commit (I'm assuming that it's constant-time to
> access the filter for each, as an index into an mmap'd array, with the
> offset coming from a commit-graph entry we'd be able to look up anyway).

I used one big Bloom filter for the whole history, because that was
the simplest way to get going, and because I was primarily interested
in the potential benefits instead of the cost of generating and
maintaining it.

Using a 8MB filter for git.git results in a false positive rate
between 0.21% - 0.53%.  Splitting that up for ~53k commits we get ~160
bytes for each.  On first sight that seems like rather small, but
running a bit statistics shows that 99% of our commits don't change
more than 10 files.

One advantage of the "one Bloom filter for each commit" is that if a
commit doesn't have a corresponding Bloom filter, then, well, we can't
query the non-existing filter.  OTOH, with one big Bloom filter we
have to be careful to only ever query it with commits whose changes
have already been added, otherwise we can get false negatives.

> I think it would also be easier to deal with maintenance, since each
> filter is independent (IIRC, you cannot delete from a bloom filter
> without re-adding all of the other keys).

Accumulating entries related to unreachable commits will eventually
increase the false positive rate, but otherwise it won't cause false
negatives, and won't increase the size of the Bloom filter or the time
necessary to query it.  So not deleting those entries right away is
not an issue, and I think it could be postponed until bigger gc runs.

[...]

> But there's also a related question: how do we match pathspec patterns?
> For a changed path like "foo/bar/baz", I imagine a bloom filter would
> mark all of "foo", "foo/bar", and "foo/bar/baz".

Indeed, that's what I did.

> But what about "*.c"? I
> don't think a bloom filter can answer that.

Surely not, but it could easily return "maybe", and thus simply fall
back to look at the diff.

However, I've looked through the output of

  grep '^git log[^|]*[\[*?]' ~/.bash_history

and haven't found a single case where I used Git's globbing.  When I
did use globbing, then I always used the shell's.  Yeah, just one data
point, and others surely use it differently, etc...  but I think we
should consider whether it's common enough to worry about and to
increase complexity because of it.

[...]

> So let's imagine we'd store such a cache external to the regular object
> data (i.e., as a commit-graph entry). The "log --raw" diff of linux.git
> has 1.7M entries. The paths should easily compress to a single 32-bit
> integer (e.g., as an index into a big path list). The oids are 20 bytes.
> Add a few bytes for modes. That's about 80MB. Big, but not impossibly
> so. Maybe pushing it for true gigantic repos, though.

In my experiments with the Linux repo a 256MB Bloom filter has ~0.3%
false positive rate, while a 128MB filter had 3-4%.  Even bigger,
though compared to the size of the checkout of the full kernel tree,
arguably not that much.

> Those numbers are ignoring merges, too. The meaning of "did this commit
> touch that path" is a lot trickier for a merge commit, and I think may
> depend on context. I'm not sure how even a bloom filter solution would
> handle that (I was assuming we'd mostly punt and let merges fall back to
> opening up the trees).

During revision walking rev_compare_tree() checks whether the given
paths changed between a commit and _one_ of its parents, and in case
of merge commits it's invoked multiple times with the same commit but
with different parent parameters.  By storing (changed-path,
parent-oid, commit-oid) tuples in the Bloom filter it can deal with
merges, too.


  parent reply	other threads:[~2018-10-09 21:30 UTC|newest]

Thread overview: 78+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-03 13:23 We should add a "git gc --auto" after "git clone" due to commit graph Ævar Arnfjörð Bjarmason
2018-10-03 13:36 ` SZEDER Gábor
2018-10-03 13:42   ` Derrick Stolee
2018-10-03 14:18     ` Ævar Arnfjörð Bjarmason
2018-10-03 14:01   ` Ævar Arnfjörð Bjarmason
2018-10-03 14:17     ` SZEDER Gábor
2018-10-03 14:22       ` Ævar Arnfjörð Bjarmason
2018-10-03 14:53         ` SZEDER Gábor
2018-10-03 15:19           ` Ævar Arnfjörð Bjarmason
2018-10-03 16:59             ` SZEDER Gábor
2018-10-05  6:09               ` Junio C Hamano
2018-10-10 22:07                 ` SZEDER Gábor
2018-10-10 23:01                   ` Ævar Arnfjörð Bjarmason
2018-10-03 19:08           ` Stefan Beller
2018-10-03 19:21             ` Jeff King
2018-10-03 20:35               ` Ævar Arnfjörð Bjarmason
2018-10-03 17:47         ` Stefan Beller
2018-10-03 18:47           ` Ævar Arnfjörð Bjarmason
2018-10-03 18:51             ` Jeff King
2018-10-03 18:59               ` Derrick Stolee
2018-10-03 19:18                 ` Jeff King
2018-10-08 16:41                   ` SZEDER Gábor
2018-10-08 16:57                     ` Derrick Stolee
2018-10-08 18:10                       ` SZEDER Gábor
2018-10-08 18:29                         ` Derrick Stolee
2018-10-09  3:08                           ` Jeff King
2018-10-09 13:48                             ` Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph) Derrick Stolee
2018-10-09 18:45                               ` Ævar Arnfjörð Bjarmason
2018-10-09 18:46                               ` Jeff King
2018-10-09 19:03                                 ` Derrick Stolee
2018-10-09 21:14                                   ` Jeff King
2018-10-09 23:12                                     ` Bloom Filters Jeff King
2018-10-09 23:13                                       ` [PoC -- do not apply 1/3] initial tree-bitmap proof of concept Jeff King
2018-10-09 23:14                                       ` [PoC -- do not apply 2/3] test-tree-bitmap: add "dump" mode Jeff King
2018-10-10  0:48                                         ` Junio C Hamano
2018-10-11  3:13                                           ` Jeff King
2018-10-09 23:14                                       ` [PoC -- do not apply 3/3] test-tree-bitmap: replace ewah with custom rle encoding Jeff King
2018-10-10  0:58                                         ` Junio C Hamano
2018-10-11  3:20                                           ` Jeff King
2018-10-11 12:33                                       ` Bloom Filters Derrick Stolee
2018-10-11 13:43                                         ` Jeff King
2018-10-09 21:30                             ` SZEDER Gábor [this message]
2018-10-09 19:34                       ` [PATCH 0/4] Bloom filter experiment SZEDER Gábor
2018-10-09 19:34                         ` [PATCH 1/4] Add a (very) barebones Bloom filter implementation SZEDER Gábor
2018-10-09 19:34                         ` [PATCH 2/4] commit-graph: write a Bloom filter containing changed paths for each commit SZEDER Gábor
2018-10-09 21:06                           ` Jeff King
2018-10-09 21:37                             ` SZEDER Gábor
2018-10-09 19:34                         ` [PATCH 3/4] revision.c: use the Bloom filter to speed up path-limited revision walks SZEDER Gábor
2018-10-09 19:34                         ` [PATCH 4/4] revision.c: add GIT_TRACE_BLOOM_FILTER for a bit of statistics SZEDER Gábor
2018-10-09 19:47                         ` [PATCH 0/4] Bloom filter experiment Derrick Stolee
2018-10-11  1:21                         ` [PATCH 0/2] Per-commit filter proof of concept Jonathan Tan
2018-10-11  1:21                           ` [PATCH 1/2] One filter per commit Jonathan Tan
2018-10-11 12:49                             ` Derrick Stolee
2018-10-11 19:11                               ` [PATCH] Per-commit and per-parent filters for 2 parents Jonathan Tan
2018-10-11  1:21                           ` [PATCH 2/2] Only make bloom filter for first parent Jonathan Tan
2018-10-11  7:37                           ` [PATCH 0/2] Per-commit filter proof of concept Ævar Arnfjörð Bjarmason
2018-10-15 14:39                         ` [PATCH 0/4] Bloom filter experiment Derrick Stolee
2018-10-16  4:45                           ` Junio C Hamano
2018-10-16 11:13                             ` Derrick Stolee
2018-10-16 12:57                               ` Ævar Arnfjörð Bjarmason
2018-10-16 13:03                                 ` Derrick Stolee
2018-10-18  2:00                                 ` Junio C Hamano
2018-10-16 23:41                           ` Jonathan Tan
2018-10-08 23:02                     ` We should add a "git gc --auto" after "git clone" due to commit graph Junio C Hamano
2018-10-03 14:32     ` Duy Nguyen
2018-10-03 16:45 ` Duy Nguyen
2018-10-04 21:42 ` [RFC PATCH] " Ævar Arnfjörð Bjarmason
2018-10-05 12:05   ` Derrick Stolee
2018-10-05 13:05     ` Ævar Arnfjörð Bjarmason
2018-10-05 13:45       ` Derrick Stolee
2018-10-05 14:04         ` Ævar Arnfjörð Bjarmason
2018-10-05 19:21         ` Jeff King
2018-10-05 19:41           ` Derrick Stolee
2018-10-05 19:47             ` Jeff King
2018-10-05 20:00               ` Derrick Stolee
2018-10-05 20:02                 ` Jeff King
2018-10-05 20:01               ` Ævar Arnfjörð Bjarmason
2018-10-05 20:09                 ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181009213007.GB23446@szeder.dev \
    --to=szeder.dev@gmail.com \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=pclouds@gmail.com \
    --cc=peff@peff.net \
    --cc=sbeller@google.com \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).