We should add a "git gc --auto" after "git clone" due to commit graph

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* We should add a "git gc --auto" after "git clone" due to commit graph
@ 2018-10-03 13:23 Ævar Arnfjörð Bjarmason
  2018-10-03 13:36 ` SZEDER Gábor
                   ` (2 more replies)
  0 siblings, 3 replies; 78+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-10-03 13:23 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Git List, Nguyễn Thái Ngọc Duy

Don't have time to patch this now, but thought I'd send a note / RFC
about this.

Now that we have the commit graph it's nice to be able to set
e.g. core.commitGraph=true & gc.writeCommitGraph=true in ~/.gitconfig or
/etc/gitconfig to apply them to all repos.

But when I clone e.g. linux.git stuff like 'tag --contains' will be slow
until whenever my first "gc" kicks in, which may be quite some time if
I'm just using it passively.

So we should make "git gc --auto" be run on clone, and change the
need_to_gc() / cmd_gc() behavior so that we detect that the
gc.writeCommitGraph=true setting is on, but we have no commit graph, and
then just generate that without doing a full repack.

As an aside such more granular "gc" would be nice for e.g. pack-refs
too. It's possible for us to just have one pack, but to have 100k loose
refs.

It might also be good to have some gc.autoDetachOnClone option and have
it false by default, so we don't have a race condition where "clone
linux && git -C linux tag --contains" is slow because the graph hasn't
been generated yet, and generating the graph initially doesn't take that
long compared to the time to clone a large repo (and on a small one it
won't matter either way).

I was going to say "also for midx", but of course after clone we have
just one pack, so I can't imagine us needing this. But I can see us
having other such optional side-indexes in the future generated by gc,
and they'd also benefit from this.

#leftoverbits

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-03 13:23 We should add a "git gc --auto" after "git clone" due to commit graph Ævar Arnfjörð Bjarmason
@ 2018-10-03 13:36 ` SZEDER Gábor
  2018-10-03 13:42   ` Derrick Stolee
  2018-10-03 14:01   ` Ævar Arnfjörð Bjarmason
  2018-10-03 16:45 ` Duy Nguyen
  2018-10-04 21:42 ` [RFC PATCH] " Ævar Arnfjörð Bjarmason
  2 siblings, 2 replies; 78+ messages in thread
From: SZEDER Gábor @ 2018-10-03 13:36 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Derrick Stolee, Git List, Nguyễn Thái Ngọc Duy

On Wed, Oct 03, 2018 at 03:23:57PM +0200, Ævar Arnfjörð Bjarmason wrote:
> Don't have time to patch this now, but thought I'd send a note / RFC
> about this.
> 
> Now that we have the commit graph it's nice to be able to set
> e.g. core.commitGraph=true & gc.writeCommitGraph=true in ~/.gitconfig or
> /etc/gitconfig to apply them to all repos.
> 
> But when I clone e.g. linux.git stuff like 'tag --contains' will be slow
> until whenever my first "gc" kicks in, which may be quite some time if
> I'm just using it passively.
> 
> So we should make "git gc --auto" be run on clone,

There is no garbage after 'git clone'...

> and change the
> need_to_gc() / cmd_gc() behavior so that we detect that the
> gc.writeCommitGraph=true setting is on, but we have no commit graph, and
> then just generate that without doing a full repack.

Or just teach 'git clone' to run 'git commit-graph write ...'

> As an aside such more granular "gc" would be nice for e.g. pack-refs
> too. It's possible for us to just have one pack, but to have 100k loose
> refs.
> 
> It might also be good to have some gc.autoDetachOnClone option and have
> it false by default, so we don't have a race condition where "clone
> linux && git -C linux tag --contains" is slow because the graph hasn't
> been generated yet, and generating the graph initially doesn't take that
> long compared to the time to clone a large repo (and on a small one it
> won't matter either way).
> 
> I was going to say "also for midx", but of course after clone we have
> just one pack, so I can't imagine us needing this. But I can see us
> having other such optional side-indexes in the future generated by gc,
> and they'd also benefit from this.
> 
> #leftoverbits

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-03 13:36 ` SZEDER Gábor
@ 2018-10-03 13:42   ` Derrick Stolee
  2018-10-03 14:18     ` Ævar Arnfjörð Bjarmason
  2018-10-03 14:01   ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 78+ messages in thread
From: Derrick Stolee @ 2018-10-03 13:42 UTC (permalink / raw)
  To: SZEDER Gábor, Ævar Arnfjörð Bjarmason
  Cc: Git List, Nguyễn Thái Ngọc Duy

On 10/3/2018 9:36 AM, SZEDER Gábor wrote:
> On Wed, Oct 03, 2018 at 03:23:57PM +0200, Ævar Arnfjörð Bjarmason wrote:
>> Don't have time to patch this now, but thought I'd send a note / RFC
>> about this.
>>
>> Now that we have the commit graph it's nice to be able to set
>> e.g. core.commitGraph=true & gc.writeCommitGraph=true in ~/.gitconfig or
>> /etc/gitconfig to apply them to all repos.
>>
>> But when I clone e.g. linux.git stuff like 'tag --contains' will be slow
>> until whenever my first "gc" kicks in, which may be quite some time if
>> I'm just using it passively.
>>
>> So we should make "git gc --auto" be run on clone,
> There is no garbage after 'git clone'...

And since there is no garbage, the gc will not write the commit-graph.

>
>> and change the
>> need_to_gc() / cmd_gc() behavior so that we detect that the
>> gc.writeCommitGraph=true setting is on, but we have no commit graph, and
>> then just generate that without doing a full repack.
> Or just teach 'git clone' to run 'git commit-graph write ...'

I plan to add a 'fetch.writeCommitGraph' config setting. I was waiting 
until the file is incremental (on my to-do list soon), so the write is 
fast when only adding a few commits at a time. This would cover the 
clone case, too.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-03 13:36 ` SZEDER Gábor
  2018-10-03 13:42   ` Derrick Stolee
@ 2018-10-03 14:01   ` Ævar Arnfjörð Bjarmason
  2018-10-03 14:17     ` SZEDER Gábor
  2018-10-03 14:32     ` Duy Nguyen
  1 sibling, 2 replies; 78+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-10-03 14:01 UTC (permalink / raw)
  To: SZEDER Gábor
  Cc: Derrick Stolee, Git List, Nguyễn Thái Ngọc Duy


On Wed, Oct 03 2018, SZEDER Gábor wrote:

> On Wed, Oct 03, 2018 at 03:23:57PM +0200, Ævar Arnfjörð Bjarmason wrote:
>> Don't have time to patch this now, but thought I'd send a note / RFC
>> about this.
>>
>> Now that we have the commit graph it's nice to be able to set
>> e.g. core.commitGraph=true & gc.writeCommitGraph=true in ~/.gitconfig or
>> /etc/gitconfig to apply them to all repos.
>>
>> But when I clone e.g. linux.git stuff like 'tag --contains' will be slow
>> until whenever my first "gc" kicks in, which may be quite some time if
>> I'm just using it passively.
>>
>> So we should make "git gc --auto" be run on clone,
>
> There is no garbage after 'git clone'...

"git gc" is really "git gc-or-create-indexes" these days.

>> and change the
>> need_to_gc() / cmd_gc() behavior so that we detect that the
>> gc.writeCommitGraph=true setting is on, but we have no commit graph, and
>> then just generate that without doing a full repack.
>
> Or just teach 'git clone' to run 'git commit-graph write ...'

Then when adding something like the commit graph we'd need to patch both
git-clone and git-gc, it's much more straightforward to make
need_to_gc() more granular.

>> As an aside such more granular "gc" would be nice for e.g. pack-refs
>> too. It's possible for us to just have one pack, but to have 100k loose
>> refs.
>>
>> It might also be good to have some gc.autoDetachOnClone option and have
>> it false by default, so we don't have a race condition where "clone
>> linux && git -C linux tag --contains" is slow because the graph hasn't
>> been generated yet, and generating the graph initially doesn't take that
>> long compared to the time to clone a large repo (and on a small one it
>> won't matter either way).
>>
>> I was going to say "also for midx", but of course after clone we have
>> just one pack, so I can't imagine us needing this. But I can see us
>> having other such optional side-indexes in the future generated by gc,
>> and they'd also benefit from this.
>>
>> #leftoverbits

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-03 14:01   ` Ævar Arnfjörð Bjarmason
@ 2018-10-03 14:17     ` SZEDER Gábor
  2018-10-03 14:22       ` Ævar Arnfjörð Bjarmason
  2018-10-03 14:32     ` Duy Nguyen
  1 sibling, 1 reply; 78+ messages in thread
From: SZEDER Gábor @ 2018-10-03 14:17 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Derrick Stolee, Git List, Nguyễn Thái Ngọc Duy

On Wed, Oct 03, 2018 at 04:01:40PM +0200, Ævar Arnfjörð Bjarmason wrote:
> 
> On Wed, Oct 03 2018, SZEDER Gábor wrote:
> 
> > On Wed, Oct 03, 2018 at 03:23:57PM +0200, Ævar Arnfjörð Bjarmason wrote:
> >> Don't have time to patch this now, but thought I'd send a note / RFC
> >> about this.
> >>
> >> Now that we have the commit graph it's nice to be able to set
> >> e.g. core.commitGraph=true & gc.writeCommitGraph=true in ~/.gitconfig or
> >> /etc/gitconfig to apply them to all repos.
> >>
> >> But when I clone e.g. linux.git stuff like 'tag --contains' will be slow
> >> until whenever my first "gc" kicks in, which may be quite some time if
> >> I'm just using it passively.
> >>
> >> So we should make "git gc --auto" be run on clone,
> >
> > There is no garbage after 'git clone'...
> 
> "git gc" is really "git gc-or-create-indexes" these days.

Because it happens to be convenient to create those indexes at
gc-time.  But that should not be an excuse to run gc when by
definition no gc is needed.

> >> and change the
> >> need_to_gc() / cmd_gc() behavior so that we detect that the
> >> gc.writeCommitGraph=true setting is on, but we have no commit graph, and
> >> then just generate that without doing a full repack.
> >
> > Or just teach 'git clone' to run 'git commit-graph write ...'
> 
> Then when adding something like the commit graph we'd need to patch both
> git-clone and git-gc, it's much more straightforward to make
> need_to_gc() more granular.
> 
> >> As an aside such more granular "gc" would be nice for e.g. pack-refs
> >> too. It's possible for us to just have one pack, but to have 100k loose
> >> refs.
> >>
> >> It might also be good to have some gc.autoDetachOnClone option and have
> >> it false by default, so we don't have a race condition where "clone
> >> linux && git -C linux tag --contains" is slow because the graph hasn't
> >> been generated yet, and generating the graph initially doesn't take that
> >> long compared to the time to clone a large repo (and on a small one it
> >> won't matter either way).
> >>
> >> I was going to say "also for midx", but of course after clone we have
> >> just one pack, so I can't imagine us needing this. But I can see us
> >> having other such optional side-indexes in the future generated by gc,
> >> and they'd also benefit from this.
> >>
> >> #leftoverbits

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-03 13:42   ` Derrick Stolee
@ 2018-10-03 14:18     ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 78+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-10-03 14:18 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: SZEDER Gábor, Git List,
	Nguyễn Thái Ngọc Duy

On Wed, Oct 03 2018, Derrick Stolee wrote:

> On 10/3/2018 9:36 AM, SZEDER Gábor wrote:
>> On Wed, Oct 03, 2018 at 03:23:57PM +0200, Ævar Arnfjörð Bjarmason wrote:
>>> Don't have time to patch this now, but thought I'd send a note / RFC
>>> about this.
>>>
>>> Now that we have the commit graph it's nice to be able to set
>>> e.g. core.commitGraph=true & gc.writeCommitGraph=true in ~/.gitconfig or
>>> /etc/gitconfig to apply them to all repos.
>>>
>>> But when I clone e.g. linux.git stuff like 'tag --contains' will be slow
>>> until whenever my first "gc" kicks in, which may be quite some time if
>>> I'm just using it passively.
>>>
>>> So we should make "git gc --auto" be run on clone,
>> There is no garbage after 'git clone'...
>
> And since there is no garbage, the gc will not write the commit-graph.

I should probably have replied to this instead of SZEDER's in
https://public-inbox.org/git/87r2h7gmd7.fsf@evledraar.gmail.com/ anyway
my 0.02 on that there.

>>
>>> and change the
>>> need_to_gc() / cmd_gc() behavior so that we detect that the
>>> gc.writeCommitGraph=true setting is on, but we have no commit graph, and
>>> then just generate that without doing a full repack.
>> Or just teach 'git clone' to run 'git commit-graph write ...'
>
> I plan to add a 'fetch.writeCommitGraph' config setting. I was waiting
> until the file is incremental (on my to-do list soon), so the write is
> fast when only adding a few commits at a time. This would cover the
> clone case, too.

It's re-arranging deck chairs on the Titanic at this point, but this
approach seems like the wrong way to go in this whole "do we have crap
to do?" git-gc state-machine.

In my mind we should have only one entry point into that, and there
shouldn't be magic like "here's the gc-ish stuff we do on
fetch". Because if we care about a bunch of new commits being added on
"fetch", that can also happen on "commit", "am", "merge", all of which
run "gc --auto" now.

Which is why I'm suggesting that we could add a sub-mode in need_to_gc()
that detects if a file we want to generate is entirely missing, which is
extendable to future formats, and the only caveat at that point is if
we'd like that subset of "gc" to block and run in the foreground in the
"clone" (or "fetch", ...) case.

And then if we have a desire to incrementally add recently added commits
to such formats, "gc --auto" could learn to consume reflogs or some
other general inventory of "stuff added since last gc", and then we
wouldn't have to instrument "fetch" specifically, the same would work
for "commit", "am", "merge" etc.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-03 14:17     ` SZEDER Gábor
@ 2018-10-03 14:22       ` Ævar Arnfjörð Bjarmason
  2018-10-03 14:53         ` SZEDER Gábor
  2018-10-03 17:47         ` Stefan Beller
  0 siblings, 2 replies; 78+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-10-03 14:22 UTC (permalink / raw)
  To: SZEDER Gábor
  Cc: Derrick Stolee, Git List, Nguyễn Thái Ngọc Duy


On Wed, Oct 03 2018, SZEDER Gábor wrote:

> On Wed, Oct 03, 2018 at 04:01:40PM +0200, Ævar Arnfjörð Bjarmason wrote:
>>
>> On Wed, Oct 03 2018, SZEDER Gábor wrote:
>>
>> > On Wed, Oct 03, 2018 at 03:23:57PM +0200, Ævar Arnfjörð Bjarmason wrote:
>> >> Don't have time to patch this now, but thought I'd send a note / RFC
>> >> about this.
>> >>
>> >> Now that we have the commit graph it's nice to be able to set
>> >> e.g. core.commitGraph=true & gc.writeCommitGraph=true in ~/.gitconfig or
>> >> /etc/gitconfig to apply them to all repos.
>> >>
>> >> But when I clone e.g. linux.git stuff like 'tag --contains' will be slow
>> >> until whenever my first "gc" kicks in, which may be quite some time if
>> >> I'm just using it passively.
>> >>
>> >> So we should make "git gc --auto" be run on clone,
>> >
>> > There is no garbage after 'git clone'...
>>
>> "git gc" is really "git gc-or-create-indexes" these days.
>
> Because it happens to be convenient to create those indexes at
> gc-time.  But that should not be an excuse to run gc when by
> definition no gc is needed.

Ah, I thought you just had an objection to the "gc" name being used for
non-gc stuff, but if you mean we shouldn't do a giant repack right after
clone I agree. I meant that "gc --auto" would learn to do a subset of
its work, instead of the current "I have work to do, let's do all of
pack-refs/repack/commit-graph etc.".

So we wouldn't be spending 5 minutes repacking linux.git right after
cloning it, just ~10s generating the commit graph, and the same would
happen if you rm'd .git/objects/info/commit-graph and ran "git commit",
which would kick of "gc --auto" in the background and do the same thing.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-03 14:01   ` Ævar Arnfjörð Bjarmason
  2018-10-03 14:17     ` SZEDER Gábor
@ 2018-10-03 14:32     ` Duy Nguyen
  1 sibling, 0 replies; 78+ messages in thread
From: Duy Nguyen @ 2018-10-03 14:32 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: SZEDER Gábor, Derrick Stolee, Git Mailing List

On Wed, Oct 3, 2018 at 4:01 PM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> >> and change the
> >> need_to_gc() / cmd_gc() behavior so that we detect that the
> >> gc.writeCommitGraph=true setting is on, but we have no commit graph, and
> >> then just generate that without doing a full repack.
> >
> > Or just teach 'git clone' to run 'git commit-graph write ...'
>
> Then when adding something like the commit graph we'd need to patch both
> git-clone and git-gc, it's much more straightforward to make
> need_to_gc() more granular.

It is straightforward and misleading. If we organize the code well,
patching both would not take much more effort and it reduces wtf?
moments reading the code.
-- 
Duy

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-03 14:22       ` Ævar Arnfjörð Bjarmason
@ 2018-10-03 14:53         ` SZEDER Gábor
  2018-10-03 15:19           ` Ævar Arnfjörð Bjarmason
  2018-10-03 19:08           ` Stefan Beller
  2018-10-03 17:47         ` Stefan Beller
  1 sibling, 2 replies; 78+ messages in thread
From: SZEDER Gábor @ 2018-10-03 14:53 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Derrick Stolee, Git List, Nguyễn Thái Ngọc Duy

On Wed, Oct 03, 2018 at 04:22:12PM +0200, Ævar Arnfjörð Bjarmason wrote:
> 
> On Wed, Oct 03 2018, SZEDER Gábor wrote:
> 
> > On Wed, Oct 03, 2018 at 04:01:40PM +0200, Ævar Arnfjörð Bjarmason wrote:
> >>
> >> On Wed, Oct 03 2018, SZEDER Gábor wrote:
> >>
> >> > On Wed, Oct 03, 2018 at 03:23:57PM +0200, Ævar Arnfjörð Bjarmason wrote:
> >> >> Don't have time to patch this now, but thought I'd send a note / RFC
> >> >> about this.
> >> >>
> >> >> Now that we have the commit graph it's nice to be able to set
> >> >> e.g. core.commitGraph=true & gc.writeCommitGraph=true in ~/.gitconfig or
> >> >> /etc/gitconfig to apply them to all repos.
> >> >>
> >> >> But when I clone e.g. linux.git stuff like 'tag --contains' will be slow
> >> >> until whenever my first "gc" kicks in, which may be quite some time if
> >> >> I'm just using it passively.
> >> >>
> >> >> So we should make "git gc --auto" be run on clone,
> >> >
> >> > There is no garbage after 'git clone'...
> >>
> >> "git gc" is really "git gc-or-create-indexes" these days.
> >
> > Because it happens to be convenient to create those indexes at
> > gc-time.  But that should not be an excuse to run gc when by
> > definition no gc is needed.
> 
> Ah, I thought you just had an objection to the "gc" name being used for
> non-gc stuff,

But you thought right, I do have an objection against that.  'git gc'
should, well, collect garbage.  Any non-gc stuff is already violating
separation of concerns.

>  but if you mean we shouldn't do a giant repack right after
> clone I agree.

And, I also mean that since 'git clone' knows that there can't
possibly be any garbage in the first place, then it shouldn't call 'gc
--auto' at all.  However, since it also knows that there is a lot of
new stuff, then it should create a commit-graph if enabled.

> I meant that "gc --auto" would learn to do a subset of
> its work, instead of the current "I have work to do, let's do all of
> pack-refs/repack/commit-graph etc.".
> 
> So we wouldn't be spending 5 minutes repacking linux.git right after
> cloning it, just ~10s generating the commit graph, and the same would
> happen if you rm'd .git/objects/info/commit-graph and ran "git commit",
> which would kick of "gc --auto" in the background and do the same thing.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-03 14:53         ` SZEDER Gábor
@ 2018-10-03 15:19           ` Ævar Arnfjörð Bjarmason
  2018-10-03 16:59             ` SZEDER Gábor
  2018-10-03 19:08           ` Stefan Beller
  1 sibling, 1 reply; 78+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-10-03 15:19 UTC (permalink / raw)
  To: SZEDER Gábor
  Cc: Derrick Stolee, Git List, Nguyễn Thái Ngọc Duy


On Wed, Oct 03 2018, SZEDER Gábor wrote:

> On Wed, Oct 03, 2018 at 04:22:12PM +0200, Ævar Arnfjörð Bjarmason wrote:
>>
>> On Wed, Oct 03 2018, SZEDER Gábor wrote:
>>
>> > On Wed, Oct 03, 2018 at 04:01:40PM +0200, Ævar Arnfjörð Bjarmason wrote:
>> >>
>> >> On Wed, Oct 03 2018, SZEDER Gábor wrote:
>> >>
>> >> > On Wed, Oct 03, 2018 at 03:23:57PM +0200, Ævar Arnfjörð Bjarmason wrote:
>> >> >> Don't have time to patch this now, but thought I'd send a note / RFC
>> >> >> about this.
>> >> >>
>> >> >> Now that we have the commit graph it's nice to be able to set
>> >> >> e.g. core.commitGraph=true & gc.writeCommitGraph=true in ~/.gitconfig or
>> >> >> /etc/gitconfig to apply them to all repos.
>> >> >>
>> >> >> But when I clone e.g. linux.git stuff like 'tag --contains' will be slow
>> >> >> until whenever my first "gc" kicks in, which may be quite some time if
>> >> >> I'm just using it passively.
>> >> >>
>> >> >> So we should make "git gc --auto" be run on clone,
>> >> >
>> >> > There is no garbage after 'git clone'...
>> >>
>> >> "git gc" is really "git gc-or-create-indexes" these days.
>> >
>> > Because it happens to be convenient to create those indexes at
>> > gc-time.  But that should not be an excuse to run gc when by
>> > definition no gc is needed.
>>
>> Ah, I thought you just had an objection to the "gc" name being used for
>> non-gc stuff,
>
> But you thought right, I do have an objection against that.  'git gc'
> should, well, collect garbage.  Any non-gc stuff is already violating
> separation of concerns.

Ever since git-gc was added back in 30f610b7b0 ("Create 'git gc' to
perform common maintenance operations.", 2006-12-27) it has been
described as:

    git-gc - Cleanup unnecessary files and optimize the local repository

Creating these indexes like the commit-graph falls under "optimize the
local repository", and 3rd party tools (e.g. the repo tool doing this
came up on list recently) have been calling "gc --auto" with this
assumption.

>>  but if you mean we shouldn't do a giant repack right after
>> clone I agree.
>
> And, I also mean that since 'git clone' knows that there can't
> possibly be any garbage in the first place, then it shouldn't call 'gc
> --auto' at all.  However, since it also knows that there is a lot of
> new stuff, then it should create a commit-graph if enabled.

Is this something you think just because the tool isn't called
git-gc-and-optimzie, or do you think this regardless of what it's
called?

I don't see how splitting up the entry points for "detect if we need to
cleanup or optimize the repo" leaves us with a better codebase for the
reasons noted in
https://public-inbox.org/git/87pnwrgll2.fsf@evledraar.gmail.com/

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-03 13:23 We should add a "git gc --auto" after "git clone" due to commit graph Ævar Arnfjörð Bjarmason
  2018-10-03 13:36 ` SZEDER Gábor
@ 2018-10-03 16:45 ` Duy Nguyen
  2018-10-04 21:42 ` [RFC PATCH] " Ævar Arnfjörð Bjarmason
  2 siblings, 0 replies; 78+ messages in thread
From: Duy Nguyen @ 2018-10-03 16:45 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Derrick Stolee, Git Mailing List

On Wed, Oct 3, 2018 at 3:23 PM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>
> Don't have time to patch this now, but thought I'd send a note / RFC
> about this.
>
> Now that we have the commit graph it's nice to be able to set
> e.g. core.commitGraph=true & gc.writeCommitGraph=true in ~/.gitconfig or
> /etc/gitconfig to apply them to all repos.
>
> But when I clone e.g. linux.git stuff like 'tag --contains' will be slow
> until whenever my first "gc" kicks in, which may be quite some time if
> I'm just using it passively.

Since you have core.hooksPath, you can already force gc (even in
detached mode) in post-checkout hook. I'm adding a new
"post-init-repo" hook to allow more customization right after clone or
init-db which may also be useful for other things than gc.
-- 
Duy

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-03 15:19           ` Ævar Arnfjörð Bjarmason
@ 2018-10-03 16:59             ` SZEDER Gábor
  2018-10-05  6:09               ` Junio C Hamano
  0 siblings, 1 reply; 78+ messages in thread
From: SZEDER Gábor @ 2018-10-03 16:59 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Derrick Stolee, Git List, Nguyễn Thái Ngọc Duy

On Wed, Oct 03, 2018 at 05:19:41PM +0200, Ævar Arnfjörð Bjarmason wrote:
> >> >> >> So we should make "git gc --auto" be run on clone,
> >> >> >
> >> >> > There is no garbage after 'git clone'...
> >> >>
> >> >> "git gc" is really "git gc-or-create-indexes" these days.
> >> >
> >> > Because it happens to be convenient to create those indexes at
> >> > gc-time.  But that should not be an excuse to run gc when by
> >> > definition no gc is needed.
> >>
> >> Ah, I thought you just had an objection to the "gc" name being used for
> >> non-gc stuff,
> >
> > But you thought right, I do have an objection against that.  'git gc'
> > should, well, collect garbage.  Any non-gc stuff is already violating
> > separation of concerns.
> 
> Ever since git-gc was added back in 30f610b7b0 ("Create 'git gc' to
> perform common maintenance operations.", 2006-12-27) it has been
> described as:
> 
>     git-gc - Cleanup unnecessary files and optimize the local repository
> 
> Creating these indexes like the commit-graph falls under "optimize the
> local repository",

But it doesn't fall under "cleanup unnecessary files", which the
commit-graph file is, since, strictly speaking, it's purely
optimization.

That description came about, because cleaning up unnecessary files,
notably combining lots of loose refs into a single packed-refs file
and combining lots of loose objects and pack files into a single pack
file, could not only make the repository smaller (barring too many
exploding unreachable objects), but, as it turned out, could also make
Git operations in that repository faster.

To me, the main goal of the command is cleanup.  Optimization, however
beneficial, is its side effect, and I assume the "optimize" part was
added to the description mainly to inform and "encourage" users.
After all, the command is called 'git gc', not 'git optimize-repo'.

> and 3rd party tools (e.g. the repo tool doing this
> came up on list recently) have been calling "gc --auto" with this
> assumption.
> 
> >>  but if you mean we shouldn't do a giant repack right after
> >> clone I agree.
> >
> > And, I also mean that since 'git clone' knows that there can't
> > possibly be any garbage in the first place, then it shouldn't call 'gc
> > --auto' at all.  However, since it also knows that there is a lot of
> > new stuff, then it should create a commit-graph if enabled.
> 
> Is this something you think just because the tool isn't called
> git-gc-and-optimzie, or do you think this regardless of what it's
> called?

Well, that still has 'gc' in it...

> I don't see how splitting up the entry points for "detect if we need to
> cleanup or optimize the repo" leaves us with a better codebase for the
> reasons noted in
> https://public-inbox.org/git/87pnwrgll2.fsf@evledraar.gmail.com/

Such a separation would be valuable for those having gc.auto = 0 in
their config.  Or, in general, to have a clearly marked entry point to
update all the enabled "purely-optimization" files without 'gc'
exploding a bunch of "just-became-unreachable" objects from deleted
reflog entries and packfiles, or without performing a comparatively
expensive repacking.  Note the "clearly marked"; I don't think
teaching 'gc [--auto]' various tricks to only create/update these
files without doing what it is fundamentally supposed to do qualifies
for that.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-03 14:22       ` Ævar Arnfjörð Bjarmason
  2018-10-03 14:53         ` SZEDER Gábor
@ 2018-10-03 17:47         ` Stefan Beller
  2018-10-03 18:47           ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 78+ messages in thread
From: Stefan Beller @ 2018-10-03 17:47 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: SZEDER Gábor, Derrick Stolee, git, Duy Nguyen

> So we wouldn't be spending 5 minutes repacking linux.git right after
> cloning it, just ~10s generating the commit graph, and the same would
> happen if you rm'd .git/objects/info/commit-graph and ran "git commit",
> which would kick of "gc --auto" in the background and do the same thing.

Or generating local bitmaps or pack idx files as well?

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-03 17:47         ` Stefan Beller
@ 2018-10-03 18:47           ` Ævar Arnfjörð Bjarmason
  2018-10-03 18:51             ` Jeff King
  0 siblings, 1 reply; 78+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-10-03 18:47 UTC (permalink / raw)
  To: Stefan Beller
  Cc: SZEDER Gábor, Derrick Stolee, git, Duy Nguyen, Jeff King

On Wed, Oct 03 2018, Stefan Beller wrote:

>> So we wouldn't be spending 5 minutes repacking linux.git right after
>> cloning it, just ~10s generating the commit graph, and the same would
>> happen if you rm'd .git/objects/info/commit-graph and ran "git commit",
>> which would kick of "gc --auto" in the background and do the same thing.
>
> Or generating local bitmaps or pack idx files as well?

I'm less familiar with this area, but when I clone I get a pack *.idx
file, why does it need to be regenerated?

But yeah, in principle this would be a sensible addition, but I'm not
aware of cases where clients get significant benefits from bitmaps (see
https://githubengineering.com/counting-objects/), and I don't turn it on
for clients, but maybe I've missed something.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-03 18:47           ` Ævar Arnfjörð Bjarmason
@ 2018-10-03 18:51             ` Jeff King
  2018-10-03 18:59               ` Derrick Stolee
  0 siblings, 1 reply; 78+ messages in thread
From: Jeff King @ 2018-10-03 18:51 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Stefan Beller, SZEDER Gábor, Derrick Stolee, git, Duy Nguyen

On Wed, Oct 03, 2018 at 08:47:11PM +0200, Ævar Arnfjörð Bjarmason wrote:

> 
> On Wed, Oct 03 2018, Stefan Beller wrote:
> 
> >> So we wouldn't be spending 5 minutes repacking linux.git right after
> >> cloning it, just ~10s generating the commit graph, and the same would
> >> happen if you rm'd .git/objects/info/commit-graph and ran "git commit",
> >> which would kick of "gc --auto" in the background and do the same thing.
> >
> > Or generating local bitmaps or pack idx files as well?
> 
> I'm less familiar with this area, but when I clone I get a pack *.idx
> file, why does it need to be regenerated?
> 
> But yeah, in principle this would be a sensible addition, but I'm not
> aware of cases where clients get significant benefits from bitmaps (see
> https://githubengineering.com/counting-objects/), and I don't turn it on
> for clients, but maybe I've missed something.

They don't help yet, and there's no good reason to enable bitmaps for
clients. I have a few patches that use bitmaps for things like
ahead/behind and --contains checks, but the utility of those may be
lessened quite a bit by Stolee's commit-graph work.  And if it isn't,
I'm mildly in favor of replacing the existing .bitmap format with
something better integrated with commit-graphs (which would give us an
opportunity to clean up some of the rough edges).

-Peff

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-03 18:51             ` Jeff King
@ 2018-10-03 18:59               ` Derrick Stolee
  2018-10-03 19:18                 ` Jeff King
  0 siblings, 1 reply; 78+ messages in thread
From: Derrick Stolee @ 2018-10-03 18:59 UTC (permalink / raw)
  To: Jeff King, Ævar Arnfjörð Bjarmason
  Cc: Stefan Beller, SZEDER Gábor, git, Duy Nguyen

On 10/3/2018 2:51 PM, Jeff King wrote:
> On Wed, Oct 03, 2018 at 08:47:11PM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> On Wed, Oct 03 2018, Stefan Beller wrote:
>>
>>>> So we wouldn't be spending 5 minutes repacking linux.git right after
>>>> cloning it, just ~10s generating the commit graph, and the same would
>>>> happen if you rm'd .git/objects/info/commit-graph and ran "git commit",
>>>> which would kick of "gc --auto" in the background and do the same thing.
>>> Or generating local bitmaps or pack idx files as well?
>> I'm less familiar with this area, but when I clone I get a pack *.idx
>> file, why does it need to be regenerated?
>>
>> But yeah, in principle this would be a sensible addition, but I'm not
>> aware of cases where clients get significant benefits from bitmaps (see
>> https://githubengineering.com/counting-objects/), and I don't turn it on
>> for clients, but maybe I've missed something.
> They don't help yet, and there's no good reason to enable bitmaps for
> clients. I have a few patches that use bitmaps for things like
> ahead/behind and --contains checks, but the utility of those may be
> lessened quite a bit by Stolee's commit-graph work.  And if it isn't,
> I'm mildly in favor of replacing the existing .bitmap format with
> something better integrated with commit-graphs (which would give us an
> opportunity to clean up some of the rough edges).

If the commit-graph doesn't improve enough on those applications, then 
we could consider adding a commit-to-commit reachability bitmap inside 
the commit-graph. ;)

-Stolee

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-03 14:53         ` SZEDER Gábor
  2018-10-03 15:19           ` Ævar Arnfjörð Bjarmason
@ 2018-10-03 19:08           ` Stefan Beller
  2018-10-03 19:21             ` Jeff King
  1 sibling, 1 reply; 78+ messages in thread
From: Stefan Beller @ 2018-10-03 19:08 UTC (permalink / raw)
  To: SZEDER Gábor
  Cc: Ævar Arnfjörð Bjarmason, Derrick Stolee, git,
	Duy Nguyen

>
> But you thought right, I do have an objection against that.  'git gc'
> should, well, collect garbage.  Any non-gc stuff is already violating
> separation of concerns.

I share these concerns in a slightly more abstract way, as
I would bucket the actions into two separate bins:

One bin that throws away information.
this would include removing expired reflog entries (which
I do not think are garbage, or collection thereof), but their
usefulness is questionable.

The other bin would be actions that optimize but
do not throw away any information, repacking (without
dropping files) would be part of it, or the new
"write additional files".

Maybe we can move all actions of the second bin into a new
"git optimize" command, and git gc would do first the "throw away
things" and then the optimize action, whereas clone would only
go for the second optimizing part?

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-03 18:59               ` Derrick Stolee
@ 2018-10-03 19:18                 ` Jeff King
  2018-10-08 16:41                   ` SZEDER Gábor
  0 siblings, 1 reply; 78+ messages in thread
From: Jeff King @ 2018-10-03 19:18 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Ævar Arnfjörð Bjarmason, Stefan Beller,
	SZEDER Gábor, git, Duy Nguyen

On Wed, Oct 03, 2018 at 02:59:34PM -0400, Derrick Stolee wrote:

> > They don't help yet, and there's no good reason to enable bitmaps for
> > clients. I have a few patches that use bitmaps for things like
> > ahead/behind and --contains checks, but the utility of those may be
> > lessened quite a bit by Stolee's commit-graph work.  And if it isn't,
> > I'm mildly in favor of replacing the existing .bitmap format with
> > something better integrated with commit-graphs (which would give us an
> > opportunity to clean up some of the rough edges).
> 
> If the commit-graph doesn't improve enough on those applications, then we
> could consider adding a commit-to-commit reachability bitmap inside the
> commit-graph. ;)

That unfortunately wouldn't be enough for us to ditch the existing
.bitmap files, since we need full object reachability for some cases
(including packing). And commit-to-commit reachability is a trivial
subset of that. I'm not sure if it would be better to just leave
.bitmaps in place as a server-side thing, and grow a new thing for
commit-to-commit reachability (since it would presumably be easier).

I'm still excited about the prospect of a bloom filter for paths which
each commit touches. I think that's the next big frontier in getting
things like "git log -- path" to a reasonable run-time.

-Peff

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-03 19:08           ` Stefan Beller
@ 2018-10-03 19:21             ` Jeff King
  2018-10-03 20:35               ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 78+ messages in thread
From: Jeff King @ 2018-10-03 19:21 UTC (permalink / raw)
  To: Stefan Beller
  Cc: SZEDER Gábor, Ævar Arnfjörð Bjarmason,
	Derrick Stolee, git, Duy Nguyen

On Wed, Oct 03, 2018 at 12:08:15PM -0700, Stefan Beller wrote:

> I share these concerns in a slightly more abstract way, as
> I would bucket the actions into two separate bins:
> 
> One bin that throws away information.
> this would include removing expired reflog entries (which
> I do not think are garbage, or collection thereof), but their
> usefulness is questionable.
> 
> The other bin would be actions that optimize but
> do not throw away any information, repacking (without
> dropping files) would be part of it, or the new
> "write additional files".
> 
> Maybe we can move all actions of the second bin into a new
> "git optimize" command, and git gc would do first the "throw away
> things" and then the optimize action, whereas clone would only
> go for the second optimizing part?

One problem with that world-view is that some of the operations do
_both_, for efficiency. E.g., repacking will drop unreachable objects in
too-old packs. We could actually be more aggressive in combining things
here. For instance, a full object graph walk in linux.git takes 30-60
seconds, depending on your CPU. But we do it at least twice during a gc:
once to repack, and then again to determine reachability for pruning.

If you generate bitmaps during the repack step, you can use them during
the prune step. But by itself, the cost of generating the bitmaps
generally outweighs the extra walk. So it's not worth generating them
_just_ for this (but is an obvious optimization for a server which would
be generating them anyway).

-Peff

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-03 19:21             ` Jeff King
@ 2018-10-03 20:35               ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 78+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-10-03 20:35 UTC (permalink / raw)
  To: Jeff King
  Cc: Stefan Beller, SZEDER Gábor, Derrick Stolee, git, Duy Nguyen


On Wed, Oct 03 2018, Jeff King wrote:

> On Wed, Oct 03, 2018 at 12:08:15PM -0700, Stefan Beller wrote:
>
>> I share these concerns in a slightly more abstract way, as
>> I would bucket the actions into two separate bins:
>>
>> One bin that throws away information.
>> this would include removing expired reflog entries (which
>> I do not think are garbage, or collection thereof), but their
>> usefulness is questionable.
>>
>> The other bin would be actions that optimize but
>> do not throw away any information, repacking (without
>> dropping files) would be part of it, or the new
>> "write additional files".
>>
>> Maybe we can move all actions of the second bin into a new
>> "git optimize" command, and git gc would do first the "throw away
>> things" and then the optimize action, whereas clone would only
>> go for the second optimizing part?
>
> One problem with that world-view is that some of the operations do
> _both_, for efficiency. E.g., repacking will drop unreachable objects in
> too-old packs. We could actually be more aggressive in combining things
> here. For instance, a full object graph walk in linux.git takes 30-60
> seconds, depending on your CPU. But we do it at least twice during a gc:
> once to repack, and then again to determine reachability for pruning.
>
> If you generate bitmaps during the repack step, you can use them during
> the prune step. But by itself, the cost of generating the bitmaps
> generally outweighs the extra walk. So it's not worth generating them
> _just_ for this (but is an obvious optimization for a server which would
> be generating them anyway).

I don't mean to fan the flames of this obviously controversial "git gc
does optimization" topic (which I didn't suspect there would be a debate
about...), but a related thing I was wondering about the other day is
whether we could have a gc.fsck option, and in the background do fsck
while we were at it, and report this back via some facility like
gc.log[1].

That would also fall into this category of more work we could do while
we're doing a full walk anyway, but as with what you're suggesting would
require some refactoring.

1. Well, one that doesn't suck, see
   https://public-inbox.org/git/87inc89j38.fsf@evledraar.gmail.com/ /
   https://public-inbox.org/git/87d0vmck55.fsf@evledraar.gmail.com/ etc.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-03 13:23 We should add a "git gc --auto" after "git clone" due to commit graph Ævar Arnfjörð Bjarmason
  2018-10-03 13:36 ` SZEDER Gábor
  2018-10-03 16:45 ` Duy Nguyen
@ 2018-10-04 21:42 ` Ævar Arnfjörð Bjarmason
  2018-10-05 12:05   ` Derrick Stolee
  2 siblings, 1 reply; 78+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-10-04 21:42 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git List, Nguyễn Thái Ngọc Duy,
	SZEDER Gábor, Jeff King, Stefan Beller


On Wed, Oct 03 2018, Ævar Arnfjörð Bjarmason wrote:

> Don't have time to patch this now, but thought I'd send a note / RFC
> about this.
>
> Now that we have the commit graph it's nice to be able to set
> e.g. core.commitGraph=true & gc.writeCommitGraph=true in ~/.gitconfig or
> /etc/gitconfig to apply them to all repos.
>
> But when I clone e.g. linux.git stuff like 'tag --contains' will be slow
> until whenever my first "gc" kicks in, which may be quite some time if
> I'm just using it passively.
>
> So we should make "git gc --auto" be run on clone, and change the
> need_to_gc() / cmd_gc() behavior so that we detect that the
> gc.writeCommitGraph=true setting is on, but we have no commit graph, and
> then just generate that without doing a full repack.
>
> As an aside such more granular "gc" would be nice for e.g. pack-refs
> too. It's possible for us to just have one pack, but to have 100k loose
> refs.
>
> It might also be good to have some gc.autoDetachOnClone option and have
> it false by default, so we don't have a race condition where "clone
> linux && git -C linux tag --contains" is slow because the graph hasn't
> been generated yet, and generating the graph initially doesn't take that
> long compared to the time to clone a large repo (and on a small one it
> won't matter either way).
>
> I was going to say "also for midx", but of course after clone we have
> just one pack, so I can't imagine us needing this. But I can see us
> having other such optional side-indexes in the future generated by gc,
> and they'd also benefit from this.

I don't have time to polish this up for submission now, but here's a WIP
patch that implements this, highlights:

 * There's a gc.clone.autoDetach=false default setting which overrides
   gc.autoDetach if 'git gc --auto' is run via git-clone (we just pass a
   --cloning option to indicate this).

 * A clone of say git.git with gc.writeCommitGraph=true looks like:

   [...]
   Receiving objects: 100% (255262/255262), 100.49 MiB | 17.78 MiB/s, done.
   Resolving deltas: 100% (188947/188947), done.
   Computing commit graph generation numbers: 100% (55210/55210), done.

 * The 'git gc --auto' command also knows to (only) run the commit-graph
   (and space is left for future optimization steps) if general GC isn't
   needed, but we need "optimization":

   $ rm .git/objects/info/commit-graph; ~/g/git/git --exec-path=$PWD -c gc.writeCommitGraph=true -c gc.autoDetach=false gc --auto;
   Annotating commits in commit graph: 341229, done.
   Computing commit graph generation numbers: 100% (165969/165969), done.
   $

 * The patch to gc.c looks less scary with -w, most of it is indenting
   the existing pack-refs etc. with a "!auto_gc || should_gc" condition.

 * I added a commit_graph_exists() exists function and only care if I
   get ENOENT for the purposes of this gc mode. This would need to be
   tweaked for the incremental mode Derrick talks about, but if we just
   set "should_optimize" that'll also work as far as gc --auto is
   concerned (e.g. on fetch, am etc.)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 1546833213..5759fbb067 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -1621,7 +1621,19 @@ gc.autoPackLimit::

 gc.autoDetach::
 	Make `git gc --auto` return immediately and run in background
-	if the system supports it. Default is true.
+	if the system supports it. Default is true. Overridden by
+	`gc.clone.autoDetach` when running linkgit:git-clone[1].
+
+gc.clone.autoDetach::
+	Make `git gc --auto` return immediately and run in background
+	if the system supports it when run via
+	linkgit:git-clone[1]. Default is false.
++
+The reason this defaults to false is because the only time we'll have
+work to do after a 'git clone' is if something like
+`gc.writeCommitGraph` is true, in that case we'd like to compute the
+optimized file before returning, so that say commands that benefit
+from commit graph aren't slow until it's generated in the background.

 gc.bigPackThreshold::
 	If non-zero, all packs larger than this limit are kept when
diff --git a/builtin/clone.c b/builtin/clone.c
index 15b142d646..824c130ba5 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -897,6 +897,8 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 	struct remote *remote;
 	int err = 0, complete_refs_before_fetch = 1;
 	int submodule_progress;
+	const char *argv_gc_auto[]       = {"gc", "--auto", "--cloning", NULL};
+	const char *argv_gc_auto_quiet[] = {"gc", "--auto", "--cloning", "--quiet", NULL};

 	struct refspec rs = REFSPEC_INIT_FETCH;
 	struct argv_array ref_prefixes = ARGV_ARRAY_INIT;
@@ -1245,5 +1247,11 @@ int cmd_clone(int argc, const char **argv, const char *prefix)

 	refspec_clear(&rs);
 	argv_array_clear(&ref_prefixes);
+
+	if (0 <= option_verbosity)
+		run_command_v_opt_cd_env(argv_gc_auto, RUN_GIT_CMD, git_dir, NULL);
+	else
+		run_command_v_opt_cd_env(argv_gc_auto_quiet, RUN_GIT_CMD, git_dir, NULL);
+
 	return err;
 }
diff --git a/builtin/gc.c b/builtin/gc.c
index 6591ddbe83..27be03890a 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -43,6 +43,7 @@ static int gc_auto_threshold = 6700;
 static int gc_auto_pack_limit = 50;
 static int gc_write_commit_graph;
 static int detach_auto = 1;
+static int detach_clone_auto = 0;
 static timestamp_t gc_log_expire_time;
 static const char *gc_log_expire = "1.day.ago";
 static const char *prune_expire = "2.weeks.ago";
@@ -133,6 +134,7 @@ static void gc_config(void)
 	git_config_get_int("gc.autopacklimit", &gc_auto_pack_limit);
 	git_config_get_bool("gc.writecommitgraph", &gc_write_commit_graph);
 	git_config_get_bool("gc.autodetach", &detach_auto);
+	git_config_get_bool("gc.clone.autodetach", &detach_clone_auto);
 	git_config_get_expiry("gc.pruneexpire", &prune_expire);
 	git_config_get_expiry("gc.worktreepruneexpire", &prune_worktrees_expire);
 	git_config_get_expiry("gc.logexpiry", &gc_log_expire);
@@ -157,9 +159,6 @@ static int too_many_loose_objects(void)
 	int num_loose = 0;
 	int needed = 0;

-	if (gc_auto_threshold <= 0)
-		return 0;
-
 	dir = opendir(git_path("objects/17"));
 	if (!dir)
 		return 0;
@@ -369,10 +368,21 @@ static int need_to_gc(void)
 		return 0;

 	if (run_hook_le(NULL, "pre-auto-gc", NULL))
-		return 0;
+		return -1;
 	return 1;
 }

+static int need_to_optimize(void) {
+	if (gc_write_commit_graph) {
+		char *obj_dir = get_object_directory();
+		char *graph_name = get_commit_graph_filename(obj_dir);
+
+		if (commit_graph_exists(graph_name) == 0) /* ENOENT */
+			return 1;
+	}
+	return 0;
+}
+
 /* return NULL on success, else hostname running the gc */
 static const char *lock_repo_for_gc(int force, pid_t* ret_pid)
 {
@@ -491,6 +501,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 {
 	int aggressive = 0;
 	int auto_gc = 0;
+	int cloning = 0;
 	int quiet = 0;
 	int force = 0;
 	const char *name;
@@ -498,6 +509,8 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 	int daemonized = 0;
 	int keep_base_pack = -1;
 	timestamp_t dummy;
+	int should_gc;
+	int should_optimize;

 	struct option builtin_gc_options[] = {
 		OPT__QUIET(&quiet, N_("suppress progress reporting")),
@@ -507,6 +520,8 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 		OPT_BOOL(0, "aggressive", &aggressive, N_("be more thorough (increased runtime)")),
 		OPT_BOOL_F(0, "auto", &auto_gc, N_("enable auto-gc mode"),
 			   PARSE_OPT_NOCOMPLETE),
+		OPT_BOOL_F(0, "cloning", &cloning, N_("enable cloning mode"),
+			   PARSE_OPT_NOCOMPLETE),
 		OPT_BOOL_F(0, "force", &force,
 			   N_("force running gc even if there may be another gc running"),
 			   PARSE_OPT_NOCOMPLETE),
@@ -555,22 +570,27 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 		/*
 		 * Auto-gc should be least intrusive as possible.
 		 */
-		if (!need_to_gc())
+		should_gc = need_to_gc();
+		if (should_gc == -1)
+			return 0;
+		should_optimize = need_to_optimize();
+		if (!should_gc && !should_optimize)
 			return 0;
-		if (!quiet) {
+		if (!quiet && should_gc) {
 			if (detach_auto)
 				fprintf(stderr, _("Auto packing the repository in background for optimum performance.\n"));
 			else
 				fprintf(stderr, _("Auto packing the repository for optimum performance.\n"));
 			fprintf(stderr, _("See \"git help gc\" for manual housekeeping.\n"));
 		}
-		if (detach_auto) {
+		if (detach_auto &&
+		    (!cloning || (cloning && detach_clone_auto))) {
 			if (report_last_gc_error())
 				return -1;

 			if (lock_repo_for_gc(force, &pid))
 				return 0;
-			if (gc_before_repack())
+			if (should_gc && gc_before_repack())
 				return -1;
 			delete_tempfile(&pidfile);

@@ -611,45 +631,48 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 		atexit(process_log_file_at_exit);
 	}

-	if (gc_before_repack())
-		return -1;
-
-	if (!repository_format_precious_objects) {
-		close_all_packs(the_repository->objects);
-		if (run_command_v_opt(repack.argv, RUN_GIT_CMD))
-			return error(FAILED_RUN, repack.argv[0]);
-
-		if (prune_expire) {
-			argv_array_push(&prune, prune_expire);
-			if (quiet)
-				argv_array_push(&prune, "--no-progress");
-			if (repository_format_partial_clone)
-				argv_array_push(&prune,
-						"--exclude-promisor-objects");
-			if (run_command_v_opt(prune.argv, RUN_GIT_CMD))
-				return error(FAILED_RUN, prune.argv[0]);
+	if (!auto_gc || should_gc) {
+		if (gc_before_repack())
+			return -1;
+
+		if (!repository_format_precious_objects) {
+			close_all_packs(the_repository->objects);
+			if (run_command_v_opt(repack.argv, RUN_GIT_CMD))
+				return error(FAILED_RUN, repack.argv[0]);
+
+			if (prune_expire) {
+				argv_array_push(&prune, prune_expire);
+				if (quiet)
+					argv_array_push(&prune, "--no-progress");
+				if (repository_format_partial_clone)
+					argv_array_push(&prune,
+							"--exclude-promisor-objects");
+				if (run_command_v_opt(prune.argv, RUN_GIT_CMD))
+					return error(FAILED_RUN, prune.argv[0]);
+			}
 		}
-	}

-	if (prune_worktrees_expire) {
-		argv_array_push(&prune_worktrees, prune_worktrees_expire);
-		if (run_command_v_opt(prune_worktrees.argv, RUN_GIT_CMD))
-			return error(FAILED_RUN, prune_worktrees.argv[0]);
-	}

-	if (run_command_v_opt(rerere.argv, RUN_GIT_CMD))
-		return error(FAILED_RUN, rerere.argv[0]);
+		if (prune_worktrees_expire) {
+			argv_array_push(&prune_worktrees, prune_worktrees_expire);
+			if (run_command_v_opt(prune_worktrees.argv, RUN_GIT_CMD))
+				return error(FAILED_RUN, prune_worktrees.argv[0]);
+		}

-	report_garbage = report_pack_garbage;
-	reprepare_packed_git(the_repository);
-	if (pack_garbage.nr > 0)
-		clean_pack_garbage();
+		if (run_command_v_opt(rerere.argv, RUN_GIT_CMD))
+			return error(FAILED_RUN, rerere.argv[0]);
+
+		report_garbage = report_pack_garbage;
+		reprepare_packed_git(the_repository);
+		if (pack_garbage.nr > 0)
+			clean_pack_garbage();
+	}

 	if (gc_write_commit_graph)
 		write_commit_graph_reachable(get_object_directory(), 0,
 					     !quiet && !daemonized);

-	if (auto_gc && too_many_loose_objects())
+	if (auto_gc && should_gc && too_many_loose_objects())
 		warning(_("There are too many unreachable loose objects; "
 			"run 'git prune' to remove them."));

diff --git a/commit-graph.c b/commit-graph.c
index 5908bd4e34..a4a7c94cec 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -57,6 +57,18 @@ static struct commit_graph *alloc_commit_graph(void)
 	return g;
 }

+int commit_graph_exists(const char *graph_file)
+{
+	struct stat st;
+	if (stat(graph_file, &st)) {
+		if (errno == ENOENT)
+			return 0;
+		else
+			return -1;
+	}
+	return 1;
+}
+
 struct commit_graph *load_commit_graph_one(const char *graph_file)
 {
 	void *graph_map;
diff --git a/commit-graph.h b/commit-graph.h
index 5678a8f4ca..a251f1bc32 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -11,6 +11,7 @@
 struct commit;

 char *get_commit_graph_filename(const char *obj_dir);
+int commit_graph_exists(const char *graph_file);

 /*
  * Given a commit struct, try to fill the commit struct info, including:

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-03 16:59             ` SZEDER Gábor
@ 2018-10-05  6:09               ` Junio C Hamano
  2018-10-10 22:07                 ` SZEDER Gábor
  0 siblings, 1 reply; 78+ messages in thread
From: Junio C Hamano @ 2018-10-05  6:09 UTC (permalink / raw)
  To: SZEDER Gábor
  Cc: Ævar Arnfjörð Bjarmason, Derrick Stolee, Git List,
	Nguyễn Thái Ngọc Duy

SZEDER Gábor <szeder.dev@gmail.com> writes:

>>     git-gc - Cleanup unnecessary files and optimize the local repository
>> 
>> Creating these indexes like the commit-graph falls under "optimize the
>> local repository",
>
> But it doesn't fall under "cleanup unnecessary files", which the
> commit-graph file is, since, strictly speaking, it's purely
> optimization.

I won't be actively engaged in this discussion soon, but I must say
that "git gc" doing "garbage collection" is merely an implementation
detail of optimizing the repository for further use.  And from that
point of view, what needs to be updated is the synopsis of the
git-gc doc.  It states "X and Y" above, but it actually is "Y by
doing X and other things".

I understand your "by definition there is no garbage immediately
after clone" position, and also I would understand if you find it
(perhaps philosophically) disturbing that "git clone" may give users
a suboptimal repository that immediately needs optimizing [*1*].

But that bridge was crossed long time ago ever since pack transfer
was invented.  The data source sends only the pack data stream, and
the receiving end is responsible for spending cycles to build .idx
file.  Theoretically, .pack should be all that is needed---you
should be able to locate any necessary object by parsing the .pack
file every time you open it, and .idx is mere optimization.  You can
think of the .midx and graph files the same way.

I'd consider it a growing pain that these two recent inventions were
and are still built as a totally optional and separate features,
requiring completely separate full enumeration of objects in the
repository that needs to happen anyway when we build .idx out of the
received .pack.

I would not be surprised by a future in which the initial index-pack
that is responsible for receiving the incoming pack stream and
storing that in .pack file(s) while creating corresponding .idx
file(s) becomes also responsible for building .midx and graph files
in the same pass, or at least smaller number of passes.  Once we
gain experience and confidence with these new auxiliary files, that
ought to happen naturally.  And at that point, we won't be having
this discussion---we'd all happily run index-pack to receive the
pack data, because that is pretty much the fundamental requirement
to make use of the data.

[Footnote]

*1* Even without considering these recent invention of auxiliary
    files, cloning from a sloppily packed server whose primary focus
    is to avoid spending cycles by not computing better deltas will
    give the cloner a suboptimal repository.  If we truly want to
    have an optimized repository ready to be used after cloning, we
    should run an equivalent of "repack -a -d -f" immediately after
    "git clone".

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-04 21:42 ` [RFC PATCH] " Ævar Arnfjörð Bjarmason
@ 2018-10-05 12:05   ` Derrick Stolee
  2018-10-05 13:05     ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 78+ messages in thread
From: Derrick Stolee @ 2018-10-05 12:05 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Git List, Nguyễn Thái Ngọc Duy,
	SZEDER Gábor, Jeff King, Stefan Beller

On 10/4/2018 5:42 PM, Ævar Arnfjörð Bjarmason wrote:
> I don't have time to polish this up for submission now, but here's a WIP
> patch that implements this, highlights:
>
>   * There's a gc.clone.autoDetach=false default setting which overrides
>     gc.autoDetach if 'git gc --auto' is run via git-clone (we just pass a
>     --cloning option to indicate this).

I'll repeat that it could make sense to do the same thing on clone _and_ 
fetch. Perhaps a "--post-fetch" flag would be good here to communicate 
that we just downloaded a pack from a remote.

>   * A clone of say git.git with gc.writeCommitGraph=true looks like:
>
>     [...]
>     Receiving objects: 100% (255262/255262), 100.49 MiB | 17.78 MiB/s, done.
>     Resolving deltas: 100% (188947/188947), done.
>     Computing commit graph generation numbers: 100% (55210/55210), done.

This looks like good UX. Thanks for the progress here!

>   * The 'git gc --auto' command also knows to (only) run the commit-graph
>     (and space is left for future optimization steps) if general GC isn't
>     needed, but we need "optimization":
>
>     $ rm .git/objects/info/commit-graph; ~/g/git/git --exec-path=$PWD -c gc.writeCommitGraph=true -c gc.autoDetach=false gc --auto;
>     Annotating commits in commit graph: 341229, done.
>     Computing commit graph generation numbers: 100% (165969/165969), done.
>     $

Will this also trigger a full commit-graph rewrite on every 'git commit' 
command? Or is there some way we can compute the staleness of the 
commit-graph in order to only update if we get too far ahead? 
Previously, this was solved by relying on the auto-GC threshold.

>   * The patch to gc.c looks less scary with -w, most of it is indenting
>     the existing pack-refs etc. with a "!auto_gc || should_gc" condition.
>
>   * I added a commit_graph_exists() exists function and only care if I
>     get ENOENT for the purposes of this gc mode. This would need to be
>     tweaked for the incremental mode Derrick talks about, but if we just
>     set "should_optimize" that'll also work as far as gc --auto is
>     concerned (e.g. on fetch, am etc.)

The incremental mode would operate the same as split-index, which means 
we will still look for .git/objects/info/commit-graph. That file may 
point us to more files.

> +int commit_graph_exists(const char *graph_file)
> +{
> +	struct stat st;
> +	if (stat(graph_file, &st)) {
> +		if (errno == ENOENT)
> +			return 0;
> +		else
> +			return -1;
> +	}
> +	return 1;
> +}
> +

This method serves a very similar purpose to 
generation_numbers_enabled(), except your method only cares about the 
file existing. It ignores information like `core.commitGraph`, which 
should keep us from doing anything with the commit-graph file if false.

Nothing about your method is specific to the commit-graph file, since 
you provide a filename as a parameter. It could easily be "int 
file_exists(const char *filename)".

Thanks,

-Stolee

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-05 12:05   ` Derrick Stolee
@ 2018-10-05 13:05     ` Ævar Arnfjörð Bjarmason
  2018-10-05 13:45       ` Derrick Stolee
  0 siblings, 1 reply; 78+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-10-05 13:05 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git List, Nguyễn Thái Ngọc Duy,
	SZEDER Gábor, Jeff King, Stefan Beller


On Fri, Oct 05 2018, Derrick Stolee wrote:

> On 10/4/2018 5:42 PM, Ævar Arnfjörð Bjarmason wrote:
>> I don't have time to polish this up for submission now, but here's a WIP
>> patch that implements this, highlights:
>>
>>   * There's a gc.clone.autoDetach=false default setting which overrides
>>     gc.autoDetach if 'git gc --auto' is run via git-clone (we just pass a
>>     --cloning option to indicate this).
>
> I'll repeat that it could make sense to do the same thing on clone
> _and_ fetch. Perhaps a "--post-fetch" flag would be good here to
> communicate that we just downloaded a pack from a remote.

I don't think that makes sense, but let's talk about why, because maybe
I've missed something, you're certainly more familiar with the
commit-graph than I am.

The reason to do it on clone as a special-case or when the file is
missing, is because we know the file is desired (via the GC config), and
presumably is expected to help performance, and we have 0% of it. So by
going from 0% to 100% on clone we'll get fast --contains and other
goodies the graph helps with.

But when we're doing a fetch, or really anything else that runs "git gc
--auto" we can safely assume that we have a recent enough graph, because
it will have been run whenever auto-gc kicked in.

I.e.:

    # Slow, if we assume background forked commit-graph generation
    # (which I'm avoiding)
    git clone x && cd x && git tag --contains
    # Fast enough, since we have an existing commit-graph
    cd x && git fetch && git tag --contains

I *do* think it might make sense to in general split off parts of "gc
--auto" that we'd like to be more aggressive about, simply because the
ratio of how long it takes to do, and how much it helps with performance
makes more sense than a full repack, which is what the current heuristic
is based on.

And maybe when we run in that mode we should run in the foreground, but
I don't see why git-fetch should be a special case there, and in this
regard, the gc.clone.autoDetach=false setting I've made doesn't make
much sence. I.e. maybe we should also skip forking to the background in
such a mode when we trigger such a "mini gc" via git-commit or whatever.

>>   * A clone of say git.git with gc.writeCommitGraph=true looks like:
>>
>>     [...]
>>     Receiving objects: 100% (255262/255262), 100.49 MiB | 17.78 MiB/s, done.
>>     Resolving deltas: 100% (188947/188947), done.
>>     Computing commit graph generation numbers: 100% (55210/55210), done.
>
> This looks like good UX. Thanks for the progress here!
>
>>   * The 'git gc --auto' command also knows to (only) run the commit-graph
>>     (and space is left for future optimization steps) if general GC isn't
>>     needed, but we need "optimization":
>>
>>     $ rm .git/objects/info/commit-graph; ~/g/git/git --exec-path=$PWD -c gc.writeCommitGraph=true -c gc.autoDetach=false gc --auto;
>>     Annotating commits in commit graph: 341229, done.
>>     Computing commit graph generation numbers: 100% (165969/165969), done.
>>     $
>
> Will this also trigger a full commit-graph rewrite on every 'git
> commit' command?

Nope, because "git commit" can safely be assumed to have some
commit-graph anyway, and I'm just special casing the case where it
doesn't exist.

But if it doesn't exist and you do a "git commit" then "gc --auto" will
be run, and we'll fork to the background and generate it...

>  Or is there some way we can compute the staleness of
> the commit-graph in order to only update if we get too far ahead?
> Previously, this was solved by relying on the auto-GC threshold.

So re the "I don't think that makes sense..." at the start of my E-Mail,
isn't it fine to rely on the default thresholds here, or should we be
more aggressive?

>>   * The patch to gc.c looks less scary with -w, most of it is indenting
>>     the existing pack-refs etc. with a "!auto_gc || should_gc" condition.
>>
>>   * I added a commit_graph_exists() exists function and only care if I
>>     get ENOENT for the purposes of this gc mode. This would need to be
>>     tweaked for the incremental mode Derrick talks about, but if we just
>>     set "should_optimize" that'll also work as far as gc --auto is
>>     concerned (e.g. on fetch, am etc.)
>
> The incremental mode would operate the same as split-index, which
> means we will still look for .git/objects/info/commit-graph. That file
> may point us to more files.

Ah!

>> +int commit_graph_exists(const char *graph_file)
>> +{
>> +	struct stat st;
>> +	if (stat(graph_file, &st)) {
>> +		if (errno == ENOENT)
>> +			return 0;
>> +		else
>> +			return -1;
>> +	}
>> +	return 1;
>> +}
>> +
>
> This method serves a very similar purpose to
> generation_numbers_enabled(), except your method only cares about the
> file existing. It ignores information like `core.commitGraph`, which
> should keep us from doing anything with the commit-graph file if
> false.
>
> Nothing about your method is specific to the commit-graph file, since
> you provide a filename as a parameter. It could easily be "int
> file_exists(const char *filename)".

I was being paranoid about not doing this if it didn't exist but it was
something else than ENOENT (e.g. permission error?), but in retrospect
that's silly. I'll drop this helper and just use file_exists().

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-05 13:05     ` Ævar Arnfjörð Bjarmason
@ 2018-10-05 13:45       ` Derrick Stolee
  2018-10-05 14:04         ` Ævar Arnfjörð Bjarmason
  2018-10-05 19:21         ` Jeff King
  0 siblings, 2 replies; 78+ messages in thread
From: Derrick Stolee @ 2018-10-05 13:45 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Git List, Nguyễn Thái Ngọc Duy,
	SZEDER Gábor, Jeff King, Stefan Beller

On 10/5/2018 9:05 AM, Ævar Arnfjörð Bjarmason wrote:
> On Fri, Oct 05 2018, Derrick Stolee wrote:
>
>> On 10/4/2018 5:42 PM, Ævar Arnfjörð Bjarmason wrote:
>>> I don't have time to polish this up for submission now, but here's a WIP
>>> patch that implements this, highlights:
>>>
>>>    * There's a gc.clone.autoDetach=false default setting which overrides
>>>      gc.autoDetach if 'git gc --auto' is run via git-clone (we just pass a
>>>      --cloning option to indicate this).
>> I'll repeat that it could make sense to do the same thing on clone
>> _and_ fetch. Perhaps a "--post-fetch" flag would be good here to
>> communicate that we just downloaded a pack from a remote.
> I don't think that makes sense, but let's talk about why, because maybe
> I've missed something, you're certainly more familiar with the
> commit-graph than I am.
>
> The reason to do it on clone as a special-case or when the file is
> missing, is because we know the file is desired (via the GC config), and
> presumably is expected to help performance, and we have 0% of it. So by
> going from 0% to 100% on clone we'll get fast --contains and other
> goodies the graph helps with.
>
> But when we're doing a fetch, or really anything else that runs "git gc
> --auto" we can safely assume that we have a recent enough graph, because
> it will have been run whenever auto-gc kicked in.
>
> I.e.:
>
>      # Slow, if we assume background forked commit-graph generation
>      # (which I'm avoiding)
>      git clone x && cd x && git tag --contains
>      # Fast enough, since we have an existing commit-graph
>      cd x && git fetch && git tag --contains
>
> I *do* think it might make sense to in general split off parts of "gc
> --auto" that we'd like to be more aggressive about, simply because the
> ratio of how long it takes to do, and how much it helps with performance
> makes more sense than a full repack, which is what the current heuristic
> is based on.
>
> And maybe when we run in that mode we should run in the foreground, but
> I don't see why git-fetch should be a special case there, and in this
> regard, the gc.clone.autoDetach=false setting I've made doesn't make
> much sence. I.e. maybe we should also skip forking to the background in
> such a mode when we trigger such a "mini gc" via git-commit or whatever.

My misunderstanding was that your proposed change to gc computes the 
commit-graph in either of these two cases:

(1) The auto-GC threshold is met.

(2) There is no commit-graph file.

And what I hope to have instead of (2) is (3):

(3) The commit-graph file is "sufficiently behind" the tip refs.

This condition is intentionally vague at the moment. It could be that we 
hint that (3) holds by saying "--post-fetch" (i.e. "We just downloaded a 
pack, and it probably contains a lot of new commits") or we could create 
some more complicated condition based on counting reachable commits with 
infinite generation number (the number of commits not in the 
commit-graph file).

I like that you are moving forward to make the commit-graph be written 
more frequently, but I'm trying to push us in a direction of writing it 
even more often than your proposed strategy. We should avoid creating 
too many orthogonal conditions that trigger the commit-graph write, 
which is why I'm pushing on your design here.

Anyone else have thoughts on this direction?

Thanks,

-Stolee


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-05 13:45       ` Derrick Stolee
@ 2018-10-05 14:04         ` Ævar Arnfjörð Bjarmason
  2018-10-05 19:21         ` Jeff King
  1 sibling, 0 replies; 78+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-10-05 14:04 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git List, Nguyễn Thái Ngọc Duy,
	SZEDER Gábor, Jeff King, Stefan Beller


On Fri, Oct 05 2018, Derrick Stolee wrote:

> On 10/5/2018 9:05 AM, Ævar Arnfjörð Bjarmason wrote:
>> On Fri, Oct 05 2018, Derrick Stolee wrote:
>>
>>> On 10/4/2018 5:42 PM, Ævar Arnfjörð Bjarmason wrote:
>>>> I don't have time to polish this up for submission now, but here's a WIP
>>>> patch that implements this, highlights:
>>>>
>>>>    * There's a gc.clone.autoDetach=false default setting which overrides
>>>>      gc.autoDetach if 'git gc --auto' is run via git-clone (we just pass a
>>>>      --cloning option to indicate this).
>>> I'll repeat that it could make sense to do the same thing on clone
>>> _and_ fetch. Perhaps a "--post-fetch" flag would be good here to
>>> communicate that we just downloaded a pack from a remote.
>> I don't think that makes sense, but let's talk about why, because maybe
>> I've missed something, you're certainly more familiar with the
>> commit-graph than I am.
>>
>> The reason to do it on clone as a special-case or when the file is
>> missing, is because we know the file is desired (via the GC config), and
>> presumably is expected to help performance, and we have 0% of it. So by
>> going from 0% to 100% on clone we'll get fast --contains and other
>> goodies the graph helps with.
>>
>> But when we're doing a fetch, or really anything else that runs "git gc
>> --auto" we can safely assume that we have a recent enough graph, because
>> it will have been run whenever auto-gc kicked in.
>>
>> I.e.:
>>
>>      # Slow, if we assume background forked commit-graph generation
>>      # (which I'm avoiding)
>>      git clone x && cd x && git tag --contains
>>      # Fast enough, since we have an existing commit-graph
>>      cd x && git fetch && git tag --contains
>>
>> I *do* think it might make sense to in general split off parts of "gc
>> --auto" that we'd like to be more aggressive about, simply because the
>> ratio of how long it takes to do, and how much it helps with performance
>> makes more sense than a full repack, which is what the current heuristic
>> is based on.
>>
>> And maybe when we run in that mode we should run in the foreground, but
>> I don't see why git-fetch should be a special case there, and in this
>> regard, the gc.clone.autoDetach=false setting I've made doesn't make
>> much sence. I.e. maybe we should also skip forking to the background in
>> such a mode when we trigger such a "mini gc" via git-commit or whatever.
>
> My misunderstanding was that your proposed change to gc computes the
> commit-graph in either of these two cases:
>
> (1) The auto-GC threshold is met.
>
> (2) There is no commit-graph file.
>
> And what I hope to have instead of (2) is (3):
>
> (3) The commit-graph file is "sufficiently behind" the tip refs.
>
> This condition is intentionally vague at the moment. It could be that
> we hint that (3) holds by saying "--post-fetch" (i.e. "We just
> downloaded a pack, and it probably contains a lot of new commits") or
> we could create some more complicated condition based on counting
> reachable commits with infinite generation number (the number of
> commits not in the commit-graph file).
>
> I like that you are moving forward to make the commit-graph be written
> more frequently, but I'm trying to push us in a direction of writing
> it even more often than your proposed strategy. We should avoid
> creating too many orthogonal conditions that trigger the commit-graph
> write, which is why I'm pushing on your design here.
>
> Anyone else have thoughts on this direction?

Ah. I see. I think #3 makes perfect sense, but probably makes sense to
do as a follow-up, or maybe you'd like to stick a patch on top of the
series I have when I send it. I don't know how to write the "I'm not
quite happy about the commit graph" code :)

What I will do is refactor gc.c a bit and leave it in a state where it's
going to be really easy to change the existing "we have no commit graph,
and thus should do the optimization step" to have some more complex
condition instead of "we have no commit graph", i.e. your "we just
grabbed a lot of data".

Also, I'll drop the gc.clone.autoDetach=false setting and name it
something more general. maybe gc.AutoDetachOnBigOptimization=false?
Anyway something more generic so that "clone" will always pass in some
option saying "expect a large % commit graph update" (100% in its case),
and then in "fetch" we could have some detection of how big what we just
got from the server is, and do the same.

This seems to be to be the most general thing that would make sense, and
could also be extended e.g. to "git commit" and other users of gc
--auto. If I started with a README file in an empty repo, and then made
a commit where I added 1 million files all in one commit, in which case
we'd (depending on that setting) also block in the foreground and
generate the commit-graph.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-05 13:45       ` Derrick Stolee
  2018-10-05 14:04         ` Ævar Arnfjörð Bjarmason
@ 2018-10-05 19:21         ` Jeff King
  2018-10-05 19:41           ` Derrick Stolee
  1 sibling, 1 reply; 78+ messages in thread
From: Jeff King @ 2018-10-05 19:21 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Ævar Arnfjörð Bjarmason, Git List,
	Nguyễn Thái Ngọc Duy, SZEDER Gábor,
	Stefan Beller

On Fri, Oct 05, 2018 at 09:45:47AM -0400, Derrick Stolee wrote:

> My misunderstanding was that your proposed change to gc computes the
> commit-graph in either of these two cases:
> 
> (1) The auto-GC threshold is met.
> 
> (2) There is no commit-graph file.
> 
> And what I hope to have instead of (2) is (3):
> 
> (3) The commit-graph file is "sufficiently behind" the tip refs.
> 
> This condition is intentionally vague at the moment. It could be that we
> hint that (3) holds by saying "--post-fetch" (i.e. "We just downloaded a
> pack, and it probably contains a lot of new commits") or we could create
> some more complicated condition based on counting reachable commits with
> infinite generation number (the number of commits not in the commit-graph
> file).
> 
> I like that you are moving forward to make the commit-graph be written more
> frequently, but I'm trying to push us in a direction of writing it even more
> often than your proposed strategy. We should avoid creating too many
> orthogonal conditions that trigger the commit-graph write, which is why I'm
> pushing on your design here.
> 
> Anyone else have thoughts on this direction?

Yes, I think measuring "sufficiently behind" is the right thing.
Everything else is a proxy or heuristic, and will run into corner cases.
E.g., I have some small number of objects and then do a huge fetch, and
now my commit-graph only covers 5% of what's available.

We know how many objects are in the graph already. And it's not too
expensive to get the number of objects in the repository. We can do the
same sampling for loose objects that "gc --auto" does, and counting
packed objects just involves opening up the .idx files (that can be slow
if you have a ton of packs, but you'd want to either repack or use a
.midx in that case anyway, either of which would help here).

So can we really just take (total_objects - commit_graph_objects) and
compare it to some threshold?

-Peff

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-05 19:21         ` Jeff King
@ 2018-10-05 19:41           ` Derrick Stolee
  2018-10-05 19:47             ` Jeff King
  0 siblings, 1 reply; 78+ messages in thread
From: Derrick Stolee @ 2018-10-05 19:41 UTC (permalink / raw)
  To: Jeff King
  Cc: Ævar Arnfjörð Bjarmason, Git List,
	Nguyễn Thái Ngọc Duy, SZEDER Gábor,
	Stefan Beller

On 10/5/2018 3:21 PM, Jeff King wrote:
> On Fri, Oct 05, 2018 at 09:45:47AM -0400, Derrick Stolee wrote:
>
>> My misunderstanding was that your proposed change to gc computes the
>> commit-graph in either of these two cases:
>>
>> (1) The auto-GC threshold is met.
>>
>> (2) There is no commit-graph file.
>>
>> And what I hope to have instead of (2) is (3):
>>
>> (3) The commit-graph file is "sufficiently behind" the tip refs.
>>
>> This condition is intentionally vague at the moment. It could be that we
>> hint that (3) holds by saying "--post-fetch" (i.e. "We just downloaded a
>> pack, and it probably contains a lot of new commits") or we could create
>> some more complicated condition based on counting reachable commits with
>> infinite generation number (the number of commits not in the commit-graph
>> file).
>>
>> I like that you are moving forward to make the commit-graph be written more
>> frequently, but I'm trying to push us in a direction of writing it even more
>> often than your proposed strategy. We should avoid creating too many
>> orthogonal conditions that trigger the commit-graph write, which is why I'm
>> pushing on your design here.
>>
>> Anyone else have thoughts on this direction?
> Yes, I think measuring "sufficiently behind" is the right thing.
> Everything else is a proxy or heuristic, and will run into corner cases.
> E.g., I have some small number of objects and then do a huge fetch, and
> now my commit-graph only covers 5% of what's available.
>
> We know how many objects are in the graph already. And it's not too
> expensive to get the number of objects in the repository. We can do the
> same sampling for loose objects that "gc --auto" does, and counting
> packed objects just involves opening up the .idx files (that can be slow
> if you have a ton of packs, but you'd want to either repack or use a
> .midx in that case anyway, either of which would help here).
>
> So can we really just take (total_objects - commit_graph_objects) and
> compare it to some threshold?

The commit-graph only stores the number of _commits_, not total objects.

Azure Repos' commit-graph does store the total number of objects, and 
that is how we trigger updating the graph, so it is not unreasonable to 
use that as a heuristic.

Thanks,

-Stolee


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-05 19:41           ` Derrick Stolee
@ 2018-10-05 19:47             ` Jeff King
  2018-10-05 20:00               ` Derrick Stolee
  2018-10-05 20:01               ` Ævar Arnfjörð Bjarmason
  0 siblings, 2 replies; 78+ messages in thread
From: Jeff King @ 2018-10-05 19:47 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Ævar Arnfjörð Bjarmason, Git List,
	Nguyễn Thái Ngọc Duy, SZEDER Gábor,
	Stefan Beller

On Fri, Oct 05, 2018 at 03:41:40PM -0400, Derrick Stolee wrote:

> > So can we really just take (total_objects - commit_graph_objects) and
> > compare it to some threshold?
> 
> The commit-graph only stores the number of _commits_, not total objects.

Oh, right, of course. That does throw a monkey wrench in that line of
thought. ;)

There's unfortunately not a fast way of doing that. One option would be
to keep a counter of "ungraphed commit objects", and have callers update
it. Anybody admitting a pack via index-pack or unpack-objects can easily
get this information. Commands like fast-import can do likewise, and
"git commit" obviously increments it by one.

I'm not excited about adding a new global on-disk data structure (and
the accompanying lock).

-Peff

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-05 19:47             ` Jeff King
@ 2018-10-05 20:00               ` Derrick Stolee
  2018-10-05 20:02                 ` Jeff King
  2018-10-05 20:01               ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 78+ messages in thread
From: Derrick Stolee @ 2018-10-05 20:00 UTC (permalink / raw)
  To: Jeff King
  Cc: Ævar Arnfjörð Bjarmason, Git List,
	Nguyễn Thái Ngọc Duy, SZEDER Gábor,
	Stefan Beller

On 10/5/2018 3:47 PM, Jeff King wrote:
> On Fri, Oct 05, 2018 at 03:41:40PM -0400, Derrick Stolee wrote:
>
>>> So can we really just take (total_objects - commit_graph_objects) and
>>> compare it to some threshold?
>> The commit-graph only stores the number of _commits_, not total objects.
> Oh, right, of course. That does throw a monkey wrench in that line of
> thought. ;)
>
> There's unfortunately not a fast way of doing that. One option would be
> to keep a counter of "ungraphed commit objects", and have callers update
> it. Anybody admitting a pack via index-pack or unpack-objects can easily
> get this information. Commands like fast-import can do likewise, and
> "git commit" obviously increments it by one.
>
> I'm not excited about adding a new global on-disk data structure (and
> the accompanying lock).

If we want, then we can add an optional chunk to the commit-graph file 
that stores the object count.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-05 19:47             ` Jeff King
  2018-10-05 20:00               ` Derrick Stolee
@ 2018-10-05 20:01               ` Ævar Arnfjörð Bjarmason
  2018-10-05 20:09                 ` Jeff King
  1 sibling, 1 reply; 78+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-10-05 20:01 UTC (permalink / raw)
  To: Jeff King
  Cc: Derrick Stolee, Git List, Nguyễn Thái Ngọc Duy,
	SZEDER Gábor, Stefan Beller

On Fri, Oct 05 2018, Jeff King wrote:

> On Fri, Oct 05, 2018 at 03:41:40PM -0400, Derrick Stolee wrote:
>
>> > So can we really just take (total_objects - commit_graph_objects) and
>> > compare it to some threshold?
>>
>> The commit-graph only stores the number of _commits_, not total objects.
>
> Oh, right, of course. That does throw a monkey wrench in that line of
> thought. ;)
>
> There's unfortunately not a fast way of doing that. One option would be
> to keep a counter of "ungraphed commit objects", and have callers update
> it. Anybody admitting a pack via index-pack or unpack-objects can easily
> get this information. Commands like fast-import can do likewise, and
> "git commit" obviously increments it by one.
>
> I'm not excited about adding a new global on-disk data structure (and
> the accompanying lock).

You don't really need a new global datastructure to solve this
problem. It would be sufficient to have git-gc itself write out a 4-line
text file after it runs saying how many tags, commits, trees and blobs
it found on its last run.

You can then fuzzily compare object counts v.s. commit counts for the
purposes of deciding whether something like the commit-graph needs to be
updated, while assuming that whatever new data you have has similar
enough ratios of those as your existing data.

That's an assumption that'll hold well enough for big repos where this
matters the most, and who tend to grow in fairly uniform ways as far as
their object type ratios go.

Databases like MySQL, PostgreSQL etc. pull similar tricks with their
fuzzy table statistics.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-05 20:00               ` Derrick Stolee
@ 2018-10-05 20:02                 ` Jeff King
  0 siblings, 0 replies; 78+ messages in thread
From: Jeff King @ 2018-10-05 20:02 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Ævar Arnfjörð Bjarmason, Git List,
	Nguyễn Thái Ngọc Duy, SZEDER Gábor,
	Stefan Beller

On Fri, Oct 05, 2018 at 04:00:12PM -0400, Derrick Stolee wrote:

> On 10/5/2018 3:47 PM, Jeff King wrote:
> > On Fri, Oct 05, 2018 at 03:41:40PM -0400, Derrick Stolee wrote:
> > 
> > > > So can we really just take (total_objects - commit_graph_objects) and
> > > > compare it to some threshold?
> > > The commit-graph only stores the number of _commits_, not total objects.
> > Oh, right, of course. That does throw a monkey wrench in that line of
> > thought. ;)
> > 
> > There's unfortunately not a fast way of doing that. One option would be
> > to keep a counter of "ungraphed commit objects", and have callers update
> > it. Anybody admitting a pack via index-pack or unpack-objects can easily
> > get this information. Commands like fast-import can do likewise, and
> > "git commit" obviously increments it by one.
> > 
> > I'm not excited about adding a new global on-disk data structure (and
> > the accompanying lock).
> 
> If we want, then we can add an optional chunk to the commit-graph file that
> stores the object count.

Yeah, that's probably a saner route, since we have to do the write then
anyway.

-Peff

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-05 20:01               ` Ævar Arnfjörð Bjarmason
@ 2018-10-05 20:09                 ` Jeff King
  0 siblings, 0 replies; 78+ messages in thread
From: Jeff King @ 2018-10-05 20:09 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Derrick Stolee, Git List, Nguyễn Thái Ngọc Duy,
	SZEDER Gábor, Stefan Beller

On Fri, Oct 05, 2018 at 10:01:31PM +0200, Ævar Arnfjörð Bjarmason wrote:

> > There's unfortunately not a fast way of doing that. One option would be
> > to keep a counter of "ungraphed commit objects", and have callers update
> > it. Anybody admitting a pack via index-pack or unpack-objects can easily
> > get this information. Commands like fast-import can do likewise, and
> > "git commit" obviously increments it by one.
> >
> > I'm not excited about adding a new global on-disk data structure (and
> > the accompanying lock).
> 
> You don't really need a new global datastructure to solve this
> problem. It would be sufficient to have git-gc itself write out a 4-line
> text file after it runs saying how many tags, commits, trees and blobs
> it found on its last run.
>
> You can then fuzzily compare object counts v.s. commit counts for the
> purposes of deciding whether something like the commit-graph needs to be
> updated, while assuming that whatever new data you have has similar
> enough ratios of those as your existing data.

I think this is basically the same thing as Stolee's suggestion to keep
the total object count in the commit-graph file. The only difference is
here is that we know the actual ratio of commit to blobs for this
particular repository. But I don't think we need to know that. As you
said, this is fuzzy anyway, so a single number for "update the graph
when there are N new objects" is likely enough.

If you had a repository with an unusually large tree, you'd end up
rebuilding the graph more often. But I think it would probably be OK, as
we're primarily trying not to waste time doing a graph rebuild when
we've only done a small amount of other work. But if we just shoved a
ton of objects through index-pack then we did a lot of work, whether
those were commit objects or not.

-Peff

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-03 19:18                 ` Jeff King
@ 2018-10-08 16:41                   ` SZEDER Gábor
  2018-10-08 16:57                     ` Derrick Stolee
  2018-10-08 23:02                     ` We should add a "git gc --auto" after "git clone" due to commit graph Junio C Hamano
  0 siblings, 2 replies; 78+ messages in thread
From: SZEDER Gábor @ 2018-10-08 16:41 UTC (permalink / raw)
  To: Jeff King
  Cc: Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Stefan Beller, git, Duy Nguyen

On Wed, Oct 03, 2018 at 03:18:05PM -0400, Jeff King wrote:
> I'm still excited about the prospect of a bloom filter for paths which
> each commit touches. I think that's the next big frontier in getting
> things like "git log -- path" to a reasonable run-time.

There is certainly potential there.  With a (very) rough PoC
experiment, a 8MB bloom filter, and a carefully choosen path I can
achieve a nice, almost 25x speedup:

  $ time git rev-list --count HEAD -- t/valgrind/valgrind.sh
  6

  real    0m1.563s
  user    0m1.519s
  sys     0m0.045s

  $ time GIT_USE_POC_BLOOM_FILTER=y ~/src/git/git rev-list --count HEAD -- t/valgrind/valgrind.sh
  6

  real    0m0.063s
  user    0m0.043s
  sys     0m0.020s

  bloom filter total queries: 16269 definitely not: 16195 maybe: 74 false positives: 64 fp ratio: 0.003934

But I'm afraid it will take a while until I get around to turn it into
something presentable...


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-08 16:41                   ` SZEDER Gábor
@ 2018-10-08 16:57                     ` Derrick Stolee
  2018-10-08 18:10                       ` SZEDER Gábor
  2018-10-09 19:34                       ` [PATCH 0/4] Bloom filter experiment SZEDER Gábor
  2018-10-08 23:02                     ` We should add a "git gc --auto" after "git clone" due to commit graph Junio C Hamano
  1 sibling, 2 replies; 78+ messages in thread
From: Derrick Stolee @ 2018-10-08 16:57 UTC (permalink / raw)
  To: SZEDER Gábor, Jeff King
  Cc: Ævar Arnfjörð Bjarmason, Stefan Beller, git,
	Duy Nguyen

On 10/8/2018 12:41 PM, SZEDER Gábor wrote:
> On Wed, Oct 03, 2018 at 03:18:05PM -0400, Jeff King wrote:
>> I'm still excited about the prospect of a bloom filter for paths which
>> each commit touches. I think that's the next big frontier in getting
>> things like "git log -- path" to a reasonable run-time.
> There is certainly potential there.  With a (very) rough PoC
> experiment, a 8MB bloom filter, and a carefully choosen path I can
> achieve a nice, almost 25x speedup:
>
>    $ time git rev-list --count HEAD -- t/valgrind/valgrind.sh
>    6
>
>    real    0m1.563s
>    user    0m1.519s
>    sys     0m0.045s
>
>    $ time GIT_USE_POC_BLOOM_FILTER=y ~/src/git/git rev-list --count HEAD -- t/valgrind/valgrind.sh
>    6
>
>    real    0m0.063s
>    user    0m0.043s
>    sys     0m0.020s
>
>    bloom filter total queries: 16269 definitely not: 16195 maybe: 74 false positives: 64 fp ratio: 0.003934
Nice! These numbers make sense to me, in terms of how many TREESAME 
queries we actually need to perform for such a query.
> But I'm afraid it will take a while until I get around to turn it into
> something presentable...
Do you have the code pushed somewhere public where one could take a 
look? I could provide some early feedback.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-08 16:57                     ` Derrick Stolee
@ 2018-10-08 18:10                       ` SZEDER Gábor
  2018-10-08 18:29                         ` Derrick Stolee
  2018-10-09 19:34                       ` [PATCH 0/4] Bloom filter experiment SZEDER Gábor
  1 sibling, 1 reply; 78+ messages in thread
From: SZEDER Gábor @ 2018-10-08 18:10 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Jeff King, Ævar Arnfjörð Bjarmason, Stefan Beller,
	git, Duy Nguyen

On Mon, Oct 08, 2018 at 12:57:34PM -0400, Derrick Stolee wrote:
> On 10/8/2018 12:41 PM, SZEDER Gábor wrote:
> >On Wed, Oct 03, 2018 at 03:18:05PM -0400, Jeff King wrote:
> >>I'm still excited about the prospect of a bloom filter for paths which
> >>each commit touches. I think that's the next big frontier in getting
> >>things like "git log -- path" to a reasonable run-time.
> >There is certainly potential there.  With a (very) rough PoC
> >experiment, a 8MB bloom filter, and a carefully choosen path I can
> >achieve a nice, almost 25x speedup:
> >
> >   $ time git rev-list --count HEAD -- t/valgrind/valgrind.sh
> >   6
> >
> >   real    0m1.563s
> >   user    0m1.519s
> >   sys     0m0.045s
> >
> >   $ time GIT_USE_POC_BLOOM_FILTER=y ~/src/git/git rev-list --count HEAD -- t/valgrind/valgrind.sh
> >   6
> >
> >   real    0m0.063s
> >   user    0m0.043s
> >   sys     0m0.020s
> >
> >   bloom filter total queries: 16269 definitely not: 16195 maybe: 74 false positives: 64 fp ratio: 0.003934

> Nice! These numbers make sense to me, in terms of how many TREESAME queries
> we actually need to perform for such a query.

Yeah...  because you didn't notice that I deliberately cheated :)

As it turned out, it's not just about the number of diff queries that
we can spare, but, for the speedup _ratio_, it's more about how
expensive those diff queries are.

git.git has a rather flat hierarchy, and 't/' is the 372th entry in
the current root tree object, while 'valgrind/' is the 923th entry,
and the diff machinery spends considerable time wading through the
previous entries.  Notice the "carefully chosen path" remark in my
previous email; I think this particular path has the highest number of
preceeding tree entries, and, in addition, 't/' changes rather
frequently, so the diff machinery often has to scan two relatively big
tree objects.  Had I chosen 'Documentation/RelNotes/1.5.0.1.txt'
instead, i.e. another path two directories deep, but whose leading
path components are both near the beginning of the tree objects, the
speedup would be much less impressive: 0.282s vs. 0.049s, i.e. "only"
~5.7x instead of ~24.8x.

> >But I'm afraid it will take a while until I get around to turn it into
> >something presentable...
> Do you have the code pushed somewhere public where one could take a look? I
> Do you have the code pushed somewhere public where one could take a 
> look? I could provide some early feedback.

Nah, definitely not...  I know full well how embarassingly broken this
implementation is, I don't need others to tell me that ;)

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-08 18:10                       ` SZEDER Gábor
@ 2018-10-08 18:29                         ` Derrick Stolee
  2018-10-09  3:08                           ` Jeff King
  0 siblings, 1 reply; 78+ messages in thread
From: Derrick Stolee @ 2018-10-08 18:29 UTC (permalink / raw)
  To: SZEDER Gábor
  Cc: Jeff King, Ævar Arnfjörð Bjarmason, Stefan Beller,
	git, Duy Nguyen

On 10/8/2018 2:10 PM, SZEDER Gábor wrote:
> On Mon, Oct 08, 2018 at 12:57:34PM -0400, Derrick Stolee wrote:
>> Nice! These numbers make sense to me, in terms of how many TREESAME queries
>> we actually need to perform for such a query.
> Yeah...  because you didn't notice that I deliberately cheated :)
>
> As it turned out, it's not just about the number of diff queries that
> we can spare, but, for the speedup _ratio_, it's more about how
> expensive those diff queries are.
>
> git.git has a rather flat hierarchy, and 't/' is the 372th entry in
> the current root tree object, while 'valgrind/' is the 923th entry,
> and the diff machinery spends considerable time wading through the
> previous entries.  Notice the "carefully chosen path" remark in my
> previous email; I think this particular path has the highest number of
> preceeding tree entries, and, in addition, 't/' changes rather
> frequently, so the diff machinery often has to scan two relatively big
> tree objects.  Had I chosen 'Documentation/RelNotes/1.5.0.1.txt'
> instead, i.e. another path two directories deep, but whose leading
> path components are both near the beginning of the tree objects, the
> speedup would be much less impressive: 0.282s vs. 0.049s, i.e. "only"
> ~5.7x instead of ~24.8x.

This is expected. The performance ratio is better when the path is any 
of the following:

1. A very deep path (need to walk multiple trees to answer TREESAME)

2. An entry is late in a very wide tree (need to spend extra time 
parsing tree object)

3. The path doesn't change very often (need to inspect many TREESAME 
pairs before finding enough interesting commits)

4. Some sub-path changes often (so the TREESAME comparison needs to 
parse beyond that sub-path often)

Our standard examples (Git and Linux repos) don't have many paths that 
have these properties. But: they do exist. In other projects, this is 
actually typical. Think about Java projects that frequently have ~5 
levels of folders before actually touching a code file.

When I was implementing the Bloom filter feature for Azure Repos, I ran 
performance tests on the Linux repo using a random sampling of paths. 
The typical speedup was 5x while some outliers were in the 25x range.

>
>>> But I'm afraid it will take a while until I get around to turn it into
>>> something presentable...
>> Do you have the code pushed somewhere public where one could take a look? I
>> Do you have the code pushed somewhere public where one could take a
>> look? I could provide some early feedback.
> Nah, definitely not...  I know full well how embarassingly broken this
> implementation is, I don't need others to tell me that ;)
There are two questions that I was hoping to answer by looking at your code:

1. How do you store your Bloom filter? Is it connected to the 
commit-graph and split on a commit-by-commit basis (storing "$path" as a 
key), or is it one huge Bloom filter (storing "$commitid:$path" as key)?

2. Where does your Bloom filter check plug into the TREESAME logic? I 
haven't investigated this part, but hopefully it isn't too complicated.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-08 16:41                   ` SZEDER Gábor
  2018-10-08 16:57                     ` Derrick Stolee
@ 2018-10-08 23:02                     ` Junio C Hamano
  1 sibling, 0 replies; 78+ messages in thread
From: Junio C Hamano @ 2018-10-08 23:02 UTC (permalink / raw)
  To: SZEDER Gábor
  Cc: Jeff King, Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Stefan Beller, git, Duy Nguyen

SZEDER Gábor <szeder.dev@gmail.com> writes:

> There is certainly potential there.  With a (very) rough PoC
> experiment, a 8MB bloom filter, and a carefully choosen path I can
> achieve a nice, almost 25x speedup:
>
>   $ time git rev-list --count HEAD -- t/valgrind/valgrind.sh
>   6
>
>   real    0m1.563s
>   user    0m1.519s
>   sys     0m0.045s
>
>   $ time GIT_USE_POC_BLOOM_FILTER=y ~/src/git/git rev-list --count HEAD -- t/valgrind/valgrind.sh
>   6
>
>   real    0m0.063s
>   user    0m0.043s
>   sys     0m0.020s

Even though I somehow sense a sign of exaggeration in [v] in the
pathname, it still is quite respectable.

>   bloom filter total queries: 16269 definitely not: 16195 maybe: 74 false positives: 64 fp ratio: 0.003934
>
> But I'm afraid it will take a while until I get around to turn it into
> something presentable...

That's OK.  This is an encouraging result.

Just from curiousity, how are you keying the filter?  tree object
name of the top-level and full path concatenated or something like
that?

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-08 18:29                         ` Derrick Stolee
@ 2018-10-09  3:08                           ` Jeff King
  2018-10-09 13:48                             ` Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph) Derrick Stolee
  2018-10-09 21:30                             ` We should add a "git gc --auto" after "git clone" due to commit graph SZEDER Gábor
  0 siblings, 2 replies; 78+ messages in thread
From: Jeff King @ 2018-10-09  3:08 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: SZEDER Gábor, Ævar Arnfjörð Bjarmason,
	Stefan Beller, git, Duy Nguyen

On Mon, Oct 08, 2018 at 02:29:47PM -0400, Derrick Stolee wrote:

> > > > But I'm afraid it will take a while until I get around to turn it into
> > > > something presentable...
> > > Do you have the code pushed somewhere public where one could take a look? I
> > > Do you have the code pushed somewhere public where one could take a
> > > look? I could provide some early feedback.
> > Nah, definitely not...  I know full well how embarassingly broken this
> > implementation is, I don't need others to tell me that ;)
> There are two questions that I was hoping to answer by looking at your code:
> 
> 1. How do you store your Bloom filter? Is it connected to the commit-graph
> and split on a commit-by-commit basis (storing "$path" as a key), or is it
> one huge Bloom filter (storing "$commitid:$path" as key)?

I guess you've probably thought all of this through for your
implementation, but let me pontificate.

I'd have done it as one fixed-size filter per commit. Then you should be
able to hash the path keys once, and apply the result as a bitwise query
to each individual commit (I'm assuming that it's constant-time to
access the filter for each, as an index into an mmap'd array, with the
offset coming from a commit-graph entry we'd be able to look up anyway).

I think it would also be easier to deal with maintenance, since each
filter is independent (IIRC, you cannot delete from a bloom filter
without re-adding all of the other keys).

The obvious downside is that it's O(commits) storage instead of O(1).

Now let me ponder a bit further afield. A bloom filter lets us answer
the question "did this commit (probably) touch these paths?". But it
does not let us answer "which paths did this commit touch?".

That second one is less useful than you might think, because we almost
always care about not just the names of the paths, but their actual
object ids. Think about a --raw diff, or even traversing for
reachability (where if we knew the tree-diff cheaply, we could avoid
asking "have we seen this yet?" on most of the tree entries). The names
alone can make that a bit faster, but in the worst case you still have
to walk the whole tree to find their entries.

But there's also a related question: how do we match pathspec patterns?
For a changed path like "foo/bar/baz", I imagine a bloom filter would
mark all of "foo", "foo/bar", and "foo/bar/baz". But what about "*.c"? I
don't think a bloom filter can answer that.

At least not by itself. If we imagine that the commit-graph also had an
alphabetized list of every path in every tree, then it's easy: apply the
glob to that list once to get a set of concrete paths, and then query
the bloom filters for those. And that list actually isn't too big. The
complete set of paths in linux.git is only about 300k gzipped (I think
that's the most relevant measure, since it's an obvious win to avoid
repeating shared prefixes of long paths).

Imagine we have that list. Is a bloom filter still the best data
structure for each commit? At the point that we have the complete
universe of paths, we could give each commit a bitmap of changed paths.
That lets us ask "did this commit touch these paths" (collect the bits
from the list of paths, then check for 1's), as well as "what paths did
we touch" (collect the 1 bits, and then index the path list).  Those
bitmaps should compress very well via EWAH or similar (most of them
would be huge stretches of 0's punctuated by short runs of 1's).

So that seems promising to me (or at least not an obvious dead-end). I
do think maintenance gets to be a headache, though. Adding new paths
potentially means reordering the bitmaps, which means O(commits) work to
"incrementally" update the structure. (Unless you always add the new
paths at the end, but then you lose fast lookups in the list; that might
be an acceptable tradeoff).

And finally, there's one more radical option: could we actually store a
real per-commit tree-diff cache? I.e., imagine that each commit had the
equivalent of a --raw diff easily accessible, including object ids. That
would allow:

  - fast pathspec matches, including globs

  - fast --raw output (and faster -p output, since we can skip the tree
    entirely)

  - fast reachability traversals (we only need to bother to look at the
    objects for changed entries)

where "fast" is basically O(size of commit's changes), rather than
O(size of whole tree). This was one of the big ideas of packv4 that
never materialized. You can _almost_ do it with packv2, since after all,
we end up storing many trees as deltas. But those deltas are byte-wise
so it's hard for a reader to convert them directly into a pure-tree
diff (they also don't mention the "deleted" data, so it's really only
half a diff).

So let's imagine we'd store such a cache external to the regular object
data (i.e., as a commit-graph entry). The "log --raw" diff of linux.git
has 1.7M entries. The paths should easily compress to a single 32-bit
integer (e.g., as an index into a big path list). The oids are 20 bytes.
Add a few bytes for modes. That's about 80MB. Big, but not impossibly
so. Maybe pushing it for true gigantic repos, though.

Those numbers are ignoring merges, too. The meaning of "did this commit
touch that path" is a lot trickier for a merge commit, and I think may
depend on context. I'm not sure how even a bloom filter solution would
handle that (I was assuming we'd mostly punt and let merges fall back to
opening up the trees).

Phew. That was a lot. I don't want to derail any useful work either of
you is doing. These are just things I've been thinking over (or even in
some cases experimenting with), and I think it's worth laying all the
options on the table. I won't be surprised if you'd considered and
rejected any of these alternate approaches, but I'd be curious to hear
the counter-arguments. :)

-Peff

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph)
  2018-10-09  3:08                           ` Jeff King
@ 2018-10-09 13:48                             ` Derrick Stolee
  2018-10-09 18:45                               ` Ævar Arnfjörð Bjarmason
  2018-10-09 18:46                               ` Jeff King
  2018-10-09 21:30                             ` We should add a "git gc --auto" after "git clone" due to commit graph SZEDER Gábor
  1 sibling, 2 replies; 78+ messages in thread
From: Derrick Stolee @ 2018-10-09 13:48 UTC (permalink / raw)
  To: Jeff King
  Cc: SZEDER Gábor, Ævar Arnfjörð Bjarmason,
	Stefan Beller, git, Duy Nguyen

(Changing title to reflect the new topic.)

On 10/8/2018 11:08 PM, Jeff King wrote:
> On Mon, Oct 08, 2018 at 02:29:47PM -0400, Derrick Stolee wrote:
>
>> There are two questions that I was hoping to answer by looking at 
>> your code:
>> 1. How do you store your Bloom filter? Is it connected to the commit-graph
>> and split on a commit-by-commit basis (storing "$path" as a key), or is it
>> one huge Bloom filter (storing "$commitid:$path" as key)?
> I guess you've probably thought all of this through for your
> implementation, but let me pontificate.
>
> I'd have done it as one fixed-size filter per commit. Then you should be
> able to hash the path keys once, and apply the result as a bitwise query
> to each individual commit (I'm assuming that it's constant-time to
> access the filter for each, as an index into an mmap'd array, with the
> offset coming from a commit-graph entry we'd be able to look up anyway).

You're right that we want to hash the path a constant number of times. 
Add in that we want to re-use information already serialized when 
updating the file, and we see that having a commit-by-commit Bloom 
filter is a good idea. Using (commit, path) pairs requires lots of 
re-hashing, repeated work when extending the filter, and poor locality 
when evaluating membership of a single key.

The nice thing is that you can generate k 32-bit hash values based on 
two 32-bit hash values that are "independent enough" (see [1]). We used 
Murmur3 with two different seed values to generate hashes a & b, then 
used the arithmetic progression a, a + b, a + 2b, ..., a + (k-1)b as our 
k hash values. These can be computed up front and then dropped into any 
size filter using a simple modulo operation. This allows flexible sizes 
in our filters.

I don't think fixed size filters are a good idea, and instead would opt 
for flex-sized filters with a maximum size. The typical parameters use 7 
hash functions and a filter size of (at least) 10 bits per entry. For 
most commits (say 60-70%), 256 bits (32 bytes) is enough. Using a 
maximum of 512 bytes covers 99% of commits. We will want these bounds to 
be configurable via config. If we had a fixed size, then we either make 
it too small (and don't have sufficient coverage of commits) or too 
large (and waste a lot of space on the commits that change very little).

We can store these flex-sized filters in the commit-graph using two 
columns of data (two new optional chunks):

* Bloom filter data: stores the binary data of each commit's Bloom 
filter, concatenated together in the same order as the commits appear in 
the commit-graph.

* Bloom filter positions: The ith position of this column stores the 
start of the (i+1)th Bloom filter (The 0th filter starts at byte 0). A 
Bloom filter of size 0 is intended to mean "we didn't store this filter 
because it would be too large". We can compute the lengths of the filter 
by inspecting adjacent values.

In order to be flexible, we will want to encode some basic information 
into the Bloom filter data chunk, such as a tuple of (hash version, num 
hash bits, num bits per entry). This allows us to change the parameters 
in config but still be able to read a serialized filter. Here I assume 
that all filters share the same parameters. The "hash version" here is 
different than the_hash_algo, because we don't care about cryptographic 
security, only a uniform distrubution (hence, Murmur3 is a good, fast 
option).

[1] 
https://web.archive.org/web/20090131053735/http://www.eecs.harvard.edu/~kirsch/pubs/bbbf/esa06.pdf

>
> I think it would also be easier to deal with maintenance, since each
> filter is independent (IIRC, you cannot delete from a bloom filter
> without re-adding all of the other keys).
>
> The obvious downside is that it's O(commits) storage instead of O(1).
It would always be O(changes), as the Bloom filter needs to grow in size 
as the number of entries grows.
> Now let me ponder a bit further afield. A bloom filter lets us answer
> the question "did this commit (probably) touch these paths?". But it
> does not let us answer "which paths did this commit touch?".
>
> That second one is less useful than you might think, because we almost
> always care about not just the names of the paths, but their actual
> object ids. Think about a --raw diff, or even traversing for
> reachability (where if we knew the tree-diff cheaply, we could avoid
> asking "have we seen this yet?" on most of the tree entries). The names
> alone can make that a bit faster, but in the worst case you still have
> to walk the whole tree to find their entries.
>
> But there's also a related question: how do we match pathspec patterns?
> For a changed path like "foo/bar/baz", I imagine a bloom filter would
> mark all of "foo", "foo/bar", and "foo/bar/baz". But what about "*.c"? I
> don't think a bloom filter can answer that.

The filter needs to store every path that would be considered "not 
TREESAME". It can't store wildcards, so you would need to evaluate the 
wildcard and test all of those paths individually (not a good idea).

> At least not by itself. If we imagine that the commit-graph also had an
> alphabetized list of every path in every tree, then it's easy: apply the
> glob to that list once to get a set of concrete paths, and then query
> the bloom filters for those. And that list actually isn't too big. The
> complete set of paths in linux.git is only about 300k gzipped (I think
> that's the most relevant measure, since it's an obvious win to avoid
> repeating shared prefixes of long paths).
As you mention below, we would actually want a list of "every path that 
has ever appeared in the repo".
> Imagine we have that list. Is a bloom filter still the best data
> structure for each commit? At the point that we have the complete
> universe of paths, we could give each commit a bitmap of changed paths.
> That lets us ask "did this commit touch these paths" (collect the bits
> from the list of paths, then check for 1's), as well as "what paths did
> we touch" (collect the 1 bits, and then index the path list).  Those
> bitmaps should compress very well via EWAH or similar (most of them
> would be huge stretches of 0's punctuated by short runs of 1's).

I'm not convinced we would frequently have runs of 1's, and the bitmap 
would not compress much better than simply listing the positions. For 
example, a path "foo/bar" that resolves to a tree would only start a run 
if the next changes are the initial section of entries in that tree 
(sorted lexicographically) such as "foo/bar/a, foo/bar/b". If we deepen 
into a tree, then we will break the run of 1's unless we changed every 
path deeper than that tree.

> So that seems promising to me (or at least not an obvious dead-end). I
> do think maintenance gets to be a headache, though. Adding new paths
> potentially means reordering the bitmaps, which means O(commits) work to
> "incrementally" update the structure. (Unless you always add the new
> paths at the end, but then you lose fast lookups in the list; that might
> be an acceptable tradeoff).
>
> And finally, there's one more radical option: could we actually store a
> real per-commit tree-diff cache? I.e., imagine that each commit had the
> equivalent of a --raw diff easily accessible, including object ids. That
> would allow:
>
>    - fast pathspec matches, including globs
>
>    - fast --raw output (and faster -p output, since we can skip the tree
>      entirely)
>
>    - fast reachability traversals (we only need to bother to look at the
>      objects for changed entries)
>
> where "fast" is basically O(size of commit's changes), rather than
> O(size of whole tree). This was one of the big ideas of packv4 that
> never materialized. You can _almost_ do it with packv2, since after all,
> we end up storing many trees as deltas. But those deltas are byte-wise
> so it's hard for a reader to convert them directly into a pure-tree
> diff (they also don't mention the "deleted" data, so it's really only
> half a diff).
>
> So let's imagine we'd store such a cache external to the regular object
> data (i.e., as a commit-graph entry). The "log --raw" diff of linux.git
> has 1.7M entries. The paths should easily compress to a single 32-bit
> integer (e.g., as an index into a big path list). The oids are 20 bytes.
> Add a few bytes for modes. That's about 80MB. Big, but not impossibly
> so. Maybe pushing it for true gigantic repos, though.

Above, I mentioned my gut reaction that storing a "changed path bitmap" 
per commit would not compress well. That puts that implementation very 
close to the one you suggest here (except we also store the OID changes).

I just want to compare your 80MB here to ~4MB it would take to store 
those changed paths in Bloom filters (10 bits per entry -> ~2MB, but 
adding some slop for the commit-by-commit storage).

> Those numbers are ignoring merges, too. The meaning of "did this commit
> touch that path" is a lot trickier for a merge commit, and I think may
> depend on context. I'm not sure how even a bloom filter solution would
> handle that (I was assuming we'd mostly punt and let merges fall back to
> opening up the trees).

My solution here is to always store the list of paths changed against 
the first parent. If we evaluate TREESAME against our first parent while 
computing simplified file history, then we continue along first-parent 
history. It is possible to store filters for every parent, but I don't 
recommend it. The merge commit will typically have many more change 
paths against the second parent, since the second parent is usually 
bringing in a small change done by few to catch up to the work done in 
parallel by many. Those diffs will frequently run over our limit.

> Phew. That was a lot. I don't want to derail any useful work either of
> you is doing. These are just things I've been thinking over (or even in
> some cases experimenting with), and I think it's worth laying all the
> options on the table. I won't be surprised if you'd considered and
> rejected any of these alternate approaches, but I'd be curious to hear
> the counter-arguments. :)

This is a good discussion to have, since the commit-graph feature is 
getting to a stable place. We still have ongoing algorithm work with 
generation numbers, but this Bloom filter discussion (and 
implementation) can happen in parallel.

Thanks,

-Stolee

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph)
  2018-10-09 13:48                             ` Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph) Derrick Stolee
@ 2018-10-09 18:45                               ` Ævar Arnfjörð Bjarmason
  2018-10-09 18:46                               ` Jeff King
  1 sibling, 0 replies; 78+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-10-09 18:45 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Jeff King, SZEDER Gábor, Stefan Beller, git, Duy Nguyen

On Tue, Oct 09 2018, Derrick Stolee wrote:

> The filter needs to store every path that would be considered "not
> TREESAME". It can't store wildcards, so you would need to evaluate the
> wildcard and test all of those paths individually (not a good idea).

If full paths are stored, yes, But have you considered instead of
storing paths, storing all trigrams that can be extracted from changed
paths at that commit?

I.e. instead of a change to "t/t0000-basic.sh" storing
"t/t0000-basic.sh" we'd store ["t/t", "/t0", "t00", "000", "00-" ...]
etc.

That sort of approach would mean that e.g. "t*000*", "*asi*.sh"
etc. could all be indexed, and as long as we could find three
consecutive bytes of fixed string we'd have a chance to short-circuit,
but would need to degrade to a full tree unpack for e.g. "t*". We could
also special-case certain sub-three-char indexes, or to
"bi-grams". E.g. to be able to index '*.c' or 't*' (first char at the
beginning of a string only).

It would mean having to check more things in the bloom filter for each
commit, but that's going to be hot in cache at that point so it'll
probably beat unpacking trees by far, and we could short-circuit exit at
the first one that returned false.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph)
  2018-10-09 13:48                             ` Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph) Derrick Stolee
  2018-10-09 18:45                               ` Ævar Arnfjörð Bjarmason
@ 2018-10-09 18:46                               ` Jeff King
  2018-10-09 19:03                                 ` Derrick Stolee
  1 sibling, 1 reply; 78+ messages in thread
From: Jeff King @ 2018-10-09 18:46 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: SZEDER Gábor, Ævar Arnfjörð Bjarmason,
	Stefan Beller, git, Duy Nguyen

On Tue, Oct 09, 2018 at 09:48:20AM -0400, Derrick Stolee wrote:

> [I snipped all of the parts about bloom filters that seemed entirely
>  reasonable to me ;) ]

> > Imagine we have that list. Is a bloom filter still the best data
> > structure for each commit? At the point that we have the complete
> > universe of paths, we could give each commit a bitmap of changed paths.
> > That lets us ask "did this commit touch these paths" (collect the bits
> > from the list of paths, then check for 1's), as well as "what paths did
> > we touch" (collect the 1 bits, and then index the path list).  Those
> > bitmaps should compress very well via EWAH or similar (most of them
> > would be huge stretches of 0's punctuated by short runs of 1's).
> 
> I'm not convinced we would frequently have runs of 1's, and the bitmap would
> not compress much better than simply listing the positions. For example, a
> path "foo/bar" that resolves to a tree would only start a run if the next
> changes are the initial section of entries in that tree (sorted
> lexicographically) such as "foo/bar/a, foo/bar/b". If we deepen into a tree,
> then we will break the run of 1's unless we changed every path deeper than
> that tree.

Yeah, I doubt we'd really have runs of 1's (by short, I just mean 1 or
2). I agree that listing the positions could work, though I sort of
assumed that was more or less what a decent compressed bitmap would
turn into. E.g., if bit N is set, we should be able to say "N-1
zeroes, 1 one" in about the same size as we could say "position N".

EWAH seems pretty awful in that regard. Or at least its serialized
format is (or maybe it's our implementation that is bad).

The patch below generates a bitmap for each commit in a repository (it
doesn't output the total list of paths; I've left that as an exercise
for the reader). On linux.git, the result is 57MB. But when I look at
the individual bitmap sizes (via GIT_TRACE), they're quite silly.
Storing a single set bit takes 28 bytes in serialized form!

There are only around 120k unique paths (including prefix trees).
Naively using run-length encoding and varints, our worst case should be
something like 18-20 bits to say "120k zeroes, then a 1, then all
zeroes".  And the average case should be better (you don't even need to
say "120k", but some smaller number).

I wonder if Roaring does better here.

Gzipping the resulting bitmaps drops the total size to about 7.5MB.
That's not a particularly important number, but I think it shows that
the built-in ewah compression is far from ideal.

Just listing the positions with a series of varints would generally be
fine, since we expect sparse bitmaps. I just hoped that a good
RLE scheme would degrade to roughly that for the sparse case, but also
perform well for more dense cases.

So at any rate, I do think it would not be out of the question to store
bitmaps like this. I'm much more worried about the maintenance cost of
adding new entries incrementally. I think it's only feasible if we give
up sorting, and then I wonder what other problems that might cause.

-Peff

-- >8 --
diff --git a/Makefile b/Makefile
index 13e1c52478..f6e823f2d6 100644
--- a/Makefile
+++ b/Makefile
@@ -751,6 +751,7 @@ TEST_PROGRAMS_NEED_X += test-parse-options
 TEST_PROGRAMS_NEED_X += test-pkt-line
 TEST_PROGRAMS_NEED_X += test-svn-fe
 TEST_PROGRAMS_NEED_X += test-tool
+TEST_PROGRAMS_NEED_X += test-tree-bitmap
 
 TEST_PROGRAMS = $(patsubst %,t/helper/%$X,$(TEST_PROGRAMS_NEED_X))
 
diff --git a/t/helper/test-tree-bitmap.c b/t/helper/test-tree-bitmap.c
new file mode 100644
index 0000000000..bc5cf0e514
--- /dev/null
+++ b/t/helper/test-tree-bitmap.c
@@ -0,0 +1,167 @@
+#include "cache.h"
+#include "revision.h"
+#include "diffcore.h"
+#include "argv-array.h"
+#include "ewah/ewok.h"
+
+/* map of pathnames to bit positions */
+struct pathmap_entry {
+	struct hashmap_entry ent;
+	unsigned pos;
+	char path[FLEX_ARRAY];
+};
+
+static int pathmap_entry_hashcmp(const void *unused_cmp_data,
+				 const void *entry,
+				 const void *entry_or_key,
+				 const void *keydata)
+{
+	const struct pathmap_entry *a = entry;
+	const struct pathmap_entry *b = entry_or_key;
+	const char *key = keydata;
+
+	return strcmp(a->path, key ? key : b->path);
+}
+
+static int pathmap_entry_strcmp(const void *va, const void *vb)
+{
+	struct pathmap_entry *a = *(struct pathmap_entry **)va;
+	struct pathmap_entry *b = *(struct pathmap_entry **)vb;
+	return strcmp(a->path, b->path);
+}
+
+struct walk_paths_data {
+	struct hashmap *paths;
+	struct commit *commit;
+};
+
+static void walk_paths(diff_format_fn_t fn, struct hashmap *paths)
+{
+	struct argv_array argv = ARGV_ARRAY_INIT;
+	struct rev_info revs;
+	struct walk_paths_data data;
+	struct commit *commit;
+
+	argv_array_pushl(&argv, "rev-list",
+			 "--all", "-t", "--no-renames",
+			 NULL);
+	init_revisions(&revs, NULL);
+	setup_revisions(argv.argc, argv.argv, &revs, NULL);
+	revs.diffopt.output_format = DIFF_FORMAT_CALLBACK;
+	revs.diffopt.format_callback = fn;
+	revs.diffopt.format_callback_data = &data;
+
+	data.paths = paths;
+
+	prepare_revision_walk(&revs);
+	while ((commit = get_revision(&revs))) {
+		data.commit = commit;
+		diff_tree_combined_merge(commit, 0, &revs);
+	}
+
+	reset_revision_walk();
+	argv_array_clear(&argv);
+}
+
+static void collect_commit_paths(struct diff_queue_struct *q,
+				 struct diff_options *opts,
+				 void *vdata)
+{
+	struct walk_paths_data *data = vdata;
+	int i;
+
+	for (i = 0; i < q->nr; i++) {
+		struct diff_filepair *p = q->queue[i];
+		const char *path = p->one->path;
+		struct pathmap_entry *entry;
+		struct hashmap_entry lookup;
+
+		hashmap_entry_init(&lookup, strhash(path));
+		entry = hashmap_get(data->paths, &lookup, path);
+		if (entry)
+			continue; /* already present */
+
+		FLEX_ALLOC_STR(entry, path, path);
+		entry->ent = lookup;
+		hashmap_put(data->paths, entry);
+	}
+}
+
+/* assign a bit position to all possible paths */
+static void collect_paths(struct hashmap *paths)
+{
+	struct pathmap_entry **sorted;
+	size_t i, n;
+	struct hashmap_iter iter;
+	struct pathmap_entry *entry;
+
+	/* grab all unique paths */
+	hashmap_init(paths, pathmap_entry_hashcmp, NULL, 0);
+	walk_paths(collect_commit_paths, paths);
+
+	/* and assign them bits in sorted order */
+	n = hashmap_get_size(paths);
+	ALLOC_ARRAY(sorted, n);
+	i = 0;
+	for (entry = hashmap_iter_first(paths, &iter);
+	     entry;
+	     entry = hashmap_iter_next(&iter)) {
+		assert(i < n);
+		sorted[i++] = entry;
+	}
+	QSORT(sorted, i, pathmap_entry_strcmp);
+	for (i = 0; i < n; i++)
+		sorted[i]->pos = i;
+	free(sorted);
+}
+
+/* generate the bitmap for a single commit */
+static void generate_bitmap(struct diff_queue_struct *q,
+			    struct diff_options *opts,
+			    void *vdata)
+{
+	struct walk_paths_data *data = vdata;
+	struct bitmap *bitmap = bitmap_new();
+	struct ewah_bitmap *ewah;
+	struct strbuf out = STRBUF_INIT;
+	size_t i;
+
+	for (i = 0; i < q->nr; i++) {
+		struct diff_filepair *p = q->queue[i];
+		const char *path = p->one->path;
+		struct pathmap_entry *entry;
+		struct hashmap_entry lookup;
+
+		hashmap_entry_init(&lookup, strhash(path));
+		entry = hashmap_get(data->paths, &lookup, path);
+		if (!entry)
+			BUG("mysterious path appeared: %s", path);
+
+		bitmap_set(bitmap, entry->pos);
+	}
+
+	ewah = bitmap_to_ewah(bitmap);
+	ewah_serialize_strbuf(ewah, &out);
+	fwrite(out.buf, 1, out.len, stdout);
+
+	trace_printf("bitmap %s %u %u",
+		     oid_to_hex(&data->commit->object.oid),
+		     (unsigned)q->nr,
+		     (unsigned)out.len);
+
+	strbuf_release(&out);
+	ewah_free(ewah);
+	bitmap_free(bitmap);
+}
+
+int cmd_main(int argc, const char **argv)
+{
+	struct hashmap paths;
+
+	setup_git_directory();
+	collect_paths(&paths);
+
+	walk_paths(generate_bitmap, &paths);
+
+	return 0;
+}

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph)
  2018-10-09 18:46                               ` Jeff King
@ 2018-10-09 19:03                                 ` Derrick Stolee
  2018-10-09 21:14                                   ` Jeff King
  0 siblings, 1 reply; 78+ messages in thread
From: Derrick Stolee @ 2018-10-09 19:03 UTC (permalink / raw)
  To: Jeff King
  Cc: SZEDER Gábor, Ævar Arnfjörð Bjarmason,
	Stefan Beller, git, Duy Nguyen

On 10/9/2018 2:46 PM, Jeff King wrote:
> On Tue, Oct 09, 2018 at 09:48:20AM -0400, Derrick Stolee wrote:
>
>> [I snipped all of the parts about bloom filters that seemed entirely
>>   reasonable to me ;) ]
>>> Imagine we have that list. Is a bloom filter still the best data
>>> structure for each commit? At the point that we have the complete
>>> universe of paths, we could give each commit a bitmap of changed paths.
>>> That lets us ask "did this commit touch these paths" (collect the bits
>>> from the list of paths, then check for 1's), as well as "what paths did
>>> we touch" (collect the 1 bits, and then index the path list).  Those
>>> bitmaps should compress very well via EWAH or similar (most of them
>>> would be huge stretches of 0's punctuated by short runs of 1's).
>> I'm not convinced we would frequently have runs of 1's, and the bitmap would
>> not compress much better than simply listing the positions. For example, a
>> path "foo/bar" that resolves to a tree would only start a run if the next
>> changes are the initial section of entries in that tree (sorted
>> lexicographically) such as "foo/bar/a, foo/bar/b". If we deepen into a tree,
>> then we will break the run of 1's unless we changed every path deeper than
>> that tree.
> Yeah, I doubt we'd really have runs of 1's (by short, I just mean 1 or
> 2). I agree that listing the positions could work, though I sort of
> assumed that was more or less what a decent compressed bitmap would
> turn into. E.g., if bit N is set, we should be able to say "N-1
> zeroes, 1 one" in about the same size as we could say "position N".
>
> EWAH seems pretty awful in that regard. Or at least its serialized
> format is (or maybe it's our implementation that is bad).
>
> The patch below generates a bitmap for each commit in a repository (it
> doesn't output the total list of paths; I've left that as an exercise
> for the reader). On linux.git, the result is 57MB. But when I look at
> the individual bitmap sizes (via GIT_TRACE), they're quite silly.
> Storing a single set bit takes 28 bytes in serialized form!
>
> There are only around 120k unique paths (including prefix trees).
> Naively using run-length encoding and varints, our worst case should be
> something like 18-20 bits to say "120k zeroes, then a 1, then all
> zeroes".  And the average case should be better (you don't even need to
> say "120k", but some smaller number).
>
> I wonder if Roaring does better here.

In these sparse cases, usually Roaring will organize the data as "array 
chunks" which are simply lists of the values. The thing that makes this 
still compressible is that we store two bytes per entry, as the entries 
are grouped by a common most-significant two bytes. SInce you say ~120k 
unique paths, the Roaring bitmap would have two or three chunks per 
bitmap (and those chunks could be empty). The overhead to store the 
chunk positions, types, and lengths does come at a cost, but it's more 
like 32 bytes _per commit_.

> Gzipping the resulting bitmaps drops the total size to about 7.5MB.
> That's not a particularly important number, but I think it shows that
> the built-in ewah compression is far from ideal.
>
> Just listing the positions with a series of varints would generally be
> fine, since we expect sparse bitmaps. I just hoped that a good
> RLE scheme would degrade to roughly that for the sparse case, but also
> perform well for more dense cases.
>
> So at any rate, I do think it would not be out of the question to store
> bitmaps like this. I'm much more worried about the maintenance cost of
> adding new entries incrementally. I think it's only feasible if we give
> up sorting, and then I wonder what other problems that might cause.
The patch below gives me a starting point to try the Bloom filter 
approach and see what the numbers are like. You did all the "git" stuff 
like computing the changed paths, so thanks!
>
> -- >8 --
> diff --git a/Makefile b/Makefile
> index 13e1c52478..f6e823f2d6 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -751,6 +751,7 @@ TEST_PROGRAMS_NEED_X += test-parse-options
>   TEST_PROGRAMS_NEED_X += test-pkt-line
>   TEST_PROGRAMS_NEED_X += test-svn-fe
>   TEST_PROGRAMS_NEED_X += test-tool
> +TEST_PROGRAMS_NEED_X += test-tree-bitmap
>   
>   TEST_PROGRAMS = $(patsubst %,t/helper/%$X,$(TEST_PROGRAMS_NEED_X))
>   
> diff --git a/t/helper/test-tree-bitmap.c b/t/helper/test-tree-bitmap.c
> new file mode 100644
> index 0000000000..bc5cf0e514
> --- /dev/null
> +++ b/t/helper/test-tree-bitmap.c
> @@ -0,0 +1,167 @@
> +#include "cache.h"
> +#include "revision.h"
> +#include "diffcore.h"
> +#include "argv-array.h"
> +#include "ewah/ewok.h"
> +
> +/* map of pathnames to bit positions */
> +struct pathmap_entry {
> +	struct hashmap_entry ent;
> +	unsigned pos;
> +	char path[FLEX_ARRAY];
> +};
> +
> +static int pathmap_entry_hashcmp(const void *unused_cmp_data,
> +				 const void *entry,
> +				 const void *entry_or_key,
> +				 const void *keydata)
> +{
> +	const struct pathmap_entry *a = entry;
> +	const struct pathmap_entry *b = entry_or_key;
> +	const char *key = keydata;
> +
> +	return strcmp(a->path, key ? key : b->path);
> +}
> +
> +static int pathmap_entry_strcmp(const void *va, const void *vb)
> +{
> +	struct pathmap_entry *a = *(struct pathmap_entry **)va;
> +	struct pathmap_entry *b = *(struct pathmap_entry **)vb;
> +	return strcmp(a->path, b->path);
> +}
> +
> +struct walk_paths_data {
> +	struct hashmap *paths;
> +	struct commit *commit;
> +};
> +
> +static void walk_paths(diff_format_fn_t fn, struct hashmap *paths)
> +{
> +	struct argv_array argv = ARGV_ARRAY_INIT;
> +	struct rev_info revs;
> +	struct walk_paths_data data;
> +	struct commit *commit;
> +
> +	argv_array_pushl(&argv, "rev-list",
> +			 "--all", "-t", "--no-renames",
> +			 NULL);
> +	init_revisions(&revs, NULL);
> +	setup_revisions(argv.argc, argv.argv, &revs, NULL);
> +	revs.diffopt.output_format = DIFF_FORMAT_CALLBACK;
> +	revs.diffopt.format_callback = fn;
> +	revs.diffopt.format_callback_data = &data;
> +
> +	data.paths = paths;
> +
> +	prepare_revision_walk(&revs);
> +	while ((commit = get_revision(&revs))) {
> +		data.commit = commit;
> +		diff_tree_combined_merge(commit, 0, &revs);
> +	}
> +
> +	reset_revision_walk();
> +	argv_array_clear(&argv);
> +}
> +
> +static void collect_commit_paths(struct diff_queue_struct *q,
> +				 struct diff_options *opts,
> +				 void *vdata)
> +{
> +	struct walk_paths_data *data = vdata;
> +	int i;
> +
> +	for (i = 0; i < q->nr; i++) {
> +		struct diff_filepair *p = q->queue[i];
> +		const char *path = p->one->path;
> +		struct pathmap_entry *entry;
> +		struct hashmap_entry lookup;
> +
> +		hashmap_entry_init(&lookup, strhash(path));
> +		entry = hashmap_get(data->paths, &lookup, path);
> +		if (entry)
> +			continue; /* already present */
> +
> +		FLEX_ALLOC_STR(entry, path, path);
> +		entry->ent = lookup;
> +		hashmap_put(data->paths, entry);
> +	}
> +}
> +
> +/* assign a bit position to all possible paths */
> +static void collect_paths(struct hashmap *paths)
> +{
> +	struct pathmap_entry **sorted;
> +	size_t i, n;
> +	struct hashmap_iter iter;
> +	struct pathmap_entry *entry;
> +
> +	/* grab all unique paths */
> +	hashmap_init(paths, pathmap_entry_hashcmp, NULL, 0);
> +	walk_paths(collect_commit_paths, paths);
> +
> +	/* and assign them bits in sorted order */
> +	n = hashmap_get_size(paths);
> +	ALLOC_ARRAY(sorted, n);
> +	i = 0;
> +	for (entry = hashmap_iter_first(paths, &iter);
> +	     entry;
> +	     entry = hashmap_iter_next(&iter)) {
> +		assert(i < n);
> +		sorted[i++] = entry;
> +	}
> +	QSORT(sorted, i, pathmap_entry_strcmp);
> +	for (i = 0; i < n; i++)
> +		sorted[i]->pos = i;
> +	free(sorted);
> +}
> +
> +/* generate the bitmap for a single commit */
> +static void generate_bitmap(struct diff_queue_struct *q,
> +			    struct diff_options *opts,
> +			    void *vdata)
> +{
> +	struct walk_paths_data *data = vdata;
> +	struct bitmap *bitmap = bitmap_new();
> +	struct ewah_bitmap *ewah;
> +	struct strbuf out = STRBUF_INIT;
> +	size_t i;
> +
> +	for (i = 0; i < q->nr; i++) {
> +		struct diff_filepair *p = q->queue[i];
> +		const char *path = p->one->path;
> +		struct pathmap_entry *entry;
> +		struct hashmap_entry lookup;
> +
> +		hashmap_entry_init(&lookup, strhash(path));
> +		entry = hashmap_get(data->paths, &lookup, path);
> +		if (!entry)
> +			BUG("mysterious path appeared: %s", path);
> +
> +		bitmap_set(bitmap, entry->pos);
> +	}
> +
> +	ewah = bitmap_to_ewah(bitmap);
> +	ewah_serialize_strbuf(ewah, &out);
> +	fwrite(out.buf, 1, out.len, stdout);
> +
> +	trace_printf("bitmap %s %u %u",
> +		     oid_to_hex(&data->commit->object.oid),
> +		     (unsigned)q->nr,
> +		     (unsigned)out.len);
> +
> +	strbuf_release(&out);
> +	ewah_free(ewah);
> +	bitmap_free(bitmap);
> +}
> +
> +int cmd_main(int argc, const char **argv)
> +{
> +	struct hashmap paths;
> +
> +	setup_git_directory();
> +	collect_paths(&paths);
> +
> +	walk_paths(generate_bitmap, &paths);
> +
> +	return 0;
> +}

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 0/4] Bloom filter experiment
  2018-10-08 16:57                     ` Derrick Stolee
  2018-10-08 18:10                       ` SZEDER Gábor
@ 2018-10-09 19:34                       ` SZEDER Gábor
  2018-10-09 19:34                         ` [PATCH 1/4] Add a (very) barebones Bloom filter implementation SZEDER Gábor
                                           ` (6 more replies)
  1 sibling, 7 replies; 78+ messages in thread
From: SZEDER Gábor @ 2018-10-09 19:34 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Junio C Hamano, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Stefan Beller, Duy Nguyen,
	SZEDER Gábor

To keep the ball rolling, here is my proof of concept in a somewhat
cleaned-up form, with still plenty of rough edges.

You can play around with it like this:

  $ GIT_USE_POC_BLOOM_FILTER=$((8*1024*1024*8)) git commit-graph write
  Computing commit graph generation numbers: 100% (52801/52801), done.
  Computing bloom filter: 100% (52801/52801), done.
  # Yeah, I even added progress indicator! :)
  $ GIT_TRACE_BLOOM_FILTER=2 GIT_USE_POC_BLOOM_FILTER=y git rev-list --count --full-history HEAD -- t/valgrind/valgrind.sh
  886
  20:40:24.783699 revision.c:486          bloom filter total queries: 66095 definitely not: 64953 maybe: 1142 false positives: 256 fp ratio: 0.003873

The value of $GIT_USE_POC_BLOOM_FILTER only really matters when writing
the Bloom filter, and it specifies the number of bits in the filter's
bitmap, IOW the above command creates a 8MB Bloom filter.  To make use
of the filter the variable can be anything non-empty.

Writing the Bloom filter is very slow as it is (yeah, that's why
bothered with the progress indicator ;).  I wrote about it in patch 2's
commit message: the cause for about half of the slowness is rather
obvious, but I don't (yet) know what's responsible for the other half.

Not a single test...  but I've run loops over all files in git.git
comparing 'git rev-list HEAD -- $file's output with and without the
Bloom filter, and, surprisingly, they match.  My quick'n'dirty
experiments usually don't fare this well...

It's also available at:

  https://github.com/szeder/git bloom-filter-experiment

SZEDER Gábor (4):
  Add a (very) barebones Bloom filter implementation
  commit-graph: write a Bloom filter containing changed paths for each
    commit
  revision.c: use the Bloom filter to speed up path-limited revision
    walks
  revision.c: add GIT_TRACE_BLOOM_FILTER for a bit of statistics

 Makefile       |   1 +
 bloom-filter.c | 103 +++++++++++++++++++++++++++++++++++++++
 bloom-filter.h |  39 +++++++++++++++
 commit-graph.c | 116 ++++++++++++++++++++++++++++++++++++++++++++
 pathspec.h     |   1 +
 revision.c     | 129 +++++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 389 insertions(+)
 create mode 100644 bloom-filter.c
 create mode 100644 bloom-filter.h

-- 
2.19.1.409.g0a0ee5eb6b

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 1/4] Add a (very) barebones Bloom filter implementation
  2018-10-09 19:34                       ` [PATCH 0/4] Bloom filter experiment SZEDER Gábor
@ 2018-10-09 19:34                         ` SZEDER Gábor
  2018-10-09 19:34                         ` [PATCH 2/4] commit-graph: write a Bloom filter containing changed paths for each commit SZEDER Gábor
                                           ` (5 subsequent siblings)
  6 siblings, 0 replies; 78+ messages in thread
From: SZEDER Gábor @ 2018-10-09 19:34 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Junio C Hamano, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Stefan Beller, Duy Nguyen,
	SZEDER Gábor

---
 Makefile       |   1 +
 bloom-filter.c | 103 +++++++++++++++++++++++++++++++++++++++++++++++++
 bloom-filter.h |  39 +++++++++++++++++++
 3 files changed, 143 insertions(+)
 create mode 100644 bloom-filter.c
 create mode 100644 bloom-filter.h

diff --git a/Makefile b/Makefile
index 13e1c52478..850eafb3ee 100644
--- a/Makefile
+++ b/Makefile
@@ -827,6 +827,7 @@ LIB_OBJS += base85.o
 LIB_OBJS += bisect.o
 LIB_OBJS += blame.o
 LIB_OBJS += blob.o
+LIB_OBJS += bloom-filter.o
 LIB_OBJS += branch.o
 LIB_OBJS += bulk-checkin.o
 LIB_OBJS += bundle.o
diff --git a/bloom-filter.c b/bloom-filter.c
new file mode 100644
index 0000000000..7dce0e35fa
--- /dev/null
+++ b/bloom-filter.c
@@ -0,0 +1,103 @@
+#include "cache.h"
+#include "bloom-filter.h"
+
+void bloom_filter_init(struct bloom_filter *bf, uint32_t bit_size)
+{
+	if (bit_size % CHAR_BIT)
+		BUG("invalid size for bloom filter");
+
+	bf->nr_entries = 0;
+	bf->bit_size = bit_size;
+	bf->bits = xmalloc(bit_size / CHAR_BIT);
+}
+
+void bloom_filter_free(struct bloom_filter *bf)
+{
+	bf->nr_entries = 0;
+	bf->bit_size = 0;
+	FREE_AND_NULL(bf->bits);
+}
+
+
+void bloom_filter_set_bits(struct bloom_filter *bf, const uint32_t *offsets,
+			   int nr_offsets, int nr_entries)
+{
+	int i;
+	for (i = 0; i < nr_offsets; i++) {
+		uint32_t byte_offset = (offsets[i] % bf->bit_size) / CHAR_BIT;
+		unsigned char mask = 1 << offsets[i] % CHAR_BIT;
+		bf->bits[byte_offset] |= mask;
+	}
+	bf->nr_entries += nr_entries;
+}
+
+int bloom_filter_check_bits(struct bloom_filter *bf, const uint32_t *offsets,
+			    int nr)
+{
+	int i;
+	for (i = 0; i < nr; i++) {
+		uint32_t byte_offset = (offsets[i] % bf->bit_size) / CHAR_BIT;
+		unsigned char mask = 1 << offsets[i] % CHAR_BIT;
+		if (!(bf->bits[byte_offset] & mask))
+			return 0;
+	}
+	return 1;
+}
+
+
+void bloom_filter_add_hash(struct bloom_filter *bf, const unsigned char *hash)
+{
+	uint32_t offsets[GIT_MAX_RAWSZ / sizeof(uint32_t)];
+	hashcpy((unsigned char*)offsets, hash);
+	bloom_filter_set_bits(bf, offsets,
+			     the_hash_algo->rawsz / sizeof(*offsets), 1);
+}
+
+int bloom_filter_check_hash(struct bloom_filter *bf, const unsigned char *hash)
+{
+	uint32_t offsets[GIT_MAX_RAWSZ / sizeof(uint32_t)];
+	hashcpy((unsigned char*)offsets, hash);
+	return bloom_filter_check_bits(bf, offsets,
+			the_hash_algo->rawsz / sizeof(*offsets));
+}
+
+void hashxor(const unsigned char *hash1, const unsigned char *hash2,
+	     unsigned char *out)
+{
+	int i;
+	for (i = 0; i < the_hash_algo->rawsz; i++)
+		out[i] = hash1[i] ^ hash2[i];
+}
+
+/* hardcoded for now... */
+static GIT_PATH_FUNC(git_path_bloom, "objects/info/bloom")
+
+int bloom_filter_load(struct bloom_filter *bf)
+{
+	int fd = open(git_path_bloom(), O_RDONLY);
+
+	if (fd < 0)
+		return -1;
+
+	read_in_full(fd, &bf->nr_entries, sizeof(bf->nr_entries));
+	read_in_full(fd, &bf->bit_size, sizeof(bf->bit_size));
+	if (bf->bit_size % CHAR_BIT)
+		BUG("invalid size for bloom filter");
+	bf->bits = xmalloc(bf->bit_size / CHAR_BIT);
+	read_in_full(fd, bf->bits, bf->bit_size / CHAR_BIT);
+
+	close(fd);
+
+	return 0;
+}
+
+void bloom_filter_write(struct bloom_filter *bf)
+{
+	int fd = xopen(git_path_bloom(), O_WRONLY | O_CREAT | O_TRUNC, 0666);
+
+	write_in_full(fd, &bf->nr_entries, sizeof(bf->nr_entries));
+	write_in_full(fd, &bf->bit_size, sizeof(bf->bit_size));
+	write_in_full(fd, bf->bits, bf->bit_size / CHAR_BIT);
+
+	close(fd);
+}
diff --git a/bloom-filter.h b/bloom-filter.h
new file mode 100644
index 0000000000..94d0af1708
--- /dev/null
+++ b/bloom-filter.h
@@ -0,0 +1,39 @@
+#ifndef BLOOM_FILTER_H
+#define BLOOM_FILTER_H
+
+#include "git-compat-util.h"
+
+struct bloom_filter {
+	uint32_t nr_entries;
+	uint32_t bit_size;
+	unsigned char *bits;
+};
+
+
+void bloom_filter_init(struct bloom_filter *bf, uint32_t bit_size);
+void bloom_filter_free(struct bloom_filter *bf);
+
+void bloom_filter_set_bits(struct bloom_filter *bf, const uint32_t *offsets,
+			   int nr_offsets, int nr_enries);
+int bloom_filter_check_bits(struct bloom_filter *bf, const uint32_t *offsets,
+			    int nr);
+
+/*
+ * Turns the given (SHA1) hash into 5 unsigned ints, and sets the bits at
+ * those positions (modulo the bitmap's size) in the Bloom filter.
+ */
+void bloom_filter_add_hash(struct bloom_filter *bf, const unsigned char *hash);
+/*
+ * Turns the given (SHA1) hash into 5 unsigned ints, and checks the bits at
+ * those positions (modulo the bitmap's size) in the Bloom filter.
+ * Returns 1 if all those bits are set, 0 otherwise.
+ */
+int bloom_filter_check_hash(struct bloom_filter *bf, const unsigned char *hash);
+
+void hashxor(const unsigned char *hash1, const unsigned char *hash2,
+	     unsigned char *out);
+
+int bloom_filter_load(struct bloom_filter *bf);
+void bloom_filter_write(struct bloom_filter *bf);
+
+#endif
-- 
2.19.1.409.g0a0ee5eb6b


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 2/4] commit-graph: write a Bloom filter containing changed paths for each commit
  2018-10-09 19:34                       ` [PATCH 0/4] Bloom filter experiment SZEDER Gábor
  2018-10-09 19:34                         ` [PATCH 1/4] Add a (very) barebones Bloom filter implementation SZEDER Gábor
@ 2018-10-09 19:34                         ` SZEDER Gábor
  2018-10-09 21:06                           ` Jeff King
  2018-10-09 19:34                         ` [PATCH 3/4] revision.c: use the Bloom filter to speed up path-limited revision walks SZEDER Gábor
                                           ` (4 subsequent siblings)
  6 siblings, 1 reply; 78+ messages in thread
From: SZEDER Gábor @ 2018-10-09 19:34 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Junio C Hamano, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Stefan Beller, Duy Nguyen,
	SZEDER Gábor

You can create a Bloom filter containing changed paths for each commit
in the history by running:

  $ GIT_USE_POC_BLOOM_FILTER=$((8*1024*1024*8)) git commit-graph write

where the value of $GIT_USE_POC_BLOOM_FILTER must specify the number
of bits used in the Bloom filter's bitmap.

Writing the Bloom filter is tied into the 'git commit-graph' command,
mainly because that's where it might end up anyway, if it turns out to
be useful, but for now it's written to a different file
('object/info/bloom').  No incremental updates yet, the Bloom filter
is regenerated from scratch each time.

There is one single, big Bloom filter for the whole history (mainly
because that was the simplest way to get this PoC experiment up and
running).  The Bloom filter stores tuples of (path, parent-oid,
commit-oid) using the hash function:

  XOR(SHA1(path), XOR(parent-oid, commit-oid))

The resulting 20 bytes are turned into 5 unsigned 32 bit ints, which
then specify the positions of the bits to set or check in the Bloom
filter's bitmap (modulo the bitmap's size).

The parent oid is taken into account, because during revision walking
the diff is checked in rev_compare_tree(), which compares one commit
to _one_ of its parents, and in case of merge commits there are
multiple rev_compare_tree() calls with the same commit but with
different parent parameters.

Combining hashes with XOR is, in general, frowned upon, because of its
intrinsic properties:

  XOR(A, A) = 0
  XOR(A, B) = XOR(B, A)

In this case it should be fine, because all of XOR's operands are
cryptographic hashes, so we can safely assume that they'll never be
the same.

Add each leading directory of the changed file, i.e. for
'dir/subdir/file' add 'dir' and 'dir/subdir' as well, so the Bloom
filter could be used to speed up commands like 'git log dir/subdir',
too.

Creating the Bloom filter is sloooow.  Running it on git.git takes
about 23s on my hardware, while

  git log --format='%H%n%P' --name-only --all >/dev/null

gathers all the information necessary for that in about 5.3s.

About 30% of the runtime is wasted by naively hashing and rehashing
the same paths over and over again.  A hash function faster than SHA1
could help with that; I just haven't yet bothered with spicing up
memhash() and friends to produce 5 ints, and neither wanted to
introduce another hash function with wideer output just yet.  Or
perhaps our hashmap mapping paths of files to their SHAs and the SHAs
of their leading directories...

That's not the only factor though.  After ripping out all the loops
from add_changes_to_bloom_filter() there are no repeated SHA1(path)
calculations and no writes to the Bloom filter at all, i.e. all what
remains is revision walking and diffing, yet it still takes about 16s,
i.e. aroung 3 times more than the above mentioned 'git log' command.
I guess some other fields in 'struct rev_info' or 'struct
diff_options' need to be set, but both of those are huge, and I
haven't yet spotted which ones.
---
 commit-graph.c | 116 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 116 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index a1454c52a6..f415d3b41f 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -14,6 +14,9 @@
 #include "object-store.h"
 #include "alloc.h"
 #include "progress.h"
+#include "bloom-filter.h"
+#include "diff.h"
+#include "diffcore.h"
 
 #define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
 #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
@@ -709,6 +712,117 @@ static int add_ref_to_list(const char *refname,
 	return 0;
 }
 
+static void add_changes_to_bloom_filter(struct bloom_filter *bf,
+					struct commit *parent,
+					struct commit *commit,
+					struct diff_options *diffopt)
+{
+	unsigned char p_c_hash[GIT_MAX_RAWSZ];
+	int i;
+
+	hashxor(parent->object.oid.hash, commit->object.oid.hash, p_c_hash);
+
+	diff_tree_oid(&parent->object.oid, &commit->object.oid, "", diffopt);
+	diffcore_std(diffopt);
+
+	for (i = 0; i < diff_queued_diff.nr; i++) {
+		const char *path = diff_queued_diff.queue[i]->two->path;
+		const char *p = path;
+
+		/*
+		 * Add each leading directory of the changed file, i.e. for
+		 * 'dir/subdir/file' add 'dir' and 'dir/subdir' as well, so
+		 * the Bloom filter could be used to speed up commands like
+		 * 'git log dir/subdir', too.
+		 *
+		 * Note that directories are added without the trailing '/'.
+		 */
+		do {
+			git_hash_ctx ctx;
+			unsigned char name_hash[GIT_MAX_RAWSZ];
+			unsigned char hash[GIT_MAX_RAWSZ];
+
+			p = strchrnul(p + 1, '/');
+
+			/*
+			 * Beware all the wasted CPU cycles!
+			 *
+			 * Most paths change (a lot) more than once in the
+			 * history of a repository, so this hashes the same
+			 * paths over and over again, accounting for almost
+			 * 40% of the runtime.
+			 */
+			the_hash_algo->init_fn(&ctx);
+			the_hash_algo->update_fn(&ctx, path, p - path);
+			the_hash_algo->final_fn(name_hash, &ctx);
+
+			hashxor(name_hash, p_c_hash, hash);
+			bloom_filter_add_hash(bf, hash);
+		} while (*p);
+
+		diff_free_filepair(diff_queued_diff.queue[i]);
+	}
+
+	free(diff_queued_diff.queue);
+	DIFF_QUEUE_CLEAR(&diff_queued_diff);
+}
+
+static void fill_bloom_filter(struct bloom_filter *bf,
+				    struct progress *progress)
+{
+	struct rev_info revs;
+	const char *revs_argv[] = {NULL, "--all", NULL};
+	struct commit *commit;
+	int i = 0;
+
+	/* We (re-)create the bloom filter from scratch every time for now. */
+	init_revisions(&revs, NULL);
+	revs.diffopt.flags.recursive = 1;
+	setup_revisions(2, revs_argv, &revs, NULL);
+
+	if (prepare_revision_walk(&revs))
+		die("revision walk setup failed while preparing bloom filter");
+
+	while ((commit = get_revision(&revs))) {
+		struct commit_list *parent;
+
+		for (parent = commit->parents; parent; parent = parent->next)
+			add_changes_to_bloom_filter(bf, parent->item, commit,
+						    &revs.diffopt);
+
+		display_progress(progress, ++i);
+	}
+}
+
+static void write_bloom_filter(int report_progress, int commit_nr)
+{
+	struct bloom_filter bf;
+	struct progress *progress = NULL;
+	const char *v = getenv("GIT_USE_POC_BLOOM_FILTER");
+	unsigned int bitsize;
+	char *end;
+
+	if (!v)
+		return;
+
+	bitsize = strtol(v, &end, 10);
+	if (*end)
+		die("GIT_USE_POC_BLOOM_FILTER must specify the number of bits in the bloom filter (multiple of 8, n < 2^32)");
+
+	bloom_filter_init(&bf, bitsize);
+
+	if (report_progress)
+		progress = start_progress(_("Computing bloom filter"),
+					  commit_nr);
+
+	fill_bloom_filter(&bf, progress);
+
+	bloom_filter_write(&bf);
+	bloom_filter_free(&bf);
+
+	stop_progress(&progress);
+}
+
 void write_commit_graph_reachable(const char *obj_dir, int append,
 				  int report_progress)
 {
@@ -916,6 +1030,8 @@ void write_commit_graph(const char *obj_dir,
 	finalize_hashfile(f, NULL, CSUM_HASH_IN_STREAM | CSUM_FSYNC);
 	commit_lock_file(&lk);
 
+	write_bloom_filter(report_progress, commits.nr);
+
 	free(graph_name);
 	free(commits.list);
 	free(oids.list);
-- 
2.19.1.409.g0a0ee5eb6b


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 3/4] revision.c: use the Bloom filter to speed up path-limited revision walks
  2018-10-09 19:34                       ` [PATCH 0/4] Bloom filter experiment SZEDER Gábor
  2018-10-09 19:34                         ` [PATCH 1/4] Add a (very) barebones Bloom filter implementation SZEDER Gábor
  2018-10-09 19:34                         ` [PATCH 2/4] commit-graph: write a Bloom filter containing changed paths for each commit SZEDER Gábor
@ 2018-10-09 19:34                         ` SZEDER Gábor
  2018-10-09 19:34                         ` [PATCH 4/4] revision.c: add GIT_TRACE_BLOOM_FILTER for a bit of statistics SZEDER Gábor
                                           ` (3 subsequent siblings)
  6 siblings, 0 replies; 78+ messages in thread
From: SZEDER Gábor @ 2018-10-09 19:34 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Junio C Hamano, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Stefan Beller, Duy Nguyen,
	SZEDER Gábor

When $GIT_USE_POC_BLOOM_FILTER is set to a non-empty value, the
revision walk will use the Bloom filter to speed up a path-limited
revision walk.

Load the Bloom filter in prepare_revision_walk(); probably not the
best place for it, but it should suffice for experiementing with 'git
rev-list'.

Checking the Bloom filter is plugged into rev_compare_tree(), the
function that compares the given paths in a commit to one of its
parents.  If checking the Bloom filter returns that the interesting
paths did not change, then it won't bother with the running the
expensive diff.

Add a new field to 'struct pathspec' to hold the SHA of the path, so
that hash is computed only once and then reused when checking each
commit.
---
 pathspec.h |  1 +
 revision.c | 97 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 98 insertions(+)

diff --git a/pathspec.h b/pathspec.h
index a6525a6551..565a26d91e 100644
--- a/pathspec.h
+++ b/pathspec.h
@@ -47,6 +47,7 @@ struct pathspec {
 			} match_mode;
 		} *attr_match;
 		struct attr_check *attr_check;
+		unsigned char name_hash[GIT_MAX_RAWSZ];
 	} *items;
 };
 
diff --git a/revision.c b/revision.c
index c5d0cb6599..3565785ca6 100644
--- a/revision.c
+++ b/revision.c
@@ -27,6 +27,7 @@
 #include "commit-reach.h"
 #include "commit-graph.h"
 #include "prio-queue.h"
+#include "bloom-filter.h"
 
 volatile show_early_output_fn_t show_early_output;
 
@@ -463,6 +464,62 @@ static void file_change(struct diff_options *options,
 	options->flags.has_changes = 1;
 }
 
+/* Another static... */
+static struct bloom_filter bf;
+
+static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
+						 struct commit *parent,
+						 struct commit *commit)
+{
+	unsigned char p_c_hash[GIT_MAX_RAWSZ];
+	int i;
+
+	if (!bf.bits)
+		return -1;
+	/*
+	 * If a commit is not in the 'commit-graph' file, then it's not in
+	 * the Bloom filter either, so any query into that would report
+	 * back a false negative, which is unacceptable.
+	 *
+	 * The writer of the Bloom filter must ensure that all commits that
+	 * go into the 'commit-graph' go into the Bloom filter as well.
+	 *
+	 * If we won't tie the Bloom filter to the commit-graph tightly,
+	 * then we'll have to come up with another means to prevent such
+	 * false negatives.
+	 */
+	if (!the_repository->objects->commit_graph)
+		return -1;
+	if (commit->generation == GENERATION_NUMBER_INFINITY)
+		return -1;
+
+	hashxor(parent->object.oid.hash, commit->object.oid.hash, p_c_hash);
+
+	for (i = 0; i < revs->pruning.pathspec.nr; i++) {
+		struct pathspec_item *pi = &revs->pruning.pathspec.items[i];
+		unsigned char hash[GIT_MAX_RAWSZ];
+
+		hashxor(pi->name_hash, p_c_hash, hash);
+		if (bloom_filter_check_hash(&bf, hash)) {
+			/*
+			 * At least one of the interesting pathspecs differs,
+			 * so we can return early and let the diff machinery
+			 * make sure that they indeed differ.
+			 *
+			 * Note: the diff machinery will look at all the given
+			 * paths; a possible future optimization might bring
+			 * the Bloom filter and the diff machinery closer to
+			 * each other, so the diff won't waste time looking
+			 * at those paths that the Bloom filter have found
+			 * unchanged.
+			 */
+			return 1;
+		}
+	}
+
+	return 0;
+}
+
 static int rev_compare_tree(struct rev_info *revs,
 			    struct commit *parent, struct commit *commit)
 {
@@ -492,6 +549,9 @@ static int rev_compare_tree(struct rev_info *revs,
 			return REV_TREE_SAME;
 	}
 
+	if (!check_maybe_different_in_bloom_filter(revs, parent, commit))
+		return REV_TREE_SAME;
+
 	tree_difference = REV_TREE_SAME;
 	revs->pruning.flags.has_changes = 0;
 	if (diff_tree_oid(&t1->object.oid, &t2->object.oid, "",
@@ -3106,6 +3166,40 @@ static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
 	}
 }
 
+void prepare_to_use_bloom_filter(struct rev_info *revs)
+{
+	const char *env = getenv("GIT_USE_POC_BLOOM_FILTER");
+	int i;
+
+	if (!env || !*env)
+		return;
+
+	for (i = 0; i < revs->pruning.pathspec.nr; i++) {
+		struct pathspec_item *pi = &revs->pruning.pathspec.items[i];
+		const char *path = pi->match;
+		git_hash_ctx ctx;
+		size_t len = strlen(path);
+
+		/*
+		 * TODO: What about wildcards?  We'd probably just want to
+		 * ignore the Bloom filter then.
+		 */
+		the_hash_algo->init_fn(&ctx);
+		the_hash_algo->update_fn(&ctx, path,
+					 path[len - 1] == '/' ? len - 1 : len);
+		the_hash_algo->final_fn(pi->name_hash, &ctx);
+	}
+
+	if (bf.bits)
+		/* Already loaded. */
+		return;
+
+	if (bloom_filter_load(&bf) < 0) {
+		warning("you wanted to use the Bloom filter, but it couldn't be loaded");
+		return;
+	}
+}
+
 int prepare_revision_walk(struct rev_info *revs)
 {
 	int i;
@@ -3155,6 +3249,9 @@ int prepare_revision_walk(struct rev_info *revs)
 		simplify_merges(revs);
 	if (revs->children.name)
 		set_children(revs);
+
+	prepare_to_use_bloom_filter(revs);
+
 	return 0;
 }
 
-- 
2.19.1.409.g0a0ee5eb6b


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 4/4] revision.c: add GIT_TRACE_BLOOM_FILTER for a bit of statistics
  2018-10-09 19:34                       ` [PATCH 0/4] Bloom filter experiment SZEDER Gábor
                                           ` (2 preceding siblings ...)
  2018-10-09 19:34                         ` [PATCH 3/4] revision.c: use the Bloom filter to speed up path-limited revision walks SZEDER Gábor
@ 2018-10-09 19:34                         ` SZEDER Gábor
  2018-10-09 19:47                         ` [PATCH 0/4] Bloom filter experiment Derrick Stolee
                                           ` (2 subsequent siblings)
  6 siblings, 0 replies; 78+ messages in thread
From: SZEDER Gábor @ 2018-10-09 19:34 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Junio C Hamano, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Stefan Beller, Duy Nguyen,
	SZEDER Gábor

It will output something like this:

  $ GIT_TRACE_BLOOM_FILTER=2 GIT_USE_POC_BLOOM_FILTER=y git rev-list --count --full-history HEAD -- t/valgrind/valgrind.sh
  886
  17:24:42.915053 revision.c:484          bloom filter total queries: 66095 definitely not: 64953 maybe: 1142 false positives: 256 fp ratio: 0.003873
---
 revision.c | 34 +++++++++++++++++++++++++++++++++-
 1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/revision.c b/revision.c
index 3565785ca6..2f3f73b4dd 100644
--- a/revision.c
+++ b/revision.c
@@ -467,6 +467,25 @@ static void file_change(struct diff_options *options,
 /* Another static... */
 static struct bloom_filter bf;
 
+static struct trace_key trace_bloom_filter = TRACE_KEY_INIT(BLOOM_FILTER);
+static int trace_bloom_filter_atexit_registered;
+static unsigned int bloom_filter_count_maybe;
+static unsigned int bloom_filter_count_definitely_not;
+static unsigned int bloom_filter_count_false_positive;
+
+static void print_bloom_filter_stats_atexit(void)
+{
+	unsigned int total = bloom_filter_count_maybe +
+			     bloom_filter_count_definitely_not;
+	trace_printf_key(&trace_bloom_filter,
+			 "bloom filter total queries: %d definitely not: %d maybe: %d false positives: %d fp ratio: %f\n",
+			 total,
+			 bloom_filter_count_definitely_not,
+			 bloom_filter_count_maybe,
+			 bloom_filter_count_false_positive,
+			 (1.0 * bloom_filter_count_false_positive) / total);
+}
+
 static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 						 struct commit *parent,
 						 struct commit *commit)
@@ -513,10 +532,12 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 			 * at those paths that the Bloom filter have found
 			 * unchanged.
 			 */
+			bloom_filter_count_maybe++;
 			return 1;
 		}
 	}
 
+	bloom_filter_count_definitely_not++;
 	return 0;
 }
 
@@ -525,6 +546,7 @@ static int rev_compare_tree(struct rev_info *revs,
 {
 	struct tree *t1 = get_commit_tree(parent);
 	struct tree *t2 = get_commit_tree(commit);
+	int bloom_ret;
 
 	if (!t1)
 		return REV_TREE_NEW;
@@ -549,7 +571,8 @@ static int rev_compare_tree(struct rev_info *revs,
 			return REV_TREE_SAME;
 	}
 
-	if (!check_maybe_different_in_bloom_filter(revs, parent, commit))
+	bloom_ret = check_maybe_different_in_bloom_filter(revs, parent, commit);
+	if (bloom_ret == 0)
 		return REV_TREE_SAME;
 
 	tree_difference = REV_TREE_SAME;
@@ -557,6 +580,8 @@ static int rev_compare_tree(struct rev_info *revs,
 	if (diff_tree_oid(&t1->object.oid, &t2->object.oid, "",
 			   &revs->pruning) < 0)
 		return REV_TREE_DIFFERENT;
+	if (bloom_ret == 1 && tree_difference == REV_TREE_SAME)
+		bloom_filter_count_false_positive++;
 	return tree_difference;
 }
 
@@ -3198,6 +3223,13 @@ void prepare_to_use_bloom_filter(struct rev_info *revs)
 		warning("you wanted to use the Bloom filter, but it couldn't be loaded");
 		return;
 	}
+
+	if (trace_want(&trace_bloom_filter)) {
+		if (!trace_bloom_filter_atexit_registered) {
+			atexit(print_bloom_filter_stats_atexit);
+			trace_bloom_filter_atexit_registered = 1;
+		}
+	}
 }
 
 int prepare_revision_walk(struct rev_info *revs)
-- 
2.19.1.409.g0a0ee5eb6b


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH 0/4] Bloom filter experiment
  2018-10-09 19:34                       ` [PATCH 0/4] Bloom filter experiment SZEDER Gábor
                                           ` (3 preceding siblings ...)
  2018-10-09 19:34                         ` [PATCH 4/4] revision.c: add GIT_TRACE_BLOOM_FILTER for a bit of statistics SZEDER Gábor
@ 2018-10-09 19:47                         ` Derrick Stolee
  2018-10-11  1:21                         ` [PATCH 0/2] Per-commit filter proof of concept Jonathan Tan
  2018-10-15 14:39                         ` [PATCH 0/4] Bloom filter experiment Derrick Stolee
  6 siblings, 0 replies; 78+ messages in thread
From: Derrick Stolee @ 2018-10-09 19:47 UTC (permalink / raw)
  To: SZEDER Gábor, git
  Cc: Jeff King, Junio C Hamano, Ævar Arnfjörð Bjarmason,
	Stefan Beller, Duy Nguyen

On 10/9/2018 3:34 PM, SZEDER Gábor wrote:
> To keep the ball rolling, here is my proof of concept in a somewhat
> cleaned-up form, with still plenty of rough edges.
>
> You can play around with it like this:
>
>    $ GIT_USE_POC_BLOOM_FILTER=$((8*1024*1024*8)) git commit-graph write
>    Computing commit graph generation numbers: 100% (52801/52801), done.
>    Computing bloom filter: 100% (52801/52801), done.
>    # Yeah, I even added progress indicator! :)
>    $ GIT_TRACE_BLOOM_FILTER=2 GIT_USE_POC_BLOOM_FILTER=y git rev-list --count --full-history HEAD -- t/valgrind/valgrind.sh
>    886
>    20:40:24.783699 revision.c:486          bloom filter total queries: 66095 definitely not: 64953 maybe: 1142 false positives: 256 fp ratio: 0.003873
>
> The value of $GIT_USE_POC_BLOOM_FILTER only really matters when writing
> the Bloom filter, and it specifies the number of bits in the filter's
> bitmap, IOW the above command creates a 8MB Bloom filter.  To make use
> of the filter the variable can be anything non-empty.
>
> Writing the Bloom filter is very slow as it is (yeah, that's why
> bothered with the progress indicator ;).  I wrote about it in patch 2's
> commit message: the cause for about half of the slowness is rather
> obvious, but I don't (yet) know what's responsible for the other half.
>
>
> Not a single test...  but I've run loops over all files in git.git
> comparing 'git rev-list HEAD -- $file's output with and without the
> Bloom filter, and, surprisingly, they match.  My quick'n'dirty
> experiments usually don't fare this well...
>
>
> It's also available at:
>
>    https://github.com/szeder/git bloom-filter-experiment

Thanks! I will take a close look at this tomorrow and start playing with it.

-Stolee


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 2/4] commit-graph: write a Bloom filter containing changed paths for each commit
  2018-10-09 19:34                         ` [PATCH 2/4] commit-graph: write a Bloom filter containing changed paths for each commit SZEDER Gábor
@ 2018-10-09 21:06                           ` Jeff King
  2018-10-09 21:37                             ` SZEDER Gábor
  0 siblings, 1 reply; 78+ messages in thread
From: Jeff King @ 2018-10-09 21:06 UTC (permalink / raw)
  To: SZEDER Gábor
  Cc: git, Junio C Hamano, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Stefan Beller, Duy Nguyen

On Tue, Oct 09, 2018 at 09:34:43PM +0200, SZEDER Gábor wrote:

> Creating the Bloom filter is sloooow.  Running it on git.git takes
> about 23s on my hardware, while
> 
>   git log --format='%H%n%P' --name-only --all >/dev/null
> 
> gathers all the information necessary for that in about 5.3s.

That command won't open the trees for merges at all. But your
implementation here looks like it does a diff against each parent of a
merge. Adding "-m" would be a more accurate comparison, I think.

Though I find that puzzling, because "-m --name-only" seems to take
about 20x longer, not 3x. So perhaps I'm missing something.

-Peff

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph)
  2018-10-09 19:03                                 ` Derrick Stolee
@ 2018-10-09 21:14                                   ` Jeff King
  2018-10-09 23:12                                     ` Bloom Filters Jeff King
  0 siblings, 1 reply; 78+ messages in thread
From: Jeff King @ 2018-10-09 21:14 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: SZEDER Gábor, Ævar Arnfjörð Bjarmason,
	Stefan Beller, git, Duy Nguyen

On Tue, Oct 09, 2018 at 03:03:08PM -0400, Derrick Stolee wrote:

> > I wonder if Roaring does better here.
> 
> In these sparse cases, usually Roaring will organize the data as "array
> chunks" which are simply lists of the values. The thing that makes this
> still compressible is that we store two bytes per entry, as the entries are
> grouped by a common most-significant two bytes. SInce you say ~120k unique
> paths, the Roaring bitmap would have two or three chunks per bitmap (and
> those chunks could be empty). The overhead to store the chunk positions,
> types, and lengths does come at a cost, but it's more like 32 bytes _per
> commit_.

Hmph. It really sounds like we could do better with a custom RLE
solution. But that makes me feel like I'm missing something, because
surely I can't invent something better than the state of the art in a
simple thought experiment, right?

I know what I'm proposing would be quite bad for random access, but my
impression is that EWAH is the same. For the scale of bitmaps we're
talking about, I think linear/streaming access through the bitmap would
be OK.

> > So at any rate, I do think it would not be out of the question to store
> > bitmaps like this. I'm much more worried about the maintenance cost of
> > adding new entries incrementally. I think it's only feasible if we give
> > up sorting, and then I wonder what other problems that might cause.
> The patch below gives me a starting point to try the Bloom filter approach
> and see what the numbers are like. You did all the "git" stuff like
> computing the changed paths, so thanks!

Great, I hope it can be useful. I almost wrote it as perl consuming the
output of "log --format=%h --name-only", but realized I didn't have a
perl ewah implementation handy.

You'll probably want to tweak this part:

> > +	prepare_revision_walk(&revs);
> > +	while ((commit = get_revision(&revs))) {
> > +		data.commit = commit;
> > +		diff_tree_combined_merge(commit, 0, &revs);
> > +	}

...to handle merges in a particular way. This will actually ignore
merges totally. You could add "-m" to the revision arguments to get a
per-parent diff, but of course you'd see those in your callback
individually. If you want to do _just_ the first parent diff, I think
you'll have to pick it apart manually, like:

  while ((commit = get_revision(&revs))) {
	struct object_id *parent_oid;

	/* ignore non-first parents, but handle root commits like --root */
	if (commit->parents)
		parent = &commit->parents->item->object.oid;
	else
		parent = the_hash_algo->empty_tree;

	diff_tree_oid(parent, &commit->oid, ...);
  }

-Peff

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-09  3:08                           ` Jeff King
  2018-10-09 13:48                             ` Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph) Derrick Stolee
@ 2018-10-09 21:30                             ` SZEDER Gábor
  1 sibling, 0 replies; 78+ messages in thread
From: SZEDER Gábor @ 2018-10-09 21:30 UTC (permalink / raw)
  To: Jeff King
  Cc: Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Stefan Beller, git, Duy Nguyen

On Mon, Oct 08, 2018 at 11:08:03PM -0400, Jeff King wrote:
> I'd have done it as one fixed-size filter per commit. Then you should be
> able to hash the path keys once, and apply the result as a bitwise query
> to each individual commit (I'm assuming that it's constant-time to
> access the filter for each, as an index into an mmap'd array, with the
> offset coming from a commit-graph entry we'd be able to look up anyway).

I used one big Bloom filter for the whole history, because that was
the simplest way to get going, and because I was primarily interested
in the potential benefits instead of the cost of generating and
maintaining it.

Using a 8MB filter for git.git results in a false positive rate
between 0.21% - 0.53%.  Splitting that up for ~53k commits we get ~160
bytes for each.  On first sight that seems like rather small, but
running a bit statistics shows that 99% of our commits don't change
more than 10 files.

One advantage of the "one Bloom filter for each commit" is that if a
commit doesn't have a corresponding Bloom filter, then, well, we can't
query the non-existing filter.  OTOH, with one big Bloom filter we
have to be careful to only ever query it with commits whose changes
have already been added, otherwise we can get false negatives.

> I think it would also be easier to deal with maintenance, since each
> filter is independent (IIRC, you cannot delete from a bloom filter
> without re-adding all of the other keys).

Accumulating entries related to unreachable commits will eventually
increase the false positive rate, but otherwise it won't cause false
negatives, and won't increase the size of the Bloom filter or the time
necessary to query it.  So not deleting those entries right away is
not an issue, and I think it could be postponed until bigger gc runs.

[...]

> But there's also a related question: how do we match pathspec patterns?
> For a changed path like "foo/bar/baz", I imagine a bloom filter would
> mark all of "foo", "foo/bar", and "foo/bar/baz".

Indeed, that's what I did.

> But what about "*.c"? I
> don't think a bloom filter can answer that.

Surely not, but it could easily return "maybe", and thus simply fall
back to look at the diff.

However, I've looked through the output of

  grep '^git log[^|]*[\[*?]' ~/.bash_history

and haven't found a single case where I used Git's globbing.  When I
did use globbing, then I always used the shell's.  Yeah, just one data
point, and others surely use it differently, etc...  but I think we
should consider whether it's common enough to worry about and to
increase complexity because of it.

[...]

> So let's imagine we'd store such a cache external to the regular object
> data (i.e., as a commit-graph entry). The "log --raw" diff of linux.git
> has 1.7M entries. The paths should easily compress to a single 32-bit
> integer (e.g., as an index into a big path list). The oids are 20 bytes.
> Add a few bytes for modes. That's about 80MB. Big, but not impossibly
> so. Maybe pushing it for true gigantic repos, though.

In my experiments with the Linux repo a 256MB Bloom filter has ~0.3%
false positive rate, while a 128MB filter had 3-4%.  Even bigger,
though compared to the size of the checkout of the full kernel tree,
arguably not that much.

> Those numbers are ignoring merges, too. The meaning of "did this commit
> touch that path" is a lot trickier for a merge commit, and I think may
> depend on context. I'm not sure how even a bloom filter solution would
> handle that (I was assuming we'd mostly punt and let merges fall back to
> opening up the trees).

During revision walking rev_compare_tree() checks whether the given
paths changed between a commit and _one_ of its parents, and in case
of merge commits it's invoked multiple times with the same commit but
with different parent parameters.  By storing (changed-path,
parent-oid, commit-oid) tuples in the Bloom filter it can deal with
merges, too.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 2/4] commit-graph: write a Bloom filter containing changed paths for each commit
  2018-10-09 21:06                           ` Jeff King
@ 2018-10-09 21:37                             ` SZEDER Gábor
  0 siblings, 0 replies; 78+ messages in thread
From: SZEDER Gábor @ 2018-10-09 21:37 UTC (permalink / raw)
  To: Jeff King
  Cc: git, Junio C Hamano, Derrick Stolee,
	Ævar Arnfjörð Bjarmason, Stefan Beller, Duy Nguyen

On Tue, Oct 09, 2018 at 05:06:20PM -0400, Jeff King wrote:
> On Tue, Oct 09, 2018 at 09:34:43PM +0200, SZEDER Gábor wrote:
> 
> > Creating the Bloom filter is sloooow.  Running it on git.git takes
> > about 23s on my hardware, while
> > 
> >   git log --format='%H%n%P' --name-only --all >/dev/null
> > 
> > gathers all the information necessary for that in about 5.3s.
> 
> That command won't open the trees for merges at all. But your
> implementation here looks like it does a diff against each parent of a
> merge.

Yeah, it does so, because that is what try_to_simplify_commit() /
rev_compare_tree() will do while traversing the history.

> Adding "-m" would be a more accurate comparison, I think.
> 
> Though I find that puzzling, because "-m --name-only" seems to take
> about 20x longer, not 3x. So perhaps I'm missing something.

Ugh, indeed.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: Bloom Filters
  2018-10-09 21:14                                   ` Jeff King
@ 2018-10-09 23:12                                     ` Jeff King
  2018-10-09 23:13                                       ` [PoC -- do not apply 1/3] initial tree-bitmap proof of concept Jeff King
                                                         ` (3 more replies)
  0 siblings, 4 replies; 78+ messages in thread
From: Jeff King @ 2018-10-09 23:12 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: SZEDER Gábor, Ævar Arnfjörð Bjarmason,
	Stefan Beller, git, Duy Nguyen

On Tue, Oct 09, 2018 at 05:14:50PM -0400, Jeff King wrote:

> Hmph. It really sounds like we could do better with a custom RLE
> solution. But that makes me feel like I'm missing something, because
> surely I can't invent something better than the state of the art in a
> simple thought experiment, right?
> 
> I know what I'm proposing would be quite bad for random access, but my
> impression is that EWAH is the same. For the scale of bitmaps we're
> talking about, I think linear/streaming access through the bitmap would
> be OK.

Thinking on it more, what I was missing is that for truly dense random
bitmaps, this will perform much worse. Because it will use a byte to say
"there's one 1", rather than a bit.

But I think it does OK in practice for the very sparse bitmaps we tend
to see in this application.  I was able to generate a complete output
that can reproduce "log --name-status -t" for linux.git in 32MB. But:

  - 15MB of that is commit sha1s, which will be stored elsewhere in a
    "real" system

  - 5MB of that is path list (which should shrink by a factor of 10 with
    prefix compression, and is really a function of a tree size less
    than history depth)

So the per-commit cost is not too bad. That's still not counting merges,
though, which would add another 10-15% (or maybe more; their bitmaps are
less sparse).

I don't know if this is a fruitful path at all or not. I was mostly just
satisfying my own curiosity on the bitmap encoding question. But I'll
post the patches, just to show my work. The first one is the same
initial proof of concept I showed earlier.

  [1/3]: initial tree-bitmap proof of concept
  [2/3]: test-tree-bitmap: add "dump" mode
  [3/3]: test-tree-bitmap: replace ewah with custom rle encoding

 Makefile                    |   1 +
 t/helper/test-tree-bitmap.c | 344 ++++++++++++++++++++++++++++++++++++
 2 files changed, 345 insertions(+)
 create mode 100644 t/helper/test-tree-bitmap.c

-Peff

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PoC -- do not apply 1/3] initial tree-bitmap proof of concept
  2018-10-09 23:12                                     ` Bloom Filters Jeff King
@ 2018-10-09 23:13                                       ` Jeff King
  2018-10-09 23:14                                       ` [PoC -- do not apply 2/3] test-tree-bitmap: add "dump" mode Jeff King
                                                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 78+ messages in thread
From: Jeff King @ 2018-10-09 23:13 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: SZEDER Gábor, Ævar Arnfjörð Bjarmason,
	Stefan Beller, git, Duy Nguyen

Signed-off-by: Jeff King <peff@peff.net>
---
 Makefile                    |   1 +
 t/helper/test-tree-bitmap.c | 167 ++++++++++++++++++++++++++++++++++++
 2 files changed, 168 insertions(+)
 create mode 100644 t/helper/test-tree-bitmap.c

diff --git a/Makefile b/Makefile
index 13e1c52478..f6e823f2d6 100644
--- a/Makefile
+++ b/Makefile
@@ -751,6 +751,7 @@ TEST_PROGRAMS_NEED_X += test-parse-options
 TEST_PROGRAMS_NEED_X += test-pkt-line
 TEST_PROGRAMS_NEED_X += test-svn-fe
 TEST_PROGRAMS_NEED_X += test-tool
+TEST_PROGRAMS_NEED_X += test-tree-bitmap
 
 TEST_PROGRAMS = $(patsubst %,t/helper/%$X,$(TEST_PROGRAMS_NEED_X))
 
diff --git a/t/helper/test-tree-bitmap.c b/t/helper/test-tree-bitmap.c
new file mode 100644
index 0000000000..bc5cf0e514
--- /dev/null
+++ b/t/helper/test-tree-bitmap.c
@@ -0,0 +1,167 @@
+#include "cache.h"
+#include "revision.h"
+#include "diffcore.h"
+#include "argv-array.h"
+#include "ewah/ewok.h"
+
+/* map of pathnames to bit positions */
+struct pathmap_entry {
+	struct hashmap_entry ent;
+	unsigned pos;
+	char path[FLEX_ARRAY];
+};
+
+static int pathmap_entry_hashcmp(const void *unused_cmp_data,
+				 const void *entry,
+				 const void *entry_or_key,
+				 const void *keydata)
+{
+	const struct pathmap_entry *a = entry;
+	const struct pathmap_entry *b = entry_or_key;
+	const char *key = keydata;
+
+	return strcmp(a->path, key ? key : b->path);
+}
+
+static int pathmap_entry_strcmp(const void *va, const void *vb)
+{
+	struct pathmap_entry *a = *(struct pathmap_entry **)va;
+	struct pathmap_entry *b = *(struct pathmap_entry **)vb;
+	return strcmp(a->path, b->path);
+}
+
+struct walk_paths_data {
+	struct hashmap *paths;
+	struct commit *commit;
+};
+
+static void walk_paths(diff_format_fn_t fn, struct hashmap *paths)
+{
+	struct argv_array argv = ARGV_ARRAY_INIT;
+	struct rev_info revs;
+	struct walk_paths_data data;
+	struct commit *commit;
+
+	argv_array_pushl(&argv, "rev-list",
+			 "--all", "-t", "--no-renames",
+			 NULL);
+	init_revisions(&revs, NULL);
+	setup_revisions(argv.argc, argv.argv, &revs, NULL);
+	revs.diffopt.output_format = DIFF_FORMAT_CALLBACK;
+	revs.diffopt.format_callback = fn;
+	revs.diffopt.format_callback_data = &data;
+
+	data.paths = paths;
+
+	prepare_revision_walk(&revs);
+	while ((commit = get_revision(&revs))) {
+		data.commit = commit;
+		diff_tree_combined_merge(commit, 0, &revs);
+	}
+
+	reset_revision_walk();
+	argv_array_clear(&argv);
+}
+
+static void collect_commit_paths(struct diff_queue_struct *q,
+				 struct diff_options *opts,
+				 void *vdata)
+{
+	struct walk_paths_data *data = vdata;
+	int i;
+
+	for (i = 0; i < q->nr; i++) {
+		struct diff_filepair *p = q->queue[i];
+		const char *path = p->one->path;
+		struct pathmap_entry *entry;
+		struct hashmap_entry lookup;
+
+		hashmap_entry_init(&lookup, strhash(path));
+		entry = hashmap_get(data->paths, &lookup, path);
+		if (entry)
+			continue; /* already present */
+
+		FLEX_ALLOC_STR(entry, path, path);
+		entry->ent = lookup;
+		hashmap_put(data->paths, entry);
+	}
+}
+
+/* assign a bit position to all possible paths */
+static void collect_paths(struct hashmap *paths)
+{
+	struct pathmap_entry **sorted;
+	size_t i, n;
+	struct hashmap_iter iter;
+	struct pathmap_entry *entry;
+
+	/* grab all unique paths */
+	hashmap_init(paths, pathmap_entry_hashcmp, NULL, 0);
+	walk_paths(collect_commit_paths, paths);
+
+	/* and assign them bits in sorted order */
+	n = hashmap_get_size(paths);
+	ALLOC_ARRAY(sorted, n);
+	i = 0;
+	for (entry = hashmap_iter_first(paths, &iter);
+	     entry;
+	     entry = hashmap_iter_next(&iter)) {
+		assert(i < n);
+		sorted[i++] = entry;
+	}
+	QSORT(sorted, i, pathmap_entry_strcmp);
+	for (i = 0; i < n; i++)
+		sorted[i]->pos = i;
+	free(sorted);
+}
+
+/* generate the bitmap for a single commit */
+static void generate_bitmap(struct diff_queue_struct *q,
+			    struct diff_options *opts,
+			    void *vdata)
+{
+	struct walk_paths_data *data = vdata;
+	struct bitmap *bitmap = bitmap_new();
+	struct ewah_bitmap *ewah;
+	struct strbuf out = STRBUF_INIT;
+	size_t i;
+
+	for (i = 0; i < q->nr; i++) {
+		struct diff_filepair *p = q->queue[i];
+		const char *path = p->one->path;
+		struct pathmap_entry *entry;
+		struct hashmap_entry lookup;
+
+		hashmap_entry_init(&lookup, strhash(path));
+		entry = hashmap_get(data->paths, &lookup, path);
+		if (!entry)
+			BUG("mysterious path appeared: %s", path);
+
+		bitmap_set(bitmap, entry->pos);
+	}
+
+	ewah = bitmap_to_ewah(bitmap);
+	ewah_serialize_strbuf(ewah, &out);
+	fwrite(out.buf, 1, out.len, stdout);
+
+	trace_printf("bitmap %s %u %u",
+		     oid_to_hex(&data->commit->object.oid),
+		     (unsigned)q->nr,
+		     (unsigned)out.len);
+
+	strbuf_release(&out);
+	ewah_free(ewah);
+	bitmap_free(bitmap);
+}
+
+int cmd_main(int argc, const char **argv)
+{
+	struct hashmap paths;
+
+	setup_git_directory();
+	collect_paths(&paths);
+
+	walk_paths(generate_bitmap, &paths);
+
+	return 0;
+}
-- 
2.19.1.550.g7610f1eecb


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PoC -- do not apply 2/3] test-tree-bitmap: add "dump" mode
  2018-10-09 23:12                                     ` Bloom Filters Jeff King
  2018-10-09 23:13                                       ` [PoC -- do not apply 1/3] initial tree-bitmap proof of concept Jeff King
@ 2018-10-09 23:14                                       ` Jeff King
  2018-10-10  0:48                                         ` Junio C Hamano
  2018-10-09 23:14                                       ` [PoC -- do not apply 3/3] test-tree-bitmap: replace ewah with custom rle encoding Jeff King
  2018-10-11 12:33                                       ` Bloom Filters Derrick Stolee
  3 siblings, 1 reply; 78+ messages in thread
From: Jeff King @ 2018-10-09 23:14 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: SZEDER Gábor, Ævar Arnfjörð Bjarmason,
	Stefan Beller, git, Duy Nguyen

This teaches "gen" mode (formerly the only mode) to include
the list of paths, and to prefix each bitmap with its
matching oid.

The "dump" mode can then read that back in and generate the
list of changed paths. This should be almost identical to:

  git rev-list --all |
  git diff-tree --stdin --name-only -t

The one difference is the sort order: git's diff output is
in tree-sort order, so a subtree "foo" sorts like "foo/",
which is after "foo.bar". Whereas the bitmap path list has a
true byte sort, which puts "foo.bar" after "foo".

Signed-off-by: Jeff King <peff@peff.net>
---
 t/helper/test-tree-bitmap.c | 104 +++++++++++++++++++++++++++++++++++-
 1 file changed, 102 insertions(+), 2 deletions(-)

diff --git a/t/helper/test-tree-bitmap.c b/t/helper/test-tree-bitmap.c
index bc5cf0e514..6f8833344a 100644
--- a/t/helper/test-tree-bitmap.c
+++ b/t/helper/test-tree-bitmap.c
@@ -112,6 +112,14 @@ static void collect_paths(struct hashmap *paths)
 	QSORT(sorted, i, pathmap_entry_strcmp);
 	for (i = 0; i < n; i++)
 		sorted[i]->pos = i;
+
+	/* dump it while we have the sorted order in memory */
+	for (i = 0; i < n; i++) {
+		printf("%s", sorted[i]->path);
+		putchar('\0');
+	}
+	putchar('\0');
+
 	free(sorted);
 }
 
@@ -142,6 +150,8 @@ static void generate_bitmap(struct diff_queue_struct *q,
 
 	ewah = bitmap_to_ewah(bitmap);
 	ewah_serialize_strbuf(ewah, &out);
+
+	fwrite(data->commit->object.oid.hash, 1, GIT_SHA1_RAWSZ, stdout);
 	fwrite(out.buf, 1, out.len, stdout);
 
 	trace_printf("bitmap %s %u %u",
@@ -154,14 +164,104 @@ static void generate_bitmap(struct diff_queue_struct *q,
 	bitmap_free(bitmap);
 }
 
-int cmd_main(int argc, const char **argv)
+static void do_gen(void)
 {
 	struct hashmap paths;
-
 	setup_git_directory();
 	collect_paths(&paths);
 
 	walk_paths(generate_bitmap, &paths);
+}
+
+static void show_path(size_t pos, void *data)
+{
+	const char **paths = data;
+
+	/* assert(pos < nr_paths), but we didn't pass the latter in */
+	printf("%s\n", paths[pos]);
+}
+
+static void do_dump(void)
+{
+	struct strbuf in = STRBUF_INIT;
+	const char *cur;
+	size_t remain;
+
+	const char **paths = NULL;
+	size_t alloc_paths = 0, nr_paths = 0;
+
+	/* slurp stdin; in the real world we'd mmap all this */
+	strbuf_read(&in, 0, 0);
+	cur = in.buf;
+	remain = in.len;
+
+	/* read path for each bit; in the real world this would be separate */
+	while (remain) {
+		const char *end = memchr(cur, '\0', remain);
+		if (!end) {
+			error("truncated input while reading path");
+			goto out;
+		}
+		if (end == cur) {
+			/* empty field signals end of paths */
+			cur++;
+			remain--;
+			break;
+		}
+
+		ALLOC_GROW(paths, nr_paths + 1, alloc_paths);
+		paths[nr_paths++] = cur;
+
+		remain -= end - cur + 1;
+		cur = end + 1;
+	}
+
+	/* read the bitmap for each commit */
+	while (remain) {
+		struct object_id oid;
+		struct ewah_bitmap *ewah;
+		ssize_t len;
+
+		if (remain < GIT_SHA1_RAWSZ) {
+			error("truncated input reading oid");
+			goto out;
+		}
+		hashcpy(oid.hash, (const unsigned char *)cur);
+		cur += GIT_SHA1_RAWSZ;
+		remain -= GIT_SHA1_RAWSZ;
+
+		ewah = ewah_new();
+		len = ewah_read_mmap(ewah, cur, remain);
+		if (len < 0) {
+			ewah_free(ewah);
+			goto out;
+		}
+
+		printf("%s\n", oid_to_hex(&oid));
+		ewah_each_bit(ewah, show_path, paths);
+
+		ewah_free(ewah);
+		cur += len;
+		remain -= len;
+	}
+
+out:
+	free(paths);
+	strbuf_release(&in);
+}
+
+int cmd_main(int argc, const char **argv)
+{
+	const char *usage_msg = "test-tree-bitmap <gen|dump>";
+
+	if (!argv[1])
+		usage(usage_msg);
+	else if (!strcmp(argv[1], "gen"))
+		do_gen();
+	else if (!strcmp(argv[1], "dump"))
+		do_dump();
+	else
+		usage(usage_msg);
 
 	return 0;
 }
-- 
2.19.1.550.g7610f1eecb


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PoC -- do not apply 3/3] test-tree-bitmap: replace ewah with custom rle encoding
  2018-10-09 23:12                                     ` Bloom Filters Jeff King
  2018-10-09 23:13                                       ` [PoC -- do not apply 1/3] initial tree-bitmap proof of concept Jeff King
  2018-10-09 23:14                                       ` [PoC -- do not apply 2/3] test-tree-bitmap: add "dump" mode Jeff King
@ 2018-10-09 23:14                                       ` Jeff King
  2018-10-10  0:58                                         ` Junio C Hamano
  2018-10-11 12:33                                       ` Bloom Filters Derrick Stolee
  3 siblings, 1 reply; 78+ messages in thread
From: Jeff King @ 2018-10-09 23:14 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: SZEDER Gábor, Ævar Arnfjörð Bjarmason,
	Stefan Beller, git, Duy Nguyen

The rules are basically:

 - each bitmap is a series of counts of runs of 0/1

 - each count is one of our standard varints

 - each bitmap must have at least one initial count of
   zeroes (which may itself be a zero-length count, if the
   first bit is set)

 - a zero-length count anywhere else marks the end of
   the bitmap

For a sparse bitmap, these will tend to be quite short,
because long runs are encoded as fairly small counts. The
worst case is an alternate 0/1/0/1 bitmap, where we will
spend a full byte to specify each bit (thus bloating it by a
factor of 8 over an uncompressed bitmap).

Signed-off-by: Jeff King <peff@peff.net>
---
 t/helper/test-tree-bitmap.c | 105 +++++++++++++++++++++++++++++++-----
 1 file changed, 91 insertions(+), 14 deletions(-)

diff --git a/t/helper/test-tree-bitmap.c b/t/helper/test-tree-bitmap.c
index 6f8833344a..36f19ed464 100644
--- a/t/helper/test-tree-bitmap.c
+++ b/t/helper/test-tree-bitmap.c
@@ -3,6 +3,7 @@
 #include "diffcore.h"
 #include "argv-array.h"
 #include "ewah/ewok.h"
+#include "varint.h"
 
 /* map of pathnames to bit positions */
 struct pathmap_entry {
@@ -123,6 +124,49 @@ static void collect_paths(struct hashmap *paths)
 	free(sorted);
 }
 
+static void strbuf_add_varint(struct strbuf *out, uintmax_t val)
+{
+	size_t len;
+	strbuf_grow(out, 16); /* enough for any varint */
+	len = encode_varint(val, (unsigned char *)out->buf + out->len);
+	strbuf_setlen(out, out->len + len);
+}
+
+static void bitmap_to_rle(struct strbuf *out, struct bitmap *bitmap)
+{
+	int curval = 0; /* count zeroes, then ones, then zeroes, etc */
+	size_t run = 0;
+	size_t word;
+	size_t orig_len = out->len;
+
+	for (word = 0; word < bitmap->word_alloc; word++) {
+		int bit;
+
+		for (bit = 0; bit < BITS_IN_EWORD; bit++) {
+			int val = !!(bitmap->words[word] & (((eword_t)1) << bit));
+			if (val == curval)
+				run++;
+			else {
+				strbuf_add_varint(out, run);
+				curval = 1 - curval; /* flip 0/1 */
+				run = 1;
+			}
+		}
+	}
+
+	/*
+	 * complete the run, but do not bother with trailing zeroes, unless we
+	 * failed to write even an initial run of 0's.
+	 */
+	if (curval && run)
+		strbuf_add_varint(out, run);
+	else if (orig_len == out->len)
+		strbuf_add_varint(out, 0);
+
+	/* signal end-of-input with an empty run */
+	strbuf_add_varint(out, 0);
+}
+
 /* generate the bitmap for a single commit */
 static void generate_bitmap(struct diff_queue_struct *q,
 			    struct diff_options *opts,
@@ -130,7 +174,6 @@ static void generate_bitmap(struct diff_queue_struct *q,
 {
 	struct walk_paths_data *data = vdata;
 	struct bitmap *bitmap = bitmap_new();
-	struct ewah_bitmap *ewah;
 	struct strbuf out = STRBUF_INIT;
 	size_t i;
 
@@ -148,8 +191,7 @@ static void generate_bitmap(struct diff_queue_struct *q,
 		bitmap_set(bitmap, entry->pos);
 	}
 
-	ewah = bitmap_to_ewah(bitmap);
-	ewah_serialize_strbuf(ewah, &out);
+	bitmap_to_rle(&out, bitmap);
 
 	fwrite(data->commit->object.oid.hash, 1, GIT_SHA1_RAWSZ, stdout);
 	fwrite(out.buf, 1, out.len, stdout);
@@ -160,7 +202,6 @@ static void generate_bitmap(struct diff_queue_struct *q,
 		     (unsigned)out.len);
 
 	strbuf_release(&out);
-	ewah_free(ewah);
 	bitmap_free(bitmap);
 }
 
@@ -181,6 +222,51 @@ static void show_path(size_t pos, void *data)
 	printf("%s\n", paths[pos]);
 }
 
+static size_t rle_each_bit(const unsigned char *in, size_t len,
+			   void (*fn)(size_t, void *), void *data)
+{
+	int curval = 0; /* look for zeroes first, then ones, etc */
+	const unsigned char *cur = in;
+	const unsigned char *end = in + len;
+	size_t pos;
+
+	/* we always have a first run, even if it's 0 zeroes */
+	pos = decode_varint(&cur);
+
+	/*
+	 * ugh, varint does not seem to have a way to prevent reading past
+	 * the end of the buffer. We'll do a length check after each one,
+	 * so the worst case is bounded.
+	 */
+	if (cur > end) {
+		error("input underflow in rle");
+		return len;
+	}
+
+	while (1) {
+		size_t run = decode_varint(&cur);
+
+		if (cur > end) {
+			error("input underflow in rle");
+			return len;
+		}
+
+		if (!run)
+			break; /* empty run signals end */
+
+		curval = 1 - curval; /* flip 0/1 */
+		if (curval) {
+			/* we have a run of 1's; deliver them */
+			size_t i;
+			for (i = 0; i < run; i++)
+				fn(pos + i, data);
+		}
+		pos += run;
+	}
+
+	return cur - in;
+}
+
 static void do_dump(void)
 {
 	struct strbuf in = STRBUF_INIT;
@@ -219,7 +305,6 @@ static void do_dump(void)
 	/* read the bitmap for each commit */
 	while (remain) {
 		struct object_id oid;
-		struct ewah_bitmap *ewah;
 		ssize_t len;
 
 		if (remain < GIT_SHA1_RAWSZ) {
@@ -230,17 +315,9 @@ static void do_dump(void)
 		cur += GIT_SHA1_RAWSZ;
 		remain -= GIT_SHA1_RAWSZ;
 
-		ewah = ewah_new();
-		len = ewah_read_mmap(ewah, cur, remain);
-		if (len < 0) {
-			ewah_free(ewah);
-			goto out;
-		}
-
 		printf("%s\n", oid_to_hex(&oid));
-		ewah_each_bit(ewah, show_path, paths);
+		len = rle_each_bit((const unsigned char *)cur, remain, show_path, paths);
 
-		ewah_free(ewah);
 		cur += len;
 		remain -= len;
 	}
-- 
2.19.1.550.g7610f1eecb

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PoC -- do not apply 2/3] test-tree-bitmap: add "dump" mode
  2018-10-09 23:14                                       ` [PoC -- do not apply 2/3] test-tree-bitmap: add "dump" mode Jeff King
@ 2018-10-10  0:48                                         ` Junio C Hamano
  2018-10-11  3:13                                           ` Jeff King
  0 siblings, 1 reply; 78+ messages in thread
From: Junio C Hamano @ 2018-10-10  0:48 UTC (permalink / raw)
  To: Jeff King
  Cc: Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Stefan Beller, git,
	Duy Nguyen

Jeff King <peff@peff.net> writes:

> The one difference is the sort order: git's diff output is
> in tree-sort order, so a subtree "foo" sorts like "foo/",
> which is after "foo.bar". Whereas the bitmap path list has a
> true byte sort, which puts "foo.bar" after "foo".

If we truly cared, it is easy enough to fix by having a custom
comparison function in 1/3 used in collect_paths() phase.

> +	/* dump it while we have the sorted order in memory */
> +	for (i = 0; i < n; i++) {
> +		printf("%s", sorted[i]->path);
> +		putchar('\0');
> +	}

With printf("%s%c", sorted[i]->path, '\0'); you can lose the braces.

> +	putchar('\0');
> +
>  	free(sorted);
>  }
>  
> @@ -142,6 +150,8 @@ static void generate_bitmap(struct diff_queue_struct *q,
>  
>  	ewah = bitmap_to_ewah(bitmap);
>  	ewah_serialize_strbuf(ewah, &out);
> +
> +	fwrite(data->commit->object.oid.hash, 1, GIT_SHA1_RAWSZ, stdout);
>  	fwrite(out.buf, 1, out.len, stdout);

OK, so per commit, we have ewah bitmap that records the "changed
paths" after the commit object name.  Makes sense.

And the list of paths are based on the "one" side of the filepair.
When we do an equivalent of "git show X", we see "diff-tree X~1 X"
and by collecting the "one" side (i.e. subset of paths in the tree
of X~1 that were modified when going to X) we say "commit X changed
these paths".  Makes sense, too.

> -int cmd_main(int argc, const char **argv)
> +static void do_gen(void)
>  {
>  	struct hashmap paths;
> -

Let's not lose this blank line.

>  	setup_git_directory();
>  	collect_paths(&paths);
>  
>  	walk_paths(generate_bitmap, &paths);
> +}
> +
> +static void do_dump(void)
> +{
> +	struct strbuf in = STRBUF_INIT;
> +	const char *cur;
> +	size_t remain;
> +
> +	const char **paths = NULL;
> +	size_t alloc_paths = 0, nr_paths = 0;
> +
> +	/* slurp stdin; in the real world we'd mmap all this */
> +	strbuf_read(&in, 0, 0);
> +	cur = in.buf;
> +	remain = in.len;
> +
> +	/* read path for each bit; in the real world this would be separate */
> +	while (remain) {
> +		const char *end = memchr(cur, '\0', remain);
> +		if (!end) {
> +			error("truncated input while reading path");
> +			goto out;
> +		}
> +		if (end == cur) {
> +			/* empty field signals end of paths */
> +			cur++;
> +			remain--;
> +			break;
> +		}
> +
> +		ALLOC_GROW(paths, nr_paths + 1, alloc_paths);
> +		paths[nr_paths++] = cur;
> +
> +		remain -= end - cur + 1;
> +		cur = end + 1;
> +	}
> +

OK.

> +	while (remain) {
> +		struct object_id oid;
> +		struct ewah_bitmap *ewah;
> +		ssize_t len;
> +
> +		if (remain < GIT_SHA1_RAWSZ) {
> +			error("truncated input reading oid");
> +			goto out;
> +		}
> +		hashcpy(oid.hash, (const unsigned char *)cur);
> +		cur += GIT_SHA1_RAWSZ;
> +		remain -= GIT_SHA1_RAWSZ;
> +
> +		ewah = ewah_new();
> +		len = ewah_read_mmap(ewah, cur, remain);
> +		if (len < 0) {
> +			ewah_free(ewah);
> +			goto out;
> +		}
> +
> +		printf("%s\n", oid_to_hex(&oid));
> +		ewah_each_bit(ewah, show_path, paths);
> +
> +		ewah_free(ewah);
> +		cur += len;
> +		remain -= len;
> +	}

Makes perfect sense.

> +out:
> +	free(paths);
> +	strbuf_release(&in);
> +}
> +
> +int cmd_main(int argc, const char **argv)
> +{
> +	const char *usage_msg = "test-tree-bitmap <gen|dump>";
> +
> +	if (!argv[1])
> +		usage(usage_msg);
> +	else if (!strcmp(argv[1], "gen"))
> +		do_gen();
> +	else if (!strcmp(argv[1], "dump"))
> +		do_dump();
> +	else
> +		usage(usage_msg);
>  
>  	return 0;
>  }

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PoC -- do not apply 3/3] test-tree-bitmap: replace ewah with custom rle encoding
  2018-10-09 23:14                                       ` [PoC -- do not apply 3/3] test-tree-bitmap: replace ewah with custom rle encoding Jeff King
@ 2018-10-10  0:58                                         ` Junio C Hamano
  2018-10-11  3:20                                           ` Jeff King
  0 siblings, 1 reply; 78+ messages in thread
From: Junio C Hamano @ 2018-10-10  0:58 UTC (permalink / raw)
  To: Jeff King
  Cc: Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Stefan Beller, git,
	Duy Nguyen

Jeff King <peff@peff.net> writes:

> +static void strbuf_add_varint(struct strbuf *out, uintmax_t val)
> +{
> +	size_t len;
> +	strbuf_grow(out, 16); /* enough for any varint */
> +	len = encode_varint(val, (unsigned char *)out->buf + out->len);
> +	strbuf_setlen(out, out->len + len);
> +}
> +
> +static void bitmap_to_rle(struct strbuf *out, struct bitmap *bitmap)
> +{
> +	int curval = 0; /* count zeroes, then ones, then zeroes, etc */
> +	size_t run = 0;
> +	size_t word;
> +	size_t orig_len = out->len;
> +
> +	for (word = 0; word < bitmap->word_alloc; word++) {
> +		int bit;
> +
> +		for (bit = 0; bit < BITS_IN_EWORD; bit++) {
> +			int val = !!(bitmap->words[word] & (((eword_t)1) << bit));
> +			if (val == curval)
> +				run++;
> +			else {
> +				strbuf_add_varint(out, run);
> +				curval = 1 - curval; /* flip 0/1 */
> +				run = 1;
> +			}
> +		}

OK.  I find it a bit disturbing to see that the loop knows a bit too
much about how "struct bitmap" is implemented, but that is a complaint
against the bitmap API, not this new user of the API.

We do not try to handle the case where bitmap has bits that is not
multiple of BITS_IN_EWORD and instead pretend that size of such a
bitmap can be rounded up, because we ignore trailing 0-bit anyway,
and we know the "struct bitmap" would pad with 0-bit at the tail?

> +	}
> +
> +	/*
> +	 * complete the run, but do not bother with trailing zeroes, unless we
> +	 * failed to write even an initial run of 0's.
> +	 */
> +	if (curval && run)
> +		strbuf_add_varint(out, run);
> +	else if (orig_len == out->len)
> +		strbuf_add_varint(out, 0);
> +
> +	/* signal end-of-input with an empty run */
> +	strbuf_add_varint(out, 0);
> +}

OK.

> +static size_t rle_each_bit(const unsigned char *in, size_t len,
> +			   void (*fn)(size_t, void *), void *data)
> +{
> +	int curval = 0; /* look for zeroes first, then ones, etc */
> +	const unsigned char *cur = in;
> +	const unsigned char *end = in + len;
> +	size_t pos;
> +
> +	/* we always have a first run, even if it's 0 zeroes */
> +	pos = decode_varint(&cur);
> +
> +	/*
> +	 * ugh, varint does not seem to have a way to prevent reading past
> +	 * the end of the buffer. We'll do a length check after each one,
> +	 * so the worst case is bounded.
> +	 */

Sorry about that :-).

> +	if (cur > end) {
> +		error("input underflow in rle");
> +		return len;
> +	}
> +
> +	while (1) {
> +		size_t run = decode_varint(&cur);
> +
> +		if (cur > end) {
> +			error("input underflow in rle");
> +			return len;
> +		}
> +
> +		if (!run)
> +			break; /* empty run signals end */
> +
> +		curval = 1 - curval; /* flip 0/1 */
> +		if (curval) {
> +			/* we have a run of 1's; deliver them */
> +			size_t i;
> +			for (i = 0; i < run; i++)
> +				fn(pos + i, data);
> +		}
> +		pos += run;
> +	}
> +
> +	return cur - in;
> +}

Makes sense.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-05  6:09               ` Junio C Hamano
@ 2018-10-10 22:07                 ` SZEDER Gábor
  2018-10-10 23:01                   ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 78+ messages in thread
From: SZEDER Gábor @ 2018-10-10 22:07 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Ævar Arnfjörð Bjarmason, Derrick Stolee, Git List,
	Nguyễn Thái Ngọc Duy

On Thu, Oct 04, 2018 at 11:09:58PM -0700, Junio C Hamano wrote:
> SZEDER Gábor <szeder.dev@gmail.com> writes:
> 
> >>     git-gc - Cleanup unnecessary files and optimize the local repository
> >> 
> >> Creating these indexes like the commit-graph falls under "optimize the
> >> local repository",
> >
> > But it doesn't fall under "cleanup unnecessary files", which the
> > commit-graph file is, since, strictly speaking, it's purely
> > optimization.
> 
> I won't be actively engaged in this discussion soon, but I must say
> that "git gc" doing "garbage collection" is merely an implementation
> detail of optimizing the repository for further use.  And from that
> point of view, what needs to be updated is the synopsis of the
> git-gc doc.  It states "X and Y" above, but it actually is "Y by
> doing X and other things".

Well, then perhaps the name of the command should be updated, too, to
better reflect what it actually does...

> I understand your "by definition there is no garbage immediately
> after clone" position, and also I would understand if you find it
> (perhaps philosophically) disturbing that "git clone" may give users
> a suboptimal repository that immediately needs optimizing [*1*].
> 
> But that bridge was crossed long time ago ever since pack transfer
> was invented.  The data source sends only the pack data stream, and
> the receiving end is responsible for spending cycles to build .idx
> file.  Theoretically, .pack should be all that is needed---you
> should be able to locate any necessary object by parsing the .pack
> file every time you open it, and .idx is mere optimization.  You can
> think of the .midx and graph files the same way.

I don't think this is a valid comparison, because, practically, Git
just didn't work after I deleted all pack index files.  So while they
can be easily (re)generated, they are essential to make pack files
usable.  The commit-graph and .midx files, however, can be safely
deleted, and everything keeps working as before.

OTOH, this is an excellent comparison, and I do think of the .midx and
graph files the same way as the pack index files.  During a clone, the
pack index file isn't generated by running a separate 'git gc
(--auto)', but by clone (or fetch-pack?) running 'git index-pack'.
The way I see it that should be the case for these other files as well.

And it is much simpler, shorter, and cleaner to either run 'git
commit-graph ...' or even to call write_commit_graph_reachable()
directly from cmd_clone(), than to bolt on another option and config
variable on 'git gc' [1] to coax it into some kind of an "after clone"
mode, that it shouldn't be doing in the first place.  At least for
now, so when we'll eventually get as far ...

> I would not be surprised by a future in which the initial index-pack
> that is responsible for receiving the incoming pack stream and
> storing that in .pack file(s) while creating corresponding .idx
> file(s) becomes also responsible for building .midx and graph files
> in the same pass, or at least smaller number of passes.  Once we
> gain experience and confidence with these new auxiliary files, that
> ought to happen naturally.  And at that point, we won't be having
> this discussion---we'd all happily run index-pack to receive the
> pack data, because that is pretty much the fundamental requirement
> to make use of the data.

... that what you wrote here becomes a reality (and I fully agree that
this is what we should ultimately aim for), then we won't have that
option and config variable still lying around and requiring
maintenance because of backwards compatibility.

1 - https://public-inbox.org/git/87in2hgzin.fsf@evledraar.gmail.com/

> [Footnote]
> 
> *1* Even without considering these recent invention of auxiliary
>     files, cloning from a sloppily packed server whose primary focus
>     is to avoid spending cycles by not computing better deltas will
>     give the cloner a suboptimal repository.  If we truly want to
>     have an optimized repository ready to be used after cloning, we
>     should run an equivalent of "repack -a -d -f" immediately after
>     "git clone".

I noticed a few times that I got surprisingly large packs from GitHub,
e.g. there is over 70% size difference between --single-branch cloning
v2.19.0 from GitHub and from my local clone or from kernel.org (~95MB
vs. ~55MB vs ~52MB).  After running 'git repack -a -d -f' they all end
up at ~65MB, which is a nice size reduction for the clone from GitHub,
but the others just gained 10-13 more MBs.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: We should add a "git gc --auto" after "git clone" due to commit graph
  2018-10-10 22:07                 ` SZEDER Gábor
@ 2018-10-10 23:01                   ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 78+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-10-10 23:01 UTC (permalink / raw)
  To: SZEDER Gábor
  Cc: Junio C Hamano, Derrick Stolee, Git List,
	Nguyễn Thái Ngọc Duy

On Wed, Oct 10 2018, SZEDER Gábor wrote:

> On Thu, Oct 04, 2018 at 11:09:58PM -0700, Junio C Hamano wrote:
>> SZEDER Gábor <szeder.dev@gmail.com> writes:
>>
>> >>     git-gc - Cleanup unnecessary files and optimize the local repository
>> >>
>> >> Creating these indexes like the commit-graph falls under "optimize the
>> >> local repository",
>> >
>> > But it doesn't fall under "cleanup unnecessary files", which the
>> > commit-graph file is, since, strictly speaking, it's purely
>> > optimization.
>>
>> I won't be actively engaged in this discussion soon, but I must say
>> that "git gc" doing "garbage collection" is merely an implementation
>> detail of optimizing the repository for further use.  And from that
>> point of view, what needs to be updated is the synopsis of the
>> git-gc doc.  It states "X and Y" above, but it actually is "Y by
>> doing X and other things".
>
> Well, then perhaps the name of the command should be updated, too, to
> better reflect what it actually does...

I don't disagree, but between "git gc" being a longstanding thing called
not just by us , but third parties expecting it to a be a general swiss
army knife for "make repo better" (so for a new tool they'd need to
update their code), and general name bikeshedding I think it's best if
we just proceed for the sake of argument with the assumption that none
of us find the name confusing in this context.

At least that's the discussion I'm interested in. I.e. whether it makes
conceptual / structural sense for this stuff to live in the same place /
function in the code. We can always argue about the name of the function
as a separate topic.

>> I understand your "by definition there is no garbage immediately
>> after clone" position, and also I would understand if you find it
>> (perhaps philosophically) disturbing that "git clone" may give users
>> a suboptimal repository that immediately needs optimizing [*1*].
>>
>> But that bridge was crossed long time ago ever since pack transfer
>> was invented.  The data source sends only the pack data stream, and
>> the receiving end is responsible for spending cycles to build .idx
>> file.  Theoretically, .pack should be all that is needed---you
>> should be able to locate any necessary object by parsing the .pack
>> file every time you open it, and .idx is mere optimization.  You can
>> think of the .midx and graph files the same way.
>
> I don't think this is a valid comparison, because, practically, Git
> just didn't work after I deleted all pack index files.  So while they
> can be easily (re)generated, they are essential to make pack files
> usable.  The commit-graph and .midx files, however, can be safely
> deleted, and everything keeps working as before.

For "things that would run in 20ms now run in 30 seconds" (actual
numbers on a repo I have) values of "keeps working".

So I think this line gets blurred somewhat. In practice if you're
expecting the graph to be there to run the sort of commands that most
benefit from it it's essential that it be generated, not some nice
optional extra.

> OTOH, this is an excellent comparison, and I do think of the .midx and
> graph files the same way as the pack index files.  During a clone, the
> pack index file isn't generated by running a separate 'git gc
> (--auto)', but by clone (or fetch-pack?) running 'git index-pack'.
> The way I see it that should be the case for these other files as well.
>
> And it is much simpler, shorter, and cleaner to either run 'git
> commit-graph ...' or even to call write_commit_graph_reachable()
> directly from cmd_clone(), than to bolt on another option and config
> variable on 'git gc' [1] to coax it into some kind of an "after clone"
> mode, that it shouldn't be doing in the first place.  At least for
> now, so when we'll eventually get as far ...
>
>> I would not be surprised by a future in which the initial index-pack
>> that is responsible for receiving the incoming pack stream and
>> storing that in .pack file(s) while creating corresponding .idx
>> file(s) becomes also responsible for building .midx and graph files
>> in the same pass, or at least smaller number of passes.  Once we
>> gain experience and confidence with these new auxiliary files, that
>> ought to happen naturally.  And at that point, we won't be having
>> this discussion---we'd all happily run index-pack to receive the
>> pack data, because that is pretty much the fundamental requirement
>> to make use of the data.
>
> ... that what you wrote here becomes a reality (and I fully agree that
> this is what we should ultimately aim for), then we won't have that
> option and config variable still lying around and requiring
> maintenance because of backwards compatibility.

We'll still have the use-case of wanting to turn on
gc.writeCommitGraph=true or equivalent and wanting previously-cloned
repositories to "catch up" and get a commit graph ASAP (but not do a
full repack).

This is why my patch tries to unify those two codepaths, i.e. so I can
turn this on and know that on the next "git fetch" we'll have the graph
(without needing to also run a full repack if it's not needed).

> 1 - https://public-inbox.org/git/87in2hgzin.fsf@evledraar.gmail.com/
>
>> [Footnote]
>>
>> *1* Even without considering these recent invention of auxiliary
>>     files, cloning from a sloppily packed server whose primary focus
>>     is to avoid spending cycles by not computing better deltas will
>>     give the cloner a suboptimal repository.  If we truly want to
>>     have an optimized repository ready to be used after cloning, we
>>     should run an equivalent of "repack -a -d -f" immediately after
>>     "git clone".
>
> I noticed a few times that I got surprisingly large packs from GitHub,
> e.g. there is over 70% size difference between --single-branch cloning
> v2.19.0 from GitHub and from my local clone or from kernel.org (~95MB
> vs. ~55MB vs ~52MB).  After running 'git repack -a -d -f' they all end
> up at ~65MB, which is a nice size reduction for the clone from GitHub,
> but the others just gained 10-13 more MBs.

To me this sounds like even more reason for why we shouldn't be
splitting up the entry point for the "does the repo look shitty? Fix
it!" function (currently called git-gc).

We might get such crappy packs from a clone, or from a fetch, or maybe
in the future when e.g. "git add" or "git commit" generate a pack
instead of a bunch of loose objects we'd want to immediately kick off
something in the background to optimize / consolidate them.

So instead of having clone/commit/add/whatever call some custom
function(s) to just do the things they *think* they want, we just call
"gc --auto", because that entry point needs to eventually know how to
"recover" from any of those states without any prior knowledge or hints
about what just happened.

This is why I'm calling "gc --auto" from clone in this WIP series, with
the only special sauce being "if you have stuff to do, don't background
yourself", because not having a commit graph at all after a clone is
just one special case of many where we have no commit graph and want to
have one made ASAP (e.g. someone rm'd it, or the "I want a commit graph"
config variable just got turned on).

So since we need all those smarts in some function anyway, let's just
have one entry point to that logic. Discarding what it doesn't need to
do (e.g. too_many_loose_objects() just after a clone) is a trivial cost,
so if we don't care about that we have less complexity to worry about by
not having a proliferation of entry points into what are now subsets of
the GC code.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 0/2] Per-commit filter proof of concept
  2018-10-09 19:34                       ` [PATCH 0/4] Bloom filter experiment SZEDER Gábor
                                           ` (4 preceding siblings ...)
  2018-10-09 19:47                         ` [PATCH 0/4] Bloom filter experiment Derrick Stolee
@ 2018-10-11  1:21                         ` Jonathan Tan
  2018-10-11  1:21                           ` [PATCH 1/2] One filter per commit Jonathan Tan
                                             ` (2 more replies)
  2018-10-15 14:39                         ` [PATCH 0/4] Bloom filter experiment Derrick Stolee
  6 siblings, 3 replies; 78+ messages in thread
From: Jonathan Tan @ 2018-10-11  1:21 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, peff, stolee, avarab, szeder.dev

I did my own experiments on top of what Szeder provided - the first
patch is to have one fixed-size bloom filter per commit, and the second
patch makes that bloom filter apply to only the first parent of each
commit. The results are:

  Original (szeder's)
  $ GIT_USE_POC_BLOOM_FILTER=$((8*1024*1024*8)) time ./git commit-graph write
  0:10.28
  $ ls -l .git/objects/info/bloom
  8388616
  $ GIT_TRACE_BLOOM_FILTER=2 GIT_USE_POC_BLOOM_FILTER=y time ./git -c \
    core.commitgraph=true rev-list --count --full-history HEAD -- \
    t/valgrind/valgrind.sh
  886
  bloom filter total queries: 66459 definitely not: 65276 maybe: 1183 false positives: 297 fp ratio: 0.004469
  0:00.24

  With patch 1
  $ GIT_USE_POC_BLOOM_FILTER=256 time ./git commit-graph write
  0:16.22
  $ ls -l .git/objects/info/bloom
  1832620
  $ GIT_TRACE_BLOOM_FILTER=2 GIT_USE_POC_BLOOM_FILTER=y time ./git -c \
    core.commitgraph=true rev-list --count --full-history HEAD -- \
    t/valgrind/valgrind.sh
  886
  bloom filter total queries: 66459 definitely not: 46637 maybe: 19822 false positives: 18936 fp ratio: 0.284928
  0:01.53

  With patch 2
  $ GIT_USE_POC_BLOOM_FILTER=256 time ./git commit-graph write
  0:06.70
  $ ls -l .git/objects/info/bloom
  1832620
  $ GIT_TRACE_BLOOM_FILTER=2 GIT_USE_POC_BLOOM_FILTER=y time ./git -c \
    core.commitgraph=true rev-list --count --full-history HEAD -- \
    t/valgrind/valgrind.sh
  886
  bloom filter total queries: 53096 definitely not: 52989 maybe: 107 false positives: 89 fp ratio: 0.001676
  0:01.29

For comparison, a non-GIT_USE_POC_BLOOM_FILTER rev-list takes 3.517
seconds.

I haven't investigated why patch 1 takes longer than the original to
create the bloom filter.

Using per-commit filters and restricting the bloom filter to a single
parent increases the relative power of the filter in omitting tree
inspections compared to the original (107/53096 vs 1183/66459), but the
lack of coverage w.r.t. the non-first parents had a more significant
effect than I thought (1.29s vs .24s). It might be best to have one
filter for each (commit, parent) pair (or, at least, the first two
parents of each commit - we probably don't need to care that much about
octopus merges) - this would take up more disk space than if we only
store filters for the first parent, but is still less than the original
example of storing information for all commits in one filter.

There are more possibilities like dynamic filter sizing, different
hashing, and hashing to support wildcard matches, which I haven't looked
into.

Jonathan Tan (2):
  One filter per commit
  Only make bloom filter for first parent

 bloom-filter.c | 31 ++++++++++++++++++-------------
 bloom-filter.h | 12 ++++--------
 commit-graph.c | 30 ++++++++++++++----------------
 revision.c     | 29 +++++++++++++++--------------
 4 files changed, 51 insertions(+), 51 deletions(-)

-- 
2.19.0.271.gfe8321ec05.dirty


^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 1/2] One filter per commit
  2018-10-11  1:21                         ` [PATCH 0/2] Per-commit filter proof of concept Jonathan Tan
@ 2018-10-11  1:21                           ` Jonathan Tan
  2018-10-11 12:49                             ` Derrick Stolee
  2018-10-11  1:21                           ` [PATCH 2/2] Only make bloom filter for first parent Jonathan Tan
  2018-10-11  7:37                           ` [PATCH 0/2] Per-commit filter proof of concept Ævar Arnfjörð Bjarmason
  2 siblings, 1 reply; 78+ messages in thread
From: Jonathan Tan @ 2018-10-11  1:21 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, peff, stolee, avarab, szeder.dev

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 bloom-filter.c | 31 ++++++++++++++++++-------------
 bloom-filter.h | 12 ++++--------
 commit-graph.c | 26 ++++++++++++--------------
 revision.c     |  9 +++------
 4 files changed, 37 insertions(+), 41 deletions(-)

diff --git a/bloom-filter.c b/bloom-filter.c
index 7dce0e35fa..39b453908f 100644
--- a/bloom-filter.c
+++ b/bloom-filter.c
@@ -1,14 +1,17 @@
 #include "cache.h"
 #include "bloom-filter.h"
 
-void bloom_filter_init(struct bloom_filter *bf, uint32_t bit_size)
+void bloom_filter_init(struct bloom_filter *bf, uint32_t commit_nr, uint32_t bit_size)
 {
 	if (bit_size % CHAR_BIT)
 		BUG("invalid size for bloom filter");
+	if (bit_size > 1024)
+		BUG("aborting: the bit size is per commit, not for the whole filter");
 
 	bf->nr_entries = 0;
+	bf->commit_nr = commit_nr;
 	bf->bit_size = bit_size;
-	bf->bits = xmalloc(bit_size / CHAR_BIT);
+	bf->bits = xcalloc(1, commit_nr * bit_size / CHAR_BIT);
 }
 
 void bloom_filter_free(struct bloom_filter *bf)
@@ -19,24 +22,24 @@ void bloom_filter_free(struct bloom_filter *bf)
 }
 
 
-void bloom_filter_set_bits(struct bloom_filter *bf, const uint32_t *offsets,
+static void bloom_filter_set_bits(struct bloom_filter *bf, uint32_t graph_pos, const uint32_t *offsets,
 			   int nr_offsets, int nr_entries)
 {
 	int i;
 	for (i = 0; i < nr_offsets; i++) {
-		uint32_t byte_offset = (offsets[i] % bf->bit_size) / CHAR_BIT;
+		uint32_t byte_offset = (offsets[i] % bf->bit_size + graph_pos * bf->bit_size) / CHAR_BIT;
 		unsigned char mask = 1 << offsets[i] % CHAR_BIT;
 		bf->bits[byte_offset] |= mask;
 	}
 	bf->nr_entries += nr_entries;
 }
 
-int bloom_filter_check_bits(struct bloom_filter *bf, const uint32_t *offsets,
+static int bloom_filter_check_bits(struct bloom_filter *bf, uint32_t graph_pos, const uint32_t *offsets,
 			    int nr)
 {
 	int i;
 	for (i = 0; i < nr; i++) {
-		uint32_t byte_offset = (offsets[i] % bf->bit_size) / CHAR_BIT;
+		uint32_t byte_offset = (offsets[i] % bf->bit_size + graph_pos * bf->bit_size) / CHAR_BIT;
 		unsigned char mask = 1 << offsets[i] % CHAR_BIT;
 		if (!(bf->bits[byte_offset] & mask))
 			return 0;
@@ -45,19 +48,19 @@ int bloom_filter_check_bits(struct bloom_filter *bf, const uint32_t *offsets,
 }
 
 
-void bloom_filter_add_hash(struct bloom_filter *bf, const unsigned char *hash)
+void bloom_filter_add_hash(struct bloom_filter *bf, uint32_t graph_pos, const unsigned char *hash)
 {
 	uint32_t offsets[GIT_MAX_RAWSZ / sizeof(uint32_t)];
 	hashcpy((unsigned char*)offsets, hash);
-	bloom_filter_set_bits(bf, offsets,
+	bloom_filter_set_bits(bf, graph_pos, offsets,
 			     the_hash_algo->rawsz / sizeof(*offsets), 1);
 }
 
-int bloom_filter_check_hash(struct bloom_filter *bf, const unsigned char *hash)
+int bloom_filter_check_hash(struct bloom_filter *bf, uint32_t graph_pos, const unsigned char *hash)
 {
 	uint32_t offsets[GIT_MAX_RAWSZ / sizeof(uint32_t)];
 	hashcpy((unsigned char*)offsets, hash);
-	return bloom_filter_check_bits(bf, offsets,
+	return bloom_filter_check_bits(bf, graph_pos, offsets,
 			the_hash_algo->rawsz / sizeof(*offsets));
 }
 
@@ -80,11 +83,12 @@ int bloom_filter_load(struct bloom_filter *bf)
 		return -1;
 
 	read_in_full(fd, &bf->nr_entries, sizeof(bf->nr_entries));
+	read_in_full(fd, &bf->commit_nr, sizeof(bf->commit_nr));
 	read_in_full(fd, &bf->bit_size, sizeof(bf->bit_size));
 	if (bf->bit_size % CHAR_BIT)
 		BUG("invalid size for bloom filter");
-	bf->bits = xmalloc(bf->bit_size / CHAR_BIT);
-	read_in_full(fd, bf->bits, bf->bit_size / CHAR_BIT);
+	bf->bits = xmalloc(bf->commit_nr * bf->bit_size / CHAR_BIT);
+	read_in_full(fd, bf->bits, bf->commit_nr * bf->bit_size / CHAR_BIT);
 
 	close(fd);
 
@@ -96,8 +100,9 @@ void bloom_filter_write(struct bloom_filter *bf)
 	int fd = xopen(git_path_bloom(), O_WRONLY | O_CREAT | O_TRUNC, 0666);
 
 	write_in_full(fd, &bf->nr_entries, sizeof(bf->nr_entries));
+	write_in_full(fd, &bf->commit_nr, sizeof(bf->commit_nr));
 	write_in_full(fd, &bf->bit_size, sizeof(bf->bit_size));
-	write_in_full(fd, bf->bits, bf->bit_size / CHAR_BIT);
+	write_in_full(fd, bf->bits, bf->commit_nr * bf->bit_size / CHAR_BIT);
 
 	close(fd);
 }
diff --git a/bloom-filter.h b/bloom-filter.h
index 94d0af1708..607649b8db 100644
--- a/bloom-filter.h
+++ b/bloom-filter.h
@@ -5,30 +5,26 @@
 
 struct bloom_filter {
 	uint32_t nr_entries;
+	uint32_t commit_nr;
 	uint32_t bit_size;
 	unsigned char *bits;
 };
 
 
-void bloom_filter_init(struct bloom_filter *bf, uint32_t bit_size);
+void bloom_filter_init(struct bloom_filter *bf, uint32_t commit_nr, uint32_t bit_size);
 void bloom_filter_free(struct bloom_filter *bf);
 
-void bloom_filter_set_bits(struct bloom_filter *bf, const uint32_t *offsets,
-			   int nr_offsets, int nr_enries);
-int bloom_filter_check_bits(struct bloom_filter *bf, const uint32_t *offsets,
-			    int nr);
-
 /*
  * Turns the given (SHA1) hash into 5 unsigned ints, and sets the bits at
  * those positions (modulo the bitmap's size) in the Bloom filter.
  */
-void bloom_filter_add_hash(struct bloom_filter *bf, const unsigned char *hash);
+void bloom_filter_add_hash(struct bloom_filter *bf, uint32_t graph_pos, const unsigned char *hash);
 /*
  * Turns the given (SHA1) hash into 5 unsigned ints, and checks the bits at
  * those positions (modulo the bitmap's size) in the Bloom filter.
  * Returns 1 if all those bits are set, 0 otherwise.
  */
-int bloom_filter_check_hash(struct bloom_filter *bf, const unsigned char *hash);
+int bloom_filter_check_hash(struct bloom_filter *bf, uint32_t graph_pos, const unsigned char *hash);
 
 void hashxor(const unsigned char *hash1, const unsigned char *hash2,
 	     unsigned char *out);
diff --git a/commit-graph.c b/commit-graph.c
index f415d3b41f..90b0b3df90 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -715,13 +715,11 @@ static int add_ref_to_list(const char *refname,
 static void add_changes_to_bloom_filter(struct bloom_filter *bf,
 					struct commit *parent,
 					struct commit *commit,
+					int index,
 					struct diff_options *diffopt)
 {
-	unsigned char p_c_hash[GIT_MAX_RAWSZ];
 	int i;
 
-	hashxor(parent->object.oid.hash, commit->object.oid.hash, p_c_hash);
-
 	diff_tree_oid(&parent->object.oid, &commit->object.oid, "", diffopt);
 	diffcore_std(diffopt);
 
@@ -756,8 +754,8 @@ static void add_changes_to_bloom_filter(struct bloom_filter *bf,
 			the_hash_algo->update_fn(&ctx, path, p - path);
 			the_hash_algo->final_fn(name_hash, &ctx);
 
-			hashxor(name_hash, p_c_hash, hash);
-			bloom_filter_add_hash(bf, hash);
+			hashxor(name_hash, parent->object.oid.hash, hash);
+			bloom_filter_add_hash(bf, index, hash);
 		} while (*p);
 
 		diff_free_filepair(diff_queued_diff.queue[i]);
@@ -768,11 +766,10 @@ static void add_changes_to_bloom_filter(struct bloom_filter *bf,
 }
 
 static void fill_bloom_filter(struct bloom_filter *bf,
-				    struct progress *progress)
+				    struct progress *progress, struct commit **commits, int commit_nr)
 {
 	struct rev_info revs;
 	const char *revs_argv[] = {NULL, "--all", NULL};
-	struct commit *commit;
 	int i = 0;
 
 	/* We (re-)create the bloom filter from scratch every time for now. */
@@ -783,18 +780,19 @@ static void fill_bloom_filter(struct bloom_filter *bf,
 	if (prepare_revision_walk(&revs))
 		die("revision walk setup failed while preparing bloom filter");
 
-	while ((commit = get_revision(&revs))) {
+	for (i = 0; i < commit_nr; i++) {
+		struct commit *commit = commits[i];
 		struct commit_list *parent;
 
 		for (parent = commit->parents; parent; parent = parent->next)
-			add_changes_to_bloom_filter(bf, parent->item, commit,
+			add_changes_to_bloom_filter(bf, parent->item, commit, i,
 						    &revs.diffopt);
 
-		display_progress(progress, ++i);
+		display_progress(progress, i);
 	}
 }
 
-static void write_bloom_filter(int report_progress, int commit_nr)
+static void write_bloom_filter(int report_progress, struct commit **commits, int commit_nr)
 {
 	struct bloom_filter bf;
 	struct progress *progress = NULL;
@@ -809,13 +807,13 @@ static void write_bloom_filter(int report_progress, int commit_nr)
 	if (*end)
 		die("GIT_USE_POC_BLOOM_FILTER must specify the number of bits in the bloom filter (multiple of 8, n < 2^32)");
 
-	bloom_filter_init(&bf, bitsize);
+	bloom_filter_init(&bf, commit_nr, bitsize);
 
 	if (report_progress)
 		progress = start_progress(_("Computing bloom filter"),
 					  commit_nr);
 
-	fill_bloom_filter(&bf, progress);
+	fill_bloom_filter(&bf, progress, commits, commit_nr);
 
 	bloom_filter_write(&bf);
 	bloom_filter_free(&bf);
@@ -1030,7 +1028,7 @@ void write_commit_graph(const char *obj_dir,
 	finalize_hashfile(f, NULL, CSUM_HASH_IN_STREAM | CSUM_FSYNC);
 	commit_lock_file(&lk);
 
-	write_bloom_filter(report_progress, commits.nr);
+	write_bloom_filter(report_progress, commits.list, commits.nr);
 
 	free(graph_name);
 	free(commits.list);
diff --git a/revision.c b/revision.c
index d5ba2b1598..c84a997928 100644
--- a/revision.c
+++ b/revision.c
@@ -490,7 +490,6 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 						 struct commit *parent,
 						 struct commit *commit)
 {
-	unsigned char p_c_hash[GIT_MAX_RAWSZ];
 	int i;
 
 	if (!bf.bits)
@@ -509,17 +508,15 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 	 */
 	if (!the_repository->objects->commit_graph)
 		return -1;
-	if (commit->generation == GENERATION_NUMBER_INFINITY)
+	if (commit->graph_pos == COMMIT_NOT_FROM_GRAPH)
 		return -1;
 
-	hashxor(parent->object.oid.hash, commit->object.oid.hash, p_c_hash);
-
 	for (i = 0; i < revs->pruning.pathspec.nr; i++) {
 		struct pathspec_item *pi = &revs->pruning.pathspec.items[i];
 		unsigned char hash[GIT_MAX_RAWSZ];
 
-		hashxor(pi->name_hash, p_c_hash, hash);
-		if (bloom_filter_check_hash(&bf, hash)) {
+		hashxor(pi->name_hash, parent->object.oid.hash, hash);
+		if (bloom_filter_check_hash(&bf, commit->graph_pos, hash)) {
 			/*
 			 * At least one of the interesting pathspecs differs,
 			 * so we can return early and let the diff machinery
-- 
2.19.0.271.gfe8321ec05.dirty


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 2/2] Only make bloom filter for first parent
  2018-10-11  1:21                         ` [PATCH 0/2] Per-commit filter proof of concept Jonathan Tan
  2018-10-11  1:21                           ` [PATCH 1/2] One filter per commit Jonathan Tan
@ 2018-10-11  1:21                           ` Jonathan Tan
  2018-10-11  7:37                           ` [PATCH 0/2] Per-commit filter proof of concept Ævar Arnfjörð Bjarmason
  2 siblings, 0 replies; 78+ messages in thread
From: Jonathan Tan @ 2018-10-11  1:21 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, peff, stolee, avarab, szeder.dev

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 commit-graph.c |  4 ++--
 revision.c     | 20 ++++++++++++--------
 2 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 90b0b3df90..d21d555611 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -782,9 +782,9 @@ static void fill_bloom_filter(struct bloom_filter *bf,
 
 	for (i = 0; i < commit_nr; i++) {
 		struct commit *commit = commits[i];
-		struct commit_list *parent;
+		struct commit_list *parent = commit->parents;
 
-		for (parent = commit->parents; parent; parent = parent->next)
+		if (parent)
 			add_changes_to_bloom_filter(bf, parent->item, commit, i,
 						    &revs.diffopt);
 
diff --git a/revision.c b/revision.c
index c84a997928..5a433a5878 100644
--- a/revision.c
+++ b/revision.c
@@ -539,11 +539,11 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 }
 
 static int rev_compare_tree(struct rev_info *revs,
-			    struct commit *parent, struct commit *commit)
+			    struct commit *parent, struct commit *commit, int nth_parent)
 {
 	struct tree *t1 = get_commit_tree(parent);
 	struct tree *t2 = get_commit_tree(commit);
-	int bloom_ret;
+	int bloom_ret = 1;
 
 	if (!t1)
 		return REV_TREE_NEW;
@@ -568,17 +568,21 @@ static int rev_compare_tree(struct rev_info *revs,
 			return REV_TREE_SAME;
 	}
 
-	bloom_ret = check_maybe_different_in_bloom_filter(revs, parent, commit);
-	if (bloom_ret == 0)
-		return REV_TREE_SAME;
+	if (!nth_parent) {
+		bloom_ret = check_maybe_different_in_bloom_filter(revs, parent, commit);
+		if (bloom_ret == 0)
+			return REV_TREE_SAME;
+	}
 
 	tree_difference = REV_TREE_SAME;
 	revs->pruning.flags.has_changes = 0;
 	if (diff_tree_oid(&t1->object.oid, &t2->object.oid, "",
 			   &revs->pruning) < 0)
 		return REV_TREE_DIFFERENT;
-	if (bloom_ret == 1 && tree_difference == REV_TREE_SAME)
-		bloom_filter_count_false_positive++;
+	if (!nth_parent) {
+		if (bloom_ret == 1 && tree_difference == REV_TREE_SAME)
+			bloom_filter_count_false_positive++;
+	}
 	return tree_difference;
 }
 
@@ -776,7 +780,7 @@ static void try_to_simplify_commit(struct rev_info *revs, struct commit *commit)
 			die("cannot simplify commit %s (because of %s)",
 			    oid_to_hex(&commit->object.oid),
 			    oid_to_hex(&p->object.oid));
-		switch (rev_compare_tree(revs, p, commit)) {
+		switch (rev_compare_tree(revs, p, commit, nth_parent)) {
 		case REV_TREE_SAME:
 			if (!revs->simplify_history || !relevant_commit(p)) {
 				/* Even if a merge with an uninteresting
-- 
2.19.0.271.gfe8321ec05.dirty


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PoC -- do not apply 2/3] test-tree-bitmap: add "dump" mode
  2018-10-10  0:48                                         ` Junio C Hamano
@ 2018-10-11  3:13                                           ` Jeff King
  0 siblings, 0 replies; 78+ messages in thread
From: Jeff King @ 2018-10-11  3:13 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Stefan Beller, git,
	Duy Nguyen

On Wed, Oct 10, 2018 at 09:48:53AM +0900, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > The one difference is the sort order: git's diff output is
> > in tree-sort order, so a subtree "foo" sorts like "foo/",
> > which is after "foo.bar". Whereas the bitmap path list has a
> > true byte sort, which puts "foo.bar" after "foo".
> 
> If we truly cared, it is easy enough to fix by having a custom
> comparison function in 1/3 used in collect_paths() phase.

Yep. I thought about doing it just so I could drop this "one difference"
note, but I got lazy.

Running this on linux.git, I do see a few other differences. It looks
like my code does actually compute lists of touched paths for some
merges (presumably using "-c"). That wasn't intended, and it would
actually make my timings less good, but my goal was just to get a rough
idea on size here (but see below).

> > +	/* dump it while we have the sorted order in memory */
> > +	for (i = 0; i < n; i++) {
> > +		printf("%s", sorted[i]->path);
> > +		putchar('\0');
> > +	}
> 
> With printf("%s%c", sorted[i]->path, '\0'); you can lose the braces.

Heh, I didn't really expect review at that level. I'm not even sure this
is a good direction to go versus something like the bloom filters (or
even a more full --raw cache). But if it is, this code is mostly
throw-away anyway, as we'd want to integrate it with the actual diff
code.

My original goal had mostly been to get an idea of the size, and the
"dump" half was there to verify that the results were roughly sane. But
it actually works for rough timing, too. I can generate roughly the same
results as "rev-list --all | diff-tree --stdin -t --name-only" in about
300ms, as opposed to 33s. So that's good.

But it's also a slight cheat, since I'm not actually traversing the
commits, but rather just opening up the bitmaps in the order we wrote
them. ;)

Actually walking the commits (and not looking at the trees) takes ~7s,
so it would at least be more like 33s versus 7.3s. With core.commitgraph,
it's more like 1.1s, so imagine 27s versus 1.4s, I guess.

That's also neglecting any load/lookup time for actual random-access to
the bitmaps. I doubt that's more than a few hundred ms, but that's just
a made-up number.

So I think the rough timings are favorable, but the real proof would
actually be using it from a revision walk, which I haven't written.

> > +	putchar('\0');
> > +
> >  	free(sorted);
> >  }
> >  
> > @@ -142,6 +150,8 @@ static void generate_bitmap(struct diff_queue_struct *q,
> >  
> >  	ewah = bitmap_to_ewah(bitmap);
> >  	ewah_serialize_strbuf(ewah, &out);
> > +
> > +	fwrite(data->commit->object.oid.hash, 1, GIT_SHA1_RAWSZ, stdout);
> >  	fwrite(out.buf, 1, out.len, stdout);
> 
> OK, so per commit, we have ewah bitmap that records the "changed
> paths" after the commit object name.  Makes sense.

Yeah. This format, btw, is garbage. It was just the smallest and
simplest thing I could think of that would work for my case. We'd want
random-access to the bitmaps for each commit, probably via an index
block in the commit-graph file.

> And the list of paths are based on the "one" side of the filepair.
> When we do an equivalent of "git show X", we see "diff-tree X~1 X"
> and by collecting the "one" side (i.e. subset of paths in the tree
> of X~1 that were modified when going to X) we say "commit X changed
> these paths".  Makes sense, too.

I didn't think too hard on whether we'd need to look at the "two" side
ever. I turned off renames, so we'd see deletions via the "one". I feel
like we'd miss additions in that case, though, but from my results, we
do not seem to.

-Peff

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PoC -- do not apply 3/3] test-tree-bitmap: replace ewah with custom rle encoding
  2018-10-10  0:58                                         ` Junio C Hamano
@ 2018-10-11  3:20                                           ` Jeff King
  0 siblings, 0 replies; 78+ messages in thread
From: Jeff King @ 2018-10-11  3:20 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Stefan Beller, git,
	Duy Nguyen

On Wed, Oct 10, 2018 at 09:58:51AM +0900, Junio C Hamano wrote:

> > +static void bitmap_to_rle(struct strbuf *out, struct bitmap *bitmap)
> > +{
> > +	int curval = 0; /* count zeroes, then ones, then zeroes, etc */
> > +	size_t run = 0;
> > +	size_t word;
> > +	size_t orig_len = out->len;
> > +
> > +	for (word = 0; word < bitmap->word_alloc; word++) {
> > +		int bit;
> > +
> > +		for (bit = 0; bit < BITS_IN_EWORD; bit++) {
> > +			int val = !!(bitmap->words[word] & (((eword_t)1) << bit));
> > +			if (val == curval)
> > +				run++;
> > +			else {
> > +				strbuf_add_varint(out, run);
> > +				curval = 1 - curval; /* flip 0/1 */
> > +				run = 1;
> > +			}
> > +		}
> 
> OK.  I find it a bit disturbing to see that the loop knows a bit too
> much about how "struct bitmap" is implemented, but that is a complaint
> against the bitmap API, not this new user of the API.

Heh, again, this is not really meant to be production code. I'm not at
all happy about inventing a new compressed bitmap format here, and I'd
want to investigate the state of the art a bit more. In particular, the
worst case here is quite bad, and I wonder if there are formats that can
select the best encoding when writing a bitmap (naive RLE when it's
good, something else other times).

I also suspect part of why this does better is that other formats are
optimized less for our case. We really don't care about setting or
looking at a few bits part way through a bitmap. Our bitmaps are small
enough that we don't mind streaming through a whole one. It's just that
we have so _many_ of them that we want to be meticulous about wasted
bytes.

Whatever format we choose, I think it would become part of the bitmap.c
file, and internal details would be OK to access there. I just put it
here to keep the patch simple.

> We do not try to handle the case where bitmap has bits that is not
> multiple of BITS_IN_EWORD and instead pretend that size of such a
> bitmap can be rounded up, because we ignore trailing 0-bit anyway,
> and we know the "struct bitmap" would pad with 0-bit at the tail?

Right. We do not know the "real" number of zero bits at all. It's just
assumed that there are infinite zeroes trailing off the end (and this is
how "struct bitmap" works, since it is the one that does not bother to
keep a separate size pointer).

> > +	/*
> > +	 * ugh, varint does not seem to have a way to prevent reading past
> > +	 * the end of the buffer. We'll do a length check after each one,
> > +	 * so the worst case is bounded.
> > +	 */
> 
> Sorry about that :-).

:) We may want to address that. I know we did some hardening about
reading off the end of .pack and .idx files. But it seems like any user
of decode_varint() may read up to 16 bytes past the end of a buffer.

We seem to only use them for the $GIT_DIR/index, though. Anybody with a
"struct hashfile" result at least has a 20-byte trailer we can
accidentally read from. But I wouldn't be surprised if there's a way to
trick it in practice.

-Peff

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 0/2] Per-commit filter proof of concept
  2018-10-11  1:21                         ` [PATCH 0/2] Per-commit filter proof of concept Jonathan Tan
  2018-10-11  1:21                           ` [PATCH 1/2] One filter per commit Jonathan Tan
  2018-10-11  1:21                           ` [PATCH 2/2] Only make bloom filter for first parent Jonathan Tan
@ 2018-10-11  7:37                           ` Ævar Arnfjörð Bjarmason
  2 siblings, 0 replies; 78+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-10-11  7:37 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, peff, stolee, szeder.dev


On Thu, Oct 11 2018, Jonathan Tan wrote:

> Using per-commit filters and restricting the bloom filter to a single
> parent increases the relative power of the filter in omitting tree
> inspections compared to the original (107/53096 vs 1183/66459), but the
> lack of coverage w.r.t. the non-first parents had a more significant
> effect than I thought (1.29s vs .24s). It might be best to have one
> filter for each (commit, parent) pair (or, at least, the first two
> parents of each commit - we probably don't need to care that much about
> octopus merges) - this would take up more disk space than if we only
> store filters for the first parent, but is still less than the original
> example of storing information for all commits in one filter.
>
> There are more possibilities like dynamic filter sizing, different
> hashing, and hashing to support wildcard matches, which I haven't looked
> into.

Another way to deal with that is to havet the filter store change since
the merge base, from an E-Mail of mine back in May[1] when this was
discussed:

    From: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
    Date: Fri, 04 May 2018 22:36:07 +0200
    Message-ID: <87h8nnxio8.fsf@evledraar.gmail.com> (raw)

    On Fri, May 04 2018, Jakub Narebski wrote:

    (Just off-the cuff here and I'm surely about to be corrected by
    Derrick...)

    > * What to do about merge commits, and octopus merges in particular?
    >   Should Bloom filter be stored for each of the parents?  How to ensure
    >   fast access then (fixed-width records) - use large edge list?

    You could still store it fixed with, you'd just say that if you
    encounter a merge with N parents the filter wouldn't store files changed
    in that commit, but rather whether any of the N (including the merge)
    had changes to files as of the their common merge-base.

    Then if they did you'd need to walk all sides of the merge where each
    commit would also have the filter to figure out where the change(s)
    was/were, but if they didn't you could skip straight to the merge base
    and keep walking.
    [...]

Ideas are cheap and I don't have any code to back that up, just thought
I'd mention it if someone found it interesting.

Thinking about this again I wonder if something like that could be
generalized more, i.e. in the abstract the idea is really whether we can
store a filter for N commits so we can skip across N in the walk as an
optimization, doing this for merges is just an implementation detail.

So what if the bloom filters were this sort of structure:

    <commit_the_filter_is_for> = [<bloom bitmap>, <next commit with filter>]

So e.g. given a history like ("-> " = parent relationship)

    A -> B
    B -> C
    C -> D
    E -> F

We could store:

    A -> B [<bloom bitmap for A..D>, D]
    B -> C
    C -> D
    D -> E [<bloom bitmap for D..F>, F]
    E -> F
    F -> G [<bloom bitmap for F..G>, G]

Note how the bitmaps aren't evenly spaced. That's because some algorithm
would have walked the graph and e.g. decided that from A..D we had few
enough changes that the bitmap should apply for 4 commits, and then 3
for the next set etc. Whether some range was worth extending could just
be a configurable implementation detail.

1. https://public-inbox.org/git/87h8nnxio8.fsf@evledraar.gmail.com/

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: Bloom Filters
  2018-10-09 23:12                                     ` Bloom Filters Jeff King
                                                         ` (2 preceding siblings ...)
  2018-10-09 23:14                                       ` [PoC -- do not apply 3/3] test-tree-bitmap: replace ewah with custom rle encoding Jeff King
@ 2018-10-11 12:33                                       ` Derrick Stolee
  2018-10-11 13:43                                         ` Jeff King
  3 siblings, 1 reply; 78+ messages in thread
From: Derrick Stolee @ 2018-10-11 12:33 UTC (permalink / raw)
  To: Jeff King
  Cc: SZEDER Gábor, Ævar Arnfjörð Bjarmason,
	Stefan Beller, git, Duy Nguyen

On 10/9/2018 7:12 PM, Jeff King wrote:
> On Tue, Oct 09, 2018 at 05:14:50PM -0400, Jeff King wrote:
>
>> Hmph. It really sounds like we could do better with a custom RLE
>> solution. But that makes me feel like I'm missing something, because
>> surely I can't invent something better than the state of the art in a
>> simple thought experiment, right?
>>
>> I know what I'm proposing would be quite bad for random access, but my
>> impression is that EWAH is the same. For the scale of bitmaps we're
>> talking about, I think linear/streaming access through the bitmap would
>> be OK.
> Thinking on it more, what I was missing is that for truly dense random
> bitmaps, this will perform much worse. Because it will use a byte to say
> "there's one 1", rather than a bit.
>
> But I think it does OK in practice for the very sparse bitmaps we tend
> to see in this application.  I was able to generate a complete output
> that can reproduce "log --name-status -t" for linux.git in 32MB. But:
>
>    - 15MB of that is commit sha1s, which will be stored elsewhere in a
>      "real" system
>
>    - 5MB of that is path list (which should shrink by a factor of 10 with
>      prefix compression, and is really a function of a tree size less
>      than history depth)
>
> So the per-commit cost is not too bad. That's still not counting merges,
> though, which would add another 10-15% (or maybe more; their bitmaps are
> less sparse).
>
> I don't know if this is a fruitful path at all or not. I was mostly just
> satisfying my own curiosity on the bitmap encoding question. But I'll
> post the patches, just to show my work. The first one is the same
> initial proof of concept I showed earlier.
>
>    [1/3]: initial tree-bitmap proof of concept
>    [2/3]: test-tree-bitmap: add "dump" mode
>    [3/3]: test-tree-bitmap: replace ewah with custom rle encoding
>
>   Makefile                    |   1 +
>   t/helper/test-tree-bitmap.c | 344 ++++++++++++++++++++++++++++++++++++
>   2 files changed, 345 insertions(+)
>   create mode 100644 t/helper/test-tree-bitmap.c
I'm trying to test this out myself, and am having trouble reverse 
engineering how I'm supposed to test it.

Looks like running "t/helper/test-tree-bitmap gen" will output a lot of 
binary data. Where should I store that? Does any file work?

Is this series just for the storage costs, assuming that we would 
replace all TREESAME checks with a query into this database? Or do you 
have a way to test how much this would improve a "git log -- <path>" query?

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] One filter per commit
  2018-10-11  1:21                           ` [PATCH 1/2] One filter per commit Jonathan Tan
@ 2018-10-11 12:49                             ` Derrick Stolee
  2018-10-11 19:11                               ` [PATCH] Per-commit and per-parent filters for 2 parents Jonathan Tan
  0 siblings, 1 reply; 78+ messages in thread
From: Derrick Stolee @ 2018-10-11 12:49 UTC (permalink / raw)
  To: Jonathan Tan, git; +Cc: peff, avarab, szeder.dev

On 10/10/2018 9:21 PM, Jonathan Tan wrote:
> diff --git a/commit-graph.c b/commit-graph.c
> index f415d3b41f..90b0b3df90 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -715,13 +715,11 @@ static int add_ref_to_list(const char *refname,
>   static void add_changes_to_bloom_filter(struct bloom_filter *bf,
>   					struct commit *parent,
>   					struct commit *commit,
> +					int index,
>   					struct diff_options *diffopt)
>   {
> -	unsigned char p_c_hash[GIT_MAX_RAWSZ];
>   	int i;
>   
> -	hashxor(parent->object.oid.hash, commit->object.oid.hash, p_c_hash);
> -
>   	diff_tree_oid(&parent->object.oid, &commit->object.oid, "", diffopt);
>   	diffcore_std(diffopt);
>   
> @@ -756,8 +754,8 @@ static void add_changes_to_bloom_filter(struct bloom_filter *bf,
>   			the_hash_algo->update_fn(&ctx, path, p - path);
>   			the_hash_algo->final_fn(name_hash, &ctx);
>   
> -			hashxor(name_hash, p_c_hash, hash);
> -			bloom_filter_add_hash(bf, hash);
> +			hashxor(name_hash, parent->object.oid.hash, hash);
> +			bloom_filter_add_hash(bf, index, hash);
>   		} while (*p);
>   
>   		diff_free_filepair(diff_queued_diff.queue[i]);
[snip]
> @@ -768,11 +766,10 @@ static void add_changes_to_bloom_filter(struct bloom_filter *bf,
>   }
>   
>   static void fill_bloom_filter(struct bloom_filter *bf,
> -				    struct progress *progress)
> +				    struct progress *progress, struct commit **commits, int commit_nr)
>   {
>   	struct rev_info revs;
>   	const char *revs_argv[] = {NULL, "--all", NULL};
> -	struct commit *commit;
>   	int i = 0;
>   
>   	/* We (re-)create the bloom filter from scratch every time for now. */
> @@ -783,18 +780,19 @@ static void fill_bloom_filter(struct bloom_filter *bf,
>   	if (prepare_revision_walk(&revs))
>   		die("revision walk setup failed while preparing bloom filter");
>   
> -	while ((commit = get_revision(&revs))) {
> +	for (i = 0; i < commit_nr; i++) {
> +		struct commit *commit = commits[i];
>   		struct commit_list *parent;
>   
>   		for (parent = commit->parents; parent; parent = parent->next)
> -			add_changes_to_bloom_filter(bf, parent->item, commit,
> +			add_changes_to_bloom_filter(bf, parent->item, commit, i,
>   						    &revs.diffopt);
>   
[snip]
>   
> -		hashxor(pi->name_hash, p_c_hash, hash);
> -		if (bloom_filter_check_hash(&bf, hash)) {
> +		hashxor(pi->name_hash, parent->object.oid.hash, hash);
> +		if (bloom_filter_check_hash(&bf, commit->graph_pos, hash)) {
>   			/*
>   			 * At least one of the interesting pathspecs differs,
>   			 * so we can return early and let the diff machinery
One main benefit of storing on Bloom filter per commit is to avoid 
recomputing hashes at every commit. Currently, this patch only improves 
locality when checking membership at the cost of taking up more space. 
Drop the dependence on the parent oid and then we can save the time 
spent hashing during history queries.

-Stolee

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: Bloom Filters
  2018-10-11 12:33                                       ` Bloom Filters Derrick Stolee
@ 2018-10-11 13:43                                         ` Jeff King
  0 siblings, 0 replies; 78+ messages in thread
From: Jeff King @ 2018-10-11 13:43 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: SZEDER Gábor, Ævar Arnfjörð Bjarmason,
	Stefan Beller, git, Duy Nguyen

On Thu, Oct 11, 2018 at 08:33:58AM -0400, Derrick Stolee wrote:

> > I don't know if this is a fruitful path at all or not. I was mostly just
> > satisfying my own curiosity on the bitmap encoding question. But I'll
> > post the patches, just to show my work. The first one is the same
> > initial proof of concept I showed earlier.
> > 
> >    [1/3]: initial tree-bitmap proof of concept
> >    [2/3]: test-tree-bitmap: add "dump" mode
> >    [3/3]: test-tree-bitmap: replace ewah with custom rle encoding
> > 
> >   Makefile                    |   1 +
> >   t/helper/test-tree-bitmap.c | 344 ++++++++++++++++++++++++++++++++++++
> >   2 files changed, 345 insertions(+)
> >   create mode 100644 t/helper/test-tree-bitmap.c
> I'm trying to test this out myself, and am having trouble reverse
> engineering how I'm supposed to test it.
> 
> Looks like running "t/helper/test-tree-bitmap gen" will output a lot of
> binary data. Where should I store that? Does any file work?

Yeah, you can do:

  # optionally run with GIT_TRACE=1 to see some per-bitmap stats
  test-tree-bitmap gen >out

  # this should be roughly the same as:
  #  git rev-list --all |
  #  git diff-tree --stdin -t --name-only
  test-tree-bitmap dump <out

> Is this series just for the storage costs, assuming that we would replace
> all TREESAME checks with a query into this database? Or do you have a way to
> test how much this would improve a "git log -- <path>" query?

Right, I was just looking at storage cost here. It's not integrated with
the diff code at all. I left some hypothetical numbers elsewhere in the
thread.

-Peff

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH] Per-commit and per-parent filters for 2 parents
  2018-10-11 12:49                             ` Derrick Stolee
@ 2018-10-11 19:11                               ` Jonathan Tan
  0 siblings, 0 replies; 78+ messages in thread
From: Jonathan Tan @ 2018-10-11 19:11 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, stolee, peff, avarab, szeder.dev

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
> One main benefit of storing on Bloom filter per commit is to avoid
> recomputing hashes at every commit. Currently, this patch only improves
> locality when checking membership at the cost of taking up more space.
> Drop the dependence on the parent oid and then we can save the time
> spent hashing during history queries.

I've removed the hashing of the parent OID here and tried having
per-parent and per-commit hashes for the first 2 parents of each commit
instead of only 1, thus doubling the filter size. The results are not
much of an improvement though:

bloom filter total queries: 66409 definitely not: 56424 maybe: 9985 false positives: 9099 fp ratio: 0.137015
0:01.17
---
 bloom-filter.c | 25 ++++++++++++-------------
 bloom-filter.h |  4 ++--
 commit-graph.c | 13 ++++++++-----
 revision.c     | 11 +++++------
 4 files changed, 27 insertions(+), 26 deletions(-)

diff --git a/bloom-filter.c b/bloom-filter.c
index 39b453908f..10c73c45ae 100644
--- a/bloom-filter.c
+++ b/bloom-filter.c
@@ -11,7 +11,7 @@ void bloom_filter_init(struct bloom_filter *bf, uint32_t commit_nr, uint32_t bit
 	bf->nr_entries = 0;
 	bf->commit_nr = commit_nr;
 	bf->bit_size = bit_size;
-	bf->bits = xcalloc(1, commit_nr * bit_size / CHAR_BIT);
+	bf->bits = xcalloc(1, 2 * commit_nr * bit_size / CHAR_BIT);
 }
 
 void bloom_filter_free(struct bloom_filter *bf)
@@ -22,24 +22,24 @@ void bloom_filter_free(struct bloom_filter *bf)
 }
 
 
-static void bloom_filter_set_bits(struct bloom_filter *bf, uint32_t graph_pos, const uint32_t *offsets,
+static void bloom_filter_set_bits(struct bloom_filter *bf, uint32_t graph_pos, int parent_index, const uint32_t *offsets,
 			   int nr_offsets, int nr_entries)
 {
 	int i;
 	for (i = 0; i < nr_offsets; i++) {
-		uint32_t byte_offset = (offsets[i] % bf->bit_size + graph_pos * bf->bit_size) / CHAR_BIT;
+		uint32_t byte_offset = (offsets[i] % bf->bit_size + (2 * graph_pos + parent_index) * bf->bit_size) / CHAR_BIT;
 		unsigned char mask = 1 << offsets[i] % CHAR_BIT;
 		bf->bits[byte_offset] |= mask;
 	}
 	bf->nr_entries += nr_entries;
 }
 
-static int bloom_filter_check_bits(struct bloom_filter *bf, uint32_t graph_pos, const uint32_t *offsets,
+static int bloom_filter_check_bits(struct bloom_filter *bf, uint32_t graph_pos, int parent_index, const uint32_t *offsets,
 			    int nr)
 {
 	int i;
 	for (i = 0; i < nr; i++) {
-		uint32_t byte_offset = (offsets[i] % bf->bit_size + graph_pos * bf->bit_size) / CHAR_BIT;
+		uint32_t byte_offset = (offsets[i] % bf->bit_size + (2 * graph_pos + parent_index) * bf->bit_size) / CHAR_BIT;
 		unsigned char mask = 1 << offsets[i] % CHAR_BIT;
 		if (!(bf->bits[byte_offset] & mask))
 			return 0;
@@ -48,19 +48,18 @@ static int bloom_filter_check_bits(struct bloom_filter *bf, uint32_t graph_pos,
 }
 
 
-void bloom_filter_add_hash(struct bloom_filter *bf, uint32_t graph_pos, const unsigned char *hash)
+void bloom_filter_add_hash(struct bloom_filter *bf, uint32_t graph_pos, int parent_index, const unsigned char *hash)
 {
 	uint32_t offsets[GIT_MAX_RAWSZ / sizeof(uint32_t)];
 	hashcpy((unsigned char*)offsets, hash);
-	bloom_filter_set_bits(bf, graph_pos, offsets,
-			     the_hash_algo->rawsz / sizeof(*offsets), 1);
+	bloom_filter_set_bits(bf, graph_pos, parent_index, offsets, the_hash_algo->rawsz / sizeof(*offsets), 1);
 }
 
-int bloom_filter_check_hash(struct bloom_filter *bf, uint32_t graph_pos, const unsigned char *hash)
+int bloom_filter_check_hash(struct bloom_filter *bf, uint32_t graph_pos, int parent_index, const unsigned char *hash)
 {
 	uint32_t offsets[GIT_MAX_RAWSZ / sizeof(uint32_t)];
 	hashcpy((unsigned char*)offsets, hash);
-	return bloom_filter_check_bits(bf, graph_pos, offsets,
+	return bloom_filter_check_bits(bf, graph_pos, parent_index, offsets,
 			the_hash_algo->rawsz / sizeof(*offsets));
 }
 
@@ -87,8 +86,8 @@ int bloom_filter_load(struct bloom_filter *bf)
 	read_in_full(fd, &bf->bit_size, sizeof(bf->bit_size));
 	if (bf->bit_size % CHAR_BIT)
 		BUG("invalid size for bloom filter");
-	bf->bits = xmalloc(bf->commit_nr * bf->bit_size / CHAR_BIT);
-	read_in_full(fd, bf->bits, bf->commit_nr * bf->bit_size / CHAR_BIT);
+	bf->bits = xmalloc(2 * bf->commit_nr * bf->bit_size / CHAR_BIT);
+	read_in_full(fd, bf->bits, 2 * bf->commit_nr * bf->bit_size / CHAR_BIT);
 
 	close(fd);
 
@@ -102,7 +101,7 @@ void bloom_filter_write(struct bloom_filter *bf)
 	write_in_full(fd, &bf->nr_entries, sizeof(bf->nr_entries));
 	write_in_full(fd, &bf->commit_nr, sizeof(bf->commit_nr));
 	write_in_full(fd, &bf->bit_size, sizeof(bf->bit_size));
-	write_in_full(fd, bf->bits, bf->commit_nr * bf->bit_size / CHAR_BIT);
+	write_in_full(fd, bf->bits, 2 * bf->commit_nr * bf->bit_size / CHAR_BIT);
 
 	close(fd);
 }
diff --git a/bloom-filter.h b/bloom-filter.h
index 607649b8db..20e0527451 100644
--- a/bloom-filter.h
+++ b/bloom-filter.h
@@ -18,13 +18,13 @@ void bloom_filter_free(struct bloom_filter *bf);
  * Turns the given (SHA1) hash into 5 unsigned ints, and sets the bits at
  * those positions (modulo the bitmap's size) in the Bloom filter.
  */
-void bloom_filter_add_hash(struct bloom_filter *bf, uint32_t graph_pos, const unsigned char *hash);
+void bloom_filter_add_hash(struct bloom_filter *bf, uint32_t graph_pos, int parent_index, const unsigned char *hash);
 /*
  * Turns the given (SHA1) hash into 5 unsigned ints, and checks the bits at
  * those positions (modulo the bitmap's size) in the Bloom filter.
  * Returns 1 if all those bits are set, 0 otherwise.
  */
-int bloom_filter_check_hash(struct bloom_filter *bf, uint32_t graph_pos, const unsigned char *hash);
+int bloom_filter_check_hash(struct bloom_filter *bf, uint32_t graph_pos, int parent_index, const unsigned char *hash);
 
 void hashxor(const unsigned char *hash1, const unsigned char *hash2,
 	     unsigned char *out);
diff --git a/commit-graph.c b/commit-graph.c
index d21d555611..ca869c10e1 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -716,6 +716,7 @@ static void add_changes_to_bloom_filter(struct bloom_filter *bf,
 					struct commit *parent,
 					struct commit *commit,
 					int index,
+					int parent_index,
 					struct diff_options *diffopt)
 {
 	int i;
@@ -738,7 +739,6 @@ static void add_changes_to_bloom_filter(struct bloom_filter *bf,
 		do {
 			git_hash_ctx ctx;
 			unsigned char name_hash[GIT_MAX_RAWSZ];
-			unsigned char hash[GIT_MAX_RAWSZ];
 
 			p = strchrnul(p + 1, '/');
 
@@ -754,8 +754,7 @@ static void add_changes_to_bloom_filter(struct bloom_filter *bf,
 			the_hash_algo->update_fn(&ctx, path, p - path);
 			the_hash_algo->final_fn(name_hash, &ctx);
 
-			hashxor(name_hash, parent->object.oid.hash, hash);
-			bloom_filter_add_hash(bf, index, hash);
+			bloom_filter_add_hash(bf, index, parent_index, name_hash);
 		} while (*p);
 
 		diff_free_filepair(diff_queued_diff.queue[i]);
@@ -784,9 +783,13 @@ static void fill_bloom_filter(struct bloom_filter *bf,
 		struct commit *commit = commits[i];
 		struct commit_list *parent = commit->parents;
 
-		if (parent)
-			add_changes_to_bloom_filter(bf, parent->item, commit, i,
+		if (parent) {
+			add_changes_to_bloom_filter(bf, parent->item, commit, i, 0,
 						    &revs.diffopt);
+			if (parent->next)
+				add_changes_to_bloom_filter(bf, parent->next->item, commit, i, 1,
+							    &revs.diffopt);
+		}
 
 		display_progress(progress, i);
 	}
diff --git a/revision.c b/revision.c
index 5a433a5878..5478d08344 100644
--- a/revision.c
+++ b/revision.c
@@ -488,6 +488,7 @@ static void print_bloom_filter_stats_atexit(void)
 
 static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 						 struct commit *parent,
+						 int parent_index,
 						 struct commit *commit)
 {
 	int i;
@@ -513,10 +514,8 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 
 	for (i = 0; i < revs->pruning.pathspec.nr; i++) {
 		struct pathspec_item *pi = &revs->pruning.pathspec.items[i];
-		unsigned char hash[GIT_MAX_RAWSZ];
 
-		hashxor(pi->name_hash, parent->object.oid.hash, hash);
-		if (bloom_filter_check_hash(&bf, commit->graph_pos, hash)) {
+		if (bloom_filter_check_hash(&bf, commit->graph_pos, parent_index, pi->name_hash)) {
 			/*
 			 * At least one of the interesting pathspecs differs,
 			 * so we can return early and let the diff machinery
@@ -568,8 +567,8 @@ static int rev_compare_tree(struct rev_info *revs,
 			return REV_TREE_SAME;
 	}
 
-	if (!nth_parent) {
-		bloom_ret = check_maybe_different_in_bloom_filter(revs, parent, commit);
+	if (nth_parent <= 1) {
+		bloom_ret = check_maybe_different_in_bloom_filter(revs, parent, nth_parent, commit);
 		if (bloom_ret == 0)
 			return REV_TREE_SAME;
 	}
@@ -579,7 +578,7 @@ static int rev_compare_tree(struct rev_info *revs,
 	if (diff_tree_oid(&t1->object.oid, &t2->object.oid, "",
 			   &revs->pruning) < 0)
 		return REV_TREE_DIFFERENT;
-	if (!nth_parent) {
+	if (nth_parent <= 1) {
 		if (bloom_ret == 1 && tree_difference == REV_TREE_SAME)
 			bloom_filter_count_false_positive++;
 	}
-- 
2.19.0.271.gfe8321ec05.dirty


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH 0/4] Bloom filter experiment
  2018-10-09 19:34                       ` [PATCH 0/4] Bloom filter experiment SZEDER Gábor
                                           ` (5 preceding siblings ...)
  2018-10-11  1:21                         ` [PATCH 0/2] Per-commit filter proof of concept Jonathan Tan
@ 2018-10-15 14:39                         ` Derrick Stolee
  2018-10-16  4:45                           ` Junio C Hamano
  2018-10-16 23:41                           ` Jonathan Tan
  6 siblings, 2 replies; 78+ messages in thread
From: Derrick Stolee @ 2018-10-15 14:39 UTC (permalink / raw)
  To: SZEDER Gábor, git
  Cc: Jeff King, Junio C Hamano, Ævar Arnfjörð Bjarmason,
	Stefan Beller, Duy Nguyen, Jonathan Tan

On 10/9/2018 3:34 PM, SZEDER Gábor wrote:
> To keep the ball rolling, here is my proof of concept in a somewhat
> cleaned-up form, with still plenty of rough edges.

Peff, Szeder, and Jonathan,

Thanks for giving me the kick in the pants to finally write a proof of 
concept for my personal take on how this should work. My implementation 
borrows things from both Szeder and Jonathan's series. You can find my 
commits for all of the versions on GitHub (it's a bit too messy to share 
as a patch series right now, I think):

Repo: https://github.com/derrickstolee/git
Branches: bloom/* (includes bloom/stolee, bloom/peff, bloom/szeder, and 
bloom/tan for the respective implementations, and bloom/base as the 
common ancestor)

My implementation uses the following scheme:

1. Bloom filters are computed and stored on a commit-by-commit basis.

2. The filters are sized according to the number of changes in each 
commit, with a minimum of one 64-bit word.

3. The filters are stored in the commit-graph using two new optional 
chunks: one stores a single 32-bit integer for each commit that provides 
the end of its Bloom filter in the second "data" chunk. The data chunk 
also stores the magic constants (hash version, num hash keys, and num 
bits per entry).

4. We fill the Bloom filters as (const char *data, int len) pairs as 
"struct bloom_filter"s in a commit slab.

5. In order to evaluate containment, we need the struct bloom_filter, 
but also struct bloom_settings (stores the magic constants in one 
place), and struct bloom_key (stores the _k_ hash values). This allows 
us to hash a path once and test the same path against many Bloom filters.

6. When we compute the Bloom filters, we don't store a filter for 
commits whose first-parent diff has more than 512 paths.

7. When we compute the commit-graph, we can re-use the pre-existing 
filters without needing to recompute diffs. (Caveat: the current 
implementation will re-compute diffs for the commits with diffs that 
were too large.)

You can build the Bloom filters in my implementation this way:

GIT_TEST_BLOOM_FILTERS=1 ./git commit-graph write --reachable

> You can play around with it like this:
>
>    $ GIT_USE_POC_BLOOM_FILTER=$((8*1024*1024*8)) git commit-graph write
>    Computing commit graph generation numbers: 100% (52801/52801), done.
>    Computing bloom filter: 100% (52801/52801), done.
>    # Yeah, I even added progress indicator! :)
>    $ GIT_TRACE_BLOOM_FILTER=2 GIT_USE_POC_BLOOM_FILTER=y git rev-list --count --full-history HEAD -- t/valgrind/valgrind.sh
>    886
>    20:40:24.783699 revision.c:486          bloom filter total queries: 66095 definitely not: 64953 maybe: 1142 false positives: 256 fp ratio: 0.003873

Jonathan used this same test, so will I. Here is a summary table:

| Implementation | Queries | Maybe | FP # | FP %  |
|----------------|---------|-------|------|-------|
| Szeder         | 66095   | 1142  | 256  | 0.38% |
| Jonathan       | 66459   | 107   | 89   | 0.16% |
| Stolee         | 53025   | 492   | 479  | 0.90% |

(Note that we must have used different starting points, which is why my 
"Queries" is so much smaller.)

The increase in false-positive percentage is expected in my 
implementation. I'm using the correct filter sizes to hit a <1% FP 
ratio. This could be lowered by changing the settings, and the size 
would dynamically grow. For my Git repo (which contains 
git-for-windows/git and microsoft/git) this implementation grows the 
commmit-graph file from 5.8 MB to 7.3 MB (1.5 MB total, compared to 
Szeder's 8MB filter). For 105,260 commits, that rounds out to less than 
20 bytes per commit (compared to Jonathan's 256 bytes per commit).

Related stats for my Linux repo: 781,756 commits, commit-graph grows 
from 43.8 to 55.6 MB (~12 MB additional, ~16 bytes per commit).

I haven't done a side-by-side performance test for these 
implementations, but it would be interesting to do so.

Despite writing a lot of code in a short amount of time, there is a lot 
of work to be done before this is submittable:

1. There are three different environment variables right now. It would 
be better to have one GIT_TEST_ variable and rely on existing tracing 
for logs (trace2 values would be good here).

2. We need config values for writing and consuming bloom filters, but 
also to override the default settings.

3. My bloom.c/bloom.h is too coupled to the commit-graph. I want to 
harden that interface to let Bloom filters live as their own thing, but 
that the commit-graph could load a bloom filter from the file instead of 
from the slab.

4. Tests, tests, and more tests.

We'll see how much time I have to do this polish, but I think the 
benefit is proven.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 0/4] Bloom filter experiment
  2018-10-15 14:39                         ` [PATCH 0/4] Bloom filter experiment Derrick Stolee
@ 2018-10-16  4:45                           ` Junio C Hamano
  2018-10-16 11:13                             ` Derrick Stolee
  2018-10-16 23:41                           ` Jonathan Tan
  1 sibling, 1 reply; 78+ messages in thread
From: Junio C Hamano @ 2018-10-16  4:45 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: SZEDER Gábor, git, Jeff King,
	Ævar Arnfjörð Bjarmason, Stefan Beller, Duy Nguyen,
	Jonathan Tan

Derrick Stolee <stolee@gmail.com> writes:

> 2. The filters are sized according to the number of changes in each
> commit, with a minimum of one 64-bit word.
> ...
> 6. When we compute the Bloom filters, we don't store a filter for
> commits whose first-parent diff has more than 512 paths.

Just being curious but was 512 taken out of thin air or is there
some math behind it, e.g. to limit false positive rate down to
certain threshold?  With a wide-enough bitset, you could store
arbitrary large number of paths with low enough false positive, I
guess, but is there a point where there is too many paths in the
change that gives us diminishing returns and not worth having a
filter in the first place?

In a normal source-code-control context, the set of paths modified
by any single commit ought to be a small subset of the entire paths,
and whole-tree changes ought to be fairly rare.  In a project for
which that assumption does not hold, it might help to have a
negative bloom filter (i.e. "git log -- A" asks "does the commit
modify A?" and the filter would say "we know it does not, because we
threw all the paths that are not touched to the bloom filter"), but
I think that would optimize for a wrong case.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 0/4] Bloom filter experiment
  2018-10-16  4:45                           ` Junio C Hamano
@ 2018-10-16 11:13                             ` Derrick Stolee
  2018-10-16 12:57                               ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 78+ messages in thread
From: Derrick Stolee @ 2018-10-16 11:13 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: SZEDER Gábor, git, Jeff King,
	Ævar Arnfjörð Bjarmason, Stefan Beller, Duy Nguyen,
	Jonathan Tan

On 10/16/2018 12:45 AM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
>
>> 2. The filters are sized according to the number of changes in each
>> commit, with a minimum of one 64-bit word.
>> ...
>> 6. When we compute the Bloom filters, we don't store a filter for
>> commits whose first-parent diff has more than 512 paths.
> Just being curious but was 512 taken out of thin air or is there
> some math behind it, e.g. to limit false positive rate down to
> certain threshold?  With a wide-enough bitset, you could store
> arbitrary large number of paths with low enough false positive, I
> guess, but is there a point where there is too many paths in the
> change that gives us diminishing returns and not worth having a
> filter in the first place?
512 is somewhat arbitrary, but having a maximum size is not.
> In a normal source-code-control context, the set of paths modified
> by any single commit ought to be a small subset of the entire paths,
> and whole-tree changes ought to be fairly rare.  In a project for
> which that assumption does not hold, it might help to have a
> negative bloom filter (i.e. "git log -- A" asks "does the commit
> modify A?" and the filter would say "we know it does not, because we
> threw all the paths that are not touched to the bloom filter"), but
> I think that would optimize for a wrong case.

A commit with many changed paths is very rare. The 512 I picked above is 
enough to cover 99% of commits in each of the repos I sampled when first 
investigating Bloom filters.

When a Bloom filter response says "maybe yes" (in our case, "maybe not 
TREESAME"), then we need to verify that it is correct. In the extreme 
case that every path is changed, then the Bloom filter does nothing but 
add extra work.

These extreme cases are also not unprecedented: in our Azure Repos 
codebase, we were using core.autocrlf  to smudge CRLFs to LFs, but when 
it was time to dogfood VFS for Git, we needed to turn off the smudge 
filter. So, there is one commit that converts every LF to a CRLF in 
every text file. Storing a Bloom filter for those ~250,000 entries would 
take ~256KB for essentially no value. By not storing a filter for this 
commit, we go immediately to the regular TREESAME check, which would 
happen for most pathspecs.

This is all to say: having a maximum size is good. 512 is big enough to 
cover _most_ commits, but not so big that we may store _really_ big filters.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 0/4] Bloom filter experiment
  2018-10-16 11:13                             ` Derrick Stolee
@ 2018-10-16 12:57                               ` Ævar Arnfjörð Bjarmason
  2018-10-16 13:03                                 ` Derrick Stolee
  2018-10-18  2:00                                 ` Junio C Hamano
  0 siblings, 2 replies; 78+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-10-16 12:57 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Junio C Hamano, SZEDER Gábor, git, Jeff King, Stefan Beller,
	Duy Nguyen, Jonathan Tan


On Tue, Oct 16 2018, Derrick Stolee wrote:

> On 10/16/2018 12:45 AM, Junio C Hamano wrote:
>> Derrick Stolee <stolee@gmail.com> writes:
>>
>>> 2. The filters are sized according to the number of changes in each
>>> commit, with a minimum of one 64-bit word.
>>> ...
>>> 6. When we compute the Bloom filters, we don't store a filter for
>>> commits whose first-parent diff has more than 512 paths.
>> Just being curious but was 512 taken out of thin air or is there
>> some math behind it, e.g. to limit false positive rate down to
>> certain threshold?  With a wide-enough bitset, you could store
>> arbitrary large number of paths with low enough false positive, I
>> guess, but is there a point where there is too many paths in the
>> change that gives us diminishing returns and not worth having a
>> filter in the first place?
> 512 is somewhat arbitrary, but having a maximum size is not.
>> In a normal source-code-control context, the set of paths modified
>> by any single commit ought to be a small subset of the entire paths,
>> and whole-tree changes ought to be fairly rare.  In a project for
>> which that assumption does not hold, it might help to have a
>> negative bloom filter (i.e. "git log -- A" asks "does the commit
>> modify A?" and the filter would say "we know it does not, because we
>> threw all the paths that are not touched to the bloom filter"), but
>> I think that would optimize for a wrong case.
>
> A commit with many changed paths is very rare. The 512 I picked above
> is enough to cover 99% of commits in each of the repos I sampled when
> first investigating Bloom filters.
>
> When a Bloom filter response says "maybe yes" (in our case, "maybe not
> TREESAME"), then we need to verify that it is correct. In the extreme
> case that every path is changed, then the Bloom filter does nothing
> but add extra work.
>
> These extreme cases are also not unprecedented: in our Azure Repos
> codebase, we were using core.autocrlf to smudge CRLFs to LFs, but
> when it was time to dogfood VFS for Git, we needed to turn off the
> smudge filter. So, there is one commit that converts every LF to a
> CRLF in every text file. Storing a Bloom filter for those ~250,000
> entries would take ~256KB for essentially no value. By not storing a
> filter for this commit, we go immediately to the regular TREESAME
> check, which would happen for most pathspecs.
>
> This is all to say: having a maximum size is good. 512 is big enough
> to cover _most_ commits, but not so big that we may store _really_ big
> filters.

Makes sense. 512 is good enough to hardcode initially, but I couldn't
tell from briefly skimming the patches if it was possible to make this
size dynamic per-repo when the graph/filter is written.

I.e. we might later add some discovery step where we look at N number of
commits at random, until we're satisfied that we've come up with some
average/median number of total (recursive) tree entries & how many tend
to be changed per-commit.

I.e. I can imagine repositories (with some automated changes) where we
have 10k files and tend to change 1k per commit, or ones with 10k files
where we tend to change just 1-10 per commit, which would mean a
larger/smaller filter would be needed / would do.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 0/4] Bloom filter experiment
  2018-10-16 12:57                               ` Ævar Arnfjörð Bjarmason
@ 2018-10-16 13:03                                 ` Derrick Stolee
  2018-10-18  2:00                                 ` Junio C Hamano
  1 sibling, 0 replies; 78+ messages in thread
From: Derrick Stolee @ 2018-10-16 13:03 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Junio C Hamano, SZEDER Gábor, git, Jeff King, Stefan Beller,
	Duy Nguyen, Jonathan Tan

On 10/16/2018 8:57 AM, Ævar Arnfjörð Bjarmason wrote:
> On Tue, Oct 16 2018, Derrick Stolee wrote:
>
>> On 10/16/2018 12:45 AM, Junio C Hamano wrote:
>>> Derrick Stolee <stolee@gmail.com> writes:
>>>
>>>> 2. The filters are sized according to the number of changes in each
>>>> commit, with a minimum of one 64-bit word.
>>>> ...
>>>> 6. When we compute the Bloom filters, we don't store a filter for
>>>> commits whose first-parent diff has more than 512 paths.
>>> Just being curious but was 512 taken out of thin air or is there
>>> some math behind it, e.g. to limit false positive rate down to
>>> certain threshold?  With a wide-enough bitset, you could store
>>> arbitrary large number of paths with low enough false positive, I
>>> guess, but is there a point where there is too many paths in the
>>> change that gives us diminishing returns and not worth having a
>>> filter in the first place?
>> 512 is somewhat arbitrary, but having a maximum size is not.
>>> In a normal source-code-control context, the set of paths modified
>>> by any single commit ought to be a small subset of the entire paths,
>>> and whole-tree changes ought to be fairly rare.  In a project for
>>> which that assumption does not hold, it might help to have a
>>> negative bloom filter (i.e. "git log -- A" asks "does the commit
>>> modify A?" and the filter would say "we know it does not, because we
>>> threw all the paths that are not touched to the bloom filter"), but
>>> I think that would optimize for a wrong case.
>> A commit with many changed paths is very rare. The 512 I picked above
>> is enough to cover 99% of commits in each of the repos I sampled when
>> first investigating Bloom filters.
>>
>> When a Bloom filter response says "maybe yes" (in our case, "maybe not
>> TREESAME"), then we need to verify that it is correct. In the extreme
>> case that every path is changed, then the Bloom filter does nothing
>> but add extra work.
>>
>> These extreme cases are also not unprecedented: in our Azure Repos
>> codebase, we were using core.autocrlf to smudge CRLFs to LFs, but
>> when it was time to dogfood VFS for Git, we needed to turn off the
>> smudge filter. So, there is one commit that converts every LF to a
>> CRLF in every text file. Storing a Bloom filter for those ~250,000
>> entries would take ~256KB for essentially no value. By not storing a
>> filter for this commit, we go immediately to the regular TREESAME
>> check, which would happen for most pathspecs.
>>
>> This is all to say: having a maximum size is good. 512 is big enough
>> to cover _most_ commits, but not so big that we may store _really_ big
>> filters.
> Makes sense. 512 is good enough to hardcode initially, but I couldn't
> tell from briefly skimming the patches if it was possible to make this
> size dynamic per-repo when the graph/filter is written.
My proof-of-concept has it as a constant, but part of my plan is to make 
these all config options, as of this item in my message:

 >>> 2. We need config values for writing and consuming bloom filters, 
but also to override the default settings.

> I.e. we might later add some discovery step where we look at N number of
> commits at random, until we're satisfied that we've come up with some
> average/median number of total (recursive) tree entries & how many tend
> to be changed per-commit.
>
> I.e. I can imagine repositories (with some automated changes) where we
> have 10k files and tend to change 1k per commit, or ones with 10k files
> where we tend to change just 1-10 per commit, which would mean a
> larger/smaller filter would be needed / would do.
I'm not sure a dynamic approach would be worth the effort, but I'm open 
to hearing the results of an experiment.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 0/4] Bloom filter experiment
  2018-10-15 14:39                         ` [PATCH 0/4] Bloom filter experiment Derrick Stolee
  2018-10-16  4:45                           ` Junio C Hamano
@ 2018-10-16 23:41                           ` Jonathan Tan
  1 sibling, 0 replies; 78+ messages in thread
From: Jonathan Tan @ 2018-10-16 23:41 UTC (permalink / raw)
  To: stolee
  Cc: szeder.dev, git, peff, gitster, avarab, sbeller, pclouds,
	jonathantanmy

> | Implementation | Queries | Maybe | FP # | FP %  |
> |----------------|---------|-------|------|-------|
> | Szeder         | 66095   | 1142  | 256  | 0.38% |
> | Jonathan       | 66459   | 107   | 89   | 0.16% |
> | Stolee         | 53025   | 492   | 479  | 0.90% |
> 
> (Note that we must have used different starting points, which is why my 
> "Queries" is so much smaller.)

I suspect it's because your bloom filter implementation covers only the
first parent (if I'm understanding get_bloom_filter() correctly). When I
only covered the first parent in my initial test (see patch 2 of [1]), I
got (following the columns in the table above):

  53096 107 89 0.001676

Also, I think that the rejecting power (Queries - Maybe)/(Total tree
comparisons if no bloom filters were used) needs to be in the evaluation
criteria somewhere, as that indicates how many tree comparisons we
managed to avoid.

Also, we probably should also test on a file that changes more
frequently :-)

[1] https://public-inbox.org/git/cover.1539219248.git.jonathantanmy@google.com/

> The increase in false-positive percentage is expected in my 
> implementation. I'm using the correct filter sizes to hit a <1% FP 
> ratio. This could be lowered by changing the settings, and the size 
> would dynamically grow. For my Git repo (which contains 
> git-for-windows/git and microsoft/git) this implementation grows the 
> commmit-graph file from 5.8 MB to 7.3 MB (1.5 MB total, compared to 
> Szeder's 8MB filter). For 105,260 commits, that rounds out to less than 
> 20 bytes per commit (compared to Jonathan's 256 bytes per commit).

Mine has 256 bits per commit, which is 32 bytes per commit (still more
than yours).

Having said all that, thanks for writing up your version - in
particular, variable sized filters (like in yours) seem to be the way to
go.

> We'll see how much time I have to do this polish, but I think the 
> benefit is proven.

Agreed.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 0/4] Bloom filter experiment
  2018-10-16 12:57                               ` Ævar Arnfjörð Bjarmason
  2018-10-16 13:03                                 ` Derrick Stolee
@ 2018-10-18  2:00                                 ` Junio C Hamano
  1 sibling, 0 replies; 78+ messages in thread
From: Junio C Hamano @ 2018-10-18  2:00 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Derrick Stolee, SZEDER Gábor, git, Jeff King, Stefan Beller,
	Duy Nguyen, Jonathan Tan

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

>> This is all to say: having a maximum size is good. 512 is big enough
>> to cover _most_ commits, but not so big that we may store _really_ big
>> filters.
>
> Makes sense. 512 is good enough to hardcode initially, but I couldn't
> tell from briefly skimming the patches if it was possible to make this
> size dynamic per-repo when the graph/filter is written.
>
> I.e. we might later add some discovery step where we look at N number of
> commits at random, until we're satisfied that we've come up with some
> average/median number of total (recursive) tree entries & how many tend
> to be changed per-commit.
>
> I.e. I can imagine repositories (with some automated changes) where we
> have 10k files and tend to change 1k per commit, or ones with 10k files
> where we tend to change just 1-10 per commit, which would mean a
> larger/smaller filter would be needed / would do.

I was more interested to find out what the advice our docs should
give to the end users to tune the value once such a knob is
invented, and what you gave in the above paragraphs may lead us to a
nice auto-tuning heuristics.


^ permalink raw reply	[flat|nested] 78+ messages in thread

end of thread, other threads:[~2018-10-18  2:01 UTC | newest]

Thread overview: 78+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-03 13:23 We should add a "git gc --auto" after "git clone" due to commit graph Ævar Arnfjörð Bjarmason
2018-10-03 13:36 ` SZEDER Gábor
2018-10-03 13:42   ` Derrick Stolee
2018-10-03 14:18     ` Ævar Arnfjörð Bjarmason
2018-10-03 14:01   ` Ævar Arnfjörð Bjarmason
2018-10-03 14:17     ` SZEDER Gábor
2018-10-03 14:22       ` Ævar Arnfjörð Bjarmason
2018-10-03 14:53         ` SZEDER Gábor
2018-10-03 15:19           ` Ævar Arnfjörð Bjarmason
2018-10-03 16:59             ` SZEDER Gábor
2018-10-05  6:09               ` Junio C Hamano
2018-10-10 22:07                 ` SZEDER Gábor
2018-10-10 23:01                   ` Ævar Arnfjörð Bjarmason
2018-10-03 19:08           ` Stefan Beller
2018-10-03 19:21             ` Jeff King
2018-10-03 20:35               ` Ævar Arnfjörð Bjarmason
2018-10-03 17:47         ` Stefan Beller
2018-10-03 18:47           ` Ævar Arnfjörð Bjarmason
2018-10-03 18:51             ` Jeff King
2018-10-03 18:59               ` Derrick Stolee
2018-10-03 19:18                 ` Jeff King
2018-10-08 16:41                   ` SZEDER Gábor
2018-10-08 16:57                     ` Derrick Stolee
2018-10-08 18:10                       ` SZEDER Gábor
2018-10-08 18:29                         ` Derrick Stolee
2018-10-09  3:08                           ` Jeff King
2018-10-09 13:48                             ` Bloom Filters (was Re: We should add a "git gc --auto" after "git clone" due to commit graph) Derrick Stolee
2018-10-09 18:45                               ` Ævar Arnfjörð Bjarmason
2018-10-09 18:46                               ` Jeff King
2018-10-09 19:03                                 ` Derrick Stolee
2018-10-09 21:14                                   ` Jeff King
2018-10-09 23:12                                     ` Bloom Filters Jeff King
2018-10-09 23:13                                       ` [PoC -- do not apply 1/3] initial tree-bitmap proof of concept Jeff King
2018-10-09 23:14                                       ` [PoC -- do not apply 2/3] test-tree-bitmap: add "dump" mode Jeff King
2018-10-10  0:48                                         ` Junio C Hamano
2018-10-11  3:13                                           ` Jeff King
2018-10-09 23:14                                       ` [PoC -- do not apply 3/3] test-tree-bitmap: replace ewah with custom rle encoding Jeff King
2018-10-10  0:58                                         ` Junio C Hamano
2018-10-11  3:20                                           ` Jeff King
2018-10-11 12:33                                       ` Bloom Filters Derrick Stolee
2018-10-11 13:43                                         ` Jeff King
2018-10-09 21:30                             ` We should add a "git gc --auto" after "git clone" due to commit graph SZEDER Gábor
2018-10-09 19:34                       ` [PATCH 0/4] Bloom filter experiment SZEDER Gábor
2018-10-09 19:34                         ` [PATCH 1/4] Add a (very) barebones Bloom filter implementation SZEDER Gábor
2018-10-09 19:34                         ` [PATCH 2/4] commit-graph: write a Bloom filter containing changed paths for each commit SZEDER Gábor
2018-10-09 21:06                           ` Jeff King
2018-10-09 21:37                             ` SZEDER Gábor
2018-10-09 19:34                         ` [PATCH 3/4] revision.c: use the Bloom filter to speed up path-limited revision walks SZEDER Gábor
2018-10-09 19:34                         ` [PATCH 4/4] revision.c: add GIT_TRACE_BLOOM_FILTER for a bit of statistics SZEDER Gábor
2018-10-09 19:47                         ` [PATCH 0/4] Bloom filter experiment Derrick Stolee
2018-10-11  1:21                         ` [PATCH 0/2] Per-commit filter proof of concept Jonathan Tan
2018-10-11  1:21                           ` [PATCH 1/2] One filter per commit Jonathan Tan
2018-10-11 12:49                             ` Derrick Stolee
2018-10-11 19:11                               ` [PATCH] Per-commit and per-parent filters for 2 parents Jonathan Tan
2018-10-11  1:21                           ` [PATCH 2/2] Only make bloom filter for first parent Jonathan Tan
2018-10-11  7:37                           ` [PATCH 0/2] Per-commit filter proof of concept Ævar Arnfjörð Bjarmason
2018-10-15 14:39                         ` [PATCH 0/4] Bloom filter experiment Derrick Stolee
2018-10-16  4:45                           ` Junio C Hamano
2018-10-16 11:13                             ` Derrick Stolee
2018-10-16 12:57                               ` Ævar Arnfjörð Bjarmason
2018-10-16 13:03                                 ` Derrick Stolee
2018-10-18  2:00                                 ` Junio C Hamano
2018-10-16 23:41                           ` Jonathan Tan
2018-10-08 23:02                     ` We should add a "git gc --auto" after "git clone" due to commit graph Junio C Hamano
2018-10-03 14:32     ` Duy Nguyen
2018-10-03 16:45 ` Duy Nguyen
2018-10-04 21:42 ` [RFC PATCH] " Ævar Arnfjörð Bjarmason
2018-10-05 12:05   ` Derrick Stolee
2018-10-05 13:05     ` Ævar Arnfjörð Bjarmason
2018-10-05 13:45       ` Derrick Stolee
2018-10-05 14:04         ` Ævar Arnfjörð Bjarmason
2018-10-05 19:21         ` Jeff King
2018-10-05 19:41           ` Derrick Stolee
2018-10-05 19:47             ` Jeff King
2018-10-05 20:00               ` Derrick Stolee
2018-10-05 20:02                 ` Jeff King
2018-10-05 20:01               ` Ævar Arnfjörð Bjarmason
2018-10-05 20:09                 ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).