Using bitmaps to accelerate fetch and clone

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Using bitmaps to accelerate fetch and clone
@ 2012-09-27  0:47 Shawn Pearce
  2012-09-27 12:17 ` Nguyen Thai Ngoc Duy
  2012-09-28 12:00 ` Nguyen Thai Ngoc Duy
  0 siblings, 2 replies; 20+ messages in thread
From: Shawn Pearce @ 2012-09-27  0:47 UTC (permalink / raw)
  To: git; +Cc: Colby Ranger

Google has published a series of patches (see links below) to JGit to
improve fetch and clone performance by adding compressed bitmaps to
the pack-*.idx structure.

Operation                   Index V2               Index VE003
Clone                       37530ms (524.06 MiB)     82ms (524.06 MiB)
Fetch (1 commit back)          75ms                 107ms
Fetch (10 commits back)       456ms (269.51 KiB)    341ms (265.19 KiB)
Fetch (100 commits back)      449ms (269.91 KiB)    337ms (267.28 KiB)
Fetch (1000 commits back)    2229ms ( 14.75 MiB)    189ms ( 14.42 MiB)
Fetch (10000 commits back)   2177ms ( 16.30 MiB)    254ms ( 15.88 MiB)
Fetch (100000 commits back) 14340ms (185.83 MiB)   1655ms (189.39 MiB)

In the table the repository tested was Android's
platform/frameworks/base. The time shown is the time spent in the
"Counting objects" phase of creating a pack for a client using the
git:// protocol. The byte size shown is the size of the pack
transferred to the client, and "commits back" describes how far behind
the client was from the server when it started the fetch. In all test
runs the client was git-core and the server was JGit on the same
machine.

The amount of disk space used by the compressed bitmaps is tunable,
but averages 10-15% of the pack-*.idx file size. So about 8 MiB of
additional space for this repository. A repository owner can reduce
the worst case time used in the 100000 commit back case by using
slightly more disk and positioning more bitmaps more frequently
throughout history. The code doesn't do this by default because the
expectation is that a client is probably not 100k commits behind.
Instead it populates bitmaps at all branch and tag tips, and densely
(every few hundred commits) near the tips, and spaces them out more
the further back in history it goes. We assume the older history is
accessed less often, and doesn't need to waste additional disk space
or precious buffer cache.

The basic gist of the implementation is a bitmap has a 1 bit set for
each object that is reachable from the commit the bitmap is associated
with. An index file may have a unique bitmap for hundreds of commits
in the corresponding pack file. The set of objects to send is
performed by doing a simple computation:

  OR (all want lines) AND NOT OR (all have lines)

There are two key patches in the series that implement the file format
change and logic involved:

* https://git.eclipse.org/r/7939

  Defines the new E003 index format and the bit set
  implementation logic.

* https://git.eclipse.org/r/7940

  Uses E003 indexes when available to make packs, and
  the logic required to make E003 format indexes during GC.

:-)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Using bitmaps to accelerate fetch and clone
  2012-09-27  0:47 Using bitmaps to accelerate fetch and clone Shawn Pearce
@ 2012-09-27 12:17 ` Nguyen Thai Ngoc Duy
  2012-09-27 14:33   ` Shawn Pearce
  2012-09-27 17:20   ` Jeff King
  2012-09-28 12:00 ` Nguyen Thai Ngoc Duy
  1 sibling, 2 replies; 20+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-09-27 12:17 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git, Colby Ranger

On Thu, Sep 27, 2012 at 7:47 AM, Shawn Pearce <spearce@spearce.org> wrote:
> Google has published a series of patches (see links below) to JGit to

Should discussions about this series happen in here, jgit mailing or
gerrit? I just want to make sure I'll discuss it at the right place.

> improve fetch and clone performance by adding compressed bitmaps to
> the pack-*.idx structure.
>
> Operation                   Index V2               Index VE003
> Clone                       37530ms (524.06 MiB)     82ms (524.06 MiB)
> Fetch (1 commit back)          75ms                 107ms
> Fetch (10 commits back)       456ms (269.51 KiB)    341ms (265.19 KiB)
> Fetch (100 commits back)      449ms (269.91 KiB)    337ms (267.28 KiB)
> Fetch (1000 commits back)    2229ms ( 14.75 MiB)    189ms ( 14.42 MiB)
> Fetch (10000 commits back)   2177ms ( 16.30 MiB)    254ms ( 15.88 MiB)
> Fetch (100000 commits back) 14340ms (185.83 MiB)   1655ms (189.39 MiB)

Beautiful. And curious, why do 100->1000 and 10000->10000 have such
big leaps in time (V2)?

> The basic gist of the implementation is a bitmap has a 1 bit set for
> each object that is reachable from the commit the bitmap is associated
> with. An index file may have a unique bitmap for hundreds of commits
> in the corresponding pack file. The set of objects to send is
> performed by doing a simple computation:
>
>   OR (all want lines) AND NOT OR (all have lines)
>
> There are two key patches in the series that implement the file format
> change and logic involved:
>
> * https://git.eclipse.org/r/7939
>
>   Defines the new E003 index format and the bit set
>   implementation logic.

I suppose the index format is not set in stone yet? My java-foo is
rusty and I'm not familiar with jgit, so I more likely read things
wrong.

It seems the bitmap data follows directly after regular index content.
I'd like to see some sort of extension mechanism like in
$GIT_DIR/index, so that we don't have to increase pack index version
often. What I have in mind is optional commit cache to speed up
rev-list and merge, which could be stored in pack index too.

In PackIndexVE003 class

+               // Read the bitmaps for the Git types
+               SimpleDataInput dataInput = new SimpleDataInput(fd);
+               this.commits = readBitmap(dataInput);
+               this.trees = readBitmap(dataInput);
+               this.blobs = readBitmap(dataInput);
+               this.tags = readBitmap(dataInput);

Am I correct in saying that you have four different on-disk bitmaps,
one for each object type? If so, for compression efficient reasons?

> :-)

Definitely :-). I have shown my interest in this topic before. So I
should probably say that I'm going to work on this on C Git, but
sllloooowwwly. As this benefits the server side greatly, perhaps a
GitHubber ;-) might want to work on this on C Git, for GitHub itself
of course, and, as a side effect, make the rest of us happy?
-- 
Duy

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Using bitmaps to accelerate fetch and clone
  2012-09-27 12:17 ` Nguyen Thai Ngoc Duy
@ 2012-09-27 14:33   ` Shawn Pearce
  2012-09-28  1:37     ` Nguyen Thai Ngoc Duy
  2012-09-27 17:20   ` Jeff King
  1 sibling, 1 reply; 20+ messages in thread
From: Shawn Pearce @ 2012-09-27 14:33 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: git, Colby Ranger

On Thu, Sep 27, 2012 at 5:17 AM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
> On Thu, Sep 27, 2012 at 7:47 AM, Shawn Pearce <spearce@spearce.org> wrote:
>> Google has published a series of patches (see links below) to JGit to
>
> Should discussions about this series happen in here, jgit mailing or
> gerrit? I just want to make sure I'll discuss it at the right place.

I think we should have a concrete discussion about the implementation
in Java on the Gerrit changes (e.g.  "you should fix this comment it
doesn't sufficiently describe the method"), and a discussion about the
file format and algorithm here where everyone can contribute.

The format is named E003 because its still experimental. Its not set
in stone. Future iterations might be named E004, etc. If we get
something final we will look to rename it to just version 3.

>> improve fetch and clone performance by adding compressed bitmaps to
>> the pack-*.idx structure.
>>
>> Operation                   Index V2               Index VE003
>> Clone                       37530ms (524.06 MiB)     82ms (524.06 MiB)
>> Fetch (1 commit back)          75ms                 107ms
>> Fetch (10 commits back)       456ms (269.51 KiB)    341ms (265.19 KiB)
>> Fetch (100 commits back)      449ms (269.91 KiB)    337ms (267.28 KiB)
>> Fetch (1000 commits back)    2229ms ( 14.75 MiB)    189ms ( 14.42 MiB)
>> Fetch (10000 commits back)   2177ms ( 16.30 MiB)    254ms ( 15.88 MiB)
>> Fetch (100000 commits back) 14340ms (185.83 MiB)   1655ms (189.39 MiB)
>
> Beautiful. And curious, why do 100->1000 and 10000->10000 have such
> big leaps in time (V2)?

We didn't investigate this. Colby just reported on what the current
JGit code does. I suspect there is something specific about the shape
of the history graph for this repository combined with the way JGit
wrote the pack file that caused these sorts of large increases.
Perhaps they could be smaller. Not sure I care anymore when the E003
approach gives us such low times. :-)

>> The basic gist of the implementation is a bitmap has a 1 bit set for
>> each object that is reachable from the commit the bitmap is associated
>> with. An index file may have a unique bitmap for hundreds of commits
>> in the corresponding pack file. The set of objects to send is
>> performed by doing a simple computation:
>>
>>   OR (all want lines) AND NOT OR (all have lines)
>>
>> There are two key patches in the series that implement the file format
>> change and logic involved:
>>
>> * https://git.eclipse.org/r/7939
>>
>>   Defines the new E003 index format and the bit set
>>   implementation logic.
>
> I suppose the index format is not set in stone yet?

For E003, yes, we already have some data encoded with it. But as a
file format change, no. We are willing to iterate on this if there is
tangible benefit displayed by an alternative. Future versions would
have to be E004 or some other new version number to disambiguate from
E003.

> My java-foo is
> rusty and I'm not familiar with jgit, so I more likely read things
> wrong.

Or maybe not. :-)

> It seems the bitmap data follows directly after regular index content.

Correct. It is after the regular content, but before the 2 SHA-1 trailers.

> I'd like to see some sort of extension mechanism like in
> $GIT_DIR/index, so that we don't have to increase pack index version
> often.

This might be worthwhile. I dislike the way $GIT_DIR/index encodes
extensions. Forcing an extension to fully materialize itself to
determine its length so the length can be placed before the data is
painful to work with when writing the file out to disk. I would prefer
writing an index catalog at the trailer of the file. We already
require random access to the index file, so its possible for a reader
to read a fixed size trailer record that has the 2 SHA-1s we normally
end an index with, and an extension catalog footer that has a length
and CRC-32 of the catalog. The catalog would immediately appear before
the footer, so a reader can find the start of the extension catalog by
subtracting from the end of the file the catalog length and the file
footer and catalog footer lengths. The catalog can then supply a
starting offset for each extension section, and writers don't need to
predict in advance how much data they need to store. Readers trying to
use extensions aren't really hurt, Git already randomly seeks to read
the tail of an index file to compare the pack SHA-1 before assuming
the index is valid.

> What I have in mind is optional commit cache to speed up
> rev-list and merge, which could be stored in pack index too.

We should also look into using the commit bitmap data to feed the
rev-list traversal. The bitmaps can tell us which objects are commits,
and their rough ordering given the packing rules. That may be
sufficient to feed the walker without having a priority queue.

> In PackIndexVE003 class
>
> +               // Read the bitmaps for the Git types
> +               SimpleDataInput dataInput = new SimpleDataInput(fd);
> +               this.commits = readBitmap(dataInput);
> +               this.trees = readBitmap(dataInput);
> +               this.blobs = readBitmap(dataInput);
> +               this.tags = readBitmap(dataInput);
>
> Am I correct in saying that you have four different on-disk bitmaps,
> one for each object type? If so, for compression efficient reasons?

Yes. The packer needs to know the type of each object its going to
pack. Instead of reading this from the individual object headers in
the pack files, we store them in compressed type bitmaps. This is much
faster to test than reading in the base data from the pack file.

> Definitely :-). I have shown my interest in this topic before. So I
> should probably say that I'm going to work on this on C Git, but
> sllloooowwwly. As this benefits the server side greatly, perhaps a
> GitHubber ;-) might want to work on this on C Git, for GitHub itself
> of course, and, as a side effect, make the rest of us happy?

Google may also contribute towards this work, we really want to see an
improvement in git-core too. Its just a lot of work, and we are also a
limited team. :-)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Using bitmaps to accelerate fetch and clone
  2012-09-27 12:17 ` Nguyen Thai Ngoc Duy
  2012-09-27 14:33   ` Shawn Pearce
@ 2012-09-27 17:20   ` Jeff King
  2012-09-27 17:35     ` Shawn Pearce
                       ` (2 more replies)
  1 sibling, 3 replies; 20+ messages in thread
From: Jeff King @ 2012-09-27 17:20 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: Shawn Pearce, git, Colby Ranger, David Barr

On Thu, Sep 27, 2012 at 07:17:42PM +0700, Nguyen Thai Ngoc Duy wrote:

> > Operation                   Index V2               Index VE003
> > Clone                       37530ms (524.06 MiB)     82ms (524.06 MiB)
> > Fetch (1 commit back)          75ms                 107ms
> > Fetch (10 commits back)       456ms (269.51 KiB)    341ms (265.19 KiB)
> > Fetch (100 commits back)      449ms (269.91 KiB)    337ms (267.28 KiB)
> > Fetch (1000 commits back)    2229ms ( 14.75 MiB)    189ms ( 14.42 MiB)
> > Fetch (10000 commits back)   2177ms ( 16.30 MiB)    254ms ( 15.88 MiB)
> > Fetch (100000 commits back) 14340ms (185.83 MiB)   1655ms (189.39 MiB)
> 
> Beautiful. And curious, why do 100->1000 and 10000->10000 have such
> big leaps in time (V2)?

Agreed. I'm very excited about these numbers.

> >   Defines the new E003 index format and the bit set
> >   implementation logic.
> [...]
> It seems the bitmap data follows directly after regular index content.
> I'd like to see some sort of extension mechanism like in
> $GIT_DIR/index, so that we don't have to increase pack index version
> often. What I have in mind is optional commit cache to speed up
> rev-list and merge, which could be stored in pack index too.

As I understand it, both the bitmaps and a commit cache are
theoretically optional. That is, git can do the job without them, but
they speed things up. If that is the case, do we need to bump the index
version at all? Why not store a plain v2 index, and then store an
additional file "pack-XXX.reachable" that contains the bitmaps and an
independent version number.

The sha1 in the filename makes sure that the reachability file is always
in sync with the actual pack data and index.  Old readers won't know
about the new file, and will ignore it. For new readers, if the file is
there they can use it; if it's missing (or its version is not
understood), they can fall back to the regular index.

I haven't looked at the details of the format change yet. If it is
purely an extra chunk of data at the end, this Just Works. If there are
changes to the earlier parts of the pack (e.g., I seem to recall the
commit cache idea wanted separate indices for each object type), we may
still need a v3. But it would be nice if we could make those changes
generic (e.g., just the separate indices, which might support many
different enhancements), and then let the actual feature work happen in
the separate files.

> Definitely :-). I have shown my interest in this topic before. So I
> should probably say that I'm going to work on this on C Git, but
> sllloooowwwly. As this benefits the server side greatly, perhaps a
> GitHubber ;-) might want to work on this on C Git, for GitHub itself
> of course, and, as a side effect, make the rest of us happy?

Yeah, GitHub is definitely interested in this. I may take a shot at it,
but I know David Barr (cc'd) is also interested in such things.

-Peff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Using bitmaps to accelerate fetch and clone
  2012-09-27 17:20   ` Jeff King
@ 2012-09-27 17:35     ` Shawn Pearce
  2012-09-27 18:22       ` Jeff King
  2012-09-27 19:47     ` David Michael Barr
  2012-09-28  1:38     ` Nguyen Thai Ngoc Duy
  2 siblings, 1 reply; 20+ messages in thread
From: Shawn Pearce @ 2012-09-27 17:35 UTC (permalink / raw)
  To: Jeff King; +Cc: Nguyen Thai Ngoc Duy, git, Colby Ranger, David Barr

On Thu, Sep 27, 2012 at 10:20 AM, Jeff King <peff@peff.net> wrote:
> On Thu, Sep 27, 2012 at 07:17:42PM +0700, Nguyen Thai Ngoc Duy wrote:
>
>> > Operation                   Index V2               Index VE003
>> > Clone                       37530ms (524.06 MiB)     82ms (524.06 MiB)
>> > Fetch (1 commit back)          75ms                 107ms
>> > Fetch (10 commits back)       456ms (269.51 KiB)    341ms (265.19 KiB)
>> > Fetch (100 commits back)      449ms (269.91 KiB)    337ms (267.28 KiB)
>> > Fetch (1000 commits back)    2229ms ( 14.75 MiB)    189ms ( 14.42 MiB)
>> > Fetch (10000 commits back)   2177ms ( 16.30 MiB)    254ms ( 15.88 MiB)
>> > Fetch (100000 commits back) 14340ms (185.83 MiB)   1655ms (189.39 MiB)
>>
>> Beautiful. And curious, why do 100->1000 and 10000->10000 have such
>> big leaps in time (V2)?
>
> Agreed. I'm very excited about these numbers.
>
>> >   Defines the new E003 index format and the bit set
>> >   implementation logic.
>> [...]
>> It seems the bitmap data follows directly after regular index content.
>> I'd like to see some sort of extension mechanism like in
>> $GIT_DIR/index, so that we don't have to increase pack index version
>> often. What I have in mind is optional commit cache to speed up
>> rev-list and merge, which could be stored in pack index too.
>
> As I understand it, both the bitmaps and a commit cache are
> theoretically optional. That is, git can do the job without them, but
> they speed things up.

Yes, entirely true.

> If that is the case, do we need to bump the index
> version at all? Why not store a plain v2 index, and then store an
> additional file "pack-XXX.reachable" that contains the bitmaps and an
> independent version number.

This is the alternate version we considered internally. It was a bit
more work to define a 3rd file stream per pack in our backend storage
system, so we opted for a revision of an existing stream. We could
spend a bit more time and add a 3rd stream, keeping the index format
unmodified.

But we could have also done this with the CRC-32 table in index v2. We
didn't. If the data should almost always be there in order to provide
good service then we should really be embedding into the files.

I'm on the fence. I could go either way on this. E003 was just the
fastest way to prototype and start testing. We would probably be
equally happy with the 3rd stream.

> The sha1 in the filename makes sure that the reachability file is always
> in sync with the actual pack data and index.

Depending on the extension dependencies, you may need to also use the
trailer SHA-1 from the pack file itself, like the index does. E.g. the
bitmap data depends heavily on object order in the pack and is invalid
if you repack with a different ordering algorithm, or a different
delta set of results from delta compression.

>  Old readers won't know
> about the new file, and will ignore it. For new readers, if the file is
> there they can use it; if it's missing (or its version is not
> understood), they can fall back to the regular index.
>
> I haven't looked at the details of the format change yet. If it is
> purely an extra chunk of data at the end, this Just Works.

Yes, its just extra chunk on the end.

> If there are
> changes to the earlier parts of the pack (e.g., I seem to recall the
> commit cache idea wanted separate indices for each object type), we may
> still need a v3. But it would be nice if we could make those changes
> generic (e.g., just the separate indices, which might support many
> different enhancements), and then let the actual feature work happen in
> the separate files.

Yes. One downside is these separate streams aren't removed when you
run git repack. But this could be fixed by  a modification to git
repack to clean up additional extensions with the same pack base name.

>> Definitely :-). I have shown my interest in this topic before. So I
>> should probably say that I'm going to work on this on C Git, but
>> sllloooowwwly. As this benefits the server side greatly, perhaps a
>> GitHubber ;-) might want to work on this on C Git, for GitHub itself
>> of course, and, as a side effect, make the rest of us happy?
>
> Yeah, GitHub is definitely interested in this. I may take a shot at it,
> but I know David Barr (cc'd) is also interested in such things.

Building this in C is also dependent on having a good implementation
of EWAH compressed bitmaps available. So its a bit more work, but not
insurmountable by any means.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Using bitmaps to accelerate fetch and clone
  2012-09-27 17:35     ` Shawn Pearce
@ 2012-09-27 18:22       ` Jeff King
  2012-09-27 18:36         ` Shawn Pearce
  0 siblings, 1 reply; 20+ messages in thread
From: Jeff King @ 2012-09-27 18:22 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Nguyen Thai Ngoc Duy, git, Colby Ranger, David Barr

On Thu, Sep 27, 2012 at 10:35:47AM -0700, Shawn O. Pearce wrote:

> > If that is the case, do we need to bump the index
> > version at all? Why not store a plain v2 index, and then store an
> > additional file "pack-XXX.reachable" that contains the bitmaps and an
> > independent version number.
> 
> This is the alternate version we considered internally. It was a bit
> more work to define a 3rd file stream per pack in our backend storage
> system, so we opted for a revision of an existing stream. We could
> spend a bit more time and add a 3rd stream, keeping the index format
> unmodified.

I'd rather make the choice that provides the best user experience, even
if it is a bit more code refactoring.

> But we could have also done this with the CRC-32 table in index v2. We
> didn't. If the data should almost always be there in order to provide
> good service then we should really be embedding into the files.

Yes, although there were other changes in v2, also (e.g., the fanout to
handle larger packfiles).  Bumping the version also made the transition
take a lot longer. We introduced the reading and writing code, but then
couldn't flip the default for quite a while. For big server providers
this is not as big a deal (we know which versions of git we will use,
and are OK with flipping a config bit). But it's one more tuning thing
to deal with for small or single-person servers.

I think clients will also want it. If we can make "git rev-list
--objects --all" faster (which this should be able to do), we can speed
up "git prune", which in turn is by far the slowest part of "git gc
--auto", since in the typical case we are only incrementally packing.

I also like that the general technique can be reused easily. We've
talked about a generation-number cache in the past. That would fit this
model as well. Removing a backwards-compatibility barrier makes it a lot
easier to experiment with these sorts of things.

> > The sha1 in the filename makes sure that the reachability file is always
> > in sync with the actual pack data and index.
> 
> Depending on the extension dependencies, you may need to also use the
> trailer SHA-1 from the pack file itself, like the index does. E.g. the
> bitmap data depends heavily on object order in the pack and is invalid
> if you repack with a different ordering algorithm, or a different
> delta set of results from delta compression.

Interesting. I would have assumed it depended on order in the index. But
like I said, I haven't looked. I think you are still OK, though, because
the filename comes from the sha1 over the index file, which in turn
includes the sha1 over the packfile. Thus any change in the packfile
would give you a new pack and index name.

> Yes. One downside is these separate streams aren't removed when you
> run git repack. But this could be fixed by  a modification to git
> repack to clean up additional extensions with the same pack base name.

I don't think that's a big deal. We already do it with ".keep" files. If
you repack with an older version of git, you may have a stale
supplementary file wasting space. But that's OK. The next time you gc
with a newer version of git, we could detect and clean up such stale
files (we already do so for tmp_pack_* files).

-Peff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Using bitmaps to accelerate fetch and clone
  2012-09-27 18:22       ` Jeff King
@ 2012-09-27 18:36         ` Shawn Pearce
  2012-09-27 18:52           ` Jeff King
  0 siblings, 1 reply; 20+ messages in thread
From: Shawn Pearce @ 2012-09-27 18:36 UTC (permalink / raw)
  To: Jeff King; +Cc: Nguyen Thai Ngoc Duy, git, Colby Ranger, David Barr

On Thu, Sep 27, 2012 at 11:22 AM, Jeff King <peff@peff.net> wrote:
>
> I think clients will also want it. If we can make "git rev-list
> --objects --all" faster (which this should be able to do), we can speed
> up "git prune", which in turn is by far the slowest part of "git gc
> --auto", since in the typical case we are only incrementally packing.

Yes, the bitmap can also accelerate prune. We didn't implement this
but it is a trivial use of the existing bitmap.

>> > The sha1 in the filename makes sure that the reachability file is always
>> > in sync with the actual pack data and index.
>>
>> Depending on the extension dependencies, you may need to also use the
>> trailer SHA-1 from the pack file itself, like the index does. E.g. the
>> bitmap data depends heavily on object order in the pack and is invalid
>> if you repack with a different ordering algorithm, or a different
>> delta set of results from delta compression.
>
> Interesting. I would have assumed it depended on order in the index.

No. We tried that. Assigning bits by order in index (aka order of
SHA-1s sorted) results in horrible compression of the bitmap itself
because of the uniform distribution of SHA-1. Encoding instead by pack
order gets us really good bitmap compression, because object graph
traversal order tends to take reachability into account. So we see
long contiguous runs of 1s and get good compression. Sorting by SHA-1
just makes the space into swiss cheese.

> I think you are still OK, though, because
> the filename comes from the sha1 over the index file, which in turn
> includes the sha1 over the packfile. Thus any change in the packfile
> would give you a new pack and index name.

No. The pack file name is composed from the SHA-1 of the sorted SHA-1s
in the pack. Any change in compression settings or delta windows or
even just random scheduling variations when repacking can cause
offsets to slide, even if the set of objects being repacked has not
differed. The resulting pack and index will have the same file names
(as its the same set of objects), but the offset information and
ordering is now different.

Naming a pack after a SHA-1 is a fun feature. Naming it after the
SHA-1 of the object list was a mistake. It should have been named
after the SHA-1 in the trailer of the file, so that any single bit
modified within the pack stream itself would have caused a different
name to be used on the filesystem. But alas this is water under the
bridge and not likely to change anytime soon.

>> Yes. One downside is these separate streams aren't removed when you
>> run git repack. But this could be fixed by  a modification to git
>> repack to clean up additional extensions with the same pack base name.
>
> I don't think that's a big deal. We already do it with ".keep" files. If
> you repack with an older version of git, you may have a stale
> supplementary file wasting space. But that's OK. The next time you gc
> with a newer version of git, we could detect and clean up such stale
> files (we already do so for tmp_pack_* files).

Yes, obviously.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Using bitmaps to accelerate fetch and clone
  2012-09-27 18:36         ` Shawn Pearce
@ 2012-09-27 18:52           ` Jeff King
  2012-09-27 20:18             ` Jeff King
  0 siblings, 1 reply; 20+ messages in thread
From: Jeff King @ 2012-09-27 18:52 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Nguyen Thai Ngoc Duy, git, Colby Ranger, David Barr

On Thu, Sep 27, 2012 at 11:36:19AM -0700, Shawn O. Pearce wrote:

> > Interesting. I would have assumed it depended on order in the index.
> 
> No. We tried that. Assigning bits by order in index (aka order of
> SHA-1s sorted) results in horrible compression of the bitmap itself
> because of the uniform distribution of SHA-1. Encoding instead by pack
> order gets us really good bitmap compression, because object graph
> traversal order tends to take reachability into account. So we see
> long contiguous runs of 1s and get good compression. Sorting by SHA-1
> just makes the space into swiss cheese.

Right, that makes a lot of sense.

> > I think you are still OK, though, because
> > the filename comes from the sha1 over the index file, which in turn
> > includes the sha1 over the packfile. Thus any change in the packfile
> > would give you a new pack and index name.
> 
> No. The pack file name is composed from the SHA-1 of the sorted SHA-1s
> in the pack. Any change in compression settings or delta windows or
> even just random scheduling variations when repacking can cause
> offsets to slide, even if the set of objects being repacked has not
> differed. The resulting pack and index will have the same file names
> (as its the same set of objects), but the offset information and
> ordering is now different.

Are you sure? The trailer is computed over the sha1 of the actual pack
data (ordering, delta choices, and all), and is computed and written to
the packfile via sha1close (see pack-objects.c, ll. 753-763). That
trailer sha1 is fed into finish_tmp_packfile (l. 793).  That function
feeds it to write_idx_file, which starts a new sha1 computation that
includes the sorted sha1 list and other index info. But before we
sha1close that computation, we write the _original_ trailer sha1, adding
it to the new sha1 calculation. See pack-write.c, ll. 178-180.

And then that sha1 gets returned to finish_tmp_packfile, which uses it
to name the resulting files.

Am I reading the code wrong?

-Peff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Using bitmaps to accelerate fetch and clone
  2012-09-27 17:20   ` Jeff King
  2012-09-27 17:35     ` Shawn Pearce
@ 2012-09-27 19:47     ` David Michael Barr
  2012-09-28  1:38     ` Nguyen Thai Ngoc Duy
  2 siblings, 0 replies; 20+ messages in thread
From: David Michael Barr @ 2012-09-27 19:47 UTC (permalink / raw)
  To: Jeff King; +Cc: Nguyen Thai Ngoc Duy, Shawn Pearce, git, Colby Ranger

Hi all,

On Fri, Sep 28, 2012 at 3:20 AM, Jeff King <peff@peff.net> wrote:
> On Thu, Sep 27, 2012 at 07:17:42PM +0700, Nguyen Thai Ngoc Duy wrote:
>
>> > Operation                   Index V2               Index VE003
>> > Clone                       37530ms (524.06 MiB)     82ms (524.06 MiB)
>> > Fetch (1 commit back)          75ms                 107ms
>> > Fetch (10 commits back)       456ms (269.51 KiB)    341ms (265.19 KiB)
>> > Fetch (100 commits back)      449ms (269.91 KiB)    337ms (267.28 KiB)
>> > Fetch (1000 commits back)    2229ms ( 14.75 MiB)    189ms ( 14.42 MiB)
>> > Fetch (10000 commits back)   2177ms ( 16.30 MiB)    254ms ( 15.88 MiB)
>> > Fetch (100000 commits back) 14340ms (185.83 MiB)   1655ms (189.39 MiB)
>>
>> Beautiful. And curious, why do 100->1000 and 10000->10000 have such
>> big leaps in time (V2)?
>
> Agreed. I'm very excited about these numbers.

+1

>> Definitely :-). I have shown my interest in this topic before. So I
>> should probably say that I'm going to work on this on C Git, but
>> sllloooowwwly. As this benefits the server side greatly, perhaps a
>> GitHubber ;-) might want to work on this on C Git, for GitHub itself
>> of course, and, as a side effect, make the rest of us happy?
>
> Yeah, GitHub is definitely interested in this. I may take a shot at it,
> but I know David Barr (cc'd) is also interested in such things.

Yeah, I'm definitely interested, I love this stuff.

--
David Michael Barr

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Using bitmaps to accelerate fetch and clone
  2012-09-27 18:52           ` Jeff King
@ 2012-09-27 20:18             ` Jeff King
  2012-09-27 21:33               ` Junio C Hamano
  0 siblings, 1 reply; 20+ messages in thread
From: Jeff King @ 2012-09-27 20:18 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Nguyen Thai Ngoc Duy, git, Colby Ranger, David Barr

On Thu, Sep 27, 2012 at 02:52:29PM -0400, Jeff King wrote:

> > No. The pack file name is composed from the SHA-1 of the sorted SHA-1s
> > in the pack. Any change in compression settings or delta windows or
> > even just random scheduling variations when repacking can cause
> > offsets to slide, even if the set of objects being repacked has not
> > differed. The resulting pack and index will have the same file names
> > (as its the same set of objects), but the offset information and
> > ordering is now different.
> 
> Are you sure? The trailer is computed over the sha1 of the actual pack
> data (ordering, delta choices, and all), and is computed and written to
> the packfile via sha1close (see pack-objects.c, ll. 753-763). That
> trailer sha1 is fed into finish_tmp_packfile (l. 793).  That function
> feeds it to write_idx_file, which starts a new sha1 computation that
> includes the sorted sha1 list and other index info. But before we
> sha1close that computation, we write the _original_ trailer sha1, adding
> it to the new sha1 calculation. See pack-write.c, ll. 178-180.
> 
> And then that sha1 gets returned to finish_tmp_packfile, which uses it
> to name the resulting files.
> 
> Am I reading the code wrong?

And the answer is...yes. I'm blind.

The final bit of code in write_idx_file is:

        sha1write(f, sha1, 20);
        sha1close(f, NULL, ((opts->flags & WRITE_IDX_VERIFY)
                            ? CSUM_CLOSE : CSUM_FSYNC));
        git_SHA1_Final(sha1, &ctx);

So we write the trailer, but the sha1 we pull out is _not_ the sha1 over
the index format. It is from "ctx", not "f"; and hte former is from the
object list. Just like you said. :)

So yeah, we would want to put the pack trailer sha1 into the
supplementary index file, and check that it matches when we open it.
It's a slight annoyance, but it's O(1).

Anything which rewrote the pack and index would also want to rewrite
these supplementary files. So the worst case would be:

  1. Pack with a new version which builds the supplementary file.

  2. Repack with an old version which generates a pack with identical
     objects, but different ordering. It does not regenerate the
     supplementary file, because it does not know about it.

  3. Try to read with newer git.

Without the extra trailer check, we get a wrong answer. With the check,
we notice that the supplementary file is bogus, and fallback to the slow
path. Which I think is OK, considering that this is a reasonably
unlikely scenario to come up often (and it is no slower than it would be
if you generated a _new_ packfile in step 2).

-Peff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Using bitmaps to accelerate fetch and clone
  2012-09-27 20:18             ` Jeff King
@ 2012-09-27 21:33               ` Junio C Hamano
  2012-09-27 21:36                 ` Jeff King
  0 siblings, 1 reply; 20+ messages in thread
From: Junio C Hamano @ 2012-09-27 21:33 UTC (permalink / raw)
  To: Jeff King
  Cc: Shawn Pearce, Nguyen Thai Ngoc Duy, git, Colby Ranger, David Barr

Jeff King <peff@peff.net> writes:

> So yeah, we would want to put the pack trailer sha1 into the
> supplementary index file, and check that it matches when we open it.
> It's a slight annoyance, but it's O(1).

Yes.  If I am not mistaken, that is exactly how an .idx file makes
sure that it describes the matching .pack file (it has packfile
checksum in its trailer).  Otherwise you can repack the same set of
objects into a new .pack file and make existing .idx very confused.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Using bitmaps to accelerate fetch and clone
  2012-09-27 21:33               ` Junio C Hamano
@ 2012-09-27 21:36                 ` Jeff King
  0 siblings, 0 replies; 20+ messages in thread
From: Jeff King @ 2012-09-27 21:36 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Shawn Pearce, Nguyen Thai Ngoc Duy, git, Colby Ranger, David Barr

On Thu, Sep 27, 2012 at 02:33:01PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > So yeah, we would want to put the pack trailer sha1 into the
> > supplementary index file, and check that it matches when we open it.
> > It's a slight annoyance, but it's O(1).
> 
> Yes.  If I am not mistaken, that is exactly how an .idx file makes
> sure that it describes the matching .pack file (it has packfile
> checksum in its trailer).  Otherwise you can repack the same set of
> objects into a new .pack file and make existing .idx very confused.

Yeah. In theory you wouldn't name the new packfile into place without
also generating a new index for it. But even if you do it right, there's
a race condition, and checking the trailer sha1 at least lets us know
that they don't match (I assume we just reject the index, then; I guess
this can result in an operation failing, but in practice it doesn't
really happen, as we don't bother packing unless there are actually new
objects).

-Peff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Using bitmaps to accelerate fetch and clone
  2012-09-27 14:33   ` Shawn Pearce
@ 2012-09-28  1:37     ` Nguyen Thai Ngoc Duy
  0 siblings, 0 replies; 20+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-09-28  1:37 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git, Colby Ranger, Jeff King

On Thu, Sep 27, 2012 at 9:33 PM, Shawn Pearce <spearce@spearce.org> wrote:
>> I'd like to see some sort of extension mechanism like in
>> $GIT_DIR/index, so that we don't have to increase pack index version
>> often.
>
> This might be worthwhile. I dislike the way $GIT_DIR/index encodes
> extensions. Forcing an extension to fully materialize itself to
> determine its length so the length can be placed before the data is
> painful to work with when writing the file out to disk. I would prefer
> writing an index catalog at the trailer of the file. We already
> require random access to the index file, so its possible for a reader
> to read a fixed size trailer record that has the 2 SHA-1s we normally
> end an index with, and an extension catalog footer that has a length
> and CRC-32 of the catalog. The catalog would immediately appear before
> the footer, so a reader can find the start of the extension catalog by
> subtracting from the end of the file the catalog length and the file
> footer and catalog footer lengths. The catalog can then supply a
> starting offset for each extension section, and writers don't need to
> predict in advance how much data they need to store. Readers trying to
> use extensions aren't really hurt, Git already randomly seeks to read
> the tail of an index file to compare the pack SHA-1 before assuming
> the index is valid.

Yeah, that's exactly what I had in mind. But perhaps a separate file
(or files?) may be better. On that point, should all extensions be in
one new extra file, or one extension per file? I prefer all extensions
in one file, so we only need a single additional stat() for extension
check instead of probing for the entire pack-XXX.* range. In that
case, the catalog trailer idea still applies.
-- 
Duy

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Using bitmaps to accelerate fetch and clone
  2012-09-27 17:20   ` Jeff King
  2012-09-27 17:35     ` Shawn Pearce
  2012-09-27 19:47     ` David Michael Barr
@ 2012-09-28  1:38     ` Nguyen Thai Ngoc Duy
  2 siblings, 0 replies; 20+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-09-28  1:38 UTC (permalink / raw)
  To: Jeff King; +Cc: Shawn Pearce, git, Colby Ranger, David Barr

On Fri, Sep 28, 2012 at 12:20 AM, Jeff King <peff@peff.net> wrote:
>> Definitely :-). I have shown my interest in this topic before. So I
>> should probably say that I'm going to work on this on C Git, but
>> sllloooowwwly. As this benefits the server side greatly, perhaps a
>> GitHubber ;-) might want to work on this on C Git, for GitHub itself
>> of course, and, as a side effect, make the rest of us happy?
>
> Yeah, GitHub is definitely interested in this. I may take a shot at it,
> but I know David Barr (cc'd) is also interested in such things.

Great. Now I can just sit back and enjoy :)
-- 
Duy

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Using bitmaps to accelerate fetch and clone
  2012-09-27  0:47 Using bitmaps to accelerate fetch and clone Shawn Pearce
  2012-09-27 12:17 ` Nguyen Thai Ngoc Duy
@ 2012-09-28 12:00 ` Nguyen Thai Ngoc Duy
  2012-10-01  1:07   ` Shawn Pearce
  1 sibling, 1 reply; 20+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-09-28 12:00 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git, Colby Ranger

On Thu, Sep 27, 2012 at 7:47 AM, Shawn Pearce <spearce@spearce.org> wrote:
> * https://git.eclipse.org/r/7939
>
>   Defines the new E003 index format and the bit set
>   implementation logic.

Quote from the patch's message:

"Currently, the new index format can only be used with pack files that
contain a complete closure of the object graph e.g. the result of a
garbage collection."

You mentioned this before in your idea mail a while back. I wonder if
it's worth storing bitmaps for all packs, not just the self contained
ones. We could have one leaf bitmap per pack to mark all leaves where
we'll need to traverse outside the pack. Commit leaves are the best as
we can potentially reuse commit bitmaps from other packs. Tree leaves
will be followed in the normal/slow way.

For connectivity check, fewer trees/commits to deflate/parse means
less time. And connectivity check is done on every git-fetch (I
suspect the other end of a push also has the same check). It's not
unusual for me to fetch some repos once every few months so these
incomplete packs could be quite big and it'll take some time for gc
--auto to kick in (of course we could adjust gc --auto to start based
on the number of non-bitmapped objects, in additional to number of
packs).
-- 
Duy

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Using bitmaps to accelerate fetch and clone
  2012-09-28 12:00 ` Nguyen Thai Ngoc Duy
@ 2012-10-01  1:07   ` Shawn Pearce
  2012-10-01  1:59     ` Nguyen Thai Ngoc Duy
  0 siblings, 1 reply; 20+ messages in thread
From: Shawn Pearce @ 2012-10-01  1:07 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: git, Colby Ranger

On Fri, Sep 28, 2012 at 5:00 AM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
> On Thu, Sep 27, 2012 at 7:47 AM, Shawn Pearce <spearce@spearce.org> wrote:
>> * https://git.eclipse.org/r/7939
>>
>>   Defines the new E003 index format and the bit set
>>   implementation logic.
>
> Quote from the patch's message:
>
> "Currently, the new index format can only be used with pack files that
> contain a complete closure of the object graph e.g. the result of a
> garbage collection."
>
> You mentioned this before in your idea mail a while back. I wonder if
> it's worth storing bitmaps for all packs, not just the self contained
> ones.

Colby and I started talking about this late last week too. It seems
feasible, but does add a bit more complexity to the algorithm used
when enumerating.

> We could have one leaf bitmap per pack to mark all leaves where
> we'll need to traverse outside the pack. Commit leaves are the best as
> we can potentially reuse commit bitmaps from other packs. Tree leaves
> will be followed in the normal/slow way.

Yes, Colby proposed the same idea.

We cannot make a "leaf bitmap per pack". The leaf SHA-1s are not in
the pack and therefore cannot have a bit assigned to them. We could
add a new section that listed the unique leaf SHA-1s in their own
private table, and then assigned per bitmap a leaf bitmap that set to
1 for any leaf object that is outside of the pack. This would probably
take up the least amount of disk space, vs. storing the list of leaf
SHA-1s after each bitmap. If a pack has only 1 bitmap (e.g. it is a
small chunk of recent history) there is really no difference in disk
usage. If the pack has 2 or 3 commit bitmaps along a string of
approximately 300 commits, you will have an identical leaf set for
each of those bitmaps so using a single leaf SHA-1 table would support
reusing the redundant leaf pointers.

One of the problems we have seen with these non-closed packs is they
waste an incredible amount of disk. As an example, do a `git fetch`
from Linus tree when you are more than a few weeks behind. You will
get back more than 100 objects, so the thin pack will be saved and
completed with additional base objects. That thin pack will go from a
few MiBs to more than 40 MiB of data on disk, thanks to the redundant
base objects being appended to the end of the pack. For most uses
these packs are best eliminated and replaced with a new complete
closure pack. The redundant base objects disappear, and Git stops
wasting a huge amount of disk.

> For connectivity check, fewer trees/commits to deflate/parse means
> less time. And connectivity check is done on every git-fetch (I
> suspect the other end of a push also has the same check). It's not
> unusual for me to fetch some repos once every few months so these
> incomplete packs could be quite big and it'll take some time for gc
> --auto to kick in (of course we could adjust gc --auto to start based
> on the number of non-bitmapped objects, in additional to number of
> packs).

Yes, of course.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Using bitmaps to accelerate fetch and clone
  2012-10-01  1:07   ` Shawn Pearce
@ 2012-10-01  1:59     ` Nguyen Thai Ngoc Duy
  2012-10-01  2:26       ` Shawn Pearce
  0 siblings, 1 reply; 20+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-10-01  1:59 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git, Colby Ranger

On Mon, Oct 1, 2012 at 8:07 AM, Shawn Pearce <spearce@spearce.org> wrote:
>> You mentioned this before in your idea mail a while back. I wonder if
>> it's worth storing bitmaps for all packs, not just the self contained
>> ones.
>
> Colby and I started talking about this late last week too. It seems
> feasible, but does add a bit more complexity to the algorithm used
> when enumerating.

Yes. Though at server side, if it's too much trouble, the packer can
just ignore open packs and use only closed ones.

>> We could have one leaf bitmap per pack to mark all leaves where
>> we'll need to traverse outside the pack. Commit leaves are the best as
>> we can potentially reuse commit bitmaps from other packs. Tree leaves
>> will be followed in the normal/slow way.
>
> Yes, Colby proposed the same idea.
>
> We cannot make a "leaf bitmap per pack". The leaf SHA-1s are not in
> the pack and therefore cannot have a bit assigned to them.

We could mark all objects _in_ the pack that lead to an external
object. That's what I meant by leaves. We need to parse the leaves to
find out actual SHA-1s that are outside the pack. Or we could go with
your approach below too.

> We could
> add a new section that listed the unique leaf SHA-1s in their own
> private table, and then assigned per bitmap a leaf bitmap that set to
> 1 for any leaf object that is outside of the pack.


> One of the problems we have seen with these non-closed packs is they
> waste an incredible amount of disk. As an example, do a `git fetch`
> from Linus tree when you are more than a few weeks behind. You will
> get back more than 100 objects, so the thin pack will be saved and
> completed with additional base objects. That thin pack will go from a
> few MiBs to more than 40 MiB of data on disk, thanks to the redundant
> base objects being appended to the end of the pack. For most uses
> these packs are best eliminated and replaced with a new complete
> closure pack. The redundant base objects disappear, and Git stops
> wasting a huge amount of disk.

That's probably a different problem. I appreciate disk savings but I
would not want to wait a few more minutes for repack on every
git-fetch. But if this bitmap thing makes repack much faster than
currently, repacking after every git-fetch may become practical.
-- 
Duy

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Using bitmaps to accelerate fetch and clone
  2012-10-01  1:59     ` Nguyen Thai Ngoc Duy
@ 2012-10-01  2:26       ` Shawn Pearce
  2012-10-01 12:48         ` Nguyen Thai Ngoc Duy
  0 siblings, 1 reply; 20+ messages in thread
From: Shawn Pearce @ 2012-10-01  2:26 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: git, Colby Ranger

On Sun, Sep 30, 2012 at 6:59 PM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
> On Mon, Oct 1, 2012 at 8:07 AM, Shawn Pearce <spearce@spearce.org> wrote:
>>> You mentioned this before in your idea mail a while back. I wonder if
>>> it's worth storing bitmaps for all packs, not just the self contained
>>> ones.
>>
>> Colby and I started talking about this late last week too. It seems
>> feasible, but does add a bit more complexity to the algorithm used
>> when enumerating.
>
> Yes. Though at server side, if it's too much trouble, the packer can
> just ignore open packs and use only closed ones.

Its not trouble once the code is written. We were just trying to be
expedient in producing a prototype that we could start to deploy on
real-world workloads. Enumerating the non-closed-pack objects using
the classical implementation is still slow and consumes CPU time at
the server, using partial bitmaps should eliminate most of that CPU
usage and reduce server loads.

One of the more troublesome problems is building the bitmaps is
difficult from a streaming processor like index-pack. You need the
reachability graph for all objects, which is not currently produced
when moving data over the wire. We do an fsck after-the-fact to verify
we didn't get corrupt data, but this is optional and currently after
the pack is stored. We need to refactor this code to run earlier to
get the bitmap built. If we take Peff's idea and put the bitmap data
into a new stream rather than the pack-*.idx file we can produce the
bitmap at the same time as the fsck check, which is probably a simpler
change.

>>> We could have one leaf bitmap per pack to mark all leaves where
>>> we'll need to traverse outside the pack. Commit leaves are the best as
>>> we can potentially reuse commit bitmaps from other packs. Tree leaves
>>> will be followed in the normal/slow way.
>>
>> Yes, Colby proposed the same idea.
>>
>> We cannot make a "leaf bitmap per pack". The leaf SHA-1s are not in
>> the pack and therefore cannot have a bit assigned to them.
>
> We could mark all objects _in_ the pack that lead to an external
> object. That's what I meant by leaves. We need to parse the leaves to
> find out actual SHA-1s that are outside the pack.

OK, yes, I was being pretty stupid. Of course we could mark the
objects _in_ the pack as leaves and parse the leaves when we need to
find the external pointers.

> Or we could go with
> your approach below too.

I actually think my approach may be better. The root tree of a leaf
commit would need to be scanned every time to identify trees that are
reachable from this leaf commit, but are not reachable in the ancestry
of the leaf's parents. This turns out to be rather expensive to
compute every time a server or an fsck algorithm considers the partial
pack. Its also somewhat uncommon for such an object to exist, it has
to happen by an unrelated side branch introducing the same object and
the creator of the pack seeing that object already exist and not
including it in the pack.

Defining the pack's "edge" as a list of SHA-1s not in this pack but
known to be required allows us to compute that leaf root tree
reachability once, and never consider parsing it again. Which saves
servers that host frequently accessed Git repositories but aren't
repacking all of the time. (FWIW we repack frequently, I hear GitHub
does too, because a fully repacked repository serves clients better
than a partially packed one.)

>> We could
>> add a new section that listed the unique leaf SHA-1s in their own
>> private table, and then assigned per bitmap a leaf bitmap that set to
>> 1 for any leaf object that is outside of the pack.
>
>> One of the problems we have seen with these non-closed packs is they
>> waste an incredible amount of disk. As an example, do a `git fetch`
>> from Linus tree when you are more than a few weeks behind. You will
>> get back more than 100 objects, so the thin pack will be saved and
>> completed with additional base objects. That thin pack will go from a
>> few MiBs to more than 40 MiB of data on disk, thanks to the redundant
>> base objects being appended to the end of the pack. For most uses
>> these packs are best eliminated and replaced with a new complete
>> closure pack. The redundant base objects disappear, and Git stops
>> wasting a huge amount of disk.
>
> That's probably a different problem. I appreciate disk savings but I
> would not want to wait a few more minutes for repack on every
> git-fetch. But if this bitmap thing makes repack much faster than
> currently, repacking after every git-fetch may become practical.

I know the "Counting objects" phase of a repack is very expensive, but
so too is the time required to do the IO in and out of every object in
the repository. Spinning disks only transfer data so fast. Assuming
the drive can move 50 MiB/s each way, a 500 MiB repository will take
10 seconds just to write back out the new pack. You aren't ever going
to want to wait for repacking during git-fetch.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Using bitmaps to accelerate fetch and clone
  2012-10-01  2:26       ` Shawn Pearce
@ 2012-10-01 12:48         ` Nguyen Thai Ngoc Duy
  2012-10-02 15:00           ` Shawn Pearce
  0 siblings, 1 reply; 20+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-10-01 12:48 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git, Colby Ranger

On Mon, Oct 1, 2012 at 9:26 AM, Shawn Pearce <spearce@spearce.org> wrote:
> One of the more troublesome problems is building the bitmaps is
> difficult from a streaming processor like index-pack. You need the
> reachability graph for all objects, which is not currently produced
> when moving data over the wire. We do an fsck after-the-fact to verify
> we didn't get corrupt data, but this is optional and currently after
> the pack is stored. We need to refactor this code to run earlier to
> get the bitmap built. If we take Peff's idea and put the bitmap data
> into a new stream rather than the pack-*.idx file we can produce the
> bitmap at the same time as the fsck check, which is probably a simpler
> change.

If we need to go through the whole pack, not random sha-1 access, then
index-pack's traversal is more efficient. I have some patches that
remove pack-check.c and make fsck use index-pack to walk through
packs. It takes much less time. But rev walk for building bitmaps
probably does not fit this style of traversal because rev walk does
not align with delta walk.

> Defining the pack's "edge" as a list of SHA-1s not in this pack but
> known to be required allows us to compute that leaf root tree
> reachability once, and never consider parsing it again. Which saves
> servers that host frequently accessed Git repositories but aren't
> repacking all of the time. (FWIW we repack frequently, I hear GitHub
> does too, because a fully repacked repository serves clients better
> than a partially packed one.)

Probably off topic. Does saving a list of missing bases in the pack
index help storing thin packs directly? I may be missing some points
because I don't see why thin packs cannot be stored on disk in the
first place.
-- 
Duy

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Using bitmaps to accelerate fetch and clone
  2012-10-01 12:48         ` Nguyen Thai Ngoc Duy
@ 2012-10-02 15:00           ` Shawn Pearce
  0 siblings, 0 replies; 20+ messages in thread
From: Shawn Pearce @ 2012-10-02 15:00 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: git, Colby Ranger

On Mon, Oct 1, 2012 at 5:48 AM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
> Probably off topic. Does saving a list of missing bases in the pack
> index help storing thin packs directly? I may be missing some points
> because I don't see why thin packs cannot be stored on disk in the
> first place.

Packs are supposed to be completely self contained. We require all
delta bases to be in the pack so that we can always recover the
objects stored within it, even if there is no other pack available.
This simplifies things like git gc such that they don't need to worry
about retaining delta bases for other packs, or avoiding delta chain
cycles between packs. I have managed to create a A->B->A delta chain
once that made the repository corrupt as soon as the last non-delta
copy of A as removed. Both A and B became unreadable as A needed B
which needed A which needed B... :-(

To keep things simple and easily verified as correct, we don't allow this.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2012-10-02 15:01 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-09-27  0:47 Using bitmaps to accelerate fetch and clone Shawn Pearce
2012-09-27 12:17 ` Nguyen Thai Ngoc Duy
2012-09-27 14:33   ` Shawn Pearce
2012-09-28  1:37     ` Nguyen Thai Ngoc Duy
2012-09-27 17:20   ` Jeff King
2012-09-27 17:35     ` Shawn Pearce
2012-09-27 18:22       ` Jeff King
2012-09-27 18:36         ` Shawn Pearce
2012-09-27 18:52           ` Jeff King
2012-09-27 20:18             ` Jeff King
2012-09-27 21:33               ` Junio C Hamano
2012-09-27 21:36                 ` Jeff King
2012-09-27 19:47     ` David Michael Barr
2012-09-28  1:38     ` Nguyen Thai Ngoc Duy
2012-09-28 12:00 ` Nguyen Thai Ngoc Duy
2012-10-01  1:07   ` Shawn Pearce
2012-10-01  1:59     ` Nguyen Thai Ngoc Duy
2012-10-01  2:26       ` Shawn Pearce
2012-10-01 12:48         ` Nguyen Thai Ngoc Duy
2012-10-02 15:00           ` Shawn Pearce

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).