Re: Using bitmaps to accelerate fetch and clone

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Shawn Pearce <spearce@spearce.org>
To: Nguyen Thai Ngoc Duy <pclouds@gmail.com>
Cc: git <git@vger.kernel.org>, Colby Ranger <cranger@google.com>
Subject: Re: Using bitmaps to accelerate fetch and clone
Date: Thu, 27 Sep 2012 07:33:30 -0700	[thread overview]
Message-ID: <CAJo=hJt0PdpDT5ROUSfZ80Zh2ep=r5Sg1BS=v7Ve-djydHhp-w@mail.gmail.com> (raw)
In-Reply-To: <CACsJy8D0vkyEArNChXE0igUkanH6PwjmPitq22a9sudfmWF4kA@mail.gmail.com>

On Thu, Sep 27, 2012 at 5:17 AM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
> On Thu, Sep 27, 2012 at 7:47 AM, Shawn Pearce <spearce@spearce.org> wrote:
>> Google has published a series of patches (see links below) to JGit to
>
> Should discussions about this series happen in here, jgit mailing or
> gerrit? I just want to make sure I'll discuss it at the right place.

I think we should have a concrete discussion about the implementation
in Java on the Gerrit changes (e.g.  "you should fix this comment it
doesn't sufficiently describe the method"), and a discussion about the
file format and algorithm here where everyone can contribute.

The format is named E003 because its still experimental. Its not set
in stone. Future iterations might be named E004, etc. If we get
something final we will look to rename it to just version 3.

>> improve fetch and clone performance by adding compressed bitmaps to
>> the pack-*.idx structure.
>>
>> Operation                   Index V2               Index VE003
>> Clone                       37530ms (524.06 MiB)     82ms (524.06 MiB)
>> Fetch (1 commit back)          75ms                 107ms
>> Fetch (10 commits back)       456ms (269.51 KiB)    341ms (265.19 KiB)
>> Fetch (100 commits back)      449ms (269.91 KiB)    337ms (267.28 KiB)
>> Fetch (1000 commits back)    2229ms ( 14.75 MiB)    189ms ( 14.42 MiB)
>> Fetch (10000 commits back)   2177ms ( 16.30 MiB)    254ms ( 15.88 MiB)
>> Fetch (100000 commits back) 14340ms (185.83 MiB)   1655ms (189.39 MiB)
>
> Beautiful. And curious, why do 100->1000 and 10000->10000 have such
> big leaps in time (V2)?

We didn't investigate this. Colby just reported on what the current
JGit code does. I suspect there is something specific about the shape
of the history graph for this repository combined with the way JGit
wrote the pack file that caused these sorts of large increases.
Perhaps they could be smaller. Not sure I care anymore when the E003
approach gives us such low times. :-)

>> The basic gist of the implementation is a bitmap has a 1 bit set for
>> each object that is reachable from the commit the bitmap is associated
>> with. An index file may have a unique bitmap for hundreds of commits
>> in the corresponding pack file. The set of objects to send is
>> performed by doing a simple computation:
>>
>>   OR (all want lines) AND NOT OR (all have lines)
>>
>> There are two key patches in the series that implement the file format
>> change and logic involved:
>>
>> * https://git.eclipse.org/r/7939
>>
>>   Defines the new E003 index format and the bit set
>>   implementation logic.
>
> I suppose the index format is not set in stone yet?

For E003, yes, we already have some data encoded with it. But as a
file format change, no. We are willing to iterate on this if there is
tangible benefit displayed by an alternative. Future versions would
have to be E004 or some other new version number to disambiguate from
E003.

> My java-foo is
> rusty and I'm not familiar with jgit, so I more likely read things
> wrong.

Or maybe not. :-)

> It seems the bitmap data follows directly after regular index content.

Correct. It is after the regular content, but before the 2 SHA-1 trailers.

> I'd like to see some sort of extension mechanism like in
> $GIT_DIR/index, so that we don't have to increase pack index version
> often.

This might be worthwhile. I dislike the way $GIT_DIR/index encodes
extensions. Forcing an extension to fully materialize itself to
determine its length so the length can be placed before the data is
painful to work with when writing the file out to disk. I would prefer
writing an index catalog at the trailer of the file. We already
require random access to the index file, so its possible for a reader
to read a fixed size trailer record that has the 2 SHA-1s we normally
end an index with, and an extension catalog footer that has a length
and CRC-32 of the catalog. The catalog would immediately appear before
the footer, so a reader can find the start of the extension catalog by
subtracting from the end of the file the catalog length and the file
footer and catalog footer lengths. The catalog can then supply a
starting offset for each extension section, and writers don't need to
predict in advance how much data they need to store. Readers trying to
use extensions aren't really hurt, Git already randomly seeks to read
the tail of an index file to compare the pack SHA-1 before assuming
the index is valid.

> What I have in mind is optional commit cache to speed up
> rev-list and merge, which could be stored in pack index too.

We should also look into using the commit bitmap data to feed the
rev-list traversal. The bitmaps can tell us which objects are commits,
and their rough ordering given the packing rules. That may be
sufficient to feed the walker without having a priority queue.

> In PackIndexVE003 class
>
> +               // Read the bitmaps for the Git types
> +               SimpleDataInput dataInput = new SimpleDataInput(fd);
> +               this.commits = readBitmap(dataInput);
> +               this.trees = readBitmap(dataInput);
> +               this.blobs = readBitmap(dataInput);
> +               this.tags = readBitmap(dataInput);
>
> Am I correct in saying that you have four different on-disk bitmaps,
> one for each object type? If so, for compression efficient reasons?

Yes. The packer needs to know the type of each object its going to
pack. Instead of reading this from the individual object headers in
the pack files, we store them in compressed type bitmaps. This is much
faster to test than reading in the base data from the pack file.

> Definitely :-). I have shown my interest in this topic before. So I
> should probably say that I'm going to work on this on C Git, but
> sllloooowwwly. As this benefits the server side greatly, perhaps a
> GitHubber ;-) might want to work on this on C Git, for GitHub itself
> of course, and, as a side effect, make the rest of us happy?

Google may also contribute towards this work, we really want to see an
improvement in git-core too. Its just a lot of work, and we are also a
limited team. :-)

next prev parent reply	other threads:[~2012-09-27 14:34 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-09-27  0:47 Using bitmaps to accelerate fetch and clone Shawn Pearce
2012-09-27 12:17 ` Nguyen Thai Ngoc Duy
2012-09-27 14:33   ` Shawn Pearce [this message]
2012-09-28  1:37     ` Nguyen Thai Ngoc Duy
2012-09-27 17:20   ` Jeff King
2012-09-27 17:35     ` Shawn Pearce
2012-09-27 18:22       ` Jeff King
2012-09-27 18:36         ` Shawn Pearce
2012-09-27 18:52           ` Jeff King
2012-09-27 20:18             ` Jeff King
2012-09-27 21:33               ` Junio C Hamano
2012-09-27 21:36                 ` Jeff King
2012-09-27 19:47     ` David Michael Barr
2012-09-28  1:38     ` Nguyen Thai Ngoc Duy
2012-09-28 12:00 ` Nguyen Thai Ngoc Duy
2012-10-01  1:07   ` Shawn Pearce
2012-10-01  1:59     ` Nguyen Thai Ngoc Duy
2012-10-01  2:26       ` Shawn Pearce
2012-10-01 12:48         ` Nguyen Thai Ngoc Duy
2012-10-02 15:00           ` Shawn Pearce

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAJo=hJt0PdpDT5ROUSfZ80Zh2ep=r5Sg1BS=v7Ve-djydHhp-w@mail.gmail.com' \
    --to=spearce@spearce.org \
    --cc=cranger@google.com \
    --cc=git@vger.kernel.org \
    --cc=pclouds@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).