git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Shawn Pearce <spearce@spearce.org>
To: Nguyen Thai Ngoc Duy <pclouds@gmail.com>
Cc: git <git@vger.kernel.org>, Colby Ranger <cranger@google.com>
Subject: Re: Using bitmaps to accelerate fetch and clone
Date: Sun, 30 Sep 2012 18:07:45 -0700	[thread overview]
Message-ID: <CAJo=hJsWczUqhvj6Kqsomeh9WxAAJO-Yc-=61k94jos6vVtEjQ@mail.gmail.com> (raw)
In-Reply-To: <CACsJy8AUdRyjSrAgM+ABzWet2NKz7N7M4re2QVoRPrrA=zfvvg@mail.gmail.com>

On Fri, Sep 28, 2012 at 5:00 AM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
> On Thu, Sep 27, 2012 at 7:47 AM, Shawn Pearce <spearce@spearce.org> wrote:
>> * https://git.eclipse.org/r/7939
>>
>>   Defines the new E003 index format and the bit set
>>   implementation logic.
>
> Quote from the patch's message:
>
> "Currently, the new index format can only be used with pack files that
> contain a complete closure of the object graph e.g. the result of a
> garbage collection."
>
> You mentioned this before in your idea mail a while back. I wonder if
> it's worth storing bitmaps for all packs, not just the self contained
> ones.

Colby and I started talking about this late last week too. It seems
feasible, but does add a bit more complexity to the algorithm used
when enumerating.

> We could have one leaf bitmap per pack to mark all leaves where
> we'll need to traverse outside the pack. Commit leaves are the best as
> we can potentially reuse commit bitmaps from other packs. Tree leaves
> will be followed in the normal/slow way.

Yes, Colby proposed the same idea.

We cannot make a "leaf bitmap per pack". The leaf SHA-1s are not in
the pack and therefore cannot have a bit assigned to them. We could
add a new section that listed the unique leaf SHA-1s in their own
private table, and then assigned per bitmap a leaf bitmap that set to
1 for any leaf object that is outside of the pack. This would probably
take up the least amount of disk space, vs. storing the list of leaf
SHA-1s after each bitmap. If a pack has only 1 bitmap (e.g. it is a
small chunk of recent history) there is really no difference in disk
usage. If the pack has 2 or 3 commit bitmaps along a string of
approximately 300 commits, you will have an identical leaf set for
each of those bitmaps so using a single leaf SHA-1 table would support
reusing the redundant leaf pointers.

One of the problems we have seen with these non-closed packs is they
waste an incredible amount of disk. As an example, do a `git fetch`
from Linus tree when you are more than a few weeks behind. You will
get back more than 100 objects, so the thin pack will be saved and
completed with additional base objects. That thin pack will go from a
few MiBs to more than 40 MiB of data on disk, thanks to the redundant
base objects being appended to the end of the pack. For most uses
these packs are best eliminated and replaced with a new complete
closure pack. The redundant base objects disappear, and Git stops
wasting a huge amount of disk.

> For connectivity check, fewer trees/commits to deflate/parse means
> less time. And connectivity check is done on every git-fetch (I
> suspect the other end of a push also has the same check). It's not
> unusual for me to fetch some repos once every few months so these
> incomplete packs could be quite big and it'll take some time for gc
> --auto to kick in (of course we could adjust gc --auto to start based
> on the number of non-bitmapped objects, in additional to number of
> packs).

Yes, of course.

  reply	other threads:[~2012-10-01  1:08 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-09-27  0:47 Using bitmaps to accelerate fetch and clone Shawn Pearce
2012-09-27 12:17 ` Nguyen Thai Ngoc Duy
2012-09-27 14:33   ` Shawn Pearce
2012-09-28  1:37     ` Nguyen Thai Ngoc Duy
2012-09-27 17:20   ` Jeff King
2012-09-27 17:35     ` Shawn Pearce
2012-09-27 18:22       ` Jeff King
2012-09-27 18:36         ` Shawn Pearce
2012-09-27 18:52           ` Jeff King
2012-09-27 20:18             ` Jeff King
2012-09-27 21:33               ` Junio C Hamano
2012-09-27 21:36                 ` Jeff King
2012-09-27 19:47     ` David Michael Barr
2012-09-28  1:38     ` Nguyen Thai Ngoc Duy
2012-09-28 12:00 ` Nguyen Thai Ngoc Duy
2012-10-01  1:07   ` Shawn Pearce [this message]
2012-10-01  1:59     ` Nguyen Thai Ngoc Duy
2012-10-01  2:26       ` Shawn Pearce
2012-10-01 12:48         ` Nguyen Thai Ngoc Duy
2012-10-02 15:00           ` Shawn Pearce

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAJo=hJsWczUqhvj6Kqsomeh9WxAAJO-Yc-=61k94jos6vVtEjQ@mail.gmail.com' \
    --to=spearce@spearce.org \
    --cc=cranger@google.com \
    --cc=git@vger.kernel.org \
    --cc=pclouds@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).