git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Shawn Pearce <spearce@spearce.org>
To: Jeff King <peff@peff.net>
Cc: Nguyen Thai Ngoc Duy <pclouds@gmail.com>,
	git <git@vger.kernel.org>, Colby Ranger <cranger@google.com>,
	David Barr <barr@github.com>
Subject: Re: Using bitmaps to accelerate fetch and clone
Date: Thu, 27 Sep 2012 11:36:19 -0700	[thread overview]
Message-ID: <CAJo=hJs4NXatb2vsZWWCamLGLmi+FoWkTaf3Ky-nereXkHEptA@mail.gmail.com> (raw)
In-Reply-To: <20120927182233.GA2519@sigill.intra.peff.net>

On Thu, Sep 27, 2012 at 11:22 AM, Jeff King <peff@peff.net> wrote:
>
> I think clients will also want it. If we can make "git rev-list
> --objects --all" faster (which this should be able to do), we can speed
> up "git prune", which in turn is by far the slowest part of "git gc
> --auto", since in the typical case we are only incrementally packing.

Yes, the bitmap can also accelerate prune. We didn't implement this
but it is a trivial use of the existing bitmap.

>> > The sha1 in the filename makes sure that the reachability file is always
>> > in sync with the actual pack data and index.
>>
>> Depending on the extension dependencies, you may need to also use the
>> trailer SHA-1 from the pack file itself, like the index does. E.g. the
>> bitmap data depends heavily on object order in the pack and is invalid
>> if you repack with a different ordering algorithm, or a different
>> delta set of results from delta compression.
>
> Interesting. I would have assumed it depended on order in the index.

No. We tried that. Assigning bits by order in index (aka order of
SHA-1s sorted) results in horrible compression of the bitmap itself
because of the uniform distribution of SHA-1. Encoding instead by pack
order gets us really good bitmap compression, because object graph
traversal order tends to take reachability into account. So we see
long contiguous runs of 1s and get good compression. Sorting by SHA-1
just makes the space into swiss cheese.

> I think you are still OK, though, because
> the filename comes from the sha1 over the index file, which in turn
> includes the sha1 over the packfile. Thus any change in the packfile
> would give you a new pack and index name.

No. The pack file name is composed from the SHA-1 of the sorted SHA-1s
in the pack. Any change in compression settings or delta windows or
even just random scheduling variations when repacking can cause
offsets to slide, even if the set of objects being repacked has not
differed. The resulting pack and index will have the same file names
(as its the same set of objects), but the offset information and
ordering is now different.

Naming a pack after a SHA-1 is a fun feature. Naming it after the
SHA-1 of the object list was a mistake. It should have been named
after the SHA-1 in the trailer of the file, so that any single bit
modified within the pack stream itself would have caused a different
name to be used on the filesystem. But alas this is water under the
bridge and not likely to change anytime soon.

>> Yes. One downside is these separate streams aren't removed when you
>> run git repack. But this could be fixed by  a modification to git
>> repack to clean up additional extensions with the same pack base name.
>
> I don't think that's a big deal. We already do it with ".keep" files. If
> you repack with an older version of git, you may have a stale
> supplementary file wasting space. But that's OK. The next time you gc
> with a newer version of git, we could detect and clean up such stale
> files (we already do so for tmp_pack_* files).

Yes, obviously.

  reply	other threads:[~2012-09-27 18:36 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-09-27  0:47 Using bitmaps to accelerate fetch and clone Shawn Pearce
2012-09-27 12:17 ` Nguyen Thai Ngoc Duy
2012-09-27 14:33   ` Shawn Pearce
2012-09-28  1:37     ` Nguyen Thai Ngoc Duy
2012-09-27 17:20   ` Jeff King
2012-09-27 17:35     ` Shawn Pearce
2012-09-27 18:22       ` Jeff King
2012-09-27 18:36         ` Shawn Pearce [this message]
2012-09-27 18:52           ` Jeff King
2012-09-27 20:18             ` Jeff King
2012-09-27 21:33               ` Junio C Hamano
2012-09-27 21:36                 ` Jeff King
2012-09-27 19:47     ` David Michael Barr
2012-09-28  1:38     ` Nguyen Thai Ngoc Duy
2012-09-28 12:00 ` Nguyen Thai Ngoc Duy
2012-10-01  1:07   ` Shawn Pearce
2012-10-01  1:59     ` Nguyen Thai Ngoc Duy
2012-10-01  2:26       ` Shawn Pearce
2012-10-01 12:48         ` Nguyen Thai Ngoc Duy
2012-10-02 15:00           ` Shawn Pearce

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAJo=hJs4NXatb2vsZWWCamLGLmi+FoWkTaf3Ky-nereXkHEptA@mail.gmail.com' \
    --to=spearce@spearce.org \
    --cc=barr@github.com \
    --cc=cranger@google.com \
    --cc=git@vger.kernel.org \
    --cc=pclouds@gmail.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).