git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Shawn Pearce <spearce@spearce.org>
Cc: Nguyen Thai Ngoc Duy <pclouds@gmail.com>,
	git <git@vger.kernel.org>, Colby Ranger <cranger@google.com>,
	David Barr <barr@github.com>
Subject: Re: Using bitmaps to accelerate fetch and clone
Date: Thu, 27 Sep 2012 14:22:33 -0400	[thread overview]
Message-ID: <20120927182233.GA2519@sigill.intra.peff.net> (raw)
In-Reply-To: <CAJo=hJuXCYa=MKSqCRsxmwFdFYZamK_94zc3fE0tmvwUAVA2Ow@mail.gmail.com>

On Thu, Sep 27, 2012 at 10:35:47AM -0700, Shawn O. Pearce wrote:

> > If that is the case, do we need to bump the index
> > version at all? Why not store a plain v2 index, and then store an
> > additional file "pack-XXX.reachable" that contains the bitmaps and an
> > independent version number.
> 
> This is the alternate version we considered internally. It was a bit
> more work to define a 3rd file stream per pack in our backend storage
> system, so we opted for a revision of an existing stream. We could
> spend a bit more time and add a 3rd stream, keeping the index format
> unmodified.

I'd rather make the choice that provides the best user experience, even
if it is a bit more code refactoring.

> But we could have also done this with the CRC-32 table in index v2. We
> didn't. If the data should almost always be there in order to provide
> good service then we should really be embedding into the files.

Yes, although there were other changes in v2, also (e.g., the fanout to
handle larger packfiles).  Bumping the version also made the transition
take a lot longer. We introduced the reading and writing code, but then
couldn't flip the default for quite a while. For big server providers
this is not as big a deal (we know which versions of git we will use,
and are OK with flipping a config bit). But it's one more tuning thing
to deal with for small or single-person servers.

I think clients will also want it. If we can make "git rev-list
--objects --all" faster (which this should be able to do), we can speed
up "git prune", which in turn is by far the slowest part of "git gc
--auto", since in the typical case we are only incrementally packing.

I also like that the general technique can be reused easily. We've
talked about a generation-number cache in the past. That would fit this
model as well. Removing a backwards-compatibility barrier makes it a lot
easier to experiment with these sorts of things.

> > The sha1 in the filename makes sure that the reachability file is always
> > in sync with the actual pack data and index.
> 
> Depending on the extension dependencies, you may need to also use the
> trailer SHA-1 from the pack file itself, like the index does. E.g. the
> bitmap data depends heavily on object order in the pack and is invalid
> if you repack with a different ordering algorithm, or a different
> delta set of results from delta compression.

Interesting. I would have assumed it depended on order in the index. But
like I said, I haven't looked. I think you are still OK, though, because
the filename comes from the sha1 over the index file, which in turn
includes the sha1 over the packfile. Thus any change in the packfile
would give you a new pack and index name.

> Yes. One downside is these separate streams aren't removed when you
> run git repack. But this could be fixed by  a modification to git
> repack to clean up additional extensions with the same pack base name.

I don't think that's a big deal. We already do it with ".keep" files. If
you repack with an older version of git, you may have a stale
supplementary file wasting space. But that's OK. The next time you gc
with a newer version of git, we could detect and clean up such stale
files (we already do so for tmp_pack_* files).

-Peff

  reply	other threads:[~2012-09-27 18:23 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-09-27  0:47 Using bitmaps to accelerate fetch and clone Shawn Pearce
2012-09-27 12:17 ` Nguyen Thai Ngoc Duy
2012-09-27 14:33   ` Shawn Pearce
2012-09-28  1:37     ` Nguyen Thai Ngoc Duy
2012-09-27 17:20   ` Jeff King
2012-09-27 17:35     ` Shawn Pearce
2012-09-27 18:22       ` Jeff King [this message]
2012-09-27 18:36         ` Shawn Pearce
2012-09-27 18:52           ` Jeff King
2012-09-27 20:18             ` Jeff King
2012-09-27 21:33               ` Junio C Hamano
2012-09-27 21:36                 ` Jeff King
2012-09-27 19:47     ` David Michael Barr
2012-09-28  1:38     ` Nguyen Thai Ngoc Duy
2012-09-28 12:00 ` Nguyen Thai Ngoc Duy
2012-10-01  1:07   ` Shawn Pearce
2012-10-01  1:59     ` Nguyen Thai Ngoc Duy
2012-10-01  2:26       ` Shawn Pearce
2012-10-01 12:48         ` Nguyen Thai Ngoc Duy
2012-10-02 15:00           ` Shawn Pearce

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120927182233.GA2519@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=barr@github.com \
    --cc=cranger@google.com \
    --cc=git@vger.kernel.org \
    --cc=pclouds@gmail.com \
    --cc=spearce@spearce.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).