git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Nguyen Thai Ngoc Duy <pclouds@gmail.com>
To: Jeff King <peff@peff.net>
Cc: David Michael Barr <b@rr-dav.id.au>,
	Git Mailing List <git@vger.kernel.org>
Subject: Re: [RFC] pack-objects: compression level for non-blobs
Date: Sun, 30 Dec 2012 19:53:48 +0700	[thread overview]
Message-ID: <CACsJy8C4UttGKcw11do1POcHZJM7iZ2r7F3ESOqEnWL8kdz+dQ@mail.gmail.com> (raw)
In-Reply-To: <20121230120542.GA10820@sigill.intra.peff.net>

On Sun, Dec 30, 2012 at 7:05 PM, Jeff King <peff@peff.net> wrote:
> So I was thinking about this, which led to some coding, which led to
> some benchmarking.

I like your way of thinking! May I suggest you take a new year break
first, then "think" about reachability bitmaps ;-) 2013 will be an
exciting year.

> I want to clean up a few things in the code before I post it, but the
> general idea is to have arbitrary per-pack cache files in the
> objects/pack directory. Like this:
>
>   $ cd objects/pack && ls
>   pack-a3e262f40d95fc0cc97d92797ff9988551367b75.commits
>   pack-a3e262f40d95fc0cc97d92797ff9988551367b75.idx
>   pack-a3e262f40d95fc0cc97d92797ff9988551367b75.pack
>   pack-a3e262f40d95fc0cc97d92797ff9988551367b75.parents
>   pack-a3e262f40d95fc0cc97d92797ff9988551367b75.timestamps
>   pack-a3e262f40d95fc0cc97d92797ff9988551367b75.trees
>
> Each file describes the objects in the matching pack. If a new pack is
> generated, you'd throw away the old cache files along with the old pack,
> and generate new ones. Or not. These are totally optional, and an older
> version of git will just ignore them. A newer version will use them if
> they're available, and otherwise fallback to the existing code (i.e.,
> reading the whole object from the pack). So you can generate them at

You have probably thought about this (and I don't have the source to
check first), but we may need to version these extra files so we can
change the format later if needed. Git versions that do not recognize
new versions simply ignore the cahce.

> repack time, later on, or not at all. For now I have a separate command
> that generates them based on the pack index; if this turns out to be a
> good idea, it would probably get called as part of "repack".

I'd like to make it part of index-pack, where we have nearly
everything in memory. But let's leave it as a separate command first.

> Each file is a set of fixed-length records. The "commits" file contains
> the sha1 of every commit in the pack (sorted). A binary search of the
> mmap'd file gives the position of a particular commit within the list,

I think we could avoid storing sha-1 in the cache with Shawn's idea
[1]. But now I read it again I fail to see it :(

[1] http://article.gmane.org/gmane.comp.version-control.git/206485

> Of course, it does very little for the full --objects listing, where we
> spend most of our time inflating trees. We could couple this with
> uncompressed trees (which are not all that much bigger, since the sha1s
> do not compress anyway). Or we could have an external tree cache, but
> I'm not sure exactly what it would look like (this is basically
> reinventing bits of packv4, but doing so in a way that is redundant with
> the existing packfile, rather than replacing it).

Depending on the use case, we could just generate packv4-like cache
for recently-used trees only. I'm not sure how tree cache impact a
merge operation on a very large worktree (iow, a lot of trees
referenced from HEAD to be inflated). This is something a cache can
do, but a new pack version cannot.

> Or since the point of
> --objects is usually reachability, it may make more sense to pursue the
> bitmap, which should be even faster still.

Yes. And if narrow clone ever comes, which needs --objects limited by
pathspec, we could just produce extra bitmaps for frequently-used
pathspecs and only allow narrow clone with those pathspecs.
-- 
Duy

  reply	other threads:[~2012-12-30 12:54 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-11-26  6:25 [RFC] pack-objects: compression level for non-blobs David Michael Barr
2012-11-26 12:35 ` David Michael Barr
2012-12-29  0:41 ` Jeff King
2012-12-29  4:34   ` Nguyen Thai Ngoc Duy
2012-12-29  5:07     ` Jeff King
2012-12-29  5:25       ` Nguyen Thai Ngoc Duy
2012-12-29  5:27         ` Jeff King
2012-12-29  9:05           ` Jeff King
2012-12-29  9:48             ` Jeff King
2012-12-30 12:05           ` Jeff King
2012-12-30 12:53             ` Nguyen Thai Ngoc Duy [this message]
2012-12-30 21:31               ` Jeff King
2012-12-31 18:06                 ` Shawn Pearce
2013-01-01  4:15                   ` Duy Nguyen
2013-01-01 12:10                     ` Duy Nguyen
2013-01-01 17:17                       ` Shawn Pearce
2013-01-01 23:47                         ` Junio C Hamano
2013-01-02  2:23                         ` Duy Nguyen
2013-01-01 20:02                       ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CACsJy8C4UttGKcw11do1POcHZJM7iZ2r7F3ESOqEnWL8kdz+dQ@mail.gmail.com \
    --to=pclouds@gmail.com \
    --cc=b@rr-dav.id.au \
    --cc=git@vger.kernel.org \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).