git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org, Duy Nguyen <pclouds@gmail.com>,
	"Shawn O. Pearce" <spearce@spearce.org>
Subject: Re: [PATCH 4/6] introduce a commit metapack
Date: Wed, 30 Jan 2013 02:07:34 -0500	[thread overview]
Message-ID: <20130130070734.GC11147@sigill.intra.peff.net> (raw)
In-Reply-To: <7vy5fbq48t.fsf@alter.siamese.dyndns.org>

On Tue, Jan 29, 2013 at 09:38:10AM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > +int commit_metapack(unsigned char *sha1,
> > +		    uint32_t *timestamp,
> > +		    unsigned char **tree,
> > +		    unsigned char **parent1,
> > +		    unsigned char **parent2)
> > +{
> > +	struct commit_metapack *p;
> > +
> > +	prepare_commit_metapacks();
> > +	for (p = commit_metapacks; p; p = p->next) {
> > +		unsigned char *data;
> > +		int pos = sha1_entry_pos(p->index, 20, 0, 0, p->nr, p->nr, sha1);
> 
> This is a tangent, but isn't it about time to rip out the check for
> GIT_USE_LOOKUP in find_pack_entry_one(), I wonder.

Rip it out and always use sha1_entry_pos, or rip it out and never use
it? I have never been able to measure any speedup from it; I just use it
here to avoid rewriting the binary search myself. I do not think there
is any disadvantage to using it, so I'd be in favor of just
standardizing on it for any sha1 binary searches.

> These cached properties of a single commit will not change no matter
> which pack it appears in, and it feels logically wrong, especially
> when you record these object names in the full SHA-1 form, to tie a
> "commit metapack" to a pack.  Logically there needs only one commit
> metapack that describes all the commits known to the repository when
> the metapack was created.

True. I had originally envisioned it as tied to the packfile to help
manage the lifecycle. You know you need to generate a metapack for some
objects when they get packed, and you know you can throw it away when
the associated pack goes away. And it means you can verify the integrity
of the metapack by matching it to a particular packfile.

However, if you just have a commit cache, you can always blow the whole
thing away and regenerate it for all objects in the repo. It does not
have tied to a pack (you do end up doing some extra work at regeneration
when you are not doing a full repack, but it is really not enough to
worry about).

> In order to reduce the disk footprint and I/O cost, the future
> direction for this mechanism may want to point into an existing
> store of SHA-1 hashes with a shorter file offset, and the .idx file
> could be such a store, and in order to move in that direction, you
> cannot avoid tying a metapack to a pack.

Yes. That was not one of my original goals for the commit cache, but I
do think it's a useful direction to go in. And reachability bitmaps
(which would eventually be their own metapacks) would generally want to
be per-pack, too, for the same reason.

> > +	*tail = &commit_list_insert(c, *tail)->next;
> > +}
> 
> It feels somewhat wasteful to:
> 
>  - use commit_list for this, rather than an array of commit
>    objects.  If you have a rough estimate of the number of commits
>    in the pack, you could just preallocate a single array and use
>    ALLOC_GROW() on it, no?

We don't have a rough estimate, but yes, we could just use an array and
trust ALLOC_GROW to be reasonable. The use of commit_list did not have a
particular reason other than that it was simple (an array means stuffing
the array pointer and the nr and alloc counts into a struct to get to
the callback). The performance of writing the cache is dominated by
accessing the objects themselves, anyway. I don't mind changing it,
though, if you think it's clearer as an array.

>  - iterate over the .idx file and run sha1_object_info() and
>    parse_commit() on many objects in the SHA-1 order.  Iterating in
>    the way builtin/pack-objects.c::get_object_details() does avoids
>    jumping around in existing packfiles, which may be more
>    efficient, no?

Probably, though generating the complete commit cache for linux-2.6.git
takes only about 7 seconds on my machine. I wasn't too concerned with
optimizing generation, since it will typically be dwarfed by repacking
costs. It might make more of a difference for doing a metapack on all
objects (e.g., reachability bitmaps).

The reason I do it in .idx order is that it feeds the callback in sorted
order, so a writer could in theory just use that output as-is (and my
initial version did that, as it wrote separate metapacks for each
element). This version now puts all data elements together (for cache
locality), and builds the in-memory list so we do not have to re-do
sha1_object_info repeatedly. So it could very easily just generate the
list in pack order and sort it at the end.

I'll look into that for the next version.

-Peff

  parent reply	other threads:[~2013-01-30  7:08 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-01-29  9:14 [PATCH/RFC 0/6] commit caching Jeff King
2013-01-29  9:15 ` [PATCH 1/6] csum-file: make sha1write const-correct Jeff King
2013-01-29  9:15 ` [PATCH 2/6] strbuf: add string-chomping functions Jeff King
2013-01-29 10:15   ` Michael Haggerty
2013-01-29 11:10     ` Jeff King
2013-01-30  5:00       ` Michael Haggerty
2013-01-29  9:15 ` [PATCH 3/6] introduce pack metadata cache files Jeff King
2013-01-29 17:35   ` Junio C Hamano
2013-01-30  6:47     ` Jeff King
2013-01-30  1:30   ` Duy Nguyen
2013-01-30  6:50     ` Jeff King
2013-01-29  9:16 ` [PATCH 4/6] introduce a commit metapack Jeff King
2013-01-29 10:24   ` Michael Haggerty
2013-01-29 11:13     ` Jeff King
2013-01-29 17:38   ` Junio C Hamano
2013-01-29 18:08     ` Junio C Hamano
2013-01-30  7:12       ` Jeff King
2013-01-30  7:17         ` Junio C Hamano
2013-02-01  9:21           ` Jeff King
2013-01-30 15:56         ` Junio C Hamano
2013-01-31 17:03           ` Shawn Pearce
2013-02-01  9:42             ` Jeff King
2013-02-02 17:49               ` Junio C Hamano
2013-01-30  7:07     ` Jeff King [this message]
2013-01-30  3:36   ` Duy Nguyen
2013-01-30  7:12     ` Jeff King
2013-01-30 13:56   ` Duy Nguyen
2013-01-30 14:16     ` Duy Nguyen
2013-01-31 11:06       ` Duy Nguyen
2013-02-01 10:15         ` Jeff King
2013-02-02  9:49           ` Duy Nguyen
2013-02-01 10:40         ` Jeff King
2013-03-17 13:21         ` Duy Nguyen
2013-03-18 12:20           ` Jeff King
2013-02-01 10:00     ` Jeff King
2013-01-29  9:16 ` [PATCH 5/6] add git-metapack command Jeff King
2013-01-29  9:16 ` [PATCH 6/6] commit: look up commit info in metapack Jeff King
2013-01-30  3:31 ` [PATCH/RFC 0/6] commit caching Duy Nguyen
2013-01-30  7:18   ` Jeff King
2013-01-30  8:32     ` Duy Nguyen
2013-01-31 17:14 ` Shawn Pearce
2013-02-01  9:11   ` Jeff King
2013-02-02 10:04     ` Shawn Pearce

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130130070734.GC11147@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=pclouds@gmail.com \
    --cc=spearce@spearce.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).