Re: [PATCH 3/4] peel_ref: check object type before loading

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Jeff King <peff@peff.net>
To: Junio C Hamano <gitster@pobox.com>
Cc: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"Git Mailing List" <git@vger.kernel.org>
Subject: Re: [PATCH 3/4] peel_ref: check object type before loading
Date: Thu, 4 Oct 2012 15:41:50 -0400	[thread overview]
Message-ID: <20121004194150.GA13955@sigill.intra.peff.net> (raw)
In-Reply-To: <7vbogiys47.fsf@alter.siamese.dyndns.org>

On Thu, Oct 04, 2012 at 12:06:16PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > The point of peel_ref is to dereference tags; if the base
> > object is not a tag, then we can return early without even
> > loading the object into memory.
> >
> > This patch accomplishes that by checking sha1_object_info
> > for the type. For a packed object, we can get away with just
> > looking in the pack index. For a loose object, we only need
> > to inflate the first couple of header bytes.
> 
> We look at the pack index and have to follow its delta chain down to
> the base to find its type; if the object is deeply deltified, this
> certainly is an overall loss.

Yes. Although in practice I would expect most things pointed to by refs
to not be delta'd. And indeed, comparing "git index-pack -v" to "git
for-each-ref" in my git.git repository yields 950 non-delta tips and
only 42 delta (and those are mostly rebased commit objects). In my
linux-2.6 repo, I have 0 delta ref tips (probably because I do not
actually do kernel work, but keep it around for querying).

The numbers I mentioned in 926f1dd (which did this exact same
optimization for upload-pack) showed a slight improvement in practice.
Because keep in mind that the alternative to this multi-step delta
lookup for the type is to load the object. Which will do the _same_
lookup, but will also have to actually compute the delta.

So the only real loss is that you have to repeat the lookup (or in the
case of a loose object, re-open and re-inflate the first 32 bytes). The
best way to do this would be to have sha1_object_info return some kind
of handle to the object where we could get the meta info cheaply, and
then reuse the handle to get the actual object data later (so for a
packed object, cache the actual pack location; for a loose object, keep
the descriptor and zlib stream open).

Of course, the speed up in 926f1dd probably won't matter anymore because
the whole point of this series is to avoid touching the object at all.
So hitting sha1_object_info in peel_ref is already the slow path. On
most objects, an optimization like this will not be noticeable either
way. The interesting thing is that we can avoid touching a large or
expensive object at all.

> > This optimization is the one that gave me the most pause. While
> > upload-pack does call peel_ref on everything, the other callers all
> > constrain themselves to refs/tags/. So for many projects, we will be
> > calling it mostly on annotated tags, and it may be a very small net
> > loss. But in practice, it will not matter for most projects with a sane
> > number of normal tags, and saving even one accidental giant blob load
> > can have a huge impact.
> 
> I may be missing something, but the above description is not
> convincing to me.  When was the last time you pointed a blob
> directly with a ref, whether large or small, and whether within
> refs/tags or outside?

Not often (though you have such a ref in git.git). I started down this
road because just applying patch 4/4 produces a regression in the test
suite. t1050 tries to clone a tag pointing to a large object with
GIT_ALLOC_LIMIT set. Using peel_ref regresses this to actually load the
whole object into memory.

That particular regression is fixed by 1/1 (because it's an annotated
tag pointing to the large object). This would fix the same case if it
were an unannotated tag.

Now is that a real-world test? No, I'm not sure it is. But I consider it
a win for large-object git[1] every time we can avoid loading an object,
because it only takes one unnecessary load to ruin the user experience.
So even if it's unlikely, if it isn't hurting the regular code path, I
think it's worth doing (and I'm not sure it is hurting; my numbers
showed that it be helping, though I suspect it is doing nothing after my
4/4).

-Peff

[1] One thing I've been toying is with "external alternates"; dumping
    your large objects in some realtively slow data store (e.g., a
    RESTful HTTP service). You could cache and cheaply query a list of
    "sha1 / size / type" for each object from the store, but getting the
    actual objects would be much more expensive. But again, it would
    depend on whether you would actually have such a store directly
    accessible by a ref.

next prev parent reply	other threads:[~2012-10-04 22:14 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-10-03 12:36 upload-pack is slow with lots of refs Ævar Arnfjörð Bjarmason
2012-10-03 13:06 ` Nguyen Thai Ngoc Duy
2012-10-03 18:03 ` Jeff King
2012-10-03 18:53   ` Junio C Hamano
2012-10-03 18:55     ` Jeff King
2012-10-03 19:41       ` Shawn Pearce
2012-10-03 20:13         ` Jeff King
2012-10-04 21:52           ` Sascha Cunz
2012-10-05  0:20             ` Jeff King
2012-10-05  6:24         ` Johannes Sixt
2012-10-05 16:57           ` Shawn Pearce
2012-10-08 15:05             ` Johannes Sixt
2012-10-09  6:46               ` Shawn Pearce
2012-10-09 20:30                 ` Johannes Sixt
2012-10-09 20:46                   ` Johannes Sixt
2012-10-03 20:16   ` Ævar Arnfjörð Bjarmason
2012-10-03 21:20     ` Jeff King
2012-10-03 22:15       ` Ævar Arnfjörð Bjarmason
2012-10-03 23:15         ` Jeff King
2012-10-03 23:54           ` Ævar Arnfjörð Bjarmason
2012-10-04  7:56             ` [PATCH 0/4] optimizing upload-pack ref peeling Jeff King
2012-10-04  7:58               ` [PATCH 1/4] peel_ref: use faster deref_tag_noverify Jeff King
2012-10-04 18:24                 ` Junio C Hamano
2012-10-04  8:00               ` [PATCH 2/4] peel_ref: do not return a null sha1 Jeff King
2012-10-04 18:32                 ` Junio C Hamano
2012-10-04  8:02               ` [PATCH 3/4] peel_ref: check object type before loading Jeff King
2012-10-04 19:06                 ` Junio C Hamano
2012-10-04 19:41                   ` Jeff King [this message]
2012-10-04 20:41                     ` Junio C Hamano
2012-10-04 21:59                       ` Jeff King
2012-10-04  8:03               ` [PATCH 4/4] upload-pack: use peel_ref for ref advertisements Jeff King
2012-10-04  8:04               ` [PATCH 0/4] optimizing upload-pack ref peeling Jeff King
2012-10-04  9:01                 ` Ævar Arnfjörð Bjarmason
2012-10-04 12:14                   ` Nazri Ramliy
2012-10-03 22:32   ` upload-pack is slow with lots of refs Ævar Arnfjörð Bjarmason
2012-10-03 23:21     ` Jeff King
2012-10-03 23:47       ` Ævar Arnfjörð Bjarmason
2012-10-03 19:13 ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20121004194150.GA13955@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).