git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Stefan Beller <sbeller@google.com>
To: Jeff Hostetler <git@jeffhostetler.com>
Cc: "git@vger.kernel.org" <git@vger.kernel.org>,
	Junio C Hamano <gitster@pobox.com>, Jeff King <peff@peff.net>,
	Jeff Hostetler <jeffhost@microsoft.com>
Subject: Re: [PATCH v1] unpack-trees: avoid duplicate ODB lookups during checkout
Date: Thu, 6 Apr 2017 15:48:07 -0700	[thread overview]
Message-ID: <CAGZ79kafbRQo2Of0H162ue5YzL7uA2PDu=sTy0=cEOejGTJhyw@mail.gmail.com> (raw)
In-Reply-To: <20170406203732.47714-2-git@jeffhostetler.com>

On Thu, Apr 6, 2017 at 1:37 PM,  <git@jeffhostetler.com> wrote:
> From: Jeff Hostetler <jeffhost@microsoft.com>
>
> Teach traverse_trees_recursive() to not do redundant ODB
> lookups when both directories refer to the same OID.

And the reason for this is that omitting the second lookup
saves time, i.e. a lookup in the ODB of a sufficiently large
repo is slow.

My kneejerk line of thinking:
* yes, it sounds good to reduce the number of ODB accesses.
* But if we consider ODB lookups to be slow and we perform
  a structured access, how about a cache in front of the ODB?
* We already have that! (sort of..) 9a414486d9 (lookup_object:
  prioritize recently found objects, 2013-05-01)
* Instead of improving the caching, maybe change the
  size of the problem: We could keep the objects of different types
  in different hash-tables.

object.c has its own hash table, I presume for historical and
performance reasons, this would be split up to multiple hash
tables.

Additionally to "object *lookup_object(*sha1)", we'd have
a function

"object *lookup_object(*sha1, enum object_type hint)"
which looks into the correct the hash table.

If you were to call just  lookup_object with no hint, then you'd
look into all the different tables (I guess there is a preferrable
order in which to look, haven't thought about that).

>
> In operations such as read-tree, checkout, and merge when
> the differences between the commits are relatively small,
> there will likely be many directories that have the same
> SHA-1.  In these cases we can avoid hitting the ODB multiple
> times for the same SHA-1.

This would explain partially why there was such a good
performance boost in the referenced commit above as we
implicitly lookup the same object multiple times.

Peff is really into getting this part faster, c.f.
https://public-inbox.org/git/20160914235547.h3n2otje2hec6u7k@sigill.intra.peff.net/

> TODO This change is a first attempt to test that by comparing
> TODO the hashes of name[i] and name[i-i] and simply copying
> TODO the tree-descriptor data.  I was thinking of the n=2
> TODO case here.  We may want to extend this to the n=3 case.

>
> ================
> On the Windows repo (500K trees, 3.1M files, 450MB index),
> this reduced the overall time by 0.75 seconds when cycling
> between 2 commits with a single file difference.
>
> (avg) before: 22.699
> (avg) after:  21.955
> ===============

So it shaves off 4% of the time needed. it doesn't sound like a
break through, but I guess these small patches add up. :)

>         for (i = 0; i < n; i++, dirmask >>= 1) {
> -               const unsigned char *sha1 = NULL;
> -               if (dirmask & 1)
> -                       sha1 = names[i].oid->hash;
> -               buf[i] = fill_tree_descriptor(t+i, sha1);
> +               if (i > 0 && (dirmask & 1) && names[i].oid && names[i-1].oid &&
> +                       !hashcmp(names[i].oid->hash, names[i-1].oid->hash)) {

Why do we need to check for dirmask & 1 here?
This ought to be covered by the hashcmp already IIUC.
So maybe we can pull out the
                         if (dirmask & 1)
                                 sha1 = names[i].oid->hash;
out of the else when dropping that dirmask check?


Thanks,
Stefan

  reply	other threads:[~2017-04-06 22:48 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-06 20:37 [PATCH v1] WIP unpack-trees: avoid duplicate ODB lookups during checkout git
2017-04-06 20:37 ` [PATCH v1] " git
2017-04-06 22:48   ` Stefan Beller [this message]
2017-04-07  5:19     ` Jeff King
2017-04-07 13:51       ` Jeff Hostetler
2017-04-07 17:35     ` Jeff Hostetler
2017-04-07  0:32   ` René Scharfe
2017-04-07 13:57     ` Jeff Hostetler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAGZ79kafbRQo2Of0H162ue5YzL7uA2PDu=sTy0=cEOejGTJhyw@mail.gmail.com' \
    --to=sbeller@google.com \
    --cc=git@jeffhostetler.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jeffhost@microsoft.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).