git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Junio C Hamano <gitster@pobox.com>
To: "Shawn O. Pearce" <spearce@spearce.org>
Cc: git@vger.kernel.org
Subject: Re: [PATCH 2/2] pack-objects: optimize "recency order"
Date: Fri, 08 Jul 2011 10:45:51 -0700	[thread overview]
Message-ID: <7voc14fyy8.fsf@alter.siamese.dyndns.org> (raw)
In-Reply-To: <1310084657-16790-3-git-send-email-gitster@pobox.com> (Junio C. Hamano's message of "Thu, 7 Jul 2011 17:24:17 -0700")

I <gitster@pobox.com> wrote:

>  - In the original recency order, trees and blobs are intermixed. Write
>    trees together before blobs, in the hope that this will improve
>    locality when running pathspec-limited revision traversal, i.e.
>    "git log paths...";
>
>  - When writing blob objects out, write the whole family of blobs that use
>    the same delta base object together, by starting from the root of the
>    delta chain, and writing its immediate children in a width-first
>    manner, in the hope that this will again improve locality when reading
>    blobs that belong to the same path, which are likely to be deltified
>    against each other.

An interesting tangent.

With a single-liner change to apply the same "write the entire family as a
group" logic also to tree objects, we get a superb improvement to the
"index-pack" workload, while "log pathspec" becomes measurably worse.

>  * Pathspec-limited log.
>    $ git log drivers/net >/dev/null
>                                   v1.7.6  with patch
> 	 60.0% percentile :        688,671     443,578
> 	 70.0% percentile :    319,574,732 110,370,100
> 	 99.0% percentile :    412,942,470 133,078,115
> 	 99.5% percentile :    413,172,266 133,163,349
> 	 99.9% percentile :    413,354,356 133,240,445
>        Less than 2MiB seek:       61.71%      62.87%

     60.00% percentile :    2,240,841
     70.00% percentile :  115,729,538
     99.00% percentile :  132,912,877
     99.90% percentile :  133,129,574
   Less than 2MiB seek :       59.77%

This of course is because trees tend to delta well against each other and
writing out everything in the same family tends to coalesce ancient and
recent trees close together, while the history traversal wants to see the
trees at the same path in adjacent commits, and the process to jump over
to the "next tree" in the same commit need to seek a lot more.

>  * Blame.
>    $ git blame drivers/net/ne.c >/dev/null
>                                   v1.7.6  with patch
> 	 50.0% percentile :         11,248       7,334
> 	 95.0% percentile :    408,230,673 132,174,308
>        Less than 2MiB seek:       56.47%      56.83%

     50.00% percentile :       79,182
     95.00% percentile :  132,547,959
   Less than 2MiB seek :       56.16%

This also is harmed by trees in adjacent commits being further apart, but
blame needs to read a lot of blobs as well, so the effect is not as grave
as tree-only "log pathspec" case. 

>  * Index pack.
>    $ git index-pack -v .git/objects/pack/pack*.pack
>                                   v1.7.6  with patch
> 	 80.0% percentile :      1,317,862       4,971
> 	 90.0% percentile :     11,926,385     119,398
> 	 99.9% percentile :    382,919,785  33,155,191
>        Less than 2MiB seek:       81.73%      96.92%

     80.00% percentile :      745
     90.00% percentile :    1,891
     99.90% percentile :   80,001
   Less than 2MiB seek :  100.00%

It is so obvious that the in pack object ordering with "the whole family
at once" tweak is extremely tuned for this workload that the result is not
even funny.

The moral of the story is that index-pack is not a workload to optimize
for ;-) and the following is a wrong patch to apply. In general, we have
to be extremely careful not to tune for a particular workload while
unknowingly sacrificing other workloads.

I wonder if we would get a better locality for "blame" by not writing the
whole family at once for blobs, but only a smaller subset.

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 27132bb..adcb509 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -553,7 +553,7 @@ static struct object_entry **compute_write_order(void)
 	for (i = 0; i < nr_objects; i++) {
 		if (objects[i].type != OBJ_TREE)
 			continue;
-		add_to_write_order(wo, &wo_end, &objects[i]);
+		add_family_to_write_order(wo, &wo_end, &objects[i]);
 	}
 
 	/*

  parent reply	other threads:[~2011-07-08 17:46 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-07-08  0:24 [PATCH 0/2] For improved pack locality Junio C Hamano
2011-07-08  0:24 ` [PATCH 1/2] core: log offset pack data accesses happened Junio C Hamano
2011-07-08  0:24 ` [PATCH 2/2] pack-objects: optimize "recency order" Junio C Hamano
2011-07-08  2:08   ` Shawn Pearce
2011-07-08 17:45   ` Junio C Hamano [this message]
2011-07-11 22:49     ` Nicolas Pitre
2011-07-08 22:47   ` Junio C Hamano
2011-07-09  0:42     ` Shawn Pearce
2011-10-27 21:01   ` Ævar Arnfjörð Bjarmason
2011-10-27 21:49     ` Ævar Arnfjörð Bjarmason
2011-10-27 22:02       ` Ævar Arnfjörð Bjarmason
2011-10-27 22:32         ` Jakub Narebski
2011-10-27 22:05     ` Junio C Hamano
2011-10-28  9:17       ` Ævar Arnfjörð Bjarmason

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7voc14fyy8.fsf@alter.siamese.dyndns.org \
    --to=gitster@pobox.com \
    --cc=git@vger.kernel.org \
    --cc=spearce@spearce.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).