git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Shawn Pearce <spearce@spearce.org>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org
Subject: Re: [PATCH 2/2] pack-objects: optimize "recency order"
Date: Thu, 7 Jul 2011 19:08:11 -0700	[thread overview]
Message-ID: <CAJo=hJtEnGGH8yjTGHiVjjaiRd6TVznW+5DwpibNSN6JEY=Trw@mail.gmail.com> (raw)
In-Reply-To: <1310084657-16790-3-git-send-email-gitster@pobox.com>

On Thu, Jul 7, 2011 at 17:24, Junio C Hamano <gitster@pobox.com> wrote:
> This optimizes the "recency order" (see pack-heuristics.txt in
> Documentation/technical/ directory) used to order objects within a
> packfile in three ways:

Yay!

>  - Commits at the tip of tags are written together, in the hope that
>   revision traversal done in incremental fetch (which starts by
>   putting them in a revision queue marked as UNINTERESTING) will see a
>   better locality of these objects;

Putting these together after the first tagged commit is an interesting
approach. Currently JGit drops these after 1024 recent commits have
been produced, the alternative here parks them around the most recent
release. I wonder which is really the better approach for the
upload-pack workload. I chose 1024 because linux-2.6 seemed to be
getting several thousand commits per MB of pack data, so 1024 would
park the tagged commits within the first MB (making a cold-cache
upload-pack slightly faster), but doesn't harm `git log` going through
the pager too badly because this block is 1024 commits back.

Have you tried putting commits in parse order rather than recency order?

By this I mean the revision traversal code needs to parse the parents
of a commit in order to get their commit date and enqueue them into
the priority queue at the right position. The order the commits get
parsed in is different from the order they are popped in, especially
when a commit is created on an older revision and there is much
concurrent activity on a different branch between the commit and its
ancestor. This pattern is very typical in repositories that pull from
others to merge changes in... aka linux, git.git, etc.

My intuition says emulating the priority queue in pack-objects and
using that to influence the order commits are written out (so its the
order commits enter the queue, rather than leave it) will further
reduce minor page faults during the `git log >/dev/null` test, and
possibly also help the other log based workloads. Its certainly a lot
more work to code than this patch, but I wonder if it produces better
ordering.

>  - In the original recency order, trees and blobs are intermixed. Write
>   trees together before blobs, in the hope that this will improve
>   locality when running pathspec-limited revision traversal, i.e.
>   "git log paths...";

FWIW JGit has been doing "commits-then-trees-then-blobs" for a long
time now and we have generally found the same results as you did here,
segregating by object type helps page faults.

>  - When writing blob objects out, write the whole family of blobs that use
>   the same delta base object together, by starting from the root of the
>   delta chain, and writing its immediate children in a width-first
>   manner, in the hope that this will again improve locality when reading
>   blobs that belong to the same path, which are likely to be deltified
>   against each other.

This is an interesting approach, and one I had not considered before.

>  * Simple commit-only log.
>
>   $ git log >/dev/null
...
>   95% of the pack accesses look at data that is no further than 260kB
>   from the previous location we accessed. The patch does not change the
>   order of commit objects very much, and the result is very similar.

I think a more interesting thing to examine is how often do we have to
"skip back" to an earlier part of the file. Consider the case that the
~190MBish of commits does not fully fit into kernel buffer cache, but
we do have read-ahead support in the kernel. Forward references are
relatively free, because read-ahead should populate that page before
we need it. Backward references are horribly expensive, because they
may have been evicted to make room for more read-ahead. From what I
could tell of similar traces in JGit, the recency commit ordering
causes a lot more backwards references than the parse ordering.

>  * Pathspec-limited log.
>
>   $ git log drivers/net >/dev/null
>
>   The path is touched by 26551 commits and merges (among 254656 total).
>
>                                  v1.7.6  with patch
>   Total number of access :      559,511     558,663
...
>         70.0% percentile :  319,574,732 110,370,100
>         80.0% percentile :  361,647,599 123,707,229
>         90.0% percentile :  393,195,669 128,947,636
>         95.0% percentile :  405,496,875 131,609,321
...

Does this result suggest that we needed less of the pack in-memory in
order to produce the result? That is, on a cold cache we should be
touching fewer pages of the pack when this patch was used to create
it, and fewer page references would allow the command to complete more
quickly on a cold cache.

> Note that a half-a-gigabyte packfile comfortably fits in the buffer cache,
> and you would unlikely to see much performance difference on a modern and
> reasonably beefy machine with enough memory and local disks. Benchmarking
> with cold cache (or over NFS) would be interesting.

It would also be easy to test these cases by setting the pack window
size to something small (e.g. 1 MB) and the pack limit to something
equally small (e.g. 25 MB), and stuffing a delay of 20 ms into the
code path that xmmaps a new window when its not already opened.

I'm glad you started working on this, it looks like it may lead to a
better pack ordering.

-- 
Shawn.

  reply	other threads:[~2011-07-08  2:08 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-07-08  0:24 [PATCH 0/2] For improved pack locality Junio C Hamano
2011-07-08  0:24 ` [PATCH 1/2] core: log offset pack data accesses happened Junio C Hamano
2011-07-08  0:24 ` [PATCH 2/2] pack-objects: optimize "recency order" Junio C Hamano
2011-07-08  2:08   ` Shawn Pearce [this message]
2011-07-08 17:45   ` Junio C Hamano
2011-07-11 22:49     ` Nicolas Pitre
2011-07-08 22:47   ` Junio C Hamano
2011-07-09  0:42     ` Shawn Pearce
2011-10-27 21:01   ` Ævar Arnfjörð Bjarmason
2011-10-27 21:49     ` Ævar Arnfjörð Bjarmason
2011-10-27 22:02       ` Ævar Arnfjörð Bjarmason
2011-10-27 22:32         ` Jakub Narebski
2011-10-27 22:05     ` Junio C Hamano
2011-10-28  9:17       ` Ævar Arnfjörð Bjarmason

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAJo=hJtEnGGH8yjTGHiVjjaiRd6TVznW+5DwpibNSN6JEY=Trw@mail.gmail.com' \
    --to=spearce@spearce.org \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).