git@vger.kernel.org mailing list mirror (one of many)
 help / Atom feed
From: Jeff King <peff@peff.net>
To: Junio C Hamano <gitster@pobox.com>
Cc: SZEDER Gábor <szeder.dev@gmail.com>,
	Ævar Arnfjörð Bjarmason <avarab@gmail.com>,
	Derrick Stolee <stolee@gmail.com>,
	Git mailing list <git@vger.kernel.org>,
	Stefan Beller <sbeller@google.com>,
	Ramsay Jones <ramsay@ramsayjones.plus.com>,
	git@jeffhostetler.com, Derrick Stolee <dstolee@microsoft.com>
Subject: Re: [PATCH v6 00/14] Serialized Git Commit Graph
Date: Fri, 16 Mar 2018 16:49:45 -0400
Message-ID: <20180316204945.GA12333@sigill.intra.peff.net> (raw)
In-Reply-To: <xmqq370z3m5o.fsf@gitster-ct.c.googlers.com>

On Fri, Mar 16, 2018 at 11:33:55AM -0700, Junio C Hamano wrote:

> It is not so surprising that history walking runs rings around
> enumerating objects in packfiles, if packfiles are built well.
> 
> A well-built packfile tends to has newer objects in base form and
> has delta that goes in backward direction (older objects are
> represented as delta against newer ones).  This helps warlking from
> the tips of the history quite a bit, because your delta base cache
> will tend to have the base object (i.e. objects in the newer part of
> the history you just walked) that will be required to access the
> "next" older part of the history more often than not.
> 
> Trying to read the objects in the pack in their object name order
> would essentially mean reading them in a cryptgraphically random
> order.  Half the time you will end up wanting to access an object
> that is near the tip of a very deep delta chain even before you've
> accessed any of the base objects in the delta chain.

I coincidentally was doing some experiments in this area a few weeks
ago, and found a few things:

  1. The ordering makes a _huge_ difference for accessing trees and
     blobs.

  2. Pack order (not pack-idx order) is actually the best order, since
     it tends to follow the delta patterns (it's close to traversal
     order, but packs delta families more tightly).

  3. None of this really matters for commits, since we almost never
     store them as deltas anyway.

Here are a few experiments people can do themselves to demonstrate (my
numbers here are all from linux.git, which is sort of a wort-case
for bad ordering because its size stresses the default delta cache):

  [every object in sha1 order: slow]
  $ time git cat-file --batch-all-objects --batch >/dev/null
  real	8m44.041s
  user	8m31.359s
  sys	0m12.262s

  [every object from a traversal: faster, but --objects traversals are
   actually CPU heavy due to all of the hash lookups for each tree. Note
   not just wall-clock time but the CPU since it's split across two
   processes]
  $ time git rev-list --objects --all |
         cut -d' ' -f2 |
	 git cat-file --batch >/dev/null
  real	1m2.667s
  user	0m58.537s
  sys	0m32.392s

  [every object in pack order: fastest. This is due to skipping the
   traversal overhead, and should use our delta cache quite efficiently.
   I'm assuming a single pack and no loose objects here, but the
   performance should generalize since accessing the "big" pack
   dominates]
  $ time git show-index <$(ls .git/objects/pack/*.idx) |
         sort -n |
         cut -d' ' -f2 |
	 git cat-file --batch >/dev/null
  real	0m51.718s
  user	0m50.963s
  sys	0m7.068s

  [just commits, sha1 order: not horrible]
  $ time git cat-file --batch-all-objects --batch-check='%(objecttype) %(objectname)' |
         grep ^commit |
	 cut -d' ' -f2 |
	 git cat-file --batch >/dev/null
  real	0m8.115s
  user	0m14.033s
  sys	0m1.170s

  [just commits, pack order: slightly worse due to the extra piping, but
   obviously that could be done more quickly internally]
  $ time git show-index <$(ls .git/objects/pack/*.idx) |
         sort -n |
         cut -d' ' -f2 |
	 git cat-file --batch-check='%(objecttype) %(objectname)' |
         grep ^commit |
	 cut -d' ' -f2 |
	 git cat-file --batch >/dev/null
  real	0m21.670s
  user	0m24.867s
  sys	0m9.600s

  [and the reason is that hardly any commits get deltas]
  $ git cat-file --batch-all-objects --batch-check='%(objecttype) %(deltabase)' |
    grep ^commit >commits
  $ wc -l commits
  692596
  $ grep -v '0000000000000000000000000000000000000000' commits | wc -l
  18856

For the purposes of this patch series, I don't think the order matters
much, since we're only dealing with commits. For doing --batch-check, I
think the sha1 ordering given by "cat-file --batch-all-objects" is
convenient, and doesn't have a big impact on performance. But it's
_awful_ for --batch. I think we may want to add a sorting option to just
return the objects in the original packfile order.

-Peff

  parent reply index

Thread overview: 110+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-02-27  2:32 [PATCH v5 00/13] " Derrick Stolee
2018-02-27  2:32 ` [PATCH v5 01/13] commit-graph: add format document Derrick Stolee
2018-02-27  2:32 ` [PATCH v5 02/13] graph: add commit graph design document Derrick Stolee
2018-02-27  2:32 ` [PATCH v5 03/13] commit-graph: create git-commit-graph builtin Derrick Stolee
2018-02-27  2:32 ` [PATCH v5 04/13] csum-file: add CSUM_KEEP_OPEN flag Derrick Stolee
2018-03-12 13:55   ` Derrick Stolee
2018-03-13 21:42     ` Junio C Hamano
2018-03-14  2:26       ` Derrick Stolee
2018-03-14 17:00         ` Junio C Hamano
2018-02-27  2:32 ` [PATCH v5 05/13] commit-graph: implement write_commit_graph() Derrick Stolee
2018-02-27  2:33 ` [PATCH v5 06/13] commit-graph: implement 'git-commit-graph write' Derrick Stolee
2018-02-27  2:33 ` [PATCH v5 07/13] commit-graph: implement git commit-graph read Derrick Stolee
2018-02-27  2:33 ` [PATCH v5 08/13] commit-graph: add core.commitGraph setting Derrick Stolee
2018-02-27  2:33 ` [PATCH v5 09/13] commit-graph: close under reachability Derrick Stolee
2018-02-27  2:33 ` [PATCH v5 10/13] commit: integrate commit graph with commit parsing Derrick Stolee
2018-02-27  2:33 ` [PATCH v5 11/13] commit-graph: read only from specific pack-indexes Derrick Stolee
2018-02-27 20:15   ` Stefan Beller
2018-02-27  2:33 ` [PATCH v5 12/13] commit-graph: build graph from starting commits Derrick Stolee
2018-02-27  2:33 ` [PATCH v5 13/13] commit-graph: implement "--additive" option Derrick Stolee
2018-02-27 18:50 ` [PATCH v5 00/13] Serialized Git Commit Graph Stefan Beller
2018-03-14 19:27 ` [PATCH v6 00/14] " Derrick Stolee
2018-03-14 19:27   ` [PATCH v6 01/14] csum-file: rename hashclose() to finalize_hashfile() Derrick Stolee
2018-03-14 19:27   ` [PATCH v6 02/14] csum-file: refactor finalize_hashfile() method Derrick Stolee
2018-03-14 19:27   ` [PATCH v6 03/14] commit-graph: add format document Derrick Stolee
2018-03-14 19:27   ` [PATCH v6 04/14] graph: add commit graph design document Derrick Stolee
2018-03-14 19:27   ` [PATCH v6 05/14] commit-graph: create git-commit-graph builtin Derrick Stolee
2018-03-14 19:27   ` [PATCH v6 06/14] commit-graph: implement write_commit_graph() Derrick Stolee
2018-03-14 19:27   ` [PATCH v6 07/14] commit-graph: implement 'git-commit-graph write' Derrick Stolee
2018-03-18 13:25     ` Ævar Arnfjörð Bjarmason
2018-03-19 13:12       ` Derrick Stolee
2018-03-19 14:36         ` Ævar Arnfjörð Bjarmason
2018-03-19 18:27           ` Derrick Stolee
2018-03-19 18:48             ` Ævar Arnfjörð Bjarmason
2018-03-14 19:27   ` [PATCH v6 08/14] commit-graph: implement git commit-graph read Derrick Stolee
2018-03-14 19:27   ` [PATCH v6 09/14] commit-graph: add core.commitGraph setting Derrick Stolee
2018-03-14 19:27   ` [PATCH v6 10/14] commit-graph: close under reachability Derrick Stolee
2018-03-14 19:27   ` [PATCH v6 11/14] commit: integrate commit graph with commit parsing Derrick Stolee
2018-03-14 19:27   ` [PATCH v6 12/14] commit-graph: read only from specific pack-indexes Derrick Stolee
2018-03-15 22:50     ` SZEDER Gábor
2018-03-19 13:13       ` Derrick Stolee
2018-03-14 19:27   ` [PATCH v6 13/14] commit-graph: build graph from starting commits Derrick Stolee
2018-03-14 19:27   ` [PATCH v6 14/14] commit-graph: implement "--additive" option Derrick Stolee
2018-03-14 20:10   ` [PATCH v6 00/14] Serialized Git Commit Graph Ramsay Jones
2018-03-14 20:43   ` Junio C Hamano
2018-03-15 17:23     ` Johannes Schindelin
2018-03-15 18:41       ` Junio C Hamano
2018-03-15 21:51         ` Ramsay Jones
2018-03-16 11:50         ` Johannes Schindelin
2018-03-16 17:27           ` Junio C Hamano
2018-03-19 11:41             ` Johannes Schindelin
2018-03-16 16:28     ` Lars Schneider
2018-03-19 13:10       ` Derrick Stolee
2018-03-16 15:06   ` Ævar Arnfjörð Bjarmason
2018-03-16 16:38     ` SZEDER Gábor
2018-03-16 18:33       ` Junio C Hamano
2018-03-16 19:48         ` SZEDER Gábor
2018-03-16 20:06           ` Jeff King
2018-03-16 20:19             ` Jeff King
2018-03-19 12:55               ` Derrick Stolee
2018-03-20  1:17                 ` Derrick Stolee
2018-03-16 20:49         ` Jeff King [this message]
2018-04-02 20:34   ` [PATCH v7 " Derrick Stolee
2018-04-02 20:34     ` [PATCH v7 01/14] csum-file: rename hashclose() to finalize_hashfile() Derrick Stolee
2018-04-02 20:34     ` [PATCH v7 02/14] csum-file: refactor finalize_hashfile() method Derrick Stolee
2018-04-07 22:59       ` Jakub Narebski
2018-04-02 20:34     ` [PATCH v7 03/14] commit-graph: add format document Derrick Stolee
2018-04-07 23:49       ` Jakub Narebski
2018-04-02 20:34     ` [PATCH v7 04/14] graph: add commit graph design document Derrick Stolee
2018-04-08 11:06       ` Jakub Narebski
2018-04-02 20:34     ` [PATCH v7 05/14] commit-graph: create git-commit-graph builtin Derrick Stolee
2018-04-02 20:34     ` [PATCH v7 06/14] commit-graph: implement write_commit_graph() Derrick Stolee
2018-04-02 20:34     ` [PATCH v7 07/14] commit-graph: implement git-commit-graph write Derrick Stolee
2018-04-08 11:59       ` Jakub Narebski
2018-04-02 20:34     ` [PATCH v7 08/14] commit-graph: implement git commit-graph read Derrick Stolee
2018-04-02 21:33       ` Junio C Hamano
2018-04-03 11:49         ` Derrick Stolee
2018-04-08 12:59       ` Jakub Narebski
2018-04-02 20:34     ` [PATCH v7 09/14] commit-graph: add core.commitGraph setting Derrick Stolee
2018-04-08 13:39       ` Jakub Narebski
2018-04-02 20:34     ` [PATCH v7 10/14] commit-graph: close under reachability Derrick Stolee
2018-04-02 20:34     ` [PATCH v7 11/14] commit: integrate commit graph with commit parsing Derrick Stolee
2018-04-02 20:34     ` [PATCH v7 12/14] commit-graph: read only from specific pack-indexes Derrick Stolee
2018-04-02 20:34     ` [PATCH v7 13/14] commit-graph: build graph from starting commits Derrick Stolee
2018-04-08 13:50       ` Jakub Narebski
2018-04-02 20:34     ` [PATCH v7 14/14] commit-graph: implement "--additive" option Derrick Stolee
2018-04-05  8:27       ` SZEDER Gábor
2018-04-10 12:55     ` [PATCH v8 00/14] Serialized Git Commit Graph Derrick Stolee
2018-04-10 12:55       ` [PATCH v8 01/14] csum-file: rename hashclose() to finalize_hashfile() Derrick Stolee
2018-04-10 12:55       ` [PATCH v8 02/14] csum-file: refactor finalize_hashfile() method Derrick Stolee
2018-04-10 12:55       ` [PATCH v8 03/14] commit-graph: add format document Derrick Stolee
2018-04-10 19:10         ` Stefan Beller
2018-04-10 19:18           ` Derrick Stolee
2018-04-11 20:58         ` Jakub Narebski
2018-04-12 11:28           ` Derrick Stolee
2018-04-13 22:07             ` Jakub Narebski
2018-04-10 12:55       ` [PATCH v8 04/14] graph: add commit graph design document Derrick Stolee
2018-04-15 22:48         ` Jakub Narebski
2018-04-10 12:55       ` [PATCH v8 05/14] commit-graph: create git-commit-graph builtin Derrick Stolee
2018-04-10 12:56       ` [PATCH v8 06/14] commit-graph: implement write_commit_graph() Derrick Stolee
2018-04-10 12:56       ` [PATCH v8 07/14] commit-graph: implement git-commit-graph write Derrick Stolee
2018-04-10 12:56       ` [PATCH v8 08/14] commit-graph: implement git commit-graph read Derrick Stolee
2018-04-14 22:15         ` Jakub Narebski
2018-04-15  3:26           ` Eric Sunshine
2018-04-10 12:56       ` [PATCH v8 09/14] commit-graph: add core.commitGraph setting Derrick Stolee
2018-04-14 18:33         ` Jakub Narebski
2018-04-10 12:56       ` [PATCH v8 10/14] commit-graph: close under reachability Derrick Stolee
2018-04-10 12:56       ` [PATCH v8 11/14] commit: integrate commit graph with commit parsing Derrick Stolee
2018-04-10 12:56       ` [PATCH v8 12/14] commit-graph: read only from specific pack-indexes Derrick Stolee
2018-04-10 12:56       ` [PATCH v8 13/14] commit-graph: build graph from starting commits Derrick Stolee
2018-04-10 12:56       ` [PATCH v8 14/14] commit-graph: implement "--append" option Derrick Stolee

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180316204945.GA12333@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=avarab@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=git@jeffhostetler.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=ramsay@ramsayjones.plus.com \
    --cc=sbeller@google.com \
    --cc=stolee@gmail.com \
    --cc=szeder.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

git@vger.kernel.org mailing list mirror (one of many)

Archives are clonable:
	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git
	nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git
	nntp://news.gmane.org/gmane.comp.version-control.git

 note: .onion URLs require Tor: https://www.torproject.org/
       or Tor2web: https://www.tor2web.org/

AGPL code for this site: git clone https://public-inbox.org/ public-inbox