git@vger.kernel.org mailing list mirror (one of many)
 help / Atom feed
From: Jeff King <peff@peff.net>
To: SZEDER Gábor <szeder.dev@gmail.com>
Cc: Junio C Hamano <gitster@pobox.com>, Ævar Arnfjörð Bjarmason <avarab@gmail.com>, Derrick Stolee <stolee@gmail.com>, Git mailing list <git@vger.kernel.org>, Stefan Beller <sbeller@google.com>, Ramsay Jones <ramsay@ramsayjones.plus.com>, git@jeffhostetler.com, Derrick Stolee <dstolee@microsoft.com>
Subject: Re: [PATCH v6 00/14] Serialized Git Commit Graph
Date: Fri, 16 Mar 2018 16:06:39 -0400
Message-ID: <20180316200639.GA1845@sigill.intra.peff.net> (raw)
In-Reply-To: <CAM0VKjmVgiWsqo8rQWwP9+mEq0tLinc8xoUM=8XdMP3VTBwJxw@mail.gmail.com>

On Fri, Mar 16, 2018 at 08:48:49PM +0100, SZEDER Gábor wrote:

> I came up with a different explanation back then: we are only interested
> in commit objects when creating the commit graph, and only a small-ish
> fraction of all objects are commit objects, so the "enumerate objects in
> packfiles" approach has to look at a lot more objects:
> 
>   # in my git fork
>   $ git rev-list --all --objects |cut -d' ' -f1 |\
>     git cat-file --batch-check='%(objecttype) %(objectsize)' >type-size
>   $ grep -c ^commit type-size
>   53754
>   $ wc -l type-size
>   244723 type-size
> 
> I.e. only about 20% of all objects are commit objects.
> 
> Furthermore, in order to look at an object it has to be zlib inflated
> first, and since commit objects tend to be much smaller than trees and
> especially blobs, there are a lot less bytes to inflate:
> 
>   $ grep ^commit type-size |cut -d' ' -f2 |avg
>   34395730 / 53754 = 639
>   $ cat type-size |cut -d' ' -f2 |avg
>   3866685744 / 244723 = 15800
> 
> So a simple revision walk inflates less than 1% of the bytes that the
> "enumerate objects packfiles" approach has to inflate.

I don't think this is quite accurate. It's true that we have to
_consider_ every object, but Git is smart enough not to inflate each one
to find its type. For loose objects we just inflate the header. For
packed objects, we either pick the type directly out of the packfile
header (for a non-delta) or can walk the delta chain (without actually
looking at the data bytes!) until we hit the base.

So starting from scratch:

  git cat-file --batch-all-objects --batch-check='%(objecttype) %(objectname)' |
  grep ^commit |
  cut -d' ' -f2 |
  git cat-file --batch

is in the same ballpark for most repos as:

  git rev-list --all |
  git cat-file --batch

though in my timings the traversal is a little bit faster (and I'd
expect that to remain the case when doing it all in a single process,
since the traversal only follows commit links, whereas processing the
object list has to do the type lookup for each object before deciding
whether to inflate it).

I'm not sure, though, if that edge would remain for incremental updates.
For instance, after we take in some new objects via "fetch", the
traversal strategy would want to do something like:

  git rev-list $new_tips --not --all |
  git cat-file --batch

whose performance will depend on the refs _currently_ in the repository,
as we load them as UNINTERESTING tips for the walk. Whereas doing:

  git show-index <.git/objects/pack/the-one-new-packfile.idx |
  cut -d' ' -f2 |
  git cat-file --batch-check='%(objecttype) %(objectname)' |
  grep ^commit |
  cut -d' ' -f2 |
  git cat-file --batch

always scales exactly with the size of the new objects (obviously that's
kind of baroque and this would all be done internally, but I'm trying to
demonstrate the algorithmic complexity). I'm not sure what the plan
would be if we explode loose objects, though. ;)

-Peff

  reply index

Thread overview: 109+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-02-27  2:32 [PATCH v5 00/13] " Derrick Stolee
2018-02-27  2:32 ` [PATCH v5 01/13] commit-graph: add format document Derrick Stolee
2018-02-27  2:32 ` [PATCH v5 02/13] graph: add commit graph design document Derrick Stolee
2018-02-27  2:32 ` [PATCH v5 03/13] commit-graph: create git-commit-graph builtin Derrick Stolee
2018-02-27  2:32 ` [PATCH v5 04/13] csum-file: add CSUM_KEEP_OPEN flag Derrick Stolee
2018-03-12 13:55   ` Derrick Stolee
2018-03-13 21:42     ` Junio C Hamano
2018-03-14  2:26       ` Derrick Stolee
2018-03-14 17:00         ` Junio C Hamano
2018-02-27  2:32 ` [PATCH v5 05/13] commit-graph: implement write_commit_graph() Derrick Stolee
2018-02-27  2:33 ` [PATCH v5 06/13] commit-graph: implement 'git-commit-graph write' Derrick Stolee
2018-02-27  2:33 ` [PATCH v5 07/13] commit-graph: implement git commit-graph read Derrick Stolee
2018-02-27  2:33 ` [PATCH v5 08/13] commit-graph: add core.commitGraph setting Derrick Stolee
2018-02-27  2:33 ` [PATCH v5 09/13] commit-graph: close under reachability Derrick Stolee
2018-02-27  2:33 ` [PATCH v5 10/13] commit: integrate commit graph with commit parsing Derrick Stolee
2018-02-27  2:33 ` [PATCH v5 11/13] commit-graph: read only from specific pack-indexes Derrick Stolee
2018-02-27 20:15   ` Stefan Beller
2018-02-27  2:33 ` [PATCH v5 12/13] commit-graph: build graph from starting commits Derrick Stolee
2018-02-27  2:33 ` [PATCH v5 13/13] commit-graph: implement "--additive" option Derrick Stolee
2018-02-27 18:50 ` [PATCH v5 00/13] Serialized Git Commit Graph Stefan Beller
2018-03-14 19:27 ` [PATCH v6 00/14] " Derrick Stolee
2018-03-14 19:27   ` [PATCH v6 01/14] csum-file: rename hashclose() to finalize_hashfile() Derrick Stolee
2018-03-14 19:27   ` [PATCH v6 02/14] csum-file: refactor finalize_hashfile() method Derrick Stolee
2018-03-14 19:27   ` [PATCH v6 03/14] commit-graph: add format document Derrick Stolee
2018-03-14 19:27   ` [PATCH v6 04/14] graph: add commit graph design document Derrick Stolee
2018-03-14 19:27   ` [PATCH v6 05/14] commit-graph: create git-commit-graph builtin Derrick Stolee
2018-03-14 19:27   ` [PATCH v6 06/14] commit-graph: implement write_commit_graph() Derrick Stolee
2018-03-14 19:27   ` [PATCH v6 07/14] commit-graph: implement 'git-commit-graph write' Derrick Stolee
2018-03-18 13:25     ` Ævar Arnfjörð Bjarmason
2018-03-19 13:12       ` Derrick Stolee
2018-03-19 14:36         ` Ævar Arnfjörð Bjarmason
2018-03-19 18:27           ` Derrick Stolee
2018-03-19 18:48             ` Ævar Arnfjörð Bjarmason
2018-03-14 19:27   ` [PATCH v6 08/14] commit-graph: implement git commit-graph read Derrick Stolee
2018-03-14 19:27   ` [PATCH v6 09/14] commit-graph: add core.commitGraph setting Derrick Stolee
2018-03-14 19:27   ` [PATCH v6 10/14] commit-graph: close under reachability Derrick Stolee
2018-03-14 19:27   ` [PATCH v6 11/14] commit: integrate commit graph with commit parsing Derrick Stolee
2018-03-14 19:27   ` [PATCH v6 12/14] commit-graph: read only from specific pack-indexes Derrick Stolee
2018-03-15 22:50     ` SZEDER Gábor
2018-03-19 13:13       ` Derrick Stolee
2018-03-14 19:27   ` [PATCH v6 13/14] commit-graph: build graph from starting commits Derrick Stolee
2018-03-14 19:27   ` [PATCH v6 14/14] commit-graph: implement "--additive" option Derrick Stolee
2018-03-14 20:10   ` [PATCH v6 00/14] Serialized Git Commit Graph Ramsay Jones
2018-03-14 20:43   ` Junio C Hamano
2018-03-15 17:23     ` Johannes Schindelin
2018-03-15 18:41       ` Junio C Hamano
2018-03-15 21:51         ` Ramsay Jones
2018-03-16 11:50         ` Johannes Schindelin
2018-03-16 17:27           ` Junio C Hamano
2018-03-19 11:41             ` Johannes Schindelin
2018-03-16 16:28     ` Lars Schneider
2018-03-19 13:10       ` Derrick Stolee
2018-03-16 15:06   ` Ævar Arnfjörð Bjarmason
2018-03-16 16:38     ` SZEDER Gábor
2018-03-16 18:33       ` Junio C Hamano
2018-03-16 19:48         ` SZEDER Gábor
2018-03-16 20:06           ` Jeff King [this message]
2018-03-16 20:19             ` Jeff King
2018-03-19 12:55               ` Derrick Stolee
2018-03-20  1:17                 ` Derrick Stolee
2018-03-16 20:49         ` Jeff King
2018-04-02 20:34   ` [PATCH v7 " Derrick Stolee
2018-04-02 20:34     ` [PATCH v7 01/14] csum-file: rename hashclose() to finalize_hashfile() Derrick Stolee
2018-04-02 20:34     ` [PATCH v7 02/14] csum-file: refactor finalize_hashfile() method Derrick Stolee
2018-04-07 22:59       ` Jakub Narebski
2018-04-02 20:34     ` [PATCH v7 03/14] commit-graph: add format document Derrick Stolee
2018-04-07 23:49       ` Jakub Narebski
2018-04-02 20:34     ` [PATCH v7 04/14] graph: add commit graph design document Derrick Stolee
2018-04-08 11:06       ` Jakub Narebski
2018-04-02 20:34     ` [PATCH v7 05/14] commit-graph: create git-commit-graph builtin Derrick Stolee
2018-04-02 20:34     ` [PATCH v7 06/14] commit-graph: implement write_commit_graph() Derrick Stolee
2018-04-02 20:34     ` [PATCH v7 07/14] commit-graph: implement git-commit-graph write Derrick Stolee
2018-04-08 11:59       ` Jakub Narebski
2018-04-02 20:34     ` [PATCH v7 08/14] commit-graph: implement git commit-graph read Derrick Stolee
2018-04-02 21:33       ` Junio C Hamano
2018-04-03 11:49         ` Derrick Stolee
2018-04-08 12:59       ` Jakub Narebski
2018-04-02 20:34     ` [PATCH v7 09/14] commit-graph: add core.commitGraph setting Derrick Stolee
2018-04-08 13:39       ` Jakub Narebski
2018-04-02 20:34     ` [PATCH v7 10/14] commit-graph: close under reachability Derrick Stolee
2018-04-02 20:34     ` [PATCH v7 11/14] commit: integrate commit graph with commit parsing Derrick Stolee
2018-04-02 20:34     ` [PATCH v7 12/14] commit-graph: read only from specific pack-indexes Derrick Stolee
2018-04-02 20:34     ` [PATCH v7 13/14] commit-graph: build graph from starting commits Derrick Stolee
2018-04-08 13:50       ` Jakub Narebski
2018-04-02 20:34     ` [PATCH v7 14/14] commit-graph: implement "--additive" option Derrick Stolee
2018-04-05  8:27       ` SZEDER Gábor
2018-04-10 12:55     ` [PATCH v8 00/14] Serialized Git Commit Graph Derrick Stolee
2018-04-10 12:55       ` [PATCH v8 01/14] csum-file: rename hashclose() to finalize_hashfile() Derrick Stolee
2018-04-10 12:55       ` [PATCH v8 02/14] csum-file: refactor finalize_hashfile() method Derrick Stolee
2018-04-10 12:55       ` [PATCH v8 03/14] commit-graph: add format document Derrick Stolee
2018-04-10 19:10         ` Stefan Beller
2018-04-10 19:18           ` Derrick Stolee
2018-04-11 20:58         ` Jakub Narebski
2018-04-12 11:28           ` Derrick Stolee
2018-04-13 22:07             ` Jakub Narebski
2018-04-10 12:55       ` [PATCH v8 04/14] graph: add commit graph design document Derrick Stolee
2018-04-15 22:48         ` Jakub Narebski
2018-04-10 12:55       ` [PATCH v8 05/14] commit-graph: create git-commit-graph builtin Derrick Stolee
2018-04-10 12:56       ` [PATCH v8 06/14] commit-graph: implement write_commit_graph() Derrick Stolee
2018-04-10 12:56       ` [PATCH v8 07/14] commit-graph: implement git-commit-graph write Derrick Stolee
2018-04-10 12:56       ` [PATCH v8 08/14] commit-graph: implement git commit-graph read Derrick Stolee
2018-04-14 22:15         ` Jakub Narebski
2018-04-15  3:26           ` Eric Sunshine
2018-04-10 12:56       ` [PATCH v8 09/14] commit-graph: add core.commitGraph setting Derrick Stolee
2018-04-14 18:33         ` Jakub Narebski
2018-04-10 12:56       ` [PATCH v8 10/14] commit-graph: close under reachability Derrick Stolee
2018-04-10 12:56       ` [PATCH v8 11/14] commit: integrate commit graph with commit parsing Derrick Stolee
2018-04-10 12:56       ` [PATCH v8 12/14] commit-graph: read only from specific pack-indexes Derrick Stolee
2018-04-10 12:56       ` [PATCH v8 13/14] commit-graph: build graph from starting commits Derrick Stolee
2018-04-10 12:56       ` [PATCH v8 14/14] commit-graph: implement "--append" option Derrick Stolee

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180316200639.GA1845@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=avarab@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=git@jeffhostetler.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=ramsay@ramsayjones.plus.com \
    --cc=sbeller@google.com \
    --cc=stolee@gmail.com \
    --cc=szeder.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

git@vger.kernel.org mailing list mirror (one of many)

Archives are clonable:
	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git
	nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git
	nntp://news.gmane.org/gmane.comp.version-control.git

 note: .onion URLs require Tor: https://www.torproject.org/
       or Tor2web: https://www.tor2web.org/

AGPL code for this site: git clone https://public-inbox.org/ public-inbox