git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Taylor Blau <me@ttaylorr.com>
To: Jeff King <peff@peff.net>
Cc: "Taylor Blau" <me@ttaylorr.com>, "Patrick Steinhardt" <ps@pks.im>,
	git@vger.kernel.org,
	"Felipe Contreras" <felipe.contreras@gmail.com>,
	"SZEDER Gábor" <szeder.dev@gmail.com>,
	"Chris Torek" <chris.torek@gmail.com>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"Junio C Hamano" <gitster@pobox.com>
Subject: Re: [PATCH v2 3/3] connected: implement connectivity check using bitmaps
Date: Fri, 2 Jul 2021 13:44:12 -0400	[thread overview]
Message-ID: <YN9QbEaWgP09PfeD@nand.local> (raw)
In-Reply-To: <YNwE3wES3iv+Xynp@coredump.intra.peff.net>

On Wed, Jun 30, 2021 at 01:45:03AM -0400, Jeff King wrote:
> That implies to me that yes, it really is saving time, especially in the
> cold-cache case. But if you have to do any actual fill-in traversal, the
> benefits get totally lost in the noise. _Especially_ in the cold-cache
> case, because then we're faulting in the actual object data, etc.

That's definitely true. I would say that any patches in this direction
would have the general sense of "this helps in some cases where we don't
have to do much traversal by eliminating an unnecessarily eager part of
loading bitmaps, and does not make anything worse when the bitmap
coverage is incomplete (and requires traversing)."

I would add that these effects change with the size of the bitmap.
Let's just consider the "count the number of objects in a bitmapped
commit". On my local copy of the kernel, I see a relatively modest
improvement:

    $ tip=2ab38c17aac10bf55ab3efde4c4db3893d8691d2
    $ hyperfine \
      'GIT_READ_COMMIT_TABLE=0 git.compile rev-list --count --objects --use-bitmap-index $tip' \
      'GIT_READ_COMMIT_TABLE=1 git.compile rev-list --count --objects --use-bitmap-index $tip' \
      --warmup=3
    Benchmark #1: GIT_READ_COMMIT_TABLE=0 git.compile rev-list --count --objects --use-bitmap-index $tip
      Time (mean ± σ):      21.5 ms ±   5.6 ms    [User: 8.7 ms, System: 12.7 ms]
      Range (min … max):    12.4 ms …  34.2 ms    170 runs

    Benchmark #2: GIT_READ_COMMIT_TABLE=1 git.compile rev-list --count --objects --use-bitmap-index $tip
      Time (mean ± σ):      10.6 ms ±   1.6 ms    [User: 7.1 ms, System: 3.5 ms]
      Range (min … max):     4.5 ms …  11.9 ms    258 runs

but on my copy of the kernel's fork network repo (that containing all of
torvalds/linux's objects, as well as all of its fork's objects, too),
the magnitude of the effect is much bigger:

    Benchmark #1: GIT_READ_COMMIT_TABLE=0 git.compile rev-list --count --objects --use-bitmap-index $tip
      Time (mean ± σ):     332.3 ms ±  12.6 ms    [User: 210.4 ms, System: 121.8 ms]
      Range (min … max):   322.7 ms … 362.4 ms    10 runs

    Benchmark #2: GIT_READ_COMMIT_TABLE=1 git.compile rev-list --count --objects --use-bitmap-index $tip
      Time (mean ± σ):     260.0 ms ±   9.3 ms    [User: 191.0 ms, System: 69.0 ms]
      Range (min … max):   250.4 ms … 272.8 ms    11 runs

That's a more modest 1.28x improvement (versus 2.03x in just linux.git),
but the overall magnitude is much bigger.

This is basically an effect of the bitmaps themselves. In the latter
example, there are more bitmaps (around 1.6k of them, versus just over
500 in my copy of just the kernel), and each of them are much wider
(because there are far more objects, 40.2M versus just 7.8M). So there
is more work to do, and the page cache is less efficient for us because
we can't fit as much of the .bitmap file in the page cache at once.

> By the way, one other thing I noticed is that having a fully-build
> commit-graph also made a big difference (much bigger than this patch),
> especially for the non-bitmapped commit. Which makes sense, since it is
> saving us from loading commit objects from disk during fill-in
> traversal.

Likewise having an reverse index helps a lot, too. That radix sort
scales linearly with the number of objects in the bitmapped pack (plus
you're paying the cost to allocate more heap, etc).

This clouded up some of my timings in p5310, which made me think that it
would be a good idea to `git config pack.writeReverseIndex true` in the
setup for those tests, but an even better direction would be to change
the default of pack.writeReverseIndex to true everywhere.

> So I dunno. There's absolutely savings for some cases, but I suspect in
> practice it's not going to really be noticeable. Part of me says "well,
> if it ever provides a benefit and there isn't a downside, why not?". So
> just devil's advocating on downsides for a moment:
>
>   - there's some extra complexity in the file format and code to read
>     and write these (and still fall back to the old system when they're
>     absent). I don't think it's a deal-breaker, as it's really not that
>     complicated a feature.

I agree with both of these. The complexity is manageable, I think,
especially since I dropped support for the extended offset table (having
a bitmap file that is >2GiB seems extremely unlikely to me, and it's
possible to add support for it in the future) and
fanout table (there are usually less than <1k commits with bitmaps, so
a 256-entry fanout table doesn't seem to help much in benchmarking).

So what's left of the format is really just:

  - a table of object id's
  - a table of (uint32_t, uint32_t) tuples describing the (short) offset
    of the bitmap, and an index position of the xor'd bitmap (if one
    exists).

>   - we're using extra bytes on disk (and the associated cold block cache
>     effects there). It's not very many bytes, though (I guess 20 for the
>     hash, plus a few offset integers; if we wanted to really
>     penny-pinch, we could probably store 32-bit pointers to the hashes
>     in the associated .idx file, at the cost of an extra level of
>     indirection while binary searching). But that is only for a few
>     hundred commits that are bitmapped, not all of them. And it's
>     balanced by not needing to allocate a similar in-memory lookup table
>     in each command. So it's probably a net win.

Worth benchmarking, at least.

I'll be offline for the next ~week and a half for my wedding, but I'll
post some patches to the list shortly after I get back.

Thanks,
Taylor

  reply	other threads:[~2021-07-02 17:44 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-28  5:33 [PATCH v2 0/3] Speed up connectivity checks via bitmaps Patrick Steinhardt
2021-06-28  5:33 ` [PATCH v2 1/3] p5400: add perf tests for git-receive-pack(1) Patrick Steinhardt
2021-06-28  7:49   ` Ævar Arnfjörð Bjarmason
2021-06-29  6:18     ` Patrick Steinhardt
2021-06-29 12:09       ` Ævar Arnfjörð Bjarmason
2021-06-28  5:33 ` [PATCH v2 2/3] receive-pack: skip connectivity checks on delete-only commands Patrick Steinhardt
2021-06-28  8:00   ` Ævar Arnfjörð Bjarmason
2021-06-28  8:06     ` Ævar Arnfjörð Bjarmason
2021-06-29  6:26     ` Patrick Steinhardt
2021-06-30  1:31   ` Jeff King
2021-06-30  1:35     ` Jeff King
2021-06-30 13:52     ` Patrick Steinhardt
2021-06-28  5:33 ` [PATCH v2 3/3] connected: implement connectivity check using bitmaps Patrick Steinhardt
2021-06-28 20:23   ` Taylor Blau
2021-06-29 22:44     ` Taylor Blau
2021-06-30  2:04       ` Jeff King
2021-06-30  3:07         ` Taylor Blau
2021-06-30  5:45           ` Jeff King
2021-07-02 17:44             ` Taylor Blau [this message]
2021-07-02 21:21               ` Jeff King
2021-06-30  1:51   ` Jeff King
2021-07-20 14:26     ` Patrick Steinhardt
2021-08-02  9:37 ` [PATCH v3 0/4] Speed up connectivity checks Patrick Steinhardt
2021-08-02  9:38   ` [PATCH v3 1/4] connected: do not sort input revisions Patrick Steinhardt
2021-08-02 12:49     ` Ævar Arnfjörð Bjarmason
2021-08-03  8:50       ` Patrick Steinhardt
2021-08-04 11:01         ` Ævar Arnfjörð Bjarmason
2021-08-02 19:00     ` Junio C Hamano
2021-08-03  8:55       ` Patrick Steinhardt
2021-08-03 21:47         ` Junio C Hamano
2021-08-02  9:38   ` [PATCH v3 2/4] revision: stop retrieving reference twice Patrick Steinhardt
2021-08-02 12:53     ` Ævar Arnfjörð Bjarmason
2021-08-02  9:38   ` [PATCH v3 3/4] revision: avoid loading object headers multiple times Patrick Steinhardt
2021-08-02 12:55     ` Ævar Arnfjörð Bjarmason
2021-08-05 10:12       ` Patrick Steinhardt
2021-08-02 19:40     ` Junio C Hamano
2021-08-03  9:07       ` Patrick Steinhardt
2021-08-06 14:17         ` Patrick Steinhardt
2021-08-02  9:38   ` [PATCH v3 4/4] revision: avoid hitting packfiles when commits are in commit-graph Patrick Steinhardt
2021-08-02 20:01     ` Junio C Hamano
2021-08-03  9:16       ` Patrick Steinhardt
2021-08-03 21:56         ` Junio C Hamano
2021-08-05 11:01           ` Patrick Steinhardt
2021-08-05 16:16             ` Junio C Hamano
2021-08-04 10:51         ` Ævar Arnfjörð Bjarmason
2021-08-05 11:25   ` [PATCH v4 0/6] Speed up connectivity checks Patrick Steinhardt
2021-08-05 11:25     ` [PATCH v4 1/6] revision: separate walk and unsorted flags Patrick Steinhardt
2021-08-05 18:47       ` Junio C Hamano
2021-08-05 11:25     ` [PATCH v4 2/6] connected: do not sort input revisions Patrick Steinhardt
2021-08-05 18:44       ` Junio C Hamano
2021-08-06  6:00         ` Patrick Steinhardt
2021-08-06 16:50           ` Junio C Hamano
2021-08-05 11:25     ` [PATCH v4 3/6] revision: stop retrieving reference twice Patrick Steinhardt
2021-08-05 11:25     ` [PATCH v4 4/6] revision: avoid loading object headers multiple times Patrick Steinhardt
2021-08-05 11:25     ` [PATCH v4 5/6] commit-graph: split out function to search commit position Patrick Steinhardt
2021-08-05 11:25     ` [PATCH v4 6/6] revision: avoid hitting packfiles when commits are in commit-graph Patrick Steinhardt
2021-08-09  8:00 ` [PATCH v5 0/5] Speed up connectivity checks Patrick Steinhardt
2021-08-09  8:02   ` Patrick Steinhardt
2021-08-09  8:11 ` Patrick Steinhardt
2021-08-09  8:11   ` [PATCH v5 1/5] revision: separate walk and unsorted flags Patrick Steinhardt
2021-08-09  8:11   ` [PATCH v5 2/5] connected: do not sort input revisions Patrick Steinhardt
2021-08-09  8:11   ` [PATCH v5 3/5] revision: stop retrieving reference twice Patrick Steinhardt
2021-08-09  8:11   ` [PATCH v5 4/5] commit-graph: split out function to search commit position Patrick Steinhardt
2021-08-09  8:12   ` [PATCH v5 5/5] revision: avoid hitting packfiles when commits are in commit-graph Patrick Steinhardt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YN9QbEaWgP09PfeD@nand.local \
    --to=me@ttaylorr.com \
    --cc=avarab@gmail.com \
    --cc=chris.torek@gmail.com \
    --cc=felipe.contreras@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=peff@peff.net \
    --cc=ps@pks.im \
    --cc=szeder.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).