git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Taylor Blau <me@ttaylorr.com>
Cc: git@vger.kernel.org, "René Scharfe" <l.s.r@web.de>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Subject: Re: [PATCH 6/6] hash-object: use fsck for object checks
Date: Wed, 18 Jan 2023 21:31:36 -0500	[thread overview]
Message-ID: <Y8iriP4T2FQPtBfF@coredump.intra.peff.net> (raw)
In-Reply-To: <Y8hlyr0o6gs9omI5@nand.local>

On Wed, Jan 18, 2023 at 04:34:02PM -0500, Taylor Blau wrote:

> That being said, let me play devil's advocate for a second. Do the new
> fsck checks slow anything in hash-object down significantly? If so, then
> it's plausible to imagine a hash-object caller who (a) doesn't use
> `--literally`, but (b) does care about throughput if they're writing a
> large number of objects at once.
> 
> I don't know if such a situation exists, or if these new fsck checks
> even slow hash-object down enough to care. But I didn't catch a
> discussion of this case in your series, so I figured I'd bring it up
> here just in case.

That's a really good point to bring up.

Prior to timing anything, here were my guesses:

  - it won't make a big difference either way because the time is
    dominated by computing sha1 anyway

  - we might actually be a little faster for commits and tags in the new
    code, because they aren't allocating structs for the pointed-to
    objects (trees, parents, etc). Nor stuffing them into obj_hash, so
    our total memory usage would be lower.

  - trees may be a little slower, because we're doing a more analysis on
    the filenames (sort order, various filesystem specific checks for
    .git, etc)

And here's what I timed, using linux.git. First I pulled out the raw
object data like so:

  mkdir -p commit tag tree

  git cat-file --batch-all-objects --unordered --batch-check='%(objecttype) %(objectname)' |
  perl -alne 'print $F[1] unless $F[0] eq "blob"' |
  git cat-file --batch |
  perl -ne '
    /(\S+) (\S+) (\d+)/ or die "confusing: $_";
    my $dir = "$2/" . substr($1, 0, 2);
    my $fn = "$dir/" . substr($1, 2);
    mkdir($dir);
    open(my $fh, ">", $fn) or die "open($fn): $!";
    read(STDIN, my $buf, $3) or die "read($3): $!";
    print $fh $buf;
    read(STDIN, $buf, 1); # trailing newline
  '

And then I timed it like this:

  find commit -type f | sort >input
  hyperfine -L v old,new './git.{v} hash-object --stdin-paths -t commit <input'

which yielded:

  Benchmark 1: ./git.old hash-object --stdin-paths -t commit <input
    Time (mean ± σ):      7.264 s ±  0.142 s    [User: 4.129 s, System: 3.043 s]
    Range (min … max):    7.098 s …  7.558 s    10 runs

  Benchmark 2: ./git.new hash-object --stdin-paths -t commit <input
    Time (mean ± σ):      6.832 s ±  0.087 s    [User: 3.848 s, System: 2.901 s]
    Range (min … max):    6.752 s …  7.059 s    10 runs

  Summary
    './git.new hash-object --stdin-paths -t commit <input' ran
      1.06 ± 0.02 times faster than './git.old hash-object --stdin-paths -t commit <input'

So the new code is indeed faster, though really most of the time is
spent reading the data and computing the hash anyway. For comparison,
using --literally drops it to ~6.3s.

And according to massif, peak heap drops from 241MB to 80k. Which is
pretty good.

Trees are definitely slower, though. I reduced the number to fit in my
budget of patience:

  find tree -type f | sort | head -n 200000 >input
  hyperfine -L v old,new './git.{v} hash-object --stdin-paths -t tree <input'

And got:

  Benchmark 1: ./git.old hash-object --stdin-paths -t tree <input
    Time (mean ± σ):      2.470 s ±  0.022 s    [User: 1.902 s, System: 0.549 s]
    Range (min … max):    2.442 s …  2.509 s    10 runs
  
  Benchmark 2: ./git.new hash-object --stdin-paths -t tree <input
    Time (mean ± σ):      3.244 s ±  0.026 s    [User: 2.661 s, System: 0.567 s]
    Range (min … max):    3.215 s …  3.295 s    10 runs
  
  Summary
    './git.old hash-object --stdin-paths -t tree <input' ran
      1.31 ± 0.02 times faster than './git.new hash-object --stdin-paths -t tree <input'

So we indeed got a bit slower (and --literally here is ~2.2s). It's
enough that it outweighs the benefits from the commits getting faster
(especially because there tend to be more trees than commits). But those
also get diluted by blobs (which have a lot of data to hash and free
fsck checks).

So in the end, I think nobody would really care that much. The absolute
numbers are pretty small, and this is already a fairly dumb way to get
objects into your repository. The usual way is via index-pack, and it
already uses the fsck code for its checks. But I do think it was a good
question to explore (plus it found a descriptor leak in hash-object,
which I sent a separate patch for).

-Peff

  reply	other threads:[~2023-01-19  2:31 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-18 20:35 [RFC/PATCH 0/6] hash-object: use fsck to check objects Jeff King
2023-01-18 20:35 ` [PATCH 1/6] t1007: modernize malformed object tests Jeff King
2023-01-18 21:13   ` Taylor Blau
2023-01-18 20:35 ` [PATCH 2/6] t1006: stop using 0-padded timestamps Jeff King
2023-01-18 20:36 ` [PATCH 3/6] t7030: stop using invalid tag name Jeff King
2023-01-18 20:41 ` [PATCH 4/6] t: use hash-object --literally when created malformed objects Jeff King
2023-01-18 21:19   ` Taylor Blau
2023-01-19  2:06     ` Jeff King
2023-01-18 20:43 ` [PATCH 5/6] fsck: provide a function to fsck buffer without object struct Jeff King
2023-01-18 21:24   ` Taylor Blau
2023-01-19  2:07     ` Jeff King
2023-01-18 20:44 ` [PATCH 6/6] hash-object: use fsck for object checks Jeff King
2023-01-18 21:34   ` Taylor Blau
2023-01-19  2:31     ` Jeff King [this message]
2023-02-01 12:50   ` Jeff King
2023-02-01 13:08     ` Ævar Arnfjörð Bjarmason
2023-02-01 20:41     ` Junio C Hamano
2023-01-18 20:46 ` [RFC/PATCH 0/6] hash-object: use fsck to check objects Jeff King
2023-01-18 20:59 ` Junio C Hamano
2023-01-18 21:38   ` Taylor Blau
2023-01-19  2:03     ` Jeff King
2023-01-19  1:39 ` Jeff King
2023-01-19 23:13   ` [PATCH 7/6] fsck: do not assume NUL-termination of buffers Jeff King
2023-01-19 23:58     ` Junio C Hamano
2023-01-21  9:36   ` [RFC/PATCH 0/6] hash-object: use fsck to check objects René Scharfe
2023-01-22  7:48     ` Jeff King
2023-01-22 11:39       ` René Scharfe
2023-02-01 14:06         ` Ævar Arnfjörð Bjarmason

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y8iriP4T2FQPtBfF@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=l.s.r@web.de \
    --cc=me@ttaylorr.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).