git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: John Cai <johncai86@gmail.com>
Cc: git <git@vger.kernel.org>, Christian Couder <christian.couder@gmail.com>
Subject: Re: [INVESTIGATION] why is fsck --connectivity-only so much more expensive than rev-list --objects --all?
Date: Tue, 20 Sep 2022 16:41:51 -0400	[thread overview]
Message-ID: <YyoljwDIn7PxRlC9@coredump.intra.peff.net> (raw)
In-Reply-To: <9304B938-4A59-456B-B091-DBBCAA1823B2@gmail.com>

On Tue, Sep 20, 2022 at 03:27:29PM -0400, John Cai wrote:

> One observation is that `git-fsck` seems to use up alot more memory in calling
> unpack_compressed_entry() than `git-rev-list` does.

That just means it's data coming from a pack. The interesting parts are
further up the stack, I think:

> | | ->40.48% (826,445,820B) 0x2B2447: unpack_compressed_entry (packfile.c:1601)
> | | | ->40.48% (826,445,820B) 0x2B46AB: unpack_entry (packfile.c:1768)
> | | | | ->40.48% (826,445,820B) 0x2B4A23: cache_or_unpack_entry (packfile.c:1438)
> | | | |   ->40.48% (826,445,820B) 0x2B4A23: packed_object_info (packfile.c:1516)
> | | | |     ->40.48% (826,445,820B) 0x29EA83: do_oid_object_info_extended (object-file.c:1620)
> | | | |       ->40.48% (826,445,820B) 0x29EBDB: oid_object_info_extended (object-file.c:1639)
> | | | |         ->40.48% (826,445,820B) 0x29EDC3: read_object (object-file.c:1671)
> | | | |           ->40.48% (826,445,820B) 0x29EE5B: read_object_file_extended (object-file.c:1714)
> | | | |             ->40.12% (819,056,147B) 0x2143BB: repo_read_object_file (object-store.h:253)
> | | | |             | ->40.12% (819,056,147B) 0x2143BB: repo_parse_commit_internal (commit.c:511)
> | | | |             |   ->40.12% (819,056,147B) 0x2143BB: repo_parse_commit_internal (commit.c:495)
> | | | |             |     ->40.12% (819,056,147B) 0x25A747: repo_parse_commit (commit.h:90)
> | | | |             |       ->40.12% (819,056,147B) 0x25A747: fsck_walk_commit (fsck.c:355)

OK, so we're holding onto commit buffers after parsing them. We probably
should tell the parser not to do that. It's useful for git-log, etc,
which are going to pretty-print the buffer, but not for this use. You
can set the global save_commit_buffer to do that.

> | | ->08.61% (175,870,114B) 0x2B9DC3: patch_delta (patch-delta.c:36)
> | | | ->08.61% (175,870,114B) 0x2B42EB: unpack_entry (packfile.c:1829)
> | | |   ->08.61% (175,870,114B) 0x2B4A23: cache_or_unpack_entry (packfile.c:1438)
> | | |     ->08.61% (175,870,114B) 0x2B4A23: packed_object_info (packfile.c:1516)
> | | |       ->08.61% (175,870,114B) 0x29EA83: do_oid_object_info_extended (object-file.c:1620)
> | | |         ->08.61% (175,870,114B) 0x29EBDB: oid_object_info_extended (object-file.c:1639)
> | | |           ->08.61% (175,870,114B) 0x29EDC3: read_object (object-file.c:1671)
> | | |             ->08.61% (175,870,114B) 0x29EE5B: read_object_file_extended (object-file.c:1714)
> | | |               ->04.52% (92,219,588B) 0x3461CB: repo_read_object_file (object-store.h:253)
> | | |               | ->04.52% (92,219,588B) 0x3461CB: parse_tree_gently.part.0 (tree.c:132)
> | | |               |   ->04.52% (92,219,588B) 0x25A4EB: parse_tree (tree.h:24)
> | | |               |     ->04.52% (92,219,588B) 0x25A4EB: fsck_walk_tree (fsck.c:307)

And this is data we've allocated for trees. It kind of looks like
fsck_walk_tree() never bothers to clean up the trees it parses, leaving
the buffers attached to the tree structs. But that can't be the case,
because linux.git has something like 16GB of trees. These may be entries
we keep in the internal delta cache, though it should be a bit smaller
than what you have here (the default is 96MB; you can drop it with
core.deltaBaseCacheLimit, but runtime may suffer).

There's a call to free_tree_buffer() in builtin/fsck.c:traverse_one();
that may be what ends up freeing things. It's been a while since I've
traced through the call paths for fsck.

> But, not having delved too deeply into the code, I wanted to ask the list if
> anything jumps out as to why `git-fsck` consumes so much more memory than
> `git-rev-list`--perhaps there is opportunity for improvement/optimization?

The patch below improves my peak heap (as reported by massif) for "git
fsck --connectivity-only" on linux.git from ~2GB to ~1GB.

But in general, I think "rev-list" is a better tool for connectivity
checks. One problem with fsck is that it tries to read the set of
available objects up front, making it racy in a repository which is
receiving pushes, or which may be repacked (e.g., if somebody makes a
new object and references it, "fsck --connectivity-only" may fail to
notice the new object but will see the updated ref, and think the ref is
corrupt).

diff --git a/builtin/fsck.c b/builtin/fsck.c
index f7916f06ed..949073a00d 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -853,6 +853,7 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
 
 	errors_found = 0;
 	read_replace_refs = 0;
+	save_commit_buffer = 0;
 
 	argc = parse_options(argc, argv, prefix, fsck_opts, fsck_usage, 0);
 

-Peff

  reply	other threads:[~2022-09-20 20:42 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-20 19:27 [INVESTIGATION] why is fsck --connectivity-only so much more expensive than rev-list --objects --all? John Cai
2022-09-20 20:41 ` Jeff King [this message]
2022-09-22 10:09   ` [PATCH 0/3] reducing fsck memory usage Jeff King
2022-09-22 10:11     ` [PATCH 1/3] fsck: free tree buffers after walking unreachable objects Jeff King
2022-09-22 18:40       ` Junio C Hamano
2022-09-22 18:58         ` Jeff King
2022-09-22 19:27           ` Junio C Hamano
2022-09-22 22:16             ` Jeff King
2022-09-22 10:13     ` [PATCH 2/3] fsck: turn off save_commit_buffer Jeff King
2022-09-22 10:15     ` [PATCH 3/3] parse_object_buffer(): respect save_commit_buffer Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YyoljwDIn7PxRlC9@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=christian.couder@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=johncai86@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).