git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
From: Junio C Hamano <gitster@pobox.com>
To: Jeff King <peff@peff.net>
Cc: John Cai <johncai86@gmail.com>, git <git@vger.kernel.org>,
	Christian Couder <christian.couder@gmail.com>
Subject: Re: [PATCH 1/3] fsck: free tree buffers after walking unreachable objects
Date: Thu, 22 Sep 2022 11:40:05 -0700	[thread overview]
Message-ID: <xmqqa66rz20q.fsf@gitster.g> (raw)
In-Reply-To: <Yyw031PqCyYlEqCX@coredump.intra.peff.net> (Jeff King's message of "Thu, 22 Sep 2022 06:11:43 -0400")

Jeff King <peff@peff.net> writes:

> After calling fsck_walk(), a tree object struct may be left in the
> parsed state, with the full tree contents available via tree->buffer.
> It's the responsibility of the caller to free these when it's done with
> the object to avoid having many trees allocated at once.
>
> In a regular "git fsck", we hit fsck_walk() only from fsck_obj(), which
> does call free_tree_buffer(). Likewise for "--connectivity-only", we see
> most objects via traverse_one_object(), which makes a similar call.
>
> The exception is in mark_unreachable_referents(). When using both
> "--connectivity-only" and "--dangling" (the latter of which is the
> default), we walk all of the unreachable objects, and there we forget to
> free. Most cases would not notice this, because they don't have a lot of
> unreachable objects, but you can make a pathological case like this:
>
>   git clone --bare /path/to/linux.git repo.git
>   cd repo.git
>   rm packed-refs ;# now everything is unreachable!
>   git fsck --connectivity-only
>
> That ends up with peak heap usage ~18GB, which is (not coincidentally)
> close to the size of all uncompressed trees in the repository. After
> this patch, the peak heap is only ~2GB.
>
> A few things to note:
>
>   - it might seem like fsck_walk(), if it is parsing the trees, should
>     be responsible for freeing them. But the situation is quite tricky.
>     In the non-connectivity mode, after we call fsck_walk() we then
>     proceed with fsck_object() which actually does the type-specific
>     sanity checks on the object contents. We do pass our own separate
>     buffer to fsck_object(), but there's a catch: our earlier call to
>     parse_object_buffer() may have attached that buffer to the object
>     struct! So by freeing it, we leave the rest of the code with a
>     dangling pointer.
>
>     Likewise, the call to fsck_walk() in index-pack is subtle. It
>     attaches a buffer to the tree object that must not be freed! And
>     so rather than calling free_tree_buffer(), it actually detaches it
>     by setting tree->buffer to NULL.
>
>     These cases would _probably_ be fixable by having fsck_walk() free
>     the tree buffer only when it was the one who allocated it via
>     parse_tree(). But that would still leave the callers responsible for
>     freeing other cases, so they wouldn't be simplified. While the
>     current semantics for fsck_walk() make it easy to accidentally leak
>     in new callers, at least they are simple to explain, and it's not a
>     function that's likely to get a lot of new call-sites.
>
>     And in any case, it's probably sensible to fix the leak first with
>     this simple patch, and try any more complicated refactoring
>     separately.
>
>   - a careful reader may notice that fsck_obj() also frees commit
>     buffers, but neither the call in traverse_one_object() nor the one
>     touched in this patch does so. And indeed, this is another problem
>     for --connectivity-only (and accounts for most of the 2GB heap after
>     this patch), but it's one we'll fix in a separate commit.
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  builtin/fsck.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/builtin/fsck.c b/builtin/fsck.c
> index f7916f06ed..34e575a170 100644
> --- a/builtin/fsck.c
> +++ b/builtin/fsck.c
> @@ -228,6 +228,8 @@ static void mark_unreachable_referents(const struct object_id *oid)
>  
>  	options.walk = mark_used;
>  	fsck_walk(obj, NULL, &options);
> +	if (obj->type == OBJ_TREE)
> +		free_tree_buffer((struct tree *)obj);
>  }

Unlike codepaths like mark_object(), which uses the REACHABLE bit to
avoid the walker coming into an already marked objects, we have no
protection that says "this tree is already marked as USED, so lets
not go into its contents" (it would be a disaster if we free tree
buffer here and then later end up calling the function on the same
tree), but it is OK because this is an unreachable object nobody
points at and we will never come back?

  reply	other threads:[~2022-09-22 18:41 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-20 19:27 [INVESTIGATION] why is fsck --connectivity-only so much more expensive than rev-list --objects --all? John Cai
2022-09-20 20:41 ` Jeff King
2022-09-22 10:09   ` [PATCH 0/3] reducing fsck memory usage Jeff King
2022-09-22 10:11     ` [PATCH 1/3] fsck: free tree buffers after walking unreachable objects Jeff King
2022-09-22 18:40       ` Junio C Hamano [this message]
2022-09-22 18:58         ` Jeff King
2022-09-22 19:27           ` Junio C Hamano
2022-09-22 22:16             ` Jeff King
2022-09-22 10:13     ` [PATCH 2/3] fsck: turn off save_commit_buffer Jeff King
2022-09-22 10:15     ` [PATCH 3/3] parse_object_buffer(): respect save_commit_buffer Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqqa66rz20q.fsf@gitster.g \
    --to=gitster@pobox.com \
    --cc=christian.couder@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=johncai86@gmail.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).