git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Junio C Hamano <gitster@pobox.com>
Cc: Taylor Blau <me@ttaylorr.com>, git@vger.kernel.org, stolee@gmail.com
Subject: Re: [PATCH 3/3] commit-graph.c: handle corrupt/missing trees
Date: Fri, 6 Sep 2019 13:28:52 -0400	[thread overview]
Message-ID: <20190906172851.GC23181@sigill.intra.peff.net> (raw)
In-Reply-To: <xmqqo8zxnz0m.fsf@gitster-ct.c.googlers.com>

On Fri, Sep 06, 2019 at 09:57:29AM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > This is sort-of attributable to my 834876630b (get_commit_tree(): return
> > NULL for broken tree, 2019-04-09). Before then it was a BUG(). However,
> > that state was relatively short-lived. Before 7b8a21dba1 (commit-graph:
> > lazy-load trees for commits, 2018-04-06), we'd have similarly returned
> > NULL (and anyway, BUG() is clearly wrong since it's a data error).
> >
> > None of which argues against your patches, but it's kind of sad that the
> > issue is present in so many code paths. I wonder if we could be handling
> > this in a more central way, but I don't see how short of dying.
> 
> Well, either we explicitly die in here, or let the caller segfault.
> Is there even a single caller that is prepared to react to NULL?
> [...]
> So, after fixing the above, we may safely be able to die inside
> get_commit_tree() instead of returning NULL.

I think the one alternative is catching this more reliably during the
parse phase. And then callers have the option of handling the error
_then_, without forcing every downstream user of the struct to
re-validate it.

We _could_ add the die() there to catch any stragglers. But that does
make it harder for somebody to try to examine the error situation, or
gracefully return an error up the stack. Maybe that use case really
doesn't have any value. I dunno. This case did BUG() until recently, and
we did run into it in the real world. But the problem wasn't that the
operation didn't succeed, but rather the BUG(). I don't know of any code
path where the caller doesn't simply die().

>     Answer. There is a single hit inside fsck.c that wants to report
>     an error without killing ourselves in fsck_commit_buffer().  I
>     however doubt its use of get_commit_tree() is correct in the
>     first place.  The function is about validating the commit object
>     payload manually, without trusting the result of parse_commit(),
>     and it does read the object name of the tree object; the call to
>     get_commit_tree() used for reporting the error there should
>     probably become has_object() on the tree_oid.

I actually think that check should be removed entirely. That function is
about checking the syntactic validity of the object itself, not about
connectivity (which is handled separately). We already check that we
have a valid "tree" pointer earlier in the function.

The current get_commit_tree() check is doing essentially nothing.
parse_commit() would have parsed the same thing we already checked, and
the lookup_tree() call it uses to fill in the pointer is not reliable
(it would only fail if we happened to have seen the same oid already as
a non-tree in the same process).

The history is interesting here. In the early days fsck-cache actually
did parse the commit object itself. Then ff5ebe39b0 ([PATCH] Port
fsck-cache to use parsing functions, 2005-04-18) converted it to use
parse_commit(). Then de2eb7f694 (git-fsck-cache.c: check commit objects
more carefully, 2005-07-27) went back to parsing it ourselves, but left
the struct checks in place.

We also look at commit->parents, but seemingly only to compare them to
grafts (and that's weird itself, because grafts aren't a property of the
object at all, and it seems like at best this is just verifying that we
correctly loaded the grafts).

> By the way, I think get_commit_tree() and parse_commit() in fsck
> should always use the value obtained from the underlying object and
> bypass any caches like commit graph---if they pay attention to the
> caches, they should be fixed.  Secondary caches like commit graph
> should of course be validated against what are recorded in the
> underlying object, but that should be done separately.

Agreed. Probably fsck should just be disabling the commit graph for the
whole process (it looks like there's an env variable for this, but no
internal global, which is what fsck would want).

-Peff

  parent reply	other threads:[~2019-09-06 17:28 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-05 22:04 [PATCH 0/3] commit-graph: harden against various corruptions Taylor Blau
2019-09-05 22:04 ` [PATCH 1/3] t/t5318: introduce failing 'git commit-graph write' tests Taylor Blau
2019-09-06 16:48   ` Derrick Stolee
2019-09-05 22:04 ` [PATCH 2/3] commit-graph.c: handle commit parsing errors Taylor Blau
2019-09-05 22:04 ` [PATCH 3/3] commit-graph.c: handle corrupt/missing trees Taylor Blau
2019-09-06  6:19   ` Jeff King
2019-09-06 15:42     ` Taylor Blau
2019-09-06 17:34       ` Jeff King
2019-09-06 16:51     ` Derrick Stolee
2019-09-06 17:37       ` Jeff King
2019-09-06 16:57     ` Junio C Hamano
2019-09-06 17:11       ` Junio C Hamano
2019-09-06 17:30         ` Jeff King
2019-09-06 17:28       ` Jeff King [this message]
2019-09-09 17:55         ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190906172851.GC23181@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=me@ttaylorr.com \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).