git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Jeff King <peff@peff.net>
Cc: git@vger.kernel.org, "Taylor Blau" <me@ttaylorr.com>,
	"René Scharfe" <l.s.r@web.de>
Subject: Re: [PATCH 6/6] hash-object: use fsck for object checks
Date: Wed, 01 Feb 2023 14:08:39 +0100	[thread overview]
Message-ID: <230201.86h6w5r07w.gmgdl@evledraar.gmail.com> (raw)
In-Reply-To: <Y9pgG10dAoQABGXG@coredump.intra.peff.net>


On Wed, Feb 01 2023, Jeff King wrote:

> On Wed, Jan 18, 2023 at 03:44:12PM -0500, Jeff King wrote:
>
>> @@ -2350,12 +2340,13 @@ static int index_mem(struct index_state *istate,
>>  		}
>>  	}
>>  	if (flags & HASH_FORMAT_CHECK) {
>> -		if (type == OBJ_TREE)
>> -			check_tree(buf, size);
>> -		if (type == OBJ_COMMIT)
>> -			check_commit(buf, size);
>> -		if (type == OBJ_TAG)
>> -			check_tag(buf, size);
>> +		struct fsck_options opts = FSCK_OPTIONS_DEFAULT;
>> +
>> +		opts.strict = 1;
>> +		opts.error_func = hash_format_check_report;
>> +		if (fsck_buffer(null_oid(), type, buf, size, &opts))
>> +			die(_("refusing to create malformed object"));
>> +		fsck_finish(&opts);
>>  	}
>
> By the way, I wanted to call out one thing here that nobody mentioned
> during review: we are not checking the return value of fsck_finish().
>
> That is a bit of a weird function. We must call it because it cleans up
> any resources allocated during the fsck_buffer() call. But it also is
> the last chance to fsck any special blobs (like those that are found as
> .gitmodules, etc). We only find out the filenames while looking at the
> enclosing trees, so we queue them and then check the blobs later.
>
> So if we are hashing a blob, that is mostly fine. We will not have the
> blob's name queued as anything special, and so the fsck is a noop.
>
> But if we fsck a tree, and it has a .gitmodules entry pointing to blob
> X, then we would also pull X from the odb and fsck it during this
> "finish" phase.
>
> Which leads me to two diverging lines of thought:
>
>   1. One of my goals with this series is that one could add objects to
>      the repository via "git hash-object -w" and feel confident that no
>      fsck rules were violated, because fsck implements some security
>      checks. In the past when GitHub rolled out security checks this was
>      a major pain, because objects enter repositories not just from
>      pushes, but also from web-client activity (e.g., editing a blob on
>      the website). And since Git had no way to say "fsck just this
>      object", we ended up implementing the fsck checks multiple times,
>      in libgit2 and in some of its calling code.
>
>      So I was hoping that just passing the objects to "hash-object"
>      would be a viable solution. I'm not sure if it is or not. If you
>      just hash a blob, then we'll have no clue it's a .gitmodules file.
>      OTOH, you have to get the matching tree which binds the blob to the
>      .gitmodules path somehow. So if that tree is fsck'd, and then
>      checks the blob during fsck_finish(), that should be enough.
>      Assuming that fsck complains when the pointed-to blob cannot be
>      accessed, which I think it should (because really, incremental
>      pushes face the same problem).
>
>      In which case we really ought to be checking the result of
>      fsck_finish() here and complaining.
>
>   2. We're not checking fsck connectivity here, and that's intentional.
>      So you can "hash-object" a tree that points to blobs that we don't
>      actually have. But if you hash a tree that points a .gitmodules
>      entry at a blob that doesn't exist, then that will fail the fsck
>      (during the finish step). And respecting the fsck_finish() exit
>      code would break that.
>
>      As an addendum, in a regular fsck, many trees might mention the
>      same blob as .gitmodules, and we'll queue that blob to be checked
>      once. But here, we are potentially running a bunch of individual
>      fscks, one per object we hash. So if you had, say, 1000 trees that
>      all mentioned the same blob (because other entries were changing),
>      and you tried to hash them all with "hash-object --stdin-paths" or
>      similar, then we'd fsck that blob 1000 times.
>
>      Which isn't wrong, per se, but seems inefficient. Solving it would
>      require keeping track of what has been checked between calls to
>      index_mem(). Which is kind of awkward, seeing as how low-level it
>      is. It would be a lot more natural if all this checking happened in
>      hash-object itself.
>
> So I dunno. The code above is doing (2), albeit with the inefficiency of
> checking blobs that we might not care about. I kind of think (1) is the
> right thing, though, and anybody who really wants to make trees that
> point to bogus .gitmodules can use --literally.

Aside from the other things you bring up here, it seems wrong to me to
conflate --literally with some sort of "no fsck" or "don't fsck this
collection yet" mode.

Can't we have a "--no-fsck" or similar, which won't do any sort of full
fsck, but also won't accept bogus object types & the like?

Currently I believe (and I haven't had time to carefully review what you
have here) we only need --literally to produce objects that are truly
corrupt when viewed in isolation. E.g. a tag that refers to a bogus
object type etc.

But we have long supported a narrow view of what the fsck checks mean in
that context. E.g. now with "mktag" we'll use the fsck machinery, but
only skin-deep, so you can be referring to a tree which would in turn
fail our checks.

I tend to think that we should be keeping it like that, but documenting
that if you're creating such objects you either need to do it really
carefully, or follow it up with an operation that's guaranteed to fsck
the sum of the objects you've added recursively.

So, rather than teach e.g. "hash-object" to be smart about that we
should e.g. encourage a manual creation of trees/blobs/commits to be
followed-up with a "git push" to a new ref that refers to them, even if
that "git push" is to the repository located in the $PWD.

By doing that we offload the "what's new?" question to the
pack-over-the-wire machinery, which is well tested.

Anything else seems ultimately to be madness, after all if I feed a
newly crafted commit to "hash-object" how do we know where to stop,
other than essentially faking up a push negotiation with ourselves?

It's also worth noting that much of the complexity around .gitmodules in
particular is to support packfile-uri's odd notion of applying the
"last" part of the PACK before the "first" part, which nothing else
does.

Which, if we just blindly applied both, and then fsck'd the resulting
combination we'd get rid of that tricky special-case. But I haven't
benchmarked that. It should be a bit slower, particularly on a large
repository that won't fit in memory. But my hunch is that it won't be
too bad, and the resulting simplification may be worth it (particularly
now that we have bundle-uri, which doesn't share that edge-case).


  reply	other threads:[~2023-02-01 13:20 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-18 20:35 [RFC/PATCH 0/6] hash-object: use fsck to check objects Jeff King
2023-01-18 20:35 ` [PATCH 1/6] t1007: modernize malformed object tests Jeff King
2023-01-18 21:13   ` Taylor Blau
2023-01-18 20:35 ` [PATCH 2/6] t1006: stop using 0-padded timestamps Jeff King
2023-01-18 20:36 ` [PATCH 3/6] t7030: stop using invalid tag name Jeff King
2023-01-18 20:41 ` [PATCH 4/6] t: use hash-object --literally when created malformed objects Jeff King
2023-01-18 21:19   ` Taylor Blau
2023-01-19  2:06     ` Jeff King
2023-01-18 20:43 ` [PATCH 5/6] fsck: provide a function to fsck buffer without object struct Jeff King
2023-01-18 21:24   ` Taylor Blau
2023-01-19  2:07     ` Jeff King
2023-01-18 20:44 ` [PATCH 6/6] hash-object: use fsck for object checks Jeff King
2023-01-18 21:34   ` Taylor Blau
2023-01-19  2:31     ` Jeff King
2023-02-01 12:50   ` Jeff King
2023-02-01 13:08     ` Ævar Arnfjörð Bjarmason [this message]
2023-02-01 20:41     ` Junio C Hamano
2023-01-18 20:46 ` [RFC/PATCH 0/6] hash-object: use fsck to check objects Jeff King
2023-01-18 20:59 ` Junio C Hamano
2023-01-18 21:38   ` Taylor Blau
2023-01-19  2:03     ` Jeff King
2023-01-19  1:39 ` Jeff King
2023-01-19 23:13   ` [PATCH 7/6] fsck: do not assume NUL-termination of buffers Jeff King
2023-01-19 23:58     ` Junio C Hamano
2023-01-21  9:36   ` [RFC/PATCH 0/6] hash-object: use fsck to check objects René Scharfe
2023-01-22  7:48     ` Jeff King
2023-01-22 11:39       ` René Scharfe
2023-02-01 14:06         ` Ævar Arnfjörð Bjarmason

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=230201.86h6w5r07w.gmgdl@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=l.s.r@web.de \
    --cc=me@ttaylorr.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).