* [RFC/PATCH 0/6] hash-object: use fsck to check objects @ 2023-01-18 20:35 Jeff King 2023-01-18 20:35 ` [PATCH 1/6] t1007: modernize malformed object tests Jeff King ` (8 more replies) 0 siblings, 9 replies; 28+ messages in thread From: Jeff King @ 2023-01-18 20:35 UTC (permalink / raw) To: git; +Cc: René Scharfe, Ævar Arnfjörð Bjarmason Right now "git hash-object" will do some basic sanity checks of the input using the usual parser code. This series teaches it to use the fsck code instead, which should catch more things. See patch 6 for some discussion of the implications. The reason this is marked as an RFC is that at the end, compiling with SANITIZE=address will provoke a failure in t3800. The issue is that fsck_tag_standalone(), when fed a buffer/size combo, will look for a NUL at the end of the headers, which might be buffer[size]. This is usually OK for objects we've loaded from the odb, because we intentionally stick an extra NUL at the end for safety. But here index_mem() may get an arbitrary buffer. I'm not sure yet of the right path forward. It's not too hard to add an extra NUL in most cases, but one code path will mmap a file on disk. And sticking a NUL there is hard (we already went down that road trying to avoid REG_STARTEND for grep, and there wasn't a good solution). The other option is having the fsck code avoid looking past the size it was given. I think the intent is that this should work, from commits like 4d0d89755e (Make sure fsck_commit_buffer() does not run out of the buffer, 2014-09-11). We do use skip_prefix() and parse_oid_hex(), which won't respect the size, but I think[1] that's OK because we'll have parsed up to the end-of-header beforehand (and those functions would never match past there). Which would mean that 9a1a3a4d4c (mktag: allow omitting the header/body \n separator, 2021-01-05) and acf9de4c94 (mktag: use fsck instead of custom verify_tag(), 2021-01-05) were buggy, and we can just fix them. [1] But I said "I think" above because it can get pretty subtle. There's some more discussion in this thread: https://lore.kernel.org/git/20150625155128.C3E9738005C@gemini.denx.de/ but I haven't yet convinced myself it's safe. This is exactly the kind of analysis I wish I had the power to nerd-snipe René into. Anyway, here are the patches in the meantime. I do think this is a good direction overall, modulo addressing the NUL-terminator question. [1/6]: t1007: modernize malformed object tests [2/6]: t1006: stop using 0-padded timestamps [3/6]: t7030: stop using invalid tag name [4/6]: t: use hash-object --literally when created malformed objects [5/6]: fsck: provide a function to fsck buffer without object struct [6/6]: hash-object: use fsck for object checks fsck.c | 29 ++++++++++------- fsck.h | 8 +++++ object-file.c | 55 +++++++++++++------------------- t/t1006-cat-file.sh | 6 ++-- t/t1007-hash-object.sh | 29 +++++++++++------ t/t1450-fsck.sh | 28 ++++++++-------- t/t4054-diff-bogus-tree.sh | 2 +- t/t4058-diff-duplicates.sh | 2 +- t/t4212-log-corrupt.sh | 4 +-- t/t5302-pack-index.sh | 2 +- t/t5504-fetch-receive-strict.sh | 2 +- t/t5702-protocol-v2.sh | 2 +- t/t6300-for-each-ref.sh | 2 +- t/t7030-verify-tag.sh | 2 +- t/t7031-verify-tag-signed-ssh.sh | 2 +- t/t7509-commit-authorship.sh | 2 +- t/t7510-signed-commit.sh | 2 +- t/t7528-signed-commit-ssh.sh | 2 +- t/t8003-blame-corner-cases.sh | 2 +- t/t9350-fast-export.sh | 2 +- 20 files changed, 101 insertions(+), 84 deletions(-) -Peff ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 1/6] t1007: modernize malformed object tests 2023-01-18 20:35 [RFC/PATCH 0/6] hash-object: use fsck to check objects Jeff King @ 2023-01-18 20:35 ` Jeff King 2023-01-18 21:13 ` Taylor Blau 2023-01-18 20:35 ` [PATCH 2/6] t1006: stop using 0-padded timestamps Jeff King ` (7 subsequent siblings) 8 siblings, 1 reply; 28+ messages in thread From: Jeff King @ 2023-01-18 20:35 UTC (permalink / raw) To: git; +Cc: René Scharfe, Ævar Arnfjörð Bjarmason The tests in t1007 for detecting malformed objects have two anachronisms: - they use "sha1" instead of "oid" in variable names, even though the script as a whole has been adapted to handle sha256 - they use test_i18ngrep, which is no longer necessary Since we'll be adding a new similar test, let's clean these up so they are all consistently using the modern style. Signed-off-by: Jeff King <peff@peff.net> --- t/t1007-hash-object.sh | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/t/t1007-hash-object.sh b/t/t1007-hash-object.sh index ac5ad8c740..2d2148d8fa 100755 --- a/t/t1007-hash-object.sh +++ b/t/t1007-hash-object.sh @@ -203,23 +203,23 @@ done test_expect_success 'too-short tree' ' echo abc >malformed-tree && test_must_fail git hash-object -t tree malformed-tree 2>err && - test_i18ngrep "too-short tree object" err + grep "too-short tree object" err ' test_expect_success 'malformed mode in tree' ' - hex_sha1=$(echo foo | git hash-object --stdin -w) && - bin_sha1=$(echo $hex_sha1 | hex2oct) && - printf "9100644 \0$bin_sha1" >tree-with-malformed-mode && + hex_oid=$(echo foo | git hash-object --stdin -w) && + bin_oid=$(echo $hex_oid | hex2oct) && + printf "9100644 \0$bin_oid" >tree-with-malformed-mode && test_must_fail git hash-object -t tree tree-with-malformed-mode 2>err && - test_i18ngrep "malformed mode in tree entry" err + grep "malformed mode in tree entry" err ' test_expect_success 'empty filename in tree' ' - hex_sha1=$(echo foo | git hash-object --stdin -w) && - bin_sha1=$(echo $hex_sha1 | hex2oct) && - printf "100644 \0$bin_sha1" >tree-with-empty-filename && + hex_oid=$(echo foo | git hash-object --stdin -w) && + bin_oid=$(echo $hex_oid | hex2oct) && + printf "100644 \0$bin_oid" >tree-with-empty-filename && test_must_fail git hash-object -t tree tree-with-empty-filename 2>err && - test_i18ngrep "empty filename in tree entry" err + grep "empty filename in tree entry" err ' test_expect_success 'corrupt commit' ' -- 2.39.1.616.gd06fca9e99 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH 1/6] t1007: modernize malformed object tests 2023-01-18 20:35 ` [PATCH 1/6] t1007: modernize malformed object tests Jeff King @ 2023-01-18 21:13 ` Taylor Blau 0 siblings, 0 replies; 28+ messages in thread From: Taylor Blau @ 2023-01-18 21:13 UTC (permalink / raw) To: Jeff King; +Cc: git, René Scharfe, Ævar Arnfjörð Bjarmason On Wed, Jan 18, 2023 at 03:35:30PM -0500, Jeff King wrote: > The tests in t1007 for detecting malformed objects have two > anachronisms: > > - they use "sha1" instead of "oid" in variable names, even though the > script as a whole has been adapted to handle sha256 I appreciate you saying that we should s/sha1/oid here. But more importantly, thanks for drawing attention to the fact that this script already handles sha256, and that the update is purely cosmetic. > --- > t/t1007-hash-object.sh | 18 +++++++++--------- > 1 file changed, 9 insertions(+), 9 deletions(-) These look obviously correct. Thanks, Taylor ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 2/6] t1006: stop using 0-padded timestamps 2023-01-18 20:35 [RFC/PATCH 0/6] hash-object: use fsck to check objects Jeff King 2023-01-18 20:35 ` [PATCH 1/6] t1007: modernize malformed object tests Jeff King @ 2023-01-18 20:35 ` Jeff King 2023-01-18 20:36 ` [PATCH 3/6] t7030: stop using invalid tag name Jeff King ` (6 subsequent siblings) 8 siblings, 0 replies; 28+ messages in thread From: Jeff King @ 2023-01-18 20:35 UTC (permalink / raw) To: git; +Cc: René Scharfe, Ævar Arnfjörð Bjarmason The fake objects in t1006 use dummy timestamps like "0000000000 +0000". While this does make them look more like normal timestamps (which, unless it is 1970, have many digits), it actually violates our fsck checks, which complain about zero-padded timestamps. This doesn't currently break anything, but let's future-proof our tests against a version of hash-object which is a little more careful about its input. We don't actually care about the exact values here (and in fact, the helper functions in this script end up removing the timestamps anyway, so we don't even have to adjust other parts of the tests). Signed-off-by: Jeff King <peff@peff.net> --- t/t1006-cat-file.sh | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh index 23b8942edb..2d875b17d8 100755 --- a/t/t1006-cat-file.sh +++ b/t/t1006-cat-file.sh @@ -292,8 +292,8 @@ commit_message="Initial commit" commit_sha1=$(echo_without_newline "$commit_message" | git commit-tree $tree_sha1) commit_size=$(($(test_oid hexsz) + 137)) commit_content="tree $tree_sha1 -author $GIT_AUTHOR_NAME <$GIT_AUTHOR_EMAIL> 0000000000 +0000 -committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> 0000000000 +0000 +author $GIT_AUTHOR_NAME <$GIT_AUTHOR_EMAIL> 0 +0000 +committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> 0 +0000 $commit_message" @@ -304,7 +304,7 @@ type blob tag hellotag tagger $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL>" tag_description="This is a tag" -tag_content="$tag_header_without_timestamp 0000000000 +0000 +tag_content="$tag_header_without_timestamp 0 +0000 $tag_description" -- 2.39.1.616.gd06fca9e99 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH 3/6] t7030: stop using invalid tag name 2023-01-18 20:35 [RFC/PATCH 0/6] hash-object: use fsck to check objects Jeff King 2023-01-18 20:35 ` [PATCH 1/6] t1007: modernize malformed object tests Jeff King 2023-01-18 20:35 ` [PATCH 2/6] t1006: stop using 0-padded timestamps Jeff King @ 2023-01-18 20:36 ` Jeff King 2023-01-18 20:41 ` [PATCH 4/6] t: use hash-object --literally when created malformed objects Jeff King ` (5 subsequent siblings) 8 siblings, 0 replies; 28+ messages in thread From: Jeff King @ 2023-01-18 20:36 UTC (permalink / raw) To: git; +Cc: René Scharfe, Ævar Arnfjörð Bjarmason We intentionally invalidate the signature of a tag by switching its tag name from "seventh" to "7th forged". However, the latter is not a valid tag name because it contains a space. This doesn't currently affect the test, but we're better off using something syntactically valid. That reduces the number of possible failure modes in the test, and future-proofs us if git hash-object gets more picky about its input. The t7031 script, which was mostly copied from t7030, has the same problem, so we'll fix it, too. Signed-off-by: Jeff King <peff@peff.net> --- t/t7030-verify-tag.sh | 2 +- t/t7031-verify-tag-signed-ssh.sh | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/t/t7030-verify-tag.sh b/t/t7030-verify-tag.sh index 10faa64515..6f526c37c2 100755 --- a/t/t7030-verify-tag.sh +++ b/t/t7030-verify-tag.sh @@ -115,7 +115,7 @@ test_expect_success GPGSM 'verify and show signatures x509 with high minTrustLev test_expect_success GPG 'detect fudged signature' ' git cat-file tag seventh-signed >raw && - sed -e "/^tag / s/seventh/7th forged/" raw >forged1 && + sed -e "/^tag / s/seventh/7th-forged/" raw >forged1 && git hash-object -w -t tag forged1 >forged1.tag && test_must_fail git verify-tag $(cat forged1.tag) 2>actual1 && grep "BAD signature from" actual1 && diff --git a/t/t7031-verify-tag-signed-ssh.sh b/t/t7031-verify-tag-signed-ssh.sh index 1cb36b9ab8..36eb86a4b1 100755 --- a/t/t7031-verify-tag-signed-ssh.sh +++ b/t/t7031-verify-tag-signed-ssh.sh @@ -125,7 +125,7 @@ test_expect_success GPGSSH,GPGSSH_VERIFYTIME 'verify-tag failes with tag date ou test_expect_success GPGSSH 'detect fudged ssh signature' ' test_config gpg.ssh.allowedSignersFile "${GPGSSH_ALLOWED_SIGNERS}" && git cat-file tag seventh-signed >raw && - sed -e "/^tag / s/seventh/7th forged/" raw >forged1 && + sed -e "/^tag / s/seventh/7th-forged/" raw >forged1 && git hash-object -w -t tag forged1 >forged1.tag && test_must_fail git verify-tag $(cat forged1.tag) 2>actual1 && grep "${GPGSSH_BAD_SIGNATURE}" actual1 && -- 2.39.1.616.gd06fca9e99 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH 4/6] t: use hash-object --literally when created malformed objects 2023-01-18 20:35 [RFC/PATCH 0/6] hash-object: use fsck to check objects Jeff King ` (2 preceding siblings ...) 2023-01-18 20:36 ` [PATCH 3/6] t7030: stop using invalid tag name Jeff King @ 2023-01-18 20:41 ` Jeff King 2023-01-18 21:19 ` Taylor Blau 2023-01-18 20:43 ` [PATCH 5/6] fsck: provide a function to fsck buffer without object struct Jeff King ` (4 subsequent siblings) 8 siblings, 1 reply; 28+ messages in thread From: Jeff King @ 2023-01-18 20:41 UTC (permalink / raw) To: git; +Cc: René Scharfe, Ævar Arnfjörð Bjarmason Many test scripts use hash-object to create malformed objects to see how we handle the results in various commands. In some cases we already have to use "hash-object --literally", because it does some rudimentary quality checks. But let's use "--literally" more consistently to future-proof these tests against hash-object learning to be more careful. Signed-off-by: Jeff King <peff@peff.net> --- This patch is worth looking at because it shows the kinds of things the new hash-object from patch 6 will reject. Most of these are obviously terrible things that we'd want to complain about, like broken emails, embedded NULs, and so on. The most contentious one is probably a tag without a tagger line, which were generated by early versions of Git (e.g., see Git's v0.99 tag). This is an "info" in fsck (which is semantically like a warning, except transfer.fsckObjects treats warnings as errors due to hysterical raisins). But the hash-object change in patch 6 will reject it, because it operates in strict mode. That seems reasonable to me, since we're helping users avoid doing bad things, and not dealing with existing objects. t/t1450-fsck.sh | 28 ++++++++++++++-------------- t/t4054-diff-bogus-tree.sh | 2 +- t/t4058-diff-duplicates.sh | 2 +- t/t4212-log-corrupt.sh | 4 ++-- t/t5302-pack-index.sh | 2 +- t/t5504-fetch-receive-strict.sh | 2 +- t/t5702-protocol-v2.sh | 2 +- t/t6300-for-each-ref.sh | 2 +- t/t7509-commit-authorship.sh | 2 +- t/t7510-signed-commit.sh | 2 +- t/t7528-signed-commit-ssh.sh | 2 +- t/t8003-blame-corner-cases.sh | 2 +- t/t9350-fast-export.sh | 2 +- 13 files changed, 27 insertions(+), 27 deletions(-) diff --git a/t/t1450-fsck.sh b/t/t1450-fsck.sh index de0f6d5e7f..fdb886dfe4 100755 --- a/t/t1450-fsck.sh +++ b/t/t1450-fsck.sh @@ -212,7 +212,7 @@ test_expect_success 'email without @ is okay' ' test_expect_success 'email with embedded > is not okay' ' git cat-file commit HEAD >basis && sed "s/@[a-z]/&>/" basis >bad-email && - new=$(git hash-object -t commit -w --stdin <bad-email) && + new=$(git hash-object --literally -t commit -w --stdin <bad-email) && test_when_finished "remove_object $new" && git update-ref refs/heads/bogus "$new" && test_when_finished "git update-ref -d refs/heads/bogus" && @@ -223,7 +223,7 @@ test_expect_success 'email with embedded > is not okay' ' test_expect_success 'missing < email delimiter is reported nicely' ' git cat-file commit HEAD >basis && sed "s/<//" basis >bad-email-2 && - new=$(git hash-object -t commit -w --stdin <bad-email-2) && + new=$(git hash-object --literally -t commit -w --stdin <bad-email-2) && test_when_finished "remove_object $new" && git update-ref refs/heads/bogus "$new" && test_when_finished "git update-ref -d refs/heads/bogus" && @@ -234,7 +234,7 @@ test_expect_success 'missing < email delimiter is reported nicely' ' test_expect_success 'missing email is reported nicely' ' git cat-file commit HEAD >basis && sed "s/[a-z]* <[^>]*>//" basis >bad-email-3 && - new=$(git hash-object -t commit -w --stdin <bad-email-3) && + new=$(git hash-object --literally -t commit -w --stdin <bad-email-3) && test_when_finished "remove_object $new" && git update-ref refs/heads/bogus "$new" && test_when_finished "git update-ref -d refs/heads/bogus" && @@ -245,7 +245,7 @@ test_expect_success 'missing email is reported nicely' ' test_expect_success '> in name is reported' ' git cat-file commit HEAD >basis && sed "s/ </> </" basis >bad-email-4 && - new=$(git hash-object -t commit -w --stdin <bad-email-4) && + new=$(git hash-object --literally -t commit -w --stdin <bad-email-4) && test_when_finished "remove_object $new" && git update-ref refs/heads/bogus "$new" && test_when_finished "git update-ref -d refs/heads/bogus" && @@ -258,7 +258,7 @@ test_expect_success 'integer overflow in timestamps is reported' ' git cat-file commit HEAD >basis && sed "s/^\\(author .*>\\) [0-9]*/\\1 18446744073709551617/" \ <basis >bad-timestamp && - new=$(git hash-object -t commit -w --stdin <bad-timestamp) && + new=$(git hash-object --literally -t commit -w --stdin <bad-timestamp) && test_when_finished "remove_object $new" && git update-ref refs/heads/bogus "$new" && test_when_finished "git update-ref -d refs/heads/bogus" && @@ -269,7 +269,7 @@ test_expect_success 'integer overflow in timestamps is reported' ' test_expect_success 'commit with NUL in header' ' git cat-file commit HEAD >basis && sed "s/author ./author Q/" <basis | q_to_nul >commit-NUL-header && - new=$(git hash-object -t commit -w --stdin <commit-NUL-header) && + new=$(git hash-object --literally -t commit -w --stdin <commit-NUL-header) && test_when_finished "remove_object $new" && git update-ref refs/heads/bogus "$new" && test_when_finished "git update-ref -d refs/heads/bogus" && @@ -292,7 +292,7 @@ test_expect_success 'tree object with duplicate entries' ' git cat-file tree $T && git cat-file tree $T ) | - git hash-object -w -t tree --stdin + git hash-object --literally -w -t tree --stdin ) && test_must_fail git fsck 2>out && test_i18ngrep "error in tree .*contains duplicate file entries" out @@ -426,7 +426,7 @@ test_expect_success 'tag with incorrect tag name & missing tagger' ' This is an invalid tag. EOF - tag=$(git hash-object -t tag -w --stdin <wrong-tag) && + tag=$(git hash-object --literally -t tag -w --stdin <wrong-tag) && test_when_finished "remove_object $tag" && echo $tag >.git/refs/tags/wrong && test_when_finished "git update-ref -d refs/tags/wrong" && @@ -558,7 +558,7 @@ test_expect_success 'rev-list --verify-objects with commit graph (parent)' ' test_expect_success 'force fsck to ignore double author' ' git cat-file commit HEAD >basis && sed "s/^author .*/&,&/" <basis | tr , \\n >multiple-authors && - new=$(git hash-object -t commit -w --stdin <multiple-authors) && + new=$(git hash-object --literally -t commit -w --stdin <multiple-authors) && test_when_finished "remove_object $new" && git update-ref refs/heads/bogus "$new" && test_when_finished "git update-ref -d refs/heads/bogus" && @@ -573,7 +573,7 @@ test_expect_success 'fsck notices blob entry pointing to null sha1' ' (git init null-blob && cd null-blob && sha=$(printf "100644 file$_bz$_bzoid" | - git hash-object -w --stdin -t tree) && + git hash-object --literally -w --stdin -t tree) && git fsck 2>out && test_i18ngrep "warning.*null sha1" out ) @@ -583,7 +583,7 @@ test_expect_success 'fsck notices submodule entry pointing to null sha1' ' (git init null-commit && cd null-commit && sha=$(printf "160000 submodule$_bz$_bzoid" | - git hash-object -w --stdin -t tree) && + git hash-object --literally -w --stdin -t tree) && git fsck 2>out && test_i18ngrep "warning.*null sha1" out ) @@ -648,7 +648,7 @@ test_expect_success 'NUL in commit' ' git commit --allow-empty -m "initial commitQNUL after message" && git cat-file commit HEAD >original && q_to_nul <original >munged && - git hash-object -w -t commit --stdin <munged >name && + git hash-object --literally -w -t commit --stdin <munged >name && git branch bad $(cat name) && test_must_fail git -c fsck.nulInCommit=error fsck 2>warn.1 && @@ -794,8 +794,8 @@ test_expect_success 'fsck errors in packed objects' ' git cat-file commit HEAD >basis && sed "s/</one/" basis >one && sed "s/</foo/" basis >two && - one=$(git hash-object -t commit -w one) && - two=$(git hash-object -t commit -w two) && + one=$(git hash-object --literally -t commit -w one) && + two=$(git hash-object --literally -t commit -w two) && pack=$( { echo $one && diff --git a/t/t4054-diff-bogus-tree.sh b/t/t4054-diff-bogus-tree.sh index 294fb55313..05c88f8cdf 100755 --- a/t/t4054-diff-bogus-tree.sh +++ b/t/t4054-diff-bogus-tree.sh @@ -10,7 +10,7 @@ test_expect_success 'create bogus tree' ' bogus_tree=$( printf "100644 fooQ$name" | q_to_nul | - git hash-object -w --stdin -t tree + git hash-object --literally -w --stdin -t tree ) ' diff --git a/t/t4058-diff-duplicates.sh b/t/t4058-diff-duplicates.sh index 54614b814d..2501c89c1c 100755 --- a/t/t4058-diff-duplicates.sh +++ b/t/t4058-diff-duplicates.sh @@ -29,7 +29,7 @@ make_tree () { make_tree_entry "$1" "$2" "$3" shift; shift; shift done | - git hash-object -w -t tree --stdin + git hash-object --literally -w -t tree --stdin } # this is kind of a convoluted setup, but matches diff --git a/t/t4212-log-corrupt.sh b/t/t4212-log-corrupt.sh index 30a219894b..e89e1f54b6 100755 --- a/t/t4212-log-corrupt.sh +++ b/t/t4212-log-corrupt.sh @@ -10,7 +10,7 @@ test_expect_success 'setup' ' git cat-file commit HEAD | sed "/^author /s/>/>-<>/" >broken_email.commit && - git hash-object -w -t commit broken_email.commit >broken_email.hash && + git hash-object --literally -w -t commit broken_email.commit >broken_email.hash && git update-ref refs/heads/broken_email $(cat broken_email.hash) ' @@ -46,7 +46,7 @@ test_expect_success 'git log --format with broken author email' ' munge_author_date () { git cat-file commit "$1" >commit.orig && sed "s/^\(author .*>\) [0-9]*/\1 $2/" <commit.orig >commit.munge && - git hash-object -w -t commit commit.munge + git hash-object --literally -w -t commit commit.munge } test_expect_success 'unparsable dates produce sentinel value' ' diff --git a/t/t5302-pack-index.sh b/t/t5302-pack-index.sh index b0095ab41d..59e9e77223 100755 --- a/t/t5302-pack-index.sh +++ b/t/t5302-pack-index.sh @@ -263,7 +263,7 @@ tag guten tag This is an invalid tag. EOF - tag=$(git hash-object -t tag -w --stdin <wrong-tag) && + tag=$(git hash-object -t tag -w --stdin --literally <wrong-tag) && pack1=$(echo $tag $sha | git pack-objects tag-test) && echo remove tag object && thirtyeight=${tag#??} && diff --git a/t/t5504-fetch-receive-strict.sh b/t/t5504-fetch-receive-strict.sh index ac4099ca89..88d3c56750 100755 --- a/t/t5504-fetch-receive-strict.sh +++ b/t/t5504-fetch-receive-strict.sh @@ -138,7 +138,7 @@ This commit object intentionally broken EOF test_expect_success 'setup bogus commit' ' - commit="$(git hash-object -t commit -w --stdin <bogus-commit)" + commit="$(git hash-object --literally -t commit -w --stdin <bogus-commit)" ' test_expect_success 'fsck with no skipList input' ' diff --git a/t/t5702-protocol-v2.sh b/t/t5702-protocol-v2.sh index b33cd4afca..e4db7513f4 100755 --- a/t/t5702-protocol-v2.sh +++ b/t/t5702-protocol-v2.sh @@ -1114,7 +1114,7 @@ test_expect_success 'packfile-uri with transfer.fsckobjects fails on bad object' This commit object intentionally broken EOF - BOGUS=$(git -C "$P" hash-object -t commit -w --stdin <bogus-commit) && + BOGUS=$(git -C "$P" hash-object -t commit -w --stdin --literally <bogus-commit) && git -C "$P" branch bogus-branch "$BOGUS" && echo my-blob >"$P/my-blob" && diff --git a/t/t6300-for-each-ref.sh b/t/t6300-for-each-ref.sh index 2ae1fc721b..c466fd989f 100755 --- a/t/t6300-for-each-ref.sh +++ b/t/t6300-for-each-ref.sh @@ -606,7 +606,7 @@ test_expect_success 'create tag without tagger' ' git tag -a -m "Broken tag" taggerless && git tag -f taggerless $(git cat-file tag taggerless | sed -e "/^tagger /d" | - git hash-object --stdin -w -t tag) + git hash-object --literally --stdin -w -t tag) ' test_atom refs/tags/taggerless type 'commit' diff --git a/t/t7509-commit-authorship.sh b/t/t7509-commit-authorship.sh index 21c668f75e..5d890949f7 100755 --- a/t/t7509-commit-authorship.sh +++ b/t/t7509-commit-authorship.sh @@ -105,7 +105,7 @@ test_expect_success '--amend option with empty author' ' test_expect_success '--amend option with missing author' ' git cat-file commit Initial >tmp && sed "s/author [^<]* </author </" tmp >malformed && - sha=$(git hash-object -t commit -w malformed) && + sha=$(git hash-object --literally -t commit -w malformed) && test_when_finished "remove_object $sha" && git checkout $sha && test_when_finished "git checkout Initial" && diff --git a/t/t7510-signed-commit.sh b/t/t7510-signed-commit.sh index 8593b7e3cb..bc7a31ba3e 100755 --- a/t/t7510-signed-commit.sh +++ b/t/t7510-signed-commit.sh @@ -202,7 +202,7 @@ test_expect_success GPG 'detect fudged signature with NUL' ' git cat-file commit seventh-signed >raw && cat raw >forged2 && echo Qwik | tr "Q" "\000" >>forged2 && - git hash-object -w -t commit forged2 >forged2.commit && + git hash-object --literally -w -t commit forged2 >forged2.commit && test_must_fail git verify-commit $(cat forged2.commit) && git show --pretty=short --show-signature $(cat forged2.commit) >actual2 && grep "BAD signature from" actual2 && diff --git a/t/t7528-signed-commit-ssh.sh b/t/t7528-signed-commit-ssh.sh index f47e995179..065f780636 100755 --- a/t/t7528-signed-commit-ssh.sh +++ b/t/t7528-signed-commit-ssh.sh @@ -270,7 +270,7 @@ test_expect_success GPGSSH 'detect fudged signature with NUL' ' git cat-file commit seventh-signed >raw && cat raw >forged2 && echo Qwik | tr "Q" "\000" >>forged2 && - git hash-object -w -t commit forged2 >forged2.commit && + git hash-object --literally -w -t commit forged2 >forged2.commit && test_must_fail git verify-commit $(cat forged2.commit) && git show --pretty=short --show-signature $(cat forged2.commit) >actual2 && grep "${GPGSSH_BAD_SIGNATURE}" actual2 && diff --git a/t/t8003-blame-corner-cases.sh b/t/t8003-blame-corner-cases.sh index d751d48b7d..8bcd39e81b 100755 --- a/t/t8003-blame-corner-cases.sh +++ b/t/t8003-blame-corner-cases.sh @@ -201,7 +201,7 @@ committer David Reiss <dreiss@facebook.com> 1234567890 +0000 some message EOF - COMMIT=$(git hash-object -t commit -w badcommit) && + COMMIT=$(git hash-object --literally -t commit -w badcommit) && git --no-pager blame $COMMIT -- uno >/dev/null ' diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh index ff21a12ee6..26c25c0eb2 100755 --- a/t/t9350-fast-export.sh +++ b/t/t9350-fast-export.sh @@ -373,7 +373,7 @@ EOF test_expect_success 'cope with tagger-less tags' ' - TAG=$(git hash-object -t tag -w tag-content) && + TAG=$(git hash-object --literally -t tag -w tag-content) && git update-ref refs/tags/sonnenschein $TAG && git fast-export -C -C --signed-tags=strip --all > output && test $(grep -c "^tag " output) = 4 && -- 2.39.1.616.gd06fca9e99 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH 4/6] t: use hash-object --literally when created malformed objects 2023-01-18 20:41 ` [PATCH 4/6] t: use hash-object --literally when created malformed objects Jeff King @ 2023-01-18 21:19 ` Taylor Blau 2023-01-19 2:06 ` Jeff King 0 siblings, 1 reply; 28+ messages in thread From: Taylor Blau @ 2023-01-18 21:19 UTC (permalink / raw) To: Jeff King; +Cc: git, René Scharfe, Ævar Arnfjörð Bjarmason On Wed, Jan 18, 2023 at 03:41:56PM -0500, Jeff King wrote: > Many test scripts use hash-object to create malformed objects to see how > we handle the results in various commands. In some cases we already have > to use "hash-object --literally", because it does some rudimentary > quality checks. But let's use "--literally" more consistently to > future-proof these tests against hash-object learning to be more > careful. Heh, I suppose this is a good illustration of how loose our checks our even without `--literally` ;-). > --- > This patch is worth looking at because it shows the kinds of things the > new hash-object from patch 6 will reject. Obviously we could avoid this patch entirely by making the new behavior of fscking the incoming objects hidden behind a `--fsck` flag or something. But I think the decision not to is a good one. We already have `--literally`, and it makes sense that passing that should let us write anything, and that not passing it should perform some validity checks. But I think exactly *what* those checks are is ambiguous enough that the absence of `--literally` implying fsck checks isn't out of the question. You address this in the last patch more thoroughly, but I figure that it is worth stating some of this here during review to indicate that I think the direction you pursued here is a good one. > t/t1450-fsck.sh | 28 ++++++++++++++-------------- > t/t4054-diff-bogus-tree.sh | 2 +- > t/t4058-diff-duplicates.sh | 2 +- > t/t4212-log-corrupt.sh | 4 ++-- > t/t5302-pack-index.sh | 2 +- > t/t5504-fetch-receive-strict.sh | 2 +- > t/t5702-protocol-v2.sh | 2 +- > t/t6300-for-each-ref.sh | 2 +- > t/t7509-commit-authorship.sh | 2 +- > t/t7510-signed-commit.sh | 2 +- > t/t7528-signed-commit-ssh.sh | 2 +- > t/t8003-blame-corner-cases.sh | 2 +- > t/t9350-fast-export.sh | 2 +- > 13 files changed, 27 insertions(+), 27 deletions(-) And these all look good, too. Each of the spots you touch here is limited to replacing "git hash-object" with "git hash-object --literally". Thanks, Taylor ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 4/6] t: use hash-object --literally when created malformed objects 2023-01-18 21:19 ` Taylor Blau @ 2023-01-19 2:06 ` Jeff King 0 siblings, 0 replies; 28+ messages in thread From: Jeff King @ 2023-01-19 2:06 UTC (permalink / raw) To: Taylor Blau Cc: git, René Scharfe, Ævar Arnfjörð Bjarmason On Wed, Jan 18, 2023 at 04:19:20PM -0500, Taylor Blau wrote: > > This patch is worth looking at because it shows the kinds of things the > > new hash-object from patch 6 will reject. > > Obviously we could avoid this patch entirely by making the new behavior > of fscking the incoming objects hidden behind a `--fsck` flag or > something. But I think the decision not to is a good one. > > We already have `--literally`, and it makes sense that passing that > should let us write anything, and that not passing it should perform > some validity checks. But I think exactly *what* those checks are is > ambiguous enough that the absence of `--literally` implying fsck checks > isn't out of the question. > > You address this in the last patch more thoroughly, but I figure that it > is worth stating some of this here during review to indicate that I > think the direction you pursued here is a good one. Thanks for raising this, I think it's a good thing to consider. I didn't even really think about making it a new option, since we already do quality checks (and loosen them via --literally). This just seemed like more of the same. So yeah, if there were people who really wanted to distinguish between the severity of the old checks and the new ones, we can add --fsck (or even default to having it on, and disable it with --no-fsck to get the old checks). But I just see little point in that. One thing we _could_ support that my patch doesn't (I think; I didn't test very deeply here) is respecting individual fsck.msgType config variables. Again, I don't really see much point there. If you know you are producing garbage, then just say --literally. The type-specific ones are useful when you have to hold your nose and accept somebody else's historical garbage, and you want to limit the damage as much as possible. -Peff ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 5/6] fsck: provide a function to fsck buffer without object struct 2023-01-18 20:35 [RFC/PATCH 0/6] hash-object: use fsck to check objects Jeff King ` (3 preceding siblings ...) 2023-01-18 20:41 ` [PATCH 4/6] t: use hash-object --literally when created malformed objects Jeff King @ 2023-01-18 20:43 ` Jeff King 2023-01-18 21:24 ` Taylor Blau 2023-01-18 20:44 ` [PATCH 6/6] hash-object: use fsck for object checks Jeff King ` (3 subsequent siblings) 8 siblings, 1 reply; 28+ messages in thread From: Jeff King @ 2023-01-18 20:43 UTC (permalink / raw) To: git; +Cc: René Scharfe, Ævar Arnfjörð Bjarmason The fsck code has been slowly moving away from requiring an object struct in commits like 103fb6d43b (fsck: accept an oid instead of a "struct tag" for fsck_tag(), 2019-10-18), c5b4269b57 (fsck: accept an oid instead of a "struct commit" for fsck_commit(), 2019-10-18), etc. However, the only external interface that fsck.c provides is fsck_object(), which requires an object struct, then promptly discards everything except its oid and type. Let's factor out the post-discard part of that function as fsck_buffer(), leaving fsck_object() as a thin wrapper around it. That will provide more flexibility for callers which may not have a struct. Signed-off-by: Jeff King <peff@peff.net> --- This is obviously preparation for the next patch. But I suspect it could be used elsewhere, too. Regular fsck wants object structs anyway to hold flags, I think, but index-pack could probably save some memory and effort by avoiding them. I didn't look too closely, as it's all out of scope for this series. fsck.c | 29 ++++++++++++++++++----------- fsck.h | 8 ++++++++ 2 files changed, 26 insertions(+), 11 deletions(-) diff --git a/fsck.c b/fsck.c index 47eaeedd70..c2c8facd2d 100644 --- a/fsck.c +++ b/fsck.c @@ -1237,19 +1237,26 @@ int fsck_object(struct object *obj, void *data, unsigned long size, if (!obj) return report(options, NULL, OBJ_NONE, FSCK_MSG_BAD_OBJECT_SHA1, "no valid object to fsck"); - if (obj->type == OBJ_BLOB) - return fsck_blob(&obj->oid, data, size, options); - if (obj->type == OBJ_TREE) - return fsck_tree(&obj->oid, data, size, options); - if (obj->type == OBJ_COMMIT) - return fsck_commit(&obj->oid, data, size, options); - if (obj->type == OBJ_TAG) - return fsck_tag(&obj->oid, data, size, options); - - return report(options, &obj->oid, obj->type, + return fsck_buffer(&obj->oid, obj->type, data, size, options); +} + +int fsck_buffer(const struct object_id *oid, enum object_type type, + void *data, unsigned long size, + struct fsck_options *options) +{ + if (type == OBJ_BLOB) + return fsck_blob(oid, data, size, options); + if (type == OBJ_TREE) + return fsck_tree(oid, data, size, options); + if (type == OBJ_COMMIT) + return fsck_commit(oid, data, size, options); + if (type == OBJ_TAG) + return fsck_tag(oid, data, size, options); + + return report(options, oid, type, FSCK_MSG_UNKNOWN_TYPE, "unknown type '%d' (internal fsck error)", - obj->type); + type); } int fsck_error_function(struct fsck_options *o, diff --git a/fsck.h b/fsck.h index fcecf4101c..668330880e 100644 --- a/fsck.h +++ b/fsck.h @@ -183,6 +183,14 @@ int fsck_walk(struct object *obj, void *data, struct fsck_options *options); int fsck_object(struct object *obj, void *data, unsigned long size, struct fsck_options *options); +/* + * Same as fsck_object(), but for when the caller doesn't have an object + * struct. + */ +int fsck_buffer(const struct object_id *oid, enum object_type, + void *data, unsigned long size, + struct fsck_options *options); + /* * fsck a tag, and pass info about it back to the caller. This is * exposed fsck_object() internals for git-mktag(1). -- 2.39.1.616.gd06fca9e99 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH 5/6] fsck: provide a function to fsck buffer without object struct 2023-01-18 20:43 ` [PATCH 5/6] fsck: provide a function to fsck buffer without object struct Jeff King @ 2023-01-18 21:24 ` Taylor Blau 2023-01-19 2:07 ` Jeff King 0 siblings, 1 reply; 28+ messages in thread From: Taylor Blau @ 2023-01-18 21:24 UTC (permalink / raw) To: Jeff King; +Cc: git, René Scharfe, Ævar Arnfjörð Bjarmason On Wed, Jan 18, 2023 at 03:43:53PM -0500, Jeff King wrote: > However, the only external interface that fsck.c provides is > fsck_object(), which requires an object struct, then promptly discards > everything except its oid and type. Let's factor out the post-discard > part of that function as fsck_buffer(), leaving fsck_object() as a thin > wrapper around it. That will provide more flexibility for callers which > may not have a struct. It's really nice that the only thing we care about having an object struct around for is basically just knowing its type. IOW it seems to have made the refactoring here pretty straightforward, which is nice ;-). Thanks, Taylor ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 5/6] fsck: provide a function to fsck buffer without object struct 2023-01-18 21:24 ` Taylor Blau @ 2023-01-19 2:07 ` Jeff King 0 siblings, 0 replies; 28+ messages in thread From: Jeff King @ 2023-01-19 2:07 UTC (permalink / raw) To: Taylor Blau Cc: git, René Scharfe, Ævar Arnfjörð Bjarmason On Wed, Jan 18, 2023 at 04:24:25PM -0500, Taylor Blau wrote: > On Wed, Jan 18, 2023 at 03:43:53PM -0500, Jeff King wrote: > > However, the only external interface that fsck.c provides is > > fsck_object(), which requires an object struct, then promptly discards > > everything except its oid and type. Let's factor out the post-discard > > part of that function as fsck_buffer(), leaving fsck_object() as a thin > > wrapper around it. That will provide more flexibility for callers which > > may not have a struct. > > It's really nice that the only thing we care about having an object > struct around for is basically just knowing its type. IOW it seems to > have made the refactoring here pretty straightforward, which is nice > ;-). Yeah, it was always in the back of my mind while doing other fsck refactors. But I have to admit that I was surprised that we were so close to the finish line. :) -Peff ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 6/6] hash-object: use fsck for object checks 2023-01-18 20:35 [RFC/PATCH 0/6] hash-object: use fsck to check objects Jeff King ` (4 preceding siblings ...) 2023-01-18 20:43 ` [PATCH 5/6] fsck: provide a function to fsck buffer without object struct Jeff King @ 2023-01-18 20:44 ` Jeff King 2023-01-18 21:34 ` Taylor Blau 2023-02-01 12:50 ` Jeff King 2023-01-18 20:46 ` [RFC/PATCH 0/6] hash-object: use fsck to check objects Jeff King ` (2 subsequent siblings) 8 siblings, 2 replies; 28+ messages in thread From: Jeff King @ 2023-01-18 20:44 UTC (permalink / raw) To: git; +Cc: René Scharfe, Ævar Arnfjörð Bjarmason Since c879daa237 (Make hash-object more robust against malformed objects, 2011-02-05), we've done some rudimentary checks against objects we're about to write by running them through our usual parsers for trees, commits, and tags. These parsers catch some problems, but they are not nearly as careful as the fsck functions (which make sense; the parsers are designed to be fast and forgiving, bailing only when the input is unintelligible). We are better off doing the more thorough fsck checks when writing objects. Doing so at write time is much better than writing garbage only to find out later (after building more history atop it!) that fsck complains about it, or hosts with transfer.fsckObjects reject it. This is obviously going to be a user-visible behavior change, and the test changes earlier in this series show the scope of the impact. But I'd argue that this is OK: - the documentation for hash-object is already vague about which checks we might do, saying that --literally will allow "any garbage[...] which might not otherwise pass standard object parsing or git-fsck checks". So we are already covered under the documented behavior. - users don't generally run hash-object anyway. There are a lot of spots in the tests that needed to be updated because creating garbage objects is something that Git's tests disproportionately do. - it's hard to imagine anyone thinking the new behavior is worse. Any object we reject would be a potential problem down the road for the user. And if they really want to create garbage, --literally is already the escape hatch they need. Note that the change here is actually in index_mem(), which handles the HASH_FORMAT_CHECK flag passed by hash-object. That flag is also used by "git-replace --edit" to sanity-check the result. Covering that with more thorough checks likewise seems like a good thing. Besides being more thorough, there are a few other bonuses: - we get rid of some questionable stack allocations of object structs. These don't seem to currently cause any problems in practice, but they subtly violate some of the assumptions made by the rest of the code (e.g., the "struct commit" we put on the stack and zero-initialize will not have a proper index from alloc_comit_index(). - likewise, those parsed object structs are the source of some small memory leaks - the resulting messages are much better. For example: [before] $ echo 'tree 123' | git hash-object -t commit --stdin error: bogus commit object 0000000000000000000000000000000000000000 fatal: corrupt commit [after] $ echo 'tree 123' | git.compile hash-object -t commit --stdin error: object fails fsck: badTreeSha1: invalid 'tree' line format - bad sha1 fatal: refusing to create malformed object Signed-off-by: Jeff King <peff@peff.net> --- object-file.c | 55 ++++++++++++++++++------------------------ t/t1007-hash-object.sh | 11 +++++++++ 2 files changed, 34 insertions(+), 32 deletions(-) diff --git a/object-file.c b/object-file.c index 80a0cd3b35..5c96384803 100644 --- a/object-file.c +++ b/object-file.c @@ -33,6 +33,7 @@ #include "object-store.h" #include "promisor-remote.h" #include "submodule.h" +#include "fsck.h" /* The maximum size for an object header. */ #define MAX_HEADER_LEN 32 @@ -2298,32 +2299,21 @@ int repo_has_object_file(struct repository *r, return repo_has_object_file_with_flags(r, oid, 0); } -static void check_tree(const void *buf, size_t size) -{ - struct tree_desc desc; - struct name_entry entry; - - init_tree_desc(&desc, buf, size); - while (tree_entry(&desc, &entry)) - /* do nothing - * tree_entry() will die() on malformed entries */ - ; -} - -static void check_commit(const void *buf, size_t size) -{ - struct commit c; - memset(&c, 0, sizeof(c)); - if (parse_commit_buffer(the_repository, &c, buf, size, 0)) - die(_("corrupt commit")); -} - -static void check_tag(const void *buf, size_t size) -{ - struct tag t; - memset(&t, 0, sizeof(t)); - if (parse_tag_buffer(the_repository, &t, buf, size)) - die(_("corrupt tag")); +/* + * We can't use the normal fsck_error_function() for index_mem(), + * because we don't yet have a valid oid for it to report. Instead, + * report the minimal fsck error here, and rely on the caller to + * give more context. + */ +static int hash_format_check_report(struct fsck_options *opts, + const struct object_id *oid, + enum object_type object_type, + enum fsck_msg_type msg_type, + enum fsck_msg_id msg_id, + const char *message) +{ + error(_("object fails fsck: %s"), message); + return 1; } static int index_mem(struct index_state *istate, @@ -2350,12 +2340,13 @@ static int index_mem(struct index_state *istate, } } if (flags & HASH_FORMAT_CHECK) { - if (type == OBJ_TREE) - check_tree(buf, size); - if (type == OBJ_COMMIT) - check_commit(buf, size); - if (type == OBJ_TAG) - check_tag(buf, size); + struct fsck_options opts = FSCK_OPTIONS_DEFAULT; + + opts.strict = 1; + opts.error_func = hash_format_check_report; + if (fsck_buffer(null_oid(), type, buf, size, &opts)) + die(_("refusing to create malformed object")); + fsck_finish(&opts); } if (write_object) diff --git a/t/t1007-hash-object.sh b/t/t1007-hash-object.sh index 2d2148d8fa..ac3d173767 100755 --- a/t/t1007-hash-object.sh +++ b/t/t1007-hash-object.sh @@ -222,6 +222,17 @@ test_expect_success 'empty filename in tree' ' grep "empty filename in tree entry" err ' +test_expect_success 'duplicate filename in tree' ' + hex_oid=$(echo foo | git hash-object --stdin -w) && + bin_oid=$(echo $hex_oid | hex2oct) && + { + printf "100644 file\0$bin_oid" && + printf "100644 file\0$bin_oid" + } >tree-with-duplicate-filename && + test_must_fail git hash-object -t tree tree-with-duplicate-filename 2>err && + grep "duplicateEntries" err +' + test_expect_success 'corrupt commit' ' test_must_fail git hash-object -t commit --stdin </dev/null ' -- 2.39.1.616.gd06fca9e99 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH 6/6] hash-object: use fsck for object checks 2023-01-18 20:44 ` [PATCH 6/6] hash-object: use fsck for object checks Jeff King @ 2023-01-18 21:34 ` Taylor Blau 2023-01-19 2:31 ` Jeff King 2023-02-01 12:50 ` Jeff King 1 sibling, 1 reply; 28+ messages in thread From: Taylor Blau @ 2023-01-18 21:34 UTC (permalink / raw) To: Jeff King; +Cc: git, René Scharfe, Ævar Arnfjörð Bjarmason On Wed, Jan 18, 2023 at 03:44:12PM -0500, Jeff King wrote: > This is obviously going to be a user-visible behavior change, and the > test changes earlier in this series show the scope of the impact. But > I'd argue that this is OK: > > - the documentation for hash-object is already vague about which > checks we might do, saying that --literally will allow "any > garbage[...] which might not otherwise pass standard object parsing > or git-fsck checks". So we are already covered under the documented > behavior. > > - users don't generally run hash-object anyway. There are a lot of > spots in the tests that needed to be updated because creating > garbage objects is something that Git's tests disproportionately do. > > - it's hard to imagine anyone thinking the new behavior is worse. Any > object we reject would be a potential problem down the road for the > user. And if they really want to create garbage, --literally is > already the escape hatch they need. This is the discussion I was pointing out earlier in the series as evidence for making this behavior the new default without "--literally". That being said, let me play devil's advocate for a second. Do the new fsck checks slow anything in hash-object down significantly? If so, then it's plausible to imagine a hash-object caller who (a) doesn't use `--literally`, but (b) does care about throughput if they're writing a large number of objects at once. I don't know if such a situation exists, or if these new fsck checks even slow hash-object down enough to care. But I didn't catch a discussion of this case in your series, so I figured I'd bring it up here just in case. > - the resulting messages are much better. For example: > > [before] > $ echo 'tree 123' | git hash-object -t commit --stdin > error: bogus commit object 0000000000000000000000000000000000000000 > fatal: corrupt commit > > [after] > $ echo 'tree 123' | git.compile hash-object -t commit --stdin > error: object fails fsck: badTreeSha1: invalid 'tree' line format - bad sha1 > fatal: refusing to create malformed object Much nicer, well done. > Signed-off-by: Jeff King <peff@peff.net> > --- > object-file.c | 55 ++++++++++++++++++------------------------ > t/t1007-hash-object.sh | 11 +++++++++ > 2 files changed, 34 insertions(+), 32 deletions(-) > > diff --git a/object-file.c b/object-file.c > index 80a0cd3b35..5c96384803 100644 > --- a/object-file.c > +++ b/object-file.c > @@ -33,6 +33,7 @@ > #include "object-store.h" > #include "promisor-remote.h" > #include "submodule.h" > +#include "fsck.h" > > /* The maximum size for an object header. */ > #define MAX_HEADER_LEN 32 > @@ -2298,32 +2299,21 @@ int repo_has_object_file(struct repository *r, > return repo_has_object_file_with_flags(r, oid, 0); > } > > -static void check_tree(const void *buf, size_t size) > -{ > - struct tree_desc desc; > - struct name_entry entry; > - > - init_tree_desc(&desc, buf, size); > - while (tree_entry(&desc, &entry)) > - /* do nothing > - * tree_entry() will die() on malformed entries */ > - ; > -} > - > -static void check_commit(const void *buf, size_t size) > -{ > - struct commit c; > - memset(&c, 0, sizeof(c)); > - if (parse_commit_buffer(the_repository, &c, buf, size, 0)) > - die(_("corrupt commit")); > -} > - > -static void check_tag(const void *buf, size_t size) > -{ > - struct tag t; > - memset(&t, 0, sizeof(t)); > - if (parse_tag_buffer(the_repository, &t, buf, size)) > - die(_("corrupt tag")); OK, here we're getting rid of all of the lightweight checks that hash-object used to implement on its own. > +/* > + * We can't use the normal fsck_error_function() for index_mem(), > + * because we don't yet have a valid oid for it to report. Instead, > + * report the minimal fsck error here, and rely on the caller to > + * give more context. > + */ > +static int hash_format_check_report(struct fsck_options *opts, > + const struct object_id *oid, > + enum object_type object_type, > + enum fsck_msg_type msg_type, > + enum fsck_msg_id msg_id, > + const char *message) > +{ > + error(_("object fails fsck: %s"), message); > + return 1; > } > > static int index_mem(struct index_state *istate, > @@ -2350,12 +2340,13 @@ static int index_mem(struct index_state *istate, > } > } > if (flags & HASH_FORMAT_CHECK) { > - if (type == OBJ_TREE) > - check_tree(buf, size); > - if (type == OBJ_COMMIT) > - check_commit(buf, size); > - if (type == OBJ_TAG) > - check_tag(buf, size); > + struct fsck_options opts = FSCK_OPTIONS_DEFAULT; > + > + opts.strict = 1; > + opts.error_func = hash_format_check_report; > + if (fsck_buffer(null_oid(), type, buf, size, &opts)) > + die(_("refusing to create malformed object")); > + fsck_finish(&opts); > } And here's the main part of the change, which is delightfully simple and appears correct to me. > diff --git a/t/t1007-hash-object.sh b/t/t1007-hash-object.sh > index 2d2148d8fa..ac3d173767 100755 > --- a/t/t1007-hash-object.sh > +++ b/t/t1007-hash-object.sh > @@ -222,6 +222,17 @@ test_expect_success 'empty filename in tree' ' > grep "empty filename in tree entry" err > ' > > +test_expect_success 'duplicate filename in tree' ' > + hex_oid=$(echo foo | git hash-object --stdin -w) && > + bin_oid=$(echo $hex_oid | hex2oct) && > + { > + printf "100644 file\0$bin_oid" && > + printf "100644 file\0$bin_oid" > + } >tree-with-duplicate-filename && > + test_must_fail git hash-object -t tree tree-with-duplicate-filename 2>err && > + grep "duplicateEntries" err > +' > + For what it's worth, I think that this is sufficient coverage for the new fsck checks. Thanks, Taylor ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 6/6] hash-object: use fsck for object checks 2023-01-18 21:34 ` Taylor Blau @ 2023-01-19 2:31 ` Jeff King 0 siblings, 0 replies; 28+ messages in thread From: Jeff King @ 2023-01-19 2:31 UTC (permalink / raw) To: Taylor Blau Cc: git, René Scharfe, Ævar Arnfjörð Bjarmason On Wed, Jan 18, 2023 at 04:34:02PM -0500, Taylor Blau wrote: > That being said, let me play devil's advocate for a second. Do the new > fsck checks slow anything in hash-object down significantly? If so, then > it's plausible to imagine a hash-object caller who (a) doesn't use > `--literally`, but (b) does care about throughput if they're writing a > large number of objects at once. > > I don't know if such a situation exists, or if these new fsck checks > even slow hash-object down enough to care. But I didn't catch a > discussion of this case in your series, so I figured I'd bring it up > here just in case. That's a really good point to bring up. Prior to timing anything, here were my guesses: - it won't make a big difference either way because the time is dominated by computing sha1 anyway - we might actually be a little faster for commits and tags in the new code, because they aren't allocating structs for the pointed-to objects (trees, parents, etc). Nor stuffing them into obj_hash, so our total memory usage would be lower. - trees may be a little slower, because we're doing a more analysis on the filenames (sort order, various filesystem specific checks for .git, etc) And here's what I timed, using linux.git. First I pulled out the raw object data like so: mkdir -p commit tag tree git cat-file --batch-all-objects --unordered --batch-check='%(objecttype) %(objectname)' | perl -alne 'print $F[1] unless $F[0] eq "blob"' | git cat-file --batch | perl -ne ' /(\S+) (\S+) (\d+)/ or die "confusing: $_"; my $dir = "$2/" . substr($1, 0, 2); my $fn = "$dir/" . substr($1, 2); mkdir($dir); open(my $fh, ">", $fn) or die "open($fn): $!"; read(STDIN, my $buf, $3) or die "read($3): $!"; print $fh $buf; read(STDIN, $buf, 1); # trailing newline ' And then I timed it like this: find commit -type f | sort >input hyperfine -L v old,new './git.{v} hash-object --stdin-paths -t commit <input' which yielded: Benchmark 1: ./git.old hash-object --stdin-paths -t commit <input Time (mean ± σ): 7.264 s ± 0.142 s [User: 4.129 s, System: 3.043 s] Range (min … max): 7.098 s … 7.558 s 10 runs Benchmark 2: ./git.new hash-object --stdin-paths -t commit <input Time (mean ± σ): 6.832 s ± 0.087 s [User: 3.848 s, System: 2.901 s] Range (min … max): 6.752 s … 7.059 s 10 runs Summary './git.new hash-object --stdin-paths -t commit <input' ran 1.06 ± 0.02 times faster than './git.old hash-object --stdin-paths -t commit <input' So the new code is indeed faster, though really most of the time is spent reading the data and computing the hash anyway. For comparison, using --literally drops it to ~6.3s. And according to massif, peak heap drops from 241MB to 80k. Which is pretty good. Trees are definitely slower, though. I reduced the number to fit in my budget of patience: find tree -type f | sort | head -n 200000 >input hyperfine -L v old,new './git.{v} hash-object --stdin-paths -t tree <input' And got: Benchmark 1: ./git.old hash-object --stdin-paths -t tree <input Time (mean ± σ): 2.470 s ± 0.022 s [User: 1.902 s, System: 0.549 s] Range (min … max): 2.442 s … 2.509 s 10 runs Benchmark 2: ./git.new hash-object --stdin-paths -t tree <input Time (mean ± σ): 3.244 s ± 0.026 s [User: 2.661 s, System: 0.567 s] Range (min … max): 3.215 s … 3.295 s 10 runs Summary './git.old hash-object --stdin-paths -t tree <input' ran 1.31 ± 0.02 times faster than './git.new hash-object --stdin-paths -t tree <input' So we indeed got a bit slower (and --literally here is ~2.2s). It's enough that it outweighs the benefits from the commits getting faster (especially because there tend to be more trees than commits). But those also get diluted by blobs (which have a lot of data to hash and free fsck checks). So in the end, I think nobody would really care that much. The absolute numbers are pretty small, and this is already a fairly dumb way to get objects into your repository. The usual way is via index-pack, and it already uses the fsck code for its checks. But I do think it was a good question to explore (plus it found a descriptor leak in hash-object, which I sent a separate patch for). -Peff ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 6/6] hash-object: use fsck for object checks 2023-01-18 20:44 ` [PATCH 6/6] hash-object: use fsck for object checks Jeff King 2023-01-18 21:34 ` Taylor Blau @ 2023-02-01 12:50 ` Jeff King 2023-02-01 13:08 ` Ævar Arnfjörð Bjarmason 2023-02-01 20:41 ` Junio C Hamano 1 sibling, 2 replies; 28+ messages in thread From: Jeff King @ 2023-02-01 12:50 UTC (permalink / raw) To: git; +Cc: Taylor Blau, René Scharfe, Ævar Arnfjörð Bjarmason On Wed, Jan 18, 2023 at 03:44:12PM -0500, Jeff King wrote: > @@ -2350,12 +2340,13 @@ static int index_mem(struct index_state *istate, > } > } > if (flags & HASH_FORMAT_CHECK) { > - if (type == OBJ_TREE) > - check_tree(buf, size); > - if (type == OBJ_COMMIT) > - check_commit(buf, size); > - if (type == OBJ_TAG) > - check_tag(buf, size); > + struct fsck_options opts = FSCK_OPTIONS_DEFAULT; > + > + opts.strict = 1; > + opts.error_func = hash_format_check_report; > + if (fsck_buffer(null_oid(), type, buf, size, &opts)) > + die(_("refusing to create malformed object")); > + fsck_finish(&opts); > } By the way, I wanted to call out one thing here that nobody mentioned during review: we are not checking the return value of fsck_finish(). That is a bit of a weird function. We must call it because it cleans up any resources allocated during the fsck_buffer() call. But it also is the last chance to fsck any special blobs (like those that are found as .gitmodules, etc). We only find out the filenames while looking at the enclosing trees, so we queue them and then check the blobs later. So if we are hashing a blob, that is mostly fine. We will not have the blob's name queued as anything special, and so the fsck is a noop. But if we fsck a tree, and it has a .gitmodules entry pointing to blob X, then we would also pull X from the odb and fsck it during this "finish" phase. Which leads me to two diverging lines of thought: 1. One of my goals with this series is that one could add objects to the repository via "git hash-object -w" and feel confident that no fsck rules were violated, because fsck implements some security checks. In the past when GitHub rolled out security checks this was a major pain, because objects enter repositories not just from pushes, but also from web-client activity (e.g., editing a blob on the website). And since Git had no way to say "fsck just this object", we ended up implementing the fsck checks multiple times, in libgit2 and in some of its calling code. So I was hoping that just passing the objects to "hash-object" would be a viable solution. I'm not sure if it is or not. If you just hash a blob, then we'll have no clue it's a .gitmodules file. OTOH, you have to get the matching tree which binds the blob to the .gitmodules path somehow. So if that tree is fsck'd, and then checks the blob during fsck_finish(), that should be enough. Assuming that fsck complains when the pointed-to blob cannot be accessed, which I think it should (because really, incremental pushes face the same problem). In which case we really ought to be checking the result of fsck_finish() here and complaining. 2. We're not checking fsck connectivity here, and that's intentional. So you can "hash-object" a tree that points to blobs that we don't actually have. But if you hash a tree that points a .gitmodules entry at a blob that doesn't exist, then that will fail the fsck (during the finish step). And respecting the fsck_finish() exit code would break that. As an addendum, in a regular fsck, many trees might mention the same blob as .gitmodules, and we'll queue that blob to be checked once. But here, we are potentially running a bunch of individual fscks, one per object we hash. So if you had, say, 1000 trees that all mentioned the same blob (because other entries were changing), and you tried to hash them all with "hash-object --stdin-paths" or similar, then we'd fsck that blob 1000 times. Which isn't wrong, per se, but seems inefficient. Solving it would require keeping track of what has been checked between calls to index_mem(). Which is kind of awkward, seeing as how low-level it is. It would be a lot more natural if all this checking happened in hash-object itself. So I dunno. The code above is doing (2), albeit with the inefficiency of checking blobs that we might not care about. I kind of think (1) is the right thing, though, and anybody who really wants to make trees that point to bogus .gitmodules can use --literally. -Peff ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 6/6] hash-object: use fsck for object checks 2023-02-01 12:50 ` Jeff King @ 2023-02-01 13:08 ` Ævar Arnfjörð Bjarmason 2023-02-01 20:41 ` Junio C Hamano 1 sibling, 0 replies; 28+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2023-02-01 13:08 UTC (permalink / raw) To: Jeff King; +Cc: git, Taylor Blau, René Scharfe On Wed, Feb 01 2023, Jeff King wrote: > On Wed, Jan 18, 2023 at 03:44:12PM -0500, Jeff King wrote: > >> @@ -2350,12 +2340,13 @@ static int index_mem(struct index_state *istate, >> } >> } >> if (flags & HASH_FORMAT_CHECK) { >> - if (type == OBJ_TREE) >> - check_tree(buf, size); >> - if (type == OBJ_COMMIT) >> - check_commit(buf, size); >> - if (type == OBJ_TAG) >> - check_tag(buf, size); >> + struct fsck_options opts = FSCK_OPTIONS_DEFAULT; >> + >> + opts.strict = 1; >> + opts.error_func = hash_format_check_report; >> + if (fsck_buffer(null_oid(), type, buf, size, &opts)) >> + die(_("refusing to create malformed object")); >> + fsck_finish(&opts); >> } > > By the way, I wanted to call out one thing here that nobody mentioned > during review: we are not checking the return value of fsck_finish(). > > That is a bit of a weird function. We must call it because it cleans up > any resources allocated during the fsck_buffer() call. But it also is > the last chance to fsck any special blobs (like those that are found as > .gitmodules, etc). We only find out the filenames while looking at the > enclosing trees, so we queue them and then check the blobs later. > > So if we are hashing a blob, that is mostly fine. We will not have the > blob's name queued as anything special, and so the fsck is a noop. > > But if we fsck a tree, and it has a .gitmodules entry pointing to blob > X, then we would also pull X from the odb and fsck it during this > "finish" phase. > > Which leads me to two diverging lines of thought: > > 1. One of my goals with this series is that one could add objects to > the repository via "git hash-object -w" and feel confident that no > fsck rules were violated, because fsck implements some security > checks. In the past when GitHub rolled out security checks this was > a major pain, because objects enter repositories not just from > pushes, but also from web-client activity (e.g., editing a blob on > the website). And since Git had no way to say "fsck just this > object", we ended up implementing the fsck checks multiple times, > in libgit2 and in some of its calling code. > > So I was hoping that just passing the objects to "hash-object" > would be a viable solution. I'm not sure if it is or not. If you > just hash a blob, then we'll have no clue it's a .gitmodules file. > OTOH, you have to get the matching tree which binds the blob to the > .gitmodules path somehow. So if that tree is fsck'd, and then > checks the blob during fsck_finish(), that should be enough. > Assuming that fsck complains when the pointed-to blob cannot be > accessed, which I think it should (because really, incremental > pushes face the same problem). > > In which case we really ought to be checking the result of > fsck_finish() here and complaining. > > 2. We're not checking fsck connectivity here, and that's intentional. > So you can "hash-object" a tree that points to blobs that we don't > actually have. But if you hash a tree that points a .gitmodules > entry at a blob that doesn't exist, then that will fail the fsck > (during the finish step). And respecting the fsck_finish() exit > code would break that. > > As an addendum, in a regular fsck, many trees might mention the > same blob as .gitmodules, and we'll queue that blob to be checked > once. But here, we are potentially running a bunch of individual > fscks, one per object we hash. So if you had, say, 1000 trees that > all mentioned the same blob (because other entries were changing), > and you tried to hash them all with "hash-object --stdin-paths" or > similar, then we'd fsck that blob 1000 times. > > Which isn't wrong, per se, but seems inefficient. Solving it would > require keeping track of what has been checked between calls to > index_mem(). Which is kind of awkward, seeing as how low-level it > is. It would be a lot more natural if all this checking happened in > hash-object itself. > > So I dunno. The code above is doing (2), albeit with the inefficiency of > checking blobs that we might not care about. I kind of think (1) is the > right thing, though, and anybody who really wants to make trees that > point to bogus .gitmodules can use --literally. Aside from the other things you bring up here, it seems wrong to me to conflate --literally with some sort of "no fsck" or "don't fsck this collection yet" mode. Can't we have a "--no-fsck" or similar, which won't do any sort of full fsck, but also won't accept bogus object types & the like? Currently I believe (and I haven't had time to carefully review what you have here) we only need --literally to produce objects that are truly corrupt when viewed in isolation. E.g. a tag that refers to a bogus object type etc. But we have long supported a narrow view of what the fsck checks mean in that context. E.g. now with "mktag" we'll use the fsck machinery, but only skin-deep, so you can be referring to a tree which would in turn fail our checks. I tend to think that we should be keeping it like that, but documenting that if you're creating such objects you either need to do it really carefully, or follow it up with an operation that's guaranteed to fsck the sum of the objects you've added recursively. So, rather than teach e.g. "hash-object" to be smart about that we should e.g. encourage a manual creation of trees/blobs/commits to be followed-up with a "git push" to a new ref that refers to them, even if that "git push" is to the repository located in the $PWD. By doing that we offload the "what's new?" question to the pack-over-the-wire machinery, which is well tested. Anything else seems ultimately to be madness, after all if I feed a newly crafted commit to "hash-object" how do we know where to stop, other than essentially faking up a push negotiation with ourselves? It's also worth noting that much of the complexity around .gitmodules in particular is to support packfile-uri's odd notion of applying the "last" part of the PACK before the "first" part, which nothing else does. Which, if we just blindly applied both, and then fsck'd the resulting combination we'd get rid of that tricky special-case. But I haven't benchmarked that. It should be a bit slower, particularly on a large repository that won't fit in memory. But my hunch is that it won't be too bad, and the resulting simplification may be worth it (particularly now that we have bundle-uri, which doesn't share that edge-case). ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 6/6] hash-object: use fsck for object checks 2023-02-01 12:50 ` Jeff King 2023-02-01 13:08 ` Ævar Arnfjörð Bjarmason @ 2023-02-01 20:41 ` Junio C Hamano 1 sibling, 0 replies; 28+ messages in thread From: Junio C Hamano @ 2023-02-01 20:41 UTC (permalink / raw) To: Jeff King Cc: git, Taylor Blau, René Scharfe, Ævar Arnfjörð Bjarmason Jeff King <peff@peff.net> writes: > ... So if that tree is fsck'd, and then > checks the blob during fsck_finish(), that should be enough. > Assuming that fsck complains when the pointed-to blob cannot be > accessed, which I think it should (because really, incremental > pushes face the same problem). Yes. > 2. We're not checking fsck connectivity here, and that's intentional. > So you can "hash-object" a tree that points to blobs that we don't > actually have. But if you hash a tree that points a .gitmodules > entry at a blob that doesn't exist, then that will fail the fsck > (during the finish step). And respecting the fsck_finish() exit > code would break that. That's tricky. An attack vector to sneak a bad .gitmodules file into history then becomes (1) hash a tree with a .gitmodules entry that points at a missing blob and then (2) after that fact is forgotten, hash a bad blob pointed to by the tree? We cannot afford to remember all trees with .gitmodules we didn't find the blob for forever, so one approach to solve it is to reject trees with missing blobs. Legitimate use cases should be able to build up trees bottle up to hash blobs before their containing trees. If you hash a commit object, we would want to fsck its tree? Do we want to fsck its parent commit and its tree? Ideally we can stop when our "traversal" reaches objects that are known to be good, but how do we decide which objects are "known to be good"? Being reachable from our refs, as usual? > So I dunno. The code above is doing (2), albeit with the inefficiency of > checking blobs that we might not care about. I kind of think (1) is the > right thing, though, and anybody who really wants to make trees that > point to bogus .gitmodules can use --literally. True. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC/PATCH 0/6] hash-object: use fsck to check objects 2023-01-18 20:35 [RFC/PATCH 0/6] hash-object: use fsck to check objects Jeff King ` (5 preceding siblings ...) 2023-01-18 20:44 ` [PATCH 6/6] hash-object: use fsck for object checks Jeff King @ 2023-01-18 20:46 ` Jeff King 2023-01-18 20:59 ` Junio C Hamano 2023-01-19 1:39 ` Jeff King 8 siblings, 0 replies; 28+ messages in thread From: Jeff King @ 2023-01-18 20:46 UTC (permalink / raw) To: git; +Cc: René Scharfe, Ævar Arnfjörð Bjarmason On Wed, Jan 18, 2023 at 03:35:06PM -0500, Jeff King wrote: > The other option is having the fsck code avoid looking past the size it > was given. I think the intent is that this should work, from commits > like 4d0d89755e (Make sure fsck_commit_buffer() does not run out of the > buffer, 2014-09-11). We do use skip_prefix() and parse_oid_hex(), which > won't respect the size, but I think[1] that's OK because we'll have > parsed up to the end-of-header beforehand (and those functions would > never match past there). > > Which would mean that 9a1a3a4d4c (mktag: allow omitting the header/body > \n separator, 2021-01-05) and acf9de4c94 (mktag: use fsck instead of > custom verify_tag(), 2021-01-05) were buggy, and we can just fix them. That would look something like this: diff --git a/fsck.c b/fsck.c index c2c8facd2d..d220276bcb 100644 --- a/fsck.c +++ b/fsck.c @@ -898,6 +898,7 @@ int fsck_tag_standalone(const struct object_id *oid, const char *buffer, { int ret = 0; char *eol; + const char *eob = buffer + size; struct strbuf sb = STRBUF_INIT; const char *p; @@ -960,10 +961,8 @@ int fsck_tag_standalone(const struct object_id *oid, const char *buffer, } else ret = fsck_ident(&buffer, oid, OBJ_TAG, options); - if (!*buffer) - goto done; - if (!starts_with(buffer, "\n")) { + if (buffer != eob && *buffer != '\n') { /* * The verify_headers() check will allow * e.g. "[...]tagger <tagger>\nsome Changing the starts_with() is not strictly necessary, but I think it makes it more clear that we are only going to look at the one character we confirmed is still valid inside the buffer. This is enough to have the whole test suite pass with ASan/UBSan after my series. But as I said earlier, I'd want to look carefully at the rest of the fsck code to make sure there aren't any other possible inputs that could look past the end of the buffer. -Peff ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [RFC/PATCH 0/6] hash-object: use fsck to check objects 2023-01-18 20:35 [RFC/PATCH 0/6] hash-object: use fsck to check objects Jeff King ` (6 preceding siblings ...) 2023-01-18 20:46 ` [RFC/PATCH 0/6] hash-object: use fsck to check objects Jeff King @ 2023-01-18 20:59 ` Junio C Hamano 2023-01-18 21:38 ` Taylor Blau 2023-01-19 1:39 ` Jeff King 8 siblings, 1 reply; 28+ messages in thread From: Junio C Hamano @ 2023-01-18 20:59 UTC (permalink / raw) To: Jeff King; +Cc: git, René Scharfe, Ævar Arnfjörð Bjarmason Jeff King <peff@peff.net> writes: > [1/6]: t1007: modernize malformed object tests Obviously good. > [2/6]: t1006: stop using 0-padded timestamps > [3/6]: t7030: stop using invalid tag name These two are pleasant to see and revealed what are "accepted" by mistake, quite surprisingly. > [4/6]: t: use hash-object --literally when created malformed objects The --literally option was invented initially primarily to allow a bogus type of object (e.g. "hash-object -t xyzzy --literally") but I am happy to see that we are finding different uses. I wonder if these objects of known types but with syntactically bad contents can be "repack"ed from loose into packed? > [5/6]: fsck: provide a function to fsck buffer without object struct Obvious, clean and very nice. > [6/6]: hash-object: use fsck for object checks ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC/PATCH 0/6] hash-object: use fsck to check objects 2023-01-18 20:59 ` Junio C Hamano @ 2023-01-18 21:38 ` Taylor Blau 2023-01-19 2:03 ` Jeff King 0 siblings, 1 reply; 28+ messages in thread From: Taylor Blau @ 2023-01-18 21:38 UTC (permalink / raw) To: Junio C Hamano Cc: Jeff King, git, René Scharfe, Ævar Arnfjörð Bjarmason On Wed, Jan 18, 2023 at 12:59:24PM -0800, Junio C Hamano wrote: > The --literally option was invented initially primarily to allow a > bogus type of object (e.g. "hash-object -t xyzzy --literally") but I > am happy to see that we are finding different uses. I wonder if > these objects of known types but with syntactically bad contents can > be "repack"ed from loose into packed? > > > [5/6]: fsck: provide a function to fsck buffer without object struct It is indeed possible: --- >8 --- Initialized empty Git repository in /home/ttaylorr/src/git/t/trash directory.t9999-test/.git/ expecting success of 9999.1 'repacking corrupt loose object into packed': name=$(echo $ZERO_OID | sed -e "s/00/Q/g") && printf "100644 fooQ$name" | q_to_nul | git hash-object -w --stdin -t tree >in && git pack-objects .git/objects/pack/pack <in Enumerating objects: 1, done. Counting objects: 100% (1/1), done. 06146c77fd19c096858d6459d602be0fdf10891b Writing objects: 100% (1/1), done. Total 1 (delta 0), reused 0 (delta 0), pack-reused 0 ok 1 - repacking corrupt loose object into packed --- 8< --- Thanks, Taylor ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC/PATCH 0/6] hash-object: use fsck to check objects 2023-01-18 21:38 ` Taylor Blau @ 2023-01-19 2:03 ` Jeff King 0 siblings, 0 replies; 28+ messages in thread From: Jeff King @ 2023-01-19 2:03 UTC (permalink / raw) To: Taylor Blau Cc: Junio C Hamano, git, René Scharfe, Ævar Arnfjörð Bjarmason On Wed, Jan 18, 2023 at 04:38:40PM -0500, Taylor Blau wrote: > On Wed, Jan 18, 2023 at 12:59:24PM -0800, Junio C Hamano wrote: > > The --literally option was invented initially primarily to allow a > > bogus type of object (e.g. "hash-object -t xyzzy --literally") but I > > am happy to see that we are finding different uses. I wonder if > > these objects of known types but with syntactically bad contents can > > be "repack"ed from loose into packed? > > > > > [5/6]: fsck: provide a function to fsck buffer without object struct > > It is indeed possible: > > --- >8 --- > Initialized empty Git repository in /home/ttaylorr/src/git/t/trash directory.t9999-test/.git/ > expecting success of 9999.1 'repacking corrupt loose object into packed': > name=$(echo $ZERO_OID | sed -e "s/00/Q/g") && > printf "100644 fooQ$name" | q_to_nul | > git hash-object -w --stdin -t tree >in && > > git pack-objects .git/objects/pack/pack <in > > Enumerating objects: 1, done. > Counting objects: 100% (1/1), done. > 06146c77fd19c096858d6459d602be0fdf10891b > Writing objects: 100% (1/1), done. > Total 1 (delta 0), reused 0 (delta 0), pack-reused 0 > ok 1 - repacking corrupt loose object into packed > --- 8< --- Right, we don't do any fsck-ing when packing objects. Nor should we, I think. We should be checking objects when they come into the repository (via index-pack/unpack-objects) or when they're created (hash-object), but there's little need to do so when they migrate between storage formats. The fact that "--literally" manually writes a loose object is mostly an implementation detail. I think if we are not writing an object with an esoteric type, that it could even just hit the regular index_fd() code path (and drop the HASH_FORMAT_CHECK flag). If you do write one with "-t xyzzy", I think pack-objects would barf, but not because of fsck checks. It just couldn't represent that type (which really makes such objects pretty pointless; you cannot ever fetch or push them!). -Peff ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC/PATCH 0/6] hash-object: use fsck to check objects 2023-01-18 20:35 [RFC/PATCH 0/6] hash-object: use fsck to check objects Jeff King ` (7 preceding siblings ...) 2023-01-18 20:59 ` Junio C Hamano @ 2023-01-19 1:39 ` Jeff King 2023-01-19 23:13 ` [PATCH 7/6] fsck: do not assume NUL-termination of buffers Jeff King 2023-01-21 9:36 ` [RFC/PATCH 0/6] hash-object: use fsck to check objects René Scharfe 8 siblings, 2 replies; 28+ messages in thread From: Jeff King @ 2023-01-19 1:39 UTC (permalink / raw) To: git; +Cc: René Scharfe, Ævar Arnfjörð Bjarmason On Wed, Jan 18, 2023 at 03:35:06PM -0500, Jeff King wrote: > The other option is having the fsck code avoid looking past the size it > was given. I think the intent is that this should work, from commits > like 4d0d89755e (Make sure fsck_commit_buffer() does not run out of the > buffer, 2014-09-11). We do use skip_prefix() and parse_oid_hex(), which > won't respect the size, but I think[1] that's OK because we'll have > parsed up to the end-of-header beforehand (and those functions would > never match past there). > > Which would mean that 9a1a3a4d4c (mktag: allow omitting the header/body > \n separator, 2021-01-05) and acf9de4c94 (mktag: use fsck instead of > custom verify_tag(), 2021-01-05) were buggy, and we can just fix them. > > [1] But I said "I think" above because it can get pretty subtle. There's > some more discussion in this thread: > > https://lore.kernel.org/git/20150625155128.C3E9738005C@gemini.denx.de/ > > but I haven't yet convinced myself it's safe. This is exactly the > kind of analysis I wish I had the power to nerd-snipe René into. I poked at this a bit more, and it definitely isn't safe. I think the use of skip_prefix(), etc, is OK, because they'd always stop at an unexpected newline. But verify_headers() is only confirming that we have a series of complete lines, and we might end with no "\n\n" (and hence no commit/tag message). And so the obvious case that fools us is one where the data simply ends at a newline, but we are missing one or more headers. So a truncated commit like: tree 1234567890123456789012345678901234567890 (with the newline at the end of the "tree" line, but nothing else) will cause fsck_commit() to look past the "size" we pass it. With all of the current callers, that means it sees a NUL and bails. So it's not currently a bug, but it becomes one if we can feed it arbitrary buffers. Fixing it isn't _too_ bad, and could look something like this: diff --git a/fsck.c b/fsck.c index c2c8facd2d..3f0bb3e350 100644 --- a/fsck.c +++ b/fsck.c @@ -834,6 +834,7 @@ static int fsck_commit(const struct object_id *oid, unsigned author_count; int err; const char *buffer_begin = buffer; + const char *buffer_end = buffer + size; const char *p; if (verify_headers(buffer, size, oid, OBJ_COMMIT, options)) @@ -847,7 +848,7 @@ static int fsck_commit(const struct object_id *oid, return err; } buffer = p + 1; - while (skip_prefix(buffer, "parent ", &buffer)) { + while (buffer < buffer_end && skip_prefix(buffer, "parent ", &buffer)) { if (parse_oid_hex(buffer, &parent_oid, &p) || *p != '\n') { err = report(options, oid, OBJ_COMMIT, FSCK_MSG_BAD_PARENT_SHA1, "invalid 'parent' line format - bad sha1"); if (err) @@ -856,7 +857,7 @@ static int fsck_commit(const struct object_id *oid, buffer = p + 1; } author_count = 0; - while (skip_prefix(buffer, "author ", &buffer)) { + while (buffer < buffer_end && skip_prefix(buffer, "author ", &buffer)) { author_count++; err = fsck_ident(&buffer, oid, OBJ_COMMIT, options); if (err) @@ -868,7 +869,7 @@ static int fsck_commit(const struct object_id *oid, err = report(options, oid, OBJ_COMMIT, FSCK_MSG_MULTIPLE_AUTHORS, "invalid format - multiple 'author' lines"); if (err) return err; - if (!skip_prefix(buffer, "committer ", &buffer)) + if (buffer >= buffer_end || !skip_prefix(buffer, "committer ", &buffer)) return report(options, oid, OBJ_COMMIT, FSCK_MSG_MISSING_COMMITTER, "invalid format - expected 'committer' line"); err = fsck_ident(&buffer, oid, OBJ_COMMIT, options); if (err) And then the tag side would need something similar. I'd probably also sprinkle some comments in verify_headers() and its callers documenting our assumptions and what's OK to do (string-like parsing functions work as long as they stop when they hit a newline). That, plus a few tests covering the problematic cases to avoid regressions, would probably be OK. I think fsck_tree() is mostly fine, as the tree-iterating code detects truncation. Though I do find the use of strlen() in decode_tree_entry() a little suspicious (and that would be true of the current code, as well, since it powers hash-object's existing parsing check). -Peff ^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH 7/6] fsck: do not assume NUL-termination of buffers 2023-01-19 1:39 ` Jeff King @ 2023-01-19 23:13 ` Jeff King 2023-01-19 23:58 ` Junio C Hamano 2023-01-21 9:36 ` [RFC/PATCH 0/6] hash-object: use fsck to check objects René Scharfe 1 sibling, 1 reply; 28+ messages in thread From: Jeff King @ 2023-01-19 23:13 UTC (permalink / raw) To: git Cc: Taylor Blau, Junio C Hamano, René Scharfe, Ævar Arnfjörð Bjarmason On Wed, Jan 18, 2023 at 08:39:55PM -0500, Jeff King wrote: > On Wed, Jan 18, 2023 at 03:35:06PM -0500, Jeff King wrote: > > > The other option is having the fsck code avoid looking past the size it > > was given. I think the intent is that this should work, from commits > > like 4d0d89755e (Make sure fsck_commit_buffer() does not run out of the > > buffer, 2014-09-11). We do use skip_prefix() and parse_oid_hex(), which > > won't respect the size, but I think[1] that's OK because we'll have > > parsed up to the end-of-header beforehand (and those functions would > > never match past there). > > > > Which would mean that 9a1a3a4d4c (mktag: allow omitting the header/body > > \n separator, 2021-01-05) and acf9de4c94 (mktag: use fsck instead of > > custom verify_tag(), 2021-01-05) were buggy, and we can just fix them. > > > > [1] But I said "I think" above because it can get pretty subtle. There's > > some more discussion in this thread: > > > > https://lore.kernel.org/git/20150625155128.C3E9738005C@gemini.denx.de/ > > > > but I haven't yet convinced myself it's safe. This is exactly the > > kind of analysis I wish I had the power to nerd-snipe René into. > > I poked at this a bit more, and it definitely isn't safe. So here's the result of my digging on this. The good news is that this one commit on top of the rest of the series should make everything safe. I'm sorry the explanation is a bit long, but I wanted to capture a bit of the history, the subtle assumptions, and how I approached analyzing and fixing it. There are a few paths forward here: - apply this on top of the earlier 6 patches. This is the simplest thing, and my preference. It does mean that t3800 temporarily has a read-one-char-past-buffer bug that is detected by ASan after patch 6 but before this patch is applied. - put this fix first. Unfortunately the tests rely on having patch 6 in order to be able to feed a non-NUL-terminated buffer to fsck. Options there are: - split this patch into two: code fix goes at the beginning of the series, and then the tests come at the end. The downside here is that it's very hard to run the tests on the pre-fixed code to verify that they are finding problems (you'd have to revert the fix, or re-order patches to get the broken state) - introduce a test-helper that lets you feed a buffer to fsck_buffer(). That can demonstrate the problem and fix independently of any hash-object changes. But it ends up being a fair bit of boilerplate, and ultimately we want to test hash-object anyway. - decide the whole "make fsck work with arbitrary buffers" thing is too subtle and error-prone. I don't think this, or else I wouldn't have made this patch. But I think it's an argument that can be made (and is roughly the approach we decided to take way back in the 2015 thread linked above). The solution there is to make sure we NUL-terminate everything. As I said before, this is tricky because of mmap. But we could probably just skip using mmap in index_core() for non-blobs (which don't tend to be very big), and then assume fsck on individual blobs is safe (it is, because they won't have been marked as gitmodules, etc for more detailed scanning). I think it could work. I kind of prefer just making the fsck functions safe. Even though the way they do left-to-right scanning is error-prone, at least the ugliness is contained inside them, rather than this "sure, I take a ptr/len combo, but make sure you allocate an extra NUL byte!" assumption that currently exists. Anyway, here's the patch. I'm happy to repost the whole 7-patch series, too, but since the earlier ones didn't change in my preferred path forward, this seemed easier for now. ;) -- >8 -- Subject: [PATCH] fsck: do not assume NUL-termination of buffers The fsck code operates on an object buffer represented as a pointer/len combination. However, the parsing of commits and tags is a little bit loose; we mostly scan left-to-right through the buffer, without checking whether we've gone past the length we were given. This has traditionally been OK because the buffers we feed to fsck always have an extra NUL after the end of the object content, which ends any left-to-right scan. That has always been true for objects we read from the odb, and we made it true for incoming index-pack/unpack-objects checks in a1e920a0a7 (index-pack: terminate object buffers with NUL, 2014-12-08). However, we recently added an exception: hash-object asks index_fd() to do fsck checks. That _may_ have an extra NUL (if we read from a pipe into a strbuf), but it might not (if we read the contents from the file). Nor can we just teach it to always add a NUL. We may mmap the on-disk file, which will not have any extra bytes (if it's a multiple of the page size). Not to mention that this is a rather subtle assumption for the fsck code to make. Instead, let's make sure that the fsck parsers don't ever look past the size of the buffer they've been given. This _almost_ works already, thanks to earlier work in 4d0d89755e (Make sure fsck_commit_buffer() does not run out of the buffer, 2014-09-11). The theory there is that we check up front whether we have the end of header double-newline separator. And then any left-to-right scanning we do is OK as long as it stops when it hits that boundary. However, we later softened that in 84d18c0bcf (fsck: it is OK for a tag and a commit to lack the body, 2015-06-28), which allows the double-newline header to be missing, but does require that the header ends in a newline. That was OK back then, because of the NUL-termination guarantees (including the one from a1e920a0a7 mentioned above). Because 84d18c0bcf guarantees that any header line does end in a newline, we are still OK with most of the left-to-right scanning. We only need to take care after completing a line, to check that there is another line (and we didn't run out of buffer). Most of these checks are just need to check "buffer < buffer_end" (where buffer is advanced as we parse) before scanning for the next header line. But here are a few notes: - we don't technically need to check for remaining buffer before parsing the very first line ("tree" for a commit, or "object" for a tag), because verify_headers() rejects a totally empty buffer. But we'll do so in the name of consistency and defensiveness. - there are some calls to strchr('\n'). These are actually OK by the "the final header line must end in a newline" guarantee from verify_headers(). They will always find that rather than run off the end of the buffer. Curiously, they do check for a NULL return and complain, but I believe that condition can never be reached. However, I converted them to use memchr() with a proper size and retained the NULL checks. Using memchr() is not much longer and makes it more obvious what is going on. Likewise, retaining the NULL checks serves as a defensive measure in case my analysis is wrong. - commit 9a1a3a4d4c (mktag: allow omitting the header/body \n separator, 2021-01-05), does check for the end-of-buffer condition, but does so with "!*buffer", relying explicitly on the NUL termination. We can accomplish the same thing with a pointer comparison. I also folded it into the follow-on conditional that checks the contents of the buffer, for consistency with the other checks. - fsck_ident() uses parse_timestamp(), which is based on strtoumax(). That function will happily skip past leading whitespace, including newlines, which makes it a risk. We can fix this by scanning to the first digit ourselves, and then using parse_timestamp() to do the actual numeric conversion. Note that as a side effect this fixes the fact that we missed zero-padded timestamps like "<email> 0123" (whereas we would complain about "<email> 0123"). I doubt anybody cares, but I mention it here for completeness. - fsck_tree() does not need any modifications. It relies on decode_tree_entry() to do the actual parsing, and that function checks both that there are enough bytes in the buffer to represent an entry, and that there is a NUL at the appropriate spot (one hash-length from the end; this may not be the NUL for the entry we are parsing, but we know that in the worst case, everything from our current position to that NUL is a filename, so we won't run out of bytes). In addition to fixing the code itself, we'd like to make sure our rather subtle assumptions are not violated in the future. So this patch does two more things: - add comments around verify_headers() documenting the link between what it checks and the memory safety of the callers. I don't expect this code to be modified frequently, but this may help somebody from accidentally breaking things. - add a thorough set of tests covering truncations at various key spots (e.g., for a "tree $oid" line, in the middle of the word "tree", right after it, after the space, in the middle of the $oid, and right at the end of the line. Most of these are fine already (it is only truncating right at the end of the line that is currently broken). And some of them are not even possible with the current code (we parse "tree " as a unit, so truncating before the space is equivalent). But I aimed here to consider the code a black box and look for any truncations that would be a problem for a left-to-right parser. Signed-off-by: Jeff King <peff@peff.net> --- fsck.c | 67 ++++++++++++++++---- t/t1451-fsck-buffer.sh | 140 +++++++++++++++++++++++++++++++++++++++++ 2 files changed, 194 insertions(+), 13 deletions(-) create mode 100755 t/t1451-fsck-buffer.sh diff --git a/fsck.c b/fsck.c index c2c8facd2d..2b18717ee8 100644 --- a/fsck.c +++ b/fsck.c @@ -748,6 +748,23 @@ static int fsck_tree(const struct object_id *tree_oid, return retval; } +/* + * Confirm that the headers of a commit or tag object end in a reasonable way, + * either with the usual "\n\n" separator, or at least with a trailing newline + * on the final header line. + * + * This property is important for the memory safety of our callers. It allows + * them to scan the buffer linewise without constantly checking the remaining + * size as long as: + * + * - they check that there are bytes left in the buffer at the start of any + * line (i.e., that the last newline they saw was not the final one we + * found here) + * + * - any intra-line scanning they do will stop at a newline, which will worst + * case hit the newline we found here as the end-of-header. This makes it + * OK for them to use helpers like parse_oid_hex(), or even skip_prefix(). + */ static int verify_headers(const void *data, unsigned long size, const struct object_id *oid, enum object_type type, struct fsck_options *options) @@ -808,6 +825,20 @@ static int fsck_ident(const char **ident, if (*p != ' ') return report(options, oid, type, FSCK_MSG_MISSING_SPACE_BEFORE_DATE, "invalid author/committer line - missing space before date"); p++; + /* + * Our timestamp parser is based on the C strto*() functions, which + * will happily eat whitespace, including the newline that is supposed + * to prevent us walking past the end of the buffer. So do our own + * scan, skipping linear whitespace but not newlines, and then + * confirming we found a digit. We _could_ be even more strict here, + * as we really expect only a single space, but since we have + * traditionally allowed extra whitespace, we'll continue to do so. + */ + while (*p == ' ' || *p == '\t') + p++; + if (!isdigit(*p)) + return report(options, oid, type, FSCK_MSG_BAD_DATE, + "invalid author/committer line - bad date"); if (*p == '0' && p[1] != ' ') return report(options, oid, type, FSCK_MSG_ZERO_PADDED_DATE, "invalid author/committer line - zero-padded date"); if (date_overflows(parse_timestamp(p, &end, 10))) @@ -834,20 +865,26 @@ static int fsck_commit(const struct object_id *oid, unsigned author_count; int err; const char *buffer_begin = buffer; + const char *buffer_end = buffer + size; const char *p; + /* + * We _must_ stop parsing immediately if this reports failure, as the + * memory safety of the rest of the function depends on it. See the + * comment above the definition of verify_headers() for more details. + */ if (verify_headers(buffer, size, oid, OBJ_COMMIT, options)) return -1; - if (!skip_prefix(buffer, "tree ", &buffer)) + if (buffer >= buffer_end || !skip_prefix(buffer, "tree ", &buffer)) return report(options, oid, OBJ_COMMIT, FSCK_MSG_MISSING_TREE, "invalid format - expected 'tree' line"); if (parse_oid_hex(buffer, &tree_oid, &p) || *p != '\n') { err = report(options, oid, OBJ_COMMIT, FSCK_MSG_BAD_TREE_SHA1, "invalid 'tree' line format - bad sha1"); if (err) return err; } buffer = p + 1; - while (skip_prefix(buffer, "parent ", &buffer)) { + while (buffer < buffer_end && skip_prefix(buffer, "parent ", &buffer)) { if (parse_oid_hex(buffer, &parent_oid, &p) || *p != '\n') { err = report(options, oid, OBJ_COMMIT, FSCK_MSG_BAD_PARENT_SHA1, "invalid 'parent' line format - bad sha1"); if (err) @@ -856,7 +893,7 @@ static int fsck_commit(const struct object_id *oid, buffer = p + 1; } author_count = 0; - while (skip_prefix(buffer, "author ", &buffer)) { + while (buffer < buffer_end && skip_prefix(buffer, "author ", &buffer)) { author_count++; err = fsck_ident(&buffer, oid, OBJ_COMMIT, options); if (err) @@ -868,7 +905,7 @@ static int fsck_commit(const struct object_id *oid, err = report(options, oid, OBJ_COMMIT, FSCK_MSG_MULTIPLE_AUTHORS, "invalid format - multiple 'author' lines"); if (err) return err; - if (!skip_prefix(buffer, "committer ", &buffer)) + if (buffer >= buffer_end || !skip_prefix(buffer, "committer ", &buffer)) return report(options, oid, OBJ_COMMIT, FSCK_MSG_MISSING_COMMITTER, "invalid format - expected 'committer' line"); err = fsck_ident(&buffer, oid, OBJ_COMMIT, options); if (err) @@ -899,13 +936,19 @@ int fsck_tag_standalone(const struct object_id *oid, const char *buffer, int ret = 0; char *eol; struct strbuf sb = STRBUF_INIT; + const char *buffer_end = buffer + size; const char *p; + /* + * We _must_ stop parsing immediately if this reports failure, as the + * memory safety of the rest of the function depends on it. See the + * comment above the definition of verify_headers() for more details. + */ ret = verify_headers(buffer, size, oid, OBJ_TAG, options); if (ret) goto done; - if (!skip_prefix(buffer, "object ", &buffer)) { + if (buffer >= buffer_end || !skip_prefix(buffer, "object ", &buffer)) { ret = report(options, oid, OBJ_TAG, FSCK_MSG_MISSING_OBJECT, "invalid format - expected 'object' line"); goto done; } @@ -916,11 +959,11 @@ int fsck_tag_standalone(const struct object_id *oid, const char *buffer, } buffer = p + 1; - if (!skip_prefix(buffer, "type ", &buffer)) { + if (buffer >= buffer_end || !skip_prefix(buffer, "type ", &buffer)) { ret = report(options, oid, OBJ_TAG, FSCK_MSG_MISSING_TYPE_ENTRY, "invalid format - expected 'type' line"); goto done; } - eol = strchr(buffer, '\n'); + eol = memchr(buffer, '\n', buffer_end - buffer); if (!eol) { ret = report(options, oid, OBJ_TAG, FSCK_MSG_MISSING_TYPE, "invalid format - unexpected end after 'type' line"); goto done; @@ -932,11 +975,11 @@ int fsck_tag_standalone(const struct object_id *oid, const char *buffer, goto done; buffer = eol + 1; - if (!skip_prefix(buffer, "tag ", &buffer)) { + if (buffer >= buffer_end || !skip_prefix(buffer, "tag ", &buffer)) { ret = report(options, oid, OBJ_TAG, FSCK_MSG_MISSING_TAG_ENTRY, "invalid format - expected 'tag' line"); goto done; } - eol = strchr(buffer, '\n'); + eol = memchr(buffer, '\n', buffer_end - buffer); if (!eol) { ret = report(options, oid, OBJ_TAG, FSCK_MSG_MISSING_TAG, "invalid format - unexpected end after 'type' line"); goto done; @@ -952,18 +995,16 @@ int fsck_tag_standalone(const struct object_id *oid, const char *buffer, } buffer = eol + 1; - if (!skip_prefix(buffer, "tagger ", &buffer)) { + if (buffer >= buffer_end || !skip_prefix(buffer, "tagger ", &buffer)) { /* early tags do not contain 'tagger' lines; warn only */ ret = report(options, oid, OBJ_TAG, FSCK_MSG_MISSING_TAGGER_ENTRY, "invalid format - expected 'tagger' line"); if (ret) goto done; } else ret = fsck_ident(&buffer, oid, OBJ_TAG, options); - if (!*buffer) - goto done; - if (!starts_with(buffer, "\n")) { + if (buffer < buffer_end && !starts_with(buffer, "\n")) { /* * The verify_headers() check will allow * e.g. "[...]tagger <tagger>\nsome diff --git a/t/t1451-fsck-buffer.sh b/t/t1451-fsck-buffer.sh new file mode 100755 index 0000000000..9ac270abab --- /dev/null +++ b/t/t1451-fsck-buffer.sh @@ -0,0 +1,140 @@ +#!/bin/sh + +test_description='fsck on buffers without NUL termination + +The goal here is to make sure that the various fsck parsers never look +past the end of the buffer they are given, even when encountering broken +or truncated objects. + +We have to use "hash-object" for this because most code paths that read objects +append an extra NUL for safety after the buffer. But hash-object, since it is +reading straight from a file (and possibly even mmap-ing it) cannot always do +so. + +These tests _might_ catch such overruns in normal use, but should be run with +ASan or valgrind for more confidence. +' +. ./test-lib.sh + +# the general idea for tags and commits is to build up the "base" file +# progressively, and then test new truncations on top of it. +reset () { + test_expect_success 'reset input to empty' ' + >base + ' +} + +add () { + content="$1" + type=${content%% *} + test_expect_success "add $type line" ' + echo "$content" >>base + ' +} + +check () { + type=$1 + fsck=$2 + content=$3 + test_expect_success "truncated $type ($fsck, \"$content\")" ' + # do not pipe into hash-object here; we want to increase + # the chance that it uses a fixed-size buffer or mmap, + # and a pipe would be read into a strbuf. + { + cat base && + echo "$content" + } >input && + test_must_fail git hash-object -t "$type" input 2>err && + grep "$fsck" err + ' +} + +test_expect_success 'create valid objects' ' + git commit --allow-empty -m foo && + commit=$(git rev-parse --verify HEAD) && + tree=$(git rev-parse --verify HEAD^{tree}) +' + +reset +check commit missingTree "" +check commit missingTree "tr" +check commit missingTree "tree" +check commit badTreeSha1 "tree " +check commit badTreeSha1 "tree 1234" +add "tree $tree" + +# these expect missingAuthor because "parent" is optional +check commit missingAuthor "" +check commit missingAuthor "par" +check commit missingAuthor "parent" +check commit badParentSha1 "parent " +check commit badParentSha1 "parent 1234" +add "parent $commit" + +check commit missingAuthor "" +check commit missingAuthor "au" +check commit missingAuthor "author" +ident_checks () { + check $1 missingEmail "$2 " + check $1 missingEmail "$2 name" + check $1 badEmail "$2 name <" + check $1 badEmail "$2 name <email" + check $1 missingSpaceBeforeDate "$2 name <email>" + check $1 badDate "$2 name <email> " + check $1 badDate "$2 name <email> 1234" + check $1 badTimezone "$2 name <email> 1234 " + check $1 badTimezone "$2 name <email> 1234 +" +} +ident_checks commit author +add "author name <email> 1234 +0000" + +check commit missingCommitter "" +check commit missingCommitter "co" +check commit missingCommitter "committer" +ident_checks commit committer +add "committer name <email> 1234 +0000" + +reset +check tag missingObject "" +check tag missingObject "obj" +check tag missingObject "object" +check tag badObjectSha1 "object " +check tag badObjectSha1 "object 1234" +add "object $commit" + +check tag missingType "" +check tag missingType "ty" +check tag missingType "type" +check tag badType "type " +check tag badType "type com" +add "type commit" + +check tag missingTagEntry "" +check tag missingTagEntry "ta" +check tag missingTagEntry "tag" +check tag badTagName "tag " +add "tag foo" + +check tag missingTagger "" +check tag missingTagger "ta" +check tag missingTagger "tagger" +ident_checks tag tagger + +# trees are a binary format and can't use our earlier helpers +test_expect_success 'truncated tree (short hash)' ' + printf "100644 foo\0\1\1\1\1" >input && + test_must_fail git hash-object -t tree input 2>err && + grep badTree err +' + +test_expect_success 'truncated tree (missing nul)' ' + # these two things are indistinguishable to the parser. The important + # thing about this is example is that there are enough bytes to + # make up a hash, and that there is no NUL (and we confirm that the + # parser does not walk past the end of the buffer). + printf "100644 a long filename, or a hash with missing nul?" >input && + test_must_fail git hash-object -t tree input 2>err && + grep badTree err +' + +test_done -- 2.39.1.616.gd06fca9e99 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH 7/6] fsck: do not assume NUL-termination of buffers 2023-01-19 23:13 ` [PATCH 7/6] fsck: do not assume NUL-termination of buffers Jeff King @ 2023-01-19 23:58 ` Junio C Hamano 0 siblings, 0 replies; 28+ messages in thread From: Junio C Hamano @ 2023-01-19 23:58 UTC (permalink / raw) To: Jeff King Cc: git, Taylor Blau, René Scharfe, Ævar Arnfjörð Bjarmason Jeff King <peff@peff.net> writes: > So here's the result of my digging on this. The good news is that this > one commit on top of the rest of the series should make everything safe. > I'm sorry the explanation is a bit long, but I wanted to capture a bit > of the history, the subtle assumptions, and how I approached analyzing > and fixing it. > > There are a few paths forward here: > > - apply this on top of the earlier 6 patches. This is the simplest > thing, and my preference. It does mean that t3800 temporarily has a > read-one-char-past-buffer bug that is detected by ASan after patch 6 > but before this patch is applied. That sounds reasonable, even though purist among us may find it slightly disturbing that it breaks "bisectability". > Anyway, here's the patch. I'm happy to repost the whole 7-patch series, > too, but since the earlier ones didn't change in my preferred path > forward, this seemed easier for now. ;) Thanks. Will queue. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC/PATCH 0/6] hash-object: use fsck to check objects 2023-01-19 1:39 ` Jeff King 2023-01-19 23:13 ` [PATCH 7/6] fsck: do not assume NUL-termination of buffers Jeff King @ 2023-01-21 9:36 ` René Scharfe 2023-01-22 7:48 ` Jeff King 1 sibling, 1 reply; 28+ messages in thread From: René Scharfe @ 2023-01-21 9:36 UTC (permalink / raw) To: Jeff King, git; +Cc: Ævar Arnfjörð Bjarmason Am 19.01.23 um 02:39 schrieb Jeff King: > > Though I do find the use of strlen() in decode_tree_entry() > a little suspicious (and that would be true of the current code, as > well, since it powers hash-object's existing parsing check). strlen() won't overrun the buffer because the first check in decode_tree_entry() makes sure there is a NUL in the buffer ahead. If get_mode() crosses it then we exit early. Storing the result in an unsigned int can overflow on platforms where size_t is bigger. That would result in pathlen values being too short for really long paths, but no out-of-bounds access. They are then stored as signed int in struct name_entry and used as such in many places -- that seems like a bad idea, but I didn't actually check them thoroughly. get_mode() can overflow "mode" if there are too many octal digits. Do we need to accept more than two handfuls in the first place? I'll send a patch for at least rejecting overflow. Hmm, what would be the performance impact of trees with mode fields zero-padded to silly lengths? René ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC/PATCH 0/6] hash-object: use fsck to check objects 2023-01-21 9:36 ` [RFC/PATCH 0/6] hash-object: use fsck to check objects René Scharfe @ 2023-01-22 7:48 ` Jeff King 2023-01-22 11:39 ` René Scharfe 0 siblings, 1 reply; 28+ messages in thread From: Jeff King @ 2023-01-22 7:48 UTC (permalink / raw) To: René Scharfe; +Cc: git, Ævar Arnfjörð Bjarmason On Sat, Jan 21, 2023 at 10:36:08AM +0100, René Scharfe wrote: > Am 19.01.23 um 02:39 schrieb Jeff King: > > > > Though I do find the use of strlen() in decode_tree_entry() > > a little suspicious (and that would be true of the current code, as > > well, since it powers hash-object's existing parsing check). > > strlen() won't overrun the buffer because the first check in > decode_tree_entry() makes sure there is a NUL in the buffer ahead. > If get_mode() crosses it then we exit early. Yeah, that was what I found after digging deeper (see my patch 7). > Storing the result in an unsigned int can overflow on platforms where > size_t is bigger. That would result in pathlen values being too short > for really long paths, but no out-of-bounds access. They are then > stored as signed int in struct name_entry and used as such in many > places -- that seems like a bad idea, but I didn't actually check them > thoroughly. Yeah, I agree that the use of a signed int there looks questionable. I do think it's orthogonal to my series here, as that tree-decoding is used by the existing hash-object checks. But it probably bears further examination, especially because we use it for the fsck checks on incoming objects via receive-pack, etc, which are meant to be the first line of defense for hosters who might receive malicious garbage from users. We probably ought to reject trees with enormous names via fsck anyway. I actually have a patch to do that, but of course it depends on decode_tree_entry() to get the length, so there's a bit of chicken-and-egg. We probably also should outright reject gigantic trees, which closes out a whole class of integer truncation problems. I know GitHub has rejected trees over 100MB for years for this reason. > get_mode() can overflow "mode" if there are too many octal digits. Do > we need to accept more than two handfuls in the first place? I'll send > a patch for at least rejecting overflow. Seems reasonable. I doubt there's an interesting attack here, just because the mode isn't used to index anything. If you feed a garbage mode that happens to overflow to something useful, you could just as easily have sent the useful mode in the first place. > Hmm, what would be the performance impact of trees with mode fields > zero-padded to silly lengths? Certainly it would waste some time parsing the tree, but you could do that already with a long pathname. Or just having a lot of paths in a tree. Or a lot of trees. -Peff ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC/PATCH 0/6] hash-object: use fsck to check objects 2023-01-22 7:48 ` Jeff King @ 2023-01-22 11:39 ` René Scharfe 2023-02-01 14:06 ` Ævar Arnfjörð Bjarmason 0 siblings, 1 reply; 28+ messages in thread From: René Scharfe @ 2023-01-22 11:39 UTC (permalink / raw) To: Jeff King; +Cc: git, Ævar Arnfjörð Bjarmason Am 22.01.23 um 08:48 schrieb Jeff King: > On Sat, Jan 21, 2023 at 10:36:08AM +0100, René Scharfe wrote: > >> Am 19.01.23 um 02:39 schrieb Jeff King: >>> >>> Though I do find the use of strlen() in decode_tree_entry() >>> a little suspicious (and that would be true of the current code, as >>> well, since it powers hash-object's existing parsing check). >> >> strlen() won't overrun the buffer because the first check in >> decode_tree_entry() makes sure there is a NUL in the buffer ahead. >> If get_mode() crosses it then we exit early. > > Yeah, that was what I found after digging deeper (see my patch 7). > >> Storing the result in an unsigned int can overflow on platforms where >> size_t is bigger. That would result in pathlen values being too short >> for really long paths, but no out-of-bounds access. They are then >> stored as signed int in struct name_entry and used as such in many >> places -- that seems like a bad idea, but I didn't actually check them >> thoroughly. > > Yeah, I agree that the use of a signed int there looks questionable. I > do think it's orthogonal to my series here, as that tree-decoding is > used by the existing hash-object checks. Sure. > But it probably bears further examination, especially because we use it > for the fsck checks on incoming objects via receive-pack, etc, which are > meant to be the first line of defense for hosters who might receive > malicious garbage from users. > > We probably ought to reject trees with enormous names via fsck anyway. I > actually have a patch to do that, but of course it depends on > decode_tree_entry() to get the length, so there's a bit of > chicken-and-egg. Solvable by limiting the search for the end of the string in decode_tree_entry() by using strnlen(3) or memchr(3) instead of strlen(3). You just need to define some (configurable?) limit. > We probably also should outright reject gigantic trees, > which closes out a whole class of integer truncation problems. I know > GitHub has rejected trees over 100MB for years for this reason. Makes sense. >> get_mode() can overflow "mode" if there are too many octal digits. Do >> we need to accept more than two handfuls in the first place? I'll send >> a patch for at least rejecting overflow. > > Seems reasonable. I doubt there's an interesting attack here, just > because the mode isn't used to index anything. If you feed a garbage > mode that happens to overflow to something useful, you could just as > easily have sent the useful mode in the first place. > >> Hmm, what would be the performance impact of trees with mode fields >> zero-padded to silly lengths? > > Certainly it would waste some time parsing the tree, but you could do > that already with a long pathname. Or just having a lot of paths in a > tree. Or a lot of trees. That's a cup half full/empty thing, perhaps. What's the harm in leading zeros? vs. Why allow leading zeros? René ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC/PATCH 0/6] hash-object: use fsck to check objects 2023-01-22 11:39 ` René Scharfe @ 2023-02-01 14:06 ` Ævar Arnfjörð Bjarmason 0 siblings, 0 replies; 28+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2023-02-01 14:06 UTC (permalink / raw) To: René Scharfe; +Cc: Jeff King, git On Sun, Jan 22 2023, René Scharfe wrote: > Am 22.01.23 um 08:48 schrieb Jeff King: >> We probably also should outright reject gigantic trees, >> which closes out a whole class of integer truncation problems. I know >> GitHub has rejected trees over 100MB for years for this reason. > > Makes sense. I really don't think it does, let's not forever encode arbitrary limits in the formats because of transitory implementation details. Those sort of arbitrary limits are understandable for hosting providers, and a sensible trade-off on that front. But for git as a general tool? I'd like to be able to throw whatever garbage I've got locally at it, and not have it complain. We already have a deluge of int v.s. unsigned int v.s. size_t warnings that we could address, we're just choosing to suppress them currently. Instead we have hacks like cast_size_t_to_int(). Those sorts of hacks are understandable as band-aid fixes, but let's work on fixing the real causes. ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2023-02-01 20:42 UTC | newest] Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-01-18 20:35 [RFC/PATCH 0/6] hash-object: use fsck to check objects Jeff King 2023-01-18 20:35 ` [PATCH 1/6] t1007: modernize malformed object tests Jeff King 2023-01-18 21:13 ` Taylor Blau 2023-01-18 20:35 ` [PATCH 2/6] t1006: stop using 0-padded timestamps Jeff King 2023-01-18 20:36 ` [PATCH 3/6] t7030: stop using invalid tag name Jeff King 2023-01-18 20:41 ` [PATCH 4/6] t: use hash-object --literally when created malformed objects Jeff King 2023-01-18 21:19 ` Taylor Blau 2023-01-19 2:06 ` Jeff King 2023-01-18 20:43 ` [PATCH 5/6] fsck: provide a function to fsck buffer without object struct Jeff King 2023-01-18 21:24 ` Taylor Blau 2023-01-19 2:07 ` Jeff King 2023-01-18 20:44 ` [PATCH 6/6] hash-object: use fsck for object checks Jeff King 2023-01-18 21:34 ` Taylor Blau 2023-01-19 2:31 ` Jeff King 2023-02-01 12:50 ` Jeff King 2023-02-01 13:08 ` Ævar Arnfjörð Bjarmason 2023-02-01 20:41 ` Junio C Hamano 2023-01-18 20:46 ` [RFC/PATCH 0/6] hash-object: use fsck to check objects Jeff King 2023-01-18 20:59 ` Junio C Hamano 2023-01-18 21:38 ` Taylor Blau 2023-01-19 2:03 ` Jeff King 2023-01-19 1:39 ` Jeff King 2023-01-19 23:13 ` [PATCH 7/6] fsck: do not assume NUL-termination of buffers Jeff King 2023-01-19 23:58 ` Junio C Hamano 2023-01-21 9:36 ` [RFC/PATCH 0/6] hash-object: use fsck to check objects René Scharfe 2023-01-22 7:48 ` Jeff King 2023-01-22 11:39 ` René Scharfe 2023-02-01 14:06 ` Ævar Arnfjörð Bjarmason
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).