git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "René Scharfe" <l.s.r@web.de>
To: Junio C Hamano <gitster@pobox.com>
Cc: Brandon Williams <bwilliamseng@gmail.com>,
	git <git@vger.kernel.org>, Jeff King <peff@peff.net>
Subject: Re: invalid tree and commit object
Date: Sun, 10 May 2020 11:07:25 +0200	[thread overview]
Message-ID: <aab9512b-a70a-0f5b-5cdc-5d40acd343d0@web.de> (raw)
In-Reply-To: <xmqqh7wovoop.fsf@gitster.c.googlers.com>

Am 09.05.20 um 22:27 schrieb Junio C Hamano:
> René Scharfe <l.s.r@web.de> writes:
>
>> Hmm, this could lead to quadratic behavior in the worst case, can't it?
>
> Oh, absolutely.  But as you have shown, you'd need a specially
> crafted tree with early entries that are prefixes of later ones,
> which would rather be rare, and most of the time the bottom pointer
> would advance by one every time we consume one path.
>
> So it is trading (hopefully rare) worst-case runtime with reduced
> storage cost.
>
>> We could, however, reduce the names we add to the string_list to
>> those that are possible candidates for conflict -- blobs followed by an
>> entry whose name starts with the blob name followed by a dot and trees
>> that follow an entry whose name matches in the same way.
>
> Yes, that is a valid solution that strikes a different balance
> between allocation and runtime.

fsck should ideally handle the worst cases gracefully -- I assume it has
to deal with the weirdest "natural" corruptions and the "best" that
pranksters and DoSers can come up with.

> We may want to survey how commonly "bad" trees appear in real
> projects.  Depending on the result, we might want to update the
> "limit re-scanning using the bottom pointer" hack we have been using
> in the unpack-trees code.

They are surprisingly common in Git's own repo.  Ca. 74% of the trees
in my clone had at least one candidate and the average number of
candidates per tree in those was ca. 8.9.  We have e.g. the t/ tree
with its several tnnnn/ vs. tnnnn-foo.sh entries, or builtin/ vs.
builtin.h.

In the Linux repo ca. 20% of the checked trees have at least a
candidate and the average number of candidates in those is ca. 2.4.

So the patch for only adding possible candidates to the string_list
is below (on top of my earlier one), but I wonder if we can do
better.

When we look for a d/f name clash we don't necessarily have to look
back upon finding a candidate directory -- we can also scan ahead
upon finding a candidate non-directory.  That could be done using
repeated update_tree_entry_gently() calls on a private copy.  It
wouldn't require any string_list allocations, but its worst case
complexity would still be quadratic, of course.

Would a stack work?  When we see a candidate non-directory, we put
it on the stack.  When we see a candidate directory, we compare it
to the entry at the top of the stack using strcmp().  Equality
indicates a duplicate and we are done.  If the directory name is
less then we can pop the entry from the stack and check again, as
we're past the point where a duplicate would be.  Makes sense?

The candidate stack solution would have linear complexity and
require less memory than a full list of candidates.

It would rely on the entries to be in the correct order (same as
the patch below, come to think of it), but that's probably OK.
We may miss DUPLICATE_ENTRIES (just like today :), but
TREE_NOT_SORTED would still be reported.

---
 fsck.c | 25 +++++++++++++++++++++++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/fsck.c b/fsck.c
index f47b35fee8..7d5bfef804 100644
--- a/fsck.c
+++ b/fsck.c
@@ -569,6 +569,25 @@ static int verify_ordered(unsigned mode1, const char *name1, unsigned mode2, con
 	return c1 < c2 ? 0 : TREE_UNORDERED;
 }

+/*
+ * Consecutive entries are checked for duplicates in verify_ordered().
+ * However, there can be non-consecutive duplicates because a slash
+ * ('/') is appended implicitly to directory names during sorting, so
+ * there can be other names between a non-directory entry and a
+ * directory entry with the same name.  E.g.:
+ *
+ *    foo
+ *    foo.bar
+ *    foo/
+ *
+ */
+static int may_be_df_dup(const char *candidate, const char *interloper)
+{
+	const char *p;
+
+	return skip_prefix(interloper, candidate, &p) && '\0' < *p && *p < '/';
+}
+
 static int fsck_tree(const struct object_id *oid,
 		     const char *buffer, unsigned long size,
 		     struct fsck_options *options)
@@ -678,11 +697,14 @@ static int fsck_tree(const struct object_id *oid,
 			default:
 				break;
 			}
+			if (!S_ISDIR(o_mode) && may_be_df_dup(o_name, name))
+				string_list_append(&names, o_name);
+			if (S_ISDIR(mode) && may_be_df_dup(name, o_name))
+				string_list_append(&names, name);
 		}

 		o_mode = mode;
 		o_name = name;
-		string_list_append(&names, name);
 	}

 	nr = names.nr;
@@ -1221,7 +1243,6 @@ int fsck_finish(struct fsck_options *options)
 		free(buf);
 	}

-
 	oidset_clear(&gitmodules_found);
 	oidset_clear(&gitmodules_done);
 	return ret;
--
2.26.2


  reply	other threads:[~2020-05-10  9:08 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-09  6:19 invalid tree and commit object Brandon Williams
2020-05-09 10:16 ` René Scharfe
2020-05-09  7:16   ` Johannes Schindelin
2020-05-09 11:51     ` René Scharfe
2020-05-09 17:28   ` Junio C Hamano
2020-05-09 19:24     ` René Scharfe
2020-05-09 20:27       ` Junio C Hamano
2020-05-10  9:07         ` René Scharfe [this message]
2020-05-10 16:12           ` René Scharfe
2020-05-11 16:25             ` Junio C Hamano
2020-05-13 16:27               ` Brandon Williams
2020-05-21  9:51               ` René Scharfe
2020-05-21  9:52               ` [PATCH 1/4] fsck: fix a typo in a comment René Scharfe
2020-05-21 10:10                 ` Denton Liu
2020-05-21 11:15                 ` René Scharfe
2020-05-21  9:52               ` [PATCH 2/4] t1450: increase test coverage of in-tree d/f detection René Scharfe
2020-05-21 10:20                 ` Denton Liu
2020-05-21 13:31                   ` René Scharfe
2020-05-21 18:01                     ` Junio C Hamano
2020-05-21  9:52               ` [PATCH 3/4] t1450: demonstrate undetected in-tree d/f conflict René Scharfe
2020-05-21  9:52               ` [PATCH 4/4] fsck: detect more in-tree d/f conflicts René Scharfe
2020-05-10 16:37           ` invalid tree and commit object Junio C Hamano
2020-05-21  9:51             ` René Scharfe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aab9512b-a70a-0f5b-5cdc-5d40acd343d0@web.de \
    --to=l.s.r@web.de \
    --cc=bwilliamseng@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).