git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / Atom feed
* [RFC PATCH 0/1] commit-graph.c: handle corrupt commit trees
@ 2019-09-04  2:22 Taylor Blau
  2019-09-04  2:22 ` [RFC PATCH 1/1] commit-graph.c: die on un-parseable commits Taylor Blau
  2019-09-04 18:25 ` [RFC PATCH 0/1] commit-graph.c: handle corrupt commit trees Garima Singh
  0 siblings, 2 replies; 20+ messages in thread
From: Taylor Blau @ 2019-09-04  2:22 UTC (permalink / raw)
  To: stolee; +Cc: git, peff

Hi,

I was running some of the new 'git commit-graph' commands, and noticed
that I could consistently get 'git commit-graph write --reachable' to
segfault when a commit's root tree is corrupt.

I have an extremely-unfinished fix attached as an RFC PATCH below, but I
wanted to get a few thoughts on this before sending it out as a non-RFC.

In my patch, I simply 'die()' when a commit isn't able to be parsed
(i.e., when 'parse_commit_no_graph' returns a non-zero code), but I
wanted to see if others thought that this was an OK approach. Some
thoughts:

  * It seems like we could write a commit-graph by placing a "filler"
    entry where the broken commit would have gone. I don't see any place
    where this is implemented currently, but this seems like a viable
    alternative to not writing _any_ commits into the commit-graph.

  * Could we learn about the corruption earlier? Ideally (in the obscene
    of these placeholder objects that indicate corruption), we wouldn't
    start writing a commit-graph until all of the objects in it are
    known to be good.

    This seems like a costly operation when it comes to memory, but
    maybe I'm thinking about it the wrong way.

Depending on the approach here, I'll clean up the commit and message, as
well as add a test that demonstrates the breakage and subsequent fix.

Thanks in advance for your feedback :-).

Taylor Blau (1):
  commit-graph.c: die on un-parseable commits

 commit-graph.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

--
2.22.0

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC PATCH 1/1] commit-graph.c: die on un-parseable commits
  2019-09-04  2:22 [RFC PATCH 0/1] commit-graph.c: handle corrupt commit trees Taylor Blau
@ 2019-09-04  2:22 ` Taylor Blau
  2019-09-04  3:04   ` Jeff King
  2019-09-04 18:25 ` [RFC PATCH 0/1] commit-graph.c: handle corrupt commit trees Garima Singh
  1 sibling, 1 reply; 20+ messages in thread
From: Taylor Blau @ 2019-09-04  2:22 UTC (permalink / raw)
  To: stolee; +Cc: git, peff

When we write a commit graph chunk, we process a given list of 'struct
commit *'s and parse out the parent(s) and tree OID in order to write
out its information.

We do this by calling 'parse_commit_no_graph', and then checking the
result of 'get_commit_tree_oid' to write the tree OID. This process
assumes that 'parse_commit_no_graph' parses the commit successfully.
When this isn't the case, 'get_commit_tree_oid(*list)' may return NULL,
in which case trying to '->hash' it causes a SIGSEGV.

Instead, teach 'write_graph_chunk_data' to stop when a commit isn't able
to be parsed, at the peril of failing to write a commit-graph.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 commit-graph.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/commit-graph.c b/commit-graph.c
index f2888c203b..6aa6998ecd 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -843,7 +843,9 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
 		uint32_t packedDate[2];
 		display_progress(ctx->progress, ++ctx->progress_cnt);
 
-		parse_commit_no_graph(*list);
+		if (parse_commit_no_graph(*list))
+			die(_("unable to parse commit %s"),
+				oid_to_hex(&(*list)->object.oid));
 		hashwrite(f, get_commit_tree_oid(*list)->hash, hash_len);
 
 		parent = (*list)->parents;
-- 
2.22.0

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 1/1] commit-graph.c: die on un-parseable commits
  2019-09-04  2:22 ` [RFC PATCH 1/1] commit-graph.c: die on un-parseable commits Taylor Blau
@ 2019-09-04  3:04   ` Jeff King
  2019-09-04 21:18     ` Taylor Blau
  2019-09-05 22:19     ` Junio C Hamano
  0 siblings, 2 replies; 20+ messages in thread
From: Jeff King @ 2019-09-04  3:04 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Junio C Hamano, stolee, git

On Tue, Sep 03, 2019 at 10:22:57PM -0400, Taylor Blau wrote:

> When we write a commit graph chunk, we process a given list of 'struct
> commit *'s and parse out the parent(s) and tree OID in order to write
> out its information.
> 
> We do this by calling 'parse_commit_no_graph', and then checking the
> result of 'get_commit_tree_oid' to write the tree OID. This process
> assumes that 'parse_commit_no_graph' parses the commit successfully.
> When this isn't the case, 'get_commit_tree_oid(*list)' may return NULL,
> in which case trying to '->hash' it causes a SIGSEGV.
> 
> Instead, teach 'write_graph_chunk_data' to stop when a commit isn't able
> to be parsed, at the peril of failing to write a commit-graph.

Yeah, I think it makes sense for commit-graph to bail completely if it
can't parse here. In theory it could omit the entry from the
commit-graph file (and a reader would just fall back to trying to access
the object contents itself), but I think we're too late in the process
for that. And besides, this should generally only happen in a corrupt
repository.

However...

> diff --git a/commit-graph.c b/commit-graph.c
> index f2888c203b..6aa6998ecd 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -843,7 +843,9 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
>  		uint32_t packedDate[2];
>  		display_progress(ctx->progress, ++ctx->progress_cnt);
>  
> -		parse_commit_no_graph(*list);
> +		if (parse_commit_no_graph(*list))
> +			die(_("unable to parse commit %s"),
> +				oid_to_hex(&(*list)->object.oid));
>  		hashwrite(f, get_commit_tree_oid(*list)->hash, hash_len);

I don't think parse_commit_no_graph() returning 0 assures us that
get_commit_tree() and friends will return non-NULL.

This is similar to the case discussed recently where a failed parse of a
tag may leave "tag->tagged == NULL" even though "tag->obj.parsed" is
set.

Here an earlier parsing error could cause (*list)->object.parsed to be
true, but with (*list)->maybe_tree still NULL. Our call to
parse_commit_no_graph() here would silently return "yep, already tried
to parse this", and then we'd still segfault.

We _could_ check:

  if (parse_commit_no_graph(*list))
	die("unable to parse...");
  tree = get_commit_tree_oid(*list);
  if (!tree)
	die("unable to get tree for %s...");

but trees are just one piece of data. In fact, the situation is much
worse for parents: there a NULL parent pointer is valid data, so worse
than segfaulting, we'd write invalid data to the commit graph, skipping
one or more parents!

And I think there's literally no way for this function to tell the
difference between "no parent" and "there was an earlier error, but we
set the parsed flag anyway and the parent flag is invalid".

I think that argues against Junio's response in:

  https://public-inbox.org/git/xmqqo90bhmi3.fsf@gitster-ct.c.googlers.com/

about how we can use the parsed flag to look at "slightly corrupt"
objects. I think we'd need at least an extra flag for "corrupt", though
I think it is simpler just _not_ setting "parsed" and letting the next
caller spend the extra cycles to rediscover the problem if they're
interested.

(All of this presupposes that you can convince another piece of code in
the same process to parse the commit buffer and ignore the error. I have
no idea if that's possible or not in this case, but it sure would be
nice not to have to care).

-Peff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/1] commit-graph.c: handle corrupt commit trees
  2019-09-04  2:22 [RFC PATCH 0/1] commit-graph.c: handle corrupt commit trees Taylor Blau
  2019-09-04  2:22 ` [RFC PATCH 1/1] commit-graph.c: die on un-parseable commits Taylor Blau
@ 2019-09-04 18:25 ` Garima Singh
  2019-09-04 21:21   ` Taylor Blau
  1 sibling, 1 reply; 20+ messages in thread
From: Garima Singh @ 2019-09-04 18:25 UTC (permalink / raw)
  To: Taylor Blau, stolee; +Cc: git, peff


On 9/3/2019 10:22 PM, Taylor Blau wrote:
> Hi,
> 
> I was running some of the new 'git commit-graph' commands, and noticed
> that I could consistently get 'git commit-graph write --reachable' to
> segfault when a commit's root tree is corrupt.
> 
> I have an extremely-unfinished fix attached as an RFC PATCH below, but I
> wanted to get a few thoughts on this before sending it out as a non-RFC.
> 
> In my patch, I simply 'die()' when a commit isn't able to be parsed
> (i.e., when 'parse_commit_no_graph' returns a non-zero code), but I
> wanted to see if others thought that this was an OK approach. Some
> thoughts:

I like the idea of completely bailing if the commit can't be parsed too.
Only question: Is there a reason you chose to die() instead of BUG() 
like the other two places in that function? What is the criteria of 
choosing one over the other?

> 
>    * It seems like we could write a commit-graph by placing a "filler"
>      entry where the broken commit would have gone. I don't see any place
>      where this is implemented currently, but this seems like a viable
>      alternative to not writing _any_ commits into the commit-graph.

I would rather we didn't do this cause it will probably kick open the 
can of always watching for that filler when we are working with the 
commit-graph. Or do we already do that today? Maybe @stolee can chime in 
on what we do in cases of shallow clones and other potential gaps in the 
walk

-Garima

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 1/1] commit-graph.c: die on un-parseable commits
  2019-09-04  3:04   ` Jeff King
@ 2019-09-04 21:18     ` Taylor Blau
  2019-09-05  6:47       ` Jeff King
  2019-09-05 22:19     ` Junio C Hamano
  1 sibling, 1 reply; 20+ messages in thread
From: Taylor Blau @ 2019-09-04 21:18 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, Junio C Hamano, stolee, git

Hi Peff,

On Tue, Sep 03, 2019 at 11:04:56PM -0400, Jeff King wrote:
> On Tue, Sep 03, 2019 at 10:22:57PM -0400, Taylor Blau wrote:
>
> > When we write a commit graph chunk, we process a given list of 'struct
> > commit *'s and parse out the parent(s) and tree OID in order to write
> > out its information.
> >
> > We do this by calling 'parse_commit_no_graph', and then checking the
> > result of 'get_commit_tree_oid' to write the tree OID. This process
> > assumes that 'parse_commit_no_graph' parses the commit successfully.
> > When this isn't the case, 'get_commit_tree_oid(*list)' may return NULL,
> > in which case trying to '->hash' it causes a SIGSEGV.
> >
> > Instead, teach 'write_graph_chunk_data' to stop when a commit isn't able
> > to be parsed, at the peril of failing to write a commit-graph.
>
> Yeah, I think it makes sense for commit-graph to bail completely if it
> can't parse here. In theory it could omit the entry from the
> commit-graph file (and a reader would just fall back to trying to access
> the object contents itself), but I think we're too late in the process
> for that. And besides, this should generally only happen in a corrupt
> repository.

Yep. I sent this as an RFC PATCH because I wasn't quite sure how folks
would feel about 'die()'-ing in the middle of 'write_graph_chunk_data',
but I think that your reasoning makes sense, and it matches my own
preferences.

So, I am glad that we resolved that portion. I'll keep going below...

> However...
>
> > diff --git a/commit-graph.c b/commit-graph.c
> > index f2888c203b..6aa6998ecd 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -843,7 +843,9 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
> >  		uint32_t packedDate[2];
> >  		display_progress(ctx->progress, ++ctx->progress_cnt);
> >
> > -		parse_commit_no_graph(*list);
> > +		if (parse_commit_no_graph(*list))
> > +			die(_("unable to parse commit %s"),
> > +				oid_to_hex(&(*list)->object.oid));
> >  		hashwrite(f, get_commit_tree_oid(*list)->hash, hash_len);
>
> I don't think parse_commit_no_graph() returning 0 assures us that
> get_commit_tree() and friends will return non-NULL.
>
> This is similar to the case discussed recently where a failed parse of a
> tag may leave "tag->tagged == NULL" even though "tag->obj.parsed" is
> set.
>
> Here an earlier parsing error could cause (*list)->object.parsed to be
> true, but with (*list)->maybe_tree still NULL. Our call to
> parse_commit_no_graph() here would silently return "yep, already tried
> to parse this", and then we'd still segfault.
>
> We _could_ check:
>
>   if (parse_commit_no_graph(*list))
> 	die("unable to parse...");
>   tree = get_commit_tree_oid(*list);
>   if (!tree)
> 	die("unable to get tree for %s...");
>
> but trees are just one piece of data. In fact, the situation is much
> worse for parents: there a NULL parent pointer is valid data, so worse
> than segfaulting, we'd write invalid data to the commit graph, skipping
> one or more parents!
>
> And I think there's literally no way for this function to tell the
> difference between "no parent" and "there was an earlier error, but we
> set the parsed flag anyway and the parent flag is invalid".
>
> I think that argues against Junio's response in:
>
>   https://public-inbox.org/git/xmqqo90bhmi3.fsf@gitster-ct.c.googlers.com/
>
> about how we can use the parsed flag to look at "slightly corrupt"
> objects. I think we'd need at least an extra flag for "corrupt", though
> I think it is simpler just _not_ setting "parsed" and letting the next
> caller spend the extra cycles to rediscover the problem if they're
> interested.

All of this makes sense to me, so I'm wondering what part(s) of this you
feel are worth addressing in this first patch series. Presumably, there
is a longer series that we _could_ write which would introduce a new
'corrupt' field and then check for it here.

But, I'm hesitant to write those patches, since I only have this one
call-site in mind. If we introduce 'corrupt', I feel it would be best to
use it uniformly, instead of only checking it here, and relying on other
bespoke mechanisms to detect corruption elsewhere.

So, I'm content to write the pseudo-code you provided above (which is to
say, call and check both 'parse_commit_no_graph', _and_
'get_commit_tree_oid'), because I think that it's expedient, and fix the
issue which I'm pointing out here.

I don't know how to address the parents situation without going further,
so perhaps it's worth it to think of an alternative, or even leave it
out of these first patch(es).

Let me know what you think, and thanks for your thoughts so far.

> (All of this presupposes that you can convince another piece of code in
> the same process to parse the commit buffer and ignore the error. I have
> no idea if that's possible or not in this case, but it sure would be
> nice not to have to care).
>
> -Peff
Thanks,
Taylor

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/1] commit-graph.c: handle corrupt commit trees
  2019-09-04 18:25 ` [RFC PATCH 0/1] commit-graph.c: handle corrupt commit trees Garima Singh
@ 2019-09-04 21:21   ` Taylor Blau
  2019-09-05  6:08     ` Jeff King
  2019-09-06 16:48     ` Derrick Stolee
  0 siblings, 2 replies; 20+ messages in thread
From: Taylor Blau @ 2019-09-04 21:21 UTC (permalink / raw)
  To: Garima Singh; +Cc: Taylor Blau, stolee, git, peff

Hi Garima,

On Wed, Sep 04, 2019 at 02:25:55PM -0400, Garima Singh wrote:
>
> On 9/3/2019 10:22 PM, Taylor Blau wrote:
> > Hi,
> >
> > I was running some of the new 'git commit-graph' commands, and noticed
> > that I could consistently get 'git commit-graph write --reachable' to
> > segfault when a commit's root tree is corrupt.
> >
> > I have an extremely-unfinished fix attached as an RFC PATCH below, but I
> > wanted to get a few thoughts on this before sending it out as a non-RFC.
> >
> > In my patch, I simply 'die()' when a commit isn't able to be parsed
> > (i.e., when 'parse_commit_no_graph' returns a non-zero code), but I
> > wanted to see if others thought that this was an OK approach. Some
> > thoughts:
>
> I like the idea of completely bailing if the commit can't be parsed too.
> Only question: Is there a reason you chose to die() instead of BUG() like
> the other two places in that function? What is the criteria of choosing one
> over the other?

I did not call 'BUG' here because 'BUG' is traditionally used to
indicate an internal bug, e.g., an unexpected state or some such. On the
other side of that coin, 'BUG' is _not_ used to indicate repository
corruption, since that is not an issue in the Git codebase, rather in
the user's repository.

Though, to be honest, I've never seen that rule written out explicitly
(maybe if it were to be written somewhere, it could be stored in
Documentation/CodingGuidelines?). I think that this is some good
#leftoverbits material.

> >
> >    * It seems like we could write a commit-graph by placing a "filler"
> >      entry where the broken commit would have gone. I don't see any place
> >      where this is implemented currently, but this seems like a viable
> >      alternative to not writing _any_ commits into the commit-graph.
>
> I would rather we didn't do this cause it will probably kick open the can of
> always watching for that filler when we are working with the commit-graph.
> Or do we already do that today? Maybe @stolee can chime in on what we do in
> cases of shallow clones and other potential gaps in the walk

Yeah, I think that the consensus is that it makes sense to just die
here, which is fine by me.

> -Garima

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/1] commit-graph.c: handle corrupt commit trees
  2019-09-04 21:21   ` Taylor Blau
@ 2019-09-05  6:08     ` Jeff King
  2019-09-06 16:48     ` Derrick Stolee
  1 sibling, 0 replies; 20+ messages in thread
From: Jeff King @ 2019-09-05  6:08 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Garima Singh, stolee, git

On Wed, Sep 04, 2019 at 05:21:21PM -0400, Taylor Blau wrote:

> > I like the idea of completely bailing if the commit can't be parsed too.
> > Only question: Is there a reason you chose to die() instead of BUG() like
> > the other two places in that function? What is the criteria of choosing one
> > over the other?
> 
> I did not call 'BUG' here because 'BUG' is traditionally used to
> indicate an internal bug, e.g., an unexpected state or some such. On the
> other side of that coin, 'BUG' is _not_ used to indicate repository
> corruption, since that is not an issue in the Git codebase, rather in
> the user's repository.
> 
> Though, to be honest, I've never seen that rule written out explicitly
> (maybe if it were to be written somewhere, it could be stored in
> Documentation/CodingGuidelines?). I think that this is some good
> #leftoverbits material.

That rule matches my understanding. A BUG() should be about asserting
invariants or catching should-not-happen cases, etc. Any time a BUG()
triggers, that is truly a bug in Git, no matter what input got thrown at
it, what syscalls failed, etc, and is worth fixing (even if the only
sensible thing is to die()).

As a side note, we've generally treated segfaults the same way. It
doesn't matter if the files on disk or the program input is garbage, we
should say so and abort the operation cleanly.

-Peff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 1/1] commit-graph.c: die on un-parseable commits
  2019-09-04 21:18     ` Taylor Blau
@ 2019-09-05  6:47       ` Jeff King
  2019-09-06 16:48         ` Derrick Stolee
  0 siblings, 1 reply; 20+ messages in thread
From: Jeff King @ 2019-09-05  6:47 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Junio C Hamano, stolee, git

On Wed, Sep 04, 2019 at 05:18:47PM -0400, Taylor Blau wrote:

> All of this makes sense to me, so I'm wondering what part(s) of this you
> feel are worth addressing in this first patch series. Presumably, there
> is a longer series that we _could_ write which would introduce a new
> 'corrupt' field and then check for it here.
> 
> But, I'm hesitant to write those patches, since I only have this one
> call-site in mind. If we introduce 'corrupt', I feel it would be best to
> use it uniformly, instead of only checking it here, and relying on other
> bespoke mechanisms to detect corruption elsewhere.
> 
> So, I'm content to write the pseudo-code you provided above (which is to
> say, call and check both 'parse_commit_no_graph', _and_
> 'get_commit_tree_oid'), because I think that it's expedient, and fix the
> issue which I'm pointing out here.

I'd actually be willing to just take the patch you have here, and
consider the "parsed but we saw an error" thing as an oddity of the
object code.  IOW, we shouldn't _have_ to be double-checking here.
Looking for an error return from parse_commit() should really be all a
caller needs to do. Once that's fixed, then your code would just be
doing the right thing.

That said, there's another unhandled case, I think: lookup_tree() might
return NULL (if somebody previously saw that oid as a non-tree), and
parse_commit() wouldn't even notice and return an error!

IMHO that's also something that parse_commit() should be returning an
error for. And it's probably a lot easier to trigger versus the "parsed
earlier but corrupted" thing.

So it might be worth doing the NULL tree check here in the meantime. I
dunno.

Below is a sketch of what I'm thinking parse_commit() should do:

  - remember when an earlier parse returned an error, so we can repeat
    that error (this requires some unfortunate bit-field adjusting)

  - notice a lookup_tree failure

  - likewise, notice a lookup_parent failure

---
diff --git a/commit.c b/commit.c
index a98de16e3d..7e415932b7 100644
--- a/commit.c
+++ b/commit.c
@@ -391,7 +391,9 @@ const void *detach_commit_buffer(struct commit *commit, unsigned long *sizep)
 	return ret;
 }
 
-int parse_commit_buffer(struct repository *r, struct commit *item, const void *buffer, unsigned long size, int check_graph)
+static int parse_commit_buffer_1(struct repository *r, struct commit *item,
+				 const void *buffer, unsigned long size,
+				 int check_graph)
 {
 	const char *tail = buffer;
 	const char *bufptr = buffer;
@@ -401,9 +403,6 @@ int parse_commit_buffer(struct repository *r, struct commit *item, const void *b
 	const int tree_entry_len = the_hash_algo->hexsz + 5;
 	const int parent_entry_len = the_hash_algo->hexsz + 7;
 
-	if (item->object.parsed)
-		return 0;
-	item->object.parsed = 1;
 	tail += size;
 	if (tail <= bufptr + tree_entry_len + 1 || memcmp(bufptr, "tree ", 5) ||
 			bufptr[tree_entry_len] != '\n')
@@ -412,6 +411,10 @@ int parse_commit_buffer(struct repository *r, struct commit *item, const void *b
 		return error("bad tree pointer in commit %s",
 			     oid_to_hex(&item->object.oid));
 	set_commit_tree(item, lookup_tree(r, &parent));
+	if (!item->maybe_tree)
+		return error("bad tree pointer %s in commit %s",
+			     oid_to_hex(&parent),
+			     oid_to_hex(&item->object.oid));
 	bufptr += tree_entry_len + 1; /* "tree " + "hex sha1" + "\n" */
 	pptr = &item->parents;
 
@@ -431,15 +434,19 @@ int parse_commit_buffer(struct repository *r, struct commit *item, const void *b
 		if (graft && (graft->nr_parent < 0 || grafts_replace_parents))
 			continue;
 		new_parent = lookup_commit(r, &parent);
-		if (new_parent)
-			pptr = &commit_list_insert(new_parent, pptr)->next;
+		if (!new_parent)
+			return error("bad parent %s in commit %s",
+				     oid_to_hex(&parent),
+				     oid_to_hex(&item->object.oid));
+		pptr = &commit_list_insert(new_parent, pptr)->next;
 	}
 	if (graft) {
 		int i;
 		struct commit *new_parent;
 		for (i = 0; i < graft->nr_parent; i++) {
 			new_parent = lookup_commit(r,
 						   &graft->parent[i]);
+			/* Here we ignore bogus grafts. Also should be an error? */
 			if (!new_parent)
 				continue;
 			pptr = &commit_list_insert(new_parent, pptr)->next;
@@ -453,6 +460,23 @@ int parse_commit_buffer(struct repository *r, struct commit *item, const void *b
 	return 0;
 }
 
+int parse_commit_buffer(struct repository *r, struct commit *item,
+			const void *buffer, unsigned long size,
+			int check_graph)
+{
+	int ret;
+
+	if (item->object.parsed)
+		return item->object.corrupt ? -1 : 0;
+	item->object.parsed = 1;
+
+	ret = parse_commit_buffer_1(r, item, buffer, size, check_graph);
+	if (ret < 0)
+		item->object.corrupt = 1;
+
+	return ret;
+}
+
 int repo_parse_commit_internal(struct repository *r,
 			       struct commit *item,
 			       int quiet_on_missing,
diff --git a/object.h b/object.h
index 0120892bbd..b83d3964ad 100644
--- a/object.h
+++ b/object.h
@@ -59,7 +59,7 @@ struct object_array {
 
 /*
  * object flag allocation:
- * revision.h:               0---------10                              25----28
+ * revision.h:               0---------10                              24----27
  * fetch-pack.c:             01
  * negotiator/default.c:       2--5
  * walker.c:                 0-2
@@ -78,13 +78,14 @@ struct object_array {
  * builtin/show-branch.c:    0-------------------------------------------26
  * builtin/unpack-objects.c:                                 2021
  */
-#define FLAG_BITS  29
+#define FLAG_BITS  28
 
 /*
  * The object type is stored in 3 bits.
  */
 struct object {
 	unsigned parsed : 1;
+	unsigned corrupt : 1;
 	unsigned type : TYPE_BITS;
 	unsigned flags : FLAG_BITS;
 	struct object_id oid;
diff --git a/revision.h b/revision.h
index 4134dc6029..5c0b831b37 100644
--- a/revision.h
+++ b/revision.h
@@ -33,7 +33,7 @@
 #define ALL_REV_FLAGS	(((1u<<11)-1) | NOT_USER_GIVEN | TRACK_LINEAR)
 
 #define TOPO_WALK_EXPLORED	(1u<<27)
-#define TOPO_WALK_INDEGREE	(1u<<28)
+#define TOPO_WALK_INDEGREE	(1u<<24)
 
 #define DECORATE_SHORT_REFS	1
 #define DECORATE_FULL_REFS	2

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 1/1] commit-graph.c: die on un-parseable commits
  2019-09-04  3:04   ` Jeff King
  2019-09-04 21:18     ` Taylor Blau
@ 2019-09-05 22:19     ` Junio C Hamano
  2019-09-06  6:35       ` Jeff King
  1 sibling, 1 reply; 20+ messages in thread
From: Junio C Hamano @ 2019-09-05 22:19 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, stolee, git

Jeff King <peff@peff.net> writes:

> I don't think parse_commit_no_graph() returning 0 assures us that
> get_commit_tree() and friends will return non-NULL.
>
> This is similar to the case discussed recently where a failed parse of a
> tag may leave "tag->tagged == NULL" even though "tag->obj.parsed" is
> set.
>
> Here an earlier parsing error could cause (*list)->object.parsed to be
> true, but with (*list)->maybe_tree still NULL. Our call to
> parse_commit_no_graph() here would silently return "yep, already tried
> to parse this", and then we'd still segfault.
> ...
> And I think there's literally no way for this function to tell the
> difference between "no parent" and "there was an earlier error, but we
> set the parsed flag anyway and the parent flag is invalid".
>
> I think that argues against Junio's response in:

Fair enough.  Forcing later users to reattempt parsing (and failing
the same way) would be safer and it should also be sufficient as we
are talking about how to handle a broken repository, i.e. an error
case.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 1/1] commit-graph.c: die on un-parseable commits
  2019-09-05 22:19     ` Junio C Hamano
@ 2019-09-06  6:35       ` Jeff King
  2019-09-06  6:56         ` Jeff King
  2019-09-06 16:59         ` Junio C Hamano
  0 siblings, 2 replies; 20+ messages in thread
From: Jeff King @ 2019-09-06  6:35 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, stolee, git

On Thu, Sep 05, 2019 at 03:19:48PM -0700, Junio C Hamano wrote:

> > Here an earlier parsing error could cause (*list)->object.parsed to be
> > true, but with (*list)->maybe_tree still NULL. Our call to
> > parse_commit_no_graph() here would silently return "yep, already tried
> > to parse this", and then we'd still segfault.
> > ...
> > And I think there's literally no way for this function to tell the
> > difference between "no parent" and "there was an earlier error, but we
> > set the parsed flag anyway and the parent flag is invalid".
> >
> > I think that argues against Junio's response in:
> 
> Fair enough.  Forcing later users to reattempt parsing (and failing
> the same way) would be safer and it should also be sufficient as we
> are talking about how to handle a broken repository, i.e. an error
> case.

One of the tricky things, and the reason I used a "corrupt" flag in my
earlier sketch, is that the state after we encounter a parse error is
unknown. So imagine parse_commit_buffer() sees that one of the parent
lines is bogus, and we return an error. The caller gets to see whatever
half-parsed state we managed to come up with.

So far so good. But now imagine we call parse_commit_buffer() again, and
we re-parse. How does that interact with the half-parsed state? Some of
it works OK (e.g., lookup_tree() would find the same tree). Some not so
much (I think we'd keep appending parents at each call).

I guess this might not be too bad to handle. Value fields like
timestamp_t are OK to overwrite. Pointers to objects likewise, since the
memory is owned elsewhere. If we see existing parent pointers in an
object we're parsing, we could probably free them under the assumption
they're leftover cruft. Likewise for the "tag" field of "struct tag",
which is owned by the struct and should be freed.

Blobs and trees don't actually parse anything into their structs. So
it's really just special-casing those two items.

-Peff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 1/1] commit-graph.c: die on un-parseable commits
  2019-09-06  6:35       ` Jeff King
@ 2019-09-06  6:56         ` Jeff King
  2019-09-06 16:59         ` Junio C Hamano
  1 sibling, 0 replies; 20+ messages in thread
From: Jeff King @ 2019-09-06  6:56 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, stolee, git

On Fri, Sep 06, 2019 at 02:35:03AM -0400, Jeff King wrote:

> > Fair enough.  Forcing later users to reattempt parsing (and failing
> > the same way) would be safer and it should also be sufficient as we
> > are talking about how to handle a broken repository, i.e. an error
> > case.
> 
> One of the tricky things, and the reason I used a "corrupt" flag in my
> earlier sketch, is that the state after we encounter a parse error is
> unknown. So imagine parse_commit_buffer() sees that one of the parent
> lines is bogus, and we return an error. The caller gets to see whatever
> half-parsed state we managed to come up with.
> 
> So far so good. But now imagine we call parse_commit_buffer() again, and
> we re-parse. How does that interact with the half-parsed state? Some of
> it works OK (e.g., lookup_tree() would find the same tree). Some not so
> much (I think we'd keep appending parents at each call).
> 
> I guess this might not be too bad to handle. Value fields like
> timestamp_t are OK to overwrite. Pointers to objects likewise, since the
> memory is owned elsewhere. If we see existing parent pointers in an
> object we're parsing, we could probably free them under the assumption
> they're leftover cruft. Likewise for the "tag" field of "struct tag",
> which is owned by the struct and should be freed.
> 
> Blobs and trees don't actually parse anything into their structs. So
> it's really just special-casing those two items.

So here's something a bit more concrete to play with. Using the patch
below, we maintain the invariant that if you called parse_commit() or
equivalent and got a successful return, then commit->tree is always
non-NULL. That fixes the second "missing tree" test from elsewhere in
this thread:

  https://public-inbox.org/git/042a8ba8b2a98c269f9cd1a8e88488b80d686f0d.1567720960.git.me@ttaylorr.com/

_without_ applying the third patch (though the error message we expect
in the test does not tweaked). And it likely also protects most of the
other callers of get_commit_tree_oid(), assuming somewhere in their code
path they call parse_commit() and actually check the error code.

Likewise, tag->tagged is always non-NULL after a successful tag parse,
which should fix the segfault discussed recently in:

  https://public-inbox.org/git/20190824230944.GA14132@jessup.stsp.name/

as well as probably others.

And callers are still able to view the broken objects, but they have to
ignore the error return from the parse functions.

This seems like a promising direction. I'd probably break it into a few
separate patches, and it would be nice to have some tests (especially
one where we actually do try to parse multiple times).

---
diff --git a/commit.c b/commit.c
index a98de16e3d..47e36fd13e 100644
--- a/commit.c
+++ b/commit.c
@@ -398,20 +398,37 @@ int parse_commit_buffer(struct repository *r, struct commit *item, const void *b
 	struct object_id parent;
 	struct commit_list **pptr;
 	struct commit_graft *graft;
+	struct tree *tree;
 	const int tree_entry_len = the_hash_algo->hexsz + 5;
 	const int parent_entry_len = the_hash_algo->hexsz + 7;
 
 	if (item->object.parsed)
 		return 0;
-	item->object.parsed = 1;
+
+	if (item->parents) {
+		/*
+		 * Presumably this is leftover from an earlier failed parse;
+		 * clear it out in preparation for us re-parsing (we'll hit the
+		 * same error, but that's good, since it lets our caller know
+		 * the result cannot be trusted.
+		 */
+		free_commit_list(item->parents);
+		item->parents = NULL;
+	}
+
 	tail += size;
 	if (tail <= bufptr + tree_entry_len + 1 || memcmp(bufptr, "tree ", 5) ||
 			bufptr[tree_entry_len] != '\n')
 		return error("bogus commit object %s", oid_to_hex(&item->object.oid));
 	if (get_oid_hex(bufptr + 5, &parent) < 0)
 		return error("bad tree pointer in commit %s",
 			     oid_to_hex(&item->object.oid));
-	set_commit_tree(item, lookup_tree(r, &parent));
+	tree = lookup_tree(r, &parent);
+	if (!tree)
+		return error("bad tree pointer %s in commit %s",
+			     oid_to_hex(&parent),
+			     oid_to_hex(&item->object.oid));
+	set_commit_tree(item, tree);
 	bufptr += tree_entry_len + 1; /* "tree " + "hex sha1" + "\n" */
 	pptr = &item->parents;
 
@@ -450,6 +467,7 @@ int parse_commit_buffer(struct repository *r, struct commit *item, const void *b
 	if (check_graph)
 		load_commit_graph_info(r, item);
 
+	item->object.parsed = 1;
 	return 0;
 }
 
diff --git a/tag.c b/tag.c
index 5db870edb9..3dcebb4715 100644
--- a/tag.c
+++ b/tag.c
@@ -141,7 +141,11 @@ int parse_tag_buffer(struct repository *r, struct tag *item, const void *data, u
 
 	if (item->object.parsed)
 		return 0;
-	item->object.parsed = 1;
+
+	if (item->tag) {
+		/* Left over from an earlier failed parse */
+		FREE_AND_NULL(item->tag);
+	}
 
 	if (size < the_hash_algo->hexsz + 24)
 		return -1;
@@ -167,10 +171,15 @@ int parse_tag_buffer(struct repository *r, struct tag *item, const void *data, u
 	} else if (!strcmp(type, tag_type)) {
 		item->tagged = (struct object *)lookup_tag(r, &oid);
 	} else {
-		error("Unknown type %s", type);
-		item->tagged = NULL;
+		return error("unknown tag type '%s' in %s",
+			     type, oid_to_hex(&item->object.oid));
 	}
 
+	if (!item->tagged)
+		return error("bad tag pointer to %s in %s",
+			     oid_to_hex(&oid),
+			     oid_to_hex(&item->object.oid));
+
 	if (bufptr + 4 < tail && starts_with(bufptr, "tag "))
 		; 		/* good */
 	else
@@ -187,6 +196,7 @@ int parse_tag_buffer(struct repository *r, struct tag *item, const void *data, u
 	else
 		item->date = 0;
 
+	item->object.parsed = 1;
 	return 0;
 }
 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 1/1] commit-graph.c: die on un-parseable commits
  2019-09-05  6:47       ` Jeff King
@ 2019-09-06 16:48         ` Derrick Stolee
  2019-09-06 17:04           ` Jeff King
  0 siblings, 1 reply; 20+ messages in thread
From: Derrick Stolee @ 2019-09-06 16:48 UTC (permalink / raw)
  To: Jeff King, Taylor Blau; +Cc: Junio C Hamano, git

On 9/5/2019 2:47 AM, Jeff King wrote:
> On Wed, Sep 04, 2019 at 05:18:47PM -0400, Taylor Blau wrote:
> 
>> All of this makes sense to me, so I'm wondering what part(s) of this you
>> feel are worth addressing in this first patch series. Presumably, there
>> is a longer series that we _could_ write which would introduce a new
>> 'corrupt' field and then check for it here.
>>
>> But, I'm hesitant to write those patches, since I only have this one
>> call-site in mind. If we introduce 'corrupt', I feel it would be best to
>> use it uniformly, instead of only checking it here, and relying on other
>> bespoke mechanisms to detect corruption elsewhere.
>>
>> So, I'm content to write the pseudo-code you provided above (which is to
>> say, call and check both 'parse_commit_no_graph', _and_
>> 'get_commit_tree_oid'), because I think that it's expedient, and fix the
>> issue which I'm pointing out here.
> 
> I'd actually be willing to just take the patch you have here, and
> consider the "parsed but we saw an error" thing as an oddity of the
> object code.  IOW, we shouldn't _have_ to be double-checking here.
> Looking for an error return from parse_commit() should really be all a
> caller needs to do. Once that's fixed, then your code would just be
> doing the right thing.
> 
> That said, there's another unhandled case, I think: lookup_tree() might
> return NULL (if somebody previously saw that oid as a non-tree), and
> parse_commit() wouldn't even notice and return an error!
> 
> IMHO that's also something that parse_commit() should be returning an
> error for. And it's probably a lot easier to trigger versus the "parsed
> earlier but corrupted" thing.
> 
> So it might be worth doing the NULL tree check here in the meantime. I
> dunno.
> 
> Below is a sketch of what I'm thinking parse_commit() should do:
> 
>   - remember when an earlier parse returned an error, so we can repeat
>     that error (this requires some unfortunate bit-field adjusting)
> 
>   - notice a lookup_tree failure
> 
>   - likewise, notice a lookup_parent failure
> 
> ---
> diff --git a/commit.c b/commit.c
> index a98de16e3d..7e415932b7 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -391,7 +391,9 @@ const void *detach_commit_buffer(struct commit *commit, unsigned long *sizep)
>  	return ret;
>  }
>  
> -int parse_commit_buffer(struct repository *r, struct commit *item, const void *buffer, unsigned long size, int check_graph)
> +static int parse_commit_buffer_1(struct repository *r, struct commit *item,
> +				 const void *buffer, unsigned long size,
> +				 int check_graph)
>  {
>  	const char *tail = buffer;
>  	const char *bufptr = buffer;
> @@ -401,9 +403,6 @@ int parse_commit_buffer(struct repository *r, struct commit *item, const void *b
>  	const int tree_entry_len = the_hash_algo->hexsz + 5;
>  	const int parent_entry_len = the_hash_algo->hexsz + 7;
>  
> -	if (item->object.parsed)
> -		return 0;
> -	item->object.parsed = 1;
>  	tail += size;
>  	if (tail <= bufptr + tree_entry_len + 1 || memcmp(bufptr, "tree ", 5) ||
>  			bufptr[tree_entry_len] != '\n')
> @@ -412,6 +411,10 @@ int parse_commit_buffer(struct repository *r, struct commit *item, const void *b
>  		return error("bad tree pointer in commit %s",
>  			     oid_to_hex(&item->object.oid));
>  	set_commit_tree(item, lookup_tree(r, &parent));
> +	if (!item->maybe_tree)
> +		return error("bad tree pointer %s in commit %s",
> +			     oid_to_hex(&parent),
> +			     oid_to_hex(&item->object.oid));
>  	bufptr += tree_entry_len + 1; /* "tree " + "hex sha1" + "\n" */
>  	pptr = &item->parents;
>  
> @@ -431,15 +434,19 @@ int parse_commit_buffer(struct repository *r, struct commit *item, const void *b
>  		if (graft && (graft->nr_parent < 0 || grafts_replace_parents))
>  			continue;
>  		new_parent = lookup_commit(r, &parent);
> -		if (new_parent)
> -			pptr = &commit_list_insert(new_parent, pptr)->next;
> +		if (!new_parent)
> +			return error("bad parent %s in commit %s",
> +				     oid_to_hex(&parent),
> +				     oid_to_hex(&item->object.oid));
> +		pptr = &commit_list_insert(new_parent, pptr)->next;
>  	}
>  	if (graft) {
>  		int i;
>  		struct commit *new_parent;
>  		for (i = 0; i < graft->nr_parent; i++) {
>  			new_parent = lookup_commit(r,
>  						   &graft->parent[i]);
> +			/* Here we ignore bogus grafts. Also should be an error? */
>  			if (!new_parent)
>  				continue;
>  			pptr = &commit_list_insert(new_parent, pptr)->next;
> @@ -453,6 +460,23 @@ int parse_commit_buffer(struct repository *r, struct commit *item, const void *b
>  	return 0;
>  }
>  
> +int parse_commit_buffer(struct repository *r, struct commit *item,
> +			const void *buffer, unsigned long size,
> +			int check_graph)
> +{
> +	int ret;
> +
> +	if (item->object.parsed)
> +		return item->object.corrupt ? -1 : 0;
> +	item->object.parsed = 1;
> +
> +	ret = parse_commit_buffer_1(r, item, buffer, size, check_graph);
> +	if (ret < 0)
> +		item->object.corrupt = 1;
> +
> +	return ret;
> +}
> +
>  int repo_parse_commit_internal(struct repository *r,
>  			       struct commit *item,
>  			       int quiet_on_missing,
> diff --git a/object.h b/object.h
> index 0120892bbd..b83d3964ad 100644
> --- a/object.h
> +++ b/object.h
> @@ -59,7 +59,7 @@ struct object_array {
>  
>  /*
>   * object flag allocation:
> - * revision.h:               0---------10                              25----28
> + * revision.h:               0---------10                              24----27
>   * fetch-pack.c:             01
>   * negotiator/default.c:       2--5
>   * walker.c:                 0-2
> @@ -78,13 +78,14 @@ struct object_array {
>   * builtin/show-branch.c:    0-------------------------------------------26
>   * builtin/unpack-objects.c:                                 2021
>   */
> -#define FLAG_BITS  29
> +#define FLAG_BITS  28
>  
>  /*
>   * The object type is stored in 3 bits.
>   */
>  struct object {
>  	unsigned parsed : 1;
> +	unsigned corrupt : 1;
>  	unsigned type : TYPE_BITS;
>  	unsigned flags : FLAG_BITS;
>  	struct object_id oid;
> diff --git a/revision.h b/revision.h
> index 4134dc6029..5c0b831b37 100644
> --- a/revision.h
> +++ b/revision.h
> @@ -33,7 +33,7 @@
>  #define ALL_REV_FLAGS	(((1u<<11)-1) | NOT_USER_GIVEN | TRACK_LINEAR)
>  
>  #define TOPO_WALK_EXPLORED	(1u<<27)
> -#define TOPO_WALK_INDEGREE	(1u<<28)
> +#define TOPO_WALK_INDEGREE	(1u<<24)

As an aside, these flag bit modifications look fine, but would need to
be explained. I'm guessing that since you are adding a bit of data
to struct object you want to avoid increasing the struct size across
a 32-bit boundary. Are we sure that bit 24 is not used anywhere else?
(My search for "1u<<24" found nothing, and "1 << 24" found a bit in
the cache-entry flags, so this seems safe.)

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/1] commit-graph.c: handle corrupt commit trees
  2019-09-04 21:21   ` Taylor Blau
  2019-09-05  6:08     ` Jeff King
@ 2019-09-06 16:48     ` Derrick Stolee
  1 sibling, 0 replies; 20+ messages in thread
From: Derrick Stolee @ 2019-09-06 16:48 UTC (permalink / raw)
  To: Taylor Blau, Garima Singh; +Cc: git, peff

On 9/4/2019 5:21 PM, Taylor Blau wrote:
> Hi Garima,
> 
> On Wed, Sep 04, 2019 at 02:25:55PM -0400, Garima Singh wrote:
>>
>> On 9/3/2019 10:22 PM, Taylor Blau wrote:
>>> Hi,
>>>
>>> I was running some of the new 'git commit-graph' commands, and noticed
>>> that I could consistently get 'git commit-graph write --reachable' to
>>> segfault when a commit's root tree is corrupt.
>>>
>>> I have an extremely-unfinished fix attached as an RFC PATCH below, but I
>>> wanted to get a few thoughts on this before sending it out as a non-RFC.
>>>
>>> In my patch, I simply 'die()' when a commit isn't able to be parsed
>>> (i.e., when 'parse_commit_no_graph' returns a non-zero code), but I
>>> wanted to see if others thought that this was an OK approach. Some
>>> thoughts:
>>
>> I like the idea of completely bailing if the commit can't be parsed too.
>> Only question: Is there a reason you chose to die() instead of BUG() like
>> the other two places in that function? What is the criteria of choosing one
>> over the other?
> 
> I did not call 'BUG' here because 'BUG' is traditionally used to
> indicate an internal bug, e.g., an unexpected state or some such. On the
> other side of that coin, 'BUG' is _not_ used to indicate repository
> corruption, since that is not an issue in the Git codebase, rather in
> the user's repository.
> 
> Though, to be honest, I've never seen that rule written out explicitly
> (maybe if it were to be written somewhere, it could be stored in
> Documentation/CodingGuidelines?). I think that this is some good
> #leftoverbits material.
> 
>>>
>>>    * It seems like we could write a commit-graph by placing a "filler"
>>>      entry where the broken commit would have gone. I don't see any place
>>>      where this is implemented currently, but this seems like a viable
>>>      alternative to not writing _any_ commits into the commit-graph.
>>
>> I would rather we didn't do this cause it will probably kick open the can of
>> always watching for that filler when we are working with the commit-graph.
>> Or do we already do that today? Maybe @stolee can chime in on what we do in
>> cases of shallow clones and other potential gaps in the walk
> 
> Yeah, I think that the consensus is that it makes sense to just die
> here, which is fine by me.

I agree the die() is the best thing to do for now.

If we wanted to salvage as much as possible, then we could use these
corrupt marks and then use the "reverse walk" in compute_generation_numbers()
to mark all commits that can reach the corrupt commit as corrupt.
We would then need to remove all corrupt commits from the list we are
planning to write.

However, that is just hiding a corrupt object in the object database,
which is not a situation we want to leave unnoticed.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 1/1] commit-graph.c: die on un-parseable commits
  2019-09-06  6:35       ` Jeff King
  2019-09-06  6:56         ` Jeff King
@ 2019-09-06 16:59         ` Junio C Hamano
  2019-09-06 17:04           ` Jeff King
  1 sibling, 1 reply; 20+ messages in thread
From: Junio C Hamano @ 2019-09-06 16:59 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, stolee, git

Jeff King <peff@peff.net> writes:

> So far so good. But now imagine we call parse_commit_buffer() again, and
> we re-parse. How does that interact with the half-parsed state? Some of
> it works OK (e.g., lookup_tree() would find the same tree). Some not so
> much (I think we'd keep appending parents at each call).
>
> I guess this might not be too bad to handle. Value fields like
> timestamp_t are OK to overwrite. Pointers to objects likewise, since the
> memory is owned elsewhere. If we see existing parent pointers in an
> object we're parsing, we could probably free them under the assumption
> they're leftover cruft. Likewise for the "tag" field of "struct tag",
> which is owned by the struct and should be freed.

Yeah, or clear them before returning with .corrupt bit set?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 1/1] commit-graph.c: die on un-parseable commits
  2019-09-06 16:48         ` Derrick Stolee
@ 2019-09-06 17:04           ` Jeff King
  2019-09-06 17:19             ` Derrick Stolee
  2019-09-06 17:20             ` Derrick Stolee
  0 siblings, 2 replies; 20+ messages in thread
From: Jeff King @ 2019-09-06 17:04 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Taylor Blau, Junio C Hamano, git

On Fri, Sep 06, 2019 at 12:48:05PM -0400, Derrick Stolee wrote:

> > diff --git a/revision.h b/revision.h
> > index 4134dc6029..5c0b831b37 100644
> > --- a/revision.h
> > +++ b/revision.h
> > @@ -33,7 +33,7 @@
> >  #define ALL_REV_FLAGS	(((1u<<11)-1) | NOT_USER_GIVEN | TRACK_LINEAR)
> >  
> >  #define TOPO_WALK_EXPLORED	(1u<<27)
> > -#define TOPO_WALK_INDEGREE	(1u<<28)
> > +#define TOPO_WALK_INDEGREE	(1u<<24)
> 
> As an aside, these flag bit modifications look fine, but would need to
> be explained. I'm guessing that since you are adding a bit of data
> to struct object you want to avoid increasing the struct size across
> a 32-bit boundary. Are we sure that bit 24 is not used anywhere else?
> (My search for "1u<<24" found nothing, and "1 << 24" found a bit in
> the cache-entry flags, so this seems safe.)

Yeah, I'd definitely break this up into several commits with explanation
(though see an alternate I posted that just uses the parsed flag without
any new bits).

Bit 24 isn't used according to the table in objects.h, which is
_supposed_ to be the source of truth, though of course there's no
compiler-level checking. (One aside: is there a reason TOPO_WALK_* isn't
part of ALL_REV_FLAGS?).

And yes, the goal was to keep things to the 32-bit boundary. But in the
course of this, I discovered something interesting: 64-bit systems are
now padding this up to the 8-byte boundary!

The culprit is the switch of GIT_MAX_RAWSZ for sha256. Before then, our
object_id was 20 bytes for sha1. Adding 4 bytes of flags still left us
at 24 bytes, which is both 4- and 8-byte aligned.

With the switch to sha256, object_id is now 32 bytes. Adding 4 bytes
takes us to 36, and then 8-byte aligning the struct takes us to 40
bytes, with 4 bytes of wasted padding.

I'm sorely tempted to use this as an opportunity to move commit->index
into "struct object". That would actually shrink commit object sizes by
4 bytes, and would let all object types do the commit-slab trick to
store object data with constant-time lookup. This would make it possible
to migrate some uses of flags to per-operation bitfields (so e.g., two
traversals would have their _own_ flag data, and wouldn't risk stomping
on each other's bits).

The one downside would be that the index space would become more sparse.
I.e., right now if you're only storing things for commits in a slab, you
know that every slot you allocate is for a commit. But if we allocate an
index for each object, then the commits are less likely to be together
(so wasted memory and worse cache performance). That might be solvable
by assigning a per-type index (with a few hacks to handle OBJ_NONE).

Anyway, all of that is rather off the topic of this discussion.

-Peff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 1/1] commit-graph.c: die on un-parseable commits
  2019-09-06 16:59         ` Junio C Hamano
@ 2019-09-06 17:04           ` Jeff King
  2019-09-09 16:39             ` Junio C Hamano
  0 siblings, 1 reply; 20+ messages in thread
From: Jeff King @ 2019-09-06 17:04 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, stolee, git

On Fri, Sep 06, 2019 at 09:59:04AM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > So far so good. But now imagine we call parse_commit_buffer() again, and
> > we re-parse. How does that interact with the half-parsed state? Some of
> > it works OK (e.g., lookup_tree() would find the same tree). Some not so
> > much (I think we'd keep appending parents at each call).
> >
> > I guess this might not be too bad to handle. Value fields like
> > timestamp_t are OK to overwrite. Pointers to objects likewise, since the
> > memory is owned elsewhere. If we see existing parent pointers in an
> > object we're parsing, we could probably free them under the assumption
> > they're leftover cruft. Likewise for the "tag" field of "struct tag",
> > which is owned by the struct and should be freed.
> 
> Yeah, or clear them before returning with .corrupt bit set?

This was my attempt to avoid dealing with a .corrupt bit. :)

-Peff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 1/1] commit-graph.c: die on un-parseable commits
  2019-09-06 17:04           ` Jeff King
@ 2019-09-06 17:19             ` Derrick Stolee
  2019-09-06 17:20             ` Derrick Stolee
  1 sibling, 0 replies; 20+ messages in thread
From: Derrick Stolee @ 2019-09-06 17:19 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, Junio C Hamano, git



On 9/6/2019 1:04 PM, Jeff King wrote:
> On Fri, Sep 06, 2019 at 12:48:05PM -0400, Derrick Stolee wrote:
> 
>>> diff --git a/revision.h b/revision.h
>>> index 4134dc6029..5c0b831b37 100644
>>> --- a/revision.h
>>> +++ b/revision.h
>>> @@ -33,7 +33,7 @@
>>>  #define ALL_REV_FLAGS	(((1u<<11)-1) | NOT_USER_GIVEN | TRACK_LINEAR)
>>>  
>>>  #define TOPO_WALK_EXPLORED	(1u<<27)
>>> -#define TOPO_WALK_INDEGREE	(1u<<28)
>>> +#define TOPO_WALK_INDEGREE	(1u<<24)
>>
>> As an aside, these flag bit modifications look fine, but would need to
>> be explained. I'm guessing that since you are adding a bit of data
>> to struct object you want to avoid increasing the struct size across
>> a 32-bit boundary. Are we sure that bit 24 is not used anywhere else?
>> (My search for "1u<<24" found nothing, and "1 << 24" found a bit in
>> the cache-entry flags, so this seems safe.)
> 
> Yeah, I'd definitely break this up into several commits with explanation
> (though see an alternate I posted that just uses the parsed flag without
> any new bits).
> 
> Bit 24 isn't used according to the table in objects.h, which is
> _supposed_ to be the source of truth, though of course there's no
> compiler-level checking. (One aside: is there a reason TOPO_WALK_* isn't
> part of ALL_REV_FLAGS?).

This was an oversight on my part. Sorry.

-Stolee

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 1/1] commit-graph.c: die on un-parseable commits
  2019-09-06 17:04           ` Jeff King
  2019-09-06 17:19             ` Derrick Stolee
@ 2019-09-06 17:20             ` Derrick Stolee
  1 sibling, 0 replies; 20+ messages in thread
From: Derrick Stolee @ 2019-09-06 17:20 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, Junio C Hamano, git



On 9/6/2019 1:04 PM, Jeff King wrote:
> On Fri, Sep 06, 2019 at 12:48:05PM -0400, Derrick Stolee wrote:
> 
>>> diff --git a/revision.h b/revision.h
>>> index 4134dc6029..5c0b831b37 100644
>>> --- a/revision.h
>>> +++ b/revision.h
>>> @@ -33,7 +33,7 @@
>>>  #define ALL_REV_FLAGS	(((1u<<11)-1) | NOT_USER_GIVEN | TRACK_LINEAR)
>>>  
>>>  #define TOPO_WALK_EXPLORED	(1u<<27)
>>> -#define TOPO_WALK_INDEGREE	(1u<<28)
>>> +#define TOPO_WALK_INDEGREE	(1u<<24)
>>
>> As an aside, these flag bit modifications look fine, but would need to
>> be explained. I'm guessing that since you are adding a bit of data
>> to struct object you want to avoid increasing the struct size across
>> a 32-bit boundary. Are we sure that bit 24 is not used anywhere else?
>> (My search for "1u<<24" found nothing, and "1 << 24" found a bit in
>> the cache-entry flags, so this seems safe.)
> 
> Yeah, I'd definitely break this up into several commits with explanation
> (though see an alternate I posted that just uses the parsed flag without
> any new bits).
> 
> Bit 24 isn't used according to the table in objects.h, which is
> _supposed_ to be the source of truth, though of course there's no
> compiler-level checking. (One aside: is there a reason TOPO_WALK_* isn't
> part of ALL_REV_FLAGS?).
> 
> And yes, the goal was to keep things to the 32-bit boundary. But in the
> course of this, I discovered something interesting: 64-bit systems are
> now padding this up to the 8-byte boundary!
> 
> The culprit is the switch of GIT_MAX_RAWSZ for sha256. Before then, our
> object_id was 20 bytes for sha1. Adding 4 bytes of flags still left us
> at 24 bytes, which is both 4- and 8-byte aligned.
> 
> With the switch to sha256, object_id is now 32 bytes. Adding 4 bytes
> takes us to 36, and then 8-byte aligning the struct takes us to 40
> bytes, with 4 bytes of wasted padding.
> 
> I'm sorely tempted to use this as an opportunity to move commit->index
> into "struct object". That would actually shrink commit object sizes by
> 4 bytes, and would let all object types do the commit-slab trick to
> store object data with constant-time lookup. This would make it possible
> to migrate some uses of flags to per-operation bitfields (so e.g., two
> traversals would have their _own_ flag data, and wouldn't risk stomping
> on each other's bits).

This reminds me that I'm hoping to eventually get around to moving
"generation" into a commit slab. That would reduce the space for people
still working without a commit-graph, and would allow updating to
generation number v2 (which needs 64 bits of data).

-Stolee

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 1/1] commit-graph.c: die on un-parseable commits
  2019-09-06 17:04           ` Jeff King
@ 2019-09-09 16:39             ` Junio C Hamano
  2019-09-09 16:54               ` Jeff King
  0 siblings, 1 reply; 20+ messages in thread
From: Junio C Hamano @ 2019-09-09 16:39 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, stolee, git

Jeff King <peff@peff.net> writes:

> On Fri, Sep 06, 2019 at 09:59:04AM -0700, Junio C Hamano wrote:
>
>> Jeff King <peff@peff.net> writes:
>> 
>> > So far so good. But now imagine we call parse_commit_buffer() again, and
>> > we re-parse. How does that interact with the half-parsed state? Some of
>> > it works OK (e.g., lookup_tree() would find the same tree). Some not so
>> > much (I think we'd keep appending parents at each call).
>> >
>> > I guess this might not be too bad to handle. Value fields like
>> > timestamp_t are OK to overwrite. Pointers to objects likewise, since the
>> > memory is owned elsewhere. If we see existing parent pointers in an
>> > object we're parsing, we could probably free them under the assumption
>> > they're leftover cruft. Likewise for the "tag" field of "struct tag",
>> > which is owned by the struct and should be freed.
>> 
>> Yeah, or clear them before returning with .corrupt bit set?
>
> This was my attempt to avoid dealing with a .corrupt bit. :)

Then, clear them before returning with .parsed bit clear?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 1/1] commit-graph.c: die on un-parseable commits
  2019-09-09 16:39             ` Junio C Hamano
@ 2019-09-09 16:54               ` Jeff King
  0 siblings, 0 replies; 20+ messages in thread
From: Jeff King @ 2019-09-09 16:54 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, stolee, git

On Mon, Sep 09, 2019 at 09:39:41AM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > On Fri, Sep 06, 2019 at 09:59:04AM -0700, Junio C Hamano wrote:
> >
> >> Jeff King <peff@peff.net> writes:
> >> 
> >> > So far so good. But now imagine we call parse_commit_buffer() again, and
> >> > we re-parse. How does that interact with the half-parsed state? Some of
> >> > it works OK (e.g., lookup_tree() would find the same tree). Some not so
> >> > much (I think we'd keep appending parents at each call).
> >> >
> >> > I guess this might not be too bad to handle. Value fields like
> >> > timestamp_t are OK to overwrite. Pointers to objects likewise, since the
> >> > memory is owned elsewhere. If we see existing parent pointers in an
> >> > object we're parsing, we could probably free them under the assumption
> >> > they're leftover cruft. Likewise for the "tag" field of "struct tag",
> >> > which is owned by the struct and should be freed.
> >> 
> >> Yeah, or clear them before returning with .corrupt bit set?
> >
> > This was my attempt to avoid dealing with a .corrupt bit. :)
> 
> Then, clear them before returning with .parsed bit clear?

But then somebody calling parse_commit() and getting an error return
doesn't get to see the full extent of what we managed to parse. Which I
thought was one of your original points: people should be able to get
whatever information we can get out of even a corrupted or malformed
object.

If we keep that requirement, then we have to either clear them at the
start of the re-parsing step, or we have to avoid re-parsing entirely by
using an extra bit to say "parsed but had an error".  I've sent messy
"something like this" patches for both strategies.

I think the latter is cleaner conceptually, but the former doesn't
require giving up a flag bit.

-Peff

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, back to index

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-04  2:22 [RFC PATCH 0/1] commit-graph.c: handle corrupt commit trees Taylor Blau
2019-09-04  2:22 ` [RFC PATCH 1/1] commit-graph.c: die on un-parseable commits Taylor Blau
2019-09-04  3:04   ` Jeff King
2019-09-04 21:18     ` Taylor Blau
2019-09-05  6:47       ` Jeff King
2019-09-06 16:48         ` Derrick Stolee
2019-09-06 17:04           ` Jeff King
2019-09-06 17:19             ` Derrick Stolee
2019-09-06 17:20             ` Derrick Stolee
2019-09-05 22:19     ` Junio C Hamano
2019-09-06  6:35       ` Jeff King
2019-09-06  6:56         ` Jeff King
2019-09-06 16:59         ` Junio C Hamano
2019-09-06 17:04           ` Jeff King
2019-09-09 16:39             ` Junio C Hamano
2019-09-09 16:54               ` Jeff King
2019-09-04 18:25 ` [RFC PATCH 0/1] commit-graph.c: handle corrupt commit trees Garima Singh
2019-09-04 21:21   ` Taylor Blau
2019-09-05  6:08     ` Jeff King
2019-09-06 16:48     ` Derrick Stolee

git@vger.kernel.org list mirror (unofficial, one of many)

Archives are clonable:
	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git
	nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git
	nntp://news.gmane.org/gmane.comp.version-control.git

 note: .onion URLs require Tor: https://www.torproject.org/

AGPL code for this site: git clone https://public-inbox.org/ public-inbox