git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH on sb/more-repo-in-api] revision: use commit graph in get_reference()
@ 2018-12-04 22:42 Jonathan Tan
  2018-12-04 23:12 ` Stefan Beller
                   ` (5 more replies)
  0 siblings, 6 replies; 28+ messages in thread
From: Jonathan Tan @ 2018-12-04 22:42 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan

When fetching into a repository, a connectivity check is first made by
check_exist_and_connected() in builtin/fetch.c that runs:

  git rev-list --objects --stdin --not --all --quiet <(list of objects)

If the client repository has many refs, this command can be slow,
regardless of the nature of the server repository or what is being
fetched. A profiler reveals that most of the time is spent in
setup_revisions() (approx. 60/63), and of the time spent in
setup_revisions(), most of it is spent in parse_object() (approx.
49/60). This is because setup_revisions() parses the target of every ref
(from "--all"), and parse_object() reads the buffer of the object.

Reading the buffer is unnecessary if the repository has a commit graph
and if the ref points to a commit (which is typically the case). This
patch uses the commit graph wherever possible; on my computer, when I
run the above command with a list of 1 object on a many-ref repository,
I get a speedup from 1.8s to 1.0s.

Another way to accomplish this effect would be to modify parse_object()
to use the commit graph if possible; however, I did not want to change
parse_object()'s current behavior of always checking the object
signature of the returned object.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
This is on sb/more-repo-in-api because I'm using the repo_parse_commit()
function.

A colleague noticed this issue when handling a mirror clone.

Looking at the bigger picture, the speed of the connectivity check
during a fetch might be further improved by passing only the negotiation
tips (obtained through --negotiation-tip) instead of "--all". This patch
just handles the low-hanging fruit first.
---
 revision.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/revision.c b/revision.c
index b5108b75ab..e7da2c57ab 100644
--- a/revision.c
+++ b/revision.c
@@ -212,7 +212,20 @@ static struct object *get_reference(struct rev_info *revs, const char *name,
 {
 	struct object *object;
 
-	object = parse_object(revs->repo, oid);
+	/*
+	 * If the repository has commit graphs, repo_parse_commit() avoids
+	 * reading the object buffer, so use it whenever possible.
+	 */
+	if (oid_object_info(revs->repo, oid, NULL) == OBJ_COMMIT) {
+		struct commit *c = lookup_commit(revs->repo, oid);
+		if (!repo_parse_commit(revs->repo, c))
+			object = (struct object *) c;
+		else
+			object = NULL;
+	} else {
+		object = parse_object(revs->repo, oid);
+	}
+
 	if (!object) {
 		if (revs->ignore_missing)
 			return object;
-- 
2.19.0.271.gfe8321ec05.dirty


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH on sb/more-repo-in-api] revision: use commit graph in get_reference()
  2018-12-04 22:42 [PATCH on sb/more-repo-in-api] revision: use commit graph in get_reference() Jonathan Tan
@ 2018-12-04 23:12 ` Stefan Beller
  2018-12-06 23:36   ` Jonathan Tan
  2018-12-05  4:54 ` Jeff King
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 28+ messages in thread
From: Stefan Beller @ 2018-12-04 23:12 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git

On Tue, Dec 4, 2018 at 2:42 PM Jonathan Tan <jonathantanmy@google.com> wrote:
>
> When fetching into a repository, a connectivity check is first made by
> check_exist_and_connected() in builtin/fetch.c that runs:
>
>   git rev-list --objects --stdin --not --all --quiet <(list of objects)
>
> If the client repository has many refs, this command can be slow,
> regardless of the nature of the server repository or what is being
> fetched. A profiler reveals that most of the time is spent in
> setup_revisions() (approx. 60/63), and of the time spent in
> setup_revisions(), most of it is spent in parse_object() (approx.
> 49/60). This is because setup_revisions() parses the target of every ref
> (from "--all"), and parse_object() reads the buffer of the object.
>
> Reading the buffer is unnecessary if the repository has a commit graph
> and if the ref points to a commit (which is typically the case). This
> patch uses the commit graph wherever possible; on my computer, when I
> run the above command with a list of 1 object on a many-ref repository,
> I get a speedup from 1.8s to 1.0s.
>
> Another way to accomplish this effect would be to modify parse_object()
> to use the commit graph if possible; however, I did not want to change
> parse_object()'s current behavior of always checking the object
> signature of the returned object.
>
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
> This is on sb/more-repo-in-api because I'm using the repo_parse_commit()
> function.

This is a mere nicety, not strictly required.
Before we had parse_commit(struct commit *) which would accomplish the
same, (and we'd still have that afterwards as a #define falling back onto
the_repository). As the function get_reference() is not the_repository safe
as it contains a call to is_promisor_object() that is repository
agnostic, I think
it would be fair game to not depend on that series. I am not
complaining, though.

> A colleague noticed this issue when handling a mirror clone.
>
> Looking at the bigger picture, the speed of the connectivity check
> during a fetch might be further improved by passing only the negotiation
> tips (obtained through --negotiation-tip) instead of "--all". This patch
> just handles the low-hanging fruit first.
> ---
>  revision.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/revision.c b/revision.c
> index b5108b75ab..e7da2c57ab 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -212,7 +212,20 @@ static struct object *get_reference(struct rev_info *revs, const char *name,
>  {
>         struct object *object;
>
> -       object = parse_object(revs->repo, oid);
> +       /*
> +        * If the repository has commit graphs, repo_parse_commit() avoids
> +        * reading the object buffer, so use it whenever possible.
> +        */
> +       if (oid_object_info(revs->repo, oid, NULL) == OBJ_COMMIT) {
> +               struct commit *c = lookup_commit(revs->repo, oid);
> +               if (!repo_parse_commit(revs->repo, c))
> +                       object = (struct object *) c;
> +               else
> +                       object = NULL;

Would it make sense in this case to rely on parse_object below
instead of assigning NULL? The reason for that would be that
when lookup_commit returns NULL, we would try more broadly.

AFAICT oid_object_info doesn't take advantage of the commit graph,
but just looks up the object header, which is still less than completely
parsing it. Then lookup_commit is overly strict, as it may return
NULL as when there still is a type mismatch (I don't think a mismatch
could happen here, as both rely on just the object store, and not the
commit graph.), so this would be just defensive programming for
the sake of it. I dunno.

    struct commit *c;

    if (oid_object_info(revs->repo, oid, NULL) == OBJ_COMMIT &&
        (c = lookup_commit(revs->repo, oid)) &&
        !repo_parse_commit(revs->repo, c))
            object = (struct object *) c;
    else
        object = parse_object(revs->repo, oid);


So with all that said, I still think this is a good patch.

Thanks,
Stefan

> +       } else {
> +               object = parse_object(revs->repo, oid);
> +       }
> +
>         if (!object) {
>                 if (revs->ignore_missing)
>                         return object;
> --
> 2.19.0.271.gfe8321ec05.dirty
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH on sb/more-repo-in-api] revision: use commit graph in get_reference()
  2018-12-04 22:42 [PATCH on sb/more-repo-in-api] revision: use commit graph in get_reference() Jonathan Tan
  2018-12-04 23:12 ` Stefan Beller
@ 2018-12-05  4:54 ` Jeff King
  2018-12-06 23:54   ` Jonathan Tan
  2018-12-05 23:15 ` Junio C Hamano
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 28+ messages in thread
From: Jeff King @ 2018-12-05  4:54 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git

On Tue, Dec 04, 2018 at 02:42:38PM -0800, Jonathan Tan wrote:

> diff --git a/revision.c b/revision.c
> index b5108b75ab..e7da2c57ab 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -212,7 +212,20 @@ static struct object *get_reference(struct rev_info *revs, const char *name,
>  {
>  	struct object *object;
>  
> -	object = parse_object(revs->repo, oid);
> +	/*
> +	 * If the repository has commit graphs, repo_parse_commit() avoids
> +	 * reading the object buffer, so use it whenever possible.
> +	 */
> +	if (oid_object_info(revs->repo, oid, NULL) == OBJ_COMMIT) {
> +		struct commit *c = lookup_commit(revs->repo, oid);
> +		if (!repo_parse_commit(revs->repo, c))
> +			object = (struct object *) c;
> +		else
> +			object = NULL;
> +	} else {
> +		object = parse_object(revs->repo, oid);
> +	}

This seems like a reasonable thing to do, but I have sort of a
meta-comment. In several places we've started doing this kind of "if
it's this type of object, do X, otherwise do Y" optimization (e.g.,
handling large blobs for streaming).

And in the many cases we end up doubling the effort to do object
lookups: here we do one lookup to get the type, and then if it's not a
commit (or if we don't have a commit graph) we end up parsing it anyway.

I wonder if we could do better. In this instance, it might make sense
to first see if we actually have a commit graph available (it might not
have this object, of course, but at least we'd expect it to have most
commits). In general, it would be nice if we had a more incremental API
for accessing objects: open, get metadata, then read the data. That
would make these kinds of optimizations "free".

I don't have numbers for how much the extra lookups cost. The lookups
are probably dwarfed by parse_object() in general, so even if we save
only a few full object loads, it may be a win. It just seems a shame
that we may be making the "slow" paths (when our type-specific check
doesn't match) even slower.

-Peff

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH on sb/more-repo-in-api] revision: use commit graph in get_reference()
  2018-12-04 22:42 [PATCH on sb/more-repo-in-api] revision: use commit graph in get_reference() Jonathan Tan
  2018-12-04 23:12 ` Stefan Beller
  2018-12-05  4:54 ` Jeff King
@ 2018-12-05 23:15 ` Junio C Hamano
  2018-12-07 21:50 ` [PATCH on master v2] " Jonathan Tan
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 28+ messages in thread
From: Junio C Hamano @ 2018-12-05 23:15 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git

Jonathan Tan <jonathantanmy@google.com> writes:

> Looking at the bigger picture, the speed of the connectivity check
> during a fetch might be further improved by passing only the negotiation
> tips (obtained through --negotiation-tip) instead of "--all". This patch
> just handles the low-hanging fruit first.

That sounds like a good direction, when having to list excessive
number of refs is the primary problem.  When fetching their 'master'
into our 'remotes/origin/master' and doing nothing else, we may end
up showing only the latter, which will miss optimization opportunity
a lot if the latest change made over there is to merge in the change
we asked them to pull earlier (which would be greatly helped if we
let them know about the tip of the topic they earlier pulled from
us), but also avoids having to send irrelevant refs that point at
tags addded to months old states.  So there is a subtle trade-off
between sending more refs to reduce the resulting packfile, and
sending fewer refs to reduce the cost of the "have" exchange.

Changing the way to throw each object pointed at by a ref into the
queue to be emitted in the "have" exchange from regular object
parsing to peeking of precomputed data would reduce the local cost
of "have" exchange, but it does not reduce the network cost at all,
though.

As to the change being specific to get_reference() and not to
parse_object(), I think what we see here is probably better, simply
because parse_object() is not in the position to asssume that it is
likely to be asked to parse commits, but I think get_reference() is,
after looking at its callsites in revision.c.

I do share the meta-comment concern with Peff, though.

> ---
>  revision.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/revision.c b/revision.c
> index b5108b75ab..e7da2c57ab 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -212,7 +212,20 @@ static struct object *get_reference(struct rev_info *revs, const char *name,
>  {
>  	struct object *object;
>  
> -	object = parse_object(revs->repo, oid);
> +	/*
> +	 * If the repository has commit graphs, repo_parse_commit() avoids
> +	 * reading the object buffer, so use it whenever possible.
> +	 */
> +	if (oid_object_info(revs->repo, oid, NULL) == OBJ_COMMIT) {
> +		struct commit *c = lookup_commit(revs->repo, oid);
> +		if (!repo_parse_commit(revs->repo, c))
> +			object = (struct object *) c;
> +		else
> +			object = NULL;
> +	} else {
> +		object = parse_object(revs->repo, oid);
> +	}
> +
>  	if (!object) {
>  		if (revs->ignore_missing)
>  			return object;

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH on sb/more-repo-in-api] revision: use commit graph in get_reference()
  2018-12-04 23:12 ` Stefan Beller
@ 2018-12-06 23:36   ` Jonathan Tan
  2018-12-07 13:49     ` Derrick Stolee
  0 siblings, 1 reply; 28+ messages in thread
From: Jonathan Tan @ 2018-12-06 23:36 UTC (permalink / raw)
  To: sbeller; +Cc: jonathantanmy, git

> > This is on sb/more-repo-in-api because I'm using the repo_parse_commit()
> > function.
> 
> This is a mere nicety, not strictly required.
> Before we had parse_commit(struct commit *) which would accomplish the
> same, (and we'd still have that afterwards as a #define falling back onto
> the_repository). As the function get_reference() is not the_repository safe
> as it contains a call to is_promisor_object() that is repository
> agnostic, I think
> it would be fair game to not depend on that series. I am not
> complaining, though.

Good point - I'll base the next version on master (and add a TODO
explaining which functions are not yet converted).

> AFAICT oid_object_info doesn't take advantage of the commit graph,
> but just looks up the object header, which is still less than completely
> parsing it. Then lookup_commit is overly strict, as it may return
> NULL as when there still is a type mismatch (I don't think a mismatch
> could happen here, as both rely on just the object store, and not the
> commit graph.), so this would be just defensive programming for
> the sake of it. I dunno.
> 
>     struct commit *c;
> 
>     if (oid_object_info(revs->repo, oid, NULL) == OBJ_COMMIT &&
>         (c = lookup_commit(revs->repo, oid)) &&
>         !repo_parse_commit(revs->repo, c))
>             object = (struct object *) c;
>     else
>         object = parse_object(revs->repo, oid);

I like this way better - I'll do it in the next version.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH on sb/more-repo-in-api] revision: use commit graph in get_reference()
  2018-12-05  4:54 ` Jeff King
@ 2018-12-06 23:54   ` Jonathan Tan
  2018-12-07  8:53     ` Jeff King
  0 siblings, 1 reply; 28+ messages in thread
From: Jonathan Tan @ 2018-12-06 23:54 UTC (permalink / raw)
  To: peff; +Cc: jonathantanmy, git, stolee

Also CC-ing Stolee since I mention multi-pack indices at the end.

> This seems like a reasonable thing to do, but I have sort of a
> meta-comment. In several places we've started doing this kind of "if
> it's this type of object, do X, otherwise do Y" optimization (e.g.,
> handling large blobs for streaming).
> 
> And in the many cases we end up doubling the effort to do object
> lookups: here we do one lookup to get the type, and then if it's not a
> commit (or if we don't have a commit graph) we end up parsing it anyway.
> 
> I wonder if we could do better. In this instance, it might make sense
> to first see if we actually have a commit graph available (it might not
> have this object, of course, but at least we'd expect it to have most
> commits).

This makes sense - I thought I shouldn't mention the commit graph in the
code since it seems like a layering violation, but I felt the need to
mention commit graph in a comment, so maybe the need to mention commit
graph in the code is there too. Subsequently, maybe the lookup-for-type
could be replaced by a lookup-in-commit-graph (maybe by using
parse_commit_in_graph() directly), which should be at least slightly
faster.

> In general, it would be nice if we had a more incremental API
> for accessing objects: open, get metadata, then read the data. That
> would make these kinds of optimizations "free".

Would this be assuming that to read the data, you would (1) first need to
read the metadata, and (2) there would be no redundancy in reading the
two? It seems to me that for loose objects, you would want to perform
all your reads at once, since any read requires opening the file, and
for commit graphs, you just want to read what you want, since the
metadata and the data are in separate places.

> I don't have numbers for how much the extra lookups cost. The lookups
> are probably dwarfed by parse_object() in general, so even if we save
> only a few full object loads, it may be a win. It just seems a shame
> that we may be making the "slow" paths (when our type-specific check
> doesn't match) even slower.

I agree. I think it will always remain a tradeoff when we have multiple
data sources of objects (loose, packed, commit graph - and we can't
unify them all, since they each have their uses). Unless the multi-pack
index can reference commit graphs as well...then it could be our first
point of reference without introducing any inefficiencies...

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH on sb/more-repo-in-api] revision: use commit graph in get_reference()
  2018-12-06 23:54   ` Jonathan Tan
@ 2018-12-07  8:53     ` Jeff King
  0 siblings, 0 replies; 28+ messages in thread
From: Jeff King @ 2018-12-07  8:53 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, stolee

On Thu, Dec 06, 2018 at 03:54:46PM -0800, Jonathan Tan wrote:

> This makes sense - I thought I shouldn't mention the commit graph in the
> code since it seems like a layering violation, but I felt the need to
> mention commit graph in a comment, so maybe the need to mention commit
> graph in the code is there too. Subsequently, maybe the lookup-for-type
> could be replaced by a lookup-in-commit-graph (maybe by using
> parse_commit_in_graph() directly), which should be at least slightly
> faster.

That makes more sense to me. If we don't have a commit graph at all,
it's a quick noop. If we do, we might binary search in the list of
commits for a non-commit. But that's strictly faster than finding the
object's type (which involves a binary search of a larger list, followed
by actually accessing the type info).

> > In general, it would be nice if we had a more incremental API
> > for accessing objects: open, get metadata, then read the data. That
> > would make these kinds of optimizations "free".
> 
> Would this be assuming that to read the data, you would (1) first need to
> read the metadata, and (2) there would be no redundancy in reading the
> two? It seems to me that for loose objects, you would want to perform
> all your reads at once, since any read requires opening the file, and
> for commit graphs, you just want to read what you want, since the
> metadata and the data are in separate places.

By metadata here, I don't mean the commit-graph data, but just the
object type and size. So I'm imagining an interface more like:

  - object_open() locates the object, and stores either the pack
    file/offset or a descriptor to a loose path in an opaque handle
    struct

  - object_size() and object_type() on that handle would do what you
    expect. For loose objects, these would parse the header (the
    equivalent of unpack_sha1_header()). For packed ones, they'd use the
    object header in the pack (and chase down the delta bits as needed).

  - object_contents() would return the full content

  - object_read() could sequentially read a subset of the file (this
    could replace the streaming interface we currently have)

We have most of the low-level bits for this already, if you poke into
what object_info_extended() is doing. We just don't have them packaged
in an interface which can persist across multiple calls.

With an interface like that, parse_object()'s large-blob check could be
something like the patch below.

But your case here is a bit more interesting. If we have a commit graph,
then we can avoid opening (or even finding!) the on-disk object at all.
So I actually think it makes sense to just check the commit-graph first,
as discussed above.

---
diff --git a/object.c b/object.c
index e54160550c..afce58c0bc 100644
--- a/object.c
+++ b/object.c
@@ -254,23 +254,31 @@ struct object *parse_object(struct repository *r, const struct object_id *oid)
 	const struct object_id *repl = lookup_replace_object(r, oid);
 	void *buffer;
 	struct object *obj;
+	struct object_handle oh;
 
 	obj = lookup_object(r, oid->hash);
 	if (obj && obj->parsed)
 		return obj;
 
-	if ((obj && obj->type == OBJ_BLOB && has_object_file(oid)) ||
-	    (!obj && has_object_file(oid) &&
-	     oid_object_info(r, oid, NULL) == OBJ_BLOB)) {
-		if (check_object_signature(repl, NULL, 0, NULL) < 0) {
+	if (object_open(&oh, oid) < 0)
+		return NULL; /* missing object */
+
+	if (object_type(&oh) == OBJ_BLOB) {
+		/* this will call object_read() on 4k chunks */
+		if (check_object_signature_stream(&oh, oid)) {
 			error(_("sha1 mismatch %s"), oid_to_hex(oid));
 			return NULL;
 		}
+		object_close(&oh); /* we don't care about contents */
 		parse_blob_buffer(lookup_blob(r, oid), NULL, 0);
 		return lookup_object(r, oid->hash);
 	}
 
-	buffer = read_object_file(oid, &type, &size);
+	type = object_type(&oh);
+	size = object_size(&oh);
+	buffer = object_contents(&oh);
+	object_close(&oh);
+
 	if (buffer) {
 		if (check_object_signature(repl, buffer, size, type_name(type)) < 0) {
 			free(buffer);

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH on sb/more-repo-in-api] revision: use commit graph in get_reference()
  2018-12-06 23:36   ` Jonathan Tan
@ 2018-12-07 13:49     ` Derrick Stolee
  0 siblings, 0 replies; 28+ messages in thread
From: Derrick Stolee @ 2018-12-07 13:49 UTC (permalink / raw)
  To: Jonathan Tan, sbeller; +Cc: git

On 12/6/2018 6:36 PM, Jonathan Tan wrote:
>> AFAICT oid_object_info doesn't take advantage of the commit graph,
>> but just looks up the object header, which is still less than completely
>> parsing it. Then lookup_commit is overly strict, as it may return
>> NULL as when there still is a type mismatch (I don't think a mismatch
>> could happen here, as both rely on just the object store, and not the
>> commit graph.), so this would be just defensive programming for
>> the sake of it. I dunno.
>>
>>      struct commit *c;
>>
>>      if (oid_object_info(revs->repo, oid, NULL) == OBJ_COMMIT &&
>>          (c = lookup_commit(revs->repo, oid)) &&
>>          !repo_parse_commit(revs->repo, c))
>>              object = (struct object *) c;
>>      else
>>          object = parse_object(revs->repo, oid);
> I like this way better - I'll do it in the next version.

If we do _not_ have a commit-graph or if the commit-graph does not have
that commit, this will have the same performance problem, right?

Should we instead create a direct dependence on the commit-graph, and try
to parse the oid from the graph directly? If it succeeds, then we learn
that the object is a commit, in addition to all of the parsing work. This
means we could avoid oid_object_info() loading data if we succeed. We
would fall back to parse_object() if it fails.

I was thinking this should be a simple API call to parse_commit_in_graph(),
but that requires a struct commit filled with an oid, which is not the
best idea if we don't actually know it is a commit yet.

The approach I recommend would then be more detailed:

1. Modify find_commit_in_graph() to take a struct object_id instead of a
    struct commit. This helps find the integer position in the graph. That
    position can be used in fill_commit_in_graph() to load the commit
    contents. Keep find_commit_in_graph() static as it should not be a
    public function.

2. Create a public function with prototype

struct commit *try_parse_commit_from_graph(struct repository *r, struct 
object_id *oid)

    that returns a commit struct fully parsed if and only if the repository
    has that oid. It can call find_commit_in_graph(), then 
lookup_commit() and
    fill_commit_in_graph() to create the commit and parse the data.

3. In replace of the snippet above, do:

     struct commit *c;

     if ((c = try_parse_commit_from_graph(revs->repo, oid))
         object = (struct object *)c;
     else
         object = parse_object(revs->repo, oid);

A similar pattern _could_ be used in parse_object(), but I don't recommend
doing this pattern unless we have a reasonable suspicion that we are going
to parse commits more often than other objects. (It adds an O(log(# 
commits))
binary search to each object.)

A final thought: consider making this "try the commit graph first, but fall
back to parse_object()" a library function with a name like

     struct object *parse_probably_commit(struct repository *r, struct 
object_id *oid)

so other paths that are parsing a lot of commits (but also maybe tags) could
use the logic.

Thanks!
-Stolee

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH on master v2] revision: use commit graph in get_reference()
  2018-12-04 22:42 [PATCH on sb/more-repo-in-api] revision: use commit graph in get_reference() Jonathan Tan
                   ` (2 preceding siblings ...)
  2018-12-05 23:15 ` Junio C Hamano
@ 2018-12-07 21:50 ` Jonathan Tan
  2018-12-09  0:51   ` Junio C Hamano
  2018-12-13 18:54 ` [PATCH v3] " Jonathan Tan
  2019-01-25 15:33 ` Regression in: [PATCH on sb/more-repo-in-api] " SZEDER Gábor
  5 siblings, 1 reply; 28+ messages in thread
From: Jonathan Tan @ 2018-12-07 21:50 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, stolee, peff, gitster

When fetching into a repository, a connectivity check is first made by
check_exist_and_connected() in builtin/fetch.c that runs:

  git rev-list --objects --stdin --not --all --quiet <(list of objects)

If the client repository has many refs, this command can be slow,
regardless of the nature of the server repository or what is being
fetched. A profiler reveals that most of the time is spent in
setup_revisions() (approx. 60/63), and of the time spent in
setup_revisions(), most of it is spent in parse_object() (approx.
49/60). This is because setup_revisions() parses the target of every ref
(from "--all"), and parse_object() reads the buffer of the object.

Reading the buffer is unnecessary if the repository has a commit graph
and if the ref points to a commit (which is typically the case). This
patch uses the commit graph wherever possible; on my computer, when I
run the above command with a list of 1 object on a many-ref repository,
I get a speedup from 1.8s to 1.0s.

Another way to accomplish this effect would be to modify parse_object()
to use the commit graph if possible; however, I did not want to change
parse_object()'s current behavior of always checking the object
signature of the returned object.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
This patch is now on master.

v2 makes use of the optimization Stolee describes in [1], except that I
have arranged the functions slightly differently. In particular, I
didn't want to add even more ways to obtain objects, so I let
parse_commit_in_graph() be able to take in either a commit shell or an
OID, and did not create the parse_probably_commit() function he
suggested. But I'm not really attached to this design choice, and can
change it if requested.

[1] https://public-inbox.org/git/aa0cd481-c135-47aa-2a69-e3dc71661caa@gmail.com/
---
 commit-graph.c             | 38 ++++++++++++++++++++++++++++----------
 commit-graph.h             | 12 ++++++++----
 commit.c                   |  2 +-
 revision.c                 |  5 ++++-
 t/helper/test-repository.c |  4 ++--
 5 files changed, 43 insertions(+), 18 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 40c855f185..a571b523b7 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -286,7 +286,8 @@ void close_commit_graph(struct repository *r)
 	r->objects->commit_graph = NULL;
 }
 
-static int bsearch_graph(struct commit_graph *g, struct object_id *oid, uint32_t *pos)
+static int bsearch_graph(struct commit_graph *g, const struct object_id *oid,
+			 uint32_t *pos)
 {
 	return bsearch_hash(oid->hash, g->chunk_oid_fanout,
 			    g->chunk_oid_lookup, g->hash_len, pos);
@@ -374,24 +375,41 @@ static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uin
 	}
 }
 
-static int parse_commit_in_graph_one(struct commit_graph *g, struct commit *item)
+static struct commit *parse_commit_in_graph_one(struct repository *r,
+						struct commit_graph *g,
+						struct commit *shell,
+						const struct object_id *oid)
 {
 	uint32_t pos;
 
-	if (item->object.parsed)
-		return 1;
+	if (shell && shell->object.parsed)
+		return shell;
 
-	if (find_commit_in_graph(item, g, &pos))
-		return fill_commit_in_graph(item, g, pos);
+	if (shell && shell->graph_pos != COMMIT_NOT_FROM_GRAPH) {
+		pos = shell->graph_pos;
+	} else if (bsearch_graph(g, shell ? &shell->object.oid : oid, &pos)) {
+		/* bsearch_graph sets pos */
+	} else {
+		return NULL;
+	}
 
-	return 0;
+	if (!shell) {
+		shell = lookup_commit(r, oid);
+		if (!shell)
+			return NULL;
+	}
+
+	fill_commit_in_graph(shell, g, pos);
+	return shell;
 }
 
-int parse_commit_in_graph(struct repository *r, struct commit *item)
+struct commit *parse_commit_in_graph(struct repository *r, struct commit *shell,
+				     const struct object_id *oid)
 {
 	if (!prepare_commit_graph(r))
 		return 0;
-	return parse_commit_in_graph_one(r->objects->commit_graph, item);
+	return parse_commit_in_graph_one(r, r->objects->commit_graph, shell,
+					 oid);
 }
 
 void load_commit_graph_info(struct repository *r, struct commit *item)
@@ -1025,7 +1043,7 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g)
 		}
 
 		graph_commit = lookup_commit(r, &cur_oid);
-		if (!parse_commit_in_graph_one(g, graph_commit))
+		if (!parse_commit_in_graph_one(r, g, graph_commit, NULL))
 			graph_report("failed to parse %s from commit-graph",
 				     oid_to_hex(&cur_oid));
 	}
diff --git a/commit-graph.h b/commit-graph.h
index 9db40b4d3a..8b7b5985dc 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -13,16 +13,20 @@ struct commit;
 char *get_commit_graph_filename(const char *obj_dir);
 
 /*
- * Given a commit struct, try to fill the commit struct info, including:
+ * If the given commit (identified by shell->object.oid or oid) is in the
+ * commit graph, returns a commit struct (reusing shell if it is not NULL)
+ * including the following info:
  *  1. tree object
  *  2. date
  *  3. parents.
  *
- * Returns 1 if and only if the commit was found in the packed graph.
+ * If not, returns NULL. See parse_commit_buffer() for the fallback after this
+ * call.
  *
- * See parse_commit_buffer() for the fallback after this call.
+ * Either shell or oid must be non-NULL. If both are non-NULL, oid is ignored.
  */
-int parse_commit_in_graph(struct repository *r, struct commit *item);
+struct commit *parse_commit_in_graph(struct repository *r, struct commit *shell,
+				     const struct object_id *oid);
 
 /*
  * It is possible that we loaded commit contents from the commit buffer,
diff --git a/commit.c b/commit.c
index d13a7bc374..88eb580c5a 100644
--- a/commit.c
+++ b/commit.c
@@ -456,7 +456,7 @@ int parse_commit_internal(struct commit *item, int quiet_on_missing, int use_com
 		return -1;
 	if (item->object.parsed)
 		return 0;
-	if (use_commit_graph && parse_commit_in_graph(the_repository, item))
+	if (use_commit_graph && parse_commit_in_graph(the_repository, item, NULL))
 		return 0;
 	buffer = read_object_file(&item->object.oid, &type, &size);
 	if (!buffer)
diff --git a/revision.c b/revision.c
index 13e0519c02..05fddb5880 100644
--- a/revision.c
+++ b/revision.c
@@ -213,7 +213,10 @@ static struct object *get_reference(struct rev_info *revs, const char *name,
 {
 	struct object *object;
 
-	object = parse_object(revs->repo, oid);
+	object = (struct object *) parse_commit_in_graph(revs->repo, NULL, oid);
+	if (!object)
+		object = parse_object(revs->repo, oid);
+
 	if (!object) {
 		if (revs->ignore_missing)
 			return object;
diff --git a/t/helper/test-repository.c b/t/helper/test-repository.c
index 6a84a53efb..63b928a883 100644
--- a/t/helper/test-repository.c
+++ b/t/helper/test-repository.c
@@ -22,7 +22,7 @@ static void test_parse_commit_in_graph(const char *gitdir, const char *worktree,
 
 	c = lookup_commit(&r, commit_oid);
 
-	if (!parse_commit_in_graph(&r, c))
+	if (!parse_commit_in_graph(&r, c, NULL))
 		die("Couldn't parse commit");
 
 	printf("%"PRItime, c->date);
@@ -52,7 +52,7 @@ static void test_get_commit_tree_in_graph(const char *gitdir,
 	 * get_commit_tree_in_graph does not automatically parse the commit, so
 	 * parse it first.
 	 */
-	if (!parse_commit_in_graph(&r, c))
+	if (!parse_commit_in_graph(&r, c, NULL))
 		die("Couldn't parse commit");
 	tree = get_commit_tree_in_graph(&r, c);
 	if (!tree)
-- 
2.19.0.271.gfe8321ec05.dirty


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH on master v2] revision: use commit graph in get_reference()
  2018-12-07 21:50 ` [PATCH on master v2] " Jonathan Tan
@ 2018-12-09  0:51   ` Junio C Hamano
  2018-12-09  1:49     ` Junio C Hamano
  2018-12-11 10:54     ` Jeff King
  0 siblings, 2 replies; 28+ messages in thread
From: Junio C Hamano @ 2018-12-09  0:51 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, stolee, peff

Jonathan Tan <jonathantanmy@google.com> writes:

> When fetching into a repository, a connectivity check is first made by
> check_exist_and_connected() in builtin/fetch.c that runs:
> ...
> Another way to accomplish this effect would be to modify parse_object()
> to use the commit graph if possible; however, I did not want to change
> parse_object()'s current behavior of always checking the object
> signature of the returned object.

Sounds good.

> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
> This patch is now on master.

OK.  

Obviously that won't apply to the base for v1 without conflicts, and
it of course applies cleanly on 'master', but the result of doing so
will cause the same conflicts when sb/more-repo-in-api is merged on
top, which means that the same conflicts need to be resolved if this
wants to be merged to 'next' (or 'pu', FWIW).

> diff --git a/commit-graph.c b/commit-graph.c
> index 40c855f185..a571b523b7 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -374,24 +375,41 @@ static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uin
>  	}
>  }
>  
> -static int parse_commit_in_graph_one(struct commit_graph *g, struct commit *item)
> +static struct commit *parse_commit_in_graph_one(struct repository *r,
> +						struct commit_graph *g,
> +						struct commit *shell,
> +						const struct object_id *oid)

Now the complexity of the behaviour of this function deserves to be
documented in a comment in front.  Let me see if I can get it
correctly without such a comment by explaining the function aloud.

The caller may or may not have already obtained an in-core commit
object for a given object name, so shell could be NULL but otherwise
it could be used for optimization.  When shell==NULL, the function
looks up the commit object using the oid parameter instead.  The
returned in-core commit has the parents etc. filled as if we ran
parse_commit() on it.  If the commit is not yet in the graph, the
caller may get a NULL even if the commit exists.

>  {
>  	uint32_t pos;
>  
> -	if (item->object.parsed)
> -		return 1;
> +	if (shell && shell->object.parsed)
> +		return shell;
>  
> -	if (find_commit_in_graph(item, g, &pos))
> -		return fill_commit_in_graph(item, g, pos);
> +	if (shell && shell->graph_pos != COMMIT_NOT_FROM_GRAPH) {
> +		pos = shell->graph_pos;
> +	} else if (bsearch_graph(g, shell ? &shell->object.oid : oid, &pos)) {
> +		/* bsearch_graph sets pos */

Please spell an empty statement like so:

		; /* comment */

> +	} else {
> +		return NULL;

We come here when the commit (either came from shell or from oid) is
not found by bsearch_graph().  "Is the caller prepared for it, and
how?" is a natural question a reader would have.  Let's read on.

> +	}
>  
> -	return 0;
> +	if (!shell) {
> +		shell = lookup_commit(r, oid);
> +		if (!shell)
> +			return NULL;
> +	}
> +
> +	fill_commit_in_graph(shell, g, pos);
> +	return shell;
>  }
>  
> -int parse_commit_in_graph(struct repository *r, struct commit *item)
> +struct commit *parse_commit_in_graph(struct repository *r, struct commit *shell,
> +				     const struct object_id *oid)
>  {
>  	if (!prepare_commit_graph(r))
>  		return 0;
> -	return parse_commit_in_graph_one(r->objects->commit_graph, item);
> +	return parse_commit_in_graph_one(r, r->objects->commit_graph, shell,
> +					 oid);
>  }
>  
>  void load_commit_graph_info(struct repository *r, struct commit *item)
> @@ -1025,7 +1043,7 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g)
>  		}
>  
>  		graph_commit = lookup_commit(r, &cur_oid);
> -		if (!parse_commit_in_graph_one(g, graph_commit))
> +		if (!parse_commit_in_graph_one(r, g, graph_commit, NULL))
>  			graph_report("failed to parse %s from commit-graph",
>  				     oid_to_hex(&cur_oid));
>  	}
> diff --git a/commit-graph.h b/commit-graph.h
> index 9db40b4d3a..8b7b5985dc 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -13,16 +13,20 @@ struct commit;
>  char *get_commit_graph_filename(const char *obj_dir);
>  
>  /*
> - * Given a commit struct, try to fill the commit struct info, including:
> + * If the given commit (identified by shell->object.oid or oid) is in the
> + * commit graph, returns a commit struct (reusing shell if it is not NULL)
> + * including the following info:
>   *  1. tree object
>   *  2. date
>   *  3. parents.
>   *
> - * Returns 1 if and only if the commit was found in the packed graph.
> + * If not, returns NULL. See parse_commit_buffer() for the fallback after this
> + * call.
>   *
> - * See parse_commit_buffer() for the fallback after this call.
> + * Either shell or oid must be non-NULL. If both are non-NULL, oid is ignored.
>   */

OK, the eventual caller is the caller of this thing, which should
have been prepared to see NULL for a commit that is too new.  So
that should be OK.

> -int parse_commit_in_graph(struct repository *r, struct commit *item);
> +struct commit *parse_commit_in_graph(struct repository *r, struct commit *shell,
> +				     const struct object_id *oid);
>  
>  /*
>   * It is possible that we loaded commit contents from the commit buffer,
> diff --git a/commit.c b/commit.c
> index d13a7bc374..88eb580c5a 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -456,7 +456,7 @@ int parse_commit_internal(struct commit *item, int quiet_on_missing, int use_com
>  		return -1;
>  	if (item->object.parsed)
>  		return 0;
> -	if (use_commit_graph && parse_commit_in_graph(the_repository, item))
> +	if (use_commit_graph && parse_commit_in_graph(the_repository, item, NULL))
>  		return 0;
>  	buffer = read_object_file(&item->object.oid, &type, &size);
>  	if (!buffer)
> diff --git a/revision.c b/revision.c
> index 13e0519c02..05fddb5880 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -213,7 +213,10 @@ static struct object *get_reference(struct rev_info *revs, const char *name,
>  {
>  	struct object *object;
>  
> -	object = parse_object(revs->repo, oid);
> +	object = (struct object *) parse_commit_in_graph(revs->repo, NULL, oid);
> +	if (!object)
> +		object = parse_object(revs->repo, oid);

OK and this is such a caller.  I think a general rule of thumb is
that we need to access recent history a lot more often than the
older part of the history, and having to fall back for more recent
commits feels a bit disturbing, but I do not see an easy way to
reverse the performance characteristics offhand.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH on master v2] revision: use commit graph in get_reference()
  2018-12-09  0:51   ` Junio C Hamano
@ 2018-12-09  1:49     ` Junio C Hamano
  2018-12-11 10:54     ` Jeff King
  1 sibling, 0 replies; 28+ messages in thread
From: Junio C Hamano @ 2018-12-09  1:49 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, stolee, peff

Junio C Hamano <gitster@pobox.com> writes:

> Jonathan Tan <jonathantanmy@google.com> writes:
>
>> When fetching into a repository, a connectivity check is first made by
>> check_exist_and_connected() in builtin/fetch.c that runs:
>> ...
>> Another way to accomplish this effect would be to modify parse_object()
>> to use the commit graph if possible; however, I did not want to change
>> parse_object()'s current behavior of always checking the object
>> signature of the returned object.
>
> Sounds good.
>
>> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
>> ---
>> This patch is now on master.
>
> OK.  
>
> Obviously that won't apply to the base for v1 without conflicts, and
> it of course applies cleanly on 'master', but the result of doing so
> will cause the same conflicts when sb/more-repo-in-api is merged on
> top, which means that the same conflicts need to be resolved if this
> wants to be merged to 'next' (or 'pu', FWIW).

So,... as I had to do the reverse rebase anyway, here is the
difference since the previous round, which I came up with by
comparing these two:

 (A) merge 'sb/more-repo-in-api' to 'master' and then merge v1 of
     this topic to the result.

 (B) apply your patch to 'master', and then merge
     'sb/more-repo-in-api' to the result.

diff --git a/commit-graph.c b/commit-graph.c
index f78a8e96b5..74a17789f8 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -286,7 +286,8 @@ void close_commit_graph(struct repository *r)
 	r->objects->commit_graph = NULL;
 }
 
-static int bsearch_graph(struct commit_graph *g, struct object_id *oid, uint32_t *pos)
+static int bsearch_graph(struct commit_graph *g, const struct object_id *oid,
+			 uint32_t *pos)
 {
 	return bsearch_hash(oid->hash, g->chunk_oid_fanout,
 			    g->chunk_oid_lookup, g->hash_len, pos);
@@ -377,26 +378,41 @@ static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uin
 	}
 }
 
-static int parse_commit_in_graph_one(struct repository *r,
-				     struct commit_graph *g,
-				     struct commit *item)
+static struct commit *parse_commit_in_graph_one(struct repository *r,
+						struct commit_graph *g,
+						struct commit *shell,
+						const struct object_id *oid)
 {
 	uint32_t pos;
 
-	if (item->object.parsed)
-		return 1;
+	if (shell && shell->object.parsed)
+		return shell;
 
-	if (find_commit_in_graph(item, g, &pos))
-		return fill_commit_in_graph(r, item, g, pos);
+	if (shell && shell->graph_pos != COMMIT_NOT_FROM_GRAPH) {
+		pos = shell->graph_pos;
+	} else if (bsearch_graph(g, shell ? &shell->object.oid : oid, &pos)) {
+		/* bsearch_graph sets pos */
+	} else {
+		return NULL;
+	}
 
-	return 0;
+	if (!shell) {
+		shell = lookup_commit(r, oid);
+		if (!shell)
+			return NULL;
+	}
+
+	fill_commit_in_graph(r, shell, g, pos);
+	return shell;
 }
 
-int parse_commit_in_graph(struct repository *r, struct commit *item)
+struct commit *parse_commit_in_graph(struct repository *r, struct commit *shell,
+				     const struct object_id *oid)
 {
 	if (!prepare_commit_graph(r))
 		return 0;
-	return parse_commit_in_graph_one(r, r->objects->commit_graph, item);
+	return parse_commit_in_graph_one(r, r->objects->commit_graph, shell,
+					 oid);
 }
 
 void load_commit_graph_info(struct repository *r, struct commit *item)
@@ -1033,7 +1049,7 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g)
 		}
 
 		graph_commit = lookup_commit(r, &cur_oid);
-		if (!parse_commit_in_graph_one(r, g, graph_commit))
+		if (!parse_commit_in_graph_one(r, g, graph_commit, NULL))
 			graph_report("failed to parse %s from commit-graph",
 				     oid_to_hex(&cur_oid));
 	}
diff --git a/commit-graph.h b/commit-graph.h
index 9db40b4d3a..8b7b5985dc 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -13,16 +13,20 @@ struct commit;
 char *get_commit_graph_filename(const char *obj_dir);
 
 /*
- * Given a commit struct, try to fill the commit struct info, including:
+ * If the given commit (identified by shell->object.oid or oid) is in the
+ * commit graph, returns a commit struct (reusing shell if it is not NULL)
+ * including the following info:
  *  1. tree object
  *  2. date
  *  3. parents.
  *
- * Returns 1 if and only if the commit was found in the packed graph.
+ * If not, returns NULL. See parse_commit_buffer() for the fallback after this
+ * call.
  *
- * See parse_commit_buffer() for the fallback after this call.
+ * Either shell or oid must be non-NULL. If both are non-NULL, oid is ignored.
  */
-int parse_commit_in_graph(struct repository *r, struct commit *item);
+struct commit *parse_commit_in_graph(struct repository *r, struct commit *shell,
+				     const struct object_id *oid);
 
 /*
  * It is possible that we loaded commit contents from the commit buffer,
diff --git a/commit.c b/commit.c
index a5333c7ac6..da7a1d3262 100644
--- a/commit.c
+++ b/commit.c
@@ -462,7 +462,7 @@ int repo_parse_commit_internal(struct repository *r,
 		return -1;
 	if (item->object.parsed)
 		return 0;
-	if (use_commit_graph && parse_commit_in_graph(r, item))
+	if (use_commit_graph && parse_commit_in_graph(r, item, NULL))
 		return 0;
 	buffer = repo_read_object_file(r, &item->object.oid, &type, &size);
 	if (!buffer)
diff --git a/revision.c b/revision.c
index 22aa109c14..05fddb5880 100644
--- a/revision.c
+++ b/revision.c
@@ -213,19 +213,9 @@ static struct object *get_reference(struct rev_info *revs, const char *name,
 {
 	struct object *object;
 
-	/*
-	 * If the repository has commit graphs, repo_parse_commit() avoids
-	 * reading the object buffer, so use it whenever possible.
-	 */
-	if (oid_object_info(revs->repo, oid, NULL) == OBJ_COMMIT) {
-		struct commit *c = lookup_commit(revs->repo, oid);
-		if (!repo_parse_commit(revs->repo, c))
-			object = (struct object *) c;
-		else
-			object = NULL;
-	} else {
+	object = (struct object *) parse_commit_in_graph(revs->repo, NULL, oid);
+	if (!object)
 		object = parse_object(revs->repo, oid);
-	}
 
 	if (!object) {
 		if (revs->ignore_missing)
diff --git a/t/helper/test-repository.c b/t/helper/test-repository.c
index f7f8618445..689a0b652e 100644
--- a/t/helper/test-repository.c
+++ b/t/helper/test-repository.c
@@ -27,7 +27,7 @@ static void test_parse_commit_in_graph(const char *gitdir, const char *worktree,
 
 	c = lookup_commit(&r, commit_oid);
 
-	if (!parse_commit_in_graph(&r, c))
+	if (!parse_commit_in_graph(&r, c, NULL))
 		die("Couldn't parse commit");
 
 	printf("%"PRItime, c->date);
@@ -62,7 +62,7 @@ static void test_get_commit_tree_in_graph(const char *gitdir,
 	 * get_commit_tree_in_graph does not automatically parse the commit, so
 	 * parse it first.
 	 */
-	if (!parse_commit_in_graph(&r, c))
+	if (!parse_commit_in_graph(&r, c, NULL))
 		die("Couldn't parse commit");
 	tree = get_commit_tree_in_graph(&r, c);
 	if (!tree)


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH on master v2] revision: use commit graph in get_reference()
  2018-12-09  0:51   ` Junio C Hamano
  2018-12-09  1:49     ` Junio C Hamano
@ 2018-12-11 10:54     ` Jeff King
  2018-12-12 19:58       ` Jonathan Tan
  1 sibling, 1 reply; 28+ messages in thread
From: Jeff King @ 2018-12-11 10:54 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jonathan Tan, git, stolee

On Sun, Dec 09, 2018 at 09:51:28AM +0900, Junio C Hamano wrote:

> > -static int parse_commit_in_graph_one(struct commit_graph *g, struct commit *item)
> > +static struct commit *parse_commit_in_graph_one(struct repository *r,
> > +						struct commit_graph *g,
> > +						struct commit *shell,
> > +						const struct object_id *oid)
> 
> Now the complexity of the behaviour of this function deserves to be
> documented in a comment in front.  Let me see if I can get it
> correctly without such a comment by explaining the function aloud.
> 
> The caller may or may not have already obtained an in-core commit
> object for a given object name, so shell could be NULL but otherwise
> it could be used for optimization.  When shell==NULL, the function
> looks up the commit object using the oid parameter instead.  The
> returned in-core commit has the parents etc. filled as if we ran
> parse_commit() on it.  If the commit is not yet in the graph, the
> caller may get a NULL even if the commit exists.

Yeah, this was the part that took me a bit to figure out, as well. The
optimization here is really just avoiding a call to lookup_commit(),
which will do a single hash-table lookup. I wonder if that's actually
worth this more complex interface (as opposed to just always taking an
oid and then always returning a "struct commit", which could be old or
new).

-Peff

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH on master v2] revision: use commit graph in get_reference()
  2018-12-11 10:54     ` Jeff King
@ 2018-12-12 19:58       ` Jonathan Tan
  2018-12-13  1:27         ` Jeff King
  0 siblings, 1 reply; 28+ messages in thread
From: Jonathan Tan @ 2018-12-12 19:58 UTC (permalink / raw)
  To: peff; +Cc: gitster, jonathantanmy, git, stolee

> On Sun, Dec 09, 2018 at 09:51:28AM +0900, Junio C Hamano wrote:
> 
> > > -static int parse_commit_in_graph_one(struct commit_graph *g, struct commit *item)
> > > +static struct commit *parse_commit_in_graph_one(struct repository *r,
> > > +						struct commit_graph *g,
> > > +						struct commit *shell,
> > > +						const struct object_id *oid)
> > 
> > Now the complexity of the behaviour of this function deserves to be
> > documented in a comment in front.  Let me see if I can get it
> > correctly without such a comment by explaining the function aloud.
> > 
> > The caller may or may not have already obtained an in-core commit
> > object for a given object name, so shell could be NULL but otherwise
> > it could be used for optimization.  When shell==NULL, the function
> > looks up the commit object using the oid parameter instead.  The
> > returned in-core commit has the parents etc. filled as if we ran
> > parse_commit() on it.  If the commit is not yet in the graph, the
> > caller may get a NULL even if the commit exists.

In the next revision, I'll unify parse_commit_in_graph_one() (quoted
above) with parse_commit_in_graph(), so that the comment I wrote for the
latter can cover the entire functionality. I think the comment covers
the details that you outline here.

> Yeah, this was the part that took me a bit to figure out, as well. The
> optimization here is really just avoiding a call to lookup_commit(),
> which will do a single hash-table lookup. I wonder if that's actually
> worth this more complex interface (as opposed to just always taking an
> oid and then always returning a "struct commit", which could be old or
> new).

Avoidance of lookup_commit() is more important than an optimization, I
think. Here, we call lookup_commit() only when we know that that object
is a commit (by its presence in a commit graph). If we just called it
blindly, we might mistakenly create a commit for that hash when it is
actually an object of another type. (We could inline lookup_commit() in
parse_commit_in_graph_one(), removing the object creation part, but that
adds complexity as well.)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH on master v2] revision: use commit graph in get_reference()
  2018-12-12 19:58       ` Jonathan Tan
@ 2018-12-13  1:27         ` Jeff King
  2018-12-13 16:20           ` Derrick Stolee
  0 siblings, 1 reply; 28+ messages in thread
From: Jeff King @ 2018-12-13  1:27 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: gitster, git, stolee

On Wed, Dec 12, 2018 at 11:58:12AM -0800, Jonathan Tan wrote:

> > Yeah, this was the part that took me a bit to figure out, as well. The
> > optimization here is really just avoiding a call to lookup_commit(),
> > which will do a single hash-table lookup. I wonder if that's actually
> > worth this more complex interface (as opposed to just always taking an
> > oid and then always returning a "struct commit", which could be old or
> > new).
> 
> Avoidance of lookup_commit() is more important than an optimization, I
> think. Here, we call lookup_commit() only when we know that that object
> is a commit (by its presence in a commit graph). If we just called it
> blindly, we might mistakenly create a commit for that hash when it is
> actually an object of another type. (We could inline lookup_commit() in
> parse_commit_in_graph_one(), removing the object creation part, but that
> adds complexity as well.)

I was thinking we would only do so in the happy path when we find a
commit. I.e., something like:

  obj = lookup_object(oid); /* does not auto-vivify */
  if (obj && obj->parsed)
	return obj;

  if (we_have_it_in_commit_graph) {
	commit = obj || lookup_commit(oid);
	fill_in_details_from_commit_graph(commit);
	return &commit->obj;
  } else {
	return parse_object(oid);
  }

which is more along the lines of that parse_probably_commit() that
Stolee mentioned.

-Peff

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH on master v2] revision: use commit graph in get_reference()
  2018-12-13  1:27         ` Jeff King
@ 2018-12-13 16:20           ` Derrick Stolee
  0 siblings, 0 replies; 28+ messages in thread
From: Derrick Stolee @ 2018-12-13 16:20 UTC (permalink / raw)
  To: Jeff King, Jonathan Tan; +Cc: gitster, git

On 12/12/2018 8:27 PM, Jeff King wrote:
> On Wed, Dec 12, 2018 at 11:58:12AM -0800, Jonathan Tan wrote:
>
>>> Yeah, this was the part that took me a bit to figure out, as well. The
>>> optimization here is really just avoiding a call to lookup_commit(),
>>> which will do a single hash-table lookup. I wonder if that's actually
>>> worth this more complex interface (as opposed to just always taking an
>>> oid and then always returning a "struct commit", which could be old or
>>> new).
>> Avoidance of lookup_commit() is more important than an optimization, I
>> think. Here, we call lookup_commit() only when we know that that object
>> is a commit (by its presence in a commit graph). If we just called it
>> blindly, we might mistakenly create a commit for that hash when it is
>> actually an object of another type. (We could inline lookup_commit() in
>> parse_commit_in_graph_one(), removing the object creation part, but that
>> adds complexity as well.)
> I was thinking we would only do so in the happy path when we find a
> commit. I.e., something like:
>
>    obj = lookup_object(oid); /* does not auto-vivify */
>    if (obj && obj->parsed)
> 	return obj;
>
>    if (we_have_it_in_commit_graph) {
> 	commit = obj || lookup_commit(oid);
> 	fill_in_details_from_commit_graph(commit);
> 	return &commit->obj;
>    } else {
> 	return parse_object(oid);
>    }
>
> which is more along the lines of that parse_probably_commit() that
> Stolee mentioned.

This approach is what I had in mind. Thanks for making it more concrete!

-Stolee


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3] revision: use commit graph in get_reference()
  2018-12-04 22:42 [PATCH on sb/more-repo-in-api] revision: use commit graph in get_reference() Jonathan Tan
                   ` (3 preceding siblings ...)
  2018-12-07 21:50 ` [PATCH on master v2] " Jonathan Tan
@ 2018-12-13 18:54 ` Jonathan Tan
  2018-12-14  3:20   ` Junio C Hamano
  2018-12-14  8:45   ` Jeff King
  2019-01-25 15:33 ` Regression in: [PATCH on sb/more-repo-in-api] " SZEDER Gábor
  5 siblings, 2 replies; 28+ messages in thread
From: Jonathan Tan @ 2018-12-13 18:54 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, peff, stolee, gitster

When fetching into a repository, a connectivity check is first made by
check_exist_and_connected() in builtin/fetch.c that runs:

  git rev-list --objects --stdin --not --all --quiet <(list of objects)

If the client repository has many refs, this command can be slow,
regardless of the nature of the server repository or what is being
fetched. A profiler reveals that most of the time is spent in
setup_revisions() (approx. 60/63), and of the time spent in
setup_revisions(), most of it is spent in parse_object() (approx.
49/60). This is because setup_revisions() parses the target of every ref
(from "--all"), and parse_object() reads the buffer of the object.

Reading the buffer is unnecessary if the repository has a commit graph
and if the ref points to a commit (which is typically the case). This
patch uses the commit graph wherever possible; on my computer, when I
run the above command with a list of 1 object on a many-ref repository,
I get a speedup from 1.8s to 1.0s.

Another way to accomplish this effect would be to modify parse_object()
to use the commit graph if possible; however, I did not want to change
parse_object()'s current behavior of always checking the object
signature of the returned object.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
This patch is still on master. Junio, let me know if you would rather
have me base it on sb/more-repo-in-api instead.

I mentioned [1] that I would unify parse_commit_in_graph_one() and
parse_commit_in_graph() so that one documentation comment could cover
all the functionality, but with the simpler API, I decided not to do
that to minimize the diff.

Change in v3: Now uses a simpler API with the algorithm suggested by
Peff in [2], except that I also retain the existing optimization that
checks if graph_pos is already set.

[1] https://public-inbox.org/git/20181212195812.232726-1-jonathantanmy@google.com/
[2] https://public-inbox.org/git/20181213012707.GC26210@sigill.intra.peff.net/
---
 commit-graph.c             | 44 ++++++++++++++++++++++++++------------
 commit-graph.h             | 11 +++++-----
 commit.c                   |  4 +++-
 revision.c                 |  5 ++++-
 t/helper/test-repository.c |  8 ++-----
 5 files changed, 45 insertions(+), 27 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 40c855f185..0aca7ec0fe 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -286,7 +286,8 @@ void close_commit_graph(struct repository *r)
 	r->objects->commit_graph = NULL;
 }
 
-static int bsearch_graph(struct commit_graph *g, struct object_id *oid, uint32_t *pos)
+static int bsearch_graph(struct commit_graph *g, const struct object_id *oid,
+			 uint32_t *pos)
 {
 	return bsearch_hash(oid->hash, g->chunk_oid_fanout,
 			    g->chunk_oid_lookup, g->hash_len, pos);
@@ -374,24 +375,42 @@ static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uin
 	}
 }
 
-static int parse_commit_in_graph_one(struct commit_graph *g, struct commit *item)
+static struct commit *parse_commit_in_graph_one(struct repository *r,
+						struct commit_graph *g,
+						const struct object_id *oid)
 {
+	struct object *obj;
+	struct commit *commit;
 	uint32_t pos;
 
-	if (item->object.parsed)
-		return 1;
+	obj = lookup_object(r, oid->hash);
+	commit = obj && obj->type == OBJ_COMMIT ? (struct commit *) obj : NULL;
+	if (commit && obj->parsed)
+		return commit;
 
-	if (find_commit_in_graph(item, g, &pos))
-		return fill_commit_in_graph(item, g, pos);
+	if (commit && commit->graph_pos != COMMIT_NOT_FROM_GRAPH)
+		pos = commit->graph_pos;
+	else if (bsearch_graph(g, oid, &pos))
+		; /* bsearch_graph sets pos */
+	else
+		return NULL;
 
-	return 0;
+	if (!commit) {
+		commit = lookup_commit(r, oid);
+		if (!commit)
+			return NULL;
+	}
+
+	fill_commit_in_graph(commit, g, pos);
+	return commit;
 }
 
-int parse_commit_in_graph(struct repository *r, struct commit *item)
+struct commit *parse_commit_in_graph(struct repository *r,
+				     const struct object_id *oid)
 {
 	if (!prepare_commit_graph(r))
-		return 0;
-	return parse_commit_in_graph_one(r->objects->commit_graph, item);
+		return NULL;
+	return parse_commit_in_graph_one(r, r->objects->commit_graph, oid);
 }
 
 void load_commit_graph_info(struct repository *r, struct commit *item)
@@ -1004,8 +1023,6 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g)
 	}
 
 	for (i = 0; i < g->num_commits; i++) {
-		struct commit *graph_commit;
-
 		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
 
 		if (i && oidcmp(&prev_oid, &cur_oid) >= 0)
@@ -1024,8 +1041,7 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g)
 			cur_fanout_pos++;
 		}
 
-		graph_commit = lookup_commit(r, &cur_oid);
-		if (!parse_commit_in_graph_one(g, graph_commit))
+		if (!parse_commit_in_graph_one(r, g, &cur_oid))
 			graph_report("failed to parse %s from commit-graph",
 				     oid_to_hex(&cur_oid));
 	}
diff --git a/commit-graph.h b/commit-graph.h
index 9db40b4d3a..b67aac1125 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -13,16 +13,17 @@ struct commit;
 char *get_commit_graph_filename(const char *obj_dir);
 
 /*
- * Given a commit struct, try to fill the commit struct info, including:
+ * If the given commit is in the commit graph, returns a commit struct
+ * including the following info:
  *  1. tree object
  *  2. date
  *  3. parents.
  *
- * Returns 1 if and only if the commit was found in the packed graph.
- *
- * See parse_commit_buffer() for the fallback after this call.
+ * If not, returns NULL. See parse_commit_buffer() for the fallback after this
+ * call.
  */
-int parse_commit_in_graph(struct repository *r, struct commit *item);
+struct commit *parse_commit_in_graph(struct repository *r,
+				     const struct object_id *oid);
 
 /*
  * It is possible that we loaded commit contents from the commit buffer,
diff --git a/commit.c b/commit.c
index d13a7bc374..19ce5e34a2 100644
--- a/commit.c
+++ b/commit.c
@@ -456,7 +456,9 @@ int parse_commit_internal(struct commit *item, int quiet_on_missing, int use_com
 		return -1;
 	if (item->object.parsed)
 		return 0;
-	if (use_commit_graph && parse_commit_in_graph(the_repository, item))
+	if (use_commit_graph &&
+	    parse_commit_in_graph(the_repository, &item->object.oid) &&
+	    item->object.parsed)
 		return 0;
 	buffer = read_object_file(&item->object.oid, &type, &size);
 	if (!buffer)
diff --git a/revision.c b/revision.c
index 13e0519c02..7f54f3b4c7 100644
--- a/revision.c
+++ b/revision.c
@@ -213,7 +213,10 @@ static struct object *get_reference(struct rev_info *revs, const char *name,
 {
 	struct object *object;
 
-	object = parse_object(revs->repo, oid);
+	object = (struct object *) parse_commit_in_graph(revs->repo, oid);
+	if (!object)
+		object = parse_object(revs->repo, oid);
+
 	if (!object) {
 		if (revs->ignore_missing)
 			return object;
diff --git a/t/helper/test-repository.c b/t/helper/test-repository.c
index 6a84a53efb..35bfd1233d 100644
--- a/t/helper/test-repository.c
+++ b/t/helper/test-repository.c
@@ -20,9 +20,7 @@ static void test_parse_commit_in_graph(const char *gitdir, const char *worktree,
 	if (repo_init(&r, gitdir, worktree))
 		die("Couldn't init repo");
 
-	c = lookup_commit(&r, commit_oid);
-
-	if (!parse_commit_in_graph(&r, c))
+	if (!(c = parse_commit_in_graph(&r, commit_oid)))
 		die("Couldn't parse commit");
 
 	printf("%"PRItime, c->date);
@@ -46,13 +44,11 @@ static void test_get_commit_tree_in_graph(const char *gitdir,
 	if (repo_init(&r, gitdir, worktree))
 		die("Couldn't init repo");
 
-	c = lookup_commit(&r, commit_oid);
-
 	/*
 	 * get_commit_tree_in_graph does not automatically parse the commit, so
 	 * parse it first.
 	 */
-	if (!parse_commit_in_graph(&r, c))
+	if (!(c = parse_commit_in_graph(&r, commit_oid)))
 		die("Couldn't parse commit");
 	tree = get_commit_tree_in_graph(&r, c);
 	if (!tree)
-- 
2.19.0.271.gfe8321ec05.dirty


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v3] revision: use commit graph in get_reference()
  2018-12-13 18:54 ` [PATCH v3] " Jonathan Tan
@ 2018-12-14  3:20   ` Junio C Hamano
  2018-12-14  8:45   ` Jeff King
  1 sibling, 0 replies; 28+ messages in thread
From: Junio C Hamano @ 2018-12-14  3:20 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, peff, stolee

Jonathan Tan <jonathantanmy@google.com> writes:

> When fetching into a repository, a connectivity check is first made by
> check_exist_and_connected() in builtin/fetch.c that runs:
>
>   git rev-list --objects --stdin --not --all --quiet <(list of objects)
>
> If the client repository has many refs, this command can be slow,
> regardless of the nature of the server repository or what is being
> fetched. A profiler reveals that most of the time is spent in
> setup_revisions() (approx. 60/63), and of the time spent in
> setup_revisions(), most of it is spent in parse_object() (approx.
> 49/60). This is because setup_revisions() parses the target of every ref
> (from "--all"), and parse_object() reads the buffer of the object.
>
> Reading the buffer is unnecessary if the repository has a commit graph
> and if the ref points to a commit (which is typically the case). This
> patch uses the commit graph wherever possible; on my computer, when I
> run the above command with a list of 1 object on a many-ref repository,
> I get a speedup from 1.8s to 1.0s.
>
> Another way to accomplish this effect would be to modify parse_object()
> to use the commit graph if possible; however, I did not want to change
> parse_object()'s current behavior of always checking the object
> signature of the returned object.
>
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
> This patch is still on master. Junio, let me know if you would rather
> have me base it on sb/more-repo-in-api instead.

Unless we all agree that we will abandon sb/more-repo-in-api,
rerolling this on 'master' will force me to resolve similar but
different conflicts every time.  Unless we fast-track that other
topic, that is, but I do not think that is what you meant to do.

> Change in v3: Now uses a simpler API with the algorithm suggested by
> Peff in [2], except that I also retain the existing optimization that
> checks if graph_pos is already set.

OK.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3] revision: use commit graph in get_reference()
  2018-12-13 18:54 ` [PATCH v3] " Jonathan Tan
  2018-12-14  3:20   ` Junio C Hamano
@ 2018-12-14  8:45   ` Jeff King
  1 sibling, 0 replies; 28+ messages in thread
From: Jeff King @ 2018-12-14  8:45 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, stolee, gitster

On Thu, Dec 13, 2018 at 10:54:50AM -0800, Jonathan Tan wrote:

> -static int parse_commit_in_graph_one(struct commit_graph *g, struct commit *item)
> +static struct commit *parse_commit_in_graph_one(struct repository *r,
> +						struct commit_graph *g,
> +						const struct object_id *oid)

Making sure I understand the new logic...

>  {
> +	struct object *obj;
> +	struct commit *commit;
>  	uint32_t pos;
>  
> -	if (item->object.parsed)
> -		return 1;
> +	obj = lookup_object(r, oid->hash);
> +	commit = obj && obj->type == OBJ_COMMIT ? (struct commit *) obj : NULL;
> +	if (commit && obj->parsed)
> +		return commit;

OK, so if it's a commit and we have it parsed, we return that. By using
lookup_object(), if it's a non-commit, we haven't changed anything.
Good.

> -	if (find_commit_in_graph(item, g, &pos))
> -		return fill_commit_in_graph(item, g, pos);
> +	if (commit && commit->graph_pos != COMMIT_NOT_FROM_GRAPH)
> +		pos = commit->graph_pos;
> +	else if (bsearch_graph(g, oid, &pos))
> +		; /* bsearch_graph sets pos */
> +	else
> +		return NULL;

And then we try to find it in the commit graph. If we didn't, then we'll
end up returning NULL. Good.

> -	return 0;
> +	if (!commit) {
> +		commit = lookup_commit(r, oid);
> +		if (!commit)
> +			return NULL;
> +	}

And at this point we found it in the commit graph, so we know it's a
commit. lookup_commit() should succeed, but in the off chance that it's
in the commit graph _and_ we previously found it as a non-commit
(yikes!), we'll return NULL. That's equivalent to just pretending we
didn't find it in the commit graph, and the caller can sort it out (when
they read the object, either it will match the previous type, or it
really will be a commit and they'll follow the normal complaining path).
Good.

So this all makes sense. The one thing we don't do here is actually
parse an unparsed commit that isn't in the graph, and instead leave that
to the caller. E.g. get_reference() now does:

> -	object = parse_object(revs->repo, oid);
> +	object = (struct object *) parse_commit_in_graph(revs->repo, oid);
> +	if (!object)
> +		object = parse_object(revs->repo, oid);

In theory we could save another lookup_object() in parse_object() by
combining these steps, but I don't think it's really worth worrying too
much about.

So overall this looks good to me.

-Peff

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Regression in: [PATCH on sb/more-repo-in-api] revision: use commit graph in get_reference()
  2018-12-04 22:42 [PATCH on sb/more-repo-in-api] revision: use commit graph in get_reference() Jonathan Tan
                   ` (4 preceding siblings ...)
  2018-12-13 18:54 ` [PATCH v3] " Jonathan Tan
@ 2019-01-25 15:33 ` SZEDER Gábor
  2019-01-25 19:56   ` Stefan Beller
  5 siblings, 1 reply; 28+ messages in thread
From: SZEDER Gábor @ 2019-01-25 15:33 UTC (permalink / raw)
  To: Jonathan Tan
  Cc: git, Jeff King, Junio C Hamano, Derrick Stolee, Stefan Beller

On Tue, Dec 04, 2018 at 02:42:38PM -0800, Jonathan Tan wrote:
> When fetching into a repository, a connectivity check is first made by
> check_exist_and_connected() in builtin/fetch.c that runs:
> 
>   git rev-list --objects --stdin --not --all --quiet <(list of objects)
> 
> If the client repository has many refs, this command can be slow,
> regardless of the nature of the server repository or what is being
> fetched. A profiler reveals that most of the time is spent in
> setup_revisions() (approx. 60/63), and of the time spent in
> setup_revisions(), most of it is spent in parse_object() (approx.
> 49/60). This is because setup_revisions() parses the target of every ref
> (from "--all"), and parse_object() reads the buffer of the object.
> 
> Reading the buffer is unnecessary if the repository has a commit graph
> and if the ref points to a commit (which is typically the case). This
> patch uses the commit graph wherever possible; on my computer, when I
> run the above command with a list of 1 object on a many-ref repository,
> I get a speedup from 1.8s to 1.0s.
> 
> Another way to accomplish this effect would be to modify parse_object()
> to use the commit graph if possible; however, I did not want to change
> parse_object()'s current behavior of always checking the object
> signature of the returned object.
> 
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
> This is on sb/more-repo-in-api because I'm using the repo_parse_commit()
> function.
> 
> A colleague noticed this issue when handling a mirror clone.
> 
> Looking at the bigger picture, the speed of the connectivity check
> during a fetch might be further improved by passing only the negotiation
> tips (obtained through --negotiation-tip) instead of "--all". This patch
> just handles the low-hanging fruit first.
> ---

I stumbled upon a regression that bisects down to this commit
ec0c5798ee (revision: use commit graph in get_reference(),
2018-12-04):

  $ ~/src/git/bin-wrappers/git version
  git version 2.19.1.566.gec0c5798ee
  $ ~/src/git/bin-wrappers/git commit-graph write --reachable
  Computing commit graph generation numbers: 100% (58994/58994), done.
  $ ~/src/git/bin-wrappers/git status
  HEAD detached at origin/pu
  nothing to commit, working tree clean
  $ ~/src/git/bin-wrappers/git -c core.commitGraph=false describe --dirty
  v2.20.1-833-gcb3b9e7ee3
  $ ~/src/git/bin-wrappers/git -c core.commitGraph=true describe --dirty
  v2.20.1-833-gcb3b9e7ee3

It's all good with only '--dirty', but watch this with '--all
--dirty':

  $ ~/src/git/bin-wrappers/git -c core.commitGraph=false describe --all --dirty
  remotes/origin/pu
  $ ~/src/git/bin-wrappers/git -c core.commitGraph=true describe --all --dirty
  remotes/origin/pu-dirty

IOW if the commit-graph is enabled, then my clean worktree is reported
as dirty.

And to add a cherry on top of my confusion:

  $ git checkout v2.20.0
  Previous HEAD position was cb3b9e7ee3 Merge branch 'jh/trace2' into pu
  HEAD is now at 5d826e9729 Git 2.20
  $ ~/src/git/bin-wrappers/git -c core.commitGraph=true describe --all --dirty
  tags/v2.20.0

It's clean even with '--all' and commit-graph enabled, but watch this:

  $ git branch this-will-screw-it-up
  $ ~/src/git/bin-wrappers/git -c core.commitGraph=true describe --all --dirty
  tags/v2.20.0-dirty

Have fun! :)


>  revision.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/revision.c b/revision.c
> index b5108b75ab..e7da2c57ab 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -212,7 +212,20 @@ static struct object *get_reference(struct rev_info *revs, const char *name,
>  {
>  	struct object *object;
>  
> -	object = parse_object(revs->repo, oid);
> +	/*
> +	 * If the repository has commit graphs, repo_parse_commit() avoids
> +	 * reading the object buffer, so use it whenever possible.
> +	 */
> +	if (oid_object_info(revs->repo, oid, NULL) == OBJ_COMMIT) {
> +		struct commit *c = lookup_commit(revs->repo, oid);
> +		if (!repo_parse_commit(revs->repo, c))
> +			object = (struct object *) c;
> +		else
> +			object = NULL;
> +	} else {
> +		object = parse_object(revs->repo, oid);
> +	}
> +
>  	if (!object) {
>  		if (revs->ignore_missing)
>  			return object;
> -- 
> 2.19.0.271.gfe8321ec05.dirty
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Regression in: [PATCH on sb/more-repo-in-api] revision: use commit graph in get_reference()
  2019-01-25 15:33 ` Regression in: [PATCH on sb/more-repo-in-api] " SZEDER Gábor
@ 2019-01-25 19:56   ` Stefan Beller
  2019-01-25 22:01     ` Jonathan Tan
  2019-01-25 22:14     ` SZEDER Gábor
  0 siblings, 2 replies; 28+ messages in thread
From: Stefan Beller @ 2019-01-25 19:56 UTC (permalink / raw)
  To: SZEDER Gábor
  Cc: Jonathan Tan, git, Jeff King, Junio C Hamano, Derrick Stolee

> Have fun! :)

$ git gc
...
Computing commit graph generation numbers: 100% (164264/164264), done.
$ ./git version
git version 2.20.1.775.g2313a6b87fe.dirty
# pu + one commit addressing
# https://public-inbox.org/git/CAGZ79kaUg3NTRPRi5mLk6ag87iDB_Ltq_kEiLwZ2HGZ+-Vsd8w@mail.gmail.com/

$ ./git -c core.commitGraph=false describe --dirty --all
remotes/gitgitgadget/pu-1-g03745a36e6
$ ./git -c core.commitGraph=true describe --dirty --all
remotes/gitgitgadget/pu-1-g03745a36e6
$ ./git -c core.commitGraph=true describe --dirty
v2.20.1-776-g03745a36e6
$ ./git -c core.commitGraph=false describe --dirty
v2.20.1-776-g03745a36e6

it looks like it is working correctly here?
Or did I miss some hint as in how to setup the reproduction properly?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Regression in: [PATCH on sb/more-repo-in-api] revision: use commit graph in get_reference()
  2019-01-25 19:56   ` Stefan Beller
@ 2019-01-25 22:01     ` Jonathan Tan
  2019-01-25 22:14     ` SZEDER Gábor
  1 sibling, 0 replies; 28+ messages in thread
From: Jonathan Tan @ 2019-01-25 22:01 UTC (permalink / raw)
  To: sbeller; +Cc: szeder.dev, jonathantanmy, git, peff, gitster, stolee

> > Have fun! :)
> 
> $ git gc
> ...
> Computing commit graph generation numbers: 100% (164264/164264), done.
> $ ./git version
> git version 2.20.1.775.g2313a6b87fe.dirty
> # pu + one commit addressing
> # https://public-inbox.org/git/CAGZ79kaUg3NTRPRi5mLk6ag87iDB_Ltq_kEiLwZ2HGZ+-Vsd8w@mail.gmail.com/
> 
> $ ./git -c core.commitGraph=false describe --dirty --all
> remotes/gitgitgadget/pu-1-g03745a36e6
> $ ./git -c core.commitGraph=true describe --dirty --all
> remotes/gitgitgadget/pu-1-g03745a36e6
> $ ./git -c core.commitGraph=true describe --dirty
> v2.20.1-776-g03745a36e6
> $ ./git -c core.commitGraph=false describe --dirty
> v2.20.1-776-g03745a36e6
> 
> it looks like it is working correctly here?
> Or did I miss some hint as in how to setup the reproduction properly?

I could reproduce it with version ec0c5798ee (as stated in Szeder's
original email) - as stated by Szeder, it doesn't work, but its parent
does. I'm looking into this, but any help is appreciated.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Regression in: [PATCH on sb/more-repo-in-api] revision: use commit graph in get_reference()
  2019-01-25 19:56   ` Stefan Beller
  2019-01-25 22:01     ` Jonathan Tan
@ 2019-01-25 22:14     ` SZEDER Gábor
  2019-01-25 22:21       ` SZEDER Gábor
  1 sibling, 1 reply; 28+ messages in thread
From: SZEDER Gábor @ 2019-01-25 22:14 UTC (permalink / raw)
  To: Stefan Beller
  Cc: Jonathan Tan, git, Jeff King, Junio C Hamano, Derrick Stolee

On Fri, Jan 25, 2019 at 11:56:38AM -0800, Stefan Beller wrote:
> > Have fun! :)
> 
> $ git gc
> ...
> Computing commit graph generation numbers: 100% (164264/164264), done.
> $ ./git version
> git version 2.20.1.775.g2313a6b87fe.dirty
> # pu + one commit addressing
> # https://public-inbox.org/git/CAGZ79kaUg3NTRPRi5mLk6ag87iDB_Ltq_kEiLwZ2HGZ+-Vsd8w@mail.gmail.com/
> 
> $ ./git -c core.commitGraph=false describe --dirty --all
> remotes/gitgitgadget/pu-1-g03745a36e6
> $ ./git -c core.commitGraph=true describe --dirty --all
> remotes/gitgitgadget/pu-1-g03745a36e6
> $ ./git -c core.commitGraph=true describe --dirty
> v2.20.1-776-g03745a36e6
> $ ./git -c core.commitGraph=false describe --dirty
> v2.20.1-776-g03745a36e6
> 
> it looks like it is working correctly here?
> Or did I miss some hint as in how to setup the reproduction properly?

How many refs are pointing to the commits you tried to describe?  In
the git repo, with an all-encompassing commit-graph it seems to be
important that more than one refs point there.  I could reproduce the
issue in a fresh git.git clone with Git built from commit 2313a6b87fe:

  $ git clone https://github.com/git/git
  Cloning into 'git'...
  <...>
  $ git commit-graph write --reachable
  Computing commit graph generation numbers: 100% (56722/56722), done.
  # 'HOME=.' makes sure that this command doesn't read my global
  # gitconfig.
  $ HOME=. ~/src/git/git describe --all --dirty
  heads/master-dirty
  $ git checkout origin/pu 
  HEAD is now at cb3b9e7ee3 Merge branch 'jh/trace2' into pu
  $ HOME=. ~/src/git/git -c core.commitGraph=true describe --all --dirty
  remotes/origin/pu
  $ git branch a-second-ref-pointing-at-pu buzz ~/src/tmp/git
  $ HOME=. ~/src/git/git -c core.commitGraph=true describe --all --dirty
  heads/a-second-ref-pointing-at-pu-dirty

I could also reproduce it in other repositories lying around here, but
could not manage to reproduce it in a minimal repository.

The smallest I could get is the test script below, where the last test
fails, i.e. the clean worktree is described as '-dirty', when the
to-be-described HEAD is not in the commit-graph.  I suspect this is
the same issue, because it bisects down to this same commit.

  --- >8 ---

Subject: [PATCH] test

---
 t/t9999-test.sh | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)
 create mode 100755 t/t9999-test.sh

diff --git a/t/t9999-test.sh b/t/t9999-test.sh
new file mode 100755
index 0000000000..cd1286e157
--- /dev/null
+++ b/t/t9999-test.sh
@@ -0,0 +1,26 @@
+#!/bin/sh
+
+test_description='test'
+
+. ./test-lib.sh
+
+test_expect_success 'setup' '
+	test_commit one &&
+	test_commit two &&
+	# Two refs point there.
+	git for-each-ref --points-at=two &&
+	git config core.commitGraph true
+'
+
+test_expect_success 'full commit-graph' '
+	git commit-graph write --reachable &&
+	verbose test "$(git describe --all --dirty)" = tags/two
+'
+
+test_expect_success 'partial commit-graph, described HEAD is not in C-G' '
+	git rev-parse one | git commit-graph write --stdin-commits &&
+	git status &&
+	verbose test "$(git describe --all --dirty)" = tags/two
+'
+
+test_done
-- 
2.20.1.642.gc55a771460


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: Regression in: [PATCH on sb/more-repo-in-api] revision: use commit graph in get_reference()
  2019-01-25 22:14     ` SZEDER Gábor
@ 2019-01-25 22:21       ` SZEDER Gábor
  2019-01-27 13:08         ` [PATCH] object_as_type: initialize commit-graph-related fields of 'struct commit' SZEDER Gábor
  0 siblings, 1 reply; 28+ messages in thread
From: SZEDER Gábor @ 2019-01-25 22:21 UTC (permalink / raw)
  To: Stefan Beller
  Cc: Jonathan Tan, git, Jeff King, Junio C Hamano, Derrick Stolee

On Fri, Jan 25, 2019 at 11:14:14PM +0100, SZEDER Gábor wrote:
> On Fri, Jan 25, 2019 at 11:56:38AM -0800, Stefan Beller wrote:
> > > Have fun! :)
> > 
> > $ git gc
> > ...
> > Computing commit graph generation numbers: 100% (164264/164264), done.
> > $ ./git version
> > git version 2.20.1.775.g2313a6b87fe.dirty
> > # pu + one commit addressing
> > # https://public-inbox.org/git/CAGZ79kaUg3NTRPRi5mLk6ag87iDB_Ltq_kEiLwZ2HGZ+-Vsd8w@mail.gmail.com/
> > 
> > $ ./git -c core.commitGraph=false describe --dirty --all
> > remotes/gitgitgadget/pu-1-g03745a36e6
> > $ ./git -c core.commitGraph=true describe --dirty --all
> > remotes/gitgitgadget/pu-1-g03745a36e6
> > $ ./git -c core.commitGraph=true describe --dirty
> > v2.20.1-776-g03745a36e6
> > $ ./git -c core.commitGraph=false describe --dirty
> > v2.20.1-776-g03745a36e6
> > 
> > it looks like it is working correctly here?
> > Or did I miss some hint as in how to setup the reproduction properly?
> 
> How many refs are pointing to the commits you tried to describe?  In
> the git repo, with an all-encompassing commit-graph it seems to be
> important that more than one refs point there.

Erm, let me try to clarify this sentence.

In general it seems to be important that more than one refs point to
the described HEAD.  In the git repo (and in other non-toy repos) I
could reproduce the issue with a commit-graph file containing all
commits, but in a minimal repo only when the described HEAD was not in
the commit-graph.

> I could reproduce the
> issue in a fresh git.git clone with Git built from commit 2313a6b87fe:
> 
>   $ git clone https://github.com/git/git
>   Cloning into 'git'...
>   <...>
>   $ git commit-graph write --reachable
>   Computing commit graph generation numbers: 100% (56722/56722), done.
>   # 'HOME=.' makes sure that this command doesn't read my global
>   # gitconfig.
>   $ HOME=. ~/src/git/git describe --all --dirty
>   heads/master-dirty
>   $ git checkout origin/pu 
>   HEAD is now at cb3b9e7ee3 Merge branch 'jh/trace2' into pu
>   $ HOME=. ~/src/git/git -c core.commitGraph=true describe --all --dirty
>   remotes/origin/pu
>   $ git branch a-second-ref-pointing-at-pu buzz ~/src/tmp/git
>   $ HOME=. ~/src/git/git -c core.commitGraph=true describe --all --dirty
>   heads/a-second-ref-pointing-at-pu-dirty
> 
> I could also reproduce it in other repositories lying around here, but
> could not manage to reproduce it in a minimal repository.
> 
> The smallest I could get is the test script below, where the last test
> fails, i.e. the clean worktree is described as '-dirty', when the
> to-be-described HEAD is not in the commit-graph.  I suspect this is
> the same issue, because it bisects down to this same commit.
> 
>   --- >8 ---
> 
> Subject: [PATCH] test
> 
> ---
>  t/t9999-test.sh | 26 ++++++++++++++++++++++++++
>  1 file changed, 26 insertions(+)
>  create mode 100755 t/t9999-test.sh
> 
> diff --git a/t/t9999-test.sh b/t/t9999-test.sh
> new file mode 100755
> index 0000000000..cd1286e157
> --- /dev/null
> +++ b/t/t9999-test.sh
> @@ -0,0 +1,26 @@
> +#!/bin/sh
> +
> +test_description='test'
> +
> +. ./test-lib.sh
> +
> +test_expect_success 'setup' '
> +	test_commit one &&
> +	test_commit two &&
> +	# Two refs point there.
> +	git for-each-ref --points-at=two &&
> +	git config core.commitGraph true
> +'
> +
> +test_expect_success 'full commit-graph' '
> +	git commit-graph write --reachable &&
> +	verbose test "$(git describe --all --dirty)" = tags/two
> +'
> +
> +test_expect_success 'partial commit-graph, described HEAD is not in C-G' '
> +	git rev-parse one | git commit-graph write --stdin-commits &&
> +	git status &&
> +	verbose test "$(git describe --all --dirty)" = tags/two
> +'
> +
> +test_done
> -- 
> 2.20.1.642.gc55a771460
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH] object_as_type: initialize commit-graph-related fields of 'struct commit'
  2019-01-25 22:21       ` SZEDER Gábor
@ 2019-01-27 13:08         ` SZEDER Gábor
  2019-01-27 13:28           ` SZEDER Gábor
                             ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: SZEDER Gábor @ 2019-01-27 13:08 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee
  Cc: SZEDER Gábor, Jonathan Tan, git, Jeff King, Stefan Beller

When the commit graph and generation numbers were introduced in
commits 177722b344 (commit: integrate commit graph with commit
parsing, 2018-04-10) and 83073cc994 (commit: add generation number to
struct commit, 2018-04-25), they tried to make sure that the
corresponding 'graph_pos' and 'generation' fields of 'struct commit'
are initialized conservatively, as if the commit were not included in
the commit-graph file.

Alas, initializing those fields only in alloc_commit_node() missed the
case when an object that happens to be a commit is first looked up via
lookup_unknown_object(), and is then later converted to a 'struct
commit' via the object_as_type() helper function (either calling it
directly, or as part of a subsequent lookup_commit() call).
Consequently, both of those fields incorrectly remain set to zero,
which means e.g. that the commit is present in and is the first entry
of the commit-graph file.  This will result in wrong timestamp, parent
and root tree hashes, if such a 'struct commit' instance is later
filled from the commit-graph.

Extract the initialization of 'struct commit's fields from
alloc_commit_node() into a helper function, and call it from
object_as_type() as well, to make sure that it properly initializes
the two commit-graph-related fields, too.  With this helper function
it is hopefully less likely that any new fields added to 'struct
commit' in the future would remain uninitialized.

With this change alloc_commit_index() won't have any remaining callers
outside of 'alloc.c', so mark it as static.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
---

So, it turns out that ec0c5798ee (revision: use commit graph in
get_reference(), 2018-12-04) is not the culprit after all, it merely
highlighted a bug that is as old as the commit-graph feature itself.
This patch fixes this and all other related issues I reported
upthread.

I couldn't find any other place where an object of unknown type is
turned into a 'struct commit', so this might have been the only place
that needed fixing.

Other object types seem to be fine, because they contain only fields
that should be zero initialized.  At least for now, because a similar
issue might arise in the future, if one of them gains a new field that
should not be initialized to zero...  but will they ever get such a
field?  So I'm not too keen on introducing similar init_tree_node(),
etc. helper funcions at the moment.

 alloc.c  | 11 ++++++++---
 alloc.h  |  2 +-
 object.c |  5 +++--
 3 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/alloc.c b/alloc.c
index e7aa81b7aa..1c64c4dd16 100644
--- a/alloc.c
+++ b/alloc.c
@@ -99,18 +99,23 @@ void *alloc_object_node(struct repository *r)
 	return obj;
 }
 
-unsigned int alloc_commit_index(struct repository *r)
+static unsigned int alloc_commit_index(struct repository *r)
 {
 	return r->parsed_objects->commit_count++;
 }
 
-void *alloc_commit_node(struct repository *r)
+void init_commit_node(struct repository *r, struct commit *c)
 {
-	struct commit *c = alloc_node(r->parsed_objects->commit_state, sizeof(struct commit));
 	c->object.type = OBJ_COMMIT;
 	c->index = alloc_commit_index(r);
 	c->graph_pos = COMMIT_NOT_FROM_GRAPH;
 	c->generation = GENERATION_NUMBER_INFINITY;
+}
+
+void *alloc_commit_node(struct repository *r)
+{
+	struct commit *c = alloc_node(r->parsed_objects->commit_state, sizeof(struct commit));
+	init_commit_node(r, c);
 	return c;
 }
 
diff --git a/alloc.h b/alloc.h
index ba356ed847..ed1071c11e 100644
--- a/alloc.h
+++ b/alloc.h
@@ -9,11 +9,11 @@ struct repository;
 
 void *alloc_blob_node(struct repository *r);
 void *alloc_tree_node(struct repository *r);
+void init_commit_node(struct repository *r, struct commit *c);
 void *alloc_commit_node(struct repository *r);
 void *alloc_tag_node(struct repository *r);
 void *alloc_object_node(struct repository *r);
 void alloc_report(struct repository *r);
-unsigned int alloc_commit_index(struct repository *r);
 
 struct alloc_state *allocate_alloc_state(void);
 void clear_alloc_state(struct alloc_state *s);
diff --git a/object.c b/object.c
index c4170d2d0c..7bccfd5d8e 100644
--- a/object.c
+++ b/object.c
@@ -164,8 +164,9 @@ void *object_as_type(struct repository *r, struct object *obj, enum object_type
 		return obj;
 	else if (obj->type == OBJ_NONE) {
 		if (type == OBJ_COMMIT)
-			((struct commit *)obj)->index = alloc_commit_index(r);
-		obj->type = type;
+			init_commit_node(r, (struct commit *) obj);
+		else
+			obj->type = type;
 		return obj;
 	}
 	else {
-- 
2.20.1.642.gc55a771460


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH] object_as_type: initialize commit-graph-related fields of 'struct commit'
  2019-01-27 13:08         ` [PATCH] object_as_type: initialize commit-graph-related fields of 'struct commit' SZEDER Gábor
@ 2019-01-27 13:28           ` SZEDER Gábor
  2019-01-27 18:40             ` Derrick Stolee
  2019-01-28 16:15           ` Jeff King
  2019-01-28 16:57           ` Jonathan Tan
  2 siblings, 1 reply; 28+ messages in thread
From: SZEDER Gábor @ 2019-01-27 13:28 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee
  Cc: Jonathan Tan, git, Jeff King, Stefan Beller

On Sun, Jan 27, 2019 at 02:08:32PM +0100, SZEDER Gábor wrote:
> When the commit graph and generation numbers were introduced in
> commits 177722b344 (commit: integrate commit graph with commit
> parsing, 2018-04-10) and 83073cc994 (commit: add generation number to
> struct commit, 2018-04-25), they tried to make sure that the
> corresponding 'graph_pos' and 'generation' fields of 'struct commit'
> are initialized conservatively, as if the commit were not included in
> the commit-graph file.
> 
> Alas, initializing those fields only in alloc_commit_node() missed the
> case when an object that happens to be a commit is first looked up via
> lookup_unknown_object(), and is then later converted to a 'struct
> commit' via the object_as_type() helper function (either calling it
> directly, or as part of a subsequent lookup_commit() call).
> Consequently, both of those fields incorrectly remain set to zero,
> which means e.g. that the commit is present in and is the first entry
> of the commit-graph file.  This will result in wrong timestamp, parent
> and root tree hashes, if such a 'struct commit' instance is later
> filled from the commit-graph.
> 
> Extract the initialization of 'struct commit's fields from
> alloc_commit_node() into a helper function, and call it from
> object_as_type() as well, to make sure that it properly initializes
> the two commit-graph-related fields, too.  With this helper function
> it is hopefully less likely that any new fields added to 'struct
> commit' in the future would remain uninitialized.
> 
> With this change alloc_commit_index() won't have any remaining callers
> outside of 'alloc.c', so mark it as static.
> 
> Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
> ---
> 
> So, it turns out that ec0c5798ee (revision: use commit graph in
> get_reference(), 2018-12-04) is not the culprit after all, it merely
> highlighted a bug that is as old as the commit-graph feature itself.
> This patch fixes this and all other related issues I reported
> upthread.

And how/why does this affect 'git describe --dirty'?

  - 'git describe' first iterates over all refs, and somewhere deep
    inside for_each_ref() each commit (well, object) a ref points to
    is looked up via lookup_unknown_object().  This leaves all fields
    of the created object zero initialized.

  - Then it dereferences HEAD for '--dirty' and ec0c5798ee's changes
    to get_reference() kick in: lookup_commit() doesn't instantiate a
    brand new and freshly initialized 'struct commit', but returns the
    object created in the previous step converted into 'struct
    commit'.  This conversion doesn't set the commit-graph fields in
    'struct commit', but leaves both as zero.  get_reference() then
    tries to load HEAD's commit information from the commit-graph,
    find_commit_in_graph() sees the the still zero 'graph_pos' field
    and doesn't perform a search through the commit-graph file, and
    the subsequent fill_commit_in_graph() reads the commit info from
    the first entry.

    In case of the failing test I posted earlier, where only the first
    commit is in the commit-graph but HEAD isn't, this means that the
    HEAD's 'struct commit' is filled with the info of HEAD^.

  - Ultimately, the diff machinery then doesn't compare the worktree
    to HEAD's tree, but to HEAD^'s, finds that they differ, hence the
    incorrect '-dirty' flag in the output.

Before ec0c5798ee get_reference() simply called parse_object(), which
ignored the commit-graph, so the issue could remain hidden.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] object_as_type: initialize commit-graph-related fields of 'struct commit'
  2019-01-27 13:28           ` SZEDER Gábor
@ 2019-01-27 18:40             ` Derrick Stolee
  0 siblings, 0 replies; 28+ messages in thread
From: Derrick Stolee @ 2019-01-27 18:40 UTC (permalink / raw)
  To: SZEDER Gábor, Junio C Hamano
  Cc: Jonathan Tan, git, Jeff King, Stefan Beller

On 1/27/2019 8:28 AM, SZEDER Gábor wrote:
> On Sun, Jan 27, 2019 at 02:08:32PM +0100, SZEDER Gábor wrote:
>> When the commit graph and generation numbers were introduced in
>> commits 177722b344 (commit: integrate commit graph with commit
>> parsing, 2018-04-10) and 83073cc994 (commit: add generation number to
>> struct commit, 2018-04-25), they tried to make sure that the
>> corresponding 'graph_pos' and 'generation' fields of 'struct commit'
>> are initialized conservatively, as if the commit were not included in
>> the commit-graph file.
>>
>> Alas, initializing those fields only in alloc_commit_node() missed the
>> case when an object that happens to be a commit is first looked up via
>> lookup_unknown_object(), and is then later converted to a 'struct
>> commit' via the object_as_type() helper function (either calling it
>> directly, or as part of a subsequent lookup_commit() call).
>> Consequently, both of those fields incorrectly remain set to zero,
>> which means e.g. that the commit is present in and is the first entry
>> of the commit-graph file.  This will result in wrong timestamp, parent
>> and root tree hashes, if such a 'struct commit' instance is later
>> filled from the commit-graph.
>>
>> Extract the initialization of 'struct commit's fields from
>> alloc_commit_node() into a helper function, and call it from
>> object_as_type() as well, to make sure that it properly initializes
>> the two commit-graph-related fields, too.  With this helper function
>> it is hopefully less likely that any new fields added to 'struct
>> commit' in the future would remain uninitialized.
>>
>> With this change alloc_commit_index() won't have any remaining callers
>> outside of 'alloc.c', so mark it as static.
>>
>> Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
>> ---
>>
>> So, it turns out that ec0c5798ee (revision: use commit graph in
>> get_reference(), 2018-12-04) is not the culprit after all, it merely
>> highlighted a bug that is as old as the commit-graph feature itself.
>> This patch fixes this and all other related issues I reported
>> upthread.
> 
> And how/why does this affect 'git describe --dirty'?
> 
>   - 'git describe' first iterates over all refs, and somewhere deep
>     inside for_each_ref() each commit (well, object) a ref points to
>     is looked up via lookup_unknown_object().  This leaves all fields
>     of the created object zero initialized.
> 
>   - Then it dereferences HEAD for '--dirty' and ec0c5798ee's changes
>     to get_reference() kick in: lookup_commit() doesn't instantiate a
>     brand new and freshly initialized 'struct commit', but returns the
>     object created in the previous step converted into 'struct
>     commit'.  This conversion doesn't set the commit-graph fields in
>     'struct commit', but leaves both as zero.  get_reference() then
>     tries to load HEAD's commit information from the commit-graph,
>     find_commit_in_graph() sees the the still zero 'graph_pos' field
>     and doesn't perform a search through the commit-graph file, and
>     the subsequent fill_commit_in_graph() reads the commit info from
>     the first entry.
> 
>     In case of the failing test I posted earlier, where only the first
>     commit is in the commit-graph but HEAD isn't, this means that the
>     HEAD's 'struct commit' is filled with the info of HEAD^.
> 
>   - Ultimately, the diff machinery then doesn't compare the worktree
>     to HEAD's tree, but to HEAD^'s, finds that they differ, hence the
>     incorrect '-dirty' flag in the output.
> 
> Before ec0c5798ee get_reference() simply called parse_object(), which
> ignored the commit-graph, so the issue could remain hidden.

Thanks for digging in, Szeder. This is a very subtle interaction, and
I'm glad you caught the issue. There are likely other ways this could
become problematic, including hitting BUG() statements regarding
generation numbers.

I recommend this be merged to 'maint' if possible.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] object_as_type: initialize commit-graph-related fields of 'struct commit'
  2019-01-27 13:08         ` [PATCH] object_as_type: initialize commit-graph-related fields of 'struct commit' SZEDER Gábor
  2019-01-27 13:28           ` SZEDER Gábor
@ 2019-01-28 16:15           ` Jeff King
  2019-01-28 16:57           ` Jonathan Tan
  2 siblings, 0 replies; 28+ messages in thread
From: Jeff King @ 2019-01-28 16:15 UTC (permalink / raw)
  To: SZEDER Gábor
  Cc: Junio C Hamano, Derrick Stolee, Jonathan Tan, git, Stefan Beller

On Sun, Jan 27, 2019 at 02:08:32PM +0100, SZEDER Gábor wrote:

> When the commit graph and generation numbers were introduced in
> commits 177722b344 (commit: integrate commit graph with commit
> parsing, 2018-04-10) and 83073cc994 (commit: add generation number to
> struct commit, 2018-04-25), they tried to make sure that the
> corresponding 'graph_pos' and 'generation' fields of 'struct commit'
> are initialized conservatively, as if the commit were not included in
> the commit-graph file.
> 
> Alas, initializing those fields only in alloc_commit_node() missed the
> case when an object that happens to be a commit is first looked up via
> lookup_unknown_object(), and is then later converted to a 'struct
> commit' via the object_as_type() helper function (either calling it
> directly, or as part of a subsequent lookup_commit() call).
> Consequently, both of those fields incorrectly remain set to zero,
> which means e.g. that the commit is present in and is the first entry
> of the commit-graph file.  This will result in wrong timestamp, parent
> and root tree hashes, if such a 'struct commit' instance is later
> filled from the commit-graph.
> 
> Extract the initialization of 'struct commit's fields from
> alloc_commit_node() into a helper function, and call it from
> object_as_type() as well, to make sure that it properly initializes
> the two commit-graph-related fields, too.  With this helper function
> it is hopefully less likely that any new fields added to 'struct
> commit' in the future would remain uninitialized.
> 
> With this change alloc_commit_index() won't have any remaining callers
> outside of 'alloc.c', so mark it as static.

Good find, and nicely explained.

> ---
> 
> So, it turns out that ec0c5798ee (revision: use commit graph in
> get_reference(), 2018-12-04) is not the culprit after all, it merely
> highlighted a bug that is as old as the commit-graph feature itself.
> This patch fixes this and all other related issues I reported
> upthread.
> 
> I couldn't find any other place where an object of unknown type is
> turned into a 'struct commit', so this might have been the only place
> that needed fixing.

This should be the only place. We already ran into this with the
commit-index field, which was what caused us to create object_as_type()
in the first place, to give a central place for coercing OBJ_NONE into
other types.

> Other object types seem to be fine, because they contain only fields
> that should be zero initialized.  At least for now, because a similar
> issue might arise in the future, if one of them gains a new field that
> should not be initialized to zero...  but will they ever get such a
> field?  So I'm not too keen on introducing similar init_tree_node(),
> etc. helper funcions at the moment.

Agreed. We can deal with those if it ever becomes necessary. In theory
adding empty placeholder functions might help somebody realize they'd
need to handle this case, but I have the feeling that they'd be as
likely to miss init_tree_node() as they would object_as_type().

I dunno. I guess if init_tree_node() were actually called from
alloc_tree_node(), it might be harder to miss.

>  alloc.c  | 11 ++++++++---
>  alloc.h  |  2 +-
>  object.c |  5 +++--
>  3 files changed, 12 insertions(+), 6 deletions(-)

The patch itself looks good to me.

-Peff

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] object_as_type: initialize commit-graph-related fields of 'struct commit'
  2019-01-27 13:08         ` [PATCH] object_as_type: initialize commit-graph-related fields of 'struct commit' SZEDER Gábor
  2019-01-27 13:28           ` SZEDER Gábor
  2019-01-28 16:15           ` Jeff King
@ 2019-01-28 16:57           ` Jonathan Tan
  2 siblings, 0 replies; 28+ messages in thread
From: Jonathan Tan @ 2019-01-28 16:57 UTC (permalink / raw)
  To: SZEDER Gábor
  Cc: Junio C Hamano, Derrick Stolee, git, Jeff King, Stefan Beller

On Sun, Jan 27, 2019 at 5:08 AM SZEDER Gábor <szeder.dev@gmail.com> wrote:
>
> When the commit graph and generation numbers were introduced in
> commits 177722b344 (commit: integrate commit graph with commit
> parsing, 2018-04-10) and 83073cc994 (commit: add generation number to
> struct commit, 2018-04-25), they tried to make sure that the
> corresponding 'graph_pos' and 'generation' fields of 'struct commit'
> are initialized conservatively, as if the commit were not included in
> the commit-graph file.

Thanks for looking into this! The patch looks good to me.

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2019-01-28 17:01 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-04 22:42 [PATCH on sb/more-repo-in-api] revision: use commit graph in get_reference() Jonathan Tan
2018-12-04 23:12 ` Stefan Beller
2018-12-06 23:36   ` Jonathan Tan
2018-12-07 13:49     ` Derrick Stolee
2018-12-05  4:54 ` Jeff King
2018-12-06 23:54   ` Jonathan Tan
2018-12-07  8:53     ` Jeff King
2018-12-05 23:15 ` Junio C Hamano
2018-12-07 21:50 ` [PATCH on master v2] " Jonathan Tan
2018-12-09  0:51   ` Junio C Hamano
2018-12-09  1:49     ` Junio C Hamano
2018-12-11 10:54     ` Jeff King
2018-12-12 19:58       ` Jonathan Tan
2018-12-13  1:27         ` Jeff King
2018-12-13 16:20           ` Derrick Stolee
2018-12-13 18:54 ` [PATCH v3] " Jonathan Tan
2018-12-14  3:20   ` Junio C Hamano
2018-12-14  8:45   ` Jeff King
2019-01-25 15:33 ` Regression in: [PATCH on sb/more-repo-in-api] " SZEDER Gábor
2019-01-25 19:56   ` Stefan Beller
2019-01-25 22:01     ` Jonathan Tan
2019-01-25 22:14     ` SZEDER Gábor
2019-01-25 22:21       ` SZEDER Gábor
2019-01-27 13:08         ` [PATCH] object_as_type: initialize commit-graph-related fields of 'struct commit' SZEDER Gábor
2019-01-27 13:28           ` SZEDER Gábor
2019-01-27 18:40             ` Derrick Stolee
2019-01-28 16:15           ` Jeff King
2019-01-28 16:57           ` Jonathan Tan

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).