git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH] revision.c: reduce object database queries
@ 2018-02-25  1:34 Derrick Stolee
  2018-02-25  1:41 ` Derrick Stolee
  2018-02-26  1:30 ` Jeff King
  0 siblings, 2 replies; 7+ messages in thread
From: Derrick Stolee @ 2018-02-25  1:34 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, Derrick Stolee

In mark_parents_uninteresting(), we check for the existence of an
object file to see if we should treat a commit as parsed. The result
is to set the "parsed" bit on the commit.

Modify the condition to only check has_object_file() if the result
would change the parsed bit.

When a local branch is different from its upstream ref, "git status"
will compute ahead/behind counts. This uses paint_down_to_common()
and hits mark_parents_uninteresting(). On a copy of the Linux repo
with a local instance of "master" behind the remote branch
"origin/master" by ~60,000 commits, we find the performance of
"git status" went from 1.42 seconds to 1.32 seconds, for a relative
difference of -7.0%.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 revision.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/revision.c b/revision.c
index 5ce9b93..bc7def5 100644
--- a/revision.c
+++ b/revision.c
@@ -113,7 +113,8 @@ void mark_parents_uninteresting(struct commit *commit)
 			 * it is popped next time around, we won't be trying
 			 * to parse it and get an error.
 			 */
-			if (!has_object_file(&commit->object.oid))
+			if (!commit->object.parsed &&
+			    !has_object_file(&commit->object.oid))
 				commit->object.parsed = 1;
 
 			if (commit->object.flags & UNINTERESTING)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] revision.c: reduce object database queries
  2018-02-25  1:34 [PATCH] revision.c: reduce object database queries Derrick Stolee
@ 2018-02-25  1:41 ` Derrick Stolee
  2018-02-26  1:30 ` Jeff King
  1 sibling, 0 replies; 7+ messages in thread
From: Derrick Stolee @ 2018-02-25  1:41 UTC (permalink / raw)
  To: Derrick Stolee, git; +Cc: gitster, peff

On 2/24/2018 8:34 PM, Derrick Stolee wrote:
> In mark_parents_uninteresting(), we check for the existence of an
> object file to see if we should treat a commit as parsed. The result
> is to set the "parsed" bit on the commit.
>
> Modify the condition to only check has_object_file() if the result
> would change the parsed bit.
>
> When a local branch is different from its upstream ref, "git status"
> will compute ahead/behind counts. This uses paint_down_to_common()
> and hits mark_parents_uninteresting(). On a copy of the Linux repo
> with a local instance of "master" behind the remote branch
> "origin/master" by ~60,000 commits, we find the performance of
> "git status" went from 1.42 seconds to 1.32 seconds, for a relative
> difference of -7.0%.

I do want to add that the exact same "git status" performance test with 
the commit-graph feature goes from 0.29 seconds to 0.21 seconds for a 
relative change of -27.5%.

Thanks,

-Stolee



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] revision.c: reduce object database queries
  2018-02-25  1:34 [PATCH] revision.c: reduce object database queries Derrick Stolee
  2018-02-25  1:41 ` Derrick Stolee
@ 2018-02-26  1:30 ` Jeff King
  2018-02-26  1:38   ` Jeff King
  1 sibling, 1 reply; 7+ messages in thread
From: Jeff King @ 2018-02-26  1:30 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, gitster

On Sat, Feb 24, 2018 at 08:34:56PM -0500, Derrick Stolee wrote:

> In mark_parents_uninteresting(), we check for the existence of an
> object file to see if we should treat a commit as parsed. The result
> is to set the "parsed" bit on the commit.
> 
> Modify the condition to only check has_object_file() if the result
> would change the parsed bit.
> 
> When a local branch is different from its upstream ref, "git status"
> will compute ahead/behind counts. This uses paint_down_to_common()
> and hits mark_parents_uninteresting(). On a copy of the Linux repo
> with a local instance of "master" behind the remote branch
> "origin/master" by ~60,000 commits, we find the performance of
> "git status" went from 1.42 seconds to 1.32 seconds, for a relative
> difference of -7.0%.

This looks like an obvious and easy optimization, and I'm OK with this
patch.

But looking at the existing code, it makes me wonder a few things:

> diff --git a/revision.c b/revision.c
> index 5ce9b93..bc7def5 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -113,7 +113,8 @@ void mark_parents_uninteresting(struct commit *commit)
>  			 * it is popped next time around, we won't be trying
>  			 * to parse it and get an error.
>  			 */
> -			if (!has_object_file(&commit->object.oid))
> +			if (!commit->object.parsed &&
> +			    !has_object_file(&commit->object.oid))
>  				commit->object.parsed = 1;

We don't actually need the object contents at all right here. This is
just faking the "parsed" flag for later so that calls to parse_object()
don't barf.

This code comes originally form 454fbbcde3 (git-rev-list: allow missing
objects when the parent is marked UNINTERESTING, 2005-07-10). But later,
in aeeae1b771 (revision traversal: allow UNINTERESTING objects to be
missing, 2009-01-27), we marked dealt with calling parse_object() on the
parents more directly.

So what I wonder is whether this code is simply redundant and can go
away entirely. That would save the has_object_file() call in all cases.
We'd stop setting the fake commit->object.parsed flag, which in theory
means we'd make repeated failed calls to parse_commit_gently() if we
visited the same commit over and over. I wonder if add_parents_to_list()
should skip already-UNINTERESTING parents, which would mean we only try
to access each missing candidate once (OTOH missing ones are probably
rare enough that it doesn't make much difference either way).

Probably the fake commit->object.parsed flag isn't hurting anything in
practice, but it seems pretty nasty to lie about having loaded the
commit in the global store. E.g., imagine a follow-up traversal in the
same process in which that commit _isn't_ UNINTERESTING, and we'd
erroneously treat it as an available commit with no parents.

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] revision.c: reduce object database queries
  2018-02-26  1:30 ` Jeff King
@ 2018-02-26  1:38   ` Jeff King
  2018-02-27 23:16     ` Junio C Hamano
  0 siblings, 1 reply; 7+ messages in thread
From: Jeff King @ 2018-02-26  1:38 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, gitster

On Sun, Feb 25, 2018 at 08:30:48PM -0500, Jeff King wrote:

> > diff --git a/revision.c b/revision.c
> > index 5ce9b93..bc7def5 100644
> > --- a/revision.c
> > +++ b/revision.c
> > @@ -113,7 +113,8 @@ void mark_parents_uninteresting(struct commit *commit)
> >  			 * it is popped next time around, we won't be trying
> >  			 * to parse it and get an error.
> >  			 */
> > -			if (!has_object_file(&commit->object.oid))
> > +			if (!commit->object.parsed &&
> > +			    !has_object_file(&commit->object.oid))
> >  				commit->object.parsed = 1;
> 
> We don't actually need the object contents at all right here. This is
> just faking the "parsed" flag for later so that calls to parse_object()
> don't barf.
> 
> This code comes originally form 454fbbcde3 (git-rev-list: allow missing
> objects when the parent is marked UNINTERESTING, 2005-07-10). But later,
> in aeeae1b771 (revision traversal: allow UNINTERESTING objects to be
> missing, 2009-01-27), we marked dealt with calling parse_object() on the
> parents more directly.
> 
> So what I wonder is whether this code is simply redundant and can go
> away entirely. That would save the has_object_file() call in all cases.

There's a similar case for trees. In mark_tree_contents_uninteresting()
we do:

   if (!has_object_file(&obj->oid))
	return;
   if (parse_tree(tree) < 0)
	die("bad tree %s", oid_to_hex(&obj->oid));

which seems wasteful. Probably this could be:

  if (parse_tree_gently(tree, 1) < 0)
	return; /* missing uninteresting trees ok */

though technically the existing code allows _missing_ trees, but
not on corrupt ones.

I guess this is perhaps less interesting, because we only mark trees
directly fed from the pending array, not every tree of commits that we
traverse. Though if you had a really gigantic tree, it might be
measurable.

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] revision.c: reduce object database queries
  2018-02-26  1:38   ` Jeff King
@ 2018-02-27 23:16     ` Junio C Hamano
  2018-02-28  6:37       ` Jeff King
  0 siblings, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2018-02-27 23:16 UTC (permalink / raw)
  To: Jeff King; +Cc: Derrick Stolee, git

Jeff King <peff@peff.net> writes:

>> This code comes originally form 454fbbcde3 (git-rev-list: allow missing
>> objects when the parent is marked UNINTERESTING, 2005-07-10). But later,
>> in aeeae1b771 (revision traversal: allow UNINTERESTING objects to be
>> missing, 2009-01-27), we marked dealt with calling parse_object() on the
>> parents more directly.
>>
>> So what I wonder is whether this code is simply redundant and can go
>> away entirely. That would save the has_object_file() call in all cases.

Hmm, interesting. I forgot all what I did around this area, but you
are right.  

> There's a similar case for trees. ...
> though technically the existing code allows _missing_ trees, but
> not on corrupt ones.

True, but the intention of these "do not care too much about missing
stuff while marking uninteresting" effort is aligned better with
ignoring corrupt ones, too, I would think, as "missing" in that
sentence is in fact about "not availble", and stuff that exists in
corrupt form is still not available anyway.  So I do not think it
makes a bad change to start allowing corrupt ones.

> I guess this is perhaps less interesting, because we only mark trees
> directly fed from the pending array, not every tree of commits that we
> traverse. Though if you had a really gigantic tree, it might be
> measurable.

I tend to agree that this is less interesting case than commits.

A huge tree with millions of entries in a single level would spend
quite a lot of cycle in slurping the tree data to in-core buffer,
but we do not actually parse these million entries upon opening the
tree, so it may not be too bad.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] revision.c: reduce object database queries
  2018-02-27 23:16     ` Junio C Hamano
@ 2018-02-28  6:37       ` Jeff King
  2018-02-28 13:34         ` Derrick Stolee
  0 siblings, 1 reply; 7+ messages in thread
From: Jeff King @ 2018-02-28  6:37 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Derrick Stolee, git

On Tue, Feb 27, 2018 at 03:16:58PM -0800, Junio C Hamano wrote:

> >> This code comes originally form 454fbbcde3 (git-rev-list: allow missing
> >> objects when the parent is marked UNINTERESTING, 2005-07-10). But later,
> >> in aeeae1b771 (revision traversal: allow UNINTERESTING objects to be
> >> missing, 2009-01-27), we marked dealt with calling parse_object() on the
> >> parents more directly.
> >>
> >> So what I wonder is whether this code is simply redundant and can go
> >> away entirely. That would save the has_object_file() call in all cases.
> 
> Hmm, interesting. I forgot all what I did around this area, but you
> are right.

I'll leave it to Stolee whether he wants to dig into removing the
has_object_file() call. I think it would do the right thing, but the
most interesting bit would be how it impacts the timings.

> > There's a similar case for trees. ...
> > though technically the existing code allows _missing_ trees, but
> > not on corrupt ones.
> 
> True, but the intention of these "do not care too much about missing
> stuff while marking uninteresting" effort is aligned better with
> ignoring corrupt ones, too, I would think, as "missing" in that
> sentence is in fact about "not availble", and stuff that exists in
> corrupt form is still not available anyway.  So I do not think it
> makes a bad change to start allowing corrupt ones.

Agreed. Here it is in patch form, though as we both said, it probably
doesn't matter that much in practice. So I'd be OK dropping it out of
a sense of conservatism.

-- >8 --
Subject: [PATCH] mark_tree_contents_uninteresting: drop has_object check

It's generally acceptable for UNINTERESTING objects in a
traversal to be unavailable (e.g., see aeeae1b771). When
marking trees UNINTERESTING, we access the object database
twice: once to check if the object is missing (and return
quietly if it is), and then again to actually parse it.

We can instead just try to parse; if that fails, we can then
return quietly. That halves the effort we spend on locating
the object.

Note that this isn't _exactly_ the same as the original
behavior, as the parse failure could be due to other
problems than a missing object: it could be corrupted, in
which case the original code would have died. But the new
behavior is arguably better, as it covers the object being
unavailable for any reason. We'll also still issue a warning
to stderr in such a case.

Signed-off-by: Jeff King <peff@peff.net>
---
 revision.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/revision.c b/revision.c
index 5ce9b93baa..221d62c52b 100644
--- a/revision.c
+++ b/revision.c
@@ -51,12 +51,9 @@ static void mark_tree_contents_uninteresting(struct tree *tree)
 {
 	struct tree_desc desc;
 	struct name_entry entry;
-	struct object *obj = &tree->object;
 
-	if (!has_object_file(&obj->oid))
+	if (parse_tree_gently(tree, 1) < 0)
 		return;
-	if (parse_tree(tree) < 0)
-		die("bad tree %s", oid_to_hex(&obj->oid));
 
 	init_tree_desc(&desc, tree->buffer, tree->size);
 	while (tree_entry(&desc, &entry)) {
-- 
2.16.2.582.ge2c16ac3c4


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] revision.c: reduce object database queries
  2018-02-28  6:37       ` Jeff King
@ 2018-02-28 13:34         ` Derrick Stolee
  0 siblings, 0 replies; 7+ messages in thread
From: Derrick Stolee @ 2018-02-28 13:34 UTC (permalink / raw)
  To: Jeff King, Junio C Hamano; +Cc: Derrick Stolee, git

On 2/28/2018 1:37 AM, Jeff King wrote:
> On Tue, Feb 27, 2018 at 03:16:58PM -0800, Junio C Hamano wrote:
>
>>>> This code comes originally form 454fbbcde3 (git-rev-list: allow missing
>>>> objects when the parent is marked UNINTERESTING, 2005-07-10). But later,
>>>> in aeeae1b771 (revision traversal: allow UNINTERESTING objects to be
>>>> missing, 2009-01-27), we marked dealt with calling parse_object() on the
>>>> parents more directly.
>>>>
>>>> So what I wonder is whether this code is simply redundant and can go
>>>> away entirely. That would save the has_object_file() call in all cases.
>> Hmm, interesting. I forgot all what I did around this area, but you
>> are right.
> I'll leave it to Stolee whether he wants to dig into removing the
> has_object_file() call. I think it would do the right thing, but the
> most interesting bit would be how it impacts the timings.

This patch was so small that I could understand the full implication of 
my change.

I'm still very unfamiliar with the object walking machinery in 
revision.c. There are a lot of inter-dependencies that are taking time 
for me to understand and to feel confident that I won't cause a side 
effect in a special case. I do appreciate that Junio added a test in 
aeeae1b771.

I'll make a note to revisit this area after I have a better grasp of the 
object walk code, but I will not try removing the has_object_file() call 
now.

>
>>> There's a similar case for trees. ...
>>> though technically the existing code allows _missing_ trees, but
>>> not on corrupt ones.
>> True, but the intention of these "do not care too much about missing
>> stuff while marking uninteresting" effort is aligned better with
>> ignoring corrupt ones, too, I would think, as "missing" in that
>> sentence is in fact about "not availble", and stuff that exists in
>> corrupt form is still not available anyway.  So I do not think it
>> makes a bad change to start allowing corrupt ones.
> Agreed. Here it is in patch form, though as we both said, it probably
> doesn't matter that much in practice. So I'd be OK dropping it out of
> a sense of conservatism.
>
> -- >8 --
> Subject: [PATCH] mark_tree_contents_uninteresting: drop has_object check
>
> It's generally acceptable for UNINTERESTING objects in a
> traversal to be unavailable (e.g., see aeeae1b771). When
> marking trees UNINTERESTING, we access the object database
> twice: once to check if the object is missing (and return
> quietly if it is), and then again to actually parse it.
>
> We can instead just try to parse; if that fails, we can then
> return quietly. That halves the effort we spend on locating
> the object.
>
> Note that this isn't _exactly_ the same as the original
> behavior, as the parse failure could be due to other
> problems than a missing object: it could be corrupted, in
> which case the original code would have died. But the new
> behavior is arguably better, as it covers the object being
> unavailable for any reason. We'll also still issue a warning
> to stderr in such a case.
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>   revision.c | 5 +----
>   1 file changed, 1 insertion(+), 4 deletions(-)
>
> diff --git a/revision.c b/revision.c
> index 5ce9b93baa..221d62c52b 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -51,12 +51,9 @@ static void mark_tree_contents_uninteresting(struct tree *tree)
>   {
>   	struct tree_desc desc;
>   	struct name_entry entry;
> -	struct object *obj = &tree->object;
>   
> -	if (!has_object_file(&obj->oid))
> +	if (parse_tree_gently(tree, 1) < 0)
>   		return;
> -	if (parse_tree(tree) < 0)
> -		die("bad tree %s", oid_to_hex(&obj->oid));
>   
>   	init_tree_desc(&desc, tree->buffer, tree->size);
>   	while (tree_entry(&desc, &entry)) {


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-02-28 13:34 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-25  1:34 [PATCH] revision.c: reduce object database queries Derrick Stolee
2018-02-25  1:41 ` Derrick Stolee
2018-02-26  1:30 ` Jeff King
2018-02-26  1:38   ` Jeff King
2018-02-27 23:16     ` Junio C Hamano
2018-02-28  6:37       ` Jeff King
2018-02-28 13:34         ` Derrick Stolee

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).