git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* suboptimal behavior of fast-import in some cases with "from"
@ 2015-07-06 22:07 Mike Hommey
  2015-07-06 22:54 ` Junio C Hamano
  0 siblings, 1 reply; 9+ messages in thread
From: Mike Hommey @ 2015-07-06 22:07 UTC (permalink / raw)
  To: git

Hi,

I did something "stupid" with a script using fast-import, and that made
the whole process ~20% slower on Linux and 400~500% slower on Mac. The
reason this happened is that the script was essentially adding a "from"
to every "commit" command, even when the "from" commit is the current
head of the branch.

One of the first things parse_from does is unconditionally throw away
the tree for the given branch, and then the "from" tree is loaded. So
when the "from" commit is the current head of the branch, that make
fast-import do more work than necessary.

Even more so when the pack flush code in gfi_unpack_entry is triggered,
which, on mac, is extra slow (and explains the huge slow down there).

Now, I do understand that my script is doing something stupid. So the
question is whether it's worth fixing in fast-import or not. I already
fixed my script anyways.

Mike

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: suboptimal behavior of fast-import in some cases with "from"
  2015-07-06 22:07 suboptimal behavior of fast-import in some cases with "from" Mike Hommey
@ 2015-07-06 22:54 ` Junio C Hamano
  2015-07-09  5:03   ` Mike Hommey
  0 siblings, 1 reply; 9+ messages in thread
From: Junio C Hamano @ 2015-07-06 22:54 UTC (permalink / raw)
  To: Mike Hommey; +Cc: git

Mike Hommey <mh@glandium.org> writes:

> One of the first things parse_from does is unconditionally throw away
> the tree for the given branch, and then the "from" tree is loaded. So
> when the "from" commit is the current head of the branch, that make
> fast-import do more work than necessary.

If it is very common that the next commit the input stream wants to
create is often on top of the commit that was just created, and if
it is very common that the input stream producer knows what the
commit object name of the commit that was just created, then
optimising for that case does not sound too bad.  It really depends
on two factors:

 - How likely is it that other people make the same mistake?

 - How bad the damage to parse_from() would be if we wanted to
   optimize for this case?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: suboptimal behavior of fast-import in some cases with "from"
  2015-07-06 22:54 ` Junio C Hamano
@ 2015-07-09  5:03   ` Mike Hommey
  2015-07-09  5:52     ` Mike Hommey
  0 siblings, 1 reply; 9+ messages in thread
From: Mike Hommey @ 2015-07-09  5:03 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Mon, Jul 06, 2015 at 03:54:35PM -0700, Junio C Hamano wrote:
> Mike Hommey <mh@glandium.org> writes:
> 
> > One of the first things parse_from does is unconditionally throw away
> > the tree for the given branch, and then the "from" tree is loaded. So
> > when the "from" commit is the current head of the branch, that make
> > fast-import do more work than necessary.
> 
> If it is very common that the next commit the input stream wants to
> create is often on top of the commit that was just created, and if
> it is very common that the input stream producer knows what the
> commit object name of the commit that was just created, then
> optimising for that case does not sound too bad.  It really depends
> on two factors:
> 
>  - How likely is it that other people make the same mistake?

Looks like the answer is: very. Assuming my quick glance at the code is
not mistaken, the same mistake is made in at least git-p4.py (yes, the
one that comes with git), felipec's git-remote-hg, and hg-fast-export,
and that's 100% of the sample I looked at.

I won't claim to know what fast-import is doing, not having looked at
more than the parse_from* functions and the commit message for 4cabf858,
but it seems plausible this also skips making tree deltas for those
trees.

>  - How bad the damage to parse_from() would be if we wanted to
>    optimize for this case?

I /think/ it would look like this (untested), which doesn't seem too
damaging:

diff --git a/fast-import.c b/fast-import.c
index e78ca10..cb232e0 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -2588,14 +2588,12 @@ static int parse_from(struct branch *b)
 {
        const char *from;
        struct branch *s;
+       unsigned char sha1[20];
 
        if (!skip_prefix(command_buf.buf, "from ", &from))
                return 0;
 
-       if (b->branch_tree.tree) {
-               release_tree_content_recursive(b->branch_tree.tree);
-               b->branch_tree.tree = NULL;
-       }
+       hashcpy(sha1, b->branch_tree.versions[1].sha1);
 
        s = lookup_branch(from);
        if (b == s)
@@ -2626,6 +2624,11 @@ static int parse_from(struct branch *b)
        else
                die("Invalid ref name or SHA1 expression: %s", from);
 
+       if (b->branch_tree.tree && hashcmp(sha1, b->branch_tree.versions[1].sha1)) {
+               release_tree_content_recursive(b->branch_tree.tree);
+               b->branch_tree.tree = NULL;
+       }
+
        read_next_command();
        return 1;
 }

Mike

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: suboptimal behavior of fast-import in some cases with "from"
  2015-07-09  5:03   ` Mike Hommey
@ 2015-07-09  5:52     ` Mike Hommey
  2015-07-09  6:50       ` [PATCH] fast-import: Do less work when given "from" matches current branch head Mike Hommey
  0 siblings, 1 reply; 9+ messages in thread
From: Mike Hommey @ 2015-07-09  5:52 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Thu, Jul 09, 2015 at 02:03:15PM +0900, Mike Hommey wrote:
> On Mon, Jul 06, 2015 at 03:54:35PM -0700, Junio C Hamano wrote:
> > Mike Hommey <mh@glandium.org> writes:
> > 
> > > One of the first things parse_from does is unconditionally throw away
> > > the tree for the given branch, and then the "from" tree is loaded. So
> > > when the "from" commit is the current head of the branch, that make
> > > fast-import do more work than necessary.
> > 
> > If it is very common that the next commit the input stream wants to
> > create is often on top of the commit that was just created, and if
> > it is very common that the input stream producer knows what the
> > commit object name of the commit that was just created, then
> > optimising for that case does not sound too bad.  It really depends
> > on two factors:
> > 
> >  - How likely is it that other people make the same mistake?
> 
> Looks like the answer is: very. Assuming my quick glance at the code is
> not mistaken, the same mistake is made in at least git-p4.py (yes, the
> one that comes with git), felipec's git-remote-hg, and hg-fast-export,
> and that's 100% of the sample I looked at.
> 
> I won't claim to know what fast-import is doing, not having looked at
> more than the parse_from* functions and the commit message for 4cabf858,
> but it seems plausible this also skips making tree deltas for those
> trees.

It doesn't /seem/ to be the case.

> >  - How bad the damage to parse_from() would be if we wanted to
> >    optimize for this case?
> 
> I /think/ it would look like this (untested), which doesn't seem too
> damaging:

It's actually not enough. It does avoid to reread trees, but it doesn't
avoid the pack to be closed/reopened.

Mike

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] fast-import: Do less work when given "from" matches current branch head
  2015-07-09  5:52     ` Mike Hommey
@ 2015-07-09  6:50       ` Mike Hommey
  2015-07-09 20:37         ` Junio C Hamano
  0 siblings, 1 reply; 9+ messages in thread
From: Mike Hommey @ 2015-07-09  6:50 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano

When building a fast-import stream, it's easy to forget the fact that
for non-merge commits happening on top of the current branch head, there
is no need for a "from" command. That is corroborated by the fact that
at least git-p4, hg-fast-export and felipec's git-remote-hg all
unconditionally use a "from" command.

Unfortunately, giving a "from" command always resets the branch tree,
forcing it to be re-read, and in many cases, the pack is also closed
and reopened through gfi_unpack_entry. Both are extra unwanted overhead,
and the latter is particularly slow at least on OSX.

So, avoid resetting the tree when it's unmodified, and avoid calling
gfi_unpack_entry when the given mark points to the same commit as the
current branch head.

Signed-off-by: Mike Hommey <mh@glandium.org>
---
 fast-import.c | 29 +++++++++++++++++------------
 1 file changed, 17 insertions(+), 12 deletions(-)

diff --git a/fast-import.c b/fast-import.c
index e78ca10..372d63a 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -2588,14 +2588,12 @@ static int parse_from(struct branch *b)
 {
 	const char *from;
 	struct branch *s;
+	unsigned char sha1[20];
 
 	if (!skip_prefix(command_buf.buf, "from ", &from))
 		return 0;
 
-	if (b->branch_tree.tree) {
-		release_tree_content_recursive(b->branch_tree.tree);
-		b->branch_tree.tree = NULL;
-	}
+	hashcpy(sha1, b->branch_tree.versions[1].sha1);
 
 	s = lookup_branch(from);
 	if (b == s)
@@ -2610,14 +2608,16 @@ static int parse_from(struct branch *b)
 		struct object_entry *oe = find_mark(idnum);
 		if (oe->type != OBJ_COMMIT)
 			die("Mark :%" PRIuMAX " not a commit", idnum);
-		hashcpy(b->sha1, oe->idx.sha1);
-		if (oe->pack_id != MAX_PACK_ID) {
-			unsigned long size;
-			char *buf = gfi_unpack_entry(oe, &size);
-			parse_from_commit(b, buf, size);
-			free(buf);
-		} else
-			parse_from_existing(b);
+		if (hashcmp(b->sha1, oe->idx.sha1)) {
+			hashcpy(b->sha1, oe->idx.sha1);
+			if (oe->pack_id != MAX_PACK_ID) {
+				unsigned long size;
+				char *buf = gfi_unpack_entry(oe, &size);
+				parse_from_commit(b, buf, size);
+				free(buf);
+			} else
+				parse_from_existing(b);
+		}
 	} else if (!get_sha1(from, b->sha1)) {
 		parse_from_existing(b);
 		if (is_null_sha1(b->sha1))
@@ -2626,6 +2626,11 @@ static int parse_from(struct branch *b)
 	else
 		die("Invalid ref name or SHA1 expression: %s", from);
 
+	if (b->branch_tree.tree && hashcmp(sha1, b->branch_tree.versions[1].sha1)) {
+		release_tree_content_recursive(b->branch_tree.tree);
+		b->branch_tree.tree = NULL;
+	}
+
 	read_next_command();
 	return 1;
 }
-- 
2.4.3.2.gf0a024e.dirty

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] fast-import: Do less work when given "from" matches current branch head
  2015-07-09  6:50       ` [PATCH] fast-import: Do less work when given "from" matches current branch head Mike Hommey
@ 2015-07-09 20:37         ` Junio C Hamano
  2015-07-09 22:30           ` Mike Hommey
  0 siblings, 1 reply; 9+ messages in thread
From: Junio C Hamano @ 2015-07-09 20:37 UTC (permalink / raw)
  To: Mike Hommey; +Cc: git, Shawn Pearce, Jonathan Nieder, Jeff King

Mike Hommey <mh@glandium.org> writes:

Cc'ed a few people who appear at the top of "shortlog --no-merges";
I think the end result is not incorrect, but I want to hear second
opinions on this one.  I do not know Shawn still remembers this
code, but what is under discussion seems to have come mostly from
ea5e370a (fast-import: Support reusing 'from' and brown paper bag
fix reset., 2007-02-12).

>  	if (!skip_prefix(command_buf.buf, "from ", &from))
>  		return 0;
>  
> -	if (b->branch_tree.tree) {
> -		release_tree_content_recursive(b->branch_tree.tree);
> -		b->branch_tree.tree = NULL;
> -	}
> +	hashcpy(sha1, b->branch_tree.versions[1].sha1);
>  
>  	s = lookup_branch(from);
>  	if (b == s)

The part that deals with a branch that is different from the current
one is not visible in the context (i.e. when s = lookup_branch(from)
returned a non-NULL result that is different from b) but it used to,
and continues to with this patch, copy sha1 from branch_tree.sha1
and branch_tree.versions[] from sha1 and branch_tree.versions[1] of
the specified branch.

That codepath used to release the contents of branch_tree.tree when
it did so, but it no longer does so after this patch because of the
removal we see above.

Does that mean the original code was doing a release that was
unnecessary?  Or does it mean this patch changes what happens on
that codepath, namely (1) leaking resource, and/or (2) keeping a
tree of the original 'b' that does not have anything to do with the
tree of 's', preventing the later lazy-load code from reading the
tree of 's' and instead of building on top of a wrong tree content?

... me goes and reads on ...

> @@ -2610,14 +2608,16 @@ static int parse_from(struct branch *b)
>  		struct object_entry *oe = find_mark(idnum);
>  		if (oe->type != OBJ_COMMIT)
>  			die("Mark :%" PRIuMAX " not a commit", idnum);
> -		hashcpy(b->sha1, oe->idx.sha1);
> -		if (oe->pack_id != MAX_PACK_ID) {
> -			unsigned long size;
> -			char *buf = gfi_unpack_entry(oe, &size);
> -			parse_from_commit(b, buf, size);
> -			free(buf);
> -		} else
> -			parse_from_existing(b);
> +		if (hashcmp(b->sha1, oe->idx.sha1)) {
> +			hashcpy(b->sha1, oe->idx.sha1);
> +			if (oe->pack_id != MAX_PACK_ID) {
> +				unsigned long size;
> +				char *buf = gfi_unpack_entry(oe, &size);
> +				parse_from_commit(b, buf, size);
> +				free(buf);
> +			} else
> +				parse_from_existing(b);
> +		}
>  	} else if (!get_sha1(from, b->sha1)) {
>  		parse_from_existing(b);
>  		if (is_null_sha1(b->sha1))

This part is straight-forward.

> @@ -2626,6 +2626,11 @@ static int parse_from(struct branch *b)
>  	else
>  		die("Invalid ref name or SHA1 expression: %s", from);
>  
> +	if (b->branch_tree.tree && hashcmp(sha1, b->branch_tree.versions[1].sha1)) {
> +		release_tree_content_recursive(b->branch_tree.tree);
> +		b->branch_tree.tree = NULL;
> +	}
> +

This looks like an attempt to compensate for that "what happens if
(s != NULL && s != b)?" issue, and also for the surviving codepaths.

As both parse_from_commit() and parse_from_existing() only touch
branch_tree.versions[] and they do not seem to get affected if
b->branch_tree.tree holds a stale and unrelated content, this looks
OK to me from a cursory reading, but it does make me feel dirty that
it has to put *b temporarily into an inconsistent state.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] fast-import: Do less work when given "from" matches current branch head
  2015-07-09 20:37         ` Junio C Hamano
@ 2015-07-09 22:30           ` Mike Hommey
  2015-07-09 22:44             ` Junio C Hamano
  0 siblings, 1 reply; 9+ messages in thread
From: Mike Hommey @ 2015-07-09 22:30 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Shawn Pearce, Jonathan Nieder, Jeff King

On Thu, Jul 09, 2015 at 01:37:02PM -0700, Junio C Hamano wrote:
> Mike Hommey <mh@glandium.org> writes:
> 
> Cc'ed a few people who appear at the top of "shortlog --no-merges";
> I think the end result is not incorrect, but I want to hear second
> opinions on this one.  I do not know Shawn still remembers this
> code, but what is under discussion seems to have come mostly from
> ea5e370a (fast-import: Support reusing 'from' and brown paper bag
> fix reset., 2007-02-12).
> 
> >  	if (!skip_prefix(command_buf.buf, "from ", &from))
> >  		return 0;
> >  
> > -	if (b->branch_tree.tree) {
> > -		release_tree_content_recursive(b->branch_tree.tree);
> > -		b->branch_tree.tree = NULL;
> > -	}
> > +	hashcpy(sha1, b->branch_tree.versions[1].sha1);
> >  
> >  	s = lookup_branch(from);
> >  	if (b == s)
> 
> The part that deals with a branch that is different from the current
> one is not visible in the context (i.e. when s = lookup_branch(from)
> returned a non-NULL result that is different from b) but it used to,
> and continues to with this patch, copy sha1 from branch_tree.sha1
> and branch_tree.versions[] from sha1 and branch_tree.versions[1] of
> the specified branch.
> 
> That codepath used to release the contents of branch_tree.tree when
> it did so, but it no longer does so after this patch because of the
> removal we see above.
> 
> Does that mean the original code was doing a release that was
> unnecessary?  Or does it mean this patch changes what happens on
> that codepath, namely (1) leaking resource, and/or (2) keeping a
> tree of the original 'b' that does not have anything to do with the
> tree of 's', preventing the later lazy-load code from reading the
> tree of 's' and instead of building on top of a wrong tree content?

I guess the question is whether branch_tree.tree can be in a state that
doesn't match that of branch_tree.versions[1].sha1. If not, then if s
and b have the same branch_tree.versions[1].sha1 for some reason, then
keeping the old branch_tree.tree makes no practical difference from
resetting it. Except it skips the busy-work.

Mike

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] fast-import: Do less work when given "from" matches current branch head
  2015-07-09 22:30           ` Mike Hommey
@ 2015-07-09 22:44             ` Junio C Hamano
  2015-07-09 22:47               ` Junio C Hamano
  0 siblings, 1 reply; 9+ messages in thread
From: Junio C Hamano @ 2015-07-09 22:44 UTC (permalink / raw)
  To: Mike Hommey; +Cc: git, Shawn Pearce, Jonathan Nieder, Jeff King

Mike Hommey <mh@glandium.org> writes:

>> Does that mean the original code was doing a release that was
>> unnecessary?  Or does it mean this patch changes what happens on
>> that codepath, namely (1) leaking resource, and/or (2) keeping a
>> tree of the original 'b' that does not have anything to do with the
>> tree of 's', preventing the later lazy-load code from reading the
>> tree of 's' and instead of building on top of a wrong tree content?
>
> I guess the question is whether branch_tree.tree can be in a state that
> doesn't match that of branch_tree.versions[1].sha1. If not, then if s
> and b have the same branch_tree.versions[1].sha1 for some reason, then
> keeping the old branch_tree.tree makes no practical difference from
> resetting it. Except it skips the busy-work.

Perhaps my comment was misleading.  I _think_ the state at the end
of this function (i.e. the latter hunk you added to the function
makes the above issue I raised go away).  It just made me feel
uneasy to leave branch_tree.tree and branch_tree.versions[] in an
inconsistent state inside this function, while it calls a few helper
functions (hence my comment on the fact that they do not seem to be
affected by this inconsistency).

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] fast-import: Do less work when given "from" matches current branch head
  2015-07-09 22:44             ` Junio C Hamano
@ 2015-07-09 22:47               ` Junio C Hamano
  0 siblings, 0 replies; 9+ messages in thread
From: Junio C Hamano @ 2015-07-09 22:47 UTC (permalink / raw)
  To: Mike Hommey; +Cc: git, Shawn Pearce, Jonathan Nieder, Jeff King

Junio C Hamano <gitster@pobox.com> writes:

> Mike Hommey <mh@glandium.org> writes:
>
>>> Does that mean the original code was doing a release that was
>>> unnecessary?  Or does it mean this patch changes what happens on
>>> that codepath, namely (1) leaking resource, and/or (2) keeping a
>>> tree of the original 'b' that does not have anything to do with the
>>> tree of 's', preventing the later lazy-load code from reading the
>>> tree of 's' and instead of building on top of a wrong tree content?
>>
>> I guess the question is whether branch_tree.tree can be in a state that
>> doesn't match that of branch_tree.versions[1].sha1. If not, then if s
>> and b have the same branch_tree.versions[1].sha1 for some reason, then
>> keeping the old branch_tree.tree makes no practical difference from
>> resetting it. Except it skips the busy-work.
>
> Perhaps my comment was misleading.  I _think_ the state at the end
> of this function (i.e. the latter hunk you added to the function
> makes the above issue I raised go away).  It just made me feel

s/this function (/this function is good (/;

> uneasy to leave branch_tree.tree and branch_tree.versions[] in an
> inconsistent state inside this function, while it calls a few helper
> functions (hence my comment on the fact that they do not seem to be
> affected by this inconsistency).

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-07-09 22:48 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-06 22:07 suboptimal behavior of fast-import in some cases with "from" Mike Hommey
2015-07-06 22:54 ` Junio C Hamano
2015-07-09  5:03   ` Mike Hommey
2015-07-09  5:52     ` Mike Hommey
2015-07-09  6:50       ` [PATCH] fast-import: Do less work when given "from" matches current branch head Mike Hommey
2015-07-09 20:37         ` Junio C Hamano
2015-07-09 22:30           ` Mike Hommey
2015-07-09 22:44             ` Junio C Hamano
2015-07-09 22:47               ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).