git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH] hash binary sha1 into patch id
@ 2010-08-13  9:40 Clemens Buchacher
  2010-08-13 20:00 ` Jonathan Nieder
  0 siblings, 1 reply; 13+ messages in thread
From: Clemens Buchacher @ 2010-08-13  9:40 UTC (permalink / raw)
  To: git; +Cc: Marat Radchenko, Michael J Gruber, Junio C Hamano


[-- Attachment #1.1: Type: text/plain, Size: 1601 bytes --]

Since commit 2f82f760 (Take binary diffs into account for "git rebase"), binary
files are included in patch ID computation. Binary files are diffed using the
text diff algorithm, however, which has a huge impact on performance. The
following tests performance for a 50000 line file marked as binary in
.gitattributes.

$ git format-patch --stdout --full-index --ignore-if-in-upstream master

real	0m0.367s
user	0m0.354s
sys	0m0.010s

Instead of hashing the diff of binary files, use the post-image sha1, which is
just as unique. As a result, performance is much improved.

$ git format-patch --stdout --full-index --ignore-if-in-upstream master

real	0m0.016s
user	0m0.015s
sys	0m0.001s

Signed-off-by: Clemens Buchacher <drizzd@aon.at>
---

This may be related to the rebase performance issue discussed in
the following thread.
 
 http://mid.gmane.org/loom.20100713T082913-327@post.gmane.org

I am attaching the script which I used to test performance.

 diff.c |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/diff.c b/diff.c
index 17873f3..20fc6db 100644
--- a/diff.c
+++ b/diff.c
@@ -3758,6 +3758,12 @@ static int diff_get_patch_id(struct diff_options *options, unsigned char *sha1)
 					len2, p->two->path);
 		git_SHA1_Update(&ctx, buffer, len1);
 
+		if (diff_filespec_is_binary(p->two)) {
+			len1 = sprintf(buffer, "%s", sha1_to_hex(p->two->sha1));
+			git_SHA1_Update(&ctx, buffer, len1);
+			continue;
+		}
+
 		xpp.flags = 0;
 		xecfg.ctxlen = 3;
 		xecfg.flags = XDL_EMIT_FUNCNAMES;
-- 
1.7.2.1.1.g202c


[-- Attachment #1.2: test-patchid.sh --]
[-- Type: application/x-sh, Size: 1719 bytes --]

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] hash binary sha1 into patch id
  2010-08-13  9:40 [PATCH] hash binary sha1 into patch id Clemens Buchacher
@ 2010-08-13 20:00 ` Jonathan Nieder
  2010-08-13 21:23   ` Clemens Buchacher
  0 siblings, 1 reply; 13+ messages in thread
From: Jonathan Nieder @ 2010-08-13 20:00 UTC (permalink / raw)
  To: Clemens Buchacher; +Cc: git, Marat Radchenko, Michael J Gruber, Junio C Hamano

Clemens Buchacher wrote:

> Since commit 2f82f760 (Take binary diffs into account for "git rebase"), binary
> files are included in patch ID computation. Binary files are diffed using the
> text diff algorithm, however
[...]
> Instead of hashing the diff of binary files, use the post-image sha1, which is
> just as unique. As a result, performance is much improved.

Maybe it should use both the pre- and post-image?

> diff --git a/diff.c b/diff.c
> index 17873f3..20fc6db 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -3758,6 +3758,12 @@ static int diff_get_patch_id(struct diff_options *options, unsigned char *sha1)
>  					len2, p->two->path);
>  		git_SHA1_Update(&ctx, buffer, len1);
>  
> +		if (diff_filespec_is_binary(p->two)) {
> +			len1 = sprintf(buffer, "%s", sha1_to_hex(p->two->sha1));
> +			git_SHA1_Update(&ctx, buffer, len1);


i.e., maybe also

			git_SHA1_Update(&ctx, sha1_to_hex(p->one->sha1), 40);

Luckily this is after the filenames and so on have been incorporated,
so there’s no need to add that specially.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] hash binary sha1 into patch id
  2010-08-13 20:00 ` Jonathan Nieder
@ 2010-08-13 21:23   ` Clemens Buchacher
  2010-08-13 21:37     ` Jonathan Nieder
  0 siblings, 1 reply; 13+ messages in thread
From: Clemens Buchacher @ 2010-08-13 21:23 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: git, Marat Radchenko, Michael J Gruber, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 1530 bytes --]

Hi Jonathan,

On Fri, Aug 13, 2010 at 03:00:31PM -0500, Jonathan Nieder wrote:
> Clemens Buchacher wrote:
> 
> > Since commit 2f82f760 (Take binary diffs into account for "git rebase"), binary
> > files are included in patch ID computation. Binary files are diffed using the
> > text diff algorithm, however
> [...]
> > Instead of hashing the diff of binary files, use the post-image sha1, which is
> > just as unique. As a result, performance is much improved.
> 
> Maybe it should use both the pre- and post-image?

That would make the patch ID more correct in that it will identify
a particular change. But ultimately, we want to know whether or not
a change has been applied already. If the contents of a binary file
are the same in both commits, this is almost certainly true,
regardless of whether or not the pre-images match.

So I think we get better behavior if we ignore the pre-image.
Although the difference is probably minuscule.

> 
> > diff --git a/diff.c b/diff.c
> > index 17873f3..20fc6db 100644
> > --- a/diff.c
> > +++ b/diff.c
> > @@ -3758,6 +3758,12 @@ static int diff_get_patch_id(struct diff_options *options, unsigned char *sha1)
> >  					len2, p->two->path);
> >  		git_SHA1_Update(&ctx, buffer, len1);
> >  
> > +		if (diff_filespec_is_binary(p->two)) {
> > +			len1 = sprintf(buffer, "%s", sha1_to_hex(p->two->sha1));
> > +			git_SHA1_Update(&ctx, buffer, len1);
> 
> 
> i.e., maybe also
> 
> 			git_SHA1_Update(&ctx, sha1_to_hex(p->one->sha1), 40);

Thanks.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] hash binary sha1 into patch id
  2010-08-13 21:23   ` Clemens Buchacher
@ 2010-08-13 21:37     ` Jonathan Nieder
  2010-08-13 21:58       ` Clemens Buchacher
  0 siblings, 1 reply; 13+ messages in thread
From: Jonathan Nieder @ 2010-08-13 21:37 UTC (permalink / raw)
  To: Clemens Buchacher; +Cc: git, Marat Radchenko, Michael J Gruber, Junio C Hamano

Clemens Buchacher wrote:
> On Fri, Aug 13, 2010 at 03:00:31PM -0500, Jonathan Nieder wrote:

>> Maybe it should use both the pre- and post-image?
[...]
> So I think we get better behavior if we ignore the pre-image.
> Although the difference is probably minuscule.

As long as you’ve thought it over, I’m happy with it.  As you say, the
postimage matching without the preimage matching should be a rare
event.

FWIW what I was imagining was some structured binary format:

With A some long string, patch 1:

 A  --> AAA

Patch 2:

 AA --> AAA

Ideally one would want an attempt to apply patch 2 to result in a
conflict.  Probably that is far-fetched.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] hash binary sha1 into patch id
  2010-08-13 21:37     ` Jonathan Nieder
@ 2010-08-13 21:58       ` Clemens Buchacher
  2010-08-15  7:20         ` [PATCH v2] " Clemens Buchacher
  0 siblings, 1 reply; 13+ messages in thread
From: Clemens Buchacher @ 2010-08-13 21:58 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: git, Marat Radchenko, Michael J Gruber, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 764 bytes --]

On Fri, Aug 13, 2010 at 04:37:26PM -0500, Jonathan Nieder wrote:
> 
> FWIW what I was imagining was some structured binary format:
> 
> With A some long string, patch 1:
> 
>  A  --> AAA
> 
> Patch 2:
> 
>  AA --> AAA
> 
> Ideally one would want an attempt to apply patch 2 to result in a
> conflict.  Probably that is far-fetched.

Actually, rebase will try to merge the respective branches and it
will typically succeed, because the post-images are the same. It
will fail due to conflict only if yet another patch has been
applied upstream on top of patch 1.

But that is also how textual changes behave. So for the sake of
consistency, it makes sense to require the pre-images to be the
same. I will resend an appropriate patch tomorrow.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2] hash binary sha1 into patch id
  2010-08-13 21:58       ` Clemens Buchacher
@ 2010-08-15  7:20         ` Clemens Buchacher
  2010-08-15  7:56           ` Jonathan Nieder
  2010-09-10  5:17           ` Marat Radchenko
  0 siblings, 2 replies; 13+ messages in thread
From: Clemens Buchacher @ 2010-08-15  7:20 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Marat Radchenko, Michael J Gruber, Jonathan Nieder


[-- Attachment #1.1: Type: text/plain, Size: 1592 bytes --]

Since commit 2f82f760 (Take binary diffs into
account for "git rebase"), binary files are
included in patch ID computation. Binary files are
diffed using the text diff algorithm, however,
which has a huge impact on performance. The
following tests performance for a 50000 line file
marked as binary in .gitattributes.

$ git format-patch --stdout --ignore-if-in-upstream master

real    0m0.367s
user    0m0.354s
sys     0m0.010s

Instead of diffing the binary files, hash the pre-
and post-image sha1, which is just as unique. As a
result, performance is much improved.

$ git format-patch --stdout --ignore-if-in-upstream master

real    0m0.016s
user    0m0.015s
sys     0m0.001s

Signed-off-by: Clemens Buchacher <drizzd@aon.at>
---

I am again attaching the corresponding test script. Since it mostly
tests performance, I am not sure how to incorporate it into the
main test suite.

 diff.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/diff.c b/diff.c
index 17873f3..f4a23ab 100644
--- a/diff.c
+++ b/diff.c
@@ -3758,6 +3758,13 @@ static int diff_get_patch_id(struct diff_options *options, unsigned char *sha1)
 					len2, p->two->path);
 		git_SHA1_Update(&ctx, buffer, len1);
 
+		if (diff_filespec_is_binary(p->one) ||
+		    diff_filespec_is_binary(p->two)) {
+			git_SHA1_Update(&ctx, sha1_to_hex(p->one->sha1), 40);
+			git_SHA1_Update(&ctx, sha1_to_hex(p->two->sha1), 40);
+			continue;
+		}
+
 		xpp.flags = 0;
 		xecfg.ctxlen = 3;
 		xecfg.flags = XDL_EMIT_FUNCNAMES;
-- 
1.7.2.1.1.g202c


[-- Attachment #1.2: test-patchid.sh --]
[-- Type: application/x-sh, Size: 1710 bytes --]

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] hash binary sha1 into patch id
  2010-08-15  7:20         ` [PATCH v2] " Clemens Buchacher
@ 2010-08-15  7:56           ` Jonathan Nieder
  2010-09-10  5:17           ` Marat Radchenko
  1 sibling, 0 replies; 13+ messages in thread
From: Jonathan Nieder @ 2010-08-15  7:56 UTC (permalink / raw)
  To: Clemens Buchacher; +Cc: Junio C Hamano, git, Marat Radchenko, Michael J Gruber

Clemens Buchacher wrote:

> Signed-off-by: Clemens Buchacher <drizzd@aon.at>

Nice. :)

> I am again attaching the corresponding test script. Since it mostly
> tests performance, I am not sure how to incorporate it into the
> main test suite.

Maybe t3302-notes-index-expensive.sh would be a good example to
compare.

I haven’t looked carefully at your test script; is it possible
to extract a non-expensive test (i.e., just for correctness)
from it, too?

Jonathan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] hash binary sha1 into patch id
  2010-08-15  7:20         ` [PATCH v2] " Clemens Buchacher
  2010-08-15  7:56           ` Jonathan Nieder
@ 2010-09-10  5:17           ` Marat Radchenko
  2010-09-10  8:16             ` Clemens Buchacher
  1 sibling, 1 reply; 13+ messages in thread
From: Marat Radchenko @ 2010-09-10  5:17 UTC (permalink / raw)
  To: Clemens Buchacher, Junio C Hamano; +Cc: git, Michael J Gruber, Jonathan Nieder

As I see, this recently hit git master. I want to say big thanks since it improved git rebase from tens of minutes to just minutes here (warm disk caches).

However, I can't say I'm completely satisfied with git rebase speed (I'd like it to be < 1 min). So, are there known performance hotspots in 'git rev-list --cherry-pick'/'git format-patch --ignore-if-in-upstream' (these are two flags have biggest impact on git rebase [-i] speed) or I should run some kind of profiles in order to determine what else could be improved?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] hash binary sha1 into patch id
  2010-09-10  5:17           ` Marat Radchenko
@ 2010-09-10  8:16             ` Clemens Buchacher
  2010-10-13  7:46               ` Marat Radchenko
  0 siblings, 1 reply; 13+ messages in thread
From: Clemens Buchacher @ 2010-09-10  8:16 UTC (permalink / raw)
  To: Marat Radchenko; +Cc: Junio C Hamano, git, Michael J Gruber, Jonathan Nieder

[-- Attachment #1: Type: text/plain, Size: 2880 bytes --]

On Fri, Sep 10, 2010 at 09:17:30AM +0400, Marat Radchenko wrote:
> 
> However, I can't say I'm completely satisfied with git rebase
> speed (I'd like it to be < 1 min). So, are there known
> performance hotspots in 'git rev-list --cherry-pick'/'git
> format-patch --ignore-if-in-upstream' (these are two flags have
> biggest impact on git rebase [-i] speed) or I should run some
> kind of profiles in order to determine what else could be
> improved?

git format-patch is still slow if you have large text files (>10000
lines). This patch only helps for binary files or files marked as
binary in .gitattributes. There is also a patch

 http://mid.gmane.org/645d8a9bf671937c1a6962b49cf1c512e0af0d82.1279008702.git.git@drmicha.warpmail.net

to turn off patch ID computation entirely.

Other than that, I am not aware of performance issues with rebase
that would cause it to take more than a minute. Possibly you can
use 'perf' to profile your use case. I used it to test this patch
in the script below (pass -p to enable). But it's only testing
format-patch right now, so you would have to modify it to profile
rebase as a whole.

Clemens
---

#!/bin/bash

set -e

NLINES=50000
NRAND=32768
# probability of change in percent
CHG_PROB=30
perf=
while test $# -gt 0
do
	case $1 in
	-p)
		perf=t
		;;
	-n)
		shift
		NLINES=$1
		;;
	*)
		break
		;;
	esac
	shift
done

if test $# -gt 0
then
	export PATH="$1:$PATH"
	shift
fi

if test $# -gt 0
then
	echo "too many arguments" >&2
	exit 1
fi

dir=$(mktemp -d)

scramble()
{
	while read x
	do
		if test $RANDOM -lt $((($CHG_PROB * $NRAND)/100))
		then
			echo $RANDOM
		else
			echo "$x"
		fi
	done < "$1" > "$1.new"
	mv -f "$1.new" "$1"
}

run()
{
	echo \$ "$@"
	if test -n "$perf"
	then
		perf record -g -f "$@" >/dev/null
		perf report -g
	else
		time "$@" >/dev/null
	fi
}

cd "$dir"
git init -q

for i in $(seq $NLINES)
do
	echo $i
done > file
git add file
echo "file binary" >.gitattributes
git add .gitattributes
git commit -q -m initial
git branch other

scramble file
git add file
git commit -q -m 'change big file'

git checkout -q other
: >newfile
git add newfile
git commit -q -m 'add small file'

gfp="git format-patch --stdout"

run $gfp master
run $gfp --ignore-if-in-upstream master

git cherry-pick master >/dev/null 2>&1

git checkout -q master
scramble file
git add file
git commit -q -m 'change big file again'

git checkout -q other^{}
run git rebase master
if test -n "$(git rev-list master...HEAD~)"
then
	echo "patch not identified" >&2
	exit 1
fi

git checkout -q -b squashed master
git reset -q --soft HEAD~2
git commit -q -m squashed
git checkout -q other^{}
if git rebase squashed >/dev/null
then
	echo "patch dropped" >&2
	exit 1
fi

cd - >/dev/null
rm -rf "$dir"

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] hash binary sha1 into patch id
  2010-09-10  8:16             ` Clemens Buchacher
@ 2010-10-13  7:46               ` Marat Radchenko
  2010-10-13  9:17                 ` Marat Radchenko
  0 siblings, 1 reply; 13+ messages in thread
From: Marat Radchenko @ 2010-10-13  7:46 UTC (permalink / raw)
  To: git


So, here's some gprof data of git-rev-list:
http://git.661346.n2.nabble.com/file/n5629905/gprof_-p.txt gprof -p output 
http://git.661346.n2.nabble.com/file/n5629905/gprof_-q.txt gprof -q output 
-- 
View this message in context: http://git.661346.n2.nabble.com/PATCH-hash-binary-sha1-into-patch-id-tp5419375p5629905.html
Sent from the git mailing list archive at Nabble.com.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] hash binary sha1 into patch id
  2010-10-13  7:46               ` Marat Radchenko
@ 2010-10-13  9:17                 ` Marat Radchenko
  2010-10-13 21:10                   ` Clemens Buchacher
  0 siblings, 1 reply; 13+ messages in thread
From: Marat Radchenko @ 2010-10-13  9:17 UTC (permalink / raw)
  To: git


Does diff_get_patch_id (in diff.c) really need to set xecfg.flags =
XDL_EMIT_FUNCNAMES? Removing that makes git-rev-list (called by rebase) 2x
faster here. Not sure about consequences though.
-- 
View this message in context: http://git.661346.n2.nabble.com/PATCH-hash-binary-sha1-into-patch-id-tp5419375p5630164.html
Sent from the git mailing list archive at Nabble.com.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] hash binary sha1 into patch id
  2010-10-13  9:17                 ` Marat Radchenko
@ 2010-10-13 21:10                   ` Clemens Buchacher
  2010-10-14  7:19                     ` Marat Radchenko
  0 siblings, 1 reply; 13+ messages in thread
From: Clemens Buchacher @ 2010-10-13 21:10 UTC (permalink / raw)
  To: Marat Radchenko; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 1056 bytes --]

Hi Marat,

On Wed, Oct 13, 2010 at 02:17:32AM -0700, Marat Radchenko wrote:
> 
> Does diff_get_patch_id (in diff.c) really need to set xecfg.flags =
> XDL_EMIT_FUNCNAMES? Removing that makes git-rev-list (called by rebase) 2x
> faster here. Not sure about consequences though.

Yes, this has already been fixed in ad14b450 (do not search
functions for patch ID), which is currently in git.git's pu branch.
Function name search performance has also been improved for some
obscure cases in c099789b (diff: avoid repeated scanning while
looking for funcname).

Thanks anyways. If you notice anything else, please let us know.

I had a look at your gprof results, but I cannot make much sense of
it. I do not understand why git rev-list would even do any diffs.
For me, rev-list spends more than 50% in libz.  Can you give more
details on the kind of repository you have (binaries, memory size,
history size, any special attributes or options?), which git
version you are using, and which options you are passing to
rev-list?

Clemens

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] hash binary sha1 into patch id
  2010-10-13 21:10                   ` Clemens Buchacher
@ 2010-10-14  7:19                     ` Marat Radchenko
  0 siblings, 0 replies; 13+ messages in thread
From: Marat Radchenko @ 2010-10-14  7:19 UTC (permalink / raw)
  To: git


I'm running git-rev-list from git master in the form git rebase does:
git-rev-list --no-merges --cherry-pick --pretty=oneline --abbrev-commit
--abbrev=7 --reverse --left-right --topo-order FOO...BAR

New gprof output (I compiled without -O2 to prevent function inlining and
removed xecfg.flags = XDL_EMIT_FUNCNAMES):
http://git.661346.n2.nabble.com/file/n5633988/gprof_-p.txt gprof -p 
http://git.661346.n2.nabble.com/file/n5633988/gprof_-q.txt gprof -q 

This thing takes ~3 minutes of cpu time.

So, it takes ~3 minutes of cpu time and looks like all time is spent in
xdl_cleanup_records.

Repo info: 80GB, 200K revisions (~200 per working day), 500K files (15GB) in
master working copy, running git-rev-list to rebase ~5 commits over ~800
commits that happened upstream.
-- 
View this message in context: http://git.661346.n2.nabble.com/PATCH-hash-binary-sha1-into-patch-id-tp5419375p5633988.html
Sent from the git mailing list archive at Nabble.com.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2010-10-14  7:19 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-08-13  9:40 [PATCH] hash binary sha1 into patch id Clemens Buchacher
2010-08-13 20:00 ` Jonathan Nieder
2010-08-13 21:23   ` Clemens Buchacher
2010-08-13 21:37     ` Jonathan Nieder
2010-08-13 21:58       ` Clemens Buchacher
2010-08-15  7:20         ` [PATCH v2] " Clemens Buchacher
2010-08-15  7:56           ` Jonathan Nieder
2010-09-10  5:17           ` Marat Radchenko
2010-09-10  8:16             ` Clemens Buchacher
2010-10-13  7:46               ` Marat Radchenko
2010-10-13  9:17                 ` Marat Radchenko
2010-10-13 21:10                   ` Clemens Buchacher
2010-10-14  7:19                     ` Marat Radchenko

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).