[PATCH 0/3] xdiff: speedup histogram diff

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH 0/3] xdiff: speedup histogram diff
@ 2021-11-17 11:20 Phillip Wood via GitGitGadget
  2021-11-17 11:20 ` [PATCH 1/3] diff histogram: intern strings Phillip Wood via GitGitGadget
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Phillip Wood via GitGitGadget @ 2021-11-17 11:20 UTC (permalink / raw)
  To: git; +Cc: Phillip Wood

Histogram is the only diff algorithm not to call xdl_classify_record().
Calling xdl_classify_record() means that it is not necessary to use
xdl_recmatch() when comparing lines, all that is necessary is to compare the
hash values. This gives a 7% reduction in the runtime of "git log --patch"
when using the histogram diff algorithm.

Phillip Wood (3):
  diff histogram: intern strings
  xdiff: avoid unnecessary memory allocations
  xdiff: simplify comparison

 xdiff/xdiffi.c     |  5 +----
 xdiff/xhistogram.c |  5 ++---
 xdiff/xprepare.c   | 35 +++++++++++++++--------------------
 3 files changed, 18 insertions(+), 27 deletions(-)


base-commit: cd3e606211bb1cf8bc57f7d76bab98cc17a150bc
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1079%2Fphillipwood%2Fwip%2Fhistogram-speedup-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1079/phillipwood/wip/histogram-speedup-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1079
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/3] diff histogram: intern strings
  2021-11-17 11:20 [PATCH 0/3] xdiff: speedup histogram diff Phillip Wood via GitGitGadget
@ 2021-11-17 11:20 ` Phillip Wood via GitGitGadget
  2021-11-17 15:55   ` Derrick Stolee
  2021-11-17 11:20 ` [PATCH 2/3] xdiff: avoid unnecessary memory allocations Phillip Wood via GitGitGadget
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 15+ messages in thread
From: Phillip Wood via GitGitGadget @ 2021-11-17 11:20 UTC (permalink / raw)
  To: git; +Cc: Phillip Wood, Phillip Wood

From: Phillip Wood <phillip.wood@dunelm.org.uk>

Histogram is the only diff algorithm not to call
xdl_classify_record(). xdl_classify_record() ensures that the hash
values of two strings that are not equal differ which means that it is
not necessary to use xdl_recmatch() when comparing lines, all that is
necessary is to compare the hash values. This gives a 7% reduction in
the runtime of "git log --patch" when using the histogram diff
algorithm.

Test                                  HEAD^             HEAD
-----------------------------------------------------------------------------
4000.1: log -3000 (baseline)          0.18(0.14+0.04)   0.19(0.17+0.02) +5.6%
4000.2: log --raw -3000 (tree-only)   0.99(0.77+0.21)   0.98(0.78+0.20) -1.0%
4000.3: log -p -3000 (Myers)          4.84(4.31+0.51)   4.81(4.15+0.64) -0.6%
4000.4: log -p -3000 --histogram      6.34(5.86+0.46)   5.87(5.19+0.66) -7.4%
4000.5: log -p -3000 --patience       5.39(4.60+0.76)   5.35(4.60+0.73) -0.7%

Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk>
---
 xdiff/xhistogram.c |  5 ++---
 xdiff/xprepare.c   | 24 ++++++++----------------
 2 files changed, 10 insertions(+), 19 deletions(-)

diff --git a/xdiff/xhistogram.c b/xdiff/xhistogram.c
index e694bfd9e31..6c1c88a69a1 100644
--- a/xdiff/xhistogram.c
+++ b/xdiff/xhistogram.c
@@ -91,9 +91,8 @@ struct region {
 static int cmp_recs(xpparam_t const *xpp,
 	xrecord_t *r1, xrecord_t *r2)
 {
-	return r1->ha == r2->ha &&
-		xdl_recmatch(r1->ptr, r1->size, r2->ptr, r2->size,
-			    xpp->flags);
+	return r1->ha == r2->ha;
+
 }
 
 #define CMP_ENV(xpp, env, s1, l1, s2, l2) \
diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index abeb8fb84e6..7fae0727a02 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -181,15 +181,11 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
 	if (!(recs = (xrecord_t **) xdl_malloc(narec * sizeof(xrecord_t *))))
 		goto abort;
 
-	if (XDF_DIFF_ALG(xpp->flags) == XDF_HISTOGRAM_DIFF)
-		hbits = hsize = 0;
-	else {
-		hbits = xdl_hashbits((unsigned int) narec);
-		hsize = 1 << hbits;
-		if (!(rhash = (xrecord_t **) xdl_malloc(hsize * sizeof(xrecord_t *))))
-			goto abort;
-		memset(rhash, 0, hsize * sizeof(xrecord_t *));
-	}
+	hbits = xdl_hashbits((unsigned int) narec);
+	hsize = 1 << hbits;
+	if (!(rhash = (xrecord_t **) xdl_malloc(hsize * sizeof(xrecord_t *))))
+		goto abort;
+	memset(rhash, 0, hsize * sizeof(xrecord_t *));
 
 	nrec = 0;
 	if ((cur = blk = xdl_mmfile_first(mf, &bsize)) != NULL) {
@@ -208,9 +204,7 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
 			crec->size = (long) (cur - prev);
 			crec->ha = hav;
 			recs[nrec++] = crec;
-
-			if ((XDF_DIFF_ALG(xpp->flags) != XDF_HISTOGRAM_DIFF) &&
-			    xdl_classify_record(pass, cf, rhash, hbits, crec) < 0)
+			if (xdl_classify_record(pass, cf, rhash, hbits, crec) < 0)
 				goto abort;
 		}
 	}
@@ -279,8 +273,7 @@ int xdl_prepare_env(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
 	enl1 = xdl_guess_lines(mf1, sample) + 1;
 	enl2 = xdl_guess_lines(mf2, sample) + 1;
 
-	if (XDF_DIFF_ALG(xpp->flags) != XDF_HISTOGRAM_DIFF &&
-	    xdl_init_classifier(&cf, enl1 + enl2 + 1, xpp->flags) < 0)
+	if (xdl_init_classifier(&cf, enl1 + enl2 + 1, xpp->flags) < 0)
 		return -1;
 
 	if (xdl_prepare_ctx(1, mf1, enl1, xpp, &cf, &xe->xdf1) < 0) {
@@ -305,8 +298,7 @@ int xdl_prepare_env(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
 		return -1;
 	}
 
-	if (XDF_DIFF_ALG(xpp->flags) != XDF_HISTOGRAM_DIFF)
-		xdl_free_classifier(&cf);
+	xdl_free_classifier(&cf);
 
 	return 0;
 }
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/3] xdiff: avoid unnecessary memory allocations
  2021-11-17 11:20 [PATCH 0/3] xdiff: speedup histogram diff Phillip Wood via GitGitGadget
  2021-11-17 11:20 ` [PATCH 1/3] diff histogram: intern strings Phillip Wood via GitGitGadget
@ 2021-11-17 11:20 ` Phillip Wood via GitGitGadget
  2021-11-17 11:20 ` [PATCH 3/3] xdiff: simplify comparison Phillip Wood via GitGitGadget
  2021-11-18 15:40 ` [PATCH 0/3] xdiff: speedup histogram diff Johannes Schindelin
  3 siblings, 0 replies; 15+ messages in thread
From: Phillip Wood via GitGitGadget @ 2021-11-17 11:20 UTC (permalink / raw)
  To: git; +Cc: Phillip Wood, Phillip Wood

From: Phillip Wood <phillip.wood@dunelm.org.uk>

rindex and ha are only used by xdl_cleanup_records() which is not
called by the histogram or patience algorithms. The perf test results
show a small reduction in run time but that is probably within the
noise.

Test                                  HEAD^             HEAD
-----------------------------------------------------------------------------
4000.1: log -3000 (baseline)          0.19(0.17+0.02)   0.19(0.12+0.07) +0.0%
4000.2: log --raw -3000 (tree-only)   0.98(0.78+0.20)   0.98(0.81+0.16) +0.0%
4000.3: log -p -3000 (Myers)          4.81(4.15+0.64)   4.81(4.23+0.56) +0.0%
4000.4: log -p -3000 --histogram      5.87(5.19+0.66)   5.83(5.11+0.70) -0.7%
4000.5: log -p -3000 --patience       5.35(4.60+0.73)   5.31(4.61+0.69) -0.7%

Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk>
---
 xdiff/xprepare.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index 7fae0727a02..4527a4a07c4 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -213,10 +213,13 @@ static int xdl_prepare_ctx(unsigned int pass, mmfile_t *mf, long narec, xpparam_
 		goto abort;
 	memset(rchg, 0, (nrec + 2) * sizeof(char));
 
-	if (!(rindex = (long *) xdl_malloc((nrec + 1) * sizeof(long))))
-		goto abort;
-	if (!(ha = (unsigned long *) xdl_malloc((nrec + 1) * sizeof(unsigned long))))
-		goto abort;
+	if ((XDF_DIFF_ALG(xpp->flags) != XDF_PATIENCE_DIFF) &&
+	    (XDF_DIFF_ALG(xpp->flags) != XDF_HISTOGRAM_DIFF)) {
+		if (!(rindex = xdl_malloc((nrec + 1) * sizeof(*rindex))))
+			goto abort;
+		if (!(ha = xdl_malloc((nrec + 1) * sizeof(*ha))))
+			goto abort;
+	}
 
 	xdf->nrec = nrec;
 	xdf->recs = recs;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/3] xdiff: simplify comparison
  2021-11-17 11:20 [PATCH 0/3] xdiff: speedup histogram diff Phillip Wood via GitGitGadget
  2021-11-17 11:20 ` [PATCH 1/3] diff histogram: intern strings Phillip Wood via GitGitGadget
  2021-11-17 11:20 ` [PATCH 2/3] xdiff: avoid unnecessary memory allocations Phillip Wood via GitGitGadget
@ 2021-11-17 11:20 ` Phillip Wood via GitGitGadget
  2021-11-18 15:40 ` [PATCH 0/3] xdiff: speedup histogram diff Johannes Schindelin
  3 siblings, 0 replies; 15+ messages in thread
From: Phillip Wood via GitGitGadget @ 2021-11-17 11:20 UTC (permalink / raw)
  To: git; +Cc: Phillip Wood, Phillip Wood

From: Phillip Wood <phillip.wood@dunelm.org.uk>

Now that the histogram algorithm calls xdl_classify_record() it is no
longer necessary to use xdl_recmatch() to compare lines, it is
sufficient just to compare the hash values. This has a negligible
effect on performance.

Test                                  HEAD~1            HEAD
-----------------------------------------------------------------------------
4000.1: log -3000 (baseline)          0.19(0.12+0.07)   0.18(0.14+0.04) -5.3%
4000.2: log --raw -3000 (tree-only)   0.98(0.81+0.16)   0.98(0.79+0.18) +0.0%
4000.3: log -p -3000 (Myers)          4.81(4.23+0.56)   4.80(4.26+0.53) -0.2%
4000.4: log -p -3000 --histogram      5.83(5.11+0.70)   5.82(5.15+0.65) -0.2%
4000.5: log -p -3000 --patience       5.31(4.61+0.69)   5.30(4.54+0.75) -0.2%

Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk>
---
 xdiff/xdiffi.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index a4542c05b61..6a3b9280beb 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -392,10 +392,7 @@ static xdchange_t *xdl_add_change(xdchange_t *xscr, long i1, long i2, long chg1,
 
 static int recs_match(xrecord_t *rec1, xrecord_t *rec2, long flags)
 {
-	return (rec1->ha == rec2->ha &&
-		xdl_recmatch(rec1->ptr, rec1->size,
-			     rec2->ptr, rec2->size,
-			     flags));
+	return (rec1->ha == rec2->ha);
 }
 
 /*
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] diff histogram: intern strings
  2021-11-17 11:20 ` [PATCH 1/3] diff histogram: intern strings Phillip Wood via GitGitGadget
@ 2021-11-17 15:55   ` Derrick Stolee
  2021-11-17 16:46     ` Jeff King
  2021-11-17 16:52     ` Phillip Wood
  0 siblings, 2 replies; 15+ messages in thread
From: Derrick Stolee @ 2021-11-17 15:55 UTC (permalink / raw)
  To: Phillip Wood via GitGitGadget, git; +Cc: Phillip Wood



On 11/17/2021 6:20 AM, Phillip Wood via GitGitGadget wrote:
> From: Phillip Wood <phillip.wood@dunelm.org.uk>
> 
> Histogram is the only diff algorithm not to call
> xdl_classify_record(). xdl_classify_record() ensures that the hash
> values of two strings that are not equal differ which means that it is
> not necessary to use xdl_recmatch() when comparing lines, all that is
> necessary is to compare the hash values. This gives a 7% reduction in
> the runtime of "git log --patch" when using the histogram diff
> algorithm.
> 
> Test                                  HEAD^             HEAD
> -----------------------------------------------------------------------------
> 4000.1: log -3000 (baseline)          0.18(0.14+0.04)   0.19(0.17+0.02) +5.6%
> 4000.2: log --raw -3000 (tree-only)   0.99(0.77+0.21)   0.98(0.78+0.20) -1.0%
> 4000.3: log -p -3000 (Myers)          4.84(4.31+0.51)   4.81(4.15+0.64) -0.6%
> 4000.4: log -p -3000 --histogram      6.34(5.86+0.46)   5.87(5.19+0.66) -7.4%
> 4000.5: log -p -3000 --patience       5.39(4.60+0.76)   5.35(4.60+0.73) -0.7%
> 
> Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk>
> ---
>  xdiff/xhistogram.c |  5 ++---
>  xdiff/xprepare.c   | 24 ++++++++----------------
>  2 files changed, 10 insertions(+), 19 deletions(-)
> 
> diff --git a/xdiff/xhistogram.c b/xdiff/xhistogram.c
> index e694bfd9e31..6c1c88a69a1 100644
> --- a/xdiff/xhistogram.c
> +++ b/xdiff/xhistogram.c
> @@ -91,9 +91,8 @@ struct region {
>  static int cmp_recs(xpparam_t const *xpp,
>  	xrecord_t *r1, xrecord_t *r2)
>  {
> -	return r1->ha == r2->ha &&
> -		xdl_recmatch(r1->ptr, r1->size, r2->ptr, r2->size,
> -			    xpp->flags);
> +	return r1->ha == r2->ha;
> +

nit: stray newline.

The only meaningful change here is that you are relying entirely on
the hash and not checking the content again. This means that hash
collisions on this 32-bit hash could start introducing different
results. Are we worried about that?

I see that a similar hash-comparison is done in xpatience.c without
further checking the contents, but xdiffi.c compares the hashes and
then checks with xdl_recmatch(). So, we are still not reaching full
consistency across all diff algorithms with how we handle these
comparisons. I think it is good to have at least one that can be used
if/when we hit these hash collisions within a diff, but it can be hard
to communicate to a user why they need to change a diff algorithm for
such an internal reason.


The following bits looked scary at first, but you are just removing the
special-casing of XDF_HISTOGRAM_DIFF from the preparation stage.

> -	if (XDF_DIFF_ALG(xpp->flags) == XDF_HISTOGRAM_DIFF)
> -		hbits = hsize = 0;
> -	else {
> -		hbits = xdl_hashbits((unsigned int) narec);
> -		hsize = 1 << hbits;
> -		if (!(rhash = (xrecord_t **) xdl_malloc(hsize * sizeof(xrecord_t *))))
> -			goto abort;
> -		memset(rhash, 0, hsize * sizeof(xrecord_t *));
> -	}
> +	hbits = xdl_hashbits((unsigned int) narec);
> +	hsize = 1 << hbits;
> +	if (!(rhash = (xrecord_t **) xdl_malloc(hsize * sizeof(xrecord_t *))))
> +		goto abort;
> +	memset(rhash, 0, hsize * sizeof(xrecord_t *));

> -			if ((XDF_DIFF_ALG(xpp->flags) != XDF_HISTOGRAM_DIFF) &&
> -			    xdl_classify_record(pass, cf, rhash, hbits, crec) < 0)
> +			if (xdl_classify_record(pass, cf, rhash, hbits, crec) < 0)

> -	if (XDF_DIFF_ALG(xpp->flags) != XDF_HISTOGRAM_DIFF &&
> -	    xdl_init_classifier(&cf, enl1 + enl2 + 1, xpp->flags) < 0)
> +	if (xdl_init_classifier(&cf, enl1 + enl2 + 1, xpp->flags) < 0)

> -	if (XDF_DIFF_ALG(xpp->flags) != XDF_HISTOGRAM_DIFF)
> -		xdl_free_classifier(&cf);
> +	xdl_free_classifier(&cf);

The existence of these conditions gave me pause, so I went to look for where they
were inserted. They were made in 9f37c27 (xdiff/xprepare: skip classification,
2011-07-12) with the justification that 

    We don't need any of that in histogram diff, so we omit calls to these
    functions. We also skip allocating memory to the hash table, rhash, as
    it is no longer used.

    This gives us a small boost in performance.

But you are actually _using_ these preparation steps, which means you are
re-adding the cost of hashing but overall improving because you use the
data correctly. Excellent.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] diff histogram: intern strings
  2021-11-17 15:55   ` Derrick Stolee
@ 2021-11-17 16:46     ` Jeff King
  2021-11-17 16:52     ` Phillip Wood
  1 sibling, 0 replies; 15+ messages in thread
From: Jeff King @ 2021-11-17 16:46 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Phillip Wood via GitGitGadget, git, Phillip Wood

On Wed, Nov 17, 2021 at 10:55:02AM -0500, Derrick Stolee wrote:

> > diff --git a/xdiff/xhistogram.c b/xdiff/xhistogram.c
> > index e694bfd9e31..6c1c88a69a1 100644
> > --- a/xdiff/xhistogram.c
> > +++ b/xdiff/xhistogram.c
> > @@ -91,9 +91,8 @@ struct region {
> >  static int cmp_recs(xpparam_t const *xpp,
> >  	xrecord_t *r1, xrecord_t *r2)
> >  {
> > -	return r1->ha == r2->ha &&
> > -		xdl_recmatch(r1->ptr, r1->size, r2->ptr, r2->size,
> > -			    xpp->flags);
> > +	return r1->ha == r2->ha;
> > +
> 
> nit: stray newline.
> 
> The only meaningful change here is that you are relying entirely on
> the hash and not checking the content again. This means that hash
> collisions on this 32-bit hash could start introducing different
> results. Are we worried about that?

I had the same thought. But running "git log --histogram -p" on git.git
does not seem to produce any differences between the two. So perhaps
collisions are removed elsewhere. It would be nice to have a better
understanding of this before proceeding with this change.

Curiously, I got a much smaller improvement in my test, which did:

  git log --no-merges -p --histogram :^po

My assumption being that "po/" diffs are big and uninteresting and so
bloat the output. But that turned out to be interesting timing-wise.
Excluding "po/" means that patch produces only a 0.6% improvement in
speed. But _just_ running the diffs for po shows a 24% speedup!

I guess this is just because those files are much larger than average
(and changed in fewer commits), meaning that the time spent hashing and
comparing lines will show up as a larger percentage of the total work.

But I wondered...

> The existence of these conditions gave me pause, so I went to look for where they
> were inserted. They were made in 9f37c27 (xdiff/xprepare: skip classification,
> 2011-07-12) with the justification that 
> 
>     We don't need any of that in histogram diff, so we omit calls to these
>     functions. We also skip allocating memory to the hash table, rhash, as
>     it is no longer used.
> 
>     This gives us a small boost in performance.
> 
> But you are actually _using_ these preparation steps, which means you are
> re-adding the cost of hashing but overall improving because you use the
> data correctly. Excellent.

Are we making a tradeoff here based on the data patterns? That is, it
seems like we are spending extra time upfront to do classification in
order to get quicker comparisons later. Presumably the upfront work is
O(n) in the length of the file. How many comparisons do we expect to
save?  Is it also linear in the number of lines, or could it be
super- or sub-linear depending on the actual diff?

-Peff

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] diff histogram: intern strings
  2021-11-17 15:55   ` Derrick Stolee
  2021-11-17 16:46     ` Jeff King
@ 2021-11-17 16:52     ` Phillip Wood
  2021-11-18 15:35       ` Johannes Schindelin
  1 sibling, 1 reply; 15+ messages in thread
From: Phillip Wood @ 2021-11-17 16:52 UTC (permalink / raw)
  To: Derrick Stolee, Phillip Wood via GitGitGadget, git; +Cc: Phillip Wood

Hi Stolee

On 17/11/2021 15:55, Derrick Stolee wrote:
> 
> 
> On 11/17/2021 6:20 AM, Phillip Wood via GitGitGadget wrote:
>> From: Phillip Wood <phillip.wood@dunelm.org.uk>
>>
>> Histogram is the only diff algorithm not to call
>> xdl_classify_record(). xdl_classify_record() ensures that the hash
>> values of two strings that are not equal differ which means that it is
>> not necessary to use xdl_recmatch() when comparing lines, all that is
>> necessary is to compare the hash values. This gives a 7% reduction in
>> the runtime of "git log --patch" when using the histogram diff
>> algorithm.
>>
>> Test                                  HEAD^             HEAD
>> -----------------------------------------------------------------------------
>> 4000.1: log -3000 (baseline)          0.18(0.14+0.04)   0.19(0.17+0.02) +5.6%
>> 4000.2: log --raw -3000 (tree-only)   0.99(0.77+0.21)   0.98(0.78+0.20) -1.0%
>> 4000.3: log -p -3000 (Myers)          4.84(4.31+0.51)   4.81(4.15+0.64) -0.6%
>> 4000.4: log -p -3000 --histogram      6.34(5.86+0.46)   5.87(5.19+0.66) -7.4%
>> 4000.5: log -p -3000 --patience       5.39(4.60+0.76)   5.35(4.60+0.73) -0.7%
>>
>> Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk>
>> ---
>>   xdiff/xhistogram.c |  5 ++---
>>   xdiff/xprepare.c   | 24 ++++++++----------------
>>   2 files changed, 10 insertions(+), 19 deletions(-)
>>
>> diff --git a/xdiff/xhistogram.c b/xdiff/xhistogram.c
>> index e694bfd9e31..6c1c88a69a1 100644
>> --- a/xdiff/xhistogram.c
>> +++ b/xdiff/xhistogram.c
>> @@ -91,9 +91,8 @@ struct region {
>>   static int cmp_recs(xpparam_t const *xpp,
>>   	xrecord_t *r1, xrecord_t *r2)
>>   {
>> -	return r1->ha == r2->ha &&
>> -		xdl_recmatch(r1->ptr, r1->size, r2->ptr, r2->size,
>> -			    xpp->flags);
>> +	return r1->ha == r2->ha;
>> +
> 
> nit: stray newline.
> 
> The only meaningful change here is that you are relying entirely on
> the hash and not checking the content again. This means that hash
> collisions on this 32-bit hash could start introducing different
> results. Are we worried about that?

xdiff-interface.c limits the size of the file that can be passed to just 
below 1GB so we are safe. The other diff algorithms are already using 
this optimization. (the hash is 64 bits on most platforms, the xdiff 
code could really benefit from a unsigned long -> size_t cleanup)

> I see that a similar hash-comparison is done in xpatience.c without
> further checking the contents, but xdiffi.c compares the hashes and
> then checks with xdl_recmatch(). So, we are still not reaching full
> consistency across all diff algorithms with how we handle these
> comparisons. I think it is good to have at least one that can be used
> if/when we hit these hash collisions within a diff, but it can be hard
> to communicate to a user why they need to change a diff algorithm for
> such an internal reason.

I think that code in xdiffi.c is only used by the diff slider code that 
implements diff.indentHeuristic, the Myers diff implementation just 
compares the hash values. Before this change the diff slider code needed 
to do the full check to be correct when processing the output of the 
histogram algorithm.

> The following bits looked scary at first, but you are just removing the
> special-casing of XDF_HISTOGRAM_DIFF from the preparation stage.
> 
>> -	if (XDF_DIFF_ALG(xpp->flags) == XDF_HISTOGRAM_DIFF)
>> -		hbits = hsize = 0;
>> -	else {
>> -		hbits = xdl_hashbits((unsigned int) narec);
>> -		hsize = 1 << hbits;
>> -		if (!(rhash = (xrecord_t **) xdl_malloc(hsize * sizeof(xrecord_t *))))
>> -			goto abort;
>> -		memset(rhash, 0, hsize * sizeof(xrecord_t *));
>> -	}
>> +	hbits = xdl_hashbits((unsigned int) narec);
>> +	hsize = 1 << hbits;
>> +	if (!(rhash = (xrecord_t **) xdl_malloc(hsize * sizeof(xrecord_t *))))
>> +		goto abort;
>> +	memset(rhash, 0, hsize * sizeof(xrecord_t *));
> 
>> -			if ((XDF_DIFF_ALG(xpp->flags) != XDF_HISTOGRAM_DIFF) &&
>> -			    xdl_classify_record(pass, cf, rhash, hbits, crec) < 0)
>> +			if (xdl_classify_record(pass, cf, rhash, hbits, crec) < 0)
> 
>> -	if (XDF_DIFF_ALG(xpp->flags) != XDF_HISTOGRAM_DIFF &&
>> -	    xdl_init_classifier(&cf, enl1 + enl2 + 1, xpp->flags) < 0)
>> +	if (xdl_init_classifier(&cf, enl1 + enl2 + 1, xpp->flags) < 0)
> 
>> -	if (XDF_DIFF_ALG(xpp->flags) != XDF_HISTOGRAM_DIFF)
>> -		xdl_free_classifier(&cf);
>> +	xdl_free_classifier(&cf);
> 
> The existence of these conditions gave me pause, so I went to look for where they
> were inserted. They were made in 9f37c27 (xdiff/xprepare: skip classification,
> 2011-07-12) with the justification that
> 
>      We don't need any of that in histogram diff, so we omit calls to these
>      functions. We also skip allocating memory to the hash table, rhash, as
>      it is no longer used.
> 
>      This gives us a small boost in performance.
> 
> But you are actually _using_ these preparation steps, which means you are
> re-adding the cost of hashing but overall improving because you use the
> data correctly. Excellent.

Thanks for looking at this

Best Wishes

Phillip

> Thanks,
> -Stolee
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] diff histogram: intern strings
  2021-11-17 16:52     ` Phillip Wood
@ 2021-11-18 15:35       ` Johannes Schindelin
  2021-11-18 15:42         ` Jeff King
  0 siblings, 1 reply; 15+ messages in thread
From: Johannes Schindelin @ 2021-11-18 15:35 UTC (permalink / raw)
  To: Phillip Wood; +Cc: Derrick Stolee, Phillip Wood via GitGitGadget, git

Hi,

On Wed, 17 Nov 2021, Phillip Wood wrote:

> On 17/11/2021 15:55, Derrick Stolee wrote:
> >
> > On 11/17/2021 6:20 AM, Phillip Wood via GitGitGadget wrote:
> > > From: Phillip Wood <phillip.wood@dunelm.org.uk>
> > >
> > > Histogram is the only diff algorithm not to call
> > > xdl_classify_record(). xdl_classify_record() ensures that the hash
> > > values of two strings that are not equal differ which means that it is
> > > not necessary to use xdl_recmatch() when comparing lines, all that is
> > > necessary is to compare the hash values. This gives a 7% reduction in
> > > the runtime of "git log --patch" when using the histogram diff
> > > algorithm.
> > >
> > > Test                                  HEAD^             HEAD
> > > -----------------------------------------------------------------------------
> > > 4000.1: log -3000 (baseline)          0.18(0.14+0.04)   0.19(0.17+0.02)
> > > +5.6%
> > > 4000.2: log --raw -3000 (tree-only)   0.99(0.77+0.21)   0.98(0.78+0.20)
> > > -1.0%
> > > 4000.3: log -p -3000 (Myers)          4.84(4.31+0.51)   4.81(4.15+0.64)
> > > -0.6%
> > > 4000.4: log -p -3000 --histogram      6.34(5.86+0.46)   5.87(5.19+0.66)
> > > -7.4%
> > > 4000.5: log -p -3000 --patience       5.39(4.60+0.76)   5.35(4.60+0.73)
> > > -0.7%
> > >
> > > Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk>
> > > ---
> > >   xdiff/xhistogram.c |  5 ++---
> > >   xdiff/xprepare.c   | 24 ++++++++----------------
> > >   2 files changed, 10 insertions(+), 19 deletions(-)
> > >
> > > diff --git a/xdiff/xhistogram.c b/xdiff/xhistogram.c
> > > index e694bfd9e31..6c1c88a69a1 100644
> > > --- a/xdiff/xhistogram.c
> > > +++ b/xdiff/xhistogram.c
> > > @@ -91,9 +91,8 @@ struct region {
> > >   static int cmp_recs(xpparam_t const *xpp,
> > >   	xrecord_t *r1, xrecord_t *r2)
> > >   {
> > > -	return r1->ha == r2->ha &&
> > > -		xdl_recmatch(r1->ptr, r1->size, r2->ptr, r2->size,
> > > -			    xpp->flags);
> > > +	return r1->ha == r2->ha;
> > > +
> >
> > nit: stray newline.
> >
> > The only meaningful change here is that you are relying entirely on
> > the hash and not checking the content again. This means that hash
> > collisions on this 32-bit hash could start introducing different
> > results. Are we worried about that?
>
> xdiff-interface.c limits the size of the file that can be passed to just below
> 1GB so we are safe. The other diff algorithms are already using this
> optimization. (the hash is 64 bits on most platforms, the xdiff code could
> really benefit from a unsigned long -> size_t cleanup)

I think the really important thing to point out is that
`xdl_classify_record()` ensures that the `ha` attribute is different for
different text. AFAIR it even "linearizes" the `ha` values, i.e. they
won't be all over the place but start at 0 (or 1).

So no, I'm not worried about collisions. That would be a bug in
`xdl_classify_record()` and I think we would have caught this bug by now.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/3] xdiff: speedup histogram diff
  2021-11-17 11:20 [PATCH 0/3] xdiff: speedup histogram diff Phillip Wood via GitGitGadget
                   ` (2 preceding siblings ...)
  2021-11-17 11:20 ` [PATCH 3/3] xdiff: simplify comparison Phillip Wood via GitGitGadget
@ 2021-11-18 15:40 ` Johannes Schindelin
  3 siblings, 0 replies; 15+ messages in thread
From: Johannes Schindelin @ 2021-11-18 15:40 UTC (permalink / raw)
  To: Phillip Wood via GitGitGadget; +Cc: git, Phillip Wood

Hi Phillip,

On Wed, 17 Nov 2021, Phillip Wood via GitGitGadget wrote:

> Histogram is the only diff algorithm not to call xdl_classify_record().
> Calling xdl_classify_record() means that it is not necessary to use
> xdl_recmatch() when comparing lines, all that is necessary is to compare the
> hash values. This gives a 7% reduction in the runtime of "git log --patch"
> when using the histogram diff algorithm.

Thanks for this! I had a look over the patches (and refreshed my memory of
what `xdl_classify_record()` does), and they look good.

Having said that, I would love to increase my confidence by backing up the
patches with a sort of stress test. Do we have anything like that? I guess
not, a `git grep histogram t/` did not really turn up anything promising,
p4000 does not even validate the output...

Thanks,
Dscho

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] diff histogram: intern strings
  2021-11-18 15:35       ` Johannes Schindelin
@ 2021-11-18 15:42         ` Jeff King
  2021-11-19 10:05           ` Phillip Wood
  0 siblings, 1 reply; 15+ messages in thread
From: Jeff King @ 2021-11-18 15:42 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Phillip Wood, Derrick Stolee, Phillip Wood via GitGitGadget, git

On Thu, Nov 18, 2021 at 04:35:48PM +0100, Johannes Schindelin wrote:

> I think the really important thing to point out is that
> `xdl_classify_record()` ensures that the `ha` attribute is different for
> different text. AFAIR it even "linearizes" the `ha` values, i.e. they
> won't be all over the place but start at 0 (or 1).
> 
> So no, I'm not worried about collisions. That would be a bug in
> `xdl_classify_record()` and I think we would have caught this bug by now.

Ah, thanks for that explanation. That addresses my collision concern from
earlier in the thread completely.

-Peff

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] diff histogram: intern strings
  2021-11-18 15:42         ` Jeff King
@ 2021-11-19 10:05           ` Phillip Wood
  2021-11-19 14:45             ` Jeff King
  2021-11-19 15:49             ` Johannes Schindelin
  0 siblings, 2 replies; 15+ messages in thread
From: Phillip Wood @ 2021-11-19 10:05 UTC (permalink / raw)
  To: Jeff King, Johannes Schindelin
  Cc: Phillip Wood, Derrick Stolee, Phillip Wood via GitGitGadget, git

On 18/11/2021 15:42, Jeff King wrote:
> On Thu, Nov 18, 2021 at 04:35:48PM +0100, Johannes Schindelin wrote:
> 
>> I think the really important thing to point out is that
>> `xdl_classify_record()` ensures that the `ha` attribute is different for
>> different text. AFAIR it even "linearizes" the `ha` values, i.e. they
>> won't be all over the place but start at 0 (or 1).
>>
>> So no, I'm not worried about collisions. That would be a bug in
>> `xdl_classify_record()` and I think we would have caught this bug by now.
> 
> Ah, thanks for that explanation. That addresses my collision concern from
> earlier in the thread completely.

Yes, thanks for clarifying I should have been clearer in my reply to 
Stolee. The reason I was waffling on about file sizes is that there can 
only be a collision if there are more than 2^32 unique lines. I think 
the minimum file size where that happens is just below 10GB when one 
side of the diff has 2^31 lines and the other has 2^31 + 1 lines and all 
the lines are unique.

Best Wishes

Phillip

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] diff histogram: intern strings
  2021-11-19 10:05           ` Phillip Wood
@ 2021-11-19 14:45             ` Jeff King
  2021-11-19 21:22               ` Ævar Arnfjörð Bjarmason
  2021-11-19 15:49             ` Johannes Schindelin
  1 sibling, 1 reply; 15+ messages in thread
From: Jeff King @ 2021-11-19 14:45 UTC (permalink / raw)
  To: phillip.wood
  Cc: Johannes Schindelin, Derrick Stolee,
	Phillip Wood via GitGitGadget, git

On Fri, Nov 19, 2021 at 10:05:32AM +0000, Phillip Wood wrote:

> On 18/11/2021 15:42, Jeff King wrote:
> > On Thu, Nov 18, 2021 at 04:35:48PM +0100, Johannes Schindelin wrote:
> > 
> > > I think the really important thing to point out is that
> > > `xdl_classify_record()` ensures that the `ha` attribute is different for
> > > different text. AFAIR it even "linearizes" the `ha` values, i.e. they
> > > won't be all over the place but start at 0 (or 1).
> > > 
> > > So no, I'm not worried about collisions. That would be a bug in
> > > `xdl_classify_record()` and I think we would have caught this bug by now.
> > 
> > Ah, thanks for that explanation. That addresses my collision concern from
> > earlier in the thread completely.
> 
> Yes, thanks for clarifying I should have been clearer in my reply to Stolee.
> The reason I was waffling on about file sizes is that there can only be a
> collision if there are more than 2^32 unique lines. I think the minimum file
> size where that happens is just below 10GB when one side of the diff has
> 2^31 lines and the other has 2^31 + 1 lines and all the lines are unique.

Right, that makes more sense (and we are not likely to lift the 1GB
limit anytime soon; there are tons of 32-bit variables and potential
integer overflows all through the xdiff code).

It's probably worth explaining this a bit in the commit message.

I also, FWIW, found the subject confusing. I expected "intern" to refer
to keeping a single copy of some strings. Maybe:

  Subject: diff histogram: skip xdl_recmatch for comparing records

or something?

-Peff

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] diff histogram: intern strings
  2021-11-19 10:05           ` Phillip Wood
  2021-11-19 14:45             ` Jeff King
@ 2021-11-19 15:49             ` Johannes Schindelin
  1 sibling, 0 replies; 15+ messages in thread
From: Johannes Schindelin @ 2021-11-19 15:49 UTC (permalink / raw)
  To: Phillip Wood
  Cc: Jeff King, Derrick Stolee, Phillip Wood via GitGitGadget, git

Hi Phillip,

On Fri, 19 Nov 2021, Phillip Wood wrote:

> On 18/11/2021 15:42, Jeff King wrote:
> > On Thu, Nov 18, 2021 at 04:35:48PM +0100, Johannes Schindelin wrote:
> >
> > > I think the really important thing to point out is that
> > > `xdl_classify_record()` ensures that the `ha` attribute is different
> > > for different text. AFAIR it even "linearizes" the `ha` values, i.e.
> > > they won't be all over the place but start at 0 (or 1).
> > >
> > > So no, I'm not worried about collisions. That would be a bug in
> > > `xdl_classify_record()` and I think we would have caught this bug by
> > > now.
> >
> > Ah, thanks for that explanation. That addresses my collision concern
> > from earlier in the thread completely.
>
> Yes, thanks for clarifying I should have been clearer in my reply to
> Stolee. The reason I was waffling on about file sizes is that there can
> only be a collision if there are more than 2^32 unique lines. I think
> the minimum file size where that happens is just below 10GB when one
> side of the diff has 2^31 lines and the other has 2^31 + 1 lines and all
> the lines are unique.

Indeed, and as you pointed out, we already refuse to generate diffs for
such large amounts of data.

(For what it's worth, I totally agree with punting on such large data, it
would also take too long a time to generate diffs on such large data to be
reasonable.)

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] diff histogram: intern strings
  2021-11-19 14:45             ` Jeff King
@ 2021-11-19 21:22               ` Ævar Arnfjörð Bjarmason
  2021-11-19 22:19                 ` Jeff King
  0 siblings, 1 reply; 15+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-11-19 21:22 UTC (permalink / raw)
  To: Jeff King
  Cc: phillip.wood, Johannes Schindelin, Derrick Stolee,
	Phillip Wood via GitGitGadget, git


On Fri, Nov 19 2021, Jeff King wrote:

> On Fri, Nov 19, 2021 at 10:05:32AM +0000, Phillip Wood wrote:
>
>> On 18/11/2021 15:42, Jeff King wrote:
>> > On Thu, Nov 18, 2021 at 04:35:48PM +0100, Johannes Schindelin wrote:
>> > 
>> > > I think the really important thing to point out is that
>> > > `xdl_classify_record()` ensures that the `ha` attribute is different for
>> > > different text. AFAIR it even "linearizes" the `ha` values, i.e. they
>> > > won't be all over the place but start at 0 (or 1).
>> > > 
>> > > So no, I'm not worried about collisions. That would be a bug in
>> > > `xdl_classify_record()` and I think we would have caught this bug by now.
>> > 
>> > Ah, thanks for that explanation. That addresses my collision concern from
>> > earlier in the thread completely.
>> 
>> Yes, thanks for clarifying I should have been clearer in my reply to Stolee.
>> The reason I was waffling on about file sizes is that there can only be a
>> collision if there are more than 2^32 unique lines. I think the minimum file
>> size where that happens is just below 10GB when one side of the diff has
>> 2^31 lines and the other has 2^31 + 1 lines and all the lines are unique.
>
> Right, that makes more sense (and we are not likely to lift the 1GB
> limit anytime soon; there are tons of 32-bit variables and potential
> integer overflows all through the xdiff code).

Interestingly:
    
    $ du -sh 8gb*
    8.1G    8gb
    8.1G    8gb.cp
    $ ~/g/git/git -P -c core.bigFileThreshold=10g diff -U0 --no-index --no-color-moved 2gb 2gb.cp
    diff --git a/8gb b/8gb.cp
    index a886cdfe5ce..4965a132d44 100644
    --- a/8gb
    +++ b/8gb.cp
    @@ -17,0 +18 @@ more
    +blah

And the only change I made was:
    
    diff --git a/xdiff-interface.c b/xdiff-interface.c
    index 75b32aef51d..cb8ca5f5d0a 100644
    --- a/xdiff-interface.c
    +++ b/xdiff-interface.c
    @@ -117,9 +117,6 @@ int xdi_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp, xdemitconf_t co
            mmfile_t a = *mf1;
            mmfile_t b = *mf2;
     
    -       if (mf1->size > MAX_XDIFF_SIZE || mf2->size > MAX_XDIFF_SIZE)
    -               return -1;
    -
            if (!xecfg->ctxlen && !(xecfg->flags & XDL_EMIT_FUNCCONTEXT))
                    trim_common_tail(&a, &b);

Perhaps we're being overly concervative with these hardcoded limits, at
least on some platforms? This is Linux x86_64.

I understand from skimming the above that it's about the pathological
case, these two files are the same except for a trailer at the end.

I wonder how far you could get with #define int size_t & the like ... :)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] diff histogram: intern strings
  2021-11-19 21:22               ` Ævar Arnfjörð Bjarmason
@ 2021-11-19 22:19                 ` Jeff King
  0 siblings, 0 replies; 15+ messages in thread
From: Jeff King @ 2021-11-19 22:19 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: phillip.wood, Johannes Schindelin, Derrick Stolee,
	Phillip Wood via GitGitGadget, git

On Fri, Nov 19, 2021 at 10:22:04PM +0100, Ævar Arnfjörð Bjarmason wrote:

> > Right, that makes more sense (and we are not likely to lift the 1GB
> > limit anytime soon; there are tons of 32-bit variables and potential
> > integer overflows all through the xdiff code).
> 
> Interestingly:
>     
>     $ du -sh 8gb*
>     8.1G    8gb
>     8.1G    8gb.cp
>     $ ~/g/git/git -P -c core.bigFileThreshold=10g diff -U0 --no-index --no-color-moved 2gb 2gb.cp
>     diff --git a/8gb b/8gb.cp
>     index a886cdfe5ce..4965a132d44 100644
>     --- a/8gb
>     +++ b/8gb.cp
>     @@ -17,0 +18 @@ more
>     +blah
> 
> And the only change I made was:
>     
>     diff --git a/xdiff-interface.c b/xdiff-interface.c
>     index 75b32aef51d..cb8ca5f5d0a 100644
>     --- a/xdiff-interface.c
>     +++ b/xdiff-interface.c
>     @@ -117,9 +117,6 @@ int xdi_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp, xdemitconf_t co
>             mmfile_t a = *mf1;
>             mmfile_t b = *mf2;
>      
>     -       if (mf1->size > MAX_XDIFF_SIZE || mf2->size > MAX_XDIFF_SIZE)
>     -               return -1;
>     -
>             if (!xecfg->ctxlen && !(xecfg->flags & XDL_EMIT_FUNCCONTEXT))
>                     trim_common_tail(&a, &b);
> 
> Perhaps we're being overly concervative with these hardcoded limits, at
> least on some platforms? This is Linux x86_64.

It's been a couple of years since I looked, but I'm fairly certain there
are triggerable heap overflows. You probably need fewer than 2^31 lines,
but more than 2^30, as that will overflow the size computation of an
array whose elements are themselves 32-bit integers.

For instance, this:

  perl -e 'print "x\n" x (2**30 + 10)'  >gigaline
  cp gigaline gigaline.cp
  echo foo >>gigaline

results in:

  $ git.compile -c core.bigfilethreshold=10g --no-pager diff --no-index gigaline gigaline.cp
  fatal: Out of memory, malloc failed (tried to allocate 18446744056529682432 bytes)

so at some point we went negative with our allocation (and then it was
cast to size_t when we passed it xmalloc). There's probably a value
somewhere in the middle where it wraps but stays positive, and you'd get
a heap overflow.

> I understand from skimming the above that it's about the pathological
> case, these two files are the same except for a trailer at the end.

The real danger here is not producing a wrong answer for some dumb
cases, but introducing an exploitable heap overflow. Switching to
size_t, or at the very least using st_mult(), etc, everywhere in xdiff
would help. I looked at that long ago, but eventually decided it was
safer and less work to just stick the 1GB limit, since it practice
nobody really cares about diffing beyond that level. (And the limit is
really about number of lines, but 1GB of bytes is an easy proxy for
that).

It would be OK for somebody to fix it if they really want bigger diffs,
but I think it has to be done carefully.

-Peff

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-11-19 22:19 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-17 11:20 [PATCH 0/3] xdiff: speedup histogram diff Phillip Wood via GitGitGadget
2021-11-17 11:20 ` [PATCH 1/3] diff histogram: intern strings Phillip Wood via GitGitGadget
2021-11-17 15:55   ` Derrick Stolee
2021-11-17 16:46     ` Jeff King
2021-11-17 16:52     ` Phillip Wood
2021-11-18 15:35       ` Johannes Schindelin
2021-11-18 15:42         ` Jeff King
2021-11-19 10:05           ` Phillip Wood
2021-11-19 14:45             ` Jeff King
2021-11-19 21:22               ` Ævar Arnfjörð Bjarmason
2021-11-19 22:19                 ` Jeff King
2021-11-19 15:49             ` Johannes Schindelin
2021-11-17 11:20 ` [PATCH 2/3] xdiff: avoid unnecessary memory allocations Phillip Wood via GitGitGadget
2021-11-17 11:20 ` [PATCH 3/3] xdiff: simplify comparison Phillip Wood via GitGitGadget
2021-11-18 15:40 ` [PATCH 0/3] xdiff: speedup histogram diff Johannes Schindelin

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).