git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Johannes Schindelin <Johannes.Schindelin@gmx.de>
To: Phillip Wood <phillip.wood@dunelm.org.uk>
Cc: Derrick Stolee <stolee@gmail.com>,
	Phillip Wood via GitGitGadget <gitgitgadget@gmail.com>,
	git@vger.kernel.org
Subject: Re: [PATCH 1/3] diff histogram: intern strings
Date: Thu, 18 Nov 2021 16:35:48 +0100 (CET)	[thread overview]
Message-ID: <nycvar.QRO.7.76.6.2111181631260.11028@tvgsbejvaqbjf.bet> (raw)
In-Reply-To: <88eaee89-4536-fba4-3aa0-c3693f58eae0@gmail.com>

Hi,

On Wed, 17 Nov 2021, Phillip Wood wrote:

> On 17/11/2021 15:55, Derrick Stolee wrote:
> >
> > On 11/17/2021 6:20 AM, Phillip Wood via GitGitGadget wrote:
> > > From: Phillip Wood <phillip.wood@dunelm.org.uk>
> > >
> > > Histogram is the only diff algorithm not to call
> > > xdl_classify_record(). xdl_classify_record() ensures that the hash
> > > values of two strings that are not equal differ which means that it is
> > > not necessary to use xdl_recmatch() when comparing lines, all that is
> > > necessary is to compare the hash values. This gives a 7% reduction in
> > > the runtime of "git log --patch" when using the histogram diff
> > > algorithm.
> > >
> > > Test                                  HEAD^             HEAD
> > > -----------------------------------------------------------------------------
> > > 4000.1: log -3000 (baseline)          0.18(0.14+0.04)   0.19(0.17+0.02)
> > > +5.6%
> > > 4000.2: log --raw -3000 (tree-only)   0.99(0.77+0.21)   0.98(0.78+0.20)
> > > -1.0%
> > > 4000.3: log -p -3000 (Myers)          4.84(4.31+0.51)   4.81(4.15+0.64)
> > > -0.6%
> > > 4000.4: log -p -3000 --histogram      6.34(5.86+0.46)   5.87(5.19+0.66)
> > > -7.4%
> > > 4000.5: log -p -3000 --patience       5.39(4.60+0.76)   5.35(4.60+0.73)
> > > -0.7%
> > >
> > > Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk>
> > > ---
> > >   xdiff/xhistogram.c |  5 ++---
> > >   xdiff/xprepare.c   | 24 ++++++++----------------
> > >   2 files changed, 10 insertions(+), 19 deletions(-)
> > >
> > > diff --git a/xdiff/xhistogram.c b/xdiff/xhistogram.c
> > > index e694bfd9e31..6c1c88a69a1 100644
> > > --- a/xdiff/xhistogram.c
> > > +++ b/xdiff/xhistogram.c
> > > @@ -91,9 +91,8 @@ struct region {
> > >   static int cmp_recs(xpparam_t const *xpp,
> > >   	xrecord_t *r1, xrecord_t *r2)
> > >   {
> > > -	return r1->ha == r2->ha &&
> > > -		xdl_recmatch(r1->ptr, r1->size, r2->ptr, r2->size,
> > > -			    xpp->flags);
> > > +	return r1->ha == r2->ha;
> > > +
> >
> > nit: stray newline.
> >
> > The only meaningful change here is that you are relying entirely on
> > the hash and not checking the content again. This means that hash
> > collisions on this 32-bit hash could start introducing different
> > results. Are we worried about that?
>
> xdiff-interface.c limits the size of the file that can be passed to just below
> 1GB so we are safe. The other diff algorithms are already using this
> optimization. (the hash is 64 bits on most platforms, the xdiff code could
> really benefit from a unsigned long -> size_t cleanup)

I think the really important thing to point out is that
`xdl_classify_record()` ensures that the `ha` attribute is different for
different text. AFAIR it even "linearizes" the `ha` values, i.e. they
won't be all over the place but start at 0 (or 1).

So no, I'm not worried about collisions. That would be a bug in
`xdl_classify_record()` and I think we would have caught this bug by now.

Ciao,
Dscho

  reply	other threads:[~2021-11-18 15:36 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-17 11:20 [PATCH 0/3] xdiff: speedup histogram diff Phillip Wood via GitGitGadget
2021-11-17 11:20 ` [PATCH 1/3] diff histogram: intern strings Phillip Wood via GitGitGadget
2021-11-17 15:55   ` Derrick Stolee
2021-11-17 16:46     ` Jeff King
2021-11-17 16:52     ` Phillip Wood
2021-11-18 15:35       ` Johannes Schindelin [this message]
2021-11-18 15:42         ` Jeff King
2021-11-19 10:05           ` Phillip Wood
2021-11-19 14:45             ` Jeff King
2021-11-19 21:22               ` Ævar Arnfjörð Bjarmason
2021-11-19 22:19                 ` Jeff King
2021-11-19 15:49             ` Johannes Schindelin
2021-11-17 11:20 ` [PATCH 2/3] xdiff: avoid unnecessary memory allocations Phillip Wood via GitGitGadget
2021-11-17 11:20 ` [PATCH 3/3] xdiff: simplify comparison Phillip Wood via GitGitGadget
2021-11-18 15:40 ` [PATCH 0/3] xdiff: speedup histogram diff Johannes Schindelin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=nycvar.QRO.7.76.6.2111181631260.11028@tvgsbejvaqbjf.bet \
    --to=johannes.schindelin@gmx.de \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=phillip.wood@dunelm.org.uk \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).