git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Paolo Bonzini <paolo.bonzini@gmail.com>
Cc: Junio C Hamano <gitster@pobox.com>, John Bito <jwbito@gmail.com>,
	git <git@vger.kernel.org>
Subject: Re: git diff looping?
Date: Wed, 17 Jun 2009 06:23:33 -0400	[thread overview]
Message-ID: <20090617102332.GA32353@coredump.intra.peff.net> (raw)
In-Reply-To: <4A38AD5D.6010404@gmail.com>

On Wed, Jun 17, 2009 at 10:46:21AM +0200, Paolo Bonzini wrote:

> 2) make sure that at least one space/tab is eaten on all but the last  
> occurrence of the repeated subexpression.  To this end the LHS of {2,} is 
> duplicated, once with [ \t]+ and once with [ \t]*.  The repetition itself 
> becomes a + since the last occurrence is now separately handled:
>
> ^[ \t]*(([A-Za-z_][A-Za-z_0-9]*[ \t]+)+[A-Za-z_][A-Za-z_0-9]*
> [ \t]*\([^;]*)$

Thanks, I can confirm that this is _much_ faster. Here are some timings
from my Solaris 8 box for the "git diff v0.4.0" case using the system
and compat engines, and using three regexes: the original that git is
using now, an updated one with your regex above[1] replacing the second
line of the stock pattern, and a baseline regex of "." which should take
virtually no time at all.

  system,  orig: infinite
  system, paolo:   2.5s
  system,   ".":   0.6s
  compat,  orig: 288.0s
  compat, paolo:   1.5s
  compat,   ".":   0.6s

So it goes from infinite to 2.5s. Which still spends 3 times as long
matching funcname regexes as it does actually calculating the diff. The
compat library is a little better, but still chokes pretty badly on the
original regex.

Let's compare compat to the glibc implementation on my Debian box:

  system,  orig:   0.22s
  system, paolo:   0.22s
  system,   ".":   0.15s
  compat,  orig: 150.88s
  compat, paolo:   0.43s
  compat,   ".":   0.15s

Besides the exponential behavior on the original regex, it is still
about twice as slow as the system one.

So I think there are three possible optimizations worth considering:

  1. Replace the builtin diff.java.xfuncname pattern with what Paolo
     suggested (though I haven't verified its correctness beyond a
     cursory look at the results). This is easy to do, and will help
     people with crappy system regex libraries and people on
     compat/regex/ (right now just mingw) a _lot_. The downside is that
     it's a little harder to read the regex, but not terribly so.

  2. Recommend NO_REGEX for people with slow system regex libraries.
     This is also easy to do, and will help people even if we do (1) for
     two reasons:

       a. we process user-defined regexes through diff.*.xfuncname
          patterns, as well as through "git grep"; so we are protecting
          against poor performance when they give us a complex regex

       b. even on more reasonable regexps like Paolo's, we seem to get a
          2:1 speedup over the Solaris system library

  3. Replace compat/regex with something faster. It still produces
     exponential behavior in complex cases where glibc does not, and it
     seems to be about 1/3 as fast on Paolo's regex.

     I haven't looked at how large or how portable the glibc
     implementation is. Another alternative is that we could provide a
     simple compat/ as now, and have better support for linking against
     an external library like pcre, if it is available.

-Peff

[1] Note if you are cutting and pasting Paolo's regex into the C code,
    the "\(" needs to be "\\(", which I screwed up in my initial
    timings. :)

  reply	other threads:[~2009-06-17 10:23 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-06-16  1:37 git diff looping? John Bito
2009-06-16  2:44 ` Jeff Epler
2009-06-16  2:53   ` John Bito
2009-06-16 11:47 ` Jeff King
2009-06-16 12:07   ` Jeff King
2009-06-16 12:11     ` [PATCH 1/2] Makefile: refactor regex compat support Jeff King
2009-06-16 18:47       ` Johannes Sixt
2009-06-16 19:05         ` Jeff King
2009-06-16 19:07           ` [PATCH v2 " Jeff King
2009-06-16 19:08           ` [PATCH v2 2/2] Makefile: use compat regex on Solaris Jeff King
2009-06-16 20:07             ` Brandon Casey
2009-06-17 13:15             ` Mike Ralphson
2009-06-17 13:55               ` Mike Ralphson
2009-06-16 12:14     ` [PATCH " Jeff King
2009-06-16 15:48   ` git diff looping? John Bito
2009-06-16 16:51   ` Junio C Hamano
2009-06-16 17:15     ` Jeff King
2009-06-16 17:35       ` Brandon Casey
2009-06-16 17:39         ` John Bito
2009-06-16 17:41           ` Jeff King
2009-06-16 20:22         ` Brandon Casey
2009-06-17  8:46       ` Paolo Bonzini
2009-06-17 10:23         ` Jeff King [this message]
2009-06-17 11:02           ` Paolo Bonzini
2009-06-17 11:31           ` Andreas Ericsson
2009-06-17 13:08             ` Paolo Bonzini
2009-06-17 13:16               ` Andreas Ericsson
2009-06-17 13:58                 ` Paolo Bonzini
2009-06-17 14:26           ` [PATCH] avoid exponential regex match for java and objc function names Paolo Bonzini
2009-06-17 15:46             ` demerphq
2009-06-17 15:56               ` Jeff King
2009-06-17 16:00                 ` demerphq
2009-06-17 16:04                   ` Paolo Bonzini
2009-06-17 16:42             ` Junio C Hamano
2009-06-18  6:45               ` Paolo Bonzini
2009-06-16 17:16     ` git diff looping? John Bito
2009-06-16 17:24       ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090617102332.GA32353@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jwbito@gmail.com \
    --cc=paolo.bonzini@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).