git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* git grep performance regression on macOS
@ 2023-09-29 23:56 Benjamin Hiller
  2023-09-30  5:45 ` Junio C Hamano
  2023-10-02  3:05 ` Carlo Marcelo Arenas Bel'on
  0 siblings, 2 replies; 3+ messages in thread
From: Benjamin Hiller @ 2023-09-29 23:56 UTC (permalink / raw
  To: git

What did you do before the bug happened? (Steps to reproduce your issue)

git grep seems to have gotten much slower as of git 2.39 on macOS for
complex extended regexes.
We noticed this because git secrets --scan was running much more
slowly for some people on our team, and eventually realized that it
was due to them using a newer version of git. git secrets runs a git
grep command with an extended regex (this is a somewhat simplified
version of the command, but still shows the performance issue):

git grep -E "(A3T[A-Z0-9]|AKIA|AGPA|AIDA|AROA|AIPA|ANPA|ANVA|ASIA)[A-Z0-9]{16}|(\"|')?(AWS|aws|Aws)?_?(SECRET|secret|Secret)?_?(ACCESS|access|Access)?_?(KEY|key|Key)(\"|')?\s*(:|=>|=)\s*(\"|')?[A-Za-z0-9/\+=]{40}(\"|')?|(\"|')?(AWS|aws|Aws)?_?(ACCOUNT|account|Account)_?(ID|id|Id)?(\"|')?\s*(:|=>|=)\s*(\"|')?[0-9]{4}\-?[0-9]{4}\-?[0-9]{4}(\"|')?"

What did you expect to happen? (Expected behavior)
With git 2.38, that command took under half a second to run on a large repo.
Using the git (https://github.com/git/git) repo as an example, it took
0.2s on my laptop.

What happened instead? (Actual behavior)
After 2.39, it now takes over 40 seconds on my laptop with the git repo!

What's different between what you expected and what actually happened?
The command runs much more slowly, though it still does return the
correct result.

Anything else you want to add:
I confirmed that the performance regression was first introduced in
2.39. Additionally, I saw that reverting the change to Makefile from
https://github.com/git/git/commit/1819ad327b7a1f19540a819813b70a0e8a7f798f
fixed the performance regression and the git grep command went back to
taking <1 second. That seems to indicate that switching from Git's
regex library to the native macOS regex library caused this
performance regression, but I haven't investigated beyond that to see
why the native macOS regex library is so much slower.

Please review the rest of the bug report below.
You can delete any lines you don't wish to share.


[System Info]
git version:
git version 2.42.0
cpu: arm64
no commit associated with this build
sizeof-long: 8
sizeof-size_t: 8
shell-path: /bin/sh
feature: fsmonitor--daemon
uname: Darwin 22.4.0 Darwin Kernel Version 22.4.0: Mon Mar  6 21:00:41
PST 2023; root:xnu-8796.101.5~3/RELEASE_ARM64_T8103 arm64
compiler info: clang: 14.0.3 (clang-1403.0.22.14.1)
libc info: no libc information available
$SHELL (typically, interactive shell): /bin/zsh


[Enabled Hooks]
post-checkout
post-merge
pre-commit
pre-push

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: git grep performance regression on macOS
  2023-09-29 23:56 git grep performance regression on macOS Benjamin Hiller
@ 2023-09-30  5:45 ` Junio C Hamano
  2023-10-02  3:05 ` Carlo Marcelo Arenas Bel'on
  1 sibling, 0 replies; 3+ messages in thread
From: Junio C Hamano @ 2023-09-30  5:45 UTC (permalink / raw
  To: Benjamin Hiller; +Cc: git

Benjamin Hiller <benhiller@gmail.com> writes:

> git grep seems to have gotten much slower as of git 2.39 on macOS for
> complex extended regexes.

> I confirmed that the performance regression was first introduced in
> 2.39. Additionally, I saw that reverting the change to Makefile from
> https://github.com/git/git/commit/1819ad327b7a1f19540a819813b70a0e8a7f798f
> fixed the performance regression and the git grep command went back to
> taking <1 second. That seems to indicate that switching from Git's
> regex library to the native macOS regex library caused this
> performance regression, but I haven't investigated beyond that to see
> why the native macOS regex library is so much slower.

Yeah, that does sound a plausible explanation.

The regexp code in compat/ is meant as a fallback implementation for
platforms whose regexp library lack certain features we take
advantage of, but it has a limitation that it is not unicode aware.
In the olden days, regexp library on macOS lacked REG_STARTEND
feature, which forced us to use NO_REGEX (hence the fallback
implementation we ship that is not unicode aware).  The commit you
cite makes us use the macOS native regexp library, as somebody on
the platform got annoyed enough by the lack of unicode awareness of
the fallback implementation, and also noticed that REG_STARTEND is
supported by the macOS native regexp library these days.

The change in 2.39 was unfortunately about correctness.  It would
have been nicer if macOS native implementation were faster, but use
of fallback implementation would be favoring "performance" (which
produces incorrect results "faster" when run with multi-byte
strings) over correctness, so it is not likely that a straight
reverting of the commit is a good idea.


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: git grep performance regression on macOS
  2023-09-29 23:56 git grep performance regression on macOS Benjamin Hiller
  2023-09-30  5:45 ` Junio C Hamano
@ 2023-10-02  3:05 ` Carlo Marcelo Arenas Bel'on
  1 sibling, 0 replies; 3+ messages in thread
From: Carlo Marcelo Arenas Bel'on @ 2023-10-02  3:05 UTC (permalink / raw
  To: Benjamin Hiller; +Cc: git

On Fri, Sep 29, 2023 at 04:56:19PM -0700, Benjamin Hiller wrote:
> 
> git grep -E "(A3T[A-Z0-9]|AKIA|AGPA|AIDA|AROA|AIPA|ANPA|ANVA|ASIA)[A-Z0-9]{16}|(\"|')?(AWS|aws|Aws)?_?(SECRET|secret|Secret)?_?(ACCESS|access|Access)?_?(KEY|key|Key)(\"|')?\s*(:|=>|=)\s*(\"|')?[A-Za-z0-9/\+=]{40}(\"|')?|(\"|')?(AWS|aws|Aws)?_?(ACCOUNT|account|Account)_?(ID|id|Id)?(\"|')?\s*(:|=>|=)\s*(\"|')?[0-9]{4}\-?[0-9]{4}\-?[0-9]{4}(\"|')?"

changing this code to use `git grep -P` instead will make it at least 7x
faster even if you have a pcre2 library without JIT.

Carlo

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2023-10-02  3:06 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-29 23:56 git grep performance regression on macOS Benjamin Hiller
2023-09-30  5:45 ` Junio C Hamano
2023-10-02  3:05 ` Carlo Marcelo Arenas Bel'on

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).