git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Git blame performance on files with a lot of history
@ 2018-12-14 18:29 Clement Moyroud
  2018-12-14 19:10 ` Bryan Turner
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Clement Moyroud @ 2018-12-14 18:29 UTC (permalink / raw)
  To: git

Hello,

My group at work is migrating a CVS repo to Git. The biggest issue we
face so far is the performance of git blame, especially compared to
CVS on the same file. One file especially causes us trouble: it's a
30k lines file with 25 years of history in 3k+ commits. The complete
repo has 200k+ commits over that same period of time.

Currently, 'cvs annotate' takes 2.7 seconds, while 'git blame'
(without -M nor -C) takes 145s.

I tried using the commit-graph with the Bloom filter, per
https://public-inbox.org/git/61559c5b-546e-d61b-d2e1-68de692f5972@gmail.com/.
No dice:
    > time GIT_TEST_BLOOM_FILTERS=1
/wv/cmoyroud/calibre-src/git-bloom-filters/git-bloom-bin/bin/git
commit-graph write --reachable
    Annotating commits in commit graph: 573705, done.
    Computing commit graph generation numbers: 100% (286441/286441), done.
    Computing commit diff Bloom filters: 100% (286441/286441), done.
    GIT_TEST_BLOOM_FILTERS=1  commit-graph write --reachable  386.80s
user 31.78s system 78% cpu 8:53.87 total
    > time GIT_TEST_BLOOM_FILTERS=1 GIT_TRACE_BLOOM_FILTER=2
GIT_USE_POC_BLOOM_FILTER=y /path/to/git blame master --
important/file.C > /tmp/foo.compiler.bloom
    Blaming lines: 100% (33179/33179), done.
    GIT_TEST_BLOOM_FILTERS=1 GIT_TRACE_BLOOM_FILTER=2
GIT_USE_POC_BLOOM_FILTER=y   145.11s user 0.97s system 99% cpu 2:26.22
total
    > time /path/to/git blame master -- important/file.C >
/tmp/foo.compiler.nobloom
    Blaming lines: 100% (33179/33179), done.
    GIT_TEST_BLOOM_FILTERS=1 GIT_TEST_BLOOM_FILTERS=1
GIT_USE_POC_BLOOM_FILTER=y   141.69s user 0.77s system 99% cpu 2:22.56
total

I used Derrick Stolee's tree at
https://github.com/derrickstolee/git/tree/bloom/stolee

Looking at the blame code, it does not seem to be able to use the
commit graph, so I tried the same rev-list command from the e-mail,
using my own file:
    > GIT_TRACE_BLOOM_FILTER=2 GIT_USE_POC_BLOOM_FILTER=y
/path/to/git rev-list --count --full-history HEAD -- important/file.C
    3576

No trace information there either. Running 'strings' on the binary
reports the env. variable names, so I'm not totally crazy. Let me know
if I tried the right thing :)

Looks like blame performance is gonna be the biggest issue for us, so
I'm really interested in seeing improvements there. Let me know if
there's anything else I can try.

Cheers,

Clément

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Git blame performance on files with a lot of history
  2018-12-14 18:29 Git blame performance on files with a lot of history Clement Moyroud
@ 2018-12-14 19:10 ` Bryan Turner
  2018-12-17 20:43   ` Clement Moyroud
  2018-12-14 21:31 ` Derrick Stolee
  2018-12-14 22:48 ` Ævar Arnfjörð Bjarmason
  2 siblings, 1 reply; 7+ messages in thread
From: Bryan Turner @ 2018-12-14 19:10 UTC (permalink / raw)
  To: clement.moyroud; +Cc: Git Users

On Fri, Dec 14, 2018 at 10:29 AM Clement Moyroud
<clement.moyroud@gmail.com> wrote:
>
> Hello,
>
> My group at work is migrating a CVS repo to Git. The biggest issue we
> face so far is the performance of git blame, especially compared to
> CVS on the same file. One file especially causes us trouble: it's a
> 30k lines file with 25 years of history in 3k+ commits. The complete
> repo has 200k+ commits over that same period of time.

After you converted the repository from CVS to Git, did you run a manual repack?

The process of converting a repository from another SCM often results
in poor delta chain selections which result in a repository that's
unnecessarily large on disk, and/or performs quite slowly.

Something like `git repack -Adf --depth=50 --window=200` discards the
existing delta chains and chooses new ones, and may result in
significantly improved performance. A smaller depth, like --depth=20,
might result in even more performance improvement, but may also make
the repository larger on disk; you'll need to find the balance that
works for you.

Might be something worth testing, if you haven't?

Bryan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Git blame performance on files with a lot of history
  2018-12-14 18:29 Git blame performance on files with a lot of history Clement Moyroud
  2018-12-14 19:10 ` Bryan Turner
@ 2018-12-14 21:31 ` Derrick Stolee
  2018-12-17 20:59   ` Clement Moyroud
  2018-12-14 22:48 ` Ævar Arnfjörð Bjarmason
  2 siblings, 1 reply; 7+ messages in thread
From: Derrick Stolee @ 2018-12-14 21:31 UTC (permalink / raw)
  To: Clement Moyroud, git

On 12/14/2018 1:29 PM, Clement Moyroud wrote:
> My group at work is migrating a CVS repo to Git. The biggest issue we
> face so far is the performance of git blame, especially compared to
> CVS on the same file. One file especially causes us trouble: it's a
> 30k lines file with 25 years of history in 3k+ commits. The complete
> repo has 200k+ commits over that same period of time.

I think the 30k lines is the bigger issue than the 200k+ commits. I'm 
not terribly familiar with the blame code, though.

> Currently, 'cvs annotate' takes 2.7 seconds, while 'git blame'
> (without -M nor -C) takes 145s.
>
> I tried using the commit-graph with the Bloom filter, per
> https://public-inbox.org/git/61559c5b-546e-d61b-d2e1-68de692f5972@gmail.com/.

Thanks for the interest in this prototype feature. Sorry that it doesn't 
appear to help you in this case. It should definitely be a follow-up 
when that feature gets polished to production-quality.
> Looking at the blame code, it does not seem to be able to use the
> commit graph, so I tried the same rev-list command from the e-mail,
> using my own file:
>      > GIT_TRACE_BLOOM_FILTER=2 GIT_USE_POC_BLOOM_FILTER=y
> /path/to/git rev-list --count --full-history HEAD -- important/file.C
>      3576
>
Please double-check that you have the 'core.commitGraph' config setting 
enabled, or you will not read the commit-graph at run-time:

     git config core.commitGraph true

I see that the commit introducing GIT_TRACE_BLOOM_FILTER [1] does 
nothing if the commit-graph is not loaded.

Thanks,
-Stolee

[1] 
https://github.com/derrickstolee/git/commit/adc469894b755512c9d02f099700ead2a7a78377

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Git blame performance on files with a lot of history
  2018-12-14 18:29 Git blame performance on files with a lot of history Clement Moyroud
  2018-12-14 19:10 ` Bryan Turner
  2018-12-14 21:31 ` Derrick Stolee
@ 2018-12-14 22:48 ` Ævar Arnfjörð Bjarmason
  2018-12-17 20:30   ` Clement Moyroud
  2 siblings, 1 reply; 7+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-12-14 22:48 UTC (permalink / raw)
  To: Clement Moyroud; +Cc: git


On Fri, Dec 14 2018, Clement Moyroud wrote:

> My group at work is migrating a CVS repo to Git. The biggest issue we
> face so far is the performance of git blame, especially compared to
> CVS on the same file. One file especially causes us trouble: it's a
> 30k lines file with 25 years of history in 3k+ commits. The complete
> repo has 200k+ commits over that same period of time.

There's a real-world repo with a shape & size very similar to this that
has good performance, gcc.git: https://github.com/gcc-mirror/gcc

    $ wc -l ChangeLog
    20240 ChangeLog
    $ git log --oneline -- ChangeLog | wc -l
    2676
    $ git log --oneline | wc -l
    165309
    $ time git blame ChangeLog >/dev/null

    real    0m1.977s
    user    0m1.909s
    sys     0m0.069s

Its history began in 1997, and the changes to the ChangeLog file by its
nature is fairly evenly spread through that period.

So check out that repo to see if you have similar or worse
performance. Does your work repo show the same problem with a history
produced with 'git fast-export --anonymize', and if so is that something
you'd be OK with sharing?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Git blame performance on files with a lot of history
  2018-12-14 22:48 ` Ævar Arnfjörð Bjarmason
@ 2018-12-17 20:30   ` Clement Moyroud
  0 siblings, 0 replies; 7+ messages in thread
From: Clement Moyroud @ 2018-12-17 20:30 UTC (permalink / raw)
  To: avarab; +Cc: git

On Fri, Dec 14, 2018 at 2:48 PM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
>
> On Fri, Dec 14 2018, Clement Moyroud wrote:
>
> > My group at work is migrating a CVS repo to Git. The biggest issue we
> > face so far is the performance of git blame, especially compared to
> > CVS on the same file. One file especially causes us trouble: it's a
> > 30k lines file with 25 years of history in 3k+ commits. The complete
> > repo has 200k+ commits over that same period of time.
>
> There's a real-world repo with a shape & size very similar to this that
> has good performance, gcc.git: https://github.com/gcc-mirror/gcc
>
>     $ wc -l ChangeLog
>     20240 ChangeLog
>     $ git log --oneline -- ChangeLog | wc -l
>     2676
>     $ git log --oneline | wc -l
>     165309
>     $ time git blame ChangeLog >/dev/null
>
>     real    0m1.977s
>     user    0m1.909s
>     sys     0m0.069s
>
> Its history began in 1997, and the changes to the ChangeLog file by its
> nature is fairly evenly spread through that period.
>
> So check out that repo to see if you have similar or worse
> performance. Does your work repo show the same problem with a history
> produced with 'git fast-export --anonymize', and if so is that something
> you'd be OK with sharing?

Hi Ævar,

I see around 3s here on the GCC repo, but I'm on a VM and the repo is
cloned on an NFS disk, so I'd say it matches :) It's around 45x faster
than my repo, on the same NFS share and VM. So there's definitely
something to improve here on my end (see my reply to Bryan re: repack
in a separate e-mail).

The anonymized export won't work in that case: all file contents are
replaced with 'anonymous blob <n>', so there's no per-line history for
blame to follow. Let me see if I can post-process a non-anonymized
version to keep the relevant data available.

Cheers,

Clément

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Git blame performance on files with a lot of history
  2018-12-14 19:10 ` Bryan Turner
@ 2018-12-17 20:43   ` Clement Moyroud
  0 siblings, 0 replies; 7+ messages in thread
From: Clement Moyroud @ 2018-12-17 20:43 UTC (permalink / raw)
  To: bturner; +Cc: git

did hOn Fri, Dec 14, 2018 at 11:10 AM Bryan Turner
<bturner@atlassian.com> wrote:
>
> After you converted the repository from CVS to Git, did you run a manual repack?
>
> The process of converting a repository from another SCM often results
> in poor delta chain selections which result in a repository that's
> unnecessarily large on disk, and/or performs quite slowly.
>

Yep I did a repack, using 'git repack -A -d --pack-kept-objects'. On
NFS it'd be even
worse, because of all the small objects Git would have to go through.

> Something like `git repack -Adf --depth=50 --window=200` discards the
> existing delta chains and chooses new ones, and may result in
> significantly improved performance. A smaller depth, like --depth=20,
> might result in even more performance improvement, but may also make
> the repository larger on disk; you'll need to find the balance that
> works for you.
>

I re-ran with 'git repack -Adf --depth=20 --window=200' and that did
help quite a bit:
  > time git blame master --  important/file.C > /tmp/foo
  Blaming lines: 100% (33179/33179), done.
  git blame master -- important/file.C > /tmp/foo 50.70s user 0.55s
system 99% cpu 51.298 total

That's roughly a 3x improvement, great. I just need another 10x one
and we'll be in business :) Based on
experiments with the Bloom filter, maybe that'll help enough to get
our users on board.

Cheers,

Clément

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Git blame performance on files with a lot of history
  2018-12-14 21:31 ` Derrick Stolee
@ 2018-12-17 20:59   ` Clement Moyroud
  0 siblings, 0 replies; 7+ messages in thread
From: Clement Moyroud @ 2018-12-17 20:59 UTC (permalink / raw)
  To: stolee; +Cc: git

On Fri, Dec 14, 2018 at 1:31 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> Please double-check that you have the 'core.commitGraph' config setting
> enabled, or you will not read the commit-graph at run-time:
>
>      git config core.commitGraph true
>

Yeah, this is what happens when trying too many things at once :( I
had removed it to get
with/without scores, and forgot to re-enable it before trying my last
set of experiments.
Here are the results with it enabled:
> time GIT_TRACE_BLOOM_FILTER=2 GIT_USE_POC_BLOOM_FILTER=y /path/to/git rev-list --count --full-history HEAD -- important/file.C
10:32:06.665057 revision.c:483          bloom filter total queries:
286363 definitely not: 234605 maybe: 51758 false positives: 48212 fp
ratio: 0.168360
GIT_TRACE_BLOOM_FILTER=2 GIT_USE_POC_BLOOM_FILTER=y  rev-list --count
HEAD -  2.62s user 0.14s system 97% cpu 2.830 total
> time /path/to/git rev-list --count --full-history HEAD -- ic/lv/src/iclv/drc_compiler.C
3576
/path/to/git rev-list      8.86s user 0.15s system 99% cpu 9.031 total

So I'm getting a 3x benefit, not bad! This is on the re-repacked repo,
which is why I ran again
with and without the Bloom filter.

Let's see what this does for blame:
> time GIT_TRACE_BLOOM_FILTER=2 GIT_USE_POC_BLOOM_FILTER=y /path/to/git blame master -- important/file.C > /tmp/foo
Blaming lines: 100% (33179/33179), done.
12:50:42.703522 revision.c:483          bloom filter total queries: 0
definitely not: 0 maybe: 0 false positives: 0 fp ratio: -nan
GIT_TRACE_BLOOM_FILTER=2 GIT_USE_POC_BLOOM_FILTER=y  blame master --
>   132.59s user 2.15s system 99% cpu 2:14.95 total

Seems like it's not implemented for blame operations. I'll be happy to
test any implementation.

Take care,

Clément

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-12-17 20:59 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-14 18:29 Git blame performance on files with a lot of history Clement Moyroud
2018-12-14 19:10 ` Bryan Turner
2018-12-17 20:43   ` Clement Moyroud
2018-12-14 21:31 ` Derrick Stolee
2018-12-17 20:59   ` Clement Moyroud
2018-12-14 22:48 ` Ævar Arnfjörð Bjarmason
2018-12-17 20:30   ` Clement Moyroud

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).