Poor performance of git describe in big repos

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Poor performance of git describe in big repos
@ 2013-05-30 10:38 Alex Bennée
  2013-05-30 11:33 ` Ramkumar Ramachandra
  2013-05-30 11:48 ` John Keeping
  0 siblings, 2 replies; 41+ messages in thread
From: Alex Bennée @ 2013-05-30 10:38 UTC (permalink / raw)
  To: git

Hi,

I'm a fairly heavy user of the magit Emacs extension for interacting
with my git repos. However I've noticed there are some cases where lag
is very high. By analysing strace output of emacs calling git I found
two commands that where particularly problematic when interrogating
the repo:

11:00 ajb@sloy/x86_64 [work.git] >time /usr/bin/git --no-pager
describe --long --tags
ajb-build-test-5224-10-gfa296e6

real    0m5.016s
user    0m4.364s
sys     0m0.444s

11:34 ajb@sloy/x86_64 [work.git] >time /usr/bin/git --no-pager
describe --contains HEAD
fatal: cannot describe 'fa296e61f549a1252a65a13b2f734d7afbc7e88e'

real    0m4.805s
user    0m4.388s
sys     0m0.400s

Running with first command with the --debug flag on gives:

11:34 ajb@sloy/x86_64 [work.git] >time /usr/bin/git --no-pager
describe --long --tags --debug
searching to describe HEAD
 lightweight       10 ajb-build-test-5224
 lightweight       41 ajb-build-test-5222
 annotated        146 vnms-2-1-36-32
 annotated        155 vnms-2-1-36-31
 annotated        174 vnms-2-1-36-30
 annotated        183 vnms-2-1-36-29
 lightweight      188 vnms-2-1-36-28
 annotated        193 vnms-2-1-36-27
 annotated        206 vnms-2-1-36-26
 annotated        215 vectastar-4-2-83-5
traversed 223 commits
more than 10 tags found; listed 10 most recent
gave up search at 2b69df72d47be8440e3ce4cee91b9b7ceaf8b77c
ajb-build-test-5224-10-gfa296e6

real    0m4.817s
user    0m4.320s
sys     0m0.464s

Which has only traversed 223 before coming to a decision. This seems
like a very low number of commits given the time it's spent doing
this.

One factor might be the size of my repo (.git is around 2.4G). Could
this just be due to computational cost of searching through large
packs to walk the commit chain? Is there any way to make this easier
for git to do?

-- 
Alex, homepage: http://www.bennee.com/~alex/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Poor performance of git describe in big repos
  2013-05-30 10:38 Poor performance of git describe in big repos Alex Bennée
@ 2013-05-30 11:33 ` Ramkumar Ramachandra
  2013-05-30 13:09   ` Alex Bennée
  2013-05-30 11:48 ` John Keeping
  1 sibling, 1 reply; 41+ messages in thread
From: Ramkumar Ramachandra @ 2013-05-30 11:33 UTC (permalink / raw)
  To: Alex Bennée; +Cc: git

Alex Bennée wrote:
>>time /usr/bin/git --no-pager
> traversed 223 commits
>
> real    0m4.817s
> user    0m4.320s
> sys     0m0.464s

I'm quite clueless about why it is taking this long: I think it's IO
because there's nothing to compute?  I really can't trace anything
unless you can reproduce it on a public repository.  On linux.git with
my rotating hard disk:

$ time git describe --debug --long --tags HEAD~10000
searching to describe HEAD~10000
 annotated       5445 v2.6.33
 annotated       5660 v2.6.33-rc8
 annotated       5884 v2.6.33-rc7
 annotated       6140 v2.6.33-rc6
 annotated       6467 v2.6.33-rc5
 annotated       6999 v2.6.33-rc4
 annotated       7430 v2.6.33-rc3
 annotated       7746 v2.6.33-rc2
 annotated       8212 v2.6.33-rc1
 annotated      13854 v2.6.32
traversed 18895 commits
more than 10 tags found; listed 10 most recent
gave up search at 648f4e3e50c4793d9dbf9a09afa193631f76fa26
v2.6.33-5445-ge7c84ee

real    0m0.509s
user    0m0.470s
sys     0m0.037s

18k+ commits traversed in half a second here, so I really don't know
what is going on.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Poor performance of git describe in big repos
  2013-05-30 10:38 Poor performance of git describe in big repos Alex Bennée
  2013-05-30 11:33 ` Ramkumar Ramachandra
@ 2013-05-30 11:48 ` John Keeping
  2013-05-30 12:29   ` Alex Bennée
  2013-05-30 13:16   ` Alex Bennée
  1 sibling, 2 replies; 41+ messages in thread
From: John Keeping @ 2013-05-30 11:48 UTC (permalink / raw)
  To: Alex Bennée; +Cc: git

On Thu, May 30, 2013 at 11:38:32AM +0100, Alex Bennée wrote:
> One factor might be the size of my repo (.git is around 2.4G). Could
> this just be due to computational cost of searching through large
> packs to walk the commit chain? Is there any way to make this easier
> for git to do?

What does "git count-objects -v" say for your repository?

You may find that performance improves if you repack with "git gc
--aggressive".

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Poor performance of git describe in big repos
  2013-05-30 11:48 ` John Keeping
@ 2013-05-30 12:29   ` Alex Bennée
  2013-05-30 13:20     ` Duy Nguyen
  2013-05-30 13:16   ` Alex Bennée
  1 sibling, 1 reply; 41+ messages in thread
From: Alex Bennée @ 2013-05-30 12:29 UTC (permalink / raw)
  To: John Keeping; +Cc: git

The repo is a fairly hairy one as it includes two historically
un-related but content related repos which I'm the process of
cherry-picking stuff across.

11:58 ajb@sloy/x86_64 [work.git] >git count-objects -v
count: 493
size: 4572
in-pack: 399307
packs: 1
size-pack: 1930755
prune-packable: 0
garbage: 0
size-garbage: 0

This was after a repack which did have slight negative effect on
performance. The pack file is:

13:27 ajb@sloy/x86_64 [work.git] >ls -lh ./.git/objects/pack/*
-r--r--r-- 1 ajb cvs  11M May 30 11:56
./.git/objects/pack/pack-a9ba133a6f25ffa74c3c407e09ab030f8745b201.idx
-r--r--r-- 1 ajb cvs 1.9G May 30 11:56
./.git/objects/pack/pack-a9ba133a6f25ffa74c3c407e09ab030f8745b201.pack

I ran perf on it and the top items in the report where:

 41.58%   git  libcrypto.so.1.0.0  [.] 0x6ae73
 33.96%   git  libz.so.1.2.3.4     [.] 0xe0ec
 10.39%   git  libz.so.1.2.3.4     [.] adler32
  2.03%   git  [kernel.kallsyms]   [k] clear_page_c

So I'm guessing it's spending a lot of non-cache efficient time
un-packing and processing the deltas?

--
Alex.

On 30 May 2013 12:48, John Keeping <john@keeping.me.uk> wrote:
> On Thu, May 30, 2013 at 11:38:32AM +0100, Alex Bennée wrote:
>> One factor might be the size of my repo (.git is around 2.4G). Could
>> this just be due to computational cost of searching through large
>> packs to walk the commit chain? Is there any way to make this easier
>> for git to do?
>
> What does "git count-objects -v" say for your repository?
>
> You may find that performance improves if you repack with "git gc
> --aggressive".



-- 
Alex, homepage: http://www.bennee.com/~alex/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Poor performance of git describe in big repos
  2013-05-30 11:33 ` Ramkumar Ramachandra
@ 2013-05-30 13:09   ` Alex Bennée
  2013-05-30 14:32     ` Ramkumar Ramachandra
  2013-05-30 15:33     ` Thomas Rast
  0 siblings, 2 replies; 41+ messages in thread
From: Alex Bennée @ 2013-05-30 13:09 UTC (permalink / raw)
  To: Ramkumar Ramachandra; +Cc: git

It looks like it's a file caching effect combined with my repo being
more pathalogical in size and contents. Note run 1 (cold) vs run 2 on
the linux file tree:

13:52 ajb@sloy/x86_64 [linux.git] >time git describe --debug --long
--tags HEAD~10000
searching to describe HEAD~10000
 annotated         57 v2.6.34-rc2
 annotated       1688 v2.6.34-rc1
 annotated       7932 v2.6.33
 annotated       8157 v2.6.33-rc8
 annotated       8381 v2.6.33-rc7
 annotated       8637 v2.6.33-rc6
 annotated       8964 v2.6.33-rc5
 annotated       9493 v2.6.33-rc4
 annotated       9912 v2.6.33-rc3
 annotated      10202 v2.6.33-rc2
traversed 10547 commits
more than 10 tags found; listed 10 most recent
gave up search at 55639353a0035052d9ea6cfe4dde0ac7fcbb2c9f
v2.6.34-rc2-57-gef5da59

real    0m7.332s
user    0m0.308s
sys     0m0.244s
14:03 ajb@sloy/x86_64 [linux.git] >time git describe --debug --long
--tags HEAD~10000
searching to describe HEAD~10000
 annotated         57 v2.6.34-rc2
 annotated       1688 v2.6.34-rc1
 annotated       7932 v2.6.33
 annotated       8157 v2.6.33-rc8
 annotated       8381 v2.6.33-rc7
 annotated       8637 v2.6.33-rc6
 annotated       8964 v2.6.33-rc5
 annotated       9493 v2.6.33-rc4
 annotated       9912 v2.6.33-rc3
 annotated      10202 v2.6.33-rc2
traversed 10547 commits
more than 10 tags found; listed 10 most recent
gave up search at 55639353a0035052d9ea6cfe4dde0ac7fcbb2c9f
v2.6.34-rc2-57-gef5da59

real    0m0.298s
user    0m0.244s
sys     0m0.036s

Although the perf profile looks subtly different.

First through the linux tree:

 22.35%   git  libz.so.1.2.3.4    [.] inflate
 18.56%   git  libz.so.1.2.3.4    [.] inflate_fast
 17.48%   git  libz.so.1.2.3.4    [.] inflate_table
  7.84%   git  git                [.] hashcmp
  3.93%   git  git                [.] get_sha1_hex
  3.46%   git  libz.so.1.2.3.4    [.] adler32

And through my "special" repo:

 41.58%   git  libcrypto.so.1.0.0  [.] sha1_block_data_order_ssse3
 33.62%   git  libz.so.1.2.3.4     [.] inflate_fast
 10.39%   git  libz.so.1.2.3.4     [.] adler32
  2.03%   git  [kernel.kallsyms]   [k] clear_page_c

 I'm not sure why libcrypto features so highly in the results


 --
 Alex.

On 30 May 2013 12:33, Ramkumar Ramachandra <artagnon@gmail.com> wrote:
> Alex Bennée wrote:
>>>time /usr/bin/git --no-pager
>> traversed 223 commits
>>
>> real    0m4.817s
>> user    0m4.320s
>> sys     0m0.464s
>
> I'm quite clueless about why it is taking this long: I think it's IO
> because there's nothing to compute?  I really can't trace anything
> unless you can reproduce it on a public repository.  On linux.git with
> my rotating hard disk:
>
> $ time git describe --debug --long --tags HEAD~10000
> searching to describe HEAD~10000
>  annotated       5445 v2.6.33
>  annotated       5660 v2.6.33-rc8
>  annotated       5884 v2.6.33-rc7
>  annotated       6140 v2.6.33-rc6
>  annotated       6467 v2.6.33-rc5
>  annotated       6999 v2.6.33-rc4
>  annotated       7430 v2.6.33-rc3
>  annotated       7746 v2.6.33-rc2
>  annotated       8212 v2.6.33-rc1
>  annotated      13854 v2.6.32
> traversed 18895 commits
> more than 10 tags found; listed 10 most recent
> gave up search at 648f4e3e50c4793d9dbf9a09afa193631f76fa26
> v2.6.33-5445-ge7c84ee
>
> real    0m0.509s
> user    0m0.470s
> sys     0m0.037s
>
> 18k+ commits traversed in half a second here, so I really don't know
> what is going on.



-- 
Alex, homepage: http://www.bennee.com/~alex/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Poor performance of git describe in big repos
  2013-05-30 11:48 ` John Keeping
  2013-05-30 12:29   ` Alex Bennée
@ 2013-05-30 13:16   ` Alex Bennée
  1 sibling, 0 replies; 41+ messages in thread
From: Alex Bennée @ 2013-05-30 13:16 UTC (permalink / raw)
  To: John Keeping; +Cc: git

> You may find that performance improves if you repack with "git gc
--aggressive".

It seems that increases the time to get to where it wants to:

14:12 ajb@sloy/x86_64 [work.git] >time /usr/bin/git --no-pager
describe --long --tags --debug
searching to describe HEAD
 lightweight       10 ajb-build-test-5224
 lightweight       41 ajb-build-test-5222
 annotated        146 vnms-2-1-36-32
 annotated        155 vnms-2-1-36-31
 annotated        174 vnms-2-1-36-30
 annotated        183 vnms-2-1-36-29
 lightweight      188 vnms-2-1-36-28
 annotated        193 vnms-2-1-36-27
 annotated        206 vnms-2-1-36-26
 annotated        215 vectastar-4-2-83-5
traversed 223 commits
more than 10 tags found; listed 10 most recent
gave up search at 2b69df72d47be8440e3ce4cee91b9b7ceaf8b77c
ajb-build-test-5224-10-gfa296e6

real    0m14.658s
user    0m12.845s
sys     0m1.776s

On 30 May 2013 12:48, John Keeping <john@keeping.me.uk> wrote:
> On Thu, May 30, 2013 at 11:38:32AM +0100, Alex Bennée wrote:
>> One factor might be the size of my repo (.git is around 2.4G). Could
>> this just be due to computational cost of searching through large
>> packs to walk the commit chain? Is there any way to make this easier
>> for git to do?
>
> What does "git count-objects -v" say for your repository?
>
> You may find that performance improves if you repack with "git gc
> --aggressive".



-- 
Alex, homepage: http://www.bennee.com/~alex/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Poor performance of git describe in big repos
  2013-05-30 12:29   ` Alex Bennée
@ 2013-05-30 13:20     ` Duy Nguyen
       [not found]       ` <CAJ-05NPacjAEC99Ntd9eMnTD9_PMMYFob-_tAx5CeSB79TkRSg@mail.gmail.com>
  0 siblings, 1 reply; 41+ messages in thread
From: Duy Nguyen @ 2013-05-30 13:20 UTC (permalink / raw)
  To: Alex Bennée; +Cc: John Keeping, Git Mailing List

On Thu, May 30, 2013 at 7:29 PM, Alex Bennée <kernel-hacker@bennee.com> wrote:
> I ran perf on it and the top items in the report where:
>
>  41.58%   git  libcrypto.so.1.0.0  [.] 0x6ae73
>  33.96%   git  libz.so.1.2.3.4     [.] 0xe0ec
>  10.39%   git  libz.so.1.2.3.4     [.] adler32
>   2.03%   git  [kernel.kallsyms]   [k] clear_page_c
>
> So I'm guessing it's spending a lot of non-cache efficient time
> un-packing and processing the deltas?

If I'm not mistaken, commits are never deltified. They are usually
small and packed close together for better I/O patterns. If you really
just read hundreds of commits, it can't take that long. Maybe some
code paths accidentally open a tree, a blob or something..

Can you try setting core.logpackaccess to a path on and rerun
describe? Jugding from the code (I never actually tried it), it'll
create a file at the given path with the accessed pack offsets. You
can check what offset corresponds to what object with verify-pack -v.
--
Duy

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Poor performance of git describe in big repos
       [not found]       ` <CAJ-05NPacjAEC99Ntd9eMnTD9_PMMYFob-_tAx5CeSB79TkRSg@mail.gmail.com>
@ 2013-05-30 13:45         ` Duy Nguyen
  2013-05-30 14:02           ` Alex Bennée
  0 siblings, 1 reply; 41+ messages in thread
From: Duy Nguyen @ 2013-05-30 13:45 UTC (permalink / raw)
  To: Alex Bennée; +Cc: John Keeping, Git Mailing List

On Thu, May 30, 2013 at 8:34 PM, Alex Bennée <kernel-hacker@bennee.com> wrote:
> From the following run:
>
>
> 14:31 ajb@sloy/x86_64 [work.git] >time /usr/bin/git --no-pager
> describe --long --tags
> ajb-build-test-5224-11-g9660048
>
> real    0m14.720s
> user    0m12.985s
> sys     0m1.700s
> 14:31 ajb@sloy/x86_64 [work.git] >wc -l /tmp/log-pack.txt
> 1610 /tmp/log-pack.txt
>
> The pack has been "tuned" with a gc --aggressive. Assuming the numbers
> are offsets into the pack it looks fairly random access until the last
> 100 or so.
>
> [snipped]

Thanks. Can you share "verify-pack -v" output of
pack-a9ba133a6f25ffa74c3c407e09ab030f8745b201.pack? I think you need
to put it somewhere on Internet temporarily as it's likely to exceed
git@vger limits.
--
Duy

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Poor performance of git describe in big repos
  2013-05-30 13:45         ` Duy Nguyen
@ 2013-05-30 14:02           ` Alex Bennée
  0 siblings, 0 replies; 41+ messages in thread
From: Alex Bennée @ 2013-05-30 14:02 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: John Keeping, Git Mailing List

On 30 May 2013 14:45, Duy Nguyen <pclouds@gmail.com> wrote:
> On Thu, May 30, 2013 at 8:34 PM, Alex Bennée <kernel-hacker@bennee.com> wrote:
> <snip>
> Thanks. Can you share "verify-pack -v" output of
> pack-a9ba133a6f25ffa74c3c407e09ab030f8745b201.pack? I think you need
> to put it somewhere on Internet temporarily as it's likely to exceed
> git@vger limits.
> --
> Duy

http://www.bennee.com/~alex/stuff/git-pack-access.tar.bz2

--
Alex, homepage: http://www.bennee.com/~alex/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Poor performance of git describe in big repos
  2013-05-30 13:09   ` Alex Bennée
@ 2013-05-30 14:32     ` Ramkumar Ramachandra
  2013-05-30 15:01       ` Alex Bennée
  2013-05-30 15:33     ` Thomas Rast
  1 sibling, 1 reply; 41+ messages in thread
From: Ramkumar Ramachandra @ 2013-05-30 14:32 UTC (permalink / raw)
  To: Alex Bennée; +Cc: git

Alex Bennée wrote:
> And through my "special" repo:
>
>  41.58%   git  libcrypto.so.1.0.0  [.] sha1_block_data_order_ssse3
>  33.62%   git  libz.so.1.2.3.4     [.] inflate_fast
>  10.39%   git  libz.so.1.2.3.4     [.] adler32
>   2.03%   git  [kernel.kallsyms]   [k] clear_page_c
>
>  I'm not sure why libcrypto features so highly in the results

While Duy churns on the delta chain, let me try to make a (rather
crude) observation:

What does it mean for libcrypto to be so high in your perf report?
sha1_block_data_order is ultimately by object.c:parse_object.  While
it indicates that deltas are taking a long time to apply (or are
somehow not optimally organized for IO), I think it indicates either:

1. Your history is very deep and there are an unusually high number of
deltas for each blob.  What are the total number of commits?

2. You have have huge (binary) files checked into your repository.  Do
you?  If so, why isn't the code in streaming.c kicking in?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Poor performance of git describe in big repos
  2013-05-30 14:32     ` Ramkumar Ramachandra
@ 2013-05-30 15:01       ` Alex Bennée
  2013-05-30 15:17         ` Ramkumar Ramachandra
  0 siblings, 1 reply; 41+ messages in thread
From: Alex Bennée @ 2013-05-30 15:01 UTC (permalink / raw)
  To: Ramkumar Ramachandra; +Cc: Git Mailing List

On 30 May 2013 15:32, Ramkumar Ramachandra <artagnon@gmail.com> wrote:
> Alex Bennée wrote:
>> And through my "special" repo:
>>
>>  41.58%   git  libcrypto.so.1.0.0  [.] sha1_block_data_order_ssse3
>>  33.62%   git  libz.so.1.2.3.4     [.] inflate_fast
>>  10.39%   git  libz.so.1.2.3.4     [.] adler32
>>   2.03%   git  [kernel.kallsyms]   [k] clear_page_c
>>
>>  I'm not sure why libcrypto features so highly in the results
>
> While Duy churns on the delta chain, let me try to make a (rather
> crude) observation:
>
> What does it mean for libcrypto to be so high in your perf report?
> sha1_block_data_order is ultimately by object.c:parse_object.  While
> it indicates that deltas are taking a long time to apply (or are
> somehow not optimally organized for IO), I think it indicates either:
>
> 1. Your history is very deep and there are an unusually high number of
> deltas for each blob.  What are the total number of commits?

Well the history does en-compose about 10 years of product development
and has a high number of files in the repo (including about 3 copies of
the kernel - sans upstream history).

15:50 ajb@sloy/x86_64 [work.git] >time git log --pretty=oneline | wc -l
24648

real    0m0.434s
user    0m0.388s
sys     0m0.112s

Although it doesn't take too long to walk the whole mainline history
(obviously ignoring all the other branches).

15:52 ajb@sloy/x86_64 [work.git] >git count-objects -v -H
count: 581
size: 5.09 MiB
in-pack: 399307
packs: 1
size-pack: 1.49 GiB
prune-packable: 0
garbage: 0
size-garbage: 0 bytes

It is a pick repo. The gc --aggressive nearly took out my machine keeping
around 4gb resident for most of the half hour and using nearly 8gb of VM.

Of course most of the history is not needed for day to day stuff. Maybe
if I split the pack files up it wouldn't be quite such a strain to work
through them?

> 2. You have have huge (binary) files checked into your repository.  Do
> you?  If so, why isn't the code in streaming.c kicking in?

We do have some binary blobs in the repository (mainly DSP and FPGA images)
although not a huge number:

15:58 ajb@sloy/x86_64 [work.git] >time git log --pretty=oneline -- xxx
xxx/xxxxxx/*.out ./xxx/xxx/*.out ./xxx/xxxxxxx/*.out | wc -l
234

real    0m0.590s
user    0m0.552s
sys     0m0.040s

How can I tell if streaming is kicking in or now?


-- 
Alex, homepage: http://www.bennee.com/~alex/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Poor performance of git describe in big repos
  2013-05-30 15:01       ` Alex Bennée
@ 2013-05-30 15:17         ` Ramkumar Ramachandra
  0 siblings, 0 replies; 41+ messages in thread
From: Ramkumar Ramachandra @ 2013-05-30 15:17 UTC (permalink / raw)
  To: Alex Bennée; +Cc: Git Mailing List

Alex Bennée wrote:
> 15:50 ajb@sloy/x86_64 [work.git] >time git log --pretty=oneline | wc -l
> 24648
>
> real    0m0.434s
> user    0m0.388s
> sys     0m0.112s
>
> Although it doesn't take too long to walk the whole mainline history
> (obviously ignoring all the other branches).

Damn, non-starter.  linux.git has 361k+ commits in mainline history.

Nit: use git rev-list --count HEAD next time.

> 15:52 ajb@sloy/x86_64 [work.git] >git count-objects -v -H
> count: 581
> size: 5.09 MiB
> in-pack: 399307
> packs: 1
> size-pack: 1.49 GiB
> prune-packable: 0
> garbage: 0
> size-garbage: 0 bytes

linux.git has 2.9m+ in-pack.  The pack-size is much lower at about
800+ MiB, but I don't think 1.49 GiB is a problem in itself.  Looking
forward to your big-files report to see why it's so big.

> It is a pick repo. The gc --aggressive nearly took out my machine keeping
> around 4gb resident for most of the half hour and using nearly 8gb of VM.
>
> Of course most of the history is not needed for day to day stuff. Maybe
> if I split the pack files up it wouldn't be quite such a strain to work
> through them?

Really out of my depth here, sorry.  Let's see what Duy (or the
others) have to say.

>> 2. You have have huge (binary) files checked into your repository.  Do
>> you?  If so, why isn't the code in streaming.c kicking in?
>
> We do have some binary blobs in the repository (mainly DSP and FPGA images)
> although not a huge number:
>
> 15:58 ajb@sloy/x86_64 [work.git] >time git log --pretty=oneline -- xxx
> xxx/xxxxxx/*.out ./xxx/xxx/*.out ./xxx/xxxxxxx/*.out | wc -l
> 234
>
> real    0m0.590s
> user    0m0.552s
> sys     0m0.040s

log is streaming, and is not a good measure: it doesn't even walk the
entire commit graph.  How big are these files?

> How can I tell if streaming is kicking in or now?

I use callgrind (and kcachegrind to visualize).  Can you post
callgrind output?  It will be helpful in figuring out where exactly
git is spending time.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Poor performance of git describe in big repos
  2013-05-30 13:09   ` Alex Bennée
  2013-05-30 14:32     ` Ramkumar Ramachandra
@ 2013-05-30 15:33     ` Thomas Rast
  2013-05-30 16:01       ` Alex Bennée
  1 sibling, 1 reply; 41+ messages in thread
From: Thomas Rast @ 2013-05-30 15:33 UTC (permalink / raw)
  To: Alex Bennée; +Cc: Ramkumar Ramachandra, git

Alex Bennée <kernel-hacker@bennee.com> writes:

>  41.58%   git  libcrypto.so.1.0.0  [.] sha1_block_data_order_ssse3
>  33.62%   git  libz.so.1.2.3.4     [.] inflate_fast
>  10.39%   git  libz.so.1.2.3.4     [.] adler32
>   2.03%   git  [kernel.kallsyms]   [k] clear_page_c

Do you have any large blobs in the repo that are referenced directly by
a tag?

Because this just so happens to exactly reproduce your symptoms:

  # in a random git.git
  $ time git describe --debug
  [...]
  real    0m0.390s
  user    0m0.037s
  sys     0m0.011s
  $ git tag big1 $(dd if=/dev/urandom bs=1M count=512 | git hash-object -w --stdin)
  512+0 records in
  512+0 records out
  536870912 bytes (537 MB) copied, 45.5088 s, 11.8 MB/s
  $ time git describe --debug
  [...]
  real    0m1.875s
  user    0m1.738s
  sys     0m0.129s
  $ git tag big2 $(dd if=/dev/urandom bs=1M count=512 | git hash-object -w --stdin)
  512+0 records in
  512+0 records out
  536870912 bytes (537 MB) copied, 44.972 s, 11.9 MB/s
  $ time git describe --debugsuche zur Beschreibung von HEAD
  [...]
  real    0m3.620s
  user    0m3.357s
  sys     0m0.248s

(I actually ran the git-describe invocations more than once to ensure
that they are again cache-hot.)

git-describe should probably be fixed to avoid loading blobs, though I'm
not sure off hand if we have any infrastructure to infer the type of a
loose object without inflating it.  (This could probably be added by
inflating only the first block.)  We do have this for packed objects, so
at least for packed repos there's a speedup to be had.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Poor performance of git describe in big repos
  2013-05-30 15:33     ` Thomas Rast
@ 2013-05-30 16:01       ` Alex Bennée
  2013-05-30 16:21         ` Thomas Rast
  0 siblings, 1 reply; 41+ messages in thread
From: Alex Bennée @ 2013-05-30 16:01 UTC (permalink / raw)
  To: Thomas Rast; +Cc: Ramkumar Ramachandra, Git Mailing List

On 30 May 2013 16:33, Thomas Rast <trast@inf.ethz.ch> wrote:
> Alex Bennée <kernel-hacker@bennee.com> writes:
>
>>  41.58%   git  libcrypto.so.1.0.0  [.] sha1_block_data_order_ssse3
>>  33.62%   git  libz.so.1.2.3.4     [.] inflate_fast
>>  10.39%   git  libz.so.1.2.3.4     [.] adler32
>>   2.03%   git  [kernel.kallsyms]   [k] clear_page_c
>
> Do you have any large blobs in the repo that are referenced directly by
> a tag?

Most probably. I've certainly done a bunch of releases (which are tagged) were
the last thing that was updated was an FPGA image.

> Because this just so happens to exactly reproduce your symptoms:
>
>   # in a random git.git
>   $ time git describe --debug
>   [...]
>   real    0m0.390s
>   user    0m0.037s
>   sys     0m0.011s
>   $ git tag big1 $(dd if=/dev/urandom bs=1M count=512 | git hash-object -w --stdin)
>   512+0 records in
>   512+0 records out
>   536870912 bytes (537 MB) copied, 45.5088 s, 11.8 MB/s
>   $ time git describe --debug
>   [...]
>   real    0m1.875s
>   user    0m1.738s
>   sys     0m0.129s
>   $ git tag big2 $(dd if=/dev/urandom bs=1M count=512 | git hash-object -w --stdin)
>   512+0 records in
>   512+0 records out
>   536870912 bytes (537 MB) copied, 44.972 s, 11.9 MB/s
>   $ time git describe --debugsuche zur Beschreibung von HEAD
>   [...]
>   real    0m3.620s
>   user    0m3.357s
>   sys     0m0.248s
>
> (I actually ran the git-describe invocations more than once to ensure
> that they are again cache-hot.)

That looks pretty promising as a replication.

> git-describe should probably be fixed to avoid loading blobs, though I'm
> not sure off hand if we have any infrastructure to infer the type of a
> loose object without inflating it.  (This could probably be added by
> inflating only the first block.)  We do have this for packed objects, so
> at least for packed repos there's a speedup to be had.

Will it be loading the blob for every commit it traverses or just ones that hit
a tag? Why does it need to load the blob at all? Surely the commit
tree state doesn't
need to be walked down?

>
> --
> Thomas Rast
> trast@{inf,student}.ethz.ch



-- 
Alex, homepage: http://www.bennee.com/~alex/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Poor performance of git describe in big repos
  2013-05-30 16:01       ` Alex Bennée
@ 2013-05-30 16:21         ` Thomas Rast
  2013-05-30 16:44           ` Thomas Rast
  2013-05-30 19:30           ` Poor performance of git describe in big repos John Keeping
  0 siblings, 2 replies; 41+ messages in thread
From: Thomas Rast @ 2013-05-30 16:21 UTC (permalink / raw)
  To: Alex Bennée; +Cc: Ramkumar Ramachandra, Git Mailing List

Alex Bennée <kernel-hacker@bennee.com> writes:

> On 30 May 2013 16:33, Thomas Rast <trast@inf.ethz.ch> wrote:
>> Alex Bennée <kernel-hacker@bennee.com> writes:
>>
>>>  41.58%   git  libcrypto.so.1.0.0  [.] sha1_block_data_order_ssse3
>>>  33.62%   git  libz.so.1.2.3.4     [.] inflate_fast
>>>  10.39%   git  libz.so.1.2.3.4     [.] adler32
>>>   2.03%   git  [kernel.kallsyms]   [k] clear_page_c
>>
>> Do you have any large blobs in the repo that are referenced directly by
>> a tag?
>
> Most probably. I've certainly done a bunch of releases (which are tagged) were
> the last thing that was updated was an FPGA image.
[...]
>> git-describe should probably be fixed to avoid loading blobs, though I'm
>> not sure off hand if we have any infrastructure to infer the type of a
>> loose object without inflating it.  (This could probably be added by
>> inflating only the first block.)  We do have this for packed objects, so
>> at least for packed repos there's a speedup to be had.
>
> Will it be loading the blob for every commit it traverses or just ones that hit
> a tag? Why does it need to load the blob at all? Surely the commit
> tree state doesn't
> need to be walked down?

No, my theory is that you tagged *the blobs*.  Git supports this.

git-describe needs to look at the commit (if any) obtained by peeling
each tag (i.e. dereferencing tags until it reaches a non-tag).  So to do
that, it resolves the tag's referent and loads it.  Usually this will be
a commit, in which case it is marked as reached by the tag.

As my example shows, it also resolves tags' referents if they refer to
non-commits, in particular, it will decompress large blobs that are
(directly) referenced by a tag.

Note that while annotated tags provide the type information themselves,
e.g.

  $ git cat-file tag junio-gpg-pub
  object fe113d3f96636710600c6b02d5fd421fa7e87dd6
  type blob
  tag junio-gpg-pub
  [...]

unannotated tags are simply refs, so it is not enough to just look at
the tag objects' referent type.

I had a brief look around sha1_file.c, in particular sha1_object_info,
and it turns out we lack the "deflate only early part" logic as I
suspected.  So that'll have to be fixed first.  After that I *think* it
should automatically carry over into the tag readers.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Poor performance of git describe in big repos
  2013-05-30 16:21         ` Thomas Rast
@ 2013-05-30 16:44           ` Thomas Rast
  2013-05-30 19:01             ` Antoine Pelisse
  2013-05-30 20:00             ` [PATCH 1/2] sha1_file: silence sha1_loose_object_info Thomas Rast
  2013-05-30 19:30           ` Poor performance of git describe in big repos John Keeping
  1 sibling, 2 replies; 41+ messages in thread
From: Thomas Rast @ 2013-05-30 16:44 UTC (permalink / raw)
  To: Alex Bennée; +Cc: Ramkumar Ramachandra, Git Mailing List

Thomas Rast <trast@inf.ethz.ch> writes:

> I had a brief look around sha1_file.c, in particular sha1_object_info,
> and it turns out we lack the "deflate only early part" logic as I
> suspected.  So that'll have to be fixed first.  After that I *think* it
> should automatically carry over into the tag readers.

Strike that, I'm wrong.  sha1_object_info is fast even for these big
loose objects.

The culprit, according to some callgrind investigation, is
lookup_commit_reference_gently() [for the unannotated case] or
deref_tag() [annotated case] calling parse_object().

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Poor performance of git describe in big repos
  2013-05-30 16:44           ` Thomas Rast
@ 2013-05-30 19:01             ` Antoine Pelisse
  2013-05-30 20:00             ` [PATCH 1/2] sha1_file: silence sha1_loose_object_info Thomas Rast
  1 sibling, 0 replies; 41+ messages in thread
From: Antoine Pelisse @ 2013-05-30 19:01 UTC (permalink / raw)
  To: Thomas Rast; +Cc: Alex Bennée, Ramkumar Ramachandra, Git Mailing List

> The culprit, according to some callgrind investigation, is
> lookup_commit_reference_gently() [for the unannotated case] or
> deref_tag() [annotated case] calling parse_object().

Using the scenario you described earlier, I think it ends-up spending
most of its time in check_sha1_signature (both deref_tag and
lookup_commit_reference_gently() go there) with 20% inflating, 80% in
SHA1_Update(). Not much we can do about that, can we ?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Poor performance of git describe in big repos
  2013-05-30 16:21         ` Thomas Rast
  2013-05-30 16:44           ` Thomas Rast
@ 2013-05-30 19:30           ` John Keeping
  2013-05-31  8:14             ` Alex Bennée
  1 sibling, 1 reply; 41+ messages in thread
From: John Keeping @ 2013-05-30 19:30 UTC (permalink / raw)
  To: Thomas Rast; +Cc: Alex Bennée, Ramkumar Ramachandra, Git Mailing List

On Thu, May 30, 2013 at 06:21:55PM +0200, Thomas Rast wrote:
> Alex Bennée <kernel-hacker@bennee.com> writes:
> 
> > On 30 May 2013 16:33, Thomas Rast <trast@inf.ethz.ch> wrote:
> >> Alex Bennée <kernel-hacker@bennee.com> writes:
> >>
> >>>  41.58%   git  libcrypto.so.1.0.0  [.] sha1_block_data_order_ssse3
> >>>  33.62%   git  libz.so.1.2.3.4     [.] inflate_fast
> >>>  10.39%   git  libz.so.1.2.3.4     [.] adler32
> >>>   2.03%   git  [kernel.kallsyms]   [k] clear_page_c
> >>
> >> Do you have any large blobs in the repo that are referenced directly by
> >> a tag?
> >
> > Most probably. I've certainly done a bunch of releases (which are tagged) were
> > the last thing that was updated was an FPGA image.
> [...]
> >> git-describe should probably be fixed to avoid loading blobs, though I'm
> >> not sure off hand if we have any infrastructure to infer the type of a
> >> loose object without inflating it.  (This could probably be added by
> >> inflating only the first block.)  We do have this for packed objects, so
> >> at least for packed repos there's a speedup to be had.
> >
> > Will it be loading the blob for every commit it traverses or just ones that hit
> > a tag? Why does it need to load the blob at all? Surely the commit
> > tree state doesn't
> > need to be walked down?
> 
> No, my theory is that you tagged *the blobs*.  Git supports this.

You can see if that is the case by doing something like this:

    eval $(git for-each-ref --shell --format '
        test $(git cat-file -t %(objectname)^{}) = commit ||
        echo %(refname);')

That will print out the name of any ref that doesn't point at a commit.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 1/2] sha1_file: silence sha1_loose_object_info
  2013-05-30 16:44           ` Thomas Rast
  2013-05-30 19:01             ` Antoine Pelisse
@ 2013-05-30 20:00             ` Thomas Rast
  2013-05-30 20:00               ` [PATCH 2/2] lookup_commit_reference_gently: do not read non-{tag,commit} Thomas Rast
  1 sibling, 1 reply; 41+ messages in thread
From: Thomas Rast @ 2013-05-30 20:00 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Ramkumar Ramachandra, Alex Bennée,
	Antoine Pelisse, John Keeping,
	Nguyễn Thái Ngọc Duy

sha1_object_info() returns -1 (OBJ_BAD) if it cannot find the object
for some reason, which suggests that it wants the _caller_ to report
this error.  However, part of its work happens in
sha1_loose_object_info, which _does_ report errors itself.  This is
doubly strange because:

* packed_object_info(), which is the other half of the duo, does _not_
  report this.

* In the event that an object is packed and pruned while
  sha1_object_info_extended() goes looking for it, we would
  erroneously show the error -- even though the code of the latter
  function purports to handle this case gracefully.

* A caller might invoke sha1_object_info() to find the type of an
  object even if that object is not known to exist.

Silence this error.  The others remain untouched as a corrupt object
is a much more grave error than it merely being absent.

Signed-off-by: Thomas Rast <trast@inf.ethz.ch>
---
 sha1_file.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sha1_file.c b/sha1_file.c
index 67e815b..c0f6a0e 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2348,7 +2348,7 @@ static int sha1_loose_object_info(const unsigned char *sha1, unsigned long *size

 	map = map_sha1_file(sha1, &mapsize);
 	if (!map)
-		return error("unable to find %s", sha1_to_hex(sha1));
+		return -1;
 	if (unpack_sha1_header(&stream, map, mapsize, hdr, sizeof(hdr)) < 0)
 		status = error("unable to unpack %s header",
 			       sha1_to_hex(sha1));
-- 
1.8.3.506.g4fdeee5

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 2/2] lookup_commit_reference_gently: do not read non-{tag,commit}
  2013-05-30 20:00             ` [PATCH 1/2] sha1_file: silence sha1_loose_object_info Thomas Rast
@ 2013-05-30 20:00               ` Thomas Rast
  2013-05-30 21:22                 ` Jeff King
  2013-05-31  6:43                 ` Ramkumar Ramachandra
  0 siblings, 2 replies; 41+ messages in thread
From: Thomas Rast @ 2013-05-30 20:00 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Ramkumar Ramachandra, Alex Bennée,
	Antoine Pelisse, John Keeping,
	Nguyễn Thái Ngọc Duy

lookup_commit_reference_gently unconditionally parses the object given
to it.  This slows down git-describe a lot if you have a repository
with large tagged blobs in it: parse_object() will read the entire
blob and verify that its sha1 matches, only to then throw it away.

Speed it up by checking the type with sha1_object_info() prior to
unpacking.

The reason that deref_tag() does not need the same fix is a bit
subtle: parse_tag_buffer() does not fill the 'tagged' member of the
tag struct if the tagged object is a blob.

Reported-by: Alex Bennée <kernel-hacker@bennee.com>
Signed-off-by: Thomas Rast <trast@inf.ethz.ch>
---
 commit.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/commit.c b/commit.c
index 888e02a..00e8d4a 100644
--- a/commit.c
+++ b/commit.c
@@ -31,8 +31,12 @@ static struct commit *check_commit(struct object *obj,
 struct commit *lookup_commit_reference_gently(const unsigned char *sha1,
 					      int quiet)
 {
-	struct object *obj = deref_tag(parse_object(sha1), NULL, 0);
-
+	struct object *obj;
+	int type = sha1_object_info(sha1, NULL);
+	/* If it's neither tag nor commit, parsing the object is wasted effort */
+	if (type != OBJ_TAG && type != OBJ_COMMIT)
+		return NULL;
+	obj = deref_tag(parse_object(sha1), NULL, 0);
 	if (!obj)
 		return NULL;
 	return check_commit(obj, sha1, quiet);
-- 
1.8.3.506.g4fdeee5

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/2] lookup_commit_reference_gently: do not read non-{tag,commit}
  2013-05-30 20:00               ` [PATCH 2/2] lookup_commit_reference_gently: do not read non-{tag,commit} Thomas Rast
@ 2013-05-30 21:22                 ` Jeff King
  2013-05-31  0:52                   ` Duy Nguyen
  2013-05-31  8:08                   ` Thomas Rast
  2013-05-31  6:43                 ` Ramkumar Ramachandra
  1 sibling, 2 replies; 41+ messages in thread
From: Jeff King @ 2013-05-30 21:22 UTC (permalink / raw)
  To: Thomas Rast
  Cc: git, Junio C Hamano, Ramkumar Ramachandra, Alex Bennée,
	Antoine Pelisse, John Keeping,
	Nguyễn Thái Ngọc Duy

On Thu, May 30, 2013 at 10:00:23PM +0200, Thomas Rast wrote:

> lookup_commit_reference_gently unconditionally parses the object given
> to it.  This slows down git-describe a lot if you have a repository
> with large tagged blobs in it: parse_object() will read the entire
> blob and verify that its sha1 matches, only to then throw it away.
> 
> Speed it up by checking the type with sha1_object_info() prior to
> unpacking.

This would speed up the case where we do not end up looking at the
object at all, but it will slow down the (presumably common) case where
we will in fact find a commit and end up parsing the object anyway.

Have you measured the impact of this on normal operations? During a
traversal, we spend a measurable amount of time looking up commits in
packfiles, and this would presumably double it.

This is not the first time I have seen this tradeoff in git.  It would
be nice if our object access was structured to do incremental
examination of the objects (i.e., store the packfile index lookup or
partial unpack of a loose object header, and then use that to complete
the next step of actually getting the contents).

-Peff

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/2] lookup_commit_reference_gently: do not read non-{tag,commit}
  2013-05-30 21:22                 ` Jeff King
@ 2013-05-31  0:52                   ` Duy Nguyen
  2013-05-31  8:08                   ` Thomas Rast
  1 sibling, 0 replies; 41+ messages in thread
From: Duy Nguyen @ 2013-05-31  0:52 UTC (permalink / raw)
  To: Jeff King
  Cc: Thomas Rast, Git Mailing List, Junio C Hamano,
	Ramkumar Ramachandra, Alex Bennée, Antoine Pelisse,
	John Keeping

On Fri, May 31, 2013 at 4:22 AM, Jeff King <peff@peff.net> wrote:
> On Thu, May 30, 2013 at 10:00:23PM +0200, Thomas Rast wrote:
>
>> lookup_commit_reference_gently unconditionally parses the object given
>> to it.  This slows down git-describe a lot if you have a repository
>> with large tagged blobs in it: parse_object() will read the entire
>> blob and verify that its sha1 matches, only to then throw it away.
>>
>> Speed it up by checking the type with sha1_object_info() prior to
>> unpacking.
>
> This would speed up the case where we do not end up looking at the
> object at all, but it will slow down the (presumably common) case where
> we will in fact find a commit and end up parsing the object anyway.

Perhaps turn "quiet" into a bitmap and only let git-describe do this?
--
Duy

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/2] lookup_commit_reference_gently: do not read non-{tag,commit}
  2013-05-30 20:00               ` [PATCH 2/2] lookup_commit_reference_gently: do not read non-{tag,commit} Thomas Rast
  2013-05-30 21:22                 ` Jeff King
@ 2013-05-31  6:43                 ` Ramkumar Ramachandra
  2013-05-31  8:16                   ` Thomas Rast
  1 sibling, 1 reply; 41+ messages in thread
From: Ramkumar Ramachandra @ 2013-05-31  6:43 UTC (permalink / raw)
  To: Thomas Rast
  Cc: git, Junio C Hamano, Alex Bennée, Antoine Pelisse,
	John Keeping, Nguyễn Thái Ngọc Duy

Thomas Rast wrote:
> diff --git a/commit.c b/commit.c
> index 888e02a..00e8d4a 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -31,8 +31,12 @@ static struct commit *check_commit(struct object *obj,
>  struct commit *lookup_commit_reference_gently(const unsigned char *sha1,
>                                               int quiet)
>  {
> -       struct object *obj = deref_tag(parse_object(sha1), NULL, 0);
> -
> +       struct object *obj;
> +       int type = sha1_object_info(sha1, NULL);
> +       /* If it's neither tag nor commit, parsing the object is wasted effort */
> +       if (type != OBJ_TAG && type != OBJ_COMMIT)
> +               return NULL;
> +       obj = deref_tag(parse_object(sha1), NULL, 0);
>         if (!obj)
>                 return NULL;
>         return check_commit(obj, sha1, quiet);

As Jeff points out, you've introduced an extra sha1_object_info() call
in the common case of tag (which derefs into a commit anyway) and
commit slowing things down.

So, my main doubt centres around how sha1_object_info() determines the
type of the object without actually parsing it.  You have to open up
the file and look at the fields near the top, no? (or fallback to blob
failing that).  I am reading it:

1. It calls sha1_loose_object_info() or sha1_packed_object_info(),
depending on whether the particular file is in-pack or not.  Lets see
what is common between them.

2. The loose counterpart seems to call unpack_sha1_header() after
mmap'ing the file.  This ultimately ends up calling
unpack_object_header_buffer(), which is also what the packed
counterpart calls.

3. I didn't understand what unpack_object_header_buffer() is doing.
And'ing with some magic 0x80 and shifting by 4 bits iteratively? type
= (c >> 4) & 7?

In contrast, parse_object() first calls lookup_object() to look it up
in some hashtable to get the type -- the packfile idx, presumably?
Why don't you also do that instead of sha1_object_info()?  Or, why
don't you wrap parse_object() in an API that doesn't go beyond the
first blob check (and not execute parse_object_buffer())?

Also, does this patch fix the bug Alex reported?

Apologies if I've misunderstood something horribly (which does seem to
be the case).

Thanks.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/2] lookup_commit_reference_gently: do not read non-{tag,commit}
  2013-05-30 21:22                 ` Jeff King
  2013-05-31  0:52                   ` Duy Nguyen
@ 2013-05-31  8:08                   ` Thomas Rast
  2013-05-31 16:00                     ` Jeff King
  1 sibling, 1 reply; 41+ messages in thread
From: Thomas Rast @ 2013-05-31  8:08 UTC (permalink / raw)
  To: Jeff King
  Cc: git, Junio C Hamano, Ramkumar Ramachandra, Alex Bennée,
	Antoine Pelisse, John Keeping,
	Nguyễn Thái Ngọc Duy

Jeff King <peff@peff.net> writes:

> On Thu, May 30, 2013 at 10:00:23PM +0200, Thomas Rast wrote:
>
>> lookup_commit_reference_gently unconditionally parses the object given
>> to it.  This slows down git-describe a lot if you have a repository
>> with large tagged blobs in it: parse_object() will read the entire
>> blob and verify that its sha1 matches, only to then throw it away.
>> 
>> Speed it up by checking the type with sha1_object_info() prior to
>> unpacking.
>
> This would speed up the case where we do not end up looking at the
> object at all, but it will slow down the (presumably common) case where
> we will in fact find a commit and end up parsing the object anyway.
>
> Have you measured the impact of this on normal operations? During a
> traversal, we spend a measurable amount of time looking up commits in
> packfiles, and this would presumably double it.

I don't think so, but admittedly I didn't measure it.

The reason why it's unlikely is that this is specific to
lookup_commit_reference_gently, which according to some grepping is
usually done on refs or values that refs might have; e.g. on the old&new
sides of a fetch in remote.c, or in many places in the callback of some
variant of for_each_ref.

Of course if you have a ridiculously large number of refs (and I gather
_you_ do), this will hurt somewhat in the usual case, but speed up the
case where there is a ref (usually a lightweight tag) directly pointing
at a large blob.

I'm not sure this can be fixed without the change you outline here:

> This is not the first time I have seen this tradeoff in git.  It would
> be nice if our object access was structured to do incremental
> examination of the objects (i.e., store the packfile index lookup or
> partial unpack of a loose object header, and then use that to complete
> the next step of actually getting the contents).

But in any case I see the point, I should try and gather some
performance numbers.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Poor performance of git describe in big repos
  2013-05-30 19:30           ` Poor performance of git describe in big repos John Keeping
@ 2013-05-31  8:14             ` Alex Bennée
  2013-05-31  8:24               ` Thomas Rast
  2013-05-31  8:32               ` John Keeping
  0 siblings, 2 replies; 41+ messages in thread
From: Alex Bennée @ 2013-05-31  8:14 UTC (permalink / raw)
  To: John Keeping; +Cc: Thomas Rast, Ramkumar Ramachandra, Git Mailing List

On 30 May 2013 20:30, John Keeping <john@keeping.me.uk> wrote:
> On Thu, May 30, 2013 at 06:21:55PM +0200, Thomas Rast wrote:
>> Alex Bennée <kernel-hacker@bennee.com> writes:
>>
>> > On 30 May 2013 16:33, Thomas Rast <trast@inf.ethz.ch> wrote:
>> >> Alex Bennée <kernel-hacker@bennee.com> writes:
> <snip>
>> > Will it be loading the blob for every commit it traverses or just ones that hit
>> > a tag? Why does it need to load the blob at all? Surely the commit
>> > tree state doesn't
>> > need to be walked down?
>>
>> No, my theory is that you tagged *the blobs*.  Git supports this.

Wait is this the difference between annotated and non-annotated tags?
I thought a non-annotated just acted like references to a particular
tree state?

>
> You can see if that is the case by doing something like this:
>
>     eval $(git for-each-ref --shell --format '
>         test $(git cat-file -t %(objectname)^{}) = commit ||
>         echo %(refname);')
>
> That will print out the name of any ref that doesn't point at a
> commit.

Hmm that didn't seem to work. But looking at the output by hand I
certainly have a mix of tags that are commits vs tags:


09:08 ajb@sloy/x86_64 [work.git] >git for-each-ref | grep "refs/tags"
| grep "commit" | wc -l
1345
09:12 ajb@sloy/x86_64 [work.git] >git for-each-ref | grep "refs/tags"
| grep -v "commit" | wc -l
66

Unfortunately I can't just delete those tags as they do refer to known
releases which we obviously care about. If I delete the tags on my
local repo and test for a speed increase can I re-create them as
annotated tag objects?

-- 
Alex, homepage: http://www.bennee.com/~alex/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/2] lookup_commit_reference_gently: do not read non-{tag,commit}
  2013-05-31  6:43                 ` Ramkumar Ramachandra
@ 2013-05-31  8:16                   ` Thomas Rast
  0 siblings, 0 replies; 41+ messages in thread
From: Thomas Rast @ 2013-05-31  8:16 UTC (permalink / raw)
  To: Ramkumar Ramachandra
  Cc: git, Junio C Hamano, Alex Bennée, Antoine Pelisse,
	John Keeping, Nguyễn Thái Ngọc Duy

Ramkumar Ramachandra <artagnon@gmail.com> writes:

> Thomas Rast wrote:
>> +       struct object *obj;
>> +       int type = sha1_object_info(sha1, NULL);
>> +       /* If it's neither tag nor commit, parsing the object is wasted effort */
>> +       if (type != OBJ_TAG && type != OBJ_COMMIT)
>> +               return NULL;
>> +       obj = deref_tag(parse_object(sha1), NULL, 0);
[...]
> In contrast, parse_object() first calls lookup_object() to look it up
> in some hashtable to get the type -- the packfile idx, presumably?
> Why don't you also do that instead of sha1_object_info()?  Or, why
> don't you wrap parse_object() in an API that doesn't go beyond the
> first blob check (and not execute parse_object_buffer())?
>
> Also, does this patch fix the bug Alex reported?
>
> Apologies if I've misunderstood something horribly (which does seem to
> be the case).

Yes, it does fix the bug.  (It's not really buggy, just slow.)

However, you implicitly point out an important point: If we have the
object, and it was already parsed (obj->parsed is set), parse_object()
is essentially free.  But sha1_object_info is not, it will in particular
unconditionally dig through long delta chains to discover the base type
of an object that has already been unpacked.


As for your original questions: lookup_object() is "do we have it in our
big object hashtable?" -- the one that holds many[1] objects, that Peff
recently sped up.

sha1_object_info() and read_object() are in many ways parallel functions
that do approximately the following:

  check all pack indexes for this object
  if we found a hit:
    attempt to unpack by recursively going through deltas
    (for _info, no need to unpack, but we still go through the delta
    chain because the type of object is determined by the innermost
    delta base)
  try to load it as a loose object
  it could have been repacked and pruned while we were looking, so:
    reload pack index information
    try the packs again (search indexes, then unpack)
  complain


[1]  blobs in particular are frequently not stored in that hash table,
because it is an insert-only table

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Poor performance of git describe in big repos
  2013-05-31  8:14             ` Alex Bennée
@ 2013-05-31  8:24               ` Thomas Rast
  2013-05-31  8:40                 ` Alex Bennée
  2013-05-31  8:32               ` John Keeping
  1 sibling, 1 reply; 41+ messages in thread
From: Thomas Rast @ 2013-05-31  8:24 UTC (permalink / raw)
  To: Alex Bennée; +Cc: John Keeping, Ramkumar Ramachandra, Git Mailing List

Alex Bennée <kernel-hacker@bennee.com> writes:

> On 30 May 2013 20:30, John Keeping <john@keeping.me.uk> wrote:
>> On Thu, May 30, 2013 at 06:21:55PM +0200, Thomas Rast wrote:
>>> Alex Bennée <kernel-hacker@bennee.com> writes:
>>>
>>> > On 30 May 2013 16:33, Thomas Rast <trast@inf.ethz.ch> wrote:
>>> >> Alex Bennée <kernel-hacker@bennee.com> writes:
>> <snip>
>>> > Will it be loading the blob for every commit it traverses or just ones that hit
>>> > a tag? Why does it need to load the blob at all? Surely the commit
>>> > tree state doesn't
>>> > need to be walked down?
>>>
>>> No, my theory is that you tagged *the blobs*.  Git supports this.
>
> Wait is this the difference between annotated and non-annotated tags?
> I thought a non-annotated just acted like references to a particular
> tree state?

A tag is just a ref.  It can point at anything, in particular also a
blob (= some file *contents*).

An annotated tag is just a tag pointing at a "tag object".  A tag object
contains tagger name/email/date, a reference to an object, and a tag
message.

The slowness I found relates to having tags that point at blobs directly
(unannotated).

>> You can see if that is the case by doing something like this:
>>
>>     eval $(git for-each-ref --shell --format '
>>         test $(git cat-file -t %(objectname)^{}) = commit ||
>>         echo %(refname);')
>>
>> That will print out the name of any ref that doesn't point at a
>> commit.
>
> Hmm that didn't seem to work. But looking at the output by hand I
> certainly have a mix of tags that are commits vs tags:
>
>
> 09:08 ajb@sloy/x86_64 [work.git] >git for-each-ref | grep "refs/tags"
> | grep "commit" | wc -l
> 1345
> 09:12 ajb@sloy/x86_64 [work.git] >git for-each-ref | grep "refs/tags"
> | grep -v "commit" | wc -l
> 66
>
> Unfortunately I can't just delete those tags as they do refer to known
> releases which we obviously care about. If I delete the tags on my
> local repo and test for a speed increase can I re-create them as
> annotated tag objects?

I would be more interested in this:

  git for-each-ref | grep ' blob'

and

  (git for-each-ref | grep ' blob' | cut -d\  -f1 | xargs -n1 git cat-file blob) | wc -c

The first tells you if you have any refs pointing at blobs.  The second
computes their total unpacked size.  My theory is that the second yields
some large number (hundreds of megabytes at least).

It would be nice if you checked, because if there turn out to be big
blobs, we have all the pieces and just need to assemble the best
solution.  Otherwise, there's something else going on and the problem
remains open.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Poor performance of git describe in big repos
  2013-05-31  8:14             ` Alex Bennée
  2013-05-31  8:24               ` Thomas Rast
@ 2013-05-31  8:32               ` John Keeping
  2013-05-31  8:49                 ` Alex Bennée
  1 sibling, 1 reply; 41+ messages in thread
From: John Keeping @ 2013-05-31  8:32 UTC (permalink / raw)
  To: Alex Bennée; +Cc: Thomas Rast, Ramkumar Ramachandra, Git Mailing List

On Fri, May 31, 2013 at 09:14:49AM +0100, Alex Bennée wrote:
> On 30 May 2013 20:30, John Keeping <john@keeping.me.uk> wrote:
> > On Thu, May 30, 2013 at 06:21:55PM +0200, Thomas Rast wrote:
> >> Alex Bennée <kernel-hacker@bennee.com> writes:
> >>
> >> > On 30 May 2013 16:33, Thomas Rast <trast@inf.ethz.ch> wrote:
> >> >> Alex Bennée <kernel-hacker@bennee.com> writes:
> > <snip>
> >> > Will it be loading the blob for every commit it traverses or just ones that hit
> >> > a tag? Why does it need to load the blob at all? Surely the commit
> >> > tree state doesn't
> >> > need to be walked down?
> >>
> >> No, my theory is that you tagged *the blobs*.  Git supports this.
> 
> Wait is this the difference between annotated and non-annotated tags?
> I thought a non-annotated just acted like references to a particular
> tree state?

No, this is something slightly different.  In Git there are four types
of object: tag, commit, tree and blob.  When you have a heavyweight tag,
the tag reference points at a tag object (which in turn points at
another object).  With a lightweight tag, the tag reference typically
points at a commit object.

However, there is no restriction that says that a tag object must point
to a commit or that a lightweight tag must point at a commit - it is
equally possible to point directly at a tree or a blob (although a lot
less common).

Thomas is suggesting that you might have a tag that does not point at a
commit but instead points to a blob object.

> > You can see if that is the case by doing something like this:
> >
> >     eval $(git for-each-ref --shell --format '
> >         test $(git cat-file -t %(objectname)^{}) = commit ||
> >         echo %(refname);')
> >
> > That will print out the name of any ref that doesn't point at a
> > commit.
> 
> Hmm that didn't seem to work.

You mean there was no output?  In that case it's likely that all your
references do indeed point at commits.

>                               But looking at the output by hand I
> certainly have a mix of tags that are commits vs tags:
> 
> 
> 09:08 ajb@sloy/x86_64 [work.git] >git for-each-ref | grep "refs/tags"
> | grep "commit" | wc -l
> 1345
> 09:12 ajb@sloy/x86_64 [work.git] >git for-each-ref | grep "refs/tags"
> | grep -v "commit" | wc -l
> 66

This means that you have 1345 lightweight tags and 66 heavyweight tags,
assuming that all of the lines that don't say "commit" do say "tag".

By the way, I don't remember if you said which version of Git you're
using.  If it's an older version then it's possible that something has
changed.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Poor performance of git describe in big repos
  2013-05-31  8:24               ` Thomas Rast
@ 2013-05-31  8:40                 ` Alex Bennée
  2013-05-31  8:46                   ` Thomas Rast
  0 siblings, 1 reply; 41+ messages in thread
From: Alex Bennée @ 2013-05-31  8:40 UTC (permalink / raw)
  To: Thomas Rast; +Cc: John Keeping, Ramkumar Ramachandra, Git Mailing List

On 31 May 2013 09:24, Thomas Rast <trast@inf.ethz.ch> wrote:
> Alex Bennée <kernel-hacker@bennee.com> writes:
>> On 30 May 2013 20:30, John Keeping <john@keeping.me.uk> wrote:
>>> On Thu, May 30, 2013 at 06:21:55PM +0200, Thomas Rast wrote:
>>>> Alex Bennée <kernel-hacker@bennee.com> writes:
>>>> > On 30 May 2013 16:33, Thomas Rast <trast@inf.ethz.ch> wrote:
> <snip>
>>>> No, my theory is that you tagged *the blobs*.  Git supports this.
>>
>> Wait is this the difference between annotated and non-annotated tags?
>> I thought a non-annotated just acted like references to a particular
>> tree state?
>
> A tag is just a ref.  It can point at anything, in particular also a
> blob (= some file *contents*).
>
> An annotated tag is just a tag pointing at a "tag object".  A tag object
> contains tagger name/email/date, a reference to an object, and a tag
> message.
>
> The slowness I found relates to having tags that point at blobs directly
> (unannotated).

I think you are right. I was brave (well I assumed the tags would come
back from the upstream repo) and ran:

git for-each-ref | grep "refs/tags" | grep "commit" | cut -d '/' -f 3
| xargs git tag -d

And boom:

09:19 ajb@sloy/x86_64 [work.git] >time /usr/bin/git --no-pager
describe --long --tags
ajb-build-test-5225-2-gdc0b771

real    0m0.009s
user    0m0.008s
sys     0m0.000s

Which is much better performance. So it does look like unannotated
tags pointing at binary blobs is the failure case.

<snip>
>
> I would be more interested in this:
>
>   git for-each-ref | grep ' blob'

Hmmm that gives nothing. All the refs are either tag or commit

> and
>
>   (git for-each-ref | grep ' blob' | cut -d\  -f1 | xargs -n1 git
>cat-file blob) | wc -c

However I have some big commits it seems:

09:37 ajb@sloy/x86_64 [work.git] >(git for-each-ref | grep ' commit' |
cut -d\  -f1 | xargs -n1 git cat-file commit) | wc -c
1147231984

>
> The first tells you if you have any refs pointing at blobs.  The second
> computes their total unpacked size.  My theory is that the second yields
> some large number (hundreds of megabytes at least).
>
> It would be nice if you checked, because if there turn out to be big
> blobs, we have all the pieces and just need to assemble the best
> solution.  Otherwise, there's something else going on and the problem
> remains open.

If you want any other numbers I'm only too happy to help. Sorry I
can't share the repo though...

-- 
Alex, homepage: http://www.bennee.com/~alex/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Poor performance of git describe in big repos
  2013-05-31  8:40                 ` Alex Bennée
@ 2013-05-31  8:46                   ` Thomas Rast
  2013-05-31  9:57                     ` Alex Bennée
  2013-05-31 10:27                     ` Thomas Rast
  0 siblings, 2 replies; 41+ messages in thread
From: Thomas Rast @ 2013-05-31  8:46 UTC (permalink / raw)
  To: Alex Bennée; +Cc: John Keeping, Ramkumar Ramachandra, Git Mailing List

Alex Bennée <kernel-hacker@bennee.com> writes:

> I think you are right. I was brave (well I assumed the tags would come
> back from the upstream repo) and ran:
>
> git for-each-ref | grep "refs/tags" | grep "commit" | cut -d '/' -f 3
> | xargs git tag -d

So that deleted all unannotated tags pointing at commits, and then it
was fast.  Curious.

> However I have some big commits it seems:
>
> 09:37 ajb@sloy/x86_64 [work.git] >(git for-each-ref | grep ' commit' |
> cut -d\  -f1 | xargs -n1 git cat-file commit) | wc -c
> 1147231984

How many unique entries are there in that list, i.e., what does

  git for-each-ref | grep ' commit' | cut -d\  -f1 | sort -u | wc -l

say?  Perhaps you can also find the biggest commit, e.g. like so:

  git for-each-ref | grep ' commit' | cut -d\  -f1 |
  while read sha; do git cat-file commit $sha | wc -c; done |
  sort -n

However, if that turns out to be the culprit, it's not fixable
currently[1].  Having commits with insanely long messages is just, well,
insane.


[1]  unless we do a major rework of the loading infrastructure, so that
we can teach it to load only the beginning of a commit as long as we are
only interested in parents and such

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Poor performance of git describe in big repos
  2013-05-31  8:32               ` John Keeping
@ 2013-05-31  8:49                 ` Alex Bennée
  2013-05-31  8:59                   ` John Keeping
  0 siblings, 1 reply; 41+ messages in thread
From: Alex Bennée @ 2013-05-31  8:49 UTC (permalink / raw)
  To: John Keeping; +Cc: Thomas Rast, Ramkumar Ramachandra, Git Mailing List

On 31 May 2013 09:32, John Keeping <john@keeping.me.uk> wrote:
> On Fri, May 31, 2013 at 09:14:49AM +0100, Alex Bennée wrote:
>> On 30 May 2013 20:30, John Keeping <john@keeping.me.uk> wrote:
>> > On Thu, May 30, 2013 at 06:21:55PM +0200, Thomas Rast wrote:
>> >> Alex Bennée <kernel-hacker@bennee.com> writes:
>> >>
>> >> > On 30 May 2013 16:33, Thomas Rast <trast@inf.ethz.ch> wrote:
>> >> >> Alex Bennée <kernel-hacker@bennee.com> writes:
>> > <snip>
>> >> > Will it be loading the blob for every commit it traverses or just ones that hit
>> >> > a tag? Why does it need to load the blob at all? Surely the commit
>> >> > tree state doesn't
>> >> > need to be walked down?
>> >>
>> >> No, my theory is that you tagged *the blobs*.  Git supports this.
>>
>> Wait is this the difference between annotated and non-annotated tags?
>> I thought a non-annotated just acted like references to a particular
>> tree state?
>
> No, this is something slightly different.  In Git there are four types
> of object: tag, commit, tree and blob.  When you have a heavyweight tag,
> the tag reference points at a tag object (which in turn points at
> another object).  With a lightweight tag, the tag reference typically
> points at a commit object.

I think this is the case in my repo.

> However, there is no restriction that says that a tag object must point
> to a commit or that a lightweight tag must point at a commit - it is
> equally possible to point directly at a tree or a blob (although a lot
> less common).
>
> Thomas is suggesting that you might have a tag that does not point at a
> commit but instead points to a blob object.

It's looking like I just have some very heavy commits. One data point
I probably should have mentioned at the beginning is this was a
converted CVS repo and I'm wondering if some of the artifacts that
introduced has contributed to this.

>> > You can see if that is the case by doing something like this:
>> >
>> >     eval $(git for-each-ref --shell --format '
>> >         test $(git cat-file -t %(objectname)^{}) = commit ||
>> >         echo %(refname);')
>> >
>> > That will print out the name of any ref that doesn't point at a
>> > commit.
>>
>> Hmm that didn't seem to work.
>
> You mean there was no output?  In that case it's likely that all your
> references do indeed point at commits.

Correct.

>
>>                               But looking at the output by hand I
>> certainly have a mix of tags that are commits vs tags:
>>
>>
>> 09:08 ajb@sloy/x86_64 [work.git] >git for-each-ref | grep "refs/tags"
>> | grep "commit" | wc -l
>> 1345
>> 09:12 ajb@sloy/x86_64 [work.git] >git for-each-ref | grep "refs/tags"
>> | grep -v "commit" | wc -l
>> 66
>
> This means that you have 1345 lightweight tags and 66 heavyweight tags,
> assuming that all of the lines that don't say "commit" do say "tag".

Yep all commits and tags, nothing else

> By the way, I don't remember if you said which version of Git you're
> using.  If it's an older version then it's possible that something has
> changed.

I'm running the GIT stable PPA:

09:38 ajb@sloy/x86_64 [work.git] >git --version
git version 1.8.3

Although I have also tested with the latest git.git maint. I'm happy
to try master if it's likely to have changed.

-- 
Alex, homepage: http://www.bennee.com/~alex/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Poor performance of git describe in big repos
  2013-05-31  8:49                 ` Alex Bennée
@ 2013-05-31  8:59                   ` John Keeping
  0 siblings, 0 replies; 41+ messages in thread
From: John Keeping @ 2013-05-31  8:59 UTC (permalink / raw)
  To: Alex Bennée; +Cc: Thomas Rast, Ramkumar Ramachandra, Git Mailing List

On Fri, May 31, 2013 at 09:49:57AM +0100, Alex Bennée wrote:
> On 31 May 2013 09:32, John Keeping <john@keeping.me.uk> wrote:
> > Thomas is suggesting that you might have a tag that does not point at a
> > commit but instead points to a blob object.
> 
> It's looking like I just have some very heavy commits. One data point
> I probably should have mentioned at the beginning is this was a
> converted CVS repo and I'm wondering if some of the artifacts that
> introduced has contributed to this.

You can try another for-each-ref invocation to see if that's the case:

    eval $(git for-each-ref --format 'printf "%s %s\n" \
        $(git cat-file -s %(objectname)) %(refname);') | sort -n

That will print the size of each object followed by the ref that points
to it, sorted by size.

> I'm running the GIT stable PPA:
> 
> 09:38 ajb@sloy/x86_64 [work.git] >git --version
> git version 1.8.3
> 
> Although I have also tested with the latest git.git maint. I'm happy
> to try master if it's likely to have changed.

master's still very close to 1.8.3 at the moment, so I don't think that
will make a difference.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Poor performance of git describe in big repos
  2013-05-31  8:46                   ` Thomas Rast
@ 2013-05-31  9:57                     ` Alex Bennée
  2013-06-03  8:02                       ` Alex Bennée
  2013-05-31 10:27                     ` Thomas Rast
  1 sibling, 1 reply; 41+ messages in thread
From: Alex Bennée @ 2013-05-31  9:57 UTC (permalink / raw)
  To: Thomas Rast; +Cc: John Keeping, Ramkumar Ramachandra, Git Mailing List

On 31 May 2013 09:46, Thomas Rast <trast@inf.ethz.ch> wrote:
> Alex Bennée <kernel-hacker@bennee.com> writes:
>
>> I think you are right. I was brave (well I assumed the tags would come
>> back from the upstream repo) and ran:
>>
>> git for-each-ref | grep "refs/tags" | grep "commit" | cut -d '/' -f 3
>> | xargs git tag -d
>
> So that deleted all unannotated tags pointing at commits, and then it
> was fast.  Curious.
>
>> However I have some big commits it seems:
>>
>> 09:37 ajb@sloy/x86_64 [work.git] >(git for-each-ref | grep ' commit' |
>> cut -d\  -f1 | xargs -n1 git cat-file commit) | wc -c
>> 1147231984
>
> How many unique entries are there in that list, i.e., what does
>
>   git for-each-ref | grep ' commit' | cut -d\  -f1 | sort -u | wc -l

09:49 ajb@sloy/x86_64 [work.git] >git for-each-ref | grep ' commit' |
cut -d\  -f1 | sort -u | wc -l
1508

> say?  Perhaps you can also find the biggest commit, e.g. like so:
>
>   git for-each-ref | grep ' commit' | cut -d\  -f1 |
>   while read sha; do git cat-file commit $sha | wc -c; done |
>   sort -n

Yeah there is a range from a few hundred bytes to a large number of 3M
commits. I guess I need to identify which commits they are and remove
the tags or convert them to annotated reference tags.

> However, if that turns out to be the culprit, it's not fixable
> currently[1].  Having commits with insanely long messages is just, well,
> insane.
>
>

> [1]  unless we do a major rework of the loading infrastructure, so that
> we can teach it to load only the beginning of a commit as long as we are
> only interested in parents and such

I'll do a bit of scripting to dig into the nature of these
uber-commits and try and work out how they cam about. I suspect they
are simply start of branch states in our broken and disparate history.

I'll get back to you once I've dug a little deeper.

>
> --
> Thomas Rast
> trast@{inf,student}.ethz.ch



-- 
Alex, homepage: http://www.bennee.com/~alex/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Poor performance of git describe in big repos
  2013-05-31  8:46                   ` Thomas Rast
  2013-05-31  9:57                     ` Alex Bennée
@ 2013-05-31 10:27                     ` Thomas Rast
  2013-05-31 16:17                       ` Jeff King
  1 sibling, 1 reply; 41+ messages in thread
From: Thomas Rast @ 2013-05-31 10:27 UTC (permalink / raw)
  To: Jeff King
  Cc: Alex Bennée, John Keeping, Ramkumar Ramachandra,
	Git Mailing List

Thomas Rast <trast@inf.ethz.ch> writes:

> However, if that turns out to be the culprit, it's not fixable
> currently[1].  Having commits with insanely long messages is just, well,
> insane.
>
> [1]  unless we do a major rework of the loading infrastructure, so that
> we can teach it to load only the beginning of a commit as long as we are
> only interested in parents and such

Actually, Peff, doesn't your commit parent/tree pointer caching give us
this for free?

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/2] lookup_commit_reference_gently: do not read non-{tag,commit}
  2013-05-31  8:08                   ` Thomas Rast
@ 2013-05-31 16:00                     ` Jeff King
  0 siblings, 0 replies; 41+ messages in thread
From: Jeff King @ 2013-05-31 16:00 UTC (permalink / raw)
  To: Thomas Rast
  Cc: git, Junio C Hamano, Ramkumar Ramachandra, Alex Bennée,
	Antoine Pelisse, John Keeping,
	Nguyễn Thái Ngọc Duy

On Fri, May 31, 2013 at 10:08:06AM +0200, Thomas Rast wrote:

> > Have you measured the impact of this on normal operations? During a
> > traversal, we spend a measurable amount of time looking up commits in
> > packfiles, and this would presumably double it.
> 
> I don't think so, but admittedly I didn't measure it.
> 
> The reason why it's unlikely is that this is specific to
> lookup_commit_reference_gently, which according to some grepping is
> usually done on refs or values that refs might have; e.g. on the old&new
> sides of a fetch in remote.c, or in many places in the callback of some
> variant of for_each_ref.

Yeah, I saw that the "_gently" form backs some of the other forms
(non-gently, lookup_commit_or_die) and was worried that we would use it
as part of the revision traversal to find parents. But we don't, of
course; we use lookup_commit, because we would not accept a parent that
is a tag pointing to a commit.

So I think it probably won't matter in any sane case.

> Of course if you have a ridiculously large number of refs (and I gather
> _you_ do), this will hurt somewhat in the usual case, but speed up the
> case where there is a ref (usually a lightweight tag) directly pointing
> at a large blob.

In my large-ref cases, there are often a lot of duplicate refs anyway
(e.g., many forks of a project having the same tags). So usually the
right thing there is to use lookup_object to see if we have the object
already anyway. parse_object has this optimization, but we can add it
into sha1_object_info, too, if it turns out to be a problem.

-Peff

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Poor performance of git describe in big repos
  2013-05-31 10:27                     ` Thomas Rast
@ 2013-05-31 16:17                       ` Jeff King
  2013-06-03  8:39                         ` Alex Bennée
  0 siblings, 1 reply; 41+ messages in thread
From: Jeff King @ 2013-05-31 16:17 UTC (permalink / raw)
  To: Thomas Rast
  Cc: Alex Bennée, John Keeping, Ramkumar Ramachandra,
	Git Mailing List

On Fri, May 31, 2013 at 12:27:11PM +0200, Thomas Rast wrote:

> Thomas Rast <trast@inf.ethz.ch> writes:
> 
> > However, if that turns out to be the culprit, it's not fixable
> > currently[1].  Having commits with insanely long messages is just, well,
> > insane.
> >
> > [1]  unless we do a major rework of the loading infrastructure, so that
> > we can teach it to load only the beginning of a commit as long as we are
> > only interested in parents and such
> 
> Actually, Peff, doesn't your commit parent/tree pointer caching give us
> this for free?

It does. You can test it from the "jk/metapacks" branch at
git://github.com/peff/git. After building, you'd need to do:

  $ git gc
  $ git metapack --all --commits

in the target repository. You can check that it's working because "git
rev-list --all --count" should be an order of magnitude faster. You may
need to add "save_commit_buffer = 0" in any commands you are checking,
though, as the optimization can only kick in if parse_commit does not
want to save the buffer as a side effect.

I also looked into trying to just read the beginning part of a commit[1],
but it turned out not to be all that much of an improvement.

-Peff

[1] http://article.gmane.org/gmane.comp.version-control.git/212301

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Poor performance of git describe in big repos
  2013-05-31  9:57                     ` Alex Bennée
@ 2013-06-03  8:02                       ` Alex Bennée
  2013-06-03 16:32                         ` Junio C Hamano
  0 siblings, 1 reply; 41+ messages in thread
From: Alex Bennée @ 2013-06-03  8:02 UTC (permalink / raw)
  To: Thomas Rast; +Cc: John Keeping, Ramkumar Ramachandra, Git Mailing List

On 31 May 2013 10:57, Alex Bennée <kernel-hacker@bennee.com> wrote:
> On 31 May 2013 09:46, Thomas Rast <trast@inf.ethz.ch> wrote:
>>
>> So that deleted all unannotated tags pointing at commits, and then it
>> was fast.  Curious.
>>
>> However, if that turns out to be the culprit, it's not fixable
>> currently[1].  Having commits with insanely long messages is just, well,
>> insane.
>>
>>
>> [1]  unless we do a major rework of the loading infrastructure, so that
>> we can teach it to load only the beginning of a commit as long as we are
>> only interested in parents and such
>
> I'll do a bit of scripting to dig into the nature of these
> uber-commits and try and work out how they cam about. I suspect they
> are simply start of branch states in our broken and disparate history.
>
> I'll get back to you once I've dug a little deeper.

So I wrote a little script [1] which I ran to remove all tags that did
not exist on any branches:

git-tag-cleaner.py -d no-branch

After a lot of churning:

17:26 ajb@sloy/x86_64 [work.git] >time /usr/bin/git --no-pager
describe --long --tags
ajb-build-test-5225-2-gdc0b771

real    0m0.799s
user    0m0.024s
sys     0m0.052s

So at least I can fix up my repo. All the big ones look at least as
though they were weird cvs2svn creations that exist to represent the
detached state of a strange CVS tag from the converted repository.
However it does raise one question.

Why is git attempting to parse a commit not on the DAG for the branch
I'm attempting to describe?

Anyway as I have a work around I'm going to do a slightly more
conservative clean of the repo with my script and move on.

[1] https://github.com/stsquad/git-tag-cleaner

-- 
Alex, homepage: http://www.bennee.com/~alex/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Poor performance of git describe in big repos
  2013-05-31 16:17                       ` Jeff King
@ 2013-06-03  8:39                         ` Alex Bennée
  2013-06-03 14:49                           ` Jeff King
  0 siblings, 1 reply; 41+ messages in thread
From: Alex Bennée @ 2013-06-03  8:39 UTC (permalink / raw)
  To: Jeff King
  Cc: Thomas Rast, John Keeping, Ramkumar Ramachandra, Git Mailing List

On 31 May 2013 17:17, Jeff King <peff@peff.net> wrote:
> On Fri, May 31, 2013 at 12:27:11PM +0200, Thomas Rast wrote:
>
>> Thomas Rast <trast@inf.ethz.ch> writes:
>>
>> > However, if that turns out to be the culprit, it's not fixable
>> > currently[1].  Having commits with insanely long messages is just, well,
>> > insane.
>> >
>> > [1]  unless we do a major rework of the loading infrastructure, so that
>> > we can teach it to load only the beginning of a commit as long as we are
>> > only interested in parents and such
>>
>> Actually, Peff, doesn't your commit parent/tree pointer caching give us
>> this for free?
>
> It does. You can test it from the "jk/metapacks" branch at
> git://github.com/peff/git. After building, you'd need to do:
>
>   $ git gc
>   $ git metapack --all --commits
>
> in the target repository. You can check that it's working because "git
> rev-list --all --count" should be an order of magnitude faster. You may
> need to add "save_commit_buffer = 0" in any commands you are checking,
> though, as the optimization can only kick in if parse_commit does not
> want to save the buffer as a side effect.

Is this a command line argument? The tools don't seem to think so.

Anyway it seems to make a marginal difference to my case:

09:08 ajb@sloy/x86_64 [work.git] >time git --no-pager describe --long --tags
ajb-build-test-5225-2-gdc0b771

real    0m14.105s
user    0m12.409s
sys     0m1.660s
09:11 ajb@sloy/x86_64 [work.git] >git gc
Counting objects: 399436, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (110874/110874), done.
Writing objects: 100% (399436/399436), done.
Total 399436 (delta 281538), reused 398357 (delta 280493)
Checking connectivity: 399436, done.
09:12 ajb@sloy/x86_64 [work.git] >git metapack --all --commits
09:13 ajb@sloy/x86_64 [work.git] >time git --no-pager describe --long --tags
ajb-build-test-5225-2-gdc0b771

real    0m12.781s
user    0m11.669s
sys     0m1.080s
09:32 ajb@sloy/x86_64 [work.git] >time git --no-pager describe --long --tags
ajb-build-test-5225-2-gdc0b771

real    0m12.768s
user    0m11.817s
sys     0m0.908s
09:33 ajb@sloy/x86_64 [work.git] >time git --no-pager describe --long --tags
ajb-build-test-5225-2-gdc0b771

real    0m12.642s
user    0m11.705s
sys     0m0.904s


>
> I also looked into trying to just read the beginning part of a commit[1],
> but it turned out not to be all that much of an improvement.
>
> -Peff
>
> [1] http://article.gmane.org/gmane.comp.version-control.git/212301



-- 
Alex, homepage: http://www.bennee.com/~alex/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Poor performance of git describe in big repos
  2013-06-03  8:39                         ` Alex Bennée
@ 2013-06-03 14:49                           ` Jeff King
  0 siblings, 0 replies; 41+ messages in thread
From: Jeff King @ 2013-06-03 14:49 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Thomas Rast, John Keeping, Ramkumar Ramachandra, Git Mailing List

On Mon, Jun 03, 2013 at 09:39:21AM +0100, Alex Bennée wrote:

> > in the target repository. You can check that it's working because "git
> > rev-list --all --count" should be an order of magnitude faster. You may
> > need to add "save_commit_buffer = 0" in any commands you are checking,
> > though, as the optimization can only kick in if parse_commit does not
> > want to save the buffer as a side effect.
> 
> Is this a command line argument? The tools don't seem to think so.

If you mean the "save_commit_buffer = 0", no; I mean you would have to
insert it somewhere in builtin/$CMD.c, and then recompile. However,
git-describe already has it, so it should work.

> Anyway it seems to make a marginal difference to my case:

I get much better results:

  $ cd linux-2.6
  $ time git --no-pager describe --long --tags HEAD~800
  v3.5-6956-gaa0b3b2

  real    0m0.261s
  user    0m0.248s
  sys     0m0.012s

  $ git metapack --commits --all
  $ time git --no-pager describe --long --tags HEAD~800
  v3.5-6956-gaa0b3b2

  real    0m0.057s
  user    0m0.032s
  sys     0m0.024s

which implies that your time is being spent elsewhere. That topic
wouldn't avoid inflating tag objects from disk. Do you have really big
tag objects (or unannotated tags pointing to blobs)? What does:

  git for-each-ref --format='%(object)' refs/tags |
  git cat-file --batch-check |
  sort -k 3nr |
  head

say?

-Peff

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Poor performance of git describe in big repos
  2013-06-03  8:02                       ` Alex Bennée
@ 2013-06-03 16:32                         ` Junio C Hamano
  2013-06-03 17:48                           ` Junio C Hamano
  0 siblings, 1 reply; 41+ messages in thread
From: Junio C Hamano @ 2013-06-03 16:32 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Thomas Rast, John Keeping, Ramkumar Ramachandra, Git Mailing List

Alex Bennée <kernel-hacker@bennee.com> writes:

> Why is git attempting to parse a commit not on the DAG for the branch
> I'm attempting to describe?

I think that is because you need to parse the objects at the tip of
refs to see if they are on the DAG in the first place.

If there weren't any annotated tag, conceivably you could do without
parsing these objects.  You would:

 - First read the refs without parsing anything to learn the object
   name of the tips of refs;

 - Traverse the DAG, starting from the commit and notice when you
   see commits that are at the tips of refs you learned in the first
   step, arranging to stop when you found the "closest" tip.

But with annotated tags (and "git describe" is designed to be
primarily used with them; you would need "--tags" option to make it
notice unannotated tags), the object name you see sitting at the tip
will never appear during the DAG traversal.  You will only see
commits from the latter, so you would need to parse the tips to
learn what commits they refer to.

And of course, "then parse only annotated tags, without parsing
commits" would not work, because you wouldn't know what the object
is without looking at it ;-)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Poor performance of git describe in big repos
  2013-06-03 16:32                         ` Junio C Hamano
@ 2013-06-03 17:48                           ` Junio C Hamano
  0 siblings, 0 replies; 41+ messages in thread
From: Junio C Hamano @ 2013-06-03 17:48 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Thomas Rast, John Keeping, Ramkumar Ramachandra, Git Mailing List

Junio C Hamano <gitster@pobox.com> writes:

> Alex Bennée <kernel-hacker@bennee.com> writes:
>
>> Why is git attempting to parse a commit not on the DAG for the branch
>> I'm attempting to describe?
>
> I think that is because you need to parse the objects at the tip of
> refs to see if they are on the DAG in the first place.
>
> If there weren't any annotated tag, conceivably you could do without
> parsing these objects.  You would:
>
>  - First read the refs without parsing anything to learn the object
>    name of the tips of refs;
>
>  - Traverse the DAG, starting from the commit and notice when you
>    see commits that are at the tips of refs you learned in the first
>    step, arranging to stop when you found the "closest" tip.
>
> But with annotated tags (and "git describe" is designed to be
> primarily used with them; you would need "--tags" option to make it
> notice unannotated tags), the object name you see sitting at the tip
> will never appear during the DAG traversal.  You will only see
> commits from the latter, so you would need to parse the tips to
> learn what commits they refer to.
>
> And of course, "then parse only annotated tags, without parsing
> commits" would not work, because you wouldn't know what the object
> is without looking at it ;-)

Having said all that, with changes by Peff and Michael Haggerty
around f85354b5c7b8 (pack_one_ref(): use function peel_entry(),
2013-04-22), recent Git does not "parse" as many refs as it used to,
only to figure out what commit an annotated tag points at when your
refs are packed, so we may be a lot closer to the optimum than I
hinted by the above description.

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2013-06-03 17:49 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-30 10:38 Poor performance of git describe in big repos Alex Bennée
2013-05-30 11:33 ` Ramkumar Ramachandra
2013-05-30 13:09   ` Alex Bennée
2013-05-30 14:32     ` Ramkumar Ramachandra
2013-05-30 15:01       ` Alex Bennée
2013-05-30 15:17         ` Ramkumar Ramachandra
2013-05-30 15:33     ` Thomas Rast
2013-05-30 16:01       ` Alex Bennée
2013-05-30 16:21         ` Thomas Rast
2013-05-30 16:44           ` Thomas Rast
2013-05-30 19:01             ` Antoine Pelisse
2013-05-30 20:00             ` [PATCH 1/2] sha1_file: silence sha1_loose_object_info Thomas Rast
2013-05-30 20:00               ` [PATCH 2/2] lookup_commit_reference_gently: do not read non-{tag,commit} Thomas Rast
2013-05-30 21:22                 ` Jeff King
2013-05-31  0:52                   ` Duy Nguyen
2013-05-31  8:08                   ` Thomas Rast
2013-05-31 16:00                     ` Jeff King
2013-05-31  6:43                 ` Ramkumar Ramachandra
2013-05-31  8:16                   ` Thomas Rast
2013-05-30 19:30           ` Poor performance of git describe in big repos John Keeping
2013-05-31  8:14             ` Alex Bennée
2013-05-31  8:24               ` Thomas Rast
2013-05-31  8:40                 ` Alex Bennée
2013-05-31  8:46                   ` Thomas Rast
2013-05-31  9:57                     ` Alex Bennée
2013-06-03  8:02                       ` Alex Bennée
2013-06-03 16:32                         ` Junio C Hamano
2013-06-03 17:48                           ` Junio C Hamano
2013-05-31 10:27                     ` Thomas Rast
2013-05-31 16:17                       ` Jeff King
2013-06-03  8:39                         ` Alex Bennée
2013-06-03 14:49                           ` Jeff King
2013-05-31  8:32               ` John Keeping
2013-05-31  8:49                 ` Alex Bennée
2013-05-31  8:59                   ` John Keeping
2013-05-30 11:48 ` John Keeping
2013-05-30 12:29   ` Alex Bennée
2013-05-30 13:20     ` Duy Nguyen
     [not found]       ` <CAJ-05NPacjAEC99Ntd9eMnTD9_PMMYFob-_tAx5CeSB79TkRSg@mail.gmail.com>
2013-05-30 13:45         ` Duy Nguyen
2013-05-30 14:02           ` Alex Bennée
2013-05-30 13:16   ` Alex Bennée

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).