Resolving deltas dominates clone time

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Resolving deltas dominates clone time
@ 2019-04-19 21:47 Martin Fick
  2019-04-20  3:58 ` Jeff King
  0 siblings, 1 reply; 26+ messages in thread
From: Martin Fick @ 2019-04-19 21:47 UTC (permalink / raw)
  To: Git Mailing List

We have a serious performance problem with one of our large repos. The repo is 
our internal version of the android platform/manifest project. Our repo after 
running a clean "repack -A -d -F" is close to 8G in size, has over 700K refs, 
and it has over 8M objects. The repo takes around 40min to clone locally (same 
disk to same disk) using git 1.8.2.1 on a high end machine (56 processors, 
128GB RAM)! It takes around 10mins before getting to the resolving deltas 
phase which then takes most of the rest of the time.

While this is a fairly large repo, a straight cp -r of the repo takes less 
than 2mins, so I would expect a clone to be on the same order of magnitude in 
time. For perspective, I have a kernel/msm repo with a third of the ref count 
and double the object count which takes only around 20mins to clone on the 
same machine (still slower than I would like).

I mention 1.8.2.1 because we have many old machines which need this. However, 
I also tested this with git v2.18 and it actually is much slower even 
(~140mins).

Reading the advice on the net, people seem to think that repacking with 
shorter delta-chains would help improve this. I have not had any success with 
this yet.

I have been thinking about this problem, and I suspect that this compute time 
is actually spent doing SHA1 calculations, is that possible? Some basic back 
of the envelope math and scripting seems to show that the repo may actually 
contain about 2TB of data if you add up the size of all the objects in the 
repo. Some quick research on the net seems to indicate that we might be able 
to expect something around 500MB/s throughput on computing SHA1s, does that 
seem reasonable? If I really have 2TB of data, should it then take around 
66mins to get the SHA1s for all that data? Could my repo clone time really be 
dominated by SHA1 math?

Any advice on how to speed up cloning this repo, or what to pursue more 
in my investigation?

Thanks,

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Resolving deltas dominates clone time
  2019-04-19 21:47 Resolving deltas dominates clone time Martin Fick
@ 2019-04-20  3:58 ` Jeff King
  2019-04-20  7:59   ` Ævar Arnfjörð Bjarmason
  2019-04-22 20:21   ` Martin Fick
  0 siblings, 2 replies; 26+ messages in thread
From: Jeff King @ 2019-04-20  3:58 UTC (permalink / raw)
  To: Martin Fick; +Cc: Git Mailing List

On Fri, Apr 19, 2019 at 03:47:22PM -0600, Martin Fick wrote:

> I have been thinking about this problem, and I suspect that this compute time 
> is actually spent doing SHA1 calculations, is that possible? Some basic back 
> of the envelope math and scripting seems to show that the repo may actually 
> contain about 2TB of data if you add up the size of all the objects in the 
> repo. Some quick research on the net seems to indicate that we might be able 
> to expect something around 500MB/s throughput on computing SHA1s, does that 
> seem reasonable? If I really have 2TB of data, should it then take around 
> 66mins to get the SHA1s for all that data? Could my repo clone time really be 
> dominated by SHA1 math?

That sounds about right, actually. 8GB to 2TB is a compression ratio of
250:1. That's bigger than I've seen, but I get 51:1 in the kernel.

Try this (with a recent version of git; your v1.8.2.1 won't have
--batch-all-objects):

  # count the on-disk size of all objects
  git cat-file --batch-all-objects --batch-check='%(objectsize) %(objectsize:disk)' |
  perl -alne '
    $repo += $F[0];
    $disk += $F[1];
    END { print "$repo / $disk = ", $repo/$disk }
  '

250:1 isn't inconceivable if you have large blobs which have small
changes to them (and at 8GB for 8 million objects, you probably do have
some larger blobs, since the kernel is about 1/8th the size for the same
number of objects).

So yes, if you really do have to hash 2TB of data, that's going to take
a while. "openssl speed" on my machine gives per-second speeds of:

type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
sha1            135340.73k   337086.10k   677821.10k   909513.73k  1007528.62k  1016916.65k

So it's faster on bigger chunks, but yeah 500-1000MB/s seems like about
the best you're going to do. And...

> I mention 1.8.2.1 because we have many old machines which need this. However, 
> I also tested this with git v2.18 and it actually is much slower even 
> (~140mins).

I think v2.18 will have the collision-detecting sha1 on by default,
which is slower. Building with OPENSSL_SHA1 should be the fastest (and
are those numbers above). Git's internal (but not collision detecting)
BLK_SHA1 is somewhere in the middle.

> Any advice on how to speed up cloning this repo, or what to pursue more 
> in my investigation?

If you don't mind losing the collision-detection, using openssl's sha1
might help. The delta resolution should be threaded, too. So in _theory_
you're using 66 minutes of CPU time, but that should only take 1-2
minutes on your 56-core machine. I don't know at what point you'd run
into lock contention, though. The locking there is quite coarse.

We also hash non-deltas while we're receiving them over the network.
That's accounted for in the "receiving pack" part of the progress meter.
If the time looks to be going to "resolving deltas", then that should
all be threaded.

If you want to replay the slow part, it should just be index-pack. So
something like (with $old as a fresh clone of the repo):

  git init --bare new-repo.git
  cd new-repo.git
  perf record git index-pack -v --stdin <$old/.git/objects/pack/pack-*.pack
  perf report

should show you where the time is going (substitute perf with whatever
profiling tool you like).

As far as avoiding that work altogether, there aren't a lot of options.
Git clients do not trust the server, so the server sends only the raw
data, and the client is responsible for computing the object ids. The
only exception is a local filesystem clone, which will blindly copy or
hardlink the .pack and .idx files from the source.

In theory there could be a protocol extension to let the client say "I
trust you, please send me the matching .idx that goes with this pack,
and I'll assume there was no bitrot nor trickery on your part". I
don't recall anybody ever discussing such a patch in the past, but I
think Microsoft's VFS for Git project that backs development on Windows
might do similar trickery under the hood.

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Resolving deltas dominates clone time
  2019-04-20  3:58 ` Jeff King
@ 2019-04-20  7:59   ` Ævar Arnfjörð Bjarmason
  2019-04-22 15:57     ` Jeff King
  2019-04-22 20:21   ` Martin Fick
  1 sibling, 1 reply; 26+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-04-20  7:59 UTC (permalink / raw)
  To: Jeff King; +Cc: Martin Fick, Git Mailing List


On Sat, Apr 20 2019, Jeff King wrote:

> On Fri, Apr 19, 2019 at 03:47:22PM -0600, Martin Fick wrote:
>
>> I have been thinking about this problem, and I suspect that this compute time
>> is actually spent doing SHA1 calculations, is that possible? Some basic back
>> of the envelope math and scripting seems to show that the repo may actually
>> contain about 2TB of data if you add up the size of all the objects in the
>> repo. Some quick research on the net seems to indicate that we might be able
>> to expect something around 500MB/s throughput on computing SHA1s, does that
>> seem reasonable? If I really have 2TB of data, should it then take around
>> 66mins to get the SHA1s for all that data? Could my repo clone time really be
>> dominated by SHA1 math?
>
> That sounds about right, actually. 8GB to 2TB is a compression ratio of
> 250:1. That's bigger than I've seen, but I get 51:1 in the kernel.
>
> Try this (with a recent version of git; your v1.8.2.1 won't have
> --batch-all-objects):
>
>   # count the on-disk size of all objects
>   git cat-file --batch-all-objects --batch-check='%(objectsize) %(objectsize:disk)' |
>   perl -alne '
>     $repo += $F[0];
>     $disk += $F[1];
>     END { print "$repo / $disk = ", $repo/$disk }
>   '
>
> 250:1 isn't inconceivable if you have large blobs which have small
> changes to them (and at 8GB for 8 million objects, you probably do have
> some larger blobs, since the kernel is about 1/8th the size for the same
> number of objects).
>
> So yes, if you really do have to hash 2TB of data, that's going to take
> a while. "openssl speed" on my machine gives per-second speeds of:
>
> type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
> sha1            135340.73k   337086.10k   677821.10k   909513.73k  1007528.62k  1016916.65k
>
> So it's faster on bigger chunks, but yeah 500-1000MB/s seems like about
> the best you're going to do. And...
>
>> I mention 1.8.2.1 because we have many old machines which need this. However,
>> I also tested this with git v2.18 and it actually is much slower even
>> (~140mins).
>
> I think v2.18 will have the collision-detecting sha1 on by default,
> which is slower. Building with OPENSSL_SHA1 should be the fastest (and
> are those numbers above). Git's internal (but not collision detecting)
> BLK_SHA1 is somewhere in the middle.
>
>> Any advice on how to speed up cloning this repo, or what to pursue more
>> in my investigation?
>
> If you don't mind losing the collision-detection, using openssl's sha1
> might help. The delta resolution should be threaded, too. So in _theory_
> you're using 66 minutes of CPU time, but that should only take 1-2
> minutes on your 56-core machine. I don't know at what point you'd run
> into lock contention, though. The locking there is quite coarse.

There's also my (been meaning to re-roll)
https://public-inbox.org/git/20181113201910.11518-1-avarab@gmail.com/
*that* part of the SHA-1 checking is part of what's going on here. It'll
help a *tiny* bit, but of course is part of the "trust remote" risk
management...

> We also hash non-deltas while we're receiving them over the network.
> That's accounted for in the "receiving pack" part of the progress meter.
> If the time looks to be going to "resolving deltas", then that should
> all be threaded.
>
> If you want to replay the slow part, it should just be index-pack. So
> something like (with $old as a fresh clone of the repo):
>
>   git init --bare new-repo.git
>   cd new-repo.git
>   perf record git index-pack -v --stdin <$old/.git/objects/pack/pack-*.pack
>   perf report
>
> should show you where the time is going (substitute perf with whatever
> profiling tool you like).
>
> As far as avoiding that work altogether, there aren't a lot of options.
> Git clients do not trust the server, so the server sends only the raw
> data, and the client is responsible for computing the object ids. The
> only exception is a local filesystem clone, which will blindly copy or
> hardlink the .pack and .idx files from the source.
>
> In theory there could be a protocol extension to let the client say "I
> trust you, please send me the matching .idx that goes with this pack,
> and I'll assume there was no bitrot nor trickery on your part". I
> don't recall anybody ever discussing such a patch in the past, but I
> think Microsoft's VFS for Git project that backs development on Windows
> might do similar trickery under the hood.

I started to write:

    I wonder if there's room for some tacit client/server cooperation
    without such a protocol change.

    E.g. the server sending over a pack constructed in such a way that
    everything required for a checkout is at the beginning of the
    data. Now we implicitly tend to do it mostly the other way around
    for delta optimization purposes.

    That would allow a smart client in a hurry to index-pack it as they
    go along, and as soon as they have enough to check out HEAD return
    to the client, and continue the rest in the background

But realized I was just starting to describe something like 'clone
--depth=1' followed by a 'fetch --unshallow' in the background, except
that would work better (if you did "just the tip" naïvely you'd get
'missing object' on e.g. 'git log', with that ad-hoc hack we'd need to
write out two packs etc...).

    $ rm -rf /tmp/git; time git clone --depth=1 https://chromium.googlesource.com/chromium/src /tmp/git; time git -C /tmp/git fetch --unshallow
    Cloning into '/tmp/git'...
    remote: Counting objects: 304839, done
    remote: Finding sources: 100% (304839/304839)
    remote: Total 304839 (delta 70483), reused 204837 (delta 70483)
    Receiving objects: 100% (304839/304839), 1.48 GiB | 19.87 MiB/s, done.
    Resolving deltas: 100% (70483/70483), done.
    Checking out files: 100% (302768/302768), done.

    real    2m10.223s
    user    1m2.434s
    sys     0m15.564s
    [not waiting for that second bit, but it'll take ages...]

I think just having a clone mode that did that for you might scratch a
lot of people's itch. I.e. "I want full history, but mainly want a
checkout right away, so background the full clone".

But at this point I'm just starting to describe some shoddy version of
Documentation/technical/partial-clone.txt :), OTOH there's no "narrow
clone and fleshen right away" option.

On protocol extensions: Just having a way to "wget" the corresponding
*.idx file from the server would be great, and reduce clone times by a
lot. There's the risk of trusting the server, but most people's use-case
is going to be pushing right back to the same server, which'll be doing
a full validation.

We could also defer that validation instead of skipping it. E.g. wget
*.{pack,idx} followed by a 'fsck' in the background. I've sometimes
wanted that anyway, i.e. "fsck --auto" similar to "gc --auto"
periodically to detect repository bitflips.

Or, do some "narrow" validation of such an *.idx file right
away. E.g. for all the trees/blobs required for the current checkout,
and background the rest.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Resolving deltas dominates clone time
  2019-04-20  7:59   ` Ævar Arnfjörð Bjarmason
@ 2019-04-22 15:57     ` Jeff King
  2019-04-22 18:01       ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 26+ messages in thread
From: Jeff King @ 2019-04-22 15:57 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Martin Fick, Git Mailing List

On Sat, Apr 20, 2019 at 09:59:12AM +0200, Ævar Arnfjörð Bjarmason wrote:

> > If you don't mind losing the collision-detection, using openssl's sha1
> > might help. The delta resolution should be threaded, too. So in _theory_
> > you're using 66 minutes of CPU time, but that should only take 1-2
> > minutes on your 56-core machine. I don't know at what point you'd run
> > into lock contention, though. The locking there is quite coarse.
> 
> There's also my (been meaning to re-roll)
> https://public-inbox.org/git/20181113201910.11518-1-avarab@gmail.com/
> *that* part of the SHA-1 checking is part of what's going on here. It'll
> help a *tiny* bit, but of course is part of the "trust remote" risk
> management...

I think we're talking about two different collision detections, and your
patch wouldn't help at all here.

Your patch is optionally removing the "woah, we got an object with a
duplicate sha1, let's check that the bytes are the same in both copies"
check. But Martin's problem is a clone, so we wouldn't have any existing
objects to duplicate in the first place.

The problem in his case is literally just that the actual SHA-1 is
expensive, and that can be helped by using the optimized openssl
implementation rather than the sha1dc (which checks not collisions with
objects we _have_, but evidence of somebody trying to exploit weaknesses
in sha1).

One thing we could do to make that easier is a run-time flag to switch
between sha1dc and a faster implementation (either openssl or blk_sha1,
depending on the build). That would let you flip the "trust" bit per
operation, rather than having it baked into your build.

(Note that the oft-discussed "use a faster sha1 implementation for
checksums, but sha1dc for object hashing" idea would not help here,
because these really are object hashes whose time is dominating. We have
to checksum 8GB of raw packfile but 2TB of object data).

> I started to write:
> 
>     I wonder if there's room for some tacit client/server cooperation
>     without such a protocol change.
> 
>     E.g. the server sending over a pack constructed in such a way that
>     everything required for a checkout is at the beginning of the
>     data. Now we implicitly tend to do it mostly the other way around
>     for delta optimization purposes.
> 
>     That would allow a smart client in a hurry to index-pack it as they
>     go along, and as soon as they have enough to check out HEAD return
>     to the client, and continue the rest in the background

Interesting idea. You're not reducing the total client effort, but
you're improving latency of getting the user to a checkout. Of course
that doesn't help if they want to run "git log" as their first
operation. ;)

> But realized I was just starting to describe something like 'clone
> --depth=1' followed by a 'fetch --unshallow' in the background, except
> that would work better (if you did "just the tip" naïvely you'd get
> 'missing object' on e.g. 'git log', with that ad-hoc hack we'd need to
> write out two packs etc...).

Right, that would work. I will note one thing, though: the total time to
do a 1-depth clone followed by an unshallow is probably much higher than
doing the whole clone as one unit, for two reasons:

  1. The server won't use reachability bitmaps when serving the
     follow-up fetch (because shallowness invalidates the reachability
     data they're caching), so it will spend much more time in the
     "Counting objects" phase.

  2. The server has to throw away some deltas. Imagine version X of a
     file in the tip commit is stored as a delta against version Y in
     that commit's parent. The initial clone has to throw away the
     on-disk delta of X and send you the whole object (because you are
     not requesting Y at all). And then in the follow-up fetch, it must
     either send you Y as a base object (wasting bandwidth), or it must
     on-the-fly generate a delta from Y to X (wasting CPU).

> But at this point I'm just starting to describe some shoddy version of
> Documentation/technical/partial-clone.txt :), OTOH there's no "narrow
> clone and fleshen right away" option.

Yes. And partial-clone suffers from the problems above to an even
greater extent. ;)

> On protocol extensions: Just having a way to "wget" the corresponding
> *.idx file from the server would be great, and reduce clone times by a
> lot. There's the risk of trusting the server, but most people's use-case
> is going to be pushing right back to the same server, which'll be doing
> a full validation.

One tricky thing is that the server may be handing you a bespoke .pack
file. There is no matching ".idx" at all, neither in-memory nor on disk.
And you would not want the whole on-disk .pack/.idx pair from a site
like GitHub, where there are objects from many forks.

So in general, I think you'd need some cooperation from the server side
to ask it to generate and send the .idx that matches the .pack it is
sending you. Or even if not the .idx format itself, some stable list of
sha1s that you could use to reproduce it without hashing each
uncompressed byte yourself. This could even be stuffed into the pack
format and stripped out by the receiving index-pack (i.e., each entry is
prefixed with "and by the way, here is my sha1...").

> We could also defer that validation instead of skipping it. E.g. wget
> *.{pack,idx} followed by a 'fsck' in the background. I've sometimes
> wanted that anyway, i.e. "fsck --auto" similar to "gc --auto"
> periodically to detect repository bitflips.
> 
> Or, do some "narrow" validation of such an *.idx file right
> away. E.g. for all the trees/blobs required for the current checkout,
> and background the rest.

The "do we have all of the objects we need" is already separate from
"figure out the sha1 of each object", so I think you'd get that
naturally if you just took in an untrusted .idx (it also demonstrates
that any .idx cost is really focusing on blobs, because the "do we have
all objects" check is going to decompress every commit and tree in the
repo anyway).

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Resolving deltas dominates clone time
  2019-04-22 15:57     ` Jeff King
@ 2019-04-22 18:01       ` Ævar Arnfjörð Bjarmason
  2019-04-22 18:43         ` Jeff King
  0 siblings, 1 reply; 26+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-04-22 18:01 UTC (permalink / raw)
  To: Jeff King; +Cc: Martin Fick, Git Mailing List, Jansen, Geert, Jonathan Tan


On Mon, Apr 22 2019, Jeff King wrote:

> On Sat, Apr 20, 2019 at 09:59:12AM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> > If you don't mind losing the collision-detection, using openssl's sha1
>> > might help. The delta resolution should be threaded, too. So in _theory_
>> > you're using 66 minutes of CPU time, but that should only take 1-2
>> > minutes on your 56-core machine. I don't know at what point you'd run
>> > into lock contention, though. The locking there is quite coarse.
>>
>> There's also my (been meaning to re-roll)
>> https://public-inbox.org/git/20181113201910.11518-1-avarab@gmail.com/
>> *that* part of the SHA-1 checking is part of what's going on here. It'll
>> help a *tiny* bit, but of course is part of the "trust remote" risk
>> management...
>
> I think we're talking about two different collision detections, and your
> patch wouldn't help at all here.
>
> Your patch is optionally removing the "woah, we got an object with a
> duplicate sha1, let's check that the bytes are the same in both copies"
> check. But Martin's problem is a clone, so we wouldn't have any existing
> objects to duplicate in the first place.

Right, but we do anyway, as reported by Geert at @amazon.com preceding
that patch of mine. But it is 99.99% irrelevant to *performance* in this
case after the loose object cache you added (but before that could make
all the difference depending on the FS).

I just mentioned it to plant a flag on another bit of the code where
index-pack in general has certain paranoias/validation the user might be
willing to optionally drop just at "clone" time.

> The problem in his case is literally just that the actual SHA-1 is
> expensive, and that can be helped by using the optimized openssl
> implementation rather than the sha1dc (which checks not collisions with
> objects we _have_, but evidence of somebody trying to exploit weaknesses
> in sha1).
>
> One thing we could do to make that easier is a run-time flag to switch
> between sha1dc and a faster implementation (either openssl or blk_sha1,
> depending on the build). That would let you flip the "trust" bit per
> operation, rather than having it baked into your build.

Yeah, this would be neat.

> (Note that the oft-discussed "use a faster sha1 implementation for
> checksums, but sha1dc for object hashing" idea would not help here,
> because these really are object hashes whose time is dominating. We have
> to checksum 8GB of raw packfile but 2TB of object data).
>
>> I started to write:
>>
>>     I wonder if there's room for some tacit client/server cooperation
>>     without such a protocol change.
>>
>>     E.g. the server sending over a pack constructed in such a way that
>>     everything required for a checkout is at the beginning of the
>>     data. Now we implicitly tend to do it mostly the other way around
>>     for delta optimization purposes.
>>
>>     That would allow a smart client in a hurry to index-pack it as they
>>     go along, and as soon as they have enough to check out HEAD return
>>     to the client, and continue the rest in the background
>
> Interesting idea. You're not reducing the total client effort, but
> you're improving latency of getting the user to a checkout. Of course
> that doesn't help if they want to run "git log" as their first
> operation. ;)
>
>> But realized I was just starting to describe something like 'clone
>> --depth=1' followed by a 'fetch --unshallow' in the background, except
>> that would work better (if you did "just the tip" naïvely you'd get
>> 'missing object' on e.g. 'git log', with that ad-hoc hack we'd need to
>> write out two packs etc...).
>
> Right, that would work. I will note one thing, though: the total time to
> do a 1-depth clone followed by an unshallow is probably much higher than
> doing the whole clone as one unit, for two reasons:

Indeed. The hypothesis is that the user doesn't really care about the
clone-time, but the clone-to-repo-mostly-usable time.

>   1. The server won't use reachability bitmaps when serving the
>      follow-up fetch (because shallowness invalidates the reachability
>      data they're caching), so it will spend much more time in the
>      "Counting objects" phase.
>
>   2. The server has to throw away some deltas. Imagine version X of a
>      file in the tip commit is stored as a delta against version Y in
>      that commit's parent. The initial clone has to throw away the
>      on-disk delta of X and send you the whole object (because you are
>      not requesting Y at all). And then in the follow-up fetch, it must
>      either send you Y as a base object (wasting bandwidth), or it must
>      on-the-fly generate a delta from Y to X (wasting CPU).
>
>> But at this point I'm just starting to describe some shoddy version of
>> Documentation/technical/partial-clone.txt :), OTOH there's no "narrow
>> clone and fleshen right away" option.
>
> Yes. And partial-clone suffers from the problems above to an even
> greater extent. ;)
>
>> On protocol extensions: Just having a way to "wget" the corresponding
>> *.idx file from the server would be great, and reduce clone times by a
>> lot. There's the risk of trusting the server, but most people's use-case
>> is going to be pushing right back to the same server, which'll be doing
>> a full validation.
>
> One tricky thing is that the server may be handing you a bespoke .pack
> file. There is no matching ".idx" at all, neither in-memory nor on disk.
> And you would not want the whole on-disk .pack/.idx pair from a site
> like GitHub, where there are objects from many forks.
>
> So in general, I think you'd need some cooperation from the server side
> to ask it to generate and send the .idx that matches the .pack it is
> sending you. Or even if not the .idx format itself, some stable list of
> sha1s that you could use to reproduce it without hashing each
> uncompressed byte yourself.

Yeah, depending on how jt/fetch-cdn-offload is designed (see my
https://public-inbox.org/git/87k1hv6eel.fsf@evledraar.gmail.com/) it
could be (ab)used to do this. I.e. you'd keep a "base" *.{pack,idx}
around for such a purpose.

So in such a case you'd serve up that recent-enough *.{pack,idx} for the
client to "wget", and the client would then trust it (or not) and do the
equivalent of a "fetch" from that point to be 100% up-to-date.

> This could even be stuffed into the pack format and stripped out by
> the receiving index-pack (i.e., each entry is prefixed with "and by
> the way, here is my sha1...").

That would be really interesting. I.e. just having room for that (or
anything else) in the pack format.

I wonder if it could be added to the delta-chain in the current format
as a nasty hack :)

>> We could also defer that validation instead of skipping it. E.g. wget
>> *.{pack,idx} followed by a 'fsck' in the background. I've sometimes
>> wanted that anyway, i.e. "fsck --auto" similar to "gc --auto"
>> periodically to detect repository bitflips.
>>
>> Or, do some "narrow" validation of such an *.idx file right
>> away. E.g. for all the trees/blobs required for the current checkout,
>> and background the rest.
>
> The "do we have all of the objects we need" is already separate from
> "figure out the sha1 of each object", so I think you'd get that
> naturally if you just took in an untrusted .idx (it also demonstrates
> that any .idx cost is really focusing on blobs, because the "do we have
> all objects" check is going to decompress every commit and tree in the
> repo anyway).
>
> -Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Resolving deltas dominates clone time
  2019-04-22 18:01       ` Ævar Arnfjörð Bjarmason
@ 2019-04-22 18:43         ` Jeff King
  2019-04-23  7:07           ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 26+ messages in thread
From: Jeff King @ 2019-04-22 18:43 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Martin Fick, Git Mailing List, Jansen, Geert, Jonathan Tan

On Mon, Apr 22, 2019 at 08:01:15PM +0200, Ævar Arnfjörð Bjarmason wrote:

> > Your patch is optionally removing the "woah, we got an object with a
> > duplicate sha1, let's check that the bytes are the same in both copies"
> > check. But Martin's problem is a clone, so we wouldn't have any existing
> > objects to duplicate in the first place.
> 
> Right, but we do anyway, as reported by Geert at @amazon.com preceding
> that patch of mine. But it is 99.99% irrelevant to *performance* in this
> case after the loose object cache you added (but before that could make
> all the difference depending on the FS).

I scratched my head at this a bit. If we don't have any other objects,
then what are we comparing against? But I think you mean that we have
the overhead of doing the object lookups to find that out. Yes, that can
add up if your filesystem has high latency, but I think in this case it
is a drop in the bucket compared to dealing with the actual object data.

> I just mentioned it to plant a flag on another bit of the code where
> index-pack in general has certain paranoias/validation the user might be
> willing to optionally drop just at "clone" time.

Yeah, I agree it may be worth pursuing independently. I just don't think
it will help Martin's case in any noticeable way.

> > Right, that would work. I will note one thing, though: the total time to
> > do a 1-depth clone followed by an unshallow is probably much higher than
> > doing the whole clone as one unit, for two reasons:
> 
> Indeed. The hypothesis is that the user doesn't really care about the
> clone-time, but the clone-to-repo-mostly-usable time.

There was a little bit of self-interest in there for me, too, as a
server operator. While it does add to the end-to-end time, most of the
resource use for the shallow fetch gets put on the server. IOW, I don't
think we'd be happy to see clients doing this depth-1-and-then-unshallow
strategy for every clone.

> > So in general, I think you'd need some cooperation from the server side
> > to ask it to generate and send the .idx that matches the .pack it is
> > sending you. Or even if not the .idx format itself, some stable list of
> > sha1s that you could use to reproduce it without hashing each
> > uncompressed byte yourself.
> 
> Yeah, depending on how jt/fetch-cdn-offload is designed (see my
> https://public-inbox.org/git/87k1hv6eel.fsf@evledraar.gmail.com/) it
> could be (ab)used to do this. I.e. you'd keep a "base" *.{pack,idx}
> around for such a purpose.
> 
> So in such a case you'd serve up that recent-enough *.{pack,idx} for the
> client to "wget", and the client would then trust it (or not) and do the
> equivalent of a "fetch" from that point to be 100% up-to-date.

I think it's sort of orthogonal. Either way you have to teach the client
how to get a .pack/.idx combo. Whether it learns to receive it inline
from the first fetch, or whether it is taught to expect it from the
out-of-band fetch, most of the challenge is the same.

> > This could even be stuffed into the pack format and stripped out by
> > the receiving index-pack (i.e., each entry is prefixed with "and by
> > the way, here is my sha1...").
> 
> That would be really interesting. I.e. just having room for that (or
> anything else) in the pack format.
> 
> I wonder if it could be added to the delta-chain in the current format
> as a nasty hack :)

There's definitely not "room" in any sense of the word in the pack
format. :) However, as long as all parties agreed, we can stick whatever
we want into the on-the-wire format. So I was imagining something more
like:

  1. pack-objects learns a --report-object-id option that sticks some
     additional bytes before each object (in its simplest form,
     $obj_hash bytes of id)

  2. likewise, index-pack learns a --parse-object-id option to receive
     it and skip hashing the object bytes

  3. we get a new protocol capability, "send-object-ids". If the server
     advertises and the client requests it, then both sides turn on the
     appropriate option

You could even imagine generalizing it to "--report-object-metadata",
and including 0 or more metadata packets before each object. With object
id being one, but possibly other computable bits like "generation
number" after that. I'm not convinced other metadata is worth the
space/time tradeoff, though. After all, this is stuff that the client
_could_ generate and cache themselves, so you're trading off bandwidth
to save the client from doing the computation.

Anyway, food for thought. :)

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Resolving deltas dominates clone time
  2019-04-20  3:58 ` Jeff King
  2019-04-20  7:59   ` Ævar Arnfjörð Bjarmason
@ 2019-04-22 20:21   ` Martin Fick
  2019-04-22 20:56     ` Jeff King
  1 sibling, 1 reply; 26+ messages in thread
From: Martin Fick @ 2019-04-22 20:21 UTC (permalink / raw)
  To: Jeff King; +Cc: Git Mailing List

On Friday, April 19, 2019 11:58:25 PM MDT Jeff King wrote:
> On Fri, Apr 19, 2019 at 03:47:22PM -0600, Martin Fick wrote:
> > I have been thinking about this problem, and I suspect that this compute
> > time is actually spent doing SHA1 calculations, is that possible? Some
> > basic back of the envelope math and scripting seems to show that the repo
> > may actually contain about 2TB of data if you add up the size of all the
> > objects in the repo. Some quick research on the net seems to indicate
> > that we might be able to expect something around 500MB/s throughput on
> > computing SHA1s, does that seem reasonable? If I really have 2TB of data,
> > should it then take around 66mins to get the SHA1s for all that data?
> > Could my repo clone time really be dominated by SHA1 math?
> 
> That sounds about right, actually. 8GB to 2TB is a compression ratio of
> 250:1. That's bigger than I've seen, but I get 51:1 in the kernel.
> 
> Try this (with a recent version of git; your v1.8.2.1 won't have
> --batch-all-objects):
> 
>   # count the on-disk size of all objects
>   git cat-file --batch-all-objects --batch-check='%(objectsize)
> %(objectsize:disk)' | perl -alne '
>     $repo += $F[0];
>     $disk += $F[1];
>     END { print "$repo / $disk = ", $repo/$disk }
>   '

This has been running for a few hours now, I will update you with results when 
its done.

> 250:1 isn't inconceivable if you have large blobs which have small
> changes to them (and at 8GB for 8 million objects, you probably do have
> some larger blobs, since the kernel is about 1/8th the size for the same
> number of objects).

I think it's mostly xml files in the 1-10MB range.

> So yes, if you really do have to hash 2TB of data, that's going to take
> a while.

I was hoping I was wrong. Unfortunately I sense that this is not likely 
something we can improve with a better algorithm. It seems like the best way 
to handle this long term is likely to use BUP's rolling hash splitting, it 
would make this way better (assuming it made objects small enough). I think it 
is interesting that this approach might end up being effective for more than 
just large binary file repos. If I could get this repo into bup somehow, it 
could potentially show us if this would drastically reduce the index-pack 
time.

> I think v2.18 will have the collision-detecting sha1 on by default,
> which is slower.

Makes sense.

> If you don't mind losing the collision-detection, using openssl's sha1
> might help. The delta resolution should be threaded, too. So in _theory_
> you're using 66 minutes of CPU time, but that should only take 1-2
> minutes on your 56-core machine. I don't know at what point you'd run
> into lock contention, though. The locking there is quite coarse.

I suspect at 3 threads, seems like the default?

I am running some index packs to test the theory, I can tell you already that 
the 56 thread versions was much slower, it took 397m25.622s. I am running a 
few other tests also, but it will take a while to get an answer. Since things 
take hours to test, I made a repo with a single branch (and the tags for that 
branch) from this bigger repo using a git init/git fetch. The single branch 
repo takes about 12s to clone, but it takes around 14s with 3 threads to run 
index-pack, any ideas why it is slower than a clone?

Here are some thread times for the single branch case:

 Threads  Time
 56           49s
 12           34s
 5             20s
 4             15s
 3             14s
 2             17
 1             30

So 3 threads appears optimal in this case.

Perhaps the locking can be improved here to make threading more effective?

> We also hash non-deltas while we're receiving them over the network.
> That's accounted for in the "receiving pack" part of the progress meter.
> If the time looks to be going to "resolving deltas", then that should
> all be threaded.

Would it make sense to make the receiving pack time also threaded because I 
believe that time is still longer than the I/O time (2 or 3 times)?

> If you want to replay the slow part, it should just be index-pack. So
> something like (with $old as a fresh clone of the repo):
> 
>   git init --bare new-repo.git
>   cd new-repo.git
>   perf record git index-pack -v --stdin <$old/.git/objects/pack/pack-*.pack
>   perf report
> 
> should show you where the time is going (substitute perf with whatever
> profiling tool you like).

I will work on profiling soon, but I wanted to give an update now.

Thanks for the great feedback,

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Resolving deltas dominates clone time
  2019-04-22 20:21   ` Martin Fick
@ 2019-04-22 20:56     ` Jeff King
  2019-04-22 21:02       ` Jeff King
                         ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Jeff King @ 2019-04-22 20:56 UTC (permalink / raw)
  To: Martin Fick; +Cc: Git Mailing List

On Mon, Apr 22, 2019 at 02:21:40PM -0600, Martin Fick wrote:

> > Try this (with a recent version of git; your v1.8.2.1 won't have
> > --batch-all-objects):
> > 
> >   # count the on-disk size of all objects
> >   git cat-file --batch-all-objects --batch-check='%(objectsize)
> > %(objectsize:disk)' | perl -alne '
> >     $repo += $F[0];
> >     $disk += $F[1];
> >     END { print "$repo / $disk = ", $repo/$disk }
> >   '
> 
> This has been running for a few hours now, I will update you with results when 
> its done.

Hours? I think something might be wrong. It takes 20s to run on
linux.git.

> > 250:1 isn't inconceivable if you have large blobs which have small
> > changes to them (and at 8GB for 8 million objects, you probably do have
> > some larger blobs, since the kernel is about 1/8th the size for the same
> > number of objects).
> 
> I think it's mostly xml files in the 1-10MB range.

Yeah, I could believe that would do it, then. Take a 10MB file with a
hundred 1K updates. That'd be 99*1K + 10MB to store, for almost 100:1
compression.

The key is really having objects where the size of change versus the
file size is small (so the marginal cost of each revision is small), and
then having lots of changes (to give you a big multiplier).

> > So yes, if you really do have to hash 2TB of data, that's going to take
> > a while.
> 
> I was hoping I was wrong. Unfortunately I sense that this is not likely 
> something we can improve with a better algorithm. It seems like the best way 
> to handle this long term is likely to use BUP's rolling hash splitting, it 
> would make this way better (assuming it made objects small enough). I think it 
> is interesting that this approach might end up being effective for more than 
> just large binary file repos. If I could get this repo into bup somehow, it 
> could potentially show us if this would drastically reduce the index-pack 
> time.

It fundamentally can't help without changing Git's object model, since
you need to have complete hashes of all of those files. If you're
proposing to do rolling hashes to split blobs into multiple objects that
would definitely work. But it wouldn't be compatible with Git anymore.

> > If you don't mind losing the collision-detection, using openssl's sha1
> > might help. The delta resolution should be threaded, too. So in _theory_
> > you're using 66 minutes of CPU time, but that should only take 1-2
> > minutes on your 56-core machine. I don't know at what point you'd run
> > into lock contention, though. The locking there is quite coarse.
> 
> I suspect at 3 threads, seems like the default?

Ah, right, I forgot we cap it at 3 (which was determined experimentally,
and which we more or less attributed to lock contention as the
bottleneck). I think you need to use $GIT_FORCE_THREADS to override it.

> I am running some index packs to test the theory, I can tell you already that 
> the 56 thread versions was much slower, it took 397m25.622s. I am running a 
> few other tests also, but it will take a while to get an answer. Since things 
> take hours to test, I made a repo with a single branch (and the tags for that 
> branch) from this bigger repo using a git init/git fetch. The single branch 
> repo takes about 12s to clone, but it takes around 14s with 3 threads to run 
> index-pack, any ideas why it is slower than a clone?

Are you running it in the same repo, or in another newly-created repo?
Or alternatively, in a new repo but repeatedly running index-pack? After
the first run, that repo will have all of the objects. And so for each
object it sees, index-pack will say "woah, we already had that one;
let's double check that they're byte for byte identical" which carries
extra overhead (and probably makes the lock contention way worse, too,
because accessing existing objects just has one big coarse lock).

So definitely do something like:

  for threads in 1 2 3 4 5 12 56; do
	rm -rf repo.git
	git init --bare repo.git
	GIT_FORCE_THREADS=$threads \
	  git -C repo.git index-pack -v --stdin </path/to/pack
  done

to test.

> Perhaps the locking can be improved here to make threading more effective?

Probably, but easier said than done, of course.

> > We also hash non-deltas while we're receiving them over the network.
> > That's accounted for in the "receiving pack" part of the progress meter.
> > If the time looks to be going to "resolving deltas", then that should
> > all be threaded.
> 
> Would it make sense to make the receiving pack time also threaded because I
> believe that time is still longer than the I/O time (2 or 3 times)?

It's a lot harder to thread since we're eating the incoming bytes. And
unless you're just coming from a local disk copy, the network is
generally the bottleneck (and if you are coming from a local disk copy,
then consider doing a local clone which will avoid all of this hashing
in the first place).

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Resolving deltas dominates clone time
  2019-04-22 20:56     ` Jeff King
@ 2019-04-22 21:02       ` Jeff King
  2019-04-22 21:19       ` [PATCH] p5302: create the repo in each index-pack test Jeff King
  2019-04-22 22:32       ` Resolving deltas dominates clone time Martin Fick
  2 siblings, 0 replies; 26+ messages in thread
From: Jeff King @ 2019-04-22 21:02 UTC (permalink / raw)
  To: Martin Fick; +Cc: Git Mailing List

On Mon, Apr 22, 2019 at 04:56:54PM -0400, Jeff King wrote:

> > I suspect at 3 threads, seems like the default?
> 
> Ah, right, I forgot we cap it at 3 (which was determined experimentally,
> and which we more or less attributed to lock contention as the
> bottleneck). I think you need to use $GIT_FORCE_THREADS to override it.

Ah, nevermind this. Using --threads will do what you expect. The cap at
3 only applies when we've just picked the number of available CPUs as a
default.

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH] p5302: create the repo in each index-pack test
  2019-04-22 20:56     ` Jeff King
  2019-04-22 21:02       ` Jeff King
@ 2019-04-22 21:19       ` Jeff King
  2019-04-23  1:09         ` Junio C Hamano
  2019-04-22 22:32       ` Resolving deltas dominates clone time Martin Fick
  2 siblings, 1 reply; 26+ messages in thread
From: Jeff King @ 2019-04-22 21:19 UTC (permalink / raw)
  To: Martin Fick; +Cc: Git Mailing List

On Mon, Apr 22, 2019 at 04:56:53PM -0400, Jeff King wrote:

> > I am running some index packs to test the theory, I can tell you already that 
> > the 56 thread versions was much slower, it took 397m25.622s. I am running a 
> > few other tests also, but it will take a while to get an answer. Since things 
> > take hours to test, I made a repo with a single branch (and the tags for that 
> > branch) from this bigger repo using a git init/git fetch. The single branch 
> > repo takes about 12s to clone, but it takes around 14s with 3 threads to run 
> > index-pack, any ideas why it is slower than a clone?
> 
> Are you running it in the same repo, or in another newly-created repo?
> Or alternatively, in a new repo but repeatedly running index-pack? After
> the first run, that repo will have all of the objects. And so for each
> object it sees, index-pack will say "woah, we already had that one;
> let's double check that they're byte for byte identical" which carries
> extra overhead (and probably makes the lock contention way worse, too,
> because accessing existing objects just has one big coarse lock).
> 
> So definitely do something like:
> 
>   for threads in 1 2 3 4 5 12 56; do
> 	rm -rf repo.git
> 	git init --bare repo.git
> 	GIT_FORCE_THREADS=$threads \
> 	  git -C repo.git index-pack -v --stdin </path/to/pack
>   done
> 
> to test.

This is roughly what p5302 is going, though it does not go as high as
56 (though it seems like there is probably not much point in doing so).

However, I did notice this slight bug in it. After this fix, here are my
numbers from indexing git.git:

  Test                                           HEAD             
  ----------------------------------------------------------------
  5302.2: index-pack 0 threads                   22.72(22.55+0.16)
  5302.3: index-pack 1 thread                    23.26(23.02+0.24)
  5302.4: index-pack 2 threads                   13.19(24.06+0.23)
  5302.5: index-pack 4 threads                   7.96(24.65+0.25) 
  5302.6: index-pack 8 threads                   7.94(45.06+0.38) 
  5302.7: index-pack default number of threads   9.37(23.82+0.18) 

So it looks like "4" is slightly better than the default of "3" for me.

I'm running it on linux.git now, but it will take quite a while to come
up with a result.

-- >8 --
Subject: [PATCH] p5302: create the repo in each index-pack test

The p5302 script runs "index-pack --stdin" in each timing test. It does
two things to try to get good timings:

  1. we do the repo creation in a separate (non-timed) setup test, so
     that our timing is purely the index-pack run

  2. we use a separate repo for each test; this is important because the
     presence of existing objects in the repo influences the result
     (because we'll end up doing collision checks against them)

But this forgets one thing: we generally run each timed test multiple
times to reduce the impact of noise. Which means that repeats of each
test after the first will be subject to the collision slowdown from
point 2, and we'll generally just end up taking the first time anyway.

Instead, let's create the repo in the test (effectively undoing point
1). That does add a constant amount of extra work to each iteration, but
it's quite small compared to the actual effects we're interested in
measuring.

Signed-off-by: Jeff King <peff@peff.net>
---
The very first 0-thread one will run faster because it has less to "rm
-rf", but I think we can ignore that.

 t/perf/p5302-pack-index.sh | 31 ++++++++++++++++++-------------
 1 file changed, 18 insertions(+), 13 deletions(-)

diff --git a/t/perf/p5302-pack-index.sh b/t/perf/p5302-pack-index.sh
index 99bdb16c85..a9b3e112d9 100755
--- a/t/perf/p5302-pack-index.sh
+++ b/t/perf/p5302-pack-index.sh
@@ -13,35 +13,40 @@ test_expect_success 'repack' '
 	export PACK
 '
 
-test_expect_success 'create target repositories' '
-	for repo in t1 t2 t3 t4 t5 t6
-	do
-		git init --bare $repo
-	done
-'
-
 test_perf 'index-pack 0 threads' '
-	GIT_DIR=t1 git index-pack --threads=1 --stdin < $PACK
+	rm -rf repo.git &&
+	git init --bare repo.git &&
+	GIT_DIR=repo.git git index-pack --threads=1 --stdin < $PACK
 '
 
 test_perf 'index-pack 1 thread ' '
-	GIT_DIR=t2 GIT_FORCE_THREADS=1 git index-pack --threads=1 --stdin < $PACK
+	rm -rf repo.git &&
+	git init --bare repo.git &&
+	GIT_DIR=repo.git GIT_FORCE_THREADS=1 git index-pack --threads=1 --stdin < $PACK
 '
 
 test_perf 'index-pack 2 threads' '
-	GIT_DIR=t3 git index-pack --threads=2 --stdin < $PACK
+	rm -rf repo.git &&
+	git init --bare repo.git &&
+	GIT_DIR=repo.git git index-pack --threads=2 --stdin < $PACK
 '
 
 test_perf 'index-pack 4 threads' '
-	GIT_DIR=t4 git index-pack --threads=4 --stdin < $PACK
+	rm -rf repo.git &&
+	git init --bare repo.git &&
+	GIT_DIR=repo.git git index-pack --threads=4 --stdin < $PACK
 '
 
 test_perf 'index-pack 8 threads' '
-	GIT_DIR=t5 git index-pack --threads=8 --stdin < $PACK
+	rm -rf repo.git &&
+	git init --bare repo.git &&
+	GIT_DIR=repo.git git index-pack --threads=8 --stdin < $PACK
 '
 
 test_perf 'index-pack default number of threads' '
-	GIT_DIR=t6 git index-pack --stdin < $PACK
+	rm -rf repo.git &&
+	git init --bare repo.git &&
+	GIT_DIR=repo.git git index-pack --stdin < $PACK
 '
 
 test_done
-- 
2.21.0.1182.g3590c06d32


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: Resolving deltas dominates clone time
  2019-04-22 20:56     ` Jeff King
  2019-04-22 21:02       ` Jeff King
  2019-04-22 21:19       ` [PATCH] p5302: create the repo in each index-pack test Jeff King
@ 2019-04-22 22:32       ` Martin Fick
  2019-04-23  1:55         ` Jeff King
  2 siblings, 1 reply; 26+ messages in thread
From: Martin Fick @ 2019-04-22 22:32 UTC (permalink / raw)
  To: Jeff King; +Cc: Git Mailing List

On Monday, April 22, 2019 4:56:54 PM MDT Jeff King wrote:
> On Mon, Apr 22, 2019 at 02:21:40PM -0600, Martin Fick wrote:
> > > Try this (with a recent version of git; your v1.8.2.1 won't have
> > > 
> > > --batch-all-objects):
> > >   # count the on-disk size of all objects
> > >   git cat-file --batch-all-objects --batch-check='%(objectsize)
> > > 
> > > %(objectsize:disk)' | perl -alne '
> > > 
> > >     $repo += $F[0];
> > >     $disk += $F[1];
> > >     END { print "$repo / $disk = ", $repo/$disk }
> > >   
> > >   '
> > 
> > This has been running for a few hours now, I will update you with results
> > when its done.
> 
> Hours? I think something might be wrong. It takes 20s to run on
> linux.git.

OK, yes I was running this on a "bad" copy of the repo, see below because I 
think it might be of some interest also...

On the better copy, the git part of the command took 3m48.025s (still not 20s, 
but not hours either), using git v2.18. The results were:

2085532480789 / 7909358121 = 263.679106304687

This seems to confirm the sizing numbers.

As for the "bad repo", let me describe:

1) It is on a different machine (only 32 processors, and some serious 
background load)

2) That copy of the repo is from a mirror clone of our source repo without 
running repacking (-f) on it. Without repacking -f, the clone is actually 16G, 
which I believe is using the deltas from the source repo which would have been 
created by people pushing new commits over and over to the source repo and 
those new objects just being added to the common pack file during repacking 
without creating a ton of new deltas. It would appear that the deltas created 
from regular repacking without using the -f may be really destroying some of 
the performance of our source repo even worse then I imagined. The rest of my 
testing has been done on a repo repacked with -f to eliminate the variability 
imposed from the source repo because I figured it would adversely impact 
things, but I did not imagine it being that bad because even the clone time of 
the bad repo is not that bad, so I wonder why the git cat-file is way worse?

> The key is really having objects where the size of change versus the
> file size is small (so the marginal cost of each revision is small), and
> then having lots of changes (to give you a big multiplier).

Right.

> > > So yes, if you really do have to hash 2TB of data, that's going to take
> > > a while.
> > 
> > I was hoping I was wrong. Unfortunately I sense that this is not likely
> > something we can improve with a better algorithm. It seems like the best
> > way to handle this long term is likely to use BUP's rolling hash
> > splitting, it would make this way better (assuming it made objects small
> > enough). I think it is interesting that this approach might end up being
> > effective for more than just large binary file repos. If I could get this
> > repo into bup somehow, it could potentially show us if this would
> > drastically reduce the index-pack time.
> 
> It fundamentally can't help without changing Git's object model, since
> you need to have complete hashes of all of those files. If you're
> proposing to do rolling hashes to split blobs into multiple objects that
> would definitely work. But it wouldn't be compatible with Git anymore.

Right. I really just meant to point out how many people may want the BUP style 
rolling hash split blobs in git even without having gigantic blobs since it 
might really help with smaller blobs and long histories with small deltas. 
Most of the git performance issues tend to focus on large repos with large 
blobs, that is what BUP was made for, but it could really help normal git 
users, possibly even the kernel as we start to develop some seriously longer 
histories. I think most assumptions have been that this is not likely to be a 
problem any time soon.

> Are you running it in the same repo, or in another newly-created repo?

Yes.

> Or alternatively, in a new repo but repeatedly running index-pack? After
> the first run, that repo will have all of the objects. And so for each
> object it sees, index-pack will say "woah, we already had that one;
> let's double check that they're byte for byte identical" which carries
> extra overhead (and probably makes the lock contention way worse, too,
> because accessing existing objects just has one big coarse lock).
> 
> So definitely do something like:
> 
>   for threads in 1 2 3 4 5 12 56; do
> 	rm -rf repo.git
> 	git init --bare repo.git
> 	GIT_FORCE_THREADS=$threads \
> 	  git -C repo.git index-pack -v --stdin </path/to/pack
>   done
> 
> to test.

Makes sense, testing now...

> > Perhaps the locking can be improved here to make threading more effective?
> 
> Probably, but easier said than done, of course.
> 
> > > We also hash non-deltas while we're receiving them over the network.
> > > That's accounted for in the "receiving pack" part of the progress meter.
> > > If the time looks to be going to "resolving deltas", then that should
> > > all be threaded.
> > 
> > Would it make sense to make the receiving pack time also threaded because
> > I
> > believe that time is still longer than the I/O time (2 or 3 times)?
> 
> It's a lot harder to thread since we're eating the incoming bytes. And
> unless you're just coming from a local disk copy, the network is
> generally the bottleneck (and if you are coming from a local disk copy,
> then consider doing a local clone which will avoid all of this hashing
> in the first place).

Is that really a fair assumption in today's intranets? Many corporate LANs 
have higher bandwidth than normal disks, don't they?

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] p5302: create the repo in each index-pack test
  2019-04-22 21:19       ` [PATCH] p5302: create the repo in each index-pack test Jeff King
@ 2019-04-23  1:09         ` Junio C Hamano
  2019-04-23  2:07           ` Jeff King
  0 siblings, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2019-04-23  1:09 UTC (permalink / raw)
  To: Jeff King; +Cc: Martin Fick, Git Mailing List

Jeff King <peff@peff.net> writes:

> Subject: [PATCH] p5302: create the repo in each index-pack test
>
> The p5302 script runs "index-pack --stdin" in each timing test. It does
> two things to try to get good timings:
>
>   1. we do the repo creation in a separate (non-timed) setup test, so
>      that our timing is purely the index-pack run
>
>   2. we use a separate repo for each test; this is important because the
>      presence of existing objects in the repo influences the result
>      (because we'll end up doing collision checks against them)
>
> But this forgets one thing: we generally run each timed test multiple
> times to reduce the impact of noise. Which means that repeats of each
> test after the first will be subject to the collision slowdown from
> point 2, and we'll generally just end up taking the first time anyway.

The above is very cleanly written to convince anybody that what the
current test does contradicts with wish #2 above, and that the two
wishes #1 and #2 are probably mutually incompatible.

But isn't the collision check a part of the real-life workload that
Git users are made waiting for and care about the performance of?
Or are we purely interested in the cost of resolving delta,
computing the object name, and writing the result out to the disk in
this test and the "overall experience" benchmark is left elsewhere?

The reason why I got confused is because the test_description of the
script leaves "the actual effects we're interested in measuring"
unsaid, I think.  The log message of b8a2486f ("index-pack: support
multithreaded delta resolving", 2012-05-06) that created this test
does not help that much, either.

In any case, the above "this forgets one thing" makes it clear that
we at this point in time declare what we are interested in very
clearly, and I agree that the solution described in the paragraph
below clearly matches the goal.  Looks good.

> Instead, let's create the repo in the test (effectively undoing point
> 1). That does add a constant amount of extra work to each iteration, but
> it's quite small compared to the actual effects we're interested in
> measuring.

> Signed-off-by: Jeff King <peff@peff.net>
> ---
> The very first 0-thread one will run faster because it has less to "rm
> -rf", but I think we can ignore that.

OK.

> -	GIT_DIR=t1 git index-pack --threads=1 --stdin < $PACK
> +	rm -rf repo.git &&
> +	git init --bare repo.git &&
> +	GIT_DIR=repo.git git index-pack --threads=1 --stdin < $PACK

This is obviously inherited from the original, but do we get scolded
by some versions of bash for this line, without quoting the source path
of the redirection, i.e.

	... --stdin <"$PACK"

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Resolving deltas dominates clone time
  2019-04-22 22:32       ` Resolving deltas dominates clone time Martin Fick
@ 2019-04-23  1:55         ` Jeff King
  2019-04-23  4:21           ` Jeff King
  0 siblings, 1 reply; 26+ messages in thread
From: Jeff King @ 2019-04-23  1:55 UTC (permalink / raw)
  To: Martin Fick; +Cc: Git Mailing List

On Mon, Apr 22, 2019 at 04:32:16PM -0600, Martin Fick wrote:

> > Hours? I think something might be wrong. It takes 20s to run on
> > linux.git.
> 
> OK, yes I was running this on a "bad" copy of the repo, see below because I 
> think it might be of some interest also...
> 
> On the better copy, the git part of the command took 3m48.025s (still not 20s, 
> but not hours either), using git v2.18. The results were:

That is slower than I'd expect. Two things that could speed it up:

  - use --buffer, which drops write() overhead

  - use a more recent version of Git which has 0750bb5b51 (cat-file:
    support "unordered" output for --batch-all-objects, 2018-08-10).
    This is order gets better cache locality (both disk cache if you
    can't fit the whole thing in RAM, but also the delta cache).

Those drop my 20s run on linux.git to 13s, but they might have more
impact for you.

> 2) That copy of the repo is from a mirror clone of our source repo without 
> running repacking (-f) on it. Without repacking -f, the clone is actually 16G, 
> which I believe is using the deltas from the source repo which would have been 
> created by people pushing new commits over and over to the source repo and 
> those new objects just being added to the common pack file during repacking 
> without creating a ton of new deltas. It would appear that the deltas created 
> from regular repacking without using the -f may be really destroying some of 
> the performance of our source repo even worse then I imagined. The rest of my 
> testing has been done on a repo repacked with -f to eliminate the variability 
> imposed from the source repo because I figured it would adversely impact 
> things, but I did not imagine it being that bad because even the clone time of 
> the bad repo is not that bad, so I wonder why the git cat-file is way worse?

Yes, I do think it's worth doing a "repack -f" every once in a while
(and probably worth --window=250 since you're already splurging on CPU).
It does seem to produce better results than taking the accumulated
results of the individual pushes. I don't think this is an area we've
studied all that well, so there may be some ways to make it better (but
what's there has been "good enough" that nobody has sunk a lot of time
into it).

> > So definitely do something like:
> > 
> >   for threads in 1 2 3 4 5 12 56; do
> > 	rm -rf repo.git
> > 	git init --bare repo.git
> > 	GIT_FORCE_THREADS=$threads \
> > 	  git -C repo.git index-pack -v --stdin </path/to/pack
> >   done
> > 
> > to test.
> 
> Makes sense, testing now...

Here are my p5302 numbers on linux.git, by the way.

  Test                                           jk/p5302-repeat-fix
  ------------------------------------------------------------------
  5302.2: index-pack 0 threads                   307.04(303.74+3.30)
  5302.3: index-pack 1 thread                    309.74(306.13+3.56)
  5302.4: index-pack 2 threads                   177.89(313.73+3.60)
  5302.5: index-pack 4 threads                   117.14(344.07+4.29)
  5302.6: index-pack 8 threads                   112.40(607.12+5.80)
  5302.7: index-pack default number of threads   135.00(322.03+3.74)

which still imply that "4" is a win over "3" ("8" is slightly better
still in wall-clock time, but the total CPU rises dramatically; that's
probably because this is a quad-core with hyperthreading, so by that
point we're just throttling down the CPUs).

> > It's a lot harder to thread since we're eating the incoming bytes. And
> > unless you're just coming from a local disk copy, the network is
> > generally the bottleneck (and if you are coming from a local disk copy,
> > then consider doing a local clone which will avoid all of this hashing
> > in the first place).
> 
> Is that really a fair assumption in today's intranets? Many corporate LANs 
> have higher bandwidth than normal disks, don't they?

Single-threaded SHA-1 is 500-1000 megabytes/sec. That's 4-8
gigabits/sec. And keep in mind we're just handling the base objects, so
we're only computing the SHA-1 on some of the bytes.

Of course there's another SHA-1 computation going over the whole
packfile at that point, so _that_ is probably a bigger bottleneck if you
really are on a 10 gigabit network.

But if somebody wants to spend the time to look at parallelizing it, I
wouldn't say no. :)

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] p5302: create the repo in each index-pack test
  2019-04-23  1:09         ` Junio C Hamano
@ 2019-04-23  2:07           ` Jeff King
  2019-04-23  2:27             ` Junio C Hamano
  0 siblings, 1 reply; 26+ messages in thread
From: Jeff King @ 2019-04-23  2:07 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Martin Fick, Git Mailing List

On Tue, Apr 23, 2019 at 10:09:54AM +0900, Junio C Hamano wrote:

> The above is very cleanly written to convince anybody that what the
> current test does contradicts with wish #2 above, and that the two
> wishes #1 and #2 are probably mutually incompatible.
> 
> But isn't the collision check a part of the real-life workload that
> Git users are made waiting for and care about the performance of?
> Or are we purely interested in the cost of resolving delta,
> computing the object name, and writing the result out to the disk in
> this test and the "overall experience" benchmark is left elsewhere?

I think we _are_ just interested in the resolving delta cost (after all,
we're testing it with various thread levels). What's more, the old code
would run the test $GIT_PERF_COUNT times, once without any objects, and
then the other N-1 times with objects. And then take the smallest time,
which would generally be the one-off!  So we're really just measuring
that case more consistently now.

But even if you left all of that aside, I think the case without objects
is actually the realistic one. It represents the equivalent a full
clone, where we would not have any objects already.

The case where we are fetching into a repository with objects already is
also potentially of interest, but this test wouldn't show that very
well. Because there the main added cost is looking up each object and
saying "ah, we do not have it; no need to do a collision check", because
we'd generally not expect the other side to be sending us duplicates.

But because this test would be repeating itself on the same pack each
time, we'd be seeing a collision on _every_ object. And the added time
would be dominated by us saying "oops, a collision; let's take the slow
path and reconstruct that object from disk so we can compare its bytes".

> The reason why I got confused is because the test_description of the
> script leaves "the actual effects we're interested in measuring"
> unsaid, I think.  The log message of b8a2486f ("index-pack: support
> multithreaded delta resolving", 2012-05-06) that created this test
> does not help that much, either.
> 
> In any case, the above "this forgets one thing" makes it clear that
> we at this point in time declare what we are interested in very
> clearly, and I agree that the solution described in the paragraph
> below clearly matches the goal.  Looks good.

So I think you convinced yourself even before my email that this was the
right path, but let me know if you think it's worth trying to revise the
commit message to include some of the above reasoning.

> > -	GIT_DIR=t1 git index-pack --threads=1 --stdin < $PACK
> > +	rm -rf repo.git &&
> > +	git init --bare repo.git &&
> > +	GIT_DIR=repo.git git index-pack --threads=1 --stdin < $PACK
> 
> This is obviously inherited from the original, but do we get scolded
> by some versions of bash for this line, without quoting the source path
> of the redirection, i.e.
> 
> 	... --stdin <"$PACK"

In general, yes, but I think we are OK in this instance because we
generated $PACK ourselves in the setup step, and we know that it is just
a relative .git/objects/pack/xyz.pack with no spaces. I almost touched
it just to get rid of the style-violating space after the "<" though. ;)

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] p5302: create the repo in each index-pack test
  2019-04-23  2:07           ` Jeff King
@ 2019-04-23  2:27             ` Junio C Hamano
  2019-04-23  2:36               ` Jeff King
  0 siblings, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2019-04-23  2:27 UTC (permalink / raw)
  To: Jeff King; +Cc: Martin Fick, Git Mailing List

Jeff King <peff@peff.net> writes:

>> This is obviously inherited from the original, but do we get scolded
>> by some versions of bash for this line, without quoting the source path
>> of the redirection, i.e.
>> 
>> 	... --stdin <"$PACK"
>
> In general, yes, but I think we are OK in this instance because we
> generated $PACK ourselves in the setup step, and we know that it is just
> a relative .git/objects/pack/xyz.pack with no spaces.

I know we are OK, but the issue with some versions of bash AFAIU is
that bash is not OK regardless of the contents of $variable that is
not quoted and used as the target or the source of a redirection,
issuing an unnecessary warning.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] p5302: create the repo in each index-pack test
  2019-04-23  2:27             ` Junio C Hamano
@ 2019-04-23  2:36               ` Jeff King
  2019-04-23  2:40                 ` Junio C Hamano
  0 siblings, 1 reply; 26+ messages in thread
From: Jeff King @ 2019-04-23  2:36 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Martin Fick, Git Mailing List

On Tue, Apr 23, 2019 at 11:27:02AM +0900, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> >> This is obviously inherited from the original, but do we get scolded
> >> by some versions of bash for this line, without quoting the source path
> >> of the redirection, i.e.
> >> 
> >> 	... --stdin <"$PACK"
> >
> > In general, yes, but I think we are OK in this instance because we
> > generated $PACK ourselves in the setup step, and we know that it is just
> > a relative .git/objects/pack/xyz.pack with no spaces.
> 
> I know we are OK, but the issue with some versions of bash AFAIU is
> that bash is not OK regardless of the contents of $variable that is
> not quoted and used as the target or the source of a redirection,
> issuing an unnecessary warning.

Is it? I thought the issue was specifically when there were spaces. I
get:

  $ bash
  $ file=ok
  $ echo foo >$file
  $ file='not ok'
  $ echo foo >$file
  bash: $file: ambiguous redirect

And that is AFAIK what the recent 7951a016a5 (t4038-diff-combined: quote
paths with whitespace, 2019-03-17) was about (because our trash
directory always has a space in it).

Did I miss a report where it happens on some versions even without
spaces? If so, we have quite a number of things to fix judging from the
output of:

  git grep '>\$'

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] p5302: create the repo in each index-pack test
  2019-04-23  2:36               ` Jeff King
@ 2019-04-23  2:40                 ` Junio C Hamano
  0 siblings, 0 replies; 26+ messages in thread
From: Junio C Hamano @ 2019-04-23  2:40 UTC (permalink / raw)
  To: Jeff King; +Cc: Martin Fick, Git Mailing List

Jeff King <peff@peff.net> writes:

> Is it? I thought the issue was specifically when there were spaces. I
> get:
>
>   $ bash
>   $ file=ok
>   $ echo foo >$file
>   $ file='not ok'
>   $ echo foo >$file
>   bash: $file: ambiguous redirect

OK, so I misremembered.  Then we are good.  Thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Resolving deltas dominates clone time
  2019-04-23  1:55         ` Jeff King
@ 2019-04-23  4:21           ` Jeff King
  2019-04-23 10:08             ` Duy Nguyen
  0 siblings, 1 reply; 26+ messages in thread
From: Jeff King @ 2019-04-23  4:21 UTC (permalink / raw)
  To: Martin Fick; +Cc: Git Mailing List

On Mon, Apr 22, 2019 at 09:55:38PM -0400, Jeff King wrote:

> Here are my p5302 numbers on linux.git, by the way.
> 
>   Test                                           jk/p5302-repeat-fix
>   ------------------------------------------------------------------
>   5302.2: index-pack 0 threads                   307.04(303.74+3.30)
>   5302.3: index-pack 1 thread                    309.74(306.13+3.56)
>   5302.4: index-pack 2 threads                   177.89(313.73+3.60)
>   5302.5: index-pack 4 threads                   117.14(344.07+4.29)
>   5302.6: index-pack 8 threads                   112.40(607.12+5.80)
>   5302.7: index-pack default number of threads   135.00(322.03+3.74)
> 
> which still imply that "4" is a win over "3" ("8" is slightly better
> still in wall-clock time, but the total CPU rises dramatically; that's
> probably because this is a quad-core with hyperthreading, so by that
> point we're just throttling down the CPUs).

And here's a similar test run on a 20-core Xeon w/ hyperthreading (I
tweaked the test to keep going after eight threads):

Test                            HEAD                
----------------------------------------------------
5302.2: index-pack 1 threads    376.88(364.50+11.52)
5302.3: index-pack 2 threads    228.13(371.21+17.86)
5302.4: index-pack 4 threads    151.41(387.06+21.12)
5302.5: index-pack 8 threads    113.68(413.40+25.80)
5302.6: index-pack 16 threads   100.60(511.85+37.53)
5302.7: index-pack 32 threads   94.43(623.82+45.70) 
5302.8: index-pack 40 threads   93.64(702.88+47.61) 

I don't think any of this is _particularly_ relevant to your case, but
it really seems to me that the default of capping at 3 threads is too
low.

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Resolving deltas dominates clone time
  2019-04-22 18:43         ` Jeff King
@ 2019-04-23  7:07           ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 26+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-04-23  7:07 UTC (permalink / raw)
  To: Jeff King; +Cc: Martin Fick, Git Mailing List, Jansen, Geert, Jonathan Tan


On Mon, Apr 22 2019, Jeff King wrote:

> On Mon, Apr 22, 2019 at 08:01:15PM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> > Your patch is optionally removing the "woah, we got an object with a
>> > duplicate sha1, let's check that the bytes are the same in both copies"
>> > check. But Martin's problem is a clone, so we wouldn't have any existing
>> > objects to duplicate in the first place.
>>
>> Right, but we do anyway, as reported by Geert at @amazon.com preceding
>> that patch of mine. But it is 99.99% irrelevant to *performance* in this
>> case after the loose object cache you added (but before that could make
>> all the difference depending on the FS).
>
> I scratched my head at this a bit. If we don't have any other objects,
> then what are we comparing against? But I think you mean that we have
> the overhead of doing the object lookups to find that out. Yes, that can
> add up if your filesystem has high latency, but I think in this case it
> is a drop in the bucket compared to dealing with the actual object data.

There was no "we have no objects" clause, so this bit is what dominated
clone time before the loose object cache...

>> I just mentioned it to plant a flag on another bit of the code where
>> index-pack in general has certain paranoias/validation the user might be
>> willing to optionally drop just at "clone" time.
>
> Yeah, I agree it may be worth pursuing independently. I just don't think
> it will help Martin's case in any noticeable way.

...indeed, as noted just mentioning this in the context of things in
index-pack that *in general* might benefit from some "we had no objects
before" special-case.

>> > Right, that would work. I will note one thing, though: the total time to
>> > do a 1-depth clone followed by an unshallow is probably much higher than
>> > doing the whole clone as one unit, for two reasons:
>>
>> Indeed. The hypothesis is that the user doesn't really care about the
>> clone-time, but the clone-to-repo-mostly-usable time.
>
> There was a little bit of self-interest in there for me, too, as a
> server operator. While it does add to the end-to-end time, most of the
> resource use for the shallow fetch gets put on the server. IOW, I don't
> think we'd be happy to see clients doing this depth-1-and-then-unshallow
> strategy for every clone.

Just change from per-seat pricing to charging a premium for CPU &
I/O. Now your problem is a solution :)

More seriously, yeah I think we definitely need to be careful about
changes to git that'll eat someone "free" server time to save the client
time/work.

At the same time we have dedicated internal operators who wouldn't mind
spending that CPU. So hopefully we can in general find some reasonable
middle-ground.

>> > So in general, I think you'd need some cooperation from the server side
>> > to ask it to generate and send the .idx that matches the .pack it is
>> > sending you. Or even if not the .idx format itself, some stable list of
>> > sha1s that you could use to reproduce it without hashing each
>> > uncompressed byte yourself.
>>
>> Yeah, depending on how jt/fetch-cdn-offload is designed (see my
>> https://public-inbox.org/git/87k1hv6eel.fsf@evledraar.gmail.com/) it
>> could be (ab)used to do this. I.e. you'd keep a "base" *.{pack,idx}
>> around for such a purpose.
>>
>> So in such a case you'd serve up that recent-enough *.{pack,idx} for the
>> client to "wget", and the client would then trust it (or not) and do the
>> equivalent of a "fetch" from that point to be 100% up-to-date.
>
> I think it's sort of orthogonal. Either way you have to teach the client
> how to get a .pack/.idx combo. Whether it learns to receive it inline
> from the first fetch, or whether it is taught to expect it from the
> out-of-band fetch, most of the challenge is the same.

I think it is, but maybe we're talking about different things.

I suspect a few of us have experimented with similar rsync-and-pull
hacks as a replacement for "clone". It's much faster (often 50-90%
faster).

I.e. just an rsync of a recent-enough .git dir (or .git/objects),
followed by a 'reset --hard' to get the worktree and then a 'git pull'.

>> > This could even be stuffed into the pack format and stripped out by
>> > the receiving index-pack (i.e., each entry is prefixed with "and by
>> > the way, here is my sha1...").
>>
>> That would be really interesting. I.e. just having room for that (or
>> anything else) in the pack format.
>>
>> I wonder if it could be added to the delta-chain in the current format
>> as a nasty hack :)
>
> There's definitely not "room" in any sense of the word in the pack
> format. :) However, as long as all parties agreed, we can stick whatever
> we want into the on-the-wire format. So I was imagining something more
> like:
>
>   1. pack-objects learns a --report-object-id option that sticks some
>      additional bytes before each object (in its simplest form,
>      $obj_hash bytes of id)
>
>   2. likewise, index-pack learns a --parse-object-id option to receive
>      it and skip hashing the object bytes
>
>   3. we get a new protocol capability, "send-object-ids". If the server
>      advertises and the client requests it, then both sides turn on the
>      appropriate option
>
> You could even imagine generalizing it to "--report-object-metadata",
> and including 0 or more metadata packets before each object. With object
> id being one, but possibly other computable bits like "generation
> number" after that. I'm not convinced other metadata is worth the
> space/time tradeoff, though. After all, this is stuff that the client
> _could_ generate and cache themselves, so you're trading off bandwidth
> to save the client from doing the computation.
>
> Anyway, food for thought. :)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Resolving deltas dominates clone time
  2019-04-23  4:21           ` Jeff King
@ 2019-04-23 10:08             ` Duy Nguyen
  2019-04-23 20:09               ` Martin Fick
  2019-04-30 17:50               ` Jeff King
  0 siblings, 2 replies; 26+ messages in thread
From: Duy Nguyen @ 2019-04-23 10:08 UTC (permalink / raw)
  To: Jeff King; +Cc: Martin Fick, Git Mailing List

On Tue, Apr 23, 2019 at 11:45 AM Jeff King <peff@peff.net> wrote:
>
> On Mon, Apr 22, 2019 at 09:55:38PM -0400, Jeff King wrote:
>
> > Here are my p5302 numbers on linux.git, by the way.
> >
> >   Test                                           jk/p5302-repeat-fix
> >   ------------------------------------------------------------------
> >   5302.2: index-pack 0 threads                   307.04(303.74+3.30)
> >   5302.3: index-pack 1 thread                    309.74(306.13+3.56)
> >   5302.4: index-pack 2 threads                   177.89(313.73+3.60)
> >   5302.5: index-pack 4 threads                   117.14(344.07+4.29)
> >   5302.6: index-pack 8 threads                   112.40(607.12+5.80)
> >   5302.7: index-pack default number of threads   135.00(322.03+3.74)
> >
> > which still imply that "4" is a win over "3" ("8" is slightly better
> > still in wall-clock time, but the total CPU rises dramatically; that's
> > probably because this is a quad-core with hyperthreading, so by that
> > point we're just throttling down the CPUs).
>
> And here's a similar test run on a 20-core Xeon w/ hyperthreading (I
> tweaked the test to keep going after eight threads):
>
> Test                            HEAD
> ----------------------------------------------------
> 5302.2: index-pack 1 threads    376.88(364.50+11.52)
> 5302.3: index-pack 2 threads    228.13(371.21+17.86)
> 5302.4: index-pack 4 threads    151.41(387.06+21.12)
> 5302.5: index-pack 8 threads    113.68(413.40+25.80)
> 5302.6: index-pack 16 threads   100.60(511.85+37.53)
> 5302.7: index-pack 32 threads   94.43(623.82+45.70)
> 5302.8: index-pack 40 threads   93.64(702.88+47.61)
>
> I don't think any of this is _particularly_ relevant to your case, but
> it really seems to me that the default of capping at 3 threads is too
> low.

Looking back at the multithread commit, I think the trend was the same
and I capped it because the gain was not proportional to the number of
cores we threw at index-pack anymore. I would not be opposed to
raising the cap though (or maybe just remove it)
-- 
Duy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Resolving deltas dominates clone time
  2019-04-23 10:08             ` Duy Nguyen
@ 2019-04-23 20:09               ` Martin Fick
  2019-04-30 18:02                 ` Jeff King
  2019-04-30 17:50               ` Jeff King
  1 sibling, 1 reply; 26+ messages in thread
From: Martin Fick @ 2019-04-23 20:09 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Jeff King, Git Mailing List

On Tuesday, April 23, 2019 5:08:40 PM MDT Duy Nguyen wrote:
> On Tue, Apr 23, 2019 at 11:45 AM Jeff King <peff@peff.net> wrote:
> > On Mon, Apr 22, 2019 at 09:55:38PM -0400, Jeff King wrote:
> > > Here are my p5302 numbers on linux.git, by the way.
> > > 
> > >   Test                                           jk/p5302-repeat-fix
> > >   ------------------------------------------------------------------
> > >   5302.2: index-pack 0 threads                   307.04(303.74+3.30)
> > >   5302.3: index-pack 1 thread                    309.74(306.13+3.56)
> > >   5302.4: index-pack 2 threads                   177.89(313.73+3.60)
> > >   5302.5: index-pack 4 threads                   117.14(344.07+4.29)
> > >   5302.6: index-pack 8 threads                   112.40(607.12+5.80)
> > >   5302.7: index-pack default number of threads   135.00(322.03+3.74)
> > > 
> > > which still imply that "4" is a win over "3" ("8" is slightly better
> > > still in wall-clock time, but the total CPU rises dramatically; that's
> > > probably because this is a quad-core with hyperthreading, so by that
> > > point we're just throttling down the CPUs).
> > 
> > And here's a similar test run on a 20-core Xeon w/ hyperthreading (I
> > tweaked the test to keep going after eight threads):
> > 
> > Test                            HEAD
> > ----------------------------------------------------
> > 5302.2: index-pack 1 threads    376.88(364.50+11.52)
> > 5302.3: index-pack 2 threads    228.13(371.21+17.86)
> > 5302.4: index-pack 4 threads    151.41(387.06+21.12)
> > 5302.5: index-pack 8 threads    113.68(413.40+25.80)
> > 5302.6: index-pack 16 threads   100.60(511.85+37.53)
> > 5302.7: index-pack 32 threads   94.43(623.82+45.70)
> > 5302.8: index-pack 40 threads   93.64(702.88+47.61)
> > 
> > I don't think any of this is _particularly_ relevant to your case, but
> > it really seems to me that the default of capping at 3 threads is too
> > low.

Here are my index-pack results (I only ran them once since they take a while) 
using vgit 1.8.3.2:

Threads  real       user        sys
1     108m46.151s 106m14.420s  1m57.192s
2     58m14.274s  106m23.158s  5m32.736s
3     40m33.351s  106m42.281s  5m40.884s
4     31m40.342s  107m20.278s  5m40.675s
5     26m0.454s   106m54.370s  5m35.827s
12    13m25.304s  107m57.271s  6m26.493s
16    10m56.866s  107m46.107s  6m41.330s
18    10m18.112s  109m50.893s  7m1.369s
20    9m54.010s   113m51.028s  7m53.082s
24    9m1.104s    115m8.245s   7m57.156s
28    8m26.058s   116m46.311s  8m34.752s
32    8m42.967s   140m33.280s  9m59.514s
36    8m52.228s   151m28.939s  11m55.590s
40    8m22.719s   153m4.496s   12m36.041s
44    8m12.419s   166m41.594s  14m7.717s
48    8m0.377s    172m3.597s   16m32.041s
56    8m22.320s   188m31.426s  17m48.274s

 
> Looking back at the multithread commit, I think the trend was the same
> and I capped it because the gain was not proportional to the number of
> cores we threw at index-pack anymore. I would not be opposed to
> raising the cap though (or maybe just remove it)

I think that if there were no default limit during a clone it could have 
disastrous effects on people using the repo tool from the android project, or 
any other "submodule like" tool that might clone many projects in parallel. 
With the repo tool, people often use a large -j number such as 24 which means 
they end up cloning around 24 projects at a time, and they may do this for 
around 1000 projects. If git clone suddenly started as many threads as there 
are CPUs for each clone, this would likely paralyze the machine.

I do suspect it would be nice to have a switch though that repo could use to 
adjust this intelligently, is there some way to adjust threads from a clone, I 
don't see one? I tried using 'GIT_FORCE_THREADS=28 git clone ...' and it 
didn't seem to make a difference?

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Resolving deltas dominates clone time
  2019-04-23 10:08             ` Duy Nguyen
  2019-04-23 20:09               ` Martin Fick
@ 2019-04-30 17:50               ` Jeff King
  2019-04-30 18:48                 ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 26+ messages in thread
From: Jeff King @ 2019-04-30 17:50 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Martin Fick, Git Mailing List

On Tue, Apr 23, 2019 at 05:08:40PM +0700, Duy Nguyen wrote:

> On Tue, Apr 23, 2019 at 11:45 AM Jeff King <peff@peff.net> wrote:
> >
> > On Mon, Apr 22, 2019 at 09:55:38PM -0400, Jeff King wrote:
> >
> > > Here are my p5302 numbers on linux.git, by the way.
> > >
> > >   Test                                           jk/p5302-repeat-fix
> > >   ------------------------------------------------------------------
> > >   5302.2: index-pack 0 threads                   307.04(303.74+3.30)
> > >   5302.3: index-pack 1 thread                    309.74(306.13+3.56)
> > >   5302.4: index-pack 2 threads                   177.89(313.73+3.60)
> > >   5302.5: index-pack 4 threads                   117.14(344.07+4.29)
> > >   5302.6: index-pack 8 threads                   112.40(607.12+5.80)
> > >   5302.7: index-pack default number of threads   135.00(322.03+3.74)
> > >
> > > which still imply that "4" is a win over "3" ("8" is slightly better
> > > still in wall-clock time, but the total CPU rises dramatically; that's
> > > probably because this is a quad-core with hyperthreading, so by that
> > > point we're just throttling down the CPUs).
> >
> > And here's a similar test run on a 20-core Xeon w/ hyperthreading (I
> > tweaked the test to keep going after eight threads):
> >
> > Test                            HEAD
> > ----------------------------------------------------
> > 5302.2: index-pack 1 threads    376.88(364.50+11.52)
> > 5302.3: index-pack 2 threads    228.13(371.21+17.86)
> > 5302.4: index-pack 4 threads    151.41(387.06+21.12)
> > 5302.5: index-pack 8 threads    113.68(413.40+25.80)
> > 5302.6: index-pack 16 threads   100.60(511.85+37.53)
> > 5302.7: index-pack 32 threads   94.43(623.82+45.70)
> > 5302.8: index-pack 40 threads   93.64(702.88+47.61)
> >
> > I don't think any of this is _particularly_ relevant to your case, but
> > it really seems to me that the default of capping at 3 threads is too
> > low.
> 
> Looking back at the multithread commit, I think the trend was the same
> and I capped it because the gain was not proportional to the number of
> cores we threw at index-pack anymore. I would not be opposed to
> raising the cap though (or maybe just remove it)

I'm not sure what the right cap would be. I don't think it's static;
we'd want ~4 threads on the top case, and 10-20 on the bottom one.

It does seem like there's an inflection point in the graph at N/2
threads. But then maybe that's just because these are hyper-threaded
machines, so "N/2" is the actual number of physical cores, and the
inflated CPU times above that are just because we can't turbo-boost
then, so we're actually clocking slower. Multi-threaded profiling and
measurement is such a mess. :)

So I'd say the right answer is probably either online_cpus() or half
that. The latter would be more appropriate for the machines I have, but
I'd worry that it would leave performance on the table for non-intel
machines.

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Resolving deltas dominates clone time
  2019-04-23 20:09               ` Martin Fick
@ 2019-04-30 18:02                 ` Jeff King
  2019-04-30 22:08                   ` Martin Fick
  0 siblings, 1 reply; 26+ messages in thread
From: Jeff King @ 2019-04-30 18:02 UTC (permalink / raw)
  To: Martin Fick; +Cc: Duy Nguyen, Git Mailing List

On Tue, Apr 23, 2019 at 02:09:31PM -0600, Martin Fick wrote:

> Here are my index-pack results (I only ran them once since they take a while) 
> using vgit 1.8.3.2:
> 
> Threads  real       user        sys
> 1     108m46.151s 106m14.420s  1m57.192s
> 2     58m14.274s  106m23.158s  5m32.736s
> 3     40m33.351s  106m42.281s  5m40.884s
> 4     31m40.342s  107m20.278s  5m40.675s
> 5     26m0.454s   106m54.370s  5m35.827s
> 12    13m25.304s  107m57.271s  6m26.493s
> 16    10m56.866s  107m46.107s  6m41.330s
> 18    10m18.112s  109m50.893s  7m1.369s
> 20    9m54.010s   113m51.028s  7m53.082s
> 24    9m1.104s    115m8.245s   7m57.156s
> 28    8m26.058s   116m46.311s  8m34.752s
> 32    8m42.967s   140m33.280s  9m59.514s
> 36    8m52.228s   151m28.939s  11m55.590s
> 40    8m22.719s   153m4.496s   12m36.041s
> 44    8m12.419s   166m41.594s  14m7.717s
> 48    8m0.377s    172m3.597s   16m32.041s
> 56    8m22.320s   188m31.426s  17m48.274s

Thanks for the data.

That seems to roughly match my results. Things get obviously better up
to around close to half of the available processors, and then you get
minimal returns for more CPU (some of yours actually get worse in the
middle, but that may be due to noise; my timings are all best-of-3).

> I think that if there were no default limit during a clone it could have 
> disastrous effects on people using the repo tool from the android project, or 
> any other "submodule like" tool that might clone many projects in parallel. 
> With the repo tool, people often use a large -j number such as 24 which means 
> they end up cloning around 24 projects at a time, and they may do this for 
> around 1000 projects. If git clone suddenly started as many threads as there 
> are CPUs for each clone, this would likely paralyze the machine.

IMHO this is already a problem, because none of those 24 gits knows
about the others. So they're already using 24*3 cores, though of course
at any given moment some of those 24 may be bottle-necked on the
network.

I suspect that repo should be passing in `-c pack.threads=N`, where `N`
is some formula based around how many cores we want to use, with some
constant fraction applied for how many we expect to be chugging on CPU
at any given point.

The optimal behavior would probably come from index-pack dynamically
assigning work based on system load, but that gets pretty messy. Ideally
we could just throw all of the load at the kernel's scheduler and let it
do the right thing, but:

  - we clearly get some inefficiencies from being overly-parallelized,
    so we don't want to go too far

  - we have other resources we'd like to keep in use like the network
    and disk. So probably the optimal case would be to have one (or a
    few) index-packs fully utilizing the network, and then as they move
    to the local-only CPU-heavy phase, start a few more on the network,
    and so on.

    There's no way to do that kind of slot-oriented gating now. It
    actually wouldn't be too hard at the low-level of the code, but I'm
    not sure what interface you'd use to communicate "OK, now go" to
    each process.

> I do suspect it would be nice to have a switch though that repo could use to 
> adjust this intelligently, is there some way to adjust threads from a clone, I 
> don't see one? I tried using 'GIT_FORCE_THREADS=28 git clone ...' and it 
> didn't seem to make a difference?

I think I led you astray earlier by mentioning GIT_FORCE_THREADS. It's
actually just a boolean for "use threads even if we're only
single-threaded". What you actually want is probably:

  git clone -c pack.threads=28 ...

(though I didn't test it to be sure).

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Resolving deltas dominates clone time
  2019-04-30 17:50               ` Jeff King
@ 2019-04-30 18:48                 ` Ævar Arnfjörð Bjarmason
  2019-04-30 20:33                   ` Jeff King
  0 siblings, 1 reply; 26+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-04-30 18:48 UTC (permalink / raw)
  To: Jeff King; +Cc: Duy Nguyen, Martin Fick, Git Mailing List


On Tue, Apr 30 2019, Jeff King wrote:

> On Tue, Apr 23, 2019 at 05:08:40PM +0700, Duy Nguyen wrote:
>
>> On Tue, Apr 23, 2019 at 11:45 AM Jeff King <peff@peff.net> wrote:
>> >
>> > On Mon, Apr 22, 2019 at 09:55:38PM -0400, Jeff King wrote:
>> >
>> > > Here are my p5302 numbers on linux.git, by the way.
>> > >
>> > >   Test                                           jk/p5302-repeat-fix
>> > >   ------------------------------------------------------------------
>> > >   5302.2: index-pack 0 threads                   307.04(303.74+3.30)
>> > >   5302.3: index-pack 1 thread                    309.74(306.13+3.56)
>> > >   5302.4: index-pack 2 threads                   177.89(313.73+3.60)
>> > >   5302.5: index-pack 4 threads                   117.14(344.07+4.29)
>> > >   5302.6: index-pack 8 threads                   112.40(607.12+5.80)
>> > >   5302.7: index-pack default number of threads   135.00(322.03+3.74)
>> > >
>> > > which still imply that "4" is a win over "3" ("8" is slightly better
>> > > still in wall-clock time, but the total CPU rises dramatically; that's
>> > > probably because this is a quad-core with hyperthreading, so by that
>> > > point we're just throttling down the CPUs).
>> >
>> > And here's a similar test run on a 20-core Xeon w/ hyperthreading (I
>> > tweaked the test to keep going after eight threads):
>> >
>> > Test                            HEAD
>> > ----------------------------------------------------
>> > 5302.2: index-pack 1 threads    376.88(364.50+11.52)
>> > 5302.3: index-pack 2 threads    228.13(371.21+17.86)
>> > 5302.4: index-pack 4 threads    151.41(387.06+21.12)
>> > 5302.5: index-pack 8 threads    113.68(413.40+25.80)
>> > 5302.6: index-pack 16 threads   100.60(511.85+37.53)
>> > 5302.7: index-pack 32 threads   94.43(623.82+45.70)
>> > 5302.8: index-pack 40 threads   93.64(702.88+47.61)
>> >
>> > I don't think any of this is _particularly_ relevant to your case, but
>> > it really seems to me that the default of capping at 3 threads is too
>> > low.
>>
>> Looking back at the multithread commit, I think the trend was the same
>> and I capped it because the gain was not proportional to the number of
>> cores we threw at index-pack anymore. I would not be opposed to
>> raising the cap though (or maybe just remove it)
>
> I'm not sure what the right cap would be. I don't think it's static;
> we'd want ~4 threads on the top case, and 10-20 on the bottom one.
>
> It does seem like there's an inflection point in the graph at N/2
> threads. But then maybe that's just because these are hyper-threaded
> machines, so "N/2" is the actual number of physical cores, and the
> inflated CPU times above that are just because we can't turbo-boost
> then, so we're actually clocking slower. Multi-threaded profiling and
> measurement is such a mess. :)
>
> So I'd say the right answer is probably either online_cpus() or half
> that. The latter would be more appropriate for the machines I have, but
> I'd worry that it would leave performance on the table for non-intel
> machines.

It would be a nice #leftoverbits project to do this dynamically at
runtime, i.e. hook up the throughput code in progress.c to some new
utility functions where the current code using pthreads would
occasionally stop and try to find some (local) maximum throughput given
N threads.

You could then dynamically save that optimum for next time, or adjust
threading at runtime every X seconds, e.g. on a server with N=24 cores
you might want 24 threads if you have one index-pack, but if you have 24
index-packs you probably don't want each with 24 threads, for a total of
576.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Resolving deltas dominates clone time
  2019-04-30 18:48                 ` Ævar Arnfjörð Bjarmason
@ 2019-04-30 20:33                   ` Jeff King
  0 siblings, 0 replies; 26+ messages in thread
From: Jeff King @ 2019-04-30 20:33 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Duy Nguyen, Martin Fick, Git Mailing List

On Tue, Apr 30, 2019 at 08:48:08PM +0200, Ævar Arnfjörð Bjarmason wrote:

> > So I'd say the right answer is probably either online_cpus() or half
> > that. The latter would be more appropriate for the machines I have, but
> > I'd worry that it would leave performance on the table for non-intel
> > machines.
> 
> It would be a nice #leftoverbits project to do this dynamically at
> runtime, i.e. hook up the throughput code in progress.c to some new
> utility functions where the current code using pthreads would
> occasionally stop and try to find some (local) maximum throughput given
> N threads.
> 
> You could then dynamically save that optimum for next time, or adjust
> threading at runtime every X seconds, e.g. on a server with N=24 cores
> you might want 24 threads if you have one index-pack, but if you have 24
> index-packs you probably don't want each with 24 threads, for a total of
> 576.

Yeah, I touched on that in my response to Martin. I think that would be
nice, but it's complicated enough that I don't think it's a left-over
bit. I'm also not sure how hard it is to change the number of threads
after the initialization.

IIRC, it's a worker pool that just asks for more work. So that's
probably the right moment to say not just "is there more work to do" but
also "does it seem like there's an idle slot on the system for our
thread to take".

-Peff

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Resolving deltas dominates clone time
  2019-04-30 18:02                 ` Jeff King
@ 2019-04-30 22:08                   ` Martin Fick
  0 siblings, 0 replies; 26+ messages in thread
From: Martin Fick @ 2019-04-30 22:08 UTC (permalink / raw)
  To: Jeff King; +Cc: Duy Nguyen, Git Mailing List

On Tuesday, April 30, 2019 2:02:32 PM MDT Jeff King wrote:
> On Tue, Apr 23, 2019 at 02:09:31PM -0600, Martin Fick wrote:
> > I think that if there were no default limit during a clone it could have
> > disastrous effects on people using the repo tool from the android project,
> > or any other "submodule like" tool that might clone many projects in
> > parallel. With the repo tool, people often use a large -j number such as
> > 24 which means they end up cloning around 24 projects at a time, and they
> > may do this for around 1000 projects. If git clone suddenly started as
> > many threads as there are CPUs for each clone, this would likely paralyze
> > the machine.
> 
> IMHO this is already a problem, because none of those 24 gits knows
> about the others. So they're already using 24*3 cores, though of course
> at any given moment some of those 24 may be bottle-necked on the
> network.

I think this is very different. In the first case this is a linear constraint 
that someone can use to intelligently adjust the number of clones they are 
doing at the same time. The second case cannot be accounted for in any 
intelligent way by the person running the clones, it is almost unconstrained.

...
> > I do suspect it would be nice to have a switch though that repo could use
> > to adjust this intelligently, is there some way to adjust threads from a
> > clone, I don't see one? I tried using 'GIT_FORCE_THREADS=28 git clone
> > ...' and it didn't seem to make a difference?
> 
> I think I led you astray earlier by mentioning GIT_FORCE_THREADS. It's
> actually just a boolean for "use threads even if we're only
> single-threaded". What you actually want is probably:
> 
>   git clone -c pack.threads=28 ...
> 
> (though I didn't test it to be sure).

Thanks, I will test this!

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2019-04-30 22:08 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-19 21:47 Resolving deltas dominates clone time Martin Fick
2019-04-20  3:58 ` Jeff King
2019-04-20  7:59   ` Ævar Arnfjörð Bjarmason
2019-04-22 15:57     ` Jeff King
2019-04-22 18:01       ` Ævar Arnfjörð Bjarmason
2019-04-22 18:43         ` Jeff King
2019-04-23  7:07           ` Ævar Arnfjörð Bjarmason
2019-04-22 20:21   ` Martin Fick
2019-04-22 20:56     ` Jeff King
2019-04-22 21:02       ` Jeff King
2019-04-22 21:19       ` [PATCH] p5302: create the repo in each index-pack test Jeff King
2019-04-23  1:09         ` Junio C Hamano
2019-04-23  2:07           ` Jeff King
2019-04-23  2:27             ` Junio C Hamano
2019-04-23  2:36               ` Jeff King
2019-04-23  2:40                 ` Junio C Hamano
2019-04-22 22:32       ` Resolving deltas dominates clone time Martin Fick
2019-04-23  1:55         ` Jeff King
2019-04-23  4:21           ` Jeff King
2019-04-23 10:08             ` Duy Nguyen
2019-04-23 20:09               ` Martin Fick
2019-04-30 18:02                 ` Jeff King
2019-04-30 22:08                   ` Martin Fick
2019-04-30 17:50               ` Jeff King
2019-04-30 18:48                 ` Ævar Arnfjörð Bjarmason
2019-04-30 20:33                   ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).