* Resolving deltas dominates clone time @ 2019-04-19 21:47 Martin Fick 2019-04-20 3:58 ` Jeff King 0 siblings, 1 reply; 26+ messages in thread From: Martin Fick @ 2019-04-19 21:47 UTC (permalink / raw) To: Git Mailing List We have a serious performance problem with one of our large repos. The repo is our internal version of the android platform/manifest project. Our repo after running a clean "repack -A -d -F" is close to 8G in size, has over 700K refs, and it has over 8M objects. The repo takes around 40min to clone locally (same disk to same disk) using git 1.8.2.1 on a high end machine (56 processors, 128GB RAM)! It takes around 10mins before getting to the resolving deltas phase which then takes most of the rest of the time. While this is a fairly large repo, a straight cp -r of the repo takes less than 2mins, so I would expect a clone to be on the same order of magnitude in time. For perspective, I have a kernel/msm repo with a third of the ref count and double the object count which takes only around 20mins to clone on the same machine (still slower than I would like). I mention 1.8.2.1 because we have many old machines which need this. However, I also tested this with git v2.18 and it actually is much slower even (~140mins). Reading the advice on the net, people seem to think that repacking with shorter delta-chains would help improve this. I have not had any success with this yet. I have been thinking about this problem, and I suspect that this compute time is actually spent doing SHA1 calculations, is that possible? Some basic back of the envelope math and scripting seems to show that the repo may actually contain about 2TB of data if you add up the size of all the objects in the repo. Some quick research on the net seems to indicate that we might be able to expect something around 500MB/s throughput on computing SHA1s, does that seem reasonable? If I really have 2TB of data, should it then take around 66mins to get the SHA1s for all that data? Could my repo clone time really be dominated by SHA1 math? Any advice on how to speed up cloning this repo, or what to pursue more in my investigation? Thanks, -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Resolving deltas dominates clone time 2019-04-19 21:47 Resolving deltas dominates clone time Martin Fick @ 2019-04-20 3:58 ` Jeff King 2019-04-20 7:59 ` Ævar Arnfjörð Bjarmason 2019-04-22 20:21 ` Martin Fick 0 siblings, 2 replies; 26+ messages in thread From: Jeff King @ 2019-04-20 3:58 UTC (permalink / raw) To: Martin Fick; +Cc: Git Mailing List On Fri, Apr 19, 2019 at 03:47:22PM -0600, Martin Fick wrote: > I have been thinking about this problem, and I suspect that this compute time > is actually spent doing SHA1 calculations, is that possible? Some basic back > of the envelope math and scripting seems to show that the repo may actually > contain about 2TB of data if you add up the size of all the objects in the > repo. Some quick research on the net seems to indicate that we might be able > to expect something around 500MB/s throughput on computing SHA1s, does that > seem reasonable? If I really have 2TB of data, should it then take around > 66mins to get the SHA1s for all that data? Could my repo clone time really be > dominated by SHA1 math? That sounds about right, actually. 8GB to 2TB is a compression ratio of 250:1. That's bigger than I've seen, but I get 51:1 in the kernel. Try this (with a recent version of git; your v1.8.2.1 won't have --batch-all-objects): # count the on-disk size of all objects git cat-file --batch-all-objects --batch-check='%(objectsize) %(objectsize:disk)' | perl -alne ' $repo += $F[0]; $disk += $F[1]; END { print "$repo / $disk = ", $repo/$disk } ' 250:1 isn't inconceivable if you have large blobs which have small changes to them (and at 8GB for 8 million objects, you probably do have some larger blobs, since the kernel is about 1/8th the size for the same number of objects). So yes, if you really do have to hash 2TB of data, that's going to take a while. "openssl speed" on my machine gives per-second speeds of: type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes sha1 135340.73k 337086.10k 677821.10k 909513.73k 1007528.62k 1016916.65k So it's faster on bigger chunks, but yeah 500-1000MB/s seems like about the best you're going to do. And... > I mention 1.8.2.1 because we have many old machines which need this. However, > I also tested this with git v2.18 and it actually is much slower even > (~140mins). I think v2.18 will have the collision-detecting sha1 on by default, which is slower. Building with OPENSSL_SHA1 should be the fastest (and are those numbers above). Git's internal (but not collision detecting) BLK_SHA1 is somewhere in the middle. > Any advice on how to speed up cloning this repo, or what to pursue more > in my investigation? If you don't mind losing the collision-detection, using openssl's sha1 might help. The delta resolution should be threaded, too. So in _theory_ you're using 66 minutes of CPU time, but that should only take 1-2 minutes on your 56-core machine. I don't know at what point you'd run into lock contention, though. The locking there is quite coarse. We also hash non-deltas while we're receiving them over the network. That's accounted for in the "receiving pack" part of the progress meter. If the time looks to be going to "resolving deltas", then that should all be threaded. If you want to replay the slow part, it should just be index-pack. So something like (with $old as a fresh clone of the repo): git init --bare new-repo.git cd new-repo.git perf record git index-pack -v --stdin <$old/.git/objects/pack/pack-*.pack perf report should show you where the time is going (substitute perf with whatever profiling tool you like). As far as avoiding that work altogether, there aren't a lot of options. Git clients do not trust the server, so the server sends only the raw data, and the client is responsible for computing the object ids. The only exception is a local filesystem clone, which will blindly copy or hardlink the .pack and .idx files from the source. In theory there could be a protocol extension to let the client say "I trust you, please send me the matching .idx that goes with this pack, and I'll assume there was no bitrot nor trickery on your part". I don't recall anybody ever discussing such a patch in the past, but I think Microsoft's VFS for Git project that backs development on Windows might do similar trickery under the hood. -Peff ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Resolving deltas dominates clone time 2019-04-20 3:58 ` Jeff King @ 2019-04-20 7:59 ` Ævar Arnfjörð Bjarmason 2019-04-22 15:57 ` Jeff King 2019-04-22 20:21 ` Martin Fick 1 sibling, 1 reply; 26+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2019-04-20 7:59 UTC (permalink / raw) To: Jeff King; +Cc: Martin Fick, Git Mailing List On Sat, Apr 20 2019, Jeff King wrote: > On Fri, Apr 19, 2019 at 03:47:22PM -0600, Martin Fick wrote: > >> I have been thinking about this problem, and I suspect that this compute time >> is actually spent doing SHA1 calculations, is that possible? Some basic back >> of the envelope math and scripting seems to show that the repo may actually >> contain about 2TB of data if you add up the size of all the objects in the >> repo. Some quick research on the net seems to indicate that we might be able >> to expect something around 500MB/s throughput on computing SHA1s, does that >> seem reasonable? If I really have 2TB of data, should it then take around >> 66mins to get the SHA1s for all that data? Could my repo clone time really be >> dominated by SHA1 math? > > That sounds about right, actually. 8GB to 2TB is a compression ratio of > 250:1. That's bigger than I've seen, but I get 51:1 in the kernel. > > Try this (with a recent version of git; your v1.8.2.1 won't have > --batch-all-objects): > > # count the on-disk size of all objects > git cat-file --batch-all-objects --batch-check='%(objectsize) %(objectsize:disk)' | > perl -alne ' > $repo += $F[0]; > $disk += $F[1]; > END { print "$repo / $disk = ", $repo/$disk } > ' > > 250:1 isn't inconceivable if you have large blobs which have small > changes to them (and at 8GB for 8 million objects, you probably do have > some larger blobs, since the kernel is about 1/8th the size for the same > number of objects). > > So yes, if you really do have to hash 2TB of data, that's going to take > a while. "openssl speed" on my machine gives per-second speeds of: > > type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes > sha1 135340.73k 337086.10k 677821.10k 909513.73k 1007528.62k 1016916.65k > > So it's faster on bigger chunks, but yeah 500-1000MB/s seems like about > the best you're going to do. And... > >> I mention 1.8.2.1 because we have many old machines which need this. However, >> I also tested this with git v2.18 and it actually is much slower even >> (~140mins). > > I think v2.18 will have the collision-detecting sha1 on by default, > which is slower. Building with OPENSSL_SHA1 should be the fastest (and > are those numbers above). Git's internal (but not collision detecting) > BLK_SHA1 is somewhere in the middle. > >> Any advice on how to speed up cloning this repo, or what to pursue more >> in my investigation? > > If you don't mind losing the collision-detection, using openssl's sha1 > might help. The delta resolution should be threaded, too. So in _theory_ > you're using 66 minutes of CPU time, but that should only take 1-2 > minutes on your 56-core machine. I don't know at what point you'd run > into lock contention, though. The locking there is quite coarse. There's also my (been meaning to re-roll) https://public-inbox.org/git/20181113201910.11518-1-avarab@gmail.com/ *that* part of the SHA-1 checking is part of what's going on here. It'll help a *tiny* bit, but of course is part of the "trust remote" risk management... > We also hash non-deltas while we're receiving them over the network. > That's accounted for in the "receiving pack" part of the progress meter. > If the time looks to be going to "resolving deltas", then that should > all be threaded. > > If you want to replay the slow part, it should just be index-pack. So > something like (with $old as a fresh clone of the repo): > > git init --bare new-repo.git > cd new-repo.git > perf record git index-pack -v --stdin <$old/.git/objects/pack/pack-*.pack > perf report > > should show you where the time is going (substitute perf with whatever > profiling tool you like). > > As far as avoiding that work altogether, there aren't a lot of options. > Git clients do not trust the server, so the server sends only the raw > data, and the client is responsible for computing the object ids. The > only exception is a local filesystem clone, which will blindly copy or > hardlink the .pack and .idx files from the source. > > In theory there could be a protocol extension to let the client say "I > trust you, please send me the matching .idx that goes with this pack, > and I'll assume there was no bitrot nor trickery on your part". I > don't recall anybody ever discussing such a patch in the past, but I > think Microsoft's VFS for Git project that backs development on Windows > might do similar trickery under the hood. I started to write: I wonder if there's room for some tacit client/server cooperation without such a protocol change. E.g. the server sending over a pack constructed in such a way that everything required for a checkout is at the beginning of the data. Now we implicitly tend to do it mostly the other way around for delta optimization purposes. That would allow a smart client in a hurry to index-pack it as they go along, and as soon as they have enough to check out HEAD return to the client, and continue the rest in the background But realized I was just starting to describe something like 'clone --depth=1' followed by a 'fetch --unshallow' in the background, except that would work better (if you did "just the tip" naïvely you'd get 'missing object' on e.g. 'git log', with that ad-hoc hack we'd need to write out two packs etc...). $ rm -rf /tmp/git; time git clone --depth=1 https://chromium.googlesource.com/chromium/src /tmp/git; time git -C /tmp/git fetch --unshallow Cloning into '/tmp/git'... remote: Counting objects: 304839, done remote: Finding sources: 100% (304839/304839) remote: Total 304839 (delta 70483), reused 204837 (delta 70483) Receiving objects: 100% (304839/304839), 1.48 GiB | 19.87 MiB/s, done. Resolving deltas: 100% (70483/70483), done. Checking out files: 100% (302768/302768), done. real 2m10.223s user 1m2.434s sys 0m15.564s [not waiting for that second bit, but it'll take ages...] I think just having a clone mode that did that for you might scratch a lot of people's itch. I.e. "I want full history, but mainly want a checkout right away, so background the full clone". But at this point I'm just starting to describe some shoddy version of Documentation/technical/partial-clone.txt :), OTOH there's no "narrow clone and fleshen right away" option. On protocol extensions: Just having a way to "wget" the corresponding *.idx file from the server would be great, and reduce clone times by a lot. There's the risk of trusting the server, but most people's use-case is going to be pushing right back to the same server, which'll be doing a full validation. We could also defer that validation instead of skipping it. E.g. wget *.{pack,idx} followed by a 'fsck' in the background. I've sometimes wanted that anyway, i.e. "fsck --auto" similar to "gc --auto" periodically to detect repository bitflips. Or, do some "narrow" validation of such an *.idx file right away. E.g. for all the trees/blobs required for the current checkout, and background the rest. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Resolving deltas dominates clone time 2019-04-20 7:59 ` Ævar Arnfjörð Bjarmason @ 2019-04-22 15:57 ` Jeff King 2019-04-22 18:01 ` Ævar Arnfjörð Bjarmason 0 siblings, 1 reply; 26+ messages in thread From: Jeff King @ 2019-04-22 15:57 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason; +Cc: Martin Fick, Git Mailing List On Sat, Apr 20, 2019 at 09:59:12AM +0200, Ævar Arnfjörð Bjarmason wrote: > > If you don't mind losing the collision-detection, using openssl's sha1 > > might help. The delta resolution should be threaded, too. So in _theory_ > > you're using 66 minutes of CPU time, but that should only take 1-2 > > minutes on your 56-core machine. I don't know at what point you'd run > > into lock contention, though. The locking there is quite coarse. > > There's also my (been meaning to re-roll) > https://public-inbox.org/git/20181113201910.11518-1-avarab@gmail.com/ > *that* part of the SHA-1 checking is part of what's going on here. It'll > help a *tiny* bit, but of course is part of the "trust remote" risk > management... I think we're talking about two different collision detections, and your patch wouldn't help at all here. Your patch is optionally removing the "woah, we got an object with a duplicate sha1, let's check that the bytes are the same in both copies" check. But Martin's problem is a clone, so we wouldn't have any existing objects to duplicate in the first place. The problem in his case is literally just that the actual SHA-1 is expensive, and that can be helped by using the optimized openssl implementation rather than the sha1dc (which checks not collisions with objects we _have_, but evidence of somebody trying to exploit weaknesses in sha1). One thing we could do to make that easier is a run-time flag to switch between sha1dc and a faster implementation (either openssl or blk_sha1, depending on the build). That would let you flip the "trust" bit per operation, rather than having it baked into your build. (Note that the oft-discussed "use a faster sha1 implementation for checksums, but sha1dc for object hashing" idea would not help here, because these really are object hashes whose time is dominating. We have to checksum 8GB of raw packfile but 2TB of object data). > I started to write: > > I wonder if there's room for some tacit client/server cooperation > without such a protocol change. > > E.g. the server sending over a pack constructed in such a way that > everything required for a checkout is at the beginning of the > data. Now we implicitly tend to do it mostly the other way around > for delta optimization purposes. > > That would allow a smart client in a hurry to index-pack it as they > go along, and as soon as they have enough to check out HEAD return > to the client, and continue the rest in the background Interesting idea. You're not reducing the total client effort, but you're improving latency of getting the user to a checkout. Of course that doesn't help if they want to run "git log" as their first operation. ;) > But realized I was just starting to describe something like 'clone > --depth=1' followed by a 'fetch --unshallow' in the background, except > that would work better (if you did "just the tip" naïvely you'd get > 'missing object' on e.g. 'git log', with that ad-hoc hack we'd need to > write out two packs etc...). Right, that would work. I will note one thing, though: the total time to do a 1-depth clone followed by an unshallow is probably much higher than doing the whole clone as one unit, for two reasons: 1. The server won't use reachability bitmaps when serving the follow-up fetch (because shallowness invalidates the reachability data they're caching), so it will spend much more time in the "Counting objects" phase. 2. The server has to throw away some deltas. Imagine version X of a file in the tip commit is stored as a delta against version Y in that commit's parent. The initial clone has to throw away the on-disk delta of X and send you the whole object (because you are not requesting Y at all). And then in the follow-up fetch, it must either send you Y as a base object (wasting bandwidth), or it must on-the-fly generate a delta from Y to X (wasting CPU). > But at this point I'm just starting to describe some shoddy version of > Documentation/technical/partial-clone.txt :), OTOH there's no "narrow > clone and fleshen right away" option. Yes. And partial-clone suffers from the problems above to an even greater extent. ;) > On protocol extensions: Just having a way to "wget" the corresponding > *.idx file from the server would be great, and reduce clone times by a > lot. There's the risk of trusting the server, but most people's use-case > is going to be pushing right back to the same server, which'll be doing > a full validation. One tricky thing is that the server may be handing you a bespoke .pack file. There is no matching ".idx" at all, neither in-memory nor on disk. And you would not want the whole on-disk .pack/.idx pair from a site like GitHub, where there are objects from many forks. So in general, I think you'd need some cooperation from the server side to ask it to generate and send the .idx that matches the .pack it is sending you. Or even if not the .idx format itself, some stable list of sha1s that you could use to reproduce it without hashing each uncompressed byte yourself. This could even be stuffed into the pack format and stripped out by the receiving index-pack (i.e., each entry is prefixed with "and by the way, here is my sha1..."). > We could also defer that validation instead of skipping it. E.g. wget > *.{pack,idx} followed by a 'fsck' in the background. I've sometimes > wanted that anyway, i.e. "fsck --auto" similar to "gc --auto" > periodically to detect repository bitflips. > > Or, do some "narrow" validation of such an *.idx file right > away. E.g. for all the trees/blobs required for the current checkout, > and background the rest. The "do we have all of the objects we need" is already separate from "figure out the sha1 of each object", so I think you'd get that naturally if you just took in an untrusted .idx (it also demonstrates that any .idx cost is really focusing on blobs, because the "do we have all objects" check is going to decompress every commit and tree in the repo anyway). -Peff ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Resolving deltas dominates clone time 2019-04-22 15:57 ` Jeff King @ 2019-04-22 18:01 ` Ævar Arnfjörð Bjarmason 2019-04-22 18:43 ` Jeff King 0 siblings, 1 reply; 26+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2019-04-22 18:01 UTC (permalink / raw) To: Jeff King; +Cc: Martin Fick, Git Mailing List, Jansen, Geert, Jonathan Tan On Mon, Apr 22 2019, Jeff King wrote: > On Sat, Apr 20, 2019 at 09:59:12AM +0200, Ævar Arnfjörð Bjarmason wrote: > >> > If you don't mind losing the collision-detection, using openssl's sha1 >> > might help. The delta resolution should be threaded, too. So in _theory_ >> > you're using 66 minutes of CPU time, but that should only take 1-2 >> > minutes on your 56-core machine. I don't know at what point you'd run >> > into lock contention, though. The locking there is quite coarse. >> >> There's also my (been meaning to re-roll) >> https://public-inbox.org/git/20181113201910.11518-1-avarab@gmail.com/ >> *that* part of the SHA-1 checking is part of what's going on here. It'll >> help a *tiny* bit, but of course is part of the "trust remote" risk >> management... > > I think we're talking about two different collision detections, and your > patch wouldn't help at all here. > > Your patch is optionally removing the "woah, we got an object with a > duplicate sha1, let's check that the bytes are the same in both copies" > check. But Martin's problem is a clone, so we wouldn't have any existing > objects to duplicate in the first place. Right, but we do anyway, as reported by Geert at @amazon.com preceding that patch of mine. But it is 99.99% irrelevant to *performance* in this case after the loose object cache you added (but before that could make all the difference depending on the FS). I just mentioned it to plant a flag on another bit of the code where index-pack in general has certain paranoias/validation the user might be willing to optionally drop just at "clone" time. > The problem in his case is literally just that the actual SHA-1 is > expensive, and that can be helped by using the optimized openssl > implementation rather than the sha1dc (which checks not collisions with > objects we _have_, but evidence of somebody trying to exploit weaknesses > in sha1). > > One thing we could do to make that easier is a run-time flag to switch > between sha1dc and a faster implementation (either openssl or blk_sha1, > depending on the build). That would let you flip the "trust" bit per > operation, rather than having it baked into your build. Yeah, this would be neat. > (Note that the oft-discussed "use a faster sha1 implementation for > checksums, but sha1dc for object hashing" idea would not help here, > because these really are object hashes whose time is dominating. We have > to checksum 8GB of raw packfile but 2TB of object data). > >> I started to write: >> >> I wonder if there's room for some tacit client/server cooperation >> without such a protocol change. >> >> E.g. the server sending over a pack constructed in such a way that >> everything required for a checkout is at the beginning of the >> data. Now we implicitly tend to do it mostly the other way around >> for delta optimization purposes. >> >> That would allow a smart client in a hurry to index-pack it as they >> go along, and as soon as they have enough to check out HEAD return >> to the client, and continue the rest in the background > > Interesting idea. You're not reducing the total client effort, but > you're improving latency of getting the user to a checkout. Of course > that doesn't help if they want to run "git log" as their first > operation. ;) > >> But realized I was just starting to describe something like 'clone >> --depth=1' followed by a 'fetch --unshallow' in the background, except >> that would work better (if you did "just the tip" naïvely you'd get >> 'missing object' on e.g. 'git log', with that ad-hoc hack we'd need to >> write out two packs etc...). > > Right, that would work. I will note one thing, though: the total time to > do a 1-depth clone followed by an unshallow is probably much higher than > doing the whole clone as one unit, for two reasons: Indeed. The hypothesis is that the user doesn't really care about the clone-time, but the clone-to-repo-mostly-usable time. > 1. The server won't use reachability bitmaps when serving the > follow-up fetch (because shallowness invalidates the reachability > data they're caching), so it will spend much more time in the > "Counting objects" phase. > > 2. The server has to throw away some deltas. Imagine version X of a > file in the tip commit is stored as a delta against version Y in > that commit's parent. The initial clone has to throw away the > on-disk delta of X and send you the whole object (because you are > not requesting Y at all). And then in the follow-up fetch, it must > either send you Y as a base object (wasting bandwidth), or it must > on-the-fly generate a delta from Y to X (wasting CPU). > >> But at this point I'm just starting to describe some shoddy version of >> Documentation/technical/partial-clone.txt :), OTOH there's no "narrow >> clone and fleshen right away" option. > > Yes. And partial-clone suffers from the problems above to an even > greater extent. ;) > >> On protocol extensions: Just having a way to "wget" the corresponding >> *.idx file from the server would be great, and reduce clone times by a >> lot. There's the risk of trusting the server, but most people's use-case >> is going to be pushing right back to the same server, which'll be doing >> a full validation. > > One tricky thing is that the server may be handing you a bespoke .pack > file. There is no matching ".idx" at all, neither in-memory nor on disk. > And you would not want the whole on-disk .pack/.idx pair from a site > like GitHub, where there are objects from many forks. > > So in general, I think you'd need some cooperation from the server side > to ask it to generate and send the .idx that matches the .pack it is > sending you. Or even if not the .idx format itself, some stable list of > sha1s that you could use to reproduce it without hashing each > uncompressed byte yourself. Yeah, depending on how jt/fetch-cdn-offload is designed (see my https://public-inbox.org/git/87k1hv6eel.fsf@evledraar.gmail.com/) it could be (ab)used to do this. I.e. you'd keep a "base" *.{pack,idx} around for such a purpose. So in such a case you'd serve up that recent-enough *.{pack,idx} for the client to "wget", and the client would then trust it (or not) and do the equivalent of a "fetch" from that point to be 100% up-to-date. > This could even be stuffed into the pack format and stripped out by > the receiving index-pack (i.e., each entry is prefixed with "and by > the way, here is my sha1..."). That would be really interesting. I.e. just having room for that (or anything else) in the pack format. I wonder if it could be added to the delta-chain in the current format as a nasty hack :) >> We could also defer that validation instead of skipping it. E.g. wget >> *.{pack,idx} followed by a 'fsck' in the background. I've sometimes >> wanted that anyway, i.e. "fsck --auto" similar to "gc --auto" >> periodically to detect repository bitflips. >> >> Or, do some "narrow" validation of such an *.idx file right >> away. E.g. for all the trees/blobs required for the current checkout, >> and background the rest. > > The "do we have all of the objects we need" is already separate from > "figure out the sha1 of each object", so I think you'd get that > naturally if you just took in an untrusted .idx (it also demonstrates > that any .idx cost is really focusing on blobs, because the "do we have > all objects" check is going to decompress every commit and tree in the > repo anyway). > > -Peff ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Resolving deltas dominates clone time 2019-04-22 18:01 ` Ævar Arnfjörð Bjarmason @ 2019-04-22 18:43 ` Jeff King 2019-04-23 7:07 ` Ævar Arnfjörð Bjarmason 0 siblings, 1 reply; 26+ messages in thread From: Jeff King @ 2019-04-22 18:43 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: Martin Fick, Git Mailing List, Jansen, Geert, Jonathan Tan On Mon, Apr 22, 2019 at 08:01:15PM +0200, Ævar Arnfjörð Bjarmason wrote: > > Your patch is optionally removing the "woah, we got an object with a > > duplicate sha1, let's check that the bytes are the same in both copies" > > check. But Martin's problem is a clone, so we wouldn't have any existing > > objects to duplicate in the first place. > > Right, but we do anyway, as reported by Geert at @amazon.com preceding > that patch of mine. But it is 99.99% irrelevant to *performance* in this > case after the loose object cache you added (but before that could make > all the difference depending on the FS). I scratched my head at this a bit. If we don't have any other objects, then what are we comparing against? But I think you mean that we have the overhead of doing the object lookups to find that out. Yes, that can add up if your filesystem has high latency, but I think in this case it is a drop in the bucket compared to dealing with the actual object data. > I just mentioned it to plant a flag on another bit of the code where > index-pack in general has certain paranoias/validation the user might be > willing to optionally drop just at "clone" time. Yeah, I agree it may be worth pursuing independently. I just don't think it will help Martin's case in any noticeable way. > > Right, that would work. I will note one thing, though: the total time to > > do a 1-depth clone followed by an unshallow is probably much higher than > > doing the whole clone as one unit, for two reasons: > > Indeed. The hypothesis is that the user doesn't really care about the > clone-time, but the clone-to-repo-mostly-usable time. There was a little bit of self-interest in there for me, too, as a server operator. While it does add to the end-to-end time, most of the resource use for the shallow fetch gets put on the server. IOW, I don't think we'd be happy to see clients doing this depth-1-and-then-unshallow strategy for every clone. > > So in general, I think you'd need some cooperation from the server side > > to ask it to generate and send the .idx that matches the .pack it is > > sending you. Or even if not the .idx format itself, some stable list of > > sha1s that you could use to reproduce it without hashing each > > uncompressed byte yourself. > > Yeah, depending on how jt/fetch-cdn-offload is designed (see my > https://public-inbox.org/git/87k1hv6eel.fsf@evledraar.gmail.com/) it > could be (ab)used to do this. I.e. you'd keep a "base" *.{pack,idx} > around for such a purpose. > > So in such a case you'd serve up that recent-enough *.{pack,idx} for the > client to "wget", and the client would then trust it (or not) and do the > equivalent of a "fetch" from that point to be 100% up-to-date. I think it's sort of orthogonal. Either way you have to teach the client how to get a .pack/.idx combo. Whether it learns to receive it inline from the first fetch, or whether it is taught to expect it from the out-of-band fetch, most of the challenge is the same. > > This could even be stuffed into the pack format and stripped out by > > the receiving index-pack (i.e., each entry is prefixed with "and by > > the way, here is my sha1..."). > > That would be really interesting. I.e. just having room for that (or > anything else) in the pack format. > > I wonder if it could be added to the delta-chain in the current format > as a nasty hack :) There's definitely not "room" in any sense of the word in the pack format. :) However, as long as all parties agreed, we can stick whatever we want into the on-the-wire format. So I was imagining something more like: 1. pack-objects learns a --report-object-id option that sticks some additional bytes before each object (in its simplest form, $obj_hash bytes of id) 2. likewise, index-pack learns a --parse-object-id option to receive it and skip hashing the object bytes 3. we get a new protocol capability, "send-object-ids". If the server advertises and the client requests it, then both sides turn on the appropriate option You could even imagine generalizing it to "--report-object-metadata", and including 0 or more metadata packets before each object. With object id being one, but possibly other computable bits like "generation number" after that. I'm not convinced other metadata is worth the space/time tradeoff, though. After all, this is stuff that the client _could_ generate and cache themselves, so you're trading off bandwidth to save the client from doing the computation. Anyway, food for thought. :) -Peff ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Resolving deltas dominates clone time 2019-04-22 18:43 ` Jeff King @ 2019-04-23 7:07 ` Ævar Arnfjörð Bjarmason 0 siblings, 0 replies; 26+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2019-04-23 7:07 UTC (permalink / raw) To: Jeff King; +Cc: Martin Fick, Git Mailing List, Jansen, Geert, Jonathan Tan On Mon, Apr 22 2019, Jeff King wrote: > On Mon, Apr 22, 2019 at 08:01:15PM +0200, Ævar Arnfjörð Bjarmason wrote: > >> > Your patch is optionally removing the "woah, we got an object with a >> > duplicate sha1, let's check that the bytes are the same in both copies" >> > check. But Martin's problem is a clone, so we wouldn't have any existing >> > objects to duplicate in the first place. >> >> Right, but we do anyway, as reported by Geert at @amazon.com preceding >> that patch of mine. But it is 99.99% irrelevant to *performance* in this >> case after the loose object cache you added (but before that could make >> all the difference depending on the FS). > > I scratched my head at this a bit. If we don't have any other objects, > then what are we comparing against? But I think you mean that we have > the overhead of doing the object lookups to find that out. Yes, that can > add up if your filesystem has high latency, but I think in this case it > is a drop in the bucket compared to dealing with the actual object data. There was no "we have no objects" clause, so this bit is what dominated clone time before the loose object cache... >> I just mentioned it to plant a flag on another bit of the code where >> index-pack in general has certain paranoias/validation the user might be >> willing to optionally drop just at "clone" time. > > Yeah, I agree it may be worth pursuing independently. I just don't think > it will help Martin's case in any noticeable way. ...indeed, as noted just mentioning this in the context of things in index-pack that *in general* might benefit from some "we had no objects before" special-case. >> > Right, that would work. I will note one thing, though: the total time to >> > do a 1-depth clone followed by an unshallow is probably much higher than >> > doing the whole clone as one unit, for two reasons: >> >> Indeed. The hypothesis is that the user doesn't really care about the >> clone-time, but the clone-to-repo-mostly-usable time. > > There was a little bit of self-interest in there for me, too, as a > server operator. While it does add to the end-to-end time, most of the > resource use for the shallow fetch gets put on the server. IOW, I don't > think we'd be happy to see clients doing this depth-1-and-then-unshallow > strategy for every clone. Just change from per-seat pricing to charging a premium for CPU & I/O. Now your problem is a solution :) More seriously, yeah I think we definitely need to be careful about changes to git that'll eat someone "free" server time to save the client time/work. At the same time we have dedicated internal operators who wouldn't mind spending that CPU. So hopefully we can in general find some reasonable middle-ground. >> > So in general, I think you'd need some cooperation from the server side >> > to ask it to generate and send the .idx that matches the .pack it is >> > sending you. Or even if not the .idx format itself, some stable list of >> > sha1s that you could use to reproduce it without hashing each >> > uncompressed byte yourself. >> >> Yeah, depending on how jt/fetch-cdn-offload is designed (see my >> https://public-inbox.org/git/87k1hv6eel.fsf@evledraar.gmail.com/) it >> could be (ab)used to do this. I.e. you'd keep a "base" *.{pack,idx} >> around for such a purpose. >> >> So in such a case you'd serve up that recent-enough *.{pack,idx} for the >> client to "wget", and the client would then trust it (or not) and do the >> equivalent of a "fetch" from that point to be 100% up-to-date. > > I think it's sort of orthogonal. Either way you have to teach the client > how to get a .pack/.idx combo. Whether it learns to receive it inline > from the first fetch, or whether it is taught to expect it from the > out-of-band fetch, most of the challenge is the same. I think it is, but maybe we're talking about different things. I suspect a few of us have experimented with similar rsync-and-pull hacks as a replacement for "clone". It's much faster (often 50-90% faster). I.e. just an rsync of a recent-enough .git dir (or .git/objects), followed by a 'reset --hard' to get the worktree and then a 'git pull'. >> > This could even be stuffed into the pack format and stripped out by >> > the receiving index-pack (i.e., each entry is prefixed with "and by >> > the way, here is my sha1..."). >> >> That would be really interesting. I.e. just having room for that (or >> anything else) in the pack format. >> >> I wonder if it could be added to the delta-chain in the current format >> as a nasty hack :) > > There's definitely not "room" in any sense of the word in the pack > format. :) However, as long as all parties agreed, we can stick whatever > we want into the on-the-wire format. So I was imagining something more > like: > > 1. pack-objects learns a --report-object-id option that sticks some > additional bytes before each object (in its simplest form, > $obj_hash bytes of id) > > 2. likewise, index-pack learns a --parse-object-id option to receive > it and skip hashing the object bytes > > 3. we get a new protocol capability, "send-object-ids". If the server > advertises and the client requests it, then both sides turn on the > appropriate option > > You could even imagine generalizing it to "--report-object-metadata", > and including 0 or more metadata packets before each object. With object > id being one, but possibly other computable bits like "generation > number" after that. I'm not convinced other metadata is worth the > space/time tradeoff, though. After all, this is stuff that the client > _could_ generate and cache themselves, so you're trading off bandwidth > to save the client from doing the computation. > > Anyway, food for thought. :) ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Resolving deltas dominates clone time 2019-04-20 3:58 ` Jeff King 2019-04-20 7:59 ` Ævar Arnfjörð Bjarmason @ 2019-04-22 20:21 ` Martin Fick 2019-04-22 20:56 ` Jeff King 1 sibling, 1 reply; 26+ messages in thread From: Martin Fick @ 2019-04-22 20:21 UTC (permalink / raw) To: Jeff King; +Cc: Git Mailing List On Friday, April 19, 2019 11:58:25 PM MDT Jeff King wrote: > On Fri, Apr 19, 2019 at 03:47:22PM -0600, Martin Fick wrote: > > I have been thinking about this problem, and I suspect that this compute > > time is actually spent doing SHA1 calculations, is that possible? Some > > basic back of the envelope math and scripting seems to show that the repo > > may actually contain about 2TB of data if you add up the size of all the > > objects in the repo. Some quick research on the net seems to indicate > > that we might be able to expect something around 500MB/s throughput on > > computing SHA1s, does that seem reasonable? If I really have 2TB of data, > > should it then take around 66mins to get the SHA1s for all that data? > > Could my repo clone time really be dominated by SHA1 math? > > That sounds about right, actually. 8GB to 2TB is a compression ratio of > 250:1. That's bigger than I've seen, but I get 51:1 in the kernel. > > Try this (with a recent version of git; your v1.8.2.1 won't have > --batch-all-objects): > > # count the on-disk size of all objects > git cat-file --batch-all-objects --batch-check='%(objectsize) > %(objectsize:disk)' | perl -alne ' > $repo += $F[0]; > $disk += $F[1]; > END { print "$repo / $disk = ", $repo/$disk } > ' This has been running for a few hours now, I will update you with results when its done. > 250:1 isn't inconceivable if you have large blobs which have small > changes to them (and at 8GB for 8 million objects, you probably do have > some larger blobs, since the kernel is about 1/8th the size for the same > number of objects). I think it's mostly xml files in the 1-10MB range. > So yes, if you really do have to hash 2TB of data, that's going to take > a while. I was hoping I was wrong. Unfortunately I sense that this is not likely something we can improve with a better algorithm. It seems like the best way to handle this long term is likely to use BUP's rolling hash splitting, it would make this way better (assuming it made objects small enough). I think it is interesting that this approach might end up being effective for more than just large binary file repos. If I could get this repo into bup somehow, it could potentially show us if this would drastically reduce the index-pack time. > I think v2.18 will have the collision-detecting sha1 on by default, > which is slower. Makes sense. > If you don't mind losing the collision-detection, using openssl's sha1 > might help. The delta resolution should be threaded, too. So in _theory_ > you're using 66 minutes of CPU time, but that should only take 1-2 > minutes on your 56-core machine. I don't know at what point you'd run > into lock contention, though. The locking there is quite coarse. I suspect at 3 threads, seems like the default? I am running some index packs to test the theory, I can tell you already that the 56 thread versions was much slower, it took 397m25.622s. I am running a few other tests also, but it will take a while to get an answer. Since things take hours to test, I made a repo with a single branch (and the tags for that branch) from this bigger repo using a git init/git fetch. The single branch repo takes about 12s to clone, but it takes around 14s with 3 threads to run index-pack, any ideas why it is slower than a clone? Here are some thread times for the single branch case: Threads Time 56 49s 12 34s 5 20s 4 15s 3 14s 2 17 1 30 So 3 threads appears optimal in this case. Perhaps the locking can be improved here to make threading more effective? > We also hash non-deltas while we're receiving them over the network. > That's accounted for in the "receiving pack" part of the progress meter. > If the time looks to be going to "resolving deltas", then that should > all be threaded. Would it make sense to make the receiving pack time also threaded because I believe that time is still longer than the I/O time (2 or 3 times)? > If you want to replay the slow part, it should just be index-pack. So > something like (with $old as a fresh clone of the repo): > > git init --bare new-repo.git > cd new-repo.git > perf record git index-pack -v --stdin <$old/.git/objects/pack/pack-*.pack > perf report > > should show you where the time is going (substitute perf with whatever > profiling tool you like). I will work on profiling soon, but I wanted to give an update now. Thanks for the great feedback, -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Resolving deltas dominates clone time 2019-04-22 20:21 ` Martin Fick @ 2019-04-22 20:56 ` Jeff King 2019-04-22 21:02 ` Jeff King ` (2 more replies) 0 siblings, 3 replies; 26+ messages in thread From: Jeff King @ 2019-04-22 20:56 UTC (permalink / raw) To: Martin Fick; +Cc: Git Mailing List On Mon, Apr 22, 2019 at 02:21:40PM -0600, Martin Fick wrote: > > Try this (with a recent version of git; your v1.8.2.1 won't have > > --batch-all-objects): > > > > # count the on-disk size of all objects > > git cat-file --batch-all-objects --batch-check='%(objectsize) > > %(objectsize:disk)' | perl -alne ' > > $repo += $F[0]; > > $disk += $F[1]; > > END { print "$repo / $disk = ", $repo/$disk } > > ' > > This has been running for a few hours now, I will update you with results when > its done. Hours? I think something might be wrong. It takes 20s to run on linux.git. > > 250:1 isn't inconceivable if you have large blobs which have small > > changes to them (and at 8GB for 8 million objects, you probably do have > > some larger blobs, since the kernel is about 1/8th the size for the same > > number of objects). > > I think it's mostly xml files in the 1-10MB range. Yeah, I could believe that would do it, then. Take a 10MB file with a hundred 1K updates. That'd be 99*1K + 10MB to store, for almost 100:1 compression. The key is really having objects where the size of change versus the file size is small (so the marginal cost of each revision is small), and then having lots of changes (to give you a big multiplier). > > So yes, if you really do have to hash 2TB of data, that's going to take > > a while. > > I was hoping I was wrong. Unfortunately I sense that this is not likely > something we can improve with a better algorithm. It seems like the best way > to handle this long term is likely to use BUP's rolling hash splitting, it > would make this way better (assuming it made objects small enough). I think it > is interesting that this approach might end up being effective for more than > just large binary file repos. If I could get this repo into bup somehow, it > could potentially show us if this would drastically reduce the index-pack > time. It fundamentally can't help without changing Git's object model, since you need to have complete hashes of all of those files. If you're proposing to do rolling hashes to split blobs into multiple objects that would definitely work. But it wouldn't be compatible with Git anymore. > > If you don't mind losing the collision-detection, using openssl's sha1 > > might help. The delta resolution should be threaded, too. So in _theory_ > > you're using 66 minutes of CPU time, but that should only take 1-2 > > minutes on your 56-core machine. I don't know at what point you'd run > > into lock contention, though. The locking there is quite coarse. > > I suspect at 3 threads, seems like the default? Ah, right, I forgot we cap it at 3 (which was determined experimentally, and which we more or less attributed to lock contention as the bottleneck). I think you need to use $GIT_FORCE_THREADS to override it. > I am running some index packs to test the theory, I can tell you already that > the 56 thread versions was much slower, it took 397m25.622s. I am running a > few other tests also, but it will take a while to get an answer. Since things > take hours to test, I made a repo with a single branch (and the tags for that > branch) from this bigger repo using a git init/git fetch. The single branch > repo takes about 12s to clone, but it takes around 14s with 3 threads to run > index-pack, any ideas why it is slower than a clone? Are you running it in the same repo, or in another newly-created repo? Or alternatively, in a new repo but repeatedly running index-pack? After the first run, that repo will have all of the objects. And so for each object it sees, index-pack will say "woah, we already had that one; let's double check that they're byte for byte identical" which carries extra overhead (and probably makes the lock contention way worse, too, because accessing existing objects just has one big coarse lock). So definitely do something like: for threads in 1 2 3 4 5 12 56; do rm -rf repo.git git init --bare repo.git GIT_FORCE_THREADS=$threads \ git -C repo.git index-pack -v --stdin </path/to/pack done to test. > Perhaps the locking can be improved here to make threading more effective? Probably, but easier said than done, of course. > > We also hash non-deltas while we're receiving them over the network. > > That's accounted for in the "receiving pack" part of the progress meter. > > If the time looks to be going to "resolving deltas", then that should > > all be threaded. > > Would it make sense to make the receiving pack time also threaded because I > believe that time is still longer than the I/O time (2 or 3 times)? It's a lot harder to thread since we're eating the incoming bytes. And unless you're just coming from a local disk copy, the network is generally the bottleneck (and if you are coming from a local disk copy, then consider doing a local clone which will avoid all of this hashing in the first place). -Peff ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Resolving deltas dominates clone time 2019-04-22 20:56 ` Jeff King @ 2019-04-22 21:02 ` Jeff King 2019-04-22 21:19 ` [PATCH] p5302: create the repo in each index-pack test Jeff King 2019-04-22 22:32 ` Resolving deltas dominates clone time Martin Fick 2 siblings, 0 replies; 26+ messages in thread From: Jeff King @ 2019-04-22 21:02 UTC (permalink / raw) To: Martin Fick; +Cc: Git Mailing List On Mon, Apr 22, 2019 at 04:56:54PM -0400, Jeff King wrote: > > I suspect at 3 threads, seems like the default? > > Ah, right, I forgot we cap it at 3 (which was determined experimentally, > and which we more or less attributed to lock contention as the > bottleneck). I think you need to use $GIT_FORCE_THREADS to override it. Ah, nevermind this. Using --threads will do what you expect. The cap at 3 only applies when we've just picked the number of available CPUs as a default. -Peff ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH] p5302: create the repo in each index-pack test 2019-04-22 20:56 ` Jeff King 2019-04-22 21:02 ` Jeff King @ 2019-04-22 21:19 ` Jeff King 2019-04-23 1:09 ` Junio C Hamano 2019-04-22 22:32 ` Resolving deltas dominates clone time Martin Fick 2 siblings, 1 reply; 26+ messages in thread From: Jeff King @ 2019-04-22 21:19 UTC (permalink / raw) To: Martin Fick; +Cc: Git Mailing List On Mon, Apr 22, 2019 at 04:56:53PM -0400, Jeff King wrote: > > I am running some index packs to test the theory, I can tell you already that > > the 56 thread versions was much slower, it took 397m25.622s. I am running a > > few other tests also, but it will take a while to get an answer. Since things > > take hours to test, I made a repo with a single branch (and the tags for that > > branch) from this bigger repo using a git init/git fetch. The single branch > > repo takes about 12s to clone, but it takes around 14s with 3 threads to run > > index-pack, any ideas why it is slower than a clone? > > Are you running it in the same repo, or in another newly-created repo? > Or alternatively, in a new repo but repeatedly running index-pack? After > the first run, that repo will have all of the objects. And so for each > object it sees, index-pack will say "woah, we already had that one; > let's double check that they're byte for byte identical" which carries > extra overhead (and probably makes the lock contention way worse, too, > because accessing existing objects just has one big coarse lock). > > So definitely do something like: > > for threads in 1 2 3 4 5 12 56; do > rm -rf repo.git > git init --bare repo.git > GIT_FORCE_THREADS=$threads \ > git -C repo.git index-pack -v --stdin </path/to/pack > done > > to test. This is roughly what p5302 is going, though it does not go as high as 56 (though it seems like there is probably not much point in doing so). However, I did notice this slight bug in it. After this fix, here are my numbers from indexing git.git: Test HEAD ---------------------------------------------------------------- 5302.2: index-pack 0 threads 22.72(22.55+0.16) 5302.3: index-pack 1 thread 23.26(23.02+0.24) 5302.4: index-pack 2 threads 13.19(24.06+0.23) 5302.5: index-pack 4 threads 7.96(24.65+0.25) 5302.6: index-pack 8 threads 7.94(45.06+0.38) 5302.7: index-pack default number of threads 9.37(23.82+0.18) So it looks like "4" is slightly better than the default of "3" for me. I'm running it on linux.git now, but it will take quite a while to come up with a result. -- >8 -- Subject: [PATCH] p5302: create the repo in each index-pack test The p5302 script runs "index-pack --stdin" in each timing test. It does two things to try to get good timings: 1. we do the repo creation in a separate (non-timed) setup test, so that our timing is purely the index-pack run 2. we use a separate repo for each test; this is important because the presence of existing objects in the repo influences the result (because we'll end up doing collision checks against them) But this forgets one thing: we generally run each timed test multiple times to reduce the impact of noise. Which means that repeats of each test after the first will be subject to the collision slowdown from point 2, and we'll generally just end up taking the first time anyway. Instead, let's create the repo in the test (effectively undoing point 1). That does add a constant amount of extra work to each iteration, but it's quite small compared to the actual effects we're interested in measuring. Signed-off-by: Jeff King <peff@peff.net> --- The very first 0-thread one will run faster because it has less to "rm -rf", but I think we can ignore that. t/perf/p5302-pack-index.sh | 31 ++++++++++++++++++------------- 1 file changed, 18 insertions(+), 13 deletions(-) diff --git a/t/perf/p5302-pack-index.sh b/t/perf/p5302-pack-index.sh index 99bdb16c85..a9b3e112d9 100755 --- a/t/perf/p5302-pack-index.sh +++ b/t/perf/p5302-pack-index.sh @@ -13,35 +13,40 @@ test_expect_success 'repack' ' export PACK ' -test_expect_success 'create target repositories' ' - for repo in t1 t2 t3 t4 t5 t6 - do - git init --bare $repo - done -' - test_perf 'index-pack 0 threads' ' - GIT_DIR=t1 git index-pack --threads=1 --stdin < $PACK + rm -rf repo.git && + git init --bare repo.git && + GIT_DIR=repo.git git index-pack --threads=1 --stdin < $PACK ' test_perf 'index-pack 1 thread ' ' - GIT_DIR=t2 GIT_FORCE_THREADS=1 git index-pack --threads=1 --stdin < $PACK + rm -rf repo.git && + git init --bare repo.git && + GIT_DIR=repo.git GIT_FORCE_THREADS=1 git index-pack --threads=1 --stdin < $PACK ' test_perf 'index-pack 2 threads' ' - GIT_DIR=t3 git index-pack --threads=2 --stdin < $PACK + rm -rf repo.git && + git init --bare repo.git && + GIT_DIR=repo.git git index-pack --threads=2 --stdin < $PACK ' test_perf 'index-pack 4 threads' ' - GIT_DIR=t4 git index-pack --threads=4 --stdin < $PACK + rm -rf repo.git && + git init --bare repo.git && + GIT_DIR=repo.git git index-pack --threads=4 --stdin < $PACK ' test_perf 'index-pack 8 threads' ' - GIT_DIR=t5 git index-pack --threads=8 --stdin < $PACK + rm -rf repo.git && + git init --bare repo.git && + GIT_DIR=repo.git git index-pack --threads=8 --stdin < $PACK ' test_perf 'index-pack default number of threads' ' - GIT_DIR=t6 git index-pack --stdin < $PACK + rm -rf repo.git && + git init --bare repo.git && + GIT_DIR=repo.git git index-pack --stdin < $PACK ' test_done -- 2.21.0.1182.g3590c06d32 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH] p5302: create the repo in each index-pack test 2019-04-22 21:19 ` [PATCH] p5302: create the repo in each index-pack test Jeff King @ 2019-04-23 1:09 ` Junio C Hamano 2019-04-23 2:07 ` Jeff King 0 siblings, 1 reply; 26+ messages in thread From: Junio C Hamano @ 2019-04-23 1:09 UTC (permalink / raw) To: Jeff King; +Cc: Martin Fick, Git Mailing List Jeff King <peff@peff.net> writes: > Subject: [PATCH] p5302: create the repo in each index-pack test > > The p5302 script runs "index-pack --stdin" in each timing test. It does > two things to try to get good timings: > > 1. we do the repo creation in a separate (non-timed) setup test, so > that our timing is purely the index-pack run > > 2. we use a separate repo for each test; this is important because the > presence of existing objects in the repo influences the result > (because we'll end up doing collision checks against them) > > But this forgets one thing: we generally run each timed test multiple > times to reduce the impact of noise. Which means that repeats of each > test after the first will be subject to the collision slowdown from > point 2, and we'll generally just end up taking the first time anyway. The above is very cleanly written to convince anybody that what the current test does contradicts with wish #2 above, and that the two wishes #1 and #2 are probably mutually incompatible. But isn't the collision check a part of the real-life workload that Git users are made waiting for and care about the performance of? Or are we purely interested in the cost of resolving delta, computing the object name, and writing the result out to the disk in this test and the "overall experience" benchmark is left elsewhere? The reason why I got confused is because the test_description of the script leaves "the actual effects we're interested in measuring" unsaid, I think. The log message of b8a2486f ("index-pack: support multithreaded delta resolving", 2012-05-06) that created this test does not help that much, either. In any case, the above "this forgets one thing" makes it clear that we at this point in time declare what we are interested in very clearly, and I agree that the solution described in the paragraph below clearly matches the goal. Looks good. > Instead, let's create the repo in the test (effectively undoing point > 1). That does add a constant amount of extra work to each iteration, but > it's quite small compared to the actual effects we're interested in > measuring. > Signed-off-by: Jeff King <peff@peff.net> > --- > The very first 0-thread one will run faster because it has less to "rm > -rf", but I think we can ignore that. OK. > - GIT_DIR=t1 git index-pack --threads=1 --stdin < $PACK > + rm -rf repo.git && > + git init --bare repo.git && > + GIT_DIR=repo.git git index-pack --threads=1 --stdin < $PACK This is obviously inherited from the original, but do we get scolded by some versions of bash for this line, without quoting the source path of the redirection, i.e. ... --stdin <"$PACK" ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH] p5302: create the repo in each index-pack test 2019-04-23 1:09 ` Junio C Hamano @ 2019-04-23 2:07 ` Jeff King 2019-04-23 2:27 ` Junio C Hamano 0 siblings, 1 reply; 26+ messages in thread From: Jeff King @ 2019-04-23 2:07 UTC (permalink / raw) To: Junio C Hamano; +Cc: Martin Fick, Git Mailing List On Tue, Apr 23, 2019 at 10:09:54AM +0900, Junio C Hamano wrote: > The above is very cleanly written to convince anybody that what the > current test does contradicts with wish #2 above, and that the two > wishes #1 and #2 are probably mutually incompatible. > > But isn't the collision check a part of the real-life workload that > Git users are made waiting for and care about the performance of? > Or are we purely interested in the cost of resolving delta, > computing the object name, and writing the result out to the disk in > this test and the "overall experience" benchmark is left elsewhere? I think we _are_ just interested in the resolving delta cost (after all, we're testing it with various thread levels). What's more, the old code would run the test $GIT_PERF_COUNT times, once without any objects, and then the other N-1 times with objects. And then take the smallest time, which would generally be the one-off! So we're really just measuring that case more consistently now. But even if you left all of that aside, I think the case without objects is actually the realistic one. It represents the equivalent a full clone, where we would not have any objects already. The case where we are fetching into a repository with objects already is also potentially of interest, but this test wouldn't show that very well. Because there the main added cost is looking up each object and saying "ah, we do not have it; no need to do a collision check", because we'd generally not expect the other side to be sending us duplicates. But because this test would be repeating itself on the same pack each time, we'd be seeing a collision on _every_ object. And the added time would be dominated by us saying "oops, a collision; let's take the slow path and reconstruct that object from disk so we can compare its bytes". > The reason why I got confused is because the test_description of the > script leaves "the actual effects we're interested in measuring" > unsaid, I think. The log message of b8a2486f ("index-pack: support > multithreaded delta resolving", 2012-05-06) that created this test > does not help that much, either. > > In any case, the above "this forgets one thing" makes it clear that > we at this point in time declare what we are interested in very > clearly, and I agree that the solution described in the paragraph > below clearly matches the goal. Looks good. So I think you convinced yourself even before my email that this was the right path, but let me know if you think it's worth trying to revise the commit message to include some of the above reasoning. > > - GIT_DIR=t1 git index-pack --threads=1 --stdin < $PACK > > + rm -rf repo.git && > > + git init --bare repo.git && > > + GIT_DIR=repo.git git index-pack --threads=1 --stdin < $PACK > > This is obviously inherited from the original, but do we get scolded > by some versions of bash for this line, without quoting the source path > of the redirection, i.e. > > ... --stdin <"$PACK" In general, yes, but I think we are OK in this instance because we generated $PACK ourselves in the setup step, and we know that it is just a relative .git/objects/pack/xyz.pack with no spaces. I almost touched it just to get rid of the style-violating space after the "<" though. ;) -Peff ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH] p5302: create the repo in each index-pack test 2019-04-23 2:07 ` Jeff King @ 2019-04-23 2:27 ` Junio C Hamano 2019-04-23 2:36 ` Jeff King 0 siblings, 1 reply; 26+ messages in thread From: Junio C Hamano @ 2019-04-23 2:27 UTC (permalink / raw) To: Jeff King; +Cc: Martin Fick, Git Mailing List Jeff King <peff@peff.net> writes: >> This is obviously inherited from the original, but do we get scolded >> by some versions of bash for this line, without quoting the source path >> of the redirection, i.e. >> >> ... --stdin <"$PACK" > > In general, yes, but I think we are OK in this instance because we > generated $PACK ourselves in the setup step, and we know that it is just > a relative .git/objects/pack/xyz.pack with no spaces. I know we are OK, but the issue with some versions of bash AFAIU is that bash is not OK regardless of the contents of $variable that is not quoted and used as the target or the source of a redirection, issuing an unnecessary warning. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH] p5302: create the repo in each index-pack test 2019-04-23 2:27 ` Junio C Hamano @ 2019-04-23 2:36 ` Jeff King 2019-04-23 2:40 ` Junio C Hamano 0 siblings, 1 reply; 26+ messages in thread From: Jeff King @ 2019-04-23 2:36 UTC (permalink / raw) To: Junio C Hamano; +Cc: Martin Fick, Git Mailing List On Tue, Apr 23, 2019 at 11:27:02AM +0900, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > >> This is obviously inherited from the original, but do we get scolded > >> by some versions of bash for this line, without quoting the source path > >> of the redirection, i.e. > >> > >> ... --stdin <"$PACK" > > > > In general, yes, but I think we are OK in this instance because we > > generated $PACK ourselves in the setup step, and we know that it is just > > a relative .git/objects/pack/xyz.pack with no spaces. > > I know we are OK, but the issue with some versions of bash AFAIU is > that bash is not OK regardless of the contents of $variable that is > not quoted and used as the target or the source of a redirection, > issuing an unnecessary warning. Is it? I thought the issue was specifically when there were spaces. I get: $ bash $ file=ok $ echo foo >$file $ file='not ok' $ echo foo >$file bash: $file: ambiguous redirect And that is AFAIK what the recent 7951a016a5 (t4038-diff-combined: quote paths with whitespace, 2019-03-17) was about (because our trash directory always has a space in it). Did I miss a report where it happens on some versions even without spaces? If so, we have quite a number of things to fix judging from the output of: git grep '>\$' -Peff ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH] p5302: create the repo in each index-pack test 2019-04-23 2:36 ` Jeff King @ 2019-04-23 2:40 ` Junio C Hamano 0 siblings, 0 replies; 26+ messages in thread From: Junio C Hamano @ 2019-04-23 2:40 UTC (permalink / raw) To: Jeff King; +Cc: Martin Fick, Git Mailing List Jeff King <peff@peff.net> writes: > Is it? I thought the issue was specifically when there were spaces. I > get: > > $ bash > $ file=ok > $ echo foo >$file > $ file='not ok' > $ echo foo >$file > bash: $file: ambiguous redirect OK, so I misremembered. Then we are good. Thanks. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Resolving deltas dominates clone time 2019-04-22 20:56 ` Jeff King 2019-04-22 21:02 ` Jeff King 2019-04-22 21:19 ` [PATCH] p5302: create the repo in each index-pack test Jeff King @ 2019-04-22 22:32 ` Martin Fick 2019-04-23 1:55 ` Jeff King 2 siblings, 1 reply; 26+ messages in thread From: Martin Fick @ 2019-04-22 22:32 UTC (permalink / raw) To: Jeff King; +Cc: Git Mailing List On Monday, April 22, 2019 4:56:54 PM MDT Jeff King wrote: > On Mon, Apr 22, 2019 at 02:21:40PM -0600, Martin Fick wrote: > > > Try this (with a recent version of git; your v1.8.2.1 won't have > > > > > > --batch-all-objects): > > > # count the on-disk size of all objects > > > git cat-file --batch-all-objects --batch-check='%(objectsize) > > > > > > %(objectsize:disk)' | perl -alne ' > > > > > > $repo += $F[0]; > > > $disk += $F[1]; > > > END { print "$repo / $disk = ", $repo/$disk } > > > > > > ' > > > > This has been running for a few hours now, I will update you with results > > when its done. > > Hours? I think something might be wrong. It takes 20s to run on > linux.git. OK, yes I was running this on a "bad" copy of the repo, see below because I think it might be of some interest also... On the better copy, the git part of the command took 3m48.025s (still not 20s, but not hours either), using git v2.18. The results were: 2085532480789 / 7909358121 = 263.679106304687 This seems to confirm the sizing numbers. As for the "bad repo", let me describe: 1) It is on a different machine (only 32 processors, and some serious background load) 2) That copy of the repo is from a mirror clone of our source repo without running repacking (-f) on it. Without repacking -f, the clone is actually 16G, which I believe is using the deltas from the source repo which would have been created by people pushing new commits over and over to the source repo and those new objects just being added to the common pack file during repacking without creating a ton of new deltas. It would appear that the deltas created from regular repacking without using the -f may be really destroying some of the performance of our source repo even worse then I imagined. The rest of my testing has been done on a repo repacked with -f to eliminate the variability imposed from the source repo because I figured it would adversely impact things, but I did not imagine it being that bad because even the clone time of the bad repo is not that bad, so I wonder why the git cat-file is way worse? > The key is really having objects where the size of change versus the > file size is small (so the marginal cost of each revision is small), and > then having lots of changes (to give you a big multiplier). Right. > > > So yes, if you really do have to hash 2TB of data, that's going to take > > > a while. > > > > I was hoping I was wrong. Unfortunately I sense that this is not likely > > something we can improve with a better algorithm. It seems like the best > > way to handle this long term is likely to use BUP's rolling hash > > splitting, it would make this way better (assuming it made objects small > > enough). I think it is interesting that this approach might end up being > > effective for more than just large binary file repos. If I could get this > > repo into bup somehow, it could potentially show us if this would > > drastically reduce the index-pack time. > > It fundamentally can't help without changing Git's object model, since > you need to have complete hashes of all of those files. If you're > proposing to do rolling hashes to split blobs into multiple objects that > would definitely work. But it wouldn't be compatible with Git anymore. Right. I really just meant to point out how many people may want the BUP style rolling hash split blobs in git even without having gigantic blobs since it might really help with smaller blobs and long histories with small deltas. Most of the git performance issues tend to focus on large repos with large blobs, that is what BUP was made for, but it could really help normal git users, possibly even the kernel as we start to develop some seriously longer histories. I think most assumptions have been that this is not likely to be a problem any time soon. > Are you running it in the same repo, or in another newly-created repo? Yes. > Or alternatively, in a new repo but repeatedly running index-pack? After > the first run, that repo will have all of the objects. And so for each > object it sees, index-pack will say "woah, we already had that one; > let's double check that they're byte for byte identical" which carries > extra overhead (and probably makes the lock contention way worse, too, > because accessing existing objects just has one big coarse lock). > > So definitely do something like: > > for threads in 1 2 3 4 5 12 56; do > rm -rf repo.git > git init --bare repo.git > GIT_FORCE_THREADS=$threads \ > git -C repo.git index-pack -v --stdin </path/to/pack > done > > to test. Makes sense, testing now... > > Perhaps the locking can be improved here to make threading more effective? > > Probably, but easier said than done, of course. > > > > We also hash non-deltas while we're receiving them over the network. > > > That's accounted for in the "receiving pack" part of the progress meter. > > > If the time looks to be going to "resolving deltas", then that should > > > all be threaded. > > > > Would it make sense to make the receiving pack time also threaded because > > I > > believe that time is still longer than the I/O time (2 or 3 times)? > > It's a lot harder to thread since we're eating the incoming bytes. And > unless you're just coming from a local disk copy, the network is > generally the bottleneck (and if you are coming from a local disk copy, > then consider doing a local clone which will avoid all of this hashing > in the first place). Is that really a fair assumption in today's intranets? Many corporate LANs have higher bandwidth than normal disks, don't they? -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Resolving deltas dominates clone time 2019-04-22 22:32 ` Resolving deltas dominates clone time Martin Fick @ 2019-04-23 1:55 ` Jeff King 2019-04-23 4:21 ` Jeff King 0 siblings, 1 reply; 26+ messages in thread From: Jeff King @ 2019-04-23 1:55 UTC (permalink / raw) To: Martin Fick; +Cc: Git Mailing List On Mon, Apr 22, 2019 at 04:32:16PM -0600, Martin Fick wrote: > > Hours? I think something might be wrong. It takes 20s to run on > > linux.git. > > OK, yes I was running this on a "bad" copy of the repo, see below because I > think it might be of some interest also... > > On the better copy, the git part of the command took 3m48.025s (still not 20s, > but not hours either), using git v2.18. The results were: That is slower than I'd expect. Two things that could speed it up: - use --buffer, which drops write() overhead - use a more recent version of Git which has 0750bb5b51 (cat-file: support "unordered" output for --batch-all-objects, 2018-08-10). This is order gets better cache locality (both disk cache if you can't fit the whole thing in RAM, but also the delta cache). Those drop my 20s run on linux.git to 13s, but they might have more impact for you. > 2) That copy of the repo is from a mirror clone of our source repo without > running repacking (-f) on it. Without repacking -f, the clone is actually 16G, > which I believe is using the deltas from the source repo which would have been > created by people pushing new commits over and over to the source repo and > those new objects just being added to the common pack file during repacking > without creating a ton of new deltas. It would appear that the deltas created > from regular repacking without using the -f may be really destroying some of > the performance of our source repo even worse then I imagined. The rest of my > testing has been done on a repo repacked with -f to eliminate the variability > imposed from the source repo because I figured it would adversely impact > things, but I did not imagine it being that bad because even the clone time of > the bad repo is not that bad, so I wonder why the git cat-file is way worse? Yes, I do think it's worth doing a "repack -f" every once in a while (and probably worth --window=250 since you're already splurging on CPU). It does seem to produce better results than taking the accumulated results of the individual pushes. I don't think this is an area we've studied all that well, so there may be some ways to make it better (but what's there has been "good enough" that nobody has sunk a lot of time into it). > > So definitely do something like: > > > > for threads in 1 2 3 4 5 12 56; do > > rm -rf repo.git > > git init --bare repo.git > > GIT_FORCE_THREADS=$threads \ > > git -C repo.git index-pack -v --stdin </path/to/pack > > done > > > > to test. > > Makes sense, testing now... Here are my p5302 numbers on linux.git, by the way. Test jk/p5302-repeat-fix ------------------------------------------------------------------ 5302.2: index-pack 0 threads 307.04(303.74+3.30) 5302.3: index-pack 1 thread 309.74(306.13+3.56) 5302.4: index-pack 2 threads 177.89(313.73+3.60) 5302.5: index-pack 4 threads 117.14(344.07+4.29) 5302.6: index-pack 8 threads 112.40(607.12+5.80) 5302.7: index-pack default number of threads 135.00(322.03+3.74) which still imply that "4" is a win over "3" ("8" is slightly better still in wall-clock time, but the total CPU rises dramatically; that's probably because this is a quad-core with hyperthreading, so by that point we're just throttling down the CPUs). > > It's a lot harder to thread since we're eating the incoming bytes. And > > unless you're just coming from a local disk copy, the network is > > generally the bottleneck (and if you are coming from a local disk copy, > > then consider doing a local clone which will avoid all of this hashing > > in the first place). > > Is that really a fair assumption in today's intranets? Many corporate LANs > have higher bandwidth than normal disks, don't they? Single-threaded SHA-1 is 500-1000 megabytes/sec. That's 4-8 gigabits/sec. And keep in mind we're just handling the base objects, so we're only computing the SHA-1 on some of the bytes. Of course there's another SHA-1 computation going over the whole packfile at that point, so _that_ is probably a bigger bottleneck if you really are on a 10 gigabit network. But if somebody wants to spend the time to look at parallelizing it, I wouldn't say no. :) -Peff ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Resolving deltas dominates clone time 2019-04-23 1:55 ` Jeff King @ 2019-04-23 4:21 ` Jeff King 2019-04-23 10:08 ` Duy Nguyen 0 siblings, 1 reply; 26+ messages in thread From: Jeff King @ 2019-04-23 4:21 UTC (permalink / raw) To: Martin Fick; +Cc: Git Mailing List On Mon, Apr 22, 2019 at 09:55:38PM -0400, Jeff King wrote: > Here are my p5302 numbers on linux.git, by the way. > > Test jk/p5302-repeat-fix > ------------------------------------------------------------------ > 5302.2: index-pack 0 threads 307.04(303.74+3.30) > 5302.3: index-pack 1 thread 309.74(306.13+3.56) > 5302.4: index-pack 2 threads 177.89(313.73+3.60) > 5302.5: index-pack 4 threads 117.14(344.07+4.29) > 5302.6: index-pack 8 threads 112.40(607.12+5.80) > 5302.7: index-pack default number of threads 135.00(322.03+3.74) > > which still imply that "4" is a win over "3" ("8" is slightly better > still in wall-clock time, but the total CPU rises dramatically; that's > probably because this is a quad-core with hyperthreading, so by that > point we're just throttling down the CPUs). And here's a similar test run on a 20-core Xeon w/ hyperthreading (I tweaked the test to keep going after eight threads): Test HEAD ---------------------------------------------------- 5302.2: index-pack 1 threads 376.88(364.50+11.52) 5302.3: index-pack 2 threads 228.13(371.21+17.86) 5302.4: index-pack 4 threads 151.41(387.06+21.12) 5302.5: index-pack 8 threads 113.68(413.40+25.80) 5302.6: index-pack 16 threads 100.60(511.85+37.53) 5302.7: index-pack 32 threads 94.43(623.82+45.70) 5302.8: index-pack 40 threads 93.64(702.88+47.61) I don't think any of this is _particularly_ relevant to your case, but it really seems to me that the default of capping at 3 threads is too low. -Peff ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Resolving deltas dominates clone time 2019-04-23 4:21 ` Jeff King @ 2019-04-23 10:08 ` Duy Nguyen 2019-04-23 20:09 ` Martin Fick 2019-04-30 17:50 ` Jeff King 0 siblings, 2 replies; 26+ messages in thread From: Duy Nguyen @ 2019-04-23 10:08 UTC (permalink / raw) To: Jeff King; +Cc: Martin Fick, Git Mailing List On Tue, Apr 23, 2019 at 11:45 AM Jeff King <peff@peff.net> wrote: > > On Mon, Apr 22, 2019 at 09:55:38PM -0400, Jeff King wrote: > > > Here are my p5302 numbers on linux.git, by the way. > > > > Test jk/p5302-repeat-fix > > ------------------------------------------------------------------ > > 5302.2: index-pack 0 threads 307.04(303.74+3.30) > > 5302.3: index-pack 1 thread 309.74(306.13+3.56) > > 5302.4: index-pack 2 threads 177.89(313.73+3.60) > > 5302.5: index-pack 4 threads 117.14(344.07+4.29) > > 5302.6: index-pack 8 threads 112.40(607.12+5.80) > > 5302.7: index-pack default number of threads 135.00(322.03+3.74) > > > > which still imply that "4" is a win over "3" ("8" is slightly better > > still in wall-clock time, but the total CPU rises dramatically; that's > > probably because this is a quad-core with hyperthreading, so by that > > point we're just throttling down the CPUs). > > And here's a similar test run on a 20-core Xeon w/ hyperthreading (I > tweaked the test to keep going after eight threads): > > Test HEAD > ---------------------------------------------------- > 5302.2: index-pack 1 threads 376.88(364.50+11.52) > 5302.3: index-pack 2 threads 228.13(371.21+17.86) > 5302.4: index-pack 4 threads 151.41(387.06+21.12) > 5302.5: index-pack 8 threads 113.68(413.40+25.80) > 5302.6: index-pack 16 threads 100.60(511.85+37.53) > 5302.7: index-pack 32 threads 94.43(623.82+45.70) > 5302.8: index-pack 40 threads 93.64(702.88+47.61) > > I don't think any of this is _particularly_ relevant to your case, but > it really seems to me that the default of capping at 3 threads is too > low. Looking back at the multithread commit, I think the trend was the same and I capped it because the gain was not proportional to the number of cores we threw at index-pack anymore. I would not be opposed to raising the cap though (or maybe just remove it) -- Duy ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Resolving deltas dominates clone time 2019-04-23 10:08 ` Duy Nguyen @ 2019-04-23 20:09 ` Martin Fick 2019-04-30 18:02 ` Jeff King 2019-04-30 17:50 ` Jeff King 1 sibling, 1 reply; 26+ messages in thread From: Martin Fick @ 2019-04-23 20:09 UTC (permalink / raw) To: Duy Nguyen; +Cc: Jeff King, Git Mailing List On Tuesday, April 23, 2019 5:08:40 PM MDT Duy Nguyen wrote: > On Tue, Apr 23, 2019 at 11:45 AM Jeff King <peff@peff.net> wrote: > > On Mon, Apr 22, 2019 at 09:55:38PM -0400, Jeff King wrote: > > > Here are my p5302 numbers on linux.git, by the way. > > > > > > Test jk/p5302-repeat-fix > > > ------------------------------------------------------------------ > > > 5302.2: index-pack 0 threads 307.04(303.74+3.30) > > > 5302.3: index-pack 1 thread 309.74(306.13+3.56) > > > 5302.4: index-pack 2 threads 177.89(313.73+3.60) > > > 5302.5: index-pack 4 threads 117.14(344.07+4.29) > > > 5302.6: index-pack 8 threads 112.40(607.12+5.80) > > > 5302.7: index-pack default number of threads 135.00(322.03+3.74) > > > > > > which still imply that "4" is a win over "3" ("8" is slightly better > > > still in wall-clock time, but the total CPU rises dramatically; that's > > > probably because this is a quad-core with hyperthreading, so by that > > > point we're just throttling down the CPUs). > > > > And here's a similar test run on a 20-core Xeon w/ hyperthreading (I > > tweaked the test to keep going after eight threads): > > > > Test HEAD > > ---------------------------------------------------- > > 5302.2: index-pack 1 threads 376.88(364.50+11.52) > > 5302.3: index-pack 2 threads 228.13(371.21+17.86) > > 5302.4: index-pack 4 threads 151.41(387.06+21.12) > > 5302.5: index-pack 8 threads 113.68(413.40+25.80) > > 5302.6: index-pack 16 threads 100.60(511.85+37.53) > > 5302.7: index-pack 32 threads 94.43(623.82+45.70) > > 5302.8: index-pack 40 threads 93.64(702.88+47.61) > > > > I don't think any of this is _particularly_ relevant to your case, but > > it really seems to me that the default of capping at 3 threads is too > > low. Here are my index-pack results (I only ran them once since they take a while) using vgit 1.8.3.2: Threads real user sys 1 108m46.151s 106m14.420s 1m57.192s 2 58m14.274s 106m23.158s 5m32.736s 3 40m33.351s 106m42.281s 5m40.884s 4 31m40.342s 107m20.278s 5m40.675s 5 26m0.454s 106m54.370s 5m35.827s 12 13m25.304s 107m57.271s 6m26.493s 16 10m56.866s 107m46.107s 6m41.330s 18 10m18.112s 109m50.893s 7m1.369s 20 9m54.010s 113m51.028s 7m53.082s 24 9m1.104s 115m8.245s 7m57.156s 28 8m26.058s 116m46.311s 8m34.752s 32 8m42.967s 140m33.280s 9m59.514s 36 8m52.228s 151m28.939s 11m55.590s 40 8m22.719s 153m4.496s 12m36.041s 44 8m12.419s 166m41.594s 14m7.717s 48 8m0.377s 172m3.597s 16m32.041s 56 8m22.320s 188m31.426s 17m48.274s > Looking back at the multithread commit, I think the trend was the same > and I capped it because the gain was not proportional to the number of > cores we threw at index-pack anymore. I would not be opposed to > raising the cap though (or maybe just remove it) I think that if there were no default limit during a clone it could have disastrous effects on people using the repo tool from the android project, or any other "submodule like" tool that might clone many projects in parallel. With the repo tool, people often use a large -j number such as 24 which means they end up cloning around 24 projects at a time, and they may do this for around 1000 projects. If git clone suddenly started as many threads as there are CPUs for each clone, this would likely paralyze the machine. I do suspect it would be nice to have a switch though that repo could use to adjust this intelligently, is there some way to adjust threads from a clone, I don't see one? I tried using 'GIT_FORCE_THREADS=28 git clone ...' and it didn't seem to make a difference? -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Resolving deltas dominates clone time 2019-04-23 20:09 ` Martin Fick @ 2019-04-30 18:02 ` Jeff King 2019-04-30 22:08 ` Martin Fick 0 siblings, 1 reply; 26+ messages in thread From: Jeff King @ 2019-04-30 18:02 UTC (permalink / raw) To: Martin Fick; +Cc: Duy Nguyen, Git Mailing List On Tue, Apr 23, 2019 at 02:09:31PM -0600, Martin Fick wrote: > Here are my index-pack results (I only ran them once since they take a while) > using vgit 1.8.3.2: > > Threads real user sys > 1 108m46.151s 106m14.420s 1m57.192s > 2 58m14.274s 106m23.158s 5m32.736s > 3 40m33.351s 106m42.281s 5m40.884s > 4 31m40.342s 107m20.278s 5m40.675s > 5 26m0.454s 106m54.370s 5m35.827s > 12 13m25.304s 107m57.271s 6m26.493s > 16 10m56.866s 107m46.107s 6m41.330s > 18 10m18.112s 109m50.893s 7m1.369s > 20 9m54.010s 113m51.028s 7m53.082s > 24 9m1.104s 115m8.245s 7m57.156s > 28 8m26.058s 116m46.311s 8m34.752s > 32 8m42.967s 140m33.280s 9m59.514s > 36 8m52.228s 151m28.939s 11m55.590s > 40 8m22.719s 153m4.496s 12m36.041s > 44 8m12.419s 166m41.594s 14m7.717s > 48 8m0.377s 172m3.597s 16m32.041s > 56 8m22.320s 188m31.426s 17m48.274s Thanks for the data. That seems to roughly match my results. Things get obviously better up to around close to half of the available processors, and then you get minimal returns for more CPU (some of yours actually get worse in the middle, but that may be due to noise; my timings are all best-of-3). > I think that if there were no default limit during a clone it could have > disastrous effects on people using the repo tool from the android project, or > any other "submodule like" tool that might clone many projects in parallel. > With the repo tool, people often use a large -j number such as 24 which means > they end up cloning around 24 projects at a time, and they may do this for > around 1000 projects. If git clone suddenly started as many threads as there > are CPUs for each clone, this would likely paralyze the machine. IMHO this is already a problem, because none of those 24 gits knows about the others. So they're already using 24*3 cores, though of course at any given moment some of those 24 may be bottle-necked on the network. I suspect that repo should be passing in `-c pack.threads=N`, where `N` is some formula based around how many cores we want to use, with some constant fraction applied for how many we expect to be chugging on CPU at any given point. The optimal behavior would probably come from index-pack dynamically assigning work based on system load, but that gets pretty messy. Ideally we could just throw all of the load at the kernel's scheduler and let it do the right thing, but: - we clearly get some inefficiencies from being overly-parallelized, so we don't want to go too far - we have other resources we'd like to keep in use like the network and disk. So probably the optimal case would be to have one (or a few) index-packs fully utilizing the network, and then as they move to the local-only CPU-heavy phase, start a few more on the network, and so on. There's no way to do that kind of slot-oriented gating now. It actually wouldn't be too hard at the low-level of the code, but I'm not sure what interface you'd use to communicate "OK, now go" to each process. > I do suspect it would be nice to have a switch though that repo could use to > adjust this intelligently, is there some way to adjust threads from a clone, I > don't see one? I tried using 'GIT_FORCE_THREADS=28 git clone ...' and it > didn't seem to make a difference? I think I led you astray earlier by mentioning GIT_FORCE_THREADS. It's actually just a boolean for "use threads even if we're only single-threaded". What you actually want is probably: git clone -c pack.threads=28 ... (though I didn't test it to be sure). -Peff ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Resolving deltas dominates clone time 2019-04-30 18:02 ` Jeff King @ 2019-04-30 22:08 ` Martin Fick 0 siblings, 0 replies; 26+ messages in thread From: Martin Fick @ 2019-04-30 22:08 UTC (permalink / raw) To: Jeff King; +Cc: Duy Nguyen, Git Mailing List On Tuesday, April 30, 2019 2:02:32 PM MDT Jeff King wrote: > On Tue, Apr 23, 2019 at 02:09:31PM -0600, Martin Fick wrote: > > I think that if there were no default limit during a clone it could have > > disastrous effects on people using the repo tool from the android project, > > or any other "submodule like" tool that might clone many projects in > > parallel. With the repo tool, people often use a large -j number such as > > 24 which means they end up cloning around 24 projects at a time, and they > > may do this for around 1000 projects. If git clone suddenly started as > > many threads as there are CPUs for each clone, this would likely paralyze > > the machine. > > IMHO this is already a problem, because none of those 24 gits knows > about the others. So they're already using 24*3 cores, though of course > at any given moment some of those 24 may be bottle-necked on the > network. I think this is very different. In the first case this is a linear constraint that someone can use to intelligently adjust the number of clones they are doing at the same time. The second case cannot be accounted for in any intelligent way by the person running the clones, it is almost unconstrained. ... > > I do suspect it would be nice to have a switch though that repo could use > > to adjust this intelligently, is there some way to adjust threads from a > > clone, I don't see one? I tried using 'GIT_FORCE_THREADS=28 git clone > > ...' and it didn't seem to make a difference? > > I think I led you astray earlier by mentioning GIT_FORCE_THREADS. It's > actually just a boolean for "use threads even if we're only > single-threaded". What you actually want is probably: > > git clone -c pack.threads=28 ... > > (though I didn't test it to be sure). Thanks, I will test this! -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Resolving deltas dominates clone time 2019-04-23 10:08 ` Duy Nguyen 2019-04-23 20:09 ` Martin Fick @ 2019-04-30 17:50 ` Jeff King 2019-04-30 18:48 ` Ævar Arnfjörð Bjarmason 1 sibling, 1 reply; 26+ messages in thread From: Jeff King @ 2019-04-30 17:50 UTC (permalink / raw) To: Duy Nguyen; +Cc: Martin Fick, Git Mailing List On Tue, Apr 23, 2019 at 05:08:40PM +0700, Duy Nguyen wrote: > On Tue, Apr 23, 2019 at 11:45 AM Jeff King <peff@peff.net> wrote: > > > > On Mon, Apr 22, 2019 at 09:55:38PM -0400, Jeff King wrote: > > > > > Here are my p5302 numbers on linux.git, by the way. > > > > > > Test jk/p5302-repeat-fix > > > ------------------------------------------------------------------ > > > 5302.2: index-pack 0 threads 307.04(303.74+3.30) > > > 5302.3: index-pack 1 thread 309.74(306.13+3.56) > > > 5302.4: index-pack 2 threads 177.89(313.73+3.60) > > > 5302.5: index-pack 4 threads 117.14(344.07+4.29) > > > 5302.6: index-pack 8 threads 112.40(607.12+5.80) > > > 5302.7: index-pack default number of threads 135.00(322.03+3.74) > > > > > > which still imply that "4" is a win over "3" ("8" is slightly better > > > still in wall-clock time, but the total CPU rises dramatically; that's > > > probably because this is a quad-core with hyperthreading, so by that > > > point we're just throttling down the CPUs). > > > > And here's a similar test run on a 20-core Xeon w/ hyperthreading (I > > tweaked the test to keep going after eight threads): > > > > Test HEAD > > ---------------------------------------------------- > > 5302.2: index-pack 1 threads 376.88(364.50+11.52) > > 5302.3: index-pack 2 threads 228.13(371.21+17.86) > > 5302.4: index-pack 4 threads 151.41(387.06+21.12) > > 5302.5: index-pack 8 threads 113.68(413.40+25.80) > > 5302.6: index-pack 16 threads 100.60(511.85+37.53) > > 5302.7: index-pack 32 threads 94.43(623.82+45.70) > > 5302.8: index-pack 40 threads 93.64(702.88+47.61) > > > > I don't think any of this is _particularly_ relevant to your case, but > > it really seems to me that the default of capping at 3 threads is too > > low. > > Looking back at the multithread commit, I think the trend was the same > and I capped it because the gain was not proportional to the number of > cores we threw at index-pack anymore. I would not be opposed to > raising the cap though (or maybe just remove it) I'm not sure what the right cap would be. I don't think it's static; we'd want ~4 threads on the top case, and 10-20 on the bottom one. It does seem like there's an inflection point in the graph at N/2 threads. But then maybe that's just because these are hyper-threaded machines, so "N/2" is the actual number of physical cores, and the inflated CPU times above that are just because we can't turbo-boost then, so we're actually clocking slower. Multi-threaded profiling and measurement is such a mess. :) So I'd say the right answer is probably either online_cpus() or half that. The latter would be more appropriate for the machines I have, but I'd worry that it would leave performance on the table for non-intel machines. -Peff ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Resolving deltas dominates clone time 2019-04-30 17:50 ` Jeff King @ 2019-04-30 18:48 ` Ævar Arnfjörð Bjarmason 2019-04-30 20:33 ` Jeff King 0 siblings, 1 reply; 26+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2019-04-30 18:48 UTC (permalink / raw) To: Jeff King; +Cc: Duy Nguyen, Martin Fick, Git Mailing List On Tue, Apr 30 2019, Jeff King wrote: > On Tue, Apr 23, 2019 at 05:08:40PM +0700, Duy Nguyen wrote: > >> On Tue, Apr 23, 2019 at 11:45 AM Jeff King <peff@peff.net> wrote: >> > >> > On Mon, Apr 22, 2019 at 09:55:38PM -0400, Jeff King wrote: >> > >> > > Here are my p5302 numbers on linux.git, by the way. >> > > >> > > Test jk/p5302-repeat-fix >> > > ------------------------------------------------------------------ >> > > 5302.2: index-pack 0 threads 307.04(303.74+3.30) >> > > 5302.3: index-pack 1 thread 309.74(306.13+3.56) >> > > 5302.4: index-pack 2 threads 177.89(313.73+3.60) >> > > 5302.5: index-pack 4 threads 117.14(344.07+4.29) >> > > 5302.6: index-pack 8 threads 112.40(607.12+5.80) >> > > 5302.7: index-pack default number of threads 135.00(322.03+3.74) >> > > >> > > which still imply that "4" is a win over "3" ("8" is slightly better >> > > still in wall-clock time, but the total CPU rises dramatically; that's >> > > probably because this is a quad-core with hyperthreading, so by that >> > > point we're just throttling down the CPUs). >> > >> > And here's a similar test run on a 20-core Xeon w/ hyperthreading (I >> > tweaked the test to keep going after eight threads): >> > >> > Test HEAD >> > ---------------------------------------------------- >> > 5302.2: index-pack 1 threads 376.88(364.50+11.52) >> > 5302.3: index-pack 2 threads 228.13(371.21+17.86) >> > 5302.4: index-pack 4 threads 151.41(387.06+21.12) >> > 5302.5: index-pack 8 threads 113.68(413.40+25.80) >> > 5302.6: index-pack 16 threads 100.60(511.85+37.53) >> > 5302.7: index-pack 32 threads 94.43(623.82+45.70) >> > 5302.8: index-pack 40 threads 93.64(702.88+47.61) >> > >> > I don't think any of this is _particularly_ relevant to your case, but >> > it really seems to me that the default of capping at 3 threads is too >> > low. >> >> Looking back at the multithread commit, I think the trend was the same >> and I capped it because the gain was not proportional to the number of >> cores we threw at index-pack anymore. I would not be opposed to >> raising the cap though (or maybe just remove it) > > I'm not sure what the right cap would be. I don't think it's static; > we'd want ~4 threads on the top case, and 10-20 on the bottom one. > > It does seem like there's an inflection point in the graph at N/2 > threads. But then maybe that's just because these are hyper-threaded > machines, so "N/2" is the actual number of physical cores, and the > inflated CPU times above that are just because we can't turbo-boost > then, so we're actually clocking slower. Multi-threaded profiling and > measurement is such a mess. :) > > So I'd say the right answer is probably either online_cpus() or half > that. The latter would be more appropriate for the machines I have, but > I'd worry that it would leave performance on the table for non-intel > machines. It would be a nice #leftoverbits project to do this dynamically at runtime, i.e. hook up the throughput code in progress.c to some new utility functions where the current code using pthreads would occasionally stop and try to find some (local) maximum throughput given N threads. You could then dynamically save that optimum for next time, or adjust threading at runtime every X seconds, e.g. on a server with N=24 cores you might want 24 threads if you have one index-pack, but if you have 24 index-packs you probably don't want each with 24 threads, for a total of 576. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Resolving deltas dominates clone time 2019-04-30 18:48 ` Ævar Arnfjörð Bjarmason @ 2019-04-30 20:33 ` Jeff King 0 siblings, 0 replies; 26+ messages in thread From: Jeff King @ 2019-04-30 20:33 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: Duy Nguyen, Martin Fick, Git Mailing List On Tue, Apr 30, 2019 at 08:48:08PM +0200, Ævar Arnfjörð Bjarmason wrote: > > So I'd say the right answer is probably either online_cpus() or half > > that. The latter would be more appropriate for the machines I have, but > > I'd worry that it would leave performance on the table for non-intel > > machines. > > It would be a nice #leftoverbits project to do this dynamically at > runtime, i.e. hook up the throughput code in progress.c to some new > utility functions where the current code using pthreads would > occasionally stop and try to find some (local) maximum throughput given > N threads. > > You could then dynamically save that optimum for next time, or adjust > threading at runtime every X seconds, e.g. on a server with N=24 cores > you might want 24 threads if you have one index-pack, but if you have 24 > index-packs you probably don't want each with 24 threads, for a total of > 576. Yeah, I touched on that in my response to Martin. I think that would be nice, but it's complicated enough that I don't think it's a left-over bit. I'm also not sure how hard it is to change the number of threads after the initialization. IIRC, it's a worker pool that just asks for more work. So that's probably the right moment to say not just "is there more work to do" but also "does it seem like there's an idle slot on the system for our thread to take". -Peff ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2019-04-30 22:08 UTC | newest] Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-04-19 21:47 Resolving deltas dominates clone time Martin Fick 2019-04-20 3:58 ` Jeff King 2019-04-20 7:59 ` Ævar Arnfjörð Bjarmason 2019-04-22 15:57 ` Jeff King 2019-04-22 18:01 ` Ævar Arnfjörð Bjarmason 2019-04-22 18:43 ` Jeff King 2019-04-23 7:07 ` Ævar Arnfjörð Bjarmason 2019-04-22 20:21 ` Martin Fick 2019-04-22 20:56 ` Jeff King 2019-04-22 21:02 ` Jeff King 2019-04-22 21:19 ` [PATCH] p5302: create the repo in each index-pack test Jeff King 2019-04-23 1:09 ` Junio C Hamano 2019-04-23 2:07 ` Jeff King 2019-04-23 2:27 ` Junio C Hamano 2019-04-23 2:36 ` Jeff King 2019-04-23 2:40 ` Junio C Hamano 2019-04-22 22:32 ` Resolving deltas dominates clone time Martin Fick 2019-04-23 1:55 ` Jeff King 2019-04-23 4:21 ` Jeff King 2019-04-23 10:08 ` Duy Nguyen 2019-04-23 20:09 ` Martin Fick 2019-04-30 18:02 ` Jeff King 2019-04-30 22:08 ` Martin Fick 2019-04-30 17:50 ` Jeff King 2019-04-30 18:48 ` Ævar Arnfjörð Bjarmason 2019-04-30 20:33 ` Jeff King
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).