* `git index-pack --strict` is *very* slow during pushes to large repos
[not found] <CAF1M8pepgrnZWhx+CeMH85J-5oWx+w6r0w3KCcsG8dWgCT9K9Q@mail.gmail.com>
@ 2021-05-09 20:52 ` Craig de Stigter
0 siblings, 0 replies; only message in thread
From: Craig de Stigter @ 2021-05-09 20:52 UTC (permalink / raw)
To: git
Hey folks
(apologies if repost; my first post seemed to disappear entirely)
We're hosting a service with some fairly large repos (created by
Kart[1] ), and I've been looking into some poor
performance of `git push` on our service.
Background: We host repositories with a specific layout. I'll try and avoid
most of the technical details but a brief description of the repo layout
might be helpful:
- At each revision we have 256 trees
- each containing 256 trees (so 65536 trees at this level)
- each subtree contains a number of objects (distributed via a hash
scheme, evenly across the subtrees)
- Some repos have up to 100 million blobs active in a given revision.
In that case each of the 65536 subtrees would contain ~1500 blobs.
- Blobs are usually a few bytes to a few KB in size.
- For various reasons we have disabled deltas entirely.
- Most repos have a few hundred commits, and a typical commit might
modify 100,000 features (again spread evenly across the 65536 trees),
thus modifying most of the trees also.
- Our largest repos are currently a few hundred GB on disk.
We've come across a curious performance issue with `git index-pack` when
invoked by `receive-pack` during a push operation. We have
`transfer.fsckObjects=true` in the server config, so the index-pack
invocation looks like:
```
git --shallow-file shallow_filename index-pack \
--stdin --keep='receive-pack 1234 on <servername>' \
--show-resolving-progress --report-end-of-input --fix-thin \
--strict
```
For our largest repos, when pushing ~100K blobs and associated trees, this
takes a *long* time - sometimes over 12 hours. The process uses enormous
amounts of disk IO (all reads; I haven't measured how much per process, but
the server was doing many terabytes of IO in total)
Here is one that "only" took 45 minutes with a few tracing environment vars
enabled:
```
$ cat craig.pack | /opt/sno/libexec/git-core/git --shallow-file
myfilename index-pack --stdin --keep='receive-pack 159567 on
servername' --show-resolving-progress --report-end-of-input --fix-thin
--strict
07:48:20.781099 common-main.c:48 version 2.29.2
07:48:20.781111 common-main.c:48 | d0 | main
| version | | | | |
2.29.2
07:48:20.781127 common-main.c:49 start
/opt/sno/libexec/git-core/git --shallow-file indexed.pack index-pack
--stdin '--keep=receive-pack 159567 on cave-7dc7798cc9-qcvxd'
--show-resolving-progress --report-end-of-input --fix-thin --strict
07:48:20.781133 common-main.c:49 | d0 | main
| start | | 0.000264 | | |
/opt/sno/libexec/git-core/git --shallow-file indexed.pack index-pack
--stdin '--keep=receive-pack 159567 on cave-7dc7798cc9-qcvxd'
--show-resolving-progress --report-end-of-input --fix-thin --strict
07:48:20.781296 git.c:444 trace: built-in: git
index-pack --stdin '--keep=receive-pack 159567 on
cave-7dc7798cc9-qcvxd' --show-resolving-progress --report-end-of-input
--fix-thin --strict
07:48:20.781306 git.c:445 cmd_name index-pack
(index-pack)
07:48:20.781312 git.c:445 | d0 | main
| cmd_name | | | | |
index-pack (index-pack)
07:48:20.781530 midx.c:184 | d0 | main
| data | r0 | 0.000670 | 0.000670 | midx |
load/num_packs:1
07:48:20.781542 midx.c:185 | d0 | main
| data | r0 | 0.000683 | 0.000683 | midx |
load/num_objects:42658742
pack 5aa14bbb43187b7dfd5f996514854c3dcdc66d71
08:27:33.724306 git.c:700 exit
elapsed:2352.943441 code:0
08:27:33.724321 git.c:700 | d0 | main
| exit | | 2352.943441 | | |
code:0
08:27:33.724336 trace2/tr2_tgt_normal.c:123 atexit
elapsed:2352.943475 code:0
08:27:33.724341 trace2/tr2_tgt_perf.c:213 | d0 | main
| atexit | | 2352.943475 | | |
code:0
```
Removing the `--strict` from the invocation by disabling
`transfer.fsckObjects` solves the problem - the process completes in less
than a minute, and uses less than a GB of read IO.
I can theorise why this operation is slightly expensive:
- `--strict` causes `index-pack` to call `fsck_object()` on each object
pushed
- these large pushes that push 100K+ blobs actually touch almost every
*tree* as well - so most/all of the 65K trees are pushed too.
- calling `fsck_object` on a tree looks up all its children (blobs and
trees) to ensure they're reachable [2]
What I can't understand is why that makes it take quite *so* much longer
and use so much IO. I think it *should* probably not be checking much about
objects that are already in the repo, other than that they exist. We
have multi-pack indexes enabled, so my assumption is that a "does
object xyz exist?" check should be very inexpensive.
What could I be missing here?
As a start of a possible theory, we found when using libgit2 that our
peculiar repo structure with so many trees requires that we expand the size
of the tree cache[3] - otherwise repeated operations on blobs would
cause tree cache misses
every time their path was traversed. I wonder if there is a similar tree
cache structure in git itself, and if so could it be relevant here?
Many thanks and sorry about the long winded post :)
Craig de Stigter
Platform Engineer
Koordinates
references:
[1]: https://kartproject.org
[2]: fsck_walk_tree:
https://github.com/git/git/blob/a0dda6023ed82b927fa205c474654699a5b07a82/fsck.c#L300
[3] GIT_OPT_SET_CACHE_OBJECT_LIMIT:
https://github.com/libgit2/libgit2/blob/508361401fbb5d87118045eaeae3356a729131aa/include/git2/common.h#L266-L272
^ permalink raw reply [flat|nested] only message in thread