git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
* `git index-pack --strict` is *very* slow during pushes to large repos
       [not found] <CAF1M8pepgrnZWhx+CeMH85J-5oWx+w6r0w3KCcsG8dWgCT9K9Q@mail.gmail.com>
@ 2021-05-09 20:52 ` Craig de Stigter
  0 siblings, 0 replies; only message in thread
From: Craig de Stigter @ 2021-05-09 20:52 UTC (permalink / raw)
  To: git

Hey folks

(apologies if repost; my first post seemed to disappear entirely)

We're hosting a service with some fairly large repos (created by
Kart[1] ), and I've been looking into some poor
performance of `git push` on our service.

Background: We host repositories with a specific layout. I'll try and avoid
most of the technical details but a brief description of the repo layout
might be helpful:

- At each revision we have 256 trees
      - each containing 256 trees (so 65536 trees at this level)
      - each subtree contains a number of objects (distributed via a hash
      scheme, evenly across the subtrees)
- Some repos have up to 100 million blobs active in a given revision.
In that case each of the 65536 subtrees would contain ~1500 blobs.
- Blobs are usually a few bytes to a few KB in size.
- For various reasons we have disabled deltas entirely.
- Most repos have a few hundred commits, and a typical commit might
modify 100,000 features (again spread evenly across the 65536 trees),
thus modifying most of the trees also.
- Our largest repos are currently a few hundred GB on disk.

We've come across a curious performance issue with `git index-pack` when
invoked by `receive-pack` during a push operation. We have
`transfer.fsckObjects=true` in the server config, so the index-pack
invocation looks like:

```
git --shallow-file shallow_filename index-pack \
   --stdin --keep='receive-pack 1234 on <servername>' \
   --show-resolving-progress --report-end-of-input --fix-thin \
   --strict
```

For our largest repos, when pushing ~100K blobs and associated trees, this
takes a *long* time - sometimes over 12 hours. The process uses enormous
amounts of disk IO (all reads; I haven't measured how much per process, but
the server was doing many terabytes of IO in total)

Here is one that "only" took 45 minutes with a few tracing environment vars
enabled:

```
$ cat craig.pack | /opt/sno/libexec/git-core/git --shallow-file
myfilename index-pack --stdin --keep='receive-pack 159567 on
servername' --show-resolving-progress --report-end-of-input --fix-thin
--strict
07:48:20.781099 common-main.c:48                  version 2.29.2
07:48:20.781111 common-main.c:48             | d0 | main
      | version      |     |           |           |              |
2.29.2
07:48:20.781127 common-main.c:49                  start
/opt/sno/libexec/git-core/git --shallow-file indexed.pack index-pack
--stdin '--keep=receive-pack 159567 on cave-7dc7798cc9-qcvxd'
--show-resolving-progress --report-end-of-input --fix-thin --strict
07:48:20.781133 common-main.c:49             | d0 | main
      | start        |     |  0.000264 |           |              |
/opt/sno/libexec/git-core/git --shallow-file indexed.pack index-pack
--stdin '--keep=receive-pack 159567 on cave-7dc7798cc9-qcvxd'
--show-resolving-progress --report-end-of-input --fix-thin --strict
07:48:20.781296 git.c:444               trace: built-in: git
index-pack --stdin '--keep=receive-pack 159567 on
cave-7dc7798cc9-qcvxd' --show-resolving-progress --report-end-of-input
--fix-thin --strict
07:48:20.781306 git.c:445                         cmd_name index-pack
(index-pack)
07:48:20.781312 git.c:445                    | d0 | main
      | cmd_name     |     |           |           |              |
index-pack (index-pack)
07:48:20.781530 midx.c:184                   | d0 | main
      | data         | r0  |  0.000670 |  0.000670 | midx         |
load/num_packs:1
07:48:20.781542 midx.c:185                   | d0 | main
      | data         | r0  |  0.000683 |  0.000683 | midx         |
load/num_objects:42658742
pack    5aa14bbb43187b7dfd5f996514854c3dcdc66d71
08:27:33.724306 git.c:700                         exit
elapsed:2352.943441 code:0
08:27:33.724321 git.c:700                    | d0 | main
      | exit         |     | 2352.943441 |           |              |
code:0
08:27:33.724336 trace2/tr2_tgt_normal.c:123       atexit
elapsed:2352.943475 code:0
08:27:33.724341 trace2/tr2_tgt_perf.c:213    | d0 | main
      | atexit       |     | 2352.943475 |           |              |
code:0
```

Removing the `--strict` from the invocation by disabling
`transfer.fsckObjects` solves the problem - the process completes in less
than a minute, and uses less than a GB of read IO.

I can theorise why this operation is slightly expensive:

   - `--strict` causes `index-pack` to call `fsck_object()` on each object
   pushed
   - these large pushes that push 100K+ blobs actually touch almost every
   *tree* as well - so most/all of the 65K trees are pushed too.
   - calling `fsck_object` on a tree looks up all its children (blobs and
   trees) to ensure they're reachable [2]

What I can't understand is why that makes it take quite *so* much longer
and use so much IO. I think it *should* probably not be checking much about
objects that are already in the repo, other than that they exist. We
have multi-pack indexes enabled, so my assumption is that a "does
object xyz exist?" check should be very inexpensive.
What could I be missing here?

As a start of a possible theory, we found when using libgit2 that our
peculiar repo structure with so many trees requires that we expand the size
of the tree cache[3] - otherwise repeated operations on blobs would
cause tree cache misses
every time their path was traversed. I wonder if there is a similar tree
cache structure in git itself, and if so could it be relevant here?

Many thanks and sorry about the long winded post :)

Craig de Stigter
Platform Engineer
Koordinates


references:
[1]: https://kartproject.org
[2]: fsck_walk_tree:
https://github.com/git/git/blob/a0dda6023ed82b927fa205c474654699a5b07a82/fsck.c#L300
[3] GIT_OPT_SET_CACHE_OBJECT_LIMIT:
https://github.com/libgit2/libgit2/blob/508361401fbb5d87118045eaeae3356a729131aa/include/git2/common.h#L266-L272

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2021-05-09 20:52 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAF1M8pepgrnZWhx+CeMH85J-5oWx+w6r0w3KCcsG8dWgCT9K9Q@mail.gmail.com>
2021-05-09 20:52 ` `git index-pack --strict` is *very* slow during pushes to large repos Craig de Stigter

git@vger.kernel.org list mirror (unofficial, one of many)

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V1 git git/ https://public-inbox.org/git \
		git@vger.kernel.org
	public-inbox-index git

Example config snippet for mirrors.
Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://7fh6tueqddpjyxjmgtdiueylzoqt6pt7hec3pukyptlmohoowvhde4yd.onion/inbox.comp.version-control.git
	nntp://ie5yzdi7fg72h7s4sdcztq5evakq23rdt33mfyfcddc5u3ndnw24ogqd.onion/inbox.comp.version-control.git
	nntp://4uok3hntl7oi7b4uf4rtfwefqeexfzil2w6kgk2jn5z2f764irre7byd.onion/inbox.comp.version-control.git
	nntp://news.gmane.io/gmane.comp.version-control.git
 note: .onion URLs require Tor: https://www.torproject.org/

code repositories for project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git