git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [ANNOUNCE] git-sizer: compute various size-related metrics for your Git repository
@ 2018-03-16 15:28 Michael Haggerty
  2018-03-16 20:01 ` Ævar Arnfjörð Bjarmason
  2018-03-21 16:02 ` Johannes Schindelin
  0 siblings, 2 replies; 5+ messages in thread
From: Michael Haggerty @ 2018-03-16 15:28 UTC (permalink / raw)
  To: Git Mailing List; +Cc: Lars Schneider

What makes a Git repository unwieldy to work with and host? It turns
out that the respository's on-disk size in gigabytes is only part of
the story. From our experience at GitHub, repositories cause problems
because of poor internal layout at least as often as because of their
overall size. For example,

* blobs or trees that are too large
* large blobs that are modified frequently (e.g., database dumps)
* large trees that are modified frequently
* trees that expand to unreasonable size when checked out (e.g., "Git
bombs" [2])
* too many tiny Git objects
* too many references
* other oddities, such as giant octopus merges, super long reference
names or file paths, huge commit messages, etc.

`git-sizer` [1] is a new open-source tool that computes various
size-related statistics for a Git repository and points out those that
are likely to cause problems or inconvenience to its users.

I tried to make the output of `git-sizer` "opinionated" and easy to
interpret. Example output for the Linux kernel is appended below. I
also made it memory-efficient and resistant against git bombs.

I've written a blog post [3] about `git-sizer` with more explanation
and examples, and the main project page [1] has a long README with
some information about what the individual metrics mean and tips for
fixing problems.

I also put quite a bit of effort into making `git-sizer` fast. It does
its work (including figuring out path names for large objects) based
on a single traversal of the repository history using `git rev-list
--objects --reverse [...]`, followed by using the output of `git
cat-file --batch` or `git cat-file --batch-check` to get information
about individual objects.

On that subject, let me share some more technical details. `git-sizer`
is written in Go. I prototyped several ways of extracting object
information, which is critical to the performance because `git-sizer`
has to read all of the reachable non-blob objects in the repository.
The results surprised me:

| Mechanism for accessing Git data                    | Time   |
| --------------------------------------------------- | -----: |
| `libgit2/git2go`                                    | 25.5 s |
| `libgit2/git2go` with `ManagedTree` optimization    | 18.9 s |
| `src-d/go-git`                                      | 63.0 s |
| Git command line client                             |  6.6 s |

It was almost a factor of four faster to read and parse the output of
Git plumbing commands (mainly `git for-each-ref`, `git rev-list
--objects`, `git cat-file --batch-check`, and `git cat-file --batch`)
than it was to use the Go bindings to libgit2. (I expect that part of
the reason is that Go's peculiar stack layout makes it quite expensive
to call out to C.) Even after Carlos Martin implemented an
experimental `ManagedTree` optimization that removed the need to call
C for every entry in a tree, it was still not competitive with the Git
CLI. `go-git`, which is a Git implementation in pure Go, was even
slower. So the final version of `git-sizer` calls `git` for accessing
the repository.

Feedback is welcome, including about the weightings [4] that I use to
compute the "level of concern" of the various metrics.

Have fun,
Michael

[1] https://github.com/github/git-sizer
[2] https://kate.io/blog/git-bomb/
[3] https://blog.github.com/2018-03-05-measuring-the-many-sizes-of-a-git-repository/
[4] https://github.com/github/git-sizer/blob/2e9a30f241ac357f2af01d42f0dd51fbbbae4b0b/sizes/output.go#L330-L401

$ git-sizer --verbose
Processing blobs: 1652370
Processing trees: 3396199
Processing commits: 722647
Matching commits to trees: 722647
Processing annotated tags: 534
Processing references: 539
| Name                         | Value     | Level of concern               |
| ---------------------------- | --------- | ------------------------------ |
| Overall repository size      |           |                                |
| * Commits                    |           |                                |
|   * Count                    |   723 k   | *                              |
|   * Total size               |   525 MiB | **                             |
| * Trees                      |           |                                |
|   * Count                    |  3.40 M   | **                             |
|   * Total size               |  9.00 GiB | ****                           |
|   * Total tree entries       |   264 M   | *****                          |
| * Blobs                      |           |                                |
|   * Count                    |  1.65 M   | *                              |
|   * Total size               |  55.8 GiB | *****                          |
| * Annotated tags             |           |                                |
|   * Count                    |   534     |                                |
| * References                 |           |                                |
|   * Count                    |   539     |                                |
|                              |           |                                |
| Biggest objects              |           |                                |
| * Commits                    |           |                                |
|   * Maximum size         [1] |  72.7 KiB | *                              |
|   * Maximum parents      [2] |    66     | ******                         |
| * Trees                      |           |                                |
|   * Maximum entries      [3] |  1.68 k   |                                |
| * Blobs                      |           |                                |
|   * Maximum size         [4] |  13.5 MiB | *                              |
|                              |           |                                |
| History structure            |           |                                |
| * Maximum history depth      |   136 k   |                                |
| * Maximum tag depth      [5] |     1     | *                              |
|                              |           |                                |
| Biggest checkouts            |           |                                |
| * Number of directories  [6] |  4.38 k   | **                             |
| * Maximum path depth     [7] |    13     | *                              |
| * Maximum path length    [8] |   134 B   | *                              |
| * Number of files        [9] |  62.3 k   | *                              |
| * Total size of files    [9] |   747 MiB |                                |
| * Number of symlinks    [10] |    40     |                                |
| * Number of submodules       |     0     |                                |

[1]  91cc53b0c78596a73fa708cceb7313e7168bb146
[2]  2cde51fbd0f310c8a2c5f977e665c0ac3945b46d
[3]  4f86eed5893207aca2c2da86b35b38f2e1ec1fc8
(refs/heads/master:arch/arm/boot/dts)
[4]  a02b6794337286bc12c907c33d5d75537c240bd0
(refs/heads/master:drivers/gpu/drm/amd/include/asic_reg/vega10/NBIO/nbio_6_1_sh_mask.h)
[5]  5dc01c595e6c6ec9ccda4f6f69c131c0dd945f8c (refs/tags/v2.6.11)
[6]  1459754b9d9acc2ffac8525bed6691e15913c6e2
(589b754df3f37ca0a1f96fccde7f91c59266f38a^{tree})
[7]  78a269635e76ed927e17d7883f2d90313570fdbc
(dae09011115133666e47c35673c0564b0a702db7^{tree})
[8]  ce5f2e31d3bdc1186041fdfd27a5ac96e728f2c5 (refs/heads/master^{tree})
[9]  532bdadc08402b7a72a4b45a2e02e5c710b7d626
(e9ef1fe312b533592e39cddc1327463c30b0ed8d^{tree})
[10] f29a5ea76884ac37e1197bef1941f62fda3f7b99
(f5308d1b83eba20e69df5e0926ba7257c8dd9074^{tree})

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [ANNOUNCE] git-sizer: compute various size-related metrics for your Git repository
  2018-03-16 15:28 [ANNOUNCE] git-sizer: compute various size-related metrics for your Git repository Michael Haggerty
@ 2018-03-16 20:01 ` Ævar Arnfjörð Bjarmason
  2018-03-16 21:29   ` Jeff King
  2018-03-21 16:02 ` Johannes Schindelin
  1 sibling, 1 reply; 5+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-03-16 20:01 UTC (permalink / raw)
  To: mhagger; +Cc: Git Mailing List, Lars Schneider


On Fri, Mar 16 2018, Michael Haggerty jotted:

> What makes a Git repository unwieldy to work with and host? It turns
> out that the respository's on-disk size in gigabytes is only part of
> the story. From our experience at GitHub, repositories cause problems
> because of poor internal layout at least as often as because of their
> overall size. For example,
>
> * blobs or trees that are too large
> * large blobs that are modified frequently (e.g., database dumps)
> * large trees that are modified frequently
> * trees that expand to unreasonable size when checked out (e.g., "Git
> bombs" [2])
> * too many tiny Git objects
> * too many references
> * other oddities, such as giant octopus merges, super long reference
> names or file paths, huge commit messages, etc.
>
> `git-sizer` [1] is a new open-source tool that computes various
> size-related statistics for a Git repository and points out those that
> are likely to cause problems or inconvenience to its users.

This is a very useful tool. I've been using it to get insight into some
bad repositories.

Suggestion for a thing to add to it, I don't have the time on the Go
tuits:

One thing that can make repositories very pathological is if the ratio
of trees to commits is too low.

I was dealing with a repo the other day that had several thousand files
all in the same root directory, and no subdirectories.

This meant that doing `git log -- <file>` was very expensive. I wrote a
bit about this on this related ticket the other day:
https://gitlab.com/gitlab-org/gitlab-ce/issues/42104#note_54933512

But it's not something where you can just say having more trees is
better, because on the other end of the spectrume we can imagine a repo
like linux.git where each file like COPYING instead exists at
C/O/P/Y/I/N/G, that would also be pathological.

It would be very interesting to do some tests to see what the optimal
value would be.

I also suspect it's not really about the commit / tree ratio, but that
you have some reasonable amount of nested trees per file, *and* that
changes to them are reasonably spread out. I.e. it doesn't help if you
have a doc/ and a src/ directory if 99% of your commits change src/, and
if you're doing 'git log -- src/something.c'.

Which is all a very long-winded way of saying that I don't know what the
general rule is, but I have some suspicions, but having all your files
in the root is definitely bad.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [ANNOUNCE] git-sizer: compute various size-related metrics for your Git repository
  2018-03-16 20:01 ` Ævar Arnfjörð Bjarmason
@ 2018-03-16 21:29   ` Jeff King
  2018-03-18 19:06     ` Michael Haggerty
  0 siblings, 1 reply; 5+ messages in thread
From: Jeff King @ 2018-03-16 21:29 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: mhagger, Git Mailing List, Lars Schneider

On Fri, Mar 16, 2018 at 09:01:42PM +0100, Ævar Arnfjörð Bjarmason wrote:

> Suggestion for a thing to add to it, I don't have the time on the Go
> tuits:
> 
> One thing that can make repositories very pathological is if the ratio
> of trees to commits is too low.
> 
> I was dealing with a repo the other day that had several thousand files
> all in the same root directory, and no subdirectories.

We've definitely run into this problem before (CocoaPods/Specs, for
example). The metric that would hopefully show this off is "what is the
tree object with the most entries". Or possibly "what is the average
number of entries in a tree object".

That's not the _whole_ story, because the really pathological case is
when you then touch that giant tree a lot. But if you assume the paths
touched by commits are reasonably distributed over the tree, then having
a huge number of entries in one tree will also mean that more commits
will touch that tree. Sort of a vaguely quadratic problem.

Doing it at the root is obviously the worst case, but the same thing can
happen if you have "foo/bar" as a huge tree, and every single commit
needs to touch some variant of "foo/bar/baz".

That's why I suspect some "average per tree object" may show this type
of problem, because you'd have a lot of near-identical copies of that
giant tree if it's being modified a lot.

> But it's not something where you can just say having more trees is
> better, because on the other end of the spectrume we can imagine a repo
> like linux.git where each file like COPYING instead exists at
> C/O/P/Y/I/N/G, that would also be pathological.
> 
> It would be very interesting to do some tests to see what the optimal
> value would be.

I suspect there's some math that could give us the solution. You want
approximately equal-sized trees, so maybe log(N) levels?

-Peff

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [ANNOUNCE] git-sizer: compute various size-related metrics for your Git repository
  2018-03-16 21:29   ` Jeff King
@ 2018-03-18 19:06     ` Michael Haggerty
  0 siblings, 0 replies; 5+ messages in thread
From: Michael Haggerty @ 2018-03-18 19:06 UTC (permalink / raw)
  To: Jeff King
  Cc: Ævar Arnfjörð Bjarmason, Git Mailing List,
	Lars Schneider

On Fri, Mar 16, 2018 at 10:29 PM, Jeff King <peff@peff.net> wrote:
> On Fri, Mar 16, 2018 at 09:01:42PM +0100, Ævar Arnfjörð Bjarmason wrote:
>> One thing that can make repositories very pathological is if the ratio
>> of trees to commits is too low.
>>
>> I was dealing with a repo the other day that had several thousand files
>> all in the same root directory, and no subdirectories.
>
> We've definitely run into this problem before (CocoaPods/Specs, for
> example). The metric that would hopefully show this off is "what is the
> tree object with the most entries". Or possibly "what is the average
> number of entries in a tree object".

I find that the best metric for determining this sort of problem is
"Overall repository size -> Trees -> Total tree entries". If you have
a big directory that is being changed frequently, the *real* problem
is that every commit has to rewrite the whole tree, with all of its
many entries. So "Total tree entries" (or equivalently, the total tree
size) skyrockets. And this means that a history traversal has to
*expand* all of those trees again. So a repository that is problematic
for this reason will have a very large number of tree entries.

If you want to detect a bad repository layout like this *before* it
becomes a problem, then probably something like "average tree entries
per commit" might be a good leading indicator of a problem.

Michael

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [ANNOUNCE] git-sizer: compute various size-related metrics for your Git repository
  2018-03-16 15:28 [ANNOUNCE] git-sizer: compute various size-related metrics for your Git repository Michael Haggerty
  2018-03-16 20:01 ` Ævar Arnfjörð Bjarmason
@ 2018-03-21 16:02 ` Johannes Schindelin
  1 sibling, 0 replies; 5+ messages in thread
From: Johannes Schindelin @ 2018-03-21 16:02 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Git Mailing List, Lars Schneider

Hi Michael,

On Fri, 16 Mar 2018, Michael Haggerty wrote:

> What makes a Git repository unwieldy to work with and host? It turns
> out that the respository's on-disk size in gigabytes is only part of
> the story. From our experience at GitHub, repositories cause problems
> because of poor internal layout at least as often as because of their
> overall size. For example,
> 
> * blobs or trees that are too large
> * large blobs that are modified frequently (e.g., database dumps)
> * large trees that are modified frequently
> * trees that expand to unreasonable size when checked out (e.g., "Git
> bombs" [2])
> * too many tiny Git objects
> * too many references
> * other oddities, such as giant octopus merges, super long reference
> names or file paths, huge commit messages, etc.
> 
> `git-sizer` [1] is a new open-source tool that computes various
> size-related statistics for a Git repository and points out those that
> are likely to cause problems or inconvenience to its users.

Thank you very much for sharing this tool.

I packaged this as a MSYS2 package for use in Git for Windows' SDKs. You
can install it via

	pacman -Sy mingw-w64-x86_64-git-sizer

(obviously, if you are in a 32-bit SDK you want to replace x86_64 by i686)

Note: I am simply re-bundling the binaries you post to the GitHub
releases; The main purpose is to make it easier for users to include this
in their custom installers.

Second note: I briefly considered including this tool in Git for Windows,
but it does increase the size of the installer by a full megabyte, and
therefore I decided to keep it as SDK-only, optional package.

Thanks!
Dscho

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-03-21 16:02 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-16 15:28 [ANNOUNCE] git-sizer: compute various size-related metrics for your Git repository Michael Haggerty
2018-03-16 20:01 ` Ævar Arnfjörð Bjarmason
2018-03-16 21:29   ` Jeff King
2018-03-18 19:06     ` Michael Haggerty
2018-03-21 16:02 ` Johannes Schindelin

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).