git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Michael Haggerty <mhagger@alum.mit.edu>
To: Git Mailing List <git@vger.kernel.org>
Cc: Lars Schneider <larsxschneider@gmail.com>
Subject: [ANNOUNCE] git-sizer: compute various size-related metrics for your Git repository
Date: Fri, 16 Mar 2018 16:28:22 +0100	[thread overview]
Message-ID: <CAMy9T_FaOdLP482YZcMX16mpy_EgM0ok1GKg45rE=X+HTGxSiQ@mail.gmail.com> (raw)

What makes a Git repository unwieldy to work with and host? It turns
out that the respository's on-disk size in gigabytes is only part of
the story. From our experience at GitHub, repositories cause problems
because of poor internal layout at least as often as because of their
overall size. For example,

* blobs or trees that are too large
* large blobs that are modified frequently (e.g., database dumps)
* large trees that are modified frequently
* trees that expand to unreasonable size when checked out (e.g., "Git
bombs" [2])
* too many tiny Git objects
* too many references
* other oddities, such as giant octopus merges, super long reference
names or file paths, huge commit messages, etc.

`git-sizer` [1] is a new open-source tool that computes various
size-related statistics for a Git repository and points out those that
are likely to cause problems or inconvenience to its users.

I tried to make the output of `git-sizer` "opinionated" and easy to
interpret. Example output for the Linux kernel is appended below. I
also made it memory-efficient and resistant against git bombs.

I've written a blog post [3] about `git-sizer` with more explanation
and examples, and the main project page [1] has a long README with
some information about what the individual metrics mean and tips for
fixing problems.

I also put quite a bit of effort into making `git-sizer` fast. It does
its work (including figuring out path names for large objects) based
on a single traversal of the repository history using `git rev-list
--objects --reverse [...]`, followed by using the output of `git
cat-file --batch` or `git cat-file --batch-check` to get information
about individual objects.

On that subject, let me share some more technical details. `git-sizer`
is written in Go. I prototyped several ways of extracting object
information, which is critical to the performance because `git-sizer`
has to read all of the reachable non-blob objects in the repository.
The results surprised me:

| Mechanism for accessing Git data                    | Time   |
| --------------------------------------------------- | -----: |
| `libgit2/git2go`                                    | 25.5 s |
| `libgit2/git2go` with `ManagedTree` optimization    | 18.9 s |
| `src-d/go-git`                                      | 63.0 s |
| Git command line client                             |  6.6 s |

It was almost a factor of four faster to read and parse the output of
Git plumbing commands (mainly `git for-each-ref`, `git rev-list
--objects`, `git cat-file --batch-check`, and `git cat-file --batch`)
than it was to use the Go bindings to libgit2. (I expect that part of
the reason is that Go's peculiar stack layout makes it quite expensive
to call out to C.) Even after Carlos Martin implemented an
experimental `ManagedTree` optimization that removed the need to call
C for every entry in a tree, it was still not competitive with the Git
CLI. `go-git`, which is a Git implementation in pure Go, was even
slower. So the final version of `git-sizer` calls `git` for accessing
the repository.

Feedback is welcome, including about the weightings [4] that I use to
compute the "level of concern" of the various metrics.

Have fun,
Michael

[1] https://github.com/github/git-sizer
[2] https://kate.io/blog/git-bomb/
[3] https://blog.github.com/2018-03-05-measuring-the-many-sizes-of-a-git-repository/
[4] https://github.com/github/git-sizer/blob/2e9a30f241ac357f2af01d42f0dd51fbbbae4b0b/sizes/output.go#L330-L401

$ git-sizer --verbose
Processing blobs: 1652370
Processing trees: 3396199
Processing commits: 722647
Matching commits to trees: 722647
Processing annotated tags: 534
Processing references: 539
| Name                         | Value     | Level of concern               |
| ---------------------------- | --------- | ------------------------------ |
| Overall repository size      |           |                                |
| * Commits                    |           |                                |
|   * Count                    |   723 k   | *                              |
|   * Total size               |   525 MiB | **                             |
| * Trees                      |           |                                |
|   * Count                    |  3.40 M   | **                             |
|   * Total size               |  9.00 GiB | ****                           |
|   * Total tree entries       |   264 M   | *****                          |
| * Blobs                      |           |                                |
|   * Count                    |  1.65 M   | *                              |
|   * Total size               |  55.8 GiB | *****                          |
| * Annotated tags             |           |                                |
|   * Count                    |   534     |                                |
| * References                 |           |                                |
|   * Count                    |   539     |                                |
|                              |           |                                |
| Biggest objects              |           |                                |
| * Commits                    |           |                                |
|   * Maximum size         [1] |  72.7 KiB | *                              |
|   * Maximum parents      [2] |    66     | ******                         |
| * Trees                      |           |                                |
|   * Maximum entries      [3] |  1.68 k   |                                |
| * Blobs                      |           |                                |
|   * Maximum size         [4] |  13.5 MiB | *                              |
|                              |           |                                |
| History structure            |           |                                |
| * Maximum history depth      |   136 k   |                                |
| * Maximum tag depth      [5] |     1     | *                              |
|                              |           |                                |
| Biggest checkouts            |           |                                |
| * Number of directories  [6] |  4.38 k   | **                             |
| * Maximum path depth     [7] |    13     | *                              |
| * Maximum path length    [8] |   134 B   | *                              |
| * Number of files        [9] |  62.3 k   | *                              |
| * Total size of files    [9] |   747 MiB |                                |
| * Number of symlinks    [10] |    40     |                                |
| * Number of submodules       |     0     |                                |

[1]  91cc53b0c78596a73fa708cceb7313e7168bb146
[2]  2cde51fbd0f310c8a2c5f977e665c0ac3945b46d
[3]  4f86eed5893207aca2c2da86b35b38f2e1ec1fc8
(refs/heads/master:arch/arm/boot/dts)
[4]  a02b6794337286bc12c907c33d5d75537c240bd0
(refs/heads/master:drivers/gpu/drm/amd/include/asic_reg/vega10/NBIO/nbio_6_1_sh_mask.h)
[5]  5dc01c595e6c6ec9ccda4f6f69c131c0dd945f8c (refs/tags/v2.6.11)
[6]  1459754b9d9acc2ffac8525bed6691e15913c6e2
(589b754df3f37ca0a1f96fccde7f91c59266f38a^{tree})
[7]  78a269635e76ed927e17d7883f2d90313570fdbc
(dae09011115133666e47c35673c0564b0a702db7^{tree})
[8]  ce5f2e31d3bdc1186041fdfd27a5ac96e728f2c5 (refs/heads/master^{tree})
[9]  532bdadc08402b7a72a4b45a2e02e5c710b7d626
(e9ef1fe312b533592e39cddc1327463c30b0ed8d^{tree})
[10] f29a5ea76884ac37e1197bef1941f62fda3f7b99
(f5308d1b83eba20e69df5e0926ba7257c8dd9074^{tree})

             reply	other threads:[~2018-03-16 17:27 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-03-16 15:28 Michael Haggerty [this message]
2018-03-16 20:01 ` [ANNOUNCE] git-sizer: compute various size-related metrics for your Git repository Ævar Arnfjörð Bjarmason
2018-03-16 21:29   ` Jeff King
2018-03-18 19:06     ` Michael Haggerty
2018-03-21 16:02 ` Johannes Schindelin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAMy9T_FaOdLP482YZcMX16mpy_EgM0ok1GKg45rE=X+HTGxSiQ@mail.gmail.com' \
    --to=mhagger@alum.mit.edu \
    --cc=git@vger.kernel.org \
    --cc=larsxschneider@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).