git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: mhagger@alum.mit.edu
Cc: Git Mailing List <git@vger.kernel.org>,
	Lars Schneider <larsxschneider@gmail.com>
Subject: Re: [ANNOUNCE] git-sizer: compute various size-related metrics for your Git repository
Date: Fri, 16 Mar 2018 21:01:42 +0100	[thread overview]
Message-ID: <87370zeqmx.fsf@evledraar.gmail.com> (raw)
In-Reply-To: <CAMy9T_FaOdLP482YZcMX16mpy_EgM0ok1GKg45rE=X+HTGxSiQ@mail.gmail.com>


On Fri, Mar 16 2018, Michael Haggerty jotted:

> What makes a Git repository unwieldy to work with and host? It turns
> out that the respository's on-disk size in gigabytes is only part of
> the story. From our experience at GitHub, repositories cause problems
> because of poor internal layout at least as often as because of their
> overall size. For example,
>
> * blobs or trees that are too large
> * large blobs that are modified frequently (e.g., database dumps)
> * large trees that are modified frequently
> * trees that expand to unreasonable size when checked out (e.g., "Git
> bombs" [2])
> * too many tiny Git objects
> * too many references
> * other oddities, such as giant octopus merges, super long reference
> names or file paths, huge commit messages, etc.
>
> `git-sizer` [1] is a new open-source tool that computes various
> size-related statistics for a Git repository and points out those that
> are likely to cause problems or inconvenience to its users.

This is a very useful tool. I've been using it to get insight into some
bad repositories.

Suggestion for a thing to add to it, I don't have the time on the Go
tuits:

One thing that can make repositories very pathological is if the ratio
of trees to commits is too low.

I was dealing with a repo the other day that had several thousand files
all in the same root directory, and no subdirectories.

This meant that doing `git log -- <file>` was very expensive. I wrote a
bit about this on this related ticket the other day:
https://gitlab.com/gitlab-org/gitlab-ce/issues/42104#note_54933512

But it's not something where you can just say having more trees is
better, because on the other end of the spectrume we can imagine a repo
like linux.git where each file like COPYING instead exists at
C/O/P/Y/I/N/G, that would also be pathological.

It would be very interesting to do some tests to see what the optimal
value would be.

I also suspect it's not really about the commit / tree ratio, but that
you have some reasonable amount of nested trees per file, *and* that
changes to them are reasonably spread out. I.e. it doesn't help if you
have a doc/ and a src/ directory if 99% of your commits change src/, and
if you're doing 'git log -- src/something.c'.

Which is all a very long-winded way of saying that I don't know what the
general rule is, but I have some suspicions, but having all your files
in the root is definitely bad.

  reply	other threads:[~2018-03-16 20:02 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-03-16 15:28 [ANNOUNCE] git-sizer: compute various size-related metrics for your Git repository Michael Haggerty
2018-03-16 20:01 ` Ævar Arnfjörð Bjarmason [this message]
2018-03-16 21:29   ` Jeff King
2018-03-18 19:06     ` Michael Haggerty
2018-03-21 16:02 ` Johannes Schindelin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87370zeqmx.fsf@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=larsxschneider@gmail.com \
    --cc=mhagger@alum.mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).