git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Git performance results on a large repository
@ 2012-02-03 14:20 Joshua Redstone
  2012-02-03 14:56 ` Ævar Arnfjörð Bjarmason
                   ` (8 more replies)
  0 siblings, 9 replies; 34+ messages in thread
From: Joshua Redstone @ 2012-02-03 14:20 UTC (permalink / raw)
  To: git@vger.kernel.org

Hi Git folks,

We (Facebook) have been investigating source control systems to meet our
growing needs.  We already use git fairly widely, but have noticed it
getting slower as we grow, and we want to make sure we have a good story
going forward.  We're debating how to proceed and would like to solicit
people's thoughts.

To better understand git scalability, I've built up a large, synthetic
repository and measured a few git operations on it.  I summarize the
results here.

The test repo has 4 million commits, linear history and about 1.3 million
files.  The size of the .git directory is about 15GB, and has been
repacked with 'git repack -a -d -f --max-pack-size=10g --depth=100
--window=250'.  This repack took about 2 days on a beefy machine (I.e.,
lots of ram and flash).  The size of the index file is 191 MB. I can share
the script that generated it if people are interested - It basically picks
2-5 files, modifies a line or two and adds a few lines at the end
consisting of random dictionary words, occasionally creates a new file,
commits all the modifications and repeats.

I timed a few common operations with both a warm OS file cache and a cold
cache.  i.e., I did a 'echo 3 | tee /proc/sys/vm/drop_caches' and then did
the operation in question a few times (first timing is the cold timing,
the next few are the warm timings).  The following results are on a server
with average hard drive (I.e., not flash)  and > 10GB of ram.

'git status' :   39 minutes cold, and 24 seconds warm.

'git blame':   44 minutes cold, 11 minutes warm.

'git add' (appending a few chars to the end of a file and adding it):   7
seconds cold and 5 seconds warm.

'git commit -m "foo bar3" --no-verify --untracked-files=no --quiet
--no-status':  41 minutes cold, 20 seconds warm.  I also hacked a version
of git to remove the three or four places where 'git commit' stats every
file in the repo, and this dropped the times to 30 minutes cold and 8
seconds warm.


The git performance we observed here is too slow for our needs.  So the
question becomes, if we want to keep using git going forward, what's the
best way to improve performance.  It seems clear we'll probably need some
specialized servers (e.g., to perform git-blame quickly) and maybe
specialized file system integration to detect what files have changed in a
working tree.

One way to get there is to do some deep code modifications to git
internals, to, for example, create some abstractions and interfaces that
allow plugging in the specialized servers.  Another way is to leave git
internals as they are and develop a layer of wrapper scripts around all
the git commands that do the necessary interfacing.  The wrapper scripts
seem perhaps easier in the short-term, but may lead to increasing
divergence from how git behaves natively and also a layer of complexity.

Thoughts?

Cheers,
Josh

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2012-02-10 12:25 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-03 14:20 Git performance results on a large repository Joshua Redstone
2012-02-03 14:56 ` Ævar Arnfjörð Bjarmason
2012-02-03 17:00   ` Joshua Redstone
2012-02-03 22:40     ` Sam Vilain
2012-02-03 22:57       ` Sam Vilain
2012-02-07  1:19       ` Nguyen Thai Ngoc Duy
2012-02-03 23:05     ` Matt Graham
2012-02-04  1:25   ` Evgeny Sazhin
2012-02-03 23:35 ` Chris Lee
2012-02-04  0:01 ` Zeki Mokhtarzada
2012-02-04  5:07 ` Joey Hess
2012-02-04  6:53 ` Nguyen Thai Ngoc Duy
2012-02-04 18:05   ` Joshua Redstone
2012-02-05  3:47     ` Nguyen Thai Ngoc Duy
2012-02-06 15:40       ` Joey Hess
2012-02-07 13:43         ` Nguyen Thai Ngoc Duy
2012-02-09 21:06           ` Joshua Redstone
2012-02-10  7:12             ` Nguyen Thai Ngoc Duy
2012-02-10  9:39               ` Christian Couder
2012-02-10 12:24                 ` Nguyen Thai Ngoc Duy
2012-02-06  7:10     ` David Mohs
2012-02-06 16:23     ` Matt Graham
2012-02-06 20:50       ` Joshua Redstone
2012-02-06 21:07         ` Greg Troxel
2012-02-07  1:28         ` david
2012-02-06 21:17     ` Sam Vilain
2012-02-04 20:05   ` Joshua Redstone
2012-02-05 15:01   ` Tomas Carnecky
2012-02-05 15:17     ` Nguyen Thai Ngoc Duy
2012-02-04  8:57 ` slinky
2012-02-04 21:42 ` Greg Troxel
2012-02-05  4:30 ` david
2012-02-05 11:24   ` David Barr
2012-02-07  8:58 ` Emanuele Zattin

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).