Git performance results on a large repository

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Git performance results on a large repository
@ 2012-02-03 14:20 Joshua Redstone
  2012-02-03 14:56 ` Ævar Arnfjörð Bjarmason
                   ` (8 more replies)
  0 siblings, 9 replies; 34+ messages in thread
From: Joshua Redstone @ 2012-02-03 14:20 UTC (permalink / raw)
  To: git@vger.kernel.org

Hi Git folks,

We (Facebook) have been investigating source control systems to meet our
growing needs.  We already use git fairly widely, but have noticed it
getting slower as we grow, and we want to make sure we have a good story
going forward.  We're debating how to proceed and would like to solicit
people's thoughts.

To better understand git scalability, I've built up a large, synthetic
repository and measured a few git operations on it.  I summarize the
results here.

The test repo has 4 million commits, linear history and about 1.3 million
files.  The size of the .git directory is about 15GB, and has been
repacked with 'git repack -a -d -f --max-pack-size=10g --depth=100
--window=250'.  This repack took about 2 days on a beefy machine (I.e.,
lots of ram and flash).  The size of the index file is 191 MB. I can share
the script that generated it if people are interested - It basically picks
2-5 files, modifies a line or two and adds a few lines at the end
consisting of random dictionary words, occasionally creates a new file,
commits all the modifications and repeats.

I timed a few common operations with both a warm OS file cache and a cold
cache.  i.e., I did a 'echo 3 | tee /proc/sys/vm/drop_caches' and then did
the operation in question a few times (first timing is the cold timing,
the next few are the warm timings).  The following results are on a server
with average hard drive (I.e., not flash)  and > 10GB of ram.

'git status' :   39 minutes cold, and 24 seconds warm.

'git blame':   44 minutes cold, 11 minutes warm.

'git add' (appending a few chars to the end of a file and adding it):   7
seconds cold and 5 seconds warm.

'git commit -m "foo bar3" --no-verify --untracked-files=no --quiet
--no-status':  41 minutes cold, 20 seconds warm.  I also hacked a version
of git to remove the three or four places where 'git commit' stats every
file in the repo, and this dropped the times to 30 minutes cold and 8
seconds warm.

The git performance we observed here is too slow for our needs.  So the
question becomes, if we want to keep using git going forward, what's the
best way to improve performance.  It seems clear we'll probably need some
specialized servers (e.g., to perform git-blame quickly) and maybe
specialized file system integration to detect what files have changed in a
working tree.

One way to get there is to do some deep code modifications to git
internals, to, for example, create some abstractions and interfaces that
allow plugging in the specialized servers.  Another way is to leave git
internals as they are and develop a layer of wrapper scripts around all
the git commands that do the necessary interfacing.  The wrapper scripts
seem perhaps easier in the short-term, but may lead to increasing
divergence from how git behaves natively and also a layer of complexity.

Thoughts?

Cheers,
Josh

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Git performance results on a large repository
  2012-02-03 14:20 Git performance results on a large repository Joshua Redstone
@ 2012-02-03 14:56 ` Ævar Arnfjörð Bjarmason
  2012-02-03 17:00   ` Joshua Redstone
  2012-02-04  1:25   ` Evgeny Sazhin
  2012-02-03 23:35 ` Chris Lee
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 34+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2012-02-03 14:56 UTC (permalink / raw)
  To: Joshua Redstone; +Cc: git@vger.kernel.org

On Fri, Feb 3, 2012 at 15:20, Joshua Redstone <joshua.redstone@fb.com> wrote:

> We (Facebook) have been investigating source control systems to meet our
> growing needs.  We already use git fairly widely, but have noticed it
> getting slower as we grow, and we want to make sure we have a good story
> going forward.  We're debating how to proceed and would like to solicit
> people's thoughts.

Where I work we also have a relatively large Git repository. Around
30k files, a couple of hundred thousand commits, clone size around
half a GB.

You haven't supplied background info on this but it really seems to me
like your testcase is converting something like a humongous Perforce
repository directly to Git.

While you /can/ do this it's not a good idea, you should split up
repositories at the boundaries code or data doesn't directly cross
over, e.g. there's no reason why you need HipHop PHP in the same
repository as Cassandra or the Facebook chat system, is there?

While Git could better with large repositories (in particular applying
commits in interactive rebase seems to be to slow down on bigger
repositories) there's only so much you can do about stat-ing 1.3
million files.

A structure that would make more sense would be to split up that giant
repository into a lot of other repositories, most of them probably
have no direct dependencies on other components, but even those that
do can sometimes just use some other repository as a submodule.

Even if you have the requirement that you'd like to roll out
*everything* at a certain point in time you can still solve that with
a super-repository that has all the other ones as submodules, and
creates a tag for every rollout or something like that.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Git performance results on a large repository
  2012-02-03 14:56 ` Ævar Arnfjörð Bjarmason
@ 2012-02-03 17:00   ` Joshua Redstone
  2012-02-03 22:40     ` Sam Vilain
  2012-02-03 23:05     ` Matt Graham
  2012-02-04  1:25   ` Evgeny Sazhin
  1 sibling, 2 replies; 34+ messages in thread
From: Joshua Redstone @ 2012-02-03 17:00 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: git@vger.kernel.org

Hi Ævar,


Thanks for the comments.  I've included a bunch more info on the test repo
below.  It is based on a growth model of two of our current repositories
(I.e., it's not a perforce import). We already have some of the easily
separable projects in separate repositories, like HPHP.   If we could
split our largest repos into multiple ones, that would help the scaling
issue.  However, the code in those repos is rather interdependent and we
believe it'd hurt more than help to split it up, at least for the
medium-term future.  We derive a fair amount of benefit from the code
sharing and keeping things together in a single repo, so it's not clear
when it'd make sense to get more aggressive splitting things up.

Some more information on the test repository:   The working directory is
9.5 GB, the median file size is 2 KB.  The average depth of a directory
(counting the number of '/'s) is 3.6 levels and the average depth of a
file is 4.6.  More detailed histograms of the repository composition is
below:

------------------------

Histogram of depth of every directory in the repo (dirs=`find . -type d` ;
(for dir in $dirs; do t=${dir//[^\/]/}; echo ${#t} ; done) |
~/tmp/histo.py)
* The .git directory itself has only 161 files, so although included,
doesn't affect the numbers significantly)

[0.0 - 1.3): 271
[1.3 - 2.6): 9966
[2.6 - 3.9): 56595
[3.9 - 5.2): 230239
[5.2 - 6.5): 67394
[6.5 - 7.8): 22868
[7.8 - 9.1): 6568
[9.1 - 10.4): 420
[10.4 - 11.7): 45
[11.7 - 13.0]: 21
n=394387 mean=4.671830, median=5.000000, stddev=1.272658


Histogram of depth of every file in the repo (files=`git ls-files` ; (for
file in $files; do t=${file//[^\/]/}; echo ${#t} ; done) | ~/tmp/histo.py)
* 'git ls-files' does not prefix entries with ./, like the 'find' command
above, does, hence why the average appears to be the same as the directory
stats

[0.0 - 1.3]: 1274
[1.3 - 2.6]: 35353
[2.6 - 3.9]: 196747
[3.9 - 5.2]: 786647
[5.2 - 6.5]: 225913
[6.5 - 7.8]: 77667
[7.8 - 9.1]: 22130
[9.1 - 10.4]: 1599
[10.4 - 11.7]: 164
[11.7 - 13.0]: 118
n=1347612 mean=4.655750, median=5.000000, stddev=1.278399


Histogram of file sizes (for first 50k files - this command takes a
while):  files=`git ls-files` ; (for file in $files; do stat -c%s $file ;
done) | ~/tmp/histo.py

[ 0.0 - 4.7): 0
[ 4.7 - 22.5): 2
[ 22.5 - 106.8): 0
[ 106.8 - 506.8): 0
[ 506.8 - 2404.7): 31142
[ 2404.7 - 11409.9): 17837
[ 11409.9 - 54137.1): 942
[ 54137.1 - 256866.9): 53
[ 256866.9 - 1218769.7): 18
[ 1218769.7 - 5782760.0]: 5
n=49999 mean=3590.953239, median=1772.000000, stddev=42835.330259

Cheers,
Josh






On 2/3/12 9:56 AM, "Ævar Arnfjörð Bjarmason" <avarab@gmail.com> wrote:

>On Fri, Feb 3, 2012 at 15:20, Joshua Redstone <joshua.redstone@fb.com>
>wrote:
>
>> We (Facebook) have been investigating source control systems to meet our
>> growing needs.  We already use git fairly widely, but have noticed it
>> getting slower as we grow, and we want to make sure we have a good story
>> going forward.  We're debating how to proceed and would like to solicit
>> people's thoughts.
>
>Where I work we also have a relatively large Git repository. Around
>30k files, a couple of hundred thousand commits, clone size around
>half a GB.
>
>You haven't supplied background info on this but it really seems to me
>like your testcase is converting something like a humongous Perforce
>repository directly to Git.
>
>While you /can/ do this it's not a good idea, you should split up
>repositories at the boundaries code or data doesn't directly cross
>over, e.g. there's no reason why you need HipHop PHP in the same
>repository as Cassandra or the Facebook chat system, is there?
>
>While Git could better with large repositories (in particular applying
>commits in interactive rebase seems to be to slow down on bigger
>repositories) there's only so much you can do about stat-ing 1.3
>million files.
>
>A structure that would make more sense would be to split up that giant
>repository into a lot of other repositories, most of them probably
>have no direct dependencies on other components, but even those that
>do can sometimes just use some other repository as a submodule.
>
>Even if you have the requirement that you'd like to roll out
>*everything* at a certain point in time you can still solve that with
>a super-repository that has all the other ones as submodules, and
>creates a tag for every rollout or something like that.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Git performance results on a large repository
  2012-02-03 17:00   ` Joshua Redstone
@ 2012-02-03 22:40     ` Sam Vilain
  2012-02-03 22:57       ` Sam Vilain
  2012-02-07  1:19       ` Nguyen Thai Ngoc Duy
  2012-02-03 23:05     ` Matt Graham
  1 sibling, 2 replies; 34+ messages in thread
From: Sam Vilain @ 2012-02-03 22:40 UTC (permalink / raw)
  To: Joshua Redstone
  Cc: Ævar Arnfjörð Bjarmason, git@vger.kernel.org

Joshua,

You have an interesting use case.

If I were you I'd consider investigating the git fast-import protocol. 
It has become bi–directional, and is essentially socket access to a git 
repository with read and transactional update capability.  With a few 
more commands implemented, it may even be capable of providing all 
functionality required for command–line git use.

It is already possible that the ".git" directory can be a file: this 
case is used for submodules in git 1.7.8 and higher.  For this use case, 
there would be an extra field to the ".git" file which is created.  It 
would indicate the hostname (and port) to connect its internal 
'fast-import' stream to.  'clone' would consist of creating this file, 
and then getting the server to stream the objects from its pack to the 
client.

With the hard–working part of git on the other end of a network service, 
you could back it by a re–implementation of git which is written to be 
distributed in Hadoop.  There are at least two similar implementations 
of git that are like this: one for cassandra which was written by github 
as a research project, and Google's implementation on top of their 
BigTable/GFS/whatever.  As the git object storage model is write–only 
and content–addressed, it should git this kind of scaling well.

There have also been designs at various times for sparse check–outs; ie 
check–outs where you don't check out the root of the repository but a 
sub–tree.

With both of these features, clients could easily check out a small part 
of the repository very quickly.  This is probably the only case which 
SVN still does better than git at, which is a particular blocker for use 
cases like repositories with large binaries in them and for projects 
such as the one you have (another one with a similar problem was KDE, 
where their projects moved around the repository a lot, and refactoring 
touched many projects simultaneously at times).

It's a large undertaking, alright.

Sam,
just another git community propeller–head.


On 2/3/12 9:00 AM, Joshua Redstone wrote:
> Hi Ævar,
>
>
> Thanks for the comments.  I've included a bunch more info on the test repo
> below.  It is based on a growth model of two of our current repositories
> (I.e., it's not a perforce import). We already have some of the easily
> separable projects in separate repositories, like HPHP.   If we could
> split our largest repos into multiple ones, that would help the scaling
> issue.  However, the code in those repos is rather interdependent and we
> believe it'd hurt more than help to split it up, at least for the
> medium-term future.  We derive a fair amount of benefit from the code
> sharing and keeping things together in a single repo, so it's not clear
> when it'd make sense to get more aggressive splitting things up.
>
> Some more information on the test repository:   The working directory is
> 9.5 GB, the median file size is 2 KB.  The average depth of a directory
> (counting the number of '/'s) is 3.6 levels and the average depth of a
> file is 4.6.  More detailed histograms of the repository composition is
> below:
>
> ------------------------
>
> Histogram of depth of every directory in the repo (dirs=`find . -type d` ;
> (for dir in $dirs; do t=${dir//[^\/]/}; echo ${#t} ; done) |
> ~/tmp/histo.py)
> * The .git directory itself has only 161 files, so although included,
> doesn't affect the numbers significantly)
>
> [0.0 - 1.3): 271
> [1.3 - 2.6): 9966
> [2.6 - 3.9): 56595
> [3.9 - 5.2): 230239
> [5.2 - 6.5): 67394
> [6.5 - 7.8): 22868
> [7.8 - 9.1): 6568
> [9.1 - 10.4): 420
> [10.4 - 11.7): 45
> [11.7 - 13.0]: 21
> n=394387 mean=4.671830, median=5.000000, stddev=1.272658
>
>
> Histogram of depth of every file in the repo (files=`git ls-files` ; (for
> file in $files; do t=${file//[^\/]/}; echo ${#t} ; done) | ~/tmp/histo.py)
> * 'git ls-files' does not prefix entries with ./, like the 'find' command
> above, does, hence why the average appears to be the same as the directory
> stats
>
> [0.0 - 1.3]: 1274
> [1.3 - 2.6]: 35353
> [2.6 - 3.9]: 196747
> [3.9 - 5.2]: 786647
> [5.2 - 6.5]: 225913
> [6.5 - 7.8]: 77667
> [7.8 - 9.1]: 22130
> [9.1 - 10.4]: 1599
> [10.4 - 11.7]: 164
> [11.7 - 13.0]: 118
> n=1347612 mean=4.655750, median=5.000000, stddev=1.278399
>
>
> Histogram of file sizes (for first 50k files - this command takes a
> while):  files=`git ls-files` ; (for file in $files; do stat -c%s $file ;
> done) | ~/tmp/histo.py
>
> [ 0.0 - 4.7): 0
> [ 4.7 - 22.5): 2
> [ 22.5 - 106.8): 0
> [ 106.8 - 506.8): 0
> [ 506.8 - 2404.7): 31142
> [ 2404.7 - 11409.9): 17837
> [ 11409.9 - 54137.1): 942
> [ 54137.1 - 256866.9): 53
> [ 256866.9 - 1218769.7): 18
> [ 1218769.7 - 5782760.0]: 5
> n=49999 mean=3590.953239, median=1772.000000, stddev=42835.330259
>
> Cheers,
> Josh
>
>
>
>
>
>
> On 2/3/12 9:56 AM, "Ævar Arnfjörð Bjarmason"<avarab@gmail.com>  wrote:
>
>> On Fri, Feb 3, 2012 at 15:20, Joshua Redstone<joshua.redstone@fb.com>
>> wrote:
>>
>>> We (Facebook) have been investigating source control systems to meet our
>>> growing needs.  We already use git fairly widely, but have noticed it
>>> getting slower as we grow, and we want to make sure we have a good story
>>> going forward.  We're debating how to proceed and would like to solicit
>>> people's thoughts.
>>
>> Where I work we also have a relatively large Git repository. Around
>> 30k files, a couple of hundred thousand commits, clone size around
>> half a GB.
>>
>> You haven't supplied background info on this but it really seems to me
>> like your testcase is converting something like a humongous Perforce
>> repository directly to Git.
>>
>> While you /can/ do this it's not a good idea, you should split up
>> repositories at the boundaries code or data doesn't directly cross
>> over, e.g. there's no reason why you need HipHop PHP in the same
>> repository as Cassandra or the Facebook chat system, is there?
>>
>> While Git could better with large repositories (in particular applying
>> commits in interactive rebase seems to be to slow down on bigger
>> repositories) there's only so much you can do about stat-ing 1.3
>> million files.
>>
>> A structure that would make more sense would be to split up that giant
>> repository into a lot of other repositories, most of them probably
>> have no direct dependencies on other components, but even those that
>> do can sometimes just use some other repository as a submodule.
>>
>> Even if you have the requirement that you'd like to roll out
>> *everything* at a certain point in time you can still solve that with
>> a super-repository that has all the other ones as submodules, and
>> creates a tag for every rollout or something like that.
>
> N�����r��y���b�X��ǧv�^�)޺{.n�+����ا�\x17��ܨ}���Ơz�&j:+v���\a����zZ+��+zf���h���~����i���z�\x1e�w���?����&�)ߢ^[fl===

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Git performance results on a large repository
  2012-02-03 22:40     ` Sam Vilain
@ 2012-02-03 22:57       ` Sam Vilain
  2012-02-07  1:19       ` Nguyen Thai Ngoc Duy
  1 sibling, 0 replies; 34+ messages in thread
From: Sam Vilain @ 2012-02-03 22:57 UTC (permalink / raw)
  To: Joshua Redstone
  Cc: Ævar Arnfjörð Bjarmason, git@vger.kernel.org

On 2/3/12 2:40 PM, Sam Vilain wrote:
> As the git object storage model is write–only and content–addressed,
> it should git this kind of scaling well.
             ^^^

Could have sworn I typed 'suit' there.  My fingers have auto–correct ;-)

Sam

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Git performance results on a large repository
  2012-02-03 22:40     ` Sam Vilain
  2012-02-03 22:57       ` Sam Vilain
@ 2012-02-07  1:19       ` Nguyen Thai Ngoc Duy
  1 sibling, 0 replies; 34+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-02-07  1:19 UTC (permalink / raw)
  To: Sam Vilain
  Cc: Joshua Redstone, Ævar Arnfjörð Bjarmason,
	git@vger.kernel.org

On Sat, Feb 4, 2012 at 5:40 AM, Sam Vilain <sam@vilain.net> wrote:
> There have also been designs at various times for sparse check–outs; ie
> check–outs where you don't check out the root of the repository but a
> sub–tree.

There is a sparse checkout feature in git (hopefully from one of the
designs you mentioned) and it can checkout subtrees. The only problem
in this case is it maintains full index. So it only solves half of the
problem (stat calls), reading/writing large index just slows
everything down.
-- 
Duy

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Git performance results on a large repository
  2012-02-03 17:00   ` Joshua Redstone
  2012-02-03 22:40     ` Sam Vilain
@ 2012-02-03 23:05     ` Matt Graham
  1 sibling, 0 replies; 34+ messages in thread
From: Matt Graham @ 2012-02-03 23:05 UTC (permalink / raw)
  To: Joshua Redstone
  Cc: Ævar Arnfjörð Bjarmason, git@vger.kernel.org

Hi Josh,

On Fri, Feb 3, 2012 at 17:00, Joshua Redstone <joshua.redstone@fb.com> wrote:
> Thanks for the comments.  I've included a bunch more info on the test repo
> below.  It is based on a growth model of two of our current repositories
> (I.e., it's not a perforce import). We already have some of the easily
> separable projects in separate repositories, like HPHP.   If we could
> split our largest repos into multiple ones, that would help the scaling
> issue.  However, the code in those repos is rather interdependent and we
> believe it'd hurt more than help to split it up, at least for the
> medium-term future.  We derive a fair amount of benefit from the code
> sharing and keeping things together in a single repo, so it's not clear
> when it'd make sense to get more aggressive splitting things up.
>
> Some more information on the test repository:   The working directory is
> 9.5 GB, the median file size is 2 KB.  The average depth of a directory
> (counting the number of '/'s) is 3.6 levels and the average depth of a
> file is 4.6.  More detailed histograms of the repository composition is
> below:

Do you have a histogram of the types of files in the repo?
And as suggested earlier, is svn working for you now because it allows
sparse checkout?  I imagine the stats for svn on the full repo would
be comparable or worse to what you measured with git?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Git performance results on a large repository
  2012-02-03 14:56 ` Ævar Arnfjörð Bjarmason
  2012-02-03 17:00   ` Joshua Redstone
@ 2012-02-04  1:25   ` Evgeny Sazhin
  1 sibling, 0 replies; 34+ messages in thread
From: Evgeny Sazhin @ 2012-02-04  1:25 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Joshua Redstone, git@vger.kernel.org

 

On Feb 3, 2012, at 9:56 AM, Ævar Arnfjörð Bjarmason wrote:

> On Fri, Feb 3, 2012 at 15:20, Joshua Redstone <joshua.redstone@fb.com> wrote:
> 
>> We (Facebook) have been investigating source control systems to meet our
>> growing needs.  We already use git fairly widely, but have noticed it
>> getting slower as we grow, and we want to make sure we have a good story
>> going forward.  We're debating how to proceed and would like to solicit
>> people's thoughts.
> 
> Where I work we also have a relatively large Git repository. Around
> 30k files, a couple of hundred thousand commits, clone size around
> half a GB.
> 
> You haven't supplied background info on this but it really seems to me
> like your testcase is converting something like a humongous Perforce
> repository directly to Git.
> 
> While you /can/ do this it's not a good idea, you should split up
> repositories at the boundaries code or data doesn't directly cross
> over, e.g. there's no reason why you need HipHop PHP in the same
> repository as Cassandra or the Facebook chat system, is there?
> 
> While Git could better with large repositories (in particular applying
> commits in interactive rebase seems to be to slow down on bigger
> repositories) there's only so much you can do about stat-ing 1.3
> million files.
> 
> A structure that would make more sense would be to split up that giant
> repository into a lot of other repositories, most of them probably
> have no direct dependencies on other components, but even those that
> do can sometimes just use some other repository as a submodule.
> 
> Even if you have the requirement that you'd like to roll out
> *everything* at a certain point in time you can still solve that with
> a super-repository that has all the other ones as submodules, and
> creates a tag for every rollout or something like that.
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



I concur. I'm working in the company with many years of development history with several huge CVS repos and we are slowly but surely migrating the codebase from CVS to Git. 
Split the things up. This will allow you to reorganize things better and there is IMHO no downsides. 
As for rollout - i think this job should be given to build/release system that will have an ability to gather necessary code from different repos and tag it properly.

just my 2 cents

Thanks,
Eugene

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Git performance results on a large repository
  2012-02-03 14:20 Git performance results on a large repository Joshua Redstone
  2012-02-03 14:56 ` Ævar Arnfjörð Bjarmason
@ 2012-02-03 23:35 ` Chris Lee
  2012-02-04  0:01 ` Zeki Mokhtarzada
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 34+ messages in thread
From: Chris Lee @ 2012-02-03 23:35 UTC (permalink / raw)
  To: Joshua Redstone; +Cc: git@vger.kernel.org

On Fri, Feb 3, 2012 at 6:20 AM, Joshua Redstone <joshua.redstone@fb.com> wrote:
> [snip]
>
> The git performance we observed here is too slow for our needs.  So the
> question becomes, if we want to keep using git going forward, what's the
> best way to improve performance.  It seems clear we'll probably need some
> specialized servers (e.g., to perform git-blame quickly) and maybe
> specialized file system integration to detect what files have changed in a
> working tree.

Have you considered upgrading all of engineering to SSDs? 200+GB SSDs
are under $400USD nowadays.

-clee

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Git performance results on a large repository
  2012-02-03 14:20 Git performance results on a large repository Joshua Redstone
  2012-02-03 14:56 ` Ævar Arnfjörð Bjarmason
  2012-02-03 23:35 ` Chris Lee
@ 2012-02-04  0:01 ` Zeki Mokhtarzada
  2012-02-04  5:07 ` Joey Hess
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 34+ messages in thread
From: Zeki Mokhtarzada @ 2012-02-04  0:01 UTC (permalink / raw)
  To: git

> The test repo has 4 million commits, linear history and about 1.3 million
> files.  The size of the .git directory is about 15GB, and has been
> repacked with 'git repack -a -d -f --max-pack-size=10g --depth=100
> --window=250'.  This repack took about 2 days on a beefy machine (I.e.,
> lots of ram and flash).  The size of the index file is 191 MB. I can share

Are you willing to give up all or part of your history in your working
repository?  I've heard of larger projects starting from scratch (i.e. copy all
of your files into a brand new repo.)  You can keep your old repo around for
archival purposes.  Also, how much of your repo is code, versus static assets. 
You could move all of your static assets (images, css, maybe some js?) into
another repo, and then merge the two repo's together at build time if you
absolutely need them deployed together.

Here are a couple strategies for doing a partial truncate:

http://stackoverflow.com/questions/4515580/how-do-i-remove-the-old-history-from-a-git-repository
http://bogdan.org.ua/2011/03/28/how-to-truncate-git-history-sample-script-included.html

-Zeki

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Git performance results on a large repository
  2012-02-03 14:20 Git performance results on a large repository Joshua Redstone
                   ` (2 preceding siblings ...)
  2012-02-04  0:01 ` Zeki Mokhtarzada
@ 2012-02-04  5:07 ` Joey Hess
  2012-02-04  6:53 ` Nguyen Thai Ngoc Duy
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 34+ messages in thread
From: Joey Hess @ 2012-02-04  5:07 UTC (permalink / raw)
  To: Joshua Redstone; +Cc: git@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 903 bytes --]

Joshua Redstone wrote:
> The test repo has 4 million commits, linear history and about 1.3 million
> files.

Have you tried separating these two factors, to see how badly each is
affecting performance?

If the number of commits is the problem (seems likely for git blame at
least), a shallow clone would avoid that overhead.

I think that git often writes .git/index inneficiently when staging
files (though your `git add` is pretty fast) and committing. It rewrites
the whole file to .git/index.lck and the renames it over .git/index at
the end. I have code that keeps a journal of changes to avoid rewriting
the index repeatedly, but it's application specific. Fixing git to write
the index more intelligently is something I'd like to see.

Hint for git status: `git status .` in a smaller subdirectory will be much
faster than the default that stats everything.

-- 
see shy jo

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Git performance results on a large repository
  2012-02-03 14:20 Git performance results on a large repository Joshua Redstone
                   ` (3 preceding siblings ...)
  2012-02-04  5:07 ` Joey Hess
@ 2012-02-04  6:53 ` Nguyen Thai Ngoc Duy
  2012-02-04 18:05   ` Joshua Redstone
                     ` (2 more replies)
  2012-02-04  8:57 ` slinky
                   ` (3 subsequent siblings)
  8 siblings, 3 replies; 34+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-02-04  6:53 UTC (permalink / raw)
  To: Joshua Redstone; +Cc: git@vger.kernel.org

On Fri, Feb 3, 2012 at 9:20 PM, Joshua Redstone <joshua.redstone@fb.com> wrote:
> I timed a few common operations with both a warm OS file cache and a cold
> cache.  i.e., I did a 'echo 3 | tee /proc/sys/vm/drop_caches' and then did
> the operation in question a few times (first timing is the cold timing,
> the next few are the warm timings).  The following results are on a server
> with average hard drive (I.e., not flash)  and > 10GB of ram.
>
> 'git status' :   39 minutes cold, and 24 seconds warm.
>
> 'git blame':   44 minutes cold, 11 minutes warm.
>
> 'git add' (appending a few chars to the end of a file and adding it):   7
> seconds cold and 5 seconds warm.
>
> 'git commit -m "foo bar3" --no-verify --untracked-files=no --quiet
> --no-status':  41 minutes cold, 20 seconds warm.  I also hacked a version
> of git to remove the three or four places where 'git commit' stats every
> file in the repo, and this dropped the times to 30 minutes cold and 8
> seconds warm.

Have you tried "git update-index --assume-unchaged"? That should
reduce mass lstat() and hopefully improve the above numbers. The
interface is not exactly easy-to-use, but if it has significant gain,
then we can try to improve UI.

On the index size issue, ideally we should make minimum writes to
index instead of rewriting 191 MB index. An improvement we could do
now is to compress it, reduce disk footprint, thus disk I/O. If you
compress the index with gzip, how big is it?
-- 
Duy

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: Git performance results on a large repository
  2012-02-04  6:53 ` Nguyen Thai Ngoc Duy
@ 2012-02-04 18:05   ` Joshua Redstone
  2012-02-05  3:47     ` Nguyen Thai Ngoc Duy
                       ` (3 more replies)
  2012-02-04 20:05   ` Joshua Redstone
  2012-02-05 15:01   ` Tomas Carnecky
  2 siblings, 4 replies; 34+ messages in thread
From: Joshua Redstone @ 2012-02-04 18:05 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: git@vger.kernel.org

[ wanted to reply to my initial msg, but wasn't subscribed to the list at time of mailing, so replying to most recent post instead ]

Thanks to everyone for the questions and suggestions.  I'll try to respond here.  One high-level clarification - this synthetic repo for which I've reported perf times is representative of where we think we'll be in the future.  Git is slow but marginally acceptable for today.  We want to start planning now for any big changes we need to make going forward.

Evgeny Sazhin, Slinky and Ævar Arnfjörð Bjarmason suggested splitting up the repo into multiple, smaller repos.  I indicated before that we have a lot of cross-dependencies.  Our largest repo by number of files and commits is the repo containing the front-end server.  It is a large code base in which the tight integration of various components results in many of the cross dependencies.  We are working slowly to split things up more, for example into services, but that is a long-term process.

To get a bit abstract for a moment, in an ideal world, it doesn't seem like performance constraints of a source-control-system should dictate how we choose to structure our code.  Ideally, seems like we should be able to choose to structure our code in whatever way we feel maximizes developer productivity.  If development and code/release management seem easier in a single repo, than why not make an SCM that can handle it?  This is one reason I've been leaning towards figuring out an SCM approach that can work well with our current practices rather than changing them as a prerequisite for good SCM performance.

Sam Vilain:  Thanks for the pointer, i didn't realize that fast-import was bi-directional.  I used it for generating the synthetic repo.  Will look into using it the other way around.  Though that still won't speed up things like git-blame, presumably?  The sparse-checkout issue you mention is a good one.  There is a good question of how to support quick checkout, branch switching, clone, push and so forth.  I'll look into the approaches you suggest.  One consideration is coming up with a high-leverage approach - i.e. not doing heavy dev work if we can avoid it.  On the other hand, it would be nice if we (including the entire community :) ) improve git in areas that others that share similar issues benefit from as well.

Matt Graham:  I don't have file stats at the moment.  It's mostly code files, with a few larger data files here and there.    We also don't do sparse checkouts, primarily because most people use git (whether on top of SVN or not), which doesn't support it.

Chris Lee:  When I was building up the repo (e.g., doing lots of commits, before I started using fast-import), i noticed that flash was not much faster - stat'ing the whole repo takes a lot of kernel time, even with flash.  My hunch is that we'd see similar issues with other operations, like git-blame.

Zeki Mokhtarzada:  Dumping history I think would speed up operations for which we don't care about old history, like git-blame in which we only want to see recent modifications.  We'd also need a good story for other kinds of operations.  In my mental model of git scalability, I categorize git structures into three kinds:  those for reasoning about history, those for the index and those for the working directory  (yeah, I know these don't map precisely to actual on-disk things like the object store, including trees, etc.).  One scaling approach we've been thinking of is to focus on each individually:  develop a specialized thing to handle history commands efficiently (git-blame, git-log, git-diff, etc.), something to speed up or bypass the index, and something to make large changes to the working directly quickly.

Joey Hess:  Separating the factors is a good suggestion.  My hunch is that the various git operations test the performance issues in isolation.  For example, git-status performance depends just on the number of files, not on the depth of history.  On the other hand, my guess is that git-blame performance is more a function of the length of history rather than the number of files.  Though, certainly with compression and indexing in pack files, I could imagine there being cross-effects between length of history and number of files.   The git-status suggestion definitely helps when you know which directory you are concerned about.  Often I'm lazy and stat the repo root so I trade-off slowness for being more sure I'm not missing anything.

@Joey, I think you're also touching on a good meta point which is that, there's probably no silver bullet here.  If we want git to efficiently handle repos that are large across a number of dimensions (size, # commits, # files, etc.), there's multiple parts of git that would need enhancement of some form.

Nguyen Thai Ngoc Duy:  At which point in the test flow should I insert git-update-index?  I'm happy to try it out.  Will compress index when I next get to a terminal.  My guess is it'll compress a bunch.  It's also conceivable that, if there were an external interface in git to attach other systems to efficiently report which files have changed (e.g., via file-system integration), it's possible that we could omit managing the index in many cases.   I know that would be a big change, but the benefits are intriguing.

Cheers,
Josh

________________________________________
From: Nguyen Thai Ngoc Duy [pclouds@gmail.com]
Sent: Friday, February 03, 2012 10:53 PM
To: Joshua Redstone
Cc: git@vger.kernel.org
Subject: Re: Git performance results on a large repository

On Fri, Feb 3, 2012 at 9:20 PM, Joshua Redstone <joshua.redstone@fb.com> wrote:
> I timed a few common operations with both a warm OS file cache and a cold
> cache.  i.e., I did a 'echo 3 | tee /proc/sys/vm/drop_caches' and then did
> the operation in question a few times (first timing is the cold timing,
> the next few are the warm timings).  The following results are on a server
> with average hard drive (I.e., not flash)  and > 10GB of ram.
>
> 'git status' :   39 minutes cold, and 24 seconds warm.
>
> 'git blame':   44 minutes cold, 11 minutes warm.
>
> 'git add' (appending a few chars to the end of a file and adding it):   7
> seconds cold and 5 seconds warm.
>
> 'git commit -m "foo bar3" --no-verify --untracked-files=no --quiet
> --no-status':  41 minutes cold, 20 seconds warm.  I also hacked a version
> of git to remove the three or four places where 'git commit' stats every
> file in the repo, and this dropped the times to 30 minutes cold and 8
> seconds warm.

Have you tried "git update-index --assume-unchaged"? That should
reduce mass lstat() and hopefully improve the above numbers. The
interface is not exactly easy-to-use, but if it has significant gain,
then we can try to improve UI.

On the index size issue, ideally we should make minimum writes to
index instead of rewriting 191 MB index. An improvement we could do
now is to compress it, reduce disk footprint, thus disk I/O. If you
compress the index with gzip, how big is it?
--
Duy

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Git performance results on a large repository
  2012-02-04 18:05   ` Joshua Redstone
@ 2012-02-05  3:47     ` Nguyen Thai Ngoc Duy
  2012-02-06 15:40       ` Joey Hess
  2012-02-06  7:10     ` David Mohs
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 34+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-02-05  3:47 UTC (permalink / raw)
  To: Joshua Redstone; +Cc: git@vger.kernel.org

On Sun, Feb 5, 2012 at 1:05 AM, Joshua Redstone <joshua.redstone@fb.com> wrote:
> It's also conceivable that, if there were an external interface in git to attach other
> systems to efficiently report which files have changed (e.g., via file-system integration),
> it's possible that we could omit managing the index in many cases.
> I know that would be a big change, but the benefits are intriguing.

The "interface to report which files have changed" is exactly "git
update-index --[no-]assume-unchanged" is for. Have a look at the man
page. Basically you can mark every file "unchanged" in the beginning
and git won't bother lstat() them. What files you change, you have to
explicitly run "git update-index --no-assume-unchanged" to tell git.

Someone on HN suggested making assume-unchanged files read-only to
avoid 90% accidentally changing a file without telling git. When
assume-unchanged bit is cleared, the file is made read-write again.
-- 
Duy

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Git performance results on a large repository
  2012-02-05  3:47     ` Nguyen Thai Ngoc Duy
@ 2012-02-06 15:40       ` Joey Hess
  2012-02-07 13:43         ` Nguyen Thai Ngoc Duy
  0 siblings, 1 reply; 34+ messages in thread
From: Joey Hess @ 2012-02-06 15:40 UTC (permalink / raw)
  To: git@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 1176 bytes --]

Nguyen Thai Ngoc Duy wrote:
> The "interface to report which files have changed" is exactly "git
> update-index --[no-]assume-unchanged" is for. Have a look at the man
> page. Basically you can mark every file "unchanged" in the beginning
> and git won't bother lstat() them. What files you change, you have to
> explicitly run "git update-index --no-assume-unchanged" to tell git.
> 
> Someone on HN suggested making assume-unchanged files read-only to
> avoid 90% accidentally changing a file without telling git. When
> assume-unchanged bit is cleared, the file is made read-write again.

That made me think about using assume-unchanged with git-annex since it
already has read-only files. 

But, here's what seems a misfeature... If an assume-unstaged file has
modifications and I git add it, nothing happens. To stage a change, I
have to explicitly git update-index --no-assume-unchanged and only then
git add, and then I need to remember to reset the assume-unstaged bit
when I'm done working on that file for now. Compare with running git mv
on the same file, which does stage the move despite assume-unstaged. (So
does git rm.)

-- 
see shy jo

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Git performance results on a large repository
  2012-02-06 15:40       ` Joey Hess
@ 2012-02-07 13:43         ` Nguyen Thai Ngoc Duy
  2012-02-09 21:06           ` Joshua Redstone
  0 siblings, 1 reply; 34+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-02-07 13:43 UTC (permalink / raw)
  To: Joey Hess; +Cc: git@vger.kernel.org

On Mon, Feb 6, 2012 at 10:40 PM, Joey Hess <joey@kitenet.net> wrote:
>> Someone on HN suggested making assume-unchanged files read-only to
>> avoid 90% accidentally changing a file without telling git. When
>> assume-unchanged bit is cleared, the file is made read-write again.
>
> That made me think about using assume-unchanged with git-annex since it
> already has read-only files.
>
> But, here's what seems a misfeature...

because, well.. assume-unchanged was designed to avoid stat() and
nothing else. We are basing a new feature on top of it.

> If an assume-unstaged file has
> modifications and I git add it, nothing happens. To stage a change, I
> have to explicitly git update-index --no-assume-unchanged and only then
> git add, and then I need to remember to reset the assume-unstaged bit
> when I'm done working on that file for now. Compare with running git mv
> on the same file, which does stage the move despite assume-unstaged. (So
> does git rm.)

This is normal in the lock-based "checkout/edit/checkin" model. mv/rm
operates on directory content, which is not "locked - no edit allowed"
(in our case --assume-unchanged) in git. But lock-based model does not
map really well to git anyway. It does not have the index (which may
make things more complicated). Also at index level, git does not
really understand directories.

I think we could add a protection layer to index, where any changes
(including removal) to an index entry are only allowed if the entry is
"unlocked" (i.e no assume-unchanged bit). Locked entries are read-only
and have assume-unchanged bit set. "git (un)lock" are introduced as
new UI. Does that make assume-unchanged friendlier?
-- 
Duy

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Git performance results on a large repository
  2012-02-07 13:43         ` Nguyen Thai Ngoc Duy
@ 2012-02-09 21:06           ` Joshua Redstone
  2012-02-10  7:12             ` Nguyen Thai Ngoc Duy
  0 siblings, 1 reply; 34+ messages in thread
From: Joshua Redstone @ 2012-02-09 21:06 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: git@vger.kernel.org

Hi Nguyen,
I like the notion of using --assume-unchanged to cut down the set of
things that git considers may have changed.
It seems to me that there may still be situations that require operations
on the order of the # of files in the repo and hence may still be slow.
Following is a list of potential candidates that occur to me.

1. Switching branches, especially if you switch to an old branch.
Sometimes I've seen branch switching taking a long time for what I thought
was close to where HEAD was.

2. Interactive rebase in which you reorder a few commits close to the tip
of the branch (I observed this taking a long time, but haven't profiled it
yet).  I include here other types of cherry-picking of commits.

3. Any working directory operations that fail part-way through and make
you want to do a 'git reset --hard' or at least a full 'git-status'.  That
is, when you have reason to believe that files with 'assume-unchange' may
have accidentally changed.

4. Operations that require rewriting the index - I think git-add is one?

If the working-tree representation is the full set of all files
materialized on disk and it's the same as the representation of files
changed, then I'm not sure how to avoid some of these without playing file
system games or using wrapper scripts.

What do you (or others) think?

Josh

On 2/7/12 8:43 AM, "Nguyen Thai Ngoc Duy" <pclouds@gmail.com> wrote:

>On Mon, Feb 6, 2012 at 10:40 PM, Joey Hess <joey@kitenet.net> wrote:
>>> Someone on HN suggested making assume-unchanged files read-only to
>>> avoid 90% accidentally changing a file without telling git. When
>>> assume-unchanged bit is cleared, the file is made read-write again.
>>
>> That made me think about using assume-unchanged with git-annex since it
>> already has read-only files.
>>
>> But, here's what seems a misfeature...
>
>because, well.. assume-unchanged was designed to avoid stat() and
>nothing else. We are basing a new feature on top of it.
>
>> If an assume-unstaged file has
>> modifications and I git add it, nothing happens. To stage a change, I
>> have to explicitly git update-index --no-assume-unchanged and only then
>> git add, and then I need to remember to reset the assume-unstaged bit
>> when I'm done working on that file for now. Compare with running git mv
>> on the same file, which does stage the move despite assume-unstaged. (So
>> does git rm.)
>
>This is normal in the lock-based "checkout/edit/checkin" model. mv/rm
>operates on directory content, which is not "locked - no edit allowed"
>(in our case --assume-unchanged) in git. But lock-based model does not
>map really well to git anyway. It does not have the index (which may
>make things more complicated). Also at index level, git does not
>really understand directories.
>
>I think we could add a protection layer to index, where any changes
>(including removal) to an index entry are only allowed if the entry is
>"unlocked" (i.e no assume-unchanged bit). Locked entries are read-only
>and have assume-unchanged bit set. "git (un)lock" are introduced as
>new UI. Does that make assume-unchanged friendlier?
>-- 
>Duy
>--
>To unsubscribe from this list: send the line "unsubscribe git" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Git performance results on a large repository
  2012-02-09 21:06           ` Joshua Redstone
@ 2012-02-10  7:12             ` Nguyen Thai Ngoc Duy
  2012-02-10  9:39               ` Christian Couder
  0 siblings, 1 reply; 34+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-02-10  7:12 UTC (permalink / raw)
  To: Joshua Redstone; +Cc: git@vger.kernel.org

On Fri, Feb 10, 2012 at 4:06 AM, Joshua Redstone <joshua.redstone@fb.com> wrote:
> Hi Nguyen,
> I like the notion of using --assume-unchanged to cut down the set of
> things that git considers may have changed.
> It seems to me that there may still be situations that require operations
> on the order of the # of files in the repo and hence may still be slow.
> Following is a list of potential candidates that occur to me.
>
> 1. Switching branches, especially if you switch to an old branch.
> Sometimes I've seen branch switching taking a long time for what I thought
> was close to where HEAD was.
>
> 2. Interactive rebase in which you reorder a few commits close to the tip
> of the branch (I observed this taking a long time, but haven't profiled it
> yet).  I include here other types of cherry-picking of commits.
>
> 3. Any working directory operations that fail part-way through and make
> you want to do a 'git reset --hard' or at least a full 'git-status'.  That
> is, when you have reason to believe that files with 'assume-unchange' may
> have accidentally changed.

All these involve unpack_trees(), which is full tree operation. The
bigger your worktree is, the slower it is. Another good reason to
split unrelated parts into separate repositories.


> 4. Operations that require rewriting the index - I think git-add is one?
>
> If the working-tree representation is the full set of all files
> materialized on disk and it's the same as the representation of files
> changed, then I'm not sure how to avoid some of these without playing file
> system games or using wrapper scripts.
>
> What do you (or others) think?
>
>
> Josh
>
>
> On 2/7/12 8:43 AM, "Nguyen Thai Ngoc Duy" <pclouds@gmail.com> wrote:
>
>>On Mon, Feb 6, 2012 at 10:40 PM, Joey Hess <joey@kitenet.net> wrote:
>>>> Someone on HN suggested making assume-unchanged files read-only to
>>>> avoid 90% accidentally changing a file without telling git. When
>>>> assume-unchanged bit is cleared, the file is made read-write again.
>>>
>>> That made me think about using assume-unchanged with git-annex since it
>>> already has read-only files.
>>>
>>> But, here's what seems a misfeature...
>>
>>because, well.. assume-unchanged was designed to avoid stat() and
>>nothing else. We are basing a new feature on top of it.
>>
>>> If an assume-unstaged file has
>>> modifications and I git add it, nothing happens. To stage a change, I
>>> have to explicitly git update-index --no-assume-unchanged and only then
>>> git add, and then I need to remember to reset the assume-unstaged bit
>>> when I'm done working on that file for now. Compare with running git mv
>>> on the same file, which does stage the move despite assume-unstaged. (So
>>> does git rm.)
>>
>>This is normal in the lock-based "checkout/edit/checkin" model. mv/rm
>>operates on directory content, which is not "locked - no edit allowed"
>>(in our case --assume-unchanged) in git. But lock-based model does not
>>map really well to git anyway. It does not have the index (which may
>>make things more complicated). Also at index level, git does not
>>really understand directories.
>>
>>I think we could add a protection layer to index, where any changes
>>(including removal) to an index entry are only allowed if the entry is
>>"unlocked" (i.e no assume-unchanged bit). Locked entries are read-only
>>and have assume-unchanged bit set. "git (un)lock" are introduced as
>>new UI. Does that make assume-unchanged friendlier?
>>--
>>Duy
>>--
>>To unsubscribe from this list: send the line "unsubscribe git" in
>>the body of a message to majordomo@vger.kernel.org
>>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Duy

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Git performance results on a large repository
  2012-02-10  7:12             ` Nguyen Thai Ngoc Duy
@ 2012-02-10  9:39               ` Christian Couder
  2012-02-10 12:24                 ` Nguyen Thai Ngoc Duy
  0 siblings, 1 reply; 34+ messages in thread
From: Christian Couder @ 2012-02-10  9:39 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: Joshua Redstone, git@vger.kernel.org

Hi,

On Fri, Feb 10, 2012 at 8:12 AM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
>
> All these involve unpack_trees(), which is full tree operation. The
> bigger your worktree is, the slower it is. Another good reason to
> split unrelated parts into separate repositories.

Maybe having different "views" would be enough to make a smaller
worktree and history, so that things are much faster for a developper?

(I already suggested "views" based on "git replace" in this thread:
http://thread.gmane.org/gmane.comp.version-control.git/177146/focus=177639)

Best regards,
Christian.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Git performance results on a large repository
  2012-02-10  9:39               ` Christian Couder
@ 2012-02-10 12:24                 ` Nguyen Thai Ngoc Duy
  0 siblings, 0 replies; 34+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-02-10 12:24 UTC (permalink / raw)
  To: Christian Couder; +Cc: Joshua Redstone, git@vger.kernel.org

On Fri, Feb 10, 2012 at 4:39 PM, Christian Couder
<christian.couder@gmail.com> wrote:
> Hi,
>
> On Fri, Feb 10, 2012 at 8:12 AM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
>>
>> All these involve unpack_trees(), which is full tree operation. The
>> bigger your worktree is, the slower it is. Another good reason to
>> split unrelated parts into separate repositories.
>
> Maybe having different "views" would be enough to make a smaller
> worktree and history, so that things are much faster for a developper?
>
> (I already suggested "views" based on "git replace" in this thread:
> http://thread.gmane.org/gmane.comp.version-control.git/177146/focus=177639)

That's more or less what I did with the subtree clone series [1] and
ended up doing narrow clone [2]. The only difference between the two
are how to handle partial worktree/index. The former uses git-replace
to seal any holes, the latter tackles at pathspec level and is
generally more elegant.

The worktree part from that work should be usable in full clone too. I
am reviving the series and going to repost it soon. Have a look [3] if
you are interested.

[1] http://thread.gmane.org/gmane.comp.version-control.git/152347
[2] http://thread.gmane.org/gmane.comp.version-control.git/155427
[3] https://github.com/pclouds/git/commits/narrow-clone
-- 
Duy

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Git performance results on a large repository
  2012-02-04 18:05   ` Joshua Redstone
  2012-02-05  3:47     ` Nguyen Thai Ngoc Duy
@ 2012-02-06  7:10     ` David Mohs
  2012-02-06 16:23     ` Matt Graham
  2012-02-06 21:17     ` Sam Vilain
  3 siblings, 0 replies; 34+ messages in thread
From: David Mohs @ 2012-02-06  7:10 UTC (permalink / raw)
  To: git

Joshua Redstone <joshua.redstone <at> fb.com> writes:

> To get a bit abstract for a moment, in an ideal world, it doesn't seem like
> performance constraints of a source-control-system should dictate how we
> choose to structure our code. Ideally, seems like we should be able to choose
> to structure our code in whatever way we feel maximizes developer
> productivity. If development and code/release management seem easier in a
> single repo, than why not make an SCM that can handle it? This is one reason
> I've been leaning towards figuring out an SCM approach that can work well with
> our current practices rather than changing them as a prerequisite for good SCM
> performance.

I certainly agree with this perspective---that our tools should support our
use cases and not the other way around. However, I'd like you to consider that
the size of this hypothetical repository might be giving you some useful
information on the health of the code it contains. You might consider creating
separate repositories simply to promote good modularization. It would involve
some up-front effort and certainly some pain, but this work itself might be
beneficial to your codebase without even considering the improved performance
of the version control system.

My concern here is that it may be extremely difficult to make a single piece
of software scale for a project that can grow arbitrarily large. You may add
some great performance improvements to git to then find that your bottleneck
is the filesystem. That would enlarge the scope of your work and would likely
make the project more difficult to manage.

If you are able to prove me wrong, the entire software community will benefit
from this work. However, before you embark upon a technical solution to your
problem, I would urge you to consider the possible benefits of a non-technical
solution, specifically restructuring your code and/or teams into more
independent modules. You might find benefits from this approach that extend
beyond source code control, which could make it the solution with the least
amount of overall risk.

Thanks for starting this valuable discussion.

-David

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Git performance results on a large repository
  2012-02-04 18:05   ` Joshua Redstone
  2012-02-05  3:47     ` Nguyen Thai Ngoc Duy
  2012-02-06  7:10     ` David Mohs
@ 2012-02-06 16:23     ` Matt Graham
  2012-02-06 20:50       ` Joshua Redstone
  2012-02-06 21:17     ` Sam Vilain
  3 siblings, 1 reply; 34+ messages in thread
From: Matt Graham @ 2012-02-06 16:23 UTC (permalink / raw)
  To: Joshua Redstone; +Cc: Nguyen Thai Ngoc Duy, git@vger.kernel.org

On Sat, Feb 4, 2012 at 18:05, Joshua Redstone <joshua.redstone@fb.com> wrote:
> [ wanted to reply to my initial msg, but wasn't subscribed to the list at time of mailing, so replying to most recent post instead ]
>
> Matt Graham:  I don't have file stats at the moment.  It's mostly code files, with a few larger data files here and there.    We also don't do sparse checkouts, primarily because most people use git (whether on top of SVN or not), which doesn't support it.

This doesn't help your original goal, but while you're still working
with git-svn, you can do sparse checkouts. Use --ignore-paths when you
do the original clone and it will filter out directories that are not
of interest.

We used this at Etsy to keep git svn checkouts manageable when we
still had a gigantic svn repo.  You've repeatedly said you don't want
to reorganize your repos but you may find this writeup informative
about how Etsy migrated to git (which included a health amount of repo
manipuation).
http://codeascraft.etsy.com/2011/12/02/moving-from-svn-to-git-in-1000-easy-steps/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Git performance results on a large repository
  2012-02-06 16:23     ` Matt Graham
@ 2012-02-06 20:50       ` Joshua Redstone
  2012-02-06 21:07         ` Greg Troxel
  2012-02-07  1:28         ` david
  0 siblings, 2 replies; 34+ messages in thread
From: Joshua Redstone @ 2012-02-06 20:50 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy, Joey Hess, dgma@mohsinc.com, Matt Graham,
	Tomas Carnecky, Greg Troxel, david@lang.hm, David Barr
  Cc: git@vger.kernel.org

Hi all,

Nguyen, thanks for pointing out the assume-unchanged part.  That, and
especially the suggestion of making assume-unchanged files read-only is
interesting.  It does require explicit specification of what's changed.
Hmm, I wonder if that could be a candidate API through which something
like  CoW file system could let git know what's changed.  Btw, I think you
asked earlier, but the index compresses from 158MB to 58MB - keep in mind
that the majority of file names in the repo are synthetic, so take with
big grain of salt.

Joey, it sounds like it might be good if git-mv and other commands where
consistent in how they treat the assume-unchanged bit.

David Mohs:  Yeah, it's an open question whether we'd be better off
somehow forcing the repos the split apart more.  As a practical matter,
what may happen is that we incrementally solve our problem by addressing
pain points as they come up (e.g., git status being slow).  One risk with
that approach is that it leads to overly short-term thinking and we get
stuck in a local minimum.  I totally agree that good modularization and
code health is valuable.  I think sometimes that getting to good
modularization does involve some technical work - like maybe moving
functionality between systems so they split apart better, having some
notion of versioning and dependency and managing that, and so forth.    I
suppose the other aspect to the problem is that we want to make sure we
have a good source-control story even if the modularization effort takes a
long time - we'd rather not end up in a race between long-term
modularization efforts and source-control performance going south too
fast.  I suppose this comes back to the desire that modularization not be
a prerequisite for good source-control performance.  Oh, and in case I
didn't mention it - we are working on modularization and splitting off
large chunks of code, both into separable libraries as well as into
separate services, but it's a long-term process.

Matt, some of our repos are still on SVN, many are on pure-git.  One of
the main ones that is on SVN is, at least at the moment, not amenable to
sparse checkouts because of it's structure.

Tomas, yeah, I think one of the big questions is how little technical work
can we get away with, and where's the point of maximum leverage in terms
of how much engineering time we invest.

Greg,  'git commit' does some stat'ing of every file, even with all those
flags - for example, I think one instance it does it is, just in case any
pre-commit hooks touched any files, it re-stats everything.  Regarding the
perf numbers, I ran it on a beefy linux box.  Have you tried doing your
measurements with the drop_caches trick to make sure the file cache is
totally cold?  Sorry for the dumb question, but how do I check the vnode
cache size?

David Lang and David Barr, I generated the pack files by doing a repack:
"git repack -a -d -f --max-pack-size=10g --depth=100 --window=250"  after
generating the repo.

One other update, the command I was running to get a histogram of all
files in the repo finally completed.  The histogram (counting file size in
bytes) is:

[       0.0 -        6.4): 3
[       6.4 -       41.3): 27
[      41.3 -      265.7): 6
[     265.7 -     1708.1): 652594
[    1708.1 -    10980.6): 673482
[   10980.6 -    70591.6): 19519
[   70591.6 -   453814.3): 1583
[  453814.3 -  2917451.4): 276
[ 2917451.4 - 18755519.0): 61
[18755519.0 - 120574242.0]: 4
n=1347555 mean=3697.917708, median=1770.000000, stddev=122940.890559

The smaller files are all text (code), and the large ones are probably
binary.

Cheers,
Josh

On 2/6/12 11:23 AM, "Matt Graham" <mdg149@gmail.com> wrote:

>On Sat, Feb 4, 2012 at 18:05, Joshua Redstone <joshua.redstone@fb.com>
>wrote:
>> [ wanted to reply to my initial msg, but wasn't subscribed to the list
>>at time of mailing, so replying to most recent post instead ]
>>
>> Matt Graham:  I don't have file stats at the moment.  It's mostly code
>>files, with a few larger data files here and there.    We also don't do
>>sparse checkouts, primarily because most people use git (whether on top
>>of SVN or not), which doesn't support it.
>
>
>This doesn't help your original goal, but while you're still working
>with git-svn, you can do sparse checkouts. Use --ignore-paths when you
>do the original clone and it will filter out directories that are not
>of interest.
>
>We used this at Etsy to keep git svn checkouts manageable when we
>still had a gigantic svn repo.  You've repeatedly said you don't want
>to reorganize your repos but you may find this writeup informative
>about how Etsy migrated to git (which included a health amount of repo
>manipuation).
>http://codeascraft.etsy.com/2011/12/02/moving-from-svn-to-git-in-1000-easy
>-steps/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Git performance results on a large repository
  2012-02-06 20:50       ` Joshua Redstone
@ 2012-02-06 21:07         ` Greg Troxel
  2012-02-07  1:28         ` david
  1 sibling, 0 replies; 34+ messages in thread
From: Greg Troxel @ 2012-02-06 21:07 UTC (permalink / raw)
  To: Joshua Redstone; +Cc: git@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 1742 bytes --]

Joshua Redstone <joshua.redstone@fb.com> writes:

> Greg,  'git commit' does some stat'ing of every file, even with all those
> flags - for example, I think one instance it does it is, just in case any
> pre-commit hooks touched any files, it re-stats everything.

That seems ripe for skipping.  If I understand correctly, what's being
committed is the index, not the working dir contents, so it would follow
that a pre-commit hook changing a file is a bug.

> Regarding the perf numbers, I ran it on a beefy linux box.  Have you
> tried doing your measurements with the drop_caches trick to make sure
> the file cache is totally cold?

On NetBSD, there should be a clear cache command for just this reason,
but I'm not sure there is.  So I did

  sysctl -w kern.maxvnodes=1000 # seemed to take a while
  ls -lR # wait for those to be faulted in
  sysctl -w kern.maxvnodes=500000

Then, git status on my repo churned the disk for a long time.

  real    2m7.121s
  user    0m3.086s
  sys     0m7.577s

and then again right away

  real    0m6.497s
  user    0m2.533s
  sys     0m3.010s

That repo has 217852 files (a real source tree with a few binaries, not
synthetic).

> Sorry for the dumb question, but how do I check the vnode cache size?

On BSD, sysctl kern.maxvnodes.  I would aasume that on Linux there is
some max size for the the vnode cache, and that stat of a file in that
cache is faster than going to the filesystem (even if reading from
cached disk blocks).  But I really don't know how that works in Linux.

I was going to say that if your vnode cache isn't big enough, then the
hot run won't be so much faster than the warm run, but that's not true,
because the fs blocks will be in the block cache and it will still help.

[-- Attachment #2: Type: application/pgp-signature, Size: 194 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Git performance results on a large repository
  2012-02-06 20:50       ` Joshua Redstone
  2012-02-06 21:07         ` Greg Troxel
@ 2012-02-07  1:28         ` david
  1 sibling, 0 replies; 34+ messages in thread
From: david @ 2012-02-07  1:28 UTC (permalink / raw)
  To: Joshua Redstone
  Cc: Nguyen Thai Ngoc Duy, Joey Hess, dgma@mohsinc.com, Matt Graham,
	Tomas Carnecky, Greg Troxel, David Barr, git@vger.kernel.org

On Mon, 6 Feb 2012, Joshua Redstone wrote:

> David Lang and David Barr, I generated the pack files by doing a repack:
> "git repack -a -d -f --max-pack-size=10g --depth=100 --window=250"  after
> generating the repo.

how many pack files does this end up creating?

I think that doing a full repack the way you did will group all revisions 
of a given file into a pack.

while what I'm saying is that if you create the packs based on time, 
rather than space efficiency of the resulting pack files, you may end up 
not having to go through as much date when doing things like a git blame.

what you did was

initialize repo
4M commits
repack

what I'm saying is

initialize repo
loop
    500K commits
    repack (and set pack to .keep so it doesn't get overwritten)

so you will end up with ~8 sets of pack files, but time based so that when 
you only need recent information you only look at the most recent pack 
file. If you need to go back through all time, the multiple pack files 
will be a little more expensive to process.

this has the added advantage that the 8 small repacks should be cheaper 
than the one large repack as it isn't trying to cover all commits each 
time.

David Lang

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Git performance results on a large repository
  2012-02-04 18:05   ` Joshua Redstone
                       ` (2 preceding siblings ...)
  2012-02-06 16:23     ` Matt Graham
@ 2012-02-06 21:17     ` Sam Vilain
  3 siblings, 0 replies; 34+ messages in thread
From: Sam Vilain @ 2012-02-06 21:17 UTC (permalink / raw)
  To: Joshua Redstone; +Cc: Nguyen Thai Ngoc Duy, git@vger.kernel.org

 > Sam Vilain: Thanks for the pointer, i didn't realize that
 > fast-import was bi-directional.  I used it for generating the
 > synthetic repo.  Will look into using it the other way around.
 > Though that still won't speed up things like git-blame,
 > presumably?

It could, because blame is an operation which primarily works on
the source history with little reference to the working copy.  Of
course this will depend on the quality of the implementation
server-side.  Blame should suit distribution over a cluster, as
it is mostly involved with scanning candidate revisions for
string matches which is the compute intensive part.  Coming up
with candidate revisions has its own cost and can probably also
be distributed, but just working on the lowest loop level might
be a good place to start.

What it doesn't help with is local filesystem operations.  For
this I think a different approach is required, if you can tie
into fam or a similar inode change notification system, then you
should be able to avoid the entire recursive stat on 'git
status'.  I'm not sure --assume-unchanged on its own is a good
idea, you could easily miss things.  Those stat's are useful.

Making the index able to hold just changes to the checked-out
tree, as others have mentioned, would also save the massive reads
and writes you've identified.  Perhaps a more high performance
back-end could be developed.

 > The sparse-checkout issue you mention is a good one.

It's actually been on the table since at least GitTogether 2008;
there's been some design discussion on it and I think it's just
one of those features which doesn't have enough demand yet for it
to be built.  It keeps coming up but not from anyone with the
inclination or resources to make it happen.  There is a protocol
issue, but this should be able to fit into the current extension
system.

 > There is a good question of how to support quick checkout,
 > branch switching, clone, push and so forth.

Sure.  It will be much more network intensive as you are
replacing the part which normally has a very fast link through
the buffercache to pack files etc.  A hybrid approach is also
possible, where objects are fetched individually via fast-import
and cached in a local .git repo.  And I have a hunch that LZOP
compression of the stream may also be a win, but as with all of
these ideas, it would be after profiling identifies it as a choke point 
than just because it sounds good.

 > I'll look into the approaches you suggest.  One consideration
 > is coming up with a high-leverage approach - i.e. not doing
 > heavy dev work if we can avoid it.

Right.  You don't actually need to port the whole of git to Hadoop 
initially, to begin with it can just pass through all commands to a 
server-side git fast-import process.  When you find specific operations 
which are slow then these specific operations can be implemented using a 
Hadoop back-end, and the rest backed to the standard git.  If done using 
a useful plug-in system, these systems could be accepted by the core 
project as an enterprise scaling option.

This could let you get going with the knowledge that the scaling option 
is there should it come out.

 > On the other hand, it would be nice if we (including the entire
 > community:) ) improve git in areas that others that share
 > similar issues benefit from as well.

Like I say, a lot of people have run into this already...

HTH,
Sam

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: Git performance results on a large repository
  2012-02-04  6:53 ` Nguyen Thai Ngoc Duy
  2012-02-04 18:05   ` Joshua Redstone
@ 2012-02-04 20:05   ` Joshua Redstone
  2012-02-05 15:01   ` Tomas Carnecky
  2 siblings, 0 replies; 34+ messages in thread
From: Joshua Redstone @ 2012-02-04 20:05 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: git@vger.kernel.org

One more follow-on thought.  I imagine that most consumers of git are nowhere near the scale of the test repo that I described.  They may still enjoy benefit from efforts to improve git support for large repos.  A few possible reasons:

1. The performance improvements should speed things up for smaller repos as well.
2. They may find their repos growing to a 'large scale' at some point in the future.
3. Any code cleanup as part of an effort to support git scalability is good for code base health and e.g., would facilitate future modifications that may more directly affect them.

Cheers,
Josh
________________________________________
From: Nguyen Thai Ngoc Duy [pclouds@gmail.com]
Sent: Friday, February 03, 2012 10:53 PM
To: Joshua Redstone
Cc: git@vger.kernel.org
Subject: Re: Git performance results on a large repository

On Fri, Feb 3, 2012 at 9:20 PM, Joshua Redstone <joshua.redstone@fb.com> wrote:
> I timed a few common operations with both a warm OS file cache and a cold
> cache.  i.e., I did a 'echo 3 | tee /proc/sys/vm/drop_caches' and then did
> the operation in question a few times (first timing is the cold timing,
> the next few are the warm timings).  The following results are on a server
> with average hard drive (I.e., not flash)  and > 10GB of ram.
>
> 'git status' :   39 minutes cold, and 24 seconds warm.
>
> 'git blame':   44 minutes cold, 11 minutes warm.
>
> 'git add' (appending a few chars to the end of a file and adding it):   7
> seconds cold and 5 seconds warm.
>
> 'git commit -m "foo bar3" --no-verify --untracked-files=no --quiet
> --no-status':  41 minutes cold, 20 seconds warm.  I also hacked a version
> of git to remove the three or four places where 'git commit' stats every
> file in the repo, and this dropped the times to 30 minutes cold and 8
> seconds warm.

Have you tried "git update-index --assume-unchaged"? That should
reduce mass lstat() and hopefully improve the above numbers. The
interface is not exactly easy-to-use, but if it has significant gain,
then we can try to improve UI.

On the index size issue, ideally we should make minimum writes to
index instead of rewriting 191 MB index. An improvement we could do
now is to compress it, reduce disk footprint, thus disk I/O. If you
compress the index with gzip, how big is it?
--
Duy

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Git performance results on a large repository
  2012-02-04  6:53 ` Nguyen Thai Ngoc Duy
  2012-02-04 18:05   ` Joshua Redstone
  2012-02-04 20:05   ` Joshua Redstone
@ 2012-02-05 15:01   ` Tomas Carnecky
  2012-02-05 15:17     ` Nguyen Thai Ngoc Duy
  2 siblings, 1 reply; 34+ messages in thread
From: Tomas Carnecky @ 2012-02-05 15:01 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: Joshua Redstone, git@vger.kernel.org

On 2/4/12 7:53 AM, Nguyen Thai Ngoc Duy wrote:
> On Fri, Feb 3, 2012 at 9:20 PM, Joshua Redstone<joshua.redstone@fb.com>  wrote:
>> I timed a few common operations with both a warm OS file cache and a cold
>> cache.  i.e., I did a 'echo 3 | tee /proc/sys/vm/drop_caches' and then did
>> the operation in question a few times (first timing is the cold timing,
>> the next few are the warm timings).  The following results are on a server
>> with average hard drive (I.e., not flash)  and>  10GB of ram.
>>
>> 'git status' :   39 minutes cold, and 24 seconds warm.
>>
>> 'git blame':   44 minutes cold, 11 minutes warm.
>>
>> 'git add' (appending a few chars to the end of a file and adding it):   7
>> seconds cold and 5 seconds warm.
>>
>> 'git commit -m "foo bar3" --no-verify --untracked-files=no --quiet
>> --no-status':  41 minutes cold, 20 seconds warm.  I also hacked a version
>> of git to remove the three or four places where 'git commit' stats every
>> file in the repo, and this dropped the times to 30 minutes cold and 8
>> seconds warm.
> Have you tried "git update-index --assume-unchaged"? That should
> reduce mass lstat() and hopefully improve the above numbers. The
> interface is not exactly easy-to-use, but if it has significant gain,
> then we can try to improve UI.
>
> On the index size issue, ideally we should make minimum writes to
> index instead of rewriting 191 MB index. An improvement we could do
> now is to compress it, reduce disk footprint, thus disk I/O. If you
> compress the index with gzip, how big is it?
If you're not afraid to add filesystem-specific code to git, you could 
leverage the btrfs find-new command (or use the ioctl directly) to 
quickly find changed files since a certain point in time. Other CoW 
filesystems may have similar mechanisms. You could for example store the 
last generation id in an index extension, that's what those extensions 
are for, right?

tom

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Git performance results on a large repository
  2012-02-05 15:01   ` Tomas Carnecky
@ 2012-02-05 15:17     ` Nguyen Thai Ngoc Duy
  0 siblings, 0 replies; 34+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-02-05 15:17 UTC (permalink / raw)
  To: Tomas Carnecky; +Cc: Joshua Redstone, git@vger.kernel.org

On Sun, Feb 5, 2012 at 10:01 PM, Tomas Carnecky <tom@dbservice.com> wrote:
> On 2/4/12 7:53 AM, Nguyen Thai Ngoc Duy wrote:
>>
>> On Fri, Feb 3, 2012 at 9:20 PM, Joshua Redstone<joshua.redstone@fb.com>
>>  wrote:
>>>
>>> I timed a few common operations with both a warm OS file cache and a cold
>>> cache.  i.e., I did a 'echo 3 | tee /proc/sys/vm/drop_caches' and then
>>> did
>>> the operation in question a few times (first timing is the cold timing,
>>> the next few are the warm timings).  The following results are on a
>>> server
>>> with average hard drive (I.e., not flash)  and>  10GB of ram.
>>>
>>> 'git status' :   39 minutes cold, and 24 seconds warm.
>>>
>>> 'git blame':   44 minutes cold, 11 minutes warm.
>>>
>>> 'git add' (appending a few chars to the end of a file and adding it):   7
>>> seconds cold and 5 seconds warm.
>>>
>>> 'git commit -m "foo bar3" --no-verify --untracked-files=no --quiet
>>> --no-status':  41 minutes cold, 20 seconds warm.  I also hacked a version
>>> of git to remove the three or four places where 'git commit' stats every
>>> file in the repo, and this dropped the times to 30 minutes cold and 8
>>> seconds warm.
>>
>> Have you tried "git update-index --assume-unchaged"? That should
>> reduce mass lstat() and hopefully improve the above numbers. The
>> interface is not exactly easy-to-use, but if it has significant gain,
>> then we can try to improve UI.
>>
>> On the index size issue, ideally we should make minimum writes to
>> index instead of rewriting 191 MB index. An improvement we could do
>> now is to compress it, reduce disk footprint, thus disk I/O. If you
>> compress the index with gzip, how big is it?
>
> If you're not afraid to add filesystem-specific code to git, you could
> leverage the btrfs find-new command (or use the ioctl directly) to quickly
> find changed files since a certain point in time. Other CoW filesystems may
> have similar mechanisms. You could for example store the last generation id
> in an index extension, that's what those extensions are for, right?

Sure they could be stored as index extensions. I'm more concerned of
the index size. I guess fs-specific code, if properly implemented
(e.g. clean, handling repos crossing fs boundaries, moving repos...),
may get Junio's approval. There were also talks of implementing NTFS's
journal (or something) on msysgit for similar goal.
-- 
Duy

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Git performance results on a large repository
  2012-02-03 14:20 Git performance results on a large repository Joshua Redstone
                   ` (4 preceding siblings ...)
  2012-02-04  6:53 ` Nguyen Thai Ngoc Duy
@ 2012-02-04  8:57 ` slinky
  2012-02-04 21:42 ` Greg Troxel
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 34+ messages in thread
From: slinky @ 2012-02-04  8:57 UTC (permalink / raw)
  To: git

Joshua Redstone <joshua.redstone <at> fb.com> writes:

> The git performance we observed here is too slow for our needs.  So the
> question becomes, if we want to keep using git going forward, what's the
> best way to improve performance.  It seems clear we'll probably need some
> specialized servers (e.g., to perform git-blame quickly) and maybe
> specialized file system integration to detect what files have changed in a
> working tree.

Hi Joshua,

sounds like you have everything in a single .git. Split up the massive
repository to separate smaller .git repositories.

For example, Android code base is quite big. They use the repo tool to manage a
number of separate .git repositories as one big aggregate "repository".

Cheers,
Slinky

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Git performance results on a large repository
  2012-02-03 14:20 Git performance results on a large repository Joshua Redstone
                   ` (5 preceding siblings ...)
  2012-02-04  8:57 ` slinky
@ 2012-02-04 21:42 ` Greg Troxel
  2012-02-05  4:30 ` david
  2012-02-07  8:58 ` Emanuele Zattin
  8 siblings, 0 replies; 34+ messages in thread
From: Greg Troxel @ 2012-02-04 21:42 UTC (permalink / raw)
  To: Joshua Redstone; +Cc: git@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 3075 bytes --]


Joshua Redstone <joshua.redstone@fb.com> writes:

> The test repo has 4 million commits, linear history and about 1.3 million
> files.  The size of the .git directory is about 15GB, and has been
> repacked with 'git repack -a -d -f --max-pack-size=10g --depth=100
> --window=250'.  This repack took about 2 days on a beefy machine (I.e.,
> lots of ram and flash).  The size of the index file is 191 MB. I can share
> the script that generated it if people are interested - It basically picks
> 2-5 files, modifies a line or two and adds a few lines at the end
> consisting of random dictionary words, occasionally creates a new file,
> commits all the modifications and repeats.

I have a repository with about 500K files, 3.3G checkout, 1.5G .git, and
about 10K commits.  (This is a real repository, not a test case.)  So
not as many commits by a lot, but the size seems not so far off.

> I timed a few common operations with both a warm OS file cache and a cold
> cache.  i.e., I did a 'echo 3 | tee /proc/sys/vm/drop_caches' and then did
> the operation in question a few times (first timing is the cold timing,
> the next few are the warm timings).  The following results are on a server
> with average hard drive (I.e., not flash)  and > 10GB of ram.
>
> 'git status' :   39 minutes cold, and 24 seconds warm.

Both of these numbers surprise me.  I'm using NetBSD, whose stat
implementation isn't as optimized as Linux (you didn't say, but
assuming).   On a years-old desktop, git status seems to be about a
minute semi-cold and 5s warm (once I set the vnode cache big over 500K,
vs 350K default for a 2G ram machine).

So on the warm status, I wonder how big your vnode cache is, and if
you've exceeded it, and I don't follow the cold time at all.  Probably
some sort of profiling within git status would be illuminating.

> 'git blame':   44 minutes cold, 11 minutes warm.
>
> 'git add' (appending a few chars to the end of a file and adding it):   7
> seconds cold and 5 seconds warm.
>
> 'git commit -m "foo bar3" --no-verify --untracked-files=no --quiet
> --no-status':  41 minutes cold, 20 seconds warm.  I also hacked a version
> of git to remove the three or four places where 'git commit' stats every
> file in the repo, and this dropped the times to 30 minutes cold and 8
> seconds warm.

So without the stat, I wonder what it's doing that takes 30 minutes.

> One way to get there is to do some deep code modifications to git
> internals, to, for example, create some abstractions and interfaces that
> allow plugging in the specialized servers.  Another way is to leave git
> internals as they are and develop a layer of wrapper scripts around all
> the git commands that do the necessary interfacing.  The wrapper scripts
> seem perhaps easier in the short-term, but may lead to increasing
> divergence from how git behaves natively and also a layer of complexity.

Having hooks for a blame server cache, etc. sounds sensible.  Having a
way to call blames sort of like with --since and then keep updating it
(eg. in emacs) to earlier times sounds useful.

[-- Attachment #2: Type: application/pgp-signature, Size: 194 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Git performance results on a large repository
  2012-02-03 14:20 Git performance results on a large repository Joshua Redstone
                   ` (6 preceding siblings ...)
  2012-02-04 21:42 ` Greg Troxel
@ 2012-02-05  4:30 ` david
  2012-02-05 11:24   ` David Barr
  2012-02-07  8:58 ` Emanuele Zattin
  8 siblings, 1 reply; 34+ messages in thread
From: david @ 2012-02-05  4:30 UTC (permalink / raw)
  To: Joshua Redstone; +Cc: git@vger.kernel.org

On Fri, 3 Feb 2012, Joshua Redstone wrote:

> The test repo has 4 million commits, linear history and about 1.3 million
> files.  The size of the .git directory is about 15GB, and has been
> repacked with 'git repack -a -d -f --max-pack-size=10g --depth=100
> --window=250'.  This repack took about 2 days on a beefy machine (I.e.,
> lots of ram and flash).  The size of the index file is 191 MB.

This may be a silly thought, but what if instead of one pack file of your 
entire history (4 million commits) you create multiple packs (say every 
half million commits) and mark all but the most recent pack as .keep (so 
that they won't be modified by a repack)

that way things that only need to worry about recent history (blame, etc) 
will probably never have to go past the most recent pack file or two

I may be wrong, but I think that when git is looking for 'similar files' 
for delta compression, it limits it's search to the current pack, so this 
will also keep you from searching the entire project history.

David Lang

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Git performance results on a large repository
  2012-02-05  4:30 ` david
@ 2012-02-05 11:24   ` David Barr
  0 siblings, 0 replies; 34+ messages in thread
From: David Barr @ 2012-02-05 11:24 UTC (permalink / raw)
  To: david; +Cc: Joshua Redstone, git@vger.kernel.org

On Sun, Feb 5, 2012 at 3:30 PM,  <david@lang.hm> wrote:
> On Fri, 3 Feb 2012, Joshua Redstone wrote:
>
>> The test repo has 4 million commits, linear history and about 1.3 million
>> files.  The size of the .git directory is about 15GB, and has been
>> repacked with 'git repack -a -d -f --max-pack-size=10g --depth=100
>> --window=250'.  This repack took about 2 days on a beefy machine (I.e.,
>> lots of ram and flash).  The size of the index file is 191 MB.
>
>
> This may be a silly thought, but what if instead of one pack file of your
> entire history (4 million commits) you create multiple packs (say every half
> million commits) and mark all but the most recent pack as .keep (so that
> they won't be modified by a repack)
>
> that way things that only need to worry about recent history (blame, etc)
> will probably never have to go past the most recent pack file or two
>
> I may be wrong, but I think that when git is looking for 'similar files' for
> delta compression, it limits it's search to the current pack, so this will
> also keep you from searching the entire project history.

I don't know if there is an easy way to determine with the with the
current tools
in git but one useful statistic for tuning packing performance is the
size of the
largest component in the delta-chain graph. The significance of this number is
that the product of window-size and maximum depth need not be larger than it.
I've found that with some older repositories I could have a depth as low as 3
and still get good performance from a moderate window size.

--
David Barr

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Git performance results on a large repository
  2012-02-03 14:20 Git performance results on a large repository Joshua Redstone
                   ` (7 preceding siblings ...)
  2012-02-05  4:30 ` david
@ 2012-02-07  8:58 ` Emanuele Zattin
  8 siblings, 0 replies; 34+ messages in thread
From: Emanuele Zattin @ 2012-02-07  8:58 UTC (permalink / raw)
  To: git

Joshua Redstone <joshua.redstone <at> fb.com> writes:

> 
> Hi Git folks,
> 

Hello everybody! 

I would just like to contribute a small set of blog posts 
about this issue and a possible solution. 
Sorry for the tone in which I wrote those posts, 
but I think there are some valid points in there.

https://gist.github.com/1758346

BR,

Emanuele Zattin

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2012-02-10 12:25 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-03 14:20 Git performance results on a large repository Joshua Redstone
2012-02-03 14:56 ` Ævar Arnfjörð Bjarmason
2012-02-03 17:00   ` Joshua Redstone
2012-02-03 22:40     ` Sam Vilain
2012-02-03 22:57       ` Sam Vilain
2012-02-07  1:19       ` Nguyen Thai Ngoc Duy
2012-02-03 23:05     ` Matt Graham
2012-02-04  1:25   ` Evgeny Sazhin
2012-02-03 23:35 ` Chris Lee
2012-02-04  0:01 ` Zeki Mokhtarzada
2012-02-04  5:07 ` Joey Hess
2012-02-04  6:53 ` Nguyen Thai Ngoc Duy
2012-02-04 18:05   ` Joshua Redstone
2012-02-05  3:47     ` Nguyen Thai Ngoc Duy
2012-02-06 15:40       ` Joey Hess
2012-02-07 13:43         ` Nguyen Thai Ngoc Duy
2012-02-09 21:06           ` Joshua Redstone
2012-02-10  7:12             ` Nguyen Thai Ngoc Duy
2012-02-10  9:39               ` Christian Couder
2012-02-10 12:24                 ` Nguyen Thai Ngoc Duy
2012-02-06  7:10     ` David Mohs
2012-02-06 16:23     ` Matt Graham
2012-02-06 20:50       ` Joshua Redstone
2012-02-06 21:07         ` Greg Troxel
2012-02-07  1:28         ` david
2012-02-06 21:17     ` Sam Vilain
2012-02-04 20:05   ` Joshua Redstone
2012-02-05 15:01   ` Tomas Carnecky
2012-02-05 15:17     ` Nguyen Thai Ngoc Duy
2012-02-04  8:57 ` slinky
2012-02-04 21:42 ` Greg Troxel
2012-02-05  4:30 ` david
2012-02-05 11:24   ` David Barr
2012-02-07  8:58 ` Emanuele Zattin

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).