On Thu, Oct 10, 2013 at 5:51 PM, Karsten Blees <karsten.blees@gmail.com> wrote:
 
>> I've noticed that when working with a very large repository using msys
>> git, the initial checkout of a cloned repository is excruciatingly
>> slow (80%+ of total clone time).  The root cause, I think, is that git
>> does all the file access serially, and that's really slow on Windows.
>>

What exactly do you mean by "excruciatingly slow"?

I just ran a few tests with a big repo (WebKit, ~2GB, ~200k files). A full checkout with git 1.8.4 on my SSD took 52s on Linux and 81s on Windows. Xcopy /s took ~4 minutes (so xcopy is much slower than git). On a 'real' HD (WD Caviar Green) the Windows checkout took ~9 minutes.

I'm using blink for my test, which should be more or less indistinguishable from WebKit.  I'm using a standard spinning disk, no SSD.  For my purposes, I need to optimize this for "standard" hardware, not for best-in-class.

For my test, I first run 'git clone -n <repo>', and then measure the running time of 'git checkout --force HEAD'.  On linux, the checkout command runs in 0:12; on Windows, it's about 3:30.
 
If your numbers are much slower, check for overeager virus scanners and probably the infamous "User Account Control" (On Vista/7 (8?), the luafv.sys driver slows down things on the system drive even with UAC turned off in control panel. The driver can be disabled with "sc config luafv start= disabled" + reboot. Reenable with "sc config luafv start= auto").

I confess that I am pretty ignorant about Windows, so I'll have to research these.

>> Has anyone considered threading file access to speed this up?  In
>> particular, I've got my eye on this loop in unpack-trees.c:
>>

Its probably worth a try, however, in my experience, doing disk IO in parallel tends to slow things down due to more disk seeks. 

I'd rather try to minimize seeks, ...

In my experience, modern disk controllers are very very good at this; it rarely, if ever, makes sense to try and outsmart them.

But, from talking to Windows-savvy people, I believe the issue is not disk seek time, but rather the fact that Windows doesn't cache file stat information.  Instead, it goes all the way to the source of truth (i.e., the physical disk) every time it stats a file or directory.  That's what causes the checkout to be so slow: all those file stats run serially.

Does that sound right?  I'm prepared to be wrong about this; but if no one has tried it, then it's probably at least worth an experiment.

Thanks,

Stefan

--
--
*** Please reply-to-all at all times ***
*** (do not pretend to know who is subscribed and who is not) ***
*** Please avoid top-posting. ***
The msysGit Wiki is here: https://github.com/msysgit/msysgit/wiki - Github accounts are free.
 
You received this message because you are subscribed to the Google
Groups "msysGit" group.
To post to this group, send email to msysgit@googlegroups.com
To unsubscribe from this group, send email to
msysgit+unsubscribe@googlegroups.com
For more options, and view previous threads, visit this group at
http://groups.google.com/group/msysgit?hl=en_US?hl=en
 
---
You received this message because you are subscribed to the Google Groups "msysGit" group.
To unsubscribe from this group and stop receiving emails from it, send an email to msysgit+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.