git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* question about: Facebook makes Mercurial faster than Git
@ 2014-03-10 10:07 Dennis Luehring
  2014-03-10 10:13 ` David Lang
  2014-03-10 11:28 ` demerphq
  0 siblings, 2 replies; 12+ messages in thread
From: Dennis Luehring @ 2014-03-10 10:07 UTC (permalink / raw)
  To: git

according to these blog posts

http://www.infoq.com/news/2014/01/facebook-scaling-hg
https://code.facebook.com/posts/218678814984400/scaling-mercurial-at-facebook/

mercurial "can" be faster then git

but i don't found any reply from the git community if it is a real problem
or if there a ongoing (maybe git 2.0) changes to compete better in this case

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: question about: Facebook makes Mercurial faster than Git
  2014-03-10 10:07 question about: Facebook makes Mercurial faster than Git Dennis Luehring
@ 2014-03-10 10:13 ` David Lang
  2014-03-10 17:51   ` Ondřej Bílka
  2014-03-10 11:28 ` demerphq
  1 sibling, 1 reply; 12+ messages in thread
From: David Lang @ 2014-03-10 10:13 UTC (permalink / raw)
  To: Dennis Luehring; +Cc: git

On Mon, 10 Mar 2014, Dennis Luehring wrote:

> according to these blog posts
>
> http://www.infoq.com/news/2014/01/facebook-scaling-hg
> https://code.facebook.com/posts/218678814984400/scaling-mercurial-at-facebook/
>
> mercurial "can" be faster then git
>
> but i don't found any reply from the git community if it is a real problem
> or if there a ongoing (maybe git 2.0) changes to compete better in this case

As I understand this, the biggest part of what happened is that Facebook made a 
tweak to mercurial so that when it needs to know what files have changed in 
their massive tree, their version asks their special storage array, while git 
would have to look at it through the filesystem interface (by doing stat calls 
on the directories and files to see if anything has changed)

In other words, unless you have a very high end storage system that can keep 
track of such things for you, the Facebook 'fix' won't help you. And even if it 
does have such a capability, unless you use the same storage system that 
Facebook uses, you would have to port it to your class of device.

Now, in addition to this, they did some other tweaks and changes, but compared 
to this status change, everything else is minor.

David Lang

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: question about: Facebook makes Mercurial faster than Git
  2014-03-10 10:07 question about: Facebook makes Mercurial faster than Git Dennis Luehring
  2014-03-10 10:13 ` David Lang
@ 2014-03-10 11:28 ` demerphq
  2014-03-10 11:42   ` Dennis Luehring
  2014-03-14 12:58   ` Duy Nguyen
  1 sibling, 2 replies; 12+ messages in thread
From: demerphq @ 2014-03-10 11:28 UTC (permalink / raw)
  To: Dennis Luehring; +Cc: Git

On 10 March 2014 11:07, Dennis Luehring <dl.soluz@gmx.net> wrote:
> according to these blog posts
>
> http://www.infoq.com/news/2014/01/facebook-scaling-hg
> https://code.facebook.com/posts/218678814984400/scaling-mercurial-at-facebook/
>
> mercurial "can" be faster then git
>
> but i don't found any reply from the git community if it is a real problem
> or if there a ongoing (maybe git 2.0) changes to compete better in this case

They mailed the list about performance issues in git. From what I saw
there was relatively little feedback.

I had the impression, and I would not be surprised if they had the
impression that the git development community is relatively
unconcerned about performance issues on larger repositories.

There have been other reports, which are difficult to keep track of
without a bug tracking system, but the ones I know of are:

Poor performance of git status with large number of excluded files and
large repositories.
Poor performance, and breakage, on repositories with very large
numbers of files in them. (Rebase for instance will break if you
rebase a commit that contains a *lot* of files.)
Poor performance in protocol layer (and other places) with repos with
large numbers of refs. (Maybe this is fixed, not sure.)

cheers,
Yves




-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: question about: Facebook makes Mercurial faster than Git
  2014-03-10 11:28 ` demerphq
@ 2014-03-10 11:42   ` Dennis Luehring
  2014-03-10 12:10     ` Johan Herland
  2014-03-10 14:18     ` Karsten Blees
  2014-03-14 12:58   ` Duy Nguyen
  1 sibling, 2 replies; 12+ messages in thread
From: Dennis Luehring @ 2014-03-10 11:42 UTC (permalink / raw)
  Cc: Git

Am 10.03.2014 12:28, schrieb demerphq:
> I had the impression, and I would not be surprised if they had the
> impression that the git development community is relatively
> unconcerned about performance issues on larger repositories.

so the question is if the git community is interested in beeing 
competive in such
large scale scenarios - something what mercurial seems to be now out of 
the box

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: question about: Facebook makes Mercurial faster than Git
  2014-03-10 11:42   ` Dennis Luehring
@ 2014-03-10 12:10     ` Johan Herland
  2014-03-10 14:48       ` Michael Haggerty
  2014-03-10 14:18     ` Karsten Blees
  1 sibling, 1 reply; 12+ messages in thread
From: Johan Herland @ 2014-03-10 12:10 UTC (permalink / raw)
  To: Dennis Luehring; +Cc: Git

On Mon, Mar 10, 2014 at 12:42 PM, Dennis Luehring <dl.soluz@gmx.net> wrote:
> Am 10.03.2014 12:28, schrieb demerphq:
>
>> I had the impression, and I would not be surprised if they had the
>> impression that the git development community is relatively
>> unconcerned about performance issues on larger repositories.
>
> so the question is if the git community is interested in beeing competive in
> such large scale scenarios - something what mercurial seems to be now out
> of the box

AFAIK, David Lang's comment is not far off the mark. Facebook has made
a tool called Watchman (https://github.com/facebook/watchman) that
watches your work tree (i.e. wrapping inotify on Linux) and triggers
various commands when files within are changed (e.g. do an auto-build
whenever a file in your project changes). Since this tool will
discover when files change, they have adjusted Mercurial to discover
changes by querying Watchman instead of stat-ing the entire work tree.

AFAICS, this is basically a tradeoff between the time it takes to stat
your work tree and the overhead/administrivia of running a daemon to
monitor the work tree. It seems Facebook has organized their code and
infrastructure in a way that makes the latter approach worthwhile for
them, and has contributed their solution back to Mercurial.

It should be possible to teach Git to do similar things, and IINM
there are (and have previously been) several attempts to do similar
things in Git, e.g.:

 - http://thread.gmane.org/gmane.comp.version-control.git/240339

 - http://thread.gmane.org/gmane.comp.version-control.git/217817

I haven't looked closely at these attempts (it is not my scratch to
itch), and I don't know if/how they would work on top of Watchman, but
in principle I don't see why Git shouldn't be able to leverage
Watchman the same way Mercurial does.


...Johan

-- 
Johan Herland, <johan@herland.net>
www.herland.net

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: question about: Facebook makes Mercurial faster than Git
  2014-03-10 11:42   ` Dennis Luehring
  2014-03-10 12:10     ` Johan Herland
@ 2014-03-10 14:18     ` Karsten Blees
  1 sibling, 0 replies; 12+ messages in thread
From: Karsten Blees @ 2014-03-10 14:18 UTC (permalink / raw)
  To: Dennis Luehring; +Cc: Git

Am 10.03.2014 12:42, schrieb Dennis Luehring:
> Am 10.03.2014 12:28, schrieb demerphq:
>> I had the impression, and I would not be surprised if they had the
>> impression that the git development community is relatively
>> unconcerned about performance issues on larger repositories.
> 
> so the question is if the git community is interested in beeing competive in such
> large scale scenarios - something what mercurial seems to be now out of the box
> 

The hgwatchman site claims (https://bitbucket.org/facebook/hgwatchman)

"On a real-world repository with over 200,000 files, hg status normally takes over 3 seconds. With hgwatchman it takes under 0.6 seconds."

There have been a few performance improvements in git status to support such large repositories. I just re-checked git status performance with the WebKit repo (~200k files):

Linux (with core.preloadIndex)
git status -uall: 0.620s
git status -uno : 0.255s

Windows (with core.preloadIndex and core.fscache)
git status -uall: 1.006s
git status -uno : 0.695s

Of course, for more reliable benchmark data, you'd have to compare the same repo on the same platform. But on first glance, it seems that mercurial with hgwatchman extension may be as fast as git is out of the box, not the other way around.

This comes at the cost of running a background daemon, which may slow down the entire system. E.g. if the daemon activates whenever the compiler creates a .o file, it will probably slow down build performance.

Note that hgwatchman doesn't support Windows, so git is probably much faster there.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: question about: Facebook makes Mercurial faster than Git
  2014-03-10 12:10     ` Johan Herland
@ 2014-03-10 14:48       ` Michael Haggerty
  0 siblings, 0 replies; 12+ messages in thread
From: Michael Haggerty @ 2014-03-10 14:48 UTC (permalink / raw)
  To: Johan Herland; +Cc: Dennis Luehring, Git

On 03/10/2014 01:10 PM, Johan Herland wrote:
> It should be possible to teach Git to do similar things, and IINM
> there are (and have previously been) several attempts to do similar
> things in Git, e.g.:
> 
>  - http://thread.gmane.org/gmane.comp.version-control.git/240339
> 
>  - http://thread.gmane.org/gmane.comp.version-control.git/217817
> 
> I haven't looked closely at these attempts (it is not my scratch to
> itch), and I don't know if/how they would work on top of Watchman, but
> in principle I don't see why Git shouldn't be able to leverage
> Watchman the same way Mercurial does.

This touches on the most important thing that we should take to heart
from this episode:

Of course Facebook could have modified either Git or Mercurial to do
what they want.  Why did they pick Mercurial?  The article seems to
claim that they were initially biased towards Git, but they chose
Mercurial because its code base is easier to modify.  This is a claim
that I can easily believe.

The two projects are almost exactly the same age.  The number of commits
in the two projects is similar.  Mercurial has had fewer contributors
active at any given time over its project lifetime.

But let's see how much code is in the main part of Mercurial vs. Git:

    $ find mercurial hgext \( -name '*.c' -o -name '*.py' \) -print |
          xargs cat | wc -l
    46164

    $ cat *.c *.h *.sh *.perl builtin/*.c | wc -l
    188530

These are just crude estimates and I hope I got the right directories
for Mercurial.  But, by these numbers, Git has 4 times as much code as
Mercurial.  That alone will go a long way to making Git harder to
modify.  I don't think that Git has anywhere near 4 times the features
of Mercurial.  Probably most of the difference can be explained by the
choice of implementation languages; 94% of the code in these hg
directories is Python, whereas 88% of Git's core code is C.

How can we make Git easier to hack (short of switching languages)?  Here
are my suggestions:

* Better function docstrings -- don't make developers have to read the
whole call stack to find out what a function does, or who owns the
memory that is passed around.

* More modularity -- more coherent and abstract APIs between different
parts of the system, and less pawing around in your neighbor's data
structures.

* Higher-level abstractions -- make more use of APIs like strbuf and
string_list as opposed to handling every malloc() and realloc() by hand.

I personally wish that we as a project would be more willing to spend a
few extra CPU microseconds to make our code easier to read and modify
and more robust.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: question about: Facebook makes Mercurial faster than Git
  2014-03-10 10:13 ` David Lang
@ 2014-03-10 17:51   ` Ondřej Bílka
  2014-03-10 17:56     ` David Lang
  0 siblings, 1 reply; 12+ messages in thread
From: Ondřej Bílka @ 2014-03-10 17:51 UTC (permalink / raw)
  To: David Lang; +Cc: Dennis Luehring, git

On Mon, Mar 10, 2014 at 03:13:45AM -0700, David Lang wrote:
> On Mon, 10 Mar 2014, Dennis Luehring wrote:
> 
> >according to these blog posts
> >
> >http://www.infoq.com/news/2014/01/facebook-scaling-hg
> >https://code.facebook.com/posts/218678814984400/scaling-mercurial-at-facebook/
> >
> >mercurial "can" be faster then git
> >
> >but i don't found any reply from the git community if it is a real problem
> >or if there a ongoing (maybe git 2.0) changes to compete better in this case
> 
> As I understand this, the biggest part of what happened is that
> Facebook made a tweak to mercurial so that when it needs to know
> what files have changed in their massive tree, their version asks
> their special storage array, while git would have to look at it
> through the filesystem interface (by doing stat calls on the
> directories and files to see if anything has changed)
> 
That is mostly a kernel problem. Long ago there was proposed patch to
add a recursive mtime so you could check what subtrees changed. If
somebody ressurected that patch it would gave similar boost.

There are two issues that need to be handled, first if you are concerned
about one mtime change doing lot of updates a application needs to mark
all directories it is interested on, when we do update we unmark
directory and by that we update each directory at most once per
application run.

Second problem were hard links where probably a best course is keep list
of these and stat them separately.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: question about: Facebook makes Mercurial faster than Git
  2014-03-10 17:51   ` Ondřej Bílka
@ 2014-03-10 17:56     ` David Lang
  2014-03-10 20:22       ` Martin Langhoff
  2014-03-11 14:23       ` Ondřej Bílka
  0 siblings, 2 replies; 12+ messages in thread
From: David Lang @ 2014-03-10 17:56 UTC (permalink / raw)
  To: Ondřej Bílka; +Cc: Dennis Luehring, git

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2058 bytes --]

On Mon, 10 Mar 2014, Ondřej Bílka wrote:

> On Mon, Mar 10, 2014 at 03:13:45AM -0700, David Lang wrote:
>> On Mon, 10 Mar 2014, Dennis Luehring wrote:
>>
>>> according to these blog posts
>>>
>>> http://www.infoq.com/news/2014/01/facebook-scaling-hg
>>> https://code.facebook.com/posts/218678814984400/scaling-mercurial-at-facebook/
>>>
>>> mercurial "can" be faster then git
>>>
>>> but i don't found any reply from the git community if it is a real problem
>>> or if there a ongoing (maybe git 2.0) changes to compete better in this case
>>
>> As I understand this, the biggest part of what happened is that
>> Facebook made a tweak to mercurial so that when it needs to know
>> what files have changed in their massive tree, their version asks
>> their special storage array, while git would have to look at it
>> through the filesystem interface (by doing stat calls on the
>> directories and files to see if anything has changed)
>>
> That is mostly a kernel problem. Long ago there was proposed patch to
> add a recursive mtime so you could check what subtrees changed. If
> somebody ressurected that patch it would gave similar boost.

btrfs could actually implement this efficiently, but for a lot of other 
filesysems this could be very expensive. The question is if it could be enough 
of a win to make it a good choice for people who are doing a heavy git workload 
as opposed to more generic uses.

there's also the issue of managed vs generated files, if you update the mtime 
all the way up the tree because a source file was compiled and a binary created, 
that will quickly defeat the value of the recursive mtime.

David Lang

> There are two issues that need to be handled, first if you are concerned
> about one mtime change doing lot of updates a application needs to mark
> all directories it is interested on, when we do update we unmark
> directory and by that we update each directory at most once per
> application run.
>
> Second problem were hard links where probably a best course is keep list
> of these and stat them separately.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: question about: Facebook makes Mercurial faster than Git
  2014-03-10 17:56     ` David Lang
@ 2014-03-10 20:22       ` Martin Langhoff
  2014-03-11 14:23       ` Ondřej Bílka
  1 sibling, 0 replies; 12+ messages in thread
From: Martin Langhoff @ 2014-03-10 20:22 UTC (permalink / raw)
  To: David Lang; +Cc: Ondřej Bílka, Dennis Luehring, Git Mailing List

On Mon, Mar 10, 2014 at 1:56 PM, David Lang <david@lang.hm> wrote:
> there's also the issue of managed vs generated files, if you update the
> mtime all the way up the tree because a source file was compiled and a
> binary created, that will quickly defeat the value of the recursive mime.

I think this points us again to an inotify-based strategy, where git
can put an event listener daemon which registers just the watchers it
needs, and filters the events on its own conditions.

The kernel and fs have no good way of knowing about this stuff.

cheers,


m
-- 
 martin.langhoff@gmail.com
 -  ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 ~ http://docs.moodle.org/en/User:Martin_Langhoff

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: question about: Facebook makes Mercurial faster than Git
  2014-03-10 17:56     ` David Lang
  2014-03-10 20:22       ` Martin Langhoff
@ 2014-03-11 14:23       ` Ondřej Bílka
  1 sibling, 0 replies; 12+ messages in thread
From: Ondřej Bílka @ 2014-03-11 14:23 UTC (permalink / raw)
  To: David Lang; +Cc: Dennis Luehring, git

On Mon, Mar 10, 2014 at 10:56:51AM -0700, David Lang wrote:
> On Mon, 10 Mar 2014, Ondřej Bílka wrote:
> 
> >On Mon, Mar 10, 2014 at 03:13:45AM -0700, David Lang wrote:
> >>On Mon, 10 Mar 2014, Dennis Luehring wrote:
> >>
> >>>according to these blog posts
> >>>
> >>>http://www.infoq.com/news/2014/01/facebook-scaling-hg
> >>>https://code.facebook.com/posts/218678814984400/scaling-mercurial-at-facebook/
> >>>
> >>>mercurial "can" be faster then git
> >>>
> >>>but i don't found any reply from the git community if it is a real problem
> >>>or if there a ongoing (maybe git 2.0) changes to compete better in this case
> >>
> >>As I understand this, the biggest part of what happened is that
> >>Facebook made a tweak to mercurial so that when it needs to know
> >>what files have changed in their massive tree, their version asks
> >>their special storage array, while git would have to look at it
> >>through the filesystem interface (by doing stat calls on the
> >>directories and files to see if anything has changed)
> >>
> >That is mostly a kernel problem. Long ago there was proposed patch to
> >add a recursive mtime so you could check what subtrees changed. If
> >somebody ressurected that patch it would gave similar boost.
> 
> btrfs could actually implement this efficiently, but for a lot of
> other filesysems this could be very expensive. The question is if it
> could be enough of a win to make it a good choice for people who are
> doing a heavy git workload as opposed to more generic uses.
>
Read next paragraph how do that efficiently, a directory update needs to be done
only between application runs. Also there is no overhead when not used
(except if that makes headers bigger.)
 
> there's also the issue of managed vs generated files, if you update
> the mtime all the way up the tree because a source file was compiled
> and a binary created, that will quickly defeat the value of the
> recursive mtime.
>
You could do marking on per-file basis. I am not sure if that is needed
as larger projects use makefiles to not recompile everything so its
probably recompiled because source at same directory changed. Also if
your compile time is five minutes a half second status would not make
much difference.

 
> 
> >There are two issues that need to be handled, first if you are concerned
> >about one mtime change doing lot of updates a application needs to mark
> >all directories it is interested on, when we do update we unmark
> >directory and by that we update each directory at most once per
> >application run.
> >
> >Second problem were hard links where probably a best course is keep list
> >of these and stat them separately.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: question about: Facebook makes Mercurial faster than Git
  2014-03-10 11:28 ` demerphq
  2014-03-10 11:42   ` Dennis Luehring
@ 2014-03-14 12:58   ` Duy Nguyen
  1 sibling, 0 replies; 12+ messages in thread
From: Duy Nguyen @ 2014-03-14 12:58 UTC (permalink / raw)
  To: demerphq; +Cc: Dennis Luehring, Git

On Mon, Mar 10, 2014 at 6:28 PM, demerphq <demerphq@gmail.com> wrote:
> I had the impression, and I would not be surprised if they had the
> impression that the git development community is relatively
> unconcerned about performance issues on larger repositories.
>
> There have been other reports, which are difficult to keep track of
> without a bug tracking system, but the ones I know of are:
>
> Poor performance of git status with large number of excluded files and
> large repositories.

I thought this has been improved lately.. I think we could do better
still, but my wip is nowhere ready for anybody's eyes.

> Poor performance, and breakage, on repositories with very large
> numbers of files in them.

index v5 and sparse checkout should help a bit. The ultimate solution,
though, is narrow clone that's nowhere near finishing. Well, if you
need all files present in worktree, then narrow clone does not help
either..

On the same line, poor performance on repos with a lot of very large
files also. Junio's split-blob series was a start, but no one picked
it up, so I guess your impression was right.

> (Rebase for instance will break if you rebase a commit that contains a *lot* of files.)

Interesting. I guess it hits shell's limitations? Roughly how many
files to break it?

> Poor performance in protocol layer (and other places) with repos with
> large numbers of refs. (Maybe this is fixed, not sure.)

Ah.. no it's not. It's being stirred up again though, in both protocol
and ref backend.
-- 
Duy

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2014-03-14 12:58 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-10 10:07 question about: Facebook makes Mercurial faster than Git Dennis Luehring
2014-03-10 10:13 ` David Lang
2014-03-10 17:51   ` Ondřej Bílka
2014-03-10 17:56     ` David Lang
2014-03-10 20:22       ` Martin Langhoff
2014-03-11 14:23       ` Ondřej Bílka
2014-03-10 11:28 ` demerphq
2014-03-10 11:42   ` Dennis Luehring
2014-03-10 12:10     ` Johan Herland
2014-03-10 14:48       ` Michael Haggerty
2014-03-10 14:18     ` Karsten Blees
2014-03-14 12:58   ` Duy Nguyen

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).