git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
* Significant performance waste in git-svn and friends
@ 2007-09-05 18:47 Mike Hommey
  2007-09-05 20:40 ` Junio C Hamano
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Mike Hommey @ 2007-09-05 18:47 UTC (permalink / raw)
  To: git

Hi,

Being a pervert abusing the way subversion doesn't deal with branches
and tags, I'm actually not a user of git-svn or git-svnimport, because
they just can't deal easily with my perversion. So I'm writing a script
to do the conversion for me, and since I also like to learn new things
when I'm coding, I'm writing it in ruby.

Anyways, one of the things I'm trying to convert is my svk repository
for debian packaging of xulrunner (so, a significant subset of the
mozilla tree), which doesn't involve a lot of revisions (around 280,
because I only imported releases or CVS snapshots), but involves a lot
of files (roughly 20k).

The first thing I noticed when twisting around the svk repo so that
git-svn could somehow import it a while ago, is that running git-svn
was in my case significantly slower than svnadmin dump | svnadmin load
(more than 2 times slower).

And now, with my own script, I got the same kind of "slowdown". So I
investigated it, and it didn't take long to realize that replacing
git-hash-object by a simple reimplementation in ruby was *way* faster.
git-hash-object being more than probably what you do the most when you
import a remote repository, it is not much of a surprise that forking
thousands of times is a huge performance waste.

So, just for the record, I did a lame hack of git-svn to see what kind
of speedup could happen in git-svn. You can find this lame hack as a
patch below. I did some tests (with a 1.5.2.1 release) and here are the
results, importing only the trunk (192 revisions), with no checkout, and
redirecting stdout to /dev/null:

original git-svn:
real    25m1.871s
user    8m51.593s
sys     12m31.659s

patched git-svn:
real    14m45.870s
user    7m31.928s
sys     4m1.047s

Some notes about the patch:
- I've not looked at the rest of the code to see if there was a way to
  get the size of the file so that SHA-1 sum and compression could be
  done in one pass and without copying the whole file in memory.
- The object creation in the .git/objects directory is not as safe as
  what git-hash-object does.

Some notes about git-svn:
- A few lines above the patched zone, the file is already read once to
  do the MD5 sum. It should be possible to do SHA-1, MD5 sums and
  deflate in just one pass.
- It might be worth testing if git-cat-file is called a lot. If so,
  implementing a simple git-cat-file equivalent that would work for
  unpacked objects could improve speed.

The same things obviously apply to git-cvsimport and other scripts
calling git-hash-object a lot.

Mike

diff --git a/git-svn.perl b/git-svn.perl
index d3c8cd0..202c228 100755
--- a/git-svn.perl
+++ b/git-svn.perl
@@ -2417,6 +2417,8 @@ use warnings;
 use Carp qw/croak/;
 use IO::File qw//;
 use Digest::MD5;
+use Digest::SHA1;
+use Compress::Zlib;
 
 # file baton members: path, mode_a, mode_b, pool, fh, blob, base
 sub new {
@@ -2603,15 +2605,26 @@ sub close_file {
 			$buf eq 'link ' or die "$path has mode 120000",
 			                       "but is not a link\n";
 		}
-		defined(my $pid = open my $out,'-|') or die "Can't fork: $!\n";
-		if (!$pid) {
-			open STDIN, '<&', $fh or croak $!;
-			exec qw/git-hash-object -w --stdin/ or croak $!;
+		my $size = 0;
+		my $buf = "";
+		while (my $read = sysread $fh, my $tmp, 4096) {
+			$size += $read;
+			$buf .= $tmp;
 		}
-		chomp($hash = do { local $/; <$out> });
-		close $out or croak $!;
+		my $sha1 = Digest::SHA1->new;
+		$sha1->add("blob $size\0");
+		$sha1->add($buf);
+		$hash = $sha1->hexdigest;
 		close $fh or croak $!;
 		$hash =~ /^[a-f\d]{40}$/ or die "not a sha1: $hash\n";
+		my $blob_dir = "$ENV{GIT_DIR}/objects/" . substr($hash, 0, 2);
+		my $blob_file = $blob_dir . "/" . substr($hash, 2);
+		if (! -f $blob_file) {
+			mkdir $blob_dir unless -d $blob_dir;
+			open BLOB, ">$blob_file";
+			print BLOB compress("blob $size\0" . $buf);
+			close BLOB;
+		}
 		close $fb->{base} or croak $!;
 	} else {
 		$hash = $fb->{blob} or die "no blob information\n";

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Significant performance waste in git-svn and friends
  2007-09-05 18:47 Significant performance waste in git-svn and friends Mike Hommey
@ 2007-09-05 20:40 ` Junio C Hamano
  2007-09-05 21:19   ` David Kastrup
                     ` (2 more replies)
  2007-09-06  7:04 ` Eric Wong
  2007-09-07  5:41 ` Mike Hommey
  2 siblings, 3 replies; 11+ messages in thread
From: Junio C Hamano @ 2007-09-05 20:40 UTC (permalink / raw)
  To: Mike Hommey; +Cc: git

Mike Hommey <mh@glandium.org> writes:

> The same things obviously apply to git-cvsimport and other scripts
> calling git-hash-object a lot.

I *obviously* hate this patch, as it makes this Porcelain
command to be aware of the internal representation too much.

I wonder if letting fast-import handle the object creation is an
option, though.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Significant performance waste in git-svn and friends
  2007-09-05 20:40 ` Junio C Hamano
@ 2007-09-05 21:19   ` David Kastrup
  2007-09-06  1:07     ` Patrick Doyle
  2007-09-06  2:19     ` Shawn O. Pearce
  2007-09-06  2:16   ` Shawn O. Pearce
  2007-09-06  5:52   ` Mike Hommey
  2 siblings, 2 replies; 11+ messages in thread
From: David Kastrup @ 2007-09-05 21:19 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Mike Hommey, git

Junio C Hamano <gitster@pobox.com> writes:

> Mike Hommey <mh@glandium.org> writes:
>
>> The same things obviously apply to git-cvsimport and other scripts
>> calling git-hash-object a lot.
>
> I *obviously* hate this patch, as it makes this Porcelain
> command to be aware of the internal representation too much.
>
> I wonder if letting fast-import handle the object creation is an
> option, though.

I think it would be saner to give git-hash-object an operation mode
that makes it usable as a pipe-controlled daemon, so that one needs
not fork and exec for interning another object.  That way, porcelain
commands could keep one bidirectional pipe (feed object type and
source and whether to use -w into git-hash-project, receive object id)
to git-hash-object around until they finish.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Significant performance waste in git-svn and friends
  2007-09-05 21:19   ` David Kastrup
@ 2007-09-06  1:07     ` Patrick Doyle
  2007-09-06  2:19     ` Shawn O. Pearce
  1 sibling, 0 replies; 11+ messages in thread
From: Patrick Doyle @ 2007-09-06  1:07 UTC (permalink / raw)
  To: git

I'll add my hearty vote of support, and perhaps even a patch or two,
to anything that would make git-svn faster.  Just last week, I gave up
waiting for it to complete the import of a (rather large) uClinux
distribution, falling back on SVN.  I'm looking forward to trying it
again with Mike's patch, or perhaps even implementing and trying out
David's idea.

Hmmm.... I'm supposed to be on a couple of planes for a rather longish
period of time next week.  Perhaps I'll give it a shot.  (But don't
count on much -- the laptop battery is pretty shot nowadays and, given
the nature of the rest of the participitants on this list, it'll
probably be implemented and tested by the time I pack up for the
trip.)

--wpd

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Significant performance waste in git-svn and friends
  2007-09-05 20:40 ` Junio C Hamano
  2007-09-05 21:19   ` David Kastrup
@ 2007-09-06  2:16   ` Shawn O. Pearce
  2007-09-06  5:52   ` Mike Hommey
  2 siblings, 0 replies; 11+ messages in thread
From: Shawn O. Pearce @ 2007-09-06  2:16 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Mike Hommey, git

Junio C Hamano <gitster@pobox.com> wrote:
> Mike Hommey <mh@glandium.org> writes:
> 
> > The same things obviously apply to git-cvsimport and other scripts
> > calling git-hash-object a lot.
> 
> I *obviously* hate this patch, as it makes this Porcelain
> command to be aware of the internal representation too much.

Me too.  I think its bad enough that fast-import knows the
internal representation of a packfile.  And its plumbing!
 
> I wonder if letting fast-import handle the object creation is an
> option, though.

I agree.  This is *exactly* what fast-import was built for.  It
really would be the better tool to use here.  Much better than
reinventing the wheel.  Its already been invented twice for you
in Git (normal object creation path and fast-import path).  No
need to do it a third time.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Significant performance waste in git-svn and friends
  2007-09-05 21:19   ` David Kastrup
  2007-09-06  1:07     ` Patrick Doyle
@ 2007-09-06  2:19     ` Shawn O. Pearce
  1 sibling, 0 replies; 11+ messages in thread
From: Shawn O. Pearce @ 2007-09-06  2:19 UTC (permalink / raw)
  To: David Kastrup; +Cc: Junio C Hamano, Mike Hommey, git

David Kastrup <dak@gnu.org> wrote:
> Junio C Hamano <gitster@pobox.com> writes:
> 
> > Mike Hommey <mh@glandium.org> writes:
> >
> >> The same things obviously apply to git-cvsimport and other scripts
> >> calling git-hash-object a lot.
> >
> > I wonder if letting fast-import handle the object creation is an
> > option, though.
> 
> I think it would be saner to give git-hash-object an operation mode
> that makes it usable as a pipe-controlled daemon, so that one needs
> not fork and exec for interning another object.  That way, porcelain
> commands could keep one bidirectional pipe (feed object type and
> source and whether to use -w into git-hash-project, receive object id)
> to git-hash-object around until they finish.

Aside from getting the hashes back from fast-import, that's what
fast-import is for.  I could also make it disable writing.  Hmm.
Junio and I were just talking about making fast-import send the
marks table back out on stdout.  This would make it easier for a
frontend process to stream a whole bunch of objects into the process,
then get back all of their SHA-1s.  Less context switches and more
parallel operation.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Significant performance waste in git-svn and friends
  2007-09-05 20:40 ` Junio C Hamano
  2007-09-05 21:19   ` David Kastrup
  2007-09-06  2:16   ` Shawn O. Pearce
@ 2007-09-06  5:52   ` Mike Hommey
  2 siblings, 0 replies; 11+ messages in thread
From: Mike Hommey @ 2007-09-06  5:52 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Wed, Sep 05, 2007 at 01:40:42PM -0700, Junio C Hamano <gitster@pobox.com> wrote:
> Mike Hommey <mh@glandium.org> writes:
> 
> > The same things obviously apply to git-cvsimport and other scripts
> > calling git-hash-object a lot.
> 
> I *obviously* hate this patch, as it makes this Porcelain
> command to be aware of the internal representation too much.

The patch was not supposed to be applied. I said it was lame ;)
It was more of a proof of concept.

Anyways, thinking a bit more about it, I was wondering if it wouldn't be
a good idea to have Git.pm have a "native" implementation (by native I
mean a .so module) for low-level plumbing tools such as hash-object,
cat-file and such.

Obviously, reinventing the wheel is not good, so this native
implementation would be using a "git library" API, such as what has
been done under SoC (though I don't know if this API exposes low-level
plumbing functions)

> I wonder if letting fast-import handle the object creation is an
> option, though.

It could, probably. The reason I didn't use it is that it was way
quicker to hack a 10 lines patch to create the blobs by hand than it
would have been to fork a fast-import object at the correct place during
git-svn initialization and piping to it at the appropriate times.

My goal was only to check how faster this would make it not to fork a
git-hash-object per blob.

Mike

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Significant performance waste in git-svn and friends
  2007-09-05 18:47 Significant performance waste in git-svn and friends Mike Hommey
  2007-09-05 20:40 ` Junio C Hamano
@ 2007-09-06  7:04 ` Eric Wong
  2007-09-07  4:55   ` Shawn O. Pearce
  2007-09-07  5:41 ` Mike Hommey
  2 siblings, 1 reply; 11+ messages in thread
From: Eric Wong @ 2007-09-06  7:04 UTC (permalink / raw)
  To: Mike Hommey; +Cc: git

Mike Hommey <mh@glandium.org> wrote:
> Hi,

Hi Mike,

> Being a pervert abusing the way subversion doesn't deal with branches
> and tags, I'm actually not a user of git-svn or git-svnimport, because
> they just can't deal easily with my perversion. So I'm writing a script
> to do the conversion for me, and since I also like to learn new things
> when I'm coding, I'm writing it in ruby.
> 
> Anyways, one of the things I'm trying to convert is my svk repository
> for debian packaging of xulrunner (so, a significant subset of the
> mozilla tree), which doesn't involve a lot of revisions (around 280,
> because I only imported releases or CVS snapshots), but involves a lot
> of files (roughly 20k).
> 
> The first thing I noticed when twisting around the svk repo so that
> git-svn could somehow import it a while ago, is that running git-svn
> was in my case significantly slower than svnadmin dump | svnadmin load
> (more than 2 times slower).
> 
> And now, with my own script, I got the same kind of "slowdown". So I
> investigated it, and it didn't take long to realize that replacing
> git-hash-object by a simple reimplementation in ruby was *way* faster.
> git-hash-object being more than probably what you do the most when you
> import a remote repository, it is not much of a surprise that forking
> thousands of times is a huge performance waste.

I haven't looked at the times in a while, but I suspect that exec()
is the (much bigger) culprit.

Since I usually import off remote repositories, so I notice network
latency way before I notice local performance problems with git-svn.

> So, just for the record, I did a lame hack of git-svn to see what kind
> of speedup could happen in git-svn. You can find this lame hack as a
> patch below. I did some tests (with a 1.5.2.1 release) and here are the
> results, importing only the trunk (192 revisions), with no checkout, and
> redirecting stdout to /dev/null:
> 
> original git-svn:
> real    25m1.871s
> user    8m51.593s
> sys     12m31.659s
> 
> patched git-svn:
> real    14m45.870s
> user    7m31.928s
> sys     4m1.047s

That's awesome.

> - It might be worth testing if git-cat-file is called a lot. If so,
>   implementing a simple git-cat-file equivalent that would work for
>   unpacked objects could improve speed.

IIRC git-cat-file is called a lot.  Every modified file needs the
original cat-ed to make use of the delta.

> The same things obviously apply to git-cvsimport and other scripts
> calling git-hash-object a lot.

Making git-svn use fast-import would be very nice.  I've got a bunch
of other git-svn things that I need to work on, but having git-svn
converted to use fast-import would be nice.  Or allowing Git.pm
to access more of the git internals...

However, how well/poorly would fast-import work for incremental
fetches throughout the day?

-- 
Eric Wong

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Significant performance waste in git-svn and friends
  2007-09-06  7:04 ` Eric Wong
@ 2007-09-07  4:55   ` Shawn O. Pearce
  2007-09-07  6:28     ` Steven Grimm
  0 siblings, 1 reply; 11+ messages in thread
From: Shawn O. Pearce @ 2007-09-07  4:55 UTC (permalink / raw)
  To: Eric Wong; +Cc: Mike Hommey, git

Eric Wong <normalperson@yhbt.net> wrote:
> Making git-svn use fast-import would be very nice.  I've got a bunch
> of other git-svn things that I need to work on, but having git-svn
> converted to use fast-import would be nice.  Or allowing Git.pm
> to access more of the git internals...
> 
> However, how well/poorly would fast-import work for incremental
> fetches throughout the day?

It would work just fine.  git-p4 uses fast-import and runs it in
incremental mode all of the time.

What will happen is you will get a packfile per git-svn fetch, so
if you fetch 5 times that day you will get 5 packfiles that day.
But you could also get 5 packfiles from git-fetch if each of those
fetches brought in 100 or more new objects.  So it really is not
that big of a deal.

At some point the number of packfiles gets out of control, but so
does the number of loose objects.  Repacking is the obvious fix in
both cases.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Significant performance waste in git-svn and friends
  2007-09-05 18:47 Significant performance waste in git-svn and friends Mike Hommey
  2007-09-05 20:40 ` Junio C Hamano
  2007-09-06  7:04 ` Eric Wong
@ 2007-09-07  5:41 ` Mike Hommey
  2 siblings, 0 replies; 11+ messages in thread
From: Mike Hommey @ 2007-09-07  5:41 UTC (permalink / raw)
  To: git

On Wed, Sep 05, 2007 at 08:47:10PM +0200, Mike Hommey <mh@glandium.org> wrote:
> Some notes about git-svn:
> - A few lines above the patched zone, the file is already read once to
>   do the MD5 sum. It should be possible to do SHA-1, MD5 sums and
>   deflate in just one pass.

FWIW, I tried, and it doesn't make it much quicker to do everything in one pass.
99% of the speedup comes from not using git-hash-object.

Mike

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Significant performance waste in git-svn and friends
  2007-09-07  4:55   ` Shawn O. Pearce
@ 2007-09-07  6:28     ` Steven Grimm
  0 siblings, 0 replies; 11+ messages in thread
From: Steven Grimm @ 2007-09-07  6:28 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Eric Wong, Mike Hommey, git

Shawn O. Pearce wrote:
> What will happen is you will get a packfile per git-svn fetch, so
> if you fetch 5 times that day you will get 5 packfiles that day.
> But you could also get 5 packfiles from git-fetch if each of those
> fetches brought in 100 or more new objects.  So it really is not
> that big of a deal.
>
> At some point the number of packfiles gets out of control, but so
> does the number of loose objects.  Repacking is the obvious fix in
> both cases.

This dovetails nicely with the "git gc --auto" work in progress, 
particularly the recent patch to pack small packfiles together when 
there get to be too many of them; run that at the end of every git-svn 
fetch and the problem goes away.

-Steve

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2007-09-07  6:28 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-09-05 18:47 Significant performance waste in git-svn and friends Mike Hommey
2007-09-05 20:40 ` Junio C Hamano
2007-09-05 21:19   ` David Kastrup
2007-09-06  1:07     ` Patrick Doyle
2007-09-06  2:19     ` Shawn O. Pearce
2007-09-06  2:16   ` Shawn O. Pearce
2007-09-06  5:52   ` Mike Hommey
2007-09-06  7:04 ` Eric Wong
2007-09-07  4:55   ` Shawn O. Pearce
2007-09-07  6:28     ` Steven Grimm
2007-09-07  5:41 ` Mike Hommey

Code repositories for project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).