git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Michael Haggerty <mhagger@alum.mit.edu>
To: Jeff King <peff@peff.net>
Cc: Martin Fick <mfick@codeaurora.org>,
	git discussion list <git@vger.kernel.org>,
	Junio C Hamano <gitster@pobox.com>
Subject: Re: remove_duplicates() in builtin/fetch-pack.c is O(N^2)
Date: Thu, 24 May 2012 14:05:16 +0200	[thread overview]
Message-ID: <4FBE23FC.5070405@alum.mit.edu> (raw)
In-Reply-To: <20120522173355.GC11600@sigill.intra.peff.net>

On 05/22/2012 07:33 PM, Jeff King wrote:
> On Tue, May 22, 2012 at 03:28:32PM +0200, Michael Haggerty wrote:
>
>>> When I try it with both 'next' and v1.7.10, I see that the latter is
>>> much faster.  I did my tests with a warm cache, but the interesting
>>> number is the CPU time, which is quite different.
>>
>> I cannot reproduce anything as big as the performance regression that
>> you see.  I did find a regression 9.5 s ->  10.1 s caused by
>>
>> 5fa044184 find_containing_dir(): use strbuf in implementation of this
>> function
>>
>> It is fixed by the patch that I just sent to the mailing list [1],
>> which sizes the strbuf in that function to strlen(refname) instead of
>> PATH_MAX.  Since your experiments suggest that the performance
>> regression is related to the size of the repository contents, it
>> could be that the same test produces more memory pressure on your
>> system and therefore a larger effect.  Please try the patch and tell
>> me if it fixes the problem for you.
>
> That patch drops about a second off of the slow case, but the main
> problem still remains. Just to be sure we are both doing the exact same
> thing, here is a complete reproduction recipe:
>
>    GIT=/path/to/your/git/checkout
>    RAILS=/path/to/unpacked/rails.git
>
>    cd $GIT&&
>    git checkout 432ad41e60cedb87ceec446ab034d46a53f5f9d8^&&
>    make&&
>
>    cd $RAILS&&
>    time $GIT/bin-wrappers/git fetch . refs/*:refs/*&&
>
>    cd $GIT&&
>    git checkout 432ad41e60cedb87ceec446ab034d46a53f5f9d8&&
>    make&&
>
>    cd $RAILS&&
>    time $GIT/bin-wrappers/git fetch . refs/*:refs/*
>
> produces:
>
>    [before]
>    real    0m9.128s
>    user    0m9.369s
>    sys     0m0.976s
>
>    [after]
>    real    0m15.926s
>    user    0m16.181s
>    sys     0m0.984s
>
> I don't think memory pressure is involved. The git process maxes out at
> ~300M, and this machine has 7.5G available.
>
> I wonder why we are getting different results. Could it be compilation
> options? As I mentioned, I compile with -O0, but I got similar results
> with -O3.

Thanks for the idiot-proof reproduction steps.  I don't know what I was 
doing differently last time, but now I can reproduce your slowdown.

The thing that triggers the problem is a reference directory 
("refs/remotes/8514/pull/") with 3810 sub-namespaces 
("refs/remotes/8514/pull/1091", "refs/remotes/8514/pull/1092", etc), 
each subnamespace containing only one or two references.  It wouldn't be 
a problem having so many *references* in a namespace, because references 
are just added to the end of the directory unsorted, and typically only 
sorted once after all of them have been added.  But every time that a 
sub-namespace is accessed, the namespace has to be searched to see if 
that sub-namespace already exists.  The searching requires the namespace 
to be sorted.  So, for example, when adding the following sequence of 
references:

1. refs/remotes/8514/pull/1091/head
2. refs/remotes/8514/pull/1091/merge
3. refs/remotes/8514/pull/1092/head
4. refs/remotes/8514/pull/1092/merge
5. refs/remotes/8514/pull/1093/head
6. refs/remotes/8514/pull/1093/merge

At step 1, sub-namespace "refs/remotes/8514/pull/1091/" is added to 
"refs/remotes/8514/pull/".  This makes the code think that 
"refs/remotes/8514/pull/" is unsorted.

Step 2 is not problematic; the new references is added to 
"refs/remotes/8514/pull/1091/", but adding a reference doesn't require 
the ref_dir to be sorted.

At step 3, namespace "refs/remotes/8514/pull/" is first checked to see 
if sub-namespace "refs/remotes/8514/pull/1092/" already exists.  This 
search requires namespace "refs/remotes/8514/pull/" to be sorted because 
step 1 caused it to be considered unsorted.

Again at step 5, namespace "refs/remotes/8514/pull/" needs to be sorted, 
and so on every time a subnamespace is added.

I will submit a patch shortly.

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

  reply	other threads:[~2012-05-24 12:12 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-05-21  8:13 remove_duplicates() in builtin/fetch-pack.c is O(N^2) Michael Haggerty
2012-05-21  9:09 ` Junio C Hamano
2012-05-21  9:42 ` demerphq
2012-05-21 17:45 ` Jeff King
2012-05-21 22:14   ` Jeff King
2012-05-21 22:17     ` [PATCH 1/5] fetch-pack: sort incoming heads Jeff King
2012-05-22 20:08       ` Junio C Hamano
2012-05-22 20:23         ` Jeff King
2012-05-24  6:04           ` Jeff King
2012-05-21 22:17     ` [PATCH 2/5] fetch-pack: avoid quadratic behavior in remove_duplicates Jeff King
2012-05-21 22:19     ` [PATCH 3/5] add sorting infrastructure for list refs Jeff King
2012-05-21 22:19     ` [PATCH 4/5] fetch-pack: sort the list of incoming refs Jeff King
2012-05-21 22:23     ` [PATCH 5/5] fetch-pack: avoid quadratic loop in filter_refs Jeff King
2012-05-22 20:16       ` Junio C Hamano
2012-05-21 23:52     ` remove_duplicates() in builtin/fetch-pack.c is O(N^2) Jeff King
2012-05-22  0:07       ` Jeff King
2012-05-22  3:59       ` Michael Haggerty
2012-05-22  4:11         ` Jeff King
2012-05-22  7:15           ` Michael Haggerty
2012-05-22  7:37             ` Jeff King
2012-05-22 13:28               ` Michael Haggerty
2012-05-22 17:33                 ` Jeff King
2012-05-24 12:05                   ` Michael Haggerty [this message]
2012-05-25  0:17     ` Martin Fick
2012-05-25  0:39       ` Jeff King
2012-05-25  0:54         ` Martin Fick
2012-05-25  1:04           ` Jeff King
2012-05-25  1:32             ` Martin Fick
2012-05-25  6:50               ` Jeff King
2012-05-22 12:18   ` Nguyen Thai Ngoc Duy
2012-05-22 13:35     ` Michael Haggerty
2012-05-22 17:01     ` Jeff King
2012-05-22 17:35       ` Junio C Hamano
2012-05-22 17:46         ` Jeff King
2012-05-24  4:54         ` Michael Haggerty
2012-05-23  1:20       ` Nguyen Thai Ngoc Duy
2012-05-22 16:16   ` Junio C Hamano
2012-05-21 18:15 ` Martin Fick
2012-05-21 19:41   ` Jeff King
2012-05-21 22:08     ` Junio C Hamano
2012-05-21 22:24       ` Jeff King
2012-05-22  5:51     ` Martin Fick
2012-05-22 18:21       ` Jeff King
2012-05-22 22:19         ` Martin Fick
2012-05-22 23:23           ` Junio C Hamano
2012-05-23  0:46             ` Martin Fick

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4FBE23FC.5070405@alum.mit.edu \
    --to=mhagger@alum.mit.edu \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=mfick@codeaurora.org \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).