How to still kill git fetch with too many refs

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* How to still kill git fetch with too many refs
@ 2013-07-02  3:02 Martin Fick
  2013-07-02  4:07 ` Jeff King
  2013-07-02  9:24 ` Michael Haggerty
  0 siblings, 2 replies; 15+ messages in thread
From: Martin Fick @ 2013-07-02  3:02 UTC (permalink / raw)
  To: git

I have often reported problems with git fetch when there are 
many refs in a repo, and I have been pleasantly surprised 
how many problems I reported were so quickly fixed. :) With 
time, others have created various synthetic test cases to 
ensure that git can handle many many refs.  A simple 
synthetic test case with 1M refs all pointing to the same 
sha1 seems to be easily handled by git these days.  However, 
in our experience with our internal git repo, we still have 
performance issues related to having too many refs, in our 
kernel/msm instance we have around 400K.

When I tried the simple synthetic test case and could not 
reproduce bad results, so I tried something just a little 
more complex and was able to get atrocious results!!! 
Basically, I generate a packed-refs files with many refs 
which each point to a different sha1.  To get a list of 
valid but unique sha1s for the repo, I simply used rev-list.  
The result, a copy of linus' repo with a million unique 
valid refs and a git fetch of a single updated ref taking a 
very long time (55mins and it did not complete yet).  Note, 
with 100K refs it completes in about 2m40s.  It is likely 
not linear since 2m40s * 10 would be ~26m (but the 
difference could also just be how the data in the sha1s are 
ordered).

Here is my small reproducible test case for this issue:

git clone 
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
cp -rp linux linux.1Mrefs-revlist

cd linux
echo "Hello" > hello ; git add hello ; git ci -a -m 'hello'
cd ..

cd linux.1Mrefs-revlist
git rev-list HEAD | for nn in $(seq 0 100) ; do for c in 
$(seq 0 10000) ; do  read sha ; echo $sha refs/c/$nn/$c$nn ; 
done ; done > .git/packed-refs

time git fetch file:///$(dirname $PWD)/linux 
refs/heads/master

Any insights as to why it is so slow, and how we could 
possibly speed it up?

Thanks,

-Martin

PS: My tests were performed with git version 1.8.2.1 on 
linux 2.6.32-37-generic #81-Ubuntu SMP 

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: How to still kill git fetch with too many refs
  2013-07-02  3:02 How to still kill git fetch with too many refs Martin Fick
@ 2013-07-02  4:07 ` Jeff King
  2013-07-02  4:41   ` Jeff King
  2013-07-02  9:24 ` Michael Haggerty
  1 sibling, 1 reply; 15+ messages in thread
From: Jeff King @ 2013-07-02  4:07 UTC (permalink / raw)
  To: Martin Fick; +Cc: git

On Mon, Jul 01, 2013 at 09:02:31PM -0600, Martin Fick wrote:

> A simple synthetic test case with 1M refs all pointing to the same
> sha1 seems to be easily handled by git these days.  However, in our
> experience with our internal git repo, we still have performance
> issues related to having too many refs, in our kernel/msm instance we
> have around 400K.

I'm not too surprised. There's O(n^2) behavior in fetch-pack's
mark_complete function, as it adds each of the N ref tips to a
commit_list, sorting by date (so each insertion is O(n)).

I posted two alternative patches in May of 2011.  The first simply
avoids adding duplicate objects, which is simple and covers many
real-world cases (e.g., an "alternates" repository which has a bunch of
copies of the same tags, one per fork). The second one switches the
commit_list out for a heap-based priority queue.

We ended up taking the first (as ea5f220), since it was trivial and
obviously correct, but didn't bother with the second since:

  1. There had been no real-world reports of it.

  2. While in theory a priority queue implementation would be used in
     other spots, too, it ended up being a pain to use it, as most of
     the callers wanted list-like splicing.

You can see the original here:

  http://thread.gmane.org/gmane.comp.version-control.git/174003/focus=174005

Though it probably doesn't apply cleanly anymore.  However, I've kept it
rebased over the years at:

  git://github.com/peff/git.git jk/fast-commit-list

Junio recently added a priority queue implementation in b4b594a
(prio-queue: priority queue of pointers to structs, 2013-06-06), which
is currently in next. So a modern version of that series would build on
top of that, rather than my priority queue.

And yet another alternative would be to keep the list unsorted during
the mark_complete calls, and then sort it at the end. Like this:

diff --git a/fetch-pack.c b/fetch-pack.c
index abe5ffb..4df8abd 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -505,7 +505,7 @@ static int mark_complete(const char *refname, const unsigned char *sha1, int fla
 		struct commit *commit = (struct commit *)o;
 		if (!(commit->object.flags & COMPLETE)) {
 			commit->object.flags |= COMPLETE;
-			commit_list_insert_by_date(commit, &complete);
+			commit_list_insert(commit, &complete);
 		}
 	}
 	return 0;
@@ -622,6 +622,7 @@ static int everything_local(struct fetch_pack_args *args,
 	if (!args->depth) {
 		for_each_ref(mark_complete, NULL);
 		for_each_alternate_ref(mark_alternate_complete, NULL);
+		commit_list_sort_by_date(&complete);
 		if (cutoff)
 			mark_recent_complete_commits(args, cutoff);
 	}

(If you're wondering why we didn't do this trivial bit at the time, it
was because back then we did not yet have the René's nice linked-list
mergesort that backs commit_list_sort_by_date).

> The result, a copy of linus' repo with a million unique 
> valid refs and a git fetch of a single updated ref taking a 
> very long time (55mins and it did not complete yet).  Note, 
> with 100K refs it completes in about 2m40s.  It is likely 
> not linear since 2m40s * 10 would be ~26m (but the 
> difference could also just be how the data in the sha1s are 
> ordered).

That sounds like the O(n^2) problem. My timings back then with 100K refs
were 1-2 minutes. Does the patch above fix it for you?

-Peff

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: How to still kill git fetch with too many refs
  2013-07-02  4:07 ` Jeff King
@ 2013-07-02  4:41   ` Jeff King
  2013-07-02  5:01     ` Jeff King
  0 siblings, 1 reply; 15+ messages in thread
From: Jeff King @ 2013-07-02  4:41 UTC (permalink / raw)
  To: Martin Fick; +Cc: git

On Tue, Jul 02, 2013 at 12:07:58AM -0400, Jeff King wrote:

> And yet another alternative would be to keep the list unsorted during
> the mark_complete calls, and then sort it at the end. Like this:

Amusingly, I seem to have posted the exact same patch last year:

  http://article.gmane.org/gmane.comp.version-control.git/194939

but never polished it up. It does cut a little time off of the initial
ref-loading that fetch-pack does (especially when we do not end up
transferring any objects at all). But it does not solve your problem.

I replicated your test setup, and the problem is that we have many
common objects on both sides during the ref negotiation. So we end up in
rev_list_push for each one, which has the same O(n^2) behavior.
Switching it to just sort at the end is not trivial; we first insert all
of the objects, but then we actually walk the parents, pushing onto the
list as we go. So I think we'd want a better data structure (like a
priority queue).

-Peff

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: How to still kill git fetch with too many refs
  2013-07-02  4:41   ` Jeff King
@ 2013-07-02  5:01     ` Jeff King
  2013-07-02  5:19       ` Junio C Hamano
  2013-07-02 17:52       ` How to still kill git fetch with too many refs Brandon Casey
  0 siblings, 2 replies; 15+ messages in thread
From: Jeff King @ 2013-07-02  5:01 UTC (permalink / raw)
  To: Martin Fick; +Cc: git

On Tue, Jul 02, 2013 at 12:41:51AM -0400, Jeff King wrote:

> I replicated your test setup, and the problem is that we have many
> common objects on both sides during the ref negotiation. So we end up in
> rev_list_push for each one, which has the same O(n^2) behavior.
> Switching it to just sort at the end is not trivial; we first insert all
> of the objects, but then we actually walk the parents, pushing onto the
> list as we go. So I think we'd want a better data structure (like a
> priority queue).

Like the patch below, which is built on top of next (which has Junio's
prio_queue implementation), and has both the priority queue fix for
rev_list_push and the mark_complete sort-at-the-end fix. With this, your
test case completes in a few seconds (no clue if it would ever have
finished otherwise; I never let it run for more than a minute or two,
but I did confirm that it was spending all of its time in
commit_list_insert_by_date).

I didn't look too hard at the actual semantics, so there might be away
to not mark so many commits in the first place. This is just a mindless
replacement of an O(n^2) list insertion with the queue.

I did note two oddities, though:

  1. In the original, we never free the commit_list pointers when we pop
     them (or when we clear the list in find_common. So I think the
     existing code was leaking memory, and my version does not.

  2. In the original version of get_rev, we "peek" at the head commit of
     rev_list:

               commit = rev_list->item;

     Then we look at the parents of that commit, and possibly push them
     onto rev_list:

               while (parents) {
                      if (!(parents->item->object.flags & SEEN))
                              rev_list_push(parents->item, mark);
                      ...
               }

     and then finally, "pop" the commit off the list:

               rev_list = rev_list->next;

     But what guarantee do we have that the commit we just processed is
     still at the front of the list? If the timestamps are monotonic,
     then the parent must come after the descendent and therefore will
     not have gotten pushed onto the front of the list. But in the face
     of clock skew, this is not the case, and we will have just skipped
     over the parent with our "pop".

     So unless I am missing something very clever, I think this was a
     bug (albeit a difficult one to have matter). And it's also fixed by
     using prio_queue (which does the peek/pop together at the top of
     the function).

diff --git a/commit.c b/commit.c
index 521e49c..ebc0eea 100644
--- a/commit.c
+++ b/commit.c
@@ -581,7 +581,7 @@ static int compare_commits_by_author_date(const void *a_, const void *b_,
 	return 0;
 }
 
-static int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused)
+int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused)
 {
 	const struct commit *a = a_, *b = b_;
 	/* newer commits with larger date first */
diff --git a/commit.h b/commit.h
index 4d452dc..18a5234 100644
--- a/commit.h
+++ b/commit.h
@@ -254,4 +254,6 @@ extern void check_commit_signature(const struct commit* commit, struct signature
  */
 extern void check_commit_signature(const struct commit* commit, struct signature_check *sigc);
 
+int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused);
+
 #endif /* COMMIT_H */
diff --git a/fetch-pack.c b/fetch-pack.c
index abe5ffb..6684348 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -11,6 +11,7 @@
 #include "run-command.h"
 #include "transport.h"
 #include "version.h"
+#include "prio-queue.h"
 
 static int transfer_unpack_limit = -1;
 static int fetch_unpack_limit = -1;
@@ -37,7 +38,7 @@ static int marked;
  */
 #define MAX_IN_VAIN 256
 
-static struct commit_list *rev_list;
+static struct prio_queue rev_list = { compare_commits_by_commit_date };
 static int non_common_revs, multi_ack, use_sideband, allow_tip_sha1_in_want;
 
 static void rev_list_push(struct commit *commit, int mark)
@@ -49,7 +50,7 @@ static void rev_list_push(struct commit *commit, int mark)
 			if (parse_commit(commit))
 				return;
 
-		commit_list_insert_by_date(commit, &rev_list);
+		prio_queue_put(&rev_list, commit);
 
 		if (!(commit->object.flags & COMMON))
 			non_common_revs++;
@@ -122,10 +123,10 @@ static const unsigned char *get_rev(void)
 		unsigned int mark;
 		struct commit_list *parents;
 
-		if (rev_list == NULL || non_common_revs == 0)
+		if (rev_list.nr == 0 || non_common_revs == 0)
 			return NULL;
 
-		commit = rev_list->item;
+		commit = prio_queue_get(&rev_list);
 		if (!commit->object.parsed)
 			parse_commit(commit);
 		parents = commit->parents;
@@ -152,8 +153,6 @@ static const unsigned char *get_rev(void)
 				mark_common(parents->item, 1, 0);
 			parents = parents->next;
 		}
-
-		rev_list = rev_list->next;
 	}
 
 	return commit->object.sha1;
@@ -442,7 +441,7 @@ static int find_common(struct fetch_pack_args *args,
 					in_vain = 0;
 					got_continue = 1;
 					if (ack == ACK_ready) {
-						rev_list = NULL;
+						clear_prio_queue(&rev_list);
 						got_ready = 1;
 					}
 					break;
@@ -505,7 +504,7 @@ static int mark_complete(const char *refname, const unsigned char *sha1, int fla
 		struct commit *commit = (struct commit *)o;
 		if (!(commit->object.flags & COMPLETE)) {
 			commit->object.flags |= COMPLETE;
-			commit_list_insert_by_date(commit, &complete);
+			commit_list_insert(commit, &complete);
 		}
 	}
 	return 0;
@@ -622,6 +621,7 @@ static int everything_local(struct fetch_pack_args *args,
 	if (!args->depth) {
 		for_each_ref(mark_complete, NULL);
 		for_each_alternate_ref(mark_alternate_complete, NULL);
+		commit_list_sort_by_date(&complete);
 		if (cutoff)
 			mark_recent_complete_commits(args, cutoff);
 	}

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: How to still kill git fetch with too many refs
  2013-07-02  5:01     ` Jeff King
@ 2013-07-02  5:19       ` Junio C Hamano
  2013-07-02  5:28         ` Jeff King
  2013-07-02 17:52       ` How to still kill git fetch with too many refs Brandon Casey
  1 sibling, 1 reply; 15+ messages in thread
From: Junio C Hamano @ 2013-07-02  5:19 UTC (permalink / raw)
  To: Jeff King; +Cc: Martin Fick, git

Jeff King <peff@peff.net> writes:

> On Tue, Jul 02, 2013 at 12:41:51AM -0400, Jeff King wrote:
>
>> I replicated your test setup, and the problem is that we have many
>> common objects on both sides during the ref negotiation. So we end up in
>> rev_list_push for each one, which has the same O(n^2) behavior.
>> Switching it to just sort at the end is not trivial; we first insert all
>> of the objects, but then we actually walk the parents, pushing onto the
>> list as we go. So I think we'd want a better data structure (like a
>> priority queue).
>
> Like the patch below, which is built on top of next (which has Junio's
> prio_queue implementation), and has both the priority queue fix for
> rev_list_push and the mark_complete sort-at-the-end fix.

Wow, I saw "160 lines" in my MUA which scared me a bit until I
opened it to realize 40% is discussion and most of the remaining
lines are context around single liners.

It just looks too easy/simple, but the result looks correct, at
least from a cursory read.

Good job ;-)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: How to still kill git fetch with too many refs
  2013-07-02  5:19       ` Junio C Hamano
@ 2013-07-02  5:28         ` Jeff King
  2013-07-02  6:11           ` [PATCH 0/3] avoid quadratic behavior in fetch-pack Jeff King
  0 siblings, 1 reply; 15+ messages in thread
From: Jeff King @ 2013-07-02  5:28 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Martin Fick, git

On Mon, Jul 01, 2013 at 10:19:51PM -0700, Junio C Hamano wrote:

> > Like the patch below, which is built on top of next (which has Junio's
> > prio_queue implementation), and has both the priority queue fix for
> > rev_list_push and the mark_complete sort-at-the-end fix.
> 
> Wow, I saw "160 lines" in my MUA which scared me a bit until I
> opened it to realize 40% is discussion and most of the remaining
> lines are context around single liners.
> 
> It just looks too easy/simple, but the result looks correct, at
> least from a cursory read.
> 
> Good job ;-)

Thanks. :)

I'm splitting it out into readable patches now. At first I was made a
bit nervous by the "popping" behavior I described as "oddity #2"
earlier.  But the more I look at it, the more I am convinced it is
simply a bug that we can happen to fix along the way.

Patches in a few minutes.

-Peff

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 0/3] avoid quadratic behavior in fetch-pack
  2013-07-02  5:28         ` Jeff King
@ 2013-07-02  6:11           ` Jeff King
  2013-07-02  6:16             ` [PATCH 1/3] fetch-pack: avoid quadratic list insertion in mark_complete Jeff King
                               ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Jeff King @ 2013-07-02  6:11 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Martin Fick, git

Here are my patches to deal with Martin's pathological case, split out
for easy reading. I took a few timings to show that the results of the
3rd patch are noticeable even with 50,000 unique refs (which is still a
lot, but something that I could conceive of a busy repo accumulating
over time).

  [1/3]: fetch-pack: avoid quadratic list insertion in mark_complete
  [2/3]: commit.c: make compare_commits_by_commit_date global
  [3/3]: fetch-pack: avoid quadratic behavior in rev_list_push

And here's the diffstat to prove it is really not scary. :)

 commit.c     |  2 +-
 commit.h     |  2 ++
 fetch-pack.c | 16 ++++++++--------
 3 files changed, 11 insertions(+), 9 deletions(-)

-Peff

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/3] fetch-pack: avoid quadratic list insertion in mark_complete
  2013-07-02  6:11           ` [PATCH 0/3] avoid quadratic behavior in fetch-pack Jeff King
@ 2013-07-02  6:16             ` Jeff King
  2013-07-02  6:21             ` [PATCH 2/3] commit.c: make compare_commits_by_commit_date global Jeff King
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 15+ messages in thread
From: Jeff King @ 2013-07-02  6:16 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Martin Fick, git

We insert the commit pointed to by each ref one-by-one into
the "complete" commit_list using insert_by_date. Because
each insertion is O(n), we end up with O(n^2) behavior.

This typically doesn't matter, because the number of refs is
reasonably small. And even if there are a lot of refs, they
often point to a smaller set of objects (in which case the
optimization in commit ea5f220 keeps our "n" small).

However, in pathological repositories (hundreds of thousands
of refs, each pointing to a unique commit), this quadratic
behavior can make a difference. Since we do not care about
the list order until we have finished building it, we can
simply keep it unsorted during the insertion phase, then
sort it afterwards.

On a repository like the one described above, this dropped
the time to do a no-op fetch from 2.0s to 1.7s. On normal
repositories, it probably does not matter at all, but it
does not hurt to protect ourselves from pathological cases.

Signed-off-by: Jeff King <peff@peff.net>
---
A note on the timings. I measured the times above last year when I wrote
the same patch here:

  http://article.gmane.org/gmane.comp.version-control.git/194939

And earlier tonight, I did a fetch that showed the same result. But when
I tried to replicate it while writing the commit message, I had trouble,
because either:

  1. I was fetching actual commits, in which case the more serious
     problem behavior in find_common kicked in, ruining the measurement.

  2. It was a no-op fetch, in which case quickfetch() kicked in and we
     did not call mark_complete at all.

So I'm rather confused how I managed to get timings earlier (both last
year, and earlier today). But I still think it's an obviously correct
thing to do, and does protect us in case (1) above (actually fetching a
commit) once we fix the problem in find_common.

 fetch-pack.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fetch-pack.c b/fetch-pack.c
index abe5ffb..4df8abd 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -505,7 +505,7 @@ static int mark_complete(const char *refname, const unsigned char *sha1, int fla
 		struct commit *commit = (struct commit *)o;
 		if (!(commit->object.flags & COMPLETE)) {
 			commit->object.flags |= COMPLETE;
-			commit_list_insert_by_date(commit, &complete);
+			commit_list_insert(commit, &complete);
 		}
 	}
 	return 0;
@@ -622,6 +622,7 @@ static int everything_local(struct fetch_pack_args *args,
 	if (!args->depth) {
 		for_each_ref(mark_complete, NULL);
 		for_each_alternate_ref(mark_alternate_complete, NULL);
+		commit_list_sort_by_date(&complete);
 		if (cutoff)
 			mark_recent_complete_commits(args, cutoff);
 	}
-- 
1.8.3.rc2.14.g7eee6b3

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/3] commit.c: make compare_commits_by_commit_date global
  2013-07-02  6:11           ` [PATCH 0/3] avoid quadratic behavior in fetch-pack Jeff King
  2013-07-02  6:16             ` [PATCH 1/3] fetch-pack: avoid quadratic list insertion in mark_complete Jeff King
@ 2013-07-02  6:21             ` Jeff King
  2013-07-02  6:24             ` [PATCH 3/3] fetch-pack: avoid quadratic behavior in rev_list_push Jeff King
  2013-07-02 17:45             ` [PATCH 0/3] avoid quadratic behavior in fetch-pack Martin Fick
  3 siblings, 0 replies; 15+ messages in thread
From: Jeff King @ 2013-07-02  6:21 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Martin Fick, git

This helper function was introduced as a prio_queue
comparator to help topological sorting. However, other users
of prio_queue who want to replace commit_list_insert_by_date
will want to use it, too. So let's make it public.

Signed-off-by: Jeff King <peff@peff.net>
---
There is also compare_commits_by_author_date, but I expect it to be less
generally useful (especially because it relies on a slab), so I didn't
bother publicizing it.  I think this is sufficient for now, and any
later users can make the author version public if they need to.

Note also that we have a similar comparison function,
commit_list_compare_by_date, which gets fed to the linked-list mergesort
for commit_list_sort_by_date. I was tempted to unify them, but we can't,
because it takes an actual "struct commit_list *", not a "struct commit
*" (and the logic is not so complex that it is worth factoring out to a
shared helper).

 commit.c | 2 +-
 commit.h | 2 ++
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index 521e49c..ebc0eea 100644
--- a/commit.c
+++ b/commit.c
@@ -581,7 +581,7 @@ static int compare_commits_by_author_date(const void *a_, const void *b_,
 	return 0;
 }
 
-static int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused)
+int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused)
 {
 	const struct commit *a = a_, *b = b_;
 	/* newer commits with larger date first */
diff --git a/commit.h b/commit.h
index 4d452dc..18a5234 100644
--- a/commit.h
+++ b/commit.h
@@ -254,4 +254,6 @@ extern void check_commit_signature(const struct commit* commit, struct signature
  */
 extern void check_commit_signature(const struct commit* commit, struct signature_check *sigc);
 
+int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused);
+
 #endif /* COMMIT_H */
-- 
1.8.3.rc2.14.g7eee6b3

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/3] fetch-pack: avoid quadratic behavior in rev_list_push
  2013-07-02  6:11           ` [PATCH 0/3] avoid quadratic behavior in fetch-pack Jeff King
  2013-07-02  6:16             ` [PATCH 1/3] fetch-pack: avoid quadratic list insertion in mark_complete Jeff King
  2013-07-02  6:21             ` [PATCH 2/3] commit.c: make compare_commits_by_commit_date global Jeff King
@ 2013-07-02  6:24             ` Jeff King
  2013-07-02  7:52               ` Eric Sunshine
  2013-07-02 17:45             ` [PATCH 0/3] avoid quadratic behavior in fetch-pack Martin Fick
  3 siblings, 1 reply; 15+ messages in thread
From: Jeff King @ 2013-07-02  6:24 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Martin Fick, git

When we call find_common to start finding common ancestors
with the remote side of a fetch, the first thing we do is
insert the tip of each ref into our rev_list linked list. We
keep the list sorted the whole time with
commit_list_insert_by_date, which means our insertion ends
up doing O(n^2) timestamp comparisons.

We could teach rev_list_push to use an unsorted list, and
then sort it once after we have added each ref. However, in
get_rev, we process the list by popping commits off the
front and adding parents back in timestamp-sorted order. So
that procedure would still operate on the large list.

Instead, we can replace the linked list with a heap-based
priority queue, which can do O(log n) insertion, making the
whole insertion procedure O(n log n).

As a result of switching to the prio_queue struct, we fix
two minor bugs:

  1. When we "pop" a commit in get_rev, and when we clear
     the rev_list in find_common, we do not take care to
     free the "struct commit_list", and just leak its
     memory. With the prio_queue implementation, the memory
     management is handled for us.

  2. In get_rev, we look at the head commit of the list,
     possibly push its parents onto the list, and then "pop"
     the front of the list off, assuming it is the same
     element that we just peeked at. This is typically going
     to be the case, but would not be in the face of clock
     skew: the parents are inserted by date, and could
     potentailly be inserted at the head of the list if they
     have a timestamp newer than their descendent. In this
     case, we would accidentally pop the parent, and never
     process it at all.

     The new implementation pulls the commit off of the
     queue as we examine it, and so does not suffer from
     this problem.

With this patch, a fetch of a single commit into a
repository with 50,000 refs went from:

  real    0m7.984s
  user    0m7.852s
  sys     0m0.120s

to:

  real    0m2.017s
  user    0m1.884s
  sys     0m0.124s

Before this patch, a larger case with 370K refs still had
not completed after tens of minutes; with this patch, it
completes in about 12 seconds.

Signed-off-by: Jeff King <peff@peff.net>
---
Not that it really matters, but Martin, note that your "one million
refs" case is actually more like 370K, because that is how many commits
there are in the linux.git repo. The rest of the lines written to your
packed-refs file are just bogus.

 fetch-pack.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/fetch-pack.c b/fetch-pack.c
index 4df8abd..6684348 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -11,6 +11,7 @@
 #include "run-command.h"
 #include "transport.h"
 #include "version.h"
+#include "prio-queue.h"
 
 static int transfer_unpack_limit = -1;
 static int fetch_unpack_limit = -1;
@@ -37,7 +38,7 @@ static int marked;
  */
 #define MAX_IN_VAIN 256
 
-static struct commit_list *rev_list;
+static struct prio_queue rev_list = { compare_commits_by_commit_date };
 static int non_common_revs, multi_ack, use_sideband, allow_tip_sha1_in_want;
 
 static void rev_list_push(struct commit *commit, int mark)
@@ -49,7 +50,7 @@ static void rev_list_push(struct commit *commit, int mark)
 			if (parse_commit(commit))
 				return;
 
-		commit_list_insert_by_date(commit, &rev_list);
+		prio_queue_put(&rev_list, commit);
 
 		if (!(commit->object.flags & COMMON))
 			non_common_revs++;
@@ -122,10 +123,10 @@ static const unsigned char *get_rev(void)
 		unsigned int mark;
 		struct commit_list *parents;
 
-		if (rev_list == NULL || non_common_revs == 0)
+		if (rev_list.nr == 0 || non_common_revs == 0)
 			return NULL;
 
-		commit = rev_list->item;
+		commit = prio_queue_get(&rev_list);
 		if (!commit->object.parsed)
 			parse_commit(commit);
 		parents = commit->parents;
@@ -152,8 +153,6 @@ static const unsigned char *get_rev(void)
 				mark_common(parents->item, 1, 0);
 			parents = parents->next;
 		}
-
-		rev_list = rev_list->next;
 	}
 
 	return commit->object.sha1;
@@ -442,7 +441,7 @@ static int find_common(struct fetch_pack_args *args,
 					in_vain = 0;
 					got_continue = 1;
 					if (ack == ACK_ready) {
-						rev_list = NULL;
+						clear_prio_queue(&rev_list);
 						got_ready = 1;
 					}
 					break;
-- 
1.8.3.rc2.14.g7eee6b3

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] fetch-pack: avoid quadratic behavior in rev_list_push
  2013-07-02  6:24             ` [PATCH 3/3] fetch-pack: avoid quadratic behavior in rev_list_push Jeff King
@ 2013-07-02  7:52               ` Eric Sunshine
  0 siblings, 0 replies; 15+ messages in thread
From: Eric Sunshine @ 2013-07-02  7:52 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Martin Fick, Git List

On Tue, Jul 2, 2013 at 2:24 AM, Jeff King <peff@peff.net> wrote:
> When we call find_common to start finding common ancestors
> with the remote side of a fetch, the first thing we do is
> insert the tip of each ref into our rev_list linked list. We
> keep the list sorted the whole time with
> commit_list_insert_by_date, which means our insertion ends
> up doing O(n^2) timestamp comparisons.
>
> We could teach rev_list_push to use an unsorted list, and
> then sort it once after we have added each ref. However, in
> get_rev, we process the list by popping commits off the
> front and adding parents back in timestamp-sorted order. So
> that procedure would still operate on the large list.
>
> Instead, we can replace the linked list with a heap-based
> priority queue, which can do O(log n) insertion, making the
> whole insertion procedure O(n log n).
>
> As a result of switching to the prio_queue struct, we fix
> two minor bugs:
>
>   1. When we "pop" a commit in get_rev, and when we clear
>      the rev_list in find_common, we do not take care to
>      free the "struct commit_list", and just leak its
>      memory. With the prio_queue implementation, the memory
>      management is handled for us.
>
>   2. In get_rev, we look at the head commit of the list,
>      possibly push its parents onto the list, and then "pop"
>      the front of the list off, assuming it is the same
>      element that we just peeked at. This is typically going
>      to be the case, but would not be in the face of clock
>      skew: the parents are inserted by date, and could
>      potentailly be inserted at the head of the list if they

s/potentailly/potentially/

>      have a timestamp newer than their descendent. In this
>      case, we would accidentally pop the parent, and never
>      process it at all.
>
>      The new implementation pulls the commit off of the
>      queue as we examine it, and so does not suffer from
>      this problem.
>
> With this patch, a fetch of a single commit into a
> repository with 50,000 refs went from:
>
>   real    0m7.984s
>   user    0m7.852s
>   sys     0m0.120s
>
> to:
>
>   real    0m2.017s
>   user    0m1.884s
>   sys     0m0.124s
>
> Before this patch, a larger case with 370K refs still had
> not completed after tens of minutes; with this patch, it
> completes in about 12 seconds.
>
> Signed-off-by: Jeff King <peff@peff.net>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/3] avoid quadratic behavior in fetch-pack
  2013-07-02  6:11           ` [PATCH 0/3] avoid quadratic behavior in fetch-pack Jeff King
                               ` (2 preceding siblings ...)
  2013-07-02  6:24             ` [PATCH 3/3] fetch-pack: avoid quadratic behavior in rev_list_push Jeff King
@ 2013-07-02 17:45             ` Martin Fick
  3 siblings, 0 replies; 15+ messages in thread
From: Martin Fick @ 2013-07-02 17:45 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git

On Tuesday, July 02, 2013 12:11:49 am Jeff King wrote:
> Here are my patches to deal with Martin's pathological
> case, split out for easy reading. I took a few timings
> to show that the results of the 3rd patch are noticeable
> even with 50,000 unique refs (which is still a lot, but
> something that I could conceive of a busy repo
> accumulating over time).
> 
>   [1/3]: fetch-pack: avoid quadratic list insertion in
> mark_complete [2/3]: commit.c: make
> compare_commits_by_commit_date global [3/3]: fetch-pack:
> avoid quadratic behavior in rev_list_push
> 
> And here's the diffstat to prove it is really not scary.
> :)
> 
>  commit.c     |  2 +-
>  commit.h     |  2 ++
>  fetch-pack.c | 16 ++++++++--------
>  3 files changed, 11 insertions(+), 9 deletions(-)
> 
> -Peff

I applied these 3 patches and it indeed improves things 
dramatically.  Thanks Peff, you are awesome!!!

The synthetic test case (but sorted), now comes in at around 
15s.  The more important real world case (for us), fetching 
from my production server, which took around 12mins 
previously, now takes around 30s (I think the extra time is 
now spent on the Gerrit server, but I will investigate that 
a bit more)!  That is very significant and should make many 
workflows much more efficient.  +1 for merging this. :)

Again, thanks,

-Martin

Note, I tested git-next 1.8.3.2.883.g27cfd27 to be sure that 
it is still problematic without this patch, it is (running 
for 10mins now without completing).

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: How to still kill git fetch with too many refs
  2013-07-02  5:01     ` Jeff King
  2013-07-02  5:19       ` Junio C Hamano
@ 2013-07-02 17:52       ` Brandon Casey
  1 sibling, 0 replies; 15+ messages in thread
From: Brandon Casey @ 2013-07-02 17:52 UTC (permalink / raw)
  To: Jeff King; +Cc: Martin Fick, git@vger.kernel.org

On Mon, Jul 1, 2013 at 10:01 PM, Jeff King <peff@peff.net> wrote:
> On Tue, Jul 02, 2013 at 12:41:51AM -0400, Jeff King wrote:
>
>> I replicated your test setup, and the problem is that we have many
>> common objects on both sides during the ref negotiation. So we end up in
>> rev_list_push for each one, which has the same O(n^2) behavior.
>> Switching it to just sort at the end is not trivial; we first insert all
>> of the objects, but then we actually walk the parents, pushing onto the
>> list as we go. So I think we'd want a better data structure (like a
>> priority queue).
>
> Like the patch below,

<snip>

Looks obviously correct to me since I've got essentially the same
patch sitting in my local repo. :b

I've got the patch coming that fixes the same problem on the push side
of things and provides the same order of magnitude improvement.

-Brandon

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: How to still kill git fetch with too many refs
  2013-07-02  3:02 How to still kill git fetch with too many refs Martin Fick
  2013-07-02  4:07 ` Jeff King
@ 2013-07-02  9:24 ` Michael Haggerty
  2013-07-02 16:58   ` Martin Fick
  1 sibling, 1 reply; 15+ messages in thread
From: Michael Haggerty @ 2013-07-02  9:24 UTC (permalink / raw)
  To: Martin Fick; +Cc: git, Jeff King, Junio C Hamano

On 07/02/2013 05:02 AM, Martin Fick wrote:
> I have often reported problems with git fetch when there are 
> many refs in a repo, and I have been pleasantly surprised 
> how many problems I reported were so quickly fixed. :) With 
> time, others have created various synthetic test cases to 
> ensure that git can handle many many refs.  A simple 
> synthetic test case with 1M refs all pointing to the same 
> sha1 seems to be easily handled by git these days.  However, 
> in our experience with our internal git repo, we still have 
> performance issues related to having too many refs, in our 
> kernel/msm instance we have around 400K.
> 
> When I tried the simple synthetic test case and could not 
> reproduce bad results, so I tried something just a little 
> more complex and was able to get atrocious results!!! 
> Basically, I generate a packed-refs files with many refs 
> which each point to a different sha1.  To get a list of 
> valid but unique sha1s for the repo, I simply used rev-list.  
> The result, a copy of linus' repo with a million unique 
> valid refs and a git fetch of a single updated ref taking a 
> very long time (55mins and it did not complete yet).  Note, 
> with 100K refs it completes in about 2m40s.  It is likely 
> not linear since 2m40s * 10 would be ~26m (but the 
> difference could also just be how the data in the sha1s are 
> ordered).
> 
> 
> Here is my small reproducible test case for this issue:
> 
> git clone 
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> cp -rp linux linux.1Mrefs-revlist
> 
> cd linux
> echo "Hello" > hello ; git add hello ; git ci -a -m 'hello'
> cd ..
> 
> cd linux.1Mrefs-revlist
> git rev-list HEAD | for nn in $(seq 0 100) ; do for c in 
> $(seq 0 10000) ; do  read sha ; echo $sha refs/c/$nn/$c$nn ; 
> done ; done > .git/packed-refs

I believe this generates a packed-refs file that is not sorted
lexicographically by refname, whereas all Git-generated packed-refs
files are sorted.  There are some optimizations in refs.c for adding
references in order that might therefore be circumvented by your
unsorted file.  Please try sorting the file by refname and see if that
helps.  (You can do so by deleting one of the packed references; then
git will sort the remainder while rewriting the file.)

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: How to still kill git fetch with too many refs
  2013-07-02  9:24 ` Michael Haggerty
@ 2013-07-02 16:58   ` Martin Fick
  0 siblings, 0 replies; 15+ messages in thread
From: Martin Fick @ 2013-07-02 16:58 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: git, Jeff King, Junio C Hamano

On Tuesday, July 02, 2013 03:24:14 am Michael Haggerty 
wrote:
> > git rev-list HEAD | for nn in $(seq 0 100) ; do for c
> > in $(seq 0 10000) ; do  read sha ; echo $sha
> > refs/c/$nn/$c$nn ; done ; done > .git/packed-refs
> 
> I believe this generates a packed-refs file that is not
> sorted lexicographically by refname, whereas all
> Git-generated packed-refs files are sorted.  


Yes, you are indeed correct.  I was attempting to be too 
clever with my sharding I guess.  Thanks.

> There are
> some optimizations in refs.c for adding references in
> order that might therefore be circumvented by your
> unsorted file.  Please try sorting the file by refname
> and see if that helps.  (You can do so by deleting one
> of the packed references; then git will sort the
> remainder while rewriting the file.)

A simple git pack-refs seems to clean it up.

The original test did complete in ~77mins last night.  A 
rerun with a sorted file takes ~61mins,

-Martin


PS: This test was performed with git version 1.8.2.1 on 
linux 2.6.32-37-generic #81-Ubuntu SMP 

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation
 

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2013-07-02 17:52 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-02  3:02 How to still kill git fetch with too many refs Martin Fick
2013-07-02  4:07 ` Jeff King
2013-07-02  4:41   ` Jeff King
2013-07-02  5:01     ` Jeff King
2013-07-02  5:19       ` Junio C Hamano
2013-07-02  5:28         ` Jeff King
2013-07-02  6:11           ` [PATCH 0/3] avoid quadratic behavior in fetch-pack Jeff King
2013-07-02  6:16             ` [PATCH 1/3] fetch-pack: avoid quadratic list insertion in mark_complete Jeff King
2013-07-02  6:21             ` [PATCH 2/3] commit.c: make compare_commits_by_commit_date global Jeff King
2013-07-02  6:24             ` [PATCH 3/3] fetch-pack: avoid quadratic behavior in rev_list_push Jeff King
2013-07-02  7:52               ` Eric Sunshine
2013-07-02 17:45             ` [PATCH 0/3] avoid quadratic behavior in fetch-pack Martin Fick
2013-07-02 17:52       ` How to still kill git fetch with too many refs Brandon Casey
2013-07-02  9:24 ` Michael Haggerty
2013-07-02 16:58   ` Martin Fick

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).