* How to still kill git fetch with too many refs
@ 2013-07-02 3:02 Martin Fick
2013-07-02 4:07 ` Jeff King
2013-07-02 9:24 ` Michael Haggerty
0 siblings, 2 replies; 15+ messages in thread
From: Martin Fick @ 2013-07-02 3:02 UTC (permalink / raw)
To: git
I have often reported problems with git fetch when there are
many refs in a repo, and I have been pleasantly surprised
how many problems I reported were so quickly fixed. :) With
time, others have created various synthetic test cases to
ensure that git can handle many many refs. A simple
synthetic test case with 1M refs all pointing to the same
sha1 seems to be easily handled by git these days. However,
in our experience with our internal git repo, we still have
performance issues related to having too many refs, in our
kernel/msm instance we have around 400K.
When I tried the simple synthetic test case and could not
reproduce bad results, so I tried something just a little
more complex and was able to get atrocious results!!!
Basically, I generate a packed-refs files with many refs
which each point to a different sha1. To get a list of
valid but unique sha1s for the repo, I simply used rev-list.
The result, a copy of linus' repo with a million unique
valid refs and a git fetch of a single updated ref taking a
very long time (55mins and it did not complete yet). Note,
with 100K refs it completes in about 2m40s. It is likely
not linear since 2m40s * 10 would be ~26m (but the
difference could also just be how the data in the sha1s are
ordered).
Here is my small reproducible test case for this issue:
git clone
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
cp -rp linux linux.1Mrefs-revlist
cd linux
echo "Hello" > hello ; git add hello ; git ci -a -m 'hello'
cd ..
cd linux.1Mrefs-revlist
git rev-list HEAD | for nn in $(seq 0 100) ; do for c in
$(seq 0 10000) ; do read sha ; echo $sha refs/c/$nn/$c$nn ;
done ; done > .git/packed-refs
time git fetch file:///$(dirname $PWD)/linux
refs/heads/master
Any insights as to why it is so slow, and how we could
possibly speed it up?
Thanks,
-Martin
PS: My tests were performed with git version 1.8.2.1 on
linux 2.6.32-37-generic #81-Ubuntu SMP
--
The Qualcomm Innovation Center, Inc. is a member of Code
Aurora Forum, hosted by The Linux Foundation
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: How to still kill git fetch with too many refs
2013-07-02 3:02 How to still kill git fetch with too many refs Martin Fick
@ 2013-07-02 4:07 ` Jeff King
2013-07-02 4:41 ` Jeff King
2013-07-02 9:24 ` Michael Haggerty
1 sibling, 1 reply; 15+ messages in thread
From: Jeff King @ 2013-07-02 4:07 UTC (permalink / raw)
To: Martin Fick; +Cc: git
On Mon, Jul 01, 2013 at 09:02:31PM -0600, Martin Fick wrote:
> A simple synthetic test case with 1M refs all pointing to the same
> sha1 seems to be easily handled by git these days. However, in our
> experience with our internal git repo, we still have performance
> issues related to having too many refs, in our kernel/msm instance we
> have around 400K.
I'm not too surprised. There's O(n^2) behavior in fetch-pack's
mark_complete function, as it adds each of the N ref tips to a
commit_list, sorting by date (so each insertion is O(n)).
I posted two alternative patches in May of 2011. The first simply
avoids adding duplicate objects, which is simple and covers many
real-world cases (e.g., an "alternates" repository which has a bunch of
copies of the same tags, one per fork). The second one switches the
commit_list out for a heap-based priority queue.
We ended up taking the first (as ea5f220), since it was trivial and
obviously correct, but didn't bother with the second since:
1. There had been no real-world reports of it.
2. While in theory a priority queue implementation would be used in
other spots, too, it ended up being a pain to use it, as most of
the callers wanted list-like splicing.
You can see the original here:
http://thread.gmane.org/gmane.comp.version-control.git/174003/focus=174005
Though it probably doesn't apply cleanly anymore. However, I've kept it
rebased over the years at:
git://github.com/peff/git.git jk/fast-commit-list
Junio recently added a priority queue implementation in b4b594a
(prio-queue: priority queue of pointers to structs, 2013-06-06), which
is currently in next. So a modern version of that series would build on
top of that, rather than my priority queue.
And yet another alternative would be to keep the list unsorted during
the mark_complete calls, and then sort it at the end. Like this:
diff --git a/fetch-pack.c b/fetch-pack.c
index abe5ffb..4df8abd 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -505,7 +505,7 @@ static int mark_complete(const char *refname, const unsigned char *sha1, int fla
struct commit *commit = (struct commit *)o;
if (!(commit->object.flags & COMPLETE)) {
commit->object.flags |= COMPLETE;
- commit_list_insert_by_date(commit, &complete);
+ commit_list_insert(commit, &complete);
}
}
return 0;
@@ -622,6 +622,7 @@ static int everything_local(struct fetch_pack_args *args,
if (!args->depth) {
for_each_ref(mark_complete, NULL);
for_each_alternate_ref(mark_alternate_complete, NULL);
+ commit_list_sort_by_date(&complete);
if (cutoff)
mark_recent_complete_commits(args, cutoff);
}
(If you're wondering why we didn't do this trivial bit at the time, it
was because back then we did not yet have the René's nice linked-list
mergesort that backs commit_list_sort_by_date).
> The result, a copy of linus' repo with a million unique
> valid refs and a git fetch of a single updated ref taking a
> very long time (55mins and it did not complete yet). Note,
> with 100K refs it completes in about 2m40s. It is likely
> not linear since 2m40s * 10 would be ~26m (but the
> difference could also just be how the data in the sha1s are
> ordered).
That sounds like the O(n^2) problem. My timings back then with 100K refs
were 1-2 minutes. Does the patch above fix it for you?
-Peff
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: How to still kill git fetch with too many refs
2013-07-02 4:07 ` Jeff King
@ 2013-07-02 4:41 ` Jeff King
2013-07-02 5:01 ` Jeff King
0 siblings, 1 reply; 15+ messages in thread
From: Jeff King @ 2013-07-02 4:41 UTC (permalink / raw)
To: Martin Fick; +Cc: git
On Tue, Jul 02, 2013 at 12:07:58AM -0400, Jeff King wrote:
> And yet another alternative would be to keep the list unsorted during
> the mark_complete calls, and then sort it at the end. Like this:
Amusingly, I seem to have posted the exact same patch last year:
http://article.gmane.org/gmane.comp.version-control.git/194939
but never polished it up. It does cut a little time off of the initial
ref-loading that fetch-pack does (especially when we do not end up
transferring any objects at all). But it does not solve your problem.
I replicated your test setup, and the problem is that we have many
common objects on both sides during the ref negotiation. So we end up in
rev_list_push for each one, which has the same O(n^2) behavior.
Switching it to just sort at the end is not trivial; we first insert all
of the objects, but then we actually walk the parents, pushing onto the
list as we go. So I think we'd want a better data structure (like a
priority queue).
-Peff
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: How to still kill git fetch with too many refs
2013-07-02 4:41 ` Jeff King
@ 2013-07-02 5:01 ` Jeff King
2013-07-02 5:19 ` Junio C Hamano
2013-07-02 17:52 ` How to still kill git fetch with too many refs Brandon Casey
0 siblings, 2 replies; 15+ messages in thread
From: Jeff King @ 2013-07-02 5:01 UTC (permalink / raw)
To: Martin Fick; +Cc: git
On Tue, Jul 02, 2013 at 12:41:51AM -0400, Jeff King wrote:
> I replicated your test setup, and the problem is that we have many
> common objects on both sides during the ref negotiation. So we end up in
> rev_list_push for each one, which has the same O(n^2) behavior.
> Switching it to just sort at the end is not trivial; we first insert all
> of the objects, but then we actually walk the parents, pushing onto the
> list as we go. So I think we'd want a better data structure (like a
> priority queue).
Like the patch below, which is built on top of next (which has Junio's
prio_queue implementation), and has both the priority queue fix for
rev_list_push and the mark_complete sort-at-the-end fix. With this, your
test case completes in a few seconds (no clue if it would ever have
finished otherwise; I never let it run for more than a minute or two,
but I did confirm that it was spending all of its time in
commit_list_insert_by_date).
I didn't look too hard at the actual semantics, so there might be away
to not mark so many commits in the first place. This is just a mindless
replacement of an O(n^2) list insertion with the queue.
I did note two oddities, though:
1. In the original, we never free the commit_list pointers when we pop
them (or when we clear the list in find_common. So I think the
existing code was leaking memory, and my version does not.
2. In the original version of get_rev, we "peek" at the head commit of
rev_list:
commit = rev_list->item;
Then we look at the parents of that commit, and possibly push them
onto rev_list:
while (parents) {
if (!(parents->item->object.flags & SEEN))
rev_list_push(parents->item, mark);
...
}
and then finally, "pop" the commit off the list:
rev_list = rev_list->next;
But what guarantee do we have that the commit we just processed is
still at the front of the list? If the timestamps are monotonic,
then the parent must come after the descendent and therefore will
not have gotten pushed onto the front of the list. But in the face
of clock skew, this is not the case, and we will have just skipped
over the parent with our "pop".
So unless I am missing something very clever, I think this was a
bug (albeit a difficult one to have matter). And it's also fixed by
using prio_queue (which does the peek/pop together at the top of
the function).
diff --git a/commit.c b/commit.c
index 521e49c..ebc0eea 100644
--- a/commit.c
+++ b/commit.c
@@ -581,7 +581,7 @@ static int compare_commits_by_author_date(const void *a_, const void *b_,
return 0;
}
-static int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused)
+int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused)
{
const struct commit *a = a_, *b = b_;
/* newer commits with larger date first */
diff --git a/commit.h b/commit.h
index 4d452dc..18a5234 100644
--- a/commit.h
+++ b/commit.h
@@ -254,4 +254,6 @@ extern void check_commit_signature(const struct commit* commit, struct signature
*/
extern void check_commit_signature(const struct commit* commit, struct signature_check *sigc);
+int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused);
+
#endif /* COMMIT_H */
diff --git a/fetch-pack.c b/fetch-pack.c
index abe5ffb..6684348 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -11,6 +11,7 @@
#include "run-command.h"
#include "transport.h"
#include "version.h"
+#include "prio-queue.h"
static int transfer_unpack_limit = -1;
static int fetch_unpack_limit = -1;
@@ -37,7 +38,7 @@ static int marked;
*/
#define MAX_IN_VAIN 256
-static struct commit_list *rev_list;
+static struct prio_queue rev_list = { compare_commits_by_commit_date };
static int non_common_revs, multi_ack, use_sideband, allow_tip_sha1_in_want;
static void rev_list_push(struct commit *commit, int mark)
@@ -49,7 +50,7 @@ static void rev_list_push(struct commit *commit, int mark)
if (parse_commit(commit))
return;
- commit_list_insert_by_date(commit, &rev_list);
+ prio_queue_put(&rev_list, commit);
if (!(commit->object.flags & COMMON))
non_common_revs++;
@@ -122,10 +123,10 @@ static const unsigned char *get_rev(void)
unsigned int mark;
struct commit_list *parents;
- if (rev_list == NULL || non_common_revs == 0)
+ if (rev_list.nr == 0 || non_common_revs == 0)
return NULL;
- commit = rev_list->item;
+ commit = prio_queue_get(&rev_list);
if (!commit->object.parsed)
parse_commit(commit);
parents = commit->parents;
@@ -152,8 +153,6 @@ static const unsigned char *get_rev(void)
mark_common(parents->item, 1, 0);
parents = parents->next;
}
-
- rev_list = rev_list->next;
}
return commit->object.sha1;
@@ -442,7 +441,7 @@ static int find_common(struct fetch_pack_args *args,
in_vain = 0;
got_continue = 1;
if (ack == ACK_ready) {
- rev_list = NULL;
+ clear_prio_queue(&rev_list);
got_ready = 1;
}
break;
@@ -505,7 +504,7 @@ static int mark_complete(const char *refname, const unsigned char *sha1, int fla
struct commit *commit = (struct commit *)o;
if (!(commit->object.flags & COMPLETE)) {
commit->object.flags |= COMPLETE;
- commit_list_insert_by_date(commit, &complete);
+ commit_list_insert(commit, &complete);
}
}
return 0;
@@ -622,6 +621,7 @@ static int everything_local(struct fetch_pack_args *args,
if (!args->depth) {
for_each_ref(mark_complete, NULL);
for_each_alternate_ref(mark_alternate_complete, NULL);
+ commit_list_sort_by_date(&complete);
if (cutoff)
mark_recent_complete_commits(args, cutoff);
}
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: How to still kill git fetch with too many refs
2013-07-02 5:01 ` Jeff King
@ 2013-07-02 5:19 ` Junio C Hamano
2013-07-02 5:28 ` Jeff King
2013-07-02 17:52 ` How to still kill git fetch with too many refs Brandon Casey
1 sibling, 1 reply; 15+ messages in thread
From: Junio C Hamano @ 2013-07-02 5:19 UTC (permalink / raw)
To: Jeff King; +Cc: Martin Fick, git
Jeff King <peff@peff.net> writes:
> On Tue, Jul 02, 2013 at 12:41:51AM -0400, Jeff King wrote:
>
>> I replicated your test setup, and the problem is that we have many
>> common objects on both sides during the ref negotiation. So we end up in
>> rev_list_push for each one, which has the same O(n^2) behavior.
>> Switching it to just sort at the end is not trivial; we first insert all
>> of the objects, but then we actually walk the parents, pushing onto the
>> list as we go. So I think we'd want a better data structure (like a
>> priority queue).
>
> Like the patch below, which is built on top of next (which has Junio's
> prio_queue implementation), and has both the priority queue fix for
> rev_list_push and the mark_complete sort-at-the-end fix.
Wow, I saw "160 lines" in my MUA which scared me a bit until I
opened it to realize 40% is discussion and most of the remaining
lines are context around single liners.
It just looks too easy/simple, but the result looks correct, at
least from a cursory read.
Good job ;-)
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: How to still kill git fetch with too many refs
2013-07-02 5:19 ` Junio C Hamano
@ 2013-07-02 5:28 ` Jeff King
2013-07-02 6:11 ` [PATCH 0/3] avoid quadratic behavior in fetch-pack Jeff King
0 siblings, 1 reply; 15+ messages in thread
From: Jeff King @ 2013-07-02 5:28 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Martin Fick, git
On Mon, Jul 01, 2013 at 10:19:51PM -0700, Junio C Hamano wrote:
> > Like the patch below, which is built on top of next (which has Junio's
> > prio_queue implementation), and has both the priority queue fix for
> > rev_list_push and the mark_complete sort-at-the-end fix.
>
> Wow, I saw "160 lines" in my MUA which scared me a bit until I
> opened it to realize 40% is discussion and most of the remaining
> lines are context around single liners.
>
> It just looks too easy/simple, but the result looks correct, at
> least from a cursory read.
>
> Good job ;-)
Thanks. :)
I'm splitting it out into readable patches now. At first I was made a
bit nervous by the "popping" behavior I described as "oddity #2"
earlier. But the more I look at it, the more I am convinced it is
simply a bug that we can happen to fix along the way.
Patches in a few minutes.
-Peff
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 0/3] avoid quadratic behavior in fetch-pack
2013-07-02 5:28 ` Jeff King
@ 2013-07-02 6:11 ` Jeff King
2013-07-02 6:16 ` [PATCH 1/3] fetch-pack: avoid quadratic list insertion in mark_complete Jeff King
` (3 more replies)
0 siblings, 4 replies; 15+ messages in thread
From: Jeff King @ 2013-07-02 6:11 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Martin Fick, git
Here are my patches to deal with Martin's pathological case, split out
for easy reading. I took a few timings to show that the results of the
3rd patch are noticeable even with 50,000 unique refs (which is still a
lot, but something that I could conceive of a busy repo accumulating
over time).
[1/3]: fetch-pack: avoid quadratic list insertion in mark_complete
[2/3]: commit.c: make compare_commits_by_commit_date global
[3/3]: fetch-pack: avoid quadratic behavior in rev_list_push
And here's the diffstat to prove it is really not scary. :)
commit.c | 2 +-
commit.h | 2 ++
fetch-pack.c | 16 ++++++++--------
3 files changed, 11 insertions(+), 9 deletions(-)
-Peff
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 1/3] fetch-pack: avoid quadratic list insertion in mark_complete
2013-07-02 6:11 ` [PATCH 0/3] avoid quadratic behavior in fetch-pack Jeff King
@ 2013-07-02 6:16 ` Jeff King
2013-07-02 6:21 ` [PATCH 2/3] commit.c: make compare_commits_by_commit_date global Jeff King
` (2 subsequent siblings)
3 siblings, 0 replies; 15+ messages in thread
From: Jeff King @ 2013-07-02 6:16 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Martin Fick, git
We insert the commit pointed to by each ref one-by-one into
the "complete" commit_list using insert_by_date. Because
each insertion is O(n), we end up with O(n^2) behavior.
This typically doesn't matter, because the number of refs is
reasonably small. And even if there are a lot of refs, they
often point to a smaller set of objects (in which case the
optimization in commit ea5f220 keeps our "n" small).
However, in pathological repositories (hundreds of thousands
of refs, each pointing to a unique commit), this quadratic
behavior can make a difference. Since we do not care about
the list order until we have finished building it, we can
simply keep it unsorted during the insertion phase, then
sort it afterwards.
On a repository like the one described above, this dropped
the time to do a no-op fetch from 2.0s to 1.7s. On normal
repositories, it probably does not matter at all, but it
does not hurt to protect ourselves from pathological cases.
Signed-off-by: Jeff King <peff@peff.net>
---
A note on the timings. I measured the times above last year when I wrote
the same patch here:
http://article.gmane.org/gmane.comp.version-control.git/194939
And earlier tonight, I did a fetch that showed the same result. But when
I tried to replicate it while writing the commit message, I had trouble,
because either:
1. I was fetching actual commits, in which case the more serious
problem behavior in find_common kicked in, ruining the measurement.
2. It was a no-op fetch, in which case quickfetch() kicked in and we
did not call mark_complete at all.
So I'm rather confused how I managed to get timings earlier (both last
year, and earlier today). But I still think it's an obviously correct
thing to do, and does protect us in case (1) above (actually fetching a
commit) once we fix the problem in find_common.
fetch-pack.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fetch-pack.c b/fetch-pack.c
index abe5ffb..4df8abd 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -505,7 +505,7 @@ static int mark_complete(const char *refname, const unsigned char *sha1, int fla
struct commit *commit = (struct commit *)o;
if (!(commit->object.flags & COMPLETE)) {
commit->object.flags |= COMPLETE;
- commit_list_insert_by_date(commit, &complete);
+ commit_list_insert(commit, &complete);
}
}
return 0;
@@ -622,6 +622,7 @@ static int everything_local(struct fetch_pack_args *args,
if (!args->depth) {
for_each_ref(mark_complete, NULL);
for_each_alternate_ref(mark_alternate_complete, NULL);
+ commit_list_sort_by_date(&complete);
if (cutoff)
mark_recent_complete_commits(args, cutoff);
}
--
1.8.3.rc2.14.g7eee6b3
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 2/3] commit.c: make compare_commits_by_commit_date global
2013-07-02 6:11 ` [PATCH 0/3] avoid quadratic behavior in fetch-pack Jeff King
2013-07-02 6:16 ` [PATCH 1/3] fetch-pack: avoid quadratic list insertion in mark_complete Jeff King
@ 2013-07-02 6:21 ` Jeff King
2013-07-02 6:24 ` [PATCH 3/3] fetch-pack: avoid quadratic behavior in rev_list_push Jeff King
2013-07-02 17:45 ` [PATCH 0/3] avoid quadratic behavior in fetch-pack Martin Fick
3 siblings, 0 replies; 15+ messages in thread
From: Jeff King @ 2013-07-02 6:21 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Martin Fick, git
This helper function was introduced as a prio_queue
comparator to help topological sorting. However, other users
of prio_queue who want to replace commit_list_insert_by_date
will want to use it, too. So let's make it public.
Signed-off-by: Jeff King <peff@peff.net>
---
There is also compare_commits_by_author_date, but I expect it to be less
generally useful (especially because it relies on a slab), so I didn't
bother publicizing it. I think this is sufficient for now, and any
later users can make the author version public if they need to.
Note also that we have a similar comparison function,
commit_list_compare_by_date, which gets fed to the linked-list mergesort
for commit_list_sort_by_date. I was tempted to unify them, but we can't,
because it takes an actual "struct commit_list *", not a "struct commit
*" (and the logic is not so complex that it is worth factoring out to a
shared helper).
commit.c | 2 +-
commit.h | 2 ++
2 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/commit.c b/commit.c
index 521e49c..ebc0eea 100644
--- a/commit.c
+++ b/commit.c
@@ -581,7 +581,7 @@ static int compare_commits_by_author_date(const void *a_, const void *b_,
return 0;
}
-static int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused)
+int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused)
{
const struct commit *a = a_, *b = b_;
/* newer commits with larger date first */
diff --git a/commit.h b/commit.h
index 4d452dc..18a5234 100644
--- a/commit.h
+++ b/commit.h
@@ -254,4 +254,6 @@ extern void check_commit_signature(const struct commit* commit, struct signature
*/
extern void check_commit_signature(const struct commit* commit, struct signature_check *sigc);
+int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused);
+
#endif /* COMMIT_H */
--
1.8.3.rc2.14.g7eee6b3
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 3/3] fetch-pack: avoid quadratic behavior in rev_list_push
2013-07-02 6:11 ` [PATCH 0/3] avoid quadratic behavior in fetch-pack Jeff King
2013-07-02 6:16 ` [PATCH 1/3] fetch-pack: avoid quadratic list insertion in mark_complete Jeff King
2013-07-02 6:21 ` [PATCH 2/3] commit.c: make compare_commits_by_commit_date global Jeff King
@ 2013-07-02 6:24 ` Jeff King
2013-07-02 7:52 ` Eric Sunshine
2013-07-02 17:45 ` [PATCH 0/3] avoid quadratic behavior in fetch-pack Martin Fick
3 siblings, 1 reply; 15+ messages in thread
From: Jeff King @ 2013-07-02 6:24 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Martin Fick, git
When we call find_common to start finding common ancestors
with the remote side of a fetch, the first thing we do is
insert the tip of each ref into our rev_list linked list. We
keep the list sorted the whole time with
commit_list_insert_by_date, which means our insertion ends
up doing O(n^2) timestamp comparisons.
We could teach rev_list_push to use an unsorted list, and
then sort it once after we have added each ref. However, in
get_rev, we process the list by popping commits off the
front and adding parents back in timestamp-sorted order. So
that procedure would still operate on the large list.
Instead, we can replace the linked list with a heap-based
priority queue, which can do O(log n) insertion, making the
whole insertion procedure O(n log n).
As a result of switching to the prio_queue struct, we fix
two minor bugs:
1. When we "pop" a commit in get_rev, and when we clear
the rev_list in find_common, we do not take care to
free the "struct commit_list", and just leak its
memory. With the prio_queue implementation, the memory
management is handled for us.
2. In get_rev, we look at the head commit of the list,
possibly push its parents onto the list, and then "pop"
the front of the list off, assuming it is the same
element that we just peeked at. This is typically going
to be the case, but would not be in the face of clock
skew: the parents are inserted by date, and could
potentailly be inserted at the head of the list if they
have a timestamp newer than their descendent. In this
case, we would accidentally pop the parent, and never
process it at all.
The new implementation pulls the commit off of the
queue as we examine it, and so does not suffer from
this problem.
With this patch, a fetch of a single commit into a
repository with 50,000 refs went from:
real 0m7.984s
user 0m7.852s
sys 0m0.120s
to:
real 0m2.017s
user 0m1.884s
sys 0m0.124s
Before this patch, a larger case with 370K refs still had
not completed after tens of minutes; with this patch, it
completes in about 12 seconds.
Signed-off-by: Jeff King <peff@peff.net>
---
Not that it really matters, but Martin, note that your "one million
refs" case is actually more like 370K, because that is how many commits
there are in the linux.git repo. The rest of the lines written to your
packed-refs file are just bogus.
fetch-pack.c | 13 ++++++-------
1 file changed, 6 insertions(+), 7 deletions(-)
diff --git a/fetch-pack.c b/fetch-pack.c
index 4df8abd..6684348 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -11,6 +11,7 @@
#include "run-command.h"
#include "transport.h"
#include "version.h"
+#include "prio-queue.h"
static int transfer_unpack_limit = -1;
static int fetch_unpack_limit = -1;
@@ -37,7 +38,7 @@ static int marked;
*/
#define MAX_IN_VAIN 256
-static struct commit_list *rev_list;
+static struct prio_queue rev_list = { compare_commits_by_commit_date };
static int non_common_revs, multi_ack, use_sideband, allow_tip_sha1_in_want;
static void rev_list_push(struct commit *commit, int mark)
@@ -49,7 +50,7 @@ static void rev_list_push(struct commit *commit, int mark)
if (parse_commit(commit))
return;
- commit_list_insert_by_date(commit, &rev_list);
+ prio_queue_put(&rev_list, commit);
if (!(commit->object.flags & COMMON))
non_common_revs++;
@@ -122,10 +123,10 @@ static const unsigned char *get_rev(void)
unsigned int mark;
struct commit_list *parents;
- if (rev_list == NULL || non_common_revs == 0)
+ if (rev_list.nr == 0 || non_common_revs == 0)
return NULL;
- commit = rev_list->item;
+ commit = prio_queue_get(&rev_list);
if (!commit->object.parsed)
parse_commit(commit);
parents = commit->parents;
@@ -152,8 +153,6 @@ static const unsigned char *get_rev(void)
mark_common(parents->item, 1, 0);
parents = parents->next;
}
-
- rev_list = rev_list->next;
}
return commit->object.sha1;
@@ -442,7 +441,7 @@ static int find_common(struct fetch_pack_args *args,
in_vain = 0;
got_continue = 1;
if (ack == ACK_ready) {
- rev_list = NULL;
+ clear_prio_queue(&rev_list);
got_ready = 1;
}
break;
--
1.8.3.rc2.14.g7eee6b3
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 3/3] fetch-pack: avoid quadratic behavior in rev_list_push
2013-07-02 6:24 ` [PATCH 3/3] fetch-pack: avoid quadratic behavior in rev_list_push Jeff King
@ 2013-07-02 7:52 ` Eric Sunshine
0 siblings, 0 replies; 15+ messages in thread
From: Eric Sunshine @ 2013-07-02 7:52 UTC (permalink / raw)
To: Jeff King; +Cc: Junio C Hamano, Martin Fick, Git List
On Tue, Jul 2, 2013 at 2:24 AM, Jeff King <peff@peff.net> wrote:
> When we call find_common to start finding common ancestors
> with the remote side of a fetch, the first thing we do is
> insert the tip of each ref into our rev_list linked list. We
> keep the list sorted the whole time with
> commit_list_insert_by_date, which means our insertion ends
> up doing O(n^2) timestamp comparisons.
>
> We could teach rev_list_push to use an unsorted list, and
> then sort it once after we have added each ref. However, in
> get_rev, we process the list by popping commits off the
> front and adding parents back in timestamp-sorted order. So
> that procedure would still operate on the large list.
>
> Instead, we can replace the linked list with a heap-based
> priority queue, which can do O(log n) insertion, making the
> whole insertion procedure O(n log n).
>
> As a result of switching to the prio_queue struct, we fix
> two minor bugs:
>
> 1. When we "pop" a commit in get_rev, and when we clear
> the rev_list in find_common, we do not take care to
> free the "struct commit_list", and just leak its
> memory. With the prio_queue implementation, the memory
> management is handled for us.
>
> 2. In get_rev, we look at the head commit of the list,
> possibly push its parents onto the list, and then "pop"
> the front of the list off, assuming it is the same
> element that we just peeked at. This is typically going
> to be the case, but would not be in the face of clock
> skew: the parents are inserted by date, and could
> potentailly be inserted at the head of the list if they
s/potentailly/potentially/
> have a timestamp newer than their descendent. In this
> case, we would accidentally pop the parent, and never
> process it at all.
>
> The new implementation pulls the commit off of the
> queue as we examine it, and so does not suffer from
> this problem.
>
> With this patch, a fetch of a single commit into a
> repository with 50,000 refs went from:
>
> real 0m7.984s
> user 0m7.852s
> sys 0m0.120s
>
> to:
>
> real 0m2.017s
> user 0m1.884s
> sys 0m0.124s
>
> Before this patch, a larger case with 370K refs still had
> not completed after tens of minutes; with this patch, it
> completes in about 12 seconds.
>
> Signed-off-by: Jeff King <peff@peff.net>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: How to still kill git fetch with too many refs
2013-07-02 3:02 How to still kill git fetch with too many refs Martin Fick
2013-07-02 4:07 ` Jeff King
@ 2013-07-02 9:24 ` Michael Haggerty
2013-07-02 16:58 ` Martin Fick
1 sibling, 1 reply; 15+ messages in thread
From: Michael Haggerty @ 2013-07-02 9:24 UTC (permalink / raw)
To: Martin Fick; +Cc: git, Jeff King, Junio C Hamano
On 07/02/2013 05:02 AM, Martin Fick wrote:
> I have often reported problems with git fetch when there are
> many refs in a repo, and I have been pleasantly surprised
> how many problems I reported were so quickly fixed. :) With
> time, others have created various synthetic test cases to
> ensure that git can handle many many refs. A simple
> synthetic test case with 1M refs all pointing to the same
> sha1 seems to be easily handled by git these days. However,
> in our experience with our internal git repo, we still have
> performance issues related to having too many refs, in our
> kernel/msm instance we have around 400K.
>
> When I tried the simple synthetic test case and could not
> reproduce bad results, so I tried something just a little
> more complex and was able to get atrocious results!!!
> Basically, I generate a packed-refs files with many refs
> which each point to a different sha1. To get a list of
> valid but unique sha1s for the repo, I simply used rev-list.
> The result, a copy of linus' repo with a million unique
> valid refs and a git fetch of a single updated ref taking a
> very long time (55mins and it did not complete yet). Note,
> with 100K refs it completes in about 2m40s. It is likely
> not linear since 2m40s * 10 would be ~26m (but the
> difference could also just be how the data in the sha1s are
> ordered).
>
>
> Here is my small reproducible test case for this issue:
>
> git clone
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> cp -rp linux linux.1Mrefs-revlist
>
> cd linux
> echo "Hello" > hello ; git add hello ; git ci -a -m 'hello'
> cd ..
>
> cd linux.1Mrefs-revlist
> git rev-list HEAD | for nn in $(seq 0 100) ; do for c in
> $(seq 0 10000) ; do read sha ; echo $sha refs/c/$nn/$c$nn ;
> done ; done > .git/packed-refs
I believe this generates a packed-refs file that is not sorted
lexicographically by refname, whereas all Git-generated packed-refs
files are sorted. There are some optimizations in refs.c for adding
references in order that might therefore be circumvented by your
unsorted file. Please try sorting the file by refname and see if that
helps. (You can do so by deleting one of the packed references; then
git will sort the remainder while rewriting the file.)
Michael
--
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: How to still kill git fetch with too many refs
2013-07-02 9:24 ` Michael Haggerty
@ 2013-07-02 16:58 ` Martin Fick
0 siblings, 0 replies; 15+ messages in thread
From: Martin Fick @ 2013-07-02 16:58 UTC (permalink / raw)
To: Michael Haggerty; +Cc: git, Jeff King, Junio C Hamano
On Tuesday, July 02, 2013 03:24:14 am Michael Haggerty
wrote:
> > git rev-list HEAD | for nn in $(seq 0 100) ; do for c
> > in $(seq 0 10000) ; do read sha ; echo $sha
> > refs/c/$nn/$c$nn ; done ; done > .git/packed-refs
>
> I believe this generates a packed-refs file that is not
> sorted lexicographically by refname, whereas all
> Git-generated packed-refs files are sorted.
Yes, you are indeed correct. I was attempting to be too
clever with my sharding I guess. Thanks.
> There are
> some optimizations in refs.c for adding references in
> order that might therefore be circumvented by your
> unsorted file. Please try sorting the file by refname
> and see if that helps. (You can do so by deleting one
> of the packed references; then git will sort the
> remainder while rewriting the file.)
A simple git pack-refs seems to clean it up.
The original test did complete in ~77mins last night. A
rerun with a sorted file takes ~61mins,
-Martin
PS: This test was performed with git version 1.8.2.1 on
linux 2.6.32-37-generic #81-Ubuntu SMP
--
The Qualcomm Innovation Center, Inc. is a member of Code
Aurora Forum, hosted by The Linux Foundation
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/3] avoid quadratic behavior in fetch-pack
2013-07-02 6:11 ` [PATCH 0/3] avoid quadratic behavior in fetch-pack Jeff King
` (2 preceding siblings ...)
2013-07-02 6:24 ` [PATCH 3/3] fetch-pack: avoid quadratic behavior in rev_list_push Jeff King
@ 2013-07-02 17:45 ` Martin Fick
3 siblings, 0 replies; 15+ messages in thread
From: Martin Fick @ 2013-07-02 17:45 UTC (permalink / raw)
To: Jeff King; +Cc: Junio C Hamano, git
On Tuesday, July 02, 2013 12:11:49 am Jeff King wrote:
> Here are my patches to deal with Martin's pathological
> case, split out for easy reading. I took a few timings
> to show that the results of the 3rd patch are noticeable
> even with 50,000 unique refs (which is still a lot, but
> something that I could conceive of a busy repo
> accumulating over time).
>
> [1/3]: fetch-pack: avoid quadratic list insertion in
> mark_complete [2/3]: commit.c: make
> compare_commits_by_commit_date global [3/3]: fetch-pack:
> avoid quadratic behavior in rev_list_push
>
> And here's the diffstat to prove it is really not scary.
> :)
>
> commit.c | 2 +-
> commit.h | 2 ++
> fetch-pack.c | 16 ++++++++--------
> 3 files changed, 11 insertions(+), 9 deletions(-)
>
> -Peff
I applied these 3 patches and it indeed improves things
dramatically. Thanks Peff, you are awesome!!!
The synthetic test case (but sorted), now comes in at around
15s. The more important real world case (for us), fetching
from my production server, which took around 12mins
previously, now takes around 30s (I think the extra time is
now spent on the Gerrit server, but I will investigate that
a bit more)! That is very significant and should make many
workflows much more efficient. +1 for merging this. :)
Again, thanks,
-Martin
Note, I tested git-next 1.8.3.2.883.g27cfd27 to be sure that
it is still problematic without this patch, it is (running
for 10mins now without completing).
--
The Qualcomm Innovation Center, Inc. is a member of Code
Aurora Forum, hosted by The Linux Foundation
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: How to still kill git fetch with too many refs
2013-07-02 5:01 ` Jeff King
2013-07-02 5:19 ` Junio C Hamano
@ 2013-07-02 17:52 ` Brandon Casey
1 sibling, 0 replies; 15+ messages in thread
From: Brandon Casey @ 2013-07-02 17:52 UTC (permalink / raw)
To: Jeff King; +Cc: Martin Fick, git@vger.kernel.org
On Mon, Jul 1, 2013 at 10:01 PM, Jeff King <peff@peff.net> wrote:
> On Tue, Jul 02, 2013 at 12:41:51AM -0400, Jeff King wrote:
>
>> I replicated your test setup, and the problem is that we have many
>> common objects on both sides during the ref negotiation. So we end up in
>> rev_list_push for each one, which has the same O(n^2) behavior.
>> Switching it to just sort at the end is not trivial; we first insert all
>> of the objects, but then we actually walk the parents, pushing onto the
>> list as we go. So I think we'd want a better data structure (like a
>> priority queue).
>
> Like the patch below,
<snip>
Looks obviously correct to me since I've got essentially the same
patch sitting in my local repo. :b
I've got the patch coming that fixes the same problem on the push side
of things and provides the same order of magnitude improvement.
-Brandon
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2013-07-02 17:52 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-02 3:02 How to still kill git fetch with too many refs Martin Fick
2013-07-02 4:07 ` Jeff King
2013-07-02 4:41 ` Jeff King
2013-07-02 5:01 ` Jeff King
2013-07-02 5:19 ` Junio C Hamano
2013-07-02 5:28 ` Jeff King
2013-07-02 6:11 ` [PATCH 0/3] avoid quadratic behavior in fetch-pack Jeff King
2013-07-02 6:16 ` [PATCH 1/3] fetch-pack: avoid quadratic list insertion in mark_complete Jeff King
2013-07-02 6:21 ` [PATCH 2/3] commit.c: make compare_commits_by_commit_date global Jeff King
2013-07-02 6:24 ` [PATCH 3/3] fetch-pack: avoid quadratic behavior in rev_list_push Jeff King
2013-07-02 7:52 ` Eric Sunshine
2013-07-02 17:45 ` [PATCH 0/3] avoid quadratic behavior in fetch-pack Martin Fick
2013-07-02 17:52 ` How to still kill git fetch with too many refs Brandon Casey
2013-07-02 9:24 ` Michael Haggerty
2013-07-02 16:58 ` Martin Fick
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).