* How to still kill git fetch with too many refs @ 2013-07-02 3:02 Martin Fick 2013-07-02 4:07 ` Jeff King 2013-07-02 9:24 ` Michael Haggerty 0 siblings, 2 replies; 15+ messages in thread From: Martin Fick @ 2013-07-02 3:02 UTC (permalink / raw) To: git I have often reported problems with git fetch when there are many refs in a repo, and I have been pleasantly surprised how many problems I reported were so quickly fixed. :) With time, others have created various synthetic test cases to ensure that git can handle many many refs. A simple synthetic test case with 1M refs all pointing to the same sha1 seems to be easily handled by git these days. However, in our experience with our internal git repo, we still have performance issues related to having too many refs, in our kernel/msm instance we have around 400K. When I tried the simple synthetic test case and could not reproduce bad results, so I tried something just a little more complex and was able to get atrocious results!!! Basically, I generate a packed-refs files with many refs which each point to a different sha1. To get a list of valid but unique sha1s for the repo, I simply used rev-list. The result, a copy of linus' repo with a million unique valid refs and a git fetch of a single updated ref taking a very long time (55mins and it did not complete yet). Note, with 100K refs it completes in about 2m40s. It is likely not linear since 2m40s * 10 would be ~26m (but the difference could also just be how the data in the sha1s are ordered). Here is my small reproducible test case for this issue: git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git cp -rp linux linux.1Mrefs-revlist cd linux echo "Hello" > hello ; git add hello ; git ci -a -m 'hello' cd .. cd linux.1Mrefs-revlist git rev-list HEAD | for nn in $(seq 0 100) ; do for c in $(seq 0 10000) ; do read sha ; echo $sha refs/c/$nn/$c$nn ; done ; done > .git/packed-refs time git fetch file:///$(dirname $PWD)/linux refs/heads/master Any insights as to why it is so slow, and how we could possibly speed it up? Thanks, -Martin PS: My tests were performed with git version 1.8.2.1 on linux 2.6.32-37-generic #81-Ubuntu SMP -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: How to still kill git fetch with too many refs 2013-07-02 3:02 How to still kill git fetch with too many refs Martin Fick @ 2013-07-02 4:07 ` Jeff King 2013-07-02 4:41 ` Jeff King 2013-07-02 9:24 ` Michael Haggerty 1 sibling, 1 reply; 15+ messages in thread From: Jeff King @ 2013-07-02 4:07 UTC (permalink / raw) To: Martin Fick; +Cc: git On Mon, Jul 01, 2013 at 09:02:31PM -0600, Martin Fick wrote: > A simple synthetic test case with 1M refs all pointing to the same > sha1 seems to be easily handled by git these days. However, in our > experience with our internal git repo, we still have performance > issues related to having too many refs, in our kernel/msm instance we > have around 400K. I'm not too surprised. There's O(n^2) behavior in fetch-pack's mark_complete function, as it adds each of the N ref tips to a commit_list, sorting by date (so each insertion is O(n)). I posted two alternative patches in May of 2011. The first simply avoids adding duplicate objects, which is simple and covers many real-world cases (e.g., an "alternates" repository which has a bunch of copies of the same tags, one per fork). The second one switches the commit_list out for a heap-based priority queue. We ended up taking the first (as ea5f220), since it was trivial and obviously correct, but didn't bother with the second since: 1. There had been no real-world reports of it. 2. While in theory a priority queue implementation would be used in other spots, too, it ended up being a pain to use it, as most of the callers wanted list-like splicing. You can see the original here: http://thread.gmane.org/gmane.comp.version-control.git/174003/focus=174005 Though it probably doesn't apply cleanly anymore. However, I've kept it rebased over the years at: git://github.com/peff/git.git jk/fast-commit-list Junio recently added a priority queue implementation in b4b594a (prio-queue: priority queue of pointers to structs, 2013-06-06), which is currently in next. So a modern version of that series would build on top of that, rather than my priority queue. And yet another alternative would be to keep the list unsorted during the mark_complete calls, and then sort it at the end. Like this: diff --git a/fetch-pack.c b/fetch-pack.c index abe5ffb..4df8abd 100644 --- a/fetch-pack.c +++ b/fetch-pack.c @@ -505,7 +505,7 @@ static int mark_complete(const char *refname, const unsigned char *sha1, int fla struct commit *commit = (struct commit *)o; if (!(commit->object.flags & COMPLETE)) { commit->object.flags |= COMPLETE; - commit_list_insert_by_date(commit, &complete); + commit_list_insert(commit, &complete); } } return 0; @@ -622,6 +622,7 @@ static int everything_local(struct fetch_pack_args *args, if (!args->depth) { for_each_ref(mark_complete, NULL); for_each_alternate_ref(mark_alternate_complete, NULL); + commit_list_sort_by_date(&complete); if (cutoff) mark_recent_complete_commits(args, cutoff); } (If you're wondering why we didn't do this trivial bit at the time, it was because back then we did not yet have the René's nice linked-list mergesort that backs commit_list_sort_by_date). > The result, a copy of linus' repo with a million unique > valid refs and a git fetch of a single updated ref taking a > very long time (55mins and it did not complete yet). Note, > with 100K refs it completes in about 2m40s. It is likely > not linear since 2m40s * 10 would be ~26m (but the > difference could also just be how the data in the sha1s are > ordered). That sounds like the O(n^2) problem. My timings back then with 100K refs were 1-2 minutes. Does the patch above fix it for you? -Peff ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: How to still kill git fetch with too many refs 2013-07-02 4:07 ` Jeff King @ 2013-07-02 4:41 ` Jeff King 2013-07-02 5:01 ` Jeff King 0 siblings, 1 reply; 15+ messages in thread From: Jeff King @ 2013-07-02 4:41 UTC (permalink / raw) To: Martin Fick; +Cc: git On Tue, Jul 02, 2013 at 12:07:58AM -0400, Jeff King wrote: > And yet another alternative would be to keep the list unsorted during > the mark_complete calls, and then sort it at the end. Like this: Amusingly, I seem to have posted the exact same patch last year: http://article.gmane.org/gmane.comp.version-control.git/194939 but never polished it up. It does cut a little time off of the initial ref-loading that fetch-pack does (especially when we do not end up transferring any objects at all). But it does not solve your problem. I replicated your test setup, and the problem is that we have many common objects on both sides during the ref negotiation. So we end up in rev_list_push for each one, which has the same O(n^2) behavior. Switching it to just sort at the end is not trivial; we first insert all of the objects, but then we actually walk the parents, pushing onto the list as we go. So I think we'd want a better data structure (like a priority queue). -Peff ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: How to still kill git fetch with too many refs 2013-07-02 4:41 ` Jeff King @ 2013-07-02 5:01 ` Jeff King 2013-07-02 5:19 ` Junio C Hamano 2013-07-02 17:52 ` How to still kill git fetch with too many refs Brandon Casey 0 siblings, 2 replies; 15+ messages in thread From: Jeff King @ 2013-07-02 5:01 UTC (permalink / raw) To: Martin Fick; +Cc: git On Tue, Jul 02, 2013 at 12:41:51AM -0400, Jeff King wrote: > I replicated your test setup, and the problem is that we have many > common objects on both sides during the ref negotiation. So we end up in > rev_list_push for each one, which has the same O(n^2) behavior. > Switching it to just sort at the end is not trivial; we first insert all > of the objects, but then we actually walk the parents, pushing onto the > list as we go. So I think we'd want a better data structure (like a > priority queue). Like the patch below, which is built on top of next (which has Junio's prio_queue implementation), and has both the priority queue fix for rev_list_push and the mark_complete sort-at-the-end fix. With this, your test case completes in a few seconds (no clue if it would ever have finished otherwise; I never let it run for more than a minute or two, but I did confirm that it was spending all of its time in commit_list_insert_by_date). I didn't look too hard at the actual semantics, so there might be away to not mark so many commits in the first place. This is just a mindless replacement of an O(n^2) list insertion with the queue. I did note two oddities, though: 1. In the original, we never free the commit_list pointers when we pop them (or when we clear the list in find_common. So I think the existing code was leaking memory, and my version does not. 2. In the original version of get_rev, we "peek" at the head commit of rev_list: commit = rev_list->item; Then we look at the parents of that commit, and possibly push them onto rev_list: while (parents) { if (!(parents->item->object.flags & SEEN)) rev_list_push(parents->item, mark); ... } and then finally, "pop" the commit off the list: rev_list = rev_list->next; But what guarantee do we have that the commit we just processed is still at the front of the list? If the timestamps are monotonic, then the parent must come after the descendent and therefore will not have gotten pushed onto the front of the list. But in the face of clock skew, this is not the case, and we will have just skipped over the parent with our "pop". So unless I am missing something very clever, I think this was a bug (albeit a difficult one to have matter). And it's also fixed by using prio_queue (which does the peek/pop together at the top of the function). diff --git a/commit.c b/commit.c index 521e49c..ebc0eea 100644 --- a/commit.c +++ b/commit.c @@ -581,7 +581,7 @@ static int compare_commits_by_author_date(const void *a_, const void *b_, return 0; } -static int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused) +int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused) { const struct commit *a = a_, *b = b_; /* newer commits with larger date first */ diff --git a/commit.h b/commit.h index 4d452dc..18a5234 100644 --- a/commit.h +++ b/commit.h @@ -254,4 +254,6 @@ extern void check_commit_signature(const struct commit* commit, struct signature */ extern void check_commit_signature(const struct commit* commit, struct signature_check *sigc); +int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused); + #endif /* COMMIT_H */ diff --git a/fetch-pack.c b/fetch-pack.c index abe5ffb..6684348 100644 --- a/fetch-pack.c +++ b/fetch-pack.c @@ -11,6 +11,7 @@ #include "run-command.h" #include "transport.h" #include "version.h" +#include "prio-queue.h" static int transfer_unpack_limit = -1; static int fetch_unpack_limit = -1; @@ -37,7 +38,7 @@ static int marked; */ #define MAX_IN_VAIN 256 -static struct commit_list *rev_list; +static struct prio_queue rev_list = { compare_commits_by_commit_date }; static int non_common_revs, multi_ack, use_sideband, allow_tip_sha1_in_want; static void rev_list_push(struct commit *commit, int mark) @@ -49,7 +50,7 @@ static void rev_list_push(struct commit *commit, int mark) if (parse_commit(commit)) return; - commit_list_insert_by_date(commit, &rev_list); + prio_queue_put(&rev_list, commit); if (!(commit->object.flags & COMMON)) non_common_revs++; @@ -122,10 +123,10 @@ static const unsigned char *get_rev(void) unsigned int mark; struct commit_list *parents; - if (rev_list == NULL || non_common_revs == 0) + if (rev_list.nr == 0 || non_common_revs == 0) return NULL; - commit = rev_list->item; + commit = prio_queue_get(&rev_list); if (!commit->object.parsed) parse_commit(commit); parents = commit->parents; @@ -152,8 +153,6 @@ static const unsigned char *get_rev(void) mark_common(parents->item, 1, 0); parents = parents->next; } - - rev_list = rev_list->next; } return commit->object.sha1; @@ -442,7 +441,7 @@ static int find_common(struct fetch_pack_args *args, in_vain = 0; got_continue = 1; if (ack == ACK_ready) { - rev_list = NULL; + clear_prio_queue(&rev_list); got_ready = 1; } break; @@ -505,7 +504,7 @@ static int mark_complete(const char *refname, const unsigned char *sha1, int fla struct commit *commit = (struct commit *)o; if (!(commit->object.flags & COMPLETE)) { commit->object.flags |= COMPLETE; - commit_list_insert_by_date(commit, &complete); + commit_list_insert(commit, &complete); } } return 0; @@ -622,6 +621,7 @@ static int everything_local(struct fetch_pack_args *args, if (!args->depth) { for_each_ref(mark_complete, NULL); for_each_alternate_ref(mark_alternate_complete, NULL); + commit_list_sort_by_date(&complete); if (cutoff) mark_recent_complete_commits(args, cutoff); } ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: How to still kill git fetch with too many refs 2013-07-02 5:01 ` Jeff King @ 2013-07-02 5:19 ` Junio C Hamano 2013-07-02 5:28 ` Jeff King 2013-07-02 17:52 ` How to still kill git fetch with too many refs Brandon Casey 1 sibling, 1 reply; 15+ messages in thread From: Junio C Hamano @ 2013-07-02 5:19 UTC (permalink / raw) To: Jeff King; +Cc: Martin Fick, git Jeff King <peff@peff.net> writes: > On Tue, Jul 02, 2013 at 12:41:51AM -0400, Jeff King wrote: > >> I replicated your test setup, and the problem is that we have many >> common objects on both sides during the ref negotiation. So we end up in >> rev_list_push for each one, which has the same O(n^2) behavior. >> Switching it to just sort at the end is not trivial; we first insert all >> of the objects, but then we actually walk the parents, pushing onto the >> list as we go. So I think we'd want a better data structure (like a >> priority queue). > > Like the patch below, which is built on top of next (which has Junio's > prio_queue implementation), and has both the priority queue fix for > rev_list_push and the mark_complete sort-at-the-end fix. Wow, I saw "160 lines" in my MUA which scared me a bit until I opened it to realize 40% is discussion and most of the remaining lines are context around single liners. It just looks too easy/simple, but the result looks correct, at least from a cursory read. Good job ;-) ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: How to still kill git fetch with too many refs 2013-07-02 5:19 ` Junio C Hamano @ 2013-07-02 5:28 ` Jeff King 2013-07-02 6:11 ` [PATCH 0/3] avoid quadratic behavior in fetch-pack Jeff King 0 siblings, 1 reply; 15+ messages in thread From: Jeff King @ 2013-07-02 5:28 UTC (permalink / raw) To: Junio C Hamano; +Cc: Martin Fick, git On Mon, Jul 01, 2013 at 10:19:51PM -0700, Junio C Hamano wrote: > > Like the patch below, which is built on top of next (which has Junio's > > prio_queue implementation), and has both the priority queue fix for > > rev_list_push and the mark_complete sort-at-the-end fix. > > Wow, I saw "160 lines" in my MUA which scared me a bit until I > opened it to realize 40% is discussion and most of the remaining > lines are context around single liners. > > It just looks too easy/simple, but the result looks correct, at > least from a cursory read. > > Good job ;-) Thanks. :) I'm splitting it out into readable patches now. At first I was made a bit nervous by the "popping" behavior I described as "oddity #2" earlier. But the more I look at it, the more I am convinced it is simply a bug that we can happen to fix along the way. Patches in a few minutes. -Peff ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 0/3] avoid quadratic behavior in fetch-pack 2013-07-02 5:28 ` Jeff King @ 2013-07-02 6:11 ` Jeff King 2013-07-02 6:16 ` [PATCH 1/3] fetch-pack: avoid quadratic list insertion in mark_complete Jeff King ` (3 more replies) 0 siblings, 4 replies; 15+ messages in thread From: Jeff King @ 2013-07-02 6:11 UTC (permalink / raw) To: Junio C Hamano; +Cc: Martin Fick, git Here are my patches to deal with Martin's pathological case, split out for easy reading. I took a few timings to show that the results of the 3rd patch are noticeable even with 50,000 unique refs (which is still a lot, but something that I could conceive of a busy repo accumulating over time). [1/3]: fetch-pack: avoid quadratic list insertion in mark_complete [2/3]: commit.c: make compare_commits_by_commit_date global [3/3]: fetch-pack: avoid quadratic behavior in rev_list_push And here's the diffstat to prove it is really not scary. :) commit.c | 2 +- commit.h | 2 ++ fetch-pack.c | 16 ++++++++-------- 3 files changed, 11 insertions(+), 9 deletions(-) -Peff ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 1/3] fetch-pack: avoid quadratic list insertion in mark_complete 2013-07-02 6:11 ` [PATCH 0/3] avoid quadratic behavior in fetch-pack Jeff King @ 2013-07-02 6:16 ` Jeff King 2013-07-02 6:21 ` [PATCH 2/3] commit.c: make compare_commits_by_commit_date global Jeff King ` (2 subsequent siblings) 3 siblings, 0 replies; 15+ messages in thread From: Jeff King @ 2013-07-02 6:16 UTC (permalink / raw) To: Junio C Hamano; +Cc: Martin Fick, git We insert the commit pointed to by each ref one-by-one into the "complete" commit_list using insert_by_date. Because each insertion is O(n), we end up with O(n^2) behavior. This typically doesn't matter, because the number of refs is reasonably small. And even if there are a lot of refs, they often point to a smaller set of objects (in which case the optimization in commit ea5f220 keeps our "n" small). However, in pathological repositories (hundreds of thousands of refs, each pointing to a unique commit), this quadratic behavior can make a difference. Since we do not care about the list order until we have finished building it, we can simply keep it unsorted during the insertion phase, then sort it afterwards. On a repository like the one described above, this dropped the time to do a no-op fetch from 2.0s to 1.7s. On normal repositories, it probably does not matter at all, but it does not hurt to protect ourselves from pathological cases. Signed-off-by: Jeff King <peff@peff.net> --- A note on the timings. I measured the times above last year when I wrote the same patch here: http://article.gmane.org/gmane.comp.version-control.git/194939 And earlier tonight, I did a fetch that showed the same result. But when I tried to replicate it while writing the commit message, I had trouble, because either: 1. I was fetching actual commits, in which case the more serious problem behavior in find_common kicked in, ruining the measurement. 2. It was a no-op fetch, in which case quickfetch() kicked in and we did not call mark_complete at all. So I'm rather confused how I managed to get timings earlier (both last year, and earlier today). But I still think it's an obviously correct thing to do, and does protect us in case (1) above (actually fetching a commit) once we fix the problem in find_common. fetch-pack.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fetch-pack.c b/fetch-pack.c index abe5ffb..4df8abd 100644 --- a/fetch-pack.c +++ b/fetch-pack.c @@ -505,7 +505,7 @@ static int mark_complete(const char *refname, const unsigned char *sha1, int fla struct commit *commit = (struct commit *)o; if (!(commit->object.flags & COMPLETE)) { commit->object.flags |= COMPLETE; - commit_list_insert_by_date(commit, &complete); + commit_list_insert(commit, &complete); } } return 0; @@ -622,6 +622,7 @@ static int everything_local(struct fetch_pack_args *args, if (!args->depth) { for_each_ref(mark_complete, NULL); for_each_alternate_ref(mark_alternate_complete, NULL); + commit_list_sort_by_date(&complete); if (cutoff) mark_recent_complete_commits(args, cutoff); } -- 1.8.3.rc2.14.g7eee6b3 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 2/3] commit.c: make compare_commits_by_commit_date global 2013-07-02 6:11 ` [PATCH 0/3] avoid quadratic behavior in fetch-pack Jeff King 2013-07-02 6:16 ` [PATCH 1/3] fetch-pack: avoid quadratic list insertion in mark_complete Jeff King @ 2013-07-02 6:21 ` Jeff King 2013-07-02 6:24 ` [PATCH 3/3] fetch-pack: avoid quadratic behavior in rev_list_push Jeff King 2013-07-02 17:45 ` [PATCH 0/3] avoid quadratic behavior in fetch-pack Martin Fick 3 siblings, 0 replies; 15+ messages in thread From: Jeff King @ 2013-07-02 6:21 UTC (permalink / raw) To: Junio C Hamano; +Cc: Martin Fick, git This helper function was introduced as a prio_queue comparator to help topological sorting. However, other users of prio_queue who want to replace commit_list_insert_by_date will want to use it, too. So let's make it public. Signed-off-by: Jeff King <peff@peff.net> --- There is also compare_commits_by_author_date, but I expect it to be less generally useful (especially because it relies on a slab), so I didn't bother publicizing it. I think this is sufficient for now, and any later users can make the author version public if they need to. Note also that we have a similar comparison function, commit_list_compare_by_date, which gets fed to the linked-list mergesort for commit_list_sort_by_date. I was tempted to unify them, but we can't, because it takes an actual "struct commit_list *", not a "struct commit *" (and the logic is not so complex that it is worth factoring out to a shared helper). commit.c | 2 +- commit.h | 2 ++ 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/commit.c b/commit.c index 521e49c..ebc0eea 100644 --- a/commit.c +++ b/commit.c @@ -581,7 +581,7 @@ static int compare_commits_by_author_date(const void *a_, const void *b_, return 0; } -static int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused) +int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused) { const struct commit *a = a_, *b = b_; /* newer commits with larger date first */ diff --git a/commit.h b/commit.h index 4d452dc..18a5234 100644 --- a/commit.h +++ b/commit.h @@ -254,4 +254,6 @@ extern void check_commit_signature(const struct commit* commit, struct signature */ extern void check_commit_signature(const struct commit* commit, struct signature_check *sigc); +int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused); + #endif /* COMMIT_H */ -- 1.8.3.rc2.14.g7eee6b3 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 3/3] fetch-pack: avoid quadratic behavior in rev_list_push 2013-07-02 6:11 ` [PATCH 0/3] avoid quadratic behavior in fetch-pack Jeff King 2013-07-02 6:16 ` [PATCH 1/3] fetch-pack: avoid quadratic list insertion in mark_complete Jeff King 2013-07-02 6:21 ` [PATCH 2/3] commit.c: make compare_commits_by_commit_date global Jeff King @ 2013-07-02 6:24 ` Jeff King 2013-07-02 7:52 ` Eric Sunshine 2013-07-02 17:45 ` [PATCH 0/3] avoid quadratic behavior in fetch-pack Martin Fick 3 siblings, 1 reply; 15+ messages in thread From: Jeff King @ 2013-07-02 6:24 UTC (permalink / raw) To: Junio C Hamano; +Cc: Martin Fick, git When we call find_common to start finding common ancestors with the remote side of a fetch, the first thing we do is insert the tip of each ref into our rev_list linked list. We keep the list sorted the whole time with commit_list_insert_by_date, which means our insertion ends up doing O(n^2) timestamp comparisons. We could teach rev_list_push to use an unsorted list, and then sort it once after we have added each ref. However, in get_rev, we process the list by popping commits off the front and adding parents back in timestamp-sorted order. So that procedure would still operate on the large list. Instead, we can replace the linked list with a heap-based priority queue, which can do O(log n) insertion, making the whole insertion procedure O(n log n). As a result of switching to the prio_queue struct, we fix two minor bugs: 1. When we "pop" a commit in get_rev, and when we clear the rev_list in find_common, we do not take care to free the "struct commit_list", and just leak its memory. With the prio_queue implementation, the memory management is handled for us. 2. In get_rev, we look at the head commit of the list, possibly push its parents onto the list, and then "pop" the front of the list off, assuming it is the same element that we just peeked at. This is typically going to be the case, but would not be in the face of clock skew: the parents are inserted by date, and could potentailly be inserted at the head of the list if they have a timestamp newer than their descendent. In this case, we would accidentally pop the parent, and never process it at all. The new implementation pulls the commit off of the queue as we examine it, and so does not suffer from this problem. With this patch, a fetch of a single commit into a repository with 50,000 refs went from: real 0m7.984s user 0m7.852s sys 0m0.120s to: real 0m2.017s user 0m1.884s sys 0m0.124s Before this patch, a larger case with 370K refs still had not completed after tens of minutes; with this patch, it completes in about 12 seconds. Signed-off-by: Jeff King <peff@peff.net> --- Not that it really matters, but Martin, note that your "one million refs" case is actually more like 370K, because that is how many commits there are in the linux.git repo. The rest of the lines written to your packed-refs file are just bogus. fetch-pack.c | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/fetch-pack.c b/fetch-pack.c index 4df8abd..6684348 100644 --- a/fetch-pack.c +++ b/fetch-pack.c @@ -11,6 +11,7 @@ #include "run-command.h" #include "transport.h" #include "version.h" +#include "prio-queue.h" static int transfer_unpack_limit = -1; static int fetch_unpack_limit = -1; @@ -37,7 +38,7 @@ static int marked; */ #define MAX_IN_VAIN 256 -static struct commit_list *rev_list; +static struct prio_queue rev_list = { compare_commits_by_commit_date }; static int non_common_revs, multi_ack, use_sideband, allow_tip_sha1_in_want; static void rev_list_push(struct commit *commit, int mark) @@ -49,7 +50,7 @@ static void rev_list_push(struct commit *commit, int mark) if (parse_commit(commit)) return; - commit_list_insert_by_date(commit, &rev_list); + prio_queue_put(&rev_list, commit); if (!(commit->object.flags & COMMON)) non_common_revs++; @@ -122,10 +123,10 @@ static const unsigned char *get_rev(void) unsigned int mark; struct commit_list *parents; - if (rev_list == NULL || non_common_revs == 0) + if (rev_list.nr == 0 || non_common_revs == 0) return NULL; - commit = rev_list->item; + commit = prio_queue_get(&rev_list); if (!commit->object.parsed) parse_commit(commit); parents = commit->parents; @@ -152,8 +153,6 @@ static const unsigned char *get_rev(void) mark_common(parents->item, 1, 0); parents = parents->next; } - - rev_list = rev_list->next; } return commit->object.sha1; @@ -442,7 +441,7 @@ static int find_common(struct fetch_pack_args *args, in_vain = 0; got_continue = 1; if (ack == ACK_ready) { - rev_list = NULL; + clear_prio_queue(&rev_list); got_ready = 1; } break; -- 1.8.3.rc2.14.g7eee6b3 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 3/3] fetch-pack: avoid quadratic behavior in rev_list_push 2013-07-02 6:24 ` [PATCH 3/3] fetch-pack: avoid quadratic behavior in rev_list_push Jeff King @ 2013-07-02 7:52 ` Eric Sunshine 0 siblings, 0 replies; 15+ messages in thread From: Eric Sunshine @ 2013-07-02 7:52 UTC (permalink / raw) To: Jeff King; +Cc: Junio C Hamano, Martin Fick, Git List On Tue, Jul 2, 2013 at 2:24 AM, Jeff King <peff@peff.net> wrote: > When we call find_common to start finding common ancestors > with the remote side of a fetch, the first thing we do is > insert the tip of each ref into our rev_list linked list. We > keep the list sorted the whole time with > commit_list_insert_by_date, which means our insertion ends > up doing O(n^2) timestamp comparisons. > > We could teach rev_list_push to use an unsorted list, and > then sort it once after we have added each ref. However, in > get_rev, we process the list by popping commits off the > front and adding parents back in timestamp-sorted order. So > that procedure would still operate on the large list. > > Instead, we can replace the linked list with a heap-based > priority queue, which can do O(log n) insertion, making the > whole insertion procedure O(n log n). > > As a result of switching to the prio_queue struct, we fix > two minor bugs: > > 1. When we "pop" a commit in get_rev, and when we clear > the rev_list in find_common, we do not take care to > free the "struct commit_list", and just leak its > memory. With the prio_queue implementation, the memory > management is handled for us. > > 2. In get_rev, we look at the head commit of the list, > possibly push its parents onto the list, and then "pop" > the front of the list off, assuming it is the same > element that we just peeked at. This is typically going > to be the case, but would not be in the face of clock > skew: the parents are inserted by date, and could > potentailly be inserted at the head of the list if they s/potentailly/potentially/ > have a timestamp newer than their descendent. In this > case, we would accidentally pop the parent, and never > process it at all. > > The new implementation pulls the commit off of the > queue as we examine it, and so does not suffer from > this problem. > > With this patch, a fetch of a single commit into a > repository with 50,000 refs went from: > > real 0m7.984s > user 0m7.852s > sys 0m0.120s > > to: > > real 0m2.017s > user 0m1.884s > sys 0m0.124s > > Before this patch, a larger case with 370K refs still had > not completed after tens of minutes; with this patch, it > completes in about 12 seconds. > > Signed-off-by: Jeff King <peff@peff.net> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/3] avoid quadratic behavior in fetch-pack 2013-07-02 6:11 ` [PATCH 0/3] avoid quadratic behavior in fetch-pack Jeff King ` (2 preceding siblings ...) 2013-07-02 6:24 ` [PATCH 3/3] fetch-pack: avoid quadratic behavior in rev_list_push Jeff King @ 2013-07-02 17:45 ` Martin Fick 3 siblings, 0 replies; 15+ messages in thread From: Martin Fick @ 2013-07-02 17:45 UTC (permalink / raw) To: Jeff King; +Cc: Junio C Hamano, git On Tuesday, July 02, 2013 12:11:49 am Jeff King wrote: > Here are my patches to deal with Martin's pathological > case, split out for easy reading. I took a few timings > to show that the results of the 3rd patch are noticeable > even with 50,000 unique refs (which is still a lot, but > something that I could conceive of a busy repo > accumulating over time). > > [1/3]: fetch-pack: avoid quadratic list insertion in > mark_complete [2/3]: commit.c: make > compare_commits_by_commit_date global [3/3]: fetch-pack: > avoid quadratic behavior in rev_list_push > > And here's the diffstat to prove it is really not scary. > :) > > commit.c | 2 +- > commit.h | 2 ++ > fetch-pack.c | 16 ++++++++-------- > 3 files changed, 11 insertions(+), 9 deletions(-) > > -Peff I applied these 3 patches and it indeed improves things dramatically. Thanks Peff, you are awesome!!! The synthetic test case (but sorted), now comes in at around 15s. The more important real world case (for us), fetching from my production server, which took around 12mins previously, now takes around 30s (I think the extra time is now spent on the Gerrit server, but I will investigate that a bit more)! That is very significant and should make many workflows much more efficient. +1 for merging this. :) Again, thanks, -Martin Note, I tested git-next 1.8.3.2.883.g27cfd27 to be sure that it is still problematic without this patch, it is (running for 10mins now without completing). -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: How to still kill git fetch with too many refs 2013-07-02 5:01 ` Jeff King 2013-07-02 5:19 ` Junio C Hamano @ 2013-07-02 17:52 ` Brandon Casey 1 sibling, 0 replies; 15+ messages in thread From: Brandon Casey @ 2013-07-02 17:52 UTC (permalink / raw) To: Jeff King; +Cc: Martin Fick, git@vger.kernel.org On Mon, Jul 1, 2013 at 10:01 PM, Jeff King <peff@peff.net> wrote: > On Tue, Jul 02, 2013 at 12:41:51AM -0400, Jeff King wrote: > >> I replicated your test setup, and the problem is that we have many >> common objects on both sides during the ref negotiation. So we end up in >> rev_list_push for each one, which has the same O(n^2) behavior. >> Switching it to just sort at the end is not trivial; we first insert all >> of the objects, but then we actually walk the parents, pushing onto the >> list as we go. So I think we'd want a better data structure (like a >> priority queue). > > Like the patch below, <snip> Looks obviously correct to me since I've got essentially the same patch sitting in my local repo. :b I've got the patch coming that fixes the same problem on the push side of things and provides the same order of magnitude improvement. -Brandon ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: How to still kill git fetch with too many refs 2013-07-02 3:02 How to still kill git fetch with too many refs Martin Fick 2013-07-02 4:07 ` Jeff King @ 2013-07-02 9:24 ` Michael Haggerty 2013-07-02 16:58 ` Martin Fick 1 sibling, 1 reply; 15+ messages in thread From: Michael Haggerty @ 2013-07-02 9:24 UTC (permalink / raw) To: Martin Fick; +Cc: git, Jeff King, Junio C Hamano On 07/02/2013 05:02 AM, Martin Fick wrote: > I have often reported problems with git fetch when there are > many refs in a repo, and I have been pleasantly surprised > how many problems I reported were so quickly fixed. :) With > time, others have created various synthetic test cases to > ensure that git can handle many many refs. A simple > synthetic test case with 1M refs all pointing to the same > sha1 seems to be easily handled by git these days. However, > in our experience with our internal git repo, we still have > performance issues related to having too many refs, in our > kernel/msm instance we have around 400K. > > When I tried the simple synthetic test case and could not > reproduce bad results, so I tried something just a little > more complex and was able to get atrocious results!!! > Basically, I generate a packed-refs files with many refs > which each point to a different sha1. To get a list of > valid but unique sha1s for the repo, I simply used rev-list. > The result, a copy of linus' repo with a million unique > valid refs and a git fetch of a single updated ref taking a > very long time (55mins and it did not complete yet). Note, > with 100K refs it completes in about 2m40s. It is likely > not linear since 2m40s * 10 would be ~26m (but the > difference could also just be how the data in the sha1s are > ordered). > > > Here is my small reproducible test case for this issue: > > git clone > git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git > cp -rp linux linux.1Mrefs-revlist > > cd linux > echo "Hello" > hello ; git add hello ; git ci -a -m 'hello' > cd .. > > cd linux.1Mrefs-revlist > git rev-list HEAD | for nn in $(seq 0 100) ; do for c in > $(seq 0 10000) ; do read sha ; echo $sha refs/c/$nn/$c$nn ; > done ; done > .git/packed-refs I believe this generates a packed-refs file that is not sorted lexicographically by refname, whereas all Git-generated packed-refs files are sorted. There are some optimizations in refs.c for adding references in order that might therefore be circumvented by your unsorted file. Please try sorting the file by refname and see if that helps. (You can do so by deleting one of the packed references; then git will sort the remainder while rewriting the file.) Michael -- Michael Haggerty mhagger@alum.mit.edu http://softwareswirl.blogspot.com/ ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: How to still kill git fetch with too many refs 2013-07-02 9:24 ` Michael Haggerty @ 2013-07-02 16:58 ` Martin Fick 0 siblings, 0 replies; 15+ messages in thread From: Martin Fick @ 2013-07-02 16:58 UTC (permalink / raw) To: Michael Haggerty; +Cc: git, Jeff King, Junio C Hamano On Tuesday, July 02, 2013 03:24:14 am Michael Haggerty wrote: > > git rev-list HEAD | for nn in $(seq 0 100) ; do for c > > in $(seq 0 10000) ; do read sha ; echo $sha > > refs/c/$nn/$c$nn ; done ; done > .git/packed-refs > > I believe this generates a packed-refs file that is not > sorted lexicographically by refname, whereas all > Git-generated packed-refs files are sorted. Yes, you are indeed correct. I was attempting to be too clever with my sharding I guess. Thanks. > There are > some optimizations in refs.c for adding references in > order that might therefore be circumvented by your > unsorted file. Please try sorting the file by refname > and see if that helps. (You can do so by deleting one > of the packed references; then git will sort the > remainder while rewriting the file.) A simple git pack-refs seems to clean it up. The original test did complete in ~77mins last night. A rerun with a sorted file takes ~61mins, -Martin PS: This test was performed with git version 1.8.2.1 on linux 2.6.32-37-generic #81-Ubuntu SMP -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2013-07-02 17:52 UTC | newest] Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2013-07-02 3:02 How to still kill git fetch with too many refs Martin Fick 2013-07-02 4:07 ` Jeff King 2013-07-02 4:41 ` Jeff King 2013-07-02 5:01 ` Jeff King 2013-07-02 5:19 ` Junio C Hamano 2013-07-02 5:28 ` Jeff King 2013-07-02 6:11 ` [PATCH 0/3] avoid quadratic behavior in fetch-pack Jeff King 2013-07-02 6:16 ` [PATCH 1/3] fetch-pack: avoid quadratic list insertion in mark_complete Jeff King 2013-07-02 6:21 ` [PATCH 2/3] commit.c: make compare_commits_by_commit_date global Jeff King 2013-07-02 6:24 ` [PATCH 3/3] fetch-pack: avoid quadratic behavior in rev_list_push Jeff King 2013-07-02 7:52 ` Eric Sunshine 2013-07-02 17:45 ` [PATCH 0/3] avoid quadratic behavior in fetch-pack Martin Fick 2013-07-02 17:52 ` How to still kill git fetch with too many refs Brandon Casey 2013-07-02 9:24 ` Michael Haggerty 2013-07-02 16:58 ` Martin Fick
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).