[PATCH] refs.c: add a function to sort a ref list, rather then sorting on add

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH] refs.c: add a function to sort a ref list, rather then sorting on add
@ 2007-04-17  1:42 Julian Phillips
  2007-04-17 16:03 ` Linus Torvalds
  2007-04-17 21:01 ` [PATCH] refs.c: add a function to sort a ref list, rather then sorting on add Julian Phillips
  0 siblings, 2 replies; 12+ messages in thread
From: Julian Phillips @ 2007-04-17  1:42 UTC (permalink / raw)
  To: git

Rather than sorting the refs list while building it, sort in one go
after it is built using a merge sort.  This has a large performance
boost with large numbers of refs.

Signed-off-by: Julian Phillips <julian@quantumfyre.co.uk>
---

Having got builtin fetch to the point of generating a correct FETCH_HEAD (for a certain path through the code at least), I revisted the speed issue I brought up a while back with the sorting of refs.

Running fetch (builtin version) on a repo with >9000 refs which is up-to-date, using the old sort-on-add I get (best of 5, warm cache):

real    0m4.351s
user    0m4.068s
sys     0m0.219s

With this patch the same fetch gives (worst of 5, warm cache):

real    0m2.196s
user    0m1.870s
sys     0m0.212s

Since this is orthogonal to making fetch a builtin, I don't see that it needs to wait ...

 refs.c |   95 +++++++++++++++++++++++++++++++++++++++++++++++++--------------
 1 files changed, 74 insertions(+), 21 deletions(-)

diff --git a/refs.c b/refs.c
index d2b7b7f..23982fc 100644
--- a/refs.c
+++ b/refs.c
@@ -47,22 +47,7 @@ static struct ref_list *add_ref(const char *name, const unsigned char *sha1,
 				struct ref_list **new_entry)
 {
 	int len;
-	struct ref_list **p = &list, *entry;
-
-	/* Find the place to insert the ref into.. */
-	while ((entry = *p) != NULL) {
-		int cmp = strcmp(entry->name, name);
-		if (cmp > 0)
-			break;
-
-		/* Same as existing entry? */
-		if (!cmp) {
-			if (new_entry)
-				*new_entry = entry;
-			return list;
-		}
-		p = &entry->next;
-	}
+	struct ref_list *entry;
 
 	/* Allocate it and add it in.. */
 	len = strlen(name) + 1;
@@ -71,11 +56,79 @@ static struct ref_list *add_ref(const char *name, const unsigned char *sha1,
 	hashclr(entry->peeled);
 	memcpy(entry->name, name, len);
 	entry->flag = flag;
-	entry->next = *p;
-	*p = entry;
+	entry->next = list;
 	if (new_entry)
 		*new_entry = entry;
-	return list;
+	return entry;
+}
+
+/* merge sort the ref list */
+static struct ref_list *sort_ref_list(struct ref_list *list)
+{
+	int psize, qsize, last_merge_count;
+	struct ref_list *p, *q, *l, *e;
+	struct ref_list *new_list = list;
+	int k = 1;
+	int merge_count = 0;
+
+	if (!list)
+		return list;
+
+	do {
+		last_merge_count = merge_count;
+		merge_count = 0;
+
+		psize = 0;
+
+		p = new_list;
+		q = new_list;
+		new_list = NULL;
+		l = NULL;
+
+		while (p) {
+			merge_count++;
+
+			while (psize < k && q->next) {
+				q = q->next;
+				psize++;
+			}
+			qsize = k;
+
+			while ((psize > 0) || (qsize > 0 && q)) {
+				if (qsize == 0 || !q) {
+					e = p;
+					p = p->next;
+					psize--;
+				} else if (psize == 0) {
+					e = q;
+					q = q->next;
+					qsize--;
+				} else if (strcmp(q->name, p->name) < 0) {
+					e = q;
+					q = q->next;
+					qsize--;
+				} else {
+					e = p;
+					p = p->next;
+					psize--;
+				}
+
+				e->next = NULL;
+
+				if (l)
+					l->next = e;
+				if (!new_list)
+					new_list = e;
+				l = e;
+			}
+
+			p = q;
+		};
+
+		k = k * 2;
+	} while ((last_merge_count != merge_count) || (last_merge_count != 1));
+
+	return new_list;
 }
 
 /*
@@ -142,7 +195,7 @@ static void read_packed_refs(FILE *f, struct cached_refs *cached_refs)
 		    !get_sha1_hex(refline + 1, sha1))
 			hashcpy(last->peeled, sha1);
 	}
-	cached_refs->packed = list;
+	cached_refs->packed = sort_ref_list(list);
 }
 
 static struct ref_list *get_packed_refs(void)
@@ -201,7 +254,7 @@ static struct ref_list *get_ref_dir(const char *base, struct ref_list *list)
 		free(ref);
 		closedir(dir);
 	}
-	return list;
+	return sort_ref_list(list);
 }
 
 static struct ref_list *get_loose_refs(void)
-- 
1.5.1.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] refs.c: add a function to sort a ref list, rather then sorting on add
  2007-04-17  1:42 [PATCH] refs.c: add a function to sort a ref list, rather then sorting on add Julian Phillips
@ 2007-04-17 16:03 ` Linus Torvalds
  2007-04-17 20:17   ` Julian Phillips
  2007-04-17 22:00   ` Junio C Hamano
  2007-04-17 21:01 ` [PATCH] refs.c: add a function to sort a ref list, rather then sorting on add Julian Phillips
  1 sibling, 2 replies; 12+ messages in thread
From: Linus Torvalds @ 2007-04-17 16:03 UTC (permalink / raw)
  To: Julian Phillips; +Cc: git



On Tue, 17 Apr 2007, Julian Phillips wrote:
>
> Rather than sorting the refs list while building it, sort in one go
> after it is built using a merge sort.  This has a large performance
> boost with large numbers of refs.
> 
> Signed-off-by: Julian Phillips <julian@quantumfyre.co.uk>

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>

Looks fine. I think that even your new times are a bit high (over two 
seconds?) but things are clearly better. Have you looked at what takes so 
long now? 

		Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] refs.c: add a function to sort a ref list, rather then sorting on add
  2007-04-17 16:03 ` Linus Torvalds
@ 2007-04-17 20:17   ` Julian Phillips
  2007-04-17 20:51     ` Linus Torvalds
  2007-04-17 22:00   ` Junio C Hamano
  1 sibling, 1 reply; 12+ messages in thread
From: Julian Phillips @ 2007-04-17 20:17 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

On Tue, 17 Apr 2007, Linus Torvalds wrote:

>
>
> On Tue, 17 Apr 2007, Julian Phillips wrote:
>>
>> Rather than sorting the refs list while building it, sort in one go
>> after it is built using a merge sort.  This has a large performance
>> boost with large numbers of refs.
>>
>> Signed-off-by: Julian Phillips <julian@quantumfyre.co.uk>
>
> Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
>
> Looks fine. I think that even your new times are a bit high (over two
> seconds?) but things are clearly better. Have you looked at what takes so
> long now?

It's the tag auto-following code, I'm calling read_ref to see if I already 
have that tag - and it appears that doing that a few thousand times takes 
a while.

If I comment out that one line (so the code will _always_ think I have 
the tags - but I do have them, so ...) I get:

real    0m0.472s
user    0m0.277s
sys     0m0.181s

Looks like read_ref is the wrong thing to be using ...

-- 
Julian

  ---
"Life is like a buffet; it's not good but there's plenty of it."

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] refs.c: add a function to sort a ref list, rather then sorting on add
  2007-04-17 20:17   ` Julian Phillips
@ 2007-04-17 20:51     ` Linus Torvalds
  2007-04-17 22:03       ` Julian Phillips
  0 siblings, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2007-04-17 20:51 UTC (permalink / raw)
  To: Julian Phillips; +Cc: git



On Tue, 17 Apr 2007, Julian Phillips wrote:
> 
> It's the tag auto-following code, I'm calling read_ref to see if I already
> have that tag - and it appears that doing that a few thousand times takes a
> while.

Heh. I think we should probably call read_refs() just once to read them 
all (when most of them are packed, that's cheap), and then after that, 
have some way to just check for a match on the refs we have cached.

		Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] refs.c: add a function to sort a ref list, rather then sorting on add
  2007-04-17  1:42 [PATCH] refs.c: add a function to sort a ref list, rather then sorting on add Julian Phillips
  2007-04-17 16:03 ` Linus Torvalds
@ 2007-04-17 21:01 ` Julian Phillips
  1 sibling, 0 replies; 12+ messages in thread
From: Julian Phillips @ 2007-04-17 21:01 UTC (permalink / raw)
  To: git

On Tue, 17 Apr 2007, Julian Phillips wrote:

> Rather than sorting the refs list while building it, sort in one go
> after it is built using a merge sort.  This has a large performance
> boost with large numbers of refs.
>
> Signed-off-by: Julian Phillips <julian@quantumfyre.co.uk>
> ---
>
> Having got builtin fetch to the point of generating a correct FETCH_HEAD (for a certain path through the code at least), I revisted the speed issue I brought up a while back with the sorting of refs.
>
> Running fetch (builtin version) on a repo with >9000 refs which is up-to-date, using the old sort-on-add I get (best of 5, warm cache):
>
> real    0m4.351s
> user    0m4.068s
> sys     0m0.219s
>
> With this patch the same fetch gives (worst of 5, warm cache):
>
> real    0m2.196s
> user    0m1.870s
> sys     0m0.212s
>
> Since this is orthogonal to making fetch a builtin, I don't see that it needs to wait ...

In case anyone is curious, doing the same fetch with master 
(v1.5.1.1-135-gf948792):

master (best of 5, warm cache):
real    0m33.962s
user    0m23.992s
sys     0m9.986s

master + patch (worst of 5, warm cache):
real    0m20.821s
user    0m10.390s
sys     0m9.799s

-- 
Julian

  ---
NANCY!!  Why is everything RED?!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] refs.c: add a function to sort a ref list, rather then sorting on add
  2007-04-17 16:03 ` Linus Torvalds
  2007-04-17 20:17   ` Julian Phillips
@ 2007-04-17 22:00   ` Junio C Hamano
  2007-04-17 22:43     ` Julian Phillips
  1 sibling, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2007-04-17 22:00 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Julian Phillips, git

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Tue, 17 Apr 2007, Julian Phillips wrote:
>>
>> Rather than sorting the refs list while building it, sort in one go
>> after it is built using a merge sort.  This has a large performance
>> boost with large numbers of refs.
>> 
>> Signed-off-by: Julian Phillips <julian@quantumfyre.co.uk>
>
> Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
>
> Looks fine. I think that even your new times are a bit high (over two 
> seconds?) but things are clearly better. Have you looked at what takes so 
> long now? 

I wonder why the loss of "we are replacing the same one" case in
the original add_ref() was not compensated for in the new
sort_ref_list().

I think we would not call add_ref() to the same list with
duplicate names, unless (1) filesystem is grossly corrupt, (2)
somebody added a new ref while we are walking (how does
readdir() behave in such a case???), or (3) packed-refs file is
corrupt.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] refs.c: add a function to sort a ref list, rather then sorting on add
  2007-04-17 20:51     ` Linus Torvalds
@ 2007-04-17 22:03       ` Julian Phillips
  0 siblings, 0 replies; 12+ messages in thread
From: Julian Phillips @ 2007-04-17 22:03 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

On Tue, 17 Apr 2007, Linus Torvalds wrote:

>
>
> On Tue, 17 Apr 2007, Julian Phillips wrote:
>>
>> It's the tag auto-following code, I'm calling read_ref to see if I already
>> have that tag - and it appears that doing that a few thousand times takes a
>> while.
>
> Heh. I think we should probably call read_refs() just once to read them
> all (when most of them are packed, that's cheap), and then after that,
> have some way to just check for a match on the refs we have cached.

I had a look at the exclude_existing function in show-ref.  That uses 
for_each_ref to build a path_list, and path_list_has_path to do the 
filtering...

Using that I get (worst of 5, warm cache):
real    0m0.526s
user    0m0.302s
sys     0m0.176s

-- 
Julian

  ---
The descent to Hades is the same from every place.
 		-- Anaxagoras

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] refs.c: add a function to sort a ref list, rather then sorting on add
  2007-04-17 22:00   ` Junio C Hamano
@ 2007-04-17 22:43     ` Julian Phillips
  2007-04-18 19:36       ` Junio C Hamano
  0 siblings, 1 reply; 12+ messages in thread
From: Julian Phillips @ 2007-04-17 22:43 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, git

On Tue, 17 Apr 2007, Junio C Hamano wrote:

> I wonder why the loss of "we are replacing the same one" case in
> the original add_ref() was not compensated for in the new
> sort_ref_list().

That would be because it didn't bite me during testing, and I forgot to 
double check the code I was removing after I was happy that the sort was 
producing sane output (it was getting rather late).

Thing is, I'm not sure what to do about it anyway.  You could drop the 
duplicate in the sort, but since add_ref returns a pointer to the added 
entry (and used to simply return a pointer to the one already in the list) 
it's possible that you would be dropping something that someone had a 
pointer to.

> I think we would not call add_ref() to the same list with
> duplicate names, unless (1) filesystem is grossly corrupt, (2)
> somebody added a new ref while we are walking (how does
> readdir() behave in such a case???), or (3) packed-refs file is
> corrupt.

This combined with the fact that the old code didn't check that the sha1 
was the same suggests to me that this behaviour may actually have been a 
subtle bug?  Perhaps the best thing to do is die if we find two entries 
with the same name when sorting?

-- 
Julian

  ---
What passes for optimism is most often the effect of an intellectual error.
 		-- Raymond Aron, "The Opium of the Intellectuals"

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] refs.c: add a function to sort a ref list, rather then sorting on add
  2007-04-17 22:43     ` Julian Phillips
@ 2007-04-18 19:36       ` Junio C Hamano
  2007-04-18 21:27         ` [PATCH] refs.c: drop duplicate entries in sort_ref_list Julian Phillips
  0 siblings, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2007-04-18 19:36 UTC (permalink / raw)
  To: Julian Phillips; +Cc: Linus Torvalds, git

Julian Phillips <julian@quantumfyre.co.uk> writes:

> On Tue, 17 Apr 2007, Junio C Hamano wrote:
> ...
>> I think we would not call add_ref() to the same list with
>> duplicate names, unless (1) filesystem is grossly corrupt, (2)
>> somebody added a new ref while we are walking (how does
>> readdir() behave in such a case???), or (3) packed-refs file is
>> corrupt.
>
> This combined with the fact that the old code didn't check that the
> sha1 was the same suggests to me that this behaviour may actually have
> been a subtle bug?  Perhaps the best thing to do is die if we find two
> entries with the same name when sorting?

I am not sure what readdir() does if somebody adds a new ref
while we are walking the directory; I am hoping we would not get
the same thing in duplicates, but I dunno.

I think the most sensible thing to do is to check for
duplicates, discarding if the SHA-1 match and otherwise dying.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH] refs.c: drop duplicate entries in sort_ref_list
  2007-04-18 19:36       ` Junio C Hamano
@ 2007-04-18 21:27         ` Julian Phillips
  2007-04-19  8:19           ` Johannes Sixt
  0 siblings, 1 reply; 12+ messages in thread
From: Julian Phillips @ 2007-04-18 21:27 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, git

It shouldn't happen that we read duplicate entries into the same list,
but just in case make sort_ref_list drop them.  If the SHA1s don't
match then die instead, as we have no way of knowing which one is the
correct one.

Signed-off-by: Julian Phillips <julian@quantumfyre.co.uk>
---

On Wed, 18 Apr 2007, Junio C Hamano wrote:

> Julian Phillips <julian@quantumfyre.co.uk> writes:
>
>> On Tue, 17 Apr 2007, Junio C Hamano wrote:
>> ...
>>> I think we would not call add_ref() to the same list with
>>> duplicate names, unless (1) filesystem is grossly corrupt, (2)
>>> somebody added a new ref while we are walking (how does
>>> readdir() behave in such a case???), or (3) packed-refs file is
>>> corrupt.
>>
>> This combined with the fact that the old code didn't check that the
>> sha1 was the same suggests to me that this behaviour may actually have
>> been a subtle bug?  Perhaps the best thing to do is die if we find two
>> entries with the same name when sorting?
>
> I am not sure what readdir() does if somebody adds a new ref
> while we are walking the directory; I am hoping we would not get
> the same thing in duplicates, but I dunno.
>
> I think the most sensible thing to do is to check for
> duplicates, discarding if the SHA-1 match and otherwise dying.
>

Something like this (on top of my previous patch)?

 refs.c |   34 ++++++++++++++++++++++++++--------
 1 files changed, 26 insertions(+), 8 deletions(-)

diff --git a/refs.c b/refs.c
index 23982fc..5736a0e 100644
--- a/refs.c
+++ b/refs.c
@@ -65,7 +65,7 @@ static struct ref_list *add_ref(const char *name, const unsigned char *sha1,
 /* merge sort the ref list */
 static struct ref_list *sort_ref_list(struct ref_list *list)
 {
-	int psize, qsize, last_merge_count;
+	int psize, qsize, last_merge_count, cmp;
 	struct ref_list *p, *q, *l, *e;
 	struct ref_list *new_list = list;
 	int k = 1;
@@ -103,14 +103,32 @@ static struct ref_list *sort_ref_list(struct ref_list *list)
 					e = q;
 					q = q->next;
 					qsize--;
-				} else if (strcmp(q->name, p->name) < 0) {
-					e = q;
-					q = q->next;
-					qsize--;
 				} else {
-					e = p;
-					p = p->next;
-					psize--;
+					cmp = strcmp(q->name, p->name);
+					if (cmp < 0) {
+						e = q;
+						q = q->next;
+						qsize--;
+					} else if (cmp > 0) {
+						e = p;
+						p = p->next;
+						psize--;
+					} else {
+						if (hashcmp(q->sha1, p->sha1))
+							die("Duplicated ref, "
+							    "and SHA1s don't "
+							    "match: %s",
+							    q->name);
+						warning("Duplicated ref: %s",
+							q->name);
+						e = q;
+						q = q->next;
+						qsize--;
+						free(e);
+						e = p;
+						p = p->next;
+						psize--;
+					}
 				}
 
 				e->next = NULL;
-- 
1.5.1.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] refs.c: drop duplicate entries in sort_ref_list
  2007-04-18 21:27         ` [PATCH] refs.c: drop duplicate entries in sort_ref_list Julian Phillips
@ 2007-04-19  8:19           ` Johannes Sixt
  2007-04-19  8:49             ` Julian Phillips
  0 siblings, 1 reply; 12+ messages in thread
From: Johannes Sixt @ 2007-04-19  8:19 UTC (permalink / raw)
  To: git

Julian Phillips wrote:
> 
> It shouldn't happen that we read duplicate entries into the same list,
> but just in case make sort_ref_list drop them.  If the SHA1s don't
> match then die instead, as we have no way of knowing which one is the
> correct one.

Clone a random repository that has tags, then

$ rm .git/refs/tags/*
$ git fetch origin

now lists each tag twice: for the tag object itself and the commit it
points to. Is this related to the problem you are solving here? If so,
dying is probably not the best in this situation since the tags are
still unique.

(rm .git/refs/tags/* may be insane, but just the other day I wanted to
clean up my repo from a number of unneeded branches as well as the tags
that they contained - to find out that it is not enough to remove the
refs: the objects must go away too, otherwise git fetch auto-follows
them and the tags come back even though the branch heads are not there
anymore.)

$ git version
git version 1.5.1.1.27.g91776

-- Hannes

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] refs.c: drop duplicate entries in sort_ref_list
  2007-04-19  8:19           ` Johannes Sixt
@ 2007-04-19  8:49             ` Julian Phillips
  0 siblings, 0 replies; 12+ messages in thread
From: Julian Phillips @ 2007-04-19  8:49 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: git

On Thu, 19 Apr 2007, Johannes Sixt wrote:

> Julian Phillips wrote:
>>
>> It shouldn't happen that we read duplicate entries into the same list,
>> but just in case make sort_ref_list drop them.  If the SHA1s don't
>> match then die instead, as we have no way of knowing which one is the
>> correct one.
>
> Clone a random repository that has tags, then
>
> $ rm .git/refs/tags/*
> $ git fetch origin
>
> now lists each tag twice: for the tag object itself and the commit it
> points to. Is this related to the problem you are solving here? If so,
> dying is probably not the best in this situation since the tags are
> still unique.

The sort function is only looking at refs.  So if you rm .git/refs/tags/* 
then you don't have any tag refs (assuming no packed-refs).  What it is 
talking about is having two instances of (e.g.) refs/tags/foo.

In the situation you mention above you still shouldn't see duplicated 
refs.

-- 
Julian

  ---

Ah, so that's what's been wrong with the little fella.  He misses
casual sex.

 		-- Homer Simpson
 		   Two Dozen and One Greyhounds

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2007-04-19  8:49 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-04-17  1:42 [PATCH] refs.c: add a function to sort a ref list, rather then sorting on add Julian Phillips
2007-04-17 16:03 ` Linus Torvalds
2007-04-17 20:17   ` Julian Phillips
2007-04-17 20:51     ` Linus Torvalds
2007-04-17 22:03       ` Julian Phillips
2007-04-17 22:00   ` Junio C Hamano
2007-04-17 22:43     ` Julian Phillips
2007-04-18 19:36       ` Junio C Hamano
2007-04-18 21:27         ` [PATCH] refs.c: drop duplicate entries in sort_ref_list Julian Phillips
2007-04-19  8:19           ` Johannes Sixt
2007-04-19  8:49             ` Julian Phillips
2007-04-17 21:01 ` [PATCH] refs.c: add a function to sort a ref list, rather then sorting on add Julian Phillips

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).