git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* "git reflog expire --all" very slow
@ 2009-03-31  1:43 Linus Torvalds
  2009-03-31  4:34 ` Junio C Hamano
  0 siblings, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2009-03-31  1:43 UTC (permalink / raw
  To: Junio C Hamano, Brandon Casey, Johannes Schindelin; +Cc: Git Mailing List


I haven't checked in detail what is up, but I just did a "git gc --prune", 
and it was quiet for about half a minute before anything seemed to happen.

Very irritating, as normally the expensive stuff at least gives you some 
kind of indication of what it's doing.

It turns out that it's the reflog expiration. On my crazy beefy 
Nehalem machine:

	[torvalds@nehalem linux]$ time git reflog expire --all

	real	0m37.596s
	user	0m37.554s
	sys	0m0.040s

and that really isn't good. 37 cpu-seconds on this machine is like half a 
decade on some laptops I could name.

The flat pgprof for this thing (user-land oprofile isn't doing Nehalem 
yet) looks like this:

      %   cumulative   self              self     total           
     time   seconds   seconds    calls   s/call   s/call  name    
     60.94     30.24    30.24 301120211     0.00     0.00  interesting
     12.37     36.38     6.14 301338513     0.00     0.00  insert_by_date
     11.35     42.01     5.63     8776     0.00     0.00  clear_commit_marks
      9.96     46.95     4.94     4388     0.00     0.01  merge_bases_many
      2.16     48.02     1.07 301486366     0.00     0.00  commit_list_insert
      1.21     48.62     0.60 301329737     0.00     0.00  parse_commit
      0.87     49.05     0.43 301637945     0.00     0.00  xmalloc
      0.34     49.22     0.17       24     0.01     0.01  xstrdup
      ...

Ok, so my reflog on this thing has 1583 entries on HEAD (yes, in the last 
90 days, the problem is _not_ that I have a long reflog and am pruning it, 
it _is_ already pruned). Add to that the reflogs for the branches (mainly 
master: 1294), and you end up with apparently a nice total of 4388 reflog 
entries.

And then it looks like for _each_ reflog entry we have:

  expire_reflog_ent()
    in_merge_bases()

which then calls 

  get_merge_bases()
    get_merge_bases_many()
      ..

each of which probably often traverses an appreciable part of the kernel 
tree, since my reflog entries are often merges, and the merge bases need 
easily thousands of commits to look up.

Which explains how you end up with 301 _million_ commits inserted into the 
lists and checked if they are interesting. Since the whole kernel tree has 
only something like 140k commits, and my revlog doesn't even go back more 
than three months, I guess that means that we'll be traversing the same 
commits tens of thousands of times each.

Even on this machine, that whole cluster-f*ck takes a little while. Oops.

I have not checked if there is anything really obvious going on that could 
change that whole logic that causes us to do merge-bases into something 
saner, since the reflog code is not a part of git I'm familiar with. 

Instead, I'm just sending this to Junio, Brandon, and Dscho, who are 
getting the main blame for 'builtin-reflog.c'. Although I'm pretty sure 
this is all Junio, but just in case..

			Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: "git reflog expire --all" very slow
  2009-03-31  1:43 "git reflog expire --all" very slow Linus Torvalds
@ 2009-03-31  4:34 ` Junio C Hamano
  2009-03-31  5:09   ` Linus Torvalds
  0 siblings, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2009-03-31  4:34 UTC (permalink / raw
  To: Linus Torvalds; +Cc: Brandon Casey, Johannes Schindelin, Git Mailing List

Linus Torvalds <torvalds@linux-foundation.org> writes:

> I have not checked if there is anything really obvious going on that could 
> change that whole logic that causes us to do merge-bases into something 
> saner, since the reflog code is not a part of git I'm familiar with. 

When we look at one reflog entry with old and new commit, we decide to
prune an old entry if either side is unreachable from the tip of the
branch:

	if (timestamp < cb->cmd->expire_unreachable) {
		if (!cb->ref_commit)
			goto prune;
		if (!old && !is_null_sha1(osha1))
			old = lookup_commit_reference_gently(osha1, 1);
		if (!new && !is_null_sha1(nsha1))
			new = lookup_commit_reference_gently(nsha1, 1);
		if ((old && !in_merge_bases(old, &cb->ref_commit, 1)) ||
		    (new && !in_merge_bases(new, &cb->ref_commit, 1)))
			goto prune;
	}

Most of your reflog entries are expected to be reachable from the tip, so
one optimization would be to mark all commits reachable from the tip
upfront, and omit the in_merge_bases() computation for the ones that are
already marked.  Perhaps something like this...

 builtin-reflog.c |   49 +++++++++++++++++++++++++++++++++++++++++++++++--
 1 files changed, 47 insertions(+), 2 deletions(-)

diff --git a/builtin-reflog.c b/builtin-reflog.c
index d95f515..f0d61bd 100644
--- a/builtin-reflog.c
+++ b/builtin-reflog.c
@@ -52,6 +52,7 @@ struct collect_reflog_cb {
 
 #define INCOMPLETE	(1u<<10)
 #define STUDYING	(1u<<11)
+#define REACHABLE	(1u<<12)
 
 static int tree_is_complete(const unsigned char *sha1)
 {
@@ -209,6 +210,43 @@ static int keep_entry(struct commit **it, unsigned char *sha1)
 	return 1;
 }
 
+static void mark_reachable(struct commit *commit, unsigned long expire_limit)
+{
+	/*
+	 * We need to compute if commit on either side of an reflog
+	 * entry is reachable from the tip of the ref for all entries.
+	 * Mark commits that are reachable from the tip down to the
+	 * time threashold first; we know a commit marked thusly is
+	 * reachable from the tip without running in_merge_bases()
+	 * at all.
+	 */
+	struct commit_list *pending = NULL;
+
+	commit_list_insert(commit, &pending);
+	while (pending) {
+		struct commit_list *entry = pending;
+		struct commit_list *parent;
+		pending = entry->next;
+		commit = entry->item;
+		free(entry);
+		if (commit->object.flags & REACHABLE)
+			continue;
+		commit->object.flags |= REACHABLE;
+		parent = commit->parents;
+		while (parent) {
+			commit = parent->item;
+			parent = parent->next;
+			if (commit->object.flags & REACHABLE)
+				continue;
+			if (parse_commit(commit))
+				continue;
+			if (commit->date < expire_limit)
+				continue;
+			commit_list_insert(commit, &pending);
+		}
+	}
+}
+
 static int expire_reflog_ent(unsigned char *osha1, unsigned char *nsha1,
 		const char *email, unsigned long timestamp, int tz,
 		const char *message, void *cb_data)
@@ -234,8 +272,12 @@ static int expire_reflog_ent(unsigned char *osha1, unsigned char *nsha1,
 			old = lookup_commit_reference_gently(osha1, 1);
 		if (!new && !is_null_sha1(nsha1))
 			new = lookup_commit_reference_gently(nsha1, 1);
-		if ((old && !in_merge_bases(old, &cb->ref_commit, 1)) ||
-		    (new && !in_merge_bases(new, &cb->ref_commit, 1)))
+		if ((old &&
+		     !(old->object.flags & REACHABLE) &&
+		     !in_merge_bases(old, &cb->ref_commit, 1)) ||
+		    (new &&
+		     !(new->object.flags & REACHABLE) &&
+		     !in_merge_bases(new, &cb->ref_commit, 1)))
 			goto prune;
 	}
 
@@ -288,7 +330,10 @@ static int expire_reflog(const char *ref, const unsigned char *sha1, int unused,
 	cb.ref_commit = lookup_commit_reference_gently(sha1, 1);
 	cb.ref = ref;
 	cb.cmd = cmd;
+
+	mark_reachable(cb.ref_commit, cmd->expire_unreachable);
 	for_each_reflog_ent(ref, expire_reflog_ent, &cb);
+	clear_commit_marks(cb.ref_commit, REACHABLE);
  finish:
 	if (cb.newlog) {
 		if (fclose(cb.newlog)) {

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: "git reflog expire --all" very slow
  2009-03-31  4:34 ` Junio C Hamano
@ 2009-03-31  5:09   ` Linus Torvalds
  2009-03-31  5:24     ` Junio C Hamano
  2009-03-31  5:38     ` Linus Torvalds
  0 siblings, 2 replies; 13+ messages in thread
From: Linus Torvalds @ 2009-03-31  5:09 UTC (permalink / raw
  To: Junio C Hamano; +Cc: Brandon Casey, Johannes Schindelin, Git Mailing List



On Mon, 30 Mar 2009, Junio C Hamano wrote:
> 
> Most of your reflog entries are expected to be reachable from the tip, so
> one optimization would be to mark all commits reachable from the tip
> upfront, and omit the in_merge_bases() computation for the ones that are
> already marked.  Perhaps something like this...

This if anything makes things just go slower.

Not much, but some. It went from 36.566s to 38.070s. That may be in the 
noise, I've not done any sensitivity analysis.

I thought you perhaps had a missing "parse_commit()" making the 
reachability thing not work (look_up_gently parses the object, but if it's 
a tag deref_tag() will dereference it until it hits a commit, but never 
parse the commit). But that wasn't it.

			Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: "git reflog expire --all" very slow
  2009-03-31  5:09   ` Linus Torvalds
@ 2009-03-31  5:24     ` Junio C Hamano
  2009-03-31  5:42       ` Linus Torvalds
  2009-03-31  5:50       ` Junio C Hamano
  2009-03-31  5:38     ` Linus Torvalds
  1 sibling, 2 replies; 13+ messages in thread
From: Junio C Hamano @ 2009-03-31  5:24 UTC (permalink / raw
  To: Linus Torvalds; +Cc: Brandon Casey, Johannes Schindelin, Git Mailing List

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Mon, 30 Mar 2009, Junio C Hamano wrote:
>> 
>> Most of your reflog entries are expected to be reachable from the tip, so
>> one optimization would be to mark all commits reachable from the tip
>> upfront, and omit the in_merge_bases() computation for the ones that are
>> already marked.  Perhaps something like this...
>
> This if anything makes things just go slower.
>
> Not much, but some. It went from 36.566s to 38.070s. That may be in the 
> noise, I've not done any sensitivity analysis.

I actually think that the cutoff for history traversal is bogus.  You may
have a 30-day old reflog entry that pulled in a 45-day old commit, and
giving up the smudging at the expiry cutoff timestamp does not make much
sense.

I do have a lot of reflog entries kept around, as my main repository has
these:

        [gc]
                reflogexpire = '2005-01-01 00:00:00 +0000'
                reflogexpireunreachable = '2005-01-01 00:00:00 +0000'

so let me experiment a bit more.

 builtin-reflog.c |   61 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 59 insertions(+), 2 deletions(-)

diff --git a/builtin-reflog.c b/builtin-reflog.c
index d95f515..ef02b5a 100644
--- a/builtin-reflog.c
+++ b/builtin-reflog.c
@@ -36,6 +36,7 @@ struct expire_reflog_cb {
 	FILE *newlog;
 	const char *ref;
 	struct commit *ref_commit;
+	int commit_marked;
 	struct cmd_reflog_expire_cb *cmd;
 	unsigned char last_kept_sha1[20];
 };
@@ -52,6 +53,7 @@ struct collect_reflog_cb {
 
 #define INCOMPLETE	(1u<<10)
 #define STUDYING	(1u<<11)
+#define REACHABLE	(1u<<12)
 
 static int tree_is_complete(const unsigned char *sha1)
 {
@@ -209,6 +211,43 @@ static int keep_entry(struct commit **it, unsigned char *sha1)
 	return 1;
 }
 
+static void mark_reachable(struct commit *commit, unsigned long expire_limit)
+{
+	/*
+	 * We need to compute if commit on either side of an reflog
+	 * entry is reachable from the tip of the ref for all entries.
+	 * Mark commits that are reachable from the tip down to the
+	 * time threashold first; we know a commit marked thusly is
+	 * reachable from the tip without running in_merge_bases()
+	 * at all.
+	 */
+	struct commit_list *pending = NULL;
+
+	commit_list_insert(commit, &pending);
+	while (pending) {
+		struct commit_list *entry = pending;
+		struct commit_list *parent;
+		pending = entry->next;
+		commit = entry->item;
+		free(entry);
+		if (commit->object.flags & REACHABLE)
+			continue;
+		commit->object.flags |= REACHABLE;
+		parent = commit->parents;
+		while (parent) {
+			commit = parent->item;
+			parent = parent->next;
+			if (commit->object.flags & REACHABLE)
+				continue;
+			if (parse_commit(commit))
+				continue;
+			if (commit->date < expire_limit)
+				continue;
+			commit_list_insert(commit, &pending);
+		}
+	}
+}
+
 static int expire_reflog_ent(unsigned char *osha1, unsigned char *nsha1,
 		const char *email, unsigned long timestamp, int tz,
 		const char *message, void *cb_data)
@@ -230,12 +269,28 @@ static int expire_reflog_ent(unsigned char *osha1, unsigned char *nsha1,
 	if (timestamp < cb->cmd->expire_unreachable) {
 		if (!cb->ref_commit)
 			goto prune;
+
 		if (!old && !is_null_sha1(osha1))
 			old = lookup_commit_reference_gently(osha1, 1);
 		if (!new && !is_null_sha1(nsha1))
 			new = lookup_commit_reference_gently(nsha1, 1);
-		if ((old && !in_merge_bases(old, &cb->ref_commit, 1)) ||
-		    (new && !in_merge_bases(new, &cb->ref_commit, 1)))
+
+		if (!cb->commit_marked) {
+			if (old) {
+				mark_reachable(cb->ref_commit, old->date);
+				cb->commit_marked = 1;
+			} else if (new) {
+				mark_reachable(cb->ref_commit, new->date);
+				cb->commit_marked = 1;
+			}
+		}
+
+		if ((old &&
+		     !(old->object.flags & REACHABLE) &&
+		     !in_merge_bases(old, &cb->ref_commit, 1)) ||
+		    (new &&
+		     !(new->object.flags & REACHABLE) &&
+		     !in_merge_bases(new, &cb->ref_commit, 1)))
 			goto prune;
 	}
 
@@ -289,6 +344,8 @@ static int expire_reflog(const char *ref, const unsigned char *sha1, int unused,
 	cb.ref = ref;
 	cb.cmd = cmd;
 	for_each_reflog_ent(ref, expire_reflog_ent, &cb);
+	if (cb.ref_commit && cb.commit_marked)
+		clear_commit_marks(cb.ref_commit, REACHABLE);
  finish:
 	if (cb.newlog) {
 		if (fclose(cb.newlog)) {

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: "git reflog expire --all" very slow
  2009-03-31  5:09   ` Linus Torvalds
  2009-03-31  5:24     ` Junio C Hamano
@ 2009-03-31  5:38     ` Linus Torvalds
  2009-03-31  5:50       ` Linus Torvalds
  1 sibling, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2009-03-31  5:38 UTC (permalink / raw
  To: Junio C Hamano; +Cc: Brandon Casey, Johannes Schindelin, Git Mailing List



On Mon, 30 Mar 2009, Linus Torvalds wrote:
> 
> This if anything makes things just go slower.
> 
> Not much, but some. It went from 36.566s to 38.070s. That may be in the 
> noise, I've not done any sensitivity analysis.
> 
> I thought you perhaps had a missing "parse_commit()" making the 
> reachability thing not work (look_up_gently parses the object, but if it's 
> a tag deref_tag() will dereference it until it hits a commit, but never 
> parse the commit). But that wasn't it.

Ahhah.

I know why it makes things slower.

The slow case is already inside that whole:

	if (timestamp < cb->cmd->expire_unreachable) {

if-statement, so the thing that slows down is if we hit a commit that is 
_older_ than the expire limit.

But your whole "mark_reachable()" thing only marks things _younger_ than 
that reachable. So you mark exactly the wrong things reachable - you mark 
the ones that we don't even care about.

If I do

	mark_reachable(cb.ref_commit, 0);

instead (to traverse the _whole_ tree, with no regards to date), the time 
shrinks to 1.7s. But of course, that's also wrong.

Qutie frankly, I don't really understand why the logic isn't just

	if (timestamp < cb->cmd->expire_unreachable)
		goto prune; 

why is that reachability so important?

			Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: "git reflog expire --all" very slow
  2009-03-31  5:24     ` Junio C Hamano
@ 2009-03-31  5:42       ` Linus Torvalds
  2009-03-31  5:57         ` Junio C Hamano
  2009-03-31  5:50       ` Junio C Hamano
  1 sibling, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2009-03-31  5:42 UTC (permalink / raw
  To: Junio C Hamano; +Cc: Brandon Casey, Johannes Schindelin, Git Mailing List



On Mon, 30 Mar 2009, Junio C Hamano wrote:
> 
> I do have a lot of reflog entries kept around, as my main repository has
> these:
> 
>         [gc]
>                 reflogexpire = '2005-01-01 00:00:00 +0000'
>                 reflogexpireunreachable = '2005-01-01 00:00:00 +0000'

I think that actually _hides_ the problem. You'll never have anything at 
all that triggers that

	if (timestamp < cb->cmd->expire_unreachable) {

because your "expire_unreachable" timestamp is already very old (== small 
value), so 'timestamp' will _not_ be older (smaller value) than that.

I dunno. As mentioned, I don't really understand why we'd want to save 
some of those reflog entries at all in the first place, so I'm probably 
missing something.

If we've asked for reflog entries past a certain age to be expired, why do 
when then look at the details of those reflog entries and only expire them 
under certain circumstances? Just expire them, and get rid of the 
'unreachable' part.

			Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: "git reflog expire --all" very slow
  2009-03-31  5:38     ` Linus Torvalds
@ 2009-03-31  5:50       ` Linus Torvalds
  2009-03-31  5:51         ` Linus Torvalds
  2009-03-31  6:08         ` Junio C Hamano
  0 siblings, 2 replies; 13+ messages in thread
From: Linus Torvalds @ 2009-03-31  5:50 UTC (permalink / raw
  To: Junio C Hamano; +Cc: Brandon Casey, Johannes Schindelin, Git Mailing List



On Mon, 30 Mar 2009, Linus Torvalds wrote:
> 
> If I do
> 
> 	mark_reachable(cb.ref_commit, 0);

Ok, I think I got it.

You had

	mark_reachable(cb.ref_commit, cmd->expire_unreachable);

but we care about the commits that are younger than 'expire_total' (older 
than that, and they are pruned unconditionally), but older than 
'expire_unreachable' (younger than that and the date doesn't matter).

So making it do

	mark_reachable(cb.ref_commit, cmd->expire_total);

marks the right parts reachable. Not the whole tree, but also not just the 
commits we're not going to expire regardless.

With that change, it's all basically instantaneous. We don't need to 
traverse the whole kernel history, and with that change to your patch, I 
get

	[torvalds@nehalem linux]$ time ~/git/git reflog expire --all

	real	0m1.715s
	user	0m1.676s
	sys	0m0.040s

which is still slower than I'd wish for, but is a whole lot faster than 
over half a minute.

		Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: "git reflog expire --all" very slow
  2009-03-31  5:24     ` Junio C Hamano
  2009-03-31  5:42       ` Linus Torvalds
@ 2009-03-31  5:50       ` Junio C Hamano
  1 sibling, 0 replies; 13+ messages in thread
From: Junio C Hamano @ 2009-03-31  5:50 UTC (permalink / raw
  To: Linus Torvalds; +Cc: Brandon Casey, Johannes Schindelin, Git Mailing List

Junio C Hamano <gitster@pobox.com> writes:

> Linus Torvalds <torvalds@linux-foundation.org> writes:
>> ...
>> This if anything makes things just go slower.
>> 
>> Not much, but some. It went from 36.566s to 38.070s. That may be in the 
>> noise, I've not done any sensitivity analysis.
>
> I actually think that the cutoff for history traversal is bogus.  You may
> have a 30-day old reflog entry that pulled in a 45-day old commit, and
> giving up the smudging at the expiry cutoff timestamp does not make much
> sense.
>
> I do have a lot of reflog entries kept around, as my main repository has
> these:
>
>         [gc]
>                 reflogexpire = '2005-01-01 00:00:00 +0000'
>                 reflogexpireunreachable = '2005-01-01 00:00:00 +0000'
>
> so let me experiment a bit more.

To initially prune a copy of my git repository that was never pruned
before, it takes a lot of time with or without the patch to prune from the
original 73476 reflog entries.

(without patch)
23.46user 0.23system 0:23.70elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+5464outputs (0major+6045minor)pagefaults 0swaps

(with patch, but with a bit of fprintf's for counting)
21.99user 0.40system 0:22.56elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+5472outputs (0major+6040minor)pagefaults 0swaps

After the initial pruning, with 13780 reflog entries still left (and no
further pruning expected), I am seeing these:

(without patch)
2.09user 0.22system 0:02.32elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+5464outputs (0major+4956minor)pagefaults 0swaps

(with patch, but again with a bit of fprintf's for counting)
0.76user 0.25system 0:01.05elapsed 96%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+5464outputs (0major+4918minor)pagefaults 0swaps

I might be better off traversing down all the way and do away with
in_merge_bases(), but that might hold true only for something small like
git.git and the kernel.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: "git reflog expire --all" very slow
  2009-03-31  5:50       ` Linus Torvalds
@ 2009-03-31  5:51         ` Linus Torvalds
  2009-04-02  6:46           ` Junio C Hamano
  2009-03-31  6:08         ` Junio C Hamano
  1 sibling, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2009-03-31  5:51 UTC (permalink / raw
  To: Junio C Hamano; +Cc: Brandon Casey, Johannes Schindelin, Git Mailing List


That made no sense. It should have been:

On Mon, 30 Mar 2009, Linus Torvalds wrote:
> 
> but we care about the commits that are younger than 'expire_total' (older 
> than that, and they are pruned unconditionally), but older than 
> 'expire_unreachable' (younger than that and the date doesn't matter).
                                                  ^^^^
                                                 reachability

but other than that the commentary stands.

		Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: "git reflog expire --all" very slow
  2009-03-31  5:42       ` Linus Torvalds
@ 2009-03-31  5:57         ` Junio C Hamano
  0 siblings, 0 replies; 13+ messages in thread
From: Junio C Hamano @ 2009-03-31  5:57 UTC (permalink / raw
  To: Linus Torvalds; +Cc: Brandon Casey, Johannes Schindelin, Git Mailing List

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Mon, 30 Mar 2009, Junio C Hamano wrote:
>> 
>> I do have a lot of reflog entries kept around, as my main repository has
>> these:
>> 
>>         [gc]
>>                 reflogexpire = '2005-01-01 00:00:00 +0000'
>>                 reflogexpireunreachable = '2005-01-01 00:00:00 +0000'
>
> I think that actually _hides_ the problem. You'll never have anything at 
> all that triggers that
>
> 	if (timestamp < cb->cmd->expire_unreachable) {
>
> because your "expire_unreachable" timestamp is already very old (== small 
> value), so 'timestamp' will _not_ be older (smaller value) than that.
>
> I dunno. As mentioned, I don't really understand why we'd want to save 
> some of those reflog entries at all in the first place, so I'm probably 
> missing something.
>
> If we've asked for reflog entries past a certain age to be expired, why do 
> when then look at the details of those reflog entries and only expire them 
> under certain circumstances? Just expire them, and get rid of the 
> 'unreachable' part.

I know.  My test repository does not have them.  I only quoted them to
tell you that I have a lot of entries to play with.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: "git reflog expire --all" very slow
  2009-03-31  5:50       ` Linus Torvalds
  2009-03-31  5:51         ` Linus Torvalds
@ 2009-03-31  6:08         ` Junio C Hamano
  1 sibling, 0 replies; 13+ messages in thread
From: Junio C Hamano @ 2009-03-31  6:08 UTC (permalink / raw
  To: Linus Torvalds; +Cc: Brandon Casey, Johannes Schindelin, Git Mailing List

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Mon, 30 Mar 2009, Linus Torvalds wrote:
>> 
>> If I do
>> 
>> 	mark_reachable(cb.ref_commit, 0);
>
> Ok, I think I got it.
>
> You had
>
> 	mark_reachable(cb.ref_commit, cmd->expire_unreachable);
>
> but we care about the commits that are younger than 'expire_total' (older 
> than that, and they are pruned unconditionally), but older than 
> 'expire_unreachable' (younger than that and the date doesn't matter).
>
> So making it do
>
> 	mark_reachable(cb.ref_commit, cmd->expire_total);
>
> marks the right parts reachable. Not the whole tree, but also not just the 
> commits we're not going to expire regardless.
>
> With that change, it's all basically instantaneous. We don't need to 
> traverse the whole kernel history, and with that change to your patch, I 
> get
>
> 	[torvalds@nehalem linux]$ time ~/git/git reflog expire --all
>
> 	real	0m1.715s
> 	user	0m1.676s
> 	sys	0m0.040s
>
> which is still slower than I'd wish for, but is a whole lot faster than 
> over half a minute.

Actually, the initial pruning I showed (around 23 seconds) is not helped
with the expire_total change at all, but if I mark everything and get rid
of in_merge_bases(), it goes down to 1.15 seconds.

But subsequent pruning, which is what we should be optimizing for, gets
much faster with your approach of not traversing down the history all the
way.

By the way, I suspect I might be making the same mistake of traversing
both sides of the merge twice while marking, but I am tired already, so
I'll throw out the final patch for tonight and go to bed.

 builtin-reflog.c |   48 ++++++++++++++++++++++++++++++++++++++++++++++--
 1 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/builtin-reflog.c b/builtin-reflog.c
index d95f515..b67272a 100644
--- a/builtin-reflog.c
+++ b/builtin-reflog.c
@@ -52,6 +52,7 @@ struct collect_reflog_cb {
 
 #define INCOMPLETE	(1u<<10)
 #define STUDYING	(1u<<11)
+#define REACHABLE	(1u<<12)
 
 static int tree_is_complete(const unsigned char *sha1)
 {
@@ -209,6 +210,43 @@ static int keep_entry(struct commit **it, unsigned char *sha1)
 	return 1;
 }
 
+static void mark_reachable(struct commit *commit, unsigned long expire_limit)
+{
+	/*
+	 * We need to compute if commit on either side of an reflog
+	 * entry is reachable from the tip of the ref for all entries.
+	 * Mark commits that are reachable from the tip down to the
+	 * time threashold first; we know a commit marked thusly is
+	 * reachable from the tip without running in_merge_bases()
+	 * at all.
+	 */
+	struct commit_list *pending = NULL;
+
+	commit_list_insert(commit, &pending);
+	while (pending) {
+		struct commit_list *entry = pending;
+		struct commit_list *parent;
+		pending = entry->next;
+		commit = entry->item;
+		free(entry);
+		if (commit->object.flags & REACHABLE)
+			continue;
+		commit->object.flags |= REACHABLE;
+		parent = commit->parents;
+		while (parent) {
+			commit = parent->item;
+			parent = parent->next;
+			if (commit->object.flags & REACHABLE)
+				continue;
+			if (parse_commit(commit))
+				continue;
+			if (expire_limit && commit->date < expire_limit)
+				continue;
+			commit_list_insert(commit, &pending);
+		}
+	}
+}
+
 static int expire_reflog_ent(unsigned char *osha1, unsigned char *nsha1,
 		const char *email, unsigned long timestamp, int tz,
 		const char *message, void *cb_data)
@@ -234,8 +272,10 @@ static int expire_reflog_ent(unsigned char *osha1, unsigned char *nsha1,
 			old = lookup_commit_reference_gently(osha1, 1);
 		if (!new && !is_null_sha1(nsha1))
 			new = lookup_commit_reference_gently(nsha1, 1);
-		if ((old && !in_merge_bases(old, &cb->ref_commit, 1)) ||
-		    (new && !in_merge_bases(new, &cb->ref_commit, 1)))
+		if ((old && !(old->object.flags & REACHABLE) &&
+		     !in_merge_bases(old, &cb->ref_commit, 1)) ||
+		    (new && !(new->object.flags & REACHABLE) &&
+		     !in_merge_bases(new, &cb->ref_commit, 1)))
 			goto prune;
 	}
 
@@ -288,7 +328,11 @@ static int expire_reflog(const char *ref, const unsigned char *sha1, int unused,
 	cb.ref_commit = lookup_commit_reference_gently(sha1, 1);
 	cb.ref = ref;
 	cb.cmd = cmd;
+	if (cb.ref_commit)
+		mark_reachable(cb.ref_commit, cmd->expire_total);
 	for_each_reflog_ent(ref, expire_reflog_ent, &cb);
+	if (cb.ref_commit)
+		clear_commit_marks(cb.ref_commit, REACHABLE);
  finish:
 	if (cb.newlog) {
 		if (fclose(cb.newlog)) {

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: "git reflog expire --all" very slow
  2009-03-31  5:51         ` Linus Torvalds
@ 2009-04-02  6:46           ` Junio C Hamano
  2009-04-02 15:30             ` Linus Torvalds
  0 siblings, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2009-04-02  6:46 UTC (permalink / raw
  To: Linus Torvalds; +Cc: Brandon Casey, Johannes Schindelin, Git Mailing List

Linus Torvalds <torvalds@linux-foundation.org> writes:

> That made no sense. It should have been:
>
> On Mon, 30 Mar 2009, Linus Torvalds wrote:
>> 
>> but we care about the commits that are younger than 'expire_total' (older 
>> than that, and they are pruned unconditionally), but older than 
>> 'expire_unreachable' (younger than that and the date doesn't matter).
>                                                   ^^^^
>                                                  reachability
>
> but other than that the commentary stands.

Correct.  But after thinking about this a bit more, I am starting to suspect
the "of course" in your earlier

    If I do

            mark_reachable(cb.ref_commit, 0);

    instead (to traverse the _whole_ tree, with no regards to date), the time 
    shrinks to 1.7s. But of course, that's also wrong.

may not be such a clearly obvious thing.

Suppose you do not do "mark_reachable(cb.ref_commit, 0)" but use the
expire_total as the cut-off value (which is what I've queued).  If you
have one unreachable entry that you end up running in_merge_bases() for,
you will traverse all the history down _anyway_, and at that point, you
would be better off if you actually marked everything upfront, and
discarded anything unmarked as unreachable without falling back to
in_merge_bases() at all.

The above reasoning of course assumes that "keep reflog entries if they
are reachable from the tip, otherwise drop them if they are more than 30
days old" is a good medium level semantics to cull what the other rule
"drop any reflog entry older than 90 days" may not.

A hacky alternative would be to use total_expire as the cut-off and do not
fall back on in_merge_bases().  We might incorrectly prune away an entry
that records that you pulled a commit that is still reachable from the tip
last week, if that commit happens to be 4 months old if we did so, so I am
not convinced myself it is a reasonable hack, though.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: "git reflog expire --all" very slow
  2009-04-02  6:46           ` Junio C Hamano
@ 2009-04-02 15:30             ` Linus Torvalds
  0 siblings, 0 replies; 13+ messages in thread
From: Linus Torvalds @ 2009-04-02 15:30 UTC (permalink / raw
  To: Junio C Hamano; +Cc: Brandon Casey, Johannes Schindelin, Git Mailing List



On Wed, 1 Apr 2009, Junio C Hamano wrote:
> 
> Correct.  But after thinking about this a bit more, I am starting to suspect
> the "of course" in your earlier
> 
>     If I do
> 
>             mark_reachable(cb.ref_commit, 0);
> 
>     instead (to traverse the _whole_ tree, with no regards to date), the time 
>     shrinks to 1.7s. But of course, that's also wrong.
> 
> may not be such a clearly obvious thing.

I think we should never do it up-front, because for nicely behaved people 
who just pull other peoples work (which are also the people most likely to 
not have beefy machines), the normal reflog is going to be entirely 
reachable, and we don't have to traverse the whole thing.

So what I'd suggest is something like:

 - start off with the time limit, possibly with some extra fudging

 - but never bother calling "in_merge_bases()"

 - if we ever get to a commit that doesn't look reachable in that 
   situation, we now have two choices:

    (a) just use the dang 'object->flags & REACHABLE' flag as-is.
        Why even bother to be clever? We did the reachability by time 
        already, it's done, it's there, just use it. In other words, the 
        reachability simply works like "--since=<date>'.

    (b) Try to do the "exact" thing, and waste lots of time on it, and 
        maybe find an odd commit or two where the date had been wrong. Do 
        we really care? 

I'd actually go for 'a', with a slight modification: try to convert the 
"reflog date" (the date of a local action) into a "commit date" (the date 
of a commit in the repository). Because those two are different "time 
spaces", and comparing a "commit date" to a "in my repo" date is fairly 
wrong.

But in general, I don't think this is something that needs any extra 
precision. We're not talking about "theoretically reachable" here. We're 
talking about reflog entries that are already older than the 
unreachability limit, and that point to commits that are older than the 
reachability limit. Yes, yes, clocks aren't synchronized, but do we really 
care?

IOW, I'd suggest just removing the in_merge_base() tests entirely. Make 
the semantics even simpler:

	have_done_reachability = 0;
	reachability_date = 0;

	for_each_reflog_oldest_first() {
		/* Older than unconditional expire? */
		if (really_old(entry)) {
			reachability_date = entry->commit->date;
			goto prune;
		}

		/* Younger than the reflog reachability? */
		if (really_young(entry) && !fix_stale)
			goto save;

		/*
		 * Ok, not an unconditional expire entry.
		 * Do the reachability - once
		 */
		if (!have_done_reachability) {
			have_done_reachability = 1;
			if (fix_stale)
				reachability_date = 0;
			mark_reachabile(top, reachability_date);
		}

		if (!(entry->commit->flags & REACHABLE))
			goto prune;

	save:
		save(entry);
		continue;
	prune:
		prune(entry);
		continue;
	}

Does that change semantics? Yes, absolutely. But it sounds very practical.

		Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2009-04-02 15:36 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-03-31  1:43 "git reflog expire --all" very slow Linus Torvalds
2009-03-31  4:34 ` Junio C Hamano
2009-03-31  5:09   ` Linus Torvalds
2009-03-31  5:24     ` Junio C Hamano
2009-03-31  5:42       ` Linus Torvalds
2009-03-31  5:57         ` Junio C Hamano
2009-03-31  5:50       ` Junio C Hamano
2009-03-31  5:38     ` Linus Torvalds
2009-03-31  5:50       ` Linus Torvalds
2009-03-31  5:51         ` Linus Torvalds
2009-04-02  6:46           ` Junio C Hamano
2009-04-02 15:30             ` Linus Torvalds
2009-03-31  6:08         ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).