* "git reflog expire --all" very slow
@ 2009-03-31 1:43 Linus Torvalds
2009-03-31 4:34 ` Junio C Hamano
0 siblings, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2009-03-31 1:43 UTC (permalink / raw
To: Junio C Hamano, Brandon Casey, Johannes Schindelin; +Cc: Git Mailing List
I haven't checked in detail what is up, but I just did a "git gc --prune",
and it was quiet for about half a minute before anything seemed to happen.
Very irritating, as normally the expensive stuff at least gives you some
kind of indication of what it's doing.
It turns out that it's the reflog expiration. On my crazy beefy
Nehalem machine:
[torvalds@nehalem linux]$ time git reflog expire --all
real 0m37.596s
user 0m37.554s
sys 0m0.040s
and that really isn't good. 37 cpu-seconds on this machine is like half a
decade on some laptops I could name.
The flat pgprof for this thing (user-land oprofile isn't doing Nehalem
yet) looks like this:
% cumulative self self total
time seconds seconds calls s/call s/call name
60.94 30.24 30.24 301120211 0.00 0.00 interesting
12.37 36.38 6.14 301338513 0.00 0.00 insert_by_date
11.35 42.01 5.63 8776 0.00 0.00 clear_commit_marks
9.96 46.95 4.94 4388 0.00 0.01 merge_bases_many
2.16 48.02 1.07 301486366 0.00 0.00 commit_list_insert
1.21 48.62 0.60 301329737 0.00 0.00 parse_commit
0.87 49.05 0.43 301637945 0.00 0.00 xmalloc
0.34 49.22 0.17 24 0.01 0.01 xstrdup
...
Ok, so my reflog on this thing has 1583 entries on HEAD (yes, in the last
90 days, the problem is _not_ that I have a long reflog and am pruning it,
it _is_ already pruned). Add to that the reflogs for the branches (mainly
master: 1294), and you end up with apparently a nice total of 4388 reflog
entries.
And then it looks like for _each_ reflog entry we have:
expire_reflog_ent()
in_merge_bases()
which then calls
get_merge_bases()
get_merge_bases_many()
..
each of which probably often traverses an appreciable part of the kernel
tree, since my reflog entries are often merges, and the merge bases need
easily thousands of commits to look up.
Which explains how you end up with 301 _million_ commits inserted into the
lists and checked if they are interesting. Since the whole kernel tree has
only something like 140k commits, and my revlog doesn't even go back more
than three months, I guess that means that we'll be traversing the same
commits tens of thousands of times each.
Even on this machine, that whole cluster-f*ck takes a little while. Oops.
I have not checked if there is anything really obvious going on that could
change that whole logic that causes us to do merge-bases into something
saner, since the reflog code is not a part of git I'm familiar with.
Instead, I'm just sending this to Junio, Brandon, and Dscho, who are
getting the main blame for 'builtin-reflog.c'. Although I'm pretty sure
this is all Junio, but just in case..
Linus
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: "git reflog expire --all" very slow
2009-03-31 1:43 "git reflog expire --all" very slow Linus Torvalds
@ 2009-03-31 4:34 ` Junio C Hamano
2009-03-31 5:09 ` Linus Torvalds
0 siblings, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2009-03-31 4:34 UTC (permalink / raw
To: Linus Torvalds; +Cc: Brandon Casey, Johannes Schindelin, Git Mailing List
Linus Torvalds <torvalds@linux-foundation.org> writes:
> I have not checked if there is anything really obvious going on that could
> change that whole logic that causes us to do merge-bases into something
> saner, since the reflog code is not a part of git I'm familiar with.
When we look at one reflog entry with old and new commit, we decide to
prune an old entry if either side is unreachable from the tip of the
branch:
if (timestamp < cb->cmd->expire_unreachable) {
if (!cb->ref_commit)
goto prune;
if (!old && !is_null_sha1(osha1))
old = lookup_commit_reference_gently(osha1, 1);
if (!new && !is_null_sha1(nsha1))
new = lookup_commit_reference_gently(nsha1, 1);
if ((old && !in_merge_bases(old, &cb->ref_commit, 1)) ||
(new && !in_merge_bases(new, &cb->ref_commit, 1)))
goto prune;
}
Most of your reflog entries are expected to be reachable from the tip, so
one optimization would be to mark all commits reachable from the tip
upfront, and omit the in_merge_bases() computation for the ones that are
already marked. Perhaps something like this...
builtin-reflog.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++--
1 files changed, 47 insertions(+), 2 deletions(-)
diff --git a/builtin-reflog.c b/builtin-reflog.c
index d95f515..f0d61bd 100644
--- a/builtin-reflog.c
+++ b/builtin-reflog.c
@@ -52,6 +52,7 @@ struct collect_reflog_cb {
#define INCOMPLETE (1u<<10)
#define STUDYING (1u<<11)
+#define REACHABLE (1u<<12)
static int tree_is_complete(const unsigned char *sha1)
{
@@ -209,6 +210,43 @@ static int keep_entry(struct commit **it, unsigned char *sha1)
return 1;
}
+static void mark_reachable(struct commit *commit, unsigned long expire_limit)
+{
+ /*
+ * We need to compute if commit on either side of an reflog
+ * entry is reachable from the tip of the ref for all entries.
+ * Mark commits that are reachable from the tip down to the
+ * time threashold first; we know a commit marked thusly is
+ * reachable from the tip without running in_merge_bases()
+ * at all.
+ */
+ struct commit_list *pending = NULL;
+
+ commit_list_insert(commit, &pending);
+ while (pending) {
+ struct commit_list *entry = pending;
+ struct commit_list *parent;
+ pending = entry->next;
+ commit = entry->item;
+ free(entry);
+ if (commit->object.flags & REACHABLE)
+ continue;
+ commit->object.flags |= REACHABLE;
+ parent = commit->parents;
+ while (parent) {
+ commit = parent->item;
+ parent = parent->next;
+ if (commit->object.flags & REACHABLE)
+ continue;
+ if (parse_commit(commit))
+ continue;
+ if (commit->date < expire_limit)
+ continue;
+ commit_list_insert(commit, &pending);
+ }
+ }
+}
+
static int expire_reflog_ent(unsigned char *osha1, unsigned char *nsha1,
const char *email, unsigned long timestamp, int tz,
const char *message, void *cb_data)
@@ -234,8 +272,12 @@ static int expire_reflog_ent(unsigned char *osha1, unsigned char *nsha1,
old = lookup_commit_reference_gently(osha1, 1);
if (!new && !is_null_sha1(nsha1))
new = lookup_commit_reference_gently(nsha1, 1);
- if ((old && !in_merge_bases(old, &cb->ref_commit, 1)) ||
- (new && !in_merge_bases(new, &cb->ref_commit, 1)))
+ if ((old &&
+ !(old->object.flags & REACHABLE) &&
+ !in_merge_bases(old, &cb->ref_commit, 1)) ||
+ (new &&
+ !(new->object.flags & REACHABLE) &&
+ !in_merge_bases(new, &cb->ref_commit, 1)))
goto prune;
}
@@ -288,7 +330,10 @@ static int expire_reflog(const char *ref, const unsigned char *sha1, int unused,
cb.ref_commit = lookup_commit_reference_gently(sha1, 1);
cb.ref = ref;
cb.cmd = cmd;
+
+ mark_reachable(cb.ref_commit, cmd->expire_unreachable);
for_each_reflog_ent(ref, expire_reflog_ent, &cb);
+ clear_commit_marks(cb.ref_commit, REACHABLE);
finish:
if (cb.newlog) {
if (fclose(cb.newlog)) {
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: "git reflog expire --all" very slow
2009-03-31 4:34 ` Junio C Hamano
@ 2009-03-31 5:09 ` Linus Torvalds
2009-03-31 5:24 ` Junio C Hamano
2009-03-31 5:38 ` Linus Torvalds
0 siblings, 2 replies; 13+ messages in thread
From: Linus Torvalds @ 2009-03-31 5:09 UTC (permalink / raw
To: Junio C Hamano; +Cc: Brandon Casey, Johannes Schindelin, Git Mailing List
On Mon, 30 Mar 2009, Junio C Hamano wrote:
>
> Most of your reflog entries are expected to be reachable from the tip, so
> one optimization would be to mark all commits reachable from the tip
> upfront, and omit the in_merge_bases() computation for the ones that are
> already marked. Perhaps something like this...
This if anything makes things just go slower.
Not much, but some. It went from 36.566s to 38.070s. That may be in the
noise, I've not done any sensitivity analysis.
I thought you perhaps had a missing "parse_commit()" making the
reachability thing not work (look_up_gently parses the object, but if it's
a tag deref_tag() will dereference it until it hits a commit, but never
parse the commit). But that wasn't it.
Linus
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: "git reflog expire --all" very slow
2009-03-31 5:09 ` Linus Torvalds
@ 2009-03-31 5:24 ` Junio C Hamano
2009-03-31 5:42 ` Linus Torvalds
2009-03-31 5:50 ` Junio C Hamano
2009-03-31 5:38 ` Linus Torvalds
1 sibling, 2 replies; 13+ messages in thread
From: Junio C Hamano @ 2009-03-31 5:24 UTC (permalink / raw
To: Linus Torvalds; +Cc: Brandon Casey, Johannes Schindelin, Git Mailing List
Linus Torvalds <torvalds@linux-foundation.org> writes:
> On Mon, 30 Mar 2009, Junio C Hamano wrote:
>>
>> Most of your reflog entries are expected to be reachable from the tip, so
>> one optimization would be to mark all commits reachable from the tip
>> upfront, and omit the in_merge_bases() computation for the ones that are
>> already marked. Perhaps something like this...
>
> This if anything makes things just go slower.
>
> Not much, but some. It went from 36.566s to 38.070s. That may be in the
> noise, I've not done any sensitivity analysis.
I actually think that the cutoff for history traversal is bogus. You may
have a 30-day old reflog entry that pulled in a 45-day old commit, and
giving up the smudging at the expiry cutoff timestamp does not make much
sense.
I do have a lot of reflog entries kept around, as my main repository has
these:
[gc]
reflogexpire = '2005-01-01 00:00:00 +0000'
reflogexpireunreachable = '2005-01-01 00:00:00 +0000'
so let me experiment a bit more.
builtin-reflog.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 59 insertions(+), 2 deletions(-)
diff --git a/builtin-reflog.c b/builtin-reflog.c
index d95f515..ef02b5a 100644
--- a/builtin-reflog.c
+++ b/builtin-reflog.c
@@ -36,6 +36,7 @@ struct expire_reflog_cb {
FILE *newlog;
const char *ref;
struct commit *ref_commit;
+ int commit_marked;
struct cmd_reflog_expire_cb *cmd;
unsigned char last_kept_sha1[20];
};
@@ -52,6 +53,7 @@ struct collect_reflog_cb {
#define INCOMPLETE (1u<<10)
#define STUDYING (1u<<11)
+#define REACHABLE (1u<<12)
static int tree_is_complete(const unsigned char *sha1)
{
@@ -209,6 +211,43 @@ static int keep_entry(struct commit **it, unsigned char *sha1)
return 1;
}
+static void mark_reachable(struct commit *commit, unsigned long expire_limit)
+{
+ /*
+ * We need to compute if commit on either side of an reflog
+ * entry is reachable from the tip of the ref for all entries.
+ * Mark commits that are reachable from the tip down to the
+ * time threashold first; we know a commit marked thusly is
+ * reachable from the tip without running in_merge_bases()
+ * at all.
+ */
+ struct commit_list *pending = NULL;
+
+ commit_list_insert(commit, &pending);
+ while (pending) {
+ struct commit_list *entry = pending;
+ struct commit_list *parent;
+ pending = entry->next;
+ commit = entry->item;
+ free(entry);
+ if (commit->object.flags & REACHABLE)
+ continue;
+ commit->object.flags |= REACHABLE;
+ parent = commit->parents;
+ while (parent) {
+ commit = parent->item;
+ parent = parent->next;
+ if (commit->object.flags & REACHABLE)
+ continue;
+ if (parse_commit(commit))
+ continue;
+ if (commit->date < expire_limit)
+ continue;
+ commit_list_insert(commit, &pending);
+ }
+ }
+}
+
static int expire_reflog_ent(unsigned char *osha1, unsigned char *nsha1,
const char *email, unsigned long timestamp, int tz,
const char *message, void *cb_data)
@@ -230,12 +269,28 @@ static int expire_reflog_ent(unsigned char *osha1, unsigned char *nsha1,
if (timestamp < cb->cmd->expire_unreachable) {
if (!cb->ref_commit)
goto prune;
+
if (!old && !is_null_sha1(osha1))
old = lookup_commit_reference_gently(osha1, 1);
if (!new && !is_null_sha1(nsha1))
new = lookup_commit_reference_gently(nsha1, 1);
- if ((old && !in_merge_bases(old, &cb->ref_commit, 1)) ||
- (new && !in_merge_bases(new, &cb->ref_commit, 1)))
+
+ if (!cb->commit_marked) {
+ if (old) {
+ mark_reachable(cb->ref_commit, old->date);
+ cb->commit_marked = 1;
+ } else if (new) {
+ mark_reachable(cb->ref_commit, new->date);
+ cb->commit_marked = 1;
+ }
+ }
+
+ if ((old &&
+ !(old->object.flags & REACHABLE) &&
+ !in_merge_bases(old, &cb->ref_commit, 1)) ||
+ (new &&
+ !(new->object.flags & REACHABLE) &&
+ !in_merge_bases(new, &cb->ref_commit, 1)))
goto prune;
}
@@ -289,6 +344,8 @@ static int expire_reflog(const char *ref, const unsigned char *sha1, int unused,
cb.ref = ref;
cb.cmd = cmd;
for_each_reflog_ent(ref, expire_reflog_ent, &cb);
+ if (cb.ref_commit && cb.commit_marked)
+ clear_commit_marks(cb.ref_commit, REACHABLE);
finish:
if (cb.newlog) {
if (fclose(cb.newlog)) {
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: "git reflog expire --all" very slow
2009-03-31 5:09 ` Linus Torvalds
2009-03-31 5:24 ` Junio C Hamano
@ 2009-03-31 5:38 ` Linus Torvalds
2009-03-31 5:50 ` Linus Torvalds
1 sibling, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2009-03-31 5:38 UTC (permalink / raw
To: Junio C Hamano; +Cc: Brandon Casey, Johannes Schindelin, Git Mailing List
On Mon, 30 Mar 2009, Linus Torvalds wrote:
>
> This if anything makes things just go slower.
>
> Not much, but some. It went from 36.566s to 38.070s. That may be in the
> noise, I've not done any sensitivity analysis.
>
> I thought you perhaps had a missing "parse_commit()" making the
> reachability thing not work (look_up_gently parses the object, but if it's
> a tag deref_tag() will dereference it until it hits a commit, but never
> parse the commit). But that wasn't it.
Ahhah.
I know why it makes things slower.
The slow case is already inside that whole:
if (timestamp < cb->cmd->expire_unreachable) {
if-statement, so the thing that slows down is if we hit a commit that is
_older_ than the expire limit.
But your whole "mark_reachable()" thing only marks things _younger_ than
that reachable. So you mark exactly the wrong things reachable - you mark
the ones that we don't even care about.
If I do
mark_reachable(cb.ref_commit, 0);
instead (to traverse the _whole_ tree, with no regards to date), the time
shrinks to 1.7s. But of course, that's also wrong.
Qutie frankly, I don't really understand why the logic isn't just
if (timestamp < cb->cmd->expire_unreachable)
goto prune;
why is that reachability so important?
Linus
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: "git reflog expire --all" very slow
2009-03-31 5:24 ` Junio C Hamano
@ 2009-03-31 5:42 ` Linus Torvalds
2009-03-31 5:57 ` Junio C Hamano
2009-03-31 5:50 ` Junio C Hamano
1 sibling, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2009-03-31 5:42 UTC (permalink / raw
To: Junio C Hamano; +Cc: Brandon Casey, Johannes Schindelin, Git Mailing List
On Mon, 30 Mar 2009, Junio C Hamano wrote:
>
> I do have a lot of reflog entries kept around, as my main repository has
> these:
>
> [gc]
> reflogexpire = '2005-01-01 00:00:00 +0000'
> reflogexpireunreachable = '2005-01-01 00:00:00 +0000'
I think that actually _hides_ the problem. You'll never have anything at
all that triggers that
if (timestamp < cb->cmd->expire_unreachable) {
because your "expire_unreachable" timestamp is already very old (== small
value), so 'timestamp' will _not_ be older (smaller value) than that.
I dunno. As mentioned, I don't really understand why we'd want to save
some of those reflog entries at all in the first place, so I'm probably
missing something.
If we've asked for reflog entries past a certain age to be expired, why do
when then look at the details of those reflog entries and only expire them
under certain circumstances? Just expire them, and get rid of the
'unreachable' part.
Linus
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: "git reflog expire --all" very slow
2009-03-31 5:38 ` Linus Torvalds
@ 2009-03-31 5:50 ` Linus Torvalds
2009-03-31 5:51 ` Linus Torvalds
2009-03-31 6:08 ` Junio C Hamano
0 siblings, 2 replies; 13+ messages in thread
From: Linus Torvalds @ 2009-03-31 5:50 UTC (permalink / raw
To: Junio C Hamano; +Cc: Brandon Casey, Johannes Schindelin, Git Mailing List
On Mon, 30 Mar 2009, Linus Torvalds wrote:
>
> If I do
>
> mark_reachable(cb.ref_commit, 0);
Ok, I think I got it.
You had
mark_reachable(cb.ref_commit, cmd->expire_unreachable);
but we care about the commits that are younger than 'expire_total' (older
than that, and they are pruned unconditionally), but older than
'expire_unreachable' (younger than that and the date doesn't matter).
So making it do
mark_reachable(cb.ref_commit, cmd->expire_total);
marks the right parts reachable. Not the whole tree, but also not just the
commits we're not going to expire regardless.
With that change, it's all basically instantaneous. We don't need to
traverse the whole kernel history, and with that change to your patch, I
get
[torvalds@nehalem linux]$ time ~/git/git reflog expire --all
real 0m1.715s
user 0m1.676s
sys 0m0.040s
which is still slower than I'd wish for, but is a whole lot faster than
over half a minute.
Linus
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: "git reflog expire --all" very slow
2009-03-31 5:24 ` Junio C Hamano
2009-03-31 5:42 ` Linus Torvalds
@ 2009-03-31 5:50 ` Junio C Hamano
1 sibling, 0 replies; 13+ messages in thread
From: Junio C Hamano @ 2009-03-31 5:50 UTC (permalink / raw
To: Linus Torvalds; +Cc: Brandon Casey, Johannes Schindelin, Git Mailing List
Junio C Hamano <gitster@pobox.com> writes:
> Linus Torvalds <torvalds@linux-foundation.org> writes:
>> ...
>> This if anything makes things just go slower.
>>
>> Not much, but some. It went from 36.566s to 38.070s. That may be in the
>> noise, I've not done any sensitivity analysis.
>
> I actually think that the cutoff for history traversal is bogus. You may
> have a 30-day old reflog entry that pulled in a 45-day old commit, and
> giving up the smudging at the expiry cutoff timestamp does not make much
> sense.
>
> I do have a lot of reflog entries kept around, as my main repository has
> these:
>
> [gc]
> reflogexpire = '2005-01-01 00:00:00 +0000'
> reflogexpireunreachable = '2005-01-01 00:00:00 +0000'
>
> so let me experiment a bit more.
To initially prune a copy of my git repository that was never pruned
before, it takes a lot of time with or without the patch to prune from the
original 73476 reflog entries.
(without patch)
23.46user 0.23system 0:23.70elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+5464outputs (0major+6045minor)pagefaults 0swaps
(with patch, but with a bit of fprintf's for counting)
21.99user 0.40system 0:22.56elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+5472outputs (0major+6040minor)pagefaults 0swaps
After the initial pruning, with 13780 reflog entries still left (and no
further pruning expected), I am seeing these:
(without patch)
2.09user 0.22system 0:02.32elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+5464outputs (0major+4956minor)pagefaults 0swaps
(with patch, but again with a bit of fprintf's for counting)
0.76user 0.25system 0:01.05elapsed 96%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+5464outputs (0major+4918minor)pagefaults 0swaps
I might be better off traversing down all the way and do away with
in_merge_bases(), but that might hold true only for something small like
git.git and the kernel.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: "git reflog expire --all" very slow
2009-03-31 5:50 ` Linus Torvalds
@ 2009-03-31 5:51 ` Linus Torvalds
2009-04-02 6:46 ` Junio C Hamano
2009-03-31 6:08 ` Junio C Hamano
1 sibling, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2009-03-31 5:51 UTC (permalink / raw
To: Junio C Hamano; +Cc: Brandon Casey, Johannes Schindelin, Git Mailing List
That made no sense. It should have been:
On Mon, 30 Mar 2009, Linus Torvalds wrote:
>
> but we care about the commits that are younger than 'expire_total' (older
> than that, and they are pruned unconditionally), but older than
> 'expire_unreachable' (younger than that and the date doesn't matter).
^^^^
reachability
but other than that the commentary stands.
Linus
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: "git reflog expire --all" very slow
2009-03-31 5:42 ` Linus Torvalds
@ 2009-03-31 5:57 ` Junio C Hamano
0 siblings, 0 replies; 13+ messages in thread
From: Junio C Hamano @ 2009-03-31 5:57 UTC (permalink / raw
To: Linus Torvalds; +Cc: Brandon Casey, Johannes Schindelin, Git Mailing List
Linus Torvalds <torvalds@linux-foundation.org> writes:
> On Mon, 30 Mar 2009, Junio C Hamano wrote:
>>
>> I do have a lot of reflog entries kept around, as my main repository has
>> these:
>>
>> [gc]
>> reflogexpire = '2005-01-01 00:00:00 +0000'
>> reflogexpireunreachable = '2005-01-01 00:00:00 +0000'
>
> I think that actually _hides_ the problem. You'll never have anything at
> all that triggers that
>
> if (timestamp < cb->cmd->expire_unreachable) {
>
> because your "expire_unreachable" timestamp is already very old (== small
> value), so 'timestamp' will _not_ be older (smaller value) than that.
>
> I dunno. As mentioned, I don't really understand why we'd want to save
> some of those reflog entries at all in the first place, so I'm probably
> missing something.
>
> If we've asked for reflog entries past a certain age to be expired, why do
> when then look at the details of those reflog entries and only expire them
> under certain circumstances? Just expire them, and get rid of the
> 'unreachable' part.
I know. My test repository does not have them. I only quoted them to
tell you that I have a lot of entries to play with.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: "git reflog expire --all" very slow
2009-03-31 5:50 ` Linus Torvalds
2009-03-31 5:51 ` Linus Torvalds
@ 2009-03-31 6:08 ` Junio C Hamano
1 sibling, 0 replies; 13+ messages in thread
From: Junio C Hamano @ 2009-03-31 6:08 UTC (permalink / raw
To: Linus Torvalds; +Cc: Brandon Casey, Johannes Schindelin, Git Mailing List
Linus Torvalds <torvalds@linux-foundation.org> writes:
> On Mon, 30 Mar 2009, Linus Torvalds wrote:
>>
>> If I do
>>
>> mark_reachable(cb.ref_commit, 0);
>
> Ok, I think I got it.
>
> You had
>
> mark_reachable(cb.ref_commit, cmd->expire_unreachable);
>
> but we care about the commits that are younger than 'expire_total' (older
> than that, and they are pruned unconditionally), but older than
> 'expire_unreachable' (younger than that and the date doesn't matter).
>
> So making it do
>
> mark_reachable(cb.ref_commit, cmd->expire_total);
>
> marks the right parts reachable. Not the whole tree, but also not just the
> commits we're not going to expire regardless.
>
> With that change, it's all basically instantaneous. We don't need to
> traverse the whole kernel history, and with that change to your patch, I
> get
>
> [torvalds@nehalem linux]$ time ~/git/git reflog expire --all
>
> real 0m1.715s
> user 0m1.676s
> sys 0m0.040s
>
> which is still slower than I'd wish for, but is a whole lot faster than
> over half a minute.
Actually, the initial pruning I showed (around 23 seconds) is not helped
with the expire_total change at all, but if I mark everything and get rid
of in_merge_bases(), it goes down to 1.15 seconds.
But subsequent pruning, which is what we should be optimizing for, gets
much faster with your approach of not traversing down the history all the
way.
By the way, I suspect I might be making the same mistake of traversing
both sides of the merge twice while marking, but I am tired already, so
I'll throw out the final patch for tonight and go to bed.
builtin-reflog.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++--
1 files changed, 46 insertions(+), 2 deletions(-)
diff --git a/builtin-reflog.c b/builtin-reflog.c
index d95f515..b67272a 100644
--- a/builtin-reflog.c
+++ b/builtin-reflog.c
@@ -52,6 +52,7 @@ struct collect_reflog_cb {
#define INCOMPLETE (1u<<10)
#define STUDYING (1u<<11)
+#define REACHABLE (1u<<12)
static int tree_is_complete(const unsigned char *sha1)
{
@@ -209,6 +210,43 @@ static int keep_entry(struct commit **it, unsigned char *sha1)
return 1;
}
+static void mark_reachable(struct commit *commit, unsigned long expire_limit)
+{
+ /*
+ * We need to compute if commit on either side of an reflog
+ * entry is reachable from the tip of the ref for all entries.
+ * Mark commits that are reachable from the tip down to the
+ * time threashold first; we know a commit marked thusly is
+ * reachable from the tip without running in_merge_bases()
+ * at all.
+ */
+ struct commit_list *pending = NULL;
+
+ commit_list_insert(commit, &pending);
+ while (pending) {
+ struct commit_list *entry = pending;
+ struct commit_list *parent;
+ pending = entry->next;
+ commit = entry->item;
+ free(entry);
+ if (commit->object.flags & REACHABLE)
+ continue;
+ commit->object.flags |= REACHABLE;
+ parent = commit->parents;
+ while (parent) {
+ commit = parent->item;
+ parent = parent->next;
+ if (commit->object.flags & REACHABLE)
+ continue;
+ if (parse_commit(commit))
+ continue;
+ if (expire_limit && commit->date < expire_limit)
+ continue;
+ commit_list_insert(commit, &pending);
+ }
+ }
+}
+
static int expire_reflog_ent(unsigned char *osha1, unsigned char *nsha1,
const char *email, unsigned long timestamp, int tz,
const char *message, void *cb_data)
@@ -234,8 +272,10 @@ static int expire_reflog_ent(unsigned char *osha1, unsigned char *nsha1,
old = lookup_commit_reference_gently(osha1, 1);
if (!new && !is_null_sha1(nsha1))
new = lookup_commit_reference_gently(nsha1, 1);
- if ((old && !in_merge_bases(old, &cb->ref_commit, 1)) ||
- (new && !in_merge_bases(new, &cb->ref_commit, 1)))
+ if ((old && !(old->object.flags & REACHABLE) &&
+ !in_merge_bases(old, &cb->ref_commit, 1)) ||
+ (new && !(new->object.flags & REACHABLE) &&
+ !in_merge_bases(new, &cb->ref_commit, 1)))
goto prune;
}
@@ -288,7 +328,11 @@ static int expire_reflog(const char *ref, const unsigned char *sha1, int unused,
cb.ref_commit = lookup_commit_reference_gently(sha1, 1);
cb.ref = ref;
cb.cmd = cmd;
+ if (cb.ref_commit)
+ mark_reachable(cb.ref_commit, cmd->expire_total);
for_each_reflog_ent(ref, expire_reflog_ent, &cb);
+ if (cb.ref_commit)
+ clear_commit_marks(cb.ref_commit, REACHABLE);
finish:
if (cb.newlog) {
if (fclose(cb.newlog)) {
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: "git reflog expire --all" very slow
2009-03-31 5:51 ` Linus Torvalds
@ 2009-04-02 6:46 ` Junio C Hamano
2009-04-02 15:30 ` Linus Torvalds
0 siblings, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2009-04-02 6:46 UTC (permalink / raw
To: Linus Torvalds; +Cc: Brandon Casey, Johannes Schindelin, Git Mailing List
Linus Torvalds <torvalds@linux-foundation.org> writes:
> That made no sense. It should have been:
>
> On Mon, 30 Mar 2009, Linus Torvalds wrote:
>>
>> but we care about the commits that are younger than 'expire_total' (older
>> than that, and they are pruned unconditionally), but older than
>> 'expire_unreachable' (younger than that and the date doesn't matter).
> ^^^^
> reachability
>
> but other than that the commentary stands.
Correct. But after thinking about this a bit more, I am starting to suspect
the "of course" in your earlier
If I do
mark_reachable(cb.ref_commit, 0);
instead (to traverse the _whole_ tree, with no regards to date), the time
shrinks to 1.7s. But of course, that's also wrong.
may not be such a clearly obvious thing.
Suppose you do not do "mark_reachable(cb.ref_commit, 0)" but use the
expire_total as the cut-off value (which is what I've queued). If you
have one unreachable entry that you end up running in_merge_bases() for,
you will traverse all the history down _anyway_, and at that point, you
would be better off if you actually marked everything upfront, and
discarded anything unmarked as unreachable without falling back to
in_merge_bases() at all.
The above reasoning of course assumes that "keep reflog entries if they
are reachable from the tip, otherwise drop them if they are more than 30
days old" is a good medium level semantics to cull what the other rule
"drop any reflog entry older than 90 days" may not.
A hacky alternative would be to use total_expire as the cut-off and do not
fall back on in_merge_bases(). We might incorrectly prune away an entry
that records that you pulled a commit that is still reachable from the tip
last week, if that commit happens to be 4 months old if we did so, so I am
not convinced myself it is a reasonable hack, though.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: "git reflog expire --all" very slow
2009-04-02 6:46 ` Junio C Hamano
@ 2009-04-02 15:30 ` Linus Torvalds
0 siblings, 0 replies; 13+ messages in thread
From: Linus Torvalds @ 2009-04-02 15:30 UTC (permalink / raw
To: Junio C Hamano; +Cc: Brandon Casey, Johannes Schindelin, Git Mailing List
On Wed, 1 Apr 2009, Junio C Hamano wrote:
>
> Correct. But after thinking about this a bit more, I am starting to suspect
> the "of course" in your earlier
>
> If I do
>
> mark_reachable(cb.ref_commit, 0);
>
> instead (to traverse the _whole_ tree, with no regards to date), the time
> shrinks to 1.7s. But of course, that's also wrong.
>
> may not be such a clearly obvious thing.
I think we should never do it up-front, because for nicely behaved people
who just pull other peoples work (which are also the people most likely to
not have beefy machines), the normal reflog is going to be entirely
reachable, and we don't have to traverse the whole thing.
So what I'd suggest is something like:
- start off with the time limit, possibly with some extra fudging
- but never bother calling "in_merge_bases()"
- if we ever get to a commit that doesn't look reachable in that
situation, we now have two choices:
(a) just use the dang 'object->flags & REACHABLE' flag as-is.
Why even bother to be clever? We did the reachability by time
already, it's done, it's there, just use it. In other words, the
reachability simply works like "--since=<date>'.
(b) Try to do the "exact" thing, and waste lots of time on it, and
maybe find an odd commit or two where the date had been wrong. Do
we really care?
I'd actually go for 'a', with a slight modification: try to convert the
"reflog date" (the date of a local action) into a "commit date" (the date
of a commit in the repository). Because those two are different "time
spaces", and comparing a "commit date" to a "in my repo" date is fairly
wrong.
But in general, I don't think this is something that needs any extra
precision. We're not talking about "theoretically reachable" here. We're
talking about reflog entries that are already older than the
unreachability limit, and that point to commits that are older than the
reachability limit. Yes, yes, clocks aren't synchronized, but do we really
care?
IOW, I'd suggest just removing the in_merge_base() tests entirely. Make
the semantics even simpler:
have_done_reachability = 0;
reachability_date = 0;
for_each_reflog_oldest_first() {
/* Older than unconditional expire? */
if (really_old(entry)) {
reachability_date = entry->commit->date;
goto prune;
}
/* Younger than the reflog reachability? */
if (really_young(entry) && !fix_stale)
goto save;
/*
* Ok, not an unconditional expire entry.
* Do the reachability - once
*/
if (!have_done_reachability) {
have_done_reachability = 1;
if (fix_stale)
reachability_date = 0;
mark_reachabile(top, reachability_date);
}
if (!(entry->commit->flags & REACHABLE))
goto prune;
save:
save(entry);
continue;
prune:
prune(entry);
continue;
}
Does that change semantics? Yes, absolutely. But it sounds very practical.
Linus
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2009-04-02 15:36 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-03-31 1:43 "git reflog expire --all" very slow Linus Torvalds
2009-03-31 4:34 ` Junio C Hamano
2009-03-31 5:09 ` Linus Torvalds
2009-03-31 5:24 ` Junio C Hamano
2009-03-31 5:42 ` Linus Torvalds
2009-03-31 5:57 ` Junio C Hamano
2009-03-31 5:50 ` Junio C Hamano
2009-03-31 5:38 ` Linus Torvalds
2009-03-31 5:50 ` Linus Torvalds
2009-03-31 5:51 ` Linus Torvalds
2009-04-02 6:46 ` Junio C Hamano
2009-04-02 15:30 ` Linus Torvalds
2009-03-31 6:08 ` Junio C Hamano
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).