git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Re: [PATCH] git tag --contains : avoid stack overflow
       [not found] <1352568970-4669-1-git-send-email-jeanjacques.lafay@gmail.com>
@ 2012-11-10 20:00 ` Philip Oakley
  2012-11-10 21:13   ` Jean-Jacques Lafay
  0 siblings, 1 reply; 15+ messages in thread
From: Philip Oakley @ 2012-11-10 20:00 UTC (permalink / raw
  To: Jean-Jacques Lafay, msysgit; +Cc: Git List

From: "Jean-Jacques Lafay" <jeanjacques.lafay@gmail.com> Sent: Saturday, 
November 10, 2012 5:36 PM
> In large repos, the recursion implementation of contains(commit, 
> commit_list)
> may result in a stack overflow. Replace the recursion with a loop to 
> fix it.
>
> Signed-off-by: Jean-Jacques Lafay <jeanjacques.lafay@gmail.com>

This is a change to the main git code so it is better to submit it to 
the main git list at git@vger.kernel.org (plain text, no HTML, first 
post usually has a delay ;-)

It sounds reasonable but others may have opinions and comments.

Have you actually suffered from the stack overflow issue? You only 
suggest it as a possibility, rather than a real problem.

Philip

> ---
> builtin/tag.c  | 83 
> ++++++++++++++++++++++++++++++++++++++++------------------
> t/t7004-tag.sh | 21 +++++++++++++++
> 2 files changed, 78 insertions(+), 26 deletions(-)
>
> diff --git a/builtin/tag.c b/builtin/tag.c
> index 9c3e067..4be67dd 100644
> --- a/builtin/tag.c
> +++ b/builtin/tag.c
> @@ -73,40 +73,71 @@ static int in_commit_list(const struct commit_list 
> *want, struct commit *c)
>  return 0;
> }
>
> -static int contains_recurse(struct commit *candidate,
> -     const struct commit_list *want)
> -{
> - struct commit_list *p;
> -
> - /* was it previously marked as containing a want commit? */
> - if (candidate->object.flags & TMP_MARK)
> - return 1;
> - /* or marked as not possibly containing a want commit? */
> - if (candidate->object.flags & UNINTERESTING)
> - return 0;
> - /* or are we it? */
> - if (in_commit_list(want, candidate))
> - return 1;
> +struct commit_list_list {
> + struct commit *item;
> + struct commit_list *next_item;
> + struct commit_list_list *next;
> +};
>
> - if (parse_commit(candidate) < 0)
> - return 0;
> +static void mark_path(struct commit_list_list *path, int status)
> +{
> + struct commit_list_list *temp;
> + while (path) {
> + path->item->object.flags |= status;
> + temp = path;
> + path = temp->next;
> + free(temp);
> + }
> +}
>
> - /* Otherwise recurse and mark ourselves for future traversals. */
> - for (p = candidate->parents; p; p = p->next) {
> - if (contains_recurse(p->item, want)) {
> - candidate->object.flags |= TMP_MARK;
> +static int contains(struct commit *candidate,
> +     const struct commit_list *want)
> +{
> + /* previously implemented with a recursion, but could stack overflow 
> in large repos */
> + struct commit_list_list *p, *tmp;
> + p = xmalloc(sizeof(struct commit_list_list));
> + p->item = candidate;
> + p->next_item = NULL;
> + p->next = NULL;
> +
> + while (p) {
> + candidate = p->item;
> +
> + /* is it ok : */
> + /* was it previously marked as containing a want commit? */
> + /* or are we it? */
> + if (candidate->object.flags & TMP_MARK || in_commit_list(want, 
> candidate)) {
> + mark_path(p, TMP_MARK);
>  return 1;
>  }
> + /* is it not ok : */
> + /* was it previously marked as not possibly containing a want 
> commit? */
> + /* do we have parents? */
> + if (candidate->object.flags & UNINTERESTING || 
> parse_commit(candidate) < 0 || !candidate->parents) {
> + candidate->object.flags |= UNINTERESTING;
> + /* then backtrack, marking as UNINTERESTING along the way */
> + while (p && !p->next_item) {
> + p->item->object.flags |= UNINTERESTING;
> + tmp = p;
> + p = tmp->next;
> + free(tmp);
> + }
> + if (p) {
> + p->item = p->next_item->item;
> + p->next_item = p->next_item->next;
> + }
> + } else {
> + /* let's check the parents */
> + tmp = xmalloc(sizeof(struct commit_list_list));
> + tmp->item = candidate->parents->item;
> + tmp->next_item = candidate->parents->next;
> + tmp->next = p;
> + p = tmp;
> + }
>  }
> - candidate->object.flags |= UNINTERESTING;
>  return 0;
> }
>
> -static int contains(struct commit *candidate, const struct 
> commit_list *want)
> -{
> - return contains_recurse(candidate, want);
> -}
> -
> static void show_tag_lines(const unsigned char *sha1, int lines)
> {
>  int i;
> diff --git a/t/t7004-tag.sh b/t/t7004-tag.sh
> index 5189446..196ac54 100755
> --- a/t/t7004-tag.sh
> +++ b/t/t7004-tag.sh
> @@ -1365,4 +1365,25 @@ test_expect_success 'multiple --points-at are 
> OR-ed together' '
>  test_cmp expect actual
> '
>
> +# what about a deep repo ?
> +
> +>expect
> +test_expect_success '--contains works in a deep repo' '
> + i=1
> + while test $i -lt 40000
> + do
> + echo "commit refs/heads/master
> +committer A U Thor <author@example.com> $((1000000000 + $i * 100)) 
> +0200
> +data <<EOF
> +commit #$i
> +EOF"
> + test $i == 1 && echo "from refs/heads/master^0"
> + i=$(($i + 1))
> + done | git fast-import
> + git checkout master
> + git tag far-far-away HEAD^
> + git tag --contains HEAD >actual &&
> + test_cmp expect actual
> +'
> +
> test_done
> -- 
> 1.8.0.msysgit.0.1.geed93bd.dirty
>
> -- 

-- 
*** Please reply-to-all at all times ***
*** (do not pretend to know who is subscribed and who is not) ***
*** Please avoid top-posting. ***
The msysGit Wiki is here: https://github.com/msysgit/msysgit/wiki - Github accounts are free.

You received this message because you are subscribed to the Google
Groups "msysGit" group.
To post to this group, send email to msysgit@googlegroups.com
To unsubscribe from this group, send email to
msysgit+unsubscribe@googlegroups.com
For more options, and view previous threads, visit this group at
http://groups.google.com/group/msysgit?hl=en_US?hl=en

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] git tag --contains : avoid stack overflow
  2012-11-10 20:00 ` [PATCH] git tag --contains : avoid stack overflow Philip Oakley
@ 2012-11-10 21:13   ` Jean-Jacques Lafay
  2012-11-10 21:33     ` Pat Thoyts
  2012-11-11 16:46     ` René Scharfe
  0 siblings, 2 replies; 15+ messages in thread
From: Jean-Jacques Lafay @ 2012-11-10 21:13 UTC (permalink / raw
  To: msysgit; +Cc: Jean-Jacques Lafay, Git List, Philip Oakley

[-- Attachment #1: Type: text/plain, Size: 2385 bytes --]

Le samedi 10 novembre 2012 21:00:10 UTC+1, Philip Oakley a écrit :

> From: "Jean-Jacques Lafay" <jeanjacq...@gmail.com <javascript:>> Sent: 
> Saturday, 
> November 10, 2012 5:36 PM 
> > In large repos, the recursion implementation of contains(commit, 
> > commit_list) 
> > may result in a stack overflow. Replace the recursion with a loop to 
> > fix it. 
> > 
> > Signed-off-by: Jean-Jacques Lafay <jeanjacq...@gmail.com <javascript:>> 
>
> This is a change to the main git code so it is better to submit it to 
> the main git list at g...@vger.kernel.org <javascript:> (plain text, no 
> HTML, first 
> post usually has a delay ;-) 
>
> It sounds reasonable but others may have opinions and comments. 
>
> Have you actually suffered from the stack overflow issue? You only 
> suggest it as a possibility, rather than a real problem. 
>
> Philip 
>

Yes, it actually crashed on me, and since the call is made by GitExtension 
while browsing the commit history, it was quickly annoying to have windows 
popping a "git.exe stopped working" message box at my face every time I 
clicked on a line of the history ;-)  (well, you can sort of work around it 
by having the "file tree" or "diff" tab active)

Coincidentally, as I was having a glance at the recent topics, I saw that 
someone else experienced it.

However, I couldn't reproduce it on Linux : where the windows 
implementations crashes at a ~32000 depth (*not* exactly 32768, mind you), 
on linux it happily went through 100000 commits. I didn't take time to look 
much further, but maybe on my 64 bit Linux VM, the process can afford to 
reserve a much bigger address range for the stack of each thread than the 
1Mb given to 32 bit processes on windows.
 
Jean-Jacques.

-- 
*** Please reply-to-all at all times ***
*** (do not pretend to know who is subscribed and who is not) ***
*** Please avoid top-posting. ***
The msysGit Wiki is here: https://github.com/msysgit/msysgit/wiki - Github accounts are free.

You received this message because you are subscribed to the Google
Groups "msysGit" group.
To post to this group, send email to msysgit@googlegroups.com
To unsubscribe from this group, send email to
msysgit+unsubscribe@googlegroups.com
For more options, and view previous threads, visit this group at
http://groups.google.com/group/msysgit?hl=en_US?hl=en

[-- Attachment #2: Type: text/html, Size: 3089 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] git tag --contains : avoid stack overflow
  2012-11-10 21:13   ` Jean-Jacques Lafay
@ 2012-11-10 21:33     ` Pat Thoyts
  2012-11-11 16:46     ` René Scharfe
  1 sibling, 0 replies; 15+ messages in thread
From: Pat Thoyts @ 2012-11-10 21:33 UTC (permalink / raw
  To: Jean-Jacques Lafay; +Cc: msysgit, Git List, Philip Oakley

On 10 November 2012 21:13, Jean-Jacques Lafay
<jeanjacques.lafay@gmail.com> wrote:
> Le samedi 10 novembre 2012 21:00:10 UTC+1, Philip Oakley a écrit :
>>
>> From: "Jean-Jacques Lafay" <jeanjacq...@gmail.com> Sent: Saturday,
>> November 10, 2012 5:36 PM
>> > In large repos, the recursion implementation of contains(commit,
>> > commit_list)
>> > may result in a stack overflow. Replace the recursion with a loop to
>> > fix it.
>> >
>> > Signed-off-by: Jean-Jacques Lafay <jeanjacq...@gmail.com>
>>
>> This is a change to the main git code so it is better to submit it to
>> the main git list at g...@vger.kernel.org (plain text, no HTML, first
>> post usually has a delay ;-)
>>
>> It sounds reasonable but others may have opinions and comments.
>>
>> Have you actually suffered from the stack overflow issue? You only
>> suggest it as a possibility, rather than a real problem.
>>
>> Philip
>
>
> Yes, it actually crashed on me, and since the call is made by GitExtension
> while browsing the commit history, it was quickly annoying to have windows
> popping a "git.exe stopped working" message box at my face every time I
> clicked on a line of the history ;-)  (well, you can sort of work around it
> by having the "file tree" or "diff" tab active)
>
> Coincidentally, as I was having a glance at the recent topics, I saw that
> someone else experienced it.
>
> However, I couldn't reproduce it on Linux : where the windows
> implementations crashes at a ~32000 depth (*not* exactly 32768, mind you),
> on linux it happily went through 100000 commits. I didn't take time to look
> much further, but maybe on my 64 bit Linux VM, the process can afford to
> reserve a much bigger address range for the stack of each thread than the
> 1Mb given to 32 bit processes on windows.
>
> Jean-Jacques.

I checked this on msysGit - the test causes a crash on Windows when
using the 1.8.0 release and recompiling with this patch applied fixes
this. As mentioned, its best to have this reviewed upstream too as its
likely that windows just has a smaller stack so causes the crash
earlier.

-- 
*** Please reply-to-all at all times ***
*** (do not pretend to know who is subscribed and who is not) ***
*** Please avoid top-posting. ***
The msysGit Wiki is here: https://github.com/msysgit/msysgit/wiki - Github accounts are free.

You received this message because you are subscribed to the Google
Groups "msysGit" group.
To post to this group, send email to msysgit@googlegroups.com
To unsubscribe from this group, send email to
msysgit+unsubscribe@googlegroups.com
For more options, and view previous threads, visit this group at
http://groups.google.com/group/msysgit?hl=en_US?hl=en

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] git tag --contains : avoid stack overflow
  2012-11-10 21:13   ` Jean-Jacques Lafay
  2012-11-10 21:33     ` Pat Thoyts
@ 2012-11-11 16:46     ` René Scharfe
  2012-11-11 16:54       ` Jeff King
  1 sibling, 1 reply; 15+ messages in thread
From: René Scharfe @ 2012-11-11 16:46 UTC (permalink / raw
  To: Jean-Jacques Lafay; +Cc: msysgit, Git List, Philip Oakley, Jeff King

Am 10.11.2012 22:13, schrieb Jean-Jacques Lafay:
> Le samedi 10 novembre 2012 21:00:10 UTC+1, Philip Oakley a écrit :
>
>     From: "Jean-Jacques Lafay" <jeanjacq...@gmail.com <javascript:>>
>     Sent: Saturday,
>     November 10, 2012 5:36 PM
>      > In large repos, the recursion implementation of contains(commit,
>      > commit_list)
>      > may result in a stack overflow. Replace the recursion with a loop to
>      > fix it.
>      >
>      > Signed-off-by: Jean-Jacques Lafay <jeanjacq...@gmail.com
>     <javascript:>>
>
>     This is a change to the main git code so it is better to submit it to
>     the main git list at g...@vger.kernel.org <javascript:> (plain text,
>     no HTML, first
>     post usually has a delay ;-)
>
>     It sounds reasonable but others may have opinions and comments.
>
>     Have you actually suffered from the stack overflow issue? You only
>     suggest it as a possibility, rather than a real problem.
>
>     Philip
>
>
> Yes, it actually crashed on me, and since the call is made by
> GitExtension while browsing the commit history, it was quickly annoying
> to have windows popping a "git.exe stopped working" message box at my
> face every time I clicked on a line of the history ;-)  (well, you can
> sort of work around it by having the "file tree" or "diff" tab active)
>
> Coincidentally, as I was having a glance at the recent topics, I saw
> that someone else experienced it.
>
> However, I couldn't reproduce it on Linux : where the windows
> implementations crashes at a ~32000 depth (*not* exactly 32768, mind
> you), on linux it happily went through 100000 commits. I didn't take
> time to look much further, but maybe on my 64 bit Linux VM, the process
> can afford to reserve a much bigger address range for the stack of each
> thread than the 1Mb given to 32 bit processes on windows.
> Jean-Jacques.

I can reproduce it on Linux (Debian testing amd64) with ulimit -s 1000 
to reduce the stack size from its default value of 8MB.

After reverting ffc4b8012d9a4f92ef238ff72c0d15e9e1b402ed (tag: speed up 
--contains calculation) the test passes even with the smaller stack, but 
it makes "git tag --contains" take thrice the time as before.

René

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] git tag --contains : avoid stack overflow
  2012-11-11 16:46     ` René Scharfe
@ 2012-11-11 16:54       ` Jeff King
  2012-11-11 23:10         ` Johannes Schindelin
  2012-11-12 22:27         ` Jean-Jacques Lafay
  0 siblings, 2 replies; 15+ messages in thread
From: Jeff King @ 2012-11-11 16:54 UTC (permalink / raw
  To: René Scharfe; +Cc: Jean-Jacques Lafay, msysgit, Git List, Philip Oakley

On Sun, Nov 11, 2012 at 05:46:32PM +0100, René Scharfe wrote:

> >However, I couldn't reproduce it on Linux : where the windows
> >implementations crashes at a ~32000 depth (*not* exactly 32768, mind
> >you), on linux it happily went through 100000 commits. I didn't take
> >time to look much further, but maybe on my 64 bit Linux VM, the process
> >can afford to reserve a much bigger address range for the stack of each
> >thread than the 1Mb given to 32 bit processes on windows.
> >Jean-Jacques.
> 
> I can reproduce it on Linux (Debian testing amd64) with ulimit -s
> 1000 to reduce the stack size from its default value of 8MB.
> 
> After reverting ffc4b8012d9a4f92ef238ff72c0d15e9e1b402ed (tag: speed
> up --contains calculation) the test passes even with the smaller
> stack, but it makes "git tag --contains" take thrice the time as
> before.

Right, I am not too surprised.  That function replaced the original
algorithm with a much faster depth-first recursive one. I haven't looked
closely yet at Jean-Jacques' iterative adaptation, but that direction
seems like a good fix for now.

Ultimately, I have some ideas for doing this in a breadth-first way,
which would make it more naturally iterative. It would involve having N
bits of storage per commit to check N tags, but it would mean that we
could get accurate answers in the face of clock skew (like the
merge-base calculation, it would merely get slower in the face of skew).

But since I haven't worked on that at all, fixing the depth-first
algorithm to be iterative makes sense to me.

-Peff

-- 
*** Please reply-to-all at all times ***
*** (do not pretend to know who is subscribed and who is not) ***
*** Please avoid top-posting. ***
The msysGit Wiki is here: https://github.com/msysgit/msysgit/wiki - Github accounts are free.

You received this message because you are subscribed to the Google
Groups "msysGit" group.
To post to this group, send email to msysgit@googlegroups.com
To unsubscribe from this group, send email to
msysgit+unsubscribe@googlegroups.com
For more options, and view previous threads, visit this group at
http://groups.google.com/group/msysgit?hl=en_US?hl=en

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Re: [PATCH] git tag --contains : avoid stack overflow
  2012-11-11 16:54       ` Jeff King
@ 2012-11-11 23:10         ` Johannes Schindelin
  2012-11-12 22:27         ` Jean-Jacques Lafay
  1 sibling, 0 replies; 15+ messages in thread
From: Johannes Schindelin @ 2012-11-11 23:10 UTC (permalink / raw
  To: Jeff King
  Cc: René Scharfe, Jean-Jacques Lafay, msysgit, Git List,
	Philip Oakley

Hi,

On Sun, 11 Nov 2012, Jeff King wrote:

> On Sun, Nov 11, 2012 at 05:46:32PM +0100, René Scharfe wrote:
> 
> > >However, I couldn't reproduce it on Linux : where the windows
> > >implementations crashes at a ~32000 depth (*not* exactly 32768, mind
> > >you), on linux it happily went through 100000 commits. I didn't take
> > >time to look much further, but maybe on my 64 bit Linux VM, the
> > >process can afford to reserve a much bigger address range for the
> > >stack of each thread than the 1Mb given to 32 bit processes on
> > >windows.  Jean-Jacques.
> > 
> > I can reproduce it on Linux (Debian testing amd64) with ulimit -s 1000
> > to reduce the stack size from its default value of 8MB.
> > 
> > After reverting ffc4b8012d9a4f92ef238ff72c0d15e9e1b402ed (tag: speed
> > up --contains calculation) the test passes even with the smaller
> > stack, but it makes "git tag --contains" take thrice the time as
> > before.
> 
> Right, I am not too surprised.  That function replaced the original
> algorithm with a much faster depth-first recursive one. I haven't looked
> closely yet at Jean-Jacques' iterative adaptation, but that direction
> seems like a good fix for now.
> 
> Ultimately, I have some ideas for doing this in a breadth-first way,
> which would make it more naturally iterative. It would involve having N
> bits of storage per commit to check N tags, but it would mean that we
> could get accurate answers in the face of clock skew (like the
> merge-base calculation, it would merely get slower in the face of skew).
> 
> But since I haven't worked on that at all, fixing the depth-first
> algorithm to be iterative makes sense to me.

Have you tried the latest tag-contains branch of
git://github.com/msysgit/git/? It contains a couple of brush-ups and a
re-write of the recursion (which I hope is right, I had only time to work
on it during an unwanted layover at O'Hare). The SHA-1 is
fc4f42787a0dd0022d202627681362081a66ef70.

Ciao,
Johannes

-- 
*** Please reply-to-all at all times ***
*** (do not pretend to know who is subscribed and who is not) ***
*** Please avoid top-posting. ***
The msysGit Wiki is here: https://github.com/msysgit/msysgit/wiki - Github accounts are free.

You received this message because you are subscribed to the Google
Groups "msysGit" group.
To post to this group, send email to msysgit@googlegroups.com
To unsubscribe from this group, send email to
msysgit+unsubscribe@googlegroups.com
For more options, and view previous threads, visit this group at
http://groups.google.com/group/msysgit?hl=en_US?hl=en

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] git tag --contains : avoid stack overflow
  2012-11-11 16:54       ` Jeff King
  2012-11-11 23:10         ` Johannes Schindelin
@ 2012-11-12 22:27         ` Jean-Jacques Lafay
  2012-11-12 23:14           ` Jeff King
  1 sibling, 1 reply; 15+ messages in thread
From: Jean-Jacques Lafay @ 2012-11-12 22:27 UTC (permalink / raw
  To: Jeff King; +Cc: René Scharfe, msysgit, Git List, Philip Oakley

2012/11/11 Jeff King <peff@peff.net>:
> On Sun, Nov 11, 2012 at 05:46:32PM +0100, René Scharfe wrote:
>
> Ultimately, I have some ideas for doing this in a breadth-first way,
> which would make it more naturally iterative. It would involve having N
> bits of storage per commit to check N tags, but it would mean that we
> could get accurate answers in the face of clock skew (like the
> merge-base calculation, it would merely get slower in the face of skew).

I guess the optimal algorithm may also depend on the commit graph
general shape, but intuitively, I'd say that the critical factor is
the number and distribution of tags. As soon as you have a significant
number of tags (let's say 1% of the commits are tagged, evenly
distributed), you'll quickly end up with every commit marked as
containing or not the target commit, so that each additional tag check
is cheap.

This suggests a complexity of O(number of commits) more often then
not, however you choose to traverse the graph.

At least on my almost-real-life repo*, with ~140,000 commits and
~2,000 tags, tag --contains takes a few seconds, of course more than
the worst-case test on a 40,000 commit / 1 tag repo, but still in the
same order of magnitude.

* : meaning we're in the process of migrating from clearcase (deep
sighs allowed !), and for now I kept almost everything in a single git
repo, which may not be the eventual choice

Jean-Jacques.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] git tag --contains : avoid stack overflow
  2012-11-12 22:27         ` Jean-Jacques Lafay
@ 2012-11-12 23:14           ` Jeff King
  2012-11-13  1:16             ` Johannes Schindelin
  0 siblings, 1 reply; 15+ messages in thread
From: Jeff King @ 2012-11-12 23:14 UTC (permalink / raw
  To: Jean-Jacques Lafay; +Cc: René Scharfe, msysgit, Git List, Philip Oakley

On Mon, Nov 12, 2012 at 11:27:14PM +0100, Jean-Jacques Lafay wrote:

> 2012/11/11 Jeff King <peff@peff.net>:
> > On Sun, Nov 11, 2012 at 05:46:32PM +0100, René Scharfe wrote:
> >
> > Ultimately, I have some ideas for doing this in a breadth-first way,
> > which would make it more naturally iterative. It would involve having N
> > bits of storage per commit to check N tags, but it would mean that we
> > could get accurate answers in the face of clock skew (like the
> > merge-base calculation, it would merely get slower in the face of skew).
> 
> I guess the optimal algorithm may also depend on the commit graph
> general shape, but intuitively, I'd say that the critical factor is
> the number and distribution of tags. As soon as you have a significant
> number of tags (let's say 1% of the commits are tagged, evenly
> distributed), you'll quickly end up with every commit marked as
> containing or not the target commit, so that each additional tag check
> is cheap.
> 
> This suggests a complexity of O(number of commits) more often then
> not, however you choose to traverse the graph.

We can do much better than O(number of commits), though, if we stop
traversing down a path when its timestamp shows that it is too old to
contain the commits we are searching for. The problem is that the
timestamps cannot always be trusted, because they are generated on
machines with wrong clocks, or by buggy software. This could be solved
by calculating and caching a "generation" number, but last time it was
discussed there was a lot of arguing and nothing got done.

Another approach, used by the merge-base calculation, is to treat
parents in a breadth-first way, but sort them by timestamp, and walk
backwards to find the merge-base (and if you get to a merge-base, you
can stop). In that case, bad timestamps may cause you to look at extra
commits (because you process a commit prematurely and end up going
deeper than the merge base), but it can never give you a wrong answer.

Thinking on it more, though, the merge-base style of computation would
mean you always have to walk backwards to the oldest tag. Which is in
the same order of magnitude as the number of commits, assuming you have
tags near the start of history. So I think we will always want to keep a
cutoff, anyway, and there is not much point in switching off of a
depth-first approach (but of course it does make sense to use iteration
instead of recursion to do so).

> At least on my almost-real-life repo*, with ~140,000 commits and
> ~2,000 tags, tag --contains takes a few seconds, of course more than
> the worst-case test on a 40,000 commit / 1 tag repo, but still in the
> same order of magnitude.

Try "git rev-list --count --all" to get a sense of how long O(# of
commits) takes. Before the depth-first implementation, "tag --contains"
was O(commits * tags) in the worst case. With depth first, it's
O(commits), and then with the timestamp cutoff, it's really O(commits
since needle), where "needle" is the oldest thing you're looking for.
Here are those numbers on linux-2.6 (with linux-stable tags):

  [to get a sense of the repo size]
  $ git for-each-ref refs/tags | wc -l
  909
  $ git rev-list --count --all
  363413

  [this is O(commits)]
  $ time git rev-list --count --all
  real    0m4.034s
  user    0m3.960s
  sys     0m0.056s

  [before depth-first, ffc4b80^; you can see that it is much worse than
   O(commits), though not as bad as the worst case (because finding the
   merge bases for recent tags is not O(commits)]
  $ time git tag --contains HEAD~200
  real    0m42.838s
  user    0m42.527s
  sys     0m0.156s

  [after depth-first, ffc4b80; you can see that this is O(commits),
   because we will go depth-first down to the roots, but do only a
   single traversal]
  $ time git tag --contains HEAD~200
  real    0m3.939s
  user    0m3.784s
  sys     0m0.140s

  [with my generation patches; much faster]
  $ time git tag --contains HEAD~200
  real    0m0.037s
  user    0m0.020s
  sys     0m0.012s

I was thinking we had merged the timestamp cutoff (which has the same
performance characteristics as generations, just with the skew issue) to
master, but it looks like we didn't.

-Peff

-- 
*** Please reply-to-all at all times ***
*** (do not pretend to know who is subscribed and who is not) ***
*** Please avoid top-posting. ***
The msysGit Wiki is here: https://github.com/msysgit/msysgit/wiki - Github accounts are free.

You received this message because you are subscribed to the Google
Groups "msysGit" group.
To post to this group, send email to msysgit@googlegroups.com
To unsubscribe from this group, send email to
msysgit+unsubscribe@googlegroups.com
For more options, and view previous threads, visit this group at
http://groups.google.com/group/msysgit?hl=en_US?hl=en

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Re: [PATCH] git tag --contains : avoid stack overflow
  2012-11-12 23:14           ` Jeff King
@ 2012-11-13  1:16             ` Johannes Schindelin
  2012-11-13  3:46               ` [msysGit] " Jeff King
  0 siblings, 1 reply; 15+ messages in thread
From: Johannes Schindelin @ 2012-11-13  1:16 UTC (permalink / raw
  To: Jeff King
  Cc: Jean-Jacques Lafay, René Scharfe, msysgit, Git List,
	Philip Oakley

Hi Peff,

On Mon, 12 Nov 2012, Jeff King wrote:

> On Mon, Nov 12, 2012 at 11:27:14PM +0100, Jean-Jacques Lafay wrote:
> 
> > 2012/11/11 Jeff King <peff@peff.net>:
> > > On Sun, Nov 11, 2012 at 05:46:32PM +0100, René Scharfe wrote:
> > >
> > > Ultimately, I have some ideas for doing this in a breadth-first way,
> > > which would make it more naturally iterative. It would involve
> > > having N bits of storage per commit to check N tags, but it would
> > > mean that we could get accurate answers in the face of clock skew
> > > (like the merge-base calculation, it would merely get slower in the
> > > face of skew).
> > 
> > I guess the optimal algorithm may also depend on the commit graph
> > general shape, but intuitively, I'd say that the critical factor is
> > the number and distribution of tags. As soon as you have a significant
> > number of tags (let's say 1% of the commits are tagged, evenly
> > distributed), you'll quickly end up with every commit marked as
> > containing or not the target commit, so that each additional tag check
> > is cheap.
> > 
> > This suggests a complexity of O(number of commits) more often then
> > not, however you choose to traverse the graph.
> 
> We can do much better than O(number of commits), though, if we stop
> traversing down a path when its timestamp shows that it is too old to
> contain the commits we are searching for. The problem is that the
> timestamps cannot always be trusted, because they are generated on
> machines with wrong clocks, or by buggy software. This could be solved
> by calculating and caching a "generation" number, but last time it was
> discussed there was a lot of arguing and nothing got done.

Sadly, not only machines with skewed clocks, but in particular buggy
3rd-party SCMs make this more than just problematic. In a git-svn clone
that was used as base for heavy Git development, I encountered quite a lot
of Jan 1, 1970 commits.

It just cannot be helped, we must distrust timestamps completely.

Ciao,
Dscho

-- 
*** Please reply-to-all at all times ***
*** (do not pretend to know who is subscribed and who is not) ***
*** Please avoid top-posting. ***
The msysGit Wiki is here: https://github.com/msysgit/msysgit/wiki - Github accounts are free.

You received this message because you are subscribed to the Google
Groups "msysGit" group.
To post to this group, send email to msysgit@googlegroups.com
To unsubscribe from this group, send email to
msysgit+unsubscribe@googlegroups.com
For more options, and view previous threads, visit this group at
http://groups.google.com/group/msysgit?hl=en_US?hl=en

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [msysGit] Re: [PATCH] git tag --contains : avoid stack overflow
  2012-11-13  1:16             ` Johannes Schindelin
@ 2012-11-13  3:46               ` Jeff King
  2012-11-13  4:01                 ` Johannes Schindelin
  2012-11-13  4:51                 ` Junio C Hamano
  0 siblings, 2 replies; 15+ messages in thread
From: Jeff King @ 2012-11-13  3:46 UTC (permalink / raw
  To: Johannes Schindelin
  Cc: Jean-Jacques Lafay, René Scharfe, msysgit, Git List,
	Philip Oakley

On Tue, Nov 13, 2012 at 01:16:01AM +0000, Johannes Schindelin wrote:

> > We can do much better than O(number of commits), though, if we stop
> > traversing down a path when its timestamp shows that it is too old to
> > contain the commits we are searching for. The problem is that the
> > timestamps cannot always be trusted, because they are generated on
> > machines with wrong clocks, or by buggy software. This could be solved
> > by calculating and caching a "generation" number, but last time it was
> > discussed there was a lot of arguing and nothing got done.
> 
> Sadly, not only machines with skewed clocks, but in particular buggy
> 3rd-party SCMs make this more than just problematic. In a git-svn clone
> that was used as base for heavy Git development, I encountered quite a lot
> of Jan 1, 1970 commits.

Yeah. We tolerate a certain amount of skew (24 hours for --name-rev, and
5 broken commits in a row for --since). But the big ones are usually
software bugs (the big kernel ones were from broken "guilt", I think) or
broken imports (when I published a bunch of skew statistics last year,
the interesting ones were all imports; I don't know if they were
software bugs, or just garbage in, garbage out).

> It just cannot be helped, we must distrust timestamps completely.

Note that name-rev will produce wrong answers in the face of clock skew.
And I think that you even wrote that code. :)

-Peff

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Re: [PATCH] git tag --contains : avoid stack overflow
  2012-11-13  3:46               ` [msysGit] " Jeff King
@ 2012-11-13  4:01                 ` Johannes Schindelin
  2012-11-13  4:05                   ` Jeff King
  2012-11-13  4:51                 ` Junio C Hamano
  1 sibling, 1 reply; 15+ messages in thread
From: Johannes Schindelin @ 2012-11-13  4:01 UTC (permalink / raw
  To: Jeff King
  Cc: Jean-Jacques Lafay, René Scharfe, msysgit, Git List,
	Philip Oakley

Hi Peff,

On Mon, 12 Nov 2012, Jeff King wrote:

> On Tue, Nov 13, 2012 at 01:16:01AM +0000, Johannes Schindelin wrote:
> 
> > > We can do much better than O(number of commits), though, if we stop
> > > traversing down a path when its timestamp shows that it is too old to
> > > contain the commits we are searching for. The problem is that the
> > > timestamps cannot always be trusted, because they are generated on
> > > machines with wrong clocks, or by buggy software. This could be solved
> > > by calculating and caching a "generation" number, but last time it was
> > > discussed there was a lot of arguing and nothing got done.
> > 
> > Sadly, not only machines with skewed clocks, but in particular buggy
> > 3rd-party SCMs make this more than just problematic. In a git-svn clone
> > that was used as base for heavy Git development, I encountered quite a lot
> > of Jan 1, 1970 commits.
> 
> Yeah. We tolerate a certain amount of skew (24 hours for --name-rev, and
> 5 broken commits in a row for --since). But the big ones are usually
> software bugs (the big kernel ones were from broken "guilt", I think) or
> broken imports (when I published a bunch of skew statistics last year,
> the interesting ones were all imports; I don't know if they were
> software bugs, or just garbage in, garbage out).
> 
> > It just cannot be helped, we must distrust timestamps completely.
> 
> Note that name-rev will produce wrong answers in the face of clock skew.
> And I think that you even wrote that code. :)

IIRC the cute code to short-circuit using the date is not from me. If it
is, I am very ashamed.

Ciao,
Johannes

-- 
*** Please reply-to-all at all times ***
*** (do not pretend to know who is subscribed and who is not) ***
*** Please avoid top-posting. ***
The msysGit Wiki is here: https://github.com/msysgit/msysgit/wiki - Github accounts are free.

You received this message because you are subscribed to the Google
Groups "msysGit" group.
To post to this group, send email to msysgit@googlegroups.com
To unsubscribe from this group, send email to
msysgit+unsubscribe@googlegroups.com
For more options, and view previous threads, visit this group at
http://groups.google.com/group/msysgit?hl=en_US?hl=en

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Re: [PATCH] git tag --contains : avoid stack overflow
  2012-11-13  4:01                 ` Johannes Schindelin
@ 2012-11-13  4:05                   ` Jeff King
  2012-11-13  4:52                     ` Johannes Schindelin
  0 siblings, 1 reply; 15+ messages in thread
From: Jeff King @ 2012-11-13  4:05 UTC (permalink / raw
  To: Johannes Schindelin
  Cc: Jean-Jacques Lafay, René Scharfe, msysgit, Git List,
	Philip Oakley

On Tue, Nov 13, 2012 at 04:01:11AM +0000, Johannes Schindelin wrote:

> > Note that name-rev will produce wrong answers in the face of clock skew.
> > And I think that you even wrote that code. :)
> 
> IIRC the cute code to short-circuit using the date is not from me. If it
> is, I am very ashamed.

Sorry, but it was:

  $ git blame -L'/commit->date < cutoff/',+1  builtin/name-rev.c
  bd321bcc name-rev.c (Johannes Schindelin 2005-10-26 15:10:20 +0200 32)
  if (commit->date < cutoff)

But it is never too late to fix it. :)

-Peff

-- 
*** Please reply-to-all at all times ***
*** (do not pretend to know who is subscribed and who is not) ***
*** Please avoid top-posting. ***
The msysGit Wiki is here: https://github.com/msysgit/msysgit/wiki - Github accounts are free.

You received this message because you are subscribed to the Google
Groups "msysGit" group.
To post to this group, send email to msysgit@googlegroups.com
To unsubscribe from this group, send email to
msysgit+unsubscribe@googlegroups.com
For more options, and view previous threads, visit this group at
http://groups.google.com/group/msysgit?hl=en_US?hl=en

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Re: [PATCH] git tag --contains : avoid stack overflow
  2012-11-13  3:46               ` [msysGit] " Jeff King
  2012-11-13  4:01                 ` Johannes Schindelin
@ 2012-11-13  4:51                 ` Junio C Hamano
  2012-11-13  6:08                   ` Jeff King
  1 sibling, 1 reply; 15+ messages in thread
From: Junio C Hamano @ 2012-11-13  4:51 UTC (permalink / raw
  To: Jeff King
  Cc: Johannes Schindelin, Jean-Jacques Lafay, René Scharfe,
	msysgit, Git List, Philip Oakley

Jeff King <peff@peff.net> writes:

> Yeah. We tolerate a certain amount of skew (24 hours for --name-rev, and
> 5 broken commits in a row for --since). But the big ones are usually
> software bugs (the big kernel ones were from broken "guilt", I think) or
> broken imports (when I published a bunch of skew statistics last year,
> the interesting ones were all imports; I don't know if they were
> software bugs, or just garbage in, garbage out).

I was hoping that 2e6bdd3 (test-generation: compute generation
numbers and clock skews, 2012-09-04) may be a good first step to
come up with a practical and cheap solution on top of it.

The traversal can be fooled by clock skews when it sees a commit
that has a timestamp that is older than it should, causing it to
give up, incorrectly thinking that there won't be newer commits that
it is interested in behind the problematic commit.

The logic implemented by the change is to identify these problematic
commits, and we could record these commits with the value of the
timestamps they should have had (e.g. the timestamp of the newest
ancestor for each of these commits) in a notes tree.  Then the
traversal logic (commit-list-insert-by-date) could be updated use
that "corrected" timestamp instead not to be fooled by the clock
skew.

Such a notes tree can be built once and updated by only "appending",
as a commit will never acquire more ancestors in its parents chain
once it is made.

Is it too simplistic, or too costly?  In git.git we have three such
commits whose timestamp need to be corrected, while in the Linux
kernel there were 2.2k skewed commits when I counted them a few
months ago.

-- 
*** Please reply-to-all at all times ***
*** (do not pretend to know who is subscribed and who is not) ***
*** Please avoid top-posting. ***
The msysGit Wiki is here: https://github.com/msysgit/msysgit/wiki - Github accounts are free.

You received this message because you are subscribed to the Google
Groups "msysGit" group.
To post to this group, send email to msysgit@googlegroups.com
To unsubscribe from this group, send email to
msysgit+unsubscribe@googlegroups.com
For more options, and view previous threads, visit this group at
http://groups.google.com/group/msysgit?hl=en_US?hl=en

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Re: [PATCH] git tag --contains : avoid stack overflow
  2012-11-13  4:05                   ` Jeff King
@ 2012-11-13  4:52                     ` Johannes Schindelin
  0 siblings, 0 replies; 15+ messages in thread
From: Johannes Schindelin @ 2012-11-13  4:52 UTC (permalink / raw
  To: Jeff King
  Cc: Jean-Jacques Lafay, René Scharfe, msysgit, Git List,
	Philip Oakley

Hi Peff,

On Mon, 12 Nov 2012, Jeff King wrote:

> On Tue, Nov 13, 2012 at 04:01:11AM +0000, Johannes Schindelin wrote:
> 
> > > Note that name-rev will produce wrong answers in the face of clock skew.
> > > And I think that you even wrote that code. :)
> > 
> > IIRC the cute code to short-circuit using the date is not from me. If it
> > is, I am very ashamed.
> 
> Sorry, but it was:
> 
>   $ git blame -L'/commit->date < cutoff/',+1  builtin/name-rev.c
>   bd321bcc name-rev.c (Johannes Schindelin 2005-10-26 15:10:20 +0200 32)
>   if (commit->date < cutoff)
> 
> But it is never too late to fix it. :)

I will now go and find a hole to hide in. Or alternatively finally go to
sleep.

Ciao,
Johannes

-- 
*** Please reply-to-all at all times ***
*** (do not pretend to know who is subscribed and who is not) ***
*** Please avoid top-posting. ***
The msysGit Wiki is here: https://github.com/msysgit/msysgit/wiki - Github accounts are free.

You received this message because you are subscribed to the Google
Groups "msysGit" group.
To post to this group, send email to msysgit@googlegroups.com
To unsubscribe from this group, send email to
msysgit+unsubscribe@googlegroups.com
For more options, and view previous threads, visit this group at
http://groups.google.com/group/msysgit?hl=en_US?hl=en

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Re: [PATCH] git tag --contains : avoid stack overflow
  2012-11-13  4:51                 ` Junio C Hamano
@ 2012-11-13  6:08                   ` Jeff King
  0 siblings, 0 replies; 15+ messages in thread
From: Jeff King @ 2012-11-13  6:08 UTC (permalink / raw
  To: Junio C Hamano
  Cc: Johannes Schindelin, Jean-Jacques Lafay, René Scharfe,
	msysgit, Git List, Philip Oakley

On Mon, Nov 12, 2012 at 08:51:37PM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > Yeah. We tolerate a certain amount of skew (24 hours for --name-rev, and
> > 5 broken commits in a row for --since). But the big ones are usually
> > software bugs (the big kernel ones were from broken "guilt", I think) or
> > broken imports (when I published a bunch of skew statistics last year,
> > the interesting ones were all imports; I don't know if they were
> > software bugs, or just garbage in, garbage out).
> 
> I was hoping that 2e6bdd3 (test-generation: compute generation
> numbers and clock skews, 2012-09-04) may be a good first step to
> come up with a practical and cheap solution on top of it.
>
> The traversal can be fooled by clock skews when it sees a commit
> that has a timestamp that is older than it should, causing it to
> give up, incorrectly thinking that there won't be newer commits that
> it is interested in behind the problematic commit.

I wrote a similar skew-finding tool last year, though some of the
numbers it came up with were different (I remember having many fewer
skewed commits in the kernel repo).

One problem is that it identifies commits which behave badly with
certain algorithms, but it does not identify commits which are wrong.
If I skew backwards, it will find my commit. But if I skew forwards, it
will label my children as wrong.

> The logic implemented by the change is to identify these problematic
> commits, and we could record these commits with the value of the
> timestamps they should have had (e.g. the timestamp of the newest
> ancestor for each of these commits) in a notes tree.  Then the
> traversal logic (commit-list-insert-by-date) could be updated use
> that "corrected" timestamp instead not to be fooled by the clock
> skew.
> 
> Such a notes tree can be built once and updated by only "appending",
> as a commit will never acquire more ancestors in its parents chain
> once it is made.
> 
> Is it too simplistic, or too costly?  In git.git we have three such
> commits whose timestamp need to be corrected, while in the Linux
> kernel there were 2.2k skewed commits when I counted them a few
> months ago.

This came up in the big generations discussion last summer, and I think
I even implemented a proof of concept. I couldn't find the actual code,
though but only that I got "pleasing performance results using a notes
tree to store a list of commits with bogus timestamps":

  http://article.gmane.org/gmane.comp.version-control.git/161101

It is a little wasteful in space if you have a lot of skewed commits
(the notes tree stores a 160-bit hash pointing to a blob object storing
a 32-bit integer).

My personal preference at this point would be:

  1. introduce an auxiliary metadata file that would live alongside the
     pack index and contain generation numbers

  2. generate the metadata file during pack indexing.

  3. If we have a generation metadata file, but a particular object is
     not in it, compute the generation; this should be quick because we
     will hit a file with a stored generation eventually

  4. If we do not have any generation metadata files, or if grafts or
     replace objects are in use, do not use cutoffs in algorithms. Be
     safe but slow.

On the other hand, just switching to doing a single traversal instead of
one merge-base computation per tag already got rid of the really awful
performance cases. Nobody has complained since that went in, so maybe
nobody cares about shaving a few seconds per operation down to a few
tens of milliseconds. The real win was shaving tens of seconds down to a
few seconds.

-Peff

-- 
*** Please reply-to-all at all times ***
*** (do not pretend to know who is subscribed and who is not) ***
*** Please avoid top-posting. ***
The msysGit Wiki is here: https://github.com/msysgit/msysgit/wiki - Github accounts are free.

You received this message because you are subscribed to the Google
Groups "msysGit" group.
To post to this group, send email to msysgit@googlegroups.com
To unsubscribe from this group, send email to
msysgit+unsubscribe@googlegroups.com
For more options, and view previous threads, visit this group at
http://groups.google.com/group/msysgit?hl=en_US?hl=en

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2012-11-13  6:08 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1352568970-4669-1-git-send-email-jeanjacques.lafay@gmail.com>
2012-11-10 20:00 ` [PATCH] git tag --contains : avoid stack overflow Philip Oakley
2012-11-10 21:13   ` Jean-Jacques Lafay
2012-11-10 21:33     ` Pat Thoyts
2012-11-11 16:46     ` René Scharfe
2012-11-11 16:54       ` Jeff King
2012-11-11 23:10         ` Johannes Schindelin
2012-11-12 22:27         ` Jean-Jacques Lafay
2012-11-12 23:14           ` Jeff King
2012-11-13  1:16             ` Johannes Schindelin
2012-11-13  3:46               ` [msysGit] " Jeff King
2012-11-13  4:01                 ` Johannes Schindelin
2012-11-13  4:05                   ` Jeff King
2012-11-13  4:52                     ` Johannes Schindelin
2012-11-13  4:51                 ` Junio C Hamano
2012-11-13  6:08                   ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).