crash on git diff-tree -Ganything <tree> for new files with textconv filter

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* crash on git diff-tree -Ganything <tree> for new files with textconv filter
@ 2012-10-27 18:37 Peter Oberndorfer
  2012-10-28 12:01 ` Jeff King
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Oberndorfer @ 2012-10-27 18:37 UTC (permalink / raw)
  To: git, Junio C Hamano

Hi,

It seems "git diff-tree -Ganything <tree>" crashes[1] with a null
pointer dereference
when run on a commit that adds a file (pdf) with a textconv filter.

It can be reproduced with vanilla git by having a commit on top that
adds a file with a textconv filter and executing git diff-tree
-Ganything HEAD
But running git log -Ganything still works without a crash.
This problem seems to exist since the feature was first added in f506b8e8b5.

While testing I also noticed the -S and -G act on the original file
instead of the textconv munged data.
Is this intentional or caused by accessing the wrong data?
Wild guess: should we really access p->one->data and not mf1.ptr?

Is there some more information i should provide?

Greetings Peter

[1]
I am running msysgit v1.8.0.msysgit.0 (52d3a7583a)
and i tried added -G pickaxe support to gitk.
gitk runs git diff-tree -r -s --stdin -Gpattern

This is how i detected the crash the first time.
(but only because of a crash popup on Windows, gitk does not complain...)

For testing on vanilla git I used .git/config:
[diff "upcase2"]
    textconv = tr a-z A-Z <

.gitatrributes:
newtext diff=upcase2

Program received signal SIGSEGV, Segmentation fault.
[Switching to thread 3864.0x83c]
0x0049d90c in regexec (preg=0x22f900, string=0x0, nmatch=1,
pmatch=0x22f46c, eflags=0) at compat/regex/regexec.c:241
241           length = strlen (string);
(gdb) bt
#0  0x0049d90c in regexec (preg=0x22f900, string=0x0, nmatch=1,
pmatch=0x22f46c, eflags=0) at compat/regex/regexec.c:241
#1  0x004f5763 in diff_grep (p=0x109a530, o=0x550b48, regexp=0x22f900,
kws=0x0) at diffcore-pickaxe.c:110
#2  0x004f59dc in pickaxe (o=<value optimized out>, regexp=0x22f900,
kws=0xffffffff, fn=0x4f5620 <diff_grep>) at diffcore-pickaxe.c:40
#3  0x004f5bd4 in diffcore_pickaxe_grep (o=0x550b48) at
diffcore-pickaxe.c:154
#4  0x0048049a in diffcore_std (options=0x550b48) at diff.c:4630
#5  0x004dc16a in log_tree_diff_flush (opt=0x5508c0) at log-tree.c:697
#6  0x004dc32e in log_tree_commit (opt=0x5508c0, commit=0xffc620) at
log-tree.c:790
#7  0x004206dd in cmd_diff_tree (argc=<value optimized out>,
argv=0x3d24bc, prefix=0x0) at builtin/diff-tree.c:43
#8  0x00401a16 in handle_internal_command (argc=<value optimized out>,
argv=<value optimized out>) at git.c:306
#9  0x00401c00 in main (argc=6, argv=0x3d24b8) at git.c:513
(gdb) up
#1  0x004f5763 in diff_grep (p=0x109a530, o=0x550b48, regexp=0x22f900,
kws=0x0) at diffcore-pickaxe.c:110
110                     hit = !regexec(regexp, p->one->data, 1,
&regmatch, 0);
(gdb) info locals
regmatch = {rm_so = 17408640, rm_eo = 2291968}
textconv_one = (struct userdiff_driver *) 0x0
textconv_two = (struct userdiff_driver *) 0xe10468
mf1 = {ptr = 0x0, size = 0}
mf2 = {
  ptr = 0x2bbcaf0 ' ' <repeats 52 times>, "PROJECT
DESCRIPTION\r\n\r\n\r\nProject Number:", ' ' <repeats 19 times>,
"xxxxx\r\nProject Description:", ' ' <repeats 14 times>,
"xxxxxxxx\r\nxxxxx:", ' ' <repeats 11 times>..., size = 64185}
hit = <value optimized out>
(gdb) print p
$1 = (struct diff_filepair *) 0x109a530
(gdb) print *p
$2 = {one = 0x109a320, two = 0x109a428, score = 0, status = 0 '\0',
broken_pair = 0, renamed_pair = 0, is_unmerged = 0}
(gdb) print p->one
$3 = (struct diff_filespec *) 0x109a320
(gdb) print *p->one
$4 = {sha1 = '\0' <repeats 19 times>, path = 0x109a360
"xxxxxxxx/xxx/doc/xxxx xxxxx 1.4.pdf", data = 0x0, cnt_data = 0x0,
  funcname_pattern_ident = 0x0, size = 0, count = 1, xfrm_flags = 0,
rename_used = 0, mode = 0, sha1_valid = 0, should_free = 0,
should_munmap = 0,
  dirty_submodule = 0, is_stdin = 0, has_more_entries = 0, driver = 0x0,
is_binary = -1}
(gdb) print *p->two
$5 = {sha1 = "0aax\217\231)oBaAa(\021\234^'Q\236\230", path = 0x109a468
"xxxxxxxx/xxx/doc/xxxx xxxxx 1.4.pdf", data = 0x0,
  cnt_data = 0x0, funcname_pattern_ident = 0x0, size = 0, count = 1,
xfrm_flags = 0, rename_used = 0, mode = 33188, sha1_valid = 1,
should_free = 0,
  should_munmap = 0, dirty_submodule = 0, is_stdin = 0, has_more_entries
= 0, driver = 0xe10468, is_binary = -1}
(gdb) print textconv_one
$6 = (struct userdiff_driver *) 0x0
(gdb) print textconv_two
$7 = (struct userdiff_driver *) 0xe10468
(gdb) print *textconv_two
$10 = {name = 0xfe4fc0 "astextplain", external = 0x0, binary = -1,
funcname = {pattern = 0x0, cflags = 0}, word_regex = 0x0,
  textconv = 0xfe4fd8 "astextplain", textconv_cache = 0x0,
textconv_want_cache = 0}
(gdb)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: crash on git diff-tree -Ganything <tree> for new files with textconv filter
  2012-10-27 18:37 crash on git diff-tree -Ganything <tree> for new files with textconv filter Peter Oberndorfer
@ 2012-10-28 12:01 ` Jeff King
  2012-10-28 12:45   ` [PATCH 0/2] textconv support for "log -S" Jeff King
  2012-10-28 19:56   ` crash on git diff-tree -Ganything <tree> for new files with textconv filter Peter Oberndorfer
  0 siblings, 2 replies; 24+ messages in thread
From: Jeff King @ 2012-10-28 12:01 UTC (permalink / raw)
  To: Peter Oberndorfer; +Cc: git, Junio C Hamano

On Sat, Oct 27, 2012 at 08:37:24PM +0200, Peter Oberndorfer wrote:

> It seems "git diff-tree -Ganything <tree>" crashes[1] with a null
> pointer dereference
> when run on a commit that adds a file (pdf) with a textconv filter.
> 
> It can be reproduced with vanilla git by having a commit on top that
> adds a file with a textconv filter and executing git diff-tree
> -Ganything HEAD
> But running git log -Ganything still works without a crash.
> This problem seems to exist since the feature was first added in f506b8e8b5.

Thanks for a thorough bug report. I didn't reproduce the crash, but I
can see how it happens (it happens with diff-tree because we will reuse
the working tree file in that instance; it could also happen if you
turned on textconv caching).

> While testing I also noticed the -S and -G act on the original file
> instead of the textconv munged data.
> Is this intentional or caused by accessing the wrong data?

Both, perhaps. :)

-G operates on the munged data; you can see it feed the munged data to
xdiff in diff_grep. But the optimization for handling added and removed
files accidentally fed the wrong pointer. Fixing that is a no-brainer,
since the optimization is inconsistent with the regular code path.

-S, however, predates the invention of textconv and has never used it.
It is a little less clear that textconv is the right thing here, because
it is not about grepping the diff, but about counting occurrences of the
string inside the file content. I tend to think that doing so on the
textconv'd data would be what people generally want, but it is a
behavior change.

> Wild guess: should we really access p->one->data and not mf1.ptr?

Precisely. Thanks for your wild guess; it made finding the bug very
easy. :)

> Is there some more information i should provide?

The patch below should fix it. I added tests, but please try your
real-world test case on it to double-check.

-- >8 --
Subject: [PATCH] diff_grep: use textconv buffers for add/deleted files

If you use "-G" to grep a diff, we will apply a configured
textconv filter to the data before generating the diff.
However, if the diff is an addition or deletion, we do not
bother running the diff at all, and just look for the token
in the added (or removed) content. This works because we
know that the diff must contain every line of content.

However, while we used the textconv-derived buffers in the
regular diff, we accidentally passed the original unmodified
buffers to regexec when checking the added or removed
content. This could lead to an incorrect answer.

Worse, in some cases we might have a textconv buffer but no
original buffer (e.g., if we pulled the textconv data from
cache, or if we reused a working tree file when generating
it). In that case, we could actually feed NULL to regexec
and segfault.

Reported-by: Peter Oberndorfer <kumbayo84@arcor.de>
Signed-off-by: Jeff King <peff@peff.net>
---
 diffcore-pickaxe.c       |  4 ++--
 t/t4030-diff-textconv.sh | 12 ++++++++++++
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
index ed23eb4..a209376 100644
--- a/diffcore-pickaxe.c
+++ b/diffcore-pickaxe.c
@@ -104,10 +104,10 @@ static int diff_grep(struct diff_filepair *p, struct diff_options *o,
 		if (!mf2.ptr)
 			return 0; /* ignore unmerged */
 		/* created "two" -- does it have what we are looking for? */
-		hit = !regexec(regexp, p->two->data, 1, &regmatch, 0);
+		hit = !regexec(regexp, mf2.ptr, 1, &regmatch, 0);
 	} else if (!mf2.ptr) {
 		/* removed "one" -- did it have what we are looking for? */
-		hit = !regexec(regexp, p->one->data, 1, &regmatch, 0);
+		hit = !regexec(regexp, mf1.ptr, 1, &regmatch, 0);
 	} else {
 		/*
 		 * We have both sides; need to run textual diff and see if
diff --git a/t/t4030-diff-textconv.sh b/t/t4030-diff-textconv.sh
index eebb1ee..461d27a 100755
--- a/t/t4030-diff-textconv.sh
+++ b/t/t4030-diff-textconv.sh
@@ -84,6 +84,18 @@ test_expect_success 'status -v produces text' '
 	git reset --soft HEAD@{1}
 '
 
+test_expect_success 'grep-diff (-G) operates on textconv data (add)' '
+	echo one >expect &&
+	git log --root --format=%s -G0 >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'grep-diff (-G) operates on textconv data (modification)' '
+	echo two >expect &&
+	git log --root --format=%s -G1 >actual &&
+	test_cmp expect actual
+'
+
 cat >expect.stat <<'EOF'
  file | Bin 2 -> 4 bytes
  1 file changed, 0 insertions(+), 0 deletions(-)
-- 
1.8.0.3.g3456896

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 0/2] textconv support for "log -S"
  2012-10-28 12:01 ` Jeff King
@ 2012-10-28 12:45   ` Jeff King
  2012-10-28 12:46     ` [PATCH 1/2] pickaxe: hoist empty needle check Jeff King
  2012-10-28 12:47     ` [PATCH 2/2] pickaxe: use textconv for -S counting Jeff King
  2012-10-28 19:56   ` crash on git diff-tree -Ganything <tree> for new files with textconv filter Peter Oberndorfer
  1 sibling, 2 replies; 24+ messages in thread
From: Jeff King @ 2012-10-28 12:45 UTC (permalink / raw)
  To: Peter Oberndorfer; +Cc: git, Junio C Hamano

On Sun, Oct 28, 2012 at 08:01:04AM -0400, Jeff King wrote:

> -G operates on the munged data; you can see it feed the munged data to
> xdiff in diff_grep. But the optimization for handling added and removed
> files accidentally fed the wrong pointer. Fixing that is a no-brainer,
> since the optimization is inconsistent with the regular code path.
> 
> -S, however, predates the invention of textconv and has never used it.
> It is a little less clear that textconv is the right thing here, because
> it is not about grepping the diff, but about counting occurrences of the
> string inside the file content. I tend to think that doing so on the
> textconv'd data would be what people generally want, but it is a
> behavior change.

I prepared the earlier bugfix for "-G" for maint. Modifying "-S" would
be a separate feature topic, and would look like this (I built it on top
of the bugfix patch, since the tests are a follow-on).

  [1/2]: pickaxe: hoist empty needle check
  [2/2]: pickaxe: use textconv for -S counting

-Peff

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/2] pickaxe: hoist empty needle check
  2012-10-28 12:45   ` [PATCH 0/2] textconv support for "log -S" Jeff King
@ 2012-10-28 12:46     ` Jeff King
  2012-10-28 12:47     ` [PATCH 2/2] pickaxe: use textconv for -S counting Jeff King
  1 sibling, 0 replies; 24+ messages in thread
From: Jeff King @ 2012-10-28 12:46 UTC (permalink / raw)
  To: Peter Oberndorfer; +Cc: git, Junio C Hamano

If we are given an empty pickaxe needle like "git log -S ''",
it is impossible for us to find anything (because no matter
what the content, the count will always be 0). We currently
check this at the lowest level of contains(). Let's hoist
the logic much earlier to has_changes(), so that it is
simpler to return our answer before loading any blob data.

Signed-off-by: Jeff King <peff@peff.net>
---
 diffcore-pickaxe.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
index a209376..61f628c 100644
--- a/diffcore-pickaxe.c
+++ b/diffcore-pickaxe.c
@@ -163,8 +163,6 @@ static unsigned int contains(struct diff_filespec *one, struct diff_options *o,
 	unsigned int cnt;
 	unsigned long sz;
 	const char *data;
-	if (!o->pickaxe[0])
-		return 0;
 	if (diff_populate_filespec(one, 0))
 		return 0;
 
@@ -206,6 +204,9 @@ static int has_changes(struct diff_filepair *p, struct diff_options *o,
 static int has_changes(struct diff_filepair *p, struct diff_options *o,
 		       regex_t *regexp, kwset_t kws)
 {
+	if (!o->pickaxe[0])
+		return 0;
+
 	if (!DIFF_FILE_VALID(p->one)) {
 		if (!DIFF_FILE_VALID(p->two))
 			return 0; /* ignore unmerged */
-- 
1.8.0.3.g3456896

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/2] pickaxe: use textconv for -S counting
  2012-10-28 12:45   ` [PATCH 0/2] textconv support for "log -S" Jeff King
  2012-10-28 12:46     ` [PATCH 1/2] pickaxe: hoist empty needle check Jeff King
@ 2012-10-28 12:47     ` Jeff King
  2012-11-13 23:13       ` Junio C Hamano
  1 sibling, 1 reply; 24+ messages in thread
From: Jeff King @ 2012-10-28 12:47 UTC (permalink / raw)
  To: Peter Oberndorfer; +Cc: git, Junio C Hamano

We currently just look at raw blob data when using "-S" to
pickaxe. This is mostly historical, as pickaxe predates the
textconv feature. If the user has bothered to define a
textconv filter, it is more likely that their search string will be
on the textconv output, as that is what they will see in the
diff (and we do not even provide a mechanism for them to
search for binary needles that contain NUL characters).

This patch teaches "-S" to use textconv, just as we
already do for "-G".

Signed-off-by: Jeff King <peff@peff.net>
---
 diffcore-pickaxe.c       | 56 +++++++++++++++++++++++++++++++++---------------
 t/t4030-diff-textconv.sh | 12 +++++++++++
 2 files changed, 51 insertions(+), 17 deletions(-)

diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
index 61f628c..b097fa7 100644
--- a/diffcore-pickaxe.c
+++ b/diffcore-pickaxe.c
@@ -157,17 +157,15 @@ static unsigned int contains(struct diff_filespec *one, struct diff_options *o,
 	return;
 }
 
-static unsigned int contains(struct diff_filespec *one, struct diff_options *o,
+static unsigned int contains(mmfile_t *mf, struct diff_options *o,
 			     regex_t *regexp, kwset_t kws)
 {
 	unsigned int cnt;
 	unsigned long sz;
 	const char *data;
-	if (diff_populate_filespec(one, 0))
-		return 0;
 
-	sz = one->size;
-	data = one->data;
+	sz = mf->size;
+	data = mf->ptr;
 	cnt = 0;
 
 	if (regexp) {
@@ -197,29 +195,53 @@ static int has_changes(struct diff_filepair *p, struct diff_options *o,
 			cnt++;
 		}
 	}
-	diff_free_filespec_data(one);
 	return cnt;
 }
 
 static int has_changes(struct diff_filepair *p, struct diff_options *o,
 		       regex_t *regexp, kwset_t kws)
 {
+	struct userdiff_driver *textconv_one = get_textconv(p->one);
+	struct userdiff_driver *textconv_two = get_textconv(p->two);
+	mmfile_t mf1, mf2;
+	int ret;
+
 	if (!o->pickaxe[0])
 		return 0;
 
-	if (!DIFF_FILE_VALID(p->one)) {
-		if (!DIFF_FILE_VALID(p->two))
-			return 0; /* ignore unmerged */
+	/*
+	 * If we have an unmodified pair, we know that the count will be the
+	 * same and don't even have to load the blobs. Unless textconv is in
+	 * play, _and_ we are using two different textconv filters (e.g.,
+	 * because a pair is an exact rename with different textconv attributes
+	 * for each side, which might generate different content).
+	 */
+	if (textconv_one == textconv_two && diff_unmodified_pair(p))
+		return 0;
+
+	fill_one(p->one, &mf1, &textconv_one);
+	fill_one(p->two, &mf2, &textconv_two);
+
+	if (!mf1.ptr) {
+		if (!mf2.ptr)
+			ret = 0; /* ignore unmerged */
 		/* created */
-		return contains(p->two, o, regexp, kws) != 0;
-	}
-	if (!DIFF_FILE_VALID(p->two))
-		return contains(p->one, o, regexp, kws) != 0;
-	if (!diff_unmodified_pair(p)) {
-		return contains(p->one, o, regexp, kws) !=
-		       contains(p->two, o, regexp, kws);
+		ret = contains(&mf2, o, regexp, kws) != 0;
 	}
-	return 0;
+	else if (!mf2.ptr) /* removed */
+		ret = contains(&mf1, o, regexp, kws) != 0;
+	else
+		ret = contains(&mf1, o, regexp, kws) !=
+		      contains(&mf2, o, regexp, kws);
+
+	if (textconv_one)
+		free(mf1.ptr);
+	if (textconv_two)
+		free(mf2.ptr);
+	diff_free_filespec_data(p->one);
+	diff_free_filespec_data(p->two);
+
+	return ret;
 }
 
 static void diffcore_pickaxe_count(struct diff_options *o)
diff --git a/t/t4030-diff-textconv.sh b/t/t4030-diff-textconv.sh
index 461d27a..53ec330 100755
--- a/t/t4030-diff-textconv.sh
+++ b/t/t4030-diff-textconv.sh
@@ -96,6 +96,18 @@ test_expect_success 'grep-diff (-G) operates on textconv data (modification)' '
 	test_cmp expect actual
 '
 
+test_expect_success 'pickaxe (-S) operates on textconv data (add)' '
+	echo one >expect &&
+	git log --root --format=%s -S0 >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'pickaxe (-S) operates on textconv data (modification)' '
+	echo two >expect &&
+	git log --root --format=%s -S1 >actual &&
+	test_cmp expect actual
+'
+
 cat >expect.stat <<'EOF'
  file | Bin 2 -> 4 bytes
  1 file changed, 0 insertions(+), 0 deletions(-)
-- 
1.8.0.3.g3456896

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: crash on git diff-tree -Ganything <tree> for new files with textconv filter
  2012-10-28 12:01 ` Jeff King
  2012-10-28 12:45   ` [PATCH 0/2] textconv support for "log -S" Jeff King
@ 2012-10-28 19:56   ` Peter Oberndorfer
  2012-10-29  6:05     ` Jeff King
  1 sibling, 1 reply; 24+ messages in thread
From: Peter Oberndorfer @ 2012-10-28 19:56 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Junio C Hamano

On 2012-10-28 13:01, Jeff King wrote:
> On Sat, Oct 27, 2012 at 08:37:24PM +0200, Peter Oberndorfer wrote:
>
>> It seems "git diff-tree -Ganything <tree>" crashes[1] with a null
>> pointer dereference
>> when run on a commit that adds a file (pdf) with a textconv filter.
>>
>> It can be reproduced with vanilla git by having a commit on top that
>> adds a file with a textconv filter and executing git diff-tree
>> -Ganything HEAD
>> But running git log -Ganything still works without a crash.
>> This problem seems to exist since the feature was first added in f506b8e8b5.
> Thanks for a thorough bug report. I didn't reproduce the crash, but I
> can see how it happens (it happens with diff-tree because we will reuse
> the working tree file in that instance; it could also happen if you
> turned on textconv caching).
>
>> While testing I also noticed the -S and -G act on the original file
>> instead of the textconv munged data.
>> Is this intentional or caused by accessing the wrong data?
> Both, perhaps. :)
Hi,
thanks for your patch for this!

>
> -G operates on the munged data; you can see it feed the munged data to
> xdiff in diff_grep. But the optimization for handling added and removed
> files accidentally fed the wrong pointer. Fixing that is a no-brainer,
> since the optimization is inconsistent with the regular code path.
>
> -S, however, predates the invention of textconv and has never used it.
> It is a little less clear that textconv is the right thing here, because
> it is not about grepping the diff, but about counting occurrences of the
> string inside the file content. I tend to think that doing so on the
> textconv'd data would be what people generally want, but it is a
> behavior change.
>
>> Wild guess: should we really access p->one->data and not mf1.ptr?
> Precisely. Thanks for your wild guess; it made finding the bug very
> easy. :)
>
>> Is there some more information i should provide?
> The patch below should fix it. I added tests, but please try your
> real-world test case on it to double-check.

I tested your patch, but now it crashes for another reason :-)
i have a file with exactly 12288(0x3000) bytes in the repository.
When the file is loaded, the data is placed luckily so the data end
falls at a page boundary.
Later diff_grep() calls regexec() which calls strlen() on the loaded buffer
and ends up reading beyond the actual data into the next page
which is not allocated and causes a pagefault.
Or it could possibly (randomly) match the regex on data that is not
actually part of a file...
Different memory allocation rules on Windows probably also have some
influence here.

My guess is that diff_filespec->data is not supposed to be zero terminated
and we should not invoke strlen() on a not zero terminated data.
But this should be decided by somebody who knows the rules.

Greetings Peter

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: crash on git diff-tree -Ganything <tree> for new files with textconv filter
  2012-10-28 19:56   ` crash on git diff-tree -Ganything <tree> for new files with textconv filter Peter Oberndorfer
@ 2012-10-29  6:05     ` Jeff King
  2012-10-29  6:18       ` Jeff King
  2012-10-29 20:19       ` Peter Oberndorfer
  0 siblings, 2 replies; 24+ messages in thread
From: Jeff King @ 2012-10-29  6:05 UTC (permalink / raw)
  To: Peter Oberndorfer; +Cc: git, Junio C Hamano

On Sun, Oct 28, 2012 at 08:56:39PM +0100, Peter Oberndorfer wrote:

> > The patch below should fix it. I added tests, but please try your
> > real-world test case on it to double-check.
> 
> I tested your patch, but now it crashes for another reason :-)

Well, that's progress, right? :)

> i have a file with exactly 12288(0x3000) bytes in the repository.
> When the file is loaded, the data is placed luckily so the data end
> falls at a page boundary.
> Later diff_grep() calls regexec() which calls strlen() on the loaded buffer
> and ends up reading beyond the actual data into the next page
> which is not allocated and causes a pagefault.
> Or it could possibly (randomly) match the regex on data that is not
> actually part of a file...

Yuck. For the most part, we treat blob content (and generally most
object content) as a sized buffer. However, there are some spots which,
either through laziness or because a code interface expects a string, we
pass the value as a string. This works because the object-reading code
puts an extra NUL at the end of our buffer to handle just such an
instance. So we might prematurely end if the object contains embedded
NULs, but we would never read past the end.

The code to read the output of a textconv filter does not do this
explicitly. I would think it would get it for free by virtue of reading
into a strbuf, though. I'll try to investigate.

-Peff

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: crash on git diff-tree -Ganything <tree> for new files with textconv filter
  2012-10-29  6:05     ` Jeff King
@ 2012-10-29  6:18       ` Jeff King
  2012-10-29 20:19       ` Peter Oberndorfer
  1 sibling, 0 replies; 24+ messages in thread
From: Jeff King @ 2012-10-29  6:18 UTC (permalink / raw)
  To: Peter Oberndorfer; +Cc: git, Junio C Hamano

On Mon, Oct 29, 2012 at 02:05:24AM -0400, Jeff King wrote:

> > i have a file with exactly 12288(0x3000) bytes in the repository.
> > When the file is loaded, the data is placed luckily so the data end
> > falls at a page boundary.
> > Later diff_grep() calls regexec() which calls strlen() on the loaded buffer
> > and ends up reading beyond the actual data into the next page
> > which is not allocated and causes a pagefault.
> > Or it could possibly (randomly) match the regex on data that is not
> > actually part of a file...
> 
> Yuck. For the most part, we treat blob content (and generally most
> object content) as a sized buffer. However, there are some spots which,
> either through laziness or because a code interface expects a string, we
> pass the value as a string. This works because the object-reading code
> puts an extra NUL at the end of our buffer to handle just such an
> instance. So we might prematurely end if the object contains embedded
> NULs, but we would never read past the end.
> 
> The code to read the output of a textconv filter does not do this
> explicitly. I would think it would get it for free by virtue of reading
> into a strbuf, though. I'll try to investigate.

I can't seem to replicate the problem here, even under valgrind. Do you
have a minimal test case?

-Peff

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: crash on git diff-tree -Ganything <tree> for new files with textconv filter
  2012-10-29  6:05     ` Jeff King
  2012-10-29  6:18       ` Jeff King
@ 2012-10-29 20:19       ` Peter Oberndorfer
  2012-10-29 22:35         ` Jeff King
  1 sibling, 1 reply; 24+ messages in thread
From: Peter Oberndorfer @ 2012-10-29 20:19 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Junio C Hamano

On 2012-10-29 07:05, Jeff King wrote:
> On Sun, Oct 28, 2012 at 08:56:39PM +0100, Peter Oberndorfer wrote:
>
>>> The patch below should fix it. I added tests, but please try your
>>> real-world test case on it to double-check.
>> I tested your patch, but now it crashes for another reason :-)
> Well, that's progress, right? :)
Sure :-)
>
>> i have a file with exactly 12288(0x3000) bytes in the repository.
>> When the file is loaded, the data is placed luckily so the data end
>> falls at a page boundary.
>> Later diff_grep() calls regexec() which calls strlen() on the loaded buffer
>> and ends up reading beyond the actual data into the next page
>> which is not allocated and causes a pagefault.
>> Or it could possibly (randomly) match the regex on data that is not
>> actually part of a file...
> Yuck. For the most part, we treat blob content (and generally most
> object content) as a sized buffer. However, there are some spots which,
> either through laziness or because a code interface expects a string, we
> pass the value as a string. This works because the object-reading code
> puts an extra NUL at the end of our buffer to handle just such an
> instance. So we might prematurely end if the object contains embedded
> NULs, but we would never read past the end.
>
> The code to read the output of a textconv filter does not do this
> explicitly. I would think it would get it for free by virtue of reading
> into a strbuf, though. I'll try to investigate.
I could reproduce with my 0x3000 bytes file on linux. The buffer is not
read with a trailing null byte it is mapped by mmap in
diff_populate_filespec...
So i think we will not get away with expecting a trailing null :-/

For me the key to reproduce the problem was to have 2 commits.
Adding the file in the root commit it did not work. [1]

Greetings Peter
> -Peff
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

[1]
kumbayo@home:~/src$ mkdir git_mmap_crash2
kumbayo@home:~/src$ cd git_mmap_crash2
kumbayo@home:~/src/git_mmap_crash2$ git init
kumbayo@home:~/src/git_mmap_crash2$ echo blah>blah
kumbayo@home:~/src/git_mmap_crash2$ git add blah
kumbayo@home:~/src/git_mmap_crash2$ git commit -m blah
[master (Basis-Version) 3458422] blah
diff_populate_filespec -> xmmap for blah size:0x5 returned: 0xb7206000
 1 file changed, 1 insertion(+)
 create mode 100644 blah
kumbayo@home:~/src/git_mmap_crash2$ perl -e 'print "-" x 0x3000 '> asdf.txt
kumbayo@home:~/src/git_mmap_crash2$ git add asdf.txt
kumbayo@home:~/src/git_mmap_crash2$ git commit -m crashy
[master 5cf2c5f] crashy
diff_populate_filespec -> xmmap for asdf.txt size:0x3000 returned:
0xb771e000
 1 file changed, 1 insertion(+)
 create mode 100644 asdf.txt

kumbayo@soybean:~/src/git_mmap_crash2$ valgrind git diff-tree -Ganything HEAD
==8388== Memcheck, a memory error detector
==8388== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al.
==8388== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright info
==8388== Command: git diff-tree -Ganything HEAD
==8388==
==8388== Conditional jump or move depends on uninitialised value(s)
==8388==    at 0x405ADD8: inflateReset2 (in /lib/i386-linux-gnu/libz.so.1.2.3.4)
==8388==    by 0xA0: ???
==8388==
==8388== Conditional jump or move depends on uninitialised value(s)
==8388==    at 0x405ADD8: inflateReset2 (in /lib/i386-linux-gnu/libz.so.1.2.3.4)
==8388==    by 0x7F: ???
==8388==


==8388== Conditional jump or move depends on uninitialised value(s)


==8388==    at 0x405ADD8: inflateReset2 (in /lib/i386-linux-gnu/libz.so.1.2.3.4)


==8388==    by 0x30: ???


==8388==


==8388== Conditional jump or move depends on uninitialised value(s)


==8388==    at 0x405ADD8: inflateReset2 (in /lib/i386-linux-gnu/libz.so.1.2.3.4)


==8388==    by 0x50: ???


==8388==


diffcore_pickaxe_grep


diff_populate_filespec -> xmmap for asdf.txt size:0x3000 returned: 0x4035000


==8388== Invalid read of size 1


==8388==    at 0x402C683: __GI_strlen (in
/usr/lib/valgrind/vgpreload_memcheck-x86-linux.so)


==8388==    by 0x430581F: regexec@@GLIBC_2.3.4 (regexec.c:245)


==8388==    by 0x814489D: diff_grep (diffcore-pickaxe.c:110)
==8388==    by 0x8144B89: pickaxe.constprop.6 (diffcore-pickaxe.c:40)
==8388==    by 0x8144DCD: diffcore_pickaxe_grep (diffcore-pickaxe.c:155)
==8388==    by 0x80DCE64: diffcore_std (diff.c:4638)
==8388==    by 0x80F0B20: log_tree_diff_flush (log-tree.c:696)
==8388==  Address 0x4038000 is not stack'd, malloc'd or (recently) free'd
==8388==
==8388==
==8388== Process terminating with default action of signal 11 (SIGSEGV)
==8388==  Access not within mapped region at address 0x4038000
==8388==    at 0x402C683: __GI_strlen (in /usr/lib/valgrind/vgpreload_memcheck-x86-linux.so)
==8388==    by 0x430581F: regexec@@GLIBC_2.3.4 (regexec.c:245)
==8388==    by 0x814489D: diff_grep (diffcore-pickaxe.c:110)
==8388==    by 0x8144B89: pickaxe.constprop.6 (diffcore-pickaxe.c:40)
==8388==    by 0x8144DCD: diffcore_pickaxe_grep (diffcore-pickaxe.c:155)
==8388==    by 0x80DCE64: diffcore_std (diff.c:4638)
==8388==    by 0x80F0B20: log_tree_diff_flush (log-tree.c:696)
==8388==  If you believe this happened as a result of a stack
==8388==  overflow in your program's main thread (unlikely but
==8388==  possible), you can try to increase the size of the
==8388==  main thread stack using the --main-stacksize= flag.
==8388==  The main thread stack size used in this run was 8388608.
==8388==
==8388== HEAP SUMMARY:
==8388==     in use at exit: 86,229 bytes in 69 blocks
==8388==   total heap usage: 193 allocs, 124 frees, 259,991 bytes allocated
==8388==
==8388== LEAK SUMMARY:
==8388==    definitely lost: 65 bytes in 1 blocks
==8388==    indirectly lost: 0 bytes in 0 blocks
==8388==      possibly lost: 0 bytes in 0 blocks
==8388==    still reachable: 86,164 bytes in 68 blocks
==8388==         suppressed: 0 bytes in 0 blocks
==8388== Rerun with --leak-check=full to see details of leaked memory
==8388==
==8388== For counts of detected and suppressed errors, rerun with: -v
==8388== Use --track-origins=yes to see where uninitialised values come from
==8388== ERROR SUMMARY: 7 errors from 5 contexts (suppressed: 0 from 0)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: crash on git diff-tree -Ganything <tree> for new files with textconv filter
  2012-10-29 20:19       ` Peter Oberndorfer
@ 2012-10-29 22:35         ` Jeff King
  2012-10-29 22:47           ` Jeff King
  2012-11-07 21:10           ` Peter Oberndorfer
  0 siblings, 2 replies; 24+ messages in thread
From: Jeff King @ 2012-10-29 22:35 UTC (permalink / raw)
  To: Peter Oberndorfer; +Cc: git, Junio C Hamano

On Mon, Oct 29, 2012 at 09:19:48PM +0100, Peter Oberndorfer wrote:

> I could reproduce with my 0x3000 bytes file on linux. The buffer is not
> read with a trailing null byte it is mapped by mmap in
> diff_populate_filespec...
> So i think we will not get away with expecting a trailing null :-/

Thanks for the reproduction recipe. I was testing with "git log", which
does not use the mmap optimization.

> For me the key to reproduce the problem was to have 2 commits.
> Adding the file in the root commit it did not work. [1]

You probably would need to pass "--root" for it to do the diff of the
initial commit.

The patch below fixes it, but it's terribly inefficient (it just detects
the situation and reallocates). It would be much better to disable the
reuse_worktree_file mmap when we populate the filespec, but it is too
late to pass an option; we may have already populated from an earlier
diffcore stage.

I guess if we teach the whole diff code that "-G" (and --pickaxe-regex)
is brittle, we can disable the optimization from the beginning based on
the diff options. I'll take a look.

diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
index b097fa7..88d1a8f 100644
--- a/diffcore-pickaxe.c
+++ b/diffcore-pickaxe.c
@@ -80,6 +80,29 @@ static void fill_one(struct diff_filespec *one,
 	if (DIFF_FILE_VALID(one)) {
 		*textconv = get_textconv(one);
 		mf->size = fill_textconv(*textconv, one, &mf->ptr);
+
+		/*
+		 * Horrible, horrible hack. If we are going to feed the result
+		 * to regexec, we must make sure it is NUL-terminated, but we
+		 * will not be if we have mmap'd a file and never munged it.
+		 *
+		 * We would do much better to turn off the reuse_worktree_file
+		 * optimization in the first place, which is the sole source of
+		 * these mmaps.
+		 */
+		if (one->should_munmap && !*textconv) { mf->ptr =
+			xmallocz(one->size); memcpy(mf->ptr, one->data,
+						    one->size);
+
+			/*
+			 * Attach the result to the filespec, which will
+			 * properly free it eventually.
+			 */
+			munmap(one->data, one->size);
+			one->should_munmap = 0;
+			one->data = mf->ptr;
+			one->should_free = 1;
+		}
 	} else {
 		memset(mf, 0, sizeof(*mf));
 	}

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: crash on git diff-tree -Ganything <tree> for new files with textconv filter
  2012-10-29 22:35         ` Jeff King
@ 2012-10-29 22:47           ` Jeff King
  2012-10-30 12:17             ` Jeff King
  2012-11-07 21:10           ` Peter Oberndorfer
  1 sibling, 1 reply; 24+ messages in thread
From: Jeff King @ 2012-10-29 22:47 UTC (permalink / raw)
  To: Peter Oberndorfer; +Cc: git, Junio C Hamano

On Mon, Oct 29, 2012 at 06:35:21PM -0400, Jeff King wrote:

> The patch below fixes it, but it's terribly inefficient (it just detects
> the situation and reallocates). It would be much better to disable the
> reuse_worktree_file mmap when we populate the filespec, but it is too
> late to pass an option; we may have already populated from an earlier
> diffcore stage.
> 
> I guess if we teach the whole diff code that "-G" (and --pickaxe-regex)
> is brittle, we can disable the optimization from the beginning based on
> the diff options. I'll take a look.

Hmm. That is problematic for two reasons.

  1. The whole diff call chain will have to be modified to pass the
     options around, so they can make it down to the
     diff_populate_filespec level. Alternatively, we could do some kind
     of global hack, which is ugly but would work OK in practice.

  2. Reusing a working tree file is only half of the reason a filespec
     might be mmap'd. It might also be because we are literally diffing
     the working tree. "-G" was meant to be used to limit log traversal,
     but it also works to reduce the diff output for something like "git
     diff HEAD^".

I really wish there were an alternate regexec interface we could use
that took a pointer/size pair. Bleh.

-Peff

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: crash on git diff-tree -Ganything <tree> for new files with textconv filter
  2012-10-29 22:47           ` Jeff King
@ 2012-10-30 12:17             ` Jeff King
  2012-10-30 12:46               ` Junio C Hamano
  2012-11-01 19:19               ` Ramsay Jones
  0 siblings, 2 replies; 24+ messages in thread
From: Jeff King @ 2012-10-30 12:17 UTC (permalink / raw)
  To: Peter Oberndorfer; +Cc: git, Junio C Hamano

On Mon, Oct 29, 2012 at 06:47:05PM -0400, Jeff King wrote:

> On Mon, Oct 29, 2012 at 06:35:21PM -0400, Jeff King wrote:
> 
> > The patch below fixes it, but it's terribly inefficient (it just detects
> > the situation and reallocates). It would be much better to disable the
> > reuse_worktree_file mmap when we populate the filespec, but it is too
> > late to pass an option; we may have already populated from an earlier
> > diffcore stage.
> > 
> > I guess if we teach the whole diff code that "-G" (and --pickaxe-regex)
> > is brittle, we can disable the optimization from the beginning based on
> > the diff options. I'll take a look.
> 
> Hmm. That is problematic for two reasons.
> 
>   1. The whole diff call chain will have to be modified to pass the
>      options around, so they can make it down to the
>      diff_populate_filespec level. Alternatively, we could do some kind
>      of global hack, which is ugly but would work OK in practice.
> 
>   2. Reusing a working tree file is only half of the reason a filespec
>      might be mmap'd. It might also be because we are literally diffing
>      the working tree. "-G" was meant to be used to limit log traversal,
>      but it also works to reduce the diff output for something like "git
>      diff HEAD^".
> 
> I really wish there were an alternate regexec interface we could use
> that took a pointer/size pair. Bleh.

Thinking on it more, my patch, hacky thought it seems, may not be the
worst solution. Here are the options that I see:

  1. Use a regex library that does not require NUL termination. If we
     are bound by the regular regexec interface, this is not feasible.
     But the GNU implementation works on arbitrary-length buffers (you
     just have to use a slightly different interface), and we already
     carry it in compat. It would mean platforms which provide a working
     but non-GNU regexec would have to start defining NO_REGEX.

  2. Figure out a way to get one extra zero byte via mmap. If the
     requested size does not fall on a page boundary, you get extra
     zero-ed bytes. Unfortunately, requesting an extra byte does not
     do what we want; you get SIGBUS accessing it.

  3. Copy mmap'd data at point-of-use into a NUL-terminated buffer. That
     way we only incur the cost when we need it.

  4. Avoid mmap-ing in the first place when we are using -G or
     --pickaxe-regex (e.g., by doing a big read()). At first glance,
     this sounds more efficient than loading the data one way and then
     making another copy. But mmap+memcpy, aside from the momentary
     doubled memory requirement, is probably just as fast or faster than
     calling read() repeatedly.

I am really tempted by (1).

Given that (2) does not work, unless somebody comes up with something
clever there, that would make (3) the next best choice.

-Peff

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: crash on git diff-tree -Ganything <tree> for new files with textconv filter
  2012-10-30 12:17             ` Jeff King
@ 2012-10-30 12:46               ` Junio C Hamano
  2012-10-30 13:12                 ` Jeff King
  2012-11-01 19:19               ` Ramsay Jones
  1 sibling, 1 reply; 24+ messages in thread
From: Junio C Hamano @ 2012-10-30 12:46 UTC (permalink / raw)
  To: Jeff King, Peter Oberndorfer; +Cc: git

(1) sounds attractive for more than one reason. In addition to avoidance of this issue, it would bring bug-to-bug compatibility across platforms.

(4), if we can run grep on streaming data (tweak interface we have for checking out a large blob to the working tree), would let us work on dataset larger than fit in core. Even though it would be much more work, it might turn out to be a better option in the longer run.

Jeff King <peff@peff.net> wrote:

>On Mon, Oct 29, 2012 at 06:47:05PM -0400, Jeff King wrote:
>
>> On Mon, Oct 29, 2012 at 06:35:21PM -0400, Jeff King wrote:
>> 
>> > The patch below fixes it, but it's terribly inefficient (it just
>detects
>> > the situation and reallocates). It would be much better to disable
>the
>> > reuse_worktree_file mmap when we populate the filespec, but it is
>too
>> > late to pass an option; we may have already populated from an
>earlier
>> > diffcore stage.
>> > 
>> > I guess if we teach the whole diff code that "-G" (and
>--pickaxe-regex)
>> > is brittle, we can disable the optimization from the beginning
>based on
>> > the diff options. I'll take a look.
>> 
>> Hmm. That is problematic for two reasons.
>> 
>>   1. The whole diff call chain will have to be modified to pass the
>>      options around, so they can make it down to the
>>      diff_populate_filespec level. Alternatively, we could do some
>kind
>>      of global hack, which is ugly but would work OK in practice.
>> 
>>   2. Reusing a working tree file is only half of the reason a
>filespec
>>      might be mmap'd. It might also be because we are literally
>diffing
>>      the working tree. "-G" was meant to be used to limit log
>traversal,
>>      but it also works to reduce the diff output for something like
>"git
>>      diff HEAD^".
>> 
>> I really wish there were an alternate regexec interface we could use
>> that took a pointer/size pair. Bleh.
>
>Thinking on it more, my patch, hacky thought it seems, may not be the
>worst solution. Here are the options that I see:
>
>  1. Use a regex library that does not require NUL termination. If we
>     are bound by the regular regexec interface, this is not feasible.
>     But the GNU implementation works on arbitrary-length buffers (you
>     just have to use a slightly different interface), and we already
>    carry it in compat. It would mean platforms which provide a working
>     but non-GNU regexec would have to start defining NO_REGEX.
>
>  2. Figure out a way to get one extra zero byte via mmap. If the
>     requested size does not fall on a page boundary, you get extra
>     zero-ed bytes. Unfortunately, requesting an extra byte does not
>     do what we want; you get SIGBUS accessing it.
>
> 3. Copy mmap'd data at point-of-use into a NUL-terminated buffer. That
>     way we only incur the cost when we need it.
>
>  4. Avoid mmap-ing in the first place when we are using -G or
>     --pickaxe-regex (e.g., by doing a big read()). At first glance,
>     this sounds more efficient than loading the data one way and then
>     making another copy. But mmap+memcpy, aside from the momentary
>    doubled memory requirement, is probably just as fast or faster than
>     calling read() repeatedly.
>
>I am really tempted by (1).
>
>Given that (2) does not work, unless somebody comes up with something
>clever there, that would make (3) the next best choice.
>
>-Peff

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: crash on git diff-tree -Ganything <tree> for new files with textconv filter
  2012-10-30 12:46               ` Junio C Hamano
@ 2012-10-30 13:12                 ` Jeff King
  0 siblings, 0 replies; 24+ messages in thread
From: Jeff King @ 2012-10-30 13:12 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Peter Oberndorfer, git

On Tue, Oct 30, 2012 at 09:46:01PM +0900, Junio C Hamano wrote:

> (1) sounds attractive for more than one reason. In addition to
> avoidance of this issue, it would bring bug-to-bug compatibility
> across platforms.

Yeah. I mentioned breaking the build for people who would now need to
turn on NO_REGEX. But the only reason to do that is to let people on
glibc systems use the system version of the tools. A much saner approach
would be to just always build with our compat regex, and turn NO_REGEX
into a no-op. We already do the same thing for kwset.

> (4), if we can run grep on streaming data (tweak interface we have for
> checking out a large blob to the working tree), would let us work on
> dataset larger than fit in core. Even though it would be much more
> work, it might turn out to be a better option in the longer run.

Agreed, that would be nice. It's potentially a lot of work, but we could
probably get by with a special streaming version of diff_populate_filespec.

The tricky thing is that we have to run the regex matcher progressively
as we stream data in (since your match might fall in the middle of a
read boundary). Which is certainly going to require switching off of
regular regexec. I don't think glibc regex will handle it either,
though. It looks like pcre can report a partial match at the end of the
string, and you can either continue with the next chunk (if using
pcre_dfa) or append and re-start the pattern match (for regular
pcre_exec).

Which means we'd probably have to make streaming matches an optional
feature, and still do (1) first to fix the correctness issue.

-Peff

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: crash on git diff-tree -Ganything <tree> for new files with textconv filter
  2012-10-30 12:17             ` Jeff King
  2012-10-30 12:46               ` Junio C Hamano
@ 2012-11-01 19:19               ` Ramsay Jones
  1 sibling, 0 replies; 24+ messages in thread
From: Ramsay Jones @ 2012-11-01 19:19 UTC (permalink / raw)
  To: Jeff King; +Cc: Peter Oberndorfer, git, Junio C Hamano

Jeff King wrote:
> Thinking on it more, my patch, hacky thought it seems, may not be the
> worst solution. Here are the options that I see:
> 
>   1. Use a regex library that does not require NUL termination. If we
>      are bound by the regular regexec interface, this is not feasible.
>      But the GNU implementation works on arbitrary-length buffers (you
>      just have to use a slightly different interface), and we already
>      carry it in compat. It would mean platforms which provide a working
>      but non-GNU regexec would have to start defining NO_REGEX.

I have thought about the possibility of doing this for unrelated reasons
in the past.

On cygwin, there have been two unexpected test passes since about v1.6.0
(I reported it to the list in passing), like so:

    [ ... ]
    All tests successful.

    Test Summary Report
    -------------------
    t0050-filesystem.sh                              (Wstat: 0 Tests: 9 Failed: 0)
      TODO passed:   5
    t7008-grep-binary.sh                             (Wstat: 0 Tests: 20 Failed: 0)
      TODO passed:   12
    Files=604, Tests=8439, 11190 wallclock secs ( 2.59 usr  1.59 sys + 7294.86 cusr
    3416.65 csys = 10715.70 CPU)
    Result: PASS

In particular, t7008.12 passes on cygwin because the regex library apparently
matches '.' to NUL. Indeed if you add a test_pause to the script and execute
"grep .fi a" (note grep *not* git-grep) then "Binary file a matches" on Linux,
cygwin and MinGW. (So I assume the test was added to document a difference in
behaviour to GNU grep).

So, if we use the GNU interface to the regex routines in compat, then we may
specify the "grep syntax" for use in git-grep. (Well that's the theory, I've
not actually tried to code it up, so take this with a pinch of salt! :-P ).

ATB,
Ramsay Jones

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: crash on git diff-tree -Ganything <tree> for new files with textconv filter
  2012-10-29 22:35         ` Jeff King
  2012-10-29 22:47           ` Jeff King
@ 2012-11-07 21:10           ` Peter Oberndorfer
  2012-11-07 21:13             ` Jeff King
  1 sibling, 1 reply; 24+ messages in thread
From: Peter Oberndorfer @ 2012-11-07 21:10 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Junio C Hamano

On 2012-10-29 23:35, Jeff King wrote:
> On Mon, Oct 29, 2012 at 09:19:48PM +0100, Peter Oberndorfer wrote:
>
>> I could reproduce with my 0x3000 bytes file on linux. The buffer is not
>> read with a trailing null byte it is mapped by mmap in
>> diff_populate_filespec...
>> So i think we will not get away with expecting a trailing null :-/
> Thanks for the reproduction recipe. I was testing with "git log", which
> does not use the mmap optimization.
>
>> For me the key to reproduce the problem was to have 2 commits.
>> Adding the file in the root commit it did not work. [1]
> You probably would need to pass "--root" for it to do the diff of the
> initial commit.
>
> The patch below fixes it, but it's terribly inefficient (it just detects
> the situation and reallocates). It would be much better to disable the
> reuse_worktree_file mmap when we populate the filespec, but it is too
> late to pass an option; we may have already populated from an earlier
> diffcore stage.
Hi,
I tested your patch, and i can confirm it fixes the problem for me.
(also on my real world test in msysgit)

Again, thanks a lot!
Greetings Peter

> I guess if we teach the whole diff code that "-G" (and --pickaxe-regex)
> is brittle, we can disable the optimization from the beginning based on
> the diff options. I'll take a look.
>
> <snip patch>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: crash on git diff-tree -Ganything <tree> for new files with textconv filter
  2012-11-07 21:10           ` Peter Oberndorfer
@ 2012-11-07 21:13             ` Jeff King
  2013-06-03 17:25               ` Peter Oberndorfer
  0 siblings, 1 reply; 24+ messages in thread
From: Jeff King @ 2012-11-07 21:13 UTC (permalink / raw)
  To: Peter Oberndorfer; +Cc: git, Junio C Hamano

On Wed, Nov 07, 2012 at 10:10:59PM +0100, Peter Oberndorfer wrote:

> >> For me the key to reproduce the problem was to have 2 commits.
> >> Adding the file in the root commit it did not work. [1]
> > You probably would need to pass "--root" for it to do the diff of the
> > initial commit.
> >
> > The patch below fixes it, but it's terribly inefficient (it just detects
> > the situation and reallocates). It would be much better to disable the
> > reuse_worktree_file mmap when we populate the filespec, but it is too
> > late to pass an option; we may have already populated from an earlier
> > diffcore stage.
> Hi,
> I tested your patch, and i can confirm it fixes the problem for me.
> (also on my real world test in msysgit)

Thanks for the report. I'd still like to pursue using a regex library
that does not require NUL-termination, but I've been distracted by other
things. I'm going to hold back my copy-to-a-NUL-buffer patch for now and
see if I can get to the regex thing this week.

-Peff

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] pickaxe: use textconv for -S counting
  2012-10-28 12:47     ` [PATCH 2/2] pickaxe: use textconv for -S counting Jeff King
@ 2012-11-13 23:13       ` Junio C Hamano
  2012-11-15  1:21         ` Jeff King
  0 siblings, 1 reply; 24+ messages in thread
From: Junio C Hamano @ 2012-11-13 23:13 UTC (permalink / raw)
  To: Jeff King; +Cc: Peter Oberndorfer, git

Jeff King <peff@peff.net> writes:

> We currently just look at raw blob data when using "-S" to
> pickaxe. This is mostly historical, as pickaxe predates the
> textconv feature. If the user has bothered to define a
> textconv filter, it is more likely that their search string will be
> on the textconv output, as that is what they will see in the
> diff (and we do not even provide a mechanism for them to
> search for binary needles that contain NUL characters).

Oookay, I suppose...

>  static int has_changes(struct diff_filepair *p, struct diff_options *o,
>  		       regex_t *regexp, kwset_t kws)
>  {
> +	struct userdiff_driver *textconv_one = get_textconv(p->one);
> +	struct userdiff_driver *textconv_two = get_textconv(p->two);
> +	mmfile_t mf1, mf2;
> +	int ret;
> +
>  	if (!o->pickaxe[0])
>  		return 0;
>  
> -	if (!DIFF_FILE_VALID(p->one)) {
> -		if (!DIFF_FILE_VALID(p->two))
> -			return 0; /* ignore unmerged */

What happened to this part that avoids showing nonsense for unmerged
paths?

> +	/*
> +	 * If we have an unmodified pair, we know that the count will be the
> +	 * same and don't even have to load the blobs. Unless textconv is in
> +	 * play, _and_ we are using two different textconv filters (e.g.,
> +	 * because a pair is an exact rename with different textconv attributes
> +	 * for each side, which might generate different content).
> +	 */
> +	if (textconv_one == textconv_two && diff_unmodified_pair(p))
> +		return 0;

I am not sure about this part that cares about the textconv.

Wouldn't the normal "git diff A B" skip the filepair that are
unmodified in the first place at the object name level without even
looking at the contents (see e.g. diff_flush_patch())?

Shouldn't this part of the code emulating that behaviour no matter
what textconv filter(s) are configured for these paths?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] pickaxe: use textconv for -S counting
  2012-11-13 23:13       ` Junio C Hamano
@ 2012-11-15  1:21         ` Jeff King
  2012-11-20  0:31           ` Junio C Hamano
  0 siblings, 1 reply; 24+ messages in thread
From: Jeff King @ 2012-11-15  1:21 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Peter Oberndorfer, git

On Tue, Nov 13, 2012 at 03:13:19PM -0800, Junio C Hamano wrote:

> >  static int has_changes(struct diff_filepair *p, struct diff_options *o,
> >  		       regex_t *regexp, kwset_t kws)
> >  {
> > +	struct userdiff_driver *textconv_one = get_textconv(p->one);
> > +	struct userdiff_driver *textconv_two = get_textconv(p->two);
> > +	mmfile_t mf1, mf2;
> > +	int ret;
> > +
> >  	if (!o->pickaxe[0])
> >  		return 0;
> >  
> > -	if (!DIFF_FILE_VALID(p->one)) {
> > -		if (!DIFF_FILE_VALID(p->two))
> > -			return 0; /* ignore unmerged */
> 
> What happened to this part that avoids showing nonsense for unmerged
> paths?

It's moved down. fill_one will return an empty mmfile if
!DIFF_FILE_VALID, so we end up here:

        fill_one(p->one, &mf1, &textconv_one);
        fill_one(p->two, &mf2, &textconv_two);

        if (!mf1.ptr) {
                if (!mf2.ptr)
                        ret = 0; /* ignore unmerged */

Prior to this change, we didn't use fill_one, so we had to check manually.

> > +	/*
> > +	 * If we have an unmodified pair, we know that the count will be the
> > +	 * same and don't even have to load the blobs. Unless textconv is in
> > +	 * play, _and_ we are using two different textconv filters (e.g.,
> > +	 * because a pair is an exact rename with different textconv attributes
> > +	 * for each side, which might generate different content).
> > +	 */
> > +	if (textconv_one == textconv_two && diff_unmodified_pair(p))
> > +		return 0;
> 
> I am not sure about this part that cares about the textconv.
> 
> Wouldn't the normal "git diff A B" skip the filepair that are
> unmodified in the first place at the object name level without even
> looking at the contents (see e.g. diff_flush_patch())?

Hmph. The point was to find the case when the paths are different (e.g.,
in a rename), and therefore the textconvs might be different. But I
think I missed the fact that diff_unmodified_pair will note the
difference in paths. So just calling diff_unmodified_pair would be
sufficient, as the code prior to my patch does.

I thought the point was an optimization to avoid comparing contains() on
the same data (which we can know will match without looking at it).
Exact renames are the obvious one, but they are not handled here. So I
am not sure of the point (to catch "git diff $blob1 $blob2" when the two
are identical? I am not sure at what layer we cull that from the diff
queue).

So there is room for optimization here on exact renames, but
diff_unmodified_pair is too forgiving of what is interesting (a rename
is interesting to diff_flush_patch, because it wants to mention the
rename, but it is not interesting to pickaxe, because we did not change
the content, and it could be culled here).

I don't know that it is that big a deal in general. Pure renames are
going to be the minority of blobs we look at, so it is probably not even
measurable. You could construct a pathological case (e.g., an otherwise
small repo with a 2G file, rename the 2G file without modification, then
running "git log -Sfoo" will unnecessarily load the giant blob while
examining the rename commit).

> Shouldn't this part of the code emulating that behaviour no matter
> what textconv filter(s) are configured for these paths?

Yeah, I just missed that it is checking the path already. It may still
make sense to tighten the optimization, but that is a separate issue. It
should just check diff_unmodified_pair as before; textconv only matters
if you are trying to optimize out exact renames.

-Peff

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] pickaxe: use textconv for -S counting
  2012-11-15  1:21         ` Jeff King
@ 2012-11-20  0:31           ` Junio C Hamano
  2012-11-20  0:48             ` Junio C Hamano
  0 siblings, 1 reply; 24+ messages in thread
From: Junio C Hamano @ 2012-11-20  0:31 UTC (permalink / raw)
  To: Jeff King; +Cc: Peter Oberndorfer, git

Jeff King <peff@peff.net> writes:

> On Tue, Nov 13, 2012 at 03:13:19PM -0800, Junio C Hamano wrote:
>
>> >  static int has_changes(struct diff_filepair *p, struct diff_options *o,
>> >  		       regex_t *regexp, kwset_t kws)
>> >  {
>> > +	struct userdiff_driver *textconv_one = get_textconv(p->one);
>> > +	struct userdiff_driver *textconv_two = get_textconv(p->two);
>> > +	mmfile_t mf1, mf2;
>> > +	int ret;
>> > +
>> >  	if (!o->pickaxe[0])
>> >  		return 0;
>> >  
>> > -	if (!DIFF_FILE_VALID(p->one)) {
>> > -		if (!DIFF_FILE_VALID(p->two))
>> > -			return 0; /* ignore unmerged */
>> 
>> What happened to this part that avoids showing nonsense for unmerged
>> paths?
>
> It's moved down. fill_one will return an empty mmfile if
> !DIFF_FILE_VALID, so we end up here:
>
>         fill_one(p->one, &mf1, &textconv_one);
>         fill_one(p->two, &mf2, &textconv_two);
>
>         if (!mf1.ptr) {
>                 if (!mf2.ptr)
>                         ret = 0; /* ignore unmerged */
>
> Prior to this change, we didn't use fill_one, so we had to check manually.
>
>> > +	/*
>> > +	 * If we have an unmodified pair, we know that the count will be the
>> > +	 * same and don't even have to load the blobs. Unless textconv is in
>> > +	 * play, _and_ we are using two different textconv filters (e.g.,
>> > +	 * because a pair is an exact rename with different textconv attributes
>> > +	 * for each side, which might generate different content).
>> > +	 */
>> > +	if (textconv_one == textconv_two && diff_unmodified_pair(p))
>> > +		return 0;
>> 
>> I am not sure about this part that cares about the textconv.
>> 
>> Wouldn't the normal "git diff A B" skip the filepair that are
>> unmodified in the first place at the object name level without even
>> looking at the contents (see e.g. diff_flush_patch())?
>
> Hmph. The point was to find the case when the paths are different (e.g.,
> in a rename), and therefore the textconvs might be different. But I
> think I missed the fact that diff_unmodified_pair will note the
> difference in paths. So just calling diff_unmodified_pair would be
> sufficient, as the code prior to my patch does.
>
> I thought the point was an optimization to avoid comparing contains() on
> the same data (which we can know will match without looking at it).

Yes.

> Exact renames are the obvious one, but they are not handled here.

That is half true.  Before this change, we will find the same number
of needles and this function would have said "no differences" in a
very inefficient way.  After this change, we may apply different
textconv filters and this function will say "there is a difference",
even though we wouldn't see such a difference at the content level
if there wasn't any rename.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] pickaxe: use textconv for -S counting
  2012-11-20  0:31           ` Junio C Hamano
@ 2012-11-20  0:48             ` Junio C Hamano
  2012-11-21 20:27               ` Jeff King
  0 siblings, 1 reply; 24+ messages in thread
From: Junio C Hamano @ 2012-11-20  0:48 UTC (permalink / raw)
  To: Jeff King; +Cc: Peter Oberndorfer, git

Junio C Hamano <gitster@pobox.com> writes:

>> Exact renames are the obvious one, but they are not handled here.
>
> That is half true.  Before this change, we will find the same number
> of needles and this function would have said "no differences" in a
> very inefficient way.  After this change, we may apply different
> textconv filters and this function will say "there is a difference",
> even though we wouldn't see such a difference at the content level
> if there wasn't any rename.

... but I think that is a good thing anyway.

If you renamed foo.c to foo.cc with different conversions from C
code to the text that explain what the code does, if we special case
only the exact rename case but let pickaxe examine the converted
result in a case where blobs are modified only by one byte, we would
get drastically different results between the two cases.

Corollary to this is what should happen when you update the attributes
between two trees so that textconv for a path that did not change
between preimage and postimage are different.  Ideally, we should
notice that the two converted result are different, perhaps, but I
do not like the performance implications very much.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] pickaxe: use textconv for -S counting
  2012-11-20  0:48             ` Junio C Hamano
@ 2012-11-21 20:27               ` Jeff King
  0 siblings, 0 replies; 24+ messages in thread
From: Jeff King @ 2012-11-21 20:27 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Peter Oberndorfer, git

On Mon, Nov 19, 2012 at 04:48:22PM -0800, Junio C Hamano wrote:

> Junio C Hamano <gitster@pobox.com> writes:
> 
> >> Exact renames are the obvious one, but they are not handled here.
> >
> > That is half true.  Before this change, we will find the same number
> > of needles and this function would have said "no differences" in a
> > very inefficient way.  After this change, we may apply different
> > textconv filters and this function will say "there is a difference",
> > even though we wouldn't see such a difference at the content level
> > if there wasn't any rename.
> 
> ... but I think that is a good thing anyway.
> 
> If you renamed foo.c to foo.cc with different conversions from C
> code to the text that explain what the code does, if we special case
> only the exact rename case but let pickaxe examine the converted
> result in a case where blobs are modified only by one byte, we would
> get drastically different results between the two cases.

Right, exactly. I think the only sane thing is to always textconv or
always not textconv (whether they are identical renames or not), and any
"these are the same" optimization for identical content needs to take
into account whether we _would have_ done a different textconv (which
most of the time is going to be "no", as textconv is either not in use,
or both paths use the same diff driver; but it is not too expensive to
look up).

The diff_unmodified_pair at the top off diff_flush_patch is correct,
because it treats renames as interesting (because we have to show the
diff header, anyway). I do not know offhand if we avoid feeding
identical content to xdiff at all, but if so, we should be doing so only
after checking that the textconv filters are identical.

> Corollary to this is what should happen when you update the attributes
> between two trees so that textconv for a path that did not change
> between preimage and postimage are different.  Ideally, we should
> notice that the two converted result are different, perhaps, but I
> do not like the performance implications very much.

The content to compare cannot be different unless either the input
content changed or the path changed, and we treat either as
"interesting" in most code paths. So I do not think there are any
performance implications, except that we may need to make sure to look
up textconvs a few lines sooner in some cases.

I'll re-roll the series next week and break out the rename-optimization
bits separately so it is more obvious that it is doing the right thing.

As an aside, I also need to revisit the regex half of that code, which
is still buggy (before and after my patch, due to the expecting-a-NUL
behavior we talked about a week or two ago).  That is a separate topic,
but the same area of code.

-Peff

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: crash on git diff-tree -Ganything <tree> for new files with textconv filter
  2012-11-07 21:13             ` Jeff King
@ 2013-06-03 17:25               ` Peter Oberndorfer
  2013-06-03 22:17                 ` Jeff King
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Oberndorfer @ 2013-06-03 17:25 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Junio C Hamano

On 2012-11-07 22:13, Jeff King wrote:
> On Wed, Nov 07, 2012 at 10:10:59PM +0100, Peter Oberndorfer wrote:
>
>>>> For me the key to reproduce the problem was to have 2 commits.
>>>> Adding the file in the root commit it did not work. [1]
>>> You probably would need to pass "--root" for it to do the diff of the
>>> initial commit.
>>>
>>> The patch below fixes it, but it's terribly inefficient (it just detects
>>> the situation and reallocates). It would be much better to disable the
>>> reuse_worktree_file mmap when we populate the filespec, but it is too
>>> late to pass an option; we may have already populated from an earlier
>>> diffcore stage.
>> Hi,
>> I tested your patch, and i can confirm it fixes the problem for me.
>> (also on my real world test in msysgit)
> Thanks for the report. I'd still like to pursue using a regex library
> that does not require NUL-termination, but I've been distracted by other
> things. I'm going to hold back my copy-to-a-NUL-buffer patch for now and
> see if I can get to the regex thing this week.
>
Hi,

are there any news regarding this problem?
The crash seems to still exist in the current version 1.8.3 and master.

Thanks,
Greetings Peter

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: crash on git diff-tree -Ganything <tree> for new files with textconv filter
  2013-06-03 17:25               ` Peter Oberndorfer
@ 2013-06-03 22:17                 ` Jeff King
  0 siblings, 0 replies; 24+ messages in thread
From: Jeff King @ 2013-06-03 22:17 UTC (permalink / raw)
  To: Peter Oberndorfer; +Cc: git, Junio C Hamano

On Mon, Jun 03, 2013 at 07:25:06PM +0200, Peter Oberndorfer wrote:

> > Thanks for the report. I'd still like to pursue using a regex library
> > that does not require NUL-termination, but I've been distracted by other
> > things. I'm going to hold back my copy-to-a-NUL-buffer patch for now and
> > see if I can get to the regex thing this week.
> >
> are there any news regarding this problem?
> The crash seems to still exist in the current version 1.8.3 and master.

Sorry, no, this got dropped due to lack of time. I _think_ it is as
simple as just tweaking the Makefile to unconditionally build against
the compat/ regex library, and then tweaking callsites as appropriate to
use the GNU-specific interface that takes buf/len instead of a
NUL-terminated string.

But there may be some hidden complexities to it.

-Peff

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2013-06-03 22:17 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-10-27 18:37 crash on git diff-tree -Ganything <tree> for new files with textconv filter Peter Oberndorfer
2012-10-28 12:01 ` Jeff King
2012-10-28 12:45   ` [PATCH 0/2] textconv support for "log -S" Jeff King
2012-10-28 12:46     ` [PATCH 1/2] pickaxe: hoist empty needle check Jeff King
2012-10-28 12:47     ` [PATCH 2/2] pickaxe: use textconv for -S counting Jeff King
2012-11-13 23:13       ` Junio C Hamano
2012-11-15  1:21         ` Jeff King
2012-11-20  0:31           ` Junio C Hamano
2012-11-20  0:48             ` Junio C Hamano
2012-11-21 20:27               ` Jeff King
2012-10-28 19:56   ` crash on git diff-tree -Ganything <tree> for new files with textconv filter Peter Oberndorfer
2012-10-29  6:05     ` Jeff King
2012-10-29  6:18       ` Jeff King
2012-10-29 20:19       ` Peter Oberndorfer
2012-10-29 22:35         ` Jeff King
2012-10-29 22:47           ` Jeff King
2012-10-30 12:17             ` Jeff King
2012-10-30 12:46               ` Junio C Hamano
2012-10-30 13:12                 ` Jeff King
2012-11-01 19:19               ` Ramsay Jones
2012-11-07 21:10           ` Peter Oberndorfer
2012-11-07 21:13             ` Jeff King
2013-06-03 17:25               ` Peter Oberndorfer
2013-06-03 22:17                 ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).