git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Johannes Schindelin <Johannes.Schindelin@gmx.de>
To: Jeff King <peff@peff.net>
Cc: git@vger.kernel.org, Junio C Hamano <gitster@pobox.com>
Subject: Re: [PATCH 2/3] diff_populate_filespec: NUL-terminate buffers
Date: Tue, 6 Sep 2016 18:02:59 +0200 (CEST)	[thread overview]
Message-ID: <alpine.DEB.2.20.1609061613270.129229@virtualbox> (raw)
In-Reply-To: <20160906070604.i5rojh3kyc7x7kso@sigill.intra.peff.net>

Hi Peff,

On Tue, 6 Sep 2016, Jeff King wrote:

> On Mon, Sep 05, 2016 at 05:45:06PM +0200, Johannes Schindelin wrote:
> 
> > It is true that many code paths populate the mmfile_t structure
> > silently appending a NUL, e.g. when running textconv on a temporary
> > file and reading the results back into an strbuf.
> > 
> > The assumption is most definitely wrong, however, when mmap()ing a
> > file.
> > 
> > Practically, we seemed to be lucky that the bytes after mmap()ed
> > memory were 1) accessible and 2) somehow contained NUL bytes
> > *somewhere*.
> > 
> > In a use case reported by Chris Sidi, it turned out that the mmap()ed
> > file had the precise size of a memory page, and on Windows the bytes
> > after memory-mapped pages are in general not valid.
> > 
> > This patch works around that issue, giving us time to discuss the best
> > course how to fix this problem more generally.
> 
> I don't know if we are in that much of a rush.

I am ;-)

> This bug has been around for many years (the thread I linked earlier is
> from 2012). Yes, it's bad and annoying, but we can probably spend a few
> days discussing the solution.

Sure we can. But I got to have a solution due to a recent switch from
storing LF to storing CR/LF in the repository (that resulted in a
noticable performance improvement): combined with -G being an integral
part of the workflow in the project that reported the issue, it is
essential that this bug gets fixed. Before I go mostly offline.

> > diff --git a/diff.c b/diff.c
> > index 534c12e..32f7f46 100644
> > --- a/diff.c
> > +++ b/diff.c
> > @@ -2826,6 +2826,15 @@ int diff_populate_filespec(struct diff_filespec *s, unsigned int flags)
> >  			s->data = strbuf_detach(&buf, &size);
> >  			s->size = size;
> >  			s->should_free = 1;
> > +		} else {
> > +			/* data must be NUL-terminated so e.g. for regexec() */
> > +			char *data = xmalloc(s->size + 1);
> > +			memcpy(data, s->data, s->size);
> > +			data[s->size] = '\0';
> > +			munmap(s->data, s->size);
> > +			s->should_munmap = 0;
> > +			s->data = data;
> > +			s->should_free = 1;
> >  		}
> 
> Without having done a complete audit recently, my gut and my
> recollection from previous discussions is that regexec() really is the
> culprit here for the diff code[1]. If we are going to do a workaround
> like this, I think we could limit it only to cases where know it
> matters, like --pickaxe-regex.

Sure.

We could introduce a new NEEDS_NUL flag.

It will still be quite tricky, because we have to touch a function that is
rather at the bottom of the food chain: diff_populate_filespec() is called
from fill_textconv(), which in turn is called from pickaxe_match(), and
only pickaxe_match() knows whether we want to call regexec() or not (it
depends on its regexp parameter).

Adding a flag to diff_populate_filespec() sounds really reasonable until
you see how many call sites fill_textconv() has.

See below for a better idea.

> Can it be triggered with -G?

It can, and it is, as demonstrated by the test I introduced in 1/3.

> I thought that operated on the diff content itself, which would always
> be in a heap buffer (which should be NUL terminated, but if it isn't,
> that would be a separate fix from this).

That is true.

Except when preimage or postimage does not exist. In which case we call

	regexec(regexp, two->ptr, 1, &regmatch, 0);

or the same with one->ptr. Note the notable absence of two->size.

> [1] We do make the assumption elsewhere that git objects are
>     NUL-terminated, but that is enforced by the object-reading code
>     (with the exception of streamed blobs, but those are obviously dealt
>     with separately anyway).

I know. I am the reason you introduced that, because I added code to
fsck.c that assumes that tag/commit messages are NUL-terminated.

So now for the better idea.

While I was researching the code for this reply, I hit upon one thing that
I never knew existed, introduced in f96e567 (grep: use REG_STARTEND for
all matching if available, 2010-05-22). Apparently, NetBSD introduced an
extension to regexec() where you can specify buffer boundaries using
REG_STARTEND. Which is pretty much what we need.

So I have this as my current proof-of-concept (which passes the test
suite, but is white-space corrupted, because I really have no time to get
non-white-space-corrupted text into this here mailer):

-- snipsnap --
diff --git a/diff.c b/diff.c
index 534c12e..2c5a360 100644
--- a/diff.c
+++ b/diff.c
@@ -951,7 +951,13 @@ static int find_word_boundaries(mmfile_t *buffer,
regex_t *word_regex,
 {
 	if (word_regex && *begin < buffer->size) {
 		regmatch_t match[1];
-		if (!regexec(word_regex, buffer->ptr + *begin, 1, match,
		0)) {
+		int f = 0;
+#ifdef REG_STARTEND
+		match[0].rm_so = 0;
+		match[0].rm_eo = *end - *begin;
+		f = REG_STARTEND;
+#endif
+		if (!regexec(word_regex, buffer->ptr + *begin, 1, match,
f)) {
 			char *p = memchr(buffer->ptr + *begin +
match[0].rm_so,
 					'\n', match[0].rm_eo -
match[0].rm_so);
 			*end = p ? p - buffer->ptr : match[0].rm_eo +
*begin;
@@ -994,7 +1000,7 @@ static void diff_words_fill(struct diff_words_buffer
*buffer, mmfile_t *out,
 	buffer->orig[0].begin = buffer->orig[0].end = buffer->text.ptr;
 	buffer->orig_nr = 1;
 
-	for (i = 0; i < buffer->text.size; i++) {
+	for (i = 0, j = buffer->text.size; i < buffer->text.size; i++) {
 		if (find_word_boundaries(&buffer->text, word_regex, &i,
&j))
 			return;
 
diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
index 55067ca..2cd09e2 100644
--- a/diffcore-pickaxe.c
+++ b/diffcore-pickaxe.c
@@ -23,7 +23,9 @@ static void diffgrep_consume(void *priv, char *line,
unsigned long len)
 {
 	struct diffgrep_cb *data = priv;
 	regmatch_t regmatch;
+#ifndef REG_STARTEND
 	int hold;
+#endif
 
 	if (line[0] != '+' && line[0] != '-')
 		return;
@@ -33,11 +35,18 @@ static void diffgrep_consume(void *priv, char *line,
unsigned long len)
 		 * caller early.
 		 */
 		return;
+#ifdef REG_STARTEND
+	regmatch.rm_so = 0;
+	regmatch.rm_eo = len;
+	data->hit = !regexec(data->regexp, line + 1, 1,
+			     &regmatch, REG_STARTEND);
+#else
 	/* Yuck -- line ought to be "const char *"! */
 	hold = line[len];
 	line[len] = '\0';
-	data->hit = !regexec(data->regexp, line + 1, 1, &regmatch, 0);
+	data->hit = !regexec(data->regexp, line + 1, 1, &regmatch, f);
 	line[len] = hold;
+#endif
 }
 
 static int diff_grep(mmfile_t *one, mmfile_t *two,
@@ -49,10 +58,24 @@ static int diff_grep(mmfile_t *one, mmfile_t *two,
 	xpparam_t xpp;
 	xdemitconf_t xecfg;
 
-	if (!one)
-		return !regexec(regexp, two->ptr, 1, &regmatch, 0);
-	if (!two)
-		return !regexec(regexp, one->ptr, 1, &regmatch, 0);
+	if (!one) {
+		int flags = 0;
+#ifdef REG_STARTEND
+		regmatch.rm_so = 0;
+		regmatch.rm_eo = two->size;
+		flags = REG_STARTEND;
+#endif
+		return !regexec(regexp, two->ptr, 1, &regmatch, flags);
+	}
+	if (!two) {
+		int flags = 0;
+#ifdef REG_STARTEND
+		regmatch.rm_so = 0;
+		regmatch.rm_eo = one->size;
+		flags = REG_STARTEND;
+#endif
+		return !regexec(regexp, one->ptr, 1, &regmatch, flags);
+	}
 
 	/*
 	 * We have both sides; need to run textual diff and see if
@@ -83,7 +106,13 @@ static unsigned int contains(mmfile_t *mf, regex_t
*regexp, kwset_t kws)
 		regmatch_t regmatch;
 		int flags = 0;
 
+#ifndef REG_STARTEND
 		assert(data[sz] == '\0');
+#else
+		regmatch.rm_so = 0;
+		regmatch.rm_eo = sz;
+		flags |= REG_STARTEND;
+#endif
 		while (*data && !regexec(regexp, data, 1, &regmatch,
flags)) {
 			flags |= REG_NOTBOL;
 			data += regmatch.rm_eo;
diff --git a/xdiff-interface.c b/xdiff-interface.c
index f34ea76..c179d43 100644
--- a/xdiff-interface.c
+++ b/xdiff-interface.c
@@ -218,7 +218,7 @@ static long ff_regexp(const char *line, long len,
 	struct ff_regs *regs = priv;
 	regmatch_t pmatch[2];
 	int i;
-	int result = -1;
+	int result = -1, flags = 0;
 
 	/* Exclude terminating newline (and cr) from matching */
 	if (len > 0 && line[len-1] == '\n') {
@@ -228,11 +228,20 @@ static long ff_regexp(const char *line, long len,
 			len--;
 	}
 
+#ifndef REG_STARTEND
 	line_buffer = xstrndup(line, len); /* make NUL terminated */
+#else
+	line_buffer = (char *)line;
+	flags = REG_STARTEND;
+#endif
 
 	for (i = 0; i < regs->nr; i++) {
 		struct ff_reg *reg = regs->array + i;
-		if (!regexec(&reg->re, line_buffer, 2, pmatch, 0)) {
+#ifdef REG_STARTEND
+		pmatch->rm_so = 0;
+		pmatch->rm_eo = len;
+#endif
+		if (!regexec(&reg->re, line_buffer, 2, pmatch, flags)) {
 			if (reg->negate)
 				goto fail;
 			break;
@@ -249,7 +258,9 @@ static long ff_regexp(const char *line, long len,
 		result--;
 	memcpy(buffer, line, result);
  fail:
+#ifndef REG_STARTEND
 	free(line_buffer);
+#endif
 	return result;
 }

  reply	other threads:[~2016-09-06 16:03 UTC|newest]

Thread overview: 66+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-09-05 15:44 [PATCH 0/3] Fix a segfault caused by regexec() being called on mmap()ed data Johannes Schindelin
2016-09-05 15:45 ` [PATCH 1/3] Demonstrate a problem: our pickaxe code assumes NUL-terminated buffers Johannes Schindelin
2016-09-06 18:43   ` Jeff King
2016-09-08  7:53     ` Johannes Schindelin
2016-09-05 15:45 ` [PATCH 2/3] diff_populate_filespec: NUL-terminate buffers Johannes Schindelin
2016-09-06  7:06   ` Jeff King
2016-09-06 16:02     ` Johannes Schindelin [this message]
2016-09-06 18:41       ` Jeff King
2016-09-07 18:31         ` Junio C Hamano
2016-09-08  7:52           ` Johannes Schindelin
2016-09-08  7:49         ` Johannes Schindelin
2016-09-08  8:22           ` Jeff King
2016-09-08 16:57             ` Junio C Hamano
2016-09-08 18:22               ` Johannes Schindelin
2016-09-08 18:48               ` Jeff King
2016-09-05 15:45 ` [PATCH 3/3] diff_grep: add assertions verifying that the buffers are NUL-terminated Johannes Schindelin
2016-09-06  7:08   ` Jeff King
2016-09-06 16:04     ` Johannes Schindelin
2016-09-05 19:10 ` [PATCH 0/3] Fix a segfault caused by regexec() being called on mmap()ed data Junio C Hamano
2016-09-06  7:12   ` Jeff King
2016-09-06 14:06     ` Johannes Schindelin
2016-09-06 18:29       ` Jeff King
2016-09-08  7:29         ` Johannes Schindelin
2016-09-08  8:00           ` Jeff King
2016-09-09 10:09             ` Johannes Schindelin
2016-09-09 17:46               ` Junio C Hamano
2016-09-06 13:21   ` Johannes Schindelin
2016-09-06  6:58 ` Jeff King
2016-09-06 14:13   ` Johannes Schindelin
2016-09-08  7:31 ` [PATCH v2 " Johannes Schindelin
2016-09-08  7:31   ` [PATCH v2 2/3] Introduce a function to run regexec() on non-NUL-terminated buffers Johannes Schindelin
2016-09-08  8:04     ` Jeff King
2016-09-09  9:45       ` Johannes Schindelin
2016-09-09  9:59         ` Jeff King
2016-09-08  7:31   ` [PATCH v2 1/3] Demonstrate a problem: our pickaxe code assumes NUL-terminated buffers Johannes Schindelin
2016-09-08  7:31   ` [PATCH v2 3/3] Use the newly-introduced regexec_buf() function Johannes Schindelin
2016-09-08  7:54     ` Johannes Schindelin
2016-09-08  8:10       ` Jeff King
2016-09-08  8:14         ` Jeff King
2016-09-08  8:35           ` Jeff King
2016-09-08 19:06             ` Ramsay Jones
2016-09-08 19:53               ` Jeff King
2016-09-08 21:30                 ` Junio C Hamano
2016-09-08  7:33   ` [PATCH v2 0/3] Fix a segfault caused by regexec() being called on mmap()ed data Johannes Schindelin
2016-09-08  8:13     ` Jeff King
2016-09-08  7:57   ` [PATCH v3 " Johannes Schindelin
2016-09-08  7:57     ` [PATCH v3 1/3] Demonstrate a problem: our pickaxe code assumes NUL-terminated buffers Johannes Schindelin
2016-09-08  7:58     ` [PATCH v3 2/3] Introduce a function to run regexec() on non-NUL-terminated buffers Johannes Schindelin
2016-09-08 17:03       ` Junio C Hamano
2016-09-08  7:59     ` [PATCH v3 3/3] Use the newly-introduced regexec_buf() function Johannes Schindelin
2016-09-08 17:09       ` Junio C Hamano
2016-09-09  9:52         ` Johannes Schindelin
2016-09-09  9:57           ` Jeff King
2016-09-09 10:41             ` Johannes Schindelin
2016-09-09 17:49           ` Junio C Hamano
2016-09-21 18:23     ` [PATCH v4 0/3] Fix a segfault caused by regexec() being called on mmap()ed data Johannes Schindelin
2016-09-21 18:23       ` [PATCH v4 1/3] regex: -G<pattern> feeds a non NUL-terminated string to regexec() and fails Johannes Schindelin
2016-09-21 18:24       ` [PATCH v4 2/3] regex: add regexec_buf() that can work on a non NUL-terminated string Johannes Schindelin
2016-09-21 19:17         ` Junio C Hamano
2016-09-22 18:38           ` Johannes Schindelin
2016-09-21 18:24       ` [PATCH v4 3/3] regex: use regexec_buf() Johannes Schindelin
2016-09-21 19:18         ` Junio C Hamano
2016-09-21 20:09           ` Junio C Hamano
2016-09-21 22:03         ` Jeff King
2016-09-25 14:01           ` Johannes Schindelin
2016-09-21 22:04       ` [PATCH v4 0/3] Fix a segfault caused by regexec() being called on mmap()ed data Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.DEB.2.20.1609061613270.129229@virtualbox \
    --to=johannes.schindelin@gmx.de \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).