git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH] fast-import: Reinitialize command_buf rather than detach it.
@ 2019-08-25  4:13 Mike Hommey
  2019-08-25  6:57 ` Jeff King
  2019-08-25 12:35 ` [PATCH] fast-import: Reinitialize command_buf rather than detach it René Scharfe
  0 siblings, 2 replies; 15+ messages in thread
From: Mike Hommey @ 2019-08-25  4:13 UTC (permalink / raw)
  To: git; +Cc: gitster

command_buf.buf is also stored in cmd_hist, so instead of
strbuf_release, the current code uses strbuf_detach in order to
"leak" the buffer as far as the strbuf is concerned.

However, strbuf_detach does more than "leak" the strbuf buffer: it
possibly reallocates it to ensure a terminating nul character. And when
that happens, what is already stored in cmd_hist may now be a free()d
buffer.

In practice, though, command_buf.buf is most of the time a nul
terminated string, meaning command_buf.len < command_buf.alloc, and
strbuf_detach is a no-op. BUT, when it's not (e.g. on the first call),
command_buf.buf is &strbuf_slopbuf. In that case, strbuf_detach does
allocate a 1 byte buffer to store a nul character in it, which is then
leaked.

Since the code using strbuf_detach is assuming it does nothing to
command_buf.buf, it's overall safer to use strbuf_init, which has
the same practical effect in the usual case, and works appropriately
when command_buf is empty.
---
 fast-import.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/fast-import.c b/fast-import.c
index b44d6a467e..b1d07efe8c 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -1763,7 +1763,9 @@ static int read_next_command(void)
 		} else {
 			struct recent_command *rc;
 
-			strbuf_detach(&command_buf, NULL);
+			// command_buf is enther empty or also stored in cmd_hist,
+			// reinitialize it.
+			strbuf_init(&command_buf, 0);
 			stdin_eof = strbuf_getline_lf(&command_buf, stdin);
 			if (stdin_eof)
 				return EOF;
@@ -1833,7 +1835,9 @@ static int parse_data(struct strbuf *sb, uintmax_t limit, uintmax_t *len_res)
 		char *term = xstrdup(data);
 		size_t term_len = command_buf.len - (data - command_buf.buf);
 
-		strbuf_detach(&command_buf, NULL);
+		// command_buf is enther empty or also stored in cmd_hist,
+		// reinitialize it.
+		strbuf_init(&command_buf, 0);
 		for (;;) {
 			if (strbuf_getline_lf(&command_buf, stdin) == EOF)
 				die("EOF in data (terminator '%s' not found)", term);
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] fast-import: Reinitialize command_buf rather than detach it.
  2019-08-25  4:13 [PATCH] fast-import: Reinitialize command_buf rather than detach it Mike Hommey
@ 2019-08-25  6:57 ` Jeff King
  2019-08-25  7:20   ` Mike Hommey
  2019-08-25  8:06   ` [PATCH 0/2] fast-import input string handling bugs Jeff King
  2019-08-25 12:35 ` [PATCH] fast-import: Reinitialize command_buf rather than detach it René Scharfe
  1 sibling, 2 replies; 15+ messages in thread
From: Jeff King @ 2019-08-25  6:57 UTC (permalink / raw)
  To: Mike Hommey; +Cc: git, gitster

On Sun, Aug 25, 2019 at 01:13:48PM +0900, Mike Hommey wrote:

> command_buf.buf is also stored in cmd_hist, so instead of
> strbuf_release, the current code uses strbuf_detach in order to
> "leak" the buffer as far as the strbuf is concerned.
> 
> However, strbuf_detach does more than "leak" the strbuf buffer: it
> possibly reallocates it to ensure a terminating nul character. And when
> that happens, what is already stored in cmd_hist may now be a free()d
> buffer.
> 
> In practice, though, command_buf.buf is most of the time a nul
> terminated string, meaning command_buf.len < command_buf.alloc, and
> strbuf_detach is a no-op. BUT, when it's not (e.g. on the first call),
> command_buf.buf is &strbuf_slopbuf. In that case, strbuf_detach does
> allocate a 1 byte buffer to store a nul character in it, which is then
> leaked.

I think this is stronger than just "most of the time". It's an invariant
for strbufs to have a NUL, so the only case where detaching isn't a noop
is the empty slopbuf case you mention.

Splitting hairs, perhaps, but I think with that explanation, we could
probably argue that this case will never come up: strbuf_getline will
either have allocated a buffer or will have returned EOF.

That said, I do think it's quite confusing and is worth fixing, both for
readability and for future-proofing. But...

> diff --git a/fast-import.c b/fast-import.c
> index b44d6a467e..b1d07efe8c 100644
> --- a/fast-import.c
> +++ b/fast-import.c
> @@ -1763,7 +1763,9 @@ static int read_next_command(void)
>  		} else {
>  			struct recent_command *rc;
>  
> -			strbuf_detach(&command_buf, NULL);
> +			// command_buf is enther empty or also stored in cmd_hist,
> +			// reinitialize it.
> +			strbuf_init(&command_buf, 0);

This whole thing is a sign that the code is Doing It Wrong. Whoever is
taking ownership of the buffer should be calling strbuf_detach() at that
point.

It's a bit tricky, though, because we take ownership and then expect
people still look at command_buf.buf. Which makes me concerned that
there are other operations which might modify the buffer and have the
same issue.

I think it would be much easier to follow if we simply used the same
command_buf over and over, and just copied into the history. The cost is
about the same (we still alloc once per line, though we do an extra
memcpy now). So I thought something like this would work (we don't need
to convert those detaches into a strbuf_reset() calls because
strbuf_getline does so automatically):

diff --git a/fast-import.c b/fast-import.c
index b44d6a467e..31207acd61 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -1763,7 +1763,6 @@ static int read_next_command(void)
 		} else {
 			struct recent_command *rc;
 
-			strbuf_detach(&command_buf, NULL);
 			stdin_eof = strbuf_getline_lf(&command_buf, stdin);
 			if (stdin_eof)
 				return EOF;
@@ -1784,7 +1783,7 @@ static int read_next_command(void)
 				free(rc->buf);
 			}
 
-			rc->buf = command_buf.buf;
+			rc->buf = xstrdup(command_buf.buf);
 			rc->prev = cmd_tail;
 			rc->next = cmd_hist.prev;
 			rc->prev->next = rc;
@@ -1833,7 +1832,6 @@ static int parse_data(struct strbuf *sb, uintmax_t limit, uintmax_t *len_res)
 		char *term = xstrdup(data);
 		size_t term_len = command_buf.len - (data - command_buf.buf);
 
-		strbuf_detach(&command_buf, NULL);
 		for (;;) {
 			if (strbuf_getline_lf(&command_buf, stdin) == EOF)
 				die("EOF in data (terminator '%s' not found)", term);

But it doesn't! It turns out that there are other places in the code
which assume they can call read_next_command() while hanging onto the
existing buffer. Which only works in the old code because the history
feature happens to hold on to the old pointer.

If I do this on top, then all tests pass:

diff --git a/fast-import.c b/fast-import.c
index 31207acd61..1f9160b645 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -2586,7 +2586,7 @@ static void parse_new_commit(const char *arg)
 	struct branch *b;
 	char *author = NULL;
 	char *committer = NULL;
-	const char *encoding = NULL;
+	char *encoding = NULL;
 	struct hash_list *merge_list = NULL;
 	unsigned int merge_count;
 	unsigned char prev_fanout, new_fanout;
@@ -2609,8 +2609,10 @@ static void parse_new_commit(const char *arg)
 	}
 	if (!committer)
 		die("Expected committer but didn't get one");
-	if (skip_prefix(command_buf.buf, "encoding ", &encoding))
+	if (skip_prefix(command_buf.buf, "encoding ", &v)) {
+		encoding = xstrdup(v);
 		read_next_command();
+	}
 	parse_data(&msg, 0, NULL);
 	read_next_command();
 	parse_from(b);
@@ -2684,6 +2686,7 @@ static void parse_new_commit(const char *arg)
 	strbuf_addbuf(&new_data, &msg);
 	free(author);
 	free(committer);
+	free(encoding);
 
 	if (!store_object(OBJ_COMMIT, &new_data, NULL, &b->oid, next_mark))
 		b->pack_id = pack_id;

And I think this is actually a real bug in the current code! We keep a
pointer to the encoding string, which survives because of the history.
But that history is bounded, and we could have an indefinite number of
changed files in the middle. If I modify t9300 like this:

diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
index 141b7fa35e..d4bbe630d5 100755
--- a/t/t9300-fast-import.sh
+++ b/t/t9300-fast-import.sh
@@ -3314,6 +3314,11 @@ test_expect_success 'X: handling encoding' '
 
 	printf "Pi: \360\nCOMMIT\n" >>input &&
 
+	for i in $(test_seq 200)
+	do
+		echo "M 644 $EMPTY_BLOB file-$i"
+	done >>input &&
+
 	git fast-import <input &&
 	git cat-file -p encoding | grep $(printf "\360") &&
 	git log -1 --format=%B encoding | grep $(printf "\317\200")

and run the test suite (using an unmodified git, without the earlier
patches I showed) then the test fails, putting garbage into the encoding
header (and when compiled with ASan, reports a use-after-free).

So I think the way the string handling is currently done papers over
such problems. You only see the problem if you have a hundred or more
modified files, so it works _most_ of the time.

That implies to me it's worth following a fix like the one I showed
above. I am slightly concerned there are other similar cases to the
"encoding" one lurking (and they _might_ not be bugs already, if there's
a fixed number of reads between the pointer being saved and being used),
but it seems worth the risk.

-Peff

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] fast-import: Reinitialize command_buf rather than detach it.
  2019-08-25  6:57 ` Jeff King
@ 2019-08-25  7:20   ` Mike Hommey
  2019-08-25  7:28     ` Jeff King
  2019-08-25  8:06   ` [PATCH 0/2] fast-import input string handling bugs Jeff King
  1 sibling, 1 reply; 15+ messages in thread
From: Mike Hommey @ 2019-08-25  7:20 UTC (permalink / raw)
  To: Jeff King; +Cc: git, gitster

On Sun, Aug 25, 2019 at 02:57:48AM -0400, Jeff King wrote:
> On Sun, Aug 25, 2019 at 01:13:48PM +0900, Mike Hommey wrote:
> 
> > command_buf.buf is also stored in cmd_hist, so instead of
> > strbuf_release, the current code uses strbuf_detach in order to
> > "leak" the buffer as far as the strbuf is concerned.
> > 
> > However, strbuf_detach does more than "leak" the strbuf buffer: it
> > possibly reallocates it to ensure a terminating nul character. And when
> > that happens, what is already stored in cmd_hist may now be a free()d
> > buffer.
> > 
> > In practice, though, command_buf.buf is most of the time a nul
> > terminated string, meaning command_buf.len < command_buf.alloc, and
> > strbuf_detach is a no-op. BUT, when it's not (e.g. on the first call),
> > command_buf.buf is &strbuf_slopbuf. In that case, strbuf_detach does
> > allocate a 1 byte buffer to store a nul character in it, which is then
> > leaked.
> 
> I think this is stronger than just "most of the time". It's an invariant
> for strbufs to have a NUL, so the only case where detaching isn't a noop
> is the empty slopbuf case you mention.

If it's an invariant, why does detach actively tries to realloc and set
the nul terminator as if it can happen in more cases than when using the
slopbuf?

> Splitting hairs, perhaps, but I think with that explanation, we could
> probably argue that this case will never come up: strbuf_getline will
> either have allocated a buffer or will have returned EOF.

Note that the slopbuf case _does_ come up, and we always leak a 1 byte
buffer.

> That said, I do think it's quite confusing and is worth fixing, both for
> readability and for future-proofing. But...
(...)

I do agree the way fast-import works between cmd_hist and command_buf is
very brittle, as you've shown. I didn't feel like digging into it
though. Thanks for having gone further than I did.

Mike

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] fast-import: Reinitialize command_buf rather than detach it.
  2019-08-25  7:20   ` Mike Hommey
@ 2019-08-25  7:28     ` Jeff King
  0 siblings, 0 replies; 15+ messages in thread
From: Jeff King @ 2019-08-25  7:28 UTC (permalink / raw)
  To: Mike Hommey; +Cc: git, gitster

On Sun, Aug 25, 2019 at 04:20:31PM +0900, Mike Hommey wrote:

> > I think this is stronger than just "most of the time". It's an invariant
> > for strbufs to have a NUL, so the only case where detaching isn't a noop
> > is the empty slopbuf case you mention.
> 
> If it's an invariant, why does detach actively tries to realloc and set
> the nul terminator as if it can happen in more cases than when using the
> slopbuf?

It calls strbuf_grow() to handle the slopbuf case (we can't hand off the
slopbuf, since the caller expects an allocated buffer). It just doesn't
bother to distinguish that case itself, and lets strbuf_grow() handle
it.

I think it would be equally correct for strbuf_detach() to do:

  if (!sb->alloc)
	strbuf_grow(0);

> > Splitting hairs, perhaps, but I think with that explanation, we could
> > probably argue that this case will never come up: strbuf_getline will
> > either have allocated a buffer or will have returned EOF.
> 
> Note that the slopbuf case _does_ come up, and we always leak a 1 byte
> buffer.

Hmm, I suppose so, on the very first call before we've read anything
(and likewise if parse_data() reset it then got an EOF, and we then
tried to read another command).

> I do agree the way fast-import works between cmd_hist and command_buf is
> very brittle, as you've shown. I didn't feel like digging into it
> though. Thanks for having gone further than I did.

I'll see if I can shape my rambling into a patch.

-Peff

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 0/2] fast-import input string handling bugs
  2019-08-25  6:57 ` Jeff King
  2019-08-25  7:20   ` Mike Hommey
@ 2019-08-25  8:06   ` Jeff King
  2019-08-25  8:08     ` [PATCH 1/2] fast-import: duplicate parsed encoding string Jeff King
                       ` (3 more replies)
  1 sibling, 4 replies; 15+ messages in thread
From: Jeff King @ 2019-08-25  8:06 UTC (permalink / raw)
  To: Mike Hommey; +Cc: Elijah Newren, git, gitster

On Sun, Aug 25, 2019 at 02:57:48AM -0400, Jeff King wrote:

> And I think this is actually a real bug in the current code! We keep a
> pointer to the encoding string, which survives because of the history.
> But that history is bounded, and we could have an indefinite number of
> changed files in the middle. If I modify t9300 like this:

Here are two patches. The first fixes the existing bug with "encoding",
and the second uses the approach I suggested to fix the leak you
noticed.

The second one does carry a greater risk of regression than your patch,
but I think it's worth it for the fact that it makes any other bugs
(like the "encoding" one) more obvious.

  [1/2]: fast-import: duplicate parsed encoding string
  [2/2]: fast-import: duplicate into history rather than passing ownership

 fast-import.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

-Peff

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/2] fast-import: duplicate parsed encoding string
  2019-08-25  8:06   ` [PATCH 0/2] fast-import input string handling bugs Jeff King
@ 2019-08-25  8:08     ` Jeff King
  2019-08-26 18:28       ` Elijah Newren
  2019-08-25  8:10     ` [PATCH 2/2] fast-import: duplicate into history rather than passing ownership Jeff King
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 15+ messages in thread
From: Jeff King @ 2019-08-25  8:08 UTC (permalink / raw)
  To: Mike Hommey; +Cc: Elijah Newren, git, gitster

We read each line of the fast-import stream into the command_buf strbuf.
When reading a commit, we parse a line like "encoding foo" by storing a
pointer to "foo", but not making a copy. We may then read an unbounded
number of other lines (e.g., one for each modified file in the commit),
each of which writes into command_buf.

This works out in practice for small cases, because we hand off
ownership of the heap buffer from command_buf to the cmd_hist array, and
read new commands into a fresh heap buffer. And thus the pointer to
"foo" remains valid as long as there aren't so many intermediate lines
that we end up dropping the original "encoding" line from the history.

But as the test modification shows, if we go over our default of 100
lines, we end up with our encoding string pointing into freed heap
memory. This seems to fail reliably by writing garbage into the output,
but running under ASan definitely detects this as a user-after-free.

We can fix it by duplicating the encoding value, just as we do for other
parsed lines (e.g., an author line ends up in parse_ident, which copies
it to a new string).

Signed-off-by: Jeff King <peff@peff.net>
---
 fast-import.c          | 7 +++++--
 t/t9300-fast-import.sh | 5 +++++
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/fast-import.c b/fast-import.c
index b44d6a467e..ee7258037a 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -2588,7 +2588,7 @@ static void parse_new_commit(const char *arg)
 	struct branch *b;
 	char *author = NULL;
 	char *committer = NULL;
-	const char *encoding = NULL;
+	char *encoding = NULL;
 	struct hash_list *merge_list = NULL;
 	unsigned int merge_count;
 	unsigned char prev_fanout, new_fanout;
@@ -2611,8 +2611,10 @@ static void parse_new_commit(const char *arg)
 	}
 	if (!committer)
 		die("Expected committer but didn't get one");
-	if (skip_prefix(command_buf.buf, "encoding ", &encoding))
+	if (skip_prefix(command_buf.buf, "encoding ", &v)) {
+		encoding = xstrdup(v);
 		read_next_command();
+	}
 	parse_data(&msg, 0, NULL);
 	read_next_command();
 	parse_from(b);
@@ -2686,6 +2688,7 @@ static void parse_new_commit(const char *arg)
 	strbuf_addbuf(&new_data, &msg);
 	free(author);
 	free(committer);
+	free(encoding);
 
 	if (!store_object(OBJ_COMMIT, &new_data, NULL, &b->oid, next_mark))
 		b->pack_id = pack_id;
diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
index 141b7fa35e..cf66b40ebc 100755
--- a/t/t9300-fast-import.sh
+++ b/t/t9300-fast-import.sh
@@ -3314,6 +3314,11 @@ test_expect_success 'X: handling encoding' '
 
 	printf "Pi: \360\nCOMMIT\n" >>input &&
 
+	for i in $(test_seq 100)
+	do
+		echo "M 644 $EMPTY_BLOB file-$i"
+	done >>input &&
+
 	git fast-import <input &&
 	git cat-file -p encoding | grep $(printf "\360") &&
 	git log -1 --format=%B encoding | grep $(printf "\317\200")
-- 
2.23.0.478.g23872bed7d


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/2] fast-import: duplicate into history rather than passing ownership
  2019-08-25  8:06   ` [PATCH 0/2] fast-import input string handling bugs Jeff King
  2019-08-25  8:08     ` [PATCH 1/2] fast-import: duplicate parsed encoding string Jeff King
@ 2019-08-25  8:10     ` Jeff King
  2019-08-25 10:02       ` Mike Hommey
  2019-08-26 15:36     ` [PATCH 0/2] fast-import input string handling bugs Junio C Hamano
  2019-08-26 19:18     ` Elijah Newren
  3 siblings, 1 reply; 15+ messages in thread
From: Jeff King @ 2019-08-25  8:10 UTC (permalink / raw)
  To: Mike Hommey; +Cc: Elijah Newren, git, gitster

Fast-import's read_next_command() has somewhat odd memory ownership
semantics for the command_buf strbuf. After reading a command, we copy
the strbuf's pointer (without duplicating the string) into our cmd_hist
array of recent commands. And then when we're about to read a new
command, we clear the strbuf by calling strbuf_detach(), dropping
ownership from the strbuf (leaving the cmd_hist reference as the
remaining owner).

This has a few surprising implications:

  - if the strbuf hasn't been copied into cmd_hist (e.g., because we
    haven't ready any commands yet), then the strbuf_detach() will leak
    the resulting string

  - any modification to command_buf risks invalidating the pointer held
    by cmd_hist. There doesn't seem to be any way to trigger this
    currently (since we tend to modify it only by detaching and reading
    in a new value), but it's subtly dangerous.

  - any pointers into an input string will remain valid as long as
    cmd_hist points to them. So in general, you can point into
    command_buf.buf and call read_next_command() up to 100 times before
    your string is cycled out and freed, leaving you with a dangling
    pointer. This makes it easy to miss bugs during testing, as they
    might trigger only for a sufficiently large commit (e.g., the bug
    fixed in the previous commit).

Instead, let's make a new string to copy the command into the history
array, rather than having dual ownership with the old. Then we can drop
the strbuf_detach() calls entirely, and just reuse the same buffer
within command_buf over and over. We'd normally have to strbuf_reset()
it before using it again, but in both cases here we're using
strbuf_getline(), which does it automatically for us.

This fixes the leak, and it means that even a single call to
read_next_command() will invalidate any held pointers, making it easier
to find bugs. In fact, we can drop the extra input lines added to the
test case by the previous commit, as the unfixed bug would now trigger
just from reading the commit message, even without any modified files in
the commit.

Reported-by: Mike Hommey <mh@glandium.org>
Signed-off-by: Jeff King <peff@peff.net>
---
 fast-import.c          | 4 +---
 t/t9300-fast-import.sh | 5 -----
 2 files changed, 1 insertion(+), 8 deletions(-)

diff --git a/fast-import.c b/fast-import.c
index ee7258037a..1f9160b645 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -1763,7 +1763,6 @@ static int read_next_command(void)
 		} else {
 			struct recent_command *rc;
 
-			strbuf_detach(&command_buf, NULL);
 			stdin_eof = strbuf_getline_lf(&command_buf, stdin);
 			if (stdin_eof)
 				return EOF;
@@ -1784,7 +1783,7 @@ static int read_next_command(void)
 				free(rc->buf);
 			}
 
-			rc->buf = command_buf.buf;
+			rc->buf = xstrdup(command_buf.buf);
 			rc->prev = cmd_tail;
 			rc->next = cmd_hist.prev;
 			rc->prev->next = rc;
@@ -1833,7 +1832,6 @@ static int parse_data(struct strbuf *sb, uintmax_t limit, uintmax_t *len_res)
 		char *term = xstrdup(data);
 		size_t term_len = command_buf.len - (data - command_buf.buf);
 
-		strbuf_detach(&command_buf, NULL);
 		for (;;) {
 			if (strbuf_getline_lf(&command_buf, stdin) == EOF)
 				die("EOF in data (terminator '%s' not found)", term);
diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
index cf66b40ebc..141b7fa35e 100755
--- a/t/t9300-fast-import.sh
+++ b/t/t9300-fast-import.sh
@@ -3314,11 +3314,6 @@ test_expect_success 'X: handling encoding' '
 
 	printf "Pi: \360\nCOMMIT\n" >>input &&
 
-	for i in $(test_seq 100)
-	do
-		echo "M 644 $EMPTY_BLOB file-$i"
-	done >>input &&
-
 	git fast-import <input &&
 	git cat-file -p encoding | grep $(printf "\360") &&
 	git log -1 --format=%B encoding | grep $(printf "\317\200")
-- 
2.23.0.478.g23872bed7d

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] fast-import: duplicate into history rather than passing ownership
  2019-08-25  8:10     ` [PATCH 2/2] fast-import: duplicate into history rather than passing ownership Jeff King
@ 2019-08-25 10:02       ` Mike Hommey
  2019-08-25 14:21         ` René Scharfe
  0 siblings, 1 reply; 15+ messages in thread
From: Mike Hommey @ 2019-08-25 10:02 UTC (permalink / raw)
  To: Jeff King; +Cc: Elijah Newren, git, gitster

On Sun, Aug 25, 2019 at 04:10:55AM -0400, Jeff King wrote:
> Fast-import's read_next_command() has somewhat odd memory ownership
> semantics for the command_buf strbuf. After reading a command, we copy
> the strbuf's pointer (without duplicating the string) into our cmd_hist
> array of recent commands. And then when we're about to read a new
> command, we clear the strbuf by calling strbuf_detach(), dropping
> ownership from the strbuf (leaving the cmd_hist reference as the
> remaining owner).
> 
> This has a few surprising implications:
> 
>   - if the strbuf hasn't been copied into cmd_hist (e.g., because we
>     haven't ready any commands yet), then the strbuf_detach() will leak
>     the resulting string
> 
>   - any modification to command_buf risks invalidating the pointer held
>     by cmd_hist. There doesn't seem to be any way to trigger this
>     currently (since we tend to modify it only by detaching and reading
>     in a new value), but it's subtly dangerous.
> 
>   - any pointers into an input string will remain valid as long as
>     cmd_hist points to them. So in general, you can point into
>     command_buf.buf and call read_next_command() up to 100 times before
>     your string is cycled out and freed, leaving you with a dangling
>     pointer. This makes it easy to miss bugs during testing, as they
>     might trigger only for a sufficiently large commit (e.g., the bug
>     fixed in the previous commit).
> 
> Instead, let's make a new string to copy the command into the history
> array, rather than having dual ownership with the old. Then we can drop
> the strbuf_detach() calls entirely, and just reuse the same buffer
> within command_buf over and over. We'd normally have to strbuf_reset()
> it before using it again, but in both cases here we're using
> strbuf_getline(), which does it automatically for us.
> 
> This fixes the leak, and it means that even a single call to
> read_next_command() will invalidate any held pointers, making it easier
> to find bugs. In fact, we can drop the extra input lines added to the
> test case by the previous commit, as the unfixed bug would now trigger
> just from reading the commit message, even without any modified files in
> the commit.
> 
> Reported-by: Mike Hommey <mh@glandium.org>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  fast-import.c          | 4 +---
>  t/t9300-fast-import.sh | 5 -----
>  2 files changed, 1 insertion(+), 8 deletions(-)
> 
> diff --git a/fast-import.c b/fast-import.c
> index ee7258037a..1f9160b645 100644
> --- a/fast-import.c
> +++ b/fast-import.c
> @@ -1763,7 +1763,6 @@ static int read_next_command(void)
>  		} else {
>  			struct recent_command *rc;
>  
> -			strbuf_detach(&command_buf, NULL);
>  			stdin_eof = strbuf_getline_lf(&command_buf, stdin);
>  			if (stdin_eof)
>  				return EOF;
> @@ -1784,7 +1783,7 @@ static int read_next_command(void)
>  				free(rc->buf);
>  			}
>  
> -			rc->buf = command_buf.buf;
> +			rc->buf = xstrdup(command_buf.buf);

You could xstrndup(command_buf.buf, command_buf.len), which would avoid
a hidden strlen.

Mike

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] fast-import: Reinitialize command_buf rather than detach it.
  2019-08-25  4:13 [PATCH] fast-import: Reinitialize command_buf rather than detach it Mike Hommey
  2019-08-25  6:57 ` Jeff King
@ 2019-08-25 12:35 ` René Scharfe
  1 sibling, 0 replies; 15+ messages in thread
From: René Scharfe @ 2019-08-25 12:35 UTC (permalink / raw)
  To: Mike Hommey; +Cc: git, gitster

Am 25.08.19 um 06:13 schrieb Mike Hommey:
> command_buf.buf is also stored in cmd_hist, so instead of
> strbuf_release, the current code uses strbuf_detach in order to
> "leak" the buffer as far as the strbuf is concerned.

Hmm.

> However, strbuf_detach does more than "leak" the strbuf buffer: it
> possibly reallocates it to ensure a terminating nul character. And when
> that happens, what is already stored in cmd_hist may now be a free()d
> buffer.
>
> In practice, though, command_buf.buf is most of the time a nul
> terminated string, meaning command_buf.len < command_buf.alloc, and
> strbuf_detach is a no-op. BUT, when it's not (e.g. on the first call),
> command_buf.buf is &strbuf_slopbuf. In that case, strbuf_detach does
> allocate a 1 byte buffer to store a nul character in it, which is then
> leaked.

And that's why a pointer to a strbuf buf is valid until the next strbuf_
function call.

>
> Since the code using strbuf_detach is assuming it does nothing to
> command_buf.buf, it's overall safer to use strbuf_init, which has
> the same practical effect in the usual case, and works appropriately
> when command_buf is empty.
> ---
>  fast-import.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/fast-import.c b/fast-import.c
> index b44d6a467e..b1d07efe8c 100644
> --- a/fast-import.c
> +++ b/fast-import.c
> @@ -1763,7 +1763,9 @@ static int read_next_command(void)
>  		} else {
>  			struct recent_command *rc;
>
> -			strbuf_detach(&command_buf, NULL);
> +			// command_buf is enther empty or also stored in cmd_hist,
> +			// reinitialize it.
> +			strbuf_init(&command_buf, 0);

This is a no-op; strbuf_detach already re-initializes the strbuf.

(And double-slash comments are avoided in Git code..)

((s/enther/either/))

>  			stdin_eof = strbuf_getline_lf(&command_buf, stdin);
>  			if (stdin_eof)
>  				return EOF;
> @@ -1833,7 +1835,9 @@ static int parse_data(struct strbuf *sb, uintmax_t limit, uintmax_t *len_res)
>  		char *term = xstrdup(data);
>  		size_t term_len = command_buf.len - (data - command_buf.buf);
>
> -		strbuf_detach(&command_buf, NULL);
> +		// command_buf is enther empty or also stored in cmd_hist,
> +		// reinitialize it.
> +		strbuf_init(&command_buf, 0);

Same here.

>  		for (;;) {
>  			if (strbuf_getline_lf(&command_buf, stdin) == EOF)
>  				die("EOF in data (terminator '%s' not found)", term);
>

strbuf_detach() is handing over ownership of the buffer.  Its return
value should be stored; grabbing the pointer beforehand is naughty
because it can get stale, as you noted.  I doubt there is a good reason
for ignoring the return value of strbuf_detach(), ever.

René

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] fast-import: duplicate into history rather than passing ownership
  2019-08-25 10:02       ` Mike Hommey
@ 2019-08-25 14:21         ` René Scharfe
  2019-08-26 18:42           ` Jeff King
  0 siblings, 1 reply; 15+ messages in thread
From: René Scharfe @ 2019-08-25 14:21 UTC (permalink / raw)
  To: Mike Hommey, Jeff King; +Cc: Elijah Newren, git, gitster

Am 25.08.19 um 12:02 schrieb Mike Hommey:
> On Sun, Aug 25, 2019 at 04:10:55AM -0400, Jeff King wrote:
>> diff --git a/fast-import.c b/fast-import.c
>> index ee7258037a..1f9160b645 100644
>> --- a/fast-import.c
>> +++ b/fast-import.c
>> @@ -1763,7 +1763,6 @@ static int read_next_command(void)
>>  		} else {
>>  			struct recent_command *rc;
>>
>> -			strbuf_detach(&command_buf, NULL);
>>  			stdin_eof = strbuf_getline_lf(&command_buf, stdin);
>>  			if (stdin_eof)
>>  				return EOF;
>> @@ -1784,7 +1783,7 @@ static int read_next_command(void)
>>  				free(rc->buf);
>>  			}
>>
>> -			rc->buf = command_buf.buf;
>> +			rc->buf = xstrdup(command_buf.buf);
>
> You could xstrndup(command_buf.buf, command_buf.len), which would avoid
> a hidden strlen.

xstrndup() also searches for NUL, albeit with memchr(3).  xmemdupz()
would copy without checking.

I suspect the simplicity of xstrdup() outweighs the benefits of the
alternatives, but didn't do any measurements..

René

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/2] fast-import input string handling bugs
  2019-08-25  8:06   ` [PATCH 0/2] fast-import input string handling bugs Jeff King
  2019-08-25  8:08     ` [PATCH 1/2] fast-import: duplicate parsed encoding string Jeff King
  2019-08-25  8:10     ` [PATCH 2/2] fast-import: duplicate into history rather than passing ownership Jeff King
@ 2019-08-26 15:36     ` Junio C Hamano
  2019-08-26 19:18     ` Elijah Newren
  3 siblings, 0 replies; 15+ messages in thread
From: Junio C Hamano @ 2019-08-26 15:36 UTC (permalink / raw)
  To: Jeff King; +Cc: Mike Hommey, Elijah Newren, git

Jeff King <peff@peff.net> writes:

> On Sun, Aug 25, 2019 at 02:57:48AM -0400, Jeff King wrote:
>
>> And I think this is actually a real bug in the current code! We keep a
>> pointer to the encoding string, which survives because of the history.
>> But that history is bounded, and we could have an indefinite number of
>> changed files in the middle. If I modify t9300 like this:
>
> Here are two patches. The first fixes the existing bug with "encoding",
> and the second uses the approach I suggested to fix the leak you
> noticed.
>
> The second one does carry a greater risk of regression than your patch,
> but I think it's worth it for the fact that it makes any other bugs
> (like the "encoding" one) more obvious.

Yeah, it may be worth the risk, given that this is quite early in
the cycle, so we have enough time to cook it in 'next' to see if
somebody screams ;-)

>
>   [1/2]: fast-import: duplicate parsed encoding string
>   [2/2]: fast-import: duplicate into history rather than passing ownership
>
>  fast-import.c | 11 ++++++-----
>  1 file changed, 6 insertions(+), 5 deletions(-)
>
> -Peff

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] fast-import: duplicate parsed encoding string
  2019-08-25  8:08     ` [PATCH 1/2] fast-import: duplicate parsed encoding string Jeff King
@ 2019-08-26 18:28       ` Elijah Newren
  2019-08-26 18:44         ` Jeff King
  0 siblings, 1 reply; 15+ messages in thread
From: Elijah Newren @ 2019-08-26 18:28 UTC (permalink / raw)
  To: Jeff King; +Cc: Mike Hommey, Git Mailing List, Junio C Hamano

On Sun, Aug 25, 2019 at 1:08 AM Jeff King <peff@peff.net> wrote:
>
> We read each line of the fast-import stream into the command_buf strbuf.
> When reading a commit, we parse a line like "encoding foo" by storing a
> pointer to "foo", but not making a copy. We may then read an unbounded
> number of other lines (e.g., one for each modified file in the commit),
> each of which writes into command_buf.
>
> This works out in practice for small cases, because we hand off
> ownership of the heap buffer from command_buf to the cmd_hist array, and
> read new commands into a fresh heap buffer. And thus the pointer to
> "foo" remains valid as long as there aren't so many intermediate lines
> that we end up dropping the original "encoding" line from the history.
>
> But as the test modification shows, if we go over our default of 100
> lines, we end up with our encoding string pointing into freed heap
> memory. This seems to fail reliably by writing garbage into the output,
> but running under ASan definitely detects this as a user-after-free.

s/user-after-free/use-after-free/

> We can fix it by duplicating the encoding value, just as we do for other
> parsed lines (e.g., an author line ends up in parse_ident, which copies
> it to a new string).

Eek!  Thanks for fixing this up for me; patch looks good.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] fast-import: duplicate into history rather than passing ownership
  2019-08-25 14:21         ` René Scharfe
@ 2019-08-26 18:42           ` Jeff King
  0 siblings, 0 replies; 15+ messages in thread
From: Jeff King @ 2019-08-26 18:42 UTC (permalink / raw)
  To: René Scharfe; +Cc: Mike Hommey, Elijah Newren, git, gitster

On Sun, Aug 25, 2019 at 04:21:54PM +0200, René Scharfe wrote:

> > You could xstrndup(command_buf.buf, command_buf.len), which would avoid
> > a hidden strlen.
> 
> xstrndup() also searches for NUL, albeit with memchr(3).  xmemdupz()
> would copy without checking.
> 
> I suspect the simplicity of xstrdup() outweighs the benefits of the
> alternatives, but didn't do any measurements..

Yep. I actually started to write xmemdupz() originally then decided it
was unnecessarily verbose and a premature optimization.

I wondered after this exchange whether something like:

  char *strbuf_dup(const struct strbuf *sb)
  {
	return xmemdupz(sb->buf, sb->len);
  }

would be a useful general helper. Grepping around it doesn't seem like
there are a lot of candidates.

If we really wanted to micro-optimize, we could have cmd_hist store
strbufs, and then we could reuse the same buffers over and over without
re-allocating. And use strbuf_addbuf(&cmd_hist.buf, &command_buf). :)

-Peff

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] fast-import: duplicate parsed encoding string
  2019-08-26 18:28       ` Elijah Newren
@ 2019-08-26 18:44         ` Jeff King
  0 siblings, 0 replies; 15+ messages in thread
From: Jeff King @ 2019-08-26 18:44 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Mike Hommey, Git Mailing List, Junio C Hamano

On Mon, Aug 26, 2019 at 11:28:58AM -0700, Elijah Newren wrote:

> On Sun, Aug 25, 2019 at 1:08 AM Jeff King <peff@peff.net> wrote:
> >
> > We read each line of the fast-import stream into the command_buf strbuf.
> > When reading a commit, we parse a line like "encoding foo" by storing a
> > pointer to "foo", but not making a copy. We may then read an unbounded
> > number of other lines (e.g., one for each modified file in the commit),
> > each of which writes into command_buf.
> >
> > This works out in practice for small cases, because we hand off
> > ownership of the heap buffer from command_buf to the cmd_hist array, and
> > read new commands into a fresh heap buffer. And thus the pointer to
> > "foo" remains valid as long as there aren't so many intermediate lines
> > that we end up dropping the original "encoding" line from the history.
> >
> > But as the test modification shows, if we go over our default of 100
> > lines, we end up with our encoding string pointing into freed heap
> > memory. This seems to fail reliably by writing garbage into the output,
> > but running under ASan definitely detects this as a user-after-free.
> 
> s/user-after-free/use-after-free/

Wow. I self-corrected "user-after-free" at least three other times while
writing this thread. I clearly have a problem. :)

> > We can fix it by duplicating the encoding value, just as we do for other
> > parsed lines (e.g., an author line ends up in parse_ident, which copies
> > it to a new string).
> 
> Eek!  Thanks for fixing this up for me; patch looks good.

No problem. On the plus side, finding your bug made me think much harder
about the implications of patch 2 (because it's quite subtle and an easy
mistake to make).

-Peff

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/2] fast-import input string handling bugs
  2019-08-25  8:06   ` [PATCH 0/2] fast-import input string handling bugs Jeff King
                       ` (2 preceding siblings ...)
  2019-08-26 15:36     ` [PATCH 0/2] fast-import input string handling bugs Junio C Hamano
@ 2019-08-26 19:18     ` Elijah Newren
  3 siblings, 0 replies; 15+ messages in thread
From: Elijah Newren @ 2019-08-26 19:18 UTC (permalink / raw)
  To: Jeff King; +Cc: Mike Hommey, Git Mailing List, Junio C Hamano

On Sun, Aug 25, 2019 at 1:06 AM Jeff King <peff@peff.net> wrote:
>
> On Sun, Aug 25, 2019 at 02:57:48AM -0400, Jeff King wrote:
>
> > And I think this is actually a real bug in the current code! We keep a
> > pointer to the encoding string, which survives because of the history.
> > But that history is bounded, and we could have an indefinite number of
> > changed files in the middle. If I modify t9300 like this:
>
> Here are two patches. The first fixes the existing bug with "encoding",
> and the second uses the approach I suggested to fix the leak you
> noticed.
>
> The second one does carry a greater risk of regression than your patch,
> but I think it's worth it for the fact that it makes any other bugs
> (like the "encoding" one) more obvious.

I agree, both patches look good to me, and I particularly appreciate
some extra help to avoid making the same mistake again.  :-)

Just for good measure, I also went and tested these patches by running
the git filter-repo testsuite and by re-running the filter-repo timing
cases at https://public-inbox.org/git/CABPp-BGOz8nks0+Tdw5GyGqxeYR-3FF6FT5JcgVqZDYVRQ6qog@mail.gmail.com/.

While filter-repo only uses a subset of fast-export functionality, it
tests it with a variety of weird and unusual tiny test repos.  And the
timing cases run on three real-world repositories (git, rails, and
linux).  Everything looks good on all of these testcases.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2019-08-26 19:19 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-25  4:13 [PATCH] fast-import: Reinitialize command_buf rather than detach it Mike Hommey
2019-08-25  6:57 ` Jeff King
2019-08-25  7:20   ` Mike Hommey
2019-08-25  7:28     ` Jeff King
2019-08-25  8:06   ` [PATCH 0/2] fast-import input string handling bugs Jeff King
2019-08-25  8:08     ` [PATCH 1/2] fast-import: duplicate parsed encoding string Jeff King
2019-08-26 18:28       ` Elijah Newren
2019-08-26 18:44         ` Jeff King
2019-08-25  8:10     ` [PATCH 2/2] fast-import: duplicate into history rather than passing ownership Jeff King
2019-08-25 10:02       ` Mike Hommey
2019-08-25 14:21         ` René Scharfe
2019-08-26 18:42           ` Jeff King
2019-08-26 15:36     ` [PATCH 0/2] fast-import input string handling bugs Junio C Hamano
2019-08-26 19:18     ` Elijah Newren
2019-08-25 12:35 ` [PATCH] fast-import: Reinitialize command_buf rather than detach it René Scharfe

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).