[PATCH 1/2] commit: reject invalid UTF-8 codepoints

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH 1/2] commit: reject invalid UTF-8 codepoints
@ 2013-06-29 17:40 brian m. carlson
  2013-06-30  2:13 ` Junio C Hamano
  2013-07-01  7:12 ` Peter Krefting
  0 siblings, 2 replies; 7+ messages in thread
From: brian m. carlson @ 2013-06-29 17:40 UTC (permalink / raw)
  To: git; +Cc: gitster

The commit code already contains code for validating UTF-8, but it does not
check for invalid values, such as guaranteed non-characters and surrogates.  Fix
this by explicitly checking for and rejecting such characters.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 commit.c                  | 18 +++++++++++++++---
 t/t3900-i18n-commit.sh    |  9 +++++++++
 t/t3900/UTF-8-invalid.txt |  3 +++
 3 files changed, 27 insertions(+), 3 deletions(-)
 create mode 100644 t/t3900/UTF-8-invalid.txt

diff --git a/commit.c b/commit.c
index 888e02a..2280413 100644
--- a/commit.c
+++ b/commit.c
@@ -1244,6 +1244,7 @@ static int find_invalid_utf8(const char *buf, int len)
 	while (len) {
 		unsigned char c = *buf++;
 		int bytes, bad_offset;
+		unsigned int codepoint;
 
 		len--;
 		offset++;
@@ -1264,24 +1265,35 @@ static int find_invalid_utf8(const char *buf, int len)
 			bytes++;
 		}
 
-		/* Must be between 1 and 5 more bytes */
-		if (bytes < 1 || bytes > 5)
+		/* Must be between 1 and 3 more bytes */
+		if (bytes < 1 || bytes > 3)
 			return bad_offset;
 
 		/* Do we *have* that many bytes? */
 		if (len < bytes)
 			return bad_offset;
 
+		/* Place the encoded bits at the bottom of the value. */
+		codepoint = (c & 0x7f) >> bytes;
+
 		offset += bytes;
 		len -= bytes;
 
 		/* And verify that they are good continuation bytes */
 		do {
+			codepoint <<= 6;
+			codepoint |= *buf & 0x3f;
 			if ((*buf++ & 0xc0) != 0x80)
 				return bad_offset;
 		} while (--bytes);
 
-		/* We could/should check the value and length here too */
+		/* Check the value here */
+		if (codepoint >= 0xd800 && codepoint <= 0xdfff)
+			return bad_offset;
+		if (codepoint > 0x10ffff)
+			return bad_offset;
+		if ((codepoint & 0x1ffffe) == 0xfffe)
+			return bad_offset;
 	}
 	return -1;
 }
diff --git a/t/t3900-i18n-commit.sh b/t/t3900-i18n-commit.sh
index 37ddabb..16ed707 100755
--- a/t/t3900-i18n-commit.sh
+++ b/t/t3900-i18n-commit.sh
@@ -39,6 +39,15 @@ test_expect_failure 'UTF-16 refused because of NULs' '
 	git commit -a -F "$TEST_DIRECTORY"/t3900/UTF-16.txt
 '
 
+test_expect_success 'UTF-8 invalid characters refused' '
+	rm -f "$HOME/stderr" &&
+	echo "UTF-8 characters" >F &&
+	git commit -a -F "$TEST_DIRECTORY"/t3900/UTF-8-invalid.txt \
+		2>"$HOME"/stderr &&
+	grep "did not conform" "$HOME"/stderr
+'
+
+rm -f "$HOME/stderr"
 
 for H in ISO8859-1 eucJP ISO-2022-JP
 do
diff --git a/t/t3900/UTF-8-invalid.txt b/t/t3900/UTF-8-invalid.txt
new file mode 100644
index 0000000..343684d
--- /dev/null
+++ b/t/t3900/UTF-8-invalid.txt
@@ -0,0 +1,3 @@
+Commit message
+
+Invalid surrogate:???
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] commit: reject invalid UTF-8 codepoints
  2013-06-29 17:40 [PATCH 1/2] commit: reject invalid UTF-8 codepoints brian m. carlson
@ 2013-06-30  2:13 ` Junio C Hamano
  2013-06-30  2:29   ` brian m. carlson
  2013-07-04  0:17   ` brian m. carlson
  2013-07-01  7:12 ` Peter Krefting
  1 sibling, 2 replies; 7+ messages in thread
From: Junio C Hamano @ 2013-06-30  2:13 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> The commit code already contains code for validating UTF-8, but it
> does not check for invalid values, such as guaranteed
> non-characters and surrogates.  Fix this by explicitly checking
> for and rejecting such characters.
>
> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> ---

Does this correspond to the following comment in the same file, and
if so, shouldn't this part of your patch?

diff --git a/commit.c b/commit.c
index 888e02a..8ce20d5 100644
--- a/commit.c
+++ b/commit.c
@@ -1292,8 +1292,8 @@ static int find_invalid_utf8(const char *buf, int len)
  * If it isn't, it assumes any non-utf8 characters are Latin1,
  * and does the conversion.
  *
- * Fixme: we should probably also disallow overlong forms and
- * invalid characters. But we don't do that currently.
+ * Fixme: we should probably also disallow overlong forms.
+ * But we don't do that fully, yet.
  */
 static int verify_utf8(struct strbuf *buf)
 {

> @@ -1264,24 +1265,35 @@ static int find_invalid_utf8(const char *buf, int len)
>  			bytes++;
>  		}
>  
> -		/* Must be between 1 and 5 more bytes */
> -		if (bytes < 1 || bytes > 5)
> +		/* Must be between 1 and 3 more bytes */
> +		if (bytes < 1 || bytes > 3)
>  			return bad_offset;

We used to allow te original up-to-6 form and this update is about
rejecting anything above U+10FFFF (in line with e.g. RFC 3629)?

>  		/* Do we *have* that many bytes? */
>  		if (len < bytes)
>  			return bad_offset;
>  
> +		/* Place the encoded bits at the bottom of the value. */
> +		codepoint = (c & 0x7f) >> bytes;
> +
>  		offset += bytes;
>  		len -= bytes;
>  
>  		/* And verify that they are good continuation bytes */
>  		do {
> +			codepoint <<= 6;
> +			codepoint |= *buf & 0x3f;
>  			if ((*buf++ & 0xc0) != 0x80)
>  				return bad_offset;
>  		} while (--bytes);
>  
> -		/* We could/should check the value and length here too */
> +		/* Check the value here */
> +		if (codepoint >= 0xd800 && codepoint <= 0xdfff)
> +			return bad_offset;

As you use separate if statements for this check, perhaps you can
give each of them a better comment to say what each is rejecting?
E.g.

	/* do not allow range for surrogate pair */
		if (codepoint >= 0xd800 && codepoint <= 0xdfff)
			return bad_offset;

> +		if (codepoint > 0x10ffff)
> +			return bad_offset;
> +		if ((codepoint & 0x1ffffe) == 0xfffe)
> +			return bad_offset;

As that comment I quoted at the beginning says, we did not check for
invalid encoded values and the primary reason for it is beccause
this function did not want to look into the actual values here.  But
now you are looking into "codepoint", you can now also check for
"overlong" form (e.g. sequence "C0 80" turning into U+0000)?

>  	}
>  	return -1;
>  }
> diff --git a/t/t3900-i18n-commit.sh b/t/t3900-i18n-commit.sh
> index 37ddabb..16ed707 100755
> --- a/t/t3900-i18n-commit.sh
> +++ b/t/t3900-i18n-commit.sh
> @@ -39,6 +39,15 @@ test_expect_failure 'UTF-16 refused because of NULs' '
>  	git commit -a -F "$TEST_DIRECTORY"/t3900/UTF-16.txt
>  '
>  
> +test_expect_success 'UTF-8 invalid characters refused' '
> +	rm -f "$HOME/stderr" &&
> +	echo "UTF-8 characters" >F &&
> +	git commit -a -F "$TEST_DIRECTORY"/t3900/UTF-8-invalid.txt \
> +		2>"$HOME"/stderr &&
> +	grep "did not conform" "$HOME"/stderr
> +'
> +
> +rm -f "$HOME/stderr"

Perhaps

        test_expect_success 'test name' '
                test_when_finished "rm -f $HOME/stderr" &&
                "rm -f $HOME/stderr" &&
                echo ...
                grep "did not conform" ...
        '

> diff --git a/t/t3900/UTF-8-invalid.txt b/t/t3900/UTF-8-invalid.txt
> new file mode 100644
> index 0000000..343684d
> --- /dev/null
> +++ b/t/t3900/UTF-8-invalid.txt
> @@ -0,0 +1,3 @@
> +Commit message
> +
> +Invalid surrogate:???

I suspect that I did not receive what you intended to send.  Do you
want to send this part as a binary patch perhaps?

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] commit: reject invalid UTF-8 codepoints
  2013-06-30  2:13 ` Junio C Hamano
@ 2013-06-30  2:29   ` brian m. carlson
  2013-06-30 19:29     ` Junio C Hamano
  2013-07-04  0:17   ` brian m. carlson
  1 sibling, 1 reply; 7+ messages in thread
From: brian m. carlson @ 2013-06-30  2:29 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 2279 bytes --]

On Sat, Jun 29, 2013 at 07:13:40PM -0700, Junio C Hamano wrote:
> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
> Does this correspond to the following comment in the same file, and
> if so, shouldn't this part of your patch?

Yes, yes, it should.

> We used to allow te original up-to-6 form and this update is about
> rejecting anything above U+10FFFF (in line with e.g. RFC 3629)?

Correct.  Since no codepoints above U+10FFFF will ever be assigned,
there's no point in trying to allow those UTF-8 sequences, since they
will be caught below anyway.  Some four-byte sequences will produce
too-large codepoints, but all five- and six-byte sequences are
guaranteed to.

> As you use separate if statements for this check, perhaps you can
> give each of them a better comment to say what each is rejecting?
> E.g.
> 
> 	/* do not allow range for surrogate pair */
> 		if (codepoint >= 0xd800 && codepoint <= 0xdfff)
> 			return bad_offset;

Sure.

> As that comment I quoted at the beginning says, we did not check for
> invalid encoded values and the primary reason for it is beccause
> this function did not want to look into the actual values here.  But
> now you are looking into "codepoint", you can now also check for
> "overlong" form (e.g. sequence "C0 80" turning into U+0000)?

Correct.  That's what my second patch does.  I thought I would make the
changes separately, since they're slightly different and from what I've
seen git development prefers small, independent patches, but if you'd
prefer, I can squash them into one patch.

> Perhaps
> 
>         test_expect_success 'test name' '
>                 test_when_finished "rm -f $HOME/stderr" &&

I wasn't aware that existed, but it makes sense to use it.

> > +Invalid surrogate:???
> 
> I suspect that I did not receive what you intended to send.  Do you
> want to send this part as a binary patch perhaps?

If you ended up with an encoding of U+D800, then you got it.  Otherwise,
I can resend it as a binary patch during the reroll.

-- 
brian m. carlson / brian with sandals: Houston, Texas, US
+1 832 623 2791 | http://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: RSA v4 4096b: 88AC E9B2 9196 305B A994 7552 F1BA 225C 0223 B187

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] commit: reject invalid UTF-8 codepoints
  2013-06-30  2:29   ` brian m. carlson
@ 2013-06-30 19:29     ` Junio C Hamano
  0 siblings, 0 replies; 7+ messages in thread
From: Junio C Hamano @ 2013-06-30 19:29 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> On Sat, Jun 29, 2013 at 07:13:40PM -0700, Junio C Hamano wrote:
>> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
>> Does this correspond to the following comment in the same file, and
>> if so, shouldn't this part of your patch?
>
> Yes, yes, it should.
> ...
>> As that comment I quoted at the beginning says, we did not check for
>> invalid encoded values and the primary reason for it is beccause
>> this function did not want to look into the actual values here.  But
>> now you are looking into "codepoint", you can now also check for
>> "overlong" form (e.g. sequence "C0 80" turning into U+0000)?
>
> Correct.  That's what my second patch does.

Ah, I started with that quoted comment when I saw 1/2, thinking that
you may not even have noticed that these two things are our "todo"
list, hence I thought it would be natural to do them both in the
same commit as the logic to do so is in the single place (the
function you patched).  It is not _wrong_ per-se to do this in two
steps, but I do not think it is necessary.  You *could* break this
to even smaller steps to go to the other extreme, first limiting
only upto 4-byte forms, then rejecting a half of the 4-byte form,
then rejecting the range for surrogates, and then rejecting overlong
forms, and the split may be technically logical and "one step at a
time" but that kind of granularity is probably more noisy than it is
valuable.

I think "1/2 is about rejecting a codepoint outside valid ranges and
2/2 is about rejecting a valid codepoint inside the valid range but
expressed in an invalid way" is on the borderline between the two
extremes, but it probably is on the saner side, so let's keep these
two patches separate.

> If you ended up with an encoding of U+D800, then you got it.

I think we are seeing ? U+003F in our mailboxes.

Thanks.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] commit: reject invalid UTF-8 codepoints
  2013-06-29 17:40 [PATCH 1/2] commit: reject invalid UTF-8 codepoints brian m. carlson
  2013-06-30  2:13 ` Junio C Hamano
@ 2013-07-01  7:12 ` Peter Krefting
  1 sibling, 0 replies; 7+ messages in thread
From: Peter Krefting @ 2013-07-01  7:12 UTC (permalink / raw)
  To: brian m. carlson; +Cc: Git Mailing List, gitster

brian m. carlson:

> +		/* Check the value here */
> +		if (codepoint >= 0xd800 && codepoint <= 0xdfff)
> +			return bad_offset;

if ((x & 0xFFFFF800) == 0xD800)

is slightly shorter, albeit a bit more difficult to read.

Please also consider adding some commentary as to what you are 
checking here, as it is not obvious unless you know Unicode well.

-- 
\\// Peter - http://www.softwolves.pp.se/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] commit: reject invalid UTF-8 codepoints
  2013-06-30  2:13 ` Junio C Hamano
  2013-06-30  2:29   ` brian m. carlson
@ 2013-07-04  0:17   ` brian m. carlson
  2013-07-04  0:47     ` Junio C Hamano
  1 sibling, 1 reply; 7+ messages in thread
From: brian m. carlson @ 2013-07-04  0:17 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 896 bytes --]

On Sat, Jun 29, 2013 at 07:13:40PM -0700, Junio C Hamano wrote:
> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
> > diff --git a/t/t3900/UTF-8-invalid.txt b/t/t3900/UTF-8-invalid.txt
> > new file mode 100644
> > index 0000000..343684d
> > --- /dev/null
> > +++ b/t/t3900/UTF-8-invalid.txt
> > @@ -0,0 +1,3 @@
> > +Commit message
> > +
> > +Invalid surrogate:???
> 
> I suspect that I did not receive what you intended to send.  Do you
> want to send this part as a binary patch perhaps?

git format-patch --binary seems to produce the exact same output as
without the --binary option.  Is there a different way I should be
selecting the binary option?

-- 
brian m. carlson / brian with sandals: Houston, Texas, US
+1 832 623 2791 | http://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: RSA v4 4096b: 88AC E9B2 9196 305B A994 7552 F1BA 225C 0223 B187

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] commit: reject invalid UTF-8 codepoints
  2013-07-04  0:17   ` brian m. carlson
@ 2013-07-04  0:47     ` Junio C Hamano
  0 siblings, 0 replies; 7+ messages in thread
From: Junio C Hamano @ 2013-07-04  0:47 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> git format-patch --binary seems to produce the exact same output as
> without the --binary option.  Is there a different way I should be
> selecting the binary option?

These days --binary is a no-op option for format-patch (we used to
default to showing "binary files differ" in diff and required the
option to force generating binary patch), but the real issue is that
you haven't told Git that these "invalid" files are binary files and
not text X-<.

Perhaps add something like this temporarily to .git/info/attributes
when running the command?

	$ cat >>.git/info/attributes <<\EOF
        t/t3900/UTF-8-invalid.txt	binary
        EOF

Another idea is _not_ to create these invalid files inside t3900/
but create them in the set-up part of the test script to the current
directory, using "printf" to make sure the readers who read test
know what exact byte sequence is written out in the test and is
used, which might be a better option in the long run.

Thanks.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2013-07-04  0:47 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-29 17:40 [PATCH 1/2] commit: reject invalid UTF-8 codepoints brian m. carlson
2013-06-30  2:13 ` Junio C Hamano
2013-06-30  2:29   ` brian m. carlson
2013-06-30 19:29     ` Junio C Hamano
2013-07-04  0:17   ` brian m. carlson
2013-07-04  0:47     ` Junio C Hamano
2013-07-01  7:12 ` Peter Krefting

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).