[PATCH v2 0/2] commit: improve UTF-8 validation

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH v2 0/2] commit: improve UTF-8 validation
@ 2013-07-04 17:17 brian m. carlson
  2013-07-04 17:19 ` [PATCH v2 1/2] commit: reject invalid UTF-8 codepoints brian m. carlson
  2013-07-04 17:20 ` [PATCH v2 2/2] commit: reject overlong UTF-8 sequences brian m. carlson
  0 siblings, 2 replies; 11+ messages in thread
From: brian m. carlson @ 2013-07-04 17:17 UTC (permalink / raw)
  To: git; +Cc: gitster

This series contains a pair of patches that improve the validation of
the UTF-8 used in commit messages.  Invalid codepoints, such as
surrogates and guaranteed non-characters, are rejected, along with
overlong UTF-8 sequences.

Changes from v1:

* Improved comments to aid those less familiar with Unicode.
* Generated test files using printf as part of the test.
* Removed FIXME comments for things that have been fixed.
* Use a shorter form for detecting surrogate pairs.

brian m. carlson (2):
  commit: reject invalid UTF-8 codepoints
  commit: reject overlong UTF-8 sequences

 commit.c               | 34 ++++++++++++++++++++++++++++------
 t/t3900-i18n-commit.sh | 23 +++++++++++++++++++++++
 2 files changed, 51 insertions(+), 6 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 1/2] commit: reject invalid UTF-8 codepoints
  2013-07-04 17:17 [PATCH v2 0/2] commit: improve UTF-8 validation brian m. carlson
@ 2013-07-04 17:19 ` brian m. carlson
  2013-07-04 19:58   ` Torsten Bögershausen
  2013-07-05 12:51   ` Peter Krefting
  2013-07-04 17:20 ` [PATCH v2 2/2] commit: reject overlong UTF-8 sequences brian m. carlson
  1 sibling, 2 replies; 11+ messages in thread
From: brian m. carlson @ 2013-07-04 17:19 UTC (permalink / raw)
  To: git; +Cc: gitster

The commit code already contains code for validating UTF-8, but it does not
check for invalid values, such as guaranteed non-characters and surrogates.  Fix
this by explicitly checking for and rejecting such characters.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 commit.c               | 27 ++++++++++++++++++++++-----
 t/t3900-i18n-commit.sh | 12 ++++++++++++
 2 files changed, 34 insertions(+), 5 deletions(-)

diff --git a/commit.c b/commit.c
index 888e02a..2264106 100644
--- a/commit.c
+++ b/commit.c
@@ -1244,6 +1244,7 @@ static int find_invalid_utf8(const char *buf, int len)
 	while (len) {
 		unsigned char c = *buf++;
 		int bytes, bad_offset;
+		unsigned int codepoint;
 
 		len--;
 		offset++;
@@ -1264,24 +1265,40 @@ static int find_invalid_utf8(const char *buf, int len)
 			bytes++;
 		}
 
-		/* Must be between 1 and 5 more bytes */
-		if (bytes < 1 || bytes > 5)
+		/*
+		 * Must be between 1 and 3 more bytes.  Longer sequences result in
+		 * codepoints beyond U+10FFFF, which are guaranteed never to exist.
+		 */
+		if (bytes < 1 || bytes > 3)
 			return bad_offset;
 
 		/* Do we *have* that many bytes? */
 		if (len < bytes)
 			return bad_offset;
 
+		/* Place the encoded bits at the bottom of the value. */
+		codepoint = (c & 0x7f) >> bytes;
+
 		offset += bytes;
 		len -= bytes;
 
 		/* And verify that they are good continuation bytes */
 		do {
+			codepoint <<= 6;
+			codepoint |= *buf & 0x3f;
 			if ((*buf++ & 0xc0) != 0x80)
 				return bad_offset;
 		} while (--bytes);
 
-		/* We could/should check the value and length here too */
+		/* No codepoints can ever be allocated beyond U+10FFFF. */
+		if (codepoint > 0x10ffff)
+			return bad_offset;
+		/* Surrogates are only for UTF-16 and cannot be encoded in UTF-8. */
+		if ((codepoint & 0x1ff800) == 0xd800)
+			return bad_offset;
+		/* U+FFFE and U+FFFF are guaranteed non-characters. */
+		if ((codepoint & 0x1ffffe) == 0xfffe)
+			return bad_offset;
 	}
 	return -1;
 }
@@ -1292,8 +1309,8 @@ static int find_invalid_utf8(const char *buf, int len)
  * If it isn't, it assumes any non-utf8 characters are Latin1,
  * and does the conversion.
  *
- * Fixme: we should probably also disallow overlong forms and
- * invalid characters. But we don't do that currently.
+ * Fixme: we should probably also disallow overlong forms.
+ * But we don't do that currently.
  */
 static int verify_utf8(struct strbuf *buf)
 {
diff --git a/t/t3900-i18n-commit.sh b/t/t3900-i18n-commit.sh
index 37ddabb..ee8ba6c 100755
--- a/t/t3900-i18n-commit.sh
+++ b/t/t3900-i18n-commit.sh
@@ -39,6 +39,18 @@ test_expect_failure 'UTF-16 refused because of NULs' '
 	git commit -a -F "$TEST_DIRECTORY"/t3900/UTF-16.txt
 '
 
+test_expect_success 'UTF-8 invalid characters refused' '
+	test_when_finished "rm -f $HOME/stderr $HOME/invalid" && 
+	rm -f "$HOME/stderr" &&
+	echo "UTF-8 characters" >F &&
+	printf "Commit message\n\nInvalid surrogate:\355\240\200\n" \
+		>"$HOME/invalid" &&
+	git commit -a -F "$HOME/invalid" \
+		2>"$HOME"/stderr &&
+	grep "did not conform" "$HOME"/stderr
+'
+
+rm -f "$HOME/stderr"
 
 for H in ISO8859-1 eucJP ISO-2022-JP
 do
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] commit: reject invalid UTF-8 codepoints
  2013-07-04 17:19 ` [PATCH v2 1/2] commit: reject invalid UTF-8 codepoints brian m. carlson
@ 2013-07-04 19:58   ` Torsten Bögershausen
  2013-07-04 20:39     ` brian m. carlson
  2013-07-05 12:51   ` Peter Krefting
  1 sibling, 1 reply; 11+ messages in thread
From: Torsten Bögershausen @ 2013-07-04 19:58 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, gitster

On 2013-07-04 19.19, brian m. carlson wrote:
> The commit code already contains code for validating UTF-8, but it does not
> check for invalid values, such as guaranteed non-characters and surrogates.  Fix
s/guaranteed non-characters/code points out of range/
> this by explicitly checking for and rejecting such characters.
Do we really reject them, or do we (only) warn about them ? 

Other question:
Now that we have a check for codepoints out of range, beyond U+10FFFF,
do we want to have an additional testcase ?


> +test_expect_success 'UTF-8 invalid characters refused' '
May be:
 test_expect_success 'UTF-8 invalid surrogate' '


> +	test_when_finished "rm -f $HOME/stderr $HOME/invalid" && 
> +	rm -f "$HOME/stderr" &&
> +	echo "UTF-8 characters" >F &&
> +	printf "Commit message\n\nInvalid surrogate:\355\240\200\n" \
> +		>"$HOME/invalid" &&
good
> +	git commit -a -F "$HOME/invalid" \
> +		2>"$HOME"/stderr &&
> +	grep "did not conform" "$HOME"/stderr
> +'
> +
> +rm -f "$HOME/stderr"
Does it make sense to "grep on the fly", like this:
git commit -a -F "$HOME/invalid" 2>&1  | grep "did not conform"

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] commit: reject invalid UTF-8 codepoints
  2013-07-04 19:58   ` Torsten Bögershausen
@ 2013-07-04 20:39     ` brian m. carlson
  0 siblings, 0 replies; 11+ messages in thread
From: brian m. carlson @ 2013-07-04 20:39 UTC (permalink / raw)
  To: Torsten Bögershausen; +Cc: git, gitster

[-- Attachment #1: Type: text/plain, Size: 1921 bytes --]

On Thu, Jul 04, 2013 at 09:58:08PM +0200, Torsten Bögershausen wrote:
> On 2013-07-04 19.19, brian m. carlson wrote:
> > The commit code already contains code for validating UTF-8, but it does not
> > check for invalid values, such as guaranteed non-characters and surrogates.  Fix
> s/guaranteed non-characters/code points out of range/

The "such as" is meant to be illustrative, not all-inclusive, and my
patch does check for U+FFFE and U+FFFF, which are guaranteed
non-characters.

> > this by explicitly checking for and rejecting such characters.
> Do we really reject them, or do we (only) warn about them ? 

Well, find_invalid_utf8 rejects them as invalid, and verify_utf8 fixes
them up as if they were Latin-1, and commit_tree_extended warns about
them.  My interpretation was from the point of view of the code that I
touched (find_invalid_utf8), not the binary.  It would be nice if the
binary actually rejected it, too, but that isn't within the scope of
this patch.

> Other question:
> Now that we have a check for codepoints out of range, beyond U+10FFFF,
> do we want to have an additional testcase ?

Sure, why not?

> > +test_expect_success 'UTF-8 invalid characters refused' '
> May be:
>  test_expect_success 'UTF-8 invalid surrogate' '

Since I'll be adding at least one more unit test, as you requested, I'll
change the name.  I suppose I might as well add a test for the
non-characters as well.

> Does it make sense to "grep on the fly", like this:
> git commit -a -F "$HOME/invalid" 2>&1  | grep "did not conform"

I am interested in making sure that git commit succeeds, and using a
pipe will cause any failure of git commit to be ignored.

-- 
brian m. carlson / brian with sandals: Houston, Texas, US
+1 832 623 2791 | http://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: RSA v4 4096b: 88AC E9B2 9196 305B A994 7552 F1BA 225C 0223 B187

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] commit: reject invalid UTF-8 codepoints
  2013-07-04 17:19 ` [PATCH v2 1/2] commit: reject invalid UTF-8 codepoints brian m. carlson
  2013-07-04 19:58   ` Torsten Bögershausen
@ 2013-07-05 12:51   ` Peter Krefting
  2013-07-08 19:36     ` Junio C Hamano
  1 sibling, 1 reply; 11+ messages in thread
From: Peter Krefting @ 2013-07-05 12:51 UTC (permalink / raw)
  To: brian m. carlson; +Cc: Git Mailing List, gitster

brian m. carlson:

> +		/* U+FFFE and U+FFFF are guaranteed non-characters. */
> +		if ((codepoint & 0x1ffffe) == 0xfffe)
> +			return bad_offset;

I missed this the first time around: All Unicode characters whose 
lower 16-bits are FFFE or FFFF are non-characters, so you can re-write 
that to:

   /* U+xxFFFE and U+xxFFFF are guaranteed non-characters. */
   if ((codepoint & 0xfffe) == 0xfffe)
    return bad_offset;

Also, the range U+FDD0--U+FDEF are also non-characters, if you wish to 
be really pedantic.

$ grep '^[0-9A-F].*<not a' NamesList.txt
FDD0	<not a character>
FDD1	<not a character>
FDD2	<not a character>
FDD3	<not a character>
FDD4	<not a character>
FDD5	<not a character>
FDD6	<not a character>
FDD7	<not a character>
FDD8	<not a character>
FDD9	<not a character>
FDDA	<not a character>
FDDB	<not a character>
FDDC	<not a character>
FDDD	<not a character>
FDDE	<not a character>
FDDF	<not a character>
FDE0	<not a character>
FDE1	<not a character>
FDE2	<not a character>
FDE3	<not a character>
FDE4	<not a character>
FDE5	<not a character>
FDE6	<not a character>
FDE7	<not a character>
FDE8	<not a character>
FDE9	<not a character>
FDEA	<not a character>
FDEB	<not a character>
FDEC	<not a character>
FDED	<not a character>
FDEE	<not a character>
FDEF	<not a character>
FFFE	<not a character>
FFFF	<not a character>
1FFFE	<not a character>
1FFFF	<not a character>
2FFFE	<not a character>
2FFFF	<not a character>
3FFFE	<not a character>
3FFFF	<not a character>
4FFFE	<not a character>
4FFFF	<not a character>
5FFFE	<not a character>
5FFFF	<not a character>
6FFFE	<not a character>
6FFFF	<not a character>
7FFFE	<not a character>
7FFFF	<not a character>
8FFFE	<not a character>
8FFFF	<not a character>
9FFFE	<not a character>
9FFFF	<not a character>
AFFFE	<not a character>
AFFFF	<not a character>
BFFFE	<not a character>
BFFFF	<not a character>
CFFFE	<not a character>
CFFFF	<not a character>
DFFFE	<not a character>
DFFFF	<not a character>
EFFFE	<not a character>
EFFFF	<not a character>
FFFFE	<not a character>
FFFFF	<not a character>
10FFFE	<not a character>
10FFFF	<not a character>

-- 
\\// Peter - http://www.softwolves.pp.se/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/2] commit: reject invalid UTF-8 codepoints
  2013-07-05 12:51   ` Peter Krefting
@ 2013-07-08 19:36     ` Junio C Hamano
  2013-07-09 11:16       ` [PATCH] commit: reject non-characters Peter Krefting
  0 siblings, 1 reply; 11+ messages in thread
From: Junio C Hamano @ 2013-07-08 19:36 UTC (permalink / raw)
  To: Peter Krefting; +Cc: brian m. carlson, Git Mailing List

Peter Krefting <peter@softwolves.pp.se> writes:

> brian m. carlson:
>
>> +		/* U+FFFE and U+FFFF are guaranteed non-characters. */
>> +		if ((codepoint & 0x1ffffe) == 0xfffe)
>> +			return bad_offset;
>
> I missed this the first time around: All Unicode characters whose
> lower 16-bits are FFFE or FFFF are non-characters, so you can re-write
> that to:
>
>   /* U+xxFFFE and U+xxFFFF are guaranteed non-characters. */
>   if ((codepoint & 0xfffe) == 0xfffe)
>    return bad_offset;
>
> Also, the range U+FDD0--U+FDEF are also non-characters, if you wish to
> be really pedantic.

Yeah, while we are at it, doing this may not hurt.  I think Brian's
two patches are in fairly good shape otherwise, so perhaps you can
do this as a follow-up patch on top of the tip of the topic,
e82bd6cc (commit: reject overlong UTF-8 sequences, 2013-07-04)?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] commit: reject non-characters
  2013-07-08 19:36     ` Junio C Hamano
@ 2013-07-09 11:16       ` Peter Krefting
  2013-08-05 12:48         ` Peter Krefting
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Krefting @ 2013-07-09 11:16 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: brian m. carlson, Git Mailing List

Unicode clause D14 defines all characters U+nFFFE and U+nFFFF (where
0 <= n <= 10h) as well as the range U+FDD0..U+FDEF as non-characters,
reserved for internal use only.  Disallow these characters in commit
messages as they are normally not recommended for interchange.

Signed-off-by: Peter Krefting <peter@softwolves.pp.se>
---
Junio C Hamano:

> Yeah, while we are at it, doing this may not hurt.  I think Brian's
> two patches are in fairly good shape otherwise, so perhaps you can
> do this as a follow-up patch on top of the tip of the topic,
> e82bd6cc (commit: reject overlong UTF-8 sequences, 2013-07-04)?

OK, here you are. Enjoy :)

  commit.c               |  7 +++++--
  t/t3900-i18n-commit.sh | 18 ++++++++++++++++++
  2 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/commit.c b/commit.c
index 5097dba..0587732 100644
--- a/commit.c
+++ b/commit.c
@@ -1305,8 +1305,11 @@ static int find_invalid_utf8(const char *buf, int len)
  		/* Surrogates are only for UTF-16 and cannot be encoded in UTF-8. */
  		if ((codepoint & 0x1ff800) == 0xd800)
  			return bad_offset;
-		/* U+FFFE and U+FFFF are guaranteed non-characters. */
-		if ((codepoint & 0x1ffffe) == 0xfffe)
+		/* U+xxFFFE and U+xxFFFF are guaranteed non-characters. */
+		if ((codepoint & 0xffffe) == 0xfffe)
+			return bad_offset;
+		/* So are anything in the range U+FDD0..U+FDEF. */
+		if (codepoint >= 0xfdd0 && codepoint <= 0xfdef)
  			return bad_offset;
  	}
  	return -1;
diff --git a/t/t3900-i18n-commit.sh b/t/t3900-i18n-commit.sh
index 051ea9d..38b00c3 100755
--- a/t/t3900-i18n-commit.sh
+++ b/t/t3900-i18n-commit.sh
@@ -58,6 +58,24 @@ test_expect_success 'UTF-8 overlong sequences rejected' '
  	grep "did not conform" "$HOME"/stderr
  '

+test_expect_success 'UTF-8 non-characters refused' '
+	test_when_finished "rm -f $HOME/stderr $HOME/invalid" &&
+	echo "UTF-8 non-character 1" >F &&
+	printf "Commit message\n\nNon-character:\364\217\277\276\n" \
+		>"$HOME/invalid" &&
+	git commit -a -F "$HOME/invalid" 2>"$HOME"/stderr &&
+	grep "did not conform" "$HOME"/stderr
+'
+
+test_expect_success 'UTF-8 non-characters refused' '
+	test_when_finished "rm -f $HOME/stderr $HOME/invalid" &&
+	echo "UTF-8 non-character 2." >F &&
+	printf "Commit message\n\nNon-character:\357\267\220\n" \
+		>"$HOME/invalid" &&
+	git commit -a -F "$HOME/invalid" 2>"$HOME"/stderr &&
+	grep "did not conform" "$HOME"/stderr
+'
+
  for H in ISO8859-1 eucJP ISO-2022-JP
  do
  	test_expect_success "$H setup" '
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] commit: reject non-characters
  2013-07-09 11:16       ` [PATCH] commit: reject non-characters Peter Krefting
@ 2013-08-05 12:48         ` Peter Krefting
  2013-08-05 16:54           ` Junio C Hamano
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Krefting @ 2013-08-05 12:48 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: brian m. carlson, Git Mailing List

Peter Krefting:

> -		/* U+FFFE and U+FFFF are guaranteed non-characters. */
> -		if ((codepoint & 0x1ffffe) == 0xfffe)
> +		/* U+xxFFFE and U+xxFFFF are guaranteed non-characters. */
> +		if ((codepoint & 0xffffe) == 0xfffe)
> +			return bad_offset;

Drats, there is an F too many in the bitmask, it should be:

  +		if ((codepoint & 0xfffe) == 0xfffe)

-- 
\\// Peter - http://www.softwolves.pp.se/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] commit: reject non-characters
  2013-08-05 12:48         ` Peter Krefting
@ 2013-08-05 16:54           ` Junio C Hamano
  2013-08-06  7:03             ` Peter Krefting
  0 siblings, 1 reply; 11+ messages in thread
From: Junio C Hamano @ 2013-08-05 16:54 UTC (permalink / raw)
  To: Peter Krefting; +Cc: brian m. carlson, Git Mailing List

Peter Krefting <peter@softwolves.pp.se> writes:

> Peter Krefting:
>
>> -		/* U+FFFE and U+FFFF are guaranteed non-characters. */
>> -		if ((codepoint & 0x1ffffe) == 0xfffe)
>> +		/* U+xxFFFE and U+xxFFFF are guaranteed non-characters. */
>> +		if ((codepoint & 0xffffe) == 0xfffe)
>> +			return bad_offset;
>
> Drats, there is an F too many in the bitmask, it should be:
>
>  +		if ((codepoint & 0xfffe) == 0xfffe)

Indeed.

-- >8 --
Subject: [PATCH] commit: typofix for xxFFF[EF] check

We wanted to catch all codepoints that ends with FFFE and FFFF,
not with 0FFFE and 0FFFF.

Noticed and corrected by Peter Krefting.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 commit.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index 7dcfeea..38d8979 100644
--- a/commit.c
+++ b/commit.c
@@ -1306,7 +1306,7 @@ static int find_invalid_utf8(const char *buf, int len)
 		if ((codepoint & 0x1ff800) == 0xd800)
 			return bad_offset;
 		/* U+xxFFFE and U+xxFFFF are guaranteed non-characters. */
-		if ((codepoint & 0xffffe) == 0xfffe)
+		if ((codepoint & 0xfffe) == 0xfffe)
 			return bad_offset;
 		/* So are anything in the range U+FDD0..U+FDEF. */
 		if (codepoint >= 0xfdd0 && codepoint <= 0xfdef)
-- 
1.8.4-rc1-129-g1f3472b

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] commit: reject non-characters
  2013-08-05 16:54           ` Junio C Hamano
@ 2013-08-06  7:03             ` Peter Krefting
  0 siblings, 0 replies; 11+ messages in thread
From: Peter Krefting @ 2013-08-06  7:03 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: brian m. carlson, Git Mailing List

Junio C Hamano:

> Indeed.

Thanks. Testcases are good, but not if they don't actually catch the 
bug one has just introduced :-)

> -- >8 --
> Subject: [PATCH] commit: typofix for xxFFF[EF] check
>
> We wanted to catch all codepoints that ends with FFFE and FFFF,
> not with 0FFFE and 0FFFF.
>
> Noticed and corrected by Peter Krefting.
>
> Signed-off-by: Junio C Hamano <gitster@pobox.com>

Signed-off-by: Peter Krefting <peter@softwolves.pp.se>

-- 
\\// Peter - http://www.softwolves.pp.se/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 2/2] commit: reject overlong UTF-8 sequences
  2013-07-04 17:17 [PATCH v2 0/2] commit: improve UTF-8 validation brian m. carlson
  2013-07-04 17:19 ` [PATCH v2 1/2] commit: reject invalid UTF-8 codepoints brian m. carlson
@ 2013-07-04 17:20 ` brian m. carlson
  1 sibling, 0 replies; 11+ messages in thread
From: brian m. carlson @ 2013-07-04 17:20 UTC (permalink / raw)
  To: git; +Cc: gitster

The commit code accepts pseudo-UTF-8 sequences that encode a character with more
bytes than necessary.  Reject such sequences, since they are not valid UTF-8.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 commit.c               | 17 +++++++++++------
 t/t3900-i18n-commit.sh | 11 +++++++++++
 2 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/commit.c b/commit.c
index 2264106..b59c187 100644
--- a/commit.c
+++ b/commit.c
@@ -1240,11 +1240,15 @@ int commit_tree(const struct strbuf *msg, unsigned char *tree,
 static int find_invalid_utf8(const char *buf, int len)
 {
 	int offset = 0;
+	static const unsigned int max_codepoint[] = {
+		0x7f, 0x7ff, 0xffff, 0x10ffff
+	};
 
 	while (len) {
 		unsigned char c = *buf++;
 		int bytes, bad_offset;
 		unsigned int codepoint;
+		unsigned int min_val, max_val;
 
 		len--;
 		offset++;
@@ -1276,8 +1280,12 @@ static int find_invalid_utf8(const char *buf, int len)
 		if (len < bytes)
 			return bad_offset;
 
-		/* Place the encoded bits at the bottom of the value. */
+		/* Place the encoded bits at the bottom of the value and compute the
+		 * valid range.
+		 */
 		codepoint = (c & 0x7f) >> bytes;
+		min_val = max_codepoint[bytes-1] + 1;
+		max_val = max_codepoint[bytes];
 
 		offset += bytes;
 		len -= bytes;
@@ -1290,8 +1298,8 @@ static int find_invalid_utf8(const char *buf, int len)
 				return bad_offset;
 		} while (--bytes);
 
-		/* No codepoints can ever be allocated beyond U+10FFFF. */
-		if (codepoint > 0x10ffff)
+		/* Reject codepoints that are out of range for the sequence length. */
+		if (codepoint < min_val || codepoint > max_val)
 			return bad_offset;
 		/* Surrogates are only for UTF-16 and cannot be encoded in UTF-8. */
 		if ((codepoint & 0x1ff800) == 0xd800)
@@ -1308,9 +1316,6 @@ static int find_invalid_utf8(const char *buf, int len)
  *
  * If it isn't, it assumes any non-utf8 characters are Latin1,
  * and does the conversion.
- *
- * Fixme: we should probably also disallow overlong forms.
- * But we don't do that currently.
  */
 static int verify_utf8(struct strbuf *buf)
 {
diff --git a/t/t3900-i18n-commit.sh b/t/t3900-i18n-commit.sh
index ee8ba6c..94fa1e8 100755
--- a/t/t3900-i18n-commit.sh
+++ b/t/t3900-i18n-commit.sh
@@ -50,6 +50,17 @@ test_expect_success 'UTF-8 invalid characters refused' '
 	grep "did not conform" "$HOME"/stderr
 '
 
+test_expect_success 'UTF-8 overlong sequences rejected' '
+	test_when_finished "rm -f $HOME/stderr $HOME/invalid" &&
+	rm -f "$HOME/stderr" "$HOME/invalid" &&
+	echo "UTF-8 overlong" >F &&
+	printf "\340\202\251ommit message\n\nThis is not a space:\300\240\n" \
+		>"$HOME/invalid" &&
+	git commit -a -F "$HOME/invalid" \
+		2>"$HOME"/stderr &&
+	grep "did not conform" "$HOME"/stderr
+'
+
 rm -f "$HOME/stderr"
 
 for H in ISO8859-1 eucJP ISO-2022-JP
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2013-08-06  7:04 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-04 17:17 [PATCH v2 0/2] commit: improve UTF-8 validation brian m. carlson
2013-07-04 17:19 ` [PATCH v2 1/2] commit: reject invalid UTF-8 codepoints brian m. carlson
2013-07-04 19:58   ` Torsten Bögershausen
2013-07-04 20:39     ` brian m. carlson
2013-07-05 12:51   ` Peter Krefting
2013-07-08 19:36     ` Junio C Hamano
2013-07-09 11:16       ` [PATCH] commit: reject non-characters Peter Krefting
2013-08-05 12:48         ` Peter Krefting
2013-08-05 16:54           ` Junio C Hamano
2013-08-06  7:03             ` Peter Krefting
2013-07-04 17:20 ` [PATCH v2 2/2] commit: reject overlong UTF-8 sequences brian m. carlson

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).