git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Junio C Hamano <gitster@pobox.com>
To: "brian m. carlson" <sandals@crustytoothpaste.net>
Cc: git@vger.kernel.org
Subject: Re: [PATCH 1/2] commit: reject invalid UTF-8 codepoints
Date: Sat, 29 Jun 2013 19:13:40 -0700	[thread overview]
Message-ID: <7v7ghcl50r.fsf@alter.siamese.dyndns.org> (raw)
In-Reply-To: <20130629174023.GB226907@vauxhall.crustytoothpaste.net> (brian m. carlson's message of "Sat, 29 Jun 2013 17:40:23 +0000")

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> The commit code already contains code for validating UTF-8, but it
> does not check for invalid values, such as guaranteed
> non-characters and surrogates.  Fix this by explicitly checking
> for and rejecting such characters.
>
> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> ---

Does this correspond to the following comment in the same file, and
if so, shouldn't this part of your patch?

diff --git a/commit.c b/commit.c
index 888e02a..8ce20d5 100644
--- a/commit.c
+++ b/commit.c
@@ -1292,8 +1292,8 @@ static int find_invalid_utf8(const char *buf, int len)
  * If it isn't, it assumes any non-utf8 characters are Latin1,
  * and does the conversion.
  *
- * Fixme: we should probably also disallow overlong forms and
- * invalid characters. But we don't do that currently.
+ * Fixme: we should probably also disallow overlong forms.
+ * But we don't do that fully, yet.
  */
 static int verify_utf8(struct strbuf *buf)
 {

> @@ -1264,24 +1265,35 @@ static int find_invalid_utf8(const char *buf, int len)
>  			bytes++;
>  		}
>  
> -		/* Must be between 1 and 5 more bytes */
> -		if (bytes < 1 || bytes > 5)
> +		/* Must be between 1 and 3 more bytes */
> +		if (bytes < 1 || bytes > 3)
>  			return bad_offset;

We used to allow te original up-to-6 form and this update is about
rejecting anything above U+10FFFF (in line with e.g. RFC 3629)?

>  		/* Do we *have* that many bytes? */
>  		if (len < bytes)
>  			return bad_offset;
>  
> +		/* Place the encoded bits at the bottom of the value. */
> +		codepoint = (c & 0x7f) >> bytes;
> +
>  		offset += bytes;
>  		len -= bytes;
>  
>  		/* And verify that they are good continuation bytes */
>  		do {
> +			codepoint <<= 6;
> +			codepoint |= *buf & 0x3f;
>  			if ((*buf++ & 0xc0) != 0x80)
>  				return bad_offset;
>  		} while (--bytes);
>  
> -		/* We could/should check the value and length here too */
> +		/* Check the value here */
> +		if (codepoint >= 0xd800 && codepoint <= 0xdfff)
> +			return bad_offset;

As you use separate if statements for this check, perhaps you can
give each of them a better comment to say what each is rejecting?
E.g.

	/* do not allow range for surrogate pair */
		if (codepoint >= 0xd800 && codepoint <= 0xdfff)
			return bad_offset;

> +		if (codepoint > 0x10ffff)
> +			return bad_offset;
> +		if ((codepoint & 0x1ffffe) == 0xfffe)
> +			return bad_offset;

As that comment I quoted at the beginning says, we did not check for
invalid encoded values and the primary reason for it is beccause
this function did not want to look into the actual values here.  But
now you are looking into "codepoint", you can now also check for
"overlong" form (e.g. sequence "C0 80" turning into U+0000)?

>  	}
>  	return -1;
>  }
> diff --git a/t/t3900-i18n-commit.sh b/t/t3900-i18n-commit.sh
> index 37ddabb..16ed707 100755
> --- a/t/t3900-i18n-commit.sh
> +++ b/t/t3900-i18n-commit.sh
> @@ -39,6 +39,15 @@ test_expect_failure 'UTF-16 refused because of NULs' '
>  	git commit -a -F "$TEST_DIRECTORY"/t3900/UTF-16.txt
>  '
>  
> +test_expect_success 'UTF-8 invalid characters refused' '
> +	rm -f "$HOME/stderr" &&
> +	echo "UTF-8 characters" >F &&
> +	git commit -a -F "$TEST_DIRECTORY"/t3900/UTF-8-invalid.txt \
> +		2>"$HOME"/stderr &&
> +	grep "did not conform" "$HOME"/stderr
> +'
> +
> +rm -f "$HOME/stderr"

Perhaps

        test_expect_success 'test name' '
                test_when_finished "rm -f $HOME/stderr" &&
                "rm -f $HOME/stderr" &&
                echo ...
                grep "did not conform" ...
        '

> diff --git a/t/t3900/UTF-8-invalid.txt b/t/t3900/UTF-8-invalid.txt
> new file mode 100644
> index 0000000..343684d
> --- /dev/null
> +++ b/t/t3900/UTF-8-invalid.txt
> @@ -0,0 +1,3 @@
> +Commit message
> +
> +Invalid surrogate:???

I suspect that I did not receive what you intended to send.  Do you
want to send this part as a binary patch perhaps?

  reply	other threads:[~2013-06-30  2:14 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-06-29 17:40 [PATCH 1/2] commit: reject invalid UTF-8 codepoints brian m. carlson
2013-06-30  2:13 ` Junio C Hamano [this message]
2013-06-30  2:29   ` brian m. carlson
2013-06-30 19:29     ` Junio C Hamano
2013-07-04  0:17   ` brian m. carlson
2013-07-04  0:47     ` Junio C Hamano
2013-07-01  7:12 ` Peter Krefting

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7v7ghcl50r.fsf@alter.siamese.dyndns.org \
    --to=gitster@pobox.com \
    --cc=git@vger.kernel.org \
    --cc=sandals@crustytoothpaste.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).