git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Kevin Daudt <me@ikke.info>
Cc: git@vger.kernel.org, Swift Geek <swiftgeek@gmail.com>,
	Junio C Hamano <gitster@pobox.com>
Subject: Re: [PATCH] mailinfo: unescape quoted-pair in header fields
Date: Fri, 16 Sep 2016 15:22:06 -0700	[thread overview]
Message-ID: <20160916222206.jz2d4gpaxxccia5p@sigill.intra.peff.net> (raw)
In-Reply-To: <20160916210204.31282-1-me@ikke.info>

On Fri, Sep 16, 2016 at 11:02:04PM +0200, Kevin Daudt wrote:

> rfc2822 has provisions for quoted strings in structured header fields,
> but also allows for escaping these with so-called quoted-pairs.
> 
> The only thing git currently does is removing exterior quotes, but
> quotes within are left alone.
> 
> Tell mailinfo to remove exterior quotes and remove escape characters from the
> author so that they don't show up in the commits author field.
> 
> Signed-off-by: Kevin Daudt <me@ikke.info>
> ---
> The only thing I could not easily fix is the prevent git am from
> removing any quotes around the author. This is done in fmt_ident,
> which calls `strbuf_addstr_without_crud`. 

Ah, OK. I was wondering where that stripping was being done. That makes
sense, and makes me doubly confident this is the right place to be doing
it, since the other quote-stripping was not even intentional, but just a
side effect of the low-level routines.

I think it is OK to leave it in place. If you really want your name to
be:

  "My Name is Always in Quotes"

then tough luck. Git does not support it via git-am, but nor does it via
git-commit, etc.

>  mailinfo.c                 | 54 ++++++++++++++++++++++++++++++++++++++++++++++
>  t/t5100-mailinfo.sh        |  6 ++++++
>  t/t5100/quoted-pair.expect |  5 +++++
>  t/t5100/quoted-pair.in     |  9 ++++++++
>  4 files changed, 74 insertions(+)
>  create mode 100644 t/t5100/quoted-pair.expect
>  create mode 100644 t/t5100/quoted-pair.in
> 
> diff --git a/mailinfo.c b/mailinfo.c
> index e19abe3..04036f3 100644
> --- a/mailinfo.c
> +++ b/mailinfo.c
> @@ -54,15 +54,69 @@ static void parse_bogus_from(struct mailinfo *mi, const struct strbuf *line)
>  	get_sane_name(&mi->name, &mi->name, &mi->email);
>  }
>  
> +static int unquote_quoted_string(struct strbuf *line)
> +{
> +	struct strbuf outbuf;
> +	const char *in = line->buf;
> +	int c, take_next_literally = 0;
> +	int found_error = 0;
> +	char escape_context=0;

Style: whitespace around "=".

I had to wonder why we needed both escape_context and
take_next_literally; shouldn't we just need a single state bit. But
escape_context is not "escape the next character", it is "we are
currently in a mode where we should be escaping".

Could we give it a more descriptive name? I guess it is more than just
"we are in a mode", but rather "here is the character that will end the
escaped mode". Maybe a comment would be more appropriate.

> +	while ((c = *in++) != 0) {
> +		if (take_next_literally) {
> +			take_next_literally = 0;
> +		} else {

OK, so that means the previous one was backslash-quoted, and we don't do
any other cleverness. Good.

> +			switch (c) {
> +			case '"':
> +				if (!escape_context)
> +					escape_context = '"';
> +				else if (escape_context == '"')
> +					escape_context = 0;
> +				continue;

And here we open or close the quoted portion, depending. Makes sense.

> +			case '\\':
> +				if (escape_context) {
> +					take_next_literally = 1;
> +					continue;
> +				}
> +				break;

I didn't look in the RFC. Is:

  From: my \"name\" <foo@example.com>

really the same as:

  From: "my \\\"name\\\"" <foo@example.com>

? That seems weird, but I think it may be that the former is simply
bogus (you are not supposed to use backslashes outside of the quoted
section at all).

> +			case '(':
> +				if (!escape_context)
> +					escape_context = '(';
> +				else if (escape_context == '(')
> +					found_error = 1;
> +				break;

Hmm. Is:

  From: Name (Comment with (another comment))

really disallowed? RFC2822 seems to say that "comment" can contain
"ccontent", which can itself be a comment.

This is obviously getting pretty silly, but if we are going to follow
the RFC, I think you actually have to do a recursive parse, and keep
track of an arbitrary depth of context.

I dunno. This method probably covers most cases in practice, and it's
easy to reason about.

> +			case ')':
> +				if (escape_context == '(')
> +					escape_context = 0;
> +				break;
> +			}
> +		}
> +
> +		strbuf_addch(&outbuf, c);
> +	}
> +
> +	strbuf_reset(line);
> +	strbuf_addbuf(line, &outbuf);
> +	strbuf_release(&outbuf);

I think you can use strbuf_swap() here to avoid copying the line an
extra time, like:

  strbuf_swap(line, &outbuf);
  strbuf_release(&outbuf);

Another option would be to just:

  in = strbuf_detach(&line);

at the beginning, and then output back into "line".

> +	return found_error;

What happens when we get here and take_next_literally is set? I.e., a
backslash at the end of the string. We'll silently print nothing, which
seems reasonable to me (the other option is to print a literal
backslash).

Ditto, what if escape_context is non-zero? We're in the middle of an
unterminated quoted string (or comment).

I'm fine with silently continuing, but it seems weird that we notice
embedded comments (and return an error), but not these other conditions.

>  static void handle_from(struct mailinfo *mi, const struct strbuf *from)
>  {
>  	char *at;
>  	size_t el;
>  	struct strbuf f;
>  
> +
>  	strbuf_init(&f, from->len);
>  	strbuf_addbuf(&f, from);

Funny extra line?

> +test_expect_success 'mailinfo unescapes rfc2822 quoted-string' '
> +    mkdir quoted-pair &&
> +    git mailinfo /dev/null /dev/null <"$TEST_DIRECTORY"/t5100/quoted-pair.in >quoted-pair/info &&
> +    test_cmp "$TEST_DIRECTORY"/t5100/quoted-pair.expect quoted-pair/info
> +'

We usually break long lines with backslash-escapes. Like:

  git mailinfo /dev/null /dev/null \
	<"$TEST_DIRECTORY"/t5100/quoted-pair.in \
	>quoted-pair/info

I'd also wonder if things might be made much more readable by putting
"$TEST_DIRECTORY/t5100" into a shorter variable like $data or something.
That would be best done as a preparatory patch which updates all of the
tests.

> --- /dev/null
> +++ b/t/t5100/quoted-pair.in
> @@ -0,0 +1,9 @@
> +From 1234567890123456789012345678901234567890 Mon Sep 17 00:00:00 2001
> +From: "Author \"The Author\" Name" <somebody@example.com>
> +Date: Sun, 25 May 2008 00:38:18 -0700
> +Subject: [PATCH] testing quoted-pair

I do not care that much about the "()" comment behavior myself, but if
we are going to implement it, it probably makes sense to protect it from
regression with a test.

-Peff

  reply	other threads:[~2016-09-16 22:22 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-09-16 21:02 [PATCH] mailinfo: unescape quoted-pair in header fields Kevin Daudt
2016-09-16 22:22 ` Jeff King [this message]
2016-09-19 10:51   ` Kevin Daudt
2016-09-20  3:57     ` Jeff King
2016-09-21 16:07       ` Junio C Hamano
2016-09-19 18:54 ` [PATCH v2 0/2] Handle escape characters in From field Kevin Daudt
2016-09-25 21:08   ` [PATCH v3 1/2] t5100-mailinfo: replace common path prefix with variable Kevin Daudt
2016-09-25 21:08     ` [PATCH v3 2/2] mailinfo: unescape quoted-pair in header fields Kevin Daudt
2016-09-26 19:11       ` Junio C Hamano
2016-09-26 19:26         ` Junio C Hamano
2016-09-26 19:44           ` Kevin Daudt
2016-09-26 22:23             ` Junio C Hamano
2016-09-27 10:26               ` Kevin Daudt
2016-09-26 19:06     ` [PATCH v3 1/2] t5100-mailinfo: replace common path prefix with variable Junio C Hamano
2016-09-28 19:49     ` [PATCH v4 0/2] Handle RFC2822 quoted-pairs in From header Kevin Daudt
2016-09-28 19:52       ` [PATCH v4 1/2] t5100-mailinfo: replace common path prefix with variable Kevin Daudt
2016-09-28 20:21         ` Junio C Hamano
2016-09-28 20:27           ` Kevin Daudt
2016-09-28 19:52       ` [PATCH v4 2/2] mailinfo: unescape quoted-pair in header fields Kevin Daudt
2016-09-19 18:54 ` [PATCH v2 1/2] t5100-mailinfo: replace common path prefix with variable Kevin Daudt
2016-09-19 21:16   ` Junio C Hamano
2016-09-20  3:59     ` Jeff King
2016-09-19 18:54 ` [PATCH v2 2/2] mailinfo: unescape quoted-pair in header fields Kevin Daudt
2016-09-19 21:24   ` Junio C Hamano
2016-09-19 22:04     ` Junio C Hamano
2016-09-20  4:28   ` Jeff King
2016-09-21 11:09   ` Jeff King
2016-09-22 22:17     ` Junio C Hamano
2016-09-23  4:15       ` Jeff King
2016-09-25 20:17         ` Kevin Daudt
2016-09-25 22:38           ` Jakub Narębski
2016-09-26  5:02             ` Kevin Daudt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160916222206.jz2d4gpaxxccia5p@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=me@ikke.info \
    --cc=swiftgeek@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).