From: Junio C Hamano <gitster@pobox.com>
To: Jonathan Tan <jonathantanmy@google.com>
Cc: git@vger.kernel.org, peff@peff.net
Subject: Re: [RFC/PATCH 1/3] mailinfo: refactor commit message processing
Date: Fri, 16 Sep 2016 12:12:50 -0700 [thread overview]
Message-ID: <xmqqoa3nk6a5.fsf@gitster.mtv.corp.google.com> (raw)
In-Reply-To: <7dbb4bc0659056211b27f0033c73f0d558efdb54.1474047135.git.jonathantanmy@google.com> (Jonathan Tan's message of "Fri, 16 Sep 2016 10:37:22 -0700")
Jonathan Tan <jonathantanmy@google.com> writes:
> Within the processing of the commit message, check for a scissors line
> or a patchbreak line first (before checking for in-body headers) so that
> a subsequent patch modifying the processing of in-body headers would not
> cause a scissors line or patchbreak line to be misidentified.
>
> If a line could be both an in-body header and a scissors line (for
> example, "From: -- >8 --"), this is considered a fatal error
> (previously, it would be interpreted as an in-body header).
The scissors line is designed to allow garbage other than scissors
and perforation marks to be on the same line, i.e.
/*
* The mark must be at least 8 bytes long (e.g. "-- >8 --").
* Even though there can be arbitrary cruft on the same line
* (e.g. "cut here"), in order to avoid misidentification, the
* perforation must occupy more than a third of the visible
* width of the line, and dashes and scissors must occupy more
* than half of the perforation.
*/
Even though it is not likely for people to do so, it would probably
be nicer if we can treat
From: -- >8 -- cut -- >8 -- >8 -- here -- >8 --
as a scissors line instead of making it a fatal error, by treating
that "From:" as just a random garbage.
But this is a minor point. It is not worth to make it work like so
if the resulting code will become messier.
> The following enumeration shows that processing is the same except (as
> described above) the in-body header + scissors line case.
>
> o in-body header (check_header OK)
> o passes UTF-8 conversion
> o [described above] is scissors line
> o [not possible] is patchbreak line
> o [not possible] is blank line
> o is none of the above - processed as header
> o fails UTF-8 conversion - processed as header
> o not in-body header
> o passes UTF-8 conversion
> o is scissors line - processed as scissors
> o is patchbreak line - processed as patchbreak
> o is blank line - ignored if in header_stage
> o is none of the above - log message
> o fails UTF-8 conversion - input error
>
> As for the result left in "line" (after the invocation of
> handle_commit_msg), it is unused (by its caller, handle_filter, and by
> handle_filter's callers, handle_boundary and handle_body) unless this
> line is a patchbreak line, in which case handle_patch is subsequently
> called (in handle_filter) on "line". In this case, "line" must have
> passed UTF-8 conversion both before and after this patch, so the result
> is still the same overall.
>
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
> mailinfo.c | 145 ++++++++++++++++++++++++++++++++++++++++++++++++-------------
> 1 file changed, 115 insertions(+), 30 deletions(-)
>
> diff --git a/mailinfo.c b/mailinfo.c
> index e19abe3..23a56c2 100644
> --- a/mailinfo.c
> +++ b/mailinfo.c
> @@ -340,23 +340,56 @@ static struct strbuf *decode_b_segment(const struct strbuf *b_seg)
> return out;
> }
>
> -static int convert_to_utf8(struct mailinfo *mi,
> - struct strbuf *line, const char *charset)
> +/*
> + * Attempts to convert line into UTF-8, storing the result in line.
> + *
> + * This differs from convert_to_utf8 in that conversion non-success is not
> + * considered an error case - mi->input_error is not set, and no error message
> + * is printed.
> + *
> + * If the conversion is unnecessary, returns 0 and stores NULL in old_buf (if
> + * old_buf is not NULL).
> + *
> + * If the conversion is successful, returns 0 and stores the unconverted string
> + * in old_buf and old_len (if they are respectively not NULL).
> + *
> + * If the conversion is unsuccessful, returns -1.
> + */
> +static int try_convert_to_utf8(const struct mailinfo *mi, struct strbuf *line,
> + const char *charset, char **old_buf,
> + size_t *old_len)
> {
> - char *out;
> + char *utf8;
>
> - if (!mi->metainfo_charset || !charset || !*charset)
> + if (!mi->metainfo_charset || !charset || !*charset ||
> + same_encoding(mi->metainfo_charset, charset)) {
> + if (old_buf)
> + *old_buf = NULL;
> return 0;
> + }
>
> - if (same_encoding(mi->metainfo_charset, charset))
> + utf8 = reencode_string(line->buf, mi->metainfo_charset, charset);
> + if (utf8) {
> + char *temp = strbuf_detach(line, old_len);
> + if (old_buf)
> + *old_buf = temp;
> + strbuf_attach(line, utf8, strlen(utf8), strlen(utf8));
> return 0;
> - out = reencode_string(line->buf, mi->metainfo_charset, charset);
> - if (!out) {
> + }
> + return -1;
> +}
> +
> +/*
> + * Converts line into UTF-8, setting mi->input_error to -1 upon failure.
> + */
> +static int convert_to_utf8(struct mailinfo *mi,
> + struct strbuf *line, const char *charset)
> +{
> + if (try_convert_to_utf8(mi, line, charset, NULL, NULL)) {
> mi->input_error = -1;
> return error("cannot convert from %s to %s",
> charset, mi->metainfo_charset);
> }
> - strbuf_attach(line, out, strlen(out), strlen(out));
> return 0;
> }
Please split this part into its own patch. IIUC, it moves the meat
of convert_to_utf8() to a more silent try_convert_to_utf8() and then
makes the former a thin wrapper of the latter. Which by itself is a
good change but does not have anything to do with "fix handling of
the in-body headers", other than that the main fix wants to have
such a more silent helper for its own use.
> @@ -515,6 +548,13 @@ static int check_header(struct mailinfo *mi,
> return ret;
> }
>
> +static int check_header_raw(struct mailinfo *mi,
> + char *buf, size_t len,
> + struct strbuf *hdr_data[], int overwrite) {
> + const struct strbuf sb = {0, len, buf};
> + return check_header(mi, &sb, hdr_data, overwrite);
> +}
IIUC, this is a helper for callers that do not have a strbuf but
instead have <buf, len> pair to perform the same check_header() the
callers that have strbuf can do.
As check_header() uses the strbuf as a read-only entity, wrapping
the <buf, len> pair in a temporary strbuf like this is safe.
The incoming <buf> should conceptually be "const char *", but it's
OK.
If check_header() didn't call any helper function that gets passed
&sb as a strbuf, or if convertiong the helper function to take a
<buf, len> pair instead, I would actually suggest refactoring this
the other way around, though. That is, move the implementation of
check_header() to this function, updating its reference to line->buf
and line->len to reference to <buf> and <len>, and then make
check_header() a thin wrapper that does
check_header(mi, const struct strbuf *line,
struct strbuf *hdr_data[], int overwrite)
{
return check_header_raw(mi, line->buf, line->len,
hdr_data, overwrite);
}
I didn't check how involved to update cmp_header() to take <buf,len>
pair. If it does not look too bad, then I think I would prefer to
do it that way, and as before, make that conversion a separate
preparatory patch.
> @@ -623,32 +663,48 @@ static int is_scissors_line(const struct strbuf *line)
> gap * 2 < perforation);
> }
>
> -static int handle_commit_msg(struct mailinfo *mi, struct strbuf *line)
> +static int resembles_rfc2822_header(const struct strbuf *line)
> {
> - assert(!mi->filter_stage);
> + char *c;
>
> - if (mi->header_stage) {
> - if (!line->len || (line->len == 1 && line->buf[0] == '\n'))
> + if (!isalpha(line->buf[0]))
> + return 0;
> +
> + for (c = line->buf + 1; *c != 0; c++) {
> + if (*c == ':')
> + return 1;
> + else if (*c != '-' && !isalpha(*c))
> return 0;
> }
> + return 0;
> +}
Is this helper supposed to handle any rfc2822 looking header, or
only the ones we expect to see as in-body header?
> - if (mi->use_inbody_headers && mi->header_stage) {
> - mi->header_stage = check_header(mi, line, mi->s_hdr_data, 0);
> - if (mi->header_stage)
> - return 0;
> - } else
> - /* Only trim the first (blank) line of the commit message
> - * when ignoring in-body headers.
> - */
> - mi->header_stage = 0;
> +static int handle_commit_msg(struct mailinfo *mi, struct strbuf *line)
> +{
> + int ret = 0;
> + int utf8_result;
> + char *old_buf;
> + size_t old_len;
> +
> + assert(!mi->filter_stage);
>
> - /* normalize the log message to UTF-8. */
> - if (convert_to_utf8(mi, line, mi->charset.buf))
> - return 0; /* mi->input_error already set */
> + /*
> + * Obtain UTF8 for scissors line and patchbreak checks, but retain the
> + * undecoded line in case we need to process it as an in-body header.
> + */
> + utf8_result = try_convert_to_utf8(mi, line, mi->charset.buf, &old_buf,
> + &old_len);
Just a minor style suggestion. As <old_buf, old_len> come in a
pair, fold the line before them, so that the readers can easily
see the association between them. I.e.
utf8_result = try_convert_to_utf8(mi, line, mi->charset.buf,
&old_buf, &old_len);
> - if (mi->use_scissors && is_scissors_line(line)) {
> + if (!utf8_result && mi->use_scissors && is_scissors_line(line)) {
> int i;
>
> + if (resembles_rfc2822_header(line))
> + /*
> + * Explicitly reject scissor lines that resemble a RFC
> + * 2822 header, to avoid being prone to error.
> + */
> + die("scissors line resembles RFC 2822 header");
> +
I guess "disambiguate to favor scissors" is not that difficult ;-)
> strbuf_setlen(&mi->log_message, 0);
> mi->header_stage = 1;
>
> @@ -661,18 +717,47 @@ static int handle_commit_msg(struct mailinfo *mi, struct strbuf *line)
> strbuf_release(mi->s_hdr_data[i]);
> mi->s_hdr_data[i] = NULL;
> }
> - return 0;
> + goto handle_commit_msg_out;
> }
> -
> - if (patchbreak(line)) {
> + if (!utf8_result && patchbreak(line)) {
> if (mi->message_id)
> strbuf_addf(&mi->log_message,
> "Message-Id: %s\n", mi->message_id);
> - return 1;
> + ret = 1;
> + goto handle_commit_msg_out;
> }
>
> + if (mi->header_stage) {
> + char *buf = old_buf ? old_buf : line->buf;
> + if (buf[0] == 0 || (buf[0] == '\n' && buf[1] == 0))
> + goto handle_commit_msg_out;
> + }
> +
> + if (mi->use_inbody_headers && mi->header_stage) {
> + char *buf = old_buf ? old_buf : line->buf;
> + size_t len = old_buf ? old_len : line->len;
> + mi->header_stage = check_header_raw(mi, buf, len,
> + mi->s_hdr_data, 0);
Just a minor comment, but I guess check_header_raw() refactoring is
not strictly needed after all, as this callsite can wrap <buf,len>
into a temporary strbuf.
Unlike the real header that is read in read_one_header_line() inside
a loop to implement line folding before check_header() is called, we
call check_header() before possibly-foldable header lines is fully
assembled into one header. Probably it comes in later patches, I
guess.
It is not immediately obvious to me how this step helps further work
done by later patches in the series until I read them, but so far
what this patch did looks understandable to me ;-)
Thanks.
> + if (mi->header_stage)
> + goto handle_commit_msg_out;
> + } else
> + /* Only trim the first (blank) line of the commit message
> + * when ignoring in-body headers.
> + */
> + mi->header_stage = 0;
> +
> + /* If adding as a log message, conversion to UTF-8 is required. */
> + if (utf8_result) {
> + mi->input_error = -1;
> + error("cannot convert from %s to %s",
> + mi->charset.buf, mi->metainfo_charset);
> + goto handle_commit_msg_out;
> + }
> strbuf_addbuf(&mi->log_message, line);
> - return 0;
> +
> +handle_commit_msg_out:
> + free(old_buf);
> + return ret;
> }
>
> static void handle_patch(struct mailinfo *mi, const struct strbuf *line)
next prev parent reply other threads:[~2016-09-16 19:12 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-09-02 19:58 [PATCH] sequencer: support folding in rfc2822 footer Jonathan Tan
2016-09-03 2:23 ` Junio C Hamano
2016-09-06 22:08 ` Jonathan Tan
2016-09-06 23:30 ` Jonathan Tan
2016-09-07 6:38 ` Jeff King
2016-09-16 17:37 ` [RFC/PATCH 0/3] handle multiline in-body headers Jonathan Tan
2016-09-16 18:29 ` Junio C Hamano
2016-09-16 17:37 ` [RFC/PATCH 1/3] mailinfo: refactor commit message processing Jonathan Tan
2016-09-16 19:12 ` Junio C Hamano [this message]
2016-09-16 21:46 ` Jeff King
2016-09-16 17:37 ` [RFC/PATCH 2/3] mailinfo: correct malformed test example Jonathan Tan
2016-09-16 19:19 ` Junio C Hamano
2016-09-16 22:42 ` Jonathan Tan
2016-09-16 22:55 ` Junio C Hamano
2016-09-17 0:31 ` Jonathan Tan
2016-09-17 3:48 ` Junio C Hamano
2016-09-16 17:37 ` [RFC/PATCH 3/3] mailinfo: handle in-body header continuations Jonathan Tan
2016-09-16 20:17 ` Junio C Hamano
2016-09-16 20:49 ` Jonathan Tan
2016-09-16 20:59 ` Junio C Hamano
2016-09-16 22:36 ` Jonathan Tan
2016-09-16 23:04 ` Junio C Hamano
2016-09-17 0:22 ` Jonathan Tan
2016-09-16 21:51 ` Jeff King
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=xmqqoa3nk6a5.fsf@gitster.mtv.corp.google.com \
--to=gitster@pobox.com \
--cc=git@vger.kernel.org \
--cc=jonathantanmy@google.com \
--cc=peff@peff.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).