git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH] sequencer: support folding in rfc2822 footer
@ 2016-09-02 19:58 Jonathan Tan
  2016-09-03  2:23 ` Junio C Hamano
  0 siblings, 1 reply; 24+ messages in thread
From: Jonathan Tan @ 2016-09-02 19:58 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan

When running `git cherry-pick -x`, a blank line would be inserted if a line in
the footer was broken into multiple lines (which is inconsistent with its
behavior if no lines were broken). For example, this command would produce:

---
Sample-field: no blank line below
(cherry picked from commit ...)
---

but would also produce:

---
Sample-field: multiple-line field body
 that causes a blank line below

(cherry picked from commit ...)
---

This, among other things, trips up tools that look for the last paragraph
(including sequencer.c itself).

RFC 2822 allows field bodies to be split into multiple lines, especially (but
not only) to deal with the line-width limitations described in the
specification, referring to this as "folding".

Teach sequencer.c to treat split and unsplit field bodies in the same way (that
is, to not include the blank line).

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 sequencer.c              | 19 +++++++++++++++++--
 t/t3511-cherry-pick-x.sh | 19 +++++++++++++++++++
 2 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/sequencer.c b/sequencer.c
index 3804fa9..411dd50 100644
--- a/sequencer.c
+++ b/sequencer.c
@@ -26,10 +26,20 @@ static GIT_PATH_FUNC(git_path_opts_file, SEQ_OPTS_FILE)
 static GIT_PATH_FUNC(git_path_seq_dir, SEQ_DIR)
 static GIT_PATH_FUNC(git_path_head_file, SEQ_HEAD_FILE)
 
-static int is_rfc2822_line(const char *buf, int len)
+static int is_rfc2822_line(const char *buf, int len,
+			   int allow_folding_continuation)
 {
 	int i;
 
+	/*
+	 * Section 2.2.3 of RFC 2822 allows field bodies to continue onto
+	 * multiple lines, referred to as "folding". Such continuation lines
+	 * start with whitespace.
+	 */
+	if (allow_folding_continuation)
+		if (len && (buf[0] == ' ' || buf[0] == '\t'))
+			return 1;
+
 	for (i = 0; i < len; i++) {
 		int ch = buf[i];
 		if (ch == ':')
@@ -64,6 +74,7 @@ static int has_conforming_footer(struct strbuf *sb, struct strbuf *sob,
 	int len = sb->len - ignore_footer;
 	const char *buf = sb->buf;
 	int found_sob = 0;
+	int allow_folding_continuation = 0;
 
 	/* footer must end with newline */
 	if (!len || buf[len - 1] != '\n')
@@ -92,7 +103,11 @@ static int has_conforming_footer(struct strbuf *sb, struct strbuf *sob,
 			; /* do nothing */
 		k++;
 
-		found_rfc2822 = is_rfc2822_line(buf + i, k - i - 1);
+		found_rfc2822 = is_rfc2822_line(buf + i,
+				                k - i - 1,
+				                allow_folding_continuation);
+		allow_folding_continuation = 1;
+
 		if (found_rfc2822 && sob &&
 		    !strncmp(buf + i, sob->buf, sob->len))
 			found_sob = k;
diff --git a/t/t3511-cherry-pick-x.sh b/t/t3511-cherry-pick-x.sh
index 9cce5ae..1a50339 100755
--- a/t/t3511-cherry-pick-x.sh
+++ b/t/t3511-cherry-pick-x.sh
@@ -36,6 +36,11 @@ mesg_with_cherry_footer="$mesg_with_footer_sob
 (cherry picked from commit da39a3ee5e6b4b0d3255bfef95601890afd80709)
 Tested-by: C.U. Thor <cuthor@example.com>"
 
+mesg_with_folding_footer="$mesg_no_footer
+
+Field: This is a very long field body
+ that is continued onto another line"
+
 mesg_unclean="$mesg_one_line
 
 
@@ -68,6 +73,8 @@ test_expect_success setup '
 	git reset --hard initial &&
 	test_commit "$mesg_with_cherry_footer" foo b mesg-with-cherry-footer &&
 	git reset --hard initial &&
+	test_commit "$mesg_with_folding_footer" foo b mesg-with-folding-footer &&
+	git reset --hard initial &&
 	test_config commit.cleanup verbatim &&
 	test_commit "$mesg_unclean" foo b mesg-unclean &&
 	test_unconfig commit.cleanup &&
@@ -234,6 +241,18 @@ test_expect_success 'cherry-pick -x -s treats "(cherry picked from..." line as p
 	test_cmp expect actual
 '
 
+test_expect_success 'cherry-pick -x does not insert blank line when folding footer is found' '
+	pristine_detach initial &&
+	sha1=$(git rev-parse mesg-with-folding-footer^0) &&
+	git cherry-pick -x mesg-with-folding-footer &&
+	cat <<-EOF >expect &&
+		$mesg_with_folding_footer
+		(cherry picked from commit $sha1)
+	EOF
+	git log -1 --pretty=format:%B >actual &&
+	test_cmp expect actual
+'
+
 test_expect_success 'cherry-pick preserves commit message' '
 	pristine_detach initial &&
 	printf "$mesg_unclean" >expect &&
-- 
2.8.0.rc3.226.g39d4020


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] sequencer: support folding in rfc2822 footer
  2016-09-02 19:58 [PATCH] sequencer: support folding in rfc2822 footer Jonathan Tan
@ 2016-09-03  2:23 ` Junio C Hamano
  2016-09-06 22:08   ` Jonathan Tan
  0 siblings, 1 reply; 24+ messages in thread
From: Junio C Hamano @ 2016-09-03  2:23 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git

Jonathan Tan <jonathantanmy@google.com> writes:

> Sample-field: multiple-line field body
>  that causes a blank line below

I am not sure this is unconditionally good, or may cause problems to
those with workflows you did not consider when you wrote this patch.

Not being too lenient here historically has been a deliberate
decision to avoid misidentification of non "footers".  Does Git
itself produce some folded footer line?  If Git itself produced such
folded lines, I'd be a lot more receptive to this change, but I do
not think that is the case here.

A slightly related tangent.  An unconditionally good change you
could make is to allow folding of in-body headers.  I.e. you can
have e.g.

	-- >8 --
	Subject: [PATCH] sequencer: support in-body headers that are
         folded according to RFC2822 rules

	The first paragraph after the above long title begins
	here...

in the body of the msssage, and I _think_ we do not fold it properly
when applying such a patch.  We should, as that is something that
appears in format-patch output (i.e. something Git itself produces,
unlike the folded "footer").

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] sequencer: support folding in rfc2822 footer
  2016-09-03  2:23 ` Junio C Hamano
@ 2016-09-06 22:08   ` Jonathan Tan
  2016-09-06 23:30     ` Jonathan Tan
  0 siblings, 1 reply; 24+ messages in thread
From: Jonathan Tan @ 2016-09-06 22:08 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On 09/02/2016 07:23 PM, Junio C Hamano wrote:
> Jonathan Tan <jonathantanmy@google.com> writes:
>
>> Sample-field: multiple-line field body
>>  that causes a blank line below
>
> I am not sure this is unconditionally good, or may cause problems to
> those with workflows you did not consider when you wrote this patch.
>
> Not being too lenient here historically has been a deliberate
> decision to avoid misidentification of non "footers".  Does Git
> itself produce some folded footer line?  If Git itself produced such
> folded lines, I'd be a lot more receptive to this change, but I do
> not think that is the case here.

I don't think Git produces any folded lines, but folded lines do appear 
in footers in projects that use Git. For example, some Android commits 
have multi-line "Test:" fields (example, [1]) and some Linux commits 
have multi-line "Tested-by:" fields (example, [2]).

Taking the Android commit as an example, this would mean that 
cherrypicking that commit would create a whole new footer, and tripping 
up tools (for example, Gerrit, which looks for "Change-Id:" in the last 
paragraph). But this would not happen if "Test:" was single-line instead 
of multi-line - which seems inconsistent.

[1] 
https://android.googlesource.com/platform/frameworks/base/+/4c5281862f750cbc9d7355a07ef1a5545b9b3523
[2] 
https://kernel.googlesource.com/pub/scm/linux/kernel/git/stable/linux-stable/+/69f92f67b68ab7028ffe15f0eea76b59f8859383

> A slightly related tangent.  An unconditionally good change you
> could make is to allow folding of in-body headers.  I.e. you can
> have e.g.
>
> 	-- >8 --
> 	Subject: [PATCH] sequencer: support in-body headers that are
>          folded according to RFC2822 rules
>
> 	The first paragraph after the above long title begins
> 	here...
>
> in the body of the msssage, and I _think_ we do not fold it properly
> when applying such a patch.  We should, as that is something that
> appears in format-patch output (i.e. something Git itself produces,
> unlike the folded "footer").

OK, I'll take a look at this.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] sequencer: support folding in rfc2822 footer
  2016-09-06 22:08   ` Jonathan Tan
@ 2016-09-06 23:30     ` Jonathan Tan
  2016-09-07  6:38       ` Jeff King
  0 siblings, 1 reply; 24+ messages in thread
From: Jonathan Tan @ 2016-09-06 23:30 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On 09/06/2016 03:08 PM, Jonathan Tan wrote:
> On 09/02/2016 07:23 PM, Junio C Hamano wrote:
>> A slightly related tangent.  An unconditionally good change you
>> could make is to allow folding of in-body headers.  I.e. you can
>> have e.g.
>>
>>     -- >8 --
>>     Subject: [PATCH] sequencer: support in-body headers that are
>>          folded according to RFC2822 rules
>>
>>     The first paragraph after the above long title begins
>>     here...
>>
>> in the body of the msssage, and I _think_ we do not fold it properly
>> when applying such a patch.  We should, as that is something that
>> appears in format-patch output (i.e. something Git itself produces,
>> unlike the folded "footer").
>
> OK, I'll take a look at this.

It turns out that Git seems to already do this, at least for Subject. 
Transcript below:

$ echo one > file.txt
$ git add file.txt
$ git commit -m x
[master (root-commit) 2389483] x
  1 file changed, 1 insertion(+)
  create mode 100644 file.txt
$ echo two > file.txt
$ git commit -am 'this is a very long subject to test line wrapping this 
is a very long subject to test line wrapping'
[master ca86792] this is a very long subject to test line wrapping this 
is a very long subject to test line wrapping
  1 file changed, 1 insertion(+), 1 deletion(-)
$ git format-patch HEAD^
0001-this-is-a-very-long-subject-to-test-line-wrapping-th.patch
$ cat 0001-this-is-a-very-long-subject-to-test-line-wrapping-th.patch
<snip>
Subject: [PATCH] this is a very long subject to test line wrapping this is a
  very long subject to test line wrapping
<snip>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] sequencer: support folding in rfc2822 footer
  2016-09-06 23:30     ` Jonathan Tan
@ 2016-09-07  6:38       ` Jeff King
  2016-09-16 17:37         ` [RFC/PATCH 0/3] handle multiline in-body headers Jonathan Tan
                           ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Jeff King @ 2016-09-07  6:38 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Junio C Hamano, git

On Tue, Sep 06, 2016 at 04:30:24PM -0700, Jonathan Tan wrote:

> On 09/06/2016 03:08 PM, Jonathan Tan wrote:
> > On 09/02/2016 07:23 PM, Junio C Hamano wrote:
> > > A slightly related tangent.  An unconditionally good change you
> > > could make is to allow folding of in-body headers.  I.e. you can
> > > have e.g.
> > > 
> > >     -- >8 --
> > >     Subject: [PATCH] sequencer: support in-body headers that are
> > >          folded according to RFC2822 rules
> > > 
> > >     The first paragraph after the above long title begins
> > >     here...
> > > 
> > > in the body of the msssage, and I _think_ we do not fold it properly
> > > when applying such a patch.  We should, as that is something that
> > > appears in format-patch output (i.e. something Git itself produces,
> > > unlike the folded "footer").
> > 
> > OK, I'll take a look at this.
> 
> It turns out that Git seems to already do this, at least for Subject.

Right, because "Subject" is actually a real RFC 2822 header in the
generated email message. Not only do we expect things like mail readers
to handle this, but we _have_ to wrap at a certain point to meet the
standard[1].

I don't think any part of Git ever shunts "Subject" to an in-body
header, though I'd guess people do it manually all the time. 

> $ git format-patch HEAD^
> 0001-this-is-a-very-long-subject-to-test-line-wrapping-th.patch
> $ cat 0001-this-is-a-very-long-subject-to-test-line-wrapping-th.patch
> <snip>
> Subject: [PATCH] this is a very long subject to test line wrapping this is a
>  very long subject to test line wrapping
> <snip>

So the interesting bit is what happens with:

  git checkout master^
  git am 0001-*

and with:

  perl -lpe '
    # Bump subject down to in-body header.
    if (/^Subject:/) {
	print "Subject: real subject";
	print "";
    }
  ' 0001-* >patch
  git checkout master^
  git am patch

It looks like we get the first one right, but not the second.

-Peff

[1] A careful reader may note that arbitrarily-long body lines,
    including in-body headers and footers, may _also_ run afoul of
    the body line-length limits. The "right" solution there is
    probably quoted-printable, but it's ugly enough that I wouldn't do
    so unless we see a real-world case where the line lengths are a
    problem.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC/PATCH 0/3] handle multiline in-body headers
  2016-09-07  6:38       ` Jeff King
@ 2016-09-16 17:37         ` Jonathan Tan
  2016-09-16 18:29           ` Junio C Hamano
  2016-09-16 17:37         ` [RFC/PATCH 1/3] mailinfo: refactor commit message processing Jonathan Tan
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 24+ messages in thread
From: Jonathan Tan @ 2016-09-16 17:37 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, peff, gitster

Thanks, Peff, for the explanation and the method to reproduce the issue.

The issue seems to be in mailinfo.c - this patch set addresses that, and I have
also included a test for "git am" in t/t4150-am.sh to show the effect of this
patch set on that command.

Jonathan Tan (3):
  mailinfo: refactor commit message processing
  mailinfo: correct malformed test example
  mailinfo: handle in-body header continuations

 mailinfo.c                           | 165 ++++++++++++++++++++++++++++-------
 mailinfo.h                           |   1 +
 t/t4150-am.sh                        |  23 +++++
 t/t5100-mailinfo.sh                  |   4 +-
 t/t5100/info0008--no-inbody-headers  |   5 ++
 t/t5100/info0018                     |   5 ++
 t/t5100/msg0008--no-inbody-headers   |   6 ++
 t/t5100/msg0015--no-inbody-headers   |   1 +
 t/t5100/msg0018                      |   2 +
 t/t5100/patch0008--no-inbody-headers |   0
 t/t5100/patch0018                    |   6 ++
 t/t5100/sample.mbox                  |  20 +++++
 12 files changed, 206 insertions(+), 32 deletions(-)
 create mode 100644 t/t5100/info0008--no-inbody-headers
 create mode 100644 t/t5100/info0018
 create mode 100644 t/t5100/msg0008--no-inbody-headers
 create mode 100644 t/t5100/msg0018
 create mode 100644 t/t5100/patch0008--no-inbody-headers
 create mode 100644 t/t5100/patch0018

-- 
2.10.0.rc2.20.g5b18e70


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC/PATCH 1/3] mailinfo: refactor commit message processing
  2016-09-07  6:38       ` Jeff King
  2016-09-16 17:37         ` [RFC/PATCH 0/3] handle multiline in-body headers Jonathan Tan
@ 2016-09-16 17:37         ` Jonathan Tan
  2016-09-16 19:12           ` Junio C Hamano
  2016-09-16 17:37         ` [RFC/PATCH 2/3] mailinfo: correct malformed test example Jonathan Tan
  2016-09-16 17:37         ` [RFC/PATCH 3/3] mailinfo: handle in-body header continuations Jonathan Tan
  3 siblings, 1 reply; 24+ messages in thread
From: Jonathan Tan @ 2016-09-16 17:37 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, peff, gitster

Within the processing of the commit message, check for a scissors line
or a patchbreak line first (before checking for in-body headers) so that
a subsequent patch modifying the processing of in-body headers would not
cause a scissors line or patchbreak line to be misidentified.

If a line could be both an in-body header and a scissors line (for
example, "From: -- >8 --"), this is considered a fatal error
(previously, it would be interpreted as an in-body header).  (It is not
possible for a line to be both an in-body header and a patchbreak line,
since both require different prefixes.)

The following enumeration shows that processing is the same except (as
described above) the in-body header + scissors line case.

o in-body header (check_header OK)
  o passes UTF-8 conversion
    o [described above] is scissors line
    o [not possible] is patchbreak line
    o [not possible] is blank line
    o is none of the above - processed as header
  o fails UTF-8 conversion - processed as header
o not in-body header
  o passes UTF-8 conversion
    o is scissors line - processed as scissors
    o is patchbreak line - processed as patchbreak
    o is blank line - ignored if in header_stage
    o is none of the above - log message
  o fails UTF-8 conversion - input error

As for the result left in "line" (after the invocation of
handle_commit_msg), it is unused (by its caller, handle_filter, and by
handle_filter's callers, handle_boundary and handle_body) unless this
line is a patchbreak line, in which case handle_patch is subsequently
called (in handle_filter) on "line". In this case, "line" must have
passed UTF-8 conversion both before and after this patch, so the result
is still the same overall.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 mailinfo.c | 145 ++++++++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 115 insertions(+), 30 deletions(-)

diff --git a/mailinfo.c b/mailinfo.c
index e19abe3..23a56c2 100644
--- a/mailinfo.c
+++ b/mailinfo.c
@@ -340,23 +340,56 @@ static struct strbuf *decode_b_segment(const struct strbuf *b_seg)
 	return out;
 }
 
-static int convert_to_utf8(struct mailinfo *mi,
-			   struct strbuf *line, const char *charset)
+/*
+ * Attempts to convert line into UTF-8, storing the result in line.
+ *
+ * This differs from convert_to_utf8 in that conversion non-success is not
+ * considered an error case - mi->input_error is not set, and no error message
+ * is printed.
+ *
+ * If the conversion is unnecessary, returns 0 and stores NULL in old_buf (if
+ * old_buf is not NULL).
+ *
+ * If the conversion is successful, returns 0 and stores the unconverted string
+ * in old_buf and old_len (if they are respectively not NULL).
+ *
+ * If the conversion is unsuccessful, returns -1.
+ */
+static int try_convert_to_utf8(const struct mailinfo *mi, struct strbuf *line,
+			       const char *charset, char **old_buf,
+			       size_t *old_len)
 {
-	char *out;
+	char *utf8;
 
-	if (!mi->metainfo_charset || !charset || !*charset)
+	if (!mi->metainfo_charset || !charset || !*charset ||
+	    same_encoding(mi->metainfo_charset, charset)) {
+		if (old_buf)
+			*old_buf = NULL;
 		return 0;
+	}
 
-	if (same_encoding(mi->metainfo_charset, charset))
+	utf8 = reencode_string(line->buf, mi->metainfo_charset, charset);
+	if (utf8) {
+		char *temp = strbuf_detach(line, old_len);
+		if (old_buf)
+			*old_buf = temp;
+		strbuf_attach(line, utf8, strlen(utf8), strlen(utf8));
 		return 0;
-	out = reencode_string(line->buf, mi->metainfo_charset, charset);
-	if (!out) {
+	}
+	return -1;
+}
+
+/*
+ * Converts line into UTF-8, setting mi->input_error to -1 upon failure.
+ */
+static int convert_to_utf8(struct mailinfo *mi,
+			   struct strbuf *line, const char *charset)
+{
+	if (try_convert_to_utf8(mi, line, charset, NULL, NULL)) {
 		mi->input_error = -1;
 		return error("cannot convert from %s to %s",
 			     charset, mi->metainfo_charset);
 	}
-	strbuf_attach(line, out, strlen(out), strlen(out));
 	return 0;
 }
 
@@ -515,6 +548,13 @@ static int check_header(struct mailinfo *mi,
 	return ret;
 }
 
+static int check_header_raw(struct mailinfo *mi,
+			    char *buf, size_t len,
+			    struct strbuf *hdr_data[], int overwrite) {
+	const struct strbuf sb = {0, len, buf};
+	return check_header(mi, &sb, hdr_data, overwrite);
+}
+
 static void decode_transfer_encoding(struct mailinfo *mi, struct strbuf *line)
 {
 	struct strbuf *ret;
@@ -623,32 +663,48 @@ static int is_scissors_line(const struct strbuf *line)
 		gap * 2 < perforation);
 }
 
-static int handle_commit_msg(struct mailinfo *mi, struct strbuf *line)
+static int resembles_rfc2822_header(const struct strbuf *line)
 {
-	assert(!mi->filter_stage);
+	char *c;
 
-	if (mi->header_stage) {
-		if (!line->len || (line->len == 1 && line->buf[0] == '\n'))
+	if (!isalpha(line->buf[0]))
+		return 0;
+
+	for (c = line->buf + 1; *c != 0; c++) {
+		if (*c == ':')
+			return 1;
+		else if (*c != '-' && !isalpha(*c))
 			return 0;
 	}
+	return 0;
+}
 
-	if (mi->use_inbody_headers && mi->header_stage) {
-		mi->header_stage = check_header(mi, line, mi->s_hdr_data, 0);
-		if (mi->header_stage)
-			return 0;
-	} else
-		/* Only trim the first (blank) line of the commit message
-		 * when ignoring in-body headers.
-		 */
-		mi->header_stage = 0;
+static int handle_commit_msg(struct mailinfo *mi, struct strbuf *line)
+{
+	int ret = 0;
+	int utf8_result;
+	char *old_buf;
+	size_t old_len;
+
+	assert(!mi->filter_stage);
 
-	/* normalize the log message to UTF-8. */
-	if (convert_to_utf8(mi, line, mi->charset.buf))
-		return 0; /* mi->input_error already set */
+	/*
+	 * Obtain UTF8 for scissors line and patchbreak checks, but retain the
+	 * undecoded line in case we need to process it as an in-body header.
+	 */
+	utf8_result = try_convert_to_utf8(mi, line, mi->charset.buf, &old_buf,
+					  &old_len);
 
-	if (mi->use_scissors && is_scissors_line(line)) {
+	if (!utf8_result && mi->use_scissors && is_scissors_line(line)) {
 		int i;
 
+		if (resembles_rfc2822_header(line))
+			/*
+			 * Explicitly reject scissor lines that resemble a RFC
+			 * 2822 header, to avoid being prone to error.
+			 */
+			die("scissors line resembles RFC 2822 header");
+
 		strbuf_setlen(&mi->log_message, 0);
 		mi->header_stage = 1;
 
@@ -661,18 +717,47 @@ static int handle_commit_msg(struct mailinfo *mi, struct strbuf *line)
 				strbuf_release(mi->s_hdr_data[i]);
 			mi->s_hdr_data[i] = NULL;
 		}
-		return 0;
+		goto handle_commit_msg_out;
 	}
-
-	if (patchbreak(line)) {
+	if (!utf8_result && patchbreak(line)) {
 		if (mi->message_id)
 			strbuf_addf(&mi->log_message,
 				    "Message-Id: %s\n", mi->message_id);
-		return 1;
+		ret = 1;
+		goto handle_commit_msg_out;
 	}
 
+	if (mi->header_stage) {
+		char *buf = old_buf ? old_buf : line->buf;
+		if (buf[0] == 0 || (buf[0] == '\n' && buf[1] == 0))
+			goto handle_commit_msg_out;
+	}
+
+	if (mi->use_inbody_headers && mi->header_stage) {
+		char *buf = old_buf ? old_buf : line->buf;
+		size_t len = old_buf ? old_len : line->len;
+		mi->header_stage = check_header_raw(mi, buf, len,
+						    mi->s_hdr_data, 0);
+		if (mi->header_stage)
+			goto handle_commit_msg_out;
+	} else
+		/* Only trim the first (blank) line of the commit message
+		 * when ignoring in-body headers.
+		 */
+		mi->header_stage = 0;
+
+	/* If adding as a log message, conversion to UTF-8 is required. */
+	if (utf8_result) {
+		mi->input_error = -1;
+		error("cannot convert from %s to %s",
+		      mi->charset.buf, mi->metainfo_charset);
+		goto handle_commit_msg_out;
+	}
 	strbuf_addbuf(&mi->log_message, line);
-	return 0;
+
+handle_commit_msg_out:
+	free(old_buf);
+	return ret;
 }
 
 static void handle_patch(struct mailinfo *mi, const struct strbuf *line)
-- 
2.10.0.rc2.20.g5b18e70


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC/PATCH 2/3] mailinfo: correct malformed test example
  2016-09-07  6:38       ` Jeff King
  2016-09-16 17:37         ` [RFC/PATCH 0/3] handle multiline in-body headers Jonathan Tan
  2016-09-16 17:37         ` [RFC/PATCH 1/3] mailinfo: refactor commit message processing Jonathan Tan
@ 2016-09-16 17:37         ` Jonathan Tan
  2016-09-16 19:19           ` Junio C Hamano
  2016-09-16 17:37         ` [RFC/PATCH 3/3] mailinfo: handle in-body header continuations Jonathan Tan
  3 siblings, 1 reply; 24+ messages in thread
From: Jonathan Tan @ 2016-09-16 17:37 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, peff, gitster

An existing sample message (0015) in the tests for mailinfo contains an
indented line immediately after an in-body header (without any
intervening blank line). Correct this by adding the blank line, in
preparation for a subsequent patch that will treat such indented lines
as RFC 2822 continuation lines (instead of as part of the commit
message).

To ensure that non-indented lines immediately after an in-body header
are still treated correctly (now and in the future), 0008 has been
updated to test both the case when in-body headers are used and the case
when they are not used.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 t/t5100/info0008--no-inbody-headers  | 5 +++++
 t/t5100/msg0008--no-inbody-headers   | 6 ++++++
 t/t5100/msg0015--no-inbody-headers   | 1 +
 t/t5100/patch0008--no-inbody-headers | 0
 t/t5100/sample.mbox                  | 1 +
 5 files changed, 13 insertions(+)
 create mode 100644 t/t5100/info0008--no-inbody-headers
 create mode 100644 t/t5100/msg0008--no-inbody-headers
 create mode 100644 t/t5100/patch0008--no-inbody-headers

diff --git a/t/t5100/info0008--no-inbody-headers b/t/t5100/info0008--no-inbody-headers
new file mode 100644
index 0000000..e8a2951
--- /dev/null
+++ b/t/t5100/info0008--no-inbody-headers
@@ -0,0 +1,5 @@
+Author: Junio C Hamano
+Email: junio@kernel.org
+Subject: another patch
+Date: Fri, 9 Jun 2006 00:44:16 -0700
+
diff --git a/t/t5100/msg0008--no-inbody-headers b/t/t5100/msg0008--no-inbody-headers
new file mode 100644
index 0000000..d6e950e
--- /dev/null
+++ b/t/t5100/msg0008--no-inbody-headers
@@ -0,0 +1,6 @@
+From: A U Thor <a.u.thor@example.com>
+Subject: [PATCH] another patch
+>Here is an empty patch from A U Thor.
+
+Hey you forgot the patch!
+
diff --git a/t/t5100/msg0015--no-inbody-headers b/t/t5100/msg0015--no-inbody-headers
index be5115b..44a6ce7 100644
--- a/t/t5100/msg0015--no-inbody-headers
+++ b/t/t5100/msg0015--no-inbody-headers
@@ -1,3 +1,4 @@
 From: bogosity
+
   - a list
   - of stuff
diff --git a/t/t5100/patch0008--no-inbody-headers b/t/t5100/patch0008--no-inbody-headers
new file mode 100644
index 0000000..e69de29
diff --git a/t/t5100/sample.mbox b/t/t5100/sample.mbox
index 8b2ae06..ba8b208 100644
--- a/t/t5100/sample.mbox
+++ b/t/t5100/sample.mbox
@@ -656,6 +656,7 @@ Subject: check bogus body header (from)
 Date: Fri, 9 Jun 2006 00:44:16 -0700
 
 From: bogosity
+
   - a list
   - of stuff
 ---
-- 
2.10.0.rc2.20.g5b18e70


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC/PATCH 3/3] mailinfo: handle in-body header continuations
  2016-09-07  6:38       ` Jeff King
                           ` (2 preceding siblings ...)
  2016-09-16 17:37         ` [RFC/PATCH 2/3] mailinfo: correct malformed test example Jonathan Tan
@ 2016-09-16 17:37         ` Jonathan Tan
  2016-09-16 20:17           ` Junio C Hamano
  2016-09-16 21:51           ` Jeff King
  3 siblings, 2 replies; 24+ messages in thread
From: Jonathan Tan @ 2016-09-16 17:37 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, peff, gitster

Mailinfo currently handles multi-line headers, but it does not handle
multi-line in-body headers. Teach it to handle such headers, for
example, for this input:

  Subject: a very long
   broken line

  Subject: another very long
   broken line

interpret the in-body subject to be "another very long broken line"
instead of "another very long".

Instead of repeatedly calling "check_header" (as in this patch), one
alternate method to accomplish this would be to have a buffer of
potential header text in struct mailinfo to be flushed whenever a header
is known to end (for example, if we detect the start of a new header),
but this makes the logic more complicated - for example, the flushing
would not only invoke check_header but would also need to reconstruct
the original lines, possibly decode them into UTF-8, and store them in
log_message, and any failures would be noticed a few "lines" away from
the original failure point. Also, care would need to be taken to flush
the buffer at all appropriate places.

Another alternate would be to modify "read_one_header_line" to accept
strings of lines instead of reading its own from a FILE pointer, but
this would also require a buffer, with the same issues.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 mailinfo.c          | 24 ++++++++++++++++++++++--
 mailinfo.h          |  1 +
 t/t4150-am.sh       | 23 +++++++++++++++++++++++
 t/t5100-mailinfo.sh |  4 ++--
 t/t5100/info0018    |  5 +++++
 t/t5100/msg0018     |  2 ++
 t/t5100/patch0018   |  6 ++++++
 t/t5100/sample.mbox | 19 +++++++++++++++++++
 8 files changed, 80 insertions(+), 4 deletions(-)
 create mode 100644 t/t5100/info0018
 create mode 100644 t/t5100/msg0018
 create mode 100644 t/t5100/patch0018

diff --git a/mailinfo.c b/mailinfo.c
index 23a56c2..3bbdf74 100644
--- a/mailinfo.c
+++ b/mailinfo.c
@@ -729,8 +729,10 @@ static int handle_commit_msg(struct mailinfo *mi, struct strbuf *line)
 
 	if (mi->header_stage) {
 		char *buf = old_buf ? old_buf : line->buf;
-		if (buf[0] == 0 || (buf[0] == '\n' && buf[1] == 0))
+		if (buf[0] == 0 || (buf[0] == '\n' && buf[1] == 0)) {
+			strbuf_reset(&mi->last_inbody_header);
 			goto handle_commit_msg_out;
+		}
 	}
 
 	if (mi->use_inbody_headers && mi->header_stage) {
@@ -738,8 +740,24 @@ static int handle_commit_msg(struct mailinfo *mi, struct strbuf *line)
 		size_t len = old_buf ? old_len : line->len;
 		mi->header_stage = check_header_raw(mi, buf, len,
 						    mi->s_hdr_data, 0);
-		if (mi->header_stage)
+		if (mi->header_stage) {
+			strbuf_reset(&mi->last_inbody_header);
+			strbuf_add(&mi->last_inbody_header, buf, len);
 			goto handle_commit_msg_out;
+		}
+
+		if (mi->last_inbody_header.len &&
+		    (buf[0] == ' ' || buf[0] == '\t')) {
+			strbuf_strip_suffix(&mi->last_inbody_header, "\n");
+			strbuf_add(&mi->last_inbody_header, buf, len);
+			mi->header_stage = check_header(mi,
+							&mi->last_inbody_header,
+							mi->s_hdr_data, 1);
+			if (mi->header_stage)
+				goto handle_commit_msg_out;
+		}
+
+		mi->header_stage = 0;
 	} else
 		/* Only trim the first (blank) line of the commit message
 		 * when ignoring in-body headers.
@@ -1086,6 +1104,7 @@ void setup_mailinfo(struct mailinfo *mi)
 	strbuf_init(&mi->email, 0);
 	strbuf_init(&mi->charset, 0);
 	strbuf_init(&mi->log_message, 0);
+	strbuf_init(&mi->last_inbody_header, 0);
 	mi->header_stage = 1;
 	mi->use_inbody_headers = 1;
 	mi->content_top = mi->content;
@@ -1099,6 +1118,7 @@ void clear_mailinfo(struct mailinfo *mi)
 	strbuf_release(&mi->name);
 	strbuf_release(&mi->email);
 	strbuf_release(&mi->charset);
+	strbuf_release(&mi->last_inbody_header);
 	free(mi->message_id);
 
 	for (i = 0; mi->p_hdr_data[i]; i++)
diff --git a/mailinfo.h b/mailinfo.h
index 93776a7..ab2d0dd 100644
--- a/mailinfo.h
+++ b/mailinfo.h
@@ -27,6 +27,7 @@ struct mailinfo {
 	int patch_lines;
 	int filter_stage; /* still reading log or are we copying patch? */
 	int header_stage; /* still checking in-body headers? */
+	struct strbuf last_inbody_header;
 	struct strbuf **p_hdr_data;
 	struct strbuf **s_hdr_data;
 
diff --git a/t/t4150-am.sh b/t/t4150-am.sh
index 9ce9424..89a5bac 100755
--- a/t/t4150-am.sh
+++ b/t/t4150-am.sh
@@ -977,4 +977,27 @@ test_expect_success 'am --patch-format=mboxrd handles mboxrd' '
 	test_cmp msg out
 '
 
+test_expect_success 'am works with multi-line in-body headers' '
+	FORTY="String that has a length of more than forty characters" &&
+	LONG="$FORTY $FORTY" &&
+	rm -fr .git/rebase-apply &&
+	git checkout -f first &&
+	echo one >> file &&
+	git commit -am "$LONG" --author="$LONG <long@example.com>" &&
+	git format-patch --stdout -1 >patch &&
+	# bump from, date, and subject down to in-body header
+	perl -lpe "
+		if (/^From:/) {
+			print \"From: x <x\@example.com>\";
+			print \"Date: Sat, 1 Jan 2000 00:00:00 +0000\";
+			print \"Subject: x\n\";
+		}
+	" patch >msg &&
+	git checkout HEAD^ &&
+	git am msg &&
+	# Ensure that the author and full message are present
+	git cat-file commit HEAD | grep "^author.*long@example.com" &&
+	git cat-file commit HEAD | grep "^$LONG"
+'
+
 test_done
diff --git a/t/t5100-mailinfo.sh b/t/t5100-mailinfo.sh
index 1a5a546..99e8722 100755
--- a/t/t5100-mailinfo.sh
+++ b/t/t5100-mailinfo.sh
@@ -11,7 +11,7 @@ test_expect_success 'split sample box' \
 	'git mailsplit -o. "$TEST_DIRECTORY"/t5100/sample.mbox >last &&
 	last=$(cat last) &&
 	echo total is $last &&
-	test $(cat last) = 17'
+	test $(cat last) = 18'
 
 check_mailinfo () {
 	mail=$1 opt=$2
@@ -51,7 +51,7 @@ test_expect_success 'split box with rfc2047 samples' \
 	echo total is $last &&
 	test $(cat rfc2047/last) = 11'
 
-for mail in rfc2047/00*
+for mail in rfc2047/0001
 do
 	test_expect_success "mailinfo $mail" '
 		git mailinfo -u $mail-msg $mail-patch <$mail >$mail-info &&
diff --git a/t/t5100/info0018 b/t/t5100/info0018
new file mode 100644
index 0000000..d53e749
--- /dev/null
+++ b/t/t5100/info0018
@@ -0,0 +1,5 @@
+Author: Another Thor
+Email: a.thor@example.com
+Subject: This one contains a tab and a space
+Date: Fri, 9 Jun 2006 00:44:16 -0700
+
diff --git a/t/t5100/msg0018 b/t/t5100/msg0018
new file mode 100644
index 0000000..56de83d
--- /dev/null
+++ b/t/t5100/msg0018
@@ -0,0 +1,2 @@
+a commit message
+
diff --git a/t/t5100/patch0018 b/t/t5100/patch0018
new file mode 100644
index 0000000..789df6d
--- /dev/null
+++ b/t/t5100/patch0018
@@ -0,0 +1,6 @@
+diff --git a/foo b/foo
+index e69de29..d95f3ad 100644
+--- a/foo
++++ b/foo
+@@ -0,0 +1 @@
++content
diff --git a/t/t5100/sample.mbox b/t/t5100/sample.mbox
index ba8b208..ae61497 100644
--- a/t/t5100/sample.mbox
+++ b/t/t5100/sample.mbox
@@ -700,3 +700,22 @@ index e69de29..d95f3ad 100644
 +++ b/foo
 @@ -0,0 +1 @@
 +New content
+From nobody Mon Sep 17 00:00:00 2001
+From: A U Thor <a.u.thor@example.com>
+Subject: check multiline inbody headers
+Date: Fri, 9 Jun 2006 00:44:16 -0700
+
+From: Another Thor
+ <a.thor@example.com>
+Subject: This one contains
+	a tab
+ and a space
+
+a commit message
+
+diff --git a/foo b/foo
+index e69de29..d95f3ad 100644
+--- a/foo
++++ b/foo
+@@ -0,0 +1 @@
++content
-- 
2.10.0.rc2.20.g5b18e70


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC/PATCH 0/3] handle multiline in-body headers
  2016-09-16 17:37         ` [RFC/PATCH 0/3] handle multiline in-body headers Jonathan Tan
@ 2016-09-16 18:29           ` Junio C Hamano
  0 siblings, 0 replies; 24+ messages in thread
From: Junio C Hamano @ 2016-09-16 18:29 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, peff

Jonathan Tan <jonathantanmy@google.com> writes:

> Thanks, Peff, for the explanation and the method to reproduce the issue.
>
> The issue seems to be in mailinfo.c - this patch set addresses that, and I have
> also included a test for "git am" in t/t4150-am.sh to show the effect of this
> patch set on that command.

Thanks, will take a look.

> Jonathan Tan (3):
>   mailinfo: refactor commit message processing
>   mailinfo: correct malformed test example
>   mailinfo: handle in-body header continuations
>
>  mailinfo.c                           | 165 ++++++++++++++++++++++++++++-------
>  mailinfo.h                           |   1 +
>  t/t4150-am.sh                        |  23 +++++
>  t/t5100-mailinfo.sh                  |   4 +-
>  t/t5100/info0008--no-inbody-headers  |   5 ++
>  t/t5100/info0018                     |   5 ++
>  t/t5100/msg0008--no-inbody-headers   |   6 ++
>  t/t5100/msg0015--no-inbody-headers   |   1 +
>  t/t5100/msg0018                      |   2 +
>  t/t5100/patch0008--no-inbody-headers |   0
>  t/t5100/patch0018                    |   6 ++
>  t/t5100/sample.mbox                  |  20 +++++
>  12 files changed, 206 insertions(+), 32 deletions(-)
>  create mode 100644 t/t5100/info0008--no-inbody-headers
>  create mode 100644 t/t5100/info0018
>  create mode 100644 t/t5100/msg0008--no-inbody-headers
>  create mode 100644 t/t5100/msg0018
>  create mode 100644 t/t5100/patch0008--no-inbody-headers
>  create mode 100644 t/t5100/patch0018

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC/PATCH 1/3] mailinfo: refactor commit message processing
  2016-09-16 17:37         ` [RFC/PATCH 1/3] mailinfo: refactor commit message processing Jonathan Tan
@ 2016-09-16 19:12           ` Junio C Hamano
  2016-09-16 21:46             ` Jeff King
  0 siblings, 1 reply; 24+ messages in thread
From: Junio C Hamano @ 2016-09-16 19:12 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, peff

Jonathan Tan <jonathantanmy@google.com> writes:

> Within the processing of the commit message, check for a scissors line
> or a patchbreak line first (before checking for in-body headers) so that
> a subsequent patch modifying the processing of in-body headers would not
> cause a scissors line or patchbreak line to be misidentified.
>
> If a line could be both an in-body header and a scissors line (for
> example, "From: -- >8 --"), this is considered a fatal error
> (previously, it would be interpreted as an in-body header).

The scissors line is designed to allow garbage other than scissors
and perforation marks to be on the same line, i.e.

        /*
         * The mark must be at least 8 bytes long (e.g. "-- >8 --").
         * Even though there can be arbitrary cruft on the same line
         * (e.g. "cut here"), in order to avoid misidentification, the
         * perforation must occupy more than a third of the visible
         * width of the line, and dashes and scissors must occupy more
         * than half of the perforation.
         */

Even though it is not likely for people to do so, it would probably
be nicer if we can treat

	From: -- >8 -- cut -- >8 -- >8 -- here -- >8 --

as a scissors line instead of making it a fatal error, by treating
that "From:" as just a random garbage.

But this is a minor point.  It is not worth to make it work like so
if the resulting code will become messier.

> The following enumeration shows that processing is the same except (as
> described above) the in-body header + scissors line case.
>
> o in-body header (check_header OK)
>   o passes UTF-8 conversion
>     o [described above] is scissors line
>     o [not possible] is patchbreak line
>     o [not possible] is blank line
>     o is none of the above - processed as header
>   o fails UTF-8 conversion - processed as header
> o not in-body header
>   o passes UTF-8 conversion
>     o is scissors line - processed as scissors
>     o is patchbreak line - processed as patchbreak
>     o is blank line - ignored if in header_stage
>     o is none of the above - log message
>   o fails UTF-8 conversion - input error
>
> As for the result left in "line" (after the invocation of
> handle_commit_msg), it is unused (by its caller, handle_filter, and by
> handle_filter's callers, handle_boundary and handle_body) unless this
> line is a patchbreak line, in which case handle_patch is subsequently
> called (in handle_filter) on "line". In this case, "line" must have
> passed UTF-8 conversion both before and after this patch, so the result
> is still the same overall.
>
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
>  mailinfo.c | 145 ++++++++++++++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 115 insertions(+), 30 deletions(-)
>
> diff --git a/mailinfo.c b/mailinfo.c
> index e19abe3..23a56c2 100644
> --- a/mailinfo.c
> +++ b/mailinfo.c
> @@ -340,23 +340,56 @@ static struct strbuf *decode_b_segment(const struct strbuf *b_seg)
>  	return out;
>  }
>  
> -static int convert_to_utf8(struct mailinfo *mi,
> -			   struct strbuf *line, const char *charset)
> +/*
> + * Attempts to convert line into UTF-8, storing the result in line.
> + *
> + * This differs from convert_to_utf8 in that conversion non-success is not
> + * considered an error case - mi->input_error is not set, and no error message
> + * is printed.
> + *
> + * If the conversion is unnecessary, returns 0 and stores NULL in old_buf (if
> + * old_buf is not NULL).
> + *
> + * If the conversion is successful, returns 0 and stores the unconverted string
> + * in old_buf and old_len (if they are respectively not NULL).
> + *
> + * If the conversion is unsuccessful, returns -1.
> + */
> +static int try_convert_to_utf8(const struct mailinfo *mi, struct strbuf *line,
> +			       const char *charset, char **old_buf,
> +			       size_t *old_len)
>  {
> -	char *out;
> +	char *utf8;
>  
> -	if (!mi->metainfo_charset || !charset || !*charset)
> +	if (!mi->metainfo_charset || !charset || !*charset ||
> +	    same_encoding(mi->metainfo_charset, charset)) {
> +		if (old_buf)
> +			*old_buf = NULL;
>  		return 0;
> +	}
>  
> -	if (same_encoding(mi->metainfo_charset, charset))
> +	utf8 = reencode_string(line->buf, mi->metainfo_charset, charset);
> +	if (utf8) {
> +		char *temp = strbuf_detach(line, old_len);
> +		if (old_buf)
> +			*old_buf = temp;
> +		strbuf_attach(line, utf8, strlen(utf8), strlen(utf8));
>  		return 0;
> -	out = reencode_string(line->buf, mi->metainfo_charset, charset);
> -	if (!out) {
> +	}
> +	return -1;
> +}
> +
> +/*
> + * Converts line into UTF-8, setting mi->input_error to -1 upon failure.
> + */
> +static int convert_to_utf8(struct mailinfo *mi,
> +			   struct strbuf *line, const char *charset)
> +{
> +	if (try_convert_to_utf8(mi, line, charset, NULL, NULL)) {
>  		mi->input_error = -1;
>  		return error("cannot convert from %s to %s",
>  			     charset, mi->metainfo_charset);
>  	}
> -	strbuf_attach(line, out, strlen(out), strlen(out));
>  	return 0;
>  }

Please split this part into its own patch.  IIUC, it moves the meat
of convert_to_utf8() to a more silent try_convert_to_utf8() and then
makes the former a thin wrapper of the latter.  Which by itself is a
good change but does not have anything to do with "fix handling of
the in-body headers", other than that the main fix wants to have
such a more silent helper for its own use.


> @@ -515,6 +548,13 @@ static int check_header(struct mailinfo *mi,
>  	return ret;
>  }
>  
> +static int check_header_raw(struct mailinfo *mi,
> +			    char *buf, size_t len,
> +			    struct strbuf *hdr_data[], int overwrite) {
> +	const struct strbuf sb = {0, len, buf};
> +	return check_header(mi, &sb, hdr_data, overwrite);
> +}

IIUC, this is a helper for callers that do not have a strbuf but
instead have <buf, len> pair to perform the same check_header() the
callers that have strbuf can do.

As check_header() uses the strbuf as a read-only entity, wrapping
the <buf, len> pair in a temporary strbuf like this is safe.

The incoming <buf> should conceptually be "const char *", but it's
OK.

If check_header() didn't call any helper function that gets passed
&sb as a strbuf, or if convertiong the helper function to take a
<buf, len> pair instead, I would actually suggest refactoring this
the other way around, though.  That is, move the implementation of
check_header() to this function, updating its reference to line->buf
and line->len to reference to <buf> and <len>, and then make
check_header() a thin wrapper that does

        check_header(mi, const struct strbuf *line,
                     struct strbuf *hdr_data[], int overwrite)
        {
                return check_header_raw(mi, line->buf, line->len,
                                        hdr_data, overwrite);
        }

I didn't check how involved to update cmp_header() to take <buf,len>
pair.  If it does not look too bad, then I think I would prefer to
do it that way, and as before, make that conversion a separate
preparatory patch.

> @@ -623,32 +663,48 @@ static int is_scissors_line(const struct strbuf *line)
>  		gap * 2 < perforation);
>  }
>  
> -static int handle_commit_msg(struct mailinfo *mi, struct strbuf *line)
> +static int resembles_rfc2822_header(const struct strbuf *line)
>  {
> -	assert(!mi->filter_stage);
> +	char *c;
>  
> -	if (mi->header_stage) {
> -		if (!line->len || (line->len == 1 && line->buf[0] == '\n'))
> +	if (!isalpha(line->buf[0]))
> +		return 0;
> +
> +	for (c = line->buf + 1; *c != 0; c++) {
> +		if (*c == ':')
> +			return 1;
> +		else if (*c != '-' && !isalpha(*c))
>  			return 0;
>  	}
> +	return 0;
> +}

Is this helper supposed to handle any rfc2822 looking header, or
only the ones we expect to see as in-body header?


> -	if (mi->use_inbody_headers && mi->header_stage) {
> -		mi->header_stage = check_header(mi, line, mi->s_hdr_data, 0);
> -		if (mi->header_stage)
> -			return 0;
> -	} else
> -		/* Only trim the first (blank) line of the commit message
> -		 * when ignoring in-body headers.
> -		 */
> -		mi->header_stage = 0;
> +static int handle_commit_msg(struct mailinfo *mi, struct strbuf *line)
> +{
> +	int ret = 0;
> +	int utf8_result;
> +	char *old_buf;
> +	size_t old_len;
> +
> +	assert(!mi->filter_stage);
>  
> -	/* normalize the log message to UTF-8. */
> -	if (convert_to_utf8(mi, line, mi->charset.buf))
> -		return 0; /* mi->input_error already set */
> +	/*
> +	 * Obtain UTF8 for scissors line and patchbreak checks, but retain the
> +	 * undecoded line in case we need to process it as an in-body header.
> +	 */
> +	utf8_result = try_convert_to_utf8(mi, line, mi->charset.buf, &old_buf,
> +					  &old_len);

Just a minor style suggestion.  As <old_buf, old_len> come in a
pair, fold the line before them, so that the readers can easily
see the association between them.  I.e.

        utf8_result = try_convert_to_utf8(mi, line, mi->charset.buf,
                                          &old_buf, &old_len);


> -	if (mi->use_scissors && is_scissors_line(line)) {
> +	if (!utf8_result && mi->use_scissors && is_scissors_line(line)) {
>  		int i;
>  
> +		if (resembles_rfc2822_header(line))
> +			/*
> +			 * Explicitly reject scissor lines that resemble a RFC
> +			 * 2822 header, to avoid being prone to error.
> +			 */
> +			die("scissors line resembles RFC 2822 header");
> +

I guess "disambiguate to favor scissors" is not that difficult ;-)

>  		strbuf_setlen(&mi->log_message, 0);
>  		mi->header_stage = 1;
>  
> @@ -661,18 +717,47 @@ static int handle_commit_msg(struct mailinfo *mi, struct strbuf *line)
>  				strbuf_release(mi->s_hdr_data[i]);
>  			mi->s_hdr_data[i] = NULL;
>  		}
> -		return 0;
> +		goto handle_commit_msg_out;
>  	}
> -
> -	if (patchbreak(line)) {
> +	if (!utf8_result && patchbreak(line)) {
>  		if (mi->message_id)
>  			strbuf_addf(&mi->log_message,
>  				    "Message-Id: %s\n", mi->message_id);
> -		return 1;
> +		ret = 1;
> +		goto handle_commit_msg_out;
>  	}
>  
> +	if (mi->header_stage) {
> +		char *buf = old_buf ? old_buf : line->buf;
> +		if (buf[0] == 0 || (buf[0] == '\n' && buf[1] == 0))
> +			goto handle_commit_msg_out;
> +	}
> +
> +	if (mi->use_inbody_headers && mi->header_stage) {
> +		char *buf = old_buf ? old_buf : line->buf;
> +		size_t len = old_buf ? old_len : line->len;
> +		mi->header_stage = check_header_raw(mi, buf, len,
> +						    mi->s_hdr_data, 0);

Just a minor comment, but I guess check_header_raw() refactoring is
not strictly needed after all, as this callsite can wrap <buf,len>
into a temporary strbuf.

Unlike the real header that is read in read_one_header_line() inside
a loop to implement line folding before check_header() is called, we
call check_header() before possibly-foldable header lines is fully
assembled into one header.  Probably it comes in later patches, I
guess.

It is not immediately obvious to me how this step helps further work
done by later patches in the series until I read them, but so far
what this patch did looks understandable to me ;-)

Thanks.


> +		if (mi->header_stage)
> +			goto handle_commit_msg_out;
> +	} else
> +		/* Only trim the first (blank) line of the commit message
> +		 * when ignoring in-body headers.
> +		 */
> +		mi->header_stage = 0;
> +
> +	/* If adding as a log message, conversion to UTF-8 is required. */
> +	if (utf8_result) {
> +		mi->input_error = -1;
> +		error("cannot convert from %s to %s",
> +		      mi->charset.buf, mi->metainfo_charset);
> +		goto handle_commit_msg_out;
> +	}
>  	strbuf_addbuf(&mi->log_message, line);
> -	return 0;
> +
> +handle_commit_msg_out:
> +	free(old_buf);
> +	return ret;
>  }
>  
>  static void handle_patch(struct mailinfo *mi, const struct strbuf *line)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC/PATCH 2/3] mailinfo: correct malformed test example
  2016-09-16 17:37         ` [RFC/PATCH 2/3] mailinfo: correct malformed test example Jonathan Tan
@ 2016-09-16 19:19           ` Junio C Hamano
  2016-09-16 22:42             ` Jonathan Tan
  0 siblings, 1 reply; 24+ messages in thread
From: Junio C Hamano @ 2016-09-16 19:19 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, peff

Jonathan Tan <jonathantanmy@google.com> writes:

> An existing sample message (0015) in the tests for mailinfo contains an
> indented line immediately after an in-body header (without any
> intervening blank line).

This comes from d25e5159 ("git am/mailinfo: Don't look at in-body
headers when rebasing", 2009-11-20), where we want to make sure that
a "From: bogosity" that isn't meant to be an in-body header is not
identified as such, even when it is immediately followed by a
non-blank line.  "From: bogosity" is for msg0015 but the same
applies to the header-looking block for msg0008.

Adding a blank line there will defeat the whole point of the test,
which is to make sure we don't do anything funky when --no-inbody-headers
is asked for, no?

> diff --git a/t/t5100/info0008--no-inbody-headers b/t/t5100/info0008--no-inbody-headers
> new file mode 100644
> index 0000000..e8a2951
> --- /dev/null
> +++ b/t/t5100/info0008--no-inbody-headers
> @@ -0,0 +1,5 @@
> +Author: Junio C Hamano
> +Email: junio@kernel.org
> +Subject: another patch
> +Date: Fri, 9 Jun 2006 00:44:16 -0700
> +
> diff --git a/t/t5100/msg0008--no-inbody-headers b/t/t5100/msg0008--no-inbody-headers
> new file mode 100644
> index 0000000..d6e950e
> --- /dev/null
> +++ b/t/t5100/msg0008--no-inbody-headers
> @@ -0,0 +1,6 @@
> +From: A U Thor <a.u.thor@example.com>
> +Subject: [PATCH] another patch
> +>Here is an empty patch from A U Thor.
> +
> +Hey you forgot the patch!
> +
> diff --git a/t/t5100/msg0015--no-inbody-headers b/t/t5100/msg0015--no-inbody-headers
> index be5115b..44a6ce7 100644
> --- a/t/t5100/msg0015--no-inbody-headers
> +++ b/t/t5100/msg0015--no-inbody-headers
> @@ -1,3 +1,4 @@
>  From: bogosity
> +
>    - a list
>    - of stuff
> diff --git a/t/t5100/patch0008--no-inbody-headers b/t/t5100/patch0008--no-inbody-headers
> new file mode 100644
> index 0000000..e69de29
> diff --git a/t/t5100/sample.mbox b/t/t5100/sample.mbox
> index 8b2ae06..ba8b208 100644
> --- a/t/t5100/sample.mbox
> +++ b/t/t5100/sample.mbox
> @@ -656,6 +656,7 @@ Subject: check bogus body header (from)
>  Date: Fri, 9 Jun 2006 00:44:16 -0700
>  
>  From: bogosity
> +
>    - a list
>    - of stuff
>  ---

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC/PATCH 3/3] mailinfo: handle in-body header continuations
  2016-09-16 17:37         ` [RFC/PATCH 3/3] mailinfo: handle in-body header continuations Jonathan Tan
@ 2016-09-16 20:17           ` Junio C Hamano
  2016-09-16 20:49             ` Jonathan Tan
  2016-09-16 21:51           ` Jeff King
  1 sibling, 1 reply; 24+ messages in thread
From: Junio C Hamano @ 2016-09-16 20:17 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, peff

Jonathan Tan <jonathantanmy@google.com> writes:

> Instead of repeatedly calling "check_header" (as in this patch), one
> alternate method to accomplish this would be to have a buffer of
> potential header text in struct mailinfo to be flushed whenever a header
> is known to end (for example, if we detect the start of a new header),
> but this makes the logic more complicated - for example, the flushing
> would not only invoke check_header but would also need to reconstruct
> the original lines, possibly decode them into UTF-8, and store them in
> log_message, and any failures would be noticed a few "lines" away from
> the original failure point. Also, care would need to be taken to flush
> the buffer at all appropriate places.

I am not sure how much the UTF-8 decoding argument above matters.

The current way handle_commit_msg() is structured (before any of
your patches) is for it to take one raw line at a time and:

    - If we haven't seen a non-header line (i.e. at the beginning,
      or we were reading in-body headers), return without doing
      anything.

    - If we are told to honor in-body headers and if we haven't seen
      a non-header line, see if the line itself looks like a header
      and if so, handle it as an in-body header and return.  If that
      line is not an in-body header, continue processing.

    - If the processing reaches at this point, we are done with the
      headers (i.e. mi->header_stage is set to 0).

    - Make sure the line is in utf8.

    - If it is a scissors line and we are told to honor scissors
      lines, ignore what we have read so far and go back to "we
      haven't seen a non-header line" state and return.

    - If it is a patch break, return and signal the caller we are
      done with the log message.

    - Otherwise accumulate the line as part of the log message.

The bug we want to address is in the second step.  We only look at
the first line of folded in-body header line, because we are fed one
line at a time.

If we keep the location of UTF8 conversion, and buffered the in-body
header in "struct mailinfo *mi" (like you seem to do in this patch),
what we will queue there will be _before_ conversion.  We'd call
check_header() on it once we know one logical line of a header is
accumulated, and check_header() would do the right conversion via
decode_header() etc., so I do not see why you need to worry about
the encoding issues at all.

I wonder if the simplest would be to introduce another state in the
state machine that is "we know we are processing in-body header, and
we have read early part of an in-body header line that may not be
complete".

In other words, wouldn't something like the illustration at the end
of this message sufficient?  If the body consists solely of in-body
header without any patch or patchbreak, we may reach EOF with
something in mi->in_line_header buffer and nothing in
mi->log_message and without this function getting any chance to
return 1, so a careful caller may want to flush in_line_header, but
the overall result of the mailinfo subsystem in such a case would be
an error ("you didn't have any patch or a message?"), so it may not
matter too much.

What am I missing?

handle_commit_msg(...)
{
	if (mi->in_line_header->len) {
		/* we have read the beginning of one in-line header */
		if (line->len && isspace(*line->buf))
			append to mi->in_line_header strbuf;
                        return 0;
		/* otherwise we know mi->in_line_header is now complete */
		check_header(mi, mi->in_line_header, ...);
		strbuf_reset(&mi->in_line_header);
	}

	if (mi->header_stage && (it is a blank line))
		return 0;

	if (mi->use_inbody_headers && mi->header_stage &&
	    (the line looks like beginning of 2822 header)) {
		strbuf_addbuf(&mi->in_line_header, line);
		return 0;
	}
        /* otherwise we are no longer looking at headers */
        mi->header_stage = 0;

	/* normalize the log message to UTF-8. */
	if (convert_to_utf8(mi, line, mi->charset.buf))
		return 0; /* mi->input_error already set */

	if (mi->use_scissors && is_scissors_line(line)) {
		int i;

		strbuf_setlen(&mi->log_message, 0);
		mi->header_stage = 1;

		/*
		 * We may have already read "secondary headers"; purge
		 * them to give ourselves a clean restart.
		 */
		for (i = 0; header[i]; i++) {
			if (mi->s_hdr_data[i])
				strbuf_release(mi->s_hdr_data[i]);
			mi->s_hdr_data[i] = NULL;
		}
		return 0;
	}

	if (patchbreak(line)) {
		if (mi->message_id)
			strbuf_addf(&mi->log_message,
				    "Message-Id: %s\n", mi->message_id);
		return 1;
	}

	strbuf_addbuf(&mi->log_message, line);
	return 0;
}
 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC/PATCH 3/3] mailinfo: handle in-body header continuations
  2016-09-16 20:17           ` Junio C Hamano
@ 2016-09-16 20:49             ` Jonathan Tan
  2016-09-16 20:59               ` Junio C Hamano
  0 siblings, 1 reply; 24+ messages in thread
From: Jonathan Tan @ 2016-09-16 20:49 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, peff

On 09/16/2016 01:17 PM, Junio C Hamano wrote:
> In other words, wouldn't something like the illustration at the end
> of this message sufficient?  If the body consists solely of in-body
> header without any patch or patchbreak, we may reach EOF with
> something in mi->in_line_header buffer and nothing in
> mi->log_message and without this function getting any chance to
> return 1, so a careful caller may want to flush in_line_header, but
> the overall result of the mailinfo subsystem in such a case would be
> an error ("you didn't have any patch or a message?"), so it may not
> matter too much.

Noted. (This was one of my concerns - that the caller should, but did 
not, flush.)

> What am I missing?
>
> handle_commit_msg(...)
> {
> 	if (mi->in_line_header->len) {
> 		/* we have read the beginning of one in-line header */
> 		if (line->len && isspace(*line->buf))

This would mean that a message like the following:

   From: Me <me@example.com>
     -- 8< -- this scissors line will be treated as part of "From"

would have its scissors line treated as a header.

The main reason why I reordered the checks (in RFC/PATCH 1/3) is to 
avoid this (treating a scissors line with an initial space immediately 
following an in-body header as part of a header).

(If this is not a concern then yes, I agree that the way you described 
is simpler and better.)

> 			append to mi->in_line_header strbuf;
>                         return 0;
> 		/* otherwise we know mi->in_line_header is now complete */
> 		check_header(mi, mi->in_line_header, ...);
> 		strbuf_reset(&mi->in_line_header);
> 	}
>
> 	if (mi->header_stage && (it is a blank line))
> 		return 0;
>
> 	if (mi->use_inbody_headers && mi->header_stage &&
> 	    (the line looks like beginning of 2822 header)) {
> 		strbuf_addbuf(&mi->in_line_header, line);
> 		return 0;
> 	}
>         /* otherwise we are no longer looking at headers */
>         mi->header_stage = 0;
>
> 	/* normalize the log message to UTF-8. */
> 	if (convert_to_utf8(mi, line, mi->charset.buf))
> 		return 0; /* mi->input_error already set */
>
> 	if (mi->use_scissors && is_scissors_line(line)) {
> 		int i;
>
> 		strbuf_setlen(&mi->log_message, 0);
> 		mi->header_stage = 1;
>
> 		/*
> 		 * We may have already read "secondary headers"; purge
> 		 * them to give ourselves a clean restart.
> 		 */
> 		for (i = 0; header[i]; i++) {
> 			if (mi->s_hdr_data[i])
> 				strbuf_release(mi->s_hdr_data[i]);
> 			mi->s_hdr_data[i] = NULL;
> 		}
> 		return 0;
> 	}
>
> 	if (patchbreak(line)) {
> 		if (mi->message_id)
> 			strbuf_addf(&mi->log_message,
> 				    "Message-Id: %s\n", mi->message_id);
> 		return 1;
> 	}
>
> 	strbuf_addbuf(&mi->log_message, line);
> 	return 0;
> }
>
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC/PATCH 3/3] mailinfo: handle in-body header continuations
  2016-09-16 20:49             ` Jonathan Tan
@ 2016-09-16 20:59               ` Junio C Hamano
  2016-09-16 22:36                 ` Jonathan Tan
  0 siblings, 1 reply; 24+ messages in thread
From: Junio C Hamano @ 2016-09-16 20:59 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, peff

Jonathan Tan <jonathantanmy@google.com> writes:

>> handle_commit_msg(...)
>> {
>> 	if (mi->in_line_header->len) {
>> 		/* we have read the beginning of one in-line header */
>> 		if (line->len && isspace(*line->buf))
>
> This would mean that a message like the following:
>
>   From: Me <me@example.com>
>     -- 8< -- this scissors line will be treated as part of "From"
>
> would have its scissors line treated as a header.
>
> The main reason why I reordered the checks (in RFC/PATCH 1/3) is to
> avoid this (treating a scissors line with an initial space immediately
> following an in-body header as part of a header).
>
> (If this is not a concern then yes, I agree that the way you described
> is simpler and better.)

Ahh, OK.  I do not think anybody sane would do the "From:" thing,
but with the "does it look like 2822 header" check to decide if the
first header-looking line should be queued, another failure mode may
be:

	any-random-alpha-and-dash-string: 
         -- >8 -- cut here -- >8 --
        Subject: real subject

        The first line of the real message
        
I personally do not think it matters that much, but if we wanted to
protect us from it we could easily do

        if (mi->in_line_header->len) {
                /* we have read the beginning of one in-line header */
                if (line->len && isspace(*line->buf) &&
                    !(mi->use_scissors && is_scissors_line(line))) {
                        append to mi->in_line_header strbuf;
                        return 0;
                }
                /* otherwise we know mi->in_line_header is now complete */
                check_header(mi, mi->in_line_header, ...);
                strbuf_reset(&mi->in_line_header);
        }
	...

instead, I think.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC/PATCH 1/3] mailinfo: refactor commit message processing
  2016-09-16 19:12           ` Junio C Hamano
@ 2016-09-16 21:46             ` Jeff King
  0 siblings, 0 replies; 24+ messages in thread
From: Jeff King @ 2016-09-16 21:46 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jonathan Tan, git

On Fri, Sep 16, 2016 at 12:12:50PM -0700, Junio C Hamano wrote:

> > +static int check_header_raw(struct mailinfo *mi,
> > +			    char *buf, size_t len,
> > +			    struct strbuf *hdr_data[], int overwrite) {
> > +	const struct strbuf sb = {0, len, buf};
> > +	return check_header(mi, &sb, hdr_data, overwrite);
> > +}
> 
> IIUC, this is a helper for callers that do not have a strbuf but
> instead have <buf, len> pair to perform the same check_header() the
> callers that have strbuf can do.
> 
> As check_header() uses the strbuf as a read-only entity, wrapping
> the <buf, len> pair in a temporary strbuf like this is safe.
> 
> The incoming <buf> should conceptually be "const char *", but it's
> OK.

I think the "right" way to do this would be to continue taking a "char
*", and then strbuf_attach() it. That saves us from unexpectedly
violating any strbuf invariants.

If our assumption that check_header() does not touch the
contents turns out to be wrong, neither strategy would inform our
caller, though. I think you'd want something like:

  assert(sb.buf == buf);

after check_header() returns (though I guess we are in theory protected
by the "const").

That being said...

> If check_header() didn't call any helper function that gets passed
> &sb as a strbuf, or if convertiong the helper function to take a
> <buf, len> pair instead, I would actually suggest refactoring this
> the other way around, though.  That is, move the implementation of
> check_header() to this function, updating its reference to line->buf
> and line->len to reference to <buf> and <len>, and then make
> check_header() a thin wrapper that does
> 
>         check_header(mi, const struct strbuf *line,
>                      struct strbuf *hdr_data[], int overwrite)
>         {
>                 return check_header_raw(mi, line->buf, line->len,
>                                         hdr_data, overwrite);
>         }

This is _way_ better, and it looks like check_header() could handle it
easily. Looking at it, I also suspect the cascading if in that function
could be made more pleasant by modeling cmp_header()'s interface after
skip_prefix_mem(), but that is totally orthogonal and optional.

-Peff

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC/PATCH 3/3] mailinfo: handle in-body header continuations
  2016-09-16 17:37         ` [RFC/PATCH 3/3] mailinfo: handle in-body header continuations Jonathan Tan
  2016-09-16 20:17           ` Junio C Hamano
@ 2016-09-16 21:51           ` Jeff King
  1 sibling, 0 replies; 24+ messages in thread
From: Jeff King @ 2016-09-16 21:51 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, gitster

On Fri, Sep 16, 2016 at 10:37:24AM -0700, Jonathan Tan wrote:

> Mailinfo currently handles multi-line headers, but it does not handle
> multi-line in-body headers. Teach it to handle such headers, for
> example, for this input:
> 
>   Subject: a very long
>    broken line
> 
>   Subject: another very long
>    broken line
> 
> interpret the in-body subject to be "another very long broken line"
> instead of "another very long".

This puzzled me; we should stop parsing in-body headers after the first
blank line. But then I realized you probably meant the first "Subject"
to be the real mail header.

I wonder if it would be more obvious with an example like:

  From: ...
  Date: ...
  Subject: the actual mail subject

  Subject: a very long
    broken line

Or something.

-Peff

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC/PATCH 3/3] mailinfo: handle in-body header continuations
  2016-09-16 20:59               ` Junio C Hamano
@ 2016-09-16 22:36                 ` Jonathan Tan
  2016-09-16 23:04                   ` Junio C Hamano
  0 siblings, 1 reply; 24+ messages in thread
From: Jonathan Tan @ 2016-09-16 22:36 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, peff

On 09/16/2016 01:59 PM, Junio C Hamano wrote:
>         if (mi->in_line_header->len) {
>                 /* we have read the beginning of one in-line header */
>                 if (line->len && isspace(*line->buf) &&
>                     !(mi->use_scissors && is_scissors_line(line))) {

Minor note: this means that the scissors check appears twice in the 
code, once here and once below (for the non-header case).

>                         append to mi->in_line_header strbuf;
>                         return 0;
>                 }
>                 /* otherwise we know mi->in_line_header is now complete */
>                 check_header(mi, mi->in_line_header, ...);

(Sorry - should have also noticed this in your original e-mail.)

I'm concerned about what happens if check_header fails - we would then 
have some lines which need to be treated as log messages. (At least, 
they are currently treated that way.)

To treat them as log messages, we would need to convert them into UTF-8, 
which may possibly fail, so we would have to figure out how to clean up 
(we have to clean up because we cannot `die` immediately, at least to 
preserve the current behavior). Also, we are likely to detect such a 
failure only while processing a subsequent line - this non-"fail fast" 
currently is fine, but I'm concerned that it will hinder future 
development (especially when debugging).

Minor note: the buffer would also need to be more complicated (instead 
of the current single buffer), either:

o store newlines in that buffer (and we would need to remove all
   newlines before passing to check_header), or
o 2 buffers: one with newlines (for log messages) and one without (for
   check_header).

In light of the above (multiple scissors checks, late detection of 
failure, more complicated buffer), it seems clearer to me to just change 
the order of the checks (as in RFC/PATCH 1/3). This necessitates holding 
on to the old un-decoded buf and len, but this seems easier to me than 
the above.

>                 strbuf_reset(&mi->in_line_header);
>         }
> 	...

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC/PATCH 2/3] mailinfo: correct malformed test example
  2016-09-16 19:19           ` Junio C Hamano
@ 2016-09-16 22:42             ` Jonathan Tan
  2016-09-16 22:55               ` Junio C Hamano
  0 siblings, 1 reply; 24+ messages in thread
From: Jonathan Tan @ 2016-09-16 22:42 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, peff

On 09/16/2016 12:19 PM, Junio C Hamano wrote:
> Jonathan Tan <jonathantanmy@google.com> writes:
>
>> An existing sample message (0015) in the tests for mailinfo contains an
>> indented line immediately after an in-body header (without any
>> intervening blank line).
>
> This comes from d25e5159 ("git am/mailinfo: Don't look at in-body
> headers when rebasing", 2009-11-20), where we want to make sure that
> a "From: bogosity" that isn't meant to be an in-body header is not
> identified as such, even when it is immediately followed by a
> non-blank line.  "From: bogosity" is for msg0015 but the same
> applies to the header-looking block for msg0008.
>
> Adding a blank line there will defeat the whole point of the test,
> which is to make sure we don't do anything funky when --no-inbody-headers
> is asked for, no?

Before I revise the patch set...I think that the point of 0015 would be 
handled by 0008 (after this patch is applied), but if you prefer that 
0015 retain its purpose, I can unindent the bullet list in 0015 instead 
of adding the extra line (and then dropping all 0008 changes). Would 
that be better? (0015 needs to be changed somehow, because its indented 
line would be interpreted as a continuation line after RFC/PATCH 3/3 is 
applied.)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC/PATCH 2/3] mailinfo: correct malformed test example
  2016-09-16 22:42             ` Jonathan Tan
@ 2016-09-16 22:55               ` Junio C Hamano
  2016-09-17  0:31                 ` Jonathan Tan
  0 siblings, 1 reply; 24+ messages in thread
From: Junio C Hamano @ 2016-09-16 22:55 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, peff

Jonathan Tan <jonathantanmy@google.com> writes:

> On 09/16/2016 12:19 PM, Junio C Hamano wrote:
>> Jonathan Tan <jonathantanmy@google.com> writes:
>>
>>> An existing sample message (0015) in the tests for mailinfo contains an
>>> indented line immediately after an in-body header (without any
>>> intervening blank line).
>>
>> This comes from d25e5159 ("git am/mailinfo: Don't look at in-body
>> headers when rebasing", 2009-11-20), where we want to make sure that
>> a "From: bogosity" that isn't meant to be an in-body header is not
>> identified as such, even when it is immediately followed by a
>> non-blank line.  "From: bogosity" is for msg0015 but the same
>> applies to the header-looking block for msg0008.
>>
>> Adding a blank line there will defeat the whole point of the test,
>> which is to make sure we don't do anything funky when --no-inbody-headers
>> is asked for, no?
>
> Before I revise the patch set...I think that the point of 0015 would
> be handled by 0008 (after this patch is applied), but if you prefer
> that 0015 retain its purpose, I can unindent the bullet list in 0015
> instead of adding the extra line (and then dropping all 0008
> changes). Would that be better? (0015 needs to be changed somehow,
> because its indented line would be interpreted as a continuation line
> after RFC/PATCH 3/3 is applied.)

Hmph, these:

 t/t5100/info0008--no-inbody-headers  | 5 +++++
 t/t5100/msg0008--no-inbody-headers   | 6 ++++++
 t/t5100/msg0015--no-inbody-headers   | 1 +

have --no-inbody-headers in their names; wouldn't that an indication
that they are expected output when mailinfo is run while in-body
header feature disabled?

I would have expected that it would make more sense to make no
change to sample.mbox and have updated expectation to outputs in the
case where in-body header feature is enabled.

To make sure this new feature will not break in the future, we would
want a brand new message with a folded in-body header added to the
sample.mbox, and see how it is parsed by mailinfo with in-body
header feature enabled (and disabled).


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC/PATCH 3/3] mailinfo: handle in-body header continuations
  2016-09-16 22:36                 ` Jonathan Tan
@ 2016-09-16 23:04                   ` Junio C Hamano
  2016-09-17  0:22                     ` Jonathan Tan
  0 siblings, 1 reply; 24+ messages in thread
From: Junio C Hamano @ 2016-09-16 23:04 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, peff

Jonathan Tan <jonathantanmy@google.com> writes:

> On 09/16/2016 01:59 PM, Junio C Hamano wrote:
>>         if (mi->in_line_header->len) {
>>                 /* we have read the beginning of one in-line header */
>>                 if (line->len && isspace(*line->buf) &&
>>                     !(mi->use_scissors && is_scissors_line(line))) {
>
> Minor note: this means that the scissors check appears twice in the
> code, once here and once below (for the non-header case).

Yes.  I actually was wondering if it is even more sensible to always
have the scissors check at the very beginning.  Even if we saw a
half-written in-body header already in the message, when we see a
scissors line, we clear the slate and restart as if the line after
the scissors is the first line in the body of the message.

>>                         append to mi->in_line_header strbuf;
>>                         return 0;
>>                 }
>>                 /* otherwise we know mi->in_line_header is now complete */
>>                 check_header(mi, mi->in_line_header, ...);
>
> (Sorry - should have also noticed this in your original e-mail.)
>
> I'm concerned about what happens if check_header fails - we would then
> have some lines which need to be treated as log messages. (At least,
> they are currently treated that way.)

I actually think we should refactor check_header() further so that
in-body header processing does not even see things that shouldn't be
changed.  The current check_header() should be used only for real
mail headers, and then a reduced version of check_header() that is
called for in-body will _ONLY_ handle the header lines that are
handled by the first "search for the interesting parts" loop.

And of course we would update your "does it look like rfc2822?" to
match what are handled by the "interesting parts" loop.  That I
think would match the current behaviour much better, I would think.

The ">From " and "[PATCH]" cases in check_header() should not even
be there.  We should handle them inside handle_commit_msg(), as
these two cases should never appear in the real header part of a
message.

And if we clean it up like that, I do not think we would ever need
to worry about "ah, it looked like a header but it is not after
all".  And not having to worry about it is a good thing and should
be one of the primary goals in this conversion, I whould think.

Thanks.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC/PATCH 3/3] mailinfo: handle in-body header continuations
  2016-09-16 23:04                   ` Junio C Hamano
@ 2016-09-17  0:22                     ` Jonathan Tan
  0 siblings, 0 replies; 24+ messages in thread
From: Jonathan Tan @ 2016-09-17  0:22 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, peff

On 09/16/2016 04:04 PM, Junio C Hamano wrote:
> Jonathan Tan <jonathantanmy@google.com> writes:
>> I'm concerned about what happens if check_header fails - we would then
>> have some lines which need to be treated as log messages. (At least,
>> they are currently treated that way.)
>
> I actually think we should refactor check_header() further so that
> in-body header processing does not even see things that shouldn't be
> changed.  The current check_header() should be used only for real
> mail headers, and then a reduced version of check_header() that is
> called for in-body will _ONLY_ handle the header lines that are
> handled by the first "search for the interesting parts" loop.
>
> And of course we would update your "does it look like rfc2822?" to
> match what are handled by the "interesting parts" loop.  That I
> think would match the current behaviour much better, I would think.

There would be a bit of code duplication in that this "does it look like 
rfc2822" function would also need to account for duplicate headers (e.g. 
2 "Subject:" lines in the in-body headers) because check_header would 
reject the 2nd one, but that is minor. (Alternatively, we could just 
allow duplicate headers in the in-body headers.)

> The ">From " and "[PATCH]" cases in check_header() should not even
> be there.  We should handle them inside handle_commit_msg(), as
> these two cases should never appear in the real header part of a
> message.

> And if we clean it up like that, I do not think we would ever need
> to worry about "ah, it looked like a header but it is not after
> all".  And not having to worry about it is a good thing and should
> be one of the primary goals in this conversion, I whould think.

Yes, this makes sense. I'll go ahead and make a patch set implementing 
this (unless anyone has any objections).

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC/PATCH 2/3] mailinfo: correct malformed test example
  2016-09-16 22:55               ` Junio C Hamano
@ 2016-09-17  0:31                 ` Jonathan Tan
  2016-09-17  3:48                   ` Junio C Hamano
  0 siblings, 1 reply; 24+ messages in thread
From: Jonathan Tan @ 2016-09-17  0:31 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, peff

On 09/16/2016 03:55 PM, Junio C Hamano wrote:
> Hmph, these:
>
>  t/t5100/info0008--no-inbody-headers  | 5 +++++
>  t/t5100/msg0008--no-inbody-headers   | 6 ++++++
>  t/t5100/msg0015--no-inbody-headers   | 1 +
>
> have --no-inbody-headers in their names; wouldn't that an indication
> that they are expected output when mailinfo is run while in-body
> header feature disabled?

Yes, that's correct (they are the test data for when the in-body header 
feature is disabled).

> I would have expected that it would make more sense to make no
> change to sample.mbox and have updated expectation to outputs in the
> case where in-body header feature is enabled.

The sample.mbox file contains the following:

   From nobody Mon Sep 17 00:00:00 2001
   From: A U Thor <a.u.thor@example.com>
   Subject: check bogus body header (from)
   Date: Fri, 9 Jun 2006 00:44:16 -0700

   From: bogosity
     - a list
     - of stuff

Unchanged, the subsequent patch would break this test because it would 
interpret that as a multi-line "From" in-body header when in-body 
headers are *not* disabled.

Besides changing sample.mbox, the other way to make sure that this test 
passes is to suppress the test when in-body headers are *not* disabled, 
but looking at t5100* (directory and script), it seemed more 
straightforward to modify sample.mbox.

The patch I sent added a blank line after "From: bogosity", but removing 
the spaces before "- a list" and "- of stuff" would work too.

> To make sure this new feature will not break in the future, we would
> want a brand new message with a folded in-body header added to the
> sample.mbox, and see how it is parsed by mailinfo with in-body
> header feature enabled (and disabled).

OK, I'll add this test. (The subsequent patch already has the brand new 
message, but not the test where in-body headers are disabled.)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC/PATCH 2/3] mailinfo: correct malformed test example
  2016-09-17  0:31                 ` Jonathan Tan
@ 2016-09-17  3:48                   ` Junio C Hamano
  0 siblings, 0 replies; 24+ messages in thread
From: Junio C Hamano @ 2016-09-17  3:48 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, peff

Jonathan Tan <jonathantanmy@google.com> writes:

>   From: bogosity
>     - a list
>     - of stuff
>
> Unchanged, the subsequent patch would break this test because it would
> interpret that as a multi-line "From" in-body header when in-body
> headers are *not* disabled.

Yes, that is totally expected.  So I would be perfectly fine if your
patch changed the test vector for that case, saying "Allowing a
folded in-body header means the expected result for the above three
lines has to change".


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2016-09-17  3:48 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-02 19:58 [PATCH] sequencer: support folding in rfc2822 footer Jonathan Tan
2016-09-03  2:23 ` Junio C Hamano
2016-09-06 22:08   ` Jonathan Tan
2016-09-06 23:30     ` Jonathan Tan
2016-09-07  6:38       ` Jeff King
2016-09-16 17:37         ` [RFC/PATCH 0/3] handle multiline in-body headers Jonathan Tan
2016-09-16 18:29           ` Junio C Hamano
2016-09-16 17:37         ` [RFC/PATCH 1/3] mailinfo: refactor commit message processing Jonathan Tan
2016-09-16 19:12           ` Junio C Hamano
2016-09-16 21:46             ` Jeff King
2016-09-16 17:37         ` [RFC/PATCH 2/3] mailinfo: correct malformed test example Jonathan Tan
2016-09-16 19:19           ` Junio C Hamano
2016-09-16 22:42             ` Jonathan Tan
2016-09-16 22:55               ` Junio C Hamano
2016-09-17  0:31                 ` Jonathan Tan
2016-09-17  3:48                   ` Junio C Hamano
2016-09-16 17:37         ` [RFC/PATCH 3/3] mailinfo: handle in-body header continuations Jonathan Tan
2016-09-16 20:17           ` Junio C Hamano
2016-09-16 20:49             ` Jonathan Tan
2016-09-16 20:59               ` Junio C Hamano
2016-09-16 22:36                 ` Jonathan Tan
2016-09-16 23:04                   ` Junio C Hamano
2016-09-17  0:22                     ` Jonathan Tan
2016-09-16 21:51           ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).