git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH] range-diff: support reading mbox files
@ 2022-11-15 18:20 Johannes Schindelin via GitGitGadget
  2022-11-16  2:09 ` Taylor Blau
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Johannes Schindelin via GitGitGadget @ 2022-11-15 18:20 UTC (permalink / raw)
  To: git; +Cc: Johannes Schindelin, Johannes Schindelin

From: Johannes Schindelin <johannes.schindelin@gmx.de>

Internally, the `git range-diff` command spawns a `git log` process and
parses its output for the given commit ranges.

This works well when the patches that need to be compared are present in
the local repository in the form of commits.

In scenarios where that is not the case, the `range-diff` command is
currently less helpful.

The Git mailing list is such a scenario: Instead of using Git to
exchange commits, the patches are sent there as plain-text and no commit
range can be specified to let `range-diff` consume those patches.

Instead, the expectation is to download the mails, apply them locally
and then use `range-diff`. This can be quite cumbersome e.g. when a
suitable base revision has to be found first where the patch applies
cleanly.

Let's offer a way to read those patches from pre-prepared MBox files
instead when an argument "mbox:<filename>" is passed instead of a commit
range.

For extra convenience, interpret the filename `-` as standard input.
This makes it easy to compare contributions on the mailing list with the
actual commits that were integrated into Git's main branch. Example:

	commit=5c4003ca3f0e9ac6d3c8aa3e387ff843bd440411
	mid=bdfa3845b81531863941e6a97c28eb1afa62dd2c.1489435755.git.johannes.schindelin@gmx.de
	curl -s https://lore.kernel.org/git/$mid/raw |
	git range-diff mbox:- $commit^!

This addresses https://github.com/gitgitgadget/git/issues/207

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
    range-diff: support reading mbox files
    
    I frequently find myself wanting to look at the range-diff between some
    local commits and the patches on the Git mailing list, but unwilling to
    go through the process of finding an appropriate base revision to apply
    the patches onto (just to throw the generated patches away afterwards,
    anyway).
    
    So I came up with this patch. May it be helpful to other developers,
    too.
    
    This patch contains a home-rolled mbox parser. Initially, I wrote a
    really basic parser and it worked well enough, but, you know, as things
    go it became more complex than that in order to provide actually useful
    range-diffs for existing commits and their corresponding mails (because
    of in-body From: headers, because of -- trailers and long subjects, just
    to name a few reasons). In hindsight, it might have made sense to try to
    to reuse the parser that is available in mailinfo.c, which I had
    initially dismissed as overly complex and unnecessary for this use case.
    If anyone feels up to it, I would invite them to adjust this code to
    replace the mbox parser with one based on the mailinfo.c. Incrementally,
    of course, because the perfect is the enemy of the good.

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1420%2Fdscho%2Frange-diff-from-mbox-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1420/dscho/range-diff-from-mbox-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1420

 Documentation/git-range-diff.txt |   3 +-
 range-diff.c                     | 317 ++++++++++++++++++++++++++++++-
 t/t3206-range-diff.sh            |   9 +
 3 files changed, 327 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-range-diff.txt b/Documentation/git-range-diff.txt
index 0b393715d70..e2c4661acde 100644
--- a/Documentation/git-range-diff.txt
+++ b/Documentation/git-range-diff.txt
@@ -37,7 +37,8 @@ There are three ways to specify the commit ranges:
 
 - `<range1> <range2>`: Either commit range can be of the form
   `<base>..<rev>`, `<rev>^!` or `<rev>^-<n>`. See `SPECIFYING RANGES`
-  in linkgit:gitrevisions[7] for more details.
+  in linkgit:gitrevisions[7] for more details. Alternatively, the
+  patches can be provided as an mbox-formatted file via `mbox:<path>`.
 
 - `<rev1>...<rev2>`. This is equivalent to
   `<rev2>..<rev1> <rev1>..<rev2>`.
diff --git a/range-diff.c b/range-diff.c
index 124dd678c38..7c84cdbeffa 100644
--- a/range-diff.c
+++ b/range-diff.c
@@ -12,6 +12,7 @@
 #include "userdiff.h"
 #include "apply.h"
 #include "revision.h"
+#include "dir.h"
 
 struct patch_util {
 	/* For the search for an exact match */
@@ -26,6 +27,293 @@ struct patch_util {
 	struct object_id oid;
 };
 
+static inline int strtost(char const *s, size_t *result, const char **end)
+{
+	unsigned long u;
+	char *p;
+
+	errno = 0;
+	/* negative values would be accepted by strtoul */
+	if (*s == '-')
+		return -1;
+	u = strtoul(s, &p, 10);
+	if (errno || p == s)
+		return -1;
+	if (result)
+		*result = u;
+	*end = p;
+
+	return 0;
+}
+
+static int parse_hunk_header(const char *p,
+			     size_t *old_count, size_t *new_count,
+			     const char **end)
+{
+	size_t o = 1, n = 1;
+
+	if (!skip_prefix(p, "@@ -", &p) ||
+	    strtost(p, NULL, &p) ||
+	    (*p != ' ' && (*p != ',' || strtost(p + 1, &o, &p))) ||
+	    !skip_prefix(p, " +", &p) ||
+	    strtost(p, NULL, &p) ||
+	    (*p != ' ' && (*p != ',' || strtost(p + 1, &n, &p))) ||
+	    !skip_prefix(p, " @@", &p))
+		return -1;
+
+	*old_count = o;
+	*new_count = n;
+	*end = p;
+
+	return 0;
+}
+
+static inline int find_eol(const char *line, size_t size)
+{
+	char *eol;
+
+	eol = memchr(line, '\n', size);
+	if (!eol)
+		return size;
+
+	if (eol != line && eol[-1] == '\r')
+		eol[-1] = '\0';
+	else
+		*eol = '\0';
+
+	return eol + 1 - line;
+}
+
+static int read_mbox(const char *path, struct string_list *list)
+{
+	struct strbuf buf = STRBUF_INIT, contents = STRBUF_INIT;
+	struct strbuf long_subject = STRBUF_INIT;
+	struct patch_util *util = NULL;
+	enum {
+		MBOX_BEFORE_HEADER,
+		MBOX_IN_HEADER,
+		MBOX_IN_COMMIT_MESSAGE,
+		MBOX_AFTER_TRIPLE_DASH,
+		MBOX_IN_DIFF
+	} state = MBOX_BEFORE_HEADER;
+	char *line, *current_filename = NULL;
+	int len;
+	size_t size, old_count = 0, new_count = 0;
+	const char *author = NULL, *subject = NULL;
+
+	if (!strcmp(path, "-")) {
+		if (strbuf_read(&contents, STDIN_FILENO, 0) < 0)
+			return error_errno(_("could not read stdin"));
+	} else if (strbuf_read_file(&contents, path, 0) < 0)
+		return error_errno(_("could not read '%s'"), path);
+
+	line = contents.buf;
+	size = contents.len;
+	for (; size > 0; size -= len, line += len) {
+		const char *p;
+
+		len = find_eol(line, size);
+
+		if (state == MBOX_BEFORE_HEADER) {
+			if (!skip_prefix(line, "From ", &p))
+				continue;
+
+			util = xcalloc(1, sizeof(*util));
+			if (get_oid_hex(p, &util->oid) < 0)
+				oidcpy(&util->oid, null_oid());
+			util->matching = -1;
+			author = subject = NULL;
+
+			state = MBOX_IN_HEADER;
+		}
+
+		if (starts_with(line, "diff --git ")) {
+			struct patch patch = { 0 };
+			struct strbuf root = STRBUF_INIT;
+			int linenr = 0;
+			int orig_len;
+
+			state = MBOX_IN_DIFF;
+			old_count = new_count = 0;
+			strbuf_addch(&buf, '\n');
+			if (!util->diff_offset)
+				util->diff_offset = buf.len;
+			line[len - 1] = '\n';
+			orig_len = len;
+			len = parse_git_diff_header(&root, &linenr, 1, line,
+						    len, size, &patch);
+			if (len < 0) {
+				error(_("could not parse git header '%.*s'"),
+				      orig_len, line);
+				free(util);
+				free(current_filename);
+				string_list_clear(list, 1);
+				strbuf_release(&buf);
+				strbuf_release(&contents);
+				strbuf_release(&long_subject);
+				return -1;
+			}
+
+			if (patch.old_name)
+				skip_prefix(patch.old_name, "a/",
+					    (const char **)&patch.old_name);
+			if (patch.new_name)
+				skip_prefix(patch.new_name, "b/",
+					    (const char **)&patch.new_name);
+
+			strbuf_addstr(&buf, " ## ");
+			if (patch.is_new > 0)
+				strbuf_addf(&buf, "%s (new)", patch.new_name);
+			else if (patch.is_delete > 0)
+				strbuf_addf(&buf, "%s (deleted)", patch.old_name);
+			else if (patch.is_rename)
+				strbuf_addf(&buf, "%s => %s", patch.old_name, patch.new_name);
+			else
+				strbuf_addstr(&buf, patch.new_name);
+
+			free(current_filename);
+			if (patch.is_delete > 0)
+				current_filename = xstrdup(patch.old_name);
+			else
+				current_filename = xstrdup(patch.new_name);
+
+			if (patch.new_mode && patch.old_mode &&
+			    patch.old_mode != patch.new_mode)
+				strbuf_addf(&buf, " (mode change %06o => %06o)",
+					    patch.old_mode, patch.new_mode);
+
+			strbuf_addstr(&buf, " ##\n");
+			util->diffsize++;
+		} else if (state == MBOX_IN_HEADER) {
+			if (!line[0]) {
+				state = MBOX_IN_COMMIT_MESSAGE;
+				/* Look for an in-body From: */
+				if (size > 5 && skip_prefix(line + 1, "From: ", &p)) {
+					size -= p - line;
+					line += p - line;
+					len = find_eol(line, size);
+
+					while (isspace(*p))
+						p++;
+					author = p;
+				}
+				strbuf_addstr(&buf, " ## Metadata ##\n");
+				if (author)
+					strbuf_addf(&buf, "Author: %s\n", author);
+				strbuf_addstr(&buf, "\n ## Commit message ##\n");
+				if (subject)
+					strbuf_addf(&buf, "    %s\n\n", subject);
+			} else if (skip_prefix(line, "From: ", &p)) {
+				while (isspace(*p))
+					p++;
+				author = p;
+			} else if (skip_prefix(line, "Subject: ", &p)) {
+				const char *q;
+
+				while (isspace(*p))
+					p++;
+				subject = p;
+
+				if (starts_with(p, "[PATCH") &&
+				    (q = strchr(p, ']'))) {
+					q++;
+					while (isspace(*q))
+						q++;
+					subject = q;
+				}
+
+				if (len < size && line[len] == ' ') {
+					/* handle long subject */
+					strbuf_reset(&long_subject);
+					strbuf_addstr(&long_subject, subject);
+					while (len < size && line[len] == ' ') {
+						line += len;
+						size -= len;
+						len = find_eol(line, size);
+						strbuf_addstr(&long_subject, line);
+					}
+					subject = long_subject.buf;
+				}
+			}
+		} else if (state == MBOX_IN_COMMIT_MESSAGE) {
+			if (!*line)
+				strbuf_addch(&buf, '\n');
+			else if (strcmp(line, "---")) {
+				int tabs = 0;
+
+				/* simulate tab expansion */
+				while (line[tabs] == '\t')
+					tabs++;
+				strbuf_addf(&buf, "%*s%s\n",
+					    4 + 8 * tabs, "", line + tabs);
+			} else {
+				/*
+				 * Trim the trailing newline that is added
+				 * by `format-patch`.
+				 */
+				strbuf_trim_trailing_newline(&buf);
+				state = MBOX_AFTER_TRIPLE_DASH;
+			}
+		} else if (state == MBOX_IN_DIFF) {
+			switch (line[0]) {
+			case '\0':
+				continue; /* ignore empty lines after diff */
+			case '+':
+			case '-':
+			case ' ':
+				if (!old_count && !new_count)
+					break;
+				if (old_count && line[0] != '+')
+					old_count--;
+				if (new_count && line[0] != '-')
+					new_count--;
+				/* fallthrough */
+			case '\\':
+				strbuf_addstr(&buf, line);
+				strbuf_addch(&buf, '\n');
+				util->diffsize++;
+				continue;
+			case '@':
+				if (parse_hunk_header(line, &old_count,
+						      &new_count, &p))
+					break;
+
+				strbuf_addstr(&buf, "@@");
+				if (current_filename && *p)
+					strbuf_addf(&buf, " %s:",
+						    current_filename);
+				strbuf_addstr(&buf, p);
+				strbuf_addch(&buf, '\n');
+				util->diffsize++;
+				continue;
+			}
+
+			if (util) {
+				string_list_append(list, buf.buf)->util = util;
+				strbuf_reset(&buf);
+			}
+			util = xcalloc(1, sizeof(*util));
+			oidcpy(&util->oid, null_oid());
+			util->matching = -1;
+			author = subject = NULL;
+			state = MBOX_BEFORE_HEADER;
+		}
+	}
+	strbuf_release(&contents);
+
+	if (util) {
+		if (state == MBOX_IN_DIFF)
+			string_list_append(list, buf.buf)->util = util;
+		else
+			free(util);
+	}
+	strbuf_release(&buf);
+	strbuf_release(&long_subject);
+	free(current_filename);
+
+	return 0;
+}
+
 /*
  * Reads the patches into a string list, with the `util` field being populated
  * as struct object_id (will need to be free()d).
@@ -41,6 +329,10 @@ static int read_patches(const char *range, struct string_list *list,
 	ssize_t len;
 	size_t size;
 	int ret = -1;
+	const char *path;
+
+	if (skip_prefix(range, "mbox:", &path))
+		return read_mbox(path, list);
 
 	strvec_pushl(&cp.args, "log", "--no-color", "-p", "--no-merges",
 		     "--reverse", "--date-order", "--decorate=no",
@@ -424,6 +716,19 @@ static void output_pair_header(struct diff_options *diffopt,
 
 		strbuf_addch(buf, ' ');
 		pp_commit_easy(CMIT_FMT_ONELINE, commit, buf);
+	} else {
+		struct patch_util *util = b_util ? b_util : a_util;
+		const char *needle = "\n ## Commit message ##\n";
+		const char *p = !util || !util->patch ?
+			NULL : strstr(util->patch, needle);
+		if (p) {
+			if (status == '!')
+				strbuf_addf(buf, "%s%s", color_reset, color);
+
+			strbuf_addch(buf, ' ');
+			p += strlen(needle);
+			strbuf_add(buf, p, strchrnul(p, '\n') - p);
+		}
 	}
 	strbuf_addf(buf, "%s\n", color_reset);
 
@@ -554,6 +859,9 @@ int show_range_diff(const char *range1, const char *range2,
 	if (range_diff_opts->left_only && range_diff_opts->right_only)
 		res = error(_("options '%s' and '%s' cannot be used together"), "--left-only", "--right-only");
 
+	if (!strcmp(range1, "mbox:-") && !strcmp(range2, "mbox:-"))
+		res = error(_("only one mbox can be read from stdin"));
+
 	if (!res && read_patches(range1, &branch1, range_diff_opts->other_arg))
 		res = error(_("could not parse log for '%s'"), range1);
 	if (!res && read_patches(range2, &branch2, range_diff_opts->other_arg))
@@ -575,10 +883,17 @@ int show_range_diff(const char *range1, const char *range2,
 int is_range_diff_range(const char *arg)
 {
 	char *copy = xstrdup(arg); /* setup_revisions() modifies it */
-	const char *argv[] = { "", copy, "--", NULL };
+	const char *argv[] = { "", copy, "--", NULL }, *path;
 	int i, positive = 0, negative = 0;
 	struct rev_info revs;
 
+	if (skip_prefix(arg, "mbox:", &path)) {
+		if (!strcmp(path, "-") || file_exists(path))
+			return 1;
+		error_errno(_("not an mbox: '%s'"), path);
+		return 0;
+	}
+
 	init_revisions(&revs, NULL);
 	if (setup_revisions(3, argv, &revs, NULL) == 1) {
 		for (i = 0; i < revs.pending.nr; i++)
diff --git a/t/t3206-range-diff.sh b/t/t3206-range-diff.sh
index 459beaf7d9c..2b64d30e8e6 100755
--- a/t/t3206-range-diff.sh
+++ b/t/t3206-range-diff.sh
@@ -783,6 +783,15 @@ test_expect_success 'ranges with pathspecs' '
 	! grep "$topic_oid" actual
 '
 
+test_expect_success 'compare range vs mbox' '
+	git format-patch --stdout topic..mode-only-change >mbox &&
+	git range-diff topic...mode-only-change >expect &&
+	git range-diff mode-only-change..topic mbox:./mbox >actual &&
+	test_cmp expect actual &&
+	git range-diff mode-only-change..topic mbox:- <mbox >actual &&
+	test_cmp expect actual
+'
+
 test_expect_success 'submodule changes are shown irrespective of diff.submodule' '
 	git init sub-repo &&
 	test_commit -C sub-repo sub-first &&

base-commit: b75747829f4c277323c78b1c5973ad63ea038a2d
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] range-diff: support reading mbox files
  2022-11-15 18:20 [PATCH] range-diff: support reading mbox files Johannes Schindelin via GitGitGadget
@ 2022-11-16  2:09 ` Taylor Blau
  2022-11-16 14:40 ` Phillip Wood
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 16+ messages in thread
From: Taylor Blau @ 2022-11-16  2:09 UTC (permalink / raw)
  To: Johannes Schindelin via GitGitGadget; +Cc: git, Johannes Schindelin

On Tue, Nov 15, 2022 at 06:20:05PM +0000, Johannes Schindelin via GitGitGadget wrote:
> Let's offer a way to read those patches from pre-prepared MBox files
> instead when an argument "mbox:<filename>" is passed instead of a commit
> range.

Very cool. Thanks for working on this. I don't have time to review the
whole parser right now (and I agree that it would be nice to see it hook
into the existing stuff in mailinfo.c), but the idea sounds delightful.

If it's possible to use the battle-tested parts of mailinfo.c without
much effort, I'd be in favor of waiting on that. But in case that it's
not, I agree with you that we shouldn't let perfect be the enemy of the
good[^1], either.

Thanks,
Taylor

[^1]: Though let's make sure that "good" doesn't have any buffer
  overruns in it ;-)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] range-diff: support reading mbox files
  2022-11-15 18:20 [PATCH] range-diff: support reading mbox files Johannes Schindelin via GitGitGadget
  2022-11-16  2:09 ` Taylor Blau
@ 2022-11-16 14:40 ` Phillip Wood
  2022-11-17 10:26   ` Phillip Wood
  2022-11-18 13:16   ` Johannes Schindelin
  2022-11-17 18:24 ` Ævar Arnfjörð Bjarmason
  2022-11-19 23:11 ` [PATCH v2] " Johannes Schindelin via GitGitGadget
  3 siblings, 2 replies; 16+ messages in thread
From: Phillip Wood @ 2022-11-16 14:40 UTC (permalink / raw)
  To: Johannes Schindelin via GitGitGadget, git
  Cc: Johannes Schindelin, Taylor Blau

Hi Dscho

On 15/11/2022 18:20, Johannes Schindelin via GitGitGadget wrote:
> From: Johannes Schindelin <johannes.schindelin@gmx.de>
> 
> Internally, the `git range-diff` command spawns a `git log` process and
> parses its output for the given commit ranges.
> 
> This works well when the patches that need to be compared are present in
> the local repository in the form of commits.
> 
> In scenarios where that is not the case, the `range-diff` command is
> currently less helpful.
> 
> The Git mailing list is such a scenario: Instead of using Git to
> exchange commits, the patches are sent there as plain-text and no commit
> range can be specified to let `range-diff` consume those patches.
> 
> Instead, the expectation is to download the mails, apply them locally
> and then use `range-diff`. This can be quite cumbersome e.g. when a
> suitable base revision has to be found first where the patch applies
> cleanly.

That's a good motivation for this change.

> Let's offer a way to read those patches from pre-prepared MBox files
> instead when an argument "mbox:<filename>" is passed instead of a commit
> range.
> 
> For extra convenience, interpret the filename `-` as standard input.
> This makes it easy to compare contributions on the mailing list with the
> actual commits that were integrated into Git's main branch. Example:
> 
> 	commit=5c4003ca3f0e9ac6d3c8aa3e387ff843bd440411
> 	mid=bdfa3845b81531863941e6a97c28eb1afa62dd2c.1489435755.git.johannes.schindelin@gmx.de
> 	curl -s https://lore.kernel.org/git/$mid/raw |
> 	git range-diff mbox:- $commit^!
> 
> This addresses https://github.com/gitgitgadget/git/issues/207
> 
> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> ---
>      range-diff: support reading mbox files
>      
>      I frequently find myself wanting to look at the range-diff between some
>      local commits and the patches on the Git mailing list, but unwilling to
>      go through the process of finding an appropriate base revision to apply
>      the patches onto (just to throw the generated patches away afterwards,
>      anyway).
>      
>      So I came up with this patch. May it be helpful to other developers,
>      too.
>      
>      This patch contains a home-rolled mbox parser. Initially, I wrote a
>      really basic parser and it worked well enough, but, you know, as things
>      go it became more complex than that in order to provide actually useful
>      range-diffs for existing commits and their corresponding mails (because
>      of in-body From: headers, because of -- trailers and long subjects, just
>      to name a few reasons). In hindsight, it might have made sense to try to
>      to reuse the parser that is available in mailinfo.c, which I had
>      initially dismissed as overly complex and unnecessary for this use case.
>      If anyone feels up to it, I would invite them to adjust this code to
>      replace the mbox parser with one based on the mailinfo.c. Incrementally,
>      of course, because the perfect is the enemy of the good.
>
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1420%2Fdscho%2Frange-diff-from-mbox-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1420/dscho/range-diff-from-mbox-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/1420
> 
>   Documentation/git-range-diff.txt |   3 +-
>   range-diff.c                     | 317 ++++++++++++++++++++++++++++++-
>   t/t3206-range-diff.sh            |   9 +
>   3 files changed, 327 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/git-range-diff.txt b/Documentation/git-range-diff.txt
> index 0b393715d70..e2c4661acde 100644
> --- a/Documentation/git-range-diff.txt
> +++ b/Documentation/git-range-diff.txt
> @@ -37,7 +37,8 @@ There are three ways to specify the commit ranges:
>   
>   - `<range1> <range2>`: Either commit range can be of the form
>     `<base>..<rev>`, `<rev>^!` or `<rev>^-<n>`. See `SPECIFYING RANGES`
> -  in linkgit:gitrevisions[7] for more details.
> +  in linkgit:gitrevisions[7] for more details. Alternatively, the
> +  patches can be provided as an mbox-formatted file via `mbox:<path>`.
>   
>   - `<rev1>...<rev2>`. This is equivalent to
>     `<rev2>..<rev1> <rev1>..<rev2>`.
> diff --git a/range-diff.c b/range-diff.c
> index 124dd678c38..7c84cdbeffa 100644
> --- a/range-diff.c
> +++ b/range-diff.c
> @@ -12,6 +12,7 @@
>   #include "userdiff.h"
>   #include "apply.h"
>   #include "revision.h"
> +#include "dir.h"
>   
>   struct patch_util {
>   	/* For the search for an exact match */
> @@ -26,6 +27,293 @@ struct patch_util {
>   	struct object_id oid;
>   };
>   
> +static inline int strtost(char const *s, size_t *result, const char **end)
> +{
> +	unsigned long u;
> +	char *p;
> +
> +	errno = 0;
> +	/* negative values would be accepted by strtoul */
> +	if (*s == '-')
> +		return -1;

I think it is right to treat the input as untrusted and so look for 
malformed hunk headers. However This test is not sufficient for that, we 
expect a digit so I think

	if (!isdigit(*s))
		return -1;

would be safer. The use of strtoul() looks good as we set errno to zero 
before the call and check both errno and endp afterwards.

> +	u = strtoul(s, &p, 10);
> +	if (errno || p == s)
> +		return -1;
> +	if (result)
> +		*result = u;
> +	*end = p;
> +
> +	return 0;
> +}
> +
> +static int parse_hunk_header(const char *p,
> +			     size_t *old_count, size_t *new_count,
> +			     const char **end)
> +{
> +	size_t o = 1, n = 1;
> +
> +	if (!skip_prefix(p, "@@ -", &p) ||
> +	    strtost(p, NULL, &p) ||
> +	    (*p != ' ' && (*p != ',' || strtost(p + 1, &o, &p))) ||

It took me a minute to understand the double negatives but it is 
correctly checking if we have ' ' or ',<digits>'

> +	    !skip_prefix(p, " +", &p) ||
> +	    strtost(p, NULL, &p) ||
> +	    (*p != ' ' && (*p != ',' || strtost(p + 1, &n, &p))) ||
> +	    !skip_prefix(p, " @@", &p))
> +		return -1;
> +
> +	*old_count = o;
> +	*new_count = n;
> +	*end = p;
> +
> +	return 0;
> +}
> +
> +static inline int find_eol(const char *line, size_t size)
> +{
> +	char *eol;
> +
> +	eol = memchr(line, '\n', size);
> +	if (!eol)
> +		return size;
> +
> +	if (eol != line && eol[-1] == '\r')
> +		eol[-1] = '\0';
> +	else
> +		*eol = '\0';
> +
> +	return eol + 1 - line;

We return the offset to the start of the next line, not the length of 
the line. This will be important later.

> +}
> +
> +static int read_mbox(const char *path, struct string_list *list)
> +{
> +	struct strbuf buf = STRBUF_INIT, contents = STRBUF_INIT;
> +	struct strbuf long_subject = STRBUF_INIT;
> +	struct patch_util *util = NULL;
> +	enum {
> +		MBOX_BEFORE_HEADER,
> +		MBOX_IN_HEADER,
> +		MBOX_IN_COMMIT_MESSAGE,
> +		MBOX_AFTER_TRIPLE_DASH,
> +		MBOX_IN_DIFF
> +	} state = MBOX_BEFORE_HEADER;
> +	char *line, *current_filename = NULL;
> +	int len;
> +	size_t size, old_count = 0, new_count = 0;
> +	const char *author = NULL, *subject = NULL;
> +
> +	if (!strcmp(path, "-")) {
> +		if (strbuf_read(&contents, STDIN_FILENO, 0) < 0)
> +			return error_errno(_("could not read stdin"));
> +	} else if (strbuf_read_file(&contents, path, 0) < 0)
> +		return error_errno(_("could not read '%s'"), path);
> +
> +	line = contents.buf;
> +	size = contents.len;
> +	for (; size > 0; size -= len, line += len) {

size is unsigned so we're effectively testing 'size != 0' which means if 
we're off by one somewhere we'll have an undetected buffer overflow. 
Using a signed type wouldn't prevent the buffer overflow but it would 
limit its extent.

> +		const char *p;
> +
> +		len = find_eol(line, size);

Here len is not the length of line if it originally ended "\r\n".

> +		if (state == MBOX_BEFORE_HEADER) {
> +			if (!skip_prefix(line, "From ", &p))
> +				continue;
> +
> +			util = xcalloc(1, sizeof(*util));
> +			if (get_oid_hex(p, &util->oid) < 0)
> +				oidcpy(&util->oid, null_oid());
> +			util->matching = -1;
> +			author = subject = NULL;
> +
> +			state = MBOX_IN_HEADER;

I wondered if there should there be a `continue;` here but I think it 
probably needs to "fall-through" to the MBOX_IN_HEADER handling below. A 
comment to clarify that would be helpful.

> +		}
> +
> +		if (starts_with(line, "diff --git ")) {
> +			struct patch patch = { 0 };
> +			struct strbuf root = STRBUF_INIT;
> +			int linenr = 0;
> +			int orig_len;
> +
> +			state = MBOX_IN_DIFF;
> +			old_count = new_count = 0;
> +			strbuf_addch(&buf, '\n');
> +			if (!util->diff_offset)
> +				util->diff_offset = buf.len;
> +			line[len - 1] = '\n';

Here the line will still be NUL terminated if it originally ended "\r\n" 
which presumably messes up the call to parse_git_diff_header() below. I 
have not checked if parse_git_diff_header() can handle "\r\n" when it is 
parsing the rest of the diff header.

> +			orig_len = len;
> +			len = parse_git_diff_header(&root, &linenr, 1, line,
> +						    len, size, &patch);
> +			if (len < 0) {
> +				error(_("could not parse git header '%.*s'"),
> +				      orig_len, line);
> +				free(util);
> +				free(current_filename);
> +				string_list_clear(list, 1);
> +				strbuf_release(&buf);
> +				strbuf_release(&contents);
> +				strbuf_release(&long_subject);
> +				return -1;
> +			}
> +
> +			if (patch.old_name)
> +				skip_prefix(patch.old_name, "a/",
> +					    (const char **)&patch.old_name);
> +			if (patch.new_name)
> +				skip_prefix(patch.new_name, "b/",
> +					    (const char **)&patch.new_name);

I think this is fine for now but we might want to support other prefixes 
in the future. If it is not a copy or rename then the filename can be 
deduced by finding the common tail of patch.old_name and patch.new_name 
and stripping anything before the first '/'. If it is a copy or rename 
then I suspect there is no prefix (though I've not checked)

> +			strbuf_addstr(&buf, " ## ");
> +			if (patch.is_new > 0)

`patch.is_now` and `patch.is_delete` are booleans like `patch.is_rename` 
so we don't need the '> 0'

> +				strbuf_addf(&buf, "%s (new)", patch.new_name);
> +			else if (patch.is_delete > 0)
> +				strbuf_addf(&buf, "%s (deleted)", patch.old_name);
> +			else if (patch.is_rename)
> +				strbuf_addf(&buf, "%s => %s", patch.old_name, patch.new_name);
> +			else
> +				strbuf_addstr(&buf, patch.new_name);
> +
> +			free(current_filename);
> +			if (patch.is_delete > 0)
> +				current_filename = xstrdup(patch.old_name);
> +			else
> +				current_filename = xstrdup(patch.new_name);
> +
> +			if (patch.new_mode && patch.old_mode &&
> +			    patch.old_mode != patch.new_mode)
> +				strbuf_addf(&buf, " (mode change %06o => %06o)",
> +					    patch.old_mode, patch.new_mode);
> +
> +			strbuf_addstr(&buf, " ##\n");
> +			util->diffsize++;
> +		} else if (state == MBOX_IN_HEADER) {
> +			if (!line[0]) {
> +				state = MBOX_IN_COMMIT_MESSAGE;
> +				/* Look for an in-body From: */
> +				if (size > 5 && skip_prefix(line + 1, "From: ", &p)) {

The "size > 5" seems a bit unnecessary as we're using skip_prefix()

> +					size -= p - line;
> +					line += p - line;

This is good, we're accounting for reading the next line.

> +					len = find_eol(line, size);
> +
> +					while (isspace(*p))
> +						p++;
> +					author = p;
> +				}
> +				strbuf_addstr(&buf, " ## Metadata ##\n");
> +				if (author)
> +					strbuf_addf(&buf, "Author: %s\n", author);
> +				strbuf_addstr(&buf, "\n ## Commit message ##\n");
> +				if (subject)
> +					strbuf_addf(&buf, "    %s\n\n", subject);
> +			} else if (skip_prefix(line, "From: ", &p)) {
> +				while (isspace(*p))
> +					p++;
> +				author = p;
> +			} else if (skip_prefix(line, "Subject: ", &p)) {
> +				const char *q;
> +
> +				while (isspace(*p))
> +					p++;
> +				subject = p;
> +
> +				if (starts_with(p, "[PATCH") &&
> +				    (q = strchr(p, ']'))) {
> +					q++;
> +					while (isspace(*q))
> +						q++;
> +					subject = q;
> +				}
> +
> +				if (len < size && line[len] == ' ') {
> +					/* handle long subject */
> +					strbuf_reset(&long_subject);
> +					strbuf_addstr(&long_subject, subject);
> +					while (len < size && line[len] == ' ') {
> +						line += len;
> +						size -= len;
> +						len = find_eol(line, size);
> +						strbuf_addstr(&long_subject, line);

Looks good

> +					}
> +					subject = long_subject.buf;
> +				}
> +			}
> +		} else if (state == MBOX_IN_COMMIT_MESSAGE) {
> +			if (!*line)

Not a big issue elsewhere you've used "!line[0]"
Style: there should be braces on this branch.

> +				strbuf_addch(&buf, '\n');
> +			else if (strcmp(line, "---")) {
> +				int tabs = 0;
> +
> +				/* simulate tab expansion */
> +				while (line[tabs] == '\t')
> +					tabs++;
> +				strbuf_addf(&buf, "%*s%s\n",
> +					    4 + 8 * tabs, "", line + tabs);
> +			} else {
> +				/*
> +				 * Trim the trailing newline that is added
> +				 * by `format-patch`.
> +				 */
> +				strbuf_trim_trailing_newline(&buf);
> +				state = MBOX_AFTER_TRIPLE_DASH;
> +			}
> +		} else if (state == MBOX_IN_DIFF) {
> +			switch (line[0]) {
> +			case '\0':
> +				continue; /* ignore empty lines after diff */
> +			case '+':
> +			case '-':
> +			case ' ':
> +				if (!old_count && !new_count)
> +					break;

This shouldn't happen in a well formed diff. Below we happily accept bad 
counts, is there a reason to reject them here?

> +				if (old_count && line[0] != '+')
> +					old_count--;
> +				if (new_count && line[0] != '-')
> +					new_count--;

The diff is malformed if old_count == 0 and we see '-' or ' ' or 
new_count == 0 and we see '+' or ' '. The code is careful not to 
decrement the count in that case so I think it is harmless to accept 
diffs with bad line counts in the hunk header.

> +				/* fallthrough */
> +			case '\\':
> +				strbuf_addstr(&buf, line);
> +				strbuf_addch(&buf, '\n');
> +				util->diffsize++;

I think this might be a better place to break if old_count and new_count 
are both zero.

> +				continue;
> +			case '@':
> +				if (parse_hunk_header(line, &old_count,
> +						      &new_count, &p))
> +					break;
> +
> +				strbuf_addstr(&buf, "@@");
> +				if (current_filename && *p)
> +					strbuf_addf(&buf, " %s:",
> +						    current_filename);
> +				strbuf_addstr(&buf, p);
> +				strbuf_addch(&buf, '\n');
> +				util->diffsize++;
> +				continue;
> +			}

This is effectively the `default:` clause as it is executed when we 
don't handle the line above. We ignore the contents of this line which 
makes me wonder what happens if it is the start of another diff. Do we 
have tests that alter more than one file in a single commit?

I think this is a useful addition, it could perhaps benefit from more 
tests though. Having tests for bad input, "\r\n" line endings and 
getting the author from a From: header as well as an in-body From: line 
would give a bit more reassurance about how robust the parser is.

Best Wishes

Phillip

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] range-diff: support reading mbox files
  2022-11-16 14:40 ` Phillip Wood
@ 2022-11-17 10:26   ` Phillip Wood
  2022-11-18 12:53     ` Johannes Schindelin
  2022-11-18 13:16   ` Johannes Schindelin
  1 sibling, 1 reply; 16+ messages in thread
From: Phillip Wood @ 2022-11-17 10:26 UTC (permalink / raw)
  To: Johannes Schindelin via GitGitGadget, git
  Cc: Johannes Schindelin, Taylor Blau

Hi Dscho

Having slept on it I think I misunderstood what was happening in the 
diff parsing yesterday

On 16/11/2022 14:40, Phillip Wood wrote:
>> +        } else if (state == MBOX_IN_DIFF) {
>> +            switch (line[0]) {
>> +            case '\0':
>> +                continue; /* ignore empty lines after diff */
>> +            case '+':
>> +            case '-':
>> +            case ' ':
>> +                if (!old_count && !new_count)
>> +                    break;
> 
> This shouldn't happen in a well formed diff. Below we happily accept bad 
> counts, is there a reason to reject them here?

I think this might be picking up the "--" at the end of the patch as we 
don't want to break here at the end of a hunk. If so then a comment 
would be helpful.

>> +                if (old_count && line[0] != '+')
>> +                    old_count--;
>> +                if (new_count && line[0] != '-')
>> +                    new_count--;
> 
> The diff is malformed if old_count == 0 and we see '-' or ' ' or 
> new_count == 0 and we see '+' or ' '. The code is careful not to 
> decrement the count in that case so I think it is harmless to accept 
> diffs with bad line counts in the hunk header.
>> +                /* fallthrough */
>> +            case '\\':
>> +                strbuf_addstr(&buf, line);
>> +                strbuf_addch(&buf, '\n');
>> +                util->diffsize++;
> 
> I think this might be a better place to break if old_count and new_count 
> are both zero.

It would be the right place to break at the end of each hunk, but I 
don't think we want to do that.

>> +                continue;
>> +            case '@':
>> +                if (parse_hunk_header(line, &old_count,
>> +                              &new_count, &p))
>> +                    break;
>> +
>> +                strbuf_addstr(&buf, "@@");
>> +                if (current_filename && *p)
>> +                    strbuf_addf(&buf, " %s:",
>> +                            current_filename);
>> +                strbuf_addstr(&buf, p);
>> +                strbuf_addch(&buf, '\n');
>> +                util->diffsize++;
>> +                continue;
>> +            }
> 
> This is effectively the `default:` clause as it is executed when we 
> don't handle the line above. We ignore the contents of this line which 
> makes me wonder what happens if it is the start of another diff.

We'll pick that up earlier with "if (starts_with(line, "diff --git"))"

We only get here at the end of a patch (assuming it has the "--" line 
from format-patch)

Best Wishes

Phillip

> Do we 
> have tests that alter more than one file in a single commit?
> 
> I think this is a useful addition, it could perhaps benefit from more 
> tests though. Having tests for bad input, "\r\n" line endings and 
> getting the author from a From: header as well as an in-body From: line 
> would give a bit more reassurance about how robust the parser is.
> 
> Best Wishes
> 
> Phillip

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] range-diff: support reading mbox files
  2022-11-15 18:20 [PATCH] range-diff: support reading mbox files Johannes Schindelin via GitGitGadget
  2022-11-16  2:09 ` Taylor Blau
  2022-11-16 14:40 ` Phillip Wood
@ 2022-11-17 18:24 ` Ævar Arnfjörð Bjarmason
  2022-11-18 11:39   ` Johannes Schindelin
  2022-11-19 23:11 ` [PATCH v2] " Johannes Schindelin via GitGitGadget
  3 siblings, 1 reply; 16+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-11-17 18:24 UTC (permalink / raw)
  To: Johannes Schindelin via GitGitGadget; +Cc: git, Johannes Schindelin


On Tue, Nov 15 2022, Johannes Schindelin via GitGitGadget wrote:

> +		if (starts_with(line, "diff --git ")) {
> +			struct patch patch = { 0 };
> +			struct strbuf root = STRBUF_INIT;
> +			int linenr = 0;
> +			int orig_len;
> +
> +			state = MBOX_IN_DIFF;
> +			old_count = new_count = 0;
> +			strbuf_addch(&buf, '\n');
> +			if (!util->diff_offset)
> +				util->diff_offset = buf.len;
> +			line[len - 1] = '\n';
> +			orig_len = len;
> +			len = parse_git_diff_header(&root, &linenr, 1, line,
> +						    len, size, &patch);

Try this with SANITIZE=leak, e.g. this seems to fix 1/4 leaks that pop
up if you try the command noted in the patch:
	
	diff --git a/range-diff.c b/range-diff.c
	index 77fa9b970b1..7ff33f92e39 100644
	--- a/range-diff.c
	+++ b/range-diff.c
	@@ -142,6 +142,7 @@ static int read_mbox(const char *path, struct string_list *list)
	 			orig_len = len;
	 			len = parse_git_diff_header(&root, &linenr, 1, line,
	 						    len, size, &patch);
	+			free(patch.def_name);
	 			if (len < 0) {
	 				error(_("could not parse git header '%.*s'"),
	 				      orig_len, line);

Maybe it really should segfault, I didn't check carefully, but your test
passes with SANITIZE=address with this, so if so it's missing
coverage...

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] range-diff: support reading mbox files
  2022-11-17 18:24 ` Ævar Arnfjörð Bjarmason
@ 2022-11-18 11:39   ` Johannes Schindelin
  0 siblings, 0 replies; 16+ messages in thread
From: Johannes Schindelin @ 2022-11-18 11:39 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Johannes Schindelin via GitGitGadget, git

[-- Attachment #1: Type: text/plain, Size: 1392 bytes --]

Hi Ævar,

On Thu, 17 Nov 2022, Ævar Arnfjörð Bjarmason wrote:

>
> On Tue, Nov 15 2022, Johannes Schindelin via GitGitGadget wrote:
>
> > +		if (starts_with(line, "diff --git ")) {
> > +			struct patch patch = { 0 };
> > +			struct strbuf root = STRBUF_INIT;
> > +			int linenr = 0;
> > +			int orig_len;
> > +
> > +			state = MBOX_IN_DIFF;
> > +			old_count = new_count = 0;
> > +			strbuf_addch(&buf, '\n');
> > +			if (!util->diff_offset)
> > +				util->diff_offset = buf.len;
> > +			line[len - 1] = '\n';
> > +			orig_len = len;
> > +			len = parse_git_diff_header(&root, &linenr, 1, line,
> > +						    len, size, &patch);
>
> Try this with SANITIZE=leak, e.g. this seems to fix 1/4 leaks that pop
> up if you try the command noted in the patch:
>
> 	diff --git a/range-diff.c b/range-diff.c
> 	index 77fa9b970b1..7ff33f92e39 100644
> 	--- a/range-diff.c
> 	+++ b/range-diff.c
> 	@@ -142,6 +142,7 @@ static int read_mbox(const char *path, struct string_list *list)
> 	 			orig_len = len;
> 	 			len = parse_git_diff_header(&root, &linenr, 1, line,
> 	 						    len, size, &patch);
> 	+			free(patch.def_name);
> 	 			if (len < 0) {
> 	 				error(_("could not parse git header '%.*s'"),
> 	 				      orig_len, line);

Thank you for keeping your feedback concise. Much appreciated.

This will be addressed in the next iteration,
Johannes

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] range-diff: support reading mbox files
  2022-11-17 10:26   ` Phillip Wood
@ 2022-11-18 12:53     ` Johannes Schindelin
  0 siblings, 0 replies; 16+ messages in thread
From: Johannes Schindelin @ 2022-11-18 12:53 UTC (permalink / raw)
  To: phillip.wood; +Cc: Johannes Schindelin via GitGitGadget, git, Taylor Blau

[-- Attachment #1: Type: text/plain, Size: 3866 bytes --]

Hi Phillip,

On Thu, 17 Nov 2022, Phillip Wood wrote:

> On 16/11/2022 14:40, Phillip Wood wrote:
> > > +        } else if (state == MBOX_IN_DIFF) {
> > > +            switch (line[0]) {
> > > +            case '\0':
> > > +                continue; /* ignore empty lines after diff */
> > > +            case '+':
> > > +            case '-':
> > > +            case ' ':
> > > +                if (!old_count && !new_count)
> > > +                    break;
> > 
> > This shouldn't happen in a well formed diff. Below we happily accept bad
> > counts, is there a reason to reject them here?
> 
> I think this might be picking up the "--" at the end of the patch as we don't
> want to break here at the end of a hunk. If so then a comment would be
> helpful.

Agreed. And yes, it is picking up the "-- " line at the end of the patch.

> > > +                if (old_count && line[0] != '+')
> > > +                    old_count--;
> > > +                if (new_count && line[0] != '-')
> > > +                    new_count--;
> > 
> > The diff is malformed if old_count == 0 and we see '-' or ' ' or new_count
> > == 0 and we see '+' or ' '. The code is careful not to decrement the count
> > in that case so I think it is harmless to accept diffs with bad line counts
> > in the hunk header.

I might be overly cautious here, but as you mentioned elsewhere, it is
really bad if a `size_t` is decremented below 0, and
`new_count`/`old_count` are of that type.

> > > +                /* fallthrough */
> > > +            case '\\':
> > > +                strbuf_addstr(&buf, line);
> > > +                strbuf_addch(&buf, '\n');
> > > +                util->diffsize++;
> > 
> > I think this might be a better place to break if old_count and new_count are
> > both zero.
> 
> It would be the right place to break at the end of each hunk, but I don't
> think we want to do that.

It would not even be the right place to break here then: think of the
`\ No newline at end of file` lines: they come after the preceding line
decremented `old_count`/`new_count`, yet we still want them to be part of
the diff.

> 
> > > +                continue;
> > > +            case '@':
> > > +                if (parse_hunk_header(line, &old_count,
> > > +                              &new_count, &p))
> > > +                    break;
> > > +
> > > +                strbuf_addstr(&buf, "@@");
> > > +                if (current_filename && *p)
> > > +                    strbuf_addf(&buf, " %s:",
> > > +                            current_filename);
> > > +                strbuf_addstr(&buf, p);
> > > +                strbuf_addch(&buf, '\n');
> > > +                util->diffsize++;
> > > +                continue;
> > > +            }
> > 
> > This is effectively the `default:` clause as it is executed when we don't
> > handle the line above. We ignore the contents of this line which makes me
> > wonder what happens if it is the start of another diff.
> 
> We'll pick that up earlier with "if (starts_with(line, "diff --git"))"
> 
> We only get here at the end of a patch (assuming it has the "--" line from
> format-patch)

We also get here in case of garbage in the middle of a diff ;-)

Thank you for setting a fantastic example how to review code in a
constructive, helpful manner!
Dscho

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] range-diff: support reading mbox files
  2022-11-16 14:40 ` Phillip Wood
  2022-11-17 10:26   ` Phillip Wood
@ 2022-11-18 13:16   ` Johannes Schindelin
  1 sibling, 0 replies; 16+ messages in thread
From: Johannes Schindelin @ 2022-11-18 13:16 UTC (permalink / raw)
  To: phillip.wood; +Cc: Johannes Schindelin via GitGitGadget, git, Taylor Blau

Hi Phillip,

On Wed, 16 Nov 2022, Phillip Wood wrote:

> On 15/11/2022 18:20, Johannes Schindelin via GitGitGadget wrote:
> > From: Johannes Schindelin <johannes.schindelin@gmx.de>
> >
> > Internally, the `git range-diff` command spawns a `git log` process and
> > parses its output for the given commit ranges.
> >
> > This works well when the patches that need to be compared are present in
> > the local repository in the form of commits.
> >
> > In scenarios where that is not the case, the `range-diff` command is
> > currently less helpful.
> >
> > The Git mailing list is such a scenario: Instead of using Git to
> > exchange commits, the patches are sent there as plain-text and no commit
> > range can be specified to let `range-diff` consume those patches.
> >
> > Instead, the expectation is to download the mails, apply them locally
> > and then use `range-diff`. This can be quite cumbersome e.g. when a
> > suitable base revision has to be found first where the patch applies
> > cleanly.
>
> That's a good motivation for this change.

Side note (potentially fun to know): A couple of weeks ago, I wanted to
work on this because I was asked to review a new patch series iteration on
the Git mailing list (I forgot which one it was, precisely), which came
without a range-diff, and I wanted to download both iterations using a
variation of
https://github.com/git-for-windows/build-extra/blob/HEAD/apply-from-lore.sh
and then use the mbox mode of `range-diff` on them. Unfortunately, I ran
out of time back then (it's a common theme for me these days), and the
patch series (rightfully) advanced to `next` without my review, and I let
this here patch slide.

> > +static inline int strtost(char const *s, size_t *result, const char **end)
> > +{
> > +	unsigned long u;
> > +	char *p;
> > +
> > +	errno = 0;
> > +	/* negative values would be accepted by strtoul */
> > +	if (*s == '-')
> > +		return -1;
>
> I think it is right to treat the input as untrusted and so look for malformed
> hunk headers. However This test is not sufficient for that, we expect a digit
> so I think
>
> 	if (!isdigit(*s))
> 		return -1;
>
> would be safer. The use of strtoul() looks good as we set errno to zero before
> the call and check both errno and endp afterwards.

Good point. As you might have guessed, this function is a copy/edit of
existing code, in this instance `strtoul_ui()`.

I have made the change you suggested.

>
> > +	u = strtoul(s, &p, 10);
> > +	if (errno || p == s)
> > +		return -1;
> > +	if (result)
> > +		*result = u;
> > +	*end = p;
> > +
> > +	return 0;
> > +}
> > +
> > +static int parse_hunk_header(const char *p,
> > +			     size_t *old_count, size_t *new_count,
> > +			     const char **end)
> > +{
> > +	size_t o = 1, n = 1;
> > +
> > +	if (!skip_prefix(p, "@@ -", &p) ||
> > +	    strtost(p, NULL, &p) ||
> > +	    (*p != ' ' && (*p != ',' || strtost(p + 1, &o, &p))) ||
>
> It took me a minute to understand the double negatives but it is correctly
> checking if we have ' ' or ',<digits>'

I actually hesitated when I wrote this, and only kept the code as-is
because I remembered how often Junio uses double negatives and thought it
would be a good joke to do the same here. Joke's obviously on me, though.

I've hence changed it to:

	    /* The range is -<start>[,<count>], defaulting to count = 1 */
	    !(*p == ' ' || (*p == ',' && !strtost(p + 1, &o, &p))) ||

and

	    /* The range is * +<start>[,<count>], * defaulting to count = 1 */
	    !(*p == ' ' || (*p == ',' && !strtost(p + 1, &n, &p))) ||

>
> > +	    !skip_prefix(p, " +", &p) ||
> > +	    strtost(p, NULL, &p) ||
> > +	    (*p != ' ' && (*p != ',' || strtost(p + 1, &n, &p))) ||
> > +	    !skip_prefix(p, " @@", &p))
> > +		return -1;
> > +
> > +	*old_count = o;
> > +	*new_count = n;
> > +	*end = p;
> > +
> > +	return 0;
> > +}
> > +
> > +static inline int find_eol(const char *line, size_t size)
> > +{
> > +	char *eol;
> > +
> > +	eol = memchr(line, '\n', size);
> > +	if (!eol)
> > +		return size;
> > +
> > +	if (eol != line && eol[-1] == '\r')
> > +		eol[-1] = '\0';
> > +	else
> > +		*eol = '\0';
> > +
> > +	return eol + 1 - line;
>
> We return the offset to the start of the next line, not the length of the
> line. This will be important later.

You're right. I changed the name to `find_next_line()` and added an
informative comment on top of the function.

>
> > +}
> > +
> > +static int read_mbox(const char *path, struct string_list *list)
> > +{
> > +	struct strbuf buf = STRBUF_INIT, contents = STRBUF_INIT;
> > +	struct strbuf long_subject = STRBUF_INIT;
> > +	struct patch_util *util = NULL;
> > +	enum {
> > +		MBOX_BEFORE_HEADER,
> > +		MBOX_IN_HEADER,
> > +		MBOX_IN_COMMIT_MESSAGE,
> > +		MBOX_AFTER_TRIPLE_DASH,
> > +		MBOX_IN_DIFF
> > +	} state = MBOX_BEFORE_HEADER;
> > +	char *line, *current_filename = NULL;
> > +	int len;
> > +	size_t size, old_count = 0, new_count = 0;
> > +	const char *author = NULL, *subject = NULL;
> > +
> > +	if (!strcmp(path, "-")) {
> > +		if (strbuf_read(&contents, STDIN_FILENO, 0) < 0)
> > +			return error_errno(_("could not read stdin"));
> > +	} else if (strbuf_read_file(&contents, path, 0) < 0)
> > +		return error_errno(_("could not read '%s'"), path);
> > +
> > +	line = contents.buf;
> > +	size = contents.len;
> > +	for (; size > 0; size -= len, line += len) {
>
> size is unsigned so we're effectively testing 'size != 0' which means if we're
> off by one somewhere we'll have an undetected buffer overflow. Using a signed
> type wouldn't prevent the buffer overflow but it would limit its extent.

`contents.len` is of type `size_t`, though, and I'd like to stay
consistent.

I did remove the misleading `> 0` from the condition, though.

>
> > +		const char *p;
> > +
> > +		len = find_eol(line, size);
>
> Here len is not the length of line if it originally ended "\r\n".
>
> > +		if (state == MBOX_BEFORE_HEADER) {
> > +			if (!skip_prefix(line, "From ", &p))
> > +				continue;
> > +
> > +			util = xcalloc(1, sizeof(*util));
> > +			if (get_oid_hex(p, &util->oid) < 0)
> > +				oidcpy(&util->oid, null_oid());
> > +			util->matching = -1;
> > +			author = subject = NULL;
> > +
> > +			state = MBOX_IN_HEADER;
>
> I wondered if there should there be a `continue;` here but I think it probably
> needs to "fall-through" to the MBOX_IN_HEADER handling below. A comment to
> clarify that would be helpful.

You're absolutely correct, there totally should be a `continue` here: we
saw a `From `, and in the `MBOX_IN_HEADER` we look for either an empty
line, the `From:` header or the `Subject:` header, so we know that the
current line cannot match, no need to fall through.

>
> > +		}
> > +
> > +		if (starts_with(line, "diff --git ")) {
> > +			struct patch patch = { 0 };
> > +			struct strbuf root = STRBUF_INIT;
> > +			int linenr = 0;
> > +			int orig_len;
> > +
> > +			state = MBOX_IN_DIFF;
> > +			old_count = new_count = 0;
> > +			strbuf_addch(&buf, '\n');
> > +			if (!util->diff_offset)
> > +				util->diff_offset = buf.len;
> > +			line[len - 1] = '\n';
>
> Here the line will still be NUL terminated if it originally ended "\r\n" which
> presumably messes up the call to parse_git_diff_header() below. I have not
> checked if parse_git_diff_header() can handle "\r\n" when it is parsing the
> rest of the diff header.

I changed it locally to reinstate the `\r\n`, only to figure out that
almost the entire `apply.c` machinery totally falls over CR/LF line
endings.

I changed the code to detect a Carriage Return and error out if one is
detected.

>
> > +			orig_len = len;
> > +			len = parse_git_diff_header(&root, &linenr, 1, line,
> > +						    len, size, &patch);
> > +			if (len < 0) {
> > +				error(_("could not parse git header '%.*s'"),
> > +				      orig_len, line);
> > +				free(util);
> > +				free(current_filename);
> > +				string_list_clear(list, 1);
> > +				strbuf_release(&buf);
> > +				strbuf_release(&contents);
> > +				strbuf_release(&long_subject);
> > +				return -1;
> > +			}
> > +
> > +			if (patch.old_name)
> > +				skip_prefix(patch.old_name, "a/",
> > +					    (const char **)&patch.old_name);
> > +			if (patch.new_name)
> > +				skip_prefix(patch.new_name, "b/",
> > +					    (const char **)&patch.new_name);
>
> I think this is fine for now but we might want to support other prefixes in
> the future. If it is not a copy or rename then the filename can be deduced by
> finding the common tail of patch.old_name and patch.new_name and stripping
> anything before the first '/'. If it is a copy or rename then I suspect there
> is no prefix (though I've not checked)

Since `skip_prefix()` does not do anything if there is no match, I think
it is sane to err by not stripping anything unless the expected `a/` and
`b/` prefixes are seen.

>
> > +			strbuf_addstr(&buf, " ## ");
> > +			if (patch.is_new > 0)
>
> `patch.is_now` and `patch.is_delete` are booleans like `patch.is_rename` so we
> don't need the '> 0'

Good catch.

>
> > +				strbuf_addf(&buf, "%s (new)", patch.new_name);
> > +			else if (patch.is_delete > 0)
> > +				strbuf_addf(&buf, "%s (deleted)",
> > patch.old_name);
> > +			else if (patch.is_rename)
> > +				strbuf_addf(&buf, "%s => %s", patch.old_name,
> > patch.new_name);
> > +			else
> > +				strbuf_addstr(&buf, patch.new_name);
> > +
> > +			free(current_filename);
> > +			if (patch.is_delete > 0)
> > +				current_filename = xstrdup(patch.old_name);
> > +			else
> > +				current_filename = xstrdup(patch.new_name);
> > +
> > +			if (patch.new_mode && patch.old_mode &&
> > +			    patch.old_mode != patch.new_mode)
> > +				strbuf_addf(&buf, " (mode change %06o =>
> > %06o)",
> > +					    patch.old_mode, patch.new_mode);
> > +
> > +			strbuf_addstr(&buf, " ##\n");
> > +			util->diffsize++;
> > +		} else if (state == MBOX_IN_HEADER) {
> > +			if (!line[0]) {
> > +				state = MBOX_IN_COMMIT_MESSAGE;
> > +				/* Look for an in-body From: */
> > +				if (size > 5 && skip_prefix(line + 1, "From:
> > ", &p)) {
>
> The "size > 5" seems a bit unnecessary as we're using skip_prefix()

Right!

>
> > +					size -= p - line;
> > +					line += p - line;
>
> This is good, we're accounting for reading the next line.
>
> > +					len = find_eol(line, size);
> > +
> > +					while (isspace(*p))
> > +						p++;
> > +					author = p;
> > +				}
> > +				strbuf_addstr(&buf, " ## Metadata ##\n");
> > +				if (author)
> > +					strbuf_addf(&buf, "Author: %s\n",
> > author);
> > +				strbuf_addstr(&buf, "\n ## Commit message
> > ##\n");
> > +				if (subject)
> > +					strbuf_addf(&buf, "    %s\n\n",
> > subject);
> > +			} else if (skip_prefix(line, "From: ", &p)) {
> > +				while (isspace(*p))
> > +					p++;
> > +				author = p;
> > +			} else if (skip_prefix(line, "Subject: ", &p)) {
> > +				const char *q;
> > +
> > +				while (isspace(*p))
> > +					p++;
> > +				subject = p;
> > +
> > +				if (starts_with(p, "[PATCH") &&
> > +				    (q = strchr(p, ']'))) {
> > +					q++;
> > +					while (isspace(*q))
> > +						q++;
> > +					subject = q;
> > +				}
> > +
> > +				if (len < size && line[len] == ' ') {
> > +					/* handle long subject */
> > +					strbuf_reset(&long_subject);
> > +					strbuf_addstr(&long_subject, subject);
> > +					while (len < size && line[len] == ' ')
> > {
> > +						line += len;
> > +						size -= len;
> > +						len = find_eol(line, size);
> > +						strbuf_addstr(&long_subject,
> > line);
>
> Looks good
>
> > +					}
> > +					subject = long_subject.buf;
> > +				}
> > +			}
> > +		} else if (state == MBOX_IN_COMMIT_MESSAGE) {
> > +			if (!*line)
>
> Not a big issue elsewhere you've used "!line[0]"
> Style: there should be braces on this branch.

Fixed on both accounts.

> [... addressed in follow-up mail...]
>
> I think this is a useful addition, it could perhaps benefit from more tests
> though. Having tests for bad input, "\r\n" line endings and getting the author
> from a From: header as well as an in-body From: line would give a bit more
> reassurance about how robust the parser is.

Good point, I've augmented the test case a bit more.

Thank you again. It is a real pleasure to receive these constructive
reviews from you that pay so much attention to detail, and always lead to
a clear path forward.

Thanks!
Dscho

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2] range-diff: support reading mbox files
  2022-11-15 18:20 [PATCH] range-diff: support reading mbox files Johannes Schindelin via GitGitGadget
                   ` (2 preceding siblings ...)
  2022-11-17 18:24 ` Ævar Arnfjörð Bjarmason
@ 2022-11-19 23:11 ` Johannes Schindelin via GitGitGadget
  2022-11-21 10:08   ` Phillip Wood
  2022-11-22  9:08   ` [PATCH v3] " Johannes Schindelin via GitGitGadget
  3 siblings, 2 replies; 16+ messages in thread
From: Johannes Schindelin via GitGitGadget @ 2022-11-19 23:11 UTC (permalink / raw)
  To: git
  Cc: Taylor Blau, Phillip Wood, Ævar Arnfjörð Bjarmason,
	Johannes Schindelin, Johannes Schindelin

From: Johannes Schindelin <johannes.schindelin@gmx.de>

Internally, the `git range-diff` command spawns a `git log` process and
parses its output for the given commit ranges.

This works well when the patches that need to be compared are present in
the local repository in the form of commits.

In scenarios where that is not the case, the `range-diff` command is
currently less helpful.

The Git mailing list is such a scenario: Instead of using Git to
exchange commits, the patches are sent there as plain-text and no commit
range can be specified to let `range-diff` consume those patches.

Instead, the expectation is to download the mails, apply them locally
and then use `range-diff`. This can be quite cumbersome e.g. when a
suitable base revision has to be found first where the patch applies
cleanly.

Let's offer a way to read those patches from pre-prepared MBox files
instead when an argument "mbox:<filename>" is passed instead of a commit
range.

For extra convenience, interpret the filename `-` as standard input.
This makes it easy to compare contributions on the mailing list with the
actual commits that were integrated into Git's main branch. Example:

	commit=5c4003ca3f0e9ac6d3c8aa3e387ff843bd440411
	mid=bdfa3845b81531863941e6a97c28eb1afa62dd2c.1489435755.git.johannes.schindelin@gmx.de
	curl -s https://lore.kernel.org/git/$mid/raw |
	git range-diff mbox:- $commit^!

This addresses https://github.com/gitgitgadget/git/issues/207

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
    range-diff: support reading mbox files
    
    I frequently find myself wanting to look at the range-diff between some
    local commits and the patches on the Git mailing list, but unwilling to
    go through the process of finding an appropriate base revision to apply
    the patches onto (just to throw the generated patches away afterwards,
    anyway).
    
    So I came up with this patch. May it be helpful to other developers,
    too.
    
    This patch contains a home-rolled mbox parser. Initially, I wrote a
    really basic parser and it worked well enough, but, you know, as things
    go it became more complex than that in order to provide actually useful
    range-diffs for existing commits and their corresponding mails (because
    of in-body From: headers, because of -- trailers and long subjects, just
    to name a few reasons). In hindsight, it might have made sense to try to
    to reuse the parser that is available in mailinfo.c, which I had
    initially dismissed as overly complex and unnecessary for this use case.
    If anyone feels up to it, I would invite them to adjust this code to
    replace the mbox parser with one based on the mailinfo.c. Incrementally,
    of course, because the perfect is the enemy of the good.
    
    Changes since v1:
    
     * We no longer leak allocated memory in the struct patch instance
     * Made strtost() a bit more stringent
     * Postel [https://en.wikipedia.org/wiki/Postel%27s_Law]ized the mbox
       parser substantially, together with a couple more cosmetic fixes,
       based on Phillip Wood's excellent review of v1.
     * Extended the test case to cover mboxes containing CR/LF and in-body
       From: lines

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1420%2Fdscho%2Frange-diff-from-mbox-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1420/dscho/range-diff-from-mbox-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/1420

Range-diff vs v1:

 1:  354840fc57c ! 1:  485249ddfb3 range-diff: support reading mbox files
     @@ range-diff.c: struct patch_util {
      +
      +	errno = 0;
      +	/* negative values would be accepted by strtoul */
     -+	if (*s == '-')
     ++	if (!isdigit(*s))
      +		return -1;
      +	u = strtoul(s, &p, 10);
      +	if (errno || p == s)
     @@ range-diff.c: struct patch_util {
      +
      +	if (!skip_prefix(p, "@@ -", &p) ||
      +	    strtost(p, NULL, &p) ||
     -+	    (*p != ' ' && (*p != ',' || strtost(p + 1, &o, &p))) ||
     ++	    /* The range is -<start>[,<count>], defaulting to count = 1 */
     ++	    !(*p == ' ' || (*p == ',' && !strtost(p + 1, &o, &p))) ||
      +	    !skip_prefix(p, " +", &p) ||
      +	    strtost(p, NULL, &p) ||
     -+	    (*p != ' ' && (*p != ',' || strtost(p + 1, &n, &p))) ||
     ++	    /* The range is +<start>[,<count>], defaulting to count = 1 */
     ++	    !(*p == ' ' || (*p == ',' && !strtost(p + 1, &n, &p))) ||
      +	    !skip_prefix(p, " @@", &p))
      +		return -1;
      +
     @@ range-diff.c: struct patch_util {
      +	return 0;
      +}
      +
     -+static inline int find_eol(const char *line, size_t size)
     ++/*
     ++ * This function finds the end of the line, replaces the newline character with
     ++ * a NUL, and returns the offset of the start of the next line.
     ++ *
     ++ * If no newline character was found, it returns the offset of the trailing NUL
     ++ * instead.
     ++ */
     ++static inline int find_next_line(const char *line, size_t size)
      +{
      +	char *eol;
      +
     @@ range-diff.c: struct patch_util {
      +	if (!eol)
      +		return size;
      +
     -+	if (eol != line && eol[-1] == '\r')
     -+		eol[-1] = '\0';
     -+	else
     -+		*eol = '\0';
     ++	*eol = '\0';
      +
      +	return eol + 1 - line;
      +}
     @@ range-diff.c: struct patch_util {
      +
      +	line = contents.buf;
      +	size = contents.len;
     -+	for (; size > 0; size -= len, line += len) {
     ++	for (; size; size -= len, line += len) {
      +		const char *p;
      +
     -+		len = find_eol(line, size);
     ++		len = find_next_line(line, size);
      +
     -+		if (state == MBOX_BEFORE_HEADER) {
     ++		if (state == MBOX_BEFORE_HEADER ||
     ++		    (state == MBOX_IN_DIFF && line[0] == 'F')) {
      +			if (!skip_prefix(line, "From ", &p))
      +				continue;
      +
     @@ range-diff.c: struct patch_util {
      +			author = subject = NULL;
      +
      +			state = MBOX_IN_HEADER;
     ++			continue;
      +		}
      +
      +		if (starts_with(line, "diff --git ")) {
     @@ range-diff.c: struct patch_util {
      +			strbuf_addch(&buf, '\n');
      +			if (!util->diff_offset)
      +				util->diff_offset = buf.len;
     -+			line[len - 1] = '\n';
     ++
      +			orig_len = len;
     -+			len = parse_git_diff_header(&root, &linenr, 1, line,
     -+						    len, size, &patch);
     ++			/* `find_next_line()`'s replaced the LF with a NUL */
     ++			line[len - 1] = '\n';
     ++			len = len > 1 && line[len - 2] == '\r' ?
     ++				error(_("cannot handle diff headers with "
     ++					"CR/LF line endings")) :
     ++				parse_git_diff_header(&root, &linenr, 1, line,
     ++						      len, size, &patch);
      +			if (len < 0) {
      +				error(_("could not parse git header '%.*s'"),
      +				      orig_len, line);
     ++				release_patch(&patch);
      +				free(util);
      +				free(current_filename);
      +				string_list_clear(list, 1);
     @@ range-diff.c: struct patch_util {
      +					    (const char **)&patch.new_name);
      +
      +			strbuf_addstr(&buf, " ## ");
     -+			if (patch.is_new > 0)
     ++			if (patch.is_new)
      +				strbuf_addf(&buf, "%s (new)", patch.new_name);
     -+			else if (patch.is_delete > 0)
     ++			else if (patch.is_delete)
      +				strbuf_addf(&buf, "%s (deleted)", patch.old_name);
      +			else if (patch.is_rename)
      +				strbuf_addf(&buf, "%s => %s", patch.old_name, patch.new_name);
     @@ range-diff.c: struct patch_util {
      +				strbuf_addstr(&buf, patch.new_name);
      +
      +			free(current_filename);
     -+			if (patch.is_delete > 0)
     ++			if (patch.is_delete)
      +				current_filename = xstrdup(patch.old_name);
      +			else
      +				current_filename = xstrdup(patch.new_name);
     @@ range-diff.c: struct patch_util {
      +
      +			strbuf_addstr(&buf, " ##\n");
      +			util->diffsize++;
     ++			release_patch(&patch);
      +		} else if (state == MBOX_IN_HEADER) {
      +			if (!line[0]) {
      +				state = MBOX_IN_COMMIT_MESSAGE;
      +				/* Look for an in-body From: */
     -+				if (size > 5 && skip_prefix(line + 1, "From: ", &p)) {
     ++				if (skip_prefix(line + 1, "From: ", &p)) {
      +					size -= p - line;
      +					line += p - line;
     -+					len = find_eol(line, size);
     ++					len = find_next_line(line, size);
      +
      +					while (isspace(*p))
      +						p++;
     @@ range-diff.c: struct patch_util {
      +					while (len < size && line[len] == ' ') {
      +						line += len;
      +						size -= len;
     -+						len = find_eol(line, size);
     ++						len = find_next_line(line, size);
      +						strbuf_addstr(&long_subject, line);
      +					}
      +					subject = long_subject.buf;
      +				}
      +			}
      +		} else if (state == MBOX_IN_COMMIT_MESSAGE) {
     -+			if (!*line)
     ++			if (!line[0]) {
      +				strbuf_addch(&buf, '\n');
     -+			else if (strcmp(line, "---")) {
     ++			} else if (strcmp(line, "---")) {
      +				int tabs = 0;
      +
      +				/* simulate tab expansion */
     @@ range-diff.c: struct patch_util {
      +			case '+':
      +			case '-':
      +			case ' ':
     ++				/* A `-- ` line indicates the end of a diff */
      +				if (!old_count && !new_count)
      +					break;
      +				if (old_count && line[0] != '+')
     @@ t/t3206-range-diff.sh: test_expect_success 'ranges with pathspecs' '
      +test_expect_success 'compare range vs mbox' '
      +	git format-patch --stdout topic..mode-only-change >mbox &&
      +	git range-diff topic...mode-only-change >expect &&
     ++
      +	git range-diff mode-only-change..topic mbox:./mbox >actual &&
      +	test_cmp expect actual &&
     -+	git range-diff mode-only-change..topic mbox:- <mbox >actual &&
     -+	test_cmp expect actual
     ++
     ++	sed -e "/^From: .*/{
     ++		h
     ++		s/.*/From: Bugs Bunny <bugs@bun.ny>/
     ++		:1
     ++		N
     ++		/[ -z]$/b1
     ++		G
     ++	}" <mbox >mbox.from &&
     ++	git range-diff mode-only-change..topic mbox:./mbox.from >actual.from &&
     ++	test_cmp expect actual.from &&
     ++
     ++	append_cr <mbox >mbox.cr &&
     ++	test_must_fail git range-diff \
     ++		mode-only-change..topic mbox:./mbox.cr 2>err &&
     ++	grep CR/LF err &&
     ++
     ++	git range-diff mode-only-change..topic mbox:- <mbox >actual.stdin &&
     ++	test_cmp expect actual.stdin
      +'
      +
       test_expect_success 'submodule changes are shown irrespective of diff.submodule' '


 Documentation/git-range-diff.txt |   3 +-
 range-diff.c                     | 333 ++++++++++++++++++++++++++++++-
 t/t3206-range-diff.sh            |  27 +++
 3 files changed, 361 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-range-diff.txt b/Documentation/git-range-diff.txt
index 0b393715d70..e2c4661acde 100644
--- a/Documentation/git-range-diff.txt
+++ b/Documentation/git-range-diff.txt
@@ -37,7 +37,8 @@ There are three ways to specify the commit ranges:
 
 - `<range1> <range2>`: Either commit range can be of the form
   `<base>..<rev>`, `<rev>^!` or `<rev>^-<n>`. See `SPECIFYING RANGES`
-  in linkgit:gitrevisions[7] for more details.
+  in linkgit:gitrevisions[7] for more details. Alternatively, the
+  patches can be provided as an mbox-formatted file via `mbox:<path>`.
 
 - `<rev1>...<rev2>`. This is equivalent to
   `<rev2>..<rev1> <rev1>..<rev2>`.
diff --git a/range-diff.c b/range-diff.c
index 124dd678c38..9dc1e3af3f8 100644
--- a/range-diff.c
+++ b/range-diff.c
@@ -12,6 +12,7 @@
 #include "userdiff.h"
 #include "apply.h"
 #include "revision.h"
+#include "dir.h"
 
 struct patch_util {
 	/* For the search for an exact match */
@@ -26,6 +27,309 @@ struct patch_util {
 	struct object_id oid;
 };
 
+static inline int strtost(char const *s, size_t *result, const char **end)
+{
+	unsigned long u;
+	char *p;
+
+	errno = 0;
+	/* negative values would be accepted by strtoul */
+	if (!isdigit(*s))
+		return -1;
+	u = strtoul(s, &p, 10);
+	if (errno || p == s)
+		return -1;
+	if (result)
+		*result = u;
+	*end = p;
+
+	return 0;
+}
+
+static int parse_hunk_header(const char *p,
+			     size_t *old_count, size_t *new_count,
+			     const char **end)
+{
+	size_t o = 1, n = 1;
+
+	if (!skip_prefix(p, "@@ -", &p) ||
+	    strtost(p, NULL, &p) ||
+	    /* The range is -<start>[,<count>], defaulting to count = 1 */
+	    !(*p == ' ' || (*p == ',' && !strtost(p + 1, &o, &p))) ||
+	    !skip_prefix(p, " +", &p) ||
+	    strtost(p, NULL, &p) ||
+	    /* The range is +<start>[,<count>], defaulting to count = 1 */
+	    !(*p == ' ' || (*p == ',' && !strtost(p + 1, &n, &p))) ||
+	    !skip_prefix(p, " @@", &p))
+		return -1;
+
+	*old_count = o;
+	*new_count = n;
+	*end = p;
+
+	return 0;
+}
+
+/*
+ * This function finds the end of the line, replaces the newline character with
+ * a NUL, and returns the offset of the start of the next line.
+ *
+ * If no newline character was found, it returns the offset of the trailing NUL
+ * instead.
+ */
+static inline int find_next_line(const char *line, size_t size)
+{
+	char *eol;
+
+	eol = memchr(line, '\n', size);
+	if (!eol)
+		return size;
+
+	*eol = '\0';
+
+	return eol + 1 - line;
+}
+
+static int read_mbox(const char *path, struct string_list *list)
+{
+	struct strbuf buf = STRBUF_INIT, contents = STRBUF_INIT;
+	struct strbuf long_subject = STRBUF_INIT;
+	struct patch_util *util = NULL;
+	enum {
+		MBOX_BEFORE_HEADER,
+		MBOX_IN_HEADER,
+		MBOX_IN_COMMIT_MESSAGE,
+		MBOX_AFTER_TRIPLE_DASH,
+		MBOX_IN_DIFF
+	} state = MBOX_BEFORE_HEADER;
+	char *line, *current_filename = NULL;
+	int len;
+	size_t size, old_count = 0, new_count = 0;
+	const char *author = NULL, *subject = NULL;
+
+	if (!strcmp(path, "-")) {
+		if (strbuf_read(&contents, STDIN_FILENO, 0) < 0)
+			return error_errno(_("could not read stdin"));
+	} else if (strbuf_read_file(&contents, path, 0) < 0)
+		return error_errno(_("could not read '%s'"), path);
+
+	line = contents.buf;
+	size = contents.len;
+	for (; size; size -= len, line += len) {
+		const char *p;
+
+		len = find_next_line(line, size);
+
+		if (state == MBOX_BEFORE_HEADER ||
+		    (state == MBOX_IN_DIFF && line[0] == 'F')) {
+			if (!skip_prefix(line, "From ", &p))
+				continue;
+
+			util = xcalloc(1, sizeof(*util));
+			if (get_oid_hex(p, &util->oid) < 0)
+				oidcpy(&util->oid, null_oid());
+			util->matching = -1;
+			author = subject = NULL;
+
+			state = MBOX_IN_HEADER;
+			continue;
+		}
+
+		if (starts_with(line, "diff --git ")) {
+			struct patch patch = { 0 };
+			struct strbuf root = STRBUF_INIT;
+			int linenr = 0;
+			int orig_len;
+
+			state = MBOX_IN_DIFF;
+			old_count = new_count = 0;
+			strbuf_addch(&buf, '\n');
+			if (!util->diff_offset)
+				util->diff_offset = buf.len;
+
+			orig_len = len;
+			/* `find_next_line()`'s replaced the LF with a NUL */
+			line[len - 1] = '\n';
+			len = len > 1 && line[len - 2] == '\r' ?
+				error(_("cannot handle diff headers with "
+					"CR/LF line endings")) :
+				parse_git_diff_header(&root, &linenr, 1, line,
+						      len, size, &patch);
+			if (len < 0) {
+				error(_("could not parse git header '%.*s'"),
+				      orig_len, line);
+				release_patch(&patch);
+				free(util);
+				free(current_filename);
+				string_list_clear(list, 1);
+				strbuf_release(&buf);
+				strbuf_release(&contents);
+				strbuf_release(&long_subject);
+				return -1;
+			}
+
+			if (patch.old_name)
+				skip_prefix(patch.old_name, "a/",
+					    (const char **)&patch.old_name);
+			if (patch.new_name)
+				skip_prefix(patch.new_name, "b/",
+					    (const char **)&patch.new_name);
+
+			strbuf_addstr(&buf, " ## ");
+			if (patch.is_new)
+				strbuf_addf(&buf, "%s (new)", patch.new_name);
+			else if (patch.is_delete)
+				strbuf_addf(&buf, "%s (deleted)", patch.old_name);
+			else if (patch.is_rename)
+				strbuf_addf(&buf, "%s => %s", patch.old_name, patch.new_name);
+			else
+				strbuf_addstr(&buf, patch.new_name);
+
+			free(current_filename);
+			if (patch.is_delete)
+				current_filename = xstrdup(patch.old_name);
+			else
+				current_filename = xstrdup(patch.new_name);
+
+			if (patch.new_mode && patch.old_mode &&
+			    patch.old_mode != patch.new_mode)
+				strbuf_addf(&buf, " (mode change %06o => %06o)",
+					    patch.old_mode, patch.new_mode);
+
+			strbuf_addstr(&buf, " ##\n");
+			util->diffsize++;
+			release_patch(&patch);
+		} else if (state == MBOX_IN_HEADER) {
+			if (!line[0]) {
+				state = MBOX_IN_COMMIT_MESSAGE;
+				/* Look for an in-body From: */
+				if (skip_prefix(line + 1, "From: ", &p)) {
+					size -= p - line;
+					line += p - line;
+					len = find_next_line(line, size);
+
+					while (isspace(*p))
+						p++;
+					author = p;
+				}
+				strbuf_addstr(&buf, " ## Metadata ##\n");
+				if (author)
+					strbuf_addf(&buf, "Author: %s\n", author);
+				strbuf_addstr(&buf, "\n ## Commit message ##\n");
+				if (subject)
+					strbuf_addf(&buf, "    %s\n\n", subject);
+			} else if (skip_prefix(line, "From: ", &p)) {
+				while (isspace(*p))
+					p++;
+				author = p;
+			} else if (skip_prefix(line, "Subject: ", &p)) {
+				const char *q;
+
+				while (isspace(*p))
+					p++;
+				subject = p;
+
+				if (starts_with(p, "[PATCH") &&
+				    (q = strchr(p, ']'))) {
+					q++;
+					while (isspace(*q))
+						q++;
+					subject = q;
+				}
+
+				if (len < size && line[len] == ' ') {
+					/* handle long subject */
+					strbuf_reset(&long_subject);
+					strbuf_addstr(&long_subject, subject);
+					while (len < size && line[len] == ' ') {
+						line += len;
+						size -= len;
+						len = find_next_line(line, size);
+						strbuf_addstr(&long_subject, line);
+					}
+					subject = long_subject.buf;
+				}
+			}
+		} else if (state == MBOX_IN_COMMIT_MESSAGE) {
+			if (!line[0]) {
+				strbuf_addch(&buf, '\n');
+			} else if (strcmp(line, "---")) {
+				int tabs = 0;
+
+				/* simulate tab expansion */
+				while (line[tabs] == '\t')
+					tabs++;
+				strbuf_addf(&buf, "%*s%s\n",
+					    4 + 8 * tabs, "", line + tabs);
+			} else {
+				/*
+				 * Trim the trailing newline that is added
+				 * by `format-patch`.
+				 */
+				strbuf_trim_trailing_newline(&buf);
+				state = MBOX_AFTER_TRIPLE_DASH;
+			}
+		} else if (state == MBOX_IN_DIFF) {
+			switch (line[0]) {
+			case '\0':
+				continue; /* ignore empty lines after diff */
+			case '+':
+			case '-':
+			case ' ':
+				/* A `-- ` line indicates the end of a diff */
+				if (!old_count && !new_count)
+					break;
+				if (old_count && line[0] != '+')
+					old_count--;
+				if (new_count && line[0] != '-')
+					new_count--;
+				/* fallthrough */
+			case '\\':
+				strbuf_addstr(&buf, line);
+				strbuf_addch(&buf, '\n');
+				util->diffsize++;
+				continue;
+			case '@':
+				if (parse_hunk_header(line, &old_count,
+						      &new_count, &p))
+					break;
+
+				strbuf_addstr(&buf, "@@");
+				if (current_filename && *p)
+					strbuf_addf(&buf, " %s:",
+						    current_filename);
+				strbuf_addstr(&buf, p);
+				strbuf_addch(&buf, '\n');
+				util->diffsize++;
+				continue;
+			}
+
+			if (util) {
+				string_list_append(list, buf.buf)->util = util;
+				strbuf_reset(&buf);
+			}
+			util = xcalloc(1, sizeof(*util));
+			oidcpy(&util->oid, null_oid());
+			util->matching = -1;
+			author = subject = NULL;
+			state = MBOX_BEFORE_HEADER;
+		}
+	}
+	strbuf_release(&contents);
+
+	if (util) {
+		if (state == MBOX_IN_DIFF)
+			string_list_append(list, buf.buf)->util = util;
+		else
+			free(util);
+	}
+	strbuf_release(&buf);
+	strbuf_release(&long_subject);
+	free(current_filename);
+
+	return 0;
+}
+
 /*
  * Reads the patches into a string list, with the `util` field being populated
  * as struct object_id (will need to be free()d).
@@ -41,6 +345,10 @@ static int read_patches(const char *range, struct string_list *list,
 	ssize_t len;
 	size_t size;
 	int ret = -1;
+	const char *path;
+
+	if (skip_prefix(range, "mbox:", &path))
+		return read_mbox(path, list);
 
 	strvec_pushl(&cp.args, "log", "--no-color", "-p", "--no-merges",
 		     "--reverse", "--date-order", "--decorate=no",
@@ -424,6 +732,19 @@ static void output_pair_header(struct diff_options *diffopt,
 
 		strbuf_addch(buf, ' ');
 		pp_commit_easy(CMIT_FMT_ONELINE, commit, buf);
+	} else {
+		struct patch_util *util = b_util ? b_util : a_util;
+		const char *needle = "\n ## Commit message ##\n";
+		const char *p = !util || !util->patch ?
+			NULL : strstr(util->patch, needle);
+		if (p) {
+			if (status == '!')
+				strbuf_addf(buf, "%s%s", color_reset, color);
+
+			strbuf_addch(buf, ' ');
+			p += strlen(needle);
+			strbuf_add(buf, p, strchrnul(p, '\n') - p);
+		}
 	}
 	strbuf_addf(buf, "%s\n", color_reset);
 
@@ -554,6 +875,9 @@ int show_range_diff(const char *range1, const char *range2,
 	if (range_diff_opts->left_only && range_diff_opts->right_only)
 		res = error(_("options '%s' and '%s' cannot be used together"), "--left-only", "--right-only");
 
+	if (!strcmp(range1, "mbox:-") && !strcmp(range2, "mbox:-"))
+		res = error(_("only one mbox can be read from stdin"));
+
 	if (!res && read_patches(range1, &branch1, range_diff_opts->other_arg))
 		res = error(_("could not parse log for '%s'"), range1);
 	if (!res && read_patches(range2, &branch2, range_diff_opts->other_arg))
@@ -575,10 +899,17 @@ int show_range_diff(const char *range1, const char *range2,
 int is_range_diff_range(const char *arg)
 {
 	char *copy = xstrdup(arg); /* setup_revisions() modifies it */
-	const char *argv[] = { "", copy, "--", NULL };
+	const char *argv[] = { "", copy, "--", NULL }, *path;
 	int i, positive = 0, negative = 0;
 	struct rev_info revs;
 
+	if (skip_prefix(arg, "mbox:", &path)) {
+		if (!strcmp(path, "-") || file_exists(path))
+			return 1;
+		error_errno(_("not an mbox: '%s'"), path);
+		return 0;
+	}
+
 	init_revisions(&revs, NULL);
 	if (setup_revisions(3, argv, &revs, NULL) == 1) {
 		for (i = 0; i < revs.pending.nr; i++)
diff --git a/t/t3206-range-diff.sh b/t/t3206-range-diff.sh
index 459beaf7d9c..89ef9a5ffc4 100755
--- a/t/t3206-range-diff.sh
+++ b/t/t3206-range-diff.sh
@@ -783,6 +783,33 @@ test_expect_success 'ranges with pathspecs' '
 	! grep "$topic_oid" actual
 '
 
+test_expect_success 'compare range vs mbox' '
+	git format-patch --stdout topic..mode-only-change >mbox &&
+	git range-diff topic...mode-only-change >expect &&
+
+	git range-diff mode-only-change..topic mbox:./mbox >actual &&
+	test_cmp expect actual &&
+
+	sed -e "/^From: .*/{
+		h
+		s/.*/From: Bugs Bunny <bugs@bun.ny>/
+		:1
+		N
+		/[ -z]$/b1
+		G
+	}" <mbox >mbox.from &&
+	git range-diff mode-only-change..topic mbox:./mbox.from >actual.from &&
+	test_cmp expect actual.from &&
+
+	append_cr <mbox >mbox.cr &&
+	test_must_fail git range-diff \
+		mode-only-change..topic mbox:./mbox.cr 2>err &&
+	grep CR/LF err &&
+
+	git range-diff mode-only-change..topic mbox:- <mbox >actual.stdin &&
+	test_cmp expect actual.stdin
+'
+
 test_expect_success 'submodule changes are shown irrespective of diff.submodule' '
 	git init sub-repo &&
 	test_commit -C sub-repo sub-first &&

base-commit: b75747829f4c277323c78b1c5973ad63ea038a2d
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] range-diff: support reading mbox files
  2022-11-19 23:11 ` [PATCH v2] " Johannes Schindelin via GitGitGadget
@ 2022-11-21 10:08   ` Phillip Wood
  2022-11-22  7:40     ` Johannes Schindelin
  2022-11-22  9:08   ` [PATCH v3] " Johannes Schindelin via GitGitGadget
  1 sibling, 1 reply; 16+ messages in thread
From: Phillip Wood @ 2022-11-21 10:08 UTC (permalink / raw)
  To: Johannes Schindelin via GitGitGadget, git
  Cc: Taylor Blau, Phillip Wood, Ævar Arnfjörð Bjarmason,
	Johannes Schindelin

Hi Dscho

On 19/11/2022 23:11, Johannes Schindelin via GitGitGadget wrote:
> From: Johannes Schindelin <johannes.schindelin@gmx.de>
>      
>      Changes since v1:
>      
>       * We no longer leak allocated memory in the struct patch instance
>       * Made strtost() a bit more stringent
>       * Postel [https://en.wikipedia.org/wiki/Postel%27s_Law]ized the mbox
>         parser substantially, together with a couple more cosmetic fixes,
>         based on Phillip Wood's excellent review of v1.
>       * Extended the test case to cover mboxes containing CR/LF and in-body
>         From: lines

This addresses all of my comments on V1. I've left a couple of queries 
on the range-diff for a couple of changes I didn't quite understand.

> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1420%2Fdscho%2Frange-diff-from-mbox-v2
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1420/dscho/range-diff-from-mbox-v2
> Pull-Request: https://github.com/gitgitgadget/git/pull/1420
> 
> Range-diff vs v1:
> 
>   1:  354840fc57c ! 1:  485249ddfb3 range-diff: support reading mbox files
>       @@ range-diff.c: struct patch_util {
>        +
>        +	errno = 0;
>        +	/* negative values would be accepted by strtoul */
>       -+	if (*s == '-')
>       ++	if (!isdigit(*s))
>        +		return -1;
>        +	u = strtoul(s, &p, 10);
>        +	if (errno || p == s)
>       @@ range-diff.c: struct patch_util {
>        +
>        +	if (!skip_prefix(p, "@@ -", &p) ||
>        +	    strtost(p, NULL, &p) ||
>       -+	    (*p != ' ' && (*p != ',' || strtost(p + 1, &o, &p))) ||
>       ++	    /* The range is -<start>[,<count>], defaulting to count = 1 */
>       ++	    !(*p == ' ' || (*p == ',' && !strtost(p + 1, &o, &p))) ||
>        +	    !skip_prefix(p, " +", &p) ||
>        +	    strtost(p, NULL, &p) ||
>       -+	    (*p != ' ' && (*p != ',' || strtost(p + 1, &n, &p))) ||
>       ++	    /* The range is +<start>[,<count>], defaulting to count = 1 */
>       ++	    !(*p == ' ' || (*p == ',' && !strtost(p + 1, &n, &p))) ||
>        +	    !skip_prefix(p, " @@", &p))
>        +		return -1;
>        +
>       @@ range-diff.c: struct patch_util {
>        +	return 0;
>        +}
>        +
>       -+static inline int find_eol(const char *line, size_t size)
>       ++/*
>       ++ * This function finds the end of the line, replaces the newline character with
>       ++ * a NUL, and returns the offset of the start of the next line.
>       ++ *
>       ++ * If no newline character was found, it returns the offset of the trailing NUL
>       ++ * instead.
>       ++ */
>       ++static inline int find_next_line(const char *line, size_t size)
>        +{
>        +	char *eol;
>        +
>       @@ range-diff.c: struct patch_util {
>        +	if (!eol)
>        +		return size;
>        +
>       -+	if (eol != line && eol[-1] == '\r')
>       -+		eol[-1] = '\0';
>       -+	else
>       -+		*eol = '\0';
>       ++	*eol = '\0';
>        +
>        +	return eol + 1 - line;
>        +}
>       @@ range-diff.c: struct patch_util {
>        +
>        +	line = contents.buf;
>        +	size = contents.len;
>       -+	for (; size > 0; size -= len, line += len) {
>       ++	for (; size; size -= len, line += len) {
>        +		const char *p;
>        +
>       -+		len = find_eol(line, size);
>       ++		len = find_next_line(line, size);
>        +
>       -+		if (state == MBOX_BEFORE_HEADER) {
>       ++		if (state == MBOX_BEFORE_HEADER ||
>       ++		    (state == MBOX_IN_DIFF && line[0] == 'F')) {

If we see an 'F' when parsing a diff then it is probably a From: line 
indicating that we've reached the start of a new patch. I think we need 
to add the diff we've just parsed to the list of commits before 
continuing and change state to MBOX_BEFORE_HEADER otherwise we'll hit 
the continue just below this comment with state unchanged.

>        +			if (!skip_prefix(line, "From ", &p))
>        +				continue;
>        +
>       @@ range-diff.c: struct patch_util {
>        +			author = subject = NULL;
>        +
>        +			state = MBOX_IN_HEADER;
>       ++			continue;
>        +		}
>        +
>        +		if (starts_with(line, "diff --git ")) {
>       @@ range-diff.c: struct patch_util {
>        +			strbuf_addch(&buf, '\n');
>        +			if (!util->diff_offset)
>        +				util->diff_offset = buf.len;
>       -+			line[len - 1] = '\n';
>       ++
>        +			orig_len = len;
>       -+			len = parse_git_diff_header(&root, &linenr, 1, line,
>       -+						    len, size, &patch);
>       ++			/* `find_next_line()`'s replaced the LF with a NUL */
>       ++			line[len - 1] = '\n';
>       ++			len = len > 1 && line[len - 2] == '\r' ?
>       ++				error(_("cannot handle diff headers with "
>       ++					"CR/LF line endings")) :
>       ++				parse_git_diff_header(&root, &linenr, 1, line,
>       ++						      len, size, &patch);
>        +			if (len < 0) {
>        +				error(_("could not parse git header '%.*s'"),
>        +				      orig_len, line);
>       ++				release_patch(&patch);

Thank you for using release_patch() rather than the "free random struct 
members" approach that was suggested.

>        +				free(util);
>        +				free(current_filename);
>        +				string_list_clear(list, 1);
>       @@ range-diff.c: struct patch_util {
>        +					    (const char **)&patch.new_name);
>        +
>        +			strbuf_addstr(&buf, " ## ");
>       -+			if (patch.is_new > 0)
>       ++			if (patch.is_new)
>        +				strbuf_addf(&buf, "%s (new)", patch.new_name);
>       -+			else if (patch.is_delete > 0)
>       ++			else if (patch.is_delete)
>        +				strbuf_addf(&buf, "%s (deleted)", patch.old_name);
>        +			else if (patch.is_rename)
>        +				strbuf_addf(&buf, "%s => %s", patch.old_name, patch.new_name);
>       @@ range-diff.c: struct patch_util {
>        +				strbuf_addstr(&buf, patch.new_name);
>        +
>        +			free(current_filename);
>       -+			if (patch.is_delete > 0)
>       ++			if (patch.is_delete)
>        +				current_filename = xstrdup(patch.old_name);
>        +			else
>        +				current_filename = xstrdup(patch.new_name);
>       @@ range-diff.c: struct patch_util {
>        +
>        +			strbuf_addstr(&buf, " ##\n");
>        +			util->diffsize++;
>       ++			release_patch(&patch);
>        +		} else if (state == MBOX_IN_HEADER) {
>        +			if (!line[0]) {
>        +				state = MBOX_IN_COMMIT_MESSAGE;
>        +				/* Look for an in-body From: */
>       -+				if (size > 5 && skip_prefix(line + 1, "From: ", &p)) {
>       ++				if (skip_prefix(line + 1, "From: ", &p)) {
>        +					size -= p - line;
>        +					line += p - line;
>       -+					len = find_eol(line, size);
>       ++					len = find_next_line(line, size);
>        +
>        +					while (isspace(*p))
>        +						p++;
>       @@ range-diff.c: struct patch_util {
>        +					while (len < size && line[len] == ' ') {
>        +						line += len;
>        +						size -= len;
>       -+						len = find_eol(line, size);
>       ++						len = find_next_line(line, size);
>        +						strbuf_addstr(&long_subject, line);
>        +					}
>        +					subject = long_subject.buf;
>        +				}
>        +			}
>        +		} else if (state == MBOX_IN_COMMIT_MESSAGE) {
>       -+			if (!*line)
>       ++			if (!line[0]) {
>        +				strbuf_addch(&buf, '\n');
>       -+			else if (strcmp(line, "---")) {
>       ++			} else if (strcmp(line, "---")) {
>        +				int tabs = 0;
>        +
>        +				/* simulate tab expansion */
>       @@ range-diff.c: struct patch_util {
>        +			case '+':
>        +			case '-':
>        +			case ' ':
>       ++				/* A `-- ` line indicates the end of a diff */
>        +				if (!old_count && !new_count)
>        +					break;
>        +				if (old_count && line[0] != '+')
>       @@ t/t3206-range-diff.sh: test_expect_success 'ranges with pathspecs' '
>        +test_expect_success 'compare range vs mbox' '
>        +	git format-patch --stdout topic..mode-only-change >mbox &&
>        +	git range-diff topic...mode-only-change >expect &&
>       ++
>        +	git range-diff mode-only-change..topic mbox:./mbox >actual &&
>        +	test_cmp expect actual &&
>       -+	git range-diff mode-only-change..topic mbox:- <mbox >actual &&
>       -+	test_cmp expect actual
>       ++

I'm a bit confused by this sed command, I've annotated it with my 
probably flawed understanding.

>       ++	sed -e "/^From: .*/{
>       ++		h

This stores the From: header in the hold space

>       ++		s/.*/From: Bugs Bunny <bugs@bun.ny>/

This changes the From: header in the pattern space

>       ++		:1
>       ++		N
>       ++		/[ -z]$/b1

We loop until we find a line that does not end with a space, letter or 
number adding the lines to the hold space

>       ++		G

This appends the hold space to the pattern space, then the pattern space 
is printed. Doesn't this mean we end up with two From: headers? Is the 
in-body From: line already present?

>       ++	}" <mbox >mbox.from &&
>       ++	git range-diff mode-only-change..topic mbox:./mbox.from >actual.from &&
>       ++	test_cmp expect actual.from &&
>       ++
>       ++	append_cr <mbox >mbox.cr &&
>       ++	test_must_fail git range-diff \
>       ++		mode-only-change..topic mbox:./mbox.cr 2>err &&
>       ++	grep CR/LF err &&

Thanks for adding that

Best Wishes

Phillip

>       ++	git range-diff mode-only-change..topic mbox:- <mbox >actual.stdin &&
>       ++	test_cmp expect actual.stdin
>        +'
>        +
>         test_expect_success 'submodule changes are shown irrespective of diff.submodule' '
> 
> 
>   Documentation/git-range-diff.txt |   3 +-
>   range-diff.c                     | 333 ++++++++++++++++++++++++++++++-
>   t/t3206-range-diff.sh            |  27 +++
>   3 files changed, 361 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/git-range-diff.txt b/Documentation/git-range-diff.txt
> index 0b393715d70..e2c4661acde 100644
> --- a/Documentation/git-range-diff.txt
> +++ b/Documentation/git-range-diff.txt
> @@ -37,7 +37,8 @@ There are three ways to specify the commit ranges:
>   
>   - `<range1> <range2>`: Either commit range can be of the form
>     `<base>..<rev>`, `<rev>^!` or `<rev>^-<n>`. See `SPECIFYING RANGES`
> -  in linkgit:gitrevisions[7] for more details.
> +  in linkgit:gitrevisions[7] for more details. Alternatively, the
> +  patches can be provided as an mbox-formatted file via `mbox:<path>`.
>   
>   - `<rev1>...<rev2>`. This is equivalent to
>     `<rev2>..<rev1> <rev1>..<rev2>`.
> diff --git a/range-diff.c b/range-diff.c
> index 124dd678c38..9dc1e3af3f8 100644
> --- a/range-diff.c
> +++ b/range-diff.c
> @@ -12,6 +12,7 @@
>   #include "userdiff.h"
>   #include "apply.h"
>   #include "revision.h"
> +#include "dir.h"
>   
>   struct patch_util {
>   	/* For the search for an exact match */
> @@ -26,6 +27,309 @@ struct patch_util {
>   	struct object_id oid;
>   };
>   
> +static inline int strtost(char const *s, size_t *result, const char **end)
> +{
> +	unsigned long u;
> +	char *p;
> +
> +	errno = 0;
> +	/* negative values would be accepted by strtoul */
> +	if (!isdigit(*s))
> +		return -1;
> +	u = strtoul(s, &p, 10);
> +	if (errno || p == s)
> +		return -1;
> +	if (result)
> +		*result = u;
> +	*end = p;
> +
> +	return 0;
> +}
> +
> +static int parse_hunk_header(const char *p,
> +			     size_t *old_count, size_t *new_count,
> +			     const char **end)
> +{
> +	size_t o = 1, n = 1;
> +
> +	if (!skip_prefix(p, "@@ -", &p) ||
> +	    strtost(p, NULL, &p) ||
> +	    /* The range is -<start>[,<count>], defaulting to count = 1 */
> +	    !(*p == ' ' || (*p == ',' && !strtost(p + 1, &o, &p))) ||
> +	    !skip_prefix(p, " +", &p) ||
> +	    strtost(p, NULL, &p) ||
> +	    /* The range is +<start>[,<count>], defaulting to count = 1 */
> +	    !(*p == ' ' || (*p == ',' && !strtost(p + 1, &n, &p))) ||
> +	    !skip_prefix(p, " @@", &p))
> +		return -1;
> +
> +	*old_count = o;
> +	*new_count = n;
> +	*end = p;
> +
> +	return 0;
> +}
> +
> +/*
> + * This function finds the end of the line, replaces the newline character with
> + * a NUL, and returns the offset of the start of the next line.
> + *
> + * If no newline character was found, it returns the offset of the trailing NUL
> + * instead.
> + */
> +static inline int find_next_line(const char *line, size_t size)
> +{
> +	char *eol;
> +
> +	eol = memchr(line, '\n', size);
> +	if (!eol)
> +		return size;
> +
> +	*eol = '\0';
> +
> +	return eol + 1 - line;
> +}
> +
> +static int read_mbox(const char *path, struct string_list *list)
> +{
> +	struct strbuf buf = STRBUF_INIT, contents = STRBUF_INIT;
> +	struct strbuf long_subject = STRBUF_INIT;
> +	struct patch_util *util = NULL;
> +	enum {
> +		MBOX_BEFORE_HEADER,
> +		MBOX_IN_HEADER,
> +		MBOX_IN_COMMIT_MESSAGE,
> +		MBOX_AFTER_TRIPLE_DASH,
> +		MBOX_IN_DIFF
> +	} state = MBOX_BEFORE_HEADER;
> +	char *line, *current_filename = NULL;
> +	int len;
> +	size_t size, old_count = 0, new_count = 0;
> +	const char *author = NULL, *subject = NULL;
> +
> +	if (!strcmp(path, "-")) {
> +		if (strbuf_read(&contents, STDIN_FILENO, 0) < 0)
> +			return error_errno(_("could not read stdin"));
> +	} else if (strbuf_read_file(&contents, path, 0) < 0)
> +		return error_errno(_("could not read '%s'"), path);
> +
> +	line = contents.buf;
> +	size = contents.len;
> +	for (; size; size -= len, line += len) {
> +		const char *p;
> +
> +		len = find_next_line(line, size);
> +
> +		if (state == MBOX_BEFORE_HEADER ||
> +		    (state == MBOX_IN_DIFF && line[0] == 'F')) {
> +			if (!skip_prefix(line, "From ", &p))
> +				continue;
> +
> +			util = xcalloc(1, sizeof(*util));
> +			if (get_oid_hex(p, &util->oid) < 0)
> +				oidcpy(&util->oid, null_oid());
> +			util->matching = -1;
> +			author = subject = NULL;
> +
> +			state = MBOX_IN_HEADER;
> +			continue;
> +		}
> +
> +		if (starts_with(line, "diff --git ")) {
> +			struct patch patch = { 0 };
> +			struct strbuf root = STRBUF_INIT;
> +			int linenr = 0;
> +			int orig_len;
> +
> +			state = MBOX_IN_DIFF;
> +			old_count = new_count = 0;
> +			strbuf_addch(&buf, '\n');
> +			if (!util->diff_offset)
> +				util->diff_offset = buf.len;
> +
> +			orig_len = len;
> +			/* `find_next_line()`'s replaced the LF with a NUL */
> +			line[len - 1] = '\n';
> +			len = len > 1 && line[len - 2] == '\r' ?
> +				error(_("cannot handle diff headers with "
> +					"CR/LF line endings")) :
> +				parse_git_diff_header(&root, &linenr, 1, line,
> +						      len, size, &patch);
> +			if (len < 0) {
> +				error(_("could not parse git header '%.*s'"),
> +				      orig_len, line);
> +				release_patch(&patch);
> +				free(util);
> +				free(current_filename);
> +				string_list_clear(list, 1);
> +				strbuf_release(&buf);
> +				strbuf_release(&contents);
> +				strbuf_release(&long_subject);
> +				return -1;
> +			}
> +
> +			if (patch.old_name)
> +				skip_prefix(patch.old_name, "a/",
> +					    (const char **)&patch.old_name);
> +			if (patch.new_name)
> +				skip_prefix(patch.new_name, "b/",
> +					    (const char **)&patch.new_name);
> +
> +			strbuf_addstr(&buf, " ## ");
> +			if (patch.is_new)
> +				strbuf_addf(&buf, "%s (new)", patch.new_name);
> +			else if (patch.is_delete)
> +				strbuf_addf(&buf, "%s (deleted)", patch.old_name);
> +			else if (patch.is_rename)
> +				strbuf_addf(&buf, "%s => %s", patch.old_name, patch.new_name);
> +			else
> +				strbuf_addstr(&buf, patch.new_name);
> +
> +			free(current_filename);
> +			if (patch.is_delete)
> +				current_filename = xstrdup(patch.old_name);
> +			else
> +				current_filename = xstrdup(patch.new_name);
> +
> +			if (patch.new_mode && patch.old_mode &&
> +			    patch.old_mode != patch.new_mode)
> +				strbuf_addf(&buf, " (mode change %06o => %06o)",
> +					    patch.old_mode, patch.new_mode);
> +
> +			strbuf_addstr(&buf, " ##\n");
> +			util->diffsize++;
> +			release_patch(&patch);
> +		} else if (state == MBOX_IN_HEADER) {
> +			if (!line[0]) {
> +				state = MBOX_IN_COMMIT_MESSAGE;
> +				/* Look for an in-body From: */
> +				if (skip_prefix(line + 1, "From: ", &p)) {
> +					size -= p - line;
> +					line += p - line;
> +					len = find_next_line(line, size);
> +
> +					while (isspace(*p))
> +						p++;
> +					author = p;
> +				}
> +				strbuf_addstr(&buf, " ## Metadata ##\n");
> +				if (author)
> +					strbuf_addf(&buf, "Author: %s\n", author);
> +				strbuf_addstr(&buf, "\n ## Commit message ##\n");
> +				if (subject)
> +					strbuf_addf(&buf, "    %s\n\n", subject);
> +			} else if (skip_prefix(line, "From: ", &p)) {
> +				while (isspace(*p))
> +					p++;
> +				author = p;
> +			} else if (skip_prefix(line, "Subject: ", &p)) {
> +				const char *q;
> +
> +				while (isspace(*p))
> +					p++;
> +				subject = p;
> +
> +				if (starts_with(p, "[PATCH") &&
> +				    (q = strchr(p, ']'))) {
> +					q++;
> +					while (isspace(*q))
> +						q++;
> +					subject = q;
> +				}
> +
> +				if (len < size && line[len] == ' ') {
> +					/* handle long subject */
> +					strbuf_reset(&long_subject);
> +					strbuf_addstr(&long_subject, subject);
> +					while (len < size && line[len] == ' ') {
> +						line += len;
> +						size -= len;
> +						len = find_next_line(line, size);
> +						strbuf_addstr(&long_subject, line);
> +					}
> +					subject = long_subject.buf;
> +				}
> +			}
> +		} else if (state == MBOX_IN_COMMIT_MESSAGE) {
> +			if (!line[0]) {
> +				strbuf_addch(&buf, '\n');
> +			} else if (strcmp(line, "---")) {
> +				int tabs = 0;
> +
> +				/* simulate tab expansion */
> +				while (line[tabs] == '\t')
> +					tabs++;
> +				strbuf_addf(&buf, "%*s%s\n",
> +					    4 + 8 * tabs, "", line + tabs);
> +			} else {
> +				/*
> +				 * Trim the trailing newline that is added
> +				 * by `format-patch`.
> +				 */
> +				strbuf_trim_trailing_newline(&buf);
> +				state = MBOX_AFTER_TRIPLE_DASH;
> +			}
> +		} else if (state == MBOX_IN_DIFF) {
> +			switch (line[0]) {
> +			case '\0':
> +				continue; /* ignore empty lines after diff */
> +			case '+':
> +			case '-':
> +			case ' ':
> +				/* A `-- ` line indicates the end of a diff */
> +				if (!old_count && !new_count)
> +					break;
> +				if (old_count && line[0] != '+')
> +					old_count--;
> +				if (new_count && line[0] != '-')
> +					new_count--;
> +				/* fallthrough */
> +			case '\\':
> +				strbuf_addstr(&buf, line);
> +				strbuf_addch(&buf, '\n');
> +				util->diffsize++;
> +				continue;
> +			case '@':
> +				if (parse_hunk_header(line, &old_count,
> +						      &new_count, &p))
> +					break;
> +
> +				strbuf_addstr(&buf, "@@");
> +				if (current_filename && *p)
> +					strbuf_addf(&buf, " %s:",
> +						    current_filename);
> +				strbuf_addstr(&buf, p);
> +				strbuf_addch(&buf, '\n');
> +				util->diffsize++;
> +				continue;
> +			}
> +
> +			if (util) {
> +				string_list_append(list, buf.buf)->util = util;
> +				strbuf_reset(&buf);
> +			}
> +			util = xcalloc(1, sizeof(*util));
> +			oidcpy(&util->oid, null_oid());
> +			util->matching = -1;
> +			author = subject = NULL;
> +			state = MBOX_BEFORE_HEADER;
> +		}
> +	}
> +	strbuf_release(&contents);
> +
> +	if (util) {
> +		if (state == MBOX_IN_DIFF)
> +			string_list_append(list, buf.buf)->util = util;
> +		else
> +			free(util);
> +	}
> +	strbuf_release(&buf);
> +	strbuf_release(&long_subject);
> +	free(current_filename);
> +
> +	return 0;
> +}
> +
>   /*
>    * Reads the patches into a string list, with the `util` field being populated
>    * as struct object_id (will need to be free()d).
> @@ -41,6 +345,10 @@ static int read_patches(const char *range, struct string_list *list,
>   	ssize_t len;
>   	size_t size;
>   	int ret = -1;
> +	const char *path;
> +
> +	if (skip_prefix(range, "mbox:", &path))
> +		return read_mbox(path, list);
>   
>   	strvec_pushl(&cp.args, "log", "--no-color", "-p", "--no-merges",
>   		     "--reverse", "--date-order", "--decorate=no",
> @@ -424,6 +732,19 @@ static void output_pair_header(struct diff_options *diffopt,
>   
>   		strbuf_addch(buf, ' ');
>   		pp_commit_easy(CMIT_FMT_ONELINE, commit, buf);
> +	} else {
> +		struct patch_util *util = b_util ? b_util : a_util;
> +		const char *needle = "\n ## Commit message ##\n";
> +		const char *p = !util || !util->patch ?
> +			NULL : strstr(util->patch, needle);
> +		if (p) {
> +			if (status == '!')
> +				strbuf_addf(buf, "%s%s", color_reset, color);
> +
> +			strbuf_addch(buf, ' ');
> +			p += strlen(needle);
> +			strbuf_add(buf, p, strchrnul(p, '\n') - p);
> +		}
>   	}
>   	strbuf_addf(buf, "%s\n", color_reset);
>   
> @@ -554,6 +875,9 @@ int show_range_diff(const char *range1, const char *range2,
>   	if (range_diff_opts->left_only && range_diff_opts->right_only)
>   		res = error(_("options '%s' and '%s' cannot be used together"), "--left-only", "--right-only");
>   
> +	if (!strcmp(range1, "mbox:-") && !strcmp(range2, "mbox:-"))
> +		res = error(_("only one mbox can be read from stdin"));
> +
>   	if (!res && read_patches(range1, &branch1, range_diff_opts->other_arg))
>   		res = error(_("could not parse log for '%s'"), range1);
>   	if (!res && read_patches(range2, &branch2, range_diff_opts->other_arg))
> @@ -575,10 +899,17 @@ int show_range_diff(const char *range1, const char *range2,
>   int is_range_diff_range(const char *arg)
>   {
>   	char *copy = xstrdup(arg); /* setup_revisions() modifies it */
> -	const char *argv[] = { "", copy, "--", NULL };
> +	const char *argv[] = { "", copy, "--", NULL }, *path;
>   	int i, positive = 0, negative = 0;
>   	struct rev_info revs;
>   
> +	if (skip_prefix(arg, "mbox:", &path)) {
> +		if (!strcmp(path, "-") || file_exists(path))
> +			return 1;
> +		error_errno(_("not an mbox: '%s'"), path);
> +		return 0;
> +	}
> +
>   	init_revisions(&revs, NULL);
>   	if (setup_revisions(3, argv, &revs, NULL) == 1) {
>   		for (i = 0; i < revs.pending.nr; i++)
> diff --git a/t/t3206-range-diff.sh b/t/t3206-range-diff.sh
> index 459beaf7d9c..89ef9a5ffc4 100755
> --- a/t/t3206-range-diff.sh
> +++ b/t/t3206-range-diff.sh
> @@ -783,6 +783,33 @@ test_expect_success 'ranges with pathspecs' '
>   	! grep "$topic_oid" actual
>   '
>   
> +test_expect_success 'compare range vs mbox' '
> +	git format-patch --stdout topic..mode-only-change >mbox &&
> +	git range-diff topic...mode-only-change >expect &&
> +
> +	git range-diff mode-only-change..topic mbox:./mbox >actual &&
> +	test_cmp expect actual &&
> +
> +	sed -e "/^From: .*/{
> +		h
> +		s/.*/From: Bugs Bunny <bugs@bun.ny>/
> +		:1
> +		N
> +		/[ -z]$/b1
> +		G
> +	}" <mbox >mbox.from &&
> +	git range-diff mode-only-change..topic mbox:./mbox.from >actual.from &&
> +	test_cmp expect actual.from &&
> +
> +	append_cr <mbox >mbox.cr &&
> +	test_must_fail git range-diff \
> +		mode-only-change..topic mbox:./mbox.cr 2>err &&
> +	grep CR/LF err &&
> +
> +	git range-diff mode-only-change..topic mbox:- <mbox >actual.stdin &&
> +	test_cmp expect actual.stdin
> +'
> +
>   test_expect_success 'submodule changes are shown irrespective of diff.submodule' '
>   	git init sub-repo &&
>   	test_commit -C sub-repo sub-first &&
> 
> base-commit: b75747829f4c277323c78b1c5973ad63ea038a2d

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] range-diff: support reading mbox files
  2022-11-21 10:08   ` Phillip Wood
@ 2022-11-22  7:40     ` Johannes Schindelin
  2022-11-22 14:20       ` Phillip Wood
  0 siblings, 1 reply; 16+ messages in thread
From: Johannes Schindelin @ 2022-11-22  7:40 UTC (permalink / raw)
  To: phillip.wood
  Cc: Johannes Schindelin via GitGitGadget, git, Taylor Blau,
	Phillip Wood, Ævar Arnfjörð Bjarmason

[-- Attachment #1: Type: text/plain, Size: 7334 bytes --]

Hi Phillip,

On Mon, 21 Nov 2022, Phillip Wood wrote:

> [...]
> >       @@ range-diff.c: struct patch_util {
> >        +
> >        +	line = contents.buf;
> >        +	size = contents.len;
> >       -+	for (; size > 0; size -= len, line += len) {
> >       ++	for (; size; size -= len, line += len) {
> >        +		const char *p;
> >        +
> >       -+		len = find_eol(line, size);
> >       ++		len = find_next_line(line, size);
> >       +
> >       -+		if (state == MBOX_BEFORE_HEADER) {
> >       ++		if (state == MBOX_BEFORE_HEADER ||
> >       ++		    (state == MBOX_IN_DIFF && line[0] == 'F')) {
>
> If we see an 'F' when parsing a diff then it is probably a From: line
> indicating that we've reached the start of a new patch. I think we need to add
> the diff we've just parsed to the list of commits before continuing and change
> state to MBOX_BEFORE_HEADER otherwise we'll hit the continue just below this
> comment with state unchanged.

As always, thank you *so* much for modeling how to perform productive,
actionable and *thorough* code reviews that focus on the relevant.

You spotted something important. It is a bug, as you suspected, and there
is even more: `util` leaks because it is allocated both after the patch
was parsed and when we see a diff header, without releasing the first (but
granted, this leak has been there before).

You have no idea how much I appreciate your diligence; I understand how
much of an effort it takes to study patches in depth until you understand
them enough to spot complex issues like this one. (In Parkinson's
parlance: I am grateful that you focus on the design of the nuclear power
plant, see https://en.wikipedia.org/w/index.php?title=Atwood%27s_duck)

Back to our regularly scheduled programming (pun intended): I indeed made
a mistake here and fixed it, by removing that 'F' check again, and instead
adding a `goto` label and jumping to that upon encountering the end of the
diff.

I also de-duplicated the `util = xcalloc(1, sizeof(*util))` calls so that
we do that _only_ in the `MBOX_BEFORE_HEADER` arm, and no longer at the
end of the `MBOX_IN_DIFF` arm.

Since I already looked at a leak, I looked further and saw that also
`copy` is leaked in `show_range_diff()` in case of an mbox. I've addressed
that, too.

>
> >        +			if (!skip_prefix(line, "From ", &p))
> >        +				continue;
> >        +
> >       @@ range-diff.c: struct patch_util {
> >        +			author = subject = NULL;
> >        +
> >        +			state = MBOX_IN_HEADER;
> >       ++			continue;
> >        +		}
> >        +
> >        +		if (starts_with(line, "diff --git ")) {
> >       @@ range-diff.c: struct patch_util {
> >        +			strbuf_addch(&buf, '\n');
> >        +			if (!util->diff_offset)
> >        +				util->diff_offset = buf.len;
> >       -+			line[len - 1] = '\n';
> >       ++
> >       +			orig_len = len;
> >       -+			len = parse_git_diff_header(&root, &linenr, 1,
> >       line,
> >       -+						    len, size,
> >       &patch);
> >       ++			/* `find_next_line()`'s replaced the LF with a
> >       NUL */
> >       ++			line[len - 1] = '\n';
> >       ++			len = len > 1 && line[len - 2] == '\r' ?
> >       ++				error(_("cannot handle diff headers
> >       with "
> >       ++					"CR/LF line endings")) :
> >       ++				parse_git_diff_header(&root, &linenr,
> >       1, line,
> >       ++						      len, size,
> >       &patch);
> >        +			if (len < 0) {
> >        +				error(_("could not parse git header
> >        '%.*s'"),
> >        +				      orig_len, line);
> >       ++				release_patch(&patch);
>
> Thank you for using release_patch() rather than the "free random struct
> members" approach that was suggested.

Indeed, it sounded like a specific suggestion to release a specific
`struct` member, but that was of course not the end of the story, and I
completed the analysis myself that I had wished the reviewer would have
completed _before_ sending off a review.

> >       @@ t/t3206-range-diff.sh: test_expect_success 'ranges with pathspecs'
> >       '
> >        +test_expect_success 'compare range vs mbox' '
> >        +	git format-patch --stdout topic..mode-only-change >mbox &&
> >        +	git range-diff topic...mode-only-change >expect &&
> >       ++
> >        +	git range-diff mode-only-change..topic mbox:./mbox >actual &&
> >        +	test_cmp expect actual &&
> >       -+	git range-diff mode-only-change..topic mbox:- <mbox >actual &&
> >       -+	test_cmp expect actual
> >       ++
>
> I'm a bit confused by this sed command, I've annotated it with my probably
> flawed understanding.
>
> >       ++	sed -e "/^From: .*/{
> >       ++		h
>
> This stores the From: header in the hold space

👍

> >       ++		s/.*/From: Bugs Bunny <bugs@bun.ny>/
>
> This changes the From: header in the pattern space

👍

> >       ++		:1
> >       ++		N
> >       ++		/[ -z]$/b1
>
> We loop until we find a line that does not end with a space, letter or number
> adding the lines to the hold space

I would have _loved_ to match on an empty line, i.e. `/[^\n]$/b1`. But
that construct is not understood by the `sed` on macOS.

I even went so far as to search for the source code of a BSD `sed` (and I
found it, and modified the code so that it builds on Linux, see
https://github.com/dscho/sed-bsd) to try a couple of things, but could not
make it work with any variation of `\n`. Therefore, I settled on expecting
all the lines in the commit header to end in printable ASCII characters
between the space and the `z`.

> >       ++		G
>
> This appends the hold space to the pattern space, then the pattern space is
> printed.

👍

> Doesn't this mean we end up with two From: headers? Is the in-body From:
> line already present?

There is no in-body `From:` header because the patch author matches the
`GIT_AUTHOR_IDENT` that is in effect while running the `format-patch`
command.

Let me show you what this `sed` call deals with. In the local test run, it
modified an `mbox` starting with this:

	From 4d39cb329d3ef4c8e69b43859c2e11adb83f8613 Mon Sep 17 00:00:00 2001
	From: Thomas Rast <trast@inf.ethz.ch>
	Date: Mon, 22 Jul 2013 11:23:44 +0200
	Subject: [PATCH 1/3] s/4/A/ + add other-file

	---
	  file       | 2 +-
	  other-file | 0
	  [...]

to a modified `mbox` that starts with this:

	From 4d39cb329d3ef4c8e69b43859c2e11adb83f8613 Mon Sep 17 00:00:00 2001
	From: Bugs Bunny <bugs@bun.ny>
	Date: Mon, 22 Jul 2013 11:23:44 +0200
	Subject: [PATCH 1/3] s/4/A/ + add other-file

	From: Thomas Rast <trast@inf.ethz.ch>
	---
	  file       | 2 +-
	  other-file | 0
	  [...]

>
> >       ++	}" <mbox >mbox.from &&
> >       ++	git range-diff mode-only-change..topic mbox:./mbox.from
> >       >actual.from &&
> >       ++	test_cmp expect actual.from &&
> >       ++
> >       ++	append_cr <mbox >mbox.cr &&
> >       ++	test_must_fail git range-diff \
> >       ++		mode-only-change..topic mbox:./mbox.cr 2>err &&
> >       ++	grep CR/LF err &&
>
> Thanks for adding that

Thank you for the suggestion to add it!

And thank you again for modeling how to perform actionable, helpful and
productive code reviews,
Dscho

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v3] range-diff: support reading mbox files
  2022-11-19 23:11 ` [PATCH v2] " Johannes Schindelin via GitGitGadget
  2022-11-21 10:08   ` Phillip Wood
@ 2022-11-22  9:08   ` Johannes Schindelin via GitGitGadget
  2022-11-22 14:23     ` Phillip Wood
                       ` (2 more replies)
  1 sibling, 3 replies; 16+ messages in thread
From: Johannes Schindelin via GitGitGadget @ 2022-11-22  9:08 UTC (permalink / raw)
  To: git
  Cc: Taylor Blau, Phillip Wood, Ævar Arnfjörð Bjarmason,
	Johannes Schindelin, Johannes Schindelin

From: Johannes Schindelin <johannes.schindelin@gmx.de>

Internally, the `git range-diff` command spawns a `git log` process and
parses its output for the given commit ranges.

This works well when the patches that need to be compared are present in
the local repository in the form of commits.

In scenarios where that is not the case, the `range-diff` command is
currently less helpful.

The Git mailing list is such a scenario: Instead of using Git to
exchange commits, the patches are sent there as plain-text and no commit
range can be specified to let `range-diff` consume those patches.

Instead, the expectation is to download the mails, apply them locally
and then use `range-diff`. This can be quite cumbersome e.g. when a
suitable base revision has to be found first where the patch applies
cleanly.

Let's offer a way to read those patches from pre-prepared MBox files
instead when an argument "mbox:<filename>" is passed instead of a commit
range.

For extra convenience, interpret the filename `-` as standard input.
This makes it easy to compare contributions on the mailing list with the
actual commits that were integrated into Git's main branch. Example:

	commit=5c4003ca3f0e9ac6d3c8aa3e387ff843bd440411
	mid=bdfa3845b81531863941e6a97c28eb1afa62dd2c.1489435755.git.johannes.schindelin@gmx.de
	curl -s https://lore.kernel.org/git/$mid/raw |
	git range-diff mbox:- $commit^!

This addresses https://github.com/gitgitgadget/git/issues/207

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
    range-diff: support reading mbox files
    
    I frequently find myself wanting to look at the range-diff between some
    local commits and the patches on the Git mailing list, but unwilling to
    go through the process of finding an appropriate base revision to apply
    the patches onto (just to throw the generated patches away afterwards,
    anyway).
    
    So I came up with this patch. May it be helpful to other developers,
    too.
    
    This patch contains a home-rolled mbox parser. Initially, I wrote a
    really basic parser and it worked well enough, but, you know, as things
    go it became more complex than that in order to provide actually useful
    range-diffs for existing commits and their corresponding mails (because
    of in-body From: headers, because of -- trailers and long subjects, just
    to name a few reasons). In hindsight, it might have made sense to try to
    to reuse the parser that is available in mailinfo.c, which I had
    initially dismissed as overly complex and unnecessary for this use case.
    If anyone feels up to it, I would invite them to adjust this code to
    replace the mbox parser with one based on the mailinfo.c. Incrementally,
    of course, because the perfect is the enemy of the good.
    
    Changes since v2:
    
     * Fixed two leaks
     * Avoided dropping a patch just because it does not end in a "-- " line
     * Empty lines in the middle of a diff are now treated as context lines
       (newer GNU diff versions generate those)
     * We now warn when a parsed diff is incomplete (i.e. when the ranges in
       the hunk header do not line up with the number of parsed lines), but
       still follow Postel's Law i.e. being kind and accepting.
    
    Changes since v1:
    
     * We no longer leak allocated memory in the struct patch instance
     * Made strtost() a bit more stringent
     * Postel [https://en.wikipedia.org/wiki/Postel%27s_Law]ized the mbox
       parser substantially, together with a couple more cosmetic fixes,
       based on Phillip Wood's excellent review of v1.
     * Extended the test case to cover mboxes containing CR/LF and in-body
       From: lines

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1420%2Fdscho%2Frange-diff-from-mbox-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1420/dscho/range-diff-from-mbox-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/1420

Range-diff vs v2:

 1:  485249ddfb3 ! 1:  ac4d8704068 range-diff: support reading mbox files
     @@ range-diff.c: struct patch_util {
      +
      +		len = find_next_line(line, size);
      +
     -+		if (state == MBOX_BEFORE_HEADER ||
     -+		    (state == MBOX_IN_DIFF && line[0] == 'F')) {
     ++		if (state == MBOX_BEFORE_HEADER) {
     ++parse_from_delimiter:
      +			if (!skip_prefix(line, "From ", &p))
      +				continue;
      +
     ++			if (util)
     ++				BUG("util already allocated");
      +			util = xcalloc(1, sizeof(*util));
      +			if (get_oid_hex(p, &util->oid) < 0)
      +				oidcpy(&util->oid, null_oid());
     @@ range-diff.c: struct patch_util {
      +			}
      +		} else if (state == MBOX_IN_DIFF) {
      +			switch (line[0]) {
     -+			case '\0':
     -+				continue; /* ignore empty lines after diff */
     ++			case '\0': /* newer GNU diff, an empty context line */
      +			case '+':
      +			case '-':
      +			case ' ':
     @@ range-diff.c: struct patch_util {
      +				strbuf_addch(&buf, '\n');
      +				util->diffsize++;
      +				continue;
     ++			default:
     ++				if (old_count || new_count)
     ++					warning(_("diff ended prematurely (-%d/+%d)"),
     ++						(int)old_count, (int)new_count);
     ++				break;
      +			}
      +
      +			if (util) {
      +				string_list_append(list, buf.buf)->util = util;
     ++				util = NULL;
      +				strbuf_reset(&buf);
      +			}
     -+			util = xcalloc(1, sizeof(*util));
     -+			oidcpy(&util->oid, null_oid());
     -+			util->matching = -1;
     -+			author = subject = NULL;
      +			state = MBOX_BEFORE_HEADER;
     ++			goto parse_from_delimiter;
      +		}
      +	}
      +	strbuf_release(&contents);
     @@ range-diff.c: int show_range_diff(const char *range1, const char *range2,
       	struct rev_info revs;
       
      +	if (skip_prefix(arg, "mbox:", &path)) {
     ++		free(copy);
      +		if (!strcmp(path, "-") || file_exists(path))
      +			return 1;
      +		error_errno(_("not an mbox: '%s'"), path);


 Documentation/git-range-diff.txt |   3 +-
 range-diff.c                     | 338 ++++++++++++++++++++++++++++++-
 t/t3206-range-diff.sh            |  27 +++
 3 files changed, 366 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-range-diff.txt b/Documentation/git-range-diff.txt
index 0b393715d70..e2c4661acde 100644
--- a/Documentation/git-range-diff.txt
+++ b/Documentation/git-range-diff.txt
@@ -37,7 +37,8 @@ There are three ways to specify the commit ranges:
 
 - `<range1> <range2>`: Either commit range can be of the form
   `<base>..<rev>`, `<rev>^!` or `<rev>^-<n>`. See `SPECIFYING RANGES`
-  in linkgit:gitrevisions[7] for more details.
+  in linkgit:gitrevisions[7] for more details. Alternatively, the
+  patches can be provided as an mbox-formatted file via `mbox:<path>`.
 
 - `<rev1>...<rev2>`. This is equivalent to
   `<rev2>..<rev1> <rev1>..<rev2>`.
diff --git a/range-diff.c b/range-diff.c
index 124dd678c38..c391ffa603b 100644
--- a/range-diff.c
+++ b/range-diff.c
@@ -12,6 +12,7 @@
 #include "userdiff.h"
 #include "apply.h"
 #include "revision.h"
+#include "dir.h"
 
 struct patch_util {
 	/* For the search for an exact match */
@@ -26,6 +27,313 @@ struct patch_util {
 	struct object_id oid;
 };
 
+static inline int strtost(char const *s, size_t *result, const char **end)
+{
+	unsigned long u;
+	char *p;
+
+	errno = 0;
+	/* negative values would be accepted by strtoul */
+	if (!isdigit(*s))
+		return -1;
+	u = strtoul(s, &p, 10);
+	if (errno || p == s)
+		return -1;
+	if (result)
+		*result = u;
+	*end = p;
+
+	return 0;
+}
+
+static int parse_hunk_header(const char *p,
+			     size_t *old_count, size_t *new_count,
+			     const char **end)
+{
+	size_t o = 1, n = 1;
+
+	if (!skip_prefix(p, "@@ -", &p) ||
+	    strtost(p, NULL, &p) ||
+	    /* The range is -<start>[,<count>], defaulting to count = 1 */
+	    !(*p == ' ' || (*p == ',' && !strtost(p + 1, &o, &p))) ||
+	    !skip_prefix(p, " +", &p) ||
+	    strtost(p, NULL, &p) ||
+	    /* The range is +<start>[,<count>], defaulting to count = 1 */
+	    !(*p == ' ' || (*p == ',' && !strtost(p + 1, &n, &p))) ||
+	    !skip_prefix(p, " @@", &p))
+		return -1;
+
+	*old_count = o;
+	*new_count = n;
+	*end = p;
+
+	return 0;
+}
+
+/*
+ * This function finds the end of the line, replaces the newline character with
+ * a NUL, and returns the offset of the start of the next line.
+ *
+ * If no newline character was found, it returns the offset of the trailing NUL
+ * instead.
+ */
+static inline int find_next_line(const char *line, size_t size)
+{
+	char *eol;
+
+	eol = memchr(line, '\n', size);
+	if (!eol)
+		return size;
+
+	*eol = '\0';
+
+	return eol + 1 - line;
+}
+
+static int read_mbox(const char *path, struct string_list *list)
+{
+	struct strbuf buf = STRBUF_INIT, contents = STRBUF_INIT;
+	struct strbuf long_subject = STRBUF_INIT;
+	struct patch_util *util = NULL;
+	enum {
+		MBOX_BEFORE_HEADER,
+		MBOX_IN_HEADER,
+		MBOX_IN_COMMIT_MESSAGE,
+		MBOX_AFTER_TRIPLE_DASH,
+		MBOX_IN_DIFF
+	} state = MBOX_BEFORE_HEADER;
+	char *line, *current_filename = NULL;
+	int len;
+	size_t size, old_count = 0, new_count = 0;
+	const char *author = NULL, *subject = NULL;
+
+	if (!strcmp(path, "-")) {
+		if (strbuf_read(&contents, STDIN_FILENO, 0) < 0)
+			return error_errno(_("could not read stdin"));
+	} else if (strbuf_read_file(&contents, path, 0) < 0)
+		return error_errno(_("could not read '%s'"), path);
+
+	line = contents.buf;
+	size = contents.len;
+	for (; size; size -= len, line += len) {
+		const char *p;
+
+		len = find_next_line(line, size);
+
+		if (state == MBOX_BEFORE_HEADER) {
+parse_from_delimiter:
+			if (!skip_prefix(line, "From ", &p))
+				continue;
+
+			if (util)
+				BUG("util already allocated");
+			util = xcalloc(1, sizeof(*util));
+			if (get_oid_hex(p, &util->oid) < 0)
+				oidcpy(&util->oid, null_oid());
+			util->matching = -1;
+			author = subject = NULL;
+
+			state = MBOX_IN_HEADER;
+			continue;
+		}
+
+		if (starts_with(line, "diff --git ")) {
+			struct patch patch = { 0 };
+			struct strbuf root = STRBUF_INIT;
+			int linenr = 0;
+			int orig_len;
+
+			state = MBOX_IN_DIFF;
+			old_count = new_count = 0;
+			strbuf_addch(&buf, '\n');
+			if (!util->diff_offset)
+				util->diff_offset = buf.len;
+
+			orig_len = len;
+			/* `find_next_line()`'s replaced the LF with a NUL */
+			line[len - 1] = '\n';
+			len = len > 1 && line[len - 2] == '\r' ?
+				error(_("cannot handle diff headers with "
+					"CR/LF line endings")) :
+				parse_git_diff_header(&root, &linenr, 1, line,
+						      len, size, &patch);
+			if (len < 0) {
+				error(_("could not parse git header '%.*s'"),
+				      orig_len, line);
+				release_patch(&patch);
+				free(util);
+				free(current_filename);
+				string_list_clear(list, 1);
+				strbuf_release(&buf);
+				strbuf_release(&contents);
+				strbuf_release(&long_subject);
+				return -1;
+			}
+
+			if (patch.old_name)
+				skip_prefix(patch.old_name, "a/",
+					    (const char **)&patch.old_name);
+			if (patch.new_name)
+				skip_prefix(patch.new_name, "b/",
+					    (const char **)&patch.new_name);
+
+			strbuf_addstr(&buf, " ## ");
+			if (patch.is_new)
+				strbuf_addf(&buf, "%s (new)", patch.new_name);
+			else if (patch.is_delete)
+				strbuf_addf(&buf, "%s (deleted)", patch.old_name);
+			else if (patch.is_rename)
+				strbuf_addf(&buf, "%s => %s", patch.old_name, patch.new_name);
+			else
+				strbuf_addstr(&buf, patch.new_name);
+
+			free(current_filename);
+			if (patch.is_delete)
+				current_filename = xstrdup(patch.old_name);
+			else
+				current_filename = xstrdup(patch.new_name);
+
+			if (patch.new_mode && patch.old_mode &&
+			    patch.old_mode != patch.new_mode)
+				strbuf_addf(&buf, " (mode change %06o => %06o)",
+					    patch.old_mode, patch.new_mode);
+
+			strbuf_addstr(&buf, " ##\n");
+			util->diffsize++;
+			release_patch(&patch);
+		} else if (state == MBOX_IN_HEADER) {
+			if (!line[0]) {
+				state = MBOX_IN_COMMIT_MESSAGE;
+				/* Look for an in-body From: */
+				if (skip_prefix(line + 1, "From: ", &p)) {
+					size -= p - line;
+					line += p - line;
+					len = find_next_line(line, size);
+
+					while (isspace(*p))
+						p++;
+					author = p;
+				}
+				strbuf_addstr(&buf, " ## Metadata ##\n");
+				if (author)
+					strbuf_addf(&buf, "Author: %s\n", author);
+				strbuf_addstr(&buf, "\n ## Commit message ##\n");
+				if (subject)
+					strbuf_addf(&buf, "    %s\n\n", subject);
+			} else if (skip_prefix(line, "From: ", &p)) {
+				while (isspace(*p))
+					p++;
+				author = p;
+			} else if (skip_prefix(line, "Subject: ", &p)) {
+				const char *q;
+
+				while (isspace(*p))
+					p++;
+				subject = p;
+
+				if (starts_with(p, "[PATCH") &&
+				    (q = strchr(p, ']'))) {
+					q++;
+					while (isspace(*q))
+						q++;
+					subject = q;
+				}
+
+				if (len < size && line[len] == ' ') {
+					/* handle long subject */
+					strbuf_reset(&long_subject);
+					strbuf_addstr(&long_subject, subject);
+					while (len < size && line[len] == ' ') {
+						line += len;
+						size -= len;
+						len = find_next_line(line, size);
+						strbuf_addstr(&long_subject, line);
+					}
+					subject = long_subject.buf;
+				}
+			}
+		} else if (state == MBOX_IN_COMMIT_MESSAGE) {
+			if (!line[0]) {
+				strbuf_addch(&buf, '\n');
+			} else if (strcmp(line, "---")) {
+				int tabs = 0;
+
+				/* simulate tab expansion */
+				while (line[tabs] == '\t')
+					tabs++;
+				strbuf_addf(&buf, "%*s%s\n",
+					    4 + 8 * tabs, "", line + tabs);
+			} else {
+				/*
+				 * Trim the trailing newline that is added
+				 * by `format-patch`.
+				 */
+				strbuf_trim_trailing_newline(&buf);
+				state = MBOX_AFTER_TRIPLE_DASH;
+			}
+		} else if (state == MBOX_IN_DIFF) {
+			switch (line[0]) {
+			case '\0': /* newer GNU diff, an empty context line */
+			case '+':
+			case '-':
+			case ' ':
+				/* A `-- ` line indicates the end of a diff */
+				if (!old_count && !new_count)
+					break;
+				if (old_count && line[0] != '+')
+					old_count--;
+				if (new_count && line[0] != '-')
+					new_count--;
+				/* fallthrough */
+			case '\\':
+				strbuf_addstr(&buf, line);
+				strbuf_addch(&buf, '\n');
+				util->diffsize++;
+				continue;
+			case '@':
+				if (parse_hunk_header(line, &old_count,
+						      &new_count, &p))
+					break;
+
+				strbuf_addstr(&buf, "@@");
+				if (current_filename && *p)
+					strbuf_addf(&buf, " %s:",
+						    current_filename);
+				strbuf_addstr(&buf, p);
+				strbuf_addch(&buf, '\n');
+				util->diffsize++;
+				continue;
+			default:
+				if (old_count || new_count)
+					warning(_("diff ended prematurely (-%d/+%d)"),
+						(int)old_count, (int)new_count);
+				break;
+			}
+
+			if (util) {
+				string_list_append(list, buf.buf)->util = util;
+				util = NULL;
+				strbuf_reset(&buf);
+			}
+			state = MBOX_BEFORE_HEADER;
+			goto parse_from_delimiter;
+		}
+	}
+	strbuf_release(&contents);
+
+	if (util) {
+		if (state == MBOX_IN_DIFF)
+			string_list_append(list, buf.buf)->util = util;
+		else
+			free(util);
+	}
+	strbuf_release(&buf);
+	strbuf_release(&long_subject);
+	free(current_filename);
+
+	return 0;
+}
+
 /*
  * Reads the patches into a string list, with the `util` field being populated
  * as struct object_id (will need to be free()d).
@@ -41,6 +349,10 @@ static int read_patches(const char *range, struct string_list *list,
 	ssize_t len;
 	size_t size;
 	int ret = -1;
+	const char *path;
+
+	if (skip_prefix(range, "mbox:", &path))
+		return read_mbox(path, list);
 
 	strvec_pushl(&cp.args, "log", "--no-color", "-p", "--no-merges",
 		     "--reverse", "--date-order", "--decorate=no",
@@ -424,6 +736,19 @@ static void output_pair_header(struct diff_options *diffopt,
 
 		strbuf_addch(buf, ' ');
 		pp_commit_easy(CMIT_FMT_ONELINE, commit, buf);
+	} else {
+		struct patch_util *util = b_util ? b_util : a_util;
+		const char *needle = "\n ## Commit message ##\n";
+		const char *p = !util || !util->patch ?
+			NULL : strstr(util->patch, needle);
+		if (p) {
+			if (status == '!')
+				strbuf_addf(buf, "%s%s", color_reset, color);
+
+			strbuf_addch(buf, ' ');
+			p += strlen(needle);
+			strbuf_add(buf, p, strchrnul(p, '\n') - p);
+		}
 	}
 	strbuf_addf(buf, "%s\n", color_reset);
 
@@ -554,6 +879,9 @@ int show_range_diff(const char *range1, const char *range2,
 	if (range_diff_opts->left_only && range_diff_opts->right_only)
 		res = error(_("options '%s' and '%s' cannot be used together"), "--left-only", "--right-only");
 
+	if (!strcmp(range1, "mbox:-") && !strcmp(range2, "mbox:-"))
+		res = error(_("only one mbox can be read from stdin"));
+
 	if (!res && read_patches(range1, &branch1, range_diff_opts->other_arg))
 		res = error(_("could not parse log for '%s'"), range1);
 	if (!res && read_patches(range2, &branch2, range_diff_opts->other_arg))
@@ -575,10 +903,18 @@ int show_range_diff(const char *range1, const char *range2,
 int is_range_diff_range(const char *arg)
 {
 	char *copy = xstrdup(arg); /* setup_revisions() modifies it */
-	const char *argv[] = { "", copy, "--", NULL };
+	const char *argv[] = { "", copy, "--", NULL }, *path;
 	int i, positive = 0, negative = 0;
 	struct rev_info revs;
 
+	if (skip_prefix(arg, "mbox:", &path)) {
+		free(copy);
+		if (!strcmp(path, "-") || file_exists(path))
+			return 1;
+		error_errno(_("not an mbox: '%s'"), path);
+		return 0;
+	}
+
 	init_revisions(&revs, NULL);
 	if (setup_revisions(3, argv, &revs, NULL) == 1) {
 		for (i = 0; i < revs.pending.nr; i++)
diff --git a/t/t3206-range-diff.sh b/t/t3206-range-diff.sh
index 459beaf7d9c..89ef9a5ffc4 100755
--- a/t/t3206-range-diff.sh
+++ b/t/t3206-range-diff.sh
@@ -783,6 +783,33 @@ test_expect_success 'ranges with pathspecs' '
 	! grep "$topic_oid" actual
 '
 
+test_expect_success 'compare range vs mbox' '
+	git format-patch --stdout topic..mode-only-change >mbox &&
+	git range-diff topic...mode-only-change >expect &&
+
+	git range-diff mode-only-change..topic mbox:./mbox >actual &&
+	test_cmp expect actual &&
+
+	sed -e "/^From: .*/{
+		h
+		s/.*/From: Bugs Bunny <bugs@bun.ny>/
+		:1
+		N
+		/[ -z]$/b1
+		G
+	}" <mbox >mbox.from &&
+	git range-diff mode-only-change..topic mbox:./mbox.from >actual.from &&
+	test_cmp expect actual.from &&
+
+	append_cr <mbox >mbox.cr &&
+	test_must_fail git range-diff \
+		mode-only-change..topic mbox:./mbox.cr 2>err &&
+	grep CR/LF err &&
+
+	git range-diff mode-only-change..topic mbox:- <mbox >actual.stdin &&
+	test_cmp expect actual.stdin
+'
+
 test_expect_success 'submodule changes are shown irrespective of diff.submodule' '
 	git init sub-repo &&
 	test_commit -C sub-repo sub-first &&

base-commit: b75747829f4c277323c78b1c5973ad63ea038a2d
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] range-diff: support reading mbox files
  2022-11-22  7:40     ` Johannes Schindelin
@ 2022-11-22 14:20       ` Phillip Wood
  0 siblings, 0 replies; 16+ messages in thread
From: Phillip Wood @ 2022-11-22 14:20 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Johannes Schindelin via GitGitGadget, git, Taylor Blau,
	Phillip Wood, Ævar Arnfjörð Bjarmason

Hi Dscho

On 22/11/2022 07:40, Johannes Schindelin wrote:
> [...]
>> I'm a bit confused by this sed command, I've annotated it with my probably
>> flawed understanding.
>>
>>>        ++	sed -e "/^From: .*/{
>>>        ++		h
>>
>> This stores the From: header in the hold space
> 
> 👍
> 
>>>        ++		s/.*/From: Bugs Bunny <bugs@bun.ny>/
>>
>> This changes the From: header in the pattern space
> 
> 👍
> 
>>>        ++		:1
>>>        ++		N
>>>        ++		/[ -z]$/b1
>>
>> We loop until we find a line that does not end with a space, letter or number
>> adding the lines to the hold space
> 
> I would have _loved_ to match on an empty line, i.e. `/[^\n]$/b1`. But
> that construct is not understood by the `sed` on macOS.

I don't think it matters much but can't you match an empty line with 
/^$/ ? Then you can loop on non-empty lines with /^$/!b1

> I even went so far as to search for the source code of a BSD `sed` (and I
> found it, and modified the code so that it builds on Linux, see
> https://github.com/dscho/sed-bsd) to try a couple of things, but could not
> make it work with any variation of `\n`. Therefore, I settled on expecting
> all the lines in the commit header to end in printable ASCII characters
> between the space and the `z`.
> 
>>>        ++		G
>>
>> This appends the hold space to the pattern space, then the pattern space is
>> printed.
> 
> 👍
> 
>> Doesn't this mean we end up with two From: headers? Is the in-body From:
>> line already present?
> 
> There is no in-body `From:` header because the patch author matches the
> `GIT_AUTHOR_IDENT` that is in effect while running the `format-patch`
> command.
> 
> Let me show you what this `sed` call deals with. In the local test run, it
> modified an `mbox` starting with this:

Thanks for taking the trouble to show the mbox files, I didn't have time 
to run the tests my self yesterday. The processed mbox file looks good.

> 	From 4d39cb329d3ef4c8e69b43859c2e11adb83f8613 Mon Sep 17 00:00:00 2001
> 	From: Thomas Rast <trast@inf.ethz.ch>
> 	Date: Mon, 22 Jul 2013 11:23:44 +0200
> 	Subject: [PATCH 1/3] s/4/A/ + add other-file
> 
> 	---
> 	  file       | 2 +-
> 	  other-file | 0
> 	  [...]
> 
> to a modified `mbox` that starts with this:
> 
> 	From 4d39cb329d3ef4c8e69b43859c2e11adb83f8613 Mon Sep 17 00:00:00 2001
> 	From: Bugs Bunny <bugs@bun.ny>
> 	Date: Mon, 22 Jul 2013 11:23:44 +0200
> 	Subject: [PATCH 1/3] s/4/A/ + add other-file
> 
> 	From: Thomas Rast <trast@inf.ethz.ch>
> 	---
> 	  file       | 2 +-
> 	  other-file | 0
> 	  [...]
> 
>>
>>>        ++	}" <mbox >mbox.from &&
>>>        ++	git range-diff mode-only-change..topic mbox:./mbox.from
>>>        >actual.from &&
>>>        ++	test_cmp expect actual.from &&
>>>        ++
>>>        ++	append_cr <mbox >mbox.cr &&
>>>        ++	test_must_fail git range-diff \
>>>        ++		mode-only-change..topic mbox:./mbox.cr 2>err &&
>>>        ++	grep CR/LF err &&
>>
>> Thanks for adding that
> 
> Thank you for the suggestion to add it!
> 
> And thank you again for modeling how to perform actionable, helpful and
> productive code reviews,

Thanks for your kind words, I glad you found my comments helpful

Best Wishes

Phillip

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3] range-diff: support reading mbox files
  2022-11-22  9:08   ` [PATCH v3] " Johannes Schindelin via GitGitGadget
@ 2022-11-22 14:23     ` Phillip Wood
  2022-11-22 23:58     ` Junio C Hamano
  2023-03-03 22:02     ` Junio C Hamano
  2 siblings, 0 replies; 16+ messages in thread
From: Phillip Wood @ 2022-11-22 14:23 UTC (permalink / raw)
  To: Johannes Schindelin via GitGitGadget, git
  Cc: Taylor Blau, Phillip Wood, Ævar Arnfjörð Bjarmason,
	Johannes Schindelin

Hi Dscho

On 22/11/2022 09:08, Johannes Schindelin via GitGitGadget wrote:
 > [...]
>      Changes since v2:
>      
>       * Fixed two leaks
>       * Avoided dropping a patch just because it does not end in a "-- " line
>       * Empty lines in the middle of a diff are now treated as context lines
>         (newer GNU diff versions generate those)

Oh, well spotted, I should have thought of that myself.

>       * We now warn when a parsed diff is incomplete (i.e. when the ranges in
>         the hunk header do not line up with the number of parsed lines), but
>         still follow Postel's Law i.e. being kind and accepting.

This version looks good to me, thanks for working on it.

Best Wishes

Phillip

>      Changes since v1:
>      
>       * We no longer leak allocated memory in the struct patch instance
>       * Made strtost() a bit more stringent
>       * Postel [https://en.wikipedia.org/wiki/Postel%27s_Law]ized the mbox
>         parser substantially, together with a couple more cosmetic fixes,
>         based on Phillip Wood's excellent review of v1.
>       * Extended the test case to cover mboxes containing CR/LF and in-body
>         From: lines
> 
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1420%2Fdscho%2Frange-diff-from-mbox-v3
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1420/dscho/range-diff-from-mbox-v3
> Pull-Request: https://github.com/gitgitgadget/git/pull/1420
> 
> Range-diff vs v2:
> 
>   1:  485249ddfb3 ! 1:  ac4d8704068 range-diff: support reading mbox files
>       @@ range-diff.c: struct patch_util {
>        +
>        +		len = find_next_line(line, size);
>        +
>       -+		if (state == MBOX_BEFORE_HEADER ||
>       -+		    (state == MBOX_IN_DIFF && line[0] == 'F')) {
>       ++		if (state == MBOX_BEFORE_HEADER) {
>       ++parse_from_delimiter:
>        +			if (!skip_prefix(line, "From ", &p))
>        +				continue;
>        +
>       ++			if (util)
>       ++				BUG("util already allocated");
>        +			util = xcalloc(1, sizeof(*util));
>        +			if (get_oid_hex(p, &util->oid) < 0)
>        +				oidcpy(&util->oid, null_oid());
>       @@ range-diff.c: struct patch_util {
>        +			}
>        +		} else if (state == MBOX_IN_DIFF) {
>        +			switch (line[0]) {
>       -+			case '\0':
>       -+				continue; /* ignore empty lines after diff */
>       ++			case '\0': /* newer GNU diff, an empty context line */
>        +			case '+':
>        +			case '-':
>        +			case ' ':
>       @@ range-diff.c: struct patch_util {
>        +				strbuf_addch(&buf, '\n');
>        +				util->diffsize++;
>        +				continue;
>       ++			default:
>       ++				if (old_count || new_count)
>       ++					warning(_("diff ended prematurely (-%d/+%d)"),
>       ++						(int)old_count, (int)new_count);
>       ++				break;
>        +			}
>        +
>        +			if (util) {
>        +				string_list_append(list, buf.buf)->util = util;
>       ++				util = NULL;
>        +				strbuf_reset(&buf);
>        +			}
>       -+			util = xcalloc(1, sizeof(*util));
>       -+			oidcpy(&util->oid, null_oid());
>       -+			util->matching = -1;
>       -+			author = subject = NULL;
>        +			state = MBOX_BEFORE_HEADER;
>       ++			goto parse_from_delimiter;
>        +		}
>        +	}
>        +	strbuf_release(&contents);
>       @@ range-diff.c: int show_range_diff(const char *range1, const char *range2,
>         	struct rev_info revs;
>         
>        +	if (skip_prefix(arg, "mbox:", &path)) {
>       ++		free(copy);
>        +		if (!strcmp(path, "-") || file_exists(path))
>        +			return 1;
>        +		error_errno(_("not an mbox: '%s'"), path);
> 
> 
>   Documentation/git-range-diff.txt |   3 +-
>   range-diff.c                     | 338 ++++++++++++++++++++++++++++++-
>   t/t3206-range-diff.sh            |  27 +++
>   3 files changed, 366 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/git-range-diff.txt b/Documentation/git-range-diff.txt
> index 0b393715d70..e2c4661acde 100644
> --- a/Documentation/git-range-diff.txt
> +++ b/Documentation/git-range-diff.txt
> @@ -37,7 +37,8 @@ There are three ways to specify the commit ranges:
>   
>   - `<range1> <range2>`: Either commit range can be of the form
>     `<base>..<rev>`, `<rev>^!` or `<rev>^-<n>`. See `SPECIFYING RANGES`
> -  in linkgit:gitrevisions[7] for more details.
> +  in linkgit:gitrevisions[7] for more details. Alternatively, the
> +  patches can be provided as an mbox-formatted file via `mbox:<path>`.
>   
>   - `<rev1>...<rev2>`. This is equivalent to
>     `<rev2>..<rev1> <rev1>..<rev2>`.
> diff --git a/range-diff.c b/range-diff.c
> index 124dd678c38..c391ffa603b 100644
> --- a/range-diff.c
> +++ b/range-diff.c
> @@ -12,6 +12,7 @@
>   #include "userdiff.h"
>   #include "apply.h"
>   #include "revision.h"
> +#include "dir.h"
>   
>   struct patch_util {
>   	/* For the search for an exact match */
> @@ -26,6 +27,313 @@ struct patch_util {
>   	struct object_id oid;
>   };
>   
> +static inline int strtost(char const *s, size_t *result, const char **end)
> +{
> +	unsigned long u;
> +	char *p;
> +
> +	errno = 0;
> +	/* negative values would be accepted by strtoul */
> +	if (!isdigit(*s))
> +		return -1;
> +	u = strtoul(s, &p, 10);
> +	if (errno || p == s)
> +		return -1;
> +	if (result)
> +		*result = u;
> +	*end = p;
> +
> +	return 0;
> +}
> +
> +static int parse_hunk_header(const char *p,
> +			     size_t *old_count, size_t *new_count,
> +			     const char **end)
> +{
> +	size_t o = 1, n = 1;
> +
> +	if (!skip_prefix(p, "@@ -", &p) ||
> +	    strtost(p, NULL, &p) ||
> +	    /* The range is -<start>[,<count>], defaulting to count = 1 */
> +	    !(*p == ' ' || (*p == ',' && !strtost(p + 1, &o, &p))) ||
> +	    !skip_prefix(p, " +", &p) ||
> +	    strtost(p, NULL, &p) ||
> +	    /* The range is +<start>[,<count>], defaulting to count = 1 */
> +	    !(*p == ' ' || (*p == ',' && !strtost(p + 1, &n, &p))) ||
> +	    !skip_prefix(p, " @@", &p))
> +		return -1;
> +
> +	*old_count = o;
> +	*new_count = n;
> +	*end = p;
> +
> +	return 0;
> +}
> +
> +/*
> + * This function finds the end of the line, replaces the newline character with
> + * a NUL, and returns the offset of the start of the next line.
> + *
> + * If no newline character was found, it returns the offset of the trailing NUL
> + * instead.
> + */
> +static inline int find_next_line(const char *line, size_t size)
> +{
> +	char *eol;
> +
> +	eol = memchr(line, '\n', size);
> +	if (!eol)
> +		return size;
> +
> +	*eol = '\0';
> +
> +	return eol + 1 - line;
> +}
> +
> +static int read_mbox(const char *path, struct string_list *list)
> +{
> +	struct strbuf buf = STRBUF_INIT, contents = STRBUF_INIT;
> +	struct strbuf long_subject = STRBUF_INIT;
> +	struct patch_util *util = NULL;
> +	enum {
> +		MBOX_BEFORE_HEADER,
> +		MBOX_IN_HEADER,
> +		MBOX_IN_COMMIT_MESSAGE,
> +		MBOX_AFTER_TRIPLE_DASH,
> +		MBOX_IN_DIFF
> +	} state = MBOX_BEFORE_HEADER;
> +	char *line, *current_filename = NULL;
> +	int len;
> +	size_t size, old_count = 0, new_count = 0;
> +	const char *author = NULL, *subject = NULL;
> +
> +	if (!strcmp(path, "-")) {
> +		if (strbuf_read(&contents, STDIN_FILENO, 0) < 0)
> +			return error_errno(_("could not read stdin"));
> +	} else if (strbuf_read_file(&contents, path, 0) < 0)
> +		return error_errno(_("could not read '%s'"), path);
> +
> +	line = contents.buf;
> +	size = contents.len;
> +	for (; size; size -= len, line += len) {
> +		const char *p;
> +
> +		len = find_next_line(line, size);
> +
> +		if (state == MBOX_BEFORE_HEADER) {
> +parse_from_delimiter:
> +			if (!skip_prefix(line, "From ", &p))
> +				continue;
> +
> +			if (util)
> +				BUG("util already allocated");
> +			util = xcalloc(1, sizeof(*util));
> +			if (get_oid_hex(p, &util->oid) < 0)
> +				oidcpy(&util->oid, null_oid());
> +			util->matching = -1;
> +			author = subject = NULL;
> +
> +			state = MBOX_IN_HEADER;
> +			continue;
> +		}
> +
> +		if (starts_with(line, "diff --git ")) {
> +			struct patch patch = { 0 };
> +			struct strbuf root = STRBUF_INIT;
> +			int linenr = 0;
> +			int orig_len;
> +
> +			state = MBOX_IN_DIFF;
> +			old_count = new_count = 0;
> +			strbuf_addch(&buf, '\n');
> +			if (!util->diff_offset)
> +				util->diff_offset = buf.len;
> +
> +			orig_len = len;
> +			/* `find_next_line()`'s replaced the LF with a NUL */
> +			line[len - 1] = '\n';
> +			len = len > 1 && line[len - 2] == '\r' ?
> +				error(_("cannot handle diff headers with "
> +					"CR/LF line endings")) :
> +				parse_git_diff_header(&root, &linenr, 1, line,
> +						      len, size, &patch);
> +			if (len < 0) {
> +				error(_("could not parse git header '%.*s'"),
> +				      orig_len, line);
> +				release_patch(&patch);
> +				free(util);
> +				free(current_filename);
> +				string_list_clear(list, 1);
> +				strbuf_release(&buf);
> +				strbuf_release(&contents);
> +				strbuf_release(&long_subject);
> +				return -1;
> +			}
> +
> +			if (patch.old_name)
> +				skip_prefix(patch.old_name, "a/",
> +					    (const char **)&patch.old_name);
> +			if (patch.new_name)
> +				skip_prefix(patch.new_name, "b/",
> +					    (const char **)&patch.new_name);
> +
> +			strbuf_addstr(&buf, " ## ");
> +			if (patch.is_new)
> +				strbuf_addf(&buf, "%s (new)", patch.new_name);
> +			else if (patch.is_delete)
> +				strbuf_addf(&buf, "%s (deleted)", patch.old_name);
> +			else if (patch.is_rename)
> +				strbuf_addf(&buf, "%s => %s", patch.old_name, patch.new_name);
> +			else
> +				strbuf_addstr(&buf, patch.new_name);
> +
> +			free(current_filename);
> +			if (patch.is_delete)
> +				current_filename = xstrdup(patch.old_name);
> +			else
> +				current_filename = xstrdup(patch.new_name);
> +
> +			if (patch.new_mode && patch.old_mode &&
> +			    patch.old_mode != patch.new_mode)
> +				strbuf_addf(&buf, " (mode change %06o => %06o)",
> +					    patch.old_mode, patch.new_mode);
> +
> +			strbuf_addstr(&buf, " ##\n");
> +			util->diffsize++;
> +			release_patch(&patch);
> +		} else if (state == MBOX_IN_HEADER) {
> +			if (!line[0]) {
> +				state = MBOX_IN_COMMIT_MESSAGE;
> +				/* Look for an in-body From: */
> +				if (skip_prefix(line + 1, "From: ", &p)) {
> +					size -= p - line;
> +					line += p - line;
> +					len = find_next_line(line, size);
> +
> +					while (isspace(*p))
> +						p++;
> +					author = p;
> +				}
> +				strbuf_addstr(&buf, " ## Metadata ##\n");
> +				if (author)
> +					strbuf_addf(&buf, "Author: %s\n", author);
> +				strbuf_addstr(&buf, "\n ## Commit message ##\n");
> +				if (subject)
> +					strbuf_addf(&buf, "    %s\n\n", subject);
> +			} else if (skip_prefix(line, "From: ", &p)) {
> +				while (isspace(*p))
> +					p++;
> +				author = p;
> +			} else if (skip_prefix(line, "Subject: ", &p)) {
> +				const char *q;
> +
> +				while (isspace(*p))
> +					p++;
> +				subject = p;
> +
> +				if (starts_with(p, "[PATCH") &&
> +				    (q = strchr(p, ']'))) {
> +					q++;
> +					while (isspace(*q))
> +						q++;
> +					subject = q;
> +				}
> +
> +				if (len < size && line[len] == ' ') {
> +					/* handle long subject */
> +					strbuf_reset(&long_subject);
> +					strbuf_addstr(&long_subject, subject);
> +					while (len < size && line[len] == ' ') {
> +						line += len;
> +						size -= len;
> +						len = find_next_line(line, size);
> +						strbuf_addstr(&long_subject, line);
> +					}
> +					subject = long_subject.buf;
> +				}
> +			}
> +		} else if (state == MBOX_IN_COMMIT_MESSAGE) {
> +			if (!line[0]) {
> +				strbuf_addch(&buf, '\n');
> +			} else if (strcmp(line, "---")) {
> +				int tabs = 0;
> +
> +				/* simulate tab expansion */
> +				while (line[tabs] == '\t')
> +					tabs++;
> +				strbuf_addf(&buf, "%*s%s\n",
> +					    4 + 8 * tabs, "", line + tabs);
> +			} else {
> +				/*
> +				 * Trim the trailing newline that is added
> +				 * by `format-patch`.
> +				 */
> +				strbuf_trim_trailing_newline(&buf);
> +				state = MBOX_AFTER_TRIPLE_DASH;
> +			}
> +		} else if (state == MBOX_IN_DIFF) {
> +			switch (line[0]) {
> +			case '\0': /* newer GNU diff, an empty context line */
> +			case '+':
> +			case '-':
> +			case ' ':
> +				/* A `-- ` line indicates the end of a diff */
> +				if (!old_count && !new_count)
> +					break;
> +				if (old_count && line[0] != '+')
> +					old_count--;
> +				if (new_count && line[0] != '-')
> +					new_count--;
> +				/* fallthrough */
> +			case '\\':
> +				strbuf_addstr(&buf, line);
> +				strbuf_addch(&buf, '\n');
> +				util->diffsize++;
> +				continue;
> +			case '@':
> +				if (parse_hunk_header(line, &old_count,
> +						      &new_count, &p))
> +					break;
> +
> +				strbuf_addstr(&buf, "@@");
> +				if (current_filename && *p)
> +					strbuf_addf(&buf, " %s:",
> +						    current_filename);
> +				strbuf_addstr(&buf, p);
> +				strbuf_addch(&buf, '\n');
> +				util->diffsize++;
> +				continue;
> +			default:
> +				if (old_count || new_count)
> +					warning(_("diff ended prematurely (-%d/+%d)"),
> +						(int)old_count, (int)new_count);
> +				break;
> +			}
> +
> +			if (util) {
> +				string_list_append(list, buf.buf)->util = util;
> +				util = NULL;
> +				strbuf_reset(&buf);
> +			}
> +			state = MBOX_BEFORE_HEADER;
> +			goto parse_from_delimiter;
> +		}
> +	}
> +	strbuf_release(&contents);
> +
> +	if (util) {
> +		if (state == MBOX_IN_DIFF)
> +			string_list_append(list, buf.buf)->util = util;
> +		else
> +			free(util);
> +	}
> +	strbuf_release(&buf);
> +	strbuf_release(&long_subject);
> +	free(current_filename);
> +
> +	return 0;
> +}
> +
>   /*
>    * Reads the patches into a string list, with the `util` field being populated
>    * as struct object_id (will need to be free()d).
> @@ -41,6 +349,10 @@ static int read_patches(const char *range, struct string_list *list,
>   	ssize_t len;
>   	size_t size;
>   	int ret = -1;
> +	const char *path;
> +
> +	if (skip_prefix(range, "mbox:", &path))
> +		return read_mbox(path, list);
>   
>   	strvec_pushl(&cp.args, "log", "--no-color", "-p", "--no-merges",
>   		     "--reverse", "--date-order", "--decorate=no",
> @@ -424,6 +736,19 @@ static void output_pair_header(struct diff_options *diffopt,
>   
>   		strbuf_addch(buf, ' ');
>   		pp_commit_easy(CMIT_FMT_ONELINE, commit, buf);
> +	} else {
> +		struct patch_util *util = b_util ? b_util : a_util;
> +		const char *needle = "\n ## Commit message ##\n";
> +		const char *p = !util || !util->patch ?
> +			NULL : strstr(util->patch, needle);
> +		if (p) {
> +			if (status == '!')
> +				strbuf_addf(buf, "%s%s", color_reset, color);
> +
> +			strbuf_addch(buf, ' ');
> +			p += strlen(needle);
> +			strbuf_add(buf, p, strchrnul(p, '\n') - p);
> +		}
>   	}
>   	strbuf_addf(buf, "%s\n", color_reset);
>   
> @@ -554,6 +879,9 @@ int show_range_diff(const char *range1, const char *range2,
>   	if (range_diff_opts->left_only && range_diff_opts->right_only)
>   		res = error(_("options '%s' and '%s' cannot be used together"), "--left-only", "--right-only");
>   
> +	if (!strcmp(range1, "mbox:-") && !strcmp(range2, "mbox:-"))
> +		res = error(_("only one mbox can be read from stdin"));
> +
>   	if (!res && read_patches(range1, &branch1, range_diff_opts->other_arg))
>   		res = error(_("could not parse log for '%s'"), range1);
>   	if (!res && read_patches(range2, &branch2, range_diff_opts->other_arg))
> @@ -575,10 +903,18 @@ int show_range_diff(const char *range1, const char *range2,
>   int is_range_diff_range(const char *arg)
>   {
>   	char *copy = xstrdup(arg); /* setup_revisions() modifies it */
> -	const char *argv[] = { "", copy, "--", NULL };
> +	const char *argv[] = { "", copy, "--", NULL }, *path;
>   	int i, positive = 0, negative = 0;
>   	struct rev_info revs;
>   
> +	if (skip_prefix(arg, "mbox:", &path)) {
> +		free(copy);
> +		if (!strcmp(path, "-") || file_exists(path))
> +			return 1;
> +		error_errno(_("not an mbox: '%s'"), path);
> +		return 0;
> +	}
> +
>   	init_revisions(&revs, NULL);
>   	if (setup_revisions(3, argv, &revs, NULL) == 1) {
>   		for (i = 0; i < revs.pending.nr; i++)
> diff --git a/t/t3206-range-diff.sh b/t/t3206-range-diff.sh
> index 459beaf7d9c..89ef9a5ffc4 100755
> --- a/t/t3206-range-diff.sh
> +++ b/t/t3206-range-diff.sh
> @@ -783,6 +783,33 @@ test_expect_success 'ranges with pathspecs' '
>   	! grep "$topic_oid" actual
>   '
>   
> +test_expect_success 'compare range vs mbox' '
> +	git format-patch --stdout topic..mode-only-change >mbox &&
> +	git range-diff topic...mode-only-change >expect &&
> +
> +	git range-diff mode-only-change..topic mbox:./mbox >actual &&
> +	test_cmp expect actual &&
> +
> +	sed -e "/^From: .*/{
> +		h
> +		s/.*/From: Bugs Bunny <bugs@bun.ny>/
> +		:1
> +		N
> +		/[ -z]$/b1
> +		G
> +	}" <mbox >mbox.from &&
> +	git range-diff mode-only-change..topic mbox:./mbox.from >actual.from &&
> +	test_cmp expect actual.from &&
> +
> +	append_cr <mbox >mbox.cr &&
> +	test_must_fail git range-diff \
> +		mode-only-change..topic mbox:./mbox.cr 2>err &&
> +	grep CR/LF err &&
> +
> +	git range-diff mode-only-change..topic mbox:- <mbox >actual.stdin &&
> +	test_cmp expect actual.stdin
> +'
> +
>   test_expect_success 'submodule changes are shown irrespective of diff.submodule' '
>   	git init sub-repo &&
>   	test_commit -C sub-repo sub-first &&
> 
> base-commit: b75747829f4c277323c78b1c5973ad63ea038a2d

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3] range-diff: support reading mbox files
  2022-11-22  9:08   ` [PATCH v3] " Johannes Schindelin via GitGitGadget
  2022-11-22 14:23     ` Phillip Wood
@ 2022-11-22 23:58     ` Junio C Hamano
  2023-03-03 22:02     ` Junio C Hamano
  2 siblings, 0 replies; 16+ messages in thread
From: Junio C Hamano @ 2022-11-22 23:58 UTC (permalink / raw)
  To: Johannes Schindelin via GitGitGadget
  Cc: git, Taylor Blau, Phillip Wood,
	Ævar Arnfjörð Bjarmason, Johannes Schindelin

> +static inline int strtost(char const *s, size_t *result, const char **end)
> +{
> +	unsigned long u;
> +	char *p;
> +
> +	errno = 0;

Minor nit.  If this is to be able to see the error condition from
strtoul(), I think it should be done after the "!isdigit()" test,
immediately before we make strtoul() call, to avoid clearing errno
unnecessarily.

> +	/* negative values would be accepted by strtoul */
> +	if (!isdigit(*s))
> +		return -1;
> +	u = strtoul(s, &p, 10);
> +	if (errno || p == s)
> +		return -1;

> +static int parse_hunk_header(const char *p,
> +			     size_t *old_count, size_t *new_count,
> +			     const char **end)
> +{
> +	size_t o = 1, n = 1;
> +
> +	if (!skip_prefix(p, "@@ -", &p) ||
> +	    strtost(p, NULL, &p) ||
> +	    /* The range is -<start>[,<count>], defaulting to count = 1 */
> +	    !(*p == ' ' || (*p == ',' && !strtost(p + 1, &o, &p))) ||
> +	    !skip_prefix(p, " +", &p) ||
> +	    strtost(p, NULL, &p) ||
> +	    /* The range is +<start>[,<count>], defaulting to count = 1 */
> +	    !(*p == ' ' || (*p == ',' && !strtost(p + 1, &n, &p))) ||
> +	    !skip_prefix(p, " @@", &p))
> +		return -1;
> +
> +	*old_count = o;
> +	*new_count = n;
> +	*end = p;

We care only about how long the hunk is, and do not care exactly
where in the preimage it sits.  The code looks correct, but for such
a specialized "parser", the name of the function gives a false
impression that it does a lot more.  Finding it only slightly
disturbing.

> + * This function finds the end of the line, replaces the newline character with
> + * a NUL, and returns the offset of the start of the next line.
> + *
> + * If no newline character was found, it returns the offset of the trailing NUL
> + * instead.

Pretty bog-standard "read each line" helper.  I suspect we may want
to consolidate multiple copies we may have elsewhere after the dust
settles.  Looking good.

> +static inline int find_next_line(const char *line, size_t size)
> +{
> +	char *eol;
> +
> +	eol = memchr(line, '\n', size);
> +	if (!eol)
> +		return size;
> +
> +	*eol = '\0';
> +
> +	return eol + 1 - line;
> +}




> +static int read_mbox(const char *path, struct string_list *list)
> +{
> +	struct strbuf buf = STRBUF_INIT, contents = STRBUF_INIT;
> +	struct strbuf long_subject = STRBUF_INIT;
> +	struct patch_util *util = NULL;
> +	enum {
> +		MBOX_BEFORE_HEADER,
> +		MBOX_IN_HEADER,
> +		MBOX_IN_COMMIT_MESSAGE,
> +		MBOX_AFTER_TRIPLE_DASH,
> +		MBOX_IN_DIFF
> +	} state = MBOX_BEFORE_HEADER;
> +	char *line, *current_filename = NULL;
> +	int len;
> +	size_t size, old_count = 0, new_count = 0;
> +	const char *author = NULL, *subject = NULL;
> +
> +	if (!strcmp(path, "-")) {
> +		if (strbuf_read(&contents, STDIN_FILENO, 0) < 0)
> +			return error_errno(_("could not read stdin"));
> +	} else if (strbuf_read_file(&contents, path, 0) < 0)
> +		return error_errno(_("could not read '%s'"), path);
> +
> +	line = contents.buf;
> +	size = contents.len;
> +	for (; size; size -= len, line += len) {
> +		const char *p;
> +
> +		len = find_next_line(line, size);
> +
> +		if (state == MBOX_BEFORE_HEADER) {
> +parse_from_delimiter:
> +			if (!skip_prefix(line, "From ", &p))
> +				continue;
> +
> +			if (util)
> +				BUG("util already allocated");

OK.  The only transition that brings us into _BEFORE_HEADER state
is from _IN_DIFF and we consume and clear the current util there
before the transition happens, so this BUG() will trigger only when
there is some programming error, not any data errors.

Good.

> +			util = xcalloc(1, sizeof(*util));
> +			if (get_oid_hex(p, &util->oid) < 0)
> +				oidcpy(&util->oid, null_oid());
> +			util->matching = -1;
> +			author = subject = NULL;
> +
> +			state = MBOX_IN_HEADER;
> +			continue;
> +		}
> +
> +		if (starts_with(line, "diff --git ")) {
> +			struct patch patch = { 0 };
> +			struct strbuf root = STRBUF_INIT;
> +			int linenr = 0;
> +			int orig_len;
> +
> +			state = MBOX_IN_DIFF;
> +			old_count = new_count = 0;
> +			strbuf_addch(&buf, '\n');
> +			if (!util->diff_offset)
> +				util->diff_offset = buf.len;
> +
> +			orig_len = len;
> +			/* `find_next_line()`'s replaced the LF with a NUL */
> +			line[len - 1] = '\n';

Does this work correctly when the input ended with an incomplete
line that lacked the final LF?  find_next_line() would have given
the size of the remaining input, and the byte at line[len-1] is the
last byte on the incomplete line that is not LF.

> +			len = len > 1 && line[len - 2] == '\r' ?
> +				error(_("cannot handle diff headers with "
> +					"CR/LF line endings")) :
> +				parse_git_diff_header(&root, &linenr, 1, line,
> +						      len, size, &patch);

Cute (in that it tries to use a single "len < 0" for all error
conditions) but moderately hard to follow.

> +			if (len < 0) {
> +				error(_("could not parse git header '%.*s'"),
> +				      orig_len, line);
> +				release_patch(&patch);
> +				free(util);
> +				free(current_filename);
> +				string_list_clear(list, 1);
> +				strbuf_release(&buf);
> +				strbuf_release(&contents);
> +				strbuf_release(&long_subject);
> +				return -1;
> +			}

OK.

> +			if (patch.old_name)
> +				skip_prefix(patch.old_name, "a/",
> +					    (const char **)&patch.old_name);
> +			if (patch.new_name)
> +				skip_prefix(patch.new_name, "b/",
> +					    (const char **)&patch.new_name);

Do we only accept "-p1" patches?  From time to time we seem to hear
on this list from folks in communities that do not use -p1 (aka a/
and b/) convention.

> +		} else if (state == MBOX_IN_HEADER) {
> +			if (!line[0]) {

OK.  After seeing a block of header lines, the first empty line
signals the end of the headers and we transition into a new state.

> +				state = MBOX_IN_COMMIT_MESSAGE;

As an in-body "From:" can have have another blank line or some
in-body header other than "From: " before it at the beginning of an
e-mail body, I do not think this is a good code structure.  I would
have expected that the first blank line would transition us into a
new state (in-commit-messages state) without doing anything else and
in that state:

 - Leading blank lines are skipped, and we will stay in the same state.

 - From:, Subject:, Date:, etc. are interpreted as in-body headers,
   and we will stay in the same state, expecting more in-body
   headers,

 - Everything else will bring us into "we are now really reading the
   log" state (do not lose that line that made us transition into
   the new state---that line is the first line of the body).

would happen.

> +		} else if (state == MBOX_IN_COMMIT_MESSAGE) {
> +			if (!line[0]) {
> +				strbuf_addch(&buf, '\n');
> +			} else if (strcmp(line, "---")) {
> +				int tabs = 0;
> +
> +				/* simulate tab expansion */
> +				while (line[tabs] == '\t')
> +					tabs++;
> +				strbuf_addf(&buf, "%*s%s\n",
> +					    4 + 8 * tabs, "", line + tabs);

I am not sure if special casing the empty line above is correct.  I
am assuming that you are pretending as if you read an "git show -s"
output after applying that patch, but "git log" gives a 4-space
indent even for an empty line, I think.  A quick sanity check

    $ git show -s | cat -e

on the patch I am responding to tells me that it is the case.

I'll stop here, as I see Phillip saw the reading of "diff --git"
output well enough to notice that diff.suppressBlankEmpty is handled
correctly and I'd trust his review.

Thanks.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3] range-diff: support reading mbox files
  2022-11-22  9:08   ` [PATCH v3] " Johannes Schindelin via GitGitGadget
  2022-11-22 14:23     ` Phillip Wood
  2022-11-22 23:58     ` Junio C Hamano
@ 2023-03-03 22:02     ` Junio C Hamano
  2 siblings, 0 replies; 16+ messages in thread
From: Junio C Hamano @ 2023-03-03 22:02 UTC (permalink / raw)
  To: Johannes Schindelin via GitGitGadget
  Cc: git, Taylor Blau, Phillip Wood,
	Ævar Arnfjörð Bjarmason, Johannes Schindelin

Another thing I forgot to mention.

> Let's offer a way to read those patches from pre-prepared MBox files
> instead when an argument "mbox:<filename>" is passed instead of a commit
> range.
> ...
> +	const char *path;
> +
> +	if (skip_prefix(range, "mbox:", &path))
> +		return read_mbox(path, list);

Shouldn't this take the prefix into account, similar to how option
of OPTION_FILENAME type does using parse-options.c::fix_filename()?

A test that starts "range-diff" in a subdirectory and refer to a
mbox file elsewhere may be a good way to prevent such a bug from
happening and regression after the change hits a release.


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2023-03-03 22:24 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-15 18:20 [PATCH] range-diff: support reading mbox files Johannes Schindelin via GitGitGadget
2022-11-16  2:09 ` Taylor Blau
2022-11-16 14:40 ` Phillip Wood
2022-11-17 10:26   ` Phillip Wood
2022-11-18 12:53     ` Johannes Schindelin
2022-11-18 13:16   ` Johannes Schindelin
2022-11-17 18:24 ` Ævar Arnfjörð Bjarmason
2022-11-18 11:39   ` Johannes Schindelin
2022-11-19 23:11 ` [PATCH v2] " Johannes Schindelin via GitGitGadget
2022-11-21 10:08   ` Phillip Wood
2022-11-22  7:40     ` Johannes Schindelin
2022-11-22 14:20       ` Phillip Wood
2022-11-22  9:08   ` [PATCH v3] " Johannes Schindelin via GitGitGadget
2022-11-22 14:23     ` Phillip Wood
2022-11-22 23:58     ` Junio C Hamano
2023-03-03 22:02     ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).