Re: [PATCH] range-diff: support reading mbox files

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Phillip Wood <phillip.wood123@gmail.com>
To: Johannes Schindelin via GitGitGadget <gitgitgadget@gmail.com>,
	git@vger.kernel.org
Cc: Johannes Schindelin <johannes.schindelin@gmx.de>,
	Taylor Blau <me@ttaylorr.com>
Subject: Re: [PATCH] range-diff: support reading mbox files
Date: Wed, 16 Nov 2022 14:40:49 +0000	[thread overview]
Message-ID: <dfe0190c-1d2e-804a-5312-877b7b2f5822@dunelm.org.uk> (raw)
In-Reply-To: <pull.1420.git.1668536405563.gitgitgadget@gmail.com>

Hi Dscho

On 15/11/2022 18:20, Johannes Schindelin via GitGitGadget wrote:
> From: Johannes Schindelin <johannes.schindelin@gmx.de>
> 
> Internally, the `git range-diff` command spawns a `git log` process and
> parses its output for the given commit ranges.
> 
> This works well when the patches that need to be compared are present in
> the local repository in the form of commits.
> 
> In scenarios where that is not the case, the `range-diff` command is
> currently less helpful.
> 
> The Git mailing list is such a scenario: Instead of using Git to
> exchange commits, the patches are sent there as plain-text and no commit
> range can be specified to let `range-diff` consume those patches.
> 
> Instead, the expectation is to download the mails, apply them locally
> and then use `range-diff`. This can be quite cumbersome e.g. when a
> suitable base revision has to be found first where the patch applies
> cleanly.

That's a good motivation for this change.

> Let's offer a way to read those patches from pre-prepared MBox files
> instead when an argument "mbox:<filename>" is passed instead of a commit
> range.
> 
> For extra convenience, interpret the filename `-` as standard input.
> This makes it easy to compare contributions on the mailing list with the
> actual commits that were integrated into Git's main branch. Example:
> 
> 	commit=5c4003ca3f0e9ac6d3c8aa3e387ff843bd440411
> 	mid=bdfa3845b81531863941e6a97c28eb1afa62dd2c.1489435755.git.johannes.schindelin@gmx.de
> 	curl -s https://lore.kernel.org/git/$mid/raw |
> 	git range-diff mbox:- $commit^!
> 
> This addresses https://github.com/gitgitgadget/git/issues/207
> 
> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> ---
>      range-diff: support reading mbox files
>      
>      I frequently find myself wanting to look at the range-diff between some
>      local commits and the patches on the Git mailing list, but unwilling to
>      go through the process of finding an appropriate base revision to apply
>      the patches onto (just to throw the generated patches away afterwards,
>      anyway).
>      
>      So I came up with this patch. May it be helpful to other developers,
>      too.
>      
>      This patch contains a home-rolled mbox parser. Initially, I wrote a
>      really basic parser and it worked well enough, but, you know, as things
>      go it became more complex than that in order to provide actually useful
>      range-diffs for existing commits and their corresponding mails (because
>      of in-body From: headers, because of -- trailers and long subjects, just
>      to name a few reasons). In hindsight, it might have made sense to try to
>      to reuse the parser that is available in mailinfo.c, which I had
>      initially dismissed as overly complex and unnecessary for this use case.
>      If anyone feels up to it, I would invite them to adjust this code to
>      replace the mbox parser with one based on the mailinfo.c. Incrementally,
>      of course, because the perfect is the enemy of the good.
>
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1420%2Fdscho%2Frange-diff-from-mbox-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1420/dscho/range-diff-from-mbox-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/1420
> 
>   Documentation/git-range-diff.txt |   3 +-
>   range-diff.c                     | 317 ++++++++++++++++++++++++++++++-
>   t/t3206-range-diff.sh            |   9 +
>   3 files changed, 327 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/git-range-diff.txt b/Documentation/git-range-diff.txt
> index 0b393715d70..e2c4661acde 100644
> --- a/Documentation/git-range-diff.txt
> +++ b/Documentation/git-range-diff.txt
> @@ -37,7 +37,8 @@ There are three ways to specify the commit ranges:
>   
>   - `<range1> <range2>`: Either commit range can be of the form
>     `<base>..<rev>`, `<rev>^!` or `<rev>^-<n>`. See `SPECIFYING RANGES`
> -  in linkgit:gitrevisions[7] for more details.
> +  in linkgit:gitrevisions[7] for more details. Alternatively, the
> +  patches can be provided as an mbox-formatted file via `mbox:<path>`.
>   
>   - `<rev1>...<rev2>`. This is equivalent to
>     `<rev2>..<rev1> <rev1>..<rev2>`.
> diff --git a/range-diff.c b/range-diff.c
> index 124dd678c38..7c84cdbeffa 100644
> --- a/range-diff.c
> +++ b/range-diff.c
> @@ -12,6 +12,7 @@
>   #include "userdiff.h"
>   #include "apply.h"
>   #include "revision.h"
> +#include "dir.h"
>   
>   struct patch_util {
>   	/* For the search for an exact match */
> @@ -26,6 +27,293 @@ struct patch_util {
>   	struct object_id oid;
>   };
>   
> +static inline int strtost(char const *s, size_t *result, const char **end)
> +{
> +	unsigned long u;
> +	char *p;
> +
> +	errno = 0;
> +	/* negative values would be accepted by strtoul */
> +	if (*s == '-')
> +		return -1;

I think it is right to treat the input as untrusted and so look for 
malformed hunk headers. However This test is not sufficient for that, we 
expect a digit so I think

	if (!isdigit(*s))
		return -1;

would be safer. The use of strtoul() looks good as we set errno to zero 
before the call and check both errno and endp afterwards.

> +	u = strtoul(s, &p, 10);
> +	if (errno || p == s)
> +		return -1;
> +	if (result)
> +		*result = u;
> +	*end = p;
> +
> +	return 0;
> +}
> +
> +static int parse_hunk_header(const char *p,
> +			     size_t *old_count, size_t *new_count,
> +			     const char **end)
> +{
> +	size_t o = 1, n = 1;
> +
> +	if (!skip_prefix(p, "@@ -", &p) ||
> +	    strtost(p, NULL, &p) ||
> +	    (*p != ' ' && (*p != ',' || strtost(p + 1, &o, &p))) ||

It took me a minute to understand the double negatives but it is 
correctly checking if we have ' ' or ',<digits>'

> +	    !skip_prefix(p, " +", &p) ||
> +	    strtost(p, NULL, &p) ||
> +	    (*p != ' ' && (*p != ',' || strtost(p + 1, &n, &p))) ||
> +	    !skip_prefix(p, " @@", &p))
> +		return -1;
> +
> +	*old_count = o;
> +	*new_count = n;
> +	*end = p;
> +
> +	return 0;
> +}
> +
> +static inline int find_eol(const char *line, size_t size)
> +{
> +	char *eol;
> +
> +	eol = memchr(line, '\n', size);
> +	if (!eol)
> +		return size;
> +
> +	if (eol != line && eol[-1] == '\r')
> +		eol[-1] = '\0';
> +	else
> +		*eol = '\0';
> +
> +	return eol + 1 - line;

We return the offset to the start of the next line, not the length of 
the line. This will be important later.

> +}
> +
> +static int read_mbox(const char *path, struct string_list *list)
> +{
> +	struct strbuf buf = STRBUF_INIT, contents = STRBUF_INIT;
> +	struct strbuf long_subject = STRBUF_INIT;
> +	struct patch_util *util = NULL;
> +	enum {
> +		MBOX_BEFORE_HEADER,
> +		MBOX_IN_HEADER,
> +		MBOX_IN_COMMIT_MESSAGE,
> +		MBOX_AFTER_TRIPLE_DASH,
> +		MBOX_IN_DIFF
> +	} state = MBOX_BEFORE_HEADER;
> +	char *line, *current_filename = NULL;
> +	int len;
> +	size_t size, old_count = 0, new_count = 0;
> +	const char *author = NULL, *subject = NULL;
> +
> +	if (!strcmp(path, "-")) {
> +		if (strbuf_read(&contents, STDIN_FILENO, 0) < 0)
> +			return error_errno(_("could not read stdin"));
> +	} else if (strbuf_read_file(&contents, path, 0) < 0)
> +		return error_errno(_("could not read '%s'"), path);
> +
> +	line = contents.buf;
> +	size = contents.len;
> +	for (; size > 0; size -= len, line += len) {

size is unsigned so we're effectively testing 'size != 0' which means if 
we're off by one somewhere we'll have an undetected buffer overflow. 
Using a signed type wouldn't prevent the buffer overflow but it would 
limit its extent.

> +		const char *p;
> +
> +		len = find_eol(line, size);

Here len is not the length of line if it originally ended "\r\n".

> +		if (state == MBOX_BEFORE_HEADER) {
> +			if (!skip_prefix(line, "From ", &p))
> +				continue;
> +
> +			util = xcalloc(1, sizeof(*util));
> +			if (get_oid_hex(p, &util->oid) < 0)
> +				oidcpy(&util->oid, null_oid());
> +			util->matching = -1;
> +			author = subject = NULL;
> +
> +			state = MBOX_IN_HEADER;

I wondered if there should there be a `continue;` here but I think it 
probably needs to "fall-through" to the MBOX_IN_HEADER handling below. A 
comment to clarify that would be helpful.

> +		}
> +
> +		if (starts_with(line, "diff --git ")) {
> +			struct patch patch = { 0 };
> +			struct strbuf root = STRBUF_INIT;
> +			int linenr = 0;
> +			int orig_len;
> +
> +			state = MBOX_IN_DIFF;
> +			old_count = new_count = 0;
> +			strbuf_addch(&buf, '\n');
> +			if (!util->diff_offset)
> +				util->diff_offset = buf.len;
> +			line[len - 1] = '\n';

Here the line will still be NUL terminated if it originally ended "\r\n" 
which presumably messes up the call to parse_git_diff_header() below. I 
have not checked if parse_git_diff_header() can handle "\r\n" when it is 
parsing the rest of the diff header.

> +			orig_len = len;
> +			len = parse_git_diff_header(&root, &linenr, 1, line,
> +						    len, size, &patch);
> +			if (len < 0) {
> +				error(_("could not parse git header '%.*s'"),
> +				      orig_len, line);
> +				free(util);
> +				free(current_filename);
> +				string_list_clear(list, 1);
> +				strbuf_release(&buf);
> +				strbuf_release(&contents);
> +				strbuf_release(&long_subject);
> +				return -1;
> +			}
> +
> +			if (patch.old_name)
> +				skip_prefix(patch.old_name, "a/",
> +					    (const char **)&patch.old_name);
> +			if (patch.new_name)
> +				skip_prefix(patch.new_name, "b/",
> +					    (const char **)&patch.new_name);

I think this is fine for now but we might want to support other prefixes 
in the future. If it is not a copy or rename then the filename can be 
deduced by finding the common tail of patch.old_name and patch.new_name 
and stripping anything before the first '/'. If it is a copy or rename 
then I suspect there is no prefix (though I've not checked)

> +			strbuf_addstr(&buf, " ## ");
> +			if (patch.is_new > 0)

`patch.is_now` and `patch.is_delete` are booleans like `patch.is_rename` 
so we don't need the '> 0'

> +				strbuf_addf(&buf, "%s (new)", patch.new_name);
> +			else if (patch.is_delete > 0)
> +				strbuf_addf(&buf, "%s (deleted)", patch.old_name);
> +			else if (patch.is_rename)
> +				strbuf_addf(&buf, "%s => %s", patch.old_name, patch.new_name);
> +			else
> +				strbuf_addstr(&buf, patch.new_name);
> +
> +			free(current_filename);
> +			if (patch.is_delete > 0)
> +				current_filename = xstrdup(patch.old_name);
> +			else
> +				current_filename = xstrdup(patch.new_name);
> +
> +			if (patch.new_mode && patch.old_mode &&
> +			    patch.old_mode != patch.new_mode)
> +				strbuf_addf(&buf, " (mode change %06o => %06o)",
> +					    patch.old_mode, patch.new_mode);
> +
> +			strbuf_addstr(&buf, " ##\n");
> +			util->diffsize++;
> +		} else if (state == MBOX_IN_HEADER) {
> +			if (!line[0]) {
> +				state = MBOX_IN_COMMIT_MESSAGE;
> +				/* Look for an in-body From: */
> +				if (size > 5 && skip_prefix(line + 1, "From: ", &p)) {

The "size > 5" seems a bit unnecessary as we're using skip_prefix()

> +					size -= p - line;
> +					line += p - line;

This is good, we're accounting for reading the next line.

> +					len = find_eol(line, size);
> +
> +					while (isspace(*p))
> +						p++;
> +					author = p;
> +				}
> +				strbuf_addstr(&buf, " ## Metadata ##\n");
> +				if (author)
> +					strbuf_addf(&buf, "Author: %s\n", author);
> +				strbuf_addstr(&buf, "\n ## Commit message ##\n");
> +				if (subject)
> +					strbuf_addf(&buf, "    %s\n\n", subject);
> +			} else if (skip_prefix(line, "From: ", &p)) {
> +				while (isspace(*p))
> +					p++;
> +				author = p;
> +			} else if (skip_prefix(line, "Subject: ", &p)) {
> +				const char *q;
> +
> +				while (isspace(*p))
> +					p++;
> +				subject = p;
> +
> +				if (starts_with(p, "[PATCH") &&
> +				    (q = strchr(p, ']'))) {
> +					q++;
> +					while (isspace(*q))
> +						q++;
> +					subject = q;
> +				}
> +
> +				if (len < size && line[len] == ' ') {
> +					/* handle long subject */
> +					strbuf_reset(&long_subject);
> +					strbuf_addstr(&long_subject, subject);
> +					while (len < size && line[len] == ' ') {
> +						line += len;
> +						size -= len;
> +						len = find_eol(line, size);
> +						strbuf_addstr(&long_subject, line);

Looks good

> +					}
> +					subject = long_subject.buf;
> +				}
> +			}
> +		} else if (state == MBOX_IN_COMMIT_MESSAGE) {
> +			if (!*line)

Not a big issue elsewhere you've used "!line[0]"
Style: there should be braces on this branch.

> +				strbuf_addch(&buf, '\n');
> +			else if (strcmp(line, "---")) {
> +				int tabs = 0;
> +
> +				/* simulate tab expansion */
> +				while (line[tabs] == '\t')
> +					tabs++;
> +				strbuf_addf(&buf, "%*s%s\n",
> +					    4 + 8 * tabs, "", line + tabs);
> +			} else {
> +				/*
> +				 * Trim the trailing newline that is added
> +				 * by `format-patch`.
> +				 */
> +				strbuf_trim_trailing_newline(&buf);
> +				state = MBOX_AFTER_TRIPLE_DASH;
> +			}
> +		} else if (state == MBOX_IN_DIFF) {
> +			switch (line[0]) {
> +			case '\0':
> +				continue; /* ignore empty lines after diff */
> +			case '+':
> +			case '-':
> +			case ' ':
> +				if (!old_count && !new_count)
> +					break;

This shouldn't happen in a well formed diff. Below we happily accept bad 
counts, is there a reason to reject them here?

> +				if (old_count && line[0] != '+')
> +					old_count--;
> +				if (new_count && line[0] != '-')
> +					new_count--;

The diff is malformed if old_count == 0 and we see '-' or ' ' or 
new_count == 0 and we see '+' or ' '. The code is careful not to 
decrement the count in that case so I think it is harmless to accept 
diffs with bad line counts in the hunk header.

> +				/* fallthrough */
> +			case '\\':
> +				strbuf_addstr(&buf, line);
> +				strbuf_addch(&buf, '\n');
> +				util->diffsize++;

I think this might be a better place to break if old_count and new_count 
are both zero.

> +				continue;
> +			case '@':
> +				if (parse_hunk_header(line, &old_count,
> +						      &new_count, &p))
> +					break;
> +
> +				strbuf_addstr(&buf, "@@");
> +				if (current_filename && *p)
> +					strbuf_addf(&buf, " %s:",
> +						    current_filename);
> +				strbuf_addstr(&buf, p);
> +				strbuf_addch(&buf, '\n');
> +				util->diffsize++;
> +				continue;
> +			}

This is effectively the `default:` clause as it is executed when we 
don't handle the line above. We ignore the contents of this line which 
makes me wonder what happens if it is the start of another diff. Do we 
have tests that alter more than one file in a single commit?

I think this is a useful addition, it could perhaps benefit from more 
tests though. Having tests for bad input, "\r\n" line endings and 
getting the author from a From: header as well as an in-body From: line 
would give a bit more reassurance about how robust the parser is.

Best Wishes

Phillip

next prev parent reply	other threads:[~2022-11-16 14:42 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-15 18:20 [PATCH] range-diff: support reading mbox files Johannes Schindelin via GitGitGadget
2022-11-16  2:09 ` Taylor Blau
2022-11-16 14:40 ` Phillip Wood [this message]
2022-11-17 10:26   ` Phillip Wood
2022-11-18 12:53     ` Johannes Schindelin
2022-11-18 13:16   ` Johannes Schindelin
2022-11-17 18:24 ` Ævar Arnfjörð Bjarmason
2022-11-18 11:39   ` Johannes Schindelin
2022-11-19 23:11 ` [PATCH v2] " Johannes Schindelin via GitGitGadget
2022-11-21 10:08   ` Phillip Wood
2022-11-22  7:40     ` Johannes Schindelin
2022-11-22 14:20       ` Phillip Wood
2022-11-22  9:08   ` [PATCH v3] " Johannes Schindelin via GitGitGadget
2022-11-22 14:23     ` Phillip Wood
2022-11-22 23:58     ` Junio C Hamano
2023-03-03 22:02     ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=dfe0190c-1d2e-804a-5312-877b7b2f5822@dunelm.org.uk \
    --to=phillip.wood123@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=johannes.schindelin@gmx.de \
    --cc=me@ttaylorr.com \
    --cc=phillip.wood@dunelm.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).