git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Johannes Schindelin <Johannes.Schindelin@gmx.de>
To: phillip.wood@dunelm.org.uk
Cc: Johannes Schindelin via GitGitGadget <gitgitgadget@gmail.com>,
	git@vger.kernel.org, Taylor Blau <me@ttaylorr.com>
Subject: Re: [PATCH] range-diff: support reading mbox files
Date: Fri, 18 Nov 2022 14:16:32 +0100 (CET)	[thread overview]
Message-ID: <06ro81n5-7sn1-070r-6747-5n77262o822p@tzk.qr> (raw)
In-Reply-To: <dfe0190c-1d2e-804a-5312-877b7b2f5822@dunelm.org.uk>

Hi Phillip,

On Wed, 16 Nov 2022, Phillip Wood wrote:

> On 15/11/2022 18:20, Johannes Schindelin via GitGitGadget wrote:
> > From: Johannes Schindelin <johannes.schindelin@gmx.de>
> >
> > Internally, the `git range-diff` command spawns a `git log` process and
> > parses its output for the given commit ranges.
> >
> > This works well when the patches that need to be compared are present in
> > the local repository in the form of commits.
> >
> > In scenarios where that is not the case, the `range-diff` command is
> > currently less helpful.
> >
> > The Git mailing list is such a scenario: Instead of using Git to
> > exchange commits, the patches are sent there as plain-text and no commit
> > range can be specified to let `range-diff` consume those patches.
> >
> > Instead, the expectation is to download the mails, apply them locally
> > and then use `range-diff`. This can be quite cumbersome e.g. when a
> > suitable base revision has to be found first where the patch applies
> > cleanly.
>
> That's a good motivation for this change.

Side note (potentially fun to know): A couple of weeks ago, I wanted to
work on this because I was asked to review a new patch series iteration on
the Git mailing list (I forgot which one it was, precisely), which came
without a range-diff, and I wanted to download both iterations using a
variation of
https://github.com/git-for-windows/build-extra/blob/HEAD/apply-from-lore.sh
and then use the mbox mode of `range-diff` on them. Unfortunately, I ran
out of time back then (it's a common theme for me these days), and the
patch series (rightfully) advanced to `next` without my review, and I let
this here patch slide.

> > +static inline int strtost(char const *s, size_t *result, const char **end)
> > +{
> > +	unsigned long u;
> > +	char *p;
> > +
> > +	errno = 0;
> > +	/* negative values would be accepted by strtoul */
> > +	if (*s == '-')
> > +		return -1;
>
> I think it is right to treat the input as untrusted and so look for malformed
> hunk headers. However This test is not sufficient for that, we expect a digit
> so I think
>
> 	if (!isdigit(*s))
> 		return -1;
>
> would be safer. The use of strtoul() looks good as we set errno to zero before
> the call and check both errno and endp afterwards.

Good point. As you might have guessed, this function is a copy/edit of
existing code, in this instance `strtoul_ui()`.

I have made the change you suggested.

>
> > +	u = strtoul(s, &p, 10);
> > +	if (errno || p == s)
> > +		return -1;
> > +	if (result)
> > +		*result = u;
> > +	*end = p;
> > +
> > +	return 0;
> > +}
> > +
> > +static int parse_hunk_header(const char *p,
> > +			     size_t *old_count, size_t *new_count,
> > +			     const char **end)
> > +{
> > +	size_t o = 1, n = 1;
> > +
> > +	if (!skip_prefix(p, "@@ -", &p) ||
> > +	    strtost(p, NULL, &p) ||
> > +	    (*p != ' ' && (*p != ',' || strtost(p + 1, &o, &p))) ||
>
> It took me a minute to understand the double negatives but it is correctly
> checking if we have ' ' or ',<digits>'

I actually hesitated when I wrote this, and only kept the code as-is
because I remembered how often Junio uses double negatives and thought it
would be a good joke to do the same here. Joke's obviously on me, though.

I've hence changed it to:

	    /* The range is -<start>[,<count>], defaulting to count = 1 */
	    !(*p == ' ' || (*p == ',' && !strtost(p + 1, &o, &p))) ||

and

	    /* The range is * +<start>[,<count>], * defaulting to count = 1 */
	    !(*p == ' ' || (*p == ',' && !strtost(p + 1, &n, &p))) ||

>
> > +	    !skip_prefix(p, " +", &p) ||
> > +	    strtost(p, NULL, &p) ||
> > +	    (*p != ' ' && (*p != ',' || strtost(p + 1, &n, &p))) ||
> > +	    !skip_prefix(p, " @@", &p))
> > +		return -1;
> > +
> > +	*old_count = o;
> > +	*new_count = n;
> > +	*end = p;
> > +
> > +	return 0;
> > +}
> > +
> > +static inline int find_eol(const char *line, size_t size)
> > +{
> > +	char *eol;
> > +
> > +	eol = memchr(line, '\n', size);
> > +	if (!eol)
> > +		return size;
> > +
> > +	if (eol != line && eol[-1] == '\r')
> > +		eol[-1] = '\0';
> > +	else
> > +		*eol = '\0';
> > +
> > +	return eol + 1 - line;
>
> We return the offset to the start of the next line, not the length of the
> line. This will be important later.

You're right. I changed the name to `find_next_line()` and added an
informative comment on top of the function.

>
> > +}
> > +
> > +static int read_mbox(const char *path, struct string_list *list)
> > +{
> > +	struct strbuf buf = STRBUF_INIT, contents = STRBUF_INIT;
> > +	struct strbuf long_subject = STRBUF_INIT;
> > +	struct patch_util *util = NULL;
> > +	enum {
> > +		MBOX_BEFORE_HEADER,
> > +		MBOX_IN_HEADER,
> > +		MBOX_IN_COMMIT_MESSAGE,
> > +		MBOX_AFTER_TRIPLE_DASH,
> > +		MBOX_IN_DIFF
> > +	} state = MBOX_BEFORE_HEADER;
> > +	char *line, *current_filename = NULL;
> > +	int len;
> > +	size_t size, old_count = 0, new_count = 0;
> > +	const char *author = NULL, *subject = NULL;
> > +
> > +	if (!strcmp(path, "-")) {
> > +		if (strbuf_read(&contents, STDIN_FILENO, 0) < 0)
> > +			return error_errno(_("could not read stdin"));
> > +	} else if (strbuf_read_file(&contents, path, 0) < 0)
> > +		return error_errno(_("could not read '%s'"), path);
> > +
> > +	line = contents.buf;
> > +	size = contents.len;
> > +	for (; size > 0; size -= len, line += len) {
>
> size is unsigned so we're effectively testing 'size != 0' which means if we're
> off by one somewhere we'll have an undetected buffer overflow. Using a signed
> type wouldn't prevent the buffer overflow but it would limit its extent.

`contents.len` is of type `size_t`, though, and I'd like to stay
consistent.

I did remove the misleading `> 0` from the condition, though.

>
> > +		const char *p;
> > +
> > +		len = find_eol(line, size);
>
> Here len is not the length of line if it originally ended "\r\n".
>
> > +		if (state == MBOX_BEFORE_HEADER) {
> > +			if (!skip_prefix(line, "From ", &p))
> > +				continue;
> > +
> > +			util = xcalloc(1, sizeof(*util));
> > +			if (get_oid_hex(p, &util->oid) < 0)
> > +				oidcpy(&util->oid, null_oid());
> > +			util->matching = -1;
> > +			author = subject = NULL;
> > +
> > +			state = MBOX_IN_HEADER;
>
> I wondered if there should there be a `continue;` here but I think it probably
> needs to "fall-through" to the MBOX_IN_HEADER handling below. A comment to
> clarify that would be helpful.

You're absolutely correct, there totally should be a `continue` here: we
saw a `From `, and in the `MBOX_IN_HEADER` we look for either an empty
line, the `From:` header or the `Subject:` header, so we know that the
current line cannot match, no need to fall through.

>
> > +		}
> > +
> > +		if (starts_with(line, "diff --git ")) {
> > +			struct patch patch = { 0 };
> > +			struct strbuf root = STRBUF_INIT;
> > +			int linenr = 0;
> > +			int orig_len;
> > +
> > +			state = MBOX_IN_DIFF;
> > +			old_count = new_count = 0;
> > +			strbuf_addch(&buf, '\n');
> > +			if (!util->diff_offset)
> > +				util->diff_offset = buf.len;
> > +			line[len - 1] = '\n';
>
> Here the line will still be NUL terminated if it originally ended "\r\n" which
> presumably messes up the call to parse_git_diff_header() below. I have not
> checked if parse_git_diff_header() can handle "\r\n" when it is parsing the
> rest of the diff header.

I changed it locally to reinstate the `\r\n`, only to figure out that
almost the entire `apply.c` machinery totally falls over CR/LF line
endings.

I changed the code to detect a Carriage Return and error out if one is
detected.

>
> > +			orig_len = len;
> > +			len = parse_git_diff_header(&root, &linenr, 1, line,
> > +						    len, size, &patch);
> > +			if (len < 0) {
> > +				error(_("could not parse git header '%.*s'"),
> > +				      orig_len, line);
> > +				free(util);
> > +				free(current_filename);
> > +				string_list_clear(list, 1);
> > +				strbuf_release(&buf);
> > +				strbuf_release(&contents);
> > +				strbuf_release(&long_subject);
> > +				return -1;
> > +			}
> > +
> > +			if (patch.old_name)
> > +				skip_prefix(patch.old_name, "a/",
> > +					    (const char **)&patch.old_name);
> > +			if (patch.new_name)
> > +				skip_prefix(patch.new_name, "b/",
> > +					    (const char **)&patch.new_name);
>
> I think this is fine for now but we might want to support other prefixes in
> the future. If it is not a copy or rename then the filename can be deduced by
> finding the common tail of patch.old_name and patch.new_name and stripping
> anything before the first '/'. If it is a copy or rename then I suspect there
> is no prefix (though I've not checked)

Since `skip_prefix()` does not do anything if there is no match, I think
it is sane to err by not stripping anything unless the expected `a/` and
`b/` prefixes are seen.

>
> > +			strbuf_addstr(&buf, " ## ");
> > +			if (patch.is_new > 0)
>
> `patch.is_now` and `patch.is_delete` are booleans like `patch.is_rename` so we
> don't need the '> 0'

Good catch.

>
> > +				strbuf_addf(&buf, "%s (new)", patch.new_name);
> > +			else if (patch.is_delete > 0)
> > +				strbuf_addf(&buf, "%s (deleted)",
> > patch.old_name);
> > +			else if (patch.is_rename)
> > +				strbuf_addf(&buf, "%s => %s", patch.old_name,
> > patch.new_name);
> > +			else
> > +				strbuf_addstr(&buf, patch.new_name);
> > +
> > +			free(current_filename);
> > +			if (patch.is_delete > 0)
> > +				current_filename = xstrdup(patch.old_name);
> > +			else
> > +				current_filename = xstrdup(patch.new_name);
> > +
> > +			if (patch.new_mode && patch.old_mode &&
> > +			    patch.old_mode != patch.new_mode)
> > +				strbuf_addf(&buf, " (mode change %06o =>
> > %06o)",
> > +					    patch.old_mode, patch.new_mode);
> > +
> > +			strbuf_addstr(&buf, " ##\n");
> > +			util->diffsize++;
> > +		} else if (state == MBOX_IN_HEADER) {
> > +			if (!line[0]) {
> > +				state = MBOX_IN_COMMIT_MESSAGE;
> > +				/* Look for an in-body From: */
> > +				if (size > 5 && skip_prefix(line + 1, "From:
> > ", &p)) {
>
> The "size > 5" seems a bit unnecessary as we're using skip_prefix()

Right!

>
> > +					size -= p - line;
> > +					line += p - line;
>
> This is good, we're accounting for reading the next line.
>
> > +					len = find_eol(line, size);
> > +
> > +					while (isspace(*p))
> > +						p++;
> > +					author = p;
> > +				}
> > +				strbuf_addstr(&buf, " ## Metadata ##\n");
> > +				if (author)
> > +					strbuf_addf(&buf, "Author: %s\n",
> > author);
> > +				strbuf_addstr(&buf, "\n ## Commit message
> > ##\n");
> > +				if (subject)
> > +					strbuf_addf(&buf, "    %s\n\n",
> > subject);
> > +			} else if (skip_prefix(line, "From: ", &p)) {
> > +				while (isspace(*p))
> > +					p++;
> > +				author = p;
> > +			} else if (skip_prefix(line, "Subject: ", &p)) {
> > +				const char *q;
> > +
> > +				while (isspace(*p))
> > +					p++;
> > +				subject = p;
> > +
> > +				if (starts_with(p, "[PATCH") &&
> > +				    (q = strchr(p, ']'))) {
> > +					q++;
> > +					while (isspace(*q))
> > +						q++;
> > +					subject = q;
> > +				}
> > +
> > +				if (len < size && line[len] == ' ') {
> > +					/* handle long subject */
> > +					strbuf_reset(&long_subject);
> > +					strbuf_addstr(&long_subject, subject);
> > +					while (len < size && line[len] == ' ')
> > {
> > +						line += len;
> > +						size -= len;
> > +						len = find_eol(line, size);
> > +						strbuf_addstr(&long_subject,
> > line);
>
> Looks good
>
> > +					}
> > +					subject = long_subject.buf;
> > +				}
> > +			}
> > +		} else if (state == MBOX_IN_COMMIT_MESSAGE) {
> > +			if (!*line)
>
> Not a big issue elsewhere you've used "!line[0]"
> Style: there should be braces on this branch.

Fixed on both accounts.

> [... addressed in follow-up mail...]
>
> I think this is a useful addition, it could perhaps benefit from more tests
> though. Having tests for bad input, "\r\n" line endings and getting the author
> from a From: header as well as an in-body From: line would give a bit more
> reassurance about how robust the parser is.

Good point, I've augmented the test case a bit more.

Thank you again. It is a real pleasure to receive these constructive
reviews from you that pay so much attention to detail, and always lead to
a clear path forward.

Thanks!
Dscho

  parent reply	other threads:[~2022-11-18 13:17 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-15 18:20 [PATCH] range-diff: support reading mbox files Johannes Schindelin via GitGitGadget
2022-11-16  2:09 ` Taylor Blau
2022-11-16 14:40 ` Phillip Wood
2022-11-17 10:26   ` Phillip Wood
2022-11-18 12:53     ` Johannes Schindelin
2022-11-18 13:16   ` Johannes Schindelin [this message]
2022-11-17 18:24 ` Ævar Arnfjörð Bjarmason
2022-11-18 11:39   ` Johannes Schindelin
2022-11-19 23:11 ` [PATCH v2] " Johannes Schindelin via GitGitGadget
2022-11-21 10:08   ` Phillip Wood
2022-11-22  7:40     ` Johannes Schindelin
2022-11-22 14:20       ` Phillip Wood
2022-11-22  9:08   ` [PATCH v3] " Johannes Schindelin via GitGitGadget
2022-11-22 14:23     ` Phillip Wood
2022-11-22 23:58     ` Junio C Hamano
2023-03-03 22:02     ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=06ro81n5-7sn1-070r-6747-5n77262o822p@tzk.qr \
    --to=johannes.schindelin@gmx.de \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=me@ttaylorr.com \
    --cc=phillip.wood@dunelm.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).