From: Johannes Schindelin <Johannes.Schindelin@gmx.de>
To: phillip.wood@dunelm.org.uk
Cc: Johannes Schindelin via GitGitGadget <gitgitgadget@gmail.com>,
git@vger.kernel.org, Taylor Blau <me@ttaylorr.com>
Subject: Re: [PATCH] range-diff: support reading mbox files
Date: Fri, 18 Nov 2022 14:16:32 +0100 (CET) [thread overview]
Message-ID: <06ro81n5-7sn1-070r-6747-5n77262o822p@tzk.qr> (raw)
In-Reply-To: <dfe0190c-1d2e-804a-5312-877b7b2f5822@dunelm.org.uk>
Hi Phillip,
On Wed, 16 Nov 2022, Phillip Wood wrote:
> On 15/11/2022 18:20, Johannes Schindelin via GitGitGadget wrote:
> > From: Johannes Schindelin <johannes.schindelin@gmx.de>
> >
> > Internally, the `git range-diff` command spawns a `git log` process and
> > parses its output for the given commit ranges.
> >
> > This works well when the patches that need to be compared are present in
> > the local repository in the form of commits.
> >
> > In scenarios where that is not the case, the `range-diff` command is
> > currently less helpful.
> >
> > The Git mailing list is such a scenario: Instead of using Git to
> > exchange commits, the patches are sent there as plain-text and no commit
> > range can be specified to let `range-diff` consume those patches.
> >
> > Instead, the expectation is to download the mails, apply them locally
> > and then use `range-diff`. This can be quite cumbersome e.g. when a
> > suitable base revision has to be found first where the patch applies
> > cleanly.
>
> That's a good motivation for this change.
Side note (potentially fun to know): A couple of weeks ago, I wanted to
work on this because I was asked to review a new patch series iteration on
the Git mailing list (I forgot which one it was, precisely), which came
without a range-diff, and I wanted to download both iterations using a
variation of
https://github.com/git-for-windows/build-extra/blob/HEAD/apply-from-lore.sh
and then use the mbox mode of `range-diff` on them. Unfortunately, I ran
out of time back then (it's a common theme for me these days), and the
patch series (rightfully) advanced to `next` without my review, and I let
this here patch slide.
> > +static inline int strtost(char const *s, size_t *result, const char **end)
> > +{
> > + unsigned long u;
> > + char *p;
> > +
> > + errno = 0;
> > + /* negative values would be accepted by strtoul */
> > + if (*s == '-')
> > + return -1;
>
> I think it is right to treat the input as untrusted and so look for malformed
> hunk headers. However This test is not sufficient for that, we expect a digit
> so I think
>
> if (!isdigit(*s))
> return -1;
>
> would be safer. The use of strtoul() looks good as we set errno to zero before
> the call and check both errno and endp afterwards.
Good point. As you might have guessed, this function is a copy/edit of
existing code, in this instance `strtoul_ui()`.
I have made the change you suggested.
>
> > + u = strtoul(s, &p, 10);
> > + if (errno || p == s)
> > + return -1;
> > + if (result)
> > + *result = u;
> > + *end = p;
> > +
> > + return 0;
> > +}
> > +
> > +static int parse_hunk_header(const char *p,
> > + size_t *old_count, size_t *new_count,
> > + const char **end)
> > +{
> > + size_t o = 1, n = 1;
> > +
> > + if (!skip_prefix(p, "@@ -", &p) ||
> > + strtost(p, NULL, &p) ||
> > + (*p != ' ' && (*p != ',' || strtost(p + 1, &o, &p))) ||
>
> It took me a minute to understand the double negatives but it is correctly
> checking if we have ' ' or ',<digits>'
I actually hesitated when I wrote this, and only kept the code as-is
because I remembered how often Junio uses double negatives and thought it
would be a good joke to do the same here. Joke's obviously on me, though.
I've hence changed it to:
/* The range is -<start>[,<count>], defaulting to count = 1 */
!(*p == ' ' || (*p == ',' && !strtost(p + 1, &o, &p))) ||
and
/* The range is * +<start>[,<count>], * defaulting to count = 1 */
!(*p == ' ' || (*p == ',' && !strtost(p + 1, &n, &p))) ||
>
> > + !skip_prefix(p, " +", &p) ||
> > + strtost(p, NULL, &p) ||
> > + (*p != ' ' && (*p != ',' || strtost(p + 1, &n, &p))) ||
> > + !skip_prefix(p, " @@", &p))
> > + return -1;
> > +
> > + *old_count = o;
> > + *new_count = n;
> > + *end = p;
> > +
> > + return 0;
> > +}
> > +
> > +static inline int find_eol(const char *line, size_t size)
> > +{
> > + char *eol;
> > +
> > + eol = memchr(line, '\n', size);
> > + if (!eol)
> > + return size;
> > +
> > + if (eol != line && eol[-1] == '\r')
> > + eol[-1] = '\0';
> > + else
> > + *eol = '\0';
> > +
> > + return eol + 1 - line;
>
> We return the offset to the start of the next line, not the length of the
> line. This will be important later.
You're right. I changed the name to `find_next_line()` and added an
informative comment on top of the function.
>
> > +}
> > +
> > +static int read_mbox(const char *path, struct string_list *list)
> > +{
> > + struct strbuf buf = STRBUF_INIT, contents = STRBUF_INIT;
> > + struct strbuf long_subject = STRBUF_INIT;
> > + struct patch_util *util = NULL;
> > + enum {
> > + MBOX_BEFORE_HEADER,
> > + MBOX_IN_HEADER,
> > + MBOX_IN_COMMIT_MESSAGE,
> > + MBOX_AFTER_TRIPLE_DASH,
> > + MBOX_IN_DIFF
> > + } state = MBOX_BEFORE_HEADER;
> > + char *line, *current_filename = NULL;
> > + int len;
> > + size_t size, old_count = 0, new_count = 0;
> > + const char *author = NULL, *subject = NULL;
> > +
> > + if (!strcmp(path, "-")) {
> > + if (strbuf_read(&contents, STDIN_FILENO, 0) < 0)
> > + return error_errno(_("could not read stdin"));
> > + } else if (strbuf_read_file(&contents, path, 0) < 0)
> > + return error_errno(_("could not read '%s'"), path);
> > +
> > + line = contents.buf;
> > + size = contents.len;
> > + for (; size > 0; size -= len, line += len) {
>
> size is unsigned so we're effectively testing 'size != 0' which means if we're
> off by one somewhere we'll have an undetected buffer overflow. Using a signed
> type wouldn't prevent the buffer overflow but it would limit its extent.
`contents.len` is of type `size_t`, though, and I'd like to stay
consistent.
I did remove the misleading `> 0` from the condition, though.
>
> > + const char *p;
> > +
> > + len = find_eol(line, size);
>
> Here len is not the length of line if it originally ended "\r\n".
>
> > + if (state == MBOX_BEFORE_HEADER) {
> > + if (!skip_prefix(line, "From ", &p))
> > + continue;
> > +
> > + util = xcalloc(1, sizeof(*util));
> > + if (get_oid_hex(p, &util->oid) < 0)
> > + oidcpy(&util->oid, null_oid());
> > + util->matching = -1;
> > + author = subject = NULL;
> > +
> > + state = MBOX_IN_HEADER;
>
> I wondered if there should there be a `continue;` here but I think it probably
> needs to "fall-through" to the MBOX_IN_HEADER handling below. A comment to
> clarify that would be helpful.
You're absolutely correct, there totally should be a `continue` here: we
saw a `From `, and in the `MBOX_IN_HEADER` we look for either an empty
line, the `From:` header or the `Subject:` header, so we know that the
current line cannot match, no need to fall through.
>
> > + }
> > +
> > + if (starts_with(line, "diff --git ")) {
> > + struct patch patch = { 0 };
> > + struct strbuf root = STRBUF_INIT;
> > + int linenr = 0;
> > + int orig_len;
> > +
> > + state = MBOX_IN_DIFF;
> > + old_count = new_count = 0;
> > + strbuf_addch(&buf, '\n');
> > + if (!util->diff_offset)
> > + util->diff_offset = buf.len;
> > + line[len - 1] = '\n';
>
> Here the line will still be NUL terminated if it originally ended "\r\n" which
> presumably messes up the call to parse_git_diff_header() below. I have not
> checked if parse_git_diff_header() can handle "\r\n" when it is parsing the
> rest of the diff header.
I changed it locally to reinstate the `\r\n`, only to figure out that
almost the entire `apply.c` machinery totally falls over CR/LF line
endings.
I changed the code to detect a Carriage Return and error out if one is
detected.
>
> > + orig_len = len;
> > + len = parse_git_diff_header(&root, &linenr, 1, line,
> > + len, size, &patch);
> > + if (len < 0) {
> > + error(_("could not parse git header '%.*s'"),
> > + orig_len, line);
> > + free(util);
> > + free(current_filename);
> > + string_list_clear(list, 1);
> > + strbuf_release(&buf);
> > + strbuf_release(&contents);
> > + strbuf_release(&long_subject);
> > + return -1;
> > + }
> > +
> > + if (patch.old_name)
> > + skip_prefix(patch.old_name, "a/",
> > + (const char **)&patch.old_name);
> > + if (patch.new_name)
> > + skip_prefix(patch.new_name, "b/",
> > + (const char **)&patch.new_name);
>
> I think this is fine for now but we might want to support other prefixes in
> the future. If it is not a copy or rename then the filename can be deduced by
> finding the common tail of patch.old_name and patch.new_name and stripping
> anything before the first '/'. If it is a copy or rename then I suspect there
> is no prefix (though I've not checked)
Since `skip_prefix()` does not do anything if there is no match, I think
it is sane to err by not stripping anything unless the expected `a/` and
`b/` prefixes are seen.
>
> > + strbuf_addstr(&buf, " ## ");
> > + if (patch.is_new > 0)
>
> `patch.is_now` and `patch.is_delete` are booleans like `patch.is_rename` so we
> don't need the '> 0'
Good catch.
>
> > + strbuf_addf(&buf, "%s (new)", patch.new_name);
> > + else if (patch.is_delete > 0)
> > + strbuf_addf(&buf, "%s (deleted)",
> > patch.old_name);
> > + else if (patch.is_rename)
> > + strbuf_addf(&buf, "%s => %s", patch.old_name,
> > patch.new_name);
> > + else
> > + strbuf_addstr(&buf, patch.new_name);
> > +
> > + free(current_filename);
> > + if (patch.is_delete > 0)
> > + current_filename = xstrdup(patch.old_name);
> > + else
> > + current_filename = xstrdup(patch.new_name);
> > +
> > + if (patch.new_mode && patch.old_mode &&
> > + patch.old_mode != patch.new_mode)
> > + strbuf_addf(&buf, " (mode change %06o =>
> > %06o)",
> > + patch.old_mode, patch.new_mode);
> > +
> > + strbuf_addstr(&buf, " ##\n");
> > + util->diffsize++;
> > + } else if (state == MBOX_IN_HEADER) {
> > + if (!line[0]) {
> > + state = MBOX_IN_COMMIT_MESSAGE;
> > + /* Look for an in-body From: */
> > + if (size > 5 && skip_prefix(line + 1, "From:
> > ", &p)) {
>
> The "size > 5" seems a bit unnecessary as we're using skip_prefix()
Right!
>
> > + size -= p - line;
> > + line += p - line;
>
> This is good, we're accounting for reading the next line.
>
> > + len = find_eol(line, size);
> > +
> > + while (isspace(*p))
> > + p++;
> > + author = p;
> > + }
> > + strbuf_addstr(&buf, " ## Metadata ##\n");
> > + if (author)
> > + strbuf_addf(&buf, "Author: %s\n",
> > author);
> > + strbuf_addstr(&buf, "\n ## Commit message
> > ##\n");
> > + if (subject)
> > + strbuf_addf(&buf, " %s\n\n",
> > subject);
> > + } else if (skip_prefix(line, "From: ", &p)) {
> > + while (isspace(*p))
> > + p++;
> > + author = p;
> > + } else if (skip_prefix(line, "Subject: ", &p)) {
> > + const char *q;
> > +
> > + while (isspace(*p))
> > + p++;
> > + subject = p;
> > +
> > + if (starts_with(p, "[PATCH") &&
> > + (q = strchr(p, ']'))) {
> > + q++;
> > + while (isspace(*q))
> > + q++;
> > + subject = q;
> > + }
> > +
> > + if (len < size && line[len] == ' ') {
> > + /* handle long subject */
> > + strbuf_reset(&long_subject);
> > + strbuf_addstr(&long_subject, subject);
> > + while (len < size && line[len] == ' ')
> > {
> > + line += len;
> > + size -= len;
> > + len = find_eol(line, size);
> > + strbuf_addstr(&long_subject,
> > line);
>
> Looks good
>
> > + }
> > + subject = long_subject.buf;
> > + }
> > + }
> > + } else if (state == MBOX_IN_COMMIT_MESSAGE) {
> > + if (!*line)
>
> Not a big issue elsewhere you've used "!line[0]"
> Style: there should be braces on this branch.
Fixed on both accounts.
> [... addressed in follow-up mail...]
>
> I think this is a useful addition, it could perhaps benefit from more tests
> though. Having tests for bad input, "\r\n" line endings and getting the author
> from a From: header as well as an in-body From: line would give a bit more
> reassurance about how robust the parser is.
Good point, I've augmented the test case a bit more.
Thank you again. It is a real pleasure to receive these constructive
reviews from you that pay so much attention to detail, and always lead to
a clear path forward.
Thanks!
Dscho
next prev parent reply other threads:[~2022-11-18 13:17 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-11-15 18:20 [PATCH] range-diff: support reading mbox files Johannes Schindelin via GitGitGadget
2022-11-16 2:09 ` Taylor Blau
2022-11-16 14:40 ` Phillip Wood
2022-11-17 10:26 ` Phillip Wood
2022-11-18 12:53 ` Johannes Schindelin
2022-11-18 13:16 ` Johannes Schindelin [this message]
2022-11-17 18:24 ` Ævar Arnfjörð Bjarmason
2022-11-18 11:39 ` Johannes Schindelin
2022-11-19 23:11 ` [PATCH v2] " Johannes Schindelin via GitGitGadget
2022-11-21 10:08 ` Phillip Wood
2022-11-22 7:40 ` Johannes Schindelin
2022-11-22 14:20 ` Phillip Wood
2022-11-22 9:08 ` [PATCH v3] " Johannes Schindelin via GitGitGadget
2022-11-22 14:23 ` Phillip Wood
2022-11-22 23:58 ` Junio C Hamano
2023-03-03 22:02 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=06ro81n5-7sn1-070r-6747-5n77262o822p@tzk.qr \
--to=johannes.schindelin@gmx.de \
--cc=git@vger.kernel.org \
--cc=gitgitgadget@gmail.com \
--cc=me@ttaylorr.com \
--cc=phillip.wood@dunelm.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).