git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Elijah Newren <newren@gmail.com>
To: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Cc: Git Mailing List <git@vger.kernel.org>
Subject: Re: [PATCH v10 18/36] merge-recursive: add get_directory_renames()
Date: Fri, 11 Oct 2019 13:02:03 -0700	[thread overview]
Message-ID: <CABPp-BFrFrm1-MdttBDRHMmun++HxwV1tySoDb75f58ObwMwfg@mail.gmail.com> (raw)
In-Reply-To: <nycvar.QRO.7.76.6.1910092044590.46@tvgsbejvaqbjf.bet>

// Dropping a few folks from the cc list as the thread is so old that
I think it should just be the normal git mailing list.

Hi Dscho,

On Wed, Oct 9, 2019 at 1:39 PM Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
>
> Hi Elijah,
>
> sorry about the blast from the past, but I just stumbled over something
> I could not even find any discussion about:

I'm curious what brought you to this part of the codebase, but either
way, thanks for sending an email with your findings.

More comments below...

[...]
> > @@ -1357,6 +1395,169 @@ static struct diff_queue_struct *get_diffpairs(struct merge_options *o,
> >       return ret;
> >  }
> >
> > +static void get_renamed_dir_portion(const char *old_path, const char *new_path,
> > +                                 char **old_dir, char **new_dir)
> > +{
> > +     char *end_of_old, *end_of_new;
> > +     int old_len, new_len;
> > +
> > +     *old_dir = NULL;
> > +     *new_dir = NULL;
> > +
> > +     /*
> > +      * For
> > +      *    "a/b/c/d/e/foo.c" -> "a/b/some/thing/else/e/foo.c"
> > +      * the "e/foo.c" part is the same, we just want to know that
> > +      *    "a/b/c/d" was renamed to "a/b/some/thing/else"
> > +      * so, for this example, this function returns "a/b/c/d" in
> > +      * *old_dir and "a/b/some/thing/else" in *new_dir.
> > +      *
> > +      * Also, if the basename of the file changed, we don't care.  We
> > +      * want to know which portion of the directory, if any, changed.
> > +      */
> > +     end_of_old = strrchr(old_path, '/');
> > +     end_of_new = strrchr(new_path, '/');
> > +
> > +     if (end_of_old == NULL || end_of_new == NULL)
> > +             return;
> > +     while (*--end_of_new == *--end_of_old &&
> > +            end_of_old != old_path &&
> > +            end_of_new != new_path)
> > +             ; /* Do nothing; all in the while loop */
> > +     /*
> > +      * We've found the first non-matching character in the directory
> > +      * paths.  That means the current directory we were comparing
> > +      * represents the rename.  Move end_of_old and end_of_new back
> > +      * to the full directory name.
> > +      */
> > +     if (*end_of_old == '/')
> > +             end_of_old++;
> > +     if (*end_of_old != '/')
> > +             end_of_new++;
>
> Is this intentional? Even after thinking about it for fifteen minutes, I
> think it was probable meant to test for `*end_of_new == '/'` instead of
> `*end_of_old != '/'`. And...

Yeah, looks like a mess-up, and yes your suspicion is correct about
what was intended.

Hilariously, though, no bug results from this.  Since these are paths,
as canonicalized by git (i.e. not as specified by the user where they
might accidentally type multiple consecutive slashes), there will
never be two slashes in a row (because we can't have directories with
an empty name).  Thus, it is guaranteed at this point that *end_of_old
!= '/', and end_of_new is thus unconditionally advanced.  Further,
since we wanted to find the _next_ '/' character after end_of_new,
then there were two cases: (1) end_of_new already pointed at a slash
character in which case we needed it to be advanced, or (2) end_of_new
didn't point to a slash character so it wouldn't hurt at all to
advance it.

> > +     end_of_old = strchr(end_of_old, '/');
> > +     end_of_new = strchr(end_of_new, '/');
>
> ... while I satisfied myself that these calls cannot return `NULL` at
> this point, it took quite a few minutes of reasoning.
>
> So I think we might want to rewrite these past 6 lines, to make
> everything quite a bit more obvious, like this:
>
>         if (end_of_old != old_path)
>                 while (*(++end_of_old) != '/')
>                         ; /* keep looking */
>         if (end_of_new != new_path)
>                 while (*(++end_of_new) != '/')
>                         ; /* keep looking */

I think your if-checks here are not correct.  Let's say that old_path
was "tar/foo.c" and new_path was "star/foo.c".  The initial strrchr
will bring both end_of_* variables back to the slash.  The moving left
while equal will move end_of_old back to old_path (i.e. pointing to
the "t") and end_of_new back to pointing at "t" as well.  Here's where
your six alternate lines would kick in, and would leave end_of_old at
old_path, while moving end_of_new to the '/', making it look like we
had a rename of "" (the empty string or root directory) to "star"
instead of a rename of "tar" to "star".  If you dropped your if-checks
(just having the while loops), then I think it does the right thing.

> There is _still_ one thing that makes this harder than trivial to reason
> about: the case where one of `*end_of_old` and `*end_of_new` is a slash.
> At this point, we assume that `*end_of_old != *end_of_new` (more about
> that assumption in the next paragraph), therefore only one of them can
> be a slash, and we want to advance beyond it. But even if the pointer
> does not point at a slash, we want to look for one, so we want to
> advance beyond it.

I should probably add a comment that we want to advance BOTH to the
next slash.  I would have just used strchr() but it wouldn't advance
the string if it already points to what I'm looking for.  Actually, I
guess I could simplify the code by unconditionally advancing by one
character, then calling strchr().  In other words, simplifying these
six lines to just

       end_of_old = strchr(++end_of_old, '/');
       end_of_new = strchr(++end_of_new, '/');

> I also think that we need an extra guard: we do not handle the case
> `a/b/c` -> `a/b/d` well. As stated a few lines above, "if the basename
> of the file changed, we don't care". So we start looking at the last
> slash, then go backwards, and since everything matches, end up with
> `end_of_old == old_path` and `end_of_new == new_path`. The current code
> will advance `end_of_new` (which I think is wrong) and then looks for
> the next slash in both `end_of_new` and `end_of_old` (which is also
> wrong).

The current code is slightly convoluted, but I would say it's not
wrong for this case.  If we renamed a/b/c -> a/b/d, then there isn't a
directory rename; the leading directory (a/b/) is the same for both.
You are right that the advancing of end_of_old and end_of_new to the
next slash would result in what looks like a rename of "a" to "a", but
the checks at the end checked for this case and only returned
something for *old_dir and *new_dir if these didn't match; in fact,
it's the part of the code at the end of your email that you didn't
comment on, here:

> > +
> > +     /*
> > +      * It may have been the case that old_path and new_path were the same
> > +      * directory all along.  Don't claim a rename if they're the same.
> > +      */
> > +     old_len = end_of_old - old_path;
> > +     new_len = end_of_new - new_path;
> > +
> > +     if (old_len != new_len || strncmp(old_path, new_path, old_len)) {
> > +             *old_dir = xstrndup(old_path, old_len);
> > +             *new_dir = xstrndup(new_path, new_len);
> > +     }
> > +}
> > [...]

However, we could drop this late check by just doing a simpler earlier
check to see if end_of_old == old_path and end_of_new == new_path
after the "find first non-equal character" step and before advancing
to the next '/', and if that condition is found, then return early
with no match.


However, since you highlighted this code, there are two other special
cases that might be interesting:

1) What if we are renaming e.g. foo/bar/baz.c ->
leading/dir/foo/bar/baz.c?  Then after trying to find the first
non-matching char we'll have end_of_old == old_path and *end_of_new ==
'f', and the advancing makes it look like "foo" being renamed to
"leading/dir/foo".  Since the root directory cannot be renamed (it
always exists on both sides of history), this probably makes sense as
the right thing to return.
2) What if the renaming went the other way, from
leading/dir/foo/bar/baz.c -> foo/bar/baz.c?  The whole advancing thing
makes this look like "leading/dir/foo" being renamed to "foo", instead
of "leading/dir" being renamed to "" (the root directory).  If we
don't detect it as "leading/dir" being renamed (merged into) the root
directory, then new files added directly within leading/dir/ on the
other side of history won't be moved by directory rename detection
into the root directory.

> Is my reading correct?

I'm not sure if I've answered your questions; let me know if not.  I
have generated a couple patches to (1) make the code easier to follow,
and (2) support the rename/merge of a subdirectory into the root
directory.  They're waiting for the gitgitgadget CI checks right now,
then I'll send them to the list.

  reply	other threads:[~2019-10-11 20:02 UTC|newest]

Thread overview: 78+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-04-19 17:57 [PATCH v10 00/36] Add directory rename detection to git Elijah Newren
2018-04-19 17:57 ` [PATCH v10 01/36] directory rename detection: basic testcases Elijah Newren
2018-04-19 17:57 ` [PATCH v10 02/36] directory rename detection: directory splitting testcases Elijah Newren
2018-04-19 17:57 ` [PATCH v10 03/36] directory rename detection: testcases to avoid taking detection too far Elijah Newren
2018-04-19 17:57 ` [PATCH v10 04/36] directory rename detection: partially renamed directory testcase/discussion Elijah Newren
2018-04-19 17:57 ` [PATCH v10 05/36] directory rename detection: files/directories in the way of some renames Elijah Newren
2018-04-19 17:57 ` [PATCH v10 06/36] directory rename detection: testcases checking which side did the rename Elijah Newren
2018-04-19 17:57 ` [PATCH v10 07/36] directory rename detection: more involved edge/corner testcases Elijah Newren
2018-04-19 17:57 ` [PATCH v10 08/36] directory rename detection: testcases exploring possibly suboptimal merges Elijah Newren
2018-04-19 17:57 ` [PATCH v10 09/36] directory rename detection: miscellaneous testcases to complete coverage Elijah Newren
2018-04-19 17:57 ` [PATCH v10 10/36] directory rename detection: tests for handling overwriting untracked files Elijah Newren
2018-04-19 17:57 ` [PATCH v10 11/36] directory rename detection: tests for handling overwriting dirty files Elijah Newren
2018-04-19 17:57 ` [PATCH v10 12/36] merge-recursive: move the get_renames() function Elijah Newren
2018-04-19 17:58 ` [PATCH v10 13/36] merge-recursive: introduce new functions to handle rename logic Elijah Newren
2018-04-19 17:58 ` [PATCH v10 14/36] merge-recursive: fix leaks of allocated renames and diff_filepairs Elijah Newren
2018-04-19 17:58 ` [PATCH v10 15/36] merge-recursive: make !o->detect_rename codepath more obvious Elijah Newren
2018-04-19 17:58 ` [PATCH v10 16/36] merge-recursive: split out code for determining diff_filepairs Elijah Newren
2018-04-19 17:58 ` [PATCH v10 17/36] merge-recursive: make a helper function for cleanup for handle_renames Elijah Newren
2018-04-19 17:58 ` [PATCH v10 18/36] merge-recursive: add get_directory_renames() Elijah Newren
2018-05-06 23:41   ` SZEDER Gábor
2018-05-07 15:45     ` [PATCH] fixup! " Elijah Newren
2019-10-09 20:38   ` [PATCH v10 18/36] " Johannes Schindelin
2019-10-11 20:02     ` Elijah Newren [this message]
2019-10-12 19:23       ` Johannes Schindelin
2018-04-19 17:58 ` [PATCH v10 19/36] merge-recursive: check for directory level conflicts Elijah Newren
2018-04-19 17:58 ` [PATCH v10 20/36] merge-recursive: add computation of collisions due to dir rename & merging Elijah Newren
2018-04-19 17:58 ` [PATCH v10 21/36] merge-recursive: check for file level conflicts then get new name Elijah Newren
2018-04-19 17:58 ` [PATCH v10 22/36] merge-recursive: when comparing files, don't include trees Elijah Newren
2018-04-19 17:58 ` [PATCH v10 23/36] merge-recursive: apply necessary modifications for directory renames Elijah Newren
2018-04-19 17:58 ` [PATCH v10 24/36] merge-recursive: avoid clobbering untracked files with " Elijah Newren
2018-04-19 17:58 ` [PATCH v10 25/36] merge-recursive: fix overwriting dirty files involved in renames Elijah Newren
2018-04-19 20:48   ` Martin Ågren
2018-04-19 20:54     ` Martin Ågren
2018-04-19 21:06     ` Elijah Newren
2018-04-19 17:58 ` [PATCH v10 26/36] merge-recursive: fix remaining directory rename + dirty overwrite cases Elijah Newren
2018-04-19 17:58 ` [PATCH v10 27/36] directory rename detection: new testcases showcasing a pair of bugs Elijah Newren
2018-04-19 17:58 ` [PATCH v10 28/36] merge-recursive: avoid spurious rename/rename conflict from dir renames Elijah Newren
2018-04-19 17:58 ` [PATCH v10 29/36] merge-recursive: improve add_cacheinfo error handling Elijah Newren
2018-04-19 17:58 ` [PATCH v10 30/36] merge-recursive: move more is_dirty handling to merge_content Elijah Newren
2018-04-19 17:58 ` [PATCH v10 31/36] merge-recursive: avoid triggering add_cacheinfo error with dirty mod Elijah Newren
2018-04-19 17:58 ` [PATCH v10 32/36] t6046: testcases checking whether updates can be skipped in a merge Elijah Newren
2018-04-19 20:26   ` SZEDER Gábor
2018-04-19 20:55     ` Elijah Newren
2018-04-19 17:58 ` [PATCH v10 33/36] merge-recursive: fix was_tracked() to quit lying with some renamed paths Elijah Newren
2018-04-19 20:39   ` Martin Ågren
2018-04-19 20:54     ` Elijah Newren
2018-04-20 12:23   ` SZEDER Gábor
2018-04-20 15:23     ` Elijah Newren
2018-04-21 19:37     ` [RFC PATCH v10 32.5/36] unpack_trees: fix memory corruption with split_index when src != dst Elijah Newren
2018-04-21 20:13       ` Elijah Newren
2018-04-22 12:38       ` Duy Nguyen
2018-04-23 17:09         ` Elijah Newren
2018-04-23 17:37           ` Duy Nguyen
2018-04-23 18:05             ` Elijah Newren
2018-04-24  0:24               ` [PATCH v2] unpack_trees: fix breakage when o->src_index != o->dst_index Elijah Newren
2018-04-24  1:51                 ` Junio C Hamano
2018-04-24  3:05                 ` Junio C Hamano
2018-04-24  6:50                   ` [PATCH v3] " Elijah Newren
2018-04-29 18:05                     ` Duy Nguyen
2018-04-29 20:53                       ` Johannes Schindelin
2018-04-30 14:42                         ` Duy Nguyen
2018-04-30 14:45                           ` Duy Nguyen
2018-04-30 16:19                             ` Elijah Newren
2018-04-30 16:29                               ` Duy Nguyen
2018-04-19 17:58 ` [PATCH v10 34/36] merge-recursive: fix remainder of was_dirty() to use original index Elijah Newren
2018-04-19 17:58 ` [PATCH v10 35/36] merge-recursive: make "Auto-merging" comment show for other merges Elijah Newren
2018-04-19 17:58 ` [PATCH v10 36/36] merge-recursive: fix check for skipability of working tree updates Elijah Newren
2018-04-19 18:35 ` [PATCH v10 00/36] Add directory rename detection to git Elijah Newren
2018-04-19 18:41   ` Stefan Beller
2018-04-19 19:54     ` Derrick Stolee
2018-04-19 20:22   ` Elijah Newren
2018-04-20  3:05   ` Junio C Hamano
2018-04-23 17:50     ` Elijah Newren
2018-04-24 20:20     ` [PATCH v10 1/2] fixup! merge-recursive: fix was_tracked() to quit lying with some renamed paths Elijah Newren
2018-04-24 20:21       ` [PATCH v10 2/2] fixup! t6046: testcases checking whether updates can be skipped in a merge Elijah Newren
2018-04-23 17:28 ` [PATCH v10 00/36] Add directory rename detection to git Elijah Newren
2018-04-23 23:46   ` Junio C Hamano
2018-04-24  0:15     ` Elijah Newren

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CABPp-BFrFrm1-MdttBDRHMmun++HxwV1tySoDb75f58ObwMwfg@mail.gmail.com \
    --to=newren@gmail.com \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).