git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Johannes Schindelin <Johannes.Schindelin@gmx.de>
To: Elijah Newren <newren@gmail.com>
Cc: Git Mailing List <git@vger.kernel.org>
Subject: Re: [PATCH v10 18/36] merge-recursive: add get_directory_renames()
Date: Sat, 12 Oct 2019 21:23:58 +0200 (CEST)	[thread overview]
Message-ID: <nycvar.QRO.7.76.6.1910122122520.3272@tvgsbejvaqbjf.bet> (raw)
In-Reply-To: <CABPp-BFrFrm1-MdttBDRHMmun++HxwV1tySoDb75f58ObwMwfg@mail.gmail.com>

Hi Elijah,

On Fri, 11 Oct 2019, Elijah Newren wrote:

> On Wed, Oct 9, 2019 at 1:39 PM Johannes Schindelin
> <Johannes.Schindelin@gmx.de> wrote:
> >
> > sorry about the blast from the past, but I just stumbled over something
> > I could not even find any discussion about:
>
> I'm curious what brought you to this part of the codebase, but either
> way, thanks for sending an email with your findings.

Well, you know, it's a loooooong story.

> More comments below...

Thank you so much, they unpuzzled me quite a bit.

Ciao,
Dscho

>
> [...]
> > > @@ -1357,6 +1395,169 @@ static struct diff_queue_struct *get_diffpairs(struct merge_options *o,
> > >       return ret;
> > >  }
> > >
> > > +static void get_renamed_dir_portion(const char *old_path, const char *new_path,
> > > +                                 char **old_dir, char **new_dir)
> > > +{
> > > +     char *end_of_old, *end_of_new;
> > > +     int old_len, new_len;
> > > +
> > > +     *old_dir = NULL;
> > > +     *new_dir = NULL;
> > > +
> > > +     /*
> > > +      * For
> > > +      *    "a/b/c/d/e/foo.c" -> "a/b/some/thing/else/e/foo.c"
> > > +      * the "e/foo.c" part is the same, we just want to know that
> > > +      *    "a/b/c/d" was renamed to "a/b/some/thing/else"
> > > +      * so, for this example, this function returns "a/b/c/d" in
> > > +      * *old_dir and "a/b/some/thing/else" in *new_dir.
> > > +      *
> > > +      * Also, if the basename of the file changed, we don't care.  We
> > > +      * want to know which portion of the directory, if any, changed.
> > > +      */
> > > +     end_of_old = strrchr(old_path, '/');
> > > +     end_of_new = strrchr(new_path, '/');
> > > +
> > > +     if (end_of_old == NULL || end_of_new == NULL)
> > > +             return;
> > > +     while (*--end_of_new == *--end_of_old &&
> > > +            end_of_old != old_path &&
> > > +            end_of_new != new_path)
> > > +             ; /* Do nothing; all in the while loop */
> > > +     /*
> > > +      * We've found the first non-matching character in the directory
> > > +      * paths.  That means the current directory we were comparing
> > > +      * represents the rename.  Move end_of_old and end_of_new back
> > > +      * to the full directory name.
> > > +      */
> > > +     if (*end_of_old == '/')
> > > +             end_of_old++;
> > > +     if (*end_of_old != '/')
> > > +             end_of_new++;
> >
> > Is this intentional? Even after thinking about it for fifteen minutes, I
> > think it was probable meant to test for `*end_of_new == '/'` instead of
> > `*end_of_old != '/'`. And...
>
> Yeah, looks like a mess-up, and yes your suspicion is correct about
> what was intended.
>
> Hilariously, though, no bug results from this.  Since these are paths,
> as canonicalized by git (i.e. not as specified by the user where they
> might accidentally type multiple consecutive slashes), there will
> never be two slashes in a row (because we can't have directories with
> an empty name).  Thus, it is guaranteed at this point that *end_of_old
> != '/', and end_of_new is thus unconditionally advanced.  Further,
> since we wanted to find the _next_ '/' character after end_of_new,
> then there were two cases: (1) end_of_new already pointed at a slash
> character in which case we needed it to be advanced, or (2) end_of_new
> didn't point to a slash character so it wouldn't hurt at all to
> advance it.
>
> > > +     end_of_old = strchr(end_of_old, '/');
> > > +     end_of_new = strchr(end_of_new, '/');
> >
> > ... while I satisfied myself that these calls cannot return `NULL` at
> > this point, it took quite a few minutes of reasoning.
> >
> > So I think we might want to rewrite these past 6 lines, to make
> > everything quite a bit more obvious, like this:
> >
> >         if (end_of_old != old_path)
> >                 while (*(++end_of_old) != '/')
> >                         ; /* keep looking */
> >         if (end_of_new != new_path)
> >                 while (*(++end_of_new) != '/')
> >                         ; /* keep looking */
>
> I think your if-checks here are not correct.  Let's say that old_path
> was "tar/foo.c" and new_path was "star/foo.c".  The initial strrchr
> will bring both end_of_* variables back to the slash.  The moving left
> while equal will move end_of_old back to old_path (i.e. pointing to
> the "t") and end_of_new back to pointing at "t" as well.  Here's where
> your six alternate lines would kick in, and would leave end_of_old at
> old_path, while moving end_of_new to the '/', making it look like we
> had a rename of "" (the empty string or root directory) to "star"
> instead of a rename of "tar" to "star".  If you dropped your if-checks
> (just having the while loops), then I think it does the right thing.
>
> > There is _still_ one thing that makes this harder than trivial to reason
> > about: the case where one of `*end_of_old` and `*end_of_new` is a slash.
> > At this point, we assume that `*end_of_old != *end_of_new` (more about
> > that assumption in the next paragraph), therefore only one of them can
> > be a slash, and we want to advance beyond it. But even if the pointer
> > does not point at a slash, we want to look for one, so we want to
> > advance beyond it.
>
> I should probably add a comment that we want to advance BOTH to the
> next slash.  I would have just used strchr() but it wouldn't advance
> the string if it already points to what I'm looking for.  Actually, I
> guess I could simplify the code by unconditionally advancing by one
> character, then calling strchr().  In other words, simplifying these
> six lines to just
>
>        end_of_old = strchr(++end_of_old, '/');
>        end_of_new = strchr(++end_of_new, '/');
>
> > I also think that we need an extra guard: we do not handle the case
> > `a/b/c` -> `a/b/d` well. As stated a few lines above, "if the basename
> > of the file changed, we don't care". So we start looking at the last
> > slash, then go backwards, and since everything matches, end up with
> > `end_of_old == old_path` and `end_of_new == new_path`. The current code
> > will advance `end_of_new` (which I think is wrong) and then looks for
> > the next slash in both `end_of_new` and `end_of_old` (which is also
> > wrong).
>
> The current code is slightly convoluted, but I would say it's not
> wrong for this case.  If we renamed a/b/c -> a/b/d, then there isn't a
> directory rename; the leading directory (a/b/) is the same for both.
> You are right that the advancing of end_of_old and end_of_new to the
> next slash would result in what looks like a rename of "a" to "a", but
> the checks at the end checked for this case and only returned
> something for *old_dir and *new_dir if these didn't match; in fact,
> it's the part of the code at the end of your email that you didn't
> comment on, here:
>
> > > +
> > > +     /*
> > > +      * It may have been the case that old_path and new_path were the same
> > > +      * directory all along.  Don't claim a rename if they're the same.
> > > +      */
> > > +     old_len = end_of_old - old_path;
> > > +     new_len = end_of_new - new_path;
> > > +
> > > +     if (old_len != new_len || strncmp(old_path, new_path, old_len)) {
> > > +             *old_dir = xstrndup(old_path, old_len);
> > > +             *new_dir = xstrndup(new_path, new_len);
> > > +     }
> > > +}
> > > [...]
>
> However, we could drop this late check by just doing a simpler earlier
> check to see if end_of_old == old_path and end_of_new == new_path
> after the "find first non-equal character" step and before advancing
> to the next '/', and if that condition is found, then return early
> with no match.
>
>
> However, since you highlighted this code, there are two other special
> cases that might be interesting:
>
> 1) What if we are renaming e.g. foo/bar/baz.c ->
> leading/dir/foo/bar/baz.c?  Then after trying to find the first
> non-matching char we'll have end_of_old == old_path and *end_of_new ==
> 'f', and the advancing makes it look like "foo" being renamed to
> "leading/dir/foo".  Since the root directory cannot be renamed (it
> always exists on both sides of history), this probably makes sense as
> the right thing to return.
> 2) What if the renaming went the other way, from
> leading/dir/foo/bar/baz.c -> foo/bar/baz.c?  The whole advancing thing
> makes this look like "leading/dir/foo" being renamed to "foo", instead
> of "leading/dir" being renamed to "" (the root directory).  If we
> don't detect it as "leading/dir" being renamed (merged into) the root
> directory, then new files added directly within leading/dir/ on the
> other side of history won't be moved by directory rename detection
> into the root directory.
>
> > Is my reading correct?
>
> I'm not sure if I've answered your questions; let me know if not.  I
> have generated a couple patches to (1) make the code easier to follow,
> and (2) support the rename/merge of a subdirectory into the root
> directory.  They're waiting for the gitgitgadget CI checks right now,
> then I'll send them to the list.
>

  reply	other threads:[~2019-10-12 19:24 UTC|newest]

Thread overview: 78+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-04-19 17:57 [PATCH v10 00/36] Add directory rename detection to git Elijah Newren
2018-04-19 17:57 ` [PATCH v10 01/36] directory rename detection: basic testcases Elijah Newren
2018-04-19 17:57 ` [PATCH v10 02/36] directory rename detection: directory splitting testcases Elijah Newren
2018-04-19 17:57 ` [PATCH v10 03/36] directory rename detection: testcases to avoid taking detection too far Elijah Newren
2018-04-19 17:57 ` [PATCH v10 04/36] directory rename detection: partially renamed directory testcase/discussion Elijah Newren
2018-04-19 17:57 ` [PATCH v10 05/36] directory rename detection: files/directories in the way of some renames Elijah Newren
2018-04-19 17:57 ` [PATCH v10 06/36] directory rename detection: testcases checking which side did the rename Elijah Newren
2018-04-19 17:57 ` [PATCH v10 07/36] directory rename detection: more involved edge/corner testcases Elijah Newren
2018-04-19 17:57 ` [PATCH v10 08/36] directory rename detection: testcases exploring possibly suboptimal merges Elijah Newren
2018-04-19 17:57 ` [PATCH v10 09/36] directory rename detection: miscellaneous testcases to complete coverage Elijah Newren
2018-04-19 17:57 ` [PATCH v10 10/36] directory rename detection: tests for handling overwriting untracked files Elijah Newren
2018-04-19 17:57 ` [PATCH v10 11/36] directory rename detection: tests for handling overwriting dirty files Elijah Newren
2018-04-19 17:57 ` [PATCH v10 12/36] merge-recursive: move the get_renames() function Elijah Newren
2018-04-19 17:58 ` [PATCH v10 13/36] merge-recursive: introduce new functions to handle rename logic Elijah Newren
2018-04-19 17:58 ` [PATCH v10 14/36] merge-recursive: fix leaks of allocated renames and diff_filepairs Elijah Newren
2018-04-19 17:58 ` [PATCH v10 15/36] merge-recursive: make !o->detect_rename codepath more obvious Elijah Newren
2018-04-19 17:58 ` [PATCH v10 16/36] merge-recursive: split out code for determining diff_filepairs Elijah Newren
2018-04-19 17:58 ` [PATCH v10 17/36] merge-recursive: make a helper function for cleanup for handle_renames Elijah Newren
2018-04-19 17:58 ` [PATCH v10 18/36] merge-recursive: add get_directory_renames() Elijah Newren
2018-05-06 23:41   ` SZEDER Gábor
2018-05-07 15:45     ` [PATCH] fixup! " Elijah Newren
2019-10-09 20:38   ` [PATCH v10 18/36] " Johannes Schindelin
2019-10-11 20:02     ` Elijah Newren
2019-10-12 19:23       ` Johannes Schindelin [this message]
2018-04-19 17:58 ` [PATCH v10 19/36] merge-recursive: check for directory level conflicts Elijah Newren
2018-04-19 17:58 ` [PATCH v10 20/36] merge-recursive: add computation of collisions due to dir rename & merging Elijah Newren
2018-04-19 17:58 ` [PATCH v10 21/36] merge-recursive: check for file level conflicts then get new name Elijah Newren
2018-04-19 17:58 ` [PATCH v10 22/36] merge-recursive: when comparing files, don't include trees Elijah Newren
2018-04-19 17:58 ` [PATCH v10 23/36] merge-recursive: apply necessary modifications for directory renames Elijah Newren
2018-04-19 17:58 ` [PATCH v10 24/36] merge-recursive: avoid clobbering untracked files with " Elijah Newren
2018-04-19 17:58 ` [PATCH v10 25/36] merge-recursive: fix overwriting dirty files involved in renames Elijah Newren
2018-04-19 20:48   ` Martin Ågren
2018-04-19 20:54     ` Martin Ågren
2018-04-19 21:06     ` Elijah Newren
2018-04-19 17:58 ` [PATCH v10 26/36] merge-recursive: fix remaining directory rename + dirty overwrite cases Elijah Newren
2018-04-19 17:58 ` [PATCH v10 27/36] directory rename detection: new testcases showcasing a pair of bugs Elijah Newren
2018-04-19 17:58 ` [PATCH v10 28/36] merge-recursive: avoid spurious rename/rename conflict from dir renames Elijah Newren
2018-04-19 17:58 ` [PATCH v10 29/36] merge-recursive: improve add_cacheinfo error handling Elijah Newren
2018-04-19 17:58 ` [PATCH v10 30/36] merge-recursive: move more is_dirty handling to merge_content Elijah Newren
2018-04-19 17:58 ` [PATCH v10 31/36] merge-recursive: avoid triggering add_cacheinfo error with dirty mod Elijah Newren
2018-04-19 17:58 ` [PATCH v10 32/36] t6046: testcases checking whether updates can be skipped in a merge Elijah Newren
2018-04-19 20:26   ` SZEDER Gábor
2018-04-19 20:55     ` Elijah Newren
2018-04-19 17:58 ` [PATCH v10 33/36] merge-recursive: fix was_tracked() to quit lying with some renamed paths Elijah Newren
2018-04-19 20:39   ` Martin Ågren
2018-04-19 20:54     ` Elijah Newren
2018-04-20 12:23   ` SZEDER Gábor
2018-04-20 15:23     ` Elijah Newren
2018-04-21 19:37     ` [RFC PATCH v10 32.5/36] unpack_trees: fix memory corruption with split_index when src != dst Elijah Newren
2018-04-21 20:13       ` Elijah Newren
2018-04-22 12:38       ` Duy Nguyen
2018-04-23 17:09         ` Elijah Newren
2018-04-23 17:37           ` Duy Nguyen
2018-04-23 18:05             ` Elijah Newren
2018-04-24  0:24               ` [PATCH v2] unpack_trees: fix breakage when o->src_index != o->dst_index Elijah Newren
2018-04-24  1:51                 ` Junio C Hamano
2018-04-24  3:05                 ` Junio C Hamano
2018-04-24  6:50                   ` [PATCH v3] " Elijah Newren
2018-04-29 18:05                     ` Duy Nguyen
2018-04-29 20:53                       ` Johannes Schindelin
2018-04-30 14:42                         ` Duy Nguyen
2018-04-30 14:45                           ` Duy Nguyen
2018-04-30 16:19                             ` Elijah Newren
2018-04-30 16:29                               ` Duy Nguyen
2018-04-19 17:58 ` [PATCH v10 34/36] merge-recursive: fix remainder of was_dirty() to use original index Elijah Newren
2018-04-19 17:58 ` [PATCH v10 35/36] merge-recursive: make "Auto-merging" comment show for other merges Elijah Newren
2018-04-19 17:58 ` [PATCH v10 36/36] merge-recursive: fix check for skipability of working tree updates Elijah Newren
2018-04-19 18:35 ` [PATCH v10 00/36] Add directory rename detection to git Elijah Newren
2018-04-19 18:41   ` Stefan Beller
2018-04-19 19:54     ` Derrick Stolee
2018-04-19 20:22   ` Elijah Newren
2018-04-20  3:05   ` Junio C Hamano
2018-04-23 17:50     ` Elijah Newren
2018-04-24 20:20     ` [PATCH v10 1/2] fixup! merge-recursive: fix was_tracked() to quit lying with some renamed paths Elijah Newren
2018-04-24 20:21       ` [PATCH v10 2/2] fixup! t6046: testcases checking whether updates can be skipped in a merge Elijah Newren
2018-04-23 17:28 ` [PATCH v10 00/36] Add directory rename detection to git Elijah Newren
2018-04-23 23:46   ` Junio C Hamano
2018-04-24  0:15     ` Elijah Newren

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=nycvar.QRO.7.76.6.1910122122520.3272@tvgsbejvaqbjf.bet \
    --to=johannes.schindelin@gmx.de \
    --cc=git@vger.kernel.org \
    --cc=newren@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).