git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Elijah Newren <newren@gmail.com>
To: Stefan Beller <sbeller@google.com>
Cc: "Junio C Hamano" <gitster@pobox.com>, git <git@vger.kernel.org>,
	"SZEDER Gábor" <szeder.dev@gmail.com>,
	"Jonathan Nieder" <jrnieder@gmail.com>,
	"Jeff King" <peff@peff.net>
Subject: Re: [PATCH v7 19/31] merge-recursive: add get_directory_renames()
Date: Sat, 3 Feb 2018 14:32:07 -0800	[thread overview]
Message-ID: <CABPp-BGh7QTTfu3kgH4KO5DrrXiQjtrNhx_uaQsB6fHXT+9hLQ@mail.gmail.com> (raw)
In-Reply-To: <CAGZ79kb2+tpr0K0x2VVfFR-u3W2htvbxokxfKBpG60mNjXGPEw@mail.gmail.com>

On Fri, Feb 2, 2018 at 5:02 PM, Stefan Beller <sbeller@google.com> wrote:
> On Tue, Jan 30, 2018 at 3:25 PM, Elijah Newren <newren@gmail.com> wrote:

>> +       /* For
>
> comment style.

Fixed it and looked through the file for any other violations and
fixed the ones introduced in this series.  (The ones I added years ago
in other places of the file I just left.)

>> +        *    "a/b/c/d/foo.c" -> "a/b/something-else/d/foo.c"
>> +        * the "d/foo.c" part is the same, we just want to know that
>> +        *    "a/b/c" was renamed to "a/b/something-else"
>> +        * so, for this example, this function returns "a/b/c" in
>> +        * *old_dir and "a/b/something-else" in *new_dir.
>
> So we would not see multi-directory renames?
>
>     "a/b/c/d/foo.c" -> "a/b/something-else/e/foo.c"
>
> could be detected as
>
>     "a/b/{c/d/ => something-else/e}/foo.c"
>
> I assume this patch series is not bringing that to the table.
> (which is fine, I am just wondering)

I fully intended to support that, and believe the code does.  I
changed the comment as follows to try to make it clearer:

    /*
     * For
     *    "a/b/c/d/e/foo.c" -> "a/b/some/thing/else/e/foo.c"
     * the "e/foo.c" part is the same, we just want to know that
     *    "a/b/c/d" was renamed to "a/b/some/thing/else"
     * so, for this example, this function returns "a/b/c/d" in
     * *old_dir and "a/b/some/thing/else" in *new_dir.
     *
     * Also, if the basename of the file changed, we don't care.  We
     * want to know which portion of the directory, if any, changed.
     */

Is that better?

>> +        *
>> +        * Also, if the basename of the file changed, we don't care.  We
>> +        * want to know which portion of the directory, if any, changed.
>> +        */
>> +       end_of_old = strrchr(old_path, '/');
>> +       end_of_new = strrchr(new_path, '/');
>> +
>> +       if (end_of_old == NULL || end_of_new == NULL)
>> +               return;
>
> return early as the files are in the top level, and apparently we do
> not support top level renaming?
>
> What about a commit like 81b50f3ce4 (Move 'builtin-*' into
> a 'builtin/' subdirectory, 2010-02-22) ?
>
> Well that specific commit left many files outside the new builtin dir,
> but conceptually we could see a directory rename of
>
>     /* => /src/*

We had some internal usecases for want to support a "rename" of the
toplevel directory into a subdirectory of itself.  However, attempting
to support that opened much too big a can of worms for me.  We'd open
up some big surprises somewhere.

In particular, note that not supporting a "rename" of the toplevel
directory is a special case of not supporting a "rename" of any
directory to a subdirectory below itself, which in turn is a very
special case of excluding partial directory renames.  I addressed this
in the cover letter of my original submission, as follows:

"""
Further, there's a basic question about when directory rename detection
should be applied at all.  I have a simple rule:

  3) If a given directory still exists on both sides of a merge, we do
     not consider it to have been renamed.

Rule 3 may sound obvious at first, but it will probably arise as a
question for some users -- what if someone "mostly" moved a directory but
still left some files around, or, equivalently (from the perspective of the
three-way merge that merge-recursive performs), fully renamed a directory
in one commmit and then recreated that directory in a later commit adding
some new files and then tried to merge?  See the big comment in section 4
of the new t6043 for further discussion of this rule.
"""

Patch 04/31 is the one that adds that big comment with further discussion.

Maybe there's a way to support toplevel renames, but I think it'd make
this series longer and more complicated...and may cause more edge
cases that confuse users.

>> +       while (*--end_of_new == *--end_of_old &&
>> +              end_of_old != old_path &&
>> +              end_of_new != new_path)
>> +               ; /* Do nothing; all in the while loop */
>
> We have to compare manually as we'd want to find
> the first non-equal and there doesn't seem to be a good
> library function for that.
>
> Assuming many repos are UTF8 (including in their paths),
> how does this work with display characters longer than one char?
> It should be fine as we cut at the slash?

Oh, UTF-8.  Ugh.
Can UTF-8 characters, other than '/', have a byte whose value matches
(unsigned char)('/')?  If so, then I'll need to figure out how to do
utf-8 character parsing.  Anyone have pointers?

>> +       /*
>> +        * We've found the first non-matching character in the directory
>> +        * paths.  That means the current directory we were comparing
>> +        * represents the rename.  Move end_of_old and end_of_new back
>> +        * to the full directory name.
>> +        */
>> +       if (*end_of_old == '/')
>> +               end_of_old++;
>> +       if (*end_of_old != '/')
>> +               end_of_new++;
>> +       end_of_old = strchr(end_of_old, '/');
>> +       end_of_new = strchr(end_of_new, '/');
>> +
>> +       /*
>> +        * It may have been the case that old_path and new_path were the same
>> +        * directory all along.  Don't claim a rename if they're the same.
>> +        */
>> +       old_len = end_of_old - old_path;
>> +       new_len = end_of_new - new_path;
>> +
>> +       if (old_len != new_len || strncmp(old_path, new_path, old_len)) {
>
> How often are we going to see this string-is-equal case?
> Would it make sense to do that first in the function?

We don't have old_len at that point, and old_len != strlen(old_path).
In particular, this is for comparing directories, and old_path and
new_path both stored file paths.
In particular, I think the most common case is someone renaming e.g.
   a/b/file.c -> a/b/newfile.c
The filenames are different, but the directory name is not.  Trying to
compare at the beginning of the function would thus give us the wrong
information.

So the check really needs to be done at this point of the function.

>> +       dir_renames = malloc(sizeof(struct hashmap));
>
> xmalloc

Thanks; I also looked for any other malloc uses I introduced, but it
looks like you caught both of them.  I'll fix them up.

>
>> +       dir_rename_init(dir_renames);
>> +       for (i = 0; i < pairs->nr; ++i) {
>> +               struct string_list_item *item;
>> +               int *count;
>> +               struct diff_filepair *pair = pairs->queue[i];
>> +               char *old_dir, *new_dir;
>> +
>> +               /* File not part of directory rename if it wasn't renamed */
>> +               if (pair->status != 'R')
>> +                       continue;
>> +
>> +               get_renamed_dir_portion(pair->one->path, pair->two->path,
>> +                                       &old_dir,        &new_dir);
>> +               if (!old_dir)
>> +                       /* Directory didn't change at all; ignore this one. */
>> +                       continue;
>
>
> So the first loop is about counting the number of files in each directory
> that are renamed and the later while loop is about mapping them?

Close; would adding the following comment at the top of the function help?

    /*
     * Typically, we think of a directory rename as all files from a
     * certain directory being moved to a target directory.  However,
     * what if someone first moved two files from the original
     * directory in one commit, and then renamed the directory
     * somewhere else in a later commit?  At merge time, we just know
     * that files from the original directory went to two different
     * places, and that the bulk of them ended up in the same place.
     * We want each directory rename to follow the bulk of the files
     * from that directory.  This function exists to find where the
     * bulk of the files went.
     *
     * The first loop below simply iterates through the list of
     * renames, adding up all the rename source->target locations with
     * a count.
     *
     * The second loop compares the count for each renamed directory
     * and declares the "winning" target location.
     */

Is there any part that remains unclear with that comment?  (Also, is
that comment too long?)

>> +               /* Strings were xstrndup'ed before inserting into string-list,
>> +                * so ask string_list to remove the entries for us.
>> +                */
>
> comment style.

Thanks.

>> +               entry->possible_new_dirs.strdup_strings = 1;
>
> Why do we need to set strdup_strings here (so late in the
> game, we are about to clear it?) Could we set it earlier?
>
> Or rather have the string list duplicate the strings instead of
> get_renamed_dir_portion ?

We didn't strdup the original string (a file path) as-is, we
strndup'ed a subset of the original string (just the relevant portion
of the directory).  Since we already had to allocate a special string
for that, making the string list duplicate the strings would have
caused an extra unnecessary allocation and required us to free the
memory allocated by get_renamed_dir_portion manually.  So, we do need
this here.

Does this comment make it clearer?:

        /*
         * The relevant directory sub-portion of the original full
         * filepaths were xstrndup'ed before inserting into
         * possible_new_dirs, and instead of manually iterating the
         * list and free'ing each, just lie and tell
         * possible_new_dirs that it did the strdup'ing so that it
         * will free them for us.
         */
        entry->possible_new_dirs.strdup_strings = 1;
        string_list_clear(&entry->possible_new_dirs, 1);

  reply	other threads:[~2018-02-03 22:32 UTC|newest]

Thread overview: 63+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-30 23:25 [PATCH v7 00/31] Add directory rename detection to git Elijah Newren
2018-01-30 23:25 ` [PATCH v7 01/31] directory rename detection: basic testcases Elijah Newren
2018-01-30 23:25 ` [PATCH v7 02/31] directory rename detection: directory splitting testcases Elijah Newren
2018-01-30 23:25 ` [PATCH v7 03/31] directory rename detection: testcases to avoid taking detection too far Elijah Newren
2018-01-30 23:25 ` [PATCH v7 04/31] directory rename detection: partially renamed directory testcase/discussion Elijah Newren
2018-01-30 23:25 ` [PATCH v7 05/31] directory rename detection: files/directories in the way of some renames Elijah Newren
2018-01-30 23:25 ` [PATCH v7 06/31] directory rename detection: testcases checking which side did the rename Elijah Newren
2018-01-30 23:25 ` [PATCH v7 07/31] directory rename detection: more involved edge/corner testcases Elijah Newren
2018-01-30 23:25 ` [PATCH v7 08/31] directory rename detection: testcases exploring possibly suboptimal merges Elijah Newren
2018-01-30 23:25 ` [PATCH v7 09/31] directory rename detection: miscellaneous testcases to complete coverage Elijah Newren
2018-01-30 23:25 ` [PATCH v7 10/31] directory rename detection: tests for handling overwriting untracked files Elijah Newren
2018-01-30 23:25 ` [PATCH v7 11/31] directory rename detection: tests for handling overwriting dirty files Elijah Newren
2018-01-30 23:25 ` [PATCH v7 12/31] merge-recursive: move the get_renames() function Elijah Newren
2018-02-02 23:27   ` Stefan Beller
     [not found]     ` <CABPp-BFDgDDa_fPSFJQUSzR1k5-ix0SWrviUPFu+SCoyWfG5cQ@mail.gmail.com>
2018-02-05 18:57       ` Stefan Beller
2018-01-30 23:25 ` [PATCH v7 13/31] merge-recursive: introduce new functions to handle rename logic Elijah Newren
2018-02-02 23:36   ` Stefan Beller
2018-01-30 23:25 ` [PATCH v7 14/31] merge-recursive: fix leaks of allocated renames and diff_filepairs Elijah Newren
2018-02-02 23:41   ` Stefan Beller
2018-01-30 23:25 ` [PATCH v7 15/31] merge-recursive: make !o->detect_rename codepath more obvious Elijah Newren
2018-02-02 23:48   ` Stefan Beller
2018-01-30 23:25 ` [PATCH v7 16/31] merge-recursive: split out code for determining diff_filepairs Elijah Newren
2018-02-03  0:06   ` Stefan Beller
2018-02-03  1:43     ` Elijah Newren
2018-01-30 23:25 ` [PATCH v7 17/31] merge-recursive: add a new hashmap for storing directory renames Elijah Newren
2018-02-03  0:26   ` Stefan Beller
2018-02-03 21:34     ` Elijah Newren
2018-02-04  8:54       ` Johannes Sixt
2018-02-05 14:56         ` Elijah Newren
2018-02-05 20:01         ` Junio C Hamano
2018-02-05 19:44       ` Stefan Beller
2018-02-05 21:27         ` Elijah Newren
2018-01-30 23:25 ` [PATCH v7 18/31] merge-recursive: make a helper function for cleanup for handle_renames Elijah Newren
2018-02-03  0:31   ` Stefan Beller
2018-01-30 23:25 ` [PATCH v7 19/31] merge-recursive: add get_directory_renames() Elijah Newren
2018-02-03  1:02   ` Stefan Beller
2018-02-03 22:32     ` Elijah Newren [this message]
2018-02-04  2:04       ` Elijah Newren
2018-02-04  4:42         ` Eric Sunshine
2018-02-04  4:44           ` Eric Sunshine
2018-02-05 19:39       ` Stefan Beller
2018-01-30 23:25 ` [PATCH v7 20/31] merge-recursive: check for directory level conflicts Elijah Newren
2018-02-05 20:00   ` Stefan Beller
2018-02-05 21:12     ` Elijah Newren
2018-01-30 23:25 ` [PATCH v7 21/31] merge-recursive: add a new hashmap for storing file collisions Elijah Newren
2018-02-05 20:02   ` Stefan Beller
2018-01-30 23:25 ` [PATCH v7 22/31] merge-recursive: add computation of collisions due to dir rename & merging Elijah Newren
2018-01-30 23:25 ` [PATCH v7 23/31] merge-recursive: check for file level conflicts then get new name Elijah Newren
2018-01-30 23:25 ` [PATCH v7 24/31] merge-recursive: when comparing files, don't include trees Elijah Newren
2018-01-30 23:25 ` [PATCH v7 25/31] merge-recursive: apply necessary modifications for directory renames Elijah Newren
2018-02-16  1:14   ` SZEDER Gábor
2018-01-30 23:25 ` [PATCH v7 26/31] merge-recursive: avoid clobbering untracked files with " Elijah Newren
2018-01-30 23:25 ` [PATCH v7 27/31] merge-recursive: fix overwriting dirty files involved in renames Elijah Newren
2018-02-05 20:52   ` Stefan Beller
2018-02-05 21:26     ` Elijah Newren
2018-01-30 23:25 ` [PATCH v7 28/31] merge-recursive: fix remaining directory rename + dirty overwrite cases Elijah Newren
2018-02-05 21:52   ` Stefan Beller
2018-02-05 22:18     ` Elijah Newren
2018-01-30 23:25 ` [PATCH v7 29/31] directory rename detection: new testcases showcasing a pair of bugs Elijah Newren
2018-01-30 23:25 ` [PATCH v7 30/31] merge-recursive: avoid spurious rename/rename conflict from dir renames Elijah Newren
2018-01-30 23:25 ` [PATCH v7 31/31] merge-recursive: ensure we write updates for directory-renamed file Elijah Newren
2018-02-05 21:58   ` Stefan Beller
2018-01-30 23:41 ` [PATCH v7 00/31] Add directory rename detection to git Elijah Newren

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CABPp-BGh7QTTfu3kgH4KO5DrrXiQjtrNhx_uaQsB6fHXT+9hLQ@mail.gmail.com \
    --to=newren@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jrnieder@gmail.com \
    --cc=peff@peff.net \
    --cc=sbeller@google.com \
    --cc=szeder.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).