Directory rename detection breaks if repo has a lot of similar/identical files

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Directory rename detection breaks if repo has a lot of similar/identical files
@ 2021-02-08 17:30 Yannis Zarkadas
  2021-02-08 23:35 ` Elijah Newren
  0 siblings, 1 reply; 2+ messages in thread
From: Yannis Zarkadas @ 2021-02-08 17:30 UTC (permalink / raw)
  To: git; +Cc: vkoukis

Hi everyone! I believe I have found a use-case that breaks git's directory
rename detection algorithm. In short, when a repo has a lot of identical 
files,
directory rename detection breaks. I believe this happens because git feeds
misleading renames to the directory rename detection algorithm.

For example, suppose a changeset (COMMIT_A, COMMIT_B). If `x/a` and `y/b` in
COMMIT_A are the same as `new/x/a` and `new/y/b` in COMMIT_B, then git may
decide the renames are `x/a -> new/y/b` and `y/b -> new/x/a` instead of
`x/a -> new/x/a` and `y/b -> new/y/b`. Of course, this confuses the 
directory
rename detection algorithm.

To make a long story short (I am including all the individual steps to
investigate this below), it seems the following enhancement would work 
to solve
this particular problem:

If file `a/base/base.txt` in COMMIT_A is the same as 
`new/a/base/base.txt` AND
`new/z/base/base.txt` in COMMIT_B, then prefer the rename
`a/base/base.txt->new/a/base/base.txt` over the rename
`a/base/base.txt->new/z/base/base.txt`. I have included a more generic 
algorithm
below.

Problem
=======

The original repo was very large, so I tried to find a simpler example that
caused git to detect directory renames incorrectly. I managed to create such
an environment with the following script:
https://gist.github.com/yanniszark/4165bd8f92d9838e143495e5a8df2ce0

The script creates a repo (/tmp/demo) with 3 branches:
- *upstream*: Contains a lot of folders, with some files being identical 
across
   folders.
- *upstream-restructured*: Same as upstream, but with all folders moved 
under
   `new/`. For example, `a/`->`new/a`
- *downstream*: Based on upstream, adds some extra files under every 
folder of
   upstream.

Then, it tries to rebase downstream ontop of upstream-restructured. Git 
detects
the WRONG directory rename `z/->new/a/` and fails with a wrong conflict.

$ git -c merge.directoryRenames=true rebase -v -m --onto 
upstream-restructured upstream
Path updated: z/overlays/develop/overlay.txt added in 9bf94f0 (Add 
overlay for z) inside a directory that was renamed in HEAD; moving it to 
new/a/overlays/develop/overlay.txt.
CONFLICT (rename/rename): Rename 
a/overlays/develop/overlay.txt->new/a/overlays/develop/overlay.txt in 
HEAD. Rename 
z/overlays/develop/overlay.txt->new/a/overlays/develop/overlay.txt in 
9bf94f0 (Add overlay for z)

We are going to work with this example for the rest of this mail.

Analysis
========

After learning a bit about how git works internally, I found out:
- Rebasing with the merge backend (`-m`) uses cherry-pick underneath.
- Cherry-pick frames the problem as a merge, with the merge-base being the
   parent of the REMOTE commit.
- The merge algorithm accepts 3 inputs (BASE, LOCAL, REMOTE), computes the
   changesets of (BASE, LOCAL) and (BASE, REMOTE), and finally combines 
them.

With this background in mind, I run the demo script and got this message:

Path updated: z/overlays/develop/overlay.txt added in 9bf94f0 (Add 
overlay for z) inside a directory that was renamed in HEAD; moving it to 
new/a/overlays/develop/overlay.txt.
CONFLICT (rename/rename): Rename 
a/overlays/develop/overlay.txt->new/a/overlays/develop/overlay.txt in 
HEAD. Rename 
z/overlays/develop/overlay.txt->new/a/overlays/develop/overlay.txt in 
9bf94f0 (Add overlay for z)

Applying our knowledge of rebase, cherry-pick and merge from above, let's
calculate the BASE, LOCAL and REMOTE commits.

LOCAL=HEAD
REMOTE=9bf94f0  # same as the error message
BASE=REMOTE~    # based on how cherry-pick works

At this point, we can peek at what changesets git's merge algorithm is 
seeing:

git diff --find-renames --stat $BASE $LOCAL
git diff --find-renames --stat $BASE $REMOTE

The first command is the interesting one and returns suspicious entries 
like the
following:

{z => new/a}/base/base.txt                   | 0
{a => new/a}/overlays/develop/overlay.txt    | 0
{a => new/a}/overlays/upstream/file_1.txt    | 0
{z => new/a}/overlays/upstream/file_3.txt    | 0
{z => new/a}/overlays/upstream/file_5.txt    | 0
{a => new/a}/overlays/upstream/file_a.txt    | 0
{z => new/a}/overlays/upstream/overlay.txt   | 0
[...]
{a => new/z}/base/base.txt                   | 0
{z => new/z}/overlays/upstream/file_1.txt    | 0
{a => new/z}/overlays/upstream/file_3.txt    | 0
{a => new/z}/overlays/upstream/file_5.txt    | 0
{z => new/z}/overlays/upstream/file_z.txt    | 0
{a => new/z}/overlays/upstream/overlay.txt   | 0

Hypothesis
==========

At this point, I formed a hypothesis:
- git detects renames between (BASE, LOCAL) as they appear in the `git diff`
   command above.
- git feeds those renames in the directory rename detection algorithm, which
   gets confused about what directory renames actually happened.

Confirming the Hypothesis
=========================

I tried to confirm my hypothesis by diving in the source code. My C is a bit
rusty, but I managed to find out the function that handles the renames, 
called
`detect_and_process_renames` (inside `merge-recursive.c`). The following 
part
is particularly interesting:

head_pairs = get_diffpairs(opt, common, head);
merge_pairs = get_diffpairs(opt, common, merge);
[...]
     dir_re_head = get_directory_renames(head_pairs);
     dir_re_merge = get_directory_renames(merge_pairs);

The first part is computing the changeset for (BASE, LOCAL) and (BASE, 
REMOTE).
The second part is feeding the changeset to the get_directory_renames 
function
for calculating directory renames.

By inspecting the `head_pairs` variable with my debugger, I confirmed 
that the
changeset is the one I saw with `git diff`. Indeed, git is feeding the
`get_directory_renames` function confusing renames.

Proposed Enhancement
====================

Now that we know what is most likely going on, I can describe what I 
would like
git to do.

Intuition: If file `a/base/base.txt` in BASE is the same as
`new/a/base/base.txt` AND `new/z/base/base.txt` in LOCAL, then prefer 
the rename
`a/base/base.txt->new/a/base/base.txt` over the rename
`a/base/base.txt->new/z/base/base.txt`.

In other words, if multiple rename candidates are possible, prefer the 
ones that
look more like the user move the file/folder from one path to another.

Possible Algorithm:

More generally, if files `A1`, `A2`, ..., `An` in commit A are the same 
as `B1`,
`B2`, `Bm` in commit B, then:
- Calculate a sorted list of scores `S(Ai, Bj), i=1,...,n and 
j=1,...,m`, where
   `S(Ai, Bj)` is the length of the suffix match of `Ai` and `Bj`.
- Walk the sorted list and for each element:
     - If `Ai` and `Bj` is not used in a rename yet, apply the rename.
     - Stop when you have found `min(n, m)` renames.
- The remaining `max(n, m) - min(n, m)` files are either deletes (if 
they are on
   commit A) or copies (if they are on commit B).

Conclusion
==========

I want to attempt to implement something like what I described above, when I
have more time to get familiar with diff's internals. In the meantime, I 
would
love to hear the opinion of more seasoned git developers on this issue. 
Do you
agree with my findings and the direction I proposed? Is there something 
I have
overlooked? Perhaps someone is already having these problems.

Thanks in advance,
Yannis Zarkadas

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Directory rename detection breaks if repo has a lot of similar/identical files
  2021-02-08 17:30 Directory rename detection breaks if repo has a lot of similar/identical files Yannis Zarkadas
@ 2021-02-08 23:35 ` Elijah Newren
  0 siblings, 0 replies; 2+ messages in thread
From: Elijah Newren @ 2021-02-08 23:35 UTC (permalink / raw)
  To: Yannis Zarkadas; +Cc: Git Mailing List, vkoukis

Hello Yannis, and welcome!

On Mon, Feb 8, 2021 at 9:37 AM Yannis Zarkadas <yanniszark@arrikto.com> wrote:
>
> Hi everyone! I believe I have found a use-case that breaks git's directory
> rename detection algorithm. In short, when a repo has a lot of identical
> files,
> directory rename detection breaks. I believe this happens because git feeds
> misleading renames to the directory rename detection algorithm.

The directory rename detection logic doesn't seem to have any issues
here; the problem is in the data fed to that logic, which is
essentially random.  Thus, this lies more on the side of rename
detection than directory rename detection, even if you paid more
attention to the issues once they started affecting directory rename
detection.  That's worth noting because directory rename detection
code is in merge-recursive.c, while normal rename detection code is in
diffcore-rename.c.  Just want to make sure we're looking at the right
code if you want to dive in.

In diffcore-rename.c, in the function named find_identical_files(),
ties will be broken based on whether basenames match:

    score += basename_same(source, target);
    if (score > best_score) {

If you have multiple identical files AND they have matching basenames,
then they'll all have the same score.  Since it only records a new
best if it is *better* than the previous, it will thus take whichever
one was first.  The ordering of which is first depends on the random
order that the hashmap decides to iterate over them.

Of course, this is only for identical files.  However, basename_same()
is used in one other place in the file, in association with inexact
renames, but has identical usage where the tie-breaker for files with
the same content similarity percentage are broken by whether the have
the same basename -- but any other path components are ignored.

...snip a bunch of good analysis...

> git diff --find-renames --stat $BASE $LOCAL
> git diff --find-renames --stat $BASE $REMOTE
>
> The first command is the interesting one and returns suspicious entries
> like the
> following:
>
> {z => new/a}/base/base.txt                   | 0
> {a => new/a}/overlays/develop/overlay.txt    | 0
> {a => new/a}/overlays/upstream/file_1.txt    | 0
> {z => new/a}/overlays/upstream/file_3.txt    | 0
> {z => new/a}/overlays/upstream/file_5.txt    | 0
> {a => new/a}/overlays/upstream/file_a.txt    | 0
> {z => new/a}/overlays/upstream/overlay.txt   | 0
> [...]
> {a => new/z}/base/base.txt                   | 0
> {z => new/z}/overlays/upstream/file_1.txt    | 0
> {a => new/z}/overlays/upstream/file_3.txt    | 0
> {a => new/z}/overlays/upstream/file_5.txt    | 0
> {z => new/z}/overlays/upstream/file_z.txt    | 0
> {a => new/z}/overlays/upstream/overlay.txt   | 0

And yep, that seems to be what you're seeing.  And if the renames look
like this, then trying infer directory renames based on the renames is
going to be a mess.

So it all boils down to the fact that you have identical files with
identical basenames, and renames don't currently know how to
differentiate between such files to get the "right" rename.

> Proposed Enhancement
> ====================
>
> Now that we know what is most likely going on, I can describe what I
> would like
> git to do.
>
> Intuition: If file `a/base/base.txt` in BASE is the same as
> `new/a/base/base.txt` AND `new/z/base/base.txt` in LOCAL, then prefer
> the rename
> `a/base/base.txt->new/a/base/base.txt` over the rename
> `a/base/base.txt->new/z/base/base.txt`.
>
> In other words, if multiple rename candidates are possible, prefer the
> ones that
> look more like the user move the file/folder from one path to another.

Stated differently, break ties for rename detection based on the
number of matching trailing path components.  For example, if the
basename was the only path component that was the same (e.g.
src/base.txt and source/base.txt), then we'd add one to the score.  If
the basenames were the same and two of the directories before that
matched as well (e.g. src/a/base/base.txt and source/a/base/base.txt)
then we'd add three to the score or name_score.  To implement this,
you'd only have to change the basename_same() function to: (1) rename
it (since it's no longer just about the basename), (2) make it keep
going past the first '/' component, and (3) change the 0...1 scoring
range to 0...N based on the number of '/' characters you passed while
walking backwards (except that reaching the toplevel counts the same
as hitting a '/' character, e.g. foo.txt and src/foo.txt have 1
matching path component rather than 0).

> Possible Algorithm:
>
> More generally, if files `A1`, `A2`, ..., `An` in commit A are the same
> as `B1`,
> `B2`, `Bm` in commit B, then:
> - Calculate a sorted list of scores `S(Ai, Bj), i=1,...,n and
> j=1,...,m`, where
>    `S(Ai, Bj)` is the length of the suffix match of `Ai` and `Bj`.
> - Walk the sorted list and for each element:
>      - If `Ai` and `Bj` is not used in a rename yet, apply the rename.
>      - Stop when you have found `min(n, m)` renames.
> - The remaining `max(n, m) - min(n, m)` files are either deletes (if
> they are on
>    commit A) or copies (if they are on commit B).

This would require a bigger code restructure, and I think it's no
different than my suggested changes to basename_same().

> Conclusion
> ==========
>
> I want to attempt to implement something like what I described above, when I
> have more time to get familiar with diff's internals. In the meantime, I
> would
> love to hear the opinion of more seasoned git developers on this issue.
> Do you
> agree with my findings and the direction I proposed? Is there something
> I have
> overlooked? Perhaps someone is already having these problems.

You're the first I know of to have identical files in different
directories with identical basenames, and then attempt to rename them
and do directory rename detection based on that combination.  But just
because you're the first I know of doesn't imply that you're the only
one or the last one, of course.  I think the basename_same() change
would probably fix this up nicely for you.

You'll need  to be aware, though, that I'm making lots of changes to
diffcore-rename.c right now, though, so there is a risk of conflicts.
Just since git-2.30, see these commits:

350410f6b1 (diffcore-rename: remove unnecessary duplicate entry
checks, 2020-12-29)
9db2ac5616 (diffcore-rename: accelerate rename_dst setup, 2020-12-11)
b970b4ef62 (diffcore-rename: simplify and accelerate
register_rename_src(), 2020-12-11)
81c4bf0296 (diffcore-rename: reduce jumpiness in progress counters, 2020-12-11)
ad8a1be529 (diffcore-rename: simplify limit check, 2020-12-11)
00b8cccdd8 (diffcore-rename: avoid usage of global in
too_many_rename_candidates(), 2020-12-11)
26a66a6b1c (diffcore-rename: rename num_create to num_destinations, 2020-12-11)

plus many of the commits found in each of the pull requests up at
https://github.com/gitgitgadget/git/pulls?q=is%3Apr+author%3Anewren+Optimization+batch.
That shouldn't stop you, but be aware that you might need to rebase a
few times, or let me take over and keep rebasing your patches on top
of mine (retaining you as author) in order to get all the series of
patches through.

Hope that helps,
Elijah

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2021-02-08 23:37 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-08 17:30 Directory rename detection breaks if repo has a lot of similar/identical files Yannis Zarkadas
2021-02-08 23:35 ` Elijah Newren

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).