git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Yannis Zarkadas <yanniszark@arrikto.com>
To: git@vger.kernel.org
Cc: vkoukis@arrikto.com
Subject: Directory rename detection breaks if repo has a lot of similar/identical files
Date: Mon, 8 Feb 2021 19:30:22 +0200	[thread overview]
Message-ID: <4ceea1ec-59fe-af8e-b91d-aced4e6a1b0f@arrikto.com> (raw)

Hi everyone! I believe I have found a use-case that breaks git's directory
rename detection algorithm. In short, when a repo has a lot of identical 
files,
directory rename detection breaks. I believe this happens because git feeds
misleading renames to the directory rename detection algorithm.

For example, suppose a changeset (COMMIT_A, COMMIT_B). If `x/a` and `y/b` in
COMMIT_A are the same as `new/x/a` and `new/y/b` in COMMIT_B, then git may
decide the renames are `x/a -> new/y/b` and `y/b -> new/x/a` instead of
`x/a -> new/x/a` and `y/b -> new/y/b`. Of course, this confuses the 
directory
rename detection algorithm.

To make a long story short (I am including all the individual steps to
investigate this below), it seems the following enhancement would work 
to solve
this particular problem:

If file `a/base/base.txt` in COMMIT_A is the same as 
`new/a/base/base.txt` AND
`new/z/base/base.txt` in COMMIT_B, then prefer the rename
`a/base/base.txt->new/a/base/base.txt` over the rename
`a/base/base.txt->new/z/base/base.txt`. I have included a more generic 
algorithm
below.


Problem
=======

The original repo was very large, so I tried to find a simpler example that
caused git to detect directory renames incorrectly. I managed to create such
an environment with the following script:
https://gist.github.com/yanniszark/4165bd8f92d9838e143495e5a8df2ce0

The script creates a repo (/tmp/demo) with 3 branches:
- *upstream*: Contains a lot of folders, with some files being identical 
across
   folders.
- *upstream-restructured*: Same as upstream, but with all folders moved 
under
   `new/`. For example, `a/`->`new/a`
- *downstream*: Based on upstream, adds some extra files under every 
folder of
   upstream.

Then, it tries to rebase downstream ontop of upstream-restructured. Git 
detects
the WRONG directory rename `z/->new/a/` and fails with a wrong conflict.

$ git -c merge.directoryRenames=true rebase -v -m --onto 
upstream-restructured upstream
Path updated: z/overlays/develop/overlay.txt added in 9bf94f0 (Add 
overlay for z) inside a directory that was renamed in HEAD; moving it to 
new/a/overlays/develop/overlay.txt.
CONFLICT (rename/rename): Rename 
a/overlays/develop/overlay.txt->new/a/overlays/develop/overlay.txt in 
HEAD. Rename 
z/overlays/develop/overlay.txt->new/a/overlays/develop/overlay.txt in 
9bf94f0 (Add overlay for z)


We are going to work with this example for the rest of this mail.


Analysis
========

After learning a bit about how git works internally, I found out:
- Rebasing with the merge backend (`-m`) uses cherry-pick underneath.
- Cherry-pick frames the problem as a merge, with the merge-base being the
   parent of the REMOTE commit.
- The merge algorithm accepts 3 inputs (BASE, LOCAL, REMOTE), computes the
   changesets of (BASE, LOCAL) and (BASE, REMOTE), and finally combines 
them.


With this background in mind, I run the demo script and got this message:

Path updated: z/overlays/develop/overlay.txt added in 9bf94f0 (Add 
overlay for z) inside a directory that was renamed in HEAD; moving it to 
new/a/overlays/develop/overlay.txt.
CONFLICT (rename/rename): Rename 
a/overlays/develop/overlay.txt->new/a/overlays/develop/overlay.txt in 
HEAD. Rename 
z/overlays/develop/overlay.txt->new/a/overlays/develop/overlay.txt in 
9bf94f0 (Add overlay for z)

Applying our knowledge of rebase, cherry-pick and merge from above, let's
calculate the BASE, LOCAL and REMOTE commits.

LOCAL=HEAD
REMOTE=9bf94f0  # same as the error message
BASE=REMOTE~    # based on how cherry-pick works

At this point, we can peek at what changesets git's merge algorithm is 
seeing:

git diff --find-renames --stat $BASE $LOCAL
git diff --find-renames --stat $BASE $REMOTE

The first command is the interesting one and returns suspicious entries 
like the
following:

{z => new/a}/base/base.txt                   | 0
{a => new/a}/overlays/develop/overlay.txt    | 0
{a => new/a}/overlays/upstream/file_1.txt    | 0
{z => new/a}/overlays/upstream/file_3.txt    | 0
{z => new/a}/overlays/upstream/file_5.txt    | 0
{a => new/a}/overlays/upstream/file_a.txt    | 0
{z => new/a}/overlays/upstream/overlay.txt   | 0
[...]
{a => new/z}/base/base.txt                   | 0
{z => new/z}/overlays/upstream/file_1.txt    | 0
{a => new/z}/overlays/upstream/file_3.txt    | 0
{a => new/z}/overlays/upstream/file_5.txt    | 0
{z => new/z}/overlays/upstream/file_z.txt    | 0
{a => new/z}/overlays/upstream/overlay.txt   | 0


Hypothesis
==========

At this point, I formed a hypothesis:
- git detects renames between (BASE, LOCAL) as they appear in the `git diff`
   command above.
- git feeds those renames in the directory rename detection algorithm, which
   gets confused about what directory renames actually happened.


Confirming the Hypothesis
=========================

I tried to confirm my hypothesis by diving in the source code. My C is a bit
rusty, but I managed to find out the function that handles the renames, 
called
`detect_and_process_renames` (inside `merge-recursive.c`). The following 
part
is particularly interesting:

head_pairs = get_diffpairs(opt, common, head);
merge_pairs = get_diffpairs(opt, common, merge);
[...]
     dir_re_head = get_directory_renames(head_pairs);
     dir_re_merge = get_directory_renames(merge_pairs);

The first part is computing the changeset for (BASE, LOCAL) and (BASE, 
REMOTE).
The second part is feeding the changeset to the get_directory_renames 
function
for calculating directory renames.

By inspecting the `head_pairs` variable with my debugger, I confirmed 
that the
changeset is the one I saw with `git diff`. Indeed, git is feeding the
`get_directory_renames` function confusing renames.


Proposed Enhancement
====================

Now that we know what is most likely going on, I can describe what I 
would like
git to do.

Intuition: If file `a/base/base.txt` in BASE is the same as
`new/a/base/base.txt` AND `new/z/base/base.txt` in LOCAL, then prefer 
the rename
`a/base/base.txt->new/a/base/base.txt` over the rename
`a/base/base.txt->new/z/base/base.txt`.

In other words, if multiple rename candidates are possible, prefer the 
ones that
look more like the user move the file/folder from one path to another.

Possible Algorithm:

More generally, if files `A1`, `A2`, ..., `An` in commit A are the same 
as `B1`,
`B2`, `Bm` in commit B, then:
- Calculate a sorted list of scores `S(Ai, Bj), i=1,...,n and 
j=1,...,m`, where
   `S(Ai, Bj)` is the length of the suffix match of `Ai` and `Bj`.
- Walk the sorted list and for each element:
     - If `Ai` and `Bj` is not used in a rename yet, apply the rename.
     - Stop when you have found `min(n, m)` renames.
- The remaining `max(n, m) - min(n, m)` files are either deletes (if 
they are on
   commit A) or copies (if they are on commit B).


Conclusion
==========

I want to attempt to implement something like what I described above, when I
have more time to get familiar with diff's internals. In the meantime, I 
would
love to hear the opinion of more seasoned git developers on this issue. 
Do you
agree with my findings and the direction I proposed? Is there something 
I have
overlooked? Perhaps someone is already having these problems.


Thanks in advance,
Yannis Zarkadas


             reply	other threads:[~2021-02-08 17:35 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-08 17:30 Yannis Zarkadas [this message]
2021-02-08 23:35 ` Directory rename detection breaks if repo has a lot of similar/identical files Elijah Newren

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4ceea1ec-59fe-af8e-b91d-aced4e6a1b0f@arrikto.com \
    --to=yanniszark@arrikto.com \
    --cc=git@vger.kernel.org \
    --cc=vkoukis@arrikto.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).