From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-Status: No, score=-3.6 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RCVD_IN_DNSWL_BLOCKED, SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by dcvr.yhbt.net (Postfix) with ESMTP id 04E041F8C2 for ; Mon, 8 Feb 2021 16:28:18 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234325AbhBHQ0Y (ORCPT ); Mon, 8 Feb 2021 11:26:24 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39632 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229558AbhBHQ0D (ORCPT ); Mon, 8 Feb 2021 11:26:03 -0500 Received: from mail-ot1-x331.google.com (mail-ot1-x331.google.com [IPv6:2607:f8b0:4864:20::331]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2E2B0C061788 for ; Mon, 8 Feb 2021 08:25:23 -0800 (PST) Received: by mail-ot1-x331.google.com with SMTP id o12so14658408ote.12 for ; Mon, 08 Feb 2021 08:25:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=1sIQ8isQY6jz6k7tV+QcWrkTdF5Ak67rfg+PjHkxuJE=; b=TT5RCXwG6Hwi8UbvjSTlCkFt6TgjpTpnSLGVg7X3EgorFEPVtJgIup0e+0UAuEactj uzJEWDYbGycupyQbl22RDnpv4Dd7R0YLgKL5daCN6xjwZwt74NGJ0fZ4fPSAmf1oBGxt X/nJdsPgLAtTzpkTCq/yBcpzB8A2u/JWgUbrP+VDaUGIGVIDPhnp3ruy/HmClQxhKZbE Z/eiUcHJIiSziK7sTQQ3e2rLcviPi7UgREQPy0SBWvb6AOclsYhbfjTh8awpj+TUHfL/ WMJ9fBHCTZa9d+yaOAiqSBVsm+Vt7zjTjUW3Qmg+G/NJXn+aWRRq4fzldxoP3K3jlzI0 FZhg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=1sIQ8isQY6jz6k7tV+QcWrkTdF5Ak67rfg+PjHkxuJE=; b=C+3WM1xcHzyn55pvFVIF9EvJlvI44YHXrIOzyCwQuCvJssZPfTH7PWHZAkApp/LWTm ABeG20wRm9uFPYb5qJylpVNHXPDPa1rv38WyoWjlwD+ojDMCI7pTnHkwP7nk1Vd+tVJ/ /vb0atMo3GnhybGpJeLL5v7YRntR/33Zkss6OJb0khLVEOcOWtqrjhyYiQH+aCXIr5lg FfrSTBtFM6GVsw1KKMOgnDo4AbL5cDWH4lqyG++nJJk/+5xdQySyBnPNA1qM4aTBaDP+ 2aFlJnoTx2rBoBdOHJ/1QlyXCc3spCuYCEfi8QsIKdwpKPwSRxcYwa8/RTTIGGxP027E Z+FA== X-Gm-Message-State: AOAM533OnghBkceJbmBE12Zyzc67Vrc0TV478QI8zMbCWtxr5EV2V0LR 7hYLwE9y7tczKSy7LPb5mBzoS8BXIFNnVAiuL1I= X-Google-Smtp-Source: ABdhPJzVnpvC7DTY8lskkht1zbS0IYPKkG5qUsHFJ/xg2ChNglCIRuQGmfa9HERJzGdksAIrFmA6HRf+RhLjHU9rqTo= X-Received: by 2002:a05:6830:93:: with SMTP id a19mr13276910oto.345.1612801522420; Mon, 08 Feb 2021 08:25:22 -0800 (PST) MIME-Version: 1.0 References: <1d941c35076e8d515c8ff7ef01d6b9d8c092aaa9.1612651937.git.gitgitgadget@gmail.com> <9fbed0f9-032e-3f99-8467-f8a9cfa2d8f1@gmail.com> <57d30e7d-7727-8d98-e3ef-bcfeebf9edd3@gmail.com> In-Reply-To: <57d30e7d-7727-8d98-e3ef-bcfeebf9edd3@gmail.com> From: Elijah Newren Date: Mon, 8 Feb 2021 08:25:00 -0800 Message-ID: Subject: Re: [PATCH 3/3] diffcore-rename: guide inexact rename detection based on basenames To: Derrick Stolee Cc: Junio C Hamano , Elijah Newren via GitGitGadget , Git Mailing List , Derrick Stolee , Jonathan Tan , Taylor Blau , Jeff King Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org On Mon, Feb 8, 2021 at 3:44 AM Derrick Stolee wrote: > > On 2/8/2021 3:38 AM, Elijah Newren wrote: > > On Sun, Feb 7, 2021 at 11:51 AM Junio C Hamano wrote: > >> > >> Derrick Stolee writes: > >> > >>> Before I get too deep into reviewing these patches, I do want > >>> to make it clear that the speed-up is coming at the cost of > >>> a behavior change. We are restricting the "best match" search > >>> to be first among files with common base name (although maybe > >>> I would use 'suffix'?). If we search for a rename among all > >>> additions and deletions ending the ".txt" we might find a > >>> similarity match that is 60% and declare that a rename, even > >>> if there is a ".txt" -> ".md" pair that has a 70% match. > >> > >> Yes, my initial reaction to the idea was that "yuck, our rename > >> detection lost its purity". diffcore-rename strived to base its > >> decision purely on content similarity, primarily because it is one > >> of the oldest part of Git where the guiding principle has always > >> been that the content is the king. I think my aversion to the "all > >> of my neighbors are relocating, so should I move to the same place" > >> (aka "directory rename") comes from a similar place, but in a sense > >> this was worse. > >> > >> At least, until I got over the initial bump. I do not think the > >> suffix match is necessarily a bad idea, but it adds more "magically > >> doing a wrong thing" failure modes (e.g. the ".txt" to ".md" example > >> would probably have more variants that impact the real life > >> projects; ".C" vs ".cc" vs ".cxx" vs ".cpp" immediately comes to > >> mind), and a tool that silently does a wrong thing because it uses > >> more magic would be a tool that is hard to explain why it did the > >> wrong thing when it does. > > > > Stolee explained a new algorithm different than what I have proposed, > > Yes, sorry for adding noise. The point stands that we are changing > the behavior in some cases, so that must be agreed upon. What you > are _actually_ proposing is a much smaller change than I thought, > but it is still worth pointing out the behavior change. Again, no need to apologize; if my commit messages weren't clear, they need to be fixed. I am much more surprised here by your repeated point that it's worth pointing out the behavior change. I totally agree with that, and it's why I spent two paragraphs on the second commit message explicitly covering this and listing four enumerated reasons to argue for the change. So, I thought I had done that, but it apparently isn't very clear...and I'm left wondering how to clarify it further. So, a question for you: How should I change it? Should I just modify the third commit message to re-highlight that it does change behavior and refer to the second commit message for details? Because repeating the point is the only way I can think of to make it clearer. Is there anything else I can or should do? > > I think based on the apparent different meanings of "basename" that > > exist. I tried to clarify that in response to his email, but I wanted > > to clarify one additional thing here too: > >> diffcore-rename has not in the past based its decision solely on > > content similarity. It only does that when break detection is on. > > Otherwise, 0% content similarity is trumped by sufficient filename > > similarity (namely, with a filename similarity of 100%). If the > > filename similarity wasn't sufficiently high (anything less than an > > exact match), then it completely ignored filename similarity and > > looked only at content similarity. It thus jumped from one extreme to > > another. > > This idea of optimizing first for 100% filename similarity is a > good perspective on Git's rename detection algorithm. The canonical > example of this 100% filename similarity is a rename cycle: > > A -> B > B -> C > C -> A > > Even if the OIDs are distinct and exactly match across these renames, > we see that there are no adds or deletes, so we do not even trigger > rename detection and report A, B, and C as edited instead. > > A "rename path" (not cycle) such as: > > A -> B > B -> C > > does trigger rename detection, but B will never be considered. Instead, > "A -> C" will be checked for similarity to see if it is within the > threshold. > > Of course, I am _not_ advocating that we change this behavior. These > situations are incredibly rare and we should not sacrifice performance > in the typical case to handle them. > > > My optimization is adding an in-between state. When the basename (the > > part of the path excluding the leading directory) matches the basename > > of another file (and those basenames are unique on each side), then > > compare content similarity and mark the files as a rename if the two > > are sufficiently similar. It is thus a position that considers both > > filename similarity (basename match) and content similarity together. > > > >>> This could be documented in a test case, to demonstrate that > >>> we are making this choice explicitly. > >> > >> Yes. I wonder if we can solve it by requiring a lot better than > >> minimum match when trying the "suffix match" first, or something? > > > > This may still be a useful idea, and was something I had considered, > > but more in the context of more generic filename similarity > > comparisons. We could still discuss it even when basenames match, but > > basenames matching seems strong enough to me that I wasn't sure extra > > configuration knobs were warranted. > > I think this is a complication that we might not want to add to the > heuristic, at least not at first. We might want to have a follow-up > that adjusts that value to be higher. A natural way would be through > a config option, so users can select something incredibly high like > 99%. Another option would be to take a minimum that is halfway between > the existing similarity minimum and 100%. > > >> Provided if we agree that it is a good idea to insert this between > >> "exact contents match" and "full matrix", I have one question to > >> Elijah on what the code does. > >> > >> To me, it seems that the "full matrix" part still uses the remaining > >> src and dst candidates fully. But if "A.txt" and "B.txt" are still > >> surviving in the src/dst at that stage, shouldn't we be saying that > >> "no way these can be similar enough---we've checked in the middle > >> stage where only the ones with the same suffix are considered and > >> this pair didn't turn into a rename"? > > > > This is a very good point. A.txt and B.txt will not have been > > compared previously since their basenames do not match, but the basic > > idea is still possible. For example, A.txt could have been compared > > to source/some-module/A.txt. And I don't do anything in the final > > "full matrix" stage to avoid re-comparing those two files again. > > However, it is worth noting that A.txt will have been compared to at > > most one other file, not N files. And thus while we are wasting some > > re-comparisons, it is at most O(N) duplicated comparisons, not O(N^2). > > I thought about that, but decided to not bother, based on the > > following thinking: > > > > 1) The most expensive comparison is the first one, because when we do > > that one, we first have to populate the list of integers that lines in > > the file hash to. Subsequent comparisons are relatively cheap since > > this list of integers has already been computed. > > > > 2) This would only save us from at most N comparisons in the N x M > > matrix (since no file in this optimization is compared to more than > > one other) > > > > 3) Checking if two files have previously been compared requires more > > code, in what is already a tight nested loop. My experience > > attempting to modify that tight loop for extra conditions (e.g. don't > > compare files that are too large), is that it's easy to accidentally > > make the code slower. In fact, this is in part what led to the > > addition of the remove_unneed_paths_from_src() function. > > Even storing a single bit to say "these were already compared" takes > quadratic space. The hope is to not have quadratic behavior if it > can be avoided. > > > 4) There were plenty of other interesting ideas and maybe I was a tad lazy. :-) > > > > I think removing these already-compared cases could be done, but I > > just avoided it. If we were to do the "attempt to match files with > > the same extension" optimization that Stolee outlines/invents above, > > then we'd definitely need to consider it. Otherwise, it's just a > > minor additional optimization that someone could add to my patches. > > The more I think about it, the less my idea makes sense. I'm sorry > for adding noise to the thread. > > Thanks, > -Stolee