From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-Status: No, score=-3.9 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by dcvr.yhbt.net (Postfix) with ESMTP id 7A5A71F8C4 for ; Sun, 7 Feb 2021 14:40:48 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229621AbhBGOjY (ORCPT ); Sun, 7 Feb 2021 09:39:24 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47470 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229510AbhBGOjV (ORCPT ); Sun, 7 Feb 2021 09:39:21 -0500 Received: from mail-ot1-x330.google.com (mail-ot1-x330.google.com [IPv6:2607:f8b0:4864:20::330]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 25268C06174A for ; Sun, 7 Feb 2021 06:38:41 -0800 (PST) Received: by mail-ot1-x330.google.com with SMTP id d7so9474830otq.6 for ; Sun, 07 Feb 2021 06:38:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=dLJftGDaTepuKg/4jLvajFz3T4KngFWThWjvjqkTDkI=; b=KcgkQiRRkG5kWx45fkWKgYQoRcp+g3R9onLY+3lP9Ls0jjYxbrU6K5lpRf2yeaB/dF 3sGQJzXOQy5WXojXTqFtkIJZ+tBpVajjHmdC58NxJQx8YIBwnZFFA0HhQRMIw5GxcXbE K8LK3C1hD6EUjcgS5NhzUB0LFaWnskX+rPAjJUnP2qeKnYFbSc9gu2QOHk+YJYsTXSbg HXfNf0CGa0uVR9kIPE3ObMbmj/k/GZeXKpD0C/OwoRdPemFChY0vnTGmdGrvXXSkwRDK UN81B8rcfpuG7GJAoddsJIY+I+0GNEp0FBGT/aDExAkYnYtKHLsoZPY4mqEPpf0qS1Fs Orzg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=dLJftGDaTepuKg/4jLvajFz3T4KngFWThWjvjqkTDkI=; b=VnQD34TG3arTI/qFFjp5VHOLhtQsOBedvqbL3OYdxOhmU1zxQJlVHaLy425YPwfmSx 6yZVzDyBz1XHCequaQR8A0k/0FPyXrYORaLnuUn6nJGw8jY9jPl8C2wEeg71frBZTP9B VtapF1+cGF687GfPOTKFox5mQsEvZRtbebp4e+8yatHxO+51HpjJaLZ9h6Jpbt9TBD5Y L8cNyN3wloW1o2sFRfMpw0qE8QA4kfwuA3Z6eacdArJ+Z3+loPvfyv46VrbBbuJan63i fj9Hi71CumaUxe8wFMZ1HA7rnEU+TVQjphECm8nDtJX5nc+FMPGIpdNaDWSYeJKBqqeg 8tuQ== X-Gm-Message-State: AOAM531cFw3t8OANr98fNSJ7NIkotF8GKa2sgJKzNk0xmZTXDv80mUf6 nTjqIPGtQ9EXu2YiqoMkSH2pbWoSnd77Tg== X-Google-Smtp-Source: ABdhPJx7cT0WPzJITEeY4jBtg5wD722fTDYWttnxfJFznaCVomS2DmFzWsnOvmp0uGZTBMGzHIlsEg== X-Received: by 2002:a05:6830:131a:: with SMTP id p26mr9235983otq.134.1612708720058; Sun, 07 Feb 2021 06:38:40 -0800 (PST) Received: from Derricks-MBP.attlocal.net ([2600:1700:e72:80a0:45c1:3f2d:a11c:fe19]) by smtp.gmail.com with ESMTPSA id x11sm765790oif.39.2021.02.07.06.38.38 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 07 Feb 2021 06:38:39 -0800 (PST) Subject: Re: [PATCH 3/3] diffcore-rename: guide inexact rename detection based on basenames To: Elijah Newren via GitGitGadget , git@vger.kernel.org Cc: Derrick Stolee , Jonathan Tan , Taylor Blau , Junio C Hamano , Jeff King , Elijah Newren References: <1d941c35076e8d515c8ff7ef01d6b9d8c092aaa9.1612651937.git.gitgitgadget@gmail.com> From: Derrick Stolee Message-ID: <9fbed0f9-032e-3f99-8467-f8a9cfa2d8f1@gmail.com> Date: Sun, 7 Feb 2021 09:38:37 -0500 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:78.0) Gecko/20100101 Thunderbird/78.7.0 MIME-Version: 1.0 In-Reply-To: <1d941c35076e8d515c8ff7ef01d6b9d8c092aaa9.1612651937.git.gitgitgadget@gmail.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org On 2/6/21 5:52 PM, Elijah Newren via GitGitGadget wrote: > From: Elijah Newren > > Make use of the new find_basename_matches() function added in the last > two patches, to find renames more rapidly in cases where we can match up > files based on basenames. This is a valuable heuristic. > For the testcases mentioned in commit 557ac0350d ("merge-ort: begin > performance work; instrument with trace2_region_* calls", 2020-10-28), > this change improves the performance as follows: > > Before After > no-renames: 13.815 s ± 0.062 s 13.138 s ± 0.086 s > mega-renames: 1799.937 s ± 0.493 s 169.488 s ± 0.494 s > just-one-mega: 51.289 s ± 0.019 s 5.061 s ± 0.017 s These numbers are very impressive. Before I get too deep into reviewing these patches, I do want to make it clear that the speed-up is coming at the cost of a behavior change. We are restricting the "best match" search to be first among files with common base name (although maybe I would use 'suffix'?). If we search for a rename among all additions and deletions ending the ".txt" we might find a similarity match that is 60% and declare that a rename, even if there is a ".txt" -> ".md" pair that has a 70% match. This could be documented in a test case, to demonstrate that we are making this choice explicitly. For example, here is a test that passes now, but would start failing with your patches (if I understand them correctly): diff --git a/t/t4001-diff-rename.sh b/t/t4001-diff-rename.sh index c16486a9d41..e4c71fcf3be 100755 --- a/t/t4001-diff-rename.sh +++ b/t/t4001-diff-rename.sh @@ -262,4 +262,21 @@ test_expect_success 'diff-tree -l0 defaults to a big rename limit, not zero' ' grep "myotherfile.*myfile" actual ' +test_expect_success 'multiple similarity choices' ' + test_write_lines line1 line2 line3 line4 line5 \ + line6 line7 line8 line9 line10 >delete.txt && + git add delete.txt && + git commit -m "base txt" && + + rm delete.txt && + test_write_lines line1 line2 line3 line4 line5 \ + line6 line7 line8 >add.txt && + test_write_lines line1 line2 line3 line4 line5 \ + line6 line7 line8 line9 >add.md && + git add add.txt add.md && + git commit -a -m "rename" && + git diff-tree -M HEAD HEAD^ >actual && + grep "add.md delete.txt" actual +' + test_done Personally, I'm fine with making this assumption. All of our renames are based on heuristics, so any opportunity to reduce the number of content comparisons is a win in my mind. We also don't report a rename unless there _is_ an add/delete pair that is sufficiently close in content. So, in this way, we are changing the optimization function that is used to determine the "best" rename available. It might be good to update documentation for how we choose renames: An add/delete pair is marked as a rename based on the following similarity function: 1. If the blob content is identical, then those files are marked as a rename. (Should we break ties here based on the basename?) 2. Among pairs whose content matches the minimum similarity limit, we optimize for: i. among files with the same basename (trailer after final '.') select pairs with highest similarity. ii. if no files with the same basename have the minimum similarity, then select pairs with highest similarity across all filenames. The above was written quickly as an attempt, so it will require careful editing to actually make sense to end users. Thanks, -Stolee