From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS53758 23.128.96.0/24 X-Spam-Status: No, score=-3.7 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE, SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by dcvr.yhbt.net (Postfix) with ESMTP id 4711A1F8C6 for ; Mon, 12 Jul 2021 20:23:55 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235838AbhGLU0k (ORCPT ); Mon, 12 Jul 2021 16:26:40 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54298 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230284AbhGLU0j (ORCPT ); Mon, 12 Jul 2021 16:26:39 -0400 Received: from mail-ot1-x32a.google.com (mail-ot1-x32a.google.com [IPv6:2607:f8b0:4864:20::32a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0DFDDC0613DD for ; Mon, 12 Jul 2021 13:23:50 -0700 (PDT) Received: by mail-ot1-x32a.google.com with SMTP id f12-20020a056830204cb029048bcf4c6bd9so20182653otp.8 for ; Mon, 12 Jul 2021 13:23:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=t1yhJquqwxb3mf2MAKERPATK+5/wB596C2+6qxHHaTE=; b=KXQheejjRTvhaT+gd92nVOmvlm9my8v+jwkZtJo3YUz9UT5YB7gXsVrLx7nn2B3ui/ HstuOnw4/hZLUimvhUwFDWYTEql2B36iKGKJH5VJNIYzr1jxVayjGrwJy4Nixnh7NOEV kkGLOdXETC7e5XXLTyHkYnB+cQI0paT+6xrZpLLSLeJiMrm63MGgdkLU10nZSnVukPGx ixY03B8gmWwV0xqaV9AFcg7DidQwc2yF6KPDmmxo7CeFTWEj8fgX6MSvohr4LDP+yf8Z hvFwWpvY4uDFuEUMUrjaWvlfFkW9TdaKIO8IhZcGiUg4U4XAL6aQyAAp4+RXEAvsxyl2 n2WA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=t1yhJquqwxb3mf2MAKERPATK+5/wB596C2+6qxHHaTE=; b=RicEc/Ws1xuctcdktox58UIOGEY7xmXJ81N6ool7fMYWANzekigsMrU9Ccf/0q3ywQ gvQfSJAy3LnFOosj1in/ZErtLYoQ69OnAGR1KuHO2OMQMMmVCEzJjlaVJhXJz07I+pPv O7jVfukEyVXw4ICC3lfeFQdLn+XkXISuH2AJwTBAocy+WdWhwqReXIT4Br20dyFAPvys l6pC8ZDYAjv/rQmtJpfbKrhA8fhciEwT9N6BAXypfUYH7Gt9zNYTaAQaSg5g1N0DyewL GfbYYYOJwRURLqS54LELoSKYK/GvN9U3FPxkO3YaMucS0IHLlaiUG5q8xfOVI34Mp2DM Qd0A== X-Gm-Message-State: AOAM533KmbNjYL1wmU1Mvrc4eA37uoBWS3+BfT3OWzAWfst0pu2jrML1 9dfd3PWWgBkpfvJ+6uH6ojXCgvCAkqpWDitPR7I= X-Google-Smtp-Source: ABdhPJx5kJnNQSgY8RIEfXf0U00FnXZAXGTFo2h5TEYT6YJJS54pvxoliPskN9lCuyag9BIG0BKpdNzD9S92HZICAzc= X-Received: by 2002:a9d:6d03:: with SMTP id o3mr603334otp.316.1626121429320; Mon, 12 Jul 2021 13:23:49 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Elijah Newren Date: Mon, 12 Jul 2021 13:23:38 -0700 Message-ID: Subject: Re: [RFC] Bump {diff,merge}.renameLimit ? To: Jeff King Cc: Git Mailing List , Derrick Stolee , Junio C Hamano , Linus Torvalds , Jonathan Tan , =?UTF-8?B?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org On Mon, Jul 12, 2021 at 10:16 AM Jeff King wrote: > > tl;dr Bumping the limit seems like a good idea to me. :-) > On Sat, Jul 10, 2021 at 05:28:43PM -0700, Elijah Newren wrote: > > > * My colleagues happily raised merge.renameLimit beyond 32767 when the > > artificial cap was removed. 10 minute waits were annoying, but much > > less so than having to manually cherry-pick commits (especially given > > the risk of getting it wrong).[13] > > One tricky thing here is that waiting 10 minutes may be worth it _if the > rename detection finds something_. If it doesn't, then it's just > annoying. > > I do think progress meters help a bit there, because then at least the > user understands what's going on. I'll go into more detail in the > sub-thread there. :) Another thing that helps here (at least for merges rather than diffs) is that we can determine a priori whether individual renames will have no effect on the merge result and just exclude those particular renames from the detection machinery. Well, at least we can do this with the new merge-ort backend. That should dramatically reduce the annoyance factor once we make it the default. ... > Yeah, it is definitely time to revisit the default numbers. I think at > one point we talked about letting it run for N wallclock seconds before > giving up, but we've been hesitant to introduce that kind of time-based > limit, because it ends up with non-deterministic results (plus you don't > realize you're not going to finish until you've already wasted a bunch > of time, whereas the static limits can avoid even beginning the work). Yeah, I'm kinda glad we didn't go that route; seems problematic to me. ... > I don't remember my methodology at this point, but perhaps it was based > on blobs in the graph, not just one tree, like: > > $ git rev-list --objects v2.6.25 | > git cat-file --batch-check='%(objecttype) %(objectsize) %(rest)' | > awk ' > /^blob/ { sum += $2; total += 1 } > END { print sum / total } > ' > 27535.8 > > I suspect the difference versus a single tree is that there is a > quadratic-ish property going on with file size: the bigger the file, the > more likely it is to be touched (so total storage is closer to bytes^2). Ah, gotcha. That makes sense. > Looking at single-tree blob sizes is probably better though, as rename > detection will happen between two single trees. Agreed. > > * I think the median file size is a better predictor of rename > > performance than mean file size, and median file size is ~2.5x smaller > > than the mean[18]. > > There you might get hit with the quadratic-update thing again, though. > The big files are more likely to be touched, so could be weighted more > (though are they more likely to have been added/delete/renamed? Who > knows). I'll agree that big files are more likely to be updated, but I don't think renames are weighted towards bigger files. In fact, I wrote a quick script to look at the sizes of all the renamed files in the history of v2.6.25, and the mean (8034.1) and median (3866) of the renamed files sizes in that history are comparable to the mean (11150.3) and median (4198) of the files sizes in the v2.6.25 tree. I re-did the calculations using v5.5, and found that the mean (12495.1) and median (3702) sizes of renames in all linux history up to that point again were a bit less than the mean (13449.2) and median (3860) file size of a file in the final v5.5 tree. Granted, this is a bit hand-wavy (what about creations or deletions? Is there too much bias from the fact that I did rename sizes over all history (due to needing enough to get statistics) while just grabbing regular file sizes just in the end tree?), but I think it provides pretty good first order approximation suggesting that mean/median sizes of files involved in rename detection will be similar to the mean/median sizes of other files within the relevant trees. > I don't think file size matters all _that_ much, though, as it has a > linear relationship to time spent. Whereas the number of entries is > quadratic. And of course the whole experiment is ball-parking in the > first place. We're looking for order-of-magnitude approximations, I'd > think. I agree that the number of entries is what's important; in fact, that's why I think the median file size is more important than the mean file size: In the case of the linux kernel, since the mean is 2.5x bigger than the median file size... => diffcore-rename checks file sizes before comparing content => Files more than 2x different in size can't be more than 50% similar => Therefore, files of the mean size will not be compared to files of the median size (or less) => Therefore, we automatically know that files of mean size will not be compared to more than half the files. > > * The feedback about the limit is better today than when we last changed > > the limits, and folks can configure a higher limit relatively easily. > > Many already have. > > I can't remember the last time I saw the limit kick in in practice, but > then I don't generally work with super-large repos (and my workflows > typically do not encourage merging across big segments of history). > Nor do I remember the topic coming up on the list after the last bump. > So maybe that means that people are happily bumping the limits > themselves via config. It might also mean that you're missing more emails than you used to, or just forgot them. :-) e.g.: https://lore.kernel.org/git/20171129201154.192379-1-jonathantanmy@google.com/ https://lore.kernel.org/git/20171113201600.24878-1-newren@gmail.com/ But I do certainly suspect it's come up less often than it would have before. > But I don't think that's really an argument against at least a moderate > bump. If it helps even a few people avoid having to learn about the > config, that's time saved. And it's a trivial code change on our end. :-)