git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Junio C Hamano <gitster@pobox.com>
To: Elijah Newren <newren@gmail.com>
Cc: Elijah Newren via GitGitGadget <gitgitgadget@gmail.com>,
	Git Mailing List <git@vger.kernel.org>,
	Derrick Stolee <dstolee@microsoft.com>,
	Jonathan Tan <jonathantanmy@google.com>,
	Taylor Blau <me@ttaylorr.com>, Jeff King <peff@peff.net>,
	Karsten Blees <blees@dcon.de>, Derrick Stolee <stolee@gmail.com>
Subject: Re: [PATCH v2 0/2] Optimization batch 6: make full use of exact renames
Date: Wed, 03 Feb 2021 15:26:13 -0800	[thread overview]
Message-ID: <xmqq5z38bune.fsf@gitster.c.googlers.com> (raw)
In-Reply-To: <CABPp-BEgwfv70NRGgyAnHnQBPx4APSyYxNCbvH9F=7WGSj4DLQ@mail.gmail.com> (Elijah Newren's message of "Wed, 3 Feb 2021 15:06:26 -0800")

Elijah Newren <newren@gmail.com> writes:

> 3) It uses a similarity measure that diverges from what researches
> used for MinHash and other fancy algorithms.  In particular,
>
>    size(A intersect B) / size(A union B)  != size(A intersect B) /
> max(size(A), size(B))
>
> The formula on the right hand side would mean that if file A is a
> subset of file B, say the first 10% of file B, then it will be treated
> as 100% similar when most humans would look at it and say it is only
> 10% similar.

If you are talking about "you start from 100 lines file and appended
900 lines of your own, then you still have 100% of the original
material remaining in the file", it is quite deliberate that we used
it as an indication that the original "100 lines" file is a good
candidate to have been renamed to the resulting "1000 lines" file.
It is "what you have kept from the original" measure.

Of course, taken to the extreme, this means that rename does not
have to be symmetrical.  "diff A B" may find that the original
100-line file in A has grown into 1000-line file in B elsewhere, but
"diff B A" or "diff -R A B" would not necessarily pair these two
blobs as matching.

> Maybe the performance gains I'm adding elsewhere will offset possible
> grumpy users?

Users, as they are, it would never happen.  When they have something
to complain about, they will, regardless of what else you do.



  reply	other threads:[~2021-02-03 23:29 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-03  5:49 [PATCH 0/2] Optimization batch 6: make full use of exact renames Elijah Newren via GitGitGadget
2021-02-03  5:49 ` [PATCH 1/2] diffcore-rename: no point trying to find a match better than exact Elijah Newren via GitGitGadget
2021-02-03 11:44   ` Derrick Stolee
2021-02-03 16:31     ` Elijah Newren
2021-02-03 18:46     ` Junio C Hamano
2021-02-03 19:10       ` Elijah Newren
2021-02-03  5:49 ` [PATCH 2/2] diffcore-rename: filter rename_src list when possible Elijah Newren via GitGitGadget
     [not found]   ` <13feb106-c3a7-a26d-0e6e-013aa45c58d4@gmail.com>
2021-02-03 17:12     ` Elijah Newren
2021-02-03 19:12   ` Junio C Hamano
2021-02-03 19:19     ` Elijah Newren
2021-02-03 20:03 ` [PATCH v2 0/2] Optimization batch 6: make full use of exact renames Elijah Newren via GitGitGadget
2021-02-03 20:03   ` [PATCH v2 1/2] diffcore-rename: no point trying to find a match better than exact Elijah Newren via GitGitGadget
2021-02-03 20:03   ` [PATCH v2 2/2] diffcore-rename: filter rename_src list when possible Elijah Newren via GitGitGadget
2021-02-13  1:04     ` Junio C Hamano
2021-02-13  4:24       ` Elijah Newren
2021-02-13  1:06     ` Junio C Hamano
2021-02-13  4:43       ` Elijah Newren
2021-02-03 21:56   ` [PATCH v2 0/2] Optimization batch 6: make full use of exact renames Junio C Hamano
2021-02-03 23:06     ` Elijah Newren
2021-02-03 23:26       ` Junio C Hamano [this message]
2021-02-03 23:36       ` Jeff King
2021-02-04  0:05         ` Elijah Newren
2021-02-14  7:34   ` [PATCH v3 " Elijah Newren via GitGitGadget
2021-02-14  7:35     ` [PATCH v3 1/2] diffcore-rename: no point trying to find a match better than exact Elijah Newren via GitGitGadget
2021-02-14  7:35     ` [PATCH v3 2/2] diffcore-rename: filter rename_src list when possible Elijah Newren via GitGitGadget

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqq5z38bune.fsf@gitster.c.googlers.com \
    --to=gitster@pobox.com \
    --cc=blees@dcon.de \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=jonathantanmy@google.com \
    --cc=me@ttaylorr.com \
    --cc=newren@gmail.com \
    --cc=peff@peff.net \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).