From: Derrick Stolee <stolee@gmail.com>
To: Jeff King <peff@peff.net>,
Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, Derrick Stolee <dstolee@microsoft.com>,
Philip Oakley <philipoakley@iee.email>
Subject: Re: [PATCH] bloom: ignore renames when computing changed paths
Date: Thu, 9 Apr 2020 07:56:43 -0400 [thread overview]
Message-ID: <72fa2e30-b841-9600-ae2c-21a269817f1c@gmail.com> (raw)
In-Reply-To: <20200408223111.GC3468797@coredump.intra.peff.net>
On 4/8/2020 6:31 PM, Jeff King wrote:
> On Wed, Apr 08, 2020 at 04:38:27PM +0000, Derrick Stolee via GitGitGadget wrote:
>
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> The changed-path Bloom filters record an entry in the filter for
>> every path that was changed. This includes every add and delete,
>> regardless of whther a rename was detected. Detecting renames
>> causes significant performance issues, but also will trigger
>> downloading missing blobs in partial clone.
>>
>> The simple fix is to disable rename detection when computing a
>> changed-path Bloom filter.
>
> Yes, we should be doing as simple a tree-diff as possible.
>
> I wonder if this might actually be fixing a potential bug, too. The loop
> below this code only looks at the "two" half of each queued diff pair.
> With renames enabled, we might see the source path only in the "one"
> half of the rename pair.
>
> However, this seems to work already. If I do:
>
> echo content >file
> git add file
> git commit -m added
> echo change >file
> git commit -am changed
> git mv file other
> git commit -am 'rename away'
> git mv other file
> git commit -am 'rename back'
> git rm file
> git commit -am removed
>
> git commit-graph write --reachable --changed-paths
> git log --oneline -- file | cat
>
> then I see all of the commits. Instrumenting Git like:
>
> diff --git a/bloom.c b/bloom.c
> index c5b461d1cf..fb2a758e1d 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -207,6 +207,10 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
> for (i = 0; i < diff_queued_diff.nr; i++) {
> const char *path = diff_queued_diff.queue[i]->two->path;
>
> + warning("queuing touched %c path %s for %s",
> + diff_queued_diff.queue[i]->status,
> + path, oid_to_hex(&c->object.oid));
> +
> /*
> * Add each leading directory of the changed file, i.e. for
> * 'dir/subdir/file' add 'dir' and 'dir/subdir' as well, so
>
> results in:
>
> warning: queuing touched A path file for 2346d88b0cb4bca11c38ee545d007a7a14ca472a
> warning: queuing touched M path file for 991cd7f0696ae29fea738ca1b8340c90dae4b201
> warning: queuing touched D path file for d3642c9fb27459ea09f6c967a1e6ad119e265d6f
> warning: queuing touched A path other for d3642c9fb27459ea09f6c967a1e6ad119e265d6f
> warning: queuing touched A path file for bc908eb29e562d97ebb8cf718e41b69d3aa1d834
> warning: queuing touched D path other for bc908eb29e562d97ebb8cf718e41b69d3aa1d834
> warning: queuing touched D path file for 7433b46bd6aa170ab17a651c10658a5b0c10ba4f
>
> So we really aren't detecting renames in the first place! And indeed,
> checking diffopt.detect_rename shows that it is unset. So I'm curious if
> there is a case where that would not be true. I _think_ it would only be
> true in a program which ran init_diff_ui_defaults(), but never in
> git-commit-graph.
So our issue was really that the partial clone prefetch logic was just
overly aggressive.
> Even if it does nothing in practice, I'm not at all opposed to having it
> in there as an explicit documentation of our expectations/requirements
> for the loop below. But it's probably worth saying so in the commit
> message.
I will update the message and send a v2. I'll fix the typo that Philip pointed
out, too.
Thanks,
-Stolee
next prev parent reply other threads:[~2020-04-09 11:56 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-04-08 16:38 [PATCH] bloom: ignore renames when computing changed paths Derrick Stolee via GitGitGadget
2020-04-08 19:11 ` Junio C Hamano
2020-04-08 19:13 ` Philip Oakley
2020-04-08 22:31 ` Jeff King
2020-04-09 11:56 ` Derrick Stolee [this message]
2020-04-09 13:47 ` Jeff King
2020-04-09 14:00 ` Derrick Stolee
2020-04-09 14:15 ` Jeff King
2020-04-09 13:00 ` [PATCH v2] " Derrick Stolee via GitGitGadget
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=72fa2e30-b841-9600-ae2c-21a269817f1c@gmail.com \
--to=stolee@gmail.com \
--cc=dstolee@microsoft.com \
--cc=git@vger.kernel.org \
--cc=gitgitgadget@gmail.com \
--cc=peff@peff.net \
--cc=philipoakley@iee.email \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).