git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Derrick Stolee <stolee@gmail.com>
To: Jeff King <peff@peff.net>,
	Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, Derrick Stolee <dstolee@microsoft.com>,
	Philip Oakley <philipoakley@iee.email>
Subject: Re: [PATCH] bloom: ignore renames when computing changed paths
Date: Thu, 9 Apr 2020 07:56:43 -0400	[thread overview]
Message-ID: <72fa2e30-b841-9600-ae2c-21a269817f1c@gmail.com> (raw)
In-Reply-To: <20200408223111.GC3468797@coredump.intra.peff.net>

On 4/8/2020 6:31 PM, Jeff King wrote:
> On Wed, Apr 08, 2020 at 04:38:27PM +0000, Derrick Stolee via GitGitGadget wrote:
> 
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> The changed-path Bloom filters record an entry in the filter for
>> every path that was changed. This includes every add and delete,
>> regardless of whther a rename was detected. Detecting renames
>> causes significant performance issues, but also will trigger
>> downloading missing blobs in partial clone.
>>
>> The simple fix is to disable rename detection when computing a
>> changed-path Bloom filter.
> 
> Yes, we should be doing as simple a tree-diff as possible.
> 
> I wonder if this might actually be fixing a potential bug, too. The loop
> below this code only looks at the "two" half of each queued diff pair.
> With renames enabled, we might see the source path only in the "one"
> half of the rename pair.
> 
> However, this seems to work already. If I do:
> 
>   echo content >file
>   git add file
>   git commit -m added
>   echo change >file
>   git commit -am changed
>   git mv file other
>   git commit -am 'rename away'
>   git mv other file
>   git commit -am 'rename back'
>   git rm file
>   git commit -am removed
> 
>   git commit-graph write --reachable --changed-paths
>   git log --oneline -- file | cat
> 
> then I see all of the commits. Instrumenting Git like:
> 
> diff --git a/bloom.c b/bloom.c
> index c5b461d1cf..fb2a758e1d 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -207,6 +207,10 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>  		for (i = 0; i < diff_queued_diff.nr; i++) {
>  			const char *path = diff_queued_diff.queue[i]->two->path;
>  
> +			warning("queuing touched %c path %s for %s",
> +				diff_queued_diff.queue[i]->status,
> +				path, oid_to_hex(&c->object.oid));
> +
>  			/*
>  			* Add each leading directory of the changed file, i.e. for
>  			* 'dir/subdir/file' add 'dir' and 'dir/subdir' as well, so
> 
> results in:
> 
>   warning: queuing touched A path file for 2346d88b0cb4bca11c38ee545d007a7a14ca472a
>   warning: queuing touched M path file for 991cd7f0696ae29fea738ca1b8340c90dae4b201
>   warning: queuing touched D path file for d3642c9fb27459ea09f6c967a1e6ad119e265d6f
>   warning: queuing touched A path other for d3642c9fb27459ea09f6c967a1e6ad119e265d6f
>   warning: queuing touched A path file for bc908eb29e562d97ebb8cf718e41b69d3aa1d834
>   warning: queuing touched D path other for bc908eb29e562d97ebb8cf718e41b69d3aa1d834
>   warning: queuing touched D path file for 7433b46bd6aa170ab17a651c10658a5b0c10ba4f
> 
> So we really aren't detecting renames in the first place! And indeed,
> checking diffopt.detect_rename shows that it is unset. So I'm curious if
> there is a case where that would not be true. I _think_ it would only be
> true in a program which ran init_diff_ui_defaults(), but never in
> git-commit-graph.

So our issue was really that the partial clone prefetch logic was just
overly aggressive.

> Even if it does nothing in practice, I'm not at all opposed to having it
> in there as an explicit documentation of our expectations/requirements
> for the loop below. But it's probably worth saying so in the commit
> message.

I will update the message and send a v2. I'll fix the typo that Philip pointed
out, too.

Thanks,
-Stolee


  reply	other threads:[~2020-04-09 11:56 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-08 16:38 [PATCH] bloom: ignore renames when computing changed paths Derrick Stolee via GitGitGadget
2020-04-08 19:11 ` Junio C Hamano
2020-04-08 19:13 ` Philip Oakley
2020-04-08 22:31 ` Jeff King
2020-04-09 11:56   ` Derrick Stolee [this message]
2020-04-09 13:47     ` Jeff King
2020-04-09 14:00       ` Derrick Stolee
2020-04-09 14:15         ` Jeff King
2020-04-09 13:00 ` [PATCH v2] " Derrick Stolee via GitGitGadget

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=72fa2e30-b841-9600-ae2c-21a269817f1c@gmail.com \
    --to=stolee@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=peff@peff.net \
    --cc=philipoakley@iee.email \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).