git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Derrick Stolee <stolee@gmail.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>,
	git@vger.kernel.org, me@ttaylorr.com, jnareb@gmail.com,
	garimasigit@gmail.com, Derrick Stolee <dstolee@microsoft.com>
Subject: Re: [PATCH 1/3] revision: complicated pathspecs disable filters
Date: Wed, 15 Apr 2020 09:27:45 -0400	[thread overview]
Message-ID: <f57c7908-55ae-deae-e9ea-1909549e628c@gmail.com> (raw)
In-Reply-To: <xmqqmu7d9b6j.fsf@gitster.c.googlers.com>

On 4/14/2020 2:25 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
> 
>>> But if users may use icase pathspec very often, it may be worth
>>> considering to build the bloom filter after downcasing the paths,
>>> perhaps?  Given that many projects extract their source code to a
>>> case insensitive filesystem, I would imagine that downcasing paths
>>> would map two originally different paths into the same thing only
>>> rarely, if ever, so there may not be much downside to do so.
>>
>> This behavior could be extended later, and carefully. My initial
>> thought was that the case check would happen on every commit. If
>> the :(icase) check only happens at the walk tip(s), then we could
>> compute a single Bloom key at the start.
> 
> Sorry, I am not sure what you mean.

That's my fault. There are a couple of things I misunderstood here.

1. Thinking about "git blame" we would need to "collapse" a pathspec
   to a specific file before starting history. But blame doesn't
   allow ":(icase)" anyway.

2. With that context of "git blame" in my head, I was thinking
   (incorrectly) that "git log" would collapse the pathspec based on
   what file(s) match the pattern at HEAD. The tests in
   t6131-pathspec-icase.sh clearly show that this is wrong. In fact,
   if we apply the following diff to this patch, then we can get failures
   with the changed-path filters:

diff --git a/revision.c b/revision.c
index f78c636e4d..a02be25feb 100644
--- a/revision.c
+++ b/revision.c
@@ -652,13 +652,14 @@ static void trace2_bloom_filter_statistics_atexit(void)
 
 static int forbid_bloom_filters(struct pathspec *spec)
 {
+       int allowed_flags = PATHSPEC_LITERAL | PATHSPEC_ICASE;
        if (spec->has_wildcard)
                return 1;
        if (spec->nr > 1)
                return 1;
-       if (spec->magic & ~PATHSPEC_LITERAL)
+       if (spec->magic & ~allowed_flags)
                return 1;
-       if (spec->nr && (spec->items[0].magic & ~PATHSPEC_LITERAL))
+       if (spec->nr && (spec->items[0].magic & ~allowed_flags))
                return 1;
 
        return 0;

> Do you mean that we notice that the user wants to match 'foo' case
> insensitively, and tell the logic that uses changed-path records in
> the graph file that commits that cannot possibly have touched any or
> the paths 'foo', 'foO', 'fOo', ... (all 8 case permutations) are not
> interesting?
> 
> I guess that would work, but I was wondering if it is simpler
> without much downside if the changed-path records in the graph file
> are prepared on paths after they are normalized to a single case.
> That would lose information (e.g. you no longer can say "commits
> that touch the path 'foo' is interesting, but those that touch the
> path 'Foo' are not"), but makes the side that queries much simpler
> (i.e. you do not have to prepare all 8 case permutations---you only
> ask about 'foo').
> 
> And because the Bloom filter is used only for performance to cull
> commits that can never possibly match, allowing a false positive
> that would be discarded by actually running tree-diff anyway, the
> only potential downside happens when the project has too many paths
> that are different only in cases by increased collisions and by
> reducing our chances to skip running tree-diff (and never affects
> correctness).  
> 
> But this is not the "could be extended later" kind of behaviour, I
> am afraid.  It is baked in the data stored in the graph file.

Since the feature is not released, we still have time to update the
format if we so desired. With the current format, we would need to
disable the filters when using an :(icase) pathspec as the current
patch does.

I'm not against the idea. Logically, collapsing case before hashing
the Bloom keys should not increase the probabilities of false
positives except in the situations where we have case conflicts.
There is a small cost in the pre-hashing step to change the case of
the paths, but that should be much lower than the cost of the hash
itself and the tree parsing to find the changed paths.

> It all depends on how often people want :(icase) pathspec matches in
> the history, I suspect.  My point was that we need to declare that
> :(icase) won't matter in real life (hence we won't optimize our data
> to support that use case), before the way in which the data stored
> in the graph file is computed is cast in stone.

My earlier statement can be summarized as "we could make this happen"
and you ask here "is it worth doing?"

I will play around with how complicated the change would be while
the community considers the "is it worth doing?" question.

Thanks,
-Stolee

  reply	other threads:[~2020-04-15 13:27 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-11  1:02 [PATCH 0/3] Integrate changed-path Bloom filters with 'git blame' Derrick Stolee via GitGitGadget
2020-04-11  1:02 ` [PATCH 1/3] revision: complicated pathspecs disable filters Derrick Stolee via GitGitGadget
2020-04-11 21:40   ` Junio C Hamano
2020-04-13 11:49     ` Derrick Stolee
2020-04-14 18:25       ` Junio C Hamano
2020-04-15 13:27         ` Derrick Stolee [this message]
2020-04-15 18:37           ` Derrick Stolee
2020-04-15 19:32             ` Junio C Hamano
2020-04-15 19:39               ` Junio C Hamano
2020-04-15 21:25             ` Junio C Hamano
2020-04-16  0:56               ` Taylor Blau
2020-04-15 22:18             ` Jakub Narębski
2020-04-16  0:52               ` Taylor Blau
2020-04-16 13:26                 ` Derrick Stolee
2020-04-16 16:33                   ` Taylor Blau
2020-04-16 18:02                     ` Junio C Hamano
2020-04-12 22:22   ` Taylor Blau
2020-04-12 22:30     ` Junio C Hamano
2020-04-13  0:07       ` Taylor Blau
2020-04-13 11:54         ` Derrick Stolee
2020-04-11  1:03 ` [PATCH 2/3] commit: write commit-graph with bloom filters Derrick Stolee via GitGitGadget
2020-04-11 21:57   ` Junio C Hamano
2020-04-12 20:51     ` Taylor Blau
2020-04-13 12:08       ` Derrick Stolee
2020-04-13 22:11         ` Junio C Hamano
2020-04-11  1:03 ` [PATCH 3/3] blame: use changed-path Bloom filters Derrick Stolee via GitGitGadget
2020-04-11 22:03   ` Junio C Hamano
2020-04-12  7:39     ` Eric Sunshine
2020-04-11 21:30 ` [PATCH 0/3] Integrate changed-path Bloom filters with 'git blame' Junio C Hamano
2020-04-13 14:45 ` [PATCH v2 0/4] " Derrick Stolee via GitGitGadget
2020-04-13 14:45   ` [PATCH v2 1/4] revision: complicated pathspecs disable filters Derrick Stolee via GitGitGadget
2020-04-13 16:09     ` Taylor Blau
2020-04-13 22:18       ` Junio C Hamano
2020-04-13 14:45   ` [PATCH v2 2/4] commit: write commit-graph with Bloom filters Derrick Stolee via GitGitGadget
2020-04-13 16:12     ` Taylor Blau
2020-04-13 22:21       ` Junio C Hamano
2020-04-14 15:04         ` Derrick Stolee
2020-04-14 17:26           ` Junio C Hamano
2020-04-14 17:40             ` Derrick Stolee
2020-04-15  0:17               ` Taylor Blau
2020-04-13 14:45   ` [PATCH v2 3/4] commit-graph: write commit-graph in more tests Derrick Stolee via GitGitGadget
2020-04-13 14:45   ` [PATCH v2 4/4] blame: use changed-path Bloom filters Derrick Stolee via GitGitGadget
2020-04-13 16:21   ` [PATCH v2 0/4] Integrate changed-path Bloom filters with 'git blame' Taylor Blau
2020-04-16 20:14   ` [PATCH v3 0/3] " Derrick Stolee via GitGitGadget
2020-04-16 20:14     ` [PATCH v3 1/3] revision: complicated pathspecs disable filters Derrick Stolee via GitGitGadget
2020-06-07 20:33       ` SZEDER Gábor
2020-04-16 20:14     ` [PATCH v3 2/3] tests: write commit-graph with Bloom filters Derrick Stolee via GitGitGadget
2020-04-16 20:14     ` [PATCH v3 3/3] blame: use changed-path " Derrick Stolee via GitGitGadget

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f57c7908-55ae-deae-e9ea-1909549e628c@gmail.com \
    --to=stolee@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=garimasigit@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=gitster@pobox.com \
    --cc=jnareb@gmail.com \
    --cc=me@ttaylorr.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).