git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
From: Derrick Stolee <stolee@gmail.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>,
	git@vger.kernel.org, me@ttaylorr.com, jnareb@gmail.com,
	garimasigit@gmail.com, Derrick Stolee <dstolee@microsoft.com>
Subject: Re: [PATCH 1/3] revision: complicated pathspecs disable filters
Date: Wed, 15 Apr 2020 14:37:40 -0400	[thread overview]
Message-ID: <99e0ae2c-6b65-24e4-3d2b-1dff619a5daa@gmail.com> (raw)
In-Reply-To: <f57c7908-55ae-deae-e9ea-1909549e628c@gmail.com>

On 4/15/2020 9:27 AM, Derrick Stolee wrote:
>
> I'm not against the idea. Logically, collapsing case before hashing
> the Bloom keys should not increase the probabilities of false
> positives except in the situations where we have case conflicts.
> There is a small cost in the pre-hashing step to change the case of
> the paths, but that should be much lower than the cost of the hash
> itself and the tree parsing to find the changed paths.
> 
>> It all depends on how often people want :(icase) pathspec matches in
>> the history, I suspect.  My point was that we need to declare that
>> :(icase) won't matter in real life (hence we won't optimize our data
>> to support that use case), before the way in which the data stored
>> in the graph file is computed is cast in stone.
> 
> My earlier statement can be summarized as "we could make this happen"
> and you ask here "is it worth doing?"
> 
> I will play around with how complicated the change would be while
> the community considers the "is it worth doing?" question.

I played a bit, and it wasn't terrible to cover everything by
adjusting fill_bloom_key(). I also added a test that we will want
anyways.

There is more work to be done if this is the direction we want to
go, including double-checking that this doesn't cause significant
performance degradation.

The point of me sending this patch is to make the proposed
direction very clear how it would work. I'm still not sure how I
feel about it.

Thanks,
-Stolee

-->8--
From 89beb9598daabb19e3c896bbceeb0fc1b9ccc6ca Mon Sep 17 00:00:00 2001
From: Derrick Stolee <dstolee@microsoft.com>
Date: Wed, 15 Apr 2020 18:04:25 +0000
Subject: [PATCH] bloom: compute all Bloom hashes from lowercase

The changed-path Bloom filters currently hash path strings using
the exact string for the path. This makes it difficult* to use the
filters when restricting to case-insensitive pathspecs.

* I say "difficult" because it is possible to generate all 2^n
  options for the case of a path and test them all, but this is
  a bad idea and should not be done. "Impossible" is an appropriate
  alternative.

THIS IS A BREAKING CHANGE. Commit-graph files with changed-path
Bloom filters computed by a previous commit will not be compatible
with the filters computed in this commit, nor will we get correct
results when testing across these incompatible versions. Normally,
this would be a completely unacceptable change, but the filters
have not been released and hence are still possible to update
before release.

TODO: If we decide to move in this direction, then the following
steps should be done (and some of them should be done anyway):

* We need to document the Bloom filter format to specify exactly
  how we compute the filter data. The details should be careful
  enough that someone can reproduce the exact file format without
  looking at the C code.

* That document would include the tolower() transformation that is
  being done here.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 bloom.c                   | 16 ++++++++++++++--
 revision.c                |  5 +++--
 t/t6131-pathspec-icase.sh | 18 ++++++++++++++++++
 3 files changed, 35 insertions(+), 4 deletions(-)

diff --git a/bloom.c b/bloom.c
index dd9bab9bbd..c919c60f12 100644
--- a/bloom.c
+++ b/bloom.c
@@ -130,8 +130,20 @@ void fill_bloom_key(const char *data,
 	int i;
 	const uint32_t seed0 = 0x293ae76f;
 	const uint32_t seed1 = 0x7e646e2c;
-	const uint32_t hash0 = murmur3_seeded(seed0, data, len);
-	const uint32_t hash1 = murmur3_seeded(seed1, data, len);
+	uint32_t hash0, hash1;
+	static struct strbuf icase_data = STRBUF_INIT;
+	char *cur;
+
+	strbuf_setlen(&icase_data, 0);
+	strbuf_addstr(&icase_data, data);
+
+	for (cur = icase_data.buf; cur && *cur; cur++) {
+		if (isupper(*cur))
+			*cur = tolower(*cur);
+	}
+
+	hash0 = murmur3_seeded(seed0, icase_data.buf, len);
+	hash1 = murmur3_seeded(seed1, icase_data.buf, len);
 
 	key->hashes = (uint32_t *)xcalloc(settings->num_hashes, sizeof(uint32_t));
 	for (i = 0; i < settings->num_hashes; i++)
diff --git a/revision.c b/revision.c
index f78c636e4d..a02be25feb 100644
--- a/revision.c
+++ b/revision.c
@@ -652,13 +652,14 @@ static void trace2_bloom_filter_statistics_atexit(void)
 
 static int forbid_bloom_filters(struct pathspec *spec)
 {
+	int allowed_flags = PATHSPEC_LITERAL | PATHSPEC_ICASE;
 	if (spec->has_wildcard)
 		return 1;
 	if (spec->nr > 1)
 		return 1;
-	if (spec->magic & ~PATHSPEC_LITERAL)
+	if (spec->magic & ~allowed_flags)
 		return 1;
-	if (spec->nr && (spec->items[0].magic & ~PATHSPEC_LITERAL))
+	if (spec->nr && (spec->items[0].magic & ~allowed_flags))
 		return 1;
 
 	return 0;
diff --git a/t/t6131-pathspec-icase.sh b/t/t6131-pathspec-icase.sh
index 39fc3f6769..ee0766bd91 100755
--- a/t/t6131-pathspec-icase.sh
+++ b/t/t6131-pathspec-icase.sh
@@ -106,4 +106,22 @@ test_expect_success '"git diff" can take magic :(icase) pathspec' '
 	test_cmp expect actual
 '
 
+test_expect_success '"git log" takes the magic :(icase) pathspec' '
+	cat >expect <<-\EOF &&
+	FOO/BAR
+	FOO/bAr
+	FOO/bar
+	fOo/BAR
+	fOo/bAr
+	fOo/bar
+	foo/BAR
+	foo/bAr
+	foo/bar
+	EOF
+	git log --pretty=%s HEAD -- ":(icase)foo/bar" >actual &&
+	test_cmp expect actual &&
+	git log --pretty=%s HEAD -- ":(icase)foo" >actual &&
+	test_cmp expect actual
+'
+
 test_done



  reply	other threads:[~2020-04-15 19:17 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-11  1:02 [PATCH 0/3] Integrate changed-path Bloom filters with 'git blame' Derrick Stolee via GitGitGadget
2020-04-11  1:02 ` [PATCH 1/3] revision: complicated pathspecs disable filters Derrick Stolee via GitGitGadget
2020-04-11 21:40   ` Junio C Hamano
2020-04-13 11:49     ` Derrick Stolee
2020-04-14 18:25       ` Junio C Hamano
2020-04-15 13:27         ` Derrick Stolee
2020-04-15 18:37           ` Derrick Stolee [this message]
2020-04-15 19:32             ` Junio C Hamano
2020-04-15 19:39               ` Junio C Hamano
2020-04-15 21:25             ` Junio C Hamano
2020-04-16  0:56               ` Taylor Blau
2020-04-15 22:18             ` Jakub Narębski
2020-04-16  0:52               ` Taylor Blau
2020-04-16 13:26                 ` Derrick Stolee
2020-04-16 16:33                   ` Taylor Blau
2020-04-16 18:02                     ` Junio C Hamano
2020-04-12 22:22   ` Taylor Blau
2020-04-12 22:30     ` Junio C Hamano
2020-04-13  0:07       ` Taylor Blau
2020-04-13 11:54         ` Derrick Stolee
2020-04-11  1:03 ` [PATCH 2/3] commit: write commit-graph with bloom filters Derrick Stolee via GitGitGadget
2020-04-11 21:57   ` Junio C Hamano
2020-04-12 20:51     ` Taylor Blau
2020-04-13 12:08       ` Derrick Stolee
2020-04-13 22:11         ` Junio C Hamano
2020-04-11  1:03 ` [PATCH 3/3] blame: use changed-path Bloom filters Derrick Stolee via GitGitGadget
2020-04-11 22:03   ` Junio C Hamano
2020-04-12  7:39     ` Eric Sunshine
2020-04-11 21:30 ` [PATCH 0/3] Integrate changed-path Bloom filters with 'git blame' Junio C Hamano
2020-04-13 14:45 ` [PATCH v2 0/4] " Derrick Stolee via GitGitGadget
2020-04-13 14:45   ` [PATCH v2 1/4] revision: complicated pathspecs disable filters Derrick Stolee via GitGitGadget
2020-04-13 16:09     ` Taylor Blau
2020-04-13 22:18       ` Junio C Hamano
2020-04-13 14:45   ` [PATCH v2 2/4] commit: write commit-graph with Bloom filters Derrick Stolee via GitGitGadget
2020-04-13 16:12     ` Taylor Blau
2020-04-13 22:21       ` Junio C Hamano
2020-04-14 15:04         ` Derrick Stolee
2020-04-14 17:26           ` Junio C Hamano
2020-04-14 17:40             ` Derrick Stolee
2020-04-15  0:17               ` Taylor Blau
2020-04-13 14:45   ` [PATCH v2 3/4] commit-graph: write commit-graph in more tests Derrick Stolee via GitGitGadget
2020-04-13 14:45   ` [PATCH v2 4/4] blame: use changed-path Bloom filters Derrick Stolee via GitGitGadget
2020-04-13 16:21   ` [PATCH v2 0/4] Integrate changed-path Bloom filters with 'git blame' Taylor Blau
2020-04-16 20:14   ` [PATCH v3 0/3] " Derrick Stolee via GitGitGadget
2020-04-16 20:14     ` [PATCH v3 1/3] revision: complicated pathspecs disable filters Derrick Stolee via GitGitGadget
2020-06-07 20:33       ` SZEDER Gábor
2020-04-16 20:14     ` [PATCH v3 2/3] tests: write commit-graph with Bloom filters Derrick Stolee via GitGitGadget
2020-04-16 20:14     ` [PATCH v3 3/3] blame: use changed-path " Derrick Stolee via GitGitGadget

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=99e0ae2c-6b65-24e4-3d2b-1dff619a5daa@gmail.com \
    --to=stolee@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=garimasigit@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=gitster@pobox.com \
    --cc=jnareb@gmail.com \
    --cc=me@ttaylorr.com \
    --subject='Re: [PATCH 1/3] revision: complicated pathspecs disable filters' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Code repositories for project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).