git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "SZEDER Gábor" <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: "Junio C Hamano" <gitster@pobox.com>,
	"Garima Singh" <garima.singh@microsoft.com>,
	"Derrick Stolee" <stolee@gmail.com>,
	"Jakub Narebski" <jnareb@gmail.com>, "Jeff King" <peff@peff.net>,
	"Taylor Blau" <me@ttaylorr.com>,
	"SZEDER Gábor" <szeder.dev@gmail.com>
Subject: [PATCH 34/34] commit-graph: use modified path Bloom filters with wildcards, if possible
Date: Fri, 29 May 2020 10:50:38 +0200	[thread overview]
Message-ID: <20200529085038.26008-35-szeder.dev@gmail.com> (raw)
In-Reply-To: <20200529085038.26008-1-szeder.dev@gmail.com>

Modified path Bloom filter don't store the names of modified paths,
they only set a couple of bits based on those paths' hashes.
Consequently, they can only be used when looking for the history of a
concrete path, so we disabled them when looking for pathspecs with
wildcards.

However, if the pathspec has "wildcard-less" leading directories, then
we can use modified path Bloom filters to skip commits that don't
modify those leading directories.  As a result, something like:

  git -c core.modifiedPathBloomFilters=1 rev-list HEAD -- 'compat/win32/pthread.*'

will take only ~0.045s instead of ~1.24s, achieving over 27x speedup.

For comparison, letting the shell do the wildcard matching, i.e. the
equivalent of using the pathspecs 'compat/win32/pthread.c' and
'compat/win32/pthread.h' takes ~0.311s without using modified path
Bloom filters: apparently tree-diff with wildcards can be considerably
more expensive that without wildcards, even if the wildcard is in the
last path component and that directory only contains a dozen files.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
---
 commit-graph.c | 44 ++++++++++++++++++++++++++++++++++----------
 1 file changed, 34 insertions(+), 10 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 8eb0cbedaf..db43877426 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1023,9 +1023,9 @@ static void compute_modified_path_bloom_hashes_for_path(const char *path,
 void init_pathspec_bloom_fields(struct repository *r,
 				struct pathspec *pathspec)
 {
-	const unsigned bloom_compatible_magic = PATHSPEC_LITERAL;
+	const unsigned bloom_compatible_magic = PATHSPEC_LITERAL | PATHSPEC_GLOB;
 	struct commit_graph *graph = r->objects->commit_graph;
-	int i;
+	int i, can_use_modified_path_bloom_filters;
 
 	if (!graph)
 		return;
@@ -1033,15 +1033,14 @@ void init_pathspec_bloom_fields(struct repository *r,
 		return;
 	if (!pathspec->nr)
 		return;
-	if (pathspec->has_wildcard)
-		return;
 	if (pathspec->magic & ~bloom_compatible_magic)
 		return;
 
+	can_use_modified_path_bloom_filters = 1;
 	for (i = 0; i < pathspec->nr; i++) {
 		struct pathspec_item *pi = &pathspec->items[i];
 		const char *path = pi->match, *p;
-		size_t len = pi->len;
+		size_t nowildcard_len = pi->nowildcard_len;
 		int path_component_nr = 0, j;
 		uint32_t *hashes;
 		struct bloom_filter embedded_bf;
@@ -1051,14 +1050,29 @@ void init_pathspec_bloom_fields(struct repository *r,
 		 * slashes, but a trailing slash might still be present,
 		 * "remove" it.
 		 */
-		if (path[len - 1] == '/')
-			len--;
+		if (path[nowildcard_len - 1] == '/')
+			nowildcard_len--;
 
 		p = path;
 		do {
 			p = strchrnul(p + 1, '/');
-			path_component_nr++;
-		} while (p - path < len);
+			if (p - path <= nowildcard_len)
+				path_component_nr++;
+		} while (p - path < nowildcard_len);
+		/*
+		 * If a pathspec uses wildcards but has wildcard-less
+		 * leading directories, then we can use modified path Bloom
+		 * filters to skip commits that don't modify those leading
+		 * directories.
+		 * However, if there is even one pathspec that has a wilcard
+		 * in its first path component, then we have no choice but
+		 * to run tree-diff anyway, so don't bother with Bloom
+		 * filters at all in that case.
+		 */
+		if (!path_component_nr) {
+			can_use_modified_path_bloom_filters = 0;
+			break;
+		}
 
 		pi->modified_path_bloom_hashes_nr = path_component_nr * graph->num_modified_path_bloom_hashes;
 		ALLOC_ARRAY(pi->modified_path_bloom_hashes,
@@ -1084,7 +1098,17 @@ void init_pathspec_bloom_fields(struct repository *r,
 				      pi->modified_path_bloom_hashes_nr);
 	}
 
-	pathspec->can_use_modified_path_bloom_filters = 1;
+	if (can_use_modified_path_bloom_filters) {
+		pathspec->can_use_modified_path_bloom_filters = 1;
+	} else {
+		int j;
+		for (j = 0; j < i; j++) {
+			struct pathspec_item *pi = &pathspec->items[j];
+			FREE_AND_NULL(pi->modified_path_bloom_hashes);
+			pi->modified_path_bloom_hashes_nr = 0;
+			pi->modified_path_bloom_mask = 0;
+		}
+	}
 }
 
 struct packed_commit_list {
-- 
2.27.0.rc1.431.g5c813f95dc


  parent reply	other threads:[~2020-05-29  8:52 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-29  8:50 [PoC PATCH 00/34] An alternative modified path Bloom filters implementation SZEDER Gábor
2020-05-28 22:09 ` Johannes Schindelin
2020-05-29  8:50 ` [PATCH 01/34] tree-walk.c: don't match submodule entries for 'submod/anything' SZEDER Gábor
2020-05-29  8:50 ` [PATCH 02/34] commit-graph: fix parsing the Chunk Lookup table SZEDER Gábor
2020-05-29  8:50 ` [PATCH 03/34] commit-graph-format.txt: all multi-byte numbers are in network byte order SZEDER Gábor
2020-05-29  8:50 ` [PATCH 04/34] commit-slab: add a function to deep free entries on the slab SZEDER Gábor
2020-06-04 16:43   ` Derrick Stolee
2020-06-27 15:56     ` SZEDER Gábor
2020-06-29 11:30       ` Derrick Stolee
2020-05-29  8:50 ` [PATCH 05/34] diff.h: drop diff_tree_oid() & friends' return value SZEDER Gábor
2020-05-29  8:50 ` [PATCH 06/34] commit-graph: clean up #includes SZEDER Gábor
2020-05-29  8:50 ` [PATCH 07/34] commit-graph: simplify parse_commit_graph() #1 SZEDER Gábor
2020-05-29  8:50 ` [PATCH 08/34] commit-graph: simplify parse_commit_graph() #2 SZEDER Gábor
2020-05-29  8:50 ` [PATCH 09/34] commit-graph: simplify write_commit_graph_file() #1 SZEDER Gábor
2020-05-29  8:50 ` [PATCH 10/34] commit-graph: simplify write_commit_graph_file() #2 SZEDER Gábor
2020-05-29  8:50 ` [PATCH 11/34] commit-graph: allocate the 'struct chunk_info' array dinamically SZEDER Gábor
2020-05-29  8:50 ` [PATCH 12/34] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor
2020-05-29  8:50 ` [PATCH 13/34] commit-graph: simplify write_commit_graph_file() #3 SZEDER Gábor
2020-05-29  8:50 ` [PATCH 14/34] commit-graph: check chunk sizes after writing SZEDER Gábor
2020-05-29  8:50 ` [PATCH 15/34] commit-graph-format.txt: document the modified path Bloom filter chunks SZEDER Gábor
2020-06-02 17:59   ` SZEDER Gábor
2020-05-29  8:50 ` [PATCH 16/34] Add a generic and minimal Bloom filter implementation SZEDER Gábor
2020-05-29  8:50 ` [PATCH 17/34] Import a streaming-capable Murmur3 hash function implementation SZEDER Gábor
2020-05-29  8:50 ` [PATCH 18/34] commit-graph: write "empty" Modified Path Bloom Filter Index chunk SZEDER Gábor
2020-05-29  8:50 ` [PATCH 19/34] commit-graph: add commit slab for modified path Bloom filters SZEDER Gábor
2020-05-29  8:50 ` [PATCH 20/34] commit-graph: fill the Modified Path Bloom Filter Index chunk SZEDER Gábor
2020-05-29  8:50 ` [PATCH 21/34] commit-graph: load and use " SZEDER Gábor
2020-05-29  8:50 ` [PATCH 22/34] commit-graph: write the Modified Path Bloom Filters chunk SZEDER Gábor
2020-05-29  8:50 ` [PATCH 23/34] commit-graph: load and use " SZEDER Gábor
2020-05-29  8:50 ` [PATCH 24/34] commit-graph: check all leading directories in modified path Bloom filters SZEDER Gábor
2020-05-29  8:50 ` [PATCH 25/34] commit-graph: check embedded modified path Bloom filters with a mask SZEDER Gábor
2020-05-29  8:50 ` [PATCH 26/34] commit-graph: deduplicate modified path Bloom filters SZEDER Gábor
2020-05-29  8:50 ` [PATCH 27/34] commit-graph: load modified path Bloom filters for merge commits SZEDER Gábor
2020-05-29  8:50 ` [PATCH 28/34] commit-graph: write Modified Path Bloom Filter Merge Index chunk SZEDER Gábor
2020-05-29  8:50 ` [PATCH 29/34] commit-graph: extract init and free write_commit_graph_context SZEDER Gábor
2020-05-29  8:50 ` [PATCH 30/34] commit-graph: move write_commit_graph_reachable below write_commit_graph SZEDER Gábor
2020-05-29  8:50 ` [PATCH 31/34] t7007-show: make the first test compatible with the next patch SZEDER Gábor
2020-05-29  8:50 ` [PATCH 32/34] PoC commit-graph: use revision walk machinery for '--reachable' SZEDER Gábor
2020-05-29  8:50 ` [PATCH 33/34] commit-graph: write modified path Bloom filters in "history order" SZEDER Gábor
2020-05-29  8:50 ` SZEDER Gábor [this message]
2020-05-29 13:59 ` [PoC PATCH 00/34] An alternative modified path Bloom filters implementation Derrick Stolee
2020-06-01 23:25 ` Taylor Blau
2020-06-02 17:08   ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200529085038.26008-35-szeder.dev@gmail.com \
    --to=szeder.dev@gmail.com \
    --cc=garima.singh@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jnareb@gmail.com \
    --cc=me@ttaylorr.com \
    --cc=peff@peff.net \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).