git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* git log exclude pathspec from file - supported? plans?
       [not found] <CACPiFCLtj5QF6_Goc5UYh9KHWgkrKtjApL-cCH04S5gdTFyk7Q@mail.gmail.com>
@ 2021-06-30 16:59 ` Martin Langhoff
  2021-06-30 17:58   ` Jeff King
  0 siblings, 1 reply; 8+ messages in thread
From: Martin Langhoff @ 2021-06-30 16:59 UTC (permalink / raw)
  To: Git Mailing List

Hi Git Gang!

long time no see! I'm doing some complex git repo spelunking and
pushing the boundaries of the pathspec magic for excludes.

Is there a reasonable way to provide a (potentially large) set of
excludes? something like

     git log --exclude-pathspec-file paths-to-exclude.txt .

Has there been discussion / patches / plans related to this? I may
have some cycles (hopefully!)

cheers,


m
-- 
 martin.langhoff@gmail.com
 - ask interesting questions  ~  http://linkedin.com/in/martinlanghoff
 - don't be distracted        ~  http://github.com/martin-langhoff
   by shiny stuff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: git log exclude pathspec from file - supported? plans?
  2021-06-30 16:59 ` git log exclude pathspec from file - supported? plans? Martin Langhoff
@ 2021-06-30 17:58   ` Jeff King
  2021-06-30 18:22     ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 8+ messages in thread
From: Jeff King @ 2021-06-30 17:58 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Git Mailing List

On Wed, Jun 30, 2021 at 12:59:43PM -0400, Martin Langhoff wrote:

> long time no see! I'm doing some complex git repo spelunking and
> pushing the boundaries of the pathspec magic for excludes.
> 
> Is there a reasonable way to provide a (potentially large) set of
> excludes? something like
> 
>      git log --exclude-pathspec-file paths-to-exclude.txt .
> 
> Has there been discussion / patches / plans related to this? I may
> have some cycles (hopefully!)

You can feed pathspecs via --stdin. So:

  {
	echo "--"
	sed s/^/:^/ paths-to-exclude.txt
  } | git log --stdin

works. Obviously it's not as turn-key if you really do have a list of
paths in a file already, but it's much more flexible.

I'll caution you that the pathspec code is not well-optimized to handle
a large number of pathspecs. E.g.:

  [no pathspecs]
  $ time git rev-list HEAD /dev/null
  real	0m0.033s
  user	0m0.017s
  sys	0m0.017s

  [trivial pathspec; now we have to actually open up trees]
  $ { echo --; echo .; } >input
  $ time git rev-list HEAD --stdin <input >/dev/null
  real	0m1.338s
  user	0m1.294s
  sys	0m0.045s

  [lots of pathspecs; now we spend loads of time actually matching
   strings; the ^C is when I got bored and killed it]
  $ { echo --; git ls-files; } >input
  $ time git rev-list HEAD --stdin <input >/dev/null
  ^C
  real	1m24.406s
  user	1m24.369s
  sys	0m0.036s

The problem is that we try to linearly match every pathspec against
every path we consider, so it's quadratic-ish in the number of files in
the repo. I played a long time ago with storing non-wildcard pathspecs
in a trie that we could traverse as we talked the individual trees we
were matching. It performed well, but IIRC the interface was hacky (I
had to bolt it specifically onto the way the tree-walker uses
pathspecs, and the other pathspec matchers didn't benefit at all).

I can probably dig it up if anybody's interested in looking at it.

-Peff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: git log exclude pathspec from file - supported? plans?
  2021-06-30 17:58   ` Jeff King
@ 2021-06-30 18:22     ` Ævar Arnfjörð Bjarmason
  2021-07-01 21:27       ` Jeff King
  0 siblings, 1 reply; 8+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-06-30 18:22 UTC (permalink / raw)
  To: Jeff King; +Cc: Martin Langhoff, Git Mailing List, Taylor Blau


On Wed, Jun 30 2021, Jeff King wrote:

> On Wed, Jun 30, 2021 at 12:59:43PM -0400, Martin Langhoff wrote:
>
>> long time no see! I'm doing some complex git repo spelunking and
>> pushing the boundaries of the pathspec magic for excludes.
>> 
>> Is there a reasonable way to provide a (potentially large) set of
>> excludes? something like
>> 
>>      git log --exclude-pathspec-file paths-to-exclude.txt .
>> 
>> Has there been discussion / patches / plans related to this? I may
>> have some cycles (hopefully!)
>
> You can feed pathspecs via --stdin. So:
>
>   {
> 	echo "--"
> 	sed s/^/:^/ paths-to-exclude.txt
>   } | git log --stdin
>
> works. Obviously it's not as turn-key if you really do have a list of
> paths in a file already, but it's much more flexible.
>
> I'll caution you that the pathspec code is not well-optimized to handle
> a large number of pathspecs. E.g.:
>
>   [no pathspecs]
>   $ time git rev-list HEAD /dev/null
>   real	0m0.033s
>   user	0m0.017s
>   sys	0m0.017s
>
>   [trivial pathspec; now we have to actually open up trees]
>   $ { echo --; echo .; } >input
>   $ time git rev-list HEAD --stdin <input >/dev/null
>   real	0m1.338s
>   user	0m1.294s
>   sys	0m0.045s
>
>   [lots of pathspecs; now we spend loads of time actually matching
>    strings; the ^C is when I got bored and killed it]
>   $ { echo --; git ls-files; } >input
>   $ time git rev-list HEAD --stdin <input >/dev/null
>   ^C
>   real	1m24.406s
>   user	1m24.369s
>   sys	0m0.036s
>
> The problem is that we try to linearly match every pathspec against
> every path we consider, so it's quadratic-ish in the number of files in
> the repo. I played a long time ago with storing non-wildcard pathspecs
> in a trie that we could traverse as we talked the individual trees we
> were matching. It performed well, but IIRC the interface was hacky (I
> had to bolt it specifically onto the way the tree-walker uses
> pathspecs, and the other pathspec matchers didn't benefit at all).
>
> I can probably dig it up if anybody's interested in looking at it.

If it's not too much trouble I'd find it interesting, but I likely won't
do anything with it any time soon.

One of the PCREv2 experiments I had very early WIP work towards was to
create a search index for commit messages, contents etc. and stick it in
something similar to the --changed-paths part of the commit-graph.

The PCREv2 codebase actually has (supposedly) a bug-for-bug compatible
implementation of our wildmatch function as a translator to a PCREv2
regex, I have a brnch somewhere where we run all our wildmatch tests
against it successfully.

So couple that with regex introspection, and a search index that
e.g. creates a trie bloom filter, then as long as your --grep=<RX>,
-G<RX> or pathspec has at least 3 fixed strings among its wildcards we
can ask the bloom filter "is this commit a candidate for this regex
searching this path/commit message/diff/whatever".

So you can have indexed matches for things like '*/test-lib.sh", not
just prefixes or fixed-strings.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: git log exclude pathspec from file - supported? plans?
  2021-06-30 18:22     ` Ævar Arnfjörð Bjarmason
@ 2021-07-01 21:27       ` Jeff King
  2021-07-01 21:30         ` [PATCH 1/3] pathspec: add optional trie index Jeff King
                           ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Jeff King @ 2021-07-01 21:27 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Martin Langhoff, Git Mailing List, Taylor Blau

On Wed, Jun 30, 2021 at 08:22:35PM +0200, Ævar Arnfjörð Bjarmason wrote:

> > The problem is that we try to linearly match every pathspec against
> > every path we consider, so it's quadratic-ish in the number of files in
> > the repo. I played a long time ago with storing non-wildcard pathspecs
> > in a trie that we could traverse as we talked the individual trees we
> > were matching. It performed well, but IIRC the interface was hacky (I
> > had to bolt it specifically onto the way the tree-walker uses
> > pathspecs, and the other pathspec matchers didn't benefit at all).
> >
> > I can probably dig it up if anybody's interested in looking at it.
> 
> If it's not too much trouble I'd find it interesting, but I likely won't
> do anything with it any time soon.

The patches are from 2014 (yikes). I forward ported them to the current
tip of master, but beware:

  - it fails a test in t4010 matching a gitlink with a trailing "/" in the
    pathspec (which I think just didn't exist back then). I suspect this
    is a simple bug (the trie has to record a bit for "this must be a
    dir", and it probably just doesn't understand gitlinks properly).

  - the original renamed diff_tree_sha1() to diff_tree_sha1_recurse(),
    which we then call internally while recursing (to pass the trie
    cursor). And then top-level calls use a diff_tree_sha1() wrapper
    that kicks off the traversal (to avoid disrupting other callers).

    When forward-porting, those functions got renamed and reorganized to
    handle multi-parent diffs. I was able to just use the internal
    ll_diff_tree_paths() as my "recurse" version, but I'm not 100% sure
    that's right (e.g., should internal calls to ll_diff_tree_oid also
    grow a cursor parameter?)

  - a few bits had to be ported from the 2-parent case to handle
    n-parents. It _looked_ pretty trivial as I did it, but it's entirely
    possible there are gotchas. Tests seem to pass, though (there are no
    specific tests for finding changes in a merge here, but all tree
    diffs should be using the new code).

  - there are a few basic tests included, but this was never run in any
    kind of production setting. Caveat emptor. :)

> One of the PCREv2 experiments I had very early WIP work towards was to
> create a search index for commit messages, contents etc. and stick it in
> something similar to the --changed-paths part of the commit-graph.

Yeah, to some degree --change-paths may mitigate this, since for a
series of simple pathspecs we'd generate the bloom filter once and then
get O(1) matching per commit.

My ulterior motive with all of this was to plug it into our custom
"blame-tree" (i.e., which commit last-touched each path), with the idea
being that we'd start with a big pathspec, and then shrink it as we find
answers. Removing an item from a bloom filter is awkward, but possible.

Anyway, here are the patches.

  [1/3]: pathspec: add optional trie index
  [2/3]: pathspec: turn on tries when appropriate
  [3/3]: tree-diff: use pathspec tries

 Makefile                      |   1 +
 pathspec.c                    | 127 ++++++++++++++++++++++++++++++++++
 pathspec.h                    |  13 ++++
 t/helper/test-pathspec.c      |  96 +++++++++++++++++++++++++
 t/helper/test-tool.c          |   1 +
 t/helper/test-tool.h          |   1 +
 t/perf/p4002-diff-pathspec.sh |  26 +++++++
 t/t6137-pathspec-trie.sh      |  57 +++++++++++++++
 tree-diff.c                   | 102 +++++++++++++++++++++++----
 9 files changed, 411 insertions(+), 13 deletions(-)
 create mode 100644 t/helper/test-pathspec.c
 create mode 100755 t/perf/p4002-diff-pathspec.sh
 create mode 100755 t/t6137-pathspec-trie.sh

-Peff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/3] pathspec: add optional trie index
  2021-07-01 21:27       ` Jeff King
@ 2021-07-01 21:30         ` Jeff King
  2021-07-01 21:30         ` [PATCH 2/3] pathspec: turn on tries when appropriate Jeff King
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 8+ messages in thread
From: Jeff King @ 2021-07-01 21:30 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Martin Langhoff, Git Mailing List, Taylor Blau

When checking if a path matches our pathspec list, in the
worst case we walk the whole list linearly to check each
pathspec.  As a result, an operation like a tree diff does
`O(paths * pathspecs)` comparisons. If we assume that the
number of pathspecs may approach the number of paths, this
is `O(n^2)` in the number of paths.  We have to do it this
way in the general case because pathspecs may be arbitrary
fnmatch expressions, and we cannot create any meaningful
index.

However, it is common that all of the pathspecs are vanilla
prefixes (e.g., "git.c" or "Documentation/") that do not use
any wildcards or magic features. In such a case, we can
index the pathspec list to give us faster lookups.

The simplest solution would be to keep the pathspec list
sorted, and use a binary search over the entries for each
path. This turns out to be rather tricky, though, as we are
not looking for an exact match in the list. We must also
find prefix matches, and take care to handle trailing
slashes for directories.

Instead, this patch introduces a pathspec_trie struct, which
represents the pathspecs as a graph of entries. For
example, the pathspec list ["foo/bar", "foo/baz/"] would be
represented as:

        (bar, terminal)
       /
  (foo)
       \
        (baz, terminal, must_be_dir)

To see if a path "foo/eek" is covered by the pathspec, we
walk the trie, matching "foo" but ultimately seeing that
"eek" is not mentioned.

This provides us with two optimizations:

  1. The children of each trie node are simple strings
     representing directory components. This means we can sort
     and binary-search them, giving us logarithmic lookups
     (but note that rather than one log lookup on the whole
     pathname, it is a sequence of smaller log lookups on
     components of the pathname).

  2. Many users of the pathspec code do not feed full
     pathnames, but instead are walking a tree hierarchy
     themselves, and limiting the walk as they go. They can
     walk the trie at the same time, meanining they avoid
     looking repeatedly at parts of the pathspec we already
     know are matched. The current code, by contrast, will
     repeatedly match "foo/" against each pathspec when
     looking at "foo/bar", "foo/baz", etc.

Note that this patch just introduces the data structure; the
code is not used anywhere yet.

Signed-off-by: Jeff King <peff@peff.net>
---
 Makefile                 |   1 +
 pathspec.c               | 125 +++++++++++++++++++++++++++++++++++++++
 pathspec.h               |  12 ++++
 t/helper/test-pathspec.c |  98 ++++++++++++++++++++++++++++++
 t/helper/test-tool.c     |   1 +
 t/helper/test-tool.h     |   1 +
 t/t6137-pathspec-trie.sh |  57 ++++++++++++++++++
 7 files changed, 295 insertions(+)
 create mode 100644 t/helper/test-pathspec.c
 create mode 100755 t/t6137-pathspec-trie.sh

diff --git a/Makefile b/Makefile
index c3565fc0f8..675f51a330 100644
--- a/Makefile
+++ b/Makefile
@@ -726,6 +726,7 @@ TEST_BUILTINS_OBJS += test-online-cpus.o
 TEST_BUILTINS_OBJS += test-parse-options.o
 TEST_BUILTINS_OBJS += test-parse-pathspec-file.o
 TEST_BUILTINS_OBJS += test-path-utils.o
+TEST_BUILTINS_OBJS += test-pathspec.o
 TEST_BUILTINS_OBJS += test-pcre2-config.o
 TEST_BUILTINS_OBJS += test-pkt-line.o
 TEST_BUILTINS_OBJS += test-prio-queue.o
diff --git a/pathspec.c b/pathspec.c
index 08f8d3eedc..24a24f627e 100644
--- a/pathspec.c
+++ b/pathspec.c
@@ -760,3 +760,128 @@ int match_pathspec_attrs(struct index_state *istate,
 
 	return 1;
 }
+
+/*
+ * This is basically a strcmp, but we do not want the caller
+ * to have to terminate "a", so we pretend as if it had a NUL.
+ */
+static int pathspec_trie_cmp(const char *a, size_t a_len,
+			     const char *b)
+{
+	int ret = strncmp(a, b, a_len);
+	return ret ?
+		ret :
+		(unsigned char)0 - (unsigned char )b[a_len];
+}
+
+/*
+ * Do a binary search on one level of the pathspec_trie. If found,
+ * returns the offset of the item in the entry list. If not found,
+ * return a negative value encoding the offset where it would be inserted
+ * (you can recover the true offset with "-pos - 1").
+ */
+int pathspec_trie_lookup(const struct pathspec_trie *parent,
+			 const char *path, size_t len)
+{
+	int lo = 0, hi = parent->nr;
+	while (lo < hi) {
+		int mi = lo + ((hi - lo) / 2);
+		int cmp;
+
+		cmp = pathspec_trie_cmp(path, len, parent->entries[mi]->path);
+		if (!cmp)
+			return mi;
+		if (cmp < 0)
+			hi = mi;
+		else
+			lo = mi + 1;
+	}
+	return -lo - 1;
+}
+
+static struct pathspec_trie *alloc_pathspec_trie(const char *path, size_t len)
+{
+	struct pathspec_trie *ret = xcalloc(1, sizeof(*ret) + len);
+	memcpy(ret->path, path, len);
+	return ret;
+}
+
+/*
+ * Add "path" to the trie rooted at "t".
+ */
+static void add_pathspec_trie(struct pathspec_trie *t, const char *path)
+{
+	/*
+	 * Special case the empty path (i.e., "."), as our splitting algorithm
+	 * below assumes at least one component.
+	 */
+	if (!*path) {
+		t->terminal = 1;
+		return;
+	}
+
+	while (1) {
+		const char *slash = strchrnul(path, '/');
+		size_t len = slash - path;
+		int pos;
+
+		pos = pathspec_trie_lookup(t, path, len);
+		if (pos < 0) {
+			ALLOC_GROW(t->entries, t->nr + 1, t->alloc);
+
+			pos = -pos - 1;
+			if (pos < t->nr)
+				memmove(t->entries + pos + 1,
+					t->entries + pos,
+					sizeof(*t->entries) * (t->nr - pos));
+			t->entries[pos] = alloc_pathspec_trie(path, len);
+			t->nr++;
+		}
+
+		t = t->entries[pos];
+		path += len;
+
+		if (!*path) {
+			t->must_be_dir = 0;
+			t->terminal = 1;
+			return;
+		}
+
+		while (*path == '/')
+			path++;
+		if (!*path) {
+			/*
+			 * if we were already a terminal, then do not set
+			 * must_be_dir; we are "foo/", but we already had a
+			 * pathspec "foo", which is a superset of us.
+			 */
+			if (!t->terminal)
+				t->must_be_dir = 1;
+			t->terminal = 1;
+			return;
+		}
+	}
+}
+
+struct pathspec_trie *build_pathspec_trie(const struct pathspec *pathspec)
+{
+	struct pathspec_trie *ret;
+	int i;
+
+	/* we only make a trie for plain-vanilla pathspecs */
+	if (pathspec->has_wildcard || (pathspec->magic & ~PATHSPEC_LITERAL))
+		return NULL;
+
+	ret = alloc_pathspec_trie("", 0);
+
+	/*
+	 * XXX we could construct the trie more efficiently by creating each
+	 * node with all of its entries in sorted order. But this is much
+	 * simpler, and since we only do this once at the start of a traversal,
+	 * it's probably fast enough.
+	 */
+	for (i = 0; i < pathspec->nr; i++)
+		add_pathspec_trie(ret, pathspec->items[i].match);
+
+	return ret;
+}
diff --git a/pathspec.h b/pathspec.h
index fceebb876f..15c9244d08 100644
--- a/pathspec.h
+++ b/pathspec.h
@@ -22,6 +22,14 @@ struct index_state;
 
 #define PATHSPEC_ONESTAR 1	/* the pathspec pattern satisfies GFNM_ONESTAR */
 
+struct pathspec_trie {
+	struct pathspec_trie **entries;
+	int nr, alloc;
+	unsigned terminal:1,
+		 must_be_dir:1;
+	char path[FLEX_ARRAY];
+};
+
 /**
  * See glossary-context.txt for the syntax of pathspec.
  * In memory, a pathspec set is represented by "struct pathspec" and is
@@ -172,4 +180,8 @@ int match_pathspec_attrs(struct index_state *istate,
 			 const char *name, int namelen,
 			 const struct pathspec_item *item);
 
+struct pathspec_trie *build_pathspec_trie(const struct pathspec *);
+int pathspec_trie_lookup(const struct pathspec_trie *pst,
+			 const char *path, size_t len);
+
 #endif /* PATHSPEC_H */
diff --git a/t/helper/test-pathspec.c b/t/helper/test-pathspec.c
new file mode 100644
index 0000000000..3f1b8f1a79
--- /dev/null
+++ b/t/helper/test-pathspec.c
@@ -0,0 +1,98 @@
+#include "test-tool.h"
+#include "pathspec.h"
+
+static const char usage_msg[] =
+"test-tool pathspec trie [pathspecs...] -- [paths....]";
+
+/*
+ * XXX Yuck. This is a lot of complicated code specific to our test. Even if it
+ * runs correctly, we have no real guarantee that the actual trie users are
+ * doing it right. And reusing their code is tough, because it happens as part
+ * of their own traversals (e.g., we walk the pathspec trie while walking the
+ * tree objects themselves).
+ *
+ * This whole test program should probably go away in favor of directly testing
+ * the tree-diff code.
+ */
+static int trie_match(const struct pathspec_trie *pst,
+		      const char *path)
+{
+	int pathlen = strlen(path);
+	int is_dir = 0;
+
+	if (pathlen > 0 && path[pathlen-1] == '/') {
+		is_dir = 1;
+		pathlen--;
+	}
+
+	while (pathlen) {
+		const char *slash = memchr(path, '/', pathlen);
+		int component_len;
+		int pos;
+
+		if (slash)
+			component_len = slash - path;
+		else
+			component_len = pathlen;
+
+		pos = pathspec_trie_lookup(pst, path, component_len);
+		if (pos < 0)
+			return 0;
+
+		pst = pst->entries[pos];
+		path += component_len;
+		pathlen -= component_len;
+
+		while (pathlen && *path == '/') {
+			path++;
+			pathlen--;
+		}
+
+		if (pst->terminal) {
+			if (!pst->must_be_dir)
+				return 1;
+			if (pathlen)
+				return 1;
+			return is_dir;
+		}
+	}
+	return 0;
+}
+
+static int cmd_trie(const char **argv)
+{
+	const char **specs, **paths;
+	struct pathspec pathspec;
+	struct pathspec_trie *trie;
+
+	paths = specs = argv;
+	while (*paths && strcmp(*paths, "--"))
+		paths++;
+	if (*paths)
+		*paths++ = NULL;
+
+	parse_pathspec(&pathspec, 0, 0, "", specs);
+	trie = build_pathspec_trie(&pathspec);
+	if (!trie)
+		die("unable to make trie from pathspec");
+
+	for (; *paths; paths++) {
+		if (trie_match(trie, *paths))
+			printf("yes\n");
+		else
+			printf("no\n");
+	}
+	return 0;
+}
+
+int cmd__pathspec(int argc, const char **argv)
+{
+	const char *cmd = argv[1];
+
+	if (!cmd)
+		usage(usage_msg);
+	else if (!strcmp(cmd, "trie"))
+		return cmd_trie(argv + 2);
+	else
+		die("unknown cmd: %s", cmd);
+}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index c5bd0c6d4c..c5dc6153e0 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -47,6 +47,7 @@ static struct test_cmd cmds[] = {
 	{ "parse-options", cmd__parse_options },
 	{ "parse-pathspec-file", cmd__parse_pathspec_file },
 	{ "path-utils", cmd__path_utils },
+	{ "pathspec", cmd__pathspec },
 	{ "pcre2-config", cmd__pcre2_config },
 	{ "pkt-line", cmd__pkt_line },
 	{ "prio-queue", cmd__prio_queue },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index e8069a3b22..878aba1bc8 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -35,6 +35,7 @@ int cmd__oidmap(int argc, const char **argv);
 int cmd__online_cpus(int argc, const char **argv);
 int cmd__parse_options(int argc, const char **argv);
 int cmd__parse_pathspec_file(int argc, const char** argv);
+int cmd__pathspec(int argc, const char **argv);
 int cmd__path_utils(int argc, const char **argv);
 int cmd__pcre2_config(int argc, const char **argv);
 int cmd__pkt_line(int argc, const char **argv);
diff --git a/t/t6137-pathspec-trie.sh b/t/t6137-pathspec-trie.sh
new file mode 100755
index 0000000000..ca2935c7db
--- /dev/null
+++ b/t/t6137-pathspec-trie.sh
@@ -0,0 +1,57 @@
+#!/bin/sh
+
+test_description='test optimized pathspec trie lookup'
+. ./test-lib.sh
+
+# This is a basic lookup test; the offsets are there to provide
+# some variation in where we land in our binary search.
+ps="faa fzz foo/bar foo/baa foo/bzz"
+for offset in a "a b" "a b c"; do
+	test_expect_success "lookups with ps=$offset $ps" "
+		cat >expect <<-\\EOF &&
+		no
+		yes
+		yes
+		no
+		EOF
+		test-tool pathspec trie $offset $ps -- f faa foo/bar foo/baz >actual &&
+		test_cmp expect actual
+	"
+done
+
+test_expect_success 'pathspecs match by prefix' '
+	cat >expect <<-\EOF &&
+	yes
+	yes
+	yes
+	EOF
+	test-tool pathspec trie foo -- foo foo/bar foo/with/deep/subdirs >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'trailing slash sets must_be_dir' '
+	cat >expect <<-\EOF &&
+	no
+	yes
+	yes
+	EOF
+	test-tool pathspec trie dir/ -- dir dir/ dir/foo
+'
+
+test_expect_success 'overlapping pathspecs allow the "loose" side' '
+	echo yes >expect &&
+	test-tool pathspec trie foo foo/ -- foo >actual &&
+	test_cmp expect actual &&
+	test-tool pathspec trie foo/ foo -- foo >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success '"." at the root matches everything' '
+	cat >expect <<-\EOF &&
+	yes
+	yes
+	EOF
+	test-tool pathspec trie . -- foo foo/bar
+'
+
+test_done
-- 
2.32.0.359.g3de86e008e.dirty


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/3] pathspec: turn on tries when appropriate
  2021-07-01 21:27       ` Jeff King
  2021-07-01 21:30         ` [PATCH 1/3] pathspec: add optional trie index Jeff King
@ 2021-07-01 21:30         ` Jeff King
  2021-07-01 21:36         ` [PATCH 3/3] tree-diff: use pathspec tries Jeff King
  2021-07-01 21:43         ` git log exclude pathspec from file - supported? plans? Jeff King
  3 siblings, 0 replies; 8+ messages in thread
From: Jeff King @ 2021-07-01 21:30 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Martin Langhoff, Git Mailing List, Taylor Blau

An earlier commit introduced pathspec_tries, but we did not
actually generate them by default. This patch causes us to
do so when it is possible (i.e., when no wildcards or other
pathspec magic are in use). This doesn't actually do
anything yet, though, as none of the pathspec users have
learned to make use of the tries.

We embed the pathspec_trie directly inside the "struct
pathspec". This is not strictly necessary, as once created,
the trie does not depend on the original pathspec. However,
since the intended use is to optimize existing pathspec
callers, passing the trie around as part of the pathspec
will minimize disruption to the call chain.

Signed-off-by: Jeff King <peff@peff.net>
---
 pathspec.c               | 2 ++
 pathspec.h               | 1 +
 t/helper/test-pathspec.c | 6 ++----
 3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/pathspec.c b/pathspec.c
index 24a24f627e..435dfd3117 100644
--- a/pathspec.c
+++ b/pathspec.c
@@ -639,6 +639,8 @@ void parse_pathspec(struct pathspec *pathspec,
 			BUG("PATHSPEC_MAXDEPTH_VALID and PATHSPEC_KEEP_ORDER are incompatible");
 		QSORT(pathspec->items, pathspec->nr, pathspec_item_cmp);
 	}
+
+	pathspec->trie = build_pathspec_trie(pathspec);
 }
 
 void parse_pathspec_file(struct pathspec *pathspec, unsigned magic_mask,
diff --git a/pathspec.h b/pathspec.h
index 15c9244d08..5329c0a6f6 100644
--- a/pathspec.h
+++ b/pathspec.h
@@ -61,6 +61,7 @@ struct pathspec {
 		} *attr_match;
 		struct attr_check *attr_check;
 	} *items;
+	struct pathspec_trie *trie;
 };
 
 #define GUARD_PATHSPEC(ps, mask) \
diff --git a/t/helper/test-pathspec.c b/t/helper/test-pathspec.c
index 3f1b8f1a79..0fb059409b 100644
--- a/t/helper/test-pathspec.c
+++ b/t/helper/test-pathspec.c
@@ -63,7 +63,6 @@ static int cmd_trie(const char **argv)
 {
 	const char **specs, **paths;
 	struct pathspec pathspec;
-	struct pathspec_trie *trie;
 
 	paths = specs = argv;
 	while (*paths && strcmp(*paths, "--"))
@@ -72,12 +71,11 @@ static int cmd_trie(const char **argv)
 		*paths++ = NULL;
 
 	parse_pathspec(&pathspec, 0, 0, "", specs);
-	trie = build_pathspec_trie(&pathspec);
-	if (!trie)
+	if (!pathspec.trie)
 		die("unable to make trie from pathspec");
 
 	for (; *paths; paths++) {
-		if (trie_match(trie, *paths))
+		if (trie_match(pathspec.trie, *paths))
 			printf("yes\n");
 		else
 			printf("no\n");
-- 
2.32.0.359.g3de86e008e.dirty


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 3/3] tree-diff: use pathspec tries
  2021-07-01 21:27       ` Jeff King
  2021-07-01 21:30         ` [PATCH 1/3] pathspec: add optional trie index Jeff King
  2021-07-01 21:30         ` [PATCH 2/3] pathspec: turn on tries when appropriate Jeff King
@ 2021-07-01 21:36         ` Jeff King
  2021-07-01 21:43         ` git log exclude pathspec from file - supported? plans? Jeff King
  3 siblings, 0 replies; 8+ messages in thread
From: Jeff King @ 2021-07-01 21:36 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Martin Langhoff, Git Mailing List, Taylor Blau

The tree-diff currently matches each pathspec against every
entry of the tree. For the common case of a handful of
pathspecs, this is not a big deal. However, if you have a
large number of pathspecs, it gets noticeably slow.

Now that we have the pathspec_trie optimization, we can do
much better (at least for simple cases without wildcards).
Here are numbers for running "git rev-list" with limiting
pathspecs of varying sizes, both before and after this
patch:

Test                 origin              HEAD
---------------------------------------------------------------
4001.2: size=1       0.12(0.11+0.00)     0.12(0.12+0.00) +0.0%
4001.3: size=2       0.17(0.16+0.00)     0.16(0.15+0.01) -5.9%
4001.4: size=4       0.17(0.17+0.00)     0.17(0.17+0.00) +0.0%
4001.5: size=8       0.21(0.20+0.00)     0.20(0.20+0.00) -4.8%
4001.6: size=16      0.25(0.24+0.00)     0.21(0.20+0.00) -16.0%
4001.7: size=32      0.31(0.31+0.00)     0.21(0.20+0.00) -32.3%
4001.8: size=64      0.43(0.41+0.01)     0.21(0.21+0.00) -51.2%
4001.9: size=128     0.73(0.72+0.00)     0.22(0.21+0.00) -69.9%
4001.10: size=256    2.02(2.02+0.00)     0.37(0.36+0.00) -81.7%
4001.11: size=512    6.78(6.78+0.00)     0.64(0.64+0.00) -90.6%
4001.12: size=1024   23.67(23.67+0.02)   1.22(1.20+0.01) -94.8%

For small pathspecs, we produce no real difference (which is
good; we know we are asymptotically better, but we have not
regressed our constant factor). Between 16 and 32 pathspecs we
start to see some small improvement, and the benefit keeps
growing with the number of pathspecs.

Obviously these large-pathspec cases are unusual. But you
might use them, for example, if the pathspecs were generated
programatically (e.g., if you want the history of all files
that are in Documentation/ _now_, not what was historically
ever there, you would expand the pathspec at the current
tree, and feed the result to rev-list).

Signed-off-by: Jeff King <peff@peff.net>
---
This commit message is the original from 2014. The perf script got
bumped to p4002, but the results today are similar (actually, the
"before" code is slower, but perhaps that's because we have a lot more
commits in git.git these days).

 t/perf/p4002-diff-pathspec.sh |  26 +++++++++
 tree-diff.c                   | 102 +++++++++++++++++++++++++++++-----
 2 files changed, 115 insertions(+), 13 deletions(-)
 create mode 100755 t/perf/p4002-diff-pathspec.sh

diff --git a/t/perf/p4002-diff-pathspec.sh b/t/perf/p4002-diff-pathspec.sh
new file mode 100755
index 0000000000..3d35b3579d
--- /dev/null
+++ b/t/perf/p4002-diff-pathspec.sh
@@ -0,0 +1,26 @@
+#!/bin/sh
+
+test_description='diff performance with many pathspecs'
+. ./perf-lib.sh
+
+test_perf_default_repo
+
+sizes='1 2 4 8 16 32 64 128 256 512 1024'
+
+test_expect_success 'create pathspec lists' '
+	git ls-tree --name-only -r HEAD >all &&
+	for i in $sizes; do
+		{
+			printf "%s\n" -- &&
+			head -$i all
+		} >ps-$i
+	done
+'
+
+for i in $sizes; do
+	test_perf "size=$i" "
+		git rev-list HEAD --stdin <ps-$i >/dev/null
+	"
+done
+
+test_done
diff --git a/tree-diff.c b/tree-diff.c
index 1572615bd9..47836941dd 100644
--- a/tree-diff.c
+++ b/tree-diff.c
@@ -28,7 +28,8 @@
 static struct combine_diff_path *ll_diff_tree_paths(
 	struct combine_diff_path *p, const struct object_id *oid,
 	const struct object_id **parents_oid, int nparent,
-	struct strbuf *base, struct diff_options *opt);
+	struct strbuf *base, struct diff_options *opt,
+	struct pathspec_trie *pst);
 static void ll_diff_tree_oid(const struct object_id *old_oid,
 			     const struct object_id *new_oid,
 			     struct strbuf *base, struct diff_options *opt);
@@ -179,7 +180,7 @@ static struct combine_diff_path *path_appendnew(struct combine_diff_path *last,
 static struct combine_diff_path *emit_path(struct combine_diff_path *p,
 	struct strbuf *base, struct diff_options *opt, int nparent,
 	struct tree_desc *t, struct tree_desc *tp,
-	int imin)
+	int imin, struct pathspec_trie *pst)
 {
 	unsigned short mode;
 	const char *path;
@@ -283,23 +284,95 @@ static struct combine_diff_path *emit_path(struct combine_diff_path *p,
 			parents_oid[i] = tpi_valid ? &tp[i].entry.oid : NULL;
 		}
 
+		/*
+		 * As we recurse through the tree objects, move through
+		 * our pathspec trie, as well. The one exception is if
+		 * we already hit a terminal node. This means we have a strict
+		 * prefix match (e.g., "foo/" matched, and we are in
+		 * "foo/bar"). We don't have to bother with checking the
+		 * pathspec at all anymore in that case.
+		 *
+		 * Note that the "pos < 0" case should not happen here,
+		 * as we would have skipped the tree entry as uninteresting
+		 * earlier. As a safety measure, we turn off the trie
+		 * optimization and fall back to doing regular pathspec
+		 * matching in this case.
+		 */
+		if (pst && !pst->terminal) {
+			int pos = pathspec_trie_lookup(pst, path, pathlen);
+			if (pos < 0)
+				pst = NULL;
+			else
+				pst = pst->entries[pos];
+		}
+
 		strbuf_add(base, path, pathlen);
 		strbuf_addch(base, '/');
-		p = ll_diff_tree_paths(p, oid, parents_oid, nparent, base, opt);
+		p = ll_diff_tree_paths(p, oid, parents_oid, nparent, base, opt, pst);
 		FAST_ARRAY_FREE(parents_oid, nparent);
 	}
 
 	strbuf_setlen(base, old_baselen);
 	return p;
 }
 
+static enum interesting match_pathspec_trie_entry(struct pathspec_trie *pst,
+						  const struct name_entry *entry)
+{
+	int pos;
+
+	/*
+	 * If our base directory is matched, then everything below is
+	 * interesting (i.e., a prefix match).
+	 */
+	if (pst->terminal)
+		return entry_interesting;
+
+	/*
+	 * Otherwise, look up the actual entry. If we don't mention it at all,
+	 * it's definitely uninteresting. But furthermore, if we're at the
+	 * end of our sorted list, we know that nothing after it is
+	 * interesting, either.
+	 *
+	 * XXX It seems like we should have to make special consideration here
+	 * for the sort order of trees. But tree_entry_interesting does not
+	 * seem to. Is it OK, is tree_entry_interesting buggy too, or am I
+	 * reading it wrong? This optimization gives substantial speedups, so
+	 * we really need to keep it or something like it.
+	 */
+	pos = pathspec_trie_lookup(pst, entry->path, tree_entry_len(entry));
+	if (pos < 0) {
+		if (-pos - 1 == pst->nr)
+			return all_entries_not_interesting;
+		else
+			return entry_not_interesting;
+	}
+
+	/*
+	 * We definitely have the entry. First we have to resolve any directory
+	 * restrictions; if there aren't any, then it's definitely interesting.
+	 *
+	 * Note that we do not need to check the "terminal" flag of the
+	 * resulting trie node. If it is not set, then this particular entry
+	 * does not match our pathspec, but we do still need to traverse
+	 * through it to get to the interesting things inside. It's interesting
+	 * either way.
+	 */
+	if (pst->entries[pos]->must_be_dir)
+		return !!S_ISDIR(entry->mode);
+	return entry_interesting;
+}
+
 static void skip_uninteresting(struct tree_desc *t, struct strbuf *base,
-			       struct diff_options *opt)
+			       struct diff_options *opt,
+			       struct pathspec_trie *pst)
 {
 	enum interesting match;
 
 	while (t->size) {
-		match = tree_entry_interesting(opt->repo->index, &t->entry,
+		match = pst ?
+			match_pathspec_trie_entry(pst, &t->entry) :
+			tree_entry_interesting(opt->repo->index, &t->entry,
 					       base, 0, &opt->pathspec);
 		if (match) {
 			if (match == all_entries_not_interesting)
@@ -310,7 +383,6 @@ static void skip_uninteresting(struct tree_desc *t, struct strbuf *base,
 	}
 }
 
-
 /*
  * generate paths for combined diff D(sha1,parents_oid[])
  *
@@ -406,7 +478,8 @@ static inline void update_tp_entries(struct tree_desc *tp, int nparent)
 static struct combine_diff_path *ll_diff_tree_paths(
 	struct combine_diff_path *p, const struct object_id *oid,
 	const struct object_id **parents_oid, int nparent,
-	struct strbuf *base, struct diff_options *opt)
+	struct strbuf *base, struct diff_options *opt,
+	struct pathspec_trie *pst)
 {
 	struct tree_desc t, *tp;
 	void *ttree, **tptree;
@@ -438,9 +511,9 @@ static struct combine_diff_path *ll_diff_tree_paths(
 			break;
 
 		if (opt->pathspec.nr) {
-			skip_uninteresting(&t, base, opt);
+			skip_uninteresting(&t, base, opt, pst);
 			for (i = 0; i < nparent; i++)
-				skip_uninteresting(&tp[i], base, opt);
+				skip_uninteresting(&tp[i], base, opt, pst);
 		}
 
 		/* comparing is finished when all trees are done */
@@ -505,7 +578,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
 
 			/* D += {δ(t,pi) if pi=p[imin];  "+a" if pi > p[imin]} */
 			p = emit_path(p, base, opt, nparent,
-					&t, tp, imin);
+					&t, tp, imin, pst);
 
 		skip_emit_t_tp:
 			/* t↓,  ∀ pi=p[imin]  pi↓ */
@@ -517,7 +590,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
 		else if (cmp < 0) {
 			/* D += "+t" */
 			p = emit_path(p, base, opt, nparent,
-					&t, /*tp=*/NULL, -1);
+					&t, /*tp=*/NULL, -1, pst);
 
 			/* t↓ */
 			update_tree_entry(&t);
@@ -533,7 +606,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
 			}
 
 			p = emit_path(p, base, opt, nparent,
-					/*t=*/NULL, tp, imin);
+					/*t=*/NULL, tp, imin, pst);
 
 		skip_emit_tp:
 			/* ∀ pi=p[imin]  pi↓ */
@@ -555,7 +628,10 @@ struct combine_diff_path *diff_tree_paths(
 	const struct object_id **parents_oid, int nparent,
 	struct strbuf *base, struct diff_options *opt)
 {
-	p = ll_diff_tree_paths(p, oid, parents_oid, nparent, base, opt);
+	/* XXX if base is not empty, we need to walk its paths
+	 * to get to the right level of the trie */
+	p = ll_diff_tree_paths(p, oid, parents_oid, nparent, base, opt,
+			       opt->pathspec.trie);
 
 	/*
 	 * free pre-allocated last element, if any
-- 
2.32.0.359.g3de86e008e.dirty

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: git log exclude pathspec from file - supported? plans?
  2021-07-01 21:27       ` Jeff King
                           ` (2 preceding siblings ...)
  2021-07-01 21:36         ` [PATCH 3/3] tree-diff: use pathspec tries Jeff King
@ 2021-07-01 21:43         ` Jeff King
  3 siblings, 0 replies; 8+ messages in thread
From: Jeff King @ 2021-07-01 21:43 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Martin Langhoff, Git Mailing List, Taylor Blau

On Thu, Jul 01, 2021 at 05:27:05PM -0400, Jeff King wrote:

> > One of the PCREv2 experiments I had very early WIP work towards was to
> > create a search index for commit messages, contents etc. and stick it in
> > something similar to the --changed-paths part of the commit-graph.
> 
> Yeah, to some degree --change-paths may mitigate this, since for a
> series of simple pathspecs we'd generate the bloom filter once and then
> get O(1) matching per commit.

The timings for "git rev-list" I shared earlier were with commit-graphs,
but not with changed-path filters. They don't seem to help much, though.

I suspect the reason is that they can never give a definite answer. They
can only say "probably, yes, this commit is worth looking at". And then
we have to do the same slow, linear pathspec match on that commit. And
if your pathspec contains virtually every path in the first place, then
the answer from the bloom filters will always be "probably, yes".

So I guess they are not really a silver bullet here.

-Peff

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-07-01 21:43 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CACPiFCLtj5QF6_Goc5UYh9KHWgkrKtjApL-cCH04S5gdTFyk7Q@mail.gmail.com>
2021-06-30 16:59 ` git log exclude pathspec from file - supported? plans? Martin Langhoff
2021-06-30 17:58   ` Jeff King
2021-06-30 18:22     ` Ævar Arnfjörð Bjarmason
2021-07-01 21:27       ` Jeff King
2021-07-01 21:30         ` [PATCH 1/3] pathspec: add optional trie index Jeff King
2021-07-01 21:30         ` [PATCH 2/3] pathspec: turn on tries when appropriate Jeff King
2021-07-01 21:36         ` [PATCH 3/3] tree-diff: use pathspec tries Jeff King
2021-07-01 21:43         ` git log exclude pathspec from file - supported? plans? Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).