From: Jacob Keller <jacob.keller@gmail.com>
To: Taylor Blau <me@ttaylorr.com>
Cc: Git mailing list <git@vger.kernel.org>, Jeff King <peff@peff.net>
Subject: Re: [PATCH 1/1] ref-filter.c: find disjoint pattern prefixes
Date: Wed, 26 Jun 2019 17:37:42 -0700 [thread overview]
Message-ID: <CA+P7+xqQv4UZMy7fEHnGHejU6nvhVKgkSruXdmW-akqUG1TLKA@mail.gmail.com> (raw)
In-Reply-To: <e41db267f7b7086126e9fd3fd5b1a02e38c8c077.1561588479.git.me@ttaylorr.com>
On Wed, Jun 26, 2019 at 3:44 PM Taylor Blau <me@ttaylorr.com> wrote:
>
> Since cfe004a5a9 (ref-filter: limit traversal to prefix, 2017-05-22),
> the ref-filter code has sought to limit the traversals to a prefix of
> the given patterns.
>
> That code stopped short of handling more than one pattern, because it
> means invoking 'for_each_ref_in' multiple times. If we're not careful
> about which patterns overlap, we will output the same refs multiple
> times.
Right.
>
> For instance, consider the set of patterns 'refs/heads/a/*',
> 'refs/heads/a/b/c', and 'refs/tags/v1.0.0'. If we naďvely ran:
>
> for_each_ref_in("refs/heads/a/*", ...);
> for_each_ref_in("refs/heads/a/b/c", ...);
> for_each_ref_in("refs/tags/v1.0.0", ...);
>
> we would see 'refs/heads/a/b/c' (and everything underneath it) twice.
>
Explaining why we ignored solving this before..
> Instead, we want to partition the patterns into disjoint sets, where we
> know that no ref will be matched by any two patterns in different sets.
> In the above, these are:
>
> - {'refs/heads/a/*', 'refs/heads/a/b/c'}, and
> - {'refs/tags/v1.0.0'}
Is this disjoint set calculation already existing, or did you have to
add it in this patch?
>
> Given one of these disjoint sets, what is a suitable pattern to pass to
> 'for_each_ref_in'? One approach is to compute the longest common prefix
> over all elements in that disjoint set, and let the caller cull out the
> refs they didn't want. Computing the longest prefix means that in most
> cases, we won't match too many things the caller would like to ignore.
>
> The longest common prefixes of the above are:
>
> - {'refs/heads/a/*', 'refs/heads/a/b/c'} -> refs/heads/a/*
> - {'refs/tags/v1.0.0'} -> refs/tags/v1.0.0
>
> We instead invoke:
>
> for_each_ref_in("refs/heads/a/*", ...);
> for_each_ref_in("refs/tags/v1.0.0", ...);
>
> Which provides us with the refs we were looking for with a minimal
> amount of extra cruft, but never a duplicate of the ref we asked for.
>
> Implemented here is an algorithm which accomplishes the above, which
> works as follows:
>
> 1. Lexicographically sort the given list of patterns.
>
> 2. Initialize 'prefix' to the empty string, where our goal is to
> build each element in the above set of longest common prefixes.
>
> 3. Consider each pattern in the given set, and emit 'prefix' if it
> reaches the end of a pattern, or touches a wildcard character. The
> end of a string is treated as if it precedes a wildcard. (Note that
> there is some room for future work to detect that, e.g., 'a?b' and
> 'abc' are disjoint).
Neat, so you're calculating disjoint patterns inline while also
figuring out which one is the "best" for any given pattern set...
Neat!
>
> 4. Otherwise, recurse on step (3) with the slice of the list
> corresponding to our current prefix (i.e., the subset of patterns
> that have our prefix as a literal string prefix.)
>
> This algorithm is 'O(kn + n log(n))', where 'k' is max(len(pattern)) for
> each pattern in the list, and 'n' is len(patterns).
>
ok, so if we can assume that k is some relatively small constant
number (since the maximum pattern length isn't likely to grow without
bounds), this is O(n*log(n)) on the number of patterns, so we don't
even approach n^2 even when we are given a large number of patterns.
Nice!
> By discovering this set of interesting patterns, we reduce the runtime
> of multi-pattern 'git for-each-ref' (and other ref traversals) from
> O(N) to O(n log(N)), where 'N' is the total number of packed references.
So here, n is the number of patterns still? This seems like a pretty
significant gane when we have a large number of packed references.
>
> Running 'git for-each-ref refs/tags/a refs/tags/b' on a repository with
> 10,000,000 refs in 'refs/tags/huge-N', my best-of-five times drop from:
>
> real 0m5.805s
> user 0m5.188s
> sys 0m0.468s
>
> to:
>
> real 0m0.001s
> user 0m0.000s
> sys 0m0.000s
>
That's a pretty significant decrease!
> On linux.git, the times to dig out two of the latest -rc tags drops from
> 0.002s to 0.001s, so the change on repositories with fewer tags is much
> less noticeable.
>
This explains why it might not have been done before.. many
repositories wouldn't benefit much.
That said, the patch description doesn't make it seem very
complicated. I did run out of time reading the message, so I'll have
to follow up reviewing the actual change below later. I think the
description of the goal and solution is sound though.
Thanks,
Jake
> Co-authored-by: Jeff King <peff@peff.net>
> Signed-off-by: Jeff King <peff@peff.net>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
> ref-filter.c | 89 +++++++++++++++++++++++++++++------------
> t/t6300-for-each-ref.sh | 26 ++++++++++++
> 2 files changed, 89 insertions(+), 26 deletions(-)
>
> diff --git a/ref-filter.c b/ref-filter.c
> index 8500671bc6..3e10fd647b 100644
> --- a/ref-filter.c
> +++ b/ref-filter.c
> @@ -20,6 +20,7 @@
> #include "commit-slab.h"
> #include "commit-graph.h"
> #include "commit-reach.h"
> +#include "argv-array.h"
>
> static struct ref_msg {
> const char *gone;
> @@ -1790,21 +1791,62 @@ static int filter_pattern_match(struct ref_filter *filter, const char *refname)
> return match_pattern(filter, refname);
> }
>
> -/*
> - * Find the longest prefix of pattern we can pass to
> - * `for_each_fullref_in()`, namely the part of pattern preceding the
> - * first glob character. (Note that `for_each_fullref_in()` is
> - * perfectly happy working with a prefix that doesn't end at a
> - * pathname component boundary.)
> - */
> -static void find_longest_prefix(struct strbuf *out, const char *pattern)
> +static int qsort_strcmp(const void *va, const void *vb)
> {
> - const char *p;
> + const char *a = *(const char **)va;
> + const char *b = *(const char **)vb;
>
> - for (p = pattern; *p && !is_glob_special(*p); p++)
> - ;
> + return strcmp(a, b);
> +}
>
> - strbuf_add(out, pattern, p - pattern);
> +static void find_longest_prefixes_1(struct string_list *out,
> + struct strbuf *prefix,
> + const char **patterns, size_t nr)
> +{
> + size_t i;
> +
> + for (i = 0; i < nr; i++) {
> + char c = patterns[i][prefix->len];
> + if (!c || is_glob_special(c)) {
> + string_list_append(out, prefix->buf);
> + return;
> + }
> + }
> +
> + i = 0;
> + while (i < nr) {
> + size_t end;
> +
> + /*
> + * Set "end" to the index of the element _after_ the last one
> + * in our group.
> + */
> + for (end = i + 1; end < nr; end++) {
> + if (patterns[i][prefix->len] != patterns[end][prefix->len])
> + break;
> + }
> +
> + strbuf_addch(prefix, patterns[i][prefix->len]);
> + find_longest_prefixes_1(out, prefix, patterns + i, end - i);
> + strbuf_setlen(prefix, prefix->len - 1);
> +
> + i = end;
> + }
> +}
> +
> +static void find_longest_prefixes(struct string_list *out,
> + const char **patterns)
> +{
> + struct argv_array sorted = ARGV_ARRAY_INIT;
> + struct strbuf prefix = STRBUF_INIT;
> +
> + argv_array_pushv(&sorted, patterns);
> + QSORT(sorted.argv, sorted.argc, qsort_strcmp);
> +
> + find_longest_prefixes_1(out, &prefix, sorted.argv, sorted.argc);
> +
> + argv_array_clear(&sorted);
> + strbuf_release(&prefix);
> }
>
> /*
> @@ -1817,7 +1859,8 @@ static int for_each_fullref_in_pattern(struct ref_filter *filter,
> void *cb_data,
> int broken)
> {
> - struct strbuf prefix = STRBUF_INIT;
> + struct string_list prefixes = STRING_LIST_INIT_DUP;
> + struct string_list_item *prefix;
> int ret;
>
> if (!filter->match_as_path) {
> @@ -1843,21 +1886,15 @@ static int for_each_fullref_in_pattern(struct ref_filter *filter,
> return for_each_fullref_in("", cb, cb_data, broken);
> }
>
> - if (filter->name_patterns[1]) {
> - /*
> - * multiple patterns; in theory this could still work as long
> - * as the patterns are disjoint. We'd just make multiple calls
> - * to for_each_ref(). But if they're not disjoint, we'd end up
> - * reporting the same ref multiple times. So let's punt on that
> - * for now.
> - */
> - return for_each_fullref_in("", cb, cb_data, broken);
> + find_longest_prefixes(&prefixes, filter->name_patterns);
> +
> + for_each_string_list_item(prefix, &prefixes) {
> + ret = for_each_fullref_in(prefix->string, cb, cb_data, broken);
> + if (ret)
> + break;
> }
>
> - find_longest_prefix(&prefix, filter->name_patterns[0]);
> -
> - ret = for_each_fullref_in(prefix.buf, cb, cb_data, broken);
> - strbuf_release(&prefix);
> + string_list_clear(&prefixes, 0);
> return ret;
> }
>
> diff --git a/t/t6300-for-each-ref.sh b/t/t6300-for-each-ref.sh
> index d9235217fc..ab69aa176d 100755
> --- a/t/t6300-for-each-ref.sh
> +++ b/t/t6300-for-each-ref.sh
> @@ -345,6 +345,32 @@ test_expect_success 'Verify descending sort' '
> test_cmp expected actual
> '
>
> +cat >expected <<\EOF
> +refs/tags/testtag
> +refs/tags/testtag-2
> +EOF
> +
> +test_expect_success 'exercise patterns with prefixes' '
> + git tag testtag-2 &&
> + test_when_finished "git tag -d testtag-2" &&
> + git for-each-ref --format="%(refname)" \
> + refs/tags/testtag refs/tags/testtag-2 >actual &&
> + test_cmp expected actual
> +'
> +
> +cat >expected <<\EOF
> +refs/tags/testtag
> +refs/tags/testtag-2
> +EOF
> +
> +test_expect_success 'exercise glob patterns with prefixes' '
> + git tag testtag-2 &&
> + test_when_finished "git tag -d testtag-2" &&
> + git for-each-ref --format="%(refname)" \
> + refs/tags/testtag "refs/tags/testtag-*" >actual &&
> + test_cmp expected actual
> +'
> +
> cat >expected <<\EOF
> 'refs/heads/master'
> 'refs/remotes/origin/master'
> --
> 2.21.0.203.g358da99528
next prev parent reply other threads:[~2019-06-27 0:37 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-06-26 22:41 [PATCH 0/1] ref-filter.c: faster multi-pattern ref filtering Taylor Blau
2019-06-26 22:41 ` [PATCH 1/1] ref-filter.c: find disjoint pattern prefixes Taylor Blau
2019-06-27 0:37 ` Jacob Keller [this message]
2019-07-01 14:39 ` Taylor Blau
2019-06-27 23:21 ` Jacob Keller
2019-07-01 14:49 ` Taylor Blau
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CA+P7+xqQv4UZMy7fEHnGHejU6nvhVKgkSruXdmW-akqUG1TLKA@mail.gmail.com \
--to=jacob.keller@gmail.com \
--cc=git@vger.kernel.org \
--cc=me@ttaylorr.com \
--cc=peff@peff.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).