Re: [PATCH 1/1] ref-filter.c: find disjoint pattern prefixes

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Jacob Keller <jacob.keller@gmail.com>
To: Taylor Blau <me@ttaylorr.com>
Cc: Git mailing list <git@vger.kernel.org>, Jeff King <peff@peff.net>
Subject: Re: [PATCH 1/1] ref-filter.c: find disjoint pattern prefixes
Date: Wed, 26 Jun 2019 17:37:42 -0700	[thread overview]
Message-ID: <CA+P7+xqQv4UZMy7fEHnGHejU6nvhVKgkSruXdmW-akqUG1TLKA@mail.gmail.com> (raw)
In-Reply-To: <e41db267f7b7086126e9fd3fd5b1a02e38c8c077.1561588479.git.me@ttaylorr.com>

On Wed, Jun 26, 2019 at 3:44 PM Taylor Blau <me@ttaylorr.com> wrote:
>
> Since cfe004a5a9 (ref-filter: limit traversal to prefix, 2017-05-22),
> the ref-filter code has sought to limit the traversals to a prefix of
> the given patterns.
>
> That code stopped short of handling more than one pattern, because it
> means invoking 'for_each_ref_in' multiple times. If we're not careful
> about which patterns overlap, we will output the same refs multiple
> times.

Right.

>
> For instance, consider the set of patterns 'refs/heads/a/*',
> 'refs/heads/a/b/c', and 'refs/tags/v1.0.0'. If we naďvely ran:
>
>   for_each_ref_in("refs/heads/a/*", ...);
>   for_each_ref_in("refs/heads/a/b/c", ...);
>   for_each_ref_in("refs/tags/v1.0.0", ...);
>
> we would see 'refs/heads/a/b/c' (and everything underneath it) twice.
>

Explaining why we ignored solving this before..

> Instead, we want to partition the patterns into disjoint sets, where we
> know that no ref will be matched by any two patterns in different sets.
> In the above, these are:
>
>   - {'refs/heads/a/*', 'refs/heads/a/b/c'}, and
>   - {'refs/tags/v1.0.0'}

Is this disjoint set calculation already existing, or did you have to
add it in this patch?

>
> Given one of these disjoint sets, what is a suitable pattern to pass to
> 'for_each_ref_in'? One approach is to compute the longest common prefix
> over all elements in that disjoint set, and let the caller cull out the
> refs they didn't want. Computing the longest prefix means that in most
> cases, we won't match too many things the caller would like to ignore.
>
> The longest common prefixes of the above are:
>
>   - {'refs/heads/a/*', 'refs/heads/a/b/c'} -> refs/heads/a/*
>   - {'refs/tags/v1.0.0'}                   -> refs/tags/v1.0.0
>
> We instead invoke:
>
>   for_each_ref_in("refs/heads/a/*", ...);
>   for_each_ref_in("refs/tags/v1.0.0", ...);
>
> Which provides us with the refs we were looking for with a minimal
> amount of extra cruft, but never a duplicate of the ref we asked for.
>
> Implemented here is an algorithm which accomplishes the above, which
> works as follows:
>
>   1. Lexicographically sort the given list of patterns.
>
>   2. Initialize 'prefix' to the empty string, where our goal is to
>      build each element in the above set of longest common prefixes.
>
>   3. Consider each pattern in the given set, and emit 'prefix' if it
>      reaches the end of a pattern, or touches a wildcard character. The
>      end of a string is treated as if it precedes a wildcard. (Note that
>      there is some room for future work to detect that, e.g., 'a?b' and
>      'abc' are disjoint).

Neat, so you're calculating disjoint patterns inline while also
figuring out which one is the "best" for any given pattern set...
Neat!

>
>   4. Otherwise, recurse on step (3) with the slice of the list
>      corresponding to our current prefix (i.e., the subset of patterns
>      that have our prefix as a literal string prefix.)
>
> This algorithm is 'O(kn + n log(n))', where 'k' is max(len(pattern)) for
> each pattern in the list, and 'n' is len(patterns).
>

ok, so if we can assume that k is some relatively small constant
number (since the maximum pattern length isn't likely to grow without
bounds), this is O(n*log(n)) on the number of patterns, so we don't
even approach n^2 even when we are given a large number of patterns.
Nice!

> By discovering this set of interesting patterns, we reduce the runtime
> of multi-pattern 'git for-each-ref' (and other ref traversals) from
> O(N) to O(n log(N)), where 'N' is the total number of packed references.

So here, n is the number of patterns still? This seems like a pretty
significant gane when we have a large number of packed references.

>
> Running 'git for-each-ref refs/tags/a refs/tags/b' on a repository with
> 10,000,000 refs in 'refs/tags/huge-N', my best-of-five times drop from:
>
>   real    0m5.805s
>   user    0m5.188s
>   sys     0m0.468s
>
> to:
>
>   real    0m0.001s
>   user    0m0.000s
>   sys     0m0.000s
>

That's a pretty significant decrease!

> On linux.git, the times to dig out two of the latest -rc tags drops from
> 0.002s to 0.001s, so the change on repositories with fewer tags is much
> less noticeable.
>

This explains why it might not have been done before.. many
repositories wouldn't benefit much.

That said, the patch description doesn't make it seem very
complicated. I did run out of time reading the message, so I'll have
to follow up reviewing the actual change below later. I think the
description of the goal and solution is sound though.

Thanks,
Jake

> Co-authored-by: Jeff King <peff@peff.net>
> Signed-off-by: Jeff King <peff@peff.net>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  ref-filter.c            | 89 +++++++++++++++++++++++++++++------------
>  t/t6300-for-each-ref.sh | 26 ++++++++++++
>  2 files changed, 89 insertions(+), 26 deletions(-)
>
> diff --git a/ref-filter.c b/ref-filter.c
> index 8500671bc6..3e10fd647b 100644
> --- a/ref-filter.c
> +++ b/ref-filter.c
> @@ -20,6 +20,7 @@
>  #include "commit-slab.h"
>  #include "commit-graph.h"
>  #include "commit-reach.h"
> +#include "argv-array.h"
>
>  static struct ref_msg {
>         const char *gone;
> @@ -1790,21 +1791,62 @@ static int filter_pattern_match(struct ref_filter *filter, const char *refname)
>         return match_pattern(filter, refname);
>  }
>
> -/*
> - * Find the longest prefix of pattern we can pass to
> - * `for_each_fullref_in()`, namely the part of pattern preceding the
> - * first glob character. (Note that `for_each_fullref_in()` is
> - * perfectly happy working with a prefix that doesn't end at a
> - * pathname component boundary.)
> - */
> -static void find_longest_prefix(struct strbuf *out, const char *pattern)
> +static int qsort_strcmp(const void *va, const void *vb)
>  {
> -       const char *p;
> +       const char *a = *(const char **)va;
> +       const char *b = *(const char **)vb;
>
> -       for (p = pattern; *p && !is_glob_special(*p); p++)
> -               ;
> +       return strcmp(a, b);
> +}
>
> -       strbuf_add(out, pattern, p - pattern);
> +static void find_longest_prefixes_1(struct string_list *out,
> +                                 struct strbuf *prefix,
> +                                 const char **patterns, size_t nr)
> +{
> +       size_t i;
> +
> +       for (i = 0; i < nr; i++) {
> +               char c = patterns[i][prefix->len];
> +               if (!c || is_glob_special(c)) {
> +                       string_list_append(out, prefix->buf);
> +                       return;
> +               }
> +       }
> +
> +       i = 0;
> +       while (i < nr) {
> +               size_t end;
> +
> +               /*
> +               * Set "end" to the index of the element _after_ the last one
> +               * in our group.
> +               */
> +               for (end = i + 1; end < nr; end++) {
> +                       if (patterns[i][prefix->len] != patterns[end][prefix->len])
> +                               break;
> +               }
> +
> +               strbuf_addch(prefix, patterns[i][prefix->len]);
> +               find_longest_prefixes_1(out, prefix, patterns + i, end - i);
> +               strbuf_setlen(prefix, prefix->len - 1);
> +
> +               i = end;
> +       }
> +}
> +
> +static void find_longest_prefixes(struct string_list *out,
> +                                 const char **patterns)
> +{
> +       struct argv_array sorted = ARGV_ARRAY_INIT;
> +       struct strbuf prefix = STRBUF_INIT;
> +
> +       argv_array_pushv(&sorted, patterns);
> +       QSORT(sorted.argv, sorted.argc, qsort_strcmp);
> +
> +       find_longest_prefixes_1(out, &prefix, sorted.argv, sorted.argc);
> +
> +       argv_array_clear(&sorted);
> +       strbuf_release(&prefix);
>  }
>
>  /*
> @@ -1817,7 +1859,8 @@ static int for_each_fullref_in_pattern(struct ref_filter *filter,
>                                        void *cb_data,
>                                        int broken)
>  {
> -       struct strbuf prefix = STRBUF_INIT;
> +       struct string_list prefixes = STRING_LIST_INIT_DUP;
> +       struct string_list_item *prefix;
>         int ret;
>
>         if (!filter->match_as_path) {
> @@ -1843,21 +1886,15 @@ static int for_each_fullref_in_pattern(struct ref_filter *filter,
>                 return for_each_fullref_in("", cb, cb_data, broken);
>         }
>
> -       if (filter->name_patterns[1]) {
> -               /*
> -                * multiple patterns; in theory this could still work as long
> -                * as the patterns are disjoint. We'd just make multiple calls
> -                * to for_each_ref(). But if they're not disjoint, we'd end up
> -                * reporting the same ref multiple times. So let's punt on that
> -                * for now.
> -                */
> -               return for_each_fullref_in("", cb, cb_data, broken);
> +       find_longest_prefixes(&prefixes, filter->name_patterns);
> +
> +       for_each_string_list_item(prefix, &prefixes) {
> +               ret = for_each_fullref_in(prefix->string, cb, cb_data, broken);
> +               if (ret)
> +                       break;
>         }
>
> -       find_longest_prefix(&prefix, filter->name_patterns[0]);
> -
> -       ret = for_each_fullref_in(prefix.buf, cb, cb_data, broken);
> -       strbuf_release(&prefix);
> +       string_list_clear(&prefixes, 0);
>         return ret;
>  }
>
> diff --git a/t/t6300-for-each-ref.sh b/t/t6300-for-each-ref.sh
> index d9235217fc..ab69aa176d 100755
> --- a/t/t6300-for-each-ref.sh
> +++ b/t/t6300-for-each-ref.sh
> @@ -345,6 +345,32 @@ test_expect_success 'Verify descending sort' '
>         test_cmp expected actual
>  '
>
> +cat >expected <<\EOF
> +refs/tags/testtag
> +refs/tags/testtag-2
> +EOF
> +
> +test_expect_success 'exercise patterns with prefixes' '
> +       git tag testtag-2 &&
> +       test_when_finished "git tag -d testtag-2" &&
> +       git for-each-ref --format="%(refname)" \
> +               refs/tags/testtag refs/tags/testtag-2 >actual &&
> +       test_cmp expected actual
> +'
> +
> +cat >expected <<\EOF
> +refs/tags/testtag
> +refs/tags/testtag-2
> +EOF
> +
> +test_expect_success 'exercise glob patterns with prefixes' '
> +       git tag testtag-2 &&
> +       test_when_finished "git tag -d testtag-2" &&
> +       git for-each-ref --format="%(refname)" \
> +               refs/tags/testtag "refs/tags/testtag-*" >actual &&
> +       test_cmp expected actual
> +'
> +
>  cat >expected <<\EOF
>  'refs/heads/master'
>  'refs/remotes/origin/master'
> --
> 2.21.0.203.g358da99528

next prev parent reply	other threads:[~2019-06-27  0:37 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-06-26 22:41 [PATCH 0/1] ref-filter.c: faster multi-pattern ref filtering Taylor Blau
2019-06-26 22:41 ` [PATCH 1/1] ref-filter.c: find disjoint pattern prefixes Taylor Blau
2019-06-27  0:37   ` Jacob Keller [this message]
2019-07-01 14:39     ` Taylor Blau
2019-06-27 23:21   ` Jacob Keller
2019-07-01 14:49     ` Taylor Blau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CA+P7+xqQv4UZMy7fEHnGHejU6nvhVKgkSruXdmW-akqUG1TLKA@mail.gmail.com \
    --to=jacob.keller@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=me@ttaylorr.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).