git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Cc: Martin Langhoff <martin.langhoff@gmail.com>,
	Git Mailing List <git@vger.kernel.org>,
	Taylor Blau <me@ttaylorr.com>
Subject: Re: git log exclude pathspec from file - supported? plans?
Date: Thu, 1 Jul 2021 17:27:05 -0400	[thread overview]
Message-ID: <YN4zKVK7gvuIZ0vK@coredump.intra.peff.net> (raw)
In-Reply-To: <87sg0zdx7z.fsf@evledraar.gmail.com>

On Wed, Jun 30, 2021 at 08:22:35PM +0200, Ævar Arnfjörð Bjarmason wrote:

> > The problem is that we try to linearly match every pathspec against
> > every path we consider, so it's quadratic-ish in the number of files in
> > the repo. I played a long time ago with storing non-wildcard pathspecs
> > in a trie that we could traverse as we talked the individual trees we
> > were matching. It performed well, but IIRC the interface was hacky (I
> > had to bolt it specifically onto the way the tree-walker uses
> > pathspecs, and the other pathspec matchers didn't benefit at all).
> >
> > I can probably dig it up if anybody's interested in looking at it.
> 
> If it's not too much trouble I'd find it interesting, but I likely won't
> do anything with it any time soon.

The patches are from 2014 (yikes). I forward ported them to the current
tip of master, but beware:

  - it fails a test in t4010 matching a gitlink with a trailing "/" in the
    pathspec (which I think just didn't exist back then). I suspect this
    is a simple bug (the trie has to record a bit for "this must be a
    dir", and it probably just doesn't understand gitlinks properly).

  - the original renamed diff_tree_sha1() to diff_tree_sha1_recurse(),
    which we then call internally while recursing (to pass the trie
    cursor). And then top-level calls use a diff_tree_sha1() wrapper
    that kicks off the traversal (to avoid disrupting other callers).

    When forward-porting, those functions got renamed and reorganized to
    handle multi-parent diffs. I was able to just use the internal
    ll_diff_tree_paths() as my "recurse" version, but I'm not 100% sure
    that's right (e.g., should internal calls to ll_diff_tree_oid also
    grow a cursor parameter?)

  - a few bits had to be ported from the 2-parent case to handle
    n-parents. It _looked_ pretty trivial as I did it, but it's entirely
    possible there are gotchas. Tests seem to pass, though (there are no
    specific tests for finding changes in a merge here, but all tree
    diffs should be using the new code).

  - there are a few basic tests included, but this was never run in any
    kind of production setting. Caveat emptor. :)

> One of the PCREv2 experiments I had very early WIP work towards was to
> create a search index for commit messages, contents etc. and stick it in
> something similar to the --changed-paths part of the commit-graph.

Yeah, to some degree --change-paths may mitigate this, since for a
series of simple pathspecs we'd generate the bloom filter once and then
get O(1) matching per commit.

My ulterior motive with all of this was to plug it into our custom
"blame-tree" (i.e., which commit last-touched each path), with the idea
being that we'd start with a big pathspec, and then shrink it as we find
answers. Removing an item from a bloom filter is awkward, but possible.

Anyway, here are the patches.

  [1/3]: pathspec: add optional trie index
  [2/3]: pathspec: turn on tries when appropriate
  [3/3]: tree-diff: use pathspec tries

 Makefile                      |   1 +
 pathspec.c                    | 127 ++++++++++++++++++++++++++++++++++
 pathspec.h                    |  13 ++++
 t/helper/test-pathspec.c      |  96 +++++++++++++++++++++++++
 t/helper/test-tool.c          |   1 +
 t/helper/test-tool.h          |   1 +
 t/perf/p4002-diff-pathspec.sh |  26 +++++++
 t/t6137-pathspec-trie.sh      |  57 +++++++++++++++
 tree-diff.c                   | 102 +++++++++++++++++++++++----
 9 files changed, 411 insertions(+), 13 deletions(-)
 create mode 100644 t/helper/test-pathspec.c
 create mode 100755 t/perf/p4002-diff-pathspec.sh
 create mode 100755 t/t6137-pathspec-trie.sh

-Peff

  reply	other threads:[~2021-07-01 21:27 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CACPiFCLtj5QF6_Goc5UYh9KHWgkrKtjApL-cCH04S5gdTFyk7Q@mail.gmail.com>
2021-06-30 16:59 ` git log exclude pathspec from file - supported? plans? Martin Langhoff
2021-06-30 17:58   ` Jeff King
2021-06-30 18:22     ` Ævar Arnfjörð Bjarmason
2021-07-01 21:27       ` Jeff King [this message]
2021-07-01 21:30         ` [PATCH 1/3] pathspec: add optional trie index Jeff King
2021-07-01 21:30         ` [PATCH 2/3] pathspec: turn on tries when appropriate Jeff King
2021-07-01 21:36         ` [PATCH 3/3] tree-diff: use pathspec tries Jeff King
2021-07-01 21:43         ` git log exclude pathspec from file - supported? plans? Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YN4zKVK7gvuIZ0vK@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=martin.langhoff@gmail.com \
    --cc=me@ttaylorr.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).