From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Jeff King <peff@peff.net>
Cc: Martin Langhoff <martin.langhoff@gmail.com>,
Git Mailing List <git@vger.kernel.org>,
Taylor Blau <me@ttaylorr.com>
Subject: Re: git log exclude pathspec from file - supported? plans?
Date: Wed, 30 Jun 2021 20:22:35 +0200 [thread overview]
Message-ID: <87sg0zdx7z.fsf@evledraar.gmail.com> (raw)
In-Reply-To: <YNywsEbFcrQFeH91@coredump.intra.peff.net>
On Wed, Jun 30 2021, Jeff King wrote:
> On Wed, Jun 30, 2021 at 12:59:43PM -0400, Martin Langhoff wrote:
>
>> long time no see! I'm doing some complex git repo spelunking and
>> pushing the boundaries of the pathspec magic for excludes.
>>
>> Is there a reasonable way to provide a (potentially large) set of
>> excludes? something like
>>
>> git log --exclude-pathspec-file paths-to-exclude.txt .
>>
>> Has there been discussion / patches / plans related to this? I may
>> have some cycles (hopefully!)
>
> You can feed pathspecs via --stdin. So:
>
> {
> echo "--"
> sed s/^/:^/ paths-to-exclude.txt
> } | git log --stdin
>
> works. Obviously it's not as turn-key if you really do have a list of
> paths in a file already, but it's much more flexible.
>
> I'll caution you that the pathspec code is not well-optimized to handle
> a large number of pathspecs. E.g.:
>
> [no pathspecs]
> $ time git rev-list HEAD /dev/null
> real 0m0.033s
> user 0m0.017s
> sys 0m0.017s
>
> [trivial pathspec; now we have to actually open up trees]
> $ { echo --; echo .; } >input
> $ time git rev-list HEAD --stdin <input >/dev/null
> real 0m1.338s
> user 0m1.294s
> sys 0m0.045s
>
> [lots of pathspecs; now we spend loads of time actually matching
> strings; the ^C is when I got bored and killed it]
> $ { echo --; git ls-files; } >input
> $ time git rev-list HEAD --stdin <input >/dev/null
> ^C
> real 1m24.406s
> user 1m24.369s
> sys 0m0.036s
>
> The problem is that we try to linearly match every pathspec against
> every path we consider, so it's quadratic-ish in the number of files in
> the repo. I played a long time ago with storing non-wildcard pathspecs
> in a trie that we could traverse as we talked the individual trees we
> were matching. It performed well, but IIRC the interface was hacky (I
> had to bolt it specifically onto the way the tree-walker uses
> pathspecs, and the other pathspec matchers didn't benefit at all).
>
> I can probably dig it up if anybody's interested in looking at it.
If it's not too much trouble I'd find it interesting, but I likely won't
do anything with it any time soon.
One of the PCREv2 experiments I had very early WIP work towards was to
create a search index for commit messages, contents etc. and stick it in
something similar to the --changed-paths part of the commit-graph.
The PCREv2 codebase actually has (supposedly) a bug-for-bug compatible
implementation of our wildmatch function as a translator to a PCREv2
regex, I have a brnch somewhere where we run all our wildmatch tests
against it successfully.
So couple that with regex introspection, and a search index that
e.g. creates a trie bloom filter, then as long as your --grep=<RX>,
-G<RX> or pathspec has at least 3 fixed strings among its wildcards we
can ask the bloom filter "is this commit a candidate for this regex
searching this path/commit message/diff/whatever".
So you can have indexed matches for things like '*/test-lib.sh", not
just prefixes or fixed-strings.
next prev parent reply other threads:[~2021-06-30 18:27 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <CACPiFCLtj5QF6_Goc5UYh9KHWgkrKtjApL-cCH04S5gdTFyk7Q@mail.gmail.com>
2021-06-30 16:59 ` git log exclude pathspec from file - supported? plans? Martin Langhoff
2021-06-30 17:58 ` Jeff King
2021-06-30 18:22 ` Ævar Arnfjörð Bjarmason [this message]
2021-07-01 21:27 ` Jeff King
2021-07-01 21:30 ` [PATCH 1/3] pathspec: add optional trie index Jeff King
2021-07-01 21:30 ` [PATCH 2/3] pathspec: turn on tries when appropriate Jeff King
2021-07-01 21:36 ` [PATCH 3/3] tree-diff: use pathspec tries Jeff King
2021-07-01 21:43 ` git log exclude pathspec from file - supported? plans? Jeff King
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87sg0zdx7z.fsf@evledraar.gmail.com \
--to=avarab@gmail.com \
--cc=git@vger.kernel.org \
--cc=martin.langhoff@gmail.com \
--cc=me@ttaylorr.com \
--cc=peff@peff.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).