Date | Commit message (Collapse) |
|
This will allow us to use p2q-compatible specifications such as
"dfpost7" to only capture blob OIDs which are 7 characters in
length (the indexer will always index down to 7 characters)
|
|
While chdir simplifies path manipulation on our end, its use
falls over when PERL5LIB/@INC contains relative paths which need
to be made absolute. It's fewer lines of code to get eliminate
chdir usage than it is to keep using relative paths in most
places.
|
|
Most xap_terms callers do not benefit from the hashref
return value, and we can delay hashmap use until
List::Util::uniqstr if needed.
|
|
Xapian has always sorted termlist iterators, so we now:
1) break out of the iterator loop early on non-matches
2) avoid doing sorting ourselves
As a result, we'll also favor the wantarray forms of xap_terms
and all_terms to preserve sort order in most cases.
Confirmed by the Xapian maintainer: <20231201184844.GO4059@survex.com>
Link: https://lists.xapian.org/pipermail/xapian-discuss/2023-December/010013.html
|
|
As of SpamAssassin 4.0.0, spamc(1) corrupts messages with NUL in
the body when the `--headers' switch is used. This increases
transport costs, but most spamc/spamd setups are via local
sockets, so it's unlikely to be significant.
Link: https://bugs.debian.org/1057749
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
|
|
There's no need to recurse and trigger deep recursion warnings
when we hit a coderepo with a known hash (SHA-1 vs SHA-256).
Noticed while pruning the 1200+ repos on a git.kernel.org
mirror.
|
|
This future proofs the index against git auto-abbreviation
needing more characters as the repo grows. It'll be useful for
joining against inboxes using dfpre.
As with emails, we'll continue indexing abbreviated blob OIDs
down to 7 hex characters so a SHA-1 git repo will have all
abbreviations of the OID from 7-39 hex characters in addition
to the 40 character unabbreviated form.
|
|
Oddly, Perl did not warn about this. Spotted while confirming
abbreviated OIDs are also indexed when unabbreviated OIDs
appear.
|
|
It looks like DragonFly inherited this from FreeBSD to
allow us to save us some syscalls.
|
|
I forgot to set TMPDIR=/path/to/non-tmpfs again.
|
|
I mixed up "flush" with "close" :x
Fixes: 87b7f633f241 (xap_helper: implement mset endpoint for WWW, IMAP, etc...)
|
|
As with mail search, a cindex may be updated while WWW is
serving requests. Thus we must reopen the Xapian DB when
the revision we're using becomes stale.
|
|
We no longer vivify the intermediate $ibx->{-hide} hashref,
instead we use $ibx->{-hide_$KEY} directly. This avoids
an intermediate hashref and extra hash table lookups.
|
|
This is a convenient (and slightly memory-saving) alternative to
specifying a `publicinbox.*.url' entry for every single inbox
when using publicinbox.wwwListing.
|
|
For inboxes associated with an extindex (currently only the
special "all") one, we can share the git process across
all those inboxes unambiguously when retrieving full SHA-1
blobs.
The comment for my proposed patch is also out-of-date as that
git speedup has been a part of git since 2.33.
|
|
We no longer trigger git cleanups from the Inbox package since
`git cat-file' users have their own cleanup to support git
coderepos not associated with any inbox.
This change means we unconditionally expire SQLite and Xapian
FDs and some internal caches regardless of git activity. The
old logic was irrelevant to Gcf2 (libgit2) users anyways since
we couldn't determine whether or not an inbox was active based
on {inflight} git requests, and upcoming changes will make it
inaccurate for all extindex/cindex users as well.
Opening SQLite and Xapian DBs is fairly cheap; so it's a small
price to pay to reduce memory use and fragmentation.
|
|
This brings a no-op -cindex scan of a git.kernel.org mirror
down from 70s to 10s with a hot cache on a busy machine.
CPU-intensive SHA-256 fingerprinting of the `git show-ref'
result can be parallelized on shard workers. Future changes can
move more of the initial scan setup phase into shard workers for
more parallelism.
But most of the performance for skipping unchanged repos is
gained from delaying the commit time reading until we've seen
the fingerprint is out-of-date, since reading commit times
requires a large amount of I/O compared to only reading refs
for fingerprints.
|
|
When setting up stdin for commands, the write_file API is
convenient enough nowadays to not be worth having special
support with process spawning.
When reading stdout of commands, we should probably be using
utf8_maybe everywhere since there'll always be legacy encodings
in git repos.
Reading regular files with :utf8 also results in worse memory
management since the file size cannot be used as a hint.
|
|
We no longer fork after cidx_init, so there's no need to spend
CPU cycles on the getpid() syscall, especially since it's no
longer cached on glibc while syscalls are also more expensive
these days due to CPU vulnerability mitigations.
|
|
It saves some code in case we keep libgit2 around.
|
|
This will allow WWW to use a combined LeiALE-like
thing to reduce git processes.
|
|
This fixes the case where we're running both SHA-256 and SHA-1.
There's no tests for SHA-256, yet, but the bug is pretty obvious
upon reading the code.
|
|
We only use it as a boolean flag, and there's no need to waste
space for common, non-error cases.
|
|
Explicitly drop support for "\n" in git coderepo pathnames as
we do other stuff. Gcf2 (our libgit2 helper) was always
broken with "\n" in pathnames, and I'm not sure if cgit config
files work with them, either. Dealing with newline characters
requires extra complexity that I'm not willing to deal with when
managing alternates files.
|
|
It's entirely possible for public inboxes to have zero patches
in them, so the amount of match slots may not match match the
number of joined ekeys.
|
|
It's possible to update the fingerprint for a given repo when we
have no commits to index on because they were already done for
another repo. Thus we'll always vivify $repo_ctx->{active}
before calling store_repo since $active may've been undef.
|
|
Found by tidy(1) while dealing with other stuff.
|
|
This gets rid of the "\ No newline at end of file"
since it's distracting noise.
|
|
We need to load the proper package and fully-qualify the sub
call since we shouldn't load Hval in lei. Some users use this
feature even if its broken, oh well :<
|
|
--no-import-before skips importing entire messages, not just
keywords, so it can cause permanent data loss if -o is pointed
to precious data.
|
|
This ensures the /all/ extindex can have auto-associations
with coderepos just like normal inboxes do.
|
|
The HTML is still extremely rough, but links seem to be mostly
working...
|
|
We don't want to be accessing uninitialized variables on
process teardown since much of our control flow revolves
around DESTROY for dependency handling.
|
|
Absolute pathnames of git coderepos are stored in the cindex,
but we should favor paths relative to $ENV{PWD} since it
respects symlinks in the heirarchy.
Respecting symlinks makes it easier to migrate cindex to
new storage as old storage wears out and to relocate the
storage device onto another machine.
|
|
This becomes noticeable when loading lots of coderepos on
my local mirror of git.kernel.org now that we can load repos
from cindex.
|
|
Accepting @ARGV without switches ends up being ambiguous with
optional parameters for --join and --show. Requiring users to
specify `--join=' or `--show=' is a bit awkward (as it with
-clone --objstore= and the like, but that is historical baggage
we need to carry at this point...)
|
|
Only worktrees need to use `git rev-parse --git-path', so avoid
the spawn overhead of a new process. With the SolverGit.pm
limit on coderepo scans disabled and scanning over 800 git repos
for git@vger matches, this reduces up xt/solver.t times by
roughly 25%.
|
|
This is a major step in solving the problem of having to
manually associate hundreds/thousands of coderepos with
hundreds/thousands of public-inboxes to power solver
(and more).
|
|
File::Spec->abs2rel doesn't touch the filesystem at all when
given an absolute base arg ($env->{PATH_INFO}), so we can rely
on it to generate relative links that work with the `mount'
from Plack::Builder and also people running `wget -r' mirrors.
|
|
The C++ version will allow us to take full advantage of Xapian's
APIs for better queries, and the Perl bindings version can still
be advantageous in the future since we'll be able to support
timeouts effectively.
|
|
It ought to help a bit with organization since xap_helper.h
is getting somewhat large and we'll need new endpoints to
support WWW, lei, and whatever else that needs to come.
|
|
We don't want hundreds of git cat-file processes for coderepos
lingering around.
|
|
We store the full path name and xap_terms already removes
the `P' character, so the loop and substr calls are a
no-op replacing `/' with `/'.
|
|
Data::Dumper+B::Deparse seems fast enough to generate cache keys
with, so this makes updating and developing tests easier (as
opposed to forcing the developer to change the identifier). The
main downside is we'll have to deal with cache expiration, but
"make clean" seems overly aggressive already (it keeps blowing
away the clones made by t/cindex-join.t :<)
|
|
While MTAs seem to stop '\0' from appearing in headers, users
fetching archives via git remain susceptible to having '\0' land
in archives. So we'll filter them out of Xapian and SQLite DBs
to avoid interopability problems with CLI tools since there's no
known messages in lore or any of my archives which feature them.
Avoiding '\0' will ensure all indexed Message-IDs and List-Ids
can be specified from the command-line (although some characters
will still require $(printf) contortions).
As with Message-ID, List-Id fields with /\n\t\r/ characters will
also be stripped for indexing. I will assume whatever went wrong
with the References: header in
<https://public-inbox.org/git/656C30A1EFC89F6B2082D9B6@localhost/raw>
could also happen to the List-Id header.
This is inspired by commit aca47e05a6026c12c768753c87e6ff769ef6bee4
(Import: Don't copy nulls from emails into git, 2018-07-07)
|
|
Our use of MID_ESC characters was only intended for the pathname
component of URIs and not appropriate for the query string
component. So use a different $unsafe parameter list for
uri_escape to make the result appropriate for query strings by
disallowing [\&\'\+=] characters. Most notably, this change
also allows us to accept `/' (slash) unescaped to make dfn: queries
nicer to look at.
Finally, we'll also add a ascii_html call on the URI-escaped
result as an extra safety measure even though it's not really
needed.
As far as I can tell, the code without this fix didn't result in
in an HTML injection since all our uses of uri_escape did escape
angle brackets.
Reported-by: Ricardo Cañuelo <ricardo.canuelo@collabora.com>
Link: https://public-inbox.org/meta/87o7ff4nlk.fsf@collabora.com/
Tested-by: Ricardo Cañuelo <ricardo.canuelo@collabora.com>
|
|
We'll introduce a helper to cast off_t to size_t consistently
for mmap/munmap/calloc calls which require size_t. Also, an
extra check for multiplication overflow can be helpful just
in case we end up with a gigantic file roots file.
|
|
It's not async-signal-safe and the glibc implementation uses
malloc via asnprintf. Practically it's not a problem unless the
kernel OOMs and the write(2) fails to the self-pipe.
|
|
Reset gets called on END{} anyways to workaround DBI lifetime
problems, so there's no need to call it near exit. We can't
replace calls to POSIX::_exit with `exit' to force END{} to
run just yet, as there are still some lingering destruction
ordering problems on newer DBI and or Perls.
|
|
While the {inflight} array should be tied to the IO object even
more tightly, that's not an easy task with our current code. So
take some small steps by introducing a gcf_inflight helper to
validate the ownership of the process and to drain the inflight
array via the awaitpid callback.
This hopefully fix problems with t/lei-q-save.t (still) hanging
occasionally on v2 outputs since git->cleanup/->DESTROY was getting
called in v2 shard workers.
|