Date | Commit message (Collapse) |
|
We'll be using this for MiscIdx and pre-generating the necessary
JSON for manifest.js.gz, so make it easier to share code for
generating per-repo JSON entries for grokmirror.
|
|
This will be used to index and search Inbox objects and perhaps
individual git repositories/epochs for grokmirror manifest.js.gz
generation. There is no sharding planned for this at the moment
since inbox count should remain low (~100K to 1M) compared to
message count.
Folding this into the existing sharded DBs could be possible;
but would likely increase query and maintenance costs, as well
as development complexity. So we'll use a few more inodes and
FDs at runtime, instead.
|
|
Recent (2020) versions of Email::MIME (and/or dependencies)
have different behavior than historical versions which seem
to be less DWIM and perhaps technically more correct. We'll
retain historical behavior for now, since it doesn't seem to
cause real problems and DWIM-ness is often required to make
sense of historical mail.
Tested on a FreeBSD 11.4 VM with the following packages:
p5-Email-MIME-1.949
p5-Email-MIME-ContentType-1.024_1
p5-Email-MIME-Encodings-1.315_2
|
|
This will set us up for supporting graceful shutdown
on -index without repeating any work.
|
|
Upon "eindex" rhymes with "reindex", which could be confusing;
so name the command and config prefix to use "extindex" which
is hopefully less confusing.
|
|
We can now handle cases where messages are edited in one inbox
but not another, bifurcating the message.
V2Writable::log_range handles some edge-cases which could happen
in v2-only code paths, as well, but weren't usually triggered
due to default git-gc knobs not pruning immediately
|
|
We don't actually use it anywhere, and may not need it in
the future.
|
|
We want NNTP clients to see consistent Xref: headers to ensure
client-side caches don't get confused.
|
|
It doesn't seem worth storing xref3 data in Xapian now that
the same info is in over.sqlite3.
|
|
We may not end up storing xref3 data in Xapian, actually.
This will make indexlevel=basic possible, and along with
--sequential-shard indexing support for slow storage.
Making oidmap a separate table seems unnecessary, too, so
fold it into the xref3 table since it's unlikely a git blob
will be responsible for multiple xref3 rows.
|
|
Not documented, yet, but it runs...
|
|
It compiles...
|
|
Since external indices won't have msgmap.sqlite3, we'll need to
store last_commit-* metadata in over.sqlite3 instead. This
has a longer limits to account for path names or newsgroup names
stored in keys.
We'll also rely on built-in counters for Xapian document IDs,
since msgmap.sqlite3 no longer provides an AUTOINCREMENT column.
|
|
This will be used to track cross-posted messages in the
external/detached index.
|
|
This will provide a similar API to PublicInbox::Inbox for
read-only WWW, -imapd, and -nntpd interfaces.
|
|
This switch is still undocumented, but we can reduce the scope
of our Xapian docdata dependency by moving its only caller to
SearchIdx. This reduces the amount of code loaded by read-only
code paths.
|
|
This follows -watch commit b70473ab8296d31ebb600adb4fa8fe0ac5935ca8
to match List-Id headers case-insensitively.
Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org>
Link: https://public-inbox.org/meta/20200921180152.uyqluod7qxbwqubo@chatter.i7.local/
|
|
It seems easiest to have a singleton Gcf2Client client object
per daemon worker for all inboxes to use. This reduces overall
FD usage from pipes.
The `public-inbox-gcf2' command + manpage are gone and a `$^X'
one-liner is used, instead. This saves inodes for internal
commands and hopefully makes it easier to avoid mismatched
PERL5LIB include paths (as noticed during development :x).
We'll also make the existing cat-file process management
infrastructure more resilient to BOFHs on process killing
sprees (or in case our libgit2-based code fails on us).
(Rare) PublicInbox::WWW PSGI users NOT using public-inbox-httpd
won't automatically benefit from this change, and extra
configuration will be required (to be documented later).
|
|
This amortizes the cost of recreating PublicInbox::Gcf2 objects
when alternates change in v2 all.git.
|
|
Since we only get OIDs from trusted local data sources
(over.sqlite3), we can safely retry within the -gcf2 process
without worry about clients spamming us with requests for
invalid OIDs and triggering reopens.
|
|
This should be able to replace multiple `git cat-file' for blob
retrieval, but adjustments may be needed.
|
|
Calling ->add_alternate won't pick up new additions to
$OBJDIR/info/alternates, unfornately. Thus v2 inboxes will
need to do something to invalidate Gcf2 objects.
|
|
Having tens of thousands of inboxes and associated git processes
won't work well, so we'll use libgit2 to access the object DB
directly. We only care about OID lookups and won't need to rely
on per-repo revision names or paths.
The Git::Raw XS package won't be used since its manpages don't
promise a stable API. Since we already use Inline::C and have
experience with I::C when it comes to compatibility, this only
introduces libgit2 itself as a source of new incompatibilities.
This also provides an excuse for me to writev(2) to reduce
syscalls, but liburing is on the horizon for next year.
|
|
Oops :x
|
|
This will help with eventual git SHA-256 transitions.
|
|
This was triggered by blindly trying to FETCH an MSN (not
"UID FETCH") on an empty dummy inbox. It's harmless, and
probably triggered by a wayward client or misbehaving bot.
|
|
We don't want to cascade failures/warnings when something else
breaks. There's likely more of these to be fixed as we
encounter them.
|
|
We may need to test against development versions of Xapian,
which may rely on setting `XAPIAN_COMPACT=xapian-compact-1.5'.
Ensure it's possible to do that.
And add a missing check in t/xcpdb-reshard.t, too.
|
|
It's only in RFC 2980 (not 977 or 3977), but Net::NNTP has
supported it since 2001, at least. We'll be making changes
to avoid pathological behavior, so test it, first.
|
|
With public-inbox-httpd, this mitigates the effect of slow git
blob storage with multiple coderepos configured for an inbox.
It's still synchronous for now (and may need to remain that way
for ->last_check_err), but no longer monopolizes the event loop
when checking multiple coderepos.
We don't yet support multi-inbox scanning, yet; but this also
prepares us for a future where we do.
We'll also support >=40 char blob OIDs in preparation for future
git SHA-256 support, too.
|
|
This helped me diagnose an error I would've introduced
in the next commit.
|
|
It's still as slow as before with hundreds/thousands of inboxes,
but at least it's fair. Future changes will allow it to be
cached and memoized with persistent HTTP servers.
|
|
"*foo" is ambiguous in that it may refer to a bareword file handle;
so we'll use it where we can without triggering warnings.
PublicInbox::TestCommon::run_script_exit required dropping the
prototype, however. We'll also future-proof by dropping "use
warnings" in Cgit.pm and use the less-ambiguous "//=" in Inbox.pm
while we're in the area.
|
|
We cannot blindly use the selected newsgroup for
HEAD/ARTICLE/BODY requests using Message-ID, since
those commands look across all newsgroups; not just
the selected one (if any).
So stuff a reference to the Inbox object into $smsg.
We can reduce args passed into set_nntp_headers() and
msg_hdr_write(), too.
Fixes: 0e6ceff37fc38f28 ("nntp: support slow blob retrievals")
|
|
Nearly all of the search uses in the production code rely on
a Xapian mset iterator being returned (instead of an array
of $smsg objects). So default to returning the mset and move
the burden of smsg array conversion into the test cases.
|
|
strict.pm helped me find a typo in an upcoming recent change, so
ensure we use it since it does more good than harm. We'll also
take the opportunity here to declare v5.10.1 compatibility level
to future-proof against Perl incompatibilities.
|
|
The special case (if any) belongs at a higher-level,
and this is another step towards removing {over_ro}-dependence
in our Search object.
|
|
{over_ro} being a part of the Search object is a historical
oddity which will go away, soon. Lets start removing its use in
tests and rarely-used helper scripts.
|
|
We'll use {oidx} as the common field name for the read-write
OverIdx, here, to disambiguate it from the read-only {over}
field. This hopefully makes it clearer which code paths are
read-only and which are read-write.
|
|
Bareword file handles outside of STD(IN|OUT|ERR) seem to be on
the chopping block for Perl 8. We'll also "use v5.10.1" to
guard against future incompatibilities.
|
|
Following "git init" as an example, we'll create every parent
path up to the one specified, instead of attempting to continue
on when Cwd::abs_path returns `undef'.
|
|
While it's not a known problem, our deduplicating logic may
change in the future; or a BOFH could be manually injecting
duplicate messages directly into the git epoch repositories.
Ensure indexing in mirrors doesn't break when there's
duplicates. This is in preparation for detached indices
for multi-inbox search.
|
|
This is no longer limited to Maildirs now that IMAP and NNTP
support exist; so give it a shorter name.
|
|
Quiet down logs from -imapd when clients are blindly
sending some unsupported flag conditions (e.g. "DRAFT",
"DELETED") specified in RFC 3501.
|
|
Link: https://public-inbox.org/meta/20200828221803.GA89978@dcvr/
|
|
We'll deduplicate redundant lines and show counts of skipped
tests to ensure it's easy to notice if something is unexpectedly
skipped.
|
|
There's no need for this to be a separate sub since there's
only a single caller. This saves a few kilobytes at least
in short-lived processes.
|
|
As noted in commit 87dca6d8d5988c5eb54019cca342450b0b7dd6b7
("www: rework query responses to avoid COUNT in SQLite"),
COUNT on many rows is expensive on big SQLite DBs.
We've already stopped using that code path long ago in WWW
while -imapd and -nntpd never used it. So we'll adjust our
remaining test cases to not need it, either.
|
|
Since we got rid of over->connect, `disconnect' no longer pairs
with it. So name it after the `close(2)' syscall it ultimately
issues.
|
|
`->connect' is confused with the perlfunc for the `connect(2)'
syscall, and also `DBI->connect'. Since SQLite doesn't use
sockets, the word "connect" needlessly confuses me. Give
it a short name to match the field name we use for it, which
also matches the variable name used by the DBI(3pm) and
DBD::SQLite(3pm) manpages.
|