Date | Commit message (Collapse) |
|
On powerful systems, having this option is preferable to
XAPIAN_FLUSH_THRESHOLD due to lock granularity and contention
with other processes (-learn, -mda, -watch).
Setting XAPIAN_FLUSH_THRESHOLD can cause -learn, -mda, and
-watch to get stuck until an epoch is completely processed.
|
|
The old name may be confused with "Content-ID" as described in
RFC 2392, so use an alternate name to avoid confusing future
readers.
|
|
PublicInbox::Eml has enough functionality to replace the
Email::MIME-based PublicInbox::MIME.
|
|
In normal mail paths, we can rely on MTAs being configured with
reasonable limits in the -watch and -mda mail injection paths.
However, the MTA is bypassed in a git-only delivery path, a BOFH
could inject a large message and DoS users attempting to mirror
a public-inbox.
This doesn't protect unindexed WWW interfaces from Email::MIME
memory explosions on v1 inboxes. Probably nobody cares about
unindexed WWW interfaces anymore, especially now that Xapian is
optional for indexing.
|
|
We switched to the SDBM-based queue to store author/committer
info last month.
Fixes: c7acdfe78bda5bf3 ("v2: SDBM-based multi Message-ID queue")
|
|
Allowing ->init_bare to be used as a method saves some
keystrokes, and we can save a little bit of time on systems with
our vfork(2)-enabled spawn().
This also sets us up for future improvements where we can
avoid spawning a process at all.
|
|
This lets us store author and committer times for deferred
indexing messages with ambiguous Message-IDs. This allows
us to reproducibly reindex messages with the git commit
and author times when a rare message lacks Received and/or
Date headers while having ambiguous Message-IDs.
|
|
We can finally get rid of the awkward, ad-hoc use of V2Writable,
SearchIdx, and OverIdx args for passing {cotime} and {autime}
between classes.
We'll still use those git time fields internally within
V2Writable and SearchIdx for (re)indexing, but that's not
worth avoiding as a fallback.
|
|
We can pass fewer order-dependent args to V2Writable::do_idx and
SearchIdxShard::index_raw by passing the smsg object, instead.
|
|
We can pass blessed PublicInbox::Smsg objects to internal
indexing APIs instead of having long parameter lists in some
places. The end goal is to avoid parsing redundant information
each step of the way and hopefully make things more
understandable.
|
|
While v2 indexing is triggered immediately after writing the
commit to the git repository, there may be a gap between when
PublicInbox::Import generates a timestamp and when
PublicInbox::SearchIdx sees the message. So follow the mirror
indexing behavior and take the to-be-indexed (time|date)stamps
directly from the git commit.
|
|
When indexing messages without Date: and/or Received: headers,
fall back to using timestamps originally recorded by git in the
commit object. This allows git mirrors to preserve the import
datestamp and timestamp of a message according to what was fed
into git, instead of blindly falling back to the current time.
|
|
It only needs to return a boolean, since none of the current
callers care about the return value. Thus avoid a hash table
assignment and use of `$smsg->{mime}', here.
|
|
Import::remove is a documented interface, and the return
value of the V2Writable work-alike should try to be compatible
with what Import implements.
|
|
I didn't wait until September to do it, this year!
|
|
OpenBSD and FreeBSD support `getconf NPROCESSORS_ONLN` (no
leading underscore). They may also have GNU nproc installed as
"gnproc".
We may also encounter Linux systems w/o GNU coreutils, but able
to use `getconf _NPROCESSORS_ONLN` (with leading underscore).
|
|
The $jobs parameter in `public-inbox-convert' is passed to
V2Writable->init_inbox as `undef' by default, causing
parallelization to be disabled.
Instead, leave the underlying {parallel} flag untouched if
$shards is undef and do not clobber the default shard count.
This allows us to take advantage of multicore systems when
running public-inbox-convert with no command-line switches.
|
|
This is to be consistent with the `nproc(1)' code path. It also
quiets down a warning from Admin when "-j $JOBS" is specified,
since the master process (which distributes work to shards and
handles OverIdx and Msgmap) is considered a job on its own.
|
|
New epochs are the most likely to have loose objects. git won't
be able to take advantage of pack indices and needs to scan
every alternate for the loose object via open/openat syscalls.
Those syscalls will add up some day when we've got hundreds or
thousands of epochs.
|
|
popen_rd accepts arbitrary redirects, so we can reuse its
code to setup the pipe end we want to read, saving each
caller a few lines of code compared to calling pipe+spawn.
|
|
Most spawn and popen_rd callers die on failure to spawn,
anyways, and some are missing checks entirely. This saves
us a bunch of verbose error-checking code in callers.
This also makes popen_rd more consistent, since it already
dies on pipe creation failures.
|
|
There's a bunch of leftover "require" and "use" statements we no
longer need and can get rid of, along with some excessive
imports via "use".
IO::Handle usage isn't always obvious, so add comments
describing why a package loads it. Along the same lines,
document the tmpdir support as the reason we depend on
File::Temp 0.19, even though every Perl 5.10.1+ user has it.
While we're at it, favor "use" over "require", since it it gives
us extra compile-time checking.
|
|
We can save callers the trouble of {-hold} and {-dev_null}
refs as well as the trouble of calling fileno().
|
|
Xapian upstream is slowly phasing out the XS-based Search::Xapian
in favor of the SWIG-generated "Xapian" package. While Debian and
both FreeBSD have Search::Xapian, OpenBSD only includes the "Xapian"
binding.
More information about the status of the "Xapian" Perl module here:
https://trac.xapian.org/ticket/523
|
|
Since we give users no indication or control of how "git gc"
runs, showing its progress is confusing.
|
|
While I've never seen "git log" fail on its own, it could happen
one day and we should be prepared to abort indexing when it
happens.
Beef up tests for t/spawn.t to ensure close() behaves
on popen_rd the way we expect it to.
|
|
And use it for mda, since "0" could be a usable directory
if somebody insists on using relative paths...
|
|
Instead of storing Message-IDs in the Msgmap object, we can
store the blob OID.
For initial indexing of mirrors, this lets us preserve
$sync->{regen} by storing the intended article number in
the queue.
On --reindex, the article number we store in Msgmap is ignored
but only used for ordering purposes.
This also allows us to avoid ENOMEM errors if somebody abuses
our system by reusing Message-IDs; but we now risk ENOSPC
instead (but systems tend to have more FS storage than RAM).
|
|
We need to stop the git process to avoid leaking FDs
to Xapian if we recurse ->index_sync on reindex.
|
|
And maybe 8-headered ones, too...
I noticed --reindex failing on the linux-renesas-soc mirror due
one 3-headed monster of a message having 3 sets of headers;
while another normal message had a Message-ID that matched one
of the 3 IDs of the 3-headed monster.
We still try to do the majority of indexing backwards, but we
defer indexing multi-Message-ID'd messages until the end to
ensure we get all the "good" messages in before we process the
multi-headered ones.
Link: https://public-inbox.org/meta/20191016211415.GA6084@dcvr/
|
|
Make it obvious that we're not the Msgmap sub and return an
array because it's less awkward than providing a modifiable ref
to a function to write to.
|
|
We'll actually use the keys of this hash in future commits.
|
|
"mainrepo" ws a bad name and artifact from the early days when I
intended for there to be a "spamrepo" (now just the
ENV{PI_EMERGENCY} Maildir). With v2, "mainrepo" can be
especially confusing, since v2 needs at least two git
repositories (epoch + all.git) to function and we shouldn't
confuse users by having them point to a git repository for v2.
Much of our documentation already references "INBOX_DIR" for
command-line arguments, so use "inboxdir" as the
git-config(1)-friendly variant for that.
"mainrepo" remains supported indefinitely for compatibility.
Users may need to revert to old versions, or may be referring
to old documentation and must not be forced to change config
files to account for this change.
So if you're using "mainrepo" today, I do NOT recommend changing
it right away because other bugs can lurk.
Link: https://public-inbox.org/meta/874l0ice8v.fsf@alyssa.is/
|
|
We don't need to make unnecesary writes to the git config file
and wear out storage devices every time we run
"public-inbox-index"
|
|
|
|
Now that the code matches Xapian terminology, ensure
our comments match, too.
|
|
Be consistent with our own terminology and use "epoch" for
[0-9]+\.git repos. The term "partition" is going away entirely.
|
|
|
|
We'll be using the term "shard" from now on to be consistent
with Xapian terminology.
|
|
Another step towards keeping our file and package names
consistent with Xapian terminology.
|
|
Our internal data structure should be consistent with Xapian
terminology.
|
|
Another step towards becoming consistent with Xapian terminology
|
|
Using compact to change shard count was abandoned during
the v2 development phase.
|
|
Oops :x
|
|
* origin/reshard:
xcpdb: support resharding v2 repos
xcpdb: use destination shard as progress prefix
xapcmd: preserve indexlevel based on the destination
v2writable: use a smaller default for Xapian partitions
|
|
Apparently 16 CPUs (probably HT) and SATA storage is common
these days. Having excessive Xapian partitions leads to
contention and excessive FD/space use. So set a smaller
default but continue allowing user-specified values to bump
this up.
|
|
Xapian on Linux <3.15 has trouble with coprocesses since it used
fork() for locking and would hold onto pipes used for git
unnecessarily.
|
|
Much of the existing purge code is repurposed to a general
"replace" functionality.
->purge is simpler because it can just drop the information.
Unlike ->purge, ->replace needs to edit existing git commits (in
case of From: and Subject: headers) and reindex the modified
message.
We currently disallow editing of References:, In-Reply-To: and
Message-ID headers because it can cause bad side effects with
our threading (and our lack of rethreading support to deal with
excessive matching from incorrect/invalid References).
|
|
Continuing the work by Eric Biederman in commit a118d58a402bd31b
("Import.pm: When purging replace a purged file with a zero length file"),
we can use a generic OID replacement mechanism to implement
purge.
|
|
It's one ugly sub with lots of parameters, but it's better
than calling a bunch of ugly subs with lots of parameters;
as we'll be needing to call it again when reindexing for
message replacements.
|