Date | Commit message (Collapse) |
|
* origin/purge:
implement public-inbox-purge tool
v2writable: read epoch on purge
v2writable: cleanup processes when done
v2writable: purge ignores non-existent git epoch directories
v2writable: ->purge returns undef on no-op
import: purge: reap fast-export process
hoist out resolve_repo_dir from -index
|
|
Maybe we'll default to a dark theme to promote energy savings...
See contrib/css/README for details
|
|
|
|
Expose the ->purge functionality of V2Writable for rewriting
git history to permanently purge messages from history. This
may be necessary for legal reasons.
Usage:
# requires ~/.public-inbox/config
public-inbox-purge --all </path/to/message-to-purge
# good for testing with unconfigured inboxes:
public-inbox-purge $INBOX_DIR </path/to/message-to-purge
|
|
We'll be using it in future admin tools, and making this
easier-to-test.
|
|
Clearly the AltId stuff was never tested for v2. Ensure
this tricky filter (which reuses Msgmap to avoid introducing
new serial numbers) doesn't trigger deadlocks SQLite due
to opening a DB for writing multiple times.
I went through several iterations of this change before
going with this one, which is the least intrusive I could
fine.
|
|
No need to reach into PublicInbox::Config internals and iterate
through the hashref by hand
|
|
This allows archivists to publish incomplete archives with newer
mail while allowing "0.git" (or "1.git" and so on) epochs to be
added-after-the-fact (without affecting "git clone" followers).
A reindex will be necessary for Xapian and SQLite to catch up
once the old epochs are added; but the reindexing code is also
capable of tolerating missing epochs.
|
|
It is redundant to set default values in the public-inbox
config file. Lets not clutter up users' screens when they
view or edit the config file.
|
|
This reuses some of the configuration from -watch, but remains
independent since some configurations will use -watch for some
inboxes and -mda for others.
The default remains "spamc" for -mda users so nothing changes
without explicit configuration.
Per-inbox configurations may also be supported in the future.
|
|
We must not clobber the original message string, as Email::MIME(*)
still needs it for iterating through parts in SearchIdx (but not
when handing it as a raw string to git-fast-import).
I've noticed message bodies (especially dfpre/dpost) were not
getting indexed when going through -mda (no problems with
-watch). This also did not affect v1 repos, since indexing is a
separate process for v1 and requires re-reading the data from
git.
(*) tested Email::MIME 1.937 on Debian stretch
|
|
It's a convenient wrapper nowadays, so get rid of some legacy
code and minimize differences from the -watch code.
|
|
If indexlevel is specified on the command line prefer that.
If indexlevel is specified in the config file prefer that.
If indexlevel is not specified anywhere default to full.
This should make indexlevel somewhat approachable.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
|
|
We subtract one from "jobs" to map to "partitions" to account
for the overview index and git fast-import jobs.
|
|
Many MTA understand these and map them to sensible SMTP error messages.
Inability to find an inbox results in "5.1.1 user unknown".
Misformatted messages are rejected with "5.6.0 data format error".
Unsupported inbox versions are reported as "5.3.5 local configuration error".
All of these are interpreted as permanent failures.
|
|
Oops, I mainly rely on public-inbox-watch for spam training
and completely forgot this tool existed :x
|
|
This quiets a warning inside Spawn.pm
|
|
Some users may not have any public-inboxes configured, especially in
tests.
|
|
I noticed I lost a $GIT_DIR/description in a conversion, so we
should preserve it. While we're at it, we ought to copy any
config in the old repo to the new one.
We will need to warn about cloneurl since it's unfortunately
not an automatic process to update. Oh well..
|
|
--no-renumber does not allow merging, and merging is not ideal
for reindexing, either.
|
|
Since we only query the SQLite over DB for OVER/XOVER; do not
need to waste space storing fields To/Cc/:bytes/:lines or the
XNUM term. We only use From/Subject/References/Message-ID/:blob
in various places of the PSGI code.
For reindexing, we will take advantage of docid stability
in "xapian-compact --no-renumber" to ensure duplicates do not
show up in search results. Since the PSGI interface is the
only consumer of Xapian at the moment, it has no need to
search based on NNTP article number.
|
|
public-inbox-convert ought to be 100% lossless, now
|
|
Not everybody needs multiprocess support.
|
|
Xapian is size-intensive and SQLite is not strictly necessary for v1.
|
|
Some of this jankiness was from early performance problems
and they turned out to be unnecessary measures.
|
|
Lets not scare users when they encounter files that are supposed
to be there. Then, preserve the journal and pipe.lock, even if
they're supposedly unused due to us holding the inbox-wide lock.
|
|
This is important for people running mirrors via "git fetch",
as they need to be kept up-to-date. Purging is also now
supported in mirrors.
The short-lived "--regenerate" option is gone and is now
implicitly enabled as a result. It's still cheap when
article number regeneration is unnecessary, as we track
the range for each git repository.
|
|
|
|
This ought to provide better performance and scalability
which is less dependent on inbox size. Xapian does not
seem optimized for some queries used by the WWW homepage,
Atom feeds, XOVER and NEWNEWS NNTP commands.
This can actually make Xapian optional for NNTP usage,
and allow more functionality to work without Xapian
installed.
Indexing performance was extremely bad at first, but
DBI::Profile helped me optimize away problematic queries.
|
|
We need to ensure there is only one file in the top-level tree
at any commit so the "add; remove; add;" sequence on the same
message is detected properly.
Otherwise, git will not detect the second "add" unless
a second message is added to history.
Deletes are now stored in "d" (and not "D" or "_/D") at the
top-level, now. There's no need to have a "_" to reduce churn
as "m" and "d" should never co-exist. It's now lowercased to
make it easier-to-distinguish from "D" in git-log output.
|
|
Ensure -convert and -compact do not make repositories
unreadable on live servers.
|
|
This bug was hidden due to timing problems with eatmydata or
running with tmpfs for TMPDIR.
|
|
I mainly focus on -watch for mirroring busy mailing lists, but
using -mda should remain an option.
|
|
Having multiple Xapian partitions is mostly pointless after
the initial import. We can compact all the partitions into
one while keeping the skeleton separate.
|
|
And we do not want to start making confused repos if somebody
leaves out "-V2" the second time around.
|
|
This should make it easier to let users perform comparisons and
migrate to v2 if needed.
|
|
Allow best-effort regeneration of NNTP article numbers from
cloned git repositories in addition to indexing Xapian Article
numbers will not remain consistent when we add purge support,
though.
|
|
This still requires a msgmap.sqlite3 file to exist, but
it allows us to tweak Xapian indexing rules and reindex
the Xapian database online while -watch is running.
|
|
This will make it easier to as well as supporting future
Filter API users. It allows simplifying our ad-hoc
import_vger_from_mbox script.
|
|
While parallel processes improves import speed for initial
imports; they are probably not necessary for daily mail imports
via WatchMaildir and certainly not for public-inbox-init. Save
some memory for daily use and even helps improve readability of
some subroutines by showing which methods they call remotely.
|
|
No functional changes, yet, but this makes future changes
easier-to-read.
|
|
A work-in-progress, but it appears the v2 UI pieces do
will not require a lot of work to do.
|
|
It works around some bugs in older Email::MIME which we'll
find useful.
|
|
Using update-copyrights from gnulib
While we're at it, use the SPDX identifier for AGPL-3.0+ to
ease mechanical processing.
|
|
We need to use the correct subject when doing global scanning,
too. In fact, the per-recipient spam training path is entirely
redundant at this point.
|
|
Sometimes an email is an innocent removal "rm" for a
misdirected, off-topic post, while most removed messages are
"spam". Allow anybody to look at history and easily distinguish
the reason for removing the message.
|
|
This should be more reliable and safer as it'll ensure
existing fast-import instances are shut down properly.
|
|
We need to ensure new messages are being processed
fairly during full rescans, so have the ->scan subroutine
yield and reschedule itself. Additionally, having a
long-running task inside the signal handler is dangerous
and subject to reentrancy bugs.
Due to the limitations of the Filesys::Notify::Simple interface,
we cannot rely on multiplexing I/O interfaces (select, IO::Poll,
Danga::Socket, etc...) for this. Forking a separate process
was considered, but it is more expensive for a mostly-idle
process.
So, we use a variant of the "self-pipe trick" via inotify (or
whatever Filesys::Notify::Simple gives us). Instead of writing
to our own pipe, we write to a file in our own temporary
directory watched by Filesys::Notify::Simple to trigger events
in signal handlers.
|
|
Otherwise the old watcher may run indefinitely
|
|
This matches the behavior of the -watch daemon since
6d534038285ddd760709ba76ea007f9108200097
("watch: watchspam affects all configured inboxes")
|