Date | Commit message (Collapse) |
|
This allows us to simplify version checking by avoiding
"//" or "||" operators sprinkled around.
|
|
"mainrepo" ws a bad name and artifact from the early days when I
intended for there to be a "spamrepo" (now just the
ENV{PI_EMERGENCY} Maildir). With v2, "mainrepo" can be
especially confusing, since v2 needs at least two git
repositories (epoch + all.git) to function and we shouldn't
confuse users by having them point to a git repository for v2.
Much of our documentation already references "INBOX_DIR" for
command-line arguments, so use "inboxdir" as the
git-config(1)-friendly variant for that.
"mainrepo" remains supported indefinitely for compatibility.
Users may need to revert to old versions, or may be referring
to old documentation and must not be forced to change config
files to account for this change.
So if you're using "mainrepo" today, I do NOT recommend changing
it right away because other bugs can lurk.
Link: https://public-inbox.org/meta/874l0ice8v.fsf@alyssa.is/
|
|
|
|
Well, it could probably be moved to contrib...
|
|
I haven't touched most these scripts in ages, but we might as well
purge \d usage from here, as well.
|
|
Extracted from import_slrnspool, since some spools get converted
to mbox or what not.
|
|
Stop showing redundant slashes and stop showing progress
for messages which do not exist.
|
|
|
|
While hunting duplicates, I noticed a leading '-' in some
Message-IDs as a result of RFC4648 encoding. While '-' seems
allowed by RFC5322 and URL-friendly (RFC4648), they are uncommon
and make using Message-IDs as arguments for command-line tools
more difficult. So prefix them with a datestamp to at least
give readers some sense of the age. And shorten the "localhost"
hostname to "z" to save space.
|
|
Xapian is size-intensive and SQLite is not strictly necessary for v1.
|
|
For objects like Inbox; the '-' prefixed hash keys are
probably intended for auto-generated/hidden parameters.
|
|
This will make it easier to as well as supporting future
Filter API users. It allows simplifying our ad-hoc
import_vger_from_mbox script.
|
|
Perhaps we should filter these headers out in Import
|
|
It appears most of the mboxes in the archive I've been given are
mboxrd (despite having Content-Length:) and needs the escaping.
|
|
The first Received: header is believable since it typically
hits the user's mail server and can be treated as relatively
trustworthy. We still show the Date: in per-message (permalink)
views, which may expose users for having incorrect Date:
headers, but all the ISO YYYY-MM-DD dates we display will
match what we see.
|
|
It works around some bugs in older Email::MIME which we'll
find useful.
|
|
It is less confusing without the clobber assignment; and
PublicInbox::MIME exists to workaround bugs in older
Email::MIME (which is in Debian 9 (stretch))
|
|
This will let us quickly test between v2 and v1 inboxes.
|
|
This is too slow, currently. Working with only 2017 LKML
archives:
git-only: ~1 minute
git + SQLite: ~12 minutes
git+Xapian+SQlite: ~45 minutes
So yes, it looks like we'll need to parallelize Xapian indexing,
at least.
|
|
Wrap the old Import package to enable creating new repos based
on size thresholds. This is better than relying on time-based
rotation as LKML traffic seems to be increasing.
|
|
Big lists are orders of magnitude more efficient with v2.
|
|
This can be useful for getting baseline of performance
of just Email::MIME and Date: header parsing. We'll need
to do some Date: header parsing for LKML since there are
some wonky date formats which causes the git RFC822 parser
to choke.
|
|
The mboxes I got from cregit have two spaces after the email
address, while the "git format-patch" output I'm used to dealing
with only has one space.
It's still a "strict" match in that it checks for something
resembling a timestamp, but it relaxes the number of spaces
between the email address and date.
|
|
Using update-copyrights from gnulib
While we're at it, use the SPDX identifier for AGPL-3.0+ to
ease mechanical processing.
|
|
This will be much faster and invoking -mda for every message.
|
|
Oops, due to an old mistake , List-ID was set incorrectly
in the MDA. This could cause some breakage w.r.t. mail filters.
|
|
This makes us closer to git.git style (though I'm not quite sure
why we do this...)
|
|
Based on reading RFC 3986, it seems '@', ':', '!', '$', '&',
"'", '; '(', ')', '*', '+', ',', ';', '=' are all allowed
in path-absolute where we have the Message-ID.
In any case, it seems '@' is fairly common in path components
nowadays and too common in Message-IDs.
|
|
I needed to use this to resurrect some messages missing
from my initial downloads from gmane...
|
|
For some existing mailing list archives, messages are identified
by serial number (such as NNTP article numbers in gmane). Those
links may become inaccessible (as is the current case for
gmane), so ensure users can still search based on old serial
numbers.
Now, I run the following periodically to get article numbers
from gmane (while news.gmane.org remains):
NNTPSERVER=news.gmane.org
export NNTPSERVER
GROUP=gmane.comp.version-control.git
perl -I lib scripts/xhdr-num2mid $GROUP --msgmap=/path/to/gmane.sqlite3
(I might integrate this further with public-inbox-* scripts one day).
My ~/.public-inbox/config as an added "altid" snippet which now
looks like this:
[publicinbox "git"]
address = git@vger.kernel.org
mainrepo = /path/to/git.vger.git
newsgroup = inbox.comp.version-control.git
; relative pathnames expand to $mainrepo/public-inbox/$file
altid = serial:gmane:file=gmane.sqlite3
And run "public-inbox-index --reindex /path/to/git.vger.git"
periodically.
This ought to allow searching for "gmane:12345" to work for
Xapian-enabled instances.
Disclaimer: while public-inbox supports NNTP and stable article
serial numbers, use of those for public links is discouraged
since it encourages centralization.
|
|
This is used to quickly generate an article number to Message-ID
mapping.
Usage:
NNTPSERVER=news.example.org ./scripts/xhdr-num2mid GROUP >file
|
|
In case others want to use it...
|
|
Oops :x
|
|
SpamAssassin often misses messages which contain viruses,
so ClamAV should fill that gap nicely.
|
|
Unfortunately, people screw up addresses enough and
for this to be a real problem.
|
|
|
|
Otherwise, tempfile() will use the current working directory,
which may not be writable.
|
|
A public-inbox is NOT necessarily a mailing list, but it
could serve as an input point for zero, one, or infinite
mailing lists :D
|
|
Unfortunately, most users still prefer their mail delivered
over SMTP; so we'll at least document mlmmj integration for now
until we can popularize pull-based reading over POP3/NNTP/ssoma.
|
|
In the future, it should be possible to use this:
git ls-files | UPDATE_COPYRIGHT_HOLDER='all contributors' \
UPDATE_COPYRIGHT_USE_INTERVALS=2 \
xargs /path/to/gnulib/build-aux/update-copyright
|
|
We want to be able to reject errors back to the MTA.
|
|
This prevents process growth when importing large messages.
Memory growth could be due to the sliding sbrk window in glibc malloc
or a circular reference in the Email::* Perl code somewhere.
|
|
PublicInbox::Config->lookup won't return unknown keys
|
|
This should alleviate fears of interrupting the process.
|
|
|
|
Some mailing lists (e.g. git@vger.kernel.org) accept messages
via Bcc: and possibly other things which get rejected by
the strict PublicInbox::Filter rules. So rely on ssoma-mda
instead.
This prefers a recent revision of ssoma-mda (commit 7fce38e9
onwards) to display subject/author/date information in the
commit message.
|
|
Apparently it's not a problem with recent archives.
|
|
We start with zero and only store the next valid ID.
|
|
This allows incremental imports of slrn spools, ideal for
tracking lists via gmane.
|
|
Any existing directory should do.
|