Date | Commit message (Collapse) |
|
File::Temp only requires four 'X' characters (unlike mkstemp(3),
which requires six). So only so only give it 4 to avoid an
80-column violation and maybe save metadata space on FSes.
|
|
Using "make update-copyrights" after setting GNULIB_PATH in my
config.mak
|
|
{pi_config} may be confused with the documented `PI_CONFIG'
environment variable, and we'll favor vowel-removal to be
consistent with our usage of object references.
The `pi_' prefix may stay in some places, for now; since a
separate namespace may come into this codebase for local/private
client-tooling.
For InboxIdle, we'll also remove an invalid comment about
holding a reference to the PublicInbox::Config object, too.
|
|
When dupe-finder was switched from ->search->{over_ro} to ->over, the
database handle was dropped. Restore it because a spot downstream
uses it.
Fixes: 73e3a6ed6e95adc6 (use more idiomatic internal API for ->over access)
|
|
{over_ro} being a part of the Search object is a historical
oddity which will go away, soon. Lets start removing its use in
tests and rarely-used helper scripts.
|
|
`->connect' is confused with the perlfunc for the `connect(2)'
syscall, and also `DBI->connect'. Since SQLite doesn't use
sockets, the word "connect" needlessly confuses me. Give
it a short name to match the field name we use for it, which
also matches the variable name used by the DBI(3pm) and
DBD::SQLite(3pm) manpages.
|
|
These aren't really supported and will probably be replaced with
better tools, but PublicInbox::Eml should be readily available
to anybody who already has our source tree.
|
|
glob() sorts alphabetically by default, which doesn't have
a useful meaning with many articles. Stop wasting CPU cycles
and memory.
|
|
I did not know to use the return value of `do' back in the day.
There's probably no practical difference in these cases, but
`eval' is overkill for these uses and may hide actual errors.
We can get rid of a few redundant `scalar' ops and pass scalar
refs to Email::MIME->new to avoid copies in a few more places,
too.
|
|
This allows us to consistently enforce the same Message-ID
extraction rules everywhere and makes it easier for us to
make changes in the future.
Update scripts/ssoma-replay, as well, but don't rely on
PublicInbox::* modules in that since it's legacy and
public-inbox was never a dependency of ssoma.
|
|
It shouldn't be hard to make this into a more generic
importer not specific to vger lists.
|
|
PublicInbox::InboxWritable takes care of those imports.
|
|
I didn't wait until September to do it, this year!
|
|
This allows us to simplify version checking by avoiding
"//" or "||" operators sprinkled around.
|
|
"mainrepo" ws a bad name and artifact from the early days when I
intended for there to be a "spamrepo" (now just the
ENV{PI_EMERGENCY} Maildir). With v2, "mainrepo" can be
especially confusing, since v2 needs at least two git
repositories (epoch + all.git) to function and we shouldn't
confuse users by having them point to a git repository for v2.
Much of our documentation already references "INBOX_DIR" for
command-line arguments, so use "inboxdir" as the
git-config(1)-friendly variant for that.
"mainrepo" remains supported indefinitely for compatibility.
Users may need to revert to old versions, or may be referring
to old documentation and must not be forced to change config
files to account for this change.
So if you're using "mainrepo" today, I do NOT recommend changing
it right away because other bugs can lurk.
Link: https://public-inbox.org/meta/874l0ice8v.fsf@alyssa.is/
|
|
|
|
Well, it could probably be moved to contrib...
|
|
I haven't touched most these scripts in ages, but we might as well
purge \d usage from here, as well.
|
|
Extracted from import_slrnspool, since some spools get converted
to mbox or what not.
|
|
Stop showing redundant slashes and stop showing progress
for messages which do not exist.
|
|
|
|
While hunting duplicates, I noticed a leading '-' in some
Message-IDs as a result of RFC4648 encoding. While '-' seems
allowed by RFC5322 and URL-friendly (RFC4648), they are uncommon
and make using Message-IDs as arguments for command-line tools
more difficult. So prefix them with a datestamp to at least
give readers some sense of the age. And shorten the "localhost"
hostname to "z" to save space.
|
|
Xapian is size-intensive and SQLite is not strictly necessary for v1.
|
|
For objects like Inbox; the '-' prefixed hash keys are
probably intended for auto-generated/hidden parameters.
|
|
This will make it easier to as well as supporting future
Filter API users. It allows simplifying our ad-hoc
import_vger_from_mbox script.
|
|
Perhaps we should filter these headers out in Import
|
|
It appears most of the mboxes in the archive I've been given are
mboxrd (despite having Content-Length:) and needs the escaping.
|
|
The first Received: header is believable since it typically
hits the user's mail server and can be treated as relatively
trustworthy. We still show the Date: in per-message (permalink)
views, which may expose users for having incorrect Date:
headers, but all the ISO YYYY-MM-DD dates we display will
match what we see.
|
|
It works around some bugs in older Email::MIME which we'll
find useful.
|
|
It is less confusing without the clobber assignment; and
PublicInbox::MIME exists to workaround bugs in older
Email::MIME (which is in Debian 9 (stretch))
|
|
This will let us quickly test between v2 and v1 inboxes.
|
|
This is too slow, currently. Working with only 2017 LKML
archives:
git-only: ~1 minute
git + SQLite: ~12 minutes
git+Xapian+SQlite: ~45 minutes
So yes, it looks like we'll need to parallelize Xapian indexing,
at least.
|
|
Wrap the old Import package to enable creating new repos based
on size thresholds. This is better than relying on time-based
rotation as LKML traffic seems to be increasing.
|
|
Big lists are orders of magnitude more efficient with v2.
|
|
This can be useful for getting baseline of performance
of just Email::MIME and Date: header parsing. We'll need
to do some Date: header parsing for LKML since there are
some wonky date formats which causes the git RFC822 parser
to choke.
|
|
The mboxes I got from cregit have two spaces after the email
address, while the "git format-patch" output I'm used to dealing
with only has one space.
It's still a "strict" match in that it checks for something
resembling a timestamp, but it relaxes the number of spaces
between the email address and date.
|
|
Using update-copyrights from gnulib
While we're at it, use the SPDX identifier for AGPL-3.0+ to
ease mechanical processing.
|
|
This will be much faster and invoking -mda for every message.
|
|
Oops, due to an old mistake , List-ID was set incorrectly
in the MDA. This could cause some breakage w.r.t. mail filters.
|
|
This makes us closer to git.git style (though I'm not quite sure
why we do this...)
|
|
Based on reading RFC 3986, it seems '@', ':', '!', '$', '&',
"'", '; '(', ')', '*', '+', ',', ';', '=' are all allowed
in path-absolute where we have the Message-ID.
In any case, it seems '@' is fairly common in path components
nowadays and too common in Message-IDs.
|
|
I needed to use this to resurrect some messages missing
from my initial downloads from gmane...
|
|
For some existing mailing list archives, messages are identified
by serial number (such as NNTP article numbers in gmane). Those
links may become inaccessible (as is the current case for
gmane), so ensure users can still search based on old serial
numbers.
Now, I run the following periodically to get article numbers
from gmane (while news.gmane.org remains):
NNTPSERVER=news.gmane.org
export NNTPSERVER
GROUP=gmane.comp.version-control.git
perl -I lib scripts/xhdr-num2mid $GROUP --msgmap=/path/to/gmane.sqlite3
(I might integrate this further with public-inbox-* scripts one day).
My ~/.public-inbox/config as an added "altid" snippet which now
looks like this:
[publicinbox "git"]
address = git@vger.kernel.org
mainrepo = /path/to/git.vger.git
newsgroup = inbox.comp.version-control.git
; relative pathnames expand to $mainrepo/public-inbox/$file
altid = serial:gmane:file=gmane.sqlite3
And run "public-inbox-index --reindex /path/to/git.vger.git"
periodically.
This ought to allow searching for "gmane:12345" to work for
Xapian-enabled instances.
Disclaimer: while public-inbox supports NNTP and stable article
serial numbers, use of those for public links is discouraged
since it encourages centralization.
|
|
This is used to quickly generate an article number to Message-ID
mapping.
Usage:
NNTPSERVER=news.example.org ./scripts/xhdr-num2mid GROUP >file
|
|
In case others want to use it...
|
|
Oops :x
|
|
SpamAssassin often misses messages which contain viruses,
so ClamAV should fill that gap nicely.
|
|
Unfortunately, people screw up addresses enough and
for this to be a real problem.
|
|
|
|
Otherwise, tempfile() will use the current working directory,
which may not be writable.
|