about summary refs log tree commit homepage
path: root/scripts
DateCommit message (Collapse)
2020-04-02mid: add $MID_EXTRACT regexp for export
This allows us to consistently enforce the same Message-ID extraction rules everywhere and makes it easier for us to make changes in the future. Update scripts/ssoma-replay, as well, but don't rely on PublicInbox::* modules in that since it's legacy and public-inbox was never a dependency of ssoma.
2020-02-24import_vger_from_mbox: add --filter parameter
It shouldn't be hard to make this into a more generic importer not specific to vger lists.
2020-02-24import_vger_from_mbox: drop redundant "use" statements
PublicInbox::InboxWritable takes care of those imports.
2020-02-06treewide: run update-copyrights from gnulib for 2019
I didn't wait until September to do it, this year!
2020-01-27inbox: add ->version method
This allows us to simplify version checking by avoiding "//" or "||" operators sprinkled around.
2019-10-16config: support "inboxdir" in addition to "mainrepo"
"mainrepo" ws a bad name and artifact from the early days when I intended for there to be a "spamrepo" (now just the ENV{PI_EMERGENCY} Maildir). With v2, "mainrepo" can be especially confusing, since v2 needs at least two git repositories (epoch + all.git) to function and we shouldn't confuse users by having them point to a git repository for v2. Much of our documentation already references "INBOX_DIR" for command-line arguments, so use "inboxdir" as the git-config(1)-friendly variant for that. "mainrepo" remains supported indefinitely for compatibility. Users may need to revert to old versions, or may be referring to old documentation and must not be forced to change config files to account for this change. So if you're using "mainrepo" today, I do NOT recommend changing it right away because other bugs can lurk. Link: https://public-inbox.org/meta/874l0ice8v.fsf@alyssa.is/
2019-09-09run update-copyrights from gnulib for 2019
2019-06-05scripts: add README to describe its purpose
Well, it could probably be moved to contrib...
2019-06-05scripts: require ASCII digits in a few places
I haven't touched most these scripts in ages, but we might as well purge \d usage from here, as well.
2018-12-28add filter for gmane archives
Extracted from import_slrnspool, since some spools get converted to mbox or what not.
2018-05-02scripts/import_slrnspool: cleanup progress messages
Stop showing redundant slashes and stop showing progress for messages which do not exist.
2018-05-02scripts/import_slrnspool: support v2 repos
2018-04-18v2: generate better Message-IDs for duplicates
While hunting duplicates, I noticed a leading '-' in some Message-IDs as a result of RFC4648 encoding. While '-' seems allowed by RFC5322 and URL-friendly (RFC4648), they are uncommon and make using Message-IDs as arguments for command-line tools more difficult. So prefix them with a datestamp to at least give readers some sense of the age. And shorten the "localhost" hostname to "z" to save space.
2018-04-06ensure Xapian and SQLite are still optional for v1 tests
Xapian is size-intensive and SQLite is not strictly necessary for v1.
2018-04-01scripts/import_vger_from_mbox: set address properly
For objects like Inbox; the '-' prefixed hash keys are probably intended for auto-generated/hidden parameters.
2018-03-20InboxWritable: add mbox/maildir parsing + import logic
This will make it easier to as well as supporting future Filter API users. It allows simplifying our ad-hoc import_vger_from_mbox script.
2018-03-19scripts/import_vger_from_mbox: filter out same headers as MDA
Perhaps we should filter these headers out in Import
2018-03-06scripts/import_vger_from_mbox: perform mboxrd or mboxo escaping
It appears most of the mboxes in the archive I've been given are mboxrd (despite having Content-Length:) and needs the escaping.
2018-03-06favor Received: date over Date: header globally
The first Received: header is believable since it typically hits the user's mail server and can be treated as relatively trustworthy. We still show the Date: in per-message (permalink) views, which may expose users for having incorrect Date: headers, but all the ISO YYYY-MM-DD dates we display will match what we see.
2018-02-28use PublicInbox::MIME consistently
It works around some bugs in older Email::MIME which we'll find useful.
2018-02-22import_vger_from_mbox: use PublicInbox::MIME and avoid clobbering
It is less confusing without the clobber assignment; and PublicInbox::MIME exists to workaround bugs in older Email::MIME (which is in Debian 9 (stretch))
2018-02-22import_vger_from_inbox: allow "-V" option
This will let us quickly test between v2 and v1 inboxes.
2018-02-20v2: support Xapian + SQLite indexing
This is too slow, currently. Working with only 2017 LKML archives: git-only: ~1 minute git + SQLite: ~12 minutes git+Xapian+SQlite: ~45 minutes So yes, it looks like we'll need to parallelize Xapian indexing, at least.
2018-02-19v2writable: initial cut for repo-rotation
Wrap the old Import package to enable creating new repos based on size thresholds. This is better than relying on time-based rotation as LKML traffic seems to be increasing.
2018-02-15scripts/import_vger_from_mbox: use v2 layout for import
Big lists are orders of magnitude more efficient with v2.
2018-02-13scripts/import_vger_from_mbox: support --dry-run option
This can be useful for getting baseline of performance of just Email::MIME and Date: header parsing. We'll need to do some Date: header parsing for LKML since there are some wonky date formats which causes the git RFC822 parser to choke.
2018-02-08scripts/import_vger_from_mbox: relax From_ line match slightly
The mboxes I got from cregit have two spaces after the email address, while the "git format-patch" output I'm used to dealing with only has one space. It's still a "strict" match in that it checks for something resembling a timestamp, but it relaxes the number of spaces between the email address and date.
2018-02-07update copyrights for 2018
Using update-copyrights from gnulib While we're at it, use the SPDX identifier for AGPL-3.0+ to ease mechanical processing.
2017-06-29scripts/import_maildir: rewrite to use Import
This will be much faster and invoking -mda for every message.
2017-06-26mda: set List-ID correctly according to RFC2919
Oops, due to an old mistake , List-ID was set incorrectly in the MDA. This could cause some breakage w.r.t. mail filters.
2016-08-21avoid spaces after shell redirection operators
This makes us closer to git.git style (though I'm not quite sure why we do this...)
2016-08-14www: do not unecessarily escape some chars in paths
Based on reading RFC 3986, it seems '@', ':', '!', '$', '&', "'", '; '(', ')', '*', '+', ',', ';', '=' are all allowed in path-absolute where we have the Message-ID. In any case, it seems '@' is fairly common in path components nowadays and too common in Message-IDs.
2016-08-14import_slrnspool: reimplement using fast-import
I needed to use this to resurrect some messages missing from my initial downloads from gmane...
2016-08-11search: support alt-ID for mapping legacy serial numbers
For some existing mailing list archives, messages are identified by serial number (such as NNTP article numbers in gmane). Those links may become inaccessible (as is the current case for gmane), so ensure users can still search based on old serial numbers. Now, I run the following periodically to get article numbers from gmane (while news.gmane.org remains): NNTPSERVER=news.gmane.org export NNTPSERVER GROUP=gmane.comp.version-control.git perl -I lib scripts/xhdr-num2mid $GROUP --msgmap=/path/to/gmane.sqlite3 (I might integrate this further with public-inbox-* scripts one day). My ~/.public-inbox/config as an added "altid" snippet which now looks like this: [publicinbox "git"] address = git@vger.kernel.org mainrepo = /path/to/git.vger.git newsgroup = inbox.comp.version-control.git ; relative pathnames expand to $mainrepo/public-inbox/$file altid = serial:gmane:file=gmane.sqlite3 And run "public-inbox-index --reindex /path/to/git.vger.git" periodically. This ought to allow searching for "gmane:12345" to work for Xapian-enabled instances. Disclaimer: while public-inbox supports NNTP and stable article serial numbers, use of those for public links is discouraged since it encourages centralization.
2016-07-28add scripts/xhdr-num2mid example
This is used to quickly generate an article number to Message-ID mapping. Usage: NNTPSERVER=news.example.org ./scripts/xhdr-num2mid GROUP >file
2016-07-28add script used for importing git from download.gmane.org
In case others want to use it...
2016-07-06scripts/dc-dlvr: ensure temporary files are removed
Oops :x
2016-06-17scripts/dc-dlvr: ClamAV support via clamdscan
SpamAssassin often misses messages which contain viruses, so ClamAV should fill that gap nicely.
2016-06-17scripts/dc-dlvr: remove catchall account
Unfortunately, people screw up addresses enough and for this to be a real problem.
2016-06-17scripts/dc-dlvr: update copyright
2016-05-20ssoma-replay: use TMPDIR for temporary path
Otherwise, tempfile() will use the current working directory, which may not be writable.
2016-05-14rename most instances of "list" to "inbox"
A public-inbox is NOT necessarily a mailing list, but it could serve as an input point for zero, one, or infinite mailing lists :D
2016-05-14import ssoma-replay example script I've been using
Unfortunately, most users still prefer their mail delivered over SMTP; so we'll at least document mlmmj integration for now until we can popularize pull-based reading over POP3/NNTP/ssoma.
2015-09-06update copyright headers and email addresses
In the future, it should be possible to use this: git ls-files | UPDATE_COPYRIGHT_HOLDER='all contributors' \ UPDATE_COPYRIGHT_USE_INTERVALS=2 \ xargs /path/to/gnulib/build-aux/update-copyright
2015-07-14scripts/dc-dlvr.pre: ensure stderr gets back to the MTA
We want to be able to reject errors back to the MTA.
2015-01-12import_slrnspool: fork a process for each message
This prevents process growth when importing large messages. Memory growth could be due to the sliding sbrk window in glibc malloc or a circular reference in the Email::* Perl code somewhere.
2015-01-11import_slrnspool: load private config key
PublicInbox::Config->lookup won't return unknown keys
2015-01-11import_slrnspool: graceful exit for interruptibility
This should alleviate fears of interrupting the process.
2015-01-11import_slrnspool: make filtering optional
2015-01-11import_slrnspool: use ssoma-mda instead
Some mailing lists (e.g. git@vger.kernel.org) accept messages via Bcc: and possibly other things which get rejected by the strict PublicInbox::Filter rules. So rely on ssoma-mda instead. This prefers a recent revision of ssoma-mda (commit 7fce38e9 onwards) to display subject/author/date information in the commit message.