about summary refs log tree commit homepage
path: root/scripts
DateCommit message (Collapse)
2018-12-28add filter for gmane archives
Extracted from import_slrnspool, since some spools get converted to mbox or what not.
2018-05-02scripts/import_slrnspool: cleanup progress messages
Stop showing redundant slashes and stop showing progress for messages which do not exist.
2018-05-02scripts/import_slrnspool: support v2 repos
2018-04-18v2: generate better Message-IDs for duplicates
While hunting duplicates, I noticed a leading '-' in some Message-IDs as a result of RFC4648 encoding. While '-' seems allowed by RFC5322 and URL-friendly (RFC4648), they are uncommon and make using Message-IDs as arguments for command-line tools more difficult. So prefix them with a datestamp to at least give readers some sense of the age. And shorten the "localhost" hostname to "z" to save space.
2018-04-06ensure Xapian and SQLite are still optional for v1 tests
Xapian is size-intensive and SQLite is not strictly necessary for v1.
2018-04-01scripts/import_vger_from_mbox: set address properly
For objects like Inbox; the '-' prefixed hash keys are probably intended for auto-generated/hidden parameters.
2018-03-20InboxWritable: add mbox/maildir parsing + import logic
This will make it easier to as well as supporting future Filter API users. It allows simplifying our ad-hoc import_vger_from_mbox script.
2018-03-19scripts/import_vger_from_mbox: filter out same headers as MDA
Perhaps we should filter these headers out in Import
2018-03-06scripts/import_vger_from_mbox: perform mboxrd or mboxo escaping
It appears most of the mboxes in the archive I've been given are mboxrd (despite having Content-Length:) and needs the escaping.
2018-03-06favor Received: date over Date: header globally
The first Received: header is believable since it typically hits the user's mail server and can be treated as relatively trustworthy. We still show the Date: in per-message (permalink) views, which may expose users for having incorrect Date: headers, but all the ISO YYYY-MM-DD dates we display will match what we see.
2018-02-28use PublicInbox::MIME consistently
It works around some bugs in older Email::MIME which we'll find useful.
2018-02-22import_vger_from_mbox: use PublicInbox::MIME and avoid clobbering
It is less confusing without the clobber assignment; and PublicInbox::MIME exists to workaround bugs in older Email::MIME (which is in Debian 9 (stretch))
2018-02-22import_vger_from_inbox: allow "-V" option
This will let us quickly test between v2 and v1 inboxes.
2018-02-20v2: support Xapian + SQLite indexing
This is too slow, currently. Working with only 2017 LKML archives: git-only: ~1 minute git + SQLite: ~12 minutes git+Xapian+SQlite: ~45 minutes So yes, it looks like we'll need to parallelize Xapian indexing, at least.
2018-02-19v2writable: initial cut for repo-rotation
Wrap the old Import package to enable creating new repos based on size thresholds. This is better than relying on time-based rotation as LKML traffic seems to be increasing.
2018-02-15scripts/import_vger_from_mbox: use v2 layout for import
Big lists are orders of magnitude more efficient with v2.
2018-02-13scripts/import_vger_from_mbox: support --dry-run option
This can be useful for getting baseline of performance of just Email::MIME and Date: header parsing. We'll need to do some Date: header parsing for LKML since there are some wonky date formats which causes the git RFC822 parser to choke.
2018-02-08scripts/import_vger_from_mbox: relax From_ line match slightly
The mboxes I got from cregit have two spaces after the email address, while the "git format-patch" output I'm used to dealing with only has one space. It's still a "strict" match in that it checks for something resembling a timestamp, but it relaxes the number of spaces between the email address and date.
2018-02-07update copyrights for 2018
Using update-copyrights from gnulib While we're at it, use the SPDX identifier for AGPL-3.0+ to ease mechanical processing.
2017-06-29scripts/import_maildir: rewrite to use Import
This will be much faster and invoking -mda for every message.
2017-06-26mda: set List-ID correctly according to RFC2919
Oops, due to an old mistake , List-ID was set incorrectly in the MDA. This could cause some breakage w.r.t. mail filters.
2016-08-21avoid spaces after shell redirection operators
This makes us closer to git.git style (though I'm not quite sure why we do this...)
2016-08-14www: do not unecessarily escape some chars in paths
Based on reading RFC 3986, it seems '@', ':', '!', '$', '&', "'", '; '(', ')', '*', '+', ',', ';', '=' are all allowed in path-absolute where we have the Message-ID. In any case, it seems '@' is fairly common in path components nowadays and too common in Message-IDs.
2016-08-14import_slrnspool: reimplement using fast-import
I needed to use this to resurrect some messages missing from my initial downloads from gmane...
2016-08-11search: support alt-ID for mapping legacy serial numbers
For some existing mailing list archives, messages are identified by serial number (such as NNTP article numbers in gmane). Those links may become inaccessible (as is the current case for gmane), so ensure users can still search based on old serial numbers. Now, I run the following periodically to get article numbers from gmane (while news.gmane.org remains): NNTPSERVER=news.gmane.org export NNTPSERVER GROUP=gmane.comp.version-control.git perl -I lib scripts/xhdr-num2mid $GROUP --msgmap=/path/to/gmane.sqlite3 (I might integrate this further with public-inbox-* scripts one day). My ~/.public-inbox/config as an added "altid" snippet which now looks like this: [publicinbox "git"] address = git@vger.kernel.org mainrepo = /path/to/git.vger.git newsgroup = inbox.comp.version-control.git ; relative pathnames expand to $mainrepo/public-inbox/$file altid = serial:gmane:file=gmane.sqlite3 And run "public-inbox-index --reindex /path/to/git.vger.git" periodically. This ought to allow searching for "gmane:12345" to work for Xapian-enabled instances. Disclaimer: while public-inbox supports NNTP and stable article serial numbers, use of those for public links is discouraged since it encourages centralization.
2016-07-28add scripts/xhdr-num2mid example
This is used to quickly generate an article number to Message-ID mapping. Usage: NNTPSERVER=news.example.org ./scripts/xhdr-num2mid GROUP >file
2016-07-28add script used for importing git from download.gmane.org
In case others want to use it...
2016-07-06scripts/dc-dlvr: ensure temporary files are removed
Oops :x
2016-06-17scripts/dc-dlvr: ClamAV support via clamdscan
SpamAssassin often misses messages which contain viruses, so ClamAV should fill that gap nicely.
2016-06-17scripts/dc-dlvr: remove catchall account
Unfortunately, people screw up addresses enough and for this to be a real problem.
2016-06-17scripts/dc-dlvr: update copyright
2016-05-20ssoma-replay: use TMPDIR for temporary path
Otherwise, tempfile() will use the current working directory, which may not be writable.
2016-05-14rename most instances of "list" to "inbox"
A public-inbox is NOT necessarily a mailing list, but it could serve as an input point for zero, one, or infinite mailing lists :D
2016-05-14import ssoma-replay example script I've been using
Unfortunately, most users still prefer their mail delivered over SMTP; so we'll at least document mlmmj integration for now until we can popularize pull-based reading over POP3/NNTP/ssoma.
2015-09-06update copyright headers and email addresses
In the future, it should be possible to use this: git ls-files | UPDATE_COPYRIGHT_HOLDER='all contributors' \ UPDATE_COPYRIGHT_USE_INTERVALS=2 \ xargs /path/to/gnulib/build-aux/update-copyright
2015-07-14scripts/dc-dlvr.pre: ensure stderr gets back to the MTA
We want to be able to reject errors back to the MTA.
2015-01-12import_slrnspool: fork a process for each message
This prevents process growth when importing large messages. Memory growth could be due to the sliding sbrk window in glibc malloc or a circular reference in the Email::* Perl code somewhere.
2015-01-11import_slrnspool: load private config key
PublicInbox::Config->lookup won't return unknown keys
2015-01-11import_slrnspool: graceful exit for interruptibility
This should alleviate fears of interrupting the process.
2015-01-11import_slrnspool: make filtering optional
2015-01-11import_slrnspool: use ssoma-mda instead
Some mailing lists (e.g. git@vger.kernel.org) accept messages via Bcc: and possibly other things which get rejected by the strict PublicInbox::Filter rules. So rely on ssoma-mda instead. This prefers a recent revision of ssoma-mda (commit 7fce38e9 onwards) to display subject/author/date information in the commit message.
2015-01-11*slrnspool* old gmane archives set Original-To
Apparently it's not a problem with recent archives.
2015-01-11import_slrnspool: fix off-by-one error
We start with zero and only store the next valid ID.
2015-01-11scripts/import_slrnspool: new incremental importer
This allows incremental imports of slrn spools, ideal for tracking lists via gmane.
2014-05-21slrnspool2maildir: fix help and dir creation
Any existing directory should do.
2014-04-26spamassassin rule and config updates
While we're at it, add a script for easy editing of user prefs. We need some human-maintained rules based on the spam we get. It's an imperfect world, but I'd _much_ rather deal with the occassional spam than require signup/registration to post.
2014-04-21new scripts for importing slrn spools and maildirs
The old import_gmane_spool script was inflexible, since we may import from maildir archives as well, so get everything into maildir, first.
2014-04-21scripts/dc-dlvr: allow exiting from ~/.dc-dlvr.pre
The ~/.dc-dlvr.pre script for my public-inbox user does this.
2014-04-20use ORIGINAL_RECIPIENT once again
It should be common for a single users to be subscribed to multiple addresses/lists, so we must use the address before alias expansion. This partially reverts commit b949afc9edf89dd494cac6255c78b124d58e11a5
2014-04-20scripts/import_gmane_spool: set git committer date
We normally want committer date to be different so we may track delivery latencies (which do not differ much). However, the rules for importing are much different and tend to screw things up when using time ranges with git-rev-list.