about summary refs log tree commit homepage
path: root/scripts
DateCommit message (Collapse)
2024-02-01scripts/import_*: update usage to include lei tips
These scripts probably don't offer anything useful now that lei has fleshed out read-only MH support and v2 outputs.
2024-02-01scripts/slrnspool2maildir: use MHreader and LeiToMail
This contains gmane-specific header munging to unmunge the things gmane dones to headers. While we're at it, document the generic `lei convert' invocation for users who don't need the gmane-specific header munging.
2023-12-08workaround --headers bug with spamc(1)
As of SpamAssassin 4.0.0, spamc(1) corrupts messages with NUL in the body when the `--headers' switch is used. This increases transport costs, but most spamc/spamd setups are via local sockets, so it's unlikely to be significant. Link: https://bugs.debian.org/1057749 Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
2023-08-28Fix some typos/grammar/errors in docs and comments
2021-03-28treewide: shorten temporary filename
File::Temp only requires four 'X' characters (unlike mkstemp(3), which requires six). So only so only give it 4 to avoid an 80-column violation and maybe save metadata space on FSes.
2021-01-01update copyrights for 2021
Using "make update-copyrights" after setting GNULIB_PATH in my config.mak
2020-12-09rename {pi_config} fields to {pi_cfg}
{pi_config} may be confused with the documented `PI_CONFIG' environment variable, and we'll favor vowel-removal to be consistent with our usage of object references. The `pi_' prefix may stay in some places, for now; since a separate namespace may come into this codebase for local/private client-tooling. For InboxIdle, we'll also remove an invalid comment about holding a reference to the PublicInbox::Config object, too.
2020-10-16scripts/dupe-finder: restore $dbh variable
When dupe-finder was switched from ->search->{over_ro} to ->over, the database handle was dropped. Restore it because a spot downstream uses it. Fixes: 73e3a6ed6e95adc6 (use more idiomatic internal API for ->over access)
2020-09-03use more idiomatic internal API for ->over access
{over_ro} being a part of the Search object is a historical oddity which will go away, soon. Lets start removing its use in tests and rarely-used helper scripts.
2020-08-27over: rename ->connect method to ->dbh
`->connect' is confused with the perlfunc for the `connect(2)' syscall, and also `DBI->connect'. Since SQLite doesn't use sockets, the word "connect" needlessly confuses me. Give it a short name to match the field name we use for it, which also matches the variable name used by the DBI(3pm) and DBD::SQLite(3pm) manpages.
2020-05-20scripts/import_*: remove PublicInbox::MIME usage
These aren't really supported and will probably be replaced with better tools, but PublicInbox::Eml should be readily available to anybody who already has our source tree.
2020-05-05scripts/slrnspool2maildir: don't sort glob()
glob() sorts alphabetically by default, which doesn't have a useful meaning with many articles. Stop wasting CPU cycles and memory.
2020-04-19favor `do {}' over `eval {}' for localized slurp
I did not know to use the return value of `do' back in the day. There's probably no practical difference in these cases, but `eval' is overkill for these uses and may hide actual errors. We can get rid of a few redundant `scalar' ops and pass scalar refs to Email::MIME->new to avoid copies in a few more places, too.
2020-04-02mid: add $MID_EXTRACT regexp for export
This allows us to consistently enforce the same Message-ID extraction rules everywhere and makes it easier for us to make changes in the future. Update scripts/ssoma-replay, as well, but don't rely on PublicInbox::* modules in that since it's legacy and public-inbox was never a dependency of ssoma.
2020-02-24import_vger_from_mbox: add --filter parameter
It shouldn't be hard to make this into a more generic importer not specific to vger lists.
2020-02-24import_vger_from_mbox: drop redundant "use" statements
PublicInbox::InboxWritable takes care of those imports.
2020-02-06treewide: run update-copyrights from gnulib for 2019
I didn't wait until September to do it, this year!
2020-01-27inbox: add ->version method
This allows us to simplify version checking by avoiding "//" or "||" operators sprinkled around.
2019-10-16config: support "inboxdir" in addition to "mainrepo"
"mainrepo" ws a bad name and artifact from the early days when I intended for there to be a "spamrepo" (now just the ENV{PI_EMERGENCY} Maildir). With v2, "mainrepo" can be especially confusing, since v2 needs at least two git repositories (epoch + all.git) to function and we shouldn't confuse users by having them point to a git repository for v2. Much of our documentation already references "INBOX_DIR" for command-line arguments, so use "inboxdir" as the git-config(1)-friendly variant for that. "mainrepo" remains supported indefinitely for compatibility. Users may need to revert to old versions, or may be referring to old documentation and must not be forced to change config files to account for this change. So if you're using "mainrepo" today, I do NOT recommend changing it right away because other bugs can lurk. Link: https://public-inbox.org/meta/874l0ice8v.fsf@alyssa.is/
2019-09-09run update-copyrights from gnulib for 2019
2019-06-05scripts: add README to describe its purpose
Well, it could probably be moved to contrib...
2019-06-05scripts: require ASCII digits in a few places
I haven't touched most these scripts in ages, but we might as well purge \d usage from here, as well.
2018-12-28add filter for gmane archives
Extracted from import_slrnspool, since some spools get converted to mbox or what not.
2018-05-02scripts/import_slrnspool: cleanup progress messages
Stop showing redundant slashes and stop showing progress for messages which do not exist.
2018-05-02scripts/import_slrnspool: support v2 repos
2018-04-18v2: generate better Message-IDs for duplicates
While hunting duplicates, I noticed a leading '-' in some Message-IDs as a result of RFC4648 encoding. While '-' seems allowed by RFC5322 and URL-friendly (RFC4648), they are uncommon and make using Message-IDs as arguments for command-line tools more difficult. So prefix them with a datestamp to at least give readers some sense of the age. And shorten the "localhost" hostname to "z" to save space.
2018-04-06ensure Xapian and SQLite are still optional for v1 tests
Xapian is size-intensive and SQLite is not strictly necessary for v1.
2018-04-01scripts/import_vger_from_mbox: set address properly
For objects like Inbox; the '-' prefixed hash keys are probably intended for auto-generated/hidden parameters.
2018-03-20InboxWritable: add mbox/maildir parsing + import logic
This will make it easier to as well as supporting future Filter API users. It allows simplifying our ad-hoc import_vger_from_mbox script.
2018-03-19scripts/import_vger_from_mbox: filter out same headers as MDA
Perhaps we should filter these headers out in Import
2018-03-06scripts/import_vger_from_mbox: perform mboxrd or mboxo escaping
It appears most of the mboxes in the archive I've been given are mboxrd (despite having Content-Length:) and needs the escaping.
2018-03-06favor Received: date over Date: header globally
The first Received: header is believable since it typically hits the user's mail server and can be treated as relatively trustworthy. We still show the Date: in per-message (permalink) views, which may expose users for having incorrect Date: headers, but all the ISO YYYY-MM-DD dates we display will match what we see.
2018-02-28use PublicInbox::MIME consistently
It works around some bugs in older Email::MIME which we'll find useful.
2018-02-22import_vger_from_mbox: use PublicInbox::MIME and avoid clobbering
It is less confusing without the clobber assignment; and PublicInbox::MIME exists to workaround bugs in older Email::MIME (which is in Debian 9 (stretch))
2018-02-22import_vger_from_inbox: allow "-V" option
This will let us quickly test between v2 and v1 inboxes.
2018-02-20v2: support Xapian + SQLite indexing
This is too slow, currently. Working with only 2017 LKML archives: git-only: ~1 minute git + SQLite: ~12 minutes git+Xapian+SQlite: ~45 minutes So yes, it looks like we'll need to parallelize Xapian indexing, at least.
2018-02-19v2writable: initial cut for repo-rotation
Wrap the old Import package to enable creating new repos based on size thresholds. This is better than relying on time-based rotation as LKML traffic seems to be increasing.
2018-02-15scripts/import_vger_from_mbox: use v2 layout for import
Big lists are orders of magnitude more efficient with v2.
2018-02-13scripts/import_vger_from_mbox: support --dry-run option
This can be useful for getting baseline of performance of just Email::MIME and Date: header parsing. We'll need to do some Date: header parsing for LKML since there are some wonky date formats which causes the git RFC822 parser to choke.
2018-02-08scripts/import_vger_from_mbox: relax From_ line match slightly
The mboxes I got from cregit have two spaces after the email address, while the "git format-patch" output I'm used to dealing with only has one space. It's still a "strict" match in that it checks for something resembling a timestamp, but it relaxes the number of spaces between the email address and date.
2018-02-07update copyrights for 2018
Using update-copyrights from gnulib While we're at it, use the SPDX identifier for AGPL-3.0+ to ease mechanical processing.
2017-06-29scripts/import_maildir: rewrite to use Import
This will be much faster and invoking -mda for every message.
2017-06-26mda: set List-ID correctly according to RFC2919
Oops, due to an old mistake , List-ID was set incorrectly in the MDA. This could cause some breakage w.r.t. mail filters.
2016-08-21avoid spaces after shell redirection operators
This makes us closer to git.git style (though I'm not quite sure why we do this...)
2016-08-14www: do not unecessarily escape some chars in paths
Based on reading RFC 3986, it seems '@', ':', '!', '$', '&', "'", '; '(', ')', '*', '+', ',', ';', '=' are all allowed in path-absolute where we have the Message-ID. In any case, it seems '@' is fairly common in path components nowadays and too common in Message-IDs.
2016-08-14import_slrnspool: reimplement using fast-import
I needed to use this to resurrect some messages missing from my initial downloads from gmane...
2016-08-11search: support alt-ID for mapping legacy serial numbers
For some existing mailing list archives, messages are identified by serial number (such as NNTP article numbers in gmane). Those links may become inaccessible (as is the current case for gmane), so ensure users can still search based on old serial numbers. Now, I run the following periodically to get article numbers from gmane (while news.gmane.org remains): NNTPSERVER=news.gmane.org export NNTPSERVER GROUP=gmane.comp.version-control.git perl -I lib scripts/xhdr-num2mid $GROUP --msgmap=/path/to/gmane.sqlite3 (I might integrate this further with public-inbox-* scripts one day). My ~/.public-inbox/config as an added "altid" snippet which now looks like this: [publicinbox "git"] address = git@vger.kernel.org mainrepo = /path/to/git.vger.git newsgroup = inbox.comp.version-control.git ; relative pathnames expand to $mainrepo/public-inbox/$file altid = serial:gmane:file=gmane.sqlite3 And run "public-inbox-index --reindex /path/to/git.vger.git" periodically. This ought to allow searching for "gmane:12345" to work for Xapian-enabled instances. Disclaimer: while public-inbox supports NNTP and stable article serial numbers, use of those for public links is discouraged since it encourages centralization.
2016-07-28add scripts/xhdr-num2mid example
This is used to quickly generate an article number to Message-ID mapping. Usage: NNTPSERVER=news.example.org ./scripts/xhdr-num2mid GROUP >file
2016-07-28add script used for importing git from download.gmane.org
In case others want to use it...
2016-07-06scripts/dc-dlvr: ensure temporary files are removed
Oops :x