about summary refs log tree commit homepage
path: root/lib/PublicInbox/Import.pm
DateCommit message (Collapse)
2018-03-19v2writable: support "barrier" operation to avoid reforking
Stopping and starting a bunch of processes to look up duplicates or removals is inefficient. Take advantage of checkpointing in "git fast-import" and transactions in Xapian and SQLite.
2018-03-06import: fall back to Sender for extracting name and email
This seems like a reasonable course of action for old messages. Cc: Nicolás Ojeda Bär <n.oje.bar@gmail.com>
2018-03-06favor Received: date over Date: header globally
The first Received: header is believable since it typically hits the user's mail server and can be treated as relatively trustworthy. We still show the Date: in per-message (permalink) views, which may expose users for having incorrect Date: headers, but all the ISO YYYY-MM-DD dates we display will match what we see.
2018-03-03import: consolidate object info for v2 imports
It's easier to store everything in one array ref similar to what our Git->check routine returns
2018-02-28v2writable: cleanup unused pipes in partitions
Leaking these pipes to child processes wasn't harmful, but made determining relationships and dataflow between processes more confusing.
2018-02-22v2: parallelize Xapian indexing
The parallelization requires splitting Msgmap, text+term indexing, and thread-linking out into separate processes. git-fast-import is fast, so we don't bother parallelizing it. Msgmap (SQLite) and thread-linking (Xapian) must be serialized because they rely on monotonically increasing numbers (NNTP article number and internal thread_id, respectively). We handle msgmap in the main process which drives fast-import. When the article number is retrieved/generated, we write the entire message to per-partition subprocesses via pipes for expensive text+term indexing. When these per-partition subprocesses are done with the expensive text+term indexing, they write SearchMsg (small data) to a shared pipe (inherited from the main V2Writable process) back to the threader, which runs its own subprocess. The number of text+term Xapian partitions is chosen at import and can be made equal to the number of cores in a machine. V2Writable --> Import -> git-fast-import \-> SearchIdxThread -> Msgmap (synchronous) \-> SearchIdxPart[n] -> SearchIdx[*] \-> SearchIdxThread -> SearchIdx ("threader", a subprocess) [* ] each subprocess writes to threader
2018-02-20v2: support Xapian + SQLite indexing
This is too slow, currently. Working with only 2017 LKML archives: git-only: ~1 minute git + SQLite: ~12 minutes git+Xapian+SQlite: ~45 minutes So yes, it looks like we'll need to parallelize Xapian indexing, at least.
2018-02-19v2writable: initial cut for repo-rotation
Wrap the old Import package to enable creating new repos based on size thresholds. This is better than relying on time-based rotation as LKML traffic seems to be increasing.
2018-02-15import: allow the epoch (0s) as a valid time
Despite email not existing until 1971; "Jan 1, 1970 00:00:00" seems like a common default timestamp for some test emails to use as a Date: header.
2018-02-15import: quiet down warnings from bogus From: lines
There's a lot of crap in archives and git-fast-import accepts empty names and email addresses for authors just fine.
2018-02-15import: pass "raw" dates to git-fast-import(1)
For LKML, it appears we need an even more liberal parser than RFC2822 date parser in git. I have not validated Date::Parse parses dates correctly, but this at least prevents git-fast-import(1) from choking.
2018-02-14import: APIs to support v2 use
Wrap "get-mark" and "checkpoint" commands for git-fast-import while documenting/cementing parts of the API.
2018-02-12import: initial handling for v2
Call order will need to change a bit since this is going to be tied to Xapian
2018-02-09import: begin supporting this without ssoma.lock
We'll reuse this class in v2, but won't be utilizing per-git-repository ssoma.lock files. Meanwhile, stop treating ::Inbox objects as an afterthought and allow importing name and email into them.
2018-02-08import: stop writing legacy ssoma.index by default
For machines which have never seen ssoma, they don't need the index so stop creating it.
2018-02-07update copyrights for 2018
Using update-copyrights from gnulib While we're at it, use the SPDX identifier for AGPL-3.0+ to ease mechanical processing.
2017-11-16learn: use "spam" as subject for removal commits
Sometimes an email is an innocent removal "rm" for a misdirected, off-topic post, while most removed messages are "spam". Allow anybody to look at history and easily distinguish the reason for removing the message.
2017-06-20import: fix encoding issues from weird "raw" emails
This seems to allow weirdly-encoded "raw" emails in blade.nagaokaut.ac.jp/ruby/ruby-core/* to be handled without difficulties.
2017-05-25import: reset :raw mode for commit title (subject)
This was necessary for the presence of the 0xa0 byte(*) in the Subject: of the message at: http://blade.nagaokaut.ac.jp/ruby/ruby-core/3220 (*) That is 0xa0, not 0x0a ("\n"), so I wonder if the nibbles got swapped somehow.
2017-01-10introduce PublicInbox::MIME wrapper class
This should fix problems with multipart messages where text/plain parts lack a header. cf. git clone --mirror https://github.com/rjbs/Email-MIME.git refs/pull/28/head In the future, we may still introduce as streaming interface to reduce memory usage on large emails.
2016-10-16import: failed GC runs are non-fatal
We should not completely kill a process if "git gc --auto" errors out due to a warning or whatnot.
2016-09-08import: run "git gc --auto" when done
We need to prevent excessive repository growth for public-inbox-watch and public-inbox-mda users.
2016-09-08import: hoist out common run_die subroutine
We will be reusing this in the next commit, too.
2016-09-08import: hoist out _check_path function
This reduces duplication, slightly. We may be using it yet again in a to-be-introduced function (or we may not introduce it).
2016-08-15import: use common address parsing to drop unnecessary quotes
Not sure why or how I missed this before; but the common address parsing routine we have should be more correct. Add a test to ensure excessively quoted names don't make it through, either.
2016-08-12watch: respect altid for incremental watch changes
We need to pass the Inbox object to SearchIdx to get altid mappings properly for incremental imports. TODO: use the Inbox object in more places where it makes sense to do so.
2016-08-09searchidx: release Xapian FDs before spawning git log
This will allow us to release and re-acquire Xapian locks due to the lack of FD_CLOEXEC on some FDs.
2016-08-02search: improve reindexing behavior
For reindexing, fresh Xapian DBs do not count as a reindex, allowing users to blindly use --reindex on the first run on a clean repo. While we're at it, allow indexing to override HEAD ref for multi-head git repos.
2016-07-27localize $/ when using chomp
Callers may have localized $/ to something else, so make sure we chomp the expected character(s) when calling chomp.
2016-06-24watch_maildir: implement optional spam checking
Mailing lists I watch and mirror may not have the best spam filtering, and an extra layer should not hurt.
2016-06-19import: allow messages without subject
Because our WatchMaildir module is liberal about what it accepts, we can potentially have messages without a subject.
2016-06-17import: auto-update index when done
This prevents multiple update processes from stepping over each other while called under the lock, and also allows the new -watch process to update the index iff indexing was desired.
2016-05-25remove Email::Address dependency
git has stricter requirements for ident names (no '<>') which Email::Address allows. Even in 1.908, Email::Address also has an incomplete fix for CVE-2015-7686 with a DoS-able regexp for comments. Since we don't care for or need all the RFC compliance of Email::Address, avoiding it entirely may be preferable. Email::Address will still be installed as a requirement for Email::MIME, but it is only used by the Email::MIME::header_str_set which we do not use
2016-05-21import: avoid needless git update-server-info
We don't need to update-server-info (or read-tree) if fast import was spawned for removals and no changes were made.
2016-05-12import: fallback to email if '<>' exists in author name
git doesn't handle '<' and '>' characters in the author name at all regardless of quoting, not just matched pairs. So fall back to using the email as the author name since the commit info isn't critical, anyways (shallow clones are fine).
2016-05-12import: normalize body by stripping trailing newlines
Mbox formatters may add extra newlines at the end of the message, and that's not relevant for comparing messages for deletion.
2016-04-28import: run git-update-server-info when done
We should update $GIT_DIR/info/refs for dumb HTTP clients whenever we make changes to the repository. The best place to update is immediately after making commits. This fixes a bug where public-inbox-learn did not properly update $GIT_DIR/info/refs after inserting or removing messages.
2016-04-27import: document API for public consumption
This is probably trivial enough to be final?
2016-04-25remove ssoma dependency
By converting to using ourt git-fast-import-based Import module. This should allow us to be more easily installed.
2016-04-25import: extra check for final byte read
The read could fail entirely and leave $lf undefined.
2016-04-12import: filter out [<>] from user names
It confuses the git ident parser and may not be a great idea to fix in git since it could break interopability with older versions.
2016-04-11import: use bytes::length for true data length in bytes
git is byte-oriented and fast-import will not tolerate miscalculations. This is necessary for wide characters in commit messages (email Subjects).
2016-04-11import: set binmode before printing author names
Author names may have wide characters in them, so avoid warnings as git favors UTF-8 for names and fast-import even requires them for commit messages
2016-04-11import: initial module + test case
This will allow us to write fast importers for existing archives as well as eventually removing the ssoma dependency for performance and ease-of-installation.