about summary refs log tree commit homepage
path: root/lib/PublicInbox/Import.pm
DateCommit message (Collapse)
2018-04-04v2: support incremental indexing + purge
This is important for people running mirrors via "git fetch", as they need to be kept up-to-date. Purging is also now supported in mirrors. The short-lived "--regenerate" option is gone and is now implicitly enabled as a result. It's still cheap when article number regeneration is unnecessary, as we track the range for each git repository.
2018-04-04import: rewrite less history during purge
We do not need to rewrite old commits unaffected by the object_id purge, only newer commits. This was a state management bug :x We will also return the new commit ID of rewritten history to aid in incremental indexing of mirrors for the next change.
2018-04-01v2: one file, really
We need to ensure there is only one file in the top-level tree at any commit so the "add; remove; add;" sequence on the same message is detected properly. Otherwise, git will not detect the second "add" unless a second message is added to history. Deletes are now stored in "d" (and not "D" or "_/D") at the top-level, now. There's no need to have a "_" to reduce churn as "m" and "d" should never co-exist. It's now lowercased to make it easier-to-distinguish from "D" in git-log output.
2018-03-29import: run_die supports redirects as spawn does
We'll be using it in more future tests and scripts.
2018-03-29v2writable: support purging messages from git entirely
Purging existing messages is fairly straightforward since we can take advantage of Xapian and lookup the git object_id with it. Unfortunately, purging an already "removed" message (which is no longer in Xapian) is not as easy and we'll need to expose ->purge_oids to purge by the git object_id (currently SHA-1). Furthermore, we expire reflogs and prune in hopes a dumb HTTP client won't get the object.
2018-03-29v2writable: append, instead of prepending generated Message-ID
The original Message-ID is still the most important when discussing with other recipients who do not rely on a message flowing through public-inbox. So whatever Message-ID we use to deduplicate internally will be secondary and less important. All of our front-end v2 code is order-independent, so we won't let the message count against us, that way.
2018-03-22import: consolidate mid prepend logic, here
This also quiets down warnings from -watch when spam training happens on messages without Message-Id.
2018-03-22v2writable: clarify header cleanups
We want to make it clear to the code and DEBUG_DIFF users that we do not introduce messages with unsuitable headers into public archives.
2018-03-22use both Date: and Received: times
We want to rely on Date: to sort messages within individual threads since it keeps messages from git-send-email(1) sorted. However, since developers occasionally have the clock set wrong on their machines, sort overall messages by the newest date in a Received: header so the landing page isn't forever polluted by messages from the future. This also gives us determinism for commit times in most cases, as we'll used the Received: timestamp there, as well.
2018-03-20import: discard all the same headers as MDA
Reduce the places where we have duplicate logic for discarding unwanted headers.
2018-03-19Lock: new base class for writable lockers
This reduces code duplication needed for locking and and hopefully makes things easier to understand.
2018-03-19import: enable locking under v2
Instead of using ssoma-based locking, enable locking via Import for now.
2018-03-19import: switch to URL-safe Base64 for Message-IDs
Hexdigests are too long and shorter Message-IDs are easier to deal with.
2018-03-19import: force Message-ID generation for v1 here
This allows us to share code for generating Message-IDs between v1 and v2 repos. For v1, this introduces a slight incompatibility in message removal iff the original message lacked a Message-ID AND the training request came from a message which did not pass through the public-inbox: The workaround for this would be to reuse the bad message from the archive itself.
2018-03-19import: implement barrier operation for v1 repos
This will allow WatchMaildir to use ->barrier operations instead of reaching inside for nchg. This also ensures dumb HTTP clients can see changes to V2 repos immediately.
2018-03-19import: (v2): write deletes to a separate '_' subdirectory
In the future, we may store "purged" content IDs or other uncommon stuff under "_/" of the git tree. This keeps the top-level tree small and more amenable to deltafication. This helps the the common case where "m" is most commonly changed file at the top level. Also, use 'D' instead of 'd' since it matches git's '--raw' output format.
2018-03-19import: (v2) delete writes the blob into history in subdir
This makes it easier to audit deletes with "git log -p" and prevents an unstable specification of "content_id" from being stored in history. This should be cost-free if done in the same partition (and even cheaper than before as it introduces no new blobs). It does have a higher cost across partitions, but is probably irrelevant given the typical ham:spam ratio.
2018-03-19use string ref for Email::Simple->new
Email::Simple is slightly faster this way, and Email::MIME and PublicInbox::MIME both wrap that.
2018-03-19v2writable: support "barrier" operation to avoid reforking
Stopping and starting a bunch of processes to look up duplicates or removals is inefficient. Take advantage of checkpointing in "git fast-import" and transactions in Xapian and SQLite.
2018-03-06import: fall back to Sender for extracting name and email
This seems like a reasonable course of action for old messages. Cc: Nicolás Ojeda Bär <n.oje.bar@gmail.com>
2018-03-06favor Received: date over Date: header globally
The first Received: header is believable since it typically hits the user's mail server and can be treated as relatively trustworthy. We still show the Date: in per-message (permalink) views, which may expose users for having incorrect Date: headers, but all the ISO YYYY-MM-DD dates we display will match what we see.
2018-03-03import: consolidate object info for v2 imports
It's easier to store everything in one array ref similar to what our Git->check routine returns
2018-02-28v2writable: cleanup unused pipes in partitions
Leaking these pipes to child processes wasn't harmful, but made determining relationships and dataflow between processes more confusing.
2018-02-22v2: parallelize Xapian indexing
The parallelization requires splitting Msgmap, text+term indexing, and thread-linking out into separate processes. git-fast-import is fast, so we don't bother parallelizing it. Msgmap (SQLite) and thread-linking (Xapian) must be serialized because they rely on monotonically increasing numbers (NNTP article number and internal thread_id, respectively). We handle msgmap in the main process which drives fast-import. When the article number is retrieved/generated, we write the entire message to per-partition subprocesses via pipes for expensive text+term indexing. When these per-partition subprocesses are done with the expensive text+term indexing, they write SearchMsg (small data) to a shared pipe (inherited from the main V2Writable process) back to the threader, which runs its own subprocess. The number of text+term Xapian partitions is chosen at import and can be made equal to the number of cores in a machine. V2Writable --> Import -> git-fast-import \-> SearchIdxThread -> Msgmap (synchronous) \-> SearchIdxPart[n] -> SearchIdx[*] \-> SearchIdxThread -> SearchIdx ("threader", a subprocess) [* ] each subprocess writes to threader
2018-02-20v2: support Xapian + SQLite indexing
This is too slow, currently. Working with only 2017 LKML archives: git-only: ~1 minute git + SQLite: ~12 minutes git+Xapian+SQlite: ~45 minutes So yes, it looks like we'll need to parallelize Xapian indexing, at least.
2018-02-19v2writable: initial cut for repo-rotation
Wrap the old Import package to enable creating new repos based on size thresholds. This is better than relying on time-based rotation as LKML traffic seems to be increasing.
2018-02-15import: allow the epoch (0s) as a valid time
Despite email not existing until 1971; "Jan 1, 1970 00:00:00" seems like a common default timestamp for some test emails to use as a Date: header.
2018-02-15import: quiet down warnings from bogus From: lines
There's a lot of crap in archives and git-fast-import accepts empty names and email addresses for authors just fine.
2018-02-15import: pass "raw" dates to git-fast-import(1)
For LKML, it appears we need an even more liberal parser than RFC2822 date parser in git. I have not validated Date::Parse parses dates correctly, but this at least prevents git-fast-import(1) from choking.
2018-02-14import: APIs to support v2 use
Wrap "get-mark" and "checkpoint" commands for git-fast-import while documenting/cementing parts of the API.
2018-02-12import: initial handling for v2
Call order will need to change a bit since this is going to be tied to Xapian
2018-02-09import: begin supporting this without ssoma.lock
We'll reuse this class in v2, but won't be utilizing per-git-repository ssoma.lock files. Meanwhile, stop treating ::Inbox objects as an afterthought and allow importing name and email into them.
2018-02-08import: stop writing legacy ssoma.index by default
For machines which have never seen ssoma, they don't need the index so stop creating it.
2018-02-07update copyrights for 2018
Using update-copyrights from gnulib While we're at it, use the SPDX identifier for AGPL-3.0+ to ease mechanical processing.
2017-11-16learn: use "spam" as subject for removal commits
Sometimes an email is an innocent removal "rm" for a misdirected, off-topic post, while most removed messages are "spam". Allow anybody to look at history and easily distinguish the reason for removing the message.
2017-06-20import: fix encoding issues from weird "raw" emails
This seems to allow weirdly-encoded "raw" emails in blade.nagaokaut.ac.jp/ruby/ruby-core/* to be handled without difficulties.
2017-05-25import: reset :raw mode for commit title (subject)
This was necessary for the presence of the 0xa0 byte(*) in the Subject: of the message at: http://blade.nagaokaut.ac.jp/ruby/ruby-core/3220 (*) That is 0xa0, not 0x0a ("\n"), so I wonder if the nibbles got swapped somehow.
2017-01-10introduce PublicInbox::MIME wrapper class
This should fix problems with multipart messages where text/plain parts lack a header. cf. git clone --mirror https://github.com/rjbs/Email-MIME.git refs/pull/28/head In the future, we may still introduce as streaming interface to reduce memory usage on large emails.
2016-10-16import: failed GC runs are non-fatal
We should not completely kill a process if "git gc --auto" errors out due to a warning or whatnot.
2016-09-08import: run "git gc --auto" when done
We need to prevent excessive repository growth for public-inbox-watch and public-inbox-mda users.
2016-09-08import: hoist out common run_die subroutine
We will be reusing this in the next commit, too.
2016-09-08import: hoist out _check_path function
This reduces duplication, slightly. We may be using it yet again in a to-be-introduced function (or we may not introduce it).
2016-08-15import: use common address parsing to drop unnecessary quotes
Not sure why or how I missed this before; but the common address parsing routine we have should be more correct. Add a test to ensure excessively quoted names don't make it through, either.
2016-08-12watch: respect altid for incremental watch changes
We need to pass the Inbox object to SearchIdx to get altid mappings properly for incremental imports. TODO: use the Inbox object in more places where it makes sense to do so.
2016-08-09searchidx: release Xapian FDs before spawning git log
This will allow us to release and re-acquire Xapian locks due to the lack of FD_CLOEXEC on some FDs.
2016-08-02search: improve reindexing behavior
For reindexing, fresh Xapian DBs do not count as a reindex, allowing users to blindly use --reindex on the first run on a clean repo. While we're at it, allow indexing to override HEAD ref for multi-head git repos.
2016-07-27localize $/ when using chomp
Callers may have localized $/ to something else, so make sure we chomp the expected character(s) when calling chomp.
2016-06-24watch_maildir: implement optional spam checking
Mailing lists I watch and mirror may not have the best spam filtering, and an extra layer should not hurt.
2016-06-19import: allow messages without subject
Because our WatchMaildir module is liberal about what it accepts, we can potentially have messages without a subject.
2016-06-17import: auto-update index when done
This prevents multiple update processes from stepping over each other while called under the lock, and also allows the new -watch process to update the index iff indexing was desired.