about summary refs log tree commit homepage
path: root/lib/PublicInbox/Import.pm
DateCommit message (Collapse)
2019-01-09doc: various overview-level module comments
Hopefully this helps people familiarize themselves with the source code.
2019-01-02update and add documentation for repository formats
Remove confusing documentation around ssoma now that we have NNTP and downloadable mbox support. Only lightly-checked for grammar and speling, and not yet formatting. Edits, corrections and addendums expected :>
2018-08-10Import.pm: When purging replace a purged file with a zero length file
This ensures that the number of added files remains the same and thus the article numbers derived from a repository will remain the same. I think this is the last place in public-inbox that has to be tweaked to guarantee the generated article number will remain the same in an public inbox archive. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-08-02Import.pm: Don't assume {in} and {out} always exist
While working on one of the tests I did: my $im = PublicInbox::V2Writable->new($ibx, 1); my $im0 = $im->importer(); $im->add($mime); Which resulted in a warning of the use of an undefined value from atfork_child, and the test failing nastily. Inspection of the code reveals this can happen anytime gfi_start has not been called. So just fix atfork_child to skip closing file descriptors that have not yet been setup. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-07-19Import.pm: Deal with potentially missing From and Sender headers
Use ||= '' to ensure that if the From or Sender header is not present the code sees an empty string and instead of undefined. I had some email messages with a From field without an @ (because the sender was local) and without a Sender which were causing errors when imported. I think this was bad enough that the email messages were failing to be imported. Signed-off-by: Eric Biederamn <ebiederm@xmission.com>
2018-07-07Import: Don't copy nulls from emails into git
Recently I ran git --git-dir=lkml/git/1.git fsck and it reported: > warning in commit 299dbd50b6995c6debe2275f0df984ce697fb4cc: nulInCommit: NULL byte inthe commit object body Which I found quite scary. Nulls in the wrong place have a bad tendency to make programs misbehave. It turns out someone had placed "=?iso-8859-1?q?=00?=" at the end of their subject line. Which is the mime encoding for NULL. Email::Mime had correctly decoded the header, and then public-inbox had simply copied the contents of the header into the subject line of the git commit. To prevent that from causing problems replace nulls in such subject lines with spaces. Signed-off-by: Eric Biederman <ebiederm@xmission.com>
2018-04-20import: cleanup git cat-file processes when ->done
This should reduce idle cat-file instances
2018-04-18import: cat_blob drops leading 'From ' lines like Inbox
In case people were running old buggy versions from 2016... (and -convert should probably clean those up, eventually)
2018-04-18v2: generate better Message-IDs for duplicates
While hunting duplicates, I noticed a leading '-' in some Message-IDs as a result of RFC4648 encoding. While '-' seems allowed by RFC5322 and URL-friendly (RFC4648), they are uncommon and make using Message-IDs as arguments for command-line tools more difficult. So prefix them with a datestamp to at least give readers some sense of the age. And shorten the "localhost" hostname to "z" to save space.
2018-04-07v2writable: reduce barriers
Since we handle the overview info synchronously, we only need barriers in tests, now. We will use asynchronous checkpoints to sync less-important Xapian data. For data deduplication, this requires us to hoist out the cat-blob support in ::Import for reading uncommitted data in git.
2018-04-04v2: support incremental indexing + purge
This is important for people running mirrors via "git fetch", as they need to be kept up-to-date. Purging is also now supported in mirrors. The short-lived "--regenerate" option is gone and is now implicitly enabled as a result. It's still cheap when article number regeneration is unnecessary, as we track the range for each git repository.
2018-04-04import: rewrite less history during purge
We do not need to rewrite old commits unaffected by the object_id purge, only newer commits. This was a state management bug :x We will also return the new commit ID of rewritten history to aid in incremental indexing of mirrors for the next change.
2018-04-01v2: one file, really
We need to ensure there is only one file in the top-level tree at any commit so the "add; remove; add;" sequence on the same message is detected properly. Otherwise, git will not detect the second "add" unless a second message is added to history. Deletes are now stored in "d" (and not "D" or "_/D") at the top-level, now. There's no need to have a "_" to reduce churn as "m" and "d" should never co-exist. It's now lowercased to make it easier-to-distinguish from "D" in git-log output.
2018-03-29import: run_die supports redirects as spawn does
We'll be using it in more future tests and scripts.
2018-03-29v2writable: support purging messages from git entirely
Purging existing messages is fairly straightforward since we can take advantage of Xapian and lookup the git object_id with it. Unfortunately, purging an already "removed" message (which is no longer in Xapian) is not as easy and we'll need to expose ->purge_oids to purge by the git object_id (currently SHA-1). Furthermore, we expire reflogs and prune in hopes a dumb HTTP client won't get the object.
2018-03-29v2writable: append, instead of prepending generated Message-ID
The original Message-ID is still the most important when discussing with other recipients who do not rely on a message flowing through public-inbox. So whatever Message-ID we use to deduplicate internally will be secondary and less important. All of our front-end v2 code is order-independent, so we won't let the message count against us, that way.
2018-03-22import: consolidate mid prepend logic, here
This also quiets down warnings from -watch when spam training happens on messages without Message-Id.
2018-03-22v2writable: clarify header cleanups
We want to make it clear to the code and DEBUG_DIFF users that we do not introduce messages with unsuitable headers into public archives.
2018-03-22use both Date: and Received: times
We want to rely on Date: to sort messages within individual threads since it keeps messages from git-send-email(1) sorted. However, since developers occasionally have the clock set wrong on their machines, sort overall messages by the newest date in a Received: header so the landing page isn't forever polluted by messages from the future. This also gives us determinism for commit times in most cases, as we'll used the Received: timestamp there, as well.
2018-03-20import: discard all the same headers as MDA
Reduce the places where we have duplicate logic for discarding unwanted headers.
2018-03-19Lock: new base class for writable lockers
This reduces code duplication needed for locking and and hopefully makes things easier to understand.
2018-03-19import: enable locking under v2
Instead of using ssoma-based locking, enable locking via Import for now.
2018-03-19import: switch to URL-safe Base64 for Message-IDs
Hexdigests are too long and shorter Message-IDs are easier to deal with.
2018-03-19import: force Message-ID generation for v1 here
This allows us to share code for generating Message-IDs between v1 and v2 repos. For v1, this introduces a slight incompatibility in message removal iff the original message lacked a Message-ID AND the training request came from a message which did not pass through the public-inbox: The workaround for this would be to reuse the bad message from the archive itself.
2018-03-19import: implement barrier operation for v1 repos
This will allow WatchMaildir to use ->barrier operations instead of reaching inside for nchg. This also ensures dumb HTTP clients can see changes to V2 repos immediately.
2018-03-19import: (v2): write deletes to a separate '_' subdirectory
In the future, we may store "purged" content IDs or other uncommon stuff under "_/" of the git tree. This keeps the top-level tree small and more amenable to deltafication. This helps the the common case where "m" is most commonly changed file at the top level. Also, use 'D' instead of 'd' since it matches git's '--raw' output format.
2018-03-19import: (v2) delete writes the blob into history in subdir
This makes it easier to audit deletes with "git log -p" and prevents an unstable specification of "content_id" from being stored in history. This should be cost-free if done in the same partition (and even cheaper than before as it introduces no new blobs). It does have a higher cost across partitions, but is probably irrelevant given the typical ham:spam ratio.
2018-03-19use string ref for Email::Simple->new
Email::Simple is slightly faster this way, and Email::MIME and PublicInbox::MIME both wrap that.
2018-03-19v2writable: support "barrier" operation to avoid reforking
Stopping and starting a bunch of processes to look up duplicates or removals is inefficient. Take advantage of checkpointing in "git fast-import" and transactions in Xapian and SQLite.
2018-03-06import: fall back to Sender for extracting name and email
This seems like a reasonable course of action for old messages. Cc: Nicolás Ojeda Bär <n.oje.bar@gmail.com>
2018-03-06favor Received: date over Date: header globally
The first Received: header is believable since it typically hits the user's mail server and can be treated as relatively trustworthy. We still show the Date: in per-message (permalink) views, which may expose users for having incorrect Date: headers, but all the ISO YYYY-MM-DD dates we display will match what we see.
2018-03-03import: consolidate object info for v2 imports
It's easier to store everything in one array ref similar to what our Git->check routine returns
2018-02-28v2writable: cleanup unused pipes in partitions
Leaking these pipes to child processes wasn't harmful, but made determining relationships and dataflow between processes more confusing.
2018-02-22v2: parallelize Xapian indexing
The parallelization requires splitting Msgmap, text+term indexing, and thread-linking out into separate processes. git-fast-import is fast, so we don't bother parallelizing it. Msgmap (SQLite) and thread-linking (Xapian) must be serialized because they rely on monotonically increasing numbers (NNTP article number and internal thread_id, respectively). We handle msgmap in the main process which drives fast-import. When the article number is retrieved/generated, we write the entire message to per-partition subprocesses via pipes for expensive text+term indexing. When these per-partition subprocesses are done with the expensive text+term indexing, they write SearchMsg (small data) to a shared pipe (inherited from the main V2Writable process) back to the threader, which runs its own subprocess. The number of text+term Xapian partitions is chosen at import and can be made equal to the number of cores in a machine. V2Writable --> Import -> git-fast-import \-> SearchIdxThread -> Msgmap (synchronous) \-> SearchIdxPart[n] -> SearchIdx[*] \-> SearchIdxThread -> SearchIdx ("threader", a subprocess) [* ] each subprocess writes to threader
2018-02-20v2: support Xapian + SQLite indexing
This is too slow, currently. Working with only 2017 LKML archives: git-only: ~1 minute git + SQLite: ~12 minutes git+Xapian+SQlite: ~45 minutes So yes, it looks like we'll need to parallelize Xapian indexing, at least.
2018-02-19v2writable: initial cut for repo-rotation
Wrap the old Import package to enable creating new repos based on size thresholds. This is better than relying on time-based rotation as LKML traffic seems to be increasing.
2018-02-15import: allow the epoch (0s) as a valid time
Despite email not existing until 1971; "Jan 1, 1970 00:00:00" seems like a common default timestamp for some test emails to use as a Date: header.
2018-02-15import: quiet down warnings from bogus From: lines
There's a lot of crap in archives and git-fast-import accepts empty names and email addresses for authors just fine.
2018-02-15import: pass "raw" dates to git-fast-import(1)
For LKML, it appears we need an even more liberal parser than RFC2822 date parser in git. I have not validated Date::Parse parses dates correctly, but this at least prevents git-fast-import(1) from choking.
2018-02-14import: APIs to support v2 use
Wrap "get-mark" and "checkpoint" commands for git-fast-import while documenting/cementing parts of the API.
2018-02-12import: initial handling for v2
Call order will need to change a bit since this is going to be tied to Xapian
2018-02-09import: begin supporting this without ssoma.lock
We'll reuse this class in v2, but won't be utilizing per-git-repository ssoma.lock files. Meanwhile, stop treating ::Inbox objects as an afterthought and allow importing name and email into them.
2018-02-08import: stop writing legacy ssoma.index by default
For machines which have never seen ssoma, they don't need the index so stop creating it.
2018-02-07update copyrights for 2018
Using update-copyrights from gnulib While we're at it, use the SPDX identifier for AGPL-3.0+ to ease mechanical processing.
2017-11-16learn: use "spam" as subject for removal commits
Sometimes an email is an innocent removal "rm" for a misdirected, off-topic post, while most removed messages are "spam". Allow anybody to look at history and easily distinguish the reason for removing the message.
2017-06-20import: fix encoding issues from weird "raw" emails
This seems to allow weirdly-encoded "raw" emails in blade.nagaokaut.ac.jp/ruby/ruby-core/* to be handled without difficulties.
2017-05-25import: reset :raw mode for commit title (subject)
This was necessary for the presence of the 0xa0 byte(*) in the Subject: of the message at: http://blade.nagaokaut.ac.jp/ruby/ruby-core/3220 (*) That is 0xa0, not 0x0a ("\n"), so I wonder if the nibbles got swapped somehow.
2017-01-10introduce PublicInbox::MIME wrapper class
This should fix problems with multipart messages where text/plain parts lack a header. cf. git clone --mirror https://github.com/rjbs/Email-MIME.git refs/pull/28/head In the future, we may still introduce as streaming interface to reduce memory usage on large emails.
2016-10-16import: failed GC runs are non-fatal
We should not completely kill a process if "git gc --auto" errors out due to a warning or whatnot.
2016-09-08import: run "git gc --auto" when done
We need to prevent excessive repository growth for public-inbox-watch and public-inbox-mda users.