about summary refs log tree commit homepage
path: root/lib/PublicInbox/Import.pm
DateCommit message (Collapse)
2021-02-01import: reap git-config(1) synchronously
This avoids a zombie if another step of the event loop takes too long.
2021-01-03use Eml (or MIME) objects for all indexing paths
We don't need to be keeping the raw message around after it hits git. Shard work now relies on Storable (or Sereal) and all of the indexing code relies on the Email::MIME-like API of Eml to access interesting parts of the message. Similarly, smsg->{raw_bytes} is no longer carried around and we do the CRLF adjustment when setting smsg->{bytes}. There's also a small simplification to t/import.t while we're in the area to use xqx instead of spawn/popen_rd.
2021-01-02import: switch to using ProcessPipe
This saves us a few lines of code, but also prevents misreaping by sibling processes.
2021-01-02import: unset GIT_CONFIG with `git config --global'
GIT_CONFIG is set by -convert, and user may have it set for other reasons. In either case, it conflicts with any any attempt to use `git config --global` so we have to unset it. This fixes t/multi-mid.t under TEST_RUN_MODE=0
2021-01-01update copyrights for 2021
Using "make update-copyrights" after setting GNULIB_PATH in my config.mak
2021-01-01spawn: move run_die here from PublicInbox::Import
It seems like a more logical place for it, but we'll favor the newly-added xsys_e() in tests for BAIL_OUT use.
2021-01-01lei_store: use per-machine refname as git HEAD
It may be helpful to identify the source of messages and perhaps avoid conflicting history. On the other hand, this may be a terrible idea for users who move portable storage (e.g. USB sticks) across computers...
2021-01-01import: respect init.defaultBranch
This matches git v2.28.0+ behavior in case users prefer a different name.
2020-12-31Merge remote-tracking branch 'origin/master' into lorelei
* origin/master: (58 commits) ds: flatten + reuse @events, epoll_wait style fixes ds: simplify EventLoop implementation check defined return value for localized slurp errors import: check for git->qx errors, clearer return values git: qx: avoid extra "local" for scalar context case search: remove {mset} option for ->mset method search: remove pointless {relevance} setting miscsearch: take reopen from Search and use it extsearch: unconditionally reopen on access extindex: allow using --all without EXTINDEX_DIR extindex: add undocumented --no-scan switch extindex: enable autoflush on STDOUT/STDERR extindex: various --watch signal handling fixes extindex: --watch for inotify-based updates eml: fix undefined vars on <Perl 5.28 t/config: test --get-urlmatch for git <2.26 default to CORE::warn in $SIG{__WARN__} handlers inbox: name variable for values loop iterator inboxidle: avoid needless syscalls on refresh inboxidle: clue users into resolving ENOSPC from inotify ...
2020-12-28import: check for git->qx errors, clearer return values
Those git commands can fail and git->qx will set $? when it fails. There's no need for the extra indirection of the @ret array, either. Improve git->qx coverage to check for $? while we're at it.
2020-12-19lei_store: local storage for Local Email Interface
Still unstable, this builds off the equally unstable extindex :P This will be used for caching/memoization of traditional mail stores (IMAP, Maildir, etc) while providing indexing via Xapian, along with compression, and checksumming from git. Most notably, this adds the ability to add/remove per-message keywords (draft, seen, flagged, answered) as described in the JMAP specification (RFC 8621 section 4.1.1). We'll use `.' (a single period) as an $eidx_key since it's an invalid {inboxdir} or {newsgroup} name.
2020-12-18import: drop X-Status in addition to Status
It's actually supported by mutt, dovecot[1], and likely some other software to augment the Status: header. While dovecot doesn't expose X-Status to clients, mutt will write 'A' (answered) and 'F' to X-Status (but not T (draft)). So we'll drop it like we do Status since it's not suitable for public mail, but stick it in an @UNWANTED_HEADERS array will allow us to configure an override if needed. [1] https://doc.dovecot.org/configuration_manual/mail_location/mbox/
2020-09-16treewide: relax allow >=40 chars for git OID
This will help with eventual git SHA-256 transitions.
2020-09-01watch: avoid unnecessary spawning on spam removals
This should further mitigate lock contention problems when -watch is configured to watch on a Maildir for spam while performing a large NNTP import. There is now a small risk a message won't get removed because if it's in the current (uncommitted) fast-import batch, but unlikely given the batch size is now only 10 messages. If a that small window is hit, flipping the \Seen flag (e.g. marking it unread, and then read again) will trigger another removal attempt via IMAP or Maildir.
2020-08-02remove unnecessary ->header_obj calls
We used ->header_obj in the past as an optimization with Email::MIME. That optimization is no longer necessary with PublicInbox::Eml. This doesn't make any functional difference even if we were to go back to Email::MIME. However, it reduces the amount of code we have and slightly reduces allocations with PublicInbox::Eml.
2020-08-01improve error handling on import fork / lock failures
v?fork failures seems to be the cause of locks not getting released in -watch. Ensure lock release doesn't get skipped in ->done for both v1 and v2 inboxes. We also need to do everything we can to ensure DB handles, pipes and processes get released even in the face of failure. While we're at it, make failures around `git update-server-info' non-fatal, since smart HTTP seems more popular anyways. v2 changes: - spawn: show failing command - ensure waitpid is synchronous for inotify events - teardown all fast-import processes on exception, not just the failing one - beef up lock_release error handling - release lock on fast-import spawn failure
2020-07-25use consistent {ibx} field for writable code paths
This is a step which makes our use of abbreviations more consistent when referring to PublicInbox::Inbox objects. We'll also be reducing the number of redundant fields in SearchIdx and V2Writable code paths to make the object graph easier-to-follow.
2020-07-17import: use common capitalization for filtering headers
In case this ends up in the same process as Mbox::msg_hdr, it can reduce memory use by sharing the cache key in PublicInbox::Eml::re_memo
2020-07-17drop binmode usage
We only support Unix-like platforms where binmode (":raw") is the default anyways, and v5.10 semantics means it won't do unicode_strings (unlike v5.12). So save some lines of code.
2020-06-30watch: check for duplicates in ->over before spamcheck
It's cheaper to check for duplicates than run `spamc' repeatedly when rechecking. We already do this for v1 with by using the "ls" command with fast-import, but v2 requires checking against over.sqlite3.
2020-06-25lock: reduce inotify wakeups
We can reduce the amount of platform-specific code by always relying on IN_MODIFY/NOTE_WRITE notifications from lock release. This reduces the number of times our read-only daemons will need to wake up when -watch sees no-op message changes (e.g. replied, seen, recent flag changes).
2020-06-13index: account for CRLF conversion when storing bytes
NNTP and IMAP both require CRLF conversions on the wire. They're also the only components which care about $smsg->{bytes}, so store the CRLF-adjusted value in over.sqlite3 and Xapian DBs.. This will allow us to optimize RFC822.SIZE fetch item in IMAP without triggering size mismatch errors in some clients' default configurations (e.g. Mail::IMAPClient), but not most others. It could also fix hypothetical problems with NNTP clients that report discrepancies between overview and article data.
2020-06-03smsg: introduce ->populate method
This will eventually replace the __hdr() calling methods and eradicate {mime} usage from Smsg. For now, we can eliminate PublicInbox::Smsg->new since most callers already rely on an open `bless' to avoid the old {mime} arg.
2020-06-03import: modernize to use Perl 5.10 features
First, prefer the leaner "parent" module over the heavy "base" module to establish ISA relationships, since "base" is only needed for "fields". The "//" and "//=" operators allow us simplify our code and fix minor bugs where a value of "0" was disallowed. Yes, we'll allow "0" as an email address, too, since some twisted BOFH could theoretically use it as a local user name. Going forward, we'll also be avoiding "use warnings" and instead rely on `-w' in the shebang.
2020-05-19favor readline() and print() as functions
In our inbox-writing code paths, ->getline as an OO method may be confused with the various definitions of `getline' used by the PSGI interface. It's also easier to do: "perldoc -f readline" than to figure out which class "->getline" belongs to (IO::Handle) and lookup documentation for that. ->print is less confusing than the "readline" vs "getline" mismatch, but we can still make it clear we're using a real file handle and not a mock interface. Finally, functions are a bit faster than their OO counterparts.
2020-05-17confine Email::MIME use even further
To avoid confusing future readers and users, recommend PublicInbox::Eml in our Import POD and refer to PublicInbox::Eml comments at the top of PublicInbox::MIME. mime_load() confined to t/eml.t, since we won't be using it anywhere else in our tests.
2020-05-12rename "ContentId" to "ContentHash"
The old name may be confused with "Content-ID" as described in RFC 2392, so use an alternate name to avoid confusing future readers.
2020-05-09remove most internal Email::MIME usage
We no longer load or use Email::MIME outside of comparison tests.
2020-05-09replace most uses of PublicInbox::MIME with Eml
PublicInbox::Eml has enough functionality to replace the Email::MIME-based PublicInbox::MIME.
2020-04-30mid: capitalize "ID" in "Message-ID"
Prefer the "ID" capitalization since it seems to to be the preferred capitalization in RFC 5322. In theory, this allows the interpreter to deduplicate the string internally (I haven't checked if it does). Unfortunately, there's too many instances of "Message-Id" in the tests to be worth changing at this point.
2020-04-20import: init_bare: use pure Perl
Even on systems with Inline::C spawn(), this cuts a primed "make check-run" time by 2-3% on Linux, and roughly 5-7% on FreeBSD when using vfork-enabled spawn. I doubt anybody cares: this omits the sample hooks and some empty and useless-for-us or obsolete directories created by git-init(1).
2020-04-20import: init_bare: allow use as method, use in tests
Allowing ->init_bare to be used as a method saves some keystrokes, and we can save a little bit of time on systems with our vfork(2)-enabled spawn(). This also sets us up for future improvements where we can avoid spawning a process at all.
2020-03-22*idx: pass smsg in even more places
We can finally get rid of the awkward, ad-hoc use of V2Writable, SearchIdx, and OverIdx args for passing {cotime} and {autime} between classes. We'll still use those git time fields internally within V2Writable and SearchIdx for (re)indexing, but that's not worth avoiding as a fallback.
2020-03-22v2writable: preserve timestamps from import
While v2 indexing is triggered immediately after writing the commit to the git repository, there may be a gap between when PublicInbox::Import generates a timestamp and when PublicInbox::SearchIdx sees the message. So follow the mirror indexing behavior and take the to-be-indexed (time|date)stamps directly from the git commit.
2020-03-01import: drop '<' and '>' characters in addresses
Some strange "From:" lines will cause Email::Address::XS to leave '<' (and presumably '>') in the address which git-fast-import won't accept even if quoted. Workaround this problem by deleting '<' and '>' the same way we delete them for the ident name. Reported-by: Leah Neukirchen <leah@vuxu.org> Link: https://public-inbox.org/meta/87h7zfemur.fsf@vuxu.org/
2020-02-06treewide: run update-copyrights from gnulib for 2019
I didn't wait until September to do it, this year!
2020-01-13use popen_rd for bidirectional pipes
popen_rd accepts arbitrary redirects, so we can reuse its code to setup the pipe end we want to read, saving each caller a few lines of code compared to calling pipe+spawn.
2020-01-11spawn (and thus popen_rd) die on failure
Most spawn and popen_rd callers die on failure to spawn, anyways, and some are missing checks entirely. This saves us a bunch of verbose error-checking code in callers. This also makes popen_rd more consistent, since it already dies on pipe creation failures.
2020-01-06treewide: "require" + "use" cleanup and docs
There's a bunch of leftover "require" and "use" statements we no longer need and can get rid of, along with some excessive imports via "use". IO::Handle usage isn't always obvious, so add comments describing why a package loads it. Along the same lines, document the tmpdir support as the reason we depend on File::Temp 0.19, even though every Perl 5.10.1+ user has it. While we're at it, favor "use" over "require", since it it gives us extra compile-time checking.
2020-01-02doc: fix a few spelling errors in user-facing docs
Found by codespell, there's a few more in comments and some debatable ones, but user-facing stuff is more important.
2019-12-30spawn: allow passing GLOB handles for redirects
We can save callers the trouble of {-hold} and {-dev_null} refs as well as the trouble of calling fileno().
2019-12-11import: (cleanup) drop redundant env arg to run_die
run_die() doesn't require an $env arg, so there's no point passing "undef" to it.
2019-11-29replace: quiet "git gc" invocation
Since we give users no indication or control of how "git gc" runs, showing its progress is confusing.
2019-11-16import: only pass Inbox object to SearchIdx->new
SearchIdx->new no longer accepts a GIT_DIR path as its argument since commit 585314673236d664729fe3ab2d4fb229d1c0f2d5 ("searchidx: require PublicInbox::Inbox (or InboxWritable) ref")
2019-10-15PublicInbox::Import Smuggle a raw message into add
I don't trust the MIME type to not munge my email messages in horrible ways upon occasion. Therefore allow for passing in the raw message value instead of trusting the mime object to preserve it. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> [ew: use "//" from Perl 5.10+ for defined check]
2019-06-09v2writable: implement ->replace call
Much of the existing purge code is repurposed to a general "replace" functionality. ->purge is simpler because it can just drop the information. Unlike ->purge, ->replace needs to edit existing git commits (in case of From: and Subject: headers) and reindex the modified message. We currently disallow editing of References:, In-Reply-To: and Message-ID headers because it can cause bad side effects with our threading (and our lack of rethreading support to deal with excessive matching from incorrect/invalid References).
2019-06-09import: switch to "replace_oids" interface for purge
Continuing the work by Eric Biederman in commit a118d58a402bd31b ("Import.pm: When purging replace a purged file with a zero length file"), we can use a generic OID replacement mechanism to implement purge.
2019-06-09import: extract_author_info becomes extract_commit_info
We will be reusing the same logic for extracting all the authorship and commit title logic for edits; so put it all into one sub.
2019-06-05tighten up digit matches to ASCII for git output
While I don't expect git to suddenly start spewing non-ASCII digits in places I'd expect ASCII, this would make things easier for future hackers and reviewers.
2019-05-17PublicInbox::Import::add: Consolidate subject handling
Consolidate subject handling in the add function to make it easier to read and understand. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>