about summary refs log tree commit homepage
path: root/lib/PublicInbox/Import.pm
DateCommit message (Collapse)
2021-10-16smsg: add ->oidbin method
This makes some of our code less noisy by reducing the amount of pack('H*', ...) use.
2021-09-12import: do not write a "description" file
The default value is worthless to us and git functions fine without the file. public-inbox-init will create a useful one in the next change.
2021-05-04lei index: new command to index mail w/o git storage
Since completely purging blobs from git is slow, users may wish to index messages in Maildirs (and eventually other local storage) without storing data in git. Much code from LeiImport and LeiInput is reused, and a new dummy FakeImport class supplies a non-storing $im->add and minimize changes to LeiStore. The tricky part of this command is to support "lei import" after a message has gone through "lei index". Relying on $smsg->{bytes} == 0 (as we do for external-only vmd storage) does not work here, since it would break searching for "z:" byte-ranges when not using externals. This eventually required PublicInbox::Import::add to use a SharedKV to keep track of imported blobs and prevent duplication.
2021-04-07import: convert init.defaultBranch to fully qualified ref
init.defaultBranch expects a branch name, not a fully qualified ref. git-init prepends "refs/heads/" automatically and unconditionally. PublicInbox::Import::default_branch, however, incorrectly passes on the init.defaultBranch value as is, leading to it being used in spots where a fully qualified ref is required. For example, with an init.defaultBranch value of "master", public-inbox-index for a v2 repository would lead to an all.git repository where HEAD's content is "ref: master" instead of "ref: refs/heads/master". Prepend "refs/heads/" to the incoming init.defaultBranch value. Fixes: 7c2f36de2fb49dd7 (import: respect init.defaultBranch)
2021-04-03lei: improve handling of Message-ID-less draft messages
We need a stable fallback time for digest2mid in the presence of messages without Received/Date headers. Furthermore, we must avoid using uninitialized smsg->{mid} when parsing References for draft replies.
2021-03-21lei import: vivify external-only messages
Keyword storage for external-only messages was preventing messages from being explicitly imported. Teach lei_store to vivify keyword-only entries into fully-indexed messages on import.
2021-02-24treewide: avoid "delete local" construct on hashes
Apparently this feature is only in Perl 5.12+, and we're still on Perl 5.10.
2021-02-10git: ->qx: respect caller's $/ in array context
This could lead to bad results when doing ls-tree -z for v2 import in case there's multiple files. In any case, the `local $/ = "\0"' in Import.pm is also eliminated to reduce potential confusion and surprises.
2021-02-01import: reap git-config(1) synchronously
This avoids a zombie if another step of the event loop takes too long.
2021-01-03use Eml (or MIME) objects for all indexing paths
We don't need to be keeping the raw message around after it hits git. Shard work now relies on Storable (or Sereal) and all of the indexing code relies on the Email::MIME-like API of Eml to access interesting parts of the message. Similarly, smsg->{raw_bytes} is no longer carried around and we do the CRLF adjustment when setting smsg->{bytes}. There's also a small simplification to t/import.t while we're in the area to use xqx instead of spawn/popen_rd.
2021-01-02import: switch to using ProcessPipe
This saves us a few lines of code, but also prevents misreaping by sibling processes.
2021-01-02import: unset GIT_CONFIG with `git config --global'
GIT_CONFIG is set by -convert, and user may have it set for other reasons. In either case, it conflicts with any any attempt to use `git config --global` so we have to unset it. This fixes t/multi-mid.t under TEST_RUN_MODE=0
2021-01-01update copyrights for 2021
Using "make update-copyrights" after setting GNULIB_PATH in my config.mak
2021-01-01spawn: move run_die here from PublicInbox::Import
It seems like a more logical place for it, but we'll favor the newly-added xsys_e() in tests for BAIL_OUT use.
2021-01-01lei_store: use per-machine refname as git HEAD
It may be helpful to identify the source of messages and perhaps avoid conflicting history. On the other hand, this may be a terrible idea for users who move portable storage (e.g. USB sticks) across computers...
2021-01-01import: respect init.defaultBranch
This matches git v2.28.0+ behavior in case users prefer a different name.
2020-12-31Merge remote-tracking branch 'origin/master' into lorelei
* origin/master: (58 commits) ds: flatten + reuse @events, epoll_wait style fixes ds: simplify EventLoop implementation check defined return value for localized slurp errors import: check for git->qx errors, clearer return values git: qx: avoid extra "local" for scalar context case search: remove {mset} option for ->mset method search: remove pointless {relevance} setting miscsearch: take reopen from Search and use it extsearch: unconditionally reopen on access extindex: allow using --all without EXTINDEX_DIR extindex: add undocumented --no-scan switch extindex: enable autoflush on STDOUT/STDERR extindex: various --watch signal handling fixes extindex: --watch for inotify-based updates eml: fix undefined vars on <Perl 5.28 t/config: test --get-urlmatch for git <2.26 default to CORE::warn in $SIG{__WARN__} handlers inbox: name variable for values loop iterator inboxidle: avoid needless syscalls on refresh inboxidle: clue users into resolving ENOSPC from inotify ...
2020-12-28import: check for git->qx errors, clearer return values
Those git commands can fail and git->qx will set $? when it fails. There's no need for the extra indirection of the @ret array, either. Improve git->qx coverage to check for $? while we're at it.
2020-12-19lei_store: local storage for Local Email Interface
Still unstable, this builds off the equally unstable extindex :P This will be used for caching/memoization of traditional mail stores (IMAP, Maildir, etc) while providing indexing via Xapian, along with compression, and checksumming from git. Most notably, this adds the ability to add/remove per-message keywords (draft, seen, flagged, answered) as described in the JMAP specification (RFC 8621 section 4.1.1). We'll use `.' (a single period) as an $eidx_key since it's an invalid {inboxdir} or {newsgroup} name.
2020-12-18import: drop X-Status in addition to Status
It's actually supported by mutt, dovecot[1], and likely some other software to augment the Status: header. While dovecot doesn't expose X-Status to clients, mutt will write 'A' (answered) and 'F' to X-Status (but not T (draft)). So we'll drop it like we do Status since it's not suitable for public mail, but stick it in an @UNWANTED_HEADERS array will allow us to configure an override if needed. [1] https://doc.dovecot.org/configuration_manual/mail_location/mbox/
2020-09-16treewide: relax allow >=40 chars for git OID
This will help with eventual git SHA-256 transitions.
2020-09-01watch: avoid unnecessary spawning on spam removals
This should further mitigate lock contention problems when -watch is configured to watch on a Maildir for spam while performing a large NNTP import. There is now a small risk a message won't get removed because if it's in the current (uncommitted) fast-import batch, but unlikely given the batch size is now only 10 messages. If a that small window is hit, flipping the \Seen flag (e.g. marking it unread, and then read again) will trigger another removal attempt via IMAP or Maildir.
2020-08-02remove unnecessary ->header_obj calls
We used ->header_obj in the past as an optimization with Email::MIME. That optimization is no longer necessary with PublicInbox::Eml. This doesn't make any functional difference even if we were to go back to Email::MIME. However, it reduces the amount of code we have and slightly reduces allocations with PublicInbox::Eml.
2020-08-01improve error handling on import fork / lock failures
v?fork failures seems to be the cause of locks not getting released in -watch. Ensure lock release doesn't get skipped in ->done for both v1 and v2 inboxes. We also need to do everything we can to ensure DB handles, pipes and processes get released even in the face of failure. While we're at it, make failures around `git update-server-info' non-fatal, since smart HTTP seems more popular anyways. v2 changes: - spawn: show failing command - ensure waitpid is synchronous for inotify events - teardown all fast-import processes on exception, not just the failing one - beef up lock_release error handling - release lock on fast-import spawn failure
2020-07-25use consistent {ibx} field for writable code paths
This is a step which makes our use of abbreviations more consistent when referring to PublicInbox::Inbox objects. We'll also be reducing the number of redundant fields in SearchIdx and V2Writable code paths to make the object graph easier-to-follow.
2020-07-17import: use common capitalization for filtering headers
In case this ends up in the same process as Mbox::msg_hdr, it can reduce memory use by sharing the cache key in PublicInbox::Eml::re_memo
2020-07-17drop binmode usage
We only support Unix-like platforms where binmode (":raw") is the default anyways, and v5.10 semantics means it won't do unicode_strings (unlike v5.12). So save some lines of code.
2020-06-30watch: check for duplicates in ->over before spamcheck
It's cheaper to check for duplicates than run `spamc' repeatedly when rechecking. We already do this for v1 with by using the "ls" command with fast-import, but v2 requires checking against over.sqlite3.
2020-06-25lock: reduce inotify wakeups
We can reduce the amount of platform-specific code by always relying on IN_MODIFY/NOTE_WRITE notifications from lock release. This reduces the number of times our read-only daemons will need to wake up when -watch sees no-op message changes (e.g. replied, seen, recent flag changes).
2020-06-13index: account for CRLF conversion when storing bytes
NNTP and IMAP both require CRLF conversions on the wire. They're also the only components which care about $smsg->{bytes}, so store the CRLF-adjusted value in over.sqlite3 and Xapian DBs.. This will allow us to optimize RFC822.SIZE fetch item in IMAP without triggering size mismatch errors in some clients' default configurations (e.g. Mail::IMAPClient), but not most others. It could also fix hypothetical problems with NNTP clients that report discrepancies between overview and article data.
2020-06-03smsg: introduce ->populate method
This will eventually replace the __hdr() calling methods and eradicate {mime} usage from Smsg. For now, we can eliminate PublicInbox::Smsg->new since most callers already rely on an open `bless' to avoid the old {mime} arg.
2020-06-03import: modernize to use Perl 5.10 features
First, prefer the leaner "parent" module over the heavy "base" module to establish ISA relationships, since "base" is only needed for "fields". The "//" and "//=" operators allow us simplify our code and fix minor bugs where a value of "0" was disallowed. Yes, we'll allow "0" as an email address, too, since some twisted BOFH could theoretically use it as a local user name. Going forward, we'll also be avoiding "use warnings" and instead rely on `-w' in the shebang.
2020-05-19favor readline() and print() as functions
In our inbox-writing code paths, ->getline as an OO method may be confused with the various definitions of `getline' used by the PSGI interface. It's also easier to do: "perldoc -f readline" than to figure out which class "->getline" belongs to (IO::Handle) and lookup documentation for that. ->print is less confusing than the "readline" vs "getline" mismatch, but we can still make it clear we're using a real file handle and not a mock interface. Finally, functions are a bit faster than their OO counterparts.
2020-05-17confine Email::MIME use even further
To avoid confusing future readers and users, recommend PublicInbox::Eml in our Import POD and refer to PublicInbox::Eml comments at the top of PublicInbox::MIME. mime_load() confined to t/eml.t, since we won't be using it anywhere else in our tests.
2020-05-12rename "ContentId" to "ContentHash"
The old name may be confused with "Content-ID" as described in RFC 2392, so use an alternate name to avoid confusing future readers.
2020-05-09remove most internal Email::MIME usage
We no longer load or use Email::MIME outside of comparison tests.
2020-05-09replace most uses of PublicInbox::MIME with Eml
PublicInbox::Eml has enough functionality to replace the Email::MIME-based PublicInbox::MIME.
2020-04-30mid: capitalize "ID" in "Message-ID"
Prefer the "ID" capitalization since it seems to to be the preferred capitalization in RFC 5322. In theory, this allows the interpreter to deduplicate the string internally (I haven't checked if it does). Unfortunately, there's too many instances of "Message-Id" in the tests to be worth changing at this point.
2020-04-20import: init_bare: use pure Perl
Even on systems with Inline::C spawn(), this cuts a primed "make check-run" time by 2-3% on Linux, and roughly 5-7% on FreeBSD when using vfork-enabled spawn. I doubt anybody cares: this omits the sample hooks and some empty and useless-for-us or obsolete directories created by git-init(1).
2020-04-20import: init_bare: allow use as method, use in tests
Allowing ->init_bare to be used as a method saves some keystrokes, and we can save a little bit of time on systems with our vfork(2)-enabled spawn(). This also sets us up for future improvements where we can avoid spawning a process at all.
2020-03-22*idx: pass smsg in even more places
We can finally get rid of the awkward, ad-hoc use of V2Writable, SearchIdx, and OverIdx args for passing {cotime} and {autime} between classes. We'll still use those git time fields internally within V2Writable and SearchIdx for (re)indexing, but that's not worth avoiding as a fallback.
2020-03-22v2writable: preserve timestamps from import
While v2 indexing is triggered immediately after writing the commit to the git repository, there may be a gap between when PublicInbox::Import generates a timestamp and when PublicInbox::SearchIdx sees the message. So follow the mirror indexing behavior and take the to-be-indexed (time|date)stamps directly from the git commit.
2020-03-01import: drop '<' and '>' characters in addresses
Some strange "From:" lines will cause Email::Address::XS to leave '<' (and presumably '>') in the address which git-fast-import won't accept even if quoted. Workaround this problem by deleting '<' and '>' the same way we delete them for the ident name. Reported-by: Leah Neukirchen <leah@vuxu.org> Link: https://public-inbox.org/meta/87h7zfemur.fsf@vuxu.org/
2020-02-06treewide: run update-copyrights from gnulib for 2019
I didn't wait until September to do it, this year!
2020-01-13use popen_rd for bidirectional pipes
popen_rd accepts arbitrary redirects, so we can reuse its code to setup the pipe end we want to read, saving each caller a few lines of code compared to calling pipe+spawn.
2020-01-11spawn (and thus popen_rd) die on failure
Most spawn and popen_rd callers die on failure to spawn, anyways, and some are missing checks entirely. This saves us a bunch of verbose error-checking code in callers. This also makes popen_rd more consistent, since it already dies on pipe creation failures.
2020-01-06treewide: "require" + "use" cleanup and docs
There's a bunch of leftover "require" and "use" statements we no longer need and can get rid of, along with some excessive imports via "use". IO::Handle usage isn't always obvious, so add comments describing why a package loads it. Along the same lines, document the tmpdir support as the reason we depend on File::Temp 0.19, even though every Perl 5.10.1+ user has it. While we're at it, favor "use" over "require", since it it gives us extra compile-time checking.
2020-01-02doc: fix a few spelling errors in user-facing docs
Found by codespell, there's a few more in comments and some debatable ones, but user-facing stuff is more important.
2019-12-30spawn: allow passing GLOB handles for redirects
We can save callers the trouble of {-hold} and {-dev_null} refs as well as the trouble of calling fileno().
2019-12-11import: (cleanup) drop redundant env arg to run_die
run_die() doesn't require an $env arg, so there's no point passing "undef" to it.