about summary refs log tree commit homepage
path: root/lib/PublicInbox/Import.pm
DateCommit message (Collapse)
2024-03-10import: fix handling of init.defaultBranch
We must chomp the newline in the branch name if it's set. Reported-by: Rob Herring <robh@kernel.org> Link: https://public-inbox.org/meta/CAL_JsqK7P4gjLPyvzxNEcYmxT4j6Ah5f3Pz1RqDHxmysTg3aEg@mail.gmail.com/ Fixes: 73830410e4336b77 (treewide: use run_qx where appropriate, 2023-10-27)
2024-03-10import: croak (instead of die) on write failures
This allows accurate reporting of the error location and can be made to dump a Perl backtrace via PERL5OPT='-MCarp=verbose'. Noticed while tracking down fast-import failures. Link: https://public-inbox.org/meta/CAL_JsqK7P4gjLPyvzxNEcYmxT4j6Ah5f3Pz1RqDHxmysTg3aEg@mail.gmail.com/
2024-02-01import: drop redundant `use' statement
We don't need multiple `use PublicInbox::IO' statements to import a subroutine.
2023-11-11mda|learn|watch: support dropUniqueUnsubscribe config
List-Unsubscribe headers with unique identifiers (such as those generated by our examples/unsubscribe.milter) should not end up in public archives. Add a new config knob to strip List-Unsubscribe headers if they have the `List-Unsubscribe-Post: List-Unsubscribe=One-Click' header. Unfortunately, this breaks DKIM signatures if the signature covers either of these List-Unsubscribe* headers. However, breaking DKIM is the lesser evil compared to any archive reader being able to stop archival by an independent archivist. As much as I would like this to be the default, it probably affects few users at the moment since very few mailing lists use unique identifiers in List-Unsubscribe (but that number has grown, recently).
2023-11-03move read_all, try_cat, and poll_in to PublicInbox::IO
The IO package seems like a better home for I/O subs than the Git package. We lose the 60 second read timeout for `git cat-file --batch-*' processes since it's probably not necessary given how reliable the code has proven and things would fall over hard in other ways if the storage device were completely hosed.
2023-11-03io: introduce write_file helper sub
This is pretty convenient way to create files for diff generation in both WWW and lei. The test suite should also be able to take advantage of it.
2023-11-03replace ProcessIO with untied PublicInbox::IO
This fixes two major problems with the use of tie for filehandles: * no way to do fcntl, stat, etc. calls directly on the tied handle, forcing callers to use the `tied' perlop to access the underlying IO::Handle * needing separate classes to handle blocking and non-blocking I/O As a result, Git->cleanup_if_unlinked, InputPipe->consume, and Qspawn->_yield_start have fewer bizzare bits and we can call `$io->blocking(0)' directly instead of `(tied *$io)->{fh}->blocking(0)' Having a PublicInbox::IO class will also allow us to support custom read buffering which allows inspecting the current state.
2023-11-03treewide: use ->close to call ProcessIO->CLOSE
This will open the door for us to drop `tie' usage from ProcessIO completely in favor of OO method dispatch. While OO method dispatches (e.g. `$fh->close') are slower than normal subroutine calls, it hardly matters in this case since process teardown is a fairly rare operation and we continue to use `close($fh)' for Maildir writes.
2023-10-28treewide: use run_qx where appropriate
This saves us some code, and is a small step towards getting ProcessIO working with stat, fcntl and other perlops that don't work with tied handles.
2023-10-18import: use read_all to detect short reads
Our Import package was depending on Git, anyways.
2023-10-11import: cat_blob is a no-op w/o live fast-import
cat_blob is a fallback for handling files which haven't made it onto disk to be readable by `git cat-file'. Thus spawning a new fast-import process to retrieve a blob is pointless, as cat_blob is only used as a last resort when `git cat-file' fails.
2023-10-11import: switch to Unix stream socket for fast-import
We use fewer file descriptors and fewer lines of code this way. I'm not aware of any place we rely on POSIX pipe semantics with `git fast-import', and sockets have bigger buffers by default in most cases (even if Linux allows larger pipe buffers).
2023-10-11treewide: consolidate "From " line removal
Aside from our prior import bugs (fixed in a0c07cba0e5d8b6a (mda: drop leading "From " lines again, 2016-06-26)), we'll always have to be dealing with mutt piping messages to us and `git format-patch' output. So just share the regexp so we can use it everywhere. In may be desirable to allow importing messages with a leading "From " line for FUSE, even. Additionally, some instances of this regexp needlessly added optional `\r?' (CR) checks ahead of the `\n' (LF) element; but they're pointless anyways since [^\n]* is enough to exclude all non-LF bytes.
2023-10-08import: use autodie, rely on PerlIO for retries
As documented in perlipc(1), the default :perlio layer retries the `read' perlop on EINTR. The :perlio layer also makes `read' perform read-in-full behavior; so there's no need to loop ourselves. Our responsibility is now only to detect short reads in case fast-import is killed mid-stream.
2023-04-20cindex: support sha256 coderepos alongside sha1
This special support is only needed for --prune at the moment since the indexing side works on a per-repo basis. There's no automated tests, yet, but it seems to work well on my sha256 projects when sharing a cindex with sha1 projects.
2023-02-22treewide: simplify File::Path mkpath/make_path callers
File::Path already accounts for the existence of directories, handles races from redundant mkdir(2), and croaks on unrecoverable errors. So there's no point in doing any of that on our end. Furthermore, avoiding the overhead of loading File::Path doesn't seem worth it to save 20-60ms given the overhead of loading our other code. Instead, try to reduce optree overhead on our code, instead, since File::Path gets used in a bunch of places. We'll also favor the newer make_path for multi-directory invocations to avoid bloating our own optree to create an arrayref, but mkpath is one fewer subroutine call within File::Path itself, right now.
2022-09-04import: pass --quiet to `git gc' if STDERR isn't a tty
No need to pollute non-interactive output for gc use. This suppresses notifications from my `lei up --all -q' cronjob.
2022-09-04lei/store: do not write info/refs file
That file is meant for dumb HTTP servers, so avoid wasting two inodes on something that should never be served for private email.
2022-08-20import: take advantage of some Perl 5.10.x features
We can save a few lines of code this way and reduce the verbosity of some lines we keep.
2021-10-16smsg: add ->oidbin method
This makes some of our code less noisy by reducing the amount of pack('H*', ...) use.
2021-09-12import: do not write a "description" file
The default value is worthless to us and git functions fine without the file. public-inbox-init will create a useful one in the next change.
2021-05-04lei index: new command to index mail w/o git storage
Since completely purging blobs from git is slow, users may wish to index messages in Maildirs (and eventually other local storage) without storing data in git. Much code from LeiImport and LeiInput is reused, and a new dummy FakeImport class supplies a non-storing $im->add and minimize changes to LeiStore. The tricky part of this command is to support "lei import" after a message has gone through "lei index". Relying on $smsg->{bytes} == 0 (as we do for external-only vmd storage) does not work here, since it would break searching for "z:" byte-ranges when not using externals. This eventually required PublicInbox::Import::add to use a SharedKV to keep track of imported blobs and prevent duplication.
2021-04-07import: convert init.defaultBranch to fully qualified ref
init.defaultBranch expects a branch name, not a fully qualified ref. git-init prepends "refs/heads/" automatically and unconditionally. PublicInbox::Import::default_branch, however, incorrectly passes on the init.defaultBranch value as is, leading to it being used in spots where a fully qualified ref is required. For example, with an init.defaultBranch value of "master", public-inbox-index for a v2 repository would lead to an all.git repository where HEAD's content is "ref: master" instead of "ref: refs/heads/master". Prepend "refs/heads/" to the incoming init.defaultBranch value. Fixes: 7c2f36de2fb49dd7 (import: respect init.defaultBranch)
2021-04-03lei: improve handling of Message-ID-less draft messages
We need a stable fallback time for digest2mid in the presence of messages without Received/Date headers. Furthermore, we must avoid using uninitialized smsg->{mid} when parsing References for draft replies.
2021-03-21lei import: vivify external-only messages
Keyword storage for external-only messages was preventing messages from being explicitly imported. Teach lei_store to vivify keyword-only entries into fully-indexed messages on import.
2021-02-24treewide: avoid "delete local" construct on hashes
Apparently this feature is only in Perl 5.12+, and we're still on Perl 5.10.
2021-02-10git: ->qx: respect caller's $/ in array context
This could lead to bad results when doing ls-tree -z for v2 import in case there's multiple files. In any case, the `local $/ = "\0"' in Import.pm is also eliminated to reduce potential confusion and surprises.
2021-02-01import: reap git-config(1) synchronously
This avoids a zombie if another step of the event loop takes too long.
2021-01-03use Eml (or MIME) objects for all indexing paths
We don't need to be keeping the raw message around after it hits git. Shard work now relies on Storable (or Sereal) and all of the indexing code relies on the Email::MIME-like API of Eml to access interesting parts of the message. Similarly, smsg->{raw_bytes} is no longer carried around and we do the CRLF adjustment when setting smsg->{bytes}. There's also a small simplification to t/import.t while we're in the area to use xqx instead of spawn/popen_rd.
2021-01-02import: switch to using ProcessPipe
This saves us a few lines of code, but also prevents misreaping by sibling processes.
2021-01-02import: unset GIT_CONFIG with `git config --global'
GIT_CONFIG is set by -convert, and user may have it set for other reasons. In either case, it conflicts with any any attempt to use `git config --global` so we have to unset it. This fixes t/multi-mid.t under TEST_RUN_MODE=0
2021-01-01update copyrights for 2021
Using "make update-copyrights" after setting GNULIB_PATH in my config.mak
2021-01-01spawn: move run_die here from PublicInbox::Import
It seems like a more logical place for it, but we'll favor the newly-added xsys_e() in tests for BAIL_OUT use.
2021-01-01lei_store: use per-machine refname as git HEAD
It may be helpful to identify the source of messages and perhaps avoid conflicting history. On the other hand, this may be a terrible idea for users who move portable storage (e.g. USB sticks) across computers...
2021-01-01import: respect init.defaultBranch
This matches git v2.28.0+ behavior in case users prefer a different name.
2020-12-31Merge remote-tracking branch 'origin/master' into lorelei
* origin/master: (58 commits) ds: flatten + reuse @events, epoll_wait style fixes ds: simplify EventLoop implementation check defined return value for localized slurp errors import: check for git->qx errors, clearer return values git: qx: avoid extra "local" for scalar context case search: remove {mset} option for ->mset method search: remove pointless {relevance} setting miscsearch: take reopen from Search and use it extsearch: unconditionally reopen on access extindex: allow using --all without EXTINDEX_DIR extindex: add undocumented --no-scan switch extindex: enable autoflush on STDOUT/STDERR extindex: various --watch signal handling fixes extindex: --watch for inotify-based updates eml: fix undefined vars on <Perl 5.28 t/config: test --get-urlmatch for git <2.26 default to CORE::warn in $SIG{__WARN__} handlers inbox: name variable for values loop iterator inboxidle: avoid needless syscalls on refresh inboxidle: clue users into resolving ENOSPC from inotify ...
2020-12-28import: check for git->qx errors, clearer return values
Those git commands can fail and git->qx will set $? when it fails. There's no need for the extra indirection of the @ret array, either. Improve git->qx coverage to check for $? while we're at it.
2020-12-19lei_store: local storage for Local Email Interface
Still unstable, this builds off the equally unstable extindex :P This will be used for caching/memoization of traditional mail stores (IMAP, Maildir, etc) while providing indexing via Xapian, along with compression, and checksumming from git. Most notably, this adds the ability to add/remove per-message keywords (draft, seen, flagged, answered) as described in the JMAP specification (RFC 8621 section 4.1.1). We'll use `.' (a single period) as an $eidx_key since it's an invalid {inboxdir} or {newsgroup} name.
2020-12-18import: drop X-Status in addition to Status
It's actually supported by mutt, dovecot[1], and likely some other software to augment the Status: header. While dovecot doesn't expose X-Status to clients, mutt will write 'A' (answered) and 'F' to X-Status (but not T (draft)). So we'll drop it like we do Status since it's not suitable for public mail, but stick it in an @UNWANTED_HEADERS array will allow us to configure an override if needed. [1] https://doc.dovecot.org/configuration_manual/mail_location/mbox/
2020-09-16treewide: relax allow >=40 chars for git OID
This will help with eventual git SHA-256 transitions.
2020-09-01watch: avoid unnecessary spawning on spam removals
This should further mitigate lock contention problems when -watch is configured to watch on a Maildir for spam while performing a large NNTP import. There is now a small risk a message won't get removed because if it's in the current (uncommitted) fast-import batch, but unlikely given the batch size is now only 10 messages. If a that small window is hit, flipping the \Seen flag (e.g. marking it unread, and then read again) will trigger another removal attempt via IMAP or Maildir.
2020-08-02remove unnecessary ->header_obj calls
We used ->header_obj in the past as an optimization with Email::MIME. That optimization is no longer necessary with PublicInbox::Eml. This doesn't make any functional difference even if we were to go back to Email::MIME. However, it reduces the amount of code we have and slightly reduces allocations with PublicInbox::Eml.
2020-08-01improve error handling on import fork / lock failures
v?fork failures seems to be the cause of locks not getting released in -watch. Ensure lock release doesn't get skipped in ->done for both v1 and v2 inboxes. We also need to do everything we can to ensure DB handles, pipes and processes get released even in the face of failure. While we're at it, make failures around `git update-server-info' non-fatal, since smart HTTP seems more popular anyways. v2 changes: - spawn: show failing command - ensure waitpid is synchronous for inotify events - teardown all fast-import processes on exception, not just the failing one - beef up lock_release error handling - release lock on fast-import spawn failure
2020-07-25use consistent {ibx} field for writable code paths
This is a step which makes our use of abbreviations more consistent when referring to PublicInbox::Inbox objects. We'll also be reducing the number of redundant fields in SearchIdx and V2Writable code paths to make the object graph easier-to-follow.
2020-07-17import: use common capitalization for filtering headers
In case this ends up in the same process as Mbox::msg_hdr, it can reduce memory use by sharing the cache key in PublicInbox::Eml::re_memo
2020-07-17drop binmode usage
We only support Unix-like platforms where binmode (":raw") is the default anyways, and v5.10 semantics means it won't do unicode_strings (unlike v5.12). So save some lines of code.
2020-06-30watch: check for duplicates in ->over before spamcheck
It's cheaper to check for duplicates than run `spamc' repeatedly when rechecking. We already do this for v1 with by using the "ls" command with fast-import, but v2 requires checking against over.sqlite3.
2020-06-25lock: reduce inotify wakeups
We can reduce the amount of platform-specific code by always relying on IN_MODIFY/NOTE_WRITE notifications from lock release. This reduces the number of times our read-only daemons will need to wake up when -watch sees no-op message changes (e.g. replied, seen, recent flag changes).
2020-06-13index: account for CRLF conversion when storing bytes
NNTP and IMAP both require CRLF conversions on the wire. They're also the only components which care about $smsg->{bytes}, so store the CRLF-adjusted value in over.sqlite3 and Xapian DBs.. This will allow us to optimize RFC822.SIZE fetch item in IMAP without triggering size mismatch errors in some clients' default configurations (e.g. Mail::IMAPClient), but not most others. It could also fix hypothetical problems with NNTP clients that report discrepancies between overview and article data.
2020-06-03smsg: introduce ->populate method
This will eventually replace the __hdr() calling methods and eradicate {mime} usage from Smsg. For now, we can eliminate PublicInbox::Smsg->new since most callers already rely on an open `bless' to avoid the old {mime} arg.