about summary refs log tree commit homepage
path: root/lib
DateCommit message (Collapse)
2019-11-16spawn: which: allow embedded slash for relative path
This makes the subroutine behave more like which(1) command and will make using spawn() in tests easier.
2019-11-16xapcmd: do not fire END and DESTROY handlers in child
We need to bypass whatever Test::More does with END/DESTROY handlers for use in lon-lived process. This doesn't affect any of our normal code since we don't use END/DESTROY for Xapcmd and its callers.
2019-11-16import: only pass Inbox object to SearchIdx->new
SearchIdx->new no longer accepts a GIT_DIR path as its argument since commit 585314673236d664729fe3ab2d4fb229d1c0f2d5 ("searchidx: require PublicInbox::Inbox (or InboxWritable) ref")
2019-11-16inboxwritable: add ->cleanup method
We've been using this in -edit, and will be using it in some more scripts and tests to optimize for run_mode=2 with run_script. Keeping this in the *Writable modules since I don't see it being useful for the WWW and NNTP read-only interfaces which use PublicInbox::Inbox.
2019-11-16admin: get rid of singleton $CFG var
PublicInbox::Admin::config() just adds an extra layer of indirection which we barely rely on. So get rid of this global variable and make it easier to run tests in the future without relying on global state.
2019-11-16mboxgz: use Compress::Raw::Zlib instead of IO::Compress::Gzip
IO::Compress::Gzip is a wrapper around Compress::Raw::Zlib, anyways, and being able to easily detach buffers to return them via ->getline is nice. This results in a 1-2% performance improvement when fetching giant mboxes.
2019-11-16mbox: split mboxgz out into a separate file
It'll make using Compress::Raw::Zlib easier, since we can use that and import constants more easily.
2019-11-16mbox: unused mid_clean import
We're gradually phasing mid_clean out (in favor of mids()).
2019-11-14inboxwritable: drop {-importer} cyclic reference
InboxWritable caching the result of ->importer leads to a circular references with returned (V2Writable|Import) object holds onto the calling InboxWritable object. With public-inbox-watch, this leads to a memory leak if a user is reloading via SIGHUP after a message is imported (it would only become noticeable with SIGHUPs after every message imported). I would not expect anybody to to notice this in real-world usage. I only noticed this since I was making -xcpdb suitable for long-lived process use (e.g. "mod_perl style") and a flock remained unreleased on v1 inboxes after resharding. WatchMaildir (used by -watch) already handles caching of the importer object itself, and all of our other real-world uses of ->importer are short-lived or designed for batch scripts, so there's no need to cache the importer result internally.
2019-11-14xapcmd: localize %SIG changes using "local"
Perl's "local" allows changes to %SIG (and %ENV) to be limited to its enclosing block. This allows us to get rid of a global variable and ad-hoc method for restoring signal handlers.
2019-11-14solvergit: use --unidiff-zero with git-apply(1)
I sometimes post context-free documentation patches generated with "-U0" to reduce size and bandwidth overhead when replacing URLs or updating copyright notices. git-apply(1) needs the --unidiff-zero switch to work properly with context-free patches. Given our search looks for blob OIDs, and we're never going to be running the code we regenerate, "--unidiff-zero" ought to be safe.
2019-11-04index: "git log" failures are fatal
While I've never seen "git log" fail on its own, it could happen one day and we should be prepared to abort indexing when it happens. Beef up tests for t/spawn.t to ensure close() behaves on popen_rd the way we expect it to.
2019-11-03searchidxshard: reuse $SIG{__WARN__} callback from Admin
We don't want to define $SIG{__WARN__} in the worker to call an existing non-default callback. Instead update ->{current_info} the same way the V2Writable master process does. I noticed this while reindexing with a large XAPIAN_FLUSH_THRESHOLD and seeing the wrong epoch on my terminal from a shard because the shard worker was spawned while reindexing a higher-numbered epoch.
2019-10-31hval: replace "'" with "'" for compatibility
While testing 216light.css changes, I managed to hit some cases where dillo failed to render ' correctly, but I also can't reproduce it reliably. Anyways, it's definitely a problem with some old browsers and newer versions of highlight already work around it, but Debian 10.x has 3.41, so use "'" to maximize compatibility.
2019-10-31qspawn: psgi_qx: delay callback until waitpid returns
We need to detect "git apply" failures reliably when patches fail. This is necessary for solving for blob 81c1164ae5 in https://public-inbox.org/git/ when at least two messages can solve for it (and one of them fails): 1. https://public-inbox.org/git/b9fb52b8-8168-6bf0-9a72-1e6c44a281a5@oracle.com/ 2. https://public-inbox.org/git/56664222-6c29-09dc-ef78-7b380b113c4a@oracle.com/
2019-10-31solvergit: deal with false-positive dfpost: results
When solving for blob 81c1164ae5 in https://public-inbox.org/git/, at least two messages get indexed with the dfpost result for that blob (after fixing MsgIter to decode all text/* parts): 1. https://public-inbox.org/git/b9fb52b8-8168-6bf0-9a72-1e6c44a281a5@oracle.com/ 2. https://public-inbox.org/git/56664222-6c29-09dc-ef78-7b380b113c4a@oracle.com/ However, only the first message contains a usable patch. So we must adjust SolverGit to account for multiple messages hitting the same "dfpost:" search result and attempt "git apply" on all results, not just the first. In the future, changes to SearchIdx.pm may rid us of invalid search results and speed up performance (at the expense of developer/indexing time); but we need to account for old search indices, first.
2019-10-31msgiter: attempt to decode all text/* bodies
We want to index text/x-patch and text/x-diff, at least, since "git format-patch" can generate a patch series as attachments using --attach.
2019-10-31msgiter: do not assume UTF-8 if Email::MIME->body_str succeeds
ISO-2202-JP and other non-UTF-8 messages need to be displayed correctly. Fixes: 7d82a8bc04ce ('handle "multipart/mixed" messages which are not multipart')
2019-10-30search: add note about SCHEMA_VERSION 15
--reindex has gotten better over the years, and having parallel Xapian DB directories would exceed all available disk space for some users with giant inboxes.
2019-10-30wwwlisting: fix spelling and clarify sub location
Spell "Schwartzian" correctly, and clarify the location of "modified" since we have multiple subs named "modified"
2019-10-30Merge branch 'learn'
* learn: doc: add public-inbox-learn(1) manpage mda: support multiple List-ID matches mda: prepare for multiple destinations inboxwritable: add assert_usable_dir sub mda: skip MIME parsing if spam mda: hoist out mda_filter_adjust filter/base: remove MAX_MID_SIZE constant mda: hoist out List-ID handling and reuse in -learn learn: hoist out remove_or_add subroutine learn: GIT_COMMITTER_<NAME|EMAIL> may be "" or "0" learn: update usage statement learn: only map recipient list on "ham" or "rm" learn: support multiple To/Cc headers
2019-10-30mda: support multiple List-ID matches
While it's not RFC2919-conformant, mail software can theoretically set multiple List-ID headers. Deliver to all inboxes which match a given List-ID since that's likely the intended. Cc: Eric W. Biederman <ebiederm@xmission.com> Link: https://public-inbox.org/meta/87pniltscf.fsf@x220.int.ebiederm.org/
2019-10-30inboxwritable: add assert_usable_dir sub
And use it for mda, since "0" could be a usable directory if somebody insists on using relative paths...
2019-10-30filter/base: remove MAX_MID_SIZE constant
We don't need it in the filter, here, since we have one in the MDA package.
2019-10-30mda: hoist out List-ID handling and reuse in -learn
It's now possible to inject false-positive ham into an inbox the same way -mda does via List-ID.
2019-10-28view: show X-Alt-Message-ID in permalink view, too
Since we index X-Alt-Message-ID (because we need to placate some NNTP clients), we now display it as well, since that Message-ID could be the X-Alt-Message-ID that the reader is actually interested in.
2019-10-28index: allow search/lookups on X-Alt-Message-ID
Since we replace extra Message-ID headers with X-Alt-Message-ID to placate NNTP clients, we should allow searching and indexing on X-Alt-Message-ID just like we do with Message-ID.
2019-10-28linkify: support adding "(raw)" link for Message-IDs
And use it for the per-message permalink display.
2019-10-28view: improve warning for multiple Message-IDs
"refer" is not the correct term, here; since that would mean multiple messages have the current message in the "References:" header, and that's a normal occurence. Instead, we need to warn the reader that the given message itself has multiple Message-IDs.
2019-10-28view: move '<' and '>' outside <a>
Browsers may underline '<' and '>' in links, which may be confused with '≤' and '≥'. So have the Message-ID header display follow what we do with In-Reply-To headers and move the "&lt;" and "&gt;" outside of <a> in the HTML.
2019-10-28view: display redundant headers in permalink
Mail headers can contain multiple headers of any type, so ensure we don't hide any information we're getting in the per-message permalink views. This means it's possible to have multiple From, Date, To, Cc, Subject, and In-Reply-To headers displayed. The thread indices are a special case, I guess, since we run out of space on the line if the headers too long and tools like mutt only show the first one.
2019-10-28search: support multiple From/To/Cc/Subject headers
We can easily support searching on messages with multiple From/To/Cc/Subject headers just like we do with multiple Message-ID headers. This matches the normal mutt pager display behavior.
2019-10-23Merge branch 'regen'
* regen: v2writable: use msgmap as multi_mid queue v2writable: move git->cleanup to the correct place v2writable: reindex handles 3-headered monsters v2writable: improve "num_for" API and disambiguate v2writable: set unindexed article number
2019-10-22syscall: get rid of sendfile wrappers for now
I'm not sure they'll make a measurable difference or will be worth the effort in the future given the prevalance of HTTPS and giant socket buffers. Using Inline::C for this may make more sense in the future, too, especially if we want to be able to use GnuTLS.
2019-10-22hval: remove new_oneline
commit 476fc666c223f0fb ('reduce "PublicInbox::Hval->new_oneline" use') was mis-titled, since it completely eliminated ->new_oneline use.
2019-10-22git: remove src_blob_url
This was intended for solver, but it's unused since commit 915cd090798069a4 ("solver: switch patch application to use a callback")
2019-10-22watchmaildir: remove redundant _path_to_mime
InboxWritable::maildir_path_load exists and we may support it for use with standalone scripts.
2019-10-22inboxwritable: import_maildir uses maildir_path_load
I'm not sure if this will get used anywhere, but at least call a function which exists in dead code.
2019-10-22www: remove unused ctx_get sub
This hasn't been used since commit 48b21cb662c1e17b7 in 2016: ("declare Inbox object for reusability")
2019-10-22overidx: remove unused delete_articles sub
This hasn't been used since commit 1b7e935ab1690e28 ("searchidx: fix incremental index with indexlevel=basic on v1")
2019-10-22v2writable: use msgmap as multi_mid queue
Instead of storing Message-IDs in the Msgmap object, we can store the blob OID. For initial indexing of mirrors, this lets us preserve $sync->{regen} by storing the intended article number in the queue. On --reindex, the article number we store in Msgmap is ignored but only used for ordering purposes. This also allows us to avoid ENOMEM errors if somebody abuses our system by reusing Message-IDs; but we now risk ENOSPC instead (but systems tend to have more FS storage than RAM).
2019-10-22v2writable: move git->cleanup to the correct place
We need to stop the git process to avoid leaking FDs to Xapian if we recurse ->index_sync on reindex.
2019-10-21v2writable: reindex handles 3-headered monsters
And maybe 8-headered ones, too... I noticed --reindex failing on the linux-renesas-soc mirror due one 3-headed monster of a message having 3 sets of headers; while another normal message had a Message-ID that matched one of the 3 IDs of the 3-headed monster. We still try to do the majority of indexing backwards, but we defer indexing multi-Message-ID'd messages until the end to ensure we get all the "good" messages in before we process the multi-headered ones. Link: https://public-inbox.org/meta/20191016211415.GA6084@dcvr/
2019-10-21v2writable: improve "num_for" API and disambiguate
Make it obvious that we're not the Msgmap sub and return an array because it's less awkward than providing a modifiable ref to a function to write to.
2019-10-21v2writable: set unindexed article number
We'll actually use the keys of this hash in future commits.
2019-10-17Merge remote-tracking branch 'origin/inboxdir'
* origin/inboxdir: config: remove redundant inboxdir check config: support "inboxdir" in addition to "mainrepo" examples/grok-pull.post_update_hook: use "inbox_dir"
2019-10-17doc: avoid [<directory>] arg for git-clone(1)
While it is possible to host source code from the root of a URL using git-http-backend(1), the lack of pathname in the URL can also be confusing to users. So just add the path name of the project into the URL itself so users can invoke "git clone" with one command-line argument instead of two. Of course, previously documented URLs continue to work as normal.
2019-10-16config: remove redundant inboxdir check
This was causing compatibility problems for old configs when using public-inbox-nntpd.
2019-10-16config: support "inboxdir" in addition to "mainrepo"
"mainrepo" ws a bad name and artifact from the early days when I intended for there to be a "spamrepo" (now just the ENV{PI_EMERGENCY} Maildir). With v2, "mainrepo" can be especially confusing, since v2 needs at least two git repositories (epoch + all.git) to function and we shouldn't confuse users by having them point to a git repository for v2. Much of our documentation already references "INBOX_DIR" for command-line arguments, so use "inboxdir" as the git-config(1)-friendly variant for that. "mainrepo" remains supported indefinitely for compatibility. Users may need to revert to old versions, or may be referring to old documentation and must not be forced to change config files to account for this change. So if you're using "mainrepo" today, I do NOT recommend changing it right away because other bugs can lurk. Link: https://public-inbox.org/meta/874l0ice8v.fsf@alyssa.is/
2019-10-16Merge branch 'listid'
* listid: wwwtext: show listid config directive(s) mda, watch: wire up List-ID header support config: allow "0" as a valid mainrepo path config: avoid unnecessary '||' use config: simplify lookup* methods config: we always have {-section_order} Config.pm: Add support for mailing list information