about summary refs log tree commit homepage
path: root/lib/PublicInbox/Msgmap.pm
DateCommit message (Collapse)
2020-02-06treewide: run update-copyrights from gnulib for 2019
I didn't wait until September to do it, this year!
2019-09-09run update-copyrights from gnulib for 2019
2019-07-13nntp: support optional [range] arg in LISTGROUP
RFC3977 6.1.2.2 LISTGROUP allows a [range] arg after [group], and supporting it allows NNTP support in neomutt to work again. Tested with NeoMutt 20170113 (1.7.2) on Debian stretch (oldstable)
2019-06-24msgmap: mid_insert: use plain "INSERT" to detect duplicates
"INSERT OR IGNORE" still bumps the auto-increment counter in SQLite, which causes gaps to appear in NNTP article numbering. This bug appeared in v2 repos where V2Writable may call ->add repeatedly on the same message. This bug is apparent with public-inbox-watch and work-in-progress IMAP watchers which may rescan and (attempt to) reinsert the same message on mailbox changes. Most uses of public-inbox-mda were not affected, unless the same message is actually delivered multiple times to the mda. v1 is not affected, either, since deduplication is only based on Message-ID and msgmap never sees the duplicate. Reported-by: "Eric W. Biederman" <ebiederm@xmission.com>
2019-05-25msgmap: remove double negative
I have never not found double negatives to be confusing...
2018-08-03Msgmap.pm: Track the largest value of num ever assigned
Today the only thing that prevents public-inbox not reusing the message numbers of deleted messages is the sqlite autoincrement magic and that only works part of the time. The new incremental indexing test has revealed areas where today public-inbox does try to reuse numbers of deleted messages. Reusing the message numbers of existing messages is a problem because if a client ever sees messages that are subsequently deleted the client will not see the new messages with their old numbers. In practice this is difficult to trigger because it requires the most recently added message to be removed and have the removal show up in a separate pull request. Still it can happen and it should be handled. Instead of infering the highset number ever used by finding the maximum number in the message map, track the largest number ever assigned directly. Update Msgmap to track this value and update the indexers to use this value. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-04-24msgmap: add limit to response for NNTP
All callers in expect to iterate through results. This was causing unfairness when fetching large ranges via XHDR as rtin does :< Fixes: b8c41362f2a5c8fc "nntp: simplify the long_response API"
2018-04-22extmsg: use Xapian only for partial matches
"LIKE" in SQLite (and other SQL implementations I've seen) is expensive with nearly 3 million messages in the archives. This caused some partial Message-ID lookups to take over 600ms on my workstation (~300ms on a faster Xeon). Cut that to below under 30ms on average on my workstation by relying exclusively on Xapian for partial Message-ID lookups as we have in the past. Unlike in the past when we tried using Xapian to match partial Message-IDs; we now optimize our indexing of Message-IDs to break apart "words" in Message-IDs for searching, yielding (hopefully) "good enough" accuracy for folks who get long URLs broken across lines when copy+pasting. We'll also drop the (in retrospect) pointless stripping of "/[tTf]" suffixes for the partial match, since anybody who hits that codepath would be hitting an invalid message ID. Finally, limit wildcard expansion to prevent easy DoS vectors on short terms. And blame Pine and alpine for generating Message-IDs with low-entropy prefixes :P
2018-04-18ensure SQLite and Xapian files respect core.sharedRepository
We can't have files with permissions inconsistent with what's in git objects.
2018-04-18Merge remote-tracking branch 'origin/master' into v2
* origin/master: nntp: allow and ignore empty commands mbox: do not barf on queries which return no results nntp: fix NEWNEWS command searchview: fix non-numeric comparison Allow specification of the number of search results to return githttpbackend: avoid infinite loop on generic PSGI servers http: fix modification of read-only value extmsg: use news.gmane.org for Message-ID lookups extmsg: rework partial MID matching to favor current inbox Update the installation instructions with Fedora package names nntp: do not drain rbuf if there is a command pending nntp: improve fairness during XOVER and similar commands searchidx: do not modify Xapian DB while iterating Don't use LIMIT in UPDATE statements
2018-04-18searchidx: regenerate and avoid article number gaps on full index
Some messages to git@vger went missing from Msgmap from old bugs and became inaccessible via NNTP. Forcing NNTP article numbers when the overview DB came about made the problem more visible when reindexing old (v1) repositories as all removed spam messages took up AUTOINCREMENT numbers again before they were removed. Having large gaps in NNTP article numbers is not good since it throws off NNTP clients. This does NOT prevent NNTP clients from seeing some messages twice, but is better than having them miss several messages entirely. We also avoid depending on --reverse in git-log, as git requires storing an entire commit list in memory for --reverse, so it's cheaper to store only deleted blobs in the %D hash since they do not live long.
2018-04-07msgmap: speed up minmax with separate queries
This significantly improves the performance of the NNTP GROUP command with 2.7 million messages from over 250ms to 700us. SQLite is weird about this, but at least there's a way to optimize it.
2018-04-06v2writable: allow tracking parallel versions
For upgrades, this will let users keep an old version running while performing "public-inbox-index" on the newest version.
2018-04-04v2: support incremental indexing + purge
This is important for people running mirrors via "git fetch", as they need to be kept up-to-date. Purging is also now supported in mirrors. The short-lived "--regenerate" option is gone and is now implicitly enabled as a result. It's still cheap when article number regeneration is unnecessary, as we track the range for each git repository.
2018-04-03nntp: simplify the long_response API
We we worked around the default range/termination conditions of long_response in many cases to reduce calls to SQLite or Xapian. So continue that trend and become more like the PSGI API which doesn't force callers to specify an article range or work inside a loop.
2018-04-03msgmap: replace id_batch with ids_after
id_batch had a an overly complicated interface, replace it with id_batch which is simpler and takes advantage of selectcol_arrayref in DBI. This allows simplification of callers and the diffstat agrees with me.
2018-04-02replace Xapian skeleton with SQLite overview DB
This ought to provide better performance and scalability which is less dependent on inbox size. Xapian does not seem optimized for some queries used by the WWW homepage, Atom feeds, XOVER and NEWNEWS NNTP commands. This can actually make Xapian optional for NNTP usage, and allow more functionality to work without Xapian installed. Indexing performance was extremely bad at first, but DBI::Profile helped me optimize away problematic queries.
2018-03-22v2writable: support reindexing Xapian
This still requires a msgmap.sqlite3 file to exist, but it allows us to tweak Xapian indexing rules and reindex the Xapian database online while -watch is running.
2018-03-22msgmap: add tmp_clone to create an anonymous copy
This will be used to keep track of Message-ID <-> NNTP Article numbers to prevent article number reuse when reindexing.
2018-03-19v2writable: implement remove correctly
We need to hide removals from anybody hitting the search engine.
2018-02-22Don't use LIMIT in UPDATE statements
...not all distributions build SQLite with that enabled. [ew: LIMIT shouldn't be necessary because `key' is primary]
2018-02-07update copyrights for 2018
Using update-copyrights from gnulib While we're at it, use the SPDX identifier for AGPL-3.0+ to ease mechanical processing.
2017-06-26msgmap: reduce constant usage
It is needless bloat and doesn't seem to help with readability, in retrospect, either.
2017-06-23msgmap: ignore duplicates instead of dying
This prevents public-inbox-watch from dying when reloading (and thus rescanning) already-imported directories.
2017-06-22msgmap: mid_insert ignores duplicates instead of die-ing
This will allow smoother imports as occasional Message-ID duplicates happen and the best we can do is ignore the second one.
2016-08-11search: support alt-ID for mapping legacy serial numbers
For some existing mailing list archives, messages are identified by serial number (such as NNTP article numbers in gmane). Those links may become inaccessible (as is the current case for gmane), so ensure users can still search based on old serial numbers. Now, I run the following periodically to get article numbers from gmane (while news.gmane.org remains): NNTPSERVER=news.gmane.org export NNTPSERVER GROUP=gmane.comp.version-control.git perl -I lib scripts/xhdr-num2mid $GROUP --msgmap=/path/to/gmane.sqlite3 (I might integrate this further with public-inbox-* scripts one day). My ~/.public-inbox/config as an added "altid" snippet which now looks like this: [publicinbox "git"] address = git@vger.kernel.org mainrepo = /path/to/git.vger.git newsgroup = inbox.comp.version-control.git ; relative pathnames expand to $mainrepo/public-inbox/$file altid = serial:gmane:file=gmane.sqlite3 And run "public-inbox-index --reindex /path/to/git.vger.git" periodically. This ought to allow searching for "gmane:12345" to work for Xapian-enabled instances. Disclaimer: while public-inbox supports NNTP and stable article serial numbers, use of those for public links is discouraged since it encourages centralization.
2016-07-31msgmap: fix use of transactions
We want transactions to be the responsibility of the caller when possible; this fixes the potential for the msgmap to internally become inconsistent when using it from inside searchidx.
2015-11-20various internal documentation updates
Hopefully this gives new hackers a better overview of how the components relate to each other.
2015-10-02Msgmap: pass ReadOnly DBI flag for non-writable opens
This doesn't seem to do anything on my older system, but maybe it will in newer or future versions of DBD::SQLite. Anyways it can be helpful for documentation purposes, too.
2015-09-30remove unnecessary fields usage
It doesn't actually give performance improvements unless we use types with "my", but we don't do that. We'll only continue using fields with Danga::Socket-derived classes where they're required.
2015-09-21msgmap: minor cleanup to move constant declaration
This doesn't actually change anything as the constant is still usable in other subroutines, but helps with consistency and readability IMHO.
2015-09-19nntp: use long response API for LISTGROUP
LISTGROUP can be expensive for giant groups, too. Use the long response API to improve fairness and prevent excessive buffering.
2015-09-18read-only NNTP server
Implementing NEWNEWS, XHDR, XOVER efficiently will require additional caching on top of msgmap. This seems to work with lynx and slrnpull, haven't tried clients. DO NOT run in production, yet, denial-of-service vulnerabilities await!
2015-09-15msgmap: add message mapping via SQLite
This will allow us to maintain stable article numbers for an NNTP server independently of Xapian.