about summary refs log tree commit homepage
path: root/t
DateCommit message (Collapse)
2018-07-15v2writable: unindex deleted messages after incremental fetch
The normal behavior is to prevent the deleted messages from being indexed in the first place. However, when fetching incrementally via git; public-inbox-index needs to account for deleted files which were created outside of the most recent fetch/reindexing window. Reported-by: Eric W. Biederman <ebiederm@xmission.com>
2018-07-07Import: Don't copy nulls from emails into git
Recently I ran git --git-dir=lkml/git/1.git fsck and it reported: > warning in commit 299dbd50b6995c6debe2275f0df984ce697fb4cc: nulInCommit: NULL byte inthe commit object body Which I found quite scary. Nulls in the wrong place have a bad tendency to make programs misbehave. It turns out someone had placed "=?iso-8859-1?q?=00?=" at the end of their subject line. Which is the mime encoding for NULL. Email::Mime had correctly decoded the header, and then public-inbox had simply copied the contents of the header into the subject line of the git commit. To prevent that from causing problems replace nulls in such subject lines with spaces. Signed-off-by: Eric Biederman <ebiederm@xmission.com>
2018-07-06MsgTime.pm: Use strptime to compute the time zone
Recently I had trouble cloning lkml/git/0.git because git fsck on receive was failing. The output of git fsck was: > Checking object directories: 100% (256/256), done. > warning in commit 59173dc1fe67b113ace4ce83e7f522414b3e0404: badTimezone: invalid author/committer line - bad time zone > warning in commit ff22aaff22eb4479e49e93f697e385f76db51c55: badTimezone: invalid author/committer line - bad time zone > warning in commit 609b744909693f5f00aff5ed9928beeeee9ded2e: badTimezone: invalid author/committer line - bad time zone > warning in commit 084572141db8e0d879428afb278bd338f2dbb053: badTimezone: invalid author/committer line - bad time zone > warning in commit 789d204de27cd12c6da693d903390a241a1a4bca: badTimezone: invalid author/committer line - bad time zone > warning in commit 0d9a65948b0c957007ca387cd56b690f9bab9c08: badTimezone: invalid author/committer line - bad time zone > warning in commit f7468c42b4196ee6323afb373ab9323971c38d69: badTimezone: invalid author/committer line - bad time zone > warning in commit 85e0cd6dd527cd55ad0440f14384529b83818228: badTimezone: invalid author/committer line - bad time zone > warning in commit f31e19a2e772c9ed00728ef142af9c550ea5de6a: badTimezone: invalid author/committer line - bad time zone > warning in commit 56eb7384443ef84e17e29504a304a071b189ae67: badTimezone: invalid author/committer line - bad time zone > warning in commit e4470030471e6810414b9de5e3b52e16f2245d12: badTimezone: invalid author/committer line - bad time zone > warning in commit f913b48caa097c3b2cb3f491707944f88d52d89f: badTimezone: invalid author/committer line - bad time zone > warning in commit 4390f26923d572c6dab6cce8282c7cad5520d785: badTimezone: invalid author/committer line - bad time zone > warning in commit 0f66db71a06bd7d651a0cd80877d8043b70fda20: badTimezone: invalid author/committer line - bad time zone > warning in commit d71472c40b36dcdf0396afc9778f6137eea45887: badTimezone: invalid author/committer line - bad time zone > warning in commit e8d3b19a91a2d86b6a91bd19dc811e851398b519: badTimezone: invalid author/committer line - bad time zone > warning in commit afd9fc0cc87e56ed7736d633e17d0ef77817b3cc: badTimezone: invalid author/committer line - bad time zone > warning in commit 811b3217708358cf1b75fba4602a64a426fce0f5: badTimezone: invalid author/committer line - bad time zone > warning in commit e7a751a597c6f5e4770c61bdee6220d55a37cba9: badTimezone: invalid author/committer line - bad time zone > warning in commit 3e32ad6192fe093e03e6b9346c3a90b16d9905c0: badTimezone: invalid author/committer line - bad time zone > warning in commit 5e66b47528e79d3bbb769e137f036a1fa99cccf9: badTimezone: invalid author/committer line - bad time zone > warning in commit d90d67d94ca47142670dff13fcb81ab7afab07bb: badTimezone: invalid author/committer line - bad time zone > Checking objects: 100% (1711464/1711464), done. > Checking connectivity: 1711464, done. Upon examination with git show --pretty=raw all of the problem commits had a time zone that was not 4 digits long. This time zone had been passed straight from the Date line in the email into the author line of the commit. Looking into that I discovered that str2time takes into account the time zone, and was actually able to process these weird time zones. So get the normalized time zone with strptime and convert it from seconds from gmt to hours and minutes from gmt. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-07-04v2: fill alternates with old epochs on init from mirrors
For v2 repositories with multiple epochs, we must not forget about earlier epochs in clones. Ensure we update the alternates file with all known epochs up to the current one. Reported-by: Eric W. Biederman <ebiederm@xmission.com> https://public-inbox.org/meta/871scj2vzi.fsf@xmission.com/
2018-06-26additional tests for bad Message-IDs in URLs
Followup-to: 73cfed86d8a8287a ("www: use undecoded paths for Message-ID extraction") Reported-by: Leah Neukirchen <leah@vuxu.org> https://public-inbox.org/meta/8736xsb5s5.fsf@vuxu.org/
2018-06-26www: use undecoded paths for Message-ID extraction
In PSGI, PATH_INFO contains URI-decoded paths which cause problems when Message-IDs contain ambiguous characters for used for routing. Instead, extract the undecoded path from REQUEST_URI and use that. Reported-by: Leah Neukirchen <leah@vuxu.org> https://public-inbox.org/meta/8736xsb5s5.fsf@vuxu.org/
2018-05-30respect umask if core.sharedRepository is not set
This is consistent with git itself and the previous behavior was a result of misunderstanding of how git interprets this. And adjust tests slightly to match the new behavior. Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org> <38873789-ab42-65a1-20c9-12c30b171f4f@linuxfoundation.org>
2018-05-24workaround Xapian OFD locks w/o close-on-exec
Xapian v1.2.21..v1.2.24 (inclusive) use OFD locks but failed to set the close-on-exec flag on those locks. So we must continue to work around those old versions by ensuring Xapian file descriptors aren't held any longer than necessary when in long-running git processes. Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org>
2018-05-11t/search: quiet warning from Encode.pm
This was probably a typo on my part, and quiets a warning: Argument contains empty address at .../Email/MIME/Encode.pm line 70 Tested with Email::MIME 1.946
2018-05-11t/v2mda: setup emergency Maildir for test
We can't expect ~/.public-inbox to exist for tests nor should we be writing to non-temporary directories during tests.
2018-05-11convert+compact: fix when running without ~/.public-inbox/config
Some users may not have any public-inboxes configured, especially in tests.
2018-05-11content_id: workaround quote handling change in Email::* modules
I'm not entirely sure where the behavior change lies, but it seems to be in some of the latest CPAN versions of these modules. In any case, this only affects the test setup and not actual behavior. cf. https://public-inbox.org/meta/2a2bf0e1-fd1f-f8bf-95bc-dac47906ef43@linuxfoundation.org/
2018-04-25thread: sort incoming messages by Date
Improve the display by finding any parent when we see out-of-order References. This prevents us from having two roots in the test case like Mail::Thread does.
2018-04-25thread: prevent hidden threads in /$INBOX/ landing page
In retrospect, the loop prevention done by our indexer is not always sufficient since it can have an improperly sorted or incomplete References headers. This bug was triggered multiple bracketed Message-IDs in an In-Reply-To: header (not References) where the Message-IDs were in non-chronological order when somebody tried to reply to different leafs of a thread with a single message. So we must check for descendents before blindly trying to use the last one. Fixes: c6a8fdf71e2c336f ("thread: last Reference always wins")
2018-04-23view: wrap To: and Cc: headers in HTML display
It is common to have large amounts of addresses Cc:-ed in large mailing lists like LKML. Make them more readable by wrapping after addresses. Unfortunately, line breaks inserted by the MUA get lost when using the public Email::MIME API. Subject and body lines remain unwrapped, as it's the author's fault to have such long lines :P
2018-04-22extmsg: use Xapian only for partial matches
"LIKE" in SQLite (and other SQL implementations I've seen) is expensive with nearly 3 million messages in the archives. This caused some partial Message-ID lookups to take over 600ms on my workstation (~300ms on a faster Xeon). Cut that to below under 30ms on average on my workstation by relying exclusively on Xapian for partial Message-ID lookups as we have in the past. Unlike in the past when we tried using Xapian to match partial Message-IDs; we now optimize our indexing of Message-IDs to break apart "words" in Message-IDs for searching, yielding (hopefully) "good enough" accuracy for folks who get long URLs broken across lines when copy+pasting. We'll also drop the (in retrospect) pointless stripping of "/[tTf]" suffixes for the partial match, since anybody who hits that codepath would be hitting an invalid message ID. Finally, limit wildcard expansion to prevent easy DoS vectors on short terms. And blame Pine and alpine for generating Message-IDs with low-entropy prefixes :P
2018-04-20disallow "\t" and "\n" in OVER headers
For Subject/To/Cc/From headers, we squeeze them to a space (' '). For Message-IDs (including References/In-Reply-To), '\t', '\n', '\r' are deleted since some MUAs might screw them up: https://public-inbox.org/git/656C30A1EFC89F6B2082D9B6@localhost/raw
2018-04-19fix tests to run without Xapian installed
We'll be ensuring we can run more of the HTTP and all of the NNTP interface with only SQLite (and not Xapian) installed in the future.
2018-04-18ensure SQLite and Xapian files respect core.sharedRepository
We can't have files with permissions inconsistent with what's in git objects.
2018-04-18Merge remote-tracking branch 'origin/master' into v2
* origin/master: nntp: allow and ignore empty commands mbox: do not barf on queries which return no results nntp: fix NEWNEWS command searchview: fix non-numeric comparison Allow specification of the number of search results to return githttpbackend: avoid infinite loop on generic PSGI servers http: fix modification of read-only value extmsg: use news.gmane.org for Message-ID lookups extmsg: rework partial MID matching to favor current inbox Update the installation instructions with Fedora package names nntp: do not drain rbuf if there is a command pending nntp: improve fairness during XOVER and similar commands searchidx: do not modify Xapian DB while iterating Don't use LIMIT in UPDATE statements
2018-04-18nntp: allow and ignore empty commands
Somebody hitting "\n" into telnet shouldn't hold a client up indefinitely and prevent shutdown.
2018-04-18searchidx: regenerate and avoid article number gaps on full index
Some messages to git@vger went missing from Msgmap from old bugs and became inaccessible via NNTP. Forcing NNTP article numbers when the overview DB came about made the problem more visible when reindexing old (v1) repositories as all removed spam messages took up AUTOINCREMENT numbers again before they were removed. Having large gaps in NNTP article numbers is not good since it throws off NNTP clients. This does NOT prevent NNTP clients from seeing some messages twice, but is better than having them miss several messages entirely. We also avoid depending on --reverse in git-log, as git requires storing an entire commit list in memory for --reverse, so it's cheaper to store only deleted blobs in the %D hash since they do not live long.
2018-04-18v2: improve deduplication checks
First off, decode text portions of messages since some archived mail I got was converted from quoted-printable or base-64 to 8bit by the original recipient. Attempting to merge them with my own archives (which had no conversion done) led to unnecessary duplicates showing up. Then, normalize CRLF line endings in text portions to LF. In the headers, we relax the content_id hashing to ignore quotes and lower-case domain names in To, Cc, and From headers since some mail processors will alter them. Finally, I've discovered Email::MIME->new($mime->as_string) does not always round-trip reliably, so we calculate the content_id twice on user-supplied messages.
2018-04-18v2: generate better Message-IDs for duplicates
While hunting duplicates, I noticed a leading '-' in some Message-IDs as a result of RFC4648 encoding. While '-' seems allowed by RFC5322 and URL-friendly (RFC4648), they are uncommon and make using Message-IDs as arguments for command-line tools more difficult. So prefix them with a datestamp to at least give readers some sense of the age. And shorten the "localhost" hostname to "z" to save space.
2018-04-18search: preserve References in Xapian smsg for x=t view
I'm not sure how useful this view is, but it exists for now.
2018-04-18compact: do not merge v2 repos by default
--no-renumber does not allow merging, and merging is not ideal for reindexing, either.
2018-04-18v1: remove articles from overview DB
Otherwise articles show up again...
2018-04-07msgmap: speed up minmax with separate queries
This significantly improves the performance of the NNTP GROUP command with 2.7 million messages from over 250ms to 700us. SQLite is weird about this, but at least there's a way to optimize it.
2018-04-07store less data in the Xapian document
Since we only query the SQLite over DB for OVER/XOVER; do not need to waste space storing fields To/Cc/:bytes/:lines or the XNUM term. We only use From/Subject/References/Message-ID/:blob in various places of the PSGI code. For reindexing, we will take advantage of docid stability in "xapian-compact --no-renumber" to ensure duplicates do not show up in search results. Since the PSGI interface is the only consumer of Xapian at the moment, it has no need to search based on NNTP article number.
2018-04-07v2writable: reduce barriers
Since we handle the overview info synchronously, we only need barriers in tests, now. We will use asynchronous checkpoints to sync less-important Xapian data. For data deduplication, this requires us to hoist out the cat-blob support in ::Import for reading uncommitted data in git.
2018-04-06ensure Xapian and SQLite are still optional for v1 tests
Xapian is size-intensive and SQLite is not strictly necessary for v1.
2018-04-06www: favor reading more from SQLite, and less from Xapian
Favor simpler internal APIs this time around, this cuts a fair amount of code out and takes another step towards removing Xapian as a dependency for v2 repos.
2018-04-06nntp: set Xref across multiple inboxes
Noted by Jonathan Corbet in https://lwn.net/Articles/748184/
2018-04-06search: index and allow searching by date-time
Dscho found this useful for finding matching git commits based on AuthorDate in git. Add it to the overview DB format, too; so in the future we can support v2 repos without Xapian. https://public-inbox.org/git/nycvar.QRO.7.76.6.1804041821420.55@ZVAVAG-6OXH6DA.rhebcr.pbec.zvpebfbsg.pbz https://public-inbox.org/git/alpine.DEB.2.20.1702041206130.3496@virtualbox/
2018-04-05v2writable: remove redundant remove from Over DB
The Xapian partitions will trigger the removal anyways. Test this and fix some description/spelling errors while we're at it.
2018-04-05support altid mechanism for v2
There's enough gmane links out there in wild that it makes sense to maintain support for these mappings.
2018-04-04v2: support incremental indexing + purge
This is important for people running mirrors via "git fetch", as they need to be kept up-to-date. Purging is also now supported in mirrors. The short-lived "--regenerate" option is gone and is now implicitly enabled as a result. It's still cheap when article number regeneration is unnecessary, as we track the range for each git repository.
2018-04-04import: rewrite less history during purge
We do not need to rewrite old commits unaffected by the object_id purge, only newer commits. This was a state management bug :x We will also return the new commit ID of rewritten history to aid in incremental indexing of mirrors for the next change.
2018-04-04searchidx: ensure duplicated Message-IDs can be linked together
This allows us to emulate the display of thread-aware MUAs when multiple messages share the same Message-ID. This also is a place where "public-inbox-index --reindex" is useful to fix existing messages and no schema version bump is necessary.
2018-04-03nntp: simplify the long_response API
We we worked around the default range/termination conditions of long_response in many cases to reduce calls to SQLite or Xapian. So continue that trend and become more like the PSGI API which doesn't force callers to specify an article range or work inside a loop.
2018-04-03msgmap: replace id_batch with ids_after
id_batch had a an overly complicated interface, replace it with id_batch which is simpler and takes advantage of selectcol_arrayref in DBI. This allows simplification of callers and the diffstat agrees with me.
2018-04-03mbox: remove remaining OFFSET usage in SQLite
We can use id_batch in the common case to speed up full mbox retrievals. Gigantic msets are still a problem, but will be fixed in future commits.
2018-04-03nntp: make XOVER, XHDR, OVER, HDR and NEWNEWS faster
While SQLite is faster than Xapian for some queries we use, it sucks at handling OFFSET. Fortunately, we do not need offsets when retrieving sorted results and can bake it into the query. For inbox.comp.version-control.git (v1 Xapian), XOVER and XHDR are over 20x faster.
2018-04-03rename+rewrite test using Benchmark module
There'll be more performance-related tests in the future.
2018-04-03t/thread-all.t: modernize test to support modern inboxes
We'll be adding more tests in the same vein as this to improve NNTP performance.
2018-04-03mbox: do not barf on queries which return no results
Having zero search results means we never get a chance to populate the Content-Disposition header for mbox downloads.
2018-04-03nntp: fix NEWNEWS command
I guess nobody uses this command (slrnpull does not), and the breakage was not noticed until I started writing new tests for multi-MID handling. Fixes: 3fc411c772a21d8f ("search: drop pointless range processors for Unix timestamp")
2018-04-02www: rework query responses to avoid COUNT in SQLite
In many cases, we do not care about the total number of messages. It's a rather expensive operation in SQLite (Xapian only provides an estimate). For LKML, this brings top-level /$INBOX/ loading time from ~375ms to around 60ms on my system. Days ago, this operation was taking 800-900ms(!) for me before introducing the SQLite overview DB.
2018-04-02t/over: test empty Subject: line matching
We need to ensure we don't match NULL 'sid' columns in the `over' table.
2018-04-02replace Xapian skeleton with SQLite overview DB
This ought to provide better performance and scalability which is less dependent on inbox size. Xapian does not seem optimized for some queries used by the WWW homepage, Atom feeds, XOVER and NEWNEWS NNTP commands. This can actually make Xapian optional for NNTP usage, and allow more functionality to work without Xapian installed. Indexing performance was extremely bad at first, but DBI::Profile helped me optimize away problematic queries.