Date | Commit message (Collapse) |
|
This simplifies the code a bit and reduces the translation
overhead for looking directly at data from tools shipped
with Xapian.
While we're at it, fix thread-all.t :)
|
|
We cannot distinguish between legitimate ghosts and mis-threaded
messages before commit 83425ef12e4b65cdcecd11ddcb38175d4a91d5a0
("searchidx: deal with empty In-Reply-To and References headers")
so we must rebuild the index in parallel to fix it.
|
|
This should fix problems with multipart messages where
text/plain parts lack a header.
cf. git clone --mirror https://github.com/rjbs/Email-MIME.git
refs/pull/28/head
In the future, we may still introduce as streaming
interface to reduce memory usage on large emails.
|
|
Apparently it never actually got used, and the world seems
fine without it, so we can drop it.
While we're at it, consider removing our subject_path
usage from existence, too. We are not using fancy subject-line
based URLs, here.
|
|
This is faster, smaller, and more straighforward to me with
fewer layers of indirection.
|
|
We call lookup_mail all over the place, be sure we can handle
database modifications in those cases.
|
|
Instead, only preload the ->mid field for threading,
as we only need ->thread and ->path once in Search->get_thread
(but we will need the ->mid field repeatedly).
This more than doubles View->load_results performance on
according to thread-all on an inbox with over 300K messages.
|
|
In addition to needing to retry enquire queries, we also need
to protect document loading from the Xapian DB and retry on
modification, as it seems to throw the same errors.
Checking the $@ ref for Search::Xapian::DatabaseModifiedError
is actually in the test suite for both the XS and SWIG Xapian
bindings, so we should be good as far as forward/backwards
compatibility.
|
|
This makes life easier for the threading algorithm, as we can
use the implied ordering of timestamps to avoid temporary ghosts
and resulting container vivication.
This would've also allowed us to hide the bug (in most cases)
fixed by the patch titled "thread: last Reference always wins",
in case that needs to be reverted due to infinite looping.
|
|
Support (and document) 'a:' after all, as "mairix -h" uses it,
so this should reduce the learning curve for mairix users.
|
|
And while we're at it, ensure searching inside displayable
attachment bodies works.
|
|
Specifying the "d:" field only worked for
NumberValueRangeProcessor in older versions of Xapian, such
as the one in Debian wheezy (libsearch-xapian-perl=1.2.10.0-1)
This slipped through since I rarely use wheezy, anymore, and
perhaps nobody else does, either. Perhaps wheezy support may be
dropped, soon.
Unfortunately, this requires a schema version bump.
|
|
As of Xapian 1.0.4 (from 2007) is possible to use
Search::Xapian::QueryParser::add_prefix multiple times with the
same user field name but different term prefixes.
This brings my current git@vger mirror from 6.5GB to 2.1GB
(both sizes are after xapian-compact).
|
|
"bs:" and "b:" are adapted from mairix(1)
We will also support searching explicitly for quoted vs
non-quoted text via "q:" and "nq:" prefixes since sometimes
readers will not care for quoted text.
In the future, we will support parsing diffs (perhaps when
repobrowse integration is complete).
Note: this roughly doubles the size of the Xapian database due
to the additional information; so this change may not be worth
it.
|
|
We only document the "s:" anyways. While the long name is more
descriptive, the ambiguity makes agnostic caching (by Varnish or
similar) slightly harder and longer URLs are more likely to be
accidentally truncated when shared.
|
|
Sometimes it can be useful to search based on who the
message was sent to, sent by, or Cc:-ed. Of course,
headers can be faked, but they usually are not...
Anyways this mostly matches the behavior of mairix(1).
|
|
Begin documenting some basic help functionality.
I may tweak the anchor names of the various HTML endpoints
to be more consistent with each other (old ones will be
supported for a short while), so I'm not documenting
those, for now.
This may become part of a builtin key-value store for
basic texts, but this probably shouldn't become a wiki
engine, either.
|
|
This is similar to mairix in that it uses a "d:" prefix; but
only takes YYYYMMDD, for now. Using custom date/time parsers
via Perl will be much more work:
nntp://news.gmane.org/20151005222157.GE5880@survex.com
Anyhow, this ought to be more human-friendly than searching by
Unix timestamps, but it requires reindexing to take advantage of.
|
|
The Unix timestamp isn't meaningful for users searching,
we will start indexing the YYYYMMDD date stamp which may
use StringValueRangeProcessor, instead.
|
|
We can't blindly assume a ghost even exists in the DB, as the
rules can change internally for some corner-case Message-IDs.
|
|
For some existing mailing list archives, messages are identified
by serial number (such as NNTP article numbers in gmane). Those
links may become inaccessible (as is the current case for
gmane), so ensure users can still search based on old serial
numbers.
Now, I run the following periodically to get article numbers
from gmane (while news.gmane.org remains):
NNTPSERVER=news.gmane.org
export NNTPSERVER
GROUP=gmane.comp.version-control.git
perl -I lib scripts/xhdr-num2mid $GROUP --msgmap=/path/to/gmane.sqlite3
(I might integrate this further with public-inbox-* scripts one day).
My ~/.public-inbox/config as an added "altid" snippet which now
looks like this:
[publicinbox "git"]
address = git@vger.kernel.org
mainrepo = /path/to/git.vger.git
newsgroup = inbox.comp.version-control.git
; relative pathnames expand to $mainrepo/public-inbox/$file
altid = serial:gmane:file=gmane.sqlite3
And run "public-inbox-index --reindex /path/to/git.vger.git"
periodically.
This ought to allow searching for "gmane:12345" to work for
Xapian-enabled instances.
Disclaimer: while public-inbox supports NNTP and stable article
serial numbers, use of those for public links is discouraged
since it encourages centralization.
|
|
Some mailing lists allow empty Subject headers and we shall support
searching and indexing them.
|
|
We failed to discard old thread IDs when vivifying ghosts
due to out-of-order message arrival. This rectifies the
failure and will trigger a re-index.
|
|
This should help avoid having too many fake top-level
messages in the topic view since we only have a partial
window for threading results.
|
|
This seems like a nasty thing which breaks downloads of
large mailboxes.
|
|
Some threads are easily over 100 messages, so the 50 limit is
not enough. It is likely that 1000 messages is not enough,
either, and we will need to tune our threading to handle more
messages and supply options for configurability.
|
|
Hopefully this gives new hackers a better overview of
how the components relate to each other.
|
|
We use it as a general compressor for identifiers such as
subject paths, so using the "mid_" prefix probably is not
appropriate.
|
|
The document data of a search message already contains a good chunk
of the information needed to respond to OVER/XOVER commands quickly.
Expand on that and use the document data to implement OVER/XOVER
quickly.
This adds a dependency on Xapian being available for nntpd usage,
but is probably alright since nntpd is esoteric enough that anybody
willing to run nntpd will also want search functionality offered
by Xapian.
This also speeds up XHDR/HDR with the To: and Cc: headers and
:bytes/:lines article metadata used by some clients for header
displays and marking messages as read/unread.
|
|
We probably won't be supporting this in the public API
|
|
Implementing NEWNEWS, XHDR, XOVER efficiently will require
additional caching on top of msgmap.
This seems to work with lynx and slrnpull, haven't tried clients.
DO NOT run in production, yet, denial-of-service vulnerabilities
await!
|
|
DBI + DBD::SQLite has much better handling of prefix lookups
than Xapian. While we're at it, avoid linking blatantly wrong
Message-IDs to external services.
|
|
In the future, it should be possible to use this:
git ls-files | UPDATE_COPYRIGHT_HOLDER='all contributors' \
UPDATE_COPYRIGHT_USE_INTERVALS=2 \
xargs /path/to/gnulib/build-aux/update-copyright
|
|
In case a URL gets truncated (as is common with long URLs),
we can rely on Xapian for partial matches and bring the user
to their destination.
|
|
We should not need to use QueryParser for internal queries,
but rather for external ones.
We'll also be exposing searching Message-IDs with the "mid:" prefix
for broken mids on some servers, and enabling partial searching
with 'm' to help with URL truncations.
Since thread IDs may be volatile, they cannot be exposed to the
public, there's no reason to expose them to the query parser,
either.
Also, add 's:' as an alternative probabilistic prefix to 'subject'
as it is shorter.
|
|
Perhaps this can be optionally enabled in the future for smaller
sites.
|
|
Might as well give relevance some weight if the timestamp is tied.
|
|
This hopefully makes it easier to find things without resorting
to proprietary external services.
|
|
We'll continue to compress long Message-IDs in URLs (which we know
about), but we will store entire Message-IDs in the Xapian database
to facilitate ease-of-lookups in external databases.
|
|
Like revision control history, older stuff is less relevant,
so favor newer stuff, first.
|
|
This makes dumping recent topics easier, hopefully.
|
|
Redundant document data increases our database size, pull the
smsg->mid off the unique term, the smsg->ts off the value, and
only generate the formatted display date off smsg->ts.
|
|
We no longer need them, as we can rely on index-time thread
resolution and thread merging. This allows us to index less
data and hopefully increase efficiency.
|
|
Perl does not currently optimize for this.
ref (from p5p):
http://mid.gmane.org/D5C27970-9176-4C7A-8B99-7D78360E67A2@pobox.com
|
|
Our search query already filters out ghost messages,
so it's wasteful to have type information loaded.
|
|
We ought to summarize subjects to avoid exploding
line lengths in the web interface.
|
|
Consistently name mid_* functions as verbs.
|
|
Many of our internal search queries do not care about relevance,
but is used for proper thread displays.
|
|
Most of our special query functions require exact matches, so none
of the flags we normally use are necessary for query parsing.
|
|
This makes organization easier and reduces the amount of code
loaded for a PSGI, mod_perl or CGI instance.
|