Date | Commit message (Collapse) |
|
These fields are only necessary in NNTP and not even stored in
Xapian; so keeping them around for the PSGI web UI search
results wastes nearly 80K when loading large result sets.
|
|
Now that some of the indexes are optionals these tests might fail
so teach them to fail more cleanly.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
|
|
This is consistent with git itself and the previous behavior
was a result of misunderstanding of how git interprets this.
And adjust tests slightly to match the new behavior.
Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org>
<38873789-ab42-65a1-20c9-12c30b171f4f@linuxfoundation.org>
|
|
This was probably a typo on my part, and quiets a warning:
Argument contains empty address at .../Email/MIME/Encode.pm line 70
Tested with Email::MIME 1.946
|
|
"LIKE" in SQLite (and other SQL implementations I've seen) is
expensive with nearly 3 million messages in the archives.
This caused some partial Message-ID lookups to take over 600ms
on my workstation (~300ms on a faster Xeon). Cut that to below
under 30ms on average on my workstation by relying exclusively
on Xapian for partial Message-ID lookups as we have in the past.
Unlike in the past when we tried using Xapian to match partial
Message-IDs; we now optimize our indexing of Message-IDs to
break apart "words" in Message-IDs for searching, yielding
(hopefully) "good enough" accuracy for folks who get long URLs
broken across lines when copy+pasting.
We'll also drop the (in retrospect) pointless stripping of
"/[tTf]" suffixes for the partial match, since anybody who
hits that codepath would be hitting an invalid message ID.
Finally, limit wildcard expansion to prevent easy DoS vectors
on short terms.
And blame Pine and alpine for generating Message-IDs with
low-entropy prefixes :P
|
|
We can't have files with permissions inconsistent with what's
in git objects.
|
|
Otherwise articles show up again...
|
|
Since we only query the SQLite over DB for OVER/XOVER; do not
need to waste space storing fields To/Cc/:bytes/:lines or the
XNUM term. We only use From/Subject/References/Message-ID/:blob
in various places of the PSGI code.
For reindexing, we will take advantage of docid stability
in "xapian-compact --no-renumber" to ensure duplicates do not
show up in search results. Since the PSGI interface is the
only consumer of Xapian at the moment, it has no need to
search based on NNTP article number.
|
|
Favor simpler internal APIs this time around, this cuts
a fair amount of code out and takes another step towards
removing Xapian as a dependency for v2 repos.
|
|
Dscho found this useful for finding matching git commits based
on AuthorDate in git. Add it to the overview DB format, too;
so in the future we can support v2 repos without Xapian.
https://public-inbox.org/git/nycvar.QRO.7.76.6.1804041821420.55@ZVAVAG-6OXH6DA.rhebcr.pbec.zvpebfbsg.pbz
https://public-inbox.org/git/alpine.DEB.2.20.1702041206130.3496@virtualbox/
|
|
In many cases, we do not care about the total number of
messages. It's a rather expensive operation in SQLite
(Xapian only provides an estimate).
For LKML, this brings top-level /$INBOX/ loading time from
~375ms to around 60ms on my system. Days ago, this operation
was taking 800-900ms(!) for me before introducing the SQLite
overview DB.
|
|
This ought to provide better performance and scalability
which is less dependent on inbox size. Xapian does not
seem optimized for some queries used by the WWW homepage,
Atom feeds, XOVER and NEWNEWS NNTP commands.
This can actually make Xapian optional for NNTP usage,
and allow more functionality to work without Xapian
installed.
Indexing performance was extremely bad at first, but
DBI::Profile helped me optimize away problematic queries.
|
|
We'll be making sure V2Writable uses this.
|
|
Too many similar functions doing the same basic thing was
redundant and misleading, especially since Message-ID is
no longer treated as a truly unique identifier.
For displaying threads in the HTML, this makes it clear
that we favor the primary Message-ID mapped to an NNTP
article number if a message cannot be found.
|
|
Using update-copyrights from gnulib
While we're at it, use the SPDX identifier for AGPL-3.0+ to
ease mechanical processing.
|
|
This simplifies the code a bit and reduces the translation
overhead for looking directly at data from tools shipped
with Xapian.
While we're at it, fix thread-all.t :)
|
|
Apparently it never actually got used, and the world seems
fine without it, so we can drop it.
While we're at it, consider removing our subject_path
usage from existence, too. We are not using fancy subject-line
based URLs, here.
|
|
Instead, only preload the ->mid field for threading,
as we only need ->thread and ->path once in Search->get_thread
(but we will need the ->mid field repeatedly).
This more than doubles View->load_results performance on
according to thread-all on an inbox with over 300K messages.
|
|
This roughly doubles performance due to the reduction in
object creation and abstraction layers.
|
|
And while we're at it, ensure searching inside displayable
attachment bodies works.
|
|
"bs:" and "b:" are adapted from mairix(1)
We will also support searching explicitly for quoted vs
non-quoted text via "q:" and "nq:" prefixes since sometimes
readers will not care for quoted text.
In the future, we will support parsing diffs (perhaps when
repobrowse integration is complete).
Note: this roughly doubles the size of the Xapian database due
to the additional information; so this change may not be worth
it.
|
|
We only document the "s:" anyways. While the long name is more
descriptive, the ambiguity makes agnostic caching (by Varnish or
similar) slightly harder and longer URLs are more likely to be
accidentally truncated when shared.
|
|
Sometimes it can be useful to search based on who the
message was sent to, sent by, or Cc:-ed. Of course,
headers can be faked, but they usually are not...
Anyways this mostly matches the behavior of mairix(1).
|
|
This is similar to mairix in that it uses a "d:" prefix; but
only takes YYYYMMDD, for now. Using custom date/time parsers
via Perl will be much more work:
nntp://news.gmane.org/20151005222157.GE5880@survex.com
Anyhow, this ought to be more human-friendly than searching by
Unix timestamps, but it requires reindexing to take advantage of.
|
|
This will allow us to release and re-acquire Xapian locks
due to the lack of FD_CLOEXEC on some FDs.
|
|
Noticed when using a long URL in the subject.
|
|
This should make identifiying leftover directories
due to SIGKILL-ed tests easier.
|
|
In case folks do not use eatmydata or tmpfs for testing,
use transactions to reduce the number of fsync calls
made and hopefully prevent drives from wearing out.
|
|
No point in loading Data::Dumper if we do not use it
in the tests.
|
|
The document data of a search message already contains a good chunk
of the information needed to respond to OVER/XOVER commands quickly.
Expand on that and use the document data to implement OVER/XOVER
quickly.
This adds a dependency on Xapian being available for nntpd usage,
but is probably alright since nntpd is esoteric enough that anybody
willing to run nntpd will also want search functionality offered
by Xapian.
This also speeds up XHDR/HDR with the To: and Cc: headers and
:bytes/:lines article metadata used by some clients for header
displays and marking messages as read/unread.
|
|
In the future, it should be possible to use this:
git ls-files | UPDATE_COPYRIGHT_HOLDER='all contributors' \
UPDATE_COPYRIGHT_USE_INTERVALS=2 \
xargs /path/to/gnulib/build-aux/update-copyright
|
|
We'll continue to compress long Message-IDs in URLs (which we know
about), but we will store entire Message-IDs in the Xapian database
to facilitate ease-of-lookups in external databases.
|
|
We no longer need them, as we can rely on index-time thread
resolution and thread merging. This allows us to index less
data and hopefully increase efficiency.
|
|
We ought to summarize subjects to avoid exploding
line lengths in the web interface.
|
|
Extend the purpose of core.sharedRepository to apply to
the $GIT_DIR/public-inbox/xapian* directory.
|
|
This makes organization easier and reduces the amount of code
loaded for a PSGI, mod_perl or CGI instance.
|
|
This is hopefully less ambiguous, as the word "count" confused
me, too.
|
|
Some mail software incorrectly creates circular references
and causes us to create ghosts before the actual mail doc
is created.
|
|
There's no need to make a transaction for each message when doing
incremental indexing against a git repository. While we're at it,
simplify the interface for callers, too and do not auto-create
the Xapian database if it was not explicitly enabled.
|
|
Replies are only direct replies, but followups could be any message
further down the thread. The latter is more useful.
|
|
We will not require Search::Xapian to be installed.
|
|
Quick-and-dirty wiring up of to Subject: paths.
This may prove more memorizable and easier-to-share than
/t/$MESSAGE_ID.html links, but less strict.
This changes our schema version to 1, since we now
use lower-case subject paths.
|
|
This will relieve callers of the need to decode the data
we store internally in Xapian
|
|
This shall allow us to search for replies/threads more easily.
|