Date | Commit message (Collapse) |
|
Introduce our own SearchThread class for threading messages.
This should allow us to specialize and optimize away objects
in future commits.
|
|
We will not care for inexact threading by subject or pruning.
|
|
Support (and document) 'a:' after all, as "mairix -h" uses it,
so this should reduce the learning curve for mairix users.
|
|
This clarifies the code somewhat, and we don't care to lazy-load
in NNTP.pm anyways since this is only used for a long-lived
daemon.
|
|
The existing string -> number date range Xapian query is good
enough, and having too much flexibility is probably bad for
caching (as well as increasing our attack surface, because
parsing queries is tricky).
Tags-as-skiplists are probably not worth the effort given
Xapian, and we may have to import old messages after-the-fact,
anyways, and message delivery for mirrors is never orderly.
Other items are all done and need to be maintained (like the
search engine docs for the mairix-compatibility features that
just got pushed out)
|
|
Output $! for diagnostic purposes since I've noticed this on
two slow machines, today (and seemingly, never prior).
|
|
And while we're at it, ensure searching inside displayable
attachment bodies works.
|
|
The basic rule is that if it is displayable via our WWW
interface, it should be indexable text for Xapian search.
|
|
It's not worth entering a complex codepath in Email::MIME to
save some (probably immeasurable amount of) memory, here. We've
already stopped doing this in our WWW code a while back, too.
If we really cared enough about it, we'd prioritize work on a
streaming replacement for Email::MIME.
|
|
Specifying the "d:" field only worked for
NumberValueRangeProcessor in older versions of Xapian, such
as the one in Debian wheezy (libsearch-xapian-perl=1.2.10.0-1)
This slipped through since I rarely use wheezy, anymore, and
perhaps nobody else does, either. Perhaps wheezy support may be
dropped, soon.
Unfortunately, this requires a schema version bump.
|
|
We pay a storage cost for storing positional information
in Xapian, make good use of it by attempting to preserve
it for (hopefully) better search results.
|
|
This is stricter than the mutt quote_regexp default
("^([ \t]*[|>:}#])+" on Debian jessie),
but matches what we have in View.pm.
I prefer the stricter quote detection since it is less ambiguous
and less likely to hide/obscure important details.
|
|
As of Xapian 1.0.4 (from 2007) is possible to use
Search::Xapian::QueryParser::add_prefix multiple times with the
same user field name but different term prefixes.
This brings my current git@vger mirror from 6.5GB to 2.1GB
(both sizes are after xapian-compact).
|
|
"bs:" and "b:" are adapted from mairix(1)
We will also support searching explicitly for quoted vs
non-quoted text via "q:" and "nq:" prefixes since sometimes
readers will not care for quoted text.
In the future, we will support parsing diffs (perhaps when
repobrowse integration is complete).
Note: this roughly doubles the size of the Xapian database due
to the additional information; so this change may not be worth
it.
|
|
We only document the "s:" anyways. While the long name is more
descriptive, the ambiguity makes agnostic caching (by Varnish or
similar) slightly harder and longer URLs are more likely to be
accidentally truncated when shared.
|
|
Sometimes it can be useful to search based on who the
message was sent to, sent by, or Cc:-ed. Of course,
headers can be faked, but they usually are not...
Anyways this mostly matches the behavior of mairix(1).
|
|
We need to prevent excessive repository growth for
public-inbox-watch and public-inbox-mda users.
|
|
We will be reusing this in the next commit, too.
|
|
For now, we will document this since it allows better
performance without the burden of extensions. Perhaps one day
far in the future Perl can natively support vfork(2) AND that
version of Perl will be widely available, but I suspect that day
is at least a decade away, if not two:
https://rt.perl.org/Ticket/Display.html?id=128227
|
|
This reduces duplication, slightly. We may be using it
yet again in a to-be-introduced function (or we may not
introduce it).
|
|
Email::MIME internally assumes "text/plain" for messages
missing a Content-Type, but does not expose that in the
Email::MIME::content_type API method. We must assume it
ourselves to avoid uninitialized value warnings for the
rare (nowadays) MUAs which do not set it.
|
|
And include it into the build + website
|
|
Hopefully more folks can download and run public-inbox,
nowadays.
|
|
Just having "limiter" in the prefix may confuse
it with something else. Use the full prefix to
avoid this confusion.
|
|
We want to encourage users to serve repositories. So enable
bitmaps by default so performance suffers less with smart HTTP.
|
|
We'll keep supporting "publicinboxlearn" indefinitely,
but "publicinboxwatch" is probably more appropriate
at the moment.
Noticed while writing documentation.
|
|
This will be important as we will have more of them.
|
|
This will allow reasonable titles to be generated for
manpages.
|
|
Since this is bundled with the source, we might as well use
internal APIs to avoid having duplicate code (and bugs :P)
|
|
Not everybody can run Tor, hopefully more can use Tor2web
even if it compromises their privacy. This should help
make system more resilient for users unable to use Tor.
|
|
We want the pod2man(1) executable for handling certain
options. Also, use the correct year while we're at it :P
|
|
This makes us closer to git.git style (though I'm not quite sure
why we do this...)
|
|
We use perlpod nowadays since it's Perl, like our code base.
|
|
Centralization sucks, so we mirror everything.
|
|
The internal help text links to the Xapian query parser
documentation anyways, but also provides information
on which prefixes exist.
|
|
Begin documenting some basic help functionality.
I may tweak the anchor names of the various HTML endpoints
to be more consistent with each other (old ones will be
supported for a short while), so I'm not documenting
those, for now.
This may become part of a builtin key-value store for
basic texts, but this probably shouldn't become a wiki
engine, either.
|
|
We're not to-the-letter about percent-encoding, but
we should allow all the characters. This is mainly
so we can effectively use the link to some Wikipedia
pages with parentheses in them:
https://en.wikipedia.org/wiki/Atom_(standard)
https://en.wikipedia.org/wiki/Git_(software)
|
|
For some reason, Alpine will set X-UNKNOWN for valid UTF-8.
Since we favor UTF-8 HTML anyways, try forcing Email::MIME to
handle text/plain as UTF-8 which might show up better.
At least this change renders
<alpine.DEB.2.20.1608131214070.4924@virtualbox>
properly by showing "•" (•) instead of
"⠢" (•)
Reported-by: Thomas Ferris Nicolaisen <tfnico@gmail.com>
|
|
Alpine seems to set charset=X-UNKNOWN for valid UTF-8 text,
which causes Email::MIME::body_str to fail as X-UNKNOWN
is not a valid encoding. So, blindly display the body
as plain-text but warn users about possibly mangled text.
Reported-by: Thomas Ferris Nicolaisen <tfnico@gmail.com>
|
|
There is no point in using an array to join on an
empty string (my original intention was probably to
join on "\n").
This is only preparation for the next change to show
a warning to in the attachment link.
|
|
This is similar to mairix in that it uses a "d:" prefix; but
only takes YYYYMMDD, for now. Using custom date/time parsers
via Perl will be much more work:
nntp://news.gmane.org/20151005222157.GE5880@survex.com
Anyhow, this ought to be more human-friendly than searching by
Unix timestamps, but it requires reindexing to take advantage of.
|
|
The Unix timestamp isn't meaningful for users searching,
we will start indexing the YYYYMMDD date stamp which may
use StringValueRangeProcessor, instead.
|
|
Also, at least add one of the Tor mirrors (the rest will
be discoverable through the mirrors themselves).
|
|
Not sure why or how I missed this before; but the common address
parsing routine we have should be more correct.
Add a test to ensure excessively quoted names don't make it
through, either.
|
|
Plenty more to do!
|
|
Ensure we usually strip one level of '<>' from Message-IDs,
since our internal SQLite, Xapian, and SHA-1 storage all
assume that.
Realistically, we screw up if somebody has '<<' or '>>',
but those are screwed up mail clients and we can deal with
it another time. Currently, this means some messages with
'>>' in References or Message-Id are not handled correctly,
yet, but we match the behavior of Mail::Thread in keeping
the extra '>'.
|
|
Based on reading RFC 3986, it seems '@', ':', '!', '$', '&',
"'", '; '(', ')', '*', '+', ',', ';', '=' are all allowed
in path-absolute where we have the Message-ID.
In any case, it seems '@' is fairly common in path components
nowadays and too common in Message-IDs.
|
|
I've seen 0x1b (\e) in at least one message and some other
possibly non-printable chars. In any case, make sure they're
valid XML with us-ascii encoding as far as xmlstarlet(1) thinks
so.
|
|
Apparently there are some really screwed up In-Reply-To
fields out there.
|
|
We can't blindly assume a ghost even exists in the DB, as the
rules can change internally for some corner-case Message-IDs.
|