Date | Commit message (Collapse) |
|
Nobody is expected to use long options, but for consistency
with mairix(1), we'll use the pluralized option throughout
(including existing PublicInbox::{Search,SearchView}).
Link: https://public-inbox.org/meta/20210206090119.GA14519@dcvr/
|
|
Parallelism and interactivity with pager + SIGPIPE needs work;
but results are shown and phrase search works without shell
users having to apply Xapian quoting rules on top of standard
shell quoting.
|
|
Using "make update-copyrights" after setting GNULIB_PATH in my
config.mak
|
|
{ibx} is shorter and is the most prevalent abbreviation
in indexing and IMAP code, and the `$ibx' local variable
is already prevalent throughout.
In general, the codebase favors removal of vowels in variable
and field names to denote non-references (because references are
"lighter" than non-references).
So update WWW and Filter users to use the same code since
it reduces confusion and may allow easier code sharing.
|
|
Using "eidx_key:" boolean prefix to limit results to a given
inbox, we can use ->ALL to emulate and replace per-Inbox
xap15/[0-9] search indices.
With this change, the presence of "extindex.all.topdir" in the
$PI_CONFIG will cause the WWW code to use that extindex and
ignore per-inbox Xapian DBs in xap15/[0-9].
Unfortunately IMAP search still requires old per-inbox indices,
for now. Mapping extindex Xapian docids to per-Inbox UIDs and
vice-versa is proving tricky. Fortunately, IMAP search is
rarely used and optional. The RFCs don't specify expensive
phrase search, either, so `indexlevel=medium' can be used in
per-inbox Xapian indices to save space.
For primarily WWW (and future JMAP) users; this should result in
significant disk space, FD, and page cache footprint savings for
large instances with many inboxes and many cross-posted
messages.
|
|
There's no need to export it, as shown by the change to
SearchView. This should pave the way to making search
more flexible and allow per-Inbox search to reuse ->ALL.
|
|
Nearly all of the search uses in the production code rely on
a Xapian mset iterator being returned (instead of an array
of $smsg objects). So default to returning the mset and move
the burden of smsg array conversion into the test cases.
|
|
Being an easily confused person, I find "next" and "prev"
ambiguous as to whether messages on the next or previous page
will be newer or older than the current page. Clarify that for
the threaded /$INBOX/ view and search results.
For search results sorted by relevance, we'll use "[>= $SCORE]"
or "[<= $SCORE]" to indicate to indicate directionality.
This also fixes $INBOX/new.html for unindexed v1 inboxes.
|
|
Sometimes it's useful to quickly get to threads and messages
which are contemporaries of the current thread/message being
focused on. This hopefully improves navigation by making:
a) the top line (where $INBOX_DIR/description) is shown
a link to the latest topics in search results and
per-thread/per-message views.
b) providing a link to contemporaries ("~YYYY-MM-DD") at
around the thread overview skeleton area for per-thread
and per-message views
|
|
Expanding threads via over.sqlite3 for mbox.gz downloads without
Xapian effectively collapsing on the THREADID column leads to
repeated messages getting downloaded.
To avoid that situation, use a "has_threadid" Xapian metadata
flag that's only set on --reindex (and brand new Xapian DBs).
This allows admins to upgrade WWW or do --reindex in any order;
without worrying about users eating up bandwidth and CPU cycles.
|
|
Finally, the addition of THREADID for collapsing results
in Xapian lets us emulate the "mairix --threads" feature.
That is, instead of returning only the matching messages,
the entire thread is included in the downloaded mbox.gz
This requires a "public-inbox-index --reindex" to be usable.
|
|
Unlike w3m and links, the lynx browser seems to require a `name'
attribute for `<input type=submit>' elements. Maybe some other
browsers do, too. The `name' attribute for submit elements
doesn't seem to cause any harm for w3m or links, users, either;
despite not (AFAIK) being part of historical or current HTML
specs.
|
|
We can avoid importing mdocid() in several places by using
this method, simplifying callers.
|
|
git blob retrieval dominates on these, "&x=t" (nested) is
roughly the same due to increased overhead for ->get_percent
storage balancing out the mass-loading from SQLite.
Atom "&x=A" is sped up slightly and uses less memory in the
long-lived response.
|
|
Instead of loading one article at-a-time from over.sqlite3, we
can use SQL to mass-load IN (?,?, ...) all results with a single
SQLite query. Despite SQLite being in-process and having no
network latency, the reduction in SQL query executions from
loading multiple rows at once speeds things up significantly.
We'll keep the over->get_art optimizations from the previous
commit, since it still speeds up long-lived responses, slightly.
|
|
This is a step towards improving kernel page cache hit rates by
relying on over.sqlite3 for document data instead of Xapian.
Some micro-optimization to over->get_art was required to
maintain performance.
|
|
Since this was already a separate package, split it off
into its own file since SearchView may not handle inbox
groups.
|
|
While this is unlikely to be a problem in current practice,
keeping Xapian DBs open for long responses can interfere with
free space recovery after -compact.
In the future, it will interfere with inbox search grouping
and lead to unexpected results.
|
|
This simplifies the primary callers of eml_entry while only making
mknews.perl worse.
|
|
We can save stack space and simplify subroutine calls, here.
|
|
Another 10% or so speedup when displaying full messages off
search results.
|
|
This will make it easier to support asynchronous blob
retrievals. The `$ctx->{nr}' counter is no longer implicitly
supplied since many users didn't care for it, so stack overhead
is slightly reduced.
|
|
Like with WwwAtomStream and MboxGz, we can bless the existing
$ctx object directly to avoid allocating a new hashref. We'll
also switch from "->" to "::" to reduce stack utilization.
|
|
To further simplify callers and avoid embarrasing memory
explosions[1], we can finally eliminate this method in
favor of smsg_eml.
[1] commit 7d02b9e64455831d3bda20cd2e64e0c15dc07df5
("view: stop storing all MIME objects on large threads")
fixed a huge memory blowup.
|
|
We can simplify WwwAtomStream callbacks by performing ->smsg_eml
calls in the `feed_entry' sub itself. This simplifies callers,
by reducing the number of places which can load an Eml object
into memory.
|
|
We can rid ourselves of a layer of indirection by subclassing
PublicInbox::Smsg instead of using a container object to hold
each $smsg. Furthermore, the `{id}' vs. `{mid}' field name
confusion is eliminated.
This reduces the size of the $rootset passed to walk_thread by
around 15%, that is over 50K memory when rendering a /$INBOX/
landing page.
|
|
Since the introduction of over.sqlite3, SearchMsg is not tied to
our search functionality in any way, so stop confusing ourselves
and future hackers by just calling it "PublicInbox::Smsg".
Add a missing "use" in ExtMsg while we're at it.
|
|
`%over' could be confused for the overview SQLite DB
instance, so call it `%override', instead. There's
also no need to write a loop to override a hash when
the language can do it for us.
|
|
We never lookup `$ctx->{-obfuscate}' anywhere, as the
correct key is `$ctx->{-obfs_ibx}' since some of the
address obfuscation stuff is inbox-specific.
Note: some of the obfuscation stuff still needs tests,
but it's low-priority at the moment since I don't think
it's a good feature after all.
|
|
We need to escape ampersands (and some other characters for href
attributes), so introduce a `mid_href' sub to do just that.
'<', '>' and '"' were always escaped, so there's no risk of tag
or attribute injection, but creative Message-IDs could cause
confusion for some parsers and generate invalid URLs.
Start getting rid of the bloated, over-engineered OO Hval API
while we're at it, I only noticed this bug because I started
killing off Hval->new* callers.
|
|
We already pre-populate the hashref when loading $smsg
(PublicInbox::SearchMsg) objects out of over.sqlite3 or Xapian,
so making expensive method calls isn't necessary in those cases.
We only need to use the method calls when SQLite or Xapian are
not available or are being populated (such as during indexing).
|
|
I didn't wait until September to do it, this year!
|
|
It'll always be used as a callback, so there's no point in
giving it a name to be called non-anonymously. Making
assigments to it is slightly faster since there's no need
to repeatedly do a lookup by name.
|
|
Pass \&coderefs explicitly to walk_thread, and add some
prototypes + comments to describe what goes on.
|
|
This saves us a few comments and confusion. Yes, it's a
destination so "dst" can be appropriate, but we may be using
that term elsewhere.
|
|
There's a bunch of leftover "require" and "use" statements we no
longer need and can get rid of, along with some excessive
imports via "use".
IO::Handle usage isn't always obvious, so add comments
describing why a package loads it. Along the same lines,
document the tmpdir support as the reason we depend on
File::Temp 0.19, even though every Perl 5.10.1+ user has it.
While we're at it, favor "use" over "require", since it it gives
us extra compile-time checking.
|
|
This allows callers to pass named (not anonymous) subs.
Update all retry_reopen callers to use this feature, and
fix some places where we failed to use retry_reopen :x
|
|
We don't need to return a closure or have a separate hash
for sorting threads by relevance. Instead, we can stuff
the relevance {pct} into the SearchMsg object itself and
use that.
Note: upon reviewing this code, the sort-by-relevance seems
bogus as it only considers the relevance of the topmost message.
Instead, it would make more sense to the user to sort by the
highest relevance of all messages in that particular thread.
|
|
Both WwwStream and WwwAtomStream ->response pass the WWW $ctx
to the callback nowadays, so we can pass named subs to them.
|
|
Displaying "100%" wastes a precious column. Show "99%" instead
since there's little practical difference and <xapian/mset.h>
states:
Note that these generally aren't percentages of anything meaningful
(unless you use a custom weighting formula where they are!)
And we're not using a custom weighting formula.
|
|
Instead of only passing an Inbox object, we'll pass the $ctx
reference as PublicInbox::SearchView::mset_thread did.
So although mset_thread was wrong, we now make it's usage
of SearchThread::thread correct and update other callers to
favor the new style of passing the entire $ctx (with ->{-inbox})
instead of just the Inbox object.
This makes the thread skeleton at the bottom of the search
page to show subjects of messages, but unfortunately links to
non-existent #anchors. The next commit will fix that.
While we're at it, favor "\&foo" over "*foo" since the former
makes the code reference (aka "function pointer) obvious so it
won't be confused for other things named "foo" in that
scope (e.g. $foo/@foo/%foo).
|
|
|
|
Displaying full path names of installed modules could expose
unnecessary information about user home directory names or other
potentially sensitive information. However, displaying a module
name could still be useful for diagnosing problems, so map full
paths to the relevant part of the path name which is relevant to
the package name.
Reported-by: Ali Alnubani <alialnu@mellanox.com>
https://public-inbox.org/meta/20190611193815.c4uovtlp574bid6x@dcvr/
|
|
I could not find a place to put the link the top without
making navigation too cluttered. Putting it at the bottom
of the page seems reasonable...
|
|
Taking a hint from Perl array access, we'll allow negative
offsets for the 'o' parameter and to reverse the sort order.
|
|
Non-ASCII digits would be interpreted as zero when used as integers.
|
|
We don't need to rely on Xapian search functionality for the
majority of the WWW code, even. subject_normalized is moved to
SearchMsg, where it (probably) makes more sense, anyways.
|
|
Empty subjects ("") and undefined Subjects: are now both
displayed as "(no subject)" for now.
|
|
'$inbox' is more human-readable, so that is for the more
human-readable name in most cases. Making our variable naming
more consistent should make the code easier-to-review and
harder to screw up.
|
|
Unused since commit 5f09452bb7e6cf49fb6eb7e6cf166a7c3cdc5433
("view: cull redundant phrases in subjects")
|