Date | Commit message (Collapse) |
|
As stated in the previous change, conditional hash assignments
which trigger other hash assignments seem problematic, at times.
So replace:
$h->{k} //= do { $h->{x} = ...; $val };
$h->{k} // do {
$h->{x} = ...;
$hk->{k} = $val
};
"||=" is affected the same way, and some instances of "||=" are
replaced with "//=" or "// do {", now.
|
|
This is necessary for showing "found $OID in $CODEREPO_URL"
in solver-generated pages ($INBOX_URL/$OID/s/).
|
|
We're not using them, anywhere.
|
|
Instead of passing the prefix section and key separately, pass
them together as is commonly done with git-config(1) usage as
well as our ->get_all API. This inconsistency in the get_1 API
is a needless footgun and confused me a bit while working on
"lei up" the other week.
|
|
This allows PublicInbox::WWW hosts to advertise the existence of
IMAP servers in addition to NNTP servers.
|
|
We no longer waste a precious hash slot for a per-Inbox
{nntpserver} if it's only configured globally for all inboxes.
|
|
This should improve the users' chances of seeing errors in
various git config files we use.
|
|
lei shouldn't become unusable if a config file is invalid.
Instead, show the "git config" stderr and attempt to continue
gracefully.
Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org>
Link: https://public-inbox.org/meta/20210910141157.6u5adehpx7wftkor@meerkat.local/
|
|
There's currently no support for altid with extindex, and
there's likely no legacy precedent for using altid like there is
with single public-inboxes.
|
|
This behaves identically the lei external "boost" parameter in
prioritizing raw messages for extindex.
Relying exclusively on the config file order doesn't work well
for mirrors since it's impossible to guarantee config file
ordering via grokmirror hooks.
Config file ordering remains the default if boost is
unconfigured, or in case of ties.
Note: I chose the name "boost" rather than "priority" or "rank"
since I always get confused by whether higher or lower numbers
take precedence when it comes to kernel scheduling. "weight" is
also a part of Xapian API terminology, which we currently do not
expose to configuration (but may in the future).
|
|
We'll be using this in lei for watch configs.
|
|
When dealing with thousands of inboxes, displaying all of
them on a single page isn't going to work. So steal some
pagination and search results code from the message search
to generate some basic HTML output that looks good in w3m.
|
|
LD_PRELOAD sent by a client can't affect lei-daemon.
|
|
I don't know if it's worth it to sub (or super)class
PublicInbox::Config into something more generic for
lei, but this change simplifies a good chunk of lei
code that reuses the public-inbox config parsing.
|
|
A user may wish to clobber/refine existing search parameters
by issuing "lei q --save" again. Support that by overwriting
the lei.saved-search state file entirely.
We continue to preserve over.sqlite3 for deduplication purposes.
This way, we don't get something redundant like:
[lei]
q = term1
q = term2
q = term1
q = term2
q = term3
...whenever a user wants to refine their search. Instead,
we'll just have:
[lei]
q = term1
q = term2
q = term3
On the second go.
|
|
git 2.11 and earlier could not handle git directories with
newlines in them, nor does libgit2 support them.
Followup-to: d87dd0e679587043 ("config: reject `\n' in `inboxdir'")
|
|
We can't completely instantiate our cgit wrapper without knowing
knowing cgit locations for serving static content.
Fixes: a5968dc059f655a ("config: lazy-load coderepos, support extindex")
|
|
Extsearch objects are duck-types of Inbox objects, and
are capable of supporting code repos all the same.
|
|
We'll try to share a bit more configuration with
extindex entries for WWW PSGI usage.
|
|
This fixes ->urlmatch use from lei, which already sets '-f'.
I noticed this because imap.$URL.compress was ignored in
my lei config file.
|
|
The features we use for SharedKV could probably be implemented
with GDBM_File or SDBM_File, but that doesn't seem worth it at
the moment since we depend on SQLite elsewhere.
|
|
Using "make update-copyrights" after setting GNULIB_PATH in my
config.mak
|
|
Instead of relying on split() and a regexp, we'll drop split()
entirely and rely on index() + two substr() calls to operate on
fixed strings. This brings PublicInbox::Config->new time down
from 0.98s down to 0.84s.
|
|
We can avoid a slow regexp capture and instead and rely on
rindex + substr to extract the section from the config file.
Then we use the defined-or-assignment (//=) operator combined
with the documented return value of `push' to ensure @section_order
is unique without repeating a hash lookup.
Finally, we avoid short-lived variables inside the loop and
declare them subroutine-wide to knock a teeny bit of allocation
time.
Combined, these optimizations bring the ~1.22s
PublicInbox::Config->new time down to ~0.98s with 50K inboxes.
|
|
It appears the Perl split() operator is not optimized for fixed
strings at all. With this change, PublicInbox::Config->new (w/o
->fill_all) time is reduced from 1.81s to 1.22s on a config file
with 50K inboxes.
|
|
Using substr() instead of a string copy + s// substitution here
reduces ->fill_all from 4.00s to 3.88s with 50K inboxes on my
workstation.
|
|
We need to canonicalize paths for inboxes which do not have
a newsgroup defined, otherwise ->eidx_key matches can fail
in unexpected ways.
|
|
We can rely on implicit join in string interpolation on die()
iff needed.
And just creating the arrayref up front to avoid an extra
backslash seems nicer at the moment.
|
|
->each_inbox will never attempt to iterate an object
without {inboxdir}, and simplify + short-circuit the
corresponding code
|
|
{pi_config} may be confused with the documented `PI_CONFIG'
environment variable, and we'll favor vowel-removal to be
consistent with our usage of object references.
The `pi_' prefix may stay in some places, for now; since a
separate namespace may come into this codebase for local/private
client-tooling.
For InboxIdle, we'll also remove an invalid comment about
holding a reference to the PublicInbox::Config object, too.
|
|
Using "eidx_key:" boolean prefix to limit results to a given
inbox, we can use ->ALL to emulate and replace per-Inbox
xap15/[0-9] search indices.
With this change, the presence of "extindex.all.topdir" in the
$PI_CONFIG will cause the WWW code to use that extindex and
ignore per-inbox Xapian DBs in xap15/[0-9].
Unfortunately IMAP search still requires old per-inbox indices,
for now. Mapping extindex Xapian docids to per-Inbox UIDs and
vice-versa is proving tricky. Fortunately, IMAP search is
rarely used and optional. The RFCs don't specify expensive
phrase search, either, so `indexlevel=medium' can be used in
per-inbox Xapian indices to save space.
For primarily WWW (and future JMAP) users; this should result in
significant disk space, FD, and page cache footprint savings for
large instances with many inboxes and many cross-posted
messages.
|
|
We can avoid doing a Message-ID lookup on every single inbox
by using ->ALL to scan its over.sqlite3 DB. This mimics NNTP
behavior and picks the first message indexed, though redirecting
to /all/$MESSAGE_ID/ could be done.
With the current lore.kernel.org set of inboxes (~140), this
provides a 10-40% speedup depending on inbox ordering.
|
|
With 50K newsgroups in the config file, this doesn't slow down
`PublicInbox::Config->new->fill_all' any measurable amount on
my busy old workstation.
This should prevent invalid newsgroup names from getting into
into extindex and catch user errors sooner, rather than later.
v2:
- delete {newsgroup} if invalid to avoid ->nntp_url link
- simplify -imapd and explain remaining check
|
|
There's no need to duplicate a potentially large hash,
but we can keep the inexpensive shortcut to it. We may
eventually drop the {groups} shortcut if it's no longer
useful.
|
|
For a mirror of lore.kernel.org with >140 inboxes, this speeds
up manifest.js.gz generation from ~1s to 40ms on my HW. This
is still unacceptable when dealing with thousands of inboxes,
but gets us closer to where we need to be.
|
|
We'll be using JSON in MiscIdx and MiscSearch, and
PublicInbox::Config seems like an appropriate place to put it.
|
|
Upon "eindex" rhymes with "reindex", which could be confusing;
so name the command and config prefix to use "extindex" which
is hopefully less confusing.
|
|
This lets us pretend an ExtSearch object is an Inbox object
in most of the existing WWW code.
|
|
This follows -watch commit b70473ab8296d31ebb600adb4fa8fe0ac5935ca8
to match List-Id headers case-insensitively.
Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org>
Link: https://public-inbox.org/meta/20200921180152.uyqluod7qxbwqubo@chatter.i7.local/
|
|
Our code doesn't support multi-values for these, and having
unexpected arrays leads to unexpected results (e.g. showing
stuff like "ARRAY(0xDEADBEEFADD12E55)" in user interfaces). So
warn and only use the last value (matching git-config(1)
behavior without `--get-all').
|
|
We will need to allow simultaneous iterators on the same
config object, since we'll need this for ExtMsg, NNTPD,
WwwListing, NewsWWW, and other places.
|
|
In Perl, we can simplify callers by passing a single array
all the way down the stack instead of a single array ref which
needs to be expanded every call.
|
|
Just some golfing to reduce scrolling and hopefully readability.
|
|
This gives better page cache utilization for Xapian indexing on
slow storage by improving locality for random I/O activity on
the Xapian DB.
Instead of doing a single-pass to index both SQLite and Xapian;
this indexes them separately. The first pass is identical to
indexlevel=basic: it indexes both over.sqlite3 and msgmap.sqlite3.
Subsequent passes only operate on a single Xapian shard for
documents belonging to that shard. Given enough shards, each
individual shard can be made small enough to fit into the kernel
page cache and avoid HDD seeks for read activity.
Doing rough tests with a busy system with a 7200 RPM HDD with ext4,
full indexing of LKML (9 epochs) goes from ~80 hours (-j0) to
~30 hours (-j8) with 16GB RAM with 7 shards configured and fsync(2)
disabled (--no-sync) and `--batch-size=10m'.
|
|
"\n" and other characters requiring quoting and/or escaping in
in $GIT_DIR/objects/info/alternates was not supported in git 2.11
and earlier; nor does it seem supported at all in libgit2.
This will allow us to support sharing git-cat-file or similar
endpoints across multiple inboxes via alternates.
This breaks an existing use case for anybody wacky
enough to put `\n' in the `inboxdir' pathname; but I doubt
this affects anybody.
|
|
Since we have IMAP client support in -watch; make sure per-URL
settings are familiar to git users by taking advantage of git's
URL matching abilities.
This requires git 1.8.5+, which most users ought to have
(though base CentOS 7 is on 1.8.3).
|
|
Finish up the IMAP-only portion of iterative config reloading,
which allows us to create all sub-ranges of an inbox up front.
The InboxIdler still uses ->each_inbox which will struggle with
100K inboxes.
Having messages in the top-level newsgroup name of an inbox will
still waste bandwidth for clients which want to do full syncs
once there's a rollover to a new 50K range. So instead, make
every inbox accessible exclusively via 50K slices in the form of
"$NEWSGROUP.$UID_MIN-$UID_END".
This introduces the DummyInbox, which makes $NEWSGROUP
and every parent component a selectable, empty inbox.
This aids navigation with mutt and possibly other MUAs.
Finally, the xt/perf-imap-list maintainer test is broken, now,
so remove it. The grep perlfunc is already proven effective,
and we'll have separate tests for mocking out ~100k inboxes.
|
|
This will be used to prevent reloading a giant config with
tens/hundreds of thousands of inboxes from blocking the event
loop.
|
|
The watchheader key supports only a single value. Supporting multiple
watchheader values was mentioned in discussion [1] of 8d3e3bd8 (doc:
explain publicinbox.<name>.watchheader, 2019-10-09), and it wasn't
clear if there was a need.
One scenario in which matching multiple headers would be convenient is
when someone wants to set up public-inbox archives for some small
projects but does _not_ want to run mailing lists for them, instead
allowing others to follow the project by any of the pull mechanisms.
Using a common underlying address, an address alias for each project
is configured via a third-party email provider, with messages for each
alias being exposed as a separate public-inbox archive. In this
setup, messages for an inbox cannot be selected by a List-ID header
but can be identified by the inbox's address in either the To or Cc
header.
To support such a use case, update the watchheader handling to
consider multiple values, accepting a message if it matches any value.
While selecting a message based on matching _any_ rather than _all_
values is motivated by the above scenario, it's worth noting that the
"any" behavior is consistent with how multiple listid config values
are handled.
[1] https://public-inbox.org/meta/20191010085118.r3amey4cayazfycb@dcvr/
|
|
This allows for a setup where a central config file for the web server
includes per-user config files.
|