Date | Commit message (Collapse) |
|
We don't need to take extra trips through the event loop for a
single message (in the common case of Message-IDs being unique).
In fact, holding the body reference left behind by Email::Simple
could be harmful to memory usage, though in practice it's not a
big problem since code paths which use Email::MIME take far more.
|
|
Using `undef EXPR' like a function call actually frees the heap
memory associated with the scalar, whereas `$sv = undef' or
`$sv = ""' will hold the buffer around until $sv goes out
of scope.
The `sv_set_undef' documentation in the perlapi(1) manpage
explicitly states this:
The perl equivalent is "$sv = undef;". Note that it doesn't
free any string buffer, unlike "undef $sv".
And I've confirmed by reading Dump() output from Devel::Peek.
We'll also inline the old index_body sub in SearchIdx.pm to make
the scope of the scalar more obvious.
This change saves several hundred kB RSS on both -index and
-httpd when hitting large emails with thousands of lines.
|
|
commit d857e7dc0d816b635a7ead09c3273f8c2d2434be
("msgtime: assume +0000 if TZ missing when using Date::Parse")
introduced a behavior change which was causes false positives
when compared to the old code.
Update the "old" implementation to match this overdue behavior
change.
|
|
We don't want to confuse intermediate caches into serving
gzipped content to any clients which can't handle it. It
probably doesn't matter in practice, though, since every HTTP
client seems to handle "Content-Encoding: gzip" regardless of
whether it was requested or not, though I could expect some
nc/socat/telnet/s_client users being annoyed.
This also matches the behavior of Plack::Middleware::Deflater
and other deflater implementations.
|
|
No point in having an extra sub for a short, commonly
called function in the same file.
|
|
We're slowly moving towards doing all of our output buffering
into a single buffer, so passing that around on the stack as
a dedicated parameter is confusing.
|
|
While rare in practice (even by spammers), A single "0" could
theoretically be the entire contents of a Subject line. So
use the Perl 5.10+ defined-or operator to improve correctness
of subject deduplication.
|
|
We depend on Perl 5.10 features in other places. Shorten the
lifetime of the `$desc' scalar while we're at it.
|
|
Clarify that we're assuming the text is UTF-8, since users
may have no idea how it's mangled.
|
|
We can't rely on Email::MIME noticing the change to our
scalar ref after calling `PublicInbox::MIME->new'.
This is because Email::MIME::body_set (unlike
Email::Simple::body_set) will copy the contents of the body into
`->{body_raw}' as a new scalar.
Furthermore, we need to escape multiple From lines in the body,
not just the first one, using the `g' modifier to `s//'.
Reported-by: Kyle Meyer <kyle@kyleam.com>
|
|
These seem mostly harmless since Perl will just truncate the
match and start a new one on a newline boundary in our case.
The only downside is we'd end up with redundant <span> tags in
HTML.
Limiting the number of line matched ourselves with `{1,$NUM}'
doesn't seem prudent since lines vary in length, so we continue
to defer the job of limiting matches to the Perl regexp engine.
I've noticed this warning in practice on 100K+ line patches to
locale data.
|
|
There may be no topics for a given timestamp range,
so don't attempt to treat `undef' as an arrayref.
|
|
While this is not a known problem in practice,
RFC 3977 section 3.1 states:
Keywords and arguments MUST each be separated by one
or more space or TAB characters.
|
|
This allows us to consistently enforce the same Message-ID
extraction rules everywhere and makes it easier for us to
make changes in the future.
Update scripts/ssoma-replay, as well, but don't rely on
PublicInbox::* modules in that since it's legacy and
public-inbox was never a dependency of ssoma.
|
|
We do not need run mid_clean() since mid_mime() uses mids()
to extract the msgid from inside the angle brackets.
|
|
No need to keep an extra sub which isn't called anywhere else,
and the mid_clean call is redundant since mid_mime already
plucks the msgid out of the angle brackets.
|
|
|
|
It may not be immediately obvious why we should value text-based
stuff so much, so clarify that.
|
|
There will probably be a 1.4 release in a few days...
|
|
Message-IDs can apparently contain spaces and other weird
characters. Ensure we pass those properly to shard subprocesses
when importing messages in parallel mode.
Our NNTP request parser does not deal with spaces in the
Message-ID, yet, and I don't expect most NNTP clients to,
either. Nor does the Net::NNTP client handle them in responses.
|
|
While the v1 inbox in this test is created without Xapian,
the v2 inbox in this test defaults to having Xapian enabled
regardless of whether it's installed or not.
Fixes: c7acdfe78bda5bf3 ("v2: SDBM-based multi Message-ID queue")
|
|
git-cat-file(1) may return less than the $BIN_DETECT value for
some blobs, so ensure we repopulate the values in $ctx for
retries in that case, otherwise we'll lose `$ctx->{-res}' and
die when attempting to use `undef' as an array ref.
|
|
User-supplied callbacks may fail, so capture the error instead
of propagating it up the stack into the public-inbox-httpd event
loop.
|
|
And use Exporter to make our life easier, since WwwAltId was
using a non-existent PublicInbox::WwwResponse namespace in error
paths which doesn't get noticed by `perl -c' or exercised by
tests on normal systems.
Fixes: 6512b1245ebc6fe3 ("www: add endpoint to retrieve altid dumps")
|
|
The "-" was never supported by Xapian in the prefix, but
it could still be used to make documentation and URLs more
readable in certain cases.
Fixes: 7909c5f7439777e3 ("altid: warn about non-word prefixes")
|
|
It's more convenient to specify `-c' / `--compact' on the
command-line when reindexing than it is to invoke
public-inbox-compact(1) separately.
This is especially convenient in low-space situations when
public-inbox-index is operating on multiple inboxes
sequentially, as compaction can happen immediately after
indexing each inbox, instead of waiting until all inboxes are
indexed.
|
|
For sharded v2 repositories with few-enough messages, it is
possible for shard[0] to go unused and never trigger the
->commit_txn_lazy to set the indexlevel field in Xapian
metadata.
So set it immediately at initialization and avoid this case.
While we're at it, avoid triggering needless pwrite syscalls
from ->set_metadata by checking with ->get_metadata, first.
|
|
This allows for a setup where a central config file for the web server
includes per-user config files.
|
|
Seeing the example config linkified, some users may inevitably
try to following it in a browser with a GET request. Provide
a helpful message to inform users to use POST instead of
attempting to treat /$INBOX/$ALTID.sql.gz as a Message-Id.
|
|
Exposing altid dumps will help and ensure total reproducibility
of existing instances.
AFAIK, sqlite3(1) can't execute arbitrary code, so it's not
quite as fashionable as the "curl | bash" stuff the cool people
are doing, these days :P
|
|
We want to be able to preload that, as well as to access it
in WwwText for a config comment in the config example.
|
|
This ensures all our indexed data, including data from altid
searches (e.g. "gmane:$ARTNUM") is retrievable.
It uses a "POST" request to avoid wasting cycles when invoked by
crawlers, since it could potentially be several megabytes of
data not indexable by search engines.
|
|
We only support searching on prefixes matching /\A\w+\z/ because
Xapian requires ':' to delimit the prefix and splits on spaces
without quotes.
I've also verified Xapian supports multibyte UTF-8 characters,
underscores, and bare numbers as search prefixes, so there's
no need to restrict it beyond what Perl's UTF-8 aware \w
character class offers.
|
|
And show contact info when there's no indexing, at all.
Installations where Xapian is too expensive can still support
threading since it only depends on SQLite, so we need to inform
users of what's available.
|
|
While we don't currently reinitialize the query parser for
the lifetime of a PublicInbox::Search object and have no plans
to, it's incorrect to be appending to an existing array in
case we reininitialize the query parser in the future.
|
|
As sqlite3(1) and other executables may become unavailable or
uninstalled while a daemon runs, we need to gracefully handle
errors in those cases.
|
|
This makes the error page more consistent.
Not that it really matters since Compress::Raw::Zlib and
IO::Compress packages have been distributed with Perl since
5.10.x. Of course, zlib itself is also a dependency of git.
|
|
PublicInbox::HTTP will chunk, otherwise, and that's
extra overhead which isn't needed.
|
|
No reason to use the ->getline interface for small responses.
|
|
The ->getline API is only useful for limiting memory use when
streaming responses containing multiple emails or log messages.
However it's unnecessary complexity and overhead for callers
(PublicInbox::HTTP) when there's only a single message.
|
|
zlib contexts are memory-intensive, particularly when used for
compression. Since the gzip filter may be sitting in a limiter
queue for a long period, delay the allocation we actually have
data to translate, and not a moment sooner.
|
|
We'll be supporting gzipped from sqlite3(1) dumps
for altid files in future commits.
In the future (and if we survive), we may replace
Plack::Middleware::Deflater with our own GzipFilter to work
better with asynchronous responses without relying on
memory-intensive anonymous subs.
|
|
We need to track the PID file having ".oldbin" appended
to it while a SIGUSR2 upgrade is in progress and ensure
it is unlinked on SIGQUIT.
|
|
Disabling workers via `-W0' blesses the contents of the
@listeners array, so we need to ensure we call fcntl on
the GLOB ref in ->{sock}.
Add tests to ensure USR2 works regardless of whether workers
are enabled or not.
|
|
This lets us store author and committer times for deferred
indexing messages with ambiguous Message-IDs. This allows
us to reproducibly reindex messages with the git commit
and author times when a rare message lacks Received and/or
Date headers while having ambiguous Message-IDs.
|
|
We can finally get rid of the awkward, ad-hoc use of V2Writable,
SearchIdx, and OverIdx args for passing {cotime} and {autime}
between classes.
We'll still use those git time fields internally within
V2Writable and SearchIdx for (re)indexing, but that's not
worth avoiding as a fallback.
|
|
We can pass fewer order-dependent args to V2Writable::do_idx and
SearchIdxShard::index_raw by passing the smsg object, instead.
|
|
We can pass blessed PublicInbox::Smsg objects to internal
indexing APIs instead of having long parameter lists in some
places. The end goal is to avoid parsing redundant information
each step of the way and hopefully make things more
understandable.
|
|
Favor `$smsg->{mid}' instead of `$mid0' to reduce parameters
down-the-line, but favor passing the Email::MIME::Header object
around instead of relying on the bloat-prone `$smsg->{mime}'
and calling ->header_obj on it.
|
|
No need to pass extra parameters to this method, since
smsg has universal meanings for {blob} and {mid}.
|