Date | Commit message (Collapse) |
|
We don't need to take extra trips through the event loop for a
single message (in the common case of Message-IDs being unique).
In fact, holding the body reference left behind by Email::Simple
could be harmful to memory usage, though in practice it's not a
big problem since code paths which use Email::MIME take far more.
|
|
And use Exporter to make our life easier, since WwwAltId was
using a non-existent PublicInbox::WwwResponse namespace in error
paths which doesn't get noticed by `perl -c' or exercised by
tests on normal systems.
Fixes: 6512b1245ebc6fe3 ("www: add endpoint to retrieve altid dumps")
|
|
This makes the error page more consistent.
Not that it really matters since Compress::Raw::Zlib and
IO::Compress packages have been distributed with Perl since
5.10.x. Of course, zlib itself is also a dependency of git.
|
|
Since the introduction of over.sqlite3, SearchMsg is not tied to
our search functionality in any way, so stop confusing ourselves
and future hackers by just calling it "PublicInbox::Smsg".
Add a missing "use" in ExtMsg while we're at it.
|
|
I didn't wait until September to do it, this year!
|
|
We can't pass empty strings to `to_filename' without
triggering warnings, and `to_filename' on an empty string
makes no sense.
|
|
This allows callers to pass named (not anonymous) subs.
Update all retry_reopen callers to use this feature, and
fix some places where we failed to use retry_reopen :x
|
|
Another place where we can rid ourselves of most anonymous subs
by passing the $ctx arg to the callback.
|
|
This was causing warnings to pop up in syslogs for messages with
empty Subject headers.
|
|
IO::Compress::Gzip is a wrapper around Compress::Raw::Zlib,
anyways, and being able to easily detach buffers to return them
via ->getline is nice. This results in a 1-2% performance
improvement when fetching giant mboxes.
|
|
It'll make using Compress::Raw::Zlib easier, since we
can use that and import constants more easily.
|
|
We're gradually phasing mid_clean out (in favor of mids()).
|
|
While we avoid generating absolute URLs in most cases, our
"git clone" instructions and URL headers in mboxrd files
contain full URLs.
So do the same thing we do for WwwAtomStream and pre-generate
the full URL before Plack::App::URLMap changes $env->{PATH_INFO}
and $env->{SCRIPT_NAME} back to their original values.
Reported-by: edef <edef@edef.eu>
Link: https://public-inbox.org/meta/cover.0f97c47bb88db8b875be7497289d8fedd3b11991.1569296942.git-series.edef@edef.eu/
|
|
qmail.org seems unavailable.
|
|
|
|
When dealing with ~30MB messages, we can save another ~30MB by
splitting the header and body processing and not appending the
body string back to the header.
We'll rely on buffering in gzip or kernel (via MSG_MORE)
to prevent silly packet sizes.
|
|
Email::Simple->new will split the head from the body in-place,
and we can avoid using Email::Simple::body. This saves us from
holding an extra copy of the message in memory, and saves us
around ~30MB when operating on ~30MB messages.
|
|
We don't need to rely on Xapian search functionality for the
majority of the WWW code, even. subject_normalized is moved to
SearchMsg, where it (probably) makes more sense, anyways.
|
|
Hopefully this helps people familiarize themselves with
the source code.
|
|
We only need to call get_thread beyond 1000 messages for
fetching entire mboxes. It's probably too much for the HTML
display otherwise.
|
|
Favor simpler internal APIs this time around, this cuts
a fair amount of code out and takes another step towards
removing Xapian as a dependency for v2 repos.
|
|
Sorting large msets is a waste when it comes to mboxes
since MUAs should thread and sort them as the user desires.
This forces us to rework each of the mbox download mechanisms
to be more independent of each other, but might make things
easier to reason about.
|
|
id_batch had a an overly complicated interface, replace it
with id_batch which is simpler and takes advantage of
selectcol_arrayref in DBI. This allows simplification of
callers and the diffstat agrees with me.
|
|
We can use id_batch in the common case to speed up full mbox
retrievals. Gigantic msets are still a problem, but will
be fixed in future commits.
|
|
In many cases, we do not care about the total number of
messages. It's a rather expensive operation in SQLite
(Xapian only provides an estimate).
For LKML, this brings top-level /$INBOX/ loading time from
~375ms to around 60ms on my system. Days ago, this operation
was taking 800-900ms(!) for me before introducing the SQLite
overview DB.
|
|
We can avoid a small amount of overhead and use the "preferred"
Message-ID based on what is in the SearchMsg object.
|
|
We do not need to care about ghosts at multiple call sites; they
cannot have a {blob} field and we've stored the blob field in
Xapian since SCHEMA_VERSION=13.
|
|
This needs tests and further refinement, but current tests pass.
|
|
Since v2 supports duplicate messages, we need to support
looking up different messages with the same Message-Id.
Fortunately, our "raw" endpoint has always been mboxrd,
so users won't need to change their parsing tools.
|
|
Using update-copyrights from gnulib
While we're at it, use the SPDX identifier for AGPL-3.0+ to
ease mechanical processing.
|
|
Allowing downloading of all search results as an gzipped mboxrd
file can be convenient for some users.
|
|
This is hopefully more sensical than "raw" files from
resulting downloads.
|
|
Sigh, yet another place to handle obfuscation for misguided
people who expect it. Maybe this will do something to prevent
spammers from getting addresses, while still allowing the
"curl $URL | git am" use case to work.
|
|
This makes life easier for the threading algorithm, as we can
use the implied ordering of timestamps to avoid temporary ghosts
and resulting container vivication.
This would've also allowed us to hide the bug (in most cases)
fixed by the patch titled "thread: last Reference always wins",
in case that needs to be reverted due to infinite looping.
|
|
Based on reading RFC 3986, it seems '@', ':', '!', '$', '&',
"'", '; '(', ')', '*', '+', ',', ';', '=' are all allowed
in path-absolute where we have the Message-ID.
In any case, it seems '@' is fairly common in path components
nowadays and too common in Message-IDs.
|
|
At least for public-inbox-httpd, this allows us to avoid having
a client monopolize one event loop tick of the server for too
long. It hurts throughput for the /all.mbox.gz endpoint, but I
doubt anybody cares and the latency improvement for other
clients would be appreciated.
We already do the same fairness thing for HTML pages.
|
|
Doing git tree lookups based on the SHA-1 of the Message-ID
is expensive as trees get larger, instead, use the SHA-1
object ID directly. This drastically reduces the amount
of time spent in the "git cat-file --batch" process for
fetching the /$INBOX/all.mbox.gz endpoint on the ~800MB
git@vger.kernel.org mirror
This retains backwards compatibility and allows existing
indices to be transparently upgraded without performance
degradation.
|
|
Hopefully this can reduce memory overhead for people that
use one-shot CGI.
|
|
This is lighter and we can work further towards eliminating
our Plack::Request dependency entirely.
|
|
We want to avoid sending 10 or 20-byte gzip headers as
separate TCP packets to reduce syscalls and avoid wasting
bandwidth.
|
|
Favor Inbox objects as our primary source of truth to simplify
our code. This increases our coupling with PSGI to make it
easier to write tests in the future.
A lot of this code was originally designed to be usable
standalone without PSGI or CGI at all; but that might increase
development effort.
|
|
Prefer to return strings instead, so Content-Length can be
calculated for caching and such.
|
|
We do not need feed options there (or anywhere, hopefully).
|
|
This allows consistency between different invocations from
roughly the same period and is no worse for caching any any of
our existing HTML and Atom feeds.
We cannot set the timestamp to the end date since messages
may be added to the repository while we are iterating
(and this streaming mechanism will pick them up).
|
|
This allows us to easily provide gigantic inboxes
with proper backpressure handling for slow clients.
It also eliminates public-inbox-httpd and Danga::Socket-specific
knowledge from this class, making it easier to follow for
those used to generic PSGI applications.
|
|
Allows easily downloading the entire archive without
special tools. In any case, it's not yet advertised to via
HTML until we can test it better. It'll also support range
queries in the future to avoid wasting bandwidth.
|
|
This should make validating the output easier
when testing between different servers.
|
|
This allows messages to be read in chronological order when
read without a mail client (e.g. with "zcat t.mbox.gz | less")
|
|
When serving archives, it's more robust to keep existing
archive links in one server goes down.
|
|
This may be necessary for compatibility with non-mboxrd aware
parsers which expect "\nFrom " for everything but the first
record.
|