Date | Commit message (Collapse) |
|
We need to abort both check-only and cat requests when
aborting, since we'll be aborting more aggressively
in in read-write paths.
|
|
Some code paths may use maximum size checks, so ensure
any checks are waited on, too.
|
|
This may fix some extindex problems and should get rid of
the "Can't bless non-reference value" errors.
|
|
The use of `substr' here an argument to `print' was causing Perl
to internally cache its target buffer. Since `syswrite()'
already offers a buffer offset arg and length limits, just use
`syswrite' directly. We were using autoflush anyways, so the
lack of buffering was of no concern performance-wise.
The target buffer could get to roughly ~10MB under some loads,
but it was usually a cold path and using memory which cannot be
released nor reused in other places.
note: IO::Handle::write uses `substr' internally, too;
so nothing would be gained using IO::Handle:write.
|
|
The regexp in split_quotes relies on the presence of a
final "\n", so add it wherever we need to instead of
making it the responsibility of every caller.
This probably doesn't matter in practice since every
email seems to have a "\n" as the final byte (due to
the way SMTP works), but maybe there's some odd ones
that'll get imported via lei.
|
|
This should bring us closer to the "Base subject" definition in
IMAP ORDEREDSUBJECT (RFC 5256 2.1). Larger changes may cause
some breakage (until --reindex). But for now, a reindex will
prevents the non-ASCII subjects from being normalized to the
same fuzzy "thread" in the thread view.
|
|
We need to ensure -extindex --gc runs don't prevent other
work from happening in the meantime. I actually caused
my -extindex to OOM due to the lack of checkpoints :x
We'll also hoist out the shard scanning into its own sub
in preparation for lei/store usage.
|
|
This helped me quickly reproduce a bug in Encode[1] and
will help me determine performance implications of workarounds
for the aforementioned bug.
[1] https://rt.cpan.org/Public/Bug/Display.html?id=139622
|
|
This lets administrators reindex specific time ranges
according to git "approxidate" formats. These arguments
are passed directly to underlying git-log(1) invocations
and may still reach into old epochs.
Since these options rely on git committer dates (which we infer
from the most recent Received: header), they are not guaranteed
to be strictly tied to git history and it's possible to
over/under-reindex some messages. It's probably not a major
problem in practice, though; reindexing a few extra messages
is generally harmless aside from some extra device wear.
Since this currently relies on git-log, these options do not
affect -extindex, yet.
|
|
As with most of our internal-only code, favor smaller
comparisons to reduce memory traffic.
|
|
`shard_remove_eidx_info' was made unnecessary with commit
82b805db3ad9 (searchidxshard: IPC conversion, part 2, 2021-01-03)
and we now call `remove_eidx_info' directly.
|
|
Both read(2) on inotify and kevent(2) return a finite amount of
events. Let the kernel notify us again in cases where we'd
need to retry instead of looping ourselves. This can prevent
missed/delayed notifications while still ensuring fairness in
busy event loops.
|
|
Making them immortal doesn't seem worth it, since doing immortal
allocations after process startup leads to fragmentation. While
the allocations made by highlight are small, those small
allocations can break up contiguous regions and prevent
consolidation by the malloc implementation.
Since instantiating code generators doesn't seem too expensive,
just use and delete them ASAP.
|
|
Unlike v1 inboxes (which don't accept duplicate Message-IDs at
all), and v2 inboxes (which generate a new Message-ID for
duplicates), extindex must accept duplicate Message-IDs as-is.
This was fine for storage, but prevented the reference-cycle
mechanism of our message threading display algorithm from working
reliably. It could no longer delete the ->{parent} field from
clobbered entries in the %id_table.
So we now take into account reused Message-IDs and never clobber
entries in %id_table. Instead, we mark reused Message-IDs as
"imposters" and special-case them by injecting them as children
after all other threading is complete.
This cycle was noticed using a pre-release of Devel::Mwrap::PSGI:
https://80x24.org/mwrap-perl.git
|
|
We only use it if Mail::Thread is available, and often it's not.
|
|
I'm not sure why they weren't emitted, earlier.
|
|
This should prevent some false duplicates. I noticed this
while implementing "lei mail-diff", and only noticed it when
I implemented the ContentDigestDbg wrapper for mail-diff.
|
|
This is useful in finding the cause of deduplication bugs,
and possibly the cause of missing threads reported by
Konstantin in <20211001130527.z7eivotlgqbgetzz@meerkat.local>
usage:
u=https://yhbt.net/lore/all/87czop5j33.fsf@tynnyri.adurom.net/raw
lei mail-diff $u
|
|
It doesn't seem to matter, actually, but this matches the
behavior of attach_inbox and the comment in ->new.
|
|
This fixes inspect for uninitialized instances, and adds Xapian
("xdoc") output if available.
Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org>
Message-ID: <20211001204943.l4yl6xvc45c5eapz@meerkat.local>
|
|
These are always numeric, but none of the Perl code cares;
we want to prevent JSON from quoting them.
|
|
When indexing a single inbox, do not attempt reindexing code
paths without a full config, otherwise ordering comparisons
won't work.
|
|
In case users see "lei-daemon" in ps(1) or syslog and wonder.
Helped-by: Kyle Meyer <kyle@kyleam.com>
|
|
I'm thinking we can drop support for Linux <2.6.27 soonish and
just use EPOLL_CLOEXEC. Perl without signalfd (or
EVFILT_SIGNAL) is miserable, actually.
|
|
Having git processes outlive DB handles is likely to hurt
from a fragmentation perspective if the DB handle needs to
be recreated immediately due to a git->cat_async callback.
So only unref DB handles when we're sure there's no live
git users left, otherwise check the inodes.
We'll also avoid needless localization checks in git->cleanup
and make the return value more obvious since the pid fields are
unconditionally deleted nowadays.
|
|
It was probably incorrect to use from max_git_epoch, and it's
small enough to inline into do_cleanup. We'll also eliminate
the unnecessary deletion of {-altid_map} while we're in the
area, since we no longer cache/memoize that.
Fixes: 7e5cea05f061e757 ("inbox: rewrite cleanup to be more aggressive")
|
|
Since signalfd is often combined with our event loop, give it a
convenient API and reduce the code duplication required to use it.
EventLoop is replaced with ::event_loop to allow consistent
parameter passing and avoid needlessly passing the package name
on stack.
We also avoid exporting SFD_NONBLOCK since it's the only flag we
support. There's no sense in having the memory overhead of a
constant function when it's in cold code.
|
|
Currently we don't use OpenSSL from child processes of parents
which use OpenSSL, but we may in the future. So ensure OpenSSL
initializes its PRNG after these forks to avoid one security
pitfall down the line.
|
|
Constant subroutines use more memory and there's no need to
optimize it for inlining since it's only used at startup.
|
|
On second thought, the ->requeue + accept retry code path isn't
worth the userspace complexity and overhead. Level-triggered
epoll has always annoyed me since it takes an inefficient code
path in the kernel; but taking our less-efficient code path in
Perl seems even worse. We also need to take load distribution
into account for multi-worker systems.
|
|
Virtual users will probably be used for read-write IMAP/JMAP
support. The potential for various kernel/hardware bugs and
attacks also needs to be highlighted.
|
|
This improves the "&x=t" navigation between the thread overview
(skeleton) section at the bottom and jumping back to the top for
the mbox download form. The "--links below ..." text ought to
be helpful for users unfamiliar with the /$MSGID/T/ and /$MSGID/t/
views.
|
|
Long pathnames are difficult to read and distinguish in ps(1)
output. Deep paths can also slow down pathname resolution
when dealing with loose objects, so we put "cat-file --batch"
deeper into the directory tree.
Since v2 processes are in the form of $INBOXDIR/all.git, keep
the basename of $INBOXDIR in --git-dir= so it's easy to
distinguish between processes just by looking at ps(1).
While "git -C" also exists, it's only present in git 1.8.5+.
We also need to keep in mind the "directory" pointed to by
--git-dir= need not be a directory (nor a symlink pointing
to one).
This reduces pathname resolution overhead for v1 and v2 inbox
git processes, but unfortunately not for extindex since that
needs to store alternates as absolute paths.
|
|
add_uniq_timer seems sufficient, and we'll drop the last
user of ::later (IMAP) and switch to unique timers.
|
|
While it doesn't look like $EXPMAP can be populated in
non-obvious ways via ->DESTROY, it still makes sense to keep it
close to some of our other code around cleanup to reduce
the likelyhood of subtle bugs in case semantics change..
|
|
'git diff --abbrev=40' did not abbreviate /^index / lines of
diff output with git <2.29, and 40 will be insufficient for
SHA-256. --full-index has been around since 2005, so it's safe
to rely on.
Tested git version 2.20.0 (Debian buster).
Fixes: 751df49e7db8ba77 ("lei rediff: add --drq and --dequote-only")
|
|
This caused config->repo_objs to not fill in {-repo_objs}
properly before starting solver.
Reported-by: Kyle Meyer <kyle@kyleam.com>
Link: https://public-inbox.org/meta/87o88cqobd.fsf@kyleam.com/
Fixes: 63d7b8ceee55a34 ("daemons: revamp periodic cleanup task")
|
|
cloneurl, description, and base_url are no longer memoized. The
non-$env form of base_url is rare in WWW, and is fast enough to
not require memoization.
cloneurl and description are now expired during cleanup,
allowing admins to change these files without restarting
(or SIGHUP).
-altid_map is no longer cached nor memoized at all, since the
endpoint(s) which hit it seem rarely accessed.
nntp_url and imap_url are now cached (instead of memoized) in
case an inbox is unvisited for a long time. They remain cached
since the truthiness check gets called in every per-inbox HTML
page, which can potentially be expensive.
|
|
Avoid relying on a giant cleanup hash and instead use the new
DS->add_uniq_timer API to amortize the pause times associated
with having to cleanup many inboxes. We can also use smaller
intervals for this, as well.
We now discard SQLite DB handles at cleanup. Each of these can
use several megabytes of memory, which adds up with
hundreds/thousands of inboxes. Since per-inbox access intervals
are unpredictable and opening an SQLite handle is relatively
inexpensive, release memory more aggressively to avoid the heap
having to hit swap.
|
|
SQLite files may be replaced or removed by admins while
generating a large threads or mailbox responses. Ensure we
don't hold onto DBI handles and associated file descriptors
past their cleanup.
|
|
While each git blob request is treated fairly w.r.t other git
blob requests, responses triggering thousands of git blob
requests can still noticeably increase latency for
less-expensive responses.
Move large mbox results and the nasty all.mbox endpoint to
a low priority queue which only fires once per-event loop
iteration. This reduces the response time of short HTTP
responses while many gigantic mboxes are being downloaded
simultaneously, but still maximizes use of available I/O
when there's no inexpensive HTTP responses happening.
This only affects PublicInbox::WWW users who use
public-inbox-httpd, not generic PSGI servers.
|
|
|
|
While `$argv[-1]' is `undef' on an empty @argv, using `$argv[-1]'
as a subroutine argument would fail incorrectly with:
Modification of non-creatable array value attempted, subscript -1 at ...
...even though we'd never attempt to modify @_ itself in the
subroutines being called. Work around the bug (tested on
5.16.3) by passing `undef' explicitly when `$argv[-1]' is
already `undef'.
Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org>
Link: https://public-inbox.org/meta/20210927124056.kj5okiefvs4ztk27@meerkat.local/
|
|
"lei index" support for IMAP and NNTP is incomplete, so there's
no point in requiring them.
Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org>
Link: https://public-inbox.org/meta/20210927124056.kj5okiefvs4ztk27@meerkat.local/
|
|
The "-w" perlop always succeeds as root, so we need to check
st_mode for writability bits to detect directories we shouldn't
write to.
Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org>
Link: https://public-inbox.org/meta/20210927124056.kj5okiefvs4ztk27@meerkat.local/
|
|
Apparently, sendmsg can fail in less common ways when
network buffers are gigantic. Add some diagnostics for
future failures, as well.
Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org>
Link: https://public-inbox.org/meta/20210927124056.kj5okiefvs4ztk27@meerkat.local/
|
|
It can be useful to test with some of these, but we can't enable
them universally for all servers (and debug + compress is gross)
|
|
Instead of passing the prefix section and key separately, pass
them together as is commonly done with git-config(1) usage as
well as our ->get_all API. This inconsistency in the get_1 API
is a needless footgun and confused me a bit while working on
"lei up" the other week.
|
|
More switches which can be useful for users who pipe from text
editors. --drq can be helpful while writing patch review email
replies, and perhaps --dequote-only, too.
|
|
lei rediff is expected to see partial patch fragments and such,
so silence warnings when something isn't exactly a valid email
message.
|