Date | Commit message (Collapse) |
|
We don't need to rely on Xapian search functionality for the
majority of the WWW code, even. subject_normalized is moved to
SearchMsg, where it (probably) makes more sense, anyways.
|
|
PublicInbox::DS works for every platform we we care about,
nowadays; so checking for it is a waste of time. Cleanup a
few POSIX and Socket imports while we're in the area.
|
|
Avoiding reliance on environment variables is a bit cleaner
for writing tests
|
|
* origin/danga-bundle:
DS: epoll: fix misordered EPOLL_CTL_DEL call
DS: drop unused "_undef" sub
syscall: drop readahead wrapper
build: do not manify DS and Syscall pods
DS: handle EINTR in IO::Poll path, too
DS: workaround IO::Kqueue EINTR (mis-)handling
DS: drop profiling support
DS: remove unused fields and functions
listener: use EPOLLEXCLUSIVE for listen sockets
bundle Danga::Socket and Sys::Syscall
|
|
This can help users track down the source of warnings
when presented with imperfect emails.
While we're at it, make the __WARN__ callback in t/v2writable.t
a no-op since we don't check for warnings, there.
|
|
These modules are unmaintained upstream at the moment, but I'll
be able to help with the intended maintainer once/if CPAN
ownership is transferred. OTOH, we've been waiting for that
transfer for several years, now...
Changes I intend to make:
* EPOLLEXCLUSIVE for Linux
* remove unused fields wasting memory
* kqueue bugfixes e.g. https://rt.cpan.org/Ticket/Display.html?id=116615
* accept4 support
And some lower priority experiments:
* switch to EV_ONESHOT / EPOLLONESHOT (incompatible changes)
* nginx-style buffering to tmpfile instead of string array
* sendfile off tmpfile buffers
* io_uring maybe?
|
|
And doesn't try to access undef as an array ref.
|
|
This should probably use lower-level git plumbing, but until
then, consistently add a bunch of --no-* options to "git log"
to get more consistent output.
Noticed-by: Johannes Berg
https://public-inbox.org/meta/1538164205.14416.76.camel@sipsolutions.net/
|
|
This allows v1 tests to continue working on git 1.8.0 for
now. This allows git 2.1.4 packaged with Debian 8 ("jessie")
to run old tests, at least.
I suppose it's safe to drop Debian 7 ("wheezy") due to our
dependency on git 1.8.0 for "merge-base --is-ancestor".
Writing V2 repositories requires git 2.6 for "get-mark"
support, so mask out tests for older gits.
|
|
IPC::Run provides a nice simplification in several places; and
we already use it (optionally) on a lot of tests.
For the non-test code, we still rely on our vfork-capable
Inline::C stuff since real-world server processes can get large
enough to where vfork is an advantage. Maybe Perl5 can use
CLONE_VFORK somehow, one day:
https://rt.perl.org/Ticket/Display.html?id=128227
Ohg V'q engure cbeg choyvp-vaobk gb Ehol :C
|
|
Now that some of the indexes are optionals these tests might fail
so teach them to fail more cleanly.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
|
|
This is consistent with git itself and the previous behavior
was a result of misunderstanding of how git interprets this.
And adjust tests slightly to match the new behavior.
Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org>
<38873789-ab42-65a1-20c9-12c30b171f4f@linuxfoundation.org>
|
|
We can't have files with permissions inconsistent with what's
in git objects.
|
|
While hunting duplicates, I noticed a leading '-' in some
Message-IDs as a result of RFC4648 encoding. While '-' seems
allowed by RFC5322 and URL-friendly (RFC4648), they are uncommon
and make using Message-IDs as arguments for command-line tools
more difficult. So prefix them with a datestamp to at least
give readers some sense of the age. And shorten the "localhost"
hostname to "z" to save space.
|
|
Since we only query the SQLite over DB for OVER/XOVER; do not
need to waste space storing fields To/Cc/:bytes/:lines or the
XNUM term. We only use From/Subject/References/Message-ID/:blob
in various places of the PSGI code.
For reindexing, we will take advantage of docid stability
in "xapian-compact --no-renumber" to ensure duplicates do not
show up in search results. Since the PSGI interface is the
only consumer of Xapian at the moment, it has no need to
search based on NNTP article number.
|
|
Since we handle the overview info synchronously, we only need
barriers in tests, now. We will use asynchronous checkpoints
to sync less-important Xapian data.
For data deduplication, this requires us to hoist out the
cat-blob support in ::Import for reading uncommitted data
in git.
|
|
Favor simpler internal APIs this time around, this cuts
a fair amount of code out and takes another step towards
removing Xapian as a dependency for v2 repos.
|
|
The Xapian partitions will trigger the removal anyways.
Test this and fix some description/spelling errors
while we're at it.
|
|
We do not need to rewrite old commits unaffected by the object_id
purge, only newer commits. This was a state management bug :x
We will also return the new commit ID of rewritten history to
aid in incremental indexing of mirrors for the next change.
|
|
We we worked around the default range/termination conditions of
long_response in many cases to reduce calls to SQLite or Xapian.
So continue that trend and become more like the PSGI API
which doesn't force callers to specify an article range or
work inside a loop.
|
|
id_batch had a an overly complicated interface, replace it
with id_batch which is simpler and takes advantage of
selectcol_arrayref in DBI. This allows simplification of
callers and the diffstat agrees with me.
|
|
In many cases, we do not care about the total number of
messages. It's a rather expensive operation in SQLite
(Xapian only provides an estimate).
For LKML, this brings top-level /$INBOX/ loading time from
~375ms to around 60ms on my system. Days ago, this operation
was taking 800-900ms(!) for me before introducing the SQLite
overview DB.
|
|
We need to stop ghost messages from generating longer
Message-IDs than Xapian can handle with terms.
|
|
We need to ensure there is only one file in the top-level tree
at any commit so the "add; remove; add;" sequence on the same
message is detected properly.
Otherwise, git will not detect the second "add" unless
a second message is added to history.
Deletes are now stored in "d" (and not "D" or "_/D") at the
top-level, now. There's no need to have a "_" to reduce churn
as "m" and "d" should never co-exist. It's now lowercased to
make it easier-to-distinguish from "D" in git-log output.
|
|
We have Git::qx nowadays.
|
|
Purging existing messages is fairly straightforward since we can
take advantage of Xapian and lookup the git object_id with it.
Unfortunately, purging an already "removed" message (which is
no longer in Xapian) is not as easy and we'll need to expose
->purge_oids to purge by the git object_id (currently SHA-1).
Furthermore, we expire reflogs and prune in hopes a dumb HTTP
client won't get the object.
|
|
The original Message-ID is still the most important when
discussing with other recipients who do not rely on a message
flowing through public-inbox. So whatever Message-ID we use
to deduplicate internally will be secondary and less important.
All of our front-end v2 code is order-independent, so we won't
let the message count against us, that way.
|
|
If we need to use content_id, we've already lost hope
in relying on Message-Id as a differentiator. This
prevents duplicates from showing up repeatedly with
-watch when Message-Ids are reused and we generate
new Message-Ids to disambiguate.
|
|
public-inbox-watch gets restarted on reboots and whatnot, so
it could get pointlessly noisy. This message was only useful
during initial development and imports.
|
|
While parallel processes improves import speed for initial
imports; they are probably not necessary for daily mail imports
via WatchMaildir and certainly not for public-inbox-init. Save
some memory for daily use and even helps improve readability of
some subroutines by showing which methods they call remotely.
|
|
This matches Import::done behavior
|
|
This will make reindexing easier.
|
|
Hexdigests are too long and shorter Message-IDs are easier
to deal with.
|
|
In the future, we may store "purged" content IDs or other
uncommon stuff under "_/" of the git tree. This keeps the
top-level tree small and more amenable to deltafication.
This helps the the common case where "m" is most commonly
changed file at the top level.
Also, use 'D' instead of 'd' since it matches git's '--raw'
output format.
|
|
This makes it easier to audit deletes with "git log -p" and
prevents an unstable specification of "content_id" from being
stored in history.
This should be cost-free if done in the same partition (and even
cheaper than before as it introduces no new blobs). It does
have a higher cost across partitions, but is probably irrelevant
given the typical ham:spam ratio.
|
|
We need to hide removals from anybody hitting the search engine.
|
|
Stopping and starting a bunch of processes to look up duplicates
or removals is inefficient. Take advantage of checkpointing
in "git fast-import" and transactions in Xapian and SQLite.
|
|
We need to detect the number of partitions the repository was
created with to ensure Xapian DBs can work across different
machines (or even CPU affinity changes) without leaving messages
unaffected by search.
|
|
We'll let the config of all.git dictate every other subrepo to
ease maintenance and configuration. The "include" directive has
been supported since git 1.7.10, so it's safe to depend on as v2
requires git 2.6.0+ anyways for "get-mark" in fast-import.
|
|
We can't rely on header order for Message-ID after all
since we fall back to existing MIDs if they exist and
are unseen. This lets us use SearchMsg->mid to get the
MID we associated with the NNTP article number to ensure
all NNTP article lookups roundtrip correctly.
|
|
This is to make SearchMsg behave more sanely under NNTP.
|
|
It's possible to have a message handle multiple terms;
so use this feature to ensure messages with multiple MIDs
can be found by either one.
|
|
Since we'll need to support multiple Message-IDs anyways,
inject a new one if we hit a duplicate (or don't get one at
all).
Try to use a deterministic Message-Id for consistency, but give
up determinism and use a random Message-Id if an "attacker"
wants to prevent their message from being archived.
|