Date | Commit message (Collapse) |
|
Having zero search results means we never get a chance
to populate the Content-Disposition header for mbox
downloads.
|
|
I guess nobody uses this command (slrnpull does not), and
the breakage was not noticed until I started writing new
tests for multi-MID handling.
Fixes: 3fc411c772a21d8f ("search: drop pointless range processors for Unix timestamp")
|
|
In many cases, we do not care about the total number of
messages. It's a rather expensive operation in SQLite
(Xapian only provides an estimate).
For LKML, this brings top-level /$INBOX/ loading time from
~375ms to around 60ms on my system. Days ago, this operation
was taking 800-900ms(!) for me before introducing the SQLite
overview DB.
|
|
We need to ensure we don't match NULL 'sid' columns in the
`over' table.
|
|
This ought to provide better performance and scalability
which is less dependent on inbox size. Xapian does not
seem optimized for some queries used by the WWW homepage,
Atom feeds, XOVER and NEWNEWS NNTP commands.
This can actually make Xapian optional for NNTP usage,
and allow more functionality to work without Xapian
installed.
Indexing performance was extremely bad at first, but
DBI::Profile helped me optimize away problematic queries.
|
|
We need to stop ghost messages from generating longer
Message-IDs than Xapian can handle with terms.
|
|
I was too aggressively disabling parallelization to speed up
the test suite and broke this :x Re-enable parallelization
for the v2reindex test so we can catch it later.
|
|
We need to ensure there is only one file in the top-level tree
at any commit so the "add; remove; add;" sequence on the same
message is detected properly.
Otherwise, git will not detect the second "add" unless
a second message is added to history.
Deletes are now stored in "d" (and not "D" or "_/D") at the
top-level, now. There's no need to have a "_" to reduce churn
as "m" and "d" should never co-exist. It's now lowercased to
make it easier-to-distinguish from "D" in git-log output.
|
|
We don't want non-fully-numeric limits being compared and
tripping warnings. While we're at it, avoid hard-coding
'200' and reuse $LIM as the default.
|
|
Ensure -convert and -compact do not make repositories
unreadable on live servers.
|
|
We have Git::qx nowadays.
|
|
We'll be making sure V2Writable uses this.
|
|
Some folks had bad mail clients which generated 3-digit years
around Y2K...
|
|
I mainly focus on -watch for mirroring busy mailing lists, but
using -mda should remain an option.
|
|
Having multiple Xapian partitions is mostly pointless after
the initial import. We can compact all the partitions into
one while keeping the skeleton separate.
|
|
And we do not want to start making confused repos if somebody
leaves out "-V2" the second time around.
|
|
Back in the day, we compressed long Message-IDs to SHA-1
hexdigests for the URL. This now redirects to a 301 in
the hopes we can remove these checks some day to reduce
overhead.
|
|
Too many similar functions doing the same basic thing was
redundant and misleading, especially since Message-ID is
no longer treated as a truly unique identifier.
For displaying threads in the HTML, this makes it clear
that we favor the primary Message-ID mapped to an NNTP
article number if a message cannot be found.
|
|
Purging existing messages is fairly straightforward since we can
take advantage of Xapian and lookup the git object_id with it.
Unfortunately, purging an already "removed" message (which is
no longer in Xapian) is not as easy and we'll need to expose
->purge_oids to purge by the git object_id (currently SHA-1).
Furthermore, we expire reflogs and prune in hopes a dumb HTTP
client won't get the object.
|
|
By using the "primary" Message-ID in WwwAttach, we can avoid
conflicts in the links we use for downloading attachments.
|
|
The original Message-ID is still the most important when
discussing with other recipients who do not rely on a message
flowing through public-inbox. So whatever Message-ID we use
to deduplicate internally will be secondary and less important.
All of our front-end v2 code is order-independent, so we won't
let the message count against us, that way.
|
|
This will require multiple client invocations, but should reduce
load on the server and make it easier for readers to only clone
the latest data.
Unfortunately, supporting a cloneurl file for externally-hosted
repos will be more difficult as we cannot easily know if the
clones use v1 or v2 repositories, or how many git partitions
they have.
|
|
Since we need to handle messages with multiple and duplicate
Message-ID headers, our thread skeleton display must account
for that.
Since we have a "preferred" Message-ID in case of conflicts,
use it as the UUID in an Atom feed so readers do not get
confused by conflicts.
|
|
This needs tests and further refinement, but current tests pass.
|
|
I forget this endpoint is still accessible (even if not linked).
This also simplifies new.html all around and removes some unused
clutter from the old days while we're at it.
|
|
Some test coverage is better than none, here.
|
|
This gives more-up-to-date data in case and allows us
to avoid reopening in more places ourselves.
|
|
Since v2 supports duplicate messages, we need to support
looking up different messages with the same Message-Id.
Fortunately, our "raw" endpoint has always been mboxrd,
so users won't need to change their parsing tools.
|
|
Allow best-effort regeneration of NNTP article numbers from
cloned git repositories in addition to indexing Xapian Article
numbers will not remain consistent when we add purge support,
though.
|
|
I'll be relying on some of this behavior for regenerating NNTP
article numbers off fresh clones.
|
|
This will be used to keep track of Message-ID <-> NNTP Article
numbers to prevent article number reuse when reindexing.
|
|
If we need to use content_id, we've already lost hope
in relying on Message-Id as a differentiator. This
prevents duplicates from showing up repeatedly with
-watch when Message-Ids are reused and we generate
new Message-Ids to disambiguate.
|
|
public-inbox-watch gets restarted on reboots and whatnot, so
it could get pointlessly noisy. This message was only useful
during initial development and imports.
|
|
While parallel processes improves import speed for initial
imports; they are probably not necessary for daily mail imports
via WatchMaildir and certainly not for public-inbox-init. Save
some memory for daily use and even helps improve readability of
some subroutines by showing which methods they call remotely.
|
|
Unfortunately this gives up some minor performance tweaks we
made to avoid reforking import processes.
|
|
This matches Import::done behavior
|
|
I had to dig through commit history for this and we should
better document our tests (along with everything else).
|
|
This will make reindexing easier.
|
|
Hexdigests are too long and shorter Message-IDs are easier
to deal with.
|
|
In the future, we may store "purged" content IDs or other
uncommon stuff under "_/" of the git tree. This keeps the
top-level tree small and more amenable to deltafication.
This helps the the common case where "m" is most commonly
changed file at the top level.
Also, use 'D' instead of 'd' since it matches git's '--raw'
output format.
|
|
This makes it easier to audit deletes with "git log -p" and
prevents an unstable specification of "content_id" from being
stored in history.
This should be cost-free if done in the same partition (and even
cheaper than before as it introduces no new blobs). It does
have a higher cost across partitions, but is probably irrelevant
given the typical ham:spam ratio.
|
|
We need to hide removals from anybody hitting the search engine.
|
|
Stopping and starting a bunch of processes to look up duplicates
or removals is inefficient. Take advantage of checkpointing
in "git fast-import" and transactions in Xapian and SQLite.
|
|
We need to detect the number of partitions the repository was
created with to ensure Xapian DBs can work across different
machines (or even CPU affinity changes) without leaving messages
unaffected by search.
|
|
We'll let the config of all.git dictate every other subrepo to
ease maintenance and configuration. The "include" directive has
been supported since git 1.7.10, so it's safe to depend on as v2
requires git 2.6.0+ anyways for "get-mark" in fast-import.
|
|
It's easier to store everything in one array ref similar
to what our Git->check routine returns
|
|
We can't rely on header order for Message-ID after all
since we fall back to existing MIDs if they exist and
are unseen. This lets us use SearchMsg->mid to get the
MID we associated with the NNTP article number to ensure
all NNTP article lookups roundtrip correctly.
|
|
I guess nobody uses this command (slrnpull does not), and
the breakage was not noticed until I started writing new
tests for multi-MID handling.
Fixes: 3fc411c772a21d8f ("search: drop pointless range processors for Unix timestamp")
|
|
This is to make SearchMsg behave more sanely under NNTP.
|
|
It's possible to have a message handle multiple terms;
so use this feature to ensure messages with multiple MIDs
can be found by either one.
|