about summary refs log tree commit homepage
path: root/lib/PublicInbox/V2Writable.pm
DateCommit message (Collapse)
2018-04-07over: remove forked subprocess
Since the overview stuff is a synchronization point anyways, move it into the main V2Writable process and allow us to drop a bunch of code. This is another step towards making Xapian optional for v2. In other words, the fan-out point is moved and the Xapian partitions no longer need to synchronize against each other: Before: /-------->\ /---------->\ v2writable -->+----parts----> over \---------->/ \-------->/ After: /----------> /-----------> v2writable --> over-->+----parts---> \-----------> \----------> Since the overview/threading logic needs to run on the same core that feeds git-fast-import, it's slower for small repos but is not noticeable in large imports where I/O wait in the partitions dominates.
2018-04-06www: favor reading more from SQLite, and less from Xapian
Favor simpler internal APIs this time around, this cuts a fair amount of code out and takes another step towards removing Xapian as a dependency for v2 repos.
2018-04-06over: use only supported and safe SQLite APIs
Some of this jankiness was from early performance problems and they turned out to be unnecessary measures.
2018-04-06v2writable: refer to git each repository as "epoch"
This hopefully helps for people who try to understand this design.
2018-04-06v2writable: allow tracking parallel versions
For upgrades, this will let users keep an old version running while performing "public-inbox-index" on the newest version.
2018-04-05v2writable: remove redundant remove from Over DB
The Xapian partitions will trigger the removal anyways. Test this and fix some description/spelling errors while we're at it.
2018-04-05v2writable: recount partitions after acquiring lock
The partition count can change if public-inbox-compact runs while public-inbox-watch or public-inbox-index is running.
2018-04-04v2writable: do not modify DBs while iterating for ->remove
Xapian may become unhappy if a DB is modified during iteration: nntp://news.gmane.org/20180228004400.GU12724@survex.com
2018-04-04v2: support incremental indexing + purge
This is important for people running mirrors via "git fetch", as they need to be kept up-to-date. Purging is also now supported in mirrors. The short-lived "--regenerate" option is gone and is now implicitly enabled as a result. It's still cheap when article number regeneration is unnecessary, as we track the range for each git repository.
2018-04-04import: rewrite less history during purge
We do not need to rewrite old commits unaffected by the object_id purge, only newer commits. This was a state management bug :x We will also return the new commit ID of rewritten history to aid in incremental indexing of mirrors for the next change.
2018-04-02v2writable: simplify barrier vs checkpoints
searchidx_checkpoint was too convoluted and confusing. Since barrier is mostly the same thing; use that instead and add an fsync option for the overview DB.
2018-04-02replace Xapian skeleton with SQLite overview DB
This ought to provide better performance and scalability which is less dependent on inbox size. Xapian does not seem optimized for some queries used by the WWW homepage, Atom feeds, XOVER and NEWNEWS NNTP commands. This can actually make Xapian optional for NNTP usage, and allow more functionality to work without Xapian installed. Indexing performance was extremely bad at first, but DBI::Profile helped me optimize away problematic queries.
2018-04-01v2writable: fix parallel termination
I was too aggressively disabling parallelization to speed up the test suite and broke this :x Re-enable parallelization for the v2reindex test so we can catch it later.
2018-04-01v2: one file, really
We need to ensure there is only one file in the top-level tree at any commit so the "add; remove; add;" sequence on the same message is detected properly. Otherwise, git will not detect the second "add" unless a second message is added to history. Deletes are now stored in "d" (and not "D" or "_/D") at the top-level, now. There's no need to have a "_" to reduce churn as "m" and "d" should never co-exist. It's now lowercased to make it easier-to-distinguish from "D" in git-log output.
2018-03-30v2: respect core.sharedRepository in git configs
Ensure -convert and -compact do not make repositories unreadable on live servers.
2018-03-30v2writable: go backwards through alternate Message-IDs
This is consistent with how we internally generate new Message-IDs to break conflicts and allows ->reindex to succeed while walking backwards through history
2018-03-30v2writable: convert some fatal reindex errors to warnings
By supporting purge and allowing users to delete git partitions, we can open up ourselves to gaps and un-reindexible data. Let that be.
2018-03-30v2writable: allow gaps in git partitions
Somebody may only care about the most recent history, so allow -init and -index to operate quietly on missing partitions.
2018-03-29v2writable: initializing an existing inbox is idempotent
And we do not want to start making confused repos if somebody leaves out "-V2" the second time around.
2018-03-29v2writable: cleanup: get rid of unused fields
The layout of this structure ended up being a bit different and the read-only access is handled through the ::Inbox class, instead.
2018-03-29v2writable: support purging messages from git entirely
Purging existing messages is fairly straightforward since we can take advantage of Xapian and lookup the git object_id with it. Unfortunately, purging an already "removed" message (which is no longer in Xapian) is not as easy and we'll need to expose ->purge_oids to purge by the git object_id (currently SHA-1). Furthermore, we expire reflogs and prune in hopes a dumb HTTP client won't get the object.
2018-03-29v2writable: append, instead of prepending generated Message-ID
The original Message-ID is still the most important when discussing with other recipients who do not rely on a message flowing through public-inbox. So whatever Message-ID we use to deduplicate internally will be secondary and less important. All of our front-end v2 code is order-independent, so we won't let the message count against us, that way.
2018-03-27v2writable: warn on unseen deleted files
It would be a bug to have deleted files marked but not seen in our histories.
2018-03-22import: consolidate mid prepend logic, here
This also quiets down warnings from -watch when spam training happens on messages without Message-Id.
2018-03-22v2writable: DEBUG_DIFF respects $TMPDIR
The File::Temp API is a bit tricky and needs TMPDIR explicitly enabled if a template is given.
2018-03-22v2writable: clarify header cleanups
We want to make it clear to the code and DEBUG_DIFF users that we do not introduce messages with unsuitable headers into public archives.
2018-03-22v2writable: add NNTP article number regeneration support
Allow best-effort regeneration of NNTP article numbers from cloned git repositories in addition to indexing Xapian Article numbers will not remain consistent when we add purge support, though.
2018-03-22v2writable: support reindexing Xapian
This still requires a msgmap.sqlite3 file to exist, but it allows us to tweak Xapian indexing rules and reindex the Xapian database online while -watch is running.
2018-03-20InboxWritable: add mbox/maildir parsing + import logic
This will make it easier to as well as supporting future Filter API users. It allows simplifying our ad-hoc import_vger_from_mbox script.
2018-03-19v2writable: remove "resent" message for duplicate Message-IDs
public-inbox-watch gets restarted on reboots and whatnot, so it could get pointlessly noisy. This message was only useful during initial development and imports.
2018-03-19v2writable: add DEBUG_DIFF env support
This can help us track down some differences during import, if needed.
2018-03-19v2writable: allow disabling parallelization
While parallel processes improves import speed for initial imports; they are probably not necessary for daily mail imports via WatchMaildir and certainly not for public-inbox-init. Save some memory for daily use and even helps improve readability of some subroutines by showing which methods they call remotely.
2018-03-19searchidxpart: s/barrier/remote_barrier/
Be consistent with our "remote_" prefix for other IPC subs
2018-03-19v2writable: ensure ->done is idempotent
This matches Import::done behavior
2018-03-19Lock: new base class for writable lockers
This reduces code duplication needed for locking and and hopefully makes things easier to understand.
2018-03-19import: enable locking under v2
Instead of using ssoma-based locking, enable locking via Import for now.
2018-03-19import: force Message-ID generation for v1 here
This allows us to share code for generating Message-IDs between v1 and v2 repos. For v1, this introduces a slight incompatibility in message removal iff the original message lacked a Message-ID AND the training request came from a message which did not pass through the public-inbox: The workaround for this would be to reuse the bad message from the archive itself.
2018-03-19import: implement barrier operation for v1 repos
This will allow WatchMaildir to use ->barrier operations instead of reaching inside for nchg. This also ensures dumb HTTP clients can see changes to V2 repos immediately.
2018-03-19import: (v2) delete writes the blob into history in subdir
This makes it easier to audit deletes with "git log -p" and prevents an unstable specification of "content_id" from being stored in history. This should be cost-free if done in the same partition (and even cheaper than before as it introduces no new blobs). It does have a higher cost across partitions, but is probably irrelevant given the typical ham:spam ratio.
2018-03-19v2writable: implement remove correctly
We need to hide removals from anybody hitting the search engine.
2018-03-19search: allow ->reopen to be chainable
Makes life a little easier for V2Writable...
2018-03-19v2writable: remove unnecessary idx_init call
We no longer need it with ->barrier working
2018-03-19v2writable: support "barrier" operation to avoid reforking
Stopping and starting a bunch of processes to look up duplicates or removals is inefficient. Take advantage of checkpointing in "git fast-import" and transactions in Xapian and SQLite.
2018-03-06v2writable: detect and use previous partition count
We need to detect the number of partitions the repository was created with to ensure Xapian DBs can work across different machines (or even CPU affinity changes) without leaving messages unaffected by search.
2018-03-06v2writable: remove unnecessary skeleton commit
Not a big deal since we still commit to the skeleton for every single partition (barrier work abandoned).
2018-03-03v2: avoid redundant/repeated configs for git partition repos
We'll let the config of all.git dictate every other subrepo to ease maintenance and configuration. The "include" directive has been supported since git 1.7.10, so it's safe to depend on as v2 requires git 2.6.0+ anyways for "get-mark" in fast-import.
2018-03-03import: consolidate object info for v2 imports
It's easier to store everything in one array ref similar to what our Git->check routine returns
2018-03-03searchidx: store the primary MID in doc data for NNTP
We can't rely on header order for Message-ID after all since we fall back to existing MIDs if they exist and are unseen. This lets us use SearchMsg->mid to get the MID we associated with the NNTP article number to ensure all NNTP article lookups roundtrip correctly.
2018-03-03v2writable: generated Message-ID goes first
This is to make SearchMsg behave more sanely under NNTP.
2018-03-02v2writable: inject new Message-IDs on true duplicates
Since we'll need to support multiple Message-IDs anyways, inject a new one if we hit a duplicate (or don't get one at all). Try to use a deterministic Message-Id for consistency, but give up determinism and use a random Message-Id if an "attacker" wants to prevent their message from being archived.