about summary refs log tree commit homepage
path: root/lib/PublicInbox/V2Writable.pm
DateCommit message (Collapse)
2018-03-27v2writable: warn on unseen deleted files
It would be a bug to have deleted files marked but not seen in our histories.
2018-03-22import: consolidate mid prepend logic, here
This also quiets down warnings from -watch when spam training happens on messages without Message-Id.
2018-03-22v2writable: DEBUG_DIFF respects $TMPDIR
The File::Temp API is a bit tricky and needs TMPDIR explicitly enabled if a template is given.
2018-03-22v2writable: clarify header cleanups
We want to make it clear to the code and DEBUG_DIFF users that we do not introduce messages with unsuitable headers into public archives.
2018-03-22v2writable: add NNTP article number regeneration support
Allow best-effort regeneration of NNTP article numbers from cloned git repositories in addition to indexing Xapian Article numbers will not remain consistent when we add purge support, though.
2018-03-22v2writable: support reindexing Xapian
This still requires a msgmap.sqlite3 file to exist, but it allows us to tweak Xapian indexing rules and reindex the Xapian database online while -watch is running.
2018-03-20InboxWritable: add mbox/maildir parsing + import logic
This will make it easier to as well as supporting future Filter API users. It allows simplifying our ad-hoc import_vger_from_mbox script.
2018-03-19v2writable: remove "resent" message for duplicate Message-IDs
public-inbox-watch gets restarted on reboots and whatnot, so it could get pointlessly noisy. This message was only useful during initial development and imports.
2018-03-19v2writable: add DEBUG_DIFF env support
This can help us track down some differences during import, if needed.
2018-03-19v2writable: allow disabling parallelization
While parallel processes improves import speed for initial imports; they are probably not necessary for daily mail imports via WatchMaildir and certainly not for public-inbox-init. Save some memory for daily use and even helps improve readability of some subroutines by showing which methods they call remotely.
2018-03-19searchidxpart: s/barrier/remote_barrier/
Be consistent with our "remote_" prefix for other IPC subs
2018-03-19v2writable: ensure ->done is idempotent
This matches Import::done behavior
2018-03-19Lock: new base class for writable lockers
This reduces code duplication needed for locking and and hopefully makes things easier to understand.
2018-03-19import: enable locking under v2
Instead of using ssoma-based locking, enable locking via Import for now.
2018-03-19import: force Message-ID generation for v1 here
This allows us to share code for generating Message-IDs between v1 and v2 repos. For v1, this introduces a slight incompatibility in message removal iff the original message lacked a Message-ID AND the training request came from a message which did not pass through the public-inbox: The workaround for this would be to reuse the bad message from the archive itself.
2018-03-19import: implement barrier operation for v1 repos
This will allow WatchMaildir to use ->barrier operations instead of reaching inside for nchg. This also ensures dumb HTTP clients can see changes to V2 repos immediately.
2018-03-19import: (v2) delete writes the blob into history in subdir
This makes it easier to audit deletes with "git log -p" and prevents an unstable specification of "content_id" from being stored in history. This should be cost-free if done in the same partition (and even cheaper than before as it introduces no new blobs). It does have a higher cost across partitions, but is probably irrelevant given the typical ham:spam ratio.
2018-03-19v2writable: implement remove correctly
We need to hide removals from anybody hitting the search engine.
2018-03-19search: allow ->reopen to be chainable
Makes life a little easier for V2Writable...
2018-03-19v2writable: remove unnecessary idx_init call
We no longer need it with ->barrier working
2018-03-19v2writable: support "barrier" operation to avoid reforking
Stopping and starting a bunch of processes to look up duplicates or removals is inefficient. Take advantage of checkpointing in "git fast-import" and transactions in Xapian and SQLite.
2018-03-06v2writable: detect and use previous partition count
We need to detect the number of partitions the repository was created with to ensure Xapian DBs can work across different machines (or even CPU affinity changes) without leaving messages unaffected by search.
2018-03-06v2writable: remove unnecessary skeleton commit
Not a big deal since we still commit to the skeleton for every single partition (barrier work abandoned).
2018-03-03v2: avoid redundant/repeated configs for git partition repos
We'll let the config of all.git dictate every other subrepo to ease maintenance and configuration. The "include" directive has been supported since git 1.7.10, so it's safe to depend on as v2 requires git 2.6.0+ anyways for "get-mark" in fast-import.
2018-03-03import: consolidate object info for v2 imports
It's easier to store everything in one array ref similar to what our Git->check routine returns
2018-03-03searchidx: store the primary MID in doc data for NNTP
We can't rely on header order for Message-ID after all since we fall back to existing MIDs if they exist and are unseen. This lets us use SearchMsg->mid to get the MID we associated with the NNTP article number to ensure all NNTP article lookups roundtrip correctly.
2018-03-03v2writable: generated Message-ID goes first
This is to make SearchMsg behave more sanely under NNTP.
2018-03-02v2writable: inject new Message-IDs on true duplicates
Since we'll need to support multiple Message-IDs anyways, inject a new one if we hit a duplicate (or don't get one at all). Try to use a deterministic Message-Id for consistency, but give up determinism and use a random Message-Id if an "attacker" wants to prevent their message from being archived.
2018-03-02v2writable: deduplicate detection on add
This is a bit expensive in a multi-process situation because we need to make our indices and packs visible to the read-only pieces.
2018-03-01v2writable: delete ::Import obj when ->done
As with the ::Import class this wraps, we want this to be usable as a checkpoint and be able to call ->add afterwards. We'll be relying on ->done to flush changes through all partition and skeleton DBs for deduplication checks.
2018-02-28v2/ui: get nntpd and init tests running on v2
A work-in-progress, but it appears the v2 UI pieces do will not require a lot of work to do.
2018-02-28v2writable: commit to skeleton via remote partitions
We need to ensure Xapian transaction commits are made to remote partitions before associated commits hit the skeleton DB. This causes unnecessary commits to be made to the skeleton DB; but they're mostly harmless. Further work will be necessary to ensure proper ordering and avoidance of unnecessary commits.
2018-02-28rename SearchIdxThread to SearchIdxSkeleton
Interchangably using "all", "skel", "threader", etc. were confusing. Standardize on the "skeleton" term to describe this class since it's also used for retrieval of basic headers.
2018-02-28use PublicInbox::MIME consistently
It works around some bugs in older Email::MIME which we'll find useful.
2018-02-28v2writable: cleanup unused pipes in partitions
Leaking these pipes to child processes wasn't harmful, but made determining relationships and dataflow between processes more confusing.
2018-02-22v2writable: warn on duplicate Message-IDs
This should give us an idea of how much a problem deduplication will be.
2018-02-22v2writable: round-robin to partitions based on article number
Instead of relying on the git object_id hash to partition, round-robin to these partitions based on the NNTP article number. This reduces the partition pipes as a source of contention when two (or more) sequential messages end up going to the same partition.
2018-02-22v2: parallelize Xapian indexing
The parallelization requires splitting Msgmap, text+term indexing, and thread-linking out into separate processes. git-fast-import is fast, so we don't bother parallelizing it. Msgmap (SQLite) and thread-linking (Xapian) must be serialized because they rely on monotonically increasing numbers (NNTP article number and internal thread_id, respectively). We handle msgmap in the main process which drives fast-import. When the article number is retrieved/generated, we write the entire message to per-partition subprocesses via pipes for expensive text+term indexing. When these per-partition subprocesses are done with the expensive text+term indexing, they write SearchMsg (small data) to a shared pipe (inherited from the main V2Writable process) back to the threader, which runs its own subprocess. The number of text+term Xapian partitions is chosen at import and can be made equal to the number of cores in a machine. V2Writable --> Import -> git-fast-import \-> SearchIdxThread -> Msgmap (synchronous) \-> SearchIdxPart[n] -> SearchIdx[*] \-> SearchIdxThread -> SearchIdx ("threader", a subprocess) [* ] each subprocess writes to threader
2018-02-20v2: support Xapian + SQLite indexing
This is too slow, currently. Working with only 2017 LKML archives: git-only: ~1 minute git + SQLite: ~12 minutes git+Xapian+SQlite: ~45 minutes So yes, it looks like we'll need to parallelize Xapian indexing, at least.
2018-02-19v2writable: initial cut for repo-rotation
Wrap the old Import package to enable creating new repos based on size thresholds. This is better than relying on time-based rotation as LKML traffic seems to be increasing.