about summary refs log tree commit homepage
path: root/lib/PublicInbox/SearchIdxPart.pm
DateCommit message (Collapse)
2019-05-06index: warn with info about the message as context
This can help users track down the source of warnings when presented with imperfect emails. While we're at it, make the __WARN__ callback in t/v2writable.t a no-op since we don't check for warnings, there.
2019-01-09doc: various overview-level module comments
Hopefully this helps people familiarize themselves with the source code.
2018-04-07over: remove forked subprocess
Since the overview stuff is a synchronization point anyways, move it into the main V2Writable process and allow us to drop a bunch of code. This is another step towards making Xapian optional for v2. In other words, the fan-out point is moved and the Xapian partitions no longer need to synchronize against each other: Before: /-------->\ /---------->\ v2writable -->+----parts----> over \---------->/ \-------->/ After: /----------> /-----------> v2writable --> over-->+----parts---> \-----------> \----------> Since the overview/threading logic needs to run on the same core that feeds git-fast-import, it's slower for small repos but is not noticeable in large imports where I/O wait in the partitions dominates.
2018-04-02replace Xapian skeleton with SQLite overview DB
This ought to provide better performance and scalability which is less dependent on inbox size. Xapian does not seem optimized for some queries used by the WWW homepage, Atom feeds, XOVER and NEWNEWS NNTP commands. This can actually make Xapian optional for NNTP usage, and allow more functionality to work without Xapian installed. Indexing performance was extremely bad at first, but DBI::Profile helped me optimize away problematic queries.
2018-03-19v2writable: allow disabling parallelization
While parallel processes improves import speed for initial imports; they are probably not necessary for daily mail imports via WatchMaildir and certainly not for public-inbox-init. Save some memory for daily use and even helps improve readability of some subroutines by showing which methods they call remotely.
2018-03-19searchidxpart: s/barrier/remote_barrier/
Be consistent with our "remote_" prefix for other IPC subs
2018-03-19v2writable: implement remove correctly
We need to hide removals from anybody hitting the search engine.
2018-03-19v2writable: support "barrier" operation to avoid reforking
Stopping and starting a bunch of processes to look up duplicates or removals is inefficient. Take advantage of checkpointing in "git fast-import" and transactions in Xapian and SQLite.
2018-03-03searchidx: store the primary MID in doc data for NNTP
We can't rely on header order for Message-ID after all since we fall back to existing MIDs if they exist and are unseen. This lets us use SearchMsg->mid to get the MID we associated with the NNTP article number to ensure all NNTP article lookups roundtrip correctly.
2018-02-28v2writable: commit to skeleton via remote partitions
We need to ensure Xapian transaction commits are made to remote partitions before associated commits hit the skeleton DB. This causes unnecessary commits to be made to the skeleton DB; but they're mostly harmless. Further work will be necessary to ensure proper ordering and avoidance of unnecessary commits.
2018-02-28rename SearchIdxThread to SearchIdxSkeleton
Interchangably using "all", "skel", "threader", etc. were confusing. Standardize on the "skeleton" term to describe this class since it's also used for retrieval of basic headers.
2018-02-28searchidxpart: force integers into add_message
Make data passed via Storable to the skeleton worker a little neater.
2018-02-28searchidx: get rid of pointless index_blob wrapper
This used to lookup the message in git, but no longer, so remove a needless indirection layer and call add_message directly.
2018-02-28searchidx*: name child subprocesses
This makes viewing "ps" output nicer.
2018-02-28searchidxpart: chomp line before splitting
This was adding a needless newline into doc_data
2018-02-28searchidxpart: binmode
Probably unnecessary, but set binmode for consistency across platforms.
2018-02-28v2writable: cleanup unused pipes in partitions
Leaking these pipes to child processes wasn't harmful, but made determining relationships and dataflow between processes more confusing.
2018-02-22searchidxpart: increase pipe size for partitions
We want to reduce the time in the main V2Writable process spends writing to the pipe, as the main process itself is the primary source of contention. While we're at it, always flush after writing to ensure the child sees it at once. (Grr... Perl doesn't use writev)
2018-02-22v2: parallelize Xapian indexing
The parallelization requires splitting Msgmap, text+term indexing, and thread-linking out into separate processes. git-fast-import is fast, so we don't bother parallelizing it. Msgmap (SQLite) and thread-linking (Xapian) must be serialized because they rely on monotonically increasing numbers (NNTP article number and internal thread_id, respectively). We handle msgmap in the main process which drives fast-import. When the article number is retrieved/generated, we write the entire message to per-partition subprocesses via pipes for expensive text+term indexing. When these per-partition subprocesses are done with the expensive text+term indexing, they write SearchMsg (small data) to a shared pipe (inherited from the main V2Writable process) back to the threader, which runs its own subprocess. The number of text+term Xapian partitions is chosen at import and can be made equal to the number of cores in a machine. V2Writable --> Import -> git-fast-import \-> SearchIdxThread -> Msgmap (synchronous) \-> SearchIdxPart[n] -> SearchIdx[*] \-> SearchIdxThread -> SearchIdx ("threader", a subprocess) [* ] each subprocess writes to threader