Date | Commit message (Collapse) |
|
Hopefully this helps people familiarize themselves with
the source code.
|
|
Since the overview stuff is a synchronization point anyways,
move it into the main V2Writable process and allow us to
drop a bunch of code. This is another step towards making
Xapian optional for v2.
In other words, the fan-out point is moved and the Xapian
partitions no longer need to synchronize against each other:
Before:
/-------->\
/---------->\
v2writable -->+----parts----> over
\---------->/
\-------->/
After:
/---------->
/----------->
v2writable --> over-->+----parts--->
\----------->
\---------->
Since the overview/threading logic needs to run on the same core
that feeds git-fast-import, it's slower for small repos but is
not noticeable in large imports where I/O wait in the partitions
dominates.
|
|
This ought to provide better performance and scalability
which is less dependent on inbox size. Xapian does not
seem optimized for some queries used by the WWW homepage,
Atom feeds, XOVER and NEWNEWS NNTP commands.
This can actually make Xapian optional for NNTP usage,
and allow more functionality to work without Xapian
installed.
Indexing performance was extremely bad at first, but
DBI::Profile helped me optimize away problematic queries.
|
|
While parallel processes improves import speed for initial
imports; they are probably not necessary for daily mail imports
via WatchMaildir and certainly not for public-inbox-init. Save
some memory for daily use and even helps improve readability of
some subroutines by showing which methods they call remotely.
|
|
Be consistent with our "remote_" prefix for other IPC subs
|
|
We need to hide removals from anybody hitting the search engine.
|
|
Stopping and starting a bunch of processes to look up duplicates
or removals is inefficient. Take advantage of checkpointing
in "git fast-import" and transactions in Xapian and SQLite.
|
|
We can't rely on header order for Message-ID after all
since we fall back to existing MIDs if they exist and
are unseen. This lets us use SearchMsg->mid to get the
MID we associated with the NNTP article number to ensure
all NNTP article lookups roundtrip correctly.
|
|
We need to ensure Xapian transaction commits are made to remote
partitions before associated commits hit the skeleton DB.
This causes unnecessary commits to be made to the skeleton DB;
but they're mostly harmless. Further work will be necessary
to ensure proper ordering and avoidance of unnecessary commits.
|
|
Interchangably using "all", "skel", "threader", etc. were
confusing. Standardize on the "skeleton" term to describe
this class since it's also used for retrieval of basic headers.
|
|
Make data passed via Storable to the skeleton worker
a little neater.
|
|
This used to lookup the message in git, but no longer, so
remove a needless indirection layer and call add_message
directly.
|
|
This makes viewing "ps" output nicer.
|
|
This was adding a needless newline into doc_data
|
|
Probably unnecessary, but set binmode for consistency across
platforms.
|
|
Leaking these pipes to child processes wasn't harmful, but
made determining relationships and dataflow between processes
more confusing.
|
|
We want to reduce the time in the main V2Writable process
spends writing to the pipe, as the main process itself is
the primary source of contention.
While we're at it, always flush after writing to ensure
the child sees it at once. (Grr... Perl doesn't use writev)
|
|
The parallelization requires splitting Msgmap, text+term
indexing, and thread-linking out into separate processes.
git-fast-import is fast, so we don't bother parallelizing it.
Msgmap (SQLite) and thread-linking (Xapian) must be serialized
because they rely on monotonically increasing numbers (NNTP
article number and internal thread_id, respectively).
We handle msgmap in the main process which drives fast-import.
When the article number is retrieved/generated, we write the
entire message to per-partition subprocesses via pipes for
expensive text+term indexing.
When these per-partition subprocesses are done with the
expensive text+term indexing, they write SearchMsg (small data)
to a shared pipe (inherited from the main V2Writable process)
back to the threader, which runs its own subprocess.
The number of text+term Xapian partitions is chosen at import
and can be made equal to the number of cores in a machine.
V2Writable --> Import -> git-fast-import
\-> SearchIdxThread -> Msgmap (synchronous)
\-> SearchIdxPart[n] -> SearchIdx[*]
\-> SearchIdxThread -> SearchIdx ("threader", a subprocess)
[* ] each subprocess writes to threader
|