From 9ecbfc09928dada28094fd3fc79e91a5472b27ea Mon Sep 17 00:00:00 2001 From: "Eric Wong (Contractor, The Linux Foundation)" Date: Thu, 22 Feb 2018 01:49:08 +0000 Subject: v2: parallelize Xapian indexing The parallelization requires splitting Msgmap, text+term indexing, and thread-linking out into separate processes. git-fast-import is fast, so we don't bother parallelizing it. Msgmap (SQLite) and thread-linking (Xapian) must be serialized because they rely on monotonically increasing numbers (NNTP article number and internal thread_id, respectively). We handle msgmap in the main process which drives fast-import. When the article number is retrieved/generated, we write the entire message to per-partition subprocesses via pipes for expensive text+term indexing. When these per-partition subprocesses are done with the expensive text+term indexing, they write SearchMsg (small data) to a shared pipe (inherited from the main V2Writable process) back to the threader, which runs its own subprocess. The number of text+term Xapian partitions is chosen at import and can be made equal to the number of cores in a machine. V2Writable --> Import -> git-fast-import \-> SearchIdxThread -> Msgmap (synchronous) \-> SearchIdxPart[n] -> SearchIdx[*] \-> SearchIdxThread -> SearchIdx ("threader", a subprocess) [* ] each subprocess writes to threader --- lib/PublicInbox/Search.pm | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) (limited to 'lib/PublicInbox/Search.pm') diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm index eac11bd4..3b280598 100644 --- a/lib/PublicInbox/Search.pm +++ b/lib/PublicInbox/Search.pm @@ -124,7 +124,10 @@ sub xdir { if ($self->{version} == 1) { "$self->{mainrepo}/public-inbox/xapian" . SCHEMA_VERSION; } else { - "$self->{mainrepo}/xap" . SCHEMA_VERSION; + my $dir = "$self->{mainrepo}/xap" . SCHEMA_VERSION; + my $part = $self->{partition}; + defined $part or die "partition not given"; + $dir .= "/$part"; } } -- cgit v1.2.3-24-ge0c7