about summary refs log tree commit homepage
path: root/lib/PublicInbox/Import.pm
diff options
context:
space:
mode:
authorEric Wong (Contractor, The Linux Foundation) <e@80x24.org>2018-02-22 01:49:08 +0000
committerEric Wong (Contractor, The Linux Foundation) <e@80x24.org>2018-02-22 18:33:46 +0000
commit9ecbfc09928dada28094fd3fc79e91a5472b27ea (patch)
treea829ab7765f45e139e8a9d5de1c3784fc26bbf69 /lib/PublicInbox/Import.pm
parenta81ad9c4b1b5d8c2ae8444b6dcb8710bd361f628 (diff)
downloadpublic-inbox-9ecbfc09928dada28094fd3fc79e91a5472b27ea.tar.gz
The parallelization requires splitting Msgmap, text+term
indexing, and thread-linking out into separate processes.

git-fast-import is fast, so we don't bother parallelizing it.

Msgmap (SQLite) and thread-linking (Xapian) must be serialized
because they rely on monotonically increasing numbers (NNTP
article number and internal thread_id, respectively).

We handle msgmap in the main process which drives fast-import.
When the article number is retrieved/generated, we write the
entire message to per-partition subprocesses via pipes for
expensive text+term indexing.

When these per-partition subprocesses are done with the
expensive text+term indexing, they write SearchMsg (small data)
to a shared pipe (inherited from the main V2Writable process)
back to the threader, which runs its own subprocess.

The number of text+term Xapian partitions is chosen at import
and can be made equal to the number of cores in a machine.

V2Writable --> Import -> git-fast-import
           \-> SearchIdxThread -> Msgmap (synchronous)
           \-> SearchIdxPart[n] -> SearchIdx[*]
	   \-> SearchIdxThread -> SearchIdx ("threader", a subprocess)

[* ] each subprocess writes to threader
Diffstat (limited to 'lib/PublicInbox/Import.pm')
-rw-r--r--lib/PublicInbox/Import.pm4
1 files changed, 1 insertions, 3 deletions
diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm
index 1a2698a7..b650e4ef 100644
--- a/lib/PublicInbox/Import.pm
+++ b/lib/PublicInbox/Import.pm
@@ -280,14 +280,12 @@ sub add {
         $self->{bytes_added} += $n;
         print $w "blob\nmark :$blob\ndata ", $n, "\n" or wfail;
         print $w $str, "\n" or wfail;
-        $str = undef;
 
         # v2: we need this for Xapian
         if ($self->{want_object_id}) {
                 chomp($self->{last_object_id} = $self->get_mark(":$blob"));
-                $self->{last_object_size} = $n;
+                $self->{last_object} = [ $n, \$str ];
         }
-
         my $ref = $self->{ref};
         my $commit = $self->{mark}++;
         my $parent = $tip =~ /\A:/ ? $tip : undef;