about summary refs log tree commit homepage
diff options
context:
space:
mode:
authorEric Wong (Contractor, The Linux Foundation) <e@80x24.org>2018-02-27 20:25:23 +0000
committerEric Wong <e@80x24.org>2018-02-27 22:12:16 +0000
commitebb59815035b42c276a89a585e16e69f51dbdb98 (patch)
treeec2f78f2cc5fbda67141ac7c79ed0b28f3b4aee3
parentb400772bf3801cb29949cf2ae5021e8e3a8e2d94 (diff)
downloadpublic-inbox-ebb59815035b42c276a89a585e16e69f51dbdb98.tar.gz
Iterating through a list of documents while modifying them does
not seem to be supported in Xapian and it can trigger
DatabaseCorruptError exceptions.  This only worked with past
datasets out of dumb luck.  With the work-in-progress "v2"
public-inbox layout, this problem might become more visible
as the "thread skeleton" is partitioned out to a separate,
smaller Xapian database.

I've reproduced the problem on both Debian 8.x and 9.x with
Xapian 1.2.19 (chert backend) and 1.4.3 (glass backend)
respectively.
-rw-r--r--lib/PublicInbox/SearchIdx.pm21
1 files changed, 14 insertions, 7 deletions
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 66faed31..5559b39d 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -696,15 +696,22 @@ sub create_ghost {
 sub merge_threads {
         my ($self, $winner_tid, $loser_tid) = @_;
         return if $winner_tid == $loser_tid;
-        my ($head, $tail) = $self->find_doc_ids('G' . $loser_tid);
         my $db = $self->{xdb};
 
-        for (; $head != $tail; $head->inc) {
-                my $docid = $head->get_docid;
-                my $doc = $db->get_document($docid);
-                $doc->remove_term('G' . $loser_tid);
-                $doc->add_term('G' . $winner_tid);
-                $db->replace_document($docid, $doc);
+        my $batch_size = 1000; # don't let @ids grow too large to avoid OOM
+        while (1) {
+                my ($head, $tail) = $self->find_doc_ids('G' . $loser_tid);
+                return if $head == $tail;
+                my @ids;
+                for (; $head != $tail && @ids < $batch_size; $head->inc) {
+                        push @ids, $head->get_docid;
+                }
+                foreach my $docid (@ids) {
+                        my $doc = $db->get_document($docid);
+                        $doc->remove_term('G' . $loser_tid);
+                        $doc->add_term('G' . $winner_tid);
+                        $db->replace_document($docid, $doc);
+                }
         }
 }