user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
From: "Eric Wong (Contractor, The Linux Foundation)" <e@80x24.org>
To: meta@public-inbox.org
Subject: [PATCH 19/21] searchidx: do not modify Xapian DB while iterating
Date: Wed, 28 Feb 2018 23:42:00 +0000	[thread overview]
Message-ID: <20180228234202.8839-20-e@80x24.org> (raw)
In-Reply-To: <20180228234202.8839-1-e@80x24.org>

Iterating through a list of documents while modifying them does
not seem to be supported in Xapian and it can trigger
DatabaseCorruptError exceptions.  This only worked with past
datasets out of dumb luck.  With the work-in-progress "v2"
public-inbox layout, this problem might become more visible
as the "thread skeleton" is partitioned out to a separate,
smaller Xapian database.

I've reproduced the problem on both Debian 8.x and 9.x with
Xapian 1.2.19 (chert backend) and 1.4.3 (glass backend)
respectively.
---
 lib/PublicInbox/SearchIdx.pm | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 3259413..f4238fe 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -740,15 +740,22 @@ sub create_ghost {
 sub merge_threads {
 	my ($self, $winner_tid, $loser_tid) = @_;
 	return if $winner_tid == $loser_tid;
-	my ($head, $tail) = $self->find_doc_ids('G' . $loser_tid);
 	my $db = $self->{xdb};
 
-	for (; $head != $tail; $head->inc) {
-		my $docid = $head->get_docid;
-		my $doc = $db->get_document($docid);
-		$doc->remove_term('G' . $loser_tid);
-		$doc->add_term('G' . $winner_tid);
-		$db->replace_document($docid, $doc);
+	my $batch_size = 1000; # don't let @ids grow too large to avoid OOM
+	while (1) {
+		my ($head, $tail) = $self->find_doc_ids('G' . $loser_tid);
+		return if $head == $tail;
+		my @ids;
+		for (; $head != $tail && @ids < $batch_size; $head->inc) {
+			push @ids, $head->get_docid;
+		}
+		foreach my $docid (@ids) {
+			my $doc = $db->get_document($docid);
+			$doc->remove_term('G' . $loser_tid);
+			$doc->add_term('G' . $winner_tid);
+			$db->replace_document($docid, $doc);
+		}
 	}
 }
 
-- 
EW


  parent reply	other threads:[~2018-02-28 23:42 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-02-28 23:41 [PATCH v2 0/21] UI bits and v2 import fixes Eric Wong (Contractor, The Linux Foundation)
2018-02-28 23:41 ` [PATCH 01/21] v2writable: warn on duplicate Message-IDs Eric Wong (Contractor, The Linux Foundation)
2018-02-28 23:41 ` [PATCH 02/21] v2/ui: some hacky things to get the PSGI UI to show up Eric Wong (Contractor, The Linux Foundation)
2018-02-28 23:41 ` [PATCH 03/21] v2/ui: retry DB reopens in a few more places Eric Wong (Contractor, The Linux Foundation)
2018-02-28 23:41 ` [PATCH 04/21] v2writable: cleanup unused pipes in partitions Eric Wong (Contractor, The Linux Foundation)
2018-02-28 23:41 ` [PATCH 05/21] searchidxpart: binmode Eric Wong (Contractor, The Linux Foundation)
2018-02-28 23:41 ` [PATCH 06/21] use PublicInbox::MIME consistently Eric Wong (Contractor, The Linux Foundation)
2018-02-28 23:41 ` [PATCH 07/21] searchidxpart: chomp line before splitting Eric Wong (Contractor, The Linux Foundation)
2018-02-28 23:41 ` [PATCH 08/21] searchidx*: name child subprocesses Eric Wong (Contractor, The Linux Foundation)
2018-02-28 23:41 ` [PATCH 09/21] searchidx: get rid of pointless index_blob wrapper Eric Wong (Contractor, The Linux Foundation)
2018-02-28 23:41 ` [PATCH 10/21] view: remove X-PI-TS reference Eric Wong (Contractor, The Linux Foundation)
2018-02-28 23:41 ` [PATCH 11/21] searchidxthread: load doc data for references Eric Wong (Contractor, The Linux Foundation)
2018-02-28 23:41 ` [PATCH 12/21] searchidxpart: force integers into add_message Eric Wong (Contractor, The Linux Foundation)
2018-02-28 23:41 ` [PATCH 13/21] search: reopen skeleton DB as well Eric Wong (Contractor, The Linux Foundation)
2018-02-28 23:41 ` [PATCH 14/21] searchidx: index values in the threader Eric Wong (Contractor, The Linux Foundation)
2018-02-28 23:41 ` [PATCH 15/21] search: use different Enquire object for skeleton queries Eric Wong (Contractor, The Linux Foundation)
2018-02-28 23:41 ` [PATCH 16/21] rename SearchIdxThread to SearchIdxSkeleton Eric Wong (Contractor, The Linux Foundation)
2018-02-28 23:41 ` [PATCH 17/21] v2writable: commit to skeleton via remote partitions Eric Wong (Contractor, The Linux Foundation)
2018-02-28 23:41 ` [PATCH 18/21] searchidxskeleton: extra error checking Eric Wong (Contractor, The Linux Foundation)
2018-02-28 23:42 ` Eric Wong (Contractor, The Linux Foundation) [this message]
2018-02-28 23:42 ` [PATCH 20/21] search: query_xover uses skeleton DB iff available Eric Wong (Contractor, The Linux Foundation)
2018-02-28 23:42 ` [PATCH 21/21] v2/ui: get nntpd and init tests running on v2 Eric Wong (Contractor, The Linux Foundation)
2018-03-01 23:40 ` [PATCH v2 0/21] UI bits and v2 import fixes Eric Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://public-inbox.org/README

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180228234202.8839-20-e@80x24.org \
    --to=e@80x24.org \
    --cc=meta@public-inbox.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).