From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id 3ACB11F5AE; Sat, 24 Jul 2021 06:34:30 +0000 (UTC) Date: Sat, 24 Jul 2021 06:34:30 +0000 From: Eric Wong To: meta@public-inbox.org Subject: Re: [WIP] extindex: support --dedupe[=MSGID] Message-ID: <20210724063430.GA12660@dcvr> References: <20210722085034.31363-1-e@80x24.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20210722085034.31363-1-e@80x24.org> List-Id: Eric Wong wrote: > Anyways, before this patch, "-extindex --dedupe" was taking ~5 > min to no-op every message (after the initial full --dedupe run > which took over a day to run). Now that it appears to be doing > many more dedupes, so it could take 40-ish hours to finish the > dedupe run on the machine running . Nearly 44 hours (on a busy system), but that was only because of a bug which caused half the work to get skipped (fix below). Xapian glass (1.4.x) really falls down with large shards and making -xcpdb work with extindex to reshard into more shards will be required to maintain performance as an extindex grows... diff --git a/lib/PublicInbox/ExtSearchIdx.pm b/lib/PublicInbox/ExtSearchIdx.pm index 442ded46..2311161e 100644 --- a/lib/PublicInbox/ExtSearchIdx.pm +++ b/lib/PublicInbox/ExtSearchIdx.pm @@ -895,12 +895,13 @@ sub eidx_dedupe ($$$) { return unless eidxq_lock_acquire($self); my ($iter, $cur_mid); my $min_id = 0; + my $idx = 0; local $sync->{-regen_fmt} = "dedupe %u/".$self->{oidx}->max."\n"; # note: we could write this query more intelligently, # but that causes lock contention with read-only processes dedupe_restart: - $cur_mid = shift @$msgids; + $cur_mid = $msgids->[$idx]; if ($cur_mid eq '') { # all Message-IDs $iter = $self->{oidx}->dbh->prepare(< ? ORDER BY id ASC @@ -945,7 +946,7 @@ EOS goto dedupe_restart; } } - goto dedupe_restart if scalar(@$msgids); + goto dedupe_restart if defined($msgids->[++$idx]); my $n = delete $sync->{dedupe_cull}; if (my $pr = $sync->{-opt}->{-progress}) {