user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
Search results ordered by [date|relevance]  view[summary|nested|Atom feed]
thread overview below | download mbox.gz: |
* [PATCH 00/14] more indexing related improvements
@ 2020-08-10  2:11  7% Eric Wong
  2020-08-10  2:11  6% ` [PATCH 02/14] index: --sequential-shard works incrementally Eric Wong
  0 siblings, 1 reply; 2+ results
From: Eric Wong @ 2020-08-10  2:11 UTC (permalink / raw)
  To: meta

publicInbox.indexSequentialShard now works incrementally

-convert also learned all the options -index learned,
so it can be less painful on HDDs.

Eric Wong (14):
  index: require --reindex when using --xapian-only
  index: --sequential-shard works incrementally
  doc: index: some more notes about latest changes
  doc: add some notes around -xcpdb / -edit / -purge
  index+xcpdb: improve SIG{INT,TERM,HUP,PIPE} behavior
  msgmap: tmp_clone: simplify + meaningful filename
  avoid File::Temp::tempfile in more places
  admin: use a generic veriable name
  index: cleanup internal variables
  searchidx: use singular `$opt' for consistency with v2
  convert: support new -index options
  convert: speed up --help
  convert: check ARGV more correctly
  convert: set No_COW on copied SQLite files

 Documentation/public-inbox-convert.pod |  19 ++++
 Documentation/public-inbox-edit.pod    |  14 +++
 Documentation/public-inbox-index.pod   |  68 +++++++------
 Documentation/public-inbox-init.pod    |   2 +-
 Documentation/public-inbox-purge.pod   |  14 +++
 Documentation/public-inbox-xcpdb.pod   |  15 ++-
 lib/PublicInbox/Admin.pm               |  71 ++++++++++++--
 lib/PublicInbox/Msgmap.pm              |  19 ++--
 lib/PublicInbox/SearchIdx.pm           |  34 +++----
 lib/PublicInbox/V2Writable.pm          |  77 ++++++++-------
 lib/PublicInbox/Xapcmd.pm              |  28 ++++--
 script/public-inbox-convert            | 131 ++++++++++++++++---------
 script/public-inbox-index              |  69 ++++---------
 script/public-inbox-init               |  17 ++--
 t/import.t                             |   5 +-
 15 files changed, 357 insertions(+), 226 deletions(-)

^ permalink raw reply	[relevance 7%]

* [PATCH 02/14] index: --sequential-shard works incrementally
  2020-08-10  2:11  7% [PATCH 00/14] more indexing related improvements Eric Wong
@ 2020-08-10  2:11  6% ` Eric Wong
  0 siblings, 0 replies; 2+ results
From: Eric Wong @ 2020-08-10  2:11 UTC (permalink / raw)
  To: meta

We should never reindex all data in Xapian unless --reindex is
specified on the command-line.  This means users who put
publicInbox.indexSequentialShard in their config file won't have
to put up with a full reindex at every invocation, only when
they specify --reindex.

We'll also cleanup the progress output to not emit non-sensical
ranges where the starting number is higher than the end.
---
 lib/PublicInbox/V2Writable.pm | 36 ++++++++++++++++++++++-------------
 1 file changed, 23 insertions(+), 13 deletions(-)

diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index f7a318e5..0b527f18 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -1198,20 +1198,20 @@ sub index_xap_only { # git->cat_async callback
 
 sub index_xap_step ($$$;$) {
 	my ($self, $sync, $beg, $step) = @_;
-	my $ibx = $self->{ibx};
-	my $all = $ibx->git;
-	my $over = $ibx->over;
-	my $batch_bytes = batch_bytes($self);
-	$step //= $self->{shards};
 	my $end = $sync->{art_end};
+	return if $beg > $end; # nothing to do
+
+	$step //= $self->{shards};
+	my $ibx = $self->{ibx};
 	if (my $pr = $sync->{-opt}->{-progress}) {
 		$pr->("Xapian indexlevel=$ibx->{indexlevel} ".
 			"$beg..$end (% $step)\n");
 	}
+	my $batch_bytes = batch_bytes($self);
 	for (my $num = $beg; $num <= $end; $num += $step) {
-		my $smsg = $over->get_art($num) or next;
+		my $smsg = $ibx->over->get_art($num) or next;
 		$smsg->{v2w} = $self;
-		$all->cat_async($smsg->{blob}, \&index_xap_only, $smsg);
+		$ibx->git->cat_async($smsg->{blob}, \&index_xap_only, $smsg);
 		if ($self->{transact_bytes} >= $batch_bytes) {
 			${$sync->{nr}} = $num;
 			reindex_checkpoint($self, $sync);
@@ -1253,8 +1253,9 @@ sub index_epoch ($$$) {
 }
 
 sub xapian_only {
-	my ($self, $opt, $sync) = @_;
+	my ($self, $opt, $sync, $art_beg) = @_;
 	my $seq = $opt->{sequentialshard};
+	$art_beg //= 0;
 	local $self->{parallel} = 0 if $seq;
 	$self->idx_init($opt); # acquire lock
 	if (my $art_end = $self->{ibx}->mm->max) {
@@ -1268,9 +1269,11 @@ sub xapian_only {
 		$sync->{art_end} = $art_end;
 		if ($seq || !$self->{parallel}) {
 			my $shard_end = $self->{shards} - 1;
-			index_xap_step($self, $sync, $_) for (0..$shard_end);
+			for (0..$shard_end) {
+				index_xap_step($self, $sync, $art_beg + $_)
+			}
 		} else { # parallel (maybe)
-			index_xap_step($self, $sync, 0, 1);
+			index_xap_step($self, $sync, $art_beg, 1);
 		}
 	}
 	$self->{ibx}->git->cat_async_wait;
@@ -1289,6 +1292,7 @@ sub index_sync {
 	return unless defined $latest;
 
 	my $seq = $opt->{sequentialshard};
+	my $art_beg; # the NNTP article number we start xapian_only at
 	my $idxlevel = $self->{ibx}->{indexlevel};
 	local $self->{ibx}->{indexlevel} = 'basic' if $seq;
 
@@ -1312,6 +1316,12 @@ sub index_sync {
 		$self->{mm}->{dbh}->begin_work;
 		$sync->{mm_tmp} =
 			$self->{mm}->tmp_clone($self->{ibx}->{inboxdir});
+
+		# xapian_only works incrementally w/o --reindex
+		if ($seq && !$opt->{reindex}) {
+			$art_beg = $sync->{mm_tmp}->max;
+			$art_beg++ if defined($art_beg);
+		}
 	}
 	if ($sync->{index_max_size} = $self->{ibx}->{index_max_size}) {
 		$sync->{index_oid} = \&index_oid;
@@ -1326,10 +1336,10 @@ sub index_sync {
 		$pr->('all.git '.sprintf($sync->{-regen_fmt}, $$nr)) if $pr;
 	}
 
-	if ($seq) { # deal with Xapian shards sequentially
+	# deal with Xapian shards sequentially
+	if ($seq && delete($sync->{mm_tmp})) {
 		$self->{ibx}->{indexlevel} = $idxlevel;
-		delete $sync->{mm_tmp};
-		xapian_only($self, $opt, $sync);
+		xapian_only($self, $opt, $sync, $art_beg);
 	}
 
 	# reindex does not pick up new changes, so we rerun w/o it:

^ permalink raw reply related	[relevance 6%]

Results 1-2 of 2 | reverse | options above
-- pct% links below jump to the message on this page, permalinks otherwise --
2020-08-10  2:11  7% [PATCH 00/14] more indexing related improvements Eric Wong
2020-08-10  2:11  6% ` [PATCH 02/14] index: --sequential-shard works incrementally Eric Wong

Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).