user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
Search results ordered by [date|relevance]  view[summary|nested|Atom feed]
thread overview below | download mbox.gz: |
* [PATCH 2/2] v2: mirrors don't clobber msgs w/ reused Message-IDs
  @ 2021-10-18  5:09  5% ` Eric Wong
  0 siblings, 0 replies; 3+ results
From: Eric Wong @ 2021-10-18  5:09 UTC (permalink / raw)
  To: meta

For odd messages with reused Message-IDs, the second message
showing up in a mirror (via git-fetch + -index) should never
clobber an entry with a different blob in over.

This is noticeable only if the messages arrive in-between
indexing runs.

Fixes: 4441a38481ed ("v2: index forwards (via `git log --reverse')")
---
 MANIFEST                      |  1 +
 lib/PublicInbox/V2Writable.pm |  7 ++++++-
 t/v2index-late-dupe.t         | 37 +++++++++++++++++++++++++++++++++++
 3 files changed, 44 insertions(+), 1 deletion(-)
 create mode 100644 t/v2index-late-dupe.t

diff --git a/MANIFEST b/MANIFEST
index b5aae77747dd..af1522d71bd1 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -552,6 +552,7 @@ t/v1-add-remove-add.t
 t/v1reindex.t
 t/v2-add-remove-add.t
 t/v2dupindex.t
+t/v2index-late-dupe.t
 t/v2mda.t
 t/v2mirror.t
 t/v2reindex.t
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 3914383cc9d3..ed5182ae8460 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -813,8 +813,8 @@ sub index_oid { # cat_async callback
 			}
 		}
 	}
+	my $oidx = $self->{oidx};
 	if (!defined($num)) { # reuse if reindexing (or duplicates)
-		my $oidx = $self->{oidx};
 		for my $mid (@$mids) {
 			($num, $mid0) = $oidx->num_mid0_for_oid($oid, $mid);
 			last if defined $num;
@@ -822,6 +822,11 @@ sub index_oid { # cat_async callback
 	}
 	$mid0 //= do { # is this a number we got before?
 		$num = $arg->{mm_tmp}->num_for($mids->[0]);
+
+		# don't clobber existing if Message-ID is reused:
+		if (my $x = defined($num) ? $oidx->get_art($num) : undef) {
+			undef($num) if $x->{blob} ne $oid;
+		}
 		defined($num) ? $mids->[0] : undef;
 	};
 	if (!defined($num)) {
diff --git a/t/v2index-late-dupe.t b/t/v2index-late-dupe.t
new file mode 100644
index 000000000000..c83e3409044f
--- /dev/null
+++ b/t/v2index-late-dupe.t
@@ -0,0 +1,37 @@
+# Copyright (C) all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+#
+# this simulates a mirror path: git fetch && -index
+use strict; use v5.10.1; use PublicInbox::TestCommon;
+use Test::More; # redundant, used for bisect
+require_mods 'v2';
+require PublicInbox::Import;
+require PublicInbox::Inbox;
+require PublicInbox::Git;
+my ($tmpdir, $for_destroy) = tmpdir();
+my $inboxdir = "$tmpdir/i";
+PublicInbox::Import::init_bare(my $e0 = "$inboxdir/git/0.git");
+open my $fh, '>', "$inboxdir/inbox.lock" or xbail $!;
+my $git = PublicInbox::Git->new($e0);
+my $im = PublicInbox::Import->new($git, qw(i i@example.com));
+$im->{lock_path} = undef;
+$im->{path_type} = 'v2';
+my $eml = eml_load('t/plack-qp.eml');
+ok($im->add($eml), 'add original');
+$im->done;
+run_script([qw(-index -Lbasic), $inboxdir]);
+is($?, 0, 'basic index');
+my $ibx = PublicInbox::Inbox->new({ inboxdir => $inboxdir });
+my $orig = $ibx->over->get_art(1);
+
+my @mid = $eml->header_raw('Message-ID');
+$eml->header_set('Message-ID', @mid, '<extra@z>');
+ok($im->add($eml), 'add another');
+$im->done;
+run_script([qw(-index -Lbasic), $inboxdir]);
+is($?, 0, 'basic index again');
+
+my $after = $ibx->over->get_art(1);
+is_deeply($after, $orig, 'original unchanged') or note explain([$orig,$after]);
+
+done_testing;

^ permalink raw reply related	[relevance 5%]

* [PATCH 00/20] indexing changes and new features
@ 2020-07-24  5:55  7% Eric Wong
  2020-07-24  5:55  4% ` [PATCH 02/20] v2: index forwards (via `git log --reverse') Eric Wong
  0 siblings, 1 reply; 3+ results
From: Eric Wong @ 2020-07-24  5:55 UTC (permalink / raw)
  To: meta

--rethread and --no-sync options are now supported in
public-inbox-index.  --no-sync should be nice for users
of FSes with poor fsync(2) performance.

Now: I also wonder if --no-sync is a bad name since we
also use it for to mean synchronising indices.  Perhaps
--no-fsync would be a better name, though technically
SQLite and Xapian use fdatasync(2), nowadays.

Some of this is prep work for exposing THREADID via IMAP (and
JMAP) to aid in searching.

Since THREADID (`over.tid') will be exposed in a user-visible
way, I'm finally giving up on using the default (reverse
chronological) log order for indexing to ensure THREADID
ascends for newer threads.

This also simplifies the indexing code significantly.
To avoid pinning huge amounts of RAM, the working space is held
in a IdxStack temporary file.  This further simplifies our code
since we no longer have to worry about old that did not use
Xapian w/o FD_CLOEXEC.

There's still more work on the horizon, here...

Eric Wong (20):
  index: support --rethread switch to fix old indices
  v2: index forwards (via `git log --reverse')
  v2writable: introduce idx_stack
  v2writable: index_sync: reduce fill_alternates calls
  v2writable: move {autime} and {cotime} into $sync state
  v2writable: allow >= 40 byte git object IDs
  v2writable: drop "EPOCH.git indexing $RANGE" progress message
  use consistent {ibx} field for writable code paths
  search: avoid copying {inboxdir}
  v2writable: use read-only PublicInbox::Git for cat_file
  v2writable: get rid of {reindex_pipe} field
  v2writable: clarify "epoch" for {last_commits}
  xapcmd: set {from} properly for v1 inboxes
  searchidx: rename _xdb_{acquire,release} => idx_
  searchidx: make v1 indexing closer to v2
  index+xcpdb: support --no-sync flag
  v2writable: share log2stack code with v1
  searchidx: support async git check
  searchidx: $batch_cb => v1_checkpoint
  v2writable: {unindexed} belongs in $sync state

 Documentation/public-inbox-index.pod |  30 +-
 Documentation/public-inbox-xcpdb.pod |   6 +
 MANIFEST                             |   3 +-
 lib/PublicInbox/Git.pm               |  72 ++++-
 lib/PublicInbox/IdxStack.pm          |  52 ++++
 lib/PublicInbox/Import.pm            |   6 +-
 lib/PublicInbox/Msgmap.pm            |  21 +-
 lib/PublicInbox/MultiMidQueue.pm     |  62 ----
 lib/PublicInbox/Over.pm              |   1 +
 lib/PublicInbox/OverIdx.pm           |  78 ++++-
 lib/PublicInbox/Search.pm            |  25 +-
 lib/PublicInbox/SearchIdx.pm         | 384 ++++++++++++------------
 lib/PublicInbox/SearchIdxShard.pm    |  12 +-
 lib/PublicInbox/Smsg.pm              |   8 +-
 lib/PublicInbox/V2Writable.pm        | 427 +++++++++------------------
 lib/PublicInbox/Xapcmd.pm            |  10 +-
 script/public-inbox-index            |   5 +-
 script/public-inbox-xcpdb            |   4 +-
 t/idx_stack.t                        |  56 ++++
 t/inbox_idle.t                       |   4 +-
 t/search.t                           |   4 +-
 t/v1reindex.t                        |  36 ++-
 t/v2reindex.t                        |  45 +++
 23 files changed, 744 insertions(+), 607 deletions(-)
 create mode 100644 lib/PublicInbox/IdxStack.pm
 delete mode 100644 lib/PublicInbox/MultiMidQueue.pm
 create mode 100644 t/idx_stack.t

^ permalink raw reply	[relevance 7%]

* [PATCH 02/20] v2: index forwards (via `git log --reverse')
  2020-07-24  5:55  7% [PATCH 00/20] indexing changes and new features Eric Wong
@ 2020-07-24  5:55  4% ` Eric Wong
  0 siblings, 0 replies; 3+ results
From: Eric Wong @ 2020-07-24  5:55 UTC (permalink / raw)
  To: meta

Since we'll need to expose THREADID to JMAP and IMAP users,
index all messages in the order they were committed to ensure
our `tid' (thread ID) column ascends in mirrors the same way
they do in the source inbox.

This drastically simplifies our code but increases memory
usage of `git-log'.  The next commit will bring memory use
back down at the expense of $TMPDIR usage.
---
 MANIFEST                         |   1 -
 lib/PublicInbox/MultiMidQueue.pm |  62 -------
 lib/PublicInbox/V2Writable.pm    | 279 +++++++++----------------------
 3 files changed, 81 insertions(+), 261 deletions(-)
 delete mode 100644 lib/PublicInbox/MultiMidQueue.pm

diff --git a/MANIFEST b/MANIFEST
index 963caad02..9d90c8c23 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -155,7 +155,6 @@ lib/PublicInbox/MboxGz.pm
 lib/PublicInbox/MsgIter.pm
 lib/PublicInbox/MsgTime.pm
 lib/PublicInbox/Msgmap.pm
-lib/PublicInbox/MultiMidQueue.pm
 lib/PublicInbox/NNTP.pm
 lib/PublicInbox/NNTPD.pm
 lib/PublicInbox/NNTPdeflate.pm
diff --git a/lib/PublicInbox/MultiMidQueue.pm b/lib/PublicInbox/MultiMidQueue.pm
deleted file mode 100644
index eb2ecf2f2..000000000
--- a/lib/PublicInbox/MultiMidQueue.pm
+++ /dev/null
@@ -1,62 +0,0 @@
-# Copyright (C) 2020 all contributors <meta@public-inbox.org>
-# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
-
-# temporary queue for public-inbox-index to support multi-Message-ID
-# messages on mirrors of v2 inboxes
-package PublicInbox::MultiMidQueue;
-use strict;
-use SDBM_File; # part of Perl standard library
-use Fcntl qw(O_RDWR O_CREAT);
-use File::Temp 0.19 (); # 0.19 for ->newdir
-my %e = (
-	freebsd => 0x100000,
-	linux => 0x80000,
-	netbsd => 0x400000,
-	openbsd => 0x10000,
-);
-my $O_CLOEXEC = $e{$^O} // 0;
-
-sub new {
-	my ($class) = @_;
-	my $tmpdir = File::Temp->newdir('multi-mid-q-XXXXXX', TMPDIR => 1);
-	my $base = $tmpdir->dirname . '/q';
-	my %sdbm;
-	my $flags = O_RDWR|O_CREAT;
-	if (!tie(%sdbm, 'SDBM_File', $base, $flags|$O_CLOEXEC, 0600)) {
-		if (!tie(%sdbm, 'SDBM_File', $base, $flags, 0600)) {
-			die "could not tie ($base): $!";
-		}
-		$O_CLOEXEC = 0;
-	}
-
-	bless {
-		cur => 1,
-		min => 1,
-		max => 0,
-		sdbm => \%sdbm,
-		tmpdir => $tmpdir,
-	}, $class;
-}
-
-sub set_oid {
-	my ($self, $i, $oid, $v2w) = @_;
-	$self->{max} = $i if $i > $self->{max};
-	$self->{min} = $i if $i < $self->{min};
-	$self->{sdbm}->{$i} = "$oid\t$v2w->{autime}\t$v2w->{cotime}";
-}
-
-sub get_oid {
-	my ($self, $i, $v2w) = @_;
-	my $rec = $self->{sdbm}->{$i} or return;
-	my ($oid, $autime, $cotime) = split(/\t/, $rec);
-	$v2w->{autime} = $autime;
-	$v2w->{cotime} = $cotime;
-	$oid
-}
-
-sub push_oid {
-	my ($self, $oid, $v2w) = @_;
-	set_oid($self, $self->{cur}++, $oid, $v2w);
-}
-
-1;
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 16556ddc2..c04ea5d77 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -18,10 +18,10 @@ use PublicInbox::OverIdx;
 use PublicInbox::Msgmap;
 use PublicInbox::Spawn qw(spawn popen_rd);
 use PublicInbox::SearchIdx;
-use PublicInbox::MultiMidQueue;
 use IO::Handle; # ->autoflush
 use File::Temp qw(tempfile);
 
+my $x40 = qr/[a-f0-9]{40}/;
 # an estimate of the post-packed size to the raw uncompressed size
 my $PACKING_FACTOR = 0.4;
 
@@ -862,18 +862,6 @@ sub atfork_child {
 	$self->{bnote}->[1];
 }
 
-sub mark_deleted ($$$$) {
-	my ($self, $sync, $git, $oid) = @_;
-	return if PublicInbox::SearchIdx::too_big($self, $git, $oid);
-	my $msgref = $git->cat_file($oid);
-	my $mime = PublicInbox::Eml->new($$msgref);
-	my $mids = mids($mime->header_obj);
-	my $chash = content_hash($mime);
-	foreach my $mid (@$mids) {
-		$sync->{D}->{"$mid\0$chash"} = $oid;
-	}
-}
-
 sub reindex_checkpoint ($$$) {
 	my ($self, $sync, $git) = @_;
 
@@ -891,107 +879,11 @@ sub reindex_checkpoint ($$$) {
 	$sync->{mm_tmp}->atfork_parent;
 }
 
-# only for a few odd messages with multiple Message-IDs
-sub reindex_oid_m ($$$$;$) {
-	my ($self, $sync, $git, $oid, $regen_num) = @_;
-	$self->{current_info} = "multi_mid $oid";
-	my ($num, $mid0, $len);
-	my $msgref = $git->cat_file($oid, \$len);
-	my $mime = PublicInbox::Eml->new($$msgref);
-	my $mids = mids($mime->header_obj);
-	my $chash = content_hash($mime);
-	die "BUG: reindex_oid_m called for <=1 mids" if scalar(@$mids) <= 1;
-
-	for my $mid (reverse @$mids) {
-		delete($sync->{D}->{"$mid\0$chash"}) and
-			die "BUG: reindex_oid should handle <$mid> delete";
-	}
-	my $over = $self->{over};
-	for my $mid (reverse @$mids) {
-		($num, $mid0) = $over->num_mid0_for_oid($oid, $mid);
-		next unless defined $num;
-		if (defined($regen_num) && $regen_num != $num) {
-			die "BUG: regen(#$regen_num) != over(#$num)";
-		}
-	}
-	unless (defined($num)) {
-		for my $mid (reverse @$mids) {
-			# is this a number we got before?
-			my $n = $sync->{mm_tmp}->num_for($mid);
-			next unless defined $n;
-			next if defined($regen_num) && $regen_num != $n;
-			($num, $mid0) = ($n, $mid);
-			last;
-		}
-	}
-	if (defined($num)) {
-		$sync->{mm_tmp}->num_delete($num);
-	} elsif (defined $regen_num) {
-		$num = $regen_num;
-		for my $mid (reverse @$mids) {
-			$self->{mm}->mid_set($num, $mid) == 1 or next;
-			$mid0 = $mid;
-			last;
-		}
-		unless (defined $mid0) {
-			warn "E: cannot regen #$num\n";
-			return;
-		}
-	} else { # fixup bugs in old mirrors on reindex
-		for my $mid (reverse @$mids) {
-			$num = $self->{mm}->mid_insert($mid);
-			next unless defined $num;
-			$mid0 = $mid;
-			last;
-		}
-		if (defined $mid0) {
-			if ($sync->{reindex}) {
-				warn "reindex added #$num <$mid0>\n";
-			}
-		} else {
-			warn "E: cannot find article #\n";
-			return;
-		}
-	}
-	$sync->{nr}++;
-	my $smsg = bless {
-		raw_bytes => $len,
-		num => $num,
-		blob => $oid,
-		mid => $mid0,
-	}, 'PublicInbox::Smsg';
-	$smsg->populate($mime, $self);
-	if (do_idx($self, $msgref, $mime, $smsg)) {
-		reindex_checkpoint($self, $sync, $git);
-	}
-}
-
-sub check_unindexed ($$$) {
-	my ($self, $num, $mid0) = @_;
-	my $unindexed = $self->{unindexed} // {};
-	my $n = delete($unindexed->{$mid0});
-	defined $n or return;
-	if ($n != $num) {
-		die "BUG: unindexed $n != $num <$mid0>\n";
-	} else {
-		$self->{mm}->mid_set($num, $mid0);
-	}
-}
-
-sub multi_mid_q_push ($$$) {
-	my ($self, $sync, $oid) = @_;
-	my $multi_mid = $sync->{multi_mid} //= PublicInbox::MultiMidQueue->new;
-	if ($sync->{reindex}) { # no regen on reindex
-		$multi_mid->push_oid($oid, $self);
-	} else {
-		my $num = $sync->{regen}--;
-		die "BUG: ran out of article numbers" if $num <= 0;
-		$multi_mid->set_oid($num, $oid, $self);
-	}
-}
-
 sub reindex_oid ($$$$) {
 	my ($self, $sync, $git, $oid) = @_;
+	if (my $D = $sync->{D}) { # don't waste I/O on deletes
+		return if $D->{pack('H*', $oid)};
+	}
 	return if PublicInbox::SearchIdx::too_big($self, $git, $oid);
 	my ($num, $mid0, $len);
 	my $msgref = $git->cat_file($oid, \$len);
@@ -1003,48 +895,57 @@ sub reindex_oid ($$$$) {
 	if (scalar(@$mids) == 0) {
 		warn "E: $oid has no Message-ID, skipping\n";
 		return;
-	} elsif (scalar(@$mids) == 1) {
-		my $mid = $mids->[0];
-
-		# was the file previously marked as deleted?, skip if so
-		if (delete($sync->{D}->{"$mid\0$chash"})) {
-			if (!$sync->{reindex}) {
-				$num = $sync->{regen}--;
-				$self->{mm}->num_highwater($num);
-			}
-			return;
-		}
+	}
 
-		# is this a number we got before?
-		$num = $sync->{mm_tmp}->num_for($mid);
+	# {unindexed} is unlikely
+	if ((my $unindexed = $self->{unindexed}) && scalar(@$mids) == 1) {
+		$num = delete($unindexed->{$mids->[0]});
 		if (defined $num) {
-			$mid0 = $mid;
-			check_unindexed($self, $num, $mid0);
-		} else {
-			$num = $sync->{regen}--;
-			die "BUG: ran out of article numbers" if $num <= 0;
-			if ($self->{mm}->mid_set($num, $mid) != 1) {
-				warn "E: unable to assign $num => <$mid>\n";
-				return;
-			}
-			$mid0 = $mid;
+			$mid0 = $mids->[0];
+			$self->{mm}->mid_set($num, $mid0);
+			delete($self->{unindexed}) if !keys(%$unindexed);
+		}
+	}
+	if (!defined($num)) { # reuse if reindexing (or duplicates)
+		my $over = $self->{over};
+		for my $mid (@$mids) {
+			($num, $mid0) = $over->num_mid0_for_oid($oid, $mid);
+			last if defined $num;
 		}
-	} else { # multiple MIDs are a weird case:
-		my $del = 0;
-		for (@$mids) {
-			$del += delete($sync->{D}->{"$_\0$chash"}) // 0;
+	}
+	$mid0 //= do { # is this a number we got before?
+		$num = $sync->{mm_tmp}->num_for($mids->[0]);
+		defined($num) ? $mids->[0] : undef;
+	};
+	if (!defined($num)) {
+		for (my $i = $#$mids; $i >= 1; $i--) {
+			$num = $sync->{mm_tmp}->num_for($mids->[$i]);
+			if (defined($num)) {
+				$mid0 = $mids->[$i];
+				last;
+			}
 		}
-		if ($del) {
-			unindex_oid_remote($self, $oid, $_) for @$mids;
-			# do not delete from {mm_tmp}, since another
-			# single-MID message may use it.
-		} else { # handle them at the end:
-			multi_mid_q_push($self, $sync, $oid);
+	}
+	if (defined($num)) {
+		$sync->{mm_tmp}->num_delete($num);
+	} else { # never seen
+		$num = $self->{mm}->mid_insert($mids->[0]);
+		if (defined($num)) {
+			$mid0 = $mids->[0];
+		} else { # rare, try the rest of them, backwards
+			for (my $i = $#$mids; $i >= 1; $i--) {
+				$num = $self->{mm}->mid_insert($mids->[$i]);
+				if (defined($num)) {
+					$mid0 = $mids->[$i];
+					last;
+				}
+			}
 		}
+	}
+	if (!defined($num)) {
+		warn "E: $oid <", join('> <', @$mids), "> is a duplicate\n";
 		return;
 	}
-	$sync->{mm_tmp}->mid_delete($mid0) or
-		die "failed to delete <$mid0> for article #$num\n";
 	$sync->{nr}++;
 	my $smsg = bless {
 		raw_bytes => $len,
@@ -1134,6 +1035,22 @@ $range
 	$range;
 }
 
+# don't bump num_highwater on --reindex
+sub mark_deleted ($$$) {
+	my ($git, $sync, $range) = @_;
+	my $D = $sync->{D} //= {}; # pack("H*", $oid) => NR
+	my $fh = $git->popen(qw(log --raw --no-abbrev
+			--pretty=tformat:%H
+			--no-notes --no-color --no-renames
+			--diff-filter=AM), $range, '--', 'd');
+	while (<$fh>) {
+		if (/\A:\d{6} 100644 $x40 ($x40) [AM]\td$/o) {
+			$D->{pack('H*', $1)}++;
+		}
+	}
+	close $fh or die "git log failed: \$?=$?";
+}
+
 sub sync_prepare ($$$) {
 	my ($self, $sync, $epoch_max) = @_;
 	my $pr = $sync->{-opt}->{-progress};
@@ -1144,7 +1061,7 @@ sub sync_prepare ($$$) {
 	# without {reindex}
 	my $reindex_heads = last_commits($self, $epoch_max) if $sync->{reindex};
 
-	for (my $i = $epoch_max; $i >= 0; $i--) {
+	for my $i (0..$epoch_max) {
 		die 'BUG: already indexing!' if $self->{reindex_pipe};
 		my $git_dir = git_dir_n($self, $i);
 		-d $git_dir or next; # missing epochs are fine
@@ -1168,8 +1085,8 @@ sub sync_prepare ($$$) {
 		close $fh or die "git log failed: \$?=$?";
 		$pr->("$n\n") if $pr;
 		$regen_max += $n;
+		mark_deleted($git, $sync, $range) if $sync->{reindex};
 	}
-
 	return 0 if (!$regen_max && !keys(%{$self->{unindex_range}}));
 
 	# reindex should NOT see new commits anymore, if we do,
@@ -1203,10 +1120,8 @@ sub unindex_oid ($$$;$) {
 		my ($id, $prev);
 		while (my $smsg = $over->next_by_mid($mid, \$id, \$prev)) {
 			$gone{$smsg->{num}} = 1 if $oid eq $smsg->{blob};
-			1; # continue
 		}
-		my $n = scalar keys %gone;
-		next unless $n;
+		my $n = scalar(keys(%gone)) or next;
 		if ($n > 1) {
 			warn "BUG: multiple articles linked to $oid\n",
 				join(',',sort keys %gone), "\n";
@@ -1222,7 +1137,6 @@ sub unindex_oid ($$$;$) {
 	}
 }
 
-my $x40 = qr/[a-f0-9]{40}/;
 sub unindex ($$$$) {
 	my ($self, $sync, $git, $unindex_range) = @_;
 	my $unindexed = $self->{unindexed} ||= {}; # $mid0 => $num
@@ -1276,22 +1190,29 @@ sub index_epoch ($$$) {
 	if (my $pr = $sync->{-opt}->{-progress}) {
 		$pr->("$i.git indexing $range\n");
 	}
-
-	my @cmd = qw(log --raw -r --pretty=tformat:%H.%at.%ct
+	my @cmd = qw(log --reverse --raw -r --pretty=tformat:%H.%at.%ct
 			--no-notes --no-color --no-abbrev --no-renames);
 	my $fh = $self->{reindex_pipe} = $git->popen(@cmd, $range);
 	my $cmt;
+	my $D = $sync->{D};
 	while (<$fh>) {
 		chomp;
 		$self->{current_info} = "$i.git $_";
 		if (/\A($x40)\.([0-9]+)\.([0-9]+)$/o) {
-			$cmt //= $1;
+			$cmt = $1;
 			$self->{autime} = $2;
 			$self->{cotime} = $3;
 		} elsif (/\A:\d{6} 100644 $x40 ($x40) [AM]\tm$/o) {
 			reindex_oid($self, $sync, $git, $1);
 		} elsif (/\A:\d{6} 100644 $x40 ($x40) [AM]\td$/o) {
-			mark_deleted($self, $sync, $git, $1);
+			# allow re-add if there was user error
+			my $oid = $1;
+			if ($D) {
+				my $oid_bin = pack('H*', $oid);
+				my $nr = --$D->{$oid_bin};
+				delete($D->{$oid_bin}) if $nr <= 0;
+			}
+			unindex_oid($self, $git, $oid);
 		}
 	}
 	close $fh or die "git log failed: \$?=$?";
@@ -1310,15 +1231,12 @@ sub index_sync {
 	$self->idx_init($opt); # acquire lock
 	$self->{over}->rethread_prepare($opt);
 	my $sync = {
-		D => {}, # "$mid\0$chash" => $oid
 		unindex_range => {}, # EPOCH => oid_old..oid_new
 		reindex => $opt->{reindex},
 		-opt => $opt
 	};
 	$sync->{ranges} = sync_ranges($self, $sync, $epoch_max);
-	$sync->{regen} = sync_prepare($self, $sync, $epoch_max);
-
-	if ($sync->{regen}) {
+	if (sync_prepare($self, $sync, $epoch_max)) {
 		# tmp_clone seems to fail if inside a transaction, so
 		# we rollback here (because we opened {mm} for reading)
 		# Note: we do NOT rely on DBI transactions for atomicity;
@@ -1328,43 +1246,8 @@ sub index_sync {
 		$sync->{mm_tmp} = $self->{mm}->tmp_clone;
 	}
 
-	# work backwards through history
-	for (my $i = $epoch_max; $i >= 0; $i--) {
-		index_epoch($self, $sync, $i);
-	}
-
-	# unindex is required for leftovers if "deletes" affect messages
-	# in a previous fetch+index window:
-	my $git;
-	if (my @leftovers = values %{delete $sync->{D}}) {
-		$git = $self->{-inbox}->git;
-		for my $oid (@leftovers) {
-			$self->{current_info} = "leftover $oid";
-			unindex_oid($self, $git, $oid);
-		}
-	}
-	if (my $multi_mid = delete $sync->{multi_mid}) {
-		$git //= $self->{-inbox}->git;
-		my $min = $multi_mid->{min};
-		my $max = $multi_mid->{max};
-		if ($sync->{reindex}) {
-			# we may need to create new Message-IDs if mirrors
-			# were initially indexed with old versions
-			for (my $i = $max; $i >= $min; $i--) {
-				my $oid;
-				$oid = $multi_mid->get_oid($i, $self) or next;
-				next unless defined $oid;
-				reindex_oid_m($self, $sync, $git, $oid);
-			}
-		} else { # regen on initial index
-			for my $num ($min..$max) {
-				my $oid;
-				$oid = $multi_mid->get_oid($num, $self) or next;
-				reindex_oid_m($self, $sync, $git, $oid, $num);
-			}
-		}
-	}
-	$git->cleanup if $git;
+	# work forwards through history
+	index_epoch($self, $sync, $_) for (0..$epoch_max);
 	$self->done;
 
 	if (my $nr = $sync->{nr}) {

^ permalink raw reply related	[relevance 4%]

Results 1-3 of 3 | reverse | options above
-- pct% links below jump to the message on this page, permalinks otherwise --
2020-07-24  5:55  7% [PATCH 00/20] indexing changes and new features Eric Wong
2020-07-24  5:55  4% ` [PATCH 02/20] v2: index forwards (via `git log --reverse') Eric Wong
2021-10-18  5:09     [PATCH 0/2] fix v2 mirrors of reused Message-IDs Eric Wong
2021-10-18  5:09  5% ` [PATCH 2/2] v2: mirrors don't clobber msgs w/ " Eric Wong

Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).