user/dev discussion of public-inbox itself
 help / color / Atom feed
* [PATCH 0/5] scattered dev/CLI-oriented changes
@ 2020-05-10 22:37 Eric Wong
  2020-05-10 22:37 ` [PATCH 1/5] xt/eml_check_limits: check limits against an inbox Eric Wong
                   ` (4 more replies)
  0 siblings, 5 replies; 8+ messages in thread
From: Eric Wong @ 2020-05-10 22:37 UTC (permalink / raw)
  To: meta

I've been using the test in 1/5 while developing Eml for the
1.5.0 release, and it's probably a good starting point for
anybody who wants to run more stats or do more optimizations,
there.

A couple of comments and naming things to make life easier
for developers

For non-server-oriented stuff, I guess we can start
using XDG directories to avoid cluttering the top-level
of users' HOME directories.  This will make development
easier on platforms where `make' has limited `-include'
support and PERL_INLINE_DIRECTORY can't be set by a
developers' config.mak

I'll probably integrate Eric Biederman's IMAPTracker work, soon:
https://public-inbox.org/meta/874l0i9vhc.fsf_-_@x220.int.ebiederm.org/

Eric Wong (5):
  xt/eml_check_limits: check limits against an inbox
  rename "ContentId" to "ContentHash"
  overidx: document the SQLite PRAGMA we use
  msgmap: use TRUNCATE for journal_mode, for now
  spawn: use ~/.cache/public-inbox/inline-c if writable

 Documentation/public-inbox-v2-format.pod      | 12 +--
 MANIFEST                                      |  5 +-
 .../{ContentId.pm => ContentHash.pm}          |  8 +-
 lib/PublicInbox/Import.pm                     |  2 +-
 lib/PublicInbox/Msgmap.pm                     |  4 +
 lib/PublicInbox/OverIdx.pm                    |  8 ++
 lib/PublicInbox/Spawn.pm                      | 13 +++-
 lib/PublicInbox/V2Writable.pm                 | 48 ++++++------
 script/public-inbox-edit                      | 16 ++--
 t/{content_id.t => content_hash.t}            | 14 ++--
 t/v1reindex.t                                 |  2 +-
 t/v2reindex.t                                 |  2 +-
 t/v2writable.t                                |  4 +-
 xt/eml_check_limits.t                         | 76 +++++++++++++++++++
 14 files changed, 154 insertions(+), 60 deletions(-)
 rename lib/PublicInbox/{ContentId.pm => ContentHash.pm} (93%)
 rename t/{content_id.t => content_hash.t} (64%)
 create mode 100644 xt/eml_check_limits.t

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/5] xt/eml_check_limits: check limits against an inbox
  2020-05-10 22:37 [PATCH 0/5] scattered dev/CLI-oriented changes Eric Wong
@ 2020-05-10 22:37 ` Eric Wong
  2020-05-10 22:37 ` [PATCH 2/5] rename "ContentId" to "ContentHash" Eric Wong
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 8+ messages in thread
From: Eric Wong @ 2020-05-10 22:37 UTC (permalink / raw)
  To: meta

This allows maintainers to easily check limits against the
contents of existing inboxes.  This script covers most of
the new limits enforced by PublicInbox::Eml.

Usage is similar to most xt/*.t scripts:

  GIANT_INBOX_DIR=/path/to/inbox prove -bvw xt/eml_check_limits.t

Setting `TEST_CLASS=PublicInbox::MIME' allows us to check
performance and memory use against the old subclass of
Email::MIME.
---
 MANIFEST              |  1 +
 xt/eml_check_limits.t | 76 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 77 insertions(+)
 create mode 100644 xt/eml_check_limits.t

diff --git a/MANIFEST b/MANIFEST
index 9c804a0780e..b1512c7a919 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -333,6 +333,7 @@ t/x-unknown-alpine.eml
 t/xcpdb-reshard.t
 xt/cmp-msgstr.t
 xt/cmp-msgview.t
+xt/eml_check_limits.t
 xt/git-http-backend.t
 xt/git_async_cmp.t
 xt/mem-msgview.t
diff --git a/xt/eml_check_limits.t b/xt/eml_check_limits.t
new file mode 100644
index 00000000000..39de047645b
--- /dev/null
+++ b/xt/eml_check_limits.t
@@ -0,0 +1,76 @@
+#!perl -w
+# Copyright (C) 2020 all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+use Test::More;
+use PublicInbox::TestCommon;
+use PublicInbox::Eml;
+use PublicInbox::Inbox;
+use List::Util qw(max);
+use Benchmark qw(:all :hireswallclock);
+use PublicInbox::Spawn qw(popen_rd);
+use Carp ();
+require_git(2.19); # for --unordered
+require_mods(qw(BSD::Resource));
+BSD::Resource->import(qw(getrusage));
+my $cls = $ENV{TEST_CLASS};
+require_mods($cls) if $cls;
+$cls //= 'PublicInbox::Eml';
+my $inboxdir = $ENV{GIANT_INBOX_DIR};
+plan skip_all => "GIANT_INBOX_DIR not defined for $0" unless $inboxdir;
+local $PublicInbox::Eml::mime_nesting_limit = 0x7fffffff;
+local $PublicInbox::Eml::mime_parts_limit = 0x7fffffff;
+local $PublicInbox::Eml::header_size_limit = 0x7fffffff;
+my $ibx = PublicInbox::Inbox->new({ inboxdir => $inboxdir, name => 'x' });
+my $git = $ibx->git;
+my @cat = qw(cat-file --buffer --batch-check --batch-all-objects --unordered);
+my $fh = $git->popen(@cat);
+my ($m, $n);
+my $max_nest = [ 0, '' ]; # [ bytes, blob oid ]
+my $max_idx = [ 0, '' ];
+my $max_parts = [ 0, '' ];
+my $max_size = [ 0, '' ];
+my $max_hdr = [ 0, '' ];
+my $info = [ 0, '' ];
+my $each_part_cb = sub {
+	my ($p) = @_;
+	my ($part, $depth, $idx) = @$p;
+	$max_nest = [ $depth, $info->[1] ] if $depth > $max_nest->[0];
+	my $max = max(split(/\./, $idx));
+	$max_idx = [ $max, $info->[1] ] if $max > $max_idx->[0];
+	++$info->[0];
+};
+
+my ($bref, $oid, $size);
+local $SIG{__WARN__} = sub { diag "$inboxdir $oid ", @_ };
+my $cat_cb = sub {
+	($bref, $oid, undef, $size) = @_;
+	++$m;
+	$info = [ 0, $oid ];
+	my $eml = $cls->new($bref);
+	my $hdr_len = length($eml->header_obj->as_string);
+	$max_hdr = [ $hdr_len, $oid ] if $hdr_len > $max_hdr->[0];
+	$eml->each_part($each_part_cb, $info, 1);
+	$max_parts = $info if $info->[0] > $max_parts->[0];
+	$max_size = [ $size, $oid ] if $size > $max_size->[0];
+};
+
+my $t = timeit(1, sub {
+	$git->cat_async_begin;
+	my ($blob, $type);
+	while (<$fh>) {
+		($blob, $type) = split / /;
+		next if $type ne 'blob';
+		++$n;
+		$git->cat_async($blob, $cat_cb);
+	}
+	$git->cat_async_wait;
+});
+is($m, $n, 'scanned all messages');
+diag "$$ $inboxdir took ".timestr($t)." for $n <=> $m messages";
+diag "$$ max_nest $max_nest->[0] @ $max_nest->[1]";
+diag "$$ max_idx $max_idx->[0] @ $max_idx->[1]";
+diag "$$ max_parts $max_parts->[0] @ $max_parts->[1]";
+diag "$$ max_size $max_size->[0] @ $max_size->[1]";
+diag "$$ max_hdr $max_hdr->[0] @ $max_hdr->[1]";
+diag "$$ RSS ".getrusage()->maxrss. ' k';
+done_testing;

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 2/5] rename "ContentId" to "ContentHash"
  2020-05-10 22:37 [PATCH 0/5] scattered dev/CLI-oriented changes Eric Wong
  2020-05-10 22:37 ` [PATCH 1/5] xt/eml_check_limits: check limits against an inbox Eric Wong
@ 2020-05-10 22:37 ` Eric Wong
  2020-05-10 22:37 ` [PATCH 3/5] overidx: document the SQLite PRAGMA we use Eric Wong
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 8+ messages in thread
From: Eric Wong @ 2020-05-10 22:37 UTC (permalink / raw)
  To: meta

The old name may be confused with "Content-ID" as described in
RFC 2392, so use an alternate name to avoid confusing future
readers.
---
 Documentation/public-inbox-v2-format.pod      | 12 ++---
 MANIFEST                                      |  4 +-
 .../{ContentId.pm => ContentHash.pm}          |  8 ++--
 lib/PublicInbox/Import.pm                     |  2 +-
 lib/PublicInbox/V2Writable.pm                 | 48 +++++++++----------
 script/public-inbox-edit                      | 16 +++----
 t/{content_id.t => content_hash.t}            | 14 +++---
 t/v1reindex.t                                 |  2 +-
 t/v2reindex.t                                 |  2 +-
 t/v2writable.t                                |  4 +-
 10 files changed, 56 insertions(+), 56 deletions(-)
 rename lib/PublicInbox/{ContentId.pm => ContentHash.pm} (93%)
 rename t/{content_id.t => content_hash.t} (64%)

diff --git a/Documentation/public-inbox-v2-format.pod b/Documentation/public-inbox-v2-format.pod
index d87a717d40b..9e284a75431 100644
--- a/Documentation/public-inbox-v2-format.pod
+++ b/Documentation/public-inbox-v2-format.pod
@@ -159,7 +159,7 @@ top-level of the directory.
 
 =head1 OBJECT IDENTIFIERS
 
-There are three distinct type of identifiers.  content_id is the
+There are three distinct type of identifiers.  content_hash is the
 new one for v2 and should make message removal and deduplication
 easier.  object_id and Message-ID are already known.
 
@@ -179,11 +179,11 @@ The email header; duplicates allowed for archival purposes.
 This remains a searchable field in Xapian.  Note: it's possible
 for emails to have multiple Message-ID headers (and L<git-send-email(1)>
 had that bug for a bit); so we take all of them into account.
-In case of conflicts detected by content_id below, we generate a new
-Message-ID based on content_id; if the generated Message-ID still
+In case of conflicts detected by content_hash below, we generate a new
+Message-ID based on content_hash; if the generated Message-ID still
 conflicts, a random one is generated.
 
-=item content_id
+=item content_hash
 
 A hash of relevant headers and raw body content for
 purging of unwanted content.  This is not stored anywhere,
@@ -193,7 +193,7 @@ For now, the relevant headers are:
 
 	Subject, From, Date, References, In-Reply-To, To, Cc
 
-Received, List-Id, and similar headers are NOT part of content_id as
+Received, List-Id, and similar headers are NOT part of content_hash as
 they differ across lists and we will want removal to be able to cross
 lists.
 
@@ -203,7 +203,7 @@ raw body risks being broken by list signatures; but we can use
 filters (e.g. PublicInbox::Filter::Vger) to clean the body for
 imports.
 
-content_id is SHA-256 for now; but can be changed at any time
+content_hash is SHA-256 for now; but can be changed at any time
 without making DB changes.
 
 =back
diff --git a/MANIFEST b/MANIFEST
index b1512c7a919..7997bc9906c 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -99,7 +99,7 @@ lib/PublicInbox/AdminEdit.pm
 lib/PublicInbox/AltId.pm
 lib/PublicInbox/Cgit.pm
 lib/PublicInbox/Config.pm
-lib/PublicInbox/ContentId.pm
+lib/PublicInbox/ContentHash.pm
 lib/PublicInbox/DS.pm
 lib/PublicInbox/DSKQXS.pm
 lib/PublicInbox/DSPoll.pm
@@ -223,7 +223,7 @@ t/cgi.t
 t/check-www-inbox.perl
 t/config.t
 t/config_limiter.t
-t/content_id.t
+t/content_hash.t
 t/convert-compact.t
 t/data/0001.patch
 t/ds-kqxs.t
diff --git a/lib/PublicInbox/ContentId.pm b/lib/PublicInbox/ContentHash.pm
similarity index 93%
rename from lib/PublicInbox/ContentId.pm
rename to lib/PublicInbox/ContentHash.pm
index 8d77934f20a..420dc5e7c92 100644
--- a/lib/PublicInbox/ContentId.pm
+++ b/lib/PublicInbox/ContentHash.pm
@@ -6,11 +6,11 @@
 # This is not stored in any database anywhere and may change
 # as changes in duplicate detection are needed.
 # See L<public-inbox-v2-format(5)> manpage for more details.
-package PublicInbox::ContentId;
+package PublicInbox::ContentHash;
 use strict;
 use warnings;
 use base qw/Exporter/;
-our @EXPORT_OK = qw/content_id content_digest/;
+our @EXPORT_OK = qw/content_hash content_digest/;
 use PublicInbox::MID qw(mids references);
 use PublicInbox::MsgIter;
 
@@ -60,7 +60,7 @@ sub content_digest ($) {
 	# References: and In-Reply-To: get used interchangeably
 	# in some "duplicates" in LKML.  We treat them the same
 	# in SearchIdx, so treat them the same for this:
-	# do NOT consider the Message-ID as part of the content_id
+	# do NOT consider the Message-ID as part of the content_hash
 	# if we got here, we've already got Message-ID reuse
 	my %seen = map { $_ => 1 } @{mids($hdr)};
 	foreach my $mid (@{references($hdr)}) {
@@ -92,7 +92,7 @@ sub content_digest ($) {
 	$dig;
 }
 
-sub content_id ($) {
+sub content_hash ($) {
 	content_digest($_[0])->digest;
 }
 
diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm
index 07d18599200..fc61d06207c 100644
--- a/lib/PublicInbox/Import.pm
+++ b/lib/PublicInbox/Import.pm
@@ -13,7 +13,7 @@ use PublicInbox::Spawn qw(spawn popen_rd);
 use PublicInbox::MID qw(mids mid2path);
 use PublicInbox::Address;
 use PublicInbox::MsgTime qw(msg_timestamp msg_datestamp);
-use PublicInbox::ContentId qw(content_digest);
+use PublicInbox::ContentHash qw(content_digest);
 use PublicInbox::MDA;
 use PublicInbox::Eml;
 use POSIX qw(strftime);
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index f599e0a03d8..bf5a0df947a 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -13,7 +13,7 @@ use PublicInbox::Eml;
 use PublicInbox::Git;
 use PublicInbox::Import;
 use PublicInbox::MID qw(mids references);
-use PublicInbox::ContentId qw(content_id content_digest);
+use PublicInbox::ContentHash qw(content_hash content_digest);
 use PublicInbox::Inbox;
 use PublicInbox::OverIdx;
 use PublicInbox::Msgmap;
@@ -353,23 +353,23 @@ sub _replace_oids ($$$) {
 	$rewrites;
 }
 
-sub content_ids ($) {
+sub content_hashes ($) {
 	my ($mime) = @_;
-	my @cids = ( content_id($mime) );
+	my @chashes = ( content_hash($mime) );
 
 	# We still support Email::MIME, here, and
 	# Email::MIME->as_string doesn't always round-trip, so we may
-	# use a second content_id
-	my $rt = content_id(PublicInbox::Eml->new(\($mime->as_string)));
-	push @cids, $rt if $cids[0] ne $rt;
-	\@cids;
+	# use a second content_hash
+	my $rt = content_hash(PublicInbox::Eml->new(\($mime->as_string)));
+	push @chashes, $rt if $chashes[0] ne $rt;
+	\@chashes;
 }
 
 sub content_matches ($$) {
-	my ($cids, $existing) = @_;
-	my $cid = content_id($existing);
-	foreach (@$cids) {
-		return 1 if $_ eq $cid
+	my ($chashes, $existing) = @_;
+	my $chash = content_hash($existing);
+	foreach (@$chashes) {
+		return 1 if $_ eq $chash
 	}
 	0
 }
@@ -386,13 +386,13 @@ sub rewrite_internal ($$;$$$) {
 		$im = $self->importer;
 	}
 	my $over = $self->{over};
-	my $cids = content_ids($old_mime);
+	my $chashes = content_hashes($old_mime);
 	my @removed;
 	my $mids = mids($old_mime->header_obj);
 
 	# We avoid introducing new blobs into git since the raw content
 	# can be slightly different, so we do not need the user-supplied
-	# message now that we have the mids and content_id
+	# message now that we have the mids and content_hash
 	$old_mime = undef;
 	my $mark;
 
@@ -407,7 +407,7 @@ sub rewrite_internal ($$;$$$) {
 			}
 			my $orig = $$msg;
 			my $cur = PublicInbox::Eml->new($msg);
-			if (content_matches($cids, $cur)) {
+			if (content_matches($chashes, $cur)) {
 				$gone{$smsg->{num}} = [ $smsg, $cur, \$orig ];
 			}
 		}
@@ -835,7 +835,7 @@ sub get_blob ($$) {
 sub content_exists ($$$) {
 	my ($self, $mime, $mid) = @_;
 	my $over = $self->{over};
-	my $cids = content_ids($mime);
+	my $chashes = content_hashes($mime);
 	my ($id, $prev);
 	while (my $smsg = $over->next_by_mid($mid, \$id, \$prev)) {
 		my $msg = get_blob($self, $smsg);
@@ -844,7 +844,7 @@ sub content_exists ($$$) {
 			next;
 		}
 		my $cur = PublicInbox::Eml->new($msg);
-		return 1 if content_matches($cids, $cur);
+		return 1 if content_matches($chashes, $cur);
 
 		# XXX DEBUG_DIFF is experimental and may be removed
 		diff($mid, $cur, $mime) if $ENV{DEBUG_DIFF};
@@ -873,9 +873,9 @@ sub mark_deleted ($$$$) {
 	my $msgref = $git->cat_file($oid);
 	my $mime = PublicInbox::Eml->new($$msgref);
 	my $mids = mids($mime->header_obj);
-	my $cid = content_id($mime);
+	my $chash = content_hash($mime);
 	foreach my $mid (@$mids) {
-		$sync->{D}->{"$mid\0$cid"} = $oid;
+		$sync->{D}->{"$mid\0$chash"} = $oid;
 	}
 }
 
@@ -904,11 +904,11 @@ sub reindex_oid_m ($$$$;$) {
 	my $msgref = $git->cat_file($oid, \$len);
 	my $mime = PublicInbox::Eml->new($$msgref);
 	my $mids = mids($mime->header_obj);
-	my $cid = content_id($mime);
+	my $chash = content_hash($mime);
 	die "BUG: reindex_oid_m called for <=1 mids" if scalar(@$mids) <= 1;
 
 	for my $mid (reverse @$mids) {
-		delete($sync->{D}->{"$mid\0$cid"}) and
+		delete($sync->{D}->{"$mid\0$chash"}) and
 			die "BUG: reindex_oid should handle <$mid> delete";
 	}
 	my $over = $self->{over};
@@ -1002,7 +1002,7 @@ sub reindex_oid ($$$$) {
 	return if $len == 0; # purged
 	my $mime = PublicInbox::Eml->new($$msgref);
 	my $mids = mids($mime->header_obj);
-	my $cid = content_id($mime);
+	my $chash = content_hash($mime);
 
 	if (scalar(@$mids) == 0) {
 		warn "E: $oid has no Message-ID, skipping\n";
@@ -1011,7 +1011,7 @@ sub reindex_oid ($$$$) {
 		my $mid = $mids->[0];
 
 		# was the file previously marked as deleted?, skip if so
-		if (delete($sync->{D}->{"$mid\0$cid"})) {
+		if (delete($sync->{D}->{"$mid\0$chash"})) {
 			if (!$sync->{reindex}) {
 				$num = $sync->{regen}--;
 				$self->{mm}->num_highwater($num);
@@ -1036,7 +1036,7 @@ sub reindex_oid ($$$$) {
 	} else { # multiple MIDs are a weird case:
 		my $del = 0;
 		for (@$mids) {
-			$del += delete($sync->{D}->{"$_\0$cid"}) // 0;
+			$del += delete($sync->{D}->{"$_\0$chash"}) // 0;
 		}
 		if ($del) {
 			unindex_oid_remote($self, $oid, $_) for @$mids;
@@ -1309,7 +1309,7 @@ sub index_sync {
 	return unless defined $latest;
 	$self->idx_init($opt); # acquire lock
 	my $sync = {
-		D => {}, # "$mid\0$cid" => $oid
+		D => {}, # "$mid\0$chash" => $oid
 		unindex_range => {}, # EPOCH => oid_old..oid_new
 		reindex => $opt->{reindex},
 		-opt => $opt
diff --git a/script/public-inbox-edit b/script/public-inbox-edit
index e895a228386..d8e511b2ee4 100755
--- a/script/public-inbox-edit
+++ b/script/public-inbox-edit
@@ -9,7 +9,7 @@ use warnings;
 use Getopt::Long qw(:config gnu_getopt no_ignore_case auto_abbrev);
 use PublicInbox::AdminEdit;
 use File::Temp 0.19 (); # 0.19 for TMPDIR
-use PublicInbox::ContentId qw(content_id);
+use PublicInbox::ContentHash qw(content_hash);
 use PublicInbox::MID qw(mid_clean mids);
 PublicInbox::Admin::check_require('-index');
 use PublicInbox::Eml;
@@ -43,7 +43,7 @@ if (defined $mid && defined $file) {
 my @ibxs = PublicInbox::Admin::resolve_inboxes(\@ARGV, $opt, $cfg);
 PublicInbox::AdminEdit::check_editable(\@ibxs);
 
-my $found = {}; # cid => [ [ibx, smsg] [, [ibx, smsg] ] ]
+my $found = {}; # chash => [ [ibx, smsg] [, [ibx, smsg] ] ]
 
 sub find_mid ($$$) {
 	my ($found, $mid, $ibxs) = @_;
@@ -53,9 +53,9 @@ sub find_mid ($$$) {
 		while (my $smsg = $over->next_by_mid($mid, \$id, \$prev)) {
 			my $ref = $ibx->msg_by_smsg($smsg);
 			my $mime = PublicInbox::Eml->new($ref);
-			my $cid = content_id($mime);
+			my $chash = content_hash($mime);
 			my $tuple = [ $ibx, $smsg ];
-			push @{$found->{$cid} ||= []}, $tuple
+			push @{$found->{$chash} ||= []}, $tuple
 		}
 		PublicInbox::InboxWritable::cleanup($ibx);
 	}
@@ -96,8 +96,8 @@ Multiple messages with different content found matching
 		die "open($file) failed: $!";
 	my $mids = mids($mime->header_obj);
 	find_mid($found, $_, \@ibxs) for (@$mids); # populates $found
-	my $cid = content_id($mime);
-	my $to_edit = $found->{$cid};
+	my $chash = content_hash($mime);
+	my $to_edit = $found->{$chash};
 	unless ($to_edit) {
 		my $nr = scalar(keys %$found);
 		if ($nr > 0) {
@@ -115,7 +115,7 @@ $mids
 		}
 		exit 1;
 	}
-	$found = { $cid => $to_edit };
+	$found = { $chash => $to_edit };
 }
 
 my %tmpopt = (
@@ -218,7 +218,7 @@ W: possible message boundary splitting error
 	my $nhdr = $new_mime->header_obj;
 	my $ohdr = $old_mime->header_obj;
 	if (($nhdr->as_string eq $ohdr->as_string) &&
-	    (content_id($new_mime) eq content_id($old_mime))) {
+	    (content_hash($new_mime) eq content_hash($old_mime))) {
 		warn "No change detected to:\n", show_cmd($ibx, $smsg);
 
 		next unless $opt->{verbose};
diff --git a/t/content_id.t b/t/content_hash.t
similarity index 64%
rename from t/content_id.t
rename to t/content_hash.t
index 9df81aa8293..646aab07c9a 100644
--- a/t/content_id.t
+++ b/t/content_hash.t
@@ -3,7 +3,7 @@
 use strict;
 use warnings;
 use Test::More;
-use PublicInbox::ContentId qw(content_id);
+use PublicInbox::ContentHash qw(content_hash);
 use PublicInbox::Eml;
 
 my $mime = PublicInbox::Eml->new(<<'EOF');
@@ -16,17 +16,17 @@ Date: Fri, 02 Oct 1993 00:00:00 +0000
 hello world
 EOF
 
-my $orig = content_id($mime);
-my $reload = content_id(PublicInbox::Eml->new($mime->as_string));
-is($orig, $reload, 'content_id matches after serialization');
+my $orig = content_hash($mime);
+my $reload = content_hash(PublicInbox::Eml->new($mime->as_string));
+is($orig, $reload, 'content_hash matches after serialization');
 
 foreach my $h (qw(From To Cc)) {
 	my $n = q("Quoted N'Ame" <foo@EXAMPLE.com>);
 	$mime->header_set($h, "$n");
-	my $q = content_id($mime);
-	is($mime->header($h), $n, "content_id does not mutate $h:");
+	my $q = content_hash($mime);
+	is($mime->header($h), $n, "content_hash does not mutate $h:");
 	$mime->header_set($h, 'Quoted N\'Ame <foo@example.com>');
-	my $nq = content_id($mime);
+	my $nq = content_hash($mime);
 	is($nq, $q, "quotes ignored in $h:");
 }
 
diff --git a/t/v1reindex.t b/t/v1reindex.t
index 13605f8bd6c..9f23ef01e56 100644
--- a/t/v1reindex.t
+++ b/t/v1reindex.t
@@ -3,7 +3,7 @@
 use strict;
 use warnings;
 use Test::More;
-use PublicInbox::ContentId qw(content_digest);
+use PublicInbox::ContentHash qw(content_digest);
 use File::Path qw(remove_tree);
 use PublicInbox::TestCommon;
 use PublicInbox::Eml;
diff --git a/t/v2reindex.t b/t/v2reindex.t
index f16a0b0d81c..b99106d0fe7 100644
--- a/t/v2reindex.t
+++ b/t/v2reindex.t
@@ -4,7 +4,7 @@ use strict;
 use warnings;
 use Test::More;
 use PublicInbox::Eml;
-use PublicInbox::ContentId qw(content_digest);
+use PublicInbox::ContentHash qw(content_digest);
 use File::Path qw(remove_tree);
 use PublicInbox::TestCommon;
 require_git(2.6);
diff --git a/t/v2writable.t b/t/v2writable.t
index e5a565cea23..fa5c786e151 100644
--- a/t/v2writable.t
+++ b/t/v2writable.t
@@ -4,7 +4,7 @@ use strict;
 use warnings;
 use Test::More;
 use PublicInbox::Eml;
-use PublicInbox::ContentId qw(content_digest content_id);
+use PublicInbox::ContentHash qw(content_digest content_hash);
 use PublicInbox::TestCommon;
 use Cwd qw(abs_path);
 require_git(2.6);
@@ -215,7 +215,7 @@ EOF
 	$im = PublicInbox::V2Writable->new($ibx, {nproc => 2});
 	is($im->{shards}, 1, 'detected single shard from previous');
 	my ($mark, $rm_mime, $smsg) = $im->remove($mime, 'test removal');
-	is(content_id($rm_mime), content_id($mime),
+	is(content_hash($rm_mime), content_hash($mime),
 			'removed object returned matches');
 	ok(defined($mark), 'mark set');
 	$im->done;

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 3/5] overidx: document the SQLite PRAGMA we use
  2020-05-10 22:37 [PATCH 0/5] scattered dev/CLI-oriented changes Eric Wong
  2020-05-10 22:37 ` [PATCH 1/5] xt/eml_check_limits: check limits against an inbox Eric Wong
  2020-05-10 22:37 ` [PATCH 2/5] rename "ContentId" to "ContentHash" Eric Wong
@ 2020-05-10 22:37 ` Eric Wong
  2020-05-10 22:37 ` [PATCH 4/5] msgmap: use TRUNCATE for journal_mode, for now Eric Wong
  2020-05-10 22:37 ` [PATCH 5/5] spawn: use ~/.cache/public-inbox/inline-c if writable Eric Wong
  4 siblings, 0 replies; 8+ messages in thread
From: Eric Wong @ 2020-05-10 22:37 UTC (permalink / raw)
  To: meta

This ought to prevent cargo-culting the cache_size PRAGMA
into smaller SQLite DBs we might use.
---
 lib/PublicInbox/OverIdx.pm | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/lib/PublicInbox/OverIdx.pm b/lib/PublicInbox/OverIdx.pm
index acbf2c8de60..cb15baadf2b 100644
--- a/lib/PublicInbox/OverIdx.pm
+++ b/lib/PublicInbox/OverIdx.pm
@@ -21,8 +21,16 @@ use PublicInbox::Search;
 sub dbh_new {
 	my ($self) = @_;
 	my $dbh = $self->SUPER::dbh_new(1);
+
+	# TRUNCATE reduces I/O compared to the default (DELETE)
 	$dbh->do('PRAGMA journal_mode = TRUNCATE');
+
+	# 80000 pages (80MiB on SQLite <3.12.0, 320MiB on 3.12.0+)
+	# was found to be good in 2018 during the large LKML import
+	# at the time.  This ought to be configurable based on HW
+	# and inbox size; I suspect it's overkill for many inboxes.
 	$dbh->do('PRAGMA cache_size = 80000');
+
 	create_tables($dbh);
 	$dbh;
 }

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 4/5] msgmap: use TRUNCATE for journal_mode, for now
  2020-05-10 22:37 [PATCH 0/5] scattered dev/CLI-oriented changes Eric Wong
                   ` (2 preceding siblings ...)
  2020-05-10 22:37 ` [PATCH 3/5] overidx: document the SQLite PRAGMA we use Eric Wong
@ 2020-05-10 22:37 ` Eric Wong
  2020-05-10 22:37 ` [PATCH 5/5] spawn: use ~/.cache/public-inbox/inline-c if writable Eric Wong
  4 siblings, 0 replies; 8+ messages in thread
From: Eric Wong @ 2020-05-10 22:37 UTC (permalink / raw)
  To: meta

It avoids I/O on the directory itself, which could prolong
the lifetime of the storage device.
---
 lib/PublicInbox/Msgmap.pm | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/lib/PublicInbox/Msgmap.pm b/lib/PublicInbox/Msgmap.pm
index 9523752e9af..5fe14383ac4 100644
--- a/lib/PublicInbox/Msgmap.pm
+++ b/lib/PublicInbox/Msgmap.pm
@@ -48,6 +48,10 @@ sub new_file {
 
 	if ($writable) {
 		create_tables($dbh);
+
+		# TRUNCATE reduces I/O compared to the default (DELETE)
+		$dbh->do('PRAGMA journal_mode = TRUNCATE');
+
 		$dbh->begin_work;
 		$self->created_at(time) unless $self->created_at;
 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 5/5] spawn: use ~/.cache/public-inbox/inline-c if writable
  2020-05-10 22:37 [PATCH 0/5] scattered dev/CLI-oriented changes Eric Wong
                   ` (3 preceding siblings ...)
  2020-05-10 22:37 ` [PATCH 4/5] msgmap: use TRUNCATE for journal_mode, for now Eric Wong
@ 2020-05-10 22:37 ` Eric Wong
  2020-05-11  0:29   ` Eric Wong
  4 siblings, 1 reply; 8+ messages in thread
From: Eric Wong @ 2020-05-10 22:37 UTC (permalink / raw)
  To: meta

Despite several memory reductions and pure Perl performance
improvements, Inline::C spawn() still gives us a noticeable
performance boost.

More user-oriented command-line programs are likely coming,
setting PERL_INLINE_DIRECTORY is annoying to users, and so is
is poor performance.  So allow users to opt-in to using our
Inline::C code once by creating a `~/.cache/public-inbox/inline-c'
directory.

XDG_CACHE_HOME is respected to override the location of ~/.cache
independent of HOME, according to
https://specifications.freedesktop.org/basedir-spec/0.6/ar01s03.html
---
 lib/PublicInbox/Spawn.pm | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/lib/PublicInbox/Spawn.pm b/lib/PublicInbox/Spawn.pm
index ad6be1878a0..489472502fa 100644
--- a/lib/PublicInbox/Spawn.pm
+++ b/lib/PublicInbox/Spawn.pm
@@ -2,7 +2,8 @@
 # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
 #
 # This allows vfork to be used for spawning subprocesses if
-# PERL_INLINE_DIRECTORY is explicitly defined in the environment.
+# ~/.cache/public-inbox/inline-c is writable or if PERL_INLINE_DIRECTORY
+# is explicitly defined in the environment (and writable).
 # Under Linux, vfork can make a big difference in spawning performance
 # as process size increases (fork still needs to mark pages for CoW use).
 # Currently, we only use this for code intended for long running
@@ -140,8 +141,12 @@ int pi_fork_exec(SV *redirref, SV *file, SV *cmdref, SV *envref, SV *rlimref,
 }
 VFORK_SPAWN
 
-my $inline_dir = $ENV{PERL_INLINE_DIRECTORY};
-$vfork_spawn = undef unless defined $inline_dir && -d $inline_dir && -w _;
+my $inline_dir = $ENV{PERL_INLINE_DIRECTORY} // (
+		$ENV{XDG_CACHE_HOME} //
+		(($ENV{HOME} // (getpwuid($>))[7]).'/.cache')
+	).'/public-inbox/inline-c';
+
+$vfork_spawn = undef unless -d $inline_dir && -w _;
 if (defined $vfork_spawn) {
 	# Inline 0.64 or later has locking in multi-process env,
 	# but we support 0.5 on Debian wheezy
@@ -150,7 +155,7 @@ if (defined $vfork_spawn) {
 		my $f = "$inline_dir/.public-inbox.lock";
 		open my $fh, '>', $f or die "failed to open $f: $!\n";
 		flock($fh, LOCK_EX) or die "LOCK_EX failed on $f: $!\n";
-		eval 'use Inline C => $vfork_spawn'; #, BUILD_NOISY => 1';
+		eval 'use Inline C => $vfork_spawn, directory => $inline_dir';
 		my $err = $@;
 		flock($fh, LOCK_UN) or die "LOCK_UN failed on $f: $!\n";
 		die $err if $err;

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 5/5] spawn: use ~/.cache/public-inbox/inline-c if writable
  2020-05-10 22:37 ` [PATCH 5/5] spawn: use ~/.cache/public-inbox/inline-c if writable Eric Wong
@ 2020-05-11  0:29   ` Eric Wong
  2020-05-11  4:27     ` [PATCH v2] " Eric Wong
  0 siblings, 1 reply; 8+ messages in thread
From: Eric Wong @ 2020-05-11  0:29 UTC (permalink / raw)
  To: meta

Eric Wong <e@yhbt.net> wrote:
> -my $inline_dir = $ENV{PERL_INLINE_DIRECTORY};
> -$vfork_spawn = undef unless defined $inline_dir && -d $inline_dir && -w _;
> +my $inline_dir = $ENV{PERL_INLINE_DIRECTORY} // (
> +		$ENV{XDG_CACHE_HOME} //
> +		(($ENV{HOME} // (getpwuid($>))[7]).'/.cache')

Erm, nobody runs perl with setuid, anyways, right?

diff --git a/lib/PublicInbox/Spawn.pm b/lib/PublicInbox/Spawn.pm
index 489472502fa..eaebb062393 100644
--- a/lib/PublicInbox/Spawn.pm
+++ b/lib/PublicInbox/Spawn.pm
@@ -143,7 +143,7 @@ VFORK_SPAWN
 
 my $inline_dir = $ENV{PERL_INLINE_DIRECTORY} // (
 		$ENV{XDG_CACHE_HOME} //
-		(($ENV{HOME} // (getpwuid($>))[7]).'/.cache')
+		(($ENV{HOME} // (getpwuid($<))[7]).'/.cache')
 	).'/public-inbox/inline-c';
 
 $vfork_spawn = undef unless -d $inline_dir && -w _;

On the other hand, I'm not sure if the getpwuid call is even
worth it when we can hard-code "/nonexistent" for "nobody"
when $HOME is undefined...

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2] spawn: use ~/.cache/public-inbox/inline-c if writable
  2020-05-11  0:29   ` Eric Wong
@ 2020-05-11  4:27     ` Eric Wong
  0 siblings, 0 replies; 8+ messages in thread
From: Eric Wong @ 2020-05-11  4:27 UTC (permalink / raw)
  To: meta

Despite several memory reductions and pure Perl performance
improvements, Inline::C spawn() still gives us a noticeable
performance boost.

More user-oriented command-line programs are likely coming,
setting PERL_INLINE_DIRECTORY is annoying to users, and so is
is poor performance.  So allow users to opt-in to using our
Inline::C code once by creating a `~/.cache/public-inbox/inline-c'
directory.

XDG_CACHE_HOME is respected to override the location of ~/.cache
independent of HOME, according to
https://specifications.freedesktop.org/basedir-spec/0.6/ar01s03.html

v2: use "/nonexistent" if HOME is undefined, since that's
the home of the "nobody" user on both FreeBSD and Debian.
---
 lib/PublicInbox/Spawn.pm | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/lib/PublicInbox/Spawn.pm b/lib/PublicInbox/Spawn.pm
index ad6be187..785ea865 100644
--- a/lib/PublicInbox/Spawn.pm
+++ b/lib/PublicInbox/Spawn.pm
@@ -2,7 +2,8 @@
 # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
 #
 # This allows vfork to be used for spawning subprocesses if
-# PERL_INLINE_DIRECTORY is explicitly defined in the environment.
+# ~/.cache/public-inbox/inline-c is writable or if PERL_INLINE_DIRECTORY
+# is explicitly defined in the environment (and writable).
 # Under Linux, vfork can make a big difference in spawning performance
 # as process size increases (fork still needs to mark pages for CoW use).
 # Currently, we only use this for code intended for long running
@@ -140,8 +141,12 @@ int pi_fork_exec(SV *redirref, SV *file, SV *cmdref, SV *envref, SV *rlimref,
 }
 VFORK_SPAWN
 
-my $inline_dir = $ENV{PERL_INLINE_DIRECTORY};
-$vfork_spawn = undef unless defined $inline_dir && -d $inline_dir && -w _;
+my $inline_dir = $ENV{PERL_INLINE_DIRECTORY} // (
+		$ENV{XDG_CACHE_HOME} //
+		( ($ENV{HOME} // '/nonexistent').'/.cache' )
+	).'/public-inbox/inline-c';
+
+$vfork_spawn = undef unless -d $inline_dir && -w _;
 if (defined $vfork_spawn) {
 	# Inline 0.64 or later has locking in multi-process env,
 	# but we support 0.5 on Debian wheezy
@@ -150,7 +155,7 @@ if (defined $vfork_spawn) {
 		my $f = "$inline_dir/.public-inbox.lock";
 		open my $fh, '>', $f or die "failed to open $f: $!\n";
 		flock($fh, LOCK_EX) or die "LOCK_EX failed on $f: $!\n";
-		eval 'use Inline C => $vfork_spawn'; #, BUILD_NOISY => 1';
+		eval 'use Inline C => $vfork_spawn, directory => $inline_dir';
 		my $err = $@;
 		flock($fh, LOCK_UN) or die "LOCK_UN failed on $f: $!\n";
 		die $err if $err;

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, back to index

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-10 22:37 [PATCH 0/5] scattered dev/CLI-oriented changes Eric Wong
2020-05-10 22:37 ` [PATCH 1/5] xt/eml_check_limits: check limits against an inbox Eric Wong
2020-05-10 22:37 ` [PATCH 2/5] rename "ContentId" to "ContentHash" Eric Wong
2020-05-10 22:37 ` [PATCH 3/5] overidx: document the SQLite PRAGMA we use Eric Wong
2020-05-10 22:37 ` [PATCH 4/5] msgmap: use TRUNCATE for journal_mode, for now Eric Wong
2020-05-10 22:37 ` [PATCH 5/5] spawn: use ~/.cache/public-inbox/inline-c if writable Eric Wong
2020-05-11  0:29   ` Eric Wong
2020-05-11  4:27     ` [PATCH v2] " Eric Wong

user/dev discussion of public-inbox itself

Archives are clonable:
	git clone --mirror http://public-inbox.org/meta
	git clone --mirror http://czquwvybam4bgbro.onion/meta
	git clone --mirror http://hjrcffqmbrq6wope.onion/meta
	git clone --mirror http://ou63pmih66umazou.onion/meta

Example config snippet for mirrors

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.mail.public-inbox.meta
	nntp://ou63pmih66umazou.onion/inbox.comp.mail.public-inbox.meta
	nntp://czquwvybam4bgbro.onion/inbox.comp.mail.public-inbox.meta
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.mail.public-inbox.meta
	nntp://news.gmane.io/gmane.mail.public-inbox.general

 note: .onion URLs require Tor: https://www.torproject.org/

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git