user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
* how to gracefully handle spaces in Message-IDs?
@ 2020-03-31  8:32 Eric Wong
  2020-03-31  8:49 ` [WIP 1/?] v2writable: index Message-IDs w/ spaces properly Eric Wong
  0 siblings, 1 reply; 3+ messages in thread
From: Eric Wong @ 2020-03-31  8:32 UTC (permalink / raw)
  To: meta

There exist Message-IDs with spaces in them, at least (and
maybe other strangeness)

Take this example:

https://lore.kernel.org/lkml/200203040330.g243URr05337@3%20(NXDOMAIN)%20/

That is:

	Message-ID: <200203040330.g243URr05337@3 (NXDOMAIN) >

RFC 3977 (NNTP) struggles with that with HDR/XHDR commands,
since it's split-on-spaces-or-tabs behavior.

Not only that, even with a successful attempt to handle
parsing of spaces in the Message-ID for -nntpd requests,
Net::NNTP has trouble parsing responses with spaces in the
Message-ID.  I haven't tried other NNTP clients, but I don't
expect clients to know what to do with invalid Message-IDs
in responses, either...

RFC 5322, Appendix A.6.3. Obsolete White Space and Comments
<https://tools.ietf.org/html/rfc5322#appendix-A.6.3> has
a particularly nasty example:

	Message-ID  : <1234   @   local(blah)  .machine .example>

And RFC 733 is full of examples with spaces in Message-IDs for
the historically-inclined: <https://tools.ietf.org/html/rfc733>

But I haven't found relevant docs on how to handle that case
for NNTP in RFC 977 or 3977...

In innd(*), the nnrpd/article.c::CMDpat function for HDR/XHDR
commands calls lib/messageid.c::IsValidMessageID with the
`stripspaces' parameter as `true', but `stripspaces' only strips
leading and trailing whitespace.

So I'm thinking at least stripping leading+trailing spaces
is something we should be doing, and spaces in the middle
of the Message-ID need to be preserved.

But, maybe non-printable control characters can also be filtered
out entirely, since I've definitely seen those in headers when
they don't belong.  I suspect those were introduced by hardware
errors or software bugs.

Anyways, my head hurts :<

(*) svn co https://inn.eyrie.org/svn/trunk innd,

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [WIP 1/?] v2writable: index Message-IDs w/ spaces properly
  2020-03-31  8:32 how to gracefully handle spaces in Message-IDs? Eric Wong
@ 2020-03-31  8:49 ` Eric Wong
  2020-04-01  0:05   ` Eric Wong
  0 siblings, 1 reply; 3+ messages in thread
From: Eric Wong @ 2020-03-31  8:49 UTC (permalink / raw)
  To: meta

Message-IDs can apparently contain spaces and other weird
characters.  Ensure we pass those properly to shard subprocesses
when importing messages in parallel mode.

Our NNTP parser does not deal with spaces in the Message-ID,
yet, and I don't expect most NNTP clients to, either.
---
 lib/PublicInbox/SearchIdxShard.pm |  8 +++++---
 t/v2writable.t                    | 11 ++++++++++-
 2 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/lib/PublicInbox/SearchIdxShard.pm b/lib/PublicInbox/SearchIdxShard.pm
index 1ea01095..06bcd403 100644
--- a/lib/PublicInbox/SearchIdxShard.pm
+++ b/lib/PublicInbox/SearchIdxShard.pm
@@ -69,8 +69,9 @@ sub shard_worker_loop ($$$$$) {
 			$self->remove_by_oid($oid, $mid);
 		} else {
 			chomp $line;
-			my ($bytes, $num, $blob, $mid, $ds, $ts) =
-							split(/ /, $line);
+			# n.b. $mid may contain spaces(!)
+			my ($bytes, $num, $blob, $ds, $ts, $mid) =
+							split(/ /, $line, 6);
 			$self->begin_txn_lazy;
 			my $n = read($r, my $msg, $bytes) or die "read: $!\n";
 			$n == $bytes or die "short read: $n != $bytes\n";
@@ -93,7 +94,8 @@ sub shard_worker_loop ($$$$$) {
 sub index_raw {
 	my ($self, $msgref, $mime, $smsg) = @_;
 	if (my $w = $self->{w}) {
-		print $w join(' ', @$smsg{qw(bytes num blob mid ds ts)}),
+		# mid must be last, it can contain spaces (but not LF)
+		print $w join(' ', @$smsg{qw(bytes num blob ds ts mid)}),
 			"\n", $$msgref or die "failed to write shard $!\n";
 	} else {
 		$$msgref = undef;
diff --git a/t/v2writable.t b/t/v2writable.t
index cdcfe4d0..8167e4de 100644
--- a/t/v2writable.t
+++ b/t/v2writable.t
@@ -109,6 +109,11 @@ if ('ensure git configs are correct') {
 	@mids = $mime->header_obj->header_raw('Message-Id');
 	like($mids[0], $sane_mid, 'mid was generated');
 	is(scalar(@mids), 1, 'new generated');
+
+	@warn = ();
+	$mime->header_set('Message-Id', '<space@ (NXDOMAIN) >');
+	ok($im->add($mime), 'message added with space in Message-Id');
+	is_deeply([], \@warn);
 }
 
 {
@@ -175,8 +180,12 @@ EOF
 		is($uniq{$mid}++, 0, "MID for $num is unique in XOVER");
 		is_deeply($n->xhdr('Message-ID', $num),
 			 { $num => $mid }, "XHDR lookup OK on num $num");
+
+		# FIXME NNTP.pm doesn't handle spaces in Message-ID
+		next if $mid =~ / /;
+
 		is_deeply($n->xhdr('Message-ID', $mid),
-			 { $mid => $mid }, "XHDR lookup OK on MID $num");
+			 { $mid => $mid }, "XHDR lookup OK on MID $mid ($num)");
 	}
 	my %nn;
 	foreach my $mid (@{$n->newnews(0, $group)}) {

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [WIP 1/?] v2writable: index Message-IDs w/ spaces properly
  2020-03-31  8:49 ` [WIP 1/?] v2writable: index Message-IDs w/ spaces properly Eric Wong
@ 2020-04-01  0:05   ` Eric Wong
  0 siblings, 0 replies; 3+ messages in thread
From: Eric Wong @ 2020-04-01  0:05 UTC (permalink / raw)
  To: meta

Eric Wong <e@yhbt.net> wrote:
> Message-IDs can apparently contain spaces and other weird
> characters.  Ensure we pass those properly to shard subprocesses
> when importing messages in parallel mode.
> 
> Our NNTP parser does not deal with spaces in the Message-ID,
> yet, and I don't expect most NNTP clients to, either.

Nor does Net::NNTP on the client side...
But regardless of what happens with Message-IDs in the NNTP
side, this patch will remain correct and fixes an indexing
problem when Message-IDs.

This bug was exacerbated by the changes to pass date and
timestamps from the git commit into the shard when mirroring,
but has always been with us when using multi-process indexing.

> diff --git a/t/v2writable.t b/t/v2writable.t
> index cdcfe4d0..8167e4de 100644
> --- a/t/v2writable.t
> +++ b/t/v2writable.t

> @@ -175,8 +180,12 @@ EOF
>  		is($uniq{$mid}++, 0, "MID for $num is unique in XOVER");
>  		is_deeply($n->xhdr('Message-ID', $num),
>  			 { $num => $mid }, "XHDR lookup OK on num $num");
> +
> +		# FIXME NNTP.pm doesn't handle spaces in Message-ID
> +		next if $mid =~ / /;
> +

Pushed with the following squashed in:

diff --git a/t/v2writable.t b/t/v2writable.t
index 8167e4de..66d5663e 100644
--- a/t/v2writable.t
+++ b/t/v2writable.t
@@ -181,7 +181,8 @@ EOF
 		is_deeply($n->xhdr('Message-ID', $num),
 			 { $num => $mid }, "XHDR lookup OK on num $num");
 
-		# FIXME NNTP.pm doesn't handle spaces in Message-ID
+		# FIXME PublicInbox::NNTP (server) doesn't handle spaces in
+		# Message-ID, but neither does Net::NNTP (client)
 		next if $mid =~ / /;
 
 		is_deeply($n->xhdr('Message-ID', $mid),

^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-04-01  0:05 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-31  8:32 how to gracefully handle spaces in Message-IDs? Eric Wong
2020-03-31  8:49 ` [WIP 1/?] v2writable: index Message-IDs w/ spaces properly Eric Wong
2020-04-01  0:05   ` Eric Wong

Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).