"lei_to_mail+mbox_reader: fix handling of empty/bogus emails"

user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed

Search results ordered by [date|relevance]  view[summary|nested|Atom feed]
thread overview below | download mbox.gz: |

* Re: [PATCH] lei_to_mail+mbox_reader: fix handling of empty/bogus emails
  2021-09-07 21:20  9%           ` Konstantin Ryabitsev
@ 2021-09-07 22:22  9%             ` Eric Wong
  0 siblings, 0 replies; 5+ results
From: Eric Wong @ 2021-09-07 22:22 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: meta

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> Okay, I guess it's not any different from someone doing the same thing over
> the web interface. It would be nice to have a way to limit how many messages
> are returned for gzipped mailbox downloads, seeing as they cannot be paginated
> in the same way web views are, but it's not a priority right away.

I'm thinking pagination would cause unnecessary hardship for
legitimate users.

The mbox.gz streaming doesn't hurt -httpd any more than
aggressive bots do.  HTML pagination is mainly needed to avoid
performance problems on the client/rendering side.

^ permalink raw reply	[relevance 9%]

* Re: [PATCH] lei_to_mail+mbox_reader: fix handling of empty/bogus emails
  2021-09-07 20:56  9%         ` Eric Wong
@ 2021-09-07 21:20  9%           ` Konstantin Ryabitsev
  2021-09-07 22:22  9%             ` Eric Wong
  0 siblings, 1 reply; 5+ results
From: Konstantin Ryabitsev @ 2021-09-07 21:20 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

On Tue, Sep 07, 2021 at 08:56:17PM +0000, Eric Wong wrote:
> > 1. this means that each "lei up" call will be increasingly larger and larger,
> >    since when we init the search with rt:, it gets resolved into a datestamp
> >    (e.g. rt:2.weeks.ago becomes rt:1625699031). I'm worried that this will be
> >    increasingly hard on the server side, especially if someone
> >    fires-and-forgets a cronjob that ends up downloading ever-growing mboxes
> >    every 5 minutes.
> 
> "rt:2.weeks.ago" stays "rt:2.weeks.ago" in saved searches :>

Oh, you're right. Apologies for not digging deeper.

> > 2. is there some sanity limit on the server side that would prevent someone's
> >    overly broad search query from gzipping and downloading gigabytes of mail?
> 
> Not right now.  With public-inbox-httpd, the actual git fetches
> are handled fairly w.r.t to other requests (and I could
> deprioritize them further, if needed...).  The Xapian query OTOH...

Okay, I guess it's not any different from someone doing the same thing over
the web interface. It would be nice to have a way to limit how many messages
are returned for gzipped mailbox downloads, seeing as they cannot be paginated
in the same way web views are, but it's not a priority right away.

Thanks,
-K

^ permalink raw reply	[relevance 9%]

* Re: [PATCH] lei_to_mail+mbox_reader: fix handling of empty/bogus emails
  2021-09-07 18:17  9%       ` Konstantin Ryabitsev
@ 2021-09-07 20:56  9%         ` Eric Wong
  2021-09-07 21:20  9%           ` Konstantin Ryabitsev
  0 siblings, 1 reply; 5+ results
From: Eric Wong @ 2021-09-07 20:56 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: meta

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> On Sat, Sep 04, 2021 at 09:36:58PM +0000, Eric Wong wrote:
> > Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> > > Yep, that seems to work fine. Question -- I noticed that lei just issues a
> > > regular query, retrieves results with curl and then parses the output. Is
> > > there a danger of potentially running into issues with parsing the regular
> > > HTML output if it changes in the future?
> > 
> > It's actually parsing gzipped mboxrd (&x=m).  But you're right
> > we could use stronger safeguards in case we see gzipped HTML or
> > something else...
> 
> Ooh, okay, I guess I should actually look at the output of the curl call. :)
> The questions I have, then:
> 
> 1. this means that each "lei up" call will be increasingly larger and larger,
>    since when we init the search with rt:, it gets resolved into a datestamp
>    (e.g. rt:2.weeks.ago becomes rt:1625699031). I'm worried that this will be
>    increasingly hard on the server side, especially if someone
>    fires-and-forgets a cronjob that ends up downloading ever-growing mboxes
>    every 5 minutes.

"rt:2.weeks.ago" stays "rt:2.weeks.ago" in saved searches :>

It was one of my primary annoyances when I initially implemented
this and commit 2e4e4b0d6f30d9d4612066395ba694c7c7d61e6e solved it.
https://public-inbox.org/meta/20210416231035.31807-2-e@80x24.org/
("lei q: --save preserves relative time queries")

> 2. is there some sanity limit on the server side that would prevent someone's
>    overly broad search query from gzipping and downloading gigabytes of mail?

Not right now.  With public-inbox-httpd, the actual git fetches
are handled fairly w.r.t to other requests (and I could
deprioritize them further, if needed...).  The Xapian query OTOH...

^ permalink raw reply	[relevance 9%]

* Re: [PATCH] lei_to_mail+mbox_reader: fix handling of empty/bogus emails
  2021-09-04 21:36 14%     ` [PATCH] lei_to_mail+mbox_reader: fix handling of empty/bogus emails Eric Wong
@ 2021-09-07 18:17  9%       ` Konstantin Ryabitsev
  2021-09-07 20:56  9%         ` Eric Wong
  0 siblings, 1 reply; 5+ results
From: Konstantin Ryabitsev @ 2021-09-07 18:17 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

On Sat, Sep 04, 2021 at 09:36:58PM +0000, Eric Wong wrote:
> Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> > Yep, that seems to work fine. Question -- I noticed that lei just issues a
> > regular query, retrieves results with curl and then parses the output. Is
> > there a danger of potentially running into issues with parsing the regular
> > HTML output if it changes in the future?
> 
> It's actually parsing gzipped mboxrd (&x=m).  But you're right
> we could use stronger safeguards in case we see gzipped HTML or
> something else...

Ooh, okay, I guess I should actually look at the output of the curl call. :)
The questions I have, then:

1. this means that each "lei up" call will be increasingly larger and larger,
   since when we init the search with rt:, it gets resolved into a datestamp
   (e.g. rt:2.weeks.ago becomes rt:1625699031). I'm worried that this will be
   increasingly hard on the server side, especially if someone
   fires-and-forgets a cronjob that ends up downloading ever-growing mboxes
   every 5 minutes.
2. is there some sanity limit on the server side that would prevent someone's
   overly broad search query from gzipping and downloading gigabytes of mail?

-K

^ permalink raw reply	[relevance 9%]

* [PATCH] lei_to_mail+mbox_reader: fix handling of empty/bogus emails
  @ 2021-09-04 21:36 14%     ` Eric Wong
  2021-09-07 18:17  9%       ` Konstantin Ryabitsev
  0 siblings, 1 reply; 5+ results
From: Eric Wong @ 2021-09-04 21:36 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: meta

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> Yep, that seems to work fine. Question -- I noticed that lei just issues a
> regular query, retrieves results with curl and then parses the output. Is
> there a danger of potentially running into issues with parsing the regular
> HTML output if it changes in the future?

It's actually parsing gzipped mboxrd (&x=m).  But you're right
we could use stronger safeguards in case we see gzipped HTML or
something else...

----------8<---------
Subject: [PATCH] lei_to_mail+mbox_reader: fix handling of empty/bogus emails

We may be handling invalid mboxes, so just return no objects in
that case.  While "lei q" on HTTP(S) externals expects a gzipped
mboxrd, there's always a chance something else gzipped can be
sent to us.

There's also changes to lei_to_mail to better handle emails
which lack a body and/or headers (e.g. t/solve/bare.patch)

Link: https://public-inbox.org/meta/20210903151500.h72mzcpqixgtytjs@meerkat.local/
---
 lib/PublicInbox/Eml.pm        |  8 ++++++++
 lib/PublicInbox/LeiToMail.pm  | 21 +++++++--------------
 lib/PublicInbox/MboxReader.pm |  3 ++-
 t/mbox_reader.t               | 23 +++++++++++++++++++++++
 4 files changed, 40 insertions(+), 15 deletions(-)

diff --git a/lib/PublicInbox/Eml.pm b/lib/PublicInbox/Eml.pm
index 955d6a96..0867a016 100644
--- a/lib/PublicInbox/Eml.pm
+++ b/lib/PublicInbox/Eml.pm
@@ -480,6 +480,14 @@ sub charset_set {
 
 sub crlf { $_[0]->{crlf} // "\n" }
 
+sub raw_size {
+	my ($self) = @_;
+	my $len = length(${$self->{hdr}});
+	defined($self->{bdy}) and
+		$len += length(${$self->{bdy}}) + length($self->{crlf});
+	$len;
+}
+
 # warnings to ignore when handling spam mailboxes and maybe other places
 sub warn_ignore {
 	my $s = "@_";
diff --git a/lib/PublicInbox/LeiToMail.pm b/lib/PublicInbox/LeiToMail.pm
index 6e102a1d..1221d3c7 100644
--- a/lib/PublicInbox/LeiToMail.pm
+++ b/lib/PublicInbox/LeiToMail.pm
@@ -109,32 +109,25 @@ sub _mboxcl_common ($$$) {
 	$$buf .= 'Content-Length: '.length($$bdy).$crlf.
 		'Lines: '.$lines.$crlf.$crlf;
 	substr($$bdy, 0, 0, $$buf); # prepend header
-	$_[0] = $bdy;
+	$$bdy .= $crlf;
+	$bdy;
 }
 
 # mboxcl still escapes "From " lines
 sub eml2mboxcl {
 	my ($eml, $smsg) = @_;
 	my $buf = _mbox_hdr_buf($eml, 'mboxcl', $smsg);
-	my $crlf = $eml->{crlf};
-	if (my $bdy = delete $eml->{bdy}) {
-		$$bdy =~ s/^From />From /gm;
-		_mboxcl_common($buf, $bdy, $crlf);
-	}
-	$$buf .= $crlf;
-	$buf;
+	my $bdy = delete($eml->{bdy}) // \(my $empty = '');
+	$$bdy =~ s/^From />From /gm;
+	_mboxcl_common($buf, $bdy, $eml->{crlf});
 }
 
 # mboxcl2 has no "From " escaping
 sub eml2mboxcl2 {
 	my ($eml, $smsg) = @_;
 	my $buf = _mbox_hdr_buf($eml, 'mboxcl2', $smsg);
-	my $crlf = $eml->{crlf};
-	if (my $bdy = delete $eml->{bdy}) {
-		_mboxcl_common($buf, $bdy, $crlf);
-	}
-	$$buf .= $crlf;
-	$buf;
+	my $bdy = delete($eml->{bdy}) // \(my $empty = '');
+	_mboxcl_common($buf, $bdy, $eml->{crlf});
 }
 
 sub git_to_mail { # git->cat_async callback
diff --git a/lib/PublicInbox/MboxReader.pm b/lib/PublicInbox/MboxReader.pm
index 9291f00b..5a754cb8 100644
--- a/lib/PublicInbox/MboxReader.pm
+++ b/lib/PublicInbox/MboxReader.pm
@@ -41,7 +41,7 @@ sub _mbox_from {
 			$raw =~ s/^\r?\n\z//ms;
 			$raw =~ s/$from_re/$1/gms;
 			my $eml = PublicInbox::Eml->new(\$raw);
-			$eml_cb->($eml, @arg);
+			$eml_cb->($eml, @arg) if $eml->raw_size;
 		}
 		return if $r == 0; # EOF
 	}
@@ -96,6 +96,7 @@ sub _mbox_cl ($$$;@) {
 			$$hdr =~ s/\A[\r\n]*From [^\n]*\n//s or
 				die "E: no 'From ' line in:\n", Dumper($hdr);
 			my $eml = PublicInbox::Eml->new($hdr);
+			next unless $eml->raw_size;
 			my @cl = $eml->header_raw('Content-Length');
 			my $n = scalar(@cl);
 			$n == 0 and die "E: Content-Length missing in:\n",
diff --git a/t/mbox_reader.t b/t/mbox_reader.t
index da0ce7f1..e5f57d7b 100644
--- a/t/mbox_reader.t
+++ b/t/mbox_reader.t
@@ -71,6 +71,12 @@ my $check_fmt = sub {
 				"Content-Length is correct $fmt $cur");
 			# clobber for ->as_string comparison below
 			$eml->header_set('Content-Length');
+
+			# special case for t/solve/bare.patch, not sure if we
+			# should even handle it...
+			if ($cl[0] eq '0' && ${$eml->{hdr}} eq '') {
+				delete $eml->{bdy};
+			}
 		} else {
 			is(scalar(@cl), 0, "Content-Length unset $fmt $cur");
 		}
@@ -121,4 +127,21 @@ exit 1
 	is(scalar(grep(/Final/, @x)), 0, 'no incomplete bit');
 }
 
+{
+	my $html = <<EOM;
+<html><head><title>hi,</title></head><body>how are you</body></html>
+EOM
+	for my $m (qw(mboxrd mboxcl mboxcl2 mboxo)) {
+		my (@w, @x);
+		local $SIG{__WARN__} = sub { push @w, @_ };
+		open my $fh, '<', \$html or xbail 'PerlIO::scalar';
+		PublicInbox::MboxReader->$m($fh, sub {
+			push @x, $_[0]->as_string
+		});
+		is_deeply(\@x, [], "messages in invalid $m");
+		is_deeply([grep(!/^W: leftover/, @w)], [],
+			"no extra warnings besides leftover ($m)");
+	}
+}
+
 done_testing;

^ permalink raw reply related	[relevance 14%]

Results 1-5 of 5 | reverse | options above

-- pct% links below jump to the message on this page, permalinks otherwise --
2021-09-02 21:12     Showcasing lei at Linux Plumbers Konstantin Ryabitsev
2021-09-02 21:58     ` Eric Wong
2021-09-03 15:15       ` Konstantin Ryabitsev
2021-09-04 21:36 14%     ` [PATCH] lei_to_mail+mbox_reader: fix handling of empty/bogus emails Eric Wong
2021-09-07 18:17  9%       ` Konstantin Ryabitsev
2021-09-07 20:56  9%         ` Eric Wong
2021-09-07 21:20  9%           ` Konstantin Ryabitsev
2021-09-07 22:22  9%             ` Eric Wong
Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).