user/dev discussion of public-inbox itself
 help / Atom feed
* Some points on public-inbox
@ 2018-06-09 17:06 Leah Neukirchen
  2018-06-12 10:09 ` Eric Wong
  0 siblings, 1 reply; 13+ messages in thread
From: Leah Neukirchen @ 2018-06-09 17:06 UTC (permalink / raw)
  To: meta

Hi,

over the last few days I've set up a public-inbox 1.1.0pre1 instance,
and noticed some things:

1) Makefile.PL only works properly when run from a checkout, not a tarball.
I replaced the beginning with

my @EXE_FILES = split("\n", `printf '%s\n' script/* 2>/dev/null`);
my $PM_FILES = `find lib 2>/dev/null`;

2) public-inbox-mda returns with status 1 when it gets a mail it
doesn't know where to deliver to.  I think status 67 would be more
appropriate (EX_NOUSER).

3) IPv6 support needs the Socket6 module, this is not stated anywhere.

4) I think it would be useful if the thread overview displayed
the name of the initial poster, could this be added as an option?

5) Is there a way for the HTML view to list all served lists?
/ results in 404.  How did you add links to meta/ and test/ on
https://public-inbox.org/ ?

6) I have a user account that uses .forward to call public-inbox-mda,
and use /etc/aliases to route the lists that are hosted primarily on
the server to it.  What's the best approach to do this for mailing
lists I only mirror? Subscribe with a "secret" second address to the
list, and add this second adress to publicinbox.<name>.address?
Or can public-inbox-mda also scan for List-Id etc and sort by it somehow?

Thank you very much for your work,
-- 
Leah Neukirchen  <leah@vuxu.org>  http://leah.zone

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Some points on public-inbox
  2018-06-09 17:06 Some points on public-inbox Leah Neukirchen
@ 2018-06-12 10:09 ` Eric Wong
  2018-06-12 11:31   ` Leah Neukirchen
                     ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Eric Wong @ 2018-06-12 10:09 UTC (permalink / raw)
  To: Leah Neukirchen; +Cc: meta

Leah Neukirchen <leah@vuxu.org> wrote:
> Hi,
> 
> over the last few days I've set up a public-inbox 1.1.0pre1 instance,
> and noticed some things:

Hey Leah, thanks for giving it a try!  Sorry for the late reply,
been trying to avoid being at the computer too much for health
reasons.

> 1) Makefile.PL only works properly when run from a checkout, not a tarball.
> I replaced the beginning with
> 
> my @EXE_FILES = split("\n", `printf '%s\n' script/* 2>/dev/null`);
> my $PM_FILES = `find lib 2>/dev/null`;

Thanks, I'd probably add "-name '*.pm'" to find(1) to filter out
directories.  But I wonder if it's better to grep the MANIFEST
file...

> 2) public-inbox-mda returns with status 1 when it gets a mail it
> doesn't know where to deliver to.  I think status 67 would be more
> appropriate (EX_NOUSER).

Sure.  There's a bunch of places where we just die() and ignore
sysexits.h or similar.  Could use some help checking for that
and patches are welcome :>

> 3) IPv6 support needs the Socket6 module, this is not stated anywhere.

Oops, I thought this was standard :x  Care to send a patch to
INSTALL for that?

> 4) I think it would be useful if the thread overview displayed
> the name of the initial poster, could this be added as an option?

If anything, I'd rather list ALL the recent participants in a
thread (it wouldn't require extra lookups).

But my philosophy is not to give anybody more credit/weight than
anybody else; and I'd rather people follow links if the Subject
seems interesting, not because who started a particular topic
(especially when it comes to emails with [PATCH] subjects).

I also prefer to avoid having too many options to reduce
support/documentation costs, so if we do something like this,
it would be the default.  But also, it's more clutter.

> 5) Is there a way for the HTML view to list all served lists?

Not currently...  I'm not sure how the UI or configuration
should be or how to avoid clutter/scalability problems with many
inboxes.  NNTP has standardized commands and clients can decide
how to show them, at least.

> / results in 404.  How did you add links to meta/ and test/ on
> https://public-inbox.org/ ?

mkdir /srv/public-inbox/{meta,test} # static directory listing

Rack::Builder routes meta/ and test/ to varnish => public-inbox-httpd

All the HTTPS, static files and reverse proxying is handled
using yet-another-horribly-named-server written in Ruby and Rack :)

==> config.ru snippet <==
# random crap from yahns extras/
require "autoindex"
require "try_gzip_static"
require "yahns/proxy_pass"
autoindex = lambda do |path|
  Autoindex.new(TryGzipStatic.new(path), skip_dotfiles: true)
end

pi = Rack::Builder.new do
  run Yahns::ProxyPass.new('http://127.0.0.1:6081', # varnish
      response_headers: {
        'Age' => :ignore,
        'X-Varnish' => :ignore,
        'Via' => :ignore
    })
end.to_app

unsub = Rack::Builder.new do
  run Yahns::ProxyPass.new('unix:/run/unsubscribe-psgi.sock')
end.to_app
unsub_re = %r{\A/u/[^/]+/[^/]+\z}

map('http://public-inbox.org/') do
  pfx = 'public-inbox'
  static = autoindex["/srv/#{pfx}"]
  txt_html = %r{\.txt\.html\z} # oops :x
  proxy = Yahns::ProxyPass.new("http://127.0.0.1:2080/$host$fullpath",
			        proxy_buffering: false)
  cascade = Rack::Cascade.new([static, proxy])
  run(lambda do |env|
    return redirect(env, nil, nil) if env['rack.url_scheme'] == -'http'
    case path_info = env["PATH_INFO"]
    when %r{\A/test(?:/?.*)\z}
      redirect(env, "try.public-inbox.org", path_info)
    when %r{\A/(?:git|meta|public-inbox(?:\.git)?)(?:/|\z)}x,
	 '/HEAD', '/info/refs', '/git-upload-pack', '/description', '/cloneurl',
         %r{\A/objects/}
      pi.call(env)
    when unsub_re
      unsub.call(env)
    when txt_html
      redirect(env, 'public-inbox.org', path_info.sub(txt_html, '.html'))
    else
      cascade.call(env)
    end
  end)
end
==> end snippet <=

> 6) I have a user account that uses .forward to call public-inbox-mda,
> and use /etc/aliases to route the lists that are hosted primarily on
> the server to it.  What's the best approach to do this for mailing
> lists I only mirror? Subscribe with a "secret" second address to the
> list, and add this second adress to publicinbox.<name>.address?
> Or can public-inbox-mda also scan for List-Id etc and sort by it somehow?

I prefer to use public-inbox-watch for mirroring existing lists.

-mda is also a bit strict and opinionated (though I have plans to
make it less so, optionally), so it's mainly for non-mirrored
inboxes.

-watch is also safer and less likely to lose/bounce mail since
it hits a Maildir, first.  -watch will scan for List-Id (or any
other header, such as X-Mailing-List) and put it into the
correct inbox.  If space is a problem, a cronjob to remove
old files will help, but maybe it can unlink-on-import-commit
in the future.

I haven't thought much about mirroring with -mda, but I suppose
having a per-list subscriber address and extra
publicinbox.<name>.address entry works, too.

> Thank you very much for your work,

No problem :>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Some points on public-inbox
  2018-06-12 10:09 ` Eric Wong
@ 2018-06-12 11:31   ` Leah Neukirchen
  2018-06-13  2:07     ` [PATCH] Makefile.PL: do not depend on git Eric Wong
  2018-06-13 21:40     ` Some points on public-inbox Eric Wong
  2018-06-12 13:19   ` Some points on public-inbox Leah Neukirchen
  2018-06-12 17:05   ` Konstantin Ryabitsev
  2 siblings, 2 replies; 13+ messages in thread
From: Leah Neukirchen @ 2018-06-12 11:31 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

Eric Wong <e@80x24.org> writes:

> Leah Neukirchen <leah@vuxu.org> wrote:
>> Hi,
>> 
>> over the last few days I've set up a public-inbox 1.1.0pre1 instance,
>> and noticed some things:
>
> Hey Leah, thanks for giving it a try!  Sorry for the late reply,
> been trying to avoid being at the computer too much for health
> reasons.

No problem, get well soon.

>> 1) Makefile.PL only works properly when run from a checkout, not a tarball.
>> I replaced the beginning with
>> 
>> my @EXE_FILES = split("\n", `printf '%s\n' script/* 2>/dev/null`);
>> my $PM_FILES = `find lib 2>/dev/null`;
>
> Thanks, I'd probably add "-name '*.pm'" to find(1) to filter out
> directories.  But I wonder if it's better to grep the MANIFEST
> file...

Yes, using MANIFEST is a better solution.

>> 2) public-inbox-mda returns with status 1 when it gets a mail it
>> doesn't know where to deliver to.  I think status 67 would be more
>> appropriate (EX_NOUSER).
>
> Sure.  There's a bunch of places where we just die() and ignore
> sysexits.h or similar.  Could use some help checking for that
> and patches are welcome :>

I'll have a look at this.

>> 3) IPv6 support needs the Socket6 module, this is not stated anywhere.
>
> Oops, I thought this was standard :x  Care to send a patch to
> INSTALL for that?

Will do.

>> 5) Is there a way for the HTML view to list all served lists?
>
> Not currently...  I'm not sure how the UI or configuration
> should be or how to avoid clutter/scalability problems with many
> inboxes.  NNTP has standardized commands and clients can decide
> how to show them, at least.

Yes, I was thinking of just having a list of "name - description",
straight from the config file?

>> / results in 404.  How did you add links to meta/ and test/ on
>> https://public-inbox.org/ ?
>
> mkdir /srv/public-inbox/{meta,test} # static directory listing

I built something similar with nginx now.

>> and use /etc/aliases to route the lists that are hosted primarily on
>> the server to it.  What's the best approach to do this for mailing
>> lists I only mirror? Subscribe with a "secret" second address to the
>> list, and add this second adress to publicinbox.<name>.address?
>> Or can public-inbox-mda also scan for List-Id etc and sort by it somehow?
>
> I prefer to use public-inbox-watch for mirroring existing lists.
>
> -mda is also a bit strict and opinionated (though I have plans to
> make it less so, optionally), so it's mainly for non-mirrored
> inboxes.
>
> -watch is also safer and less likely to lose/bounce mail since
> it hits a Maildir, first.  -watch will scan for List-Id (or any
> other header, such as X-Mailing-List) and put it into the
> correct inbox.  If space is a problem, a cronjob to remove
> old files will help, but maybe it can unlink-on-import-commit
> in the future.

Space is not an issue, and scanning for special headers will avoid
getting password reminders and administrative messages into the archive.

I'll use watch then.


During testing, we also found another thing when obscure characters
are used in Message-IDs, esp. / and ?.

E.g. using a Message-ID of <F1WYEAZPOF.3LOD2T7ZHY9I1@localdomain/raw/T>
will create a corrupt link.  Some more "ideas" are at
https://inbox.vuxu.org/pi-test/


Thanks,
-- 
Leah Neukirchen  <leah@vuxu.org>  http://leah.zone

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Some points on public-inbox
  2018-06-12 10:09 ` Eric Wong
  2018-06-12 11:31   ` Leah Neukirchen
@ 2018-06-12 13:19   ` Leah Neukirchen
  2018-06-12 17:05   ` Konstantin Ryabitsev
  2 siblings, 0 replies; 13+ messages in thread
From: Leah Neukirchen @ 2018-06-12 13:19 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

Eric Wong <e@80x24.org> writes:

>> 6) I have a user account that uses .forward to call public-inbox-mda,
>> and use /etc/aliases to route the lists that are hosted primarily on
>> the server to it.  What's the best approach to do this for mailing
>> lists I only mirror? Subscribe with a "secret" second address to the
>> list, and add this second adress to publicinbox.<name>.address?
>> Or can public-inbox-mda also scan for List-Id etc and sort by it somehow?
>
> I prefer to use public-inbox-watch for mirroring existing lists.
>
> -mda is also a bit strict and opinionated (though I have plans to
> make it less so, optionally), so it's mainly for non-mirrored
> inboxes.
>
> -watch is also safer and less likely to lose/bounce mail since
> it hits a Maildir, first.  -watch will scan for List-Id (or any
> other header, such as X-Mailing-List) and put it into the
> correct inbox.  If space is a problem, a cronjob to remove
> old files will help, but maybe it can unlink-on-import-commit
> in the future.

Ok, that was a bit more fiddly than expected, because now I needed to
run two MDA under the same user.

Also I noticed public-inbox-watch expects different Maildir for every
list, so I had to put a maildrop in front of it to pre-sort by
List-Id... couldn't -watch do that itself?

-- 
Leah Neukirchen  <leah@vuxu.org>  http://leah.zone

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Some points on public-inbox
  2018-06-12 10:09 ` Eric Wong
  2018-06-12 11:31   ` Leah Neukirchen
  2018-06-12 13:19   ` Some points on public-inbox Leah Neukirchen
@ 2018-06-12 17:05   ` Konstantin Ryabitsev
  2018-06-13  1:57     ` Eric Wong
  2 siblings, 1 reply; 13+ messages in thread
From: Konstantin Ryabitsev @ 2018-06-12 17:05 UTC (permalink / raw)
  To: Eric Wong; +Cc: Leah Neukirchen, meta

[-- Attachment #1: Type: text/plain, Size: 1617 bytes --]

On Tue, Jun 12, 2018 at 10:09:15AM +0000, Eric Wong wrote:
>I prefer to use public-inbox-watch for mirroring existing lists.
>
>-mda is also a bit strict and opinionated (though I have plans to
>make it less so, optionally), so it's mainly for non-mirrored
>inboxes.
>
>-watch is also safer and less likely to lose/bounce mail since
>it hits a Maildir, first.  -watch will scan for List-Id (or any
>other header, such as X-Mailing-List) and put it into the
>correct inbox.  If space is a problem, a cronjob to remove
>old files will help, but maybe it can unlink-on-import-commit
>in the future.

I opted in favour of -mda over -watch because Maildir performance
usually degrades linearly with the number of messages. A month of LKML
mail is anywhere from 25,000 to 40,000 messages, and maildirs tend to
handle that poorly due to peformance overhead of listing tens of
thousands of files in a single folder.

Obviously, I can set up an archival job, but then I'd have to worry
about messages that weren't actually imported into the archive (because
they didn't pass spam tests, but are actually ham, for example). The
-mda script gives me this for free, with such messages being put into
the emergency folder for later review.

>I haven't thought much about mirroring with -mda, but I suppose
>having a per-list subscriber address and extra
>publicinbox.<name>.address entry works, too.

It works, but cloning details at the bottom of the page expose both
addresses:

public-inbox-init -V2 lkml lkml/ https://[not-live-yet].kernel.org/lkml \
		linux-kernel@[not-live-yet].kernel.org linux-kernel@vger.kernel.org


-K

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Some points on public-inbox
  2018-06-12 17:05   ` Konstantin Ryabitsev
@ 2018-06-13  1:57     ` Eric Wong
  0 siblings, 0 replies; 13+ messages in thread
From: Eric Wong @ 2018-06-13  1:57 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: Leah Neukirchen, meta

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> On Tue, Jun 12, 2018 at 10:09:15AM +0000, Eric Wong wrote:
> > I prefer to use public-inbox-watch for mirroring existing lists.
> 
> I opted in favour of -mda over -watch because Maildir performance
> usually degrades linearly with the number of messages. A month of LKML
> mail is anywhere from 25,000 to 40,000 messages, and maildirs tend to
> handle that poorly due to peformance overhead of listing tens of
> thousands of files in a single folder.

Right; but with inotify, getdents/readdir overhead is not a
problem outside of initial startup (or rescanning via SIGUSR1
after config changes).

> Obviously, I can set up an archival job, but then I'd have to worry
> about messages that weren't actually imported into the archive (because
> they didn't pass spam tests, but are actually ham, for example). The
> -mda script gives me this for free, with such messages being put into
> the emergency folder for later review.

Interesting take on it, thanks for sharing.  I prefer to keep
the Maildir messages around for a bit and do my own reading off
that, for now[1].  I occasionally review syslog for spam notices
from -watch, but probably not enough :x

> > I haven't thought much about mirroring with -mda, but I suppose
> > having a per-list subscriber address and extra
> > publicinbox.<name>.address entry works, too.
> 
> It works, but cloning details at the bottom of the page expose both
> addresses:
> 
> public-inbox-init -V2 lkml lkml/ https://[not-live-yet].kernel.org/lkml \
> 		linux-kernel@[not-live-yet].kernel.org linux-kernel@vger.kernel.org

Hmm, I intended the multi-address support to work as a way to
have inboxes hosted simultaneously on multiple domains, either
temporarily as a migration strategy or permanently for redundancy.


So maybe there should be a way to specify an email address as
"hidden" for that, but still let -mda use it for routing.
Any thoughts on how to do it?

I'm thinking something like replacing '@' with '!' in the
.public-inbox/config file.



[1] I've thought about a Mairix/notmuch-like tool which extracts
    messages from public-inboxes, so I won't need a redundant
    copy in the Maildir.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] Makefile.PL: do not depend on git
  2018-06-12 11:31   ` Leah Neukirchen
@ 2018-06-13  2:07     ` Eric Wong
  2018-06-13 14:26       ` Leah Neukirchen
  2018-06-13 21:40     ` Some points on public-inbox Eric Wong
  1 sibling, 1 reply; 13+ messages in thread
From: Eric Wong @ 2018-06-13  2:07 UTC (permalink / raw)
  To: Leah Neukirchen; +Cc: meta

Leah Neukirchen <leah@vuxu.org> wrote:
> Eric Wong <e@80x24.org> writes:
> > been trying to avoid being at the computer too much for health
> > reasons.
> 
> No problem, get well soon.

Thanks, but they're more preventative measures than anything.

> >> 1) Makefile.PL only works properly when run from a checkout, not a tarball.
> >> I replaced the beginning with
> >> 
> >> my @EXE_FILES = split("\n", `printf '%s\n' script/* 2>/dev/null`);
> >> my $PM_FILES = `find lib 2>/dev/null`;
> >
> > Thanks, I'd probably add "-name '*.pm'" to find(1) to filter out
> > directories.  But I wonder if it's better to grep the MANIFEST
> > file...
> 
> Yes, using MANIFEST is a better solution.

OK, will push this out:

---------8<-------
Subject: [PATCH] Makefile.PL: do not depend on git

Otherwise, things do not work from a tarball distribution.

Reported-by: Leah Neukirchen <leah@vuxu.org>
  https://public-inbox.org/meta/871sdfzy80.fsf@gmail.com/
---
 Makefile.PL | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/Makefile.PL b/Makefile.PL
index a47e17b..027c3e6 100644
--- a/Makefile.PL
+++ b/Makefile.PL
@@ -3,9 +3,10 @@
 # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
 use strict;
 use ExtUtils::MakeMaker;
-my @EXE_FILES = split("\n", `git ls-files 'script/*' 2>/dev/null`);
-my $PM_FILES = `git ls-files lib '*.pm' 2>/dev/null`;
-$PM_FILES =~ tr/\n/ /;
+open my $m, '<', 'MANIFEST' or die "open(MANIFEST): $!\n";
+chomp(my @manifest = (<$m>));
+my @EXE_FILES = grep(m!^script/!, @manifest);
+my $PM_FILES = join(' ', grep(m!^lib/.*\.pm$!, @manifest));
 
 WriteMakefile(
 	NAME => 'PublicInbox',
-- 
EW

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Makefile.PL: do not depend on git
  2018-06-13  2:07     ` [PATCH] Makefile.PL: do not depend on git Eric Wong
@ 2018-06-13 14:26       ` Leah Neukirchen
  2018-06-13 21:04         ` Eric Wong
  0 siblings, 1 reply; 13+ messages in thread
From: Leah Neukirchen @ 2018-06-13 14:26 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

Eric Wong <e@80x24.org> writes:

> Subject: [PATCH] Makefile.PL: do not depend on git
>
> Otherwise, things do not work from a tarball distribution.

The same issue also applies to ssoma.

-- 
Leah Neukirchen  <leah@vuxu.org>  http://leah.zone

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Makefile.PL: do not depend on git
  2018-06-13 14:26       ` Leah Neukirchen
@ 2018-06-13 21:04         ` Eric Wong
  2018-06-13 21:20           ` Leah Neukirchen
  0 siblings, 1 reply; 13+ messages in thread
From: Eric Wong @ 2018-06-13 21:04 UTC (permalink / raw)
  To: Leah Neukirchen; +Cc: meta

Leah Neukirchen <leah@vuxu.org> wrote:
> Eric Wong <e@80x24.org> writes:
> 
> > Subject: [PATCH] Makefile.PL: do not depend on git
> >
> > Otherwise, things do not work from a tarball distribution.
> 
> The same issue also applies to ssoma.

Thanks, pushed an identical change to
29b74d2c4cda7e06524ec92d5bc517cc07a05979 to ssoma.git

That said, I'm not sure if it's worth continuing work on ssoma...
I think focusing on existing standards such as NNTP is a better
use of our time.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Makefile.PL: do not depend on git
  2018-06-13 21:04         ` Eric Wong
@ 2018-06-13 21:20           ` Leah Neukirchen
  0 siblings, 0 replies; 13+ messages in thread
From: Leah Neukirchen @ 2018-06-13 21:20 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

Eric Wong <e@80x24.org> writes:

> Leah Neukirchen <leah@vuxu.org> wrote:
>> Eric Wong <e@80x24.org> writes:
>> 
>> > Subject: [PATCH] Makefile.PL: do not depend on git
>> >
>> > Otherwise, things do not work from a tarball distribution.
>> 
>> The same issue also applies to ssoma.
>
> Thanks, pushed an identical change to
> 29b74d2c4cda7e06524ec92d5bc517cc07a05979 to ssoma.git
>
> That said, I'm not sure if it's worth continuing work on ssoma...
> I think focusing on existing standards such as NNTP is a better
> use of our time.

I wrote a prototype of a NNTP fetcher for mblaze:
http://git.vuxu.org/mblaze/tree/contrib/msuck

-- 
Leah Neukirchen  <leah@vuxu.org>  http://leah.zone

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Some points on public-inbox
  2018-06-12 11:31   ` Leah Neukirchen
  2018-06-13  2:07     ` [PATCH] Makefile.PL: do not depend on git Eric Wong
@ 2018-06-13 21:40     ` Eric Wong
  2018-06-13 22:43       ` [PATCH] www: use undecoded paths for Message-ID extraction Eric Wong
  1 sibling, 1 reply; 13+ messages in thread
From: Eric Wong @ 2018-06-13 21:40 UTC (permalink / raw)
  To: Leah Neukirchen; +Cc: meta

Leah Neukirchen <leah@vuxu.org> wrote:
> During testing, we also found another thing when obscure characters
> are used in Message-IDs, esp. / and ?.
> 
> E.g. using a Message-ID of <F1WYEAZPOF.3LOD2T7ZHY9I1@localdomain/raw/T>
> will create a corrupt link.  Some more "ideas" are at
> https://inbox.vuxu.org/pi-test/

Thanks for this.  I've overlooked some things and will
investigate :x

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] www: use undecoded paths for Message-ID extraction
  2018-06-13 21:40     ` Some points on public-inbox Eric Wong
@ 2018-06-13 22:43       ` Eric Wong
  2018-06-26  7:46         ` [PATCH] additional tests for bad Message-IDs in URLs Eric Wong
  0 siblings, 1 reply; 13+ messages in thread
From: Eric Wong @ 2018-06-13 22:43 UTC (permalink / raw)
  To: Leah Neukirchen; +Cc: meta

> Leah Neukirchen <leah@vuxu.org> wrote:
> > During testing, we also found another thing when obscure characters
> > are used in Message-IDs, esp. / and ?.
> > 
> > E.g. using a Message-ID of <F1WYEAZPOF.3LOD2T7ZHY9I1@localdomain/raw/T>
> > will create a corrupt link.  Some more "ideas" are at
> > https://inbox.vuxu.org/pi-test/

I guess I'm spoiled by Rack where PATH_INFO is undecoded :x
However, REQUEST_URI is specified in PSGI specs(*)

Very lightly tested, but this seems to work; additions to the
test suite will be necessary...

------8<----
Subject: [PATCH] www: use undecoded paths for Message-ID extraction

In PSGI, PATH_INFO contains URI-decoded paths which cause
problems when Message-IDs contain ambiguous characters for used
for routing.  Instead, extract the undecoded path from
REQUEST_URI and use that.

Reported-by: Leah Neukirchen <leah@vuxu.org>
  https://public-inbox.org/meta/8736xsb5s5.fsf@vuxu.org/
---
 lib/PublicInbox/WWW.pm | 40 ++++++++++++++++++++++++++++------------
 t/cgi.t                |  2 ++
 2 files changed, 30 insertions(+), 12 deletions(-)

diff --git a/lib/PublicInbox/WWW.pm b/lib/PublicInbox/WWW.pm
index 24e24f1..c1c3926 100644
--- a/lib/PublicInbox/WWW.pm
+++ b/lib/PublicInbox/WWW.pm
@@ -36,6 +36,17 @@ sub run {
 	PublicInbox::WWW->new->call($req->env);
 }
 
+my %path_re_cache;
+
+sub path_re ($) {
+	my $sn = $_[0]->{SCRIPT_NAME};
+	$path_re_cache{$sn} ||= do {
+		$sn = '/'.$sn unless index($sn, '/') == 0;
+		$sn =~ s!/\z!!;
+		qr!\A(?:https?://[^/]+)?\Q$sn\E(/[^\?\#]+)!;
+	};
+}
+
 sub call {
 	my ($self, $env) = @_;
 	my $ctx = { env => $env, www => $self };
@@ -50,7 +61,8 @@ sub call {
 	} split(/[&;]+/, $env->{QUERY_STRING});
 	$ctx->{qp} = \%qp;
 
-	my $path_info = $env->{PATH_INFO};
+	# not using $env->{PATH_INFO} here since that's already decoded
+	my ($path_info) = ($env->{REQUEST_URI} =~ path_re($env));
 	my $method = $env->{REQUEST_METHOD};
 
 	if ($method eq 'POST') {
@@ -91,13 +103,13 @@ sub call {
 		invalid_inbox_mid($ctx, $1, $2) || get_attach($ctx, $idx, $fn);
 	# in case people leave off the trailing slash:
 	} elsif ($path_info =~ m!$INBOX_RE/$MID_RE/(T|t)\z!o) {
-		my ($inbox, $mid, $suffix) = ($1, $2, $3);
+		my ($inbox, $mid_ue, $suffix) = ($1, $2, $3);
 		$suffix .= $suffix =~ /\A[tT]\z/ ? '/#u' : '/';
-		r301($ctx, $inbox, $mid, $suffix);
+		r301($ctx, $inbox, $mid_ue, $suffix);
 
 	} elsif ($path_info =~ m!$INBOX_RE/$MID_RE/R/?\z!o) {
-		my ($inbox, $mid) = ($1, $2);
-		r301($ctx, $inbox, $mid, '#R');
+		my ($inbox, $mid_ue) = ($1, $2);
+		r301($ctx, $inbox, $mid_ue, '#R');
 
 	} elsif ($path_info =~ m!$INBOX_RE/$MID_RE/f/?\z!o) {
 		r301($ctx, $1, $2);
@@ -164,11 +176,11 @@ sub invalid_inbox ($$) {
 
 # returns undef if valid, array ref response if invalid
 sub invalid_inbox_mid {
-	my ($ctx, $inbox, $mid) = @_;
+	my ($ctx, $inbox, $mid_ue) = @_;
 	my $ret = invalid_inbox($ctx, $inbox);
 	return $ret if $ret;
 
-	$ctx->{mid} = $mid;
+	my $mid = $ctx->{mid} = uri_unescape($mid_ue);
 	my $ibx = $ctx->{-inbox};
 	if ($mid =~ m!\A([a-f0-9]{2})([a-f0-9]{38})\z!) {
 		my ($x2, $x38) = ($1, $2);
@@ -177,7 +189,7 @@ sub invalid_inbox_mid {
 		require Email::Simple;
 		my $s = Email::Simple->new($str);
 		$mid = PublicInbox::MID::mid_clean($s->header('Message-ID'));
-		return r301($ctx, $inbox, $mid);
+		return r301($ctx, $inbox, mid_escape($mid));
 	}
 	undef;
 }
@@ -352,7 +364,7 @@ sub legacy_redirects {
 }
 
 sub r301 {
-	my ($ctx, $inbox, $mid, $suffix) = @_;
+	my ($ctx, $inbox, $mid_ue, $suffix) = @_;
 	my $obj = $ctx->{-inbox};
 	unless ($obj) {
 		my $r404 = invalid_inbox($ctx, $inbox);
@@ -361,7 +373,11 @@ sub r301 {
 	}
 	my $url = $obj->base_url($ctx->{env});
 	my $qs = $ctx->{env}->{QUERY_STRING};
-	$url .= (mid_escape($mid) . '/') if (defined $mid);
+	if (defined $mid_ue) {
+		# common, and much nicer as '@' than '%40':
+		$mid_ue =~ s/%40/@/g;
+		$url .= $mid_ue . '/';
+	}
 	$url .= $suffix if (defined $suffix);
 	$url .= "?$qs" if $qs ne '';
 
@@ -371,9 +387,9 @@ sub r301 {
 }
 
 sub msg_page {
-	my ($ctx, $inbox, $mid, $e) = @_;
+	my ($ctx, $inbox, $mid_ue, $e) = @_;
 	my $ret;
-	$ret = invalid_inbox_mid($ctx, $inbox, $mid) and return $ret;
+	$ret = invalid_inbox_mid($ctx, $inbox, $mid_ue) and return $ret;
 	'' eq $e and return get_mid_html($ctx);
 	'T/' eq $e and return get_thread($ctx, 1);
 	't/' eq $e and return get_thread($ctx);
diff --git a/t/cgi.t b/t/cgi.t
index bd92ca3..2e2476d 100644
--- a/t/cgi.t
+++ b/t/cgi.t
@@ -225,6 +225,8 @@ sub cgi_run {
 	my %env = (
 		PATH_INFO => $_[0],
 		QUERY_STRING => $_[1] || "",
+		SCRIPT_NAME => '',
+		REQUEST_URI => $_[0] . ($_[1] ? "?$_[1]" : ''),
 		REQUEST_METHOD => $_[2] || "GET",
 		GATEWAY_INTERFACE => 'CGI/1.1',
 		HTTP_ACCEPT => '*/*',
-- 
(*) git clone https://github.com/plack/psgi-specs.git

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] additional tests for bad Message-IDs in URLs
  2018-06-13 22:43       ` [PATCH] www: use undecoded paths for Message-ID extraction Eric Wong
@ 2018-06-26  7:46         ` Eric Wong
  0 siblings, 0 replies; 13+ messages in thread
From: Eric Wong @ 2018-06-26  7:46 UTC (permalink / raw)
  To: Leah Neukirchen; +Cc: meta

Followup-to: 73cfed86d8a8287a
   ("www: use undecoded paths for Message-ID extraction")

Reported-by: Leah Neukirchen <leah@vuxu.org>
  https://public-inbox.org/meta/8736xsb5s5.fsf@vuxu.org/
---
 Oops, forgot this earlier :x

 MANIFEST          |  1 +
 t/psgi_bad_mids.t | 85 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 86 insertions(+)
 create mode 100644 t/psgi_bad_mids.t

diff --git a/MANIFEST b/MANIFEST
index 08a8ef4..68c79c9 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -182,6 +182,7 @@ t/perf-threading.t
 t/plack.t
 t/precheck.t
 t/psgi_attach.t
+t/psgi_bad_mids.t
 t/psgi_mount.t
 t/psgi_search.t
 t/psgi_text.t
diff --git a/t/psgi_bad_mids.t b/t/psgi_bad_mids.t
new file mode 100644
index 0000000..5008f5b
--- /dev/null
+++ b/t/psgi_bad_mids.t
@@ -0,0 +1,85 @@
+# Copyright (C) 2018 all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+use strict;
+use warnings;
+use Test::More;
+use File::Temp qw/tempdir/;
+use PublicInbox::MIME;
+use PublicInbox::Config;
+use PublicInbox::WWW;
+my @mods = qw(DBD::SQLite Search::Xapian HTTP::Request::Common Plack::Test
+		URI::Escape Plack::Builder);
+foreach my $mod (@mods) {
+	eval "require $mod";
+	plan skip_all => "$mod missing for psgi_bad_mids.t" if $@;
+}
+use_ok($_) for @mods;
+use_ok 'PublicInbox::V2Writable';
+my $mainrepo = tempdir('pi-bad-mids-XXXXXX', TMPDIR => 1, CLEANUP => 1);
+my $cfgpfx = "publicinbox.bad-mids";
+my $ibx = {
+	mainrepo => $mainrepo,
+	name => 'bad-mids',
+	version => 2,
+	-primary_address => 'test@example.com',
+};
+$ibx = PublicInbox::Inbox->new($ibx);
+my $im = PublicInbox::V2Writable->new($ibx, 1);
+$im->{parallel} = 0;
+
+my $msgs = <<'';
+F1V5OR6NMF.3M649JTLO9IXD@tux.localdomain/hehe1"'<foo
+F1V5NB0PTU.3U0DCVGAJ750Z@tux.localdomain"'<>/foo
+F1V5MIHGCU.2ABINKW6WBE8N@tux.localdomain/raw
+F1V5LF9D9C.2QT5PGXZQ050E@tux.localdomain/t.atom
+F1V58X3CMU.2DCCVAKQZGADV@tux.localdomain/../../../../foo
+F1TVKINT3G.2S6I36MXMHYG6@tux.localdomain" onclick="alert(1)"
+
+my @mids = split(/\n/, $msgs);
+my $i = 0;
+foreach my $mid (@mids) {
+	my $data = << "";
+Subject: test
+Message-ID: <$mid>
+From: a\@example.com
+To: b\@example.com
+Date: Fri, 02 Oct 1993 00:00:0$i +0000
+
+
+	my $mime = PublicInbox::MIME->new(\$data);
+	ok($im->add($mime), "added $mid");
+	$i++
+}
+$im->done;
+
+my $cfg = {
+	"$cfgpfx.address" => $ibx->{-primary_address},
+	"$cfgpfx.mainrepo" => $mainrepo,
+};
+my $config = PublicInbox::Config->new($cfg);
+my $www = PublicInbox::WWW->new($config);
+test_psgi(sub { $www->call(@_) }, sub {
+	my ($cb) = @_;
+	my $res = $cb->(GET('/bad-mids/'));
+	is($res->code, 200, 'got 200 OK listing');
+	my $raw = $res->content;
+	foreach my $mid (@mids) {
+		ok(index($raw, $mid) < 0, "escaped $mid");
+	}
+
+	my (@xmids) = ($raw =~ m!\bhref="([^"]+)/t\.mbox\.gz"!sg);
+	is(scalar(@xmids), scalar(@mids),
+		'got escaped links to all messages');
+
+	@xmids = reverse @xmids;
+	foreach my $i (0..$#xmids) {
+		$res = $cb->(GET("/bad-mids/$xmids[$i]/raw"));
+		is($res->code, 200, 'got 200 OK raw message');
+		like($res->content, qr/Message-ID: <\Q$mids[$i]\E>/s,
+			'retrieved correct message');
+	}
+});
+
+done_testing();
+
+1;
-- 
EW

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, back to index

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-09 17:06 Some points on public-inbox Leah Neukirchen
2018-06-12 10:09 ` Eric Wong
2018-06-12 11:31   ` Leah Neukirchen
2018-06-13  2:07     ` [PATCH] Makefile.PL: do not depend on git Eric Wong
2018-06-13 14:26       ` Leah Neukirchen
2018-06-13 21:04         ` Eric Wong
2018-06-13 21:20           ` Leah Neukirchen
2018-06-13 21:40     ` Some points on public-inbox Eric Wong
2018-06-13 22:43       ` [PATCH] www: use undecoded paths for Message-ID extraction Eric Wong
2018-06-26  7:46         ` [PATCH] additional tests for bad Message-IDs in URLs Eric Wong
2018-06-12 13:19   ` Some points on public-inbox Leah Neukirchen
2018-06-12 17:05   ` Konstantin Ryabitsev
2018-06-13  1:57     ` Eric Wong

user/dev discussion of public-inbox itself

Archives are clonable:
	git clone --mirror https://public-inbox.org/meta
	git clone --mirror http://czquwvybam4bgbro.onion/meta
	git clone --mirror http://hjrcffqmbrq6wope.onion/meta
	git clone --mirror http://ou63pmih66umazou.onion/meta

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.mail.public-inbox.meta
	nntp://ou63pmih66umazou.onion/inbox.comp.mail.public-inbox.meta
	nntp://czquwvybam4bgbro.onion/inbox.comp.mail.public-inbox.meta
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.mail.public-inbox.meta
	nntp://news.gmane.org/gmane.mail.public-inbox.general

 note: .onion URLs require Tor: https://www.torproject.org/
       or Tor2web: https://www.tor2web.org/

AGPL code for this site: git clone https://public-inbox.org/ public-inbox