user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
* Relationship between public-inbox and ssoma?
@ 2018-03-05  0:54 Nicolás Ojeda Bär
  2018-03-05  2:07 ` Eric Wong
  0 siblings, 1 reply; 12+ messages in thread
From: Nicolás Ojeda Bär @ 2018-03-05  0:54 UTC (permalink / raw)
  To: meta

Hello,

Thanks very much for this great project.

I am a bit puzzled about the difference between public-inbox and ssoma. In particular:

- What is the difference between public-inbox-mda and ssoma-mda ?
- Are the git repository formats the same for public-inbox and ssoma ?

Any comments appreciated.

Thanks a lot!

Best wishes,
Nicolás

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Relationship between public-inbox and ssoma?
  2018-03-05  0:54 Relationship between public-inbox and ssoma? Nicolás Ojeda Bär
@ 2018-03-05  2:07 ` Eric Wong
  2018-03-05 11:45   ` Nicolás Ojeda Bär
  2018-03-15 15:30   ` internal format (was: Relationship between public-inbox and ssoma?) Stefan Monnier
  0 siblings, 2 replies; 12+ messages in thread
From: Eric Wong @ 2018-03-05  2:07 UTC (permalink / raw)
  To: Nicolás Ojeda Bär; +Cc: meta

Nicolás Ojeda Bär <n.oje.bar@gmail.com> wrote:
> Hello,
> 
> Thanks very much for this great project.
> 
> I am a bit puzzled about the difference between public-inbox and ssoma. In particular:
> 
> - What is the difference between public-inbox-mda and ssoma-mda ?

public-inbox-mda is more suitable for public endpoints where
it's the primary entry point for a publically-shared mail.
ssoma-mda is/was intended for personal mail.  Originally,
public-inbox depended on and used ssoma, but that was given up
for more performance.

Sidenote: I don't recommend public-inbox-mda for running
_mirrors_ of existing mailing lists since it's stricter than
what most lists accept.  public-inbox-watch is more lenient and
more performant (on Linux with inotify, at least); so I wrote
it for mirroring.

> - Are the git repository formats the same for public-inbox and ssoma ?

Currently they are the same with one exception: ssoma allows two
different messages (different blob SHA-1) to have the same
Message-Id by default; public-inbox (current version) does not.
(ssoma-mda has a "-1" option to disable duplicate Message-Id).

The work-in-progress "v2" public-inbox format diverges and I
don't currently have plans to port ssoma to use it.  The v1
format will remain supported in public-inbox.

I'm not sure if ssoma is worth the effort any more, as it's too
much effort to promote a new sync protocol (even if based on
git).  I'd rather improve NNTP servers and clients as an option
for people to read public inboxes.

> Any comments appreciated.
> 
> Thanks a lot!

No problem, thanks for your interest.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Relationship between public-inbox and ssoma?
  2018-03-05  2:07 ` Eric Wong
@ 2018-03-05 11:45   ` Nicolás Ojeda Bär
  2018-03-05 17:50     ` Eric Wong
  2018-03-15 15:30   ` internal format (was: Relationship between public-inbox and ssoma?) Stefan Monnier
  1 sibling, 1 reply; 12+ messages in thread
From: Nicolás Ojeda Bär @ 2018-03-05 11:45 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

Hello Eric,

Thanks for the prompt reply.  I am trying to migrate a long-lived
mailing list (65k messages over 26 years), below are some
troubles/questions I am having;
any suggestions would be greatly appreciated.

- public-inbox-watch seems to struggle with very big maildirs; for now
I am moving the data into the maildir a little at a time and that
seems to work. Is there a particular obstacle
  to making the importing process more incremental?

- Trouble due to missing/malformed headers (mostly on very old
messages). For example, here is the header of a message that trips
public-inbox-watch:

From weis@margaux  Fri Nov 27 16:24:50 1992
Received: by margaux.inria.fr, Fri, 27 Nov 92 16:24:50 +0100
Message-ID: <9211271524.AA29971@margaux.inria.fr>
To: caml-list@margaux
Sender: weis@margaux
Status: O

The error is: fatal: Invalid rfc2822 date "" in ident:  <> (I guess
due to the lack of a Date: field). I added a Date: field just to test
and
noticed that Author: in the git commit was empty, I guess due to the
use of Sender: rather than From: header.

Do you think it is feasible to improve public-inbox-watch to try to
extract the date from some other header like above?
and to use Sender: when From: is not found?

- There are some messages that do not have Message-Id, but
public-inbox-watch seems to be able to handle them.
  Is it the case that Date: is the only header that is absolutely
necessary for public-inbox-watch to process the message?

- Does public-inbox-watch ever modify the message data?

- In general public-inbox-watch prints very little about what it is
doing, which makes it hard(er) to trace problems; a verbose flag would
be a nice
  addition, I think.

Thanks!

Best wishes,
Nicolás

On Mon, Mar 5, 2018 at 3:07 AM, Eric Wong <e@80x24.org> wrote:
> Nicolás Ojeda Bär <n.oje.bar@gmail.com> wrote:
>> Hello,
>>
>> Thanks very much for this great project.
>>
>> I am a bit puzzled about the difference between public-inbox and ssoma. In particular:
>>
>> - What is the difference between public-inbox-mda and ssoma-mda ?
>
> public-inbox-mda is more suitable for public endpoints where
> it's the primary entry point for a publically-shared mail.
> ssoma-mda is/was intended for personal mail.  Originally,
> public-inbox depended on and used ssoma, but that was given up
> for more performance.
>
> Sidenote: I don't recommend public-inbox-mda for running
> _mirrors_ of existing mailing lists since it's stricter than
> what most lists accept.  public-inbox-watch is more lenient and
> more performant (on Linux with inotify, at least); so I wrote
> it for mirroring.
>
>> - Are the git repository formats the same for public-inbox and ssoma ?
>
> Currently they are the same with one exception: ssoma allows two
> different messages (different blob SHA-1) to have the same
> Message-Id by default; public-inbox (current version) does not.
> (ssoma-mda has a "-1" option to disable duplicate Message-Id).
>
> The work-in-progress "v2" public-inbox format diverges and I
> don't currently have plans to port ssoma to use it.  The v1
> format will remain supported in public-inbox.
>
> I'm not sure if ssoma is worth the effort any more, as it's too
> much effort to promote a new sync protocol (even if based on
> git).  I'd rather improve NNTP servers and clients as an option
> for people to read public inboxes.
>
>> Any comments appreciated.
>>
>> Thanks a lot!
>
> No problem, thanks for your interest.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Relationship between public-inbox and ssoma?
  2018-03-05 11:45   ` Nicolás Ojeda Bär
@ 2018-03-05 17:50     ` Eric Wong
  2018-03-05 18:06       ` Nicolás Ojeda Bär
  0 siblings, 1 reply; 12+ messages in thread
From: Eric Wong @ 2018-03-05 17:50 UTC (permalink / raw)
  To: Nicolás Ojeda Bär; +Cc: meta

Nicolás Ojeda Bär <n.oje.bar@gmail.com> wrote:
> Hello Eric,
> 
> Thanks for the prompt reply.  I am trying to migrate a long-lived
> mailing list (65k messages over 26 years), below are some
> troubles/questions I am having;
> any suggestions would be greatly appreciated.
> 
> - public-inbox-watch seems to struggle with very big maildirs; for now
> I am moving the data into the maildir a little at a time and that
> seems to work. Is there a particular obstacle
>   to making the importing process more incremental?

Do you know if it's SpamAssassin being slow?

I disable network checks for large imports in ~/.spamassassin/user_prefs
(if I'm using SA at all during the imports):
# uncomment the following for importing archives:
# dns_available no
# skip_rbl_checks 1
# skip_uribl_checks 1

Fwiw, large directories are a performance killer in any
application.  Seek times and cache overheads are two problems,
at least, so an SSD will definitely help; and maybe even shorter
filenames.

I usually prefer one-off scripts like
scripts/import_vger_from_mbox for initial imports and store
large archives in compressed mboxes instead of Maildir.  Lack of
mbox support is one reason I never used notmuch despite studying
it.

> - Trouble due to missing/malformed headers (mostly on very old
> messages). For example, here is the header of a message that trips
> public-inbox-watch:
> 
> From weis@margaux  Fri Nov 27 16:24:50 1992
> Received: by margaux.inria.fr, Fri, 27 Nov 92 16:24:50 +0100
> Message-ID: <9211271524.AA29971@margaux.inria.fr>
> To: caml-list@margaux
> Sender: weis@margaux
> Status: O
> 
> The error is: fatal: Invalid rfc2822 date "" in ident:  <> (I guess
> due to the lack of a Date: field). I added a Date: field just to test
> and
> noticed that Author: in the git commit was empty, I guess due to the
> use of Sender: rather than From: header.

I have a patch in the wings to use the Received: date:

 https://public-inbox.org/meta/20180215110840.30413-16-e@80x24.org/raw

And I'm thinking about favoring Received: over Date: if both
exist, since Date: headers are more often wrong...

> Do you think it is feasible to improve public-inbox-watch to try to
> extract the date from some other header like above?
> and to use Sender: when From: is not found?

Sure, I suppose falling back to Sender is correct if From is
missing.

> - There are some messages that do not have Message-Id, but
> public-inbox-watch seems to be able to handle them.

Yes, we generate a Message-Id if one is missing

>   Is it the case that Date: is the only header that is absolutely
> necessary for public-inbox-watch to process the message?

Probably none of them are, actually.

> - Does public-inbox-watch ever modify the message data?

Message-ID generation is one that's generated.
Status, Lines, Bytes, Content-Length, and @BAD_HEADERS in
lib/PublicInbox/MDA.pm are all dropped:

our @BAD_HEADERS = (
	# postfix
	qw(delivered-to x-original-to), # prevent training loops

	# The rest are taken from Mailman 2.1.15:
	# could contain passwords:
	qw(approved approve x-approved x-approve urgent),
	# could be used phishing:
	qw(return-receipt-to disposition-notification-to x-confirm-reading-to),
	# Pegasus mail:
	qw(x-pmrqc)
);

Email::MIME might modify invalid characters in the headers (or
if there's bugs in Email::MIME).  I don't think bodies are
modified outside of the not-really-documented
PublicInbox::Filter API.  You can check out some filters at
lib/PublicInbox/Filter/*.pm (some commit messages document them,
but I don't think there's manpages, yet)

> - In general public-inbox-watch prints very little about what it is
> doing, which makes it hard(er) to trace problems; a verbose flag would
> be a nice
>   addition, I think.

I usually use strace on Linux to track down problems.  I'm not
sure it's worth the effort to introduce new options/features
if generic tracing utilities are more detailed and accurate.


Also, I'm going to be mostly offline for about a week starting
tomorrow; so don't expect prompt replies for a bit.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Relationship between public-inbox and ssoma?
  2018-03-05 17:50     ` Eric Wong
@ 2018-03-05 18:06       ` Nicolás Ojeda Bär
  2018-03-19  7:43         ` watch performance [was: Relationship between public-inbox and ssoma?] Eric Wong
  0 siblings, 1 reply; 12+ messages in thread
From: Nicolás Ojeda Bär @ 2018-03-05 18:06 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

Hi Eric,

Thanks for the quick reply.

On Mon, Mar 5, 2018 at 6:50 PM, Eric Wong <e@80x24.org> wrote:
> Nicolás Ojeda Bär <n.oje.bar@gmail.com> wrote:
>> Hello Eric,
>>
>> Thanks for the prompt reply.  I am trying to migrate a long-lived
>> mailing list (65k messages over 26 years), below are some
>> troubles/questions I am having;
>> any suggestions would be greatly appreciated.
>>
>> - public-inbox-watch seems to struggle with very big maildirs; for now
>> I am moving the data into the maildir a little at a time and that
>> seems to work. Is there a particular obstacle
>>   to making the importing process more incremental?
>
> Do you know if it's SpamAssassin being slow?
>
> I disable network checks for large imports in ~/.spamassassin/user_prefs
> (if I'm using SA at all during the imports):
> # uncomment the following for importing archives:
> # dns_available no
> # skip_rbl_checks 1
> # skip_uribl_checks 1

I don't think it is even installed and I have not set it up at all, so
probably not.

> Fwiw, large directories are a performance killer in any
> application.  Seek times and cache overheads are two problems,
> at least, so an SSD will definitely help; and maybe even shorter
> filenames.

OK.

> I usually prefer one-off scripts like
> scripts/import_vger_from_mbox for initial imports and store
> large archives in compressed mboxes instead of Maildir.  Lack of
> mbox support is one reason I never used notmuch despite studying
> it.

Thanks for the pointer, I will take a look, hopefully it will nudge me
in the right direction.

>> - Trouble due to missing/malformed headers (mostly on very old
>> messages). For example, here is the header of a message that trips
>> public-inbox-watch:
>>
>> From weis@margaux  Fri Nov 27 16:24:50 1992
>> Received: by margaux.inria.fr, Fri, 27 Nov 92 16:24:50 +0100
>> Message-ID: <9211271524.AA29971@margaux.inria.fr>
>> To: caml-list@margaux
>> Sender: weis@margaux
>> Status: O
>>
>> The error is: fatal: Invalid rfc2822 date "" in ident:  <> (I guess
>> due to the lack of a Date: field). I added a Date: field just to test
>> and
>> noticed that Author: in the git commit was empty, I guess due to the
>> use of Sender: rather than From: header.
>
> I have a patch in the wings to use the Received: date:
>
>  https://public-inbox.org/meta/20180215110840.30413-16-e@80x24.org/raw
>
> And I'm thinking about favoring Received: over Date: if both
> exist, since Date: headers are more often wrong...

Great, I will try your patch to see if I can get my messages past
public-inbox-watch.

>> Do you think it is feasible to improve public-inbox-watch to try to
>> extract the date from some other header like above?
>> and to use Sender: when From: is not found?
>
> Sure, I suppose falling back to Sender is correct if From is
> missing.

OK, I will see if I can patch this on my own this since I am keen on
getting this mailing list imported.

>> - There are some messages that do not have Message-Id, but
>> public-inbox-watch seems to be able to handle them.
>
> Yes, we generate a Message-Id if one is missing
>
>>   Is it the case that Date: is the only header that is absolutely
>> necessary for public-inbox-watch to process the message?
>
> Probably none of them are, actually.

Currently, public-inbox-watch refuses to process the message with the
header quoted above due to a missing Date: header.

>> - Does public-inbox-watch ever modify the message data?
>
> Message-ID generation is one that's generated.
> Status, Lines, Bytes, Content-Length, and @BAD_HEADERS in
> lib/PublicInbox/MDA.pm are all dropped:
>
> our @BAD_HEADERS = (
>         # postfix
>         qw(delivered-to x-original-to), # prevent training loops
>
>         # The rest are taken from Mailman 2.1.15:
>         # could contain passwords:
>         qw(approved approve x-approved x-approve urgent),
>         # could be used phishing:
>         qw(return-receipt-to disposition-notification-to x-confirm-reading-to),
>         # Pegasus mail:
>         qw(x-pmrqc)
> );
>
> Email::MIME might modify invalid characters in the headers (or
> if there's bugs in Email::MIME).  I don't think bodies are
> modified outside of the not-really-documented
> PublicInbox::Filter API.  You can check out some filters at
> lib/PublicInbox/Filter/*.pm (some commit messages document them,
> but I don't think there's manpages, yet)

OK, will take a look.

>> - In general public-inbox-watch prints very little about what it is
>> doing, which makes it hard(er) to trace problems; a verbose flag would
>> be a nice
>>   addition, I think.
>
> I usually use strace on Linux to track down problems.  I'm not
> sure it's worth the effort to introduce new options/features
> if generic tracing utilities are more detailed and accurate.
>

Makes sense. Thanks for the suggestion.

> Also, I'm going to be mostly offline for about a week starting
> tomorrow; so don't expect prompt replies for a bit.

Sure, thanks for the heads-up.

Best wishes,
Nicolás

^ permalink raw reply	[flat|nested] 12+ messages in thread

* internal format (was: Relationship between public-inbox and ssoma?)
  2018-03-05  2:07 ` Eric Wong
  2018-03-05 11:45   ` Nicolás Ojeda Bär
@ 2018-03-15 15:30   ` Stefan Monnier
  2018-03-15 16:40     ` Eric Wong
  1 sibling, 1 reply; 12+ messages in thread
From: Stefan Monnier @ 2018-03-15 15:30 UTC (permalink / raw)
  To: meta

> The work-in-progress "v2" public-inbox format diverges and I
> don't currently have plans to port ssoma to use it.  The v1
> format will remain supported in public-inbox.

Which reminds me: do you have some document that explains the reasoning
behind the choice of format (especially which alternatives were
considered and dropped and why)?


        Stefan

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: internal format (was: Relationship between public-inbox and ssoma?)
  2018-03-15 15:30   ` internal format (was: Relationship between public-inbox and ssoma?) Stefan Monnier
@ 2018-03-15 16:40     ` Eric Wong
  2018-03-15 18:49       ` internal format Stefan Monnier
  0 siblings, 1 reply; 12+ messages in thread
From: Eric Wong @ 2018-03-15 16:40 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: meta

Stefan Monnier <monnier@IRO.UMontreal.CA> wrote:
> > The work-in-progress "v2" public-inbox format diverges and I
> > don't currently have plans to port ssoma to use it.  The v1
> > format will remain supported in public-inbox.
> 
> Which reminds me: do you have some document that explains the reasoning
> behind the choice of format (especially which alternatives were
> considered and dropped and why)?

v1 or v2?  Some of the reasoning for v2 was here:
  https://public-inbox.org/meta/20180209205140.GA11047@dcvr/

v1 was similar to what git did with loose objects and prevented
dupes based on Message-ID.  That worked well enough for small
non-mirror lists and wasn't designed with search (Xapian) in mind.

As for git itself: reliability, ease-of-replication, storage
efficiency.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: internal format
  2018-03-15 16:40     ` Eric Wong
@ 2018-03-15 18:49       ` Stefan Monnier
  2018-03-15 20:14         ` Eric Wong
  0 siblings, 1 reply; 12+ messages in thread
From: Stefan Monnier @ 2018-03-15 18:49 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

> v1 or v2?  Some of the reasoning for v2 was here:
>   https://public-inbox.org/meta/20180209205140.GA11047@dcvr/

IIUC, the issues you consider important are:

- Size
- Time to perform "git rev-list --objects --all"
- Flexibility, e.g. to be able to remove messages.

For size your benchmarks seem to indicate that as long as it's kept
inside Git, the choice of format doesn't actually affect it
significantly (and this matches my expectations).
Tho I guess it's probably possible to improve on it with enough efforts
(e.g. storing attachments separately, or splitting large messages into
chunks, e.g. like `bup` does), but I doubt it's worth the effort
(especially if you assume that the mailing-list imposes a limit on
message size).

For timing, I'm curious why you only consider
"git rev-list --objects --all".  Which operation does this corresponds
to in public-inbox and is that really the only one that is
performance-sensitive?

> As for git itself: reliability, ease-of-replication, storage
> efficiency.

Yes, that part I totally understand (same reason I used Git in BuGit
https://gitlab.com/monnier/bugit).  Part of my question was related to
the fact that in BuGit I store the messages in the commit-object rather
than in files (which trivially gives me conflict-free merges as well as
"discussion threads") so I was wondering if it would make sense in the
case of public-inbox to keep the email messages in the commit objects
rather than in files, but since I don't really know which operations are
frequent/important I really have no idea.

One thing that strikes me is that you don't seem to use its
"decentralization": IIUC public-inbox always assumes one of the
repositories is the "master" and others are mirrors (or mirrors of
mirrors), so you get efficient "fast-forward" updates, but you
don't do "merges".

This probably means that keeping the email messages in commit objects
wouldn't bring any benefits.

Also this means that public-inbox could freely rewrite history, for
example (which you'll need to really expunge messages) and just use
"forced updates" in mirrors.

Now I'm left wondering what it would mean for something like
public-inbox to support merging.


        Stefan

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: internal format
  2018-03-15 18:49       ` internal format Stefan Monnier
@ 2018-03-15 20:14         ` Eric Wong
  2018-03-15 21:05           ` Stefan Monnier
  0 siblings, 1 reply; 12+ messages in thread
From: Eric Wong @ 2018-03-15 20:14 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: meta

Stefan Monnier <monnier@IRO.UMontreal.CA> wrote:
> > v1 or v2?  Some of the reasoning for v2 was here:
> >   https://public-inbox.org/meta/20180209205140.GA11047@dcvr/
> 
> IIUC, the issues you consider important are:
> 
> - Size
> - Time to perform "git rev-list --objects --all"
> - Flexibility, e.g. to be able to remove messages.
> 
> For size your benchmarks seem to indicate that as long as it's kept
> inside Git, the choice of format doesn't actually affect it
> significantly (and this matches my expectations).
> Tho I guess it's probably possible to improve on it with enough efforts
> (e.g. storing attachments separately, or splitting large messages into
> chunks, e.g. like `bup` does), but I doubt it's worth the effort
> (especially if you assume that the mailing-list imposes a limit on
> message size).

Right, I decided splitting big messages wasn't worth the
complexity and we leave it up to the (usually reasonable)
mail server.

> For timing, I'm curious why you only consider
> "git rev-list --objects --all".  Which operation does this corresponds
> to in public-inbox and is that really the only one that is
> performance-sensitive?

That traverses the object graph (same walk used for repacking
where bitmaps don't help).  I got it from Peff
https://public-inbox.org/git/20160805092805.w3nwv2l6jkbuwlzf@sigill.intra.peff.net/

That's the main thing we can control with repository layout.
Large packs are generally a problem with git, so v2 partitions
repositories at roughly 1G.

> > As for git itself: reliability, ease-of-replication, storage
> > efficiency.
> 
> Yes, that part I totally understand (same reason I used Git in BuGit
> https://gitlab.com/monnier/bugit).  Part of my question was related to
> the fact that in BuGit I store the messages in the commit-object rather
> than in files (which trivially gives me conflict-free merges as well as
> "discussion threads") so I was wondering if it would make sense in the
> case of public-inbox to keep the email messages in the commit objects
> rather than in files, but since I don't really know which operations are
> frequent/important I really have no idea.

I thought about storing messages in the commit object, but that
would break our current use of Xapian if history rewrites are
required for legal reasons.

> One thing that strikes me is that you don't seem to use its
> "decentralization": IIUC public-inbox always assumes one of the
> repositories is the "master" and others are mirrors (or mirrors of
> mirrors), so you get efficient "fast-forward" updates, but you
> don't do "merges".

Right, git merges require the use of pre-established
communications channels (e.g. email) to coordinate.  I don't
believe merging and keeping an authoritative history/order makes
sense with public-inbox (more on this later).

What's important to decentralization is the "root" can
change easily (change of URLs / archival addresses) and all
the messages eventually end up replicatable.

I consider ease-of-replication and efficiency the building
blocks of decentralization.

Beyond that, I believe encouraging "pull" via NNTP and
discouraging "push" via SMTP with mlmmj/mailman/etc. can
eventually lend itself to entirely forkable communities.

> This probably means that keeping the email messages in commit objects
> wouldn't bring any benefits.
> 
> Also this means that public-inbox could freely rewrite history, for
> example (which you'll need to really expunge messages) and just use
> "forced updates" in mirrors.

We currently store blob SHA-1s in Xapian to avoid tree lookups
in git.  Having a history rewrite can break an entire chain of
unrelated messages if we store commit SHA-1 in Xapian instead of
blobs.

> Now I'm left wondering what it would mean for something like
> public-inbox to support merging.

I consider it a waste of effort to maintain an authoritive
commit history when archiving mail.  There's too many variables
when it comes to mail servers and headers and no guarantees on
message ordering.  Among other things, the last (top) Received:
header will surely differ if multiple people start archiving a
list independently of each other.

The email messages are what's important, so replaying an
mbox/Maildir into an importer will get the data that matters
(and deduplication checks will avoid redundant mails).

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: internal format
  2018-03-15 20:14         ` Eric Wong
@ 2018-03-15 21:05           ` Stefan Monnier
  2018-03-15 21:21             ` Eric Wong
  0 siblings, 1 reply; 12+ messages in thread
From: Stefan Monnier @ 2018-03-15 21:05 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

>> For timing, I'm curious why you only consider
>> "git rev-list --objects --all".  Which operation does this corresponds
>> to in public-inbox and is that really the only one that is
>> performance-sensitive?
> That traverses the object graph (same walk used for repacking
> where bitmaps don't help).

Yes, I understand what it does in Git, but I wonder why a full traversal
of the graph is the only/main operation you care about.

Hmm... I guess your other operations are:
- lookup by message-id (which is made efficient because you index files
  by the message-id).
- everything else is done by keeping another index (from NNTP article
  number to message-id (or to blob?)), as in the case of Xapian.

Actually, if you directly index the blobs, you don't really need to
index your file by message-id (you could keep the index from message-id
to blobs external, just as is done for Xapian, right?).

> We currently store blob SHA-1s in Xapian to avoid tree lookups
> in git.  Having a history rewrite can break an entire chain of
> unrelated messages if we store commit SHA-1 in Xapian instead of
> blobs.

Ah, indeed, keeping them as files means that the file's own SHA won't
change when you rewrite history so it makes it much easier to rewrite
history if you rely on this (also probably a lot more efficient within
Git).

>> Now I'm left wondering what it would mean for something like
>> public-inbox to support merging.
> I consider it a waste of effort to maintain an authoritive
> commit history when archiving mail.

Indeed, as long as we're left wondering what good it would do to be able
to merge, we're left with its downsides.


        Stefan

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: internal format
  2018-03-15 21:05           ` Stefan Monnier
@ 2018-03-15 21:21             ` Eric Wong
  0 siblings, 0 replies; 12+ messages in thread
From: Eric Wong @ 2018-03-15 21:21 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: meta

Stefan Monnier <monnier@IRO.UMontreal.CA> wrote:
> >> For timing, I'm curious why you only consider
> >> "git rev-list --objects --all".  Which operation does this corresponds
> >> to in public-inbox and is that really the only one that is
> >> performance-sensitive?
> > That traverses the object graph (same walk used for repacking
> > where bitmaps don't help).
> 
> Yes, I understand what it does in Git, but I wonder why a full traversal
> of the graph is the only/main operation you care about.
> 
> Hmm... I guess your other operations are:
> - lookup by message-id (which is made efficient because you index files
>   by the message-id).
> - everything else is done by keeping another index (from NNTP article
>   number to message-id (or to blob?)), as in the case of Xapian.
> 
> Actually, if you directly index the blobs, you don't really need to
> index your file by message-id (you could keep the index from message-id
> to blobs external, just as is done for Xapian, right?).

Right, storing blob OIDs in Xapian means tree lookups are irrelevant
to read performance.  Since we can rely on Xapian for v2, we can
fix the graph traversal problem by simplifying the trees and
speed up writes by having smaller trees.

The only remaining performance pain point is the overall size of
repos (which we work around by partitioning).

^ permalink raw reply	[flat|nested] 12+ messages in thread

* watch performance [was: Relationship between public-inbox and ssoma?]
  2018-03-05 18:06       ` Nicolás Ojeda Bär
@ 2018-03-19  7:43         ` Eric Wong
  0 siblings, 0 replies; 12+ messages in thread
From: Eric Wong @ 2018-03-19  7:43 UTC (permalink / raw)
  To: Nicolás Ojeda Bär; +Cc: meta

Nicolás Ojeda Bär <n.oje.bar@gmail.com> wrote:
> On Mon, Mar 5, 2018 at 6:50 PM, Eric Wong <e@80x24.org> wrote:
> > Nicolás Ojeda Bär <n.oje.bar@gmail.com> wrote:
> >> Hello Eric,
> >>
> >> Thanks for the prompt reply.  I am trying to migrate a long-lived
> >> mailing list (65k messages over 26 years), below are some
> >> troubles/questions I am having;
> >> any suggestions would be greatly appreciated.
> >>
> >> - public-inbox-watch seems to struggle with very big maildirs; for now
> >> I am moving the data into the maildir a little at a time and that
> >> seems to work. Is there a particular obstacle
> >>   to making the importing process more incremental?

Heh, I've been adjusting some of that code to support v2 and
-watch has actually has been incremental for a while.

It tries to balance work between inboxes fairly and might be
writing data out to disk than you want it to for initial
imports.

It was a trade-off for allowing readers to see up-to-date data
and throughput.

Also, I forget to ask, are you on Linux with Inotify support?
I haven't tried Filesys::Notify::Simple (used by -watch) without it
so maybe other OSes struggle.

> > I usually prefer one-off scripts like
> > scripts/import_vger_from_mbox for initial imports and store
> > large archives in compressed mboxes instead of Maildir.  Lack of
> > mbox support is one reason I never used notmuch despite studying
> > it.

Ah, another thing I do almost subconciously for running imports
and tests is use "eatmydata" to disable fsync:
	https://www.flamingspork.com/projects/libeatmydata/

Running -watch with eatmydata on my desktop with an SSD,
I didn't notice any problems with ~28K mail from LKML from the
past month or so.

It might be a pain to support our own knobs for disabling fsync:
There's one knob for Xapian (only 1.4.x, I think), one knob for
SQLite, and git doesn't allow disabling fsync on packs, yet,
only loose objects at the moment; so "eatmydata" is probably the
easiest.

> > And I'm thinking about favoring Received: over Date: if both
> > exist, since Date: headers are more often wrong...

Ugh, but there's patchbombs and git adjusts Date: to get sorting
right for MUAs, so using Received: makes those out-of-order :<
So overall inbox sorting might use Received:, but sorting within
individual threads will need to use the Date: header.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2018-03-19  7:43 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-05  0:54 Relationship between public-inbox and ssoma? Nicolás Ojeda Bär
2018-03-05  2:07 ` Eric Wong
2018-03-05 11:45   ` Nicolás Ojeda Bär
2018-03-05 17:50     ` Eric Wong
2018-03-05 18:06       ` Nicolás Ojeda Bär
2018-03-19  7:43         ` watch performance [was: Relationship between public-inbox and ssoma?] Eric Wong
2018-03-15 15:30   ` internal format (was: Relationship between public-inbox and ssoma?) Stefan Monnier
2018-03-15 16:40     ` Eric Wong
2018-03-15 18:49       ` internal format Stefan Monnier
2018-03-15 20:14         ` Eric Wong
2018-03-15 21:05           ` Stefan Monnier
2018-03-15 21:21             ` Eric Wong

Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).