From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <ebiederm@xmission.com>
X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on dcvr.yhbt.net
X-Spam-Level: 
X-Spam-ASN: AS6315 166.70.0.0/16
X-Spam-Status: No, score=-3.7 required=3.0 tests=AWL,BAYES_00,
	RCVD_IN_DNSWL_LOW,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no
	version=3.4.1
Received: from out03.mta.xmission.com (out03.mta.xmission.com [166.70.13.233])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by dcvr.yhbt.net (Postfix) with ESMTPS id BF0121F85E;
	Thu, 12 Jul 2018 13:59:12 +0000 (UTC)
Received: from in02.mta.xmission.com ([166.70.13.52])
	by out03.mta.xmission.com with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128)
	(Exim 4.87)
	(envelope-from <ebiederm@xmission.com>)
	id 1fdc7j-00042t-FH; Thu, 12 Jul 2018 07:59:11 -0600
Received: from [97.119.167.31] (helo=x220.xmission.com)
	by in02.mta.xmission.com with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128)
	(Exim 4.87)
	(envelope-from <ebiederm@xmission.com>)
	id 1fdc7T-0005dR-VY; Thu, 12 Jul 2018 07:59:11 -0600
From: ebiederm@xmission.com (Eric W. Biederman)
To: Eric Wong <e@80x24.org>
Cc: meta@public-inbox.org
References: <87k1q1bky6.fsf@xmission.com>
	<20180712014715.dn5aouayoa3uejp4@dcvr>
Date: Thu, 12 Jul 2018 08:58:51 -0500
In-Reply-To: <20180712014715.dn5aouayoa3uejp4@dcvr> (Eric Wong's message of
	"Thu, 12 Jul 2018 01:47:15 +0000")
Message-ID: <87k1q07dyc.fsf@xmission.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.1 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
X-XM-SPF: eid=1fdc7T-0005dR-VY;;;mid=<87k1q07dyc.fsf@xmission.com>;;;hst=in02.mta.xmission.com;;;ip=97.119.167.31;;;frm=ebiederm@xmission.com;;;spf=neutral
X-XM-AID: U2FsdGVkX1+CpYqAFQ9iznk/q6QclC5FDiSpUFUVkfk=
X-SA-Exim-Connect-IP: 97.119.167.31
X-SA-Exim-Mail-From: ebiederm@xmission.com
Subject: Re: Q: V2 format
X-SA-Exim-Version: 4.2.1 (built Thu, 05 May 2016 13:38:54 -0600)
X-SA-Exim-Scanned: Yes (on in02.mta.xmission.com)
List-Id: <meta.public-inbox.org>

Eric Wong <e@80x24.org> writes:

> "Eric W. Biederman" <ebiederm@xmission.com> wrote:
>> I have been digging through the code looking so I can understand the v2
>> format and I have some ideas on how things might be improved, and some
>> questions so that I understand.
>
> Great to know you're interested!  Fwiw, I've still been meaning
> to turn my v2 docs into a POD manpage:
>
>   https://public-inbox.org/meta/20180419015813.GA20051@dcvr/

I have some personal mail archives that I need to do something better
with.  My goal is for day-to-day operations (aka mail delivery and
archiving) to be able to run on a smallish 32bit machine.

But archives are not valuable unless you have a fast search capability
which makes all of the features of xapian very interesting.

I need to compare message id's to see if I have content missing from the
public linux-kernel archive.   It is probably Konrad's cleanup of the
headers but my linux-kernel archive when imported into public-inbox is
slightly larger than Konrads.

I also like the idea of being able to read and archive public lists that
I care about with just a git fetch and local tools.

Public mailing lists and their archives are more important, but on my
radar is also IMAP/regular email support.  With it's little bit of extra
state.

>> V1 supported the concept of messages being added and deleted from
>> the git repository all while keeping a full history of everything that
>> went on.  The V2 code appears to have the name 'm' for added and 'd' for
>> deleted, but the public-inbox-index code appears to expect deletes to
>> happen by way of an altered history that totally purge the commits,
>> and does not process the 'd' entries.
>
> "Purge" is a new concept for v2 and not even exposed (yet) in
> via tools.  Normal operations to remove files using 'd' (via
> -watch or -rm) don't rewrite old history so it won't disrupt
> non-force fetches.

This helps a lot in understanding the intent of the code.  Konrad had
mentioned something about being able to rebase when I pointed out
the buggy git commits in linux-kernel.

>> What is the thinking about deleted entries, and for v2 what is the
>> preferred way to delete mail from a public inbox git repository and why?
>
> Definitely prefer the normal way with 'd' files to not break
> people using non-force fetches.  "Purge" is too disruptive
> and reserved for extraordinary cases (e.g. legal reasons).

Then I am going to report a probable bug.  In V2 in public-inbox-index
I can not find a path from finding a 'd' file and a call to unindex.  V1
unindexes deleted files.  Rebased heads for purges call unindex.  I
don't see that for ordinary d files though.

>> Size.  Reading the history of the public inbox meta mailling list and
>> playing around I discovered that I can shave off about 100M of the V2
>> size of the git public inbox git repository but pushing all of the
>> messages into a single commit.  Not great for day to day operation,
>> but if rebasses are part of the plan, and old archives part of the
>> challenge I see quite a lot of potential for old archives to be reduced
>> to a git repository with a single commit.
>
> Rebases/rewriting history is definitely not part of the plan and
> a last resort.
>
>> Names.  Is there a good reason not to use message numbers as the names
>> in the git repositories?  (Other than the cost to change the code?) That
>> would remove the need for treat the sqlite msgmap database as precious,
>> and it would make it easier to recover if an nntp server goes away.  In
>> V2 format the git mailing list git repository is only about 2M larger if
>> each message has it's msg number as it's name.  Plus the git log
>> is easier to read as messages are all + or -.
>
> Big trees in git were a scalability problem in v1 because of the
> long 2/38 names.  With shorter names you propose (base-10 serial
> number?, the scalability problem gets pushed off a bit, I suppose.
> But not indefinitely; and later v2 partitions will suffer more
> from longer names.

Bit trees were a scalability problem in git becuase they are quadratic.
Every commit mentioned every email.  So a walk of the history would
have to visit every file on every commit.  I expect those tree objects
in the history compress well with their parents but it doesn't simplify
the tree walker.

Would you like my test conversion script from V1 so you can take a look?

> I also want to limit the use and exposure of serial numbers as
> much as possible.  It's unavoidable with the NNTP interface;
> but reliance on serial numbers in public interfaces leads to
> centralization.

I completely agree about public web interfaces.  Message-ID is a much
better key to messages as it was generated by the message sender.

> The current v2 is also better for inode-starved users in case
> somebody forgets to type "--mirror" or "--bare" with clone.  For
> the most part (unless purge is used), the SQLite database is
> actually recoverable.

Because of the parallelism in V2 I have noticed messages in numbered
in an order that does not correspond to their commit order.  So the
SQLite database isn't as recoverable as it might be.  Especially as the
parallelism introduces an element of non-determinancy.

> So no, I don't think having serial numbers stored in filenames
> is the right thing.

I won't push it but I at the present time I respectfully disagree.

The big advantage I see with serial numbers (other than msgmap) is that
you can include multiple emails per commit (without going quadratic).  I
am also looking at potentially storing the other email states that IMAP
and maildir mailboxes track.  I can imagine that much more easily with
message numbers.  Still I want to avoid something that makes git go
quadratic again.

>> xapian.  Can the Xapian database be made optional in V2?  
>
> Definitely in the TODO :)
>
>> I absolutely
>> think a quick search for terms and other things very valuable, so I
>> would never suggest giving up Xapian.  On the other hand on my personal
>> laptop the xapian database for lkml takes ages and ages to build, and it
>> pushes the system into swap.  Which is all around unpleasant.  That
>> seems to eat into the distributed nature of the goal of public inbox.
>> I have tried to see what could be done that might shrink the size of
>> the xapian database.  The only think I could think of is perhaps
>> sharding the xapian database by time/msgnum ranges.   That would allow
>> the old xapians databases to be compacted and forgotten about, and I
>> think it would allow less wastage in the current xapian database as it
>> would be smaller, so wasting 50% space (or whatever the btrees waste)
>> would be less of an issue.  And as smaller databases are faster I think
>> that would in general be a help.
>
> One big killer for Xapian is position information required for
> "quoted phrase searches".  I seem to remember deleting the position.*
> files was safe as it would only break phrase searches (but I
> haven't tried it).

I have a very ugly patch that removed all of Xapian.  So for day to day
nntp use.  It is certainly safe.

> So there should be an option to toggle between the "index_text"
> and routines in Xapian "index_text_without_positions".

I might take a look at that.  I just looked and the position database is
huge.

> Given the way the indexing only works on the most recent data;
> I think one could also write a script to delete old data/results
> from Xapian without affecting current/future indexing.
> That would pop back up if/when there's schema upgrades requiring
> a rebuild, though...

Good for testing.  Not for long term as it is the actual indexing that
is painful.

> I believe there should be 3 levels of v2 operation:
>
> 1) SQLite-only (NNTP and all the threading stuff works)
> 2) SQLite + Xapian w/o positions (good enough for most things)
> 3) SQLite + Xapian w/ positions (current, default)
>
> 2) seems like a reasonable trade-off for most sites; I'm not
> sure how often phrase searching gets used.

I will take a look at that.  That seems a straight forward place to
start that we can easily agree upon.

>> Time permitting I am willing to do some of this work so that
>> public-inbox works well for me.  I want to see what your vision is for
>> the code before I start anything.
>
> Thanks for running this by, first.  I'm not convinced git layout
> changes are warranted at this point for v2.
>
> Making Xapian optional and configurable to use
> index_text_without_positions is something I definitely want to
> see happen, though.

I will clean up my patches for that then.

Eric