From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <e@80x24.org>
X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on dcvr.yhbt.net
X-Spam-Level: 
X-Spam-ASN:  
X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,BAYES_00
	shortcircuit=no autolearn=ham autolearn_force=no version=3.4.1
Received: from localhost (dcvr.yhbt.net [127.0.0.1])
	by dcvr.yhbt.net (Postfix) with ESMTP id 2FAE51F85E;
	Thu, 12 Jul 2018 23:09:47 +0000 (UTC)
Date: Thu, 12 Jul 2018 23:09:47 +0000
From: Eric Wong <e@80x24.org>
To: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: meta@public-inbox.org
Subject: Re: Q: V2 format
Message-ID: <20180712230946.mqv3yjw4aabf7xrf@dcvr.yhbt.net>
References: <87k1q1bky6.fsf@xmission.com>
 <20180712014715.dn5aouayoa3uejp4@dcvr>
 <87k1q07dyc.fsf@xmission.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <87k1q07dyc.fsf@xmission.com>
List-Id: <meta.public-inbox.org>

"Eric W. Biederman" <ebiederm@xmission.com> wrote:
> Eric Wong <e@80x24.org> writes:
> > "Eric W. Biederman" <ebiederm@xmission.com> wrote:
> >> I have been digging through the code looking so I can understand the v2
> >> format and I have some ideas on how things might be improved, and some
> >> questions so that I understand.
> >
> > Great to know you're interested!  Fwiw, I've still been meaning
> > to turn my v2 docs into a POD manpage:
> >
> >   https://public-inbox.org/meta/20180419015813.GA20051@dcvr/
> 
> I have some personal mail archives that I need to do something better
> with.  My goal is for day-to-day operations (aka mail delivery and
> archiving) to be able to run on a smallish 32bit machine.

Great to hear your interest in that!  public-inbox.org is still
32-bit on a $20/month VPS.  Xapian really does better with an
SSD (freshly TRIM-ed), though; so my low-end netbook with HDD
struggles on big inboxes at the moment.

> But archives are not valuable unless you have a fast search capability
> which makes all of the features of xapian very interesting.

Agreed.

> I need to compare message id's to see if I have content missing from the
> public linux-kernel archive.   It is probably Konrad's cleanup of the
> headers but my linux-kernel archive when imported into public-inbox is
> slightly larger than Konrads.

Konrad == Konstantin?  I haven't looked at what's in lore, yet,
but there were numerous header differences from the archives he
gave me for v2 development vs what I got from my own archives.

Off the top of my head:

* addresses in To:/Cc: lists rewritten for some old list addresses

* some addressee formatting/quoting changes as a result

* last (most recent) Received: header removed (but not actually
  enough to anonymize the original recipient in most cases).
  This affects sorting comparisons in search results

* reencoded some MIME parts to different encodings (to 8bit, I think)

Maybe some others.

> I also like the idea of being able to read and archive public lists that
> I care about with just a git fetch and local tools.

Yes.  I still use "git log -p -B" etc.  That said; I don't want
to give up too much to support that (the SQLite dependency doesn't
seem too expensive); and try to keep public-inbox easy-to-install.
Making Xapian optional will be a huge part of that.

> Public mailing lists and their archives are more important, but on my
> radar is also IMAP/regular email support.  With it's little bit of extra
> state.

Cool.  I've been thinking about something for personal mail,
too.  mairix is killing my beefier personal machine (because it
needs to rewrite the entire index every time) and
Maildirs+notmuch is a non-starter due to dentry cache overheads
and inode consumption.

> >> What is the thinking about deleted entries, and for v2 what is the
> >> preferred way to delete mail from a public inbox git repository and why?
> >
> > Definitely prefer the normal way with 'd' files to not break
> > people using non-force fetches.  "Purge" is too disruptive
> > and reserved for extraordinary cases (e.g. legal reasons).
> 
> Then I am going to report a probable bug.  In V2 in public-inbox-index
> I can not find a path from finding a 'd' file and a call to unindex.  V1
> unindexes deleted files.  Rebased heads for purges call unindex.  I
> don't see that for ordinary d files though.

It shouldn't need to call unindex because they never get indexed
on rebuilds.  V2 indexing walks history backwards (normal "git log"
behavior) so it remembers 'd' paths in the "$D" hash; and skips blobs
as it encounters them.

v1 needed to unindex because it used "git log --reverse" to walk
forward in history.

> >> Size.  Reading the history of the public inbox meta mailling list and
> >> playing around I discovered that I can shave off about 100M of the V2
> >> size of the git public inbox git repository but pushing all of the
> >> messages into a single commit.  Not great for day to day operation,
> >> but if rebasses are part of the plan, and old archives part of the
> >> challenge I see quite a lot of potential for old archives to be reduced
> >> to a git repository with a single commit.
> >
> > Rebases/rewriting history is definitely not part of the plan and
> > a last resort.
> >
> >> Names.  Is there a good reason not to use message numbers as the names
> >> in the git repositories?  (Other than the cost to change the code?) That
> >> would remove the need for treat the sqlite msgmap database as precious,
> >> and it would make it easier to recover if an nntp server goes away.  In
> >> V2 format the git mailing list git repository is only about 2M larger if
> >> each message has it's msg number as it's name.  Plus the git log
> >> is easier to read as messages are all + or -.
> >
> > Big trees in git were a scalability problem in v1 because of the
> > long 2/38 names.  With shorter names you propose (base-10 serial
> > number?, the scalability problem gets pushed off a bit, I suppose.
> > But not indefinitely; and later v2 partitions will suffer more
> > from longer names.
> 
> Bit trees were a scalability problem in git becuase they are quadratic.
> Every commit mentioned every email.  So a walk of the history would
> have to visit every file on every commit.  I expect those tree objects
> in the history compress well with their parents but it doesn't simplify
> the tree walker.
> 
> Would you like my test conversion script from V1 so you can take a look?

Sure, but I can't guarantee I can find the time to spend on it;
but others might be interested.

> > The current v2 is also better for inode-starved users in case
> > somebody forgets to type "--mirror" or "--bare" with clone.  For
> > the most part (unless purge is used), the SQLite database is
> > actually recoverable.
> 
> Because of the parallelism in V2 I have noticed messages in numbered
> in an order that does not correspond to their commit order.  So the
> SQLite database isn't as recoverable as it might be.  Especially as the
> parallelism introduces an element of non-determinancy.

*puzzled* were you able to reproduce that?  The serial number
generation + threading happens in the main process and the
parallelism is limited to Xapian text indexing.  -index
generates serial numbers by walking backwards with v2, and
complains on unexpected results.

As far as personal mail goes, I wouldn't want serial numbers at all
(more unnecessary state to keep track of).

> > So no, I don't think having serial numbers stored in filenames
> > is the right thing.
> 
> I won't push it but I at the present time I respectfully disagree.
> 
> The big advantage I see with serial numbers (other than msgmap) is that
> you can include multiple emails per commit (without going quadratic).  I
> am also looking at potentially storing the other email states that IMAP
> and maildir mailboxes track.  I can imagine that much more easily with
> message numbers.  Still I want to avoid something that makes git go
> quadratic again.

You'd want deeper trees; still.  I'd still use hex, and maybe
truncate the blob hash to avoid having to keep track of any
serial number state.  Maybe 2/2/4 naming is enough while using
git history to resolve collisions.

Multiple emails per-commit doesn't make sense for public
archives.  For personal archives, you could probably snap off
1-file-per-commit history periodically to make make a big tree
to reduce commit objects.  The cost of losing compatibility,
rewriting history + repacking, to save 100M there out of 1G(?)
or so doesn't seem like a great trade-off, though.

I wonder how much can be saved with short author/committer info
and empty commit messages, even.  I'd rather do that than break
history and require repacking.

If I wanted to track replied/seen/etc... state in git for
personal mail, I'd probably use 'r', 's', etc filenames; but I'm
not sure it'd be in the same or different git repo from the
public one.

That said; I don't know if I want to store state in git or
SQLite or something else...

Looking forward to making Xapian and position data optional :>