From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS6315 166.70.0.0/16 X-Spam-Status: No, score=-3.7 required=3.0 tests=AWL,BAYES_00, RCVD_IN_DNSWL_LOW,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.1 Received: from out02.mta.xmission.com (out02.mta.xmission.com [166.70.13.232]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id AD1671F85E; Fri, 13 Jul 2018 13:39:40 +0000 (UTC) Received: from in02.mta.xmission.com ([166.70.13.52]) by out02.mta.xmission.com with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.87) (envelope-from ) id 1fdyIN-00085q-BB; Fri, 13 Jul 2018 07:39:39 -0600 Received: from [97.119.167.31] (helo=x220.xmission.com) by in02.mta.xmission.com with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.87) (envelope-from ) id 1fdyIL-0007zd-Rd; Fri, 13 Jul 2018 07:39:39 -0600 From: ebiederm@xmission.com (Eric W. Biederman) To: Eric Wong Cc: meta@public-inbox.org References: <87k1q1bky6.fsf@xmission.com> <20180712014715.dn5aouayoa3uejp4@dcvr> <87k1q07dyc.fsf@xmission.com> <20180712230946.mqv3yjw4aabf7xrf@dcvr.yhbt.net> Date: Fri, 13 Jul 2018 08:39:32 -0500 In-Reply-To: <20180712230946.mqv3yjw4aabf7xrf@dcvr.yhbt.net> (Eric Wong's message of "Thu, 12 Jul 2018 23:09:47 +0000") Message-ID: <878t6f1ch7.fsf@xmission.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.1 (gnu/linux) MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-XM-SPF: eid=1fdyIL-0007zd-Rd;;;mid=<878t6f1ch7.fsf@xmission.com>;;;hst=in02.mta.xmission.com;;;ip=97.119.167.31;;;frm=ebiederm@xmission.com;;;spf=neutral X-XM-AID: U2FsdGVkX19QcCWcM02mHoJ3LlymXbiEwiFBtglf//A= X-SA-Exim-Connect-IP: 97.119.167.31 X-SA-Exim-Mail-From: ebiederm@xmission.com Subject: Re: Q: V2 format X-SA-Exim-Version: 4.2.1 (built Thu, 05 May 2016 13:38:54 -0600) X-SA-Exim-Scanned: Yes (on in02.mta.xmission.com) List-Id: --=-=-= Content-Type: text/plain Eric Wong writes: > "Eric W. Biederman" wrote: >> Eric Wong writes: >> > "Eric W. Biederman" wrote: >> >> I have been digging through the code looking so I can understand the v2 >> >> format and I have some ideas on how things might be improved, and some >> >> questions so that I understand. >> > >> > Great to know you're interested! Fwiw, I've still been meaning >> > to turn my v2 docs into a POD manpage: >> > >> > https://public-inbox.org/meta/20180419015813.GA20051@dcvr/ >> >> I have some personal mail archives that I need to do something better >> with. My goal is for day-to-day operations (aka mail delivery and >> archiving) to be able to run on a smallish 32bit machine. > > Great to hear your interest in that! public-inbox.org is still > 32-bit on a $20/month VPS. Xapian really does better with an > SSD (freshly TRIM-ed), though; so my low-end netbook with HDD > struggles on big inboxes at the moment. I am leery of SSDs at the moment. It was probably bad luck but my last mail setup using cyrus (1 message per file) managed to kill an SSD in under a year. >> But archives are not valuable unless you have a fast search capability >> which makes all of the features of xapian very interesting. > > Agreed. > >> I need to compare message id's to see if I have content missing from the >> public linux-kernel archive. It is probably Konrad's cleanup of the >> headers but my linux-kernel archive when imported into public-inbox is >> slightly larger than Konrads. > > Konrad == Konstantin? Yes. Konstantin Ryabitsev. Konstantin, my apologies I did not mean to scramble your name. > I haven't looked at what's in lore, yet, > but there were numerous header differences from the archives he > gave me for v2 development vs what I got from my own archives. > > Off the top of my head: > > * addresses in To:/Cc: lists rewritten for some old list addresses > > * some addressee formatting/quoting changes as a result > > * last (most recent) Received: header removed (but not actually > enough to anonymize the original recipient in most cases). > This affects sorting comparisons in search results > > * reencoded some MIME parts to different encodings (to 8bit, I think) > > Maybe some others. > >> I also like the idea of being able to read and archive public lists that >> I care about with just a git fetch and local tools. > > Yes. I still use "git log -p -B" etc. That said; I don't want > to give up too much to support that (the SQLite dependency doesn't > seem too expensive); and try to keep public-inbox easy-to-install. > Making Xapian optional will be a huge part of that. What I meant is that it is very useful not to have to not need to sync anything other than the git repository between machines. >> Public mailing lists and their archives are more important, but on my >> radar is also IMAP/regular email support. With it's little bit of extra >> state. > > Cool. I've been thinking about something for personal mail, > too. mairix is killing my beefier personal machine (because it > needs to rewrite the entire index every time) and > Maildirs+notmuch is a non-starter due to dentry cache overheads > and inode consumption. > >> >> What is the thinking about deleted entries, and for v2 what is the >> >> preferred way to delete mail from a public inbox git repository and why? >> > >> > Definitely prefer the normal way with 'd' files to not break >> > people using non-force fetches. "Purge" is too disruptive >> > and reserved for extraordinary cases (e.g. legal reasons). >> >> Then I am going to report a probable bug. In V2 in public-inbox-index >> I can not find a path from finding a 'd' file and a call to unindex. V1 >> unindexes deleted files. Rebased heads for purges call unindex. I >> don't see that for ordinary d files though. > > It shouldn't need to call unindex because they never get indexed > on rebuilds. V2 indexing walks history backwards (normal "git log" > behavior) so it remembers 'd' paths in the "$D" hash; and skips blobs > as it encounters them. > > v1 needed to unindex because it used "git log --reverse" to walk > forward in history. This assumes that you see them in the same git pull. I would think ideally anything that is going to be deleted that quickly you can just skip archiving. What is the time window of you expecting 'd' messages to appear? >> >> Size. Reading the history of the public inbox meta mailling list and >> >> playing around I discovered that I can shave off about 100M of the V2 >> >> size of the git public inbox git repository but pushing all of the >> >> messages into a single commit. Not great for day to day operation, >> >> but if rebasses are part of the plan, and old archives part of the >> >> challenge I see quite a lot of potential for old archives to be reduced >> >> to a git repository with a single commit. >> > >> > Rebases/rewriting history is definitely not part of the plan and >> > a last resort. >> > >> >> Names. Is there a good reason not to use message numbers as the names >> >> in the git repositories? (Other than the cost to change the code?) That >> >> would remove the need for treat the sqlite msgmap database as precious, >> >> and it would make it easier to recover if an nntp server goes away. In >> >> V2 format the git mailing list git repository is only about 2M larger if >> >> each message has it's msg number as it's name. Plus the git log >> >> is easier to read as messages are all + or -. >> > >> > Big trees in git were a scalability problem in v1 because of the >> > long 2/38 names. With shorter names you propose (base-10 serial >> > number?, the scalability problem gets pushed off a bit, I suppose. >> > But not indefinitely; and later v2 partitions will suffer more >> > from longer names. >> >> Bit trees were a scalability problem in git becuase they are quadratic. >> Every commit mentioned every email. So a walk of the history would >> have to visit every file on every commit. I expect those tree objects >> in the history compress well with their parents but it doesn't simplify >> the tree walker. >> >> Would you like my test conversion script from V1 so you can take a look? > > Sure, but I can't guarantee I can find the time to spend on it; > but others might be interested. > >> > The current v2 is also better for inode-starved users in case >> > somebody forgets to type "--mirror" or "--bare" with clone. For >> > the most part (unless purge is used), the SQLite database is >> > actually recoverable. >> >> Because of the parallelism in V2 I have noticed messages in numbered >> in an order that does not correspond to their commit order. So the >> SQLite database isn't as recoverable as it might be. Especially as the >> parallelism introduces an element of non-determinancy. > > *puzzled* were you able to reproduce that? The serial number > generation + threading happens in the main process and the > parallelism is limited to Xapian text indexing. -index > generates serial numbers by walking backwards with v2, and > complains on unexpected results. I will have to look a bit deeper. It was just something I noticed in passing as I was rewriting mail boxes with msgnum extracted from sqllite. I will see if I can track that one done. I very much value retaining enough information in the git archive to reconstruct the serial numbers. So that all that is needs to be backed up is the git archive. Even purge can insert a dummy entry so I don't think there is any time when we would not be able to preserve them with the current setup. > As far as personal mail goes, I wouldn't want serial numbers at all > (more unnecessary state to keep track of). At least imap requires serial numbers, and I imagine the easy transition for mail clients is to have an imap server. As you have mentioned an ordered list of commits is good enough to reconstruct the msgnum reliably so it is unlikely we would need to do anything special there. >> > So no, I don't think having serial numbers stored in filenames >> > is the right thing. >> >> I won't push it but I at the present time I respectfully disagree. >> >> The big advantage I see with serial numbers (other than msgmap) is that >> you can include multiple emails per commit (without going quadratic). I >> am also looking at potentially storing the other email states that IMAP >> and maildir mailboxes track. I can imagine that much more easily with >> message numbers. Still I want to avoid something that makes git go >> quadratic again. > > You'd want deeper trees; still. I'd still use hex, and maybe > truncate the blob hash to avoid having to keep track of any > serial number state. Maybe 2/2/4 naming is enough while using > git history to resolve collisions. The key fundamental difference is if you keep the same files from one commit to another. To demonstrate this I have attached a quick conversion script I used to test this. It uses h{40} names. Totally flat. "time git rev-list --objects --all | wc -l" on the git mailling list archive takes just over 5 seconds. Compared to your one file name case: $ du -hs git/git/0.git/ git-long-names/git/0.git/ 759M git/git/0.git/ 772M git-long-names/git/0.git/ So the only difference is using shorter filenames you save 13M. The original git tree in V1 format is 1001M so still 30M larger. And "time git rev-list --objects --all | wc -l" takes 1m14s. Making it definitely slower. > Multiple emails per-commit doesn't make sense for public > archives. I am not certain. For a maillist like linux kernel especially when someone sends a patch series to the list and it arrives all at once I imagine there is potential there. I believe this is visible in the mail delivery pipeline if you implement LMTP. > For personal archives, you could probably snap off > 1-file-per-commit history periodically to make make a big tree > to reduce commit objects. The cost of losing compatibility, > rewriting history + repacking, to save 100M there out of 1G(?) > or so doesn't seem like a great trade-off, though. It is significant. Mostly it seems to make sense for importing archives or really compacting archives for storage. > I wonder how much can be saved with short author/committer info > and empty commit messages, even. I'd rather do that than break > history and require repacking. You seem to have saved 13M with one character file names. > If I wanted to track replied/seen/etc... state in git for > personal mail, I'd probably use 'r', 's', etc filenames; but I'm > not sure it'd be in the same or different git repo from the > public one. > > That said; I don't know if I want to store state in git or > SQLite or something else... Agreed. That all bears some careful looking into. > Looking forward to making Xapian and position data optional :> --=-=-= Content-Type: text/plain Content-Disposition: inline; filename=public-inbox-convert-long-names #!/usr/bin/perl -w # Copyright (C) 2018 all contributors # License: AGPL-3.0+ use strict; use warnings; use Getopt::Long qw(:config gnu_getopt no_ignore_case auto_abbrev); use PublicInbox::MIME; use PublicInbox::InboxWritable; use PublicInbox::Config; use PublicInbox::V2Writable; use PublicInbox::Import; use PublicInbox::Spawn qw(spawn); use Cwd 'abs_path'; use File::Copy 'cp'; # preserves permissions: my $usage = "Usage: public-inbox-convert OLD NEW\n"; my $jobs; my $index = 1; my %opts = ( '--jobs|j=i' => \$jobs, '--index!' => \$index, ); GetOptions(%opts) or die "bad command-line args\n$usage"; GetOptions(%opts) or die "bad command-line args\n$usage"; my $old_dir = shift or die $usage; my $new_dir = shift or die $usage; die "$new_dir exists\n" if -d $new_dir; die "$old_dir not a directory\n" unless -d $old_dir; my $config = eval { PublicInbox::Config->new }; $old_dir = abs_path($old_dir); my $old; if ($config) { $config->each_inbox(sub { $old = $_[0] if abs_path($_[0]->{mainrepo}) eq $old_dir; }); } unless ($old) { warn "W: $old_dir not configured in " . PublicInbox::Config::default_file() . "\n"; $old = { mainrepo => $old_dir, name => 'ignored', address => [ 'old@example.com' ], }; $old = PublicInbox::Inbox->new($old); } $old = PublicInbox::InboxWritable->new($old); if (($old->{version} || 1) >= 2) { die "Only conversion from v1 inboxes is supported\n"; } my $new = { %$old }; $new->{mainrepo} = abs_path($new_dir); $new->{version} = 2; $new = PublicInbox::InboxWritable->new($new); my $v2w; $old->umask_prepare; sub link_or_copy ($$) { my ($src, $dst) = @_; link($src, $dst) and return; $!{EXDEV} or warn "link $src, $dst failed: $!, trying cp\n"; cp($src, $dst) or die "cp $src, $dst failed: $!\n"; } $old->with_umask(sub { my $old_cfg = "$old->{mainrepo}/config"; local $ENV{GIT_CONFIG} = $old_cfg; my $new_cfg = "$new->{mainrepo}/all.git/config"; $v2w = PublicInbox::V2Writable->new($new, 1); $v2w->init_inbox($jobs); unlink $new_cfg; link_or_copy($old_cfg, $new_cfg); if (my $alt = $new->{altid}) { require PublicInbox::AltId; foreach my $i (0..$#$alt) { my $src = PublicInbox::AltId->new($old, $alt->[$i], 0); $src->mm_alt or next; my $dst = PublicInbox::AltId->new($new, $alt->[$i], 1); $dst = $dst->{filename}; $src->mm_alt->{dbh}->sqlite_backup_to_file($dst); } } my $desc = "$old->{mainrepo}/description"; link_or_copy($desc, "$new->{mainrepo}/description") if -e $desc; my $clone = "$old->{mainrepo}/cloneurl"; if (-e $clone) { warn <<""; $clone may not be valid after migrating to v2, not copying } }); my $state = ''; my ($prev, $from); my $head = $old->{ref_head} || 'HEAD'; my ($rd, $pid) = $old->git->popen(qw(fast-export --use-done-feature), $head); $v2w->idx_init; my $im = $v2w->importer; my ($r, $w) = $im->gfi_start; my $h = '[0-9a-f]'; my %D; my $purged = 0; while (<$rd>) { if ($_ eq "blob\n") { $state = 'blob'; } elsif (/^commit /) { $state = 'commit'; $purged = 0; } elsif (/^data (\d+)/) { my $len = $1; $w->print($_) or $im->wfail; while ($len) { my $n = read($rd, my $tmp, $len) or die "read: $!"; warn "$n != $len\n" if $n != $len; $len -= $n; $w->print($tmp) or $im->wfail; } next; } elsif ($state eq 'commit') { if (m/^([MDcRN] | deleteall)/) { if (!$purged) { $purged = 1; $w->print("deleteall\n") or $im->wfail; } } if (m{^M 100644 :(\d+) (${h}{2})/(${h}{38})}o) { my ($mark, $path) = ($1, $2 . $3); ${D}{$path} = $mark; $w->print("M 100644 :$mark $path\n") or $im->wfail; next; } if (m{^D (${h}{2})/(${h}{38})}o) { my $path = $1 . $2; my $mark = delete $D{$path}; defined $mark or die "undeleted path: $1\n"; $w->print("M 100644 :$mark d\n") or $im->wfail; next; } if (m{^from (:\d+)}) { $prev = $from; $from = $1; # no next } } last if $_ eq "done\n"; $w->print($_) or $im->wfail; } $w = $r = undef; close $rd or die "close fast-export: $!\n"; waitpid($pid, 0) or die "waitpid failed: $!\n"; $? == 0 or die "fast-export failed: $?\n"; my $mm = $old->mm; $mm->{dbh}->sqlite_backup_to_file("$new_dir/msgmap.sqlite3") if $mm; $v2w->done; if ($index) { $v2w->index_sync; $v2w->done; } --=-=-=--