From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.0 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id 644701F6C1; Sun, 21 Aug 2016 21:14:55 +0000 (UTC) Date: Sun, 21 Aug 2016 21:14:55 +0000 From: Eric Wong To: "W. Trevor King" Cc: notmuch@notmuchmail.org, meta@public-inbox.org Subject: Re: Mail archives in Git using ssoma Message-ID: <20160821211455.GA11841@starla> References: <20141107190321.GL23609@odin.tremily.us> <20160821043631.GA2338@odin.tremily.us> <20160821183704.GB11495@dcvr> <20160821202820.GC30347@odin.tremily.us> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20160821202820.GC30347@odin.tremily.us> List-Id: "W. Trevor King" wrote: > On Sun, Aug 21, 2016 at 06:37:04PM +0000, Eric Wong wrote: > > Btw, for public-inbox, I'm using git-fast-import now, so imports are > > a bit faster and $GIT_DIR/ssoma.index is no longer used. This was > > crucial for getting git@vger archives imported in a reasonable time. > > ssoma-mda imports 22k notmuch messages in around 15 minutes (with > profiling enabled), and: In contrast, git@vger is around 300K messages. LKML is well into the millions, and I hope public-inbox (and git!) can handle that one day, even on cheap hardware (haven't tried). One problem I noticed with ssoma-mda is that it gets slower as more messages get imported, since all those files sit in the index, and the git index format is bad for incremental updates with big, flat trees. Big trees are a general problem with git: I'm now storing blob IDs directly in Xapian and will be using them more to avoid tree lookups. tree creation lookups degrade the same way the index does as they get bigger. Currently it's using 2/38 of the SHA-1 like git loose objects; a goal might be to move towards supporting 2/2/36 (or deeper) as Jeff noted substantial object traversal improvements: https://public-inbox.org/git/20160805092805.w3nwv2l6jkbuwlzf@sigill.intra.peff.net/ Of course, support for 2/38 will be retained for old archives/messages. > $ python -m cProfile -o profile import.py notmuch.mbox > $ python -c "import pstats; p=pstats.Stats('profile'); p.sort_stats('cumulative').print_stats(10)" > Sun Aug 21 12:56:49 2016 profile > > 101823722 function calls (99078415 primitive calls) in 885.069 seconds > > Ordered by: cumulative time > List reduced from 1145 to 10 due to restriction <10> > > ncalls tottime percall cumtime percall filename:lineno(function) > 70/1 0.002 0.000 885.069 885.069 {built-in method exec} > 1 0.111 0.111 885.069 885.069 /home/wking/src/notmuch/notmuch-archives.git/import.py:9() > 1 0.400 0.400 884.915 884.915 /home/wking/src/notmuch/notmuch-archives.git/import.py:17(import_mbox) > 22875 0.601 0.000 863.371 0.038 /home/wking/src/notmuch/notmuch-archives.git/ssoma_mda.py:362(deliver) > 22875 8.943 0.000 810.459 0.035 /home/wking/src/notmuch/notmuch-archives.git/ssoma_mda.py:207(append) > 22875 0.418 0.000 308.353 0.013 /home/wking/.local/lib64/python3.4/site-packages/pygit2/index.py:146(write_tree) > 22875 307.855 0.013 307.855 0.013 {built-in method git_index_write_tree} > 22874 0.575 0.000 279.293 0.012 /home/wking/.local/lib64/python3.4/site-packages/pygit2/index.py:238(diff_to_tree) > 22874 278.501 0.012 278.501 0.012 {built-in method git_diff_tree_to_index} It looks like writing the index is already the slowest, here, in terms of total time, too. It might be interesting if you profiled each *-mda invocation to see the degradation from the first to last message. > 22875 0.088 0.000 80.413 0.004 /home/wking/.local/lib64/python3.4/site-packages/pygit2/index.py:99(read) > > 38 ms per ssoma delivery is probably fast enough, especially if you Not even close for me :) > are invoking ssoma-mda once per message, since process setup will take a similar amount of time: > > $ time python -c 'print("hello")' > hello > > real 0m0.016s > user 0m0.013s > sys 0m0.003s > > It's possible that fast-import would shave a few ms off the pygit2 > addition (I'm not sure, and maybe pygit2 is faster than fast-import). > But I doubt it matters enough either way to be worth changing unless > you are dealing with a really large corpus. One key feature is fast-import avoids writing an index entirely. I think pygit2 would have to learn that, too.