user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
From: Eric Wong <e@80x24.org>
To: "W. Trevor King" <wking@tremily.us>
Cc: notmuch@notmuchmail.org, meta@public-inbox.org
Subject: Re: Mail archives in Git using ssoma
Date: Sun, 21 Aug 2016 21:14:55 +0000	[thread overview]
Message-ID: <20160821211455.GA11841@starla> (raw)
In-Reply-To: <20160821202820.GC30347@odin.tremily.us>

"W. Trevor King" <wking@tremily.us> wrote:
> On Sun, Aug 21, 2016 at 06:37:04PM +0000, Eric Wong wrote:
> > Btw, for public-inbox, I'm using git-fast-import now, so imports are
> > a bit faster and $GIT_DIR/ssoma.index is no longer used.  This was
> > crucial for getting git@vger archives imported in a reasonable time.
> 
> ssoma-mda imports 22k notmuch messages in around 15 minutes (with
> profiling enabled), and:

In contrast, git@vger is around 300K messages.  LKML is well
into the millions, and I hope public-inbox (and git!) can handle
that one day, even on cheap hardware (haven't tried).

One problem I noticed with ssoma-mda is that it gets slower as
more messages get imported, since all those files sit in the
index, and the git index format is bad for incremental updates
with big, flat trees.  Big trees are a general problem with git:

    I'm now storing blob IDs directly in Xapian and will be
    using them more to avoid tree lookups.  tree creation
    lookups degrade the same way the index does as they
    get bigger.

    Currently it's using 2/38 of the SHA-1 like git loose
    objects; a goal might be to move towards supporting 2/2/36
    (or deeper) as Jeff noted substantial object traversal
    improvements:

https://public-inbox.org/git/20160805092805.w3nwv2l6jkbuwlzf@sigill.intra.peff.net/

    Of course, support for 2/38 will be retained for old
    archives/messages.

>   $ python -m cProfile -o profile import.py notmuch.mbox
>   $ python -c "import pstats; p=pstats.Stats('profile'); p.sort_stats('cumulative').print_stats(10)"
>   Sun Aug 21 12:56:49 2016    profile
> 
>            101823722 function calls (99078415 primitive calls) in 885.069 seconds
> 
>      Ordered by: cumulative time
>      List reduced from 1145 to 10 due to restriction <10>
> 
>      ncalls  tottime  percall  cumtime  percall filename:lineno(function)
>        70/1    0.002    0.000  885.069  885.069 {built-in method exec}
>           1    0.111    0.111  885.069  885.069 /home/wking/src/notmuch/notmuch-archives.git/import.py:9(<module>)
>           1    0.400    0.400  884.915  884.915 /home/wking/src/notmuch/notmuch-archives.git/import.py:17(import_mbox)
>       22875    0.601    0.000  863.371    0.038 /home/wking/src/notmuch/notmuch-archives.git/ssoma_mda.py:362(deliver)
>       22875    8.943    0.000  810.459    0.035 /home/wking/src/notmuch/notmuch-archives.git/ssoma_mda.py:207(append)
>       22875    0.418    0.000  308.353    0.013 /home/wking/.local/lib64/python3.4/site-packages/pygit2/index.py:146(write_tree)
>       22875  307.855    0.013  307.855    0.013 {built-in method git_index_write_tree}
>       22874    0.575    0.000  279.293    0.012 /home/wking/.local/lib64/python3.4/site-packages/pygit2/index.py:238(diff_to_tree)
>       22874  278.501    0.012  278.501    0.012 {built-in method git_diff_tree_to_index}

It looks like writing the index is already the slowest, here, in
terms of total time, too.  It might be interesting if you
profiled each *-mda invocation to see the degradation from the
first to last message.

>       22875    0.088    0.000   80.413    0.004 /home/wking/.local/lib64/python3.4/site-packages/pygit2/index.py:99(read)
> 
> 38 ms per ssoma delivery is probably fast enough, especially if you

Not even close for me :)

> are invoking ssoma-mda once per message, since process setup will take a similar amount of time:
> 
>   $ time python -c 'print("hello")'
>   hello
> 
>   real    0m0.016s
>   user    0m0.013s
>   sys     0m0.003s
> 
> It's possible that fast-import would shave a few ms off the pygit2
> addition (I'm not sure, and maybe pygit2 is faster than fast-import).
> But I doubt it matters enough either way to be worth changing unless
> you are dealing with a really large corpus.

One key feature is fast-import avoids writing an index entirely.
I think pygit2 would have to learn that, too.

      reply	other threads:[~2016-08-21 21:14 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20141107190321.GL23609@odin.tremily.us>
     [not found] ` <20160821043631.GA2338@odin.tremily.us>
     [not found]   ` <20160821094833.GB2338@odin.tremily.us>
2016-08-21 12:08     ` Mail archives in Git using ssoma (Docker image) Eric Wong
2016-08-21 17:36       ` W. Trevor King
2016-08-21 18:28         ` Eric Wong
     [not found]   ` <20160821183704.GB11495@dcvr>
2016-08-21 20:28     ` Mail archives in Git using ssoma W. Trevor King
2016-08-21 21:14       ` Eric Wong [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://public-inbox.org/README

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160821211455.GA11841@starla \
    --to=e@80x24.org \
    --cc=meta@public-inbox.org \
    --cc=notmuch@notmuchmail.org \
    --cc=wking@tremily.us \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).