user/dev discussion of public-inbox itself
 help / color / Atom feed
From: Eric Wong <e@80x24.org>
To: meta@public-inbox.org
Subject: Re: [v2] one file to rule them all?
Date: Thu, 15 Feb 2018 10:55:09 +0000
Message-ID: <20180215105509.GA22409@dcvr> (raw)
In-Reply-To: <20180209205140.GA11047@dcvr>

Eric Wong <e@80x24.org> wrote:
> Timing "git rev-list --objects --all |wc -l" reveals much bigger
> differences.  Timings are a bit fuzzy since (AFAIK) this is a
> shared system, but it's not really a contest:
> 
> 2-38         ~5 minutes
> 2-2-2-34     ~30 seconds
> 2-2-36       ~30 seconds
> 1-file       ~5 seconds
> 
> Smaller trees are way faster :)

The LKML 2000-2017 archives (16GB uncompressed mboxes) I have
are 6.3G of objects with 1-file storage in git and took around
33 minutes to do a full import utilizing a single core and
single git repo (no deduplication checks).

"git repack -adb" takes about 2 minutes

"git rev-list --objects --all |wc -l"
takes around 1 minute with over 8 million objects

As a baseline, pure Perl parsing of the mboxes (no writing to
git) was around 23 minutes on a single core; so git-fast-import
does add some overhead but probably not as much as Xapian will
add for the initial import.

The v1 2-38 code slowed to a crawl as more data got into the
repo and I gave up after it hit 18G and hit snags with
badly-formatted dates (worked around by relying on Date::Parse
instead of git's RFC2822 parser).

Side note:  Using 4 parallel processes for the parse-only tests
took around 10.5 minutes; while 2 processes took around 11-12
minutes.  Then I realized 2 of the 4 processors were HT,
so it appears HT doesn't help much with Perl parsing...

> In other words, git scales infinitely well to deep history
> depths, but not to breadth of large trees[1].

Serving a ~6G for clones is still a lot of bandwidth; so
partitioning the git repos to limit the size of each clone
seems worth it.

Yearly partitions is probably too frequent and we'd end up with
too many packs (and resulting more open-FDs, cache-misses,
metadata stored in Xapian).  I think partitioning based on
message-count/sizes might be a better metric for splitting as
LKML seems to get more traffic year-after-year.

> Marking spam and handling message removals might be a little
> trickier as chronology will have to be taken into account...
> (will post more on this, later)

Keeping track of everything in Xapian while walking backwards
through git history shouldn't be a big problem, actually.
(Xapian has read-after-write consistency)

However, trying to reason about partitioning of Xapian DBs
across time/message-count boundaries was making my head hurt and
now I'm not sure if it's necessary to partition Xapian DBs.

While normal text indexing is trivial to partition and
parallelize, associating messages with thread IDs requires
"global" knowledge spanning all partitions (since mail threads
span multiple periods).  Unfortunately, this requires much
synchronization and synchronization hurts parallelism.

Partitioning Xapian DBs is useful to speed up full-indexing and
not much else.  Full-(re)indexing is a rare event, and can be
done on a cold DB while the hot one is taking traffic.  In fact,
I would expect lookups on partitioned DBs to be slower since it
has more files to go through and has to map things like
internal document_ids to non-conflicting ones.

Also, we don't serve Xapian data to be cloned; which is the main
reason to do partitioning of git storage...

  reply index

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-02-09 20:51 Eric Wong
2018-02-15 10:55 ` Eric Wong [this message]
2018-02-15 11:08   ` [WIP 0/17] initial v2 work based on one-file tree Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08     ` [WIP 01/17] AUTHORS: add The Linux Foundation Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08     ` [WIP 02/17] watch_maildir: allow '-' in mail filename Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08     ` [WIP 03/17] scripts/import_vger_from_mbox: relax From_ line match slightly Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08     ` [WIP 04/17] import: stop writing legacy ssoma.index by default Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08     ` [WIP 05/17] import: begin supporting this without ssoma.lock Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08     ` [WIP 06/17] import: initial handling for v2 Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08     ` [WIP 07/17] t/import: test for last_object_id insertion Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08     ` [WIP 08/17] content_id: add test case Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08     ` [WIP 09/17] searchmsg: add mid_mime import for _extract_mid Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08     ` [WIP 10/17] scripts/import_vger_from_mbox: support --dry-run option Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08     ` [WIP 11/17] import: APIs to support v2 use Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08     ` [WIP 12/17] search: free up 'Q' prefix for a real unique identifier Eric Wong (Contractor, The Linux Foundation)
2018-02-22 21:08       ` Eric Wong
2018-02-15 11:08     ` [WIP 13/17] searchidx: fix comment around next_thread_id Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08     ` [WIP 14/17] address: extract more characters from email addresses Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08     ` [WIP 15/17] import: pass "raw" dates to git-fast-import(1) Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08     ` [WIP 16/17] scripts/import_vger_from_mbox: use v2 layout for import Eric Wong (Contractor, The Linux Foundation)
2018-02-15 11:08     ` [WIP 17/17] import: quiet down warnings from bogus From: lines Eric Wong (Contractor, The Linux Foundation)

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://public-inbox.org/README

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180215105509.GA22409@dcvr \
    --to=e@80x24.org \
    --cc=meta@public-inbox.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

user/dev discussion of public-inbox itself

Archives are clonable:
	git clone --mirror https://public-inbox.org/meta
	git clone --mirror http://czquwvybam4bgbro.onion/meta
	git clone --mirror http://hjrcffqmbrq6wope.onion/meta
	git clone --mirror http://ou63pmih66umazou.onion/meta

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.mail.public-inbox.meta
	nntp://ou63pmih66umazou.onion/inbox.comp.mail.public-inbox.meta
	nntp://czquwvybam4bgbro.onion/inbox.comp.mail.public-inbox.meta
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.mail.public-inbox.meta
	nntp://news.gmane.org/gmane.mail.public-inbox.general

 note: .onion URLs require Tor: https://www.torproject.org/

AGPL code for this site: git clone https://public-inbox.org/ public-inbox