From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.0 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id 7ABA31F576; Thu, 15 Feb 2018 10:55:09 +0000 (UTC) Date: Thu, 15 Feb 2018 10:55:09 +0000 From: Eric Wong To: meta@public-inbox.org Subject: Re: [v2] one file to rule them all? Message-ID: <20180215105509.GA22409@dcvr> References: <20180209205140.GA11047@dcvr> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180209205140.GA11047@dcvr> List-Id: Eric Wong wrote: > Timing "git rev-list --objects --all |wc -l" reveals much bigger > differences. Timings are a bit fuzzy since (AFAIK) this is a > shared system, but it's not really a contest: > > 2-38 ~5 minutes > 2-2-2-34 ~30 seconds > 2-2-36 ~30 seconds > 1-file ~5 seconds > > Smaller trees are way faster :) The LKML 2000-2017 archives (16GB uncompressed mboxes) I have are 6.3G of objects with 1-file storage in git and took around 33 minutes to do a full import utilizing a single core and single git repo (no deduplication checks). "git repack -adb" takes about 2 minutes "git rev-list --objects --all |wc -l" takes around 1 minute with over 8 million objects As a baseline, pure Perl parsing of the mboxes (no writing to git) was around 23 minutes on a single core; so git-fast-import does add some overhead but probably not as much as Xapian will add for the initial import. The v1 2-38 code slowed to a crawl as more data got into the repo and I gave up after it hit 18G and hit snags with badly-formatted dates (worked around by relying on Date::Parse instead of git's RFC2822 parser). Side note: Using 4 parallel processes for the parse-only tests took around 10.5 minutes; while 2 processes took around 11-12 minutes. Then I realized 2 of the 4 processors were HT, so it appears HT doesn't help much with Perl parsing... > In other words, git scales infinitely well to deep history > depths, but not to breadth of large trees[1]. Serving a ~6G for clones is still a lot of bandwidth; so partitioning the git repos to limit the size of each clone seems worth it. Yearly partitions is probably too frequent and we'd end up with too many packs (and resulting more open-FDs, cache-misses, metadata stored in Xapian). I think partitioning based on message-count/sizes might be a better metric for splitting as LKML seems to get more traffic year-after-year. > Marking spam and handling message removals might be a little > trickier as chronology will have to be taken into account... > (will post more on this, later) Keeping track of everything in Xapian while walking backwards through git history shouldn't be a big problem, actually. (Xapian has read-after-write consistency) However, trying to reason about partitioning of Xapian DBs across time/message-count boundaries was making my head hurt and now I'm not sure if it's necessary to partition Xapian DBs. While normal text indexing is trivial to partition and parallelize, associating messages with thread IDs requires "global" knowledge spanning all partitions (since mail threads span multiple periods). Unfortunately, this requires much synchronization and synchronization hurts parallelism. Partitioning Xapian DBs is useful to speed up full-indexing and not much else. Full-(re)indexing is a rare event, and can be done on a cold DB while the hot one is taking traffic. In fact, I would expect lookups on partitioned DBs to be slower since it has more files to go through and has to map things like internal document_ids to non-conflicting ones. Also, we don't serve Xapian data to be cloned; which is the main reason to do partitioning of git storage...