From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.0 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id 4DF581F404 for ; Thu, 22 Feb 2018 21:42:24 +0000 (UTC) From: "Eric Wong (Contractor, The Linux Foundation)" To: meta@public-inbox.org Subject: [WIP PATCH 0/12] v2: git repo rotation + parallel Xapian indexing Date: Thu, 22 Feb 2018 21:42:10 +0000 Message-Id: <20180222214222.1086-1-e@80x24.org> List-Id: The key thing is sharding git and sharding of Xapian are not tied together: git repos are sharded to reduce clone/repack costs; so we shard them based on size (currently 1G or so). Xapian DBs are sharded to take advantage of SMP during the indexing phase. Current import times are as follows: git-only: ~1 minute git+SQLite: ~12 minutes git+Xapian+SQLite serial: ~45 minutes git+Xapian+SQLite 4 parts: ~15 minutes (2 + 2 hyperthread) More cores will help since the Xapian text+term indexing is the slowest and the only partitioned work. I also tested just the December 2017 archives on an 8-core AMD FX-8320. I forget the specifics, but I seem to recall half the cores on that chip are not full power: 4 parts: 58s 8 parts: 45s Note: I use eatmydata (LD_PRELOAD to disable sync/fsync) for development and I consider it perfectly safe to use for offline updates/reindexing.