From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS6315 166.70.0.0/16 X-Spam-Status: No, score=-3.7 required=3.0 tests=BAYES_00,RCVD_IN_DNSWL_LOW, SPF_HELO_NONE,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from out03.mta.xmission.com (out03.mta.xmission.com [166.70.13.233]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 809D81F464; Thu, 26 Sep 2019 00:24:07 +0000 (UTC) Received: from in02.mta.xmission.com ([166.70.13.52]) by out03.mta.xmission.com with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.87) (envelope-from ) id 1iDHZl-0007Wz-Ny; Wed, 25 Sep 2019 18:24:05 -0600 Received: from ip68-227-160-95.om.om.cox.net ([68.227.160.95] helo=x220.xmission.com) by in02.mta.xmission.com with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.87) (envelope-from ) id 1iDHZk-00045F-RW; Wed, 25 Sep 2019 18:24:05 -0600 From: ebiederm@xmission.com (Eric W. Biederman) To: Eric Wong Cc: Konstantin Ryabitsev , meta@public-inbox.org References: <20190925182431.GA4628@chatter.i7.local> <20190925194503.GA21501@dcvr> <20190925195838.GB4628@chatter.i7.local> <20190925224500.GA28628@dcvr> Date: Wed, 25 Sep 2019 19:23:35 -0500 In-Reply-To: <20190925224500.GA28628@dcvr> (Eric Wong's message of "Wed, 25 Sep 2019 22:45:00 +0000") Message-ID: <87zhir3ny0.fsf@x220.int.ebiederm.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-XM-SPF: eid=1iDHZk-00045F-RW;;;mid=<87zhir3ny0.fsf@x220.int.ebiederm.org>;;;hst=in02.mta.xmission.com;;;ip=68.227.160.95;;;frm=ebiederm@xmission.com;;;spf=neutral X-XM-AID: U2FsdGVkX184TUg6xwgzgm0+zqXn5Ou0ZlP6k3NHPPc= X-SA-Exim-Connect-IP: 68.227.160.95 X-SA-Exim-Mail-From: ebiederm@xmission.com Subject: Re: Git-only operation mode X-SA-Exim-Version: 4.2.1 (built Thu, 05 May 2016 13:38:54 -0600) X-SA-Exim-Scanned: Yes (on in02.mta.xmission.com) List-Id: Eric Wong writes: > Konstantin Ryabitsev wrote: >> On Wed, Sep 25, 2019 at 07:45:03PM +0000, Eric Wong wrote: >> > > Is there a way to run just the archiver component of public-inbox -- >> > > just >> > > writing to git repos without any of the indexing/frontend bits? One of the >> > > idle conversations I had with vger.kernel.org folks was to see if we can >> > > shift the source of truth archive generation to happen at their end. We >> > > would then clone repositories from them and provide the frontend/search bits >> > > on lore.kernel.org. From my cursory looking, it would seem that the >> > > watch/delivery tools always expect to be taking care of xapian/indexing, but >> > > I think being able to decouple git bits from search/frontend bits would be a >> > > useful mode or operation. >> > >> > v1 was git-only (that led to scalability problems from big trees). >> > v2 needs SQLite to do dedupe with indexlevel=basic, but not Xapian, >> > anymore. We could get rid of dedupe for v2, but I'm not sure it's >> > worth it... >> >> Needing sqlite is not a big deal -- compared to the size of the repos, >> that's reasonably small (e.g. all of lkml git trees are 8.2GB, while >> msgmap.sqlite3 is 600MB). > > Right, it'll also need xap15/over.sqlite* but that's still not too > big. For linux-kernel my copy looks to be about 2.4G while the git repos run 9.1G. >> Is there an easy way to exclude xapian indexes from being generated during >> watch/mda runs then? > > public-inbox-init --indexlevel=basic > > Or setting publicinbox.$INBOX_NAME.indexlevel=basic in the > config file after-the-fact. You should also be able to remove > any non-SQLite files from xap15 after-the-fact, if you already > generated them, too (but I haven't tested that). > > I started working on a public-inbox-init manpage the other day, > still need to finish that... > >> A follow-up to that -- is running "public-inbox-index" on the repository >> after it's been updated enough to update the xapian db? It would be easy to >> do so as part of the grok-pull post-update hook. > > Yes, on a fresh clone. You'll need to change indexlevel to > medium or full if it was setup using basic. > > I haven't figured out how to use a grok-pull post-update hook to > run index on my clone of erol, since there's multiple epochs > per-inbox to deal with. I have a perl script I use. Which boils down to: git remote update public-inbox-index Which is enough get things up to date. The tricky bit when the you have a archive like linux-kernel that uses multiple git repos. Given that except in the case of bugs article numbers are stable it should be completely possible do this. The nasty case is when someone rebases the git history. I have been meaning to report this after tracking it down. To the best of my knowledge public-inbox-inbox throws out all of the history that was rebased. Which can be expensive. For me it meant I had to drop from indexlevel=full to indexlevel=basic on linux-kernel. Because my laptop machine could not handle the reindexing of all of those messages. Given that the message numbers remain stable in an event like that it should be possible to optimize and only reindex things if the blob in git for a particular message number has changed. Maybe we already try and even that is too expensive. I haven't re-read that code since I noticed the problem. Eric