From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS6315 166.70.0.0/16 X-Spam-Status: No, score=-3.7 required=3.0 tests=AWL,BAYES_00, RCVD_IN_DNSWL_LOW,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.1 Received: from out02.mta.xmission.com (out02.mta.xmission.com [166.70.13.232]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 963AD1F597; Fri, 20 Jul 2018 12:37:19 +0000 (UTC) Received: from in02.mta.xmission.com ([166.70.13.52]) by out02.mta.xmission.com with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.87) (envelope-from ) id 1fgUer-0004yE-Uu; Fri, 20 Jul 2018 06:37:18 -0600 Received: from [97.119.167.31] (helo=x220.xmission.com) by in02.mta.xmission.com with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.87) (envelope-from ) id 1fgUeq-0006aj-Vr; Fri, 20 Jul 2018 06:37:17 -0600 From: ebiederm@xmission.com (Eric W. Biederman) To: Eric Wong Cc: meta@public-inbox.org References: <87in5bdkbv.fsf@xmission.com> <20180719211216.GA1984@dcvr> <87601adfo7.fsf@xmission.com> <20180720061106.4f2u2zpdxnsilrxt@dcvr> Date: Fri, 20 Jul 2018 07:37:09 -0500 In-Reply-To: <20180720061106.4f2u2zpdxnsilrxt@dcvr> (Eric Wong's message of "Fri, 20 Jul 2018 06:11:07 +0000") Message-ID: <8736weaxsa.fsf@xmission.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-XM-SPF: eid=1fgUeq-0006aj-Vr;;;mid=<8736weaxsa.fsf@xmission.com>;;;hst=in02.mta.xmission.com;;;ip=97.119.167.31;;;frm=ebiederm@xmission.com;;;spf=neutral X-XM-AID: U2FsdGVkX1+e3sxr2LaBdGlZcAvk4JF5MWLwUKPI+c8= X-SA-Exim-Connect-IP: 97.119.167.31 X-SA-Exim-Mail-From: ebiederm@xmission.com Subject: Re: Searching via git grep? X-SA-Exim-Version: 4.2.1 (built Thu, 05 May 2016 13:38:54 -0600) X-SA-Exim-Scanned: Yes (on in02.mta.xmission.com) List-Id: Eric Wong writes: > "Eric W. Biederman" wrote: >> My current goal is to make it pleasant to read linux-kernel and possibly >> other large archives on my personal machine. Right now the git >> trees for linux-kernel are aboug 6.8G. Small enough to fit in RAM. >> >> The Xapian indexes are about 63G. Not small enough to fit in ram. >> They are also not fast to update when I pull in a new batch of messages >> from linux-kernel. > > Interesting, how long does it take to do an incremental index > medium/full for you? Setting XAPIAN_FLUSH_THRESHOLD after my > patch yesterday should help noticeably, especially if you're on > HDD. For a small sample less than a days worth of lkml messages I get: $ git --git-dir git/6.git/ fetch Enter passphrase for key '/home/eric/.ssh/id_rsa': Fetching origin >From https://git.kernel.org/pub/scm/public-inbox/vger.kernel.org/lkml/6 35280da650..0a97acb7e7 master -> master remote: Counting objects: 1791, done. remote: Compressing objects: 100% (1085/1085), done. remote: Total 1791 (delta 109), reused 1791 (delta 109) Receiving objects: 100% (1791/1791), 1.94 MiB | 1.98 MiB/s, done. Resolving deltas: 100% (109/109), done. >From git:/public-inbox/vger.kernel.org/linux-kernel/6 35280da65057..0a97acb7e709 master -> master $ time public-inbox-index real 2m1.482s user 0m26.084s sys 0m20.792s I am not on a HDD. I will play with XAPIAN_FLUSH_THRESHOLD next time and see if things get better. Initially building the Xapian index was extremely painful, with swapping and took over a day. Subjectively searcing all of 6.git feels faster than those 2 minutes. If for no other reason than I get some of the results back immediately. >> So I am looking at using git grep as a stand-in for the Xapian indexes >> when indexlevel eq 'basic'. >> >> Given my personal ratio of searches to indexing I think I will save >> time in doing that. I don't have it all wired up yet to know if it will >> work well, but I suspect it will. > > Totally understandable, and yes, if you can fit the LKML repos > into RAM it should be usable enough for a single user. > > "git grep" also has the advantage of being able to use regexps, > which isn't possible with Xapian at the moment. My only concern with "git grep" for v2 is how do I get it to exclude messages that have been deleted. >> Is it only the web interface where the advanced search functionality is >> available? > > Yes. I don't think there's a good way to implement search for > NNTP on the server side... IMAP has specs for implementing > search; but I don't know how much overlap there is with what > our web UI currently offers. I skimmed the IMAP rfcs earlier and the search sounds very close to what Xapian makes available. Roughly terms and quoted terms (aka terms with positions). If the IMAP interface is sensible it might be worth doing the work to extend NNTP to provide a search interface modeled on it. Eric