From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS6315 166.70.0.0/16 X-Spam-Status: No, score=-3.7 required=3.0 tests=AWL,BAYES_00, RCVD_IN_DNSWL_LOW,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.1 Received: from out02.mta.xmission.com (out02.mta.xmission.com [166.70.13.232]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 518A71F597; Thu, 19 Jul 2018 22:28:02 +0000 (UTC) Received: from in01.mta.xmission.com ([166.70.13.51]) by out02.mta.xmission.com with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.87) (envelope-from ) id 1fgHOy-0006gg-W1; Thu, 19 Jul 2018 16:28:01 -0600 Received: from [97.119.167.31] (helo=x220.xmission.com) by in01.mta.xmission.com with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.87) (envelope-from ) id 1fgHOy-0003r4-DO; Thu, 19 Jul 2018 16:28:00 -0600 From: ebiederm@xmission.com (Eric W. Biederman) To: Eric Wong Cc: meta@public-inbox.org References: <87in5bdkbv.fsf@xmission.com> <20180719211216.GA1984@dcvr> Date: Thu, 19 Jul 2018 17:27:52 -0500 In-Reply-To: <20180719211216.GA1984@dcvr> (Eric Wong's message of "Thu, 19 Jul 2018 21:12:16 +0000") Message-ID: <87601adfo7.fsf@xmission.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-XM-SPF: eid=1fgHOy-0003r4-DO;;;mid=<87601adfo7.fsf@xmission.com>;;;hst=in01.mta.xmission.com;;;ip=97.119.167.31;;;frm=ebiederm@xmission.com;;;spf=neutral X-XM-AID: U2FsdGVkX1/p4/vaZuoTKr/U9zQ5JXEMrsXV7GoMnX4= X-SA-Exim-Connect-IP: 97.119.167.31 X-SA-Exim-Mail-From: ebiederm@xmission.com Subject: Re: Searching via git grep? X-SA-Exim-Version: 4.2.1 (built Thu, 05 May 2016 13:38:54 -0600) X-SA-Exim-Scanned: Yes (on in01.mta.xmission.com) List-Id: Eric Wong writes: > "Eric W. Biederman" wrote: >> Have you considered searching public inboxes via git grep? > > Not yet... > >> For a big server lore.kernel.org with a lot of searches and a lot of >> clients it might not make sense. But for home use where searches are >> rare and the indexes can not be kept in ram, but the mailbox might fit >> git grep sounds attractive? >> >> I performed a preliminary test and just running git grep manually and >> I was search all of the git mailling list archive pretty much >> immediately. >> >> For v1 it is just 'git grep HEAD' >> For v2 it is 'git --rev-list --all | xargs git grep ' >> >> If this sounds reasonable to you I will take a look at what it takes to >> wire that up over the next while. > > Having something like this on a potentially public-facing web UI > seems like a liability support-wise(*). > > However, I'd be open to having this as a command-line tool. > Maybe in the scripts/ directory for one-off scripts... > If I were building a personal mail tool, I could use > scripts/dupe-finder as a starting point. My current goal is to make it pleasant to read linux-kernel and possibly other large archives on my personal machine. Right now the git trees for linux-kernel are aboug 6.8G. Small enough to fit in RAM. The Xapian indexes are about 63G. Not small enough to fit in ram. They are also not fast to update when I pull in a new batch of messages from linux-kernel. So I am looking at using git grep as a stand-in for the Xapian indexes when indexlevel eq 'basic'. Given my personal ratio of searches to indexing I think I will save time in doing that. I don't have it all wired up yet to know if it will work well, but I suspect it will. Is it only the web interface where the advanced search functionality is available? > (*) I would also caution against having personal mail accessible > over http://localhost/ on any port without a password; as > there's attacks on browsers which could hit them. Good point. While I may get there that is not my primary focus. I have for example emails archived from public mailling lists but because I was the author the mail machine stripped the domain from from. This was all about 20 years ago. Eric