user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
From: ebiederm@xmission.com (Eric W. Biederman)
To: Eric Wong <e@80x24.org>
Cc: meta@public-inbox.org
Subject: Re: Searching via git grep?
Date: Fri, 20 Jul 2018 07:37:09 -0500	[thread overview]
Message-ID: <8736weaxsa.fsf@xmission.com> (raw)
In-Reply-To: <20180720061106.4f2u2zpdxnsilrxt@dcvr> (Eric Wong's message of "Fri, 20 Jul 2018 06:11:07 +0000")

Eric Wong <e@80x24.org> writes:

> "Eric W. Biederman" <ebiederm@xmission.com> wrote:
>> My current goal is to make it pleasant to read linux-kernel and possibly
>> other large archives on my personal machine.  Right now the git
>> trees for linux-kernel are aboug 6.8G.  Small enough to fit in RAM.
>> 
>> The Xapian indexes are about 63G.  Not small enough to fit in ram.
>> They are also not fast to update when I pull in a new batch of messages
>> from linux-kernel.
>
> Interesting, how long does it take to do an incremental index
> medium/full for you?  Setting XAPIAN_FLUSH_THRESHOLD after my
> patch yesterday should help noticeably, especially if you're on
> HDD.

For a small sample less than a days worth of lkml messages
I get:

$ git --git-dir git/6.git/ fetch
Enter passphrase for key '/home/eric/.ssh/id_rsa': 
Fetching origin
From https://git.kernel.org/pub/scm/public-inbox/vger.kernel.org/lkml/6
   35280da650..0a97acb7e7  master     -> master
remote: Counting objects: 1791, done.
remote: Compressing objects: 100% (1085/1085), done.
remote: Total 1791 (delta 109), reused 1791 (delta 109)
Receiving objects: 100% (1791/1791), 1.94 MiB | 1.98 MiB/s, done.
Resolving deltas: 100% (109/109), done.
From git:/public-inbox/vger.kernel.org/linux-kernel/6
   35280da65057..0a97acb7e709  master     -> master

$ time public-inbox-index 
real    2m1.482s
user    0m26.084s
sys     0m20.792s

I am not on a HDD.  I will play with XAPIAN_FLUSH_THRESHOLD next time
and see if things get better.  Initially building the Xapian index was
extremely painful, with swapping and took over a day.

Subjectively searcing all of 6.git feels faster than those 2 minutes.
If for no other reason than I get some of the results back immediately.

>> So I am looking at using git grep as a stand-in for the Xapian indexes
>> when indexlevel eq 'basic'.
>> 
>> Given my personal ratio of searches to indexing I think I will save
>> time in doing that.  I don't have it all wired up yet to know if it will
>> work well, but I suspect it will.
>
> Totally understandable, and yes, if you can fit the LKML repos
> into RAM it should be usable enough for a single user.
>
> "git grep" also has the advantage of being able to use regexps,
> which isn't possible with Xapian at the moment.

My only concern with "git grep" for v2 is how do I get it to exclude
messages that have been deleted. 

>> Is it only the web interface where the advanced search functionality is
>> available?
>
> Yes.  I don't think there's a good way to implement search for
> NNTP on the server side...  IMAP has specs for implementing
> search; but I don't know how much overlap there is with what
> our web UI currently offers.

I skimmed the IMAP rfcs earlier and the search sounds very close to what
Xapian makes available.  Roughly terms and quoted terms (aka terms with
positions).

If the IMAP interface is sensible it might be worth doing the work to
extend NNTP to provide a search interface modeled on it.

Eric

  reply	other threads:[~2018-07-20 12:37 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-19 20:47 Searching via git grep? Eric W. Biederman
2018-07-19 21:12 ` Eric Wong
2018-07-19 22:27   ` Eric W. Biederman
2018-07-20  6:11     ` Eric Wong
2018-07-20 12:37       ` Eric W. Biederman [this message]
2018-07-20 23:56         ` Eric W. Biederman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://public-inbox.org/README

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8736weaxsa.fsf@xmission.com \
    --to=ebiederm@xmission.com \
    --cc=e@80x24.org \
    --cc=meta@public-inbox.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).