user/dev discussion of public-inbox itself
 help / Atom feed
* Searching via git grep?
@ 2018-07-19 20:47 ebiederm
  2018-07-19 21:12 ` Eric Wong
  0 siblings, 1 reply; 6+ messages in thread
From: ebiederm @ 2018-07-19 20:47 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta


Have you considered searching public inboxes via git grep?

For a big server lore.kernel.org with a lot of searches and a lot of
clients it might not make sense.  But for home use where searches are
rare and the indexes can not be kept in ram, but the mailbox might fit
git grep sounds attractive?

I performed a preliminary test and just running git grep manually and
I was search all of the git mailling list archive pretty much
immediately.

For v1 it is just 'git grep <regexp> HEAD'
For v2 it is 'git --rev-list --all | xargs git grep <regexp>'

If this sounds reasonable to you I will take a look at what it takes to
wire that up over the next while.

Eric

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Searching via git grep?
  2018-07-19 20:47 Searching via git grep? ebiederm
@ 2018-07-19 21:12 ` Eric Wong
  2018-07-19 22:27   ` ebiederm
  0 siblings, 1 reply; 6+ messages in thread
From: Eric Wong @ 2018-07-19 21:12 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: meta

"Eric W. Biederman" <ebiederm@xmission.com> wrote:
> Have you considered searching public inboxes via git grep?

Not yet...

> For a big server lore.kernel.org with a lot of searches and a lot of
> clients it might not make sense.  But for home use where searches are
> rare and the indexes can not be kept in ram, but the mailbox might fit
> git grep sounds attractive?
> 
> I performed a preliminary test and just running git grep manually and
> I was search all of the git mailling list archive pretty much
> immediately.
> 
> For v1 it is just 'git grep <regexp> HEAD'
> For v2 it is 'git --rev-list --all | xargs git grep <regexp>'
> 
> If this sounds reasonable to you I will take a look at what it takes to
> wire that up over the next while.

Having something like this on a potentially public-facing web UI
seems like a liability support-wise(*).

However, I'd be open to having this as a command-line tool.
Maybe in the scripts/ directory for one-off scripts...
If I were building a personal mail tool, I could use
scripts/dupe-finder as a starting point.


(*) I would also caution against having personal mail accessible
over http://localhost/ on any port without a password; as
there's attacks on browsers which could hit them.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Searching via git grep?
  2018-07-19 21:12 ` Eric Wong
@ 2018-07-19 22:27   ` ebiederm
  2018-07-20  6:11     ` Eric Wong
  0 siblings, 1 reply; 6+ messages in thread
From: ebiederm @ 2018-07-19 22:27 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

Eric Wong <e@80x24.org> writes:

> "Eric W. Biederman" <ebiederm@xmission.com> wrote:
>> Have you considered searching public inboxes via git grep?
>
> Not yet...
>
>> For a big server lore.kernel.org with a lot of searches and a lot of
>> clients it might not make sense.  But for home use where searches are
>> rare and the indexes can not be kept in ram, but the mailbox might fit
>> git grep sounds attractive?
>> 
>> I performed a preliminary test and just running git grep manually and
>> I was search all of the git mailling list archive pretty much
>> immediately.
>> 
>> For v1 it is just 'git grep <regexp> HEAD'
>> For v2 it is 'git --rev-list --all | xargs git grep <regexp>'
>> 
>> If this sounds reasonable to you I will take a look at what it takes to
>> wire that up over the next while.
>
> Having something like this on a potentially public-facing web UI
> seems like a liability support-wise(*).
>
> However, I'd be open to having this as a command-line tool.
> Maybe in the scripts/ directory for one-off scripts...
> If I were building a personal mail tool, I could use
> scripts/dupe-finder as a starting point.

My current goal is to make it pleasant to read linux-kernel and possibly
other large archives on my personal machine.  Right now the git
trees for linux-kernel are aboug 6.8G.  Small enough to fit in RAM.

The Xapian indexes are about 63G.  Not small enough to fit in ram.
They are also not fast to update when I pull in a new batch of messages
from linux-kernel.

So I am looking at using git grep as a stand-in for the Xapian indexes
when indexlevel eq 'basic'.

Given my personal ratio of searches to indexing I think I will save
time in doing that.  I don't have it all wired up yet to know if it will
work well, but I suspect it will.

Is it only the web interface where the advanced search functionality is
available?

> (*) I would also caution against having personal mail accessible
> over http://localhost/ on any port without a password; as
> there's attacks on browsers which could hit them.

Good point.  While I may get there that is not my primary focus.

I have for example emails archived from public mailling lists
but because I was the author the mail machine stripped the domain from
from.  This was all about 20 years ago.

Eric



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Searching via git grep?
  2018-07-19 22:27   ` ebiederm
@ 2018-07-20  6:11     ` Eric Wong
  2018-07-20 12:37       ` ebiederm
  0 siblings, 1 reply; 6+ messages in thread
From: Eric Wong @ 2018-07-20  6:11 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: meta

"Eric W. Biederman" <ebiederm@xmission.com> wrote:
> My current goal is to make it pleasant to read linux-kernel and possibly
> other large archives on my personal machine.  Right now the git
> trees for linux-kernel are aboug 6.8G.  Small enough to fit in RAM.
> 
> The Xapian indexes are about 63G.  Not small enough to fit in ram.
> They are also not fast to update when I pull in a new batch of messages
> from linux-kernel.

Interesting, how long does it take to do an incremental index
medium/full for you?  Setting XAPIAN_FLUSH_THRESHOLD after my
patch yesterday should help noticeably, especially if you're on
HDD.

> So I am looking at using git grep as a stand-in for the Xapian indexes
> when indexlevel eq 'basic'.
> 
> Given my personal ratio of searches to indexing I think I will save
> time in doing that.  I don't have it all wired up yet to know if it will
> work well, but I suspect it will.

Totally understandable, and yes, if you can fit the LKML repos
into RAM it should be usable enough for a single user.

"git grep" also has the advantage of being able to use regexps,
which isn't possible with Xapian at the moment.

> Is it only the web interface where the advanced search functionality is
> available?

Yes.  I don't think there's a good way to implement search for
NNTP on the server side...  IMAP has specs for implementing
search; but I don't know how much overlap there is with what
our web UI currently offers.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Searching via git grep?
  2018-07-20  6:11     ` Eric Wong
@ 2018-07-20 12:37       ` ebiederm
  2018-07-20 23:56         ` ebiederm
  0 siblings, 1 reply; 6+ messages in thread
From: ebiederm @ 2018-07-20 12:37 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

Eric Wong <e@80x24.org> writes:

> "Eric W. Biederman" <ebiederm@xmission.com> wrote:
>> My current goal is to make it pleasant to read linux-kernel and possibly
>> other large archives on my personal machine.  Right now the git
>> trees for linux-kernel are aboug 6.8G.  Small enough to fit in RAM.
>> 
>> The Xapian indexes are about 63G.  Not small enough to fit in ram.
>> They are also not fast to update when I pull in a new batch of messages
>> from linux-kernel.
>
> Interesting, how long does it take to do an incremental index
> medium/full for you?  Setting XAPIAN_FLUSH_THRESHOLD after my
> patch yesterday should help noticeably, especially if you're on
> HDD.

For a small sample less than a days worth of lkml messages
I get:

$ git --git-dir git/6.git/ fetch
Enter passphrase for key '/home/eric/.ssh/id_rsa': 
Fetching origin
From https://git.kernel.org/pub/scm/public-inbox/vger.kernel.org/lkml/6
   35280da650..0a97acb7e7  master     -> master
remote: Counting objects: 1791, done.
remote: Compressing objects: 100% (1085/1085), done.
remote: Total 1791 (delta 109), reused 1791 (delta 109)
Receiving objects: 100% (1791/1791), 1.94 MiB | 1.98 MiB/s, done.
Resolving deltas: 100% (109/109), done.
From git:/public-inbox/vger.kernel.org/linux-kernel/6
   35280da65057..0a97acb7e709  master     -> master

$ time public-inbox-index 
real    2m1.482s
user    0m26.084s
sys     0m20.792s

I am not on a HDD.  I will play with XAPIAN_FLUSH_THRESHOLD next time
and see if things get better.  Initially building the Xapian index was
extremely painful, with swapping and took over a day.

Subjectively searcing all of 6.git feels faster than those 2 minutes.
If for no other reason than I get some of the results back immediately.

>> So I am looking at using git grep as a stand-in for the Xapian indexes
>> when indexlevel eq 'basic'.
>> 
>> Given my personal ratio of searches to indexing I think I will save
>> time in doing that.  I don't have it all wired up yet to know if it will
>> work well, but I suspect it will.
>
> Totally understandable, and yes, if you can fit the LKML repos
> into RAM it should be usable enough for a single user.
>
> "git grep" also has the advantage of being able to use regexps,
> which isn't possible with Xapian at the moment.

My only concern with "git grep" for v2 is how do I get it to exclude
messages that have been deleted. 

>> Is it only the web interface where the advanced search functionality is
>> available?
>
> Yes.  I don't think there's a good way to implement search for
> NNTP on the server side...  IMAP has specs for implementing
> search; but I don't know how much overlap there is with what
> our web UI currently offers.

I skimmed the IMAP rfcs earlier and the search sounds very close to what
Xapian makes available.  Roughly terms and quoted terms (aka terms with
positions).

If the IMAP interface is sensible it might be worth doing the work to
extend NNTP to provide a search interface modeled on it.

Eric

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Searching via git grep?
  2018-07-20 12:37       ` ebiederm
@ 2018-07-20 23:56         ` ebiederm
  0 siblings, 0 replies; 6+ messages in thread
From: ebiederm @ 2018-07-20 23:56 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

ebiederm@xmission.com (Eric W. Biederman) writes:

> Eric Wong <e@80x24.org> writes:
>
>> "Eric W. Biederman" <ebiederm@xmission.com> wrote:
>>> My current goal is to make it pleasant to read linux-kernel and possibly
>>> other large archives on my personal machine.  Right now the git
>>> trees for linux-kernel are aboug 6.8G.  Small enough to fit in RAM.
>>> 
>>> The Xapian indexes are about 63G.  Not small enough to fit in ram.
>>> They are also not fast to update when I pull in a new batch of messages
>>> from linux-kernel.
>>
>> Interesting, how long does it take to do an incremental index
>> medium/full for you?  Setting XAPIAN_FLUSH_THRESHOLD after my
>> patch yesterday should help noticeably, especially if you're on
>> HDD.
>
> For a small sample less than a days worth of lkml messages
> I get:
>
> $ git --git-dir git/6.git/ fetch
> Enter passphrase for key '/home/eric/.ssh/id_rsa': 
> Fetching origin
>> From https://git.kernel.org/pub/scm/public-inbox/vger.kernel.org/lkml/6
>    35280da650..0a97acb7e7  master     -> master
> remote: Counting objects: 1791, done.
> remote: Compressing objects: 100% (1085/1085), done.
> remote: Total 1791 (delta 109), reused 1791 (delta 109)
> Receiving objects: 100% (1791/1791), 1.94 MiB | 1.98 MiB/s, done.
> Resolving deltas: 100% (109/109), done.
>> From git:/public-inbox/vger.kernel.org/linux-kernel/6
>    35280da65057..0a97acb7e709  master     -> master
>
> $ time public-inbox-index 
> real    2m1.482s
> user    0m26.084s
> sys     0m20.792s
>
> I am not on a HDD.  I will play with XAPIAN_FLUSH_THRESHOLD next time
> and see if things get better.  Initially building the Xapian index was
> extremely painful, with swapping and took over a day.
>
> Subjectively searcing all of 6.git feels faster than those 2 minutes.
> If for no other reason than I get some of the results back immediately.

XAPIAN_FLUSH_THRESHOLD seems to help.

$ git --git-dir git/6.git/ fetch
Enter passphrase for key '/home/eric/.ssh/id_rsa': 
Fetching origin
From https://git.kernel.org/pub/scm/public-inbox/vger.kernel.org/lkml/6
   0a97acb7e7..61d959b624  master     -> master
remote: Counting objects: 2562, done.
remote: Compressing objects: 100% (1384/1384), done.
remote: Total 2562 (delta 324), reused 2562 (delta 324)
Receiving objects: 100% (2562/2562), 2.45 MiB | 4.03 MiB/s, done.
Resolving deltas: 100% (324/324), done.
From git:/public-inbox/vger.kernel.org/linux-kernel/6
   0a97acb7e709..61d959b62473  master     -> master

$ (export XAPIAN_FLUSH_THRESHOLD=4000000000; time public-inbox-index )
Use of uninitialized value in lc at /usr/share/perl5/Email/Simple/Header.pm line 181, <$in_r> line 121.
Use of uninitialized value in lc at /usr/share/perl5/Email/Simple/Header.pm line 181, <$in_r> line 121.
Use of uninitialized value in lc at /usr/share/perl5/Email/Simple/Header.pm line 181, <$in_r> line 121.
Use of uninitialized value in lc at /usr/share/perl5/Email/Simple/Header.pm line 181, <$in_r> line 121.
Use of uninitialized value in lc at /usr/share/perl5/Email/Simple/Header.pm line 181, <$r> line 41.
Use of uninitialized value in lc at /usr/share/perl5/Email/Simple/Header.pm line 181, <$r> line 41.
Use of uninitialized value in lc at /usr/share/perl5/Email/Simple/Header.pm line 181, <$r> line 41.
Use of uninitialized value in lc at /usr/share/perl5/Email/Simple/Header.pm line 181, <$r> line 41.

real    0m58.239s
user    0m15.820s
sys     0m11.088s

It looks like it cut a minute off running with a slighlty larger pool of
objects.

Eric

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, back to index

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-19 20:47 Searching via git grep? ebiederm
2018-07-19 21:12 ` Eric Wong
2018-07-19 22:27   ` ebiederm
2018-07-20  6:11     ` Eric Wong
2018-07-20 12:37       ` ebiederm
2018-07-20 23:56         ` ebiederm

user/dev discussion of public-inbox itself

Archives are clonable:
	git clone --mirror https://public-inbox.org/meta
	git clone --mirror http://czquwvybam4bgbro.onion/meta
	git clone --mirror http://hjrcffqmbrq6wope.onion/meta
	git clone --mirror http://ou63pmih66umazou.onion/meta

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.mail.public-inbox.meta
	nntp://ou63pmih66umazou.onion/inbox.comp.mail.public-inbox.meta
	nntp://czquwvybam4bgbro.onion/inbox.comp.mail.public-inbox.meta
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.mail.public-inbox.meta
	nntp://news.gmane.org/gmane.mail.public-inbox.general

 note: .onion URLs require Tor: https://www.torproject.org/
       or Tor2web: https://www.tor2web.org/

AGPL code for this site: git clone https://public-inbox.org/ public-inbox