user/dev discussion of public-inbox itself
 help / color / Atom feed
* Git-only operation mode
@ 2019-09-25 18:24 Konstantin Ryabitsev
  2019-09-25 19:45 ` Eric Wong
  0 siblings, 1 reply; 9+ messages in thread
From: Konstantin Ryabitsev @ 2019-09-25 18:24 UTC (permalink / raw)
  To: meta

Hello:

Is there a way to run just the archiver component of public-inbox -- 
just writing to git repos without any of the indexing/frontend bits? One 
of the idle conversations I had with vger.kernel.org folks was to see if 
we can shift the source of truth archive generation to happen at their 
end. We would then clone repositories from them and provide the 
frontend/search bits on lore.kernel.org. From my cursory looking, it 
would seem that the watch/delivery tools always expect to be taking care 
of xapian/indexing, but I think being able to decouple git bits from 
search/frontend bits would be a useful mode or operation.

Best,
-K

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Git-only operation mode
  2019-09-25 18:24 Git-only operation mode Konstantin Ryabitsev
@ 2019-09-25 19:45 ` Eric Wong
  2019-09-25 19:58   ` Konstantin Ryabitsev
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Wong @ 2019-09-25 19:45 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: meta

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> Hello:
> 
> Is there a way to run just the archiver component of public-inbox -- just
> writing to git repos without any of the indexing/frontend bits? One of the
> idle conversations I had with vger.kernel.org folks was to see if we can
> shift the source of truth archive generation to happen at their end. We
> would then clone repositories from them and provide the frontend/search bits
> on lore.kernel.org. From my cursory looking, it would seem that the
> watch/delivery tools always expect to be taking care of xapian/indexing, but
> I think being able to decouple git bits from search/frontend bits would be a
> useful mode or operation.

v1 was git-only (that led to scalability problems from big trees).
v2 needs SQLite to do dedupe with indexlevel=basic, but not Xapian,
anymore.  We could get rid of dedupe for v2, but I'm not sure it's
worth it...

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Git-only operation mode
  2019-09-25 19:45 ` Eric Wong
@ 2019-09-25 19:58   ` Konstantin Ryabitsev
  2019-09-25 22:45     ` Eric Wong
  0 siblings, 1 reply; 9+ messages in thread
From: Konstantin Ryabitsev @ 2019-09-25 19:58 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

On Wed, Sep 25, 2019 at 07:45:03PM +0000, Eric Wong wrote:
>> Is there a way to run just the archiver component of public-inbox -- 
>> just
>> writing to git repos without any of the indexing/frontend bits? One of the
>> idle conversations I had with vger.kernel.org folks was to see if we can
>> shift the source of truth archive generation to happen at their end. We
>> would then clone repositories from them and provide the frontend/search bits
>> on lore.kernel.org. From my cursory looking, it would seem that the
>> watch/delivery tools always expect to be taking care of xapian/indexing, but
>> I think being able to decouple git bits from search/frontend bits would be a
>> useful mode or operation.
>
>v1 was git-only (that led to scalability problems from big trees).
>v2 needs SQLite to do dedupe with indexlevel=basic, but not Xapian,
>anymore.  We could get rid of dedupe for v2, but I'm not sure it's
>worth it...

Needing sqlite is not a big deal -- compared to the size of the repos, 
that's reasonably small (e.g. all of lkml git trees are 8.2GB, while 
msgmap.sqlite3 is 600MB). 

Is there an easy way to exclude xapian indexes from being generated 
during watch/mda runs then?

A follow-up to that -- is running "public-inbox-index" on the repository 
after it's been updated enough to update the xapian db? It would be easy 
to do so as part of the grok-pull post-update hook.

Best,
-K

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Git-only operation mode
  2019-09-25 19:58   ` Konstantin Ryabitsev
@ 2019-09-25 22:45     ` Eric Wong
  2019-09-26  0:23       ` ebiederm
  2019-09-26 20:52       ` Konstantin Ryabitsev
  0 siblings, 2 replies; 9+ messages in thread
From: Eric Wong @ 2019-09-25 22:45 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: meta

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> On Wed, Sep 25, 2019 at 07:45:03PM +0000, Eric Wong wrote:
> > > Is there a way to run just the archiver component of public-inbox --
> > > just
> > > writing to git repos without any of the indexing/frontend bits? One of the
> > > idle conversations I had with vger.kernel.org folks was to see if we can
> > > shift the source of truth archive generation to happen at their end. We
> > > would then clone repositories from them and provide the frontend/search bits
> > > on lore.kernel.org. From my cursory looking, it would seem that the
> > > watch/delivery tools always expect to be taking care of xapian/indexing, but
> > > I think being able to decouple git bits from search/frontend bits would be a
> > > useful mode or operation.
> > 
> > v1 was git-only (that led to scalability problems from big trees).
> > v2 needs SQLite to do dedupe with indexlevel=basic, but not Xapian,
> > anymore.  We could get rid of dedupe for v2, but I'm not sure it's
> > worth it...
> 
> Needing sqlite is not a big deal -- compared to the size of the repos,
> that's reasonably small (e.g. all of lkml git trees are 8.2GB, while
> msgmap.sqlite3 is 600MB).

Right, it'll also need xap15/over.sqlite* but that's still not too big.

> Is there an easy way to exclude xapian indexes from being generated during
> watch/mda runs then?

public-inbox-init --indexlevel=basic <usual args>

Or setting publicinbox.$INBOX_NAME.indexlevel=basic in the
config file after-the-fact.  You should also be able to remove
any non-SQLite files from xap15 after-the-fact, if you already
generated them, too (but I haven't tested that).

I started working on a public-inbox-init manpage the other day,
still need to finish that...

> A follow-up to that -- is running "public-inbox-index" on the repository
> after it's been updated enough to update the xapian db? It would be easy to
> do so as part of the grok-pull post-update hook.

Yes, on a fresh clone.  You'll need to change indexlevel to
medium or full if it was setup using basic.

I haven't figured out how to use a grok-pull post-update hook to
run index on my clone of erol, since there's multiple epochs
per-inbox to deal with.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Git-only operation mode
  2019-09-25 22:45     ` Eric Wong
@ 2019-09-26  0:23       ` ebiederm
  2019-09-26 20:52       ` Konstantin Ryabitsev
  1 sibling, 0 replies; 9+ messages in thread
From: ebiederm @ 2019-09-26  0:23 UTC (permalink / raw)
  To: Eric Wong; +Cc: Konstantin Ryabitsev, meta

Eric Wong <e@80x24.org> writes:

> Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
>> On Wed, Sep 25, 2019 at 07:45:03PM +0000, Eric Wong wrote:
>> > > Is there a way to run just the archiver component of public-inbox --
>> > > just
>> > > writing to git repos without any of the indexing/frontend bits? One of the
>> > > idle conversations I had with vger.kernel.org folks was to see if we can
>> > > shift the source of truth archive generation to happen at their end. We
>> > > would then clone repositories from them and provide the frontend/search bits
>> > > on lore.kernel.org. From my cursory looking, it would seem that the
>> > > watch/delivery tools always expect to be taking care of xapian/indexing, but
>> > > I think being able to decouple git bits from search/frontend bits would be a
>> > > useful mode or operation.
>> > 
>> > v1 was git-only (that led to scalability problems from big trees).
>> > v2 needs SQLite to do dedupe with indexlevel=basic, but not Xapian,
>> > anymore.  We could get rid of dedupe for v2, but I'm not sure it's
>> > worth it...
>> 
>> Needing sqlite is not a big deal -- compared to the size of the repos,
>> that's reasonably small (e.g. all of lkml git trees are 8.2GB, while
>> msgmap.sqlite3 is 600MB).
>
> Right, it'll also need xap15/over.sqlite* but that's still not too
> big.

For linux-kernel my copy looks to be about 2.4G while the git repos
run 9.1G.

>> Is there an easy way to exclude xapian indexes from being generated during
>> watch/mda runs then?
>
> public-inbox-init --indexlevel=basic <usual args>
>
> Or setting publicinbox.$INBOX_NAME.indexlevel=basic in the
> config file after-the-fact.  You should also be able to remove
> any non-SQLite files from xap15 after-the-fact, if you already
> generated them, too (but I haven't tested that).
>
> I started working on a public-inbox-init manpage the other day,
> still need to finish that...
>
>> A follow-up to that -- is running "public-inbox-index" on the repository
>> after it's been updated enough to update the xapian db? It would be easy to
>> do so as part of the grok-pull post-update hook.
>
> Yes, on a fresh clone.  You'll need to change indexlevel to
> medium or full if it was setup using basic.
>
> I haven't figured out how to use a grok-pull post-update hook to
> run index on my clone of erol, since there's multiple epochs
> per-inbox to deal with.

I have a perl script I use.

Which boils down to:

	git remote update
        public-inbox-index

Which is enough get things up to date.

The tricky bit when the you have a archive like linux-kernel that uses
multiple git repos.

Given that except in the case of bugs article numbers are stable it
should be completely possible do this.

The nasty case is when someone rebases the git history.  I have been
meaning to report this after tracking it down.  To the best of my
knowledge public-inbox-inbox throws out all of the history that was
rebased.  Which can be expensive.   For me it meant I had to drop from
indexlevel=full to indexlevel=basic on linux-kernel.  Because my laptop
machine could not handle the reindexing of all of those messages.

Given that the message numbers remain stable in an event like that it
should be possible to optimize and only reindex things if the blob in
git for a particular message number has changed.  Maybe we already try
and even that is too expensive.  I haven't re-read that code since I
noticed the problem.

Eric


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Git-only operation mode
  2019-09-25 22:45     ` Eric Wong
  2019-09-26  0:23       ` ebiederm
@ 2019-09-26 20:52       ` Konstantin Ryabitsev
  2019-09-26 21:10         ` Eric Wong
  2019-10-07  0:07         ` Eric Wong
  1 sibling, 2 replies; 9+ messages in thread
From: Konstantin Ryabitsev @ 2019-09-26 20:52 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

On Wed, Sep 25, 2019 at 10:45:00PM +0000, Eric Wong wrote:
>> A follow-up to that -- is running "public-inbox-index" on the 
>> repository
>> after it's been updated enough to update the xapian db? It would be easy to
>> do so as part of the grok-pull post-update hook.
>
>Yes, on a fresh clone.  You'll need to change indexlevel to
>medium or full if it was setup using basic.
>
>I haven't figured out how to use a grok-pull post-update hook to
>run index on my clone of erol, since there's multiple epochs
>per-inbox to deal with.

Theoretically, shouldn't be that difficult. The post-update hook fires 
on clone/update with the full path to the repo that got updated, e.g.

post-update-hook.sh /var/lib/public-inbox/lkml/git/7.git

Here's a quick and dirty start to the post-update-hook that I came up 
with:

-----
#!/bin/bash

topdir=$(echo $1 | sed 's|/git/[[:digit:]]*\.git$||g')
pidir=$(basename $topdir)
url="http://localhost:8080/${pidir}"

cd $topdir/..

if [[ ! -f $pidir/msgmap.sqlite3 ]]; then
    listid=$(git --git-dir=$1 show master:m | grep -i '^List-Id:' | sed 's|.*:.*<\(.*\)>$|\1|g')
    email=$(echo $listid | sed 's|\.|@|')
    public-inbox-init -V2 $pidir $pidir/ $url $email
    # Need logic here for adding to the config file
fi

public-inbox-index $pidir
-----

It needs some kind of a template entry for adding to the config file 
post-init, but this should at least do the right thing for running 
public-inbox-index on repo updates.

-K

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Git-only operation mode
  2019-09-26 20:52       ` Konstantin Ryabitsev
@ 2019-09-26 21:10         ` Eric Wong
  2019-09-26 21:44           ` Konstantin Ryabitsev
  2019-10-07  0:07         ` Eric Wong
  1 sibling, 1 reply; 9+ messages in thread
From: Eric Wong @ 2019-09-26 21:10 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: meta

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> On Wed, Sep 25, 2019 at 10:45:00PM +0000, Eric Wong wrote:
> > > A follow-up to that -- is running "public-inbox-index" on the
> > > repository
> > > after it's been updated enough to update the xapian db? It would be easy to
> > > do so as part of the grok-pull post-update hook.
> > 
> > Yes, on a fresh clone.  You'll need to change indexlevel to
> > medium or full if it was setup using basic.
> > 
> > I haven't figured out how to use a grok-pull post-update hook to
> > run index on my clone of erol, since there's multiple epochs
> > per-inbox to deal with.
> 
> Theoretically, shouldn't be that difficult. The post-update hook fires on
> clone/update with the full path to the repo that got updated, e.g.
> 
> post-update-hook.sh /var/lib/public-inbox/lkml/git/7.git
> 
> Here's a quick and dirty start to the post-update-hook that I came up with:
> 
> -----
> #!/bin/bash
> 
> topdir=$(echo $1 | sed 's|/git/[[:digit:]]*\.git$||g')
> pidir=$(basename $topdir)
> url="http://localhost:8080/${pidir}"
> 
> cd $topdir/..
> 
> if [[ ! -f $pidir/msgmap.sqlite3 ]]; then
>    listid=$(git --git-dir=$1 show master:m | grep -i '^List-Id:' | sed 's|.*:.*<\(.*\)>$|\1|g')
>    email=$(echo $listid | sed 's|\.|@|')
>    public-inbox-init -V2 $pidir $pidir/ $url $email

If grok-pull is using multiple threads, there can be a race
there because parallel runs of public-inbox-init can clobber
each other (which needs to be fixed :x)

>    # Need logic here for adding to the config file

Yeah, I've been meaning to add something like "$INBOX_URL/_/text/config"
so some of the config keys can be easily cloned, too.

Not sure if it's something that can be stuffed in manifest.js.gz
or better as a separate file...  Probably separate file?

> fi
> 
> public-inbox-index $pidir
> -----
> 
> It needs some kind of a template entry for adding to the config file
> post-init, but this should at least do the right thing for running
> public-inbox-index on repo updates.
> 
> -K

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Git-only operation mode
  2019-09-26 21:10         ` Eric Wong
@ 2019-09-26 21:44           ` Konstantin Ryabitsev
  0 siblings, 0 replies; 9+ messages in thread
From: Konstantin Ryabitsev @ 2019-09-26 21:44 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

On Thu, Sep 26, 2019 at 09:10:25PM +0000, Eric Wong wrote:
>If grok-pull is using multiple threads, there can be a race
>there because parallel runs of public-inbox-init can clobber
>each other (which needs to be fixed :x)

Yes, this is true -- there should be a lockfile in the hook to avoid 
multiple post-update-hook's from operating on the same pidir.

>>    # Need logic here for adding to the config file
>
>Yeah, I've been meaning to add something like "$INBOX_URL/_/text/config"
>so some of the config keys can be easily cloned, too.
>
>Not sure if it's something that can be stuffed in manifest.js.gz
>or better as a separate file...  Probably separate file?

Right -- the manifest only deals with very basic repository details, so 
it's easier to pass this info in some other way. Either via a remote 
URL, or perhaps via a file in a special ref in the repo itself 
(refs/meta/config?). We use this trick on git.kernel.org to let people 
tweak cgit parameters for their repos, see
https://korg.wiki.kernel.org/userdoc/cgit-meta-data.

-K

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Git-only operation mode
  2019-09-26 20:52       ` Konstantin Ryabitsev
  2019-09-26 21:10         ` Eric Wong
@ 2019-10-07  0:07         ` Eric Wong
  1 sibling, 0 replies; 9+ messages in thread
From: Eric Wong @ 2019-10-07  0:07 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: meta

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> On Wed, Sep 25, 2019 at 10:45:00PM +0000, Eric Wong wrote:
> > > A follow-up to that -- is running "public-inbox-index" on the
> > > repository
> > > after it's been updated enough to update the xapian db? It would be easy to
> > > do so as part of the grok-pull post-update hook.
> > 
> > Yes, on a fresh clone.  You'll need to change indexlevel to
> > medium or full if it was setup using basic.
> > 
> > I haven't figured out how to use a grok-pull post-update hook to
> > run index on my clone of erol, since there's multiple epochs
> > per-inbox to deal with.
> 
> Theoretically, shouldn't be that difficult. The post-update hook fires on
> clone/update with the full path to the repo that got updated, e.g.
> 
> post-update-hook.sh /var/lib/public-inbox/lkml/git/7.git
> 
> Here's a quick and dirty start to the post-update-hook that I came up with:
> 
> -----
> #!/bin/bash
> 
> topdir=$(echo $1 | sed 's|/git/[[:digit:]]*\.git$||g')
> pidir=$(basename $topdir)
> url="http://localhost:8080/${pidir}"
> 
> cd $topdir/..
> 
> if [[ ! -f $pidir/msgmap.sqlite3 ]]; then
>    listid=$(git --git-dir=$1 show master:m | grep -i '^List-Id:' | sed 's|.*:.*<\(.*\)>$|\1|g')
>    email=$(echo $listid | sed 's|\.|@|')
>    public-inbox-init -V2 $pidir $pidir/ $url $email
>    # Need logic here for adding to the config file
> fi
> 
> public-inbox-index $pidir

Running public-inbox-index blindly there can be
dangerous/surprising when multiple epochs are initially cloned in
non-sequential order.

The example I sent out won't index unless there's messages
in msgmap:

https://public-inbox.org/meta/20191006235651.5725-1-e@80x24.org/

> -----
> 
> It needs some kind of a template entry for adding to the config file
> post-init, but this should at least do the right thing for running
> public-inbox-index on repo updates.

I tried to use the $INBOX_URL/_/text/config/raw endpoint, which
fell down when I tried to clone erol.kernel.org :x (but works on
lore)

I don't have List-Id: as a fallback, yet...  Not sure if it's
really worth the effort, but it just creates a bogus
$inbox_name@$$.$(hostname).example.com address if curl fails on
the config URL.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, back to index

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-25 18:24 Git-only operation mode Konstantin Ryabitsev
2019-09-25 19:45 ` Eric Wong
2019-09-25 19:58   ` Konstantin Ryabitsev
2019-09-25 22:45     ` Eric Wong
2019-09-26  0:23       ` ebiederm
2019-09-26 20:52       ` Konstantin Ryabitsev
2019-09-26 21:10         ` Eric Wong
2019-09-26 21:44           ` Konstantin Ryabitsev
2019-10-07  0:07         ` Eric Wong

user/dev discussion of public-inbox itself

Archives are clonable:
	git clone --mirror https://public-inbox.org/meta
	git clone --mirror http://czquwvybam4bgbro.onion/meta
	git clone --mirror http://hjrcffqmbrq6wope.onion/meta
	git clone --mirror http://ou63pmih66umazou.onion/meta

Example config snippet for mirrors

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.mail.public-inbox.meta
	nntp://ou63pmih66umazou.onion/inbox.comp.mail.public-inbox.meta
	nntp://czquwvybam4bgbro.onion/inbox.comp.mail.public-inbox.meta
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.mail.public-inbox.meta
	nntp://news.gmane.org/gmane.mail.public-inbox.general

 note: .onion URLs require Tor: https://www.torproject.org/

AGPL code for this site: git clone https://public-inbox.org/ public-inbox