user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
From: ebiederm@xmission.com (Eric W. Biederman)
To: Eric Wong <e@80x24.org>
Cc: Konstantin Ryabitsev <konstantin@linuxfoundation.org>,
	 meta@public-inbox.org
Subject: Re: Git-only operation mode
Date: Wed, 25 Sep 2019 19:23:35 -0500	[thread overview]
Message-ID: <87zhir3ny0.fsf@x220.int.ebiederm.org> (raw)
In-Reply-To: <20190925224500.GA28628@dcvr> (Eric Wong's message of "Wed, 25 Sep 2019 22:45:00 +0000")

Eric Wong <e@80x24.org> writes:

> Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
>> On Wed, Sep 25, 2019 at 07:45:03PM +0000, Eric Wong wrote:
>> > > Is there a way to run just the archiver component of public-inbox --
>> > > just
>> > > writing to git repos without any of the indexing/frontend bits? One of the
>> > > idle conversations I had with vger.kernel.org folks was to see if we can
>> > > shift the source of truth archive generation to happen at their end. We
>> > > would then clone repositories from them and provide the frontend/search bits
>> > > on lore.kernel.org. From my cursory looking, it would seem that the
>> > > watch/delivery tools always expect to be taking care of xapian/indexing, but
>> > > I think being able to decouple git bits from search/frontend bits would be a
>> > > useful mode or operation.
>> > 
>> > v1 was git-only (that led to scalability problems from big trees).
>> > v2 needs SQLite to do dedupe with indexlevel=basic, but not Xapian,
>> > anymore.  We could get rid of dedupe for v2, but I'm not sure it's
>> > worth it...
>> 
>> Needing sqlite is not a big deal -- compared to the size of the repos,
>> that's reasonably small (e.g. all of lkml git trees are 8.2GB, while
>> msgmap.sqlite3 is 600MB).
>
> Right, it'll also need xap15/over.sqlite* but that's still not too
> big.

For linux-kernel my copy looks to be about 2.4G while the git repos
run 9.1G.

>> Is there an easy way to exclude xapian indexes from being generated during
>> watch/mda runs then?
>
> public-inbox-init --indexlevel=basic <usual args>
>
> Or setting publicinbox.$INBOX_NAME.indexlevel=basic in the
> config file after-the-fact.  You should also be able to remove
> any non-SQLite files from xap15 after-the-fact, if you already
> generated them, too (but I haven't tested that).
>
> I started working on a public-inbox-init manpage the other day,
> still need to finish that...
>
>> A follow-up to that -- is running "public-inbox-index" on the repository
>> after it's been updated enough to update the xapian db? It would be easy to
>> do so as part of the grok-pull post-update hook.
>
> Yes, on a fresh clone.  You'll need to change indexlevel to
> medium or full if it was setup using basic.
>
> I haven't figured out how to use a grok-pull post-update hook to
> run index on my clone of erol, since there's multiple epochs
> per-inbox to deal with.

I have a perl script I use.

Which boils down to:

	git remote update
        public-inbox-index

Which is enough get things up to date.

The tricky bit when the you have a archive like linux-kernel that uses
multiple git repos.

Given that except in the case of bugs article numbers are stable it
should be completely possible do this.

The nasty case is when someone rebases the git history.  I have been
meaning to report this after tracking it down.  To the best of my
knowledge public-inbox-inbox throws out all of the history that was
rebased.  Which can be expensive.   For me it meant I had to drop from
indexlevel=full to indexlevel=basic on linux-kernel.  Because my laptop
machine could not handle the reindexing of all of those messages.

Given that the message numbers remain stable in an event like that it
should be possible to optimize and only reindex things if the blob in
git for a particular message number has changed.  Maybe we already try
and even that is too expensive.  I haven't re-read that code since I
noticed the problem.

Eric


  reply	other threads:[~2019-09-26  0:24 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-25 18:24 Git-only operation mode Konstantin Ryabitsev
2019-09-25 19:45 ` Eric Wong
2019-09-25 19:58   ` Konstantin Ryabitsev
2019-09-25 22:45     ` Eric Wong
2019-09-26  0:23       ` Eric W. Biederman [this message]
2019-09-26 20:52       ` Konstantin Ryabitsev
2019-09-26 21:10         ` Eric Wong
2019-09-26 21:44           ` Konstantin Ryabitsev
2019-10-07  0:07         ` Eric Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://public-inbox.org/README

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87zhir3ny0.fsf@x220.int.ebiederm.org \
    --to=ebiederm@xmission.com \
    --cc=e@80x24.org \
    --cc=konstantin@linuxfoundation.org \
    --cc=meta@public-inbox.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).