Re: public-inbox + mlmmj best practices?

user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed

From: Eric Wong <e@80x24.org>
To: Konstantin Ryabitsev <konstantin@linuxfoundation.org>
Cc: meta@public-inbox.org
Subject: Re: public-inbox + mlmmj best practices?
Date: Mon, 28 Dec 2020 21:31:39 +0000	[thread overview]
Message-ID: <20201228213139.GA17600@dcvr> (raw)
In-Reply-To: <20201228162218.zcnqxkgwa2i3nt66@chatter.i7.local>

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> On Tue, Dec 22, 2020 at 06:28:08AM +0000, Eric Wong wrote:
> > Eric Wong <e@80x24.org> wrote:
> > > 
> > > There's scripts/ssoma-replay which was v1-only and dependent on
> > > ssoma.  I've been meaning to convert into something that reads
> > > NNTP so it's not locked into public-inbox.  Maybe it could be
> > > part of `lei', too, for piping to arbitrary commands, dunno...
> 
> I wrote grok-pi-piper a while back for the purpose of piping from git to
> patchwork.kernel.org. It's not complete yet, because we currently do not
> handle situations with rewritten history, but it's been working well enough. I
> have a write-up here:
> 
> https://people.kernel.org/monsieuricon/subscribing-to-lore-lists-with-grokmirror
> 
> What is the sanest way to recognize and handle history rewrites? Right now, we
> just keep track of the latest tip hash. On each subsequent run, we just iterate
> all commits between the recorded hash and the newest tip. My current thoughts
> are:
> 
> - in addition to the latest tip hash, keep track of author, authordate and
>   message-id of the last processed message
> - if we no longer find the tracked hash in the repo, use author+authordate to
>   find the new hash of the latest message we processed, and verify with
>   message-id
> - if we cannot find the exact match (i.e. our latest processed message is gone
>   from history), find the first commit that happens before our recorded
>   authordate and use that as the "latest processed" jump-off point

That's a lot of persistent state to keep track of.

> This should do the right thing in most situations except for when the message
> that was deleted from history was sent with a bogus Date: header with a date
> in the future. In this case, we can miss valid messages in the queue.

AFAIK, V2Writable always does the right thing on -purge/-edit;
at least for WWW users(*).

V2W does more work in rare cases when history gets rewritten,
but doesn't track anything beyond the latest indexed commit
hash.

In the V2Writable::log_range sub, it uses "git merge-base --is-ancestor"
(via is_ancestor wrapper) to cover the common case of contiguous history.

Otherwise, it attempts "git merge-base" to find a common ancestor:

	if (common_ancestor_found)
		unindex some history starting at common ancestor
		reindex from common ancestor
	else
		unindex all history in epoch
		reindex epoch from stratch

AFAIK, the common_ancestor_found case is always true unless
somebody was wacky enough to run a full gc+prune immediately
after fetching.  IOW, I don't think the else case happens
in practice.

(*) The downside to this approach is IMAP UIDs (NNTP article
numbers) get changed, but I think I can workaround that.  The
workaround I'm thinking of involves capturing exact blob OIDs
during the unindex phase to create an OID => UID mapping.
reindex would reuse the OID => UID mapping to keep the same
IMAP UID.  It could be loosened to use ContentHash, or
whatever combination of Message-ID/From/Date/etc, too.

> Any suggestions on how this can be improved?

Fwiw, my general approach is to keep track of and operate with
as little state as I can get away with (and discard it as soon
as possible).

IME it avoids bugs by simplifying things to accomodate my
limited mental capacity.

The lack of distinct POLL{IN|OUT|HUP|ERR} callbacks in the DS
event loop is another example of that approach, as is the lack
of explicit {state} fields for per-client sockets: all state
is implied from what's in (or not in) read/write buffers.

next prev parent reply	other threads:[~2020-12-28 21:31 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-21 21:20 public-inbox + mlmmj best practices? Konstantin Ryabitsev
2020-12-21 21:39 ` Eric Wong
2020-12-22  6:28   ` Eric Wong
2020-12-28 16:22     ` Konstantin Ryabitsev
2020-12-28 21:31       ` Eric Wong [this message]
2021-01-04 20:12         ` Konstantin Ryabitsev
2021-01-05  1:06           ` Eric Wong
2021-01-05  1:29             ` [PATCH] v2writable: exact discontiguous history handling Eric Wong
2021-01-09 22:21               ` Eric Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://public-inbox.org/README

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201228213139.GA17600@dcvr \
    --to=e@80x24.org \
    --cc=konstantin@linuxfoundation.org \
    --cc=meta@public-inbox.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).