From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id C0D0A1F4B4; Tue, 5 Jan 2021 01:06:43 +0000 (UTC) Date: Tue, 5 Jan 2021 01:06:43 +0000 From: Eric Wong To: meta@public-inbox.org Subject: Re: public-inbox + mlmmj best practices? Message-ID: <20210105010643.GA20926@dcvr> References: <20201221212032.syunaxzrvcqcrose@chatter.i7.local> <20201221213914.GA9374@dcvr> <20201222062808.GA4522@dcvr> <20201228162218.zcnqxkgwa2i3nt66@chatter.i7.local> <20201228213139.GA17600@dcvr> <20210104201245.cbtqno6cyxw5iycu@chatter.i7.local> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20210104201245.cbtqno6cyxw5iycu@chatter.i7.local> List-Id: Konstantin Ryabitsev wrote: > On Mon, Dec 28, 2020 at 09:31:39PM +0000, Eric Wong wrote: > > AFAIK, V2Writable always does the right thing on -purge/-edit; > > at least for WWW users(*). > > > > V2W does more work in rare cases when history gets rewritten, > > but doesn't track anything beyond the latest indexed commit > > hash. > > > > In the V2Writable::log_range sub, it uses "git merge-base --is-ancestor" > > (via is_ancestor wrapper) to cover the common case of contiguous history. > > > > Otherwise, it attempts "git merge-base" to find a common ancestor: > > > > if (common_ancestor_found) > > unindex some history starting at common ancestor > > reindex from common ancestor > > else > > unindex all history in epoch > > reindex epoch from stratch > > I think I understand, but in the case of grok-pi-piper, unindexing is not an > option, since we can't control what the receiving-end app has already done > with the messages we have previously piped to it. We can't assume that it will > do the right thing when it receives duplicate messages, so we need to somehow > make sure that we don't pipe the same message twice. Nevermind, I just reread my code more carefully :x Actually the unindexing code currently stores an {unindexed} hash which is a { Message-ID => (NNTP )num } mapping Which allows most unedited messages keep the same NNTP article number so clients don't see it twice. "Most" meaning non-broken messages which don't have reused Message-IDs. I'm thinking {unindexed} should be a { OID => [ num, Message-ID ] } mapping That would allow the new version of the edited message to be piped and seen by NNTP/IMAP readers. You *do* want to pipe the new version of the message you've edited, right? > > AFAIK, the common_ancestor_found case is always true unless > > somebody was wacky enough to run a full gc+prune immediately > > after fetching. IOW, I don't think the else case happens > > in practice. > > :) It kinda does in grok-pi-piper case, since one of the config options is to > continuously "reshallow" the repository to basically contain no objects. > > https://git.kernel.org/pub/scm/utils/grokmirror/grokmirror.git/tree/grokmirror/pi_piper.py#n58 > > I know that this is "wacky" as you say, but it helps save dramatic amounts of > space when cloning most of lore.kernel.org repositories. We can still use "git > fetch --deepen" when necessary, but this does make it impossible to use the > common ancestor strategy when dealing with history rewrites. Understood. So yeah, actually the current {unindexed} hash in V2Writable mostly does what we want, but I'm preparing a patch which does the aforementioned { OID => [ num, Message-ID ] } mapping.