From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,AWL,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id 97CC11F453; Mon, 28 Jan 2019 03:32:07 +0000 (UTC) Date: Mon, 28 Jan 2019 03:32:07 +0000 From: Eric Wong To: Konstantin Ryabitsev Cc: meta@public-inbox.org Subject: Re: RFC: Using public-inbox v2 repos for distributed patch lifecycle tracking Message-ID: <20190128033207.jz7oxdsl7cjxjfan@dcvr> References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: List-Id: Konstantin Ryabitsev wrote: > Hi, all: > > Here's something I've been mulling over for a while, and sorta goes > hand-in-hand with what Eric has been doing with diff highlights work. > We have significant duplication of functionality on lore.kernel.org > between the public-inbox repository for LKML and the patchwork > instance for the same list (https://lore.kernel.org/patchwork). I am > currently working on a tool that would move a lot of patchwork's > functionality to the developers' workstation by using the public-inbox > archive of the mailing list that receives patches. After it is done, > the tool should be able to do all of the following: > > - Keep track of patches and patch series, including series revisions > - Allow developers to easily apply patch series directly from the > public-inbox repository (by creating a mbox file of the series behind > the scenes with adjusted git by-llines -- e.g. if someone replies to a > patch with a "Acked-by: Foo Dev" or "Tested-by: Foo Bot", the original > patch trailers are adjusted to reflect that new data) > - Automatically recognising when patches have been applied to a repo > and auto-"accepting" them > - Allowing developers to see changes between series using interdiff Cool! I haven't looked into patchwork, much, but it looks like a centralized system that's not easily possible to replicate to every developer system? And your tool could be seen as a local/replicatible version of patchwork? > Currently, the tool sticks various patch tracking information into a > custom sqlite3 db, but I'm increasingly wondering if it makes more > sense to have this machine-parseable data available as part of the > repository itself -- say, as a git note on the commit-id of the > message. In other words, if the patch is in a message in abcd1234:m, a > refs/notes/patches entry for abcd1234 would have json-formatted data > with patch tracking information similar to what patchwork has for it > (see, for example, > https://lore.kernel.org/patchwork/api/1.1/patches/1035986/), but > without patchwork-specific bits and duplicate info like headers and > patch contents. If we add these notes on lore.kernel.org itself, then > we save a lot of redundant data processing on the client end. A > developer who has mirror-cloned a public-inbox archive of a > development mailing list would be able to start applying patches and > series without having to parse tens of thousands of messages. How much of this data is reproducible/recreatable using existing mails + git repos? If all of it is reproducible, I don't see storing extra/redundant data in git as necessary... Especially since SQLite supports online backup. The overhead for parsing a lot of mails could be alleviated by offering DB snapshots for download and seeding of developer machines. Incremental updates would be similar to public-inbox: git fetch && $FOO-index (To that end, public-inbox could offer a way to download the msgmap and overview DBs, too; but offering Xapian DB downloads would be tough, as it doesn't offer online backups/snapshots). > I have limited experience with notes, however, and I'm curious if they > are a good candidate for such task. A public-inbox repo of LKML, even > after sharding, contains hundreds of thousands of messages. If many of > them carry such notes, would that significantly increase the > repository size and reduce its performance? I'm not familiar with notes, either (or I forgot what I knew). However, I know pain points w.r.t. git performance are: 1) too many refs 2) big trees On a quick glance git notes seems to hit #2 by having a giant top-level tree in "refs/notes/commit". There's been some work in git.git over the years to fix #1 with reftable, reftrees or lmdb; but it's not yet an accepted solution in git.git. > I have similar thoughts about publishing CI-related information as > notes to public-inbox commits, as well. This would allow centralised > archival of distributed CI efforts, which is a common problem in > distributed projects. Currently, bots flood lists with automated email > as patch follow-ups, but if they could publish their CI information as > refs/notes/ci/projectname in a public repository, we could mirror that > back to lore.kernel.org via regular pulls of that ref for each defined > projects and developers would be able to view CI reports from multiple > distributed projects inside the same interface. I'm thinking this can can be done with the current public-inbox using custom search queries (or perhaps enhancements/customizations to search). Bots would still be flooding lists with automated mails, but their emails could be searched by patch URL/Message-ID/Subject and made into a CI report. > Again, I'm not sure if git notes is the right tool for this. Another > way to go about it would be to use something like a custom blockchain > ledger, but I'm afraid that picking that would result in lower > adoption rate due to general unease people have about blockchain (it's > new, it's overhyped, and it's mired by association with cryptocoins). Just call it "git" instead of "blockchain" :) Philosophically, I am heavily influenced by the choices git made around rename-detection and delta-compression. Doing these things after-the-fact allows freedom for improvements later on as algorithms improve. To that end, public-inbox stores minimal information up-front and relies heavily on search indices for after-the-fact querying. Right now, one could search by a patch subject + "Acked-by" in non-quoted text: s:"$SUBJECT" AND ( nq:Acked-by OR nq:Tested-by ) More advanced queries are possible, of course; and the way data is indexed is not set in stone at all (we're up to "xap15", already) > I'd love to hear your thoughts. One of my goals is to find a way to > keep the distributed nature of Linux Kernel development without > locking it into a single vendor (like GitHub) or a suite of tools > (like GitLab, CircleCI, whatnot). We need to find a way to preserve > and archive the data generated by such tools in a way that is easy to > replicate and verify. Good to know our goals are aligned. Beyond the Linux kernel, my goal since 2004(*) was to make ALL development distributed and free from any lock-in. Sadly, the world's gone the opposite direction :< (*) with svn-arch-mirror and git-svn