user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
From: Eric Wong <>
To: Konstantin Ryabitsev <>
Subject: Re: RFC: Using public-inbox v2 repos for distributed patch lifecycle tracking
Date: Mon, 28 Jan 2019 03:32:07 +0000	[thread overview]
Message-ID: <20190128033207.jz7oxdsl7cjxjfan@dcvr> (raw)
In-Reply-To: <>

Konstantin Ryabitsev <> wrote:
> Hi, all:
> Here's something I've been mulling over for a while, and sorta goes
> hand-in-hand with what Eric has been doing with diff highlights work.
> We have significant duplication of functionality on
> between the public-inbox repository for LKML and the patchwork
> instance for the same list ( I am
> currently working on a tool that would move a lot of patchwork's
> functionality to the developers' workstation by using the public-inbox
> archive of the mailing list that receives patches. After it is done,
> the tool should be able to do all of the following:
> - Keep track of patches and patch series, including series revisions
> - Allow developers to easily apply patch series directly from the
> public-inbox repository (by creating a mbox file of the series behind
> the scenes with adjusted git by-llines -- e.g. if someone replies to a
> patch with a "Acked-by: Foo Dev" or "Tested-by: Foo Bot", the original
> patch trailers are adjusted to reflect that new data)
> - Automatically recognising when patches have been applied to a repo
> and auto-"accepting" them
> - Allowing developers to see changes between series using interdiff

Cool!  I haven't looked into patchwork, much, but it looks like
a centralized system that's not easily possible to replicate
to every developer system?

And your tool could be seen as a local/replicatible version of

> Currently, the tool sticks various patch tracking information into a
> custom sqlite3 db, but I'm increasingly wondering if it makes more
> sense to have this machine-parseable data available as part of the
> repository itself -- say, as a git note on the commit-id of the
> message. In other words, if the patch is in a message in abcd1234:m, a
> refs/notes/patches entry for abcd1234 would have json-formatted data
> with patch tracking information similar to what patchwork has for it
> (see, for example,
>, but
> without patchwork-specific bits and duplicate info like headers and
> patch contents. If we add these notes on itself, then
> we save a lot of redundant data processing on the client end. A
> developer who has mirror-cloned a public-inbox archive of a
> development mailing list would be able to start applying patches and
> series without having to parse tens of thousands of messages.

How much of this data is reproducible/recreatable using existing
mails + git repos?  If all of it is reproducible, I don't see
storing extra/redundant data in git as necessary...

Especially since SQLite supports online backup.  The overhead for
parsing a lot of mails could be alleviated by offering DB snapshots
for download and seeding of developer machines.  Incremental updates
would be similar to public-inbox: git fetch && $FOO-index

(To that end, public-inbox could offer a way to download the msgmap and
 overview DBs, too; but offering Xapian DB downloads would be tough,
 as it doesn't offer online backups/snapshots).

> I have limited experience with notes, however, and I'm curious if they
> are a good candidate for such task. A public-inbox repo of LKML, even
> after sharding, contains hundreds of thousands of messages. If many of
> them carry such notes, would that significantly increase the
> repository size and reduce its performance?

I'm not familiar with notes, either (or I forgot what I knew).
However, I know pain points w.r.t. git performance are:

1) too many refs
2) big trees

On a quick glance git notes seems to hit #2 by having a giant
top-level tree in "refs/notes/commit".  There's been some
work in git.git over the years to fix #1 with reftable, reftrees
or lmdb; but it's not yet an accepted solution in git.git.

> I have similar thoughts about publishing CI-related information as
> notes to public-inbox commits, as well. This would allow centralised
> archival of distributed CI efforts, which is a common problem in
> distributed projects. Currently, bots flood lists with automated email
> as patch follow-ups, but if they could publish their CI information as
> refs/notes/ci/projectname in a public repository, we could mirror that
> back to via regular pulls of that ref for each defined
> projects and developers would be able to view CI reports from multiple
> distributed projects inside the same interface.

I'm thinking this can can be done with the current public-inbox using
custom search queries (or perhaps enhancements/customizations to search).

Bots would still be flooding lists with automated mails, but their
emails could be searched by patch URL/Message-ID/Subject and made into
a CI report.

> Again, I'm not sure if git notes is the right tool for this. Another
> way to go about it would be to use something like a custom blockchain
> ledger, but I'm afraid that picking that would result in lower
> adoption rate due to general unease people have about blockchain (it's
> new, it's overhyped, and it's mired by association with cryptocoins).

Just call it "git" instead of "blockchain" :)

Philosophically, I am heavily influenced by the choices git made around
rename-detection and delta-compression.  Doing these things
after-the-fact allows freedom for improvements later on as algorithms

To that end, public-inbox stores minimal information up-front and relies
heavily on search indices for after-the-fact querying.

Right now, one could search by a patch subject + "Acked-by" in
non-quoted text:

	s:"$SUBJECT" AND ( nq:Acked-by OR nq:Tested-by )

More advanced queries are possible, of course; and the way data is
indexed is not set in stone at all (we're up to "xap15", already)

> I'd love to hear your thoughts. One of my goals is to find a way to
> keep the distributed nature of Linux Kernel development without
> locking it into a single vendor (like GitHub) or a suite of tools
> (like GitLab, CircleCI, whatnot). We need to find a way to preserve
> and archive the data generated by such tools in a way that is easy to
> replicate and verify.

Good to know our goals are aligned.  Beyond the Linux kernel, my goal
since 2004(*) was to make ALL development distributed and free from
any lock-in.  Sadly, the world's gone the opposite direction :<

(*) with svn-arch-mirror and git-svn

      reply	other threads:[~2019-01-28  3:32 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-27 18:37 RFC: Using public-inbox v2 repos for distributed patch lifecycle tracking Konstantin Ryabitsev
2019-01-28  3:32 ` Eric Wong [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

  List information:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190128033207.jz7oxdsl7cjxjfan@dcvr \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).