RFC: Using public-inbox v2 repos for distributed patch lifecycle tracking

user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed

* RFC: Using public-inbox v2 repos for distributed patch lifecycle tracking
@ 2019-01-27 18:37 Konstantin Ryabitsev
  2019-01-28  3:32 ` Eric Wong
  0 siblings, 1 reply; 2+ messages in thread
From: Konstantin Ryabitsev @ 2019-01-27 18:37 UTC (permalink / raw)
  To: meta

Hi, all:

Here's something I've been mulling over for a while, and sorta goes
hand-in-hand with what Eric has been doing with diff highlights work.
We have significant duplication of functionality on lore.kernel.org
between the public-inbox repository for LKML and the patchwork
instance for the same list (https://lore.kernel.org/patchwork). I am
currently working on a tool that would move a lot of patchwork's
functionality to the developers' workstation by using the public-inbox
archive of the mailing list that receives patches. After it is done,
the tool should be able to do all of the following:

- Keep track of patches and patch series, including series revisions
- Allow developers to easily apply patch series directly from the
public-inbox repository (by creating a mbox file of the series behind
the scenes with adjusted git by-llines -- e.g. if someone replies to a
patch with a "Acked-by: Foo Dev" or "Tested-by: Foo Bot", the original
patch trailers are adjusted to reflect that new data)
- Automatically recognising when patches have been applied to a repo
and auto-"accepting" them
- Allowing developers to see changes between series using interdiff

Currently, the tool sticks various patch tracking information into a
custom sqlite3 db, but I'm increasingly wondering if it makes more
sense to have this machine-parseable data available as part of the
repository itself -- say, as a git note on the commit-id of the
message. In other words, if the patch is in a message in abcd1234:m, a
refs/notes/patches entry for abcd1234 would have json-formatted data
with patch tracking information similar to what patchwork has for it
(see, for example,
https://lore.kernel.org/patchwork/api/1.1/patches/1035986/), but
without patchwork-specific bits and duplicate info like headers and
patch contents. If we add these notes on lore.kernel.org itself, then
we save a lot of redundant data processing on the client end. A
developer who has mirror-cloned a public-inbox archive of a
development mailing list would be able to start applying patches and
series without having to parse tens of thousands of messages.

I have limited experience with notes, however, and I'm curious if they
are a good candidate for such task. A public-inbox repo of LKML, even
after sharding, contains hundreds of thousands of messages. If many of
them carry such notes, would that significantly increase the
repository size and reduce its performance?

I have similar thoughts about publishing CI-related information as
notes to public-inbox commits, as well. This would allow centralised
archival of distributed CI efforts, which is a common problem in
distributed projects. Currently, bots flood lists with automated email
as patch follow-ups, but if they could publish their CI information as
refs/notes/ci/projectname in a public repository, we could mirror that
back to lore.kernel.org via regular pulls of that ref for each defined
projects and developers would be able to view CI reports from multiple
distributed projects inside the same interface.

Again, I'm not sure if git notes is the right tool for this. Another
way to go about it would be to use something like a custom blockchain
ledger, but I'm afraid that picking that would result in lower
adoption rate due to general unease people have about blockchain (it's
new, it's overhyped, and it's mired by association with cryptocoins).

I'd love to hear your thoughts. One of my goals is to find a way to
keep the distributed nature of Linux Kernel development without
locking it into a single vendor (like GitHub) or a suite of tools
(like GitLab, CircleCI, whatnot). We need to find a way to preserve
and archive the data generated by such tools in a way that is easy to
replicate and verify.

-K

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: RFC: Using public-inbox v2 repos for distributed patch lifecycle tracking
  2019-01-27 18:37 RFC: Using public-inbox v2 repos for distributed patch lifecycle tracking Konstantin Ryabitsev
@ 2019-01-28  3:32 ` Eric Wong
  0 siblings, 0 replies; 2+ messages in thread
From: Eric Wong @ 2019-01-28  3:32 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: meta

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> Hi, all:
> 
> Here's something I've been mulling over for a while, and sorta goes
> hand-in-hand with what Eric has been doing with diff highlights work.
> We have significant duplication of functionality on lore.kernel.org
> between the public-inbox repository for LKML and the patchwork
> instance for the same list (https://lore.kernel.org/patchwork). I am
> currently working on a tool that would move a lot of patchwork's
> functionality to the developers' workstation by using the public-inbox
> archive of the mailing list that receives patches. After it is done,
> the tool should be able to do all of the following:
> 
> - Keep track of patches and patch series, including series revisions
> - Allow developers to easily apply patch series directly from the
> public-inbox repository (by creating a mbox file of the series behind
> the scenes with adjusted git by-llines -- e.g. if someone replies to a
> patch with a "Acked-by: Foo Dev" or "Tested-by: Foo Bot", the original
> patch trailers are adjusted to reflect that new data)
> - Automatically recognising when patches have been applied to a repo
> and auto-"accepting" them
> - Allowing developers to see changes between series using interdiff

Cool!  I haven't looked into patchwork, much, but it looks like
a centralized system that's not easily possible to replicate
to every developer system?

And your tool could be seen as a local/replicatible version of
patchwork?

> Currently, the tool sticks various patch tracking information into a
> custom sqlite3 db, but I'm increasingly wondering if it makes more
> sense to have this machine-parseable data available as part of the
> repository itself -- say, as a git note on the commit-id of the
> message. In other words, if the patch is in a message in abcd1234:m, a
> refs/notes/patches entry for abcd1234 would have json-formatted data
> with patch tracking information similar to what patchwork has for it
> (see, for example,
> https://lore.kernel.org/patchwork/api/1.1/patches/1035986/), but
> without patchwork-specific bits and duplicate info like headers and
> patch contents. If we add these notes on lore.kernel.org itself, then
> we save a lot of redundant data processing on the client end. A
> developer who has mirror-cloned a public-inbox archive of a
> development mailing list would be able to start applying patches and
> series without having to parse tens of thousands of messages.

How much of this data is reproducible/recreatable using existing
mails + git repos?  If all of it is reproducible, I don't see
storing extra/redundant data in git as necessary...

Especially since SQLite supports online backup.  The overhead for
parsing a lot of mails could be alleviated by offering DB snapshots
for download and seeding of developer machines.  Incremental updates
would be similar to public-inbox: git fetch && $FOO-index

(To that end, public-inbox could offer a way to download the msgmap and
 overview DBs, too; but offering Xapian DB downloads would be tough,
 as it doesn't offer online backups/snapshots).

> I have limited experience with notes, however, and I'm curious if they
> are a good candidate for such task. A public-inbox repo of LKML, even
> after sharding, contains hundreds of thousands of messages. If many of
> them carry such notes, would that significantly increase the
> repository size and reduce its performance?

I'm not familiar with notes, either (or I forgot what I knew).
However, I know pain points w.r.t. git performance are:

1) too many refs
2) big trees

On a quick glance git notes seems to hit #2 by having a giant
top-level tree in "refs/notes/commit".  There's been some
work in git.git over the years to fix #1 with reftable, reftrees
or lmdb; but it's not yet an accepted solution in git.git.

> I have similar thoughts about publishing CI-related information as
> notes to public-inbox commits, as well. This would allow centralised
> archival of distributed CI efforts, which is a common problem in
> distributed projects. Currently, bots flood lists with automated email
> as patch follow-ups, but if they could publish their CI information as
> refs/notes/ci/projectname in a public repository, we could mirror that
> back to lore.kernel.org via regular pulls of that ref for each defined
> projects and developers would be able to view CI reports from multiple
> distributed projects inside the same interface.

I'm thinking this can can be done with the current public-inbox using
custom search queries (or perhaps enhancements/customizations to search).

Bots would still be flooding lists with automated mails, but their
emails could be searched by patch URL/Message-ID/Subject and made into
a CI report.

> Again, I'm not sure if git notes is the right tool for this. Another
> way to go about it would be to use something like a custom blockchain
> ledger, but I'm afraid that picking that would result in lower
> adoption rate due to general unease people have about blockchain (it's
> new, it's overhyped, and it's mired by association with cryptocoins).

Just call it "git" instead of "blockchain" :)

Philosophically, I am heavily influenced by the choices git made around
rename-detection and delta-compression.  Doing these things
after-the-fact allows freedom for improvements later on as algorithms
improve.

To that end, public-inbox stores minimal information up-front and relies
heavily on search indices for after-the-fact querying.

Right now, one could search by a patch subject + "Acked-by" in
non-quoted text:

	s:"$SUBJECT" AND ( nq:Acked-by OR nq:Tested-by )

More advanced queries are possible, of course; and the way data is
indexed is not set in stone at all (we're up to "xap15", already)

> I'd love to hear your thoughts. One of my goals is to find a way to
> keep the distributed nature of Linux Kernel development without
> locking it into a single vendor (like GitHub) or a suite of tools
> (like GitLab, CircleCI, whatnot). We need to find a way to preserve
> and archive the data generated by such tools in a way that is easy to
> replicate and verify.

Good to know our goals are aligned.  Beyond the Linux kernel, my goal
since 2004(*) was to make ALL development distributed and free from
any lock-in.  Sadly, the world's gone the opposite direction :<

(*) with svn-arch-mirror and git-svn

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2019-01-28  3:32 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-27 18:37 RFC: Using public-inbox v2 repos for distributed patch lifecycle tracking Konstantin Ryabitsev
2019-01-28  3:32 ` Eric Wong

Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).