user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
* Recording archiver origins in git
@ 2021-06-28 21:26 Konstantin Ryabitsev
  2021-06-28 22:12 ` Eric Wong
  0 siblings, 1 reply; 5+ messages in thread
From: Konstantin Ryabitsev @ 2021-06-28 21:26 UTC (permalink / raw)
  To: meta

Hello:

I'm working away on grokmirror+public-inbox replication, and I'm trying to
come up with a good solution for passing the "archiver origins" info. In
examples/grok-pull.post_update_hook.sh, we try to get this information out of
a curl call to the clone origin, but this may not be reliable for a number of
reasons:

1. we may be cloning from an intermediary location that only serves the git
   repositories and the manifest file (e.g. erol.kernel.org)
2. the call may retrieve information relevant to the intermediary, and not to
   the origins of the archive

I'm thinking of including a special location in the git repo itself for
passing some of the same info currently found in the config snippet, e.g. in
refs/meta/origins.

Imaginary code snippet:

$ git show refs/meta/origins:i
[metadata]
source = smtp
listaddress = linux-kernel@vger.kernel.org
listid = linux-kernel.vger.kernel.org
archive-url = https://lore.kernel.org/linux-kernel
archive-contact = postmaster@kernel.org

This way, even if the archive gets mirrored around a bunch of times, it's
still possible to track where it originated, and if the original archive info
becomes obsolete, someone can update the information without it affecting the
rest of the archive.

Does that sound sane?

-K

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Recording archiver origins in git
  2021-06-28 21:26 Recording archiver origins in git Konstantin Ryabitsev
@ 2021-06-28 22:12 ` Eric Wong
  2021-06-29 12:56   ` Konstantin Ryabitsev
  0 siblings, 1 reply; 5+ messages in thread
From: Eric Wong @ 2021-06-28 22:12 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: meta

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> Hello:
> 
> I'm working away on grokmirror+public-inbox replication, and I'm trying to
> come up with a good solution for passing the "archiver origins" info. In
> examples/grok-pull.post_update_hook.sh, we try to get this information out of
> a curl call to the clone origin, but this may not be reliable for a number of
> reasons:
> 
> 1. we may be cloning from an intermediary location that only serves the git
>    repositories and the manifest file (e.g. erol.kernel.org)
> 2. the call may retrieve information relevant to the intermediary, and not to
>    the origins of the archive
> 
> I'm thinking of including a special location in the git repo itself for
> passing some of the same info currently found in the config snippet, e.g. in
> refs/meta/origins.

> Imaginary code snippet:
> 
> $ git show refs/meta/origins:i
> [metadata]
> source = smtp

Is "source" necessary?  It seems like something that could be
in the "description" file or noted in the contents of
publicinbox.$NAME.infourl.

> listaddress = linux-kernel@vger.kernel.org
> listid = linux-kernel.vger.kernel.org
> archive-url = https://lore.kernel.org/linux-kernel
> archive-contact = postmaster@kernel.org

I think the keys should match what we use in the config file, at
least.  So s/listaddress/address/ and s/archive-url/url/

I'm not sure if "contact" is necessary if the aforementioned
"infourl" exists.

> This way, even if the archive gets mirrored around a bunch of times, it's
> still possible to track where it originated, and if the original archive info
> becomes obsolete, someone can update the information without it affecting the
> rest of the archive.
> 
> Does that sound sane?

I think so.  Only the latest epoch would be taken into account,
I suppose.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Recording archiver origins in git
  2021-06-28 22:12 ` Eric Wong
@ 2021-06-29 12:56   ` Konstantin Ryabitsev
  2021-06-29 19:59     ` Eric Wong
  0 siblings, 1 reply; 5+ messages in thread
From: Konstantin Ryabitsev @ 2021-06-29 12:56 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

On Mon, Jun 28, 2021 at 10:12:36PM +0000, Eric Wong wrote:

Hope you're finding ways of staying cool and sane. It's hot here on the East
coast, but a) we're used to it, and b) it's not yikes degrees.

> > Imaginary code snippet:
> > 
> > $ git show refs/meta/origins:i
> > [metadata]
> > source = smtp
> 
> Is "source" necessary?  It seems like something that could be
> in the "description" file or noted in the contents of
> publicinbox.$NAME.infourl.

I wasn't sure what infourl was used for. :) Is it supposed to contain
structured data, or is it more of a "click here for more info" kind of thing?

> > listaddress = linux-kernel@vger.kernel.org
> > listid = linux-kernel.vger.kernel.org
> > archive-url = https://lore.kernel.org/linux-kernel
> > archive-contact = postmaster@kernel.org
> 
> I think the keys should match what we use in the config file, at
> least.  So s/listaddress/address/ and s/archive-url/url/

Okay.

> I'm not sure if "contact" is necessary if the aforementioned
> "infourl" exists.

My thinking is that with mirrors of mirrors of mirrors, if someone submits a
GDPR removal request, then there should be an easy way of figuring out where
these requests should actually go. Maybe infourl can cover this, but it's less
likely to be set up than an email address like postmaster@. So, I'm in favour
of keeping that in the info record.

> > Does that sound sane?
> 
> I think so.  Only the latest epoch would be taken into account,
> I suppose.

I'm actually thinking the other way around -- only the 0.git (or whatever is
the lowest number), simply because it's more likely to have that record. That
is, unless you want to add functionality to automatically copy these from
previous epochs on auto-rotation events.

-K

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Recording archiver origins in git
  2021-06-29 12:56   ` Konstantin Ryabitsev
@ 2021-06-29 19:59     ` Eric Wong
  2021-06-29 20:10       ` Konstantin Ryabitsev
  0 siblings, 1 reply; 5+ messages in thread
From: Eric Wong @ 2021-06-29 19:59 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: meta

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> On Mon, Jun 28, 2021 at 10:12:36PM +0000, Eric Wong wrote:
> 
> Hope you're finding ways of staying cool and sane. It's hot here on the East
> coast, but a) we're used to it, and b) it's not yikes degrees.

Bugs are crazy invasive, again :<

> > > Imaginary code snippet:
> > > 
> > > $ git show refs/meta/origins:i
> > > [metadata]
> > > source = smtp
> > 
> > Is "source" necessary?  It seems like something that could be
> > in the "description" file or noted in the contents of
> > publicinbox.$NAME.infourl.
> 
> I wasn't sure what infourl was used for. :) Is it supposed to contain
> structured data, or is it more of a "click here for more info" kind of thing?

It's "click here for more info", so freeform.  It should
probably be added to the footer of every HTML page.

> > > listaddress = linux-kernel@vger.kernel.org
> > > listid = linux-kernel.vger.kernel.org
> > > archive-url = https://lore.kernel.org/linux-kernel
> > > archive-contact = postmaster@kernel.org
> > 
> > I think the keys should match what we use in the config file, at
> > least.  So s/listaddress/address/ and s/archive-url/url/
> 
> Okay.
> 
> > I'm not sure if "contact" is necessary if the aforementioned
> > "infourl" exists.
> 
> My thinking is that with mirrors of mirrors of mirrors, if someone submits a
> GDPR removal request, then there should be an easy way of figuring out where
> these requests should actually go. Maybe infourl can cover this, but it's less
> likely to be set up than an email address like postmaster@. So, I'm in favour
> of keeping that in the info record.

I'm a little worried about it having too many directions
(e.g. webmaster vs postmaster) and something freeform like
infourl can cover it.  Or we maybe something with user-defined
labels would work:

	contact = postmaster mail-admin@example.com
	contact = GPDR https://our-lawyers.example.com/form.cgi
	contact = webmaster web-admin@example.com

> > > Does that sound sane?
> > 
> > I think so.  Only the latest epoch would be taken into account,
> > I suppose.
> 
> I'm actually thinking the other way around -- only the 0.git (or whatever is
> the lowest number), simply because it's more likely to have that record. That
> is, unless you want to add functionality to automatically copy these from
> previous epochs on auto-rotation events.

It should be easy to change this info down-the-line in case of
project name/hosting changes, forks, etc.  So whatever's latest
should supercede the older.  It would also allow partial mirrors
to work with only the latest epoch(s) (I expect a partial
mirrors of recent epochs to be more useful than mirrors with
only old epochs).

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Recording archiver origins in git
  2021-06-29 19:59     ` Eric Wong
@ 2021-06-29 20:10       ` Konstantin Ryabitsev
  0 siblings, 0 replies; 5+ messages in thread
From: Konstantin Ryabitsev @ 2021-06-29 20:10 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

On Tue, Jun 29, 2021 at 07:59:57PM +0000, Eric Wong wrote:
> > My thinking is that with mirrors of mirrors of mirrors, if someone submits a
> > GDPR removal request, then there should be an easy way of figuring out where
> > these requests should actually go. Maybe infourl can cover this, but it's less
> > likely to be set up than an email address like postmaster@. So, I'm in favour
> > of keeping that in the info record.
> 
> I'm a little worried about it having too many directions
> (e.g. webmaster vs postmaster) and something freeform like
> infourl can cover it.  Or we maybe something with user-defined
> labels would work:
> 
> 	contact = postmaster mail-admin@example.com
> 	contact = GPDR https://our-lawyers.example.com/form.cgi
> 	contact = webmaster web-admin@example.com

Maybe more RFC2822-like:

 	contact = postmaster <mail-admin@example.com>
 	contact = GPDR <https://our-lawyers.example.com/form.cgi>
 	contact = webmaster <web-admin@example.com>

Should be kosher with git-config and a bit more obvious that one is a label.

> > I'm actually thinking the other way around -- only the 0.git (or whatever is
> > the lowest number), simply because it's more likely to have that record. That
> > is, unless you want to add functionality to automatically copy these from
> > previous epochs on auto-rotation events.
> 
> It should be easy to change this info down-the-line in case of
> project name/hosting changes, forks, etc.  So whatever's latest
> should supercede the older.  It would also allow partial mirrors
> to work with only the latest epoch(s) (I expect a partial
> mirrors of recent epochs to be more useful than mirrors with
> only old epochs).

OK, I'm not picky and it's easy to reverse the dir listing.

I'll trial it out on some lore archives and then share the resulting hooks.

-K

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-06-29 20:10 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-28 21:26 Recording archiver origins in git Konstantin Ryabitsev
2021-06-28 22:12 ` Eric Wong
2021-06-29 12:56   ` Konstantin Ryabitsev
2021-06-29 19:59     ` Eric Wong
2021-06-29 20:10       ` Konstantin Ryabitsev

Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).