RFC: monthly epochs for v2

user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed

* RFC: monthly epochs for v2
@ 2019-10-24 19:53 Konstantin Ryabitsev
  2019-10-24 20:35 ` Eric Wong
  0 siblings, 1 reply; 10+ messages in thread
From: Konstantin Ryabitsev @ 2019-10-24 19:53 UTC (permalink / raw)
  To: meta

Hi, all:

With public-inbox now providing manifest files, it is easy to 
communicate to mirroring services when an epoch rolls over. What do you 
think if we make these roll-overs month-based instead of size-based. So, 
instead of:

git/
  0.git
  1.git
  2.git

it becomes

git/
  201908.git
  201909.git
  201910.git

Upsides:

- if history needs to be rewritten due to GDPR edits, the impact is 
  limited to just messages in that month's archive
- if someone is only interested in a few months worth of archives, they 
  don't have to clone the entire collection
- similarly, someone using public-inbox to feed messages to their inbox 
  (e.g. using the l2md tool [1]) doesn't need to waste gigabytes storing 
  archives they aren't interested in
- since the numbers are always auto-incrementing, this change can even 
  be done to repos currently using number-based epoch rotation, e.g.:

  git/
    0.git
    1.git
    201910.git
    201911.git

- there shouldn't be severe directory listing penalties with this, as 
  even 20 years worth of archives will only have 240 entries

Thoughts?

-K

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RFC: monthly epochs for v2
  2019-10-24 19:53 RFC: monthly epochs for v2 Konstantin Ryabitsev
@ 2019-10-24 20:35 ` Eric Wong
  2019-10-24 21:21   ` Konstantin Ryabitsev
  0 siblings, 1 reply; 10+ messages in thread
From: Eric Wong @ 2019-10-24 20:35 UTC (permalink / raw)
  To: meta

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> Hi, all:
> 
> With public-inbox now providing manifest files, it is easy to communicate to
> mirroring services when an epoch rolls over. What do you think if we make
> these roll-overs month-based instead of size-based. So, instead of:
> 
> git/
>  0.git
>  1.git
>  2.git
> 
> it becomes
> 
> git/
>  201908.git
>  201909.git
>  201910.git
> 
> Upsides:
> 
> - if history needs to be rewritten due to GDPR edits, the impact is  limited
> to just messages in that month's archive

Epoch size should be configurable, yes.  But I'm against time
periods such as months or years being a factor for rollover.
Many inboxes (including this one) can go idle for weeks/months;
and activity can be unpredictable if there's surges.

> - if someone is only interested in a few months worth of archives, they
> don't have to clone the entire collection
> - similarly, someone using public-inbox to feed messages to their inbox
> (e.g. using the l2md tool [1]) doesn't need to waste gigabytes storing
> archives they aren't interested in

NNTP or d:YYYYMMDD..YYYYMMDD mboxrd downloads via HTTP search
are better suited for those cases.

The HTTP search will be better once it can expand threads to
fetch messages in the same thread outside of date ranges
(e.g. "mairix -t").

The client side could still import into a local v2 inbox and use
it as a cache, and configure their epoch size and expiration logic.

> - since the numbers are always auto-incrementing, this change can even  be
> done to repos currently using number-based epoch rotation, e.g.:
> 
>  git/
>    0.git
>    1.git
>    201910.git
>    201911.git
> 
> - there shouldn't be severe directory listing penalties with this, as  even
> 20 years worth of archives will only have 240 entries

That would still increase overhead for cloning + fetching as far
as installing and running extra tools.

Aside from LKML, most inboxes are pretty small and shouldn't
require more than an initial clone and then fetch via cron.

If people only want a backup via git (and not host HTTP/NNTP),
it's FAR easier for them to run ubiquitous commands such as
"git clone --mirror && git fetch" rather than
"install $TOOL which may be out-of-date-or-missing-on-your-distro"

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RFC: monthly epochs for v2
  2019-10-24 20:35 ` Eric Wong
@ 2019-10-24 21:21   ` Konstantin Ryabitsev
  2019-10-24 22:34     ` Eric Wong
  0 siblings, 1 reply; 10+ messages in thread
From: Konstantin Ryabitsev @ 2019-10-24 21:21 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

On Thu, Oct 24, 2019 at 08:35:03PM +0000, Eric Wong wrote:
>Epoch size should be configurable, yes.  But I'm against time
>periods such as months or years being a factor for rollover.
>Many inboxes (including this one) can go idle for weeks/months;
>and activity can be unpredictable if there's surges.

Okay. It did just occur to me that the manifest file carries the 
"last-modified" stamp, so it's possible to figure out which repositories 
someone would need by parsing that data.

>> - if someone is only interested in a few months worth of archives, they
>> don't have to clone the entire collection
>> - similarly, someone using public-inbox to feed messages to their inbox
>> (e.g. using the l2md tool [1]) doesn't need to waste gigabytes storing
>> archives they aren't interested in
>
>NNTP or d:YYYYMMDD..YYYYMMDD mboxrd downloads via HTTP search
>are better suited for those cases.

I know you really like nntp, but I'm worried that with Big Corp's love 
of deep packet inspection and filtering, NNTP ports aren't going to be 
usable by a large subset of developers. We already have enough problems 
with port 9418 not being reachable (and sometimes not even port 22).  
Since usenet's descent into mostly illegal content, many corporate 
environments probably have ports 119 and 563 blocked off entirely and 
changing that would be futile.

>If people only want a backup via git (and not host HTTP/NNTP),
>it's FAR easier for them to run ubiquitous commands such as
>"git clone --mirror && git fetch" rather than
>"install $TOOL which may be out-of-date-or-missing-on-your-distro"

I think that anyone who is likely to use public-inbox repositories for 
more than just a copy of archives is likely to be using some kind of 
tool. I mean, SMTP can be used with "telnet" but nobody really does. :) 
If we provide a convenient library that supports things like intelligent 
selective cloning, indexing, fetching messages, etc, then that would 
avoid everyone doing it badly. In fact, libpublicinbox and bindings to 
most common languages is probably something that should happen early on.

-K

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RFC: monthly epochs for v2
  2019-10-24 21:21   ` Konstantin Ryabitsev
@ 2019-10-24 22:34     ` Eric Wong
  2019-10-25 12:22       ` Eric Wong
  0 siblings, 1 reply; 10+ messages in thread
From: Eric Wong @ 2019-10-24 22:34 UTC (permalink / raw)
  To: meta

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> On Thu, Oct 24, 2019 at 08:35:03PM +0000, Eric Wong wrote:
> > > - if someone is only interested in a few months worth of archives, they
> > > don't have to clone the entire collection
> > > - similarly, someone using public-inbox to feed messages to their inbox
> > > (e.g. using the l2md tool [1]) doesn't need to waste gigabytes storing
> > > archives they aren't interested in
> > 
> > NNTP or d:YYYYMMDD..YYYYMMDD mboxrd downloads via HTTP search
> > are better suited for those cases.
> 
> I know you really like nntp, but I'm worried that with Big Corp's love of
> deep packet inspection and filtering, NNTP ports aren't going to be usable
> by a large subset of developers. We already have enough problems with port
> 9418 not being reachable (and sometimes not even port 22).  Since usenet's
> descent into mostly illegal content, many corporate environments probably
> have ports 119 and 563 blocked off entirely and changing that would be
> futile.

I would consider the possibility of an HTTP API which looks like
NNTP commands, too.  But it wouldn't work with existing NNTP
clients...  Maybe websockets can be used *shrug*

NNTP can also run off 80/443 if somebody has an extra IP.  Not
sure if supporting HTTP and NNTP off the same port is a
possibility since some HTTP clients pre-connect TCP and NNTP is
server-talks-first whereas HTTP is client-talks-first.

> > If people only want a backup via git (and not host HTTP/NNTP),
> > it's FAR easier for them to run ubiquitous commands such as
> > "git clone --mirror && git fetch" rather than
> > "install $TOOL which may be out-of-date-or-missing-on-your-distro"
> 
> I think that anyone who is likely to use public-inbox repositories for more
> than just a copy of archives is likely to be using some kind of tool. I
> mean, SMTP can be used with "telnet" but nobody really does. :) If we
> provide a convenient library that supports things like intelligent selective
> cloning, indexing, fetching messages, etc, then that would avoid everyone
> doing it badly. In fact, libpublicinbox and bindings to most common
> languages is probably something that should happen early on.

I'm not sure about a libpublicinbox... I have been really
hesitant to depend on shared C/C++ libraries whenever I use Perl
or Ruby because of build and install complexity; especially for
stuff that's not-yet-available on distros.

Well-defined and stable protocols + data formats?
Yes. 100 times yes.

What would be nice is to have a local server so they could
access everything via HTTP using curl or whatever HTTP library
users want.  On shared systems, it could be HTTP over a UNIX
socket.  I don't think libcurl supports Unix domain sockets,
yet, but HTTP/1.1 parsers are pretty common.

JSON is a possibility, too; but I'm not sure if JSON is even
necessary if all that's exchanged are git blob OIDs and URLs for
mboxes.  Parsing MIME + RFC822(-ish) are already sunk costs.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RFC: monthly epochs for v2
  2019-10-24 22:34     ` Eric Wong
@ 2019-10-25 12:22       ` Eric Wong
  2019-10-25 20:56         ` Konstantin Ryabitsev
  0 siblings, 1 reply; 10+ messages in thread
From: Eric Wong @ 2019-10-25 12:22 UTC (permalink / raw)
  To: meta

Eric Wong <e@80x24.org> wrote:
> Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> > On Thu, Oct 24, 2019 at 08:35:03PM +0000, Eric Wong wrote:
> > > > - if someone is only interested in a few months worth of archives, they
> > > > don't have to clone the entire collection
> > > > - similarly, someone using public-inbox to feed messages to their inbox
> > > > (e.g. using the l2md tool [1]) doesn't need to waste gigabytes storing
> > > > archives they aren't interested in
> > > 
> > > NNTP or d:YYYYMMDD..YYYYMMDD mboxrd downloads via HTTP search
> > > are better suited for those cases.
> > 
> > I know you really like nntp, but I'm worried that with Big Corp's love of
> > deep packet inspection and filtering, NNTP ports aren't going to be usable
> > by a large subset of developers. We already have enough problems with port
> > 9418 not being reachable (and sometimes not even port 22).  Since usenet's
> > descent into mostly illegal content, many corporate environments probably
> > have ports 119 and 563 blocked off entirely and changing that would be
> > futile.
> 
> I would consider the possibility of an HTTP API which looks like
> NNTP commands, too.  But it wouldn't work with existing NNTP
> clients...  Maybe websockets can be used *shrug*

I forget, HTTP CONNECT exists for tunneling anything off
HTTP/HTTPS, too; so generic LD_PRELOAD like proxychains should
work with most NNTP clients/libraries.

Net::NNTP is in every Perl5 install I know of, and every nearly
hacker's *nix system has Perl5, but proxychains isn't ubiquitous
(Debian packages it, at least)

> NNTP can also run off 80/443 if somebody has an extra IP.  Not
> sure if supporting HTTP and NNTP off the same port is a
> possibility since some HTTP clients pre-connect TCP and NNTP is
> server-talks-first whereas HTTP is client-talks-first.


> > > If people only want a backup via git (and not host HTTP/NNTP),
> > > it's FAR easier for them to run ubiquitous commands such as
> > > "git clone --mirror && git fetch" rather than
> > > "install $TOOL which may be out-of-date-or-missing-on-your-distro"
> > 
> > I think that anyone who is likely to use public-inbox repositories for more
> > than just a copy of archives is likely to be using some kind of tool. I
> > mean, SMTP can be used with "telnet" but nobody really does. :) If we
> > provide a convenient library that supports things like intelligent selective
> > cloning, indexing, fetching messages, etc, then that would avoid everyone
> > doing it badly. In fact, libpublicinbox and bindings to most common
> > languages is probably something that should happen early on.
> 
> I'm not sure about a libpublicinbox... I have been really
> hesitant to depend on shared C/C++ libraries whenever I use Perl
> or Ruby because of build and install complexity; especially for
> stuff that's not-yet-available on distros.
> 
> Well-defined and stable protocols + data formats?
> Yes. 100 times yes.
> 
> What would be nice is to have a local server so they could
> access everything via HTTP using curl or whatever HTTP library
> users want.  On shared systems, it could be HTTP over a UNIX
> socket.  I don't think libcurl supports Unix domain sockets,
> yet, but HTTP/1.1 parsers are pretty common.
> 
> JSON is a possibility, too; but I'm not sure if JSON is even
> necessary if all that's exchanged are git blob OIDs and URLs for
> mboxes.  Parsing MIME + RFC822(-ish) are already sunk costs.

More on that. As much as I may be in favor of "software freedom",
I'm even more in favor of "freedom _from_ software".  Reusing
existing data formats as much as possible to minimize the
bug and attack surface is something I've been trying to do.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RFC: monthly epochs for v2
  2019-10-25 12:22       ` Eric Wong
@ 2019-10-25 20:56         ` Konstantin Ryabitsev
  2019-10-25 22:57           ` Eric Wong
  0 siblings, 1 reply; 10+ messages in thread
From: Konstantin Ryabitsev @ 2019-10-25 20:56 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

On Fri, Oct 25, 2019 at 12:22:14PM +0000, Eric Wong wrote:
>> I'm not sure about a libpublicinbox... I have been really
>> hesitant to depend on shared C/C++ libraries whenever I use Perl
>> or Ruby because of build and install complexity; especially for
>> stuff that's not-yet-available on distros.
>>
>> Well-defined and stable protocols + data formats?
>> Yes. 100 times yes.
>>
>> What would be nice is to have a local server so they could
>> access everything via HTTP using curl or whatever HTTP library
>> users want.  On shared systems, it could be HTTP over a UNIX
>> socket.  I don't think libcurl supports Unix domain sockets,
>> yet, but HTTP/1.1 parsers are pretty common.
>>
>> JSON is a possibility, too; but I'm not sure if JSON is even
>> necessary if all that's exchanged are git blob OIDs and URLs for
>> mboxes.  Parsing MIME + RFC822(-ish) are already sunk costs.
>
>More on that. As much as I may be in favor of "software freedom",
>I'm even more in favor of "freedom _from_ software".  Reusing
>existing data formats as much as possible to minimize the
>bug and attack surface is something I've been trying to do.

I understand the sentiment, but it's the exact problem that kernel 
maintainers are struggling with. Almost every maintainer I've spoken to 
have complained that without dedicated tools automating a lot of tasks 
for them, they quickly run out of scaling capacity. In fact, most of the 
ones I spoke with have rigged up some kind of fragile solution that 
sort-of works for them, often involving spittle and baling wire (someone 
I know runs patchwork in a container). After they set it up, they are 
hesitant to touch it, which means they don't want to perform system or 
library upgrades for the fear that something may break at the worst 
possible time during the merge window. Which means they are potentially 
leaving their systems exposed by not applying security updates.

It's little wonder why many are clamoring for a centralized forge 
solution that would put the responsibility of maintaining things on 
someone else's shoulders. 

If we want to avoid this, we need to provide them with convenient and 
robust tools that they can use and adapt to their needs. Otherwise we 
aren't really solving the problem.

(I know this really belongs on workflows more than on meta.)

-K

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RFC: monthly epochs for v2
  2019-10-25 20:56         ` Konstantin Ryabitsev
@ 2019-10-25 22:57           ` Eric Wong
  2019-10-29 15:03             ` Eric W. Biederman
  0 siblings, 1 reply; 10+ messages in thread
From: Eric Wong @ 2019-10-25 22:57 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: meta

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> On Fri, Oct 25, 2019 at 12:22:14PM +0000, Eric Wong wrote:
> > > I'm not sure about a libpublicinbox... I have been really
> > > hesitant to depend on shared C/C++ libraries whenever I use Perl
> > > or Ruby because of build and install complexity; especially for
> > > stuff that's not-yet-available on distros.
> > > 
> > > Well-defined and stable protocols + data formats?
> > > Yes. 100 times yes.
> > > 
> > > What would be nice is to have a local server so they could
> > > access everything via HTTP using curl or whatever HTTP library
> > > users want.  On shared systems, it could be HTTP over a UNIX
> > > socket.  I don't think libcurl supports Unix domain sockets,
> > > yet, but HTTP/1.1 parsers are pretty common.
> > > 
> > > JSON is a possibility, too; but I'm not sure if JSON is even
> > > necessary if all that's exchanged are git blob OIDs and URLs for
> > > mboxes.  Parsing MIME + RFC822(-ish) are already sunk costs.
> > 
> > More on that. As much as I may be in favor of "software freedom",
> > I'm even more in favor of "freedom _from_ software".  Reusing
> > existing data formats as much as possible to minimize the
> > bug and attack surface is something I've been trying to do.
> 
> I understand the sentiment, but it's the exact problem that kernel
> maintainers are struggling with. Almost every maintainer I've spoken to have
> complained that without dedicated tools automating a lot of tasks for them,
> they quickly run out of scaling capacity. In fact, most of the ones I spoke
> with have rigged up some kind of fragile solution that sort-of works for
> them, often involving spittle and baling wire (someone I know runs patchwork
> in a container). After they set it up, they are hesitant to touch it, which
> means they don't want to perform system or library upgrades for the fear
> that something may break at the worst possible time during the merge window.
> Which means they are potentially leaving their systems exposed by not
> applying security updates.
> 
> It's little wonder why many are clamoring for a centralized forge solution
> that would put the responsibility of maintaining things on someone else's
> shoulders.

Yeah.  I've tried to make public-inbox somewhat easy-to-install
compared to typical web apps, at least on Debian-based systems.
I guess the CentOS7 experience is acceptable, but maybe less than
ideal due to the lack of Search::Xapian.

Not sure what other distros and dependencies we'd have to worry
about with less package availability than CentOS/RHEL.

> If we want to avoid this, we need to provide them with convenient and robust
> tools that they can use and adapt to their needs. Otherwise we aren't really
> solving the problem.

Right.  Much of the public-inbox search and blob-solver logic
can be adapted to command-line tools (as they are for testing)
and/or exposed locally via a git-instaweb-style HTTP server
which can be curl-ed.  But ALSO hosting it on a giant server is
an option for organizations that want such things.

Maybe some tooling can be piggy-backed into git.git or its
contrib/, section, too.  I certainly want to make git operation
totally network transparent, at least.

> (I know this really belongs on workflows more than on meta.)

I posted a little bit more about local tools, here:
  https://lore.kernel.org/workflows/20191025223915.GA22959@dcvr/
(but I've posted to similar effect here, I think)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RFC: monthly epochs for v2
  2019-10-25 22:57           ` Eric Wong
@ 2019-10-29 15:03             ` Eric W. Biederman
  2019-10-29 15:55               ` Konstantin Ryabitsev
  0 siblings, 1 reply; 10+ messages in thread
From: Eric W. Biederman @ 2019-10-29 15:03 UTC (permalink / raw)
  To: Eric Wong; +Cc: Konstantin Ryabitsev, meta

Eric Wong <e@80x24.org> writes:

> Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
>> On Fri, Oct 25, 2019 at 12:22:14PM +0000, Eric Wong wrote:
>> > > I'm not sure about a libpublicinbox... I have been really
>> > > hesitant to depend on shared C/C++ libraries whenever I use Perl
>> > > or Ruby because of build and install complexity; especially for
>> > > stuff that's not-yet-available on distros.
>> > > 
>> > > Well-defined and stable protocols + data formats?
>> > > Yes. 100 times yes.
>> > > 
>> > > What would be nice is to have a local server so they could
>> > > access everything via HTTP using curl or whatever HTTP library
>> > > users want.  On shared systems, it could be HTTP over a UNIX
>> > > socket.  I don't think libcurl supports Unix domain sockets,
>> > > yet, but HTTP/1.1 parsers are pretty common.
>> > > 
>> > > JSON is a possibility, too; but I'm not sure if JSON is even
>> > > necessary if all that's exchanged are git blob OIDs and URLs for
>> > > mboxes.  Parsing MIME + RFC822(-ish) are already sunk costs.
>> > 
>> > More on that. As much as I may be in favor of "software freedom",
>> > I'm even more in favor of "freedom _from_ software".  Reusing
>> > existing data formats as much as possible to minimize the
>> > bug and attack surface is something I've been trying to do.
>> 
>> I understand the sentiment, but it's the exact problem that kernel
>> maintainers are struggling with. Almost every maintainer I've spoken to have
>> complained that without dedicated tools automating a lot of tasks for them,
>> they quickly run out of scaling capacity. In fact, most of the ones I spoke
>> with have rigged up some kind of fragile solution that sort-of works for
>> them, often involving spittle and baling wire (someone I know runs patchwork
>> in a container). After they set it up, they are hesitant to touch it, which
>> means they don't want to perform system or library upgrades for the fear
>> that something may break at the worst possible time during the merge window.
>> Which means they are potentially leaving their systems exposed by not
>> applying security updates.
>> 
>> It's little wonder why many are clamoring for a centralized forge solution
>> that would put the responsibility of maintaining things on someone else's
>> shoulders.
>
> Yeah.  I've tried to make public-inbox somewhat easy-to-install
> compared to typical web apps, at least on Debian-based systems.
> I guess the CentOS7 experience is acceptable, but maybe less than
> ideal due to the lack of Search::Xapian.
>
> Not sure what other distros and dependencies we'd have to worry
> about with less package availability than CentOS/RHEL.
>
>> If we want to avoid this, we need to provide them with convenient and robust
>> tools that they can use and adapt to their needs. Otherwise we aren't really
>> solving the problem.
>
> Right.  Much of the public-inbox search and blob-solver logic
> can be adapted to command-line tools (as they are for testing)
> and/or exposed locally via a git-instaweb-style HTTP server
> which can be curl-ed.  But ALSO hosting it on a giant server is
> an option for organizations that want such things.
>
> Maybe some tooling can be piggy-backed into git.git or its
> contrib/, section, too.  I certainly want to make git operation
> totally network transparent, at least.
>
>> (I know this really belongs on workflows more than on meta.)
>
> I posted a little bit more about local tools, here:
>   https://lore.kernel.org/workflows/20191025223915.GA22959@dcvr/
> (but I've posted to similar effect here, I think)

So not monthly epochs.  But it would be very handing to have a
public-inbox command command that refreshes git mirrors.  It would
be even more awesome if there was something like the IMAP IDLE command
in http that would let a process block until an update happened and
then fetch the updated data.

ssoma had some of that.

I have a very rough script that works, but I periodically need to find
the new epochs start a new mirror by hand.  So it would be nice if we
could get a more polished tool that I could just tell it mirror this
mailing list.  Or mirror all of the mailling lists on lore.

I don't think the case of optimizing for people who have a one time use
and want a small download in the git case makes sense.  As history tends
to be useful, and having people mirror things tends to be useful, and
the sizes we are talking about is comparatively small.

For distributed use I generally think make it cheap enough that people
don't have to optimize their current setup.  Which is roughly the policy
that git uses and it has worked.

Eric


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RFC: monthly epochs for v2
  2019-10-29 15:03             ` Eric W. Biederman
@ 2019-10-29 15:55               ` Konstantin Ryabitsev
  2019-10-29 22:46                 ` Eric Wong
  0 siblings, 1 reply; 10+ messages in thread
From: Konstantin Ryabitsev @ 2019-10-29 15:55 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Eric Wong, meta

On Tue, Oct 29, 2019 at 10:03:43AM -0500, Eric W. Biederman wrote:
> So not monthly epochs.  But it would be very handing to have a
> public-inbox command command that refreshes git mirrors.  It would
> be even more awesome if there was something like the IMAP IDLE command
> in http that would let a process block until an update happened and
> then fetch the updated data.
> 
> ssoma had some of that.
> 
> I have a very rough script that works, but I periodically need to find
> the new epochs start a new mirror by hand.  So it would be nice if we
> could get a more polished tool that I could just tell it mirror this
> mailing list.  Or mirror all of the mailling lists on lore.

You can use manifest files to recognize when a new epoch is available
(and if there are new updates). E.g.:

http://lore.kernel.org/lkml/manifest.js.gz

It's written to be consumed by grokmirror, but it can be adapted to be
used by any other tool. The format is straightforward and simple to
understand.

-K

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RFC: monthly epochs for v2
  2019-10-29 15:55               ` Konstantin Ryabitsev
@ 2019-10-29 22:46                 ` Eric Wong
  0 siblings, 0 replies; 10+ messages in thread
From: Eric Wong @ 2019-10-29 22:46 UTC (permalink / raw)
  To: Eric W. Biederman, meta

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> On Tue, Oct 29, 2019 at 10:03:43AM -0500, Eric W. Biederman wrote:
> > So not monthly epochs.  But it would be very handing to have a
> > public-inbox command command that refreshes git mirrors.  It would
> > be even more awesome if there was something like the IMAP IDLE command
> > in http that would let a process block until an update happened and
> > then fetch the updated data.
> > 
> > ssoma had some of that.

Did it?  It still runs via cronjob for old lists I have to drive mlmmj.

> > I have a very rough script that works, but I periodically need to find
> > the new epochs start a new mirror by hand.  So it would be nice if we
> > could get a more polished tool that I could just tell it mirror this
> > mailing list.  Or mirror all of the mailling lists on lore.
> 
> You can use manifest files to recognize when a new epoch is available
> (and if there are new updates). E.g.:
> 
> http://lore.kernel.org/lkml/manifest.js.gz
> 
> It's written to be consumed by grokmirror, but it can be adapted to be
> used by any other tool. The format is straightforward and simple to
> understand.

To reduce polling, I'm thinking a non-standard "IDLE" HTTP
method could be used and public-inbox-httpd could implement that
inotify/kevent to wake up clients on changes.

The following would wait until the local ./manifest.js.gz is
older than the server:

  curl -z ./manifest.js.gz -X IDLE http://example.com/manifest.js.gz

But getting a non-standard request method "IDLE" through proxies
could be tough, so perhaps using POST and a query parameter
would work, too:

  curl -z ./manifest.js.gz -X POST http://example.com/manifest.js.gz?idle=1

But then again, HTTP proxies probably timeout after a minute or so.
I seem to recall nginx times out after 75s.  Maybe empty gzip fragments
can be sent to keep the connection alive?

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-10-29 22:46 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-24 19:53 RFC: monthly epochs for v2 Konstantin Ryabitsev
2019-10-24 20:35 ` Eric Wong
2019-10-24 21:21   ` Konstantin Ryabitsev
2019-10-24 22:34     ` Eric Wong
2019-10-25 12:22       ` Eric Wong
2019-10-25 20:56         ` Konstantin Ryabitsev
2019-10-25 22:57           ` Eric Wong
2019-10-29 15:03             ` Eric W. Biederman
2019-10-29 15:55               ` Konstantin Ryabitsev
2019-10-29 22:46                 ` Eric Wong

Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).