user/dev discussion of public-inbox itself
 help / color / Atom feed
* Limited-history local archives
@ 2020-01-03 20:15 Konstantin Ryabitsev
  2020-01-03 22:02 ` Eric Wong
  0 siblings, 1 reply; 2+ messages in thread
From: Konstantin Ryabitsev @ 2020-01-03 20:15 UTC (permalink / raw)
  To: meta

Hi, all:

I wonder if it would be useful to have a feature allowing someone to run 
a limited-history local copy of a larger remote archive -- for example 
if someone only wanted a 3-month copy of LKML instead of the whole 
20-year enchilada.

It's possible to accomplish this with git already [^1], e.g. you can use 
the following to grab a copy of LKML starting with December 2019:

  $ git clone --bare --shallow-since 2019-12-01 https://lore.kernel.org/lkml/git/7 lkml-since-dec.git
  $ cd lkml-since-dec.git
  $ git config --add remote.origin.fetch '+refs/heads/master:refs/heads/master'

You can now run "git fetch" as usual and perform all the normal 
operations, such as "git show {rev}:m" to get the message contents.  
Obviously, if we try to get a revision from before December 1, the 
operation fails:

  $ git show dae740ca679710fbe8b97b3e704d63e3e7883fd9:m
  fatal: Path 'm' does not exist in 'dae740ca679710fbe8b97b3e704d63e3e7883fd9'

If we enable uploadpack.allowAnySHA1InWant on the server, we can then 
fetch this object directly:

  $ git fetch --depth 1 origin dae740ca679710fbe8b97b3e704d63e3e7883fd9
  remote: Counting objects: 3, done.
  remote: Compressing objects: 100% (2/2), done.
  remote: Total 3 (delta 0), reused 3 (delta 0)
  Unpacking objects: 100% (3/3), done.
  From https://lore.kernel.org/lkml/git/7
   * branch              dae740ca679710fbe8b97b3e704d63e3e7883fd9 -> FETCH_HEAD

Now this succeeds:

  $ git show dae740ca679710fbe8b97b3e704d63e3e7883fd9:m

We can then periodically reshallow the archive (e.g. once a day) in 
order to get rid of older objects:

  $ git fetch --shallow-since 2019-12-15 --update-shallow origin master
  $ git gc --prune=now

There isn't really an RFC or anything associated with this -- I just 
wanted to share this idea as a possibly useful way of reducing local 
storage requirements while still being able to operate directly on 
public-inbox git repositories -- e.g. with a tool like l2md 
(https://git.kernel.org/pub/scm/linux/kernel/git/dborkman/l2md.git/).

-K

[^1]: Theoretically, this will become even easier in the future with 
      partial-clone functionality, though I believe that's mostly 
      written to support fetching large blobs from CDNs and wouldn't be 
      as useful for very linear public-inbox repositories.

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Limited-history local archives
  2020-01-03 20:15 Limited-history local archives Konstantin Ryabitsev
@ 2020-01-03 22:02 ` Eric Wong
  0 siblings, 0 replies; 2+ messages in thread
From: Eric Wong @ 2020-01-03 22:02 UTC (permalink / raw)
  To: meta

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> Hi, all:
> 
> I wonder if it would be useful to have a feature allowing someone to run 
> a limited-history local copy of a larger remote archive -- for example 
> if someone only wanted a 3-month copy of LKML instead of the whole 
> 20-year enchilada.

Yes.

> It's possible to accomplish this with git already [^1], e.g. you can use 
> the following to grab a copy of LKML starting with December 2019:
> 
>   $ git clone --bare --shallow-since 2019-12-01 https://lore.kernel.org/lkml/git/7 lkml-since-dec.git
>   $ cd lkml-since-dec.git
>   $ git config --add remote.origin.fetch '+refs/heads/master:refs/heads/master'
> 
> You can now run "git fetch" as usual and perform all the normal 
> operations, such as "git show {rev}:m" to get the message contents.  
> Obviously, if we try to get a revision from before December 1, the 
> operation fails:
> 
>   $ git show dae740ca679710fbe8b97b3e704d63e3e7883fd9:m
>   fatal: Path 'm' does not exist in 'dae740ca679710fbe8b97b3e704d63e3e7883fd9'
> 
> If we enable uploadpack.allowAnySHA1InWant on the server, we can then 
> fetch this object directly:

Usability-wise, git itself seems pretty bad at this...

I haven't looked deeply at this, but could/should public-inbox
enable allowAnySHA1InWant by default?

>   $ git fetch --depth 1 origin dae740ca679710fbe8b97b3e704d63e3e7883fd9
>   remote: Counting objects: 3, done.
>   remote: Compressing objects: 100% (2/2), done.
>   remote: Total 3 (delta 0), reused 3 (delta 0)
>   Unpacking objects: 100% (3/3), done.
>   From https://lore.kernel.org/lkml/git/7
>    * branch              dae740ca679710fbe8b97b3e704d63e3e7883fd9 -> FETCH_HEAD
> 
> Now this succeeds:
> 
>   $ git show dae740ca679710fbe8b97b3e704d63e3e7883fd9:m
> 
> We can then periodically reshallow the archive (e.g. once a day) in 
> order to get rid of older objects:
> 
>   $ git fetch --shallow-since 2019-12-15 --update-shallow origin master
>   $ git gc --prune=now
> 
> There isn't really an RFC or anything associated with this -- I just 
> wanted to share this idea as a possibly useful way of reducing local 
> storage requirements while still being able to operate directly on 
> public-inbox git repositories -- e.g. with a tool like l2md 
> (https://git.kernel.org/pub/scm/linux/kernel/git/dborkman/l2md.git/).

Given allowAnySHA1InWant isn't enabled by default on servers
today, and the number of commands are needed on the client,
I'm not sure git is really great for people who want to read
mail locally...

POST + "&x=m" search queries the easiest alternative, I think:

	curl -X POST "$INBOX_URL/?q=d:$YYYYMMDD..&x=m" >mboxrd.gz
	(but I wish MUAs could keep track of which messages I've read in
	 between queries)

And NNTP, which ought to be tunnel-able over HTTPS CONNECT.

> [^1]: Theoretically, this will become even easier in the future with 
>       partial-clone functionality, though I believe that's mostly 
>       written to support fetching large blobs from CDNs and wouldn't be 
>       as useful for very linear public-inbox repositories.

Fwiw, I really wish "git --git-dir=$URL any-read-only-command"
could work one day like it does with SVN.

WebDAV would've been nice but AFAIK davfs2 doesn't support
Range:, yet..., and having to mount FSes is a drag...

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, back to index

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-03 20:15 Limited-history local archives Konstantin Ryabitsev
2020-01-03 22:02 ` Eric Wong

user/dev discussion of public-inbox itself

Archives are clonable:
	git clone --mirror https://public-inbox.org/meta
	git clone --mirror http://czquwvybam4bgbro.onion/meta
	git clone --mirror http://hjrcffqmbrq6wope.onion/meta
	git clone --mirror http://ou63pmih66umazou.onion/meta

Example config snippet for mirrors

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.mail.public-inbox.meta
	nntp://ou63pmih66umazou.onion/inbox.comp.mail.public-inbox.meta
	nntp://czquwvybam4bgbro.onion/inbox.comp.mail.public-inbox.meta
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.mail.public-inbox.meta
	nntp://news.gmane.io/gmane.mail.public-inbox.general

 note: .onion URLs require Tor: https://www.torproject.org/

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git