Sparse clones (Was: Re: [PATCH 1/2] upload-pack: support subtree packing)

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Sparse clones (Was: Re: [PATCH 1/2] upload-pack: support subtree  packing)
@ 2010-07-28  0:13 Elijah Newren
  2010-07-28  1:05 ` Avery Pennarun
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Elijah Newren @ 2010-07-28  0:13 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Nguyễn Thái Ngọc, git

Hi,

2010/7/27 Shawn O. Pearce <spearce@spearce.org>:
> I would prefer doing something more like what we do with shallow
> on the client side.  Record in a magic file the path(s) that we
> did actually obtain.  During fsck, rev-list, or read-tree the
> client skips over any paths that don't match that file's listing.
> Then we can keep the same commit SHA-1s, but we won't complain that
> there are objects missing.

I recently decided to take a crack at implementing sparse clones, due
to a crazy idea I had (which might not be as crazy as I thought since
you suggest something similar, though more limited).  I was going to
wait until I actually got somewhere tangible with it to post an RFC,
particularly since it may take me a while, but since it's fresh on
everyone's minds perhaps now is good anyway.

Does the following seem sane, or are there big gotchas that I'm just unaware of?

0) Sparse clones have "all" commit objects, but not all trees/blobs.

Note that "all" only means all that are reachable from the refs being
downloaded, of course.  I think this is widely agreed upon and has
been suggested many times on this list.

1) A user controls sparseness by passing rev-list arguments to clone.

This allows a user to control sparseness both in terms of span of
content (files/directories) and depth of history.  It can also be used
to limit to a subset of refs (cloning just one or two branches instead
of all branches and tags).  For example,
  $ git clone ssh://repo.git dst -- Documentation/
  $ git clone ssh://repo.git dst master~6..master
  $ git clone ssh://repo.git dst -3
(Note that the destination argument becomes mandatory for those doing
a sparse clone in order to disambiguate it from rev-list options.)

This method also means users don't need much training to learn how to
use sparse clones -- they just use syntax they've already learned with
log, and clone will pass this info on to upload-pack.

There is a slight question as to whether users should have to specify
"--all HEAD" with all sparse clones or whether it should be assumed
when no other refs are listed.

2) Sparse checkouts are automatically invoked with the path(s) from
   the specified rev-list arguments.

Can't checkout content that we don't have.  :-)

This has a slight downside -- it makes sparse checkouts and sparse
clones slight misfits: the syntax (.gitignore style vs. rev-list
arguments) is a bit different, and sparse checkouts can exclude
certain paths whereas my sparse clones would only be able to *include*
paths.  I don't see this as a deal-breaker, but even if others
disagree I think a more general path-exclusion mechanism for the
revision walking machinery would be really nice for reasons beyond
just this one.  I've often wanted to do something like
  git log -S'important code phrase' --EXCLUDE-PATH=big-data-dir

3) The limiting rev-list arguments passed to clone are stored.

However, relative arguments such as "-3" or "master~6" first need to
be translated into one or more exclude ranges written as "^<sha1>".

4) All revision-walking operations automatically use these limiting args.

This should be a simple code change, and would enable rev-list, log,
etc. to avoid missing blobs/trees and thus enable them to work with
sparse clones.  fsck would take a bit more work, since it doesn't use
the setup_revisions() and revision.h walking machinery, but shouldn't
be too bad (I hope).

There are also performance ramifications: There should be no
measurable performance overhead for non-sparse clones (something that
might be a problem with a different implementation that did
does-this-exist check each time it references a blob).  It should also
be a significant performance boost for those using it, as operations
will only need to deal with the subset of the repository they specify
(faster downloads, stats, logs, etc.)

5) "Densifying" a sparse clone can be done

One can fetch a new pack and replace the limiting rev-list args with
the new choice.  The sparse checkout information needs to be updated
too.

(So users probably would want to densify a sparse clone with "pull"
rather than "fetch", as manually updating sparse checkouts may be a
bit of a hassle.)

6) Cloning-from/fetching-from/pushing-to sparse clones is supported.

Future fetches and pushes also make use of the limiting arguments.
Receives do as well, but only to make sure the pack obtained is not
"more sparse" than what the receiving repository already has.
(uploads ignore the stored rev-list arguments, instead using the
rev-list arguments passed to it -- it will die if asked for content
not locally available to it.)

7) Operations that need unavailable data simply error out

Examples: merge, cherry-pick, rebase (and upload-pack in a sparse
clone).  However, hopefully the error messages state what extra
information needs to be downloaded so the user can appropriately
"densify" their repository.

Thanks,
Elijah

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Sparse clones (Was: Re: [PATCH 1/2] upload-pack: support subtree  packing)
  2010-07-28  0:13 Sparse clones (Was: Re: [PATCH 1/2] upload-pack: support subtree packing) Elijah Newren
@ 2010-07-28  1:05 ` Avery Pennarun
  2010-07-28  3:06   ` Nguyen Thai Ngoc Duy
  2010-07-28  3:31   ` Elijah Newren
  2010-07-28  3:36 ` Nguyen Thai Ngoc Duy
  2010-08-13 17:31 ` Enrico Weigelt
  2 siblings, 2 replies; 16+ messages in thread
From: Avery Pennarun @ 2010-07-28  1:05 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Shawn O. Pearce, Nguyễn Thái Ngọc, git

2010/7/27 Elijah Newren <newren@gmail.com>:
> 0) Sparse clones have "all" commit objects, but not all trees/blobs.
>
> Note that "all" only means all that are reachable from the refs being
> downloaded, of course.  I think this is widely agreed upon and has
> been suggested many times on this list.

I think downloading all commit objects would require very low
bandwidth and storage space, so it should be harmless.

In fact, I have a pretty strong impression that also downloading all
*tree* objects would be fine too.  But I've never actually gone and
counted them to see what the stats are like.  Still, I'd assume that
the vast majority of repo space is blobs, not trees, and that trees
are highly compatible with deltafication.

Note that if you happen to want to implement it in a way that you'll
also get all the commit objects from your submodules too (which I
highly encourage :)) then downloading the trees is the easiest way.
Otherwise you won't know which submodule commits you need.

> 1) A user controls sparseness by passing rev-list arguments to clone.
>
> This allows a user to control sparseness both in terms of span of
> content (files/directories) and depth of history.  It can also be used
> to limit to a subset of refs (cloning just one or two branches instead
> of all branches and tags).  For example,
>  $ git clone ssh://repo.git dst -- Documentation/
>  $ git clone ssh://repo.git dst master~6..master
>  $ git clone ssh://repo.git dst -3
> (Note that the destination argument becomes mandatory for those doing
> a sparse clone in order to disambiguate it from rev-list options.)

It's really too bad that the dst argument took up that slot which, in
every other git command, is where the list of revs would go :(  Other
than that, I think the syntax looks nice.

> There is a slight question as to whether users should have to specify
> "--all HEAD" with all sparse clones or whether it should be assumed
> when no other refs are listed.

Since downloading commits is so cheap anyway, I'd suggest just
defaulting to downloading all the refs, as clone currently does.  If
people don't like it, they can do what they currently do:

   git init
   git remote add ...
   git fetch

Not that pretty, but then again, it's rarely needed.

> 2) Sparse checkouts are automatically invoked with the path(s) from
>   the specified rev-list arguments.
>
> Can't checkout content that we don't have.  :-)
>
> This has a slight downside -- it makes sparse checkouts and sparse
> clones slight misfits: the syntax (.gitignore style vs. rev-list
> arguments) is a bit different, and sparse checkouts can exclude
> certain paths whereas my sparse clones would only be able to *include*
> paths.  I don't see this as a deal-breaker, but even if others
> disagree I think a more general path-exclusion mechanism for the
> revision walking machinery would be really nice for reasons beyond
> just this one.  I've often wanted to do something like
>  git log -S'important code phrase' --EXCLUDE-PATH=big-data-dir

I don't totally understand what you mean here.  But I do think that if
you can *mostly* trim down a tree, excluding every little thing is not
that important.  As was discussed on the other thread, it seems like
*most* people are trimming down their trees (currently using
submodules) just to make stuff faster, and getting rid of 90% of the
unwanted cruft is probably fine; getting rid of 100% of it isn't that
much more of a speed boost.

I guess my point is, more complex exclusions could always be added
later but they aren't so important right away.

> 3) The limiting rev-list arguments passed to clone are stored.
>
> However, relative arguments such as "-3" or "master~6" first need to
> be translated into one or more exclude ranges written as "^<sha1>".

Just run them through rev-parse, I think.

> 4) All revision-walking operations automatically use these limiting args.
>
> This should be a simple code change, and would enable rev-list, log,
> etc. to avoid missing blobs/trees and thus enable them to work with
> sparse clones.  fsck would take a bit more work, since it doesn't use
> the setup_revisions() and revision.h walking machinery, but shouldn't
> be too bad (I hope).

I don't know if this implementation detail would be better or worse
than just having the tools auto-trim their activities when they run
into a missing object.  But maybe.  It does sound sort of elegant:
this way they *won't* run into the missing objects.

Beware, however, that

   git log -- Documentation

outputs a different set of commits than just

   git log

You don't want to enable history simplification here; I think that
means you want --full-history on by default for the "stored" path
limiting, but not for any command-line path limiting.  That could be
slightly messy.

> 5) "Densifying" a sparse clone can be done
>
> One can fetch a new pack and replace the limiting rev-list args with
> the new choice.  The sparse checkout information needs to be updated
> too.
>
> (So users probably would want to densify a sparse clone with "pull"
> rather than "fetch", as manually updating sparse checkouts may be a
> bit of a hassle.)

I think this would work, but unless you want to re-download some
(possibly lots of) objects you've already got, it would require some
kind of extra support from the server, I think.  Maybe that's a rare
enough case that few people will care and it could be fixed later.

I don't think the pull vs. fetch distinction is valid; I would be very
surprised if pull un-sparsified my checkout, just as I would be
surprised if merge did.  And pull is just fetch+merge.

> 6) Cloning-from/fetching-from/pushing-to sparse clones is supported.
>
> Future fetches and pushes also make use of the limiting arguments.
> Receives do as well, but only to make sure the pack obtained is not
> "more sparse" than what the receiving repository already has.
> (uploads ignore the stored rev-list arguments, instead using the
> rev-list arguments passed to it -- it will die if asked for content
> not locally available to it.)

This scares me a little.  It's a reminder that it's all-too-easy to
get your repository into a really messed up state by going in and
screwing with your sparseness parameters at the wrong time.

It would make me more comfortable if there was some kind of "oh god,
just fix it by downloading any objects you think are missing" mode :)
In fact, git could benefit from that in general - every now and then
someone on the list asks about a repository they managed to mangle by
corrupting a pack or something, and there's no really good answer to
that.

> 7) Operations that need unavailable data simply error out
>
> Examples: merge, cherry-pick, rebase (and upload-pack in a sparse
> clone).  However, hopefully the error messages state what extra
> information needs to be downloaded so the user can appropriately
> "densify" their repository.

That sounds good to me.

Have fun,

Avery

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Sparse clones (Was: Re: [PATCH 1/2] upload-pack: support subtree  packing)
  2010-07-28  1:05 ` Avery Pennarun
@ 2010-07-28  3:06   ` Nguyen Thai Ngoc Duy
  2010-07-28  3:38     ` Nguyen Thai Ngoc Duy
  2010-07-28  3:31   ` Elijah Newren
  1 sibling, 1 reply; 16+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2010-07-28  3:06 UTC (permalink / raw)
  To: Avery Pennarun; +Cc: Elijah Newren, Shawn O. Pearce, git

2010/7/28 Avery Pennarun <apenwarr@gmail.com>:
> 2010/7/27 Elijah Newren <newren@gmail.com>:
>> 0) Sparse clones have "all" commit objects, but not all trees/blobs.
>>
>> Note that "all" only means all that are reachable from the refs being
>> downloaded, of course.  I think this is widely agreed upon and has
>> been suggested many times on this list.
>
> I think downloading all commit objects would require very low
> bandwidth and storage space, so it should be harmless.

Here you go. A pack with only commits and trees of git.git#master is
15M. With blobs, it is 28M. Git is not a typical repo with large blobs
though.
-- 
Duy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Sparse clones (Was: Re: [PATCH 1/2] upload-pack: support subtree  packing)
  2010-07-28  1:05 ` Avery Pennarun
  2010-07-28  3:06   ` Nguyen Thai Ngoc Duy
@ 2010-07-28  3:31   ` Elijah Newren
  2010-07-31 22:36     ` Elijah Newren
  1 sibling, 1 reply; 16+ messages in thread
From: Elijah Newren @ 2010-07-28  3:31 UTC (permalink / raw)
  To: Avery Pennarun; +Cc: Shawn O. Pearce, Nguyễn Thái Ngọc, git

2010/7/27 Avery Pennarun <apenwarr@gmail.com>:
> Note that if you happen to want to implement it in a way that you'll
> also get all the commit objects from your submodules too (which I
> highly encourage :)) then downloading the trees is the easiest way.
> Otherwise you won't know which submodule commits you need.

Makes sense.  Seems like a good reason to include all the trees.

> Since downloading commits is so cheap anyway, I'd suggest just
> defaulting to downloading all the refs, as clone currently does.  If
> people don't like it, they can do what they currently do:
>
>   git init
>   git remote add ...
>   git fetch
>
> Not that pretty, but then again, it's rarely needed.

Would you suggest then parsing the limiting arguments passed to clone
and disallowing refs?  Or just making it non-useful by always
appending "--all HEAD"?

>> 2) Sparse checkouts are automatically invoked with the path(s) from
>>   the specified rev-list arguments.
<snip>
> I don't totally understand what you mean here.  But I do think that if
Basically, I mean what you stated much more succinctly and eloquently
right here:
> I guess my point is, more complex exclusions could always be added
> later but they aren't so important right away.

>> 4) All revision-walking operations automatically use these limiting args.
<snip>
> It does sound sort of elegant: this way they *won't* run into the missing objects.
> Beware, however, that
>
>   git log -- Documentation
>
> outputs a different set of commits than just
>
>   git log

Yes, exactly.  In a sparse clone, why wouldn't one want the behavior
of the former automatically, without having to specify the paths on
the command line every time they ran log (or rev-list or fast-export
or...etc., especially if they cloned N directories rather than just
1)?

Actually, I can kind of see the desire to see the 'real' log since the
users do happen to have all commits locally, but it almost seems like
it should be the case that requires a special option to be passed to
git log ('--ignore-sparse-limiting'?).  But trying to get that option
to work in conjunction with other options (--stat, -S, -p, etc.) would
be really hard, if not impossible.

>> 5) "Densifying" a sparse clone can be done
<snip>
> I think this would work, but unless you want to re-download some
> (possibly lots of) objects you've already got, it would require some
> kind of extra support from the server, I think.  Maybe that's a rare
> enough case that few people will care and it could be fixed later.

For my first implementation, my plan was to simply re-download ALL
(not just some or lots of) objects I've already got in such a case.  A
bit wasteful to be sure, but I was hoping it was rare enough to
"densify" a clone that it wouldn't be a big deal...and that support
for smarter downloads could be added later.

> I don't think the pull vs. fetch distinction is valid; I would be very
> surprised if pull un-sparsified my checkout, just as I would be
> surprised if merge did.  And pull is just fetch+merge.

Right, I don't think pull should un-sparsify either the checkout OR
the clone by default (it should have fetch pass the same limiting
arguments and only download an equivalently sparse set of updates).
Your point about pull=fetch+merge (or fetch+rebase) makes sense, which
I guess means that un-sparsifying a clone+checkout should be a
separate toplevel command ("densify"?) rather than a special option
for fetch/pull.

>> 6) Cloning-from/fetching-from/pushing-to sparse clones is supported.
>>
>> Future fetches and pushes also make use of the limiting arguments.
>> Receives do as well, but only to make sure the pack obtained is not
>> "more sparse" than what the receiving repository already has.
>> (uploads ignore the stored rev-list arguments, instead using the
>> rev-list arguments passed to it -- it will die if asked for content
>> not locally available to it.)
>
> This scares me a little.  It's a reminder that it's all-too-easy to
> get your repository into a really messed up state by going in and
> screwing with your sparseness parameters at the wrong time.

I don't follow.  Why would people be "screwing with sparseness parameters"?

My basic idea was that there would be only three ways to change
sparseness parameters for clones, with only the first two documented:
the initial clone command, the "densify" command (someone probably
needs to think of a better name), and reading the source code to
figure out what bits on your disk to change and changing them.

Here's why I want the clone-able/fetch-able/pull-able sparse clone
functionality:

I like having translators (who only need maybe one file) or technical
writers (who only need the Documentation/ subdirectory) or other
similar folks having the ability to collaborate on the subset of the
repository that they need to do their work.  Thus, it makes sense for
them to be able to clone from, pull from, and push to each other.  The
only two rules that I think are necessary to enable such behavior are:

* No repository can provide information that it doesn't have (should
be pretty easy to enforce...)
* No repository accepts less data than it expects in its repository
(i.e. you can push to a sparse clone or a real clone, but need to
provide data that fulfills it's rev-list limiting arguments)

> It would make me more comfortable if there was some kind of "oh god,
> just fix it by downloading any objects you think are missing" mode :)
> In fact, git could benefit from that in general - every now and then
> someone on the list asks about a repository they managed to mangle by
> corrupting a pack or something, and there's no really good answer to
> that.

For sparse clones, Isn't that mode just running the "densify" command
with no limiting arguments?

>> 7) Operations that need unavailable data simply error out
>>
>> Examples: merge, cherry-pick, rebase (and upload-pack in a sparse
>> clone).  However, hopefully the error messages state what extra
>> information needs to be downloaded so the user can appropriately
>> "densify" their repository.
>
> That sounds good to me.

Thanks for the detailed feedback.  :-)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Sparse clones (Was: Re: [PATCH 1/2] upload-pack: support subtree  packing)
  2010-07-28  0:13 Sparse clones (Was: Re: [PATCH 1/2] upload-pack: support subtree packing) Elijah Newren
  2010-07-28  1:05 ` Avery Pennarun
@ 2010-07-28  3:36 ` Nguyen Thai Ngoc Duy
  2010-07-28  3:59   ` Elijah Newren
  2010-08-13 17:31 ` Enrico Weigelt
  2 siblings, 1 reply; 16+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2010-07-28  3:36 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Shawn O. Pearce, git

2010/7/28 Elijah Newren <newren@gmail.com>:
> 1) A user controls sparseness by passing rev-list arguments to clone.
>
> This allows a user to control sparseness both in terms of span of
> content (files/directories) and depth of history.  It can also be used
> to limit to a subset of refs (cloning just one or two branches instead
> of all branches and tags).  For example,
>  $ git clone ssh://repo.git dst -- Documentation/

Does pathspec is supported to in addition to prefix?

>  $ git clone ssh://repo.git dst master~6..master
>  $ git clone ssh://repo.git dst -3
> (Note that the destination argument becomes mandatory for those doing
> a sparse clone in order to disambiguate it from rev-list options.)
>
> This method also means users don't need much training to learn how to
> use sparse clones -- they just use syntax they've already learned with
> log, and clone will pass this info on to upload-pack.
>
> There is a slight question as to whether users should have to specify
> "--all HEAD" with all sparse clones or whether it should be assumed
> when no other refs are listed.

So you basically kill off shallow clone too, with "master~6..master".
I wonder what happens if user does "git clone ... master~6..master~3"?

> 4) All revision-walking operations automatically use these limiting args.
>
> This should be a simple code change, and would enable rev-list, log,
> etc. to avoid missing blobs/trees and thus enable them to work with
> sparse clones.  fsck would take a bit more work, since it doesn't use
> the setup_revisions() and revision.h walking machinery, but shouldn't
> be too bad (I hope).
>
> There are also performance ramifications: There should be no
> measurable performance overhead for non-sparse clones (something that
> might be a problem with a different implementation that did
> does-this-exist check each time it references a blob).  It should also
> be a significant performance boost for those using it, as operations
> will only need to deal with the subset of the repository they specify
> (faster downloads, stats, logs, etc.)

Revision walking is not the only gate to access objects. Others like
diff machinery needs also be taught about rev-list limits.

> 5) "Densifying" a sparse clone can be done
>
> One can fetch a new pack and replace the limiting rev-list args with
> the new choice.  The sparse checkout information needs to be updated
> too.
>
> (So users probably would want to densify a sparse clone with "pull"
> rather than "fetch", as manually updating sparse checkouts may be a
> bit of a hassle.)

What information would you send to the server to request new pack in
sparse clone? Currently we send all commit tips. rev-list has a notion
to subtract commit trees. I don't know if it can "add" or "subtract"
tree prefix though.
-- 
Duy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Sparse clones (Was: Re: [PATCH 1/2] upload-pack: support subtree  packing)
  2010-07-28  3:06   ` Nguyen Thai Ngoc Duy
@ 2010-07-28  3:38     ` Nguyen Thai Ngoc Duy
  2010-07-28  3:58       ` Avery Pennarun
  0 siblings, 1 reply; 16+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2010-07-28  3:38 UTC (permalink / raw)
  To: Avery Pennarun; +Cc: Elijah Newren, Shawn O. Pearce, git

(corrected reply context, sorry)

On Wed, Jul 28, 2010 at 1:06 PM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
> 2010/7/28 Avery Pennarun <apenwarr@gmail.com>:
>> 2010/7/27 Elijah Newren <newren@gmail.com>:
>>> 0) Sparse clones have "all" commit objects, but not all trees/blobs.
>>>
>>> Note that "all" only means all that are reachable from the refs being
>>> downloaded, of course.  I think this is widely agreed upon and has
>>> been suggested many times on this list.
>>
>> I think downloading all commit objects would require very low
>> bandwidth and storage space, so it should be harmless.
> >
> > In fact, I have a pretty strong impression that also downloading
> > all *tree* objects would be fine too.
>
> Here you go. A pack with only commits and trees of git.git#master is
> 15M. With blobs, it is 28M. Git is not a typical repo with large blobs
> though.
> --
> Duy
>



-- 
Duy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Sparse clones (Was: Re: [PATCH 1/2] upload-pack: support subtree  packing)
  2010-07-28  3:38     ` Nguyen Thai Ngoc Duy
@ 2010-07-28  3:58       ` Avery Pennarun
  2010-07-28  6:12         ` Sverre Rabbelier
  2010-07-28  7:11         ` Nguyen Thai Ngoc Duy
  0 siblings, 2 replies; 16+ messages in thread
From: Avery Pennarun @ 2010-07-28  3:58 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: Elijah Newren, Shawn O. Pearce, git

On Wed, Jul 28, 2010 at 1:06 PM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
> 2010/7/28 Avery Pennarun <apenwarr@gmail.com>:
>> 2010/7/27 Elijah Newren <newren@gmail.com>:
>>> 0) Sparse clones have "all" commit objects, but not all trees/blobs.
>>>
>>> Note that "all" only means all that are reachable from the refs being
>>> downloaded, of course.  I think this is widely agreed upon and has
>>> been suggested many times on this list.
>>
>> I think downloading all commit objects would require very low
>> bandwidth and storage space, so it should be harmless.
> >
> > In fact, I have a pretty strong impression that also downloading
> > all *tree* objects would be fine too.
>
> Here you go. A pack with only commits and trees of git.git#master is
> 15M. With blobs, it is 28M. Git is not a typical repo with large blobs
> though.

Hmm, that's very interesting - more than half the repo is just tree
and commit objects?  Maybe that's not so shocking after all, given the
tendency in the git project to use long commit messages and relatively
short patches.

Was your pack carefully ordered for best deltification?

Knowing how much of that is commits vs. trees would also be very interesting.

But if so, only saving half the space is kind of disappointing.  If
you have a script around for generating this, it would be very
interesting to compare the results with, say, the Linux kernel repo
(especially since it seems to be the #1 example of "submodules people
don't want to check out because they're so bloody huge").

In bup, I know the trees+commits are much smaller than the blobs, so
my intuition was telling me it would be the same in git.  It's
entirely possible that I was wrong, though.  In retrospect, bup uses
really short computer-generated commit messages, and backs up large
numbers of files at once, most of which never change (and thus most of
the trees never change).  Commits+trees end up somewhere around 0.5%
of the total repo size.

Have fun,

Avery

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Sparse clones (Was: Re: [PATCH 1/2] upload-pack: support subtree  packing)
  2010-07-28  3:36 ` Nguyen Thai Ngoc Duy
@ 2010-07-28  3:59   ` Elijah Newren
  2010-07-29 10:29     ` Nguyen Thai Ngoc Duy
  0 siblings, 1 reply; 16+ messages in thread
From: Elijah Newren @ 2010-07-28  3:59 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: Shawn O. Pearce, git

On Tue, Jul 27, 2010 at 9:36 PM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
> 2010/7/28 Elijah Newren <newren@gmail.com>:
>> 1) A user controls sparseness by passing rev-list arguments to clone.
<snip>
>> For example,
>>  $ git clone ssh://repo.git dst -- Documentation/
>
> Does pathspec is supported to in addition to prefix?

Basically, whatever git log or git rev-list accepts.  I think I saw
some other discussion about making those adopt some of the
code/capability of git grep, which would automatically benefit sparse
clones.  But until then, no, because I need to be able to take these
arguments and automatically pass them on to log, rev-list, etc.

> So you basically kill off shallow clone too, with "master~6..master".

Yes, that was part of the plan...extend the capabilities of shallow
clones in two ways: allowing the user to specify a cutoff via a
revision identifier as well as a number of commits, and allow people
to clone (and fetch-from/push-to) other "shallow" clones.

> I wonder what happens if user does "git clone ... master~6..master~3"?

Currently, that'd break -- just like it similarly does for fast-export
(see t/t9350-fast-export.sh, 'no exact-ref revisions included').  I
had been thinking of trying to get that fixed for both cases by making
it result in a "master" branch that is "three commits behind" what you
clone/fast-export from.  You'd have to look for and disallow other
special cases like "git fast-export ... master^1 master^2" or "git
clone ... :/searchstring".

I'm not sure how this interacts with Avery's suggestion to just ignore
branch/tag limiting.

> Revision walking is not the only gate to access objects. Others like
> diff machinery needs also be taught about rev-list limits.

Right, good point.  Are there others than the diff machinery (and the
fsck special case) that you know of?

> What information would you send to the server to request new pack in
> sparse clone? Currently we send all commit tips. rev-list has a notion
> to subtract commit trees. I don't know if it can "add" or "subtract"
> tree prefix though.

When "densifying" a sparse clone, I was (initially at least) just
going to treat it like an initial clone and re-download _everything_
(even if sparsifying rather than densifying).  I assumed it'd be rare
to want to do such an operation, but yeah, in the future someone might
want a smarter way to handle it.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Sparse clones (Was: Re: [PATCH 1/2] upload-pack: support subtree  packing)
  2010-07-28  3:58       ` Avery Pennarun
@ 2010-07-28  6:12         ` Sverre Rabbelier
  2010-07-28  7:59           ` Nguyen Thai Ngoc Duy
  2010-07-28  7:11         ` Nguyen Thai Ngoc Duy
  1 sibling, 1 reply; 16+ messages in thread
From: Sverre Rabbelier @ 2010-07-28  6:12 UTC (permalink / raw)
  To: Avery Pennarun; +Cc: Nguyen Thai Ngoc Duy, Elijah Newren, Shawn O. Pearce, git

Heya,

On Tue, Jul 27, 2010 at 22:58, Avery Pennarun <apenwarr@gmail.com> wrote:
> Hmm, that's very interesting - more than half the repo is just tree
> and commit objects?  Maybe that's not so shocking after all, given the
> tendency in the git project to use long commit messages and relatively
> short patches.

Note that in the case of the ginormous-tree this holds as well. Many
small files, but in insanely deeply nested directories with insane
fan-out. If you need to download all the trees you don't save that
much bandwith. OTOH, I'm not only concerned about bandwidth, just
being able to run 'git status' without it taking half a minute would
be sweet.

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Sparse clones (Was: Re: [PATCH 1/2] upload-pack: support subtree  packing)
  2010-07-28  3:58       ` Avery Pennarun
  2010-07-28  6:12         ` Sverre Rabbelier
@ 2010-07-28  7:11         ` Nguyen Thai Ngoc Duy
  1 sibling, 0 replies; 16+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2010-07-28  7:11 UTC (permalink / raw)
  To: Avery Pennarun; +Cc: Elijah Newren, Shawn O. Pearce, git

On Wed, Jul 28, 2010 at 1:58 PM, Avery Pennarun <apenwarr@gmail.com> wrote:
> On Wed, Jul 28, 2010 at 1:06 PM, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
>> 2010/7/28 Avery Pennarun <apenwarr@gmail.com>:
>>> 2010/7/27 Elijah Newren <newren@gmail.com>:
>>>> 0) Sparse clones have "all" commit objects, but not all trees/blobs.
>>>>
>>>> Note that "all" only means all that are reachable from the refs being
>>>> downloaded, of course.  I think this is widely agreed upon and has
>>>> been suggested many times on this list.
>>>
>>> I think downloading all commit objects would require very low
>>> bandwidth and storage space, so it should be harmless.
>> >
>> > In fact, I have a pretty strong impression that also downloading
>> > all *tree* objects would be fine too.
>>
>> Here you go. A pack with only commits and trees of git.git#master is
>> 15M. With blobs, it is 28M. Git is not a typical repo with large blobs
>> though.
>
> Hmm, that's very interesting - more than half the repo is just tree
> and commit objects?  Maybe that's not so shocking after all, given the
> tendency in the git project to use long commit messages and relatively
> short patches.
>
> Was your pack carefully ordered for best deltification?

I did not do any optimization. It's pack-objects' defaults. I only
filtered blobs out and that's what fetch-pack would receive.

>
> Knowing how much of that is commits vs. trees would also be very interesting.

Commits only takes 7.8 MB. Well.. all those commit messages..

> But if so, only saving half the space is kind of disappointing.  If
> you have a script around for generating this, it would be very
> interesting to compare the results with, say, the Linux kernel repo
> (especially since it seems to be the #1 example of "submodules people
> don't want to check out because they're so bloody huge").

I modified pack-objects.c, show_object() to certain objects. Actually
I started with git-fetch-pack, but you can do

git rev-parse master | git pack-objects --revs --delta-base-offset pack

Then verify what's in the pack with

git verify-pack -v pack-*.idx
-- 
Duy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Sparse clones (Was: Re: [PATCH 1/2] upload-pack: support subtree  packing)
  2010-07-28  6:12         ` Sverre Rabbelier
@ 2010-07-28  7:59           ` Nguyen Thai Ngoc Duy
  2010-07-28 14:48             ` Sverre Rabbelier
  0 siblings, 1 reply; 16+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2010-07-28  7:59 UTC (permalink / raw)
  To: Sverre Rabbelier; +Cc: Avery Pennarun, Elijah Newren, Shawn O. Pearce, git

On Wed, Jul 28, 2010 at 4:12 PM, Sverre Rabbelier <srabbelier@gmail.com> wrote:
> OTOH, I'm not only concerned about bandwidth, just
> being able to run 'git status' without it taking half a minute would
> be sweet.

Doesn't assume-unchanged bit or sparse checkout help?
-- 
Duy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Sparse clones (Was: Re: [PATCH 1/2] upload-pack: support subtree  packing)
  2010-07-28  7:59           ` Nguyen Thai Ngoc Duy
@ 2010-07-28 14:48             ` Sverre Rabbelier
  0 siblings, 0 replies; 16+ messages in thread
From: Sverre Rabbelier @ 2010-07-28 14:48 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: Avery Pennarun, Elijah Newren, Shawn O. Pearce, git

Heya,

On Wed, Jul 28, 2010 at 02:59, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
> Doesn't assume-unchanged bit or sparse checkout help?

See my earlier part, assume-unchanged doesn't help due to the
gitignore files. Sparse checkout isn't an option since I really need
those files to be there, I just don't ever modify them. Really what I
need is read-only checkout ;). "Git, please ignore the existence of
this directory and all it's files/subdirectories, ktnx".

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Sparse clones (Was: Re: [PATCH 1/2] upload-pack: support subtree  packing)
  2010-07-28  3:59   ` Elijah Newren
@ 2010-07-29 10:29     ` Nguyen Thai Ngoc Duy
  0 siblings, 0 replies; 16+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2010-07-29 10:29 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Shawn O. Pearce, git

On Wed, Jul 28, 2010 at 1:59 PM, Elijah Newren <newren@gmail.com> wrote:
>> Revision walking is not the only gate to access objects. Others like
>> diff machinery needs also be taught about rev-list limits.
>
> Right, good point.  Are there others than the diff machinery (and the
> fsck special case) that you know of?

A lot (just found out as I was pushing subtree clone as far as I
could). For merging, you can hardly limit sha1 access. When writing
tree, git's particularly paranoid and check for sha1 existence
(has_sha1_file and assert_sha1_type). You can find that has_sha1_file
is used in many places, not just commit/write-tree.

Narrow/Sparse/Subtree/Lazy/Whatever-it-is clone will have hard time..
Oh.. the lazy one does not.
-- 
Duy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Sparse clones (Was: Re: [PATCH 1/2] upload-pack: support subtree  packing)
  2010-07-28  3:31   ` Elijah Newren
@ 2010-07-31 22:36     ` Elijah Newren
  0 siblings, 0 replies; 16+ messages in thread
From: Elijah Newren @ 2010-07-31 22:36 UTC (permalink / raw)
  To: Avery Pennarun; +Cc: Shawn O. Pearce, Nguyễn Thái Ngọc, git

Hi,

2010/7/27 Elijah Newren <newren@gmail.com>:
> 2010/7/27 Avery Pennarun <apenwarr@gmail.com>:
>> Note that if you happen to want to implement it in a way that you'll
>> also get all the commit objects from your submodules too (which I
>> highly encourage :)) then downloading the trees is the easiest way.
>> Otherwise you won't know which submodule commits you need.
>
> Makes sense.  Seems like a good reason to include all the trees.

Actually, having thought about it more, I don't see the reason for
getting all the commit objects from submodules (unless those
submodules are at paths specified for download).  If a user has
specified that they just want the Documentation subdirectory, why
would it matter if the submodule under src/widgets was downloaded?
They don't want to do anything with any of its contents, so I don't
see why they'd needs its trees or commits.  Am I missing something?

Also, I'm rethinking the download-all-commits aspect too.  This is
partially due to Nguyễn's stats (and special usecases like
translators), partially because of security issues (it has already
been stated that only including stuff meant to be public is an
important security concern for clone[1], and commit logs for changes
completely outside specified paths might be considered non-public
data[2]), and partially because it reinforces my whole rev-list
limiting args idea (it makes it really clear that 'git log' should
automatically behave like 'git log -- Documentation/' in a sparse
clone of just Documentation/).

[1] e.g. http://article.gmane.org/gmane.comp.version-control.git/115835

[2] This isn't just theoretical either.  I have a couple big important
(to $dayjob and thus me) sparse-clone usecases in this situation and
have for a few years, but gave up on it thinking it wouldn't be
possible with sparse clones.  I instead wrote a fast filtering
mechanism using fast-export/fast-import that creates a new repository
and keeps track of the mapping between sha1sums in unfiltered and
filtered repos, allowing changes to be grafted between the two.  Kind
of a pain, and suboptimal for a few reasons.  It'd be really nice if I
could replace this stuff with sparse clones, but can't do that if
commit logs corresponding to changes completely outside the sparse
paths are included.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Sparse clones (Was: Re: [PATCH 1/2] upload-pack: support subtree packing)
  2010-07-28  0:13 Sparse clones (Was: Re: [PATCH 1/2] upload-pack: support subtree packing) Elijah Newren
  2010-07-28  1:05 ` Avery Pennarun
  2010-07-28  3:36 ` Nguyen Thai Ngoc Duy
@ 2010-08-13 17:31 ` Enrico Weigelt
  2010-08-13 19:19   ` Truncating history (Re: Sparse clones) Jonathan Nieder
  2 siblings, 1 reply; 16+ messages in thread
From: Enrico Weigelt @ 2010-08-13 17:31 UTC (permalink / raw)
  To: git

* Elijah Newren <newren@gmail.com> wrote:

Hi folks,

as I'm doing many backups via git (eg. hourly sql dumps), I'd 
like to cut off the history (eg. at the n'th past commit)
and reclaim the space - both on local and remote side (even
differently).

So let me propose another approach: fake-root's

Fake-roots are special refs that declare certain commit objects
as root-commits). Each time git walks down the history, it checks
whether the current commit is an fake-root and so treats it as
having no ancestor. That should be generic enough let everything
else (commit, push, gc, etc) work as usual.

The only tricky point is when to update remote fake-roots: the
remote should not cut off my local repo (unless explicitly asked).
So remote fake-roots should only be imported if the local/receiving
side has not the dropped commits anymore.

hmm, maybe it's even more wise to get one step back in history
and introduce fake-empty's (which also have no parents) instead
of fake-root's ? A fake-empty is imported as soon as the original
object is missing.

Of course, it's important that this feature has to be explicitly
enabled (maybe even on per-remote basis) to prevent security flaws.

cu
-- 
----------------------------------------------------------------------
 Enrico Weigelt, metux IT service -- http://www.metux.de/

 phone:  +49 36207 519931  email: weigelt@metux.de
 mobile: +49 151 27565287  icq:   210169427         skype: nekrad666
----------------------------------------------------------------------
 Embedded-Linux / Portierung / Opensource-QM / Verteilte Systeme
----------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Truncating history (Re: Sparse clones)
  2010-08-13 17:31 ` Enrico Weigelt
@ 2010-08-13 19:19   ` Jonathan Nieder
  0 siblings, 0 replies; 16+ messages in thread
From: Jonathan Nieder @ 2010-08-13 19:19 UTC (permalink / raw)
  To: git

Hi,

Enrico Weigelt wrote:

> Fake-roots are special refs that declare certain commit objects
> as root-commits). Each time git walks down the history, it checks
> whether the current commit is an fake-root and so treats it as
> having no ancestor. That should be generic enough let everything
> else (commit, push, gc, etc) work as usual.
> 
> The only tricky point is when to update remote fake-roots: the
> remote should not cut off my local repo (unless explicitly asked).
> So remote fake-roots should only be imported if the local/receiving
> side has not the dropped commits anymore.

You may be interested in grafts and replacement refs; see
git-filter-branch(1) and git-replace(1) for some hints.

Good luck,
Jonathan

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2010-08-13 19:21 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-07-28  0:13 Sparse clones (Was: Re: [PATCH 1/2] upload-pack: support subtree packing) Elijah Newren
2010-07-28  1:05 ` Avery Pennarun
2010-07-28  3:06   ` Nguyen Thai Ngoc Duy
2010-07-28  3:38     ` Nguyen Thai Ngoc Duy
2010-07-28  3:58       ` Avery Pennarun
2010-07-28  6:12         ` Sverre Rabbelier
2010-07-28  7:59           ` Nguyen Thai Ngoc Duy
2010-07-28 14:48             ` Sverre Rabbelier
2010-07-28  7:11         ` Nguyen Thai Ngoc Duy
2010-07-28  3:31   ` Elijah Newren
2010-07-31 22:36     ` Elijah Newren
2010-07-28  3:36 ` Nguyen Thai Ngoc Duy
2010-07-28  3:59   ` Elijah Newren
2010-07-29 10:29     ` Nguyen Thai Ngoc Duy
2010-08-13 17:31 ` Enrico Weigelt
2010-08-13 19:19   ` Truncating history (Re: Sparse clones) Jonathan Nieder

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).