Re: Sparse clones (Was: Re: [PATCH 1/2] upload-pack: support subtree packing)

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Avery Pennarun <apenwarr@gmail.com>
To: Elijah Newren <newren@gmail.com>
Cc: "Shawn O. Pearce" <spearce@spearce.org>,
	"Nguyễn Thái Ngọc" <pclouds@gmail.com>,
	git@vger.kernel.org
Subject: Re: Sparse clones (Was: Re: [PATCH 1/2] upload-pack: support subtree  packing)
Date: Tue, 27 Jul 2010 21:05:10 -0400	[thread overview]
Message-ID: <AANLkTikMLOFet-VMT7MntPgoSkvqGAXPd8Z1aaDpY1xs@mail.gmail.com> (raw)
In-Reply-To: <AANLkTikJhSVJw2hXkp0j6XA+k-J-AtSYzKWumjnqqsgz@mail.gmail.com>

2010/7/27 Elijah Newren <newren@gmail.com>:
> 0) Sparse clones have "all" commit objects, but not all trees/blobs.
>
> Note that "all" only means all that are reachable from the refs being
> downloaded, of course.  I think this is widely agreed upon and has
> been suggested many times on this list.

I think downloading all commit objects would require very low
bandwidth and storage space, so it should be harmless.

In fact, I have a pretty strong impression that also downloading all
*tree* objects would be fine too.  But I've never actually gone and
counted them to see what the stats are like.  Still, I'd assume that
the vast majority of repo space is blobs, not trees, and that trees
are highly compatible with deltafication.

Note that if you happen to want to implement it in a way that you'll
also get all the commit objects from your submodules too (which I
highly encourage :)) then downloading the trees is the easiest way.
Otherwise you won't know which submodule commits you need.

> 1) A user controls sparseness by passing rev-list arguments to clone.
>
> This allows a user to control sparseness both in terms of span of
> content (files/directories) and depth of history.  It can also be used
> to limit to a subset of refs (cloning just one or two branches instead
> of all branches and tags).  For example,
>  $ git clone ssh://repo.git dst -- Documentation/
>  $ git clone ssh://repo.git dst master~6..master
>  $ git clone ssh://repo.git dst -3
> (Note that the destination argument becomes mandatory for those doing
> a sparse clone in order to disambiguate it from rev-list options.)

It's really too bad that the dst argument took up that slot which, in
every other git command, is where the list of revs would go :(  Other
than that, I think the syntax looks nice.

> There is a slight question as to whether users should have to specify
> "--all HEAD" with all sparse clones or whether it should be assumed
> when no other refs are listed.

Since downloading commits is so cheap anyway, I'd suggest just
defaulting to downloading all the refs, as clone currently does.  If
people don't like it, they can do what they currently do:

   git init
   git remote add ...
   git fetch

Not that pretty, but then again, it's rarely needed.

> 2) Sparse checkouts are automatically invoked with the path(s) from
>   the specified rev-list arguments.
>
> Can't checkout content that we don't have.  :-)
>
> This has a slight downside -- it makes sparse checkouts and sparse
> clones slight misfits: the syntax (.gitignore style vs. rev-list
> arguments) is a bit different, and sparse checkouts can exclude
> certain paths whereas my sparse clones would only be able to *include*
> paths.  I don't see this as a deal-breaker, but even if others
> disagree I think a more general path-exclusion mechanism for the
> revision walking machinery would be really nice for reasons beyond
> just this one.  I've often wanted to do something like
>  git log -S'important code phrase' --EXCLUDE-PATH=big-data-dir

I don't totally understand what you mean here.  But I do think that if
you can *mostly* trim down a tree, excluding every little thing is not
that important.  As was discussed on the other thread, it seems like
*most* people are trimming down their trees (currently using
submodules) just to make stuff faster, and getting rid of 90% of the
unwanted cruft is probably fine; getting rid of 100% of it isn't that
much more of a speed boost.

I guess my point is, more complex exclusions could always be added
later but they aren't so important right away.

> 3) The limiting rev-list arguments passed to clone are stored.
>
> However, relative arguments such as "-3" or "master~6" first need to
> be translated into one or more exclude ranges written as "^<sha1>".

Just run them through rev-parse, I think.

> 4) All revision-walking operations automatically use these limiting args.
>
> This should be a simple code change, and would enable rev-list, log,
> etc. to avoid missing blobs/trees and thus enable them to work with
> sparse clones.  fsck would take a bit more work, since it doesn't use
> the setup_revisions() and revision.h walking machinery, but shouldn't
> be too bad (I hope).

I don't know if this implementation detail would be better or worse
than just having the tools auto-trim their activities when they run
into a missing object.  But maybe.  It does sound sort of elegant:
this way they *won't* run into the missing objects.

Beware, however, that

   git log -- Documentation

outputs a different set of commits than just

   git log

You don't want to enable history simplification here; I think that
means you want --full-history on by default for the "stored" path
limiting, but not for any command-line path limiting.  That could be
slightly messy.

> 5) "Densifying" a sparse clone can be done
>
> One can fetch a new pack and replace the limiting rev-list args with
> the new choice.  The sparse checkout information needs to be updated
> too.
>
> (So users probably would want to densify a sparse clone with "pull"
> rather than "fetch", as manually updating sparse checkouts may be a
> bit of a hassle.)

I think this would work, but unless you want to re-download some
(possibly lots of) objects you've already got, it would require some
kind of extra support from the server, I think.  Maybe that's a rare
enough case that few people will care and it could be fixed later.

I don't think the pull vs. fetch distinction is valid; I would be very
surprised if pull un-sparsified my checkout, just as I would be
surprised if merge did.  And pull is just fetch+merge.

> 6) Cloning-from/fetching-from/pushing-to sparse clones is supported.
>
> Future fetches and pushes also make use of the limiting arguments.
> Receives do as well, but only to make sure the pack obtained is not
> "more sparse" than what the receiving repository already has.
> (uploads ignore the stored rev-list arguments, instead using the
> rev-list arguments passed to it -- it will die if asked for content
> not locally available to it.)

This scares me a little.  It's a reminder that it's all-too-easy to
get your repository into a really messed up state by going in and
screwing with your sparseness parameters at the wrong time.

It would make me more comfortable if there was some kind of "oh god,
just fix it by downloading any objects you think are missing" mode :)
In fact, git could benefit from that in general - every now and then
someone on the list asks about a repository they managed to mangle by
corrupting a pack or something, and there's no really good answer to
that.

> 7) Operations that need unavailable data simply error out
>
> Examples: merge, cherry-pick, rebase (and upload-pack in a sparse
> clone).  However, hopefully the error messages state what extra
> information needs to be downloaded so the user can appropriately
> "densify" their repository.

That sounds good to me.

Have fun,

Avery

next prev parent reply	other threads:[~2010-07-28  1:05 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-07-28  0:13 Sparse clones (Was: Re: [PATCH 1/2] upload-pack: support subtree packing) Elijah Newren
2010-07-28  1:05 ` Avery Pennarun [this message]
2010-07-28  3:06   ` Nguyen Thai Ngoc Duy
2010-07-28  3:38     ` Nguyen Thai Ngoc Duy
2010-07-28  3:58       ` Avery Pennarun
2010-07-28  6:12         ` Sverre Rabbelier
2010-07-28  7:59           ` Nguyen Thai Ngoc Duy
2010-07-28 14:48             ` Sverre Rabbelier
2010-07-28  7:11         ` Nguyen Thai Ngoc Duy
2010-07-28  3:31   ` Elijah Newren
2010-07-31 22:36     ` Elijah Newren
2010-07-28  3:36 ` Nguyen Thai Ngoc Duy
2010-07-28  3:59   ` Elijah Newren
2010-07-29 10:29     ` Nguyen Thai Ngoc Duy
2010-08-13 17:31 ` Enrico Weigelt
2010-08-13 19:19   ` Truncating history (Re: Sparse clones) Jonathan Nieder

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=AANLkTikMLOFet-VMT7MntPgoSkvqGAXPd8Z1aaDpY1xs@mail.gmail.com \
    --to=apenwarr@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=newren@gmail.com \
    --cc=pclouds@gmail.com \
    --cc=spearce@spearce.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).