git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* git-clone --single-branch clones objects outside of branch
@ 2020-01-26 12:39 Chris Jerdonek
  2020-01-27  5:55 ` Jeff King
  0 siblings, 1 reply; 6+ messages in thread
From: Chris Jerdonek @ 2020-01-26 12:39 UTC (permalink / raw)
  To: git

Hi,

I'm reporting some git-clone behavior regarding --single-branch that I
found unexpected after reading the docs. I'm using git 2.25.0.

The git-clone docs for --single-branch say:

> Clone only the history leading to the tip of a single branch, either specified by the --branch option or the primary branch remote’s HEAD points at.

(from: https://git-scm.com/docs/git-clone#Documentation/git-clone.txt---no-single-branch
)

However, when I attempted this with a local repo, I found that objects
located only in branches other than the branch I specified are also
cloned. Also, this is true even if the remote repo has only loose
objects (i.e. no pack files). So it doesn't appear to be doing this
only to avoid creating new files.

In contrast, git-fetch behaves as expected (including locally).
git-fetch appears to fetch only objects in the given branch, when a
branch is specified.

Below are some commands to assist with reproducing this situation (but
you will need to update the path in the `git-remote add` invocation
below). At the least, it seems like the docs should clarify the
behavior. (The Python commands were for when I was doing some
experiments with pack files.)

mkdir repo1
cd repo1
git init
python -c "print('a\n' + 10 * 'x\n')" > a.txt
git add a.txt
git commit -m "add a"
# Get object id to check existence with `git cat-file -t` below.
git hash-object a.txt

git checkout -b dev
python -c "print('b\n' + 10 * 'x\n')" > b.txt
git add b.txt
git commit -m "add b"
# Get object id to check existence with `git cat-file -t` below.
git hash-object b.txt

git checkout master

cd ..
mkdir repo2
cd repo2
git init
git remote add other file:///<path-to-repo1>
git fetch other master

cd ..
git clone --branch master --single-branch repo1 repo3

Thanks,
--Chris

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: git-clone --single-branch clones objects outside of branch
  2020-01-26 12:39 git-clone --single-branch clones objects outside of branch Chris Jerdonek
@ 2020-01-27  5:55 ` Jeff King
  2020-01-27  6:46   ` Chris Jerdonek
  0 siblings, 1 reply; 6+ messages in thread
From: Jeff King @ 2020-01-27  5:55 UTC (permalink / raw)
  To: Chris Jerdonek; +Cc: git

On Sun, Jan 26, 2020 at 04:39:52AM -0800, Chris Jerdonek wrote:

> However, when I attempted this with a local repo, I found that objects
> located only in branches other than the branch I specified are also
> cloned. Also, this is true even if the remote repo has only loose
> objects (i.e. no pack files). So it doesn't appear to be doing this
> only to avoid creating new files.
> 
> In contrast, git-fetch behaves as expected (including locally).
> git-fetch appears to fetch only objects in the given branch, when a
> branch is specified.

This is the expected outcome, because in your example you're cloning on
the local filesystem. By default that enables some optimizations, one of
which is to hard-link the object files into the destination repository.
That avoids the cost of copying and re-hashing them (which a normal
cross-system clone would do). And it even avoids traversing the objects
to find which are necessary, instead just hard-linking everything.

So with:

> git clone --branch master --single-branch repo1 repo3

You should be able to see with "git for-each-ref" in repo3 that you only
got the "master" branch, but not "dev". But those extra objects are
available to you, because of the hard-links.

If you do:

  git clone --branch master --single-branch --no-local repo1 repo4

then repo4 will not have the objects (we really will send a packfile
across a pipe, just as we would across the network for a cross-system
clone).

> cd repo2
> git init
> git remote add other file:///<path-to-repo1>
> git fetch other master

This one behaves as you expected because git-fetch does not perform the
same optimizations (it wouldn't make as much sense there, as generally
in a fetch we already have most of the objects from the other side
anyway, so hard-linking would just give us duplicates).

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: git-clone --single-branch clones objects outside of branch
  2020-01-27  5:55 ` Jeff King
@ 2020-01-27  6:46   ` Chris Jerdonek
  2020-01-28  9:48     ` Jeff King
  0 siblings, 1 reply; 6+ messages in thread
From: Chris Jerdonek @ 2020-01-27  6:46 UTC (permalink / raw)
  To: Jeff King, git

On Sun, Jan 26, 2020 at 9:55 PM Jeff King <peff@peff.net> wrote:
> On Sun, Jan 26, 2020 at 04:39:52AM -0800, Chris Jerdonek wrote:
> > However, when I attempted this with a local repo, I found that objects
> > located only in branches other than the branch I specified are also
> > cloned. Also, this is true even if the remote repo has only loose
> > objects (i.e. no pack files). So it doesn't appear to be doing this
> > only to avoid creating new files.
>
> This is the expected outcome, because in your example you're cloning on
> the local filesystem. By default that enables some optimizations, one of
> which is to hard-link the object files into the destination repository.
> That avoids the cost of copying and re-hashing them (which a normal
> cross-system clone would do). And it even avoids traversing the objects
> to find which are necessary, instead just hard-linking everything.

Thanks for the reply. It's okay for that to be the expected behavior.
My suggestion would just be that the documentation for --single-branch
be updated to clarify that objects unreachable from the specified
branch can still be in the cloned repo when run using the --local
optimizations. For example, it can matter for security if one is
trying to create a clone of a repo that doesn't include data from
branches with sensitive info (e.g. in following Git's advice to create
a separate repo if security of private data is desired:
https://git-scm.com/docs/gitnamespaces#_security ).

I'm guessing other flags also don't apply when --local is being used.
For example, I'm guessing --reference is also ignored when using
--local, but I haven't checked yet to confirm. It would be nice if the
documentation gave a heads up in cases like these. Even if hard links
are being used, it's not clear from the docs whether the objects are
filtered first, prior to hard linking, when flags like --single-branch
and --reference are passed.

> This one behaves as you expected because git-fetch does not perform the
> same optimizations (it wouldn't make as much sense there, as generally
> in a fetch we already have most of the objects from the other side
> anyway, so hard-linking would just give us duplicates).

Incidentally, here's a thread from 2010 requesting that this
optimization be available in the git-fetch case:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=573909
(I don't know how reports on that Debian list relate to this list.)

--Chris

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: git-clone --single-branch clones objects outside of branch
  2020-01-27  6:46   ` Chris Jerdonek
@ 2020-01-28  9:48     ` Jeff King
  2020-01-29  1:59       ` Chris Jerdonek
  0 siblings, 1 reply; 6+ messages in thread
From: Jeff King @ 2020-01-28  9:48 UTC (permalink / raw)
  To: Chris Jerdonek; +Cc: git

On Sun, Jan 26, 2020 at 10:46:07PM -0800, Chris Jerdonek wrote:

> Thanks for the reply. It's okay for that to be the expected behavior.
> My suggestion would just be that the documentation for --single-branch
> be updated to clarify that objects unreachable from the specified
> branch can still be in the cloned repo when run using the --local
> optimizations. For example, it can matter for security if one is
> trying to create a clone of a repo that doesn't include data from
> branches with sensitive info (e.g. in following Git's advice to create
> a separate repo if security of private data is desired:
> https://git-scm.com/docs/gitnamespaces#_security ).

I think it would make sense to talk about that under "--local". There
are other subtle reasons you might want "--no-local", too: it will
perform more consistency checks, which could be valuable if you're
thinking about deleting the old copy.

I'm not sure how much Git guarantees in general that you won't get extra
objects. It's true that we try to avoid sending objects that aren't
needed, but it's mostly as an optimization. If we later modified
pack-objects to sometimes send unneeded objects (say, because we're able
to compute the set more efficiently if we use an approximation that errs
on the conservative side), then I think that's something we'd consider.
And I suppose it's already possible with the dumb-http protocol, which
has to fetch whole packfiles. And the same would be true of recently
proposed schemes to clients to a pre-generated packfile URL.

> I'm guessing other flags also don't apply when --local is being used.
> For example, I'm guessing --reference is also ignored when using
> --local, but I haven't checked yet to confirm. It would be nice if the
> documentation gave a heads up in cases like these. Even if hard links
> are being used, it's not clear from the docs whether the objects are
> filtered first, prior to hard linking, when flags like --single-branch
> and --reference are passed.

No, "--reference" behaves as usual. However, "--depth" is ignored (and
issues a warning). I don't think it would be wrong to issue a warning
when --single-branch is used locally (though it would not be "single
branch is ignored, since it does impact which refs are copied). But I
kind of wonder if it would be annoying for people who don't care about
having the extra objects reachable.

> > This one behaves as you expected because git-fetch does not perform the
> > same optimizations (it wouldn't make as much sense there, as generally
> > in a fetch we already have most of the objects from the other side
> > anyway, so hard-linking would just give us duplicates).
> 
> Incidentally, here's a thread from 2010 requesting that this
> optimization be available in the git-fetch case:
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=573909
> (I don't know how reports on that Debian list relate to this list.)

Sometimes they get forwarded here, and sometimes not. :)

I think there are subtle issues that make naively using the optimization
a bad idea, as it could actually backfire and cause more disk usage.
E.g., consider a sequence like this:

  1. Repo A has a 100MB packfile.

  2. "git clone A B" uses hardlinks. Now we have two copies of the repo,
     storing 100MB.

  3. Repo A adds a few more commits, and then does a "git gc", breaking
     the hardlinks. Now we have ~200MB used. We're no worse off than if
     we hadn't done the hardlinks in the first place.

  4. Repo B fetches from A. It wants the new commits, but they're in
     repo A's big packfile. So it hardlinks that. We have ~200MB in repo
     B, but half of that is hardlinked and shared with A. So we're still
     using ~200MB. So far so good.

  5. Repo A repacks again, breaking the hardlinks. Now it's using
     ~100MB, but repo B is still using ~300MB. We're worse off than we
     would be without the optimization.

If you really want to keep sharing objects over time, I think using
"clone -s" is a better choice (though it comes with its own
complications and dangers, too; see the git-clone documentation).

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: git-clone --single-branch clones objects outside of branch
  2020-01-28  9:48     ` Jeff King
@ 2020-01-29  1:59       ` Chris Jerdonek
  2020-01-29  2:23         ` Jeff King
  0 siblings, 1 reply; 6+ messages in thread
From: Chris Jerdonek @ 2020-01-29  1:59 UTC (permalink / raw)
  To: git; +Cc: Jeff King

On Tue, Jan 28, 2020 at 1:48 AM Jeff King <peff@peff.net> wrote:
> On Sun, Jan 26, 2020 at 10:46:07PM -0800, Chris Jerdonek wrote:
> > I'm guessing other flags also don't apply when --local is being used.
> > For example, I'm guessing --reference is also ignored when using
> > --local, but I haven't checked yet to confirm. It would be nice if the
> > documentation gave a heads up in cases like these. Even if hard links
> > are being used, it's not clear from the docs whether the objects are
> > filtered first, prior to hard linking, when flags like --single-branch
> > and --reference are passed.
>
> No, "--reference" behaves as usual.

On this, I found that --reference does behave differently in the way
that I suspected. For example, when run with the default --local, I
found that git-clone will create hard links in the new repo to loose
objects, even if those objects already exist in the reference
repository. When run with --non-local, the objects in the reference
repository weren't copied (I didn't find them in the cloned repo's
pack file).

So in addition to --single-branch, this seems to be another case where
`git-clone --local` will ignore the provided options when deciding
what files inside .git/objects/ to hard-link. It just hard-links
everything. This is another example of something that I think would be
worth mentioning in the docs in some form. Currently, the
documentation for --reference suggests that objects won't be created
in the new repo if they already exist in the reference repository.

--Chris

> However, "--depth" is ignored (and
> issues a warning). I don't think it would be wrong to issue a warning
> when --single-branch is used locally (though it would not be "single
> branch is ignored, since it does impact which refs are copied). But I
> kind of wonder if it would be annoying for people who don't care about
> having the extra objects reachable.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: git-clone --single-branch clones objects outside of branch
  2020-01-29  1:59       ` Chris Jerdonek
@ 2020-01-29  2:23         ` Jeff King
  0 siblings, 0 replies; 6+ messages in thread
From: Jeff King @ 2020-01-29  2:23 UTC (permalink / raw)
  To: Chris Jerdonek; +Cc: git

On Tue, Jan 28, 2020 at 05:59:54PM -0800, Chris Jerdonek wrote:

> > No, "--reference" behaves as usual.
> 
> On this, I found that --reference does behave differently in the way
> that I suspected. For example, when run with the default --local, I
> found that git-clone will create hard links in the new repo to loose
> objects, even if those objects already exist in the reference
> repository. When run with --non-local, the objects in the reference
> repository weren't copied (I didn't find them in the cloned repo's
> pack file).

Sure, but I'd just consider that the same issue: hardlinking takes
everything in the object database without regard to what's needed. So
--reference still works (and when you repack and break the hardlinks, it
will drop any duplicates as usual), but obviously you have access to the
extra hardlinked objects in the meantime.

IOW, I don't think it's worth calling this out in the --reference
documentation, but could be in the --local documentation. Because the
unusual implication is the same whether --reference is used or not.

> So in addition to --single-branch, this seems to be another case where
> `git-clone --local` will ignore the provided options when deciding
> what files inside .git/objects/ to hard-link. It just hard-links
> everything. This is another example of something that I think would be
> worth mentioning in the docs in some form. Currently, the
> documentation for --reference suggests that objects won't be created
> in the new repo if they already exist in the reference repository.

It says:

  Using an already existing repository as an alternate will require
  fewer objects to be copied from the repository being cloned, reducing
  network and local storage costs.

which I think is still technically true. The hardlinks are free in terms
of storage (and there's no network cost by definition). So at worst
--reference is doing nothing to start with. And the repository has been
setup in such a way that it may yield benefits later (after you repack
and the hardlinks are broken).

This is maybe splitting hairs a bit. The thing that matters, I think, is
where a documentation change should go (and what it should say).

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-01-29  2:23 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-26 12:39 git-clone --single-branch clones objects outside of branch Chris Jerdonek
2020-01-27  5:55 ` Jeff King
2020-01-27  6:46   ` Chris Jerdonek
2020-01-28  9:48     ` Jeff King
2020-01-29  1:59       ` Chris Jerdonek
2020-01-29  2:23         ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).