From: Junio C Hamano <gitster@pobox.com>
To: Andrew Keller <andrew@kellerfarm.com>
Cc: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
"Git List" <git@vger.kernel.org>
Subject: Re: Borrowing objects from nearby repositories
Date: Wed, 26 Mar 2014 10:29:22 -0700 [thread overview]
Message-ID: <xmqqbnwskgwd.fsf@gitster.dls.corp.google.com> (raw)
In-Reply-To: <3533946C-DE97-4214-9B55-F5B788DDD952@kellerfarm.com> (Andrew Keller's message of "Wed, 26 Mar 2014 09:36:09 -0400")
Andrew Keller <andrew@kellerfarm.com> writes:
> On Mar 25, 2014, at 6:17 PM, Junio C Hamano <gitster@pobox.com> wrote:
> ...
>>> I think that the standard practice with the existing toolset is to
>>> clone with reference and then repack. That is:
>>>
>>> $ git clone --reference <borrowee> git://over/there mine
>>> $ cd mine
>>> $ git repack -a -d
>>>
>>> And then you can try this:
>>>
>>> $ mv .git/objects/info/alternates .git/objects/info/alternates.disabled
>>> $ git fsck
>>>
>>> to make sure that you are no longer borrowing anything from the
>>> borrowee. Once you are satisfied, you can remove the saved-away
>>> alternates.disabled file.
>>
>> Oh, I forgot to say that I am not opposed if somebody wants to teach
>> "git clone" a new option to copy its objects from two places,
>> (hopefully) the majority from near-by reference repository and the
>> remainder over the network, without permanently relying on the
>> former via the alternates mechanism. The implementation of such a
>> feature could even literally be "clone with reference first and then
>> repack" at least initially but even in the final version.
[Administrivia: please wrap your lines to a reasonable length]
> That was actually one of my first ideas - adding some sort of
> '--auto-repack' option to git-clone. It's a relatively small
> change, and would work. However, keeping in mind my end goal of
> automating the feature to the point where you could run simply
> 'git clone <url>', an '--auto-repack' option is more difficult to
> undo. You would need a new parameter to disable the automatic
> adding of reference repositories, and a new parameter to undo
> '--auto-repack', and you'd have to remember to actually undo both
> of those settings.
>
> In contrast, if the new feature was '--borrow', and the evolution
> of the feature was a global configuration 'fetch.autoBorrow', then
> to turn it off temporarily, one only needs a single new parameter
> '--no-auto-borrow'. I think this is a cleaner approach than the
> former, although much more work.
I think you may have misread me. With the "new option", I was
hinting that the "clone --reference && repack && rm alternates"
will be an acceptable internal implementation of the "--borrow"
option that was mentioned in the thread. I am not sure where you
got the "auto-repack" from.
One of the reasons you may have misread me may be because I made it
sound as if "this may work and when it works you will be happy, but
if it does not work you did not lose very much" by mentioning "mv &&
fsck". That wasn't what I meant.
The "repack -a" procedure is to make the borrower repository no
longer dependent on the borrowee, and it is supposed to always work.
In fact, this behaviour was the whole reason why "repack" later
learned its "-l" option to disable it, because people who cloned
with "--reference" in order to reduce the disk footprint by sharing
older and more common objects [*1*] were rightfully surprised to see
that the borrowed objects were copied over to their borrower
repository when they ran "repack" [*2*].
Because this is "clone", there is nothing complex to "undo". Either
it succeeds, or you remove the whole new directory if anything
fails.
I said "even in the final version" for a simple reason: you cannot
cannot do realistically any better than the "clone --reference &&
repack -a d && rm alternates" sequence.
But you would need to know a few things about how Git works in order
to come to that realisation. Here are some:
* "clone --borrow" (or whatever we end up calling the option) must
talk to two repositories:
- We will need to have one upload-pack session with the distant
origin repository over the network, which will send a complete
pack.
- We need to also copy objects that weren't sent from the
distant origin to our repository from the reference one.
* A single "repack -a -d" (without "-l") after "clone --reference"
is already a way to do exactly what you need---enumerate what are
missing in the packfile that was received from the distant origin
and come up with packfile(s) that contain all and only objects
the cloned repository needs.
* You cannot easily concatenate multiple packfiles into a single
one (or append runs of objects to an existing packfile) to come
up with a single packfile.
You _could_ shoehorn the logic to "enumerate and read from the
reference, and append them at the end of the packfile received from
the distant origin repository" into the part that talks to the
distant origin repository, but the object layout in the resulting
packfile will be suboptimal [*3*] and the code complexity required
to do so is not worth it [*4*].
[Footnotes]
*1* From the point of view of supporting both camps, i.e. those who
want their borrower repositories to keep sharing the objects
with the borrowee repository and those who want to use a
borrowee repository temporarily while cloning only to reduce the
network cost from the distant upstream, the current option name
"--reference" and the proposed name "--borrow" are backwards.
The folks who want the original behaviour of keep depending want
to "borrow" from the borrowee repository; those who want to
utilize the mechanism temporarily only while cloning would want
to merely "reference" it only while cloning.
*2* A repository created with "clone --reference" may want to set a
configuration variable in it to tell future invocations of "git
repack" to use the "-l" option by default, while allowing those
who want to fatten such a repository to override it with "repack
--no-local". Without such an arrangement, we would risk the
people who wanted that "-l" option for "repack" in the first
place to accidentally fatten their lean repositories by mistake,
forgetting to pass the "-l" option. Luckily "gc" always runs
"repack" with "-l", so the risk is limited to those who run
"repack" themselves, which may be why we heard no complaints on
this point.
*3* In fact, another program invoked during the object transfer
"index-pack --fix-thin" does have "append selected objects at
the end of the packfile received over the wire and fix up the
whole thing" logic in it). The pack that results from it
suffers from the suboptimal layout because the appended objects
are "appended", not placed in their optimal positions in the
packfile to reduce seeks.
If "--borrow" did the same, the pack layout issue will be worse,
because the whole point of "--borrow" is to borrow the majority
of objects from reference repository---we will be appending a
lot from the reference to a relatively small pack we receive
over the wire from the distant origin.
*4* The complexity of the code to implement "index-pack --fix-thin"
is not pretty, but it will be worse for "--borrow", as the
former at least does not have to walk the dag to find out what
objects need to be appended but the latter does.
next prev parent reply other threads:[~2014-03-26 17:29 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-03-12 3:37 Borrowing objects from nearby repositories Andrew Keller
2014-03-23 18:04 ` Phil Hord
2014-03-24 21:21 ` Ævar Arnfjörð Bjarmason
2014-03-25 13:13 ` Andrew Keller
2014-03-25 17:02 ` Junio C Hamano
2014-03-25 22:17 ` Junio C Hamano
2014-03-26 13:36 ` Andrew Keller
2014-03-26 17:29 ` Junio C Hamano [this message]
2014-03-28 14:52 ` Andrew Keller
2014-03-28 17:02 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=xmqqbnwskgwd.fsf@gitster.dls.corp.google.com \
--to=gitster@pobox.com \
--cc=andrew@kellerfarm.com \
--cc=avarab@gmail.com \
--cc=git@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).