git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Junio C Hamano <gitster@pobox.com>
To: Andrew Keller <andrew@kellerfarm.com>
Cc: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"Git List" <git@vger.kernel.org>
Subject: Re: Borrowing objects from nearby repositories
Date: Wed, 26 Mar 2014 10:29:22 -0700	[thread overview]
Message-ID: <xmqqbnwskgwd.fsf@gitster.dls.corp.google.com> (raw)
In-Reply-To: <3533946C-DE97-4214-9B55-F5B788DDD952@kellerfarm.com> (Andrew Keller's message of "Wed, 26 Mar 2014 09:36:09 -0400")

Andrew Keller <andrew@kellerfarm.com> writes:

> On Mar 25, 2014, at 6:17 PM, Junio C Hamano <gitster@pobox.com> wrote:
> ...
>>> I think that the standard practice with the existing toolset is to
>>> clone with reference and then repack.  That is:
>>> 
>>>    $ git clone --reference <borrowee> git://over/there mine
>>>    $ cd mine
>>>    $ git repack -a -d
>>> 
>>> And then you can try this:
>>> 
>>>    $ mv .git/objects/info/alternates .git/objects/info/alternates.disabled
>>>    $ git fsck
>>> 
>>> to make sure that you are no longer borrowing anything from the
>>> borrowee.  Once you are satisfied, you can remove the saved-away
>>> alternates.disabled file.
>> 
>> Oh, I forgot to say that I am not opposed if somebody wants to teach
>> "git clone" a new option to copy its objects from two places,
>> (hopefully) the majority from near-by reference repository and the
>> remainder over the network, without permanently relying on the
>> former via the alternates mechanism.  The implementation of such a
>> feature could even literally be "clone with reference first and then
>> repack" at least initially but even in the final version.

[Administrivia: please wrap your lines to a reasonable length]

> That was actually one of my first ideas - adding some sort of
> '--auto-repack' option to git-clone.  It's a relatively small
> change, and would work.  However, keeping in mind my end goal of
> automating the feature to the point where you could run simply
> 'git clone <url>', an '--auto-repack' option is more difficult to
> undo.  You would need a new parameter to disable the automatic
> adding of reference repositories, and a new parameter to undo
> '--auto-repack', and you'd have to remember to actually undo both
> of those settings.
>
> In contrast, if the new feature was '--borrow', and the evolution
> of the feature was a global configuration 'fetch.autoBorrow', then
> to turn it off temporarily, one only needs a single new parameter
> '--no-auto-borrow'.  I think this is a cleaner approach than the
> former, although much more work.

I think you may have misread me.  With the "new option", I was
hinting that the "clone --reference && repack && rm alternates"
will be an acceptable internal implementation of the "--borrow"
option that was mentioned in the thread.  I am not sure where you
got the "auto-repack" from.

One of the reasons you may have misread me may be because I made it
sound as if "this may work and when it works you will be happy, but
if it does not work you did not lose very much" by mentioning "mv &&
fsck".  That wasn't what I meant.

The "repack -a" procedure is to make the borrower repository no
longer dependent on the borrowee, and it is supposed to always work.
In fact, this behaviour was the whole reason why "repack" later
learned its "-l" option to disable it, because people who cloned
with "--reference" in order to reduce the disk footprint by sharing
older and more common objects [*1*] were rightfully surprised to see
that the borrowed objects were copied over to their borrower
repository when they ran "repack" [*2*].

Because this is "clone", there is nothing complex to "undo".  Either
it succeeds, or you remove the whole new directory if anything
fails.

I said "even in the final version" for a simple reason: you cannot
cannot do realistically any better than the "clone --reference &&
repack -a d && rm alternates" sequence.

But you would need to know a few things about how Git works in order
to come to that realisation.  Here are some:

 * "clone --borrow" (or whatever we end up calling the option) must
   talk to two repositories:

    - We will need to have one upload-pack session with the distant
      origin repository over the network, which will send a complete
      pack.

    - We need to also copy objects that weren't sent from the
      distant origin to our repository from the reference one.

 * A single "repack -a -d" (without "-l") after "clone --reference"
   is already a way to do exactly what you need---enumerate what are
   missing in the packfile that was received from the distant origin
   and come up with packfile(s) that contain all and only objects
   the cloned repository needs.

 * You cannot easily concatenate multiple packfiles into a single
   one (or append runs of objects to an existing packfile) to come
   up with a single packfile.

You _could_ shoehorn the logic to "enumerate and read from the
reference, and append them at the end of the packfile received from
the distant origin repository" into the part that talks to the
distant origin repository, but the object layout in the resulting
packfile will be suboptimal [*3*] and the code complexity required
to do so is not worth it [*4*].


[Footnotes]

*1* From the point of view of supporting both camps, i.e. those who
    want their borrower repositories to keep sharing the objects
    with the borrowee repository and those who want to use a
    borrowee repository temporarily while cloning only to reduce the
    network cost from the distant upstream, the current option name
    "--reference" and the proposed name "--borrow" are backwards.
    The folks who want the original behaviour of keep depending want
    to "borrow" from the borrowee repository; those who want to
    utilize the mechanism temporarily only while cloning would want
    to merely "reference" it only while cloning.

*2* A repository created with "clone --reference" may want to set a
    configuration variable in it to tell future invocations of "git
    repack" to use the "-l" option by default, while allowing those
    who want to fatten such a repository to override it with "repack
    --no-local".  Without such an arrangement, we would risk the
    people who wanted that "-l" option for "repack" in the first
    place to accidentally fatten their lean repositories by mistake,
    forgetting to pass the "-l" option.  Luckily "gc" always runs
    "repack" with "-l", so the risk is limited to those who run
    "repack" themselves, which may be why we heard no complaints on
    this point.

*3* In fact, another program invoked during the object transfer
    "index-pack --fix-thin" does have "append selected objects at
    the end of the packfile received over the wire and fix up the
    whole thing" logic in it).  The pack that results from it
    suffers from the suboptimal layout because the appended objects
    are "appended", not placed in their optimal positions in the
    packfile to reduce seeks.  

    If "--borrow" did the same, the pack layout issue will be worse,
    because the whole point of "--borrow" is to borrow the majority
    of objects from reference repository---we will be appending a
    lot from the reference to a relatively small pack we receive
    over the wire from the distant origin.

*4* The complexity of the code to implement "index-pack --fix-thin"
    is not pretty, but it will be worse for "--borrow", as the
    former at least does not have to walk the dag to find out what
    objects need to be appended but the latter does.

  reply	other threads:[~2014-03-26 17:29 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-03-12  3:37 Borrowing objects from nearby repositories Andrew Keller
2014-03-23 18:04 ` Phil Hord
2014-03-24 21:21 ` Ævar Arnfjörð Bjarmason
2014-03-25 13:13   ` Andrew Keller
2014-03-25 17:02   ` Junio C Hamano
2014-03-25 22:17     ` Junio C Hamano
2014-03-26 13:36       ` Andrew Keller
2014-03-26 17:29         ` Junio C Hamano [this message]
2014-03-28 14:52           ` Andrew Keller
2014-03-28 17:02             ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqqbnwskgwd.fsf@gitster.dls.corp.google.com \
    --to=gitster@pobox.com \
    --cc=andrew@kellerfarm.com \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).