git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Borrowing objects from nearby repositories
@ 2014-03-12  3:37 Andrew Keller
  2014-03-23 18:04 ` Phil Hord
  2014-03-24 21:21 ` Ævar Arnfjörð Bjarmason
  0 siblings, 2 replies; 10+ messages in thread
From: Andrew Keller @ 2014-03-12  3:37 UTC (permalink / raw)
  To: Git List

Hi all,

I am considering developing a new feature, and I'd like to poll the group for opinions.

Background: A couple years ago, I wrote a set of scripts that speed up cloning of frequently used repositories.  The scripts utilize a bare Git repository located at a known location, and automate providing a --reference parameter to `git clone` and `git submodule update`.  Recently, some coworkers of mine expressed an interest in using the scripts, so I published the current version of my scripts, called `git repocache`, described at the bottom of <https://github.com/andrewkeller/ak-git-tools>.

Slowly, it has occurred to me that this feature, or something similar to it, may be worth adding to Git, so I've been thinking about the best approach.  Here's my best idea so far:

1)  Introduce '--borrow' to `git-fetch`.  This would behave similarly to '--reference', except that it operates on a temporary basis, and does not assume that the reference repository will exist after the operation completes, so any used objects are copied into the local objects database.  In theory, this mechanism would be distinct from '--reference', so if both are used, some objects would be copied, and some objects would be accessible via a reference repository referenced by the alternates file.

2)  Teach `git fetch` to read 'repocache.path' (or a better-named configuration), and use it to automatically activate borrowing.

3)  For consistency, `git clone`, `git pull`, and `git submodule update` should probably all learn '--borrow', and forward it to `git fetch`.

4)  In some scenarios, it may be necessary to temporarily not automatically borrow, so `git fetch`, and everything that calls it may need an argument to do that.

Intended outcome: With 'repocache.path' set, and the cached repository properly updated, one could run `git clone <url>`, and the operation would complete much faster than it does now due to less load on the network.

Things I haven't figured out yet:

*  What's the best approach to copying the needed objects?  It's probably inefficient to copy individual objects out of pack files one at a time, but it could be wasteful to copy entire pack files just because you need one object.  Hard-linking could help, but that won't always be available.  One of my previous ideas was to add a '--auto-repack' option to `git-clone`, which solves this problem better, but introduces some other front-end usability problems.
*  To maintain optimal effectiveness, users would have to regularly run a fetch in the cache repository.  Not all users know how to set up a scheduled task on their computer, so this might become a maintenance problem for the user.  This kind of problem I think brings into question the viability of the underlying design here, assuming that the ultimate goal is to clone faster, with very little or no change in the use of git.


Thoughts?

Thanks,
Andrew Keller

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Borrowing objects from nearby repositories
  2014-03-12  3:37 Borrowing objects from nearby repositories Andrew Keller
@ 2014-03-23 18:04 ` Phil Hord
  2014-03-24 21:21 ` Ævar Arnfjörð Bjarmason
  1 sibling, 0 replies; 10+ messages in thread
From: Phil Hord @ 2014-03-23 18:04 UTC (permalink / raw)
  To: Andrew Keller; +Cc: Git List

On Tue, Mar 11, 2014 at 11:37 PM, Andrew Keller <andrew@kellerfarm.com> wrote:
> I am considering developing a new feature, and I'd like to poll the group for opinions.
>
> Background: A couple years ago, I wrote a set of scripts that speed up cloning of frequently used repositories.  The scripts utilize a bare Git repository located at a known location, and automate providing a --reference parameter to `git clone` and `git submodule update`.  Recently, some coworkers of mine expressed an interest in using the scripts, so I published the current version of my scripts, called `git repocache`, described at the bottom of <https://github.com/andrewkeller/ak-git-tools>.
>
> Slowly, it has occurred to me that this feature, or something similar to it, may be worth adding to Git, so I've been thinking about the best approach.  Here's my best idea so far:
>
> 1)  Introduce '--borrow' to `git-fetch`.  This would behave similarly to '--reference', except that it operates on a temporary basis, and does not assume that the reference repository will exist after the operation completes, so any used objects are copied into the local objects database.  In theory, this mechanism would be distinct from '--reference', so if both are used, some objects would be copied, and some objects would be accessible via a reference repository referenced by the alternates file.

Interesting.  I do something similar on my CI Server to reduce
workload on Gerrit. Having a built-in to support submodules would be
nice.  Currently my script does this:

MIRROR=/path/to/local/mirror
NEW=ssh://gerrit-server
git clone ${MIRROR}/project && cd project

#-- Init/update submodules from our local mirror if possible
git submodule update --recursive --init

#-- Switch to the remote server URL
git config remote.origin.url $(git config remote.origin.url|sed -e
"s|^${MIRROR}|${NEW}|")
git submodule sync #--recursive ; recursive not supported :-[

#-- Checkout remote updates
git pull --ff-only --recurse-submodules origin ${BRANCH}
git submodule update --recursive --init


Is that about the same as you are aiming for?


> 2)  Teach `git fetch` to read 'repocache.path' (or a better-named configuration), and use it to automatically activate borrowing.

Seems like this could be trouble if a local repo is coincidentally
named the same as some unrelated repo you want to clone.  But I can
see the value.

What about something similar to url.insteadOf?   Maybe
'url.${SERVER}.autoBorrow = ${MIRROR}', with replacement semantics
similar to insteadOf.

> 3)  For consistency, `git clone`, `git pull`, and `git submodule update` should probably all learn '--borrow', and forward it to `git fetch`.
>
> 4)  In some scenarios, it may be necessary to temporarily not automatically borrow, so `git fetch`, and everything that calls it may need an argument to do that.

--no-borrow

Phil

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Borrowing objects from nearby repositories
  2014-03-12  3:37 Borrowing objects from nearby repositories Andrew Keller
  2014-03-23 18:04 ` Phil Hord
@ 2014-03-24 21:21 ` Ævar Arnfjörð Bjarmason
  2014-03-25 13:13   ` Andrew Keller
  2014-03-25 17:02   ` Junio C Hamano
  1 sibling, 2 replies; 10+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2014-03-24 21:21 UTC (permalink / raw)
  To: Andrew Keller; +Cc: Git List

On Wed, Mar 12, 2014 at 4:37 AM, Andrew Keller <andrew@kellerfarm.com> wrote:
> Hi all,
>
> I am considering developing a new feature, and I'd like to poll the group for opinions.
>
> Background: A couple years ago, I wrote a set of scripts that speed up cloning of frequently used repositories.  The scripts utilize a bare Git repository located at a known location, and automate providing a --reference parameter to `git clone` and `git submodule update`.  Recently, some coworkers of mine expressed an interest in using the scripts, so I published the current version of my scripts, called `git repocache`, described at the bottom of <https://github.com/andrewkeller/ak-git-tools>.
>
> Slowly, it has occurred to me that this feature, or something similar to it, may be worth adding to Git, so I've been thinking about the best approach.  Here's my best idea so far:
>
> 1)  Introduce '--borrow' to `git-fetch`.  This would behave similarly to '--reference', except that it operates on a temporary basis, and does not assume that the reference repository will exist after the operation completes, so any used objects are copied into the local objects database.  In theory, this mechanism would be distinct from '--reference', so if both are used, some objects would be copied, and some objects would be accessible via a reference repository referenced by the alternates file.

Isn't this the same as git clone --reference <path> --no-hardlinks <url> ?

Also without --no-hardlinks we're not assuming that the other repo
doesn't go away (you could rm-rf it), just that the files won't be
*modified*, which Git won't do, but you could manually do with other
tools, so the default is to hardlink.

> 2)  Teach `git fetch` to read 'repocache.path' (or a better-named configuration), and use it to automatically activate borrowing.

So a default path for --reference <path> --no-hardlinks ?

> 3)  For consistency, `git clone`, `git pull`, and `git submodule update` should probably all learn '--borrow', and forward it to `git fetch`.
>
> 4)  In some scenarios, it may be necessary to temporarily not automatically borrow, so `git fetch`, and everything that calls it may need an argument to do that.
>
> Intended outcome: With 'repocache.path' set, and the cached repository properly updated, one could run `git clone <url>`, and the operation would complete much faster than it does now due to less load on the network.
>
> Things I haven't figured out yet:
>
> *  What's the best approach to copying the needed objects?  It's probably inefficient to copy individual objects out of pack files one at a time, but it could be wasteful to copy entire pack files just because you need one object.  Hard-linking could help, but that won't always be available.  One of my previous ideas was to add a '--auto-repack' option to `git-clone`, which solves this problem better, but introduces some other front-end usability problems.
> *  To maintain optimal effectiveness, users would have to regularly run a fetch in the cache repository.  Not all users know how to set up a scheduled task on their computer, so this might become a maintenance problem for the user.  This kind of problem I think brings into question the viability of the underlying design here, assuming that the ultimate goal is to clone faster, with very little or no change in the use of git.
>
>
> Thoughts?
>
> Thanks,
> Andrew Keller
>
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Borrowing objects from nearby repositories
  2014-03-24 21:21 ` Ævar Arnfjörð Bjarmason
@ 2014-03-25 13:13   ` Andrew Keller
  2014-03-25 17:02   ` Junio C Hamano
  1 sibling, 0 replies; 10+ messages in thread
From: Andrew Keller @ 2014-03-25 13:13 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Git List

On Mar 24, 2014, at 5:21 PM, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> On Wed, Mar 12, 2014 at 4:37 AM, Andrew Keller <andrew@kellerfarm.com> wrote:
>> Hi all,
>> 
>> I am considering developing a new feature, and I'd like to poll the group for opinions.
>> 
>> Background: A couple years ago, I wrote a set of scripts that speed up cloning of frequently used repositories.  The scripts utilize a bare Git repository located at a known location, and automate providing a --reference parameter to `git clone` and `git submodule update`.  Recently, some coworkers of mine expressed an interest in using the scripts, so I published the current version of my scripts, called `git repocache`, described at the bottom of <https://github.com/andrewkeller/ak-git-tools>.
>> 
>> Slowly, it has occurred to me that this feature, or something similar to it, may be worth adding to Git, so I've been thinking about the best approach.  Here's my best idea so far:
>> 
>> 1)  Introduce '--borrow' to `git-fetch`.  This would behave similarly to '--reference', except that it operates on a temporary basis, and does not assume that the reference repository will exist after the operation completes, so any used objects are copied into the local objects database.  In theory, this mechanism would be distinct from '--reference', so if both are used, some objects would be copied, and some objects would be accessible via a reference repository referenced by the alternates file.
> 
> Isn't this the same as git clone --reference <path> --no-hardlinks <url> ?

'--reference` adds an entry to 'info/alternates' inside the objects folder.  When an object is looked up, any objects folder listed in 'objects/info/alternates' is considered to be an extension of the local objects folder.  So, when, for example, fetch runs, when it goes to decide whether or not it already has a blob locally, it may decide "yes", and not download the blob at all, because it already exists in one of the reference repositories.  If I clone one of my 80 GB repositories over SSH using a reference repository, the resulting clone is only about 175 KB, because it's assuming the reference repository will exist going forward, so it doesn't actually own any objects itself at all.

The '--no-hardlinks' option is only applicable when hard linking is available in the first place - i.e., when cloning from one local folder to another on the same filesystem (assuming the filesystem supports hard links).

Thanks,
 - Andrew

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Borrowing objects from nearby repositories
  2014-03-24 21:21 ` Ævar Arnfjörð Bjarmason
  2014-03-25 13:13   ` Andrew Keller
@ 2014-03-25 17:02   ` Junio C Hamano
  2014-03-25 22:17     ` Junio C Hamano
  1 sibling, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2014-03-25 17:02 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Andrew Keller, Git List

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

>> 1) Introduce '--borrow' to `git-fetch`.  This would behave similarly
> to '--reference', except that it operates on a temporary basis, and
> does not assume that the reference repository will exist after the
> operation completes, so any used objects are copied into the local
> objects database.  In theory, this mechanism would be distinct from
> --reference', so if both are used, some objects would be copied, and
> some objects would be accessible via a reference repository referenced
> by the alternates file.
>
> Isn't this the same as git clone --reference <path> --no-hardlinks
> <url> ?
>
> Also without --no-hardlinks we're not assuming that the other repo
> doesn't go away (you could rm-rf it), just that the files won't be
> *modified*, which Git won't do, but you could manually do with other
> tools, so the default is to hardlink.

I think that the standard practice with the existing toolset is to
clone with reference and then repack.  That is:

    $ git clone --reference <borrowee> git://over/there mine
    $ cd mine
    $ git repack -a -d

And then you can try this:

    $ mv .git/objects/info/alternates .git/objects/info/alternates.disabled
    $ git fsck

to make sure that you are no longer borrowing anything from the
borrowee.  Once you are satisfied, you can remove the saved-away
alternates.disabled file.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Borrowing objects from nearby repositories
  2014-03-25 17:02   ` Junio C Hamano
@ 2014-03-25 22:17     ` Junio C Hamano
  2014-03-26 13:36       ` Andrew Keller
  0 siblings, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2014-03-25 22:17 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Andrew Keller, Git List

Junio C Hamano <gitster@pobox.com> writes:

> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>
>>> 1) Introduce '--borrow' to `git-fetch`.  This would behave similarly
>> to '--reference', except that it operates on a temporary basis, and
>> does not assume that the reference repository will exist after the
>> operation completes, so any used objects are copied into the local
>> objects database.  In theory, this mechanism would be distinct from
>> --reference', so if both are used, some objects would be copied, and
>> some objects would be accessible via a reference repository referenced
>> by the alternates file.
>>
>> Isn't this the same as git clone --reference <path> --no-hardlinks
>> <url> ?
>>
>> Also without --no-hardlinks we're not assuming that the other repo
>> doesn't go away (you could rm-rf it), just that the files won't be
>> *modified*, which Git won't do, but you could manually do with other
>> tools, so the default is to hardlink.
>
> I think that the standard practice with the existing toolset is to
> clone with reference and then repack.  That is:
>
>     $ git clone --reference <borrowee> git://over/there mine
>     $ cd mine
>     $ git repack -a -d
>
> And then you can try this:
>
>     $ mv .git/objects/info/alternates .git/objects/info/alternates.disabled
>     $ git fsck
>
> to make sure that you are no longer borrowing anything from the
> borrowee.  Once you are satisfied, you can remove the saved-away
> alternates.disabled file.

Oh, I forgot to say that I am not opposed if somebody wants to teach
"git clone" a new option to copy its objects from two places,
(hopefully) the majority from near-by reference repository and the
remainder over the network, without permanently relying on the
former via the alternates mechanism.  The implementation of such a
feature could even literally be "clone with reference first and then
repack" at least initially but even in the final version.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Borrowing objects from nearby repositories
  2014-03-25 22:17     ` Junio C Hamano
@ 2014-03-26 13:36       ` Andrew Keller
  2014-03-26 17:29         ` Junio C Hamano
  0 siblings, 1 reply; 10+ messages in thread
From: Andrew Keller @ 2014-03-26 13:36 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Ævar Arnfjörð Bjarmason, Git List

On Mar 25, 2014, at 6:17 PM, Junio C Hamano <gitster@pobox.com> wrote:

> Junio C Hamano <gitster@pobox.com> writes:
> 
>> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>> 
>>>> 1) Introduce '--borrow' to `git-fetch`.  This would behave similarly
>>> to '--reference', except that it operates on a temporary basis, and
>>> does not assume that the reference repository will exist after the
>>> operation completes, so any used objects are copied into the local
>>> objects database.  In theory, this mechanism would be distinct from
>>> --reference', so if both are used, some objects would be copied, and
>>> some objects would be accessible via a reference repository referenced
>>> by the alternates file.
>>> 
>>> Isn't this the same as git clone --reference <path> --no-hardlinks
>>> <url> ?
>>> 
>>> Also without --no-hardlinks we're not assuming that the other repo
>>> doesn't go away (you could rm-rf it), just that the files won't be
>>> *modified*, which Git won't do, but you could manually do with other
>>> tools, so the default is to hardlink.
>> 
>> I think that the standard practice with the existing toolset is to
>> clone with reference and then repack.  That is:
>> 
>>    $ git clone --reference <borrowee> git://over/there mine
>>    $ cd mine
>>    $ git repack -a -d
>> 
>> And then you can try this:
>> 
>>    $ mv .git/objects/info/alternates .git/objects/info/alternates.disabled
>>    $ git fsck
>> 
>> to make sure that you are no longer borrowing anything from the
>> borrowee.  Once you are satisfied, you can remove the saved-away
>> alternates.disabled file.
> 
> Oh, I forgot to say that I am not opposed if somebody wants to teach
> "git clone" a new option to copy its objects from two places,
> (hopefully) the majority from near-by reference repository and the
> remainder over the network, without permanently relying on the
> former via the alternates mechanism.  The implementation of such a
> feature could even literally be "clone with reference first and then
> repack" at least initially but even in the final version.

That was actually one of my first ideas - adding some sort of '--auto-repack' option to git-clone.  It's a relatively small change, and would work.  However, keeping in mind my end goal of automating the feature to the point where you could run simply 'git clone <url>', an '--auto-repack' option is more difficult to undo.  You would need a new parameter to disable the automatic adding of reference repositories, and a new parameter to undo '--auto-repack', and you'd have to remember to actually undo both of those settings.

In contrast, if the new feature was '--borrow', and the evolution of the feature was a global configuration 'fetch.autoBorrow', then to turn it off temporarily, one only needs a single new parameter '--no-auto-borrow'.  I think this is a cleaner approach than the former, although much more work.

Thanks,
 - Andrew Keller

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Borrowing objects from nearby repositories
  2014-03-26 13:36       ` Andrew Keller
@ 2014-03-26 17:29         ` Junio C Hamano
  2014-03-28 14:52           ` Andrew Keller
  0 siblings, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2014-03-26 17:29 UTC (permalink / raw)
  To: Andrew Keller; +Cc: Ævar Arnfjörð Bjarmason, Git List

Andrew Keller <andrew@kellerfarm.com> writes:

> On Mar 25, 2014, at 6:17 PM, Junio C Hamano <gitster@pobox.com> wrote:
> ...
>>> I think that the standard practice with the existing toolset is to
>>> clone with reference and then repack.  That is:
>>> 
>>>    $ git clone --reference <borrowee> git://over/there mine
>>>    $ cd mine
>>>    $ git repack -a -d
>>> 
>>> And then you can try this:
>>> 
>>>    $ mv .git/objects/info/alternates .git/objects/info/alternates.disabled
>>>    $ git fsck
>>> 
>>> to make sure that you are no longer borrowing anything from the
>>> borrowee.  Once you are satisfied, you can remove the saved-away
>>> alternates.disabled file.
>> 
>> Oh, I forgot to say that I am not opposed if somebody wants to teach
>> "git clone" a new option to copy its objects from two places,
>> (hopefully) the majority from near-by reference repository and the
>> remainder over the network, without permanently relying on the
>> former via the alternates mechanism.  The implementation of such a
>> feature could even literally be "clone with reference first and then
>> repack" at least initially but even in the final version.

[Administrivia: please wrap your lines to a reasonable length]

> That was actually one of my first ideas - adding some sort of
> '--auto-repack' option to git-clone.  It's a relatively small
> change, and would work.  However, keeping in mind my end goal of
> automating the feature to the point where you could run simply
> 'git clone <url>', an '--auto-repack' option is more difficult to
> undo.  You would need a new parameter to disable the automatic
> adding of reference repositories, and a new parameter to undo
> '--auto-repack', and you'd have to remember to actually undo both
> of those settings.
>
> In contrast, if the new feature was '--borrow', and the evolution
> of the feature was a global configuration 'fetch.autoBorrow', then
> to turn it off temporarily, one only needs a single new parameter
> '--no-auto-borrow'.  I think this is a cleaner approach than the
> former, although much more work.

I think you may have misread me.  With the "new option", I was
hinting that the "clone --reference && repack && rm alternates"
will be an acceptable internal implementation of the "--borrow"
option that was mentioned in the thread.  I am not sure where you
got the "auto-repack" from.

One of the reasons you may have misread me may be because I made it
sound as if "this may work and when it works you will be happy, but
if it does not work you did not lose very much" by mentioning "mv &&
fsck".  That wasn't what I meant.

The "repack -a" procedure is to make the borrower repository no
longer dependent on the borrowee, and it is supposed to always work.
In fact, this behaviour was the whole reason why "repack" later
learned its "-l" option to disable it, because people who cloned
with "--reference" in order to reduce the disk footprint by sharing
older and more common objects [*1*] were rightfully surprised to see
that the borrowed objects were copied over to their borrower
repository when they ran "repack" [*2*].

Because this is "clone", there is nothing complex to "undo".  Either
it succeeds, or you remove the whole new directory if anything
fails.

I said "even in the final version" for a simple reason: you cannot
cannot do realistically any better than the "clone --reference &&
repack -a d && rm alternates" sequence.

But you would need to know a few things about how Git works in order
to come to that realisation.  Here are some:

 * "clone --borrow" (or whatever we end up calling the option) must
   talk to two repositories:

    - We will need to have one upload-pack session with the distant
      origin repository over the network, which will send a complete
      pack.

    - We need to also copy objects that weren't sent from the
      distant origin to our repository from the reference one.

 * A single "repack -a -d" (without "-l") after "clone --reference"
   is already a way to do exactly what you need---enumerate what are
   missing in the packfile that was received from the distant origin
   and come up with packfile(s) that contain all and only objects
   the cloned repository needs.

 * You cannot easily concatenate multiple packfiles into a single
   one (or append runs of objects to an existing packfile) to come
   up with a single packfile.

You _could_ shoehorn the logic to "enumerate and read from the
reference, and append them at the end of the packfile received from
the distant origin repository" into the part that talks to the
distant origin repository, but the object layout in the resulting
packfile will be suboptimal [*3*] and the code complexity required
to do so is not worth it [*4*].


[Footnotes]

*1* From the point of view of supporting both camps, i.e. those who
    want their borrower repositories to keep sharing the objects
    with the borrowee repository and those who want to use a
    borrowee repository temporarily while cloning only to reduce the
    network cost from the distant upstream, the current option name
    "--reference" and the proposed name "--borrow" are backwards.
    The folks who want the original behaviour of keep depending want
    to "borrow" from the borrowee repository; those who want to
    utilize the mechanism temporarily only while cloning would want
    to merely "reference" it only while cloning.

*2* A repository created with "clone --reference" may want to set a
    configuration variable in it to tell future invocations of "git
    repack" to use the "-l" option by default, while allowing those
    who want to fatten such a repository to override it with "repack
    --no-local".  Without such an arrangement, we would risk the
    people who wanted that "-l" option for "repack" in the first
    place to accidentally fatten their lean repositories by mistake,
    forgetting to pass the "-l" option.  Luckily "gc" always runs
    "repack" with "-l", so the risk is limited to those who run
    "repack" themselves, which may be why we heard no complaints on
    this point.

*3* In fact, another program invoked during the object transfer
    "index-pack --fix-thin" does have "append selected objects at
    the end of the packfile received over the wire and fix up the
    whole thing" logic in it).  The pack that results from it
    suffers from the suboptimal layout because the appended objects
    are "appended", not placed in their optimal positions in the
    packfile to reduce seeks.  

    If "--borrow" did the same, the pack layout issue will be worse,
    because the whole point of "--borrow" is to borrow the majority
    of objects from reference repository---we will be appending a
    lot from the reference to a relatively small pack we receive
    over the wire from the distant origin.

*4* The complexity of the code to implement "index-pack --fix-thin"
    is not pretty, but it will be worse for "--borrow", as the
    former at least does not have to walk the dag to find out what
    objects need to be appended but the latter does.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Borrowing objects from nearby repositories
  2014-03-26 17:29         ` Junio C Hamano
@ 2014-03-28 14:52           ` Andrew Keller
  2014-03-28 17:02             ` Junio C Hamano
  0 siblings, 1 reply; 10+ messages in thread
From: Andrew Keller @ 2014-03-28 14:52 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Ævar Arnfjörð Bjarmason, Git List

On Mar 26, 2014, at 1:29 PM, Junio C Hamano <gitster@pobox.com> wrote:

> Andrew Keller <andrew@kellerfarm.com> writes:
> 
>> On Mar 25, 2014, at 6:17 PM, Junio C Hamano <gitster@pobox.com> wrote:
>> ...
>>>> I think that the standard practice with the existing toolset is to
>>>> clone with reference and then repack.  That is:
>>>> 
>>>>   $ git clone --reference <borrowee> git://over/there mine
>>>>   $ cd mine
>>>>   $ git repack -a -d
>>>> 
>>>> And then you can try this:
>>>> 
>>>>   $ mv .git/objects/info/alternates .git/objects/info/alternates.disabled
>>>>   $ git fsck
>>>> 
>>>> to make sure that you are no longer borrowing anything from the
>>>> borrowee.  Once you are satisfied, you can remove the saved-away
>>>> alternates.disabled file.
>>> 
>>> Oh, I forgot to say that I am not opposed if somebody wants to teach
>>> "git clone" a new option to copy its objects from two places,
>>> (hopefully) the majority from near-by reference repository and the
>>> remainder over the network, without permanently relying on the
>>> former via the alternates mechanism.  The implementation of such a
>>> feature could even literally be "clone with reference first and then
>>> repack" at least initially but even in the final version.
> 
> [Administrivia: please wrap your lines to a reasonable length]
> 
>> That was actually one of my first ideas - adding some sort of
>> '--auto-repack' option to git-clone.  It's a relatively small
>> change, and would work.  However, keeping in mind my end goal of
>> automating the feature to the point where you could run simply
>> 'git clone <url>', an '--auto-repack' option is more difficult to
>> undo.  You would need a new parameter to disable the automatic
>> adding of reference repositories, and a new parameter to undo
>> '--auto-repack', and you'd have to remember to actually undo both
>> of those settings.
>> 
>> In contrast, if the new feature was '--borrow', and the evolution
>> of the feature was a global configuration 'fetch.autoBorrow', then
>> to turn it off temporarily, one only needs a single new parameter
>> '--no-auto-borrow'.  I think this is a cleaner approach than the
>> former, although much more work.
> 
> I think you may have misread me.  With the "new option", I was
> hinting that the "clone --reference && repack && rm alternates"
> will be an acceptable internal implementation of the "--borrow"
> option that was mentioned in the thread.  I am not sure where you
> got the "auto-repack" from.

Ah, yes - that is better than what I was thinking.  I was thinking a bit
too low-level, and using two arguments in the place of your one.

> One of the reasons you may have misread me may be because I made it
> sound as if "this may work and when it works you will be happy, but
> if it does not work you did not lose very much" by mentioning "mv &&
> fsck".  That wasn't what I meant.
> 
> The "repack -a" procedure is to make the borrower repository no
> longer dependent on the borrowee, and it is supposed to always work.
> In fact, this behaviour was the whole reason why "repack" later
> learned its "-l" option to disable it, because people who cloned
> with "--reference" in order to reduce the disk footprint by sharing
> older and more common objects [*1*] were rightfully surprised to see
> that the borrowed objects were copied over to their borrower
> repository when they ran "repack" [*2*].
> 
> Because this is "clone", there is nothing complex to "undo".  Either
> it succeeds, or you remove the whole new directory if anything
> fails.
> 
> I said "even in the final version" for a simple reason: you cannot
> cannot do realistically any better than the "clone --reference &&
> repack -a d && rm alternates" sequence.

Wow, that's very insightful - thanks!  So, it sounds like I was right about
the general areas of concern when trying to do this during a fetch, but
I underestimated just how complicated it would be.

Okay, so to re-frame my idea, like you said, the goal is to find a user-
friendly way for the user to tell git-clone to set up the alternates file
(or perhaps just use the --alternates parameter), and run a repack,
and disconnect the alternate.  And yet, we still want to be able to use
--reference on its own, because there are existing use cases for that.

Thanks!
 - Andrew Keller

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Borrowing objects from nearby repositories
  2014-03-28 14:52           ` Andrew Keller
@ 2014-03-28 17:02             ` Junio C Hamano
  0 siblings, 0 replies; 10+ messages in thread
From: Junio C Hamano @ 2014-03-28 17:02 UTC (permalink / raw)
  To: Andrew Keller; +Cc: Ævar Arnfjörð Bjarmason, Git List

Andrew Keller <andrew@kellerfarm.com> writes:

> Okay, so to re-frame my idea, like you said, the goal is to find a user-
> friendly way for the user to tell git-clone to set up the alternates file
> (or perhaps just use the --alternates parameter), and run a repack,
> and disconnect the alternate.  And yet, we still want to be able to use
> --reference on its own, because there are existing use cases for that.

Here are a few possible action items that came out of this
discussion:

 1. Introduce a new "--borrow" option to "git clone".

    The updates to the SYNOPSIS section may go like this:

    -'git clone' [--reference <repository>] ...other options...
    +'git clone' [[--reference|--borrow] <repository>] ...other options...

    The new option can be used instead of "--reference" and they
    will be mutually incompatible.  The first implementation of the
    "--borrow" option would do the following:

      (1) run the same "git clone" with the same command line but
          replacing "--borrow" with "--reference"; if this fails, exit
          with the same failure.

      (2) in the resulting repository, run "git repack -a -d"; if this
          fails, remove the entire directory the first step created,
          and exit with failure.

      (3) remove .git/objects/info/alternates from the resulting
          repository and exit with success.

    and it may be acceptable as the final implementation as well.


 2. Make "git repack" safer for the users of "clone --reference" who
    want to keep sharing objects from the original.

    - Introduce the "repack.local" configuration variable that can
      be set to either true or false.  Missing variable defaults to
      "false".  

    - A "repack" that is run without "-l" option on the command line
      will pretend as if it was given "-l" from the command line if
      "repack.local" is set to "true".  Add "repack --no-local"
      option to countermand this configuration variable from the
      command line.

    - Teach "git clone --reference" (but not "git clone --borrow")
      to set "repack.local = true" in the configuration of the
      resulting repository.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-03-28 17:02 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-12  3:37 Borrowing objects from nearby repositories Andrew Keller
2014-03-23 18:04 ` Phil Hord
2014-03-24 21:21 ` Ævar Arnfjörð Bjarmason
2014-03-25 13:13   ` Andrew Keller
2014-03-25 17:02   ` Junio C Hamano
2014-03-25 22:17     ` Junio C Hamano
2014-03-26 13:36       ` Andrew Keller
2014-03-26 17:29         ` Junio C Hamano
2014-03-28 14:52           ` Andrew Keller
2014-03-28 17:02             ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).