git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Junio C Hamano <junkio@cox.net>
To: git@vger.kernel.org
Subject: [RFC] shallow clone
Date: Sun, 29 Jan 2006 23:18:50 -0800	[thread overview]
Message-ID: <7voe1uchet.fsf@assigned-by-dhcp.cox.net> (raw)

Shallow History Cloning
=======================

One good thing about git repository is that each clone is a
freestanding and complete entity, and you can keep developing in
it offline, without talking to the outside world, knowing that
you can sync with them later when online.

It is also a bad thing.  It gives people working on projects
with long development history stored in CVS a heart attack when
we tell them that their clones need to store the whole history.

There was a suggestion by Linus to allow a partial clone using a
syntax like this:

	$ git clone --since=v2.6.14 git://.../linux-2.6/ master

Here is an outline of what changes are needed to the current
core to do this.


Strategy
--------

We have `info/grafts` mechanism to fake parent information for
commit objects.  Using this facility, we could roughly do:

. Download the full tree for v2.6.14 commit and store its
  objects locally.

. Set up `info/grafts` to lie to the local git that Linux kernel
  history began at v2.6.14 version.

. Run `git fetch git://.../linux-2.6 master`, with a local ref
  pointing at v2.6.14 commit, to pretend that we have everything
  up to v2.6.14 to `upload-pack` running on the other end.

. Update the `origin` branch with the master commit object name
  we just fetched from Linus.

There are some issues.

. In the fetch above to obtain everything after v2.6.14, and
  future runs of `git fetch origin`, if a blob that is in the
  commit being fetched happens to match what used to be in a
  commit that is older than v2.6.14 (e.g. a patch was reverted),
  `upload-pack` running on the other end is free to omit sending
  it, because we are telling it that we are up to date with
  respect to v2.6.14.  Although I think the current `rev-list
  --objects` implementation does not always do such a revert
  optimization if the revert is to a blob in a revision that is
  sufficiently old, it is free to optimize more aggressively in
  the future.

. Later when the user decides to fetch older history, the
  operation can become a bit cumbersome.

I think the latter one is cumbersome but is doable -- we could
do the equivalent of:

	$ git clone --since=v2.6.13 origin v2.6.14

place all the objects obtained by such a clone/fetch operation
and remember that now we have history beginning at v2.6.13.  So
let's worry about that later.

For the first issue, we need to have the other end cooperate
while fetching from it.  If the other end also thinks the
development started at v2.6.14, even if we tell that we have the
history up to v2.6.14 (or a commit we obtained since then),
there is no way for `upload-pack` running there to optimize too
agressively and assume we have a blob that appeared in v2.6.13.
More simply, we do not have to tell them we have anything -- if
the other end thinks the epoch is at v2.6.14, only commits that
comes later will be sent to us.


Design
------

First, to bootstrap the process, we would need to add a way to
obtain all objects associated with a commit.  We could do a new
program, or we could implement this as a protocol extension to
`upload-pack`.  My current inclination is the latter.

When talking with `upload-pack` that supports this extension,
the downloader can give one commit object name and get a pack
that contains all the objects in the tree associated with that
commit, plus the commit object itself.  This is a rough
equivalent of running the commit walker with the `-t` flag.

Another functionality we would need is to tell `upload-pack` to
use `info/grafts` of downloader's choice.  With this, after
fetching the objects for v2.6.14 commit, the downloader can set
up its own grafts file to cauterize the development history at
v2.6.14, and tell the `upload-pack` to pretend the kernel
history starts at that commit, while sending the tip of Linus'
development track to us.

Using the extended protocol (let's call it 'shallow' extension),
a clone to create a repository that has only recent kernel
history since v2.6.14 goes like this:

The first client is to fetch the v2.6.14 itself.

[NOTE]
Most likely this is not directly run by the user but is run as
the first command invoked by the shallow clone script.

1. The `fetch-pack` command acquires a new option, `--single`:

	$ git-fetch-pack --single git://.../linux-2.6/ v2.6.14

   This talks with `upload-pack` on the kernel.org server via
   `git-daemon`.

2. `upload-pack` tells the fetcher what commits it has,
   what their refs are, and what protocol extensions it
   supports, as usual.

3. If it does not see `shallow` extension supported, there is no
   way to get a single tree, so things fail here.  Otherwise, it
   sends `single X{40}\0` request, instead of the usual `want`
   line.  The object name sent here is the desired commit.

4. `upload-pack` notices this is a single commit request, and
   sends an ACK if it can satisfy the request (or a NAK if it
   can't, e.g. it does not have the asked commit).  Instead of
   doing the usual `get_common_commits` followed by
   `create_pack_file`, it does:

	$ git rev-list -n1 --objects $commit | git pack-object

   and sends the result out.

5. The fetcher checks the ACK and receives the objects.

After the above exchange, we have downloaded v2.6.14 commit and
its objects but not its history.  `git-fetch-pack` would output
the tag object name for `v2.6.14` and we would stash it away in
`$GIT_DIR/FETCH_HEAD` as usual.  Then we set up `info/grafts`
with this:

	$ git rev-parse FETCH_HEAD^{commit} >"$GIT_DIR/info/grafts"

This cauterizes the history on our end.

The second phase of the shallow clone is to fetch the history
since v2.6.14 to the tip.

1. The `fetch-pack` command is run as usual.  Most likely the
   command line run by the shallow clone script would be:

	$ git fetch-pack git://.../linux-2.6/ master

   Notice there is nothing magical about it.  It is just the
   business as usual.

2. `upload-pack` does its usual greeting to the downloader.

3. We notice `shallow` extension again, and first send out
   `graft X{40}\0` request.  The syntax of graft request would
   be `graft ` followed by one or more commit object names on a
   line separated with SP.  After sending out all the needed
   graft requests (in this example there is only one, to
   cauterize the history at v2.6.14), it does the usual `want
   X{40}\0multi_ack` and a flush.

4. `upload-pack` notices graft requests, reinitializes its graft
   information with what it receives from the other end, and
   then records `want`.

5. After the above steps, the usual `upload-pack` vs
   `fetch-pack` exchange continues and objects needed to
   complete the Linus' tip of development trail for somebody who
   has v2.6.14 are sent in a pack.  The difference from the
   usual operation is that `upload-pack` during this run thinks
   v2.6.14 commit does not have any parent.

The exact sequence from the second part of the initial "shallow
clone" can be used for further updates.

There is a small issue about the actual implementation.  In the
above description I pretended that `upload-pack` can be told to
use phony grafts information, but in the current implementation
the program that needs to use phony grafts information is
`rev-list` spawned from it.  We _could_ point GIT_GRAFT_FILE
environment variable point at a temporary file while we do so,
but I'd like to avoid using a temporary file if possible, given
that `upload-pack` is run from `git-daemon`.  Maybe we could
give --read-graft-from-stdin flag to `rev-list` for this
purpose.


Anybody want to try?

             reply	other threads:[~2006-01-30  7:19 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-01-30  7:18 Junio C Hamano [this message]
2006-01-30 11:39 ` [RFC] shallow clone Johannes Schindelin
2006-01-30 11:58   ` Simon Richter
2006-01-30 12:13     ` Johannes Schindelin
2006-01-30 13:25       ` Simon Richter
2006-01-30 19:25       ` Junio C Hamano
2006-01-31 11:28         ` Johannes Schindelin
2006-01-31 13:05           ` Simon Richter
2006-01-31 13:31             ` Johannes Schindelin
2006-01-31 14:23               ` Simon Richter
2006-01-30 19:25     ` Junio C Hamano
2006-01-31  8:37       ` Franck
2006-01-31  8:51         ` Junio C Hamano
2006-01-31 11:11           ` Franck
2006-01-30 18:46   ` Junio C Hamano
2006-01-31 11:02     ` [PATCH] Shallow clone: low level machinery Junio C Hamano
2006-01-31 13:58       ` Johannes Schindelin
2006-01-31 17:49         ` Junio C Hamano
2006-01-31 18:06           ` Johannes Schindelin
2006-01-31 18:22             ` Junio C Hamano
2006-02-01 14:33               ` Johannes Schindelin
2006-02-01 20:27                 ` Junio C Hamano
2006-02-02  0:48                   ` Johannes Schindelin
2006-02-02  1:17                     ` Junio C Hamano
2006-02-02 18:44                       ` Johannes Schindelin
2006-02-02 19:31                         ` Junio C Hamano
2006-01-31 14:20     ` [RFC] shallow clone Johannes Schindelin
2006-01-31 20:59     ` Junio C Hamano
2006-02-01 14:47       ` Johannes Schindelin
     [not found] ` <43DF1F1D.1060704@innova-card.com>
2006-01-31  9:00   ` Franck

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7voe1uchet.fsf@assigned-by-dhcp.cox.net \
    --to=junkio@cox.net \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).