From: Jonathan Tan <jonathantanmy@google.com>
To: git@vger.kernel.org
Cc: Jonathan Tan <jonathantanmy@google.com>,
gitster@pobox.com, git@jeffhostetler.com, peartben@gmail.com,
christian.couder@gmail.com
Subject: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches)
Date: Fri, 29 Sep 2017 13:11:36 -0700 [thread overview]
Message-ID: <cover.1506714999.git.jonathantanmy@google.com> (raw)
These patches are also available online:
https://github.com/jonathantanmy/git/commits/partialclone3
(I've announced it in another e-mail, but am now sending the patches to the
mailing list too.)
Here's an update of my work so far. Notable features:
- These 18 patches allow a user to clone with --blob-max-bytes=<bytes>,
creating a partial clone that is automatically configured to lazily
fetch missing objects from the origin. The local repo also has fsck
working offline, and GC working (albeit only on locally created
objects).
- Cloning and fetching is currently only able to exclude blobs by a
size threshold, but the local repository is already capable of
fetching missing objects of any type. For example, if a repository
with missing trees or commits is generated by any tool (for example,
a future version of Git), current Git with my patches will still be
able to operate on them, automatically fetching those missing trees
and commits when needed.
- Missing blobs are fetched all at once during checkout.
Jeff Hostetler has sent out some object-filtering patches [1] that is a
superset of the object-filtering functionality that I have (in the
pack-objects patches). I have gone for the minimal approach here, but if
his patches are merged, I'll update my patch set to use those.
[1] https://public-inbox.org/git/20170922203017.53986-6-git@jeffhostetler.com/
Demo
====
Obtain a repository.
$ make prefix=$HOME/local install
$ cd $HOME/tmp
$ git clone https://github.com/git/git
Make it advertise the new feature and allow requests for arbitrary blobs.
$ git -C git config uploadpack.advertiseblobmaxbytes 1
$ git -C git config uploadpack.allowanysha1inwant 1
Perform the partial clone and check that it is indeed smaller. Specify
"file://" in order to test the partial clone mechanism. (If not, Git will
perform a local clone, which unselectively copies every object.)
$ git clone --blob-max-bytes=0 "file://$(pwd)/git" git2
$ git clone "file://$(pwd)/git" git3
$ du -sh git2 git3
85M git2
130M git3
Observe that the new repo is automatically configured to fetch missing objects
from the original repo. Subsequent fetches will also be partial.
$ cat git2/.git/config
[core]
repositoryformatversion = 1
filemode = true
bare = false
logallrefupdates = true
[remote "origin"]
url = [snip]
fetch = +refs/heads/*:refs/remotes/origin/*
blobmaxbytes = 0
[extensions]
partialclone = origin
[branch "master"]
remote = origin
merge = refs/heads/master
Design
======
Local repository layout
-----------------------
A repository declares its dependence on a *promisor remote* (a remote that
declares that it can serve certain objects when requested) by a repository
extension "partialclone". `extensions.partialclone` must be set to the name of
the remote ("origin" in the demo above).
A packfile can be annotated as originating from the promisor remote by the
existence of a "(packfile name).promisor" file with arbitrary contents (similar
to the ".keep" file). Whenever a promisor remote sends an object, it declares
that it can serve every object directly or indirectly referenced by the sent
object.
A promisor packfile is a packfile annotated with the ".promisor" file. A
promisor object is an object that the promisor remote is known to be able to
serve, because it is an object in a promisor packfile or directly referred to by
one.
(In the future, we might need to add ".promisor" support to loose objects.)
Connectivity check and gc
-------------------------
The object walk done by the connectivity check (as used by fsck and fetch) stops
at all promisor objects.
The object walk done by gc also stops at all promisor objects. Only non-promisor
packfiles are deleted (if pack deletion is requested); promisor packfiles are
left alone. This maintains the distinction between promisor packfiles and
non-promisor packfiles. (In the future, we might need to do something more
sophisticated with promisor packfiles.)
Fetching of missing objects
---------------------------
When `sha1_object_info_extended()` (or similar) is invoked, it will
automatically attempt to fetch a missing object from the promisor remote if that
object is not in the local repository. For efficiency, no check is made as to
whether that object is known to be a promisor object or not.
This automatic fetching can be toggled on and off by the `fetch_if_missing`
global variable, and it is on by default.
The actual fetch is done through the fetch-pack/upload-pack protocol. Right now,
this uses the fact that upload-pack allows blob and tree "want"s, and this
incurs the overhead of the unnecessary ref advertisement. I hope that protocol
v2 will allow us to declare that blob and tree "want"s are allowed, and allow
the client to declare that it does not want the ref advertisement. All packfiles
downloaded in this way are annotated with ".promisor".
Fetching with `git fetch`
-------------------------
The fetch-pack/upload-pack protocol has also been extended to support omission
of blobs above a certain size. The client only allows this when fetching from
the promisor remote, and will annotate any packs received from the promisor
remote with ".promisor".
Jonathan Tan (18):
fsck: introduce partialclone extension
fsck: support refs pointing to promisor objects
fsck: support referenced promisor objects
fsck: support promisor objects as CLI argument
index-pack: refactor writing of .keep files
introduce fetch-object: fetch one promisor object
sha1_file: support lazily fetching missing objects
rev-list: support termination at promisor objects
gc: do not repack promisor packfiles
pack-objects: rename want_.* to ignore_.*
pack-objects: support --blob-max-bytes
fetch-pack: support excluding large blobs
fetch: refactor calculation of remote list
fetch: support excluding large blobs
clone: support excluding large blobs
clone: configure blobmaxbytes in created repos
unpack-trees: batch fetching of missing blobs
fetch-pack: restore save_commit_buffer after use
Documentation/git-pack-objects.txt | 12 +-
Documentation/technical/pack-protocol.txt | 9 +
Documentation/technical/protocol-capabilities.txt | 7 +
Documentation/technical/repository-version.txt | 12 +
Makefile | 1 +
builtin/cat-file.c | 2 +
builtin/clone.c | 24 +-
builtin/fetch-pack.c | 21 ++
builtin/fetch.c | 36 ++-
builtin/fsck.c | 26 +-
builtin/gc.c | 3 +
builtin/index-pack.c | 113 ++++---
builtin/pack-objects.c | 97 ++++--
builtin/prune.c | 7 +
builtin/repack.c | 7 +-
builtin/rev-list.c | 13 +
cache.h | 13 +-
connected.c | 1 +
environment.c | 1 +
fetch-object.c | 45 +++
fetch-object.h | 11 +
fetch-pack.c | 23 +-
fetch-pack.h | 3 +
list-objects.c | 16 +-
object.c | 2 +-
packfile.c | 77 ++++-
packfile.h | 13 +
remote-curl.c | 21 +-
remote.c | 2 +
remote.h | 2 +
revision.c | 33 ++-
revision.h | 5 +-
setup.c | 7 +-
sha1_file.c | 38 ++-
t/t0410-partial-clone.sh | 343 ++++++++++++++++++++++
t/t5300-pack-object.sh | 45 +++
t/t5500-fetch-pack.sh | 115 ++++++++
t/t5601-clone.sh | 101 +++++++
t/test-lib-functions.sh | 12 +
transport-helper.c | 4 +
transport.c | 18 ++
transport.h | 12 +
unpack-trees.c | 22 ++
upload-pack.c | 16 +-
44 files changed, 1278 insertions(+), 113 deletions(-)
create mode 100644 fetch-object.c
create mode 100644 fetch-object.h
create mode 100755 t/t0410-partial-clone.sh
--
2.14.2.822.g60be5d43e6-goog
next reply other threads:[~2017-09-29 20:12 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-09-29 20:11 Jonathan Tan [this message]
2017-09-29 20:11 ` [PATCH 01/18] fsck: introduce partialclone extension Jonathan Tan
2017-09-29 20:11 ` [PATCH 02/18] fsck: support refs pointing to promisor objects Jonathan Tan
2017-09-29 20:11 ` [PATCH 03/18] fsck: support referenced " Jonathan Tan
2017-09-29 20:11 ` [PATCH 04/18] fsck: support promisor objects as CLI argument Jonathan Tan
2017-09-29 20:11 ` [PATCH 05/18] index-pack: refactor writing of .keep files Jonathan Tan
2017-09-29 20:11 ` [PATCH 06/18] introduce fetch-object: fetch one promisor object Jonathan Tan
2017-09-29 20:11 ` [PATCH 07/18] sha1_file: support lazily fetching missing objects Jonathan Tan
2017-10-12 14:42 ` Christian Couder
2017-10-12 15:45 ` Christian Couder
2017-09-29 20:11 ` [PATCH 08/18] rev-list: support termination at promisor objects Jonathan Tan
2017-09-29 20:11 ` [PATCH 09/18] gc: do not repack promisor packfiles Jonathan Tan
2017-09-29 20:11 ` [PATCH 10/18] pack-objects: rename want_.* to ignore_.* Jonathan Tan
2017-09-29 20:11 ` [PATCH 11/18] pack-objects: support --blob-max-bytes Jonathan Tan
2017-09-29 20:11 ` [PATCH 12/18] fetch-pack: support excluding large blobs Jonathan Tan
2017-09-29 20:11 ` [PATCH 13/18] fetch: refactor calculation of remote list Jonathan Tan
2017-09-29 20:11 ` [PATCH 14/18] fetch: support excluding large blobs Jonathan Tan
2017-09-29 20:11 ` [PATCH 15/18] clone: " Jonathan Tan
2017-09-29 20:11 ` [PATCH 16/18] clone: configure blobmaxbytes in created repos Jonathan Tan
2017-09-29 20:11 ` [PATCH 17/18] unpack-trees: batch fetching of missing blobs Jonathan Tan
2017-09-29 20:11 ` [PATCH 18/18] fetch-pack: restore save_commit_buffer after use Jonathan Tan
2017-09-29 21:08 ` [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches) Johannes Schindelin
2017-10-02 4:23 ` Junio C Hamano
2017-10-03 6:15 ` Christian Couder
2017-10-03 8:50 ` Junio C Hamano
2017-10-03 14:39 ` Jeff Hostetler
2017-10-03 23:42 ` Jonathan Tan
2017-10-04 13:30 ` Jeff Hostetler
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cover.1506714999.git.jonathantanmy@google.com \
--to=jonathantanmy@google.com \
--cc=christian.couder@gmail.com \
--cc=git@jeffhostetler.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=peartben@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).