From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-3.6 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,RCVD_IN_DNSWL_HI, RP_MATCHES_RCVD shortcircuit=no autolearn=ham autolearn_force=no version=3.4.0 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by dcvr.yhbt.net (Postfix) with ESMTP id 3C33120281 for ; Fri, 29 Sep 2017 20:12:12 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752334AbdI2UMJ (ORCPT ); Fri, 29 Sep 2017 16:12:09 -0400 Received: from mail-pf0-f176.google.com ([209.85.192.176]:46458 "EHLO mail-pf0-f176.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752208AbdI2UMI (ORCPT ); Fri, 29 Sep 2017 16:12:08 -0400 Received: by mail-pf0-f176.google.com with SMTP id r68so340198pfj.3 for ; Fri, 29 Sep 2017 13:12:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=from:to:cc:subject:date:message-id; bh=rTfpMNZsfd2hENwNLcxjRRdeA+PAhcFwwwabaw0GAwA=; b=tIU3Hb4m+m0K2jBpyAd3EJo8MwiKRqkCUw+ZhWDVFTWzHmvB/Z/sYs92dYA8tQmpyz amyIsAWkWG89PxLOIiq4KlExfbjF9w4RcNMUVyi1OkwBLUCTfZEva3U7vVxUObpBoAMZ kHzquPTnBsyrXSAZ1xy8Y1QpbcqQpwnWR6WWz2RxtIW7bXKgGfCoro9gJU9N5OnflSXm dhl1QbpTsl/SgXs+azzw/26zXse6nOzhtbCgMGhHjX1lhnTk47NRi28Sq87O0NXwNE0K p8Nw8uY0wMEyAC1ht2tpoN2DbkT0e0Ip4AE8xoru1qUbDC8nfL6LK0FUYnnbsUDlKFXg XLGg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id; bh=rTfpMNZsfd2hENwNLcxjRRdeA+PAhcFwwwabaw0GAwA=; b=AlGwws+ZiwKk83Lxy/qwfwL7ipVUbKf6vLyvCQJTKxds/olmBSn7lLdvS9OQnaOm60 /BBVEA+341vhx/KRThTiZm8fTlP4M1j7bLCxDP8vESLrMjqN+RHlWnVV8o8Jgf/BVPnE TtGv6+tuN+/cllA7j+csjv30Wfywy/kxYkDB3zyyKP5Dz3Q9c2kzwWCg4bzJEi+KLt1t A9GX6cSogSOUx4ax1BQ2bbx34cCiU2B9MV9ZYqUtW64cwfK7BvCEaFeLlQd93Sot+lbA 58TE5DszO+N6JnJXSjCQjDxhocGMfof37RSfnzRrYTFUDbA2SYENUWDsrwZpKldbCTx9 2UGg== X-Gm-Message-State: AHPjjUiTiJipY+GKGlHWz/48ILwDuMlkZV6rKUzKmdj5s8pQ+okhvY/G wLktyj80qllBPYIbDcZjduEIoNc8+y8= X-Google-Smtp-Source: AOwi7QD/MLv9+p6UCSChjAtYjoZg+XTJkRZuluQyp0jOxOst6tJm2E9hTBRRS+RuDlkjtFzLqxhcjg== X-Received: by 10.98.163.156 with SMTP id q28mr8917053pfl.185.1506715927579; Fri, 29 Sep 2017 13:12:07 -0700 (PDT) Received: from twelve3.mtv.corp.google.com ([100.96.218.44]) by smtp.gmail.com with ESMTPSA id g5sm9280561pgo.66.2017.09.29.13.12.06 (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Fri, 29 Sep 2017 13:12:06 -0700 (PDT) From: Jonathan Tan To: git@vger.kernel.org Cc: Jonathan Tan , gitster@pobox.com, git@jeffhostetler.com, peartben@gmail.com, christian.couder@gmail.com Subject: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches) Date: Fri, 29 Sep 2017 13:11:36 -0700 Message-Id: X-Mailer: git-send-email 2.14.1.748.g20475d2c7 Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org These patches are also available online: https://github.com/jonathantanmy/git/commits/partialclone3 (I've announced it in another e-mail, but am now sending the patches to the mailing list too.) Here's an update of my work so far. Notable features: - These 18 patches allow a user to clone with --blob-max-bytes=, creating a partial clone that is automatically configured to lazily fetch missing objects from the origin. The local repo also has fsck working offline, and GC working (albeit only on locally created objects). - Cloning and fetching is currently only able to exclude blobs by a size threshold, but the local repository is already capable of fetching missing objects of any type. For example, if a repository with missing trees or commits is generated by any tool (for example, a future version of Git), current Git with my patches will still be able to operate on them, automatically fetching those missing trees and commits when needed. - Missing blobs are fetched all at once during checkout. Jeff Hostetler has sent out some object-filtering patches [1] that is a superset of the object-filtering functionality that I have (in the pack-objects patches). I have gone for the minimal approach here, but if his patches are merged, I'll update my patch set to use those. [1] https://public-inbox.org/git/20170922203017.53986-6-git@jeffhostetler.com/ Demo ==== Obtain a repository. $ make prefix=$HOME/local install $ cd $HOME/tmp $ git clone https://github.com/git/git Make it advertise the new feature and allow requests for arbitrary blobs. $ git -C git config uploadpack.advertiseblobmaxbytes 1 $ git -C git config uploadpack.allowanysha1inwant 1 Perform the partial clone and check that it is indeed smaller. Specify "file://" in order to test the partial clone mechanism. (If not, Git will perform a local clone, which unselectively copies every object.) $ git clone --blob-max-bytes=0 "file://$(pwd)/git" git2 $ git clone "file://$(pwd)/git" git3 $ du -sh git2 git3 85M git2 130M git3 Observe that the new repo is automatically configured to fetch missing objects from the original repo. Subsequent fetches will also be partial. $ cat git2/.git/config [core] repositoryformatversion = 1 filemode = true bare = false logallrefupdates = true [remote "origin"] url = [snip] fetch = +refs/heads/*:refs/remotes/origin/* blobmaxbytes = 0 [extensions] partialclone = origin [branch "master"] remote = origin merge = refs/heads/master Design ====== Local repository layout ----------------------- A repository declares its dependence on a *promisor remote* (a remote that declares that it can serve certain objects when requested) by a repository extension "partialclone". `extensions.partialclone` must be set to the name of the remote ("origin" in the demo above). A packfile can be annotated as originating from the promisor remote by the existence of a "(packfile name).promisor" file with arbitrary contents (similar to the ".keep" file). Whenever a promisor remote sends an object, it declares that it can serve every object directly or indirectly referenced by the sent object. A promisor packfile is a packfile annotated with the ".promisor" file. A promisor object is an object that the promisor remote is known to be able to serve, because it is an object in a promisor packfile or directly referred to by one. (In the future, we might need to add ".promisor" support to loose objects.) Connectivity check and gc ------------------------- The object walk done by the connectivity check (as used by fsck and fetch) stops at all promisor objects. The object walk done by gc also stops at all promisor objects. Only non-promisor packfiles are deleted (if pack deletion is requested); promisor packfiles are left alone. This maintains the distinction between promisor packfiles and non-promisor packfiles. (In the future, we might need to do something more sophisticated with promisor packfiles.) Fetching of missing objects --------------------------- When `sha1_object_info_extended()` (or similar) is invoked, it will automatically attempt to fetch a missing object from the promisor remote if that object is not in the local repository. For efficiency, no check is made as to whether that object is known to be a promisor object or not. This automatic fetching can be toggled on and off by the `fetch_if_missing` global variable, and it is on by default. The actual fetch is done through the fetch-pack/upload-pack protocol. Right now, this uses the fact that upload-pack allows blob and tree "want"s, and this incurs the overhead of the unnecessary ref advertisement. I hope that protocol v2 will allow us to declare that blob and tree "want"s are allowed, and allow the client to declare that it does not want the ref advertisement. All packfiles downloaded in this way are annotated with ".promisor". Fetching with `git fetch` ------------------------- The fetch-pack/upload-pack protocol has also been extended to support omission of blobs above a certain size. The client only allows this when fetching from the promisor remote, and will annotate any packs received from the promisor remote with ".promisor". Jonathan Tan (18): fsck: introduce partialclone extension fsck: support refs pointing to promisor objects fsck: support referenced promisor objects fsck: support promisor objects as CLI argument index-pack: refactor writing of .keep files introduce fetch-object: fetch one promisor object sha1_file: support lazily fetching missing objects rev-list: support termination at promisor objects gc: do not repack promisor packfiles pack-objects: rename want_.* to ignore_.* pack-objects: support --blob-max-bytes fetch-pack: support excluding large blobs fetch: refactor calculation of remote list fetch: support excluding large blobs clone: support excluding large blobs clone: configure blobmaxbytes in created repos unpack-trees: batch fetching of missing blobs fetch-pack: restore save_commit_buffer after use Documentation/git-pack-objects.txt | 12 +- Documentation/technical/pack-protocol.txt | 9 + Documentation/technical/protocol-capabilities.txt | 7 + Documentation/technical/repository-version.txt | 12 + Makefile | 1 + builtin/cat-file.c | 2 + builtin/clone.c | 24 +- builtin/fetch-pack.c | 21 ++ builtin/fetch.c | 36 ++- builtin/fsck.c | 26 +- builtin/gc.c | 3 + builtin/index-pack.c | 113 ++++--- builtin/pack-objects.c | 97 ++++-- builtin/prune.c | 7 + builtin/repack.c | 7 +- builtin/rev-list.c | 13 + cache.h | 13 +- connected.c | 1 + environment.c | 1 + fetch-object.c | 45 +++ fetch-object.h | 11 + fetch-pack.c | 23 +- fetch-pack.h | 3 + list-objects.c | 16 +- object.c | 2 +- packfile.c | 77 ++++- packfile.h | 13 + remote-curl.c | 21 +- remote.c | 2 + remote.h | 2 + revision.c | 33 ++- revision.h | 5 +- setup.c | 7 +- sha1_file.c | 38 ++- t/t0410-partial-clone.sh | 343 ++++++++++++++++++++++ t/t5300-pack-object.sh | 45 +++ t/t5500-fetch-pack.sh | 115 ++++++++ t/t5601-clone.sh | 101 +++++++ t/test-lib-functions.sh | 12 + transport-helper.c | 4 + transport.c | 18 ++ transport.h | 12 + unpack-trees.c | 22 ++ upload-pack.c | 16 +- 44 files changed, 1278 insertions(+), 113 deletions(-) create mode 100644 fetch-object.c create mode 100644 fetch-object.h create mode 100755 t/t0410-partial-clone.sh -- 2.14.2.822.g60be5d43e6-goog