From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-3.0 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,RCVD_IN_DNSWL_HI, RCVD_IN_SORBS_SPAM,RP_MATCHES_RCVD shortcircuit=no autolearn=no autolearn_force=no version=3.4.0 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by dcvr.yhbt.net (Postfix) with ESMTP id 0A0B4208DB for ; Fri, 25 Aug 2017 21:23:13 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934679AbdHYVXK (ORCPT ); Fri, 25 Aug 2017 17:23:10 -0400 Received: from mail-pg0-f48.google.com ([74.125.83.48]:38523 "EHLO mail-pg0-f48.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932374AbdHYVXJ (ORCPT ); Fri, 25 Aug 2017 17:23:09 -0400 Received: by mail-pg0-f48.google.com with SMTP id b8so5056044pgn.5 for ; Fri, 25 Aug 2017 14:23:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=mtMdhjWUegoa2ar2z8ZLjDL4GG7UWCTOnTpv5lK2wLY=; b=TduHUyEu1N8pplW3G0zRKY/R/M7Q4Onz1GbdD6Y5qfO2MLFiGxwtz1z7tkW+Rz46Qz YHEhKhUzRw6RkdagevGaz+VnyNwSRkojUbFUugueXeoTmkk5g7l9SIyqNzvCbuh9KGCk RMTIQzPtJXjjv9yFxL8KM3MPX8u+a+kJhNDYSDoX+6E5X9GDxJleRUROqZzwoa0aIX43 PfiClswb3yHkgBe6vObp9YmHVLEE4+YXtFus0aX4yV/FjPjlL/57LDVad5BQeO451VJY 0az8K6sygV5b8RWHiMxkiMZZBmkEJN6lyTtpVmpt4LCVQQi/8IJksJW3u2OLTcnKotju YBjw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=mtMdhjWUegoa2ar2z8ZLjDL4GG7UWCTOnTpv5lK2wLY=; b=MQuoGTEX/j0BtPWrMpO/ydTUw+tu+3Z7y/zUblHJkI8a/VHH7CaBHfJ6dR5w9Exwkr ovH/1qU06Yiieiw74zkMzU3yTeMCXiq+1TA/a+luX3AW/78yYb3S88JwVz59z6cn0H1+ m10NLYCa59XFj/SLgFoSERk6wA3OW9w7Rkhi+t3v/fEi5VRxCAN3SGcyV8KQYLK498v0 I1zz7aS5KJlrAKlIK7b7sP+BxJFdn79SmSORYpBXfQkA4lw4a0JCdE3xd0d1g/jWzayF WxK8J1TBM43JMYuLAoS2m6wfsiiltVdNk1Vi92MdAuXeDtCG08YYSOnd9aDc6yy31LVR ksvg== X-Gm-Message-State: AHYfb5i3eSQed3KsJ9wUGR52IAqkL0D+0kzo6ri895XO//BB8UWlSW7W BBnvJTsGOpTNjhZt X-Google-Smtp-Source: ADKCNb6Kay6Q8RRW4AaTUkKQwN5RK4eb7wY3GZq+IALHOQp46KtsKs/I+yH4bNh11eX+xbL0c9weDQ== X-Received: by 10.98.34.208 with SMTP id p77mr11079992pfj.315.1503696188634; Fri, 25 Aug 2017 14:23:08 -0700 (PDT) Received: from twelve2.svl.corp.google.com ([2620:0:100e:422:d1c8:e472:b952:ce78]) by smtp.gmail.com with ESMTPSA id g2sm12867322pgu.86.2017.08.25.14.23.07 (version=TLS1_2 cipher=AES128-SHA bits=128/128); Fri, 25 Aug 2017 14:23:08 -0700 (PDT) Date: Fri, 25 Aug 2017 14:23:00 -0700 From: Jonathan Tan To: Christian Couder Cc: Stefan Beller , "git@vger.kernel.org" , Junio C Hamano , Jeff King , Ben Peart , Nguyen Thai Ngoc Duy , Mike Hommey , Lars Schneider , Eric Wong , Christian Couder Subject: Re: [PATCH v5 35/40] Add Documentation/technical/external-odb.txt Message-ID: <20170825142300.01b15d6b@twelve2.svl.corp.google.com> In-Reply-To: References: <20170803091926.1755-1-chriscool@tuxfamily.org> <20170803091926.1755-36-chriscool@tuxfamily.org> X-Mailer: Claws Mail 3.9.3 (GTK+ 2.24.23; x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org On Fri, 25 Aug 2017 08:14:08 +0200 Christian Couder wrote: > As Git is used by more and more by people having different needs, I > think it is not realistic to expect that we can optimize its object > storage for all these different needs. So a better strategy is to just > let them store objects in external stores. [snip] > About these many use cases, I gave the "really big binary files" > example which is why Git LFS exists (and which GitLab is interested in > better solving), and the "really big number of files that are fetched > only as needed" example which Microsoft is interested in solving. I > could also imagine that some people have both big text files and big > binary files in which case the "core.bigfilethreshold" might not work > well, or that some people already have blobs in some different stores > (like HTTP servers, Docker registries, artifact stores, ...) and want > to fetch them from there as much as possible. Thanks for explaining the use cases - this makes sense, especially the last one which motivates the different modes for the "get" command (return raw bytes vs populating the Git repository with loose/packed objects). > And then letting people > use different stores can make clones or fetches restartable which > would solve another problem people have long been complaining about... This is unrelated to the rest of my e-mail, but out of curiosity, how would a different store make clones or fetches restartable? Do you mean that Git would invoke a "fetch" command through the ODB protocol instead of using its own native protocol? > >> +Furthermore many improvements that are dependent on specific setups > >> +could be implemented in the way Git objects are managed if it was > >> +possible to customize how the Git objects are handled. For example a > >> +restartable clone using the bundle mechanism has often been requested, > >> +but implementing that would go against the current strict rules under > >> +which the Git objects are currently handled. > > > > So in this example, you would use todays git-clone to obtain a small version > > of the repo and then obtain other objects later? > > The problem with explaining how it would work is that the > --initial-refspec option is added to git clone later in the patch > series. And there could be changes in the later part of the patch > series. So I don't want to promise or explain too much here. > But maybe I could add another patch to better explain that at the end > of the series. Such an explanation, in whatever form (patch or e-mail) would be great, because I'm not sure of the interaction between fetches and the connectivity check. The approach I have taken in my own patches [1] is to (1) declare that if a lazy remote supplies an object, it promises to have everything referred to by that object, and (2) we thus only need to check the objects not from the lazy remote. Translated to the ODB world, (1) is possible in the Microsoft case and is trivial in all the cases where the ODB provides only blobs (since blobs don't refer to any other object), and for (2), a "list" command should suffice. One constraint is that we do not want to obtain (from the remote) or store a separate list of what it has, to avoid the overhead. (I saw the --initial-refspec approach - that would not work if we want to avoid the overhead.) For fetches, we remember the objects obtained from that specific remote by adding a special file, name to be determined (I used ".imported" in [1]). (The same method is used to note objects lazily downloaded.) The repack command understands the difference between these two types of objects (patches for this are in progress). I'm not sure if this can be translated to the ODB world. The ODB can declare a special capability that fetch sends to the server in order to inform the server that it can exclude certain objects, and fetch can inform the ODB of the packfiles that it has written, but I'm not sure how the ODB can "remember" what it has. The ODB could mark such packs with ".managed" to note that it is managed by that ODB, so Git shoudn't touch it, but this means (for example) that Git can't GC them (and it seems also quite contradictory for an ODB to manage Git packfiles). [1] https://public-inbox.org/git/20170804145113.5ceafafa@twelve2.svl.corp.google.com/