From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-3.4 required=3.0 tests=AWL,BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,RCVD_IN_DNSWL_HI,T_RP_MATCHES_RCVD shortcircuit=no autolearn=ham autolearn_force=no version=3.4.0 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by dcvr.yhbt.net (Postfix) with ESMTP id 0473220C11 for ; Thu, 30 Nov 2017 14:24:14 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752538AbdK3OYL (ORCPT ); Thu, 30 Nov 2017 09:24:11 -0500 Received: from siwi.pair.com ([209.68.5.199]:18751 "EHLO siwi.pair.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751474AbdK3OYJ (ORCPT ); Thu, 30 Nov 2017 09:24:09 -0500 Received: from siwi.pair.com (localhost [127.0.0.1]) by siwi.pair.com (Postfix) with ESMTP id 1D975844EE; Thu, 30 Nov 2017 09:24:09 -0500 (EST) Received: from [10.160.98.77] (unknown [167.220.148.86]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by siwi.pair.com (Postfix) with ESMTPSA id E1C18844E3; Thu, 30 Nov 2017 09:24:08 -0500 (EST) Subject: Re: How hard would it be to implement sparse fetching/pulling? To: Vitaly Arbuzov , git@vger.kernel.org References: From: Jeff Hostetler Message-ID: Date: Thu, 30 Nov 2017 09:24:08 -0500 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.4.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org On 11/29/2017 10:16 PM, Vitaly Arbuzov wrote: > Hi guys, > > I'm looking for ways to improve fetch/pull/clone time for large git > (mono)repositories with unrelated source trees (that span across > multiple services). > I've found sparse checkout approach appealing and helpful for most of > client-side operations (e.g. status, reset, commit, etc.) > The problem is that there is no feature like sparse fetch/pull in git, > this means that ALL objects in unrelated trees are always fetched. > It may take a lot of time for large repositories and results in some > practical scalability limits for git. > This forced some large companies like Facebook and Google to move to > Mercurial as they were unable to improve client-side experience with > git while Microsoft has developed GVFS, which seems to be a step back > to CVCS world. > > I want to get a feedback (from more experienced git users than I am) > on what it would take to implement sparse fetching/pulling. > (Downloading only objects related to the sparse-checkout list) > Are there any issues with missing hashes? > Are there any fundamental problems why it can't be done? > Can we get away with only client-side changes or would it require > special features on the server side? > > If we had such a feature then all we would need on top is a separate > tool that builds the right "sparse" scope for the workspace based on > paths that developer wants to work on. > > In the world where more and more companies are moving towards large > monorepos this improvement would provide a good way of scaling git to > meet this demand. > > PS. Please don't advice to split things up, as there are some good > reasons why many companies decide to keep their code in the monorepo, > which you can easily find online. So let's keep that part out the > scope. > > -Vitaly > This work is in-progress now. A short summary can be found in [1] of the current parts 1, 2, and 3. > * jh/object-filtering (2017-11-22) 6 commits > * jh/fsck-promisors (2017-11-22) 10 commits > * jh/partial-clone (2017-11-22) 14 commits [1] https://public-inbox.org/git/xmqq1skh6fyz.fsf@gitster.mtv.corp.google.com/T/ I have a branch that contains V5 all 3 parts: https://github.com/jeffhostetler/git/tree/core/pc5_p3 This is a WIP, so there are some rough edges.... I hope to have a V6 out before the weekend with some bug fixes and cleanup. Please give it a try and see if it fits your needs. Currently, there are filter methods to filter all blobs, all large blobs, and one to match a sparse-checkout specification. Let me know if you have any questions or problems. Thanks, Jeff