git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Ben Peart <peartben@gmail.com>
To: Jonathan Nieder <jrnieder@gmail.com>
Cc: Jonathan Tan <jonathantanmy@google.com>,
	Junio C Hamano <gitster@pobox.com>,
	git@vger.kernel.org, christian.couder@gmail.com
Subject: Re: Partial clone design (with connectivity check for locally-created objects)
Date: Tue, 8 Aug 2017 10:18:43 -0400	[thread overview]
Message-ID: <12fbcba3-1b31-2240-a330-e7cc11820f4a@gmail.com> (raw)
In-Reply-To: <20170807192151.GX13924@aiede.mtv.corp.google.com>



On 8/7/2017 3:21 PM, Jonathan Nieder wrote:
> Hi,
> 
> Ben Peart wrote:
>>> On Fri, 04 Aug 2017 15:51:08 -0700
>>> Junio C Hamano <gitster@pobox.com> wrote:
>>>> Jonathan Tan <jonathantanmy@google.com> writes:
> 
>>>>> "Imported" objects must be in a packfile that has a "<pack name>.remote"
>>>>> file with arbitrary text (similar to the ".keep" file). They come from
>>>>> clones, fetches, and the object loader (see below).
>>>>> ...
>>>>>
>>>>> A "homegrown" object is valid if each object it references:
>>>>>   1. is a "homegrown" object,
>>>>>   2. is an "imported" object, or
>>>>>   3. is referenced by an "imported" object.
>>>>
>>>> Overall it captures what was discussed, and I think it is a good
>>>> start.
>>
>> I missed the offline discussion and so am trying to piece together
>> what this latest design is trying to do.  Please let me know if I'm
>> not understanding something correctly.
> 
> I believe
> https://public-inbox.org/git/cover.1501532294.git.jonathantanmy@google.com/
> and the surrounding thread (especially
> https://public-inbox.org/git/xmqqefsudjqk.fsf@gitster.mtv.corp.google.com/)
> is the discussion Junio is referring to.
> 
> [...]
>> This segmentation is what is driving the need for the object loader
>> to build a new local pack file for every command that has to fetch a
>> missing object.  For example, we can't just write a tree object from
>> a "partial" clone into the loose object store as we have no way for
>> fsck to treat them differently and ignore any missing objects
>> referenced by that tree object.
> 
> That's related and how it got lumped into this proposal, but it's not
> the only motivation.
> 
> Other aspects:
> 
>   1. using pack files instead of loose objects means we can use deltas.
>      This is the primary motivation.
> 
>   2. pack files can use reachability bitmaps (I realize there are
>      obstacles to getting benefit out of this because git's bitmap
>      format currently requires a pack to be self-contained, but I
>      thought it was worth mentioning for completeness).
> 
>   3. existing git servers are oriented around pack files; they can
>      more cheaply serve objects from pack files in pack format,
>      including reusing deltas from them.
> 
>   4. file systems cope better with a few large files than many small
>      files
> 
> [...]
>> We all know that git doesn't scale well with a lot of pack files as
>> it has to do a linear search through all the pack files when
>> attempting to find an object.  I can see that very quickly, there
>> would be a lot of pack files generated and with gc ignoring
>> "partial" pack files, this would never get corrected.
> 
> Yes, that's an important point.  Regardless of this proposal, we need
> to get more aggressive about concatenating pack files (e.g. by
> implementing exponential rollup in "git gc --auto").
> 
>> In our usage scenarios, _all_ of the objects come from "partial"
>> clones so all of our objects would end up in a series of "partial"
>> pack files and would have pretty poor performance as a result.
> 
> Can you say more about this?  Why would the pack files (or loose
> objects, for that matter) never end up being consolidated into few
> pack files?
> 

Our initial clone is very sparse - we only pull down the commit we are 
about to checkout and none of the blobs. All missing objects are then 
downloaded on demand (and in this proposal, would end up in a "partial" 
pack file).  For performance reasons, we also (by default) download a 
server computed pack file of commits and trees to pre-populate the local 
cache.

Without modification, fsck, repack, prune, gc will trigger every object 
in the repo to be downloaded.  We punted for now and just block those 
commands but eventually they need to be aware of missing objects so that 
they do not cause them to be downloaded.  Jonathan is already working on 
this for fsck in another patch series.

> [...]
>> That thinking did lead me back to wondering again if we could live
>> with a repo specific flag.  If any clone/fetch was "partial" the
>> flag is set and fsck ignore missing objects whether they came from a
>> "partial" remote or not.
>>
>> I'll admit it isn't as robust if someone is mixing and matching
>> remotes from different servers some of which are partial and some of
>> which are not.  I'm not sure how often that would actually happen
>> but I _am_ certain a single repo specific flag is a _much_ simpler
>> model than anything else we've come up with so far.
> 
> The primary motivation in this thread is locally-created objects, not
> objects obtained from other remotes.  Objects obtained from other
> remotes are more of an edge case.
> 

Thank you - that helps me to better understand the requirements of the 
problem we're trying to solve.  In short, that means what we really need 
is a way to identify locally created objects so that fsck can do a 
complete connectivity check on them.  I'll have to think about a good 
way to do that - we've talked about a few but each has a different set 
of trade-offs and none of them are great (yet :)).

> Thanks for your thoughtful comments.
> 
> Jonathan
> 

  reply	other threads:[~2017-08-08 14:18 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-04 21:51 Partial clone design (with connectivity check for locally-created objects) Jonathan Tan
2017-08-04 22:51 ` Junio C Hamano
2017-08-05  0:21   ` Jonathan Tan
2017-08-07 19:12     ` Ben Peart
2017-08-07 19:21       ` Jonathan Nieder
2017-08-08 14:18         ` Ben Peart [this message]
2017-08-07 19:41       ` Junio C Hamano
2017-08-08 16:45         ` Ben Peart
2017-08-08 17:03           ` Jonathan Nieder
2017-08-07 23:10       ` Jonathan Tan
2017-08-16  0:32 ` [RFC PATCH] Updated "imported object" design Jonathan Tan
2017-08-16 20:32   ` Junio C Hamano
2017-08-16 21:35     ` Jonathan Tan
2017-08-17 20:50       ` Ben Peart
2017-08-17 21:39         ` Jonathan Tan
2017-08-18 14:18           ` Ben Peart
2017-08-18 23:33             ` Jonathan Tan
2017-08-17 20:07   ` Ben Peart

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=12fbcba3-1b31-2240-a330-e7cc11820f4a@gmail.com \
    --to=peartben@gmail.com \
    --cc=christian.couder@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jonathantanmy@google.com \
    --cc=jrnieder@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).