git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jonathan Tan <jonathantanmy@google.com>
To: Ben Peart <peartben@gmail.com>
Cc: git@vger.kernel.org, jrnieder@gmail.com, sbeller@google.com,
	git@jeffhostetler.com, philipoakley@iee.org
Subject: Re: [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects
Date: Thu, 20 Jul 2017 14:13:42 -0700	[thread overview]
Message-ID: <20170720141342.6a89aace@twelve2.svl.corp.google.com> (raw)
In-Reply-To: <d8beb406-6d03-d893-4f37-83a7bdd6b9ac@gmail.com>

On Thu, 20 Jul 2017 15:58:51 -0400
Ben Peart <peartben@gmail.com> wrote:

> On 7/19/2017 8:21 PM, Jonathan Tan wrote:
> > Currently, Git does not support repos with very large numbers of objects
> > or repos that wish to minimize manipulation of certain blobs (for
> > example, because they are very large) very well, even if the user
> > operates mostly on part of the repo, because Git is designed on the
> > assumption that every referenced object is available somewhere in the
> > repo storage.
> > 
> 
> Great to see this idea making progress. Making git able to gracefully 
> handle partial clones (beyond the existing shallow clone support) is a 
> key piece of dealing with very large objects and repos.

Thanks.

> > As a first step to reducing this problem, introduce the concept of
> > promised objects. Each Git repo can contain a list of promised objects
> > and their sizes (if blobs) at $GIT_DIR/objects/promised. This patch
> > contains functions to query them; functions for creating and modifying
> > that file will be introduced in later patches.
> 
> If I'm reading this patch correctly, for a repo to successfully pass 
> "git fsck" either the object or a promise must exist for everything fsck 
> checks.  From the documentation for fsck it says "git fsck defaults to 
> using the index file, all SHA-1 references in refs namespace, and all 
> reflogs (unless --no-reflogs is given) as heads." Doesn't this then 
> imply objects or promises must exist for all objects referenced by any 
> of these locations?
> 
> We're currently in the hundreds of millions of objects on some of our 
> repos so even downloading the promises for all the objects in the index 
> is unreasonable as it is gigabytes of data and growing.

For the index to contain all the files, the repo must already have
downloaded all the trees for HEAD (at least). The trees collectively
contain entries for all the relevant blobs. We need one promise for each
blob, and the size of a promise is comparable to the size of a tree
entry, so the size (of download and storage) needed would be just double
of what we would need if we didn't need promises. This is still only
linear growth, unless you have found that the absolute numbers are too
large?

Also, if the index is ever changed to not have one entry for every file,
we also wouldn't need one promise for every file.

> I think we should have a flag (off by default) that enables someone to 
> say that promised objects are optional. If the flag is set, 
> "is_promised_object" will return success and pass the OBJ_ANY type and a 
> size of -1.
> 
> Nothing today is using the size and in the two places where the object 
> type is being checked for consistency (fsck_cache_tree and 
> fsck_handle_ref) the test can add a test for OBJ_ANY as well.
> 
> This will enable very large numbers of objects to be omitted from the 
> clone without triggering a download of the corresponding number of 
> promised objects.

Eventually I plan to use the size when implementing parameters for
history-searching commands (e.g. "git log -S"), but it's true that
that's in the future.

Allowing promised objects to be optional would indeed solve the issue of
downloading too many promises. It would make the code more complicated,
but I'm not sure by how much.

For example, in this fsck patch, the easiest way I could think of to
have promised objects was to introduce a 3rd state, called "promised",
of "struct object" - one in which the type is known, but we don't have
access to the full "struct commit" or equivalent. And thus fsck could
assume that if the "struct object" is "parsed" or "promised", the type
is known. Having optional promised objects would require that we let
this "promised" state have a type of OBJ_UNKNOWN (or something like
that) - maybe that would be fine, but I haven't looked into this in
detail.

> > A repository that is missing an object but has that object promised is not
> > considered to be in error, so also teach fsck this. As part of doing
> > this, object.{h,c} has been modified to generate "struct object" based
> > on only the information available to promised objects, without requiring
> > the object itself.
> 
> In your work on this, did you investigate if there are other commands 
> (ie repack/gc) that will need to learn about promised objects? Have you 
> had a chance (or have plans) to hack up the test suite so that it runs 
> all tests with promised objects and see what (if anything) breaks?

In one of the subsequent patches, I tried to ensure that all
object-reading functions in sha1_file.c somewhat works (albeit slowly)
in the presence of promised objects - that would cover the functionality
of the other commands. As for hacking up the test suite to run with
promised objects, that would be ideal, but I haven't figured out how to
do that yet.

  reply	other threads:[~2017-07-20 21:13 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-07-11 19:48 [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing blobs") Jonathan Tan
2017-07-11 19:48 ` [RFC PATCH 1/3] promised-blob, fsck: introduce promised blobs Jonathan Tan
2017-07-11 22:02   ` Stefan Beller
2017-07-19 23:37     ` Jonathan Tan
2017-07-12 17:29   ` Jeff Hostetler
2017-07-12 19:28     ` Jonathan Nieder
2017-07-13 14:48       ` Jeff Hostetler
2017-07-13 15:05         ` Jeff Hostetler
2017-07-13 19:39     ` Jonathan Tan
2017-07-14 20:03       ` Jeff Hostetler
2017-07-14 21:30         ` Jonathan Nieder
2017-07-11 19:48 ` [RFC PATCH 2/3] sha1-array: support appending unsigned char hash Jonathan Tan
2017-07-11 22:06   ` Stefan Beller
2017-07-19 23:56     ` Jonathan Tan
2017-07-20  0:06       ` Stefan Beller
2017-07-11 19:48 ` [RFC PATCH 3/3] sha1_file: add promised blob hook support Jonathan Tan
2017-07-11 22:38   ` Stefan Beller
2017-07-12 17:40   ` Ben Peart
2017-07-12 20:38     ` Jonathan Nieder
2017-07-16 15:23 ` [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing blobs") Philip Oakley
2017-07-17 17:43   ` Ben Peart
2017-07-25 20:48     ` Philip Oakley
2017-07-17 18:03   ` Jonathan Nieder
2017-07-29 12:51     ` Philip Oakley
2017-07-20  0:21 ` [RFC PATCH v2 0/4] Partial clone: promised objects (not only blobs) Jonathan Tan
2017-07-20  0:21 ` [RFC PATCH v2 1/4] object: remove "used" field from struct object Jonathan Tan
2017-07-20  0:36   ` Stefan Beller
2017-07-20  0:55     ` Jonathan Tan
2017-07-20 17:44       ` Ben Peart
2017-07-20 21:20   ` Junio C Hamano
2017-07-20  0:21 ` [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects Jonathan Tan
2017-07-20 18:07   ` Stefan Beller
2017-07-20 19:17     ` Jonathan Tan
2017-07-20 19:58   ` Ben Peart
2017-07-20 21:13     ` Jonathan Tan [this message]
2017-07-21 16:24       ` Ben Peart
2017-07-21 20:33         ` Jonathan Tan
2017-07-25 15:10           ` Ben Peart
2017-07-29 13:26             ` Philip Oakley
2017-07-20  0:21 ` [RFC PATCH v2 3/4] sha1-array: support appending unsigned char hash Jonathan Tan
2017-07-20  0:21 ` [RFC PATCH v2 4/4] sha1_file: support promised object hook Jonathan Tan
2017-07-20 18:23   ` Stefan Beller
2017-07-20 20:58     ` Ben Peart
2017-07-20 21:18       ` Jonathan Tan
2017-07-21 16:27         ` Ben Peart

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170720141342.6a89aace@twelve2.svl.corp.google.com \
    --to=jonathantanmy@google.com \
    --cc=git@jeffhostetler.com \
    --cc=git@vger.kernel.org \
    --cc=jrnieder@gmail.com \
    --cc=peartben@gmail.com \
    --cc=philipoakley@iee.org \
    --cc=sbeller@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).