Re: [RFC PATCH 1/3] promised-blob, fsck: introduce promised blobs

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Jonathan Nieder <jrnieder@gmail.com>
To: Jeff Hostetler <git@jeffhostetler.com>
Cc: Jonathan Tan <jonathantanmy@google.com>, git@vger.kernel.org
Subject: Re: [RFC PATCH 1/3] promised-blob, fsck: introduce promised blobs
Date: Wed, 12 Jul 2017 12:28:31 -0700	[thread overview]
Message-ID: <20170712192831.GG93855@aiede.mtv.corp.google.com> (raw)
In-Reply-To: <890a36fc-be16-83bc-fec6-94e21d0f7d0a@jeffhostetler.com>

Hi,

Jeff Hostetler wrote:

> My primary concern is scale and managing the list of objects over time.
[...]
>                                                                  For
> example, on the Windows repo we have (conservatively) 100M+ blobs (and
> growing).  Assuming 28 bytes per, gives a 2.8GB list to be manipulated.
>
> If I understand your proposal, newly-omitted blobs would need to be
> merged into the promised-blob list after each fetch.  The fetch itself
> may not have that many new entries, but inserting them into the existing
> list will be slow.

This is a good point.  An alternative would be to avoid storing the
list and instead use a repository extension that treats all missing
blobs in the repository similarly to promised blobs (with a weaker
promise, where the server is allowed to return 404).  The downsides:

- blob sizes are not available without an additional request, e.g. for
  directory listings

- the semantics of has_object_file become more complicated.  If it
  is allowed to return false for omitted blobs, then callers have to
  be audited to tolerate that and try to look up the blob anyway.
  If it has to contact the server to find out whether an omitted blob
  is available, then callers have to be audited to skip this expensive
  operation when possible.

- similarly, the semantics of sha1_object_info{,_extended} become more
  complicated.  If they are allowed to return -1 for omitted blobs,
  then callers have to be audited to handle that. If they have to
  contact the server to find the object type and size, it becomes
  expensive in a way that affects callers.

- causes futile repeated requests to the server for objects that don't
  exist.  Caching negative lookups is fussy because a later push could
  cause those objects to exist --- though it should be possible for
  fetches to invalidate entries in such a cache using the list of
  promised blobs sent by the server.

[...]
> In such a "sparse clone", it would be nice to omit unneeded tree objects
> in addition to just blobs.   I say that because we are finding with GVFS
> on the Windows repo, that even with commits-and-trees-only filtering,
> the number of tree objects is overwhelming.  So I'm also concerned about
> limiting the list to just blobs.  If we need to have this list, it
> should be able to contain any object.  (Suggesting having an object type
> in the entry.)

Would it work to have a separate lists of blobs and trees (either in
separate files or in the same file)?

One option would be to add a version number / magic string to the start
of the file.  That would allow making format changes later without a
proliferation of distinct repository extensions.

[...]
> I assume that we'll also need a promised-blob.lock file to control
> access during list manipulation.  This is already a sore spot with the
> index; I'd hate to create another one.

Can you say more about this concern?  My experience with concurrent
fetches has already not been great (since one fetch process is not
aware of what the other has fetched) --- is your concern that the
promised-blob facility would affect pushes as well some day?

> I also have to wonder about the need to have a complete list of omitted
> blobs up front.  It may be better to just relax the consistency checks
> and assume a missing blob is "intentionally missing" rather than
> indicating a corruption somewhere.

We've discussed this previously on list and gone back and forth. :)

>                                     And then let the client do a later
> round-trip to either demand-load the object -or- demand-load the
> existence/size info if/when it really matters.

The cost of demand-loading this existence/size information is what
ultimately convinced me of this approach.

But I can see how the tradeoffs differ between the omit-large-blobs
case and the omit-all-blobs case.  We might end up having to support
both modes. :(

> Maybe we should add a verb to your new fetch-blob endpoint to just get
> the size of one or more objects to help with this.

No objections from me, though we don't need it yet.

Thanks,
Jonathan

next prev parent reply	other threads:[~2017-07-12 19:28 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-07-11 19:48 [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing blobs") Jonathan Tan
2017-07-11 19:48 ` [RFC PATCH 1/3] promised-blob, fsck: introduce promised blobs Jonathan Tan
2017-07-11 22:02   ` Stefan Beller
2017-07-19 23:37     ` Jonathan Tan
2017-07-12 17:29   ` Jeff Hostetler
2017-07-12 19:28     ` Jonathan Nieder [this message]
2017-07-13 14:48       ` Jeff Hostetler
2017-07-13 15:05         ` Jeff Hostetler
2017-07-13 19:39     ` Jonathan Tan
2017-07-14 20:03       ` Jeff Hostetler
2017-07-14 21:30         ` Jonathan Nieder
2017-07-11 19:48 ` [RFC PATCH 2/3] sha1-array: support appending unsigned char hash Jonathan Tan
2017-07-11 22:06   ` Stefan Beller
2017-07-19 23:56     ` Jonathan Tan
2017-07-20  0:06       ` Stefan Beller
2017-07-11 19:48 ` [RFC PATCH 3/3] sha1_file: add promised blob hook support Jonathan Tan
2017-07-11 22:38   ` Stefan Beller
2017-07-12 17:40   ` Ben Peart
2017-07-12 20:38     ` Jonathan Nieder
2017-07-16 15:23 ` [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing blobs") Philip Oakley
2017-07-17 17:43   ` Ben Peart
2017-07-25 20:48     ` Philip Oakley
2017-07-17 18:03   ` Jonathan Nieder
2017-07-29 12:51     ` Philip Oakley
2017-07-20  0:21 ` [RFC PATCH v2 0/4] Partial clone: promised objects (not only blobs) Jonathan Tan
2017-07-20  0:21 ` [RFC PATCH v2 1/4] object: remove "used" field from struct object Jonathan Tan
2017-07-20  0:36   ` Stefan Beller
2017-07-20  0:55     ` Jonathan Tan
2017-07-20 17:44       ` Ben Peart
2017-07-20 21:20   ` Junio C Hamano
2017-07-20  0:21 ` [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects Jonathan Tan
2017-07-20 18:07   ` Stefan Beller
2017-07-20 19:17     ` Jonathan Tan
2017-07-20 19:58   ` Ben Peart
2017-07-20 21:13     ` Jonathan Tan
2017-07-21 16:24       ` Ben Peart
2017-07-21 20:33         ` Jonathan Tan
2017-07-25 15:10           ` Ben Peart
2017-07-29 13:26             ` Philip Oakley
2017-07-20  0:21 ` [RFC PATCH v2 3/4] sha1-array: support appending unsigned char hash Jonathan Tan
2017-07-20  0:21 ` [RFC PATCH v2 4/4] sha1_file: support promised object hook Jonathan Tan
2017-07-20 18:23   ` Stefan Beller
2017-07-20 20:58     ` Ben Peart
2017-07-20 21:18       ` Jonathan Tan
2017-07-21 16:27         ` Ben Peart

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170712192831.GG93855@aiede.mtv.corp.google.com \
    --to=jrnieder@gmail.com \
    --cc=git@jeffhostetler.com \
    --cc=git@vger.kernel.org \
    --cc=jonathantanmy@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).