git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Jonathan Tan <jonathantanmy@google.com>
Cc: git@vger.kernel.org
Subject: Re: [PATCH 2/2] index-pack: prefetch missing REF_DELTA bases
Date: Wed, 15 May 2019 19:16:18 -0400	[thread overview]
Message-ID: <20190515231617.GA1395@sigill.intra.peff.net> (raw)
In-Reply-To: <4fcaa4481b5fd2a76aa21263f997e00913db0e0f.1557868134.git.jonathantanmy@google.com>

On Tue, May 14, 2019 at 02:10:55PM -0700, Jonathan Tan wrote:

> Support for lazy fetching should still generally be turned off in
> index-pack because it is used as part of the lazy fetching process
> itself (if not, infinite loops may occur), but we do need to fetch the
> REF_DELTA bases. (When fetching REF_DELTA bases, it is unlikely that
> those are REF_DELTA themselves, because we do not send "have" when
> making such fetches.)

I agree that the current implementation (and probably any sane
implementation) would not send us a delta if we have not provided any
haves. But this does mean that a malicious server could send a client
into an infinite loop.

Pretty unlikely, but should we put some kind of circuit-breaker into the
client to ensure this?

> To resolve this, prefetch all missing REF_DELTA bases before attempting
> to resolve them. This both ensures that all bases are attempted to be
> fetched, and ensures that we make only one request per index-pack
> invocation, and not one request per missing object.

Ah, but now things get more tricky.

You are assuming that the server does not ever send a REF_DELTA unless
the base object is not present in the pack (it would use OFS_DELTA
instead). If we imagine a server which did, then there are two
implications:

  1. We might pre-fetch a full copy of an object that we don't need.
     It's just that it's stored as a delta in the pack which we are
     currently indexing.

  2. If we pre-fetch multiple objects, some of them may be REF_DELTAs
     against each other, leading to an infinite loop.

Off the top of my head, I am pretty sure your assumption holds for all
versions of Git that support delta-base-offset[1]. But that feels a lot
less certain to me. I could imagine an alternate server implementation,
for example, that is gluing together packs and does not try hard to
order the base before the delta, which would require it to use REF_DELTA
instead of OFS_DELTA.

That's sort of contrived, but it does feel like we're introducing a
really subtle requirement on the server here, which might close off
options to us in the future.

Unfortunately, I can't really think of a way for an existing client to
solve this without doing individual fetches for each REF_DELTA as we
encounter a need for it. In fact, even then we may lose if our ordering
is unlucky. E.g., imagine we have two packfile entries whose object ids
(which we don't know yet!) are X and Y, and they are stored as
REF_DELTAs with bases Z and X, respectively. Then either:

  1. We try to resolve X first. We fetch Z on-demand, and reconstruct X.
     We're lucky, because when we try to resolve Y, we see that we
     already have its base X.

  2. We try to resolve Y first. We fetch X on-demand, and reconstruct Y.
     We're unlucky; we then resolve X, but only after on-demand fetching
     Z and reconstructing it do we realize that we already had it.

So really, pre-fetching all of the REF_DELTAs just means we always hit
the unlucky case. But even with careful on-demand fetching, our worst
case is the same (and even worse in terms of latency).

I dunno. Maybe we should just ignore it. It's a fundamental issue with
partial clones that we're going to have to fetch extra junk here anyway,
because what the server _optimally_ would do is not send us deltas
against objects we don't have anyway. We just don't have an efficient
way to tell it what we do have.

If we're willing to modify the format, one thing we _could_ do is have
the server communicate the expectations for each base. I.e., introduce a
new THIN_DELTA type that behaves exactly as a REF_DELTA, but with the
extra 1-bit of knowledge that the server knows it is not including the
base in the pack. I'm not sure how painful that retro-fitting would be.
It would need at least a new capability and options to pack-objects and
index-pack. We might be tight on bits in the packfile type field.

-Peff

[1] Of course there are versions of Git that don't support
    delta-base-offset. They wouldn't support partial clones either, but
    it's possible to fetch into a partially-cloned repository from
    another remote. Given how old and rare such versions are, I think we
    can probably discount it entirely. I'm much more concerned about
    alternate implementations, or trying our hands for future
    pack-objects optimizations.

  parent reply	other threads:[~2019-05-15 23:25 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-05-14 21:10 [PATCH 0/2] Partial clone fix: handling received REF_DELTA Jonathan Tan
2019-05-14 21:10 ` [PATCH 1/2] t5616: refactor packfile replacement Jonathan Tan
2019-05-15  8:36   ` Johannes Schindelin
2019-05-15 18:22     ` Jonathan Tan
2019-05-14 21:10 ` [PATCH 2/2] index-pack: prefetch missing REF_DELTA bases Jonathan Tan
2019-05-15  8:46   ` Johannes Schindelin
2019-05-15 18:28     ` Jonathan Tan
2019-05-17 18:33       ` Johannes Schindelin
2019-05-15 23:16   ` Jeff King [this message]
2019-05-16  1:43     ` Junio C Hamano
2019-05-16  4:04       ` Jeff King
2019-05-16 18:26     ` Jonathan Tan
2019-05-16 21:12       ` Jeff King
2019-05-16 21:30         ` Jonathan Tan
2019-05-16 21:42           ` Jeff King
2019-05-16 23:15             ` Jonathan Tan
2019-05-17  1:09               ` Jeff King
2019-05-17  1:22                 ` Jeff King
2019-05-17  4:39                   ` Jeff King
2019-05-17  4:42                     ` Jeff King
2019-05-17  7:20                     ` Duy Nguyen
2019-05-17  8:55                       ` Jeff King
2019-05-18 11:39                         ` Duy Nguyen
2019-05-20 23:04                           ` Nicolas Pitre
2019-05-21 21:20                             ` Jeff King
2019-06-03 22:23   ` Jonathan Nieder

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190515231617.GA1395@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=jonathantanmy@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).