git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Johannes Schindelin <Johannes.Schindelin@gmx.de>
To: Jonathan Tan <jonathantanmy@google.com>
Cc: git@vger.kernel.org
Subject: Re: [PATCH 2/2] index-pack: prefetch missing REF_DELTA bases
Date: Wed, 15 May 2019 10:46:42 +0200 (DST)	[thread overview]
Message-ID: <nycvar.QRO.7.76.6.1905151040240.44@tvgsbejvaqbjf.bet> (raw)
In-Reply-To: <4fcaa4481b5fd2a76aa21263f997e00913db0e0f.1557868134.git.jonathantanmy@google.com>

Hi Jonathan,

On Tue, 14 May 2019, Jonathan Tan wrote:

> When fetching, the client sends "have" commit IDs indicating that the
> server does not need to send any object referenced by those commits,
> reducing network I/O. When the client is a partial clone, the client
> still sends "have"s in this way, even if it does not have every object
> referenced by a commit it sent as "have".
>
> If a server omits such an object, it is fine: the client could lazily
> fetch that object before this fetch, and it can still do so after.
>
> The issue is when the server sends a thin pack containing an object that
> is a REF_DELTA against such a missing object: index-pack fails to fix
> the thin pack. When support for lazily fetching missing objects was
> added in 8b4c0103a9 ("sha1_file: support lazily fetching missing
> objects", 2017-12-08), support in index-pack was turned off in the
> belief that it accesses the repo only to do hash collision checks.
> However, this is not true: it also needs to access the repo to resolve
> REF_DELTA bases.
>
> Support for lazy fetching should still generally be turned off in
> index-pack because it is used as part of the lazy fetching process
> itself (if not, infinite loops may occur), but we do need to fetch the
> REF_DELTA bases. (When fetching REF_DELTA bases, it is unlikely that
> those are REF_DELTA themselves, because we do not send "have" when
> making such fetches.)
>
> To resolve this, prefetch all missing REF_DELTA bases before attempting
> to resolve them. This both ensures that all bases are attempted to be
> fetched, and ensures that we make only one request per index-pack
> invocation, and not one request per missing object.

Hmm. I wonder whether this can lead to *really* undesirable behavior, e.g.
with deep delta chains. The client would possibly have to fetch the
REF_DELTA object, but that would also be delivered in a thin pack with
*another* REF_DELTA object, and the same over and over again, with plenty
of round trips that kill performance really well.

Wouldn't it make more sense to introduce a new term like `promised`
(instead of `have`)? Both client and server will have to know about this,
and it would be a new capability, of course, but that way the server could
know that it has to send the entire delta chain.

Of course, this would be quite a bit more involved than the current patch
:-(

Ciao,
Dscho

> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
>  builtin/index-pack.c     | 26 +++++++++++++++--
>  t/t5616-partial-clone.sh | 61 ++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 85 insertions(+), 2 deletions(-)
>
> diff --git a/builtin/index-pack.c b/builtin/index-pack.c
> index ccf4eb7e9b..0d55f73b0b 100644
> --- a/builtin/index-pack.c
> +++ b/builtin/index-pack.c
> @@ -14,6 +14,7 @@
>  #include "thread-utils.h"
>  #include "packfile.h"
>  #include "object-store.h"
> +#include "fetch-object.h"
>
>  static const char index_pack_usage[] =
>  "git index-pack [-v] [-o <index-file>] [--keep | --keep=<msg>] [--verify] [--strict] (<pack-file> | --stdin [--fix-thin] [<pack-file>])";
> @@ -1351,6 +1352,25 @@ static void fix_unresolved_deltas(struct hashfile *f)
>  		sorted_by_pos[i] = &ref_deltas[i];
>  	QSORT(sorted_by_pos, nr_ref_deltas, delta_pos_compare);
>
> +	if (repository_format_partial_clone) {
> +		/*
> +		 * Prefetch the delta bases.
> +		 */
> +		struct oid_array to_fetch = OID_ARRAY_INIT;
> +		for (i = 0; i < nr_ref_deltas; i++) {
> +			struct ref_delta_entry *d = sorted_by_pos[i];
> +			if (!oid_object_info_extended(the_repository, &d->oid,
> +						      NULL,
> +						      OBJECT_INFO_FOR_PREFETCH))
> +				continue;
> +			oid_array_append(&to_fetch, &d->oid);
> +		}
> +		if (to_fetch.nr)
> +			fetch_objects(repository_format_partial_clone,
> +				      to_fetch.oid, to_fetch.nr);
> +		oid_array_clear(&to_fetch);
> +	}
> +
>  	for (i = 0; i < nr_ref_deltas; i++) {
>  		struct ref_delta_entry *d = sorted_by_pos[i];
>  		enum object_type type;
> @@ -1650,8 +1670,10 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
>  	int report_end_of_input = 0;
>
>  	/*
> -	 * index-pack never needs to fetch missing objects, since it only
> -	 * accesses the repo to do hash collision checks
> +	 * index-pack never needs to fetch missing objects except when
> +	 * REF_DELTA bases are missing (which are explicitly handled). It only
> +	 * accesses the repo to do hash collision checks and to check which
> +	 * REF_DELTA bases need to be fetched.
>  	 */
>  	fetch_if_missing = 0;
>
> diff --git a/t/t5616-partial-clone.sh b/t/t5616-partial-clone.sh
> index 7cc0c71556..f1baf83502 100755
> --- a/t/t5616-partial-clone.sh
> +++ b/t/t5616-partial-clone.sh
> @@ -339,4 +339,65 @@ test_expect_success 'when partial cloning, tolerate server not sending target of
>  	! test -e "$HTTPD_ROOT_PATH/one-time-sed"
>  '
>
> +test_expect_success 'tolerate server sending REF_DELTA against missing promisor objects' '
> +	SERVER="$HTTPD_DOCUMENT_ROOT_PATH/server" &&
> +	rm -rf "$SERVER" repo &&
> +	test_create_repo "$SERVER" &&
> +	test_config -C "$SERVER" uploadpack.allowfilter 1 &&
> +	test_config -C "$SERVER" uploadpack.allowanysha1inwant 1 &&
> +
> +	# Create a commit with a blob to be used as a delta base.
> +	for i in $(test_seq 10)
> +	do
> +		echo "this is a line" >>"$SERVER/foo.txt"
> +	done &&
> +	git -C "$SERVER" add foo.txt &&
> +	git -C "$SERVER" commit -m bar &&
> +	git -C "$SERVER" rev-parse HEAD:foo.txt >deltabase &&
> +
> +	git -c protocol.version=2 clone --no-checkout \
> +		--filter=blob:none $HTTPD_URL/one_time_sed/server repo &&
> +
> +	# Sanity check to ensure that the client does not have that blob.
> +	git -C repo rev-list --objects --exclude-promisor-objects \
> +		-- $(cat deltabase) >objlist &&
> +	test_line_count = 0 objlist &&
> +
> +	# Another commit. This commit will be fetched by the client.
> +	echo "abcdefghijklmnopqrstuvwxyz" >>"$SERVER/foo.txt" &&
> +	git -C "$SERVER" add foo.txt &&
> +	git -C "$SERVER" commit -m baz &&
> +
> +	# Pack a thin pack containing, among other things, HEAD:foo.txt
> +	# delta-ed against HEAD^:foo.txt.
> +	printf "%s\n--not\n%s\n" \
> +		$(git -C "$SERVER" rev-parse HEAD) \
> +		$(git -C "$SERVER" rev-parse HEAD^) |
> +		git -C "$SERVER" pack-objects --thin --stdout >thin.pack &&
> +
> +	# Ensure that the pack contains one delta against HEAD^:foo.txt. Since
> +	# the delta contains at least 26 novel characters, the size cannot be
> +	# contained in 4 bits, so the object header will take up 2 bytes. The
> +	# most significant nybble of the first byte is 0b1111 (0b1 to indicate
> +	# that the header continues, and 0b111 to indicate REF_DELTA), followed
> +	# by any 3 nybbles, then the OID of the delta base.
> +	git -C "$SERVER" rev-parse HEAD^:foo.txt >deltabase &&
> +	printf "f.,..%s" $(intersperse "," <deltabase) >want &&
> +	hex_unpack <thin.pack | intersperse "," >have &&
> +	grep $(cat want) have &&
> +
> +	replace_packfile thin.pack &&
> +
> +	# Use protocol v2 because the sed command looks for the "packfile"
> +	# section header.
> +	test_config -C "$SERVER" protocol.version 2 &&
> +
> +	# Fetch the thin pack and ensure that index-pack is able to handle the
> +	# REF_DELTA object with a missing promisor delta base.
> +	git -C repo -c protocol.version=2 fetch &&
> +
> +	# Ensure that the one-time-sed script was used.
> +	! test -e "$HTTPD_ROOT_PATH/one-time-sed"
> +'
> +
>  test_done
> --
> 2.21.0.1020.gf2820cf01a-goog
>
>

  reply	other threads:[~2019-05-15  8:46 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-05-14 21:10 [PATCH 0/2] Partial clone fix: handling received REF_DELTA Jonathan Tan
2019-05-14 21:10 ` [PATCH 1/2] t5616: refactor packfile replacement Jonathan Tan
2019-05-15  8:36   ` Johannes Schindelin
2019-05-15 18:22     ` Jonathan Tan
2019-05-14 21:10 ` [PATCH 2/2] index-pack: prefetch missing REF_DELTA bases Jonathan Tan
2019-05-15  8:46   ` Johannes Schindelin [this message]
2019-05-15 18:28     ` Jonathan Tan
2019-05-17 18:33       ` Johannes Schindelin
2019-05-15 23:16   ` Jeff King
2019-05-16  1:43     ` Junio C Hamano
2019-05-16  4:04       ` Jeff King
2019-05-16 18:26     ` Jonathan Tan
2019-05-16 21:12       ` Jeff King
2019-05-16 21:30         ` Jonathan Tan
2019-05-16 21:42           ` Jeff King
2019-05-16 23:15             ` Jonathan Tan
2019-05-17  1:09               ` Jeff King
2019-05-17  1:22                 ` Jeff King
2019-05-17  4:39                   ` Jeff King
2019-05-17  4:42                     ` Jeff King
2019-05-17  7:20                     ` Duy Nguyen
2019-05-17  8:55                       ` Jeff King
2019-05-18 11:39                         ` Duy Nguyen
2019-05-20 23:04                           ` Nicolas Pitre
2019-05-21 21:20                             ` Jeff King
2019-06-03 22:23   ` Jonathan Nieder

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=nycvar.QRO.7.76.6.1905151040240.44@tvgsbejvaqbjf.bet \
    --to=johannes.schindelin@gmx.de \
    --cc=git@vger.kernel.org \
    --cc=jonathantanmy@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).