git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Ben Peart <peartben@gmail.com>
To: Jonathan Tan <jonathantanmy@google.com>, git@vger.kernel.org
Subject: Re: [RFC PATCH 3/3] sha1_file: add promised blob hook support
Date: Wed, 12 Jul 2017 13:40:49 -0400	[thread overview]
Message-ID: <8366d1de-3552-50fc-7a6c-7cfc3219181b@gmail.com> (raw)
In-Reply-To: <34efd9e9936fdab331655f5a33a098a72dc134f4.1499800530.git.jonathantanmy@google.com>



On 7/11/2017 3:48 PM, Jonathan Tan wrote:
> Teach sha1_file to invoke a hook whenever a blob is requested and
> unavailable but is promised. The hook is a shell command that can be
> configured through "git config"; this hook takes in a list of hashes and
> writes (if successful) the corresponding objects to the repo's local
> storage.
> 

This would seem to work well for promised blobs but does it currently 
support other object types?  In our large mono repos, we have to support 
missing commits and trees as well.  I'd rather not have 
separate/different ways to handle the various missing object types.

There are a couple of related patch series in flight that are playing in 
this area. We should definitely coordinate these various efforts.

[RFC] Add support for downloading blobs on demand
https://public-inbox.org/git/009d01d28deb$158563a0$40902ae0$@gmail.com/

This series adds read-object-process support to download on demand any 
missing object type.  This is based on the code we are using today. I'll 
update it to 1) take advantage of the refactored sub-process support 
(https://public-inbox.org/git/20170505152802.6724-1-benpeart@microsoft.com/) 
and 2) fix a bug we discovered with alternates support and resubmit it 
as an updated RFC.

[RFC/PATCH v4 00/49] Add initial experimental external ODB support
https://public-inbox.org/git/20170620075523.26961-1-chriscool@tuxfamily.org/

This is a broader patch series that enables you to register multiple 
handlers for "missing" objects. It also already incorporates the 
read-object-process patch series above.

The advantage of these other patch series is that they already support 
_any_ object type, not just blobs or promised blobs.

> This is meant as a temporary measure to ensure that all Git commands
> work in such a situation. Future patches will update some commands to
> either tolerate promised blobs (without invoking the hook) or be more
> efficient in invoking the promised blob hook.
> 
> In order to determine the code changes in sha1_file.c necessary, I
> investigated the following:
>   (1) functions in sha1_file that take in a hash, without the user
>       regarding how the object is stored (loose or packed)
>   (2) functions in sha1_file that operate on packed objects (because I
>       need to check callers that know about the loose/packed distinction
>       and operate on both differently, and ensure that they can handle
>       the concept of objects that are neither loose nor packed)
> 
> (1) is handled by the modification to sha1_object_info_extended().
> 
> For (2), I looked at for_each_packed_object and at the packed-related
> functions that take in a hash. For for_each_packed_object, the callers
> either already work or are fixed in this patch:
>   - reachable - only to find recent objects
>   - builtin/fsck - already knows about promised blobs
>   - builtin/cat-file - fixed in this commit
> 
> Callers of the other functions do not need to be changed:
>   - parse_pack_index
>     - http - indirectly from http_get_info_packs
>   - find_pack_entry_one
>     - this searches a single pack that is provided as an argument; the
>       caller already knows (through other means) that the sought object
>       is in a specific pack
>   - find_sha1_pack
>     - fast-import - appears to be an optimization to not store a
>       file if it is already in a pack
>     - http-walker - to search through a struct alt_base
>     - http-push - to search through remote packs
>   - has_sha1_pack
>     - builtin/fsck - already knows about promised blobs
>     - builtin/count-objects - informational purposes only (check if loose
>       object is also packed)
>     - builtin/prune-packed - check if object to be pruned is packed (if
>       not, don't prune it)
>     - revision - used to exclude packed objects if requested by user
>     - diff - just for optimization
> 
> An alternative design that I considered but rejected:
> 
>   - Adding a hook whenever a packed blob is requested, not on any blob.
>     That is, whenever we attempt to search the packfiles for a blob, if
>     it is missing (from the packfiles and from the loose object storage),
>     to invoke the hook (which must then store it as a packfile), open the
>     packfile the hook generated, and report that the blob is found in
>     that new packfile. This reduces the amount of analysis needed (in
>     that we only need to look at how packed blobs are handled), but
>     requires that the hook generate packfiles (or for sha1_file to pack
>     whatever loose objects are generated), creating one packfile for each
>     missing blob and potentially very many packfiles that must be
>     linearly searched. This may be tolerable now for repos that only have
>     a few missing blobs (for example, repos that only want to exclude
>     large blobs), and might be tolerable in the future if we have
>     batching support for the most commonly used commands, but is not
>     tolerable now for repos that exclude a large amount of blobs.
> 
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
>   Documentation/config.txt               |  8 ++++
>   Documentation/gitrepository-layout.txt |  8 ++++
>   builtin/cat-file.c                     |  9 ++++
>   promised-blob.c                        | 75 ++++++++++++++++++++++++++++++++++
>   promised-blob.h                        | 13 ++++++
>   sha1_file.c                            | 44 +++++++++++++-------
>   t/t3907-promised-blob.sh               | 36 ++++++++++++++++
>   7 files changed, 179 insertions(+), 14 deletions(-)
> 
> diff --git a/Documentation/config.txt b/Documentation/config.txt
> index d5c9c4cab..c293ac921 100644
> --- a/Documentation/config.txt
> +++ b/Documentation/config.txt
> @@ -393,6 +393,14 @@ The default is false, except linkgit:git-clone[1] or linkgit:git-init[1]
>   will probe and set core.ignoreCase true if appropriate when the repository
>   is created.
>   
> +core.promisedBlobCommand::
> +	If set, whenever a blob in the local repo is attempted to be read, but
> +	is both missing and a promised blob, invoke this shell command to
> +	generate or obtain that blob before reporting an error. This shell
> +	command should take one or more hashes, each terminated by a newline,
> +	as standard input, and (if successful) should write the corresponding
> +	objects to the local repo (packed or loose).
> +
>   core.precomposeUnicode::
>   	This option is only used by Mac OS implementation of Git.
>   	When core.precomposeUnicode=true, Git reverts the unicode decomposition
> diff --git a/Documentation/gitrepository-layout.txt b/Documentation/gitrepository-layout.txt
> index f51ed4e37..0ef9312fa 100644
> --- a/Documentation/gitrepository-layout.txt
> +++ b/Documentation/gitrepository-layout.txt
> @@ -47,6 +47,10 @@ use with dumb transports but otherwise is OK as long as
>   `objects/info/alternates` points at the object stores it
>   borrows from.
>   +
> +. You could have blobs that are merely promised by another source. When
> +Git requires those blobs, it will invoke the command in the
> +`core.promisedBlobCommand` configuration variable.
> ++
>   This directory is ignored if $GIT_COMMON_DIR is set and
>   "$GIT_COMMON_DIR/objects" will be used instead.
>   
> @@ -91,6 +95,10 @@ objects/info/http-alternates::
>   	this object store borrows objects from, to be used when
>   	the repository is fetched over HTTP.
>   
> +objects/promisedblob::
> +	This file records the sha1 object names and sizes of promised
> +	blobs.
> +
>   refs::
>   	References are stored in subdirectories of this
>   	directory.  The 'git prune' command knows to preserve
> diff --git a/builtin/cat-file.c b/builtin/cat-file.c
> index 96b786e48..3df00bf91 100644
> --- a/builtin/cat-file.c
> +++ b/builtin/cat-file.c
> @@ -12,6 +12,7 @@
>   #include "streaming.h"
>   #include "tree-walk.h"
>   #include "sha1-array.h"
> +#include "promised-blob.h"
>   
>   struct batch_options {
>   	int enabled;
> @@ -432,6 +433,13 @@ static int batch_packed_object(const struct object_id *oid,
>   	return 0;
>   }
>   
> +static int batch_promised_blob(const struct object_id *oid,
> +			       void *data)
> +{
> +	oid_array_append(data, oid);
> +	return 0;
> +}
> +
>   static int batch_objects(struct batch_options *opt)
>   {
>   	struct strbuf buf = STRBUF_INIT;
> @@ -473,6 +481,7 @@ static int batch_objects(struct batch_options *opt)
>   
>   		for_each_loose_object(batch_loose_object, &sa, 0);
>   		for_each_packed_object(batch_packed_object, &sa, 0);
> +		for_each_promised_blob(batch_promised_blob, &sa);
>   
>   		cb.opt = opt;
>   		cb.expand = &data;
> diff --git a/promised-blob.c b/promised-blob.c
> index 493808ed2..0fe04daf2 100644
> --- a/promised-blob.c
> +++ b/promised-blob.c
> @@ -2,6 +2,9 @@
>   #include "promised-blob.h"
>   #include "sha1-lookup.h"
>   #include "strbuf.h"
> +#include "run-command.h"
> +#include "sha1-array.h"
> +#include "config.h"
>   
>   #define ENTRY_SIZE (GIT_SHA1_RAWSZ + 8)
>   /*
> @@ -93,3 +96,75 @@ int for_each_promised_blob(each_promised_blob_fn cb, void *data)
>   	}
>   	return 0;
>   }
> +
> +static char *promised_blob_command;
> +
> +static int promised_blob_config(const char *conf_key, const char *value,
> +				void *cb)
> +{
> +	if (!strcmp(conf_key, "core.promisedblobcommand")) {
> +		promised_blob_command = xstrdup(value);
> +	}
> +	return 0;
> +}
> +
> +static void ensure_promised_blob_configured(void)
> +{
> +	static int configured;
> +	if (configured)
> +		return;
> +	git_config(promised_blob_config, NULL);
> +	configured = 1;
> +}
> +
> +int request_promised_blobs(const struct oid_array *blobs)
> +{
> +	struct child_process cp = CHILD_PROCESS_INIT;
> +	const char *argv[] = {NULL, NULL};
> +	const char *env[] = {"GIT_IGNORE_PROMISED_BLOBS=1", NULL};
> +	int blobs_requested = 0;
> +	int i;
> +
> +	for (i = 0; i < blobs->nr; i++) {
> +		if (is_promised_blob(&blobs->oid[i], NULL))
> +			break;
> +	}
> +
> +	if (i == blobs->nr)
> +		/* Nothing to fetch */
> +		return 0;
> +
> +	ensure_promised_blob_configured();
> +	if (!promised_blob_command)
> +		die("some promised blobs need to be fetched, but\n"
> +		    "no core.promisedblobcommand configured");
> +
> +	argv[0] = promised_blob_command;
> +	cp.argv = argv;
> +	cp.env = env;
> +	cp.use_shell = 1;
> +	cp.in = -1;
> +
> +	if (start_command(&cp))
> +		die("failed to start <%s>", promised_blob_command);
> +
> +	for (; i < blobs->nr; i++) {
> +		if (is_promised_blob(&blobs->oid[i], NULL)) {
> +			write_in_full(cp.in, oid_to_hex(&blobs->oid[i]),
> +				      GIT_SHA1_HEXSZ);
> +			write_in_full(cp.in, "\n", 1);
> +			blobs_requested++;
> +		}
> +	}
> +	close(cp.in);
> +
> +	if (finish_command(&cp))
> +		die("failed to finish <%s>", promised_blob_command);
> +
> +	/*
> +	 * The command above may have updated packfiles, so update our record
> +	 * of them.
> +	 */
> +	reprepare_packed_git();
> +	return blobs_requested;
> +}
> diff --git a/promised-blob.h b/promised-blob.h
> index a303ea1ff..34b1d3f9e 100644
> --- a/promised-blob.h
> +++ b/promised-blob.h
> @@ -1,6 +1,8 @@
>   #ifndef PROMISED_BLOB_H
>   #define PROMISED_BLOB_H
>   
> +#include "sha1-array.h"
> +
>   /*
>    * Returns 1 if oid is the name of a promised blob. If size is not NULL, also
>    * returns its size.
> @@ -11,4 +13,15 @@ extern int is_promised_blob(const struct object_id *oid,
>   typedef int each_promised_blob_fn(const struct object_id *oid, void *data);
>   int for_each_promised_blob(each_promised_blob_fn, void *);
>   
> +/*
> + * If any of the given blobs are promised blobs, invokes
> + * core.promisedblobcommand with those blobs and returns the number of blobs
> + * requested. No check is made as to whether the invocation actually populated
> + * the repository with the promised blobs.
> + *
> + * If none of the given blobs are promised blobs, this function does not invoke
> + * anything and returns 0.
> + */
> +int request_promised_blobs(const struct oid_array *blobs);
> +
>   #endif
> diff --git a/sha1_file.c b/sha1_file.c
> index 5862386cd..5778a6dbd 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -28,6 +28,11 @@
>   #include "list.h"
>   #include "mergesort.h"
>   #include "quote.h"
> +#include "iterator.h"
> +#include "dir-iterator.h"
> +#include "sha1-lookup.h"
> +#include "promised-blob.h"
> +#include "sha1-array.h"
>   
>   #define SZ_FMT PRIuMAX
>   static inline uintmax_t sz_fmt(size_t s) { return s; }
> @@ -2983,6 +2988,7 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
>   	const unsigned char *real = (flags & OBJECT_INFO_LOOKUP_REPLACE) ?
>   				    lookup_replace_object(sha1) :
>   				    sha1;
> +	int already_retried = 0;
>   
>   	if (!oi)
>   		oi = &blank_oi;
> @@ -3007,30 +3013,40 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
>   		}
>   	}
>   
> -	if (!find_pack_entry(real, &e)) {
> -		/* Most likely it's a loose object. */
> -		if (!sha1_loose_object_info(real, oi, flags)) {
> -			oi->whence = OI_LOOSE;
> -			return 0;
> -		}
> +retry:
> +	if (find_pack_entry(real, &e))
> +		goto found_packed;
>   
> -		/* Not a loose object; someone else may have just packed it. */
> -		if (flags & OBJECT_INFO_QUICK) {
> -			return -1;
> -		} else {
> -			reprepare_packed_git();
> -			if (!find_pack_entry(real, &e))
> -				return -1;
> +	/* Most likely it's a loose object. */
> +	if (!sha1_loose_object_info(real, oi, flags)) {
> +		oi->whence = OI_LOOSE;
> +		return 0;
> +	}
> +
> +	/* Not a loose object; someone else may have just packed it. */
> +	reprepare_packed_git();
> +	if (find_pack_entry(real, &e))
> +		goto found_packed;
> +
> +	/* Check if it is a promised blob */
> +	if (!already_retried) {
> +		struct oid_array promised = OID_ARRAY_INIT;
> +		oid_array_append_sha1(&promised, real);
> +		if (request_promised_blobs(&promised)) {
> +			already_retried = 1;
> +			goto retry;
>   		}
>   	}
>   
> +	return -1;
> +
> +found_packed:
>   	if (oi == &blank_oi)
>   		/*
>   		 * We know that the caller doesn't actually need the
>   		 * information below, so return early.
>   		 */
>   		return 0;
> -
>   	rtype = packed_object_info(e.p, e.offset, oi);
>   	if (rtype < 0) {
>   		mark_bad_packed_object(e.p, real);
> diff --git a/t/t3907-promised-blob.sh b/t/t3907-promised-blob.sh
> index 827072004..778afd1f6 100755
> --- a/t/t3907-promised-blob.sh
> +++ b/t/t3907-promised-blob.sh
> @@ -26,4 +26,40 @@ test_expect_success '...but fails again with GIT_IGNORE_PROMISED_BLOBS' '
>   	unset GIT_IGNORE_PROMISED_BLOBS
>   '
>   
> +test_expect_success 'sha1_object_info_extended (through git cat-file)' '
> +	rm -rf server client &&
> +
> +	git init server &&
> +	test_commit -C server 1 1.t abcdefgh &&
> +	HASH=$(git hash-object server/1.t) &&
> +
> +	git init client &&
> +	git -C client config core.promisedblobcommand \
> +		"git -C \"$(pwd)/server\" pack-objects --stdout |
> +		 git unpack-objects" &&
> +	
> +	test_must_fail git -C client cat-file -p "$HASH"
> +'
> +
> +test_expect_success '...succeeds if it is a promised blob' '
> +	printf "%s%016x" "$HASH" "$(wc -c <server/1.t)" |
> +		hex_pack >client/.git/objects/promisedblob &&
> +	git -C client cat-file -p "$HASH"
> +'
> +
> +test_expect_success 'cat-file --batch-all-objects with promised blobs' '
> +	rm -rf client &&
> +
> +	git init client &&
> +	git -C client config core.promisedblobcommand \
> +		"git -C \"$(pwd)/server\" pack-objects --stdout |
> +		 git unpack-objects" &&
> +	printf "%s%016x" "$HASH" "$(wc -c <server/1.t)" |
> +		hex_pack >client/.git/objects/promisedblob &&
> +
> +	# Verify that the promised blob is printed
> +	git -C client cat-file --batch --batch-all-objects | tee out |
> +		grep abcdefgh
> +'
> +
>   test_done
> 

  parent reply	other threads:[~2017-07-12 17:41 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-07-11 19:48 [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing blobs") Jonathan Tan
2017-07-11 19:48 ` [RFC PATCH 1/3] promised-blob, fsck: introduce promised blobs Jonathan Tan
2017-07-11 22:02   ` Stefan Beller
2017-07-19 23:37     ` Jonathan Tan
2017-07-12 17:29   ` Jeff Hostetler
2017-07-12 19:28     ` Jonathan Nieder
2017-07-13 14:48       ` Jeff Hostetler
2017-07-13 15:05         ` Jeff Hostetler
2017-07-13 19:39     ` Jonathan Tan
2017-07-14 20:03       ` Jeff Hostetler
2017-07-14 21:30         ` Jonathan Nieder
2017-07-11 19:48 ` [RFC PATCH 2/3] sha1-array: support appending unsigned char hash Jonathan Tan
2017-07-11 22:06   ` Stefan Beller
2017-07-19 23:56     ` Jonathan Tan
2017-07-20  0:06       ` Stefan Beller
2017-07-11 19:48 ` [RFC PATCH 3/3] sha1_file: add promised blob hook support Jonathan Tan
2017-07-11 22:38   ` Stefan Beller
2017-07-12 17:40   ` Ben Peart [this message]
2017-07-12 20:38     ` Jonathan Nieder
2017-07-16 15:23 ` [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing blobs") Philip Oakley
2017-07-17 17:43   ` Ben Peart
2017-07-25 20:48     ` Philip Oakley
2017-07-17 18:03   ` Jonathan Nieder
2017-07-29 12:51     ` Philip Oakley
2017-07-20  0:21 ` [RFC PATCH v2 0/4] Partial clone: promised objects (not only blobs) Jonathan Tan
2017-07-20  0:21 ` [RFC PATCH v2 1/4] object: remove "used" field from struct object Jonathan Tan
2017-07-20  0:36   ` Stefan Beller
2017-07-20  0:55     ` Jonathan Tan
2017-07-20 17:44       ` Ben Peart
2017-07-20 21:20   ` Junio C Hamano
2017-07-20  0:21 ` [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects Jonathan Tan
2017-07-20 18:07   ` Stefan Beller
2017-07-20 19:17     ` Jonathan Tan
2017-07-20 19:58   ` Ben Peart
2017-07-20 21:13     ` Jonathan Tan
2017-07-21 16:24       ` Ben Peart
2017-07-21 20:33         ` Jonathan Tan
2017-07-25 15:10           ` Ben Peart
2017-07-29 13:26             ` Philip Oakley
2017-07-20  0:21 ` [RFC PATCH v2 3/4] sha1-array: support appending unsigned char hash Jonathan Tan
2017-07-20  0:21 ` [RFC PATCH v2 4/4] sha1_file: support promised object hook Jonathan Tan
2017-07-20 18:23   ` Stefan Beller
2017-07-20 20:58     ` Ben Peart
2017-07-20 21:18       ` Jonathan Tan
2017-07-21 16:27         ` Ben Peart

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8366d1de-3552-50fc-7a6c-7cfc3219181b@gmail.com \
    --to=peartben@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=jonathantanmy@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).