Proposal for missing blob support in Git repos

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Proposal for missing blob support in Git repos
@ 2017-04-26 22:13 Jonathan Tan
  2017-05-01  3:57 ` Junio C Hamano
  0 siblings, 1 reply; 12+ messages in thread
From: Jonathan Tan @ 2017-04-26 22:13 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, markbt, git, kevin.david

Here is a proposal for missing blob support in Git repos. There have
been several other proposals [1] [2]; this is similar to those except
that I have provided a more comprehensive analysis of the changes that
need to be done, and have made those changes (see commit below the
scissors line).

This proposal is limited to local handling using a user-specified hook.
I have another proposal out [3] for what a potential default hook should
be - how Git on the server can serve blobs to repos that have missing
blobs.

[1] <20170304191901.9622-1-markbt@efaref.net>
[2] <1488999039-37631-1-git-send-email-git@jeffhostetler.com>
[3] <ffd92ad9-39fe-c76b-178d-6e3d6a425037@google.com>

-- 8< --
sha1_file, fsck: add missing blob hook support

Currently, Git does not support repos with very large numbers of blobs
or repos that wish to minimize manipulation of certain blobs (for
example, because they are very large) very well, even if the user
operates mostly on part of the repo, because Git is designed on the
assumption that every blob referenced by a tree object is available
somewhere in the repo storage.

As a first step to reducing this problem, add rudimentary support for
missing blobs by teaching sha1_file to invoke a hook whenever a blob is
requested but missing, and by updating fsck to tolerate missing blobs.
The hook is a shell command that can be configured through "git config";
this hook takes in a list of hashes and writes (if successful) the
corresponding objects to the repo's local storage.

This commit does not include support for generating such a repo; neither
has any command (other than fsck) been modified to either tolerate
missing blobs (without invoking the hook) or be more efficient in
invoking the missing blob hook. Only a fallback is provided in the form
of sha1_file invoking the missing blob hook when necessary.

In order to determine the code changes in sha1_file.c necessary, I
investigated the following:
 (1) functions in sha1_file that take in a hash, without the user
     regarding how the object is stored (loose or packed)
 (2) functions in sha1_file that operate on packed objects (because I
     need to check callers that know about the loose/packed distinction
     and operate on both differently, and ensure that they can handle
     the concept of objects that are neither loose nor packed)

For (1), I looked through all non-static functions in sha1_file.c that
take in an unsigned char * parameter. The ones that are relevant, and my
modifications to them to resolve this problem, are:
 - sha1_object_info_extended (fixed in this commit)
 - sha1_object_info (auto-fixed by sha1_object_info_extended)
 - read_sha1_file_extended (fixed by fixing read_object)
 - read_object_with_reference (auto-fixed by read_sha1_file_extended)
 - force_object_loose (only called from builtin/pack-objects.c, which
   already knows that at least one pack contains this object)
 - has_sha1_file_with_flags (fixed in this commit)
 - assert_sha1_type (auto-fixed by sha1_object_info)

As described in the list above, several changes have been included in
this commit to fix the necessary functions.

For (2), I looked through the same functions as in (1) and also
for_each_packed_object. The ones that are relevant are:
 - parse_pack_index
   - http - indirectly from http_get_info_packs
 - find_pack_entry_one
   - this searches a single pack that is provided as an argument; the
     caller already knows (through other means) that the sought object
     is in a specific pack
 - find_sha1_pack
   - fast-import - appears to be an optimization to not store a
     file if it is already in a pack
   - http-walker - to search through a struct alt_base
   - http-push - to search through remote packs
 - has_sha1_pack
   - builtin/fsck - fixed in this commit
   - builtin/count-objects - informational purposes only (check if loose
     object is also packed)
   - builtin/prune-packed - check if object to be pruned is packed (if
     not, don't prune it)
   - revision - used to exclude packed objects if requested by user
   - diff - just for optimization
 - for_each_packed_object
   - reachable - only to find recent objects
   - builtin/fsck - fixed in this commit
   - builtin/cat-file - see below

As described in the list above, builtin/fsck has been updated. I have
left builtin/cat-file alone; this means that cat-file
--batch-all-objects will only operate on objects physically in the repo.

Some alternative designs that I considered but rejected:

 - Storing a list of hashes of missing blobs, possibly with metadata
   (much like the shallow list). Having such a list would allow for
   things like better error messages, attaching metadata (for example,
   file size or binary/text nature) to each blob, and configuring
   different hooks for each blob, but it is difficult to scale to large
   repos.
 - Adding a hook whenever a packed blob is requested, not on any blob.
   That is, whenever we attempt to search the packfiles for a blob, if
   it is missing (from the packfiles and from the loose object storage),
   to invoke the hook (which must then store it as a packfile), open the
   packfile the hook generated, and report that the blob is found in
   that new packfile. This reduces the amount of analysis needed (in
   that we only need to look at how packed blobs are handled), but
   requires that the hook generate packfiles (or for sha1_file to pack
   whatever loose objects are generated), creating one packfile for each
   missing blob and potentially very many packfiles that must be
   linearly searched. This may be tolerable now for repos that only have
   a few missing blobs (for example, repos that only want to exclude
   large blobs), and might be tolerable in the future if we have
   batching support for the most commonly used commands, but is not
   tolerable now for repos that exclude a large amount of blobs.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 Documentation/config.txt | 10 +++++++
 builtin/fsck.c           |  6 +++++
 sha1_file.c              | 69 +++++++++++++++++++++++++++++++++++++++++++++---
 t/t3907-missing-blob.sh  | 50 +++++++++++++++++++++++++++++++++++
 4 files changed, 131 insertions(+), 4 deletions(-)
 create mode 100644 t/t3907-missing-blob.sh

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 475e874d5..04db83fe8 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -369,6 +369,16 @@ The default is false, except linkgit:git-clone[1] or linkgit:git-init[1]
 will probe and set core.ignoreCase true if appropriate when the repository
 is created.
 
+core.missingBlobCommand::
+	If set, whenever a blob in the local repo is attempted to be
+	read but is missing, invoke this shell command to generate or
+	obtain that blob before reporting an error. This shell command
+	should take one or more hashes, each terminated by a newline, as
+	standard input, and (if successful) should write the
+	corresponding objects to the local repo (packed or loose).
++
+If set, fsck will not treat a missing blob as an error condition.
+
 core.precomposeUnicode::
 	This option is only used by Mac OS implementation of Git.
 	When core.precomposeUnicode=true, Git reverts the unicode decomposition
diff --git a/builtin/fsck.c b/builtin/fsck.c
index f76e4163a..5c60c0916 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -37,6 +37,7 @@ static int verbose;
 static int show_progress = -1;
 static int show_dangling = 1;
 static int name_objects;
+static int missing_blob_ok;
 #define ERROR_OBJECT 01
 #define ERROR_REACHABLE 02
 #define ERROR_PACK 04
@@ -93,6 +94,9 @@ static int fsck_config(const char *var, const char *value, void *cb)
 		return 0;
 	}
 
+	if (!strcmp(var, "core.missingblobcommand"))
+		missing_blob_ok = 1;
+
 	return git_default_config(var, value, cb);
 }
 
@@ -222,6 +226,8 @@ static void check_reachable_object(struct object *obj)
 	if (!(obj->flags & HAS_OBJ)) {
 		if (has_sha1_pack(obj->oid.hash))
 			return; /* it is in pack - forget about it */
+		if (missing_blob_ok && obj->type == OBJ_BLOB)
+			return;
 		printf("missing %s %s\n", printable_type(obj),
 			describe_object(obj));
 		errors_found |= ERROR_REACHABLE;
diff --git a/sha1_file.c b/sha1_file.c
index 1577e2d7d..de8f137a9 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2955,6 +2955,49 @@ static int sha1_loose_object_info(const unsigned char *sha1,
 	return (status < 0) ? status : 0;
 }
 
+static char *missing_blob_command;
+static int sha1_file_config(const char *conf_key, const char *value, void *cb)
+{
+	if (!strcmp(conf_key, "core.missingblobcommand")) {
+		missing_blob_command = xstrdup(value);
+	}
+	return 0;
+}
+
+static int configured;
+static void ensure_configured(void)
+{
+	if (configured)
+		return;
+
+	git_config(sha1_file_config, NULL);
+	configured = 1;
+}
+
+static void handle_missing_blob(const unsigned char *sha1)
+{
+	struct child_process cp = CHILD_PROCESS_INIT;
+	const char *argv[] = {missing_blob_command, NULL};
+	char input[GIT_MAX_HEXSZ + 1];
+
+	memcpy(input, sha1_to_hex(sha1), 40);
+	input[40] = '\n';
+
+	cp.argv = argv;
+	cp.env = local_repo_env;
+	cp.use_shell = 1;
+
+	if (pipe_command(&cp, input, sizeof(input), NULL, 0, NULL, 0)) {
+		die("failed to load blob %s", sha1_to_hex(sha1));
+	}
+
+	/*
+	 * The command above may have updated packfiles, so update our record
+	 * of them.
+	 */
+	reprepare_packed_git();
+}
+
 int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi, unsigned flags)
 {
 	struct cached_object *co;
@@ -2979,6 +3022,8 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
 		return 0;
 	}
 
+	if (!has_sha1_file_with_flags(sha1, 0))
+		return -1;
 	if (!find_pack_entry(real, &e)) {
 		/* Most likely it's a loose object. */
 		if (!sha1_loose_object_info(real, oi, flags)) {
@@ -3091,6 +3136,8 @@ static void *read_object(const unsigned char *sha1, enum object_type *type,
 		return xmemdupz(co->buf, co->size);
 	}
 
+	if (!has_sha1_file_with_flags(sha1, 0))
+		return NULL;
 	buf = read_packed_sha1(sha1, type, size);
 	if (buf)
 		return buf;
@@ -3480,17 +3527,31 @@ int has_sha1_pack(const unsigned char *sha1)
 int has_sha1_file_with_flags(const unsigned char *sha1, int flags)
 {
 	struct pack_entry e;
+	int already_retried = 0;
 
 	if (!startup_info->have_repository)
 		return 0;
+retry:
 	if (find_pack_entry(sha1, &e))
 		return 1;
 	if (has_loose_object(sha1))
 		return 1;
-	if (flags & HAS_SHA1_QUICK)
-		return 0;
-	reprepare_packed_git();
-	return find_pack_entry(sha1, &e);
+	if (!(flags & HAS_SHA1_QUICK)) {
+		reprepare_packed_git();
+		if (find_pack_entry(sha1, &e))
+			return 1;
+	}
+
+	if (!already_retried) {
+		ensure_configured();
+		if (missing_blob_command) {
+			already_retried = 1;
+			handle_missing_blob(sha1);
+			goto retry;
+		}
+	}
+
+	return 0;
 }
 
 int has_object_file(const struct object_id *oid)
diff --git a/t/t3907-missing-blob.sh b/t/t3907-missing-blob.sh
new file mode 100644
index 000000000..f03e53c23
--- /dev/null
+++ b/t/t3907-missing-blob.sh
@@ -0,0 +1,50 @@
+#!/bin/sh
+
+test_description='core.missingblobcommand option'
+
+. ./test-lib.sh
+
+test_expect_success 'sha1_object_info_extended and read_sha1_file (through git cat-file -p)' '
+	rm -rf server client &&
+
+	git init server &&
+	test_commit -C server 1 &&
+	test_config -C server uploadpack.allowanysha1inwant 1 &&
+	HASH=$(git hash-object server/1.t) &&
+
+	git init client &&
+	test_config -C client core.missingblobcommand \
+		"git -C \"$(pwd)/server\" pack-objects --stdout | git unpack-objects" &&
+	git -C client cat-file -p "$HASH"
+'
+
+test_expect_success 'has_sha1_file (through git cat-file -e)' '
+	rm -rf server client &&
+
+	git init server &&
+	test_commit -C server 1 &&
+	test_config -C server uploadpack.allowanysha1inwant 1 &&
+	HASH=$(git hash-object server/1.t) &&
+
+	git init client &&
+	test_config -C client core.missingblobcommand \
+		"git -C \"$(pwd)/server\" pack-objects --stdout | git unpack-objects" &&
+	git -C client cat-file -e "$HASH"
+'
+
+test_expect_success 'fsck' '
+	rm -rf server client &&
+
+	git init server &&
+	test_commit -C server 1 &&
+	test_config -C server uploadpack.allowanysha1inwant 1 &&
+	HASH=$(git hash-object server/1.t) &&
+	echo hash is $HASH &&
+
+	cp -r server client &&
+	test_config -C client core.missingblobcommand "this-command-is-not-actually-run" &&
+	rm client/.git/objects/$(echo $HASH | cut -c1-2)/$(echo $HASH | cut -c3-40) &&
+	git -C client fsck
+'
+
+test_done
-- 
2.13.0.rc0.306.g87b477812d-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: Proposal for missing blob support in Git repos
  2017-04-26 22:13 Proposal for missing blob support in Git repos Jonathan Tan
@ 2017-05-01  3:57 ` Junio C Hamano
  2017-05-01 19:12   ` Jonathan Tan
  0 siblings, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2017-05-01  3:57 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, markbt, git, kevin.david

Jonathan Tan <jonathantanmy@google.com> writes:

> In order to determine the code changes in sha1_file.c necessary, I
> investigated the following:
>  (1) functions in sha1_file that take in a hash, without the user
>      regarding how the object is stored (loose or packed)
>  (2) functions in sha1_file that operate on packed objects (because I
>      need to check callers that know about the loose/packed distinction
>      and operate on both differently, and ensure that they can handle
>      the concept of objects that are neither loose nor packed)
>
> For (1), I looked through all non-static functions in sha1_file.c that
> take in an unsigned char * parameter. The ones that are relevant, and my
> modifications to them to resolve this problem, are:
>  - sha1_object_info_extended (fixed in this commit)
>  - sha1_object_info (auto-fixed by sha1_object_info_extended)
>  - read_sha1_file_extended (fixed by fixing read_object)
>  - read_object_with_reference (auto-fixed by read_sha1_file_extended)
>  - force_object_loose (only called from builtin/pack-objects.c, which
>    already knows that at least one pack contains this object)
>  - has_sha1_file_with_flags (fixed in this commit)
>  - assert_sha1_type (auto-fixed by sha1_object_info)
>
> As described in the list above, several changes have been included in
> this commit to fix the necessary functions.
>
> For (2), I looked through the same functions as in (1) and also
> for_each_packed_object. The ones that are relevant are:
>  - parse_pack_index
>    - http - indirectly from http_get_info_packs
>  - find_pack_entry_one
>    - this searches a single pack that is provided as an argument; the
>      caller already knows (through other means) that the sought object
>      is in a specific pack
>  - find_sha1_pack
>    - fast-import - appears to be an optimization to not store a
>      file if it is already in a pack
>    - http-walker - to search through a struct alt_base
>    - http-push - to search through remote packs
>  - has_sha1_pack
>    - builtin/fsck - fixed in this commit
>    - builtin/count-objects - informational purposes only (check if loose
>      object is also packed)
>    - builtin/prune-packed - check if object to be pruned is packed (if
>      not, don't prune it)
>    - revision - used to exclude packed objects if requested by user
>    - diff - just for optimization
>  - for_each_packed_object
>    - reachable - only to find recent objects
>    - builtin/fsck - fixed in this commit
>    - builtin/cat-file - see below
>
> As described in the list above, builtin/fsck has been updated. I have
> left builtin/cat-file alone; this means that cat-file
> --batch-all-objects will only operate on objects physically in the repo.

One thing I wonder is what the performance impact of a change like
this to the codepath that wants to see if an object does _not_ exist
in the repository.  When creating a new object by hashing raw data,
we see if an object with the same name already exists before writing
the compressed loose object out (or comparing the payload to detect
hash collision).  With a "missing blob" support, we'd essentially
spawn an extra process every time we want to create a new blob
locally, and most of the time that is done only to hear the external
command to say "no, we've never heard of such an object", with a
possibly large latency.

If we do not have to worry about that (or if it is no use to worry
about it, because we cannot avoid it if we wanted to do the lazy
loading of objects from elsewhere), then the patch presented here
looked like a sensible first step towards the stated goal.

Thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Proposal for missing blob support in Git repos
  2017-05-01  3:57 ` Junio C Hamano
@ 2017-05-01 19:12   ` Jonathan Tan
  2017-05-01 23:29     ` Junio C Hamano
  0 siblings, 1 reply; 12+ messages in thread
From: Jonathan Tan @ 2017-05-01 19:12 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, markbt, git, kevin.david

On 04/30/2017 08:57 PM, Junio C Hamano wrote:
> One thing I wonder is what the performance impact of a change like
> this to the codepath that wants to see if an object does _not_ exist
> in the repository.  When creating a new object by hashing raw data,
> we see if an object with the same name already exists before writing
> the compressed loose object out (or comparing the payload to detect
> hash collision).  With a "missing blob" support, we'd essentially
> spawn an extra process every time we want to create a new blob
> locally, and most of the time that is done only to hear the external
> command to say "no, we've never heard of such an object", with a
> possibly large latency.
>
> If we do not have to worry about that (or if it is no use to worry
> about it, because we cannot avoid it if we wanted to do the lazy
> loading of objects from elsewhere), then the patch presented here
> looked like a sensible first step towards the stated goal.
>
> Thanks.

Thanks for your comments. If you're referring to the codepath involving 
write_sha1_file() (for example, builtin/hash-object -> index_fd or 
builtin/unpack-objects), that is fine because write_sha1_file() invokes 
freshen_packed_object() and freshen_loose_object() directly to check if 
the object already exists (and thus does not invoke the new mechanism in 
this patch).

Having said that, looking at other parts of the fetching mechanism, 
there are a few calls to has_sha1_file() and others that might need to 
be checked. (We have already discussed one - the one in rev-list when 
invoked to check connectivity.) I could take a look at that, but was 
hoping for discussion on what I've sent so far (so that I know that I'm 
on the right track, and because it somewhat works, albeit slowly).

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Proposal for missing blob support in Git repos
  2017-05-01 19:12   ` Jonathan Tan
@ 2017-05-01 23:29     ` Junio C Hamano
  2017-05-02  0:33       ` Jonathan Tan
  0 siblings, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2017-05-01 23:29 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, markbt, git, kevin.david

Jonathan Tan <jonathantanmy@google.com> writes:

> Thanks for your comments. If you're referring to the codepath
> involving write_sha1_file() (for example, builtin/hash-object ->
> index_fd or builtin/unpack-objects), that is fine because
> write_sha1_file() invokes freshen_packed_object() and
> freshen_loose_object() directly to check if the object already exists
> (and thus does not invoke the new mechanism in this patch).

Is that a good thing, though?  It means that you an attacker can
feed one version to the remote object store your "grab blob" hook
gets the blobs from, and have you add a colliding object locally,
and the usual "are we recording the same object as existing one?"
check is bypassed.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Proposal for missing blob support in Git repos
  2017-05-01 23:29     ` Junio C Hamano
@ 2017-05-02  0:33       ` Jonathan Tan
  2017-05-02  0:38         ` Brandon Williams
  2017-05-02  1:41         ` Junio C Hamano
  0 siblings, 2 replies; 12+ messages in thread
From: Jonathan Tan @ 2017-05-02  0:33 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, markbt, git, kevin.david

On 05/01/2017 04:29 PM, Junio C Hamano wrote:
> Jonathan Tan <jonathantanmy@google.com> writes:
>
>> Thanks for your comments. If you're referring to the codepath
>> involving write_sha1_file() (for example, builtin/hash-object ->
>> index_fd or builtin/unpack-objects), that is fine because
>> write_sha1_file() invokes freshen_packed_object() and
>> freshen_loose_object() directly to check if the object already exists
>> (and thus does not invoke the new mechanism in this patch).
>
> Is that a good thing, though?  It means that you an attacker can
> feed one version to the remote object store your "grab blob" hook
> gets the blobs from, and have you add a colliding object locally,
> and the usual "are we recording the same object as existing one?"
> check is bypassed.

If I understand this correctly, what you mean is the situation where the 
hook adds an object to the local repo, overriding another object of the 
same name? If yes, I think that is the nature of executing an arbitrary 
command. If we really want to avoid that, we could drop the hook 
functionality (and instead, for example, provide the URL of a Git repo 
instead from which we can communicate using a new fetch-blob protocol), 
although that would reduce the usefulness of this, especially during the 
transition period in which we don't have any sort of batching of requests.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Proposal for missing blob support in Git repos
  2017-05-02  0:33       ` Jonathan Tan
@ 2017-05-02  0:38         ` Brandon Williams
  2017-05-02  1:41         ` Junio C Hamano
  1 sibling, 0 replies; 12+ messages in thread
From: Brandon Williams @ 2017-05-02  0:38 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Junio C Hamano, git, markbt, git, kevin.david

On 05/01, Jonathan Tan wrote:
> On 05/01/2017 04:29 PM, Junio C Hamano wrote:
> >Jonathan Tan <jonathantanmy@google.com> writes:
> >
> >>Thanks for your comments. If you're referring to the codepath
> >>involving write_sha1_file() (for example, builtin/hash-object ->
> >>index_fd or builtin/unpack-objects), that is fine because
> >>write_sha1_file() invokes freshen_packed_object() and
> >>freshen_loose_object() directly to check if the object already exists
> >>(and thus does not invoke the new mechanism in this patch).
> >
> >Is that a good thing, though?  It means that you an attacker can
> >feed one version to the remote object store your "grab blob" hook
> >gets the blobs from, and have you add a colliding object locally,
> >and the usual "are we recording the same object as existing one?"
> >check is bypassed.
> 
> If I understand this correctly, what you mean is the situation where
> the hook adds an object to the local repo, overriding another object
> of the same name? If yes, I think that is the nature of executing an
> arbitrary command. If we really want to avoid that, we could drop
> the hook functionality (and instead, for example, provide the URL of
> a Git repo instead from which we can communicate using a new
> fetch-blob protocol), although that would reduce the usefulness of
> this, especially during the transition period in which we don't have
> any sort of batching of requests.

If I understand correctly this is where we aim to be once all is said
and done.  I guess the question is what are we willing to do during the
transition phase.

-- 
Brandon Williams

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Proposal for missing blob support in Git repos
  2017-05-02  0:33       ` Jonathan Tan
  2017-05-02  0:38         ` Brandon Williams
@ 2017-05-02  1:41         ` Junio C Hamano
  2017-05-02 17:21           ` Jonathan Tan
  1 sibling, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2017-05-02  1:41 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, markbt, git, kevin.david

Jonathan Tan <jonathantanmy@google.com> writes:

> On 05/01/2017 04:29 PM, Junio C Hamano wrote:
>> Jonathan Tan <jonathantanmy@google.com> writes:
>>
>>> Thanks for your comments. If you're referring to the codepath
>>> involving write_sha1_file() (for example, builtin/hash-object ->
>>> index_fd or builtin/unpack-objects), that is fine because
>>> write_sha1_file() invokes freshen_packed_object() and
>>> freshen_loose_object() directly to check if the object already exists
>>> (and thus does not invoke the new mechanism in this patch).
>>
>> Is that a good thing, though?  It means that you an attacker can
>> feed one version to the remote object store your "grab blob" hook
>> gets the blobs from, and have you add a colliding object locally,
>> and the usual "are we recording the same object as existing one?"
>> check is bypassed.
>
> If I understand this correctly, what you mean is the situation where
> the hook adds an object to the local repo, overriding another object
> of the same name?

No.  

write_sha1_file() pays attention to objects already in the local
object store to avoid hash collisions that can be used to replace a
known-to-be-good object and that is done as a security measure.
What I am reading in your response was that this new mechanism
bypasses that, and I was wondering if that is a good thing.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Proposal for missing blob support in Git repos
  2017-05-02  1:41         ` Junio C Hamano
@ 2017-05-02 17:21           ` Jonathan Tan
  2017-05-02 18:32             ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 12+ messages in thread
From: Jonathan Tan @ 2017-05-02 17:21 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git mailing list, Mark Thomas, Jeff Hostetler, Kevin David

On Mon, May 1, 2017 at 6:41 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Jonathan Tan <jonathantanmy@google.com> writes:
>
>> On 05/01/2017 04:29 PM, Junio C Hamano wrote:
>>> Jonathan Tan <jonathantanmy@google.com> writes:
>>>
>>>> Thanks for your comments. If you're referring to the codepath
>>>> involving write_sha1_file() (for example, builtin/hash-object ->
>>>> index_fd or builtin/unpack-objects), that is fine because
>>>> write_sha1_file() invokes freshen_packed_object() and
>>>> freshen_loose_object() directly to check if the object already exists
>>>> (and thus does not invoke the new mechanism in this patch).
>>>
>>> Is that a good thing, though?  It means that you an attacker can
>>> feed one version to the remote object store your "grab blob" hook
>>> gets the blobs from, and have you add a colliding object locally,
>>> and the usual "are we recording the same object as existing one?"
>>> check is bypassed.
>>
>> If I understand this correctly, what you mean is the situation where
>> the hook adds an object to the local repo, overriding another object
>> of the same name?
>
> No.
>
> write_sha1_file() pays attention to objects already in the local
> object store to avoid hash collisions that can be used to replace a
> known-to-be-good object and that is done as a security measure.
> What I am reading in your response was that this new mechanism
> bypasses that, and I was wondering if that is a good thing.

Oh, what I meant was that write_sha1_file() bypasses the new
mechanism, not that the new mechanism bypasses the checks in
write_sha1_file().

To be clear, here's what happens when write_sha1_file() is invoked
(before and after this patch - this patch does not affect
write_sha1_file at all):
1. (some details omitted)
2. call freshen_packed_object
3, call freshen_loose_object if necessary
4. write object (if freshen_packed_object and freshen_loose_object do
not both return 0)

Nothing changes in this patch (whether a hook is defined or not).

And here's what happens when has_sha1_file (or another function listed
in the commit message) is invoked:
1. check for existence of packed object of the requested name
2. check for existence of loose object of the requested name
3. check again for existence of packed object of the requested name
4. if a hook is defined, invoke the hook and repeat 1-3

Here, in step 4, the hook could do whatever it wants to the repository.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Proposal for missing blob support in Git repos
  2017-05-02 17:21           ` Jonathan Tan
@ 2017-05-02 18:32             ` Ævar Arnfjörð Bjarmason
  2017-05-02 21:45               ` Jonathan Tan
  0 siblings, 1 reply; 12+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2017-05-02 18:32 UTC (permalink / raw)
  To: Jonathan Tan
  Cc: Junio C Hamano, Git mailing list, Mark Thomas, Jeff Hostetler,
	Kevin David

On Tue, May 2, 2017 at 7:21 PM, Jonathan Tan <jonathantanmy@google.com> wrote:
> On Mon, May 1, 2017 at 6:41 PM, Junio C Hamano <gitster@pobox.com> wrote:
>> Jonathan Tan <jonathantanmy@google.com> writes:
>>
>>> On 05/01/2017 04:29 PM, Junio C Hamano wrote:
>>>> Jonathan Tan <jonathantanmy@google.com> writes:
>>>>
>>>>> Thanks for your comments. If you're referring to the codepath
>>>>> involving write_sha1_file() (for example, builtin/hash-object ->
>>>>> index_fd or builtin/unpack-objects), that is fine because
>>>>> write_sha1_file() invokes freshen_packed_object() and
>>>>> freshen_loose_object() directly to check if the object already exists
>>>>> (and thus does not invoke the new mechanism in this patch).
>>>>
>>>> Is that a good thing, though?  It means that you an attacker can
>>>> feed one version to the remote object store your "grab blob" hook
>>>> gets the blobs from, and have you add a colliding object locally,
>>>> and the usual "are we recording the same object as existing one?"
>>>> check is bypassed.
>>>
>>> If I understand this correctly, what you mean is the situation where
>>> the hook adds an object to the local repo, overriding another object
>>> of the same name?
>>
>> No.
>>
>> write_sha1_file() pays attention to objects already in the local
>> object store to avoid hash collisions that can be used to replace a
>> known-to-be-good object and that is done as a security measure.
>> What I am reading in your response was that this new mechanism
>> bypasses that, and I was wondering if that is a good thing.
>
> Oh, what I meant was that write_sha1_file() bypasses the new
> mechanism, not that the new mechanism bypasses the checks in
> write_sha1_file().
>
> To be clear, here's what happens when write_sha1_file() is invoked
> (before and after this patch - this patch does not affect
> write_sha1_file at all):
> 1. (some details omitted)
> 2. call freshen_packed_object
> 3, call freshen_loose_object if necessary
> 4. write object (if freshen_packed_object and freshen_loose_object do
> not both return 0)
>
> Nothing changes in this patch (whether a hook is defined or not).

But don't the semantics change in the sense that before
core.missingBlobCommand you couldn't write a new blob SHA1 that was
already part of your history, whereas with this change
write_sha1_file() might write what it considers to be a new blob, but
it's actually colliding with an existing blob, but write_sha1_file()
doesn't know that because knowing would involve asking the hook to
fetch the blob?

> And here's what happens when has_sha1_file (or another function listed
> in the commit message) is invoked:
> 1. check for existence of packed object of the requested name
> 2. check for existence of loose object of the requested name
> 3. check again for existence of packed object of the requested name
> 4. if a hook is defined, invoke the hook and repeat 1-3
>
> Here, in step 4, the hook could do whatever it wants to the repository.

This might be a bit of early bikeshedding, but then again the lack of
early bikeshedding tends to turn into standards.

Wouldn't it be better to name this core.missingObjectCommand & have
the hook take a list on stdin like:

    <id> <TAB> <object_type> <TAB> <object_id> <TAB> <request_type> <TAB> [....]

And have the hook respond:

    <id> <TAB> <status> <TAB> [....]

I.e. what you'd do now is send this to the hook:

    1 <TAB> blob <TAB> <sha1> <TAB> missing

And the hook would respond:

    1 <TAB> ok

But this leaves open the door addressing this potential edge case with
writing new blobs in the future, i.e. write_sha1_file() could call it
as:

    1 <TAB> blob <TAB> <sha1> <TAB> new

And the hook could either respond immediately as:

    1 <TAB> ok

If it's in some #YOLO mode where it's not going to check for colliding
blobs over the network, or alternatively & ask the parent repo if it
has those blobs, and if so print:

    1 <TAB> collision

Or something like that.

This also enables future lazy loading of trees/commits from the same
API, and for the hook to respond out-of-order to the input it gets as
it can, since each request is prefixed with an incrementing request
id.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Proposal for missing blob support in Git repos
  2017-05-02 18:32             ` Ævar Arnfjörð Bjarmason
@ 2017-05-02 21:45               ` Jonathan Tan
  2017-05-04  4:29                 ` Junio C Hamano
  0 siblings, 1 reply; 12+ messages in thread
From: Jonathan Tan @ 2017-05-02 21:45 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Junio C Hamano, Git mailing list, Mark Thomas, Jeff Hostetler,
	Kevin David

On 05/02/2017 11:32 AM, Ævar Arnfjörð Bjarmason wrote:
> On Tue, May 2, 2017 at 7:21 PM, Jonathan Tan <jonathantanmy@google.com> wrote:
>> On Mon, May 1, 2017 at 6:41 PM, Junio C Hamano <gitster@pobox.com> wrote:
>>> Jonathan Tan <jonathantanmy@google.com> writes:
>>>
>>>> On 05/01/2017 04:29 PM, Junio C Hamano wrote:
>>>>> Jonathan Tan <jonathantanmy@google.com> writes:
>>>>>
>>>>>> Thanks for your comments. If you're referring to the codepath
>>>>>> involving write_sha1_file() (for example, builtin/hash-object ->
>>>>>> index_fd or builtin/unpack-objects), that is fine because
>>>>>> write_sha1_file() invokes freshen_packed_object() and
>>>>>> freshen_loose_object() directly to check if the object already exists
>>>>>> (and thus does not invoke the new mechanism in this patch).
>>>>>
>>>>> Is that a good thing, though?  It means that you an attacker can
>>>>> feed one version to the remote object store your "grab blob" hook
>>>>> gets the blobs from, and have you add a colliding object locally,
>>>>> and the usual "are we recording the same object as existing one?"
>>>>> check is bypassed.
>>>>
>>>> If I understand this correctly, what you mean is the situation where
>>>> the hook adds an object to the local repo, overriding another object
>>>> of the same name?
>>>
>>> No.
>>>
>>> write_sha1_file() pays attention to objects already in the local
>>> object store to avoid hash collisions that can be used to replace a
>>> known-to-be-good object and that is done as a security measure.
>>> What I am reading in your response was that this new mechanism
>>> bypasses that, and I was wondering if that is a good thing.
>>
>> Oh, what I meant was that write_sha1_file() bypasses the new
>> mechanism, not that the new mechanism bypasses the checks in
>> write_sha1_file().
>>
>> To be clear, here's what happens when write_sha1_file() is invoked
>> (before and after this patch - this patch does not affect
>> write_sha1_file at all):
>> 1. (some details omitted)
>> 2. call freshen_packed_object
>> 3, call freshen_loose_object if necessary
>> 4. write object (if freshen_packed_object and freshen_loose_object do
>> not both return 0)
>>
>> Nothing changes in this patch (whether a hook is defined or not).
>
> But don't the semantics change in the sense that before
> core.missingBlobCommand you couldn't write a new blob SHA1 that was
> already part of your history,

Strictly speaking, you can already do this if you don't have the blob in 
your local repo (for example, with shallow clones - you likely wouldn't 
have blobs pointed to by historical commits outside whatever depth is set).

 > whereas with this change
> write_sha1_file() might write what it considers to be a new blob, but
> it's actually colliding with an existing blob, but write_sha1_file()
> doesn't know that because knowing would involve asking the hook to
> fetch the blob?

Yes, this might happen.

I see the semantics as "don't write what you already have", where "have" 
means what you have in local storage, but if you extend "have" to what 
upstream has, then yes, you're right that this changes (ignoring shallow 
clones).

This does remove a resistance that we have against hash collision (in 
that normally we would have the correct object for a given hash and can 
resist other servers trying to introduce a wrong object, but now that is 
no longer the case), but I think it's better than consulting the hook 
whenever you want to write anything (which is also a change in semantics 
in that you're consulting an external source whenever you're writing an 
object, besides the performance implications).

>> And here's what happens when has_sha1_file (or another function listed
>> in the commit message) is invoked:
>> 1. check for existence of packed object of the requested name
>> 2. check for existence of loose object of the requested name
>> 3. check again for existence of packed object of the requested name
>> 4. if a hook is defined, invoke the hook and repeat 1-3
>>
>> Here, in step 4, the hook could do whatever it wants to the repository.
>
> This might be a bit of early bikeshedding, but then again the lack of
> early bikeshedding tends to turn into standards.
>
> Wouldn't it be better to name this core.missingObjectCommand & have
> the hook take a list on stdin like:
>
>     <id> <TAB> <object_type> <TAB> <object_id> <TAB> <request_type> <TAB> [....]
>
> And have the hook respond:
>
>     <id> <TAB> <status> <TAB> [....]
>
> I.e. what you'd do now is send this to the hook:
>
>     1 <TAB> blob <TAB> <sha1> <TAB> missing
>
> And the hook would respond:
>
>     1 <TAB> ok
>
> But this leaves open the door addressing this potential edge case with
> writing new blobs in the future, i.e. write_sha1_file() could call it
> as:
>
>     1 <TAB> blob <TAB> <sha1> <TAB> new
>
> And the hook could either respond immediately as:
>
>     1 <TAB> ok
>
> If it's in some #YOLO mode where it's not going to check for colliding
> blobs over the network, or alternatively & ask the parent repo if it
> has those blobs, and if so print:
>
>     1 <TAB> collision
>
> Or something like that.
>
> This also enables future lazy loading of trees/commits from the same
> API, and for the hook to respond out-of-order to the input it gets as
> it can, since each request is prefixed with an incrementing request
> id.

My initial thought is that it would be better to extend hook support by 
adding configuration options for separate hooks instead of extending an 
existing protocol. For example, with this, the hook would probably need 
to declare capabilities so that Git knows what the hook supports; also, 
even if the user and the hook have no issue with Git writing any object, 
Git would have to invoke the hook since it wouldn't know.

Having said that, I'm not too attached to the existing protocol and it 
can change to something like what you suggested if necessary.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Proposal for missing blob support in Git repos
  2017-05-02 21:45               ` Jonathan Tan
@ 2017-05-04  4:29                 ` Junio C Hamano
  2017-05-04 17:09                   ` Jonathan Tan
  0 siblings, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2017-05-04  4:29 UTC (permalink / raw)
  To: Jonathan Tan
  Cc: Ævar Arnfjörð Bjarmason, Git mailing list,
	Mark Thomas, Jeff Hostetler, Kevin David

Jonathan Tan <jonathantanmy@google.com> writes:

> I see the semantics as "don't write what you already have", where
> "have" means what you have in local storage, but if you extend "have"
> to what upstream has, then yes, you're right that this changes
> (ignoring shallow clones).
>
> This does remove a resistance that we have against hash collision (in
> that normally we would have the correct object for a given hash and
> can resist other servers trying to introduce a wrong object, but now
> that is no longer the case), but I think it's better than consulting
> the hook whenever you want to write anything (which is also a change
> in semantics in that you're consulting an external source whenever
> you're writing an object, besides the performance implications).

As long as the above pros-and-cons analysis is understood and we are
striking a balance between performance and strictness with such an
understanding of the implications, I am perfectly fine with the
proposal.  That is why my comment has never been "I think that is
wrong" but consistently was "I wonder if that is a good thing."

Thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Proposal for missing blob support in Git repos
  2017-05-04  4:29                 ` Junio C Hamano
@ 2017-05-04 17:09                   ` Jonathan Tan
  0 siblings, 0 replies; 12+ messages in thread
From: Jonathan Tan @ 2017-05-04 17:09 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Ævar Arnfjörð Bjarmason, Git mailing list,
	Mark Thomas, Jeff Hostetler, Kevin David

On 05/03/2017 09:29 PM, Junio C Hamano wrote:
> Jonathan Tan <jonathantanmy@google.com> writes:
>
>> I see the semantics as "don't write what you already have", where
>> "have" means what you have in local storage, but if you extend "have"
>> to what upstream has, then yes, you're right that this changes
>> (ignoring shallow clones).
>>
>> This does remove a resistance that we have against hash collision (in
>> that normally we would have the correct object for a given hash and
>> can resist other servers trying to introduce a wrong object, but now
>> that is no longer the case), but I think it's better than consulting
>> the hook whenever you want to write anything (which is also a change
>> in semantics in that you're consulting an external source whenever
>> you're writing an object, besides the performance implications).
>
> As long as the above pros-and-cons analysis is understood and we are
> striking a balance between performance and strictness with such an
> understanding of the implications, I am perfectly fine with the
> proposal.  That is why my comment has never been "I think that is
> wrong" but consistently was "I wonder if that is a good thing."
>
> Thanks.

Noted - if/when I update the patch, I'll include this information in.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2017-05-04 17:09 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-26 22:13 Proposal for missing blob support in Git repos Jonathan Tan
2017-05-01  3:57 ` Junio C Hamano
2017-05-01 19:12   ` Jonathan Tan
2017-05-01 23:29     ` Junio C Hamano
2017-05-02  0:33       ` Jonathan Tan
2017-05-02  0:38         ` Brandon Williams
2017-05-02  1:41         ` Junio C Hamano
2017-05-02 17:21           ` Jonathan Tan
2017-05-02 18:32             ` Ævar Arnfjörð Bjarmason
2017-05-02 21:45               ` Jonathan Tan
2017-05-04  4:29                 ` Junio C Hamano
2017-05-04 17:09                   ` Jonathan Tan

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).