git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [RFC/PATCH v2 0/1] Add support for downloading blobs on demand
@ 2017-07-14 13:26 Ben Peart
  2017-07-14 13:26 ` [PATCH v2 1/1] sha1_file: " Ben Peart
  0 siblings, 1 reply; 6+ messages in thread
From: Ben Peart @ 2017-07-14 13:26 UTC (permalink / raw)
  To: git; +Cc: gitster, benpeart, pclouds, christian.couder, git, jonathantanmy

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 3131 bytes --]

This patch series enables git to request missing objects when they are
not found in the object store. This is a fault-in model where the
"read-object" sub-process will fetch the missing object and store it in
the object store as a loose, alternate, or pack file. On success, git
will retry the operation and find the requested object.

It utilizes the recent sub-process refactoring to spawn a "read-object"
hook as a sub-process on the first request and then all subsequent
requests are made to that existing sub-process. This significantly
reduces the cost of making multiple request within a single git command.

An earlier version [1] of this patch series is pulled into [2] which
enables registering multiple ODB handlers which can be a simple command
or a sub-process. My hope is that this patch series can be completed to
meet the needs of the various efforts where faulting in missing objects
is required.

In the meantime, now that the sub-process refactoring has made it to
master, I’ve refactored this patch series to be as small and focused as
possible so it can be easily incorporated in other patch series where it
makes sense.

[3] has a need for faulting in missing objects and will be reworked to
take advantage of this patch series for its next iteration.

[4] has a similar function that uses a registered command instead of a
sub-process. Spawning a separate process for every missing object only
works if there are very few missing objects.  It does not scale well
when there are many missing objects (especially small objects like
commits and trees).

Patch is available here:
https://github.com/benpeart/git-for-windows/commits/read-object-process


[RFC] Add support for downloading blobs on demand 
[1] https://public-inbox.org/git/20170113155253.1644-1-benpeart@microsoft.com/

[RFC/PATCH v4 00/49] Add initial experimental external ODB support 
[2] https://public-inbox.org/git/20170620075523.26961-1-chriscool@tuxfamily.org/

[PATCH v2 00/19] WIP object filtering for partial clone
[3] https://public-inbox.org/git/20170713173459.3559-1-git@jeffhostetler.com/

[RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing blobs")
[4] https://public-inbox.org/git/cover.1499800530.git.jonathantanmy@google.com/

Ben Peart (1):
  sha1_file: Add support for downloading blobs on demand

 Documentation/technical/read-object-protocol.txt | 102 ++++++++++++
 cache.h                                          |   1 +
 config.c                                         |   5 +
 contrib/long-running-read-object/example.pl      | 114 +++++++++++++
 environment.c                                    |   1 +
 sha1_file.c                                      | 193 ++++++++++++++++++++++-
 t/t0410-read-object.sh                           |  27 ++++
 t/t0410/read-object                              | 114 +++++++++++++
 8 files changed, 550 insertions(+), 7 deletions(-)
 create mode 100644 Documentation/technical/read-object-protocol.txt
 create mode 100755 contrib/long-running-read-object/example.pl
 create mode 100755 t/t0410-read-object.sh
 create mode 100755 t/t0410/read-object

-- 
2.13.2.windows.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v2 1/1] sha1_file: Add support for downloading blobs on demand
  2017-07-14 13:26 [RFC/PATCH v2 0/1] Add support for downloading blobs on demand Ben Peart
@ 2017-07-14 13:26 ` Ben Peart
  2017-07-14 15:18   ` Christian Couder
  2017-07-17 18:06   ` Jonathan Tan
  0 siblings, 2 replies; 6+ messages in thread
From: Ben Peart @ 2017-07-14 13:26 UTC (permalink / raw)
  To: git; +Cc: gitster, benpeart, pclouds, christian.couder, git, jonathantanmy

This patch series enables git to request missing objects when they are
not found in the object store. This is a fault-in model where the
"read-object" sub-process will fetch the missing object and store it in
the object store as a loose, alternate, or pack file. On success, git
will retry the operation and find the requested object.

It utilizes the recent sub-process refactoring to spawn a "read-object"
hook as a sub-process on the first request and then all subsequent
requests are made to that existing sub-process. This significantly
reduces the cost of making multiple request within a single git command.

Signed-off-by: Ben Peart <benpeart@microsoft.com>
---
 Documentation/technical/read-object-protocol.txt | 102 ++++++++++++
 cache.h                                          |   1 +
 config.c                                         |   5 +
 contrib/long-running-read-object/example.pl      | 114 +++++++++++++
 environment.c                                    |   1 +
 sha1_file.c                                      | 193 ++++++++++++++++++++++-
 t/t0410-read-object.sh                           |  27 ++++
 t/t0410/read-object                              | 114 +++++++++++++
 8 files changed, 550 insertions(+), 7 deletions(-)
 create mode 100644 Documentation/technical/read-object-protocol.txt
 create mode 100755 contrib/long-running-read-object/example.pl
 create mode 100755 t/t0410-read-object.sh
 create mode 100755 t/t0410/read-object

diff --git a/Documentation/technical/read-object-protocol.txt b/Documentation/technical/read-object-protocol.txt
new file mode 100644
index 0000000000..a893b46e7c
--- /dev/null
+++ b/Documentation/technical/read-object-protocol.txt
@@ -0,0 +1,102 @@
+Read Object Process
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The read-object process enables Git to read all missing blobs with a
+single process invocation for the entire life of a single Git command.
+This is achieved by using a packet format (pkt-line, see technical/
+protocol-common.txt) based protocol over standard input and standard
+output as follows. All packets, except for the "*CONTENT" packets and
+the "0000" flush packet, are considered text and therefore are
+terminated by a LF.
+
+Git starts the process when it encounters the first missing object that
+needs to be retrieved. After the process is started, Git sends a welcome
+message ("git-read-object-client"), a list of supported protocol version
+numbers, and a flush packet. Git expects to read a welcome response
+message ("git-read-object-server"), exactly one protocol version number
+from the previously sent list, and a flush packet. All further
+communication will be based on the selected version.
+
+The remaining protocol description below documents "version=1". Please
+note that "version=42" in the example below does not exist and is only
+there to illustrate how the protocol would look with more than one
+version.
+
+After the version negotiation Git sends a list of all capabilities that
+it supports and a flush packet. Git expects to read a list of desired
+capabilities, which must be a subset of the supported capabilities list,
+and a flush packet as response:
+------------------------
+packet: git> git-read-object-client
+packet: git> version=1
+packet: git> version=42
+packet: git> 0000
+packet: git< git-read-object-server
+packet: git< version=1
+packet: git< 0000
+packet: git> capability=get
+packet: git> capability=have
+packet: git> capability=put
+packet: git> capability=not-yet-invented
+packet: git> 0000
+packet: git< capability=get
+packet: git< 0000
+------------------------
+The only supported capability in version 1 is "get".
+
+Afterwards Git sends a list of "key=value" pairs terminated with a flush
+packet. The list will contain at least the command (based on the
+supported capabilities) and the sha1 of the object to retrieve. Please
+note, that the process must not send any response before it received the
+final flush packet.
+
+When the process receives the "get" command, it should make the requested
+object available in the git object store and then return success. Git will
+then check the object store again and this time find it and proceed.
+------------------------
+packet: git> command=get
+packet: git> sha1=0a214a649e1b3d5011e14a3dc227753f2bd2be05
+packet: git> 0000
+------------------------
+
+The process is expected to respond with a list of "key=value" pairs
+terminated with a flush packet. If the process does not experience
+problems then the list must contain a "success" status.
+------------------------
+packet: git< status=success
+packet: git< 0000
+------------------------
+
+In case the process cannot or does not want to process the content, it
+is expected to respond with an "error" status.
+------------------------
+packet: git< status=error
+packet: git< 0000
+------------------------
+
+In case the process cannot or does not want to process the content as
+well as any future content for the lifetime of the Git process, then it
+is expected to respond with an "abort" status at any point in the
+protocol.
+------------------------
+packet: git< status=abort
+packet: git< 0000
+------------------------
+
+Git neither stops nor restarts the process in case the "error"/"abort"
+status is set.
+
+If the process dies during the communication or does not adhere to the
+protocol then Git will stop the process and restart it with the next
+object that needs to be processed.
+
+After the read-object process has processed an object it is expected to
+wait for the next "key=value" list containing a command. Git will close
+the command pipe on exit. The process is expected to detect EOF and exit
+gracefully on its own. Git will wait until the process has stopped.
+
+A long running read-object process demo implementation can be found in
+`contrib/long-running-read-object/example.pl` located in the Git core
+repository. If you develop your own long running process then the
+`GIT_TRACE_PACKET` environment variables can be very helpful for
+debugging (see linkgit:git[1]).
diff --git a/cache.h b/cache.h
index 71fe092644..914379724f 100644
--- a/cache.h
+++ b/cache.h
@@ -804,6 +804,7 @@ enum log_refs_config {
 	LOG_REFS_ALWAYS
 };
 extern enum log_refs_config log_all_ref_updates;
+extern int core_virtualize_objects;
 
 enum branch_track {
 	BRANCH_TRACK_UNSPECIFIED = -1,
diff --git a/config.c b/config.c
index a9356c1383..cc6e8f3237 100644
--- a/config.c
+++ b/config.c
@@ -1241,6 +1241,11 @@ static int git_default_core_config(const char *var, const char *value)
 		return 0;
 	}
 
+	if (!strcmp(var, "core.virtualizeobjects")) {
+		core_virtualize_objects = git_config_bool(var, value);
+		return 0;
+	}
+
 	/* Add other config variables here and to Documentation/config.txt. */
 	return 0;
 }
diff --git a/contrib/long-running-read-object/example.pl b/contrib/long-running-read-object/example.pl
new file mode 100755
index 0000000000..b8f37f836a
--- /dev/null
+++ b/contrib/long-running-read-object/example.pl
@@ -0,0 +1,114 @@
+#!/usr/bin/perl
+#
+# Example implementation for the Git read-object protocol version 1
+# See Documentation/technical/read-object-protocol.txt
+#
+# Allows you to test the ability for blobs to be pulled from a host git repo
+# "on demand."  Called when git needs a blob it couldn't find locally due to
+# a lazy clone that only cloned the commits and trees.
+#
+# A lazy clone can be simulated via the following commands from the host repo
+# you wish to create a lazy clone of:
+#
+# cd /host_repo
+# git rev-parse HEAD
+# git init /guest_repo
+# git cat-file --batch-check --batch-all-objects | grep -v 'blob' |
+#	cut -d' ' -f1 | git pack-objects /guest_repo/.git/objects/pack/noblobs
+# cd /guest_repo
+# git config core.virtualizeobjects true
+# git reset --hard <sha from rev-parse call above>
+#
+# Please note, this sample is a minimal skeleton. No proper error handling
+# was implemented.
+#
+
+use strict;
+use warnings;
+
+#
+# Point $DIR to the folder where your host git repo is located so we can pull
+# missing objects from it
+#
+my $DIR = "/host_repo/.git/";
+
+sub packet_bin_read {
+	my $buffer;
+	my $bytes_read = read STDIN, $buffer, 4;
+	if ( $bytes_read == 0 ) {
+
+		# EOF - Git stopped talking to us!
+		exit();
+	}
+	elsif ( $bytes_read != 4 ) {
+		die "invalid packet: '$buffer'";
+	}
+	my $pkt_size = hex($buffer);
+	if ( $pkt_size == 0 ) {
+		return ( 1, "" );
+	}
+	elsif ( $pkt_size > 4 ) {
+		my $content_size = $pkt_size - 4;
+		$bytes_read = read STDIN, $buffer, $content_size;
+		if ( $bytes_read != $content_size ) {
+			die "invalid packet ($content_size bytes expected; $bytes_read bytes read)";
+		}
+		return ( 0, $buffer );
+	}
+	else {
+		die "invalid packet size: $pkt_size";
+	}
+}
+
+sub packet_txt_read {
+	my ( $res, $buf ) = packet_bin_read();
+	unless ( $buf =~ s/\n$// ) {
+		die "A non-binary line MUST be terminated by an LF.";
+	}
+	return ( $res, $buf );
+}
+
+sub packet_bin_write {
+	my $buf = shift;
+	print STDOUT sprintf( "%04x", length($buf) + 4 );
+	print STDOUT $buf;
+	STDOUT->flush();
+}
+
+sub packet_txt_write {
+	packet_bin_write( $_[0] . "\n" );
+}
+
+sub packet_flush {
+	print STDOUT sprintf( "%04x", 0 );
+	STDOUT->flush();
+}
+
+( packet_txt_read() eq ( 0, "git-read-object-client" ) ) || die "bad initialize";
+( packet_txt_read() eq ( 0, "version=1" ) )				 || die "bad version";
+( packet_bin_read() eq ( 1, "" ) )                       || die "bad version end";
+
+packet_txt_write("git-read-object-server");
+packet_txt_write("version=1");
+packet_flush();
+
+( packet_txt_read() eq ( 0, "capability=get" ) )    || die "bad capability";
+( packet_bin_read() eq ( 1, "" ) )                  || die "bad capability end";
+
+packet_txt_write("capability=get");
+packet_flush();
+
+while (1) {
+	my ($command) = packet_txt_read() =~ /^command=([^=]+)$/;
+
+	if ( $command eq "get" ) {
+		my ($sha1) = packet_txt_read() =~ /^sha1=([0-9a-f]{40})$/;
+		packet_bin_read();
+
+		system ('git --git-dir="' . $DIR . '" cat-file blob ' . $sha1 . ' | git -c core.virtualizeobjects=false hash-object -w --stdin >/dev/null 2>&1');
+		packet_txt_write(($?) ? "status=error" : "status=success");
+		packet_flush();
+	} else {
+		die "bad command '$command'";
+	}
+}
diff --git a/environment.c b/environment.c
index 3fd4b10845..ad9403ed6c 100644
--- a/environment.c
+++ b/environment.c
@@ -66,6 +66,7 @@ int precomposed_unicode = -1; /* see probe_utf8_pathname_composition() */
 unsigned long pack_size_limit_cfg;
 enum hide_dotfiles_type hide_dotfiles = HIDE_DOTFILES_DOTGITONLY;
 enum log_refs_config log_all_ref_updates = LOG_REFS_UNSET;
+int core_virtualize_objects;
 
 #ifndef PROTECT_HFS_DEFAULT
 #define PROTECT_HFS_DEFAULT 0
diff --git a/sha1_file.c b/sha1_file.c
index 5862386cd0..06290f8647 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -28,6 +28,9 @@
 #include "list.h"
 #include "mergesort.h"
 #include "quote.h"
+#include "sigchain.h"
+#include "sub-process.h"
+#include "pkt-line.h"
 
 #define SZ_FMT PRIuMAX
 static inline uintmax_t sz_fmt(size_t s) { return s; }
@@ -647,6 +650,162 @@ void prepare_alt_odb(void)
 	read_info_alternates(get_object_directory(), 0);
 }
 
+#define CAP_GET    (1u<<0)
+
+static int subprocess_map_initialized;
+static struct hashmap subprocess_map;
+
+struct read_object_process {
+	struct subprocess_entry subprocess;
+	unsigned int supported_capabilities;
+};
+
+int start_read_object_fn(struct subprocess_entry *subprocess)
+{
+	int err;
+	struct read_object_process *entry = (struct read_object_process *)subprocess;
+	struct child_process *process;
+	struct string_list cap_list = STRING_LIST_INIT_NODUP;
+	char *cap_buf;
+	const char *cap_name;
+
+	process = subprocess_get_child_process(&entry->subprocess);
+
+	sigchain_push(SIGPIPE, SIG_IGN);
+
+	err = packet_writel(process->in, "git-read-object-client", "version=1", NULL);
+	if (err)
+		goto done;
+
+	err = strcmp(packet_read_line(process->out, NULL), "git-read-object-server");
+	if (err) {
+		error("external process '%s' does not support read-object protocol version 1", subprocess->cmd);
+		goto done;
+	}
+	err = strcmp(packet_read_line(process->out, NULL), "version=1");
+	if (err)
+		goto done;
+	err = packet_read_line(process->out, NULL) != NULL;
+	if (err)
+		goto done;
+
+	err = packet_writel(process->in, "capability=get", NULL);
+	if (err)
+		goto done;
+
+	for (;;) {
+		cap_buf = packet_read_line(process->out, NULL);
+		if (!cap_buf)
+			break;
+		string_list_split_in_place(&cap_list, cap_buf, '=', 1);
+
+		if (cap_list.nr != 2 || strcmp(cap_list.items[0].string, "capability"))
+			continue;
+
+		cap_name = cap_list.items[1].string;
+		if (!strcmp(cap_name, "get")) {
+			entry->supported_capabilities |= CAP_GET;
+		}
+		else {
+			warning(
+				"external process '%s' requested unsupported read-object capability '%s'",
+				subprocess->cmd, cap_name
+			);
+		}
+
+		string_list_clear(&cap_list, 0);
+	}
+
+done:
+	sigchain_pop(SIGPIPE);
+
+	if (err || errno == EPIPE)
+		return err ? err : errno;
+
+	return 0;
+}
+
+static int read_object_process(const unsigned char *sha1)
+{
+	int err;
+	struct read_object_process *entry;
+	struct child_process *process;
+	struct strbuf status = STRBUF_INIT;
+	const char *cmd = find_hook("read-object");
+	uint64_t start;
+
+	start = getnanotime();
+
+	if (!subprocess_map_initialized) {
+		subprocess_map_initialized = 1;
+		hashmap_init(&subprocess_map, (hashmap_cmp_fn)cmd2process_cmp, 0);
+		entry = NULL;
+	} else {
+		entry = (struct cmd2process *)subprocess_find_entry(&subprocess_map, cmd);
+	}
+	if (!entry) {
+		entry = xmalloc(sizeof(*entry));
+		entry->supported_capabilities = 0;
+
+		if (subprocess_start(&subprocess_map, &entry->subprocess, cmd, start_read_object_fn)) {
+			free(entry);
+			return -1;
+		}
+	}
+	process = subprocess_get_child_process(&entry->subprocess);
+
+	if (!(CAP_GET & entry->supported_capabilities))
+		return -1;
+
+	sigchain_push(SIGPIPE, SIG_IGN);
+
+	err = packet_write_fmt_gently(process->in, "command=get\n");
+	if (err)
+		goto done;
+
+	err = packet_write_fmt_gently(process->in, "sha1=%s\n", sha1_to_hex(sha1));
+	if (err)
+		goto done;
+
+	err = packet_flush_gently(process->in);
+	if (err)
+		goto done;
+
+	err = subprocess_read_status(process->out, &status);
+	err = err ? err : strcmp(status.buf, "success");
+
+done:
+	sigchain_pop(SIGPIPE);
+
+	if (err || errno == EPIPE) {
+		err = err ? err : errno;
+		if (!strcmp(status.buf, "error")) {
+			/* The process signaled a problem with the file. */
+		}
+		else if (!strcmp(status.buf, "abort")) {
+			/*
+			 * The process signaled a permanent problem. Don't try to read
+			 * objects with the same command for the lifetime of the current
+			 * Git process.
+			 */
+			entry->supported_capabilities &= ~CAP_GET;
+		}
+		else {
+			/*
+			 * Something went wrong with the read-object process.
+			 * Force shutdown and restart if needed.
+			 */
+			error("external process '%s' failed", cmd);
+			subprocess_stop(&subprocess_map, (struct subprocess_entry *)entry);
+			free(entry);
+		}
+	}
+
+	trace_performance_since(start, "read_object_process");
+
+	return err;
+}
+
 /* Returns 1 if we have successfully freshened the file, 0 otherwise. */
 static int freshen_file(const char *fn)
 {
@@ -690,8 +849,19 @@ static int check_and_freshen_nonlocal(const unsigned char *sha1, int freshen)
 
 static int check_and_freshen(const unsigned char *sha1, int freshen)
 {
-	return check_and_freshen_local(sha1, freshen) ||
-	       check_and_freshen_nonlocal(sha1, freshen);
+	int ret;
+	int already_retried = 0;
+
+retry:
+	ret = check_and_freshen_local(sha1, freshen) ||
+		check_and_freshen_nonlocal(sha1, freshen);
+	if (!ret && core_virtualize_objects && !already_retried) {
+		already_retried = 1;
+		if (!read_object_process(sha1))
+			goto retry;
+	}
+
+	return ret;
 }
 
 int has_loose_object_nonlocal(const unsigned char *sha1)
@@ -2983,6 +3153,7 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
 	const unsigned char *real = (flags & OBJECT_INFO_LOOKUP_REPLACE) ?
 				    lookup_replace_object(sha1) :
 				    sha1;
+	int already_retried = 0;
 
 	if (!oi)
 		oi = &blank_oi;
@@ -3007,6 +3178,7 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
 		}
 	}
 
+retry:
 	if (!find_pack_entry(real, &e)) {
 		/* Most likely it's a loose object. */
 		if (!sha1_loose_object_info(real, oi, flags)) {
@@ -3015,13 +3187,19 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
 		}
 
 		/* Not a loose object; someone else may have just packed it. */
-		if (flags & OBJECT_INFO_QUICK) {
-			return -1;
-		} else {
+		if (!(flags & OBJECT_INFO_QUICK)) {
 			reprepare_packed_git();
-			if (!find_pack_entry(real, &e))
-				return -1;
+			if (find_pack_entry(real, &e))
+				goto found_packed;
+		}
+
+		/* Request the object be retrieved */
+		if (core_virtualize_objects && !already_retried) {
+			already_retried = 1;
+			if (!read_object_process(sha1))
+				goto retry;
 		}
+		return -1;
 	}
 
 	if (oi == &blank_oi)
@@ -3031,6 +3209,7 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
 		 */
 		return 0;
 
+found_packed:
 	rtype = packed_object_info(e.p, e.offset, oi);
 	if (rtype < 0) {
 		mark_bad_packed_object(e.p, real);
diff --git a/t/t0410-read-object.sh b/t/t0410-read-object.sh
new file mode 100755
index 0000000000..b8d7521c2c
--- /dev/null
+++ b/t/t0410-read-object.sh
@@ -0,0 +1,27 @@
+#!/bin/sh
+
+test_description='tests for long running read-object process'
+
+. ./test-lib.sh
+
+test_expect_success 'setup host repo with a root commit' '
+	test_commit zero &&
+	hash1=$(git ls-tree HEAD | grep zero.t | cut -f1 | cut -d\  -f3)
+'
+
+test_expect_success 'blobs can be retrieved from the host repo' '
+	git init guest-repo &&
+	(cd guest-repo &&
+	 mkdir -p .git/hooks &&
+	 cp $TEST_DIRECTORY/t0410/read-object .git/hooks/ &&
+	 git config core.virtualizeobjects true &&
+	 git cat-file blob "$hash1")
+'
+
+test_expect_success 'invalid blobs generate errors' '
+	(cd guest-repo &&
+	 test_must_fail git cat-file blob "invalid")
+'
+
+
+test_done
diff --git a/t/t0410/read-object b/t/t0410/read-object
new file mode 100755
index 0000000000..85e997c930
--- /dev/null
+++ b/t/t0410/read-object
@@ -0,0 +1,114 @@
+#!/usr/bin/perl
+#
+# Example implementation for the Git read-object protocol version 1
+# See Documentation/technical/read-object-protocol.txt
+#
+# Allows you to test the ability for blobs to be pulled from a host git repo
+# "on demand."  Called when git needs a blob it couldn't find locally due to
+# a lazy clone that only cloned the commits and trees.
+#
+# A lazy clone can be simulated via the following commands from the host repo
+# you wish to create a lazy clone of:
+#
+# cd /host_repo
+# git rev-parse HEAD
+# git init /guest_repo
+# git cat-file --batch-check --batch-all-objects | grep -v 'blob' |
+#	cut -d' ' -f1 | git pack-objects /guest_repo/.git/objects/pack/noblobs
+# cd /guest_repo
+# git config core.virtualizeobjects true
+# git reset --hard <sha from rev-parse call above>
+#
+# Please note, this sample is a minimal skeleton. No proper error handling
+# was implemented.
+#
+
+use strict;
+use warnings;
+
+#
+# Point $DIR to the folder where your host git repo is located so we can pull
+# missing objects from it
+#
+my $DIR = "../.git/";
+
+sub packet_bin_read {
+	my $buffer;
+	my $bytes_read = read STDIN, $buffer, 4;
+	if ( $bytes_read == 0 ) {
+
+		# EOF - Git stopped talking to us!
+		exit();
+	}
+	elsif ( $bytes_read != 4 ) {
+		die "invalid packet: '$buffer'";
+	}
+	my $pkt_size = hex($buffer);
+	if ( $pkt_size == 0 ) {
+		return ( 1, "" );
+	}
+	elsif ( $pkt_size > 4 ) {
+		my $content_size = $pkt_size - 4;
+		$bytes_read = read STDIN, $buffer, $content_size;
+		if ( $bytes_read != $content_size ) {
+			die "invalid packet ($content_size bytes expected; $bytes_read bytes read)";
+		}
+		return ( 0, $buffer );
+	}
+	else {
+		die "invalid packet size: $pkt_size";
+	}
+}
+
+sub packet_txt_read {
+	my ( $res, $buf ) = packet_bin_read();
+	unless ( $buf =~ s/\n$// ) {
+		die "A non-binary line MUST be terminated by an LF.";
+	}
+	return ( $res, $buf );
+}
+
+sub packet_bin_write {
+	my $buf = shift;
+	print STDOUT sprintf( "%04x", length($buf) + 4 );
+	print STDOUT $buf;
+	STDOUT->flush();
+}
+
+sub packet_txt_write {
+	packet_bin_write( $_[0] . "\n" );
+}
+
+sub packet_flush {
+	print STDOUT sprintf( "%04x", 0 );
+	STDOUT->flush();
+}
+
+( packet_txt_read() eq ( 0, "git-read-object-client" ) ) || die "bad initialize";
+( packet_txt_read() eq ( 0, "version=1" ) )				 || die "bad version";
+( packet_bin_read() eq ( 1, "" ) )                       || die "bad version end";
+
+packet_txt_write("git-read-object-server");
+packet_txt_write("version=1");
+packet_flush();
+
+( packet_txt_read() eq ( 0, "capability=get" ) )    || die "bad capability";
+( packet_bin_read() eq ( 1, "" ) )                  || die "bad capability end";
+
+packet_txt_write("capability=get");
+packet_flush();
+
+while (1) {
+	my ($command) = packet_txt_read() =~ /^command=([^=]+)$/;
+
+	if ( $command eq "get" ) {
+		my ($sha1) = packet_txt_read() =~ /^sha1=([0-9a-f]{40})$/;
+		packet_bin_read();
+
+		system ('git --git-dir="' . $DIR . '" cat-file blob ' . $sha1 . ' | git -c core.virtualizeobjects=false hash-object -w --stdin >/dev/null 2>&1');
+		packet_txt_write(($?) ? "status=error" : "status=success");
+		packet_flush();
+	} else {
+		die "bad command '$command'";
+	}
+}
-- 
2.13.2.windows.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 1/1] sha1_file: Add support for downloading blobs on demand
  2017-07-14 13:26 ` [PATCH v2 1/1] sha1_file: " Ben Peart
@ 2017-07-14 15:18   ` Christian Couder
  2017-07-17 18:06   ` Jonathan Tan
  1 sibling, 0 replies; 6+ messages in thread
From: Christian Couder @ 2017-07-14 15:18 UTC (permalink / raw)
  To: Ben Peart
  Cc: git, Junio C Hamano, Ben Peart, Nguyen Thai Ngoc Duy,
	Jeff Hostetler, Jonathan Tan

On Fri, Jul 14, 2017 at 3:26 PM, Ben Peart <peartben@gmail.com> wrote:

> diff --git a/contrib/long-running-read-object/example.pl b/contrib/long-running-read-object/example.pl
> new file mode 100755
> index 0000000000..b8f37f836a
> --- /dev/null
> +++ b/contrib/long-running-read-object/example.pl

[...]

> +sub packet_bin_read {
> +       my $buffer;
> +       my $bytes_read = read STDIN, $buffer, 4;
> +       if ( $bytes_read == 0 ) {
> +
> +               # EOF - Git stopped talking to us!
> +               exit();
> +       }
> +       elsif ( $bytes_read != 4 ) {
> +               die "invalid packet: '$buffer'";
> +       }
> +       my $pkt_size = hex($buffer);
> +       if ( $pkt_size == 0 ) {
> +               return ( 1, "" );
> +       }
> +       elsif ( $pkt_size > 4 ) {
> +               my $content_size = $pkt_size - 4;
> +               $bytes_read = read STDIN, $buffer, $content_size;
> +               if ( $bytes_read != $content_size ) {
> +                       die "invalid packet ($content_size bytes expected; $bytes_read bytes read)";
> +               }
> +               return ( 0, $buffer );
> +       }
> +       else {
> +               die "invalid packet size: $pkt_size";
> +       }
> +}
> +
> +sub packet_txt_read {
> +       my ( $res, $buf ) = packet_bin_read();
> +       unless ( $buf =~ s/\n$// ) {
> +               die "A non-binary line MUST be terminated by an LF.";
> +       }
> +       return ( $res, $buf );
> +}
> +
> +sub packet_bin_write {
> +       my $buf = shift;
> +       print STDOUT sprintf( "%04x", length($buf) + 4 );
> +       print STDOUT $buf;
> +       STDOUT->flush();
> +}
> +
> +sub packet_txt_write {
> +       packet_bin_write( $_[0] . "\n" );
> +}
> +
> +sub packet_flush {
> +       print STDOUT sprintf( "%04x", 0 );
> +       STDOUT->flush();
> +}

The above could reuse the refactoring of t0021/rot13-filter.pl into
perl/Git/Packet.pm that is at the beginning of my patch series.

> diff --git a/t/t0410/read-object b/t/t0410/read-object
> new file mode 100755
> index 0000000000..85e997c930
> --- /dev/null
> +++ b/t/t0410/read-object

[...]

> +sub packet_bin_read {
> +       my $buffer;
> +       my $bytes_read = read STDIN, $buffer, 4;
> +       if ( $bytes_read == 0 ) {
> +
> +               # EOF - Git stopped talking to us!
> +               exit();
> +       }
> +       elsif ( $bytes_read != 4 ) {
> +               die "invalid packet: '$buffer'";
> +       }
> +       my $pkt_size = hex($buffer);
> +       if ( $pkt_size == 0 ) {
> +               return ( 1, "" );
> +       }
> +       elsif ( $pkt_size > 4 ) {
> +               my $content_size = $pkt_size - 4;
> +               $bytes_read = read STDIN, $buffer, $content_size;
> +               if ( $bytes_read != $content_size ) {
> +                       die "invalid packet ($content_size bytes expected; $bytes_read bytes read)";
> +               }
> +               return ( 0, $buffer );
> +       }
> +       else {
> +               die "invalid packet size: $pkt_size";
> +       }
> +}
> +
> +sub packet_txt_read {
> +       my ( $res, $buf ) = packet_bin_read();
> +       unless ( $buf =~ s/\n$// ) {
> +               die "A non-binary line MUST be terminated by an LF.";
> +       }
> +       return ( $res, $buf );
> +}
> +
> +sub packet_bin_write {
> +       my $buf = shift;
> +       print STDOUT sprintf( "%04x", length($buf) + 4 );
> +       print STDOUT $buf;
> +       STDOUT->flush();
> +}
> +
> +sub packet_txt_write {
> +       packet_bin_write( $_[0] . "\n" );
> +}
> +
> +sub packet_flush {
> +       print STDOUT sprintf( "%04x", 0 );
> +       STDOUT->flush();
> +}

Same thing about the above and perl/Git/Packet.pm.

Otherwise thanks for updating this!

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 1/1] sha1_file: Add support for downloading blobs on demand
  2017-07-14 13:26 ` [PATCH v2 1/1] sha1_file: " Ben Peart
  2017-07-14 15:18   ` Christian Couder
@ 2017-07-17 18:06   ` Jonathan Tan
  2017-07-17 20:09     ` Ben Peart
  1 sibling, 1 reply; 6+ messages in thread
From: Jonathan Tan @ 2017-07-17 18:06 UTC (permalink / raw)
  To: Ben Peart; +Cc: git, gitster, benpeart, pclouds, christian.couder, git

About the difference between this patch and my patch set [1], besides
the fact that this patch does not spawn separate processes for each
missing object, which does seem like an improvement to me, this patch
(i) does not use a list of promised objects (but instead communicates
with the hook for each missing object), and (ii) provides backwards
compatibility with other Git code (that does not know about promised
objects) in a different way.

The costs and benefits of (i) are being discussed here [2]. As for (ii),
I still think that my approach is better - I have commented more about
this below.

Maybe the best approach is a combination of both our approaches.

[1] https://public-inbox.org/git/34efd9e9936fdab331655f5a33a098a72dc134f4.1499800530.git.jonathantanmy@google.com/

[2] https://public-inbox.org/git/20170713123951.5cab1adc@twelve2.svl.corp.google.com/

On Fri, 14 Jul 2017 09:26:51 -0400
Ben Peart <peartben@gmail.com> wrote:

> +------------------------
> +packet: git> command=get
> +packet: git> sha1=0a214a649e1b3d5011e14a3dc227753f2bd2be05
> +packet: git> 0000
> +------------------------

It would be useful to have this command support more than one SHA-1, so
that hooks that know how to batch can do so.

> +static int subprocess_map_initialized;
> +static struct hashmap subprocess_map;

The documentation of "tablesize" in "struct hashmap" states that it can
be used to check if the hashmap is initialized, so
subprocess_map_initialized is probably unnecessary.

>  static int check_and_freshen(const unsigned char *sha1, int freshen)
>  {
> -	return check_and_freshen_local(sha1, freshen) ||
> -	       check_and_freshen_nonlocal(sha1, freshen);
> +	int ret;
> +	int already_retried = 0;
> +
> +retry:
> +	ret = check_and_freshen_local(sha1, freshen) ||
> +		check_and_freshen_nonlocal(sha1, freshen);
> +	if (!ret && core_virtualize_objects && !already_retried) {
> +		already_retried = 1;
> +		if (!read_object_process(sha1))
> +			goto retry;
> +	}
> +
> +	return ret;
>  }

Is this change meant to ensure that Git code that operates on loose
objects directly (bypassing storage-agnostic functions such as
sha1_object_info_extended() and has_sha1_file()) still work? If yes,
this patch appears incomplete (for example, read_loose_object() needs to
be changed too), and this seems like a difficult task - in my patch set
[1], I ended up deciding to create a separate type of storage and
instead looked at the code that operates on *packed* objects directly
(because there were fewer such methods) to ensure that they would work
correctly in the presence of a separate type of storage.

[1] https://public-inbox.org/git/34efd9e9936fdab331655f5a33a098a72dc134f4.1499800530.git.jonathantanmy@google.com/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 1/1] sha1_file: Add support for downloading blobs on demand
  2017-07-17 18:06   ` Jonathan Tan
@ 2017-07-17 20:09     ` Ben Peart
  2017-07-17 23:24       ` Jonathan Tan
  0 siblings, 1 reply; 6+ messages in thread
From: Ben Peart @ 2017-07-17 20:09 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, gitster, benpeart, pclouds, christian.couder, git



On 7/17/2017 2:06 PM, Jonathan Tan wrote:
> About the difference between this patch and my patch set [1], besides
> the fact that this patch does not spawn separate processes for each
> missing object, which does seem like an improvement to me, this patch
> (i) does not use a list of promised objects (but instead communicates
> with the hook for each missing object), and (ii) provides backwards
> compatibility with other Git code (that does not know about promised
> objects) in a different way.
> 
> The costs and benefits of (i) are being discussed here [2]. As for (ii),
> I still think that my approach is better - I have commented more about
> this below.
> 
> Maybe the best approach is a combination of both our approaches.

Yes, in the context of the promised objects model patch series, the 
value of this patch series is primarily as a sample of how to use the 
sub-process mechanism to create a versioned interface for retrieving 
objects.

> 
> [1] https://public-inbox.org/git/34efd9e9936fdab331655f5a33a098a72dc134f4.1499800530.git.jonathantanmy@google.com/
> 
> [2] https://public-inbox.org/git/20170713123951.5cab1adc@twelve2.svl.corp.google.com/
> 
> On Fri, 14 Jul 2017 09:26:51 -0400
> Ben Peart <peartben@gmail.com> wrote:
> 
>> +------------------------
>> +packet: git> command=get
>> +packet: git> sha1=0a214a649e1b3d5011e14a3dc227753f2bd2be05
>> +packet: git> 0000
>> +------------------------
> 
> It would be useful to have this command support more than one SHA-1, so
> that hooks that know how to batch can do so.
> 

I agree.  Since nothing was using that capability yet, I decided to keep 
it simple and not add support for a feature that wasn't being used. The 
reason the interface is versioned is so that if/when something does need 
that capability, it can be added.

>> +static int subprocess_map_initialized;
>> +static struct hashmap subprocess_map;
> 
> The documentation of "tablesize" in "struct hashmap" states that it can
> be used to check if the hashmap is initialized, so
> subprocess_map_initialized is probably unnecessary.
> 

Nice.  That will make things a little simpler.

>>   static int check_and_freshen(const unsigned char *sha1, int freshen)
>>   {
>> -	return check_and_freshen_local(sha1, freshen) ||
>> -	       check_and_freshen_nonlocal(sha1, freshen);
>> +	int ret;
>> +	int already_retried = 0;
>> +
>> +retry:
>> +	ret = check_and_freshen_local(sha1, freshen) ||
>> +		check_and_freshen_nonlocal(sha1, freshen);
>> +	if (!ret && core_virtualize_objects && !already_retried) {
>> +		already_retried = 1;
>> +		if (!read_object_process(sha1))
>> +			goto retry;
>> +	}
>> +
>> +	return ret;
>>   }
> 
> Is this change meant to ensure that Git code that operates on loose
> objects directly (bypassing storage-agnostic functions such as
> sha1_object_info_extended() and has_sha1_file()) still work? If yes,
> this patch appears incomplete (for example, read_loose_object() needs to
> be changed too), and this seems like a difficult task - in my patch set
> [1], I ended up deciding to create a separate type of storage and
> instead looked at the code that operates on *packed* objects directly
> (because there were fewer such methods) to ensure that they would work
> correctly in the presence of a separate type of storage.
> 

Yes, with this set of patches, we've been running successfully on 
completely sparse clones (no commits, trees, or blobs) for several 
months.  read_loose_object() is only called by fsck when it is 
enumerating existing loose objects so does not need to be updated.

We have a few thousand developers making ~100K commits per week so in 
our particular usage, I'm fairly confident it works correctly.  That 
said, it is possible there is some code path I've missed. :)

> [1] https://public-inbox.org/git/34efd9e9936fdab331655f5a33a098a72dc134f4.1499800530.git.jonathantanmy@google.com/
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 1/1] sha1_file: Add support for downloading blobs on demand
  2017-07-17 20:09     ` Ben Peart
@ 2017-07-17 23:24       ` Jonathan Tan
  0 siblings, 0 replies; 6+ messages in thread
From: Jonathan Tan @ 2017-07-17 23:24 UTC (permalink / raw)
  To: Ben Peart; +Cc: git, gitster, benpeart, pclouds, christian.couder, git

On Mon, 17 Jul 2017 16:09:17 -0400
Ben Peart <peartben@gmail.com> wrote:

> > Is this change meant to ensure that Git code that operates on loose
> > objects directly (bypassing storage-agnostic functions such as
> > sha1_object_info_extended() and has_sha1_file()) still work? If yes,
> > this patch appears incomplete (for example, read_loose_object() needs to
> > be changed too), and this seems like a difficult task - in my patch set
> > [1], I ended up deciding to create a separate type of storage and
> > instead looked at the code that operates on *packed* objects directly
> > (because there were fewer such methods) to ensure that they would work
> > correctly in the presence of a separate type of storage.
> > 
> 
> Yes, with this set of patches, we've been running successfully on 
> completely sparse clones (no commits, trees, or blobs) for several 
> months.  read_loose_object() is only called by fsck when it is 
> enumerating existing loose objects so does not need to be updated.

Ah, that's good to know. I think such an analysis (of the other
loose-related functions) in the commit message would be useful, like I
did for the packed-related functions [1].

[1] https://public-inbox.org/git/34efd9e9936fdab331655f5a33a098a72dc134f4.1499800530.git.jonathantanmy@google.com/

> We have a few thousand developers making ~100K commits per week so in 
> our particular usage, I'm fairly confident it works correctly.  That 
> said, it is possible there is some code path I've missed. :)

I think that code paths like the history-searching ones ("git log -S",
for example) should still work offline if possible - one of my ideas is
to have these commands take a size threshold parameter so that we do not
need to fetch large blobs during the invocation. (Hence my preference
for size information to be already available to the repo.)

Aside from that, does fsck of a partial repo work? Ability to fsck seems
quite important (or, at least, useful). I tried updating fsck to support
omitting commits and trees (in addition to blobs), and it seems
relatively involved (although I didn't look into it deeply yet).

(Also, that investigation also made me realize that, in my patch set, I
didn't handle the case where a tag references a blob - fsck doesn't work
when the blob is missing, even if it is promised. That's something for
me to look into.)

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-07-17 23:24 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-14 13:26 [RFC/PATCH v2 0/1] Add support for downloading blobs on demand Ben Peart
2017-07-14 13:26 ` [PATCH v2 1/1] sha1_file: " Ben Peart
2017-07-14 15:18   ` Christian Couder
2017-07-17 18:06   ` Jonathan Tan
2017-07-17 20:09     ` Ben Peart
2017-07-17 23:24       ` Jonathan Tan

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).