git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH 0/2] repack: add --filter=
@ 2022-01-27  1:49 John Cai via GitGitGadget
  2022-01-27  1:49 ` [PATCH 1/2] pack-objects: allow --filter without --stdout John Cai via GitGitGadget
                   ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: John Cai via GitGitGadget @ 2022-01-27  1:49 UTC (permalink / raw)
  To: git; +Cc: Christian Couder, John Cai

This patch series aims to make partial clones more useful by allowing repack
to create packfiles with promisor objects. The longer vision is to be able
to use partial clones on a git server to offload large blobs to an http
server. We can then store large blobs on said http server, and use a remote
helper to grab these objects when necessary.

This is the first step in allowing a repack to honor a filter spec.

John Cai (2):
  pack-objects: allow --filter without --stdout
  repack: add --filter=<filter-spec> option

 Documentation/git-repack.txt   |   5 +
 builtin/pack-objects.c         |   2 -
 builtin/repack.c               |  10 ++
 t/lib-httpd.sh                 |   2 +
 t/lib-httpd/apache.conf        |   8 ++
 t/lib-httpd/list.sh            |  43 +++++++++
 t/lib-httpd/upload.sh          |  46 +++++++++
 t/t0410-partial-clone.sh       |  52 ++++++++++
 t/t0410/git-remote-testhttpgit | 170 +++++++++++++++++++++++++++++++++
 t/t7700-repack.sh              |  20 ++++
 10 files changed, 356 insertions(+), 2 deletions(-)
 create mode 100644 t/lib-httpd/list.sh
 create mode 100644 t/lib-httpd/upload.sh
 create mode 100755 t/t0410/git-remote-testhttpgit


base-commit: 89bece5c8c96f0b962cfc89e63f82d603fd60bed
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1206%2Fjohn-cai%2Fjc-repack-filter-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1206/john-cai/jc-repack-filter-v1
Pull-Request: https://github.com/git/git/pull/1206
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/2] pack-objects: allow --filter without --stdout
  2022-01-27  1:49 [PATCH 0/2] repack: add --filter= John Cai via GitGitGadget
@ 2022-01-27  1:49 ` John Cai via GitGitGadget
  2022-01-27  1:49 ` [PATCH 2/2] repack: add --filter=<filter-spec> option John Cai via GitGitGadget
  2022-02-09  2:10 ` [PATCH v2 0/4] [RFC] repack: add --filter= John Cai via GitGitGadget
  2 siblings, 0 replies; 34+ messages in thread
From: John Cai via GitGitGadget @ 2022-01-27  1:49 UTC (permalink / raw)
  To: git; +Cc: Christian Couder, John Cai, John Cai

From: John Cai <johncai86@gmail.com>

9535ce7 taught pack-objects to use filtering, but added a requirement of
the --stdout since a partial clone mechanism was not yet in place to
handle missing objects. Since then, changes like 9e27beaa and others
added support to dynamically fetch objects that were missing.

Remove the --stdout requirement so that in the next commit, repack can
pass --filter to pack-objects to omit certain objects from the packfile.

Based-on-patch-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: John Cai <johncai86@gmail.com>
---
 builtin/pack-objects.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index ba2006f2212..2d1ecb18784 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -4075,8 +4075,6 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		unpack_unreachable_expiration = 0;
 
 	if (filter_options.choice) {
-		if (!pack_to_stdout)
-			die(_("cannot use --filter without --stdout"));
 		if (stdin_packs)
 			die(_("cannot use --filter with --stdin-packs"));
 	}
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 2/2] repack: add --filter=<filter-spec> option
  2022-01-27  1:49 [PATCH 0/2] repack: add --filter= John Cai via GitGitGadget
  2022-01-27  1:49 ` [PATCH 1/2] pack-objects: allow --filter without --stdout John Cai via GitGitGadget
@ 2022-01-27  1:49 ` John Cai via GitGitGadget
  2022-01-27 15:03   ` Derrick Stolee
  2022-02-09  2:10 ` [PATCH v2 0/4] [RFC] repack: add --filter= John Cai via GitGitGadget
  2 siblings, 1 reply; 34+ messages in thread
From: John Cai via GitGitGadget @ 2022-01-27  1:49 UTC (permalink / raw)
  To: git; +Cc: Christian Couder, John Cai, John Cai

From: John Cai <johncai86@gmail.com>

Currently, repack does not work with partial clones. When repack is run
on a partially cloned repository, it grabs all missing objects from
promisor remotes. This also means that when gc is run for repository
maintenance on a partially cloned repository, it will end up getting
missing objects, which is not what we want.

In order to make repack work with partial clone, teach repack a new
option --filter, which takes a <filter-spec> argument. repack will skip
any objects that are matched by <filter-spec> similar to how the clone
command will skip fetching certain objects.

The final goal of this feature, is to be able to store objects on a
server other than the regular git server itself.

There are several scripts added so we can test the process of using a
remote helper to upload blobs to an http server:

- t/lib-httpd/list.sh lists blobs uploaded to the http server.
- t/lib-httpd/upload.sh uploads blobs to the http server.
- t/t0410/git-remote-testhttpgit a remote helper that can access blobs
  onto from an http server. Copied over from t/t5801/git-remote-testhttpgit
  and modified to upload blobs to an http server.
- t/t0410/lib-http-promisor.sh convenience functions for uploading
  blobs

Based-on-patch-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: John Cai <johncai86@gmail.com>
---
 Documentation/git-repack.txt   |   5 +
 builtin/repack.c               |  10 ++
 t/lib-httpd.sh                 |   2 +
 t/lib-httpd/apache.conf        |   8 ++
 t/lib-httpd/list.sh            |  43 +++++++++
 t/lib-httpd/upload.sh          |  46 +++++++++
 t/t0410-partial-clone.sh       |  52 ++++++++++
 t/t0410/git-remote-testhttpgit | 170 +++++++++++++++++++++++++++++++++
 t/t7700-repack.sh              |  20 ++++
 9 files changed, 356 insertions(+)
 create mode 100644 t/lib-httpd/list.sh
 create mode 100644 t/lib-httpd/upload.sh
 create mode 100755 t/t0410/git-remote-testhttpgit

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index ee30edc178a..e394ec52ab1 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -126,6 +126,11 @@ depth is 4095.
 	a larger and slower repository; see the discussion in
 	`pack.packSizeLimit`.
 
+--filter=<filter-spec>::
+	Omits certain objects (usually blobs) from the resulting
+	packfile. See linkgit:git-rev-list[1] for valid
+	`<filter-spec>` forms.
+
 -b::
 --write-bitmap-index::
 	Write a reachability bitmap index as part of the repack. This
diff --git a/builtin/repack.c b/builtin/repack.c
index da1e364a756..9c2e5bcfe3b 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -152,6 +152,7 @@ struct pack_objects_args {
 	const char *depth;
 	const char *threads;
 	const char *max_pack_size;
+	const char *filter;
 	int no_reuse_delta;
 	int no_reuse_object;
 	int quiet;
@@ -172,6 +173,8 @@ static void prepare_pack_objects(struct child_process *cmd,
 		strvec_pushf(&cmd->args, "--threads=%s", args->threads);
 	if (args->max_pack_size)
 		strvec_pushf(&cmd->args, "--max-pack-size=%s", args->max_pack_size);
+	if (args->filter)
+		strvec_pushf(&cmd->args, "--filter=%s", args->filter);
 	if (args->no_reuse_delta)
 		strvec_pushf(&cmd->args, "--no-reuse-delta");
 	if (args->no_reuse_object)
@@ -660,6 +663,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 				N_("limits the maximum number of threads")),
 		OPT_STRING(0, "max-pack-size", &po_args.max_pack_size, N_("bytes"),
 				N_("maximum size of each packfile")),
+		OPT_STRING(0, "filter", &po_args.filter, N_("args"),
+				N_("object filtering")),
 		OPT_BOOL(0, "pack-kept-objects", &pack_kept_objects,
 				N_("repack objects in packs marked with .keep")),
 		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
@@ -819,6 +824,11 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		if (line.len != the_hash_algo->hexsz)
 			die(_("repack: Expecting full hex object ID lines only from pack-objects."));
 		string_list_append(&names, line.buf);
+		if (po_args.filter) {
+			char *promisor_name = mkpathdup("%s-%s.promisor", packtmp,
+							line.buf);
+			write_promisor_file(promisor_name, NULL, 0);
+		}
 	}
 	fclose(out);
 	ret = finish_command(&cmd);
diff --git a/t/lib-httpd.sh b/t/lib-httpd.sh
index 782891908d7..fc6587c6d39 100644
--- a/t/lib-httpd.sh
+++ b/t/lib-httpd.sh
@@ -136,6 +136,8 @@ prepare_httpd() {
 	install_script error-smart-http.sh
 	install_script error.sh
 	install_script apply-one-time-perl.sh
+	install_script upload.sh
+	install_script list.sh
 
 	ln -s "$LIB_HTTPD_MODULE_PATH" "$HTTPD_ROOT_PATH/modules"
 
diff --git a/t/lib-httpd/apache.conf b/t/lib-httpd/apache.conf
index 497b9b9d927..1ea382750f0 100644
--- a/t/lib-httpd/apache.conf
+++ b/t/lib-httpd/apache.conf
@@ -129,6 +129,8 @@ ScriptAlias /broken_smart/ broken-smart-http.sh/
 ScriptAlias /error_smart/ error-smart-http.sh/
 ScriptAlias /error/ error.sh/
 ScriptAliasMatch /one_time_perl/(.*) apply-one-time-perl.sh/$1
+ScriptAlias /upload/ upload.sh/
+ScriptAlias /list/ list.sh/
 <Directory ${GIT_EXEC_PATH}>
 	Options FollowSymlinks
 </Directory>
@@ -156,6 +158,12 @@ ScriptAliasMatch /one_time_perl/(.*) apply-one-time-perl.sh/$1
 <Files ${GIT_EXEC_PATH}/git-http-backend>
 	Options ExecCGI
 </Files>
+<Files upload.sh>
+  Options ExecCGI
+</Files>
+<Files list.sh>
+  Options ExecCGI
+</Files>
 
 RewriteEngine on
 RewriteRule ^/dumb-redir/(.*)$ /dumb/$1 [R=301]
diff --git a/t/lib-httpd/list.sh b/t/lib-httpd/list.sh
new file mode 100644
index 00000000000..e63406be3b2
--- /dev/null
+++ b/t/lib-httpd/list.sh
@@ -0,0 +1,43 @@
+#!/bin/sh
+
+# Used in the httpd test server to be called by a remote helper to list objects.
+
+FILES_DIR="www/files"
+
+OLDIFS="$IFS"
+IFS='&'
+set -- $QUERY_STRING
+IFS="$OLDIFS"
+
+while test $# -gt 0
+do
+	key=${1%%=*}
+	val=${1#*=}
+
+	case "$key" in
+	"sha1") sha1="$val" ;;
+	*) echo >&2 "unknown key '$key'" ;;
+	esac
+
+	shift
+done
+
+if test -d "$FILES_DIR"
+then
+	if test -z "$sha1"
+	then
+		echo 'Status: 200 OK'
+		echo
+		ls "$FILES_DIR" | tr '-' ' '
+	else
+		if test -f "$FILES_DIR/$sha1"-*
+		then
+			echo 'Status: 200 OK'
+			echo
+			cat "$FILES_DIR/$sha1"-*
+		else
+			echo 'Status: 404 Not Found'
+			echo
+		fi
+	fi
+fi
diff --git a/t/lib-httpd/upload.sh b/t/lib-httpd/upload.sh
new file mode 100644
index 00000000000..202de63b2dc
--- /dev/null
+++ b/t/lib-httpd/upload.sh
@@ -0,0 +1,46 @@
+#!/bin/sh
+
+# In part from http://codereview.stackexchange.com/questions/79549/bash-cgi-upload-file
+# Used in the httpd test server to for a remote helper to call to upload blobs.
+
+FILES_DIR="www/files"
+
+OLDIFS="$IFS"
+IFS='&'
+set -- $QUERY_STRING
+IFS="$OLDIFS"
+
+while test $# -gt 0
+do
+	key=${1%%=*}
+	val=${1#*=}
+
+	case "$key" in
+	"sha1") sha1="$val" ;;
+	"type") type="$val" ;;
+	"size") size="$val" ;;
+	"delete") delete=1 ;;
+	*) echo >&2 "unknown key '$key'" ;;
+	esac
+
+	shift
+done
+
+case "$REQUEST_METHOD" in
+POST)
+	if test "$delete" = "1"
+	then
+		rm -f "$FILES_DIR/$sha1-$size-$type"
+	else
+		mkdir -p "$FILES_DIR"
+		cat >"$FILES_DIR/$sha1-$size-$type"
+	fi
+
+	echo 'Status: 204 No Content'
+	echo
+	;;
+
+*)
+	echo 'Status: 405 Method Not Allowed'
+	echo
+esac
diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
index f17abd298c8..731f6bebc64 100755
--- a/t/t0410-partial-clone.sh
+++ b/t/t0410-partial-clone.sh
@@ -30,6 +30,31 @@ promise_and_delete () {
 	delete_object repo "$HASH"
 }
 
+upload_blob() {
+	SERVER_REPO="$1"
+	HASH="$2"
+
+	test -n "$HASH" || die "Invalid argument '$HASH'"
+	HASH_SIZE=$(git -C "$SERVER_REPO" cat-file -s "$HASH") || {
+		echo >&2 "Cannot get blob size of '$HASH'"
+		return 1
+	}
+
+	UPLOAD_URL="http://127.0.0.1:$LIB_HTTPD_PORT/upload/?sha1=$HASH&size=$HASH_SIZE&type=blob"
+
+	git -C "$SERVER_REPO" cat-file blob "$HASH" >object &&
+	curl --data-binary @object --include "$UPLOAD_URL"
+}
+
+upload_blobs_from_stdin() {
+	SERVER_REPO="$1"
+	while read -r blob
+	do
+		echo "uploading $blob"
+		upload_blob "$SERVER_REPO" "$blob" || return
+	done
+}
+
 test_expect_success 'extensions.partialclone without filter' '
 	test_create_repo server &&
 	git clone --filter="blob:none" "file://$(pwd)/server" client &&
@@ -668,6 +693,33 @@ test_expect_success 'fetching of missing objects from an HTTP server' '
 	grep "$HASH" out
 '
 
+PATH="$TEST_DIRECTORY/t0410:$PATH"
+
+test_expect_success 'fetch of missing objects through remote helper' '
+	rm -rf origin server &&
+	test_create_repo origin &&
+	dd if=/dev/zero of=origin/file1 bs=801k count=1 &&
+	git -C origin add file1 &&
+	git -C origin commit -m "large blob" &&
+	sha="$(git -C origin rev-parse :file1)" &&
+	expected="?$(git -C origin rev-parse :file1)" &&
+	git clone --bare --no-local origin server &&
+	git -C server remote add httpremote "testhttpgit::${PWD}/server" &&
+	git -C server config remote.httpremote.promisor true &&
+	git -C server config --remove-section remote.origin &&
+	git -C server rev-list --all --objects --filter-print-omitted \
+		--filter=blob:limit=800k | perl -ne "print if s/^[~]//" \
+		>large_blobs.txt &&
+	upload_blobs_from_stdin server <large_blobs.txt &&
+	git -C server -c repack.writebitmaps=false repack -a -d \
+		--filter=blob:limit=800k &&
+	git -C server rev-list --objects --all --missing=print >objects &&
+	grep "$expected" objects &&
+	HTTPD_URL=$HTTPD_URL git -C server show $sha &&
+	git -C server rev-list --objects --all --missing=print >objects &&
+	grep "$sha" objects
+'
+
 # DO NOT add non-httpd-specific tests here, because the last part of this
 # test script is only executed when httpd is available and enabled.
 
diff --git a/t/t0410/git-remote-testhttpgit b/t/t0410/git-remote-testhttpgit
new file mode 100755
index 00000000000..e5e187243ed
--- /dev/null
+++ b/t/t0410/git-remote-testhttpgit
@@ -0,0 +1,170 @@
+#!/bin/sh
+# Copyright (c) 2012 Felipe Contreras
+# Copyright (c) 2020 Christian Couder
+
+# This is a git remote helper that can be used to store blobs on an http server
+
+# The first argument can be a url when the fetch/push command was a url
+# instead of a configured remote. In this case, use a generic alias.
+if test "$1" = "testhttpgit::$2"; then
+	alias=_
+else
+	alias=$1
+fi
+url=$2
+
+unset GIT_DIR
+
+h_refspec="refs/heads/*:refs/testhttpgit/$alias/heads/*"
+t_refspec="refs/tags/*:refs/testhttpgit/$alias/tags/*"
+
+if test -n "$GIT_REMOTE_TESTHTTPGIT_NOREFSPEC"
+then
+	h_refspec=""
+	t_refspec=""
+fi
+
+die () {
+	echo >&2 "fatal: $*"
+	echo "fatal: $*" >>/tmp/t0430.txt
+	echo >>/tmp/t0430.txt
+	exit 1
+}
+
+force=
+
+mark_count_tmp=$(mktemp -t git-remote-http-mark-count_XXXXXX) || die "Failed to create temp file"
+echo "1" >"$mark_count_tmp"
+
+get_mark_count() {
+	mark=$(cat "$mark_count_tmp")
+	echo "$mark"
+	mark=$((mark+1))
+	echo "$mark" >"$mark_count_tmp"	
+}
+
+export_blob_from_file() {
+	file="$1"
+	echo "blob"
+	echo "mark :$(get_mark_count)"
+	size=$(wc -c <"$file") || return
+	echo "data $size"
+	cat "$file" || return
+	echo
+}
+
+while read line
+do
+	case $line in
+	capabilities)
+		echo 'import'
+		echo 'export'
+		test -n "$h_refspec" && echo "refspec $h_refspec"
+		test -n "$t_refspec" && echo "refspec $t_refspec"
+		test -n "$GIT_REMOTE_TESTHTTPGIT_SIGNED_TAGS" && echo "signed-tags"
+		test -n "$GIT_REMOTE_TESTHTTPGIT_NO_PRIVATE_UPDATE" && echo "no-private-update"
+		echo 'option'
+		echo
+		;;
+	list)
+		git -C "$url" for-each-ref --format='? %(refname)' 'refs/heads/' 'refs/tags/'
+		head=$(git -C "$url" symbolic-ref HEAD)
+		echo "@$head HEAD"
+		echo
+		;;
+	import*)
+		# read all import lines
+		while true
+		do
+			ref="${line#* }"
+			refs="$refs $ref"
+			read line
+			test "${line%% *}" != "import" && break
+		done
+
+		echo "refs: $refs" >>/tmp/t0430.txt
+
+		if test -n "$GIT_REMOTE_TESTHTTPGIT_FAILURE"
+		then
+			echo "feature done"
+			exit 1
+		fi
+
+		echo "feature done"
+
+		tmpdir=$(mktemp -d -t git-remote-http-import_XXXXXX) || die "Failed to create temp directory"
+
+		for ref in $refs
+		do
+			get_url="$HTTPD_URL/list/?sha1=$ref"
+			echo "curl url: $get_url" >>/tmp/t0430.txt
+			echo "curl output: $tmpdir/$ref" >>/tmp/t0430.txt
+			curl -s -o "$tmpdir/$ref" "$get_url" ||
+				die "curl '$get_url' failed"
+			echo "exporting from: $tmpdir/$ref" >>/tmp/t0430.txt
+			export_blob_from_file "$tmpdir/$ref" ||
+				die "failed to export blob from '$tmpdir/$ref'"
+			echo "done exporting" >>/tmp/t0430.txt
+		done
+
+		echo "done"
+		;;
+	export)
+		if test -n "$GIT_REMOTE_TESTHTTPGIT_FAILURE"
+		then
+			# consume input so fast-export doesn't get SIGPIPE;
+			# git would also notice that case, but we want
+			# to make sure we are exercising the later
+			# error checks
+			while read line; do
+				test "done" = "$line" && break
+			done
+			exit 1
+		fi
+
+		before=$(git -C "$url" for-each-ref --format=' %(refname) %(objectname) ')
+
+		git -C "$url" fast-import \
+			${force:+--force} \
+			${testhttpgitmarks:+"--import-marks=$testhttpgitmarks"} \
+			${testhttpgitmarks:+"--export-marks=$testhttpgitmarks"} \
+			--quiet
+
+		# figure out which refs were updated
+		git -C "$url" for-each-ref --format='%(refname) %(objectname)' |
+		while read ref a
+		do
+			case "$before" in
+			*" $ref $a "*)
+				continue ;;	# unchanged
+			esac
+			if test -z "$GIT_REMOTE_TESTHTTPGIT_PUSH_ERROR"
+			then
+				echo "ok $ref"
+			else
+				echo "error $ref $GIT_REMOTE_TESTHTTPGIT_PUSH_ERROR"
+			fi
+		done
+
+		echo
+		;;
+	option\ *)
+		read cmd opt val <<-EOF
+		$line
+		EOF
+		case $opt in
+		force)
+			test $val = "true" && force="true" || force=
+			echo "ok"
+			;;
+		*)
+			echo "unsupported"
+			;;
+		esac
+		;;
+	'')
+		exit
+		;;
+	esac
+done
+
diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index e489869dd94..78cc1858cb6 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -237,6 +237,26 @@ test_expect_success 'auto-bitmaps do not complain if unavailable' '
 	test_must_be_empty actual
 '
 
+test_expect_success 'repack with filter does not fetch from remote' '
+	rm -rf server client &&
+	test_create_repo server &&
+	git -C server config uploadpack.allowFilter true &&
+	git -C server config uploadpack.allowAnySHA1InWant true &&
+	echo content1 >server/file1 &&
+	git -C server add file1 &&
+	git -C server commit -m initial_commit &&
+	expected="?$(git -C server rev-parse :file1)" &&
+	git clone --bare --no-local server client &&
+	git -C client config remote.origin.promisor true &&
+	git -C client -c repack.writebitmaps=false repack -a -d --filter=blob:none &&
+	git -C client rev-list --objects --all --missing=print >objects &&
+	grep "$expected" objects &&
+	git -C client repack -a -d &&
+	expected="$(git -C server rev-parse :file1)" &&
+	git -C client rev-list --objects --all --missing=print >objects &&
+	grep "$expected" objects
+'
+
 objdir=.git/objects
 midx=$objdir/pack/multi-pack-index
 
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] repack: add --filter=<filter-spec> option
  2022-01-27  1:49 ` [PATCH 2/2] repack: add --filter=<filter-spec> option John Cai via GitGitGadget
@ 2022-01-27 15:03   ` Derrick Stolee
  2022-01-29 19:14     ` John Cai
  0 siblings, 1 reply; 34+ messages in thread
From: Derrick Stolee @ 2022-01-27 15:03 UTC (permalink / raw)
  To: John Cai via GitGitGadget, git; +Cc: Christian Couder, John Cai

On 1/26/2022 8:49 PM, John Cai via GitGitGadget wrote:
> From: John Cai <johncai86@gmail.com>
> 
> Currently, repack does not work with partial clones. When repack is run
> on a partially cloned repository, it grabs all missing objects from
> promisor remotes. This also means that when gc is run for repository
> maintenance on a partially cloned repository, it will end up getting
> missing objects, which is not what we want.

This shouldn't be what is happening. Do you have a demonstration of
this happening? repack_promisor_objects() should be avoiding following
links outside of promisor packs so we can safely 'git gc' in a partial
clone without downloading all reachable blobs.

> In order to make repack work with partial clone, teach repack a new
> option --filter, which takes a <filter-spec> argument. repack will skip
> any objects that are matched by <filter-spec> similar to how the clone
> command will skip fetching certain objects.

This is a bit misleading, since 'git clone' doesn't "skip fetching",
but instead requests a filter and the server can choose to write a
pack-file using that filter. I'm not sure if it's worth how pedantic
I'm being here.

The thing that I find confusing here is that you are adding an option
that could be run on a _full_ repository. If I have a set of packs
and none of them are promisor (I have every reachable object), then
what is the end result after 'git repack -adf --filter=blob:none'?
Those existing pack-files shouldn't be deleted because they have
objects that are not in the newly-created pack-file.

I'd like to see some additional clarity on this before continuing
to review this series.

> The final goal of this feature, is to be able to store objects on a
> server other than the regular git server itself.
> 
> There are several scripts added so we can test the process of using a
> remote helper to upload blobs to an http server:
> 
> - t/lib-httpd/list.sh lists blobs uploaded to the http server.
> - t/lib-httpd/upload.sh uploads blobs to the http server.
> - t/t0410/git-remote-testhttpgit a remote helper that can access blobs
>   onto from an http server. Copied over from t/t5801/git-remote-testhttpgit
>   and modified to upload blobs to an http server.
> - t/t0410/lib-http-promisor.sh convenience functions for uploading
>   blobs

I think these changes to the tests should be extracted to a new
patch where this can be discussed in more detail. I didn't look
too closely at them because I want to focus on whether this
--filter option is a good direction for 'git repack'.

>  		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
> @@ -819,6 +824,11 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  		if (line.len != the_hash_algo->hexsz)
>  			die(_("repack: Expecting full hex object ID lines only from pack-objects."));
>  		string_list_append(&names, line.buf);
> +		if (po_args.filter) {
> +			char *promisor_name = mkpathdup("%s-%s.promisor", packtmp,
> +							line.buf);
> +			write_promisor_file(promisor_name, NULL, 0);

This code is duplicated in repack_promisor_objects(), so it would be
good to extract that logic into a helper method called by both places.

> +		}
>  	}
>  	fclose(out);
>  	ret = finish_command(&cmd);

> diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
> index e489869dd94..78cc1858cb6 100755
> --- a/t/t7700-repack.sh
> +++ b/t/t7700-repack.sh
> @@ -237,6 +237,26 @@ test_expect_success 'auto-bitmaps do not complain if unavailable' '
>  	test_must_be_empty actual
>  '
>  
> +test_expect_success 'repack with filter does not fetch from remote' '
> +	rm -rf server client &&
> +	test_create_repo server &&
> +	git -C server config uploadpack.allowFilter true &&
> +	git -C server config uploadpack.allowAnySHA1InWant true &&
> +	echo content1 >server/file1 &&
> +	git -C server add file1 &&
> +	git -C server commit -m initial_commit &&
> +	expected="?$(git -C server rev-parse :file1)" &&
> +	git clone --bare --no-local server client &&

You could use "file:://$(pwd)/server" here instead of "server".

> +	git -C client config remote.origin.promisor true &&
> +	git -C client -c repack.writebitmaps=false repack -a -d --filter=blob:none &&
This isn't testing what you want it to test, because your initial
clone doesn't use --filter=blob:none, so you already have all of
the objects in the client. You would never trigger a need for a
fetch from the remote.

> +	git -C client rev-list --objects --all --missing=print >objects &&
> +	grep "$expected" objects &&
> +	git -C client repack -a -d &&
> +	expected="$(git -C server rev-parse :file1)" &&

This is signalling to me that you are looking for a remote fetch
now that you are repacking everything, and that can only happen
if you deleted objects from the client during your first repack.
That seems incorrect.

> +	git -C client rev-list --objects --all --missing=print >objects &&
> +	grep "$expected" objects
> +'

Based on my current understanding, this patch seems unnecessary (repacks
should already be doing the right thing when in the presence of a partial
clone) and incorrect (we should not delete existing reachable objects
when repacking with a filter).

I look forward to hearing more about your intended use of this feature so
we can land on a better way to solve the problems you are having.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] repack: add --filter=<filter-spec> option
  2022-01-27 15:03   ` Derrick Stolee
@ 2022-01-29 19:14     ` John Cai
  2022-01-30  8:16       ` Christian Couder
  2022-01-30 13:02       ` John Cai
  0 siblings, 2 replies; 34+ messages in thread
From: John Cai @ 2022-01-29 19:14 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: John Cai via GitGitGadget, git, Christian Couder

Hi Stolee,

Thanks for taking the time to review this patch! I added some points of clarification
down below.

On 27 Jan 2022, at 10:03, Derrick Stolee wrote:

> On 1/26/2022 8:49 PM, John Cai via GitGitGadget wrote:
>> From: John Cai <johncai86@gmail.com>
>>
>> Currently, repack does not work with partial clones. When repack is run
>> on a partially cloned repository, it grabs all missing objects from
>> promisor remotes. This also means that when gc is run for repository
>> maintenance on a partially cloned repository, it will end up getting
>> missing objects, which is not what we want.
>
> This shouldn't be what is happening. Do you have a demonstration of
> this happening? repack_promisor_objects() should be avoiding following
> links outside of promisor packs so we can safely 'git gc' in a partial
> clone without downloading all reachable blobs.

You're right, sorry I was mistaken about this detail of how partial clones work.
>
>> In order to make repack work with partial clone, teach repack a new
>> option --filter, which takes a <filter-spec> argument. repack will skip
>> any objects that are matched by <filter-spec> similar to how the clone
>> command will skip fetching certain objects.
>
> This is a bit misleading, since 'git clone' doesn't "skip fetching",
> but instead requests a filter and the server can choose to write a
> pack-file using that filter. I'm not sure if it's worth how pedantic
> I'm being here.

Thanks for the more precise description of the mechanics of partial clone.
I'll improve the wording in the next version of this patch series.

>
> The thing that I find confusing here is that you are adding an option
> that could be run on a _full_ repository. If I have a set of packs
> and none of them are promisor (I have every reachable object), then
> what is the end result after 'git repack -adf --filter=blob:none'?
> Those existing pack-files shouldn't be deleted because they have
> objects that are not in the newly-created pack-file.
>
> I'd like to see some additional clarity on this before continuing
> to review this series.

Apologies for the lack of clarity. Indeed, I can see why this is the most important
detail of this patch to provide enough context on, as it involves deleting
objects from a full repository as you said.

To back up a little, the goal is to be able to offload large
blobs to a separate http server. Christian Couder has a demo [1] that shows
this in detail.

If we had the following:
A. an http server to use as a generalized object store
B. a server update hook that uploads large blobs to 1.
C. a git server
D. a regular job that runs `git repack --filter` to remove large
blobs from C.

Clients would need to configure both C) and A) as promisor remotes to
be able to get everything. When they push new large blobs, they can
still push them to C), as B) will upload them to A), and D) will
regularly remove those large blobs from C).

This way with a little bit of client and server configuration, we can have
a native way to support offloading large files without git LFS.
It would be more flexible as you can easily tweak which blobs are considered large
files by tweaking B) and D).

>
>> The final goal of this feature, is to be able to store objects on a
>> server other than the regular git server itself.
>>
>> There are several scripts added so we can test the process of using a
>> remote helper to upload blobs to an http server:
>>
>> - t/lib-httpd/list.sh lists blobs uploaded to the http server.
>> - t/lib-httpd/upload.sh uploads blobs to the http server.
>> - t/t0410/git-remote-testhttpgit a remote helper that can access blobs
>>   onto from an http server. Copied over from t/t5801/git-remote-testhttpgit
>>   and modified to upload blobs to an http server.
>> - t/t0410/lib-http-promisor.sh convenience functions for uploading
>>   blobs
>
> I think these changes to the tests should be extracted to a new
> patch where this can be discussed in more detail. I didn't look
> too closely at them because I want to focus on whether this
> --filter option is a good direction for 'git repack'.
>
>>  		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
>> @@ -819,6 +824,11 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>>  		if (line.len != the_hash_algo->hexsz)
>>  			die(_("repack: Expecting full hex object ID lines only from pack-objects."));
>>  		string_list_append(&names, line.buf);
>> +		if (po_args.filter) {
>> +			char *promisor_name = mkpathdup("%s-%s.promisor", packtmp,
>> +							line.buf);
>> +			write_promisor_file(promisor_name, NULL, 0);
>
> This code is duplicated in repack_promisor_objects(), so it would be
> good to extract that logic into a helper method called by both places.

Thanks for pointing this out. I'll incorporate this into the next version.
>
>> +		}
>>  	}
>>  	fclose(out);
>>  	ret = finish_command(&cmd);
>
>> diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
>> index e489869dd94..78cc1858cb6 100755
>> --- a/t/t7700-repack.sh
>> +++ b/t/t7700-repack.sh
>> @@ -237,6 +237,26 @@ test_expect_success 'auto-bitmaps do not complain if unavailable' '
>>  	test_must_be_empty actual
>>  '
>>
>> +test_expect_success 'repack with filter does not fetch from remote' '
>> +	rm -rf server client &&
>> +	test_create_repo server &&
>> +	git -C server config uploadpack.allowFilter true &&
>> +	git -C server config uploadpack.allowAnySHA1InWant true &&
>> +	echo content1 >server/file1 &&
>> +	git -C server add file1 &&
>> +	git -C server commit -m initial_commit &&
>> +	expected="?$(git -C server rev-parse :file1)" &&
>> +	git clone --bare --no-local server client &&
>
> You could use "file:://$(pwd)/server" here instead of "server".

good point, thanks

>
>> +	git -C client config remote.origin.promisor true &&
>> +	git -C client -c repack.writebitmaps=false repack -a -d --filter=blob:none &&
> This isn't testing what you want it to test, because your initial
> clone doesn't use --filter=blob:none, so you already have all of
> the objects in the client. You would never trigger a need for a
> fetch from the remote.

right, so this test is actually testing that repack --filter would shed objects to show
that it can be used as D) as a regular cleanup job for git servers that utilize another
http server to host large blobs.

>
>> +	git -C client rev-list --objects --all --missing=print >objects &&
>> +	grep "$expected" objects &&
>> +	git -C client repack -a -d &&
>> +	expected="$(git -C server rev-parse :file1)" &&
>
> This is signalling to me that you are looking for a remote fetch
> now that you are repacking everything, and that can only happen
> if you deleted objects from the client during your first repack.
> That seems incorrect.
>
>> +	git -C client rev-list --objects --all --missing=print >objects &&
>> +	grep "$expected" objects
>> +'
>
> Based on my current understanding, this patch seems unnecessary (repacks
> should already be doing the right thing when in the presence of a partial
> clone) and incorrect (we should not delete existing reachable objects
> when repacking with a filter).
>
> I look forward to hearing more about your intended use of this feature so
> we can land on a better way to solve the problems you are having.

Thanks for the callouts on the big picture of this proposed change. Looking
forward to getting your thoughts on this!
>
> Thanks,
> -Stolee

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] repack: add --filter=<filter-spec> option
  2022-01-29 19:14     ` John Cai
@ 2022-01-30  8:16       ` Christian Couder
  2022-01-30 13:02       ` John Cai
  1 sibling, 0 replies; 34+ messages in thread
From: Christian Couder @ 2022-01-30  8:16 UTC (permalink / raw)
  To: John Cai; +Cc: Derrick Stolee, John Cai via GitGitGadget, git

On Sat, Jan 29, 2022 at 8:14 PM John Cai <johncai86@gmail.com> wrote:

> Apologies for the lack of clarity. Indeed, I can see why this is the most important
> detail of this patch to provide enough context on, as it involves deleting
> objects from a full repository as you said.
>
> To back up a little, the goal is to be able to offload large
> blobs to a separate http server. Christian Couder has a demo [1] that shows
> this in detail.

You might have forgotten to provide a link for [1], also I am not sure
if you wanted to link to the repo:

https://gitlab.com/chriscool/partial-clone-demo/

or the demo itself in the repo:

https://gitlab.com/chriscool/partial-clone-demo/-/blob/master/http-promisor/server_demo.txt

> If we had the following:
> A. an http server to use as a generalized object store
> B. a server update hook that uploads large blobs to 1.

s/1./A./

> C. a git server
> D. a regular job that runs `git repack --filter` to remove large
> blobs from C.
>
> Clients would need to configure both C) and A) as promisor remotes to

Maybe s/C)/C./ and s/A)/A./

Also note that configuring A. as a promisor remote requires a remote helper.

> be able to get everything. When they push new large blobs, they can
> still push them to C), as B) will upload them to A), and D) will
> regularly remove those large blobs from C).
>
> This way with a little bit of client and server configuration, we can have
> a native way to support offloading large files without git LFS.
> It would be more flexible as you can easily tweak which blobs are considered large
> files by tweaking B) and D).

Yeah, that's the idea of the demo.

Thanks for working on this!

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] repack: add --filter=<filter-spec> option
  2022-01-29 19:14     ` John Cai
  2022-01-30  8:16       ` Christian Couder
@ 2022-01-30 13:02       ` John Cai
  1 sibling, 0 replies; 34+ messages in thread
From: John Cai @ 2022-01-30 13:02 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: John Cai via GitGitGadget, git, Christian Couder

Sorry forgot to include the link to Christian's demo. included below

On 29 Jan 2022, at 14:14, John Cai wrote:

> Hi Stolee,
>
> Thanks for taking the time to review this patch! I added some points of clarification
> down below.
>
> On 27 Jan 2022, at 10:03, Derrick Stolee wrote:
>
>> On 1/26/2022 8:49 PM, John Cai via GitGitGadget wrote:
>>> From: John Cai <johncai86@gmail.com>
>>>
>>> Currently, repack does not work with partial clones. When repack is run
>>> on a partially cloned repository, it grabs all missing objects from
>>> promisor remotes. This also means that when gc is run for repository
>>> maintenance on a partially cloned repository, it will end up getting
>>> missing objects, which is not what we want.
>>
>> This shouldn't be what is happening. Do you have a demonstration of
>> this happening? repack_promisor_objects() should be avoiding following
>> links outside of promisor packs so we can safely 'git gc' in a partial
>> clone without downloading all reachable blobs.
>
> You're right, sorry I was mistaken about this detail of how partial clones work.
>>
>>> In order to make repack work with partial clone, teach repack a new
>>> option --filter, which takes a <filter-spec> argument. repack will skip
>>> any objects that are matched by <filter-spec> similar to how the clone
>>> command will skip fetching certain objects.
>>
>> This is a bit misleading, since 'git clone' doesn't "skip fetching",
>> but instead requests a filter and the server can choose to write a
>> pack-file using that filter. I'm not sure if it's worth how pedantic
>> I'm being here.
>
> Thanks for the more precise description of the mechanics of partial clone.
> I'll improve the wording in the next version of this patch series.
>
>>
>> The thing that I find confusing here is that you are adding an option
>> that could be run on a _full_ repository. If I have a set of packs
>> and none of them are promisor (I have every reachable object), then
>> what is the end result after 'git repack -adf --filter=blob:none'?
>> Those existing pack-files shouldn't be deleted because they have
>> objects that are not in the newly-created pack-file.
>>
>> I'd like to see some additional clarity on this before continuing
>> to review this series.
>
> Apologies for the lack of clarity. Indeed, I can see why this is the most important
> detail of this patch to provide enough context on, as it involves deleting
> objects from a full repository as you said.
>
> To back up a little, the goal is to be able to offload large
> blobs to a separate http server. Christian Couder has a demo [1] that shows
> this in detail.
>
> If we had the following:
> A. an http server to use as a generalized object store
> B. a server update hook that uploads large blobs to 1.
> C. a git server
> D. a regular job that runs `git repack --filter` to remove large
> blobs from C.
>
> Clients would need to configure both C) and A) as promisor remotes to
> be able to get everything. When they push new large blobs, they can
> still push them to C), as B) will upload them to A), and D) will
> regularly remove those large blobs from C).
>
> This way with a little bit of client and server configuration, we can have
> a native way to support offloading large files without git LFS.
> It would be more flexible as you can easily tweak which blobs are considered large
> files by tweaking B) and D).
>

[1] https://gitlab.com/chriscool/partial-clone-demo/-/blob/master/http-promisor/server_demo.txt

>>
>>> The final goal of this feature, is to be able to store objects on a
>>> server other than the regular git server itself.
>>>
>>> There are several scripts added so we can test the process of using a
>>> remote helper to upload blobs to an http server:
>>>
>>> - t/lib-httpd/list.sh lists blobs uploaded to the http server.
>>> - t/lib-httpd/upload.sh uploads blobs to the http server.
>>> - t/t0410/git-remote-testhttpgit a remote helper that can access blobs
>>>   onto from an http server. Copied over from t/t5801/git-remote-testhttpgit
>>>   and modified to upload blobs to an http server.
>>> - t/t0410/lib-http-promisor.sh convenience functions for uploading
>>>   blobs
>>
>> I think these changes to the tests should be extracted to a new
>> patch where this can be discussed in more detail. I didn't look
>> too closely at them because I want to focus on whether this
>> --filter option is a good direction for 'git repack'.
>>
>>>  		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
>>> @@ -819,6 +824,11 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>>>  		if (line.len != the_hash_algo->hexsz)
>>>  			die(_("repack: Expecting full hex object ID lines only from pack-objects."));
>>>  		string_list_append(&names, line.buf);
>>> +		if (po_args.filter) {
>>> +			char *promisor_name = mkpathdup("%s-%s.promisor", packtmp,
>>> +							line.buf);
>>> +			write_promisor_file(promisor_name, NULL, 0);
>>
>> This code is duplicated in repack_promisor_objects(), so it would be
>> good to extract that logic into a helper method called by both places.
>
> Thanks for pointing this out. I'll incorporate this into the next version.
>>
>>> +		}
>>>  	}
>>>  	fclose(out);
>>>  	ret = finish_command(&cmd);
>>
>>> diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
>>> index e489869dd94..78cc1858cb6 100755
>>> --- a/t/t7700-repack.sh
>>> +++ b/t/t7700-repack.sh
>>> @@ -237,6 +237,26 @@ test_expect_success 'auto-bitmaps do not complain if unavailable' '
>>>  	test_must_be_empty actual
>>>  '
>>>
>>> +test_expect_success 'repack with filter does not fetch from remote' '
>>> +	rm -rf server client &&
>>> +	test_create_repo server &&
>>> +	git -C server config uploadpack.allowFilter true &&
>>> +	git -C server config uploadpack.allowAnySHA1InWant true &&
>>> +	echo content1 >server/file1 &&
>>> +	git -C server add file1 &&
>>> +	git -C server commit -m initial_commit &&
>>> +	expected="?$(git -C server rev-parse :file1)" &&
>>> +	git clone --bare --no-local server client &&
>>
>> You could use "file:://$(pwd)/server" here instead of "server".
>
> good point, thanks
>
>>
>>> +	git -C client config remote.origin.promisor true &&
>>> +	git -C client -c repack.writebitmaps=false repack -a -d --filter=blob:none &&
>> This isn't testing what you want it to test, because your initial
>> clone doesn't use --filter=blob:none, so you already have all of
>> the objects in the client. You would never trigger a need for a
>> fetch from the remote.
>
> right, so this test is actually testing that repack --filter would shed objects to show
> that it can be used as D) as a regular cleanup job for git servers that utilize another
> http server to host large blobs.
>
>>
>>> +	git -C client rev-list --objects --all --missing=print >objects &&
>>> +	grep "$expected" objects &&
>>> +	git -C client repack -a -d &&
>>> +	expected="$(git -C server rev-parse :file1)" &&
>>
>> This is signalling to me that you are looking for a remote fetch
>> now that you are repacking everything, and that can only happen
>> if you deleted objects from the client during your first repack.
>> That seems incorrect.
>>
>>> +	git -C client rev-list --objects --all --missing=print >objects &&
>>> +	grep "$expected" objects
>>> +'
>>
>> Based on my current understanding, this patch seems unnecessary (repacks
>> should already be doing the right thing when in the presence of a partial
>> clone) and incorrect (we should not delete existing reachable objects
>> when repacking with a filter).
>>
>> I look forward to hearing more about your intended use of this feature so
>> we can land on a better way to solve the problems you are having.
>
> Thanks for the callouts on the big picture of this proposed change. Looking
> forward to getting your thoughts on this!
>>
>> Thanks,
>> -Stolee

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2 0/4] [RFC] repack: add --filter=
  2022-01-27  1:49 [PATCH 0/2] repack: add --filter= John Cai via GitGitGadget
  2022-01-27  1:49 ` [PATCH 1/2] pack-objects: allow --filter without --stdout John Cai via GitGitGadget
  2022-01-27  1:49 ` [PATCH 2/2] repack: add --filter=<filter-spec> option John Cai via GitGitGadget
@ 2022-02-09  2:10 ` John Cai via GitGitGadget
  2022-02-09  2:10   ` [PATCH v2 1/4] pack-objects: allow --filter without --stdout John Cai via GitGitGadget
                     ` (5 more replies)
  2 siblings, 6 replies; 34+ messages in thread
From: John Cai via GitGitGadget @ 2022-02-09  2:10 UTC (permalink / raw)
  To: git; +Cc: John Cai

This patch series makes partial clone more useful by making it possible to
run repack to remove objects from a repository (replacing it with promisor
objects). This is useful when we want to offload large blobs from a git
server onto another git server, or even use an http server through a remote
helper.

In [A], a --refilter option on fetch and fetch-pack is being discussed where
either a less restrictive or more restrictive filter can be used. In the
more restrictive case, the objects that already exist will not be deleted.
But, one can imagine that users might want the ability to delete objects
when they apply a more restrictive filter in order to save space, and this
patch series would also allow that.

There are a couple of things we need to adjust to make this possible. This
patch has three parts.

 1. Allow --filter in pack-objects without --stdout
 2. Add a --filter flag for repack
 3. Allow missing promisor objects in upload-pack
 4. Tests that demonstrate the ability to offload objects onto an http
    remote

cc: Christian Couder christian.couder@gmail.com cc: Derrick Stolee
stolee@gmail.com cc: Robert Coup robert@coup.net.nz

A.
https://lore.kernel.org/git/pull.1138.git.1643730593.gitgitgadget@gmail.com/

John Cai (4):
  pack-objects: allow --filter without --stdout
  repack: add --filter=<filter-spec> option
  upload-pack: allow missing promisor objects
  tests for repack --filter mode

 Documentation/git-repack.txt   |   5 +
 builtin/pack-objects.c         |   2 -
 builtin/repack.c               |  22 +++--
 t/lib-httpd.sh                 |   2 +
 t/lib-httpd/apache.conf        |   8 ++
 t/lib-httpd/list.sh            |  43 +++++++++
 t/lib-httpd/upload.sh          |  46 +++++++++
 t/t0410-partial-clone.sh       |  81 ++++++++++++++++
 t/t0410/git-remote-testhttpgit | 170 +++++++++++++++++++++++++++++++++
 t/t7700-repack.sh              |  20 ++++
 upload-pack.c                  |   5 +
 11 files changed, 395 insertions(+), 9 deletions(-)
 create mode 100644 t/lib-httpd/list.sh
 create mode 100644 t/lib-httpd/upload.sh
 create mode 100755 t/t0410/git-remote-testhttpgit


base-commit: 38062e73e009f27ea192d50481fcb5e7b0e9d6eb
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1206%2Fjohn-cai%2Fjc-repack-filter-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1206/john-cai/jc-repack-filter-v2
Pull-Request: https://github.com/git/git/pull/1206

Range-diff vs v1:

 1:  0eec9b117da = 1:  f43b76ca650 pack-objects: allow --filter without --stdout
 -:  ----------- > 2:  6e7c8410b8d repack: add --filter=<filter-spec> option
 -:  ----------- > 3:  40612b9663b upload-pack: allow missing promisor objects
 2:  a3166381572 ! 4:  d76faa1f16e repack: add --filter=<filter-spec> option
     @@ Metadata
      Author: John Cai <johncai86@gmail.com>
      
       ## Commit message ##
     -    repack: add --filter=<filter-spec> option
     +    tests for repack --filter mode
      
     -    Currently, repack does not work with partial clones. When repack is run
     -    on a partially cloned repository, it grabs all missing objects from
     -    promisor remotes. This also means that when gc is run for repository
     -    maintenance on a partially cloned repository, it will end up getting
     -    missing objects, which is not what we want.
     -
     -    In order to make repack work with partial clone, teach repack a new
     -    option --filter, which takes a <filter-spec> argument. repack will skip
     -    any objects that are matched by <filter-spec> similar to how the clone
     -    command will skip fetching certain objects.
     -
     -    The final goal of this feature, is to be able to store objects on a
     -    server other than the regular git server itself.
     +    This patch adds tests to test both repack --filter functionality in
     +    isolation (in t7700-repack.sh) as well as how it can be used to offload
     +    large blobs (in t0410-partial-clone.sh)
      
          There are several scripts added so we can test the process of using a
     -    remote helper to upload blobs to an http server:
     +    remote helper to upload blobs to an http server.
      
          - t/lib-httpd/list.sh lists blobs uploaded to the http server.
          - t/lib-httpd/upload.sh uploads blobs to the http server.
     @@ Commit message
          Based-on-patch-by: Christian Couder <chriscool@tuxfamily.org>
          Signed-off-by: John Cai <johncai86@gmail.com>
      
     - ## Documentation/git-repack.txt ##
     -@@ Documentation/git-repack.txt: depth is 4095.
     - 	a larger and slower repository; see the discussion in
     - 	`pack.packSizeLimit`.
     - 
     -+--filter=<filter-spec>::
     -+	Omits certain objects (usually blobs) from the resulting
     -+	packfile. See linkgit:git-rev-list[1] for valid
     -+	`<filter-spec>` forms.
     -+
     - -b::
     - --write-bitmap-index::
     - 	Write a reachability bitmap index as part of the repack. This
     -
     - ## builtin/repack.c ##
     -@@ builtin/repack.c: struct pack_objects_args {
     - 	const char *depth;
     - 	const char *threads;
     - 	const char *max_pack_size;
     -+	const char *filter;
     - 	int no_reuse_delta;
     - 	int no_reuse_object;
     - 	int quiet;
     -@@ builtin/repack.c: static void prepare_pack_objects(struct child_process *cmd,
     - 		strvec_pushf(&cmd->args, "--threads=%s", args->threads);
     - 	if (args->max_pack_size)
     - 		strvec_pushf(&cmd->args, "--max-pack-size=%s", args->max_pack_size);
     -+	if (args->filter)
     -+		strvec_pushf(&cmd->args, "--filter=%s", args->filter);
     - 	if (args->no_reuse_delta)
     - 		strvec_pushf(&cmd->args, "--no-reuse-delta");
     - 	if (args->no_reuse_object)
     -@@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix)
     - 				N_("limits the maximum number of threads")),
     - 		OPT_STRING(0, "max-pack-size", &po_args.max_pack_size, N_("bytes"),
     - 				N_("maximum size of each packfile")),
     -+		OPT_STRING(0, "filter", &po_args.filter, N_("args"),
     -+				N_("object filtering")),
     - 		OPT_BOOL(0, "pack-kept-objects", &pack_kept_objects,
     - 				N_("repack objects in packs marked with .keep")),
     - 		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
     -@@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix)
     - 		if (line.len != the_hash_algo->hexsz)
     - 			die(_("repack: Expecting full hex object ID lines only from pack-objects."));
     - 		string_list_append(&names, line.buf);
     -+		if (po_args.filter) {
     -+			char *promisor_name = mkpathdup("%s-%s.promisor", packtmp,
     -+							line.buf);
     -+			write_promisor_file(promisor_name, NULL, 0);
     -+		}
     - 	}
     - 	fclose(out);
     - 	ret = finish_command(&cmd);
     -
       ## t/lib-httpd.sh ##
      @@ t/lib-httpd.sh: prepare_httpd() {
       	install_script error-smart-http.sh
     @@ t/t0410-partial-clone.sh: test_expect_success 'fetching of missing objects from
      +	git -C server rev-list --objects --all --missing=print >objects &&
      +	grep "$sha" objects
      +'
     ++
     ++test_expect_success 'fetch does not cause server to fetch missing objects' '
     ++	rm -rf origin server client &&
     ++	test_create_repo origin &&
     ++	dd if=/dev/zero of=origin/file1 bs=801k count=1 &&
     ++	git -C origin add file1 &&
     ++	git -C origin commit -m "large blob" &&
     ++	sha="$(git -C origin rev-parse :file1)" &&
     ++	expected="?$(git -C origin rev-parse :file1)" &&
     ++	git clone --bare --no-local origin server &&
     ++	git -C server remote add httpremote "testhttpgit::${PWD}/server" &&
     ++	git -C server config remote.httpremote.promisor true &&
     ++	git -C server config --remove-section remote.origin &&
     ++	git -C server rev-list --all --objects --filter-print-omitted \
     ++		--filter=blob:limit=800k | perl -ne "print if s/^[~]//" \
     ++		>large_blobs.txt &&
     ++	upload_blobs_from_stdin server <large_blobs.txt &&
     ++	git -C server -c repack.writebitmaps=false repack -a -d \
     ++		--filter=blob:limit=800k &&
     ++	git -C server config uploadpack.allowmissingpromisor true &&
     ++	git clone -c remote.httpremote.url="testhttpgit::${PWD}/server" \
     ++	-c remote.httpremote.fetch='+refs/heads/*:refs/remotes/httpremote/*' \
     ++	-c remote.httpremote.promisor=true --bare --no-local \
     ++	--filter=blob:limit=800k server client &&
     ++	git -C client rev-list --objects --all --missing=print >client_objects &&
     ++	grep "$expected" client_objects &&
     ++	git -C server rev-list --objects --all --missing=print >server_objects &&
     ++	grep "$expected" server_objects
     ++'
      +
       # DO NOT add non-httpd-specific tests here, because the last part of this
       # test script is only executed when httpd is available and enabled.

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2 1/4] pack-objects: allow --filter without --stdout
  2022-02-09  2:10 ` [PATCH v2 0/4] [RFC] repack: add --filter= John Cai via GitGitGadget
@ 2022-02-09  2:10   ` John Cai via GitGitGadget
  2022-02-09  2:10   ` [PATCH v2 2/4] repack: add --filter=<filter-spec> option John Cai via GitGitGadget
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 34+ messages in thread
From: John Cai via GitGitGadget @ 2022-02-09  2:10 UTC (permalink / raw)
  To: git; +Cc: John Cai, John Cai

From: John Cai <johncai86@gmail.com>

9535ce7 taught pack-objects to use filtering, but added a requirement of
the --stdout since a partial clone mechanism was not yet in place to
handle missing objects. Since then, changes like 9e27beaa and others
added support to dynamically fetch objects that were missing.

Remove the --stdout requirement so that in the next commit, repack can
pass --filter to pack-objects to omit certain objects from the packfile.

Based-on-patch-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: John Cai <johncai86@gmail.com>
---
 builtin/pack-objects.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index ba2006f2212..2d1ecb18784 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -4075,8 +4075,6 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		unpack_unreachable_expiration = 0;
 
 	if (filter_options.choice) {
-		if (!pack_to_stdout)
-			die(_("cannot use --filter without --stdout"));
 		if (stdin_packs)
 			die(_("cannot use --filter with --stdin-packs"));
 	}
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v2 2/4] repack: add --filter=<filter-spec> option
  2022-02-09  2:10 ` [PATCH v2 0/4] [RFC] repack: add --filter= John Cai via GitGitGadget
  2022-02-09  2:10   ` [PATCH v2 1/4] pack-objects: allow --filter without --stdout John Cai via GitGitGadget
@ 2022-02-09  2:10   ` John Cai via GitGitGadget
  2022-02-09  2:10   ` [PATCH v2 3/4] upload-pack: allow missing promisor objects John Cai via GitGitGadget
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 34+ messages in thread
From: John Cai via GitGitGadget @ 2022-02-09  2:10 UTC (permalink / raw)
  To: git; +Cc: John Cai, John Cai

From: John Cai <johncai86@gmail.com>

In order to use a separate http server as a remote to offload large
blobs, imagine the following:

A. an http server to use as a generalized object store.
B. a server update hook that uploads large blobs to (A).
C. a git server
D. a remote helper that knows how to download objects from the http
server
E. a regular job that runs `git repack --filter` to remove large
blobs from (C).

Clients would need to configure both (C) and (A) as promisor remotes to
be able to get everything. When they push new large blobs, they can
still push them to (C), as (B) will upload them to (A), and (E) will
regularly remove those large blobs from (C).

This way with a little bit of client and server configuration, we can
have a native way to support offloading large files without git LFS.
It would be more flexible as you can easily tweak which blobs are
considered large files by tweaking (B) and (E).

A fuller demo can be found at http://tiny.cc/object_storage_demo

Based-on-patch-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: John Cai <johncai86@gmail.com>
---
 Documentation/git-repack.txt |  5 +++++
 builtin/repack.c             | 22 +++++++++++++++-------
 2 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index ee30edc178a..e394ec52ab1 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -126,6 +126,11 @@ depth is 4095.
 	a larger and slower repository; see the discussion in
 	`pack.packSizeLimit`.
 
+--filter=<filter-spec>::
+	Omits certain objects (usually blobs) from the resulting
+	packfile. See linkgit:git-rev-list[1] for valid
+	`<filter-spec>` forms.
+
 -b::
 --write-bitmap-index::
 	Write a reachability bitmap index as part of the repack. This
diff --git a/builtin/repack.c b/builtin/repack.c
index da1e364a756..3f1e8a39a2b 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -152,6 +152,7 @@ struct pack_objects_args {
 	const char *depth;
 	const char *threads;
 	const char *max_pack_size;
+	const char *filter;
 	int no_reuse_delta;
 	int no_reuse_object;
 	int quiet;
@@ -172,6 +173,8 @@ static void prepare_pack_objects(struct child_process *cmd,
 		strvec_pushf(&cmd->args, "--threads=%s", args->threads);
 	if (args->max_pack_size)
 		strvec_pushf(&cmd->args, "--max-pack-size=%s", args->max_pack_size);
+	if (args->filter)
+		strvec_pushf(&cmd->args, "--filter=%s", args->filter);
 	if (args->no_reuse_delta)
 		strvec_pushf(&cmd->args, "--no-reuse-delta");
 	if (args->no_reuse_object)
@@ -238,6 +241,13 @@ static unsigned populate_pack_exts(char *name)
 	return ret;
 }
 
+static void write_promisor_file_1(char *p)
+{
+	char *promisor_name = mkpathdup("%s-%s.promisor", packtmp, p);
+	write_promisor_file(promisor_name, NULL, 0);
+	free(promisor_name);
+}
+
 static void repack_promisor_objects(const struct pack_objects_args *args,
 				    struct string_list *names)
 {
@@ -269,7 +279,6 @@ static void repack_promisor_objects(const struct pack_objects_args *args,
 	out = xfdopen(cmd.out, "r");
 	while (strbuf_getline_lf(&line, out) != EOF) {
 		struct string_list_item *item;
-		char *promisor_name;
 
 		if (line.len != the_hash_algo->hexsz)
 			die(_("repack: Expecting full hex object ID lines only from pack-objects."));
@@ -286,13 +295,8 @@ static void repack_promisor_objects(const struct pack_objects_args *args,
 		 * concatenate the contents of all .promisor files instead of
 		 * just creating a new empty file.
 		 */
-		promisor_name = mkpathdup("%s-%s.promisor", packtmp,
-					  line.buf);
-		write_promisor_file(promisor_name, NULL, 0);
-
+		write_promisor_file_1(line.buf);
 		item->util = (void *)(uintptr_t)populate_pack_exts(item->string);
-
-		free(promisor_name);
 	}
 	fclose(out);
 	if (finish_command(&cmd))
@@ -660,6 +664,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 				N_("limits the maximum number of threads")),
 		OPT_STRING(0, "max-pack-size", &po_args.max_pack_size, N_("bytes"),
 				N_("maximum size of each packfile")),
+		OPT_STRING(0, "filter", &po_args.filter, N_("args"),
+				N_("object filtering")),
 		OPT_BOOL(0, "pack-kept-objects", &pack_kept_objects,
 				N_("repack objects in packs marked with .keep")),
 		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
@@ -819,6 +825,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		if (line.len != the_hash_algo->hexsz)
 			die(_("repack: Expecting full hex object ID lines only from pack-objects."));
 		string_list_append(&names, line.buf);
+		if (po_args.filter)
+			write_promisor_file_1(line.buf);
 	}
 	fclose(out);
 	ret = finish_command(&cmd);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v2 3/4] upload-pack: allow missing promisor objects
  2022-02-09  2:10 ` [PATCH v2 0/4] [RFC] repack: add --filter= John Cai via GitGitGadget
  2022-02-09  2:10   ` [PATCH v2 1/4] pack-objects: allow --filter without --stdout John Cai via GitGitGadget
  2022-02-09  2:10   ` [PATCH v2 2/4] repack: add --filter=<filter-spec> option John Cai via GitGitGadget
@ 2022-02-09  2:10   ` John Cai via GitGitGadget
  2022-02-09  2:10   ` [PATCH v2 4/4] tests for repack --filter mode John Cai via GitGitGadget
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 34+ messages in thread
From: John Cai via GitGitGadget @ 2022-02-09  2:10 UTC (permalink / raw)
  To: git; +Cc: John Cai, John Cai

From: John Cai <johncai86@gmail.com>

When a git server (A) is being used alongside an http server (B) remote
that stores large blobs, and a client fetches objects from both (A) as
well as (B), we do not want (A) to fetch missing objects during object
traversal.

Add a config value uploadpack.allowmissingpromisor that, when set to
true, will allow (A) to skip fetching missing objects.

Based-on-patch-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: John Cai <johncai86@gmail.com>
---
 upload-pack.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/upload-pack.c b/upload-pack.c
index 8acc98741bb..39b56650b77 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -112,6 +112,7 @@ struct upload_pack_data {
 	unsigned allow_ref_in_want : 1;				/* v2 only */
 	unsigned allow_sideband_all : 1;			/* v2 only */
 	unsigned advertise_sid : 1;
+	unsigned allow_missing_promisor : 1;
 };
 
 static void upload_pack_data_init(struct upload_pack_data *data)
@@ -309,6 +310,8 @@ static void create_pack_file(struct upload_pack_data *pack_data,
 		strvec_push(&pack_objects.args, "--delta-base-offset");
 	if (pack_data->use_include_tag)
 		strvec_push(&pack_objects.args, "--include-tag");
+	if (pack_data->allow_missing_promisor)
+		strvec_push(&pack_objects.args, "--missing=allow-promisor");
 	if (pack_data->filter_options.choice) {
 		const char *spec =
 			expand_list_objects_filter_spec(&pack_data->filter_options);
@@ -1315,6 +1318,8 @@ static int upload_pack_config(const char *var, const char *value, void *cb_data)
 		data->allow_ref_in_want = git_config_bool(var, value);
 	} else if (!strcmp("uploadpack.allowsidebandall", var)) {
 		data->allow_sideband_all = git_config_bool(var, value);
+	} else if (!strcmp("uploadpack.allowmissingpromisor", var)) {
+		data->allow_missing_promisor = git_config_bool(var, value);
 	} else if (!strcmp("core.precomposeunicode", var)) {
 		precomposed_unicode = git_config_bool(var, value);
 	} else if (!strcmp("transfer.advertisesid", var)) {
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v2 4/4] tests for repack --filter mode
  2022-02-09  2:10 ` [PATCH v2 0/4] [RFC] repack: add --filter= John Cai via GitGitGadget
                     ` (2 preceding siblings ...)
  2022-02-09  2:10   ` [PATCH v2 3/4] upload-pack: allow missing promisor objects John Cai via GitGitGadget
@ 2022-02-09  2:10   ` John Cai via GitGitGadget
  2022-02-17 16:14     ` Robert Coup
  2022-02-09  2:27   ` [PATCH v2 0/4] [RFC] repack: add --filter= John Cai
  2022-02-16 15:39   ` Robert Coup
  5 siblings, 1 reply; 34+ messages in thread
From: John Cai via GitGitGadget @ 2022-02-09  2:10 UTC (permalink / raw)
  To: git; +Cc: John Cai, John Cai

From: John Cai <johncai86@gmail.com>

This patch adds tests to test both repack --filter functionality in
isolation (in t7700-repack.sh) as well as how it can be used to offload
large blobs (in t0410-partial-clone.sh)

There are several scripts added so we can test the process of using a
remote helper to upload blobs to an http server.

- t/lib-httpd/list.sh lists blobs uploaded to the http server.
- t/lib-httpd/upload.sh uploads blobs to the http server.
- t/t0410/git-remote-testhttpgit a remote helper that can access blobs
  onto from an http server. Copied over from t/t5801/git-remote-testhttpgit
  and modified to upload blobs to an http server.
- t/t0410/lib-http-promisor.sh convenience functions for uploading
  blobs

Based-on-patch-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: John Cai <johncai86@gmail.com>
---
 t/lib-httpd.sh                 |   2 +
 t/lib-httpd/apache.conf        |   8 ++
 t/lib-httpd/list.sh            |  43 +++++++++
 t/lib-httpd/upload.sh          |  46 +++++++++
 t/t0410-partial-clone.sh       |  81 ++++++++++++++++
 t/t0410/git-remote-testhttpgit | 170 +++++++++++++++++++++++++++++++++
 t/t7700-repack.sh              |  20 ++++
 7 files changed, 370 insertions(+)
 create mode 100644 t/lib-httpd/list.sh
 create mode 100644 t/lib-httpd/upload.sh
 create mode 100755 t/t0410/git-remote-testhttpgit

diff --git a/t/lib-httpd.sh b/t/lib-httpd.sh
index 782891908d7..fc6587c6d39 100644
--- a/t/lib-httpd.sh
+++ b/t/lib-httpd.sh
@@ -136,6 +136,8 @@ prepare_httpd() {
 	install_script error-smart-http.sh
 	install_script error.sh
 	install_script apply-one-time-perl.sh
+	install_script upload.sh
+	install_script list.sh
 
 	ln -s "$LIB_HTTPD_MODULE_PATH" "$HTTPD_ROOT_PATH/modules"
 
diff --git a/t/lib-httpd/apache.conf b/t/lib-httpd/apache.conf
index 497b9b9d927..1ea382750f0 100644
--- a/t/lib-httpd/apache.conf
+++ b/t/lib-httpd/apache.conf
@@ -129,6 +129,8 @@ ScriptAlias /broken_smart/ broken-smart-http.sh/
 ScriptAlias /error_smart/ error-smart-http.sh/
 ScriptAlias /error/ error.sh/
 ScriptAliasMatch /one_time_perl/(.*) apply-one-time-perl.sh/$1
+ScriptAlias /upload/ upload.sh/
+ScriptAlias /list/ list.sh/
 <Directory ${GIT_EXEC_PATH}>
 	Options FollowSymlinks
 </Directory>
@@ -156,6 +158,12 @@ ScriptAliasMatch /one_time_perl/(.*) apply-one-time-perl.sh/$1
 <Files ${GIT_EXEC_PATH}/git-http-backend>
 	Options ExecCGI
 </Files>
+<Files upload.sh>
+  Options ExecCGI
+</Files>
+<Files list.sh>
+  Options ExecCGI
+</Files>
 
 RewriteEngine on
 RewriteRule ^/dumb-redir/(.*)$ /dumb/$1 [R=301]
diff --git a/t/lib-httpd/list.sh b/t/lib-httpd/list.sh
new file mode 100644
index 00000000000..e63406be3b2
--- /dev/null
+++ b/t/lib-httpd/list.sh
@@ -0,0 +1,43 @@
+#!/bin/sh
+
+# Used in the httpd test server to be called by a remote helper to list objects.
+
+FILES_DIR="www/files"
+
+OLDIFS="$IFS"
+IFS='&'
+set -- $QUERY_STRING
+IFS="$OLDIFS"
+
+while test $# -gt 0
+do
+	key=${1%%=*}
+	val=${1#*=}
+
+	case "$key" in
+	"sha1") sha1="$val" ;;
+	*) echo >&2 "unknown key '$key'" ;;
+	esac
+
+	shift
+done
+
+if test -d "$FILES_DIR"
+then
+	if test -z "$sha1"
+	then
+		echo 'Status: 200 OK'
+		echo
+		ls "$FILES_DIR" | tr '-' ' '
+	else
+		if test -f "$FILES_DIR/$sha1"-*
+		then
+			echo 'Status: 200 OK'
+			echo
+			cat "$FILES_DIR/$sha1"-*
+		else
+			echo 'Status: 404 Not Found'
+			echo
+		fi
+	fi
+fi
diff --git a/t/lib-httpd/upload.sh b/t/lib-httpd/upload.sh
new file mode 100644
index 00000000000..202de63b2dc
--- /dev/null
+++ b/t/lib-httpd/upload.sh
@@ -0,0 +1,46 @@
+#!/bin/sh
+
+# In part from http://codereview.stackexchange.com/questions/79549/bash-cgi-upload-file
+# Used in the httpd test server to for a remote helper to call to upload blobs.
+
+FILES_DIR="www/files"
+
+OLDIFS="$IFS"
+IFS='&'
+set -- $QUERY_STRING
+IFS="$OLDIFS"
+
+while test $# -gt 0
+do
+	key=${1%%=*}
+	val=${1#*=}
+
+	case "$key" in
+	"sha1") sha1="$val" ;;
+	"type") type="$val" ;;
+	"size") size="$val" ;;
+	"delete") delete=1 ;;
+	*) echo >&2 "unknown key '$key'" ;;
+	esac
+
+	shift
+done
+
+case "$REQUEST_METHOD" in
+POST)
+	if test "$delete" = "1"
+	then
+		rm -f "$FILES_DIR/$sha1-$size-$type"
+	else
+		mkdir -p "$FILES_DIR"
+		cat >"$FILES_DIR/$sha1-$size-$type"
+	fi
+
+	echo 'Status: 204 No Content'
+	echo
+	;;
+
+*)
+	echo 'Status: 405 Method Not Allowed'
+	echo
+esac
diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
index f17abd298c8..0724043ffb7 100755
--- a/t/t0410-partial-clone.sh
+++ b/t/t0410-partial-clone.sh
@@ -30,6 +30,31 @@ promise_and_delete () {
 	delete_object repo "$HASH"
 }
 
+upload_blob() {
+	SERVER_REPO="$1"
+	HASH="$2"
+
+	test -n "$HASH" || die "Invalid argument '$HASH'"
+	HASH_SIZE=$(git -C "$SERVER_REPO" cat-file -s "$HASH") || {
+		echo >&2 "Cannot get blob size of '$HASH'"
+		return 1
+	}
+
+	UPLOAD_URL="http://127.0.0.1:$LIB_HTTPD_PORT/upload/?sha1=$HASH&size=$HASH_SIZE&type=blob"
+
+	git -C "$SERVER_REPO" cat-file blob "$HASH" >object &&
+	curl --data-binary @object --include "$UPLOAD_URL"
+}
+
+upload_blobs_from_stdin() {
+	SERVER_REPO="$1"
+	while read -r blob
+	do
+		echo "uploading $blob"
+		upload_blob "$SERVER_REPO" "$blob" || return
+	done
+}
+
 test_expect_success 'extensions.partialclone without filter' '
 	test_create_repo server &&
 	git clone --filter="blob:none" "file://$(pwd)/server" client &&
@@ -668,6 +693,62 @@ test_expect_success 'fetching of missing objects from an HTTP server' '
 	grep "$HASH" out
 '
 
+PATH="$TEST_DIRECTORY/t0410:$PATH"
+
+test_expect_success 'fetch of missing objects through remote helper' '
+	rm -rf origin server &&
+	test_create_repo origin &&
+	dd if=/dev/zero of=origin/file1 bs=801k count=1 &&
+	git -C origin add file1 &&
+	git -C origin commit -m "large blob" &&
+	sha="$(git -C origin rev-parse :file1)" &&
+	expected="?$(git -C origin rev-parse :file1)" &&
+	git clone --bare --no-local origin server &&
+	git -C server remote add httpremote "testhttpgit::${PWD}/server" &&
+	git -C server config remote.httpremote.promisor true &&
+	git -C server config --remove-section remote.origin &&
+	git -C server rev-list --all --objects --filter-print-omitted \
+		--filter=blob:limit=800k | perl -ne "print if s/^[~]//" \
+		>large_blobs.txt &&
+	upload_blobs_from_stdin server <large_blobs.txt &&
+	git -C server -c repack.writebitmaps=false repack -a -d \
+		--filter=blob:limit=800k &&
+	git -C server rev-list --objects --all --missing=print >objects &&
+	grep "$expected" objects &&
+	HTTPD_URL=$HTTPD_URL git -C server show $sha &&
+	git -C server rev-list --objects --all --missing=print >objects &&
+	grep "$sha" objects
+'
+
+test_expect_success 'fetch does not cause server to fetch missing objects' '
+	rm -rf origin server client &&
+	test_create_repo origin &&
+	dd if=/dev/zero of=origin/file1 bs=801k count=1 &&
+	git -C origin add file1 &&
+	git -C origin commit -m "large blob" &&
+	sha="$(git -C origin rev-parse :file1)" &&
+	expected="?$(git -C origin rev-parse :file1)" &&
+	git clone --bare --no-local origin server &&
+	git -C server remote add httpremote "testhttpgit::${PWD}/server" &&
+	git -C server config remote.httpremote.promisor true &&
+	git -C server config --remove-section remote.origin &&
+	git -C server rev-list --all --objects --filter-print-omitted \
+		--filter=blob:limit=800k | perl -ne "print if s/^[~]//" \
+		>large_blobs.txt &&
+	upload_blobs_from_stdin server <large_blobs.txt &&
+	git -C server -c repack.writebitmaps=false repack -a -d \
+		--filter=blob:limit=800k &&
+	git -C server config uploadpack.allowmissingpromisor true &&
+	git clone -c remote.httpremote.url="testhttpgit::${PWD}/server" \
+	-c remote.httpremote.fetch='+refs/heads/*:refs/remotes/httpremote/*' \
+	-c remote.httpremote.promisor=true --bare --no-local \
+	--filter=blob:limit=800k server client &&
+	git -C client rev-list --objects --all --missing=print >client_objects &&
+	grep "$expected" client_objects &&
+	git -C server rev-list --objects --all --missing=print >server_objects &&
+	grep "$expected" server_objects
+'
+
 # DO NOT add non-httpd-specific tests here, because the last part of this
 # test script is only executed when httpd is available and enabled.
 
diff --git a/t/t0410/git-remote-testhttpgit b/t/t0410/git-remote-testhttpgit
new file mode 100755
index 00000000000..e5e187243ed
--- /dev/null
+++ b/t/t0410/git-remote-testhttpgit
@@ -0,0 +1,170 @@
+#!/bin/sh
+# Copyright (c) 2012 Felipe Contreras
+# Copyright (c) 2020 Christian Couder
+
+# This is a git remote helper that can be used to store blobs on an http server
+
+# The first argument can be a url when the fetch/push command was a url
+# instead of a configured remote. In this case, use a generic alias.
+if test "$1" = "testhttpgit::$2"; then
+	alias=_
+else
+	alias=$1
+fi
+url=$2
+
+unset GIT_DIR
+
+h_refspec="refs/heads/*:refs/testhttpgit/$alias/heads/*"
+t_refspec="refs/tags/*:refs/testhttpgit/$alias/tags/*"
+
+if test -n "$GIT_REMOTE_TESTHTTPGIT_NOREFSPEC"
+then
+	h_refspec=""
+	t_refspec=""
+fi
+
+die () {
+	echo >&2 "fatal: $*"
+	echo "fatal: $*" >>/tmp/t0430.txt
+	echo >>/tmp/t0430.txt
+	exit 1
+}
+
+force=
+
+mark_count_tmp=$(mktemp -t git-remote-http-mark-count_XXXXXX) || die "Failed to create temp file"
+echo "1" >"$mark_count_tmp"
+
+get_mark_count() {
+	mark=$(cat "$mark_count_tmp")
+	echo "$mark"
+	mark=$((mark+1))
+	echo "$mark" >"$mark_count_tmp"	
+}
+
+export_blob_from_file() {
+	file="$1"
+	echo "blob"
+	echo "mark :$(get_mark_count)"
+	size=$(wc -c <"$file") || return
+	echo "data $size"
+	cat "$file" || return
+	echo
+}
+
+while read line
+do
+	case $line in
+	capabilities)
+		echo 'import'
+		echo 'export'
+		test -n "$h_refspec" && echo "refspec $h_refspec"
+		test -n "$t_refspec" && echo "refspec $t_refspec"
+		test -n "$GIT_REMOTE_TESTHTTPGIT_SIGNED_TAGS" && echo "signed-tags"
+		test -n "$GIT_REMOTE_TESTHTTPGIT_NO_PRIVATE_UPDATE" && echo "no-private-update"
+		echo 'option'
+		echo
+		;;
+	list)
+		git -C "$url" for-each-ref --format='? %(refname)' 'refs/heads/' 'refs/tags/'
+		head=$(git -C "$url" symbolic-ref HEAD)
+		echo "@$head HEAD"
+		echo
+		;;
+	import*)
+		# read all import lines
+		while true
+		do
+			ref="${line#* }"
+			refs="$refs $ref"
+			read line
+			test "${line%% *}" != "import" && break
+		done
+
+		echo "refs: $refs" >>/tmp/t0430.txt
+
+		if test -n "$GIT_REMOTE_TESTHTTPGIT_FAILURE"
+		then
+			echo "feature done"
+			exit 1
+		fi
+
+		echo "feature done"
+
+		tmpdir=$(mktemp -d -t git-remote-http-import_XXXXXX) || die "Failed to create temp directory"
+
+		for ref in $refs
+		do
+			get_url="$HTTPD_URL/list/?sha1=$ref"
+			echo "curl url: $get_url" >>/tmp/t0430.txt
+			echo "curl output: $tmpdir/$ref" >>/tmp/t0430.txt
+			curl -s -o "$tmpdir/$ref" "$get_url" ||
+				die "curl '$get_url' failed"
+			echo "exporting from: $tmpdir/$ref" >>/tmp/t0430.txt
+			export_blob_from_file "$tmpdir/$ref" ||
+				die "failed to export blob from '$tmpdir/$ref'"
+			echo "done exporting" >>/tmp/t0430.txt
+		done
+
+		echo "done"
+		;;
+	export)
+		if test -n "$GIT_REMOTE_TESTHTTPGIT_FAILURE"
+		then
+			# consume input so fast-export doesn't get SIGPIPE;
+			# git would also notice that case, but we want
+			# to make sure we are exercising the later
+			# error checks
+			while read line; do
+				test "done" = "$line" && break
+			done
+			exit 1
+		fi
+
+		before=$(git -C "$url" for-each-ref --format=' %(refname) %(objectname) ')
+
+		git -C "$url" fast-import \
+			${force:+--force} \
+			${testhttpgitmarks:+"--import-marks=$testhttpgitmarks"} \
+			${testhttpgitmarks:+"--export-marks=$testhttpgitmarks"} \
+			--quiet
+
+		# figure out which refs were updated
+		git -C "$url" for-each-ref --format='%(refname) %(objectname)' |
+		while read ref a
+		do
+			case "$before" in
+			*" $ref $a "*)
+				continue ;;	# unchanged
+			esac
+			if test -z "$GIT_REMOTE_TESTHTTPGIT_PUSH_ERROR"
+			then
+				echo "ok $ref"
+			else
+				echo "error $ref $GIT_REMOTE_TESTHTTPGIT_PUSH_ERROR"
+			fi
+		done
+
+		echo
+		;;
+	option\ *)
+		read cmd opt val <<-EOF
+		$line
+		EOF
+		case $opt in
+		force)
+			test $val = "true" && force="true" || force=
+			echo "ok"
+			;;
+		*)
+			echo "unsupported"
+			;;
+		esac
+		;;
+	'')
+		exit
+		;;
+	esac
+done
+
diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index e489869dd94..78cc1858cb6 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -237,6 +237,26 @@ test_expect_success 'auto-bitmaps do not complain if unavailable' '
 	test_must_be_empty actual
 '
 
+test_expect_success 'repack with filter does not fetch from remote' '
+	rm -rf server client &&
+	test_create_repo server &&
+	git -C server config uploadpack.allowFilter true &&
+	git -C server config uploadpack.allowAnySHA1InWant true &&
+	echo content1 >server/file1 &&
+	git -C server add file1 &&
+	git -C server commit -m initial_commit &&
+	expected="?$(git -C server rev-parse :file1)" &&
+	git clone --bare --no-local server client &&
+	git -C client config remote.origin.promisor true &&
+	git -C client -c repack.writebitmaps=false repack -a -d --filter=blob:none &&
+	git -C client rev-list --objects --all --missing=print >objects &&
+	grep "$expected" objects &&
+	git -C client repack -a -d &&
+	expected="$(git -C server rev-parse :file1)" &&
+	git -C client rev-list --objects --all --missing=print >objects &&
+	grep "$expected" objects
+'
+
 objdir=.git/objects
 midx=$objdir/pack/multi-pack-index
 
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 0/4] [RFC] repack: add --filter=
  2022-02-09  2:10 ` [PATCH v2 0/4] [RFC] repack: add --filter= John Cai via GitGitGadget
                     ` (3 preceding siblings ...)
  2022-02-09  2:10   ` [PATCH v2 4/4] tests for repack --filter mode John Cai via GitGitGadget
@ 2022-02-09  2:27   ` John Cai
  2022-02-16 15:39   ` Robert Coup
  5 siblings, 0 replies; 34+ messages in thread
From: John Cai @ 2022-02-09  2:27 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, Christian Couder, Derrick Stolee, robert

Hi Johannes

I'm not sure where I went wrong on GGG. Somehow the cc list didn't get translated into
cc fields. Here's the PR: https://github.com/git/git/pull/1206. Thanks!

cc'ing folks I meant to cc for this patch series

On 8 Feb 2022, at 21:10, John Cai via GitGitGadget wrote:

> This patch series makes partial clone more useful by making it possible to
> run repack to remove objects from a repository (replacing it with promisor
> objects). This is useful when we want to offload large blobs from a git
> server onto another git server, or even use an http server through a remote
> helper.
>
> In [A], a --refilter option on fetch and fetch-pack is being discussed where
> either a less restrictive or more restrictive filter can be used. In the
> more restrictive case, the objects that already exist will not be deleted.
> But, one can imagine that users might want the ability to delete objects
> when they apply a more restrictive filter in order to save space, and this
> patch series would also allow that.
>
> There are a couple of things we need to adjust to make this possible. This
> patch has three parts.
>
>  1. Allow --filter in pack-objects without --stdout
>  2. Add a --filter flag for repack
>  3. Allow missing promisor objects in upload-pack
>  4. Tests that demonstrate the ability to offload objects onto an http
>     remote
>
> cc: Christian Couder christian.couder@gmail.com cc: Derrick Stolee
> stolee@gmail.com cc: Robert Coup robert@coup.net.nz
>
> A.
> https://lore.kernel.org/git/pull.1138.git.1643730593.gitgitgadget@gmail.com/
>
> John Cai (4):
>   pack-objects: allow --filter without --stdout
>   repack: add --filter=<filter-spec> option
>   upload-pack: allow missing promisor objects
>   tests for repack --filter mode
>
>  Documentation/git-repack.txt   |   5 +
>  builtin/pack-objects.c         |   2 -
>  builtin/repack.c               |  22 +++--
>  t/lib-httpd.sh                 |   2 +
>  t/lib-httpd/apache.conf        |   8 ++
>  t/lib-httpd/list.sh            |  43 +++++++++
>  t/lib-httpd/upload.sh          |  46 +++++++++
>  t/t0410-partial-clone.sh       |  81 ++++++++++++++++
>  t/t0410/git-remote-testhttpgit | 170 +++++++++++++++++++++++++++++++++
>  t/t7700-repack.sh              |  20 ++++
>  upload-pack.c                  |   5 +
>  11 files changed, 395 insertions(+), 9 deletions(-)
>  create mode 100644 t/lib-httpd/list.sh
>  create mode 100644 t/lib-httpd/upload.sh
>  create mode 100755 t/t0410/git-remote-testhttpgit
>
>
> base-commit: 38062e73e009f27ea192d50481fcb5e7b0e9d6eb
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1206%2Fjohn-cai%2Fjc-repack-filter-v2
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1206/john-cai/jc-repack-filter-v2
> Pull-Request: https://github.com/git/git/pull/1206
>
> Range-diff vs v1:
>
>  1:  0eec9b117da = 1:  f43b76ca650 pack-objects: allow --filter without --stdout
>  -:  ----------- > 2:  6e7c8410b8d repack: add --filter=<filter-spec> option
>  -:  ----------- > 3:  40612b9663b upload-pack: allow missing promisor objects
>  2:  a3166381572 ! 4:  d76faa1f16e repack: add --filter=<filter-spec> option
>      @@ Metadata
>       Author: John Cai <johncai86@gmail.com>
>
>        ## Commit message ##
>      -    repack: add --filter=<filter-spec> option
>      +    tests for repack --filter mode
>
>      -    Currently, repack does not work with partial clones. When repack is run
>      -    on a partially cloned repository, it grabs all missing objects from
>      -    promisor remotes. This also means that when gc is run for repository
>      -    maintenance on a partially cloned repository, it will end up getting
>      -    missing objects, which is not what we want.
>      -
>      -    In order to make repack work with partial clone, teach repack a new
>      -    option --filter, which takes a <filter-spec> argument. repack will skip
>      -    any objects that are matched by <filter-spec> similar to how the clone
>      -    command will skip fetching certain objects.
>      -
>      -    The final goal of this feature, is to be able to store objects on a
>      -    server other than the regular git server itself.
>      +    This patch adds tests to test both repack --filter functionality in
>      +    isolation (in t7700-repack.sh) as well as how it can be used to offload
>      +    large blobs (in t0410-partial-clone.sh)
>
>           There are several scripts added so we can test the process of using a
>      -    remote helper to upload blobs to an http server:
>      +    remote helper to upload blobs to an http server.
>
>           - t/lib-httpd/list.sh lists blobs uploaded to the http server.
>           - t/lib-httpd/upload.sh uploads blobs to the http server.
>      @@ Commit message
>           Based-on-patch-by: Christian Couder <chriscool@tuxfamily.org>
>           Signed-off-by: John Cai <johncai86@gmail.com>
>
>      - ## Documentation/git-repack.txt ##
>      -@@ Documentation/git-repack.txt: depth is 4095.
>      - 	a larger and slower repository; see the discussion in
>      - 	`pack.packSizeLimit`.
>      -
>      -+--filter=<filter-spec>::
>      -+	Omits certain objects (usually blobs) from the resulting
>      -+	packfile. See linkgit:git-rev-list[1] for valid
>      -+	`<filter-spec>` forms.
>      -+
>      - -b::
>      - --write-bitmap-index::
>      - 	Write a reachability bitmap index as part of the repack. This
>      -
>      - ## builtin/repack.c ##
>      -@@ builtin/repack.c: struct pack_objects_args {
>      - 	const char *depth;
>      - 	const char *threads;
>      - 	const char *max_pack_size;
>      -+	const char *filter;
>      - 	int no_reuse_delta;
>      - 	int no_reuse_object;
>      - 	int quiet;
>      -@@ builtin/repack.c: static void prepare_pack_objects(struct child_process *cmd,
>      - 		strvec_pushf(&cmd->args, "--threads=%s", args->threads);
>      - 	if (args->max_pack_size)
>      - 		strvec_pushf(&cmd->args, "--max-pack-size=%s", args->max_pack_size);
>      -+	if (args->filter)
>      -+		strvec_pushf(&cmd->args, "--filter=%s", args->filter);
>      - 	if (args->no_reuse_delta)
>      - 		strvec_pushf(&cmd->args, "--no-reuse-delta");
>      - 	if (args->no_reuse_object)
>      -@@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix)
>      - 				N_("limits the maximum number of threads")),
>      - 		OPT_STRING(0, "max-pack-size", &po_args.max_pack_size, N_("bytes"),
>      - 				N_("maximum size of each packfile")),
>      -+		OPT_STRING(0, "filter", &po_args.filter, N_("args"),
>      -+				N_("object filtering")),
>      - 		OPT_BOOL(0, "pack-kept-objects", &pack_kept_objects,
>      - 				N_("repack objects in packs marked with .keep")),
>      - 		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
>      -@@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix)
>      - 		if (line.len != the_hash_algo->hexsz)
>      - 			die(_("repack: Expecting full hex object ID lines only from pack-objects."));
>      - 		string_list_append(&names, line.buf);
>      -+		if (po_args.filter) {
>      -+			char *promisor_name = mkpathdup("%s-%s.promisor", packtmp,
>      -+							line.buf);
>      -+			write_promisor_file(promisor_name, NULL, 0);
>      -+		}
>      - 	}
>      - 	fclose(out);
>      - 	ret = finish_command(&cmd);
>      -
>        ## t/lib-httpd.sh ##
>       @@ t/lib-httpd.sh: prepare_httpd() {
>        	install_script error-smart-http.sh
>      @@ t/t0410-partial-clone.sh: test_expect_success 'fetching of missing objects from
>       +	git -C server rev-list --objects --all --missing=print >objects &&
>       +	grep "$sha" objects
>       +'
>      ++
>      ++test_expect_success 'fetch does not cause server to fetch missing objects' '
>      ++	rm -rf origin server client &&
>      ++	test_create_repo origin &&
>      ++	dd if=/dev/zero of=origin/file1 bs=801k count=1 &&
>      ++	git -C origin add file1 &&
>      ++	git -C origin commit -m "large blob" &&
>      ++	sha="$(git -C origin rev-parse :file1)" &&
>      ++	expected="?$(git -C origin rev-parse :file1)" &&
>      ++	git clone --bare --no-local origin server &&
>      ++	git -C server remote add httpremote "testhttpgit::${PWD}/server" &&
>      ++	git -C server config remote.httpremote.promisor true &&
>      ++	git -C server config --remove-section remote.origin &&
>      ++	git -C server rev-list --all --objects --filter-print-omitted \
>      ++		--filter=blob:limit=800k | perl -ne "print if s/^[~]//" \
>      ++		>large_blobs.txt &&
>      ++	upload_blobs_from_stdin server <large_blobs.txt &&
>      ++	git -C server -c repack.writebitmaps=false repack -a -d \
>      ++		--filter=blob:limit=800k &&
>      ++	git -C server config uploadpack.allowmissingpromisor true &&
>      ++	git clone -c remote.httpremote.url="testhttpgit::${PWD}/server" \
>      ++	-c remote.httpremote.fetch='+refs/heads/*:refs/remotes/httpremote/*' \
>      ++	-c remote.httpremote.promisor=true --bare --no-local \
>      ++	--filter=blob:limit=800k server client &&
>      ++	git -C client rev-list --objects --all --missing=print >client_objects &&
>      ++	grep "$expected" client_objects &&
>      ++	git -C server rev-list --objects --all --missing=print >server_objects &&
>      ++	grep "$expected" server_objects
>      ++'
>       +
>        # DO NOT add non-httpd-specific tests here, because the last part of this
>        # test script is only executed when httpd is available and enabled.
>
> -- 
> gitgitgadget

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 0/4] [RFC] repack: add --filter=
  2022-02-09  2:10 ` [PATCH v2 0/4] [RFC] repack: add --filter= John Cai via GitGitGadget
                     ` (4 preceding siblings ...)
  2022-02-09  2:27   ` [PATCH v2 0/4] [RFC] repack: add --filter= John Cai
@ 2022-02-16 15:39   ` Robert Coup
  2022-02-16 21:07     ` John Cai
  5 siblings, 1 reply; 34+ messages in thread
From: Robert Coup @ 2022-02-16 15:39 UTC (permalink / raw)
  To: John Cai via GitGitGadget; +Cc: git, John Cai

Hi John,

On Wed, 9 Feb 2022 at 02:41, John Cai via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> This patch series makes partial clone more useful by making it possible to
> run repack to remove objects from a repository (replacing it with promisor
> objects). This is useful when we want to offload large blobs from a git
> server onto another git server, or even use an http server through a remote
> helper.
>
> In [A], a --refilter option on fetch and fetch-pack is being discussed where
> either a less restrictive or more restrictive filter can be used. In the
> more restrictive case, the objects that already exist will not be deleted.
> But, one can imagine that users might want the ability to delete objects
> when they apply a more restrictive filter in order to save space, and this
> patch series would also allow that.

This all makes sense to me, and the implementation is remarkably short -
gluing together capabilities that are already there, and writing tests.

*But*, running `repack --filter` drops objects from the object db.
That seems like
a capability Git shouldn't idly expose without people understanding the
consequences - mostly that they really have another copy elsewhere or they
will lose data, and it won't necessarily be obvious for a long time. Otherwise
it is a footgun.

I don't know whether that is just around naming (--delete-filter /
--drop-filter /
--expire-filter ?), and/or making the documentation very explicit that
this isn't so
much "omitting certain objects from a packfile" as irretrievably
deleting objects.

Rob :)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 0/4] [RFC] repack: add --filter=
  2022-02-16 15:39   ` Robert Coup
@ 2022-02-16 21:07     ` John Cai
  2022-02-21  3:11       ` Taylor Blau
  0 siblings, 1 reply; 34+ messages in thread
From: John Cai @ 2022-02-16 21:07 UTC (permalink / raw)
  To: Robert Coup
  Cc: John Cai via GitGitGadget, git, Derrick Stolee, Christian Couder

Hi Rob,

glad these two efforts dovetail nicely!

On 16 Feb 2022, at 10:39, Robert Coup wrote:

> Hi John,
>
> On Wed, 9 Feb 2022 at 02:41, John Cai via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>>
>> This patch series makes partial clone more useful by making it possible to
>> run repack to remove objects from a repository (replacing it with promisor
>> objects). This is useful when we want to offload large blobs from a git
>> server onto another git server, or even use an http server through a remote
>> helper.
>>
>> In [A], a --refilter option on fetch and fetch-pack is being discussed where
>> either a less restrictive or more restrictive filter can be used. In the
>> more restrictive case, the objects that already exist will not be deleted.
>> But, one can imagine that users might want the ability to delete objects
>> when they apply a more restrictive filter in order to save space, and this
>> patch series would also allow that.
>
> This all makes sense to me, and the implementation is remarkably short -
> gluing together capabilities that are already there, and writing tests.
>
> *But*, running `repack --filter` drops objects from the object db.
> That seems like
> a capability Git shouldn't idly expose without people understanding the
> consequences - mostly that they really have another copy elsewhere or they
> will lose data, and it won't necessarily be obvious for a long time. Otherwise
> it is a footgun.

Yes, great point. I think there was concern from Stolee around this as well.
>
> I don't know whether that is just around naming (--delete-filter /
> --drop-filter /
> --expire-filter ?), and/or making the documentation very explicit that
> this isn't so
> much "omitting certain objects from a packfile" as irretrievably
> deleting objects.

Yeah, making the name very clear (I kind of like --delete-filter) would certainly help.
Also, to have more protection we can either

1. add a config value that needs to be set to true for repack to remove
objects (repack.allowDestroyFilter).

2. --filter is dry-run by default and prints out objects that would have been removed,
and it has to be combined with another flag --destroy in order for it to actually remove
objects from the odb.

>
> Rob :)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 4/4] tests for repack --filter mode
  2022-02-09  2:10   ` [PATCH v2 4/4] tests for repack --filter mode John Cai via GitGitGadget
@ 2022-02-17 16:14     ` Robert Coup
  2022-02-17 20:36       ` John Cai
  0 siblings, 1 reply; 34+ messages in thread
From: Robert Coup @ 2022-02-17 16:14 UTC (permalink / raw)
  To: John Cai via GitGitGadget; +Cc: git, John Cai

Hi John,

Minor, but should we use oid rather than sha1 in the list.sh/upload.sh
scripts? wrt sha256 slowly coming along the pipe.

> diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
> index e489869dd94..78cc1858cb6 100755
> --- a/t/t7700-repack.sh
> +++ b/t/t7700-repack.sh
> @@ -237,6 +237,26 @@ test_expect_success 'auto-bitmaps do not complain if unavailable' '
>         test_must_be_empty actual
>  '
>
> +test_expect_success 'repack with filter does not fetch from remote' '
> +       rm -rf server client &&
> +       test_create_repo server &&
> +       git -C server config uploadpack.allowFilter true &&
> +       git -C server config uploadpack.allowAnySHA1InWant true &&
> +       echo content1 >server/file1 &&
> +       git -C server add file1 &&
> +       git -C server commit -m initial_commit &&
> +       expected="?$(git -C server rev-parse :file1)" &&
> +       git clone --bare --no-local server client &&
> +       git -C client config remote.origin.promisor true &&
> +       git -C client -c repack.writebitmaps=false repack -a -d --filter=blob:none &&

Does writing bitmaps have any effect/interaction here?

> +       git -C client rev-list --objects --all --missing=print >objects &&
> +       grep "$expected" objects &&

This is testing the object that was cloned initially is gone after the
repack, ok.

> +       git -C client repack -a -d &&
> +       expected="$(git -C server rev-parse :file1)" &&
> +       git -C client rev-list --objects --all --missing=print >objects &&
> +       grep "$expected" objects

But I'm not sure what you're testing here? A repack wouldn't fetch
missing objects for a promisor pack anyway... and because there's no
'^' in the pattern the grep will succeed regardless of whether the
object is missing/present.

Rob :)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 4/4] tests for repack --filter mode
  2022-02-17 16:14     ` Robert Coup
@ 2022-02-17 20:36       ` John Cai
  0 siblings, 0 replies; 34+ messages in thread
From: John Cai @ 2022-02-17 20:36 UTC (permalink / raw)
  To: Robert Coup; +Cc: John Cai via GitGitGadget, git

Hi Rob,

On 17 Feb 2022, at 11:14, Robert Coup wrote:

> Hi John,
>
> Minor, but should we use oid rather than sha1 in the list.sh/upload.sh
> scripts? wrt sha256 slowly coming along the pipe.

good point, I'll make those adjustments.

>
>> diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
>> index e489869dd94..78cc1858cb6 100755
>> --- a/t/t7700-repack.sh
>> +++ b/t/t7700-repack.sh
>> @@ -237,6 +237,26 @@ test_expect_success 'auto-bitmaps do not complain if unavailable' '
>>         test_must_be_empty actual
>>  '
>>
>> +test_expect_success 'repack with filter does not fetch from remote' '
>> +       rm -rf server client &&
>> +       test_create_repo server &&
>> +       git -C server config uploadpack.allowFilter true &&
>> +       git -C server config uploadpack.allowAnySHA1InWant true &&
>> +       echo content1 >server/file1 &&
>> +       git -C server add file1 &&
>> +       git -C server commit -m initial_commit &&
>> +       expected="?$(git -C server rev-parse :file1)" &&
>> +       git clone --bare --no-local server client &&
>> +       git -C client config remote.origin.promisor true &&
>> +       git -C client -c repack.writebitmaps=false repack -a -d --filter=blob:none &&
>
> Does writing bitmaps have any effect/interaction here?

Currently writing bitmaps don't play well with promisor objects. If I'm reading
the code correctly, it seems that when we build a bitmap with
bitmap_writer_build(), find_object_pos() gets called and will complain if an
object is missing from the pack.

We probably need to do the work to allow bitmaps to play well with promisor
objects.

>
>> +       git -C client rev-list --objects --all --missing=print >objects &&
>> +       grep "$expected" objects &&
>
> This is testing the object that was cloned initially is gone after the
> repack, ok.
>
>> +       git -C client repack -a -d &&
>> +       expected="$(git -C server rev-parse :file1)" &&
>> +       git -C client rev-list --objects --all --missing=print >objects &&
>> +       grep "$expected" objects
>
> But I'm not sure what you're testing here? A repack wouldn't fetch
> missing objects for a promisor pack anyway... and because there's no
> '^' in the pattern the grep will succeed regardless of whether the
> object is missing/present.

Good point. I overlooked the fact that by this point in the test, repack has
already written a promisor file. I think I'll just remove these last couple of
lines.

>
> Rob :)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 0/4] [RFC] repack: add --filter=
  2022-02-16 21:07     ` John Cai
@ 2022-02-21  3:11       ` Taylor Blau
  2022-02-21 15:38         ` Robert Coup
  2022-02-21 21:10         ` Christian Couder
  0 siblings, 2 replies; 34+ messages in thread
From: Taylor Blau @ 2022-02-21  3:11 UTC (permalink / raw)
  To: John Cai
  Cc: Robert Coup, John Cai via GitGitGadget, git, Derrick Stolee,
	Christian Couder

On Wed, Feb 16, 2022 at 04:07:14PM -0500, John Cai wrote:
> > I don't know whether that is just around naming (--delete-filter /
> > --drop-filter /
> > --expire-filter ?), and/or making the documentation very explicit that
> > this isn't so
> > much "omitting certain objects from a packfile" as irretrievably
> > deleting objects.
>
> Yeah, making the name very clear (I kind of like --delete-filter) would certainly help.
> Also, to have more protection we can either
>
> 1. add a config value that needs to be set to true for repack to remove
> objects (repack.allowDestroyFilter).
>
> 2. --filter is dry-run by default and prints out objects that would have been removed,
> and it has to be combined with another flag --destroy in order for it to actually remove
> objects from the odb.

I share the same concern as Robert and Stolee do. But I think this issue
goes deeper than just naming.

Even if we called this `git repack --delete-filter` and only ran it with
`--i-know-what-im-doing` flag, we would still be leaving repository
corruption on the table, just making it marginally more difficult to
achieve.

I'm not familiar enough with the proposal to comment authoritatively,
but it seems like we should be verifying that there is a promisor remote
which promises any objects that we are about to filter out of the
repository.

I think that this is basically what `pack-objects`'s
`--missing=allow-promisor` does, though I don't think that's the right
tool for this job, either. Because we pack-objects also knows the object
filter, by the time we are ready to construct a pack, we're traversing
the filtered list of objects.

So we don't even bother to call show_object (or, in this case,
builtin/pack-objects.c::show_objecT__ma_allow_promisor) on them.

So I wonder what your thoughts are on having pack-objects only allow an
object to get "filtered out" if a copy of it is promised by some
promisor remote. Alternatively, and perhaps a more straight-forward
option might be to have `git repack` look at any objects that exist in a
pack we're about to delete, but don't exist in any of the packs we are
going to leave around, and make sure that any of those objects are
either unreachable or exist on a promisor remote.

But as it stands right now, I worry that this feature is too easily
misused and could result in unintended repository corruption.

I think verifying that that any objects we're about to delete exist
somewhere should make this safer to use, though even then, I think we're
still open to a TOCTOU race whereby the promisor has the objects when
we're about to delete them (convincing Git that deleting those objects
is OK to do) but gets rid of them after objects have been deleted from
the local copy (leaving no copies of the object around).

So, I don't know exactly what the right path forward is. But I'm curious
to get your thoughts on the above.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 0/4] [RFC] repack: add --filter=
  2022-02-21  3:11       ` Taylor Blau
@ 2022-02-21 15:38         ` Robert Coup
  2022-02-21 17:57           ` Taylor Blau
  2022-02-21 21:10         ` Christian Couder
  1 sibling, 1 reply; 34+ messages in thread
From: Robert Coup @ 2022-02-21 15:38 UTC (permalink / raw)
  To: Taylor Blau
  Cc: John Cai, John Cai via GitGitGadget, git, Derrick Stolee,
	Christian Couder

Hi,

On Mon, 21 Feb 2022 at 03:11, Taylor Blau <me@ttaylorr.com> wrote:
>
> we would still be leaving repository
> corruption on the table, just making it marginally more difficult to
> achieve.

While reviewing John's patch I initially wondered if a better approach
might be something like `git repack -a -d --exclude-stdin`, passing a
list of specific objects to exclude from the new pack (sourced from
rev-list via a filter, etc). To me this seems like a less dangerous
approach, but my concern was it doesn't use the existing filter
capabilities of pack-objects, and we end up generating and passing
around a huge list of oids. And of course any mistakes in the list
generation aren't visible until it's too late.

I also wonder whether there's a race condition if the repository gets
updated? If you're moving large objects out in advance, then filtering
the remainder there's nothing to stop a new large object being pushed
between those two steps and getting dropped.

My other idea, which is growing on me, is whether repack could
generate two valid packs: one for the included objects via the filter
(as John's change does now), and one containing the filtered-out
objects. `git repack -a -d --split-filter=<filter>` Then a user could
then move/extract the second packfile to object storage, but there'd
be no way to *accidentally* corrupt the repository by using a bad
option. With this approach the race condition above goes away too.

    $ git repack -a -d -q --split-filter=blob:limit=1m
    pack-37b7443e3123549a2ddee31f616ae272c51cae90
    pack-10789d94fcd99ffe1403b63b167c181a9df493cd

First pack identifier being the objects that match the filter (ie:
commits/trees/blobs <1m), and the second pack identifier being the
objects that are excluded by the filter (blobs >1m).

An astute --i-know-what-im-doing reader could spot that you could just
delete the second packfile and achieve the same outcome as the current
proposed patch, subject to being confident the race condition hadn't
happened to you.

Thanks,
Rob :)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 0/4] [RFC] repack: add --filter=
  2022-02-21 15:38         ` Robert Coup
@ 2022-02-21 17:57           ` Taylor Blau
  0 siblings, 0 replies; 34+ messages in thread
From: Taylor Blau @ 2022-02-21 17:57 UTC (permalink / raw)
  To: Robert Coup
  Cc: Taylor Blau, John Cai, John Cai via GitGitGadget, git,
	Derrick Stolee, Christian Couder

On Mon, Feb 21, 2022 at 03:38:03PM +0000, Robert Coup wrote:
> Hi,
>
> On Mon, 21 Feb 2022 at 03:11, Taylor Blau <me@ttaylorr.com> wrote:
> >
> > we would still be leaving repository
> > corruption on the table, just making it marginally more difficult to
> > achieve.
>
> While reviewing John's patch I initially wondered if a better approach
> might be something like `git repack -a -d --exclude-stdin`, passing a
> list of specific objects to exclude from the new pack (sourced from
> rev-list via a filter, etc). To me this seems like a less dangerous
> approach, but my concern was it doesn't use the existing filter
> capabilities of pack-objects, and we end up generating and passing
> around a huge list of oids. And of course any mistakes in the list
> generation aren't visible until it's too late.

Yeah; I think the most elegant approach would have pack-objects do as
much work as possible, and have repack be in charge of coordinating what
all the pack-objects invocation(s) have to do.

> I also wonder whether there's a race condition if the repository gets
> updated? If you're moving large objects out in advance, then filtering
> the remainder there's nothing to stop a new large object being pushed
> between those two steps and getting dropped.

Yeah, we will want to make sure that we're operating on a consistent
view of the repository. If this is all done in-process, it won't be a
problem since we'll capture an atomic snapshot of the reference states
once. If this is done across multiple processes, we'll need to make sure
we're passing around that snapshot where appropriate.

See the `--refs-snapshot`-related code in git-repack for when we write a
multi-pack bitmap for an example of the latter.

> My other idea, which is growing on me, is whether repack could
> generate two valid packs: one for the included objects via the filter
> (as John's change does now), and one containing the filtered-out
> objects. `git repack -a -d --split-filter=<filter>` Then a user could
> then move/extract the second packfile to object storage, but there'd
> be no way to *accidentally* corrupt the repository by using a bad
> option. With this approach the race condition above goes away too.
>
>     $ git repack -a -d -q --split-filter=blob:limit=1m
>     pack-37b7443e3123549a2ddee31f616ae272c51cae90
>     pack-10789d94fcd99ffe1403b63b167c181a9df493cd
>
> First pack identifier being the objects that match the filter (ie:
> commits/trees/blobs <1m), and the second pack identifier being the
> objects that are excluded by the filter (blobs >1m).

I like this idea quite a bit. We also have a lot of existing tools that
would make an implementation fairly lightweight, namely pack-objects'
`--stdin-packs` mode.

Using that would look something like first having `repack` generate the
filtered pack first, remembering its name [1]. After that, we would run
`pack-objects` again, this time with `--stdin-packs`, where the positive
packs are the ones we're going to delete, and the negative pack(s)
is/are the filtered one generated in the last step.

The second invocation would leave us with a single pack which represents
all of the objects in packs we are about to delete, skipping any objects
that are in the filtered pack we just generated. In other words, it
would leave the repository with two packs: one with all of the objects
that met the filter criteria, and one with all objects that don't meet
the filter criteria.

A client could then upload the "doesn't meet the filter criteria" pack
off elsewhere, and then delete it locally. (I'm assuming this last part
in particular is orchestrated by some other command, and we aren't
encouraging users to run "rm" inside of .git/objects/pack!)

> An astute --i-know-what-im-doing reader could spot that you could just
> delete the second packfile and achieve the same outcome as the current
> proposed patch, subject to being confident the race condition hadn't
> happened to you.

Yeah, and I think this goes to my joking remark in the last paragraph.
If we allow users to delete packs at will, all bets are off regardless
of how safely we generate those packs. But I think splitting the
repository into two packs and _then_ dealing with one of them separately
as opposed to deleting objects which don't meet the filter criteria
immediately is moving in a safer direction.

> Thanks,
> Rob :)

Thanks,
Taylor

[1]: Optionally "name_s_", if we passed the `--max-pack-size` option to
`git pack-objects`, which we can trigger via a `git repack` option of
the same name.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 0/4] [RFC] repack: add --filter=
  2022-02-21  3:11       ` Taylor Blau
  2022-02-21 15:38         ` Robert Coup
@ 2022-02-21 21:10         ` Christian Couder
  2022-02-21 21:42           ` Taylor Blau
  1 sibling, 1 reply; 34+ messages in thread
From: Christian Couder @ 2022-02-21 21:10 UTC (permalink / raw)
  To: Taylor Blau
  Cc: John Cai, Robert Coup, John Cai via GitGitGadget, git,
	Derrick Stolee

On Mon, Feb 21, 2022 at 4:11 AM Taylor Blau <me@ttaylorr.com> wrote:
>
> On Wed, Feb 16, 2022 at 04:07:14PM -0500, John Cai wrote:
> > > I don't know whether that is just around naming (--delete-filter /
> > > --drop-filter /
> > > --expire-filter ?), and/or making the documentation very explicit that
> > > this isn't so
> > > much "omitting certain objects from a packfile" as irretrievably
> > > deleting objects.
> >
> > Yeah, making the name very clear (I kind of like --delete-filter) would certainly help.

I am ok with making the name and doc very clear that it deletes
objects and they might be lost if they haven't been saved elsewhere
first.

> > Also, to have more protection we can either
> >
> > 1. add a config value that needs to be set to true for repack to remove
> > objects (repack.allowDestroyFilter).

I don't think it's of much value. We don't have such config values for
other possibly destructive operations.

> > 2. --filter is dry-run by default and prints out objects that would have been removed,
> > and it has to be combined with another flag --destroy in order for it to actually remove
> > objects from the odb.

I am not sure it's of much value either compared to naming it
--filter-destroy. It's likely to just make things more difficult for
users to understand.

> I share the same concern as Robert and Stolee do. But I think this issue
> goes deeper than just naming.
>
> Even if we called this `git repack --delete-filter` and only ran it with
> `--i-know-what-im-doing` flag, we would still be leaving repository
> corruption on the table, just making it marginally more difficult to
> achieve.

My opinion on this is that the promisor object mechanism assumes by
design that some objects are outside a repo, and that this repo
shouldn't care much about these objects possibly being corrupted.

It's the same for git LFS. As only a pointer file is stored in Git and
the real file is stored elsewhere, the Git repo doesn't care by design
about possible corruption of the real file.

I am not against a name and some docs that strongly state that users
should be very careful when using such a command, but otherwise I
think such a command is perfectly ok. We have other commands that by
design could lead to some objects or data being lost.

> I'm not familiar enough with the proposal to comment authoritatively,
> but it seems like we should be verifying that there is a promisor remote
> which promises any objects that we are about to filter out of the
> repository.

I think it could be a follow up mode that could be useful and safe,
but there should be no requirement for such a mode. In some cases you
know very much what you want and you don't want checks. For example if
you have taken proper care to transfer large objects to another
remote, you might just not need other possibly expansive checks.

[...]

> But as it stands right now, I worry that this feature is too easily
> misused and could result in unintended repository corruption.

Are you worrying about the UI or about what it does?

I am ok with improving the UI, but I think what it does is reasonable.

> I think verifying that that any objects we're about to delete exist
> somewhere should make this safer to use, though even then, I think we're
> still open to a TOCTOU race whereby the promisor has the objects when
> we're about to delete them (convincing Git that deleting those objects
> is OK to do) but gets rid of them after objects have been deleted from
> the local copy (leaving no copies of the object around).

Possible TOCTOU races are a good reason why something with no check is
perhaps a better goal for now.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 0/4] [RFC] repack: add --filter=
  2022-02-21 21:10         ` Christian Couder
@ 2022-02-21 21:42           ` Taylor Blau
  2022-02-22 17:11             ` Christian Couder
  2022-02-22 18:52             ` John Cai
  0 siblings, 2 replies; 34+ messages in thread
From: Taylor Blau @ 2022-02-21 21:42 UTC (permalink / raw)
  To: Christian Couder
  Cc: Taylor Blau, John Cai, Robert Coup, John Cai via GitGitGadget,
	git, Derrick Stolee

On Mon, Feb 21, 2022 at 10:10:15PM +0100, Christian Couder wrote:
> > > Also, to have more protection we can either
> > >
> > > 1. add a config value that needs to be set to true for repack to remove
> > > objects (repack.allowDestroyFilter).
>
> I don't think it's of much value. We don't have such config values for
> other possibly destructive operations.
>
> > > 2. --filter is dry-run by default and prints out objects that would have been removed,
> > > and it has to be combined with another flag --destroy in order for it to actually remove
> > > objects from the odb.
>
> I am not sure it's of much value either compared to naming it
> --filter-destroy. It's likely to just make things more difficult for
> users to understand.

On this and the above, I agree with Christian.

> > I share the same concern as Robert and Stolee do. But I think this issue
> > goes deeper than just naming.
> >
> > Even if we called this `git repack --delete-filter` and only ran it with
> > `--i-know-what-im-doing` flag, we would still be leaving repository
> > corruption on the table, just making it marginally more difficult to
> > achieve.
>
> My opinion on this is that the promisor object mechanism assumes by
> design that some objects are outside a repo, and that this repo
> shouldn't care much about these objects possibly being corrupted.

For what it's worth, I am fine having a mode of repack which allows us
to remove objects that we know are stored by a promisor remote. But this
series doesn't do that, so users could easily run `git repack -d
--filter=...` and find that they have irrecoverably corrupted their
repository.

I think that there are some other reasonable directions, though. One
which Robert and I discussed was making it possible to split a
repository into two packs, one which holds objects that match some
`--filter` criteria, and one which holds the objects that don't match
that filter.

Another option would be to prune the repository according to objects
that are already made available by a promisor remote.

An appealing quality about the above two directions is that the first
doesn't actually remove any objects, just makes it easier to push a
whole pack of unwanted objects off to a promsior remote. The second
prunes the repository according to objects that are already made
available by the promisor remote. (Yes, there is a TOCTOU race there,
too, but it's the same prune-while-pushing race that Git already has
today).

> I am not against a name and some docs that strongly state that users
> should be very careful when using such a command, but otherwise I
> think such a command is perfectly ok. We have other commands that by
> design could lead to some objects or data being lost.

I can think of a handful of ways to remove objects which are unreachable
from a repository, but I am not sure we have any ways to remove objects
which are reachable.

> > But as it stands right now, I worry that this feature is too easily
> > misused and could result in unintended repository corruption.
>
> Are you worrying about the UI or about what it does?
>
> I am ok with improving the UI, but I think what it does is reasonable.

I am more worried about the proposal's functionality than its UI,
hopefully my concerns there are summarized above.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 0/4] [RFC] repack: add --filter=
  2022-02-21 21:42           ` Taylor Blau
@ 2022-02-22 17:11             ` Christian Couder
  2022-02-22 17:33               ` Taylor Blau
                                 ` (2 more replies)
  2022-02-22 18:52             ` John Cai
  1 sibling, 3 replies; 34+ messages in thread
From: Christian Couder @ 2022-02-22 17:11 UTC (permalink / raw)
  To: Taylor Blau
  Cc: John Cai, Robert Coup, John Cai via GitGitGadget, git,
	Derrick Stolee

On Mon, Feb 21, 2022 at 10:42 PM Taylor Blau <me@ttaylorr.com> wrote:
>
> On Mon, Feb 21, 2022 at 10:10:15PM +0100, Christian Couder wrote:
> > > > Also, to have more protection we can either
> > > >
> > > > 1. add a config value that needs to be set to true for repack to remove
> > > > objects (repack.allowDestroyFilter).
> >
> > I don't think it's of much value. We don't have such config values for
> > other possibly destructive operations.
> >
> > > > 2. --filter is dry-run by default and prints out objects that would have been removed,
> > > > and it has to be combined with another flag --destroy in order for it to actually remove
> > > > objects from the odb.
> >
> > I am not sure it's of much value either compared to naming it
> > --filter-destroy. It's likely to just make things more difficult for
> > users to understand.
>
> On this and the above, I agree with Christian.
>
> > > I share the same concern as Robert and Stolee do. But I think this issue
> > > goes deeper than just naming.
> > >
> > > Even if we called this `git repack --delete-filter` and only ran it with
> > > `--i-know-what-im-doing` flag, we would still be leaving repository
> > > corruption on the table, just making it marginally more difficult to
> > > achieve.
> >
> > My opinion on this is that the promisor object mechanism assumes by
> > design that some objects are outside a repo, and that this repo
> > shouldn't care much about these objects possibly being corrupted.
>
> For what it's worth, I am fine having a mode of repack which allows us
> to remove objects that we know are stored by a promisor remote. But this
> series doesn't do that, so users could easily run `git repack -d
> --filter=...` and find that they have irrecoverably corrupted their
> repository.

In some cases we just know the objects we are removing are stored by a
promisor remote or are replicated on different physical machines or
both, so you should be fine with this.

If you are not fine with this because sometimes a user might use it
without knowing, then why are you ok with commands deleting refs not
checking that there isn't a regular repack removing dangling objects?

Also note that people who want to remove objects using a filter can
already do it by cloning with a filter and then replacing the original
packs with the packs from the clone. So refusing this new feature is
just making things more cumbersome.

> I think that there are some other reasonable directions, though. One
> which Robert and I discussed was making it possible to split a
> repository into two packs, one which holds objects that match some
> `--filter` criteria, and one which holds the objects that don't match
> that filter.

I am ok with someone implementing this feature, but if an option that
actually deletes the filtered objects is rejected then such a feature
will be used with some people just deleting one of the resulting packs
(and they might get it wrong), so I don't think any real safety will
be gained.

> Another option would be to prune the repository according to objects
> that are already made available by a promisor remote.

If the objects have just been properly transferred to the promisor
remote, the check will just waste resources.

> An appealing quality about the above two directions is that the first
> doesn't actually remove any objects, just makes it easier to push a
> whole pack of unwanted objects off to a promsior remote. The second
> prunes the repository according to objects that are already made
> available by the promisor remote. (Yes, there is a TOCTOU race there,
> too, but it's the same prune-while-pushing race that Git already has
> today).
>
> > I am not against a name and some docs that strongly state that users
> > should be very careful when using such a command, but otherwise I
> > think such a command is perfectly ok. We have other commands that by
> > design could lead to some objects or data being lost.
>
> I can think of a handful of ways to remove objects which are unreachable
> from a repository, but I am not sure we have any ways to remove objects
> which are reachable.

Cloning with a filter already does that. It's by design in the
promisor object and partial clone mechanisms that reachable objects
are removed. Having more than one promisor remote, which is an
existing mechanism, means that it's just wasteful to require all the
remotes to have all the reachable objects, so how could people easily
set up such remotes? Why make it unnecessarily hard and forbid a
straightforward way?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 0/4] [RFC] repack: add --filter=
  2022-02-22 17:11             ` Christian Couder
@ 2022-02-22 17:33               ` Taylor Blau
  2022-02-23 15:40               ` Robert Coup
  2022-02-23 19:31               ` Junio C Hamano
  2 siblings, 0 replies; 34+ messages in thread
From: Taylor Blau @ 2022-02-22 17:33 UTC (permalink / raw)
  To: Christian Couder
  Cc: Taylor Blau, John Cai, Robert Coup, John Cai via GitGitGadget,
	git, Derrick Stolee

Hi Christian,

I think my objections may be based on a misunderstanding of John and
your original proposal. From reading [1], it seemed to me like a
required step of this proposal was to upload the objects you want to
filter out ahead of time, and then run `git repack -ad --filter=...`.

So my concerns thus far have been around the lack of cohesion between
(1) the filter which describes the set of objects uploaded to the HTTP
server, and (2) the filter used when re-filtering the repository.

If (1) and (2) aren't inverses of each other, then in the case where (2)
leaves behind an object which wasn't caught by (1), we have lost that
object.

If instead the server used in your script at [1] is a stand-in for an
ordinary Git remote, that changes my thinking significantly. See below
for more details:

On Tue, Feb 22, 2022 at 06:11:11PM +0100, Christian Couder wrote:
> > > > I share the same concern as Robert and Stolee do. But I think this issue
> > > > goes deeper than just naming.
> > > >
> > > > Even if we called this `git repack --delete-filter` and only ran it with
> > > > `--i-know-what-im-doing` flag, we would still be leaving repository
> > > > corruption on the table, just making it marginally more difficult to
> > > > achieve.
> > >
> > > My opinion on this is that the promisor object mechanism assumes by
> > > design that some objects are outside a repo, and that this repo
> > > shouldn't care much about these objects possibly being corrupted.
> >
> > For what it's worth, I am fine having a mode of repack which allows us
> > to remove objects that we know are stored by a promisor remote. But this
> > series doesn't do that, so users could easily run `git repack -d
> > --filter=...` and find that they have irrecoverably corrupted their
> > repository.
>
> In some cases we just know the objects we are removing are stored by a
> promisor remote or are replicated on different physical machines or
> both, so you should be fine with this.

I am definitely OK with a convenient way to re-filter your repository
locally so long as you know that the objects you are filtering out are
available via some promisor remote.

But perhaps I have misunderstood what this proposal is for. Reading
through John's original cover letter and the link to your demo script, I
understood that a key part of this was being able to upload the pack of
objects you were about to filter out of your local copy to some server
(not necessarily Git) over HTTP.

My hesitation so far has been based on that understanding. Reading these
patches, I don't see a mechanism to upload objects we're about to
expunge to a promisor remote.

But perhaps I'm misunderstanding: if you are instead assuming that the
existing set of remotes can serve any objects that we deleted, and this
is the way to delete them, then I am OK with that approach. But I think
either way, I am missing some details in the original proposal that
would have perhaps made it easier for me to understand what your goals
are.

In any case, this patch series doesn't seem to correctly set up a
promisor remote for me, since doing the following on a fresh clone of
git.git (after running "make"):

    $ bin-wrappers/git repack -ad --filter=blob:none
    $ bin-wrappers/git fsck

results in many "missing blob" and "missing link" lines out of output.

(FWIW, I think what's missing here is correctly setting up the affected
remote(s) as promisors and indicating that we're now in a filtered
setting when going from a full clone down to a partial one.)

> If you are not fine with this because sometimes a user might use it
> without knowing, then why are you ok with commands deleting refs not
> checking that there isn't a regular repack removing dangling objects?

I'm not sure I totally understand your question, but my general sense
has been "because we typically make it difficult / impossible to remove
reachable objects".

Thanks,
Taylor

[1]: https://gitlab.com/chriscool/partial-clone-demo/-/blob/master/http-promisor/server_demo.txt#L47-52

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 0/4] [RFC] repack: add --filter=
  2022-02-21 21:42           ` Taylor Blau
  2022-02-22 17:11             ` Christian Couder
@ 2022-02-22 18:52             ` John Cai
  2022-02-22 19:35               ` Taylor Blau
  1 sibling, 1 reply; 34+ messages in thread
From: John Cai @ 2022-02-22 18:52 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Christian Couder, Robert Coup, John Cai via GitGitGadget, git,
	Derrick Stolee

Hi Taylor,

On 21 Feb 2022, at 16:42, Taylor Blau wrote:

> On Mon, Feb 21, 2022 at 10:10:15PM +0100, Christian Couder wrote:
>>>> Also, to have more protection we can either
>>>>
>>>> 1. add a config value that needs to be set to true for repack to remove
>>>> objects (repack.allowDestroyFilter).
>>
>> I don't think it's of much value. We don't have such config values for
>> other possibly destructive operations.
>>
>>>> 2. --filter is dry-run by default and prints out objects that would have been removed,
>>>> and it has to be combined with another flag --destroy in order for it to actually remove
>>>> objects from the odb.
>>
>> I am not sure it's of much value either compared to naming it
>> --filter-destroy. It's likely to just make things more difficult for
>> users to understand.
>
> On this and the above, I agree with Christian.
>
>>> I share the same concern as Robert and Stolee do. But I think this issue
>>> goes deeper than just naming.
>>>
>>> Even if we called this `git repack --delete-filter` and only ran it with
>>> `--i-know-what-im-doing` flag, we would still be leaving repository
>>> corruption on the table, just making it marginally more difficult to
>>> achieve.
>>
>> My opinion on this is that the promisor object mechanism assumes by
>> design that some objects are outside a repo, and that this repo
>> shouldn't care much about these objects possibly being corrupted.
>
> For what it's worth, I am fine having a mode of repack which allows us
> to remove objects that we know are stored by a promisor remote. But this
> series doesn't do that, so users could easily run `git repack -d
> --filter=...` and find that they have irrecoverably corrupted their
> repository.
>
> I think that there are some other reasonable directions, though. One
> which Robert and I discussed was making it possible to split a
> repository into two packs, one which holds objects that match some
> `--filter` criteria, and one which holds the objects that don't match
> that filter.
>
> Another option would be to prune the repository according to objects
> that are already made available by a promisor remote.

Thanks for the discussion around the two packfile idea. Definitely interesting.
However, I'm leaning towards the second option here where we ensure that objects that
are about to be deleted can be retrieved via a promisor remote. That way we have an
easy path to recovery.

>
> An appealing quality about the above two directions is that the first
> doesn't actually remove any objects, just makes it easier to push a
> whole pack of unwanted objects off to a promsior remote. The second
> prunes the repository according to objects that are already made
> available by the promisor remote. (Yes, there is a TOCTOU race there,
> too, but it's the same prune-while-pushing race that Git already has
> today).
>
>> I am not against a name and some docs that strongly state that users
>> should be very careful when using such a command, but otherwise I
>> think such a command is perfectly ok. We have other commands that by
>> design could lead to some objects or data being lost.
>
> I can think of a handful of ways to remove objects which are unreachable
> from a repository, but I am not sure we have any ways to remove objects
> which are reachable.
>
>>> But as it stands right now, I worry that this feature is too easily
>>> misused and could result in unintended repository corruption.
>>
>> Are you worrying about the UI or about what it does?
>>
>> I am ok with improving the UI, but I think what it does is reasonable.
>
> I am more worried about the proposal's functionality than its UI,
> hopefully my concerns there are summarized above.
>
> Thanks,
> Taylor

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 0/4] [RFC] repack: add --filter=
  2022-02-22 18:52             ` John Cai
@ 2022-02-22 19:35               ` Taylor Blau
  0 siblings, 0 replies; 34+ messages in thread
From: Taylor Blau @ 2022-02-22 19:35 UTC (permalink / raw)
  To: John Cai
  Cc: Taylor Blau, Christian Couder, Robert Coup,
	John Cai via GitGitGadget, git, Derrick Stolee

On Tue, Feb 22, 2022 at 01:52:09PM -0500, John Cai wrote:
> > Another option would be to prune the repository according to objects
> > that are already made available by a promisor remote.
>
> Thanks for the discussion around the two packfile idea. Definitely
> interesting.  However, I'm leaning towards the second option here
> where we ensure that objects that are about to be deleted can be
> retrieved via a promisor remote. That way we have an easy path to
> recovery.

Yeah, I think this may have all come from a potential misunderstanding I
had with the original proposal. More of the details there can be found
in [1].

But assuming that this proposal isn't about first offloading some
objects to an auxiliary (non-Git) server, then I think refiltering into
a single pack makes sense, because we trust the remote to still have
any objects we deleted.

(The snag I hit was that it seemed like your+Christian's proposal hinged
on using _two_ filters, one to produce the set of objects you wanted to
get rid of, and another to produce the set of objects you wanted to
keep. The lack of cohesion between the two is what gave me pause, but it
may not have been what either of you were thinking in the first place).

Anyway, I'm not sure "spitting" a repository along a `--filter` into two
packs is all that interesting of an idea, but it is something we could
do if it became useful to you without writing too much new code (and
instead leveraging `git pack-objects --stdin-packs`).

Thanks,
Taylor

[1]: https://lore.kernel.org/git/YhUeUCIetu%2FaOu6k@nand.local/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 0/4] [RFC] repack: add --filter=
  2022-02-22 17:11             ` Christian Couder
  2022-02-22 17:33               ` Taylor Blau
@ 2022-02-23 15:40               ` Robert Coup
  2022-02-23 19:31               ` Junio C Hamano
  2 siblings, 0 replies; 34+ messages in thread
From: Robert Coup @ 2022-02-23 15:40 UTC (permalink / raw)
  To: Christian Couder
  Cc: Taylor Blau, John Cai, John Cai via GitGitGadget, git,
	Derrick Stolee

Hi Christian,

On Tue, 22 Feb 2022 at 17:11, Christian Couder
<christian.couder@gmail.com> wrote:
>
> In some cases we just know the objects we are removing are stored by a
> promisor remote or are replicated on different physical machines or
> both, so you should be fine with this.

From my point of view I think the goal here is great.

> > Another option would be to prune the repository according to objects
> > that are already made available by a promisor remote.
>
> If the objects have just been properly transferred to the promisor
> remote, the check will just waste resources.

As far as I can see this patch doesn't know or check that any of the
filtered-out objects are held anywhere else... it simply applies a
filter during repacking and the excluded objects are dropped. That's
the aspect I have concerns about.

Maybe an approach where you build/get/maintain a list of
objects-I-definitely-have-elsewhere and pass it as an exclude list to
repack would be a cleaner/safer/easier solution? If you're confident
enough you don't need to check with the promisor remote then you can
use a local list, or even something generated with `rev-list
--filter=`.

Thanks,

Rob :)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 0/4] [RFC] repack: add --filter=
  2022-02-22 17:11             ` Christian Couder
  2022-02-22 17:33               ` Taylor Blau
  2022-02-23 15:40               ` Robert Coup
@ 2022-02-23 19:31               ` Junio C Hamano
  2022-02-26 16:01                 ` John Cai
  2 siblings, 1 reply; 34+ messages in thread
From: Junio C Hamano @ 2022-02-23 19:31 UTC (permalink / raw)
  To: Christian Couder
  Cc: Taylor Blau, John Cai, Robert Coup, John Cai via GitGitGadget,
	git, Derrick Stolee

Christian Couder <christian.couder@gmail.com> writes:

>> For what it's worth, I am fine having a mode of repack which allows us
>> to remove objects that we know are stored by a promisor remote. But this
>> series doesn't do that, so users could easily run `git repack -d
>> --filter=...` and find that they have irrecoverably corrupted their
>> repository.
>
> In some cases we just know the objects we are removing are stored by a
> promisor remote or are replicated on different physical machines or
> both, so you should be fine with this.

So, we need to decide if an object we have that is outside the
narrowed filter definition was (and still is, but let's keep the
assumption the whole lazy clone mechanism makes: promisor remotes
will never shed objects that they once served) available at the
promisor remote, but I suspect we have too little information to
reliably do so.  It is OK to assume that objects in existing packs
taken from the promisor remotes and everything reachable from them
(but missing from our object store) will be available to us from
there.  But if we see an object that is outside of _new_ filter spec
(e.g. you fetched with "max 100MB", now you are refiltering with
"max 50MB", narrowing the spec, and you need to decide for an object
that weigh 70MB), can we tell if that can be retrieved from the
promisor or is it unique to our repository until we push it out?  I
am not sure.  For that matter, do we even have a way to compare if
the new filter spec is a subset, a superset, or neither, of the
original filter spec?

> If you are not fine with this because sometimes a user might use it
> without knowing, then why are you ok with commands deleting refs not
> checking that there isn't a regular repack removing dangling objects?

Sorry, I do not follow this argument.  Your user may do "branch -D"
because the branch deleted is no longer needed, which may mean that
objects only reachable from the deleted branch are no longer needed.
I do not see what repack has anything to do with that.  As long as
the filter spec does not change (in other words, before this series
is applied), the repack that discards objects that are known to be
reachable from objects in packs retrieved from promisor remote, the
objects that are no longer reachable may be removed and that will
not lose objects that we do not know to be retrievable from there
(which is different from objects that we know are unretrievable).
But with filter spec changing after the fact, I am not sure if that
is safe.  IOW, "commands deleting refs" may have been OK without
this series, but this series may be what makes it not OK, no?

Puzzled.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 0/4] [RFC] repack: add --filter=
  2022-02-23 19:31               ` Junio C Hamano
@ 2022-02-26 16:01                 ` John Cai
  2022-02-26 17:29                   ` Taylor Blau
  0 siblings, 1 reply; 34+ messages in thread
From: John Cai @ 2022-02-26 16:01 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Christian Couder, Taylor Blau, Robert Coup,
	John Cai via GitGitGadget, git, Derrick Stolee

Thank you for everyone's feedback. Really appreciate the collaboration!

On 23 Feb 2022, at 14:31, Junio C Hamano wrote:

> Christian Couder <christian.couder@gmail.com> writes:
>
>>> For what it's worth, I am fine having a mode of repack which allows us
>>> to remove objects that we know are stored by a promisor remote. But this
>>> series doesn't do that, so users could easily run `git repack -d
>>> --filter=...` and find that they have irrecoverably corrupted their
>>> repository.
>>
>> In some cases we just know the objects we are removing are stored by a
>> promisor remote or are replicated on different physical machines or
>> both, so you should be fine with this.
>
> So, we need to decide if an object we have that is outside the
> narrowed filter definition was (and still is, but let's keep the
> assumption the whole lazy clone mechanism makes: promisor remotes
> will never shed objects that they once served) available at the
> promisor remote, but I suspect we have too little information to
> reliably do so.  It is OK to assume that objects in existing packs
> taken from the promisor remotes and everything reachable from them
> (but missing from our object store) will be available to us from
> there.  But if we see an object that is outside of _new_ filter spec
> (e.g. you fetched with "max 100MB", now you are refiltering with
> "max 50MB", narrowing the spec, and you need to decide for an object
> that weigh 70MB), can we tell if that can be retrieved from the
> promisor or is it unique to our repository until we push it out?  I
> am not sure.  For that matter, do we even have a way to compare if
> the new filter spec is a subset, a superset, or neither, of the
> original filter spec?

let me try to summarize (perhaps over simplify) the main concern folks have
with this feature, so please correct me if I'm wrong!

As a user, if I apply a filter that ends up deleting objects that it turns
out do not exist anywhere else, then I have irrecoverably corrupted my
repository.

Before git allows me to delete objects from my repository, it should be pretty
certain that I have path to recover those objects if I need to.

Is that correct? It seems to me that, put another way, we don't want to give
users too much rope to hang themselves.

I can see why we would want to do this. In this case, there have been a couple
of alternative ideas proposed throughout this thread that I think are viable and
I wanted to get folks thoughts.

1. split pack file - (Rob gave this idea and Taylor provided some more detail on
   how using pack-objects would make it fairly straightforward to implement)

when a user wants to apply a filter that removes objects from their repository,
split the packfile into one containing objects that are filtered out, and
another packfile with objects that remain.

pros: simple to implement
cons: does not address the question "how sure am I that the objects I want to
filter out of my repository exist on a promsior remote?"

2. check the promisor remotes to see if they contain the objects that are about
   to get deleted. Only delete objects that we find on promisor remotes.

pros: provides assurance that I have access to objects I am about to delete from
a promsior remote.
cons: more complex to implement. [*]

Out of these two, I like 2 more for the aforementioned pros.

* I am beginning to look into how fetches work and am still pretty new to the
codebase so I don't know if this is even feasible, but I was thinking perhaps
we could write a function that fetches with a --filter and create a promisor
packfile containing promisor objects (this operaiton would have to somehow
ignore the presence of the actual objects in the repository). Then, we would
have a record of objects we have access to. Then, repack --filter can remove
only the objects contained in this promisor packfile.

>
>> If you are not fine with this because sometimes a user might use it
>> without knowing, then why are you ok with commands deleting refs not
>> checking that there isn't a regular repack removing dangling objects?
>
> Sorry, I do not follow this argument.  Your user may do "branch -D"
> because the branch deleted is no longer needed, which may mean that
> objects only reachable from the deleted branch are no longer needed.
> I do not see what repack has anything to do with that.  As long as
> the filter spec does not change (in other words, before this series
> is applied), the repack that discards objects that are known to be
> reachable from objects in packs retrieved from promisor remote, the
> objects that are no longer reachable may be removed and that will
> not lose objects that we do not know to be retrievable from there
> (which is different from objects that we know are unretrievable).
> But with filter spec changing after the fact, I am not sure if that
> is safe.  IOW, "commands deleting refs" may have been OK without
> this series, but this series may be what makes it not OK, no?
>
> Puzzled.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 0/4] [RFC] repack: add --filter=
  2022-02-26 16:01                 ` John Cai
@ 2022-02-26 17:29                   ` Taylor Blau
  2022-02-26 20:19                     ` John Cai
  0 siblings, 1 reply; 34+ messages in thread
From: Taylor Blau @ 2022-02-26 17:29 UTC (permalink / raw)
  To: John Cai
  Cc: Junio C Hamano, Christian Couder, Robert Coup,
	John Cai via GitGitGadget, git, Derrick Stolee

On Sat, Feb 26, 2022 at 11:01:46AM -0500, John Cai wrote:
> let me try to summarize (perhaps over simplify) the main concern folks
> have with this feature, so please correct me if I'm wrong!
>
> As a user, if I apply a filter that ends up deleting objects that it
> turns out do not exist anywhere else, then I have irrecoverably
> corrupted my repository.
>
> Before git allows me to delete objects from my repository, it should
> be pretty certain that I have path to recover those objects if I need
> to.
>
> Is that correct? It seems to me that, put another way, we don't want
> to give users too much rope to hang themselves.

I wrote about my concerns in some more detail in [1], but the thing I
was most unclear on was how your demo script[2] was supposed to work.

Namely, I wasn't sure if you had intended to use two separate filters to
"re-filter" a repository, one to filter objects to be uploaded to a
content store, and another to filter objects to be expunged from the
repository. I have major concerns with that approach, namely that if
each of the filters is not exactly the inverse of the other, then we
will either upload too few objects, or delete too many.

My other concern was around what guarantees we currently provide for a
promisor remote. My understanding is that we expect an object which was
received from the promisor remote to always be fetch-able later on. If
that's the case, then I don't mind the idea of refiltering a repository,
provided that you only need to specify a filter once.

So the suggestion about splitting a repository into two packs was a
potential way to mediate the "two filter" problem, since the two packs
you get exactly correspond to the set of objects that match the filter,
and the set of objects that _don't_ match the filter.

In either case, I tried to use the patches in [1] and was able to
corrupt my local repository (even when fetching from a remote that held
onto the objects I had pruned locally).

Thanks,
Taylor

[1]: https://lore.kernel.org/git/YhUeUCIetu%2FaOu6k@nand.local/
[2]: https://gitlab.com/chriscool/partial-clone-demo/-/blob/master/http-promisor/server_demo.txt#L47-52

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 0/4] [RFC] repack: add --filter=
  2022-02-26 17:29                   ` Taylor Blau
@ 2022-02-26 20:19                     ` John Cai
  2022-02-26 20:30                       ` Taylor Blau
  0 siblings, 1 reply; 34+ messages in thread
From: John Cai @ 2022-02-26 20:19 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Junio C Hamano, Christian Couder, Robert Coup,
	John Cai via GitGitGadget, git, Derrick Stolee

Hi Taylor,

(resending this because my email client misbehaved and set the mime type to html)

On 26 Feb 2022, at 12:29, Taylor Blau wrote:

> On Sat, Feb 26, 2022 at 11:01:46AM -0500, John Cai wrote:
>> let me try to summarize (perhaps over simplify) the main concern folks
>> have with this feature, so please correct me if I'm wrong!
>>
>> As a user, if I apply a filter that ends up deleting objects that it
>> turns out do not exist anywhere else, then I have irrecoverably
>> corrupted my repository.
>>
>> Before git allows me to delete objects from my repository, it should
>> be pretty certain that I have path to recover those objects if I need
>> to.
>>
>> Is that correct? It seems to me that, put another way, we don't want
>> to give users too much rope to hang themselves.
>
> I wrote about my concerns in some more detail in [1], but the thing I
> was most unclear on was how your demo script[2] was supposed to work.
>
> Namely, I wasn't sure if you had intended to use two separate filters to
> "re-filter" a repository, one to filter objects to be uploaded to a
> content store, and another to filter objects to be expunged from the
> repository. I have major concerns with that approach, namely that if
> each of the filters is not exactly the inverse of the other, then we
> will either upload too few objects, or delete too many.

Thanks for bringing this up again. I meant to write back regarding what you raised
in the other part of this thread. I think this is a valid concern. To attain the
goal of offloading certain blobs onto another server(B) and saving space on a git
server(A), then there will essentially be two steps. One to upload objects to (B),
and one to remove objects from (A). As you said, these two need to be the inverse of each
other or else you might end up with missing objects.

Thinking about it more, there is also an issue of timing. Even if the filters
are exact inverses of each other, let's say we have the following order of
events:

- (A)'s large blobs get upload to (B)
- large blob (C) get added to (A)
- (A) gets repacked with a filter

In this case we could lose (C) forever. So it does seem like we need some built in guarantee
that we only shed objects from the repo if we know we can retrieve them later on.

>
> My other concern was around what guarantees we currently provide for a
> promisor remote. My understanding is that we expect an object which was
> received from the promisor remote to always be fetch-able later on. If
> that's the case, then I don't mind the idea of refiltering a repository,
> provided that you only need to specify a filter once.

Could you clarify what you mean by re-filtering a repository? By that I assumed
it meant specifying a filter eg: 100mb, and then narrowing it by specifying a
50mb filter.
>
> So the suggestion about splitting a repository into two packs was a
> potential way to mediate the "two filter" problem, since the two packs
> you get exactly correspond to the set of objects that match the filter,
> and the set of objects that _don't_ match the filter.
>
> In either case, I tried to use the patches in [1] and was able to
> corrupt my local repository (even when fetching from a remote that held
> onto the objects I had pruned locally).
>
> Thanks,
> Taylor

Thanks!
John

>
> [1]: https://lore.kernel.org/git/YhUeUCIetu%2FaOu6k@nand.local/
> [2]: https://gitlab.com/chriscool/partial-clone-demo/-/blob/master/http-promisor/server_demo.txt#L47-52

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 0/4] [RFC] repack: add --filter=
  2022-02-26 20:19                     ` John Cai
@ 2022-02-26 20:30                       ` Taylor Blau
  2022-02-26 21:05                         ` John Cai
  0 siblings, 1 reply; 34+ messages in thread
From: Taylor Blau @ 2022-02-26 20:30 UTC (permalink / raw)
  To: John Cai
  Cc: Taylor Blau, Junio C Hamano, Christian Couder, Robert Coup,
	John Cai via GitGitGadget, git, Derrick Stolee

On Sat, Feb 26, 2022 at 03:19:11PM -0500, John Cai wrote:
> Thanks for bringing this up again. I meant to write back regarding what you raised
> in the other part of this thread. I think this is a valid concern. To attain the
> goal of offloading certain blobs onto another server(B) and saving space on a git
> server(A), then there will essentially be two steps. One to upload objects to (B),
> and one to remove objects from (A). As you said, these two need to be the inverse of each
> other or else you might end up with missing objects.

Do you mean that you want to offload objects both from a local clone of
some repository, _and_ the original remote it was cloned from?

I don't understand what the role of "another server" is here. If this
proposal was about making it easy to remove objects from a local copy of
a repository based on a filter provided that there was a Git server
elsewhere that could act as a promisor remote, than that makes sense to
me.

But I think I'm not quite understanding the rest of what you're
suggesting.

> > My other concern was around what guarantees we currently provide for a
> > promisor remote. My understanding is that we expect an object which was
> > received from the promisor remote to always be fetch-able later on. If
> > that's the case, then I don't mind the idea of refiltering a repository,
> > provided that you only need to specify a filter once.
>
> Could you clarify what you mean by re-filtering a repository? By that I assumed
> it meant specifying a filter eg: 100mb, and then narrowing it by specifying a
> 50mb filter.

I meant: applying a filter to a local clone (either where there wasn't a
filter before, or a filter which matched more objects) and then removing
objects that don't match the filter.

But your response makes me think of another potential issue. What
happens if I do the following:

    $ git repack -ad --filter=blob:limit=100k
    $ git repack -ad --filter=blob:limit=200k

What should the second invocation do? I would expect that it needs to do
a fetch from the promisor remote to recover any blobs between (100, 200]
KB in size, since they would be gone after the first repack.

This is a problem not just with two consecutive `git repack --filter`s,
I think, since you could cook up the same situation with:

    $ git clone --filter=blob:limit=100k git@github.com:git
    $ git -C git repack -ad --filter=blob:limit=200k

I don't think the existing patches handle this situation, so I'm curious
whether it's something you have considered or not before.

(Unrelated to the above, but please feel free to trim any quoted parts
of emails when responding if they get overly long.)

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 0/4] [RFC] repack: add --filter=
  2022-02-26 20:30                       ` Taylor Blau
@ 2022-02-26 21:05                         ` John Cai
  2022-02-26 21:44                           ` Taylor Blau
  0 siblings, 1 reply; 34+ messages in thread
From: John Cai @ 2022-02-26 21:05 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Junio C Hamano, Christian Couder, Robert Coup,
	John Cai via GitGitGadget, git, Derrick Stolee

Hi Taylor,

On 26 Feb 2022, at 15:30, Taylor Blau wrote:

> On Sat, Feb 26, 2022 at 03:19:11PM -0500, John Cai wrote:
>> Thanks for bringing this up again. I meant to write back regarding what you raised
>> in the other part of this thread. I think this is a valid concern. To attain the
>> goal of offloading certain blobs onto another server(B) and saving space on a git
>> server(A), then there will essentially be two steps. One to upload objects to (B),
>> and one to remove objects from (A). As you said, these two need to be the inverse of each
>> other or else you might end up with missing objects.
>
> Do you mean that you want to offload objects both from a local clone of
> some repository, _and_ the original remote it was cloned from?

yes, exactly. The "another server" would be something like an http server, OR another remote
which hosts a subset of the objects (let's say the large blobs).
>
> I don't understand what the role of "another server" is here. If this
> proposal was about making it easy to remove objects from a local copy of
> a repository based on a filter provided that there was a Git server
> elsewhere that could act as a promisor remote, than that makes sense to
> me.
>
> But I think I'm not quite understanding the rest of what you're
> suggesting.

Sorry for the lack of clarity here. The goal is to make it easy for a remote to offload a subset
of its objects to __another__ remote (either a Git server or an http server through a remote helper).
>
>>> My other concern was around what guarantees we currently provide for a
>>> promisor remote. My understanding is that we expect an object which was
>>> received from the promisor remote to always be fetch-able later on. If
>>> that's the case, then I don't mind the idea of refiltering a repository,
>>> provided that you only need to specify a filter once.
>>
>> Could you clarify what you mean by re-filtering a repository? By that I assumed
>> it meant specifying a filter eg: 100mb, and then narrowing it by specifying a
>> 50mb filter.
>
> I meant: applying a filter to a local clone (either where there wasn't a
> filter before, or a filter which matched more objects) and then removing
> objects that don't match the filter.
>
> But your response makes me think of another potential issue. What
> happens if I do the following:
>
>     $ git repack -ad --filter=blob:limit=100k
>     $ git repack -ad --filter=blob:limit=200k
>
> What should the second invocation do? I would expect that it needs to do
> a fetch from the promisor remote to recover any blobs between (100, 200]
> KB in size, since they would be gone after the first repack.
>
> This is a problem not just with two consecutive `git repack --filter`s,
> I think, since you could cook up the same situation with:
>
>     $ git clone --filter=blob:limit=100k git@github.com:git
>     $ git -C git repack -ad --filter=blob:limit=200k
>
> I don't think the existing patches handle this situation, so I'm curious
> whether it's something you have considered or not before.

I have not-will have to think through this case, but this sound similar to
what [1] is about.
is about.

>
> (Unrelated to the above, but please feel free to trim any quoted parts
> of emails when responding if they get overly long.)
>
> Thanks,
> Taylor

Thanks
John

1. https://lore.kernel.org/git/pull.1138.v2.git.1645719218.gitgitgadget@gmail.com/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 0/4] [RFC] repack: add --filter=
  2022-02-26 21:05                         ` John Cai
@ 2022-02-26 21:44                           ` Taylor Blau
  0 siblings, 0 replies; 34+ messages in thread
From: Taylor Blau @ 2022-02-26 21:44 UTC (permalink / raw)
  To: John Cai
  Cc: Taylor Blau, Junio C Hamano, Christian Couder, Robert Coup,
	John Cai via GitGitGadget, git, Derrick Stolee, Jonathan Tan

On Sat, Feb 26, 2022 at 04:05:37PM -0500, John Cai wrote:
> Hi Taylor,
>
> On 26 Feb 2022, at 15:30, Taylor Blau wrote:
>
> > On Sat, Feb 26, 2022 at 03:19:11PM -0500, John Cai wrote:
> >> Thanks for bringing this up again. I meant to write back regarding what you raised
> >> in the other part of this thread. I think this is a valid concern. To attain the
> >> goal of offloading certain blobs onto another server(B) and saving space on a git
> >> server(A), then there will essentially be two steps. One to upload objects to (B),
> >> and one to remove objects from (A). As you said, these two need to be the inverse of each
> >> other or else you might end up with missing objects.
> >
> > Do you mean that you want to offload objects both from a local clone of
> > some repository, _and_ the original remote it was cloned from?
>
> yes, exactly. The "another server" would be something like an http server, OR another remote
> which hosts a subset of the objects (let's say the large blobs).
> >
> > I don't understand what the role of "another server" is here. If this
> > proposal was about making it easy to remove objects from a local copy of
> > a repository based on a filter provided that there was a Git server
> > elsewhere that could act as a promisor remote, than that makes sense to
> > me.
> >
> > But I think I'm not quite understanding the rest of what you're
> > suggesting.
>
> Sorry for the lack of clarity here. The goal is to make it easy for a remote to offload a subset
> of its objects to __another__ remote (either a Git server or an http server through a remote helper).

Does the other server then act as a promisor remote in conjunction with
the Git server? I'm having trouble understanding why the _Git_ remote
you originally cloned from needs to offload its objects, too.

So I think the list would benefit from understanding some more of the
details and motivation there. But it would also benefit us to have some
understanding of how we'll ensure that any objects which are moved out
of a Git repository make their way to another server.

I am curious to hear Jonathan Tan's thoughts on this all, too.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2022-02-26 21:45 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-27  1:49 [PATCH 0/2] repack: add --filter= John Cai via GitGitGadget
2022-01-27  1:49 ` [PATCH 1/2] pack-objects: allow --filter without --stdout John Cai via GitGitGadget
2022-01-27  1:49 ` [PATCH 2/2] repack: add --filter=<filter-spec> option John Cai via GitGitGadget
2022-01-27 15:03   ` Derrick Stolee
2022-01-29 19:14     ` John Cai
2022-01-30  8:16       ` Christian Couder
2022-01-30 13:02       ` John Cai
2022-02-09  2:10 ` [PATCH v2 0/4] [RFC] repack: add --filter= John Cai via GitGitGadget
2022-02-09  2:10   ` [PATCH v2 1/4] pack-objects: allow --filter without --stdout John Cai via GitGitGadget
2022-02-09  2:10   ` [PATCH v2 2/4] repack: add --filter=<filter-spec> option John Cai via GitGitGadget
2022-02-09  2:10   ` [PATCH v2 3/4] upload-pack: allow missing promisor objects John Cai via GitGitGadget
2022-02-09  2:10   ` [PATCH v2 4/4] tests for repack --filter mode John Cai via GitGitGadget
2022-02-17 16:14     ` Robert Coup
2022-02-17 20:36       ` John Cai
2022-02-09  2:27   ` [PATCH v2 0/4] [RFC] repack: add --filter= John Cai
2022-02-16 15:39   ` Robert Coup
2022-02-16 21:07     ` John Cai
2022-02-21  3:11       ` Taylor Blau
2022-02-21 15:38         ` Robert Coup
2022-02-21 17:57           ` Taylor Blau
2022-02-21 21:10         ` Christian Couder
2022-02-21 21:42           ` Taylor Blau
2022-02-22 17:11             ` Christian Couder
2022-02-22 17:33               ` Taylor Blau
2022-02-23 15:40               ` Robert Coup
2022-02-23 19:31               ` Junio C Hamano
2022-02-26 16:01                 ` John Cai
2022-02-26 17:29                   ` Taylor Blau
2022-02-26 20:19                     ` John Cai
2022-02-26 20:30                       ` Taylor Blau
2022-02-26 21:05                         ` John Cai
2022-02-26 21:44                           ` Taylor Blau
2022-02-22 18:52             ` John Cai
2022-02-22 19:35               ` Taylor Blau

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).