git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH 0/3] Implement filtering repacks
@ 2022-10-12 13:51 Christian Couder
  2022-10-12 13:51 ` [PATCH 1/3] pack-objects: allow --filter without --stdout Christian Couder
                   ` (4 more replies)
  0 siblings, 5 replies; 29+ messages in thread
From: Christian Couder @ 2022-10-12 13:51 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Christian Couder

Earlier this year, John Cai sent 2 versions of a patch series to
implement `git repack --filter=<filter-spec>`:

https://lore.kernel.org/git/pull.1206.git.git.1643248180.gitgitgadget@gmail.com/

We tried to "sell" it as a way to use partial clone on a Git server to
offload large blobs to, for example, an http server, while using
multiple promisor remotes on the client side.

Even though it is still our end goal, it seems a bit far fetched for
now and unnecessary as `git repack --filter=<filter-spec>` could be
useful on the client side too.

For example one might want to clone with a filter to avoid too many
space to be taken by some large blobs, and one might realize after
some time that a number of the large blobs have still be downloaded
because some old branches referencing them were checked out. In this
case a filtering repack could remove some of those large blobs.

Some of the comments on the patch series that John sent were related
to the possible data loss and repo corruption that a filtering repack
could cause. It's indeed true that it could be very dangerous, and we
agree that improvements were needed in this area.

To address this, in the patch 2/3 introducing --filter, we warn users
launching such a repack on the command line and ask them if they
really want to do it. If such a repack is not launched from a
terminal, we die().

A new patch 3/3, though, introduces --force to allow users to launch
such a repack without a terminal and without having to confirm it on
the command line.

Patch 1/3 is a preparatory patch.

In short, this small patch series tries to reboot the previous one
with a focus on the client side and a focus on safety.

Thanks to John Cai, who worked on the previous versions, and to
Jonathan Nieder, Jonathan Tan and Taylor Blau, who recently discussed
this with me at the Git Merge and Contributor Summit.

Christian Couder (3):
  pack-objects: allow --filter without --stdout
  repack: add --filter=<filter-spec> option
  repack: introduce --force to force filtering

 Documentation/git-repack.txt | 17 ++++++++++++++
 builtin/pack-objects.c       |  6 +----
 builtin/repack.c             | 44 ++++++++++++++++++++++++++++++------
 t/t7700-repack.sh            | 20 ++++++++++++++++
 4 files changed, 75 insertions(+), 12 deletions(-)

-- 
2.38.0.4.g7f9724c7bf.dirty


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 1/3] pack-objects: allow --filter without --stdout
  2022-10-12 13:51 [PATCH 0/3] Implement filtering repacks Christian Couder
@ 2022-10-12 13:51 ` Christian Couder
  2022-10-12 13:51 ` [PATCH 2/3] repack: add --filter=<filter-spec> option Christian Couder
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 29+ messages in thread
From: Christian Couder @ 2022-10-12 13:51 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Christian Couder

From: Christian Couder <chriscool@tuxfamily.org>

9535ce7337 (pack-objects: add list-objects filtering, 2017-11-21)
taught pack-objects to use --filter, but required the use of
--stdout since a partial clone mechanism was not yet in place to
handle missing objects. Since then, changes like 9e27beaa23
(promisor-remote: implement promisor_remote_get_direct(), 2019-06-25)
and others added support to dynamically fetch objects that were missing.

Remove the --stdout requirement so that in a follow-up commit, repack
can pass --filter to pack-objects to omit certain objects from the
resulting packfile.

Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 builtin/pack-objects.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 3658c05caf..d707e6230d 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -4385,12 +4385,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (!rev_list_all || !rev_list_reflog || !rev_list_index)
 		unpack_unreachable_expiration = 0;
 
-	if (pfd.have_revs && pfd.revs.filter.choice) {
-		if (!pack_to_stdout)
-			die(_("cannot use --filter without --stdout"));
-		if (stdin_packs)
+	if (pfd.have_revs && pfd.revs.filter.choice && stdin_packs)
 			die(_("cannot use --filter with --stdin-packs"));
-	}
 
 	if (stdin_packs && use_internal_rev_list)
 		die(_("cannot use internal rev list with --stdin-packs"));
-- 
2.38.0.4.g7f9724c7bf.dirty


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 2/3] repack: add --filter=<filter-spec> option
  2022-10-12 13:51 [PATCH 0/3] Implement filtering repacks Christian Couder
  2022-10-12 13:51 ` [PATCH 1/3] pack-objects: allow --filter without --stdout Christian Couder
@ 2022-10-12 13:51 ` Christian Couder
  2022-10-12 13:51 ` [PATCH 3/3] repack: introduce --force to force filtering Christian Couder
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 29+ messages in thread
From: Christian Couder @ 2022-10-12 13:51 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Christian Couder

From: Christian Couder <chriscool@tuxfamily.org>

After cloning with --filter=<filter-spec>, for example to avoid
getting unneeded large files on a user machine, it's possible
that some of these large files still get fetched for some reasons
(like checking out old branches) over time.

In this case the repo size could grow too much for no good reason
and `git repack --filter=<filter-spec>` would be useful to remove
the unneeded large files.

This command could be dangerous to use though, as it might remove
local objects that haven't been pushed which would lose data and
corrupt the repo. On a server this command could also corrupt a
repo unless ALL the removed objects aren't already available in
another remote that clients can access.

To avoid as much as possible data to be lost and repos to be
corrupted, let's warn users and ask them to confirm that they
really want to use this command.

If no terminal is used, let's just die() for now. A follow-up
commit will introduce a --force option that will allow using
this option when no terminal is available.

It will be easier to test that --filter is working correctly
in the follow-up commit that adds --force, so let's just test
that we die() if no terminal is used for now.

Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/git-repack.txt |  7 ++++++
 builtin/repack.c             | 41 ++++++++++++++++++++++++++++++------
 t/t7700-repack.sh            |  5 +++++
 3 files changed, 46 insertions(+), 7 deletions(-)

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index 0bf13893d8..230f176e10 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -137,6 +137,13 @@ depth is 4095.
 	a larger and slower repository; see the discussion in
 	`pack.packSizeLimit`.
 
+--filter=<filter-spec>::
+	Omits certain objects (usually blobs) from the resulting
+	packfile. WARNING: this could easily corrupt the current repo
+	and lose data if ANY of the omitted objects hasn't been
+	already pushed to a remote. See linkgit:git-rev-list[1] for
+	valid `<filter-spec>` forms.
+
 -b::
 --write-bitmap-index::
 	Write a reachability bitmap index as part of the repack. This
diff --git a/builtin/repack.c b/builtin/repack.c
index a5bacc7797..0a93f38da4 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -17,6 +17,7 @@
 #include "pack.h"
 #include "pack-bitmap.h"
 #include "refs.h"
+#include "prompt.h"
 
 #define ALL_INTO_ONE 1
 #define LOOSEN_UNREACHABLE 2
@@ -50,6 +51,7 @@ struct pack_objects_args {
 	const char *depth;
 	const char *threads;
 	const char *max_pack_size;
+	const char *filter;
 	int no_reuse_delta;
 	int no_reuse_object;
 	int quiet;
@@ -201,6 +203,8 @@ static void prepare_pack_objects(struct child_process *cmd,
 		strvec_pushf(&cmd->args, "--threads=%s", args->threads);
 	if (args->max_pack_size)
 		strvec_pushf(&cmd->args, "--max-pack-size=%s", args->max_pack_size);
+	if (args->filter)
+		strvec_pushf(&cmd->args, "--filter=%s", args->filter);
 	if (args->no_reuse_delta)
 		strvec_pushf(&cmd->args, "--no-reuse-delta");
 	if (args->no_reuse_object)
@@ -268,6 +272,13 @@ static unsigned populate_pack_exts(char *name)
 	return ret;
 }
 
+static void write_promisor_file_1(char *p)
+{
+	char *promisor_name = mkpathdup("%s-%s.promisor", packtmp, p);
+	write_promisor_file(promisor_name, NULL, 0);
+	free(promisor_name);
+}
+
 static void repack_promisor_objects(const struct pack_objects_args *args,
 				    struct string_list *names)
 {
@@ -299,7 +310,6 @@ static void repack_promisor_objects(const struct pack_objects_args *args,
 	out = xfdopen(cmd.out, "r");
 	while (strbuf_getline_lf(&line, out) != EOF) {
 		struct string_list_item *item;
-		char *promisor_name;
 
 		if (line.len != the_hash_algo->hexsz)
 			die(_("repack: Expecting full hex object ID lines only from pack-objects."));
@@ -316,13 +326,8 @@ static void repack_promisor_objects(const struct pack_objects_args *args,
 		 * concatenate the contents of all .promisor files instead of
 		 * just creating a new empty file.
 		 */
-		promisor_name = mkpathdup("%s-%s.promisor", packtmp,
-					  line.buf);
-		write_promisor_file(promisor_name, NULL, 0);
-
+		write_promisor_file_1(line.buf);
 		item->util = (void *)(uintptr_t)populate_pack_exts(item->string);
-
-		free(promisor_name);
 	}
 	fclose(out);
 	if (finish_command(&cmd))
@@ -786,6 +791,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 				N_("limits the maximum number of threads")),
 		OPT_STRING(0, "max-pack-size", &po_args.max_pack_size, N_("bytes"),
 				N_("maximum size of each packfile")),
+		OPT_STRING(0, "filter", &po_args.filter, N_("args"),
+				N_("object filtering")),
 		OPT_BOOL(0, "pack-kept-objects", &pack_kept_objects,
 				N_("repack objects in packs marked with .keep")),
 		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
@@ -818,6 +825,24 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 			die(_("options '%s' and '%s' cannot be used together"), "--cruft", "-k");
 	}
 
+	if (po_args.filter) {
+		const char *yesno;
+
+		if (!isatty(STDIN_FILENO))
+			die (_("Repacking with a filter is not allowed "
+			       "yet unless a terminal is used!"));
+
+		/*
+		 * TRANSLATORS: Make sure to include [y] and [N] in your translation.
+		 * The program will only accept English input at this point.
+		 */
+		yesno = git_prompt(_("Repacking with a filter will lose data and corrupt the repo\n"
+				     "if ANY of the filtered out object hasn't been already pushed!\n"
+				     "Repack with a filter anyway [y/N]? "), PROMPT_ECHO);
+		if (tolower(*yesno) != 'y')
+			exit(1);
+	}
+
 	if (write_bitmaps < 0) {
 		if (!write_midx &&
 		    (!(pack_everything & ALL_INTO_ONE) || !is_bare_repository()))
@@ -955,6 +980,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		if (line.len != the_hash_algo->hexsz)
 			die(_("repack: Expecting full hex object ID lines only from pack-objects."));
 		string_list_append(&names, line.buf);
+		if (po_args.filter)
+			write_promisor_file_1(line.buf);
 	}
 	fclose(out);
 	ret = finish_command(&cmd);
diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index ca45c4cd2c..f8764d1dd9 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -237,6 +237,11 @@ test_expect_success 'auto-bitmaps do not complain if unavailable' '
 	test_must_be_empty actual
 '
 
+test_expect_success 'repacking with a filter is not allowed' '
+	test_must_fail git repack -a -d --filter=blob:none 2>actual &&
+	test_i18ngrep "Repacking with a filter is not allowed" actual
+'
+
 objdir=.git/objects
 midx=$objdir/pack/multi-pack-index
 
-- 
2.38.0.4.g7f9724c7bf.dirty


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 3/3] repack: introduce --force to force filtering
  2022-10-12 13:51 [PATCH 0/3] Implement filtering repacks Christian Couder
  2022-10-12 13:51 ` [PATCH 1/3] pack-objects: allow --filter without --stdout Christian Couder
  2022-10-12 13:51 ` [PATCH 2/3] repack: add --filter=<filter-spec> option Christian Couder
@ 2022-10-12 13:51 ` Christian Couder
  2022-10-14 16:46 ` [PATCH 0/3] Implement filtering repacks Junio C Hamano
  2022-10-25 12:28 ` [PATCH v2 0/2] " Christian Couder
  4 siblings, 0 replies; 29+ messages in thread
From: Christian Couder @ 2022-10-12 13:51 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Christian Couder, Christian Couder

A previous commit introduced --filter=<filter-spec>, but disallowed it
when stdin doesn't refer to a terminal, and asked the user when it does
refer to a terminal. This is because this option is dangerous as it
could lose data and corrupt the repo.

In some cases people might want to run this command automatically when
stdin desn't refer to a terminal or when no question should be asked.
Let's introduce --force for this purpose.

Unfortunately, we cannot use OPT__FORCE() to implement this because -f is
already a repack option (to pass --no-reuse-delta to git-pack-objects),
so we use OPT_BOOL() instead.

This is also a good time to test that --filter works properly as it
wasn't done in the previous commit that introduced --filter.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/git-repack.txt | 12 +++++++++++-
 builtin/repack.c             |  7 +++++--
 t/t7700-repack.sh            | 15 +++++++++++++++
 3 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index 230f176e10..574f569fbe 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -142,7 +142,17 @@ depth is 4095.
 	packfile. WARNING: this could easily corrupt the current repo
 	and lose data if ANY of the omitted objects hasn't been
 	already pushed to a remote. See linkgit:git-rev-list[1] for
-	valid `<filter-spec>` forms.
+	valid `<filter-spec>` forms. See also `--force` option below.
+
+--force::
+	By default when using `--filter=<filter-spec>` and stdin
+	refers to a terminal, the user will be warned and asked if the
+	filtering repack should really be launched. This tries to
+	avoid possible data loss and repo corruption. See `--filter`
+	option above. If stdin doesn't refer to a terminal, the repack
+	is aborted. `--force` allows the repack to go on anyway when
+	stdin doesn't refer to a terminal or when no question should
+	be asked.
 
 -b::
 --write-bitmap-index::
diff --git a/builtin/repack.c b/builtin/repack.c
index 0a93f38da4..3660dbb7a7 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -56,6 +56,7 @@ struct pack_objects_args {
 	int no_reuse_object;
 	int quiet;
 	int local;
+	int force;
 };
 
 static int repack_config(const char *var, const char *value, void *cb)
@@ -793,6 +794,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 				N_("maximum size of each packfile")),
 		OPT_STRING(0, "filter", &po_args.filter, N_("args"),
 				N_("object filtering")),
+		OPT_BOOL(0, "force", &po_args.force,
+				N_("force object filtering")),
 		OPT_BOOL(0, "pack-kept-objects", &pack_kept_objects,
 				N_("repack objects in packs marked with .keep")),
 		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
@@ -825,12 +828,12 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 			die(_("options '%s' and '%s' cannot be used together"), "--cruft", "-k");
 	}
 
-	if (po_args.filter) {
+	if (po_args.filter && !po_args.force) {
 		const char *yesno;
 
 		if (!isatty(STDIN_FILENO))
 			die (_("Repacking with a filter is not allowed "
-			       "yet unless a terminal is used!"));
+			       "unless a terminal is used or --force is passed!"));
 
 		/*
 		 * TRANSLATORS: Make sure to include [y] and [N] in your translation.
diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index f8764d1dd9..20727112b5 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -242,6 +242,21 @@ test_expect_success 'repacking with a filter is not allowed' '
 	test_i18ngrep "Repacking with a filter is not allowed" actual
 '
 
+test_expect_success 'repacking with a filter works when --force is passed' '
+	test_when_finished "rm -rf server client" &&
+	test_create_repo server &&
+	git -C server config uploadpack.allowFilter true &&
+	git -C server config uploadpack.allowAnySHA1InWant true &&
+	test_commit -C server 1 &&
+	git clone --bare --no-local server client &&
+	git -C client config remote.origin.promisor true &&
+	git -C client rev-list --objects --all --missing=print >objects &&
+	test $(grep "^?" objects | wc -l) = 0 &&
+	git -C client -c repack.writebitmaps=false repack -a -d --filter=blob:none --force &&
+	git -C client rev-list --objects --all --missing=print >objects &&
+	test $(grep "^?" objects | wc -l) = 1
+'
+
 objdir=.git/objects
 midx=$objdir/pack/multi-pack-index
 
-- 
2.38.0.4.g7f9724c7bf.dirty


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/3] Implement filtering repacks
  2022-10-12 13:51 [PATCH 0/3] Implement filtering repacks Christian Couder
                   ` (2 preceding siblings ...)
  2022-10-12 13:51 ` [PATCH 3/3] repack: introduce --force to force filtering Christian Couder
@ 2022-10-14 16:46 ` Junio C Hamano
  2022-10-20 11:23   ` Christian Couder
  2022-10-25 12:28 ` [PATCH v2 0/2] " Christian Couder
  4 siblings, 1 reply; 29+ messages in thread
From: Junio C Hamano @ 2022-10-14 16:46 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, John Cai, Jonathan Tan, Jonathan Nieder, Taylor Blau

Christian Couder <christian.couder@gmail.com> writes:

> For example one might want to clone with a filter to avoid too many
> space to be taken by some large blobs, and one might realize after
> some time that a number of the large blobs have still be downloaded
> because some old branches referencing them were checked out. In this
> case a filtering repack could remove some of those large blobs.
>
> Some of the comments on the patch series that John sent were related
> to the possible data loss and repo corruption that a filtering repack
> could cause. It's indeed true that it could be very dangerous, and we
> agree that improvements were needed in this area.

The wish is understandable, but I do not think this gives a good UI.

This feature is, from an end-user's point of view, very similar to
"git prune-packed", in that we prune data that is not necessary due
to redundancy.  Nobody runs "prune-packed" directly; most people are
even unaware of it being run on their behalf when they run "git gc".

Reusing pack-objects as an underlying mechanism is OK, but this
needs to be plumbed through to "git gc" for a more consistent
experience for the end users.

Is there a way to check if the "promisor remote" is still willing to
keep the previous promise it made, so that the users do not have to
see "we may corrupt the repository as the result of this operation,
you have been warned", by the way?  Possibly with a protocol
extension?

In a sense, once you made a partial clone, your repository is at the
mercy of the remote.  They can disappear any time with majority of
the data you depend on, leaving only what you created locally and
haven't pruned away, in a repository that may technically pass
"fsck", because the things that are supposed to exist locally
exists, but may not be usable in practice.  So from that point of
view, a simple check that asks "I earlier fetched from you with this
filter and these tips of histories; are you still willing to support
me?" and gets yes/no answer might be sufficient.  A remote that is
not trustworthy can say "yes" and still change their mind later, so
such a check may not even be needed.

The above two big paragraphs is a way to say that I am not all that
worried about losing objects that we should be able to refetch again
by adding this feature.  The perceived need for "--force" or "must
run from terminal" may be overblown.  I do not think this negatively
affects correctness or robustness at all (as long as the pruning is
not buggy, of course).

HOWEVER

Unlike prune-packed, pruning objects that could be refetched has
negative performance impact.  So adding an option to enable/disable
such pruning is needed.  If the frontmost UI entry point were "gc",
it needs an opt-in option to invoke the "filtering repack", in other
words.  "pack-objects" should not need any more work than what you
have here (with the "terminal" and "force" discarded), as "--filter"
is such an opt-in option already.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/3] Implement filtering repacks
  2022-10-14 16:46 ` [PATCH 0/3] Implement filtering repacks Junio C Hamano
@ 2022-10-20 11:23   ` Christian Couder
  2022-10-28 19:49     ` Taylor Blau
  0 siblings, 1 reply; 29+ messages in thread
From: Christian Couder @ 2022-10-20 11:23 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, John Cai, Jonathan Tan, Jonathan Nieder, Taylor Blau

On Fri, Oct 14, 2022 at 6:46 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:
>
> > For example one might want to clone with a filter to avoid too many
> > space to be taken by some large blobs, and one might realize after
> > some time that a number of the large blobs have still be downloaded
> > because some old branches referencing them were checked out. In this
> > case a filtering repack could remove some of those large blobs.
> >
> > Some of the comments on the patch series that John sent were related
> > to the possible data loss and repo corruption that a filtering repack
> > could cause. It's indeed true that it could be very dangerous, and we
> > agree that improvements were needed in this area.
>
> The wish is understandable, but I do not think this gives a good UI.
>
> This feature is, from an end-user's point of view, very similar to
> "git prune-packed", in that we prune data that is not necessary due
> to redundancy.  Nobody runs "prune-packed" directly; most people are
> even unaware of it being run on their behalf when they run "git gc".

I am Ok with adding the --filter option to `git gc`, or a config
option with a similar effect. I wonder how `git gc` should implement
that option though.

If we implement a new command called for example `git filter-packed`,
similar to `git prune-packed`, then this new command will call `git
pack-objects --filter=...`.

`git gc` is already running `git repack` under the hood in a number of
cases though. So running `git gc --filter=...` would in many cases
call `git pack-objects` twice, as it would call it once through git
repack and once through `git filter-packed`. Or am I missing something
here?

If on the other hand --filter was implemented in some way in `git
repack`, then `git gc --filter=...` could just call `git repack` once.

So even if the new feature should be run only through `git gc` and
perhaps a new command possibly called `git filter-packed`, I think it
might make sense for efficiency to implement it in some ways, like
maybe with some undocumented option or flag, in `git repack`.

> Reusing pack-objects as an underlying mechanism is OK, but this
> needs to be plumbed through to "git gc" for a more consistent
> experience for the end users.

It seems to me that `git prune-packed` might only remove objects that
are already in pack files. So there is no risk of losing data or
corrupting the repo.

Instead, the new feature could in some cases lose data and corrupt the
repo if some removed objects haven't yet been pushed. So on the client
side, it seems dangerous to me to make it run automatically without a
check that everything has been pushed.

Unfortunately some users might already run `git gc` automatically, in
cron scripts for example, and they might be tempted to just add the
`--filter=...` to their `git gc` script, or to set up a config option
with a similar effect without always properly checking that everything
has been pushed.

So I am Ok with trying to make the experience consistent, but I would
be worrying that it would let people shoot themselves in the foot too
easily.

I feel that an obscure `git repack` option would be less likely to be
run automatically.

> Is there a way to check if the "promisor remote" is still willing to
> keep the previous promise it made, so that the users do not have to
> see "we may corrupt the repository as the result of this operation,
> you have been warned", by the way?  Possibly with a protocol
> extension?
>
> In a sense, once you made a partial clone, your repository is at the
> mercy of the remote.  They can disappear any time with majority of
> the data you depend on, leaving only what you created locally and
> haven't pruned away, in a repository that may technically pass
> "fsck", because the things that are supposed to exist locally
> exists, but may not be usable in practice.

Yeah, when a user clones using --filter=..., the remote can disappear
anytime, and we haven't been very worried about that.

> So from that point of
> view, a simple check that asks "I earlier fetched from you with this
> filter and these tips of histories; are you still willing to support
> me?" and gets yes/no answer might be sufficient.  A remote that is
> not trustworthy can say "yes" and still change their mind later, so
> such a check may not even be needed.

Yeah, or a remote that is using some kind of high availability system
underneath might consider that it's too expensive and useless to check
if everything is properly saved everywhere, as the underlying system
has been designed for that purpose and already runs enough internal
checks.

> The above two big paragraphs is a way to say that I am not all that
> worried about losing objects that we should be able to refetch again
> by adding this feature.

I agree. I think it's more important to worry about objects that might
have been added locally to the repo, but might not have been pushed
somewhere else (yet).

> The perceived need for "--force" or "must
> run from terminal" may be overblown.  I do not think this negatively
> affects correctness or robustness at all (as long as the pruning is
> not buggy, of course).

I am Ok to remove the "must run from terminal" and "--force" if we
consider that people using this feature should know what they are
doing.

> HOWEVER
>
> Unlike prune-packed, pruning objects that could be refetched has
> negative performance impact.  So adding an option to enable/disable
> such pruning is needed.

I think a command line option like `--filter=...` is what makes it the
most obvious that something special is going on, compared to a config
option.

> If the frontmost UI entry point were "gc",
> it needs an opt-in option to invoke the "filtering repack", in other
> words.  "pack-objects" should not need any more work than what you
> have here (with the "terminal" and "force" discarded), as "--filter"
> is such an opt-in option already.

Yeah. So to sum up, it looks like you are Ok with `git gc
--filter=...`  which is fine for me, even if I wonder if `git repack
--filter=...` could be a good first step as it is less likely to be
used automatically (so safer in a way) and it might be better for
implementation related performance reasons.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v2 0/2] Implement filtering repacks
  2022-10-12 13:51 [PATCH 0/3] Implement filtering repacks Christian Couder
                   ` (3 preceding siblings ...)
  2022-10-14 16:46 ` [PATCH 0/3] Implement filtering repacks Junio C Hamano
@ 2022-10-25 12:28 ` Christian Couder
  2022-10-25 12:28   ` [PATCH v2 1/2] pack-objects: allow --filter without --stdout Christian Couder
                     ` (3 more replies)
  4 siblings, 4 replies; 29+ messages in thread
From: Christian Couder @ 2022-10-25 12:28 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Christian Couder

Earlier this year, John Cai sent 2 versions of a patch series to
implement `git repack --filter=<filter-spec>`:

https://lore.kernel.org/git/pull.1206.git.git.1643248180.gitgitgadget@gmail.com/

We tried to "sell" it as a way to use partial clone on a Git server to
offload large blobs to, for example, an http server, while using
multiple promisor remotes on the client side.

Even though it is still our end goal, it seems a bit far fetched for
now and unnecessary as `git repack --filter=<filter-spec>` could be
useful on the client side too.

For example one might want to clone with a filter to avoid too many
space to be taken by some large blobs, and one might realize after
some time that a number of the large blobs have still be downloaded
because some old branches referencing them were checked out. In this
case a filtering repack could remove some of those large blobs.

Some of the comments on the patch series that John sent were related
to the possible data loss and repo corruption that a filtering repack
could cause. It's indeed true that it could be very dangerous, so the
first version of this patch series asked the user to confirm the
command, either by answering 'Y' on the command line or by passing
`--force`.

In the discussion with Junio following that first version though, it
appeared that asking for such confirmation might not be necessary, so
the differences in this v2 compared to the v1 are:

  - the patch 3/3 that introduced the `--force` option has been
    removed,
    
  - the test that checked that the command couldn't be launched
    without a terminal and without --force has been removed,

  - the code that checked if the command was launched from a terminal
    and that asked for confirmation on the command line has been
    removed,

  - the documentation of --filter has been improved to make it clearer
    that objects created locally which haven't been pushed could be
    deleted.

So there are only 2 patches now in this v2 series:

  - Patch 1/2 is a preparatory patch.

  - Patch 2/2 introduces the `--filter=<filter-spec>` option.

In the discussion with Junio following the first version, Junio
suggested adding `--filter=<filter-spec>` to `git gc` and I am still
Ok with doing it, either later in a followup patch or in a v3. I
haven't done it yet, as I think it's better for performance reasons if
anyway the filtering is done underneath by `git repack` (even if for
example the --filter option is undocumented in `git repack`), and I'd
like the approach in this v2 series to be reviewed first.

Thanks to Junio for discussing the v1, to John Cai, who worked
on the previous versions, to Jonathan Nieder, Jonathan Tan and Taylor
Blau, who discussed this with me at the Git Merge and Contributor
Summit, and to Stolee, Taylor, Robert Coup and Junio who discussed the
versions John sent.

Not sure it's very useful as it's quite big, but here is the
range-diff compared to v1:

1:  0ac7ebf48c = 1:  d1c65ff1f5 pack-objects: allow --filter without --stdout
2:  144356a97e ! 2:  ac21b4ec8f repack: add --filter=<filter-spec> option
    @@ Commit message
     
         This command could be dangerous to use though, as it might remove
         local objects that haven't been pushed which would lose data and
    -    corrupt the repo. On a server this command could also corrupt a
    +    corrupt the repo. On a server, this command could also corrupt a
         repo unless ALL the removed objects aren't already available in
         another remote that clients can access.
     
    -    To avoid as much as possible data to be lost and repos to be
    -    corrupted, let's warn users and ask them to confirm that they
    -    really want to use this command.
    -
    -    If no terminal is used, let's just die() for now. A follow-up
    -    commit will introduce a --force option that will allow using
    -    this option when no terminal is available.
    -
    -    It will be easier to test that --filter is working correctly
    -    in the follow-up commit that adds --force, so let's just test
    -    that we die() if no terminal is used for now.
    -
         Signed-off-by: John Cai <johncai86@gmail.com>
         Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
     
    @@ Documentation/git-repack.txt: depth is 4095.
     +--filter=<filter-spec>::
     +  Omits certain objects (usually blobs) from the resulting
     +  packfile. WARNING: this could easily corrupt the current repo
    -+  and lose data if ANY of the omitted objects hasn't been
    -+  already pushed to a remote. See linkgit:git-rev-list[1] for
    -+  valid `<filter-spec>` forms.
    ++  and lose data if ANY of the omitted objects hasn't been already
    ++  pushed to a remote. Be very careful about objects that might
    ++  have been created locally! See linkgit:git-rev-list[1] for valid
    ++  `<filter-spec>` forms.
     +
      -b::
      --write-bitmap-index::
        Write a reachability bitmap index as part of the repack. This
     
      ## builtin/repack.c ##
    -@@
    - #include "pack.h"
    - #include "pack-bitmap.h"
    - #include "refs.h"
    -+#include "prompt.h"
    - 
    - #define ALL_INTO_ONE 1
    - #define LOOSEN_UNREACHABLE 2
     @@ builtin/repack.c: struct pack_objects_args {
        const char *depth;
        const char *threads;
    @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix
                OPT_BOOL(0, "pack-kept-objects", &pack_kept_objects,
                                N_("repack objects in packs marked with .keep")),
                OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
    -@@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix)
    -                   die(_("options '%s' and '%s' cannot be used together"), "--cruft", "-k");
    -   }
    - 
    -+  if (po_args.filter) {
    -+          const char *yesno;
    -+
    -+          if (!isatty(STDIN_FILENO))
    -+                  die (_("Repacking with a filter is not allowed "
    -+                         "yet unless a terminal is used!"));
    -+
    -+          /*
    -+           * TRANSLATORS: Make sure to include [y] and [N] in your translation.
    -+           * The program will only accept English input at this point.
    -+           */
    -+          yesno = git_prompt(_("Repacking with a filter will lose data and corrupt the repo\n"
    -+                               "if ANY of the filtered out object hasn't been already pushed!\n"
    -+                               "Repack with a filter anyway [y/N]? "), PROMPT_ECHO);
    -+          if (tolower(*yesno) != 'y')
    -+                  exit(1);
    -+  }
    -+
    -   if (write_bitmaps < 0) {
    -           if (!write_midx &&
    -               (!(pack_everything & ALL_INTO_ONE) || !is_bare_repository()))
     @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix)
                if (line.len != the_hash_algo->hexsz)
                        die(_("repack: Expecting full hex object ID lines only from pack-objects."));
    @@ t/t7700-repack.sh: test_expect_success 'auto-bitmaps do not complain if unavaila
        test_must_be_empty actual
      '
      
    -+test_expect_success 'repacking with a filter is not allowed' '
    -+  test_must_fail git repack -a -d --filter=blob:none 2>actual &&
    -+  test_i18ngrep "Repacking with a filter is not allowed" actual
    ++test_expect_success 'repacking with a filter works' '
    ++  test_when_finished "rm -rf server client" &&
    ++  test_create_repo server &&
    ++  git -C server config uploadpack.allowFilter true &&
    ++  git -C server config uploadpack.allowAnySHA1InWant true &&
    ++  test_commit -C server 1 &&
    ++  git clone --bare --no-local server client &&
    ++  git -C client config remote.origin.promisor true &&
    ++  git -C client rev-list --objects --all --missing=print >objects &&
    ++  test $(grep "^?" objects | wc -l) = 0 &&
    ++  git -C client -c repack.writebitmaps=false repack -a -d --filter=blob:none &&
    ++  git -C client rev-list --objects --all --missing=print >objects &&
    ++  test $(grep "^?" objects | wc -l) = 1
     +'
     +
      objdir=.git/objects
3:  a23e19796e < -:  ---------- repack: introduce --force to force filtering

Christian Couder (2):
  pack-objects: allow --filter without --stdout
  repack: add --filter=<filter-spec> option

 Documentation/git-repack.txt |  8 ++++++++
 builtin/pack-objects.c       |  6 +-----
 builtin/repack.c             | 22 +++++++++++++++-------
 t/t7700-repack.sh            | 15 +++++++++++++++
 4 files changed, 39 insertions(+), 12 deletions(-)

-- 
2.38.1.145.g80cce38e46.dirty


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v2 1/2] pack-objects: allow --filter without --stdout
  2022-10-25 12:28 ` [PATCH v2 0/2] " Christian Couder
@ 2022-10-25 12:28   ` Christian Couder
  2022-10-25 12:28   ` [PATCH v2 2/2] repack: add --filter=<filter-spec> option Christian Couder
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 29+ messages in thread
From: Christian Couder @ 2022-10-25 12:28 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Christian Couder

From: Christian Couder <chriscool@tuxfamily.org>

9535ce7337 (pack-objects: add list-objects filtering, 2017-11-21)
taught pack-objects to use --filter, but required the use of
--stdout since a partial clone mechanism was not yet in place to
handle missing objects. Since then, changes like 9e27beaa23
(promisor-remote: implement promisor_remote_get_direct(), 2019-06-25)
and others added support to dynamically fetch objects that were missing.

Remove the --stdout requirement so that in a follow-up commit, repack
can pass --filter to pack-objects to omit certain objects from the
resulting packfile.

Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 builtin/pack-objects.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 3658c05caf..d707e6230d 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -4385,12 +4385,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (!rev_list_all || !rev_list_reflog || !rev_list_index)
 		unpack_unreachable_expiration = 0;
 
-	if (pfd.have_revs && pfd.revs.filter.choice) {
-		if (!pack_to_stdout)
-			die(_("cannot use --filter without --stdout"));
-		if (stdin_packs)
+	if (pfd.have_revs && pfd.revs.filter.choice && stdin_packs)
 			die(_("cannot use --filter with --stdin-packs"));
-	}
 
 	if (stdin_packs && use_internal_rev_list)
 		die(_("cannot use internal rev list with --stdin-packs"));
-- 
2.38.1.145.g80cce38e46.dirty


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 2/2] repack: add --filter=<filter-spec> option
  2022-10-25 12:28 ` [PATCH v2 0/2] " Christian Couder
  2022-10-25 12:28   ` [PATCH v2 1/2] pack-objects: allow --filter without --stdout Christian Couder
@ 2022-10-25 12:28   ` Christian Couder
  2022-10-28 19:54   ` [PATCH v2 0/2] Implement filtering repacks Taylor Blau
  2022-11-22 17:51   ` [PATCH v3 " Christian Couder
  3 siblings, 0 replies; 29+ messages in thread
From: Christian Couder @ 2022-10-25 12:28 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Christian Couder

From: Christian Couder <chriscool@tuxfamily.org>

After cloning with --filter=<filter-spec>, for example to avoid
getting unneeded large files on a user machine, it's possible
that some of these large files still get fetched for some reasons
(like checking out old branches) over time.

In this case the repo size could grow too much for no good reason
and `git repack --filter=<filter-spec>` would be useful to remove
the unneeded large files.

This command could be dangerous to use though, as it might remove
local objects that haven't been pushed which would lose data and
corrupt the repo. On a server, this command could also corrupt a
repo unless ALL the removed objects aren't already available in
another remote that clients can access.

Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/git-repack.txt |  8 ++++++++
 builtin/repack.c             | 22 +++++++++++++++-------
 t/t7700-repack.sh            | 15 +++++++++++++++
 3 files changed, 38 insertions(+), 7 deletions(-)

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index 0bf13893d8..d28bb68cfc 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -137,6 +137,14 @@ depth is 4095.
 	a larger and slower repository; see the discussion in
 	`pack.packSizeLimit`.
 
+--filter=<filter-spec>::
+	Omits certain objects (usually blobs) from the resulting
+	packfile. WARNING: this could easily corrupt the current repo
+	and lose data if ANY of the omitted objects hasn't been already
+	pushed to a remote. Be very careful about objects that might
+	have been created locally! See linkgit:git-rev-list[1] for valid
+	`<filter-spec>` forms.
+
 -b::
 --write-bitmap-index::
 	Write a reachability bitmap index as part of the repack. This
diff --git a/builtin/repack.c b/builtin/repack.c
index a5bacc7797..d8b8a58f73 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -50,6 +50,7 @@ struct pack_objects_args {
 	const char *depth;
 	const char *threads;
 	const char *max_pack_size;
+	const char *filter;
 	int no_reuse_delta;
 	int no_reuse_object;
 	int quiet;
@@ -201,6 +202,8 @@ static void prepare_pack_objects(struct child_process *cmd,
 		strvec_pushf(&cmd->args, "--threads=%s", args->threads);
 	if (args->max_pack_size)
 		strvec_pushf(&cmd->args, "--max-pack-size=%s", args->max_pack_size);
+	if (args->filter)
+		strvec_pushf(&cmd->args, "--filter=%s", args->filter);
 	if (args->no_reuse_delta)
 		strvec_pushf(&cmd->args, "--no-reuse-delta");
 	if (args->no_reuse_object)
@@ -268,6 +271,13 @@ static unsigned populate_pack_exts(char *name)
 	return ret;
 }
 
+static void write_promisor_file_1(char *p)
+{
+	char *promisor_name = mkpathdup("%s-%s.promisor", packtmp, p);
+	write_promisor_file(promisor_name, NULL, 0);
+	free(promisor_name);
+}
+
 static void repack_promisor_objects(const struct pack_objects_args *args,
 				    struct string_list *names)
 {
@@ -299,7 +309,6 @@ static void repack_promisor_objects(const struct pack_objects_args *args,
 	out = xfdopen(cmd.out, "r");
 	while (strbuf_getline_lf(&line, out) != EOF) {
 		struct string_list_item *item;
-		char *promisor_name;
 
 		if (line.len != the_hash_algo->hexsz)
 			die(_("repack: Expecting full hex object ID lines only from pack-objects."));
@@ -316,13 +325,8 @@ static void repack_promisor_objects(const struct pack_objects_args *args,
 		 * concatenate the contents of all .promisor files instead of
 		 * just creating a new empty file.
 		 */
-		promisor_name = mkpathdup("%s-%s.promisor", packtmp,
-					  line.buf);
-		write_promisor_file(promisor_name, NULL, 0);
-
+		write_promisor_file_1(line.buf);
 		item->util = (void *)(uintptr_t)populate_pack_exts(item->string);
-
-		free(promisor_name);
 	}
 	fclose(out);
 	if (finish_command(&cmd))
@@ -786,6 +790,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 				N_("limits the maximum number of threads")),
 		OPT_STRING(0, "max-pack-size", &po_args.max_pack_size, N_("bytes"),
 				N_("maximum size of each packfile")),
+		OPT_STRING(0, "filter", &po_args.filter, N_("args"),
+				N_("object filtering")),
 		OPT_BOOL(0, "pack-kept-objects", &pack_kept_objects,
 				N_("repack objects in packs marked with .keep")),
 		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
@@ -955,6 +961,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		if (line.len != the_hash_algo->hexsz)
 			die(_("repack: Expecting full hex object ID lines only from pack-objects."));
 		string_list_append(&names, line.buf);
+		if (po_args.filter)
+			write_promisor_file_1(line.buf);
 	}
 	fclose(out);
 	ret = finish_command(&cmd);
diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index ca45c4cd2c..363eaf8fea 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -237,6 +237,21 @@ test_expect_success 'auto-bitmaps do not complain if unavailable' '
 	test_must_be_empty actual
 '
 
+test_expect_success 'repacking with a filter works' '
+	test_when_finished "rm -rf server client" &&
+	test_create_repo server &&
+	git -C server config uploadpack.allowFilter true &&
+	git -C server config uploadpack.allowAnySHA1InWant true &&
+	test_commit -C server 1 &&
+	git clone --bare --no-local server client &&
+	git -C client config remote.origin.promisor true &&
+	git -C client rev-list --objects --all --missing=print >objects &&
+	test $(grep "^?" objects | wc -l) = 0 &&
+	git -C client -c repack.writebitmaps=false repack -a -d --filter=blob:none &&
+	git -C client rev-list --objects --all --missing=print >objects &&
+	test $(grep "^?" objects | wc -l) = 1
+'
+
 objdir=.git/objects
 midx=$objdir/pack/multi-pack-index
 
-- 
2.38.1.145.g80cce38e46.dirty


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/3] Implement filtering repacks
  2022-10-20 11:23   ` Christian Couder
@ 2022-10-28 19:49     ` Taylor Blau
  2022-10-28 20:26       ` Junio C Hamano
  2022-11-07  9:00       ` Christian Couder
  0 siblings, 2 replies; 29+ messages in thread
From: Taylor Blau @ 2022-10-28 19:49 UTC (permalink / raw)
  To: Christian Couder
  Cc: Junio C Hamano, git, John Cai, Jonathan Tan, Jonathan Nieder

On Thu, Oct 20, 2022 at 01:23:02PM +0200, Christian Couder wrote:
> On Fri, Oct 14, 2022 at 6:46 PM Junio C Hamano <gitster@pobox.com> wrote:
> >
> > Christian Couder <christian.couder@gmail.com> writes:
> >
> > > For example one might want to clone with a filter to avoid too many
> > > space to be taken by some large blobs, and one might realize after
> > > some time that a number of the large blobs have still be downloaded
> > > because some old branches referencing them were checked out. In this
> > > case a filtering repack could remove some of those large blobs.
> > >
> > > Some of the comments on the patch series that John sent were related
> > > to the possible data loss and repo corruption that a filtering repack
> > > could cause. It's indeed true that it could be very dangerous, and we
> > > agree that improvements were needed in this area.
> >
> > The wish is understandable, but I do not think this gives a good UI.
> >
> > This feature is, from an end-user's point of view, very similar to
> > "git prune-packed", in that we prune data that is not necessary due
> > to redundancy.  Nobody runs "prune-packed" directly; most people are
> > even unaware of it being run on their behalf when they run "git gc".
>
> I am Ok with adding the --filter option to `git gc`, or a config
> option with a similar effect. I wonder how `git gc` should implement
> that option though.
>
> If we implement a new command called for example `git filter-packed`,
> similar to `git prune-packed`, then this new command will call `git
> pack-objects --filter=...`.

Conceptually, yes, the two are similar. Though `prune-filtered` is
necessarily going to differ in implementation from `prune-packed`, since
we will have to write new pack(s), not just delete loose objects which
appear in packs already.

So it's really not just a matter of purely deleting redundant loose
copies of objects like in the case of prune-packed. Here we really do
care about potentially writing a new set of packs to satisfy the new
filter constraint.

Presumably that tool would implement creating the new packs according to
the given --filter, and would similarly delete existing packs. That is
basically what your implementation in repack already does, so I am not
sure what the difference would be.

> Yeah. So to sum up, it looks like you are Ok with `git gc
> --filter=...`  which is fine for me, even if I wonder if `git repack
> --filter=...` could be a good first step as it is less likely to be
> used automatically (so safer in a way) and it might be better for
> implementation related performance reasons.

If we don't intend to have `git repack --filter` part of our backwards
compatibility guarantee, then I would prefer to see the implementation
just live in git-gc from start to finish.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 0/2] Implement filtering repacks
  2022-10-25 12:28 ` [PATCH v2 0/2] " Christian Couder
  2022-10-25 12:28   ` [PATCH v2 1/2] pack-objects: allow --filter without --stdout Christian Couder
  2022-10-25 12:28   ` [PATCH v2 2/2] repack: add --filter=<filter-spec> option Christian Couder
@ 2022-10-28 19:54   ` Taylor Blau
  2022-11-07  9:29     ` Christian Couder
  2022-11-22 17:51   ` [PATCH v3 " Christian Couder
  3 siblings, 1 reply; 29+ messages in thread
From: Taylor Blau @ 2022-10-28 19:54 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee

On Tue, Oct 25, 2022 at 02:28:54PM +0200, Christian Couder wrote:
> So there are only 2 patches now in this v2 series:
>
>   - Patch 1/2 is a preparatory patch.
>
>   - Patch 2/2 introduces the `--filter=<filter-spec>` option.

One thing that I wasn't clear on in this or the previous round(s) was
how we handle setting remote.<name>.promisor and partialclonefilter.

If there is a single remote, then it's obvious that we should set
promisor to "true" and partialCloneFilter to whatever value of
`--filter` the user provided when repacking / GCing.

But what happens if there are multiple remotes? Which get the new
configuration settings modified?

I wonder what breakage happens if we fail to do that (and why such
breakage isn't yet noticed by CI).

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/3] Implement filtering repacks
  2022-10-28 19:49     ` Taylor Blau
@ 2022-10-28 20:26       ` Junio C Hamano
  2022-11-07  9:12         ` Christian Couder
  2022-11-07  9:00       ` Christian Couder
  1 sibling, 1 reply; 29+ messages in thread
From: Junio C Hamano @ 2022-10-28 20:26 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Christian Couder, git, John Cai, Jonathan Tan, Jonathan Nieder

Taylor Blau <me@ttaylorr.com> writes:

>> > This feature is, from an end-user's point of view, very similar to
>> > "git prune-packed", in that we prune data that is not necessary due
>> > to redundancy.  Nobody runs "prune-packed" directly; most people are
>> > even unaware of it being run on their behalf when they run "git gc".
>> ...
> If we don't intend to have `git repack --filter` part of our backwards
> compatibility guarantee, then I would prefer to see the implementation
> just live in git-gc from start to finish.

I am OK with that, too.  We do not have to expose the option to
users of pack-objects at all.  Just make "gc" the single entry point
and that should be quite clean.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/3] Implement filtering repacks
  2022-10-28 19:49     ` Taylor Blau
  2022-10-28 20:26       ` Junio C Hamano
@ 2022-11-07  9:00       ` Christian Couder
  1 sibling, 0 replies; 29+ messages in thread
From: Christian Couder @ 2022-11-07  9:00 UTC (permalink / raw)
  To: Taylor Blau; +Cc: Junio C Hamano, git, John Cai, Jonathan Tan, Jonathan Nieder

On Fri, Oct 28, 2022 at 9:49 PM Taylor Blau <me@ttaylorr.com> wrote:
>
> On Thu, Oct 20, 2022 at 01:23:02PM +0200, Christian Couder wrote:
> > On Fri, Oct 14, 2022 at 6:46 PM Junio C Hamano <gitster@pobox.com> wrote:
> > >
> > > Christian Couder <christian.couder@gmail.com> writes:
> > >
> > > > For example one might want to clone with a filter to avoid too many
> > > > space to be taken by some large blobs, and one might realize after
> > > > some time that a number of the large blobs have still be downloaded
> > > > because some old branches referencing them were checked out. In this
> > > > case a filtering repack could remove some of those large blobs.
> > > >
> > > > Some of the comments on the patch series that John sent were related
> > > > to the possible data loss and repo corruption that a filtering repack
> > > > could cause. It's indeed true that it could be very dangerous, and we
> > > > agree that improvements were needed in this area.
> > >
> > > The wish is understandable, but I do not think this gives a good UI.
> > >
> > > This feature is, from an end-user's point of view, very similar to
> > > "git prune-packed", in that we prune data that is not necessary due
> > > to redundancy.  Nobody runs "prune-packed" directly; most people are
> > > even unaware of it being run on their behalf when they run "git gc".
> >
> > I am Ok with adding the --filter option to `git gc`, or a config
> > option with a similar effect. I wonder how `git gc` should implement
> > that option though.
> >
> > If we implement a new command called for example `git filter-packed`,
> > similar to `git prune-packed`, then this new command will call `git
> > pack-objects --filter=...`.
>
> Conceptually, yes, the two are similar. Though `prune-filtered` is
> necessarily going to differ in implementation from `prune-packed`, since
> we will have to write new pack(s), not just delete loose objects which
> appear in packs already.

Yeah, that's why I say `prune-filtered` will call `git pack-objects
--filter=...`.

> So it's really not just a matter of purely deleting redundant loose
> copies of objects like in the case of prune-packed. Here we really do
> care about potentially writing a new set of packs to satisfy the new
> filter constraint.

Yeah, I agree.

> Presumably that tool would implement creating the new packs according to
> the given --filter, and would similarly delete existing packs. That is
> basically what your implementation in repack already does, so I am not
> sure what the difference would be.

Indeed, there wouldn't be much difference implementation wise between
a new `git filter-packed` command like Junio suggested and the current
implementation I sent which implements the feature in `git repack`. (A
new `git filter-packed` would just duplicate the repack features that
are needed and just call `git pack-objects --filter=...`). That's why
I don't really see the point of a new `git filter-packed` command and
the version 2 I sent still implements the feature in `git repack`.

So I have a hard time understanding your comment unless you just agree with me.

> > Yeah. So to sum up, it looks like you are Ok with `git gc
> > --filter=...`  which is fine for me, even if I wonder if `git repack
> > --filter=...` could be a good first step as it is less likely to be
> > used automatically (so safer in a way) and it might be better for
> > implementation related performance reasons.
>
> If we don't intend to have `git repack --filter` part of our backwards
> compatibility guarantee, then I would prefer to see the implementation
> just live in git-gc from start to finish.

About the implementation living in `git gc` I wrote the following:

>>> `git gc` is already running `git repack` under the hood in a number of
>>> cases though. So running `git gc --filter=...` would in many cases
>>> call `git pack-objects` twice, as it would call it once through git
>>> repack and once through `git filter-packed`. Or am I missing something
>>> here?

Even if we don't have a `git filter-packed` command, if the feature is
implemented in `git gc` (but not in `git repack`) it would just call
`git pack-objects --filter=...` from there, which means that `git
pack-objects` would be called twice (once through `git repack` and
once for this new feature) by `git gc` in some cases, instead of just
once if the feature was implemented in `git repack` as `git gc` could
then just calls `git repack ... --filter=...` once.

That's why I think it's better for performance reasons if the feature
is implemented in `git repack`. If you don't want for some reason to
have `git repack --filter=...` part of our backwards compatibility
guarantee, then --filter can be a hidden and undocumented option in
`git repack`. Or maybe we could use a new env variable to instruct
`git repack` to pass some --filter option to `git pack-objects`, but
my opinion is that it's much simpler to just accept --filter to be a
regular, though dangerous, `git repack` option, and then add --filter
to `git gc`.

I am also Ok with adding --filter to `git gc` in this patch series and
have the doc say that it's better to use `git gc --filter` instead of
`git repack --filter` so that users could learn right away to use the
feature through `git gc` instead of through `git repack`.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/3] Implement filtering repacks
  2022-10-28 20:26       ` Junio C Hamano
@ 2022-11-07  9:12         ` Christian Couder
  0 siblings, 0 replies; 29+ messages in thread
From: Christian Couder @ 2022-11-07  9:12 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Taylor Blau, git, John Cai, Jonathan Tan, Jonathan Nieder

On Fri, Oct 28, 2022 at 10:26 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Taylor Blau <me@ttaylorr.com> writes:
>
> >> > This feature is, from an end-user's point of view, very similar to
> >> > "git prune-packed", in that we prune data that is not necessary due
> >> > to redundancy.  Nobody runs "prune-packed" directly; most people are
> >> > even unaware of it being run on their behalf when they run "git gc".
> >> ...
> > If we don't intend to have `git repack --filter` part of our backwards
> > compatibility guarantee, then I would prefer to see the implementation
> > just live in git-gc from start to finish.
>
> I am OK with that, too.  We do not have to expose the option to
> users of pack-objects at all.

You might mean "to users of `git repack`", as `git pack-objects`
already has a `--filter=<filter-spec>` option.

> Just make "gc" the single entry point
> and that should be quite clean.

I am Ok with that if there is a simple way to not call `git
pack-objects` twice in some cases (once through git repack and once
through this new feature). I may be missing something, but
unfortunately I don't see one.

(Or maybe you mean that `--filter` in `git gc` should be an option
that is incompatible with other `git gc` options and configuration
parameters, so that when `git gc --filter=XXX` is called it only
performs what `git repack --filter=XXX` performs in the implementation
I sent, but then people might want to call `git gc` twice instead of
once, if they also want other usual gc housekeeping tasks to be
performed. I don't think that's the way `git gc` has been designed for
now though.)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 0/2] Implement filtering repacks
  2022-10-28 19:54   ` [PATCH v2 0/2] Implement filtering repacks Taylor Blau
@ 2022-11-07  9:29     ` Christian Couder
  0 siblings, 0 replies; 29+ messages in thread
From: Christian Couder @ 2022-11-07  9:29 UTC (permalink / raw)
  To: Taylor Blau
  Cc: git, Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Derrick Stolee

On Fri, Oct 28, 2022 at 9:54 PM Taylor Blau <me@ttaylorr.com> wrote:
>
> On Tue, Oct 25, 2022 at 02:28:54PM +0200, Christian Couder wrote:
> > So there are only 2 patches now in this v2 series:
> >
> >   - Patch 1/2 is a preparatory patch.
> >
> >   - Patch 2/2 introduces the `--filter=<filter-spec>` option.
>
> One thing that I wasn't clear on in this or the previous round(s) was
> how we handle setting remote.<name>.promisor and partialclonefilter.

Yeah, I agree that it's an interesting question that I overlooked.

> If there is a single remote, then it's obvious that we should set
> promisor to "true" and partialCloneFilter to whatever value of
> `--filter` the user provided when repacking / GCing.

I would be Ok to setting remote.<name>.promisor to true in this case,
but I am not sure we really need to do it.

Maybe the user is mostly interested in reducing the size of the repo
for now and plans to set up a promisor remote afterwards.

Another perhaps better way to handle this would be to just die() if no
remote.<name>.promisor is set to true. This way we can make sure that
users will not forget to set up at least one promisor remote. This
could also give users the opportunity to think about whether their
configured remotes contain all the objects they are going to remove.

About remote.<name>.partialclonefilter I don't think we need to do
anything. Maybe the user would be Ok with having different filters
when fetching and when cleaning up.

> But what happens if there are multiple remotes? Which get the new
> configuration settings modified?

I agree that if we want this feature to modify settings, then there is
no good and simple solution in this case.

> I wonder what breakage happens if we fail to do that (and why such
> breakage isn't yet noticed by CI).

If we want to avoid breakages as much as possible, then die()ing when
no remote.<name>.promisor is set to true seems to be the best
solution, instead of trying to modify settings.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v3 0/2] Implement filtering repacks
  2022-10-25 12:28 ` [PATCH v2 0/2] " Christian Couder
                     ` (2 preceding siblings ...)
  2022-10-28 19:54   ` [PATCH v2 0/2] Implement filtering repacks Taylor Blau
@ 2022-11-22 17:51   ` Christian Couder
  2022-11-22 17:51     ` [PATCH v3 1/2] pack-objects: allow --filter without --stdout Christian Couder
                       ` (4 more replies)
  3 siblings, 5 replies; 29+ messages in thread
From: Christian Couder @ 2022-11-22 17:51 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Christian Couder

Earlier this year, John Cai sent 2 versions of a patch series to
implement `git repack --filter=<filter-spec>`:

https://lore.kernel.org/git/pull.1206.git.git.1643248180.gitgitgadget@gmail.com/

We tried to "sell" it as a way to use partial clone on a Git server to
offload large blobs to, for example, an http server, while using
multiple promisor remotes on the client side.

Even though it is still our end goal, it seems a bit far fetched for
now and unnecessary as `git repack --filter=<filter-spec>` could be
useful on the client side too.

For example one might want to clone with a filter to avoid too many
space to be taken by some large blobs, and one might realize after
some time that a number of the large blobs have still be downloaded
because some old branches referencing them were checked out. In this
case a filtering repack could remove some of those large blobs.

Some of the comments on the patch series that John sent were related
to the possible data loss and repo corruption that a filtering repack
could cause. It's indeed true that it could be very dangerous, so the
first version of this patch series asked the user to confirm the
command, either by answering 'Y' on the command line or by passing
`--force`.

In the discussion with Junio following that first version though, it
appeared that asking for such confirmation might not be necessary, so
the v2 removed those checks. Taylor though asked what would happen to
the remote.<name>.promisor and remote.<name>.partialclonefilter config
variables when a filtering repack is run. I replied that it seems to
me that we should just check that a promisor remote has been
configured and fail if that's not the case.

In the discussions with Junio and Taylor following the first and
second versions, Junio suggested adding `--filter=<filter-spec>` to
`git gc` and I am still Ok with doing it, either later in a followup
patch or in a v4. I haven't done it yet, as it's not clear how to
implement it efficiently only in `git gc`.

`git gc` is already running `git repack` when it's passed some options
(either on the command line or via the config), so it would be just
simpler for `git gc` to just pass on the --filter=<filter-spec> it
would be given to `git repack`. If `git gc` would implement that by
calling `git pack-objects` directly, then `git pack-objects` would
possibly be called twice from `git gc`: once throught `git repack` and
once triggered by the --filter option.

So the only changes in this v3 compared to v2 are the following:

  - rebased on top of a0789512c5 (The thirteenth batch, 2022-11-18) to
    avoid a simple conflict,

  - patch 2/2 uses has_promisor_remote() to check if a promisor remote
    is configured and die() otherwise.

Thanks to Junio and Taylor for discussing the v1 and v2, to John Cai,
who worked on the previous versions, to Jonathan Nieder, Jonathan Tan
and Taylor Blau, who discussed this with me at the Git Merge and
Contributor Summit, and to Stolee, Taylor, Robert Coup and Junio who
discussed the versions John sent.

Range diff with v2:

1:  d1c65ff1f5 = 1:  1e64cac782 pack-objects: allow --filter without --stdout
2:  ac21b4ec8f ! 2:  7216a7bc05 repack: add --filter=<filter-spec> option
    @@ Commit message
         repo unless ALL the removed objects aren't already available in
         another remote that clients can access.
     
    +    To mitigate that risk, we check that a promisor remote has at
    +    least been configured.
    +
         Signed-off-by: John Cai <johncai86@gmail.com>
         Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
     
    @@ builtin/repack.c: static void prepare_pack_objects(struct child_process *cmd,
        if (args->no_reuse_delta)
                strvec_pushf(&cmd->args, "--no-reuse-delta");
        if (args->no_reuse_object)
    -@@ builtin/repack.c: static unsigned populate_pack_exts(char *name)
    -   return ret;
    +@@ builtin/repack.c: static struct generated_pack_data *populate_pack_exts(const char *name)
    +   return data;
      }
      
     +static void write_promisor_file_1(char *p)
    @@ builtin/repack.c: static void repack_promisor_objects(const struct pack_objects_
     -          write_promisor_file(promisor_name, NULL, 0);
     -
     +          write_promisor_file_1(line.buf);
    -           item->util = (void *)(uintptr_t)populate_pack_exts(item->string);
    +           item->util = populate_pack_exts(item->string);
     -
     -          free(promisor_name);
        }
    @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix
                OPT_BOOL(0, "pack-kept-objects", &pack_kept_objects,
                                N_("repack objects in packs marked with .keep")),
                OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
    +@@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix)
    +                   die(_("options '%s' and '%s' cannot be used together"), "--cruft", "-k");
    +   }
    + 
    ++  if (po_args.filter && !has_promisor_remote())
    ++          die("a promisor remote must be setup\n"
    ++              "Also please push all the objects "
    ++              "that might be filtered to that remote!\n"
    ++              "Otherwise they will be lost!");
    ++
    +   if (write_bitmaps < 0) {
    +           if (!write_midx &&
    +               (!(pack_everything & ALL_INTO_ONE) || !is_bare_repository()))
     @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix)
                if (line.len != the_hash_algo->hexsz)
                        die(_("repack: Expecting full hex object ID lines only from pack-objects."));
    -           string_list_append(&names, line.buf);
    +           item = string_list_append(&names, line.buf);
     +          if (po_args.filter)
     +                  write_promisor_file_1(line.buf);
    +           item->util = populate_pack_exts(item->string);
        }
        fclose(out);
    -   ret = finish_command(&cmd);
     
      ## t/t7700-repack.sh ##
     @@ t/t7700-repack.sh: test_expect_success 'auto-bitmaps do not complain if unavailable' '


Christian Couder (2):
  pack-objects: allow --filter without --stdout
  repack: add --filter=<filter-spec> option

 Documentation/git-repack.txt |  8 ++++++++
 builtin/pack-objects.c       |  6 +-----
 builtin/repack.c             | 28 +++++++++++++++++++++-------
 t/t7700-repack.sh            | 15 +++++++++++++++
 4 files changed, 45 insertions(+), 12 deletions(-)

-- 
2.38.1.475.g7216a7bc05


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v3 1/2] pack-objects: allow --filter without --stdout
  2022-11-22 17:51   ` [PATCH v3 " Christian Couder
@ 2022-11-22 17:51     ` Christian Couder
  2022-11-22 17:51     ` [PATCH v3 2/2] repack: add --filter=<filter-spec> option Christian Couder
                       ` (3 subsequent siblings)
  4 siblings, 0 replies; 29+ messages in thread
From: Christian Couder @ 2022-11-22 17:51 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Christian Couder

From: Christian Couder <chriscool@tuxfamily.org>

9535ce7337 (pack-objects: add list-objects filtering, 2017-11-21)
taught pack-objects to use --filter, but required the use of
--stdout since a partial clone mechanism was not yet in place to
handle missing objects. Since then, changes like 9e27beaa23
(promisor-remote: implement promisor_remote_get_direct(), 2019-06-25)
and others added support to dynamically fetch objects that were missing.

Remove the --stdout requirement so that in a follow-up commit, repack
can pass --filter to pack-objects to omit certain objects from the
resulting packfile.

Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 builtin/pack-objects.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 573d0b20b7..a4dccca4cd 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -4385,12 +4385,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (!rev_list_all || !rev_list_reflog || !rev_list_index)
 		unpack_unreachable_expiration = 0;
 
-	if (pfd.have_revs && pfd.revs.filter.choice) {
-		if (!pack_to_stdout)
-			die(_("cannot use --filter without --stdout"));
-		if (stdin_packs)
+	if (pfd.have_revs && pfd.revs.filter.choice && stdin_packs)
 			die(_("cannot use --filter with --stdin-packs"));
-	}
 
 	if (stdin_packs && use_internal_rev_list)
 		die(_("cannot use internal rev list with --stdin-packs"));
-- 
2.38.1.475.g7216a7bc05


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v3 2/2] repack: add --filter=<filter-spec> option
  2022-11-22 17:51   ` [PATCH v3 " Christian Couder
  2022-11-22 17:51     ` [PATCH v3 1/2] pack-objects: allow --filter without --stdout Christian Couder
@ 2022-11-22 17:51     ` Christian Couder
  2022-11-23  0:31     ` [PATCH v3 0/2] Implement filtering repacks Junio C Hamano
                       ` (2 subsequent siblings)
  4 siblings, 0 replies; 29+ messages in thread
From: Christian Couder @ 2022-11-22 17:51 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Christian Couder

From: Christian Couder <chriscool@tuxfamily.org>

After cloning with --filter=<filter-spec>, for example to avoid
getting unneeded large files on a user machine, it's possible
that some of these large files still get fetched for some reasons
(like checking out old branches) over time.

In this case the repo size could grow too much for no good reason
and `git repack --filter=<filter-spec>` would be useful to remove
the unneeded large files.

This command could be dangerous to use though, as it might remove
local objects that haven't been pushed which would lose data and
corrupt the repo. On a server, this command could also corrupt a
repo unless ALL the removed objects aren't already available in
another remote that clients can access.

To mitigate that risk, we check that a promisor remote has at
least been configured.

Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/git-repack.txt |  8 ++++++++
 builtin/repack.c             | 28 +++++++++++++++++++++-------
 t/t7700-repack.sh            | 15 +++++++++++++++
 3 files changed, 44 insertions(+), 7 deletions(-)

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index 4017157949..2539ee0a02 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -143,6 +143,14 @@ depth is 4095.
 	a larger and slower repository; see the discussion in
 	`pack.packSizeLimit`.
 
+--filter=<filter-spec>::
+	Omits certain objects (usually blobs) from the resulting
+	packfile. WARNING: this could easily corrupt the current repo
+	and lose data if ANY of the omitted objects hasn't been already
+	pushed to a remote. Be very careful about objects that might
+	have been created locally! See linkgit:git-rev-list[1] for valid
+	`<filter-spec>` forms.
+
 -b::
 --write-bitmap-index::
 	Write a reachability bitmap index as part of the repack. This
diff --git a/builtin/repack.c b/builtin/repack.c
index 65eb1b8bd2..3586d97f3f 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -49,6 +49,7 @@ struct pack_objects_args {
 	const char *depth;
 	const char *threads;
 	const char *max_pack_size;
+	const char *filter;
 	int no_reuse_delta;
 	int no_reuse_object;
 	int quiet;
@@ -163,6 +164,8 @@ static void prepare_pack_objects(struct child_process *cmd,
 		strvec_pushf(&cmd->args, "--threads=%s", args->threads);
 	if (args->max_pack_size)
 		strvec_pushf(&cmd->args, "--max-pack-size=%s", args->max_pack_size);
+	if (args->filter)
+		strvec_pushf(&cmd->args, "--filter=%s", args->filter);
 	if (args->no_reuse_delta)
 		strvec_pushf(&cmd->args, "--no-reuse-delta");
 	if (args->no_reuse_object)
@@ -234,6 +237,13 @@ static struct generated_pack_data *populate_pack_exts(const char *name)
 	return data;
 }
 
+static void write_promisor_file_1(char *p)
+{
+	char *promisor_name = mkpathdup("%s-%s.promisor", packtmp, p);
+	write_promisor_file(promisor_name, NULL, 0);
+	free(promisor_name);
+}
+
 static void repack_promisor_objects(const struct pack_objects_args *args,
 				    struct string_list *names)
 {
@@ -265,7 +275,6 @@ static void repack_promisor_objects(const struct pack_objects_args *args,
 	out = xfdopen(cmd.out, "r");
 	while (strbuf_getline_lf(&line, out) != EOF) {
 		struct string_list_item *item;
-		char *promisor_name;
 
 		if (line.len != the_hash_algo->hexsz)
 			die(_("repack: Expecting full hex object ID lines only from pack-objects."));
@@ -282,13 +291,8 @@ static void repack_promisor_objects(const struct pack_objects_args *args,
 		 * concatenate the contents of all .promisor files instead of
 		 * just creating a new empty file.
 		 */
-		promisor_name = mkpathdup("%s-%s.promisor", packtmp,
-					  line.buf);
-		write_promisor_file(promisor_name, NULL, 0);
-
+		write_promisor_file_1(line.buf);
 		item->util = populate_pack_exts(item->string);
-
-		free(promisor_name);
 	}
 	fclose(out);
 	if (finish_command(&cmd))
@@ -800,6 +804,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 				N_("limits the maximum number of threads")),
 		OPT_STRING(0, "max-pack-size", &po_args.max_pack_size, N_("bytes"),
 				N_("maximum size of each packfile")),
+		OPT_STRING(0, "filter", &po_args.filter, N_("args"),
+				N_("object filtering")),
 		OPT_BOOL(0, "pack-kept-objects", &pack_kept_objects,
 				N_("repack objects in packs marked with .keep")),
 		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
@@ -834,6 +840,12 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 			die(_("options '%s' and '%s' cannot be used together"), "--cruft", "-k");
 	}
 
+	if (po_args.filter && !has_promisor_remote())
+		die("a promisor remote must be setup\n"
+		    "Also please push all the objects "
+		    "that might be filtered to that remote!\n"
+		    "Otherwise they will be lost!");
+
 	if (write_bitmaps < 0) {
 		if (!write_midx &&
 		    (!(pack_everything & ALL_INTO_ONE) || !is_bare_repository()))
@@ -971,6 +983,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		if (line.len != the_hash_algo->hexsz)
 			die(_("repack: Expecting full hex object ID lines only from pack-objects."));
 		item = string_list_append(&names, line.buf);
+		if (po_args.filter)
+			write_promisor_file_1(line.buf);
 		item->util = populate_pack_exts(item->string);
 	}
 	fclose(out);
diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index c630e0d52d..fef144bace 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -237,6 +237,21 @@ test_expect_success 'auto-bitmaps do not complain if unavailable' '
 	test_must_be_empty actual
 '
 
+test_expect_success 'repacking with a filter works' '
+	test_when_finished "rm -rf server client" &&
+	test_create_repo server &&
+	git -C server config uploadpack.allowFilter true &&
+	git -C server config uploadpack.allowAnySHA1InWant true &&
+	test_commit -C server 1 &&
+	git clone --bare --no-local server client &&
+	git -C client config remote.origin.promisor true &&
+	git -C client rev-list --objects --all --missing=print >objects &&
+	test $(grep "^?" objects | wc -l) = 0 &&
+	git -C client -c repack.writebitmaps=false repack -a -d --filter=blob:none &&
+	git -C client rev-list --objects --all --missing=print >objects &&
+	test $(grep "^?" objects | wc -l) = 1
+'
+
 objdir=.git/objects
 midx=$objdir/pack/multi-pack-index
 
-- 
2.38.1.475.g7216a7bc05


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 0/2] Implement filtering repacks
  2022-11-22 17:51   ` [PATCH v3 " Christian Couder
  2022-11-22 17:51     ` [PATCH v3 1/2] pack-objects: allow --filter without --stdout Christian Couder
  2022-11-22 17:51     ` [PATCH v3 2/2] repack: add --filter=<filter-spec> option Christian Couder
@ 2022-11-23  0:31     ` Junio C Hamano
  2022-12-21  3:53       ` Christian Couder
  2022-11-23  0:35     ` Junio C Hamano
  2022-12-21  4:04     ` [PATCH v4 0/3] " Christian Couder
  4 siblings, 1 reply; 29+ messages in thread
From: Junio C Hamano @ 2022-11-23  0:31 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, John Cai, Jonathan Tan, Jonathan Nieder, Taylor Blau,
	Derrick Stolee

Christian Couder <christian.couder@gmail.com> writes:

> In the discussions with Junio and Taylor following the first and
> second versions, Junio suggested adding `--filter=<filter-spec>` to
> `git gc` and I am still Ok with doing it, either later in a followup
> patch or in a v4. I haven't done it yet, as it's not clear how to
> implement it efficiently only in `git gc`.

Did I make such a suggestion?  I only said --filter=<filter-spec>
does not seem like a good option to surface and stress at the
end-user level for "git pack-objects".

The similarity of "git prune-packed" with this new operation was
brought up to illustrate the point that users of "git gc" do not
want to even know about having to remove redundantly available
objects via "prune-packed" by giving an extra option to "git gc"
(hence --prune is the default), and honoring the filter spec when
objects are known to be available via the configured promisor remote
is likely what users at the "git gc" level would want to see happen
automatically.  IOW, adding more options to "gc" would be the last
thing I would want to see.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 0/2] Implement filtering repacks
  2022-11-22 17:51   ` [PATCH v3 " Christian Couder
                       ` (2 preceding siblings ...)
  2022-11-23  0:31     ` [PATCH v3 0/2] Implement filtering repacks Junio C Hamano
@ 2022-11-23  0:35     ` Junio C Hamano
  2022-12-21  4:04     ` [PATCH v4 0/3] " Christian Couder
  4 siblings, 0 replies; 29+ messages in thread
From: Junio C Hamano @ 2022-11-23  0:35 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, John Cai, Jonathan Tan, Jonathan Nieder, Taylor Blau,
	Derrick Stolee

Christian Couder <christian.couder@gmail.com> writes:

> So the only changes in this v3 compared to v2 are the following:
>
>   - rebased on top of a0789512c5 (The thirteenth batch, 2022-11-18) to
>     avoid a simple conflict,
>
>   - patch 2/2 uses has_promisor_remote() to check if a promisor remote
>     is configured and die() otherwise.

Both sounds like sensible things to do.  Thanks.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 0/2] Implement filtering repacks
  2022-11-23  0:31     ` [PATCH v3 0/2] Implement filtering repacks Junio C Hamano
@ 2022-12-21  3:53       ` Christian Couder
  0 siblings, 0 replies; 29+ messages in thread
From: Christian Couder @ 2022-12-21  3:53 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, John Cai, Jonathan Tan, Jonathan Nieder, Taylor Blau,
	Derrick Stolee

On Wed, Nov 23, 2022 at 1:31 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:
>
> > In the discussions with Junio and Taylor following the first and
> > second versions, Junio suggested adding `--filter=<filter-spec>` to
> > `git gc` and I am still Ok with doing it, either later in a followup
> > patch or in a v4. I haven't done it yet, as it's not clear how to
> > implement it efficiently only in `git gc`.
>
> Did I make such a suggestion?  I only said --filter=<filter-spec>
> does not seem like a good option to surface and stress at the
> end-user level for "git pack-objects".

In https://lore.kernel.org/git/xmqqilkm9wv6.fsf@gitster.g/ you wrote:

-> Unlike prune-packed, pruning objects that could be refetched has
-> negative performance impact.  So adding an option to enable/disable
-> such pruning is needed.  If the frontmost UI entry point were "gc",
-> it needs an opt-in option to invoke the "filtering repack", in other
-> words.  "pack-objects" should not need any more work than what you
-> have here (with the "terminal" and "force" discarded), as "--filter"
-> is such an opt-in option already.

I understood the above to mean that you would be Ok with adding
a "--filter" flag to `git gc`.

> The similarity of "git prune-packed" with this new operation was
> brought up to illustrate the point that users of "git gc" do not
> want to even know about having to remove redundantly available
> objects via "prune-packed" by giving an extra option to "git gc"
> (hence --prune is the default), and honoring the filter spec when
> objects are known to be available via the configured promisor remote
> is likely what users at the "git gc" level would want to see happen
> automatically.  IOW, adding more options to "gc" would be the last
> thing I would want to see.

I still think that users should be able to control if `git gc` performs
such a filtering when repacking, and I don't see how it could be done
without any new option at all. I think that for example only checking
the 'remote.<name>.partialclonefilter' config variable to decide if it
should be done or not is just too dangerous for people already using
this variable.

In the v4 I will send really soon now, I tentatively implemented a
'gc.repackFilter' config option for the purpose of telling `git gc` that it
should perform filtering repacks with the specified filter. I hope that
this will help move the discussion forward.

BTW in the latest "What's cooking in git.git" emails there is the
following about this patch series:

-> Seems to break CI.
-> cf. https://github.com/git/git/actions/runs/3560918726

I took a look at the failures in the different failing tests and they
don't seem related to this patch
series and seem quite random:

=== Failed test: t3106-ls-tree-pattern ===
The full logs are in the 'print test failures' step below.
See also the 'failed-tests-*' artifacts attached to this run.
Error: failed: t3106.3 combine with "--object-only"
failure: t3106.3 combine with "--object-only"
Error: failed: t3106.5 combine with "--long"
failure: t3106.5 combine with "--long"

=== Failed test: t5601-clone ===
The full logs are in the 'print test failures' step below.
See also the 'failed-tests-*' artifacts attached to this run.
skip: t5601.60 clone c:temp is dos drive
skip: t5601.101 colliding file detection
Error: failed: t5601.110 auto-discover bundle URI from HTTP clone
failure: t5601.110 auto-discover bundle URI from HTTP clone
Error: failed: t5601.111 auto-discover multiple bundles from HTTP clone
failure: t5601.111 auto-discover multiple bundles from HTTP clone

=== Failed test: t5556-http-auth ===
The full logs are in the 'print test failures' step below.
See also the 'failed-tests-*' artifacts attached to this run.
Error: failed: t5556.2 http auth anonymous no challenge
failure: t5556.2 http auth anonymous no challenge
Error: failed: t5556.3 http auth www-auth headers to credential helper
bearer valid
failure: t5556.3 http auth www-auth headers to credential helper bearer valid

=== Failed test: t4203-mailmap ===
The full logs are in the 'print test failures' step below.
See also the 'failed-tests-*' artifacts attached to this run.
Error: failed: t4203.66 git cat-file -s returns correct size with --use-mailmap
failure: t4203.66 git cat-file -s returns correct size with --use-mailmap
Error: failed: t4203.67 git cat-file -s returns correct size with
--use-mailmap for tag objects
failure: t4203.67 git cat-file -s returns correct size with
--use-mailmap for tag objects

etc

I tried different build flags but can't reproduce locally.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v4 0/3] Implement filtering repacks
  2022-11-22 17:51   ` [PATCH v3 " Christian Couder
                       ` (3 preceding siblings ...)
  2022-11-23  0:35     ` Junio C Hamano
@ 2022-12-21  4:04     ` Christian Couder
  2022-12-21  4:04       ` [PATCH v4 1/3] pack-objects: allow --filter without --stdout Christian Couder
                         ` (2 more replies)
  4 siblings, 3 replies; 29+ messages in thread
From: Christian Couder @ 2022-12-21  4:04 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Christian Couder

Earlier this year, John Cai sent 2 versions of a patch series to
implement `git repack --filter=<filter-spec>`:

https://lore.kernel.org/git/pull.1206.git.git.1643248180.gitgitgadget@gmail.com/

We tried to "sell" it as a way to use partial clone on a Git server to
offload large blobs to, for example, an http server, while using
multiple promisor remotes on the client side.

Even though it is still our end goal, it seems a bit far fetched for
now and unnecessary as `git repack --filter=<filter-spec>` could be
useful on the client side too.

For example one might want to clone with a filter to avoid too many
space to be taken by some large blobs, and one might realize after
some time that a number of the large blobs have still be downloaded
because some old branches referencing them were checked out. In this
case a filtering repack could remove some of those large blobs.

Some of the comments on the patch series that John sent were related
to the possible data loss and repo corruption that a filtering repack
could cause. It's indeed true that it could be very dangerous, so the
first version of this patch series asked the user to confirm the
command, either by answering 'Y' on the command line or by passing
`--force`.

In the discussion with Junio following that first version though, it
appeared that asking for such confirmation might not be necessary, so
the v2 removed those checks.

Taylor though asked what would happen to the 'remote.<name>.promisor'
and 'remote.<name>.partialclonefilter' config variables when a
filtering repack is run. As it seemed to me that we should just check
that a promisor remote has been configured and fail if that's not the
case, that was implemented in the third version of this patch series.

In the discussions following the first, second and third versions,
Junio commented that `git gc` was a better way for users to launch
filtering repacks then `git repack`, so in this v4 a new
'gc.repackFilter' config option is implemented that allows `git gc` to
perform filtering repacks. When this config option is set to a non
empty string, `git gc` will just add a `--filter=<filter-spec>`
argument to the repack processes it launches, with '<filter-spec>' set
to the value of 'gc.repackFilter'.

So the changes in this v4 compared to v3 are the following:

  - rebased on top of 57e2c6ebbe (Start the 2.40 cycle, 2022-12-14) to
    avoid a simple conflict,

  - simplified the test in patch 2/3 by using `grep -c ...` instead of
    `grep ... | wc -l`,

  - added patch 3/3 which implements a new 'gc.repackFilter' config
    option so that `git gc` can perform filtering repacks.

Thanks to Junio and Taylor for discussing the v1, v2 and v3, to John
Cai, who worked on the previous versions, to Jonathan Nieder, Jonathan
Tan and Taylor, who discussed this with me at the Git Merge and
Contributor Summit, and to Stolee, Taylor, Robert Coup and Junio who
discussed the versions John sent.

Range diff with v3:

1:  1e64cac782 < -:  ---------- pack-objects: allow --filter without --stdout
-:  ---------- > 1:  c2dca82dee pack-objects: allow --filter without --stdout
2:  7216a7bc05 ! 2:  1dcdba4b1d repack: add --filter=<filter-spec> option
    @@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix
     +                  write_promisor_file_1(line.buf);
                item->util = populate_pack_exts(item->string);
        }
    -   fclose(out);
    +   strbuf_release(&line);
     
      ## t/t7700-repack.sh ##
     @@ t/t7700-repack.sh: test_expect_success 'auto-bitmaps do not complain if unavailable' '
    @@ t/t7700-repack.sh: test_expect_success 'auto-bitmaps do not complain if unavaila
     +  git clone --bare --no-local server client &&
     +  git -C client config remote.origin.promisor true &&
     +  git -C client rev-list --objects --all --missing=print >objects &&
    -+  test $(grep "^?" objects | wc -l) = 0 &&
    ++  test $(grep -c "^?" objects) = 0 &&
     +  git -C client -c repack.writebitmaps=false repack -a -d --filter=blob:none &&
     +  git -C client rev-list --objects --all --missing=print >objects &&
    -+  test $(grep "^?" objects | wc -l) = 1
    ++  test $(grep -c "^?" objects) = 1
     +'
     +
      objdir=.git/objects
-:  ---------- > 3:  6bb98b4b00 gc: add gc.repackFilter config option


Christian Couder (3):
  pack-objects: allow --filter without --stdout
  repack: add --filter=<filter-spec> option
  gc: add gc.repackFilter config option

 Documentation/config/gc.txt  |  9 +++++++++
 Documentation/git-repack.txt |  8 ++++++++
 builtin/gc.c                 |  6 ++++++
 builtin/pack-objects.c       |  8 ++------
 builtin/repack.c             | 28 +++++++++++++++++++++-------
 t/t6500-gc.sh                | 19 +++++++++++++++++++
 t/t7700-repack.sh            | 15 +++++++++++++++
 7 files changed, 80 insertions(+), 13 deletions(-)

-- 
2.39.0.59.g395bcb85bc.dirty


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v4 1/3] pack-objects: allow --filter without --stdout
  2022-12-21  4:04     ` [PATCH v4 0/3] " Christian Couder
@ 2022-12-21  4:04       ` Christian Couder
  2023-01-04 14:56         ` Patrick Steinhardt
  2022-12-21  4:04       ` [PATCH v4 2/3] repack: add --filter=<filter-spec> option Christian Couder
  2022-12-21  4:04       ` [PATCH v4 3/3] gc: add gc.repackFilter config option Christian Couder
  2 siblings, 1 reply; 29+ messages in thread
From: Christian Couder @ 2022-12-21  4:04 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Christian Couder

From: Christian Couder <chriscool@tuxfamily.org>

9535ce7337 (pack-objects: add list-objects filtering, 2017-11-21)
taught pack-objects to use --filter, but required the use of
--stdout since a partial clone mechanism was not yet in place to
handle missing objects. Since then, changes like 9e27beaa23
(promisor-remote: implement promisor_remote_get_direct(), 2019-06-25)
and others added support to dynamically fetch objects that were missing.

Remove the --stdout requirement so that in a follow-up commit, repack
can pass --filter to pack-objects to omit certain objects from the
resulting packfile.

Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 builtin/pack-objects.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 2193f80b89..aa0b13d015 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -4371,12 +4371,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (!rev_list_all || !rev_list_reflog || !rev_list_index)
 		unpack_unreachable_expiration = 0;
 
-	if (filter_options.choice) {
-		if (!pack_to_stdout)
-			die(_("cannot use --filter without --stdout"));
-		if (stdin_packs)
-			die(_("cannot use --filter with --stdin-packs"));
-	}
+	if (stdin_packs && filter_options.choice)
+		die(_("cannot use --filter with --stdin-packs"));
 
 	if (stdin_packs && use_internal_rev_list)
 		die(_("cannot use internal rev list with --stdin-packs"));
-- 
2.39.0.59.g395bcb85bc.dirty


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 2/3] repack: add --filter=<filter-spec> option
  2022-12-21  4:04     ` [PATCH v4 0/3] " Christian Couder
  2022-12-21  4:04       ` [PATCH v4 1/3] pack-objects: allow --filter without --stdout Christian Couder
@ 2022-12-21  4:04       ` Christian Couder
  2023-01-04 14:56         ` Patrick Steinhardt
  2022-12-21  4:04       ` [PATCH v4 3/3] gc: add gc.repackFilter config option Christian Couder
  2 siblings, 1 reply; 29+ messages in thread
From: Christian Couder @ 2022-12-21  4:04 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Christian Couder

From: Christian Couder <chriscool@tuxfamily.org>

After cloning with --filter=<filter-spec>, for example to avoid
getting unneeded large files on a user machine, it's possible
that some of these large files still get fetched for some reasons
(like checking out old branches) over time.

In this case the repo size could grow too much for no good reason
and `git repack --filter=<filter-spec>` would be useful to remove
the unneeded large files.

This command could be dangerous to use though, as it might remove
local objects that haven't been pushed which would lose data and
corrupt the repo. On a server, this command could also corrupt a
repo unless ALL the removed objects aren't already available in
another remote that clients can access.

To mitigate that risk, we check that a promisor remote has at
least been configured.

Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/git-repack.txt |  8 ++++++++
 builtin/repack.c             | 28 +++++++++++++++++++++-------
 t/t7700-repack.sh            | 15 +++++++++++++++
 3 files changed, 44 insertions(+), 7 deletions(-)

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index 4017157949..2539ee0a02 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -143,6 +143,14 @@ depth is 4095.
 	a larger and slower repository; see the discussion in
 	`pack.packSizeLimit`.
 
+--filter=<filter-spec>::
+	Omits certain objects (usually blobs) from the resulting
+	packfile. WARNING: this could easily corrupt the current repo
+	and lose data if ANY of the omitted objects hasn't been already
+	pushed to a remote. Be very careful about objects that might
+	have been created locally! See linkgit:git-rev-list[1] for valid
+	`<filter-spec>` forms.
+
 -b::
 --write-bitmap-index::
 	Write a reachability bitmap index as part of the repack. This
diff --git a/builtin/repack.c b/builtin/repack.c
index c1402ad038..8e5ac9c171 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -49,6 +49,7 @@ struct pack_objects_args {
 	const char *depth;
 	const char *threads;
 	const char *max_pack_size;
+	const char *filter;
 	int no_reuse_delta;
 	int no_reuse_object;
 	int quiet;
@@ -163,6 +164,8 @@ static void prepare_pack_objects(struct child_process *cmd,
 		strvec_pushf(&cmd->args, "--threads=%s", args->threads);
 	if (args->max_pack_size)
 		strvec_pushf(&cmd->args, "--max-pack-size=%s", args->max_pack_size);
+	if (args->filter)
+		strvec_pushf(&cmd->args, "--filter=%s", args->filter);
 	if (args->no_reuse_delta)
 		strvec_pushf(&cmd->args, "--no-reuse-delta");
 	if (args->no_reuse_object)
@@ -234,6 +237,13 @@ static struct generated_pack_data *populate_pack_exts(const char *name)
 	return data;
 }
 
+static void write_promisor_file_1(char *p)
+{
+	char *promisor_name = mkpathdup("%s-%s.promisor", packtmp, p);
+	write_promisor_file(promisor_name, NULL, 0);
+	free(promisor_name);
+}
+
 static void repack_promisor_objects(const struct pack_objects_args *args,
 				    struct string_list *names)
 {
@@ -265,7 +275,6 @@ static void repack_promisor_objects(const struct pack_objects_args *args,
 	out = xfdopen(cmd.out, "r");
 	while (strbuf_getline_lf(&line, out) != EOF) {
 		struct string_list_item *item;
-		char *promisor_name;
 
 		if (line.len != the_hash_algo->hexsz)
 			die(_("repack: Expecting full hex object ID lines only from pack-objects."));
@@ -282,13 +291,8 @@ static void repack_promisor_objects(const struct pack_objects_args *args,
 		 * concatenate the contents of all .promisor files instead of
 		 * just creating a new empty file.
 		 */
-		promisor_name = mkpathdup("%s-%s.promisor", packtmp,
-					  line.buf);
-		write_promisor_file(promisor_name, NULL, 0);
-
+		write_promisor_file_1(line.buf);
 		item->util = populate_pack_exts(item->string);
-
-		free(promisor_name);
 	}
 	fclose(out);
 	if (finish_command(&cmd))
@@ -800,6 +804,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 				N_("limits the maximum number of threads")),
 		OPT_STRING(0, "max-pack-size", &po_args.max_pack_size, N_("bytes"),
 				N_("maximum size of each packfile")),
+		OPT_STRING(0, "filter", &po_args.filter, N_("args"),
+				N_("object filtering")),
 		OPT_BOOL(0, "pack-kept-objects", &pack_kept_objects,
 				N_("repack objects in packs marked with .keep")),
 		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
@@ -834,6 +840,12 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 			die(_("options '%s' and '%s' cannot be used together"), "--cruft", "-k");
 	}
 
+	if (po_args.filter && !has_promisor_remote())
+		die("a promisor remote must be setup\n"
+		    "Also please push all the objects "
+		    "that might be filtered to that remote!\n"
+		    "Otherwise they will be lost!");
+
 	if (write_bitmaps < 0) {
 		if (!write_midx &&
 		    (!(pack_everything & ALL_INTO_ONE) || !is_bare_repository()))
@@ -971,6 +983,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		if (line.len != the_hash_algo->hexsz)
 			die(_("repack: Expecting full hex object ID lines only from pack-objects."));
 		item = string_list_append(&names, line.buf);
+		if (po_args.filter)
+			write_promisor_file_1(line.buf);
 		item->util = populate_pack_exts(item->string);
 	}
 	strbuf_release(&line);
diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index 4aabe98139..3a6ad9f623 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -253,6 +253,21 @@ test_expect_success 'auto-bitmaps do not complain if unavailable' '
 	test_must_be_empty actual
 '
 
+test_expect_success 'repacking with a filter works' '
+	test_when_finished "rm -rf server client" &&
+	test_create_repo server &&
+	git -C server config uploadpack.allowFilter true &&
+	git -C server config uploadpack.allowAnySHA1InWant true &&
+	test_commit -C server 1 &&
+	git clone --bare --no-local server client &&
+	git -C client config remote.origin.promisor true &&
+	git -C client rev-list --objects --all --missing=print >objects &&
+	test $(grep -c "^?" objects) = 0 &&
+	git -C client -c repack.writebitmaps=false repack -a -d --filter=blob:none &&
+	git -C client rev-list --objects --all --missing=print >objects &&
+	test $(grep -c "^?" objects) = 1
+'
+
 objdir=.git/objects
 midx=$objdir/pack/multi-pack-index
 
-- 
2.39.0.59.g395bcb85bc.dirty


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 3/3] gc: add gc.repackFilter config option
  2022-12-21  4:04     ` [PATCH v4 0/3] " Christian Couder
  2022-12-21  4:04       ` [PATCH v4 1/3] pack-objects: allow --filter without --stdout Christian Couder
  2022-12-21  4:04       ` [PATCH v4 2/3] repack: add --filter=<filter-spec> option Christian Couder
@ 2022-12-21  4:04       ` Christian Couder
  2023-01-04 14:57         ` Patrick Steinhardt
  2 siblings, 1 reply; 29+ messages in thread
From: Christian Couder @ 2022-12-21  4:04 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Christian Couder, Christian Couder

A previous commit has implemented `git repack --filter=<filter-spec>` to
allow users to remove objects that are available on a promisor remote
but that they don't want locally.

Users might want to perform such a cleanup regularly at the same time as
they perform other repacks and cleanups, so as part of `git gc`.

Let's allow them to configure a <filter-spec> for that purpose using a
new gc.repackFilter config option.

Now when `git gc` will perform a repack with a <filter-spec> configured
through this option and not empty, the repack process will be passed a
corresponding `--filter=<filter-spec>` argument.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/config/gc.txt |  9 +++++++++
 builtin/gc.c                |  6 ++++++
 t/t6500-gc.sh               | 19 +++++++++++++++++++
 3 files changed, 34 insertions(+)

diff --git a/Documentation/config/gc.txt b/Documentation/config/gc.txt
index 38fea076a2..9359136f14 100644
--- a/Documentation/config/gc.txt
+++ b/Documentation/config/gc.txt
@@ -130,6 +130,15 @@ or rebase occurring.  Since these changes are not part of the current
 project most users will want to expire them sooner, which is why the
 default is more aggressive than `gc.reflogExpire`.
 
+gc.repackFilter::
+	When repacking, use the specified filter to omit certain
+	objects from the resulting packfile. WARNING: this could
+	easily corrupt the current repo and lose data if ANY of the
+	omitted objects hasn't been already pushed to a remote. Be
+	very careful about objects that might have been created
+	locally! See the `--filter=<filter-spec>` option of
+	linkgit:git-repack[1].
+
 gc.rerereResolved::
 	Records of conflicted merge you resolved earlier are
 	kept for this many days when 'git rerere gc' is run.
diff --git a/builtin/gc.c b/builtin/gc.c
index 02455fdcd7..bf28619723 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -52,6 +52,7 @@ static timestamp_t gc_log_expire_time;
 static const char *gc_log_expire = "1.day.ago";
 static const char *prune_expire = "2.weeks.ago";
 static const char *prune_worktrees_expire = "3.months.ago";
+static char *repack_filter;
 static unsigned long big_pack_threshold;
 static unsigned long max_delta_cache_size = DEFAULT_DELTA_CACHE_SIZE;
 
@@ -161,6 +162,8 @@ static void gc_config(void)
 	git_config_get_ulong("gc.bigpackthreshold", &big_pack_threshold);
 	git_config_get_ulong("pack.deltacachesize", &max_delta_cache_size);
 
+	git_config_get_string("gc.repackfilter", &repack_filter);
+
 	git_config(git_default_config, NULL);
 }
 
@@ -346,6 +349,9 @@ static void add_repack_all_option(struct string_list *keep_pack)
 
 	if (keep_pack)
 		for_each_string_list(keep_pack, keep_one_pack, NULL);
+
+	if (repack_filter && *repack_filter)
+		strvec_pushf(&repack, "--filter=%s", repack_filter);
 }
 
 static void add_repack_incremental_option(void)
diff --git a/t/t6500-gc.sh b/t/t6500-gc.sh
index d9acb63951..b1492b521a 100755
--- a/t/t6500-gc.sh
+++ b/t/t6500-gc.sh
@@ -56,6 +56,7 @@ test_expect_success 'gc -h with invalid configuration' '
 '
 
 test_expect_success 'gc is not aborted due to a stale symref' '
+	test_when_finished "rm -rf remote client" &&
 	git init remote &&
 	(
 		cd remote &&
@@ -202,6 +203,24 @@ test_expect_success 'one of gc.reflogExpire{Unreachable,}=never does not skip "e
 	grep -E "^trace: (built-in|exec|run_command): git reflog expire --" trace.out
 '
 
+test_expect_success 'gc.repackFilter launches repack with a filter' '
+	test_when_finished "rm -rf server client" &&
+	test_create_repo server &&
+	git -C server config uploadpack.allowFilter true &&
+	git -C server config uploadpack.allowAnySHA1InWant true &&
+	test_commit -C server 1 &&
+	git clone --bare --no-local server client &&
+	git -C client config remote.origin.promisor true &&
+	git -C client rev-list --objects --all --missing=print >objects &&
+	test $(grep -c "^?" objects) = 0 &&
+
+	GIT_TRACE=$(pwd)/trace.out git -C client -c gc.repackFilter=blob:none -c repack.writeBitmaps=false -c gc.pruneExpire=now gc &&
+
+	grep -E "^trace: (built-in|exec|run_command): git repack .* --filter=blob:none ?.*" trace.out &&
+	git -C client rev-list --objects --all --missing=print >objects &&
+	test $(grep -c "^?" objects) = 1
+'
+
 prepare_cruft_history () {
 	test_commit base &&
 
-- 
2.39.0.59.g395bcb85bc.dirty


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 1/3] pack-objects: allow --filter without --stdout
  2022-12-21  4:04       ` [PATCH v4 1/3] pack-objects: allow --filter without --stdout Christian Couder
@ 2023-01-04 14:56         ` Patrick Steinhardt
  0 siblings, 0 replies; 29+ messages in thread
From: Patrick Steinhardt @ 2023-01-04 14:56 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Christian Couder

[-- Attachment #1: Type: text/plain, Size: 1803 bytes --]

On Wed, Dec 21, 2022 at 05:04:44AM +0100, Christian Couder wrote:
> From: Christian Couder <chriscool@tuxfamily.org>
> 
> 9535ce7337 (pack-objects: add list-objects filtering, 2017-11-21)
> taught pack-objects to use --filter, but required the use of
> --stdout since a partial clone mechanism was not yet in place to
> handle missing objects. Since then, changes like 9e27beaa23
> (promisor-remote: implement promisor_remote_get_direct(), 2019-06-25)
> and others added support to dynamically fetch objects that were missing.
> 
> Remove the --stdout requirement so that in a follow-up commit, repack
> can pass --filter to pack-objects to omit certain objects from the
> resulting packfile.
> 
> Signed-off-by: John Cai <johncai86@gmail.com>
> Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
> ---
>  builtin/pack-objects.c | 8 ++------
>  1 file changed, 2 insertions(+), 6 deletions(-)
> 
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 2193f80b89..aa0b13d015 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -4371,12 +4371,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
>  	if (!rev_list_all || !rev_list_reflog || !rev_list_index)
>  		unpack_unreachable_expiration = 0;
>  
> -	if (filter_options.choice) {
> -		if (!pack_to_stdout)
> -			die(_("cannot use --filter without --stdout"));
> -		if (stdin_packs)
> -			die(_("cannot use --filter with --stdin-packs"));
> -	}
> +	if (stdin_packs && filter_options.choice)
> +		die(_("cannot use --filter with --stdin-packs"));
>  
>  	if (stdin_packs && use_internal_rev_list)
>  		die(_("cannot use internal rev list with --stdin-packs"));

Can we add a test to exercise this now-allowed combination of options?

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 2/3] repack: add --filter=<filter-spec> option
  2022-12-21  4:04       ` [PATCH v4 2/3] repack: add --filter=<filter-spec> option Christian Couder
@ 2023-01-04 14:56         ` Patrick Steinhardt
  2023-01-05  1:39           ` Junio C Hamano
  0 siblings, 1 reply; 29+ messages in thread
From: Patrick Steinhardt @ 2023-01-04 14:56 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Christian Couder

[-- Attachment #1: Type: text/plain, Size: 7591 bytes --]

On Wed, Dec 21, 2022 at 05:04:45AM +0100, Christian Couder wrote:
> From: Christian Couder <chriscool@tuxfamily.org>
> 
> After cloning with --filter=<filter-spec>, for example to avoid
> getting unneeded large files on a user machine, it's possible
> that some of these large files still get fetched for some reasons
> (like checking out old branches) over time.
> 
> In this case the repo size could grow too much for no good reason
> and `git repack --filter=<filter-spec>` would be useful to remove
> the unneeded large files.
> 
> This command could be dangerous to use though, as it might remove
> local objects that haven't been pushed which would lose data and
> corrupt the repo. On a server, this command could also corrupt a
> repo unless ALL the removed objects aren't already available in
> another remote that clients can access.
> 
> To mitigate that risk, we check that a promisor remote has at
> least been configured.

While this is a nice safeguard, I wonder whether it is sufficient.
Suppose you for example have a non-bare repository that already has
blobs checked out that would become removed by the filtering repack --
does Git handle this situation gracefully?

A quick check seems to indicate that it does. But not quite as well as
I'd have hoped: when I switch to a detached HEAD with an arbitrary
commit and then execute `git repack --filter=blob:none` then it also
removes blobs that are referenced by the currently checked-out commit.
This may or may not be what the user is asking for, but I'd rather lean
towards this behaviour being surprising.

Patrick

> Signed-off-by: John Cai <johncai86@gmail.com>
> Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
> ---
>  Documentation/git-repack.txt |  8 ++++++++
>  builtin/repack.c             | 28 +++++++++++++++++++++-------
>  t/t7700-repack.sh            | 15 +++++++++++++++
>  3 files changed, 44 insertions(+), 7 deletions(-)
> 
> diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
> index 4017157949..2539ee0a02 100644
> --- a/Documentation/git-repack.txt
> +++ b/Documentation/git-repack.txt
> @@ -143,6 +143,14 @@ depth is 4095.
>  	a larger and slower repository; see the discussion in
>  	`pack.packSizeLimit`.
>  
> +--filter=<filter-spec>::
> +	Omits certain objects (usually blobs) from the resulting
> +	packfile. WARNING: this could easily corrupt the current repo
> +	and lose data if ANY of the omitted objects hasn't been already
> +	pushed to a remote. Be very careful about objects that might
> +	have been created locally! See linkgit:git-rev-list[1] for valid
> +	`<filter-spec>` forms.
> +
>  -b::
>  --write-bitmap-index::
>  	Write a reachability bitmap index as part of the repack. This
> diff --git a/builtin/repack.c b/builtin/repack.c
> index c1402ad038..8e5ac9c171 100644
> --- a/builtin/repack.c
> +++ b/builtin/repack.c
> @@ -49,6 +49,7 @@ struct pack_objects_args {
>  	const char *depth;
>  	const char *threads;
>  	const char *max_pack_size;
> +	const char *filter;
>  	int no_reuse_delta;
>  	int no_reuse_object;
>  	int quiet;
> @@ -163,6 +164,8 @@ static void prepare_pack_objects(struct child_process *cmd,
>  		strvec_pushf(&cmd->args, "--threads=%s", args->threads);
>  	if (args->max_pack_size)
>  		strvec_pushf(&cmd->args, "--max-pack-size=%s", args->max_pack_size);
> +	if (args->filter)
> +		strvec_pushf(&cmd->args, "--filter=%s", args->filter);
>  	if (args->no_reuse_delta)
>  		strvec_pushf(&cmd->args, "--no-reuse-delta");
>  	if (args->no_reuse_object)
> @@ -234,6 +237,13 @@ static struct generated_pack_data *populate_pack_exts(const char *name)
>  	return data;
>  }
>  
> +static void write_promisor_file_1(char *p)
> +{
> +	char *promisor_name = mkpathdup("%s-%s.promisor", packtmp, p);
> +	write_promisor_file(promisor_name, NULL, 0);
> +	free(promisor_name);
> +}
> +
>  static void repack_promisor_objects(const struct pack_objects_args *args,
>  				    struct string_list *names)
>  {
> @@ -265,7 +275,6 @@ static void repack_promisor_objects(const struct pack_objects_args *args,
>  	out = xfdopen(cmd.out, "r");
>  	while (strbuf_getline_lf(&line, out) != EOF) {
>  		struct string_list_item *item;
> -		char *promisor_name;
>  
>  		if (line.len != the_hash_algo->hexsz)
>  			die(_("repack: Expecting full hex object ID lines only from pack-objects."));
> @@ -282,13 +291,8 @@ static void repack_promisor_objects(const struct pack_objects_args *args,
>  		 * concatenate the contents of all .promisor files instead of
>  		 * just creating a new empty file.
>  		 */
> -		promisor_name = mkpathdup("%s-%s.promisor", packtmp,
> -					  line.buf);
> -		write_promisor_file(promisor_name, NULL, 0);
> -
> +		write_promisor_file_1(line.buf);
>  		item->util = populate_pack_exts(item->string);
> -
> -		free(promisor_name);
>  	}
>  	fclose(out);
>  	if (finish_command(&cmd))
> @@ -800,6 +804,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  				N_("limits the maximum number of threads")),
>  		OPT_STRING(0, "max-pack-size", &po_args.max_pack_size, N_("bytes"),
>  				N_("maximum size of each packfile")),
> +		OPT_STRING(0, "filter", &po_args.filter, N_("args"),
> +				N_("object filtering")),
>  		OPT_BOOL(0, "pack-kept-objects", &pack_kept_objects,
>  				N_("repack objects in packs marked with .keep")),
>  		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
> @@ -834,6 +840,12 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  			die(_("options '%s' and '%s' cannot be used together"), "--cruft", "-k");
>  	}
>  
> +	if (po_args.filter && !has_promisor_remote())
> +		die("a promisor remote must be setup\n"
> +		    "Also please push all the objects "
> +		    "that might be filtered to that remote!\n"
> +		    "Otherwise they will be lost!");
> +

>  	if (write_bitmaps < 0) {
>  		if (!write_midx &&
>  		    (!(pack_everything & ALL_INTO_ONE) || !is_bare_repository()))
> @@ -971,6 +983,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  		if (line.len != the_hash_algo->hexsz)
>  			die(_("repack: Expecting full hex object ID lines only from pack-objects."));
>  		item = string_list_append(&names, line.buf);
> +		if (po_args.filter)
> +			write_promisor_file_1(line.buf);
>  		item->util = populate_pack_exts(item->string);
>  	}
>  	strbuf_release(&line);
> diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
> index 4aabe98139..3a6ad9f623 100755
> --- a/t/t7700-repack.sh
> +++ b/t/t7700-repack.sh
> @@ -253,6 +253,21 @@ test_expect_success 'auto-bitmaps do not complain if unavailable' '
>  	test_must_be_empty actual
>  '
>  
> +test_expect_success 'repacking with a filter works' '
> +	test_when_finished "rm -rf server client" &&
> +	test_create_repo server &&
> +	git -C server config uploadpack.allowFilter true &&
> +	git -C server config uploadpack.allowAnySHA1InWant true &&
> +	test_commit -C server 1 &&
> +	git clone --bare --no-local server client &&
> +	git -C client config remote.origin.promisor true &&
> +	git -C client rev-list --objects --all --missing=print >objects &&
> +	test $(grep -c "^?" objects) = 0 &&
> +	git -C client -c repack.writebitmaps=false repack -a -d --filter=blob:none &&
> +	git -C client rev-list --objects --all --missing=print >objects &&
> +	test $(grep -c "^?" objects) = 1
> +'
> +
>  objdir=.git/objects
>  midx=$objdir/pack/multi-pack-index
>  
> -- 
> 2.39.0.59.g395bcb85bc.dirty
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 3/3] gc: add gc.repackFilter config option
  2022-12-21  4:04       ` [PATCH v4 3/3] gc: add gc.repackFilter config option Christian Couder
@ 2023-01-04 14:57         ` Patrick Steinhardt
  0 siblings, 0 replies; 29+ messages in thread
From: Patrick Steinhardt @ 2023-01-04 14:57 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Junio C Hamano, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Christian Couder

[-- Attachment #1: Type: text/plain, Size: 6314 bytes --]

On Wed, Dec 21, 2022 at 05:04:46AM +0100, Christian Couder wrote:
> A previous commit has implemented `git repack --filter=<filter-spec>` to
> allow users to remove objects that are available on a promisor remote
> but that they don't want locally.
> 
> Users might want to perform such a cleanup regularly at the same time as
> they perform other repacks and cleanups, so as part of `git gc`.
> 
> Let's allow them to configure a <filter-spec> for that purpose using a
> new gc.repackFilter config option.
> 
> Now when `git gc` will perform a repack with a <filter-spec> configured
> through this option and not empty, the repack process will be passed a
> corresponding `--filter=<filter-spec>` argument.
> 
> Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
> ---
>  Documentation/config/gc.txt |  9 +++++++++
>  builtin/gc.c                |  6 ++++++
>  t/t6500-gc.sh               | 19 +++++++++++++++++++
>  3 files changed, 34 insertions(+)
> 
> diff --git a/Documentation/config/gc.txt b/Documentation/config/gc.txt
> index 38fea076a2..9359136f14 100644
> --- a/Documentation/config/gc.txt
> +++ b/Documentation/config/gc.txt
> @@ -130,6 +130,15 @@ or rebase occurring.  Since these changes are not part of the current
>  project most users will want to expire them sooner, which is why the
>  default is more aggressive than `gc.reflogExpire`.
>  
> +gc.repackFilter::
> +	When repacking, use the specified filter to omit certain
> +	objects from the resulting packfile. WARNING: this could
> +	easily corrupt the current repo and lose data if ANY of the
> +	omitted objects hasn't been already pushed to a remote. Be
> +	very careful about objects that might have been created
> +	locally! See the `--filter=<filter-spec>` option of
> +	linkgit:git-repack[1].

This limitation means that no user can ever configure this unless they
also set `gc.auto=0`, or otherwise Git might periodically delete their
local objects without any way to restore them.

I see though that this might be a useful thing to have for partial clone
repositories: right now they only ever grow in size, so it would be nice
if Git could gracefully cull that size every now and then. But for this
to be automatic I'd expect a few safeguards:

    - Given a remote with a partial clone filter we only ever delete
      objects that seem to exist on the remote. This might for example
      be determined via a graph-walk of the remote's references. This
      protects against deleting objects that only exist locally.

    - Objects that are referenced by any of the currently checked-out
      HEADs should not get puned. This protects against deleting objects
      that you'd have to re-download anyway.

    - It would be great to have a grace period after which objects get
      pruned. Given that we have no access times for packed objects
      though this doesn't seem that easy to implement though. Still,
      this would protect the user from objects getting deleted when they
      frequently switch between different commits.

From my point of view we should not expose a high-level interface around
git-gc(1) at this point in time because of these missing safeguards.
Otherwise it is only a footgun waiting to go off.

Until then I don't see much of a problem with exposing the low-level
interface via git-repack(1) though.

Patrick

>  gc.rerereResolved::
>  	Records of conflicted merge you resolved earlier are
>  	kept for this many days when 'git rerere gc' is run.
> diff --git a/builtin/gc.c b/builtin/gc.c
> index 02455fdcd7..bf28619723 100644
> --- a/builtin/gc.c
> +++ b/builtin/gc.c
> @@ -52,6 +52,7 @@ static timestamp_t gc_log_expire_time;
>  static const char *gc_log_expire = "1.day.ago";
>  static const char *prune_expire = "2.weeks.ago";
>  static const char *prune_worktrees_expire = "3.months.ago";
> +static char *repack_filter;
>  static unsigned long big_pack_threshold;
>  static unsigned long max_delta_cache_size = DEFAULT_DELTA_CACHE_SIZE;
>  
> @@ -161,6 +162,8 @@ static void gc_config(void)
>  	git_config_get_ulong("gc.bigpackthreshold", &big_pack_threshold);
>  	git_config_get_ulong("pack.deltacachesize", &max_delta_cache_size);
>  
> +	git_config_get_string("gc.repackfilter", &repack_filter);
> +
>  	git_config(git_default_config, NULL);
>  }
>  
> @@ -346,6 +349,9 @@ static void add_repack_all_option(struct string_list *keep_pack)
>  
>  	if (keep_pack)
>  		for_each_string_list(keep_pack, keep_one_pack, NULL);
> +
> +	if (repack_filter && *repack_filter)
> +		strvec_pushf(&repack, "--filter=%s", repack_filter);
>  }
>  
>  static void add_repack_incremental_option(void)
> diff --git a/t/t6500-gc.sh b/t/t6500-gc.sh
> index d9acb63951..b1492b521a 100755
> --- a/t/t6500-gc.sh
> +++ b/t/t6500-gc.sh
> @@ -56,6 +56,7 @@ test_expect_success 'gc -h with invalid configuration' '
>  '
>  
>  test_expect_success 'gc is not aborted due to a stale symref' '
> +	test_when_finished "rm -rf remote client" &&
>  	git init remote &&
>  	(
>  		cd remote &&
> @@ -202,6 +203,24 @@ test_expect_success 'one of gc.reflogExpire{Unreachable,}=never does not skip "e
>  	grep -E "^trace: (built-in|exec|run_command): git reflog expire --" trace.out
>  '
>  
> +test_expect_success 'gc.repackFilter launches repack with a filter' '
> +	test_when_finished "rm -rf server client" &&
> +	test_create_repo server &&
> +	git -C server config uploadpack.allowFilter true &&
> +	git -C server config uploadpack.allowAnySHA1InWant true &&
> +	test_commit -C server 1 &&
> +	git clone --bare --no-local server client &&
> +	git -C client config remote.origin.promisor true &&
> +	git -C client rev-list --objects --all --missing=print >objects &&
> +	test $(grep -c "^?" objects) = 0 &&
> +
> +	GIT_TRACE=$(pwd)/trace.out git -C client -c gc.repackFilter=blob:none -c repack.writeBitmaps=false -c gc.pruneExpire=now gc &&
> +
> +	grep -E "^trace: (built-in|exec|run_command): git repack .* --filter=blob:none ?.*" trace.out &&
> +	git -C client rev-list --objects --all --missing=print >objects &&
> +	test $(grep -c "^?" objects) = 1
> +'
> +
>  prepare_cruft_history () {
>  	test_commit base &&
>  
> -- 
> 2.39.0.59.g395bcb85bc.dirty
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 2/3] repack: add --filter=<filter-spec> option
  2023-01-04 14:56         ` Patrick Steinhardt
@ 2023-01-05  1:39           ` Junio C Hamano
  0 siblings, 0 replies; 29+ messages in thread
From: Junio C Hamano @ 2023-01-05  1:39 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: Christian Couder, git, John Cai, Jonathan Tan, Jonathan Nieder,
	Taylor Blau, Derrick Stolee, Christian Couder

Patrick Steinhardt <ps@pks.im> writes:

> While this is a nice safeguard, I wonder whether it is sufficient.
> Suppose you for example have a non-bare repository that already has
> blobs checked out that would become removed by the filtering repack --
> does Git handle this situation gracefully?
>
> A quick check seems to indicate that it does. But not quite as well as
> I'd have hoped: when I switch to a detached HEAD with an arbitrary
> commit and then execute `git repack --filter=blob:none` then it also
> removes blobs that are referenced by the currently checked-out commit.
> This may or may not be what the user is asking for, but I'd rather lean
> towards this behaviour being surprising.

Hmph, the user asked not to have blobs that came from remote locally
and instead refetch them from the promisor on-demand, so I would
expect some pruning to happen (I am not a lazy-clone user, though).

As long as we do the pruning sensibly, that is.

Unless you are always following somebody else without doing any work
on your own, you are likely to have objects that exist only locally
and nowhere else.  It would be unexpected and surprising, if we lost
them only because they are of type 'blob' and because there is a
promisor remote configured.

Even if that is documented, that would be an unacceptable foot-gun
misfeature.  It is not just a local repository corruption that can
be recovered by cloning from elsewhere.  We are looking at lost
work that cannot be recovered.

I wonder if this topic can be salvaged by making it less aggressive
in pruning, perhaps by traversing from the tips of remote-tracking
branches of the promisor remote to identify which blobs can safely
be pruned (by definition, promisor remote cannot lose objects that
it once published, or its cloners will immediately have corrupt
repositories).  That may turn this from a misfeature into a feature.

Thanks.

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2023-01-05  1:39 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-12 13:51 [PATCH 0/3] Implement filtering repacks Christian Couder
2022-10-12 13:51 ` [PATCH 1/3] pack-objects: allow --filter without --stdout Christian Couder
2022-10-12 13:51 ` [PATCH 2/3] repack: add --filter=<filter-spec> option Christian Couder
2022-10-12 13:51 ` [PATCH 3/3] repack: introduce --force to force filtering Christian Couder
2022-10-14 16:46 ` [PATCH 0/3] Implement filtering repacks Junio C Hamano
2022-10-20 11:23   ` Christian Couder
2022-10-28 19:49     ` Taylor Blau
2022-10-28 20:26       ` Junio C Hamano
2022-11-07  9:12         ` Christian Couder
2022-11-07  9:00       ` Christian Couder
2022-10-25 12:28 ` [PATCH v2 0/2] " Christian Couder
2022-10-25 12:28   ` [PATCH v2 1/2] pack-objects: allow --filter without --stdout Christian Couder
2022-10-25 12:28   ` [PATCH v2 2/2] repack: add --filter=<filter-spec> option Christian Couder
2022-10-28 19:54   ` [PATCH v2 0/2] Implement filtering repacks Taylor Blau
2022-11-07  9:29     ` Christian Couder
2022-11-22 17:51   ` [PATCH v3 " Christian Couder
2022-11-22 17:51     ` [PATCH v3 1/2] pack-objects: allow --filter without --stdout Christian Couder
2022-11-22 17:51     ` [PATCH v3 2/2] repack: add --filter=<filter-spec> option Christian Couder
2022-11-23  0:31     ` [PATCH v3 0/2] Implement filtering repacks Junio C Hamano
2022-12-21  3:53       ` Christian Couder
2022-11-23  0:35     ` Junio C Hamano
2022-12-21  4:04     ` [PATCH v4 0/3] " Christian Couder
2022-12-21  4:04       ` [PATCH v4 1/3] pack-objects: allow --filter without --stdout Christian Couder
2023-01-04 14:56         ` Patrick Steinhardt
2022-12-21  4:04       ` [PATCH v4 2/3] repack: add --filter=<filter-spec> option Christian Couder
2023-01-04 14:56         ` Patrick Steinhardt
2023-01-05  1:39           ` Junio C Hamano
2022-12-21  4:04       ` [PATCH v4 3/3] gc: add gc.repackFilter config option Christian Couder
2023-01-04 14:57         ` Patrick Steinhardt

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).