git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH 0/3] list-object-filter: introduce depth filter
@ 2022-09-01  9:41 ZheNing Hu via GitGitGadget
  2022-09-01  9:41 ` [PATCH 1/3] commit-graph: let commit graph respect commit graft ZheNing Hu via GitGitGadget
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: ZheNing Hu via GitGitGadget @ 2022-09-01  9:41 UTC (permalink / raw)
  To: git
  Cc: Christian Couder, Ævar Arnfjörð Bjarmason,
	Jeff King, Jeff Hostetler, Junio C Hamano, Derrick Stolee,
	Johannes Schindelin, ZheNing Hu

This patch let partial clone have the similar capabilities of the shallow
clone git clone --depth=<depth>.

Disadvantages of git clone --depth=<depth> --filter=blob:none: we must call
git fetch --unshallow to lift the shallow clone restriction, it will
download all history of current commit.

Disadvantages of git clone --filter=blob:none with git sparse-checkout: The
git client needs to send a lot of missing objects' id to the server, this
can be very wasteful of network traffic.

Now we can use git clone --filter="depth=<depth>" to omit all commits whose
depth is >= <depth>. By this way, we can have the advantages of both shallow
clone and partial clone: Limiting the depth of commits, get other objects on
demand.

Unfinished business for now:

 1. Git fetch has not yet learned the depth filter, if we can solve this
    problem, we may can have a better "batch fetch" for some needed commits
    (see [1]).
 2. Sometimes we may want to partial clone to avoid automatic downloads
    missing objects, e.g. when running git log, we might want to have
    similar results of shallow clone (without commit graft).

[1]:
https://lore.kernel.org/git/16633d89-6ccd-859d-8533-9861ad831c45@github.com/

ZheNing Hu (3):
  commit-graph: let commit graph respect commit graft
  list-object-filter: pass traversal_context in filter_init_fn
  list-object-filter: introduce depth filter

 Documentation/rev-list-options.txt  |   6 ++
 builtin/clone.c                     |  10 ++-
 commit-graph.c                      |  36 +++++++--
 list-objects-filter-options.c       |  30 +++++++
 list-objects-filter-options.h       |   6 ++
 list-objects-filter.c               |  78 ++++++++++++++++++-
 list-objects-filter.h               |   2 +
 list-objects.c                      |  10 +--
 list-objects.h                      |   8 ++
 shallow.c                           |  16 ++++
 shallow.h                           |   2 +
 t/t5616-partial-clone.sh            | 116 ++++++++++++++++++++++++++++
 t/t6112-rev-list-filters-objects.sh |  14 ++++
 upload-pack.c                       |  14 ----
 upload-pack.h                       |  14 ++++
 15 files changed, 330 insertions(+), 32 deletions(-)


base-commit: d42b38dfb5edf1a7fddd9542d722f91038407819
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1343%2Fadlternative%2Fzh%2Ffilter_depth-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1343/adlternative/zh/filter_depth-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1343
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/3] commit-graph: let commit graph respect commit graft
  2022-09-01  9:41 [PATCH 0/3] list-object-filter: introduce depth filter ZheNing Hu via GitGitGadget
@ 2022-09-01  9:41 ` ZheNing Hu via GitGitGadget
  2022-09-01 19:18   ` Derrick Stolee
  2022-09-01  9:41 ` [PATCH 2/3] list-object-filter: pass traversal_context in filter_init_fn ZheNing Hu via GitGitGadget
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 12+ messages in thread
From: ZheNing Hu via GitGitGadget @ 2022-09-01  9:41 UTC (permalink / raw)
  To: git
  Cc: Christian Couder, Ævar Arnfjörð Bjarmason,
	Jeff King, Jeff Hostetler, Junio C Hamano, Derrick Stolee,
	Johannes Schindelin, ZheNing Hu, ZheNing Hu

From: ZheNing Hu <adlternative@gmail.com>

In repo_parse_commit_internal(), if we want to use
commit graph, it will call parse_commit_in_graph() to
parse commit's content from commit graph, otherwise
call repo_read_object_file() to parse commit's content
from commit object.

repo_read_object_file() will respect commit graft,
which can correctly amend commit's parents. But
parse_commit_in_graph() not. Inconsistencies here may
result in incorrect processing of shallow clone.

So let parse_commit_in_graph() respect commit graft as
repo_read_object_file() does, which can solve this problem.

Signed-off-by: ZheNing Hu <adlternative@gmail.com>
---
 commit-graph.c | 36 ++++++++++++++++++++++++++++++------
 1 file changed, 30 insertions(+), 6 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index f2a36032f84..89bb6f87079 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -820,6 +820,7 @@ static int fill_commit_in_graph(struct repository *r,
 	struct commit_list **pptr;
 	const unsigned char *commit_data;
 	uint32_t lex_index;
+	struct commit_graft *graft;
 
 	while (pos < g->num_commits_in_base)
 		g = g->base_graph;
@@ -833,31 +834,54 @@ static int fill_commit_in_graph(struct repository *r,
 
 	set_commit_tree(item, NULL);
 
+	graft = lookup_commit_graft(r, &item->object.oid);
+	if (graft)
+		r->parsed_objects->substituted_parent = 1;
+
 	pptr = &item->parents;
 
 	edge_value = get_be32(commit_data + g->hash_len);
 	if (edge_value == GRAPH_PARENT_NONE)
 		return 1;
-	pptr = insert_parent_or_die(r, g, edge_value, pptr);
+	if (!(graft && (graft->nr_parent < 0 || grafts_replace_parents)))
+		pptr = insert_parent_or_die(r, g, edge_value, pptr);
 
 	edge_value = get_be32(commit_data + g->hash_len + 4);
 	if (edge_value == GRAPH_PARENT_NONE)
 		return 1;
 	if (!(edge_value & GRAPH_EXTRA_EDGES_NEEDED)) {
-		pptr = insert_parent_or_die(r, g, edge_value, pptr);
-		return 1;
+		if (!(graft && (graft->nr_parent < 0 || grafts_replace_parents))) {
+			pptr = insert_parent_or_die(r, g, edge_value, pptr);
+			return 1;
+		}
 	}
 
 	parent_data_ptr = (uint32_t*)(g->chunk_extra_edges +
 			  4 * (uint64_t)(edge_value & GRAPH_EDGE_LAST_MASK));
 	do {
 		edge_value = get_be32(parent_data_ptr);
-		pptr = insert_parent_or_die(r, g,
-					    edge_value & GRAPH_EDGE_LAST_MASK,
-					    pptr);
+		if (!(graft && (graft->nr_parent < 0 || grafts_replace_parents))) {
+			pptr = insert_parent_or_die(r, g,
+						    edge_value & GRAPH_EDGE_LAST_MASK,
+						    pptr);
+		}
 		parent_data_ptr++;
 	} while (!(edge_value & GRAPH_LAST_EDGE));
 
+	if (graft) {
+		int i;
+		struct commit *new_parent;
+		for (i = 0; i < graft->nr_parent; i++) {
+			new_parent = lookup_commit(r,
+						   &graft->parent[i]);
+			if (!new_parent)
+				die(_("bad graft parent %s in commit %s"),
+				       oid_to_hex(&graft->parent[i]),
+				       oid_to_hex(&item->object.oid));
+			pptr = &commit_list_insert(new_parent, pptr)->next;
+		}
+	}
+
 	return 1;
 }
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/3] list-object-filter: pass traversal_context in filter_init_fn
  2022-09-01  9:41 [PATCH 0/3] list-object-filter: introduce depth filter ZheNing Hu via GitGitGadget
  2022-09-01  9:41 ` [PATCH 1/3] commit-graph: let commit graph respect commit graft ZheNing Hu via GitGitGadget
@ 2022-09-01  9:41 ` ZheNing Hu via GitGitGadget
  2022-09-01  9:41 ` [PATCH 3/3] list-object-filter: introduce depth filter ZheNing Hu via GitGitGadget
  2022-09-01 19:24 ` [PATCH 0/3] " Derrick Stolee
  3 siblings, 0 replies; 12+ messages in thread
From: ZheNing Hu via GitGitGadget @ 2022-09-01  9:41 UTC (permalink / raw)
  To: git
  Cc: Christian Couder, Ævar Arnfjörð Bjarmason,
	Jeff King, Jeff Hostetler, Junio C Hamano, Derrick Stolee,
	Johannes Schindelin, ZheNing Hu, ZheNing Hu

From: ZheNing Hu <adlternative@gmail.com>

Pass traversal_context to all filter init functions, so that
we can read or modify the rev_info of traversal_context in the
filter initialization function.

Signed-off-by: ZheNing Hu <adlternative@gmail.com>
---
 list-objects-filter.c | 12 ++++++++++--
 list-objects-filter.h |  2 ++
 list-objects.c        | 10 +---------
 list-objects.h        |  8 ++++++++
 4 files changed, 21 insertions(+), 11 deletions(-)

diff --git a/list-objects-filter.c b/list-objects-filter.c
index 1c1ee3d1bb1..76e8659ea73 100644
--- a/list-objects-filter.c
+++ b/list-objects-filter.c
@@ -112,6 +112,7 @@ static enum list_objects_filter_result filter_blobs_none(
 }
 
 static void filter_blobs_none__init(
+	struct traversal_context *ctx,
 	struct list_objects_filter_options *filter_options,
 	struct filter *filter)
 {
@@ -249,6 +250,7 @@ static void filter_trees_free(void *filter_data) {
 }
 
 static void filter_trees_depth__init(
+	struct traversal_context *ctx,
 	struct list_objects_filter_options *filter_options,
 	struct filter *filter)
 {
@@ -336,6 +338,7 @@ include_it:
 }
 
 static void filter_blobs_limit__init(
+	struct traversal_context *ctx,
 	struct list_objects_filter_options *filter_options,
 	struct filter *filter)
 {
@@ -519,6 +522,7 @@ static void filter_sparse_free(void *filter_data)
 }
 
 static void filter_sparse_oid__init(
+	struct traversal_context *ctx,
 	struct list_objects_filter_options *filter_options,
 	struct filter *filter)
 {
@@ -609,6 +613,7 @@ static enum list_objects_filter_result filter_object_type(
 }
 
 static void filter_object_type__init(
+	struct traversal_context *ctx,
 	struct list_objects_filter_options *filter_options,
 	struct filter *filter)
 {
@@ -734,6 +739,7 @@ static void filter_combine__finalize_omits(
 }
 
 static void filter_combine__init(
+	struct traversal_context *ctx,
 	struct list_objects_filter_options *filter_options,
 	struct filter* filter)
 {
@@ -744,7 +750,7 @@ static void filter_combine__init(
 	CALLOC_ARRAY(d->sub, d->nr);
 	for (sub = 0; sub < d->nr; sub++)
 		d->sub[sub].filter = list_objects_filter__init(
-			filter->omits ? &d->sub[sub].omits : NULL,
+			ctx, filter->omits ? &d->sub[sub].omits : NULL,
 			&filter_options->sub[sub]);
 
 	filter->filter_data = d;
@@ -754,6 +760,7 @@ static void filter_combine__init(
 }
 
 typedef void (*filter_init_fn)(
+	struct traversal_context *ctx,
 	struct list_objects_filter_options *filter_options,
 	struct filter *filter);
 
@@ -771,6 +778,7 @@ static filter_init_fn s_filters[] = {
 };
 
 struct filter *list_objects_filter__init(
+	struct traversal_context *ctx,
 	struct oidset *omitted,
 	struct list_objects_filter_options *filter_options)
 {
@@ -792,7 +800,7 @@ struct filter *list_objects_filter__init(
 
 	CALLOC_ARRAY(filter, 1);
 	filter->omits = omitted;
-	init_fn(filter_options, filter);
+	init_fn(ctx, filter_options, filter);
 	return filter;
 }
 
diff --git a/list-objects-filter.h b/list-objects-filter.h
index 9e98814111c..0a3cb500976 100644
--- a/list-objects-filter.h
+++ b/list-objects-filter.h
@@ -5,6 +5,7 @@ struct list_objects_filter_options;
 struct object;
 struct oidset;
 struct repository;
+struct traversal_context;
 
 /*
  * During list-object traversal we allow certain objects to be
@@ -72,6 +73,7 @@ struct filter;
  * filter *`.
  */
 struct filter *list_objects_filter__init(
+	struct traversal_context *ctx,
 	struct oidset *omitted,
 	struct list_objects_filter_options *filter_options);
 
diff --git a/list-objects.c b/list-objects.c
index 250d9de41cb..698e4dbe8ff 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -13,14 +13,6 @@
 #include "object-store.h"
 #include "trace.h"
 
-struct traversal_context {
-	struct rev_info *revs;
-	show_object_fn show_object;
-	show_commit_fn show_commit;
-	void *show_data;
-	struct filter *filter;
-};
-
 static void show_commit(struct traversal_context *ctx,
 			struct commit *commit)
 {
@@ -448,7 +440,7 @@ void traverse_commit_list_filtered(
 	};
 
 	if (revs->filter.choice)
-		ctx.filter = list_objects_filter__init(omitted, &revs->filter);
+		ctx.filter = list_objects_filter__init(&ctx, omitted, &revs->filter);
 
 	do_traverse(&ctx);
 
diff --git a/list-objects.h b/list-objects.h
index 9eaf4de8449..44c598e9ce8 100644
--- a/list-objects.h
+++ b/list-objects.h
@@ -16,6 +16,14 @@ void mark_edges_uninteresting(struct rev_info *revs,
 struct oidset;
 struct list_objects_filter_options;
 
+struct traversal_context {
+	struct rev_info *revs;
+	show_object_fn show_object;
+	show_commit_fn show_commit;
+	void *show_data;
+	struct filter *filter;
+};
+
 void traverse_commit_list_filtered(
 	struct rev_info *revs,
 	show_commit_fn show_commit,
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/3] list-object-filter: introduce depth filter
  2022-09-01  9:41 [PATCH 0/3] list-object-filter: introduce depth filter ZheNing Hu via GitGitGadget
  2022-09-01  9:41 ` [PATCH 1/3] commit-graph: let commit graph respect commit graft ZheNing Hu via GitGitGadget
  2022-09-01  9:41 ` [PATCH 2/3] list-object-filter: pass traversal_context in filter_init_fn ZheNing Hu via GitGitGadget
@ 2022-09-01  9:41 ` ZheNing Hu via GitGitGadget
  2022-09-01 19:24 ` [PATCH 0/3] " Derrick Stolee
  3 siblings, 0 replies; 12+ messages in thread
From: ZheNing Hu via GitGitGadget @ 2022-09-01  9:41 UTC (permalink / raw)
  To: git
  Cc: Christian Couder, Ævar Arnfjörð Bjarmason,
	Jeff King, Jeff Hostetler, Junio C Hamano, Derrick Stolee,
	Johannes Schindelin, ZheNing Hu, ZheNing Hu

From: ZheNing Hu <adlternative@gmail.com>

'git clone --depth=<depth>' have a obvious disadvantages:
We can't do some git commands that require looking deeper
commits. We might use 'git fetch -unshallow' to lift this
restriction, but that's not a good idea either: it downloads
too much objects.

Rethink this question: why not integrate the functionality
of shallow clone into parital clone? Partial clone has a
very clear advantage: it downloads objects only if the user
needs them.

Therefore, add a filter 'depth=<depth>' which can omits all
commits whose depth is >= <depth> (<depth> > 0), it just look
like  '--depth=<depth>' in git clone, but the git client
doesn't treat it as a shallow clone.

'--filter=depth:<depth>' cannot be used with '--depth',
'--shallow-since', '--shallow-exclude', '--shallow-submodules'.

Signed-off-by: ZheNing Hu <adlternative@gmail.com>
---
 Documentation/rev-list-options.txt  |   6 ++
 builtin/clone.c                     |  10 ++-
 list-objects-filter-options.c       |  30 +++++++
 list-objects-filter-options.h       |   6 ++
 list-objects-filter.c               |  66 ++++++++++++++++
 shallow.c                           |  16 ++++
 shallow.h                           |   2 +
 t/t5616-partial-clone.sh            | 116 ++++++++++++++++++++++++++++
 t/t6112-rev-list-filters-objects.sh |  14 ++++
 upload-pack.c                       |  14 ----
 upload-pack.h                       |  14 ++++
 11 files changed, 279 insertions(+), 15 deletions(-)

diff --git a/Documentation/rev-list-options.txt b/Documentation/rev-list-options.txt
index 1837509566a..4e2905b9e1e 100644
--- a/Documentation/rev-list-options.txt
+++ b/Documentation/rev-list-options.txt
@@ -954,6 +954,12 @@ Note that the form '--filter=sparse:path=<path>' that wants to read
 from an arbitrary path on the filesystem has been dropped for security
 reasons.
 +
+The form '--filter=depth:<depth>' omits all commits whose depth is
+>= <depth>, it just look like '--depth=<depth>' in git clone, but it
+will not be treated as shallow-clone, so if you want to see some deeper
+commits, you can freely do some git commands e.g. git diff to refetch
+missing git objects without 'git fetch --unshallow'.
++
 Multiple '--filter=' flags can be specified to combine filters. Only
 objects which are accepted by every filter are included.
 +
diff --git a/builtin/clone.c b/builtin/clone.c
index c4ff4643ecd..0b168ec18ef 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -916,7 +916,8 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 		usage_msg_opt(_("You must specify a repository to clone."),
 			builtin_clone_usage, builtin_clone_options);
 
-	if (option_depth || option_since || option_not.nr)
+	if (option_depth || option_since || option_not.nr ||
+	    list_objects_filter_choice_exists(&filter_options, LOFC_DEPTH))
 		deepen = 1;
 	if (option_single_branch == -1)
 		option_single_branch = deepen ? 1 : 0;
@@ -1113,6 +1114,13 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 		die(_("the option '%s' requires '%s'"),
 		    "--also-filter-submodules", "--recurse-submodules");
 
+	if ((option_depth || option_since || option_not.nr ||
+	     option_shallow_submodules) &&
+	     list_objects_filter_choice_exists(&filter_options, LOFC_DEPTH))
+		die(_("--filter='depth:<depth>' cannot be used with "
+		      "--depth, --shallow-since, --shallow-exclude, "
+		      "--shallow-submodules"));
+
 	/*
 	 * apply the remote name provided by --origin only after this second
 	 * call to git_config, to ensure it overrides all config-based values.
diff --git a/list-objects-filter-options.c b/list-objects-filter-options.c
index 4b25287886d..687010b84d4 100644
--- a/list-objects-filter-options.c
+++ b/list-objects-filter-options.c
@@ -31,6 +31,8 @@ const char *list_object_filter_config_name(enum list_objects_filter_choice c)
 		return "sparse:oid";
 	case LOFC_OBJECT_TYPE:
 		return "object:type";
+	case LOFC_DEPTH:
+		return "depth";
 	case LOFC_COMBINE:
 		return "combine";
 	case LOFC__COUNT:
@@ -40,6 +42,23 @@ const char *list_object_filter_config_name(enum list_objects_filter_choice c)
 	BUG("list_object_filter_config_name: invalid argument '%d'", c);
 }
 
+int list_objects_filter_choice_exists(
+	struct list_objects_filter_options *filter_options,
+	enum list_objects_filter_choice choice) {
+	int i;
+
+	if (!filter_options)
+		return 0;
+
+	if (filter_options->choice == choice)
+		return 1;
+	if (filter_options->sub_nr)
+		for (i = 0; i < filter_options->sub_nr; i++)
+			if (filter_options->sub[i].choice == choice)
+				return 1;
+	return 0;
+}
+
 int gently_parse_list_objects_filter(
 	struct list_objects_filter_options *filter_options,
 	const char *arg,
@@ -97,6 +116,17 @@ int gently_parse_list_objects_filter(
 
 		return 0;
 
+	} else if (skip_prefix(arg, "depth:", &v0)) {
+		if (!git_parse_ulong(v0, &filter_options->depth)) {
+			strbuf_addstr(errbuf, _("expected 'depth:<depth>'"));
+			return 1;
+		} else if (atoi(v0) <= 0) {
+			strbuf_addf(errbuf, _("depth %s is not a positive number"), v0);
+			return 1;
+		}
+		filter_options->choice = LOFC_DEPTH;
+		return 0;
+
 	} else if (skip_prefix(arg, "combine:", &v0)) {
 		return parse_combine_filter(filter_options, v0, errbuf);
 
diff --git a/list-objects-filter-options.h b/list-objects-filter-options.h
index ffc02d77e76..a4fd40567d2 100644
--- a/list-objects-filter-options.h
+++ b/list-objects-filter-options.h
@@ -15,6 +15,7 @@ enum list_objects_filter_choice {
 	LOFC_TREE_DEPTH,
 	LOFC_SPARSE_OID,
 	LOFC_OBJECT_TYPE,
+	LOFC_DEPTH,
 	LOFC_COMBINE,
 	LOFC__COUNT /* must be last */
 };
@@ -54,6 +55,7 @@ struct list_objects_filter_options {
 	 */
 
 	char *sparse_oid_name;
+	unsigned long depth;
 	unsigned long blob_limit_value;
 	unsigned long tree_exclude_depth;
 	enum object_type object_type;
@@ -69,6 +71,10 @@ struct list_objects_filter_options {
 	 */
 };
 
+int list_objects_filter_choice_exists(
+	struct list_objects_filter_options *filter_options,
+	enum list_objects_filter_choice choice);
+
 /*
  * Parse value of the argument to the "filter" keyword.
  * On the command line this looks like:
diff --git a/list-objects-filter.c b/list-objects-filter.c
index 76e8659ea73..5b4d8348b54 100644
--- a/list-objects-filter.c
+++ b/list-objects-filter.c
@@ -13,6 +13,8 @@
 #include "oidmap.h"
 #include "oidset.h"
 #include "object-store.h"
+#include "shallow.h"
+#include "upload-pack.h"
 
 /* Remember to update object flag allocation in object.h */
 /*
@@ -69,6 +71,69 @@ struct filter {
 	struct oidset *omits;
 };
 
+static enum list_objects_filter_result filter_noop(
+	struct repository *r,
+	enum list_objects_filter_situation filter_situation,
+	struct object *obj,
+	const char *pathname,
+	const char *filename,
+	struct oidset *omits,
+	void *filter_data_)
+{
+	switch (filter_situation) {
+	default:
+		BUG("unknown filter_situation: %d", filter_situation);
+
+	case LOFS_TAG:
+		assert(obj->type == OBJ_TAG);
+		/* always include all tag objects */
+		return LOFR_MARK_SEEN | LOFR_DO_SHOW;
+
+	case LOFS_COMMIT:
+		assert(obj->type == OBJ_COMMIT);
+		/* always include all commit objects */
+		return LOFR_MARK_SEEN | LOFR_DO_SHOW;
+
+	case LOFS_BEGIN_TREE:
+		assert(obj->type == OBJ_TREE);
+		/* always include all tree objects */
+		return LOFR_MARK_SEEN | LOFR_DO_SHOW;
+
+	case LOFS_END_TREE:
+		assert(obj->type == OBJ_TREE);
+		return LOFR_ZERO;
+
+	case LOFS_BLOB:
+		assert(obj->type == OBJ_BLOB);
+		/* always include all blob objects */
+		return LOFR_MARK_SEEN | LOFR_DO_SHOW;
+	}
+}
+
+static void noop_free(void *filter_data) {
+	/* noop */
+}
+
+static void filter_depth__init(
+	struct traversal_context *ctx,
+	struct list_objects_filter_options *filter_options,
+	struct filter *filter)
+{
+	struct commit_list *result = get_shallow_commits_by_commits(ctx->revs->commits,
+					filter_options->depth,
+					SHALLOW, NOT_SHALLOW);
+
+	while (result) {
+		register_shallow(the_repository, &result->item->object.oid);
+		result = result->next;
+	}
+	free_commit_list(result);
+
+	filter->filter_object_fn = filter_noop;
+	filter->free_fn = noop_free;
+}
+
+
 static enum list_objects_filter_result filter_blobs_none(
 	struct repository *r,
 	enum list_objects_filter_situation filter_situation,
@@ -774,6 +839,7 @@ static filter_init_fn s_filters[] = {
 	filter_trees_depth__init,
 	filter_sparse_oid__init,
 	filter_object_type__init,
+	filter_depth__init,
 	filter_combine__init,
 };
 
diff --git a/shallow.c b/shallow.c
index 8cb768ee5f8..9d1d7668ad8 100644
--- a/shallow.c
+++ b/shallow.c
@@ -122,6 +122,22 @@ static void free_depth_in_slab(int **ptr)
 {
 	FREE_AND_NULL(*ptr);
 }
+
+struct commit_list *get_shallow_commits_by_commits(struct commit_list *commits, int depth,
+		int shallow_flag, int not_shallow_flag) {
+	struct object_array array = OBJECT_ARRAY_INIT;
+	struct commit_list *result = NULL;
+	struct commit_list *commit = commits;
+
+	while (commit) {
+		add_object_array(&commit->item->object, NULL, &array);
+		commit = commit->next;
+	}
+	result = get_shallow_commits(&array, depth, shallow_flag, not_shallow_flag);
+	object_array_clear(&array);
+	return result;
+}
+
 struct commit_list *get_shallow_commits(struct object_array *heads, int depth,
 		int shallow_flag, int not_shallow_flag)
 {
diff --git a/shallow.h b/shallow.h
index aba6ff58294..ed425b72796 100644
--- a/shallow.h
+++ b/shallow.h
@@ -34,6 +34,8 @@ void rollback_shallow_file(struct repository *r, struct shallow_lock *lk);
 
 struct commit_list *get_shallow_commits(struct object_array *heads,
 					int depth, int shallow_flag, int not_shallow_flag);
+struct commit_list *get_shallow_commits_by_commits(
+		struct commit_list *commits, int depth, int shallow_flag, int not_shallow_flag);
 struct commit_list *get_shallow_commits_by_rev_list(
 		int ac, const char **av, int shallow_flag, int not_shallow_flag);
 int write_shallow_commits(struct strbuf *out, int use_pack_protocol,
diff --git a/t/t5616-partial-clone.sh b/t/t5616-partial-clone.sh
index 9aeacc2f6a5..c328f5d76bc 100755
--- a/t/t5616-partial-clone.sh
+++ b/t/t5616-partial-clone.sh
@@ -458,6 +458,122 @@ test_expect_success 'partial clone with unresolvable sparse filter fails cleanly
 	test_i18ngrep "unable to parse sparse filter data in" err
 '
 
+test_expect_success 'setup src repo for depth filter' '
+	git init depth-src &&
+	git -C depth-src config --local uploadpack.allowfilter 1 &&
+	git -C depth-src config --local uploadpack.allowanysha1inwant 1 &&
+	test_commit -C depth-src one &&
+	test_commit -C depth-src two &&
+	test_commit -C depth-src three &&
+	git -C depth-src rm -rf two.t &&
+	git -C depth-src commit -m four
+'
+
+test_expect_success 'partial clone with depth=1 filter succeeds' '
+	rm -rf dst.git &&
+	git clone --no-local --bare \
+		  --filter=depth:1 \
+		  depth-src dst.git &&
+	(
+		cd dst.git &&
+		git cat-file --batch-check --batch-all-objects >object &&
+		grep blob object >blob_count &&
+		test_line_count = 2 blob_count &&
+		grep tree object >tree_count &&
+		test_line_count = 1 tree_count &&
+		grep commit object >commit_count &&
+		test_line_count = 1 commit_count
+	)
+'
+
+test_expect_success 'partial clone with depth=2 filter succeeds' '
+	rm -rf dst.git &&
+	git clone --no-local --bare \
+		  --filter=depth:2 \
+		  depth-src dst.git &&
+	(
+		cd dst.git &&
+		git cat-file --batch-check --batch-all-objects >object &&
+		grep blob object >blob_count &&
+		test_line_count = 3 blob_count &&
+		grep tree object >tree_count &&
+		test_line_count = 2 tree_count &&
+		grep commit object >commit_count &&
+		test_line_count = 2 commit_count
+	)
+'
+
+test_expect_success 'partial clone depth filter combine with blob:none filter succeeds' '
+	rm -rf dst.git &&
+	git clone --no-local --bare \
+		  --filter="combine:depth:1+blob:none" \
+		  depth-src dst.git &&
+	(
+		cd dst.git &&
+		git cat-file --batch-check --batch-all-objects >object &&
+		! grep blob object &&
+		grep tree object >tree_count &&
+		test_line_count = 1 tree_count &&
+		grep commit object >commit_count &&
+		test_line_count = 1 commit_count
+	)
+'
+
+test_expect_success 'refetch other commits after partial clone with depth filter' '
+	rm -rf dst.git &&
+	git clone --no-local --bare \
+		  --filter=depth:1 \
+		  depth-src dst.git &&
+	(
+		cd dst.git &&
+		git cat-file --batch-check --batch-all-objects >object &&
+		grep blob object >blob_count &&
+		test_line_count = 2 blob_count &&
+		grep tree object >tree_count &&
+		test_line_count = 1 tree_count &&
+		grep commit object >commit_count &&
+		test_line_count = 1 commit_count &&
+		# git log will trigger refetch commits
+		git log &&
+		git cat-file --batch-check --batch-all-objects >object &&
+		grep blob object >blob_count &&
+		test_line_count = 2 blob_count &&
+		grep tree object >tree_count &&
+		test_line_count = 4 tree_count &&
+		grep commit object >commit_count &&
+		test_line_count = 4 commit_count &&
+		# git diff will trigger refetch blobs
+		git diff HEAD^ HEAD &&
+		git cat-file --batch-check --batch-all-objects >object &&
+		grep blob object >blob_count &&
+		test_line_count = 3 blob_count &&
+		grep tree object >tree_count &&
+		test_line_count = 4 tree_count &&
+		grep commit object >commit_count &&
+		test_line_count = 4 commit_count
+	)
+'
+
+test_expect_success 'partial clone with depth filter with shallow clone failed' "
+	rm -rf dst.git &&
+	test_must_fail git clone --no-local --bare \
+		  --filter=depth:1 --depth=1 \
+		  depth-src dst.git 2>err &&
+	test_i18ngrep \"fatal: --filter='depth:<depth>' cannot be used with --depth, --shallow-since, --shallow-exclude, --shallow-submodules\" err &&
+	test_must_fail git clone --no-local --bare \
+		  --filter=depth:1  --shallow-since '300000000 +0700' \
+		  depth-src dst.git 2>err &&
+	test_i18ngrep \"fatal: --filter='depth:<depth>' cannot be used with --depth, --shallow-since, --shallow-exclude, --shallow-submodules\" err &&
+	test_must_fail git clone --no-local --bare \
+		  --filter=depth:1 --shallow-exclude one.t \
+		  depth-src dst.git 2>err &&
+	test_i18ngrep \"fatal: --filter='depth:<depth>' cannot be used with --depth, --shallow-since, --shallow-exclude, --shallow-submodules\" err &&
+	test_must_fail git clone --no-local --bare \
+		  --filter=depth:1 --shallow-submodules \
+		  depth-src dst.git 2>err &&
+	test_i18ngrep \"fatal: --filter='depth:<depth>' cannot be used with --depth, --shallow-since, --shallow-exclude, --shallow-submodules\" err
+"
+
 setup_triangle () {
 	rm -rf big-blob.txt server client promisor-remote &&
 
diff --git a/t/t6112-rev-list-filters-objects.sh b/t/t6112-rev-list-filters-objects.sh
index 8d9d6604f05..a562bdffc80 100755
--- a/t/t6112-rev-list-filters-objects.sh
+++ b/t/t6112-rev-list-filters-objects.sh
@@ -417,6 +417,20 @@ test_expect_success 'verify tree:3 includes everything expected' '
 	test_line_count = 10 actual
 '
 
+test_expect_success 'verify depth:1 includes tip commit expected' '
+	git -C r3 rev-list --objects --filter=depth:1 --no-object-names HEAD >objects &&
+	cat objects | git -C r3 cat-file --batch-check="%(objecttype)" >types &&
+	grep commit types >actual &&
+	test_line_count = 1 actual
+'
+
+test_expect_success 'verify depth:2 includes two commits expected' '
+	git -C r3 rev-list --objects --filter=depth:2 --no-object-names HEAD >objects &&
+	cat objects | git -C r3 cat-file --batch-check="%(objecttype)" >types &&
+	grep commit types >actual &&
+	test_line_count = 2 actual
+'
+
 test_expect_success 'combine:... for a simple combination' '
 	git -C r3 rev-list --objects --filter=combine:tree:2+blob:none HEAD \
 		>actual &&
diff --git a/upload-pack.c b/upload-pack.c
index b217a1f469e..d4ffebfa6ab 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -28,20 +28,6 @@
 #include "commit-reach.h"
 #include "shallow.h"
 
-/* Remember to update object flag allocation in object.h */
-#define THEY_HAVE	(1u << 11)
-#define OUR_REF		(1u << 12)
-#define WANTED		(1u << 13)
-#define COMMON_KNOWN	(1u << 14)
-
-#define SHALLOW		(1u << 16)
-#define NOT_SHALLOW	(1u << 17)
-#define CLIENT_SHALLOW	(1u << 18)
-#define HIDDEN_REF	(1u << 19)
-
-#define ALL_FLAGS (THEY_HAVE | OUR_REF | WANTED | COMMON_KNOWN | SHALLOW | \
-		NOT_SHALLOW | CLIENT_SHALLOW | HIDDEN_REF)
-
 /* Enum for allowed unadvertised object request (UOR) */
 enum allow_uor {
 	/* Allow specifying sha1 if it is a ref tip. */
diff --git a/upload-pack.h b/upload-pack.h
index d6ee25ea98e..36add62f6bc 100644
--- a/upload-pack.h
+++ b/upload-pack.h
@@ -1,6 +1,20 @@
 #ifndef UPLOAD_PACK_H
 #define UPLOAD_PACK_H
 
+/* Remember to update object flag allocation in object.h */
+#define THEY_HAVE	(1u << 11)
+#define OUR_REF		(1u << 12)
+#define WANTED		(1u << 13)
+#define COMMON_KNOWN	(1u << 14)
+
+#define SHALLOW		(1u << 16)
+#define NOT_SHALLOW	(1u << 17)
+#define CLIENT_SHALLOW	(1u << 18)
+#define HIDDEN_REF	(1u << 19)
+
+#define ALL_FLAGS (THEY_HAVE | OUR_REF | WANTED | COMMON_KNOWN | SHALLOW | \
+		NOT_SHALLOW | CLIENT_SHALLOW | HIDDEN_REF)
+
 void upload_pack(const int advertise_refs, const int stateless_rpc,
 		 const int timeout);
 
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/3] commit-graph: let commit graph respect commit graft
  2022-09-01  9:41 ` [PATCH 1/3] commit-graph: let commit graph respect commit graft ZheNing Hu via GitGitGadget
@ 2022-09-01 19:18   ` Derrick Stolee
  2022-09-04  5:57     ` ZheNing Hu
  0 siblings, 1 reply; 12+ messages in thread
From: Derrick Stolee @ 2022-09-01 19:18 UTC (permalink / raw)
  To: ZheNing Hu via GitGitGadget, git
  Cc: Christian Couder, Ævar Arnfjörð Bjarmason,
	Jeff King, Jeff Hostetler, Junio C Hamano, Johannes Schindelin,
	ZheNing Hu

On 9/1/2022 5:41 AM, ZheNing Hu via GitGitGadget wrote:
> From: ZheNing Hu <adlternative@gmail.com>
> 
> In repo_parse_commit_internal(), if we want to use
> commit graph, it will call parse_commit_in_graph() to
> parse commit's content from commit graph, otherwise
> call repo_read_object_file() to parse commit's content
> from commit object.
> 
> repo_read_object_file() will respect commit graft,
> which can correctly amend commit's parents. But
> parse_commit_in_graph() not. Inconsistencies here may
> result in incorrect processing of shallow clone.
> 
> So let parse_commit_in_graph() respect commit graft as
> repo_read_object_file() does, which can solve this problem.

If grafts or replace-objects exist, then the commit-graph
is disabled and this code will never be called. I would
expect a test case demonstrating the change in behavior
here, but that is impossible.

The commit-graph parsing should not be bogged down with
this logic.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/3] list-object-filter: introduce depth filter
  2022-09-01  9:41 [PATCH 0/3] list-object-filter: introduce depth filter ZheNing Hu via GitGitGadget
                   ` (2 preceding siblings ...)
  2022-09-01  9:41 ` [PATCH 3/3] list-object-filter: introduce depth filter ZheNing Hu via GitGitGadget
@ 2022-09-01 19:24 ` Derrick Stolee
  2022-09-02 13:48   ` Johannes Schindelin
  2022-09-04  7:27   ` ZheNing Hu
  3 siblings, 2 replies; 12+ messages in thread
From: Derrick Stolee @ 2022-09-01 19:24 UTC (permalink / raw)
  To: ZheNing Hu via GitGitGadget, git
  Cc: Christian Couder, Ævar Arnfjörð Bjarmason,
	Jeff King, Jeff Hostetler, Junio C Hamano, Johannes Schindelin,
	ZheNing Hu

On 9/1/2022 5:41 AM, ZheNing Hu via GitGitGadget wrote:
> This patch let partial clone have the similar capabilities of the shallow
> clone git clone --depth=<depth>.
...
> Now we can use git clone --filter="depth=<depth>" to omit all commits whose
> depth is >= <depth>. By this way, we can have the advantages of both shallow
> clone and partial clone: Limiting the depth of commits, get other objects on
> demand.

I have several concerns about this proposal.

The first is that "depth=X" doesn't mean anything after the first
clone. What will happen when we fetch the remaining objects?

Partial clone is designed to download a subset of objects, but make
the remaining reachable objects downloadable on demand. By dropping
reachable commits, the normal partial clone mechanism would result
in a 'git rev-list' call asking for a missing commit. Would this
inherit the "depth=X" but result in a huge amount of over-downloading
the trees and blobs in that commit range? Would it result in downloading
commits one-by-one, and then their root trees (and all reachable objects
from those root trees)?

Finally, computing the set of objects to send is just as expensive as
if we had a shallow clone (we can't use bitmaps). However, we get the
additional problem where fetches do not have a shallow boundary, so
the server will send deltas based on objects that are not necessarily
present locally, triggering extra requests to resolve those deltas.

This fallout remains undocumented and unexplored in this series, but I
doubt the investigation would result in positive outcomes.

> Disadvantages of git clone --depth=<depth> --filter=blob:none: we must call
> git fetch --unshallow to lift the shallow clone restriction, it will
> download all history of current commit.

How does your proposal fix this? Instead of unshallowing, users will
stumble across these objects and trigger huge downloads by accident.
 
> Disadvantages of git clone --filter=blob:none with git sparse-checkout: The
> git client needs to send a lot of missing objects' id to the server, this
> can be very wasteful of network traffic.

Asking for a list of blobs (especially limited to a sparse-checkout) is
much more efficient than what will happen when a user tries to do almost
anything in a repository formed the way you did here.

Thinking about this idea, I don't think it is viable. I would need to
see a lot of work done to test these scenarios closely to believe that
this type of partial clone is a desirable working state.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/3] list-object-filter: introduce depth filter
  2022-09-01 19:24 ` [PATCH 0/3] " Derrick Stolee
@ 2022-09-02 13:48   ` Johannes Schindelin
  2022-09-04  9:14     ` ZheNing Hu
  2022-09-04  7:27   ` ZheNing Hu
  1 sibling, 1 reply; 12+ messages in thread
From: Johannes Schindelin @ 2022-09-02 13:48 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: ZheNing Hu via GitGitGadget, git, Christian Couder,
	Ævar Arnfjörð Bjarmason, Jeff King, Jeff Hostetler,
	Junio C Hamano, ZheNing Hu

Hi ZheNing,

first of all: thank you for working on this. In the past, I thought that
this feature would be likely something we would want to have in Git.

But Stolee's concerns are valid, and made me think about it more. See
below for a more detailed analysis.

On Thu, 1 Sep 2022, Derrick Stolee wrote:

> On 9/1/2022 5:41 AM, ZheNing Hu via GitGitGadget wrote:
>
> > [...]
> >
> > Disadvantages of git clone --filter=blob:none with git
> > sparse-checkout: The git client needs to send a lot of missing
> > objects' id to the server, this can be very wasteful of network
> > traffic.
>
> Asking for a list of blobs (especially limited to a sparse-checkout) is
> much more efficient than what will happen when a user tries to do almost
> anything in a repository formed the way you did here.

I agree. When you have all the commit and tree objects on the local side,
you can enumerate all the blob objects you need in one fell swoop, then
fetch them in a single network round trip.

When you lack tree objects, or worse, commit objects, this is not true.
You may very well need to fetch _quite_ a bunch of objects, then inspect
them to find out that you need to fetch more tree/commit objects, and then
a couple more round trips, before you can enumerate all of the objects you
need.

Concrete example: let's assume that you clone git.git with a "partial
depth" of 50. That is, while cloning, all of the tip commits' graphs will
be traversed up until the commits that are removed by 49 edges in the
commit graph. For example, v0.99~49 will be present locally after cloning,
but not v0.99~50.

Now, the first-parent depth of v0.99 is 955 (verify with `git rev-list
--count --first-parent v0.99`). None of the commits reachable from v0.99
other than the tip itself seem to be closer to any other tag, so all
commits reachable from v0.99~49 will be missing locally. And since reverts
are rare, we must assume that the vast majority of the associated root
tree objects are missing, too.

Digging through history, a contributor might need to investigate where,
say, `t/t4100/t-apply-7.expect` was introduced (it was in v0.99~206)
because they found something looking like a bug and they need to read the
commit message to see whether it was intentional. They know that this file
was already present in v0.99. Naturally, the command-line to investigate
that is:

	git log --diff-filter=A v0.99 -- t/t4100/t-apply-7.expect

So what does Git do in that operation? It traverses the commits starting
from v0.99, following the chain along the commit parents. When it
encounters v0.99~49, it figures out that it has to fetch v0.99~50. To see
whether v0.99~49 introduced that file, it then has to inspect that commit
object and then fetch the tree object (v0.99~50^{tree}). Then, Git
inspects that tree to find out the object ID for v0.99~50^{tree}:t/, sees
that it is identical to v0.99~49^{tree}:t/ and therefore the pathspec
filter skips this commit from the output of the `git log` command. A
couple of parent traversals later (always fetching the parent commit
object individually, then the associated tree object, then figuring out
that `t/` is unchanged) Git will encounter v0.99~55 where `t/` _did_
change. So now it also has to fetch _that_ tree object.

In total, we are looking at 400+ individual network round trips just to
fetch the required tree/commit objects, i.e. before Git can show you the
output of that `git log` command. And that's just for back-filling the
missing tree/commit objects.

If we had done this using a shallow clone, Git would have stopped at the
shallow boundary, the user would have had a chance to increase the depth
in bigger chunks (probably first extending the depth by 50, then maybe
100, then maybe going for 500) and while it would have been a lot of
manual labor, the total time would be still a lot shorter than those 400+
network round trips (which likely would incur some throttling on the
server side).

> Thinking about this idea, I don't think it is viable. I would need to
> see a lot of work done to test these scenarios closely to believe that
> this type of partial clone is a desirable working state.

Indeed, it is hard to think of a way how the design could result in
anything but undesirable behavior, both on the client and the server side.

We also have to consider that our experience with large repositories
demonstrates that tree and commit objects delta pretty well and are
virtually never a concern when cloning. It is always the sheer amount of
blob objects that is causing poor user experience when performing
non-partial clones of large repositories.

Now, I can be totally wrong in my expectation that there is _no_ scenario
where cloning with a "partial depth" would cause anything but poor
performance. If I am wrong, then there is value in having this feature,
but since it causes undesirable performance in all cases I can think of,
it definitely should be guarded behind an opt-in flag.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/3] commit-graph: let commit graph respect commit graft
  2022-09-01 19:18   ` Derrick Stolee
@ 2022-09-04  5:57     ` ZheNing Hu
  0 siblings, 0 replies; 12+ messages in thread
From: ZheNing Hu @ 2022-09-04  5:57 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: ZheNing Hu via GitGitGadget, Git List, Christian Couder,
	Ævar Arnfjörð Bjarmason, Jeff King, Jeff Hostetler,
	Junio C Hamano, Johannes Schindelin

Derrick Stolee <derrickstolee@github.com> 于2022年9月2日周五 03:18写道:
>
> On 9/1/2022 5:41 AM, ZheNing Hu via GitGitGadget wrote:
> > From: ZheNing Hu <adlternative@gmail.com>
> >
> > In repo_parse_commit_internal(), if we want to use
> > commit graph, it will call parse_commit_in_graph() to
> > parse commit's content from commit graph, otherwise
> > call repo_read_object_file() to parse commit's content
> > from commit object.
> >
> > repo_read_object_file() will respect commit graft,
> > which can correctly amend commit's parents. But
> > parse_commit_in_graph() not. Inconsistencies here may
> > result in incorrect processing of shallow clone.
> >
> > So let parse_commit_in_graph() respect commit graft as
> > repo_read_object_file() does, which can solve this problem.
>
> If grafts or replace-objects exist, then the commit-graph
> is disabled and this code will never be called. I would
> expect a test case demonstrating the change in behavior
> here, but that is impossible.
>

Thanks for the clarification.
I don't really know what's the wrong here, but just let do a little test:

1. Revert this commit 19fd72c34dcd1332df638d76b0b028e9d9da3d41
$ git revert 19fd72

2. Clone the git repo
$ git clone --bare git@github.com:git/git.git

3. Write commit graph
$ git commit-graph write

4. Use the depth=<depth> to clone (depth=1)
$  git clone --no-checkout --no-local --=depth=1 git.git git1
Cloning into 'git1'...
remote: Enumerating objects: 4306, done.
remote: Counting objects: 100% (4306/4306), done.
remote: Compressing objects: 100% (3785/3785), done.

4.  Use the depth=<depth> to clone (depth=2)
$  git clone --no-checkout --no-local --=depth=2 git.git git2
Cloning into 'git2'...
remote: Enumerating objects: 4311, done.
remote: Counting objects: 100% (4311/4311), done.
remote: Compressing objects: 100% (3788/3788), done.

5. Use the depth filter to clone (depth=1)
$  git clone --no-checkout --no-local --filter=depth:1 git.git git3
Cloning into 'git3'...
remote: Enumerating objects: 4306, done.
remote: Counting objects: 100% (4306/4306), done.
remote: Compressing objects: 100% (3785/3785), done.

6. Use the depth filter to clone (depth=2)
$  git clone --no-checkout --no-local --filter=depth:2 git.git git4
Cloning into 'git4'...
remote: Enumerating objects: 322987, done.
remote: Counting objects: 100% (322987/322987), done.
remote: Compressing objects: 100% (77441/77441), done.

As we can see, when we use --filter=depth:<depth> (depth >= 2),
it seems like we clone a lot of objects. The result is significantly
different from git clone --depth=<depth> (depth >= 2).

So I debug it by reproducing the git pack-objects process:

I find there are different action between --filter=depth:<depth> and
 --depth=<depth> .

--filter=depth:<depth> will be successfully resolved commit parents in
parse_commit_in_graph(),

Call stack( cmd_pack_objects -> get_object_list -> traverse_commit_list ->
traverse_commit_list_filtered -> do_traverse -> get_revision ->
get_revision_internal -> get_revision_1 -> process_parents ->
repo_parse_commit_gently -> repo_parse_commit_internal ->
parse_commit_in_graph)

--depth=<depth> will failed in parse_commit_in_graph(), and call
repo_read_object_file() to resolved commit parents.

Call stack( cmd_pack_objects -> get_object_list -> traverse_commit_list ->
traverse_commit_list_filtered -> do_traverse -> get_revision ->
get_revision_internal -> get_revision_1 -> process_parents ->
repo_parse_commit_gently -> repo_parse_commit_internal ->
repo_read_object_file)

> The commit-graph parsing should not be bogged down with
> this logic.
>

So I try to fix this problem by let commit-graph respect commit-graft.
I don't know if I overlook something before...

> Thanks,
> -Stolee
>

Thanks,
ZheNing Hu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/3] list-object-filter: introduce depth filter
  2022-09-01 19:24 ` [PATCH 0/3] " Derrick Stolee
  2022-09-02 13:48   ` Johannes Schindelin
@ 2022-09-04  7:27   ` ZheNing Hu
  1 sibling, 0 replies; 12+ messages in thread
From: ZheNing Hu @ 2022-09-04  7:27 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: ZheNing Hu via GitGitGadget, Git List, Christian Couder,
	Ævar Arnfjörð Bjarmason, Jeff King, Jeff Hostetler,
	Junio C Hamano, Johannes Schindelin

Derrick Stolee <derrickstolee@github.com> 于2022年9月2日周五 03:24写道:
>
> On 9/1/2022 5:41 AM, ZheNing Hu via GitGitGadget wrote:
> > This patch let partial clone have the similar capabilities of the shallow
> > clone git clone --depth=<depth>.
> ...
> > Now we can use git clone --filter="depth=<depth>" to omit all commits whose
> > depth is >= <depth>. By this way, we can have the advantages of both shallow
> > clone and partial clone: Limiting the depth of commits, get other objects on
> > demand.
>
> I have several concerns about this proposal.
>
> The first is that "depth=X" doesn't mean anything after the first
> clone. What will happen when we fetch the remaining objects?
>

According to the current results, yes, it still downloads a large number
of commits.

Do a litte test again:

$ git clone --filter=depth:2 git.git git
Cloning into 'git'...
remote: Enumerating objects: 4311, done.
remote: Counting objects: 100% (4311/4311), done.
remote: Compressing objects: 100% (3788/3788), done.

Just see how many objects...
$ git cat-file --batch-check --batch-all-objects | grep blob | wc -l
warning: This repository uses promisor remotes. Some objects may not be loaded.
    4098
$ git cat-file --batch-check --batch-all-objects | grep tree | wc -l
warning: This repository uses promisor remotes. Some objects may not be loaded.
     211
$ git cat-file --batch-check --batch-all-objects | grep commit | wc -l
warning: This repository uses promisor remotes. Some objects may not be loaded.
       2

$ git checkout HEAD~

Fetch nothing...because depth=2.

$  git checkout HEAD~
remote: Enumerating objects: 198514, done.
remote: Counting objects: 100% (198514/198514), done.
remote: Compressing objects: 100% (68511/68511), done.
remote: Total 198514 (delta 128408), reused 198509 (delta 128406), pack-reused 0
Receiving objects: 100% (198514/198514), 77.07 MiB | 9.58 MiB/s, done.
Resolving deltas: 100% (128408/128408), done.
remote: Enumerating objects: 1, done.
remote: Counting objects: 100% (1/1), done.
remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 0
Receiving objects: 100% (1/1), 14.35 KiB | 14.35 MiB/s, done.
remote: Enumerating objects: 198014, done.
remote: Counting objects: 100% (198014/198014), done.
remote: Compressing objects: 100% (68362/68362), done.
remote: Total 198014 (delta 128056), reused 198012 (delta 128055), pack-reused 0
Receiving objects: 100% (198014/198014), 76.55 MiB | 14.00 MiB/s, done.
Resolving deltas: 100% (128056/128056), done.
Previous HEAD position was 624a936234 Merge branch 'en/merge-multi-strategies'
HEAD is now at 014a9ea207 Merge branch 'en/t4301-more-merge-tree-tests'

Fetch a lot of objects... (three times!)

$ git cat-file --batch-check --batch-all-objects | grep blob | wc -l
warning: This repository uses promisor remotes. Some objects may not be loaded.
    4099
$ git cat-file --batch-check --batch-all-objects | grep tree | wc -l
warning: This repository uses promisor remotes. Some objects may not be loaded.
    130712
$ git cat-file --batch-check --batch-all-objects | grep commit | wc -l
warning: This repository uses promisor remotes. Some objects may not be loaded.
    67815

It fetched too many Commits and Trees... But Surprisingly, only one
more blob was downloaded.

I admit that this is a very bad action, That's because we
have no commits locally...

Maybe one solution: we can also provide a commit-id parameter
inside the depth filter, like --filter="commit:014a9ea207, depth:1"...
we can clone with blob:none filter to download all trees/commits,
then fetch blobs with this "commit-depth" filter.... even we can
provide a more complex filter: --filter="commit:014a9ea207, depth:1, type=blob"
This may avoid downloading too many unneeded commits and trees...

git fetch --filter="commit:014a9ea207, depth:1, type=blob"

If git fetch have learned this filter, then git checkout or other commands can
 use this filter internally heuristically:

e.g.

git checkout HEAD~
if HEAD~ missing | 75% blobs/trees in HEAD~ missing -> use "commit-depth" filter
else -> use blob:none filter

We can even make this commit-depth filter support multiple commits later.

> Partial clone is designed to download a subset of objects, but make
> the remaining reachable objects downloadable on demand. By dropping
> reachable commits, the normal partial clone mechanism would result
> in a 'git rev-list' call asking for a missing commit. Would this
> inherit the "depth=X" but result in a huge amount of over-downloading
> the trees and blobs in that commit range? Would it result in downloading
> commits one-by-one, and then their root trees (and all reachable objects
> from those root trees)?
>

I don't know if it's possible let git rev-list know that commits is missing, and
stop download them. (just like git cat-file --batch --batch-all-objects does)

Similarly, you can let git log or other commands to understand this...

Probably a config var: fetch.skipmissingcommits...

> Finally, computing the set of objects to send is just as expensive as
> if we had a shallow clone (we can't use bitmaps). However, we get the
> additional problem where fetches do not have a shallow boundary, so
> the server will send deltas based on objects that are not necessarily
> present locally, triggering extra requests to resolve those deltas.
>

Agree, I think this maybe a problem, but there is no good solution for it.

> This fallout remains undocumented and unexplored in this series, but I
> doubt the investigation would result in positive outcomes.
>
> > Disadvantages of git clone --depth=<depth> --filter=blob:none: we must call
> > git fetch --unshallow to lift the shallow clone restriction, it will
> > download all history of current commit.
>
> How does your proposal fix this? Instead of unshallowing, users will
> stumble across these objects and trigger huge downloads by accident.
>

As mentioned above, I would expect a commit-depth filter to fix this.

> > Disadvantages of git clone --filter=blob:none with git sparse-checkout: The
> > git client needs to send a lot of missing objects' id to the server, this
> > can be very wasteful of network traffic.
>
> Asking for a list of blobs (especially limited to a sparse-checkout) is
> much more efficient than what will happen when a user tries to do almost
> anything in a repository formed the way you did here.
>

Yes. also as mentioned above, enabling this filter in some specific cases:
e.g. we have the commit but not all trees/blobs in it.

> Thinking about this idea, I don't think it is viable. I would need to
> see a lot of work done to test these scenarios closely to believe that
> this type of partial clone is a desirable working state.
>

Agree.

> Thanks,
> -Stolee

Thanks to these reviews and criticisms, it makes me think more :)

ZheNing Hu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/3] list-object-filter: introduce depth filter
  2022-09-02 13:48   ` Johannes Schindelin
@ 2022-09-04  9:14     ` ZheNing Hu
  2022-09-07 10:18       ` Johannes Schindelin
  0 siblings, 1 reply; 12+ messages in thread
From: ZheNing Hu @ 2022-09-04  9:14 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Derrick Stolee, ZheNing Hu via GitGitGadget, Git List,
	Christian Couder, Ævar Arnfjörð Bjarmason,
	Jeff King, Jeff Hostetler, Junio C Hamano

Johannes Schindelin <Johannes.Schindelin@gmx.de> 于2022年9月2日周五 21:48写道:
>
> Hi ZheNing,
>
> first of all: thank you for working on this. In the past, I thought that
> this feature would be likely something we would want to have in Git.
>

Originally, I just find "full git checkout" after partial-clone will
send so many blob-ids:

$ git clone --filter=blob:none --no-checkout --sparse
git@github.com:derrickstolee/sparse-checkout-example.git
$ cd sparse-checkout-example
$ GIT_TRACE_PACKET=$HOME/packet.trace git checkout HEAD
$ grep want $HOME/packet.trace  | wc -l
4060

So I just think about whether this process can be simplified between
the client and the server. In git checkout, users only need all the objects
in a commit. So maybe we can let the git client tell the server about this
commit-id, then the server downloads all objects in this commit. Then I
find it just looks like git clone|fetch --depth=1, but the shallow-clone doesn't
seem as easy to extend missing objects as the partial-clone.

https://git-scm.com/docs/partial-clone#_non_tasks also said:

Every time the subject of "demand loading blobs" comes up it seems
that someone suggests that the server be allowed to "guess" and send
additional objects that may be related to the requested objects.

So I guess --filter=depth:<depth> may be a solution, but as you and
Derrick have said: there are still very many problems with this depth filter.

> But Stolee's concerns are valid, and made me think about it more. See
> below for a more detailed analysis.
>
> On Thu, 1 Sep 2022, Derrick Stolee wrote:
>
> > On 9/1/2022 5:41 AM, ZheNing Hu via GitGitGadget wrote:
> >
> > > [...]
> > >
> > > Disadvantages of git clone --filter=blob:none with git
> > > sparse-checkout: The git client needs to send a lot of missing
> > > objects' id to the server, this can be very wasteful of network
> > > traffic.
> >
> > Asking for a list of blobs (especially limited to a sparse-checkout) is
> > much more efficient than what will happen when a user tries to do almost
> > anything in a repository formed the way you did here.
>
> I agree. When you have all the commit and tree objects on the local side,
> you can enumerate all the blob objects you need in one fell swoop, then
> fetch them in a single network round trip.
>
> When you lack tree objects, or worse, commit objects, this is not true.
> You may very well need to fetch _quite_ a bunch of objects, then inspect
> them to find out that you need to fetch more tree/commit objects, and then
> a couple more round trips, before you can enumerate all of the objects you
> need.
>

I think this is because the previous design was that you had to fetch
these missing
commits (also trees) and all their ancestors. Maybe we can modify git
rev-list to
make it understand missing commits...

> Concrete example: let's assume that you clone git.git with a "partial
> depth" of 50. That is, while cloning, all of the tip commits' graphs will
> be traversed up until the commits that are removed by 49 edges in the
> commit graph. For example, v0.99~49 will be present locally after cloning,
> but not v0.99~50.
>
> Now, the first-parent depth of v0.99 is 955 (verify with `git rev-list
> --count --first-parent v0.99`). None of the commits reachable from v0.99
> other than the tip itself seem to be closer to any other tag, so all
> commits reachable from v0.99~49 will be missing locally. And since reverts
> are rare, we must assume that the vast majority of the associated root
> tree objects are missing, too.
>
> Digging through history, a contributor might need to investigate where,
> say, `t/t4100/t-apply-7.expect` was introduced (it was in v0.99~206)
> because they found something looking like a bug and they need to read the
> commit message to see whether it was intentional. They know that this file
> was already present in v0.99. Naturally, the command-line to investigate
> that is:
>
>         git log --diff-filter=A v0.99 -- t/t4100/t-apply-7.expect
>
> So what does Git do in that operation? It traverses the commits starting
> from v0.99, following the chain along the commit parents. When it
> encounters v0.99~49, it figures out that it has to fetch v0.99~50. To see
> whether v0.99~49 introduced that file, it then has to inspect that commit
> object and then fetch the tree object (v0.99~50^{tree}). Then, Git
> inspects that tree to find out the object ID for v0.99~50^{tree}:t/, sees
> that it is identical to v0.99~49^{tree}:t/ and therefore the pathspec
> filter skips this commit from the output of the `git log` command. A
> couple of parent traversals later (always fetching the parent commit
> object individually, then the associated tree object, then figuring out
> that `t/` is unchanged) Git will encounter v0.99~55 where `t/` _did_
> change. So now it also has to fetch _that_ tree object.
>

Very convincing example. I think some git commands which may require
all missing commits history, should fetch all commits in a batch. (so this
depth filter is not very useful here)

> In total, we are looking at 400+ individual network round trips just to
> fetch the required tree/commit objects, i.e. before Git can show you the
> output of that `git log` command. And that's just for back-filling the
> missing tree/commit objects.
>
> If we had done this using a shallow clone, Git would have stopped at the
> shallow boundary, the user would have had a chance to increase the depth
> in bigger chunks (probably first extending the depth by 50, then maybe
> 100, then maybe going for 500) and while it would have been a lot of
> manual labor, the total time would be still a lot shorter than those 400+
> network round trips (which likely would incur some throttling on the
> server side).
>

Agree.

> > Thinking about this idea, I don't think it is viable. I would need to
> > see a lot of work done to test these scenarios closely to believe that
> > this type of partial clone is a desirable working state.
>
> Indeed, it is hard to think of a way how the design could result in
> anything but undesirable behavior, both on the client and the server side.
>
> We also have to consider that our experience with large repositories
> demonstrates that tree and commit objects delta pretty well and are
> virtually never a concern when cloning. It is always the sheer amount of
> blob objects that is causing poor user experience when performing
> non-partial clones of large repositories.
>

Thanks, I think I understand the problem here. By the way, does it make
sense to download just some of the commits/trees in some big repository
which have several million commits/trees?

> Now, I can be totally wrong in my expectation that there is _no_ scenario
> where cloning with a "partial depth" would cause anything but poor
> performance. If I am wrong, then there is value in having this feature,
> but since it causes undesirable performance in all cases I can think of,
> it definitely should be guarded behind an opt-in flag.
>

Well, now I think this depth filter might be a better fit for git fetch.

If git checkout or other commands which just need to check
few commits, and find almost all objects (maybe >= 75%) in a
commit are not local, it can use this depth filter to download them.

> Ciao,
> Dscho

Thanks,
ZheNing Hu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/3] list-object-filter: introduce depth filter
  2022-09-04  9:14     ` ZheNing Hu
@ 2022-09-07 10:18       ` Johannes Schindelin
  2022-09-11 10:59         ` ZheNing Hu
  0 siblings, 1 reply; 12+ messages in thread
From: Johannes Schindelin @ 2022-09-07 10:18 UTC (permalink / raw)
  To: ZheNing Hu
  Cc: Derrick Stolee, ZheNing Hu via GitGitGadget, Git List,
	Christian Couder, Ævar Arnfjörð Bjarmason,
	Jeff King, Jeff Hostetler, Junio C Hamano

[-- Attachment #1: Type: text/plain, Size: 3966 bytes --]

Hi ZheNing,

On Sun, 4 Sep 2022, ZheNing Hu wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> 于2022年9月2日周五 21:48写道:
>
> > [...]
> > When you have all the commit and tree objects on the local side,
> > you can enumerate all the blob objects you need in one fell swoop, then
> > fetch them in a single network round trip.
> >
> > When you lack tree objects, or worse, commit objects, this is not true.
> > You may very well need to fetch _quite_ a bunch of objects, then inspect
> > them to find out that you need to fetch more tree/commit objects, and then
> > a couple more round trips, before you can enumerate all of the objects you
> > need.
>
> I think this is because the previous design was that you had to fetch
> these missing commits (also trees) and all their ancestors. Maybe we can
> modify git rev-list to make it understand missing commits...

We do have such a modification, and it is called "shallow clone" ;-)

Granted, shallow clones are not a complete solution and turned out to be a
dead end (i.e. that design cannot be extended into anything more useful).
But that approach demonstrates what it would take to implement a logic
whereby Git understands that some commit ranges are missing and should not
be fetched automatically.

> > [...] it is hard to think of a way how the design could result in
> > anything but undesirable behavior, both on the client and the server
> > side.
> >
> > We also have to consider that our experience with large repositories
> > demonstrates that tree and commit objects delta pretty well and are
> > virtually never a concern when cloning. It is always the sheer amount
> > of blob objects that is causing poor user experience when performing
> > non-partial clones of large repositories.
>
> Thanks, I think I understand the problem here. By the way, does it make
> sense to download just some of the commits/trees in some big repository
> which have several million commits/trees?

It probably only makes sense if we can come up with a good idea how to
teach Git the trick to stop downloading so many objects in costly
roundtrips.

But I wonder whether your scenarios are so different from the ones I
encountered, in that commit and tree objects do _not_ delta well on your
side?

If they _do_ delta well, i.e. if it is comparatively cheap to just fetch
them all in one go, it probably makes more sense to just drop the idea of
fetching only some commit/tree objects but not others in a partial clone,
and always fetch all of 'em.

> > Now, I can be totally wrong in my expectation that there is _no_ scenario
> > where cloning with a "partial depth" would cause anything but poor
> > performance. If I am wrong, then there is value in having this feature,
> > but since it causes undesirable performance in all cases I can think of,
> > it definitely should be guarded behind an opt-in flag.
>
> Well, now I think this depth filter might be a better fit for git fetch.

I disagree here, because I see all the same challenges as I described for
clones missing entire commit ranges.

> If git checkout or other commands which just need to check
> few commits, and find almost all objects (maybe >= 75%) in a
> commit are not local, it can use this depth filter to download them.

If you want a clone that does not show any reasonable commit history
because it does not fetch commit objects on-the-fly, then we already have
such a thing with shallow clones.

The only way to make Git's revision walking logic perform _somewhat_
reasonably would be to teach it to fetch not just a single commit object
when it was asked for, but to somehow pass a desired depth by which to
"unshallow" automatically.

However, such a feature would come with the same undesirable implications
on the server side as shallow clones (fetches into shallow clones are
_really_ expensive on the server side).

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/3] list-object-filter: introduce depth filter
  2022-09-07 10:18       ` Johannes Schindelin
@ 2022-09-11 10:59         ` ZheNing Hu
  0 siblings, 0 replies; 12+ messages in thread
From: ZheNing Hu @ 2022-09-11 10:59 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Derrick Stolee, ZheNing Hu via GitGitGadget, Git List,
	Christian Couder, Ævar Arnfjörð Bjarmason,
	Jeff King, Jeff Hostetler, Junio C Hamano

Johannes Schindelin <Johannes.Schindelin@gmx.de> 于2022年9月7日周三 18:18写道:
>
> Hi ZheNing,
>
> On Sun, 4 Sep 2022, ZheNing Hu wrote:
>
> > Johannes Schindelin <Johannes.Schindelin@gmx.de> 于2022年9月2日周五 21:48写道:
> >
> > > [...]
> > > When you have all the commit and tree objects on the local side,
> > > you can enumerate all the blob objects you need in one fell swoop, then
> > > fetch them in a single network round trip.
> > >
> > > When you lack tree objects, or worse, commit objects, this is not true.
> > > You may very well need to fetch _quite_ a bunch of objects, then inspect
> > > them to find out that you need to fetch more tree/commit objects, and then
> > > a couple more round trips, before you can enumerate all of the objects you
> > > need.
> >
> > I think this is because the previous design was that you had to fetch
> > these missing commits (also trees) and all their ancestors. Maybe we can
> > modify git rev-list to make it understand missing commits...
>
> We do have such a modification, and it is called "shallow clone" ;-)
>
> Granted, shallow clones are not a complete solution and turned out to be a
> dead end (i.e. that design cannot be extended into anything more useful).

Yeah, the depth filter would have been possible to overcome this
shortcoming, but
it may require very much network overhead in some special cases.

> But that approach demonstrates what it would take to implement a logic
> whereby Git understands that some commit ranges are missing and should not
> be fetched automatically.
>

Agree. Git uses the commit-graft to do so.

> > > [...] it is hard to think of a way how the design could result in
> > > anything but undesirable behavior, both on the client and the server
> > > side.
> > >
> > > We also have to consider that our experience with large repositories
> > > demonstrates that tree and commit objects delta pretty well and are
> > > virtually never a concern when cloning. It is always the sheer amount
> > > of blob objects that is causing poor user experience when performing
> > > non-partial clones of large repositories.
> >
> > Thanks, I think I understand the problem here. By the way, does it make
> > sense to download just some of the commits/trees in some big repository
> > which have several million commits/trees?
>
> It probably only makes sense if we can come up with a good idea how to
> teach Git the trick to stop downloading so many objects in costly
> roundtrips.
>

Good advice. Perhaps we should merge these multiple requests into one.
Maybe we should use a blob:none filter to download all missing trees/commits
if we need to iterate through all commits history.

> But I wonder whether your scenarios are so different from the ones I
> encountered, in that commit and tree objects do _not_ delta well on your
> side?
>
> If they _do_ delta well, i.e. if it is comparatively cheap to just fetch
> them all in one go, it probably makes more sense to just drop the idea of
> fetching only some commit/tree objects but not others in a partial clone,
> and always fetch all of 'em.
>

Delta is a wonderful thing most of the time (in cases where bulk acquisition
is required). But sometimes I think users just want to see the message of one
commit, so why do they have to download other commits/trees that are not
required?

Sometimes users may better understand the working patterns of their git
objects than the git server, It may be nice if the user could download the
specified object just mapped by its objectid (it is only for blob now, right?)

> > > Now, I can be totally wrong in my expectation that there is _no_ scenario
> > > where cloning with a "partial depth" would cause anything but poor
> > > performance. If I am wrong, then there is value in having this feature,
> > > but since it causes undesirable performance in all cases I can think of,
> > > it definitely should be guarded behind an opt-in flag.
> >
> > Well, now I think this depth filter might be a better fit for git fetch.
>
> I disagree here, because I see all the same challenges as I described for
> clones missing entire commit ranges.
>

Oh, a prerequisite is missing here: after we have all commits, trees,
then use the
depth filter to down missing blobs.

> > If git checkout or other commands which just need to check
> > few commits, and find almost all objects (maybe >= 75%) in a
> > commit are not local, it can use this depth filter to download them.
>
> If you want a clone that does not show any reasonable commit history
> because it does not fetch commit objects on-the-fly, then we already have
> such a thing with shallow clones.
>
> The only way to make Git's revision walking logic perform _somewhat_
> reasonably would be to teach it to fetch not just a single commit object
> when it was asked for, but to somehow pass a desired depth by which to
> "unshallow" automatically.
>
> However, such a feature would come with the same undesirable implications
> on the server side as shallow clones (fetches into shallow clones are
> _really_ expensive on the server side).
>

Agree. letting git shallow clone to be smarter may work, but there are
big challenges too.

> Ciao,
> Dscho

Thanks,
ZheNing Hu

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-09-11 10:59 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-01  9:41 [PATCH 0/3] list-object-filter: introduce depth filter ZheNing Hu via GitGitGadget
2022-09-01  9:41 ` [PATCH 1/3] commit-graph: let commit graph respect commit graft ZheNing Hu via GitGitGadget
2022-09-01 19:18   ` Derrick Stolee
2022-09-04  5:57     ` ZheNing Hu
2022-09-01  9:41 ` [PATCH 2/3] list-object-filter: pass traversal_context in filter_init_fn ZheNing Hu via GitGitGadget
2022-09-01  9:41 ` [PATCH 3/3] list-object-filter: introduce depth filter ZheNing Hu via GitGitGadget
2022-09-01 19:24 ` [PATCH 0/3] " Derrick Stolee
2022-09-02 13:48   ` Johannes Schindelin
2022-09-04  9:14     ` ZheNing Hu
2022-09-07 10:18       ` Johannes Schindelin
2022-09-11 10:59         ` ZheNing Hu
2022-09-04  7:27   ` ZheNing Hu

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).