git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH] index-pack: remove fetch_if_missing=0
@ 2023-02-25  5:24 Kousik Sanagavarapu
  2023-02-27 16:56 ` Kousik Sanagavarapu
                   ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: Kousik Sanagavarapu @ 2023-02-25  5:24 UTC (permalink / raw)
  To: git; +Cc: jonathantanmy, five231003

A collision test is triggered in sha1_object(), whenever there is an
object file in our repo. If our repo is a partial clone, then checking
for this file existence has the behavior of lazy-fetching the object
because we have one or more promisor remotes.

This behavior is controlled by setting fetch_if_missing to 0, but this
global was added in the first place as a temporary measure to suppress
the fetching of missing objects and can be removed once the commands
have been taught to handle these cases.

Hence, use has_object() to check for the existence of an object, which
has the default behavior of not lazy-fetching in a partial clone.

Signed-off-by: Kousik Sanagavarapu <five231003@gmail.com>
---
 builtin/index-pack.c     | 11 +----------
 t/t5616-partial-clone.sh | 28 ++++++++++++++++++++++++++++
 2 files changed, 29 insertions(+), 10 deletions(-)

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 6648f2daef..8c0f36a49e 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -800,8 +800,7 @@ static void sha1_object(const void *data, struct object_entry *obj_entry,
 
 	if (startup_info->have_repository) {
 		read_lock();
-		collision_test_needed =
-			has_object_file_with_flags(oid, OBJECT_INFO_QUICK);
+		collision_test_needed = has_object(the_repository, oid, 0);
 		read_unlock();
 	}
 
@@ -1728,14 +1727,6 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 	int report_end_of_input = 0;
 	int hash_algo = 0;
 
-	/*
-	 * index-pack never needs to fetch missing objects except when
-	 * REF_DELTA bases are missing (which are explicitly handled). It only
-	 * accesses the repo to do hash collision checks and to check which
-	 * REF_DELTA bases need to be fetched.
-	 */
-	fetch_if_missing = 0;
-
 	if (argc == 2 && !strcmp(argv[1], "-h"))
 		usage(index_pack_usage);
 
diff --git a/t/t5616-partial-clone.sh b/t/t5616-partial-clone.sh
index 037941b95d..4658ce0866 100755
--- a/t/t5616-partial-clone.sh
+++ b/t/t5616-partial-clone.sh
@@ -644,6 +644,34 @@ test_expect_success 'repack does not loosen promisor objects' '
 	grep "loosen_unused_packed_objects/loosened:0" trace
 '
 
+test_expect_success 'index-pack does not lazy-fetch when checking for sha1 collisions' '
+	rm -rf server promisor-remote client &&
+	rm -rf object-count &&
+
+	git init server &&
+	for i in 1 2 3 4
+	do
+		echo $i >$(pwd)/server/file$i &&
+		git -C server add file$i &&
+		git -C server commit -am "Commit $i" || return 1
+	done &&
+	git -C server config --local uploadpack.allowFilter 1 &&
+	git -C server config --local uploadpack.allowAnySha1InWant 1 &&
+	HASH=$(git -C server hash-object file3) &&
+
+	git init promisor-remote &&
+	git -C promisor-remote fetch --keep "file://$(pwd)/server" $HASH &&
+
+	git clone --no-checkout --filter=blob:none "file://$(pwd)/server" client &&
+	git -C client remote set-url origin "file://$(pwd)/promisor-remote" &&
+	git -C client config extensions.partialClone 1 &&
+	git -C client config remote.origin.promisor 1 &&
+
+	# make sure that index-pack is run from within the repository
+	git -C client index-pack $(pwd)/client/.git/objects/pack/*.pack &&
+	test_path_is_missing $(pwd)/client/file3
+'
+
 . "$TEST_DIRECTORY"/lib-httpd.sh
 start_httpd
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] index-pack: remove fetch_if_missing=0
  2023-02-25  5:24 [PATCH] index-pack: remove fetch_if_missing=0 Kousik Sanagavarapu
@ 2023-02-27 16:56 ` Kousik Sanagavarapu
  2023-02-27 22:14 ` Jonathan Tan
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 20+ messages in thread
From: Kousik Sanagavarapu @ 2023-02-27 16:56 UTC (permalink / raw)
  To: git; +Cc: five231003

Waiting for review.

Thanks

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] index-pack: remove fetch_if_missing=0
  2023-02-25  5:24 [PATCH] index-pack: remove fetch_if_missing=0 Kousik Sanagavarapu
  2023-02-27 16:56 ` Kousik Sanagavarapu
@ 2023-02-27 22:14 ` Jonathan Tan
  2023-02-28  3:54   ` Kousik Sanagavarapu
  2023-03-10 18:30 ` [PATCH v2] " Kousik Sanagavarapu
  2023-03-11 20:01 ` [PATCH] " Sean Allred
  3 siblings, 1 reply; 20+ messages in thread
From: Jonathan Tan @ 2023-02-27 22:14 UTC (permalink / raw)
  To: Kousik Sanagavarapu; +Cc: Jonathan Tan, git

Kousik Sanagavarapu <five231003@gmail.com> writes:
> A collision test is triggered in sha1_object(), whenever there is an
> object file in our repo. If our repo is a partial clone, then checking
> for this file existence has the behavior of lazy-fetching the object
> because we have one or more promisor remotes.

Hmm...this is not true, because (as you said)...
 
> This behavior is controlled by setting fetch_if_missing to 0,

...this makes it so that we don't fetch in this situation.

> but this
> global was added in the first place as a temporary measure to suppress
> the fetching of missing objects and can be removed once the commands
> have been taught to handle these cases.

Yes, that's true.

> @@ -1728,14 +1727,6 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
>  	int report_end_of_input = 0;
>  	int hash_algo = 0;
>  
> -	/*
> -	 * index-pack never needs to fetch missing objects except when
> -	 * REF_DELTA bases are missing (which are explicitly handled). It only
> -	 * accesses the repo to do hash collision checks and to check which
> -	 * REF_DELTA bases need to be fetched.
> -	 */
> -	fetch_if_missing = 0;

I think that the author of such a commit (you) should also independently
verify that this comment is true (and if it is, then yes, all the
remaining cases are handled and we can remove this assignment to
fetch_if_missing). I believe this comment to be true, but I haven't
checked the code in a while so I'm not sure myself.

> +test_expect_success 'index-pack does not lazy-fetch when checking for sha1 collisions' '
> +	rm -rf server promisor-remote client &&
> +	rm -rf object-count &&
> +
> +	git init server &&
> +	for i in 1 2 3 4
> +	do
> +		echo $i >$(pwd)/server/file$i &&
> +		git -C server add file$i &&
> +		git -C server commit -am "Commit $i" || return 1
> +	done &&
> +	git -C server config --local uploadpack.allowFilter 1 &&
> +	git -C server config --local uploadpack.allowAnySha1InWant 1 &&
> +	HASH=$(git -C server hash-object file3) &&
> +
> +	git init promisor-remote &&
> +	git -C promisor-remote fetch --keep "file://$(pwd)/server" $HASH &&
> +
> +	git clone --no-checkout --filter=blob:none "file://$(pwd)/server" client &&
> +	git -C client remote set-url origin "file://$(pwd)/promisor-remote" &&
> +	git -C client config extensions.partialClone 1 &&
> +	git -C client config remote.origin.promisor 1 &&
> +
> +	# make sure that index-pack is run from within the repository
> +	git -C client index-pack $(pwd)/client/.git/objects/pack/*.pack &&
> +	test_path_is_missing $(pwd)/client/file3
> +'

How does this check that no lazy fetch has occurred? It seems to me
that you're just checking the existence of a file in the worktree,
which does not indicate the presence or absence of a lazy fetch.

I think the way to test needs to be more complicated: you need
to create a partial clone, fetch into it from another repo, and
then verify that no fetches were made to the original partial
clone.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] index-pack: remove fetch_if_missing=0
  2023-02-27 22:14 ` Jonathan Tan
@ 2023-02-28  3:54   ` Kousik Sanagavarapu
  0 siblings, 0 replies; 20+ messages in thread
From: Kousik Sanagavarapu @ 2023-02-28  3:54 UTC (permalink / raw)
  To: jonathantanmy; +Cc: git, five231003

On Tue, 28 Feb 2023 at 03:44, Jonathan Tan <jonathantanmy@google.com> wrote:
>
> Kousik Sanagavarapu <five231003@gmail.com> writes:
> > A collision test is triggered in sha1_object(), whenever there is an
> > object file in our repo. If our repo is a partial clone, then checking
> > for this file existence has the behavior of lazy-fetching the object
> > because we have one or more promisor remotes.
>
> Hmm...this is not true, because (as you said)...
>
> > This behavior is controlled by setting fetch_if_missing to 0,
>
> ...this makes it so that we don't fetch in this situation.

Yes, that statement is false if fetch_if_missing is set to 0. But my original
thought in writing it was so that the anyone who is reading the commit message
understands the motivation as to why we are setting fetch_if_missing to 0.

> [...]
>
> > @@ -1728,14 +1727,6 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
> >       int report_end_of_input = 0;
> >       int hash_algo = 0;
> >
> > -     /*
> > -      * index-pack never needs to fetch missing objects except when
> > -      * REF_DELTA bases are missing (which are explicitly handled). It only
> > -      * accesses the repo to do hash collision checks and to check which
> > -      * REF_DELTA bases need to be fetched.
> > -      */
> > -     fetch_if_missing = 0;
>
> I think that the author of such a commit (you) should also independently
> verify that this comment is true (and if it is, then yes, all the
> remaining cases are handled and we can remove this assignment to
> fetch_if_missing). I believe this comment to be true, but I haven't
> checked the code in a while so I'm not sure myself.

It seems indeed that this is the only place where lazy-fetching is possible.
I checked this by looking up the calls for oid_object_info_extended() or
any other function in object-file.c which depends on it.

In builtin/index-pack.c, we have (in the order that these functions appear)

- check_object()
    Call to oid_object_info(), but we return early with
    0 if we don't have an object.

- sha1_object()
    Call to has_object_file_with_flags() (which this patch replaces with
    has_object()), where lazy-fetching is possible.
    
    Calls to oid_object_info() and read_object_file(), which trigger only
    when the above has_object_file_with_flags() succeeds.

- fix_unresolved_deltas()
    Call to oid_object_info_extended(), we prefetch delta bases.

    Call to read_object_file(), but we only read data from ref_delta_entry.
    In case it was a delta base, we already prefetched it.

There are cases where we fsck objects, but lazy-fetching is already handled
in fsck (although by setting fetch_if_missing to 0).

Do we need to be explicit about this in the commit message? That sha1_object()
is the only place where there is a chance to lazy-fetch if it is a partial clone?

> > +test_expect_success 'index-pack does not lazy-fetch when checking for sha1 collisions' '
> > +     rm -rf server promisor-remote client &&
> > +     rm -rf object-count &&
> > +
> > +     git init server &&
> > +     for i in 1 2 3 4
> > +     do
> > +             echo $i >$(pwd)/server/file$i &&
> > +             git -C server add file$i &&
> > +             git -C server commit -am "Commit $i" || return 1
> > +     done &&
> > +     git -C server config --local uploadpack.allowFilter 1 &&
> > +     git -C server config --local uploadpack.allowAnySha1InWant 1 &&
> > +     HASH=$(git -C server hash-object file3) &&
> > +
> > +     git init promisor-remote &&
> > +     git -C promisor-remote fetch --keep "file://$(pwd)/server" $HASH &&
> > +
> > +     git clone --no-checkout --filter=blob:none "file://$(pwd)/server" client &&
> > +     git -C client remote set-url origin "file://$(pwd)/promisor-remote" &&
> > +     git -C client config extensions.partialClone 1 &&
> > +     git -C client config remote.origin.promisor 1 &&
> > +
> > +     # make sure that index-pack is run from within the repository
> > +     git -C client index-pack $(pwd)/client/.git/objects/pack/*.pack &&
> > +     test_path_is_missing $(pwd)/client/file3
> > +'
>
> How does this check that no lazy fetch has occurred? It seems to me
> that you're just checking the existence of a file in the worktree,
> which does not indicate the presence or absence of a lazy fetch.

What I had in mind was if the file was lazy-fetched (because of the failure
of has_object_file_with_flags() and fetch_if_missing not set to 0), then
it would be unpacked and we would find it in the worktree. Since, we
prevent this exact behavior by using has_object(), we should not find
such a file in our repo.

> I think the way to test needs to be more complicated: you need
> to create a partial clone, fetch into it from another repo, and
> then verify that no fetches were made to the original partial
> clone.

So, after the fetch, during the pack indexing phase, we look for
any additional fetches made. This makes more sense and it would
be way more clear, to anyone reading, than what I wrote.

Will do a reroll. If there needs to be a change in the commit message
as well, please let me know.

Thanks for the review

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2] index-pack: remove fetch_if_missing=0
  2023-02-25  5:24 [PATCH] index-pack: remove fetch_if_missing=0 Kousik Sanagavarapu
  2023-02-27 16:56 ` Kousik Sanagavarapu
  2023-02-27 22:14 ` Jonathan Tan
@ 2023-03-10 18:30 ` Kousik Sanagavarapu
  2023-03-10 20:30   ` Junio C Hamano
  2023-03-13 18:15   ` [PATCH v3] " Kousik Sanagavarapu
  2023-03-11 20:01 ` [PATCH] " Sean Allred
  3 siblings, 2 replies; 20+ messages in thread
From: Kousik Sanagavarapu @ 2023-03-10 18:30 UTC (permalink / raw)
  To: git; +Cc: jonathantanmy, five231003

A collision test is triggered in sha1_object(), whenever there is an
object file in our repo. If our repo is a partial clone, then checking
for this file existence does not lazy-fetch the object (if the object
is missing and if there are one or more promisor remotes) when
fetch_if_missing is set to 0.

This global was added as a temporary measure to suppress the fetching
of missing objects and can be removed once the commandshave been taught
to handle these cases.

Hence, use has_object() to check for the existence of an object, which
has the default behavior of not lazy-fetching in a partial clone. It is
worth mentioning that this is the only place where there is potential for
lazy-fetching and all other cases are properly handled, making it safe to
remove this global here.

Signed-off-by: Kousik Sanagavarapu <five231003@gmail.com>
---

Sorry for the late reroll, I was having semester-end exams.

Changes since v1:
- Changed the commit message to be more clear about the
  change done here.

- Changed the test according to the previous review.

 builtin/index-pack.c     | 11 +----------
 t/t5616-partial-clone.sh | 35 +++++++++++++++++++++++++++++++++++
 2 files changed, 36 insertions(+), 10 deletions(-)

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 6648f2daef..8c0f36a49e 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -800,8 +800,7 @@ static void sha1_object(const void *data, struct object_entry *obj_entry,
 
 	if (startup_info->have_repository) {
 		read_lock();
-		collision_test_needed =
-			has_object_file_with_flags(oid, OBJECT_INFO_QUICK);
+		collision_test_needed = has_object(the_repository, oid, 0);
 		read_unlock();
 	}
 
@@ -1728,14 +1727,6 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 	int report_end_of_input = 0;
 	int hash_algo = 0;
 
-	/*
-	 * index-pack never needs to fetch missing objects except when
-	 * REF_DELTA bases are missing (which are explicitly handled). It only
-	 * accesses the repo to do hash collision checks and to check which
-	 * REF_DELTA bases need to be fetched.
-	 */
-	fetch_if_missing = 0;
-
 	if (argc == 2 && !strcmp(argv[1], "-h"))
 		usage(index_pack_usage);
 
diff --git a/t/t5616-partial-clone.sh b/t/t5616-partial-clone.sh
index f519d2a87a..46af8698ce 100755
--- a/t/t5616-partial-clone.sh
+++ b/t/t5616-partial-clone.sh
@@ -644,6 +644,41 @@ test_expect_success 'repack does not loosen promisor objects' '
 	grep "loosen_unused_packed_objects/loosened:0" trace
 '
 
+test_expect_success 'index-pack does not lazy-fetch when checking for sha1 collsions' '
+	rm -rf server promisor-remote client repo trace &&
+
+	# setup
+	git init server &&
+	for i in 1 2 3 4
+	do
+		echo $i >server/file$i &&
+		git -C server add file$i &&
+		git -C server commit -am "Commit $i" || return 1
+	done &&
+	git -C server config --local uploadpack.allowFilter 1 &&
+	git -C server config --local uploadpack.allowAnySha1InWant 1 &&
+	HASH=$(git -C server hash-object file3) &&
+
+	git init promisor-remote &&
+	git -C promisor-remote fetch --keep "file://$(pwd)/server" &&
+
+	git clone --no-checkout --filter=blob:none "file://$(pwd)/server" client &&
+	git -C client remote set-url origin "file://$(pwd)/promisor-remote" &&
+	git -C client config extensions.partialClone 1 &&
+	git -C client config remote.origin.promisor 1 &&
+
+	git init repo &&
+	echo "5" >repo/file5 &&
+	git -C repo config --local uploadpack.allowFilter 1 &&
+	git -C repo config --local uploadpack.allowAnySha1InWant 1 &&
+
+	# verify that no lazy-fetching is done when fetching from another repo
+	GIT_TRACE_PACKET="$(pwd)/trace" git -C client \
+					fetch --keep "file://$(pwd)/repo" main &&
+
+	! grep "want $HASH" trace
+'
+
 test_expect_success 'lazy-fetch in submodule succeeds' '
 	# setup
 	test_config_global protocol.file.allow always &&
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v2] index-pack: remove fetch_if_missing=0
  2023-03-10 18:30 ` [PATCH v2] " Kousik Sanagavarapu
@ 2023-03-10 20:30   ` Junio C Hamano
  2023-03-10 21:13     ` Jonathan Tan
  2023-03-11  6:00     ` Kousik Sanagavarapu
  2023-03-13 18:15   ` [PATCH v3] " Kousik Sanagavarapu
  1 sibling, 2 replies; 20+ messages in thread
From: Junio C Hamano @ 2023-03-10 20:30 UTC (permalink / raw)
  To: Kousik Sanagavarapu; +Cc: git, jonathantanmy

Kousik Sanagavarapu <five231003@gmail.com> writes:

> This global was added as a temporary measure to suppress the fetching
> of missing objects and can be removed once the commandshave been taught
> to handle these cases.

Two requests.

 * Could you substantiate this claim for future readers of "git
   log"?  A reference to an old mailing list discussion or a log
   message of the commit that added the temporary measure that says
   the above plan would be perfect.

 * What exactly does "once the commands have been taught"?  Which
   commands?  Could you clarify?

> Hence, use has_object() to check for the existence of an object, which
> has the default behavior of not lazy-fetching in a partial clone. It is
> worth mentioning that this is the only place where there is potential for
> lazy-fetching and all other cases are properly handled, making it safe to
> remove this global here.

This paragraph is very well explained.

> @@ -1728,14 +1727,6 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
>  	int report_end_of_input = 0;
>  	int hash_algo = 0;
>  
> -	/*
> -	 * index-pack never needs to fetch missing objects except when
> -	 * REF_DELTA bases are missing (which are explicitly handled). It only
> -	 * accesses the repo to do hash collision checks and to check which
> -	 * REF_DELTA bases need to be fetched.
> -	 */

OK.  The comment describes the design choice we made to flip the
fetch_if_missing flag off.  The old world-view was that we would
notice a breakage by non-functioning index-pack when a lazy clone is
missing objects that we need by disabling auto-fetching, and we
instead explicitly handle any missing and necessary objects by lazy
fetching (like "when we lack REF_DELTA bases").  It does sound like
a conservative thing to do, compared to the opposite approach we are
taking with this patch, i.e. we would not fail if we tried to access
objects we do not need to, because we have lazy fetching enabled,
and we just ended up with bloated object store nobody may notice.

To protect us from future breakage that can come from the new
approach, it is a very good thing that you added new tests to ensure
no unnecessary lazy fetching is done (I am not offhand sure if that
test is sufficient, though).

> -	fetch_if_missing = 0;

Looking good to me.  Jonathan, who reviewed the previous round, do
you have any comments?

Thanks, all.  Will queue.

> diff --git a/t/t5616-partial-clone.sh b/t/t5616-partial-clone.sh
> index f519d2a87a..46af8698ce 100755
> --- a/t/t5616-partial-clone.sh
> +++ b/t/t5616-partial-clone.sh
> @@ -644,6 +644,41 @@ test_expect_success 'repack does not loosen promisor objects' '
>  	grep "loosen_unused_packed_objects/loosened:0" trace
>  '
>  
> +test_expect_success 'index-pack does not lazy-fetch when checking for sha1 collsions' '
> +	rm -rf server promisor-remote client repo trace &&
> +
> +	# setup
> +	git init server &&
> +	for i in 1 2 3 4
> +	do
> +		echo $i >server/file$i &&
> +		git -C server add file$i &&
> +		git -C server commit -am "Commit $i" || return 1
> +	done &&
> +	git -C server config --local uploadpack.allowFilter 1 &&
> +	git -C server config --local uploadpack.allowAnySha1InWant 1 &&
> +	HASH=$(git -C server hash-object file3) &&
> +
> +	git init promisor-remote &&
> +	git -C promisor-remote fetch --keep "file://$(pwd)/server" &&
> +
> +	git clone --no-checkout --filter=blob:none "file://$(pwd)/server" client &&
> +	git -C client remote set-url origin "file://$(pwd)/promisor-remote" &&
> +	git -C client config extensions.partialClone 1 &&
> +	git -C client config remote.origin.promisor 1 &&
> +
> +	git init repo &&
> +	echo "5" >repo/file5 &&
> +	git -C repo config --local uploadpack.allowFilter 1 &&
> +	git -C repo config --local uploadpack.allowAnySha1InWant 1 &&
> +
> +	# verify that no lazy-fetching is done when fetching from another repo
> +	GIT_TRACE_PACKET="$(pwd)/trace" git -C client \
> +					fetch --keep "file://$(pwd)/repo" main &&
> +
> +	! grep "want $HASH" trace
> +'
> +
>  test_expect_success 'lazy-fetch in submodule succeeds' '
>  	# setup
>  	test_config_global protocol.file.allow always &&

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2] index-pack: remove fetch_if_missing=0
  2023-03-10 20:30   ` Junio C Hamano
@ 2023-03-10 21:13     ` Jonathan Tan
  2023-03-10 21:41       ` Junio C Hamano
  2023-03-11  6:22       ` [PATCH] " Kousik Sanagavarapu
  2023-03-11  6:00     ` Kousik Sanagavarapu
  1 sibling, 2 replies; 20+ messages in thread
From: Jonathan Tan @ 2023-03-10 21:13 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jonathan Tan, Kousik Sanagavarapu, git

Junio C Hamano <gitster@pobox.com> writes:
> > Hence, use has_object() to check for the existence of an object, which
> > has the default behavior of not lazy-fetching in a partial clone. It is
> > worth mentioning that this is the only place where there is potential for
> > lazy-fetching and all other cases are properly handled, making it safe to
> > remove this global here.
> 
> This paragraph is very well explained.

It might be good if the "all other cases" were enumerated here in the
commit message (since the consequence of missing a case might be an
infinite loop of fetching).

> OK.  The comment describes the design choice we made to flip the
> fetch_if_missing flag off.  The old world-view was that we would
> notice a breakage by non-functioning index-pack when a lazy clone is
> missing objects that we need by disabling auto-fetching, and we
> instead explicitly handle any missing and necessary objects by lazy
> fetching (like "when we lack REF_DELTA bases").  It does sound like
> a conservative thing to do, compared to the opposite approach we are
> taking with this patch, i.e. we would not fail if we tried to access
> objects we do not need to, because we have lazy fetching enabled,
> and we just ended up with bloated object store nobody may notice.
> 
> To protect us from future breakage that can come from the new
> approach, it is a very good thing that you added new tests to ensure
> no unnecessary lazy fetching is done (I am not offhand sure if that
> test is sufficient, though).

I don't think the test is sufficient - I'll explain that below.

> > +test_expect_success 'index-pack does not lazy-fetch when checking for sha1 collsions' '
> > +	rm -rf server promisor-remote client repo trace &&
> > +
> > +	# setup
> > +	git init server &&
> > +	for i in 1 2 3 4
> > +	do
> > +		echo $i >server/file$i &&
> > +		git -C server add file$i &&
> > +		git -C server commit -am "Commit $i" || return 1
> > +	done &&
> > +	git -C server config --local uploadpack.allowFilter 1 &&
> > +	git -C server config --local uploadpack.allowAnySha1InWant 1 &&
> > +	HASH=$(git -C server hash-object file3) &&
> > +
> > +	git init promisor-remote &&
> > +	git -C promisor-remote fetch --keep "file://$(pwd)/server" &&
> > +
> > +	git clone --no-checkout --filter=blob:none "file://$(pwd)/server" client &&
> > +	git -C client remote set-url origin "file://$(pwd)/promisor-remote" &&
> > +	git -C client config extensions.partialClone 1 &&
> > +	git -C client config remote.origin.promisor 1 &&
> > +
> > +	git init repo &&
> > +	echo "5" >repo/file5 &&
> > +	git -C repo config --local uploadpack.allowFilter 1 &&
> > +	git -C repo config --local uploadpack.allowAnySha1InWant 1 &&

The file5 isn't committed?

> > +
> > +	# verify that no lazy-fetching is done when fetching from another repo
> > +	GIT_TRACE_PACKET="$(pwd)/trace" git -C client \
> > +					fetch --keep "file://$(pwd)/repo" main &&
> > +
> > +	! grep "want $HASH" trace
> > +'

It seems to me that this test clones a repo and then attempts to fetch
from another repo: so far, so good. But I don't think this tests what
we want: firstly, the file5 isn't committed, so it is never fetched. And
even if it was, we only check that file3 was never fetched from "$(pwd)/
server". But file3 has nothing to do with the subsequent fetch: we are
only fetching file5. It is the hash of file5 that we are checking for
collisions, and thus it is file5 that we want to verify is not fetched.

So I think the way to do this is to have 3 repositories like the author
is doing now (server, client, and repo), and do it as follows:
 - create "server", one commit will do
 - clone "server" into "client" (partial clone)
 - clone "server" into "another-remote" (not partial clone)
 - add a file ("new-file") to "server", commit it, and pull from "another-remote"
 - fetch from "another-remote" into "client"

This way, "client" will need to verify that the hash of "new-file" has
no collisions with any object it currently has. If there is no bug,
"new-file" will never be fetched from "server", and if there is a bug,
"new-file" will be fetched.

One problem is that if there is a bug, such a test will cause an
infinite loop (we fetch "new-file", so we want to check it for
collisions, and because of the bug, we fetch "new-file" again, which we
check for collisions, and so on) which might be problematic for things
like CI. But we might be able to treat timeouts as the same as test
failures, so this should be OK.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2] index-pack: remove fetch_if_missing=0
  2023-03-10 21:13     ` Jonathan Tan
@ 2023-03-10 21:41       ` Junio C Hamano
  2023-03-11  2:59         ` Jonathan Tan
  2023-03-12 17:16         ` Kousik Sanagavarapu
  2023-03-11  6:22       ` [PATCH] " Kousik Sanagavarapu
  1 sibling, 2 replies; 20+ messages in thread
From: Junio C Hamano @ 2023-03-10 21:41 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Kousik Sanagavarapu, git

Jonathan Tan <jonathantanmy@google.com> writes:

> Junio C Hamano <gitster@pobox.com> writes:
>> > Hence, use has_object() to check for the existence of an object, which
>> > has the default behavior of not lazy-fetching in a partial clone. It is
>> > worth mentioning that this is the only place where there is potential for
>> > lazy-fetching and all other cases are properly handled, making it safe to
>> > remove this global here.
>> 
>> This paragraph is very well explained.
>
> It might be good if the "all other cases" were enumerated here in the
> commit message (since the consequence of missing a case might be an
> infinite loop of fetching).
>
>> OK.  The comment describes the design choice we made to flip the
>> fetch_if_missing flag off.  The old world-view was that we would
>> notice a breakage by non-functioning index-pack when a lazy clone is
>> missing objects that we need by disabling auto-fetching, and we
>> instead explicitly handle any missing and necessary objects by lazy
>> fetching (like "when we lack REF_DELTA bases").  It does sound like
>> a conservative thing to do, compared to the opposite approach we are
>> taking with this patch, i.e. we would not fail if we tried to access
>> objects we do not need to, because we have lazy fetching enabled,
>> and we just ended up with bloated object store nobody may notice.
>> 
>> To protect us from future breakage that can come from the new
>> approach, it is a very good thing that you added new tests to ensure
>> no unnecessary lazy fetching is done (I am not offhand sure if that
>> test is sufficient, though).
>
> I don't think the test is sufficient - I'll explain that below.

I admit I haven't thought about it any longer than anybody who
touched this topic, but should "fetch_if_missing=0" really be
treated as "it was a dirty hack in the past, now we do not need it,
as all callers into the object layer avoids lazy fetching when they
do not have to, so let's remove it"?  It looks to me more and more
that the old world-view to disable lazy fetching by default and have
individual calls to the object layer opt into fetching as needed may
give us a better resulting code, or is it just me?  The possible
error modes in new code that fails to follow the world-view with and
without this change are:

 * If the lazy fetching is disabled by default (i.e. without this
   patch), a new code can by mistake call has_object(), which does
   not lazy fetch, when it does need to have the object and should
   be using something like has_object_file_with_flags(), and dies
   loudly.

 * If the lazy fetching is enabled by default, on the other hand, a
   new code can by mistake call has_object_file_with_flags(), which
   does lazy fetch, when it does not need to have the object.  It
   does not die, it just lazily fetches objects it does not need.
   The (performance) "bug" will stay hidden until somebody complains.

In short, the world-view of the current code seems to give us
tighter control over what gets lazy fetched, simply because we do
not allow lazy fetching without thinking.

Do we have other uses of fetch_if_missing (i.e. disable lazy
fetching)?

    $ git grep -l fetch_if_missing
    Documentation/technical/partial-clone.txt
    builtin/fetch-pack.c
    builtin/fsck.c
    builtin/pack-objects.c
    builtin/prune.c
    builtin/rev-list.c
    cache.h
    midx.c
    object-file.c
    revision.c

As the default is 1, all these hits (outside the header, doc, and
object-file.c) are to disable lazy fetching.  Judging from the list
of "family" that want tighter control over what gets fetched, I have
a feeling that pack-index may want to stay to be in the family.

Or am I missing some big picture goal to eventually getting rid of
this mechanism and always allowing lazy fetching?

Thanks.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2] index-pack: remove fetch_if_missing=0
  2023-03-10 21:41       ` Junio C Hamano
@ 2023-03-11  2:59         ` Jonathan Tan
  2023-03-12 17:16         ` Kousik Sanagavarapu
  1 sibling, 0 replies; 20+ messages in thread
From: Jonathan Tan @ 2023-03-11  2:59 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jonathan Tan, Kousik Sanagavarapu, git

Junio C Hamano <gitster@pobox.com> writes:
> Do we have other uses of fetch_if_missing (i.e. disable lazy
> fetching)?
> 
>     $ git grep -l fetch_if_missing
>     Documentation/technical/partial-clone.txt
>     builtin/fetch-pack.c
>     builtin/fsck.c
>     builtin/pack-objects.c
>     builtin/prune.c
>     builtin/rev-list.c
>     cache.h
>     midx.c
>     object-file.c
>     revision.c
> 
> As the default is 1, all these hits (outside the header, doc, and
> object-file.c) are to disable lazy fetching.  Judging from the list
> of "family" that want tighter control over what gets fetched, I have
> a feeling that pack-index may want to stay to be in the family.

I think this "family" concept is a good way to think of it. I did
use to think that it would be better to be consistent throughout Git
and choose one world-view, and if I had to choose, it would be the one
without fetch_if_missing=0. But now it does make sense to me to have
two families:

 (a) The more low-level code that the lazy fetching itself relies on
     (and maybe things like builtin/fsck.c as well) where we really need
     to be careful about what we fetch, and it would be better to err
     on the side of not fetching. The test cases for these would need to
     cover both the partial clone cases and the regular cases.

     For these cases, the consequence of lazy-fetching when we shouldn't
     might be as bad as an infinite loop, so it makes sense to default
     not lazy-fetching here.

 (b) The more high-level code, in which I think that it is better to err
     on the side of fetching. The test cases would generally not need to
     cover the partial clone cases (except when there are specific
     optimizations needed, such as in checkout where we bulk prefetch
     missing objects).

     For these cases, the consequences of lazy-fetching when we shouldn't
     are generally performance-related, so it might not be so bad to let
     development happen in these areas of code without great
     consideration to whether a lazy-fetch would happen if an object
     didn't exist. (I do think it would be ideal for all new code to pay
     attention to when they read objects, which would help not only in
     partial clone but also in a potential future in which we have non-
     disk object stores, but we're probably not there yet as a project.)

And indeed, pack-index would go in (a).

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] index-pack: remove fetch_if_missing=0
  2023-03-10 20:30   ` Junio C Hamano
  2023-03-10 21:13     ` Jonathan Tan
@ 2023-03-11  6:00     ` Kousik Sanagavarapu
  1 sibling, 0 replies; 20+ messages in thread
From: Kousik Sanagavarapu @ 2023-03-11  6:00 UTC (permalink / raw)
  To: gitster; +Cc: git, jonathantanmy

On Sat, 11 Mar 2023 at 02:01, Junio C Hamano <gitster@pobox.com> wrote:
>
> [...]
>
> Two requests.
>
>  * Could you substantiate this claim for future readers of "git
>    log"?  A reference to an old mailing list discussion or a log
>    message of the commit that added the temporary measure that says
>    the above plan would be perfect.
>
>  * What exactly does "once the commands have been taught"?  Which
>    commands?  Could you clarify?
>

Will do the change.

> > Hence, use has_object() to check for the existence of an object, which
> > has the default behavior of not lazy-fetching in a partial clone. It is
> > worth mentioning that this is the only place where there is potential for
> > lazy-fetching and all other cases are properly handled, making it safe to
> > remove this global here.
>
> This paragraph is very well explained.
>

Thanks.

> > @@ -1728,14 +1727,6 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
> >       int report_end_of_input = 0;
> >       int hash_algo = 0;
> > 
> > -     /*
> > -      * index-pack never needs to fetch missing objects except when
> > -      * REF_DELTA bases are missing (which are explicitly handled). It only
> > -      * accesses the repo to do hash collision checks and to check which
> > -      * REF_DELTA bases need to be fetched.
> > -      */
>
> OK.  The comment describes the design choice we made to flip the
> fetch_if_missing flag off.  The old world-view was that we would
> notice a breakage by non-functioning index-pack when a lazy clone is
> missing objects that we need by disabling auto-fetching, and we
> instead explicitly handle any missing and necessary objects by lazy
> fetching (like "when we lack REF_DELTA bases").  It does sound like
> a conservative thing to do, compared to the opposite approach we are
> taking with this patch, i.e. we would not fail if we tried to access
> objects we do not need to, because we have lazy fetching enabled,
> and we just ended up with bloated object store nobody may notice.
>
> To protect us from future breakage that can come from the new
> approach, it is a very good thing that you added new tests to ensure
> no unnecessary lazy fetching is done (I am not offhand sure if that
> test is sufficient, though).
>
> > -     fetch_if_missing = 0;
>
> Looking good to me.  Jonathan, who reviewed the previous round, do
> you have any comments?
>
> Thanks, all.  Will queue.
>

It does seem that the tests need to be changed significantly. Will do.

> > diff --git a/t/t5616-partial-clone.sh b/t/t5616-partial-clone.sh
> > index f519d2a87a..46af8698ce 100755
> > --- a/t/t5616-partial-clone.sh
> > +++ b/t/t5616-partial-clone.sh
> > @@ -644,6 +644,41 @@ test_expect_success 'repack does not loosen promisor objects' '
> >       grep "loosen_unused_packed_objects/loosened:0" trace
> >  '
> > 
> > +test_expect_success 'index-pack does not lazy-fetch when checking for sha1 collsions' '
> > +     rm -rf server promisor-remote client repo trace &&
> > +
> > +     # setup
> > +     git init server &&
> > +     for i in 1 2 3 4
> > +     do
> > +             echo $i >server/file$i &&
> > +             git -C server add file$i &&
> > +             git -C server commit -am "Commit $i" || return 1
> > +     done &&
> > +     git -C server config --local uploadpack.allowFilter 1 &&
> > +     git -C server config --local uploadpack.allowAnySha1InWant 1 &&
> > +     HASH=$(git -C server hash-object file3) &&
> > +
> > +     git init promisor-remote &&
> > +     git -C promisor-remote fetch --keep "file://$(pwd)/server" &&
> > +
> > +     git clone --no-checkout --filter=blob:none "file://$(pwd)/server" client &&
> > +     git -C client remote set-url origin "file://$(pwd)/promisor-remote" &&
> > +     git -C client config extensions.partialClone 1 &&
> > +     git -C client config remote.origin.promisor 1 &&
> > +
> > +     git init repo &&
> > +     echo "5" >repo/file5 &&
> > +     git -C repo config --local uploadpack.allowFilter 1 &&
> > +     git -C repo config --local uploadpack.allowAnySha1InWant 1 &&
> > +
> > +     # verify that no lazy-fetching is done when fetching from another repo
> > +     GIT_TRACE_PACKET="$(pwd)/trace" git -C client \
> > +                                     fetch --keep "file://$(pwd)/repo" main &&
> > +
> > +     ! grep "want $HASH" trace
> > +'
> > +
> >  test_expect_success 'lazy-fetch in submodule succeeds' '
> >       # setup
> >       test_config_global protocol.file.allow always &&

Thanks

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] index-pack: remove fetch_if_missing=0
  2023-03-10 21:13     ` Jonathan Tan
  2023-03-10 21:41       ` Junio C Hamano
@ 2023-03-11  6:22       ` Kousik Sanagavarapu
  1 sibling, 0 replies; 20+ messages in thread
From: Kousik Sanagavarapu @ 2023-03-11  6:22 UTC (permalink / raw)
  To: jonathantanmy; +Cc: git, gitster

On Sat, 11 Mar 2023 at 02:43, Jonathan Tan <jonathantanmy@google.com> wrote:
>
> Junio C Hamano <gitster@pobox.com> writes:
> > > Hence, use has_object() to check for the existence of an object, which
> > > has the default behavior of not lazy-fetching in a partial clone. It is
> > > worth mentioning that this is the only place where there is potential for
> > > lazy-fetching and all other cases are properly handled, making it safe to
> > > remove this global here.
> >
> > This paragraph is very well explained.
>
> It might be good if the "all other cases" were enumerated here in the
> commit message (since the consequence of missing a case might be an
> infinite loop of fetching).
>

I will make the change.

> > OK.  The comment describes the design choice we made to flip the
> > fetch_if_missing flag off.  The old world-view was that we would
> > notice a breakage by non-functioning index-pack when a lazy clone is
> > missing objects that we need by disabling auto-fetching, and we
> > instead explicitly handle any missing and necessary objects by lazy
> > fetching (like "when we lack REF_DELTA bases").  It does sound like
> > a conservative thing to do, compared to the opposite approach we are
> > taking with this patch, i.e. we would not fail if we tried to access
> > objects we do not need to, because we have lazy fetching enabled,
> > and we just ended up with bloated object store nobody may notice.
> >
> > To protect us from future breakage that can come from the new
> > approach, it is a very good thing that you added new tests to ensure
> > no unnecessary lazy fetching is done (I am not offhand sure if that
> > test is sufficient, though).
>
> I don't think the test is sufficient - I'll explain that below.
>
> > > +test_expect_success 'index-pack does not lazy-fetch when checking for sha1 collsions' '
> > > +   rm -rf server promisor-remote client repo trace &&
> > > +
> > > +   # setup
> > > +   git init server &&
> > > +   for i in 1 2 3 4
> > > +   do
> > > +           echo $i >server/file$i &&
> > > +           git -C server add file$i &&
> > > +           git -C server commit -am "Commit $i" || return 1
> > > +   done &&
> > > +   git -C server config --local uploadpack.allowFilter 1 &&
> > > +   git -C server config --local uploadpack.allowAnySha1InWant 1 &&
> > > +   HASH=$(git -C server hash-object file3) &&
> > > +
> > > +   git init promisor-remote &&
> > > +   git -C promisor-remote fetch --keep "file://$(pwd)/server" &&
> > > +
> > > +   git clone --no-checkout --filter=blob:none "file://$(pwd)/server" client &&
> > > +   git -C client remote set-url origin "file://$(pwd)/promisor-remote" &&
> > > +   git -C client config extensions.partialClone 1 &&
> > > +   git -C client config remote.origin.promisor 1 &&
> > > +
> > > +   git init repo &&
> > > +   echo "5" >repo/file5 &&
> > > +   git -C repo config --local uploadpack.allowFilter 1 &&
> > > +   git -C repo config --local uploadpack.allowAnySha1InWant 1 &&
>
> The file5 isn't committed?

That is a blunder.

>
> [...]
>
> So I think the way to do this is to have 3 repositories like the author
> is doing now (server, client, and repo), and do it as follows:
>  - create "server", one commit will do
>  - clone "server" into "client" (partial clone)
>  - clone "server" into "another-remote" (not partial clone)
>  - add a file ("new-file") to "server", commit it, and pull from "another-remote"
>  - fetch from "another-remote" into "client"
>
> This way, "client" will need to verify that the hash of "new-file" has
> no collisions with any object it currently has. If there is no bug,
> "new-file" will never be fetched from "server", and if there is a bug,
> "new-file" will be fetched.
>

So, we can lose the "promisor-remote" in the original test and make the
"server" itself a promisor-remote?

Thanks for the review

> One problem is that if there is a bug, such a test will cause an
> infinite loop (we fetch "new-file", so we want to check it for
> collisions, and because of the bug, we fetch "new-file" again, which we
> check for collisions, and so on) which might be problematic for things
> like CI. But we might be able to treat timeouts as the same as test
> failures, so this should be OK.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] index-pack: remove fetch_if_missing=0
  2023-02-25  5:24 [PATCH] index-pack: remove fetch_if_missing=0 Kousik Sanagavarapu
                   ` (2 preceding siblings ...)
  2023-03-10 18:30 ` [PATCH v2] " Kousik Sanagavarapu
@ 2023-03-11 20:01 ` Sean Allred
  2023-03-11 20:37   ` Junio C Hamano
  3 siblings, 1 reply; 20+ messages in thread
From: Sean Allred @ 2023-03-11 20:01 UTC (permalink / raw)
  To: Kousik Sanagavarapu; +Cc: git, jonathantanmy


Kousik Sanagavarapu <five231003@gmail.com> writes:

> Hence, use has_object() to check for the existence of an object, which
> has the default behavior of not lazy-fetching in a partial clone.

Any chance this fixes behavior[1] I was seeing where pushing from a
depth=1, treeless clone was fetching content?

[1]: https://lore.kernel.org/git/m0lekp4rjv.fsf@epic96565.epic.com/T/#m83ba30c1bf91106fca61efff6619c491543034e4

--
Sean Allred

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] index-pack: remove fetch_if_missing=0
  2023-03-11 20:01 ` [PATCH] " Sean Allred
@ 2023-03-11 20:37   ` Junio C Hamano
  0 siblings, 0 replies; 20+ messages in thread
From: Junio C Hamano @ 2023-03-11 20:37 UTC (permalink / raw)
  To: Sean Allred; +Cc: Kousik Sanagavarapu, git, jonathantanmy

Sean Allred <allred.sean@gmail.com> writes:

> Kousik Sanagavarapu <five231003@gmail.com> writes:
>
>> Hence, use has_object() to check for the existence of an object, which
>> has the default behavior of not lazy-fetching in a partial clone.
>
> Any chance this fixes behavior[1] I was seeing ...

If I understand correctly, Kousik's change is meant to be a clean-up
that does not change any behaviour, so if it changes some behaviour
for you, it means the patch is buggy.

But I think you can build a version of Git without the patch
(earlier this week would have been a good time to do so to test the
release candidate #2 for the upcoming 2.40) to see if you have the
behaviour still, and then apply this patch to see if it fixes it,
and then report your findings here.  It would make a great
contribution to the community.

Thanks.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2] index-pack: remove fetch_if_missing=0
  2023-03-10 21:41       ` Junio C Hamano
  2023-03-11  2:59         ` Jonathan Tan
@ 2023-03-12 17:16         ` Kousik Sanagavarapu
  1 sibling, 0 replies; 20+ messages in thread
From: Kousik Sanagavarapu @ 2023-03-12 17:16 UTC (permalink / raw)
  To: gitster; +Cc: git, jonathantanmy

On Sat, 11 Mar 2023 at 03:11, Junio C Hamano <gitster@pobox.com> wrote:
>
> I admit I haven't thought about it any longer than anybody who
> touched this topic, but should "fetch_if_missing=0" really be
> treated as "it was a dirty hack in the past, now we do not need it,
> as all callers into the object layer avoids lazy fetching when they
> do not have to, so let's remove it"?  It looks to me more and more
> that the old world-view to disable lazy fetching by default and have
> individual calls to the object layer opt into fetching as needed may
> give us a better resulting code, or is it just me?

I think having a single function to check for object existence, which is
compatible with partial clones is better, because the end goal is to
completely integrate the concept of partial clones with git's codebase
and not have code that worries "Oh, there maybe bugs here because
what if the user has a partial clone", everytime the code does an object
existence check (that is, a call to has_object_file() or any of its
related functions) and just have fetch_if_missing set to 1 or 0, according
to the particular command.

It is also true that has_object() is not that "single function", because
in cases where we are missing an object in partial clone and want it,
has_object() has no way of fetching it. With or without flags (it only
supports one flag which, when set, rechecks packed storage), it does not
lazy-fetch in a partial clone. But there are cases where we need such
objects, such as the commands that come into "family (b)" [1].

So, why not use oid_object_info_extended() directly, instead of wrapping
it with some other function, whenever we are checking for an object's
existence. We can skip lazy-fetches whenever we want with
OBJECT_INFO_SKIP_FETCH_OBJECT and can also prefetch with
OBJECT_INFO_FOR_PREFETCH [2].

[1] https://lore.kernel.org/git/20230311025906.4170554-1-jonathantanmy@google.com/

[2] pack-index itself is one example of where this is done.

    When we don't have REF_DELTA bases, we bulk prefetch them

	    if (has_promisor_remote()) {
                     /*
                      * Prefetch the delta bases.
                      */
                     struct oid_array to_fetch = OID_ARRAY_INIT;
		     for (i = 0; i < nr_ref_deltas; i++) {
                             struct ref_delta_entry *d = sorted_by_pos[i];
                             if (!oid_object_info_extended(the_repository, &d->oid,
                                                           NULL,
                                                           OBJECT_INFO_FOR_PREFETCH))
                                     continue;
                             oid_array_append(&to_fetch, &d->oid);
                     }
                     promisor_remote_get_direct(the_repository,
                                                to_fetch.oid, to_fetch.nr);
                     oid_array_clear(&to_fetch);
             }

    Instead of going object-by-object, which is basically like an
    infinite loop in large repos and partial clones are widely used
    in large repos.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v3] index-pack: remove fetch_if_missing=0
  2023-03-10 18:30 ` [PATCH v2] " Kousik Sanagavarapu
  2023-03-10 20:30   ` Junio C Hamano
@ 2023-03-13 18:15   ` Kousik Sanagavarapu
  2023-03-13 19:17     ` Junio C Hamano
                       ` (2 more replies)
  1 sibling, 3 replies; 20+ messages in thread
From: Kousik Sanagavarapu @ 2023-03-13 18:15 UTC (permalink / raw)
  To: git; +Cc: jonathanmy, gitster, fve231003, Kousik Sanagavarapu, Jonathan Tan

A collision test is triggered in sha1_object(), whenever there is an
object file in our repo. If our repo is a partial clone, then checking
for this file existence does not lazy-fetch the object (if the object
is missing and if there are one or more promisor remotes) when
fetch_if_missing is set to 0.

This global was added as a temporary measure to suppress the fetching
of missing objects [1] and can be removed once the remaining commands:
 - fetch-pack
 - fsck
 - pack-objects
 - prune
 - rev-list
can handle lazy-fetching without fetch_if_missing.

Hence, use has_object() to check for the existence of an object, which
has the default behavior of not lazy-fetching in a partial clone. It is
worth mentioning that this is the only place where there is potential for
lazy-fetching and all other cases [2] are properly handled, making it safe
to remove this global here.

[1] See 8b4c0103a9 (sha1_file: support lazily fetching missing objects,
		   2017-12-08)
[2] These cases are:
    - When we check objects, but we return with 0 early if the object
      doesn't exist.
    - We prefetch delta bases in a partial clone, if we don't have them
      (as the comment outlines).
    - There are some cases where we fsck objects, but lazy-fetching is
      already handled in fsck.

Helped-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Kousik Sanagavarapu <five231003@gmail.com>
---
 builtin/index-pack.c     | 11 +----------
 t/t5616-partial-clone.sh | 28 ++++++++++++++++++++++++++++
 2 files changed, 29 insertions(+), 10 deletions(-)

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 6648f2daef..8c0f36a49e 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -800,8 +800,7 @@ static void sha1_object(const void *data, struct object_entry *obj_entry,
 
 	if (startup_info->have_repository) {
 		read_lock();
-		collision_test_needed =
-			has_object_file_with_flags(oid, OBJECT_INFO_QUICK);
+		collision_test_needed = has_object(the_repository, oid, 0);
 		read_unlock();
 	}
 
@@ -1728,14 +1727,6 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 	int report_end_of_input = 0;
 	int hash_algo = 0;
 
-	/*
-	 * index-pack never needs to fetch missing objects except when
-	 * REF_DELTA bases are missing (which are explicitly handled). It only
-	 * accesses the repo to do hash collision checks and to check which
-	 * REF_DELTA bases need to be fetched.
-	 */
-	fetch_if_missing = 0;
-
 	if (argc == 2 && !strcmp(argv[1], "-h"))
 		usage(index_pack_usage);
 
diff --git a/t/t5616-partial-clone.sh b/t/t5616-partial-clone.sh
index f519d2a87a..fdb34a0b50 100755
--- a/t/t5616-partial-clone.sh
+++ b/t/t5616-partial-clone.sh
@@ -644,6 +644,34 @@ test_expect_success 'repack does not loosen promisor objects' '
 	grep "loosen_unused_packed_objects/loosened:0" trace
 '
 
+test_expect_success 'index-pack does not lazy-fetch when checking for sha1 collsions' '
+	rm -rf server client another-remote &&
+
+	git init server &&
+	echo "line" >server/file &&
+	git -C server add file &&
+	git -C server commit -am "file" &&
+	git -C server config --local uploadpack.allowFilter 1 &&
+	git -C server config --local uploadpack.allowAnySha1InWant 1 &&
+
+	git clone --no-checkout --filter=blob:none "file://$(pwd)/server" client &&
+	git -C client config extensions.partialClone 1 &&
+	git -C client config remote.origin.promisor 1 &&
+
+	git clone "file://$(pwd)/server" another-remote &&
+
+	echo "new line" >server/new-file &&
+	git -C server add new-file &&
+	git -C server commit -am "new-file" &&
+
+	git -C another-remote pull &&
+
+	# Try to fetch so that "client" will have to do a collision-check.
+	# This should, however, not fetch "new-file" because "client" is a
+	# partial clone.
+	git -C client fetch "file://$(pwd)/another-remote" main
+'
+
 test_expect_success 'lazy-fetch in submodule succeeds' '
 	# setup
 	test_config_global protocol.file.allow always &&
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v3] index-pack: remove fetch_if_missing=0
  2023-03-13 18:15   ` [PATCH v3] " Kousik Sanagavarapu
@ 2023-03-13 19:17     ` Junio C Hamano
  2023-03-13 19:18     ` Junio C Hamano
  2023-03-17 17:56     ` [PATCH v4] " Kousik Sanagavarapu
  2 siblings, 0 replies; 20+ messages in thread
From: Junio C Hamano @ 2023-03-13 19:17 UTC (permalink / raw)
  To: Kousik Sanagavarapu; +Cc: git, jonathanmy, fve231003, Jonathan Tan

Kousik Sanagavarapu <five231003@gmail.com> writes:

> A collision test is triggered in sha1_object(), whenever there is an
> object file in our repo. If our repo is a partial clone, then checking
> for this file existence does not lazy-fetch the object (if the object
> is missing and if there are one or more promisor remotes) when
> fetch_if_missing is set to 0.
> ...
> Hence, use has_object() to check for the existence of an object, which
> has the default behavior of not lazy-fetching in a partial clone. It is
> worth mentioning that this is the only place where there is potential for
> lazy-fetching and all other cases [2] are properly handled, making it safe
> to remove this global here.
>
> [1] See 8b4c0103a9 (sha1_file: support lazily fetching missing objects,
> 		   2017-12-08)

Thanks for this reference.

The way I read the "lazy fetching is by default not suppressed, and
this is a temporary measure" described in the log message is quite
opposite from where this patch wants to go, though.  

I think the commit envisioned the world where all the accesses to
the object layer are aware of the characteristics of a lazy clone
(e.g. has_object() reporting "we do not have that object locally
(yet)" is not an immediate sign of a repository corruption) and when
to lazily fetch and when to tolerate locally missing objects is
controlled more tightly, and to reach that world one step at a time,
introduced the global, so that for now everybody lazily fetches, but
the commands individual patches concentrates to "fix" can turn off
the "by default all missing objects are lazily fetched" so that they
can either allow certain objects to be locally missing, or fetch
them from promisor remotes when they need the contents of such
objects.  By fixing each commands one by one, eventually we would be
able to wean ourselves away from this "by default everything is
lazily fetched" global---in other words, in the ideal endgame, the
fetch_if_missing should be set to 0 everywhere.

So, if this patch was made in reaction to the "it was a temporary
measure" in 8b4c0103 (sha1_file: support lazily fetching missing
objects, 2017-12-08), I think it goes in the completely opposite
direction.  If the patch shared the cause with 8b4c0103 and wanted
to help realize the ideal world, it instead should have left this
command who can already work with fetch_if_missing=0 alone and fixed
somebody else who still depends on fetch_if_missing=1, I think.

Now it is a separate issue to argue if "everybody knows exactly when
to trigger lazy fetching and fetch_if_missing is set to false
everywhere" is really the ideal endgame.  I do not think "Future
patches will update some commands to either tolerate missing objects
or be more efficient in fetching them." proposed by the commit from
late 2017 has seen that much advance recently.

But for commands that need to deal with many missing objects,
enumerating the objects that are missing locally and need fetching
first and then requesting them in a batch should be vastly more
efficient than the default lazy fetch logic that lets the caller
request a single object, realize it is missing locally, make a
connection to fetch that single object and disconnect.  So I have to
suspect that ...

> This global was added as a temporary measure to suppress the fetching
> of missing objects [1] and can be removed once the remaining commands:
>  - fetch-pack
>  - fsck
>  - pack-objects
>  - prune
>  - rev-list
> can handle lazy-fetching without fetch_if_missing.

... this "can handle" may be a misguided direction to go in.  They
were taught not to lazy fetch because blindly lazy fetching was bad,
weren't they?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3] index-pack: remove fetch_if_missing=0
  2023-03-13 18:15   ` [PATCH v3] " Kousik Sanagavarapu
  2023-03-13 19:17     ` Junio C Hamano
@ 2023-03-13 19:18     ` Junio C Hamano
  2023-03-17 17:56     ` [PATCH v4] " Kousik Sanagavarapu
  2 siblings, 0 replies; 20+ messages in thread
From: Junio C Hamano @ 2023-03-13 19:18 UTC (permalink / raw)
  To: Kousik Sanagavarapu; +Cc: git, jonathanmy, fve231003, Jonathan Tan

Kousik Sanagavarapu <five231003@gmail.com> writes:

> A collision test is triggered in sha1_object(), whenever there is an
> object file in our repo. If our repo is a partial clone, then checking
> for this file existence does not lazy-fetch the object (if the object
> is missing and if there are one or more promisor remotes) when
> fetch_if_missing is set to 0.
> ...
> Hence, use has_object() to check for the existence of an object, which
> has the default behavior of not lazy-fetching in a partial clone. It is
> worth mentioning that this is the only place where there is potential for
> lazy-fetching and all other cases [2] are properly handled, making it safe
> to remove this global here.
>
> [1] See 8b4c0103a9 (sha1_file: support lazily fetching missing objects,
> 		   2017-12-08)

Thanks for the reference.

The way I read the "lazy fetching is by default not suppressed, and
this is a temporary measure" described in the log message is quite
opposite from where this patch wants to go, though.  

I think the commit envisioned the world where all the accesses to
the object layer are aware of the characteristics of a lazy clone
(e.g. has_object() reporting "we do not have that object locally
(yet)" is not an immediate sign of a repository corruption) and when
to lazily fetch and when to tolerate locally missing objects is
controlled more tightly, and to reach that world one step at a time,
introduced the global, so that for now everybody lazily fetches, but
the commands individual patches concentrates to "fix" can turn off
the "by default all missing objects are lazily fetched" so that they
can either allow certain objects to be locally missing, or fetch
them from promisor remotes when they need the contents of such
objects.  By fixing each commands one by one, eventually we would be
able to wean ourselves away from this "by default everything is
lazily fetched" global---in other words, in the ideal endgame, the
fetch_if_missing should be set to 0 everywhere.

So, if this patch was made in reaction to the "it was a temporary
measure" in 8b4c0103 (sha1_file: support lazily fetching missing
objects, 2017-12-08), I think it goes in the completely opposite
direction.  If the patch shared the cause with 8b4c0103 and wanted
to help realize the ideal world, it instead should have left this
command who can already work with fetch_if_missing=0 alone and fixed
somebody else who still depends on fetch_if_missing=1, I think.

Now it is a separate issue to argue if "everybody knows exactly when
to trigger lazy fetching and fetch_if_missing is set to false
everywhere" is really the ideal endgame.  I do not think "Future
patches will update some commands to either tolerate missing objects
or be more efficient in fetching them." proposed by the commit from
late 2017 has seen that much advance recently.

But for commands that need to deal with many missing objects,
enumerating the objects that are missing locally and need fetching
first and then requesting them in a batch should be vastly more
efficient than the default lazy fetch logic that lets the caller
request a single object, realize it is missing locally, make a
connection to fetch that single object and disconnect.  So I have to
suspect that ...

> This global was added as a temporary measure to suppress the fetching
> of missing objects [1] and can be removed once the remaining commands:
>  - fetch-pack
>  - fsck
>  - pack-objects
>  - prune
>  - rev-list
> can handle lazy-fetching without fetch_if_missing.

... this "can handle" may be a misguided direction to go in.  They
were taught not to lazy fetch because blindly lazy fetching was bad,
weren't they?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v4] index-pack: remove fetch_if_missing=0
  2023-03-13 18:15   ` [PATCH v3] " Kousik Sanagavarapu
  2023-03-13 19:17     ` Junio C Hamano
  2023-03-13 19:18     ` Junio C Hamano
@ 2023-03-17 17:56     ` Kousik Sanagavarapu
  2023-03-17 22:58       ` Junio C Hamano
  2 siblings, 1 reply; 20+ messages in thread
From: Kousik Sanagavarapu @ 2023-03-17 17:56 UTC (permalink / raw)
  To: git; +Cc: gitster, Kousik Sanagavarapu, Jonathan Tan

A collision test is triggered in sha1_object(), whenever there is an
object file in our repo. If our repo is a partial clone, then checking
for this file existence does not lazy-fetch the object (if the object
is missing and if there are one or more promisor remotes) when
fetch_if_missing is set to 0.

Though this global lets us control lazy-fetching in regions of code,
it prevents multi-threading [1].

Hence, use has_object() to check for the existence of an object, which
has the default behavior of not lazy-fetching in a partial clone, even
when fetch_if_missing is set to 1. It is worth mentioning that this is
the only place where there is potential for lazy-fetching and all other
cases [2] are properly handled, making it safe to remove this global
here.

[1] See https://lore.kernel.org/git/xmqqv9sdeeif.fsf@gitster-ct.c.googlers.com/
    and the discussion that follows from it.

[2] These cases are:
    - When we check objects, but we return with 0 early if the object
      doesn't exist.
    - We prefetch delta bases in a partial clone, if we don't have them
      (as the comment outlines).
    - There are some cases where we fsck objects, but lazy-fetching is
      already handled in fsck.

Helped-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Kousik Sanagavarapu <five231003@gmail.com>
---

Changes since v3:
- Changed the commit message to give a stronger reason as to
  why we should reduce the use of this global and in this
  process remove it from builtin/index-pack.c

- Also made an addition to the test which I overlooked last time
  and without which the does not make sense

 builtin/index-pack.c     | 11 +----------
 t/t5616-partial-clone.sh | 33 +++++++++++++++++++++++++++++++++
 2 files changed, 34 insertions(+), 10 deletions(-)

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 6648f2daef..8c0f36a49e 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -800,8 +800,7 @@ static void sha1_object(const void *data, struct object_entry *obj_entry,
 
 	if (startup_info->have_repository) {
 		read_lock();
-		collision_test_needed =
-			has_object_file_with_flags(oid, OBJECT_INFO_QUICK);
+		collision_test_needed = has_object(the_repository, oid, 0);
 		read_unlock();
 	}
 
@@ -1728,14 +1727,6 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 	int report_end_of_input = 0;
 	int hash_algo = 0;
 
-	/*
-	 * index-pack never needs to fetch missing objects except when
-	 * REF_DELTA bases are missing (which are explicitly handled). It only
-	 * accesses the repo to do hash collision checks and to check which
-	 * REF_DELTA bases need to be fetched.
-	 */
-	fetch_if_missing = 0;
-
 	if (argc == 2 && !strcmp(argv[1], "-h"))
 		usage(index_pack_usage);
 
diff --git a/t/t5616-partial-clone.sh b/t/t5616-partial-clone.sh
index f519d2a87a..41fa7130f1 100755
--- a/t/t5616-partial-clone.sh
+++ b/t/t5616-partial-clone.sh
@@ -644,6 +644,39 @@ test_expect_success 'repack does not loosen promisor objects' '
 	grep "loosen_unused_packed_objects/loosened:0" trace
 '
 
+test_expect_success 'index-pack does not lazy-fetch
+		     when checking for sha1 collsions' '
+	rm -rf server client another-remote &&
+
+	git init server &&
+	echo "line" >server/file &&
+	git -C server add file &&
+	git -C server commit -am "file" &&
+	git -C server config --local uploadpack.allowFilter 1 &&
+	git -C server config --local uploadpack.allowAnySha1InWant 1 &&
+
+	git clone --no-checkout --filter=blob:none \
+				"file://$(pwd)/server" client &&
+	git -C client config extensions.partialClone 1 &&
+	git -C client config remote.origin.promisor 1 &&
+
+	git clone "file://$(pwd)/server" another-remote &&
+
+	echo "new line" >server/new-file &&
+	git -C server add new-file &&
+	git -C server commit -am "new-file" &&
+	HASH=$(git -C server hash-object new-file) &&
+
+	git -C another-remote pull &&
+
+	# Try to fetch so that "client" will have to do a collision-check.
+	# This should, however, not fetch "new-file" because "client" is a
+	# partial clone.
+	GIT_TRACE_PACKET=git -C client fetch \
+				"file://$(pwd)/another-remote" main &&
+	! grep "want $HASH" trace
+'
+
 test_expect_success 'lazy-fetch in submodule succeeds' '
 	# setup
 	test_config_global protocol.file.allow always &&
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v4] index-pack: remove fetch_if_missing=0
  2023-03-17 17:56     ` [PATCH v4] " Kousik Sanagavarapu
@ 2023-03-17 22:58       ` Junio C Hamano
  2023-03-19  6:17         ` Kousik Sanagavarapu
  0 siblings, 1 reply; 20+ messages in thread
From: Junio C Hamano @ 2023-03-17 22:58 UTC (permalink / raw)
  To: Kousik Sanagavarapu; +Cc: git, Jonathan Tan

Kousik Sanagavarapu <five231003@gmail.com> writes:

> A collision test is triggered in sha1_object(), whenever there is an
> object file in our repo. If our repo is a partial clone, then checking
> for this file existence does not lazy-fetch the object (if the object
> is missing and if there are one or more promisor remotes) when
> fetch_if_missing is set to 0.
>
> Though this global lets us control lazy-fetching in regions of code,
> it prevents multi-threading [1].

Sorry, but I really do not see the point.

We already have read_lock/read_unlock to prevent multiple threads
from stomping on the in-core object database structure either way.

If somebody needs to dynamically change the value of fetch_if_missing
after the program started and spawned multiple threads, yes, the update
to the single variable would become a problem point in multi-threading.

But that is not what we are doing, and you already discovered that
this was done as "a temporary measure" to selectively let some
programs use 0 and others use 1 for lazy-fetching, at a very early
part of these programs.

If we are to reduce this global, perhaps we should teach more
codepaths not to lazy fetch by default.  Once everybody gets
converted like so, then index-pack can lose the assignment of 0 to
the variable, as the global variable would be initialized to 0 and
nobody will flip it to 1 to "temporarily opt into lazy fetching by
default until it gets fixed".  At that point, we can lose the global
variable.

So "we want to reduce the use of this global" is not a good reason
to do this change at all, without a convincing argument that says
why everybody should do automatic lazy fetching of objects.  If
everybody should avoid doing automatic lazy fetching, a good first
step to reduce the use of this global is not to touch index-pack
that has already been fixed not to do so, no?



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4] index-pack: remove fetch_if_missing=0
  2023-03-17 22:58       ` Junio C Hamano
@ 2023-03-19  6:17         ` Kousik Sanagavarapu
  0 siblings, 0 replies; 20+ messages in thread
From: Kousik Sanagavarapu @ 2023-03-19  6:17 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jonathan Tan

On 18/03/23 04:28, Junio C Hamano wrote:

> Kousik Sanagavarapu <five231003@gmail.com> writes:
>
>> A collision test is triggered in sha1_object(), whenever there is an
>> object file in our repo. If our repo is a partial clone, then checking
>> for this file existence does not lazy-fetch the object (if the object
>> is missing and if there are one or more promisor remotes) when
>> fetch_if_missing is set to 0.
>>
>> Though this global lets us control lazy-fetching in regions of code,
>> it prevents multi-threading [1].
> Sorry, but I really do not see the point.
>
> We already have read_lock/read_unlock to prevent multiple threads
> from stomping on the in-core object database structure either way.
>
> If somebody needs to dynamically change the value of fetch_if_missing
> after the program started and spawned multiple threads, yes, the update
> to the single variable would become a problem point in multi-threading.
>
> But that is not what we are doing, and you already discovered that
> this was done as "a temporary measure" to selectively let some
> programs use 0 and others use 1 for lazy-fetching, at a very early
> part of these programs.
>
> If we are to reduce this global, perhaps we should teach more
> codepaths not to lazy fetch by default.  Once everybody gets
> converted like so, then index-pack can lose the assignment of 0 to
> the variable, as the global variable would be initialized to 0 and
> nobody will flip it to 1 to "temporarily opt into lazy fetching by
> default until it gets fixed".  At that point, we can lose the global
> variable.
>
> So "we want to reduce the use of this global" is not a good reason
> to do this change at all, without a convincing argument that says
> why everybody should do automatic lazy fetching of objects.  If
> everybody should avoid doing automatic lazy fetching, a good first
> step to reduce the use of this global is not to touch index-pack
> that has already been fixed not to do so, no?

Thanks for the review.


Also, thanks for pointing out the direction of work in this area.

Really helpful.


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2023-03-19  6:17 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-25  5:24 [PATCH] index-pack: remove fetch_if_missing=0 Kousik Sanagavarapu
2023-02-27 16:56 ` Kousik Sanagavarapu
2023-02-27 22:14 ` Jonathan Tan
2023-02-28  3:54   ` Kousik Sanagavarapu
2023-03-10 18:30 ` [PATCH v2] " Kousik Sanagavarapu
2023-03-10 20:30   ` Junio C Hamano
2023-03-10 21:13     ` Jonathan Tan
2023-03-10 21:41       ` Junio C Hamano
2023-03-11  2:59         ` Jonathan Tan
2023-03-12 17:16         ` Kousik Sanagavarapu
2023-03-11  6:22       ` [PATCH] " Kousik Sanagavarapu
2023-03-11  6:00     ` Kousik Sanagavarapu
2023-03-13 18:15   ` [PATCH v3] " Kousik Sanagavarapu
2023-03-13 19:17     ` Junio C Hamano
2023-03-13 19:18     ` Junio C Hamano
2023-03-17 17:56     ` [PATCH v4] " Kousik Sanagavarapu
2023-03-17 22:58       ` Junio C Hamano
2023-03-19  6:17         ` Kousik Sanagavarapu
2023-03-11 20:01 ` [PATCH] " Sean Allred
2023-03-11 20:37   ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).