[PATCH 0/5] Optimization batch 13: partial clone optimizations for merge-ort

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH 0/5] Optimization batch 13: partial clone optimizations for merge-ort
@ 2021-06-05  1:27 Elijah Newren via GitGitGadget
  2021-06-05  1:28 ` [PATCH 1/5] promisor-remote: output trace2 statistics for number of objects fetched Elijah Newren via GitGitGadget
                   ` (5 more replies)
  0 siblings, 6 replies; 30+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-06-05  1:27 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Derrick Stolee, Taylor Blau, Elijah Newren

This series optimizes blob downloading in merges for partial clones. It can
apply on master.

=== Conflict heads up ===

This series has two minor textual conflicts with
https://lore.kernel.org/git/cover.1622580781.git.jonathantanmy@google.com/,
because we both add a repo parameter to fetch_objects() but Jonathan makes
additional other nearby changes. If it'd help, I can rebase my series on his
if he resolves the GIT_TEST_DEFAULT_HASH=sha256 issue, but I thought I'd
send this out while the topic is fresh on Jonathan's mind.

(This topic is semantically and textually independent of my other in-flight
and not-yet-submitted ort-perf-batch-* topics, so this series need not
include those.)

=== Basic Optimization idea ===

merge-ort was designed to minimize the computation needed to complete a
merge, and much of that (particularly the "irrelevant rename"
determinations) also dramatically reduced the amount of data needed for the
merge. Reducing the amount of data needed to do computations ought to
benefit partial clones as well by enabling them to download less
information. However, my previous series didn't modify the prefetch()
command in diffcore-rename to take advantage of these reduced data
requirements. This series changes that.

Further, although diffcore-rename batched downloads of objects for rename
detection, the merge machinery did not do the same for three-way content
merges of files. This series adds batch downloading of objects within
merge-ort to correct that.

=== Modified performance measurement method ===

The testcases I've been using so far to measure performance were not run in
a partial clone, so they aren't directly usable for comparison. Further,
partial clone performance depends on network speed which can be highly
variable. So I want to modify one of the existing testcases slightly and
focus on two different but more stable metrics:

 1. Number of git fetch operations during rebase
 2. Number of objects fetched during rebase

The first of these should already be decent due to Jonathan Tan's work to
batch fetching of missing blobs during rename detection (see commit
7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-05)), so we are
mostly looking to optimize the second but would like to also decrease the
first if possible.

The testcase we will look at will be a modification of the mega-renames
testcase from commit 557ac0350d ("merge-ort: begin performance work;
instrument with trace2_region_* calls", 2020-10-28). In particular, we
change

$ git clone \
    git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git

to

$ git clone --sparse --filter=blob:none \
    https://github.com/github/linux

(The change in clone URL is just to get a server that supports the filter
predicate.)

We otherwise keep the test the same (in particular, we do not add any calls
to "git-sparse checkout {set,add}" which means that the resulting repository
will only have 7 total blobs from files in the toplevel directory before
starting the rebase).

=== Results ===

For the mega-renames testcase noted above (which rebases 35 commits across
an upstream with ~26K renames in a partial clone), I found the following
results for our metrics of interest:

     Number of `git fetch` ops during rebase

                     Before Series   After Series
merge-recursive:          62              63
merge-ort:                30              20

     Number of objects fetched during rebase

                     Before Series   After Series
merge-recursive:         11423          11423
merge-ort:               11391             63

So, we have a significant reduction (factor of ~3 relative to
merge-recursive) in the number of git fetch operations that have to be
performed in a partial clone to complete the rebase, and a dramatic
reduction (factor of ~180) in the number of objects that need to be fetched.

=== Summary ===

It's worth pointing out that merge-ort after the series needs only ~1.8
blobs per commit being transplanted to complete this particular rebase.
Essentially, this reinforces the fact the optimization work so far has taken
rename detection from often being an overwhelmingly costly portion of a
merge (leading many to just capitulate on it), to what I have observed in my
experience so far as being just a minor cost for merges.

Elijah Newren (5):
  promisor-remote: output trace2 statistics for number of objects
    fetched
  t6421: add tests checking for excessive object downloads during merge
  diffcore-rename: allow different missing_object_cb functions
  diffcore-rename: use a different prefetch for basename comparisons
  merge-ort: add prefetching for content merges

 diffcore-rename.c              | 149 +++++++++---
 merge-ort.c                    |  50 ++++
 promisor-remote.c              |   7 +-
 t/t6421-merge-partial-clone.sh | 433 +++++++++++++++++++++++++++++++++
 4 files changed, 605 insertions(+), 34 deletions(-)
 create mode 100755 t/t6421-merge-partial-clone.sh

base-commit: 01352fcdf3a96480ffa4e25a103a83a9e5d7f67a
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-969%2Fnewren%2Fort-perf-batch-13-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-969/newren/ort-perf-batch-13-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/969
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/5] promisor-remote: output trace2 statistics for number of objects fetched
  2021-06-05  1:27 [PATCH 0/5] Optimization batch 13: partial clone optimizations for merge-ort Elijah Newren via GitGitGadget
@ 2021-06-05  1:28 ` Elijah Newren via GitGitGadget
  2021-06-09 21:12   ` Derrick Stolee
  2021-06-05  1:28 ` [PATCH 2/5] t6421: add tests checking for excessive object downloads during merge Elijah Newren via GitGitGadget
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 30+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-06-05  1:28 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Derrick Stolee, Taylor Blau, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 promisor-remote.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/promisor-remote.c b/promisor-remote.c
index da3f2ca2615e..d465377d7d32 100644
--- a/promisor-remote.c
+++ b/promisor-remote.c
@@ -12,7 +12,8 @@ void set_repository_format_partial_clone(char *partial_clone)
 	repository_format_partial_clone = xstrdup_or_null(partial_clone);
 }
 
-static int fetch_objects(const char *remote_name,
+static int fetch_objects(struct repository *repo,
+			 const char *remote_name,
 			 const struct object_id *oids,
 			 int oid_nr)
 {
@@ -30,6 +31,8 @@ static int fetch_objects(const char *remote_name,
 		die(_("promisor-remote: unable to fork off fetch subprocess"));
 	child_in = xfdopen(child.in, "w");
 
+	trace2_data_intmax("promisor", repo, "fetch_count", oid_nr);
+
 	for (i = 0; i < oid_nr; i++) {
 		if (fputs(oid_to_hex(&oids[i]), child_in) < 0)
 			die_errno(_("promisor-remote: could not write to fetch subprocess"));
@@ -238,7 +241,7 @@ int promisor_remote_get_direct(struct repository *repo,
 	promisor_remote_init();
 
 	for (r = promisors; r; r = r->next) {
-		if (fetch_objects(r->name, remaining_oids, remaining_nr) < 0) {
+		if (fetch_objects(repo, r->name, remaining_oids, remaining_nr) < 0) {
 			if (remaining_nr == 1)
 				continue;
 			remaining_nr = remove_fetched_oids(repo, &remaining_oids,
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 2/5] t6421: add tests checking for excessive object downloads during merge
  2021-06-05  1:27 [PATCH 0/5] Optimization batch 13: partial clone optimizations for merge-ort Elijah Newren via GitGitGadget
  2021-06-05  1:28 ` [PATCH 1/5] promisor-remote: output trace2 statistics for number of objects fetched Elijah Newren via GitGitGadget
@ 2021-06-05  1:28 ` Elijah Newren via GitGitGadget
  2021-06-09 21:16   ` Derrick Stolee
  2021-06-05  1:28 ` [PATCH 3/5] diffcore-rename: allow different missing_object_cb functions Elijah Newren via GitGitGadget
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 30+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-06-05  1:28 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Derrick Stolee, Taylor Blau, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 t/t6421-merge-partial-clone.sh | 433 +++++++++++++++++++++++++++++++++
 1 file changed, 433 insertions(+)
 create mode 100755 t/t6421-merge-partial-clone.sh

diff --git a/t/t6421-merge-partial-clone.sh b/t/t6421-merge-partial-clone.sh
new file mode 100755
index 000000000000..028d876be2c8
--- /dev/null
+++ b/t/t6421-merge-partial-clone.sh
@@ -0,0 +1,433 @@
+#!/bin/sh
+
+test_description="limiting blob downloads when merging with partial clones"
+# Uses a methodology similar to
+#   t6042: corner cases with renames but not criss-cross merges
+#   t6036: corner cases with both renames and criss-cross merges
+#   t6423: directory rename detection
+#
+# The setup for all of them, pictorially, is:
+#
+#      A
+#      o
+#     / \
+#  O o   ?
+#     \ /
+#      o
+#      B
+#
+# To help make it easier to follow the flow of tests, they have been
+# divided into sections and each test will start with a quick explanation
+# of what commits O, A, and B contain.
+#
+# Notation:
+#    z/{b,c}   means  files z/b and z/c both exist
+#    x/d_1     means  file x/d exists with content d1.  (Purpose of the
+#                     underscore notation is to differentiate different
+#                     files that might be renamed into each other's paths.)
+
+. ./test-lib.sh
+. "$TEST_DIRECTORY"/lib-merge.sh
+
+test_setup_repo () {
+	test -d server && return
+	test_create_repo server &&
+	(
+		cd server &&
+
+		git config uploadpack.allowfilter 1 &&
+		git config uploadpack.allowanysha1inwant 1 &&
+
+		mkdir -p general &&
+		test_seq 2 9 >general/leap1 &&
+		cp general/leap1 general/leap2 &&
+		echo leap2 >>general/leap2 &&
+
+		mkdir -p basename &&
+		cp general/leap1 basename/numbers &&
+		cp general/leap1 basename/sequence &&
+		cp general/leap1 basename/values &&
+		echo numbers >>basename/numbers &&
+		echo sequence >>basename/sequence &&
+		echo values >>basename/values &&
+
+		mkdir -p dir/unchanged &&
+		mkdir -p dir/subdir/tweaked &&
+		echo a >dir/subdir/a &&
+		echo b >dir/subdir/b &&
+		echo c >dir/subdir/c &&
+		echo d >dir/subdir/d &&
+		echo e >dir/subdir/e &&
+		cp general/leap1 dir/subdir/Makefile &&
+		echo toplevel makefile >>dir/subdir/Makefile &&
+		echo f >dir/subdir/tweaked/f &&
+		echo g >dir/subdir/tweaked/g &&
+		echo h >dir/subdir/tweaked/h &&
+		echo subdirectory makefile >dir/subdir/tweaked/Makefile &&
+		for i in `test_seq 1 88`; do
+			echo content $i >dir/unchanged/file_$i
+		done &&
+		git add . &&
+		git commit -m "O" &&
+
+		git branch O &&
+		git branch A &&
+		git branch B-single &&
+		git branch B-dir &&
+		git branch B-many &&
+
+		git switch A &&
+
+		git rm general/leap* &&
+		mkdir general/ &&
+		test_seq 1 9 >general/jump1 &&
+		cp general/jump1 general/jump2 &&
+		echo leap2 >>general/jump2 &&
+
+		rm basename/numbers basename/sequence basename/values &&
+		mkdir -p basename/subdir/
+		cp general/jump1 basename/subdir/numbers &&
+		cp general/jump1 basename/subdir/sequence &&
+		cp general/jump1 basename/subdir/values &&
+		echo numbers >>basename/subdir/numbers &&
+		echo sequence >>basename/subdir/sequence &&
+		echo values >>basename/subdir/values &&
+
+		git rm dir/subdir/tweaked/f &&
+		echo more >>dir/subdir/e &&
+		echo more >>dir/subdir/Makefile &&
+		echo more >>dir/subdir/tweaked/Makefile &&
+		mkdir dir/subdir/newsubdir &&
+		echo rust code >dir/subdir/newsubdir/newfile.rs &&
+		git mv dir/subdir/e dir/subdir/newsubdir/ &&
+		git mv dir folder &&
+		git add . &&
+		git commit -m "A" &&
+
+		git switch B-single &&
+		echo new first line >dir/subdir/Makefile &&
+		cat general/leap1 >>dir/subdir/Makefile &&
+		echo toplevel makefile >>dir/subdir/Makefile &&
+		echo perl code >general/newfile.pl &&
+		git add . &&
+		git commit -m "B-single" &&
+
+		git switch B-dir &&
+		echo java code >dir/subdir/newfile.java &&
+		echo scala code >dir/subdir/newfile.scala &&
+		echo groovy code >dir/subdir/newfile.groovy &&
+		git add . &&
+		git commit -m "B-dir" &&
+
+		git switch B-many &&
+		test_seq 2 10 >general/leap1 &&
+		rm general/leap2 &&
+		cp general/leap1 general/leap2 &&
+		echo leap2 >>general/leap2 &&
+
+		rm basename/numbers basename/sequence basename/values &&
+		mkdir -p basename/subdir/
+		cp general/leap1 basename/subdir/numbers &&
+		cp general/leap1 basename/subdir/sequence &&
+		cp general/leap1 basename/subdir/values &&
+		echo numbers >>basename/subdir/numbers &&
+		echo sequence >>basename/subdir/sequence &&
+		echo values >>basename/subdir/values &&
+
+		mkdir dir/subdir/newsubdir/ &&
+		echo c code >dir/subdir/newfile.c &&
+		echo python code >dir/subdir/newsubdir/newfile.py &&
+		git add . &&
+		git commit -m "B-many" &&
+
+		git switch A
+	)
+}
+
+# Testcase: Objects downloaded for single relevant rename
+#   Commit O:
+#              general/{leap1_O, leap2_O}
+#              basename/{numbers_O, sequence_O, values_O}
+#              dir/subdir/{a,b,c,d,e_O,Makefile_TOP_O}
+#              dir/subdir/tweaked/{f,g,h,Makefile_SUB_O}
+#              dir/unchanged/<LOTS OF FILES>
+#   Commit A:
+#     (Rename leap->jump, rename basename/ -> basename/subdir/, rename dir/
+#      -> folder/, move e into newsubdir, add newfile.rs, remove f, modify
+#      both both Makefiles and jumps)
+#              general/{jump1_A, jump2_A}
+#              basename/subdir/{numbers_A, sequence_A, values_A}
+#              folder/subdir/{a,b,c,d,Makefile_TOP_A}
+#              folder/subdir/newsubdir/{e_A,newfile.rs}
+#              folder/subdir/tweaked/{g,h,Makefile_SUB_A}
+#              folder/unchanged/<LOTS OF FILES>
+#   Commit B(-single):
+#     (add newfile.pl, tweak Makefile_TOP)
+#              general/{leap1_O, leap2_O,newfile.pl}
+#              basename/{numbers_O, sequence_O, values_O}
+#              dir/{a,b,c,d,e_O,Makefile_TOP_B}
+#              dir/tweaked/{f,g,h,Makefile_SUB_O}
+#              dir/unchanged/<LOTS OF FILES>
+#   Expected:
+#              general/{jump1_A, jump2_A,newfile.pl}
+#              basename/subdir/{numbers_A, sequence_A, values_A}
+#              folder/subdir/{a,b,c,d,Makefile_TOP_Merged}
+#              folder/subdir/newsubdir/{e_A,newfile.rs}
+#              folder/subdir/tweaked/{g,h,Makefile_SUB_A}
+#              folder/unchanged/<LOTS OF FILES>
+#
+# Objects that need to be fetched:
+#   Rename detection:
+#     Side1 (O->A):
+#       Basename-matches rename detection only needs to fetch these objects:
+#         Makefile_TOP_O, Makefile_TOP_A
+#         (Despite many renames, all others are content irrelevant.  They
+#          are also location irrelevant because newfile.rs was added on
+#          the side doing the directory rename, and newfile.pl was added to
+#          a directory that was not renamed on either side.)
+#       General rename detection only needs to fetch these objects:
+#         <None>
+#          (Even though newfile.rs, jump[12], basename/subdir/*, and e
+#          could all be used as destinations in rename detection, the
+#          basename detection for Makefile matches up all relevant
+#          sources, so these other files never end up needing to be
+#          used)
+#     Side2 (O->B):
+#       Basename-matches rename detection only needs to fetch these objects:
+#         <None>
+#         (there are no deleted files, so no possible sources)
+#       General rename detection only needs to fetch these objects:
+#         <None>
+#         (there are no deleted files, so no possible sources)
+#   Merge:
+#     3-way content merge needs to grab these objects:
+#       Makefile_TOP_B
+#   Nothing else needs to fetch objects
+#
+#   Summary: 2 fetches (1 for 2 objects, 1 for 1 object)
+#
+test_expect_merge_algorithm failure failure 'Objects downloaded for single relevant rename' '
+	test_setup_repo &&
+	git clone --sparse --filter=blob:none "file://$(pwd)/server" objects-single &&
+	(
+		cd objects-single &&
+
+		git rev-list --objects --all --missing=print |
+			grep '\?' >missing-objects-before &&
+
+		git checkout -q origin/A &&
+
+		GIT_TRACE2_PERF="$(pwd)/trace.output" git -c merge.directoryRenames=true merge --no-stat --no-progress origin/B-single &&
+
+		# Check the number of objects we reported we would fetch
+		cat >expect <<-EOF &&
+		 ..........fetch_count:2
+		 ......fetch_count:1
+		EOF
+		grep fetch_count trace.output | cut -d "|" -f 9 >actual &&
+		test_cmp expect actual &&
+
+		# Check the number of fetch commands exec-ed
+		grep d0.*fetch.negotiationAlgorithm trace.output >fetches &&
+		test_line_count = 2 fetches &&
+
+		git rev-list --objects --all --missing=print |
+			grep ^? >missing-objects-after &&
+		test_cmp missing-objects-before missing-objects-after |
+			grep "^[-+]?" >found-and-new-objects &&
+		# We should not have any NEW missing objects
+		! grep ^+ found-and-new-objects &&
+		# Fetched 2+1=3 objects, so should have 3 fewer missing objects
+		test_line_count = 3 found-and-new-objects
+	)
+'
+
+# Testcase: Objects downloaded for directory rename
+#   Commit O:
+#              general/{leap1_O, leap2_O}
+#              basename/{numbers_O, sequence_O, values_O}
+#              dir/subdir/{a,b,c,d,e_O,Makefile_TOP_O}
+#              dir/subdir/tweaked/{f,g,h,Makefile_SUB_O}
+#              dir/unchanged/<LOTS OF FILES>
+#   Commit A:
+#     (Rename leap->jump, rename basename/ -> basename/subdir/, rename dir/ ->
+#      folder/, move e into newsubdir, add newfile.rs, remove f, modify
+#      both Makefiles and jumps)
+#              general/{jump1_A, jump2_A}
+#              basename/subdir/{numbers_A, sequence_A, values_A}
+#              folder/subdir/{a,b,c,d,Makefile_TOP_A}
+#              folder/subdir/newsubdir/{e_A,newfile.rs}
+#              folder/subdir/tweaked/{g,h,Makefile_SUB_A}
+#              folder/unchanged/<LOTS OF FILES>
+#   Commit B(-dir):
+#     (add dir/subdir/newfile.{java,scala,groovy}
+#              general/{leap1_O, leap2_O}
+#              basename/{numbers_O, sequence_O, values_O}
+#              dir/subdir/{a,b,c,d,e_O,Makefile_TOP_O,
+#                          newfile.java,newfile.scala,newfile.groovy}
+#              dir/subdir/tweaked/{f,g,h,Makefile_SUB_O}
+#              dir/unchanged/<LOTS OF FILES>
+#   Expected:
+#              general/{jump1_A, jump2_A}
+#              basename/subdir/{numbers_A, sequence_A, values_A}
+#              folder/subdir/{a,b,c,d,Makefile_TOP_A,
+#                             newfile.java,newfile.scala,newfile.groovy}
+#              folder/subdir/newsubdir/{e_A,newfile.rs}
+#              folder/subdir/tweaked/{g,h,Makefile_SUB_A}
+#              folder/unchanged/<LOTS OF FILES>
+#
+# Objects that need to be fetched:
+#   Makefile_TOP_O, Makefile_TOP_A
+#   Makefile_SUB_O, Makefile_SUB_A
+#   e_O, e_A
+#   * Despite A's rename of jump->leap, those renames are irrelevant.
+#   * Despite A's rename of basename/ -> basename/subdir/, those renames are
+#     irrelevant.
+#   * Because of A's rename of dir/ -> folder/ and B-dir's addition of
+#     newfile.* into dir/subdir/, we need to determine directory renames.
+#     (Technically, there are enough exact renames to determine directory
+#      rename detection, but the current implementation always does
+#      basename searching before directory rename detection.  Running it
+#      also before basename searching would mean doing directory rename
+#      detection twice, but it's a bit expensive to do that and cases like
+#      this are not all that common.)
+#   Summary: 1 fetches for 6 objects
+#
+test_expect_merge_algorithm failure failure 'Objects downloaded when a directory rename triggered' '
+	test_setup_repo &&
+	git clone --sparse --filter=blob:none "file://$(pwd)/server" objects-dir &&
+	(
+		cd objects-dir &&
+
+		git rev-list --objects --all --missing=print |
+			grep '\?' >missing-objects-before &&
+
+		git checkout -q origin/A &&
+
+		GIT_TRACE2_PERF="$(pwd)/trace.output" git -c merge.directoryRenames=true merge --no-stat --no-progress origin/B-dir &&
+
+		# Check the number of objects we reported we would fetch
+		cat >expect <<-EOF &&
+		 ..........fetch_count:6
+		EOF
+		grep fetch_count trace.output | cut -d "|" -f 9 >actual &&
+		test_cmp expect actual &&
+
+		# Check the number of fetch commands exec-ed
+		grep d0.*fetch.negotiationAlgorithm trace.output >fetches &&
+		test_line_count = 1 fetches &&
+
+		git rev-list --objects --all --missing=print |
+			grep ^? >missing-objects-after &&
+		test_cmp missing-objects-before missing-objects-after |
+			grep "^[-+]?" >found-and-new-objects &&
+		# We should not have any NEW missing objects
+		! grep ^+ found-and-new-objects &&
+		# Fetched 6 objects, so should have 6 fewer missing objects
+		test_line_count = 6 found-and-new-objects
+	)
+'
+
+# Testcase: Objects downloaded with lots of renames and modifications
+#   Commit O:
+#              general/{leap1_O, leap2_O}
+#              basename/{numbers_O, sequence_O, values_O}
+#              dir/subdir/{a,b,c,d,e_O,Makefile_TOP_O}
+#              dir/subdir/tweaked/{f,g,h,Makefile_SUB_O}
+#              dir/unchanged/<LOTS OF FILES>
+#   Commit A:
+#     (Rename leap->jump, rename basename/ -> basename/subdir/, rename dir/
+#      -> folder/, move e into newsubdir, add newfile.rs, remove f, modify
+#      both both Makefiles and jumps)
+#              general/{jump1_A, jump2_A}
+#              basename/subdir/{numbers_A, sequence_A, values_A}
+#              folder/subdir/{a,b,c,d,Makefile_TOP_A}
+#              folder/subdir/newsubdir/{e_A,newfile.rs}
+#              folder/subdir/tweaked/{g,h,Makefile_SUB_A}
+#              folder/unchanged/<LOTS OF FILES>
+#   Commit B(-minimal):
+#     (modify both leaps, rename basename/ -> basename/subdir/, add
+#      newfile.{c,py})
+#              general/{leap1_B, leap2_B}
+#              basename/subdir/{numbers_B, sequence_B, values_B}
+#              dir/{a,b,c,d,e_O,Makefile_TOP_O,newfile.c}
+#              dir/tweaked/{f,g,h,Makefile_SUB_O,newfile.py}
+#              dir/unchanged/<LOTS OF FILES>
+#   Expected:
+#              general/{jump1_Merged, jump2_Merged}
+#              basename/subdir/{numbers_Merged, sequence_Merged, values_Merged}
+#              folder/subdir/{a,b,c,d,Makefile_TOP_A,newfile.c}
+#              folder/subdir/newsubdir/e_A
+#              folder/subdir/tweaked/{g,h,Makefile_SUB_A,newfile.py}
+#              folder/unchanged/<LOTS OF FILES>
+#
+# Objects that need to be fetched:
+#   Rename detection:
+#     Side1 (O->A):
+#       Basename-matches rename detection only needs to fetch these objects:
+#         numbers_O, numbers_A
+#         sequence_O, sequence_A
+#         values_O, values_A
+#         Makefile_TOP_O, Makefile_TOP_A
+#         Makefile_SUB_O, Makefile_SUB_A
+#         e_O, e_A
+#       General rename detection only needs to fetch these objects:
+#         leap1_O, leap2_O
+#         jump1_A, jump2_A, newfile.rs
+#         (only need remaining relevant sources, but any relevant sources need
+#          to be matched against all possible unpaired destinations)
+#     Side2 (O->B):
+#       Basename-matches rename detection only needs to fetch these objects:
+#         numbers_B
+#         sequence_B
+#         values_B
+#       (because numbers_O, sequence_O, and values_O already fetched above)
+#       General rename detection only needs to fetch these objects:
+#         <None>
+#   Merge:
+#     3-way content merge needs to grab these objects:
+#       leap1_B
+#       leap2_B
+#   Nothing else needs to fetch objects
+#
+#   Summary: 4 fetches (1 for 6 objects, 1 for 8, 1 for 3, 1 for 2)
+#
+test_expect_merge_algorithm failure failure 'Objects downloaded with lots of renames and modifications' '
+	test_setup_repo &&
+	git clone --sparse --filter=blob:none "file://$(pwd)/server" objects-many &&
+	(
+		cd objects-many &&
+
+		git rev-list --objects --all --missing=print |
+			grep '\?' >missing-objects-before &&
+
+		git checkout -q origin/A &&
+
+		GIT_TRACE2_PERF="$(pwd)/trace.output" git -c merge.directoryRenames=true merge --no-stat --no-progress origin/B-many &&
+
+		# Check the number of objects we reported we would fetch
+		cat >expect <<-EOF &&
+		 ..........fetch_count:12
+		 ..........fetch_count:5
+		 ..........fetch_count:3
+		 ......fetch_count:2
+		EOF
+		grep fetch_count trace.output | cut -d "|" -f 9 >actual &&
+		test_cmp expect actual &&
+
+		# Check the number of fetch commands exec-ed
+		grep d0.*fetch.negotiationAlgorithm trace.output >fetches &&
+		test_line_count = 4 fetches &&
+
+		git rev-list --objects --all --missing=print |
+			grep ^? >missing-objects-after &&
+		test_cmp missing-objects-before missing-objects-after |
+			grep "^[-+]?" >found-and-new-objects &&
+		# We should not have any NEW missing objects
+		! grep ^+ found-and-new-objects &&
+		# Fetched 12 + 5 + 3 + 2 == 22 objects
+		test_line_count = 22 found-and-new-objects
+	)
+'
+
+test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 3/5] diffcore-rename: allow different missing_object_cb functions
  2021-06-05  1:27 [PATCH 0/5] Optimization batch 13: partial clone optimizations for merge-ort Elijah Newren via GitGitGadget
  2021-06-05  1:28 ` [PATCH 1/5] promisor-remote: output trace2 statistics for number of objects fetched Elijah Newren via GitGitGadget
  2021-06-05  1:28 ` [PATCH 2/5] t6421: add tests checking for excessive object downloads during merge Elijah Newren via GitGitGadget
@ 2021-06-05  1:28 ` Elijah Newren via GitGitGadget
  2021-06-05  1:28 ` [PATCH 4/5] diffcore-rename: use a different prefetch for basename comparisons Elijah Newren via GitGitGadget
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 30+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-06-05  1:28 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Derrick Stolee, Taylor Blau, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

estimate_similarity() was setting up a diff_populate_filespec_options
every time it was called, requiring the caller of estimate_similarity()
to pass in some data needed to set up this option.  Currently the needed
data consisted of a single variable (skip_unmodified), but we want to
also have the different estimate_similarity() callsites start using
different missing_object_cb functions as well.  Rather than also passing
that data in, just have the caller pass in the whole
diff_populate_filespec_options, and reduce the number of times we need to
set it up.

As a side note, this also drops the number of calls to
has_promisor_remote() dramatically.  If L is the number of basename
paths to compare, M is the number of inexact sources, and N is the
number of inexact destinations, then the number of calls to
has_promisor_remote() drops from L+M*N down to at most 2 -- one for each
of the sites that calls estimate_similarity().  has_promisor_remote() is
a very fast function so this almost certainly has no measurable
performance impact, but it seems cleaner to avoid calling that function
so many times.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 58 +++++++++++++++++++++++++++++++----------------
 1 file changed, 39 insertions(+), 19 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 35378d84e8f1..e13e52046026 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -126,7 +126,7 @@ static int estimate_similarity(struct repository *r,
 			       struct diff_filespec *src,
 			       struct diff_filespec *dst,
 			       int minimum_score,
-			       int skip_unmodified)
+			       struct diff_populate_filespec_options *dpf_opt)
 {
 	/* src points at a file that existed in the original tree (or
 	 * optionally a file in the destination tree) and dst points
@@ -143,15 +143,6 @@ static int estimate_similarity(struct repository *r,
 	 */
 	unsigned long max_size, delta_size, base_size, src_copied, literal_added;
 	int score;
-	struct diff_populate_filespec_options dpf_options = {
-		.check_size_only = 1
-	};
-	struct prefetch_options prefetch_options = {r, skip_unmodified};
-
-	if (r == the_repository && has_promisor_remote()) {
-		dpf_options.missing_object_cb = prefetch;
-		dpf_options.missing_object_data = &prefetch_options;
-	}
 
 	/* We deal only with regular files.  Symlink renames are handled
 	 * only when they are exact matches --- in other words, no edits
@@ -169,11 +160,13 @@ static int estimate_similarity(struct repository *r,
 	 * is a possible size - we really should have a flag to
 	 * say whether the size is valid or not!)
 	 */
+	dpf_opt->check_size_only = 1;
+
 	if (!src->cnt_data &&
-	    diff_populate_filespec(r, src, &dpf_options))
+	    diff_populate_filespec(r, src, dpf_opt))
 		return 0;
 	if (!dst->cnt_data &&
-	    diff_populate_filespec(r, dst, &dpf_options))
+	    diff_populate_filespec(r, dst, dpf_opt))
 		return 0;
 
 	max_size = ((src->size > dst->size) ? src->size : dst->size);
@@ -191,11 +184,11 @@ static int estimate_similarity(struct repository *r,
 	if (max_size * (MAX_SCORE-minimum_score) < delta_size * MAX_SCORE)
 		return 0;
 
-	dpf_options.check_size_only = 0;
+	dpf_opt->check_size_only = 0;
 
-	if (!src->cnt_data && diff_populate_filespec(r, src, &dpf_options))
+	if (!src->cnt_data && diff_populate_filespec(r, src, dpf_opt))
 		return 0;
-	if (!dst->cnt_data && diff_populate_filespec(r, dst, &dpf_options))
+	if (!dst->cnt_data && diff_populate_filespec(r, dst, dpf_opt))
 		return 0;
 
 	if (diffcore_count_changes(r, src, dst,
@@ -862,7 +855,11 @@ static int find_basename_matches(struct diff_options *options,
 	int i, renames = 0;
 	struct strintmap sources;
 	struct strintmap dests;
-
+	struct diff_populate_filespec_options dpf_options = {
+		.check_binary = 0,
+		.missing_object_cb = NULL,
+		.missing_object_data = NULL
+	};
 	/*
 	 * The prefeteching stuff wants to know if it can skip prefetching
 	 * blobs that are unmodified...and will then do a little extra work
@@ -873,7 +870,10 @@ static int find_basename_matches(struct diff_options *options,
 	 * the extra work necessary to check if rename_src entries are
 	 * unmodified would be a small waste.
 	 */
-	int skip_unmodified = 0;
+	struct prefetch_options prefetch_options = {
+		.repo = options->repo,
+		.skip_unmodified = 0
+	};
 
 	/*
 	 * Create maps of basename -> fullname(s) for remaining sources and
@@ -910,6 +910,11 @@ static int find_basename_matches(struct diff_options *options,
 			strintmap_set(&dests, base, i);
 	}
 
+	if (options->repo == the_repository && has_promisor_remote()) {
+		dpf_options.missing_object_cb = prefetch;
+		dpf_options.missing_object_data = &prefetch_options;
+	}
+
 	/* Now look for basename matchups and do similarity estimation */
 	for (i = 0; i < rename_src_nr; ++i) {
 		char *filename = rename_src[i].p->one->path;
@@ -953,7 +958,7 @@ static int find_basename_matches(struct diff_options *options,
 			one = rename_src[src_index].p->one;
 			two = rename_dst[dst_index].p->two;
 			score = estimate_similarity(options->repo, one, two,
-						    minimum_score, skip_unmodified);
+						    minimum_score, &dpf_options);
 
 			/* If sufficiently similar, record as rename pair */
 			if (score < minimum_score)
@@ -1272,6 +1277,14 @@ void diffcore_rename_extended(struct diff_options *options,
 	int num_sources, want_copies;
 	struct progress *progress = NULL;
 	struct dir_rename_info info;
+	struct diff_populate_filespec_options dpf_options = {
+		.check_binary = 0,
+		.missing_object_cb = NULL,
+		.missing_object_data = NULL
+	};
+	struct prefetch_options prefetch_options = {
+		.repo = options->repo
+	};
 
 	trace2_region_enter("diff", "setup", options->repo);
 	info.setup = 0;
@@ -1433,6 +1446,13 @@ void diffcore_rename_extended(struct diff_options *options,
 				(uint64_t)num_destinations * (uint64_t)num_sources);
 	}
 
+	/* Finish setting up dpf_options */
+	prefetch_options.skip_unmodified = skip_unmodified;
+	if (options->repo == the_repository && has_promisor_remote()) {
+		dpf_options.missing_object_cb = prefetch;
+		dpf_options.missing_object_data = &prefetch_options;
+	}
+
 	CALLOC_ARRAY(mx, st_mult(NUM_CANDIDATE_PER_DST, num_destinations));
 	for (dst_cnt = i = 0; i < rename_dst_nr; i++) {
 		struct diff_filespec *two = rename_dst[i].p->two;
@@ -1458,7 +1478,7 @@ void diffcore_rename_extended(struct diff_options *options,
 			this_src.score = estimate_similarity(options->repo,
 							     one, two,
 							     minimum_score,
-							     skip_unmodified);
+							     &dpf_options);
 			this_src.name_score = basename_same(one, two);
 			this_src.dst = i;
 			this_src.src = j;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 4/5] diffcore-rename: use a different prefetch for basename comparisons
  2021-06-05  1:27 [PATCH 0/5] Optimization batch 13: partial clone optimizations for merge-ort Elijah Newren via GitGitGadget
                   ` (2 preceding siblings ...)
  2021-06-05  1:28 ` [PATCH 3/5] diffcore-rename: allow different missing_object_cb functions Elijah Newren via GitGitGadget
@ 2021-06-05  1:28 ` Elijah Newren via GitGitGadget
  2021-06-05  1:28 ` [PATCH 5/5] merge-ort: add prefetching for content merges Elijah Newren via GitGitGadget
  2021-06-15 22:41 ` [PATCH v2 0/5] Optimization batch 13: partial clone optimizations for merge-ort Elijah Newren via GitGitGadget
  5 siblings, 0 replies; 30+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-06-05  1:28 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Derrick Stolee, Taylor Blau, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

merge-ort was designed to minimize the amount of data needed and used,
and several changes were made to diffcore-rename to take advantage of
extra metadata to enable this data minimization (particularly the
relevant_sources variable for skipping "irrelevant" renames).  This
effort obviously succeeded in drastically reducing computation times,
but should also theoretically allow partial clones to download much less
information.  Previously, though, the "prefetch" command used in
diffcore-rename had never been modified and downloaded many blobs that
were unnecessary for merge-ort.  This commit corrects that.

When doing basename comparisons, we want to fetch only the objects that
will be used for basename comparisons.  If after basename fetching this
leaves us with no more relevant sources (or no more destinations), then
we won't need to do the full inexact rename detection and can skip
downloading additional source and destination files.  Even if we have to
do that later full inexact rename detection, irrelevant sources are
culled after basename matching and before the full inexact rename
detection, so we can still avoid downloading the blobs for irrelevant
sources.  Rename prefetch() to inexact_prefetch(), and introduce a
new basename_prefetch() to take advantage of this.

If we modify the testcase from commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2021-01-23)
to pass
    --sparse --filter=blob:none
to the clone command, and use the new trace2 "fetch_count" output from
a few commits ago to track both the number of fetch subcommands invoked
and the number of objects fetched across all those fetches, then for
the mega-renames testcase we observe the following:

BEFORE this commit, rebasing 35 patches:
    strategy     # of fetches    total # of objects fetched
    ---------    ------------    --------------------------
    recursive    62              11423
    ort          30              11391

AFTER this commit, rebasing the same 35 patches:
    ort          32                 63

This means that the new code only needs to download less than 2 blobs
per patch being rebased.  That is especially interesting given that the
repository at the start only had approximately half a dozen TOTAL blobs
downloaded to start with (because the default sparse-checkout of just
the toplevel directory was in use).

So, for this particular linux kernel testcase that involved ~26,000
renames on the upstream side (drivers/ -> pilots/) across which 35
patches were being rebased, this change reduces the number of blobs that
need to be downloaded by a factor of ~180.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c              | 101 +++++++++++++++++++++++++++------
 t/t6421-merge-partial-clone.sh |   4 +-
 2 files changed, 85 insertions(+), 20 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index e13e52046026..4ef0459cfb50 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -87,13 +87,13 @@ struct diff_score {
 	short name_score;
 };
 
-struct prefetch_options {
+struct inexact_prefetch_options {
 	struct repository *repo;
 	int skip_unmodified;
 };
-static void prefetch(void *prefetch_options)
+static void inexact_prefetch(void *prefetch_options)
 {
-	struct prefetch_options *options = prefetch_options;
+	struct inexact_prefetch_options *options = prefetch_options;
 	int i;
 	struct oid_array to_fetch = OID_ARRAY_INIT;
 
@@ -816,6 +816,78 @@ static int idx_possible_rename(char *filename, struct dir_rename_info *info)
 	return idx;
 }
 
+struct basename_prefetch_options {
+	struct repository *repo;
+	struct strintmap *relevant_sources;
+	struct strintmap *sources;
+	struct strintmap *dests;
+	struct dir_rename_info *info;
+};
+static void basename_prefetch(void *prefetch_options)
+{
+	struct basename_prefetch_options *options = prefetch_options;
+	struct strintmap *relevant_sources = options->relevant_sources;
+	struct strintmap *sources = options->sources;
+	struct strintmap *dests = options->dests;
+	struct dir_rename_info *info = options->info;
+	int i;
+	struct oid_array to_fetch = OID_ARRAY_INIT;
+
+	/*
+	 * TODO: The following loops mirror the code/logic from
+	 * find_basename_matches(), though not quite exactly.  Maybe
+	 * abstract the iteration logic out somehow?
+	 */
+	for (i = 0; i < rename_src_nr; ++i) {
+		char *filename = rename_src[i].p->one->path;
+		const char *base = NULL;
+		intptr_t src_index;
+		intptr_t dst_index;
+
+		/* Skip irrelevant sources */
+		if (relevant_sources &&
+		    !strintmap_contains(relevant_sources, filename))
+			continue;
+
+		/*
+		 * If the basename is unique among remaining sources, then
+		 * src_index will equal 'i' and we can attempt to match it
+		 * to a unique basename in the destinations.  Otherwise,
+		 * use directory rename heuristics, if possible.
+		 */
+		base = get_basename(filename);
+		src_index = strintmap_get(sources, base);
+		assert(src_index == -1 || src_index == i);
+
+		if (strintmap_contains(dests, base)) {
+			struct diff_filespec *one, *two;
+
+			/* Find a matching destination, if possible */
+			dst_index = strintmap_get(dests, base);
+			if (src_index == -1 || dst_index == -1) {
+				src_index = i;
+				dst_index = idx_possible_rename(filename, info);
+			}
+			if (dst_index == -1)
+				continue;
+
+			/* Ignore this dest if already used in a rename */
+			if (rename_dst[dst_index].is_rename)
+				continue; /* already used previously */
+
+			one = rename_src[src_index].p->one;
+			two = rename_dst[dst_index].p->two;
+
+			/* Add the pairs */
+			diff_add_if_missing(options->repo, &to_fetch, two);
+			diff_add_if_missing(options->repo, &to_fetch, one);
+		}
+	}
+
+	promisor_remote_get_direct(options->repo, to_fetch.oid, to_fetch.nr);
+	oid_array_clear(&to_fetch);
+}
+
 static int find_basename_matches(struct diff_options *options,
 				 int minimum_score,
 				 struct dir_rename_info *info,
@@ -860,19 +932,12 @@ static int find_basename_matches(struct diff_options *options,
 		.missing_object_cb = NULL,
 		.missing_object_data = NULL
 	};
-	/*
-	 * The prefeteching stuff wants to know if it can skip prefetching
-	 * blobs that are unmodified...and will then do a little extra work
-	 * to verify that the oids are indeed different before prefetching.
-	 * Unmodified blobs are only relevant when doing copy detection;
-	 * when limiting to rename detection, diffcore_rename[_extended]()
-	 * will never be called with unmodified source paths fed to us, so
-	 * the extra work necessary to check if rename_src entries are
-	 * unmodified would be a small waste.
-	 */
-	struct prefetch_options prefetch_options = {
+	struct basename_prefetch_options prefetch_options = {
 		.repo = options->repo,
-		.skip_unmodified = 0
+		.relevant_sources = relevant_sources,
+		.sources = &sources,
+		.dests = &dests,
+		.info = info
 	};
 
 	/*
@@ -911,7 +976,7 @@ static int find_basename_matches(struct diff_options *options,
 	}
 
 	if (options->repo == the_repository && has_promisor_remote()) {
-		dpf_options.missing_object_cb = prefetch;
+		dpf_options.missing_object_cb = basename_prefetch;
 		dpf_options.missing_object_data = &prefetch_options;
 	}
 
@@ -1282,7 +1347,7 @@ void diffcore_rename_extended(struct diff_options *options,
 		.missing_object_cb = NULL,
 		.missing_object_data = NULL
 	};
-	struct prefetch_options prefetch_options = {
+	struct inexact_prefetch_options prefetch_options = {
 		.repo = options->repo
 	};
 
@@ -1449,7 +1514,7 @@ void diffcore_rename_extended(struct diff_options *options,
 	/* Finish setting up dpf_options */
 	prefetch_options.skip_unmodified = skip_unmodified;
 	if (options->repo == the_repository && has_promisor_remote()) {
-		dpf_options.missing_object_cb = prefetch;
+		dpf_options.missing_object_cb = inexact_prefetch;
 		dpf_options.missing_object_data = &prefetch_options;
 	}
 
diff --git a/t/t6421-merge-partial-clone.sh b/t/t6421-merge-partial-clone.sh
index 028d876be2c8..fb7eb18cc80c 100755
--- a/t/t6421-merge-partial-clone.sh
+++ b/t/t6421-merge-partial-clone.sh
@@ -206,7 +206,7 @@ test_setup_repo () {
 #
 #   Summary: 2 fetches (1 for 2 objects, 1 for 1 object)
 #
-test_expect_merge_algorithm failure failure 'Objects downloaded for single relevant rename' '
+test_expect_merge_algorithm failure success 'Objects downloaded for single relevant rename' '
 	test_setup_repo &&
 	git clone --sparse --filter=blob:none "file://$(pwd)/server" objects-single &&
 	(
@@ -293,7 +293,7 @@ test_expect_merge_algorithm failure failure 'Objects downloaded for single relev
 #      this are not all that common.)
 #   Summary: 1 fetches for 6 objects
 #
-test_expect_merge_algorithm failure failure 'Objects downloaded when a directory rename triggered' '
+test_expect_merge_algorithm failure success 'Objects downloaded when a directory rename triggered' '
 	test_setup_repo &&
 	git clone --sparse --filter=blob:none "file://$(pwd)/server" objects-dir &&
 	(
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 5/5] merge-ort: add prefetching for content merges
  2021-06-05  1:27 [PATCH 0/5] Optimization batch 13: partial clone optimizations for merge-ort Elijah Newren via GitGitGadget
                   ` (3 preceding siblings ...)
  2021-06-05  1:28 ` [PATCH 4/5] diffcore-rename: use a different prefetch for basename comparisons Elijah Newren via GitGitGadget
@ 2021-06-05  1:28 ` Elijah Newren via GitGitGadget
  2021-06-15 22:41 ` [PATCH v2 0/5] Optimization batch 13: partial clone optimizations for merge-ort Elijah Newren via GitGitGadget
  5 siblings, 0 replies; 30+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-06-05  1:28 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Derrick Stolee, Taylor Blau, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

Commit 7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-05)
introduced batching of fetching missing blobs, so that the diff
machinery would have one fetch subprocess grab N blobs instead of N
processes each grabbing 1.

However, the diff machinery is not the only thing in a merge that needs
to work on blobs.  The 3-way content merges need them as well.  Rather
than download all the blobs 1 at a time, prefetch all the blobs needed
for regular content merges.

This does not cover all possible paths in merge-ort that might need to
download blobs.  Others include:
  - The blob_unchanged() calls to avoid modify/delete conflicts (when
    blob renormalization results in an "unchanged" file)
  - Preliminary content merges needed for rename/add and
    rename/rename(2to1) style conflicts.  (Both of these types of
    conflicts can result in nested conflict markers from the need to do
    two levels of content merging; the first happens before our new
    prefetch_for_content_merges() function.)

The first of these wouldn't be an extreme amount of work to support, and
even the second could be theoretically supported in batching, but all of
these cases seem unusual to me, and this is a minor performance
optimization anyway; in the worst case we only get some of the fetches
batched and have a few additional one-off fetches.  So for now, just
handle the regular 3-way content merges in our prefetching.

For the testcase from the previous commit, the number of downloaded
objects remains at 63, but this drops the number of fetches needed from
32 down to 20, a sizeable reduction.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c                    | 50 ++++++++++++++++++++++++++++++++++
 t/t6421-merge-partial-clone.sh |  2 +-
 2 files changed, 51 insertions(+), 1 deletion(-)

diff --git a/merge-ort.c b/merge-ort.c
index cfa751053b01..e3a5dfc7b312 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -29,6 +29,7 @@
 #include "entry.h"
 #include "ll-merge.h"
 #include "object-store.h"
+#include "promisor-remote.h"
 #include "revision.h"
 #include "strmap.h"
 #include "submodule.h"
@@ -3485,6 +3486,54 @@ static void process_entry(struct merge_options *opt,
 	record_entry_for_tree(dir_metadata, path, &ci->merged);
 }
 
+static void prefetch_for_content_merges(struct merge_options *opt,
+					struct string_list *plist)
+{
+	struct string_list_item *e;
+	struct oid_array to_fetch = OID_ARRAY_INIT;
+
+	if (opt->repo != the_repository || !has_promisor_remote())
+		return;
+
+	for (e = &plist->items[plist->nr-1]; e >= plist->items; --e) {
+		/* char *path = e->string; */
+		struct conflict_info *ci = e->util;
+		int i;
+
+		/* Ignore clean entries */
+		if (ci->merged.clean)
+			continue;
+
+		/* Ignore entries that don't need a content merge */
+		if (ci->match_mask || ci->filemask < 6 ||
+		    !S_ISREG(ci->stages[1].mode) ||
+		    !S_ISREG(ci->stages[2].mode) ||
+		    oideq(&ci->stages[1].oid, &ci->stages[2].oid))
+			continue;
+
+		/* Also don't need content merge if base matches either side */
+		if (ci->filemask == 7 &&
+		    S_ISREG(ci->stages[0].mode) &&
+		    (oideq(&ci->stages[0].oid, &ci->stages[1].oid) ||
+		     oideq(&ci->stages[0].oid, &ci->stages[2].oid)))
+			continue;
+
+		for (i = 0; i < 3; i++) {
+			unsigned side_mask = (1 << i);
+			struct version_info *vi = &ci->stages[i];
+
+			if ((ci->filemask & side_mask) &&
+			    S_ISREG(vi->mode) &&
+			    oid_object_info_extended(opt->repo, &vi->oid, NULL,
+						     OBJECT_INFO_FOR_PREFETCH))
+				oid_array_append(&to_fetch, &vi->oid);
+		}
+	}
+
+	promisor_remote_get_direct(opt->repo, to_fetch.oid, to_fetch.nr);
+	oid_array_clear(&to_fetch);
+}
+
 static void process_entries(struct merge_options *opt,
 			    struct object_id *result_oid)
 {
@@ -3531,6 +3580,7 @@ static void process_entries(struct merge_options *opt,
 	 * the way when it is time to process the file at the same path).
 	 */
 	trace2_region_enter("merge", "processing", opt->repo);
+	prefetch_for_content_merges(opt, &plist);
 	for (entry = &plist.items[plist.nr-1]; entry >= plist.items; --entry) {
 		char *path = entry->string;
 		/*
diff --git a/t/t6421-merge-partial-clone.sh b/t/t6421-merge-partial-clone.sh
index fb7eb18cc80c..84642cb3831c 100755
--- a/t/t6421-merge-partial-clone.sh
+++ b/t/t6421-merge-partial-clone.sh
@@ -392,7 +392,7 @@ test_expect_merge_algorithm failure success 'Objects downloaded when a directory
 #
 #   Summary: 4 fetches (1 for 6 objects, 1 for 8, 1 for 3, 1 for 2)
 #
-test_expect_merge_algorithm failure failure 'Objects downloaded with lots of renames and modifications' '
+test_expect_merge_algorithm failure success 'Objects downloaded with lots of renames and modifications' '
 	test_setup_repo &&
 	git clone --sparse --filter=blob:none "file://$(pwd)/server" objects-many &&
 	(
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/5] promisor-remote: output trace2 statistics for number of objects fetched
  2021-06-05  1:28 ` [PATCH 1/5] promisor-remote: output trace2 statistics for number of objects fetched Elijah Newren via GitGitGadget
@ 2021-06-09 21:12   ` Derrick Stolee
  0 siblings, 0 replies; 30+ messages in thread
From: Derrick Stolee @ 2021-06-09 21:12 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget, git
  Cc: Jonathan Tan, Derrick Stolee, Taylor Blau, Elijah Newren

On 6/4/2021 9:28 PM, Elijah Newren via GitGitGadget wrote:
> +	trace2_data_intmax("promisor", repo, "fetch_count", oid_nr);

This seems like a helpful value for lots of reasons.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/5] t6421: add tests checking for excessive object downloads during merge
  2021-06-05  1:28 ` [PATCH 2/5] t6421: add tests checking for excessive object downloads during merge Elijah Newren via GitGitGadget
@ 2021-06-09 21:16   ` Derrick Stolee
  0 siblings, 0 replies; 30+ messages in thread
From: Derrick Stolee @ 2021-06-09 21:16 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget, git
  Cc: Jonathan Tan, Derrick Stolee, Taylor Blau, Elijah Newren

On 6/4/2021 9:28 PM, Elijah Newren via GitGitGadget wrote:
> From: Elijah Newren <newren@gmail.com>
...
> +test_expect_merge_algorithm failure failure 'Objects downloaded for single relevant rename' '
> +	test_setup_repo &&
> +	git clone --sparse --filter=blob:none "file://$(pwd)/server" objects-single &&
> +	(
> +		cd objects-single &&
> +
> +		git rev-list --objects --all --missing=print |
> +			grep '\?' >missing-objects-before &&
> +
> +		git checkout -q origin/A &&
> +
> +		GIT_TRACE2_PERF="$(pwd)/trace.output" git -c merge.directoryRenames=true merge --no-stat --no-progress origin/B-single &&

nit: this long line could be split up with \ marks.

> +
> +		# Check the number of objects we reported we would fetch
> +		cat >expect <<-EOF &&
> +		 ..........fetch_count:2
> +		 ......fetch_count:1

I'm concerned that the number of dots here could change for various
reasons (e.g. adding regions). Could the 'cut' below cut everything
after the dots, so we only care about "fetch_count:X"?

> +		EOF
> +		grep fetch_count trace.output | cut -d "|" -f 9 >actual &&
> +		test_cmp expect actual &&
> +
> +		# Check the number of fetch commands exec-ed
> +		grep d0.*fetch.negotiationAlgorithm trace.output >fetches &&
> +		test_line_count = 2 fetches &&
> +
> +		git rev-list --objects --all --missing=print |
> +			grep ^? >missing-objects-after &&
> +		test_cmp missing-objects-before missing-objects-after |
> +			grep "^[-+]?" >found-and-new-objects &&
> +		# We should not have any NEW missing objects
> +		! grep ^+ found-and-new-objects &&
> +		# Fetched 2+1=3 objects, so should have 3 fewer missing objects
> +		test_line_count = 3 found-and-new-objects
> +	)
> +'

Generally, this idea of using trace2 to determine internal activity
that affects performance is a good one. The balance to strike is to
ensure that future changes don't suffer too much due to the how
rigid the tests are. The dots are the only example of a way to relax
these tests a bit more.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v2 0/5] Optimization batch 13: partial clone optimizations for merge-ort
  2021-06-05  1:27 [PATCH 0/5] Optimization batch 13: partial clone optimizations for merge-ort Elijah Newren via GitGitGadget
                   ` (4 preceding siblings ...)
  2021-06-05  1:28 ` [PATCH 5/5] merge-ort: add prefetching for content merges Elijah Newren via GitGitGadget
@ 2021-06-15 22:41 ` Elijah Newren via GitGitGadget
  2021-06-15 22:41   ` [PATCH v2 1/5] promisor-remote: output trace2 statistics for number of objects fetched Elijah Newren via GitGitGadget
                     ` (7 more replies)
  5 siblings, 8 replies; 30+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-06-15 22:41 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Derrick Stolee, Taylor Blau, Elijah Newren

This series optimizes blob downloading in merges for partial clones. It can
apply on master. It's independent of ort-perf-batch-12.

Changes since v1:

 * Incorporated the suggestions from Stolee on patch 2.

This series has a minor conflict with jt/partial-clone-submodule-1. I asked
about this previously and it was decided to just submit these topics
independently[1]. The conflict is that both topics add a "repo" argument to
fetch_objects(), but jt/partial-clone-submodule-1 also makes additional
nearby changes.

[1]
https://lore.kernel.org/git/20210609045804.2329079-1-jonathantanmy@google.com/

=== Basic Optimization idea ===

merge-ort was designed to minimize the computation needed to complete a
merge, and much of that (particularly the "irrelevant rename"
determinations) also dramatically reduced the amount of data needed for the
merge. Reducing the amount of data needed to do computations ought to
benefit partial clones as well by enabling them to download less
information. However, my previous series didn't modify the prefetch()
command in diffcore-rename to take advantage of these reduced data
requirements. This series changes that.

Further, although diffcore-rename batched downloads of objects for rename
detection, the merge machinery did not do the same for three-way content
merges of files. This series adds batch downloading of objects within
merge-ort to correct that.

=== Modified performance measurement method ===

The testcases I've been using so far to measure performance were not run in
a partial clone, so they aren't directly usable for comparison. Further,
partial clone performance depends on network speed which can be highly
variable. So I want to modify one of the existing testcases slightly and
focus on two different but more stable metrics:

 1. Number of git fetch operations during rebase
 2. Number of objects fetched during rebase

The first of these should already be decent due to Jonathan Tan's work to
batch fetching of missing blobs during rename detection (see commit
7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-05)), so we are
mostly looking to optimize the second but would like to also decrease the
first if possible.

The testcase we will look at will be a modification of the mega-renames
testcase from commit 557ac0350d ("merge-ort: begin performance work;
instrument with trace2_region_* calls", 2020-10-28). In particular, we
change

$ git clone \
    git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git

to

$ git clone --sparse --filter=blob:none \
    https://github.com/github/linux

(The change in clone URL is just to get a server that supports the filter
predicate.)

We otherwise keep the test the same (in particular, we do not add any calls
to "git-sparse checkout {set,add}" which means that the resulting repository
will only have 7 total blobs from files in the toplevel directory before
starting the rebase).

=== Results ===

For the mega-renames testcase noted above (which rebases 35 commits across
an upstream with ~26K renames in a partial clone), I found the following
results for our metrics of interest:

     Number of `git fetch` ops during rebase

                     Before Series   After Series
merge-recursive:          62              63
merge-ort:                30              20

     Number of objects fetched during rebase

                     Before Series   After Series
merge-recursive:         11423          11423
merge-ort:               11391             63

So, we have a significant reduction (factor of ~3 relative to
merge-recursive) in the number of git fetch operations that have to be
performed in a partial clone to complete the rebase, and a dramatic
reduction (factor of ~180) in the number of objects that need to be fetched.

=== Summary ===

It's worth pointing out that merge-ort after the series needs only ~1.8
blobs per commit being transplanted to complete this particular rebase.
Essentially, this reinforces the fact the optimization work so far has taken
rename detection from often being an overwhelmingly costly portion of a
merge (leading many to just capitulate on it), to what I have observed in my
experience so far as being just a minor cost for merges.

Elijah Newren (5):
  promisor-remote: output trace2 statistics for number of objects
    fetched
  t6421: add tests checking for excessive object downloads during merge
  diffcore-rename: allow different missing_object_cb functions
  diffcore-rename: use a different prefetch for basename comparisons
  merge-ort: add prefetching for content merges

 diffcore-rename.c              | 149 ++++++++---
 merge-ort.c                    |  50 ++++
 promisor-remote.c              |   7 +-
 t/t6421-merge-partial-clone.sh | 439 +++++++++++++++++++++++++++++++++
 4 files changed, 611 insertions(+), 34 deletions(-)
 create mode 100755 t/t6421-merge-partial-clone.sh

base-commit: 6de569e6ac492213e81321ca35f1f1b365ba31e3
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-969%2Fnewren%2Fort-perf-batch-13-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-969/newren/ort-perf-batch-13-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/969

Range-diff vs v1:

 1:  ad9b451d6823 = 1:  04f5ebdabe14 promisor-remote: output trace2 statistics for number of objects fetched
 2:  6462bb63310d ! 2:  0f786cfb4c95 t6421: add tests checking for excessive object downloads during merge
     @@ t/t6421-merge-partial-clone.sh (new)
      +
      +		git checkout -q origin/A &&
      +
     -+		GIT_TRACE2_PERF="$(pwd)/trace.output" git -c merge.directoryRenames=true merge --no-stat --no-progress origin/B-single &&
     ++		GIT_TRACE2_PERF="$(pwd)/trace.output" git \
     ++			-c merge.directoryRenames=true merge --no-stat \
     ++			--no-progress origin/B-single &&
      +
      +		# Check the number of objects we reported we would fetch
      +		cat >expect <<-EOF &&
     -+		 ..........fetch_count:2
     -+		 ......fetch_count:1
     ++		fetch_count:2
     ++		fetch_count:1
      +		EOF
     -+		grep fetch_count trace.output | cut -d "|" -f 9 >actual &&
     ++		grep fetch_count trace.output | cut -d "|" -f 9 | tr -d " ." >actual &&
      +		test_cmp expect actual &&
      +
      +		# Check the number of fetch commands exec-ed
     @@ t/t6421-merge-partial-clone.sh (new)
      +
      +		git checkout -q origin/A &&
      +
     -+		GIT_TRACE2_PERF="$(pwd)/trace.output" git -c merge.directoryRenames=true merge --no-stat --no-progress origin/B-dir &&
     ++		GIT_TRACE2_PERF="$(pwd)/trace.output" git \
     ++			-c merge.directoryRenames=true merge --no-stat \
     ++			--no-progress origin/B-dir &&
      +
      +		# Check the number of objects we reported we would fetch
      +		cat >expect <<-EOF &&
     -+		 ..........fetch_count:6
     ++		fetch_count:6
      +		EOF
     -+		grep fetch_count trace.output | cut -d "|" -f 9 >actual &&
     ++		grep fetch_count trace.output | cut -d "|" -f 9 | tr -d " ." >actual &&
      +		test_cmp expect actual &&
      +
      +		# Check the number of fetch commands exec-ed
     @@ t/t6421-merge-partial-clone.sh (new)
      +
      +		git checkout -q origin/A &&
      +
     -+		GIT_TRACE2_PERF="$(pwd)/trace.output" git -c merge.directoryRenames=true merge --no-stat --no-progress origin/B-many &&
     ++		GIT_TRACE2_PERF="$(pwd)/trace.output" git \
     ++			-c merge.directoryRenames=true merge --no-stat \
     ++			--no-progress origin/B-many &&
      +
      +		# Check the number of objects we reported we would fetch
      +		cat >expect <<-EOF &&
     -+		 ..........fetch_count:12
     -+		 ..........fetch_count:5
     -+		 ..........fetch_count:3
     -+		 ......fetch_count:2
     ++		fetch_count:12
     ++		fetch_count:5
     ++		fetch_count:3
     ++		fetch_count:2
      +		EOF
     -+		grep fetch_count trace.output | cut -d "|" -f 9 >actual &&
     ++		grep fetch_count trace.output | cut -d "|" -f 9 | tr -d " ." >actual &&
      +		test_cmp expect actual &&
      +
      +		# Check the number of fetch commands exec-ed
 3:  c4b3109c3b08 = 3:  9f2a8ed8d61f diffcore-rename: allow different missing_object_cb functions
 4:  f4ade3996d3f = 4:  f753f8035564 diffcore-rename: use a different prefetch for basename comparisons
 5:  ca3b2a743b8e = 5:  317bcc7f56cb merge-ort: add prefetching for content merges

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v2 1/5] promisor-remote: output trace2 statistics for number of objects fetched
  2021-06-15 22:41 ` [PATCH v2 0/5] Optimization batch 13: partial clone optimizations for merge-ort Elijah Newren via GitGitGadget
@ 2021-06-15 22:41   ` Elijah Newren via GitGitGadget
  2021-06-15 22:41   ` [PATCH v2 2/5] t6421: add tests checking for excessive object downloads during merge Elijah Newren via GitGitGadget
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 30+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-06-15 22:41 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Derrick Stolee, Taylor Blau, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 promisor-remote.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/promisor-remote.c b/promisor-remote.c
index da3f2ca2615e..d465377d7d32 100644
--- a/promisor-remote.c
+++ b/promisor-remote.c
@@ -12,7 +12,8 @@ void set_repository_format_partial_clone(char *partial_clone)
 	repository_format_partial_clone = xstrdup_or_null(partial_clone);
 }
 
-static int fetch_objects(const char *remote_name,
+static int fetch_objects(struct repository *repo,
+			 const char *remote_name,
 			 const struct object_id *oids,
 			 int oid_nr)
 {
@@ -30,6 +31,8 @@ static int fetch_objects(const char *remote_name,
 		die(_("promisor-remote: unable to fork off fetch subprocess"));
 	child_in = xfdopen(child.in, "w");
 
+	trace2_data_intmax("promisor", repo, "fetch_count", oid_nr);
+
 	for (i = 0; i < oid_nr; i++) {
 		if (fputs(oid_to_hex(&oids[i]), child_in) < 0)
 			die_errno(_("promisor-remote: could not write to fetch subprocess"));
@@ -238,7 +241,7 @@ int promisor_remote_get_direct(struct repository *repo,
 	promisor_remote_init();
 
 	for (r = promisors; r; r = r->next) {
-		if (fetch_objects(r->name, remaining_oids, remaining_nr) < 0) {
+		if (fetch_objects(repo, r->name, remaining_oids, remaining_nr) < 0) {
 			if (remaining_nr == 1)
 				continue;
 			remaining_nr = remove_fetched_oids(repo, &remaining_oids,
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 2/5] t6421: add tests checking for excessive object downloads during merge
  2021-06-15 22:41 ` [PATCH v2 0/5] Optimization batch 13: partial clone optimizations for merge-ort Elijah Newren via GitGitGadget
  2021-06-15 22:41   ` [PATCH v2 1/5] promisor-remote: output trace2 statistics for number of objects fetched Elijah Newren via GitGitGadget
@ 2021-06-15 22:41   ` Elijah Newren via GitGitGadget
  2021-06-17  4:49     ` Junio C Hamano
  2021-06-15 22:41   ` [PATCH v2 3/5] diffcore-rename: allow different missing_object_cb functions Elijah Newren via GitGitGadget
                     ` (5 subsequent siblings)
  7 siblings, 1 reply; 30+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-06-15 22:41 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Derrick Stolee, Taylor Blau, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 t/t6421-merge-partial-clone.sh | 439 +++++++++++++++++++++++++++++++++
 1 file changed, 439 insertions(+)
 create mode 100755 t/t6421-merge-partial-clone.sh

diff --git a/t/t6421-merge-partial-clone.sh b/t/t6421-merge-partial-clone.sh
new file mode 100755
index 000000000000..c9d6860de9c9
--- /dev/null
+++ b/t/t6421-merge-partial-clone.sh
@@ -0,0 +1,439 @@
+#!/bin/sh
+
+test_description="limiting blob downloads when merging with partial clones"
+# Uses a methodology similar to
+#   t6042: corner cases with renames but not criss-cross merges
+#   t6036: corner cases with both renames and criss-cross merges
+#   t6423: directory rename detection
+#
+# The setup for all of them, pictorially, is:
+#
+#      A
+#      o
+#     / \
+#  O o   ?
+#     \ /
+#      o
+#      B
+#
+# To help make it easier to follow the flow of tests, they have been
+# divided into sections and each test will start with a quick explanation
+# of what commits O, A, and B contain.
+#
+# Notation:
+#    z/{b,c}   means  files z/b and z/c both exist
+#    x/d_1     means  file x/d exists with content d1.  (Purpose of the
+#                     underscore notation is to differentiate different
+#                     files that might be renamed into each other's paths.)
+
+. ./test-lib.sh
+. "$TEST_DIRECTORY"/lib-merge.sh
+
+test_setup_repo () {
+	test -d server && return
+	test_create_repo server &&
+	(
+		cd server &&
+
+		git config uploadpack.allowfilter 1 &&
+		git config uploadpack.allowanysha1inwant 1 &&
+
+		mkdir -p general &&
+		test_seq 2 9 >general/leap1 &&
+		cp general/leap1 general/leap2 &&
+		echo leap2 >>general/leap2 &&
+
+		mkdir -p basename &&
+		cp general/leap1 basename/numbers &&
+		cp general/leap1 basename/sequence &&
+		cp general/leap1 basename/values &&
+		echo numbers >>basename/numbers &&
+		echo sequence >>basename/sequence &&
+		echo values >>basename/values &&
+
+		mkdir -p dir/unchanged &&
+		mkdir -p dir/subdir/tweaked &&
+		echo a >dir/subdir/a &&
+		echo b >dir/subdir/b &&
+		echo c >dir/subdir/c &&
+		echo d >dir/subdir/d &&
+		echo e >dir/subdir/e &&
+		cp general/leap1 dir/subdir/Makefile &&
+		echo toplevel makefile >>dir/subdir/Makefile &&
+		echo f >dir/subdir/tweaked/f &&
+		echo g >dir/subdir/tweaked/g &&
+		echo h >dir/subdir/tweaked/h &&
+		echo subdirectory makefile >dir/subdir/tweaked/Makefile &&
+		for i in `test_seq 1 88`; do
+			echo content $i >dir/unchanged/file_$i
+		done &&
+		git add . &&
+		git commit -m "O" &&
+
+		git branch O &&
+		git branch A &&
+		git branch B-single &&
+		git branch B-dir &&
+		git branch B-many &&
+
+		git switch A &&
+
+		git rm general/leap* &&
+		mkdir general/ &&
+		test_seq 1 9 >general/jump1 &&
+		cp general/jump1 general/jump2 &&
+		echo leap2 >>general/jump2 &&
+
+		rm basename/numbers basename/sequence basename/values &&
+		mkdir -p basename/subdir/
+		cp general/jump1 basename/subdir/numbers &&
+		cp general/jump1 basename/subdir/sequence &&
+		cp general/jump1 basename/subdir/values &&
+		echo numbers >>basename/subdir/numbers &&
+		echo sequence >>basename/subdir/sequence &&
+		echo values >>basename/subdir/values &&
+
+		git rm dir/subdir/tweaked/f &&
+		echo more >>dir/subdir/e &&
+		echo more >>dir/subdir/Makefile &&
+		echo more >>dir/subdir/tweaked/Makefile &&
+		mkdir dir/subdir/newsubdir &&
+		echo rust code >dir/subdir/newsubdir/newfile.rs &&
+		git mv dir/subdir/e dir/subdir/newsubdir/ &&
+		git mv dir folder &&
+		git add . &&
+		git commit -m "A" &&
+
+		git switch B-single &&
+		echo new first line >dir/subdir/Makefile &&
+		cat general/leap1 >>dir/subdir/Makefile &&
+		echo toplevel makefile >>dir/subdir/Makefile &&
+		echo perl code >general/newfile.pl &&
+		git add . &&
+		git commit -m "B-single" &&
+
+		git switch B-dir &&
+		echo java code >dir/subdir/newfile.java &&
+		echo scala code >dir/subdir/newfile.scala &&
+		echo groovy code >dir/subdir/newfile.groovy &&
+		git add . &&
+		git commit -m "B-dir" &&
+
+		git switch B-many &&
+		test_seq 2 10 >general/leap1 &&
+		rm general/leap2 &&
+		cp general/leap1 general/leap2 &&
+		echo leap2 >>general/leap2 &&
+
+		rm basename/numbers basename/sequence basename/values &&
+		mkdir -p basename/subdir/
+		cp general/leap1 basename/subdir/numbers &&
+		cp general/leap1 basename/subdir/sequence &&
+		cp general/leap1 basename/subdir/values &&
+		echo numbers >>basename/subdir/numbers &&
+		echo sequence >>basename/subdir/sequence &&
+		echo values >>basename/subdir/values &&
+
+		mkdir dir/subdir/newsubdir/ &&
+		echo c code >dir/subdir/newfile.c &&
+		echo python code >dir/subdir/newsubdir/newfile.py &&
+		git add . &&
+		git commit -m "B-many" &&
+
+		git switch A
+	)
+}
+
+# Testcase: Objects downloaded for single relevant rename
+#   Commit O:
+#              general/{leap1_O, leap2_O}
+#              basename/{numbers_O, sequence_O, values_O}
+#              dir/subdir/{a,b,c,d,e_O,Makefile_TOP_O}
+#              dir/subdir/tweaked/{f,g,h,Makefile_SUB_O}
+#              dir/unchanged/<LOTS OF FILES>
+#   Commit A:
+#     (Rename leap->jump, rename basename/ -> basename/subdir/, rename dir/
+#      -> folder/, move e into newsubdir, add newfile.rs, remove f, modify
+#      both both Makefiles and jumps)
+#              general/{jump1_A, jump2_A}
+#              basename/subdir/{numbers_A, sequence_A, values_A}
+#              folder/subdir/{a,b,c,d,Makefile_TOP_A}
+#              folder/subdir/newsubdir/{e_A,newfile.rs}
+#              folder/subdir/tweaked/{g,h,Makefile_SUB_A}
+#              folder/unchanged/<LOTS OF FILES>
+#   Commit B(-single):
+#     (add newfile.pl, tweak Makefile_TOP)
+#              general/{leap1_O, leap2_O,newfile.pl}
+#              basename/{numbers_O, sequence_O, values_O}
+#              dir/{a,b,c,d,e_O,Makefile_TOP_B}
+#              dir/tweaked/{f,g,h,Makefile_SUB_O}
+#              dir/unchanged/<LOTS OF FILES>
+#   Expected:
+#              general/{jump1_A, jump2_A,newfile.pl}
+#              basename/subdir/{numbers_A, sequence_A, values_A}
+#              folder/subdir/{a,b,c,d,Makefile_TOP_Merged}
+#              folder/subdir/newsubdir/{e_A,newfile.rs}
+#              folder/subdir/tweaked/{g,h,Makefile_SUB_A}
+#              folder/unchanged/<LOTS OF FILES>
+#
+# Objects that need to be fetched:
+#   Rename detection:
+#     Side1 (O->A):
+#       Basename-matches rename detection only needs to fetch these objects:
+#         Makefile_TOP_O, Makefile_TOP_A
+#         (Despite many renames, all others are content irrelevant.  They
+#          are also location irrelevant because newfile.rs was added on
+#          the side doing the directory rename, and newfile.pl was added to
+#          a directory that was not renamed on either side.)
+#       General rename detection only needs to fetch these objects:
+#         <None>
+#          (Even though newfile.rs, jump[12], basename/subdir/*, and e
+#          could all be used as destinations in rename detection, the
+#          basename detection for Makefile matches up all relevant
+#          sources, so these other files never end up needing to be
+#          used)
+#     Side2 (O->B):
+#       Basename-matches rename detection only needs to fetch these objects:
+#         <None>
+#         (there are no deleted files, so no possible sources)
+#       General rename detection only needs to fetch these objects:
+#         <None>
+#         (there are no deleted files, so no possible sources)
+#   Merge:
+#     3-way content merge needs to grab these objects:
+#       Makefile_TOP_B
+#   Nothing else needs to fetch objects
+#
+#   Summary: 2 fetches (1 for 2 objects, 1 for 1 object)
+#
+test_expect_merge_algorithm failure failure 'Objects downloaded for single relevant rename' '
+	test_setup_repo &&
+	git clone --sparse --filter=blob:none "file://$(pwd)/server" objects-single &&
+	(
+		cd objects-single &&
+
+		git rev-list --objects --all --missing=print |
+			grep '\?' >missing-objects-before &&
+
+		git checkout -q origin/A &&
+
+		GIT_TRACE2_PERF="$(pwd)/trace.output" git \
+			-c merge.directoryRenames=true merge --no-stat \
+			--no-progress origin/B-single &&
+
+		# Check the number of objects we reported we would fetch
+		cat >expect <<-EOF &&
+		fetch_count:2
+		fetch_count:1
+		EOF
+		grep fetch_count trace.output | cut -d "|" -f 9 | tr -d " ." >actual &&
+		test_cmp expect actual &&
+
+		# Check the number of fetch commands exec-ed
+		grep d0.*fetch.negotiationAlgorithm trace.output >fetches &&
+		test_line_count = 2 fetches &&
+
+		git rev-list --objects --all --missing=print |
+			grep ^? >missing-objects-after &&
+		test_cmp missing-objects-before missing-objects-after |
+			grep "^[-+]?" >found-and-new-objects &&
+		# We should not have any NEW missing objects
+		! grep ^+ found-and-new-objects &&
+		# Fetched 2+1=3 objects, so should have 3 fewer missing objects
+		test_line_count = 3 found-and-new-objects
+	)
+'
+
+# Testcase: Objects downloaded for directory rename
+#   Commit O:
+#              general/{leap1_O, leap2_O}
+#              basename/{numbers_O, sequence_O, values_O}
+#              dir/subdir/{a,b,c,d,e_O,Makefile_TOP_O}
+#              dir/subdir/tweaked/{f,g,h,Makefile_SUB_O}
+#              dir/unchanged/<LOTS OF FILES>
+#   Commit A:
+#     (Rename leap->jump, rename basename/ -> basename/subdir/, rename dir/ ->
+#      folder/, move e into newsubdir, add newfile.rs, remove f, modify
+#      both Makefiles and jumps)
+#              general/{jump1_A, jump2_A}
+#              basename/subdir/{numbers_A, sequence_A, values_A}
+#              folder/subdir/{a,b,c,d,Makefile_TOP_A}
+#              folder/subdir/newsubdir/{e_A,newfile.rs}
+#              folder/subdir/tweaked/{g,h,Makefile_SUB_A}
+#              folder/unchanged/<LOTS OF FILES>
+#   Commit B(-dir):
+#     (add dir/subdir/newfile.{java,scala,groovy}
+#              general/{leap1_O, leap2_O}
+#              basename/{numbers_O, sequence_O, values_O}
+#              dir/subdir/{a,b,c,d,e_O,Makefile_TOP_O,
+#                          newfile.java,newfile.scala,newfile.groovy}
+#              dir/subdir/tweaked/{f,g,h,Makefile_SUB_O}
+#              dir/unchanged/<LOTS OF FILES>
+#   Expected:
+#              general/{jump1_A, jump2_A}
+#              basename/subdir/{numbers_A, sequence_A, values_A}
+#              folder/subdir/{a,b,c,d,Makefile_TOP_A,
+#                             newfile.java,newfile.scala,newfile.groovy}
+#              folder/subdir/newsubdir/{e_A,newfile.rs}
+#              folder/subdir/tweaked/{g,h,Makefile_SUB_A}
+#              folder/unchanged/<LOTS OF FILES>
+#
+# Objects that need to be fetched:
+#   Makefile_TOP_O, Makefile_TOP_A
+#   Makefile_SUB_O, Makefile_SUB_A
+#   e_O, e_A
+#   * Despite A's rename of jump->leap, those renames are irrelevant.
+#   * Despite A's rename of basename/ -> basename/subdir/, those renames are
+#     irrelevant.
+#   * Because of A's rename of dir/ -> folder/ and B-dir's addition of
+#     newfile.* into dir/subdir/, we need to determine directory renames.
+#     (Technically, there are enough exact renames to determine directory
+#      rename detection, but the current implementation always does
+#      basename searching before directory rename detection.  Running it
+#      also before basename searching would mean doing directory rename
+#      detection twice, but it's a bit expensive to do that and cases like
+#      this are not all that common.)
+#   Summary: 1 fetches for 6 objects
+#
+test_expect_merge_algorithm failure failure 'Objects downloaded when a directory rename triggered' '
+	test_setup_repo &&
+	git clone --sparse --filter=blob:none "file://$(pwd)/server" objects-dir &&
+	(
+		cd objects-dir &&
+
+		git rev-list --objects --all --missing=print |
+			grep '\?' >missing-objects-before &&
+
+		git checkout -q origin/A &&
+
+		GIT_TRACE2_PERF="$(pwd)/trace.output" git \
+			-c merge.directoryRenames=true merge --no-stat \
+			--no-progress origin/B-dir &&
+
+		# Check the number of objects we reported we would fetch
+		cat >expect <<-EOF &&
+		fetch_count:6
+		EOF
+		grep fetch_count trace.output | cut -d "|" -f 9 | tr -d " ." >actual &&
+		test_cmp expect actual &&
+
+		# Check the number of fetch commands exec-ed
+		grep d0.*fetch.negotiationAlgorithm trace.output >fetches &&
+		test_line_count = 1 fetches &&
+
+		git rev-list --objects --all --missing=print |
+			grep ^? >missing-objects-after &&
+		test_cmp missing-objects-before missing-objects-after |
+			grep "^[-+]?" >found-and-new-objects &&
+		# We should not have any NEW missing objects
+		! grep ^+ found-and-new-objects &&
+		# Fetched 6 objects, so should have 6 fewer missing objects
+		test_line_count = 6 found-and-new-objects
+	)
+'
+
+# Testcase: Objects downloaded with lots of renames and modifications
+#   Commit O:
+#              general/{leap1_O, leap2_O}
+#              basename/{numbers_O, sequence_O, values_O}
+#              dir/subdir/{a,b,c,d,e_O,Makefile_TOP_O}
+#              dir/subdir/tweaked/{f,g,h,Makefile_SUB_O}
+#              dir/unchanged/<LOTS OF FILES>
+#   Commit A:
+#     (Rename leap->jump, rename basename/ -> basename/subdir/, rename dir/
+#      -> folder/, move e into newsubdir, add newfile.rs, remove f, modify
+#      both both Makefiles and jumps)
+#              general/{jump1_A, jump2_A}
+#              basename/subdir/{numbers_A, sequence_A, values_A}
+#              folder/subdir/{a,b,c,d,Makefile_TOP_A}
+#              folder/subdir/newsubdir/{e_A,newfile.rs}
+#              folder/subdir/tweaked/{g,h,Makefile_SUB_A}
+#              folder/unchanged/<LOTS OF FILES>
+#   Commit B(-minimal):
+#     (modify both leaps, rename basename/ -> basename/subdir/, add
+#      newfile.{c,py})
+#              general/{leap1_B, leap2_B}
+#              basename/subdir/{numbers_B, sequence_B, values_B}
+#              dir/{a,b,c,d,e_O,Makefile_TOP_O,newfile.c}
+#              dir/tweaked/{f,g,h,Makefile_SUB_O,newfile.py}
+#              dir/unchanged/<LOTS OF FILES>
+#   Expected:
+#              general/{jump1_Merged, jump2_Merged}
+#              basename/subdir/{numbers_Merged, sequence_Merged, values_Merged}
+#              folder/subdir/{a,b,c,d,Makefile_TOP_A,newfile.c}
+#              folder/subdir/newsubdir/e_A
+#              folder/subdir/tweaked/{g,h,Makefile_SUB_A,newfile.py}
+#              folder/unchanged/<LOTS OF FILES>
+#
+# Objects that need to be fetched:
+#   Rename detection:
+#     Side1 (O->A):
+#       Basename-matches rename detection only needs to fetch these objects:
+#         numbers_O, numbers_A
+#         sequence_O, sequence_A
+#         values_O, values_A
+#         Makefile_TOP_O, Makefile_TOP_A
+#         Makefile_SUB_O, Makefile_SUB_A
+#         e_O, e_A
+#       General rename detection only needs to fetch these objects:
+#         leap1_O, leap2_O
+#         jump1_A, jump2_A, newfile.rs
+#         (only need remaining relevant sources, but any relevant sources need
+#          to be matched against all possible unpaired destinations)
+#     Side2 (O->B):
+#       Basename-matches rename detection only needs to fetch these objects:
+#         numbers_B
+#         sequence_B
+#         values_B
+#       (because numbers_O, sequence_O, and values_O already fetched above)
+#       General rename detection only needs to fetch these objects:
+#         <None>
+#   Merge:
+#     3-way content merge needs to grab these objects:
+#       leap1_B
+#       leap2_B
+#   Nothing else needs to fetch objects
+#
+#   Summary: 4 fetches (1 for 6 objects, 1 for 8, 1 for 3, 1 for 2)
+#
+test_expect_merge_algorithm failure failure 'Objects downloaded with lots of renames and modifications' '
+	test_setup_repo &&
+	git clone --sparse --filter=blob:none "file://$(pwd)/server" objects-many &&
+	(
+		cd objects-many &&
+
+		git rev-list --objects --all --missing=print |
+			grep '\?' >missing-objects-before &&
+
+		git checkout -q origin/A &&
+
+		GIT_TRACE2_PERF="$(pwd)/trace.output" git \
+			-c merge.directoryRenames=true merge --no-stat \
+			--no-progress origin/B-many &&
+
+		# Check the number of objects we reported we would fetch
+		cat >expect <<-EOF &&
+		fetch_count:12
+		fetch_count:5
+		fetch_count:3
+		fetch_count:2
+		EOF
+		grep fetch_count trace.output | cut -d "|" -f 9 | tr -d " ." >actual &&
+		test_cmp expect actual &&
+
+		# Check the number of fetch commands exec-ed
+		grep d0.*fetch.negotiationAlgorithm trace.output >fetches &&
+		test_line_count = 4 fetches &&
+
+		git rev-list --objects --all --missing=print |
+			grep ^? >missing-objects-after &&
+		test_cmp missing-objects-before missing-objects-after |
+			grep "^[-+]?" >found-and-new-objects &&
+		# We should not have any NEW missing objects
+		! grep ^+ found-and-new-objects &&
+		# Fetched 12 + 5 + 3 + 2 == 22 objects
+		test_line_count = 22 found-and-new-objects
+	)
+'
+
+test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 3/5] diffcore-rename: allow different missing_object_cb functions
  2021-06-15 22:41 ` [PATCH v2 0/5] Optimization batch 13: partial clone optimizations for merge-ort Elijah Newren via GitGitGadget
  2021-06-15 22:41   ` [PATCH v2 1/5] promisor-remote: output trace2 statistics for number of objects fetched Elijah Newren via GitGitGadget
  2021-06-15 22:41   ` [PATCH v2 2/5] t6421: add tests checking for excessive object downloads during merge Elijah Newren via GitGitGadget
@ 2021-06-15 22:41   ` Elijah Newren via GitGitGadget
  2021-06-15 22:41   ` [PATCH v2 4/5] diffcore-rename: use a different prefetch for basename comparisons Elijah Newren via GitGitGadget
                     ` (4 subsequent siblings)
  7 siblings, 0 replies; 30+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-06-15 22:41 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Derrick Stolee, Taylor Blau, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

estimate_similarity() was setting up a diff_populate_filespec_options
every time it was called, requiring the caller of estimate_similarity()
to pass in some data needed to set up this option.  Currently the needed
data consisted of a single variable (skip_unmodified), but we want to
also have the different estimate_similarity() callsites start using
different missing_object_cb functions as well.  Rather than also passing
that data in, just have the caller pass in the whole
diff_populate_filespec_options, and reduce the number of times we need to
set it up.

As a side note, this also drops the number of calls to
has_promisor_remote() dramatically.  If L is the number of basename
paths to compare, M is the number of inexact sources, and N is the
number of inexact destinations, then the number of calls to
has_promisor_remote() drops from L+M*N down to at most 2 -- one for each
of the sites that calls estimate_similarity().  has_promisor_remote() is
a very fast function so this almost certainly has no measurable
performance impact, but it seems cleaner to avoid calling that function
so many times.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 58 +++++++++++++++++++++++++++++++----------------
 1 file changed, 39 insertions(+), 19 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 35378d84e8f1..e13e52046026 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -126,7 +126,7 @@ static int estimate_similarity(struct repository *r,
 			       struct diff_filespec *src,
 			       struct diff_filespec *dst,
 			       int minimum_score,
-			       int skip_unmodified)
+			       struct diff_populate_filespec_options *dpf_opt)
 {
 	/* src points at a file that existed in the original tree (or
 	 * optionally a file in the destination tree) and dst points
@@ -143,15 +143,6 @@ static int estimate_similarity(struct repository *r,
 	 */
 	unsigned long max_size, delta_size, base_size, src_copied, literal_added;
 	int score;
-	struct diff_populate_filespec_options dpf_options = {
-		.check_size_only = 1
-	};
-	struct prefetch_options prefetch_options = {r, skip_unmodified};
-
-	if (r == the_repository && has_promisor_remote()) {
-		dpf_options.missing_object_cb = prefetch;
-		dpf_options.missing_object_data = &prefetch_options;
-	}
 
 	/* We deal only with regular files.  Symlink renames are handled
 	 * only when they are exact matches --- in other words, no edits
@@ -169,11 +160,13 @@ static int estimate_similarity(struct repository *r,
 	 * is a possible size - we really should have a flag to
 	 * say whether the size is valid or not!)
 	 */
+	dpf_opt->check_size_only = 1;
+
 	if (!src->cnt_data &&
-	    diff_populate_filespec(r, src, &dpf_options))
+	    diff_populate_filespec(r, src, dpf_opt))
 		return 0;
 	if (!dst->cnt_data &&
-	    diff_populate_filespec(r, dst, &dpf_options))
+	    diff_populate_filespec(r, dst, dpf_opt))
 		return 0;
 
 	max_size = ((src->size > dst->size) ? src->size : dst->size);
@@ -191,11 +184,11 @@ static int estimate_similarity(struct repository *r,
 	if (max_size * (MAX_SCORE-minimum_score) < delta_size * MAX_SCORE)
 		return 0;
 
-	dpf_options.check_size_only = 0;
+	dpf_opt->check_size_only = 0;
 
-	if (!src->cnt_data && diff_populate_filespec(r, src, &dpf_options))
+	if (!src->cnt_data && diff_populate_filespec(r, src, dpf_opt))
 		return 0;
-	if (!dst->cnt_data && diff_populate_filespec(r, dst, &dpf_options))
+	if (!dst->cnt_data && diff_populate_filespec(r, dst, dpf_opt))
 		return 0;
 
 	if (diffcore_count_changes(r, src, dst,
@@ -862,7 +855,11 @@ static int find_basename_matches(struct diff_options *options,
 	int i, renames = 0;
 	struct strintmap sources;
 	struct strintmap dests;
-
+	struct diff_populate_filespec_options dpf_options = {
+		.check_binary = 0,
+		.missing_object_cb = NULL,
+		.missing_object_data = NULL
+	};
 	/*
 	 * The prefeteching stuff wants to know if it can skip prefetching
 	 * blobs that are unmodified...and will then do a little extra work
@@ -873,7 +870,10 @@ static int find_basename_matches(struct diff_options *options,
 	 * the extra work necessary to check if rename_src entries are
 	 * unmodified would be a small waste.
 	 */
-	int skip_unmodified = 0;
+	struct prefetch_options prefetch_options = {
+		.repo = options->repo,
+		.skip_unmodified = 0
+	};
 
 	/*
 	 * Create maps of basename -> fullname(s) for remaining sources and
@@ -910,6 +910,11 @@ static int find_basename_matches(struct diff_options *options,
 			strintmap_set(&dests, base, i);
 	}
 
+	if (options->repo == the_repository && has_promisor_remote()) {
+		dpf_options.missing_object_cb = prefetch;
+		dpf_options.missing_object_data = &prefetch_options;
+	}
+
 	/* Now look for basename matchups and do similarity estimation */
 	for (i = 0; i < rename_src_nr; ++i) {
 		char *filename = rename_src[i].p->one->path;
@@ -953,7 +958,7 @@ static int find_basename_matches(struct diff_options *options,
 			one = rename_src[src_index].p->one;
 			two = rename_dst[dst_index].p->two;
 			score = estimate_similarity(options->repo, one, two,
-						    minimum_score, skip_unmodified);
+						    minimum_score, &dpf_options);
 
 			/* If sufficiently similar, record as rename pair */
 			if (score < minimum_score)
@@ -1272,6 +1277,14 @@ void diffcore_rename_extended(struct diff_options *options,
 	int num_sources, want_copies;
 	struct progress *progress = NULL;
 	struct dir_rename_info info;
+	struct diff_populate_filespec_options dpf_options = {
+		.check_binary = 0,
+		.missing_object_cb = NULL,
+		.missing_object_data = NULL
+	};
+	struct prefetch_options prefetch_options = {
+		.repo = options->repo
+	};
 
 	trace2_region_enter("diff", "setup", options->repo);
 	info.setup = 0;
@@ -1433,6 +1446,13 @@ void diffcore_rename_extended(struct diff_options *options,
 				(uint64_t)num_destinations * (uint64_t)num_sources);
 	}
 
+	/* Finish setting up dpf_options */
+	prefetch_options.skip_unmodified = skip_unmodified;
+	if (options->repo == the_repository && has_promisor_remote()) {
+		dpf_options.missing_object_cb = prefetch;
+		dpf_options.missing_object_data = &prefetch_options;
+	}
+
 	CALLOC_ARRAY(mx, st_mult(NUM_CANDIDATE_PER_DST, num_destinations));
 	for (dst_cnt = i = 0; i < rename_dst_nr; i++) {
 		struct diff_filespec *two = rename_dst[i].p->two;
@@ -1458,7 +1478,7 @@ void diffcore_rename_extended(struct diff_options *options,
 			this_src.score = estimate_similarity(options->repo,
 							     one, two,
 							     minimum_score,
-							     skip_unmodified);
+							     &dpf_options);
 			this_src.name_score = basename_same(one, two);
 			this_src.dst = i;
 			this_src.src = j;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 4/5] diffcore-rename: use a different prefetch for basename comparisons
  2021-06-15 22:41 ` [PATCH v2 0/5] Optimization batch 13: partial clone optimizations for merge-ort Elijah Newren via GitGitGadget
                     ` (2 preceding siblings ...)
  2021-06-15 22:41   ` [PATCH v2 3/5] diffcore-rename: allow different missing_object_cb functions Elijah Newren via GitGitGadget
@ 2021-06-15 22:41   ` Elijah Newren via GitGitGadget
  2021-06-15 22:41   ` [PATCH v2 5/5] merge-ort: add prefetching for content merges Elijah Newren via GitGitGadget
                     ` (3 subsequent siblings)
  7 siblings, 0 replies; 30+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-06-15 22:41 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Derrick Stolee, Taylor Blau, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

merge-ort was designed to minimize the amount of data needed and used,
and several changes were made to diffcore-rename to take advantage of
extra metadata to enable this data minimization (particularly the
relevant_sources variable for skipping "irrelevant" renames).  This
effort obviously succeeded in drastically reducing computation times,
but should also theoretically allow partial clones to download much less
information.  Previously, though, the "prefetch" command used in
diffcore-rename had never been modified and downloaded many blobs that
were unnecessary for merge-ort.  This commit corrects that.

When doing basename comparisons, we want to fetch only the objects that
will be used for basename comparisons.  If after basename fetching this
leaves us with no more relevant sources (or no more destinations), then
we won't need to do the full inexact rename detection and can skip
downloading additional source and destination files.  Even if we have to
do that later full inexact rename detection, irrelevant sources are
culled after basename matching and before the full inexact rename
detection, so we can still avoid downloading the blobs for irrelevant
sources.  Rename prefetch() to inexact_prefetch(), and introduce a
new basename_prefetch() to take advantage of this.

If we modify the testcase from commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2021-01-23)
to pass
    --sparse --filter=blob:none
to the clone command, and use the new trace2 "fetch_count" output from
a few commits ago to track both the number of fetch subcommands invoked
and the number of objects fetched across all those fetches, then for
the mega-renames testcase we observe the following:

BEFORE this commit, rebasing 35 patches:
    strategy     # of fetches    total # of objects fetched
    ---------    ------------    --------------------------
    recursive    62              11423
    ort          30              11391

AFTER this commit, rebasing the same 35 patches:
    ort          32                 63

This means that the new code only needs to download less than 2 blobs
per patch being rebased.  That is especially interesting given that the
repository at the start only had approximately half a dozen TOTAL blobs
downloaded to start with (because the default sparse-checkout of just
the toplevel directory was in use).

So, for this particular linux kernel testcase that involved ~26,000
renames on the upstream side (drivers/ -> pilots/) across which 35
patches were being rebased, this change reduces the number of blobs that
need to be downloaded by a factor of ~180.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c              | 101 +++++++++++++++++++++++++++------
 t/t6421-merge-partial-clone.sh |   4 +-
 2 files changed, 85 insertions(+), 20 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index e13e52046026..4ef0459cfb50 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -87,13 +87,13 @@ struct diff_score {
 	short name_score;
 };
 
-struct prefetch_options {
+struct inexact_prefetch_options {
 	struct repository *repo;
 	int skip_unmodified;
 };
-static void prefetch(void *prefetch_options)
+static void inexact_prefetch(void *prefetch_options)
 {
-	struct prefetch_options *options = prefetch_options;
+	struct inexact_prefetch_options *options = prefetch_options;
 	int i;
 	struct oid_array to_fetch = OID_ARRAY_INIT;
 
@@ -816,6 +816,78 @@ static int idx_possible_rename(char *filename, struct dir_rename_info *info)
 	return idx;
 }
 
+struct basename_prefetch_options {
+	struct repository *repo;
+	struct strintmap *relevant_sources;
+	struct strintmap *sources;
+	struct strintmap *dests;
+	struct dir_rename_info *info;
+};
+static void basename_prefetch(void *prefetch_options)
+{
+	struct basename_prefetch_options *options = prefetch_options;
+	struct strintmap *relevant_sources = options->relevant_sources;
+	struct strintmap *sources = options->sources;
+	struct strintmap *dests = options->dests;
+	struct dir_rename_info *info = options->info;
+	int i;
+	struct oid_array to_fetch = OID_ARRAY_INIT;
+
+	/*
+	 * TODO: The following loops mirror the code/logic from
+	 * find_basename_matches(), though not quite exactly.  Maybe
+	 * abstract the iteration logic out somehow?
+	 */
+	for (i = 0; i < rename_src_nr; ++i) {
+		char *filename = rename_src[i].p->one->path;
+		const char *base = NULL;
+		intptr_t src_index;
+		intptr_t dst_index;
+
+		/* Skip irrelevant sources */
+		if (relevant_sources &&
+		    !strintmap_contains(relevant_sources, filename))
+			continue;
+
+		/*
+		 * If the basename is unique among remaining sources, then
+		 * src_index will equal 'i' and we can attempt to match it
+		 * to a unique basename in the destinations.  Otherwise,
+		 * use directory rename heuristics, if possible.
+		 */
+		base = get_basename(filename);
+		src_index = strintmap_get(sources, base);
+		assert(src_index == -1 || src_index == i);
+
+		if (strintmap_contains(dests, base)) {
+			struct diff_filespec *one, *two;
+
+			/* Find a matching destination, if possible */
+			dst_index = strintmap_get(dests, base);
+			if (src_index == -1 || dst_index == -1) {
+				src_index = i;
+				dst_index = idx_possible_rename(filename, info);
+			}
+			if (dst_index == -1)
+				continue;
+
+			/* Ignore this dest if already used in a rename */
+			if (rename_dst[dst_index].is_rename)
+				continue; /* already used previously */
+
+			one = rename_src[src_index].p->one;
+			two = rename_dst[dst_index].p->two;
+
+			/* Add the pairs */
+			diff_add_if_missing(options->repo, &to_fetch, two);
+			diff_add_if_missing(options->repo, &to_fetch, one);
+		}
+	}
+
+	promisor_remote_get_direct(options->repo, to_fetch.oid, to_fetch.nr);
+	oid_array_clear(&to_fetch);
+}
+
 static int find_basename_matches(struct diff_options *options,
 				 int minimum_score,
 				 struct dir_rename_info *info,
@@ -860,19 +932,12 @@ static int find_basename_matches(struct diff_options *options,
 		.missing_object_cb = NULL,
 		.missing_object_data = NULL
 	};
-	/*
-	 * The prefeteching stuff wants to know if it can skip prefetching
-	 * blobs that are unmodified...and will then do a little extra work
-	 * to verify that the oids are indeed different before prefetching.
-	 * Unmodified blobs are only relevant when doing copy detection;
-	 * when limiting to rename detection, diffcore_rename[_extended]()
-	 * will never be called with unmodified source paths fed to us, so
-	 * the extra work necessary to check if rename_src entries are
-	 * unmodified would be a small waste.
-	 */
-	struct prefetch_options prefetch_options = {
+	struct basename_prefetch_options prefetch_options = {
 		.repo = options->repo,
-		.skip_unmodified = 0
+		.relevant_sources = relevant_sources,
+		.sources = &sources,
+		.dests = &dests,
+		.info = info
 	};
 
 	/*
@@ -911,7 +976,7 @@ static int find_basename_matches(struct diff_options *options,
 	}
 
 	if (options->repo == the_repository && has_promisor_remote()) {
-		dpf_options.missing_object_cb = prefetch;
+		dpf_options.missing_object_cb = basename_prefetch;
 		dpf_options.missing_object_data = &prefetch_options;
 	}
 
@@ -1282,7 +1347,7 @@ void diffcore_rename_extended(struct diff_options *options,
 		.missing_object_cb = NULL,
 		.missing_object_data = NULL
 	};
-	struct prefetch_options prefetch_options = {
+	struct inexact_prefetch_options prefetch_options = {
 		.repo = options->repo
 	};
 
@@ -1449,7 +1514,7 @@ void diffcore_rename_extended(struct diff_options *options,
 	/* Finish setting up dpf_options */
 	prefetch_options.skip_unmodified = skip_unmodified;
 	if (options->repo == the_repository && has_promisor_remote()) {
-		dpf_options.missing_object_cb = prefetch;
+		dpf_options.missing_object_cb = inexact_prefetch;
 		dpf_options.missing_object_data = &prefetch_options;
 	}
 
diff --git a/t/t6421-merge-partial-clone.sh b/t/t6421-merge-partial-clone.sh
index c9d6860de9c9..a011f8d27867 100755
--- a/t/t6421-merge-partial-clone.sh
+++ b/t/t6421-merge-partial-clone.sh
@@ -206,7 +206,7 @@ test_setup_repo () {
 #
 #   Summary: 2 fetches (1 for 2 objects, 1 for 1 object)
 #
-test_expect_merge_algorithm failure failure 'Objects downloaded for single relevant rename' '
+test_expect_merge_algorithm failure success 'Objects downloaded for single relevant rename' '
 	test_setup_repo &&
 	git clone --sparse --filter=blob:none "file://$(pwd)/server" objects-single &&
 	(
@@ -295,7 +295,7 @@ test_expect_merge_algorithm failure failure 'Objects downloaded for single relev
 #      this are not all that common.)
 #   Summary: 1 fetches for 6 objects
 #
-test_expect_merge_algorithm failure failure 'Objects downloaded when a directory rename triggered' '
+test_expect_merge_algorithm failure success 'Objects downloaded when a directory rename triggered' '
 	test_setup_repo &&
 	git clone --sparse --filter=blob:none "file://$(pwd)/server" objects-dir &&
 	(
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 5/5] merge-ort: add prefetching for content merges
  2021-06-15 22:41 ` [PATCH v2 0/5] Optimization batch 13: partial clone optimizations for merge-ort Elijah Newren via GitGitGadget
                     ` (3 preceding siblings ...)
  2021-06-15 22:41   ` [PATCH v2 4/5] diffcore-rename: use a different prefetch for basename comparisons Elijah Newren via GitGitGadget
@ 2021-06-15 22:41   ` Elijah Newren via GitGitGadget
  2021-06-17  5:04     ` Junio C Hamano
  2021-06-16 17:54   ` [PATCH v2 0/5] Optimization batch 13: partial clone optimizations for merge-ort Derrick Stolee
                     ` (2 subsequent siblings)
  7 siblings, 1 reply; 30+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-06-15 22:41 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, Derrick Stolee, Taylor Blau, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

Commit 7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-05)
introduced batching of fetching missing blobs, so that the diff
machinery would have one fetch subprocess grab N blobs instead of N
processes each grabbing 1.

However, the diff machinery is not the only thing in a merge that needs
to work on blobs.  The 3-way content merges need them as well.  Rather
than download all the blobs 1 at a time, prefetch all the blobs needed
for regular content merges.

This does not cover all possible paths in merge-ort that might need to
download blobs.  Others include:
  - The blob_unchanged() calls to avoid modify/delete conflicts (when
    blob renormalization results in an "unchanged" file)
  - Preliminary content merges needed for rename/add and
    rename/rename(2to1) style conflicts.  (Both of these types of
    conflicts can result in nested conflict markers from the need to do
    two levels of content merging; the first happens before our new
    prefetch_for_content_merges() function.)

The first of these wouldn't be an extreme amount of work to support, and
even the second could be theoretically supported in batching, but all of
these cases seem unusual to me, and this is a minor performance
optimization anyway; in the worst case we only get some of the fetches
batched and have a few additional one-off fetches.  So for now, just
handle the regular 3-way content merges in our prefetching.

For the testcase from the previous commit, the number of downloaded
objects remains at 63, but this drops the number of fetches needed from
32 down to 20, a sizeable reduction.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c                    | 50 ++++++++++++++++++++++++++++++++++
 t/t6421-merge-partial-clone.sh |  2 +-
 2 files changed, 51 insertions(+), 1 deletion(-)

diff --git a/merge-ort.c b/merge-ort.c
index cfa751053b01..e3a5dfc7b312 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -29,6 +29,7 @@
 #include "entry.h"
 #include "ll-merge.h"
 #include "object-store.h"
+#include "promisor-remote.h"
 #include "revision.h"
 #include "strmap.h"
 #include "submodule.h"
@@ -3485,6 +3486,54 @@ static void process_entry(struct merge_options *opt,
 	record_entry_for_tree(dir_metadata, path, &ci->merged);
 }
 
+static void prefetch_for_content_merges(struct merge_options *opt,
+					struct string_list *plist)
+{
+	struct string_list_item *e;
+	struct oid_array to_fetch = OID_ARRAY_INIT;
+
+	if (opt->repo != the_repository || !has_promisor_remote())
+		return;
+
+	for (e = &plist->items[plist->nr-1]; e >= plist->items; --e) {
+		/* char *path = e->string; */
+		struct conflict_info *ci = e->util;
+		int i;
+
+		/* Ignore clean entries */
+		if (ci->merged.clean)
+			continue;
+
+		/* Ignore entries that don't need a content merge */
+		if (ci->match_mask || ci->filemask < 6 ||
+		    !S_ISREG(ci->stages[1].mode) ||
+		    !S_ISREG(ci->stages[2].mode) ||
+		    oideq(&ci->stages[1].oid, &ci->stages[2].oid))
+			continue;
+
+		/* Also don't need content merge if base matches either side */
+		if (ci->filemask == 7 &&
+		    S_ISREG(ci->stages[0].mode) &&
+		    (oideq(&ci->stages[0].oid, &ci->stages[1].oid) ||
+		     oideq(&ci->stages[0].oid, &ci->stages[2].oid)))
+			continue;
+
+		for (i = 0; i < 3; i++) {
+			unsigned side_mask = (1 << i);
+			struct version_info *vi = &ci->stages[i];
+
+			if ((ci->filemask & side_mask) &&
+			    S_ISREG(vi->mode) &&
+			    oid_object_info_extended(opt->repo, &vi->oid, NULL,
+						     OBJECT_INFO_FOR_PREFETCH))
+				oid_array_append(&to_fetch, &vi->oid);
+		}
+	}
+
+	promisor_remote_get_direct(opt->repo, to_fetch.oid, to_fetch.nr);
+	oid_array_clear(&to_fetch);
+}
+
 static void process_entries(struct merge_options *opt,
 			    struct object_id *result_oid)
 {
@@ -3531,6 +3580,7 @@ static void process_entries(struct merge_options *opt,
 	 * the way when it is time to process the file at the same path).
 	 */
 	trace2_region_enter("merge", "processing", opt->repo);
+	prefetch_for_content_merges(opt, &plist);
 	for (entry = &plist.items[plist.nr-1]; entry >= plist.items; --entry) {
 		char *path = entry->string;
 		/*
diff --git a/t/t6421-merge-partial-clone.sh b/t/t6421-merge-partial-clone.sh
index a011f8d27867..26964aa56256 100755
--- a/t/t6421-merge-partial-clone.sh
+++ b/t/t6421-merge-partial-clone.sh
@@ -396,7 +396,7 @@ test_expect_merge_algorithm failure success 'Objects downloaded when a directory
 #
 #   Summary: 4 fetches (1 for 6 objects, 1 for 8, 1 for 3, 1 for 2)
 #
-test_expect_merge_algorithm failure failure 'Objects downloaded with lots of renames and modifications' '
+test_expect_merge_algorithm failure success 'Objects downloaded with lots of renames and modifications' '
 	test_setup_repo &&
 	git clone --sparse --filter=blob:none "file://$(pwd)/server" objects-many &&
 	(
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/5] Optimization batch 13: partial clone optimizations for merge-ort
  2021-06-15 22:41 ` [PATCH v2 0/5] Optimization batch 13: partial clone optimizations for merge-ort Elijah Newren via GitGitGadget
                     ` (4 preceding siblings ...)
  2021-06-15 22:41   ` [PATCH v2 5/5] merge-ort: add prefetching for content merges Elijah Newren via GitGitGadget
@ 2021-06-16 17:54   ` Derrick Stolee
  2021-06-17  5:05   ` Junio C Hamano
  2021-06-22  8:04   ` [PATCH v3 " Elijah Newren via GitGitGadget
  7 siblings, 0 replies; 30+ messages in thread
From: Derrick Stolee @ 2021-06-16 17:54 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget, git
  Cc: Jonathan Tan, Derrick Stolee, Taylor Blau, Elijah Newren

On 6/15/2021 6:41 PM, Elijah Newren via GitGitGadget wrote:
> This series optimizes blob downloading in merges for partial clones. It can
> apply on master. It's independent of ort-perf-batch-12.
> 
> Changes since v1:
> 
>  * Incorporated the suggestions from Stolee on patch 2.

Thank you for these.

>      -+		 ..........fetch_count:12
>      -+		 ..........fetch_count:5
>      -+		 ..........fetch_count:3
>      -+		 ......fetch_count:2
>      ++		fetch_count:12
>      ++		fetch_count:5
>      ++		fetch_count:3
>      ++		fetch_count:2

In particular, I think this simplification will pay dividends in the
future.

Also, I am in full support of the goals of this series. With such
changes, we could potentially re-enable merge.renames in Scalar and
VFS for Git. Currently, we avoid this so we don't download a huge
list of missing blobs whenever pulling the latest changes in these
enormous repos.

I am still late in doing more meaningful testing which would allow me
to give this a full stamp of approval. Specifically, I want to merge
this code and the rest of merge-ort into the vfs-2.32.0 branch of
microsoft/git and run some sample merges on some private monorepos
and see how it behaves. I'm very eager to include merge-ort as a
recommended config in Scalar, but I haven't had the time to devote
to testing like this.

I just created a calendar event for Tuesday, June 22nd to hopefully
devote the entire day to such an effort. Thank you for your
patience.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/5] t6421: add tests checking for excessive object downloads during merge
  2021-06-15 22:41   ` [PATCH v2 2/5] t6421: add tests checking for excessive object downloads during merge Elijah Newren via GitGitGadget
@ 2021-06-17  4:49     ` Junio C Hamano
  0 siblings, 0 replies; 30+ messages in thread
From: Junio C Hamano @ 2021-06-17  4:49 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget
  Cc: git, Jonathan Tan, Derrick Stolee, Taylor Blau, Elijah Newren

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +		for i in `test_seq 1 88`; do
> +			echo content $i >dir/unchanged/file_$i
> +		done &&

        for i in $(...)
        do
                ... || return 1
        done &&

> +test_expect_merge_algorithm failure failure 'Objects downloaded for single relevant rename' '
> +	test_setup_repo &&
> +	git clone --sparse --filter=blob:none "file://$(pwd)/server" objects-single &&
> +	(
> +		cd objects-single &&
> +
> +		git rev-list --objects --all --missing=print |
> +			grep '\?' >missing-objects-before &&

The closing and reopening of single quote here is somewhat
misleading.  Wouldn't

		grep "?" >... &&

work just as well?

> +		git rev-list --objects --all --missing=print |
> +			grep ^? >missing-objects-after &&
> +		test_cmp missing-objects-before missing-objects-after |
> +			grep "^[-+]?" >found-and-new-objects &&

I do not think we specify what test_cmp's output looks like; we only
guarantee that it is the right tool to use for expecting two paths
have the same contents, and it may give some human readable output.

Do not assume you can grep in the output; use "diff -u" if you mean
you expect things that exist on the left side and missing from the
right hand side are prefixed with '-' and vice versa for '+'.

Or sort "missing-objects-after" and "missing-objects-before" and run 
"comm" on them, which might be more stable.  Depending on the order
two invocations of rev-list spews out the "missing" objects, you may
even see the same object mentioned with "-" and "+" in the diff output
if lines appear to be moved around.

> +		# We should not have any NEW missing objects
> +		! grep ^+ found-and-new-objects &&
> +		# Fetched 2+1=3 objects, so should have 3 fewer missing objects
> +		test_line_count = 3 found-and-new-objects
> +	)
> +'

I think similar comments apply to the tests in the remainder of the
patch.

Thanks.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 5/5] merge-ort: add prefetching for content merges
  2021-06-15 22:41   ` [PATCH v2 5/5] merge-ort: add prefetching for content merges Elijah Newren via GitGitGadget
@ 2021-06-17  5:04     ` Junio C Hamano
  2021-06-22  8:02       ` Elijah Newren
  0 siblings, 1 reply; 30+ messages in thread
From: Junio C Hamano @ 2021-06-17  5:04 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget
  Cc: git, Jonathan Tan, Derrick Stolee, Taylor Blau, Elijah Newren

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +		/* Ignore clean entries */
> +		if (ci->merged.clean)
> +			continue;
> +
> +		/* Ignore entries that don't need a content merge */
> +		if (ci->match_mask || ci->filemask < 6 ||
> +		    !S_ISREG(ci->stages[1].mode) ||
> +		    !S_ISREG(ci->stages[2].mode) ||
> +		    oideq(&ci->stages[1].oid, &ci->stages[2].oid))
> +			continue;
> +
> +		/* Also don't need content merge if base matches either side */
> +		if (ci->filemask == 7 &&
> +		    S_ISREG(ci->stages[0].mode) &&
> +		    (oideq(&ci->stages[0].oid, &ci->stages[1].oid) ||
> +		     oideq(&ci->stages[0].oid, &ci->stages[2].oid)))
> +			continue;

Even though this is unlikely to change, it is unsatisfactory that we
reproduce the knowledge on the situations when a merge will
trivially resolve and when it will need to go content level.

One obvious way to solve it would be to fold this logic into the
main code that actually merges a list of "ci"s by making it a two
pass process (the first pass does essentially the same as this new
function, the second pass does the tree-level merge where the above
says "continue", fills mmfiles with the loop below, and calls into
ll_merge() after the loop to merge), but the logic duplication is
not too big and it may not be worth such a code churn.

> +		for (i = 0; i < 3; i++) {
> +			unsigned side_mask = (1 << i);
> +			struct version_info *vi = &ci->stages[i];
> +
> +			if ((ci->filemask & side_mask) &&
> +			    S_ISREG(vi->mode) &&
> +			    oid_object_info_extended(opt->repo, &vi->oid, NULL,
> +						     OBJECT_INFO_FOR_PREFETCH))
> +				oid_array_append(&to_fetch, &vi->oid);
> +		}
> +	}
> +
> +	promisor_remote_get_direct(opt->repo, to_fetch.oid, to_fetch.nr);
> +	oid_array_clear(&to_fetch);
> +}
> +

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/5] Optimization batch 13: partial clone optimizations for merge-ort
  2021-06-15 22:41 ` [PATCH v2 0/5] Optimization batch 13: partial clone optimizations for merge-ort Elijah Newren via GitGitGadget
                     ` (5 preceding siblings ...)
  2021-06-16 17:54   ` [PATCH v2 0/5] Optimization batch 13: partial clone optimizations for merge-ort Derrick Stolee
@ 2021-06-17  5:05   ` Junio C Hamano
  2021-06-22  8:04   ` [PATCH v3 " Elijah Newren via GitGitGadget
  7 siblings, 0 replies; 30+ messages in thread
From: Junio C Hamano @ 2021-06-17  5:05 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget
  Cc: git, Jonathan Tan, Derrick Stolee, Taylor Blau, Elijah Newren

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> This series optimizes blob downloading in merges for partial clones. It can
> apply on master. It's independent of ort-perf-batch-12.

Overall this has a good feel of going in the right direction.
I left comments on a few of the patches, though.

Thanks.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 5/5] merge-ort: add prefetching for content merges
  2021-06-17  5:04     ` Junio C Hamano
@ 2021-06-22  8:02       ` Elijah Newren
  0 siblings, 0 replies; 30+ messages in thread
From: Elijah Newren @ 2021-06-22  8:02 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Jonathan Tan,
	Derrick Stolee, Taylor Blau

On Wed, Jun 16, 2021 at 10:04 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > +             /* Ignore clean entries */
> > +             if (ci->merged.clean)
> > +                     continue;
> > +
> > +             /* Ignore entries that don't need a content merge */
> > +             if (ci->match_mask || ci->filemask < 6 ||
> > +                 !S_ISREG(ci->stages[1].mode) ||
> > +                 !S_ISREG(ci->stages[2].mode) ||
> > +                 oideq(&ci->stages[1].oid, &ci->stages[2].oid))
> > +                     continue;
> > +
> > +             /* Also don't need content merge if base matches either side */
> > +             if (ci->filemask == 7 &&
> > +                 S_ISREG(ci->stages[0].mode) &&
> > +                 (oideq(&ci->stages[0].oid, &ci->stages[1].oid) ||
> > +                  oideq(&ci->stages[0].oid, &ci->stages[2].oid)))
> > +                     continue;
>
> Even though this is unlikely to change, it is unsatisfactory that we
> reproduce the knowledge on the situations when a merge will
> trivially resolve and when it will need to go content level.

I agree, it's not the nicest.

> One obvious way to solve it would be to fold this logic into the
> main code that actually merges a list of "ci"s by making it a two
> pass process (the first pass does essentially the same as this new
> function, the second pass does the tree-level merge where the above
> says "continue", fills mmfiles with the loop below, and calls into
> ll_merge() after the loop to merge), but the logic duplication is
> not too big and it may not be worth such a code churn.

I'm worried even more about the resulting complexity than the code
churn.  The two-pass model, which I considered, would require special
casing so many of the branches of process_entry() that it feels like
it'd be increasing code complexity more than introducing a function
with a few duplicated checks.  process_entry() was already a function
that Stolee reported as coming across as pretty complex to him in
earlier rounds of review, but that seems to just be intrinsic based on
the number of special cases: handling anything from entries with D/F
conflicts, to different file types, to match_mask being precomputed,
to recursive vs. normal cases, to modify/delete, to normalization, to
added on one side, to deleted on both side, to three-way content
merges.  The three-way content merges are just one of 9-ish different
branches, and are the only one that we're prefetching for.  It just
seems easier and cleaner overall to add these three checks to pick off
the cases that will end up going through the three-way content merges.
I've looked at it again a couple times over the past few days based on
your comment, but I still can't see a way to restructure it that feels
cleaner than what I've currently got.

Also, it may be worth noting here that if these checks fell out of
date with process_entry() in some manner, it still would not affect
the correctness of the code.  At worst, it'd only affect whether
enough or too many objects are prefetched.  If too many, then some
extra objects would be downloaded, and if too few, then we'd end up
later fetching additional objects 1-by-1 on demand later.

So I'm going to agree with the not-worth-it portion of your final
sentence and leave this out of the next roll.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v3 0/5] Optimization batch 13: partial clone optimizations for merge-ort
  2021-06-15 22:41 ` [PATCH v2 0/5] Optimization batch 13: partial clone optimizations for merge-ort Elijah Newren via GitGitGadget
                     ` (6 preceding siblings ...)
  2021-06-17  5:05   ` Junio C Hamano
@ 2021-06-22  8:04   ` Elijah Newren via GitGitGadget
  2021-06-22  8:04     ` [PATCH v3 1/5] promisor-remote: output trace2 statistics for number of objects fetched Elijah Newren via GitGitGadget
                       ` (5 more replies)
  7 siblings, 6 replies; 30+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-06-22  8:04 UTC (permalink / raw)
  To: git
  Cc: Jonathan Tan, Derrick Stolee, Taylor Blau, Derrick Stolee,
	Elijah Newren

This series optimizes blob downloading in merges for partial clones. It can
apply on master. It's independent of ort-perf-batch-12.

Changes since v2:

 * Incorporated the suggestions from Junio on patch 2.

Changes since v1:

 * Incorporated the suggestions from Stolee on patch 2.

=== High level summary ===

 1. diffcore-rename.c has had a prefetch() to get data needed for inexact
    renames for a while.
 2. find_basename_matches() only requires a small subset of what prefetch()
    provides.
 3. I added a basename_prefetch() for find_basename_matches()

In the worst case, the above means:

 * We download the same number of objects, in 2 steps instead of 1.

However, in practice, since rename detection can usually quit after
find_basename_matches() (usually due to the irrelevant check that cannot be
performed until after find_basename_matches()):

 * We download far fewer objects, and use barely more download steps than
   before.

Adding some prefetching to merge-ort.c allows us to also drop the number of
downloads overall.

=== Modified performance measurement method ===

The testcases I've been using so far to measure performance were not run in
a partial clone, so they aren't directly usable for comparison. Further,
partial clone performance depends on network speed which can be highly
variable. So I want to modify one of the existing testcases slightly and
focus on two different but more stable metrics:

 1. Number of git fetch operations during rebase
 2. Number of objects fetched during rebase

The first of these should already be decent due to Jonathan Tan's work to
batch fetching of missing blobs during rename detection (see commit
7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-05)), so we are
mostly looking to optimize the second but would like to also decrease the
first if possible.

The testcase we will look at will be a modification of the mega-renames
testcase from commit 557ac0350d ("merge-ort: begin performance work;
instrument with trace2_region_* calls", 2020-10-28). In particular, we
change

$ git clone \
    git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git


to

$ git clone --sparse --filter=blob:none \
    https://github.com/github/linux


(The change in clone URL is just to get a server that supports the filter
predicate.)

We otherwise keep the test the same (in particular, we do not add any calls
to "git-sparse checkout {set,add}" which means that the resulting repository
will only have 7 total blobs from files in the toplevel directory before
starting the rebase).

=== Results ===

For the mega-renames testcase noted above (which rebases 35 commits across
an upstream with ~26K renames in a partial clone), I found the following
results for our metrics of interest:

     Number of `git fetch` ops during rebase

                     Before Series   After Series
merge-recursive:          62              63
merge-ort:                30              20


     Number of objects fetched during rebase

                     Before Series   After Series
merge-recursive:         11423          11423
merge-ort:               11391             63


So, we have a significant reduction (factor of ~3 relative to
merge-recursive) in the number of git fetch operations that have to be
performed in a partial clone to complete the rebase, and a dramatic
reduction (factor of ~180) in the number of objects that need to be fetched.

=== Summary ===

It's worth pointing out that merge-ort after the series needs only ~1.8
blobs per commit being transplanted to complete this particular rebase.
Essentially, this reinforces the fact the optimization work so far has taken
rename detection from often being an overwhelmingly costly portion of a
merge (leading many to just capitulate on it), to what I have observed in my
experience so far as being just a minor cost for merges.

Elijah Newren (5):
  promisor-remote: output trace2 statistics for number of objects
    fetched
  t6421: add tests checking for excessive object downloads during merge
  diffcore-rename: allow different missing_object_cb functions
  diffcore-rename: use a different prefetch for basename comparisons
  merge-ort: add prefetching for content merges

 diffcore-rename.c              | 149 ++++++++---
 merge-ort.c                    |  50 ++++
 promisor-remote.c              |   7 +-
 t/t6421-merge-partial-clone.sh | 440 +++++++++++++++++++++++++++++++++
 4 files changed, 612 insertions(+), 34 deletions(-)
 create mode 100755 t/t6421-merge-partial-clone.sh


base-commit: 6de569e6ac492213e81321ca35f1f1b365ba31e3
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-969%2Fnewren%2Fort-perf-batch-13-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-969/newren/ort-perf-batch-13-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/969

Range-diff vs v2:

 1:  04f5ebdabe14 = 1:  04f5ebdabe14 promisor-remote: output trace2 statistics for number of objects fetched
 2:  0f786cfb4c95 ! 2:  4796e096fdb4 t6421: add tests checking for excessive object downloads during merge
     @@ t/t6421-merge-partial-clone.sh (new)
      +		echo g >dir/subdir/tweaked/g &&
      +		echo h >dir/subdir/tweaked/h &&
      +		echo subdirectory makefile >dir/subdir/tweaked/Makefile &&
     -+		for i in `test_seq 1 88`; do
     ++		for i in $(test_seq 1 88)
     ++		do
      +			echo content $i >dir/unchanged/file_$i
      +		done &&
      +		git add . &&
     @@ t/t6421-merge-partial-clone.sh (new)
      +		cd objects-single &&
      +
      +		git rev-list --objects --all --missing=print |
     -+			grep '\?' >missing-objects-before &&
     ++			grep "^?" | sort >missing-objects-before &&
      +
      +		git checkout -q origin/A &&
      +
     @@ t/t6421-merge-partial-clone.sh (new)
      +		test_line_count = 2 fetches &&
      +
      +		git rev-list --objects --all --missing=print |
     -+			grep ^? >missing-objects-after &&
     -+		test_cmp missing-objects-before missing-objects-after |
     -+			grep "^[-+]?" >found-and-new-objects &&
     -+		# We should not have any NEW missing objects
     -+		! grep ^+ found-and-new-objects &&
     -+		# Fetched 2+1=3 objects, so should have 3 fewer missing objects
     -+		test_line_count = 3 found-and-new-objects
     ++			grep "^?" | sort >missing-objects-after &&
     ++		comm -2 -3 missing-objects-before missing-objects-after >old &&
     ++		comm -1 -3 missing-objects-before missing-objects-after >new &&
     ++		# No new missing objects
     ++		test_must_be_empty new &&
     ++		# Fetched 2 + 1 = 3 objects
     ++		test_line_count = 3 old
      +	)
      +'
      +
     @@ t/t6421-merge-partial-clone.sh (new)
      +		cd objects-dir &&
      +
      +		git rev-list --objects --all --missing=print |
     -+			grep '\?' >missing-objects-before &&
     ++			grep "^?" | sort >missing-objects-before &&
      +
      +		git checkout -q origin/A &&
      +
     @@ t/t6421-merge-partial-clone.sh (new)
      +		test_line_count = 1 fetches &&
      +
      +		git rev-list --objects --all --missing=print |
     -+			grep ^? >missing-objects-after &&
     -+		test_cmp missing-objects-before missing-objects-after |
     -+			grep "^[-+]?" >found-and-new-objects &&
     -+		# We should not have any NEW missing objects
     -+		! grep ^+ found-and-new-objects &&
     -+		# Fetched 6 objects, so should have 6 fewer missing objects
     -+		test_line_count = 6 found-and-new-objects
     ++			grep "^?" | sort >missing-objects-after &&
     ++		comm -2 -3 missing-objects-before missing-objects-after >old &&
     ++		comm -1 -3 missing-objects-before missing-objects-after >new &&
     ++		# No new missing objects
     ++		test_must_be_empty new &&
     ++		# Fetched 6 objects
     ++		test_line_count = 6 old
      +	)
      +'
      +
     @@ t/t6421-merge-partial-clone.sh (new)
      +		cd objects-many &&
      +
      +		git rev-list --objects --all --missing=print |
     -+			grep '\?' >missing-objects-before &&
     ++			grep "^?" | sort >missing-objects-before &&
      +
      +		git checkout -q origin/A &&
      +
     @@ t/t6421-merge-partial-clone.sh (new)
      +		test_line_count = 4 fetches &&
      +
      +		git rev-list --objects --all --missing=print |
     -+			grep ^? >missing-objects-after &&
     -+		test_cmp missing-objects-before missing-objects-after |
     -+			grep "^[-+]?" >found-and-new-objects &&
     -+		# We should not have any NEW missing objects
     -+		! grep ^+ found-and-new-objects &&
     -+		# Fetched 12 + 5 + 3 + 2 == 22 objects
     -+		test_line_count = 22 found-and-new-objects
     ++			grep "^?" | sort >missing-objects-after &&
     ++		comm -2 -3 missing-objects-before missing-objects-after >old &&
     ++		comm -1 -3 missing-objects-before missing-objects-after >new &&
     ++		# No new missing objects
     ++		test_must_be_empty new &&
     ++		# Fetched 12 + 5 + 3 + 2 = 22 objects
     ++		test_line_count = 22 old
      +	)
      +'
      +
 3:  9f2a8ed8d61f = 3:  7ed0162cdb4e diffcore-rename: allow different missing_object_cb functions
 4:  f753f8035564 = 4:  c9b55241d831 diffcore-rename: use a different prefetch for basename comparisons
 5:  317bcc7f56cb = 5:  69011cfe9fae merge-ort: add prefetching for content merges

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v3 1/5] promisor-remote: output trace2 statistics for number of objects fetched
  2021-06-22  8:04   ` [PATCH v3 " Elijah Newren via GitGitGadget
@ 2021-06-22  8:04     ` Elijah Newren via GitGitGadget
  2021-06-22  8:04     ` [PATCH v3 2/5] t6421: add tests checking for excessive object downloads during merge Elijah Newren via GitGitGadget
                       ` (4 subsequent siblings)
  5 siblings, 0 replies; 30+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-06-22  8:04 UTC (permalink / raw)
  To: git
  Cc: Jonathan Tan, Derrick Stolee, Taylor Blau, Derrick Stolee,
	Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 promisor-remote.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/promisor-remote.c b/promisor-remote.c
index da3f2ca2615e..d465377d7d32 100644
--- a/promisor-remote.c
+++ b/promisor-remote.c
@@ -12,7 +12,8 @@ void set_repository_format_partial_clone(char *partial_clone)
 	repository_format_partial_clone = xstrdup_or_null(partial_clone);
 }
 
-static int fetch_objects(const char *remote_name,
+static int fetch_objects(struct repository *repo,
+			 const char *remote_name,
 			 const struct object_id *oids,
 			 int oid_nr)
 {
@@ -30,6 +31,8 @@ static int fetch_objects(const char *remote_name,
 		die(_("promisor-remote: unable to fork off fetch subprocess"));
 	child_in = xfdopen(child.in, "w");
 
+	trace2_data_intmax("promisor", repo, "fetch_count", oid_nr);
+
 	for (i = 0; i < oid_nr; i++) {
 		if (fputs(oid_to_hex(&oids[i]), child_in) < 0)
 			die_errno(_("promisor-remote: could not write to fetch subprocess"));
@@ -238,7 +241,7 @@ int promisor_remote_get_direct(struct repository *repo,
 	promisor_remote_init();
 
 	for (r = promisors; r; r = r->next) {
-		if (fetch_objects(r->name, remaining_oids, remaining_nr) < 0) {
+		if (fetch_objects(repo, r->name, remaining_oids, remaining_nr) < 0) {
 			if (remaining_nr == 1)
 				continue;
 			remaining_nr = remove_fetched_oids(repo, &remaining_oids,
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v3 2/5] t6421: add tests checking for excessive object downloads during merge
  2021-06-22  8:04   ` [PATCH v3 " Elijah Newren via GitGitGadget
  2021-06-22  8:04     ` [PATCH v3 1/5] promisor-remote: output trace2 statistics for number of objects fetched Elijah Newren via GitGitGadget
@ 2021-06-22  8:04     ` Elijah Newren via GitGitGadget
  2021-06-22  8:04     ` [PATCH v3 3/5] diffcore-rename: allow different missing_object_cb functions Elijah Newren via GitGitGadget
                       ` (3 subsequent siblings)
  5 siblings, 0 replies; 30+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-06-22  8:04 UTC (permalink / raw)
  To: git
  Cc: Jonathan Tan, Derrick Stolee, Taylor Blau, Derrick Stolee,
	Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 t/t6421-merge-partial-clone.sh | 440 +++++++++++++++++++++++++++++++++
 1 file changed, 440 insertions(+)
 create mode 100755 t/t6421-merge-partial-clone.sh

diff --git a/t/t6421-merge-partial-clone.sh b/t/t6421-merge-partial-clone.sh
new file mode 100755
index 000000000000..3dcffc15f801
--- /dev/null
+++ b/t/t6421-merge-partial-clone.sh
@@ -0,0 +1,440 @@
+#!/bin/sh
+
+test_description="limiting blob downloads when merging with partial clones"
+# Uses a methodology similar to
+#   t6042: corner cases with renames but not criss-cross merges
+#   t6036: corner cases with both renames and criss-cross merges
+#   t6423: directory rename detection
+#
+# The setup for all of them, pictorially, is:
+#
+#      A
+#      o
+#     / \
+#  O o   ?
+#     \ /
+#      o
+#      B
+#
+# To help make it easier to follow the flow of tests, they have been
+# divided into sections and each test will start with a quick explanation
+# of what commits O, A, and B contain.
+#
+# Notation:
+#    z/{b,c}   means  files z/b and z/c both exist
+#    x/d_1     means  file x/d exists with content d1.  (Purpose of the
+#                     underscore notation is to differentiate different
+#                     files that might be renamed into each other's paths.)
+
+. ./test-lib.sh
+. "$TEST_DIRECTORY"/lib-merge.sh
+
+test_setup_repo () {
+	test -d server && return
+	test_create_repo server &&
+	(
+		cd server &&
+
+		git config uploadpack.allowfilter 1 &&
+		git config uploadpack.allowanysha1inwant 1 &&
+
+		mkdir -p general &&
+		test_seq 2 9 >general/leap1 &&
+		cp general/leap1 general/leap2 &&
+		echo leap2 >>general/leap2 &&
+
+		mkdir -p basename &&
+		cp general/leap1 basename/numbers &&
+		cp general/leap1 basename/sequence &&
+		cp general/leap1 basename/values &&
+		echo numbers >>basename/numbers &&
+		echo sequence >>basename/sequence &&
+		echo values >>basename/values &&
+
+		mkdir -p dir/unchanged &&
+		mkdir -p dir/subdir/tweaked &&
+		echo a >dir/subdir/a &&
+		echo b >dir/subdir/b &&
+		echo c >dir/subdir/c &&
+		echo d >dir/subdir/d &&
+		echo e >dir/subdir/e &&
+		cp general/leap1 dir/subdir/Makefile &&
+		echo toplevel makefile >>dir/subdir/Makefile &&
+		echo f >dir/subdir/tweaked/f &&
+		echo g >dir/subdir/tweaked/g &&
+		echo h >dir/subdir/tweaked/h &&
+		echo subdirectory makefile >dir/subdir/tweaked/Makefile &&
+		for i in $(test_seq 1 88)
+		do
+			echo content $i >dir/unchanged/file_$i
+		done &&
+		git add . &&
+		git commit -m "O" &&
+
+		git branch O &&
+		git branch A &&
+		git branch B-single &&
+		git branch B-dir &&
+		git branch B-many &&
+
+		git switch A &&
+
+		git rm general/leap* &&
+		mkdir general/ &&
+		test_seq 1 9 >general/jump1 &&
+		cp general/jump1 general/jump2 &&
+		echo leap2 >>general/jump2 &&
+
+		rm basename/numbers basename/sequence basename/values &&
+		mkdir -p basename/subdir/
+		cp general/jump1 basename/subdir/numbers &&
+		cp general/jump1 basename/subdir/sequence &&
+		cp general/jump1 basename/subdir/values &&
+		echo numbers >>basename/subdir/numbers &&
+		echo sequence >>basename/subdir/sequence &&
+		echo values >>basename/subdir/values &&
+
+		git rm dir/subdir/tweaked/f &&
+		echo more >>dir/subdir/e &&
+		echo more >>dir/subdir/Makefile &&
+		echo more >>dir/subdir/tweaked/Makefile &&
+		mkdir dir/subdir/newsubdir &&
+		echo rust code >dir/subdir/newsubdir/newfile.rs &&
+		git mv dir/subdir/e dir/subdir/newsubdir/ &&
+		git mv dir folder &&
+		git add . &&
+		git commit -m "A" &&
+
+		git switch B-single &&
+		echo new first line >dir/subdir/Makefile &&
+		cat general/leap1 >>dir/subdir/Makefile &&
+		echo toplevel makefile >>dir/subdir/Makefile &&
+		echo perl code >general/newfile.pl &&
+		git add . &&
+		git commit -m "B-single" &&
+
+		git switch B-dir &&
+		echo java code >dir/subdir/newfile.java &&
+		echo scala code >dir/subdir/newfile.scala &&
+		echo groovy code >dir/subdir/newfile.groovy &&
+		git add . &&
+		git commit -m "B-dir" &&
+
+		git switch B-many &&
+		test_seq 2 10 >general/leap1 &&
+		rm general/leap2 &&
+		cp general/leap1 general/leap2 &&
+		echo leap2 >>general/leap2 &&
+
+		rm basename/numbers basename/sequence basename/values &&
+		mkdir -p basename/subdir/
+		cp general/leap1 basename/subdir/numbers &&
+		cp general/leap1 basename/subdir/sequence &&
+		cp general/leap1 basename/subdir/values &&
+		echo numbers >>basename/subdir/numbers &&
+		echo sequence >>basename/subdir/sequence &&
+		echo values >>basename/subdir/values &&
+
+		mkdir dir/subdir/newsubdir/ &&
+		echo c code >dir/subdir/newfile.c &&
+		echo python code >dir/subdir/newsubdir/newfile.py &&
+		git add . &&
+		git commit -m "B-many" &&
+
+		git switch A
+	)
+}
+
+# Testcase: Objects downloaded for single relevant rename
+#   Commit O:
+#              general/{leap1_O, leap2_O}
+#              basename/{numbers_O, sequence_O, values_O}
+#              dir/subdir/{a,b,c,d,e_O,Makefile_TOP_O}
+#              dir/subdir/tweaked/{f,g,h,Makefile_SUB_O}
+#              dir/unchanged/<LOTS OF FILES>
+#   Commit A:
+#     (Rename leap->jump, rename basename/ -> basename/subdir/, rename dir/
+#      -> folder/, move e into newsubdir, add newfile.rs, remove f, modify
+#      both both Makefiles and jumps)
+#              general/{jump1_A, jump2_A}
+#              basename/subdir/{numbers_A, sequence_A, values_A}
+#              folder/subdir/{a,b,c,d,Makefile_TOP_A}
+#              folder/subdir/newsubdir/{e_A,newfile.rs}
+#              folder/subdir/tweaked/{g,h,Makefile_SUB_A}
+#              folder/unchanged/<LOTS OF FILES>
+#   Commit B(-single):
+#     (add newfile.pl, tweak Makefile_TOP)
+#              general/{leap1_O, leap2_O,newfile.pl}
+#              basename/{numbers_O, sequence_O, values_O}
+#              dir/{a,b,c,d,e_O,Makefile_TOP_B}
+#              dir/tweaked/{f,g,h,Makefile_SUB_O}
+#              dir/unchanged/<LOTS OF FILES>
+#   Expected:
+#              general/{jump1_A, jump2_A,newfile.pl}
+#              basename/subdir/{numbers_A, sequence_A, values_A}
+#              folder/subdir/{a,b,c,d,Makefile_TOP_Merged}
+#              folder/subdir/newsubdir/{e_A,newfile.rs}
+#              folder/subdir/tweaked/{g,h,Makefile_SUB_A}
+#              folder/unchanged/<LOTS OF FILES>
+#
+# Objects that need to be fetched:
+#   Rename detection:
+#     Side1 (O->A):
+#       Basename-matches rename detection only needs to fetch these objects:
+#         Makefile_TOP_O, Makefile_TOP_A
+#         (Despite many renames, all others are content irrelevant.  They
+#          are also location irrelevant because newfile.rs was added on
+#          the side doing the directory rename, and newfile.pl was added to
+#          a directory that was not renamed on either side.)
+#       General rename detection only needs to fetch these objects:
+#         <None>
+#          (Even though newfile.rs, jump[12], basename/subdir/*, and e
+#          could all be used as destinations in rename detection, the
+#          basename detection for Makefile matches up all relevant
+#          sources, so these other files never end up needing to be
+#          used)
+#     Side2 (O->B):
+#       Basename-matches rename detection only needs to fetch these objects:
+#         <None>
+#         (there are no deleted files, so no possible sources)
+#       General rename detection only needs to fetch these objects:
+#         <None>
+#         (there are no deleted files, so no possible sources)
+#   Merge:
+#     3-way content merge needs to grab these objects:
+#       Makefile_TOP_B
+#   Nothing else needs to fetch objects
+#
+#   Summary: 2 fetches (1 for 2 objects, 1 for 1 object)
+#
+test_expect_merge_algorithm failure failure 'Objects downloaded for single relevant rename' '
+	test_setup_repo &&
+	git clone --sparse --filter=blob:none "file://$(pwd)/server" objects-single &&
+	(
+		cd objects-single &&
+
+		git rev-list --objects --all --missing=print |
+			grep "^?" | sort >missing-objects-before &&
+
+		git checkout -q origin/A &&
+
+		GIT_TRACE2_PERF="$(pwd)/trace.output" git \
+			-c merge.directoryRenames=true merge --no-stat \
+			--no-progress origin/B-single &&
+
+		# Check the number of objects we reported we would fetch
+		cat >expect <<-EOF &&
+		fetch_count:2
+		fetch_count:1
+		EOF
+		grep fetch_count trace.output | cut -d "|" -f 9 | tr -d " ." >actual &&
+		test_cmp expect actual &&
+
+		# Check the number of fetch commands exec-ed
+		grep d0.*fetch.negotiationAlgorithm trace.output >fetches &&
+		test_line_count = 2 fetches &&
+
+		git rev-list --objects --all --missing=print |
+			grep "^?" | sort >missing-objects-after &&
+		comm -2 -3 missing-objects-before missing-objects-after >old &&
+		comm -1 -3 missing-objects-before missing-objects-after >new &&
+		# No new missing objects
+		test_must_be_empty new &&
+		# Fetched 2 + 1 = 3 objects
+		test_line_count = 3 old
+	)
+'
+
+# Testcase: Objects downloaded for directory rename
+#   Commit O:
+#              general/{leap1_O, leap2_O}
+#              basename/{numbers_O, sequence_O, values_O}
+#              dir/subdir/{a,b,c,d,e_O,Makefile_TOP_O}
+#              dir/subdir/tweaked/{f,g,h,Makefile_SUB_O}
+#              dir/unchanged/<LOTS OF FILES>
+#   Commit A:
+#     (Rename leap->jump, rename basename/ -> basename/subdir/, rename dir/ ->
+#      folder/, move e into newsubdir, add newfile.rs, remove f, modify
+#      both Makefiles and jumps)
+#              general/{jump1_A, jump2_A}
+#              basename/subdir/{numbers_A, sequence_A, values_A}
+#              folder/subdir/{a,b,c,d,Makefile_TOP_A}
+#              folder/subdir/newsubdir/{e_A,newfile.rs}
+#              folder/subdir/tweaked/{g,h,Makefile_SUB_A}
+#              folder/unchanged/<LOTS OF FILES>
+#   Commit B(-dir):
+#     (add dir/subdir/newfile.{java,scala,groovy}
+#              general/{leap1_O, leap2_O}
+#              basename/{numbers_O, sequence_O, values_O}
+#              dir/subdir/{a,b,c,d,e_O,Makefile_TOP_O,
+#                          newfile.java,newfile.scala,newfile.groovy}
+#              dir/subdir/tweaked/{f,g,h,Makefile_SUB_O}
+#              dir/unchanged/<LOTS OF FILES>
+#   Expected:
+#              general/{jump1_A, jump2_A}
+#              basename/subdir/{numbers_A, sequence_A, values_A}
+#              folder/subdir/{a,b,c,d,Makefile_TOP_A,
+#                             newfile.java,newfile.scala,newfile.groovy}
+#              folder/subdir/newsubdir/{e_A,newfile.rs}
+#              folder/subdir/tweaked/{g,h,Makefile_SUB_A}
+#              folder/unchanged/<LOTS OF FILES>
+#
+# Objects that need to be fetched:
+#   Makefile_TOP_O, Makefile_TOP_A
+#   Makefile_SUB_O, Makefile_SUB_A
+#   e_O, e_A
+#   * Despite A's rename of jump->leap, those renames are irrelevant.
+#   * Despite A's rename of basename/ -> basename/subdir/, those renames are
+#     irrelevant.
+#   * Because of A's rename of dir/ -> folder/ and B-dir's addition of
+#     newfile.* into dir/subdir/, we need to determine directory renames.
+#     (Technically, there are enough exact renames to determine directory
+#      rename detection, but the current implementation always does
+#      basename searching before directory rename detection.  Running it
+#      also before basename searching would mean doing directory rename
+#      detection twice, but it's a bit expensive to do that and cases like
+#      this are not all that common.)
+#   Summary: 1 fetches for 6 objects
+#
+test_expect_merge_algorithm failure failure 'Objects downloaded when a directory rename triggered' '
+	test_setup_repo &&
+	git clone --sparse --filter=blob:none "file://$(pwd)/server" objects-dir &&
+	(
+		cd objects-dir &&
+
+		git rev-list --objects --all --missing=print |
+			grep "^?" | sort >missing-objects-before &&
+
+		git checkout -q origin/A &&
+
+		GIT_TRACE2_PERF="$(pwd)/trace.output" git \
+			-c merge.directoryRenames=true merge --no-stat \
+			--no-progress origin/B-dir &&
+
+		# Check the number of objects we reported we would fetch
+		cat >expect <<-EOF &&
+		fetch_count:6
+		EOF
+		grep fetch_count trace.output | cut -d "|" -f 9 | tr -d " ." >actual &&
+		test_cmp expect actual &&
+
+		# Check the number of fetch commands exec-ed
+		grep d0.*fetch.negotiationAlgorithm trace.output >fetches &&
+		test_line_count = 1 fetches &&
+
+		git rev-list --objects --all --missing=print |
+			grep "^?" | sort >missing-objects-after &&
+		comm -2 -3 missing-objects-before missing-objects-after >old &&
+		comm -1 -3 missing-objects-before missing-objects-after >new &&
+		# No new missing objects
+		test_must_be_empty new &&
+		# Fetched 6 objects
+		test_line_count = 6 old
+	)
+'
+
+# Testcase: Objects downloaded with lots of renames and modifications
+#   Commit O:
+#              general/{leap1_O, leap2_O}
+#              basename/{numbers_O, sequence_O, values_O}
+#              dir/subdir/{a,b,c,d,e_O,Makefile_TOP_O}
+#              dir/subdir/tweaked/{f,g,h,Makefile_SUB_O}
+#              dir/unchanged/<LOTS OF FILES>
+#   Commit A:
+#     (Rename leap->jump, rename basename/ -> basename/subdir/, rename dir/
+#      -> folder/, move e into newsubdir, add newfile.rs, remove f, modify
+#      both both Makefiles and jumps)
+#              general/{jump1_A, jump2_A}
+#              basename/subdir/{numbers_A, sequence_A, values_A}
+#              folder/subdir/{a,b,c,d,Makefile_TOP_A}
+#              folder/subdir/newsubdir/{e_A,newfile.rs}
+#              folder/subdir/tweaked/{g,h,Makefile_SUB_A}
+#              folder/unchanged/<LOTS OF FILES>
+#   Commit B(-minimal):
+#     (modify both leaps, rename basename/ -> basename/subdir/, add
+#      newfile.{c,py})
+#              general/{leap1_B, leap2_B}
+#              basename/subdir/{numbers_B, sequence_B, values_B}
+#              dir/{a,b,c,d,e_O,Makefile_TOP_O,newfile.c}
+#              dir/tweaked/{f,g,h,Makefile_SUB_O,newfile.py}
+#              dir/unchanged/<LOTS OF FILES>
+#   Expected:
+#              general/{jump1_Merged, jump2_Merged}
+#              basename/subdir/{numbers_Merged, sequence_Merged, values_Merged}
+#              folder/subdir/{a,b,c,d,Makefile_TOP_A,newfile.c}
+#              folder/subdir/newsubdir/e_A
+#              folder/subdir/tweaked/{g,h,Makefile_SUB_A,newfile.py}
+#              folder/unchanged/<LOTS OF FILES>
+#
+# Objects that need to be fetched:
+#   Rename detection:
+#     Side1 (O->A):
+#       Basename-matches rename detection only needs to fetch these objects:
+#         numbers_O, numbers_A
+#         sequence_O, sequence_A
+#         values_O, values_A
+#         Makefile_TOP_O, Makefile_TOP_A
+#         Makefile_SUB_O, Makefile_SUB_A
+#         e_O, e_A
+#       General rename detection only needs to fetch these objects:
+#         leap1_O, leap2_O
+#         jump1_A, jump2_A, newfile.rs
+#         (only need remaining relevant sources, but any relevant sources need
+#          to be matched against all possible unpaired destinations)
+#     Side2 (O->B):
+#       Basename-matches rename detection only needs to fetch these objects:
+#         numbers_B
+#         sequence_B
+#         values_B
+#       (because numbers_O, sequence_O, and values_O already fetched above)
+#       General rename detection only needs to fetch these objects:
+#         <None>
+#   Merge:
+#     3-way content merge needs to grab these objects:
+#       leap1_B
+#       leap2_B
+#   Nothing else needs to fetch objects
+#
+#   Summary: 4 fetches (1 for 6 objects, 1 for 8, 1 for 3, 1 for 2)
+#
+test_expect_merge_algorithm failure failure 'Objects downloaded with lots of renames and modifications' '
+	test_setup_repo &&
+	git clone --sparse --filter=blob:none "file://$(pwd)/server" objects-many &&
+	(
+		cd objects-many &&
+
+		git rev-list --objects --all --missing=print |
+			grep "^?" | sort >missing-objects-before &&
+
+		git checkout -q origin/A &&
+
+		GIT_TRACE2_PERF="$(pwd)/trace.output" git \
+			-c merge.directoryRenames=true merge --no-stat \
+			--no-progress origin/B-many &&
+
+		# Check the number of objects we reported we would fetch
+		cat >expect <<-EOF &&
+		fetch_count:12
+		fetch_count:5
+		fetch_count:3
+		fetch_count:2
+		EOF
+		grep fetch_count trace.output | cut -d "|" -f 9 | tr -d " ." >actual &&
+		test_cmp expect actual &&
+
+		# Check the number of fetch commands exec-ed
+		grep d0.*fetch.negotiationAlgorithm trace.output >fetches &&
+		test_line_count = 4 fetches &&
+
+		git rev-list --objects --all --missing=print |
+			grep "^?" | sort >missing-objects-after &&
+		comm -2 -3 missing-objects-before missing-objects-after >old &&
+		comm -1 -3 missing-objects-before missing-objects-after >new &&
+		# No new missing objects
+		test_must_be_empty new &&
+		# Fetched 12 + 5 + 3 + 2 = 22 objects
+		test_line_count = 22 old
+	)
+'
+
+test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v3 3/5] diffcore-rename: allow different missing_object_cb functions
  2021-06-22  8:04   ` [PATCH v3 " Elijah Newren via GitGitGadget
  2021-06-22  8:04     ` [PATCH v3 1/5] promisor-remote: output trace2 statistics for number of objects fetched Elijah Newren via GitGitGadget
  2021-06-22  8:04     ` [PATCH v3 2/5] t6421: add tests checking for excessive object downloads during merge Elijah Newren via GitGitGadget
@ 2021-06-22  8:04     ` Elijah Newren via GitGitGadget
  2021-06-22  8:04     ` [PATCH v3 4/5] diffcore-rename: use a different prefetch for basename comparisons Elijah Newren via GitGitGadget
                       ` (2 subsequent siblings)
  5 siblings, 0 replies; 30+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-06-22  8:04 UTC (permalink / raw)
  To: git
  Cc: Jonathan Tan, Derrick Stolee, Taylor Blau, Derrick Stolee,
	Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

estimate_similarity() was setting up a diff_populate_filespec_options
every time it was called, requiring the caller of estimate_similarity()
to pass in some data needed to set up this option.  Currently the needed
data consisted of a single variable (skip_unmodified), but we want to
also have the different estimate_similarity() callsites start using
different missing_object_cb functions as well.  Rather than also passing
that data in, just have the caller pass in the whole
diff_populate_filespec_options, and reduce the number of times we need to
set it up.

As a side note, this also drops the number of calls to
has_promisor_remote() dramatically.  If L is the number of basename
paths to compare, M is the number of inexact sources, and N is the
number of inexact destinations, then the number of calls to
has_promisor_remote() drops from L+M*N down to at most 2 -- one for each
of the sites that calls estimate_similarity().  has_promisor_remote() is
a very fast function so this almost certainly has no measurable
performance impact, but it seems cleaner to avoid calling that function
so many times.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 58 +++++++++++++++++++++++++++++++----------------
 1 file changed, 39 insertions(+), 19 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 35378d84e8f1..e13e52046026 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -126,7 +126,7 @@ static int estimate_similarity(struct repository *r,
 			       struct diff_filespec *src,
 			       struct diff_filespec *dst,
 			       int minimum_score,
-			       int skip_unmodified)
+			       struct diff_populate_filespec_options *dpf_opt)
 {
 	/* src points at a file that existed in the original tree (or
 	 * optionally a file in the destination tree) and dst points
@@ -143,15 +143,6 @@ static int estimate_similarity(struct repository *r,
 	 */
 	unsigned long max_size, delta_size, base_size, src_copied, literal_added;
 	int score;
-	struct diff_populate_filespec_options dpf_options = {
-		.check_size_only = 1
-	};
-	struct prefetch_options prefetch_options = {r, skip_unmodified};
-
-	if (r == the_repository && has_promisor_remote()) {
-		dpf_options.missing_object_cb = prefetch;
-		dpf_options.missing_object_data = &prefetch_options;
-	}
 
 	/* We deal only with regular files.  Symlink renames are handled
 	 * only when they are exact matches --- in other words, no edits
@@ -169,11 +160,13 @@ static int estimate_similarity(struct repository *r,
 	 * is a possible size - we really should have a flag to
 	 * say whether the size is valid or not!)
 	 */
+	dpf_opt->check_size_only = 1;
+
 	if (!src->cnt_data &&
-	    diff_populate_filespec(r, src, &dpf_options))
+	    diff_populate_filespec(r, src, dpf_opt))
 		return 0;
 	if (!dst->cnt_data &&
-	    diff_populate_filespec(r, dst, &dpf_options))
+	    diff_populate_filespec(r, dst, dpf_opt))
 		return 0;
 
 	max_size = ((src->size > dst->size) ? src->size : dst->size);
@@ -191,11 +184,11 @@ static int estimate_similarity(struct repository *r,
 	if (max_size * (MAX_SCORE-minimum_score) < delta_size * MAX_SCORE)
 		return 0;
 
-	dpf_options.check_size_only = 0;
+	dpf_opt->check_size_only = 0;
 
-	if (!src->cnt_data && diff_populate_filespec(r, src, &dpf_options))
+	if (!src->cnt_data && diff_populate_filespec(r, src, dpf_opt))
 		return 0;
-	if (!dst->cnt_data && diff_populate_filespec(r, dst, &dpf_options))
+	if (!dst->cnt_data && diff_populate_filespec(r, dst, dpf_opt))
 		return 0;
 
 	if (diffcore_count_changes(r, src, dst,
@@ -862,7 +855,11 @@ static int find_basename_matches(struct diff_options *options,
 	int i, renames = 0;
 	struct strintmap sources;
 	struct strintmap dests;
-
+	struct diff_populate_filespec_options dpf_options = {
+		.check_binary = 0,
+		.missing_object_cb = NULL,
+		.missing_object_data = NULL
+	};
 	/*
 	 * The prefeteching stuff wants to know if it can skip prefetching
 	 * blobs that are unmodified...and will then do a little extra work
@@ -873,7 +870,10 @@ static int find_basename_matches(struct diff_options *options,
 	 * the extra work necessary to check if rename_src entries are
 	 * unmodified would be a small waste.
 	 */
-	int skip_unmodified = 0;
+	struct prefetch_options prefetch_options = {
+		.repo = options->repo,
+		.skip_unmodified = 0
+	};
 
 	/*
 	 * Create maps of basename -> fullname(s) for remaining sources and
@@ -910,6 +910,11 @@ static int find_basename_matches(struct diff_options *options,
 			strintmap_set(&dests, base, i);
 	}
 
+	if (options->repo == the_repository && has_promisor_remote()) {
+		dpf_options.missing_object_cb = prefetch;
+		dpf_options.missing_object_data = &prefetch_options;
+	}
+
 	/* Now look for basename matchups and do similarity estimation */
 	for (i = 0; i < rename_src_nr; ++i) {
 		char *filename = rename_src[i].p->one->path;
@@ -953,7 +958,7 @@ static int find_basename_matches(struct diff_options *options,
 			one = rename_src[src_index].p->one;
 			two = rename_dst[dst_index].p->two;
 			score = estimate_similarity(options->repo, one, two,
-						    minimum_score, skip_unmodified);
+						    minimum_score, &dpf_options);
 
 			/* If sufficiently similar, record as rename pair */
 			if (score < minimum_score)
@@ -1272,6 +1277,14 @@ void diffcore_rename_extended(struct diff_options *options,
 	int num_sources, want_copies;
 	struct progress *progress = NULL;
 	struct dir_rename_info info;
+	struct diff_populate_filespec_options dpf_options = {
+		.check_binary = 0,
+		.missing_object_cb = NULL,
+		.missing_object_data = NULL
+	};
+	struct prefetch_options prefetch_options = {
+		.repo = options->repo
+	};
 
 	trace2_region_enter("diff", "setup", options->repo);
 	info.setup = 0;
@@ -1433,6 +1446,13 @@ void diffcore_rename_extended(struct diff_options *options,
 				(uint64_t)num_destinations * (uint64_t)num_sources);
 	}
 
+	/* Finish setting up dpf_options */
+	prefetch_options.skip_unmodified = skip_unmodified;
+	if (options->repo == the_repository && has_promisor_remote()) {
+		dpf_options.missing_object_cb = prefetch;
+		dpf_options.missing_object_data = &prefetch_options;
+	}
+
 	CALLOC_ARRAY(mx, st_mult(NUM_CANDIDATE_PER_DST, num_destinations));
 	for (dst_cnt = i = 0; i < rename_dst_nr; i++) {
 		struct diff_filespec *two = rename_dst[i].p->two;
@@ -1458,7 +1478,7 @@ void diffcore_rename_extended(struct diff_options *options,
 			this_src.score = estimate_similarity(options->repo,
 							     one, two,
 							     minimum_score,
-							     skip_unmodified);
+							     &dpf_options);
 			this_src.name_score = basename_same(one, two);
 			this_src.dst = i;
 			this_src.src = j;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v3 4/5] diffcore-rename: use a different prefetch for basename comparisons
  2021-06-22  8:04   ` [PATCH v3 " Elijah Newren via GitGitGadget
                       ` (2 preceding siblings ...)
  2021-06-22  8:04     ` [PATCH v3 3/5] diffcore-rename: allow different missing_object_cb functions Elijah Newren via GitGitGadget
@ 2021-06-22  8:04     ` Elijah Newren via GitGitGadget
  2021-06-22  8:04     ` [PATCH v3 5/5] merge-ort: add prefetching for content merges Elijah Newren via GitGitGadget
  2021-06-22 16:10     ` [PATCH v3 0/5] Optimization batch 13: partial clone optimizations for merge-ort Derrick Stolee
  5 siblings, 0 replies; 30+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-06-22  8:04 UTC (permalink / raw)
  To: git
  Cc: Jonathan Tan, Derrick Stolee, Taylor Blau, Derrick Stolee,
	Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

merge-ort was designed to minimize the amount of data needed and used,
and several changes were made to diffcore-rename to take advantage of
extra metadata to enable this data minimization (particularly the
relevant_sources variable for skipping "irrelevant" renames).  This
effort obviously succeeded in drastically reducing computation times,
but should also theoretically allow partial clones to download much less
information.  Previously, though, the "prefetch" command used in
diffcore-rename had never been modified and downloaded many blobs that
were unnecessary for merge-ort.  This commit corrects that.

When doing basename comparisons, we want to fetch only the objects that
will be used for basename comparisons.  If after basename fetching this
leaves us with no more relevant sources (or no more destinations), then
we won't need to do the full inexact rename detection and can skip
downloading additional source and destination files.  Even if we have to
do that later full inexact rename detection, irrelevant sources are
culled after basename matching and before the full inexact rename
detection, so we can still avoid downloading the blobs for irrelevant
sources.  Rename prefetch() to inexact_prefetch(), and introduce a
new basename_prefetch() to take advantage of this.

If we modify the testcase from commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2021-01-23)
to pass
    --sparse --filter=blob:none
to the clone command, and use the new trace2 "fetch_count" output from
a few commits ago to track both the number of fetch subcommands invoked
and the number of objects fetched across all those fetches, then for
the mega-renames testcase we observe the following:

BEFORE this commit, rebasing 35 patches:
    strategy     # of fetches    total # of objects fetched
    ---------    ------------    --------------------------
    recursive    62              11423
    ort          30              11391

AFTER this commit, rebasing the same 35 patches:
    ort          32                 63

This means that the new code only needs to download less than 2 blobs
per patch being rebased.  That is especially interesting given that the
repository at the start only had approximately half a dozen TOTAL blobs
downloaded to start with (because the default sparse-checkout of just
the toplevel directory was in use).

So, for this particular linux kernel testcase that involved ~26,000
renames on the upstream side (drivers/ -> pilots/) across which 35
patches were being rebased, this change reduces the number of blobs that
need to be downloaded by a factor of ~180.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c              | 101 +++++++++++++++++++++++++++------
 t/t6421-merge-partial-clone.sh |   4 +-
 2 files changed, 85 insertions(+), 20 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index e13e52046026..4ef0459cfb50 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -87,13 +87,13 @@ struct diff_score {
 	short name_score;
 };
 
-struct prefetch_options {
+struct inexact_prefetch_options {
 	struct repository *repo;
 	int skip_unmodified;
 };
-static void prefetch(void *prefetch_options)
+static void inexact_prefetch(void *prefetch_options)
 {
-	struct prefetch_options *options = prefetch_options;
+	struct inexact_prefetch_options *options = prefetch_options;
 	int i;
 	struct oid_array to_fetch = OID_ARRAY_INIT;
 
@@ -816,6 +816,78 @@ static int idx_possible_rename(char *filename, struct dir_rename_info *info)
 	return idx;
 }
 
+struct basename_prefetch_options {
+	struct repository *repo;
+	struct strintmap *relevant_sources;
+	struct strintmap *sources;
+	struct strintmap *dests;
+	struct dir_rename_info *info;
+};
+static void basename_prefetch(void *prefetch_options)
+{
+	struct basename_prefetch_options *options = prefetch_options;
+	struct strintmap *relevant_sources = options->relevant_sources;
+	struct strintmap *sources = options->sources;
+	struct strintmap *dests = options->dests;
+	struct dir_rename_info *info = options->info;
+	int i;
+	struct oid_array to_fetch = OID_ARRAY_INIT;
+
+	/*
+	 * TODO: The following loops mirror the code/logic from
+	 * find_basename_matches(), though not quite exactly.  Maybe
+	 * abstract the iteration logic out somehow?
+	 */
+	for (i = 0; i < rename_src_nr; ++i) {
+		char *filename = rename_src[i].p->one->path;
+		const char *base = NULL;
+		intptr_t src_index;
+		intptr_t dst_index;
+
+		/* Skip irrelevant sources */
+		if (relevant_sources &&
+		    !strintmap_contains(relevant_sources, filename))
+			continue;
+
+		/*
+		 * If the basename is unique among remaining sources, then
+		 * src_index will equal 'i' and we can attempt to match it
+		 * to a unique basename in the destinations.  Otherwise,
+		 * use directory rename heuristics, if possible.
+		 */
+		base = get_basename(filename);
+		src_index = strintmap_get(sources, base);
+		assert(src_index == -1 || src_index == i);
+
+		if (strintmap_contains(dests, base)) {
+			struct diff_filespec *one, *two;
+
+			/* Find a matching destination, if possible */
+			dst_index = strintmap_get(dests, base);
+			if (src_index == -1 || dst_index == -1) {
+				src_index = i;
+				dst_index = idx_possible_rename(filename, info);
+			}
+			if (dst_index == -1)
+				continue;
+
+			/* Ignore this dest if already used in a rename */
+			if (rename_dst[dst_index].is_rename)
+				continue; /* already used previously */
+
+			one = rename_src[src_index].p->one;
+			two = rename_dst[dst_index].p->two;
+
+			/* Add the pairs */
+			diff_add_if_missing(options->repo, &to_fetch, two);
+			diff_add_if_missing(options->repo, &to_fetch, one);
+		}
+	}
+
+	promisor_remote_get_direct(options->repo, to_fetch.oid, to_fetch.nr);
+	oid_array_clear(&to_fetch);
+}
+
 static int find_basename_matches(struct diff_options *options,
 				 int minimum_score,
 				 struct dir_rename_info *info,
@@ -860,19 +932,12 @@ static int find_basename_matches(struct diff_options *options,
 		.missing_object_cb = NULL,
 		.missing_object_data = NULL
 	};
-	/*
-	 * The prefeteching stuff wants to know if it can skip prefetching
-	 * blobs that are unmodified...and will then do a little extra work
-	 * to verify that the oids are indeed different before prefetching.
-	 * Unmodified blobs are only relevant when doing copy detection;
-	 * when limiting to rename detection, diffcore_rename[_extended]()
-	 * will never be called with unmodified source paths fed to us, so
-	 * the extra work necessary to check if rename_src entries are
-	 * unmodified would be a small waste.
-	 */
-	struct prefetch_options prefetch_options = {
+	struct basename_prefetch_options prefetch_options = {
 		.repo = options->repo,
-		.skip_unmodified = 0
+		.relevant_sources = relevant_sources,
+		.sources = &sources,
+		.dests = &dests,
+		.info = info
 	};
 
 	/*
@@ -911,7 +976,7 @@ static int find_basename_matches(struct diff_options *options,
 	}
 
 	if (options->repo == the_repository && has_promisor_remote()) {
-		dpf_options.missing_object_cb = prefetch;
+		dpf_options.missing_object_cb = basename_prefetch;
 		dpf_options.missing_object_data = &prefetch_options;
 	}
 
@@ -1282,7 +1347,7 @@ void diffcore_rename_extended(struct diff_options *options,
 		.missing_object_cb = NULL,
 		.missing_object_data = NULL
 	};
-	struct prefetch_options prefetch_options = {
+	struct inexact_prefetch_options prefetch_options = {
 		.repo = options->repo
 	};
 
@@ -1449,7 +1514,7 @@ void diffcore_rename_extended(struct diff_options *options,
 	/* Finish setting up dpf_options */
 	prefetch_options.skip_unmodified = skip_unmodified;
 	if (options->repo == the_repository && has_promisor_remote()) {
-		dpf_options.missing_object_cb = prefetch;
+		dpf_options.missing_object_cb = inexact_prefetch;
 		dpf_options.missing_object_data = &prefetch_options;
 	}
 
diff --git a/t/t6421-merge-partial-clone.sh b/t/t6421-merge-partial-clone.sh
index 3dcffc15f801..aad975d24ddb 100755
--- a/t/t6421-merge-partial-clone.sh
+++ b/t/t6421-merge-partial-clone.sh
@@ -207,7 +207,7 @@ test_setup_repo () {
 #
 #   Summary: 2 fetches (1 for 2 objects, 1 for 1 object)
 #
-test_expect_merge_algorithm failure failure 'Objects downloaded for single relevant rename' '
+test_expect_merge_algorithm failure success 'Objects downloaded for single relevant rename' '
 	test_setup_repo &&
 	git clone --sparse --filter=blob:none "file://$(pwd)/server" objects-single &&
 	(
@@ -296,7 +296,7 @@ test_expect_merge_algorithm failure failure 'Objects downloaded for single relev
 #      this are not all that common.)
 #   Summary: 1 fetches for 6 objects
 #
-test_expect_merge_algorithm failure failure 'Objects downloaded when a directory rename triggered' '
+test_expect_merge_algorithm failure success 'Objects downloaded when a directory rename triggered' '
 	test_setup_repo &&
 	git clone --sparse --filter=blob:none "file://$(pwd)/server" objects-dir &&
 	(
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v3 5/5] merge-ort: add prefetching for content merges
  2021-06-22  8:04   ` [PATCH v3 " Elijah Newren via GitGitGadget
                       ` (3 preceding siblings ...)
  2021-06-22  8:04     ` [PATCH v3 4/5] diffcore-rename: use a different prefetch for basename comparisons Elijah Newren via GitGitGadget
@ 2021-06-22  8:04     ` Elijah Newren via GitGitGadget
  2021-06-22 16:10     ` [PATCH v3 0/5] Optimization batch 13: partial clone optimizations for merge-ort Derrick Stolee
  5 siblings, 0 replies; 30+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-06-22  8:04 UTC (permalink / raw)
  To: git
  Cc: Jonathan Tan, Derrick Stolee, Taylor Blau, Derrick Stolee,
	Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

Commit 7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-05)
introduced batching of fetching missing blobs, so that the diff
machinery would have one fetch subprocess grab N blobs instead of N
processes each grabbing 1.

However, the diff machinery is not the only thing in a merge that needs
to work on blobs.  The 3-way content merges need them as well.  Rather
than download all the blobs 1 at a time, prefetch all the blobs needed
for regular content merges.

This does not cover all possible paths in merge-ort that might need to
download blobs.  Others include:
  - The blob_unchanged() calls to avoid modify/delete conflicts (when
    blob renormalization results in an "unchanged" file)
  - Preliminary content merges needed for rename/add and
    rename/rename(2to1) style conflicts.  (Both of these types of
    conflicts can result in nested conflict markers from the need to do
    two levels of content merging; the first happens before our new
    prefetch_for_content_merges() function.)

The first of these wouldn't be an extreme amount of work to support, and
even the second could be theoretically supported in batching, but all of
these cases seem unusual to me, and this is a minor performance
optimization anyway; in the worst case we only get some of the fetches
batched and have a few additional one-off fetches.  So for now, just
handle the regular 3-way content merges in our prefetching.

For the testcase from the previous commit, the number of downloaded
objects remains at 63, but this drops the number of fetches needed from
32 down to 20, a sizeable reduction.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c                    | 50 ++++++++++++++++++++++++++++++++++
 t/t6421-merge-partial-clone.sh |  2 +-
 2 files changed, 51 insertions(+), 1 deletion(-)

diff --git a/merge-ort.c b/merge-ort.c
index cfa751053b01..e3a5dfc7b312 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -29,6 +29,7 @@
 #include "entry.h"
 #include "ll-merge.h"
 #include "object-store.h"
+#include "promisor-remote.h"
 #include "revision.h"
 #include "strmap.h"
 #include "submodule.h"
@@ -3485,6 +3486,54 @@ static void process_entry(struct merge_options *opt,
 	record_entry_for_tree(dir_metadata, path, &ci->merged);
 }
 
+static void prefetch_for_content_merges(struct merge_options *opt,
+					struct string_list *plist)
+{
+	struct string_list_item *e;
+	struct oid_array to_fetch = OID_ARRAY_INIT;
+
+	if (opt->repo != the_repository || !has_promisor_remote())
+		return;
+
+	for (e = &plist->items[plist->nr-1]; e >= plist->items; --e) {
+		/* char *path = e->string; */
+		struct conflict_info *ci = e->util;
+		int i;
+
+		/* Ignore clean entries */
+		if (ci->merged.clean)
+			continue;
+
+		/* Ignore entries that don't need a content merge */
+		if (ci->match_mask || ci->filemask < 6 ||
+		    !S_ISREG(ci->stages[1].mode) ||
+		    !S_ISREG(ci->stages[2].mode) ||
+		    oideq(&ci->stages[1].oid, &ci->stages[2].oid))
+			continue;
+
+		/* Also don't need content merge if base matches either side */
+		if (ci->filemask == 7 &&
+		    S_ISREG(ci->stages[0].mode) &&
+		    (oideq(&ci->stages[0].oid, &ci->stages[1].oid) ||
+		     oideq(&ci->stages[0].oid, &ci->stages[2].oid)))
+			continue;
+
+		for (i = 0; i < 3; i++) {
+			unsigned side_mask = (1 << i);
+			struct version_info *vi = &ci->stages[i];
+
+			if ((ci->filemask & side_mask) &&
+			    S_ISREG(vi->mode) &&
+			    oid_object_info_extended(opt->repo, &vi->oid, NULL,
+						     OBJECT_INFO_FOR_PREFETCH))
+				oid_array_append(&to_fetch, &vi->oid);
+		}
+	}
+
+	promisor_remote_get_direct(opt->repo, to_fetch.oid, to_fetch.nr);
+	oid_array_clear(&to_fetch);
+}
+
 static void process_entries(struct merge_options *opt,
 			    struct object_id *result_oid)
 {
@@ -3531,6 +3580,7 @@ static void process_entries(struct merge_options *opt,
 	 * the way when it is time to process the file at the same path).
 	 */
 	trace2_region_enter("merge", "processing", opt->repo);
+	prefetch_for_content_merges(opt, &plist);
 	for (entry = &plist.items[plist.nr-1]; entry >= plist.items; --entry) {
 		char *path = entry->string;
 		/*
diff --git a/t/t6421-merge-partial-clone.sh b/t/t6421-merge-partial-clone.sh
index aad975d24ddb..36bcd7c32801 100755
--- a/t/t6421-merge-partial-clone.sh
+++ b/t/t6421-merge-partial-clone.sh
@@ -397,7 +397,7 @@ test_expect_merge_algorithm failure success 'Objects downloaded when a directory
 #
 #   Summary: 4 fetches (1 for 6 objects, 1 for 8, 1 for 3, 1 for 2)
 #
-test_expect_merge_algorithm failure failure 'Objects downloaded with lots of renames and modifications' '
+test_expect_merge_algorithm failure success 'Objects downloaded with lots of renames and modifications' '
 	test_setup_repo &&
 	git clone --sparse --filter=blob:none "file://$(pwd)/server" objects-many &&
 	(
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 0/5] Optimization batch 13: partial clone optimizations for merge-ort
  2021-06-22  8:04   ` [PATCH v3 " Elijah Newren via GitGitGadget
                       ` (4 preceding siblings ...)
  2021-06-22  8:04     ` [PATCH v3 5/5] merge-ort: add prefetching for content merges Elijah Newren via GitGitGadget
@ 2021-06-22 16:10     ` Derrick Stolee
  2021-06-22 18:45       ` Elijah Newren
  5 siblings, 1 reply; 30+ messages in thread
From: Derrick Stolee @ 2021-06-22 16:10 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget, git
  Cc: Jonathan Tan, Derrick Stolee, Taylor Blau, Elijah Newren

On 6/22/2021 4:04 AM, Elijah Newren via GitGitGadget wrote:
> This series optimizes blob downloading in merges for partial clones. It can
> apply on master. It's independent of ort-perf-batch-12.

As promised, I completed a performance evaluation of this series as well
as ort-perf-batch-12 (and all earlier batches) against our microsoft/git
fork and running in one of our large monorepos that has over 2 million
files at HEAD. Here are my findings.

In my comparisons, I compare the recursive merge strategy with renames
disabled against the ORT strategy (which always has renames enabled). When
I enabled renames for the recursive merge I saw the partial clone logic
kick in and start downloading many files in every case, so I dropped that
from consideration.

Experiment #1: Large RI/FI merges
---------------------------------

Most of the merge commits in this repo's history are merges of several
long-lived branches as code is merged across organizational boundaries. I
focused on the merge commits in the first-parent history, which mean these
are the merges that brought the latest changes from several areas into the
canonical version of the software.

These merges are all automated merges created by libgit2's implementation
of the recursive merge strategy. Since they are created on the server,
these will not have any merge conflicts.

They are still interesting because the sheer number of files that change
can be large. This is a pain point for the recursive merge because many
index entries need to update with the merge. For ORT, some of the updates
are simple because only one side changed a certain subtree (the
organizational boundaries also correspond to the directory structure in
many cases).

Across these merges I tested, ORT was _always_ faster and was consistent
with the recursive strategy. Even more interesting was the fact that the
recursive strategy had very slow outliers while the ORT strategy was much
more consistent:

     Recursive     ORT
-----------------------
MAX     34.97s    4.74s
P90     30.04s    4.50s
P75     15.35s    3.74s
P50      7.22s    3.39s
P10      3.61s    3.08s

(I'm not testing ORT with the sparse-index yet. A significant portion of
this 3 second lower bound is due to reading and writing the index file
with 2 million entries. I _am_ using sparse-checkout with only the files
at root, which minimizes the time required to update the working directory
with any changed files.)

For these merges, ORT is a clear win.

Experiment #2: User-created merges
----------------------------------

To find merges that might be created by actual user cases, I ran
'git rev-list --grep="^Merge branch"' to get merges that had default
messages from 'git merge' runs. (The merges from Experiment #1 had other
automated names that did not appear in this search.)

Here, the differences are less striking, but still valuable:

     Recursive     ORT
-----------------------
MAX     10.61s   6.27s
P75      8.81s   3.92s
P50      4.32s   3.21s
P10      3.53s   2.95s

The ORT strategy had more variance in these examples, though still not as
much as the recursive strategy. Here the variance is due to conflicting
files needing content merges, which usually were automatically resolved.

This version of the experiment provided interesting observations in a few
cases:

1. One case had the recursive merge strategy result in a root tree that
   disagreed with what the user committed, but the ORT strategy _did_ the
   correct resolution. Likely, this is due to the rename detection and
   resolution. The user probably had to manually resolve the merge to
   match their expected renames since we turn off merge.renames in their
   config.

2. I watched for the partial clone logic to kick in and download blobs.
   Some of these were inevitable: we need the blobs to resolve edit/edit
   conflicts. Most cases none were downloaded at all, so this series is
   working as advertised. There _was_ a case where the inexact rename
   detection requested a large list of files (~2900 in three batches) but
   _then_ said "inexact rename detection was skipped due to too many
   files". This is a case that would be nice to resolve in this series. I
   will try to find exactly where in the code this is being triggered and
   report back.

3. As I mentioned, I was using sparse-checkout to limit the size of the
   working directory. In one case of a conflict that could not be
   automatically resolved, the ORT strategy output this error:

   error: could not open '<X>': No such file or directory

   It seems we are looking for a file on disk without considering if it
   might have the SKIP_WORKTREE bit on in the index. I don't think this is
   an issue for this series, but might require a follow-up on top of the
   other ORT work.

Conclusions
-----------

I continue to be excited about the ORT strategy and will likely be
focusing on it in a month or so to integrate it with the sparse-index. I
think we would be interested in making the ORT strategy a new default for
Scalar, but we might really want it to respect merge.renames=false if only
so we can deploy the settings in stages (first, change the strategy, then
enable renames as an independent step) so we can isolate concerns.

Thanks!
-Stolee

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 0/5] Optimization batch 13: partial clone optimizations for merge-ort
  2021-06-22 16:10     ` [PATCH v3 0/5] Optimization batch 13: partial clone optimizations for merge-ort Derrick Stolee
@ 2021-06-22 18:45       ` Elijah Newren
  2021-06-23  2:14         ` Derrick Stolee
  0 siblings, 1 reply; 30+ messages in thread
From: Elijah Newren @ 2021-06-22 18:45 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Jonathan Tan,
	Derrick Stolee, Taylor Blau

On Tue, Jun 22, 2021 at 9:10 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 6/22/2021 4:04 AM, Elijah Newren via GitGitGadget wrote:
> > This series optimizes blob downloading in merges for partial clones. It can
> > apply on master. It's independent of ort-perf-batch-12.
>
> As promised, I completed a performance evaluation of this series as well
> as ort-perf-batch-12 (and all earlier batches) against our microsoft/git
> fork and running in one of our large monorepos that has over 2 million
> files at HEAD. Here are my findings.

Nice, thanks for the investigation and the detailed report!

> In my comparisons, I compare the recursive merge strategy with renames
> disabled against the ORT strategy (which always has renames enabled). When
> I enabled renames for the recursive merge I saw the partial clone logic
> kick in and start downloading many files in every case, so I dropped that
> from consideration.

I understand it'd take way too long to do an exhaustive testing, but
just doing a couple example comparisons against recursive with rename
detection turned on (and merge.renameLimit set to e.g. 0, or to some
high number if needed) would still be interesting.

> Experiment #1: Large RI/FI merges
> ---------------------------------
>
...
>      Recursive     ORT
> -----------------------
> MAX     34.97s    4.74s
> P90     30.04s    4.50s
> P75     15.35s    3.74s
> P50      7.22s    3.39s
> P10      3.61s    3.08s
>
> (I'm not testing ORT with the sparse-index yet. A significant portion of
> this 3 second lower bound is due to reading and writing the index file
> with 2 million entries. I _am_ using sparse-checkout with only the files
> at root, which minimizes the time required to update the working directory
> with any changed files.)
>
> For these merges, ORT is a clear win.

Wahoo!

I suspect I know what you're seeing from recursive as well; recursive
had some quadratic behaviors found within it, separate from the calls
into diffcore-rename.  It was much less likely to be triggered with
rename detection turned off, but could still be triggered.  I suspect
you just had enough testcases that some of them were affected.  (I
particularly suspect the item described at (5a) of
https://lore.kernel.org/git/pull.842.git.1612331345.gitgitgadget@gmail.com/)

Given the 2 million entries, I'm curious if ort-perf-batch-14 (not yet
submitted; I've been waiting a few weeks for ort-perf-batch-12 to hit
next) will provide further significant benefits.  If it really does
take 3s or so to do index reading and writing, then it'd only be able
to help with the slower cases, but I'm still curious.  It helped
rather dramatically on my linux kernel testcase with lots of renames.

> Experiment #2: User-created merges
> ----------------------------------
>
> To find merges that might be created by actual user cases, I ran
> 'git rev-list --grep="^Merge branch"' to get merges that had default
> messages from 'git merge' runs. (The merges from Experiment #1 had other
> automated names that did not appear in this search.)
>
> Here, the differences are less striking, but still valuable:
>
>      Recursive     ORT
> -----------------------
> MAX     10.61s   6.27s
> P75      8.81s   3.92s
> P50      4.32s   3.21s
> P10      3.53s   2.95s
>
> The ORT strategy had more variance in these examples, though still not as
> much as the recursive strategy. Here the variance is due to conflicting
> files needing content merges, which usually were automatically resolved.

Is the max case on the ort side related to item #2 below, where
perhaps there was a case where there was no automatic resolution of
some file?

> This version of the experiment provided interesting observations in a few
> cases:
>
> 1. One case had the recursive merge strategy result in a root tree that
>    disagreed with what the user committed, but the ORT strategy _did_ the
>    correct resolution. Likely, this is due to the rename detection and
>    resolution. The user probably had to manually resolve the merge to
>    match their expected renames since we turn off merge.renames in their
>    config.

I agree that seems like the likely explanation.

> 2. I watched for the partial clone logic to kick in and download blobs.
>    Some of these were inevitable: we need the blobs to resolve edit/edit
>    conflicts. Most cases none were downloaded at all, so this series is
>    working as advertised. There _was_ a case where the inexact rename
>    detection requested a large list of files (~2900 in three batches) but
>    _then_ said "inexact rename detection was skipped due to too many
>    files". This is a case that would be nice to resolve in this series. I
>    will try to find exactly where in the code this is being triggered and
>    report back.

This suggests perhaps that EITHER there was a real modify/delete
conflict (because you have to do full rename detection to rule out
that the modify/delete was part of some rename), OR that there was a
renamed file modified on both sides that did not keep its original
basename (because that combination is needed to bypass the various
optimizations and make it fall back to full inexact rename detection).
Further, in either case, there were enough adds/deletes that full
inexact detection is still a bit expensive.  It'd be interesting to
know which case it was.  What happens if you set merge.renameLimit to
something higher (the default is surprisingly small)?

> 3. As I mentioned, I was using sparse-checkout to limit the size of the
>    working directory. In one case of a conflict that could not be
>    automatically resolved, the ORT strategy output this error:
>
>    error: could not open '<X>': No such file or directory
>
>    It seems we are looking for a file on disk without considering if it
>    might have the SKIP_WORKTREE bit on in the index. I don't think this is
>    an issue for this series, but might require a follow-up on top of the
>    other ORT work.

I suspect either checkout() or record_conflicted_index_entries() are
to blame, and yeah, it'd be an issue from a previous series.  If you
have any further details or even hints about reproducing (doesn't have
to be a complete set of steps or minimal testcase), I would love to
hear it.  (Incidentally, of all the reviews on merge-ort, those two
functions were left out; look for "17" and "19" in
https://lore.kernel.org/git/pull.923.git.git.1606635803.gitgitgadget@gmail.com/
)

> Conclusions
> -----------
>
> I continue to be excited about the ORT strategy and will likely be
> focusing on it in a month or so to integrate it with the sparse-index. I
> think we would be interested in making the ORT strategy a new default for
> Scalar, but we might really want it to respect merge.renames=false if only
> so we can deploy the settings in stages (first, change the strategy, then
> enable renames as an independent step) so we can isolate concerns.

Given that my ort performance work concentrated on rename detection,
it's encouraging to see that ort-WITH-renames is generally faster for
you than recursive-WITHOUT-renames.  That means not only that rename
detection can be done in reasonable time now, but that the
improvements outside of rename detection show some real benefits as
well.  It'll be interesting to see how the final two performance
series for merge-ort, which will concentrate on performance on areas
other than rename detection, will affect these above results.

And by the way, thanks for the herculean review effort you've put in.
You've reviewed nearly every merge-ort and diffcore-rename patch
(which included many big, tricky patches)...and a quick search says
there's been at least 128 patches so far in the last 8 months.  And
the code is better (cleaner, clearer, and faster) for all your
reviews.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 0/5] Optimization batch 13: partial clone optimizations for merge-ort
  2021-06-22 18:45       ` Elijah Newren
@ 2021-06-23  2:14         ` Derrick Stolee
  2021-06-23  8:11           ` Elijah Newren
  0 siblings, 1 reply; 30+ messages in thread
From: Derrick Stolee @ 2021-06-23  2:14 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Jonathan Tan,
	Derrick Stolee, Taylor Blau

On 6/22/2021 2:45 PM, Elijah Newren wrote:
> On Tue, Jun 22, 2021 at 9:10 AM Derrick Stolee <stolee@gmail.com> wrote:

I want to focus on this item:

>> 2. I watched for the partial clone logic to kick in and download blobs.
>>    Some of these were inevitable: we need the blobs to resolve edit/edit
>>    conflicts. Most cases none were downloaded at all, so this series is
>>    working as advertised. There _was_ a case where the inexact rename
>>    detection requested a large list of files (~2900 in three batches) but
>>    _then_ said "inexact rename detection was skipped due to too many
>>    files". This is a case that would be nice to resolve in this series. I
>>    will try to find exactly where in the code this is being triggered and
>>    report back.
> 
> This suggests perhaps that EITHER there was a real modify/delete
> conflict (because you have to do full rename detection to rule out
> that the modify/delete was part of some rename), OR that there was a
> renamed file modified on both sides that did not keep its original
> basename (because that combination is needed to bypass the various
> optimizations and make it fall back to full inexact rename detection).
> Further, in either case, there were enough adds/deletes that full
> inexact detection is still a bit expensive.  It'd be interesting to
> know which case it was.  What happens if you set merge.renameLimit to
> something higher (the default is surprisingly small)?

The behavior I'd like to see is that the partial clone logic is not
run if we are going to download more than merge.renameLimit files.
Whatever is getting these missing blobs is earlier than the limit
check, but it should be after instead.

It's particularly problematic that Git does all the work to get the
blobs, but then gives up and doesn't even use them for rename
detection.

I'm happy that we download necessary blobs when there are a few
dozen files that need inexact renames. When it gets into the
thousands, then we jump into a different category of user experience.

Having a stop-gap of rename detection limits is an important way to
avoid huge amounts of file downloads in these huge repo cases. Users
can always opt into a larger limit if they really do want that rename
detection to work at such a large scale, but we still need protections
for the vast majority of cases where a user isn't willing to pay the
cost of downloading these blobs.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 0/5] Optimization batch 13: partial clone optimizations for merge-ort
  2021-06-23  2:14         ` Derrick Stolee
@ 2021-06-23  8:11           ` Elijah Newren
  2021-06-23 17:31             ` Elijah Newren
  0 siblings, 1 reply; 30+ messages in thread
From: Elijah Newren @ 2021-06-23  8:11 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Jonathan Tan,
	Derrick Stolee, Taylor Blau

On Tue, Jun 22, 2021 at 7:14 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 6/22/2021 2:45 PM, Elijah Newren wrote:
> > On Tue, Jun 22, 2021 at 9:10 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> I want to focus on this item:
>
> >> 2. I watched for the partial clone logic to kick in and download blobs.
> >>    Some of these were inevitable: we need the blobs to resolve edit/edit
> >>    conflicts. Most cases none were downloaded at all, so this series is
> >>    working as advertised. There _was_ a case where the inexact rename
> >>    detection requested a large list of files (~2900 in three batches) but
> >>    _then_ said "inexact rename detection was skipped due to too many
> >>    files". This is a case that would be nice to resolve in this series. I
> >>    will try to find exactly where in the code this is being triggered and
> >>    report back.
> >
> > This suggests perhaps that EITHER there was a real modify/delete
> > conflict (because you have to do full rename detection to rule out
> > that the modify/delete was part of some rename), OR that there was a
> > renamed file modified on both sides that did not keep its original
> > basename (because that combination is needed to bypass the various
> > optimizations and make it fall back to full inexact rename detection).
> > Further, in either case, there were enough adds/deletes that full
> > inexact detection is still a bit expensive.  It'd be interesting to
> > know which case it was.  What happens if you set merge.renameLimit to
> > something higher (the default is surprisingly small)?
>
> The behavior I'd like to see is that the partial clone logic is not
> run if we are going to download more than merge.renameLimit files.
> Whatever is getting these missing blobs is earlier than the limit
> check, but it should be after instead.
>
> It's particularly problematic that Git does all the work to get the
> blobs, but then gives up and doesn't even use them for rename
> detection.

I agree with what should happen, but I'm surprised it's not already
happening.  The give up check comes from too_many_rename_candidates(),
which is called before dpf_options.missing_object_cb is even set to
inexact_prefetch.  So I'm not sure how the fetching comes first.  Is
there a microsoft specific patch that changes the order somehow?  Is
there something I'm mis-reading in the code?

> I'm happy that we download necessary blobs when there are a few
> dozen files that need inexact renames. When it gets into the
> thousands, then we jump into a different category of user experience.
>
> Having a stop-gap of rename detection limits is an important way to
> avoid huge amounts of file downloads in these huge repo cases. Users
> can always opt into a larger limit if they really do want that rename
> detection to work at such a large scale, but we still need protections
> for the vast majority of cases where a user isn't willing to pay the
> cost of downloading these blobs.

Sure, that's fair.

But I'm still curious what the particular shape is for the data in
question.  What does the error say merge.renameLimit would be need to
set to?  If it's set higher, do some of the files resolve nicely (i.e.
they really were renames modified on both sides with a different
basename), or are they modify/delete conflicts and we're paying the
cost of rename detection to verify they were deleted and we really do
have a conflict?  I'm curious if there's more to learn and more
optimization potential, basically.  Your repository is bigger, so
there may be more to learn from it than from the testcases I've tried
so far.  :-)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 0/5] Optimization batch 13: partial clone optimizations for merge-ort
  2021-06-23  8:11           ` Elijah Newren
@ 2021-06-23 17:31             ` Elijah Newren
  0 siblings, 0 replies; 30+ messages in thread
From: Elijah Newren @ 2021-06-23 17:31 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Jonathan Tan,
	Derrick Stolee, Taylor Blau

On Wed, Jun 23, 2021 at 1:11 AM Elijah Newren <newren@gmail.com> wrote:
>
> On Tue, Jun 22, 2021 at 7:14 PM Derrick Stolee <stolee@gmail.com> wrote:
> >
> > On 6/22/2021 2:45 PM, Elijah Newren wrote:
> > > On Tue, Jun 22, 2021 at 9:10 AM Derrick Stolee <stolee@gmail.com> wrote:
> >
> > I want to focus on this item:
> >
> > >> 2. I watched for the partial clone logic to kick in and download blobs.
> > >>    Some of these were inevitable: we need the blobs to resolve edit/edit
> > >>    conflicts. Most cases none were downloaded at all, so this series is
> > >>    working as advertised. There _was_ a case where the inexact rename
> > >>    detection requested a large list of files (~2900 in three batches) but
> > >>    _then_ said "inexact rename detection was skipped due to too many
> > >>    files". This is a case that would be nice to resolve in this series. I
> > >>    will try to find exactly where in the code this is being triggered and
> > >>    report back.
> > >
> > > This suggests perhaps that EITHER there was a real modify/delete
> > > conflict (because you have to do full rename detection to rule out
> > > that the modify/delete was part of some rename), OR that there was a
> > > renamed file modified on both sides that did not keep its original
> > > basename (because that combination is needed to bypass the various
> > > optimizations and make it fall back to full inexact rename detection).
> > > Further, in either case, there were enough adds/deletes that full
> > > inexact detection is still a bit expensive.  It'd be interesting to
> > > know which case it was.  What happens if you set merge.renameLimit to
> > > something higher (the default is surprisingly small)?
> >
> > The behavior I'd like to see is that the partial clone logic is not
> > run if we are going to download more than merge.renameLimit files.
> > Whatever is getting these missing blobs is earlier than the limit
> > check, but it should be after instead.
> >
> > It's particularly problematic that Git does all the work to get the
> > blobs, but then gives up and doesn't even use them for rename
> > detection.
>
> I agree with what should happen, but I'm surprised it's not already
> happening.  The give up check comes from too_many_rename_candidates(),
> which is called before dpf_options.missing_object_cb is even set to
> inexact_prefetch.  So I'm not sure how the fetching comes first.  Is
> there a microsoft specific patch that changes the order somehow?  Is
> there something I'm mis-reading in the code?

After thinking about it more, I wonder if the following is what is happening:

* Some directory was renamed (and likely not a leaf), resulting in a
large pile of renames (or some other change that gives lots of renames
without changing basename)
* basename_prefetch() kicks in to download files whose basenames have
changed (even for "irrelevant" renames)
* Basename matching is performed (which is linear in the number of
renames, and not subject to the merge.renameLimit)
* After basename matching, there are still unmatched destinations and
relevant sources; which would need to be handled by the quadratic
inexact matching algorithm.
* Since there are enough renames left to trigger the renameLimit, you
get the "inexact rename detection was skipped due to too many files"
warning.

and then you assumed the prefetching was for the quadratic inexact
rename detection when in reality it was just for the linear basename
matching.

If so, there's a couple things we could do here:

* The easy change: modify the warning, perhaps to something like
"quadratic inexact rename detection was skipped...", to make it better
reflect its meaning (basename matching is also an inexact match)
* The more complicated change: be more aggressive about not detecting
renames via basename match for "irrelevant" sources...perhaps avoiding
it when it involves prefetching, or maybe just when we can somehow
determine that we'd bail on the quadratic inexact rename detection
anyway.

The second point perhaps deserves a bit more explanation.  Basename
matching has an advantage of removing both sources and destinations
from the later quadratic (O(N*M)) inexact rename detection, so it was
generally beneficial to just always do the basename matching even if
the path wasn't relevant for either content or location reasons.  But
that was presuming later quadratic inexact rename detection was going
to run and not be skipped.  It was one of those tradeoffs in
optimization.  But if the file hasn't been prefetched and we're likely
to skip the quadratic inexact rename detection, then trying to do
basename matching for an irrelevant rename is wasted work, as is
prefetching the blobs in order to be able to detect that irrelevant
rename.  But how do we know if we'll skip the quadratic rename
detection if we don't match the basenames that we can, since every
matched basename removes both a source and a destination from the
unmatched pairs?

Maybe we should reorder the basename matching as a two-pass algorithm,
where we first detect basename-matching renames for all relevant
sources, and then repeat for non-relevant sources.

That would also allow us to insert a check before the second pass,
where if the first pass removed all relevant sources (meaning both
that no relevant sources were left out of basename matching and we
found a match for every relevant source included in basename
matching), then we can exit without doing the second pass.  That might
even improve the performance in cases without prefetching.  But it
would mean adding another prefetch step.

After doing the above splitting, then we could add extra conditions
before the second step, such as bailing on it if we think we'd bail on
quadratic inexact rename detection (which may still be hard to guess).
Or maybe even just bailing on the second step if prefetching would be
involved, because we'd rather lump those all into the quadratic
inexact rename detection anyway.

> But I'm still curious what the particular shape is for the data in
> question.  What does the error say merge.renameLimit would be need to
> set to?  If it's set higher, do some of the files resolve nicely (i.e.
> they really were renames modified on both sides with a different
> basename), or are they modify/delete conflicts and we're paying the
> cost of rename detection to verify they were deleted and we really do
> have a conflict?  I'm curious if there's more to learn and more
> optimization potential, basically.  Your repository is bigger, so
> there may be more to learn from it than from the testcases I've tried
> so far.  :-)

We might have just identified multiple additional optimization
opportunities above, neither of which is included in my pending
optimization series.  It would be helpful to get more details about
how frequently these kind of cases occur, the particular renameLimit
in use, the number of paths involved (number of unmatched source and
destinations), how many of the sources are relevant (perhaps even
broken down into content relevant and location relevant), etc., as
these all may help inform the implementation and whatever tests we
want to add to the testsuite.

However, some of that info is currently hard to gather.  I could
probably start by adding some trace2 statistics to diffcore-rename to
print out the original number of sources and destinations, the number
matched via exact matching, the number matched via basename matching,
the number of sources removed due to being irrelevant, and anything
else I might be overlooking at the moment to help gather relevant
data.

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2021-06-23 17:31 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-05  1:27 [PATCH 0/5] Optimization batch 13: partial clone optimizations for merge-ort Elijah Newren via GitGitGadget
2021-06-05  1:28 ` [PATCH 1/5] promisor-remote: output trace2 statistics for number of objects fetched Elijah Newren via GitGitGadget
2021-06-09 21:12   ` Derrick Stolee
2021-06-05  1:28 ` [PATCH 2/5] t6421: add tests checking for excessive object downloads during merge Elijah Newren via GitGitGadget
2021-06-09 21:16   ` Derrick Stolee
2021-06-05  1:28 ` [PATCH 3/5] diffcore-rename: allow different missing_object_cb functions Elijah Newren via GitGitGadget
2021-06-05  1:28 ` [PATCH 4/5] diffcore-rename: use a different prefetch for basename comparisons Elijah Newren via GitGitGadget
2021-06-05  1:28 ` [PATCH 5/5] merge-ort: add prefetching for content merges Elijah Newren via GitGitGadget
2021-06-15 22:41 ` [PATCH v2 0/5] Optimization batch 13: partial clone optimizations for merge-ort Elijah Newren via GitGitGadget
2021-06-15 22:41   ` [PATCH v2 1/5] promisor-remote: output trace2 statistics for number of objects fetched Elijah Newren via GitGitGadget
2021-06-15 22:41   ` [PATCH v2 2/5] t6421: add tests checking for excessive object downloads during merge Elijah Newren via GitGitGadget
2021-06-17  4:49     ` Junio C Hamano
2021-06-15 22:41   ` [PATCH v2 3/5] diffcore-rename: allow different missing_object_cb functions Elijah Newren via GitGitGadget
2021-06-15 22:41   ` [PATCH v2 4/5] diffcore-rename: use a different prefetch for basename comparisons Elijah Newren via GitGitGadget
2021-06-15 22:41   ` [PATCH v2 5/5] merge-ort: add prefetching for content merges Elijah Newren via GitGitGadget
2021-06-17  5:04     ` Junio C Hamano
2021-06-22  8:02       ` Elijah Newren
2021-06-16 17:54   ` [PATCH v2 0/5] Optimization batch 13: partial clone optimizations for merge-ort Derrick Stolee
2021-06-17  5:05   ` Junio C Hamano
2021-06-22  8:04   ` [PATCH v3 " Elijah Newren via GitGitGadget
2021-06-22  8:04     ` [PATCH v3 1/5] promisor-remote: output trace2 statistics for number of objects fetched Elijah Newren via GitGitGadget
2021-06-22  8:04     ` [PATCH v3 2/5] t6421: add tests checking for excessive object downloads during merge Elijah Newren via GitGitGadget
2021-06-22  8:04     ` [PATCH v3 3/5] diffcore-rename: allow different missing_object_cb functions Elijah Newren via GitGitGadget
2021-06-22  8:04     ` [PATCH v3 4/5] diffcore-rename: use a different prefetch for basename comparisons Elijah Newren via GitGitGadget
2021-06-22  8:04     ` [PATCH v3 5/5] merge-ort: add prefetching for content merges Elijah Newren via GitGitGadget
2021-06-22 16:10     ` [PATCH v3 0/5] Optimization batch 13: partial clone optimizations for merge-ort Derrick Stolee
2021-06-22 18:45       ` Elijah Newren
2021-06-23  2:14         ` Derrick Stolee
2021-06-23  8:11           ` Elijah Newren
2021-06-23 17:31             ` Elijah Newren

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).