[PATCH v5 0/6] submodule: parallelize diff

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH v5 0/6] submodule: parallelize diff
       [not found] <https://lore.kernel.org/git/20221108184200.2813458-1-calvinwan@google.com/>
@ 2023-01-04 21:54 ` Calvin Wan
  2023-01-05 23:23   ` Calvin Wan
                     ` (7 more replies)
  2023-01-04 21:54 ` [PATCH v5 1/6] run-command: add duplicate_output_fn to run_processes_parallel_opts Calvin Wan
                   ` (5 subsequent siblings)
  6 siblings, 8 replies; 86+ messages in thread
From: Calvin Wan @ 2023-01-04 21:54 UTC (permalink / raw)
  To: git
  Cc: Calvin Wan, emilyshaffer, avarab, phillip.wood123, chooglen,
	newren, jonathantanmy

Original cover letter for context:
https://lore.kernel.org/git/20221011232604.839941-1-calvinwan@google.com/

Thank you again everyone for the numerous reviews! For this reroll, I
incorporated most of the feedback given, fixed a bug I found, and made
some stylistic refactors. I also added a new patch at the end that swaps
the serial implementation in is_submodule_modified for the new parallel
one. While I had patch 6 originally smushed with the previous one,
the diff came out not very reviewer friendly so it has been separated
out.

Changes since v4

(Patch 1)
The code in run-command.c that calls duplicate_output_fn has been
cleaned up and no longer passes a separate strbuf for the output. It
instead passes an offset that represents the starting point in the
original strbuf.

(Patch 5)
Moved status parsing from status_duplicate_output to status_finish. In
pp_buffer_stderr::run-command.c, output is gathered by strbuf_read_once
which reads 8192 bytes at once so a longer status message would error
out during status parsing since part of it would be cut off. Therefore,
status parsing must happen at the end of the process rather than in
duplicate_output_fn (and has subsequently been moved).

(Patch 6)
New patch swapping serial implementation in is_submodule_modified for
the new parallel one.

Calvin Wan (6):
  run-command: add duplicate_output_fn to run_processes_parallel_opts
  submodule: strbuf variable rename
  submodule: move status parsing into function
  diff-lib: refactor match_stat_with_submodule
  diff-lib: parallelize run_diff_files for submodules
  submodule: call parallel code from serial status

 Documentation/config/submodule.txt |  12 ++
 diff-lib.c                         | 104 ++++++++++--
 run-command.c                      |  16 +-
 run-command.h                      |  27 ++++
 submodule.c                        | 250 ++++++++++++++++++++++-------
 submodule.h                        |   9 ++
 t/helper/test-run-command.c        |  21 +++
 t/t0061-run-command.sh             |  39 +++++
 t/t4027-diff-submodule.sh          |  19 +++
 t/t7506-status-submodule.sh        |  19 +++
 10 files changed, 441 insertions(+), 75 deletions(-)

-- 
2.39.0.314.g84b9a713c41-goog

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH v5 1/6] run-command: add duplicate_output_fn to run_processes_parallel_opts
       [not found] <https://lore.kernel.org/git/20221108184200.2813458-1-calvinwan@google.com/>
  2023-01-04 21:54 ` [PATCH v5 0/6] submodule: parallelize diff Calvin Wan
@ 2023-01-04 21:54 ` Calvin Wan
  2023-01-04 21:54 ` [PATCH v5 2/6] submodule: strbuf variable rename Calvin Wan
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 86+ messages in thread
From: Calvin Wan @ 2023-01-04 21:54 UTC (permalink / raw)
  To: git
  Cc: Calvin Wan, emilyshaffer, avarab, phillip.wood123, chooglen,
	newren, jonathantanmy

Add duplicate_output_fn as an optionally set function in
run_process_parallel_opts. If set, output from each child process is
copied and passed to the callback function whenever output from the
child process is buffered to allow for separate parsing.

Signed-off-by: Calvin Wan <calvinwan@google.com>
---
 run-command.c               | 16 ++++++++++++---
 run-command.h               | 27 +++++++++++++++++++++++++
 t/helper/test-run-command.c | 21 ++++++++++++++++++++
 t/t0061-run-command.sh      | 39 +++++++++++++++++++++++++++++++++++++
 4 files changed, 100 insertions(+), 3 deletions(-)

diff --git a/run-command.c b/run-command.c
index 756f1839aa..cad88befe0 100644
--- a/run-command.c
+++ b/run-command.c
@@ -1526,6 +1526,9 @@ static void pp_init(struct parallel_processes *pp,
 	if (!opts->get_next_task)
 		BUG("you need to specify a get_next_task function");
 
+	if (opts->duplicate_output && opts->ungroup)
+		BUG("duplicate_output and ungroup are incompatible with each other");
+
 	CALLOC_ARRAY(pp->children, n);
 	if (!opts->ungroup)
 		CALLOC_ARRAY(pp->pfd, n);
@@ -1645,14 +1648,21 @@ static void pp_buffer_stderr(struct parallel_processes *pp,
 	for (size_t i = 0; i < opts->processes; i++) {
 		if (pp->children[i].state == GIT_CP_WORKING &&
 		    pp->pfd[i].revents & (POLLIN | POLLHUP)) {
-			int n = strbuf_read_once(&pp->children[i].err,
-						 pp->children[i].process.err, 0);
+			ssize_t n = strbuf_read_once(&pp->children[i].err,
+						     pp->children[i].process.err, 0);
 			if (n == 0) {
 				close(pp->children[i].process.err);
 				pp->children[i].state = GIT_CP_WAIT_CLEANUP;
-			} else if (n < 0)
+			} else if (n < 0) {
 				if (errno != EAGAIN)
 					die_errno("read");
+			} else {
+				if (opts->duplicate_output)
+					opts->duplicate_output(&pp->children[i].err,
+					       strlen(pp->children[i].err.buf) - n,
+					       opts->data,
+					       pp->children[i].data);
+			}
 		}
 	}
 }
diff --git a/run-command.h b/run-command.h
index 072db56a4d..6dcf999f6c 100644
--- a/run-command.h
+++ b/run-command.h
@@ -408,6 +408,27 @@ typedef int (*start_failure_fn)(struct strbuf *out,
 				void *pp_cb,
 				void *pp_task_cb);
 
+/**
+ * This callback is called whenever output from a child process is buffered
+ * 
+ * See run_processes_parallel() below for a discussion of the "struct
+ * strbuf *out" parameter.
+ * 
+ * The offset refers to the number of bytes originally in "out" before
+ * the output from the child process was buffered. Therefore, the buffer
+ * range, "out + buf" to the end of "out", would contain the buffer of
+ * the child process output.
+ *
+ * pp_cb is the callback cookie as passed into run_processes_parallel,
+ * pp_task_cb is the callback cookie as passed into get_next_task_fn.
+ *
+ * This function is incompatible with "ungroup"
+ */
+typedef void (*duplicate_output_fn)(struct strbuf *out,
+				    size_t offset,
+				    void *pp_cb,
+				    void *pp_task_cb);
+
 /**
  * This callback is called on every child process that finished processing.
  *
@@ -461,6 +482,12 @@ struct run_process_parallel_opts
 	 */
 	start_failure_fn start_failure;
 
+	/**
+	 * duplicate_output: See duplicate_output_fn() above. This should be
+	 * NULL unless process specific output is needed
+	 */
+	duplicate_output_fn duplicate_output;
+
 	/**
 	 * task_finished: See task_finished_fn() above. This can be
 	 * NULL to omit any special handling.
diff --git a/t/helper/test-run-command.c b/t/helper/test-run-command.c
index 3ecb830f4a..ffd3cd0045 100644
--- a/t/helper/test-run-command.c
+++ b/t/helper/test-run-command.c
@@ -52,6 +52,21 @@ static int no_job(struct child_process *cp,
 	return 0;
 }
 
+static void duplicate_output(struct strbuf *out,
+			size_t offset,
+			void *pp_cb UNUSED,
+			void *pp_task_cb UNUSED)
+{
+	struct string_list list = STRING_LIST_INIT_DUP;
+
+	string_list_split(&list, out->buf + offset, '\n', -1);
+	for (size_t i = 0; i < list.nr; i++) {
+		if (strlen(list.items[i].string) > 0)
+			fprintf(stderr, "duplicate_output: %s\n", list.items[i].string);
+	}
+	string_list_clear(&list, 0);
+}
+
 static int task_finished(int result,
 			 struct strbuf *err,
 			 void *pp_cb,
@@ -439,6 +454,12 @@ int cmd__run_command(int argc, const char **argv)
 		opts.ungroup = 1;
 	}
 
+	if (!strcmp(argv[1], "--duplicate-output")) {
+		argv += 1;
+		argc -= 1;
+		opts.duplicate_output = duplicate_output;
+	}
+
 	jobs = atoi(argv[2]);
 	strvec_clear(&proc.args);
 	strvec_pushv(&proc.args, (const char **)argv + 3);
diff --git a/t/t0061-run-command.sh b/t/t0061-run-command.sh
index e2411f6a9b..879e536638 100755
--- a/t/t0061-run-command.sh
+++ b/t/t0061-run-command.sh
@@ -135,6 +135,15 @@ test_expect_success 'run_command runs in parallel with more jobs available than
 	test_cmp expect actual
 '
 
+test_expect_success 'run_command runs in parallel with more jobs available than tasks --duplicate-output' '
+	test-tool run-command --duplicate-output run-command-parallel 5 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
+	test_must_be_empty out &&
+	test 4 = $(grep -c "duplicate_output: Hello" err) &&
+	test 4 = $(grep -c "duplicate_output: World" err) &&
+	sed "/duplicate_output/d" err > err1 &&
+	test_cmp expect err1
+'
+
 test_expect_success 'run_command runs ungrouped in parallel with more jobs available than tasks' '
 	test-tool run-command --ungroup run-command-parallel 5 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
 	test_line_count = 8 out &&
@@ -147,6 +156,15 @@ test_expect_success 'run_command runs in parallel with as many jobs as tasks' '
 	test_cmp expect actual
 '
 
+test_expect_success 'run_command runs in parallel with as many jobs as tasks --duplicate-output' '
+	test-tool run-command --duplicate-output run-command-parallel 4 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
+	test_must_be_empty out &&
+	test 4 = $(grep -c "duplicate_output: Hello" err) &&
+	test 4 = $(grep -c "duplicate_output: World" err) &&
+	sed "/duplicate_output/d" err > err1 &&
+	test_cmp expect err1
+'
+
 test_expect_success 'run_command runs ungrouped in parallel with as many jobs as tasks' '
 	test-tool run-command --ungroup run-command-parallel 4 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
 	test_line_count = 8 out &&
@@ -159,6 +177,15 @@ test_expect_success 'run_command runs in parallel with more tasks than jobs avai
 	test_cmp expect actual
 '
 
+test_expect_success 'run_command runs in parallel with more tasks than jobs available --duplicate-output' '
+	test-tool run-command --duplicate-output run-command-parallel 3 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
+	test_must_be_empty out &&
+	test 4 = $(grep -c "duplicate_output: Hello" err) &&
+	test 4 = $(grep -c "duplicate_output: World" err) &&
+	sed "/duplicate_output/d" err > err1 &&
+	test_cmp expect err1
+'
+
 test_expect_success 'run_command runs ungrouped in parallel with more tasks than jobs available' '
 	test-tool run-command --ungroup run-command-parallel 3 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
 	test_line_count = 8 out &&
@@ -180,6 +207,12 @@ test_expect_success 'run_command is asked to abort gracefully' '
 	test_cmp expect actual
 '
 
+test_expect_success 'run_command is asked to abort gracefully --duplicate-output' '
+	test-tool run-command --duplicate-output run-command-abort 3 false >out 2>err &&
+	test_must_be_empty out &&
+	test_cmp expect err
+'
+
 test_expect_success 'run_command is asked to abort gracefully (ungroup)' '
 	test-tool run-command --ungroup run-command-abort 3 false >out 2>err &&
 	test_must_be_empty out &&
@@ -196,6 +229,12 @@ test_expect_success 'run_command outputs ' '
 	test_cmp expect actual
 '
 
+test_expect_success 'run_command outputs --duplicate-output' '
+	test-tool run-command --duplicate-output run-command-no-jobs 3 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
+	test_must_be_empty out &&
+	test_cmp expect err
+'
+
 test_expect_success 'run_command outputs (ungroup) ' '
 	test-tool run-command --ungroup run-command-no-jobs 3 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
 	test_must_be_empty out &&
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v5 2/6] submodule: strbuf variable rename
       [not found] <https://lore.kernel.org/git/20221108184200.2813458-1-calvinwan@google.com/>
  2023-01-04 21:54 ` [PATCH v5 0/6] submodule: parallelize diff Calvin Wan
  2023-01-04 21:54 ` [PATCH v5 1/6] run-command: add duplicate_output_fn to run_processes_parallel_opts Calvin Wan
@ 2023-01-04 21:54 ` Calvin Wan
  2023-01-04 21:54 ` [PATCH v5 3/6] submodule: move status parsing into function Calvin Wan
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 86+ messages in thread
From: Calvin Wan @ 2023-01-04 21:54 UTC (permalink / raw)
  To: git
  Cc: Calvin Wan, emilyshaffer, avarab, phillip.wood123, chooglen,
	newren, jonathantanmy

A prepatory change for a future patch that moves the status parsing
logic to a separate function.

Signed-off-by: Calvin Wan <calvinwan@google.com>
---
 submodule.c | 23 +++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/submodule.c b/submodule.c
index fae24ef34a..faf37c1101 100644
--- a/submodule.c
+++ b/submodule.c
@@ -1906,25 +1906,28 @@ unsigned is_submodule_modified(const char *path, int ignore_untracked)
 
 	fp = xfdopen(cp.out, "r");
 	while (strbuf_getwholeline(&buf, fp, '\n') != EOF) {
+		char *str = buf.buf;
+		const size_t len = buf.len;
+
 		/* regular untracked files */
-		if (buf.buf[0] == '?')
+		if (str[0] == '?')
 			dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
 
-		if (buf.buf[0] == 'u' ||
-		    buf.buf[0] == '1' ||
-		    buf.buf[0] == '2') {
+		if (str[0] == 'u' ||
+		    str[0] == '1' ||
+		    str[0] == '2') {
 			/* T = line type, XY = status, SSSS = submodule state */
-			if (buf.len < strlen("T XY SSSS"))
+			if (len < strlen("T XY SSSS"))
 				BUG("invalid status --porcelain=2 line %s",
-				    buf.buf);
+				    str);
 
-			if (buf.buf[5] == 'S' && buf.buf[8] == 'U')
+			if (str[5] == 'S' && str[8] == 'U')
 				/* nested untracked file */
 				dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
 
-			if (buf.buf[0] == 'u' ||
-			    buf.buf[0] == '2' ||
-			    memcmp(buf.buf + 5, "S..U", 4))
+			if (str[0] == 'u' ||
+			    str[0] == '2' ||
+			    memcmp(str + 5, "S..U", 4))
 				/* other change */
 				dirty_submodule |= DIRTY_SUBMODULE_MODIFIED;
 		}
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v5 3/6] submodule: move status parsing into function
       [not found] <https://lore.kernel.org/git/20221108184200.2813458-1-calvinwan@google.com/>
                   ` (2 preceding siblings ...)
  2023-01-04 21:54 ` [PATCH v5 2/6] submodule: strbuf variable rename Calvin Wan
@ 2023-01-04 21:54 ` Calvin Wan
  2023-01-04 21:54 ` [PATCH v5 4/6] diff-lib: refactor match_stat_with_submodule Calvin Wan
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 86+ messages in thread
From: Calvin Wan @ 2023-01-04 21:54 UTC (permalink / raw)
  To: git
  Cc: Calvin Wan, emilyshaffer, avarab, phillip.wood123, chooglen,
	newren, jonathantanmy

A future patch requires the ability to parse the output of git
status --porcelain=2. Move parsing code from is_submodule_modified to
parse_status_porcelain.

Signed-off-by: Calvin Wan <calvinwan@google.com>
---
 submodule.c | 74 ++++++++++++++++++++++++++++++-----------------------
 1 file changed, 42 insertions(+), 32 deletions(-)

diff --git a/submodule.c b/submodule.c
index faf37c1101..768d4b4cd7 100644
--- a/submodule.c
+++ b/submodule.c
@@ -1870,6 +1870,45 @@ int fetch_submodules(struct repository *r,
 	return spf.result;
 }
 
+static int parse_status_porcelain(char *str, size_t len,
+				  unsigned *dirty_submodule,
+				  int ignore_untracked)
+{
+	/* regular untracked files */
+	if (str[0] == '?')
+		*dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
+
+	if (str[0] == 'u' ||
+	    str[0] == '1' ||
+	    str[0] == '2') {
+		/* T = line type, XY = status, SSSS = submodule state */
+		if (len < strlen("T XY SSSS"))
+			BUG("invalid status --porcelain=2 line %s",
+			    str);
+
+		if (str[5] == 'S' && str[8] == 'U')
+			/* nested untracked file */
+			*dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
+
+		if (str[0] == 'u' ||
+		    str[0] == '2' ||
+		    memcmp(str + 5, "S..U", 4))
+			/* other change */
+			*dirty_submodule |= DIRTY_SUBMODULE_MODIFIED;
+	}
+
+	if ((*dirty_submodule & DIRTY_SUBMODULE_MODIFIED) &&
+	    ((*dirty_submodule & DIRTY_SUBMODULE_UNTRACKED) ||
+	     ignore_untracked)) {
+		/*
+		* We're not interested in any further information from
+		* the child any more, neither output nor its exit code.
+		*/
+		return 1;
+	}
+	return 0;
+}
+
 unsigned is_submodule_modified(const char *path, int ignore_untracked)
 {
 	struct child_process cp = CHILD_PROCESS_INIT;
@@ -1909,39 +1948,10 @@ unsigned is_submodule_modified(const char *path, int ignore_untracked)
 		char *str = buf.buf;
 		const size_t len = buf.len;
 
-		/* regular untracked files */
-		if (str[0] == '?')
-			dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
-
-		if (str[0] == 'u' ||
-		    str[0] == '1' ||
-		    str[0] == '2') {
-			/* T = line type, XY = status, SSSS = submodule state */
-			if (len < strlen("T XY SSSS"))
-				BUG("invalid status --porcelain=2 line %s",
-				    str);
-
-			if (str[5] == 'S' && str[8] == 'U')
-				/* nested untracked file */
-				dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
-
-			if (str[0] == 'u' ||
-			    str[0] == '2' ||
-			    memcmp(str + 5, "S..U", 4))
-				/* other change */
-				dirty_submodule |= DIRTY_SUBMODULE_MODIFIED;
-		}
-
-		if ((dirty_submodule & DIRTY_SUBMODULE_MODIFIED) &&
-		    ((dirty_submodule & DIRTY_SUBMODULE_UNTRACKED) ||
-		     ignore_untracked)) {
-			/*
-			 * We're not interested in any further information from
-			 * the child any more, neither output nor its exit code.
-			 */
-			ignore_cp_exit_code = 1;
+		ignore_cp_exit_code = parse_status_porcelain(str, len, &dirty_submodule,
+							     ignore_untracked);
+		if (ignore_cp_exit_code)
 			break;
-		}
 	}
 	fclose(fp);
 
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v5 4/6] diff-lib: refactor match_stat_with_submodule
       [not found] <https://lore.kernel.org/git/20221108184200.2813458-1-calvinwan@google.com/>
                   ` (3 preceding siblings ...)
  2023-01-04 21:54 ` [PATCH v5 3/6] submodule: move status parsing into function Calvin Wan
@ 2023-01-04 21:54 ` Calvin Wan
  2023-01-04 21:54 ` [PATCH v5 5/6] diff-lib: parallelize run_diff_files for submodules Calvin Wan
  2023-01-04 21:54 ` [PATCH v5 6/6] submodule: call parallel code from serial status Calvin Wan
  6 siblings, 0 replies; 86+ messages in thread
From: Calvin Wan @ 2023-01-04 21:54 UTC (permalink / raw)
  To: git
  Cc: Calvin Wan, emilyshaffer, avarab, phillip.wood123, chooglen,
	newren, jonathantanmy

Flatten out the if statements in match_stat_with_submodule so the
logic is more readable and easier for future patches to add to.
orig_flags didn't need to be set if the cache entry wasn't a
GITLINK so defer setting it.

Signed-off-by: Calvin Wan <calvinwan@google.com>
---
 diff-lib.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/diff-lib.c b/diff-lib.c
index dec040c366..64583fded0 100644
--- a/diff-lib.c
+++ b/diff-lib.c
@@ -73,18 +73,24 @@ static int match_stat_with_submodule(struct diff_options *diffopt,
 				     unsigned *dirty_submodule)
 {
 	int changed = ie_match_stat(diffopt->repo->index, ce, st, ce_option);
-	if (S_ISGITLINK(ce->ce_mode)) {
-		struct diff_flags orig_flags = diffopt->flags;
-		if (!diffopt->flags.override_submodule_config)
-			set_diffopt_flags_from_submodule_config(diffopt, ce->name);
-		if (diffopt->flags.ignore_submodules)
-			changed = 0;
-		else if (!diffopt->flags.ignore_dirty_submodules &&
-			 (!changed || diffopt->flags.dirty_submodules))
-			*dirty_submodule = is_submodule_modified(ce->name,
-								 diffopt->flags.ignore_untracked_in_submodules);
-		diffopt->flags = orig_flags;
+	struct diff_flags orig_flags;
+
+	if (!S_ISGITLINK(ce->ce_mode))
+		return changed;
+
+	orig_flags = diffopt->flags;
+	if (!diffopt->flags.override_submodule_config)
+		set_diffopt_flags_from_submodule_config(diffopt, ce->name);
+	if (diffopt->flags.ignore_submodules) {
+		changed = 0;
+		goto cleanup;
 	}
+	if (!diffopt->flags.ignore_dirty_submodules &&
+	    (!changed || diffopt->flags.dirty_submodules))
+		*dirty_submodule = is_submodule_modified(ce->name,
+					 diffopt->flags.ignore_untracked_in_submodules);
+cleanup:
+	diffopt->flags = orig_flags;
 	return changed;
 }
 
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v5 5/6] diff-lib: parallelize run_diff_files for submodules
       [not found] <https://lore.kernel.org/git/20221108184200.2813458-1-calvinwan@google.com/>
                   ` (4 preceding siblings ...)
  2023-01-04 21:54 ` [PATCH v5 4/6] diff-lib: refactor match_stat_with_submodule Calvin Wan
@ 2023-01-04 21:54 ` Calvin Wan
  2023-01-04 21:54 ` [PATCH v5 6/6] submodule: call parallel code from serial status Calvin Wan
  6 siblings, 0 replies; 86+ messages in thread
From: Calvin Wan @ 2023-01-04 21:54 UTC (permalink / raw)
  To: git
  Cc: Calvin Wan, emilyshaffer, avarab, phillip.wood123, chooglen,
	newren, jonathantanmy

During the iteration of the index entries in run_diff_files, whenever
a submodule is found and needs its status checked, a subprocess is
spawned for it. Instead of spawning the subprocess immediately and
waiting for its completion to continue, hold onto all submodules and
relevant information in a list. Then use that list to create tasks for
run_processes_parallel. Subprocess output is duplicated and passed to
status_pipe_output which stores it to be parsed on completion of the
subprocess.

Add config option submodule.diffJobs to set the maximum number
of parallel jobs. The option defaults to 1 if unset. If set to 0, the
number of jobs is set to online_cpus().

Since run_diff_files is called from many different commands, I chose
to grab the config option in the function rather than adding variables
to every git command and then figuring out how to pass them all in.

Signed-off-by: Calvin Wan <calvinwan@google.com>
---
 Documentation/config/submodule.txt |  12 +++
 diff-lib.c                         |  84 +++++++++++++--
 submodule.c                        | 168 +++++++++++++++++++++++++++++
 submodule.h                        |   9 ++
 t/t4027-diff-submodule.sh          |  19 ++++
 t/t7506-status-submodule.sh        |  19 ++++
 6 files changed, 304 insertions(+), 7 deletions(-)

diff --git a/Documentation/config/submodule.txt b/Documentation/config/submodule.txt
index 6490527b45..3209eb8117 100644
--- a/Documentation/config/submodule.txt
+++ b/Documentation/config/submodule.txt
@@ -93,6 +93,18 @@ submodule.fetchJobs::
 	in parallel. A value of 0 will give some reasonable default.
 	If unset, it defaults to 1.
 
+submodule.diffJobs::
+	Specifies how many submodules are diffed at the same time. A
+	positive integer allows up to that number of submodules diffed
+	in parallel. A value of 0 will give some reasonable default.
+	If unset, it defaults to 1. The diff operation is used by many
+	other git commands such as add, merge, diff, status, stash and
+	more. Note that the expensive part of the diff operation is
+	reading the index from cache or memory. Therefore multiple jobs
+	may be detrimental to performance if your hardware does not
+	support parallel reads or if the number of jobs greatly exceeds
+	the amount of supported reads.
+
 submodule.alternateLocation::
 	Specifies how the submodules obtain alternates when submodules are
 	cloned. Possible values are `no`, `superproject`.
diff --git a/diff-lib.c b/diff-lib.c
index 64583fded0..f51ea07f36 100644
--- a/diff-lib.c
+++ b/diff-lib.c
@@ -14,6 +14,7 @@
 #include "dir.h"
 #include "fsmonitor.h"
 #include "commit-reach.h"
+#include "config.h"
 
 /*
  * diff-files
@@ -65,18 +66,23 @@ static int check_removed(const struct index_state *istate, const struct cache_en
  * Return 1 when changes are detected, 0 otherwise. If the DIRTY_SUBMODULES
  * option is set, the caller does not only want to know if a submodule is
  * modified at all but wants to know all the conditions that are met (new
- * commits, untracked content and/or modified content).
+ * commits, untracked content and/or modified content). If
+ * defer_submodule_status bit is set, dirty_submodule will be left to the
+ * caller to set. defer_submodule_status can also be set to 0 in this
+ * function if there is no need to check if the submodule is modified.
  */
 static int match_stat_with_submodule(struct diff_options *diffopt,
 				     const struct cache_entry *ce,
 				     struct stat *st, unsigned ce_option,
-				     unsigned *dirty_submodule)
+				     unsigned *dirty_submodule, int *defer_submodule_status,
+				     unsigned *ignore_untracked)
 {
 	int changed = ie_match_stat(diffopt->repo->index, ce, st, ce_option);
 	struct diff_flags orig_flags;
+	int defer = 0;
 
 	if (!S_ISGITLINK(ce->ce_mode))
-		return changed;
+		goto ret;
 
 	orig_flags = diffopt->flags;
 	if (!diffopt->flags.override_submodule_config)
@@ -86,11 +92,20 @@ static int match_stat_with_submodule(struct diff_options *diffopt,
 		goto cleanup;
 	}
 	if (!diffopt->flags.ignore_dirty_submodules &&
-	    (!changed || diffopt->flags.dirty_submodules))
-		*dirty_submodule = is_submodule_modified(ce->name,
+	    (!changed || diffopt->flags.dirty_submodules)) {
+		if (defer_submodule_status && *defer_submodule_status) {
+			defer = 1;
+			*ignore_untracked = diffopt->flags.ignore_untracked_in_submodules;
+		} else {
+			*dirty_submodule = is_submodule_modified(ce->name,
 					 diffopt->flags.ignore_untracked_in_submodules);
+		}
+	}
 cleanup:
 	diffopt->flags = orig_flags;
+ret:
+	if (defer_submodule_status)
+		*defer_submodule_status = defer;
 	return changed;
 }
 
@@ -102,6 +117,7 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
 			      ? CE_MATCH_RACY_IS_DIRTY : 0);
 	uint64_t start = getnanotime();
 	struct index_state *istate = revs->diffopt.repo->index;
+	struct string_list submodules = STRING_LIST_INIT_NODUP;
 
 	diff_set_mnemonic_prefix(&revs->diffopt, "i/", "w/");
 
@@ -226,6 +242,8 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
 			newmode = ce->ce_mode;
 		} else {
 			struct stat st;
+			unsigned ignore_untracked = 0;
+			int defer_submodule_status = !!revs->repo;
 
 			changed = check_removed(istate, ce, &st);
 			if (changed) {
@@ -247,8 +265,26 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
 			}
 
 			changed = match_stat_with_submodule(&revs->diffopt, ce, &st,
-							    ce_option, &dirty_submodule);
+							    ce_option, &dirty_submodule,
+							    &defer_submodule_status,
+							    &ignore_untracked);
 			newmode = ce_mode_from_stat(ce, st.st_mode);
+			if (defer_submodule_status) {
+				struct submodule_status_util tmp = {
+					.changed = changed,
+					.dirty_submodule = 0,
+					.ignore_untracked = ignore_untracked,
+					.newmode = newmode,
+					.ce = ce,
+					.path = ce->name,
+				};
+				struct string_list_item *item;
+
+				item = string_list_append(&submodules, ce->name);
+				item->util = xmalloc(sizeof(tmp));
+				memcpy(item->util, &tmp, sizeof(tmp));
+				continue;
+			}
 		}
 
 		if (!changed && !dirty_submodule) {
@@ -267,6 +303,40 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
 			    ce->name, 0, dirty_submodule);
 
 	}
+	if (submodules.nr > 0) {
+		int parallel_jobs;
+		if (git_config_get_int("submodule.diffjobs", &parallel_jobs))
+			parallel_jobs = 1;
+		else if (!parallel_jobs)
+			parallel_jobs = online_cpus();
+		else if (parallel_jobs < 0)
+			die(_("submodule.diffjobs cannot be negative"));
+
+		if (get_submodules_status(&submodules, parallel_jobs))
+			die(_("submodule status failed"));
+		for (size_t i = 0; i < submodules.nr; i++) {
+			struct submodule_status_util *util = submodules.items[i].util;
+			struct cache_entry *ce = util->ce;
+			unsigned int oldmode;
+			const struct object_id *old_oid, *new_oid;
+
+			if (!util->changed && !util->dirty_submodule) {
+				ce_mark_uptodate(ce);
+				mark_fsmonitor_valid(istate, ce);
+				if (!revs->diffopt.flags.find_copies_harder)
+					continue;
+			}
+			oldmode = ce->ce_mode;
+			old_oid = &ce->oid;
+			new_oid = util->changed ? null_oid() : &ce->oid;
+			diff_change(&revs->diffopt, oldmode, util->newmode,
+				    old_oid, new_oid,
+				    !is_null_oid(old_oid),
+				    !is_null_oid(new_oid),
+				    ce->name, 0, util->dirty_submodule);
+		}
+	}
+	string_list_clear(&submodules, 1);
 	diffcore_std(&revs->diffopt);
 	diff_flush(&revs->diffopt);
 	trace_performance_since(start, "diff-files");
@@ -314,7 +384,7 @@ static int get_stat_data(const struct index_state *istate,
 			return -1;
 		}
 		changed = match_stat_with_submodule(diffopt, ce, &st,
-						    0, dirty_submodule);
+						    0, dirty_submodule, NULL, NULL);
 		if (changed) {
 			mode = ce_mode_from_stat(ce, st.st_mode);
 			oid = null_oid();
diff --git a/submodule.c b/submodule.c
index 768d4b4cd7..a0ca646d9b 100644
--- a/submodule.c
+++ b/submodule.c
@@ -1369,6 +1369,17 @@ int submodule_touches_in_range(struct repository *r,
 	return ret;
 }
 
+struct submodule_parallel_status {
+	size_t index_count;
+	int result;
+
+	struct string_list *submodule_names;
+
+	/* Pending statuses by OIDs */
+	struct status_task **oid_status_tasks;
+	int oid_status_tasks_nr, oid_status_tasks_alloc;
+};
+
 struct submodule_parallel_fetch {
 	/*
 	 * The index of the last index entry processed by
@@ -1451,6 +1462,12 @@ struct fetch_task {
 	struct oid_array *commits; /* Ensure these commits are fetched */
 };
 
+struct status_task {
+	const char *path;
+	struct strbuf out;
+	int ignore_untracked;
+};
+
 /**
  * When a submodule is not defined in .gitmodules, we cannot access it
  * via the regular submodule-config. Create a fake submodule, which we can
@@ -1909,6 +1926,25 @@ static int parse_status_porcelain(char *str, size_t len,
 	return 0;
 }
 
+static void parse_status_porcelain_strbuf(struct strbuf *buf,
+				   unsigned *dirty_submodule,
+				   int ignore_untracked)
+{
+	struct string_list list = STRING_LIST_INIT_DUP;
+	struct string_list_item *item;
+
+	string_list_split(&list, buf->buf, '\n', -1);
+
+	for_each_string_list_item(item, &list) {
+		if (parse_status_porcelain(item->string,
+					   strlen(item->string),
+					   dirty_submodule,
+					   ignore_untracked))
+			break;
+	}
+	string_list_clear(&list, 0);
+}
+
 unsigned is_submodule_modified(const char *path, int ignore_untracked)
 {
 	struct child_process cp = CHILD_PROCESS_INIT;
@@ -1962,6 +1998,138 @@ unsigned is_submodule_modified(const char *path, int ignore_untracked)
 	return dirty_submodule;
 }
 
+static struct status_task *
+get_status_task_from_index(struct submodule_parallel_status *sps,
+			   struct strbuf *err)
+{
+	for (; sps->index_count < sps->submodule_names->nr; sps->index_count++) {
+		struct submodule_status_util *util = sps->submodule_names->items[sps->index_count].util;
+		struct status_task *task;
+		struct strbuf buf = STRBUF_INIT;
+		const char *git_dir;
+
+		strbuf_addf(&buf, "%s/.git", util->path);
+		git_dir = read_gitfile(buf.buf);
+		if (!git_dir)
+			git_dir = buf.buf;
+		if (!is_git_directory(git_dir)) {
+			if (is_directory(git_dir))
+				die(_("'%s' not recognized as a git repository"), git_dir);
+			strbuf_release(&buf);
+			/* The submodule is not checked out, so it is not modified */
+			util->dirty_submodule = 0;
+			continue;
+		}
+		strbuf_release(&buf);
+
+		task = xmalloc(sizeof(*task));
+		task->path = util->path;
+		task->ignore_untracked = util->ignore_untracked;
+		strbuf_init(&task->out, 0);
+		sps->index_count++;
+		return task;
+	}
+	return NULL;
+}
+
+static int get_next_submodule_status(struct child_process *cp,
+				     struct strbuf *err, void *data,
+				     void **task_cb)
+{
+	struct submodule_parallel_status *sps = data;
+	struct status_task *task = get_status_task_from_index(sps, err);
+
+	if (!task)
+		return 0;
+
+	child_process_init(cp);
+	prepare_submodule_repo_env_in_gitdir(&cp->env);
+
+	strvec_init(&cp->args);
+	strvec_pushl(&cp->args, "status", "--porcelain=2", NULL);
+	if (task->ignore_untracked)
+		strvec_push(&cp->args, "-uno");
+
+	prepare_submodule_repo_env(&cp->env);
+	cp->git_cmd = 1;
+	cp->dir = task->path;
+	*task_cb = task;
+	return 1;
+}
+
+static int status_start_failure(struct strbuf *err,
+				void *cb, void *task_cb)
+{
+	struct submodule_parallel_status *sps = cb;
+	struct status_task *task = task_cb;
+
+	sps->result = 1;
+	strbuf_addf(err,
+	    _("could not run 'git status --porcelain=2' in submodule %s"),
+	    task->path);
+	return 0;
+}
+
+static void status_duplicate_output(struct strbuf *out,
+				    size_t offset,
+				    void *cb, void *task_cb)
+{
+	struct status_task *task = task_cb;
+
+	strbuf_add(&task->out, out->buf + offset, out->len - offset);
+	strbuf_setlen(out, offset);
+}
+
+static int status_finish(int retvalue, struct strbuf *err,
+			 void *cb, void *task_cb)
+{
+	struct submodule_parallel_status *sps = cb;
+	struct status_task *task = task_cb;
+	struct string_list_item *it =
+		string_list_lookup(sps->submodule_names, task->path);
+	struct submodule_status_util *util = it->util;
+
+	if (retvalue) {
+		sps->result = 1;
+		strbuf_addf(err,
+		    _("'git status --porcelain=2' failed in submodule %s"),
+		    task->path);
+	}
+
+	parse_status_porcelain_strbuf(&task->out,
+			      &util->dirty_submodule,
+			      util->ignore_untracked);
+
+	free(task);
+
+	return 0;
+}
+
+int get_submodules_status(struct string_list *submodules,
+			  int max_parallel_jobs)
+{
+	struct submodule_parallel_status sps = {
+		.submodule_names = submodules,
+	};
+	const struct run_process_parallel_opts opts = {
+		.tr2_category = "submodule",
+		.tr2_label = "parallel/status",
+
+		.processes = max_parallel_jobs,
+
+		.get_next_task = get_next_submodule_status,
+		.start_failure = status_start_failure,
+		.duplicate_output = status_duplicate_output,
+		.task_finished = status_finish,
+		.data = &sps,
+	};
+
+	string_list_sort(sps.submodule_names);
+	run_processes_parallel(&opts);
+
+	return sps.result;
+}
+
 int submodule_uses_gitfile(const char *path)
 {
 	struct child_process cp = CHILD_PROCESS_INIT;
diff --git a/submodule.h b/submodule.h
index b52a4ff1e7..08d278a414 100644
--- a/submodule.h
+++ b/submodule.h
@@ -41,6 +41,13 @@ struct submodule_update_strategy {
 	.type = SM_UPDATE_UNSPECIFIED, \
 }
 
+struct submodule_status_util {
+	int changed, ignore_untracked;
+	unsigned dirty_submodule, newmode;
+	struct cache_entry *ce;
+	const char *path;
+};
+
 int is_gitmodules_unmerged(struct index_state *istate);
 int is_writing_gitmodules_ok(void);
 int is_staging_gitmodules_ok(struct index_state *istate);
@@ -94,6 +101,8 @@ int fetch_submodules(struct repository *r,
 		     int command_line_option,
 		     int default_option,
 		     int quiet, int max_parallel_jobs);
+int get_submodules_status(struct string_list *submodules,
+			  int max_parallel_jobs);
 unsigned is_submodule_modified(const char *path, int ignore_untracked);
 int submodule_uses_gitfile(const char *path);
 
diff --git a/t/t4027-diff-submodule.sh b/t/t4027-diff-submodule.sh
index 40164ae07d..e08ee315a7 100755
--- a/t/t4027-diff-submodule.sh
+++ b/t/t4027-diff-submodule.sh
@@ -34,6 +34,25 @@ test_expect_success setup '
 	subtip=$3 subprev=$2
 '
 
+test_expect_success 'diff in superproject with submodules respects parallel settings' '
+	test_when_finished "rm -f trace.out" &&
+	(
+		GIT_TRACE=$(pwd)/trace.out git diff &&
+		grep "1 tasks" trace.out &&
+		>trace.out &&
+
+		git config submodule.diffJobs 8 &&
+		GIT_TRACE=$(pwd)/trace.out git diff &&
+		grep "8 tasks" trace.out &&
+		>trace.out &&
+
+		GIT_TRACE=$(pwd)/trace.out git -c submodule.diffJobs=0 diff &&
+		grep "preparing to run up to [0-9]* tasks" trace.out &&
+		! grep "up to 0 tasks" trace.out &&
+		>trace.out
+	)
+'
+
 test_expect_success 'git diff --raw HEAD' '
 	hexsz=$(test_oid hexsz) &&
 	git diff --raw --abbrev=$hexsz HEAD >actual &&
diff --git a/t/t7506-status-submodule.sh b/t/t7506-status-submodule.sh
index d050091345..52a82b703f 100755
--- a/t/t7506-status-submodule.sh
+++ b/t/t7506-status-submodule.sh
@@ -412,4 +412,23 @@ test_expect_success 'status with added file in nested submodule (short)' '
 	EOF
 '
 
+test_expect_success 'status in superproject with submodules respects parallel settings' '
+	test_when_finished "rm -f trace.out" &&
+	(
+		GIT_TRACE=$(pwd)/trace.out git status &&
+		grep "1 tasks" trace.out &&
+		>trace.out &&
+
+		git config submodule.diffJobs 8 &&
+		GIT_TRACE=$(pwd)/trace.out git status &&
+		grep "8 tasks" trace.out &&
+		>trace.out &&
+
+		GIT_TRACE=$(pwd)/trace.out git -c submodule.diffJobs=0 status &&
+		grep "preparing to run up to [0-9]* tasks" trace.out &&
+		! grep "up to 0 tasks" trace.out &&
+		>trace.out
+	)
+'
+
 test_done
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v5 6/6] submodule: call parallel code from serial status
       [not found] <https://lore.kernel.org/git/20221108184200.2813458-1-calvinwan@google.com/>
                   ` (5 preceding siblings ...)
  2023-01-04 21:54 ` [PATCH v5 5/6] diff-lib: parallelize run_diff_files for submodules Calvin Wan
@ 2023-01-04 21:54 ` Calvin Wan
  6 siblings, 0 replies; 86+ messages in thread
From: Calvin Wan @ 2023-01-04 21:54 UTC (permalink / raw)
  To: git
  Cc: Calvin Wan, emilyshaffer, avarab, phillip.wood123, chooglen,
	newren, jonathantanmy

Remove the serial implementation of status inside of
is_submodule_modified since the parallel implementation of status with
one job accomplishes the same task.

Combine parse_status_porcelain and parse_status_porcelain_strbuf since
the only other caller of parse_status_porcelain was in
is_submodule_modified

Signed-off-by: Calvin Wan <calvinwan@google.com>
---
 submodule.c | 143 ++++++++++++++++++----------------------------------
 1 file changed, 48 insertions(+), 95 deletions(-)

diff --git a/submodule.c b/submodule.c
index a0ca646d9b..042e26137f 100644
--- a/submodule.c
+++ b/submodule.c
@@ -1887,46 +1887,7 @@ int fetch_submodules(struct repository *r,
 	return spf.result;
 }
 
-static int parse_status_porcelain(char *str, size_t len,
-				  unsigned *dirty_submodule,
-				  int ignore_untracked)
-{
-	/* regular untracked files */
-	if (str[0] == '?')
-		*dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
-
-	if (str[0] == 'u' ||
-	    str[0] == '1' ||
-	    str[0] == '2') {
-		/* T = line type, XY = status, SSSS = submodule state */
-		if (len < strlen("T XY SSSS"))
-			BUG("invalid status --porcelain=2 line %s",
-			    str);
-
-		if (str[5] == 'S' && str[8] == 'U')
-			/* nested untracked file */
-			*dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
-
-		if (str[0] == 'u' ||
-		    str[0] == '2' ||
-		    memcmp(str + 5, "S..U", 4))
-			/* other change */
-			*dirty_submodule |= DIRTY_SUBMODULE_MODIFIED;
-	}
-
-	if ((*dirty_submodule & DIRTY_SUBMODULE_MODIFIED) &&
-	    ((*dirty_submodule & DIRTY_SUBMODULE_UNTRACKED) ||
-	     ignore_untracked)) {
-		/*
-		* We're not interested in any further information from
-		* the child any more, neither output nor its exit code.
-		*/
-		return 1;
-	}
-	return 0;
-}
-
-static void parse_status_porcelain_strbuf(struct strbuf *buf,
+static void parse_status_porcelain(struct strbuf *buf,
 				   unsigned *dirty_submodule,
 				   int ignore_untracked)
 {
@@ -1936,66 +1897,58 @@ static void parse_status_porcelain_strbuf(struct strbuf *buf,
 	string_list_split(&list, buf->buf, '\n', -1);
 
 	for_each_string_list_item(item, &list) {
-		if (parse_status_porcelain(item->string,
-					   strlen(item->string),
-					   dirty_submodule,
-					   ignore_untracked))
+		char *str = item->string;
+		/* regular untracked files */
+		if (str[0] == '?')
+			*dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
+
+		if (str[0] == 'u' ||
+		str[0] == '1' ||
+		str[0] == '2') {
+			/* T = line type, XY = status, SSSS = submodule state */
+			if (strlen(str) < strlen("T XY SSSS"))
+				BUG("invalid status --porcelain=2 line %s",
+				str);
+
+			if (str[5] == 'S' && str[8] == 'U')
+				/* nested untracked file */
+				*dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
+
+			if (str[0] == 'u' ||
+			str[0] == '2' ||
+			memcmp(str + 5, "S..U", 4))
+				/* other change */
+				*dirty_submodule |= DIRTY_SUBMODULE_MODIFIED;
+		}
+
+		if ((*dirty_submodule & DIRTY_SUBMODULE_MODIFIED) &&
+		    ((*dirty_submodule & DIRTY_SUBMODULE_UNTRACKED) ||
+		    ignore_untracked)) {
+			/*
+			* We're not interested in any further information from
+			* the child any more, neither output nor its exit code.
+			*/
 			break;
+		}
 	}
 	string_list_clear(&list, 0);
 }
 
 unsigned is_submodule_modified(const char *path, int ignore_untracked)
 {
-	struct child_process cp = CHILD_PROCESS_INIT;
-	struct strbuf buf = STRBUF_INIT;
-	FILE *fp;
-	unsigned dirty_submodule = 0;
-	const char *git_dir;
-	int ignore_cp_exit_code = 0;
-
-	strbuf_addf(&buf, "%s/.git", path);
-	git_dir = read_gitfile(buf.buf);
-	if (!git_dir)
-		git_dir = buf.buf;
-	if (!is_git_directory(git_dir)) {
-		if (is_directory(git_dir))
-			die(_("'%s' not recognized as a git repository"), git_dir);
-		strbuf_release(&buf);
-		/* The submodule is not checked out, so it is not modified */
-		return 0;
-	}
-	strbuf_reset(&buf);
-
-	strvec_pushl(&cp.args, "status", "--porcelain=2", NULL);
-	if (ignore_untracked)
-		strvec_push(&cp.args, "-uno");
-
-	prepare_submodule_repo_env(&cp.env);
-	cp.git_cmd = 1;
-	cp.no_stdin = 1;
-	cp.out = -1;
-	cp.dir = path;
-	if (start_command(&cp))
-		die(_("Could not run 'git status --porcelain=2' in submodule %s"), path);
-
-	fp = xfdopen(cp.out, "r");
-	while (strbuf_getwholeline(&buf, fp, '\n') != EOF) {
-		char *str = buf.buf;
-		const size_t len = buf.len;
-
-		ignore_cp_exit_code = parse_status_porcelain(str, len, &dirty_submodule,
-							     ignore_untracked);
-		if (ignore_cp_exit_code)
-			break;
-	}
-	fclose(fp);
-
-	if (finish_command(&cp) && !ignore_cp_exit_code)
-		die(_("'git status --porcelain=2' failed in submodule %s"), path);
+	struct submodule_status_util util = {
+		.dirty_submodule = 0,
+		.ignore_untracked = ignore_untracked,
+		.path = path,
+	};
+	struct string_list sub = STRING_LIST_INIT_NODUP;
+	struct string_list_item *item;
 
-	strbuf_release(&buf);
-	return dirty_submodule;
+	item = string_list_append(&sub, path);
+	item->util = &util;
+	if (get_submodules_status(&sub, 1))
+		die(_("submodule status failed"));
+	return util.dirty_submodule;
 }
 
 static struct status_task *
@@ -2096,9 +2049,9 @@ static int status_finish(int retvalue, struct strbuf *err,
 		    task->path);
 	}
 
-	parse_status_porcelain_strbuf(&task->out,
-			      &util->dirty_submodule,
-			      util->ignore_untracked);
+	parse_status_porcelain(&task->out,
+			       &util->dirty_submodule,
+			       util->ignore_untracked);
 
 	free(task);
 
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 0/6] submodule: parallelize diff
  2023-01-04 21:54 ` [PATCH v5 0/6] submodule: parallelize diff Calvin Wan
@ 2023-01-05 23:23   ` Calvin Wan
  2023-01-17 19:30   ` [PATCH v6 " Calvin Wan
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 86+ messages in thread
From: Calvin Wan @ 2023-01-05 23:23 UTC (permalink / raw)
  To: git; +Cc: emilyshaffer, avarab, phillip.wood123, chooglen, newren,
	jonathantanmy

Apologies for the broken link to the previous versions. Looks like I had
some encoding issues with copy/paste. Here are the previous versions

v4: https://lore.kernel.org/git/20221108184200.2813458-1-calvinwan@google.com/
v3: https://lore.kernel.org/git/20221020232532.1128326-1-calvinwan@google.com/
v2: https://lore.kernel.org/git/20221011232604.839941-1-calvinwan@google.com/
v1: https://lore.kernel.org/git/20220922232947.631309-1-calvinwan@google.com/

On Wed, Jan 4, 2023 at 1:54 PM Calvin Wan <calvinwan@google.com> wrote:
>
> Original cover letter for context:
> https://lore.kernel.org/git/20221011232604.839941-1-calvinwan@google.com/
>
> Thank you again everyone for the numerous reviews! For this reroll, I
> incorporated most of the feedback given, fixed a bug I found, and made
> some stylistic refactors. I also added a new patch at the end that swaps
> the serial implementation in is_submodule_modified for the new parallel
> one. While I had patch 6 originally smushed with the previous one,
> the diff came out not very reviewer friendly so it has been separated
> out.
>
> Changes since v4
>
> (Patch 1)
> The code in run-command.c that calls duplicate_output_fn has been
> cleaned up and no longer passes a separate strbuf for the output. It
> instead passes an offset that represents the starting point in the
> original strbuf.
>
> (Patch 5)
> Moved status parsing from status_duplicate_output to status_finish. In
> pp_buffer_stderr::run-command.c, output is gathered by strbuf_read_once
> which reads 8192 bytes at once so a longer status message would error
> out during status parsing since part of it would be cut off. Therefore,
> status parsing must happen at the end of the process rather than in
> duplicate_output_fn (and has subsequently been moved).
>
> (Patch 6)
> New patch swapping serial implementation in is_submodule_modified for
> the new parallel one.
>
> Calvin Wan (6):
>   run-command: add duplicate_output_fn to run_processes_parallel_opts
>   submodule: strbuf variable rename
>   submodule: move status parsing into function
>   diff-lib: refactor match_stat_with_submodule
>   diff-lib: parallelize run_diff_files for submodules
>   submodule: call parallel code from serial status
>
>  Documentation/config/submodule.txt |  12 ++
>  diff-lib.c                         | 104 ++++++++++--
>  run-command.c                      |  16 +-
>  run-command.h                      |  27 ++++
>  submodule.c                        | 250 ++++++++++++++++++++++-------
>  submodule.h                        |   9 ++
>  t/helper/test-run-command.c        |  21 +++
>  t/t0061-run-command.sh             |  39 +++++
>  t/t4027-diff-submodule.sh          |  19 +++
>  t/t7506-status-submodule.sh        |  19 +++
>  10 files changed, 441 insertions(+), 75 deletions(-)
>
> --
> 2.39.0.314.g84b9a713c41-goog
>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH v6 0/6] submodule: parallelize diff
  2023-01-04 21:54 ` [PATCH v5 0/6] submodule: parallelize diff Calvin Wan
  2023-01-05 23:23   ` Calvin Wan
@ 2023-01-17 19:30   ` Calvin Wan
  2023-02-07 18:16     ` [PATCH v7 0/7] " Calvin Wan
                       ` (7 more replies)
  2023-01-17 19:30   ` [PATCH v6 1/6] run-command: add duplicate_output_fn to run_processes_parallel_opts Calvin Wan
                     ` (5 subsequent siblings)
  7 siblings, 8 replies; 86+ messages in thread
From: Calvin Wan @ 2023-01-17 19:30 UTC (permalink / raw)
  To: git
  Cc: Calvin Wan, emilyshaffer, avarab, phillip.wood123, chooglen,
	newren, jonathantanmy

Original cover letter for context:
https://lore.kernel.org/git/20221011232604.839941-1-calvinwan@google.com/

(Quick reroll to fix leaks from v5)

Thank you again everyone for the numerous reviews! For this reroll, I
incorporated most of the feedback given, fixed a bug I found, and made
some stylistic refactors. I also added a new patch at the end that swaps
the serial implementation in is_submodule_modified for the new parallel
one. While I had patch 6 originally smushed with the previous one,
the diff came out not very reviewer friendly so it has been separated
out.

Changes since v4

(Patch 1)
The code in run-command.c that calls duplicate_output_fn has been
cleaned up and no longer passes a separate strbuf for the output. It
instead passes an offset that represents the starting point in the
original strbuf.

(Patch 5)
Moved status parsing from status_duplicate_output to status_finish. In
pp_buffer_stderr::run-command.c, output is gathered by strbuf_read_once
which reads 8192 bytes at once so a longer status message would error
out during status parsing since part of it would be cut off. Therefore,
status parsing must happen at the end of the process rather than in
duplicate_output_fn (and has subsequently been moved).

(Patch 6)
New patch swapping serial implementation in is_submodule_modified for
the new parallel one.

Calvin Wan (6):
  run-command: add duplicate_output_fn to run_processes_parallel_opts
  submodule: strbuf variable rename
  submodule: move status parsing into function
  diff-lib: refactor match_stat_with_submodule
  diff-lib: parallelize run_diff_files for submodules
  submodule: call parallel code from serial status

 Documentation/config/submodule.txt |  12 ++
 diff-lib.c                         | 104 ++++++++++--
 run-command.c                      |  16 +-
 run-command.h                      |  27 +++
 submodule.c                        | 254 ++++++++++++++++++++++-------
 submodule.h                        |   9 +
 t/helper/test-run-command.c        |  21 +++
 t/t0061-run-command.sh             |  39 +++++
 t/t4027-diff-submodule.sh          |  19 +++
 t/t7506-status-submodule.sh        |  19 +++
 10 files changed, 445 insertions(+), 75 deletions(-)

-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH v6 1/6] run-command: add duplicate_output_fn to run_processes_parallel_opts
  2023-01-04 21:54 ` [PATCH v5 0/6] submodule: parallelize diff Calvin Wan
  2023-01-05 23:23   ` Calvin Wan
  2023-01-17 19:30   ` [PATCH v6 " Calvin Wan
@ 2023-01-17 19:30   ` Calvin Wan
  2023-01-17 19:30   ` [PATCH v6 2/6] submodule: strbuf variable rename Calvin Wan
                     ` (4 subsequent siblings)
  7 siblings, 0 replies; 86+ messages in thread
From: Calvin Wan @ 2023-01-17 19:30 UTC (permalink / raw)
  To: git
  Cc: Calvin Wan, emilyshaffer, avarab, phillip.wood123, chooglen,
	newren, jonathantanmy

Add duplicate_output_fn as an optionally set function in
run_process_parallel_opts. If set, output from each child process is
copied and passed to the callback function whenever output from the
child process is buffered to allow for separate parsing.

Signed-off-by: Calvin Wan <calvinwan@google.com>
---
 run-command.c               | 16 ++++++++++++---
 run-command.h               | 27 +++++++++++++++++++++++++
 t/helper/test-run-command.c | 21 ++++++++++++++++++++
 t/t0061-run-command.sh      | 39 +++++++++++++++++++++++++++++++++++++
 4 files changed, 100 insertions(+), 3 deletions(-)

diff --git a/run-command.c b/run-command.c
index 756f1839aa..cad88befe0 100644
--- a/run-command.c
+++ b/run-command.c
@@ -1526,6 +1526,9 @@ static void pp_init(struct parallel_processes *pp,
 	if (!opts->get_next_task)
 		BUG("you need to specify a get_next_task function");
 
+	if (opts->duplicate_output && opts->ungroup)
+		BUG("duplicate_output and ungroup are incompatible with each other");
+
 	CALLOC_ARRAY(pp->children, n);
 	if (!opts->ungroup)
 		CALLOC_ARRAY(pp->pfd, n);
@@ -1645,14 +1648,21 @@ static void pp_buffer_stderr(struct parallel_processes *pp,
 	for (size_t i = 0; i < opts->processes; i++) {
 		if (pp->children[i].state == GIT_CP_WORKING &&
 		    pp->pfd[i].revents & (POLLIN | POLLHUP)) {
-			int n = strbuf_read_once(&pp->children[i].err,
-						 pp->children[i].process.err, 0);
+			ssize_t n = strbuf_read_once(&pp->children[i].err,
+						     pp->children[i].process.err, 0);
 			if (n == 0) {
 				close(pp->children[i].process.err);
 				pp->children[i].state = GIT_CP_WAIT_CLEANUP;
-			} else if (n < 0)
+			} else if (n < 0) {
 				if (errno != EAGAIN)
 					die_errno("read");
+			} else {
+				if (opts->duplicate_output)
+					opts->duplicate_output(&pp->children[i].err,
+					       strlen(pp->children[i].err.buf) - n,
+					       opts->data,
+					       pp->children[i].data);
+			}
 		}
 	}
 }
diff --git a/run-command.h b/run-command.h
index 072db56a4d..6dcf999f6c 100644
--- a/run-command.h
+++ b/run-command.h
@@ -408,6 +408,27 @@ typedef int (*start_failure_fn)(struct strbuf *out,
 				void *pp_cb,
 				void *pp_task_cb);
 
+/**
+ * This callback is called whenever output from a child process is buffered
+ * 
+ * See run_processes_parallel() below for a discussion of the "struct
+ * strbuf *out" parameter.
+ * 
+ * The offset refers to the number of bytes originally in "out" before
+ * the output from the child process was buffered. Therefore, the buffer
+ * range, "out + buf" to the end of "out", would contain the buffer of
+ * the child process output.
+ *
+ * pp_cb is the callback cookie as passed into run_processes_parallel,
+ * pp_task_cb is the callback cookie as passed into get_next_task_fn.
+ *
+ * This function is incompatible with "ungroup"
+ */
+typedef void (*duplicate_output_fn)(struct strbuf *out,
+				    size_t offset,
+				    void *pp_cb,
+				    void *pp_task_cb);
+
 /**
  * This callback is called on every child process that finished processing.
  *
@@ -461,6 +482,12 @@ struct run_process_parallel_opts
 	 */
 	start_failure_fn start_failure;
 
+	/**
+	 * duplicate_output: See duplicate_output_fn() above. This should be
+	 * NULL unless process specific output is needed
+	 */
+	duplicate_output_fn duplicate_output;
+
 	/**
 	 * task_finished: See task_finished_fn() above. This can be
 	 * NULL to omit any special handling.
diff --git a/t/helper/test-run-command.c b/t/helper/test-run-command.c
index 3ecb830f4a..ffd3cd0045 100644
--- a/t/helper/test-run-command.c
+++ b/t/helper/test-run-command.c
@@ -52,6 +52,21 @@ static int no_job(struct child_process *cp,
 	return 0;
 }
 
+static void duplicate_output(struct strbuf *out,
+			size_t offset,
+			void *pp_cb UNUSED,
+			void *pp_task_cb UNUSED)
+{
+	struct string_list list = STRING_LIST_INIT_DUP;
+
+	string_list_split(&list, out->buf + offset, '\n', -1);
+	for (size_t i = 0; i < list.nr; i++) {
+		if (strlen(list.items[i].string) > 0)
+			fprintf(stderr, "duplicate_output: %s\n", list.items[i].string);
+	}
+	string_list_clear(&list, 0);
+}
+
 static int task_finished(int result,
 			 struct strbuf *err,
 			 void *pp_cb,
@@ -439,6 +454,12 @@ int cmd__run_command(int argc, const char **argv)
 		opts.ungroup = 1;
 	}
 
+	if (!strcmp(argv[1], "--duplicate-output")) {
+		argv += 1;
+		argc -= 1;
+		opts.duplicate_output = duplicate_output;
+	}
+
 	jobs = atoi(argv[2]);
 	strvec_clear(&proc.args);
 	strvec_pushv(&proc.args, (const char **)argv + 3);
diff --git a/t/t0061-run-command.sh b/t/t0061-run-command.sh
index e2411f6a9b..879e536638 100755
--- a/t/t0061-run-command.sh
+++ b/t/t0061-run-command.sh
@@ -135,6 +135,15 @@ test_expect_success 'run_command runs in parallel with more jobs available than
 	test_cmp expect actual
 '
 
+test_expect_success 'run_command runs in parallel with more jobs available than tasks --duplicate-output' '
+	test-tool run-command --duplicate-output run-command-parallel 5 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
+	test_must_be_empty out &&
+	test 4 = $(grep -c "duplicate_output: Hello" err) &&
+	test 4 = $(grep -c "duplicate_output: World" err) &&
+	sed "/duplicate_output/d" err > err1 &&
+	test_cmp expect err1
+'
+
 test_expect_success 'run_command runs ungrouped in parallel with more jobs available than tasks' '
 	test-tool run-command --ungroup run-command-parallel 5 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
 	test_line_count = 8 out &&
@@ -147,6 +156,15 @@ test_expect_success 'run_command runs in parallel with as many jobs as tasks' '
 	test_cmp expect actual
 '
 
+test_expect_success 'run_command runs in parallel with as many jobs as tasks --duplicate-output' '
+	test-tool run-command --duplicate-output run-command-parallel 4 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
+	test_must_be_empty out &&
+	test 4 = $(grep -c "duplicate_output: Hello" err) &&
+	test 4 = $(grep -c "duplicate_output: World" err) &&
+	sed "/duplicate_output/d" err > err1 &&
+	test_cmp expect err1
+'
+
 test_expect_success 'run_command runs ungrouped in parallel with as many jobs as tasks' '
 	test-tool run-command --ungroup run-command-parallel 4 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
 	test_line_count = 8 out &&
@@ -159,6 +177,15 @@ test_expect_success 'run_command runs in parallel with more tasks than jobs avai
 	test_cmp expect actual
 '
 
+test_expect_success 'run_command runs in parallel with more tasks than jobs available --duplicate-output' '
+	test-tool run-command --duplicate-output run-command-parallel 3 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
+	test_must_be_empty out &&
+	test 4 = $(grep -c "duplicate_output: Hello" err) &&
+	test 4 = $(grep -c "duplicate_output: World" err) &&
+	sed "/duplicate_output/d" err > err1 &&
+	test_cmp expect err1
+'
+
 test_expect_success 'run_command runs ungrouped in parallel with more tasks than jobs available' '
 	test-tool run-command --ungroup run-command-parallel 3 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
 	test_line_count = 8 out &&
@@ -180,6 +207,12 @@ test_expect_success 'run_command is asked to abort gracefully' '
 	test_cmp expect actual
 '
 
+test_expect_success 'run_command is asked to abort gracefully --duplicate-output' '
+	test-tool run-command --duplicate-output run-command-abort 3 false >out 2>err &&
+	test_must_be_empty out &&
+	test_cmp expect err
+'
+
 test_expect_success 'run_command is asked to abort gracefully (ungroup)' '
 	test-tool run-command --ungroup run-command-abort 3 false >out 2>err &&
 	test_must_be_empty out &&
@@ -196,6 +229,12 @@ test_expect_success 'run_command outputs ' '
 	test_cmp expect actual
 '
 
+test_expect_success 'run_command outputs --duplicate-output' '
+	test-tool run-command --duplicate-output run-command-no-jobs 3 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
+	test_must_be_empty out &&
+	test_cmp expect err
+'
+
 test_expect_success 'run_command outputs (ungroup) ' '
 	test-tool run-command --ungroup run-command-no-jobs 3 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
 	test_must_be_empty out &&
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v6 2/6] submodule: strbuf variable rename
  2023-01-04 21:54 ` [PATCH v5 0/6] submodule: parallelize diff Calvin Wan
                     ` (2 preceding siblings ...)
  2023-01-17 19:30   ` [PATCH v6 1/6] run-command: add duplicate_output_fn to run_processes_parallel_opts Calvin Wan
@ 2023-01-17 19:30   ` Calvin Wan
  2023-01-17 19:30   ` [PATCH v6 3/6] submodule: move status parsing into function Calvin Wan
                     ` (3 subsequent siblings)
  7 siblings, 0 replies; 86+ messages in thread
From: Calvin Wan @ 2023-01-17 19:30 UTC (permalink / raw)
  To: git
  Cc: Calvin Wan, emilyshaffer, avarab, phillip.wood123, chooglen,
	newren, jonathantanmy

A prepatory change for a future patch that moves the status parsing
logic to a separate function.

Signed-off-by: Calvin Wan <calvinwan@google.com>
---
 submodule.c | 23 +++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/submodule.c b/submodule.c
index fae24ef34a..faf37c1101 100644
--- a/submodule.c
+++ b/submodule.c
@@ -1906,25 +1906,28 @@ unsigned is_submodule_modified(const char *path, int ignore_untracked)
 
 	fp = xfdopen(cp.out, "r");
 	while (strbuf_getwholeline(&buf, fp, '\n') != EOF) {
+		char *str = buf.buf;
+		const size_t len = buf.len;
+
 		/* regular untracked files */
-		if (buf.buf[0] == '?')
+		if (str[0] == '?')
 			dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
 
-		if (buf.buf[0] == 'u' ||
-		    buf.buf[0] == '1' ||
-		    buf.buf[0] == '2') {
+		if (str[0] == 'u' ||
+		    str[0] == '1' ||
+		    str[0] == '2') {
 			/* T = line type, XY = status, SSSS = submodule state */
-			if (buf.len < strlen("T XY SSSS"))
+			if (len < strlen("T XY SSSS"))
 				BUG("invalid status --porcelain=2 line %s",
-				    buf.buf);
+				    str);
 
-			if (buf.buf[5] == 'S' && buf.buf[8] == 'U')
+			if (str[5] == 'S' && str[8] == 'U')
 				/* nested untracked file */
 				dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
 
-			if (buf.buf[0] == 'u' ||
-			    buf.buf[0] == '2' ||
-			    memcmp(buf.buf + 5, "S..U", 4))
+			if (str[0] == 'u' ||
+			    str[0] == '2' ||
+			    memcmp(str + 5, "S..U", 4))
 				/* other change */
 				dirty_submodule |= DIRTY_SUBMODULE_MODIFIED;
 		}
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v6 3/6] submodule: move status parsing into function
  2023-01-04 21:54 ` [PATCH v5 0/6] submodule: parallelize diff Calvin Wan
                     ` (3 preceding siblings ...)
  2023-01-17 19:30   ` [PATCH v6 2/6] submodule: strbuf variable rename Calvin Wan
@ 2023-01-17 19:30   ` Calvin Wan
  2023-01-17 19:30   ` [PATCH v6 4/6] diff-lib: refactor match_stat_with_submodule Calvin Wan
                     ` (2 subsequent siblings)
  7 siblings, 0 replies; 86+ messages in thread
From: Calvin Wan @ 2023-01-17 19:30 UTC (permalink / raw)
  To: git
  Cc: Calvin Wan, emilyshaffer, avarab, phillip.wood123, chooglen,
	newren, jonathantanmy

A future patch requires the ability to parse the output of git
status --porcelain=2. Move parsing code from is_submodule_modified to
parse_status_porcelain.

Signed-off-by: Calvin Wan <calvinwan@google.com>
---
 submodule.c | 74 ++++++++++++++++++++++++++++++-----------------------
 1 file changed, 42 insertions(+), 32 deletions(-)

diff --git a/submodule.c b/submodule.c
index faf37c1101..768d4b4cd7 100644
--- a/submodule.c
+++ b/submodule.c
@@ -1870,6 +1870,45 @@ int fetch_submodules(struct repository *r,
 	return spf.result;
 }
 
+static int parse_status_porcelain(char *str, size_t len,
+				  unsigned *dirty_submodule,
+				  int ignore_untracked)
+{
+	/* regular untracked files */
+	if (str[0] == '?')
+		*dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
+
+	if (str[0] == 'u' ||
+	    str[0] == '1' ||
+	    str[0] == '2') {
+		/* T = line type, XY = status, SSSS = submodule state */
+		if (len < strlen("T XY SSSS"))
+			BUG("invalid status --porcelain=2 line %s",
+			    str);
+
+		if (str[5] == 'S' && str[8] == 'U')
+			/* nested untracked file */
+			*dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
+
+		if (str[0] == 'u' ||
+		    str[0] == '2' ||
+		    memcmp(str + 5, "S..U", 4))
+			/* other change */
+			*dirty_submodule |= DIRTY_SUBMODULE_MODIFIED;
+	}
+
+	if ((*dirty_submodule & DIRTY_SUBMODULE_MODIFIED) &&
+	    ((*dirty_submodule & DIRTY_SUBMODULE_UNTRACKED) ||
+	     ignore_untracked)) {
+		/*
+		* We're not interested in any further information from
+		* the child any more, neither output nor its exit code.
+		*/
+		return 1;
+	}
+	return 0;
+}
+
 unsigned is_submodule_modified(const char *path, int ignore_untracked)
 {
 	struct child_process cp = CHILD_PROCESS_INIT;
@@ -1909,39 +1948,10 @@ unsigned is_submodule_modified(const char *path, int ignore_untracked)
 		char *str = buf.buf;
 		const size_t len = buf.len;
 
-		/* regular untracked files */
-		if (str[0] == '?')
-			dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
-
-		if (str[0] == 'u' ||
-		    str[0] == '1' ||
-		    str[0] == '2') {
-			/* T = line type, XY = status, SSSS = submodule state */
-			if (len < strlen("T XY SSSS"))
-				BUG("invalid status --porcelain=2 line %s",
-				    str);
-
-			if (str[5] == 'S' && str[8] == 'U')
-				/* nested untracked file */
-				dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
-
-			if (str[0] == 'u' ||
-			    str[0] == '2' ||
-			    memcmp(str + 5, "S..U", 4))
-				/* other change */
-				dirty_submodule |= DIRTY_SUBMODULE_MODIFIED;
-		}
-
-		if ((dirty_submodule & DIRTY_SUBMODULE_MODIFIED) &&
-		    ((dirty_submodule & DIRTY_SUBMODULE_UNTRACKED) ||
-		     ignore_untracked)) {
-			/*
-			 * We're not interested in any further information from
-			 * the child any more, neither output nor its exit code.
-			 */
-			ignore_cp_exit_code = 1;
+		ignore_cp_exit_code = parse_status_porcelain(str, len, &dirty_submodule,
+							     ignore_untracked);
+		if (ignore_cp_exit_code)
 			break;
-		}
 	}
 	fclose(fp);
 
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v6 4/6] diff-lib: refactor match_stat_with_submodule
  2023-01-04 21:54 ` [PATCH v5 0/6] submodule: parallelize diff Calvin Wan
                     ` (4 preceding siblings ...)
  2023-01-17 19:30   ` [PATCH v6 3/6] submodule: move status parsing into function Calvin Wan
@ 2023-01-17 19:30   ` Calvin Wan
  2023-01-17 19:30   ` [PATCH v6 5/6] diff-lib: parallelize run_diff_files for submodules Calvin Wan
  2023-01-17 19:30   ` [PATCH v6 6/6] submodule: call parallel code from serial status Calvin Wan
  7 siblings, 0 replies; 86+ messages in thread
From: Calvin Wan @ 2023-01-17 19:30 UTC (permalink / raw)
  To: git
  Cc: Calvin Wan, emilyshaffer, avarab, phillip.wood123, chooglen,
	newren, jonathantanmy

Flatten out the if statements in match_stat_with_submodule so the
logic is more readable and easier for future patches to add to.
orig_flags didn't need to be set if the cache entry wasn't a
GITLINK so defer setting it.

Signed-off-by: Calvin Wan <calvinwan@google.com>
---
 diff-lib.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/diff-lib.c b/diff-lib.c
index dec040c366..64583fded0 100644
--- a/diff-lib.c
+++ b/diff-lib.c
@@ -73,18 +73,24 @@ static int match_stat_with_submodule(struct diff_options *diffopt,
 				     unsigned *dirty_submodule)
 {
 	int changed = ie_match_stat(diffopt->repo->index, ce, st, ce_option);
-	if (S_ISGITLINK(ce->ce_mode)) {
-		struct diff_flags orig_flags = diffopt->flags;
-		if (!diffopt->flags.override_submodule_config)
-			set_diffopt_flags_from_submodule_config(diffopt, ce->name);
-		if (diffopt->flags.ignore_submodules)
-			changed = 0;
-		else if (!diffopt->flags.ignore_dirty_submodules &&
-			 (!changed || diffopt->flags.dirty_submodules))
-			*dirty_submodule = is_submodule_modified(ce->name,
-								 diffopt->flags.ignore_untracked_in_submodules);
-		diffopt->flags = orig_flags;
+	struct diff_flags orig_flags;
+
+	if (!S_ISGITLINK(ce->ce_mode))
+		return changed;
+
+	orig_flags = diffopt->flags;
+	if (!diffopt->flags.override_submodule_config)
+		set_diffopt_flags_from_submodule_config(diffopt, ce->name);
+	if (diffopt->flags.ignore_submodules) {
+		changed = 0;
+		goto cleanup;
 	}
+	if (!diffopt->flags.ignore_dirty_submodules &&
+	    (!changed || diffopt->flags.dirty_submodules))
+		*dirty_submodule = is_submodule_modified(ce->name,
+					 diffopt->flags.ignore_untracked_in_submodules);
+cleanup:
+	diffopt->flags = orig_flags;
 	return changed;
 }
 
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v6 5/6] diff-lib: parallelize run_diff_files for submodules
  2023-01-04 21:54 ` [PATCH v5 0/6] submodule: parallelize diff Calvin Wan
                     ` (5 preceding siblings ...)
  2023-01-17 19:30   ` [PATCH v6 4/6] diff-lib: refactor match_stat_with_submodule Calvin Wan
@ 2023-01-17 19:30   ` Calvin Wan
  2023-01-26  9:09     ` Glen Choo
  2023-01-26  9:16     ` Glen Choo
  2023-01-17 19:30   ` [PATCH v6 6/6] submodule: call parallel code from serial status Calvin Wan
  7 siblings, 2 replies; 86+ messages in thread
From: Calvin Wan @ 2023-01-17 19:30 UTC (permalink / raw)
  To: git
  Cc: Calvin Wan, emilyshaffer, avarab, phillip.wood123, chooglen,
	newren, jonathantanmy

During the iteration of the index entries in run_diff_files, whenever
a submodule is found and needs its status checked, a subprocess is
spawned for it. Instead of spawning the subprocess immediately and
waiting for its completion to continue, hold onto all submodules and
relevant information in a list. Then use that list to create tasks for
run_processes_parallel. Subprocess output is duplicated and passed to
status_pipe_output which stores it to be parsed on completion of the
subprocess.

Add config option submodule.diffJobs to set the maximum number
of parallel jobs. The option defaults to 1 if unset. If set to 0, the
number of jobs is set to online_cpus().

Since run_diff_files is called from many different commands, I chose
to grab the config option in the function rather than adding variables
to every git command and then figuring out how to pass them all in.

Signed-off-by: Calvin Wan <calvinwan@google.com>
---
 Documentation/config/submodule.txt |  12 ++
 diff-lib.c                         |  84 ++++++++++++--
 submodule.c                        | 169 +++++++++++++++++++++++++++++
 submodule.h                        |   9 ++
 t/t4027-diff-submodule.sh          |  19 ++++
 t/t7506-status-submodule.sh        |  19 ++++
 6 files changed, 305 insertions(+), 7 deletions(-)

diff --git a/Documentation/config/submodule.txt b/Documentation/config/submodule.txt
index 6490527b45..3209eb8117 100644
--- a/Documentation/config/submodule.txt
+++ b/Documentation/config/submodule.txt
@@ -93,6 +93,18 @@ submodule.fetchJobs::
 	in parallel. A value of 0 will give some reasonable default.
 	If unset, it defaults to 1.
 
+submodule.diffJobs::
+	Specifies how many submodules are diffed at the same time. A
+	positive integer allows up to that number of submodules diffed
+	in parallel. A value of 0 will give some reasonable default.
+	If unset, it defaults to 1. The diff operation is used by many
+	other git commands such as add, merge, diff, status, stash and
+	more. Note that the expensive part of the diff operation is
+	reading the index from cache or memory. Therefore multiple jobs
+	may be detrimental to performance if your hardware does not
+	support parallel reads or if the number of jobs greatly exceeds
+	the amount of supported reads.
+
 submodule.alternateLocation::
 	Specifies how the submodules obtain alternates when submodules are
 	cloned. Possible values are `no`, `superproject`.
diff --git a/diff-lib.c b/diff-lib.c
index 64583fded0..f51ea07f36 100644
--- a/diff-lib.c
+++ b/diff-lib.c
@@ -14,6 +14,7 @@
 #include "dir.h"
 #include "fsmonitor.h"
 #include "commit-reach.h"
+#include "config.h"
 
 /*
  * diff-files
@@ -65,18 +66,23 @@ static int check_removed(const struct index_state *istate, const struct cache_en
  * Return 1 when changes are detected, 0 otherwise. If the DIRTY_SUBMODULES
  * option is set, the caller does not only want to know if a submodule is
  * modified at all but wants to know all the conditions that are met (new
- * commits, untracked content and/or modified content).
+ * commits, untracked content and/or modified content). If
+ * defer_submodule_status bit is set, dirty_submodule will be left to the
+ * caller to set. defer_submodule_status can also be set to 0 in this
+ * function if there is no need to check if the submodule is modified.
  */
 static int match_stat_with_submodule(struct diff_options *diffopt,
 				     const struct cache_entry *ce,
 				     struct stat *st, unsigned ce_option,
-				     unsigned *dirty_submodule)
+				     unsigned *dirty_submodule, int *defer_submodule_status,
+				     unsigned *ignore_untracked)
 {
 	int changed = ie_match_stat(diffopt->repo->index, ce, st, ce_option);
 	struct diff_flags orig_flags;
+	int defer = 0;
 
 	if (!S_ISGITLINK(ce->ce_mode))
-		return changed;
+		goto ret;
 
 	orig_flags = diffopt->flags;
 	if (!diffopt->flags.override_submodule_config)
@@ -86,11 +92,20 @@ static int match_stat_with_submodule(struct diff_options *diffopt,
 		goto cleanup;
 	}
 	if (!diffopt->flags.ignore_dirty_submodules &&
-	    (!changed || diffopt->flags.dirty_submodules))
-		*dirty_submodule = is_submodule_modified(ce->name,
+	    (!changed || diffopt->flags.dirty_submodules)) {
+		if (defer_submodule_status && *defer_submodule_status) {
+			defer = 1;
+			*ignore_untracked = diffopt->flags.ignore_untracked_in_submodules;
+		} else {
+			*dirty_submodule = is_submodule_modified(ce->name,
 					 diffopt->flags.ignore_untracked_in_submodules);
+		}
+	}
 cleanup:
 	diffopt->flags = orig_flags;
+ret:
+	if (defer_submodule_status)
+		*defer_submodule_status = defer;
 	return changed;
 }
 
@@ -102,6 +117,7 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
 			      ? CE_MATCH_RACY_IS_DIRTY : 0);
 	uint64_t start = getnanotime();
 	struct index_state *istate = revs->diffopt.repo->index;
+	struct string_list submodules = STRING_LIST_INIT_NODUP;
 
 	diff_set_mnemonic_prefix(&revs->diffopt, "i/", "w/");
 
@@ -226,6 +242,8 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
 			newmode = ce->ce_mode;
 		} else {
 			struct stat st;
+			unsigned ignore_untracked = 0;
+			int defer_submodule_status = !!revs->repo;
 
 			changed = check_removed(istate, ce, &st);
 			if (changed) {
@@ -247,8 +265,26 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
 			}
 
 			changed = match_stat_with_submodule(&revs->diffopt, ce, &st,
-							    ce_option, &dirty_submodule);
+							    ce_option, &dirty_submodule,
+							    &defer_submodule_status,
+							    &ignore_untracked);
 			newmode = ce_mode_from_stat(ce, st.st_mode);
+			if (defer_submodule_status) {
+				struct submodule_status_util tmp = {
+					.changed = changed,
+					.dirty_submodule = 0,
+					.ignore_untracked = ignore_untracked,
+					.newmode = newmode,
+					.ce = ce,
+					.path = ce->name,
+				};
+				struct string_list_item *item;
+
+				item = string_list_append(&submodules, ce->name);
+				item->util = xmalloc(sizeof(tmp));
+				memcpy(item->util, &tmp, sizeof(tmp));
+				continue;
+			}
 		}
 
 		if (!changed && !dirty_submodule) {
@@ -267,6 +303,40 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
 			    ce->name, 0, dirty_submodule);
 
 	}
+	if (submodules.nr > 0) {
+		int parallel_jobs;
+		if (git_config_get_int("submodule.diffjobs", &parallel_jobs))
+			parallel_jobs = 1;
+		else if (!parallel_jobs)
+			parallel_jobs = online_cpus();
+		else if (parallel_jobs < 0)
+			die(_("submodule.diffjobs cannot be negative"));
+
+		if (get_submodules_status(&submodules, parallel_jobs))
+			die(_("submodule status failed"));
+		for (size_t i = 0; i < submodules.nr; i++) {
+			struct submodule_status_util *util = submodules.items[i].util;
+			struct cache_entry *ce = util->ce;
+			unsigned int oldmode;
+			const struct object_id *old_oid, *new_oid;
+
+			if (!util->changed && !util->dirty_submodule) {
+				ce_mark_uptodate(ce);
+				mark_fsmonitor_valid(istate, ce);
+				if (!revs->diffopt.flags.find_copies_harder)
+					continue;
+			}
+			oldmode = ce->ce_mode;
+			old_oid = &ce->oid;
+			new_oid = util->changed ? null_oid() : &ce->oid;
+			diff_change(&revs->diffopt, oldmode, util->newmode,
+				    old_oid, new_oid,
+				    !is_null_oid(old_oid),
+				    !is_null_oid(new_oid),
+				    ce->name, 0, util->dirty_submodule);
+		}
+	}
+	string_list_clear(&submodules, 1);
 	diffcore_std(&revs->diffopt);
 	diff_flush(&revs->diffopt);
 	trace_performance_since(start, "diff-files");
@@ -314,7 +384,7 @@ static int get_stat_data(const struct index_state *istate,
 			return -1;
 		}
 		changed = match_stat_with_submodule(diffopt, ce, &st,
-						    0, dirty_submodule);
+						    0, dirty_submodule, NULL, NULL);
 		if (changed) {
 			mode = ce_mode_from_stat(ce, st.st_mode);
 			oid = null_oid();
diff --git a/submodule.c b/submodule.c
index 768d4b4cd7..da95ea1f5e 100644
--- a/submodule.c
+++ b/submodule.c
@@ -1369,6 +1369,17 @@ int submodule_touches_in_range(struct repository *r,
 	return ret;
 }
 
+struct submodule_parallel_status {
+	size_t index_count;
+	int result;
+
+	struct string_list *submodule_names;
+
+	/* Pending statuses by OIDs */
+	struct status_task **oid_status_tasks;
+	int oid_status_tasks_nr, oid_status_tasks_alloc;
+};
+
 struct submodule_parallel_fetch {
 	/*
 	 * The index of the last index entry processed by
@@ -1451,6 +1462,12 @@ struct fetch_task {
 	struct oid_array *commits; /* Ensure these commits are fetched */
 };
 
+struct status_task {
+	const char *path;
+	struct strbuf out;
+	int ignore_untracked;
+};
+
 /**
  * When a submodule is not defined in .gitmodules, we cannot access it
  * via the regular submodule-config. Create a fake submodule, which we can
@@ -1909,6 +1926,25 @@ static int parse_status_porcelain(char *str, size_t len,
 	return 0;
 }
 
+static void parse_status_porcelain_strbuf(struct strbuf *buf,
+				   unsigned *dirty_submodule,
+				   int ignore_untracked)
+{
+	struct string_list list = STRING_LIST_INIT_DUP;
+	struct string_list_item *item;
+
+	string_list_split(&list, buf->buf, '\n', -1);
+
+	for_each_string_list_item(item, &list) {
+		if (parse_status_porcelain(item->string,
+					   strlen(item->string),
+					   dirty_submodule,
+					   ignore_untracked))
+			break;
+	}
+	string_list_clear(&list, 0);
+}
+
 unsigned is_submodule_modified(const char *path, int ignore_untracked)
 {
 	struct child_process cp = CHILD_PROCESS_INIT;
@@ -1962,6 +1998,139 @@ unsigned is_submodule_modified(const char *path, int ignore_untracked)
 	return dirty_submodule;
 }
 
+static struct status_task *
+get_status_task_from_index(struct submodule_parallel_status *sps,
+			   struct strbuf *err)
+{
+	for (; sps->index_count < sps->submodule_names->nr; sps->index_count++) {
+		struct submodule_status_util *util = sps->submodule_names->items[sps->index_count].util;
+		struct status_task *task;
+		struct strbuf buf = STRBUF_INIT;
+		const char *git_dir;
+
+		strbuf_addf(&buf, "%s/.git", util->path);
+		git_dir = read_gitfile(buf.buf);
+		if (!git_dir)
+			git_dir = buf.buf;
+		if (!is_git_directory(git_dir)) {
+			if (is_directory(git_dir))
+				die(_("'%s' not recognized as a git repository"), git_dir);
+			strbuf_release(&buf);
+			/* The submodule is not checked out, so it is not modified */
+			util->dirty_submodule = 0;
+			continue;
+		}
+		strbuf_release(&buf);
+
+		task = xmalloc(sizeof(*task));
+		task->path = util->path;
+		task->ignore_untracked = util->ignore_untracked;
+		strbuf_init(&task->out, 0);
+		sps->index_count++;
+		return task;
+	}
+	return NULL;
+}
+
+static int get_next_submodule_status(struct child_process *cp,
+				     struct strbuf *err, void *data,
+				     void **task_cb)
+{
+	struct submodule_parallel_status *sps = data;
+	struct status_task *task = get_status_task_from_index(sps, err);
+
+	if (!task)
+		return 0;
+
+	child_process_init(cp);
+	prepare_submodule_repo_env_in_gitdir(&cp->env);
+
+	strvec_init(&cp->args);
+	strvec_pushl(&cp->args, "status", "--porcelain=2", NULL);
+	if (task->ignore_untracked)
+		strvec_push(&cp->args, "-uno");
+
+	prepare_submodule_repo_env(&cp->env);
+	cp->git_cmd = 1;
+	cp->dir = task->path;
+	*task_cb = task;
+	return 1;
+}
+
+static int status_start_failure(struct strbuf *err,
+				void *cb, void *task_cb)
+{
+	struct submodule_parallel_status *sps = cb;
+	struct status_task *task = task_cb;
+
+	sps->result = 1;
+	strbuf_addf(err,
+	    _("could not run 'git status --porcelain=2' in submodule %s"),
+	    task->path);
+	return 0;
+}
+
+static void status_duplicate_output(struct strbuf *out,
+				    size_t offset,
+				    void *cb, void *task_cb)
+{
+	struct status_task *task = task_cb;
+
+	strbuf_add(&task->out, out->buf + offset, out->len - offset);
+	strbuf_setlen(out, offset);
+}
+
+static int status_finish(int retvalue, struct strbuf *err,
+			 void *cb, void *task_cb)
+{
+	struct submodule_parallel_status *sps = cb;
+	struct status_task *task = task_cb;
+	struct string_list_item *it =
+		string_list_lookup(sps->submodule_names, task->path);
+	struct submodule_status_util *util = it->util;
+
+	if (retvalue) {
+		sps->result = 1;
+		strbuf_addf(err,
+		    _("'git status --porcelain=2' failed in submodule %s"),
+		    task->path);
+	}
+
+	parse_status_porcelain_strbuf(&task->out,
+			      &util->dirty_submodule,
+			      util->ignore_untracked);
+
+	strbuf_release(&task->out);
+	free(task);
+
+	return 0;
+}
+
+int get_submodules_status(struct string_list *submodules,
+			  int max_parallel_jobs)
+{
+	struct submodule_parallel_status sps = {
+		.submodule_names = submodules,
+	};
+	const struct run_process_parallel_opts opts = {
+		.tr2_category = "submodule",
+		.tr2_label = "parallel/status",
+
+		.processes = max_parallel_jobs,
+
+		.get_next_task = get_next_submodule_status,
+		.start_failure = status_start_failure,
+		.duplicate_output = status_duplicate_output,
+		.task_finished = status_finish,
+		.data = &sps,
+	};
+
+	string_list_sort(sps.submodule_names);
+	run_processes_parallel(&opts);
+
+	return sps.result;
+}
+
 int submodule_uses_gitfile(const char *path)
 {
 	struct child_process cp = CHILD_PROCESS_INIT;
diff --git a/submodule.h b/submodule.h
index b52a4ff1e7..08d278a414 100644
--- a/submodule.h
+++ b/submodule.h
@@ -41,6 +41,13 @@ struct submodule_update_strategy {
 	.type = SM_UPDATE_UNSPECIFIED, \
 }
 
+struct submodule_status_util {
+	int changed, ignore_untracked;
+	unsigned dirty_submodule, newmode;
+	struct cache_entry *ce;
+	const char *path;
+};
+
 int is_gitmodules_unmerged(struct index_state *istate);
 int is_writing_gitmodules_ok(void);
 int is_staging_gitmodules_ok(struct index_state *istate);
@@ -94,6 +101,8 @@ int fetch_submodules(struct repository *r,
 		     int command_line_option,
 		     int default_option,
 		     int quiet, int max_parallel_jobs);
+int get_submodules_status(struct string_list *submodules,
+			  int max_parallel_jobs);
 unsigned is_submodule_modified(const char *path, int ignore_untracked);
 int submodule_uses_gitfile(const char *path);
 
diff --git a/t/t4027-diff-submodule.sh b/t/t4027-diff-submodule.sh
index 40164ae07d..e08ee315a7 100755
--- a/t/t4027-diff-submodule.sh
+++ b/t/t4027-diff-submodule.sh
@@ -34,6 +34,25 @@ test_expect_success setup '
 	subtip=$3 subprev=$2
 '
 
+test_expect_success 'diff in superproject with submodules respects parallel settings' '
+	test_when_finished "rm -f trace.out" &&
+	(
+		GIT_TRACE=$(pwd)/trace.out git diff &&
+		grep "1 tasks" trace.out &&
+		>trace.out &&
+
+		git config submodule.diffJobs 8 &&
+		GIT_TRACE=$(pwd)/trace.out git diff &&
+		grep "8 tasks" trace.out &&
+		>trace.out &&
+
+		GIT_TRACE=$(pwd)/trace.out git -c submodule.diffJobs=0 diff &&
+		grep "preparing to run up to [0-9]* tasks" trace.out &&
+		! grep "up to 0 tasks" trace.out &&
+		>trace.out
+	)
+'
+
 test_expect_success 'git diff --raw HEAD' '
 	hexsz=$(test_oid hexsz) &&
 	git diff --raw --abbrev=$hexsz HEAD >actual &&
diff --git a/t/t7506-status-submodule.sh b/t/t7506-status-submodule.sh
index d050091345..52a82b703f 100755
--- a/t/t7506-status-submodule.sh
+++ b/t/t7506-status-submodule.sh
@@ -412,4 +412,23 @@ test_expect_success 'status with added file in nested submodule (short)' '
 	EOF
 '
 
+test_expect_success 'status in superproject with submodules respects parallel settings' '
+	test_when_finished "rm -f trace.out" &&
+	(
+		GIT_TRACE=$(pwd)/trace.out git status &&
+		grep "1 tasks" trace.out &&
+		>trace.out &&
+
+		git config submodule.diffJobs 8 &&
+		GIT_TRACE=$(pwd)/trace.out git status &&
+		grep "8 tasks" trace.out &&
+		>trace.out &&
+
+		GIT_TRACE=$(pwd)/trace.out git -c submodule.diffJobs=0 status &&
+		grep "preparing to run up to [0-9]* tasks" trace.out &&
+		! grep "up to 0 tasks" trace.out &&
+		>trace.out
+	)
+'
+
 test_done
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v6 6/6] submodule: call parallel code from serial status
  2023-01-04 21:54 ` [PATCH v5 0/6] submodule: parallelize diff Calvin Wan
                     ` (6 preceding siblings ...)
  2023-01-17 19:30   ` [PATCH v6 5/6] diff-lib: parallelize run_diff_files for submodules Calvin Wan
@ 2023-01-17 19:30   ` Calvin Wan
  2023-01-26  8:09     ` Glen Choo
  7 siblings, 1 reply; 86+ messages in thread
From: Calvin Wan @ 2023-01-17 19:30 UTC (permalink / raw)
  To: git
  Cc: Calvin Wan, emilyshaffer, avarab, phillip.wood123, chooglen,
	newren, jonathantanmy

Remove the serial implementation of status inside of
is_submodule_modified since the parallel implementation of status with
one job accomplishes the same task.

Combine parse_status_porcelain and parse_status_porcelain_strbuf since
the only other caller of parse_status_porcelain was in
is_submodule_modified

Signed-off-by: Calvin Wan <calvinwan@google.com>
---
 submodule.c | 146 ++++++++++++++++++----------------------------------
 1 file changed, 51 insertions(+), 95 deletions(-)

diff --git a/submodule.c b/submodule.c
index da95ea1f5e..2009748d9f 100644
--- a/submodule.c
+++ b/submodule.c
@@ -1887,46 +1887,7 @@ int fetch_submodules(struct repository *r,
 	return spf.result;
 }
 
-static int parse_status_porcelain(char *str, size_t len,
-				  unsigned *dirty_submodule,
-				  int ignore_untracked)
-{
-	/* regular untracked files */
-	if (str[0] == '?')
-		*dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
-
-	if (str[0] == 'u' ||
-	    str[0] == '1' ||
-	    str[0] == '2') {
-		/* T = line type, XY = status, SSSS = submodule state */
-		if (len < strlen("T XY SSSS"))
-			BUG("invalid status --porcelain=2 line %s",
-			    str);
-
-		if (str[5] == 'S' && str[8] == 'U')
-			/* nested untracked file */
-			*dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
-
-		if (str[0] == 'u' ||
-		    str[0] == '2' ||
-		    memcmp(str + 5, "S..U", 4))
-			/* other change */
-			*dirty_submodule |= DIRTY_SUBMODULE_MODIFIED;
-	}
-
-	if ((*dirty_submodule & DIRTY_SUBMODULE_MODIFIED) &&
-	    ((*dirty_submodule & DIRTY_SUBMODULE_UNTRACKED) ||
-	     ignore_untracked)) {
-		/*
-		* We're not interested in any further information from
-		* the child any more, neither output nor its exit code.
-		*/
-		return 1;
-	}
-	return 0;
-}
-
-static void parse_status_porcelain_strbuf(struct strbuf *buf,
+static void parse_status_porcelain(struct strbuf *buf,
 				   unsigned *dirty_submodule,
 				   int ignore_untracked)
 {
@@ -1936,65 +1897,60 @@ static void parse_status_porcelain_strbuf(struct strbuf *buf,
 	string_list_split(&list, buf->buf, '\n', -1);
 
 	for_each_string_list_item(item, &list) {
-		if (parse_status_porcelain(item->string,
-					   strlen(item->string),
-					   dirty_submodule,
-					   ignore_untracked))
+		char *str = item->string;
+		/* regular untracked files */
+		if (str[0] == '?')
+			*dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
+
+		if (str[0] == 'u' ||
+		str[0] == '1' ||
+		str[0] == '2') {
+			/* T = line type, XY = status, SSSS = submodule state */
+			if (strlen(str) < strlen("T XY SSSS"))
+				BUG("invalid status --porcelain=2 line %s",
+				str);
+
+			if (str[5] == 'S' && str[8] == 'U')
+				/* nested untracked file */
+				*dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
+
+			if (str[0] == 'u' ||
+			str[0] == '2' ||
+			memcmp(str + 5, "S..U", 4))
+				/* other change */
+				*dirty_submodule |= DIRTY_SUBMODULE_MODIFIED;
+		}
+
+		if ((*dirty_submodule & DIRTY_SUBMODULE_MODIFIED) &&
+		    ((*dirty_submodule & DIRTY_SUBMODULE_UNTRACKED) ||
+		    ignore_untracked)) {
+			/*
+			* We're not interested in any further information from
+			* the child any more, neither output nor its exit code.
+			*/
 			break;
+		}
 	}
 	string_list_clear(&list, 0);
 }
 
 unsigned is_submodule_modified(const char *path, int ignore_untracked)
 {
-	struct child_process cp = CHILD_PROCESS_INIT;
-	struct strbuf buf = STRBUF_INIT;
-	FILE *fp;
-	unsigned dirty_submodule = 0;
-	const char *git_dir;
-	int ignore_cp_exit_code = 0;
-
-	strbuf_addf(&buf, "%s/.git", path);
-	git_dir = read_gitfile(buf.buf);
-	if (!git_dir)
-		git_dir = buf.buf;
-	if (!is_git_directory(git_dir)) {
-		if (is_directory(git_dir))
-			die(_("'%s' not recognized as a git repository"), git_dir);
-		strbuf_release(&buf);
-		/* The submodule is not checked out, so it is not modified */
-		return 0;
-	}
-	strbuf_reset(&buf);
-
-	strvec_pushl(&cp.args, "status", "--porcelain=2", NULL);
-	if (ignore_untracked)
-		strvec_push(&cp.args, "-uno");
-
-	prepare_submodule_repo_env(&cp.env);
-	cp.git_cmd = 1;
-	cp.no_stdin = 1;
-	cp.out = -1;
-	cp.dir = path;
-	if (start_command(&cp))
-		die(_("Could not run 'git status --porcelain=2' in submodule %s"), path);
-
-	fp = xfdopen(cp.out, "r");
-	while (strbuf_getwholeline(&buf, fp, '\n') != EOF) {
-		char *str = buf.buf;
-		const size_t len = buf.len;
-
-		ignore_cp_exit_code = parse_status_porcelain(str, len, &dirty_submodule,
-							     ignore_untracked);
-		if (ignore_cp_exit_code)
-			break;
-	}
-	fclose(fp);
-
-	if (finish_command(&cp) && !ignore_cp_exit_code)
-		die(_("'git status --porcelain=2' failed in submodule %s"), path);
-
-	strbuf_release(&buf);
+	struct submodule_status_util util = {
+		.dirty_submodule = 0,
+		.ignore_untracked = ignore_untracked,
+		.path = path,
+	};
+	struct string_list sub = STRING_LIST_INIT_NODUP;
+	struct string_list_item *item;
+	int dirty_submodule;
+
+	item = string_list_append(&sub, path);
+	item->util = &util;
+	if (get_submodules_status(&sub, 1))
+		die(_("submodule status failed"));
+	dirty_submodule = util.dirty_submodule;
+	string_list_clear(&sub, 0);
 	return dirty_submodule;
 }
 
@@ -2096,9 +2052,9 @@ static int status_finish(int retvalue, struct strbuf *err,
 		    task->path);
 	}
 
-	parse_status_porcelain_strbuf(&task->out,
-			      &util->dirty_submodule,
-			      util->ignore_untracked);
+	parse_status_porcelain(&task->out,
+			       &util->dirty_submodule,
+			       util->ignore_untracked);
 
 	strbuf_release(&task->out);
 	free(task);
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH v6 6/6] submodule: call parallel code from serial status
  2023-01-17 19:30   ` [PATCH v6 6/6] submodule: call parallel code from serial status Calvin Wan
@ 2023-01-26  8:09     ` Glen Choo
  2023-01-26  8:45       ` Glen Choo
  0 siblings, 1 reply; 86+ messages in thread
From: Glen Choo @ 2023-01-26  8:09 UTC (permalink / raw)
  To: Calvin Wan, git
  Cc: Calvin Wan, emilyshaffer, avarab, phillip.wood123, newren,
	jonathantanmy

Calvin Wan <calvinwan@google.com> writes:

> Remove the serial implementation of status inside of
> is_submodule_modified since the parallel implementation of status with
> one job accomplishes the same task.
>
> Combine parse_status_porcelain and parse_status_porcelain_strbuf since
> the only other caller of parse_status_porcelain was in
> is_submodule_modified

I see that this is in direct response to Jonathan's earlier comment [1]
that we should have only one implementation. Thanks, this is helpful.
Definitely a step in the right direction.

That said, I don't think this patch's position in the series makes
sense. I would have expected a patch like this to come before 5/6. I.e.
this series duplicates code in 5/6 and deletes it in 6/6 so that we only
have one implementation for both serial and parallel submodule status.

Instead, I would have expected we would refactor out the serial
implementation, then use the refactored code for the parallel
implementation. Not having duplicated code in 5/6 would shrink the line
count a lot and make it easier to review.

[1] https://lore.kernel.org/git/20221128210125.2751300-1-jonathantanmy@google.com/

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v6 6/6] submodule: call parallel code from serial status
  2023-01-26  8:09     ` Glen Choo
@ 2023-01-26  8:45       ` Glen Choo
  0 siblings, 0 replies; 86+ messages in thread
From: Glen Choo @ 2023-01-26  8:45 UTC (permalink / raw)
  To: Calvin Wan, git
  Cc: Calvin Wan, emilyshaffer, avarab, phillip.wood123, newren,
	jonathantanmy

Glen Choo <chooglen@google.com> writes:

> Calvin Wan <calvinwan@google.com> writes:
>
>> Remove the serial implementation of status inside of
>> is_submodule_modified since the parallel implementation of status with
>> one job accomplishes the same task.
>
> I see that this is in direct response to Jonathan's earlier comment [1]
> that we should have only one implementation. Thanks, this is helpful.
> Definitely a step in the right direction.
>
> That said, I don't think this patch's position in the series makes
> sense. I would have expected a patch like this to come before 5/6. I.e.
> this series duplicates code in 5/6 and deletes it in 6/6 so that we only
> have one implementation for both serial and parallel submodule status.
>
> Instead, I would have expected we would refactor out the serial
> implementation, then use the refactored code for the parallel
> implementation. Not having duplicated code in 5/6 would shrink the line
> count a lot and make it easier to review.
>
> [1] https://lore.kernel.org/git/20221128210125.2751300-1-jonathantanmy@google.com/

Ah, I realize I completely misunderstood this patch. I thought that this
was deleting code that was duplicated between the serial and parallel
implementations in 5/6 such that both ended up sharing just one copy of
the code.

Instead, this patch deletes the serial implementation altogether and
replaces it with the parallel one. As such, this patch can't come
earlier than 5/6, because we need the parallel implementation to exist
before we can use it.

For reviewability of 5/6, I'd still strongly prefer that we refactor out
functions (I'll leave more specific comments on that patch). We could
still consider replacing the serial implementation with "parallel with a
single job", though I suspect that it will be unnecessary if we do the
refactoring well. I'm also not sure how idiomatic it is to call
run_processes_parallel() with a hardcoded value of 1, but I don't feel
too strongly about that.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v6 5/6] diff-lib: parallelize run_diff_files for submodules
  2023-01-17 19:30   ` [PATCH v6 5/6] diff-lib: parallelize run_diff_files for submodules Calvin Wan
@ 2023-01-26  9:09     ` Glen Choo
  2023-01-26  9:16     ` Glen Choo
  1 sibling, 0 replies; 86+ messages in thread
From: Glen Choo @ 2023-01-26  9:09 UTC (permalink / raw)
  To: Calvin Wan, git
  Cc: Calvin Wan, emilyshaffer, avarab, phillip.wood123, newren,
	jonathantanmy

As Jonathan mentioned in [1], I think we should refactor functions out
from the serial implementation in a preparatory patch, then use those
functions to implement the parallel version in this patch. In its
current form, there is a fair amount of duplicated code, which makes it
tricky to review because of the additional overhead of checking what the
duplicated code does and whether we've copied it correcly.

For cleanliness, I'll only point out the duplicated code in this email;
I'll comment on other things I spotted in another one.

[1] https://lore.kernel.org/git/20221128210125.2751300-1-jonathantanmy@google.com/

Calvin Wan <calvinwan@google.com> writes:

> +		for (size_t i = 0; i < submodules.nr; i++) {
> +			struct submodule_status_util *util = submodules.items[i].util;
> +			struct cache_entry *ce = util->ce;
> +			unsigned int oldmode;
> +			const struct object_id *old_oid, *new_oid;
> +
> +			if (!util->changed && !util->dirty_submodule) {
> +				ce_mark_uptodate(ce);
> +				mark_fsmonitor_valid(istate, ce);
> +				if (!revs->diffopt.flags.find_copies_harder)
> +					continue;
> +			}
> +			oldmode = ce->ce_mode;
> +			old_oid = &ce->oid;
> +			new_oid = util->changed ? null_oid() : &ce->oid;
> +			diff_change(&revs->diffopt, oldmode, util->newmode,
> +				    old_oid, new_oid,
> +				    !is_null_oid(old_oid),
> +				    !is_null_oid(new_oid),
> +				    ce->name, 0, util->dirty_submodule);
> +		}
> +	}

The lines from "if (!util->changed && !util->dirty_submodule)" onwards
are copied from earlier in run_diff_files(). This might be refactored
into something like diff_submodule_change().

> +static struct status_task *
> +get_status_task_from_index(struct submodule_parallel_status *sps,
> +			   struct strbuf *err)
> +{
> +	for (; sps->index_count < sps->submodule_names->nr; sps->index_count++) {
> +		struct submodule_status_util *util = sps->submodule_names->items[sps->index_count].util;
> +		struct status_task *task;
> +		struct strbuf buf = STRBUF_INIT;
> +		const char *git_dir;
> +
> +		strbuf_addf(&buf, "%s/.git", util->path);
> +		git_dir = read_gitfile(buf.buf);

This...

> +static int get_next_submodule_status(struct child_process *cp,
> +				     struct strbuf *err, void *data,
> +				     void **task_cb)
> +{
> +	struct submodule_parallel_status *sps = data;
> +	struct status_task *task = get_status_task_from_index(sps, err);
> +
> +	if (!task)
> +		return 0;
> +
> +	child_process_init(cp);
> +	prepare_submodule_repo_env_in_gitdir(&cp->env);
> +
> +	strvec_init(&cp->args);
> +	strvec_pushl(&cp->args, "status", "--porcelain=2", NULL);
> +	if (task->ignore_untracked)
> +		strvec_push(&cp->args, "-uno");
> +
> +	prepare_submodule_repo_env(&cp->env);
> +	cp->git_cmd = 1;

this...

> +static int status_start_failure(struct strbuf *err,
> +				void *cb, void *task_cb)
> +{
> +	struct submodule_parallel_status *sps = cb;
> +	struct status_task *task = task_cb;
> +
> +	sps->result = 1;
> +	strbuf_addf(err,
> +	    _("could not run 'git status --porcelain=2' in submodule %s"),
> +	    task->path);
> +	return 0;
> +}

this...

> +static int status_finish(int retvalue, struct strbuf *err,
> +			 void *cb, void *task_cb)
> +{
> +	struct submodule_parallel_status *sps = cb;
> +	struct status_task *task = task_cb;
> +	struct string_list_item *it =
> +		string_list_lookup(sps->submodule_names, task->path);
> +	struct submodule_status_util *util = it->util;
> +
> +	if (retvalue) {
> +		sps->result = 1;
> +		strbuf_addf(err,
> +		    _("'git status --porcelain=2' failed in submodule %s"),
> +		    task->path);
> +	}

and this are all copied from different parts of is_submodule_modified().
To refactor them out, I think we could combine the first two into
"setup_submodule_status()". The last one could be moved into
"process_submodule_status_result()" or perhaps we could find a way to
combine it into parse_status_porcelain().

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v6 5/6] diff-lib: parallelize run_diff_files for submodules
  2023-01-17 19:30   ` [PATCH v6 5/6] diff-lib: parallelize run_diff_files for submodules Calvin Wan
  2023-01-26  9:09     ` Glen Choo
@ 2023-01-26  9:16     ` Glen Choo
  2023-01-26 18:52       ` Calvin Wan
  1 sibling, 1 reply; 86+ messages in thread
From: Glen Choo @ 2023-01-26  9:16 UTC (permalink / raw)
  To: Calvin Wan, git
  Cc: Calvin Wan, emilyshaffer, avarab, phillip.wood123, newren,
	jonathantanmy


Calvin Wan <calvinwan@google.com> writes:

> @@ -226,6 +242,8 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
>  			newmode = ce->ce_mode;
>  		} else {
>  			struct stat st;
> +			unsigned ignore_untracked = 0;
> +			int defer_submodule_status = !!revs->repo;

What is the reasoning behind this condition? I would expect revs->repo
to always be set, and we would always end up deferring.

>  			newmode = ce_mode_from_stat(ce, st.st_mode);
> +			if (defer_submodule_status) {
> +				struct submodule_status_util tmp = {
> +					.changed = changed,
> +					.dirty_submodule = 0,
> +					.ignore_untracked = ignore_untracked,
> +					.newmode = newmode,
> +					.ce = ce,
> +					.path = ce->name,
> +				};
> +				struct string_list_item *item;
> +
> +				item = string_list_append(&submodules, ce->name);
> +				item->util = xmalloc(sizeof(tmp));
> +				memcpy(item->util, &tmp, sizeof(tmp));

(Not a C expert) Since we don't return the string list, I wonder if we
can avoid the memcpy() by using &tmp like so:

  struct string_list_item *item;
  item = string_list_append(&submodules, ce->name);
  item->util = &tmp;

And then when we call string_list_clear(), we wouldn't need to free the
util since we exit the stack frame.

> +test_expect_success 'diff in superproject with submodules respects parallel settings' '
> +	test_when_finished "rm -f trace.out" &&
> +	(
> +		GIT_TRACE=$(pwd)/trace.out git diff &&
> +		grep "1 tasks" trace.out &&
> +		>trace.out &&
> +
> +		git config submodule.diffJobs 8 &&
> +		GIT_TRACE=$(pwd)/trace.out git diff &&
> +		grep "8 tasks" trace.out &&
> +		>trace.out &&
> +
> +		GIT_TRACE=$(pwd)/trace.out git -c submodule.diffJobs=0 diff &&
> +		grep "preparing to run up to [0-9]* tasks" trace.out &&
> +		! grep "up to 0 tasks" trace.out &&
> +		>trace.out
> +	)
> +'
> +

Could we get tests to check that the output of git diff isn't changed by
setting parallelism? This might not be feasible for submodule.diffJobs >
1 due to raciness, but it would be good to see for submodule.diffJobs =
1 at least.

>  test_expect_success 'git diff --raw HEAD' '
>  	hexsz=$(test_oid hexsz) &&
>  	git diff --raw --abbrev=$hexsz HEAD >actual &&
> diff --git a/t/t7506-status-submodule.sh b/t/t7506-status-submodule.sh
> index d050091345..52a82b703f 100755
> --- a/t/t7506-status-submodule.sh
> +++ b/t/t7506-status-submodule.sh
> @@ -412,4 +412,23 @@ test_expect_success 'status with added file in nested submodule (short)' '
>  	EOF
>  '
>  
> +test_expect_success 'status in superproject with submodules respects parallel settings' '
> +	test_when_finished "rm -f trace.out" &&
> +	(
> +		GIT_TRACE=$(pwd)/trace.out git status &&
> +		grep "1 tasks" trace.out &&
> +		>trace.out &&
> +
> +		git config submodule.diffJobs 8 &&
> +		GIT_TRACE=$(pwd)/trace.out git status &&
> +		grep "8 tasks" trace.out &&
> +		>trace.out &&
> +
> +		GIT_TRACE=$(pwd)/trace.out git -c submodule.diffJobs=0 status &&
> +		grep "preparing to run up to [0-9]* tasks" trace.out &&
> +		! grep "up to 0 tasks" trace.out &&
> +		>trace.out
> +	)
> +'
> +

Ditto for "status".

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v6 5/6] diff-lib: parallelize run_diff_files for submodules
  2023-01-26  9:16     ` Glen Choo
@ 2023-01-26 18:52       ` Calvin Wan
  0 siblings, 0 replies; 86+ messages in thread
From: Calvin Wan @ 2023-01-26 18:52 UTC (permalink / raw)
  To: Glen Choo
  Cc: git, emilyshaffer, avarab, phillip.wood123, newren, jonathantanmy

On Thu, Jan 26, 2023 at 1:16 AM Glen Choo <chooglen@google.com> wrote:
>
>
> Calvin Wan <calvinwan@google.com> writes:
>
> > @@ -226,6 +242,8 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
> >                       newmode = ce->ce_mode;
> >               } else {
> >                       struct stat st;
> > +                     unsigned ignore_untracked = 0;
> > +                     int defer_submodule_status = !!revs->repo;
>
> What is the reasoning behind this condition? I would expect revs->repo
> to always be set, and we would always end up deferring.

Ah looks like a vestigial sanity check. You're correct that we would
always be deferring anyways.

>
> >                       newmode = ce_mode_from_stat(ce, st.st_mode);
> > +                     if (defer_submodule_status) {
> > +                             struct submodule_status_util tmp = {
> > +                                     .changed = changed,
> > +                                     .dirty_submodule = 0,
> > +                                     .ignore_untracked = ignore_untracked,
> > +                                     .newmode = newmode,
> > +                                     .ce = ce,
> > +                                     .path = ce->name,
> > +                             };
> > +                             struct string_list_item *item;
> > +
> > +                             item = string_list_append(&submodules, ce->name);
> > +                             item->util = xmalloc(sizeof(tmp));
> > +                             memcpy(item->util, &tmp, sizeof(tmp));
>
> (Not a C expert) Since we don't return the string list, I wonder if we
> can avoid the memcpy() by using &tmp like so:
>
>   struct string_list_item *item;
>   item = string_list_append(&submodules, ce->name);
>   item->util = &tmp;
>
> And then when we call string_list_clear(), we wouldn't need to free the
> util since we exit the stack frame.

Unfortunately this doesn't work because tmp is deallocated off the stack
after changing scope.

> > +test_expect_success 'diff in superproject with submodules respects parallel settings' '
> > +     test_when_finished "rm -f trace.out" &&
> > +     (
> > +             GIT_TRACE=$(pwd)/trace.out git diff &&
> > +             grep "1 tasks" trace.out &&
> > +             >trace.out &&
> > +
> > +             git config submodule.diffJobs 8 &&
> > +             GIT_TRACE=$(pwd)/trace.out git diff &&
> > +             grep "8 tasks" trace.out &&
> > +             >trace.out &&
> > +
> > +             GIT_TRACE=$(pwd)/trace.out git -c submodule.diffJobs=0 diff &&
> > +             grep "preparing to run up to [0-9]* tasks" trace.out &&
> > +             ! grep "up to 0 tasks" trace.out &&
> > +             >trace.out
> > +     )
> > +'
> > +
>
> Could we get tests to check that the output of git diff isn't changed by
> setting parallelism? This might not be feasible for submodule.diffJobs >
> 1 due to raciness, but it would be good to see for submodule.diffJobs =
> 1 at least.

ack.

>
> >  test_expect_success 'git diff --raw HEAD' '
> >       hexsz=$(test_oid hexsz) &&
> >       git diff --raw --abbrev=$hexsz HEAD >actual &&
> > diff --git a/t/t7506-status-submodule.sh b/t/t7506-status-submodule.sh
> > index d050091345..52a82b703f 100755
> > --- a/t/t7506-status-submodule.sh
> > +++ b/t/t7506-status-submodule.sh
> > @@ -412,4 +412,23 @@ test_expect_success 'status with added file in nested submodule (short)' '
> >       EOF
> >  '
> >
> > +test_expect_success 'status in superproject with submodules respects parallel settings' '
> > +     test_when_finished "rm -f trace.out" &&
> > +     (
> > +             GIT_TRACE=$(pwd)/trace.out git status &&
> > +             grep "1 tasks" trace.out &&
> > +             >trace.out &&
> > +
> > +             git config submodule.diffJobs 8 &&
> > +             GIT_TRACE=$(pwd)/trace.out git status &&
> > +             grep "8 tasks" trace.out &&
> > +             >trace.out &&
> > +
> > +             GIT_TRACE=$(pwd)/trace.out git -c submodule.diffJobs=0 status &&
> > +             grep "preparing to run up to [0-9]* tasks" trace.out &&
> > +             ! grep "up to 0 tasks" trace.out &&
> > +             >trace.out
> > +     )
> > +'
> > +
>
> Ditto for "status".

ack.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH v7 0/7] submodule: parallelize diff
  2023-01-17 19:30   ` [PATCH v6 " Calvin Wan
@ 2023-02-07 18:16     ` Calvin Wan
  2023-02-08  0:55       ` Ævar Arnfjörð Bjarmason
                         ` (7 more replies)
  2023-02-07 18:17     ` [PATCH v7 1/7] run-command: add duplicate_output_fn to run_processes_parallel_opts Calvin Wan
                       ` (6 subsequent siblings)
  7 siblings, 8 replies; 86+ messages in thread
From: Calvin Wan @ 2023-02-07 18:16 UTC (permalink / raw)
  To: git; +Cc: Calvin Wan, avarab, chooglen, newren, jonathantanmy

Original cover letter for context:
https://lore.kernel.org/git/20221011232604.839941-1-calvinwan@google.com/

Changes since v6

Added patches 4 and 5 to refactor out more functionality so that it is
clear what changes my final patch makes. Since the large majority of
the functionality between the serial and parallel implementation is now
shared, I no longer remove the serial implementation.

Added additional tests to verify setting parallelism doesn't alter
output

Calvin Wan (7):
  run-command: add duplicate_output_fn to run_processes_parallel_opts
  submodule: strbuf variable rename
  submodule: move status parsing into function
  submodule: refactor is_submodule_modified()
  diff-lib: refactor out diff_change logic
  diff-lib: refactor match_stat_with_submodule
  diff-lib: parallelize run_diff_files for submodules

 Documentation/config/submodule.txt |  12 ++
 diff-lib.c                         | 133 +++++++++++---
 run-command.c                      |  16 +-
 run-command.h                      |  27 +++
 submodule.c                        | 274 ++++++++++++++++++++++++-----
 submodule.h                        |   9 +
 t/helper/test-run-command.c        |  21 +++
 t/t0061-run-command.sh             |  39 ++++
 t/t4027-diff-submodule.sh          |  31 ++++
 t/t7506-status-submodule.sh        |  25 +++
 10 files changed, 508 insertions(+), 79 deletions(-)

-- 
2.39.1.519.gcb327c4b5f-goog


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH v7 1/7] run-command: add duplicate_output_fn to run_processes_parallel_opts
  2023-01-17 19:30   ` [PATCH v6 " Calvin Wan
  2023-02-07 18:16     ` [PATCH v7 0/7] " Calvin Wan
@ 2023-02-07 18:17     ` Calvin Wan
  2023-02-07 22:16       ` Ævar Arnfjörð Bjarmason
  2023-02-08 14:19       ` Phillip Wood
  2023-02-07 18:17     ` [PATCH v7 2/7] submodule: strbuf variable rename Calvin Wan
                       ` (5 subsequent siblings)
  7 siblings, 2 replies; 86+ messages in thread
From: Calvin Wan @ 2023-02-07 18:17 UTC (permalink / raw)
  To: git; +Cc: Calvin Wan, avarab, chooglen, newren, jonathantanmy

Add duplicate_output_fn as an optionally set function in
run_process_parallel_opts. If set, output from each child process is
copied and passed to the callback function whenever output from the
child process is buffered to allow for separate parsing.

Signed-off-by: Calvin Wan <calvinwan@google.com>
---
 run-command.c               | 16 ++++++++++++---
 run-command.h               | 27 +++++++++++++++++++++++++
 t/helper/test-run-command.c | 21 ++++++++++++++++++++
 t/t0061-run-command.sh      | 39 +++++++++++++++++++++++++++++++++++++
 4 files changed, 100 insertions(+), 3 deletions(-)

diff --git a/run-command.c b/run-command.c
index 756f1839aa..cad88befe0 100644
--- a/run-command.c
+++ b/run-command.c
@@ -1526,6 +1526,9 @@ static void pp_init(struct parallel_processes *pp,
 	if (!opts->get_next_task)
 		BUG("you need to specify a get_next_task function");
 
+	if (opts->duplicate_output && opts->ungroup)
+		BUG("duplicate_output and ungroup are incompatible with each other");
+
 	CALLOC_ARRAY(pp->children, n);
 	if (!opts->ungroup)
 		CALLOC_ARRAY(pp->pfd, n);
@@ -1645,14 +1648,21 @@ static void pp_buffer_stderr(struct parallel_processes *pp,
 	for (size_t i = 0; i < opts->processes; i++) {
 		if (pp->children[i].state == GIT_CP_WORKING &&
 		    pp->pfd[i].revents & (POLLIN | POLLHUP)) {
-			int n = strbuf_read_once(&pp->children[i].err,
-						 pp->children[i].process.err, 0);
+			ssize_t n = strbuf_read_once(&pp->children[i].err,
+						     pp->children[i].process.err, 0);
 			if (n == 0) {
 				close(pp->children[i].process.err);
 				pp->children[i].state = GIT_CP_WAIT_CLEANUP;
-			} else if (n < 0)
+			} else if (n < 0) {
 				if (errno != EAGAIN)
 					die_errno("read");
+			} else {
+				if (opts->duplicate_output)
+					opts->duplicate_output(&pp->children[i].err,
+					       strlen(pp->children[i].err.buf) - n,
+					       opts->data,
+					       pp->children[i].data);
+			}
 		}
 	}
 }
diff --git a/run-command.h b/run-command.h
index 072db56a4d..6dcf999f6c 100644
--- a/run-command.h
+++ b/run-command.h
@@ -408,6 +408,27 @@ typedef int (*start_failure_fn)(struct strbuf *out,
 				void *pp_cb,
 				void *pp_task_cb);
 
+/**
+ * This callback is called whenever output from a child process is buffered
+ * 
+ * See run_processes_parallel() below for a discussion of the "struct
+ * strbuf *out" parameter.
+ * 
+ * The offset refers to the number of bytes originally in "out" before
+ * the output from the child process was buffered. Therefore, the buffer
+ * range, "out + buf" to the end of "out", would contain the buffer of
+ * the child process output.
+ *
+ * pp_cb is the callback cookie as passed into run_processes_parallel,
+ * pp_task_cb is the callback cookie as passed into get_next_task_fn.
+ *
+ * This function is incompatible with "ungroup"
+ */
+typedef void (*duplicate_output_fn)(struct strbuf *out,
+				    size_t offset,
+				    void *pp_cb,
+				    void *pp_task_cb);
+
 /**
  * This callback is called on every child process that finished processing.
  *
@@ -461,6 +482,12 @@ struct run_process_parallel_opts
 	 */
 	start_failure_fn start_failure;
 
+	/**
+	 * duplicate_output: See duplicate_output_fn() above. This should be
+	 * NULL unless process specific output is needed
+	 */
+	duplicate_output_fn duplicate_output;
+
 	/**
 	 * task_finished: See task_finished_fn() above. This can be
 	 * NULL to omit any special handling.
diff --git a/t/helper/test-run-command.c b/t/helper/test-run-command.c
index 3ecb830f4a..ffd3cd0045 100644
--- a/t/helper/test-run-command.c
+++ b/t/helper/test-run-command.c
@@ -52,6 +52,21 @@ static int no_job(struct child_process *cp,
 	return 0;
 }
 
+static void duplicate_output(struct strbuf *out,
+			size_t offset,
+			void *pp_cb UNUSED,
+			void *pp_task_cb UNUSED)
+{
+	struct string_list list = STRING_LIST_INIT_DUP;
+
+	string_list_split(&list, out->buf + offset, '\n', -1);
+	for (size_t i = 0; i < list.nr; i++) {
+		if (strlen(list.items[i].string) > 0)
+			fprintf(stderr, "duplicate_output: %s\n", list.items[i].string);
+	}
+	string_list_clear(&list, 0);
+}
+
 static int task_finished(int result,
 			 struct strbuf *err,
 			 void *pp_cb,
@@ -439,6 +454,12 @@ int cmd__run_command(int argc, const char **argv)
 		opts.ungroup = 1;
 	}
 
+	if (!strcmp(argv[1], "--duplicate-output")) {
+		argv += 1;
+		argc -= 1;
+		opts.duplicate_output = duplicate_output;
+	}
+
 	jobs = atoi(argv[2]);
 	strvec_clear(&proc.args);
 	strvec_pushv(&proc.args, (const char **)argv + 3);
diff --git a/t/t0061-run-command.sh b/t/t0061-run-command.sh
index e2411f6a9b..879e536638 100755
--- a/t/t0061-run-command.sh
+++ b/t/t0061-run-command.sh
@@ -135,6 +135,15 @@ test_expect_success 'run_command runs in parallel with more jobs available than
 	test_cmp expect actual
 '
 
+test_expect_success 'run_command runs in parallel with more jobs available than tasks --duplicate-output' '
+	test-tool run-command --duplicate-output run-command-parallel 5 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
+	test_must_be_empty out &&
+	test 4 = $(grep -c "duplicate_output: Hello" err) &&
+	test 4 = $(grep -c "duplicate_output: World" err) &&
+	sed "/duplicate_output/d" err > err1 &&
+	test_cmp expect err1
+'
+
 test_expect_success 'run_command runs ungrouped in parallel with more jobs available than tasks' '
 	test-tool run-command --ungroup run-command-parallel 5 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
 	test_line_count = 8 out &&
@@ -147,6 +156,15 @@ test_expect_success 'run_command runs in parallel with as many jobs as tasks' '
 	test_cmp expect actual
 '
 
+test_expect_success 'run_command runs in parallel with as many jobs as tasks --duplicate-output' '
+	test-tool run-command --duplicate-output run-command-parallel 4 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
+	test_must_be_empty out &&
+	test 4 = $(grep -c "duplicate_output: Hello" err) &&
+	test 4 = $(grep -c "duplicate_output: World" err) &&
+	sed "/duplicate_output/d" err > err1 &&
+	test_cmp expect err1
+'
+
 test_expect_success 'run_command runs ungrouped in parallel with as many jobs as tasks' '
 	test-tool run-command --ungroup run-command-parallel 4 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
 	test_line_count = 8 out &&
@@ -159,6 +177,15 @@ test_expect_success 'run_command runs in parallel with more tasks than jobs avai
 	test_cmp expect actual
 '
 
+test_expect_success 'run_command runs in parallel with more tasks than jobs available --duplicate-output' '
+	test-tool run-command --duplicate-output run-command-parallel 3 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
+	test_must_be_empty out &&
+	test 4 = $(grep -c "duplicate_output: Hello" err) &&
+	test 4 = $(grep -c "duplicate_output: World" err) &&
+	sed "/duplicate_output/d" err > err1 &&
+	test_cmp expect err1
+'
+
 test_expect_success 'run_command runs ungrouped in parallel with more tasks than jobs available' '
 	test-tool run-command --ungroup run-command-parallel 3 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
 	test_line_count = 8 out &&
@@ -180,6 +207,12 @@ test_expect_success 'run_command is asked to abort gracefully' '
 	test_cmp expect actual
 '
 
+test_expect_success 'run_command is asked to abort gracefully --duplicate-output' '
+	test-tool run-command --duplicate-output run-command-abort 3 false >out 2>err &&
+	test_must_be_empty out &&
+	test_cmp expect err
+'
+
 test_expect_success 'run_command is asked to abort gracefully (ungroup)' '
 	test-tool run-command --ungroup run-command-abort 3 false >out 2>err &&
 	test_must_be_empty out &&
@@ -196,6 +229,12 @@ test_expect_success 'run_command outputs ' '
 	test_cmp expect actual
 '
 
+test_expect_success 'run_command outputs --duplicate-output' '
+	test-tool run-command --duplicate-output run-command-no-jobs 3 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
+	test_must_be_empty out &&
+	test_cmp expect err
+'
+
 test_expect_success 'run_command outputs (ungroup) ' '
 	test-tool run-command --ungroup run-command-no-jobs 3 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
 	test_must_be_empty out &&
-- 
2.39.1.519.gcb327c4b5f-goog


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v7 2/7] submodule: strbuf variable rename
  2023-01-17 19:30   ` [PATCH v6 " Calvin Wan
  2023-02-07 18:16     ` [PATCH v7 0/7] " Calvin Wan
  2023-02-07 18:17     ` [PATCH v7 1/7] run-command: add duplicate_output_fn to run_processes_parallel_opts Calvin Wan
@ 2023-02-07 18:17     ` Calvin Wan
  2023-02-07 22:47       ` Ævar Arnfjörð Bjarmason
  2023-02-07 18:17     ` [PATCH v7 3/7] submodule: move status parsing into function Calvin Wan
                       ` (4 subsequent siblings)
  7 siblings, 1 reply; 86+ messages in thread
From: Calvin Wan @ 2023-02-07 18:17 UTC (permalink / raw)
  To: git; +Cc: Calvin Wan, avarab, chooglen, newren, jonathantanmy

A prepatory change for a future patch that moves the status parsing
logic to a separate function.

Signed-off-by: Calvin Wan <calvinwan@google.com>
---
 submodule.c | 23 +++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/submodule.c b/submodule.c
index fae24ef34a..faf37c1101 100644
--- a/submodule.c
+++ b/submodule.c
@@ -1906,25 +1906,28 @@ unsigned is_submodule_modified(const char *path, int ignore_untracked)
 
 	fp = xfdopen(cp.out, "r");
 	while (strbuf_getwholeline(&buf, fp, '\n') != EOF) {
+		char *str = buf.buf;
+		const size_t len = buf.len;
+
 		/* regular untracked files */
-		if (buf.buf[0] == '?')
+		if (str[0] == '?')
 			dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
 
-		if (buf.buf[0] == 'u' ||
-		    buf.buf[0] == '1' ||
-		    buf.buf[0] == '2') {
+		if (str[0] == 'u' ||
+		    str[0] == '1' ||
+		    str[0] == '2') {
 			/* T = line type, XY = status, SSSS = submodule state */
-			if (buf.len < strlen("T XY SSSS"))
+			if (len < strlen("T XY SSSS"))
 				BUG("invalid status --porcelain=2 line %s",
-				    buf.buf);
+				    str);
 
-			if (buf.buf[5] == 'S' && buf.buf[8] == 'U')
+			if (str[5] == 'S' && str[8] == 'U')
 				/* nested untracked file */
 				dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
 
-			if (buf.buf[0] == 'u' ||
-			    buf.buf[0] == '2' ||
-			    memcmp(buf.buf + 5, "S..U", 4))
+			if (str[0] == 'u' ||
+			    str[0] == '2' ||
+			    memcmp(str + 5, "S..U", 4))
 				/* other change */
 				dirty_submodule |= DIRTY_SUBMODULE_MODIFIED;
 		}
-- 
2.39.1.519.gcb327c4b5f-goog


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v7 3/7] submodule: move status parsing into function
  2023-01-17 19:30   ` [PATCH v6 " Calvin Wan
                       ` (2 preceding siblings ...)
  2023-02-07 18:17     ` [PATCH v7 2/7] submodule: strbuf variable rename Calvin Wan
@ 2023-02-07 18:17     ` Calvin Wan
  2023-02-07 18:17     ` [PATCH v7 4/7] submodule: refactor is_submodule_modified() Calvin Wan
                       ` (3 subsequent siblings)
  7 siblings, 0 replies; 86+ messages in thread
From: Calvin Wan @ 2023-02-07 18:17 UTC (permalink / raw)
  To: git; +Cc: Calvin Wan, avarab, chooglen, newren, jonathantanmy

A future patch requires the ability to parse the output of git
status --porcelain=2. Move parsing code from is_submodule_modified to
parse_status_porcelain.

Signed-off-by: Calvin Wan <calvinwan@google.com>
---
 submodule.c | 74 ++++++++++++++++++++++++++++++-----------------------
 1 file changed, 42 insertions(+), 32 deletions(-)

diff --git a/submodule.c b/submodule.c
index faf37c1101..768d4b4cd7 100644
--- a/submodule.c
+++ b/submodule.c
@@ -1870,6 +1870,45 @@ int fetch_submodules(struct repository *r,
 	return spf.result;
 }
 
+static int parse_status_porcelain(char *str, size_t len,
+				  unsigned *dirty_submodule,
+				  int ignore_untracked)
+{
+	/* regular untracked files */
+	if (str[0] == '?')
+		*dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
+
+	if (str[0] == 'u' ||
+	    str[0] == '1' ||
+	    str[0] == '2') {
+		/* T = line type, XY = status, SSSS = submodule state */
+		if (len < strlen("T XY SSSS"))
+			BUG("invalid status --porcelain=2 line %s",
+			    str);
+
+		if (str[5] == 'S' && str[8] == 'U')
+			/* nested untracked file */
+			*dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
+
+		if (str[0] == 'u' ||
+		    str[0] == '2' ||
+		    memcmp(str + 5, "S..U", 4))
+			/* other change */
+			*dirty_submodule |= DIRTY_SUBMODULE_MODIFIED;
+	}
+
+	if ((*dirty_submodule & DIRTY_SUBMODULE_MODIFIED) &&
+	    ((*dirty_submodule & DIRTY_SUBMODULE_UNTRACKED) ||
+	     ignore_untracked)) {
+		/*
+		* We're not interested in any further information from
+		* the child any more, neither output nor its exit code.
+		*/
+		return 1;
+	}
+	return 0;
+}
+
 unsigned is_submodule_modified(const char *path, int ignore_untracked)
 {
 	struct child_process cp = CHILD_PROCESS_INIT;
@@ -1909,39 +1948,10 @@ unsigned is_submodule_modified(const char *path, int ignore_untracked)
 		char *str = buf.buf;
 		const size_t len = buf.len;
 
-		/* regular untracked files */
-		if (str[0] == '?')
-			dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
-
-		if (str[0] == 'u' ||
-		    str[0] == '1' ||
-		    str[0] == '2') {
-			/* T = line type, XY = status, SSSS = submodule state */
-			if (len < strlen("T XY SSSS"))
-				BUG("invalid status --porcelain=2 line %s",
-				    str);
-
-			if (str[5] == 'S' && str[8] == 'U')
-				/* nested untracked file */
-				dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
-
-			if (str[0] == 'u' ||
-			    str[0] == '2' ||
-			    memcmp(str + 5, "S..U", 4))
-				/* other change */
-				dirty_submodule |= DIRTY_SUBMODULE_MODIFIED;
-		}
-
-		if ((dirty_submodule & DIRTY_SUBMODULE_MODIFIED) &&
-		    ((dirty_submodule & DIRTY_SUBMODULE_UNTRACKED) ||
-		     ignore_untracked)) {
-			/*
-			 * We're not interested in any further information from
-			 * the child any more, neither output nor its exit code.
-			 */
-			ignore_cp_exit_code = 1;
+		ignore_cp_exit_code = parse_status_porcelain(str, len, &dirty_submodule,
+							     ignore_untracked);
+		if (ignore_cp_exit_code)
 			break;
-		}
 	}
 	fclose(fp);
 
-- 
2.39.1.519.gcb327c4b5f-goog


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v7 4/7] submodule: refactor is_submodule_modified()
  2023-01-17 19:30   ` [PATCH v6 " Calvin Wan
                       ` (3 preceding siblings ...)
  2023-02-07 18:17     ` [PATCH v7 3/7] submodule: move status parsing into function Calvin Wan
@ 2023-02-07 18:17     ` Calvin Wan
  2023-02-07 22:59       ` Ævar Arnfjörð Bjarmason
  2023-02-07 18:17     ` [PATCH v7 5/7] diff-lib: refactor out diff_change logic Calvin Wan
                       ` (2 subsequent siblings)
  7 siblings, 1 reply; 86+ messages in thread
From: Calvin Wan @ 2023-02-07 18:17 UTC (permalink / raw)
  To: git; +Cc: Calvin Wan, avarab, chooglen, newren, jonathantanmy

Refactor out submodule status logic and error messages that will be
used in a future patch.

Signed-off-by: Calvin Wan <calvinwan@google.com>
---
 submodule.c | 65 ++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 42 insertions(+), 23 deletions(-)

diff --git a/submodule.c b/submodule.c
index 768d4b4cd7..d88aa2c573 100644
--- a/submodule.c
+++ b/submodule.c
@@ -28,6 +28,10 @@ static int config_update_recurse_submodules = RECURSE_SUBMODULES_OFF;
 static int initialized_fetch_ref_tips;
 static struct oid_array ref_tips_before_fetch;
 static struct oid_array ref_tips_after_fetch;
+static const char *status_porcelain_start_error =
+	N_("could not run 'git status --porcelain=2' in submodule %s");
+static const char *status_porcelain_fail_error =
+	N_("'git status --porcelain=2' failed in submodule %s");
 
 /*
  * Check if the .gitmodules file is unmerged. Parsing of the .gitmodules file
@@ -1870,6 +1874,40 @@ int fetch_submodules(struct repository *r,
 	return spf.result;
 }
 
+static int verify_submodule_git_directory(const char *path)
+{
+	const char *git_dir;
+	struct strbuf buf = STRBUF_INIT;
+
+	strbuf_addf(&buf, "%s/.git", path);
+	git_dir = read_gitfile(buf.buf);
+	if (!git_dir)
+		git_dir = buf.buf;
+	if (!is_git_directory(git_dir)) {
+		if (is_directory(git_dir))
+			die(_("'%s' not recognized as a git repository"), git_dir);
+		strbuf_release(&buf);
+		/* The submodule is not checked out, so it is not modified */
+		return 0;
+	}
+	strbuf_release(&buf);
+	return 1;
+}
+
+static void prepare_status_porcelain(struct child_process *cp,
+			     const char *path, int ignore_untracked)
+{
+	strvec_pushl(&cp->args, "status", "--porcelain=2", NULL);
+	if (ignore_untracked)
+		strvec_push(&cp->args, "-uno");
+
+	prepare_submodule_repo_env(&cp->env);
+	cp->git_cmd = 1;
+	cp->no_stdin = 1;
+	cp->out = -1;
+	cp->dir = path;
+}
+
 static int parse_status_porcelain(char *str, size_t len,
 				  unsigned *dirty_submodule,
 				  int ignore_untracked)
@@ -1915,33 +1953,14 @@ unsigned is_submodule_modified(const char *path, int ignore_untracked)
 	struct strbuf buf = STRBUF_INIT;
 	FILE *fp;
 	unsigned dirty_submodule = 0;
-	const char *git_dir;
 	int ignore_cp_exit_code = 0;
 
-	strbuf_addf(&buf, "%s/.git", path);
-	git_dir = read_gitfile(buf.buf);
-	if (!git_dir)
-		git_dir = buf.buf;
-	if (!is_git_directory(git_dir)) {
-		if (is_directory(git_dir))
-			die(_("'%s' not recognized as a git repository"), git_dir);
-		strbuf_release(&buf);
-		/* The submodule is not checked out, so it is not modified */
+	if (!verify_submodule_git_directory(path))
 		return 0;
-	}
-	strbuf_reset(&buf);
-
-	strvec_pushl(&cp.args, "status", "--porcelain=2", NULL);
-	if (ignore_untracked)
-		strvec_push(&cp.args, "-uno");
 
-	prepare_submodule_repo_env(&cp.env);
-	cp.git_cmd = 1;
-	cp.no_stdin = 1;
-	cp.out = -1;
-	cp.dir = path;
+	prepare_status_porcelain(&cp, path, ignore_untracked);
 	if (start_command(&cp))
-		die(_("Could not run 'git status --porcelain=2' in submodule %s"), path);
+		die(_(status_porcelain_start_error), path);
 
 	fp = xfdopen(cp.out, "r");
 	while (strbuf_getwholeline(&buf, fp, '\n') != EOF) {
@@ -1956,7 +1975,7 @@ unsigned is_submodule_modified(const char *path, int ignore_untracked)
 	fclose(fp);
 
 	if (finish_command(&cp) && !ignore_cp_exit_code)
-		die(_("'git status --porcelain=2' failed in submodule %s"), path);
+		die(_(status_porcelain_fail_error), path);
 
 	strbuf_release(&buf);
 	return dirty_submodule;
-- 
2.39.1.519.gcb327c4b5f-goog


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v7 5/7] diff-lib: refactor out diff_change logic
  2023-01-17 19:30   ` [PATCH v6 " Calvin Wan
                       ` (4 preceding siblings ...)
  2023-02-07 18:17     ` [PATCH v7 4/7] submodule: refactor is_submodule_modified() Calvin Wan
@ 2023-02-07 18:17     ` Calvin Wan
  2023-02-08 14:28       ` Phillip Wood
  2023-02-07 18:17     ` [PATCH v7 6/7] diff-lib: refactor match_stat_with_submodule Calvin Wan
  2023-02-07 18:17     ` [PATCH v7 7/7] diff-lib: parallelize run_diff_files for submodules Calvin Wan
  7 siblings, 1 reply; 86+ messages in thread
From: Calvin Wan @ 2023-02-07 18:17 UTC (permalink / raw)
  To: git; +Cc: Calvin Wan, avarab, chooglen, newren, jonathantanmy

Refactor out logic that sets up the diff_change call into a helper
function for a future patch.

Signed-off-by: Calvin Wan <calvinwan@google.com>
---
 diff-lib.c | 46 +++++++++++++++++++++++++++++-----------------
 1 file changed, 29 insertions(+), 17 deletions(-)

diff --git a/diff-lib.c b/diff-lib.c
index dec040c366..7101cfda3f 100644
--- a/diff-lib.c
+++ b/diff-lib.c
@@ -88,6 +88,31 @@ static int match_stat_with_submodule(struct diff_options *diffopt,
 	return changed;
 }
 
+static int diff_change_helper(struct diff_options *options,
+	      unsigned newmode, unsigned dirty_submodule,
+	      int changed, struct index_state *istate,
+	      struct cache_entry *ce)
+{
+	unsigned int oldmode;
+	const struct object_id *old_oid, *new_oid;
+
+	if (!changed && !dirty_submodule) {
+		ce_mark_uptodate(ce);
+		mark_fsmonitor_valid(istate, ce);
+		if (!options->flags.find_copies_harder)
+			return 1;
+	}
+	oldmode = ce->ce_mode;
+	old_oid = &ce->oid;
+	new_oid = changed ? null_oid() : &ce->oid;
+	diff_change(options, oldmode, newmode,
+			old_oid, new_oid,
+			!is_null_oid(old_oid),
+			!is_null_oid(new_oid),
+			ce->name, 0, dirty_submodule);
+	return 0;
+}
+
 int run_diff_files(struct rev_info *revs, unsigned int option)
 {
 	int entries, i;
@@ -105,11 +130,10 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
 		diff_unmerged_stage = 2;
 	entries = istate->cache_nr;
 	for (i = 0; i < entries; i++) {
-		unsigned int oldmode, newmode;
+		unsigned int newmode;
 		struct cache_entry *ce = istate->cache[i];
 		int changed;
 		unsigned dirty_submodule = 0;
-		const struct object_id *old_oid, *new_oid;
 
 		if (diff_can_quit_early(&revs->diffopt))
 			break;
@@ -245,21 +269,9 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
 			newmode = ce_mode_from_stat(ce, st.st_mode);
 		}
 
-		if (!changed && !dirty_submodule) {
-			ce_mark_uptodate(ce);
-			mark_fsmonitor_valid(istate, ce);
-			if (!revs->diffopt.flags.find_copies_harder)
-				continue;
-		}
-		oldmode = ce->ce_mode;
-		old_oid = &ce->oid;
-		new_oid = changed ? null_oid() : &ce->oid;
-		diff_change(&revs->diffopt, oldmode, newmode,
-			    old_oid, new_oid,
-			    !is_null_oid(old_oid),
-			    !is_null_oid(new_oid),
-			    ce->name, 0, dirty_submodule);
-
+		if (diff_change_helper(&revs->diffopt, newmode, dirty_submodule,
+				       changed, istate, ce))
+			continue;
 	}
 	diffcore_std(&revs->diffopt);
 	diff_flush(&revs->diffopt);
-- 
2.39.1.519.gcb327c4b5f-goog


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v7 6/7] diff-lib: refactor match_stat_with_submodule
  2023-01-17 19:30   ` [PATCH v6 " Calvin Wan
                       ` (5 preceding siblings ...)
  2023-02-07 18:17     ` [PATCH v7 5/7] diff-lib: refactor out diff_change logic Calvin Wan
@ 2023-02-07 18:17     ` Calvin Wan
  2023-02-08  8:18       ` Ævar Arnfjörð Bjarmason
  2023-02-08 14:22       ` Phillip Wood
  2023-02-07 18:17     ` [PATCH v7 7/7] diff-lib: parallelize run_diff_files for submodules Calvin Wan
  7 siblings, 2 replies; 86+ messages in thread
From: Calvin Wan @ 2023-02-07 18:17 UTC (permalink / raw)
  To: git; +Cc: Calvin Wan, avarab, chooglen, newren, jonathantanmy

Flatten out the if statements in match_stat_with_submodule so the
logic is more readable and easier for future patches to add to.
orig_flags didn't need to be set if the cache entry wasn't a
GITLINK so defer setting it.

Signed-off-by: Calvin Wan <calvinwan@google.com>
---
 diff-lib.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/diff-lib.c b/diff-lib.c
index 7101cfda3f..e18c886a80 100644
--- a/diff-lib.c
+++ b/diff-lib.c
@@ -73,18 +73,24 @@ static int match_stat_with_submodule(struct diff_options *diffopt,
 				     unsigned *dirty_submodule)
 {
 	int changed = ie_match_stat(diffopt->repo->index, ce, st, ce_option);
-	if (S_ISGITLINK(ce->ce_mode)) {
-		struct diff_flags orig_flags = diffopt->flags;
-		if (!diffopt->flags.override_submodule_config)
-			set_diffopt_flags_from_submodule_config(diffopt, ce->name);
-		if (diffopt->flags.ignore_submodules)
-			changed = 0;
-		else if (!diffopt->flags.ignore_dirty_submodules &&
-			 (!changed || diffopt->flags.dirty_submodules))
-			*dirty_submodule = is_submodule_modified(ce->name,
-								 diffopt->flags.ignore_untracked_in_submodules);
-		diffopt->flags = orig_flags;
+	struct diff_flags orig_flags;
+
+	if (!S_ISGITLINK(ce->ce_mode))
+		return changed;
+
+	orig_flags = diffopt->flags;
+	if (!diffopt->flags.override_submodule_config)
+		set_diffopt_flags_from_submodule_config(diffopt, ce->name);
+	if (diffopt->flags.ignore_submodules) {
+		changed = 0;
+		goto cleanup;
 	}
+	if (!diffopt->flags.ignore_dirty_submodules &&
+	    (!changed || diffopt->flags.dirty_submodules))
+		*dirty_submodule = is_submodule_modified(ce->name,
+					 diffopt->flags.ignore_untracked_in_submodules);
+cleanup:
+	diffopt->flags = orig_flags;
 	return changed;
 }
 
-- 
2.39.1.519.gcb327c4b5f-goog


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v7 7/7] diff-lib: parallelize run_diff_files for submodules
  2023-01-17 19:30   ` [PATCH v6 " Calvin Wan
                       ` (6 preceding siblings ...)
  2023-02-07 18:17     ` [PATCH v7 6/7] diff-lib: refactor match_stat_with_submodule Calvin Wan
@ 2023-02-07 18:17     ` Calvin Wan
  2023-02-07 23:06       ` Ævar Arnfjörð Bjarmason
  7 siblings, 1 reply; 86+ messages in thread
From: Calvin Wan @ 2023-02-07 18:17 UTC (permalink / raw)
  To: git; +Cc: Calvin Wan, avarab, chooglen, newren, jonathantanmy

During the iteration of the index entries in run_diff_files, whenever
a submodule is found and needs its status checked, a subprocess is
spawned for it. Instead of spawning the subprocess immediately and
waiting for its completion to continue, hold onto all submodules and
relevant information in a list. Then use that list to create tasks for
run_processes_parallel. Subprocess output is duplicated and passed to
status_pipe_output which stores it to be parsed on completion of the
subprocess.

Add config option submodule.diffJobs to set the maximum number
of parallel jobs. The option defaults to 1 if unset. If set to 0, the
number of jobs is set to online_cpus().

Since run_diff_files is called from many different commands, I chose
to grab the config option in the function rather than adding variables
to every git command and then figuring out how to pass them all in.

Signed-off-by: Calvin Wan <calvinwan@google.com>
---
 Documentation/config/submodule.txt |  12 +++
 diff-lib.c                         |  71 ++++++++++++--
 submodule.c                        | 148 +++++++++++++++++++++++++++++
 submodule.h                        |   9 ++
 t/t4027-diff-submodule.sh          |  31 ++++++
 t/t7506-status-submodule.sh        |  25 +++++
 6 files changed, 289 insertions(+), 7 deletions(-)

diff --git a/Documentation/config/submodule.txt b/Documentation/config/submodule.txt
index 6490527b45..3209eb8117 100644
--- a/Documentation/config/submodule.txt
+++ b/Documentation/config/submodule.txt
@@ -93,6 +93,18 @@ submodule.fetchJobs::
 	in parallel. A value of 0 will give some reasonable default.
 	If unset, it defaults to 1.
 
+submodule.diffJobs::
+	Specifies how many submodules are diffed at the same time. A
+	positive integer allows up to that number of submodules diffed
+	in parallel. A value of 0 will give some reasonable default.
+	If unset, it defaults to 1. The diff operation is used by many
+	other git commands such as add, merge, diff, status, stash and
+	more. Note that the expensive part of the diff operation is
+	reading the index from cache or memory. Therefore multiple jobs
+	may be detrimental to performance if your hardware does not
+	support parallel reads or if the number of jobs greatly exceeds
+	the amount of supported reads.
+
 submodule.alternateLocation::
 	Specifies how the submodules obtain alternates when submodules are
 	cloned. Possible values are `no`, `superproject`.
diff --git a/diff-lib.c b/diff-lib.c
index e18c886a80..f91cd73ae7 100644
--- a/diff-lib.c
+++ b/diff-lib.c
@@ -14,6 +14,7 @@
 #include "dir.h"
 #include "fsmonitor.h"
 #include "commit-reach.h"
+#include "config.h"
 
 /*
  * diff-files
@@ -65,18 +66,23 @@ static int check_removed(const struct index_state *istate, const struct cache_en
  * Return 1 when changes are detected, 0 otherwise. If the DIRTY_SUBMODULES
  * option is set, the caller does not only want to know if a submodule is
  * modified at all but wants to know all the conditions that are met (new
- * commits, untracked content and/or modified content).
+ * commits, untracked content and/or modified content). If
+ * defer_submodule_status bit is set, dirty_submodule will be left to the
+ * caller to set. defer_submodule_status can also be set to 0 in this
+ * function if there is no need to check if the submodule is modified.
  */
 static int match_stat_with_submodule(struct diff_options *diffopt,
 				     const struct cache_entry *ce,
 				     struct stat *st, unsigned ce_option,
-				     unsigned *dirty_submodule)
+				     unsigned *dirty_submodule, int *defer_submodule_status,
+				     unsigned *ignore_untracked)
 {
 	int changed = ie_match_stat(diffopt->repo->index, ce, st, ce_option);
 	struct diff_flags orig_flags;
+	int defer = 0;
 
 	if (!S_ISGITLINK(ce->ce_mode))
-		return changed;
+		goto ret;
 
 	orig_flags = diffopt->flags;
 	if (!diffopt->flags.override_submodule_config)
@@ -86,11 +92,20 @@ static int match_stat_with_submodule(struct diff_options *diffopt,
 		goto cleanup;
 	}
 	if (!diffopt->flags.ignore_dirty_submodules &&
-	    (!changed || diffopt->flags.dirty_submodules))
-		*dirty_submodule = is_submodule_modified(ce->name,
+	    (!changed || diffopt->flags.dirty_submodules)) {
+		if (defer_submodule_status && *defer_submodule_status) {
+			defer = 1;
+			*ignore_untracked = diffopt->flags.ignore_untracked_in_submodules;
+		} else {
+			*dirty_submodule = is_submodule_modified(ce->name,
 					 diffopt->flags.ignore_untracked_in_submodules);
+		}
+	}
 cleanup:
 	diffopt->flags = orig_flags;
+ret:
+	if (defer_submodule_status)
+		*defer_submodule_status = defer;
 	return changed;
 }
 
@@ -127,6 +142,7 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
 			      ? CE_MATCH_RACY_IS_DIRTY : 0);
 	uint64_t start = getnanotime();
 	struct index_state *istate = revs->diffopt.repo->index;
+	struct string_list submodules = STRING_LIST_INIT_NODUP;
 
 	diff_set_mnemonic_prefix(&revs->diffopt, "i/", "w/");
 
@@ -250,6 +266,8 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
 			newmode = ce->ce_mode;
 		} else {
 			struct stat st;
+			unsigned ignore_untracked = 0;
+			int defer_submodule_status = 1;
 
 			changed = check_removed(istate, ce, &st);
 			if (changed) {
@@ -271,14 +289,53 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
 			}
 
 			changed = match_stat_with_submodule(&revs->diffopt, ce, &st,
-							    ce_option, &dirty_submodule);
+							    ce_option, &dirty_submodule,
+							    &defer_submodule_status,
+							    &ignore_untracked);
 			newmode = ce_mode_from_stat(ce, st.st_mode);
+			if (defer_submodule_status) {
+				struct submodule_status_util tmp = {
+					.changed = changed,
+					.dirty_submodule = 0,
+					.ignore_untracked = ignore_untracked,
+					.newmode = newmode,
+					.ce = ce,
+					.path = ce->name,
+				};
+				struct string_list_item *item;
+
+				item = string_list_append(&submodules, ce->name);
+				item->util = xmalloc(sizeof(tmp));
+				memcpy(item->util, &tmp, sizeof(tmp));
+				continue;
+			}
 		}
 
 		if (diff_change_helper(&revs->diffopt, newmode, dirty_submodule,
 				       changed, istate, ce))
 			continue;
 	}
+	if (submodules.nr > 0) {
+		int parallel_jobs;
+		if (git_config_get_int("submodule.diffjobs", &parallel_jobs))
+			parallel_jobs = 1;
+		else if (!parallel_jobs)
+			parallel_jobs = online_cpus();
+		else if (parallel_jobs < 0)
+			die(_("submodule.diffjobs cannot be negative"));
+
+		if (get_submodules_status(&submodules, parallel_jobs))
+			die(_("submodule status failed"));
+		for (size_t i = 0; i < submodules.nr; i++) {
+			struct submodule_status_util *util = submodules.items[i].util;
+
+			if (diff_change_helper(&revs->diffopt, util->newmode,
+				       util->dirty_submodule, util->changed,
+				       istate, util->ce))
+				continue;
+		}
+	}
+	string_list_clear(&submodules, 1);
 	diffcore_std(&revs->diffopt);
 	diff_flush(&revs->diffopt);
 	trace_performance_since(start, "diff-files");
@@ -326,7 +383,7 @@ static int get_stat_data(const struct index_state *istate,
 			return -1;
 		}
 		changed = match_stat_with_submodule(diffopt, ce, &st,
-						    0, dirty_submodule);
+						    0, dirty_submodule, NULL, NULL);
 		if (changed) {
 			mode = ce_mode_from_stat(ce, st.st_mode);
 			oid = null_oid();
diff --git a/submodule.c b/submodule.c
index d88aa2c573..3e1811691a 100644
--- a/submodule.c
+++ b/submodule.c
@@ -1373,6 +1373,17 @@ int submodule_touches_in_range(struct repository *r,
 	return ret;
 }
 
+struct submodule_parallel_status {
+	size_t index_count;
+	int result;
+
+	struct string_list *submodule_names;
+
+	/* Pending statuses by OIDs */
+	struct status_task **oid_status_tasks;
+	int oid_status_tasks_nr, oid_status_tasks_alloc;
+};
+
 struct submodule_parallel_fetch {
 	/*
 	 * The index of the last index entry processed by
@@ -1455,6 +1466,12 @@ struct fetch_task {
 	struct oid_array *commits; /* Ensure these commits are fetched */
 };
 
+struct status_task {
+	const char *path;
+	struct strbuf out;
+	int ignore_untracked;
+};
+
 /**
  * When a submodule is not defined in .gitmodules, we cannot access it
  * via the regular submodule-config. Create a fake submodule, which we can
@@ -1947,6 +1964,25 @@ static int parse_status_porcelain(char *str, size_t len,
 	return 0;
 }
 
+static void parse_status_porcelain_strbuf(struct strbuf *buf,
+				   unsigned *dirty_submodule,
+				   int ignore_untracked)
+{
+	struct string_list list = STRING_LIST_INIT_DUP;
+	struct string_list_item *item;
+
+	string_list_split(&list, buf->buf, '\n', -1);
+
+	for_each_string_list_item(item, &list) {
+		if (parse_status_porcelain(item->string,
+					   strlen(item->string),
+					   dirty_submodule,
+					   ignore_untracked))
+			break;
+	}
+	string_list_clear(&list, 0);
+}
+
 unsigned is_submodule_modified(const char *path, int ignore_untracked)
 {
 	struct child_process cp = CHILD_PROCESS_INIT;
@@ -1981,6 +2017,118 @@ unsigned is_submodule_modified(const char *path, int ignore_untracked)
 	return dirty_submodule;
 }
 
+static struct status_task *
+get_status_task_from_index(struct submodule_parallel_status *sps,
+			   struct strbuf *err)
+{
+	for (; sps->index_count < sps->submodule_names->nr; sps->index_count++) {
+		struct submodule_status_util *util = sps->submodule_names->items[sps->index_count].util;
+		struct status_task *task;
+
+		if (!verify_submodule_git_directory(util->path))
+			continue;
+
+		task = xmalloc(sizeof(*task));
+		task->path = util->path;
+		task->ignore_untracked = util->ignore_untracked;
+		strbuf_init(&task->out, 0);
+		sps->index_count++;
+		return task;
+	}
+	return NULL;
+}
+
+static int get_next_submodule_status(struct child_process *cp,
+				     struct strbuf *err, void *data,
+				     void **task_cb)
+{
+	struct submodule_parallel_status *sps = data;
+	struct status_task *task = get_status_task_from_index(sps, err);
+
+	if (!task)
+		return 0;
+
+	child_process_init(cp);
+	prepare_submodule_repo_env_in_gitdir(&cp->env);
+	prepare_status_porcelain(cp, task->path, task->ignore_untracked);
+	*task_cb = task;
+	return 1;
+}
+
+static int status_start_failure(struct strbuf *err,
+				void *cb, void *task_cb)
+{
+	struct submodule_parallel_status *sps = cb;
+	struct status_task *task = task_cb;
+
+	sps->result = 1;
+	strbuf_addf(err,
+	    _(status_porcelain_start_error),
+	    task->path);
+	return 0;
+}
+
+static void status_duplicate_output(struct strbuf *out,
+				    size_t offset,
+				    void *cb, void *task_cb)
+{
+	struct status_task *task = task_cb;
+
+	strbuf_add(&task->out, out->buf + offset, out->len - offset);
+	strbuf_setlen(out, offset);
+}
+
+static int status_finish(int retvalue, struct strbuf *err,
+			 void *cb, void *task_cb)
+{
+	struct submodule_parallel_status *sps = cb;
+	struct status_task *task = task_cb;
+	struct string_list_item *it =
+		string_list_lookup(sps->submodule_names, task->path);
+	struct submodule_status_util *util = it->util;
+
+	if (retvalue) {
+		sps->result = 1;
+		strbuf_addf(err,
+		    _(status_porcelain_fail_error),
+		    task->path);
+	}
+
+	parse_status_porcelain_strbuf(&task->out,
+			      &util->dirty_submodule,
+			      util->ignore_untracked);
+
+	strbuf_release(&task->out);
+	free(task);
+
+	return 0;
+}
+
+int get_submodules_status(struct string_list *submodules,
+			  int max_parallel_jobs)
+{
+	struct submodule_parallel_status sps = {
+		.submodule_names = submodules,
+	};
+	const struct run_process_parallel_opts opts = {
+		.tr2_category = "submodule",
+		.tr2_label = "parallel/status",
+
+		.processes = max_parallel_jobs,
+
+		.get_next_task = get_next_submodule_status,
+		.start_failure = status_start_failure,
+		.duplicate_output = status_duplicate_output,
+		.task_finished = status_finish,
+		.data = &sps,
+	};
+
+	string_list_sort(sps.submodule_names);
+	run_processes_parallel(&opts);
+
+	return sps.result;
+}
+
 int submodule_uses_gitfile(const char *path)
 {
 	struct child_process cp = CHILD_PROCESS_INIT;
diff --git a/submodule.h b/submodule.h
index b52a4ff1e7..08d278a414 100644
--- a/submodule.h
+++ b/submodule.h
@@ -41,6 +41,13 @@ struct submodule_update_strategy {
 	.type = SM_UPDATE_UNSPECIFIED, \
 }
 
+struct submodule_status_util {
+	int changed, ignore_untracked;
+	unsigned dirty_submodule, newmode;
+	struct cache_entry *ce;
+	const char *path;
+};
+
 int is_gitmodules_unmerged(struct index_state *istate);
 int is_writing_gitmodules_ok(void);
 int is_staging_gitmodules_ok(struct index_state *istate);
@@ -94,6 +101,8 @@ int fetch_submodules(struct repository *r,
 		     int command_line_option,
 		     int default_option,
 		     int quiet, int max_parallel_jobs);
+int get_submodules_status(struct string_list *submodules,
+			  int max_parallel_jobs);
 unsigned is_submodule_modified(const char *path, int ignore_untracked);
 int submodule_uses_gitfile(const char *path);
 
diff --git a/t/t4027-diff-submodule.sh b/t/t4027-diff-submodule.sh
index 40164ae07d..1c747cc325 100755
--- a/t/t4027-diff-submodule.sh
+++ b/t/t4027-diff-submodule.sh
@@ -34,6 +34,25 @@ test_expect_success setup '
 	subtip=$3 subprev=$2
 '
 
+test_expect_success 'diff in superproject with submodules respects parallel settings' '
+	test_when_finished "rm -f trace.out" &&
+	(
+		GIT_TRACE=$(pwd)/trace.out git diff &&
+		grep "1 tasks" trace.out &&
+		>trace.out &&
+
+		git config submodule.diffJobs 8 &&
+		GIT_TRACE=$(pwd)/trace.out git diff &&
+		grep "8 tasks" trace.out &&
+		>trace.out &&
+
+		GIT_TRACE=$(pwd)/trace.out git -c submodule.diffJobs=0 diff &&
+		grep "preparing to run up to [0-9]* tasks" trace.out &&
+		! grep "up to 0 tasks" trace.out &&
+		>trace.out
+	)
+'
+
 test_expect_success 'git diff --raw HEAD' '
 	hexsz=$(test_oid hexsz) &&
 	git diff --raw --abbrev=$hexsz HEAD >actual &&
@@ -70,6 +89,18 @@ test_expect_success 'git diff HEAD with dirty submodule (work tree)' '
 	test_cmp expect.body actual.body
 '
 
+test_expect_success 'git diff HEAD with dirty submodule (work tree, parallel)' '
+	(
+		cd sub &&
+		git reset --hard &&
+		echo >>world
+	) &&
+	git -c submodule.diffJobs=8 diff HEAD >actual &&
+	sed -e "1,/^@@/d" actual >actual.body &&
+	expect_from_to >expect.body $subtip $subprev-dirty &&
+	test_cmp expect.body actual.body
+'
+
 test_expect_success 'git diff HEAD with dirty submodule (index)' '
 	(
 		cd sub &&
diff --git a/t/t7506-status-submodule.sh b/t/t7506-status-submodule.sh
index d050091345..7da64e4c4c 100755
--- a/t/t7506-status-submodule.sh
+++ b/t/t7506-status-submodule.sh
@@ -412,4 +412,29 @@ test_expect_success 'status with added file in nested submodule (short)' '
 	EOF
 '
 
+test_expect_success 'status in superproject with submodules respects parallel settings' '
+	test_when_finished "rm -f trace.out" &&
+	(
+		GIT_TRACE=$(pwd)/trace.out git status &&
+		grep "1 tasks" trace.out &&
+		>trace.out &&
+
+		git config submodule.diffJobs 8 &&
+		GIT_TRACE=$(pwd)/trace.out git status &&
+		grep "8 tasks" trace.out &&
+		>trace.out &&
+
+		GIT_TRACE=$(pwd)/trace.out git -c submodule.diffJobs=0 status &&
+		grep "preparing to run up to [0-9]* tasks" trace.out &&
+		! grep "up to 0 tasks" trace.out &&
+		>trace.out
+	)
+'
+
+test_expect_success 'status in superproject with submodules (parallel)' '
+	git -C super status --porcelain >output &&
+	git -C super -c submodule.diffJobs=8 status --porcelain >output_parallel &&
+	diff output output_parallel
+'
+
 test_done
-- 
2.39.1.519.gcb327c4b5f-goog


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 1/7] run-command: add duplicate_output_fn to run_processes_parallel_opts
  2023-02-07 18:17     ` [PATCH v7 1/7] run-command: add duplicate_output_fn to run_processes_parallel_opts Calvin Wan
@ 2023-02-07 22:16       ` Ævar Arnfjörð Bjarmason
  2023-02-08 22:50         ` Calvin Wan
  2023-02-08 14:19       ` Phillip Wood
  1 sibling, 1 reply; 86+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2023-02-07 22:16 UTC (permalink / raw)
  To: Calvin Wan; +Cc: git, chooglen, newren, jonathantanmy

On Tue, Feb 07 2023, Calvin Wan wrote:

> diff --git a/run-command.c b/run-command.c
> index 756f1839aa..cad88befe0 100644
> --- a/run-command.c
> +++ b/run-command.c
> @@ -1526,6 +1526,9 @@ static void pp_init(struct parallel_processes *pp,
>  	if (!opts->get_next_task)
>  		BUG("you need to specify a get_next_task function");
>  
> +	if (opts->duplicate_output && opts->ungroup)
> +		BUG("duplicate_output and ungroup are incompatible with each other");
> +
>  	CALLOC_ARRAY(pp->children, n);
>  	if (!opts->ungroup)
>  		CALLOC_ARRAY(pp->pfd, n);

A trivial request, not worth a re-roll in itself: The "prep" topic[1] I
have for Emily's eventual config-based hooks doesn't need to add new
run-command.c modes that are incompatible with ungroup, but that happens
in the next stage of that saga.

When I merge your topic here with that, the end result here is:

	if (opts->ungroup) {
		if (opts->feed_pipe)
			BUG(".ungroup=1 is incompatible with .feed_pipe != NULL");
		if (opts->consume_sideband)
			BUG(".ungroup=1 is incompatible with .consume_sideband != NULL");
	}

	if (!opts->get_next_task)
		BUG("you need to specify a get_next_task function");

	if (opts->duplicate_output && opts->ungroup)
		BUG("duplicate_output and ungroup are incompatible with each other");

So, whether do the incompatibility check before or after
"get_next_task" is arbitrary. If I had to pick, I think doing it after as you're
doing here probably makes more sense.

But would ou mind if this addition of yours were instead:

	if (opts->ungroup) {
		if (opts->duplicate_output)
			BUG("duplicate_output and ungroup are incompatible with each other")
	}

Like I said, a trivial request.

But it will save us the eventual refactoring of that into nested checks
as we add more of these options.

To the extent that we need to mention the seemingly odd looking pattern
we could just say that we're future-proofing this for future
incompatible modes.

1. https://lore.kernel.org/git/cover-0.5-00000000000-20230123T170550Z-avarab@gmail.com/#t

> @@ -1645,14 +1648,21 @@ static void pp_buffer_stderr(struct parallel_processes *pp,
>  	for (size_t i = 0; i < opts->processes; i++) {
>  		if (pp->children[i].state == GIT_CP_WORKING &&
>  		    pp->pfd[i].revents & (POLLIN | POLLHUP)) {
> -			int n = strbuf_read_once(&pp->children[i].err,
> -						 pp->children[i].process.err, 0);
> +			ssize_t n = strbuf_read_once(&pp->children[i].err,
> +						     pp->children[i].process.err, 0);

This s/int/ssize_t/ change is a good on, but not mentioned in the commit
message. Maybe worth splitting out?

If I revert that back to "int" on top of this entire topic our tests
still pass, so while it's a good change it seems entirely unrelated to
the "duplicate_output" subject of this patch.

>  			if (n == 0) {
>  				close(pp->children[i].process.err);
>  				pp->children[i].state = GIT_CP_WAIT_CLEANUP;
> -			} else if (n < 0)
> +			} else if (n < 0) {

Here you're adding braces, which is an otherwise good change (but maybe
worth splitting up, I haven't read the rest of this topic to see if
there's even more style changes).

In this case we should/could have done this change with the pre-image,
before "duplicate_output".

>  				if (errno != EAGAIN)
>  					die_errno("read");
> +			} else {
> +				if (opts->duplicate_output)

I've read ahead and this topic adds nothing new to this "else" block, so
why the extra indentation instead of:

	} else if (opts->duplicate_output) {
		[...];

> +					opts->duplicate_output(&pp->children[i].err,
> +					       strlen(pp->children[i].err.buf) - n,

Uh, why are we getting the length of strbuf with strlen()? Am I missing
something obvious here, or should this be:

	pp->children[i].err.len - n

?

> +					       opts->data,
> +					       pp->children[i].data);

Especially with how otherwise painful the wrapping is here (well, not
very, but we can easily save a \t-indent here).

> +			}
>  		}
>  	}
>  }
> diff --git a/run-command.h b/run-command.h
> index 072db56a4d..6dcf999f6c 100644
> --- a/run-command.h
> +++ b/run-command.h
> @@ -408,6 +408,27 @@ typedef int (*start_failure_fn)(struct strbuf *out,
>  				void *pp_cb,
>  				void *pp_task_cb);
>  
> +/**
> + * This callback is called whenever output from a child process is buffered
> + * 
> + * See run_processes_parallel() below for a discussion of the "struct
> + * strbuf *out" parameter.
> + * 
> + * The offset refers to the number of bytes originally in "out" before
> + * the output from the child process was buffered. Therefore, the buffer
> + * range, "out + buf" to the end of "out", would contain the buffer of
> + * the child process output.
> + *
> + * pp_cb is the callback cookie as passed into run_processes_parallel,
> + * pp_task_cb is the callback cookie as passed into get_next_task_fn.
> + *
> + * This function is incompatible with "ungroup"
> + */
> +typedef void (*duplicate_output_fn)(struct strbuf *out,
> +				    size_t offset,
> +				    void *pp_cb,
> +				    void *pp_task_cb);

There's some over-wrapping here, I see some existing code does it, but
for new code we could follow our usual style, which would put this on
two lines.

> +
>  /**
>   * This callback is called on every child process that finished processing.
>   *
> @@ -461,6 +482,12 @@ struct run_process_parallel_opts
>  	 */
>  	start_failure_fn start_failure;
>  
> +	/**
> +	 * duplicate_output: See duplicate_output_fn() above. This should be
> +	 * NULL unless process specific output is needed
> +	 */

Here we mostly refer to the previous docs, but the "unless process
specific output is neeed" is very confusing. Without seeing the name or
having read the above I'd think this were some "do_not_pipe_to_dev_null"
feature.

Shouldn't we say "Unless you need to capture the output... leave this at
NULL" or something?

> +static void duplicate_output(struct strbuf *out,
> +			size_t offset,
> +			void *pp_cb UNUSED,
> +			void *pp_task_cb UNUSED)
> +{
> +	struct string_list list = STRING_LIST_INIT_DUP;
> +
> +	string_list_split(&list, out->buf + offset, '\n', -1);
> +	for (size_t i = 0; i < list.nr; i++) {
> +		if (strlen(list.items[i].string) > 0)

First, you can use for_each_string_list_item() here to make this look
much nicer/simpler.

Second, don't use strlen(s) > 0, just use strlen(s).

Third, you can git rid of the {} braces for the "for" here.

But just getting rid of that strlen() check and printing makes all your
tests pass.

And why is this thing that wants to prove to us that we're capturing the
output wanting to strip successive newlines?

Using a struct string_list for this is also pretty wasteful, we could
just make this a while-loop that printed this string when it sees "\n".

But it's just test code, so we don't care, I think it's fine for it to
be wastful, I just don't see why it's doing what it's doing, and what
it's going out of its way to do isn't tested for here.

> +test_expect_success 'run_command runs in parallel with more jobs available than tasks --duplicate-output' '
> +	test-tool run-command --duplicate-output run-command-parallel 5 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
> +	test_must_be_empty out &&
> +	test 4 = $(grep -c "duplicate_output: Hello" err) &&
> +	test 4 = $(grep -c "duplicate_output: World" err) &&
> +	sed "/duplicate_output/d" err > err1 &&

Style: ">f" not "> f".

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 2/7] submodule: strbuf variable rename
  2023-02-07 18:17     ` [PATCH v7 2/7] submodule: strbuf variable rename Calvin Wan
@ 2023-02-07 22:47       ` Ævar Arnfjörð Bjarmason
  2023-02-08 22:59         ` Calvin Wan
  0 siblings, 1 reply; 86+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2023-02-07 22:47 UTC (permalink / raw)
  To: Calvin Wan; +Cc: git, chooglen, newren, jonathantanmy


On Tue, Feb 07 2023, Calvin Wan wrote:

> A prepatory change for a future patch that moves the status parsing
> logic to a separate function.

Ah, I think I suggested splitting this up in some previous round, and
coming back to this this + the next patch look very nice with the move
detection, thanks!
>  	fp = xfdopen(cp.out, "r");
>  	while (strbuf_getwholeline(&buf, fp, '\n') != EOF) {
> +		char *str = buf.buf;
> +		const size_t len = buf.len;
> +
>  		/* regular untracked files */
> -		if (buf.buf[0] == '?')
> +		if (str[0] == '?')
>  			dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;

I'll only add that we could also do this on top:
	
	diff --git a/submodule.c b/submodule.c
	index c7c6bfb2e26..eeb940d96a0 100644
	--- a/submodule.c
	+++ b/submodule.c
	@@ -1875,7 +1875,7 @@ unsigned is_submodule_modified(const char *path, int ignore_untracked)
	 	struct child_process cp = CHILD_PROCESS_INIT;
	 	struct strbuf buf = STRBUF_INIT;
	 	FILE *fp;
	-	unsigned dirty_submodule = 0;
	+	unsigned dirty_submodule0 = 0;
	 	const char *git_dir;
	 	int ignore_cp_exit_code = 0;
	 
	@@ -1908,10 +1908,11 @@ unsigned is_submodule_modified(const char *path, int ignore_untracked)
	 	while (strbuf_getwholeline(&buf, fp, '\n') != EOF) {
	 		char *str = buf.buf;
	 		const size_t len = buf.len;
	+		unsigned *dirty_submodule = &dirty_submodule0;
	 
	 		/* regular untracked files */
	 		if (str[0] == '?')
	-			dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
	+			*dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
	 
	 		if (str[0] == 'u' ||
	 		    str[0] == '1' ||
	@@ -1923,17 +1924,17 @@ unsigned is_submodule_modified(const char *path, int ignore_untracked)
	 
	 			if (str[5] == 'S' && str[8] == 'U')
	 				/* nested untracked file */
	-				dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
	+				*dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
	 
	 			if (str[0] == 'u' ||
	 			    str[0] == '2' ||
	 			    memcmp(str + 5, "S..U", 4))
	 				/* other change */
	-				dirty_submodule |= DIRTY_SUBMODULE_MODIFIED;
	+				*dirty_submodule |= DIRTY_SUBMODULE_MODIFIED;
	 		}
	 
	-		if ((dirty_submodule & DIRTY_SUBMODULE_MODIFIED) &&
	-		    ((dirty_submodule & DIRTY_SUBMODULE_UNTRACKED) ||
	+		if ((*dirty_submodule & DIRTY_SUBMODULE_MODIFIED) &&
	+		    ((*dirty_submodule & DIRTY_SUBMODULE_UNTRACKED) ||
	 		     ignore_untracked)) {
	 			/*
	 			 * We're not interested in any further information from
	@@ -1949,7 +1950,7 @@ unsigned is_submodule_modified(const char *path, int ignore_untracked)
	 		die(_("'git status --porcelain=2' failed in submodule %s"), path);
	 
	 	strbuf_release(&buf);
	-	return dirty_submodule;
	+	return dirty_submodule0;
	 }
	 
	 int submodule_uses_gitfile(const char *path)

Which, if we're massaging this for a subsequent smaller diff we can do
to make only the comment adjustment part of this be a non-moved line.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 4/7] submodule: refactor is_submodule_modified()
  2023-02-07 18:17     ` [PATCH v7 4/7] submodule: refactor is_submodule_modified() Calvin Wan
@ 2023-02-07 22:59       ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 86+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2023-02-07 22:59 UTC (permalink / raw)
  To: Calvin Wan; +Cc: git, chooglen, newren, jonathantanmy


On Tue, Feb 07 2023, Calvin Wan wrote:

> diff --git a/submodule.c b/submodule.c
> index 768d4b4cd7..d88aa2c573 100644
> --- a/submodule.c
> +++ b/submodule.c
> @@ -28,6 +28,10 @@ static int config_update_recurse_submodules = RECURSE_SUBMODULES_OFF;
>  static int initialized_fetch_ref_tips;
>  static struct oid_array ref_tips_before_fetch;
>  static struct oid_array ref_tips_after_fetch;
> +static const char *status_porcelain_start_error =
> +	N_("could not run 'git status --porcelain=2' in submodule %s");
> +static const char *status_porcelain_fail_error =
> +	N_("'git status --porcelain=2' failed in submodule %s");

Let's instead do:

	#define STATUS_PORCELAIN_START_ERROR \
        	N_("could not run 'git status --porcelain=2' in submodule %s")
	#define STATUS_PORCELAIN_FAIL_ERROR \
        	N_("'git status --porcelain=2' failed in submodule %s")

Because a thing you're not discussing in the commit message is that the
disadvantage of doing this sort of thing is that we lose the checking
that -Wformat gives us (try to add an extra "%s" to these in your
version, then the macro version, with gcc and/or clang).

Personally I'd prefer just copy/pasting over losing that, but using a
macro instead of a variable allows us to have our cake and eat it too.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 7/7] diff-lib: parallelize run_diff_files for submodules
  2023-02-07 18:17     ` [PATCH v7 7/7] diff-lib: parallelize run_diff_files for submodules Calvin Wan
@ 2023-02-07 23:06       ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 86+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2023-02-07 23:06 UTC (permalink / raw)
  To: Calvin Wan; +Cc: git, chooglen, newren, jonathantanmy

On Tue, Feb 07 2023, Calvin Wan wrote:

> [...]
> +	sps->result = 1;
> +	strbuf_addf(err,
> +	    _(status_porcelain_start_error),
> +	    task->path);
> +	return 0;
> [...]
> +	if (retvalue) {
> +		sps->result = 1;
> +		strbuf_addf(err,
> +		    _(status_porcelain_fail_error),
> +		    task->path);
> [...]

This is nitpicky, but what's with the short lines and over-wrapping?

If you change these two to (just using my macro version on top, but it's
the same with yours):

	strbuf_addf(err, _(STATUS_PORCELAIN_START_ERROR), task->path);

And:

	strbuf_addf(err, _(STATUS_PORCELAIN_FAIL_ERROR), task->path);

Both of these are under our usual line limit at their respective
indentation (the latter at 77, rule of thumb is to wrap at 79-80).

> +	if (submodules.nr > 0) {

Don't compare unsigned to >0, just use "submodules.nr".

> +		int parallel_jobs;

nit: add extra \n, or maybe just call this "int v", as it's clear from
the scope what it's about...

> +		if (git_config_get_int("submodule.diffjobs", &parallel_jobs))
> +			parallel_jobs = 1;
> +		else if (!parallel_jobs)
> +			parallel_jobs = online_cpus();
> +		else if (parallel_jobs < 0)
> +			die(_("submodule.diffjobs cannot be negative"));

Can't you use the "ulong" instead of "int" and have it handle this "is
negative?" error check for you?

> +
> +		if (get_submodules_status(&submodules, parallel_jobs))
> +			die(_("submodule status failed"));
> +		for (size_t i = 0; i < submodules.nr; i++) {

Another case that can use for_each_string_list_item().

> +struct submodule_parallel_status {
> +	size_t index_count;
> +	int result;
> +
> +	struct string_list *submodule_names;
> +
> +	/* Pending statuses by OIDs */
> +	struct status_task **oid_status_tasks;
> +	int oid_status_tasks_nr, oid_status_tasks_alloc;

For new structs, let's use size_t, not "int" for alloc/nr.

Also, as this is 7/7 and we're not adding another such pattern for the
forseeable future, can we just call these "size_t nr", "size_t alloc"
and "tasks"?

And having said all that, it turns out this is just dead code that can
be removed? Blindly copied from submodule_parallel_fetch?

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/7] submodule: parallelize diff
  2023-02-07 18:16     ` [PATCH v7 0/7] " Calvin Wan
@ 2023-02-08  0:55       ` Ævar Arnfjörð Bjarmason
  2023-02-09  0:02       ` [PATCH v8 0/6] " Calvin Wan
                         ` (6 subsequent siblings)
  7 siblings, 0 replies; 86+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2023-02-08  0:55 UTC (permalink / raw)
  To: Calvin Wan; +Cc: git, chooglen, newren, jonathantanmy


On Tue, Feb 07 2023, Calvin Wan wrote:

> Original cover letter for context:
> https://lore.kernel.org/git/20221011232604.839941-1-calvinwan@google.com/

I went over this, noticed some issues, some nits, but definitely some
things worth an eventual re-roll.

> Changes since v6

I would very much appreciate for future iterations if you can start
including a range-diff to the previous version.

> Added patches 4 and 5 to refactor out more functionality so that it is
> clear what changes my final patch makes. Since the large majority of
> the functionality between the serial and parallel implementation is now
> shared, I no longer remove the serial implementation.
>
> Added additional tests to verify setting parallelism doesn't alter
> output

I could have, but didn't manually apply both v6 and v7 and produce a
range-diff, having it in the CL would really help to track the changes
across re-rolls.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 6/7] diff-lib: refactor match_stat_with_submodule
  2023-02-07 18:17     ` [PATCH v7 6/7] diff-lib: refactor match_stat_with_submodule Calvin Wan
@ 2023-02-08  8:18       ` Ævar Arnfjörð Bjarmason
  2023-02-08 17:07         ` Phillip Wood
  2023-02-08 14:22       ` Phillip Wood
  1 sibling, 1 reply; 86+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2023-02-08  8:18 UTC (permalink / raw)
  To: Calvin Wan; +Cc: git, chooglen, newren, jonathantanmy


On Tue, Feb 07 2023, Calvin Wan wrote:

> diff --git a/diff-lib.c b/diff-lib.c
> index 7101cfda3f..e18c886a80 100644
> --- a/diff-lib.c
> +++ b/diff-lib.c
> @@ -73,18 +73,24 @@ static int match_stat_with_submodule(struct diff_options *diffopt,
>  				     unsigned *dirty_submodule)
>  {
>  	int changed = ie_match_stat(diffopt->repo->index, ce, st, ce_option);
> -	if (S_ISGITLINK(ce->ce_mode)) {
> -		struct diff_flags orig_flags = diffopt->flags;
> -		if (!diffopt->flags.override_submodule_config)
> -			set_diffopt_flags_from_submodule_config(diffopt, ce->name);
> -		if (diffopt->flags.ignore_submodules)
> -			changed = 0;
> -		else if (!diffopt->flags.ignore_dirty_submodules &&
> -			 (!changed || diffopt->flags.dirty_submodules))
> -			*dirty_submodule = is_submodule_modified(ce->name,
> -								 diffopt->flags.ignore_untracked_in_submodules);
> -		diffopt->flags = orig_flags;
> +	struct diff_flags orig_flags;
> +
> +	if (!S_ISGITLINK(ce->ce_mode))
> +		return changed;
> +
> +	orig_flags = diffopt->flags;
> +	if (!diffopt->flags.override_submodule_config)
> +		set_diffopt_flags_from_submodule_config(diffopt, ce->name);
> +	if (diffopt->flags.ignore_submodules) {
> +		changed = 0;
> +		goto cleanup;
>  	}
> +	if (!diffopt->flags.ignore_dirty_submodules &&
> +	    (!changed || diffopt->flags.dirty_submodules))
> +		*dirty_submodule = is_submodule_modified(ce->name,
> +					 diffopt->flags.ignore_untracked_in_submodules);
> +cleanup:
> +	diffopt->flags = orig_flags;
>  	return changed;
>  }

Parallel to reviewing your topic I started wondering if we couldn't get
rid of this "orig_flags" flip-flopping, i.e. can't we just set the
specific flags we want in output parameters.

Anyway, having looked at this closely I think this patch should be
dropped entirely. I don't understand how this refactoring is meant to
make the end result easier to read, reason about, or how it helps the
subsequent patch.

In addition to the above diff in 7/7 you do (and that's the change this
is meant to help):
	
	 static int match_stat_with_submodule(struct diff_options *diffopt,
	 				     const struct cache_entry *ce,
	 				     struct stat *st, unsigned ce_option,
	-				     unsigned *dirty_submodule)
	+				     unsigned *dirty_submodule, int *defer_submodule_status,
	+				     unsigned *ignore_untracked)
	 {
	 	int changed = ie_match_stat(diffopt->repo->index, ce, st, ce_option);
	 	struct diff_flags orig_flags;
	+	int defer = 0;
	 
	 	if (!S_ISGITLINK(ce->ce_mode))
	-		return changed;
	+		goto ret;
	 
	 	orig_flags = diffopt->flags;
	 	if (!diffopt->flags.override_submodule_config)
	@@ -86,11 +92,20 @@ static int match_stat_with_submodule(struct diff_options *diffopt,
	 		goto cleanup;
	 	}
	 	if (!diffopt->flags.ignore_dirty_submodules &&
	-	    (!changed || diffopt->flags.dirty_submodules))
	-		*dirty_submodule = is_submodule_modified(ce->name,
	+	    (!changed || diffopt->flags.dirty_submodules)) {
	+		if (defer_submodule_status && *defer_submodule_status) {
	+			defer = 1;
	+			*ignore_untracked = diffopt->flags.ignore_untracked_in_submodules;
	+		} else {
	+			*dirty_submodule = is_submodule_modified(ce->name,
	 					 diffopt->flags.ignore_untracked_in_submodules);
	+		}
	+	}
	 cleanup:
	 	diffopt->flags = orig_flags;
	+ret:
	+	if (defer_submodule_status)
	+		*defer_submodule_status = defer;
	 	return changed;
	 }

But if I rebase out this 6/7 patch and solve the conflict for 7/7 it
becomes:
	
	@@ -65,14 +66,20 @@ static int check_removed(const struct index_state *istate, const struct cache_en
	  * Return 1 when changes are detected, 0 otherwise. If the DIRTY_SUBMODULES
	  * option is set, the caller does not only want to know if a submodule is
	  * modified at all but wants to know all the conditions that are met (new
	- * commits, untracked content and/or modified content).
	+ * commits, untracked content and/or modified content). If
	+ * defer_submodule_status bit is set, dirty_submodule will be left to the
	+ * caller to set. defer_submodule_status can also be set to 0 in this
	+ * function if there is no need to check if the submodule is modified.
	  */
	 static int match_stat_with_submodule(struct diff_options *diffopt,
	 				     const struct cache_entry *ce,
	 				     struct stat *st, unsigned ce_option,
	-				     unsigned *dirty_submodule)
	+				     unsigned *dirty_submodule, int *defer_submodule_status,
	+				     unsigned *ignore_untracked)
	 {
	 	int changed = ie_match_stat(diffopt->repo->index, ce, st, ce_option);
	+	int defer = 0;
	+
	 	if (S_ISGITLINK(ce->ce_mode)) {
	 		struct diff_flags orig_flags = diffopt->flags;
	 		if (!diffopt->flags.override_submodule_config)
	@@ -80,11 +87,20 @@ static int match_stat_with_submodule(struct diff_options *diffopt,
	 		if (diffopt->flags.ignore_submodules)
	 			changed = 0;
	 		else if (!diffopt->flags.ignore_dirty_submodules &&
	-			 (!changed || diffopt->flags.dirty_submodules))
	-			*dirty_submodule = is_submodule_modified(ce->name,
	-								 diffopt->flags.ignore_untracked_in_submodules);
	+			 (!changed || diffopt->flags.dirty_submodules)) {
	+			if (defer_submodule_status && *defer_submodule_status) {
	+				defer = 1;
	+				*ignore_untracked = diffopt->flags.ignore_untracked_in_submodules;
	+			} else {
	+				*dirty_submodule = is_submodule_modified(ce->name,
	+									 diffopt->flags.ignore_untracked_in_submodules);
	+			}
	+		}
	 		diffopt->flags = orig_flags;
	 	}
	+
	+	if (defer_submodule_status)
	+		*defer_submodule_status = defer;
	 	return changed;
	 }
	 

I can see how there's some room for *a* refactoring to reduce the
subsequent diff, but not by mutch.

But this commit didn't help at all. This whole "goto ret", and "goto
cleanup" is just working around the fact that you pulled "orig_flags"
out of the "if" scope. Normally the de-indentation would be worth it,
but here it's not. The control flow becomes more complex to reason about
as a result.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 1/7] run-command: add duplicate_output_fn to run_processes_parallel_opts
  2023-02-07 18:17     ` [PATCH v7 1/7] run-command: add duplicate_output_fn to run_processes_parallel_opts Calvin Wan
  2023-02-07 22:16       ` Ævar Arnfjörð Bjarmason
@ 2023-02-08 14:19       ` Phillip Wood
  2023-02-08 22:54         ` Calvin Wan
  1 sibling, 1 reply; 86+ messages in thread
From: Phillip Wood @ 2023-02-08 14:19 UTC (permalink / raw)
  To: Calvin Wan, git; +Cc: avarab, chooglen, newren, jonathantanmy

Hi Calvin

On 07/02/2023 18:17, Calvin Wan wrote:
> Add duplicate_output_fn as an optionally set function in
> run_process_parallel_opts. If set, output from each child process is
> copied and passed to the callback function whenever output from the
> child process is buffered to allow for separate parsing.
> 
> Signed-off-by: Calvin Wan <calvinwan@google.com>
> ---
>   run-command.c               | 16 ++++++++++++---
>   run-command.h               | 27 +++++++++++++++++++++++++
>   t/helper/test-run-command.c | 21 ++++++++++++++++++++
>   t/t0061-run-command.sh      | 39 +++++++++++++++++++++++++++++++++++++
>   4 files changed, 100 insertions(+), 3 deletions(-)
> 
> diff --git a/run-command.c b/run-command.c
> index 756f1839aa..cad88befe0 100644
> --- a/run-command.c
> +++ b/run-command.c
> @@ -1526,6 +1526,9 @@ static void pp_init(struct parallel_processes *pp,
>   	if (!opts->get_next_task)
>   		BUG("you need to specify a get_next_task function");
>   
> +	if (opts->duplicate_output && opts->ungroup)
> +		BUG("duplicate_output and ungroup are incompatible with each other");
> +
>   	CALLOC_ARRAY(pp->children, n);
>   	if (!opts->ungroup)
>   		CALLOC_ARRAY(pp->pfd, n);
> @@ -1645,14 +1648,21 @@ static void pp_buffer_stderr(struct parallel_processes *pp,
>   	for (size_t i = 0; i < opts->processes; i++) {
>   		if (pp->children[i].state == GIT_CP_WORKING &&
>   		    pp->pfd[i].revents & (POLLIN | POLLHUP)) {
> -			int n = strbuf_read_once(&pp->children[i].err,
> -						 pp->children[i].process.err, 0);
> +			ssize_t n = strbuf_read_once(&pp->children[i].err,
> +						     pp->children[i].process.err, 0);
>   			if (n == 0) {
>   				close(pp->children[i].process.err);
>   				pp->children[i].state = GIT_CP_WAIT_CLEANUP;
> -			} else if (n < 0)
> +			} else if (n < 0) {
>   				if (errno != EAGAIN)
>   					die_errno("read");
> +			} else {
> +				if (opts->duplicate_output)
> +					opts->duplicate_output(&pp->children[i].err,
> +					       strlen(pp->children[i].err.buf) - n,

Looking at how this is used in patch 7 I think it would be better to 
pass a const char*, length pair rather than a struct strbuf*, offset pair.
i.e.
	opts->duplicate_output(pp->children[i].err.buf + 
pp->children[i].err.len - n, n, ...)

That would make it clear that we do not expect duplicate_output() to 
alter the buffer and would avoid the duplicate_output() having to add 
the offset to the start of the buffer to find the new data.

Best Wishes

Phillip


> +					       opts->data,
> +					       pp->children[i].data);
> +			}
>   		}
>   	}
>   }
> diff --git a/run-command.h b/run-command.h
> index 072db56a4d..6dcf999f6c 100644
> --- a/run-command.h
> +++ b/run-command.h
> @@ -408,6 +408,27 @@ typedef int (*start_failure_fn)(struct strbuf *out,
>   				void *pp_cb,
>   				void *pp_task_cb);
>   
> +/**
> + * This callback is called whenever output from a child process is buffered
> + *
> + * See run_processes_parallel() below for a discussion of the "struct
> + * strbuf *out" parameter.
> + *
> + * The offset refers to the number of bytes originally in "out" before
> + * the output from the child process was buffered. Therefore, the buffer
> + * range, "out + buf" to the end of "out", would contain the buffer of
> + * the child process output.
> + *
> + * pp_cb is the callback cookie as passed into run_processes_parallel,
> + * pp_task_cb is the callback cookie as passed into get_next_task_fn.
> + *
> + * This function is incompatible with "ungroup"
> + */
> +typedef void (*duplicate_output_fn)(struct strbuf *out,
> +				    size_t offset,
> +				    void *pp_cb,
> +				    void *pp_task_cb);
> +
>   /**
>    * This callback is called on every child process that finished processing.
>    *
> @@ -461,6 +482,12 @@ struct run_process_parallel_opts
>   	 */
>   	start_failure_fn start_failure;
>   
> +	/**
> +	 * duplicate_output: See duplicate_output_fn() above. This should be
> +	 * NULL unless process specific output is needed
> +	 */
> +	duplicate_output_fn duplicate_output;
> +
>   	/**
>   	 * task_finished: See task_finished_fn() above. This can be
>   	 * NULL to omit any special handling.
> diff --git a/t/helper/test-run-command.c b/t/helper/test-run-command.c
> index 3ecb830f4a..ffd3cd0045 100644
> --- a/t/helper/test-run-command.c
> +++ b/t/helper/test-run-command.c
> @@ -52,6 +52,21 @@ static int no_job(struct child_process *cp,
>   	return 0;
>   }
>   
> +static void duplicate_output(struct strbuf *out,
> +			size_t offset,
> +			void *pp_cb UNUSED,
> +			void *pp_task_cb UNUSED)
> +{
> +	struct string_list list = STRING_LIST_INIT_DUP;
> +
> +	string_list_split(&list, out->buf + offset, '\n', -1);
> +	for (size_t i = 0; i < list.nr; i++) {
> +		if (strlen(list.items[i].string) > 0)
> +			fprintf(stderr, "duplicate_output: %s\n", list.items[i].string);
> +	}
> +	string_list_clear(&list, 0);
> +}
> +
>   static int task_finished(int result,
>   			 struct strbuf *err,
>   			 void *pp_cb,
> @@ -439,6 +454,12 @@ int cmd__run_command(int argc, const char **argv)
>   		opts.ungroup = 1;
>   	}
>   
> +	if (!strcmp(argv[1], "--duplicate-output")) {
> +		argv += 1;
> +		argc -= 1;
> +		opts.duplicate_output = duplicate_output;
> +	}
> +
>   	jobs = atoi(argv[2]);
>   	strvec_clear(&proc.args);
>   	strvec_pushv(&proc.args, (const char **)argv + 3);
> diff --git a/t/t0061-run-command.sh b/t/t0061-run-command.sh
> index e2411f6a9b..879e536638 100755
> --- a/t/t0061-run-command.sh
> +++ b/t/t0061-run-command.sh
> @@ -135,6 +135,15 @@ test_expect_success 'run_command runs in parallel with more jobs available than
>   	test_cmp expect actual
>   '
>   
> +test_expect_success 'run_command runs in parallel with more jobs available than tasks --duplicate-output' '
> +	test-tool run-command --duplicate-output run-command-parallel 5 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
> +	test_must_be_empty out &&
> +	test 4 = $(grep -c "duplicate_output: Hello" err) &&
> +	test 4 = $(grep -c "duplicate_output: World" err) &&
> +	sed "/duplicate_output/d" err > err1 &&
> +	test_cmp expect err1
> +'
> +
>   test_expect_success 'run_command runs ungrouped in parallel with more jobs available than tasks' '
>   	test-tool run-command --ungroup run-command-parallel 5 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
>   	test_line_count = 8 out &&
> @@ -147,6 +156,15 @@ test_expect_success 'run_command runs in parallel with as many jobs as tasks' '
>   	test_cmp expect actual
>   '
>   
> +test_expect_success 'run_command runs in parallel with as many jobs as tasks --duplicate-output' '
> +	test-tool run-command --duplicate-output run-command-parallel 4 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
> +	test_must_be_empty out &&
> +	test 4 = $(grep -c "duplicate_output: Hello" err) &&
> +	test 4 = $(grep -c "duplicate_output: World" err) &&
> +	sed "/duplicate_output/d" err > err1 &&
> +	test_cmp expect err1
> +'
> +
>   test_expect_success 'run_command runs ungrouped in parallel with as many jobs as tasks' '
>   	test-tool run-command --ungroup run-command-parallel 4 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
>   	test_line_count = 8 out &&
> @@ -159,6 +177,15 @@ test_expect_success 'run_command runs in parallel with more tasks than jobs avai
>   	test_cmp expect actual
>   '
>   
> +test_expect_success 'run_command runs in parallel with more tasks than jobs available --duplicate-output' '
> +	test-tool run-command --duplicate-output run-command-parallel 3 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
> +	test_must_be_empty out &&
> +	test 4 = $(grep -c "duplicate_output: Hello" err) &&
> +	test 4 = $(grep -c "duplicate_output: World" err) &&
> +	sed "/duplicate_output/d" err > err1 &&
> +	test_cmp expect err1
> +'
> +
>   test_expect_success 'run_command runs ungrouped in parallel with more tasks than jobs available' '
>   	test-tool run-command --ungroup run-command-parallel 3 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
>   	test_line_count = 8 out &&
> @@ -180,6 +207,12 @@ test_expect_success 'run_command is asked to abort gracefully' '
>   	test_cmp expect actual
>   '
>   
> +test_expect_success 'run_command is asked to abort gracefully --duplicate-output' '
> +	test-tool run-command --duplicate-output run-command-abort 3 false >out 2>err &&
> +	test_must_be_empty out &&
> +	test_cmp expect err
> +'
> +
>   test_expect_success 'run_command is asked to abort gracefully (ungroup)' '
>   	test-tool run-command --ungroup run-command-abort 3 false >out 2>err &&
>   	test_must_be_empty out &&
> @@ -196,6 +229,12 @@ test_expect_success 'run_command outputs ' '
>   	test_cmp expect actual
>   '
>   
> +test_expect_success 'run_command outputs --duplicate-output' '
> +	test-tool run-command --duplicate-output run-command-no-jobs 3 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
> +	test_must_be_empty out &&
> +	test_cmp expect err
> +'
> +
>   test_expect_success 'run_command outputs (ungroup) ' '
>   	test-tool run-command --ungroup run-command-no-jobs 3 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
>   	test_must_be_empty out &&

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 6/7] diff-lib: refactor match_stat_with_submodule
  2023-02-07 18:17     ` [PATCH v7 6/7] diff-lib: refactor match_stat_with_submodule Calvin Wan
  2023-02-08  8:18       ` Ævar Arnfjörð Bjarmason
@ 2023-02-08 14:22       ` Phillip Wood
  1 sibling, 0 replies; 86+ messages in thread
From: Phillip Wood @ 2023-02-08 14:22 UTC (permalink / raw)
  To: Calvin Wan, git; +Cc: avarab, chooglen, newren, jonathantanmy

Hi Calvin

On 07/02/2023 18:17, Calvin Wan wrote:
> Flatten out the if statements in match_stat_with_submodule so the
> logic is more readable and easier for future patches to add to.
> orig_flags didn't need to be set if the cache entry wasn't a
> GITLINK so defer setting it.
> 
> Signed-off-by: Calvin Wan <calvinwan@google.com>
> ---
>   diff-lib.c | 28 +++++++++++++++++-----------
>   1 file changed, 17 insertions(+), 11 deletions(-)
> 
> diff --git a/diff-lib.c b/diff-lib.c
> index 7101cfda3f..e18c886a80 100644
> --- a/diff-lib.c
> +++ b/diff-lib.c
> @@ -73,18 +73,24 @@ static int match_stat_with_submodule(struct diff_options *diffopt,
>   				     unsigned *dirty_submodule)
>   {
>   	int changed = ie_match_stat(diffopt->repo->index, ce, st, ce_option);
> -	if (S_ISGITLINK(ce->ce_mode)) {
> -		struct diff_flags orig_flags = diffopt->flags;
> -		if (!diffopt->flags.override_submodule_config)
> -			set_diffopt_flags_from_submodule_config(diffopt, ce->name);
> -		if (diffopt->flags.ignore_submodules)
> -			changed = 0;
> -		else if (!diffopt->flags.ignore_dirty_submodules &&
> -			 (!changed || diffopt->flags.dirty_submodules))
> -			*dirty_submodule = is_submodule_modified(ce->name,
> -								 diffopt->flags.ignore_untracked_in_submodules);
> -		diffopt->flags = orig_flags;
> +	struct diff_flags orig_flags;
> +
> +	if (!S_ISGITLINK(ce->ce_mode))
> +		return changed;
> +
> +	orig_flags = diffopt->flags;
> +	if (!diffopt->flags.override_submodule_config)
> +		set_diffopt_flags_from_submodule_config(diffopt, ce->name);
> +	if (diffopt->flags.ignore_submodules) {
> +		changed = 0;
> +		goto cleanup;

Looking ahead to patch 7 there are no new uses of the "cleanup" label so 
I think it would be simpler to leave the code as it was, rather than 
changing the "else if" below to "if" and adding the goto here.

Best Wishes

Phillip

>   	}
> +	if (!diffopt->flags.ignore_dirty_submodules &&
> +	    (!changed || diffopt->flags.dirty_submodules))
> +		*dirty_submodule = is_submodule_modified(ce->name,
> +					 diffopt->flags.ignore_untracked_in_submodules);
> +cleanup:
> +	diffopt->flags = orig_flags;
>   	return changed;
>   }
>   

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 5/7] diff-lib: refactor out diff_change logic
  2023-02-07 18:17     ` [PATCH v7 5/7] diff-lib: refactor out diff_change logic Calvin Wan
@ 2023-02-08 14:28       ` Phillip Wood
  2023-02-08 23:12         ` Calvin Wan
  0 siblings, 1 reply; 86+ messages in thread
From: Phillip Wood @ 2023-02-08 14:28 UTC (permalink / raw)
  To: Calvin Wan, git; +Cc: avarab, chooglen, newren, jonathantanmy

Hi Calvin

On 07/02/2023 18:17, Calvin Wan wrote:
> Refactor out logic that sets up the diff_change call into a helper
> function for a future patch.
> 
> Signed-off-by: Calvin Wan <calvinwan@google.com>
> ---
>   diff-lib.c | 46 +++++++++++++++++++++++++++++-----------------
>   1 file changed, 29 insertions(+), 17 deletions(-)
> 
> diff --git a/diff-lib.c b/diff-lib.c
> index dec040c366..7101cfda3f 100644
> --- a/diff-lib.c
> +++ b/diff-lib.c
> @@ -88,6 +88,31 @@ static int match_stat_with_submodule(struct diff_options *diffopt,
>   	return changed;
>   }
>   
> +static int diff_change_helper(struct diff_options *options,
> +	      unsigned newmode, unsigned dirty_submodule,
> +	      int changed,

I worry that having three integer parameters next to each other makes it 
very easy to mix them up with out getting any errors from the compiler 
because the types are all compatible. Could the last two be combined 
into a flags argument? A similar issues occurs in 
match_stat_with_submodule() in patch 7

Best Wishes

Phillip

  struct index_state *istate,
> +	      struct cache_entry *ce)
> +{
> +	unsigned int oldmode;
> +	const struct object_id *old_oid, *new_oid;
> +
> +	if (!changed && !dirty_submodule) {
> +		ce_mark_uptodate(ce);
> +		mark_fsmonitor_valid(istate, ce);
> +		if (!options->flags.find_copies_harder)
> +			return 1;
> +	}
> +	oldmode = ce->ce_mode;
> +	old_oid = &ce->oid;
> +	new_oid = changed ? null_oid() : &ce->oid;
> +	diff_change(options, oldmode, newmode,
> +			old_oid, new_oid,
> +			!is_null_oid(old_oid),
> +			!is_null_oid(new_oid),
> +			ce->name, 0, dirty_submodule);
> +	return 0;
> +}
> +
>   int run_diff_files(struct rev_info *revs, unsigned int option)
>   {
>   	int entries, i;
> @@ -105,11 +130,10 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
>   		diff_unmerged_stage = 2;
>   	entries = istate->cache_nr;
>   	for (i = 0; i < entries; i++) {
> -		unsigned int oldmode, newmode;
> +		unsigned int newmode;
>   		struct cache_entry *ce = istate->cache[i];
>   		int changed;
>   		unsigned dirty_submodule = 0;
> -		const struct object_id *old_oid, *new_oid;
>   
>   		if (diff_can_quit_early(&revs->diffopt))
>   			break;
> @@ -245,21 +269,9 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
>   			newmode = ce_mode_from_stat(ce, st.st_mode);
>   		}
>   
> -		if (!changed && !dirty_submodule) {
> -			ce_mark_uptodate(ce);
> -			mark_fsmonitor_valid(istate, ce);
> -			if (!revs->diffopt.flags.find_copies_harder)
> -				continue;
> -		}
> -		oldmode = ce->ce_mode;
> -		old_oid = &ce->oid;
> -		new_oid = changed ? null_oid() : &ce->oid;
> -		diff_change(&revs->diffopt, oldmode, newmode,
> -			    old_oid, new_oid,
> -			    !is_null_oid(old_oid),
> -			    !is_null_oid(new_oid),
> -			    ce->name, 0, dirty_submodule);
> -
> +		if (diff_change_helper(&revs->diffopt, newmode, dirty_submodule,
> +				       changed, istate, ce))
> +			continue;
>   	}
>   	diffcore_std(&revs->diffopt);
>   	diff_flush(&revs->diffopt);

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 6/7] diff-lib: refactor match_stat_with_submodule
  2023-02-08  8:18       ` Ævar Arnfjörð Bjarmason
@ 2023-02-08 17:07         ` Phillip Wood
  2023-02-08 23:13           ` Calvin Wan
  0 siblings, 1 reply; 86+ messages in thread
From: Phillip Wood @ 2023-02-08 17:07 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, Calvin Wan
  Cc: git, chooglen, newren, jonathantanmy

On 08/02/2023 08:18, Ævar Arnfjörð Bjarmason wrote:
> 
> On Tue, Feb 07 2023, Calvin Wan wrote:

> Anyway, having looked at this closely I think this patch should be
> dropped entirely. I don't understand how this refactoring is meant to
> make the end result easier to read, reason about, or how it helps the
> subsequent patch.

That's my feeling too c.f. 
<19f91fea-a2a9-7dc6-d940-cc10f384fe76@dunelm.org.uk>. This patch has 
improved since that comment on v4 but I still think we'd be better off 
without it.

Best Wishes

Phillip


> In addition to the above diff in 7/7 you do (and that's the change this
> is meant to help):
> 	
> 	 static int match_stat_with_submodule(struct diff_options *diffopt,
> 	 				     const struct cache_entry *ce,
> 	 				     struct stat *st, unsigned ce_option,
> 	-				     unsigned *dirty_submodule)
> 	+				     unsigned *dirty_submodule, int *defer_submodule_status,
> 	+				     unsigned *ignore_untracked)
> 	 {
> 	 	int changed = ie_match_stat(diffopt->repo->index, ce, st, ce_option);
> 	 	struct diff_flags orig_flags;
> 	+	int defer = 0;
> 	
> 	 	if (!S_ISGITLINK(ce->ce_mode))
> 	-		return changed;
> 	+		goto ret;
> 	
> 	 	orig_flags = diffopt->flags;
> 	 	if (!diffopt->flags.override_submodule_config)
> 	@@ -86,11 +92,20 @@ static int match_stat_with_submodule(struct diff_options *diffopt,
> 	 		goto cleanup;
> 	 	}
> 	 	if (!diffopt->flags.ignore_dirty_submodules &&
> 	-	    (!changed || diffopt->flags.dirty_submodules))
> 	-		*dirty_submodule = is_submodule_modified(ce->name,
> 	+	    (!changed || diffopt->flags.dirty_submodules)) {
> 	+		if (defer_submodule_status && *defer_submodule_status) {
> 	+			defer = 1;
> 	+			*ignore_untracked = diffopt->flags.ignore_untracked_in_submodules;
> 	+		} else {
> 	+			*dirty_submodule = is_submodule_modified(ce->name,
> 	 					 diffopt->flags.ignore_untracked_in_submodules);
> 	+		}
> 	+	}
> 	 cleanup:
> 	 	diffopt->flags = orig_flags;
> 	+ret:
> 	+	if (defer_submodule_status)
> 	+		*defer_submodule_status = defer;
> 	 	return changed;
> 	 }
> 
> But if I rebase out this 6/7 patch and solve the conflict for 7/7 it
> becomes:
> 	
> 	@@ -65,14 +66,20 @@ static int check_removed(const struct index_state *istate, const struct cache_en
> 	  * Return 1 when changes are detected, 0 otherwise. If the DIRTY_SUBMODULES
> 	  * option is set, the caller does not only want to know if a submodule is
> 	  * modified at all but wants to know all the conditions that are met (new
> 	- * commits, untracked content and/or modified content).
> 	+ * commits, untracked content and/or modified content). If
> 	+ * defer_submodule_status bit is set, dirty_submodule will be left to the
> 	+ * caller to set. defer_submodule_status can also be set to 0 in this
> 	+ * function if there is no need to check if the submodule is modified.
> 	  */
> 	 static int match_stat_with_submodule(struct diff_options *diffopt,
> 	 				     const struct cache_entry *ce,
> 	 				     struct stat *st, unsigned ce_option,
> 	-				     unsigned *dirty_submodule)
> 	+				     unsigned *dirty_submodule, int *defer_submodule_status,
> 	+				     unsigned *ignore_untracked)
> 	 {
> 	 	int changed = ie_match_stat(diffopt->repo->index, ce, st, ce_option);
> 	+	int defer = 0;
> 	+
> 	 	if (S_ISGITLINK(ce->ce_mode)) {
> 	 		struct diff_flags orig_flags = diffopt->flags;
> 	 		if (!diffopt->flags.override_submodule_config)
> 	@@ -80,11 +87,20 @@ static int match_stat_with_submodule(struct diff_options *diffopt,
> 	 		if (diffopt->flags.ignore_submodules)
> 	 			changed = 0;
> 	 		else if (!diffopt->flags.ignore_dirty_submodules &&
> 	-			 (!changed || diffopt->flags.dirty_submodules))
> 	-			*dirty_submodule = is_submodule_modified(ce->name,
> 	-								 diffopt->flags.ignore_untracked_in_submodules);
> 	+			 (!changed || diffopt->flags.dirty_submodules)) {
> 	+			if (defer_submodule_status && *defer_submodule_status) {
> 	+				defer = 1;
> 	+				*ignore_untracked = diffopt->flags.ignore_untracked_in_submodules;
> 	+			} else {
> 	+				*dirty_submodule = is_submodule_modified(ce->name,
> 	+									 diffopt->flags.ignore_untracked_in_submodules);
> 	+			}
> 	+		}
> 	 		diffopt->flags = orig_flags;
> 	 	}
> 	+
> 	+	if (defer_submodule_status)
> 	+		*defer_submodule_status = defer;
> 	 	return changed;
> 	 }
> 	
> 
> I can see how there's some room for *a* refactoring to reduce the
> subsequent diff, but not by mutch.
> 
> But this commit didn't help at all. This whole "goto ret", and "goto
> cleanup" is just working around the fact that you pulled "orig_flags"
> out of the "if" scope. Normally the de-indentation would be worth it,
> but here it's not. The control flow becomes more complex to reason about
> as a result.
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 1/7] run-command: add duplicate_output_fn to run_processes_parallel_opts
  2023-02-07 22:16       ` Ævar Arnfjörð Bjarmason
@ 2023-02-08 22:50         ` Calvin Wan
  0 siblings, 0 replies; 86+ messages in thread
From: Calvin Wan @ 2023-02-08 22:50 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, chooglen, newren, jonathantanmy

> But would ou mind if this addition of yours were instead:
>
>         if (opts->ungroup) {
>                 if (opts->duplicate_output)
>                         BUG("duplicate_output and ungroup are incompatible with each other")
>         }

I don't see why not -- will change.

> > @@ -1645,14 +1648,21 @@ static void pp_buffer_stderr(struct parallel_processes *pp,
> >       for (size_t i = 0; i < opts->processes; i++) {
> >               if (pp->children[i].state == GIT_CP_WORKING &&
> >                   pp->pfd[i].revents & (POLLIN | POLLHUP)) {
> > -                     int n = strbuf_read_once(&pp->children[i].err,
> > -                                              pp->children[i].process.err, 0);
> > +                     ssize_t n = strbuf_read_once(&pp->children[i].err,
> > +                                                  pp->children[i].process.err, 0);
>
> This s/int/ssize_t/ change is a good on, but not mentioned in the commit
> message. Maybe worth splitting out?

I'll call this and the style change out in the commit message instead of
splitting it out.

> And why is this thing that wants to prove to us that we're capturing the
> output wanting to strip successive newlines?

I added it as a sanity check originally, but you're right that this is
unnecessary. Thanks for your comments on the other stylistic nits. I've
gone ahead and fixed them all.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 1/7] run-command: add duplicate_output_fn to run_processes_parallel_opts
  2023-02-08 14:19       ` Phillip Wood
@ 2023-02-08 22:54         ` Calvin Wan
  2023-02-09 20:37           ` Phillip Wood
  0 siblings, 1 reply; 86+ messages in thread
From: Calvin Wan @ 2023-02-08 22:54 UTC (permalink / raw)
  To: phillip.wood; +Cc: git, avarab, chooglen, newren, jonathantanmy

> > +                     } else {
> > +                             if (opts->duplicate_output)
> > +                                     opts->duplicate_output(&pp->children[i].err,
> > +                                            strlen(pp->children[i].err.buf) - n,
>
> Looking at how this is used in patch 7 I think it would be better to
> pass a const char*, length pair rather than a struct strbuf*, offset pair.
> i.e.
>         opts->duplicate_output(pp->children[i].err.buf +
> pp->children[i].err.len - n, n, ...)
>
> That would make it clear that we do not expect duplicate_output() to
> alter the buffer and would avoid the duplicate_output() having to add
> the offset to the start of the buffer to find the new data.

I don't think that would work since
pp->children[i].err.buf + pp->children[i].err.len - n
wouldn't end up as a const char* unless I'm missing something?

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 2/7] submodule: strbuf variable rename
  2023-02-07 22:47       ` Ævar Arnfjörð Bjarmason
@ 2023-02-08 22:59         ` Calvin Wan
  0 siblings, 0 replies; 86+ messages in thread
From: Calvin Wan @ 2023-02-08 22:59 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, chooglen, newren, jonathantanmy

> I'll only add that we could also do this on top:
>
>         diff --git a/submodule.c b/submodule.c
>         index c7c6bfb2e26..eeb940d96a0 100644
>         --- a/submodule.c
>         +++ b/submodule.c
>         @@ -1875,7 +1875,7 @@ unsigned is_submodule_modified(const char *path, int ignore_untracked)
>                 struct child_process cp = CHILD_PROCESS_INIT;
>                 struct strbuf buf = STRBUF_INIT;
>                 FILE *fp;
>         -       unsigned dirty_submodule = 0;
>         +       unsigned dirty_submodule0 = 0;
>                 const char *git_dir;
>                 int ignore_cp_exit_code = 0;
>
>         @@ -1908,10 +1908,11 @@ unsigned is_submodule_modified(const char *path, int ignore_untracked)
>                 while (strbuf_getwholeline(&buf, fp, '\n') != EOF) {
>                         char *str = buf.buf;
>                         const size_t len = buf.len;
>         +               unsigned *dirty_submodule = &dirty_submodule0;
>
>                         /* regular untracked files */
>                         if (str[0] == '?')
>         -                       dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
>         +                       *dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
>
>                         if (str[0] == 'u' ||
>                             str[0] == '1' ||
>         @@ -1923,17 +1924,17 @@ unsigned is_submodule_modified(const char *path, int ignore_untracked)
>
>                                 if (str[5] == 'S' && str[8] == 'U')
>                                         /* nested untracked file */
>         -                               dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
>         +                               *dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
>
>                                 if (str[0] == 'u' ||
>                                     str[0] == '2' ||
>                                     memcmp(str + 5, "S..U", 4))
>                                         /* other change */
>         -                               dirty_submodule |= DIRTY_SUBMODULE_MODIFIED;
>         +                               *dirty_submodule |= DIRTY_SUBMODULE_MODIFIED;
>                         }
>
>         -               if ((dirty_submodule & DIRTY_SUBMODULE_MODIFIED) &&
>         -                   ((dirty_submodule & DIRTY_SUBMODULE_UNTRACKED) ||
>         +               if ((*dirty_submodule & DIRTY_SUBMODULE_MODIFIED) &&
>         +                   ((*dirty_submodule & DIRTY_SUBMODULE_UNTRACKED) ||
>                              ignore_untracked)) {
>                                 /*
>                                  * We're not interested in any further information from
>         @@ -1949,7 +1950,7 @@ unsigned is_submodule_modified(const char *path, int ignore_untracked)
>                         die(_("'git status --porcelain=2' failed in submodule %s"), path);
>
>                 strbuf_release(&buf);
>         -       return dirty_submodule;
>         +       return dirty_submodule0;
>          }
>
>          int submodule_uses_gitfile(const char *path)
>
> Which, if we're massaging this for a subsequent smaller diff we can do
> to make only the comment adjustment part of this be a non-moved line.

Ah that's a neat little trick -- I'll save this one for the next time I do a
refactor like this :)

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 5/7] diff-lib: refactor out diff_change logic
  2023-02-08 14:28       ` Phillip Wood
@ 2023-02-08 23:12         ` Calvin Wan
  2023-02-09 20:53           ` Phillip Wood
  0 siblings, 1 reply; 86+ messages in thread
From: Calvin Wan @ 2023-02-08 23:12 UTC (permalink / raw)
  To: phillip.wood; +Cc: git, avarab, chooglen, newren, jonathantanmy

> I worry that having three integer parameters next to each other makes it
> very easy to mix them up with out getting any errors from the compiler
> because the types are all compatible. Could the last two be combined
> into a flags argument? A similar issues occurs in
> match_stat_with_submodule() in patch 7

I'm not sure how much more I want to engineer a static helper function
that is only being called in one other place. I also don't understand what
you mean by combining the last two into paramters a flags argument.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 6/7] diff-lib: refactor match_stat_with_submodule
  2023-02-08 17:07         ` Phillip Wood
@ 2023-02-08 23:13           ` Calvin Wan
  0 siblings, 0 replies; 86+ messages in thread
From: Calvin Wan @ 2023-02-08 23:13 UTC (permalink / raw)
  To: phillip.wood
  Cc: Ævar Arnfjörð Bjarmason, git, chooglen, newren,
	jonathantanmy

I agree that this patch should be dropped. Thanks for catching this.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH v8 0/6] submodule: parallelize diff
  2023-02-07 18:16     ` [PATCH v7 0/7] " Calvin Wan
  2023-02-08  0:55       ` Ævar Arnfjörð Bjarmason
@ 2023-02-09  0:02       ` Calvin Wan
  2023-02-09  1:42         ` Ævar Arnfjörð Bjarmason
                           ` (3 more replies)
  2023-02-09  0:02       ` [PATCH v8 1/6] run-command: add duplicate_output_fn to run_processes_parallel_opts Calvin Wan
                         ` (5 subsequent siblings)
  7 siblings, 4 replies; 86+ messages in thread
From: Calvin Wan @ 2023-02-09  0:02 UTC (permalink / raw)
  To: git; +Cc: Calvin Wan, avarab, chooglen, newren, jonathantanmy,
	phillip.wood123

Original cover letter for context:
https://lore.kernel.org/git/20221011232604.839941-1-calvinwan@google.com/

This reroll contains stylistic changes suggested by Avar and Phillip,
and includes a range-diff below.

Calvin Wan (6):
  run-command: add duplicate_output_fn to run_processes_parallel_opts
  submodule: strbuf variable rename
  submodule: move status parsing into function
  submodule: refactor is_submodule_modified()
  diff-lib: refactor out diff_change logic
  diff-lib: parallelize run_diff_files for submodules

 Documentation/config/submodule.txt |  12 ++
 diff-lib.c                         | 133 +++++++++++----
 run-command.c                      |  16 +-
 run-command.h                      |  25 +++
 submodule.c                        | 266 ++++++++++++++++++++++++-----
 submodule.h                        |   9 +
 t/helper/test-run-command.c        |  20 +++
 t/t0061-run-command.sh             |  39 +++++
 t/t4027-diff-submodule.sh          |  31 ++++
 t/t7506-status-submodule.sh        |  25 +++
 10 files changed, 497 insertions(+), 79 deletions(-)

Range-diff against v7:
1:  311b1abfbe ! 1:  5d51250c67 run-command: add duplicate_output_fn to run_processes_parallel_opts
    @@ run-command.c: static void pp_init(struct parallel_processes *pp,
      	if (!opts->get_next_task)
      		BUG("you need to specify a get_next_task function");
      
    -+	if (opts->duplicate_output && opts->ungroup)
    -+		BUG("duplicate_output and ungroup are incompatible with each other");
    ++	if (opts->ungroup) {
    ++		if (opts->duplicate_output)
    ++			BUG("duplicate_output and ungroup are incompatible with each other");
    ++	}
     +
      	CALLOC_ARRAY(pp->children, n);
      	if (!opts->ungroup)
    @@ run-command.c: static void pp_buffer_stderr(struct parallel_processes *pp,
     +			} else if (n < 0) {
      				if (errno != EAGAIN)
      					die_errno("read");
    -+			} else {
    -+				if (opts->duplicate_output)
    -+					opts->duplicate_output(&pp->children[i].err,
    -+					       strlen(pp->children[i].err.buf) - n,
    -+					       opts->data,
    -+					       pp->children[i].data);
    ++			} else if (opts->duplicate_output) {
    ++				opts->duplicate_output(&pp->children[i].err,
    ++					pp->children[i].err.len - n,
    ++					opts->data, pp->children[i].data);
     +			}
      		}
      	}
    @@ run-command.h: typedef int (*start_failure_fn)(struct strbuf *out,
     + *
     + * This function is incompatible with "ungroup"
     + */
    -+typedef void (*duplicate_output_fn)(struct strbuf *out,
    -+				    size_t offset,
    -+				    void *pp_cb,
    -+				    void *pp_task_cb);
    ++typedef void (*duplicate_output_fn)(struct strbuf *out, size_t offset,
    ++				    void *pp_cb, void *pp_task_cb);
     +
      /**
       * This callback is called on every child process that finished processing.
    @@ run-command.h: struct run_process_parallel_opts
      	start_failure_fn start_failure;
      
     +	/**
    -+	 * duplicate_output: See duplicate_output_fn() above. This should be
    -+	 * NULL unless process specific output is needed
    ++	 * duplicate_output: See duplicate_output_fn() above. Unless you need
    ++	 * to capture output from child processes, leave this as NULL.
     +	 */
     +	duplicate_output_fn duplicate_output;
     +
    @@ t/helper/test-run-command.c: static int no_job(struct child_process *cp,
     +			void *pp_task_cb UNUSED)
     +{
     +	struct string_list list = STRING_LIST_INIT_DUP;
    ++	struct string_list_item *item;
     +
     +	string_list_split(&list, out->buf + offset, '\n', -1);
    -+	for (size_t i = 0; i < list.nr; i++) {
    -+		if (strlen(list.items[i].string) > 0)
    -+			fprintf(stderr, "duplicate_output: %s\n", list.items[i].string);
    -+	}
    ++	for_each_string_list_item(item, &list)
    ++		fprintf(stderr, "duplicate_output: %s\n", item->string);
     +	string_list_clear(&list, 0);
     +}
     +
    @@ t/t0061-run-command.sh: test_expect_success 'run_command runs in parallel with m
     +	test_must_be_empty out &&
     +	test 4 = $(grep -c "duplicate_output: Hello" err) &&
     +	test 4 = $(grep -c "duplicate_output: World" err) &&
    -+	sed "/duplicate_output/d" err > err1 &&
    ++	sed "/duplicate_output/d" err >err1 &&
     +	test_cmp expect err1
     +'
     +
    @@ t/t0061-run-command.sh: test_expect_success 'run_command runs in parallel with a
     +	test_must_be_empty out &&
     +	test 4 = $(grep -c "duplicate_output: Hello" err) &&
     +	test 4 = $(grep -c "duplicate_output: World" err) &&
    -+	sed "/duplicate_output/d" err > err1 &&
    ++	sed "/duplicate_output/d" err >err1 &&
     +	test_cmp expect err1
     +'
     +
    @@ t/t0061-run-command.sh: test_expect_success 'run_command runs in parallel with m
     +	test_must_be_empty out &&
     +	test 4 = $(grep -c "duplicate_output: Hello" err) &&
     +	test 4 = $(grep -c "duplicate_output: World" err) &&
    -+	sed "/duplicate_output/d" err > err1 &&
    ++	sed "/duplicate_output/d" err >err1 &&
     +	test_cmp expect err1
     +'
     +
2:  d00a18dd84 = 2:  6ded5b6788 submodule: strbuf variable rename
3:  dcda518922 = 3:  0c71cea8cd submodule: move status parsing into function
4:  c6fc5ba13b ! 4:  5c8cc93f9f submodule: refactor is_submodule_modified()
    @@ submodule.c: static int config_update_recurse_submodules = RECURSE_SUBMODULES_OF
      static int initialized_fetch_ref_tips;
      static struct oid_array ref_tips_before_fetch;
      static struct oid_array ref_tips_after_fetch;
    -+static const char *status_porcelain_start_error =
    -+	N_("could not run 'git status --porcelain=2' in submodule %s");
    -+static const char *status_porcelain_fail_error =
    -+	N_("'git status --porcelain=2' failed in submodule %s");
    ++#define STATUS_PORCELAIN_START_ERROR \
    ++	N_("could not run 'git status --porcelain=2' in submodule %s")
    ++#define STATUS_PORCELAIN_FAIL_ERROR \
    ++	N_("'git status --porcelain=2' failed in submodule %s")
      
      /*
       * Check if the .gitmodules file is unmerged. Parsing of the .gitmodules file
    @@ submodule.c: unsigned is_submodule_modified(const char *path, int ignore_untrack
     +	prepare_status_porcelain(&cp, path, ignore_untracked);
      	if (start_command(&cp))
     -		die(_("Could not run 'git status --porcelain=2' in submodule %s"), path);
    -+		die(_(status_porcelain_start_error), path);
    ++		die(_(STATUS_PORCELAIN_START_ERROR), path);
      
      	fp = xfdopen(cp.out, "r");
      	while (strbuf_getwholeline(&buf, fp, '\n') != EOF) {
    @@ submodule.c: unsigned is_submodule_modified(const char *path, int ignore_untrack
      
      	if (finish_command(&cp) && !ignore_cp_exit_code)
     -		die(_("'git status --porcelain=2' failed in submodule %s"), path);
    -+		die(_(status_porcelain_fail_error), path);
    ++		die(_(STATUS_PORCELAIN_FAIL_ERROR), path);
      
      	strbuf_release(&buf);
      	return dirty_submodule;
5:  1ea8eae9c9 = 5:  6c2b62abc8 diff-lib: refactor out diff_change logic
6:  0d35fcc38d < -:  ---------- diff-lib: refactor match_stat_with_submodule
7:  fd1eec974d ! 6:  bb25dadbe5 diff-lib: parallelize run_diff_files for submodules
    @@ diff-lib.c: static int check_removed(const struct index_state *istate, const str
     +				     unsigned *ignore_untracked)
      {
      	int changed = ie_match_stat(diffopt->repo->index, ce, st, ce_option);
    - 	struct diff_flags orig_flags;
    +-	if (S_ISGITLINK(ce->ce_mode)) {
    +-		struct diff_flags orig_flags = diffopt->flags;
    +-		if (!diffopt->flags.override_submodule_config)
    +-			set_diffopt_flags_from_submodule_config(diffopt, ce->name);
    +-		if (diffopt->flags.ignore_submodules)
    +-			changed = 0;
    +-		else if (!diffopt->flags.ignore_dirty_submodules &&
    +-			 (!changed || diffopt->flags.dirty_submodules))
    ++	struct diff_flags orig_flags;
     +	int defer = 0;
    - 
    - 	if (!S_ISGITLINK(ce->ce_mode))
    --		return changed;
    ++
    ++	if (!S_ISGITLINK(ce->ce_mode))
     +		goto ret;
    - 
    - 	orig_flags = diffopt->flags;
    - 	if (!diffopt->flags.override_submodule_config)
    -@@ diff-lib.c: static int match_stat_with_submodule(struct diff_options *diffopt,
    - 		goto cleanup;
    - 	}
    - 	if (!diffopt->flags.ignore_dirty_submodules &&
    --	    (!changed || diffopt->flags.dirty_submodules))
    --		*dirty_submodule = is_submodule_modified(ce->name,
    ++
    ++	orig_flags = diffopt->flags;
    ++	if (!diffopt->flags.override_submodule_config)
    ++		set_diffopt_flags_from_submodule_config(diffopt, ce->name);
    ++	if (diffopt->flags.ignore_submodules) {
    ++		changed = 0;
    ++		goto cleanup;
    ++	}
    ++	if (!diffopt->flags.ignore_dirty_submodules &&
     +	    (!changed || diffopt->flags.dirty_submodules)) {
     +		if (defer_submodule_status && *defer_submodule_status) {
     +			defer = 1;
     +			*ignore_untracked = diffopt->flags.ignore_untracked_in_submodules;
     +		} else {
    -+			*dirty_submodule = is_submodule_modified(ce->name,
    - 					 diffopt->flags.ignore_untracked_in_submodules);
    + 			*dirty_submodule = is_submodule_modified(ce->name,
    +-								 diffopt->flags.ignore_untracked_in_submodules);
    +-		diffopt->flags = orig_flags;
    ++					 diffopt->flags.ignore_untracked_in_submodules);
     +		}
    -+	}
    - cleanup:
    - 	diffopt->flags = orig_flags;
    + 	}
    ++cleanup:
    ++	diffopt->flags = orig_flags;
     +ret:
     +	if (defer_submodule_status)
     +		*defer_submodule_status = defer;
    @@ diff-lib.c: int run_diff_files(struct rev_info *revs, unsigned int option)
      				       changed, istate, ce))
      			continue;
      	}
    -+	if (submodules.nr > 0) {
    -+		int parallel_jobs;
    -+		if (git_config_get_int("submodule.diffjobs", &parallel_jobs))
    ++	if (submodules.nr) {
    ++		unsigned long parallel_jobs;
    ++		struct string_list_item *item;
    ++
    ++		if (git_config_get_ulong("submodule.diffjobs", &parallel_jobs))
     +			parallel_jobs = 1;
     +		else if (!parallel_jobs)
     +			parallel_jobs = online_cpus();
    -+		else if (parallel_jobs < 0)
    -+			die(_("submodule.diffjobs cannot be negative"));
     +
     +		if (get_submodules_status(&submodules, parallel_jobs))
     +			die(_("submodule status failed"));
    -+		for (size_t i = 0; i < submodules.nr; i++) {
    -+			struct submodule_status_util *util = submodules.items[i].util;
    ++		for_each_string_list_item(item, &submodules) {
    ++			struct submodule_status_util *util = item->util;
     +
     +			if (diff_change_helper(&revs->diffopt, util->newmode,
     +				       util->dirty_submodule, util->changed,
    @@ submodule.c: int submodule_touches_in_range(struct repository *r,
     +	int result;
     +
     +	struct string_list *submodule_names;
    -+
    -+	/* Pending statuses by OIDs */
    -+	struct status_task **oid_status_tasks;
    -+	int oid_status_tasks_nr, oid_status_tasks_alloc;
     +};
     +
      struct submodule_parallel_fetch {
    @@ submodule.c: unsigned is_submodule_modified(const char *path, int ignore_untrack
     +	struct status_task *task = task_cb;
     +
     +	sps->result = 1;
    -+	strbuf_addf(err,
    -+	    _(status_porcelain_start_error),
    -+	    task->path);
    ++	strbuf_addf(err, _(STATUS_PORCELAIN_START_ERROR), task->path);
     +	return 0;
     +}
     +
    @@ submodule.c: unsigned is_submodule_modified(const char *path, int ignore_untrack
     +
     +	if (retvalue) {
     +		sps->result = 1;
    -+		strbuf_addf(err,
    -+		    _(status_porcelain_fail_error),
    -+		    task->path);
    ++		strbuf_addf(err, _(STATUS_PORCELAIN_FAIL_ERROR), task->path);
     +	}
     +
     +	parse_status_porcelain_strbuf(&task->out,
-- 
2.39.1.519.gcb327c4b5f-goog


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH v8 1/6] run-command: add duplicate_output_fn to run_processes_parallel_opts
  2023-02-07 18:16     ` [PATCH v7 0/7] " Calvin Wan
  2023-02-08  0:55       ` Ævar Arnfjörð Bjarmason
  2023-02-09  0:02       ` [PATCH v8 0/6] " Calvin Wan
@ 2023-02-09  0:02       ` Calvin Wan
  2023-02-13  6:34         ` Glen Choo
  2023-02-09  0:02       ` [PATCH v8 2/6] submodule: strbuf variable rename Calvin Wan
                         ` (4 subsequent siblings)
  7 siblings, 1 reply; 86+ messages in thread
From: Calvin Wan @ 2023-02-09  0:02 UTC (permalink / raw)
  To: git; +Cc: Calvin Wan, avarab, chooglen, newren, jonathantanmy,
	phillip.wood123

Add duplicate_output_fn as an optionally set function in
run_process_parallel_opts. If set, output from each child process is
copied and passed to the callback function whenever output from the
child process is buffered to allow for separate parsing.

Fix two items in pp_buffer_stderr:
 * strbuf_read_once returns a ssize_t but the variable it is set to is
   an int so fix that.
 * Add missing brackets to "else if" statement

The ungroup/duplicate_output incompatibility check is nested to
prepare for future imcompatibles modes with ungroup.

Signed-off-by: Calvin Wan <calvinwan@google.com>
---
 run-command.c               | 16 ++++++++++++---
 run-command.h               | 25 ++++++++++++++++++++++++
 t/helper/test-run-command.c | 20 +++++++++++++++++++
 t/t0061-run-command.sh      | 39 +++++++++++++++++++++++++++++++++++++
 4 files changed, 97 insertions(+), 3 deletions(-)

diff --git a/run-command.c b/run-command.c
index 756f1839aa..50f741f2ab 100644
--- a/run-command.c
+++ b/run-command.c
@@ -1526,6 +1526,11 @@ static void pp_init(struct parallel_processes *pp,
 	if (!opts->get_next_task)
 		BUG("you need to specify a get_next_task function");
 
+	if (opts->ungroup) {
+		if (opts->duplicate_output)
+			BUG("duplicate_output and ungroup are incompatible with each other");
+	}
+
 	CALLOC_ARRAY(pp->children, n);
 	if (!opts->ungroup)
 		CALLOC_ARRAY(pp->pfd, n);
@@ -1645,14 +1650,19 @@ static void pp_buffer_stderr(struct parallel_processes *pp,
 	for (size_t i = 0; i < opts->processes; i++) {
 		if (pp->children[i].state == GIT_CP_WORKING &&
 		    pp->pfd[i].revents & (POLLIN | POLLHUP)) {
-			int n = strbuf_read_once(&pp->children[i].err,
-						 pp->children[i].process.err, 0);
+			ssize_t n = strbuf_read_once(&pp->children[i].err,
+						     pp->children[i].process.err, 0);
 			if (n == 0) {
 				close(pp->children[i].process.err);
 				pp->children[i].state = GIT_CP_WAIT_CLEANUP;
-			} else if (n < 0)
+			} else if (n < 0) {
 				if (errno != EAGAIN)
 					die_errno("read");
+			} else if (opts->duplicate_output) {
+				opts->duplicate_output(&pp->children[i].err,
+					pp->children[i].err.len - n,
+					opts->data, pp->children[i].data);
+			}
 		}
 	}
 }
diff --git a/run-command.h b/run-command.h
index 072db56a4d..0c16d7f251 100644
--- a/run-command.h
+++ b/run-command.h
@@ -408,6 +408,25 @@ typedef int (*start_failure_fn)(struct strbuf *out,
 				void *pp_cb,
 				void *pp_task_cb);
 
+/**
+ * This callback is called whenever output from a child process is buffered
+ * 
+ * See run_processes_parallel() below for a discussion of the "struct
+ * strbuf *out" parameter.
+ * 
+ * The offset refers to the number of bytes originally in "out" before
+ * the output from the child process was buffered. Therefore, the buffer
+ * range, "out + buf" to the end of "out", would contain the buffer of
+ * the child process output.
+ *
+ * pp_cb is the callback cookie as passed into run_processes_parallel,
+ * pp_task_cb is the callback cookie as passed into get_next_task_fn.
+ *
+ * This function is incompatible with "ungroup"
+ */
+typedef void (*duplicate_output_fn)(struct strbuf *out, size_t offset,
+				    void *pp_cb, void *pp_task_cb);
+
 /**
  * This callback is called on every child process that finished processing.
  *
@@ -461,6 +480,12 @@ struct run_process_parallel_opts
 	 */
 	start_failure_fn start_failure;
 
+	/**
+	 * duplicate_output: See duplicate_output_fn() above. Unless you need
+	 * to capture output from child processes, leave this as NULL.
+	 */
+	duplicate_output_fn duplicate_output;
+
 	/**
 	 * task_finished: See task_finished_fn() above. This can be
 	 * NULL to omit any special handling.
diff --git a/t/helper/test-run-command.c b/t/helper/test-run-command.c
index 3ecb830f4a..4596ba68a8 100644
--- a/t/helper/test-run-command.c
+++ b/t/helper/test-run-command.c
@@ -52,6 +52,20 @@ static int no_job(struct child_process *cp,
 	return 0;
 }
 
+static void duplicate_output(struct strbuf *out,
+			size_t offset,
+			void *pp_cb UNUSED,
+			void *pp_task_cb UNUSED)
+{
+	struct string_list list = STRING_LIST_INIT_DUP;
+	struct string_list_item *item;
+
+	string_list_split(&list, out->buf + offset, '\n', -1);
+	for_each_string_list_item(item, &list)
+		fprintf(stderr, "duplicate_output: %s\n", item->string);
+	string_list_clear(&list, 0);
+}
+
 static int task_finished(int result,
 			 struct strbuf *err,
 			 void *pp_cb,
@@ -439,6 +453,12 @@ int cmd__run_command(int argc, const char **argv)
 		opts.ungroup = 1;
 	}
 
+	if (!strcmp(argv[1], "--duplicate-output")) {
+		argv += 1;
+		argc -= 1;
+		opts.duplicate_output = duplicate_output;
+	}
+
 	jobs = atoi(argv[2]);
 	strvec_clear(&proc.args);
 	strvec_pushv(&proc.args, (const char **)argv + 3);
diff --git a/t/t0061-run-command.sh b/t/t0061-run-command.sh
index e2411f6a9b..31f1db96fc 100755
--- a/t/t0061-run-command.sh
+++ b/t/t0061-run-command.sh
@@ -135,6 +135,15 @@ test_expect_success 'run_command runs in parallel with more jobs available than
 	test_cmp expect actual
 '
 
+test_expect_success 'run_command runs in parallel with more jobs available than tasks --duplicate-output' '
+	test-tool run-command --duplicate-output run-command-parallel 5 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
+	test_must_be_empty out &&
+	test 4 = $(grep -c "duplicate_output: Hello" err) &&
+	test 4 = $(grep -c "duplicate_output: World" err) &&
+	sed "/duplicate_output/d" err >err1 &&
+	test_cmp expect err1
+'
+
 test_expect_success 'run_command runs ungrouped in parallel with more jobs available than tasks' '
 	test-tool run-command --ungroup run-command-parallel 5 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
 	test_line_count = 8 out &&
@@ -147,6 +156,15 @@ test_expect_success 'run_command runs in parallel with as many jobs as tasks' '
 	test_cmp expect actual
 '
 
+test_expect_success 'run_command runs in parallel with as many jobs as tasks --duplicate-output' '
+	test-tool run-command --duplicate-output run-command-parallel 4 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
+	test_must_be_empty out &&
+	test 4 = $(grep -c "duplicate_output: Hello" err) &&
+	test 4 = $(grep -c "duplicate_output: World" err) &&
+	sed "/duplicate_output/d" err >err1 &&
+	test_cmp expect err1
+'
+
 test_expect_success 'run_command runs ungrouped in parallel with as many jobs as tasks' '
 	test-tool run-command --ungroup run-command-parallel 4 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
 	test_line_count = 8 out &&
@@ -159,6 +177,15 @@ test_expect_success 'run_command runs in parallel with more tasks than jobs avai
 	test_cmp expect actual
 '
 
+test_expect_success 'run_command runs in parallel with more tasks than jobs available --duplicate-output' '
+	test-tool run-command --duplicate-output run-command-parallel 3 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
+	test_must_be_empty out &&
+	test 4 = $(grep -c "duplicate_output: Hello" err) &&
+	test 4 = $(grep -c "duplicate_output: World" err) &&
+	sed "/duplicate_output/d" err >err1 &&
+	test_cmp expect err1
+'
+
 test_expect_success 'run_command runs ungrouped in parallel with more tasks than jobs available' '
 	test-tool run-command --ungroup run-command-parallel 3 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
 	test_line_count = 8 out &&
@@ -180,6 +207,12 @@ test_expect_success 'run_command is asked to abort gracefully' '
 	test_cmp expect actual
 '
 
+test_expect_success 'run_command is asked to abort gracefully --duplicate-output' '
+	test-tool run-command --duplicate-output run-command-abort 3 false >out 2>err &&
+	test_must_be_empty out &&
+	test_cmp expect err
+'
+
 test_expect_success 'run_command is asked to abort gracefully (ungroup)' '
 	test-tool run-command --ungroup run-command-abort 3 false >out 2>err &&
 	test_must_be_empty out &&
@@ -196,6 +229,12 @@ test_expect_success 'run_command outputs ' '
 	test_cmp expect actual
 '
 
+test_expect_success 'run_command outputs --duplicate-output' '
+	test-tool run-command --duplicate-output run-command-no-jobs 3 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
+	test_must_be_empty out &&
+	test_cmp expect err
+'
+
 test_expect_success 'run_command outputs (ungroup) ' '
 	test-tool run-command --ungroup run-command-no-jobs 3 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
 	test_must_be_empty out &&
-- 
2.39.1.519.gcb327c4b5f-goog


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v8 2/6] submodule: strbuf variable rename
  2023-02-07 18:16     ` [PATCH v7 0/7] " Calvin Wan
                         ` (2 preceding siblings ...)
  2023-02-09  0:02       ` [PATCH v8 1/6] run-command: add duplicate_output_fn to run_processes_parallel_opts Calvin Wan
@ 2023-02-09  0:02       ` Calvin Wan
  2023-02-13  8:37         ` Glen Choo
  2023-02-09  0:02       ` [PATCH v8 3/6] submodule: move status parsing into function Calvin Wan
                         ` (3 subsequent siblings)
  7 siblings, 1 reply; 86+ messages in thread
From: Calvin Wan @ 2023-02-09  0:02 UTC (permalink / raw)
  To: git; +Cc: Calvin Wan, avarab, chooglen, newren, jonathantanmy,
	phillip.wood123

A prepatory change for a future patch that moves the status parsing
logic to a separate function.

Signed-off-by: Calvin Wan <calvinwan@google.com>
---
 submodule.c | 23 +++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/submodule.c b/submodule.c
index fae24ef34a..faf37c1101 100644
--- a/submodule.c
+++ b/submodule.c
@@ -1906,25 +1906,28 @@ unsigned is_submodule_modified(const char *path, int ignore_untracked)
 
 	fp = xfdopen(cp.out, "r");
 	while (strbuf_getwholeline(&buf, fp, '\n') != EOF) {
+		char *str = buf.buf;
+		const size_t len = buf.len;
+
 		/* regular untracked files */
-		if (buf.buf[0] == '?')
+		if (str[0] == '?')
 			dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
 
-		if (buf.buf[0] == 'u' ||
-		    buf.buf[0] == '1' ||
-		    buf.buf[0] == '2') {
+		if (str[0] == 'u' ||
+		    str[0] == '1' ||
+		    str[0] == '2') {
 			/* T = line type, XY = status, SSSS = submodule state */
-			if (buf.len < strlen("T XY SSSS"))
+			if (len < strlen("T XY SSSS"))
 				BUG("invalid status --porcelain=2 line %s",
-				    buf.buf);
+				    str);
 
-			if (buf.buf[5] == 'S' && buf.buf[8] == 'U')
+			if (str[5] == 'S' && str[8] == 'U')
 				/* nested untracked file */
 				dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
 
-			if (buf.buf[0] == 'u' ||
-			    buf.buf[0] == '2' ||
-			    memcmp(buf.buf + 5, "S..U", 4))
+			if (str[0] == 'u' ||
+			    str[0] == '2' ||
+			    memcmp(str + 5, "S..U", 4))
 				/* other change */
 				dirty_submodule |= DIRTY_SUBMODULE_MODIFIED;
 		}
-- 
2.39.1.519.gcb327c4b5f-goog


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v8 3/6] submodule: move status parsing into function
  2023-02-07 18:16     ` [PATCH v7 0/7] " Calvin Wan
                         ` (3 preceding siblings ...)
  2023-02-09  0:02       ` [PATCH v8 2/6] submodule: strbuf variable rename Calvin Wan
@ 2023-02-09  0:02       ` Calvin Wan
  2023-02-09  0:02       ` [PATCH v8 4/6] submodule: refactor is_submodule_modified() Calvin Wan
                         ` (2 subsequent siblings)
  7 siblings, 0 replies; 86+ messages in thread
From: Calvin Wan @ 2023-02-09  0:02 UTC (permalink / raw)
  To: git; +Cc: Calvin Wan, avarab, chooglen, newren, jonathantanmy,
	phillip.wood123

A future patch requires the ability to parse the output of git
status --porcelain=2. Move parsing code from is_submodule_modified to
parse_status_porcelain.

Signed-off-by: Calvin Wan <calvinwan@google.com>
---
 submodule.c | 74 ++++++++++++++++++++++++++++++-----------------------
 1 file changed, 42 insertions(+), 32 deletions(-)

diff --git a/submodule.c b/submodule.c
index faf37c1101..768d4b4cd7 100644
--- a/submodule.c
+++ b/submodule.c
@@ -1870,6 +1870,45 @@ int fetch_submodules(struct repository *r,
 	return spf.result;
 }
 
+static int parse_status_porcelain(char *str, size_t len,
+				  unsigned *dirty_submodule,
+				  int ignore_untracked)
+{
+	/* regular untracked files */
+	if (str[0] == '?')
+		*dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
+
+	if (str[0] == 'u' ||
+	    str[0] == '1' ||
+	    str[0] == '2') {
+		/* T = line type, XY = status, SSSS = submodule state */
+		if (len < strlen("T XY SSSS"))
+			BUG("invalid status --porcelain=2 line %s",
+			    str);
+
+		if (str[5] == 'S' && str[8] == 'U')
+			/* nested untracked file */
+			*dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
+
+		if (str[0] == 'u' ||
+		    str[0] == '2' ||
+		    memcmp(str + 5, "S..U", 4))
+			/* other change */
+			*dirty_submodule |= DIRTY_SUBMODULE_MODIFIED;
+	}
+
+	if ((*dirty_submodule & DIRTY_SUBMODULE_MODIFIED) &&
+	    ((*dirty_submodule & DIRTY_SUBMODULE_UNTRACKED) ||
+	     ignore_untracked)) {
+		/*
+		* We're not interested in any further information from
+		* the child any more, neither output nor its exit code.
+		*/
+		return 1;
+	}
+	return 0;
+}
+
 unsigned is_submodule_modified(const char *path, int ignore_untracked)
 {
 	struct child_process cp = CHILD_PROCESS_INIT;
@@ -1909,39 +1948,10 @@ unsigned is_submodule_modified(const char *path, int ignore_untracked)
 		char *str = buf.buf;
 		const size_t len = buf.len;
 
-		/* regular untracked files */
-		if (str[0] == '?')
-			dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
-
-		if (str[0] == 'u' ||
-		    str[0] == '1' ||
-		    str[0] == '2') {
-			/* T = line type, XY = status, SSSS = submodule state */
-			if (len < strlen("T XY SSSS"))
-				BUG("invalid status --porcelain=2 line %s",
-				    str);
-
-			if (str[5] == 'S' && str[8] == 'U')
-				/* nested untracked file */
-				dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
-
-			if (str[0] == 'u' ||
-			    str[0] == '2' ||
-			    memcmp(str + 5, "S..U", 4))
-				/* other change */
-				dirty_submodule |= DIRTY_SUBMODULE_MODIFIED;
-		}
-
-		if ((dirty_submodule & DIRTY_SUBMODULE_MODIFIED) &&
-		    ((dirty_submodule & DIRTY_SUBMODULE_UNTRACKED) ||
-		     ignore_untracked)) {
-			/*
-			 * We're not interested in any further information from
-			 * the child any more, neither output nor its exit code.
-			 */
-			ignore_cp_exit_code = 1;
+		ignore_cp_exit_code = parse_status_porcelain(str, len, &dirty_submodule,
+							     ignore_untracked);
+		if (ignore_cp_exit_code)
 			break;
-		}
 	}
 	fclose(fp);
 
-- 
2.39.1.519.gcb327c4b5f-goog


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v8 4/6] submodule: refactor is_submodule_modified()
  2023-02-07 18:16     ` [PATCH v7 0/7] " Calvin Wan
                         ` (4 preceding siblings ...)
  2023-02-09  0:02       ` [PATCH v8 3/6] submodule: move status parsing into function Calvin Wan
@ 2023-02-09  0:02       ` Calvin Wan
  2023-02-13  7:06         ` Glen Choo
  2023-02-09  0:02       ` [PATCH v8 5/6] diff-lib: refactor out diff_change logic Calvin Wan
  2023-02-09  0:02       ` [PATCH v8 6/6] diff-lib: parallelize run_diff_files for submodules Calvin Wan
  7 siblings, 1 reply; 86+ messages in thread
From: Calvin Wan @ 2023-02-09  0:02 UTC (permalink / raw)
  To: git; +Cc: Calvin Wan, avarab, chooglen, newren, jonathantanmy,
	phillip.wood123

Refactor out submodule status logic and error messages that will be
used in a future patch.

Signed-off-by: Calvin Wan <calvinwan@google.com>
---
 submodule.c | 65 ++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 42 insertions(+), 23 deletions(-)

diff --git a/submodule.c b/submodule.c
index 768d4b4cd7..426074cebb 100644
--- a/submodule.c
+++ b/submodule.c
@@ -28,6 +28,10 @@ static int config_update_recurse_submodules = RECURSE_SUBMODULES_OFF;
 static int initialized_fetch_ref_tips;
 static struct oid_array ref_tips_before_fetch;
 static struct oid_array ref_tips_after_fetch;
+#define STATUS_PORCELAIN_START_ERROR \
+	N_("could not run 'git status --porcelain=2' in submodule %s")
+#define STATUS_PORCELAIN_FAIL_ERROR \
+	N_("'git status --porcelain=2' failed in submodule %s")
 
 /*
  * Check if the .gitmodules file is unmerged. Parsing of the .gitmodules file
@@ -1870,6 +1874,40 @@ int fetch_submodules(struct repository *r,
 	return spf.result;
 }
 
+static int verify_submodule_git_directory(const char *path)
+{
+	const char *git_dir;
+	struct strbuf buf = STRBUF_INIT;
+
+	strbuf_addf(&buf, "%s/.git", path);
+	git_dir = read_gitfile(buf.buf);
+	if (!git_dir)
+		git_dir = buf.buf;
+	if (!is_git_directory(git_dir)) {
+		if (is_directory(git_dir))
+			die(_("'%s' not recognized as a git repository"), git_dir);
+		strbuf_release(&buf);
+		/* The submodule is not checked out, so it is not modified */
+		return 0;
+	}
+	strbuf_release(&buf);
+	return 1;
+}
+
+static void prepare_status_porcelain(struct child_process *cp,
+			     const char *path, int ignore_untracked)
+{
+	strvec_pushl(&cp->args, "status", "--porcelain=2", NULL);
+	if (ignore_untracked)
+		strvec_push(&cp->args, "-uno");
+
+	prepare_submodule_repo_env(&cp->env);
+	cp->git_cmd = 1;
+	cp->no_stdin = 1;
+	cp->out = -1;
+	cp->dir = path;
+}
+
 static int parse_status_porcelain(char *str, size_t len,
 				  unsigned *dirty_submodule,
 				  int ignore_untracked)
@@ -1915,33 +1953,14 @@ unsigned is_submodule_modified(const char *path, int ignore_untracked)
 	struct strbuf buf = STRBUF_INIT;
 	FILE *fp;
 	unsigned dirty_submodule = 0;
-	const char *git_dir;
 	int ignore_cp_exit_code = 0;
 
-	strbuf_addf(&buf, "%s/.git", path);
-	git_dir = read_gitfile(buf.buf);
-	if (!git_dir)
-		git_dir = buf.buf;
-	if (!is_git_directory(git_dir)) {
-		if (is_directory(git_dir))
-			die(_("'%s' not recognized as a git repository"), git_dir);
-		strbuf_release(&buf);
-		/* The submodule is not checked out, so it is not modified */
+	if (!verify_submodule_git_directory(path))
 		return 0;
-	}
-	strbuf_reset(&buf);
-
-	strvec_pushl(&cp.args, "status", "--porcelain=2", NULL);
-	if (ignore_untracked)
-		strvec_push(&cp.args, "-uno");
 
-	prepare_submodule_repo_env(&cp.env);
-	cp.git_cmd = 1;
-	cp.no_stdin = 1;
-	cp.out = -1;
-	cp.dir = path;
+	prepare_status_porcelain(&cp, path, ignore_untracked);
 	if (start_command(&cp))
-		die(_("Could not run 'git status --porcelain=2' in submodule %s"), path);
+		die(_(STATUS_PORCELAIN_START_ERROR), path);
 
 	fp = xfdopen(cp.out, "r");
 	while (strbuf_getwholeline(&buf, fp, '\n') != EOF) {
@@ -1956,7 +1975,7 @@ unsigned is_submodule_modified(const char *path, int ignore_untracked)
 	fclose(fp);
 
 	if (finish_command(&cp) && !ignore_cp_exit_code)
-		die(_("'git status --porcelain=2' failed in submodule %s"), path);
+		die(_(STATUS_PORCELAIN_FAIL_ERROR), path);
 
 	strbuf_release(&buf);
 	return dirty_submodule;
-- 
2.39.1.519.gcb327c4b5f-goog


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v8 5/6] diff-lib: refactor out diff_change logic
  2023-02-07 18:16     ` [PATCH v7 0/7] " Calvin Wan
                         ` (5 preceding siblings ...)
  2023-02-09  0:02       ` [PATCH v8 4/6] submodule: refactor is_submodule_modified() Calvin Wan
@ 2023-02-09  0:02       ` Calvin Wan
  2023-02-09  1:48         ` Ævar Arnfjörð Bjarmason
  2023-02-13  8:42         ` Glen Choo
  2023-02-09  0:02       ` [PATCH v8 6/6] diff-lib: parallelize run_diff_files for submodules Calvin Wan
  7 siblings, 2 replies; 86+ messages in thread
From: Calvin Wan @ 2023-02-09  0:02 UTC (permalink / raw)
  To: git; +Cc: Calvin Wan, avarab, chooglen, newren, jonathantanmy,
	phillip.wood123

Refactor out logic that sets up the diff_change call into a helper
function for a future patch.

Signed-off-by: Calvin Wan <calvinwan@google.com>
---
 diff-lib.c | 46 +++++++++++++++++++++++++++++-----------------
 1 file changed, 29 insertions(+), 17 deletions(-)

diff --git a/diff-lib.c b/diff-lib.c
index dec040c366..7101cfda3f 100644
--- a/diff-lib.c
+++ b/diff-lib.c
@@ -88,6 +88,31 @@ static int match_stat_with_submodule(struct diff_options *diffopt,
 	return changed;
 }
 
+static int diff_change_helper(struct diff_options *options,
+	      unsigned newmode, unsigned dirty_submodule,
+	      int changed, struct index_state *istate,
+	      struct cache_entry *ce)
+{
+	unsigned int oldmode;
+	const struct object_id *old_oid, *new_oid;
+
+	if (!changed && !dirty_submodule) {
+		ce_mark_uptodate(ce);
+		mark_fsmonitor_valid(istate, ce);
+		if (!options->flags.find_copies_harder)
+			return 1;
+	}
+	oldmode = ce->ce_mode;
+	old_oid = &ce->oid;
+	new_oid = changed ? null_oid() : &ce->oid;
+	diff_change(options, oldmode, newmode,
+			old_oid, new_oid,
+			!is_null_oid(old_oid),
+			!is_null_oid(new_oid),
+			ce->name, 0, dirty_submodule);
+	return 0;
+}
+
 int run_diff_files(struct rev_info *revs, unsigned int option)
 {
 	int entries, i;
@@ -105,11 +130,10 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
 		diff_unmerged_stage = 2;
 	entries = istate->cache_nr;
 	for (i = 0; i < entries; i++) {
-		unsigned int oldmode, newmode;
+		unsigned int newmode;
 		struct cache_entry *ce = istate->cache[i];
 		int changed;
 		unsigned dirty_submodule = 0;
-		const struct object_id *old_oid, *new_oid;
 
 		if (diff_can_quit_early(&revs->diffopt))
 			break;
@@ -245,21 +269,9 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
 			newmode = ce_mode_from_stat(ce, st.st_mode);
 		}
 
-		if (!changed && !dirty_submodule) {
-			ce_mark_uptodate(ce);
-			mark_fsmonitor_valid(istate, ce);
-			if (!revs->diffopt.flags.find_copies_harder)
-				continue;
-		}
-		oldmode = ce->ce_mode;
-		old_oid = &ce->oid;
-		new_oid = changed ? null_oid() : &ce->oid;
-		diff_change(&revs->diffopt, oldmode, newmode,
-			    old_oid, new_oid,
-			    !is_null_oid(old_oid),
-			    !is_null_oid(new_oid),
-			    ce->name, 0, dirty_submodule);
-
+		if (diff_change_helper(&revs->diffopt, newmode, dirty_submodule,
+				       changed, istate, ce))
+			continue;
 	}
 	diffcore_std(&revs->diffopt);
 	diff_flush(&revs->diffopt);
-- 
2.39.1.519.gcb327c4b5f-goog


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v8 6/6] diff-lib: parallelize run_diff_files for submodules
  2023-02-07 18:16     ` [PATCH v7 0/7] " Calvin Wan
                         ` (6 preceding siblings ...)
  2023-02-09  0:02       ` [PATCH v8 5/6] diff-lib: refactor out diff_change logic Calvin Wan
@ 2023-02-09  0:02       ` Calvin Wan
  2023-02-13  8:36         ` Glen Choo
  7 siblings, 1 reply; 86+ messages in thread
From: Calvin Wan @ 2023-02-09  0:02 UTC (permalink / raw)
  To: git; +Cc: Calvin Wan, avarab, chooglen, newren, jonathantanmy,
	phillip.wood123

During the iteration of the index entries in run_diff_files, whenever
a submodule is found and needs its status checked, a subprocess is
spawned for it. Instead of spawning the subprocess immediately and
waiting for its completion to continue, hold onto all submodules and
relevant information in a list. Then use that list to create tasks for
run_processes_parallel. Subprocess output is duplicated and passed to
status_pipe_output which stores it to be parsed on completion of the
subprocess.

Add config option submodule.diffJobs to set the maximum number
of parallel jobs. The option defaults to 1 if unset. If set to 0, the
number of jobs is set to online_cpus().

Since run_diff_files is called from many different commands, I chose
to grab the config option in the function rather than adding variables
to every git command and then figuring out how to pass them all in.

Signed-off-by: Calvin Wan <calvinwan@google.com>
---
 Documentation/config/submodule.txt |  12 +++
 diff-lib.c                         |  91 ++++++++++++++++---
 submodule.c                        | 140 +++++++++++++++++++++++++++++
 submodule.h                        |   9 ++
 t/t4027-diff-submodule.sh          |  31 +++++++
 t/t7506-status-submodule.sh        |  25 ++++++
 6 files changed, 294 insertions(+), 14 deletions(-)

diff --git a/Documentation/config/submodule.txt b/Documentation/config/submodule.txt
index 6490527b45..3209eb8117 100644
--- a/Documentation/config/submodule.txt
+++ b/Documentation/config/submodule.txt
@@ -93,6 +93,18 @@ submodule.fetchJobs::
 	in parallel. A value of 0 will give some reasonable default.
 	If unset, it defaults to 1.
 
+submodule.diffJobs::
+	Specifies how many submodules are diffed at the same time. A
+	positive integer allows up to that number of submodules diffed
+	in parallel. A value of 0 will give some reasonable default.
+	If unset, it defaults to 1. The diff operation is used by many
+	other git commands such as add, merge, diff, status, stash and
+	more. Note that the expensive part of the diff operation is
+	reading the index from cache or memory. Therefore multiple jobs
+	may be detrimental to performance if your hardware does not
+	support parallel reads or if the number of jobs greatly exceeds
+	the amount of supported reads.
+
 submodule.alternateLocation::
 	Specifies how the submodules obtain alternates when submodules are
 	cloned. Possible values are `no`, `superproject`.
diff --git a/diff-lib.c b/diff-lib.c
index 7101cfda3f..2dde575ec6 100644
--- a/diff-lib.c
+++ b/diff-lib.c
@@ -14,6 +14,7 @@
 #include "dir.h"
 #include "fsmonitor.h"
 #include "commit-reach.h"
+#include "config.h"
 
 /*
  * diff-files
@@ -65,26 +66,46 @@ static int check_removed(const struct index_state *istate, const struct cache_en
  * Return 1 when changes are detected, 0 otherwise. If the DIRTY_SUBMODULES
  * option is set, the caller does not only want to know if a submodule is
  * modified at all but wants to know all the conditions that are met (new
- * commits, untracked content and/or modified content).
+ * commits, untracked content and/or modified content). If
+ * defer_submodule_status bit is set, dirty_submodule will be left to the
+ * caller to set. defer_submodule_status can also be set to 0 in this
+ * function if there is no need to check if the submodule is modified.
  */
 static int match_stat_with_submodule(struct diff_options *diffopt,
 				     const struct cache_entry *ce,
 				     struct stat *st, unsigned ce_option,
-				     unsigned *dirty_submodule)
+				     unsigned *dirty_submodule, int *defer_submodule_status,
+				     unsigned *ignore_untracked)
 {
 	int changed = ie_match_stat(diffopt->repo->index, ce, st, ce_option);
-	if (S_ISGITLINK(ce->ce_mode)) {
-		struct diff_flags orig_flags = diffopt->flags;
-		if (!diffopt->flags.override_submodule_config)
-			set_diffopt_flags_from_submodule_config(diffopt, ce->name);
-		if (diffopt->flags.ignore_submodules)
-			changed = 0;
-		else if (!diffopt->flags.ignore_dirty_submodules &&
-			 (!changed || diffopt->flags.dirty_submodules))
+	struct diff_flags orig_flags;
+	int defer = 0;
+
+	if (!S_ISGITLINK(ce->ce_mode))
+		goto ret;
+
+	orig_flags = diffopt->flags;
+	if (!diffopt->flags.override_submodule_config)
+		set_diffopt_flags_from_submodule_config(diffopt, ce->name);
+	if (diffopt->flags.ignore_submodules) {
+		changed = 0;
+		goto cleanup;
+	}
+	if (!diffopt->flags.ignore_dirty_submodules &&
+	    (!changed || diffopt->flags.dirty_submodules)) {
+		if (defer_submodule_status && *defer_submodule_status) {
+			defer = 1;
+			*ignore_untracked = diffopt->flags.ignore_untracked_in_submodules;
+		} else {
 			*dirty_submodule = is_submodule_modified(ce->name,
-								 diffopt->flags.ignore_untracked_in_submodules);
-		diffopt->flags = orig_flags;
+					 diffopt->flags.ignore_untracked_in_submodules);
+		}
 	}
+cleanup:
+	diffopt->flags = orig_flags;
+ret:
+	if (defer_submodule_status)
+		*defer_submodule_status = defer;
 	return changed;
 }
 
@@ -121,6 +142,7 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
 			      ? CE_MATCH_RACY_IS_DIRTY : 0);
 	uint64_t start = getnanotime();
 	struct index_state *istate = revs->diffopt.repo->index;
+	struct string_list submodules = STRING_LIST_INIT_NODUP;
 
 	diff_set_mnemonic_prefix(&revs->diffopt, "i/", "w/");
 
@@ -244,6 +266,8 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
 			newmode = ce->ce_mode;
 		} else {
 			struct stat st;
+			unsigned ignore_untracked = 0;
+			int defer_submodule_status = 1;
 
 			changed = check_removed(istate, ce, &st);
 			if (changed) {
@@ -265,14 +289,53 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
 			}
 
 			changed = match_stat_with_submodule(&revs->diffopt, ce, &st,
-							    ce_option, &dirty_submodule);
+							    ce_option, &dirty_submodule,
+							    &defer_submodule_status,
+							    &ignore_untracked);
 			newmode = ce_mode_from_stat(ce, st.st_mode);
+			if (defer_submodule_status) {
+				struct submodule_status_util tmp = {
+					.changed = changed,
+					.dirty_submodule = 0,
+					.ignore_untracked = ignore_untracked,
+					.newmode = newmode,
+					.ce = ce,
+					.path = ce->name,
+				};
+				struct string_list_item *item;
+
+				item = string_list_append(&submodules, ce->name);
+				item->util = xmalloc(sizeof(tmp));
+				memcpy(item->util, &tmp, sizeof(tmp));
+				continue;
+			}
 		}
 
 		if (diff_change_helper(&revs->diffopt, newmode, dirty_submodule,
 				       changed, istate, ce))
 			continue;
 	}
+	if (submodules.nr) {
+		unsigned long parallel_jobs;
+		struct string_list_item *item;
+
+		if (git_config_get_ulong("submodule.diffjobs", &parallel_jobs))
+			parallel_jobs = 1;
+		else if (!parallel_jobs)
+			parallel_jobs = online_cpus();
+
+		if (get_submodules_status(&submodules, parallel_jobs))
+			die(_("submodule status failed"));
+		for_each_string_list_item(item, &submodules) {
+			struct submodule_status_util *util = item->util;
+
+			if (diff_change_helper(&revs->diffopt, util->newmode,
+				       util->dirty_submodule, util->changed,
+				       istate, util->ce))
+				continue;
+		}
+	}
+	string_list_clear(&submodules, 1);
 	diffcore_std(&revs->diffopt);
 	diff_flush(&revs->diffopt);
 	trace_performance_since(start, "diff-files");
@@ -320,7 +383,7 @@ static int get_stat_data(const struct index_state *istate,
 			return -1;
 		}
 		changed = match_stat_with_submodule(diffopt, ce, &st,
-						    0, dirty_submodule);
+						    0, dirty_submodule, NULL, NULL);
 		if (changed) {
 			mode = ce_mode_from_stat(ce, st.st_mode);
 			oid = null_oid();
diff --git a/submodule.c b/submodule.c
index 426074cebb..e175fb8d3f 100644
--- a/submodule.c
+++ b/submodule.c
@@ -1373,6 +1373,13 @@ int submodule_touches_in_range(struct repository *r,
 	return ret;
 }
 
+struct submodule_parallel_status {
+	size_t index_count;
+	int result;
+
+	struct string_list *submodule_names;
+};
+
 struct submodule_parallel_fetch {
 	/*
 	 * The index of the last index entry processed by
@@ -1455,6 +1462,12 @@ struct fetch_task {
 	struct oid_array *commits; /* Ensure these commits are fetched */
 };
 
+struct status_task {
+	const char *path;
+	struct strbuf out;
+	int ignore_untracked;
+};
+
 /**
  * When a submodule is not defined in .gitmodules, we cannot access it
  * via the regular submodule-config. Create a fake submodule, which we can
@@ -1947,6 +1960,25 @@ static int parse_status_porcelain(char *str, size_t len,
 	return 0;
 }
 
+static void parse_status_porcelain_strbuf(struct strbuf *buf,
+				   unsigned *dirty_submodule,
+				   int ignore_untracked)
+{
+	struct string_list list = STRING_LIST_INIT_DUP;
+	struct string_list_item *item;
+
+	string_list_split(&list, buf->buf, '\n', -1);
+
+	for_each_string_list_item(item, &list) {
+		if (parse_status_porcelain(item->string,
+					   strlen(item->string),
+					   dirty_submodule,
+					   ignore_untracked))
+			break;
+	}
+	string_list_clear(&list, 0);
+}
+
 unsigned is_submodule_modified(const char *path, int ignore_untracked)
 {
 	struct child_process cp = CHILD_PROCESS_INIT;
@@ -1981,6 +2013,114 @@ unsigned is_submodule_modified(const char *path, int ignore_untracked)
 	return dirty_submodule;
 }
 
+static struct status_task *
+get_status_task_from_index(struct submodule_parallel_status *sps,
+			   struct strbuf *err)
+{
+	for (; sps->index_count < sps->submodule_names->nr; sps->index_count++) {
+		struct submodule_status_util *util = sps->submodule_names->items[sps->index_count].util;
+		struct status_task *task;
+
+		if (!verify_submodule_git_directory(util->path))
+			continue;
+
+		task = xmalloc(sizeof(*task));
+		task->path = util->path;
+		task->ignore_untracked = util->ignore_untracked;
+		strbuf_init(&task->out, 0);
+		sps->index_count++;
+		return task;
+	}
+	return NULL;
+}
+
+static int get_next_submodule_status(struct child_process *cp,
+				     struct strbuf *err, void *data,
+				     void **task_cb)
+{
+	struct submodule_parallel_status *sps = data;
+	struct status_task *task = get_status_task_from_index(sps, err);
+
+	if (!task)
+		return 0;
+
+	child_process_init(cp);
+	prepare_submodule_repo_env_in_gitdir(&cp->env);
+	prepare_status_porcelain(cp, task->path, task->ignore_untracked);
+	*task_cb = task;
+	return 1;
+}
+
+static int status_start_failure(struct strbuf *err,
+				void *cb, void *task_cb)
+{
+	struct submodule_parallel_status *sps = cb;
+	struct status_task *task = task_cb;
+
+	sps->result = 1;
+	strbuf_addf(err, _(STATUS_PORCELAIN_START_ERROR), task->path);
+	return 0;
+}
+
+static void status_duplicate_output(struct strbuf *out,
+				    size_t offset,
+				    void *cb, void *task_cb)
+{
+	struct status_task *task = task_cb;
+
+	strbuf_add(&task->out, out->buf + offset, out->len - offset);
+	strbuf_setlen(out, offset);
+}
+
+static int status_finish(int retvalue, struct strbuf *err,
+			 void *cb, void *task_cb)
+{
+	struct submodule_parallel_status *sps = cb;
+	struct status_task *task = task_cb;
+	struct string_list_item *it =
+		string_list_lookup(sps->submodule_names, task->path);
+	struct submodule_status_util *util = it->util;
+
+	if (retvalue) {
+		sps->result = 1;
+		strbuf_addf(err, _(STATUS_PORCELAIN_FAIL_ERROR), task->path);
+	}
+
+	parse_status_porcelain_strbuf(&task->out,
+			      &util->dirty_submodule,
+			      util->ignore_untracked);
+
+	strbuf_release(&task->out);
+	free(task);
+
+	return 0;
+}
+
+int get_submodules_status(struct string_list *submodules,
+			  int max_parallel_jobs)
+{
+	struct submodule_parallel_status sps = {
+		.submodule_names = submodules,
+	};
+	const struct run_process_parallel_opts opts = {
+		.tr2_category = "submodule",
+		.tr2_label = "parallel/status",
+
+		.processes = max_parallel_jobs,
+
+		.get_next_task = get_next_submodule_status,
+		.start_failure = status_start_failure,
+		.duplicate_output = status_duplicate_output,
+		.task_finished = status_finish,
+		.data = &sps,
+	};
+
+	string_list_sort(sps.submodule_names);
+	run_processes_parallel(&opts);
+
+	return sps.result;
+}
+
 int submodule_uses_gitfile(const char *path)
 {
 	struct child_process cp = CHILD_PROCESS_INIT;
diff --git a/submodule.h b/submodule.h
index b52a4ff1e7..08d278a414 100644
--- a/submodule.h
+++ b/submodule.h
@@ -41,6 +41,13 @@ struct submodule_update_strategy {
 	.type = SM_UPDATE_UNSPECIFIED, \
 }
 
+struct submodule_status_util {
+	int changed, ignore_untracked;
+	unsigned dirty_submodule, newmode;
+	struct cache_entry *ce;
+	const char *path;
+};
+
 int is_gitmodules_unmerged(struct index_state *istate);
 int is_writing_gitmodules_ok(void);
 int is_staging_gitmodules_ok(struct index_state *istate);
@@ -94,6 +101,8 @@ int fetch_submodules(struct repository *r,
 		     int command_line_option,
 		     int default_option,
 		     int quiet, int max_parallel_jobs);
+int get_submodules_status(struct string_list *submodules,
+			  int max_parallel_jobs);
 unsigned is_submodule_modified(const char *path, int ignore_untracked);
 int submodule_uses_gitfile(const char *path);
 
diff --git a/t/t4027-diff-submodule.sh b/t/t4027-diff-submodule.sh
index 40164ae07d..1c747cc325 100755
--- a/t/t4027-diff-submodule.sh
+++ b/t/t4027-diff-submodule.sh
@@ -34,6 +34,25 @@ test_expect_success setup '
 	subtip=$3 subprev=$2
 '
 
+test_expect_success 'diff in superproject with submodules respects parallel settings' '
+	test_when_finished "rm -f trace.out" &&
+	(
+		GIT_TRACE=$(pwd)/trace.out git diff &&
+		grep "1 tasks" trace.out &&
+		>trace.out &&
+
+		git config submodule.diffJobs 8 &&
+		GIT_TRACE=$(pwd)/trace.out git diff &&
+		grep "8 tasks" trace.out &&
+		>trace.out &&
+
+		GIT_TRACE=$(pwd)/trace.out git -c submodule.diffJobs=0 diff &&
+		grep "preparing to run up to [0-9]* tasks" trace.out &&
+		! grep "up to 0 tasks" trace.out &&
+		>trace.out
+	)
+'
+
 test_expect_success 'git diff --raw HEAD' '
 	hexsz=$(test_oid hexsz) &&
 	git diff --raw --abbrev=$hexsz HEAD >actual &&
@@ -70,6 +89,18 @@ test_expect_success 'git diff HEAD with dirty submodule (work tree)' '
 	test_cmp expect.body actual.body
 '
 
+test_expect_success 'git diff HEAD with dirty submodule (work tree, parallel)' '
+	(
+		cd sub &&
+		git reset --hard &&
+		echo >>world
+	) &&
+	git -c submodule.diffJobs=8 diff HEAD >actual &&
+	sed -e "1,/^@@/d" actual >actual.body &&
+	expect_from_to >expect.body $subtip $subprev-dirty &&
+	test_cmp expect.body actual.body
+'
+
 test_expect_success 'git diff HEAD with dirty submodule (index)' '
 	(
 		cd sub &&
diff --git a/t/t7506-status-submodule.sh b/t/t7506-status-submodule.sh
index d050091345..7da64e4c4c 100755
--- a/t/t7506-status-submodule.sh
+++ b/t/t7506-status-submodule.sh
@@ -412,4 +412,29 @@ test_expect_success 'status with added file in nested submodule (short)' '
 	EOF
 '
 
+test_expect_success 'status in superproject with submodules respects parallel settings' '
+	test_when_finished "rm -f trace.out" &&
+	(
+		GIT_TRACE=$(pwd)/trace.out git status &&
+		grep "1 tasks" trace.out &&
+		>trace.out &&
+
+		git config submodule.diffJobs 8 &&
+		GIT_TRACE=$(pwd)/trace.out git status &&
+		grep "8 tasks" trace.out &&
+		>trace.out &&
+
+		GIT_TRACE=$(pwd)/trace.out git -c submodule.diffJobs=0 status &&
+		grep "preparing to run up to [0-9]* tasks" trace.out &&
+		! grep "up to 0 tasks" trace.out &&
+		>trace.out
+	)
+'
+
+test_expect_success 'status in superproject with submodules (parallel)' '
+	git -C super status --porcelain >output &&
+	git -C super -c submodule.diffJobs=8 status --porcelain >output_parallel &&
+	diff output output_parallel
+'
+
 test_done
-- 
2.39.1.519.gcb327c4b5f-goog


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH v8 0/6] submodule: parallelize diff
  2023-02-09  0:02       ` [PATCH v8 0/6] " Calvin Wan
@ 2023-02-09  1:42         ` Ævar Arnfjörð Bjarmason
  2023-02-09 19:50         ` Junio C Hamano
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 86+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2023-02-09  1:42 UTC (permalink / raw)
  To: Calvin Wan; +Cc: git, chooglen, newren, jonathantanmy, phillip.wood123


On Thu, Feb 09 2023, Calvin Wan wrote:

> 6:  0d35fcc38d < -:  ---------- diff-lib: refactor match_stat_with_submodule
> 7:  fd1eec974d ! 6:  bb25dadbe5 diff-lib: parallelize run_diff_files for submodules
>     @@ diff-lib.c: static int check_removed(const struct index_state *istate, const str
>      +				     unsigned *ignore_untracked)
>       {
>       	int changed = ie_match_stat(diffopt->repo->index, ce, st, ce_option);
>     - 	struct diff_flags orig_flags;
>     +-	if (S_ISGITLINK(ce->ce_mode)) {
>     +-		struct diff_flags orig_flags = diffopt->flags;
>     +-		if (!diffopt->flags.override_submodule_config)
>     +-			set_diffopt_flags_from_submodule_config(diffopt, ce->name);
>     +-		if (diffopt->flags.ignore_submodules)
>     +-			changed = 0;
>     +-		else if (!diffopt->flags.ignore_dirty_submodules &&
>     +-			 (!changed || diffopt->flags.dirty_submodules))
>     ++	struct diff_flags orig_flags;
>      +	int defer = 0;
>     - 
>     - 	if (!S_ISGITLINK(ce->ce_mode))
>     --		return changed;
>     ++
>     ++	if (!S_ISGITLINK(ce->ce_mode))
>      +		goto ret;
>     - 
>     - 	orig_flags = diffopt->flags;
>     - 	if (!diffopt->flags.override_submodule_config)
>     -@@ diff-lib.c: static int match_stat_with_submodule(struct diff_options *diffopt,
>     - 		goto cleanup;
>     - 	}
>     - 	if (!diffopt->flags.ignore_dirty_submodules &&
>     --	    (!changed || diffopt->flags.dirty_submodules))
>     --		*dirty_submodule = is_submodule_modified(ce->name,
>     ++
>     ++	orig_flags = diffopt->flags;
>     ++	if (!diffopt->flags.override_submodule_config)
>     ++		set_diffopt_flags_from_submodule_config(diffopt, ce->name);
>     ++	if (diffopt->flags.ignore_submodules) {
>     ++		changed = 0;
>     ++		goto cleanup;
>     ++	}
>     ++	if (!diffopt->flags.ignore_dirty_submodules &&
>      +	    (!changed || diffopt->flags.dirty_submodules)) {
>      +		if (defer_submodule_status && *defer_submodule_status) {
>      +			defer = 1;
>      +			*ignore_untracked = diffopt->flags.ignore_untracked_in_submodules;
>      +		} else {
>     -+			*dirty_submodule = is_submodule_modified(ce->name,
>     - 					 diffopt->flags.ignore_untracked_in_submodules);
>     + 			*dirty_submodule = is_submodule_modified(ce->name,
>     +-								 diffopt->flags.ignore_untracked_in_submodules);
>     +-		diffopt->flags = orig_flags;
>     ++					 diffopt->flags.ignore_untracked_in_submodules);
>      +		}
>     -+	}
>     - cleanup:
>     - 	diffopt->flags = orig_flags;
>     + 	}
>     ++cleanup:
>     ++	diffopt->flags = orig_flags;
>      +ret:
>      +	if (defer_submodule_status)
>      +		*defer_submodule_status = defer;
>     @@ diff-lib.c: int run_diff_files(struct rev_info *revs, unsigned int option)
>       				       changed, istate, ce))

I think you dropped the 7/8 per my suggestion in [1]. I think this 6/6
is actually worse than the v6.

I.e. it seems you dropped the previous refactoring commit by squashing
the refactoring+functional change together.

What I was pointing out in [1] was that you don't need the refactoring,
and that both the change itself and the end-state is much easier to look
at and reason about as a result

I.e. I think the diff in your 6/6 should just be what's after "it
becomes" in [1] (maybe with some pre-refactoring, e.g. we could add the
braces first or whatever).

But in case you strongly prefer the current end-state I think having
your previous refactoring prep would be better, because it would at
least split off some of the refactoring & functional change.

I haven't looked as deeply at this v8 as v7 for the rest, but from
skimming the range-diff it all looked good otherwise.

1. https://lore.kernel.org/git/230208.861qn01s4g.gmgdl@evledraar.gmail.com/

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v8 5/6] diff-lib: refactor out diff_change logic
  2023-02-09  0:02       ` [PATCH v8 5/6] diff-lib: refactor out diff_change logic Calvin Wan
@ 2023-02-09  1:48         ` Ævar Arnfjörð Bjarmason
  2023-02-13  8:42         ` Glen Choo
  1 sibling, 0 replies; 86+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2023-02-09  1:48 UTC (permalink / raw)
  To: Calvin Wan; +Cc: git, chooglen, newren, jonathantanmy, phillip.wood123

On Thu, Feb 09 2023, Calvin Wan wrote:

> +	diff_change(options, oldmode, newmode,
> +			old_oid, new_oid,
> +			!is_null_oid(old_oid),
> +			!is_null_oid(new_oid),
> +			ce->name, 0, dirty_submodule);

Nit: This has odd not-our-usual-style indentation (to align with the
"("). I didn't spot it before, but I vaguely recall seeing something
like this in another one of your patches, but maybe I misrecall. In case
not maybe some editor settings need tweaking?

I haven't looked carefully at the rest to see if the same issue occurs
in other code here.

> -		if (!changed && !dirty_submodule) {
> -			ce_mark_uptodate(ce);
> -			mark_fsmonitor_valid(istate, ce);
> -			if (!revs->diffopt.flags.find_copies_harder)
> -				continue;
> -		}
> -		oldmode = ce->ce_mode;
> -		old_oid = &ce->oid;
> -		new_oid = changed ? null_oid() : &ce->oid;
> -		diff_change(&revs->diffopt, oldmode, newmode,
> -			    old_oid, new_oid,
> -			    !is_null_oid(old_oid),
> -			    !is_null_oid(new_oid),
> -			    ce->name, 0, dirty_submodule);

So in this case it's not new code, but code moving, note the four spaces
after the sequence of tabs that aren't in your version.

So perhaps your editor on re-indentation is configured not to just strip
off the leading \t to re-indent (which is all that's needed here) but
strips all whitespace, then re-indents after its own mind?

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v8 0/6] submodule: parallelize diff
  2023-02-09  0:02       ` [PATCH v8 0/6] " Calvin Wan
  2023-02-09  1:42         ` Ævar Arnfjörð Bjarmason
@ 2023-02-09 19:50         ` Junio C Hamano
  2023-02-09 21:52           ` Calvin Wan
  2023-02-09 20:50         ` Phillip Wood
  2023-03-02 21:52         ` [PATCH v9 " Calvin Wan
  3 siblings, 1 reply; 86+ messages in thread
From: Junio C Hamano @ 2023-02-09 19:50 UTC (permalink / raw)
  To: Calvin Wan; +Cc: git, avarab, chooglen, newren, jonathantanmy, phillip.wood123

Calvin Wan <calvinwan@google.com> writes:

> Original cover letter for context:
> https://lore.kernel.org/git/20221011232604.839941-1-calvinwan@google.com/

Thanks.  I'll try to take a look at this today.

By the way, how are you driving send-email when sending a
multi-patch series with a cover letter?  It seems that all
messages in this series including its cover are marked as if they
are replies to the cover letter of the previous round, which is a
bit harder to follow than making only [v8 0/6] as a reply to [v7 0/X]
and all [v8 n/6] (n > 0) to be replies to [v8 0/6].

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 1/7] run-command: add duplicate_output_fn to run_processes_parallel_opts
  2023-02-08 22:54         ` Calvin Wan
@ 2023-02-09 20:37           ` Phillip Wood
  0 siblings, 0 replies; 86+ messages in thread
From: Phillip Wood @ 2023-02-09 20:37 UTC (permalink / raw)
  To: Calvin Wan; +Cc: git, avarab, chooglen, newren, jonathantanmy

Hi Calvin

On 08/02/2023 22:54, Calvin Wan wrote:
>>> +                     } else {
>>> +                             if (opts->duplicate_output)
>>> +                                     opts->duplicate_output(&pp->children[i].err,
>>> +                                            strlen(pp->children[i].err.buf) - n,
>>
>> Looking at how this is used in patch 7 I think it would be better to
>> pass a const char*, length pair rather than a struct strbuf*, offset pair.
>> i.e.
>>          opts->duplicate_output(pp->children[i].err.buf +
>> pp->children[i].err.len - n, n, ...)
>>
>> That would make it clear that we do not expect duplicate_output() to
>> alter the buffer and would avoid the duplicate_output() having to add
>> the offset to the start of the buffer to find the new data.
> 
> I don't think that would work since
> pp->children[i].err.buf + pp->children[i].err.len - n
> wouldn't end up as a const char* unless I'm missing something?

You can still pass it to a function that takes a const char* though and 
change type of the callback to

typedef void (*duplicate_output_fn)(const char *out, size_t offset,
				    void *pp_cb, void *pp_task_cb);

Best Wishes

Phillip

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v8 0/6] submodule: parallelize diff
  2023-02-09  0:02       ` [PATCH v8 0/6] " Calvin Wan
  2023-02-09  1:42         ` Ævar Arnfjörð Bjarmason
  2023-02-09 19:50         ` Junio C Hamano
@ 2023-02-09 20:50         ` Phillip Wood
  2023-03-02 21:52         ` [PATCH v9 " Calvin Wan
  3 siblings, 0 replies; 86+ messages in thread
From: Phillip Wood @ 2023-02-09 20:50 UTC (permalink / raw)
  To: Calvin Wan, git; +Cc: avarab, chooglen, newren, jonathantanmy, phillip.wood123

Hi Calvin

On 09/02/2023 00:02, Calvin Wan wrote:
> Original cover letter for context:
> https://lore.kernel.org/git/20221011232604.839941-1-calvinwan@google.com/
> 
> This reroll contains stylistic changes suggested by Avar and Phillip,
> and includes a range-diff below.
> 
> Calvin Wan (6):
>    run-command: add duplicate_output_fn to run_processes_parallel_opts
>    submodule: strbuf variable rename
>    submodule: move status parsing into function
>    submodule: refactor is_submodule_modified()
>    diff-lib: refactor out diff_change logic
>    diff-lib: parallelize run_diff_files for submodules
> 
>   Documentation/config/submodule.txt |  12 ++
>   diff-lib.c                         | 133 +++++++++++----
>   run-command.c                      |  16 +-
>   run-command.h                      |  25 +++
>   submodule.c                        | 266 ++++++++++++++++++++++++-----
>   submodule.h                        |   9 +
>   t/helper/test-run-command.c        |  20 +++
>   t/t0061-run-command.sh             |  39 +++++
>   t/t4027-diff-submodule.sh          |  31 ++++
>   t/t7506-status-submodule.sh        |  25 +++
>   10 files changed, 497 insertions(+), 79 deletions(-)
> 
> Range-diff against v7:
> 6:  0d35fcc38d < -:  ---------- diff-lib: refactor match_stat_with_submodule
> 7:  fd1eec974d ! 6:  bb25dadbe5 diff-lib: parallelize run_diff_files for submodules
>      @@ diff-lib.c: static int check_removed(const struct index_state *istate, const str
>       +				     unsigned *ignore_untracked)
>        {
>        	int changed = ie_match_stat(diffopt->repo->index, ce, st, ce_option);
>      - 	struct diff_flags orig_flags;
>      +-	if (S_ISGITLINK(ce->ce_mode)) {
>      +-		struct diff_flags orig_flags = diffopt->flags;
>      +-		if (!diffopt->flags.override_submodule_config)
>      +-			set_diffopt_flags_from_submodule_config(diffopt, ce->name);
>      +-		if (diffopt->flags.ignore_submodules)
>      +-			changed = 0;
>      +-		else if (!diffopt->flags.ignore_dirty_submodules &&
>      +-			 (!changed || diffopt->flags.dirty_submodules))
>      ++	struct diff_flags orig_flags;
>       +	int defer = 0;
>      -
>      - 	if (!S_ISGITLINK(ce->ce_mode))
>      --		return changed;
>      ++
>      ++	if (!S_ISGITLINK(ce->ce_mode))
>       +		goto ret;
>      -
>      - 	orig_flags = diffopt->flags;
>      - 	if (!diffopt->flags.override_submodule_config)
>      -@@ diff-lib.c: static int match_stat_with_submodule(struct diff_options *diffopt,
>      - 		goto cleanup;
>      - 	}
>      - 	if (!diffopt->flags.ignore_dirty_submodules &&
>      --	    (!changed || diffopt->flags.dirty_submodules))
>      --		*dirty_submodule = is_submodule_modified(ce->name,
>      ++
>      ++	orig_flags = diffopt->flags;
>      ++	if (!diffopt->flags.override_submodule_config)
>      ++		set_diffopt_flags_from_submodule_config(diffopt, ce->name);
>      ++	if (diffopt->flags.ignore_submodules) {
>      ++		changed = 0;
>      ++		goto cleanup;
>      ++	}
>      ++	if (!diffopt->flags.ignore_dirty_submodules &&
>       +	    (!changed || diffopt->flags.dirty_submodules)) {
>       +		if (defer_submodule_status && *defer_submodule_status) {
>       +			defer = 1;
>       +			*ignore_untracked = diffopt->flags.ignore_untracked_in_submodules;
>       +		} else {
>      -+			*dirty_submodule = is_submodule_modified(ce->name,
>      - 					 diffopt->flags.ignore_untracked_in_submodules);
>      + 			*dirty_submodule = is_submodule_modified(ce->name,
>      +-								 diffopt->flags.ignore_untracked_in_submodules);
>      +-		diffopt->flags = orig_flags;
>      ++					 diffopt->flags.ignore_untracked_in_submodules);
>       +		}
>      -+	}
>      - cleanup:
>      - 	diffopt->flags = orig_flags;
>      + 	}
>      ++cleanup:
>      ++	diffopt->flags = orig_flags;
>       +ret:
>       +	if (defer_submodule_status)

The idea behind the suggestion to drop the previous patch from the last 
version was to stop refactoring the if block and get away from having 
these labels. Can't you just add the "if (defer_submodule_status && 
...)" into the existing code?

Best Wishes

Phillip

>       +		*defer_submodule_status = defer;
>      @@ diff-lib.c: int run_diff_files(struct rev_info *revs, unsigned int option)
>        				       changed, istate, ce))
>        			continue;
>        	}
>      -+	if (submodules.nr > 0) {
>      -+		int parallel_jobs;
>      -+		if (git_config_get_int("submodule.diffjobs", &parallel_jobs))
>      ++	if (submodules.nr) {
>      ++		unsigned long parallel_jobs;
>      ++		struct string_list_item *item;
>      ++
>      ++		if (git_config_get_ulong("submodule.diffjobs", &parallel_jobs))
>       +			parallel_jobs = 1;
>       +		else if (!parallel_jobs)
>       +			parallel_jobs = online_cpus();
>      -+		else if (parallel_jobs < 0)
>      -+			die(_("submodule.diffjobs cannot be negative"));
>       +
>       +		if (get_submodules_status(&submodules, parallel_jobs))
>       +			die(_("submodule status failed"));
>      -+		for (size_t i = 0; i < submodules.nr; i++) {
>      -+			struct submodule_status_util *util = submodules.items[i].util;
>      ++		for_each_string_list_item(item, &submodules) {
>      ++			struct submodule_status_util *util = item->util;
>       +
>       +			if (diff_change_helper(&revs->diffopt, util->newmode,
>       +				       util->dirty_submodule, util->changed,
>      @@ submodule.c: int submodule_touches_in_range(struct repository *r,
>       +	int result;
>       +
>       +	struct string_list *submodule_names;
>      -+
>      -+	/* Pending statuses by OIDs */
>      -+	struct status_task **oid_status_tasks;
>      -+	int oid_status_tasks_nr, oid_status_tasks_alloc;
>       +};
>       +
>        struct submodule_parallel_fetch {
>      @@ submodule.c: unsigned is_submodule_modified(const char *path, int ignore_untrack
>       +	struct status_task *task = task_cb;
>       +
>       +	sps->result = 1;
>      -+	strbuf_addf(err,
>      -+	    _(status_porcelain_start_error),
>      -+	    task->path);
>      ++	strbuf_addf(err, _(STATUS_PORCELAIN_START_ERROR), task->path);
>       +	return 0;
>       +}
>       +
>      @@ submodule.c: unsigned is_submodule_modified(const char *path, int ignore_untrack
>       +
>       +	if (retvalue) {
>       +		sps->result = 1;
>      -+		strbuf_addf(err,
>      -+		    _(status_porcelain_fail_error),
>      -+		    task->path);
>      ++		strbuf_addf(err, _(STATUS_PORCELAIN_FAIL_ERROR), task->path);
>       +	}
>       +
>       +	parse_status_porcelain_strbuf(&task->out,

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 5/7] diff-lib: refactor out diff_change logic
  2023-02-08 23:12         ` Calvin Wan
@ 2023-02-09 20:53           ` Phillip Wood
  0 siblings, 0 replies; 86+ messages in thread
From: Phillip Wood @ 2023-02-09 20:53 UTC (permalink / raw)
  To: Calvin Wan; +Cc: git, avarab, chooglen, newren, jonathantanmy

Hi Calvin

On 08/02/2023 23:12, Calvin Wan wrote:
>> I worry that having three integer parameters next to each other makes it
>> very easy to mix them up with out getting any errors from the compiler
>> because the types are all compatible. Could the last two be combined
>> into a flags argument? A similar issues occurs in
>> match_stat_with_submodule() in patch 7
> 
> I'm not sure how much more I want to engineer a static helper function
> that is only being called in one other place. I also don't understand what
> you mean by combining the last two into paramters a flags argument.

Are `dirty_submodule` and `changed` booleans? If so then you can have a 
single bit flags argument made up of

#define SUBMODULE_DIRTY 1
#define SUBMODULE_CHANGED 2

Best Wishes

Phillip

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v8 0/6] submodule: parallelize diff
  2023-02-09 19:50         ` Junio C Hamano
@ 2023-02-09 21:52           ` Calvin Wan
  2023-02-09 22:25             ` Junio C Hamano
  2023-02-10 13:24             ` Ævar Arnfjörð Bjarmason
  0 siblings, 2 replies; 86+ messages in thread
From: Calvin Wan @ 2023-02-09 21:52 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, avarab, chooglen, newren, jonathantanmy, phillip.wood123

> By the way, how are you driving send-email when sending a
> multi-patch series with a cover letter?  It seems that all
> messages in this series including its cover are marked as if they
> are replies to the cover letter of the previous round, which is a
> bit harder to follow than making only [v8 0/6] as a reply to [v7 0/X]
> and all [v8 n/6] (n > 0) to be replies to [v8 0/6].

I'll do that from now on -- didn't realize that make it harder to follow

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v8 0/6] submodule: parallelize diff
  2023-02-09 21:52           ` Calvin Wan
@ 2023-02-09 22:25             ` Junio C Hamano
  2023-02-10 13:24             ` Ævar Arnfjörð Bjarmason
  1 sibling, 0 replies; 86+ messages in thread
From: Junio C Hamano @ 2023-02-09 22:25 UTC (permalink / raw)
  To: Calvin Wan; +Cc: git, avarab, chooglen, newren, jonathantanmy, phillip.wood123

Calvin Wan <calvinwan@google.com> writes:

>> By the way, how are you driving send-email when sending a
>> multi-patch series with a cover letter?  It seems that all
>> messages in this series including its cover are marked as if they
>> are replies to the cover letter of the previous round, which is a
>> bit harder to follow than making only [v8 0/6] as a reply to [v7 0/X]
>> and all [v8 n/6] (n > 0) to be replies to [v8 0/6].
>
> I'll do that from now on -- didn't realize that make it harder to follow

"a bit harder" may even have been an exaggeration.  It is just being
different from how other topics by many other people are formatted.

Thanks.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v8 0/6] submodule: parallelize diff
  2023-02-09 21:52           ` Calvin Wan
  2023-02-09 22:25             ` Junio C Hamano
@ 2023-02-10 13:24             ` Ævar Arnfjörð Bjarmason
  2023-02-10 17:42               ` Junio C Hamano
  1 sibling, 1 reply; 86+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2023-02-10 13:24 UTC (permalink / raw)
  To: Calvin Wan
  Cc: Junio C Hamano, git, chooglen, newren, jonathantanmy,
	phillip.wood123


On Thu, Feb 09 2023, Calvin Wan wrote:

>> By the way, how are you driving send-email when sending a
>> multi-patch series with a cover letter?  It seems that all
>> messages in this series including its cover are marked as if they
>> are replies to the cover letter of the previous round, which is a
>> bit harder to follow than making only [v8 0/6] as a reply to [v7 0/X]
>> and all [v8 n/6] (n > 0) to be replies to [v8 0/6].
>
> I'll do that from now on -- didn't realize that make it harder to follow

Welcome to the club :)

This came up before when I'd been sending mails like this for years,
without realizing the difference:
https://lore.kernel.org/git/nycvar.QRO.7.76.6.2103191540330.57@tvgsbejvaqbjf.bet/
& https://lore.kernel.org/git/xmqqr1k9k2w7.fsf@gitster.g/

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v8 0/6] submodule: parallelize diff
  2023-02-10 13:24             ` Ævar Arnfjörð Bjarmason
@ 2023-02-10 17:42               ` Junio C Hamano
  0 siblings, 0 replies; 86+ messages in thread
From: Junio C Hamano @ 2023-02-10 17:42 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Calvin Wan, git, chooglen, newren, jonathantanmy, phillip.wood123

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> Welcome to the club :)
>
> This came up before when I'd been sending mails like this for years,
> without realizing the difference:
> https://lore.kernel.org/git/nycvar.QRO.7.76.6.2103191540330.57@tvgsbejvaqbjf.bet/
> & https://lore.kernel.org/git/xmqqr1k9k2w7.fsf@gitster.g/

The organization makes it easier to identify the cover letter,
mechanically from the thread structure without relying on the
subject line [*], and that is one of the things that the procedure
to prepare the "What's cooking" report needs to do.

	Side note: When the "What's cooking" report is updated, it
	knows individual commits on a topic, and the message ID of
	the patch for each of these commits (they are recorded in
	refs/notes/amlog).  But the message ID of the cover letter
	is not recorded anywhere because it does not become any
	commit, so it looks at these messages to find references
	and/or in-reply-to.  A flat "everything including cover is
	reply to previous cover" organization would not help to find
	the cover of this iteration at all.

In Documentation/SubmittingPatches, we give unnecessary either-or
recommendation.  We should clearly spell it out instead, perhaps
something like this.

 Documentation/SubmittingPatches | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git c/Documentation/SubmittingPatches w/Documentation/SubmittingPatches
index 927f7329a5..af7f2a4045 100644
--- c/Documentation/SubmittingPatches
+++ w/Documentation/SubmittingPatches
@@ -346,8 +346,9 @@ your code.  For this reason, each patch should be submitted
 
 Multiple related patches should be grouped into their own e-mail
 thread to help readers find all parts of the series.  To that end,
-send them as replies to either an additional "cover letter" message
-(see below), the first patch, or the respective preceding patch.
+send them as replies to an additional "cover letter" message
+(see below), which should be a reply to the "cover letter" of
+the previous iteration.
 
 If your log message (including your name on the
 `Signed-off-by` trailer) is not writable in ASCII, make sure that

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH v8 1/6] run-command: add duplicate_output_fn to run_processes_parallel_opts
  2023-02-09  0:02       ` [PATCH v8 1/6] run-command: add duplicate_output_fn to run_processes_parallel_opts Calvin Wan
@ 2023-02-13  6:34         ` Glen Choo
  2023-02-13 17:52           ` Junio C Hamano
  0 siblings, 1 reply; 86+ messages in thread
From: Glen Choo @ 2023-02-13  6:34 UTC (permalink / raw)
  To: Calvin Wan, git
  Cc: Calvin Wan, avarab, newren, jonathantanmy, phillip.wood123

Calvin Wan <calvinwan@google.com> writes:

> @@ -1645,14 +1650,19 @@ static void pp_buffer_stderr(struct parallel_processes *pp,
>  	for (size_t i = 0; i < opts->processes; i++) {
>  		if (pp->children[i].state == GIT_CP_WORKING &&
>  		    pp->pfd[i].revents & (POLLIN | POLLHUP)) {
> -			int n = strbuf_read_once(&pp->children[i].err,
> -						 pp->children[i].process.err, 0);
> +			ssize_t n = strbuf_read_once(&pp->children[i].err,
> +						     pp->children[i].process.err, 0);
>  			if (n == 0) {
>  				close(pp->children[i].process.err);
>  				pp->children[i].state = GIT_CP_WAIT_CLEANUP;
> -			} else if (n < 0)
> +			} else if (n < 0) {
>  				if (errno != EAGAIN)
>  					die_errno("read");
> +			} else if (opts->duplicate_output) {
> +				opts->duplicate_output(&pp->children[i].err,
> +					pp->children[i].err.len - n,
> +					opts->data, pp->children[i].data);
> +			}
>  		}
>  	}
>  }

What do we think of the name "duplicate_output"? IMO it made sense in
earlier versions when we were copying the output to a separate buffer (I
believe it was renamed in response to [1]), but now that we're just
calling a callback on the main buffer, it seems misleading. Maybe
"output_buffered" would be better?

Sidenote: One convention from JS that I like is to name such event
listeners as "on_<event_name>", e.g. "on_output_buffered". This makes
naming a lot easier sometimes because you don't have to worry about
having your event listener being mistaken for something else. It
wouldn't be idiomatic for Git today, but I wonder what others think
about adopting this.

[1] https://lore.kernel.org/git/xmqq4jvxpw46.fsf@gitster.g/

> +/**
> + * This callback is called whenever output from a child process is buffered
> + * 
> + * See run_processes_parallel() below for a discussion of the "struct
> + * strbuf *out" parameter.
> + * 
> + * The offset refers to the number of bytes originally in "out" before
> + * the output from the child process was buffered. Therefore, the buffer
> + * range, "out + buf" to the end of "out", would contain the buffer of
> + * the child process output.

Looks like there's extra whitespace on the 'blank' lines.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v8 4/6] submodule: refactor is_submodule_modified()
  2023-02-09  0:02       ` [PATCH v8 4/6] submodule: refactor is_submodule_modified() Calvin Wan
@ 2023-02-13  7:06         ` Glen Choo
  0 siblings, 0 replies; 86+ messages in thread
From: Glen Choo @ 2023-02-13  7:06 UTC (permalink / raw)
  To: Calvin Wan, git
  Cc: Calvin Wan, avarab, newren, jonathantanmy, phillip.wood123

Calvin Wan <calvinwan@google.com> writes:

> Refactor out submodule status logic and error messages that will be
> used in a future patch.

This improves the readability of the last patch by quite a lot. Thanks
for taking the suggestion :)

(This patch was actually introduced in the previous round, but I missed
that, sorry.)

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v8 6/6] diff-lib: parallelize run_diff_files for submodules
  2023-02-09  0:02       ` [PATCH v8 6/6] diff-lib: parallelize run_diff_files for submodules Calvin Wan
@ 2023-02-13  8:36         ` Glen Choo
  0 siblings, 0 replies; 86+ messages in thread
From: Glen Choo @ 2023-02-13  8:36 UTC (permalink / raw)
  To: Calvin Wan, git
  Cc: Calvin Wan, avarab, newren, jonathantanmy, phillip.wood123

Calvin Wan <calvinwan@google.com> writes:

> @@ -244,6 +266,8 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
>  			newmode = ce->ce_mode;
>  		} else {
>  			struct stat st;
> +			unsigned ignore_untracked = 0;
> +			int defer_submodule_status = 1;
>  
>  			changed = check_removed(istate, ce, &st);
>  			if (changed) {

Previously [1] it wasn't entirely clear whether we intended to always
parallelize submodule diffing, but now it seems that we always try to
parallelize. In essence, this means that we don't have a serial
implementation any more, but maybe that's okay.

[1] https://lore.kernel.org/git/kl6lilgtveoe.fsf@chooglen-macbookpro.roam.corp.google.com/

> @@ -265,14 +289,53 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
>  			}
>  
>  			changed = match_stat_with_submodule(&revs->diffopt, ce, &st,
> -							    ce_option, &dirty_submodule);
> +							    ce_option, &dirty_submodule,
> +							    &defer_submodule_status,
> +							    &ignore_untracked);

Here we get the 'changed' bit of the submodule. Because we always defer,
we never call is_submodule_modified() inside
match_stat_with_submodule() and as such, we never set "dirty_submodule"
here. If so, could we remove the variable altogether?

>  			newmode = ce_mode_from_stat(ce, st.st_mode);
> +			if (defer_submodule_status) {
> +				struct submodule_status_util tmp = {
> +					.changed = changed,
> +					.dirty_submodule = 0,
> +					.ignore_untracked = ignore_untracked,
> +					.newmode = newmode,
> +					.ce = ce,
> +					.path = ce->name,
> +				};
> +				struct string_list_item *item;
> +
> +				item = string_list_append(&submodules, ce->name);
> +				item->util = xmalloc(sizeof(tmp));
> +				memcpy(item->util, &tmp, sizeof(tmp));
> +				continue;
> +			}
>  		}
>  
>  		if (diff_change_helper(&revs->diffopt, newmode, dirty_submodule,
>  				       changed, istate, ce))

I'm surprised to see that we still call "diff_change_helper()" even
though we've 'deferred' the submodule diff, especially since "changed"
is set and "dirty_submodule" is unset. Even if this is safe, I think we
shouldn't do this because...

> +	if (submodules.nr) {
> +		unsigned long parallel_jobs;
> +		struct string_list_item *item;
> +
> +		if (git_config_get_ulong("submodule.diffjobs", &parallel_jobs))
> +			parallel_jobs = 1;
> +		else if (!parallel_jobs)
> +			parallel_jobs = online_cpus();
> +
> +		if (get_submodules_status(&submodules, parallel_jobs))
> +			die(_("submodule status failed"));
> +		for_each_string_list_item(item, &submodules) {
> +			struct submodule_status_util *util = item->util;
> +
> +			if (diff_change_helper(&revs->diffopt, util->newmode,
> +				       util->dirty_submodule, util->changed,
> +				       istate, util->ce))

Here we call "diff_change_helper()" again on the deferred submodule, but
now with the "dirty_submodule" value we expected. At best this is
wasteful, but at worst this is possibly wrong.

For good measure, I applied this patch to see if we needed either
"dirty_submodule" or the second "diff_change_helper()" call; our
test suite still passes after I remove both of them.

  diff --git a/diff-lib.c b/diff-lib.c
  index 2dde575ec6..21adcc7fd6 100644
  --- a/diff-lib.c
  +++ b/diff-lib.c
  @@ -156,6 +156,7 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
      struct cache_entry *ce = istate->cache[i];
      int changed;
      unsigned dirty_submodule = 0;
  +		int defer_submodule_status = 1;

      if (diff_can_quit_early(&revs->diffopt))
        break;
  @@ -267,7 +268,6 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
      } else {
        struct stat st;
        unsigned ignore_untracked = 0;
  -			int defer_submodule_status = 1;

        changed = check_removed(istate, ce, &st);
        if (changed) {
  @@ -311,9 +311,9 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
        }
      }

  -		if (diff_change_helper(&revs->diffopt, newmode, dirty_submodule,
  -				       changed, istate, ce))
  -			continue;
  +		if (!defer_submodule_status)
  +			diff_change_helper(&revs->diffopt, newmode, 0,
  +					   changed, istate, ce);
    }
    if (submodules.nr) {
      unsigned long parallel_jobs;


> +static void parse_status_porcelain_strbuf(struct strbuf *buf,
> +				   unsigned *dirty_submodule,
> +				   int ignore_untracked)
> +{
> +	struct string_list list = STRING_LIST_INIT_DUP;
> +	struct string_list_item *item;
> +
> +	string_list_split(&list, buf->buf, '\n', -1);
> +
> +	for_each_string_list_item(item, &list) {
> +		if (parse_status_porcelain(item->string,
> +					   strlen(item->string),
> +					   dirty_submodule,
> +					   ignore_untracked))
> +			break;
> +	}
> +	string_list_clear(&list, 0);
> +}

Given that this function only has one caller, is quite simple, and isn't
actually a strbuf version of "parse_status_porcelain()" (it's actually a
multiline version that also happens to accept a strbuf), I think this
might be better inlined.

> +test_expect_success 'status in superproject with submodules (parallel)' '
> +	git -C super status --porcelain >output &&
> +	git -C super -c submodule.diffJobs=8 status --porcelain >output_parallel &&
> +	diff output output_parallel
> +'
> +
>  test_done

When I first suggested this test, I thought that we would sometimes
defer submodule status and sometimes not, so this would be a good way to
check between both modes. Now this is less useful, since this is only
checking that parallelism > 1 doesn't affect the output, but it's still
a useful reasonableness check IMO. Thanks.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v8 2/6] submodule: strbuf variable rename
  2023-02-09  0:02       ` [PATCH v8 2/6] submodule: strbuf variable rename Calvin Wan
@ 2023-02-13  8:37         ` Glen Choo
  0 siblings, 0 replies; 86+ messages in thread
From: Glen Choo @ 2023-02-13  8:37 UTC (permalink / raw)
  To: Calvin Wan, git
  Cc: Calvin Wan, avarab, newren, jonathantanmy, phillip.wood123

Calvin Wan <calvinwan@google.com> writes:

> Subject: [PATCH v8 2/6] submodule: strbuf variable rename

This should probably be "submodule: rename strbuf variable".

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v8 5/6] diff-lib: refactor out diff_change logic
  2023-02-09  0:02       ` [PATCH v8 5/6] diff-lib: refactor out diff_change logic Calvin Wan
  2023-02-09  1:48         ` Ævar Arnfjörð Bjarmason
@ 2023-02-13  8:42         ` Glen Choo
  2023-02-13 18:29           ` Calvin Wan
  1 sibling, 1 reply; 86+ messages in thread
From: Glen Choo @ 2023-02-13  8:42 UTC (permalink / raw)
  To: Calvin Wan, git
  Cc: Calvin Wan, avarab, newren, jonathantanmy, phillip.wood123

Calvin Wan <calvinwan@google.com> writes:

> Refactor out logic that sets up the diff_change call into a helper
> function for a future patch.

This seems underspecified; there are two diff_change calls in diff-lib,
and the call in show_modified() is not changed in this patch.

> +static int diff_change_helper(struct diff_options *options,
> +	      unsigned newmode, unsigned dirty_submodule,
> +	      int changed, struct index_state *istate,
> +	      struct cache_entry *ce)

The function name is very generic, and it's not clear:

- What this does besides calling "diff_change()".
- When I should call this instead of "diff_change()".
- What the return value means.

Both of these should be documented in a comment, and I also suggest
renaming the function.


> @@ -245,21 +269,9 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
>  			newmode = ce_mode_from_stat(ce, st.st_mode);
>  		}
>  
> -		if (!changed && !dirty_submodule) {
> -			ce_mark_uptodate(ce);
> -			mark_fsmonitor_valid(istate, ce);
> -			if (!revs->diffopt.flags.find_copies_harder)
> -				continue;
> -		}
> -		oldmode = ce->ce_mode;
> -		old_oid = &ce->oid;
> -		new_oid = changed ? null_oid() : &ce->oid;
> -		diff_change(&revs->diffopt, oldmode, newmode,
> -			    old_oid, new_oid,
> -			    !is_null_oid(old_oid),
> -			    !is_null_oid(new_oid),
> -			    ce->name, 0, dirty_submodule);
> -
> +		if (diff_change_helper(&revs->diffopt, newmode, dirty_submodule,
> +				       changed, istate, ce))
> +			continue;
>  	}

If I'm reading the indentation correctly, the "continue" comes right
before the end of the for-loop block, so it's a no-op and should be
removed.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v8 1/6] run-command: add duplicate_output_fn to run_processes_parallel_opts
  2023-02-13  6:34         ` Glen Choo
@ 2023-02-13 17:52           ` Junio C Hamano
  2023-02-13 18:26             ` Calvin Wan
  0 siblings, 1 reply; 86+ messages in thread
From: Junio C Hamano @ 2023-02-13 17:52 UTC (permalink / raw)
  To: Glen Choo; +Cc: Calvin Wan, git, avarab, newren, jonathantanmy, phillip.wood123

Glen Choo <chooglen@google.com> writes:

> What do we think of the name "duplicate_output"? IMO it made sense in
> earlier versions when we were copying the output to a separate buffer (I
> believe it was renamed in response to [1]), but now that we're just
> calling a callback on the main buffer, it seems misleading. Maybe
> "output_buffered" would be better?

Yeah, we do not even know what the callback does to the data we are
giving it.  The only thing we know is that we have output from the
child, and in addition to the usual buffering we do ourselves, we
are allowing the callback to peek into the buffered data in advance.

If the callback does consume it *and* remove the buffered data it
consumed right away, then as you say, "duplicate" becomes a word
that totally misses the point.  There is no duplication, as the
callback consumed and we no longer has our own copy, either.

If the callback consumes it but leaves the buffered data as-is, and
we would show that once the child finishes anyway, you can say that
we are feeding a duplicate of buffered data to the callback.  The
mechanism could be used merely to count how much output we have
accumulated so far to update the progress-bar, for example, and the
output may be given after the process is done.  But note that we are
not doing an "output" of "buffered" data in such a case.

To me, both "duplicate_output" and "output_buffered" sound like they
are names that are quite specific to the expected use case the
person who proposed the names had in mind, yet it is a bit hard to
guess exactly what the expected use cases they had in mind were,
because the names are not quite specific enough.

> Sidenote: One convention from JS that I like is to name such event
> listeners as "on_<event_name>", e.g. "on_output_buffered".

Thanks for bringing this up.  I agree that "Upon X happening, do
this" is a very good convention to follow.  I think the callback is
made whenever the child emits to the standard error stream, so
"on_error_output" (if we are worried that "error" has a too strong
"something bad happend" connotation, then perhaps "on_stderr_output"
may dampen it) perhaps?

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v8 1/6] run-command: add duplicate_output_fn to run_processes_parallel_opts
  2023-02-13 17:52           ` Junio C Hamano
@ 2023-02-13 18:26             ` Calvin Wan
  0 siblings, 0 replies; 86+ messages in thread
From: Calvin Wan @ 2023-02-13 18:26 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Glen Choo, git, avarab, newren, jonathantanmy, phillip.wood123

> > Sidenote: One convention from JS that I like is to name such event
> > listeners as "on_<event_name>", e.g. "on_output_buffered".
>
> Thanks for bringing this up.  I agree that "Upon X happening, do
> this" is a very good convention to follow.  I think the callback is
> made whenever the child emits to the standard error stream, so
> "on_error_output" (if we are worried that "error" has a too strong
> "something bad happend" connotation, then perhaps "on_stderr_output"
> may dampen it) perhaps?

"on_stderr_output" sounds much better than "duplicate_output". I
did spend much time trying to come up with a better name, but
couldn't find anything that conveyed what the expected use case
of this function was. Thanks, I'll rename it on my next reroll.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v8 5/6] diff-lib: refactor out diff_change logic
  2023-02-13  8:42         ` Glen Choo
@ 2023-02-13 18:29           ` Calvin Wan
  2023-02-14  4:03             ` Glen Choo
  0 siblings, 1 reply; 86+ messages in thread
From: Calvin Wan @ 2023-02-13 18:29 UTC (permalink / raw)
  To: Glen Choo; +Cc: git, avarab, newren, jonathantanmy, phillip.wood123

On Mon, Feb 13, 2023 at 12:42 AM Glen Choo <chooglen@google.com> wrote:
>
> Calvin Wan <calvinwan@google.com> writes:
>
> > Refactor out logic that sets up the diff_change call into a helper
> > function for a future patch.
>
> This seems underspecified; there are two diff_change calls in diff-lib,
> and the call in show_modified() is not changed in this patch.
>
> > +static int diff_change_helper(struct diff_options *options,
> > +           unsigned newmode, unsigned dirty_submodule,
> > +           int changed, struct index_state *istate,
> > +           struct cache_entry *ce)
>
> The function name is very generic, and it's not clear:
>
> - What this does besides calling "diff_change()".
> - When I should call this instead of "diff_change()".
> - What the return value means.
>
> Both of these should be documented in a comment, and I also suggest
> renaming the function.

ack.

> > @@ -245,21 +269,9 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
> >                       newmode = ce_mode_from_stat(ce, st.st_mode);
> >               }
> >
> > -             if (!changed && !dirty_submodule) {
> > -                     ce_mark_uptodate(ce);
> > -                     mark_fsmonitor_valid(istate, ce);
> > -                     if (!revs->diffopt.flags.find_copies_harder)
> > -                             continue;
> > -             }
> > -             oldmode = ce->ce_mode;
> > -             old_oid = &ce->oid;
> > -             new_oid = changed ? null_oid() : &ce->oid;
> > -             diff_change(&revs->diffopt, oldmode, newmode,
> > -                         old_oid, new_oid,
> > -                         !is_null_oid(old_oid),
> > -                         !is_null_oid(new_oid),
> > -                         ce->name, 0, dirty_submodule);
> > -
> > +             if (diff_change_helper(&revs->diffopt, newmode, dirty_submodule,
> > +                                    changed, istate, ce))
> > +                     continue;
> >       }
>
> If I'm reading the indentation correctly, the "continue" comes right
> before the end of the for-loop block, so it's a no-op and should be
> removed.

It is a no-op, but I left it in as future-proofing in case more code is
added after that block later. I'm not sure whether that line of
reasoning is enough to leave it in though.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v8 5/6] diff-lib: refactor out diff_change logic
  2023-02-13 18:29           ` Calvin Wan
@ 2023-02-14  4:03             ` Glen Choo
  0 siblings, 0 replies; 86+ messages in thread
From: Glen Choo @ 2023-02-14  4:03 UTC (permalink / raw)
  To: Calvin Wan; +Cc: git, avarab, newren, jonathantanmy, phillip.wood123

Calvin Wan <calvinwan@google.com> writes:

>> > @@ -245,21 +269,9 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
>> >                       newmode = ce_mode_from_stat(ce, st.st_mode);
>> >               }
>> >
>> > -             if (!changed && !dirty_submodule) {
>> > -                     ce_mark_uptodate(ce);
>> > -                     mark_fsmonitor_valid(istate, ce);
>> > -                     if (!revs->diffopt.flags.find_copies_harder)
>> > -                             continue;
>> > -             }
>> > -             oldmode = ce->ce_mode;
>> > -             old_oid = &ce->oid;
>> > -             new_oid = changed ? null_oid() : &ce->oid;
>> > -             diff_change(&revs->diffopt, oldmode, newmode,
>> > -                         old_oid, new_oid,
>> > -                         !is_null_oid(old_oid),
>> > -                         !is_null_oid(new_oid),
>> > -                         ce->name, 0, dirty_submodule);
>> > -
>> > +             if (diff_change_helper(&revs->diffopt, newmode, dirty_submodule,
>> > +                                    changed, istate, ce))
>> > +                     continue;
>> >       }
>>
>> If I'm reading the indentation correctly, the "continue" comes right
>> before the end of the for-loop block, so it's a no-op and should be
>> removed.
>
> It is a no-op, but I left it in as future-proofing in case more code is
> added after that block later. I'm not sure whether that line of
> reasoning is enough to leave it in though.

I don't think it is. If we haven't thought of the reason why we would
need to skip code, that seems like YAGNI to me.

As a matter of personal taste, I wouldn't leave a trailing "continue" in
an earlier patch even if I were going to change it in a later patch,
because it looks too much like an unintentional mistake.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH v9 0/6] submodule: parallelize diff
  2023-02-09  0:02       ` [PATCH v8 0/6] " Calvin Wan
                           ` (2 preceding siblings ...)
  2023-02-09 20:50         ` Phillip Wood
@ 2023-03-02 21:52         ` Calvin Wan
  2023-03-02 22:02           ` [PATCH v9 1/6] run-command: add on_stderr_output_fn to run_processes_parallel_opts Calvin Wan
                             ` (5 more replies)
  3 siblings, 6 replies; 86+ messages in thread
From: Calvin Wan @ 2023-03-02 21:52 UTC (permalink / raw)
  To: git; +Cc: Calvin Wan, avarab, chooglen, newren, jonathantanmy,
	phillip.wood123

Original cover letter for context:
https://lore.kernel.org/git/20221011232604.839941-1-calvinwan@google.com/

I appreciate all the reviewers that have stuck through this entire series!
Hoping this can be the final reroll as I believe I've addressed all feedback
and personally am happy with the state of the patches.

Changes from v8
 - renamed duplicate_output_fn to on_stderr_output_fn
 - renamed diff_change_helper() to record_file_diff() and added comments
 - reworded commit message for patch 5
 - removed the refactoring of match_stat_with_submodule()
 - inlined parse_status_porcelain_strbuf()
 - fixed stylistic nits and cleaned up unnecessary variables and logic

Calvin Wan (6):
  run-command: add on_stderr_output_fn to run_processes_parallel_opts
  submodule: rename strbuf variable
  submodule: move status parsing into function
  submodule: refactor is_submodule_modified()
  diff-lib: refactor out diff_change logic
  diff-lib: parallelize run_diff_files for submodules

 Documentation/config/submodule.txt |  12 ++
 diff-lib.c                         | 123 +++++++++++---
 run-command.c                      |  16 +-
 run-command.h                      |  25 +++
 submodule.c                        | 254 +++++++++++++++++++++++------
 submodule.h                        |   9 +
 t/helper/test-run-command.c        |  20 +++
 t/t0061-run-command.sh             |  39 +++++
 t/t4027-diff-submodule.sh          |  31 ++++
 t/t7506-status-submodule.sh        |  25 +++
 10 files changed, 478 insertions(+), 76 deletions(-)

Range-diff against v8:
1:  5d51250c67 ! 1:  49749ae3a5 run-command: add duplicate_output_fn to run_processes_parallel_opts
    @@ Metadata
     Author: Calvin Wan <calvinwan@google.com>
     
      ## Commit message ##
    -    run-command: add duplicate_output_fn to run_processes_parallel_opts
    +    run-command: add on_stderr_output_fn to run_processes_parallel_opts
     
      ## run-command.c ##
     @@ run-command.c: static void pp_init(struct parallel_processes *pp,
    @@ run-command.c: static void pp_init(struct parallel_processes *pp,
      		BUG("you need to specify a get_next_task function");
      
     +	if (opts->ungroup) {
    -+		if (opts->duplicate_output)
    -+			BUG("duplicate_output and ungroup are incompatible with each other");
    ++		if (opts->on_stderr_output)
    ++			BUG("on_stderr_output and ungroup are incompatible with each other");
     +	}
     +
      	CALLOC_ARRAY(pp->children, n);
    @@ run-command.c: static void pp_buffer_stderr(struct parallel_processes *pp,
     +			} else if (n < 0) {
      				if (errno != EAGAIN)
      					die_errno("read");
    -+			} else if (opts->duplicate_output) {
    -+				opts->duplicate_output(&pp->children[i].err,
    ++			} else if (opts->on_stderr_output) {
    ++				opts->on_stderr_output(&pp->children[i].err,
     +					pp->children[i].err.len - n,
     +					opts->data, pp->children[i].data);
     +			}
    @@ run-command.h: typedef int (*start_failure_fn)(struct strbuf *out,
      
     +/**
     + * This callback is called whenever output from a child process is buffered
    -+ * 
    ++ *
     + * See run_processes_parallel() below for a discussion of the "struct
     + * strbuf *out" parameter.
    -+ * 
    ++ *
     + * The offset refers to the number of bytes originally in "out" before
     + * the output from the child process was buffered. Therefore, the buffer
     + * range, "out + buf" to the end of "out", would contain the buffer of
    @@ run-command.h: typedef int (*start_failure_fn)(struct strbuf *out,
     + *
     + * This function is incompatible with "ungroup"
     + */
    -+typedef void (*duplicate_output_fn)(struct strbuf *out, size_t offset,
    ++typedef void (*on_stderr_output_fn)(struct strbuf *out, size_t offset,
     +				    void *pp_cb, void *pp_task_cb);
     +
      /**
    @@ run-command.h: struct run_process_parallel_opts
      	start_failure_fn start_failure;
      
     +	/**
    -+	 * duplicate_output: See duplicate_output_fn() above. Unless you need
    ++	 * on_stderr_output: See on_stderr_output_fn() above. Unless you need
     +	 * to capture output from child processes, leave this as NULL.
     +	 */
    -+	duplicate_output_fn duplicate_output;
    ++	on_stderr_output_fn on_stderr_output;
     +
      	/**
      	 * task_finished: See task_finished_fn() above. This can be
    @@ t/helper/test-run-command.c: static int no_job(struct child_process *cp,
      	return 0;
      }
      
    -+static void duplicate_output(struct strbuf *out,
    ++static void on_stderr_output(struct strbuf *out,
     +			size_t offset,
     +			void *pp_cb UNUSED,
     +			void *pp_task_cb UNUSED)
    @@ t/helper/test-run-command.c: static int no_job(struct child_process *cp,
     +
     +	string_list_split(&list, out->buf + offset, '\n', -1);
     +	for_each_string_list_item(item, &list)
    -+		fprintf(stderr, "duplicate_output: %s\n", item->string);
    ++		fprintf(stderr, "on_stderr_output: %s\n", item->string);
     +	string_list_clear(&list, 0);
     +}
     +
    @@ t/helper/test-run-command.c: int cmd__run_command(int argc, const char **argv)
      		opts.ungroup = 1;
      	}
      
    -+	if (!strcmp(argv[1], "--duplicate-output")) {
    ++	if (!strcmp(argv[1], "--on-stderr-output")) {
     +		argv += 1;
     +		argc -= 1;
    -+		opts.duplicate_output = duplicate_output;
    ++		opts.on_stderr_output = on_stderr_output;
     +	}
     +
      	jobs = atoi(argv[2]);
    @@ t/t0061-run-command.sh: test_expect_success 'run_command runs in parallel with m
      	test_cmp expect actual
      '
      
    -+test_expect_success 'run_command runs in parallel with more jobs available than tasks --duplicate-output' '
    -+	test-tool run-command --duplicate-output run-command-parallel 5 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
    ++test_expect_success 'run_command runs in parallel with more jobs available than tasks --on-stderr-output' '
    ++	test-tool run-command --on-stderr-output run-command-parallel 5 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
     +	test_must_be_empty out &&
    -+	test 4 = $(grep -c "duplicate_output: Hello" err) &&
    -+	test 4 = $(grep -c "duplicate_output: World" err) &&
    -+	sed "/duplicate_output/d" err >err1 &&
    ++	test 4 = $(grep -c "on_stderr_output: Hello" err) &&
    ++	test 4 = $(grep -c "on_stderr_output: World" err) &&
    ++	sed "/on_stderr_output/d" err >err1 &&
     +	test_cmp expect err1
     +'
     +
    @@ t/t0061-run-command.sh: test_expect_success 'run_command runs in parallel with a
      	test_cmp expect actual
      '
      
    -+test_expect_success 'run_command runs in parallel with as many jobs as tasks --duplicate-output' '
    -+	test-tool run-command --duplicate-output run-command-parallel 4 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
    ++test_expect_success 'run_command runs in parallel with as many jobs as tasks --on-stderr-output' '
    ++	test-tool run-command --on-stderr-output run-command-parallel 4 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
     +	test_must_be_empty out &&
    -+	test 4 = $(grep -c "duplicate_output: Hello" err) &&
    -+	test 4 = $(grep -c "duplicate_output: World" err) &&
    -+	sed "/duplicate_output/d" err >err1 &&
    ++	test 4 = $(grep -c "on_stderr_output: Hello" err) &&
    ++	test 4 = $(grep -c "on_stderr_output: World" err) &&
    ++	sed "/on_stderr_output/d" err >err1 &&
     +	test_cmp expect err1
     +'
     +
    @@ t/t0061-run-command.sh: test_expect_success 'run_command runs in parallel with m
      	test_cmp expect actual
      '
      
    -+test_expect_success 'run_command runs in parallel with more tasks than jobs available --duplicate-output' '
    -+	test-tool run-command --duplicate-output run-command-parallel 3 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
    ++test_expect_success 'run_command runs in parallel with more tasks than jobs available --on-stderr-output' '
    ++	test-tool run-command --on-stderr-output run-command-parallel 3 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
     +	test_must_be_empty out &&
    -+	test 4 = $(grep -c "duplicate_output: Hello" err) &&
    -+	test 4 = $(grep -c "duplicate_output: World" err) &&
    -+	sed "/duplicate_output/d" err >err1 &&
    ++	test 4 = $(grep -c "on_stderr_output: Hello" err) &&
    ++	test 4 = $(grep -c "on_stderr_output: World" err) &&
    ++	sed "/on_stderr_output/d" err >err1 &&
     +	test_cmp expect err1
     +'
     +
    @@ t/t0061-run-command.sh: test_expect_success 'run_command is asked to abort grace
      	test_cmp expect actual
      '
      
    -+test_expect_success 'run_command is asked to abort gracefully --duplicate-output' '
    -+	test-tool run-command --duplicate-output run-command-abort 3 false >out 2>err &&
    ++test_expect_success 'run_command is asked to abort gracefully --on-stderr-output' '
    ++	test-tool run-command --on-stderr-output run-command-abort 3 false >out 2>err &&
     +	test_must_be_empty out &&
     +	test_cmp expect err
     +'
    @@ t/t0061-run-command.sh: test_expect_success 'run_command outputs ' '
      	test_cmp expect actual
      '
      
    -+test_expect_success 'run_command outputs --duplicate-output' '
    -+	test-tool run-command --duplicate-output run-command-no-jobs 3 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
    ++test_expect_success 'run_command outputs --on-stderr-output' '
    ++	test-tool run-command --on-stderr-output run-command-no-jobs 3 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
     +	test_must_be_empty out &&
     +	test_cmp expect err
     +'
2:  6ded5b6788 ! 2:  6c62e670f9 submodule: strbuf variable rename
    @@ Metadata
     Author: Calvin Wan <calvinwan@google.com>
     
      ## Commit message ##
    -    submodule: strbuf variable rename
    +    submodule: rename strbuf variable
     
      ## submodule.c ##
     @@ submodule.c: unsigned is_submodule_modified(const char *path, int ignore_untracked)
3:  0c71cea8cd = 3:  24e02f2a24 submodule: move status parsing into function
4:  5c8cc93f9f = 4:  86c1f734a0 submodule: refactor is_submodule_modified()
5:  6c2b62abc8 ! 5:  811a1fee55 diff-lib: refactor out diff_change logic
    @@ diff-lib.c: static int match_stat_with_submodule(struct diff_options *diffopt,
      	return changed;
      }
      
    -+static int diff_change_helper(struct diff_options *options,
    -+	      unsigned newmode, unsigned dirty_submodule,
    -+	      int changed, struct index_state *istate,
    -+	      struct cache_entry *ce)
    ++/**
    ++ * Records diff_change if there is a change in the entry from run_diff_files.
    ++ * If there is no change, then the cache entry is marked CE_UPTODATE and
    ++ * CE_FSMONITOR_VALID. If there is no change and the find_copies_harder flag
    ++ * is not set, then the function returns early.
    ++ */
    ++static void record_file_diff(struct diff_options *options, unsigned newmode,
    ++			     unsigned dirty_submodule, int changed,
    ++			     struct index_state *istate,
    ++			     struct cache_entry *ce)
     +{
     +	unsigned int oldmode;
     +	const struct object_id *old_oid, *new_oid;
    @@ diff-lib.c: static int match_stat_with_submodule(struct diff_options *diffopt,
     +		ce_mark_uptodate(ce);
     +		mark_fsmonitor_valid(istate, ce);
     +		if (!options->flags.find_copies_harder)
    -+			return 1;
    ++			return;
     +	}
     +	oldmode = ce->ce_mode;
     +	old_oid = &ce->oid;
     +	new_oid = changed ? null_oid() : &ce->oid;
    -+	diff_change(options, oldmode, newmode,
    -+			old_oid, new_oid,
    -+			!is_null_oid(old_oid),
    -+			!is_null_oid(new_oid),
    -+			ce->name, 0, dirty_submodule);
    -+	return 0;
    ++	diff_change(options, oldmode, newmode, old_oid, new_oid,
    ++		    !is_null_oid(old_oid), !is_null_oid(new_oid),
    ++		    ce->name, 0, dirty_submodule);
     +}
     +
      int run_diff_files(struct rev_info *revs, unsigned int option)
    @@ diff-lib.c: int run_diff_files(struct rev_info *revs, unsigned int option)
     -			    !is_null_oid(new_oid),
     -			    ce->name, 0, dirty_submodule);
     -
    -+		if (diff_change_helper(&revs->diffopt, newmode, dirty_submodule,
    -+				       changed, istate, ce))
    -+			continue;
    ++		record_file_diff(&revs->diffopt, newmode, dirty_submodule,
    ++				 changed, istate, ce);
      	}
      	diffcore_std(&revs->diffopt);
      	diff_flush(&revs->diffopt);
6:  bb25dadbe5 ! 6:  17010fc179 diff-lib: parallelize run_diff_files for submodules
    @@ diff-lib.c: static int check_removed(const struct index_state *istate, const str
     +				     unsigned *ignore_untracked)
      {
      	int changed = ie_match_stat(diffopt->repo->index, ce, st, ce_option);
    --	if (S_ISGITLINK(ce->ce_mode)) {
    --		struct diff_flags orig_flags = diffopt->flags;
    --		if (!diffopt->flags.override_submodule_config)
    --			set_diffopt_flags_from_submodule_config(diffopt, ce->name);
    ++	int defer = 0;
    ++
    + 	if (S_ISGITLINK(ce->ce_mode)) {
    + 		struct diff_flags orig_flags = diffopt->flags;
    + 		if (!diffopt->flags.override_submodule_config)
    + 			set_diffopt_flags_from_submodule_config(diffopt, ce->name);
     -		if (diffopt->flags.ignore_submodules)
    --			changed = 0;
    ++		if (diffopt->flags.ignore_submodules) {
    + 			changed = 0;
     -		else if (!diffopt->flags.ignore_dirty_submodules &&
     -			 (!changed || diffopt->flags.dirty_submodules))
    -+	struct diff_flags orig_flags;
    -+	int defer = 0;
    -+
    -+	if (!S_ISGITLINK(ce->ce_mode))
    -+		goto ret;
    -+
    -+	orig_flags = diffopt->flags;
    -+	if (!diffopt->flags.override_submodule_config)
    -+		set_diffopt_flags_from_submodule_config(diffopt, ce->name);
    -+	if (diffopt->flags.ignore_submodules) {
    -+		changed = 0;
    -+		goto cleanup;
    -+	}
    -+	if (!diffopt->flags.ignore_dirty_submodules &&
    -+	    (!changed || diffopt->flags.dirty_submodules)) {
    -+		if (defer_submodule_status && *defer_submodule_status) {
    -+			defer = 1;
    -+			*ignore_untracked = diffopt->flags.ignore_untracked_in_submodules;
    -+		} else {
    - 			*dirty_submodule = is_submodule_modified(ce->name,
    +-			*dirty_submodule = is_submodule_modified(ce->name,
     -								 diffopt->flags.ignore_untracked_in_submodules);
    --		diffopt->flags = orig_flags;
    ++		} else if (!diffopt->flags.ignore_dirty_submodules &&
    ++			   (!changed || diffopt->flags.dirty_submodules)) {
    ++			if (defer_submodule_status && *defer_submodule_status) {
    ++				defer = 1;
    ++				*ignore_untracked = diffopt->flags.ignore_untracked_in_submodules;
    ++			} else {
    ++				*dirty_submodule = is_submodule_modified(ce->name,
     +					 diffopt->flags.ignore_untracked_in_submodules);
    ++			}
     +		}
    + 		diffopt->flags = orig_flags;
      	}
    -+cleanup:
    -+	diffopt->flags = orig_flags;
    -+ret:
    ++
     +	if (defer_submodule_status)
     +		*defer_submodule_status = defer;
      	return changed;
    @@ diff-lib.c: int run_diff_files(struct rev_info *revs, unsigned int option)
      
      	diff_set_mnemonic_prefix(&revs->diffopt, "i/", "w/");
      
    +@@ diff-lib.c: int run_diff_files(struct rev_info *revs, unsigned int option)
    + 		unsigned int newmode;
    + 		struct cache_entry *ce = istate->cache[i];
    + 		int changed;
    +-		unsigned dirty_submodule = 0;
    ++		int defer_submodule_status = 1;
    + 
    + 		if (diff_can_quit_early(&revs->diffopt))
    + 			break;
     @@ diff-lib.c: int run_diff_files(struct rev_info *revs, unsigned int option)
      			newmode = ce->ce_mode;
      		} else {
      			struct stat st;
     +			unsigned ignore_untracked = 0;
    -+			int defer_submodule_status = 1;
      
      			changed = check_removed(istate, ce, &st);
      			if (changed) {
    @@ diff-lib.c: int run_diff_files(struct rev_info *revs, unsigned int option)
      
      			changed = match_stat_with_submodule(&revs->diffopt, ce, &st,
     -							    ce_option, &dirty_submodule);
    -+							    ce_option, &dirty_submodule,
    ++							    ce_option, NULL,
     +							    &defer_submodule_status,
     +							    &ignore_untracked);
      			newmode = ce_mode_from_stat(ce, st.st_mode);
    @@ diff-lib.c: int run_diff_files(struct rev_info *revs, unsigned int option)
     +			}
      		}
      
    - 		if (diff_change_helper(&revs->diffopt, newmode, dirty_submodule,
    - 				       changed, istate, ce))
    - 			continue;
    - 	}
    +-		record_file_diff(&revs->diffopt, newmode, dirty_submodule,
    +-				 changed, istate, ce);
    ++		if (!defer_submodule_status)
    ++			record_file_diff(&revs->diffopt, newmode, 0,
    ++					   changed,istate, ce);
    ++	}
     +	if (submodules.nr) {
     +		unsigned long parallel_jobs;
     +		struct string_list_item *item;
    @@ diff-lib.c: int run_diff_files(struct rev_info *revs, unsigned int option)
     +		for_each_string_list_item(item, &submodules) {
     +			struct submodule_status_util *util = item->util;
     +
    -+			if (diff_change_helper(&revs->diffopt, util->newmode,
    -+				       util->dirty_submodule, util->changed,
    -+				       istate, util->ce))
    -+				continue;
    ++			record_file_diff(&revs->diffopt, util->newmode,
    ++					 util->dirty_submodule, util->changed,
    ++					 istate, util->ce);
     +		}
    -+	}
    + 	}
     +	string_list_clear(&submodules, 1);
      	diffcore_std(&revs->diffopt);
      	diff_flush(&revs->diffopt);
    @@ submodule.c: struct fetch_task {
      /**
       * When a submodule is not defined in .gitmodules, we cannot access it
       * via the regular submodule-config. Create a fake submodule, which we can
    -@@ submodule.c: static int parse_status_porcelain(char *str, size_t len,
    - 	return 0;
    - }
    - 
    -+static void parse_status_porcelain_strbuf(struct strbuf *buf,
    -+				   unsigned *dirty_submodule,
    -+				   int ignore_untracked)
    -+{
    -+	struct string_list list = STRING_LIST_INIT_DUP;
    -+	struct string_list_item *item;
    -+
    -+	string_list_split(&list, buf->buf, '\n', -1);
    -+
    -+	for_each_string_list_item(item, &list) {
    -+		if (parse_status_porcelain(item->string,
    -+					   strlen(item->string),
    -+					   dirty_submodule,
    -+					   ignore_untracked))
    -+			break;
    -+	}
    -+	string_list_clear(&list, 0);
    -+}
    -+
    - unsigned is_submodule_modified(const char *path, int ignore_untracked)
    - {
    - 	struct child_process cp = CHILD_PROCESS_INIT;
     @@ submodule.c: unsigned is_submodule_modified(const char *path, int ignore_untracked)
      	return dirty_submodule;
      }
    @@ submodule.c: unsigned is_submodule_modified(const char *path, int ignore_untrack
     +	return 0;
     +}
     +
    -+static void status_duplicate_output(struct strbuf *out,
    ++static void status_on_stderr_output(struct strbuf *out,
     +				    size_t offset,
     +				    void *cb, void *task_cb)
     +{
    @@ submodule.c: unsigned is_submodule_modified(const char *path, int ignore_untrack
     +	struct string_list_item *it =
     +		string_list_lookup(sps->submodule_names, task->path);
     +	struct submodule_status_util *util = it->util;
    ++	struct string_list list = STRING_LIST_INIT_DUP;
    ++	struct string_list_item *item;
     +
     +	if (retvalue) {
     +		sps->result = 1;
     +		strbuf_addf(err, _(STATUS_PORCELAIN_FAIL_ERROR), task->path);
     +	}
     +
    -+	parse_status_porcelain_strbuf(&task->out,
    -+			      &util->dirty_submodule,
    -+			      util->ignore_untracked);
    -+
    ++	string_list_split(&list, task->out.buf, '\n', -1);
    ++	for_each_string_list_item(item, &list) {
    ++		if (parse_status_porcelain(item->string,
    ++					   strlen(item->string),
    ++					   &util->dirty_submodule,
    ++					   util->ignore_untracked))
    ++			break;
    ++	}
    ++	string_list_clear(&list, 0);
     +	strbuf_release(&task->out);
     +	free(task);
     +
    @@ submodule.c: unsigned is_submodule_modified(const char *path, int ignore_untrack
     +
     +		.get_next_task = get_next_submodule_status,
     +		.start_failure = status_start_failure,
    -+		.duplicate_output = status_duplicate_output,
    ++		.on_stderr_output = status_on_stderr_output,
     +		.task_finished = status_finish,
     +		.data = &sps,
     +	};
-- 
2.40.0.rc0.216.gc4246ad0f0-goog


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH v9 1/6] run-command: add on_stderr_output_fn to run_processes_parallel_opts
  2023-03-02 21:52         ` [PATCH v9 " Calvin Wan
@ 2023-03-02 22:02           ` Calvin Wan
  2023-03-02 22:02           ` [PATCH v9 2/6] submodule: rename strbuf variable Calvin Wan
                             ` (4 subsequent siblings)
  5 siblings, 0 replies; 86+ messages in thread
From: Calvin Wan @ 2023-03-02 22:02 UTC (permalink / raw)
  To: git; +Cc: Calvin Wan, avarab, chooglen, newren, jonathantanmy,
	phillip.wood123

Add duplicate_output_fn as an optionally set function in
run_process_parallel_opts. If set, output from each child process is
copied and passed to the callback function whenever output from the
child process is buffered to allow for separate parsing.

Fix two items in pp_buffer_stderr:
 * strbuf_read_once returns a ssize_t but the variable it is set to is
   an int so fix that.
 * Add missing brackets to "else if" statement

The ungroup/duplicate_output incompatibility check is nested to
prepare for future imcompatibles modes with ungroup.

Signed-off-by: Calvin Wan <calvinwan@google.com>
---
 run-command.c               | 16 ++++++++++++---
 run-command.h               | 25 ++++++++++++++++++++++++
 t/helper/test-run-command.c | 20 +++++++++++++++++++
 t/t0061-run-command.sh      | 39 +++++++++++++++++++++++++++++++++++++
 4 files changed, 97 insertions(+), 3 deletions(-)

diff --git a/run-command.c b/run-command.c
index 756f1839aa..7eed4e98c2 100644
--- a/run-command.c
+++ b/run-command.c
@@ -1526,6 +1526,11 @@ static void pp_init(struct parallel_processes *pp,
 	if (!opts->get_next_task)
 		BUG("you need to specify a get_next_task function");
 
+	if (opts->ungroup) {
+		if (opts->on_stderr_output)
+			BUG("on_stderr_output and ungroup are incompatible with each other");
+	}
+
 	CALLOC_ARRAY(pp->children, n);
 	if (!opts->ungroup)
 		CALLOC_ARRAY(pp->pfd, n);
@@ -1645,14 +1650,19 @@ static void pp_buffer_stderr(struct parallel_processes *pp,
 	for (size_t i = 0; i < opts->processes; i++) {
 		if (pp->children[i].state == GIT_CP_WORKING &&
 		    pp->pfd[i].revents & (POLLIN | POLLHUP)) {
-			int n = strbuf_read_once(&pp->children[i].err,
-						 pp->children[i].process.err, 0);
+			ssize_t n = strbuf_read_once(&pp->children[i].err,
+						     pp->children[i].process.err, 0);
 			if (n == 0) {
 				close(pp->children[i].process.err);
 				pp->children[i].state = GIT_CP_WAIT_CLEANUP;
-			} else if (n < 0)
+			} else if (n < 0) {
 				if (errno != EAGAIN)
 					die_errno("read");
+			} else if (opts->on_stderr_output) {
+				opts->on_stderr_output(&pp->children[i].err,
+					pp->children[i].err.len - n,
+					opts->data, pp->children[i].data);
+			}
 		}
 	}
 }
diff --git a/run-command.h b/run-command.h
index 072db56a4d..8f08e41fae 100644
--- a/run-command.h
+++ b/run-command.h
@@ -408,6 +408,25 @@ typedef int (*start_failure_fn)(struct strbuf *out,
 				void *pp_cb,
 				void *pp_task_cb);
 
+/**
+ * This callback is called whenever output from a child process is buffered
+ *
+ * See run_processes_parallel() below for a discussion of the "struct
+ * strbuf *out" parameter.
+ *
+ * The offset refers to the number of bytes originally in "out" before
+ * the output from the child process was buffered. Therefore, the buffer
+ * range, "out + buf" to the end of "out", would contain the buffer of
+ * the child process output.
+ *
+ * pp_cb is the callback cookie as passed into run_processes_parallel,
+ * pp_task_cb is the callback cookie as passed into get_next_task_fn.
+ *
+ * This function is incompatible with "ungroup"
+ */
+typedef void (*on_stderr_output_fn)(struct strbuf *out, size_t offset,
+				    void *pp_cb, void *pp_task_cb);
+
 /**
  * This callback is called on every child process that finished processing.
  *
@@ -461,6 +480,12 @@ struct run_process_parallel_opts
 	 */
 	start_failure_fn start_failure;
 
+	/**
+	 * on_stderr_output: See on_stderr_output_fn() above. Unless you need
+	 * to capture output from child processes, leave this as NULL.
+	 */
+	on_stderr_output_fn on_stderr_output;
+
 	/**
 	 * task_finished: See task_finished_fn() above. This can be
 	 * NULL to omit any special handling.
diff --git a/t/helper/test-run-command.c b/t/helper/test-run-command.c
index 3ecb830f4a..a2fac6f762 100644
--- a/t/helper/test-run-command.c
+++ b/t/helper/test-run-command.c
@@ -52,6 +52,20 @@ static int no_job(struct child_process *cp,
 	return 0;
 }
 
+static void on_stderr_output(struct strbuf *out,
+			size_t offset,
+			void *pp_cb UNUSED,
+			void *pp_task_cb UNUSED)
+{
+	struct string_list list = STRING_LIST_INIT_DUP;
+	struct string_list_item *item;
+
+	string_list_split(&list, out->buf + offset, '\n', -1);
+	for_each_string_list_item(item, &list)
+		fprintf(stderr, "on_stderr_output: %s\n", item->string);
+	string_list_clear(&list, 0);
+}
+
 static int task_finished(int result,
 			 struct strbuf *err,
 			 void *pp_cb,
@@ -439,6 +453,12 @@ int cmd__run_command(int argc, const char **argv)
 		opts.ungroup = 1;
 	}
 
+	if (!strcmp(argv[1], "--on-stderr-output")) {
+		argv += 1;
+		argc -= 1;
+		opts.on_stderr_output = on_stderr_output;
+	}
+
 	jobs = atoi(argv[2]);
 	strvec_clear(&proc.args);
 	strvec_pushv(&proc.args, (const char **)argv + 3);
diff --git a/t/t0061-run-command.sh b/t/t0061-run-command.sh
index e2411f6a9b..883d871dfb 100755
--- a/t/t0061-run-command.sh
+++ b/t/t0061-run-command.sh
@@ -135,6 +135,15 @@ test_expect_success 'run_command runs in parallel with more jobs available than
 	test_cmp expect actual
 '
 
+test_expect_success 'run_command runs in parallel with more jobs available than tasks --on-stderr-output' '
+	test-tool run-command --on-stderr-output run-command-parallel 5 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
+	test_must_be_empty out &&
+	test 4 = $(grep -c "on_stderr_output: Hello" err) &&
+	test 4 = $(grep -c "on_stderr_output: World" err) &&
+	sed "/on_stderr_output/d" err >err1 &&
+	test_cmp expect err1
+'
+
 test_expect_success 'run_command runs ungrouped in parallel with more jobs available than tasks' '
 	test-tool run-command --ungroup run-command-parallel 5 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
 	test_line_count = 8 out &&
@@ -147,6 +156,15 @@ test_expect_success 'run_command runs in parallel with as many jobs as tasks' '
 	test_cmp expect actual
 '
 
+test_expect_success 'run_command runs in parallel with as many jobs as tasks --on-stderr-output' '
+	test-tool run-command --on-stderr-output run-command-parallel 4 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
+	test_must_be_empty out &&
+	test 4 = $(grep -c "on_stderr_output: Hello" err) &&
+	test 4 = $(grep -c "on_stderr_output: World" err) &&
+	sed "/on_stderr_output/d" err >err1 &&
+	test_cmp expect err1
+'
+
 test_expect_success 'run_command runs ungrouped in parallel with as many jobs as tasks' '
 	test-tool run-command --ungroup run-command-parallel 4 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
 	test_line_count = 8 out &&
@@ -159,6 +177,15 @@ test_expect_success 'run_command runs in parallel with more tasks than jobs avai
 	test_cmp expect actual
 '
 
+test_expect_success 'run_command runs in parallel with more tasks than jobs available --on-stderr-output' '
+	test-tool run-command --on-stderr-output run-command-parallel 3 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
+	test_must_be_empty out &&
+	test 4 = $(grep -c "on_stderr_output: Hello" err) &&
+	test 4 = $(grep -c "on_stderr_output: World" err) &&
+	sed "/on_stderr_output/d" err >err1 &&
+	test_cmp expect err1
+'
+
 test_expect_success 'run_command runs ungrouped in parallel with more tasks than jobs available' '
 	test-tool run-command --ungroup run-command-parallel 3 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
 	test_line_count = 8 out &&
@@ -180,6 +207,12 @@ test_expect_success 'run_command is asked to abort gracefully' '
 	test_cmp expect actual
 '
 
+test_expect_success 'run_command is asked to abort gracefully --on-stderr-output' '
+	test-tool run-command --on-stderr-output run-command-abort 3 false >out 2>err &&
+	test_must_be_empty out &&
+	test_cmp expect err
+'
+
 test_expect_success 'run_command is asked to abort gracefully (ungroup)' '
 	test-tool run-command --ungroup run-command-abort 3 false >out 2>err &&
 	test_must_be_empty out &&
@@ -196,6 +229,12 @@ test_expect_success 'run_command outputs ' '
 	test_cmp expect actual
 '
 
+test_expect_success 'run_command outputs --on-stderr-output' '
+	test-tool run-command --on-stderr-output run-command-no-jobs 3 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
+	test_must_be_empty out &&
+	test_cmp expect err
+'
+
 test_expect_success 'run_command outputs (ungroup) ' '
 	test-tool run-command --ungroup run-command-no-jobs 3 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
 	test_must_be_empty out &&
-- 
2.40.0.rc0.216.gc4246ad0f0-goog


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v9 2/6] submodule: rename strbuf variable
  2023-03-02 21:52         ` [PATCH v9 " Calvin Wan
  2023-03-02 22:02           ` [PATCH v9 1/6] run-command: add on_stderr_output_fn to run_processes_parallel_opts Calvin Wan
@ 2023-03-02 22:02           ` Calvin Wan
  2023-03-03  0:25             ` Junio C Hamano
  2023-03-02 22:02           ` [PATCH v9 3/6] submodule: move status parsing into function Calvin Wan
                             ` (3 subsequent siblings)
  5 siblings, 1 reply; 86+ messages in thread
From: Calvin Wan @ 2023-03-02 22:02 UTC (permalink / raw)
  To: git; +Cc: Calvin Wan, avarab, chooglen, newren, jonathantanmy,
	phillip.wood123

A prepatory change for a future patch that moves the status parsing
logic to a separate function.

Signed-off-by: Calvin Wan <calvinwan@google.com>
---
 submodule.c | 23 +++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/submodule.c b/submodule.c
index fae24ef34a..faf37c1101 100644
--- a/submodule.c
+++ b/submodule.c
@@ -1906,25 +1906,28 @@ unsigned is_submodule_modified(const char *path, int ignore_untracked)
 
 	fp = xfdopen(cp.out, "r");
 	while (strbuf_getwholeline(&buf, fp, '\n') != EOF) {
+		char *str = buf.buf;
+		const size_t len = buf.len;
+
 		/* regular untracked files */
-		if (buf.buf[0] == '?')
+		if (str[0] == '?')
 			dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
 
-		if (buf.buf[0] == 'u' ||
-		    buf.buf[0] == '1' ||
-		    buf.buf[0] == '2') {
+		if (str[0] == 'u' ||
+		    str[0] == '1' ||
+		    str[0] == '2') {
 			/* T = line type, XY = status, SSSS = submodule state */
-			if (buf.len < strlen("T XY SSSS"))
+			if (len < strlen("T XY SSSS"))
 				BUG("invalid status --porcelain=2 line %s",
-				    buf.buf);
+				    str);
 
-			if (buf.buf[5] == 'S' && buf.buf[8] == 'U')
+			if (str[5] == 'S' && str[8] == 'U')
 				/* nested untracked file */
 				dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
 
-			if (buf.buf[0] == 'u' ||
-			    buf.buf[0] == '2' ||
-			    memcmp(buf.buf + 5, "S..U", 4))
+			if (str[0] == 'u' ||
+			    str[0] == '2' ||
+			    memcmp(str + 5, "S..U", 4))
 				/* other change */
 				dirty_submodule |= DIRTY_SUBMODULE_MODIFIED;
 		}
-- 
2.40.0.rc0.216.gc4246ad0f0-goog


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v9 3/6] submodule: move status parsing into function
  2023-03-02 21:52         ` [PATCH v9 " Calvin Wan
  2023-03-02 22:02           ` [PATCH v9 1/6] run-command: add on_stderr_output_fn to run_processes_parallel_opts Calvin Wan
  2023-03-02 22:02           ` [PATCH v9 2/6] submodule: rename strbuf variable Calvin Wan
@ 2023-03-02 22:02           ` Calvin Wan
  2023-03-17 20:42             ` Glen Choo
  2023-03-02 22:02           ` [PATCH v9 4/6] submodule: refactor is_submodule_modified() Calvin Wan
                             ` (2 subsequent siblings)
  5 siblings, 1 reply; 86+ messages in thread
From: Calvin Wan @ 2023-03-02 22:02 UTC (permalink / raw)
  To: git; +Cc: Calvin Wan, avarab, chooglen, newren, jonathantanmy,
	phillip.wood123

A future patch requires the ability to parse the output of git
status --porcelain=2. Move parsing code from is_submodule_modified to
parse_status_porcelain.

Signed-off-by: Calvin Wan <calvinwan@google.com>
---
 submodule.c | 74 ++++++++++++++++++++++++++++++-----------------------
 1 file changed, 42 insertions(+), 32 deletions(-)

diff --git a/submodule.c b/submodule.c
index faf37c1101..768d4b4cd7 100644
--- a/submodule.c
+++ b/submodule.c
@@ -1870,6 +1870,45 @@ int fetch_submodules(struct repository *r,
 	return spf.result;
 }
 
+static int parse_status_porcelain(char *str, size_t len,
+				  unsigned *dirty_submodule,
+				  int ignore_untracked)
+{
+	/* regular untracked files */
+	if (str[0] == '?')
+		*dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
+
+	if (str[0] == 'u' ||
+	    str[0] == '1' ||
+	    str[0] == '2') {
+		/* T = line type, XY = status, SSSS = submodule state */
+		if (len < strlen("T XY SSSS"))
+			BUG("invalid status --porcelain=2 line %s",
+			    str);
+
+		if (str[5] == 'S' && str[8] == 'U')
+			/* nested untracked file */
+			*dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
+
+		if (str[0] == 'u' ||
+		    str[0] == '2' ||
+		    memcmp(str + 5, "S..U", 4))
+			/* other change */
+			*dirty_submodule |= DIRTY_SUBMODULE_MODIFIED;
+	}
+
+	if ((*dirty_submodule & DIRTY_SUBMODULE_MODIFIED) &&
+	    ((*dirty_submodule & DIRTY_SUBMODULE_UNTRACKED) ||
+	     ignore_untracked)) {
+		/*
+		* We're not interested in any further information from
+		* the child any more, neither output nor its exit code.
+		*/
+		return 1;
+	}
+	return 0;
+}
+
 unsigned is_submodule_modified(const char *path, int ignore_untracked)
 {
 	struct child_process cp = CHILD_PROCESS_INIT;
@@ -1909,39 +1948,10 @@ unsigned is_submodule_modified(const char *path, int ignore_untracked)
 		char *str = buf.buf;
 		const size_t len = buf.len;
 
-		/* regular untracked files */
-		if (str[0] == '?')
-			dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
-
-		if (str[0] == 'u' ||
-		    str[0] == '1' ||
-		    str[0] == '2') {
-			/* T = line type, XY = status, SSSS = submodule state */
-			if (len < strlen("T XY SSSS"))
-				BUG("invalid status --porcelain=2 line %s",
-				    str);
-
-			if (str[5] == 'S' && str[8] == 'U')
-				/* nested untracked file */
-				dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
-
-			if (str[0] == 'u' ||
-			    str[0] == '2' ||
-			    memcmp(str + 5, "S..U", 4))
-				/* other change */
-				dirty_submodule |= DIRTY_SUBMODULE_MODIFIED;
-		}
-
-		if ((dirty_submodule & DIRTY_SUBMODULE_MODIFIED) &&
-		    ((dirty_submodule & DIRTY_SUBMODULE_UNTRACKED) ||
-		     ignore_untracked)) {
-			/*
-			 * We're not interested in any further information from
-			 * the child any more, neither output nor its exit code.
-			 */
-			ignore_cp_exit_code = 1;
+		ignore_cp_exit_code = parse_status_porcelain(str, len, &dirty_submodule,
+							     ignore_untracked);
+		if (ignore_cp_exit_code)
 			break;
-		}
 	}
 	fclose(fp);
 
-- 
2.40.0.rc0.216.gc4246ad0f0-goog


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v9 4/6] submodule: refactor is_submodule_modified()
  2023-03-02 21:52         ` [PATCH v9 " Calvin Wan
                             ` (2 preceding siblings ...)
  2023-03-02 22:02           ` [PATCH v9 3/6] submodule: move status parsing into function Calvin Wan
@ 2023-03-02 22:02           ` Calvin Wan
  2023-03-02 22:02           ` [PATCH v9 5/6] diff-lib: refactor out diff_change logic Calvin Wan
  2023-03-02 22:02           ` [PATCH v9 6/6] diff-lib: parallelize run_diff_files for submodules Calvin Wan
  5 siblings, 0 replies; 86+ messages in thread
From: Calvin Wan @ 2023-03-02 22:02 UTC (permalink / raw)
  To: git; +Cc: Calvin Wan, avarab, chooglen, newren, jonathantanmy,
	phillip.wood123

Refactor out submodule status logic and error messages that will be
used in a future patch.

Signed-off-by: Calvin Wan <calvinwan@google.com>
---
 submodule.c | 65 ++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 42 insertions(+), 23 deletions(-)

diff --git a/submodule.c b/submodule.c
index 768d4b4cd7..426074cebb 100644
--- a/submodule.c
+++ b/submodule.c
@@ -28,6 +28,10 @@ static int config_update_recurse_submodules = RECURSE_SUBMODULES_OFF;
 static int initialized_fetch_ref_tips;
 static struct oid_array ref_tips_before_fetch;
 static struct oid_array ref_tips_after_fetch;
+#define STATUS_PORCELAIN_START_ERROR \
+	N_("could not run 'git status --porcelain=2' in submodule %s")
+#define STATUS_PORCELAIN_FAIL_ERROR \
+	N_("'git status --porcelain=2' failed in submodule %s")
 
 /*
  * Check if the .gitmodules file is unmerged. Parsing of the .gitmodules file
@@ -1870,6 +1874,40 @@ int fetch_submodules(struct repository *r,
 	return spf.result;
 }
 
+static int verify_submodule_git_directory(const char *path)
+{
+	const char *git_dir;
+	struct strbuf buf = STRBUF_INIT;
+
+	strbuf_addf(&buf, "%s/.git", path);
+	git_dir = read_gitfile(buf.buf);
+	if (!git_dir)
+		git_dir = buf.buf;
+	if (!is_git_directory(git_dir)) {
+		if (is_directory(git_dir))
+			die(_("'%s' not recognized as a git repository"), git_dir);
+		strbuf_release(&buf);
+		/* The submodule is not checked out, so it is not modified */
+		return 0;
+	}
+	strbuf_release(&buf);
+	return 1;
+}
+
+static void prepare_status_porcelain(struct child_process *cp,
+			     const char *path, int ignore_untracked)
+{
+	strvec_pushl(&cp->args, "status", "--porcelain=2", NULL);
+	if (ignore_untracked)
+		strvec_push(&cp->args, "-uno");
+
+	prepare_submodule_repo_env(&cp->env);
+	cp->git_cmd = 1;
+	cp->no_stdin = 1;
+	cp->out = -1;
+	cp->dir = path;
+}
+
 static int parse_status_porcelain(char *str, size_t len,
 				  unsigned *dirty_submodule,
 				  int ignore_untracked)
@@ -1915,33 +1953,14 @@ unsigned is_submodule_modified(const char *path, int ignore_untracked)
 	struct strbuf buf = STRBUF_INIT;
 	FILE *fp;
 	unsigned dirty_submodule = 0;
-	const char *git_dir;
 	int ignore_cp_exit_code = 0;
 
-	strbuf_addf(&buf, "%s/.git", path);
-	git_dir = read_gitfile(buf.buf);
-	if (!git_dir)
-		git_dir = buf.buf;
-	if (!is_git_directory(git_dir)) {
-		if (is_directory(git_dir))
-			die(_("'%s' not recognized as a git repository"), git_dir);
-		strbuf_release(&buf);
-		/* The submodule is not checked out, so it is not modified */
+	if (!verify_submodule_git_directory(path))
 		return 0;
-	}
-	strbuf_reset(&buf);
-
-	strvec_pushl(&cp.args, "status", "--porcelain=2", NULL);
-	if (ignore_untracked)
-		strvec_push(&cp.args, "-uno");
 
-	prepare_submodule_repo_env(&cp.env);
-	cp.git_cmd = 1;
-	cp.no_stdin = 1;
-	cp.out = -1;
-	cp.dir = path;
+	prepare_status_porcelain(&cp, path, ignore_untracked);
 	if (start_command(&cp))
-		die(_("Could not run 'git status --porcelain=2' in submodule %s"), path);
+		die(_(STATUS_PORCELAIN_START_ERROR), path);
 
 	fp = xfdopen(cp.out, "r");
 	while (strbuf_getwholeline(&buf, fp, '\n') != EOF) {
@@ -1956,7 +1975,7 @@ unsigned is_submodule_modified(const char *path, int ignore_untracked)
 	fclose(fp);
 
 	if (finish_command(&cp) && !ignore_cp_exit_code)
-		die(_("'git status --porcelain=2' failed in submodule %s"), path);
+		die(_(STATUS_PORCELAIN_FAIL_ERROR), path);
 
 	strbuf_release(&buf);
 	return dirty_submodule;
-- 
2.40.0.rc0.216.gc4246ad0f0-goog


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v9 5/6] diff-lib: refactor out diff_change logic
  2023-03-02 21:52         ` [PATCH v9 " Calvin Wan
                             ` (3 preceding siblings ...)
  2023-03-02 22:02           ` [PATCH v9 4/6] submodule: refactor is_submodule_modified() Calvin Wan
@ 2023-03-02 22:02           ` Calvin Wan
  2023-03-02 22:02           ` [PATCH v9 6/6] diff-lib: parallelize run_diff_files for submodules Calvin Wan
  5 siblings, 0 replies; 86+ messages in thread
From: Calvin Wan @ 2023-03-02 22:02 UTC (permalink / raw)
  To: git; +Cc: Calvin Wan, avarab, chooglen, newren, jonathantanmy,
	phillip.wood123

In run_diff_files, there is logic that records the diff and updates
relevant bits at the end of each entry iteration. Refactor out that
logic into a helper function so a future patch can call it.

Signed-off-by: Calvin Wan <calvinwan@google.com>
---
 diff-lib.c | 48 +++++++++++++++++++++++++++++++-----------------
 1 file changed, 31 insertions(+), 17 deletions(-)

diff --git a/diff-lib.c b/diff-lib.c
index dec040c366..744ae98a69 100644
--- a/diff-lib.c
+++ b/diff-lib.c
@@ -88,6 +88,34 @@ static int match_stat_with_submodule(struct diff_options *diffopt,
 	return changed;
 }
 
+/**
+ * Records diff_change if there is a change in the entry from run_diff_files.
+ * If there is no change, then the cache entry is marked CE_UPTODATE and
+ * CE_FSMONITOR_VALID. If there is no change and the find_copies_harder flag
+ * is not set, then the function returns early.
+ */
+static void record_file_diff(struct diff_options *options, unsigned newmode,
+			     unsigned dirty_submodule, int changed,
+			     struct index_state *istate,
+			     struct cache_entry *ce)
+{
+	unsigned int oldmode;
+	const struct object_id *old_oid, *new_oid;
+
+	if (!changed && !dirty_submodule) {
+		ce_mark_uptodate(ce);
+		mark_fsmonitor_valid(istate, ce);
+		if (!options->flags.find_copies_harder)
+			return;
+	}
+	oldmode = ce->ce_mode;
+	old_oid = &ce->oid;
+	new_oid = changed ? null_oid() : &ce->oid;
+	diff_change(options, oldmode, newmode, old_oid, new_oid,
+		    !is_null_oid(old_oid), !is_null_oid(new_oid),
+		    ce->name, 0, dirty_submodule);
+}
+
 int run_diff_files(struct rev_info *revs, unsigned int option)
 {
 	int entries, i;
@@ -105,11 +133,10 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
 		diff_unmerged_stage = 2;
 	entries = istate->cache_nr;
 	for (i = 0; i < entries; i++) {
-		unsigned int oldmode, newmode;
+		unsigned int newmode;
 		struct cache_entry *ce = istate->cache[i];
 		int changed;
 		unsigned dirty_submodule = 0;
-		const struct object_id *old_oid, *new_oid;
 
 		if (diff_can_quit_early(&revs->diffopt))
 			break;
@@ -245,21 +272,8 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
 			newmode = ce_mode_from_stat(ce, st.st_mode);
 		}
 
-		if (!changed && !dirty_submodule) {
-			ce_mark_uptodate(ce);
-			mark_fsmonitor_valid(istate, ce);
-			if (!revs->diffopt.flags.find_copies_harder)
-				continue;
-		}
-		oldmode = ce->ce_mode;
-		old_oid = &ce->oid;
-		new_oid = changed ? null_oid() : &ce->oid;
-		diff_change(&revs->diffopt, oldmode, newmode,
-			    old_oid, new_oid,
-			    !is_null_oid(old_oid),
-			    !is_null_oid(new_oid),
-			    ce->name, 0, dirty_submodule);
-
+		record_file_diff(&revs->diffopt, newmode, dirty_submodule,
+				 changed, istate, ce);
 	}
 	diffcore_std(&revs->diffopt);
 	diff_flush(&revs->diffopt);
-- 
2.40.0.rc0.216.gc4246ad0f0-goog


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v9 6/6] diff-lib: parallelize run_diff_files for submodules
  2023-03-02 21:52         ` [PATCH v9 " Calvin Wan
                             ` (4 preceding siblings ...)
  2023-03-02 22:02           ` [PATCH v9 5/6] diff-lib: refactor out diff_change logic Calvin Wan
@ 2023-03-02 22:02           ` Calvin Wan
  2023-03-07  8:41             ` Ævar Arnfjörð Bjarmason
                               ` (2 more replies)
  5 siblings, 3 replies; 86+ messages in thread
From: Calvin Wan @ 2023-03-02 22:02 UTC (permalink / raw)
  To: git; +Cc: Calvin Wan, avarab, chooglen, newren, jonathantanmy,
	phillip.wood123

During the iteration of the index entries in run_diff_files, whenever a
submodule is found and needs its status checked, a subprocess is spawned
for it. Instead of spawning the subprocess immediately and waiting for
its completion to continue, hold onto all submodules and relevant
information in a list. Then use that list to create tasks for
run_processes_parallel. Subprocess output is passed to
status_on_stderr_output which stores it to be parsed on completion of
the subprocess.

Add config option submodule.diffJobs to set the maximum number of
parallel jobs. The option defaults to 1 if unset. If set to 0, the
number of jobs is set to online_cpus().

Since run_diff_files is called from many different commands, I chose to
grab the config option in the function rather than adding variables to
every git command and then figuring out how to pass them all in.

Signed-off-by: Calvin Wan <calvinwan@google.com>
---
 Documentation/config/submodule.txt |  12 +++
 diff-lib.c                         |  81 +++++++++++++++---
 submodule.c                        | 128 +++++++++++++++++++++++++++++
 submodule.h                        |   9 ++
 t/t4027-diff-submodule.sh          |  31 +++++++
 t/t7506-status-submodule.sh        |  25 ++++++
 6 files changed, 274 insertions(+), 12 deletions(-)

diff --git a/Documentation/config/submodule.txt b/Documentation/config/submodule.txt
index 6490527b45..3209eb8117 100644
--- a/Documentation/config/submodule.txt
+++ b/Documentation/config/submodule.txt
@@ -93,6 +93,18 @@ submodule.fetchJobs::
 	in parallel. A value of 0 will give some reasonable default.
 	If unset, it defaults to 1.
 
+submodule.diffJobs::
+	Specifies how many submodules are diffed at the same time. A
+	positive integer allows up to that number of submodules diffed
+	in parallel. A value of 0 will give some reasonable default.
+	If unset, it defaults to 1. The diff operation is used by many
+	other git commands such as add, merge, diff, status, stash and
+	more. Note that the expensive part of the diff operation is
+	reading the index from cache or memory. Therefore multiple jobs
+	may be detrimental to performance if your hardware does not
+	support parallel reads or if the number of jobs greatly exceeds
+	the amount of supported reads.
+
 submodule.alternateLocation::
 	Specifies how the submodules obtain alternates when submodules are
 	cloned. Possible values are `no`, `superproject`.
diff --git a/diff-lib.c b/diff-lib.c
index 744ae98a69..7fe6ced950 100644
--- a/diff-lib.c
+++ b/diff-lib.c
@@ -14,6 +14,7 @@
 #include "dir.h"
 #include "fsmonitor.h"
 #include "commit-reach.h"
+#include "config.h"
 
 /*
  * diff-files
@@ -65,26 +66,41 @@ static int check_removed(const struct index_state *istate, const struct cache_en
  * Return 1 when changes are detected, 0 otherwise. If the DIRTY_SUBMODULES
  * option is set, the caller does not only want to know if a submodule is
  * modified at all but wants to know all the conditions that are met (new
- * commits, untracked content and/or modified content).
+ * commits, untracked content and/or modified content). If
+ * defer_submodule_status bit is set, dirty_submodule will be left to the
+ * caller to set. defer_submodule_status can also be set to 0 in this
+ * function if there is no need to check if the submodule is modified.
  */
 static int match_stat_with_submodule(struct diff_options *diffopt,
 				     const struct cache_entry *ce,
 				     struct stat *st, unsigned ce_option,
-				     unsigned *dirty_submodule)
+				     unsigned *dirty_submodule, int *defer_submodule_status,
+				     unsigned *ignore_untracked)
 {
 	int changed = ie_match_stat(diffopt->repo->index, ce, st, ce_option);
+	int defer = 0;
+
 	if (S_ISGITLINK(ce->ce_mode)) {
 		struct diff_flags orig_flags = diffopt->flags;
 		if (!diffopt->flags.override_submodule_config)
 			set_diffopt_flags_from_submodule_config(diffopt, ce->name);
-		if (diffopt->flags.ignore_submodules)
+		if (diffopt->flags.ignore_submodules) {
 			changed = 0;
-		else if (!diffopt->flags.ignore_dirty_submodules &&
-			 (!changed || diffopt->flags.dirty_submodules))
-			*dirty_submodule = is_submodule_modified(ce->name,
-								 diffopt->flags.ignore_untracked_in_submodules);
+		} else if (!diffopt->flags.ignore_dirty_submodules &&
+			   (!changed || diffopt->flags.dirty_submodules)) {
+			if (defer_submodule_status && *defer_submodule_status) {
+				defer = 1;
+				*ignore_untracked = diffopt->flags.ignore_untracked_in_submodules;
+			} else {
+				*dirty_submodule = is_submodule_modified(ce->name,
+					 diffopt->flags.ignore_untracked_in_submodules);
+			}
+		}
 		diffopt->flags = orig_flags;
 	}
+
+	if (defer_submodule_status)
+		*defer_submodule_status = defer;
 	return changed;
 }
 
@@ -124,6 +140,7 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
 			      ? CE_MATCH_RACY_IS_DIRTY : 0);
 	uint64_t start = getnanotime();
 	struct index_state *istate = revs->diffopt.repo->index;
+	struct string_list submodules = STRING_LIST_INIT_NODUP;
 
 	diff_set_mnemonic_prefix(&revs->diffopt, "i/", "w/");
 
@@ -136,7 +153,7 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
 		unsigned int newmode;
 		struct cache_entry *ce = istate->cache[i];
 		int changed;
-		unsigned dirty_submodule = 0;
+		int defer_submodule_status = 1;
 
 		if (diff_can_quit_early(&revs->diffopt))
 			break;
@@ -247,6 +264,7 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
 			newmode = ce->ce_mode;
 		} else {
 			struct stat st;
+			unsigned ignore_untracked = 0;
 
 			changed = check_removed(istate, ce, &st);
 			if (changed) {
@@ -268,13 +286,52 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
 			}
 
 			changed = match_stat_with_submodule(&revs->diffopt, ce, &st,
-							    ce_option, &dirty_submodule);
+							    ce_option, NULL,
+							    &defer_submodule_status,
+							    &ignore_untracked);
 			newmode = ce_mode_from_stat(ce, st.st_mode);
+			if (defer_submodule_status) {
+				struct submodule_status_util tmp = {
+					.changed = changed,
+					.dirty_submodule = 0,
+					.ignore_untracked = ignore_untracked,
+					.newmode = newmode,
+					.ce = ce,
+					.path = ce->name,
+				};
+				struct string_list_item *item;
+
+				item = string_list_append(&submodules, ce->name);
+				item->util = xmalloc(sizeof(tmp));
+				memcpy(item->util, &tmp, sizeof(tmp));
+				continue;
+			}
 		}
 
-		record_file_diff(&revs->diffopt, newmode, dirty_submodule,
-				 changed, istate, ce);
+		if (!defer_submodule_status)
+			record_file_diff(&revs->diffopt, newmode, 0,
+					   changed,istate, ce);
+	}
+	if (submodules.nr) {
+		unsigned long parallel_jobs;
+		struct string_list_item *item;
+
+		if (git_config_get_ulong("submodule.diffjobs", &parallel_jobs))
+			parallel_jobs = 1;
+		else if (!parallel_jobs)
+			parallel_jobs = online_cpus();
+
+		if (get_submodules_status(&submodules, parallel_jobs))
+			die(_("submodule status failed"));
+		for_each_string_list_item(item, &submodules) {
+			struct submodule_status_util *util = item->util;
+
+			record_file_diff(&revs->diffopt, util->newmode,
+					 util->dirty_submodule, util->changed,
+					 istate, util->ce);
+		}
 	}
+	string_list_clear(&submodules, 1);
 	diffcore_std(&revs->diffopt);
 	diff_flush(&revs->diffopt);
 	trace_performance_since(start, "diff-files");
@@ -322,7 +379,7 @@ static int get_stat_data(const struct index_state *istate,
 			return -1;
 		}
 		changed = match_stat_with_submodule(diffopt, ce, &st,
-						    0, dirty_submodule);
+						    0, dirty_submodule, NULL, NULL);
 		if (changed) {
 			mode = ce_mode_from_stat(ce, st.st_mode);
 			oid = null_oid();
diff --git a/submodule.c b/submodule.c
index 426074cebb..6f6e150a3f 100644
--- a/submodule.c
+++ b/submodule.c
@@ -1373,6 +1373,13 @@ int submodule_touches_in_range(struct repository *r,
 	return ret;
 }
 
+struct submodule_parallel_status {
+	size_t index_count;
+	int result;
+
+	struct string_list *submodule_names;
+};
+
 struct submodule_parallel_fetch {
 	/*
 	 * The index of the last index entry processed by
@@ -1455,6 +1462,12 @@ struct fetch_task {
 	struct oid_array *commits; /* Ensure these commits are fetched */
 };
 
+struct status_task {
+	const char *path;
+	struct strbuf out;
+	int ignore_untracked;
+};
+
 /**
  * When a submodule is not defined in .gitmodules, we cannot access it
  * via the regular submodule-config. Create a fake submodule, which we can
@@ -1981,6 +1994,121 @@ unsigned is_submodule_modified(const char *path, int ignore_untracked)
 	return dirty_submodule;
 }
 
+static struct status_task *
+get_status_task_from_index(struct submodule_parallel_status *sps,
+			   struct strbuf *err)
+{
+	for (; sps->index_count < sps->submodule_names->nr; sps->index_count++) {
+		struct submodule_status_util *util = sps->submodule_names->items[sps->index_count].util;
+		struct status_task *task;
+
+		if (!verify_submodule_git_directory(util->path))
+			continue;
+
+		task = xmalloc(sizeof(*task));
+		task->path = util->path;
+		task->ignore_untracked = util->ignore_untracked;
+		strbuf_init(&task->out, 0);
+		sps->index_count++;
+		return task;
+	}
+	return NULL;
+}
+
+static int get_next_submodule_status(struct child_process *cp,
+				     struct strbuf *err, void *data,
+				     void **task_cb)
+{
+	struct submodule_parallel_status *sps = data;
+	struct status_task *task = get_status_task_from_index(sps, err);
+
+	if (!task)
+		return 0;
+
+	child_process_init(cp);
+	prepare_submodule_repo_env_in_gitdir(&cp->env);
+	prepare_status_porcelain(cp, task->path, task->ignore_untracked);
+	*task_cb = task;
+	return 1;
+}
+
+static int status_start_failure(struct strbuf *err,
+				void *cb, void *task_cb)
+{
+	struct submodule_parallel_status *sps = cb;
+	struct status_task *task = task_cb;
+
+	sps->result = 1;
+	strbuf_addf(err, _(STATUS_PORCELAIN_START_ERROR), task->path);
+	return 0;
+}
+
+static void status_on_stderr_output(struct strbuf *out,
+				    size_t offset,
+				    void *cb, void *task_cb)
+{
+	struct status_task *task = task_cb;
+
+	strbuf_add(&task->out, out->buf + offset, out->len - offset);
+	strbuf_setlen(out, offset);
+}
+
+static int status_finish(int retvalue, struct strbuf *err,
+			 void *cb, void *task_cb)
+{
+	struct submodule_parallel_status *sps = cb;
+	struct status_task *task = task_cb;
+	struct string_list_item *it =
+		string_list_lookup(sps->submodule_names, task->path);
+	struct submodule_status_util *util = it->util;
+	struct string_list list = STRING_LIST_INIT_DUP;
+	struct string_list_item *item;
+
+	if (retvalue) {
+		sps->result = 1;
+		strbuf_addf(err, _(STATUS_PORCELAIN_FAIL_ERROR), task->path);
+	}
+
+	string_list_split(&list, task->out.buf, '\n', -1);
+	for_each_string_list_item(item, &list) {
+		if (parse_status_porcelain(item->string,
+					   strlen(item->string),
+					   &util->dirty_submodule,
+					   util->ignore_untracked))
+			break;
+	}
+	string_list_clear(&list, 0);
+	strbuf_release(&task->out);
+	free(task);
+
+	return 0;
+}
+
+int get_submodules_status(struct string_list *submodules,
+			  int max_parallel_jobs)
+{
+	struct submodule_parallel_status sps = {
+		.submodule_names = submodules,
+	};
+	const struct run_process_parallel_opts opts = {
+		.tr2_category = "submodule",
+		.tr2_label = "parallel/status",
+
+		.processes = max_parallel_jobs,
+
+		.get_next_task = get_next_submodule_status,
+		.start_failure = status_start_failure,
+		.on_stderr_output = status_on_stderr_output,
+		.task_finished = status_finish,
+		.data = &sps,
+	};
+
+	string_list_sort(sps.submodule_names);
+	run_processes_parallel(&opts);
+
+	return sps.result;
+}
+
 int submodule_uses_gitfile(const char *path)
 {
 	struct child_process cp = CHILD_PROCESS_INIT;
diff --git a/submodule.h b/submodule.h
index b52a4ff1e7..08d278a414 100644
--- a/submodule.h
+++ b/submodule.h
@@ -41,6 +41,13 @@ struct submodule_update_strategy {
 	.type = SM_UPDATE_UNSPECIFIED, \
 }
 
+struct submodule_status_util {
+	int changed, ignore_untracked;
+	unsigned dirty_submodule, newmode;
+	struct cache_entry *ce;
+	const char *path;
+};
+
 int is_gitmodules_unmerged(struct index_state *istate);
 int is_writing_gitmodules_ok(void);
 int is_staging_gitmodules_ok(struct index_state *istate);
@@ -94,6 +101,8 @@ int fetch_submodules(struct repository *r,
 		     int command_line_option,
 		     int default_option,
 		     int quiet, int max_parallel_jobs);
+int get_submodules_status(struct string_list *submodules,
+			  int max_parallel_jobs);
 unsigned is_submodule_modified(const char *path, int ignore_untracked);
 int submodule_uses_gitfile(const char *path);
 
diff --git a/t/t4027-diff-submodule.sh b/t/t4027-diff-submodule.sh
index 40164ae07d..1c747cc325 100755
--- a/t/t4027-diff-submodule.sh
+++ b/t/t4027-diff-submodule.sh
@@ -34,6 +34,25 @@ test_expect_success setup '
 	subtip=$3 subprev=$2
 '
 
+test_expect_success 'diff in superproject with submodules respects parallel settings' '
+	test_when_finished "rm -f trace.out" &&
+	(
+		GIT_TRACE=$(pwd)/trace.out git diff &&
+		grep "1 tasks" trace.out &&
+		>trace.out &&
+
+		git config submodule.diffJobs 8 &&
+		GIT_TRACE=$(pwd)/trace.out git diff &&
+		grep "8 tasks" trace.out &&
+		>trace.out &&
+
+		GIT_TRACE=$(pwd)/trace.out git -c submodule.diffJobs=0 diff &&
+		grep "preparing to run up to [0-9]* tasks" trace.out &&
+		! grep "up to 0 tasks" trace.out &&
+		>trace.out
+	)
+'
+
 test_expect_success 'git diff --raw HEAD' '
 	hexsz=$(test_oid hexsz) &&
 	git diff --raw --abbrev=$hexsz HEAD >actual &&
@@ -70,6 +89,18 @@ test_expect_success 'git diff HEAD with dirty submodule (work tree)' '
 	test_cmp expect.body actual.body
 '
 
+test_expect_success 'git diff HEAD with dirty submodule (work tree, parallel)' '
+	(
+		cd sub &&
+		git reset --hard &&
+		echo >>world
+	) &&
+	git -c submodule.diffJobs=8 diff HEAD >actual &&
+	sed -e "1,/^@@/d" actual >actual.body &&
+	expect_from_to >expect.body $subtip $subprev-dirty &&
+	test_cmp expect.body actual.body
+'
+
 test_expect_success 'git diff HEAD with dirty submodule (index)' '
 	(
 		cd sub &&
diff --git a/t/t7506-status-submodule.sh b/t/t7506-status-submodule.sh
index d050091345..7da64e4c4c 100755
--- a/t/t7506-status-submodule.sh
+++ b/t/t7506-status-submodule.sh
@@ -412,4 +412,29 @@ test_expect_success 'status with added file in nested submodule (short)' '
 	EOF
 '
 
+test_expect_success 'status in superproject with submodules respects parallel settings' '
+	test_when_finished "rm -f trace.out" &&
+	(
+		GIT_TRACE=$(pwd)/trace.out git status &&
+		grep "1 tasks" trace.out &&
+		>trace.out &&
+
+		git config submodule.diffJobs 8 &&
+		GIT_TRACE=$(pwd)/trace.out git status &&
+		grep "8 tasks" trace.out &&
+		>trace.out &&
+
+		GIT_TRACE=$(pwd)/trace.out git -c submodule.diffJobs=0 status &&
+		grep "preparing to run up to [0-9]* tasks" trace.out &&
+		! grep "up to 0 tasks" trace.out &&
+		>trace.out
+	)
+'
+
+test_expect_success 'status in superproject with submodules (parallel)' '
+	git -C super status --porcelain >output &&
+	git -C super -c submodule.diffJobs=8 status --porcelain >output_parallel &&
+	diff output output_parallel
+'
+
 test_done
-- 
2.40.0.rc0.216.gc4246ad0f0-goog


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH v9 2/6] submodule: rename strbuf variable
  2023-03-02 22:02           ` [PATCH v9 2/6] submodule: rename strbuf variable Calvin Wan
@ 2023-03-03  0:25             ` Junio C Hamano
  2023-03-06 17:37               ` Calvin Wan
  0 siblings, 1 reply; 86+ messages in thread
From: Junio C Hamano @ 2023-03-03  0:25 UTC (permalink / raw)
  To: Calvin Wan; +Cc: git, avarab, chooglen, newren, jonathantanmy, phillip.wood123

Calvin Wan <calvinwan@google.com> writes:

> A prepatory change for a future patch that moves the status parsing
> logic to a separate function.
>
> Signed-off-by: Calvin Wan <calvinwan@google.com>
> ---
>  submodule.c | 23 +++++++++++++----------
>  1 file changed, 13 insertions(+), 10 deletions(-)

> Subject: Re: [PATCH v9 2/6] submodule: rename strbuf variable

What strbuf variable renamed to what?

I have a feeling that squashing this and 3/6 into a single patch,
and pass buf.buf and buf.len to the new helper function without
introducing an intermediate variables in the caller, would make the
resulting code easier to follow.

In any case, nice factoring out of a useful helper function.

> diff --git a/submodule.c b/submodule.c
> index fae24ef34a..faf37c1101 100644
> --- a/submodule.c
> +++ b/submodule.c
> @@ -1906,25 +1906,28 @@ unsigned is_submodule_modified(const char *path, int ignore_untracked)
>  
>  	fp = xfdopen(cp.out, "r");
>  	while (strbuf_getwholeline(&buf, fp, '\n') != EOF) {
> +		char *str = buf.buf;
> +		const size_t len = buf.len;
> +
>  		/* regular untracked files */
> -		if (buf.buf[0] == '?')
> +		if (str[0] == '?')
>  			dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
>  
> -		if (buf.buf[0] == 'u' ||
> -		    buf.buf[0] == '1' ||
> -		    buf.buf[0] == '2') {
> +		if (str[0] == 'u' ||
> +		    str[0] == '1' ||
> +		    str[0] == '2') {
>  			/* T = line type, XY = status, SSSS = submodule state */
> -			if (buf.len < strlen("T XY SSSS"))
> +			if (len < strlen("T XY SSSS"))
>  				BUG("invalid status --porcelain=2 line %s",
> -				    buf.buf);
> +				    str);
>  
> -			if (buf.buf[5] == 'S' && buf.buf[8] == 'U')
> +			if (str[5] == 'S' && str[8] == 'U')
>  				/* nested untracked file */
>  				dirty_submodule |= DIRTY_SUBMODULE_UNTRACKED;
>  
> -			if (buf.buf[0] == 'u' ||
> -			    buf.buf[0] == '2' ||
> -			    memcmp(buf.buf + 5, "S..U", 4))
> +			if (str[0] == 'u' ||
> +			    str[0] == '2' ||
> +			    memcmp(str + 5, "S..U", 4))
>  				/* other change */
>  				dirty_submodule |= DIRTY_SUBMODULE_MODIFIED;
>  		}

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v9 2/6] submodule: rename strbuf variable
  2023-03-03  0:25             ` Junio C Hamano
@ 2023-03-06 17:37               ` Calvin Wan
  2023-03-06 18:30                 ` Junio C Hamano
  0 siblings, 1 reply; 86+ messages in thread
From: Calvin Wan @ 2023-03-06 17:37 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, avarab, chooglen, newren, jonathantanmy, phillip.wood123

On Thu, Mar 2, 2023 at 4:25 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Calvin Wan <calvinwan@google.com> writes:
>
> > A prepatory change for a future patch that moves the status parsing
> > logic to a separate function.
> >
> > Signed-off-by: Calvin Wan <calvinwan@google.com>
> > ---
> >  submodule.c | 23 +++++++++++++----------
> >  1 file changed, 13 insertions(+), 10 deletions(-)
>
> > Subject: Re: [PATCH v9 2/6] submodule: rename strbuf variable
>
> What strbuf variable renamed to what?
>
> I have a feeling that squashing this and 3/6 into a single patch,
> and pass buf.buf and buf.len to the new helper function without
> introducing an intermediate variables in the caller, would make the
> resulting code easier to follow.
>
> In any case, nice factoring out of a useful helper function.
>

A much earlier version squashed those changes together, but it was
recommended to split those changes up; I think I am indifferent either way
since the refactoring is clear to me whether it is split up or not.
https://lore.kernel.org/git/221012.868rllo545.gmgdl@evledraar.gmail.com/

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v9 2/6] submodule: rename strbuf variable
  2023-03-06 17:37               ` Calvin Wan
@ 2023-03-06 18:30                 ` Junio C Hamano
  2023-03-06 19:00                   ` Calvin Wan
  0 siblings, 1 reply; 86+ messages in thread
From: Junio C Hamano @ 2023-03-06 18:30 UTC (permalink / raw)
  To: Calvin Wan; +Cc: git, avarab, chooglen, newren, jonathantanmy, phillip.wood123

Calvin Wan <calvinwan@google.com> writes:

> On Thu, Mar 2, 2023 at 4:25 PM Junio C Hamano <gitster@pobox.com> wrote:
>>
>> Calvin Wan <calvinwan@google.com> writes:
>>
>> > A prepatory change for a future patch that moves the status parsing
>> > logic to a separate function.
>> >
>> > Signed-off-by: Calvin Wan <calvinwan@google.com>
>> > ---
>> >  submodule.c | 23 +++++++++++++----------
>> >  1 file changed, 13 insertions(+), 10 deletions(-)
>>
>> > Subject: Re: [PATCH v9 2/6] submodule: rename strbuf variable
>>
>> What strbuf variable renamed to what?
>>
>> I have a feeling that squashing this and 3/6 into a single patch,
>> and pass buf.buf and buf.len to the new helper function without
>> introducing an intermediate variables in the caller, would make the
>> resulting code easier to follow.
>>
>> In any case, nice factoring out of a useful helper function.
>>
>
> A much earlier version squashed those changes together, but it was
> recommended to split those changes up; I think I am indifferent either way
> since the refactoring is clear to me whether it is split up or not.
> https://lore.kernel.org/git/221012.868rllo545.gmgdl@evledraar.gmail.com/

I am indifferent, either, but with or without them squashed into a
single patch, "rename strbuf" would not be how you would describe
the value of this refactoring, which is to make the interface not
depend on strbuf.  Some callers may have separate <ptr,len> pair
that is not in strbuf, and with the current interface they are
forced to wrap the pair in a throw-away strbuf which is not nice.

And squashing them together into a single patch, it becomes a lot
clear what the point of these two steps combined is.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v9 2/6] submodule: rename strbuf variable
  2023-03-06 18:30                 ` Junio C Hamano
@ 2023-03-06 19:00                   ` Calvin Wan
  0 siblings, 0 replies; 86+ messages in thread
From: Calvin Wan @ 2023-03-06 19:00 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, avarab, chooglen, newren, jonathantanmy, phillip.wood123

On Mon, Mar 6, 2023 at 10:30 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Calvin Wan <calvinwan@google.com> writes:
>
> > On Thu, Mar 2, 2023 at 4:25 PM Junio C Hamano <gitster@pobox.com> wrote:
> >>
> >> Calvin Wan <calvinwan@google.com> writes:
> >>
> >> > A prepatory change for a future patch that moves the status parsing
> >> > logic to a separate function.
> >> >
> >> > Signed-off-by: Calvin Wan <calvinwan@google.com>
> >> > ---
> >> >  submodule.c | 23 +++++++++++++----------
> >> >  1 file changed, 13 insertions(+), 10 deletions(-)
> >>
> >> > Subject: Re: [PATCH v9 2/6] submodule: rename strbuf variable
> >>
> >> What strbuf variable renamed to what?
> >>
> >> I have a feeling that squashing this and 3/6 into a single patch,
> >> and pass buf.buf and buf.len to the new helper function without
> >> introducing an intermediate variables in the caller, would make the
> >> resulting code easier to follow.
> >>
> >> In any case, nice factoring out of a useful helper function.
> >>
> >
> > A much earlier version squashed those changes together, but it was
> > recommended to split those changes up; I think I am indifferent either way
> > since the refactoring is clear to me whether it is split up or not.
> > https://lore.kernel.org/git/221012.868rllo545.gmgdl@evledraar.gmail.com/
>
> I am indifferent, either, but with or without them squashed into a
> single patch, "rename strbuf" would not be how you would describe
> the value of this refactoring, which is to make the interface not
> depend on strbuf.  Some callers may have separate <ptr,len> pair
> that is not in strbuf, and with the current interface they are
> forced to wrap the pair in a throw-away strbuf which is not nice.

I see what you mean here; will reword the commit message, thanks!

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v9 6/6] diff-lib: parallelize run_diff_files for submodules
  2023-03-02 22:02           ` [PATCH v9 6/6] diff-lib: parallelize run_diff_files for submodules Calvin Wan
@ 2023-03-07  8:41             ` Ævar Arnfjörð Bjarmason
  2023-03-07 10:21             ` Ævar Arnfjörð Bjarmason
  2023-03-17  1:09             ` Glen Choo
  2 siblings, 0 replies; 86+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2023-03-07  8:41 UTC (permalink / raw)
  To: Calvin Wan; +Cc: git, chooglen, newren, jonathantanmy, phillip.wood123

On Thu, Mar 02 2023, Calvin Wan wrote:

Some of this is stuff I probably should have noted in earlier rounds,
sorry, but then again the diff-churn in those made it harder to review,
now that that's mostly out of the way (yay!) ....

> +submodule.diffJobs::
> +	Specifies how many submodules are diffed at the same time. A
> +	positive integer allows up to that number of submodules diffed
> +	in parallel. A value of 0 will give some reasonable default.
> +	If unset, it defaults to 1. The diff operation is used by many

Nit: Maybe start a new paragraph as of "The diff..."?

> +	other git commands such as add, merge, diff, status, stash and
> +	more. Note that the expensive part of the diff operation is

Nit: Maybe change 'add', 'merge' etc. to linkgit:git-add[1], or quote
them?

> +	reading the index from cache or memory. Therefore multiple jobs

With how much we conflate "the cache" and "index" saying "the index from
cache" might be especially confusing. I think we can just skip " from
cache or memory" here.

>  static int match_stat_with_submodule(struct diff_options *diffopt,
>  				     const struct cache_entry *ce,
>  				     struct stat *st, unsigned ce_option,
> -				     unsigned *dirty_submodule)
> +				     unsigned *dirty_submodule, int *defer_submodule_status,

Nit: The other one is an "unsigned", shouldn't "defer_submodule_status"
also be (more on this below).

> +				     unsigned *ignore_untracked)
>  {
>  	int changed = ie_match_stat(diffopt->repo->index, ce, st, ce_option);
> +	int defer = 0;
> +
>  	if (S_ISGITLINK(ce->ce_mode)) {
>  		struct diff_flags orig_flags = diffopt->flags;
>  		if (!diffopt->flags.override_submodule_config)
>  			set_diffopt_flags_from_submodule_config(diffopt, ce->name);

The meaty functional change here looks *much* better, thanks! I.e. this
is pretty much what I suggested in
https://lore.kernel.org/git/230208.861qn01s4g.gmgdl@evledraar.gmail.com/

> -		if (diffopt->flags.ignore_submodules)
> +		if (diffopt->flags.ignore_submodules) {

Not worth a re-roll in itself, but FWIW I think this change would be
marginally easier to follow with *a* preceding refactoring change, but
per the above &
https://lore.kernel.org/git/230209.867cwrzk1l.gmgdl@evledraar.gmail.com/
I just didn't think v7's 6/7
(https://lore.kernel.org/git/20230207181706.363453-7-calvinwan@google.com/)
was what we needed there.

I.e. in this case a leading change that would add these braces would
make this a bit easier to read...

>  			changed = 0;
> -		else if (!diffopt->flags.ignore_dirty_submodules &&

...ditto this line, which would stay the same.

> -			 (!changed || diffopt->flags.dirty_submodules))
> -			*dirty_submodule = is_submodule_modified(ce->name,
> -								 diffopt->flags.ignore_untracked_in_submodules);

Here you are incorrectly changing the indentation of this away from our
usual coding style, which...

> +		} else if (!diffopt->flags.ignore_dirty_submodules &&
> +			   (!changed || diffopt->flags.dirty_submodules)) {
> +			if (defer_submodule_status && *defer_submodule_status) {

Hrm, if if I remove that "&& *defer_submodule_status" all of our tests
pass, the only two callers of this function are one where this is NULL,
and where it's non-NULL but pre-initilized to 1, and the caller will
check if it's then flipped to 0.

> +				defer = 1;
> +				*ignore_untracked = diffopt->flags.ignore_untracked_in_submodules;
> +			} else {
> +				*dirty_submodule = is_submodule_modified(ce->name,
> +					 diffopt->flags.ignore_untracked_in_submodules);

...needlessly inflates the diff here, at least under -w and move
detection, as we correctly detect the "*dirty_submodule" line as the
same, but the "diffopt->flags" line also has a re-indentation change
unrelated to adding the "else" scope.

> +			}
> +		}
>  		diffopt->flags = orig_flags;
>  	}
> +
> +	if (defer_submodule_status)
> +		*defer_submodule_status = defer;

Having read this whole thing to the end again I think this on top would
be much simpler (if I'm right about it being functionally equivalent),
and would address some of the above:

	diff --git a/diff-lib.c b/diff-lib.c
	index 7fe6ced9501..d5c823f512a 100644
	--- a/diff-lib.c
	+++ b/diff-lib.c
	@@ -78,7 +78,6 @@ static int match_stat_with_submodule(struct diff_options *diffopt,
	 				     unsigned *ignore_untracked)
	 {
	 	int changed = ie_match_stat(diffopt->repo->index, ce, st, ce_option);
	-	int defer = 0;

	 	if (S_ISGITLINK(ce->ce_mode)) {
	 		struct diff_flags orig_flags = diffopt->flags;
	@@ -88,8 +87,8 @@ static int match_stat_with_submodule(struct diff_options *diffopt,
	 			changed = 0;
	 		} else if (!diffopt->flags.ignore_dirty_submodules &&
	 			   (!changed || diffopt->flags.dirty_submodules)) {
	-			if (defer_submodule_status && *defer_submodule_status) {
	-				defer = 1;
	+			if (defer_submodule_status) {
	+				*defer_submodule_status = 1;
	 				*ignore_untracked = diffopt->flags.ignore_untracked_in_submodules;
	 			} else {
	 				*dirty_submodule = is_submodule_modified(ce->name,
	@@ -99,8 +98,6 @@ static int match_stat_with_submodule(struct diff_options *diffopt,
	 		diffopt->flags = orig_flags;
	 	}

	-	if (defer_submodule_status)
	-		*defer_submodule_status = defer;
	 	return changed;
	 }

	@@ -153,7 +150,7 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
	 		unsigned int newmode;
	 		struct cache_entry *ce = istate->cache[i];
	 		int changed;
	-		int defer_submodule_status = 1;
	+		int defer_submodule_status = 0;

	 		if (diff_can_quit_early(&revs->diffopt))
	 			break;

We could also just leave it, but I for one found it a bit hard to follow
that this interface seems to be a tri-state (NULL, set to 0, set to 1),
but really it's dual-state, i.e. NULL or a "tell me to defer this" bit.

>  	return changed;
>  }
>  
> @@ -124,6 +140,7 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
>  			      ? CE_MATCH_RACY_IS_DIRTY : 0);
>  	uint64_t start = getnanotime();
>  	struct index_state *istate = revs->diffopt.repo->index;
> +	struct string_list submodules = STRING_LIST_INIT_NODUP;
>  
>  	diff_set_mnemonic_prefix(&revs->diffopt, "i/", "w/");
>  
> @@ -136,7 +153,7 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
>  		unsigned int newmode;
>  		struct cache_entry *ce = istate->cache[i];
>  		int changed;
> -		unsigned dirty_submodule = 0;
> +		int defer_submodule_status = 1;

Hrm, having suggested the diff above I just noticed this now, I ended up
inverting this, but found the "defer_submodule_status" name a bit odd,
can't we just keep "unsigned dirty_submodule"? (that would also address
the change from "unsigned" to "int" noted above, which is seeminly
unnecessary).

But maybe I'm missing a subtlety here, and we should have "deferred
status" as apposed to "dirty submodule", but in any case the new one
looks like it doesn't need negative values.

> +	}
> +	if (submodules.nr) {
> +		unsigned long parallel_jobs;
> +		struct string_list_item *item;
> +
> +		if (git_config_get_ulong("submodule.diffjobs", &parallel_jobs))
> +			parallel_jobs = 1;
> +		else if (!parallel_jobs)
> +			parallel_jobs = online_cpus();

Given that online_cpus() returns int the "unsigned long" is slightly odd
here, but it's because git_config_get_ulong() exist, but we have no
git_config_get_uint(), so this is OK (but could be cleaned up as some
#leftoverbits).

> +		if (get_submodules_status(&submodules, parallel_jobs))
> +			die(_("submodule status failed"));

Here we're adding get_submodules_status(), and returning the actual
error code from "status", but then ignoring it here, and returning 128
for any non-zero.

I think this would be better as either:

	code = get_submodules_status(...);
	die_message(...)
	exit(code);

Or to just have the function itself return !!status, i.e. a "ok" or "not
ok".

Admittedly a nit, but I have spent quite a bit of time chasing down
various exit-code losses in the submodule code, and it would be nice if
we just carry the code up, or more explicitly ignore it, but don't add
code that seems to care about it, but really doesn't.

I also changed this "die" to a "BUG" and our tests passed, so we have no
tests for when "status" failed, will such a thing even happen in
practice?

> +		for_each_string_list_item(item, &submodules) {
> +			struct submodule_status_util *util = item->util;
> +
> +			record_file_diff(&revs->diffopt, util->newmode,
> +					 util->dirty_submodule, util->changed,
> +					 istate, util->ce);
> +		}
>  	}
> +	string_list_clear(&submodules, 1);
>  	diffcore_std(&revs->diffopt);
>  	diff_flush(&revs->diffopt);
>  	trace_performance_since(start, "diff-files");
> @@ -322,7 +379,7 @@ static int get_stat_data(const struct index_state *istate,
>  			return -1;
>  		}
>  		changed = match_stat_with_submodule(diffopt, ce, &st,
> -						    0, dirty_submodule);
> +						    0, dirty_submodule, NULL, NULL);
>  		if (changed) {
>  			mode = ce_mode_from_stat(ce, st.st_mode);
>  			oid = null_oid();
> diff --git a/submodule.c b/submodule.c
> index 426074cebb..6f6e150a3f 100644
> --- a/submodule.c
> +++ b/submodule.c
> @@ -1373,6 +1373,13 @@ int submodule_touches_in_range(struct repository *r,
>  	return ret;
>  }
>  
> +struct submodule_parallel_status {
> +	size_t index_count;
> +	int result;
> +
> +	struct string_list *submodule_names;
> +};

Hrm, actually reading a bit more I think part of my comments above are
incorrect, i.e. this "result" seems like an exit code, but really in the
guts of the API we're ignoring the actual code we get, and just setting
this to 1.

Per the above I think it might be OK to ignore the exit code (or not),
but I really wish we did this more explicitly, e.g. if you want to
ignore it call this something like "failed", not "result", and make it
an "unsigned int failed:1" to firmly indicate that it's a boolean at the
API level.

> +struct status_task {
> +	const char *path;

I think we should call this "ce_path", but more on that below.

> +	struct strbuf out;
> +	int ignore_untracked;

Continued type mismatch commentary: Elsewhere in this diff this is
"unsigned", and this compiles for me if I make it "unsigned int
ignore_untracked:1", so let's set it to such a flag instead?

> +static int status_finish(int retvalue, struct strbuf *err,
> +			 void *cb, void *task_cb)
> +{
> +	struct submodule_parallel_status *sps = cb;
> +	struct status_task *task = task_cb;
> +	struct string_list_item *it =
> +		string_list_lookup(sps->submodule_names, task->path);
> +	struct submodule_status_util *util = it->util;
> +	struct string_list list = STRING_LIST_INIT_DUP;
> +	struct string_list_item *item;
> +
> +	if (retvalue) {
> +		sps->result = 1;
> +		strbuf_addf(err, _(STATUS_PORCELAIN_FAIL_ERROR), task->path);
> +	}
> +
> +	string_list_split(&list, task->out.buf, '\n', -1);

I think I noted in some earlier round that taking a string and splitting
it by \n was a bit wasteful in the test code, but this uses the same
pattern.

Maybe it's not a performance concern here either, but won't we
potentially have to parse some very large statuses here?

Aside from that, I haven't tried or reviewed this bit in detail, but
this seems to be making things harder than they need to be. Why are we
buffering up all of the output into "out" here, only to split it by "\n"
later on, and then consider each line as a status line?

Shouldn't we be allocating this string_list to begin with, and append to
it in the "status_on_stderr_output" callback instead?

> +	for_each_string_list_item(item, &list) {
> +		if (parse_status_porcelain(item->string,
> +					   strlen(item->string),
> +					   &util->dirty_submodule,
> +					   util->ignore_untracked))

OK, this seemingly buggy bit of error handling seems to actually be OK
on further review, because we'll BUG() out in the function if it fails,
so the non-zero return here just means "we're done here".

> +			break;
> +	}

Style: drop the braces here, as this is just a for/if/body with a single
body line.

> +int get_submodules_status(struct string_list *submodules,
> +			  int max_parallel_jobs)
> +{
> +	struct submodule_parallel_status sps = {
> +		.submodule_names = submodules,
> +	};
> +	const struct run_process_parallel_opts opts = {
> +		.tr2_category = "submodule",
> +		.tr2_label = "parallel/status",
> +
> +		.processes = max_parallel_jobs,
> +
> +		.get_next_task = get_next_submodule_status,
> +		.start_failure = status_start_failure,
> +		.on_stderr_output = status_on_stderr_output,
> +		.task_finished = status_finish,
> +		.data = &sps,
> +	};
> +
> +	string_list_sort(sps.submodule_names);
> +	run_processes_parallel(&opts);
> +
> +	return sps.result;

All OK, except as noted above the "result" here is just "did we fail?".

> +}
> +
>  int submodule_uses_gitfile(const char *path)
>  {
>  	struct child_process cp = CHILD_PROCESS_INIT;
> diff --git a/submodule.h b/submodule.h
> index b52a4ff1e7..08d278a414 100644
> --- a/submodule.h
> +++ b/submodule.h
> @@ -41,6 +41,13 @@ struct submodule_update_strategy {
>  	.type = SM_UPDATE_UNSPECIFIED, \
>  }
>  
> +struct submodule_status_util {
> +	int changed, ignore_untracked;
> +	unsigned dirty_submodule, newmode;
> +	struct cache_entry *ce;
> +	const char *path;

Re "ce_path" above: What's the point of adding a "path" here if we
already have "ce"? You just seem to assign "path" to "ce->name"
always. I tried this fix-up on top & it worked:

	diff --git a/diff-lib.c b/diff-lib.c
	index d5c823f512a..39d8179f0ed 100644
	--- a/diff-lib.c
	+++ b/diff-lib.c
	@@ -294,7 +294,6 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
	 					.ignore_untracked = ignore_untracked,
	 					.newmode = newmode,
	 					.ce = ce,
	-					.path = ce->name,
	 				};
	 				struct string_list_item *item;

	diff --git a/submodule.c b/submodule.c
	index 3eba00f1533..c220d85815a 100644
	--- a/submodule.c
	+++ b/submodule.c
	@@ -2002,11 +2002,11 @@ get_status_task_from_index(struct submodule_parallel_status *sps,
	 		struct submodule_status_util *util = sps->submodule_names->items[sps->index_count].util;
	 		struct status_task *task;

	-		if (!verify_submodule_git_directory(util->path))
	+		if (!verify_submodule_git_directory(util->ce->name))
	 			continue;

	 		task = xmalloc(sizeof(*task));
	-		task->path = util->path;
	+		task->path = util->ce->name;
	 		task->ignore_untracked = util->ignore_untracked;
	 		strbuf_init(&task->out, 0);
	 		sps->index_count++;
	diff --git a/submodule.h b/submodule.h
	index 3b6abca05cd..3427c495573 100644
	--- a/submodule.h
	+++ b/submodule.h
	@@ -45,7 +45,6 @@ struct submodule_status_util {
	 	int changed, ignore_untracked;
	 	unsigned dirty_submodule, newmode;
	 	struct cache_entry *ce;
	-	const char *path;
	 };

	 int is_gitmodules_unmerged(struct index_state *istate);

I'd be all for actually narrowing the scope of data we get in general,
i.e. do we need all of the "ce" members? I didn't check, but doing this
just seems like needless duplication.

> @@ -94,6 +101,8 @@ int fetch_submodules(struct repository *r,
>  		     int command_line_option,
>  		     int default_option,
>  		     int quiet, int max_parallel_jobs);
> +int get_submodules_status(struct string_list *submodules,
> +			  int max_parallel_jobs);

It would be nice to get some API docs for the new function, re its
"result" behavior etc. noted above

>  unsigned is_submodule_modified(const char *path, int ignore_untracked);
>  int submodule_uses_gitfile(const char *path);
>  
> diff --git a/t/t4027-diff-submodule.sh b/t/t4027-diff-submodule.sh
> index 40164ae07d..1c747cc325 100755
> --- a/t/t4027-diff-submodule.sh
> +++ b/t/t4027-diff-submodule.sh
> @@ -34,6 +34,25 @@ test_expect_success setup '
>  	subtip=$3 subprev=$2
>  '
>  
> +test_expect_success 'diff in superproject with submodules respects parallel settings' '
> +	test_when_finished "rm -f trace.out" &&
> +	(
> +		GIT_TRACE=$(pwd)/trace.out git diff &&
> +		grep "1 tasks" trace.out &&
> +		>trace.out &&
> +
> +		git config submodule.diffJobs 8 &&
> +		GIT_TRACE=$(pwd)/trace.out git diff &&
> +		grep "8 tasks" trace.out &&
> +		>trace.out &&
> +
> +		GIT_TRACE=$(pwd)/trace.out git -c submodule.diffJobs=0 diff &&
> +		grep "preparing to run up to [0-9]* tasks" trace.out &&
> +		! grep "up to 0 tasks" trace.out &&
> +		>trace.out
> +	)
> +'
> +
>  test_expect_success 'git diff --raw HEAD' '
>  	hexsz=$(test_oid hexsz) &&
>  	git diff --raw --abbrev=$hexsz HEAD >actual &&
> @@ -70,6 +89,18 @@ test_expect_success 'git diff HEAD with dirty submodule (work tree)' '
>  	test_cmp expect.body actual.body
>  '
>  
> +test_expect_success 'git diff HEAD with dirty submodule (work tree, parallel)' '
> +	(
> +		cd sub &&
> +		git reset --hard &&
> +		echo >>world
> +	) &&
> +	git -c submodule.diffJobs=8 diff HEAD >actual &&
> +	sed -e "1,/^@@/d" actual >actual.body &&
> +	expect_from_to >expect.body $subtip $subprev-dirty &&
> +	test_cmp expect.body actual.body
> +'
> +
>  test_expect_success 'git diff HEAD with dirty submodule (index)' '
>  	(
>  		cd sub &&
> diff --git a/t/t7506-status-submodule.sh b/t/t7506-status-submodule.sh
> index d050091345..7da64e4c4c 100755
> --- a/t/t7506-status-submodule.sh
> +++ b/t/t7506-status-submodule.sh
> @@ -412,4 +412,29 @@ test_expect_success 'status with added file in nested submodule (short)' '
>  	EOF
>  '
>  
> +test_expect_success 'status in superproject with submodules respects parallel settings' '
> +	test_when_finished "rm -f trace.out" &&
> +	(
> +		GIT_TRACE=$(pwd)/trace.out git status &&
> +		grep "1 tasks" trace.out &&
> +		>trace.out &&
> +
> +		git config submodule.diffJobs 8 &&
> +		GIT_TRACE=$(pwd)/trace.out git status &&
> +		grep "8 tasks" trace.out &&
> +		>trace.out &&
> +
> +		GIT_TRACE=$(pwd)/trace.out git -c submodule.diffJobs=0 status &&
> +		grep "preparing to run up to [0-9]* tasks" trace.out &&
> +		! grep "up to 0 tasks" trace.out &&
> +		>trace.out
> +	)
> +'
> +
> +test_expect_success 'status in superproject with submodules (parallel)' '
> +	git -C super status --porcelain >output &&
> +	git -C super -c submodule.diffJobs=8 status --porcelain >output_parallel &&
> +	diff output output_parallel

Shouldn't this be a "test_cmp" instead of "diff", and use "actual" and
"expect" instead of "output" and "output_parallel"?

I'd also rename the test to something like "output with
submodule.diffJobs=N equals submodule.diffJobs=1".

Except is that even correct? Don't we need to set submodule.diffJobs=1
explicitly so it doesn't default to online_cpus() here? Maybe I missed
an earlier config setup...

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v9 6/6] diff-lib: parallelize run_diff_files for submodules
  2023-03-02 22:02           ` [PATCH v9 6/6] diff-lib: parallelize run_diff_files for submodules Calvin Wan
  2023-03-07  8:41             ` Ævar Arnfjörð Bjarmason
@ 2023-03-07 10:21             ` Ævar Arnfjörð Bjarmason
  2023-03-07 17:55               ` Junio C Hamano
  2023-03-17  1:09             ` Glen Choo
  2 siblings, 1 reply; 86+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2023-03-07 10:21 UTC (permalink / raw)
  To: Calvin Wan; +Cc: git, chooglen, newren, jonathantanmy, phillip.wood123


On Thu, Mar 02 2023, Calvin Wan wrote:

> +		if (git_config_get_ulong("submodule.diffjobs", &parallel_jobs))
> +			parallel_jobs = 1;

Something I missed when eyeballing this in my just-sent review, here we
have a "revs->repo" already, so let's not fall back on "the_repository",
but use it. I think you want this as a fix-up:
	
	diff --git a/diff-lib.c b/diff-lib.c
	index 925d64ff58c..ec8a0f98085 100644
	--- a/diff-lib.c
	+++ b/diff-lib.c
	@@ -312,7 +312,8 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
	 		unsigned long parallel_jobs;
	 		struct string_list_item *item;
	 
	-		if (git_config_get_ulong("submodule.diffjobs", &parallel_jobs))
	+		if (repo_config_get_ulong(revs->repo, "submodule.diffjobs",
	+					  &parallel_jobs))
	 			parallel_jobs = 1;
	 		else if (!parallel_jobs)
	 			parallel_jobs = online_cpus();

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v9 6/6] diff-lib: parallelize run_diff_files for submodules
  2023-03-07 10:21             ` Ævar Arnfjörð Bjarmason
@ 2023-03-07 17:55               ` Junio C Hamano
  0 siblings, 0 replies; 86+ messages in thread
From: Junio C Hamano @ 2023-03-07 17:55 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Calvin Wan, git, chooglen, newren, jonathantanmy, phillip.wood123

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> On Thu, Mar 02 2023, Calvin Wan wrote:
>
>> +		if (git_config_get_ulong("submodule.diffjobs", &parallel_jobs))
>> +			parallel_jobs = 1;
>
> Something I missed when eyeballing this in my just-sent review, here we
> have a "revs->repo" already, so let's not fall back on "the_repository",
> but use it. I think you want this as a fix-up:
> 	
> 	diff --git a/diff-lib.c b/diff-lib.c
> 	index 925d64ff58c..ec8a0f98085 100644
> 	--- a/diff-lib.c
> 	+++ b/diff-lib.c
> 	@@ -312,7 +312,8 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
> 	 		unsigned long parallel_jobs;
> 	 		struct string_list_item *item;
> 	 
> 	-		if (git_config_get_ulong("submodule.diffjobs", &parallel_jobs))
> 	+		if (repo_config_get_ulong(revs->repo, "submodule.diffjobs",
> 	+					  &parallel_jobs))
> 	 			parallel_jobs = 1;
> 	 		else if (!parallel_jobs)
> 	 			parallel_jobs = online_cpus();

Good eyes.  Thanks for a careful review.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v9 6/6] diff-lib: parallelize run_diff_files for submodules
  2023-03-02 22:02           ` [PATCH v9 6/6] diff-lib: parallelize run_diff_files for submodules Calvin Wan
  2023-03-07  8:41             ` Ævar Arnfjörð Bjarmason
  2023-03-07 10:21             ` Ævar Arnfjörð Bjarmason
@ 2023-03-17  1:09             ` Glen Choo
  2023-03-17  2:51               ` Glen Choo
  2 siblings, 1 reply; 86+ messages in thread
From: Glen Choo @ 2023-03-17  1:09 UTC (permalink / raw)
  To: Calvin Wan, git
  Cc: Calvin Wan, avarab, newren, jonathantanmy, phillip.wood123

I haven't verified if the code in this version is correct or not, as I
found it a bit difficult to follow through the churn. After reading this
series again, I've established a better mental model of the code, and I
think there are some renames and documentation changes we can make to
make this clearer.

Unfortunately, I think the biggest clarification would be _yet_ another
refactor, and I'm not sure if we actually want to bear so much churn. I
might do this refactor locally to see if it really is _much_ cleaner or
not.

If anyone has thoughts on the refactor, do chime in.

Calvin Wan <calvinwan@google.com> writes:

> diff --git a/diff-lib.c b/diff-lib.c
> index 744ae98a69..7fe6ced950 100644
> --- a/diff-lib.c
> +++ b/diff-lib.c
> @@ -65,26 +66,41 @@ static int check_removed(const struct index_state *istate, const struct cache_en
>   * Return 1 when changes are detected, 0 otherwise. If the DIRTY_SUBMODULES
>   * option is set, the caller does not only want to know if a submodule is
>   * modified at all but wants to know all the conditions that are met (new
> - * commits, untracked content and/or modified content).
> + * commits, untracked content and/or modified content). If
> + * defer_submodule_status bit is set, dirty_submodule will be left to the
> + * caller to set. defer_submodule_status can also be set to 0 in this
> + * function if there is no need to check if the submodule is modified.
>   */
>  static int match_stat_with_submodule(struct diff_options *diffopt,
>  				     const struct cache_entry *ce,
>  				     struct stat *st, unsigned ce_option,
> -				     unsigned *dirty_submodule)
> +				     unsigned *dirty_submodule, int *defer_submodule_status,
> +				     unsigned *ignore_untracked)
>  {
>  	int changed = ie_match_stat(diffopt->repo->index, ce, st, ce_option);
> +	int defer = 0;
> +
>  	if (S_ISGITLINK(ce->ce_mode)) {
>  		struct diff_flags orig_flags = diffopt->flags;
>  		if (!diffopt->flags.override_submodule_config)
>  			set_diffopt_flags_from_submodule_config(diffopt, ce->name);
> -		if (diffopt->flags.ignore_submodules)
> +		if (diffopt->flags.ignore_submodules) {
>  			changed = 0;
> -		else if (!diffopt->flags.ignore_dirty_submodules &&
> -			 (!changed || diffopt->flags.dirty_submodules))
> -			*dirty_submodule = is_submodule_modified(ce->name,
> -								 diffopt->flags.ignore_untracked_in_submodules);
> +		} else if (!diffopt->flags.ignore_dirty_submodules &&
> +			   (!changed || diffopt->flags.dirty_submodules)) {
> +			if (defer_submodule_status && *defer_submodule_status) {
> +				defer = 1;
> +				*ignore_untracked = diffopt->flags.ignore_untracked_in_submodules;
> +			} else {
> +				*dirty_submodule = is_submodule_modified(ce->name,
> +					 diffopt->flags.ignore_untracked_in_submodules);
> +			}
> +		}
>  		diffopt->flags = orig_flags;
>  	}
> +
> +	if (defer_submodule_status)
> +		*defer_submodule_status = defer;

The crux of this patch is that we are replacing some serial operation
with a parallel operation. The replacement happens here, where we are
replacing is_submodule_modified() by 'deferring' it.

So to verify if the parallel implementation is correct, we should
compare the "setup" and "finish" steps in is_submodule_modified() and
get_submodules_status(). Eyeballing it, it looks correct, especially
because we made sure to refactor out the shared logic in previous
patches.

To reflect this, I think it would be clearer to rename
get_submodules_status() to something similar (e.g.
are_submodules_modified_parallel()), with an explicit comment saying
that it is meant to be a parallel implementation of
is_submodule_modified().

Except, I told a little white lie in the previous paragraph, because
get_submodules_status() isn't _just_ a parallel implementation of
is_submodule_modified()...

> @@ -268,13 +286,52 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
>  			}
>  
>  			changed = match_stat_with_submodule(&revs->diffopt, ce, &st,
> -							    ce_option, &dirty_submodule);
> +							    ce_option, NULL,
> +							    &defer_submodule_status,
> +							    &ignore_untracked);
>  			newmode = ce_mode_from_stat(ce, st.st_mode);
> +			if (defer_submodule_status) {
> +				struct submodule_status_util tmp = {
> +					.changed = changed,
> +					.dirty_submodule = 0,
> +					.ignore_untracked = ignore_untracked,
> +					.newmode = newmode,
> +					.ce = ce,
> +					.path = ce->name,
> +				};
> +				struct string_list_item *item;
> +
> +				item = string_list_append(&submodules, ce->name);
> +				item->util = xmalloc(sizeof(tmp));
> +				memcpy(item->util, &tmp, sizeof(tmp));
> +				continue;
> +			}

because get_submodules_status() doesn't just contain the results of
the parallel processes, it is _also_ shuttling "changed" and
"ignore_untracked" from match_stat_with_submodule(), as well as
.newmode, .ce and .path from run_diff_files() (basically everything
except .dirty_submodule)...

>  		}
>  
> -		record_file_diff(&revs->diffopt, newmode, dirty_submodule,
> -				 changed, istate, ce);
> +		if (!defer_submodule_status)
> +			record_file_diff(&revs->diffopt, newmode, 0,
> +					   changed,istate, ce);
> +	}
> +	if (submodules.nr) {
> +		unsigned long parallel_jobs;
> +		struct string_list_item *item;
> +
> +		if (git_config_get_ulong("submodule.diffjobs", &parallel_jobs))
> +			parallel_jobs = 1;
> +		else if (!parallel_jobs)
> +			parallel_jobs = online_cpus();
> +
> +		if (get_submodules_status(&submodules, parallel_jobs))
> +			die(_("submodule status failed"));
> +		for_each_string_list_item(item, &submodules) {
> +			struct submodule_status_util *util = item->util;
> +
> +			record_file_diff(&revs->diffopt, util->newmode,
> +					 util->dirty_submodule, util->changed,
> +					 istate, util->ce);
> +		}

so that we can pass all of this back into record_file_diff(). The only
member that is changed by the parallel process is .dirty_submodule,
which is exactly what we would expect from a parallel version of
is_submodule_modified().

If we don't want to do a bigger refactor, I think we should also add
comments to members of "struct submodule_status_util" to document where
they come from and what they are used for.

The rest of the comments are refactor-related.

It would be good if we could avoid mixing unrelated information sources
in "struct submodule_status_util", since a) this makes it very tightly
coupled to run_diff_files() and b) it causes us to repeat ourselves in
the same function (.changed = changed, record_file_diff()).

The only reason why the code looks this way right now is that
match_stat_with_submodule() sets defer_submodule_status based on whether
or not we should ignore the submodule, and this eventually tells
get_submodule_status() what submodules it needs to care about. But,
deciding whether to spawn a subprocess for which submodule is exactly
what the .get_next_task member is for.

> diff --git a/submodule.c b/submodule.c
> index 426074cebb..6f6e150a3f 100644
> --- a/submodule.c
> +++ b/submodule.c
> @@ -1981,6 +1994,121 @@ unsigned is_submodule_modified(const char *path, int ignore_untracked)
>  	return dirty_submodule;
>  }
>  
> +static struct status_task *
> +get_status_task_from_index(struct submodule_parallel_status *sps,
> +			   struct strbuf *err)
> +{
> +	for (; sps->index_count < sps->submodule_names->nr; sps->index_count++) {
> +		struct submodule_status_util *util = sps->submodule_names->items[sps->index_count].util;
> +		struct status_task *task;
> +
> +		if (!verify_submodule_git_directory(util->path))
> +			continue;

So right here, we could use the "check if this submodule should be
ignored" logic form match_stat_with_submodule() to decide whether or not
to spawn the subprocess. IOW, I am advocating for
get_submodules_status() to be a parallel version of
match_stat_with_submodule() (not a parallel version of
is_submodule_modified() that shuttles extra information).

Another sign that this refactor is a good idea is that it lets us
simplify _existing_ submodule logic in run_diff_files(). Prior to this
patch, we have:

      unsigned dirty_submodule = 0;
      ...
			changed = match_stat_with_submodule(&revs->diffopt, ce, &st,
							    ce_option, NULL,
							    &defer_submodule_status,
							    &ignore_untracked);
      // If submodule was deferred, shuttle a bunch of information
      // If not, call record_file_diff()

but the body of match_stat_with_submodule() is just ie_match_stat() +
some additional submodule logic. Post refactor, this would look
something like:

    struct string_list submodules;
    ...
    // For any submodule, just append it to a list and let the
    // parallel thing take care of it.
    if (S_ISGITLINK(ce->ce_mode) {
      // Probably pass .newmode and .ce to the util too...
      string_list_append(submodules, ce->name);
    } else {
      changed = ie_match_stat(foo, bar, baz);
      record_file_diff();
    }
    ...
    if (submodules.nr) {
      parallel_match_stat_with_submodule_wip_name(&submodules);
      for_each_string_list_item(item, &submodules) {
        record_file_diff(&item);
      }
    }

Which I think is easier to follow, since we won't need
defer_submodule_status any more, and we don't shuttle information from
match_stat_with_submodule(). Though I'm a bit unhappy that it's still
pretty coupled to run_diff_files() (it still has to shuttle .newmode,
.ce). Also, I don't think this refactor lets us avoid the refactors we
did in the previous patches.

> +
> +		task = xmalloc(sizeof(*task));
> +		task->path = util->path;
> +		task->ignore_untracked = util->ignore_untracked;
> +		strbuf_init(&task->out, 0);
> +		sps->index_count++;
> +		return task;
> +	}
> +	return NULL;
> +}
> +
> +static int get_next_submodule_status(struct child_process *cp,
> +				     struct strbuf *err, void *data,
> +				     void **task_cb)
> +{
> +	struct submodule_parallel_status *sps = data;
> +	struct status_task *task = get_status_task_from_index(sps, err);

As an aside, I think we can inline get_status_task_from_index(). I
suspect this pattern was copied from get_next_submodule(), which
gets fetch tasks from two different places (hence _from_index and
_from_changed), but here I don't think we will ever get status tasks
from more than one place.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v9 6/6] diff-lib: parallelize run_diff_files for submodules
  2023-03-17  1:09             ` Glen Choo
@ 2023-03-17  2:51               ` Glen Choo
  0 siblings, 0 replies; 86+ messages in thread
From: Glen Choo @ 2023-03-17  2:51 UTC (permalink / raw)
  To: Calvin Wan, git
  Cc: Calvin Wan, avarab, newren, jonathantanmy, phillip.wood123

Glen Choo <chooglen@google.com> writes:

> It would be good if we could avoid mixing unrelated information sources
> in "struct submodule_status_util", since a) this makes it very tightly
> coupled to run_diff_files() and b) it causes us to repeat ourselves in
> the same function (.changed = changed, record_file_diff()).
>
> The only reason why the code looks this way right now is that
> match_stat_with_submodule() sets defer_submodule_status based on whether
> or not we should ignore the submodule, and this eventually tells
> get_submodule_status() what submodules it needs to care about. But,
> deciding whether to spawn a subprocess for which submodule is exactly
> what the .get_next_task member is for.
>
>> diff --git a/submodule.c b/submodule.c
>> index 426074cebb..6f6e150a3f 100644
>> --- a/submodule.c
>> +++ b/submodule.c
>> @@ -1981,6 +1994,121 @@ unsigned is_submodule_modified(const char *path, int ignore_untracked)
>>  	return dirty_submodule;
>>  }
>>  
>> +static struct status_task *
>> +get_status_task_from_index(struct submodule_parallel_status *sps,
>> +			   struct strbuf *err)
>> +{
>> +	for (; sps->index_count < sps->submodule_names->nr; sps->index_count++) {
>> +		struct submodule_status_util *util = sps->submodule_names->items[sps->index_count].util;
>> +		struct status_task *task;
>> +
>> +		if (!verify_submodule_git_directory(util->path))
>> +			continue;
>
> So right here, we could use the "check if this submodule should be
> ignored" logic form match_stat_with_submodule() to decide whether or not
> to spawn the subprocess. IOW, I am advocating for
> get_submodules_status() to be a parallel version of
> match_stat_with_submodule() (not a parallel version of
> is_submodule_modified() that shuttles extra information).

It turns out to be quite difficult to implement a parallel
match_stat_with_submodule():

  a) we can't remove it because it still has another caller
  b) its internals are quite hard to refactor: one conditional arm depends
    on "changed", which is set by calling ie_match_stat(), which in turn
    requires the "struct stat" to have already been lstat()-ed...

So even though this series adds a lot, it is just about as minimally
invasive as possible.

I suspect that there are some possible cleanups down the line, e.g.
is_submodule_modified() is rightfully only called by diff-lib.c , so I
think it should be a static function there. And once we move that, we
can make our parallel function static, and then we don't have to worry
about tight coupling to run_diff_files(). To keep the range-diff
manageable, that can be left for a future cleanup though.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v9 3/6] submodule: move status parsing into function
  2023-03-02 22:02           ` [PATCH v9 3/6] submodule: move status parsing into function Calvin Wan
@ 2023-03-17 20:42             ` Glen Choo
  0 siblings, 0 replies; 86+ messages in thread
From: Glen Choo @ 2023-03-17 20:42 UTC (permalink / raw)
  To: Calvin Wan, git
  Cc: Calvin Wan, avarab, newren, jonathantanmy, phillip.wood123

Calvin Wan <calvinwan@google.com> writes:

> A future patch requires the ability to parse the output of git
> status --porcelain=2. Move parsing code from is_submodule_modified to
> parse_status_porcelain.

If my mental model is correct [1], i.e. that we are implementing a
parallel version of is_submodule_modified(). I think we should be more
explicit in this patch and the next, e.g.:

  In a later patch, we will implement a parallel version of
  is_submodule_modified(). Refactor its "git status --porcelain=2"
  parsing code so that we can reuse it both the parallel and
  non-parallel versions.

If so, then this is pretty much doing the same thing as the next patch,
so if the --color-moved diff isn't too bad, I think we can squash them,
which will make the commit message easier to write too:

  In a later patch, we will implement a parallel version of
  is_submodule_modified(). Refactor its setup and parsing code so that
  we can reuse it both the parallel and non-parallel versions.

  - Setting up the subprocess is moved to prepare_status_porcelain()
  - XYZ is moved to verify_submodule_git_directory()
  - ABC is moved to parse_foobarbaz()

Just an idea. I don't think squashing is necessarily better, but being
explciit that we want a parallel version of is_submodule_modified() will
make this easier to follow.

[1] https://lore.kernel.org/git/kl6ljzzguqss.fsf@chooglen-macbookpro.roam.corp.google.com

^ permalink raw reply	[flat|nested] 86+ messages in thread

end of thread, other threads:[~2023-03-17 20:42 UTC | newest]

Thread overview: 86+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <https://lore.kernel.org/git/20221108184200.2813458-1-calvinwan@google.com/>
2023-01-04 21:54 ` [PATCH v5 0/6] submodule: parallelize diff Calvin Wan
2023-01-05 23:23   ` Calvin Wan
2023-01-17 19:30   ` [PATCH v6 " Calvin Wan
2023-02-07 18:16     ` [PATCH v7 0/7] " Calvin Wan
2023-02-08  0:55       ` Ævar Arnfjörð Bjarmason
2023-02-09  0:02       ` [PATCH v8 0/6] " Calvin Wan
2023-02-09  1:42         ` Ævar Arnfjörð Bjarmason
2023-02-09 19:50         ` Junio C Hamano
2023-02-09 21:52           ` Calvin Wan
2023-02-09 22:25             ` Junio C Hamano
2023-02-10 13:24             ` Ævar Arnfjörð Bjarmason
2023-02-10 17:42               ` Junio C Hamano
2023-02-09 20:50         ` Phillip Wood
2023-03-02 21:52         ` [PATCH v9 " Calvin Wan
2023-03-02 22:02           ` [PATCH v9 1/6] run-command: add on_stderr_output_fn to run_processes_parallel_opts Calvin Wan
2023-03-02 22:02           ` [PATCH v9 2/6] submodule: rename strbuf variable Calvin Wan
2023-03-03  0:25             ` Junio C Hamano
2023-03-06 17:37               ` Calvin Wan
2023-03-06 18:30                 ` Junio C Hamano
2023-03-06 19:00                   ` Calvin Wan
2023-03-02 22:02           ` [PATCH v9 3/6] submodule: move status parsing into function Calvin Wan
2023-03-17 20:42             ` Glen Choo
2023-03-02 22:02           ` [PATCH v9 4/6] submodule: refactor is_submodule_modified() Calvin Wan
2023-03-02 22:02           ` [PATCH v9 5/6] diff-lib: refactor out diff_change logic Calvin Wan
2023-03-02 22:02           ` [PATCH v9 6/6] diff-lib: parallelize run_diff_files for submodules Calvin Wan
2023-03-07  8:41             ` Ævar Arnfjörð Bjarmason
2023-03-07 10:21             ` Ævar Arnfjörð Bjarmason
2023-03-07 17:55               ` Junio C Hamano
2023-03-17  1:09             ` Glen Choo
2023-03-17  2:51               ` Glen Choo
2023-02-09  0:02       ` [PATCH v8 1/6] run-command: add duplicate_output_fn to run_processes_parallel_opts Calvin Wan
2023-02-13  6:34         ` Glen Choo
2023-02-13 17:52           ` Junio C Hamano
2023-02-13 18:26             ` Calvin Wan
2023-02-09  0:02       ` [PATCH v8 2/6] submodule: strbuf variable rename Calvin Wan
2023-02-13  8:37         ` Glen Choo
2023-02-09  0:02       ` [PATCH v8 3/6] submodule: move status parsing into function Calvin Wan
2023-02-09  0:02       ` [PATCH v8 4/6] submodule: refactor is_submodule_modified() Calvin Wan
2023-02-13  7:06         ` Glen Choo
2023-02-09  0:02       ` [PATCH v8 5/6] diff-lib: refactor out diff_change logic Calvin Wan
2023-02-09  1:48         ` Ævar Arnfjörð Bjarmason
2023-02-13  8:42         ` Glen Choo
2023-02-13 18:29           ` Calvin Wan
2023-02-14  4:03             ` Glen Choo
2023-02-09  0:02       ` [PATCH v8 6/6] diff-lib: parallelize run_diff_files for submodules Calvin Wan
2023-02-13  8:36         ` Glen Choo
2023-02-07 18:17     ` [PATCH v7 1/7] run-command: add duplicate_output_fn to run_processes_parallel_opts Calvin Wan
2023-02-07 22:16       ` Ævar Arnfjörð Bjarmason
2023-02-08 22:50         ` Calvin Wan
2023-02-08 14:19       ` Phillip Wood
2023-02-08 22:54         ` Calvin Wan
2023-02-09 20:37           ` Phillip Wood
2023-02-07 18:17     ` [PATCH v7 2/7] submodule: strbuf variable rename Calvin Wan
2023-02-07 22:47       ` Ævar Arnfjörð Bjarmason
2023-02-08 22:59         ` Calvin Wan
2023-02-07 18:17     ` [PATCH v7 3/7] submodule: move status parsing into function Calvin Wan
2023-02-07 18:17     ` [PATCH v7 4/7] submodule: refactor is_submodule_modified() Calvin Wan
2023-02-07 22:59       ` Ævar Arnfjörð Bjarmason
2023-02-07 18:17     ` [PATCH v7 5/7] diff-lib: refactor out diff_change logic Calvin Wan
2023-02-08 14:28       ` Phillip Wood
2023-02-08 23:12         ` Calvin Wan
2023-02-09 20:53           ` Phillip Wood
2023-02-07 18:17     ` [PATCH v7 6/7] diff-lib: refactor match_stat_with_submodule Calvin Wan
2023-02-08  8:18       ` Ævar Arnfjörð Bjarmason
2023-02-08 17:07         ` Phillip Wood
2023-02-08 23:13           ` Calvin Wan
2023-02-08 14:22       ` Phillip Wood
2023-02-07 18:17     ` [PATCH v7 7/7] diff-lib: parallelize run_diff_files for submodules Calvin Wan
2023-02-07 23:06       ` Ævar Arnfjörð Bjarmason
2023-01-17 19:30   ` [PATCH v6 1/6] run-command: add duplicate_output_fn to run_processes_parallel_opts Calvin Wan
2023-01-17 19:30   ` [PATCH v6 2/6] submodule: strbuf variable rename Calvin Wan
2023-01-17 19:30   ` [PATCH v6 3/6] submodule: move status parsing into function Calvin Wan
2023-01-17 19:30   ` [PATCH v6 4/6] diff-lib: refactor match_stat_with_submodule Calvin Wan
2023-01-17 19:30   ` [PATCH v6 5/6] diff-lib: parallelize run_diff_files for submodules Calvin Wan
2023-01-26  9:09     ` Glen Choo
2023-01-26  9:16     ` Glen Choo
2023-01-26 18:52       ` Calvin Wan
2023-01-17 19:30   ` [PATCH v6 6/6] submodule: call parallel code from serial status Calvin Wan
2023-01-26  8:09     ` Glen Choo
2023-01-26  8:45       ` Glen Choo
2023-01-04 21:54 ` [PATCH v5 1/6] run-command: add duplicate_output_fn to run_processes_parallel_opts Calvin Wan
2023-01-04 21:54 ` [PATCH v5 2/6] submodule: strbuf variable rename Calvin Wan
2023-01-04 21:54 ` [PATCH v5 3/6] submodule: move status parsing into function Calvin Wan
2023-01-04 21:54 ` [PATCH v5 4/6] diff-lib: refactor match_stat_with_submodule Calvin Wan
2023-01-04 21:54 ` [PATCH v5 5/6] diff-lib: parallelize run_diff_files for submodules Calvin Wan
2023-01-04 21:54 ` [PATCH v5 6/6] submodule: call parallel code from serial status Calvin Wan

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).