Re: [PATCH 4/4] diff-lib: parallelize run_diff_files for submodules

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Junio C Hamano <gitster@pobox.com>
To: Calvin Wan <calvinwan@google.com>
Cc: git@vger.kernel.org, emilyshaffer@google.com
Subject: Re: [PATCH 4/4] diff-lib: parallelize run_diff_files for submodules
Date: Fri, 23 Sep 2022 14:44:57 -0700	[thread overview]
Message-ID: <xmqqy1u9sr3a.fsf@gitster.g> (raw)
In-Reply-To: <20220922232947.631309-5-calvinwan@google.com> (Calvin Wan's message of "Thu, 22 Sep 2022 23:29:47 +0000")

Calvin Wan <calvinwan@google.com> writes:

> During the iteration of the index entries in run_diff_files, whenever
> a submodule is found and needs its status checked, a subprocess is
> spawned for it. Instead of spawning the subprocess immediately and
> waiting for its completion to continue, hold onto all submodules and
> relevant information in a list. Then use that list to create tasks for
> run_processes_parallel. Finished subprocesses pipe their output to
> status_finish which parses it and sets the relevant variables.

Excellent---I like the idea very much.

You make it sound as if this is only for "git status", but shouldn't
it benefit the usual "git diff" the same way when you have
submodules that can be dirty?

> Add config option status.parallelSubmodules to set the maximum number
> of parallel jobs. 

Configuration is fine, but I am having a hard time justifying the
addition of an extra parameter to run_diff_files().  It might be
more palatable to give a new bit (default off) in the rev structure
that tells it that it is OK to go into the "defer_submodule_status"
mode, if we absolutely want to change the behaviour of
run_diff_files() only for a single caller or something, but I doubt
it should even be needed.

I cannot think of a sane user interface that says "when
run_diff_files() is invoked as an implementation detail of 'status',
use this value, and when it is running for another command 'foo',
use this other value".  How would a user decide to pick a different
value for different commands?

Letting a single configuration variable to decide the degree of
parallelism inside run_diff_files() would be sufficient, and we
shouldn't have to touch all the existing calling sites of
run_diff_files() all over the place.  If you absolutely want to do
this, I'd rather see the new member for the configuration variable
not in wt_status but in rev_info.

> +status.parallelSubmodules::
> +	When linkgit:git-status[1] is run in a superproject with
> +	submodules, a status subprocess is spawned for every submodule.
> +	This option specifies the number of submodule status subprocesses
> +	to run in parallel. If unset, it defaults to 1.

As I said, I am not sure <cmd>.parallelSubmodules per command makes
much sense.  "If unset, it defaults to" sounds a bit redundant (if
it is explicitly set, the default value should not matter).

> @@ -83,11 +88,20 @@ static int match_stat_with_submodule(struct diff_options *diffopt,
>  		goto cleanup;
>  	}
>  	if (!diffopt->flags.ignore_dirty_submodules &&
> -		(!changed || diffopt->flags.dirty_submodules))
> +		(!changed || diffopt->flags.dirty_submodules)) {
> +		if (defer_submodule_status && *defer_submodule_status) {
> +			defer = 1;
> +			*ignore_untracked_in_submodules =
> +							diffopt->flags.ignore_untracked_in_submodules;
> +		} else {
>  			*dirty_submodule = is_submodule_modified(ce->name,
>  							diffopt->flags.ignore_untracked_in_submodules);
> +		}
> +	}

OK, so the caller can look at *defer

>  cleanup:
>  	diffopt->flags = orig_flags;
> +	if (defer_submodule_status)
> +		*defer_submodule_status = defer;
>  	return changed;
>  }
>  
> @@ -117,7 +131,7 @@ static void finish_run_diff_files(struct rev_info *revs,
>  			    ce->name, 0, dirty_submodule);
>  }
>  
> -int run_diff_files(struct rev_info *revs, unsigned int option)
> +int run_diff_files(struct rev_info *revs, unsigned int option, int parallel_jobs)
>  {
>  	int entries, i;
>  	int diff_unmerged_stage = revs->max_count;
> @@ -125,6 +139,7 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
>  			      ? CE_MATCH_RACY_IS_DIRTY : 0);
>  	uint64_t start = getnanotime();
>  	struct index_state *istate = revs->diffopt.repo->index;
> +	struct string_list submodules = STRING_LIST_INIT_NODUP;
>  
>  	diff_set_mnemonic_prefix(&revs->diffopt, "i/", "w/");
>  
> @@ -138,6 +153,9 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
>  		struct cache_entry *ce = istate->cache[i];
>  		int changed;
>  		unsigned dirty_submodule = 0;
> +		int defer_submodule_status = revs && revs->repo &&
> +							!strcmp(revs->repo->gitdir, ".git");

What is this overly deeply indented comparison with ".git" doing?
What are we checking and why?  Is that something we need to do for
each and every active_cache[] entry, or isn't it constant over the
loop?

> +		int ignore_untracked_in_submodules;
>  
>  		if (diff_can_quit_early(&revs->diffopt))
>  			break;
> @@ -269,11 +287,36 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
>  			}
>  
>  			changed = match_stat_with_submodule(&revs->diffopt, ce, &st,
> +							ce_option, &dirty_submodule,
> +							&defer_submodule_status,
> +							&ignore_untracked_in_submodules);
>  			newmode = ce_mode_from_stat(ce, st.st_mode);
> +			if (defer_submodule_status) {
> +				struct string_list_item *item =
> +								string_list_append(&submodules, ce->name);
> +				struct submodule_status_util *util = xmalloc(sizeof(*util));
> +				util->changed = changed;
> +				util->dirty_submodule = 0;
> +				util->ignore_untracked = ignore_untracked_in_submodules;
> +				util->newmode = newmode;
> +				util->ce = ce;
> +				item->util = util;
> +				continue;

This makes me wonder if defer_submodule_status should be a string
list that receives the punted submodules---iow, don't these details
belong to match_stat_with_submodule() rather than its caller here?

Even better may be to start a new child task for the submodule here,
letting it work while the parent process moves on to the next entry
in the active_cache[].  I do not know if you thought about doing it
that way.

I dunno.

> +			}
>  		}
>  		finish_run_diff_files(revs, ce, istate, changed, dirty_submodule, newmode);
>  	}
> +	if (submodules.nr > 0) {
> +		if (get_submodules_status(revs->repo, &submodules,
> +						parallel_jobs > 0 ? parallel_jobs : 1))
> +			BUG("Submodule status failed");
> +		for (int i = 0; i < submodules.nr; i++) {

We still do not allow "for (type var = init; ...)" if I am not
mistaken.  Check the coding guidelines.

next prev parent reply	other threads:[~2022-09-23 21:45 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-22 23:29 [PATCH 0/4] submodule: parallelize status Calvin Wan
2022-09-22 23:29 ` [PATCH 1/4] run-command: add pipe_output to run_processes_parallel Calvin Wan
2022-09-23  7:52   ` Ævar Arnfjörð Bjarmason
2022-09-26 16:59     ` Calvin Wan
2022-09-27 10:52       ` Ævar Arnfjörð Bjarmason
2022-09-23 18:58   ` Junio C Hamano
2022-09-26 17:31     ` Calvin Wan
2022-09-27  4:45       ` Junio C Hamano
2022-09-27 18:10         ` Calvin Wan
2022-09-27 21:40           ` Junio C Hamano
2022-09-27  9:05       ` Ævar Arnfjörð Bjarmason
2022-09-27 17:55         ` Calvin Wan
2022-09-27 19:34           ` Ævar Arnfjörð Bjarmason
2022-09-27 20:45             ` Calvin Wan
2022-09-28  5:40               ` Ævar Arnfjörð Bjarmason
2022-09-29 20:52                 ` Calvin Wan
2022-09-22 23:29 ` [PATCH 2/4] submodule: move status parsing into function Calvin Wan
2022-09-22 23:29 ` [PATCH 3/4] diff-lib: refactor functions Calvin Wan
2022-09-23 20:36   ` Junio C Hamano
2022-09-26 17:35     ` Calvin Wan
2022-09-22 23:29 ` [PATCH 4/4] diff-lib: parallelize run_diff_files for submodules Calvin Wan
2022-09-23  8:06   ` Ævar Arnfjörð Bjarmason
2022-09-24 20:17     ` Junio C Hamano
2022-09-26 17:50     ` Calvin Wan
2022-09-23 21:44   ` Junio C Hamano [this message]
2022-09-26 19:12     ` Calvin Wan
2022-09-25 13:59   ` Phillip Wood
2022-09-26 17:11     ` Junio C Hamano
2022-09-26 19:22     ` Calvin Wan
2022-09-27 18:40   ` Emily Shaffer
2022-09-23 22:56 ` [PATCH 0/4] submodule: parallelize status Junio C Hamano
2022-09-26 16:33   ` Calvin Wan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqqy1u9sr3a.fsf@gitster.g \
    --to=gitster@pobox.com \
    --cc=calvinwan@google.com \
    --cc=emilyshaffer@google.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).