git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Glen Choo <chooglen@google.com>
To: Calvin Wan <calvinwan@google.com>, git@vger.kernel.org
Cc: Calvin Wan <calvinwan@google.com>,
	avarab@gmail.com, newren@gmail.com, jonathantanmy@google.com,
	phillip.wood123@gmail.com
Subject: Re: [PATCH v9 6/6] diff-lib: parallelize run_diff_files for submodules
Date: Thu, 16 Mar 2023 18:09:39 -0700	[thread overview]
Message-ID: <kl6ljzzguqss.fsf@chooglen-macbookpro.roam.corp.google.com> (raw)
In-Reply-To: <20230302220251.1474923-6-calvinwan@google.com>


I haven't verified if the code in this version is correct or not, as I
found it a bit difficult to follow through the churn. After reading this
series again, I've established a better mental model of the code, and I
think there are some renames and documentation changes we can make to
make this clearer.

Unfortunately, I think the biggest clarification would be _yet_ another
refactor, and I'm not sure if we actually want to bear so much churn. I
might do this refactor locally to see if it really is _much_ cleaner or
not.

If anyone has thoughts on the refactor, do chime in.

Calvin Wan <calvinwan@google.com> writes:

> diff --git a/diff-lib.c b/diff-lib.c
> index 744ae98a69..7fe6ced950 100644
> --- a/diff-lib.c
> +++ b/diff-lib.c
> @@ -65,26 +66,41 @@ static int check_removed(const struct index_state *istate, const struct cache_en
>   * Return 1 when changes are detected, 0 otherwise. If the DIRTY_SUBMODULES
>   * option is set, the caller does not only want to know if a submodule is
>   * modified at all but wants to know all the conditions that are met (new
> - * commits, untracked content and/or modified content).
> + * commits, untracked content and/or modified content). If
> + * defer_submodule_status bit is set, dirty_submodule will be left to the
> + * caller to set. defer_submodule_status can also be set to 0 in this
> + * function if there is no need to check if the submodule is modified.
>   */
>  static int match_stat_with_submodule(struct diff_options *diffopt,
>  				     const struct cache_entry *ce,
>  				     struct stat *st, unsigned ce_option,
> -				     unsigned *dirty_submodule)
> +				     unsigned *dirty_submodule, int *defer_submodule_status,
> +				     unsigned *ignore_untracked)
>  {
>  	int changed = ie_match_stat(diffopt->repo->index, ce, st, ce_option);
> +	int defer = 0;
> +
>  	if (S_ISGITLINK(ce->ce_mode)) {
>  		struct diff_flags orig_flags = diffopt->flags;
>  		if (!diffopt->flags.override_submodule_config)
>  			set_diffopt_flags_from_submodule_config(diffopt, ce->name);
> -		if (diffopt->flags.ignore_submodules)
> +		if (diffopt->flags.ignore_submodules) {
>  			changed = 0;
> -		else if (!diffopt->flags.ignore_dirty_submodules &&
> -			 (!changed || diffopt->flags.dirty_submodules))
> -			*dirty_submodule = is_submodule_modified(ce->name,
> -								 diffopt->flags.ignore_untracked_in_submodules);
> +		} else if (!diffopt->flags.ignore_dirty_submodules &&
> +			   (!changed || diffopt->flags.dirty_submodules)) {
> +			if (defer_submodule_status && *defer_submodule_status) {
> +				defer = 1;
> +				*ignore_untracked = diffopt->flags.ignore_untracked_in_submodules;
> +			} else {
> +				*dirty_submodule = is_submodule_modified(ce->name,
> +					 diffopt->flags.ignore_untracked_in_submodules);
> +			}
> +		}
>  		diffopt->flags = orig_flags;
>  	}
> +
> +	if (defer_submodule_status)
> +		*defer_submodule_status = defer;

The crux of this patch is that we are replacing some serial operation
with a parallel operation. The replacement happens here, where we are
replacing is_submodule_modified() by 'deferring' it.

So to verify if the parallel implementation is correct, we should
compare the "setup" and "finish" steps in is_submodule_modified() and
get_submodules_status(). Eyeballing it, it looks correct, especially
because we made sure to refactor out the shared logic in previous
patches.

To reflect this, I think it would be clearer to rename
get_submodules_status() to something similar (e.g.
are_submodules_modified_parallel()), with an explicit comment saying
that it is meant to be a parallel implementation of
is_submodule_modified().

Except, I told a little white lie in the previous paragraph, because
get_submodules_status() isn't _just_ a parallel implementation of
is_submodule_modified()...

> @@ -268,13 +286,52 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
>  			}
>  
>  			changed = match_stat_with_submodule(&revs->diffopt, ce, &st,
> -							    ce_option, &dirty_submodule);
> +							    ce_option, NULL,
> +							    &defer_submodule_status,
> +							    &ignore_untracked);
>  			newmode = ce_mode_from_stat(ce, st.st_mode);
> +			if (defer_submodule_status) {
> +				struct submodule_status_util tmp = {
> +					.changed = changed,
> +					.dirty_submodule = 0,
> +					.ignore_untracked = ignore_untracked,
> +					.newmode = newmode,
> +					.ce = ce,
> +					.path = ce->name,
> +				};
> +				struct string_list_item *item;
> +
> +				item = string_list_append(&submodules, ce->name);
> +				item->util = xmalloc(sizeof(tmp));
> +				memcpy(item->util, &tmp, sizeof(tmp));
> +				continue;
> +			}

because get_submodules_status() doesn't just contain the results of
the parallel processes, it is _also_ shuttling "changed" and
"ignore_untracked" from match_stat_with_submodule(), as well as
.newmode, .ce and .path from run_diff_files() (basically everything
except .dirty_submodule)...

>  		}
>  
> -		record_file_diff(&revs->diffopt, newmode, dirty_submodule,
> -				 changed, istate, ce);
> +		if (!defer_submodule_status)
> +			record_file_diff(&revs->diffopt, newmode, 0,
> +					   changed,istate, ce);
> +	}
> +	if (submodules.nr) {
> +		unsigned long parallel_jobs;
> +		struct string_list_item *item;
> +
> +		if (git_config_get_ulong("submodule.diffjobs", &parallel_jobs))
> +			parallel_jobs = 1;
> +		else if (!parallel_jobs)
> +			parallel_jobs = online_cpus();
> +
> +		if (get_submodules_status(&submodules, parallel_jobs))
> +			die(_("submodule status failed"));
> +		for_each_string_list_item(item, &submodules) {
> +			struct submodule_status_util *util = item->util;
> +
> +			record_file_diff(&revs->diffopt, util->newmode,
> +					 util->dirty_submodule, util->changed,
> +					 istate, util->ce);
> +		}

so that we can pass all of this back into record_file_diff(). The only
member that is changed by the parallel process is .dirty_submodule,
which is exactly what we would expect from a parallel version of
is_submodule_modified().

If we don't want to do a bigger refactor, I think we should also add
comments to members of "struct submodule_status_util" to document where
they come from and what they are used for.

The rest of the comments are refactor-related.

It would be good if we could avoid mixing unrelated information sources
in "struct submodule_status_util", since a) this makes it very tightly
coupled to run_diff_files() and b) it causes us to repeat ourselves in
the same function (.changed = changed, record_file_diff()).

The only reason why the code looks this way right now is that
match_stat_with_submodule() sets defer_submodule_status based on whether
or not we should ignore the submodule, and this eventually tells
get_submodule_status() what submodules it needs to care about. But,
deciding whether to spawn a subprocess for which submodule is exactly
what the .get_next_task member is for.

> diff --git a/submodule.c b/submodule.c
> index 426074cebb..6f6e150a3f 100644
> --- a/submodule.c
> +++ b/submodule.c
> @@ -1981,6 +1994,121 @@ unsigned is_submodule_modified(const char *path, int ignore_untracked)
>  	return dirty_submodule;
>  }
>  
> +static struct status_task *
> +get_status_task_from_index(struct submodule_parallel_status *sps,
> +			   struct strbuf *err)
> +{
> +	for (; sps->index_count < sps->submodule_names->nr; sps->index_count++) {
> +		struct submodule_status_util *util = sps->submodule_names->items[sps->index_count].util;
> +		struct status_task *task;
> +
> +		if (!verify_submodule_git_directory(util->path))
> +			continue;

So right here, we could use the "check if this submodule should be
ignored" logic form match_stat_with_submodule() to decide whether or not
to spawn the subprocess. IOW, I am advocating for
get_submodules_status() to be a parallel version of
match_stat_with_submodule() (not a parallel version of
is_submodule_modified() that shuttles extra information).

Another sign that this refactor is a good idea is that it lets us
simplify _existing_ submodule logic in run_diff_files(). Prior to this
patch, we have:

      unsigned dirty_submodule = 0;
      ...
			changed = match_stat_with_submodule(&revs->diffopt, ce, &st,
							    ce_option, NULL,
							    &defer_submodule_status,
							    &ignore_untracked);
      // If submodule was deferred, shuttle a bunch of information
      // If not, call record_file_diff()

but the body of match_stat_with_submodule() is just ie_match_stat() +
some additional submodule logic. Post refactor, this would look
something like:

    struct string_list submodules;
    ...
    // For any submodule, just append it to a list and let the
    // parallel thing take care of it.
    if (S_ISGITLINK(ce->ce_mode) {
      // Probably pass .newmode and .ce to the util too...
      string_list_append(submodules, ce->name);
    } else {
      changed = ie_match_stat(foo, bar, baz);
      record_file_diff();
    }
    ...
    if (submodules.nr) {
      parallel_match_stat_with_submodule_wip_name(&submodules);
      for_each_string_list_item(item, &submodules) {
        record_file_diff(&item);
      }
    }

Which I think is easier to follow, since we won't need
defer_submodule_status any more, and we don't shuttle information from
match_stat_with_submodule(). Though I'm a bit unhappy that it's still
pretty coupled to run_diff_files() (it still has to shuttle .newmode,
.ce). Also, I don't think this refactor lets us avoid the refactors we
did in the previous patches.

> +
> +		task = xmalloc(sizeof(*task));
> +		task->path = util->path;
> +		task->ignore_untracked = util->ignore_untracked;
> +		strbuf_init(&task->out, 0);
> +		sps->index_count++;
> +		return task;
> +	}
> +	return NULL;
> +}
> +
> +static int get_next_submodule_status(struct child_process *cp,
> +				     struct strbuf *err, void *data,
> +				     void **task_cb)
> +{
> +	struct submodule_parallel_status *sps = data;
> +	struct status_task *task = get_status_task_from_index(sps, err);

As an aside, I think we can inline get_status_task_from_index(). I
suspect this pattern was copied from get_next_submodule(), which
gets fetch tasks from two different places (hence _from_index and
_from_changed), but here I don't think we will ever get status tasks
from more than one place.

  parent reply	other threads:[~2023-03-17  1:09 UTC|newest]

Thread overview: 86+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <https://lore.kernel.org/git/20221108184200.2813458-1-calvinwan@google.com/>
2023-01-04 21:54 ` [PATCH v5 0/6] submodule: parallelize diff Calvin Wan
2023-01-05 23:23   ` Calvin Wan
2023-01-17 19:30   ` [PATCH v6 " Calvin Wan
2023-02-07 18:16     ` [PATCH v7 0/7] " Calvin Wan
2023-02-08  0:55       ` Ævar Arnfjörð Bjarmason
2023-02-09  0:02       ` [PATCH v8 0/6] " Calvin Wan
2023-02-09  1:42         ` Ævar Arnfjörð Bjarmason
2023-02-09 19:50         ` Junio C Hamano
2023-02-09 21:52           ` Calvin Wan
2023-02-09 22:25             ` Junio C Hamano
2023-02-10 13:24             ` Ævar Arnfjörð Bjarmason
2023-02-10 17:42               ` Junio C Hamano
2023-02-09 20:50         ` Phillip Wood
2023-03-02 21:52         ` [PATCH v9 " Calvin Wan
2023-03-02 22:02           ` [PATCH v9 1/6] run-command: add on_stderr_output_fn to run_processes_parallel_opts Calvin Wan
2023-03-02 22:02           ` [PATCH v9 2/6] submodule: rename strbuf variable Calvin Wan
2023-03-03  0:25             ` Junio C Hamano
2023-03-06 17:37               ` Calvin Wan
2023-03-06 18:30                 ` Junio C Hamano
2023-03-06 19:00                   ` Calvin Wan
2023-03-02 22:02           ` [PATCH v9 3/6] submodule: move status parsing into function Calvin Wan
2023-03-17 20:42             ` Glen Choo
2023-03-02 22:02           ` [PATCH v9 4/6] submodule: refactor is_submodule_modified() Calvin Wan
2023-03-02 22:02           ` [PATCH v9 5/6] diff-lib: refactor out diff_change logic Calvin Wan
2023-03-02 22:02           ` [PATCH v9 6/6] diff-lib: parallelize run_diff_files for submodules Calvin Wan
2023-03-07  8:41             ` Ævar Arnfjörð Bjarmason
2023-03-07 10:21             ` Ævar Arnfjörð Bjarmason
2023-03-07 17:55               ` Junio C Hamano
2023-03-17  1:09             ` Glen Choo [this message]
2023-03-17  2:51               ` Glen Choo
2023-02-09  0:02       ` [PATCH v8 1/6] run-command: add duplicate_output_fn to run_processes_parallel_opts Calvin Wan
2023-02-13  6:34         ` Glen Choo
2023-02-13 17:52           ` Junio C Hamano
2023-02-13 18:26             ` Calvin Wan
2023-02-09  0:02       ` [PATCH v8 2/6] submodule: strbuf variable rename Calvin Wan
2023-02-13  8:37         ` Glen Choo
2023-02-09  0:02       ` [PATCH v8 3/6] submodule: move status parsing into function Calvin Wan
2023-02-09  0:02       ` [PATCH v8 4/6] submodule: refactor is_submodule_modified() Calvin Wan
2023-02-13  7:06         ` Glen Choo
2023-02-09  0:02       ` [PATCH v8 5/6] diff-lib: refactor out diff_change logic Calvin Wan
2023-02-09  1:48         ` Ævar Arnfjörð Bjarmason
2023-02-13  8:42         ` Glen Choo
2023-02-13 18:29           ` Calvin Wan
2023-02-14  4:03             ` Glen Choo
2023-02-09  0:02       ` [PATCH v8 6/6] diff-lib: parallelize run_diff_files for submodules Calvin Wan
2023-02-13  8:36         ` Glen Choo
2023-02-07 18:17     ` [PATCH v7 1/7] run-command: add duplicate_output_fn to run_processes_parallel_opts Calvin Wan
2023-02-07 22:16       ` Ævar Arnfjörð Bjarmason
2023-02-08 22:50         ` Calvin Wan
2023-02-08 14:19       ` Phillip Wood
2023-02-08 22:54         ` Calvin Wan
2023-02-09 20:37           ` Phillip Wood
2023-02-07 18:17     ` [PATCH v7 2/7] submodule: strbuf variable rename Calvin Wan
2023-02-07 22:47       ` Ævar Arnfjörð Bjarmason
2023-02-08 22:59         ` Calvin Wan
2023-02-07 18:17     ` [PATCH v7 3/7] submodule: move status parsing into function Calvin Wan
2023-02-07 18:17     ` [PATCH v7 4/7] submodule: refactor is_submodule_modified() Calvin Wan
2023-02-07 22:59       ` Ævar Arnfjörð Bjarmason
2023-02-07 18:17     ` [PATCH v7 5/7] diff-lib: refactor out diff_change logic Calvin Wan
2023-02-08 14:28       ` Phillip Wood
2023-02-08 23:12         ` Calvin Wan
2023-02-09 20:53           ` Phillip Wood
2023-02-07 18:17     ` [PATCH v7 6/7] diff-lib: refactor match_stat_with_submodule Calvin Wan
2023-02-08  8:18       ` Ævar Arnfjörð Bjarmason
2023-02-08 17:07         ` Phillip Wood
2023-02-08 23:13           ` Calvin Wan
2023-02-08 14:22       ` Phillip Wood
2023-02-07 18:17     ` [PATCH v7 7/7] diff-lib: parallelize run_diff_files for submodules Calvin Wan
2023-02-07 23:06       ` Ævar Arnfjörð Bjarmason
2023-01-17 19:30   ` [PATCH v6 1/6] run-command: add duplicate_output_fn to run_processes_parallel_opts Calvin Wan
2023-01-17 19:30   ` [PATCH v6 2/6] submodule: strbuf variable rename Calvin Wan
2023-01-17 19:30   ` [PATCH v6 3/6] submodule: move status parsing into function Calvin Wan
2023-01-17 19:30   ` [PATCH v6 4/6] diff-lib: refactor match_stat_with_submodule Calvin Wan
2023-01-17 19:30   ` [PATCH v6 5/6] diff-lib: parallelize run_diff_files for submodules Calvin Wan
2023-01-26  9:09     ` Glen Choo
2023-01-26  9:16     ` Glen Choo
2023-01-26 18:52       ` Calvin Wan
2023-01-17 19:30   ` [PATCH v6 6/6] submodule: call parallel code from serial status Calvin Wan
2023-01-26  8:09     ` Glen Choo
2023-01-26  8:45       ` Glen Choo
2023-01-04 21:54 ` [PATCH v5 1/6] run-command: add duplicate_output_fn to run_processes_parallel_opts Calvin Wan
2023-01-04 21:54 ` [PATCH v5 2/6] submodule: strbuf variable rename Calvin Wan
2023-01-04 21:54 ` [PATCH v5 3/6] submodule: move status parsing into function Calvin Wan
2023-01-04 21:54 ` [PATCH v5 4/6] diff-lib: refactor match_stat_with_submodule Calvin Wan
2023-01-04 21:54 ` [PATCH v5 5/6] diff-lib: parallelize run_diff_files for submodules Calvin Wan
2023-01-04 21:54 ` [PATCH v5 6/6] submodule: call parallel code from serial status Calvin Wan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=kl6ljzzguqss.fsf@chooglen-macbookpro.roam.corp.google.com \
    --to=chooglen@google.com \
    --cc=avarab@gmail.com \
    --cc=calvinwan@google.com \
    --cc=git@vger.kernel.org \
    --cc=jonathantanmy@google.com \
    --cc=newren@gmail.com \
    --cc=phillip.wood123@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).