From: Glen Choo <chooglen@google.com>
To: Calvin Wan <calvinwan@google.com>, git@vger.kernel.org
Cc: Calvin Wan <calvinwan@google.com>,
avarab@gmail.com, newren@gmail.com, jonathantanmy@google.com,
phillip.wood123@gmail.com
Subject: Re: [PATCH v9 6/6] diff-lib: parallelize run_diff_files for submodules
Date: Thu, 16 Mar 2023 18:09:39 -0700 [thread overview]
Message-ID: <kl6ljzzguqss.fsf@chooglen-macbookpro.roam.corp.google.com> (raw)
In-Reply-To: <20230302220251.1474923-6-calvinwan@google.com>
I haven't verified if the code in this version is correct or not, as I
found it a bit difficult to follow through the churn. After reading this
series again, I've established a better mental model of the code, and I
think there are some renames and documentation changes we can make to
make this clearer.
Unfortunately, I think the biggest clarification would be _yet_ another
refactor, and I'm not sure if we actually want to bear so much churn. I
might do this refactor locally to see if it really is _much_ cleaner or
not.
If anyone has thoughts on the refactor, do chime in.
Calvin Wan <calvinwan@google.com> writes:
> diff --git a/diff-lib.c b/diff-lib.c
> index 744ae98a69..7fe6ced950 100644
> --- a/diff-lib.c
> +++ b/diff-lib.c
> @@ -65,26 +66,41 @@ static int check_removed(const struct index_state *istate, const struct cache_en
> * Return 1 when changes are detected, 0 otherwise. If the DIRTY_SUBMODULES
> * option is set, the caller does not only want to know if a submodule is
> * modified at all but wants to know all the conditions that are met (new
> - * commits, untracked content and/or modified content).
> + * commits, untracked content and/or modified content). If
> + * defer_submodule_status bit is set, dirty_submodule will be left to the
> + * caller to set. defer_submodule_status can also be set to 0 in this
> + * function if there is no need to check if the submodule is modified.
> */
> static int match_stat_with_submodule(struct diff_options *diffopt,
> const struct cache_entry *ce,
> struct stat *st, unsigned ce_option,
> - unsigned *dirty_submodule)
> + unsigned *dirty_submodule, int *defer_submodule_status,
> + unsigned *ignore_untracked)
> {
> int changed = ie_match_stat(diffopt->repo->index, ce, st, ce_option);
> + int defer = 0;
> +
> if (S_ISGITLINK(ce->ce_mode)) {
> struct diff_flags orig_flags = diffopt->flags;
> if (!diffopt->flags.override_submodule_config)
> set_diffopt_flags_from_submodule_config(diffopt, ce->name);
> - if (diffopt->flags.ignore_submodules)
> + if (diffopt->flags.ignore_submodules) {
> changed = 0;
> - else if (!diffopt->flags.ignore_dirty_submodules &&
> - (!changed || diffopt->flags.dirty_submodules))
> - *dirty_submodule = is_submodule_modified(ce->name,
> - diffopt->flags.ignore_untracked_in_submodules);
> + } else if (!diffopt->flags.ignore_dirty_submodules &&
> + (!changed || diffopt->flags.dirty_submodules)) {
> + if (defer_submodule_status && *defer_submodule_status) {
> + defer = 1;
> + *ignore_untracked = diffopt->flags.ignore_untracked_in_submodules;
> + } else {
> + *dirty_submodule = is_submodule_modified(ce->name,
> + diffopt->flags.ignore_untracked_in_submodules);
> + }
> + }
> diffopt->flags = orig_flags;
> }
> +
> + if (defer_submodule_status)
> + *defer_submodule_status = defer;
The crux of this patch is that we are replacing some serial operation
with a parallel operation. The replacement happens here, where we are
replacing is_submodule_modified() by 'deferring' it.
So to verify if the parallel implementation is correct, we should
compare the "setup" and "finish" steps in is_submodule_modified() and
get_submodules_status(). Eyeballing it, it looks correct, especially
because we made sure to refactor out the shared logic in previous
patches.
To reflect this, I think it would be clearer to rename
get_submodules_status() to something similar (e.g.
are_submodules_modified_parallel()), with an explicit comment saying
that it is meant to be a parallel implementation of
is_submodule_modified().
Except, I told a little white lie in the previous paragraph, because
get_submodules_status() isn't _just_ a parallel implementation of
is_submodule_modified()...
> @@ -268,13 +286,52 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
> }
>
> changed = match_stat_with_submodule(&revs->diffopt, ce, &st,
> - ce_option, &dirty_submodule);
> + ce_option, NULL,
> + &defer_submodule_status,
> + &ignore_untracked);
> newmode = ce_mode_from_stat(ce, st.st_mode);
> + if (defer_submodule_status) {
> + struct submodule_status_util tmp = {
> + .changed = changed,
> + .dirty_submodule = 0,
> + .ignore_untracked = ignore_untracked,
> + .newmode = newmode,
> + .ce = ce,
> + .path = ce->name,
> + };
> + struct string_list_item *item;
> +
> + item = string_list_append(&submodules, ce->name);
> + item->util = xmalloc(sizeof(tmp));
> + memcpy(item->util, &tmp, sizeof(tmp));
> + continue;
> + }
because get_submodules_status() doesn't just contain the results of
the parallel processes, it is _also_ shuttling "changed" and
"ignore_untracked" from match_stat_with_submodule(), as well as
.newmode, .ce and .path from run_diff_files() (basically everything
except .dirty_submodule)...
> }
>
> - record_file_diff(&revs->diffopt, newmode, dirty_submodule,
> - changed, istate, ce);
> + if (!defer_submodule_status)
> + record_file_diff(&revs->diffopt, newmode, 0,
> + changed,istate, ce);
> + }
> + if (submodules.nr) {
> + unsigned long parallel_jobs;
> + struct string_list_item *item;
> +
> + if (git_config_get_ulong("submodule.diffjobs", ¶llel_jobs))
> + parallel_jobs = 1;
> + else if (!parallel_jobs)
> + parallel_jobs = online_cpus();
> +
> + if (get_submodules_status(&submodules, parallel_jobs))
> + die(_("submodule status failed"));
> + for_each_string_list_item(item, &submodules) {
> + struct submodule_status_util *util = item->util;
> +
> + record_file_diff(&revs->diffopt, util->newmode,
> + util->dirty_submodule, util->changed,
> + istate, util->ce);
> + }
so that we can pass all of this back into record_file_diff(). The only
member that is changed by the parallel process is .dirty_submodule,
which is exactly what we would expect from a parallel version of
is_submodule_modified().
If we don't want to do a bigger refactor, I think we should also add
comments to members of "struct submodule_status_util" to document where
they come from and what they are used for.
The rest of the comments are refactor-related.
It would be good if we could avoid mixing unrelated information sources
in "struct submodule_status_util", since a) this makes it very tightly
coupled to run_diff_files() and b) it causes us to repeat ourselves in
the same function (.changed = changed, record_file_diff()).
The only reason why the code looks this way right now is that
match_stat_with_submodule() sets defer_submodule_status based on whether
or not we should ignore the submodule, and this eventually tells
get_submodule_status() what submodules it needs to care about. But,
deciding whether to spawn a subprocess for which submodule is exactly
what the .get_next_task member is for.
> diff --git a/submodule.c b/submodule.c
> index 426074cebb..6f6e150a3f 100644
> --- a/submodule.c
> +++ b/submodule.c
> @@ -1981,6 +1994,121 @@ unsigned is_submodule_modified(const char *path, int ignore_untracked)
> return dirty_submodule;
> }
>
> +static struct status_task *
> +get_status_task_from_index(struct submodule_parallel_status *sps,
> + struct strbuf *err)
> +{
> + for (; sps->index_count < sps->submodule_names->nr; sps->index_count++) {
> + struct submodule_status_util *util = sps->submodule_names->items[sps->index_count].util;
> + struct status_task *task;
> +
> + if (!verify_submodule_git_directory(util->path))
> + continue;
So right here, we could use the "check if this submodule should be
ignored" logic form match_stat_with_submodule() to decide whether or not
to spawn the subprocess. IOW, I am advocating for
get_submodules_status() to be a parallel version of
match_stat_with_submodule() (not a parallel version of
is_submodule_modified() that shuttles extra information).
Another sign that this refactor is a good idea is that it lets us
simplify _existing_ submodule logic in run_diff_files(). Prior to this
patch, we have:
unsigned dirty_submodule = 0;
...
changed = match_stat_with_submodule(&revs->diffopt, ce, &st,
ce_option, NULL,
&defer_submodule_status,
&ignore_untracked);
// If submodule was deferred, shuttle a bunch of information
// If not, call record_file_diff()
but the body of match_stat_with_submodule() is just ie_match_stat() +
some additional submodule logic. Post refactor, this would look
something like:
struct string_list submodules;
...
// For any submodule, just append it to a list and let the
// parallel thing take care of it.
if (S_ISGITLINK(ce->ce_mode) {
// Probably pass .newmode and .ce to the util too...
string_list_append(submodules, ce->name);
} else {
changed = ie_match_stat(foo, bar, baz);
record_file_diff();
}
...
if (submodules.nr) {
parallel_match_stat_with_submodule_wip_name(&submodules);
for_each_string_list_item(item, &submodules) {
record_file_diff(&item);
}
}
Which I think is easier to follow, since we won't need
defer_submodule_status any more, and we don't shuttle information from
match_stat_with_submodule(). Though I'm a bit unhappy that it's still
pretty coupled to run_diff_files() (it still has to shuttle .newmode,
.ce). Also, I don't think this refactor lets us avoid the refactors we
did in the previous patches.
> +
> + task = xmalloc(sizeof(*task));
> + task->path = util->path;
> + task->ignore_untracked = util->ignore_untracked;
> + strbuf_init(&task->out, 0);
> + sps->index_count++;
> + return task;
> + }
> + return NULL;
> +}
> +
> +static int get_next_submodule_status(struct child_process *cp,
> + struct strbuf *err, void *data,
> + void **task_cb)
> +{
> + struct submodule_parallel_status *sps = data;
> + struct status_task *task = get_status_task_from_index(sps, err);
As an aside, I think we can inline get_status_task_from_index(). I
suspect this pattern was copied from get_next_submodule(), which
gets fetch tasks from two different places (hence _from_index and
_from_changed), but here I don't think we will ever get status tasks
from more than one place.
next prev parent reply other threads:[~2023-03-17 1:09 UTC|newest]
Thread overview: 86+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <https://lore.kernel.org/git/20221108184200.2813458-1-calvinwan@google.com/>
2023-01-04 21:54 ` [PATCH v5 0/6] submodule: parallelize diff Calvin Wan
2023-01-05 23:23 ` Calvin Wan
2023-01-17 19:30 ` [PATCH v6 " Calvin Wan
2023-02-07 18:16 ` [PATCH v7 0/7] " Calvin Wan
2023-02-08 0:55 ` Ævar Arnfjörð Bjarmason
2023-02-09 0:02 ` [PATCH v8 0/6] " Calvin Wan
2023-02-09 1:42 ` Ævar Arnfjörð Bjarmason
2023-02-09 19:50 ` Junio C Hamano
2023-02-09 21:52 ` Calvin Wan
2023-02-09 22:25 ` Junio C Hamano
2023-02-10 13:24 ` Ævar Arnfjörð Bjarmason
2023-02-10 17:42 ` Junio C Hamano
2023-02-09 20:50 ` Phillip Wood
2023-03-02 21:52 ` [PATCH v9 " Calvin Wan
2023-03-02 22:02 ` [PATCH v9 1/6] run-command: add on_stderr_output_fn to run_processes_parallel_opts Calvin Wan
2023-03-02 22:02 ` [PATCH v9 2/6] submodule: rename strbuf variable Calvin Wan
2023-03-03 0:25 ` Junio C Hamano
2023-03-06 17:37 ` Calvin Wan
2023-03-06 18:30 ` Junio C Hamano
2023-03-06 19:00 ` Calvin Wan
2023-03-02 22:02 ` [PATCH v9 3/6] submodule: move status parsing into function Calvin Wan
2023-03-17 20:42 ` Glen Choo
2023-03-02 22:02 ` [PATCH v9 4/6] submodule: refactor is_submodule_modified() Calvin Wan
2023-03-02 22:02 ` [PATCH v9 5/6] diff-lib: refactor out diff_change logic Calvin Wan
2023-03-02 22:02 ` [PATCH v9 6/6] diff-lib: parallelize run_diff_files for submodules Calvin Wan
2023-03-07 8:41 ` Ævar Arnfjörð Bjarmason
2023-03-07 10:21 ` Ævar Arnfjörð Bjarmason
2023-03-07 17:55 ` Junio C Hamano
2023-03-17 1:09 ` Glen Choo [this message]
2023-03-17 2:51 ` Glen Choo
2023-02-09 0:02 ` [PATCH v8 1/6] run-command: add duplicate_output_fn to run_processes_parallel_opts Calvin Wan
2023-02-13 6:34 ` Glen Choo
2023-02-13 17:52 ` Junio C Hamano
2023-02-13 18:26 ` Calvin Wan
2023-02-09 0:02 ` [PATCH v8 2/6] submodule: strbuf variable rename Calvin Wan
2023-02-13 8:37 ` Glen Choo
2023-02-09 0:02 ` [PATCH v8 3/6] submodule: move status parsing into function Calvin Wan
2023-02-09 0:02 ` [PATCH v8 4/6] submodule: refactor is_submodule_modified() Calvin Wan
2023-02-13 7:06 ` Glen Choo
2023-02-09 0:02 ` [PATCH v8 5/6] diff-lib: refactor out diff_change logic Calvin Wan
2023-02-09 1:48 ` Ævar Arnfjörð Bjarmason
2023-02-13 8:42 ` Glen Choo
2023-02-13 18:29 ` Calvin Wan
2023-02-14 4:03 ` Glen Choo
2023-02-09 0:02 ` [PATCH v8 6/6] diff-lib: parallelize run_diff_files for submodules Calvin Wan
2023-02-13 8:36 ` Glen Choo
2023-02-07 18:17 ` [PATCH v7 1/7] run-command: add duplicate_output_fn to run_processes_parallel_opts Calvin Wan
2023-02-07 22:16 ` Ævar Arnfjörð Bjarmason
2023-02-08 22:50 ` Calvin Wan
2023-02-08 14:19 ` Phillip Wood
2023-02-08 22:54 ` Calvin Wan
2023-02-09 20:37 ` Phillip Wood
2023-02-07 18:17 ` [PATCH v7 2/7] submodule: strbuf variable rename Calvin Wan
2023-02-07 22:47 ` Ævar Arnfjörð Bjarmason
2023-02-08 22:59 ` Calvin Wan
2023-02-07 18:17 ` [PATCH v7 3/7] submodule: move status parsing into function Calvin Wan
2023-02-07 18:17 ` [PATCH v7 4/7] submodule: refactor is_submodule_modified() Calvin Wan
2023-02-07 22:59 ` Ævar Arnfjörð Bjarmason
2023-02-07 18:17 ` [PATCH v7 5/7] diff-lib: refactor out diff_change logic Calvin Wan
2023-02-08 14:28 ` Phillip Wood
2023-02-08 23:12 ` Calvin Wan
2023-02-09 20:53 ` Phillip Wood
2023-02-07 18:17 ` [PATCH v7 6/7] diff-lib: refactor match_stat_with_submodule Calvin Wan
2023-02-08 8:18 ` Ævar Arnfjörð Bjarmason
2023-02-08 17:07 ` Phillip Wood
2023-02-08 23:13 ` Calvin Wan
2023-02-08 14:22 ` Phillip Wood
2023-02-07 18:17 ` [PATCH v7 7/7] diff-lib: parallelize run_diff_files for submodules Calvin Wan
2023-02-07 23:06 ` Ævar Arnfjörð Bjarmason
2023-01-17 19:30 ` [PATCH v6 1/6] run-command: add duplicate_output_fn to run_processes_parallel_opts Calvin Wan
2023-01-17 19:30 ` [PATCH v6 2/6] submodule: strbuf variable rename Calvin Wan
2023-01-17 19:30 ` [PATCH v6 3/6] submodule: move status parsing into function Calvin Wan
2023-01-17 19:30 ` [PATCH v6 4/6] diff-lib: refactor match_stat_with_submodule Calvin Wan
2023-01-17 19:30 ` [PATCH v6 5/6] diff-lib: parallelize run_diff_files for submodules Calvin Wan
2023-01-26 9:09 ` Glen Choo
2023-01-26 9:16 ` Glen Choo
2023-01-26 18:52 ` Calvin Wan
2023-01-17 19:30 ` [PATCH v6 6/6] submodule: call parallel code from serial status Calvin Wan
2023-01-26 8:09 ` Glen Choo
2023-01-26 8:45 ` Glen Choo
2023-01-04 21:54 ` [PATCH v5 1/6] run-command: add duplicate_output_fn to run_processes_parallel_opts Calvin Wan
2023-01-04 21:54 ` [PATCH v5 2/6] submodule: strbuf variable rename Calvin Wan
2023-01-04 21:54 ` [PATCH v5 3/6] submodule: move status parsing into function Calvin Wan
2023-01-04 21:54 ` [PATCH v5 4/6] diff-lib: refactor match_stat_with_submodule Calvin Wan
2023-01-04 21:54 ` [PATCH v5 5/6] diff-lib: parallelize run_diff_files for submodules Calvin Wan
2023-01-04 21:54 ` [PATCH v5 6/6] submodule: call parallel code from serial status Calvin Wan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=kl6ljzzguqss.fsf@chooglen-macbookpro.roam.corp.google.com \
--to=chooglen@google.com \
--cc=avarab@gmail.com \
--cc=calvinwan@google.com \
--cc=git@vger.kernel.org \
--cc=jonathantanmy@google.com \
--cc=newren@gmail.com \
--cc=phillip.wood123@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).