git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Derrick Stolee <stolee@gmail.com>
To: Elijah Newren via GitGitGadget <gitgitgadget@gmail.com>,
	git@vger.kernel.org
Cc: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"Elijah Newren" <newren@gmail.com>
Subject: Re: [PATCH v2 3/7] merge-ort: add data structures for allowable trivial directory resolves
Date: Thu, 15 Jul 2021 09:54:26 -0400	[thread overview]
Message-ID: <f8d4eef7-9768-571f-2fb9-76284d69108f@gmail.com> (raw)
In-Reply-To: <f7ac01055d9d2e9e2dfdfd780ff7f10fbfd05d5b.1626204784.git.gitgitgadget@gmail.com>

On 7/13/2021 3:32 PM, Elijah Newren via GitGitGadget wrote:
> From: Elijah Newren <newren@gmail.com>

The first two patches were easy to follow. This one requires a more
careful read, so here is my attempt to double-check that I am
reading it right.

> In commit f89b4f2bee ("merge-ort: skip rename detection entirely if
> possible", 2021-03-11), we added an additional reason that rename
> detection could be skipped entirely -- namely, if no *relevant* sources
> were present.  Without completing collect_merge_info_callback(), we do
> not yet know if there are no relevant sources.  However, we do know that
> if the current directory on one side matches the merge base, then every
> source file within that directory will not be RELEVANT_CONTENT, and a
> few simple checks can often let us rule out RELEVANT_LOCATION as well.
> This suggests we can just defer recursing into such directories until
> the end of collect_merge_info.

We are not making early decisions, but collecting a list of directories
that cannot be relevant sources. These directories could still be
relevant locations (or content), but we don't want to pay the cost of
recursing right away.

> Since the deferred directories are known to not add any relevant sources
> due to the above properties, then if there are no relevant sources after
> we've traversed all paths other than the deferred ones, then we know
> there are not any relevant sources.  Under those conditions, rename
> detection is unnecessary, and that means we can resolve the deferred
> directories without recursing into them.

Only in the case where there are no relevant sources at the end of
scanning all directories (possibly non-recursively), then we can skip
recursing into these directories because finding relevant locations
is wasted effort now.

This means the performance benefit will be limited to cases where
rename detection isn't needed at all, but that is a frequent enough
case to merit the overhead of tracking this information differently.

> Note that the logic for skipping rename detection was also modified
> further in commit 76e253793c ("merge-ort, diffcore-rename: employ cached
> renames when possible", 2021-01-30); in particular rename detection can
> be skipped if we already have cached renames for each relevant source.
> We can take advantage of this information as well with our deferral of
> recursing into directories where one side matches the merge base.
> 
> Add some data structures that we will use to do these deferrals, with
> some lengthy comments explaining their purpose.

And now that the idea is understood, we are just laying the groundwork
for the process by creating the tracking data structures, not actually
the algorithm to use them. Makes sense to split them up.

> +	/*
> +	 * possible_trivial_merges: directories we defer recursing into

Not only are we deferring the recursion, we might be able to avoid
it entirely. Perhaps "directories to be explored only when needed"
would emphasize this point differently?

> +	 *
> +	 * possible_trivial_merges is a map of directory names to
> +	 * dir_rename_mask.  When we detect that a directory is unchanged on
> +	 * one side, we can sometimes resolve the directory without recursing
> +	 * into it.  Renames are the only things that can prevent such an
> +	 * optimization.  However, for rename sources:
> +	 *   - If no parent directory needed directory rename detection, then
> +	 *     no path under such a directory can be a relevant_source.
> +	 * and for rename destinations:
> +	 *   - If no cached rename has a target path under the directory AND
> +	 *   - If there are no unpaired relevant_sources elsewhere in the
> +	 *     repository
> +	 * then we don't need any path under this directory for a rename
> +	 * destination.  The only way to know the last item above is to defer
> +	 * handling such directories until the end of collect_merge_info(),
> +	 * in handle_deferred_entries().
> +	 *
> +	 * For each we store dir_rename_mask, since that's the only bit of
> +	 * information we need, other than the path, to resume the recursive
> +	 * traversal.
> +	 */
> +	struct strintmap possible_trivial_merges[3];
> +
> +	/*
> +	 * trivial_merges_okay: if trivial directory merges are okay
> +	 *
> +	 * See possible_trivial_merges above.  The "no unpaired
> +	 * relevant_sources elsewhere in the repository" is a single boolean
> +	 * per merge side, which we store here.  Note that while 0 means no,
> +	 * 1 only means "maybe" rather than "yes"; we optimistically set it
> +	 * to 1 initially and only clear when we determine it is unsafe to
> +	 * do trivial directory merges.
> +	 */
> +	unsigned trivial_merges_okay[3];
> +
> +	/*
> +	 * target_dirs: ancestor directories of rename targets
> +	 *
> +	 * target_dirs contains all directory names that are an ancestor of
> +	 * any rename destination.
> +	 */
> +	struct strset target_dirs[3];

These three members seem tightly coupled. Would it be beneficial to
group the data into a 'struct trivial_merge_data' that could then
have an array of three here?

>  	/*
>  	 * dir_rename_mask:
>  	 *   0: optimization removing unmodified potential rename source okay
> @@ -490,6 +535,9 @@ static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
>  		strintmap_func(&renames->dirs_removed[i]);
>  		strmap_func(&renames->dir_renames[i], 0);
>  		strintmap_func(&renames->relevant_sources[i]);
> +		strintmap_func(&renames->possible_trivial_merges[i]);
> +		strset_func(&renames->target_dirs[i]);
> +		renames->trivial_merges_okay[i] = 1; /* 1 == maybe */

This could be collected into an initializer method for the data
required by this optimization instead of three initializations
among many.

>  		if (!reinitialize)
>  			assert(renames->cached_pairs_valid_side == 0);
>  		if (i != renames->cached_pairs_valid_side) {
> @@ -4045,12 +4093,17 @@ static void merge_start(struct merge_options *opt, struct merge_result *result)
>  		strintmap_init_with_options(&renames->relevant_sources[i],
>  					    -1 /* explicitly invalid */,
>  					    NULL, 0);
> +		strintmap_init_with_options(&renames->possible_trivial_merges[i],
> +					    0, NULL, 0);
> +		strset_init_with_options(&renames->target_dirs[i],
> +					 NULL, 1);
>  		strmap_init_with_options(&renames->cached_pairs[i],
>  					 NULL, 1);
>  		strset_init_with_options(&renames->cached_irrelevant[i],
>  					 NULL, 1);
>  		strset_init_with_options(&renames->cached_target_names[i],
>  					 NULL, 0);
> +		renames->trivial_merges_okay[i] = 1; /* 1 == maybe */

Here, there is even a separation between the grouped data. If you
don't like the idea of creating helper methods, then at least these
new lines could be grouped together. Bonus points for adding
whitespace between these lines and the others.

Thanks,
-Stolee

  reply	other threads:[~2021-07-15 13:54 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-01  3:46 [PATCH 0/7] Optimization batch 14: trivial directory resolution Elijah Newren via GitGitGadget
2021-07-01  3:46 ` [PATCH 1/7] merge-ort: resolve paths early when we have sufficient information Elijah Newren via GitGitGadget
2021-07-01  3:46 ` [PATCH 2/7] merge-ort: add some more explanations in collect_merge_info_callback() Elijah Newren via GitGitGadget
2021-07-01  3:46 ` [PATCH 3/7] merge-ort: add data structures for allowable trivial directory resolves Elijah Newren via GitGitGadget
2021-07-01  3:46 ` [PATCH 4/7] merge-ort: add a handle_deferred_entries() helper function Elijah Newren via GitGitGadget
2021-07-01  3:46 ` [PATCH 5/7] merge-ort: defer recursing into directories when merge base is matched Elijah Newren via GitGitGadget
2021-07-01  3:46 ` [PATCH 6/7] merge-ort: avoid recursing into directories when we don't need to Elijah Newren via GitGitGadget
2021-07-01  3:46 ` [PATCH 7/7] merge-ort: restart merge with cached renames to reduce process entry cost Elijah Newren via GitGitGadget
2021-07-01 13:21 ` [PATCH 0/7] Optimization batch 14: trivial directory resolution Ævar Arnfjörð Bjarmason
2021-07-01 15:04   ` Elijah Newren
2021-07-01 19:22     ` Elijah Newren
2021-07-13 19:32 ` [PATCH v2 " Elijah Newren via GitGitGadget
2021-07-13 19:32   ` [PATCH v2 1/7] merge-ort: resolve paths early when we have sufficient information Elijah Newren via GitGitGadget
2021-07-13 19:32   ` [PATCH v2 2/7] merge-ort: add some more explanations in collect_merge_info_callback() Elijah Newren via GitGitGadget
2021-07-13 23:34     ` Bagas Sanjaya
2021-07-14  0:19       ` Elijah Newren
2021-07-13 19:32   ` [PATCH v2 3/7] merge-ort: add data structures for allowable trivial directory resolves Elijah Newren via GitGitGadget
2021-07-15 13:54     ` Derrick Stolee [this message]
2021-07-15 15:54       ` Elijah Newren
2021-07-13 19:33   ` [PATCH v2 4/7] merge-ort: add a handle_deferred_entries() helper function Elijah Newren via GitGitGadget
2021-07-15 14:32     ` Derrick Stolee
2021-07-15 15:59       ` Elijah Newren
2021-07-13 19:33   ` [PATCH v2 5/7] merge-ort: defer recursing into directories when merge base is matched Elijah Newren via GitGitGadget
2021-07-15 14:43     ` Derrick Stolee
2021-07-15 16:03       ` Elijah Newren
2021-07-15 17:14         ` Derrick Stolee
2021-07-13 19:33   ` [PATCH v2 6/7] merge-ort: avoid recursing into directories when we don't need to Elijah Newren via GitGitGadget
2021-07-15 14:55     ` Derrick Stolee
2021-07-15 16:28       ` Elijah Newren
2021-07-13 19:33   ` [PATCH v2 7/7] merge-ort: restart merge with cached renames to reduce process entry cost Elijah Newren via GitGitGadget
2021-07-15 15:09     ` Derrick Stolee
2021-07-15 16:53       ` Elijah Newren
2021-07-15 17:19         ` Derrick Stolee
2021-07-15 17:32           ` Elijah Newren
2021-07-16  5:22   ` [PATCH v3 0/7] Optimization batch 14: trivial directory resolution Elijah Newren via GitGitGadget
2021-07-16  5:22     ` [PATCH v3 1/7] merge-ort: resolve paths early when we have sufficient information Elijah Newren via GitGitGadget
2021-07-16  5:22     ` [PATCH v3 2/7] merge-ort: add some more explanations in collect_merge_info_callback() Elijah Newren via GitGitGadget
2021-07-16  5:22     ` [PATCH v3 3/7] merge-ort: add data structures for allowable trivial directory resolves Elijah Newren via GitGitGadget
2021-07-16  5:22     ` [PATCH v3 4/7] merge-ort: add a handle_deferred_entries() helper function Elijah Newren via GitGitGadget
2021-07-16  5:22     ` [PATCH v3 5/7] merge-ort: defer recursing into directories when merge base is matched Elijah Newren via GitGitGadget
2021-07-16  5:22     ` [PATCH v3 6/7] merge-ort: avoid recursing into directories when we don't need to Elijah Newren via GitGitGadget
2021-07-16  5:22     ` [PATCH v3 7/7] merge-ort: restart merge with cached renames to reduce process entry cost Elijah Newren via GitGitGadget
2021-07-20 13:00     ` [PATCH v3 0/7] Optimization batch 14: trivial directory resolution Derrick Stolee
2021-07-20 21:43       ` Junio C Hamano
2021-07-21  4:23     ` [PATCH v4 " Elijah Newren via GitGitGadget
2021-07-21  4:23       ` [PATCH v4 1/7] merge-ort: resolve paths early when we have sufficient information Elijah Newren via GitGitGadget
2021-07-21  4:23       ` [PATCH v4 2/7] merge-ort: add some more explanations in collect_merge_info_callback() Elijah Newren via GitGitGadget
2021-07-21  4:24       ` [PATCH v4 3/7] merge-ort: add data structures for allowable trivial directory resolves Elijah Newren via GitGitGadget
2021-07-21  4:24       ` [PATCH v4 4/7] merge-ort: add a handle_deferred_entries() helper function Elijah Newren via GitGitGadget
2021-07-21  4:24       ` [PATCH v4 5/7] merge-ort: defer recursing into directories when merge base is matched Elijah Newren via GitGitGadget
2021-07-21  4:24       ` [PATCH v4 6/7] merge-ort: avoid recursing into directories when we don't need to Elijah Newren via GitGitGadget
2021-07-21  4:24       ` [PATCH v4 7/7] merge-ort: restart merge with cached renames to reduce process entry cost Elijah Newren via GitGitGadget

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f8d4eef7-9768-571f-2fb9-76284d69108f@gmail.com \
    --to=stolee@gmail.com \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=newren@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).