From: Elijah Newren <newren@gmail.com>
To: Derrick Stolee <stolee@gmail.com>
Cc: "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com>,
"Git Mailing List" <git@vger.kernel.org>,
"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Subject: Re: [PATCH v2 3/7] merge-ort: add data structures for allowable trivial directory resolves
Date: Thu, 15 Jul 2021 08:54:36 -0700 [thread overview]
Message-ID: <CABPp-BEeZuzngVfeuykmTN8R1+fX=UG4ckW=eVcrYkO-8MPO3Q@mail.gmail.com> (raw)
In-Reply-To: <f8d4eef7-9768-571f-2fb9-76284d69108f@gmail.com>
On Thu, Jul 15, 2021 at 6:54 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 7/13/2021 3:32 PM, Elijah Newren via GitGitGadget wrote:
> > From: Elijah Newren <newren@gmail.com>
>
> The first two patches were easy to follow. This one requires a more
> careful read, so here is my attempt to double-check that I am
> reading it right.
Yep, you seem to be understanding all of it just fine (does that mean
I did a good job explaining?), and made some good suggestions to
improve it further.
> > In commit f89b4f2bee ("merge-ort: skip rename detection entirely if
> > possible", 2021-03-11), we added an additional reason that rename
> > detection could be skipped entirely -- namely, if no *relevant* sources
> > were present. Without completing collect_merge_info_callback(), we do
> > not yet know if there are no relevant sources. However, we do know that
> > if the current directory on one side matches the merge base, then every
> > source file within that directory will not be RELEVANT_CONTENT, and a
> > few simple checks can often let us rule out RELEVANT_LOCATION as well.
> > This suggests we can just defer recursing into such directories until
> > the end of collect_merge_info.
>
> We are not making early decisions, but collecting a list of directories
> that cannot be relevant sources. These directories could still be
> relevant locations (or content), but we don't want to pay the cost of
> recursing right away.
>
> > Since the deferred directories are known to not add any relevant sources
> > due to the above properties, then if there are no relevant sources after
> > we've traversed all paths other than the deferred ones, then we know
> > there are not any relevant sources. Under those conditions, rename
> > detection is unnecessary, and that means we can resolve the deferred
> > directories without recursing into them.
>
> Only in the case where there are no relevant sources at the end of
> scanning all directories (possibly non-recursively), then we can skip
> recursing into these directories because finding relevant locations
> is wasted effort now.
>
> This means the performance benefit will be limited to cases where
> rename detection isn't needed at all, but that is a frequent enough
> case to merit the overhead of tracking this information differently.
>
> > Note that the logic for skipping rename detection was also modified
> > further in commit 76e253793c ("merge-ort, diffcore-rename: employ cached
> > renames when possible", 2021-01-30); in particular rename detection can
> > be skipped if we already have cached renames for each relevant source.
> > We can take advantage of this information as well with our deferral of
> > recursing into directories where one side matches the merge base.
> >
> > Add some data structures that we will use to do these deferrals, with
> > some lengthy comments explaining their purpose.
>
> And now that the idea is understood, we are just laying the groundwork
> for the process by creating the tracking data structures, not actually
> the algorithm to use them. Makes sense to split them up.
>
> > + /*
> > + * possible_trivial_merges: directories we defer recursing into
>
> Not only are we deferring the recursion, we might be able to avoid
> it entirely. Perhaps "directories to be explored only when needed"
> would emphasize this point differently?
I like the sound of that; seems clearer. Will update.
> > + *
> > + * possible_trivial_merges is a map of directory names to
> > + * dir_rename_mask. When we detect that a directory is unchanged on
> > + * one side, we can sometimes resolve the directory without recursing
> > + * into it. Renames are the only things that can prevent such an
> > + * optimization. However, for rename sources:
> > + * - If no parent directory needed directory rename detection, then
> > + * no path under such a directory can be a relevant_source.
> > + * and for rename destinations:
> > + * - If no cached rename has a target path under the directory AND
> > + * - If there are no unpaired relevant_sources elsewhere in the
> > + * repository
> > + * then we don't need any path under this directory for a rename
> > + * destination. The only way to know the last item above is to defer
> > + * handling such directories until the end of collect_merge_info(),
> > + * in handle_deferred_entries().
> > + *
> > + * For each we store dir_rename_mask, since that's the only bit of
> > + * information we need, other than the path, to resume the recursive
> > + * traversal.
> > + */
> > + struct strintmap possible_trivial_merges[3];
> > +
> > + /*
> > + * trivial_merges_okay: if trivial directory merges are okay
> > + *
> > + * See possible_trivial_merges above. The "no unpaired
> > + * relevant_sources elsewhere in the repository" is a single boolean
> > + * per merge side, which we store here. Note that while 0 means no,
> > + * 1 only means "maybe" rather than "yes"; we optimistically set it
> > + * to 1 initially and only clear when we determine it is unsafe to
> > + * do trivial directory merges.
> > + */
> > + unsigned trivial_merges_okay[3];
> > +
> > + /*
> > + * target_dirs: ancestor directories of rename targets
> > + *
> > + * target_dirs contains all directory names that are an ancestor of
> > + * any rename destination.
> > + */
> > + struct strset target_dirs[3];
>
> These three members seem tightly coupled. Would it be beneficial to
> group the data into a 'struct trivial_merge_data' that could then
> have an array of three here?
Ooh, I like that idea; makes sense. That'll naturally flow with your
last two comments too.
> > /*
> > * dir_rename_mask:
> > * 0: optimization removing unmodified potential rename source okay
> > @@ -490,6 +535,9 @@ static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
> > strintmap_func(&renames->dirs_removed[i]);
> > strmap_func(&renames->dir_renames[i], 0);
> > strintmap_func(&renames->relevant_sources[i]);
> > + strintmap_func(&renames->possible_trivial_merges[i]);
> > + strset_func(&renames->target_dirs[i]);
> > + renames->trivial_merges_okay[i] = 1; /* 1 == maybe */
>
> This could be collected into an initializer method for the data
> required by this optimization instead of three initializations
> among many.
>
> > if (!reinitialize)
> > assert(renames->cached_pairs_valid_side == 0);
> > if (i != renames->cached_pairs_valid_side) {
> > @@ -4045,12 +4093,17 @@ static void merge_start(struct merge_options *opt, struct merge_result *result)
> > strintmap_init_with_options(&renames->relevant_sources[i],
> > -1 /* explicitly invalid */,
> > NULL, 0);
> > + strintmap_init_with_options(&renames->possible_trivial_merges[i],
> > + 0, NULL, 0);
> > + strset_init_with_options(&renames->target_dirs[i],
> > + NULL, 1);
> > strmap_init_with_options(&renames->cached_pairs[i],
> > NULL, 1);
> > strset_init_with_options(&renames->cached_irrelevant[i],
> > NULL, 1);
> > strset_init_with_options(&renames->cached_target_names[i],
> > NULL, 0);
> > + renames->trivial_merges_okay[i] = 1; /* 1 == maybe */
>
> Here, there is even a separation between the grouped data. If you
> don't like the idea of creating helper methods, then at least these
> new lines could be grouped together. Bonus points for adding
> whitespace between these lines and the others.
next prev parent reply other threads:[~2021-07-15 15:54 UTC|newest]
Thread overview: 52+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-07-01 3:46 [PATCH 0/7] Optimization batch 14: trivial directory resolution Elijah Newren via GitGitGadget
2021-07-01 3:46 ` [PATCH 1/7] merge-ort: resolve paths early when we have sufficient information Elijah Newren via GitGitGadget
2021-07-01 3:46 ` [PATCH 2/7] merge-ort: add some more explanations in collect_merge_info_callback() Elijah Newren via GitGitGadget
2021-07-01 3:46 ` [PATCH 3/7] merge-ort: add data structures for allowable trivial directory resolves Elijah Newren via GitGitGadget
2021-07-01 3:46 ` [PATCH 4/7] merge-ort: add a handle_deferred_entries() helper function Elijah Newren via GitGitGadget
2021-07-01 3:46 ` [PATCH 5/7] merge-ort: defer recursing into directories when merge base is matched Elijah Newren via GitGitGadget
2021-07-01 3:46 ` [PATCH 6/7] merge-ort: avoid recursing into directories when we don't need to Elijah Newren via GitGitGadget
2021-07-01 3:46 ` [PATCH 7/7] merge-ort: restart merge with cached renames to reduce process entry cost Elijah Newren via GitGitGadget
2021-07-01 13:21 ` [PATCH 0/7] Optimization batch 14: trivial directory resolution Ævar Arnfjörð Bjarmason
2021-07-01 15:04 ` Elijah Newren
2021-07-01 19:22 ` Elijah Newren
2021-07-13 19:32 ` [PATCH v2 " Elijah Newren via GitGitGadget
2021-07-13 19:32 ` [PATCH v2 1/7] merge-ort: resolve paths early when we have sufficient information Elijah Newren via GitGitGadget
2021-07-13 19:32 ` [PATCH v2 2/7] merge-ort: add some more explanations in collect_merge_info_callback() Elijah Newren via GitGitGadget
2021-07-13 23:34 ` Bagas Sanjaya
2021-07-14 0:19 ` Elijah Newren
2021-07-13 19:32 ` [PATCH v2 3/7] merge-ort: add data structures for allowable trivial directory resolves Elijah Newren via GitGitGadget
2021-07-15 13:54 ` Derrick Stolee
2021-07-15 15:54 ` Elijah Newren [this message]
2021-07-13 19:33 ` [PATCH v2 4/7] merge-ort: add a handle_deferred_entries() helper function Elijah Newren via GitGitGadget
2021-07-15 14:32 ` Derrick Stolee
2021-07-15 15:59 ` Elijah Newren
2021-07-13 19:33 ` [PATCH v2 5/7] merge-ort: defer recursing into directories when merge base is matched Elijah Newren via GitGitGadget
2021-07-15 14:43 ` Derrick Stolee
2021-07-15 16:03 ` Elijah Newren
2021-07-15 17:14 ` Derrick Stolee
2021-07-13 19:33 ` [PATCH v2 6/7] merge-ort: avoid recursing into directories when we don't need to Elijah Newren via GitGitGadget
2021-07-15 14:55 ` Derrick Stolee
2021-07-15 16:28 ` Elijah Newren
2021-07-13 19:33 ` [PATCH v2 7/7] merge-ort: restart merge with cached renames to reduce process entry cost Elijah Newren via GitGitGadget
2021-07-15 15:09 ` Derrick Stolee
2021-07-15 16:53 ` Elijah Newren
2021-07-15 17:19 ` Derrick Stolee
2021-07-15 17:32 ` Elijah Newren
2021-07-16 5:22 ` [PATCH v3 0/7] Optimization batch 14: trivial directory resolution Elijah Newren via GitGitGadget
2021-07-16 5:22 ` [PATCH v3 1/7] merge-ort: resolve paths early when we have sufficient information Elijah Newren via GitGitGadget
2021-07-16 5:22 ` [PATCH v3 2/7] merge-ort: add some more explanations in collect_merge_info_callback() Elijah Newren via GitGitGadget
2021-07-16 5:22 ` [PATCH v3 3/7] merge-ort: add data structures for allowable trivial directory resolves Elijah Newren via GitGitGadget
2021-07-16 5:22 ` [PATCH v3 4/7] merge-ort: add a handle_deferred_entries() helper function Elijah Newren via GitGitGadget
2021-07-16 5:22 ` [PATCH v3 5/7] merge-ort: defer recursing into directories when merge base is matched Elijah Newren via GitGitGadget
2021-07-16 5:22 ` [PATCH v3 6/7] merge-ort: avoid recursing into directories when we don't need to Elijah Newren via GitGitGadget
2021-07-16 5:22 ` [PATCH v3 7/7] merge-ort: restart merge with cached renames to reduce process entry cost Elijah Newren via GitGitGadget
2021-07-20 13:00 ` [PATCH v3 0/7] Optimization batch 14: trivial directory resolution Derrick Stolee
2021-07-20 21:43 ` Junio C Hamano
2021-07-21 4:23 ` [PATCH v4 " Elijah Newren via GitGitGadget
2021-07-21 4:23 ` [PATCH v4 1/7] merge-ort: resolve paths early when we have sufficient information Elijah Newren via GitGitGadget
2021-07-21 4:23 ` [PATCH v4 2/7] merge-ort: add some more explanations in collect_merge_info_callback() Elijah Newren via GitGitGadget
2021-07-21 4:24 ` [PATCH v4 3/7] merge-ort: add data structures for allowable trivial directory resolves Elijah Newren via GitGitGadget
2021-07-21 4:24 ` [PATCH v4 4/7] merge-ort: add a handle_deferred_entries() helper function Elijah Newren via GitGitGadget
2021-07-21 4:24 ` [PATCH v4 5/7] merge-ort: defer recursing into directories when merge base is matched Elijah Newren via GitGitGadget
2021-07-21 4:24 ` [PATCH v4 6/7] merge-ort: avoid recursing into directories when we don't need to Elijah Newren via GitGitGadget
2021-07-21 4:24 ` [PATCH v4 7/7] merge-ort: restart merge with cached renames to reduce process entry cost Elijah Newren via GitGitGadget
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CABPp-BEeZuzngVfeuykmTN8R1+fX=UG4ckW=eVcrYkO-8MPO3Q@mail.gmail.com' \
--to=newren@gmail.com \
--cc=avarab@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitgitgadget@gmail.com \
--cc=stolee@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).