git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Elijah Newren <newren@gmail.com>
To: Derrick Stolee <stolee@gmail.com>
Cc: "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com>,
	"Git Mailing List" <git@vger.kernel.org>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Subject: Re: [PATCH v2 3/7] merge-ort: add data structures for allowable trivial directory resolves
Date: Thu, 15 Jul 2021 08:54:36 -0700	[thread overview]
Message-ID: <CABPp-BEeZuzngVfeuykmTN8R1+fX=UG4ckW=eVcrYkO-8MPO3Q@mail.gmail.com> (raw)
In-Reply-To: <f8d4eef7-9768-571f-2fb9-76284d69108f@gmail.com>

On Thu, Jul 15, 2021 at 6:54 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 7/13/2021 3:32 PM, Elijah Newren via GitGitGadget wrote:
> > From: Elijah Newren <newren@gmail.com>
>
> The first two patches were easy to follow. This one requires a more
> careful read, so here is my attempt to double-check that I am
> reading it right.

Yep, you seem to be understanding all of it just fine (does that mean
I did a good job explaining?), and made some good suggestions to
improve it further.

> > In commit f89b4f2bee ("merge-ort: skip rename detection entirely if
> > possible", 2021-03-11), we added an additional reason that rename
> > detection could be skipped entirely -- namely, if no *relevant* sources
> > were present.  Without completing collect_merge_info_callback(), we do
> > not yet know if there are no relevant sources.  However, we do know that
> > if the current directory on one side matches the merge base, then every
> > source file within that directory will not be RELEVANT_CONTENT, and a
> > few simple checks can often let us rule out RELEVANT_LOCATION as well.
> > This suggests we can just defer recursing into such directories until
> > the end of collect_merge_info.
>
> We are not making early decisions, but collecting a list of directories
> that cannot be relevant sources. These directories could still be
> relevant locations (or content), but we don't want to pay the cost of
> recursing right away.
>
> > Since the deferred directories are known to not add any relevant sources
> > due to the above properties, then if there are no relevant sources after
> > we've traversed all paths other than the deferred ones, then we know
> > there are not any relevant sources.  Under those conditions, rename
> > detection is unnecessary, and that means we can resolve the deferred
> > directories without recursing into them.
>
> Only in the case where there are no relevant sources at the end of
> scanning all directories (possibly non-recursively), then we can skip
> recursing into these directories because finding relevant locations
> is wasted effort now.
>
> This means the performance benefit will be limited to cases where
> rename detection isn't needed at all, but that is a frequent enough
> case to merit the overhead of tracking this information differently.
>
> > Note that the logic for skipping rename detection was also modified
> > further in commit 76e253793c ("merge-ort, diffcore-rename: employ cached
> > renames when possible", 2021-01-30); in particular rename detection can
> > be skipped if we already have cached renames for each relevant source.
> > We can take advantage of this information as well with our deferral of
> > recursing into directories where one side matches the merge base.
> >
> > Add some data structures that we will use to do these deferrals, with
> > some lengthy comments explaining their purpose.
>
> And now that the idea is understood, we are just laying the groundwork
> for the process by creating the tracking data structures, not actually
> the algorithm to use them. Makes sense to split them up.
>
> > +     /*
> > +      * possible_trivial_merges: directories we defer recursing into
>
> Not only are we deferring the recursion, we might be able to avoid
> it entirely. Perhaps "directories to be explored only when needed"
> would emphasize this point differently?

I like the sound of that; seems clearer.  Will update.

> > +      *
> > +      * possible_trivial_merges is a map of directory names to
> > +      * dir_rename_mask.  When we detect that a directory is unchanged on
> > +      * one side, we can sometimes resolve the directory without recursing
> > +      * into it.  Renames are the only things that can prevent such an
> > +      * optimization.  However, for rename sources:
> > +      *   - If no parent directory needed directory rename detection, then
> > +      *     no path under such a directory can be a relevant_source.
> > +      * and for rename destinations:
> > +      *   - If no cached rename has a target path under the directory AND
> > +      *   - If there are no unpaired relevant_sources elsewhere in the
> > +      *     repository
> > +      * then we don't need any path under this directory for a rename
> > +      * destination.  The only way to know the last item above is to defer
> > +      * handling such directories until the end of collect_merge_info(),
> > +      * in handle_deferred_entries().
> > +      *
> > +      * For each we store dir_rename_mask, since that's the only bit of
> > +      * information we need, other than the path, to resume the recursive
> > +      * traversal.
> > +      */
> > +     struct strintmap possible_trivial_merges[3];
> > +
> > +     /*
> > +      * trivial_merges_okay: if trivial directory merges are okay
> > +      *
> > +      * See possible_trivial_merges above.  The "no unpaired
> > +      * relevant_sources elsewhere in the repository" is a single boolean
> > +      * per merge side, which we store here.  Note that while 0 means no,
> > +      * 1 only means "maybe" rather than "yes"; we optimistically set it
> > +      * to 1 initially and only clear when we determine it is unsafe to
> > +      * do trivial directory merges.
> > +      */
> > +     unsigned trivial_merges_okay[3];
> > +
> > +     /*
> > +      * target_dirs: ancestor directories of rename targets
> > +      *
> > +      * target_dirs contains all directory names that are an ancestor of
> > +      * any rename destination.
> > +      */
> > +     struct strset target_dirs[3];
>
> These three members seem tightly coupled. Would it be beneficial to
> group the data into a 'struct trivial_merge_data' that could then
> have an array of three here?

Ooh, I like that idea; makes sense.  That'll naturally flow with your
last two comments too.

> >       /*
> >        * dir_rename_mask:
> >        *   0: optimization removing unmodified potential rename source okay
> > @@ -490,6 +535,9 @@ static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
> >               strintmap_func(&renames->dirs_removed[i]);
> >               strmap_func(&renames->dir_renames[i], 0);
> >               strintmap_func(&renames->relevant_sources[i]);
> > +             strintmap_func(&renames->possible_trivial_merges[i]);
> > +             strset_func(&renames->target_dirs[i]);
> > +             renames->trivial_merges_okay[i] = 1; /* 1 == maybe */
>
> This could be collected into an initializer method for the data
> required by this optimization instead of three initializations
> among many.
>
> >               if (!reinitialize)
> >                       assert(renames->cached_pairs_valid_side == 0);
> >               if (i != renames->cached_pairs_valid_side) {
> > @@ -4045,12 +4093,17 @@ static void merge_start(struct merge_options *opt, struct merge_result *result)
> >               strintmap_init_with_options(&renames->relevant_sources[i],
> >                                           -1 /* explicitly invalid */,
> >                                           NULL, 0);
> > +             strintmap_init_with_options(&renames->possible_trivial_merges[i],
> > +                                         0, NULL, 0);
> > +             strset_init_with_options(&renames->target_dirs[i],
> > +                                      NULL, 1);
> >               strmap_init_with_options(&renames->cached_pairs[i],
> >                                        NULL, 1);
> >               strset_init_with_options(&renames->cached_irrelevant[i],
> >                                        NULL, 1);
> >               strset_init_with_options(&renames->cached_target_names[i],
> >                                        NULL, 0);
> > +             renames->trivial_merges_okay[i] = 1; /* 1 == maybe */
>
> Here, there is even a separation between the grouped data. If you
> don't like the idea of creating helper methods, then at least these
> new lines could be grouped together. Bonus points for adding
> whitespace between these lines and the others.

  reply	other threads:[~2021-07-15 15:54 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-01  3:46 [PATCH 0/7] Optimization batch 14: trivial directory resolution Elijah Newren via GitGitGadget
2021-07-01  3:46 ` [PATCH 1/7] merge-ort: resolve paths early when we have sufficient information Elijah Newren via GitGitGadget
2021-07-01  3:46 ` [PATCH 2/7] merge-ort: add some more explanations in collect_merge_info_callback() Elijah Newren via GitGitGadget
2021-07-01  3:46 ` [PATCH 3/7] merge-ort: add data structures for allowable trivial directory resolves Elijah Newren via GitGitGadget
2021-07-01  3:46 ` [PATCH 4/7] merge-ort: add a handle_deferred_entries() helper function Elijah Newren via GitGitGadget
2021-07-01  3:46 ` [PATCH 5/7] merge-ort: defer recursing into directories when merge base is matched Elijah Newren via GitGitGadget
2021-07-01  3:46 ` [PATCH 6/7] merge-ort: avoid recursing into directories when we don't need to Elijah Newren via GitGitGadget
2021-07-01  3:46 ` [PATCH 7/7] merge-ort: restart merge with cached renames to reduce process entry cost Elijah Newren via GitGitGadget
2021-07-01 13:21 ` [PATCH 0/7] Optimization batch 14: trivial directory resolution Ævar Arnfjörð Bjarmason
2021-07-01 15:04   ` Elijah Newren
2021-07-01 19:22     ` Elijah Newren
2021-07-13 19:32 ` [PATCH v2 " Elijah Newren via GitGitGadget
2021-07-13 19:32   ` [PATCH v2 1/7] merge-ort: resolve paths early when we have sufficient information Elijah Newren via GitGitGadget
2021-07-13 19:32   ` [PATCH v2 2/7] merge-ort: add some more explanations in collect_merge_info_callback() Elijah Newren via GitGitGadget
2021-07-13 23:34     ` Bagas Sanjaya
2021-07-14  0:19       ` Elijah Newren
2021-07-13 19:32   ` [PATCH v2 3/7] merge-ort: add data structures for allowable trivial directory resolves Elijah Newren via GitGitGadget
2021-07-15 13:54     ` Derrick Stolee
2021-07-15 15:54       ` Elijah Newren [this message]
2021-07-13 19:33   ` [PATCH v2 4/7] merge-ort: add a handle_deferred_entries() helper function Elijah Newren via GitGitGadget
2021-07-15 14:32     ` Derrick Stolee
2021-07-15 15:59       ` Elijah Newren
2021-07-13 19:33   ` [PATCH v2 5/7] merge-ort: defer recursing into directories when merge base is matched Elijah Newren via GitGitGadget
2021-07-15 14:43     ` Derrick Stolee
2021-07-15 16:03       ` Elijah Newren
2021-07-15 17:14         ` Derrick Stolee
2021-07-13 19:33   ` [PATCH v2 6/7] merge-ort: avoid recursing into directories when we don't need to Elijah Newren via GitGitGadget
2021-07-15 14:55     ` Derrick Stolee
2021-07-15 16:28       ` Elijah Newren
2021-07-13 19:33   ` [PATCH v2 7/7] merge-ort: restart merge with cached renames to reduce process entry cost Elijah Newren via GitGitGadget
2021-07-15 15:09     ` Derrick Stolee
2021-07-15 16:53       ` Elijah Newren
2021-07-15 17:19         ` Derrick Stolee
2021-07-15 17:32           ` Elijah Newren
2021-07-16  5:22   ` [PATCH v3 0/7] Optimization batch 14: trivial directory resolution Elijah Newren via GitGitGadget
2021-07-16  5:22     ` [PATCH v3 1/7] merge-ort: resolve paths early when we have sufficient information Elijah Newren via GitGitGadget
2021-07-16  5:22     ` [PATCH v3 2/7] merge-ort: add some more explanations in collect_merge_info_callback() Elijah Newren via GitGitGadget
2021-07-16  5:22     ` [PATCH v3 3/7] merge-ort: add data structures for allowable trivial directory resolves Elijah Newren via GitGitGadget
2021-07-16  5:22     ` [PATCH v3 4/7] merge-ort: add a handle_deferred_entries() helper function Elijah Newren via GitGitGadget
2021-07-16  5:22     ` [PATCH v3 5/7] merge-ort: defer recursing into directories when merge base is matched Elijah Newren via GitGitGadget
2021-07-16  5:22     ` [PATCH v3 6/7] merge-ort: avoid recursing into directories when we don't need to Elijah Newren via GitGitGadget
2021-07-16  5:22     ` [PATCH v3 7/7] merge-ort: restart merge with cached renames to reduce process entry cost Elijah Newren via GitGitGadget
2021-07-20 13:00     ` [PATCH v3 0/7] Optimization batch 14: trivial directory resolution Derrick Stolee
2021-07-20 21:43       ` Junio C Hamano
2021-07-21  4:23     ` [PATCH v4 " Elijah Newren via GitGitGadget
2021-07-21  4:23       ` [PATCH v4 1/7] merge-ort: resolve paths early when we have sufficient information Elijah Newren via GitGitGadget
2021-07-21  4:23       ` [PATCH v4 2/7] merge-ort: add some more explanations in collect_merge_info_callback() Elijah Newren via GitGitGadget
2021-07-21  4:24       ` [PATCH v4 3/7] merge-ort: add data structures for allowable trivial directory resolves Elijah Newren via GitGitGadget
2021-07-21  4:24       ` [PATCH v4 4/7] merge-ort: add a handle_deferred_entries() helper function Elijah Newren via GitGitGadget
2021-07-21  4:24       ` [PATCH v4 5/7] merge-ort: defer recursing into directories when merge base is matched Elijah Newren via GitGitGadget
2021-07-21  4:24       ` [PATCH v4 6/7] merge-ort: avoid recursing into directories when we don't need to Elijah Newren via GitGitGadget
2021-07-21  4:24       ` [PATCH v4 7/7] merge-ort: restart merge with cached renames to reduce process entry cost Elijah Newren via GitGitGadget

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CABPp-BEeZuzngVfeuykmTN8R1+fX=UG4ckW=eVcrYkO-8MPO3Q@mail.gmail.com' \
    --to=newren@gmail.com \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).