From: Derrick Stolee <stolee@gmail.com>
To: Elijah Newren via GitGitGadget <gitgitgadget@gmail.com>,
git@vger.kernel.org
Cc: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
"Elijah Newren" <newren@gmail.com>
Subject: Re: [PATCH v2 7/7] merge-ort: restart merge with cached renames to reduce process entry cost
Date: Thu, 15 Jul 2021 11:09:59 -0400 [thread overview]
Message-ID: <d91ed8a0-b37b-7dfa-10bf-e068f30e9691@gmail.com> (raw)
In-Reply-To: <7133f0efa520b3d0cadb059151daa12484fdb003.1626204784.git.gitgitgadget@gmail.com>
On 7/13/2021 3:33 PM, Elijah Newren via GitGitGadget wrote:
> From: Elijah Newren <newren@gmail.com>
...
> Often, I noticed that when the optimization does not apply, it is
> because there are a handful of relevant sources -- maybe even only one.
> It felt frustrating to need to recurse into potentially hundreds or even
> thousands of directories just for a single rename, but it was needed for
> correctness.
>
> However, staring at this list of functions and noticing that
> process_entries() is the most expensive and knowing I could avoid it if
> I had cached renames suggested a simple idea: change
> collect_merge_info()
> detect_and_process_renames()
> process_entries()
> into
> collect_merge_info()
> detect_and_process_renames()
> <cache all the renames, and restart>
> collect_merge_info()
> detect_and_process_renames()
> process_entries()
>
> This may seem odd and look like more work. However, note that although
> we run collect_merge_info() twice, the second time we get to employ
> trivial directory resolves, which makes it much faster, so the increased
> time in collect_merge_info() is small. While we run
> detect_and_process_renames() again, all renames are cached so it's
> nearly a no-op (we don't call into diffcore_rename_extended() but we do
> have a little bit of data structure checking and fixing up). And the
> big payoff comes from the fact that process_entries(), will be much
> faster due to having far fewer entries to process.
I enjoyed the story you tell here.
> + if (path_count_after) {
> + /*
> + * Not sure were the right cut-off is for the optimization
> + * to redo collect_merge_info after we've cached the
> + * regular renames is. Basically, collect_merge_info(),
> + * detect_regular_renames(), and process_entries() are
> + * similar costs and all big tent poles. Caching the
> + * result of detect_regular_renames() means that redoing
> + * that one function will cost us virtually 0 extra, so it
> + * depends on the other two functions, which are both O(N)
> + * cost in the number of paths. Thus, it makes sense that
> + * if we can cut the number of paths in half, then redoing
> + * collect_merge_info() at half cost in order to get
> + * process_entries() at half cost should be about equal
> + * cost. If we can cut by more than half, then we would
> + * win. The fact that process_entries() is about 10%-20%
> + * more expensive than collect_merge_info() suggests we
> + * could make the factor be less than two. The fact that
> + * even when we have renames cached, we still have to
> + * traverse down to the individual (relevant) renames,
> + * which suggests we should perhaps use a bigger factor.
> + *
> + * The exact number isn't critical, since the code will
> + * work even if we get the factor wrong -- it just might be
> + * slightly slower if we're a bit off. For now, just error
> + * on the side of a bigger fudge. For the linux kernel
super-nit: s/linux/Linux/
> + * testcases I was looking at with massive renames, the
> + * ratio came in around 50 to 250, which clearly would
> + * trigger this optimization and provided some *very* nice
> + * speedups.
This bit of your testing might be more appropriate for your commit
message. This discussion of a test made at a certain point in time
is more likely to go stale than the description of how this factor
does not change correctness, only performance.
> + */
> + int wanted_factor = 3;
and perhaps make it 'const'?
> +
> + /* We should only redo collect_merge_info one time */
> + assert(renames->redo_after_renames == 0);
> +
> + if (path_count_after / path_count_before > wanted_factor) {
With truncation from integer division, this condition is equivalent* to
path_count_after >= 4 * path_count_before
Or, do you want to change this to a ">=" so that the factor of 3 seems
more accurate?
*I understand the intention of using division to avoid (unlikely)
overflow via multiplication. The truncation is causing some confusion.
> -test_expect_merge_algorithm failure failure '12f: Trivial directory resolve, caching, all kinds of fun' '
> +test_expect_merge_algorithm failure success '12f: Trivial directory resolve, caching, all kinds of fun' '
> test_setup_12f &&
> (
> cd 12f &&
>
Oh, and a bonus test success! Excellent.
Thanks,
-Stolee
next prev parent reply other threads:[~2021-07-15 15:10 UTC|newest]
Thread overview: 52+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-07-01 3:46 [PATCH 0/7] Optimization batch 14: trivial directory resolution Elijah Newren via GitGitGadget
2021-07-01 3:46 ` [PATCH 1/7] merge-ort: resolve paths early when we have sufficient information Elijah Newren via GitGitGadget
2021-07-01 3:46 ` [PATCH 2/7] merge-ort: add some more explanations in collect_merge_info_callback() Elijah Newren via GitGitGadget
2021-07-01 3:46 ` [PATCH 3/7] merge-ort: add data structures for allowable trivial directory resolves Elijah Newren via GitGitGadget
2021-07-01 3:46 ` [PATCH 4/7] merge-ort: add a handle_deferred_entries() helper function Elijah Newren via GitGitGadget
2021-07-01 3:46 ` [PATCH 5/7] merge-ort: defer recursing into directories when merge base is matched Elijah Newren via GitGitGadget
2021-07-01 3:46 ` [PATCH 6/7] merge-ort: avoid recursing into directories when we don't need to Elijah Newren via GitGitGadget
2021-07-01 3:46 ` [PATCH 7/7] merge-ort: restart merge with cached renames to reduce process entry cost Elijah Newren via GitGitGadget
2021-07-01 13:21 ` [PATCH 0/7] Optimization batch 14: trivial directory resolution Ævar Arnfjörð Bjarmason
2021-07-01 15:04 ` Elijah Newren
2021-07-01 19:22 ` Elijah Newren
2021-07-13 19:32 ` [PATCH v2 " Elijah Newren via GitGitGadget
2021-07-13 19:32 ` [PATCH v2 1/7] merge-ort: resolve paths early when we have sufficient information Elijah Newren via GitGitGadget
2021-07-13 19:32 ` [PATCH v2 2/7] merge-ort: add some more explanations in collect_merge_info_callback() Elijah Newren via GitGitGadget
2021-07-13 23:34 ` Bagas Sanjaya
2021-07-14 0:19 ` Elijah Newren
2021-07-13 19:32 ` [PATCH v2 3/7] merge-ort: add data structures for allowable trivial directory resolves Elijah Newren via GitGitGadget
2021-07-15 13:54 ` Derrick Stolee
2021-07-15 15:54 ` Elijah Newren
2021-07-13 19:33 ` [PATCH v2 4/7] merge-ort: add a handle_deferred_entries() helper function Elijah Newren via GitGitGadget
2021-07-15 14:32 ` Derrick Stolee
2021-07-15 15:59 ` Elijah Newren
2021-07-13 19:33 ` [PATCH v2 5/7] merge-ort: defer recursing into directories when merge base is matched Elijah Newren via GitGitGadget
2021-07-15 14:43 ` Derrick Stolee
2021-07-15 16:03 ` Elijah Newren
2021-07-15 17:14 ` Derrick Stolee
2021-07-13 19:33 ` [PATCH v2 6/7] merge-ort: avoid recursing into directories when we don't need to Elijah Newren via GitGitGadget
2021-07-15 14:55 ` Derrick Stolee
2021-07-15 16:28 ` Elijah Newren
2021-07-13 19:33 ` [PATCH v2 7/7] merge-ort: restart merge with cached renames to reduce process entry cost Elijah Newren via GitGitGadget
2021-07-15 15:09 ` Derrick Stolee [this message]
2021-07-15 16:53 ` Elijah Newren
2021-07-15 17:19 ` Derrick Stolee
2021-07-15 17:32 ` Elijah Newren
2021-07-16 5:22 ` [PATCH v3 0/7] Optimization batch 14: trivial directory resolution Elijah Newren via GitGitGadget
2021-07-16 5:22 ` [PATCH v3 1/7] merge-ort: resolve paths early when we have sufficient information Elijah Newren via GitGitGadget
2021-07-16 5:22 ` [PATCH v3 2/7] merge-ort: add some more explanations in collect_merge_info_callback() Elijah Newren via GitGitGadget
2021-07-16 5:22 ` [PATCH v3 3/7] merge-ort: add data structures for allowable trivial directory resolves Elijah Newren via GitGitGadget
2021-07-16 5:22 ` [PATCH v3 4/7] merge-ort: add a handle_deferred_entries() helper function Elijah Newren via GitGitGadget
2021-07-16 5:22 ` [PATCH v3 5/7] merge-ort: defer recursing into directories when merge base is matched Elijah Newren via GitGitGadget
2021-07-16 5:22 ` [PATCH v3 6/7] merge-ort: avoid recursing into directories when we don't need to Elijah Newren via GitGitGadget
2021-07-16 5:22 ` [PATCH v3 7/7] merge-ort: restart merge with cached renames to reduce process entry cost Elijah Newren via GitGitGadget
2021-07-20 13:00 ` [PATCH v3 0/7] Optimization batch 14: trivial directory resolution Derrick Stolee
2021-07-20 21:43 ` Junio C Hamano
2021-07-21 4:23 ` [PATCH v4 " Elijah Newren via GitGitGadget
2021-07-21 4:23 ` [PATCH v4 1/7] merge-ort: resolve paths early when we have sufficient information Elijah Newren via GitGitGadget
2021-07-21 4:23 ` [PATCH v4 2/7] merge-ort: add some more explanations in collect_merge_info_callback() Elijah Newren via GitGitGadget
2021-07-21 4:24 ` [PATCH v4 3/7] merge-ort: add data structures for allowable trivial directory resolves Elijah Newren via GitGitGadget
2021-07-21 4:24 ` [PATCH v4 4/7] merge-ort: add a handle_deferred_entries() helper function Elijah Newren via GitGitGadget
2021-07-21 4:24 ` [PATCH v4 5/7] merge-ort: defer recursing into directories when merge base is matched Elijah Newren via GitGitGadget
2021-07-21 4:24 ` [PATCH v4 6/7] merge-ort: avoid recursing into directories when we don't need to Elijah Newren via GitGitGadget
2021-07-21 4:24 ` [PATCH v4 7/7] merge-ort: restart merge with cached renames to reduce process entry cost Elijah Newren via GitGitGadget
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d91ed8a0-b37b-7dfa-10bf-e068f30e9691@gmail.com \
--to=stolee@gmail.com \
--cc=avarab@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitgitgadget@gmail.com \
--cc=newren@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).