From: Emily Shaffer <emilyshaffer@google.com>
To: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Cc: Junio C Hamano <gitster@pobox.com>, git@vger.kernel.org
Subject: Re: [PATCH v5 4/4] submodule: record superproject gitdir during 'update'
Date: Tue, 9 Nov 2021 12:36:50 -0800 [thread overview]
Message-ID: <YYrb4oCaIXIwN/bF@google.com> (raw)
In-Reply-To: <211109.86v912dtfw.gmgdl@evledraar.gmail.com>
On Tue, Nov 09, 2021 at 01:42:03AM +0100, Ævar Arnfjörð Bjarmason wrote:
>
>
> On Mon, Nov 08 2021, Emily Shaffer wrote:
>
> > On Fri, Nov 05, 2021 at 09:43:56AM +0100, Ævar Arnfjörð Bjarmason wrote:
> >>
> >>
> >> On Thu, Nov 04 2021, Junio C Hamano wrote:
> >>
> >> > Emily Shaffer <emilyshaffer@google.com> writes:
> >> >
> >> >> A recorded hint path to the superproject's gitdir might be added during
> >> >> 'git submodule add', but in some cases - like submodules which were
> >> >> created before 'git submodule add' learned to record that info - it might
> >> >> be useful to update the hint. Let's do it during 'git submodule
> >> >> update', when we already have a handle to the superproject while calling
> >> >> operations on the submodules.
> >> >
> >> > We are hearing repeated mention of "cache" and "hint". Do we ever
> >> > invalidate it, or if we have such a record, do we blindly trust it
> >> > and use it without verifying if it is still fresh?
> >> >
> >> > Also, this step and the previous step both say we record gitdir on
> >> > their title, but we instead record common dir. Whichever is the
> >> > right choice to record, let's be consistent.
> >>
> >> I had similar (AFAICT still unaddressed) feedback on the v2[1]. I'd lost
> >> track of this series, and see one reason is that the In-Reply-Chain was
> >> broken between v3..v4.
> >>
> >> I.e. it seems to me that this whole thing started as a way to avoid
> >> shellscript overhead by calling git-rev-parse from git-submodule.sh, but
> >> now that the relevant bits are moved to C we could just call some
> >> slightly adjusted code in setup.c.
> >
> > No, that is not the case. It is the case that `git -C .. rev-parse
> > --git-dir` is *very* expensive in the case where `../` is not, in fact,
> > a gitdir; when I attempted another series which relied on finding the
> > parent superproject's gitdir in this way, our testsuite took something
> > like 5x longer to run than before. In other words, the expensive part is
> > not the shelling out overhead - the expensive part is searching up the
> > entire filesystem directory structure in the worst-case ("we are not a
> > submodule") scenario. This is still needed, even with 'git-submodule.sh'
> > moving to C.
>
> Do you have that test code somewhere?
I messed around with it a little more, rebasing the no-caches-involved
older implementation and using an in-process lookup with
setup_git_directory_gently_1.
https://github.com/nasamuffin/git/tree/config-inheritance-no-cache
The recent experiments are in the tip commit, and the original series is
in the two commits prior if you're interested.
The upshot, though, is that I think there is still not a way around a
second subprocess. Before, we determined the superproject's gitdir like
so:
# Does a git project at .. think I belong to it?
git -C .. ls-files <args> -- path/to/submodule
# Where does that git project's gitdir live?
git -C .. rev-parse --absolute-git-dir
Even if we can do the second call in-process, we still will be
performing this ls-files call to ensure that the parent repo is actually
our superproject. (One good example of a time when the parent repo is
*not*: the entire Git test suite, where '/path/to/git/t/trash directory.t1234-abcd'
is not a submodule of '/path/to/git/.git'.)
We could reverse the checks, which will make this much less painful in
the real world, but will still slow down our test suites (and hopefully
you'll forgive me for combining C and bash so brazenly, but it's for
illustration purposes only):
# Is there a git project above us?
setup_git_directory_gently_1("..", out, 0);
# Does it think we're its submodule?
git -C $out rev-parse --absolute-git-dir
That will still result in an extra out-of-process call for every line in
the Git test suite, though, because of the trash directory layout.
I looked briefly at `git ls-files` // `cmd_ls_files()` and it's fairly
close to being callable on an arbitrary 'struct repository', but not
quite there. But I am pretty afraid of the rabbit hole ;)
And anyway, even with those possible changes, it turns out that
`setup_git_directory_gently_1` - which I had to munge a bit to make
non-private, anyway - wants to take shortcuts like looking at
getenv(GIT_DIR_ENVIRONMENT), which means it's will notice the
submodule's envvar before it will notice the path passed to it. I
actually am a little surprised that your experiment worked at all,
because of this wrinkle, but your printf lines show that it did somehow.
Taken as a whole, I'm not quite certain which is worse.
Approach "cache a pointer from the submodule":
- all the normal caching gotchas - creation, invalidation, staleness,
etc
+ very easy lookup without need for significant refactoring
- since it's a config, if we do it wrong we're stuck supporting it
forever anyway
Approach "do lots of in-process heuristics":
- need to refactor code areas unrelated to submodules or configs, like
setup.c
- performance might differ based on filesystem speed
+ still pretty fast (compared to subprocess calls)
+ avoid all cache correctness issues
- still kind of based on heuristics; will someone envision a wonky way
of organizing a submodule that breaks the heuristic?
I'll think on it more....
- Emily
> I tried to reproduce this &
> can't. I run my tests in /home/avar/*, and just created this:
>
> $ find /tmp/some/ -name '.git' -type d
> /tmp/some/dir/.git
> /tmp/some/dir/a/b/c/d/e/f/g/i/j/k/.git
>
> I.e. a deeply nested structure in /tmp, if you ask for the git-dir in
> /tmp/some/**/k you'll need to search several levels up.
>
> Then with the patch below we'll instrument almost all git commands to
> optionally do that search, i.e. anything that does setup_git_directory()
> at all:
>
> $ GIT_TEST_SETUP=true GIT_TEST_SETUP_PRINT=true ~/g/git/git rev-parse HEAD
> warning: from '/tmp/some/dir/a/b/c/d/e/f/g/i/j' found '/tmp/some/dir' ('/tmp/some/dir/.git')
> <some hash here>
>
> And as a quick test to run a few tests I tried:
>
> rm -rf test-results/; GIT_TEST_SETUP=true GIT_TEST_SETUP_PRINT=true prove -j8 t741*submod*.sh :: -V
>
> Which runs quickly enough for a tight test loop, and does that work >600 times:
>
> $ cat test-results/*.out|grep -c warning.*from.*found
> 662
>
> I can't get that to show me any meaningful difference, just to pick on
> one test (since it was easier to run repeatedly):
>
> $ hyperfine -L v true,false "GIT_TEST_SETUP={v} ./t7416-submodule-dash-url.sh --root=/dev/shm/git"
> Benchmark #1: GIT_TEST_SETUP=true ./t7416-submodule-dash-url.sh --root=/dev/shm/git
> Time (mean ± σ): 527.5 ms ± 7.2 ms [User: 431.6 ms, System: 125.9 ms]
> Range (min … max): 522.6 ms … 542.5 ms 10 runs
>
> Benchmark #2: GIT_TEST_SETUP=false ./t7416-submodule-dash-url.sh --root=/dev/shm/git
> Time (mean ± σ): 526.7 ms ± 10.8 ms [User: 421.1 ms, System: 131.6 ms]
> Range (min … max): 518.2 ms … 546.8 ms 10 runs
>
> Summary
> 'GIT_TEST_SETUP=false ./t7416-submodule-dash-url.sh --root=/dev/shm/git' ran
> 1.00 ± 0.02 times faster than 'GIT_TEST_SETUP=true ./t7416-submodule-dash-url.sh --root=/dev/shm/git'
>
> I.e. it's all fuzzy and within the error margins.
>
> Now if we do e.g.:
>
> GIT_TEST_SETUP=false strace -f -c -U calls,name,time -S calls ./t7416-submodule-dash-url.sh 2>&1 >/dev/null | grep -A9000 calls >a
> GIT_TEST_SETUP=true strace -f -c -U calls,name,time -S calls ./t7416-submodule-dash-url.sh 2>&1 >/dev/null | grep -A9000 calls >b
>
> We'll see the syscall difference, in summary:
>
> - 110086 total 100.00
> + 114765 total 100.00
>
> And some of the real big differences are:
>
> $ diff -u <(head -n 12 a) <(head -n 12 b)
> --- /dev/fd/63 2021-11-09 01:57:16.023991556 +0100
> +++ /dev/fd/62 2021-11-09 01:57:16.019991593 +0100
> @@ -1,12 +1,12 @@
> calls syscall % time
> --------- ---------------- ------
> - 11504 openat 2.15
> - 11496 close 2.07
> - 10672 read 4.15
> - 10465 rt_sigaction 0.32
> - 9913 lstat 1.50
> - 9456 mmap 0.67
> - 7545 stat 0.81
> - 6349 fstat 0.62
> - 3896 access 0.61
> - 3490 mprotect 0.26
> + 11783 lstat 1.80
> + 11600 close 1.89
> + 11599 openat 2.19
> + 10887 read 4.36
> + 10465 rt_sigaction 0.33
> + 9742 stat 1.07
> + 9455 mmap 0.75
> + 6346 fstat 0.57
> + 4113 access 0.65
> + 3490 mprotect 0.28
>
> But syscalls are fast, so it doesn't show up in real results.
>
> Now, of course a real implementation could be less stupid, e.g. even if
> we think we need *a cache* if these are the performance numbers why do
> we need to risk the cache being incorrect v.s. say just writing "I am a
> submodule" somewhere in the config (if we don't have that).
>
> Then we'll only do that work for submodules, so not all git invocations
> will pay the cost (and it this point we'll usually have read config
> anyway).
>
> But I really just don't think it's that expensive at all. I can see how
> it would be for actually shelling out, but we don't need to do that.
>
> This could also just be that I'm running this on a really fast FS, which
> is true. So I went and tested on an AIX machine I have access to.
>
> I/O on AIX is slow, *really slow*, so slow that if you "rm -rfv"
> something you'll have time to read individual lines scrolling by.
>
> That ~500ms t7416-submodule-dash-url.sh test runs in around 50s on that
> AIX box (power-aix.osuosl.org), most of which is I/O overhead, I created
> the same /tmp/ directory structure and tried with
> GIT_TEST_SETUP=[false|true] and it's ~55s with/without the env variable,
> with no clear winner.
>
> I don't have access to hyperfine on that box, or the patience to wait
> for AIX I/O to wait for meaningful results, but to a first approximation
> that seems to indicate that it doesn't really matter there either.
>
> diff --git a/setup.c b/setup.c
> index 347d7181ae9..8453d397676 100644
> --- a/setup.c
> +++ b/setup.c
> @@ -1209,6 +1209,8 @@ const char *setup_git_directory_gently(int *nongit_ok)
> struct strbuf dir = STRBUF_INIT, gitdir = STRBUF_INIT;
> const char *prefix = NULL;
> struct repository_format repo_fmt = REPOSITORY_FORMAT_INIT;
> + const char *str = "/tmp/some/dir/a/b/c/d/e/f/g/i/j";
> + struct strbuf a = STRBUF_INIT, b = STRBUF_INIT;
>
> /*
> * We may have read an incomplete configuration before
> @@ -1231,6 +1233,14 @@ const char *setup_git_directory_gently(int *nongit_ok)
> die_errno(_("Unable to read current working directory"));
> strbuf_addbuf(&dir, &cwd);
>
> + if (git_env_bool("GIT_TEST_SETUP", 0)) {
> + strbuf_addstr(&a, str);
> + setup_git_directory_gently_1(&a, &b, 0);
> +
> + if (strcmp(a.buf, str) && git_env_bool("GIT_TEST_SETUP_PRINT", 0))
> + warning("from '%s' found '%s' ('%s/%s')", str, a.buf, a.buf, b.buf);
> + }
> +
> switch (setup_git_directory_gently_1(&dir, &gitdir, 1)) {
> case GIT_DIR_EXPLICIT:
> prefix = setup_explicit_git_dir(gitdir.buf, &cwd, &repo_fmt, nongit_ok);
>
next prev parent reply other threads:[~2021-11-09 20:37 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-10-14 20:34 [PATCH v4 0/4] cache parent project's gitdir in submodules Emily Shaffer
2021-10-14 20:34 ` [PATCH v4 1/4] t7400-submodule-basic: modernize inspect() helper Emily Shaffer
2021-10-14 20:34 ` [PATCH v4 2/4] introduce submodule.superprojectGitDir record Emily Shaffer
2021-10-14 20:34 ` [PATCH v4 3/4] submodule: record superproject gitdir during absorbgitdirs Emily Shaffer
2021-10-18 23:18 ` Jonathan Tan
2021-10-25 16:14 ` Derrick Stolee
2021-11-04 23:22 ` Emily Shaffer
2021-11-08 1:07 ` Derrick Stolee
2021-11-04 22:09 ` Emily Shaffer
2021-10-14 20:34 ` [PATCH v4 4/4] submodule: record superproject gitdir during 'update' Emily Shaffer
2021-10-25 16:17 ` Derrick Stolee
2021-10-25 16:19 ` [PATCH v4 0/4] cache parent project's gitdir in submodules Derrick Stolee
2021-11-04 23:49 ` [PATCH v5 " Emily Shaffer
2021-11-04 23:49 ` [PATCH v5 1/4] t7400-submodule-basic: modernize inspect() helper Emily Shaffer
2021-11-04 23:49 ` [PATCH v5 2/4] introduce submodule.superprojectGitDir record Emily Shaffer
2021-11-05 7:50 ` Junio C Hamano
2021-11-08 23:16 ` Emily Shaffer
2021-11-04 23:49 ` [PATCH v5 3/4] submodule: record superproject gitdir during absorbgitdirs Emily Shaffer
2021-11-04 23:49 ` [PATCH v5 4/4] submodule: record superproject gitdir during 'update' Emily Shaffer
2021-11-05 4:49 ` Junio C Hamano
2021-11-05 8:43 ` Ævar Arnfjörð Bjarmason
2021-11-08 23:21 ` Emily Shaffer
2021-11-09 0:42 ` Ævar Arnfjörð Bjarmason
2021-11-09 20:36 ` Emily Shaffer [this message]
2021-11-09 21:46 ` Emily Shaffer
2021-11-05 8:51 ` Ævar Arnfjörð Bjarmason
2021-11-08 23:22 ` Emily Shaffer
2021-11-09 1:12 ` Ævar Arnfjörð Bjarmason
2021-11-08 1:24 ` [PATCH v5 0/4] cache parent project's gitdir in submodules Derrick Stolee
2021-11-08 23:11 ` Emily Shaffer
2021-11-09 15:58 ` Derrick Stolee
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YYrb4oCaIXIwN/bF@google.com \
--to=emilyshaffer@google.com \
--cc=avarab@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).