git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Shaoxuan Yuan <shaoxuan.yuan02@gmail.com>
To: Derrick Stolee <derrickstolee@github.com>, git@vger.kernel.org
Cc: vdye@github.com
Subject: Re: [PATCH v1 2/2] builtin/grep.c: integrate with sparse index
Date: Thu, 25 Aug 2022 05:06:42 +0800	[thread overview]
Message-ID: <666dc1a3-5f18-c487-6290-44b0646f5724@gmail.com> (raw)
In-Reply-To: <19dea639-389a-7258-e424-4912bde226df@github.com>

On 8/17/2022 10:23 PM, Derrick Stolee wrote:
 > On 8/17/2022 3:56 AM, Shaoxuan Yuan wrote:
 >> Turn on sparse index and remove ensure_full_index().
 >>
 >> Change it to only expands the index when using --sparse.
 >>
 >> The p2000 tests demonstrate a ~99.4% execution time reduction for
 >> `git grep` using a sparse index.
 >>
 >> Test                                           HEAD~1 HEAD
 >> 
-----------------------------------------------------------------------------
 >> 2000.78: git grep --cached bogus (full-v3)     0.019 0.018  (-5.2%)
 >> 2000.79: git grep --cached bogus (full-v4)     0.017 0.016  (-5.8%)
 >> 2000.80: git grep --cached bogus (sparse-v3)   0.29 0.0015 (-99.4%)
 >> 2000.81: git grep --cached bogus (sparse-v4)   0.30 0.0018 (-99.4%)
 >
 > Good results.
 >
 > I think we could get interesting results even with the --sparse
 > option if you go another step further (perhaps as a patch after
 > this one).

OK.

 >>
 >> Optional reading about performance test results
 >> -----------------------------------------------
 >> Notice that because `git-grep` needs to parse blobs in the index, the
 >> index reading time is minuscule comparing to the object parsing time.
 >> And because of this, the p2000 test results cannot clearly reflect the
 >> speedup for index reading: combining with the object parsing time,
 >> the aggregated time difference is extremely close between HEAD~1 and
 >> HEAD.
 >>
 >> Hence, the results presenting here are not directly extracted from the
 >> p2000 test results. Instead, to make the performance difference more
 >> visible, the test command is manually ran with GIT_TRACE2_PERF in the
 >> four repos (full-v3, sparse-v3, full-v4, sparse-v4). The numbers here
 >> are then extracted from the time difference between "region_enter" and
 >> "region_leave" of label "do_read_index".
 >
 > This is a good point, but I don't recommend displaying them as if they
 > were the output of a "./run HEAD~1 HEAD -- p2000-sparse-operations.sh"
 > command. Instead, point out that the performance test does not show a
 > major improvement and instead you have these "Before" and "After" results
 > from testing manually and extracting trace2 regions.

OK.

 >> @@ -519,11 +519,15 @@ static int grep_cache(struct grep_opt *opt,
 >>          strbuf_addstr(&name, repo->submodule_prefix);
 >>      }
 >>
 >> +    prepare_repo_settings(repo);
 >> +    repo->settings.command_requires_full_index = 0;
 >> +
 >
 > The best pattern is to put this in cmd_grep() immediately after parsing
 > options. This guarantees that we don't parse and expand the index in any
 > other code path.

Got it.

 >>      if (repo_read_index(repo) < 0)
 >>          die(_("index file corrupt"));
 >>
 >> -    /* TODO: audit for interaction with sparse-index. */
 >> -    ensure_full_index(repo->index);
 >> +    if (grep_sparse)

A side note: this condition should be `grep_sparse && cached`.

 >> +        ensure_full_index(repo->index);
 >> +
 > As mentioned before, this approach is the simplest way to make the case
 > without --sparse faster, but the case _with_ --sparse will still be slow.
 > The way to fix this would be to modify this portion of the loop:

I'm not sure. If --sparse here means we want to expand the index, it
is expected to be slow (ensure_full_index is slow), isn't it?

 >     if (S_ISREG(ce->ce_mode) &&
 >         match_pathspec(repo->index, pathspec, name.buf, name.len, 0, 
NULL,
 >                S_ISDIR(ce->ce_mode) ||
 >                S_ISGITLINK(ce->ce_mode))) {
 >
 > by adding an initial case
 >
 >     if (S_ISSPARSEDIR(ce->ce_mode)) {
 >         hit |= grep_tree(opt, &ce->oid, name.buf, 0, name.buf);
 >     } else if (S_ISREG(ce->ce_mode) &&
 >            match_pathspec(repo->index, pathspec, name.buf, name.len, 
0, NULL,
 >                   S_ISDIR(ce->ce_mode) ||
 >                   S_ISGITLINK(ce->ce_mode))) {
 >
 > and appropriately implement "grep_tree()" to walk the tree at ce->oid to
 > find all matching files within, then call grep_oid() for each of those
 > paths.

Tree walking is faster, yes. So, for this approach to be faster, I
think you are suggesting we should not expand the index, even when
--sparse is given? Instead, we just rely on the tree walking logic,
right?

 > Bonus points if you recognize that the pathspec uses prefix checks that
 > allow pruning the search space and not parsing all of the trees
 > recursively. But that can definitely be delayed for a future enhancement.

OK.

 >> +test_expect_success 'grep expands index using --sparse' '
 >> +    init_repos &&
 >> +
 >> +    # With --sparse and --cached, do not ignore sparse entries and
 >> +    # expand the index.
 >> +    test_all_match git grep --sparse --cached a
 >> +'
 >
 > Here, you're testing that the behavior matches, but not testing that the
 > index expands. (It does describe why you didn't include it in the later
 > ensure_not_expanded tests.)

I was trying to "imply" the index expansion because of the behavior
match. Yes, I think the test should be more explicit.

 >> +
 >> +test_expect_success 'grep is not expanded' '
 >> +    init_repos &&
 >> +
 >> +    ensure_not_expanded grep a &&
 >> +    ensure_not_expanded grep a -- deep/* &&
 >> +    # grep does not match anything per se, so ! is used
 >
 > It can be helpful to say why:
 >
 >     # All files within the folder1/* pathspec are sparse,
 >     # so this command does not find any matches.

OK.

--
Thanks,
Shaoxuan



  reply	other threads:[~2022-08-24 21:06 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-17  7:56 [PATCH v1 0/2] grep: integrate with sparse index Shaoxuan Yuan
2022-08-17  7:56 ` [PATCH v1 1/2] builtin/grep.c: add --sparse option Shaoxuan Yuan
2022-08-17 14:12   ` Derrick Stolee
2022-08-17 17:13     ` Junio C Hamano
2022-08-17 17:34       ` Victoria Dye
2022-08-17 17:43         ` Derrick Stolee
2022-08-17 18:47           ` Junio C Hamano
2022-08-17 17:37     ` Elijah Newren
2022-08-24 18:20     ` Shaoxuan Yuan
2022-08-24 19:08       ` Derrick Stolee
2022-08-17  7:56 ` [PATCH v1 2/2] builtin/grep.c: integrate with sparse index Shaoxuan Yuan
2022-08-17 14:23   ` Derrick Stolee
2022-08-24 21:06     ` Shaoxuan Yuan [this message]
2022-08-25  0:39       ` Derrick Stolee
2022-08-17 13:46 ` [PATCH v1 0/2] grep: " Derrick Stolee
2022-08-29 23:28 ` [PATCH v2 " Shaoxuan Yuan
2022-08-29 23:28   ` [PATCH v2 1/2] builtin/grep.c: add --sparse option Shaoxuan Yuan
2022-08-29 23:28   ` [PATCH v2 2/2] builtin/grep.c: integrate with sparse index Shaoxuan Yuan
2022-08-30 13:45     ` Derrick Stolee
2022-09-01  4:57 ` [PATCH v3 0/3] grep: " Shaoxuan Yuan
2022-09-01  4:57   ` [PATCH v3 1/3] builtin/grep.c: add --sparse option Shaoxuan Yuan
2022-09-01  4:57   ` [PATCH v3 2/3] builtin/grep.c: integrate with sparse index Shaoxuan Yuan
2022-09-01  4:57   ` [PATCH v3 3/3] builtin/grep.c: walking tree instead of expanding index with --sparse Shaoxuan Yuan
2022-09-01 17:03     ` Derrick Stolee
2022-09-01 18:31       ` Shaoxuan Yuan
2022-09-01 17:17     ` Junio C Hamano
2022-09-01 17:27       ` Junio C Hamano
2022-09-01 22:49         ` Shaoxuan Yuan
2022-09-01 22:36       ` Shaoxuan Yuan
2022-09-02  3:28     ` Victoria Dye
2022-09-02 18:47       ` Shaoxuan Yuan
2022-09-03  0:36 ` [PATCH v4 0/3] grep: integrate with sparse index Shaoxuan Yuan
2022-09-03  0:36   ` [PATCH v4 1/3] builtin/grep.c: add --sparse option Shaoxuan Yuan
2022-09-03  0:36   ` [PATCH v4 2/3] builtin/grep.c: integrate with sparse index Shaoxuan Yuan
2022-09-03  0:36   ` [PATCH v4 3/3] builtin/grep.c: walking tree instead of expanding index with --sparse Shaoxuan Yuan
2022-09-03  4:39     ` Junio C Hamano
2022-09-08  0:24       ` Shaoxuan Yuan
2022-09-08  0:18 ` [PATCH v5 0/3] grep: integrate with sparse index Shaoxuan Yuan
2022-09-08  0:18   ` [PATCH v5 1/3] builtin/grep.c: add --sparse option Shaoxuan Yuan
2022-09-10  1:07     ` Victoria Dye
2022-09-14  6:08     ` Elijah Newren
2022-09-15  2:57       ` Junio C Hamano
2022-09-18  2:14         ` Elijah Newren
2022-09-18 19:52           ` Victoria Dye
2022-09-19  1:23             ` Junio C Hamano
2022-09-19  4:27             ` Shaoxuan Yuan
2022-09-19 11:03             ` Ævar Arnfjörð Bjarmason
2022-09-20  7:13             ` Elijah Newren
2022-09-17  3:34       ` Shaoxuan Yuan
2022-09-18  4:24         ` Elijah Newren
2022-09-19  4:13           ` Shaoxuan Yuan
2022-09-17  3:45       ` Shaoxuan Yuan
2022-09-08  0:18   ` [PATCH v5 2/3] builtin/grep.c: integrate with sparse index Shaoxuan Yuan
2022-09-08  0:18   ` [PATCH v5 3/3] builtin/grep.c: walking tree instead of expanding index with --sparse Shaoxuan Yuan
2022-09-08 17:59     ` Junio C Hamano
2022-09-08 20:46       ` Derrick Stolee
2022-09-08 20:56         ` Junio C Hamano
2022-09-08 21:06           ` Shaoxuan Yuan
2022-09-09 12:49           ` Derrick Stolee
2022-09-13 17:23         ` Junio C Hamano
2022-09-10  2:04     ` Victoria Dye
2022-09-23  4:18 ` [PATCH v6 0/1] grep: integrate with sparse index Shaoxuan Yuan
2022-09-23  4:18   ` [PATCH v6 1/1] builtin/grep.c: " Shaoxuan Yuan
2022-09-23 16:40     ` Junio C Hamano
2022-09-23 16:58     ` Junio C Hamano
2022-09-26 17:28       ` Junio C Hamano
2022-09-23 14:13   ` [PATCH v6 0/1] grep: " Derrick Stolee
2022-09-23 16:01   ` Victoria Dye
2022-09-23 17:08     ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=666dc1a3-5f18-c487-6290-44b0646f5724@gmail.com \
    --to=shaoxuan.yuan02@gmail.com \
    --cc=derrickstolee@github.com \
    --cc=git@vger.kernel.org \
    --cc=vdye@github.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).