git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: ZheNing Hu <adlternative@gmail.com>
Cc: ZheNing Hu via GitGitGadget <gitgitgadget@gmail.com>,
	Git List <git@vger.kernel.org>,
	Junio C Hamano <gitster@pobox.com>,
	Christian Couder <christian.couder@gmail.com>,
	Hariom Verma <hariom18599@gmail.com>,
	Bagas Sanjaya <bagasdotme@gmail.com>, Jeff King <peff@peff.net>
Subject: Re: [PATCH 12/15] [GSOC] cat-file: reuse ref-filter logic
Date: Sat, 03 Jul 2021 15:37:33 +0200	[thread overview]
Message-ID: <87eecf8ork.fsf@evledraar.gmail.com> (raw)
In-Reply-To: <CAOLTT8RdujpQ2uKEWPyG0HGkUz_EsONw3hEZ6YAhpmQc5rgohA@mail.gmail.com>


On Sat, Jul 03 2021, ZheNing Hu wrote:

> Ævar Arnfjörð Bjarmason <avarab@gmail.com> 于2021年7月2日周五 下午9:39写道:
>>
>>
>> On Thu, Jul 01 2021, ZheNing Hu via GitGitGadget wrote:
>>
>> > From: ZheNing Hu <adlternative@gmail.com>
>> >
>> > In order to let cat-file use ref-filter logic, let's do the
>> > following:
>> >
>> > 1. Change the type of member `format` in struct `batch_options`
>> > to `ref_format`, we will pass it to ref-filter later.
>> > 2. Let `batch_objects()` add atoms to format, and use
>> > `verify_ref_format()` to check atoms.
>> > 3. Use `format_ref_array_item()` in `batch_object_write()` to
>> > get the formatted data corresponding to the object. If the
>> > return value of `format_ref_array_item()` is equals to zero,
>> > use `batch_write()` to print object data; else if the return
>> > value is less than zero, use `die()` to print the error message
>> > and exit; else if return value is greater than zero, only print
>> > the error message, but don't exit.
>> > 4. Use free_ref_array_item_value() to free ref_array_item's
>> > value.
>> >
>> > Most of the atoms in `for-each-ref --format` are now supported,
>> > such as `%(tree)`, `%(parent)`, `%(author)`, `%(tagger)`, `%(if)`,
>> > `%(then)`, `%(else)`, `%(end)`. But these atoms will be rejected:
>> > `%(refname)`, `%(symref)`, `%(upstream)`, `%(push)`, `%(worktreepath)`,
>> > `%(flag)`, `%(HEAD)`, because these atoms are unique to those objects
>> > that pointed to by a ref, "for-each-ref"'s family can naturally use
>> > these atoms, but not all objects are pointed to be a ref, so "cat-file"
>> > will not be able to use them.
>> >
>> > The performance for `git cat-file --batch-all-objects
>> > --batch-check` on the Git repository itself with performance
>> > testing tool `hyperfine` changes from 669.4 ms ±  31.1 ms to
>> > 1.134 s ±  0.063 s.
>> >
>> > The performance for `git cat-file --batch-all-objects --batch
>> >>/dev/null` on the Git repository itself with performance testing
>> > tool `time` change from "27.37s user 0.29s system 98% cpu 28.089
>> > total" to "33.69s user 1.54s system 87% cpu 40.258 total".
>>
>> This new feature is really nice, but that's a really bad performance
>> regression. A lot of software in the wild relies on "cat-file --batch"
>> to be *the* performant interface to git for mass-extrction of object
>> data.
>>
>
> Thanks, this performance is indeed worrying.
>
>> That's in increase of ~70% and ~20%, respectively. Have you dug into
>> (e.g. with a profiler) where we're now spending all this time?
>
> See this two attachment about performance flame graph,
> oid_object_info_extended() in get_object() is the key to performance
> limitations.

Most of the problem, although this may not entirely fix the performance
regression, is that you're either looking up everything twice now, or
taking a much more expensive path.

I think using gprof is probably much more handy here. See [1. I did a
`git rev-list --all >rla` and ran that piped into 'git cat-file --batch'
with/without your pathces. Results:
    
    $ gprof ./git-master ./gmon-master.out | head -n 10
    Flat profile:
    
    Each sample counts as 0.01 seconds.
      %   cumulative   self              self     total           
     time   seconds   seconds    calls  ms/call  ms/call  name    
     14.29      0.02     0.02   475186     0.00     0.00  nth_packed_object_offset
     14.29      0.04     0.02   237835     0.00     0.00  hash_to_hex_algop_r
      7.14      0.05     0.01  5220425     0.00     0.00  hashcmp_algop
      7.14      0.06     0.01  4757120     0.00     0.00  hex2chr
      7.14      0.07     0.01  1732023     0.00     0.00  find_entry_ptr

And:
    
    $ gprof ./git-new ./gmon-new.out |head -n 10
    Flat profile:
    
    Each sample counts as 0.01 seconds.
      %   cumulative   self              self     total
     time   seconds   seconds    calls  ms/call  ms/call  name
      7.32      0.06     0.06   764570     0.00     0.00  lookup_object
      7.32      0.12     0.06   237835     0.00     0.00  parse_commit_date
      4.88      0.16     0.04   712779     0.00     0.00  nth_packed_object_offset
      3.66      0.19     0.03   964574     0.00     0.00  bsearch_hash
      3.66      0.22     0.03   237835     0.00     0.00  grab_sub_body_contents

If you e.g. make lookup_object() simply die when it's called you'll see
that before we don't call it at all, after your patch it's our #1
function.

Before when we have the simplest case of writing out an object this is
our callstack:
    
    (gdb) bt
    #0  batch_write (opt=0x7fffffffde50, data=0x555555ab9470, len=52) at builtin/cat-file.c:298
    #1  0x000055555558b160 in batch_object_write (obj_name=0x55555597cef0 "504fe6b39f7747be6427f28d9ca97decf5e6cecf", scratch=0x7fffffffd8c0,
        opt=0x7fffffffde50, data=0x7fffffffd7f0) at builtin/cat-file.c:375
    #2  0x000055555558b36e in batch_one_object (obj_name=0x55555597cef0 "504fe6b39f7747be6427f28d9ca97decf5e6cecf", scratch=0x7fffffffd8c0, opt=0x7fffffffde50,
        data=0x7fffffffd7f0) at builtin/cat-file.c:431
    #3  0x000055555558b8ed in batch_objects (opt=0x7fffffffde50) at builtin/cat-file.c:588
    #4  0x000055555558c0d3 in cmd_cat_file (argc=0, argv=0x7fffffffe1e0, prefix=0x0) at builtin/cat-file.c:716
    #5  0x0000555555573adb in run_builtin (p=0x555555941870 <commands+240>, argc=2, argv=0x7fffffffe1e0) at git.c:461
    #6  0x0000555555573f00 in handle_builtin (argc=2, argv=0x7fffffffe1e0) at git.c:714
    #7  0x0000555555574182 in run_argv (argcp=0x7fffffffe08c, argv=0x7fffffffe080) at git.c:781
    #8  0x000055555557460f in cmd_main (argc=2, argv=0x7fffffffe1e0) at git.c:912
    #9  0x000055555565b508 in main (argc=3, argv=0x7fffffffe1d8) at common-main.c:52

After (well, here we're not even to writing it, just looking it up), the
BUG() is my addition:
    
    (gdb) bt
    #0  BUG_fl (file=0x5555558ade71 "object.c", line=91, fmt=0x5555558ade6e "yo") at usage.c:290
    #1  0x00005555557441ca in lookup_object (r=0x5555559755c0 <the_repo>, oid=0x555555975160 <oi>) at object.c:91
    #2  0x000055555569dfc8 in lookup_commit (r=0x5555559755c0 <the_repo>, oid=0x555555975160 <oi>) at commit.c:62
    #3  0x00005555557445f5 in parse_object_buffer (r=0x5555559755c0 <the_repo>, oid=0x555555975160 <oi>, type=OBJ_COMMIT, size=342, buffer=0x555555ab48e0,
        eaten_p=0x7fffffffd36c) at object.c:215
    #4  0x0000555555785094 in get_object (ref=0x7fffffffd6b0, deref=0, obj=0x7fffffffd520, oi=0x555555975160 <oi>, err=0x7fffffffd860) at ref-filter.c:1803
    #5  0x0000555555785c99 in populate_value (ref=0x7fffffffd6b0, err=0x7fffffffd860) at ref-filter.c:2030
    #6  0x0000555555785d7b in get_ref_atom_value (ref=0x7fffffffd6b0, atom=0, v=0x7fffffffd628, err=0x7fffffffd860) at ref-filter.c:2064
    #7  0x000055555578742f in format_ref_array_item (info=0x7fffffffd6b0, format=0x7fffffffde30, final_buf=0x7fffffffd880, error_buf=0x7fffffffd860)
        at ref-filter.c:2659
    #8  0x000055555558ab1c in batch_object_write (obj_name=0x55555597e3f0 "504fe6b39f7747be6427f28d9ca97decf5e6cecf", scratch=0x7fffffffd880,
        err=0x7fffffffd860, opt=0x7fffffffde10, data=0x7fffffffd800) at builtin/cat-file.c:225
    #9  0x000055555558ade5 in batch_one_object (obj_name=0x55555597e3f0 "504fe6b39f7747be6427f28d9ca97decf5e6cecf", scratch=0x7fffffffd880, err=0x7fffffffd860,
        opt=0x7fffffffde10, data=0x7fffffffd800) at builtin/cat-file.c:298
    #10 0x000055555558b394 in batch_objects (batch=0x7fffffffde10, options=0x7fffffffd900) at builtin/cat-file.c:458
    #11 0x000055555558bbd5 in cmd_cat_file (argc=0, argv=0x7fffffffe1d0, prefix=0x0) at builtin/cat-file.c:585
    #12 0x0000555555573adb in run_builtin (p=0x555555942850 <commands+240>, argc=2, argv=0x7fffffffe1d0) at git.c:461
    #13 0x0000555555573f00 in handle_builtin (argc=2, argv=0x7fffffffe1d0) at git.c:714
    #14 0x0000555555574182 in run_argv (argcp=0x7fffffffe07c, argv=0x7fffffffe070) at git.c:781
    #15 0x000055555557460f in cmd_main (argc=2, argv=0x7fffffffe1d0) at git.c:912
    #16 0x000055555565afc1 in main (argc=3, argv=0x7fffffffe1c8) at common-main.c:52

I.e. before in batch_object_write() we could use a cheap path of doing
oid_object_info_extended() and directly emitting the content. With your
version we're all the way down to parse_object_buffer(). Meaning that
we're going to be creating a "struct commit" or whatever if we're
looking at a commit, just to print out the raw contents.

I think the best next step here is to add a t/perf/t1006-cat-file.sh
test to stress these various cases, i.e. a plain --batch without a
format, with format, with --batch-all-objects etc. Try to then run that
on each of your commits against the preceding one and see if/when you
have regressions.

Aside from any double-lookups etc, the problem is also that you're
trying to handle a really general case (e.g. with textconv) in a
codepath that needs to be really fast. If anything we should be
inserting some more more optimization shortcuts for common cases into
it. E.g. I was able to trivially speed up 'cat-file --batch-check' on
"master" by hardcoding a path for our default format (patch at the end
of this mail):
    
    # passed all 2 test(s)
    1..2
    Test                             origin/master     HEAD
    -------------------------------------------------------------------------
    1006.2: cat-file --batch-check   0.60(0.37+0.23)   0.35(0.33+0.02) -41.7%

Anything that needs to handle general format patching is going to be
slower. I think /some/ performance regression if we're using something
that's not just the current light strbuf_expand() probably can't be
avoided, but we could/should try to make up the difference at least for
the common case of --batch or --batch-check without --textconv and
perhaps hardcode (and document that it's faster) a path for the default
formats).

1. https://sourceware.org/binutils/docs/gprof/Output.html

diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 5ebf13359e8..775b7dd1b01 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -360,6 +360,11 @@ static void batch_object_write(const char *obj_name,
 			       struct batch_options *opt,
 			       struct expand_data *data)
 {
+	int default_format = !strcmp(opt->format, "%(objectname) %(objecttype) %(objectsize)");
+	struct strbuf type_name = STRBUF_INIT;
+	if (default_format)
+		data->info.type_name = &type_name;
+		
 	if (!data->skip_object_info &&
 	    oid_object_info_extended(the_repository, &data->oid, &data->info,
 				     OBJECT_INFO_LOOKUP_REPLACE) < 0) {
@@ -369,14 +374,20 @@ static void batch_object_write(const char *obj_name,
 		return;
 	}
 
-	strbuf_reset(scratch);
-	strbuf_expand(scratch, opt->format, expand_format, data);
-	strbuf_addch(scratch, '\n');
-	batch_write(opt, scratch->buf, scratch->len);
-
-	if (opt->print_contents) {
-		print_object_or_die(opt, data);
-		batch_write(opt, "\n", 1);
+	if (default_format && !opt->print_contents) {
+		fprintf(stdout, "%s %s %"PRIuMAX"\n", oid_to_hex(&data->oid),
+			data->info.type_name->buf, 
+			(uintmax_t)*data->info.sizep);
+	} else {
+		strbuf_reset(scratch);
+		strbuf_expand(scratch, opt->format, expand_format, data);
+		strbuf_addch(scratch, '\n');
+		batch_write(opt, scratch->buf, scratch->len);
+
+		if (opt->print_contents) {
+			print_object_or_die(opt, data);
+			batch_write(opt, "\n", 1);
+		}
 	}
 }
 
diff --git a/t/perf/p1006-cat-file.sh b/t/perf/p1006-cat-file.sh
new file mode 100755
index 00000000000..a295d334715
--- /dev/null
+++ b/t/perf/p1006-cat-file.sh
@@ -0,0 +1,16 @@
+#!/bin/sh
+
+test_description='Basic sort performance tests'
+. ./perf-lib.sh
+
+test_perf_default_repo
+
+test_expect_success 'setup' '
+	git rev-list --all >rla
+'
+
+test_perf 'cat-file --batch-check' '
+	git cat-file --batch-check <rla
+'
+
+test_done

  reply	other threads:[~2021-07-03 14:18 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-01 16:07 [PATCH 00/15] [GSOC] cat-file: reuse ref-filter logic ZheNing Hu via GitGitGadget
2021-07-01 16:07 ` [PATCH 01/15] [GSOC] ref-filter: add obj-type check in grab contents ZheNing Hu via GitGitGadget
2021-07-01 16:08 ` [PATCH 02/15] [GSOC] ref-filter: add %(raw) atom ZheNing Hu via GitGitGadget
2021-07-02 13:22   ` Ævar Arnfjörð Bjarmason
2021-07-03  5:14     ` ZheNing Hu
2021-07-01 16:08 ` [PATCH 03/15] [GSOC] ref-filter: --format=%(raw) re-support --perl ZheNing Hu via GitGitGadget
2021-07-02 13:29   ` Ævar Arnfjörð Bjarmason
2021-07-03  5:14     ` ZheNing Hu
2021-07-01 16:08 ` [PATCH 04/15] [GSOC] ref-filter: use non-const ref_format in *_atom_parser() ZheNing Hu via GitGitGadget
2021-07-01 16:08 ` [PATCH 05/15] [GSOC] ref-filter: add %(rest) atom ZheNing Hu via GitGitGadget
2021-07-01 16:08 ` [PATCH 06/15] [GSOC] ref-filter: pass get_object() return value to their callers ZheNing Hu via GitGitGadget
2021-07-01 16:08 ` [PATCH 07/15] [GSOC] ref-filter: introduce free_ref_array_item_value() function ZheNing Hu via GitGitGadget
2021-07-01 16:08 ` [PATCH 08/15] [GSOC] ref-filter: add cat_file_mode in struct ref_format ZheNing Hu via GitGitGadget
2021-07-02 13:32   ` Ævar Arnfjörð Bjarmason
2021-07-02 19:28     ` Eric Sunshine
2021-07-02 22:11       ` Christian Couder
2021-07-03  5:55         ` ZheNing Hu
2021-07-05  7:17           ` Ævar Arnfjörð Bjarmason
2021-07-01 16:08 ` [PATCH 09/15] [GSOC] ref-filter: modify the error message and value in get_object ZheNing Hu via GitGitGadget
2021-07-01 16:08 ` [PATCH 10/15] [GSOC] cat-file: add has_object_file() check ZheNing Hu via GitGitGadget
2021-07-02 13:34   ` Ævar Arnfjörð Bjarmason
2021-07-03  5:50     ` ZheNing Hu
2021-07-01 16:08 ` [PATCH 11/15] [GSOC] cat-file: change batch_objects parameter name ZheNing Hu via GitGitGadget
2021-07-01 16:08 ` [PATCH 12/15] [GSOC] cat-file: reuse ref-filter logic ZheNing Hu via GitGitGadget
2021-07-02 13:36   ` Ævar Arnfjörð Bjarmason
2021-07-02 13:45     ` Christian Couder
2021-07-03 11:45     ` ZheNing Hu
2021-07-03 13:37       ` Ævar Arnfjörð Bjarmason [this message]
2021-07-04 11:10         ` ZheNing Hu
2021-07-05  7:18           ` Ævar Arnfjörð Bjarmason
2021-07-03 14:17       ` ZheNing Hu
2021-07-01 16:08 ` [PATCH 13/15] [GSOC] cat-file: reuse err buf in batch_object_write() ZheNing Hu via GitGitGadget
2021-07-01 16:08 ` [PATCH 14/15] [GSOC] cat-file: re-implement --textconv, --filters options ZheNing Hu via GitGitGadget
2021-07-01 19:55   ` Junio C Hamano
2021-07-01 20:11   ` Junio C Hamano
2021-07-02 12:46     ` ZheNing Hu
2021-07-02 15:27       ` Junio C Hamano
2021-07-03  6:17         ` ZheNing Hu
2021-07-01 16:08 ` [PATCH 15/15] [GSOC] ref-filter: remove grab_oid() function ZheNing Hu via GitGitGadget

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87eecf8ork.fsf@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=adlternative@gmail.com \
    --cc=bagasdotme@gmail.com \
    --cc=christian.couder@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=gitster@pobox.com \
    --cc=hariom18599@gmail.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).