git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
* [PATCH] hash-object: don't pointlessly zlib compress without -w
@ 2019-05-20 22:29 Ævar Arnfjörð Bjarmason
  2019-05-22  5:32 ` Jeff King
  0 siblings, 1 reply; 2+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-05-20 22:29 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Ævar Arnfjörð Bjarmason

When hash-object hashes something the size of core.bigFileThreshold or
larger (512MB by default) it'll be streamed through
stream_to_pack().

That added in 568508e765 ("bulk-checkin: replace fast-import based
implementation", 2011-10-28) would compress the file with zlib, but
was oblivious as to whether the content would actually be written out
to disk, which isn't the case unless hash-object is called with the
"-w" option.

Hashing is much slower if we need to compress the content, so let's
check if the HASH_WRITE_OBJECT flag has been given.

An accompanying perf test shows how much this improves things. With
CFLAGS=-O3 and OPENSSL_SHA1=Y the relevant change is (manually
reformatted to avoid long lines):

    1007.6: 'git hash-object <file>' with threshold=32M
        -> 1.57(1.55+0.01)   0.09(0.09+0.00) -94.3%
    1007.7: 'git hash-object --stdin < <file>' with threshold=32M
        -> 1.57(1.57+0.00)   0.09(0.07+0.01) -94.3%
    1007.8: 'echo <file> | git hash-object --stdin-paths' threshold=32M
        -> 1.59(1.56+0.00)   0.09(0.08+0.00) -94.3%

The same tests using "-w" still take that long, since those will need
to zlib compress the relevant object. With the sha1collisiondetection
library (our default) there's less of a difference since the hashing
itself is slower, or respectively:

    1.71(1.65+0.01)   0.19(0.18+0.01) -88.9%
    1.70(1.66+0.02)   0.19(0.19+0.00) -88.8%
    1.69(1.66+0.00)   0.19(0.18+0.00) -88.8%

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 bulk-checkin.c              |  3 ++-
 t/perf/p1007-hash-object.sh | 53 +++++++++++++++++++++++++++++++++++++
 2 files changed, 55 insertions(+), 1 deletion(-)
 create mode 100755 t/perf/p1007-hash-object.sh

diff --git a/bulk-checkin.c b/bulk-checkin.c
index 39ee7d6107..a26126ee76 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -105,8 +105,9 @@ static int stream_to_pack(struct bulk_checkin_state *state,
 	int status = Z_OK;
 	int write_object = (flags & HASH_WRITE_OBJECT);
 	off_t offset = 0;
+	int level = write_object ? pack_compression_level : Z_NO_COMPRESSION;
 
-	git_deflate_init(&s, pack_compression_level);
+	git_deflate_init(&s, level);
 
 	hdrlen = encode_in_pack_object_header(obuf, sizeof(obuf), type, size);
 	s.next_out = obuf + hdrlen;
diff --git a/t/perf/p1007-hash-object.sh b/t/perf/p1007-hash-object.sh
new file mode 100755
index 0000000000..8df6dc59a5
--- /dev/null
+++ b/t/perf/p1007-hash-object.sh
@@ -0,0 +1,53 @@
+#!/bin/sh
+
+test_description="Tests performance of hash-object"
+. ./perf-lib.sh
+
+test_perf_fresh_repo
+
+test_lazy_prereq SHA1SUM_AND_SANE_DD_AND_URANDOM '
+	>empty &&
+	sha1sum empty >empty.sha1sum &&
+	grep -q -w da39a3ee5e6b4b0d3255bfef95601890afd80709 empty.sha1sum &&
+	dd if=/dev/urandom of=random.test bs=1024 count=1 &&
+	stat -c %s random.test >random.size &&
+	grep -q -x 1024 random.size
+'
+
+if test_have_prereq !SHA1SUM_AND_SANE_DD_AND_URANDOM
+then
+	skip_all='failed prereq check for sha1sum/dd/stat'
+	test_perf 'dummy p0013 test (skipped all tests)' 'true'
+	test_done
+fi
+
+test_expect_success 'setup 64MB file.random file' '
+	dd if=/dev/urandom of=file.random count=$((64*1024)) bs=1024
+'
+
+test_perf 'sha1sum(1) on file.random (for comparison)' '
+	sha1sum file.random
+'
+
+for threshold in 32M 64M
+do
+	for write in '' ' -w'
+	do
+		for literally in ' --literally -t commit' ''
+		do
+			test_perf "'git hash-object$write$literally <file>' with threshold=$threshold" "
+				git -c core.bigFileThreshold=$threshold hash-object$write$literally file.random
+			"
+
+			test_perf "'git hash-object$write$literally --stdin < <file>' with threshold=$threshold" "
+				git -c core.bigFileThreshold=$threshold hash-object$write$literally --stdin <file.random
+			"
+
+			test_perf "'echo <file> | git hash-object$write$literally --stdin-paths' threshold=$threshold" "
+				echo file.random | git -c core.bigFileThreshold=$threshold hash-object$write$literally --stdin-paths
+			"
+		done
+	done
+done
+
+test_done
-- 
2.21.0.1020.gf2820cf01a


^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [PATCH] hash-object: don't pointlessly zlib compress without -w
  2019-05-20 22:29 [PATCH] hash-object: don't pointlessly zlib compress without -w Ævar Arnfjörð Bjarmason
@ 2019-05-22  5:32 ` Jeff King
  0 siblings, 0 replies; 2+ messages in thread
From: Jeff King @ 2019-05-22  5:32 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: git, Junio C Hamano

On Tue, May 21, 2019 at 12:29:32AM +0200, Ævar Arnfjörð Bjarmason wrote:

> That added in 568508e765 ("bulk-checkin: replace fast-import based
> implementation", 2011-10-28) would compress the file with zlib, but
> was oblivious as to whether the content would actually be written out
> to disk, which isn't the case unless hash-object is called with the
> "-w" option.

Yeah, this seems like an obvious and easy improvement. I'd guess that
not many people use hash-object without "-w", so nobody ever really
cared much.

> diff --git a/bulk-checkin.c b/bulk-checkin.c
> index 39ee7d6107..a26126ee76 100644
> --- a/bulk-checkin.c
> +++ b/bulk-checkin.c
> @@ -105,8 +105,9 @@ static int stream_to_pack(struct bulk_checkin_state *state,
>  	int status = Z_OK;
>  	int write_object = (flags & HASH_WRITE_OBJECT);
>  	off_t offset = 0;
> +	int level = write_object ? pack_compression_level : Z_NO_COMPRESSION;
>  
> -	git_deflate_init(&s, pack_compression_level);
> +	git_deflate_init(&s, level);

Hmm. It really seems like if we're not writing, that we could avoid this
whole pack-encoding business altogether, and just stream the contents
through the hash, no matter the size. E.g., we're paying extra here to
compute a pack crc32 that nobody cares about, not to mention the zlib
crc.

That said, this is just not that common a case, and you've regained the
vast majority of the performance with this simple change. So it may not
be worth pursuing further.

> +test_lazy_prereq SHA1SUM_AND_SANE_DD_AND_URANDOM '
> +	>empty &&
> +	sha1sum empty >empty.sha1sum &&
> +	grep -q -w da39a3ee5e6b4b0d3255bfef95601890afd80709 empty.sha1sum &&
> +	dd if=/dev/urandom of=random.test bs=1024 count=1 &&
> +	stat -c %s random.test >random.size &&
> +	grep -q -x 1024 random.size
> +'

I don't think this needs to be cryptographically random. You could just
use "test-tool genrandom" instead, which gets rid of the need for dd and
/dev/urandom (and as a bonus, is more deterministic since it's a seeded
LCG).

> +test_perf 'sha1sum(1) on file.random (for comparison)' '
> +	sha1sum file.random
> +'

I'm not sure this is worth the prereq it entails. Who really cares how
fast sha1sum is? What we generally care about in the perf suite is how
Git speeds compare. Primarily between versions (to show off
improvements, or find regressions), but sometimes between two builds
(but sadly the perf output is not good at comparing those numbers
automatically).

I know this sets a baseline, but it feels a bit like apples to oranges.

> +for threshold in 32M 64M
> +do
> +	for write in '' ' -w'
> +	do
> +		for literally in ' --literally -t commit' ''

Is this "-t commit" worth it? I guess it proves that --literally is
being used, since we'd choke otherwise. But for a perf test, it seems
like you'd want to change as few variables as possible.

> +			test_perf "'git hash-object$write$literally --stdin < <file>' with threshold=$threshold" "

This test title violates our shell style (whitespace after "<"). ;)

(Yes, I know it was probably to avoid "<<").

> +			test_perf "'echo <file> | git hash-object$write$literally --stdin-paths' threshold=$threshold" "
> +				echo file.random | git -c core.bigFileThreshold=$threshold hash-object$write$literally --stdin-paths

You have to bunch up the option variables in the title to get the
whitespace right. But in the command itself, you can do:

  hash-object $write $literally ...

and the shell's whitespace handling will make it work. Maybe worth it to
make the source a bit more readable.

-Peff

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2019-05-22  5:32 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-20 22:29 [PATCH] hash-object: don't pointlessly zlib compress without -w Ævar Arnfjörð Bjarmason
2019-05-22  5:32 ` Jeff King

git@vger.kernel.org list mirror (unofficial, one of many)

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V1 git git/ https://public-inbox.org/git \
		git@vger.kernel.org
	public-inbox-index git

Example config snippet for mirrors.
Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://7fh6tueqddpjyxjmgtdiueylzoqt6pt7hec3pukyptlmohoowvhde4yd.onion/inbox.comp.version-control.git
	nntp://ie5yzdi7fg72h7s4sdcztq5evakq23rdt33mfyfcddc5u3ndnw24ogqd.onion/inbox.comp.version-control.git
	nntp://4uok3hntl7oi7b4uf4rtfwefqeexfzil2w6kgk2jn5z2f764irre7byd.onion/inbox.comp.version-control.git
	nntp://news.gmane.io/gmane.comp.version-control.git
 note: .onion URLs require Tor: https://www.torproject.org/

code repositories for project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git