git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: git@vger.kernel.org
Cc: "Junio C Hamano" <gitster@pobox.com>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Subject: [PATCH] hash-object: don't pointlessly zlib compress without -w
Date: Tue, 21 May 2019 00:29:32 +0200	[thread overview]
Message-ID: <20190520222932.22843-1-avarab@gmail.com> (raw)

When hash-object hashes something the size of core.bigFileThreshold or
larger (512MB by default) it'll be streamed through
stream_to_pack().

That added in 568508e765 ("bulk-checkin: replace fast-import based
implementation", 2011-10-28) would compress the file with zlib, but
was oblivious as to whether the content would actually be written out
to disk, which isn't the case unless hash-object is called with the
"-w" option.

Hashing is much slower if we need to compress the content, so let's
check if the HASH_WRITE_OBJECT flag has been given.

An accompanying perf test shows how much this improves things. With
CFLAGS=-O3 and OPENSSL_SHA1=Y the relevant change is (manually
reformatted to avoid long lines):

    1007.6: 'git hash-object <file>' with threshold=32M
        -> 1.57(1.55+0.01)   0.09(0.09+0.00) -94.3%
    1007.7: 'git hash-object --stdin < <file>' with threshold=32M
        -> 1.57(1.57+0.00)   0.09(0.07+0.01) -94.3%
    1007.8: 'echo <file> | git hash-object --stdin-paths' threshold=32M
        -> 1.59(1.56+0.00)   0.09(0.08+0.00) -94.3%

The same tests using "-w" still take that long, since those will need
to zlib compress the relevant object. With the sha1collisiondetection
library (our default) there's less of a difference since the hashing
itself is slower, or respectively:

    1.71(1.65+0.01)   0.19(0.18+0.01) -88.9%
    1.70(1.66+0.02)   0.19(0.19+0.00) -88.8%
    1.69(1.66+0.00)   0.19(0.18+0.00) -88.8%

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 bulk-checkin.c              |  3 ++-
 t/perf/p1007-hash-object.sh | 53 +++++++++++++++++++++++++++++++++++++
 2 files changed, 55 insertions(+), 1 deletion(-)
 create mode 100755 t/perf/p1007-hash-object.sh

diff --git a/bulk-checkin.c b/bulk-checkin.c
index 39ee7d6107..a26126ee76 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -105,8 +105,9 @@ static int stream_to_pack(struct bulk_checkin_state *state,
 	int status = Z_OK;
 	int write_object = (flags & HASH_WRITE_OBJECT);
 	off_t offset = 0;
+	int level = write_object ? pack_compression_level : Z_NO_COMPRESSION;
 
-	git_deflate_init(&s, pack_compression_level);
+	git_deflate_init(&s, level);
 
 	hdrlen = encode_in_pack_object_header(obuf, sizeof(obuf), type, size);
 	s.next_out = obuf + hdrlen;
diff --git a/t/perf/p1007-hash-object.sh b/t/perf/p1007-hash-object.sh
new file mode 100755
index 0000000000..8df6dc59a5
--- /dev/null
+++ b/t/perf/p1007-hash-object.sh
@@ -0,0 +1,53 @@
+#!/bin/sh
+
+test_description="Tests performance of hash-object"
+. ./perf-lib.sh
+
+test_perf_fresh_repo
+
+test_lazy_prereq SHA1SUM_AND_SANE_DD_AND_URANDOM '
+	>empty &&
+	sha1sum empty >empty.sha1sum &&
+	grep -q -w da39a3ee5e6b4b0d3255bfef95601890afd80709 empty.sha1sum &&
+	dd if=/dev/urandom of=random.test bs=1024 count=1 &&
+	stat -c %s random.test >random.size &&
+	grep -q -x 1024 random.size
+'
+
+if test_have_prereq !SHA1SUM_AND_SANE_DD_AND_URANDOM
+then
+	skip_all='failed prereq check for sha1sum/dd/stat'
+	test_perf 'dummy p0013 test (skipped all tests)' 'true'
+	test_done
+fi
+
+test_expect_success 'setup 64MB file.random file' '
+	dd if=/dev/urandom of=file.random count=$((64*1024)) bs=1024
+'
+
+test_perf 'sha1sum(1) on file.random (for comparison)' '
+	sha1sum file.random
+'
+
+for threshold in 32M 64M
+do
+	for write in '' ' -w'
+	do
+		for literally in ' --literally -t commit' ''
+		do
+			test_perf "'git hash-object$write$literally <file>' with threshold=$threshold" "
+				git -c core.bigFileThreshold=$threshold hash-object$write$literally file.random
+			"
+
+			test_perf "'git hash-object$write$literally --stdin < <file>' with threshold=$threshold" "
+				git -c core.bigFileThreshold=$threshold hash-object$write$literally --stdin <file.random
+			"
+
+			test_perf "'echo <file> | git hash-object$write$literally --stdin-paths' threshold=$threshold" "
+				echo file.random | git -c core.bigFileThreshold=$threshold hash-object$write$literally --stdin-paths
+			"
+		done
+	done
+done
+
+test_done
-- 
2.21.0.1020.gf2820cf01a


             reply	other threads:[~2019-05-20 22:29 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-05-20 22:29 Ævar Arnfjörð Bjarmason [this message]
2019-05-22  5:32 ` [PATCH] hash-object: don't pointlessly zlib compress without -w Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190520222932.22843-1-avarab@gmail.com \
    --to=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).