[PATCH] csum-file: flush less often - Derrick Stolee via GitGitGadget

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: gitster@pobox.com, peff@peff.net, me@ttaylorr.com,
	Derrick Stolee <derrickstolee@github.com>,
	Derrick Stolee <dstolee@microsoft.com>
Subject: [PATCH] csum-file: flush less often
Date: Wed, 24 Mar 2021 17:50:18 +0000	[thread overview]
Message-ID: <pull.914.git.1616608219602.gitgitgadget@gmail.com> (raw)

From: Derrick Stolee <dstolee@microsoft.com>

In an independent investigation, I noticed that do_write_index() in
read-cache.c has its own hashing logic and buffering mechanism.
Specifically, the ce_write() method was introduced by 4990aadc (Speed up
index file writing by chunking it nicely, 2005-04-20) and similar
mechanisms were introduced a few months later in c38138cd
(git-pack-objects: write the pack files with a SHA1 csum, 2005-06-26).
Based on the timing, in the early days of the Git codebase, I figured
that these roughly equivalent code paths were never unified only because
it got lost in the shuffle. The hashfile API has since been used
extensively in other file formats, such as pack-indexes,
mult-pack-indexes, and commit-graphs. Therefore, it seems prudent to
unify the index writing code to use the same mechanism.

However, upon doing that refactoring process, I noticed that this caused
some commands that write the index to slow down by 1-2%! I then looked
for a reason why this could be.

First, I noticed that the mechanisms use different buffer sizes. The
hashfile uses an 8KB buffer while the index uses an 128KB buffer.
Testing with a variety of different buffer sizes made little difference.

Next, I inspected the buffering code itself, and found an important
difference. Specifically, every call to hashwrite() was causing a flush
of the filestream, even if it was a very small write. With many callers
using helpers like hashwrite_be32() to write integers in network-byte
order, this was leading to many more file flushes than necessary.

This change modifies hashwrite() to always populate the hashfile buffer,
and only flush when that buffer is full. This is safe to do because all
consumers of a hashfile must call finalize_hashfile(), which flushes the
buffer at the start.

It is worth noting that this is modifying logic introduced by a8032d12
(sha1write: don't copy full sized buffers, 2008-09-02) which reduces
memcpy() calls when the input buffer is sufficiently longer than the
hashfile's buffer, causing nr to be the length of the full buffer. Use
the input buffer directly in these cases. Since we don't guarantee that
the buffer is flushed by the end of hashwrite(), we need to group some
offset logic into the condition that memcpy() is necessary. Note that nr
is equal to sizeof(f->buffer) only when f->offset is zero, so that
condition does not need to be added here.

As for performance, I focused on known commands that spend a significant
amount of time writing through the hashfile API, especially if using
small buffers as in hashwrite_be32(). 'git multi-pack-index write' was
an excellent example (deleting the multi-pack-index file between runs)
and demonstrated this performance change in the Linux kernal repo:

Benchmark #1: old
  Time (mean ± σ):      2.229 s ±  0.143 s    [User: 1.409 s, System: 0.327 s]
  Range (min … max):    2.160 s …  2.636 s    10 runs

Benchmark #2: new
  Time (mean ± σ):      2.162 s ±  0.005 s    [User: 1.392 s, System: 0.323 s]
  Range (min … max):    2.152 s …  2.172 s    10 runs

Summary
  'new' ran
    1.03 ± 0.07 times faster than 'old'

Similarly, the same command on the Git repository gave these numbers:

Benchmark #1: old
  Time (mean ± σ):     230.5 ms ±   6.3 ms    [User: 140.5 ms, System: 42.9 ms]
  Range (min … max):   221.7 ms … 240.6 ms    12 runs

Benchmark #2: new
  Time (mean ± σ):     220.6 ms ±   5.1 ms    [User: 139.5 ms, System: 34.1 ms]
  Range (min … max):   214.0 ms … 229.0 ms    13 runs

Summary
  'new' ran
    1.05 ± 0.04 times faster than 'old'

Finally, to demonstrate that performance holds when frequently using
large buffers, the numbers below are for 'git pack-objects' packing all
objects in the Git repository between v2.30.0 and v2.31.1:

Benchmark #1: old
  Time (mean ± σ):      1.003 s ±  0.045 s    [User: 1.877 s, System: 0.167 s]
  Range (min … max):    0.931 s …  1.044 s    10 runs

Benchmark #2: new
  Time (mean ± σ):     976.4 ms ±  42.2 ms    [User: 1.854 s, System: 0.192 s]
  Range (min … max):   940.1 ms … 1049.3 ms    10 runs

Summary
  'new' ran
    1.03 ± 0.06 times faster than 'old'

With these consistent improvements of 3-5%, it will be possible to move
the index writing logic over to hashfile without performance
degradation.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
    csum-file: flush less often

    I found this while poking around the index.

    Thanks, -Stolee

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-914%2Fderrickstolee%2Fhashfile-flush-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-914/derrickstolee/hashfile-flush-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/914

 csum-file.c | 22 ++++++++--------------
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/csum-file.c b/csum-file.c
index 0f35fa5ee47c..39644af590a5 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -89,32 +89,26 @@ int finalize_hashfile(struct hashfile *f, unsigned char *result, unsigned int fl
 void hashwrite(struct hashfile *f, const void *buf, unsigned int count)
 {
 	while (count) {
-		unsigned offset = f->offset;
-		unsigned left = sizeof(f->buffer) - offset;
+		unsigned left = sizeof(f->buffer) - f->offset;
 		unsigned nr = count > left ? left : count;
-		const void *data;

 		if (f->do_crc)
 			f->crc32 = crc32(f->crc32, buf, nr);

 		if (nr == sizeof(f->buffer)) {
 			/* process full buffer directly without copy */
-			data = buf;
+			the_hash_algo->update_fn(&f->ctx, buf, nr);
+			flush(f, buf, nr);
 		} else {
-			memcpy(f->buffer + offset, buf, nr);
-			data = f->buffer;
+			memcpy(f->buffer + f->offset, buf, nr);
+			f->offset += nr;
+			left -= nr;
+			if (!left)
+				hashflush(f);
 		}

 		count -= nr;
-		offset += nr;
 		buf = (char *) buf + nr;
-		left -= nr;
-		if (!left) {
-			the_hash_algo->update_fn(&f->ctx, data, offset);
-			flush(f, data, offset);
-			offset = 0;
-		}
-		f->offset = offset;
 	}
 }

base-commit: 142430338477d9d1bb25be66267225fb58498d92
-- 
gitgitgadget

next             reply	other threads:[~2021-03-24 17:51 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-24 17:50 Derrick Stolee via GitGitGadget [this message]
2021-03-25 11:55 ` [PATCH] csum-file: flush less often Derrick Stolee
2021-03-25 18:46   ` Junio C Hamano
2021-03-25 18:52     ` Junio C Hamano
2021-03-26  3:16       ` Jeff King
2021-03-26 12:38 ` [PATCH v2] csum-file: make hashwrite() more readable Derrick Stolee via GitGitGadget
2021-03-26 21:38   ` Junio C Hamano
2021-03-28  8:38   ` Jeff King

find likely ancestor, descendant, or conflicting patches for this message:
dfblob:0f35fa5ee47 dfblob:39644af590a
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=pull.914.git.1616608219602.gitgitgadget@gmail.com \
    --to=gitgitgadget@gmail.com \
    --cc=derrickstolee@github.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=me@ttaylorr.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).