[PATCH] csum-file: flush less often

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH] csum-file: flush less often
@ 2021-03-24 17:50 Derrick Stolee via GitGitGadget
  2021-03-25 11:55 ` Derrick Stolee
  2021-03-26 12:38 ` [PATCH v2] csum-file: make hashwrite() more readable Derrick Stolee via GitGitGadget
  0 siblings, 2 replies; 8+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-24 17:50 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, me, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In an independent investigation, I noticed that do_write_index() in
read-cache.c has its own hashing logic and buffering mechanism.
Specifically, the ce_write() method was introduced by 4990aadc (Speed up
index file writing by chunking it nicely, 2005-04-20) and similar
mechanisms were introduced a few months later in c38138cd
(git-pack-objects: write the pack files with a SHA1 csum, 2005-06-26).
Based on the timing, in the early days of the Git codebase, I figured
that these roughly equivalent code paths were never unified only because
it got lost in the shuffle. The hashfile API has since been used
extensively in other file formats, such as pack-indexes,
mult-pack-indexes, and commit-graphs. Therefore, it seems prudent to
unify the index writing code to use the same mechanism.

However, upon doing that refactoring process, I noticed that this caused
some commands that write the index to slow down by 1-2%! I then looked
for a reason why this could be.

First, I noticed that the mechanisms use different buffer sizes. The
hashfile uses an 8KB buffer while the index uses an 128KB buffer.
Testing with a variety of different buffer sizes made little difference.

Next, I inspected the buffering code itself, and found an important
difference. Specifically, every call to hashwrite() was causing a flush
of the filestream, even if it was a very small write. With many callers
using helpers like hashwrite_be32() to write integers in network-byte
order, this was leading to many more file flushes than necessary.

This change modifies hashwrite() to always populate the hashfile buffer,
and only flush when that buffer is full. This is safe to do because all
consumers of a hashfile must call finalize_hashfile(), which flushes the
buffer at the start.

It is worth noting that this is modifying logic introduced by a8032d12
(sha1write: don't copy full sized buffers, 2008-09-02) which reduces
memcpy() calls when the input buffer is sufficiently longer than the
hashfile's buffer, causing nr to be the length of the full buffer. Use
the input buffer directly in these cases. Since we don't guarantee that
the buffer is flushed by the end of hashwrite(), we need to group some
offset logic into the condition that memcpy() is necessary. Note that nr
is equal to sizeof(f->buffer) only when f->offset is zero, so that
condition does not need to be added here.

As for performance, I focused on known commands that spend a significant
amount of time writing through the hashfile API, especially if using
small buffers as in hashwrite_be32(). 'git multi-pack-index write' was
an excellent example (deleting the multi-pack-index file between runs)
and demonstrated this performance change in the Linux kernal repo:

Benchmark #1: old
  Time (mean ± σ):      2.229 s ±  0.143 s    [User: 1.409 s, System: 0.327 s]
  Range (min … max):    2.160 s …  2.636 s    10 runs

Benchmark #2: new
  Time (mean ± σ):      2.162 s ±  0.005 s    [User: 1.392 s, System: 0.323 s]
  Range (min … max):    2.152 s …  2.172 s    10 runs

Summary
  'new' ran
    1.03 ± 0.07 times faster than 'old'

Similarly, the same command on the Git repository gave these numbers:

Benchmark #1: old
  Time (mean ± σ):     230.5 ms ±   6.3 ms    [User: 140.5 ms, System: 42.9 ms]
  Range (min … max):   221.7 ms … 240.6 ms    12 runs

Benchmark #2: new
  Time (mean ± σ):     220.6 ms ±   5.1 ms    [User: 139.5 ms, System: 34.1 ms]
  Range (min … max):   214.0 ms … 229.0 ms    13 runs

Summary
  'new' ran
    1.05 ± 0.04 times faster than 'old'

Finally, to demonstrate that performance holds when frequently using
large buffers, the numbers below are for 'git pack-objects' packing all
objects in the Git repository between v2.30.0 and v2.31.1:

Benchmark #1: old
  Time (mean ± σ):      1.003 s ±  0.045 s    [User: 1.877 s, System: 0.167 s]
  Range (min … max):    0.931 s …  1.044 s    10 runs

Benchmark #2: new
  Time (mean ± σ):     976.4 ms ±  42.2 ms    [User: 1.854 s, System: 0.192 s]
  Range (min … max):   940.1 ms … 1049.3 ms    10 runs

Summary
  'new' ran
    1.03 ± 0.06 times faster than 'old'

With these consistent improvements of 3-5%, it will be possible to move
the index writing logic over to hashfile without performance
degradation.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
    csum-file: flush less often

    I found this while poking around the index.

    Thanks, -Stolee

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-914%2Fderrickstolee%2Fhashfile-flush-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-914/derrickstolee/hashfile-flush-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/914

 csum-file.c | 22 ++++++++--------------
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/csum-file.c b/csum-file.c
index 0f35fa5ee47c..39644af590a5 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -89,32 +89,26 @@ int finalize_hashfile(struct hashfile *f, unsigned char *result, unsigned int fl
 void hashwrite(struct hashfile *f, const void *buf, unsigned int count)
 {
 	while (count) {
-		unsigned offset = f->offset;
-		unsigned left = sizeof(f->buffer) - offset;
+		unsigned left = sizeof(f->buffer) - f->offset;
 		unsigned nr = count > left ? left : count;
-		const void *data;

 		if (f->do_crc)
 			f->crc32 = crc32(f->crc32, buf, nr);

 		if (nr == sizeof(f->buffer)) {
 			/* process full buffer directly without copy */
-			data = buf;
+			the_hash_algo->update_fn(&f->ctx, buf, nr);
+			flush(f, buf, nr);
 		} else {
-			memcpy(f->buffer + offset, buf, nr);
-			data = f->buffer;
+			memcpy(f->buffer + f->offset, buf, nr);
+			f->offset += nr;
+			left -= nr;
+			if (!left)
+				hashflush(f);
 		}

 		count -= nr;
-		offset += nr;
 		buf = (char *) buf + nr;
-		left -= nr;
-		if (!left) {
-			the_hash_algo->update_fn(&f->ctx, data, offset);
-			flush(f, data, offset);
-			offset = 0;
-		}
-		f->offset = offset;
 	}
 }

base-commit: 142430338477d9d1bb25be66267225fb58498d92
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] csum-file: flush less often
  2021-03-24 17:50 [PATCH] csum-file: flush less often Derrick Stolee via GitGitGadget
@ 2021-03-25 11:55 ` Derrick Stolee
  2021-03-25 18:46   ` Junio C Hamano
  2021-03-26 12:38 ` [PATCH v2] csum-file: make hashwrite() more readable Derrick Stolee via GitGitGadget
  1 sibling, 1 reply; 8+ messages in thread
From: Derrick Stolee @ 2021-03-25 11:55 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget, git
  Cc: gitster, peff, me, Derrick Stolee, Derrick Stolee

On 3/24/2021 1:50 PM, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>

Let me walk this back a bit.

> Next, I inspected the buffering code itself, and found an important
> difference. Specifically, every call to hashwrite() was causing a flush
> of the filestream, even if it was a very small write. With many callers
> using helpers like hashwrite_be32() to write integers in network-byte
> order, this was leading to many more file flushes than necessary.

This is incorrect. I misinterpreted the logic inside the loop, and I
later confirmed using trace2 that the number of flushes is the same
between versions.

So, what happened with my performance tests?

> As for performance, I focused on known commands that spend a significant
> amount of time writing through the hashfile API, especially if using
> small buffers as in hashwrite_be32(). 'git multi-pack-index write' was
> an excellent example (deleting the multi-pack-index file between runs)
> and demonstrated this performance change in the Linux kernal repo:
...
> Summary
>   'new' ran
>     1.03 ± 0.07 times faster than 'old'
> 
> Similarly, the same command on the Git repository gave these numbers:
...
> Summary
>   'new' ran
>     1.05 ± 0.04 times faster than 'old'
> 
> Finally, to demonstrate that performance holds when frequently using
> large buffers, the numbers below are for 'git pack-objects' packing all
> objects in the Git repository between v2.30.0 and v2.31.1:
...
> Summary
>   'new' ran
>     1.03 ± 0.06 times faster than 'old'
>
> With these consistent improvements of 3-5%, ...

These numbers seems consistent, across repos and test commands. They
seem to be the inverse of the slowdown I was seeing in the index
refactor. These caused me to use confirmation bias to assume I had
done something clever.

I was using hyperfine to run these numbers, with the hope that it
provides a consistent scenario worthy of testing. I used this command,
roughly (in a script):

hyperfine \
        -n "old" "$1 && $OLD_GIT $2 <input" \
        -n "new" "$1 && $NEW_GIT $2 <input" \
        --warmup=3 \
        --min-runs=20

where I would pass some preparatory step as "$1" and the Git commands
to run as "$2", and have an input file (necessary for the pack-objects
command).

The first thing I did when confronted with the flush problem was swap
the order of the "old" and "new" lines, and that caused the performance
difference to go away, hinting that the number of warmups needed to
increase. Changing to "--warmup=20" and "--min-runs=50", the change in
timing went away entirely.

I did the same with my draft changes to the index write code, and that
caused the 1-2% performance drop go away, too. So, this whole adventure
was based on a faulty performance test.

But...is there something we could still do here?

My confusion about flushing is mostly due to my error, but upon
reflection the loop is doing a lot of different things, but most of
the time we know which behavior we need at the start, in the middle,
and at the end:

     1. Fill the existing buffer with the beginning of 'buf'. If the
        hashfile's buffer is full, then flush.
    
     2. Flush sizeof(f->buffer) chunks directly out of 'buf' as long as
        possible.
    
     3. Copy the remaining byes out of 'buf' into the hashfile's buffer.

Here is a rewrite that more explicitly follows this flow:

void hashwrite(struct hashfile *f, const void *buf, unsigned int count)
{
	const int full_buffer = sizeof(f->buffer);
	unsigned left = full_buffer - f->offset;
	unsigned nr = count > left ? left : count;

	/*
	 * Initially fill the buffer in a batch until it
	 * is full, then flush.
	 */
	if (f->do_crc)
		f->crc32 = crc32(f->crc32, buf, nr);

	memcpy(f->buffer + f->offset, buf, nr);
	f->offset += nr;
	count -= nr;
	buf = (char *) buf + nr;

	if (left == nr)
		hashflush(f);

	/*
	 * After filling the hashfile's buffer and flushing, take
	 * batches of full_buffer bytes directly from the input
	 * buffer.
	 */
	while (count >= full_buffer) {
		if (f->do_crc)
			f->crc32 = crc32(f->crc32, buf, full_buffer);

		the_hash_algo->update_fn(&f->ctx, buf, full_buffer);
		flush(f, buf, full_buffer);

		count -= full_buffer;
		buf = (char *) buf + full_buffer;
	}

	/*
	 * Capture any remaining bytes at the end of the input buffer
	 * into the hashfile's buffer. We do not need to flush because
	 * count is strictly less than full_buffer here.
	 */
	if (count) {
		if (f->do_crc)
			f->crc32 = crc32(f->crc32, buf, count);

		memcpy(f->buffer + f->offset, buf, count);
		f->offset = count;
	}
	
	if (f->base)
		hashwrite(f->base, buf, count);
}

With this implementation (and the more robust performance test), the
performance for pack-objects and index-pack remains constant, but
there is a slight improvement for 'git multi-pack-index write', which
is mostly translating data from the pack-indexes into a multi-pack-
index:

    Using the Git repository:
    
    Benchmark #1: old
      Time (mean ± σ):     270.4 ms ±   6.9 ms    [User: 184.6 ms, System: 38.6 ms]
      Range (min … max):   258.6 ms … 283.2 ms    50 runs
    
    Benchmark #2: new
      Time (mean ± σ):     265.3 ms ±   6.0 ms    [User: 180.9 ms, System: 37.8 ms]
      Range (min … max):   257.4 ms … 282.0 ms    50 runs
    
    Summary
      'new' ran
        1.02 ± 0.03 times faster than 'old'
    
    Using the Linux kernel repository:
    
    Benchmark #1: old
      Time (mean ± σ):      2.321 s ±  0.011 s    [User: 1.538 s, System: 0.335 s]
      Range (min … max):    2.301 s …  2.353 s    50 runs
    
    Benchmark #2: new
      Time (mean ± σ):      2.290 s ±  0.011 s    [User: 1.513 s, System: 0.329 s]
      Range (min … max):    2.273 s …  2.318 s    50 runs
    
    Summary
      'new' ran
        1.01 ± 0.01 times faster than 'old'

Again, variance might be at play here, but after running this
test multiple times, I was never able to see less than 1% reported
here.

So, I'm of two minds here:

 1. This is embarassing. I wasted everyone's time for nothing. I can retract
    this patch.

 2. This is embarassing. I overstated the problem here. But we might be able
    to eke out a tiny performance boost here.

I'm open to either. I think we should default to dropping this patch unless
someone thinks the rewrite above is a better organization of the logic. (I
can then send a v2 including that version and an updated commit message.)

Thanks,
-Stolee

P.S. Special thanks to Peff who pointed out my error in private.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] csum-file: flush less often
  2021-03-25 11:55 ` Derrick Stolee
@ 2021-03-25 18:46   ` Junio C Hamano
  2021-03-25 18:52     ` Junio C Hamano
  0 siblings, 1 reply; 8+ messages in thread
From: Junio C Hamano @ 2021-03-25 18:46 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, peff, me, Derrick Stolee,
	Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

> But...is there something we could still do here?
>
> My confusion about flushing is mostly due to my error, but upon
> reflection the loop is doing a lot of different things, but most of
> the time we know which behavior we need at the start, in the middle,
> and at the end:
>
>      1. Fill the existing buffer with the beginning of 'buf'. If the
>         hashfile's buffer is full, then flush.

"But do not do this if f->buffer is empty, and we are writing out
more than sizeof(f->buffer)." is missing, isn't it?

>      2. Flush sizeof(f->buffer) chunks directly out of 'buf' as long as
>         possible.
>     
>      3. Copy the remaining bytes out of 'buf' into the hashfile's buffer.

It is debatable if the existing loop, which came mostly from Nico's
a8032d12 (sha1write: don't copy full sized buffers, 2008-09-02), is
too clever; I personally find it concise and readable enough, but my
reading is tainted.

If you use a couple of helpers to reduce the repeated "crc and hash"
pattern in your variant, it may become easier to follow than the
original, but I dunno.

> Here is a rewrite that more explicitly follows this flow:
>
> void hashwrite(struct hashfile *f, const void *buf, unsigned int count)
> {
> 	const int full_buffer = sizeof(f->buffer);
> 	unsigned left = full_buffer - f->offset;
> 	unsigned nr = count > left ? left : count;
>
> 	/*
> 	 * Initially fill the buffer in a batch until it
> 	 * is full, then flush.
> 	 */
> 	if (f->do_crc)
> 		f->crc32 = crc32(f->crc32, buf, nr);
>
> 	memcpy(f->buffer + f->offset, buf, nr);

Here, if the f->buffer was empty, we end up memcpy a full bufferful
unconditionally.  Nico's original cleverly takes advantage of the
fact that 'nr' would be the full buffer size only when the f->buffer
was empty upon entry to the function and we have more byte than the
size of the buffer to copy out directly from 'buf'.

> 	f->offset += nr;
> 	count -= nr;
> 	buf = (char *) buf + nr;
>
> 	if (left == nr)
> 		hashflush(f);
>
> 	/*
> 	 * After filling the hashfile's buffer and flushing, take
> 	 * batches of full_buffer bytes directly from the input
> 	 * buffer.
> 	 */
> 	while (count >= full_buffer) {
> 		if (f->do_crc)
> 			f->crc32 = crc32(f->crc32, buf, full_buffer);
>
> 		the_hash_algo->update_fn(&f->ctx, buf, full_buffer);
> 		flush(f, buf, full_buffer);
>
> 		count -= full_buffer;
> 		buf = (char *) buf + full_buffer;
> 	}
>
> 	/*
> 	 * Capture any remaining bytes at the end of the input buffer
> 	 * into the hashfile's buffer. We do not need to flush because
> 	 * count is strictly less than full_buffer here.
> 	 */
> 	if (count) {
> 		if (f->do_crc)
> 			f->crc32 = crc32(f->crc32, buf, count);
>
> 		memcpy(f->buffer + f->offset, buf, count);
> 		f->offset = count;
> 	}
> 	
> 	if (f->base)
> 		hashwrite(f->base, buf, count);
> }
> ...
> So, I'm of two minds here:
>
>  1. This is embarassing. I wasted everyone's time for nothing. I can retract
>     this patch.
>
>  2. This is embarassing. I overstated the problem here. But we might be able
>     to eke out a tiny performance boost here.
>
> I'm open to either. I think we should default to dropping this patch unless
> someone thinks the rewrite above is a better organization of the logic. (I
> can then send a v2 including that version and an updated commit message.)

3. The current code around "if (nr == sizeof(f->buffer))" might be a
   bit too clever for readers who try to understand what is going
   on, and the whole "while" loop may deserve a comment based on
   what you wrote before your replacement implementation.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] csum-file: flush less often
  2021-03-25 18:46   ` Junio C Hamano
@ 2021-03-25 18:52     ` Junio C Hamano
  2021-03-26  3:16       ` Jeff King
  0 siblings, 1 reply; 8+ messages in thread
From: Junio C Hamano @ 2021-03-25 18:52 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, peff, me, Derrick Stolee,
	Derrick Stolee

Junio C Hamano <gitster@pobox.com> writes:

>> So, I'm of two minds here:
>>
>>  1. This is embarassing. I wasted everyone's time for nothing. I can retract
>>     this patch.
>>
>>  2. This is embarassing. I overstated the problem here. But we might be able
>>     to eke out a tiny performance boost here.
>>
>> I'm open to either. I think we should default to dropping this patch unless
>> someone thinks the rewrite above is a better organization of the logic. (I
>> can then send a v2 including that version and an updated commit message.)
>
> 3. The current code around "if (nr == sizeof(f->buffer))" might be a
>    bit too clever for readers who try to understand what is going
>    on, and the whole "while" loop may deserve a comment based on
>    what you wrote before your replacement implementation.

Having said all that, comparing the original and the version updated
with your "flush less often" patch, I find the latter quite easier
to read, so as long as the update does not give us 1% slowdown, it
may be worth adopting for the readability improvement alone.

Of course, if we were to go that route, the sales pitch in the log
message needs to be updated.

Thanks.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] csum-file: flush less often
  2021-03-25 18:52     ` Junio C Hamano
@ 2021-03-26  3:16       ` Jeff King
  0 siblings, 0 replies; 8+ messages in thread
From: Jeff King @ 2021-03-26  3:16 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee, Derrick Stolee via GitGitGadget, git, me,
	Derrick Stolee, Derrick Stolee

On Thu, Mar 25, 2021 at 11:52:29AM -0700, Junio C Hamano wrote:

> Junio C Hamano <gitster@pobox.com> writes:
> 
> >> So, I'm of two minds here:
> >>
> >>  1. This is embarassing. I wasted everyone's time for nothing. I can retract
> >>     this patch.
> >>
> >>  2. This is embarassing. I overstated the problem here. But we might be able
> >>     to eke out a tiny performance boost here.
> >>
> >> I'm open to either. I think we should default to dropping this patch unless
> >> someone thinks the rewrite above is a better organization of the logic. (I
> >> can then send a v2 including that version and an updated commit message.)
> >
> > 3. The current code around "if (nr == sizeof(f->buffer))" might be a
> >    bit too clever for readers who try to understand what is going
> >    on, and the whole "while" loop may deserve a comment based on
> >    what you wrote before your replacement implementation.

Yes, my first thought on reading Stolee's post-image was: wait, how do
we know when data needed flushed from the buffer? But that is not new in
his patch. It is confusing before and after. :)

> Having said all that, comparing the original and the version updated
> with your "flush less often" patch, I find the latter quite easier
> to read, so as long as the update does not give us 1% slowdown, it
> may be worth adopting for the readability improvement alone.
> 
> Of course, if we were to go that route, the sales pitch in the log
> message needs to be updated.

Yeah, I am OK with either version, as long as it is justified correctly
in the commit message. IMHO the big difference is that the original is
using local data/offset variables in order to provide a layer of
indirection when we get to the hash+flush code. And Stolee's patch is
calling the same code in the two places instead.

It's quite possible that gives the compiler slightly more opportunity to
micro-optimize (which doesn't matter if you are feeding big blocks, but
may if you are feeding 4 bytes at a time as in the midx code; though in
that case it is entirely possible that the caller allocating a single
array, writing it, and then feeding it to hashwrite() would be faster
still, though a little more cumbersome).

-Peff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2] csum-file: make hashwrite() more readable
  2021-03-24 17:50 [PATCH] csum-file: flush less often Derrick Stolee via GitGitGadget
  2021-03-25 11:55 ` Derrick Stolee
@ 2021-03-26 12:38 ` Derrick Stolee via GitGitGadget
  2021-03-26 21:38   ` Junio C Hamano
  2021-03-28  8:38   ` Jeff King
  1 sibling, 2 replies; 8+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-26 12:38 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, me, Derrick Stolee, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The hashwrite() method takes an input buffer and updates a hashfile's
hash function while writing the data to a file. To avoid overuse of
flushes, the hashfile has an internal buffer and most writes will use
memcpy() to transfer data from the input 'buf' to the hashfile's buffer
of size 8 * 1024 bytes.

Logic introduced by a8032d12 (sha1write: don't copy full sized buffers,
2008-09-02) reduces the number of memcpy() calls when the input buffer
is sufficiently longer than the hashfile's buffer, causing nr to be the
length of the full buffer. In these cases, the input buffer is used
directly in chunks equal to the hashfile's buffer size.

This method caught my attention while investigating some performance
issues, but it turns out that these performance issues were noise within
the variance of the experiment.

However, during this investigation, I inspected hashwrite() and
misunderstood it, even after looking closely and trying to make it
faster. This change simply reorganizes some parts of the loop within
hashwrite() to make it clear that each batch either uses memcpy() to the
hashfile's buffer or writes directly from the input buffer. The previous
code relied on indirection through local variables and essentially
inlined the implementation of hashflush() to reduce lines of code.

Helped-by: Jeff King <peff@peff.net>
Helped-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
    csum-file: make hashwrite() more readable
    
    v2 adds comments to the implementation to make some assumptions very
    clear about what is happening in each case.
    
    The commit message is almost completely rewritten based on the
    discussion on-list.
    
    Thanks, -Stolee

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-914%2Fderrickstolee%2Fhashfile-flush-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-914/derrickstolee/hashfile-flush-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/914

Range-diff vs v1:

 1:  91995baaaa42 ! 1:  68a429b0ba7e csum-file: flush less often
     @@ Metadata
      Author: Derrick Stolee <dstolee@microsoft.com>
      
       ## Commit message ##
     -    csum-file: flush less often
     -
     -    In an independent investigation, I noticed that do_write_index() in
     -    read-cache.c has its own hashing logic and buffering mechanism.
     -    Specifically, the ce_write() method was introduced by 4990aadc (Speed up
     -    index file writing by chunking it nicely, 2005-04-20) and similar
     -    mechanisms were introduced a few months later in c38138cd
     -    (git-pack-objects: write the pack files with a SHA1 csum, 2005-06-26).
     -    Based on the timing, in the early days of the Git codebase, I figured
     -    that these roughly equivalent code paths were never unified only because
     -    it got lost in the shuffle. The hashfile API has since been used
     -    extensively in other file formats, such as pack-indexes,
     -    mult-pack-indexes, and commit-graphs. Therefore, it seems prudent to
     -    unify the index writing code to use the same mechanism.
     -
     -    However, upon doing that refactoring process, I noticed that this caused
     -    some commands that write the index to slow down by 1-2%! I then looked
     -    for a reason why this could be.
     -
     -    First, I noticed that the mechanisms use different buffer sizes. The
     -    hashfile uses an 8KB buffer while the index uses an 128KB buffer.
     -    Testing with a variety of different buffer sizes made little difference.
     -
     -    Next, I inspected the buffering code itself, and found an important
     -    difference. Specifically, every call to hashwrite() was causing a flush
     -    of the filestream, even if it was a very small write. With many callers
     -    using helpers like hashwrite_be32() to write integers in network-byte
     -    order, this was leading to many more file flushes than necessary.
     -
     -    This change modifies hashwrite() to always populate the hashfile buffer,
     -    and only flush when that buffer is full. This is safe to do because all
     -    consumers of a hashfile must call finalize_hashfile(), which flushes the
     -    buffer at the start.
     -
     -    It is worth noting that this is modifying logic introduced by a8032d12
     -    (sha1write: don't copy full sized buffers, 2008-09-02) which reduces
     -    memcpy() calls when the input buffer is sufficiently longer than the
     -    hashfile's buffer, causing nr to be the length of the full buffer. Use
     -    the input buffer directly in these cases. Since we don't guarantee that
     -    the buffer is flushed by the end of hashwrite(), we need to group some
     -    offset logic into the condition that memcpy() is necessary. Note that nr
     -    is equal to sizeof(f->buffer) only when f->offset is zero, so that
     -    condition does not need to be added here.
     -
     -    As for performance, I focused on known commands that spend a significant
     -    amount of time writing through the hashfile API, especially if using
     -    small buffers as in hashwrite_be32(). 'git multi-pack-index write' was
     -    an excellent example (deleting the multi-pack-index file between runs)
     -    and demonstrated this performance change in the Linux kernal repo:
     -
     -    Benchmark #1: old
     -      Time (mean ± σ):      2.229 s ±  0.143 s    [User: 1.409 s, System: 0.327 s]
     -      Range (min … max):    2.160 s …  2.636 s    10 runs
     -
     -    Benchmark #2: new
     -      Time (mean ± σ):      2.162 s ±  0.005 s    [User: 1.392 s, System: 0.323 s]
     -      Range (min … max):    2.152 s …  2.172 s    10 runs
     -
     -    Summary
     -      'new' ran
     -        1.03 ± 0.07 times faster than 'old'
     -
     -    Similarly, the same command on the Git repository gave these numbers:
     -
     -    Benchmark #1: old
     -      Time (mean ± σ):     230.5 ms ±   6.3 ms    [User: 140.5 ms, System: 42.9 ms]
     -      Range (min … max):   221.7 ms … 240.6 ms    12 runs
     -
     -    Benchmark #2: new
     -      Time (mean ± σ):     220.6 ms ±   5.1 ms    [User: 139.5 ms, System: 34.1 ms]
     -      Range (min … max):   214.0 ms … 229.0 ms    13 runs
     -
     -    Summary
     -      'new' ran
     -        1.05 ± 0.04 times faster than 'old'
     -
     -    Finally, to demonstrate that performance holds when frequently using
     -    large buffers, the numbers below are for 'git pack-objects' packing all
     -    objects in the Git repository between v2.30.0 and v2.31.1:
     -
     -    Benchmark #1: old
     -      Time (mean ± σ):      1.003 s ±  0.045 s    [User: 1.877 s, System: 0.167 s]
     -      Range (min … max):    0.931 s …  1.044 s    10 runs
     -
     -    Benchmark #2: new
     -      Time (mean ± σ):     976.4 ms ±  42.2 ms    [User: 1.854 s, System: 0.192 s]
     -      Range (min … max):   940.1 ms … 1049.3 ms    10 runs
     -
     -    Summary
     -      'new' ran
     -        1.03 ± 0.06 times faster than 'old'
     -
     -    With these consistent improvements of 3-5%, it will be possible to move
     -    the index writing logic over to hashfile without performance
     -    degradation.
     -
     +    csum-file: make hashwrite() more readable
     +
     +    The hashwrite() method takes an input buffer and updates a hashfile's
     +    hash function while writing the data to a file. To avoid overuse of
     +    flushes, the hashfile has an internal buffer and most writes will use
     +    memcpy() to transfer data from the input 'buf' to the hashfile's buffer
     +    of size 8 * 1024 bytes.
     +
     +    Logic introduced by a8032d12 (sha1write: don't copy full sized buffers,
     +    2008-09-02) reduces the number of memcpy() calls when the input buffer
     +    is sufficiently longer than the hashfile's buffer, causing nr to be the
     +    length of the full buffer. In these cases, the input buffer is used
     +    directly in chunks equal to the hashfile's buffer size.
     +
     +    This method caught my attention while investigating some performance
     +    issues, but it turns out that these performance issues were noise within
     +    the variance of the experiment.
     +
     +    However, during this investigation, I inspected hashwrite() and
     +    misunderstood it, even after looking closely and trying to make it
     +    faster. This change simply reorganizes some parts of the loop within
     +    hashwrite() to make it clear that each batch either uses memcpy() to the
     +    hashfile's buffer or writes directly from the input buffer. The previous
     +    code relied on indirection through local variables and essentially
     +    inlined the implementation of hashflush() to reduce lines of code.
     +
     +    Helped-by: Jeff King <peff@peff.net>
     +    Helped-by: Junio C Hamano <gitster@pobox.com>
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
       ## csum-file.c ##
     @@ csum-file.c: int finalize_hashfile(struct hashfile *f, unsigned char *result, un
       			f->crc32 = crc32(f->crc32, buf, nr);
       
       		if (nr == sizeof(f->buffer)) {
     - 			/* process full buffer directly without copy */
     +-			/* process full buffer directly without copy */
      -			data = buf;
     ++			/*
     ++			 * Flush a full batch worth of data directly
     ++			 * from the input, skipping the memcpy() to
     ++			 * the hashfile's buffer. In this block,
     ++			 * f->offset is necessarily zero.
     ++			 */
      +			the_hash_algo->update_fn(&f->ctx, buf, nr);
      +			flush(f, buf, nr);
       		} else {
      -			memcpy(f->buffer + offset, buf, nr);
      -			data = f->buffer;
     ++			/*
     ++			 * Copy to the hashfile's buffer, flushing only
     ++			 * if it became full.
     ++			 */
      +			memcpy(f->buffer + f->offset, buf, nr);
      +			f->offset += nr;
      +			left -= nr;


 csum-file.c | 33 ++++++++++++++++++---------------
 1 file changed, 18 insertions(+), 15 deletions(-)

diff --git a/csum-file.c b/csum-file.c
index 0f35fa5ee47c..7510950fa3e9 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -89,32 +89,35 @@ int finalize_hashfile(struct hashfile *f, unsigned char *result, unsigned int fl
 void hashwrite(struct hashfile *f, const void *buf, unsigned int count)
 {
 	while (count) {
-		unsigned offset = f->offset;
-		unsigned left = sizeof(f->buffer) - offset;
+		unsigned left = sizeof(f->buffer) - f->offset;
 		unsigned nr = count > left ? left : count;
-		const void *data;
 
 		if (f->do_crc)
 			f->crc32 = crc32(f->crc32, buf, nr);
 
 		if (nr == sizeof(f->buffer)) {
-			/* process full buffer directly without copy */
-			data = buf;
+			/*
+			 * Flush a full batch worth of data directly
+			 * from the input, skipping the memcpy() to
+			 * the hashfile's buffer. In this block,
+			 * f->offset is necessarily zero.
+			 */
+			the_hash_algo->update_fn(&f->ctx, buf, nr);
+			flush(f, buf, nr);
 		} else {
-			memcpy(f->buffer + offset, buf, nr);
-			data = f->buffer;
+			/*
+			 * Copy to the hashfile's buffer, flushing only
+			 * if it became full.
+			 */
+			memcpy(f->buffer + f->offset, buf, nr);
+			f->offset += nr;
+			left -= nr;
+			if (!left)
+				hashflush(f);
 		}
 
 		count -= nr;
-		offset += nr;
 		buf = (char *) buf + nr;
-		left -= nr;
-		if (!left) {
-			the_hash_algo->update_fn(&f->ctx, data, offset);
-			flush(f, data, offset);
-			offset = 0;
-		}
-		f->offset = offset;
 	}
 }
 

base-commit: 142430338477d9d1bb25be66267225fb58498d92
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] csum-file: make hashwrite() more readable
  2021-03-26 12:38 ` [PATCH v2] csum-file: make hashwrite() more readable Derrick Stolee via GitGitGadget
@ 2021-03-26 21:38   ` Junio C Hamano
  2021-03-28  8:38   ` Jeff King
  1 sibling, 0 replies; 8+ messages in thread
From: Junio C Hamano @ 2021-03-26 21:38 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, peff, me, Derrick Stolee, Derrick Stolee, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

>  		if (nr == sizeof(f->buffer)) {
> -			/* process full buffer directly without copy */
> -			data = buf;
> +			/*
> +			 * Flush a full batch worth of data directly
> +			 * from the input, skipping the memcpy() to
> +			 * the hashfile's buffer. In this block,
> +			 * f->offset is necessarily zero.
> +			 */

What made me a bit confused was the fact that, in order to exercise
the "bypass memcpy and take a full bufferful from the incoming data
directly" optimization, there are two preconditions.  The incoming
data must be large enough, and we do not have anything kept in the
buffer that needs to be emitted before the incoming data.  And the
cleverness of the original code was that both are checked by this
single "nr == sizeof(f->buffer)" condition.

So I do appreciate this extra comment, and I think future readers of
the code will, too.

> +			the_hash_algo->update_fn(&f->ctx, buf, nr);
> +			flush(f, buf, nr);
>  		} else {
> -			memcpy(f->buffer + offset, buf, nr);
> -			data = f->buffer;
> +			/*
> +			 * Copy to the hashfile's buffer, flushing only
> +			 * if it became full.
> +			 */
> +			memcpy(f->buffer + f->offset, buf, nr);
> +			f->offset += nr;
> +			left -= nr;
> +			if (!left)
> +				hashflush(f);
>  		}
>  
>  		count -= nr;
> -		offset += nr;
>  		buf = (char *) buf + nr;
> -		left -= nr;
> -		if (!left) {
> -			the_hash_algo->update_fn(&f->ctx, data, offset);
> -			flush(f, data, offset);
> -			offset = 0;
> -		}
> -		f->offset = offset;
>  	}
>  }
>  
>
> base-commit: 142430338477d9d1bb25be66267225fb58498d92

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] csum-file: make hashwrite() more readable
  2021-03-26 12:38 ` [PATCH v2] csum-file: make hashwrite() more readable Derrick Stolee via GitGitGadget
  2021-03-26 21:38   ` Junio C Hamano
@ 2021-03-28  8:38   ` Jeff King
  1 sibling, 0 replies; 8+ messages in thread
From: Jeff King @ 2021-03-28  8:38 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, gitster, me, Derrick Stolee, Derrick Stolee, Derrick Stolee

On Fri, Mar 26, 2021 at 12:38:11PM +0000, Derrick Stolee via GitGitGadget wrote:

> However, during this investigation, I inspected hashwrite() and
> misunderstood it, even after looking closely and trying to make it
> faster. This change simply reorganizes some parts of the loop within
> hashwrite() to make it clear that each batch either uses memcpy() to the
> hashfile's buffer or writes directly from the input buffer. The previous
> code relied on indirection through local variables and essentially
> inlined the implementation of hashflush() to reduce lines of code.

Thanks, I think this nicely argues for the change here (and I agree the
result, especially with the comments, is much easier to understand).

-Peff

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-03-28  8:45 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-24 17:50 [PATCH] csum-file: flush less often Derrick Stolee via GitGitGadget
2021-03-25 11:55 ` Derrick Stolee
2021-03-25 18:46   ` Junio C Hamano
2021-03-25 18:52     ` Junio C Hamano
2021-03-26  3:16       ` Jeff King
2021-03-26 12:38 ` [PATCH v2] csum-file: make hashwrite() more readable Derrick Stolee via GitGitGadget
2021-03-26 21:38   ` Junio C Hamano
2021-03-28  8:38   ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).