git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Patrick Steinhardt <ps@pks.im>
To: git@vger.kernel.org
Subject: [PATCH 8/9] reftable/block: open-code call to `uncompress2()`
Date: Wed, 27 Mar 2024 07:37:25 +0100	[thread overview]
Message-ID: <bf4c1ab797d32284b469034c5272383fd998439e.1711519925.git.ps@pks.im> (raw)
In-Reply-To: <cover.1711519925.git.ps@pks.im>

[-- Attachment #1: Type: text/plain, Size: 4364 bytes --]

The reftable format stores log blocks in a compressed format. Thus,
whenever we want to read such a block we first need to decompress it.
This is done by calling the convenience function `uncompress2()` of the
zlib library, which is a simple wrapper that manages the lifecycle of
the `zstream` structure for us.

While nice for one-off inflation of data, when iterating through reflogs
we will likely end up inflating many such log blocks. This requires us
to reallocate the state of the `zstream` every single time, which adds
up over time. It would thus be great to reuse the `zstream` instead of
discarding it after every inflation.

Open-code the call to `uncompress2()` such that we can start reusing the
`zstream` in the subsequent commit. Note that our open-coded variant is
different from `uncompress2()` in two ways:

  - We do not loop around `inflate()` until we have processed all input.
    As our input is limited by the maximum block size, which is 16MB, we
    should not hit limits of `inflate()`.

  - We use `Z_FINISH` instead of `Z_NO_FLUSH`. Quoting the `inflate()`
    documentation: "inflate() should normally be called until it returns
    Z_STREAM_END or an error. However if all decompression is to be
    performed in a single step (a single call of inflate), the parameter
    flush should be set to Z_FINISH."

    Furthermore, "Z_FINISH also informs inflate to not maintain a
    sliding window if the stream completes, which reduces inflate's
    memory footprint."

Other than that this commit is expected to be functionally equivalent
and does not yet reuse the `zstream`.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 reftable/block.c | 38 ++++++++++++++++++++++++++++----------
 1 file changed, 28 insertions(+), 10 deletions(-)

diff --git a/reftable/block.c b/reftable/block.c
index 31af075c1d..31e7255056 100644
--- a/reftable/block.c
+++ b/reftable/block.c
@@ -195,10 +195,10 @@ int block_reader_init(struct block_reader *br, struct reftable_block *block,
 	}
 
 	if (typ == BLOCK_TYPE_LOG) {
-		int block_header_skip = 4 + header_off;
-		uLongf dst_len = sz - block_header_skip; /* total size of dest
-							    buffer. */
-		uLongf src_len = block->len - block_header_skip;
+		uint32_t block_header_skip = 4 + header_off;
+		uLong dst_len = sz - block_header_skip;
+		uLong src_len = block->len - block_header_skip;
+		z_stream stream = {0};
 
 		/* Log blocks specify the *uncompressed* size in their header. */
 		REFTABLE_ALLOC_GROW(br->uncompressed_data, sz,
@@ -207,15 +207,33 @@ int block_reader_init(struct block_reader *br, struct reftable_block *block,
 		/* Copy over the block header verbatim. It's not compressed. */
 		memcpy(br->uncompressed_data, block->data, block_header_skip);
 
-		/* Uncompress */
-		if (Z_OK !=
-		    uncompress2(br->uncompressed_data + block_header_skip, &dst_len,
-				block->data + block_header_skip, &src_len)) {
+		err = inflateInit(&stream);
+		if (err != Z_OK) {
 			err = REFTABLE_ZLIB_ERROR;
 			goto done;
 		}
 
-		if (dst_len + block_header_skip != sz) {
+		stream.next_in = block->data + block_header_skip;
+		stream.avail_in = src_len;
+		stream.next_out = br->uncompressed_data + block_header_skip;
+		stream.avail_out = dst_len;
+
+		/*
+		 * We know both input as well as output size, and we know that
+		 * the sizes should never be bigger than `uInt_MAX` because
+		 * blocks can at most be 16MB large. We can thus use `Z_FINISH`
+		 * here to instruct zlib to inflate the data in one go, which
+		 * is more efficient than using `Z_NO_FLUSH`.
+		 */
+		err = inflate(&stream, Z_FINISH);
+		inflateEnd(&stream);
+		if (err != Z_STREAM_END) {
+			err = REFTABLE_ZLIB_ERROR;
+			goto done;
+		}
+		err = 0;
+
+		if (stream.total_out + block_header_skip != sz) {
 			err = REFTABLE_FORMAT_ERROR;
 			goto done;
 		}
@@ -224,7 +242,7 @@ int block_reader_init(struct block_reader *br, struct reftable_block *block,
 		reftable_block_done(block);
 		block->data = br->uncompressed_data;
 		block->len = sz;
-		full_block_size = src_len + block_header_skip;
+		full_block_size = src_len + block_header_skip - stream.avail_in;
 	} else if (full_block_size == 0) {
 		full_block_size = sz;
 	} else if (sz < full_block_size && sz < block->len &&
-- 
2.44.GIT


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  parent reply	other threads:[~2024-03-27  6:38 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-27  6:36 [PATCH 0/9] reftable: optimize table and block iterators Patrick Steinhardt
2024-03-27  6:36 ` [PATCH 1/9] reftable/block: rename `block_reader_start()` Patrick Steinhardt
2024-03-27  6:37 ` [PATCH 2/9] reftable/block: merge `block_iter_seek()` and `block_reader_seek()` Patrick Steinhardt
2024-03-27  6:37 ` [PATCH 3/9] reftable/block: better grouping of functions Patrick Steinhardt
2024-03-27  6:37 ` [PATCH 4/9] reftable/block: introduce `block_reader_release()` Patrick Steinhardt
2024-04-03 13:16   ` Karthik Nayak
2024-04-08 12:10     ` Patrick Steinhardt
2024-03-27  6:37 ` [PATCH 5/9] reftable/block: move ownership of block reader into `struct table_iter` Patrick Steinhardt
2024-04-03  4:52   ` Justin Tobler
2024-04-03 13:10     ` Patrick Steinhardt
2024-03-27  6:37 ` [PATCH 6/9] reftable/reader: iterate to next block in place Patrick Steinhardt
2024-03-27  6:37 ` [PATCH 7/9] reftable/block: reuse uncompressed blocks Patrick Steinhardt
2024-03-27  6:37 ` Patrick Steinhardt [this message]
2024-03-27  6:37 ` [PATCH 9/9] reftable/block: reuse `zstream` state on inflation Patrick Steinhardt
2024-04-03 13:33 ` [PATCH 0/9] reftable: optimize table and block iterators Karthik Nayak
2024-04-08 12:16 ` [PATCH v2 00/10] " Patrick Steinhardt
2024-04-08 12:16   ` [PATCH v2 01/10] reftable/block: rename `block_reader_start()` Patrick Steinhardt
2024-04-08 12:16   ` [PATCH v2 02/10] reftable/block: merge `block_iter_seek()` and `block_reader_seek()` Patrick Steinhardt
2024-04-08 12:16   ` [PATCH v2 03/10] reftable/block: better grouping of functions Patrick Steinhardt
2024-04-08 12:16   ` [PATCH v2 04/10] reftable/block: introduce `block_reader_release()` Patrick Steinhardt
2024-04-08 12:16   ` [PATCH v2 05/10] reftable/block: move ownership of block reader into `struct table_iter` Patrick Steinhardt
2024-04-08 12:16   ` [PATCH v2 06/10] reftable/reader: iterate to next block in place Patrick Steinhardt
2024-04-08 12:16   ` [PATCH v2 07/10] reftable/block: reuse uncompressed blocks Patrick Steinhardt
2024-04-08 12:16   ` [PATCH v2 08/10] reftable/block: open-code call to `uncompress2()` Patrick Steinhardt
2024-04-08 12:17   ` [PATCH v2 09/10] reftable/block: reuse `zstream` state on inflation Patrick Steinhardt
2024-04-10 10:15     ` Karthik Nayak
2024-04-08 12:17   ` [PATCH v2 10/10] reftable/block: avoid copying block iterators on seek Patrick Steinhardt
2024-04-09  1:29     ` Justin Tobler
2024-04-09  3:18       ` Patrick Steinhardt
2024-04-09  1:32   ` [PATCH v2 00/10] reftable: optimize table and block iterators Justin Tobler
2024-04-10 11:35   ` Karthik Nayak

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bf4c1ab797d32284b469034c5272383fd998439e.1711519925.git.ps@pks.im \
    --to=ps@pks.im \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).