From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS53758 23.128.96.0/24 X-Spam-Status: No, score=-4.2 required=3.0 tests=AWL,BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RCVD_IN_DNSWL_LOW, SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by dcvr.yhbt.net (Postfix) with ESMTP id DA79B1F934 for ; Tue, 18 May 2021 07:31:46 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S241980AbhERHcr (ORCPT ); Tue, 18 May 2021 03:32:47 -0400 Received: from cloud.peff.net ([104.130.231.41]:57446 "EHLO cloud.peff.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S240235AbhERHcq (ORCPT ); Tue, 18 May 2021 03:32:46 -0400 Received: (qmail 11241 invoked by uid 109); 18 May 2021 07:31:28 -0000 Received: from Unknown (HELO peff.net) (10.0.1.2) by cloud.peff.net (qpsmtpd/0.94) with ESMTP; Tue, 18 May 2021 07:31:28 +0000 Authentication-Results: cloud.peff.net; auth=none Received: (qmail 5388 invoked by uid 111); 18 May 2021 07:31:30 -0000 Received: from coredump.intra.peff.net (HELO sigill.intra.peff.net) (10.0.0.2) by peff.net (qpsmtpd/0.94) with (TLS_AES_256_GCM_SHA384 encrypted) ESMTPS; Tue, 18 May 2021 03:31:30 -0400 Authentication-Results: peff.net; auth=none Date: Tue, 18 May 2021 03:31:28 -0400 From: Jeff King To: Derrick Stolee via GitGitGadget Cc: git@vger.kernel.org, gitster@pobox.com, stolee@gmail.com, git@jeffhostetler.com, Derrick Stolee , Derrick Stolee Subject: Re: [PATCH v2 2/4] csum-file.h: increase hashfile buffer size Message-ID: References: <9dc602f6c4221e2259778842ec3d1eda57508333.1621254292.git.gitgitgadget@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <9dc602f6c4221e2259778842ec3d1eda57508333.1621254292.git.gitgitgadget@gmail.com> Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org On Mon, May 17, 2021 at 12:24:50PM +0000, Derrick Stolee via GitGitGadget wrote: > From: Derrick Stolee > > The hashfile API uses a hard-coded buffer size of 8KB and has ever since > it was introduced in c38138c (git-pack-objects: write the pack files > with a SHA1 csum, 2005-06-26). It performs a similar function to the > hashing buffers in read-cache.c, but that code was updated from 8KB to > 128KB in f279894 (read-cache: make the index write buffer size 128K, > 2021-02-18). The justification there was that do_write_index() improves > from 1.02s to 0.72s. > > There is a buffer, check_buffer, that is used to verify the check_fd > file descriptor. When this buffer increases to 128K to fit the data > being flushed, it causes the stack to overflow the limits placed in the > test suite. By moving this to a static buffer, we stop using stack data > for this purpose, but we lose some thread-safety. This change makes it > unsafe to write to multiple hashfiles across different threads. > > By adding a new trace2 region in the chunk-format API, we can see that > the writing portion of 'git multi-pack-index write' lowers from ~1.49s > to ~1.47s on a Linux machine. These effects may be more pronounced or > diminished on other filesystems. The end-to-end timing is too noisy to > have a definitive change either way. I think there is one thing missing from this commit message: why we want to do this. You mentioned that read-cache got larger by using a bigger buffer. But here we use a bigger buffer, and it produces no improvement larger than the noise. And on top of it, you describe the static-buffer downsides. So why not just skip it? :) And the answer is in the larger series: we want to be able to make use of the hashfile API in read-cache, but without regressing the performance. One sentence at the end of the first paragraph would clarify that quite a bit, I think. > +static void verify_buffer_or_die(struct hashfile *f, > + const void *buf, > + unsigned int count) > +{ > + static unsigned char check_buffer[WRITE_BUFFER_SIZE]; > + ssize_t ret = read_in_full(f->check_fd, check_buffer, count); > + > + if (ret < 0) > + die_errno("%s: sha1 file read error", f->name); > + if (ret != count) > + die("%s: sha1 file truncated", f->name); > + if (memcmp(buf, check_buffer, count)) > + die("sha1 file '%s' validation error", f->name); > +} Does this have to use the same-size buffer? We could read and check smaller chunks, like: while (count > 0) { static unsigned char chunk[1024]; unsigned int chunk_len = sizeof(chunk) < count ? sizeof(chunk) : count; ssize_t ret = read_in_full(f->check_fd, chunk, chunk_len); if (ret < 0) ... if (ret != count) ... if (memcmp(buf, chunk, chunk_len)) ... buf += chunk_len; count -= chunk_len; } We may prefer to use the larger buffer size for performance, but I think this "check" mode is only used for "index-pack --verify" and similar. The performance may matter a lot less to us there than for more frequently used code paths like index writing. I don't have a strong preference either way, but it's nice to avoid introducing non-reentrancy to a function (Junio's heap suggestion is also quite reasonable). -Peff