From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS53758 23.128.96.0/24 X-Spam-Status: No, score=-3.7 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI, SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by dcvr.yhbt.net (Postfix) with ESMTP id BF2AA1F953 for ; Tue, 11 Jan 2022 10:34:14 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1349259AbiAKKeG (ORCPT ); Tue, 11 Jan 2022 05:34:06 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58802 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1349197AbiAKKeE (ORCPT ); Tue, 11 Jan 2022 05:34:04 -0500 Received: from mail-lf1-x12c.google.com (mail-lf1-x12c.google.com [IPv6:2a00:1450:4864:20::12c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E3A77C06173F for ; Tue, 11 Jan 2022 02:34:03 -0800 (PST) Received: by mail-lf1-x12c.google.com with SMTP id d3so28314561lfv.13 for ; Tue, 11 Jan 2022 02:34:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=J+tFj+Gmse7idEO0YzDAklpEf59kn+MqaZjH+WYXWVk=; b=eag+NQ5XmAVnr40lqCFoc2kw76ZkPs3aV9wkq4ul6DaRljhx3pOU1T+YRWCFa7lbOE kyfoOlE61M2vluHHoB3qKsd6xtF+A/amEAcAtKoZ0W7heZpvSwhzep+2tiDdULnTv4Kz xgcoip4gWQPA6KZE2OyHsq7CbQ1527tyTNoRfL8mkOpjitLyhvV9Pq9K33hS5uMIQQTW 2AFMbkaaCIl5dA5NB1Y3abGpjJWu2K0mFy09859I6WJBvMFe8b/x0OuUwaQH60IYB8Mo DFSQcEOEKU/GPjaIaY58z2+2ZS/TZBn6fp1FsuYWcRPTks8rjvbcVzgxg806fWEm266I FKnw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=J+tFj+Gmse7idEO0YzDAklpEf59kn+MqaZjH+WYXWVk=; b=FiziCDjNHK6rbNRQGYdDcS7GRbWVWHtHmsgwS1dg1pZnsuOGaeHwzvfuzUAUnW9GGe +TyprERzPHIFExwZJtWu8r7UolJ6maq94EkCq6evSm2BH3XJ4/1pCs+to9rriOb3KORk /WBcg8yhbohM7EEmCqAHawnurN8Lhyox6sDRy5qt78AGLtNi0p3F4SWXzFa9LRTkGRY4 l8nPpV4oELGukScfrVPPkLVp9Ee37344sq3fpNoLlg+hnHqFnUACN9ZwKSuUYkf1xx8E DwT72clkLh1I1YAQoiqnuWRqGIVEqHJNqFXRA/Wxobt1lCt2lIxxeYpj24GpYhow0vnR Pdig== X-Gm-Message-State: AOAM5309k69c5V4MUWKmetFen9fzmQ4yk9qgSdM9bOSjFYGyaxyhH1Sa tQBlQV0tnJinLCLGhAeQAD3XnRd+ZQyx/rqidWc= X-Google-Smtp-Source: ABdhPJzqLwpmjm4Mp0hgJ4kJ/MeyIFBN1MKhNMcAjuDqr3jZylPgq8UqSEsj9FjVwxxLw5QJ0lZxRnDkFOHbMvrGB2Y= X-Received: by 2002:a05:651c:54f:: with SMTP id q15mr2411767ljp.319.1641897242172; Tue, 11 Jan 2022 02:34:02 -0800 (PST) MIME-Version: 1.0 References: <20211217112629.12334-1-chiyutianyi@gmail.com> <20220108085419.79682-3-chiyutianyi@gmail.com> In-Reply-To: From: Han Xin Date: Tue, 11 Jan 2022 18:33:50 +0800 Message-ID: Subject: Re: [PATCH v8 2/6] object-file.c: refactor write_loose_object() to several steps To: =?UTF-8?Q?Ren=C3=A9_Scharfe?= Cc: Junio C Hamano , Git List , Jeff King , Jiang Xin , Philip Oakley , =?UTF-8?B?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= , Derrick Stolee , Han Xin Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org On Sat, Jan 8, 2022 at 8:28 PM Ren=C3=A9 Scharfe wrote: > > Am 08.01.22 um 09:54 schrieb Han Xin: > > From: Han Xin > > > > When writing a large blob using "write_loose_object()", we have to pass > > a buffer with the whole content of the blob, and this behavior will > > consume lots of memory and may cause OOM. We will introduce a stream > > version function ("stream_loose_object()") in latter commit to resolve > > this issue. > > > > Before introducing a stream vesion function for writing loose object, > > do some refactoring on "write_loose_object()" to reuse code for both > > versions. > > > > Rewrite "write_loose_object()" as follows: > > > > 1. Figure out a path for the (temp) object file. This step is only > > used in "write_loose_object()". > > > > 2. Move common steps for starting to write loose objects into a new > > function "start_loose_object_common()". > > > > 3. Compress data. > > > > 4. Move common steps for ending zlib stream into a new funciton > > "end_loose_object_common()". > > > > 5. Close fd and finalize the object file. > > > > Helped-by: =C3=86var Arnfj=C3=B6r=C3=B0 Bjarmason > > Helped-by: Jiang Xin > > Signed-off-by: Han Xin > > --- > > object-file.c | 149 +++++++++++++++++++++++++++++++++++--------------- > > 1 file changed, 105 insertions(+), 44 deletions(-) > > > > diff --git a/object-file.c b/object-file.c > > index eb1426f98c..5d163081b1 100644 > > --- a/object-file.c > > +++ b/object-file.c > > @@ -1743,6 +1743,25 @@ static void write_object_file_prepare(const stru= ct git_hash_algo *algo, > > algo->final_oid_fn(oid, &c); > > } > > > > +/* > > + * Move the just written object with proper mtime into its final resti= ng place. > > + */ > > +static int finalize_object_file_with_mtime(const char *tmpfile, > > + const char *filename, > > + time_t mtime, > > + unsigned flags) > > This function is called only once after your series. Should it be used b= y > stream_loose_object()? Probably not -- the latter doesn't have a way to > force a certain modification time and its caller doesn't need one. So > creating finalize_object_file_with_mtime() seems unnecessary for this > series. > After accepting the suggestion by =C3=86var Arnfj=C3=B6r=C3=B0 Bjarmason[1]= to remove finalize_object_file_with_mtime() from stream_loose_object() , it seems to be an overkill for write_loose_object() now. I'll put it back into write_loose_object() . 1. https://lore.kernel.org/git/211221.86pmpqq9aj.gmgdl@evledraar.gmail.com/ Thanks -Han Xin > > +{ > > + struct utimbuf utb; > > + > > + if (mtime) { > > + utb.actime =3D mtime; > > + utb.modtime =3D mtime; > > + if (utime(tmpfile, &utb) < 0 && !(flags & HASH_SILENT)) > > + warning_errno(_("failed utime() on %s"), tmpfile)= ; > > + } > > + return finalize_object_file(tmpfile, filename); > > +} > > + > > /* > > * Move the just written object into its final resting place. > > */ > > @@ -1828,7 +1847,8 @@ static inline int directory_size(const char *file= name) > > * We want to avoid cross-directory filename renames, because those > > * can have problems on various filesystems (FAT, NFS, Coda). > > */ > > -static int create_tmpfile(struct strbuf *tmp, const char *filename) > > +static int create_tmpfile(struct strbuf *tmp, const char *filename, > > + unsigned flags) > > create_tmpfile() is not mentioned in the commit message, yet it's > changed here. Hrm. > > > { > > int fd, dirlen =3D directory_size(filename); > > > > @@ -1836,7 +1856,9 @@ static int create_tmpfile(struct strbuf *tmp, con= st char *filename) > > strbuf_add(tmp, filename, dirlen); > > strbuf_addstr(tmp, "tmp_obj_XXXXXX"); > > fd =3D git_mkstemp_mode(tmp->buf, 0444); > > - if (fd < 0 && dirlen && errno =3D=3D ENOENT) { > > + do { > > + if (fd >=3D 0 || !dirlen || errno !=3D ENOENT) > > + break; > > Why turn this branch into a loop? Is this done to mkdir multiple > components, e.g. with filename being "a/b/c/file" to create "a", "a/b", > and "a/b/c"? It's only used for loose objects, so a fan-out directory > (e.g. ".git/objects/ff") can certainly be missing, but can their parent > be missing as well sometimes? If that's the point then such a fix > would be worth its own patch. (Which probably would benefit from using > safe_create_leading_directories()). > > > /* > > * Make sure the directory exists; note that the contents > > * of the buffer are undefined after mkstemp returns an > > @@ -1846,17 +1868,72 @@ static int create_tmpfile(struct strbuf *tmp, c= onst char *filename) > > strbuf_reset(tmp); > > strbuf_add(tmp, filename, dirlen - 1); > > if (mkdir(tmp->buf, 0777) && errno !=3D EEXIST) > > - return -1; > > + break; > > if (adjust_shared_perm(tmp->buf)) > > - return -1; > > + break; > > Or is it just to replace these returns with a jump to the new error > reporting section? > > > > > /* Try again */ > > strbuf_addstr(tmp, "/tmp_obj_XXXXXX"); > > fd =3D git_mkstemp_mode(tmp->buf, 0444); > > In that case a break would be missing here. > > > + } while (0); > > + > > + if (fd < 0 && !(flags & HASH_SILENT)) { > > + if (errno =3D=3D EACCES) > > + return error(_("insufficient permission for addin= g an " > > + "object to repository database %s"= ), > > + get_object_directory()); > > + else > > + return error_errno(_("unable to create temporary = file")); > > } > > Why move this error reporting code into create_tmpfile()? This function > has a single caller both before and after your series, so the code could > just as well stay at its call-site, avoiding the need to add the flags > parameter. > Here is a legacy from v7, now there is no step called "Figuring out a path for the (temp) object file.", and it's only used in start_loose_object_comm= on(). I will bring it back to what it was. Thanks -Han Xin > > + > > return fd; > > } > > > > +static int start_loose_object_common(struct strbuf *tmp_file, > > + const char *filename, unsigned flags= , > > + git_zstream *stream, > > + unsigned char *buf, size_t buflen, > > + git_hash_ctx *c, > > + enum object_type type, size_t len, > > The parameters type and len are not used by this function and thus can > be dropped. > *nod* > > + char *hdr, int hdrlen) > > +{ > > + int fd; > > + > > + fd =3D create_tmpfile(tmp_file, filename, flags); > > + if (fd < 0) > > + return -1; > > + > > + /* Setup zlib stream for compression */ > > + git_deflate_init(stream, zlib_compression_level); > > + stream->next_out =3D buf; > > + stream->avail_out =3D buflen; > > + the_hash_algo->init_fn(c); > > + > > + /* Start to feed header to zlib stream */ > > + stream->next_in =3D (unsigned char *)hdr; > > + stream->avail_in =3D hdrlen; > > + while (git_deflate(stream, 0) =3D=3D Z_OK) > > + ; /* nothing */ > > + the_hash_algo->update_fn(c, hdr, hdrlen); > > + > > + return fd; > > +} > > + > > +static void end_loose_object_common(int ret, git_hash_ctx *c, > > + git_zstream *stream, > > + struct object_id *parano_oid, > > + const struct object_id *expected_oid, > > + const char *die_msg1_fmt, > > + const char *die_msg2_fmt) > > Hmm, the signature needs as many lines as the function body. > > > +{ > > + if (ret !=3D Z_STREAM_END) > > + die(_(die_msg1_fmt), ret, expected_oid); > > + ret =3D git_deflate_end_gently(stream); > > + if (ret !=3D Z_OK) > > + die(_(die_msg2_fmt), ret, expected_oid); > > These format strings cannot be checked by the compiler. > > Considering those two together I think I'd either unify the error > messages and move their strings here (losing the ability for users > to see if streaming was used) or not extract the function and > duplicate its few shared lines. Just a feeling, though. > > > + the_hash_algo->final_oid_fn(parano_oid, c); > > +} > > + > > static int write_loose_object(const struct object_id *oid, char *hdr, > > int hdrlen, const void *buf, unsigned long = len, > > time_t mtime, unsigned flags) > > @@ -1871,28 +1948,18 @@ static int write_loose_object(const struct obje= ct_id *oid, char *hdr, > > > > loose_object_path(the_repository, &filename, oid); > > > > - fd =3D create_tmpfile(&tmp_file, filename.buf); > > - if (fd < 0) { > > - if (flags & HASH_SILENT) > > - return -1; > > - else if (errno =3D=3D EACCES) > > - return error(_("insufficient permission for addin= g an object to repository database %s"), get_object_directory()); > > - else > > - return error_errno(_("unable to create temporary = file")); > > - } > > - > > - /* Set it up */ > > - git_deflate_init(&stream, zlib_compression_level); > > - stream.next_out =3D compressed; > > - stream.avail_out =3D sizeof(compressed); > > - the_hash_algo->init_fn(&c); > > - > > - /* First header.. */ > > - stream.next_in =3D (unsigned char *)hdr; > > - stream.avail_in =3D hdrlen; > > - while (git_deflate(&stream, 0) =3D=3D Z_OK) > > - ; /* nothing */ > > - the_hash_algo->update_fn(&c, hdr, hdrlen); > > + /* Common steps for write_loose_object and stream_loose_object to > > + * start writing loose oject: > > + * > > + * - Create tmpfile for the loose object. > > + * - Setup zlib stream for compression. > > + * - Start to feed header to zlib stream. > > + */ > > + fd =3D start_loose_object_common(&tmp_file, filename.buf, flags, > > + &stream, compressed, sizeof(compre= ssed), > > + &c, OBJ_NONE, 0, hdr, hdrlen); > > + if (fd < 0) > > + return -1; > > > > /* Then the data itself.. */ > > stream.next_in =3D (void *)buf; > > @@ -1907,30 +1974,24 @@ static int write_loose_object(const struct obje= ct_id *oid, char *hdr, > > stream.avail_out =3D sizeof(compressed); > > } while (ret =3D=3D Z_OK); > > > > - if (ret !=3D Z_STREAM_END) > > - die(_("unable to deflate new object %s (%d)"), oid_to_hex= (oid), > > - ret); > > - ret =3D git_deflate_end_gently(&stream); > > - if (ret !=3D Z_OK) > > - die(_("deflateEnd on object %s failed (%d)"), oid_to_hex(= oid), > > - ret); > > - the_hash_algo->final_oid_fn(¶no_oid, &c); > > + /* Common steps for write_loose_object and stream_loose_object to > > + * end writing loose oject: > > + * > > + * - End the compression of zlib stream. > > + * - Get the calculated oid to "parano_oid". > > + */ > > + end_loose_object_common(ret, &c, &stream, ¶no_oid, oid, > > + N_("unable to deflate new object %s (%d)"= ), > > + N_("deflateEnd on object %s failed (%d)")= ); > > + > > if (!oideq(oid, ¶no_oid)) > > die(_("confused by unstable object source data for %s"), > > oid_to_hex(oid)); > > > > close_loose_object(fd); > > > > - if (mtime) { > > - struct utimbuf utb; > > - utb.actime =3D mtime; > > - utb.modtime =3D mtime; > > - if (utime(tmp_file.buf, &utb) < 0 && > > - !(flags & HASH_SILENT)) > > - warning_errno(_("failed utime() on %s"), tmp_file= .buf); > > - } > > - > > - return finalize_object_file(tmp_file.buf, filename.buf); > > + return finalize_object_file_with_mtime(tmp_file.buf, filename.buf= , > > + mtime, flags); > > } > > > > static int freshen_loose_object(const struct object_id *oid)