From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS3215 2.6.0.0/16 X-Spam-Status: No, score=-3.4 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from out1.vger.email (out1.vger.email [IPv6:2620:137:e000::1:20]) by dcvr.yhbt.net (Postfix) with ESMTP id C46B21F8C4 for ; Sat, 19 Mar 2022 00:24:20 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S241068AbiCSAZf (ORCPT ); Fri, 18 Mar 2022 20:25:35 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38284 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S241664AbiCSAZa (ORCPT ); Fri, 18 Mar 2022 20:25:30 -0400 Received: from mail-wr1-x431.google.com (mail-wr1-x431.google.com [IPv6:2a00:1450:4864:20::431]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B6B3713665B for ; Fri, 18 Mar 2022 17:23:44 -0700 (PDT) Received: by mail-wr1-x431.google.com with SMTP id u10so13691139wra.9 for ; Fri, 18 Mar 2022 17:23:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=AOeNIRp44Mv2o6ue+ZHw19CaEV4VWuq394+OG/scwCc=; b=by0iMjvAuFtbLLA5VdHZpWhqVUjZvjyp4gfy8ZBkQxdO9Z0om0UcRk/r2Q7M8llRjr A7tOD+O+EE3+aRGjgw8GLwabprLt6m2kyoC+08XVjRCRdfqrQbwUzm+16RY8voWb/xto xJxYGYMxSbuu5gWEDbydywsn04GgUot40MSLLwa5a1XepP+khkvuig8xfpibvvEDZGKB brRkjeedAvxlBxoxawjX4n4NG5uHKhu2//kAeQELuICe3HKgPnTDduej3UnLFpoBQ7Lj uJ+/jGqpy5XZdtjQqbCq7DbIGCpjgIr7c22JxRov/bO5B/78Uqzik+rmC71jv6/usgIU 96lw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=AOeNIRp44Mv2o6ue+ZHw19CaEV4VWuq394+OG/scwCc=; b=Cy9/ij0vo8WxlMrJKFqmzWklL/LkUWP3Nmz8xNlD5unlHNr7Lw7UtqZISOyClYs7af JPloWod8PT7Pou5vkATCtDVzP9Z9LRvgIE6yYlRqh8bsLJO3CPp3fCWzZDIwRWbMyCJN rmFutkRQLJvKvunb9bSC8g9Sv+8x47wO3WvU0UB/2TH8WG3rCicHgrcBwisV7u6lzpBg +USUws4szSFAeu02jrOiOzqDBodmQa9Gig6rtCgbs/1UZHIe3vHm7kLPZ1o4UDzuLA4h h8frX3ny4uG3RIRe/eyTCk57EJm/mN+IjOzKQfjU1qf2VDTVxgzUQT1eYKf29eKzGvfA /dEg== X-Gm-Message-State: AOAM531cql+rwoDFOEQkyDnr7EpeevPaJyb1DXtH6MdQrnrJHIeWYeIU g84cHanbB9zG0rIsRTN+1q0Nyoj3DaHUQw== X-Google-Smtp-Source: ABdhPJx2jhhZsVI37YThExjW8b00uCYTI6nwGmHMvIycwbLA7/B6iJNzQt+Ob/Iu0mpMAkPi9dbI6A== X-Received: by 2002:adf:f192:0:b0:203:e38f:afa1 with SMTP id h18-20020adff192000000b00203e38fafa1mr9571114wro.120.1647649422267; Fri, 18 Mar 2022 17:23:42 -0700 (PDT) Received: from vm.nix.is (vm.nix.is. [2a01:4f8:120:2468::2]) by smtp.gmail.com with ESMTPSA id u25-20020a05600c211900b00389d4bdb3d2sm14865519wml.36.2022.03.18.17.23.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 18 Mar 2022 17:23:41 -0700 (PDT) From: =?UTF-8?q?=C3=86var=20Arnfj=C3=B6r=C3=B0=20Bjarmason?= To: git@vger.kernel.org Cc: Junio C Hamano , Han Xin , Jiang Xin , =?UTF-8?q?Ren=C3=A9=20Scharfe?= , Derrick Stolee , Philip Oakley , Han Xin , =?UTF-8?q?=C3=86var=20Arnfj=C3=B6r=C3=B0=20Bjarmason?= , Jiang Xin Subject: [PATCH v11 8/8] unpack-objects: use stream_loose_object() to unpack large objects Date: Sat, 19 Mar 2022 01:23:25 +0100 Message-Id: X-Mailer: git-send-email 2.35.1.1438.g8874c8eeb35 In-Reply-To: References: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Han Xin Make use of the stream_loose_object() function introduced in the preceding commit to unpack large objects. Before this we'd need to malloc() the size of the blob before unpacking it, which could cause OOM with very large blobs. We could use the new streaming interface to unpack all blobs, but doing so would be much slower, as demonstrated e.g. with this benchmark using git-hyperfine[0]: rm -rf /tmp/scalar.git && git clone --bare https://github.com/Microsoft/scalar.git /tmp/scalar.git && mv /tmp/scalar.git/objects/pack/*.pack /tmp/scalar.git/my.pack && git hyperfine \ -r 2 --warmup 1 \ -L rev origin/master,HEAD -L v "10,512,1k,1m" \ -s 'make' \ -p 'git init --bare dest.git' \ -c 'rm -rf dest.git' \ './git -C dest.git -c core.bigFileThreshold={v} unpack-objects &1 | grep Maximum' Using this test we'll always use >100MB of memory on origin/master (around ~105MB), but max out at e.g. ~55MB if we set core.bigFileThreshold=50m. The relevant "Maximum resident set size" lines were manually added below the relevant benchmark: '/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold=50m unpack-objects &1 | grep Maximum' in 'origin/master' ran Maximum resident set size (kbytes): 107080 1.02 ± 0.78 times faster than '/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold=512 unpack-objects &1 | grep Maximum' in 'origin/master' Maximum resident set size (kbytes): 106968 1.09 ± 0.79 times faster than '/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold=100m unpack-objects &1 | grep Maximum' in 'origin/master' Maximum resident set size (kbytes): 107032 1.42 ± 1.07 times faster than '/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold=100m unpack-objects &1 | grep Maximum' in 'HEAD' Maximum resident set size (kbytes): 107072 1.83 ± 1.02 times faster than '/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold=50m unpack-objects &1 | grep Maximum' in 'HEAD' Maximum resident set size (kbytes): 55704 2.16 ± 1.19 times faster than '/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold=512 unpack-objects &1 | grep Maximum' in 'HEAD' Maximum resident set size (kbytes): 4564 This shows that if you have enough memory this new streaming method is slower the lower you set the streaming threshold, but the benefit is more bounded memory use. An earlier version of this patch introduced a new "core.bigFileStreamingThreshold" instead of re-using the existing "core.bigFileThreshold" variable[1]. As noted in a detailed overview of its users in [2] using it has several different meanings. Still, we consider it good enough to simply re-use it. While it's possible that someone might want to e.g. consider objects "small" for the purposes of diffing but "big" for the purposes of writing them such use-cases are probably too obscure to worry about. We can always split up "core.bigFileThreshold" in the future if there's a need for that. 0. https://github.com/avar/git-hyperfine/ 1. https://lore.kernel.org/git/20211210103435.83656-1-chiyutianyi@gmail.com/ 2. https://lore.kernel.org/git/20220120112114.47618-5-chiyutianyi@gmail.com/ Helped-by: Ævar Arnfjörð Bjarmason Helped-by: Derrick Stolee Helped-by: Jiang Xin Signed-off-by: Han Xin Signed-off-by: Ævar Arnfjörð Bjarmason --- Documentation/config/core.txt | 4 +- builtin/unpack-objects.c | 67 +++++++++++++++++++++++++++++++++ t/t5351-unpack-large-objects.sh | 26 +++++++++++-- 3 files changed, 92 insertions(+), 5 deletions(-) diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt index b6a12218665..5aca987632c 100644 --- a/Documentation/config/core.txt +++ b/Documentation/config/core.txt @@ -436,8 +436,8 @@ usage, at the slight expense of increased disk usage. * Will be generally be streamed when written, which avoids excessive memory usage, at the cost of some fixed overhead. Commands that make use of this include linkgit:git-archive[1], -linkgit:git-fast-import[1], linkgit:git-index-pack[1] and -linkgit:git-fsck[1]. +linkgit:git-fast-import[1], linkgit:git-index-pack[1], +linkgit:git-unpack-objects[1] and linkgit:git-fsck[1]. core.excludesFile:: Specifies the pathname to the file that contains patterns to diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c index d374599d544..9d7b325c23b 100644 --- a/builtin/unpack-objects.c +++ b/builtin/unpack-objects.c @@ -338,6 +338,68 @@ static void added_object(unsigned nr, enum object_type type, } } +struct input_zstream_data { + git_zstream *zstream; + unsigned char buf[8192]; + int status; +}; + +static const void *feed_input_zstream(struct input_stream *in_stream, + unsigned long *readlen) +{ + struct input_zstream_data *data = in_stream->data; + git_zstream *zstream = data->zstream; + void *in = fill(1); + + if (in_stream->is_finished) { + *readlen = 0; + return NULL; + } + + zstream->next_out = data->buf; + zstream->avail_out = sizeof(data->buf); + zstream->next_in = in; + zstream->avail_in = len; + + data->status = git_inflate(zstream, 0); + + in_stream->is_finished = data->status != Z_OK; + use(len - zstream->avail_in); + *readlen = sizeof(data->buf) - zstream->avail_out; + + return data->buf; +} + +static void stream_blob(unsigned long size, unsigned nr) +{ + git_zstream zstream = { 0 }; + struct input_zstream_data data = { 0 }; + struct input_stream in_stream = { + .read = feed_input_zstream, + .data = &data, + }; + struct obj_info *info = &obj_list[nr]; + + data.zstream = &zstream; + git_inflate_init(&zstream); + + if (stream_loose_object(&in_stream, size, &info->oid)) + die(_("failed to write object in stream")); + + if (data.status != Z_STREAM_END) + die(_("inflate returned (%d)"), data.status); + git_inflate_end(&zstream); + + if (strict) { + struct blob *blob = lookup_blob(the_repository, &info->oid); + + if (!blob) + die(_("invalid blob object from stream")); + blob->object.flags |= FLAG_WRITTEN; + } + info->obj = NULL; +} + static int resolve_against_held(unsigned nr, const struct object_id *base, void *delta_data, unsigned long delta_size) { @@ -471,6 +533,11 @@ static void unpack_one(unsigned nr) switch (type) { case OBJ_BLOB: + if (!dry_run && size > big_file_threshold) { + stream_blob(size, nr); + return; + } + /* fallthrough */ case OBJ_COMMIT: case OBJ_TREE: case OBJ_TAG: diff --git a/t/t5351-unpack-large-objects.sh b/t/t5351-unpack-large-objects.sh index 8d84313221c..461ca060b2b 100755 --- a/t/t5351-unpack-large-objects.sh +++ b/t/t5351-unpack-large-objects.sh @@ -9,7 +9,8 @@ test_description='git unpack-objects with large objects' prepare_dest () { test_when_finished "rm -rf dest.git" && - git init --bare dest.git + git init --bare dest.git && + git -C dest.git config core.bigFileThreshold "$1" } test_expect_success "create large objects (1.5 MB) and PACK" ' @@ -26,16 +27,35 @@ test_expect_success 'set memory limitation to 1MB' ' ' test_expect_success 'unpack-objects failed under memory limitation' ' - prepare_dest && + prepare_dest 2m && test_must_fail git -C dest.git unpack-objects err && grep "fatal: attempting to allocate" err ' test_expect_success 'unpack-objects works with memory limitation in dry-run mode' ' - prepare_dest && + prepare_dest 2m && git -C dest.git unpack-objects -n