* Re: [PATCH v6 5/6] unpack-objects.c: add dry_run mode for get_data()
@ 2021-12-21 11:40 Han Xin
2021-12-21 17:17 ` Junio C Hamano
0 siblings, 1 reply; 4+ messages in thread
From: Han Xin @ 2021-12-21 11:40 UTC (permalink / raw)
To: l.s.r
Cc: avarab, chiyutianyi, git, gitster, hanxin.hx, peff, philipoakley,
stolee, zhiyou.jx
> Clever. Looks good to me.
>
> For some reason I was expecting this patch to have some connection to
> one of the earlier ones (perhaps because get_data() was mentioned),
> but it is technically independent.
I think I should adjust the order of this patch to move it forward.
Thanks
-Han Xin
^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH v5 0/6] unpack large blobs in stream
@ 2021-12-10 10:34 Han Xin
2021-12-17 11:26 ` [PATCH v6 5/6] unpack-objects.c: add dry_run mode for get_data() Han Xin
0 siblings, 1 reply; 4+ messages in thread
From: Han Xin @ 2021-12-10 10:34 UTC (permalink / raw)
To: Junio C Hamano, Git List, Jeff King, Jiang Xin, Philip Oakley,
Ævar Arnfjörð Bjarmason, Derrick Stolee
Cc: Han Xin
From: Han Xin <hanxin.hx@alibaba-inc.com>
Changes since v4:
* Refactor to "struct input_stream" implementations so that we can
reduce the changes to "write_loose_object()" sugguest by
Ævar Arnfjörð Bjarmason.
* Add a new flag called "HASH_STREAM" to support this feature.
* Add a new config "core.bigFileStreamingThreshold" instread of
"core.bigFileThreshold" sugguest by Ævar Arnfjörð Bjarmason[1].
* Roll destination repository preparement into a function in
"t5590-unpack-non-delta-objects.sh", so that we can run testcases
with --run=setup,3,4.
1. https://lore.kernel.org/git/211203.86zgphsu5a.gmgdl@evledraar.gmail.com/
Han Xin (6):
object-file: refactor write_loose_object() to support read from stream
object-file.c: handle undetermined oid in write_loose_object()
object-file.c: read stream in a loop in write_loose_object()
unpack-objects.c: add dry_run mode for get_data()
object-file.c: make "write_object_file_flags()" to support "HASH_STREAM"
unpack-objects: unpack_non_delta_entry() read data in a stream
Documentation/config/core.txt | 11 ++++
builtin/unpack-objects.c | 86 +++++++++++++++++++++++++++--
cache.h | 2 +
config.c | 5 ++
environment.c | 1 +
object-file.c | 73 +++++++++++++++++++-----
object-store.h | 5 ++
t/t5590-unpack-non-delta-objects.sh | 70 +++++++++++++++++++++++
8 files changed, 234 insertions(+), 19 deletions(-)
create mode 100755 t/t5590-unpack-non-delta-objects.sh
Range-diff against v4:
1: af707ef304 < -: ---------- object-file: refactor write_loose_object() to read buffer from stream
2: 321ad90d8e < -: ---------- object-file.c: handle undetermined oid in write_loose_object()
3: 1992ac39af < -: ---------- object-file.c: read stream in a loop in write_loose_object()
-: ---------- > 1: f3595e68cc object-file: refactor write_loose_object() to support read from stream
-: ---------- > 2: c25fdd1fe5 object-file.c: handle undetermined oid in write_loose_object()
-: ---------- > 3: ed226f2f9f object-file.c: read stream in a loop in write_loose_object()
4: c41eb06533 ! 4: 2f91e540f6 unpack-objects.c: add dry_run mode for get_data()
@@ builtin/unpack-objects.c: static void use(int bytes)
{
git_zstream stream;
- void *buf = xmallocz(size);
-+ unsigned long bufsize = dry_run ? 4096 : size;
++ unsigned long bufsize = dry_run ? 8192 : size;
+ void *buf = xmallocz(bufsize);
memset(&stream, 0, sizeof(stream));
-: ---------- > 5: 7698938eac object-file.c: make "write_object_file_flags()" to support "HASH_STREAM"
5: 9427775bdc ! 6: 103bb1db06 unpack-objects: unpack_non_delta_entry() read data in a stream
@@ Commit message
However, unpack non-delta objects from a stream instead of from an entrie
buffer will have 10% performance penalty. Therefore, only unpack object
- larger than the "big_file_threshold" in zstream. See the following
+ larger than the "core.BigFileStreamingThreshold" in zstream. See the following
benchmarks:
hyperfine \
--setup \
'if ! test -d scalar.git; then git clone --bare https://github.com/microsoft/scalar.git; cp scalar.git/objects/pack/*.pack small.pack; fi' \
- --prepare 'rm -rf dest.git && git init --bare dest.git' \
- -n 'old' 'git -C dest.git unpack-objects <small.pack' \
- -n 'new' 'new/git -C dest.git unpack-objects <small.pack' \
- -n 'new (small threshold)' \
- 'new/git -c core.bigfilethreshold=16k -C dest.git unpack-objects <small.pack'
- Benchmark 1: old
- Time (mean ± σ): 6.075 s ± 0.069 s [User: 5.047 s, System: 0.991 s]
- Range (min … max): 6.018 s … 6.189 s 10 runs
-
- Benchmark 2: new
- Time (mean ± σ): 6.090 s ± 0.033 s [User: 5.075 s, System: 0.976 s]
- Range (min … max): 6.030 s … 6.142 s 10 runs
-
- Benchmark 3: new (small threshold)
- Time (mean ± σ): 6.755 s ± 0.029 s [User: 5.150 s, System: 1.560 s]
- Range (min … max): 6.711 s … 6.809 s 10 runs
+ --prepare 'rm -rf dest.git && git init --bare dest.git'
Summary
- 'old' ran
- 1.00 ± 0.01 times faster than 'new'
- 1.11 ± 0.01 times faster than 'new (small threshold)'
+ './git -C dest.git -c core.bigfilethreshold=512m unpack-objects <small.pack' in 'origin/master'
+ 1.01 ± 0.04 times faster than './git -C dest.git -c core.bigfilethreshold=512m unpack-objects <small.pack' in 'HEAD~1'
+ 1.01 ± 0.04 times faster than './git -C dest.git -c core.bigfilethreshold=512m unpack-objects <small.pack' in 'HEAD~0'
+ 1.03 ± 0.10 times faster than './git -C dest.git -c core.bigfilethreshold=16k unpack-objects <small.pack' in 'origin/master'
+ 1.02 ± 0.07 times faster than './git -C dest.git -c core.bigfilethreshold=16k unpack-objects <small.pack' in 'HEAD~0'
+ 1.10 ± 0.04 times faster than './git -C dest.git -c core.bigfilethreshold=16k unpack-objects <small.pack' in 'HEAD~1'
+ Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Helped-by: Derrick Stolee <stolee@gmail.com>
Helped-by: Jiang Xin <zhiyou.jx@alibaba-inc.com>
Signed-off-by: Han Xin <hanxin.hx@alibaba-inc.com>
+ ## Documentation/config/core.txt ##
+@@ Documentation/config/core.txt: be delta compressed, but larger binary media files won't be.
+ +
+ Common unit suffixes of 'k', 'm', or 'g' are supported.
+
++core.bigFileStreamingThreshold::
++ Files larger than this will be streamed out to a temporary
++ object file while being hashed, which will when be renamed
++ in-place to a loose object, particularly if the
++ `core.bigFileThreshold' setting dictates that they're always
++ written out as loose objects.
+++
++Default is 128 MiB on all platforms.
+++
++Common unit suffixes of 'k', 'm', or 'g' are supported.
++
+ core.excludesFile::
+ Specifies the pathname to the file that contains patterns to
+ describe paths that are not meant to be tracked, in addition
+
## builtin/unpack-objects.c ##
@@ builtin/unpack-objects.c: static void added_object(unsigned nr, enum object_type type,
}
@@ builtin/unpack-objects.c: static void added_object(unsigned nr, enum object_type
+
+static void write_stream_blob(unsigned nr, unsigned long size)
+{
-+ char hdr[32];
-+ int hdrlen;
+ git_zstream zstream;
+ struct input_zstream_data data;
+ struct input_stream in_stream = {
+ .read = feed_input_zstream,
+ .data = &data,
-+ .size = size,
+ };
-+ struct object_id *oid = &obj_list[nr].oid;
+ int ret;
+
+ memset(&zstream, 0, sizeof(zstream));
@@ builtin/unpack-objects.c: static void added_object(unsigned nr, enum object_type
+ data.zstream = &zstream;
+ git_inflate_init(&zstream);
+
-+ /* Generate the header */
-+ hdrlen = xsnprintf(hdr, sizeof(hdr), "%s %"PRIuMAX, type_name(OBJ_BLOB), (uintmax_t)size) + 1;
-+
-+ if ((ret = write_loose_object(oid, hdr, hdrlen, &in_stream, 0, 0)))
++ if ((ret = write_object_file_flags(&in_stream, size, type_name(OBJ_BLOB) ,&obj_list[nr].oid, HASH_STREAM)))
+ die(_("failed to write object in stream %d"), ret);
+
+ if (zstream.total_out != size || data.status != Z_STREAM_END)
@@ builtin/unpack-objects.c: static void added_object(unsigned nr, enum object_type
+ git_inflate_end(&zstream);
+
+ if (strict && !dry_run) {
-+ struct blob *blob = lookup_blob(the_repository, oid);
++ struct blob *blob = lookup_blob(the_repository, &obj_list[nr].oid);
+ if (blob)
+ blob->object.flags |= FLAG_WRITTEN;
+ else
@@ builtin/unpack-objects.c: static void added_object(unsigned nr, enum object_type
+ void *buf;
+
+ /* Write large blob in stream without allocating full buffer. */
-+ if (!dry_run && type == OBJ_BLOB && size > big_file_threshold) {
++ if (!dry_run && type == OBJ_BLOB && size > big_file_streaming_threshold) {
+ write_stream_blob(nr, size);
+ return;
+ }
@@ builtin/unpack-objects.c: static void added_object(unsigned nr, enum object_type
write_object(nr, type, buf, size);
else
- ## object-file.c ##
-@@ object-file.c: static const void *feed_simple_input_stream(struct input_stream *in_stream, unsi
- return data->buf;
- }
+ ## cache.h ##
+@@ cache.h: extern size_t packed_git_window_size;
+ extern size_t packed_git_limit;
+ extern size_t delta_base_cache_limit;
+ extern unsigned long big_file_threshold;
++extern unsigned long big_file_streaming_threshold;
+ extern unsigned long pack_size_limit_cfg;
--static int write_loose_object(const struct object_id *oid, char *hdr,
-- int hdrlen, struct input_stream *in_stream,
-- time_t mtime, unsigned flags)
-+int write_loose_object(const struct object_id *oid, char *hdr,
-+ int hdrlen, struct input_stream *in_stream,
-+ time_t mtime, unsigned flags)
- {
- int fd, ret;
- unsigned char compressed[4096];
+ /*
- ## object-store.h ##
-@@ object-store.h: int hash_object_file(const struct git_hash_algo *algo, const void *buf,
- unsigned long len, const char *type,
- struct object_id *oid);
+ ## config.c ##
+@@ config.c: static int git_default_core_config(const char *var, const char *value, void *cb)
+ return 0;
+ }
-+int write_loose_object(const struct object_id *oid, char *hdr,
-+ int hdrlen, struct input_stream *in_stream,
-+ time_t mtime, unsigned flags);
++ if (!strcmp(var, "core.bigfilestreamingthreshold")) {
++ big_file_streaming_threshold = git_config_ulong(var, value);
++ return 0;
++ }
+
- int write_object_file_flags(const void *buf, unsigned long len,
- const char *type, struct object_id *oid,
- unsigned flags);
+ if (!strcmp(var, "core.packedgitlimit")) {
+ packed_git_limit = git_config_ulong(var, value);
+ return 0;
+
+ ## environment.c ##
+@@ environment.c: size_t packed_git_window_size = DEFAULT_PACKED_GIT_WINDOW_SIZE;
+ size_t packed_git_limit = DEFAULT_PACKED_GIT_LIMIT;
+ size_t delta_base_cache_limit = 96 * 1024 * 1024;
+ unsigned long big_file_threshold = 512 * 1024 * 1024;
++unsigned long big_file_streaming_threshold = 128 * 1024 * 1024;
+ int pager_use_color = 1;
+ const char *editor_program;
+ const char *askpass_program;
## t/t5590-unpack-non-delta-objects.sh (new) ##
@@
@@ t/t5590-unpack-non-delta-objects.sh (new)
+
+. ./test-lib.sh
+
-+test_expect_success "create commit with big blobs (1.5 MB)" '
++prepare_dest () {
++ test_when_finished "rm -rf dest.git" &&
++ git init --bare dest.git &&
++ git -C dest.git config core.bigFileStreamingThreshold $1
++ git -C dest.git config core.bigFileThreshold $1
++}
++
++test_expect_success "setup repo with big blobs (1.5 MB)" '
+ test-tool genrandom foo 1500000 >big-blob &&
+ test_commit --append foo big-blob &&
+ test-tool genrandom bar 1500000 >big-blob &&
@@ t/t5590-unpack-non-delta-objects.sh (new)
+ cd .git &&
+ find objects/?? -type f | sort
+ ) >expect &&
-+ PACK=$(echo main | git pack-objects --progress --revs test)
++ PACK=$(echo main | git pack-objects --revs test)
+'
+
-+test_expect_success 'setup GIT_ALLOC_LIMIT to 1MB' '
++test_expect_success 'setup env: GIT_ALLOC_LIMIT to 1MB' '
+ GIT_ALLOC_LIMIT=1m &&
+ export GIT_ALLOC_LIMIT
+'
+
-+test_expect_success 'prepare dest repository' '
-+ git init --bare dest.git &&
-+ git -C dest.git config core.bigFileThreshold 2m &&
-+ git -C dest.git config receive.unpacklimit 100
-+'
-+
+test_expect_success 'fail to unpack-objects: cannot allocate' '
++ prepare_dest 2m &&
+ test_must_fail git -C dest.git unpack-objects <test-$PACK.pack 2>err &&
-+ test_i18ngrep "fatal: attempting to allocate" err &&
++ grep "fatal: attempting to allocate" err &&
+ (
+ cd dest.git &&
+ find objects/?? -type f | sort
+ ) >actual &&
++ test_file_not_empty actual &&
+ ! test_cmp expect actual
+'
+
-+test_expect_success 'set a lower bigfile threshold' '
-+ git -C dest.git config core.bigFileThreshold 1m
-+'
-+
+test_expect_success 'unpack big object in stream' '
++ prepare_dest 1m &&
+ git -C dest.git unpack-objects <test-$PACK.pack &&
+ git -C dest.git fsck &&
+ (
@@ t/t5590-unpack-non-delta-objects.sh (new)
+ test_cmp expect actual
+'
+
-+test_expect_success 'setup for unpack-objects dry-run test' '
-+ git init --bare unpack-test.git
-+'
-+
+test_expect_success 'unpack-objects dry-run' '
++ prepare_dest 1m &&
++ git -C dest.git unpack-objects -n <test-$PACK.pack &&
+ (
-+ cd unpack-test.git &&
-+ git unpack-objects -n <../test-$PACK.pack
-+ ) &&
-+ (
-+ cd unpack-test.git &&
++ cd dest.git &&
+ find objects/ -type f
+ ) >actual &&
+ test_must_be_empty actual
--
2.34.0
^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH v6 5/6] unpack-objects.c: add dry_run mode for get_data()
2021-12-10 10:34 [PATCH v5 0/6] unpack large blobs in stream Han Xin
@ 2021-12-17 11:26 ` Han Xin
2021-12-17 21:22 ` René Scharfe
0 siblings, 1 reply; 4+ messages in thread
From: Han Xin @ 2021-12-17 11:26 UTC (permalink / raw)
To: Junio C Hamano, Git List, Jeff King, Jiang Xin, Philip Oakley,
Ævar Arnfjörð Bjarmason, Derrick Stolee
Cc: Han Xin
From: Han Xin <hanxin.hx@alibaba-inc.com>
In dry_run mode, "get_data()" is used to verify the inflation of data,
and the returned buffer will not be used at all and will be freed
immediately. Even in dry_run mode, it is dangerous to allocate a
full-size buffer for a large blob object. Therefore, only allocate a
low memory footprint when calling "get_data()" in dry_run mode.
Suggested-by: Jiang Xin <zhiyou.jx@alibaba-inc.com>
Signed-off-by: Han Xin <hanxin.hx@alibaba-inc.com>
---
builtin/unpack-objects.c | 23 +++++++++++++++++------
1 file changed, 17 insertions(+), 6 deletions(-)
diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c
index 4a9466295b..c4a17bdb44 100644
--- a/builtin/unpack-objects.c
+++ b/builtin/unpack-objects.c
@@ -96,15 +96,21 @@ static void use(int bytes)
display_throughput(progress, consumed_bytes);
}
-static void *get_data(unsigned long size)
+static void *get_data(unsigned long size, int dry_run)
{
git_zstream stream;
- void *buf = xmallocz(size);
+ unsigned long bufsize;
+ void *buf;
memset(&stream, 0, sizeof(stream));
+ if (dry_run && size > 8192)
+ bufsize = 8192;
+ else
+ bufsize = size;
+ buf = xmallocz(bufsize);
stream.next_out = buf;
- stream.avail_out = size;
+ stream.avail_out = bufsize;
stream.next_in = fill(1);
stream.avail_in = len;
git_inflate_init(&stream);
@@ -124,6 +130,11 @@ static void *get_data(unsigned long size)
}
stream.next_in = fill(1);
stream.avail_in = len;
+ if (dry_run) {
+ /* reuse the buffer in dry_run mode */
+ stream.next_out = buf;
+ stream.avail_out = bufsize;
+ }
}
git_inflate_end(&stream);
return buf;
@@ -323,7 +334,7 @@ static void added_object(unsigned nr, enum object_type type,
static void unpack_non_delta_entry(enum object_type type, unsigned long size,
unsigned nr)
{
- void *buf = get_data(size);
+ void *buf = get_data(size, dry_run);
if (!dry_run && buf)
write_object(nr, type, buf, size);
@@ -357,7 +368,7 @@ static void unpack_delta_entry(enum object_type type, unsigned long delta_size,
if (type == OBJ_REF_DELTA) {
oidread(&base_oid, fill(the_hash_algo->rawsz));
use(the_hash_algo->rawsz);
- delta_data = get_data(delta_size);
+ delta_data = get_data(delta_size, dry_run);
if (dry_run || !delta_data) {
free(delta_data);
return;
@@ -396,7 +407,7 @@ static void unpack_delta_entry(enum object_type type, unsigned long delta_size,
if (base_offset <= 0 || base_offset >= obj_list[nr].offset)
die("offset value out of bound for delta base object");
- delta_data = get_data(delta_size);
+ delta_data = get_data(delta_size, dry_run);
if (dry_run || !delta_data) {
free(delta_data);
return;
--
2.34.1.52.gfcc2252aea.agit.6.5.6
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH v6 5/6] unpack-objects.c: add dry_run mode for get_data()
2021-12-17 11:26 ` [PATCH v6 5/6] unpack-objects.c: add dry_run mode for get_data() Han Xin
@ 2021-12-17 21:22 ` René Scharfe
0 siblings, 0 replies; 4+ messages in thread
From: René Scharfe @ 2021-12-17 21:22 UTC (permalink / raw)
To: Han Xin, Junio C Hamano, Git List, Jeff King, Jiang Xin,
Philip Oakley, Ævar Arnfjörð Bjarmason,
Derrick Stolee
Cc: Han Xin
Am 17.12.21 um 12:26 schrieb Han Xin:
> From: Han Xin <hanxin.hx@alibaba-inc.com>
>
> In dry_run mode, "get_data()" is used to verify the inflation of data,
> and the returned buffer will not be used at all and will be freed
> immediately. Even in dry_run mode, it is dangerous to allocate a
> full-size buffer for a large blob object. Therefore, only allocate a
> low memory footprint when calling "get_data()" in dry_run mode.
Clever. Looks good to me.
For some reason I was expecting this patch to have some connection to
one of the earlier ones (perhaps because get_data() was mentioned),
but it is technically independent.
>
> Suggested-by: Jiang Xin <zhiyou.jx@alibaba-inc.com>
> Signed-off-by: Han Xin <hanxin.hx@alibaba-inc.com>
> ---
> builtin/unpack-objects.c | 23 +++++++++++++++++------
> 1 file changed, 17 insertions(+), 6 deletions(-)
>
> diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c
> index 4a9466295b..c4a17bdb44 100644
> --- a/builtin/unpack-objects.c
> +++ b/builtin/unpack-objects.c
> @@ -96,15 +96,21 @@ static void use(int bytes)
> display_throughput(progress, consumed_bytes);
> }
>
> -static void *get_data(unsigned long size)
> +static void *get_data(unsigned long size, int dry_run)
> {
> git_zstream stream;
> - void *buf = xmallocz(size);
> + unsigned long bufsize;
> + void *buf;
>
> memset(&stream, 0, sizeof(stream));
> + if (dry_run && size > 8192)
> + bufsize = 8192;
> + else
> + bufsize = size;
> + buf = xmallocz(bufsize);
>
> stream.next_out = buf;
> - stream.avail_out = size;
> + stream.avail_out = bufsize;
> stream.next_in = fill(1);
> stream.avail_in = len;
> git_inflate_init(&stream);
> @@ -124,6 +130,11 @@ static void *get_data(unsigned long size)
> }
> stream.next_in = fill(1);
> stream.avail_in = len;
> + if (dry_run) {
> + /* reuse the buffer in dry_run mode */
> + stream.next_out = buf;
> + stream.avail_out = bufsize;
> + }
> }
> git_inflate_end(&stream);
> return buf;
> @@ -323,7 +334,7 @@ static void added_object(unsigned nr, enum object_type type,
> static void unpack_non_delta_entry(enum object_type type, unsigned long size,
> unsigned nr)
> {
> - void *buf = get_data(size);
> + void *buf = get_data(size, dry_run);
>
> if (!dry_run && buf)
> write_object(nr, type, buf, size);
> @@ -357,7 +368,7 @@ static void unpack_delta_entry(enum object_type type, unsigned long delta_size,
> if (type == OBJ_REF_DELTA) {
> oidread(&base_oid, fill(the_hash_algo->rawsz));
> use(the_hash_algo->rawsz);
> - delta_data = get_data(delta_size);
> + delta_data = get_data(delta_size, dry_run);
> if (dry_run || !delta_data) {
> free(delta_data);
> return;
> @@ -396,7 +407,7 @@ static void unpack_delta_entry(enum object_type type, unsigned long delta_size,
> if (base_offset <= 0 || base_offset >= obj_list[nr].offset)
> die("offset value out of bound for delta base object");
>
> - delta_data = get_data(delta_size);
> + delta_data = get_data(delta_size, dry_run);
> if (dry_run || !delta_data) {
> free(delta_data);
> return;
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2021-12-21 17:18 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-21 11:40 [PATCH v6 5/6] unpack-objects.c: add dry_run mode for get_data() Han Xin
2021-12-21 17:17 ` Junio C Hamano
-- strict thread matches above, loose matches on Subject: below --
2021-12-10 10:34 [PATCH v5 0/6] unpack large blobs in stream Han Xin
2021-12-17 11:26 ` [PATCH v6 5/6] unpack-objects.c: add dry_run mode for get_data() Han Xin
2021-12-17 21:22 ` René Scharfe
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).