From: Jeff King <peff@peff.net>
To: git@vger.kernel.org
Cc: Kyle Meyer <kyle@kyleam.com>,
Eric Sunshine <sunshine@sunshineco.com>,
Taylor Blau <me@ttaylorr.com>
Subject: [PATCH v2] rev-list --disk-usage
Date: Tue, 9 Feb 2021 05:52:28 -0500 [thread overview]
Message-ID: <YCJpbPIlSpCAKSBF@coredump.intra.peff.net> (raw)
In-Reply-To: <YBHlGPBSJC++CnPy@coredump.intra.peff.net>
Here's a re-roll of my series to add "rev-list --disk-usage", for
counting up object storage used for various slices of history.
This fixes the minor bits mentioned in review for v1, but the big change
is that "--disk-usage" no longer implies "--objects". I think you
generally would want to use it with that option, but it really seemed to
violate the principle of least surprise for the user.
That requires handling each object type independently, but the code for
that turned out to be not too bad (and is modeled after the similar
logic in traverse_bitmap_commit_list()). I was slightly concerned that
it would slow things down to walk over the bitmap multiple times, but it
doesn't seem to make much of a difference in practice.
There's a range-diff below, but it's not really worth looking at. All of
the interesting parts were rewritten completely, so you're better off to
just read patch 2 again (and patch 1 did not change at all).
[1/2]: t: add --no-tag option to test_commit
[2/2]: rev-list: add --disk-usage option for calculating disk usage
Documentation/rev-list-options.txt | 9 ++++
builtin/rev-list.c | 46 +++++++++++++++++
pack-bitmap.c | 81 ++++++++++++++++++++++++++++++
pack-bitmap.h | 2 +
t/t4208-log-magic-pathspec.sh | 9 +---
t/t6114-rev-list-du.sh | 51 +++++++++++++++++++
t/test-lib-functions.sh | 9 +++-
7 files changed, 199 insertions(+), 8 deletions(-)
create mode 100755 t/t6114-rev-list-du.sh
1: 20f8edeff1 = 1: 6365cd94bd t: add --no-tag option to test_commit
2: 64e28cb6c9 ! 2: 8a93583dee rev-list: add --disk-usage option for calculating disk usage
@@ Commit message
You can find that out by generating a list of objects, getting their
sizes from cat-file, and then summing them, like:
- git rev-list --objects main..branch
- cut -d' ' -f1 |
+ git rev-list --objects --no-object-names main..branch
git cat-file --batch-check='%(objectsize:disk)' |
perl -lne '$total += $_; END { print $total }'
@@ Commit message
torvalds/linux:
[rev-list piped to cat-file, no bitmaps]
- $ time git rev-list --objects --all |
- cut -d' ' -f1 |
+ $ time git rev-list --objects --no-object-names --all |
git cat-file --buffer --batch-check='%(objectsize:disk)' |
perl -lne '$total += $_; END { print $total }'
- 1455691059
- real 0m34.336s
- user 0m46.533s
- sys 0m2.953s
+ 1459938510
+ real 0m29.635s
+ user 0m38.003s
+ sys 0m1.093s
[internal, no bitmaps]
- $ time git rev-list --disk-usage --all
- 1455691059
- real 0m32.662s
- user 0m32.306s
- sys 0m0.353s
+ $ time git rev-list --disk-usage --objects --all
+ 1459938510
+ real 0m31.262s
+ user 0m30.885s
+ sys 0m0.376s
- The wall-clock times aren't that different because of parallelism, but
- notice the CPU savings between the two. We saved 35% of the CPU just by
+ Even though the wall-clock time is slightly worse due to parallelism,
+ notice the CPU savings between the two. We saved 21% of the CPU just by
avoiding the pipes.
But the real win is with bitmaps. If we use them without the new option:
[rev-list piped to cat-file, bitmaps]
- $ time git rev-list --objects --all --use-bitmap-index |
- cut -d' ' -f1 |
+ $ time git rev-list --objects --no-object-names --all --use-bitmap-index |
git cat-file --batch-check='%(objectsize:disk)' |
perl -lne '$total += $_; END { print $total }'
- real 0m9.954s
- user 0m11.234s
- sys 0m8.522s
+ 1459938510
+ real 0m6.244s
+ user 0m8.452s
+ sys 0m0.311s
then we're faster to generate the list of objects, but we still spend a
lot of time piping and looking things up. But if we do both together:
[internal, bitmaps]
- $ time git rev-list --disk-usage --all --use-bitmap-index
- 1455691059
- real 0m0.235s
- user 0m0.186s
+ $ time git rev-list --disk-usage --objects --all --use-bitmap-index
+ 1459938510
+ real 0m0.219s
+ user 0m0.169s
sys 0m0.049s
then we get the same answer much faster.
@@ Commit message
of course. But we're actually checking reachability here, so we're still
fast when we ask for more interesting things:
- $ time git rev-list --disk-usage --all --use-bitmap-index v5.0..v5.10
+ $ time git rev-list --disk-usage --use-bitmap-index v5.0..v5.10
374798628
real 0m0.429s
user 0m0.356s
@@ Documentation/rev-list-options.txt: ifdef::git-rev-list[]
+
+--disk-usage::
+ Suppress normal output; instead, print the sum of the bytes used
-+ for on-disk storage by the selected objects. This is equivalent
-+ to piping the output of `rev-list --objects` into
-+ `git cat-file --batch-check='%(objectsize:disk)', except that it
-+ runs much faster (especially with `--use-bitmap-index`). See the
-+ `CAVEATS` section in linkgit:git-cat-file[1] for the limitations
-+ of what "on-disk storage" means.
++ for on-disk storage by the selected commits or objects. This is
++ equivalent to piping the output into `git cat-file
++ --batch-check='%(objectsize:disk)'`, except that it runs much
++ faster (especially with `--use-bitmap-index`). See the `CAVEATS`
++ section in linkgit:git-cat-file[1] for the limitations of what
++ "on-disk storage" means.
endif::git-rev-list[]
--cherry-mark::
@@ builtin/rev-list.c: static int try_bitmap_traversal(struct rev_info *revs,
+ return -1;
+
+ printf("%"PRIuMAX"\n",
-+ (uintmax_t)get_disk_usage_from_bitmap(bitmap_git));
++ (uintmax_t)get_disk_usage_from_bitmap(bitmap_git, revs));
+ return 0;
+}
+
@@ builtin/rev-list.c: int cmd_rev_list(int argc, const char **argv, const char *pr
+ if (!strcmp(arg, "--disk-usage")) {
+ show_disk_usage = 1;
-+ revs.tag_objects = 1;
-+ revs.tree_objects = 1;
-+ revs.blob_objects = 1;
+ info.flags |= REV_LIST_QUIET;
+ continue;
+ }
@@ pack-bitmap.c: int bitmap_has_oid_in_uninteresting(struct bitmap_index *bitmap_g
bitmap_walk_contains(bitmap_git, bitmap_git->haves, oid);
}
+
-+off_t get_disk_usage_from_bitmap(struct bitmap_index *bitmap_git)
++static off_t get_disk_usage_for_type(struct bitmap_index *bitmap_git,
++ enum object_type object_type)
+{
+ struct bitmap *result = bitmap_git->result;
+ struct packed_git *pack = bitmap_git->pack;
-+ struct eindex *eindex = &bitmap_git->ext_index;
-+ struct object_info oi = OBJECT_INFO_INIT;
-+ off_t object_size;
+ off_t total = 0;
++ struct ewah_iterator it;
++ eword_t filter;
+ size_t i;
+
-+ oi.disk_sizep = &object_size;
-+
-+ for (i = 0; i < result->word_alloc; i++) {
-+ eword_t word = result->words[i];
++ init_type_iterator(&it, bitmap_git, object_type);
++ for (i = 0; i < result->word_alloc &&
++ ewah_iterator_next(&filter, &it); i++) {
++ eword_t word = result->words[i] & filter;
+ size_t base = (i * BITS_IN_EWORD);
+ unsigned offset;
+
++ if (!word)
++ continue;
++
+ for (offset = 0; offset < BITS_IN_EWORD; offset++) {
+ size_t pos;
+
@@ pack-bitmap.c: int bitmap_has_oid_in_uninteresting(struct bitmap_index *bitmap_g
+
+ offset += ewah_bit_ctz64(word >> offset);
+ pos = base + offset;
-+
-+ /*
-+ * If it's in the pack, we can use the fast path
-+ * and just check the revindex. Otherwise, we
-+ * fall back to looking it up.
-+ */
-+ if (pos < pack->num_objects) {
-+ object_size =
-+ pack_pos_to_offset(pack, pos + 1) -
-+ pack_pos_to_offset(pack, pos);
-+ } else {
-+ struct object *obj;
-+ obj = eindex->objects[pos - pack->num_objects];
-+ if (oid_object_info_extended(the_repository, &obj->oid, &oi, 0) < 0)
-+ die(_("unable to get disk usage of %s"),
-+ oid_to_hex(&obj->oid));
-+ }
-+
-+ total += object_size;
++ total += pack_pos_to_offset(pack, pos + 1) -
++ pack_pos_to_offset(pack, pos);
+ }
+ }
+
+ return total;
++}
++
++static off_t get_disk_usage_for_extended(struct bitmap_index *bitmap_git)
++{
++ struct bitmap *result = bitmap_git->result;
++ struct packed_git *pack = bitmap_git->pack;
++ struct eindex *eindex = &bitmap_git->ext_index;
++ off_t total = 0;
++ struct object_info oi = OBJECT_INFO_INIT;
++ off_t object_size;
++ size_t i;
++
++ oi.disk_sizep = &object_size;
++
++ for (i = 0; i < eindex->count; i++) {
++ struct object *obj = eindex->objects[i];
++
++ if (!bitmap_get(result, pack->num_objects + i))
++ continue;
++
++ if (oid_object_info_extended(the_repository, &obj->oid, &oi, 0) < 0)
++ die(_("unable to get disk usage of %s"),
++ oid_to_hex(&obj->oid));
++
++ total += object_size;
++ }
++ return total;
++}
++
++off_t get_disk_usage_from_bitmap(struct bitmap_index *bitmap_git,
++ struct rev_info *revs)
++{
++ off_t total = 0;
++
++ total += get_disk_usage_for_type(bitmap_git, OBJ_COMMIT);
++ if (revs->tree_objects)
++ total += get_disk_usage_for_type(bitmap_git, OBJ_TREE);
++ if (revs->blob_objects)
++ total += get_disk_usage_for_type(bitmap_git, OBJ_BLOB);
++ if (revs->tag_objects)
++ total += get_disk_usage_for_type(bitmap_git, OBJ_TAG);
++
++ total += get_disk_usage_for_extended(bitmap_git);
++
++ return total;
+}
## pack-bitmap.h ##
@@ pack-bitmap.h: int bitmap_walk_contains(struct bitmap_index *,
*/
int bitmap_has_oid_in_uninteresting(struct bitmap_index *, const struct object_id *oid);
-+off_t get_disk_usage_from_bitmap(struct bitmap_index *);
++off_t get_disk_usage_from_bitmap(struct bitmap_index *, struct rev_info *);
+
void bitmap_writer_show_progress(int show);
void bitmap_writer_set_checksum(unsigned char *sha1);
@@ t/t6114-rev-list-du.sh (new)
+# packing, zlib, etc. We'll assume that the regular rev-list and cat-file
+# machinery works and compare the --disk-usage output to that.
+disk_usage_slow () {
-+ git rev-list --objects "$@" |
-+ cut -d' ' -f1 |
++ git rev-list --no-object-names "$@" |
+ git cat-file --batch-check="%(objectsize:disk)" |
+ perl -lne '$total += $_; END { print $total}'
+}
@@ t/t6114-rev-list-du.sh (new)
+}
+
+check_du HEAD
-+check_du HEAD^..HEAD
++check_du --objects HEAD
++check_du --objects HEAD^..HEAD
+
+test_done
next prev parent reply other threads:[~2021-02-09 10:56 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-01-27 22:11 [PATCH 0/2] rev-list --disk-usage Jeff King
2021-01-27 22:12 ` [PATCH 1/2] t: add --no-tag option to test_commit Jeff King
2021-01-27 22:48 ` Taylor Blau
2021-01-27 22:17 ` [PATCH 2/2] rev-list: add --disk-usage option for calculating disk usage Jeff King
2021-01-27 22:57 ` Taylor Blau
2021-01-27 23:34 ` Jeff King
2021-01-27 23:01 ` Kyle Meyer
2021-01-27 23:36 ` Jeff King
2021-01-27 23:07 ` Eric Sunshine
2021-01-27 23:39 ` Jeff King
2021-01-27 22:46 ` [PATCH 0/2] rev-list --disk-usage Taylor Blau
2021-02-09 10:52 ` Jeff King [this message]
2021-02-09 10:52 ` [PATCH v2 1/2] t: add --no-tag option to test_commit Jeff King
2021-02-09 10:53 ` [PATCH v2 2/2] rev-list: add --disk-usage option for calculating disk usage Jeff King
2021-02-09 11:09 ` [PATCH v2] rev-list --disk-usage Jeff King
2021-02-09 21:14 ` Junio C Hamano
2021-02-10 9:38 ` Jeff King
2021-02-10 0:44 ` Junio C Hamano
2021-02-10 1:49 ` Taylor Blau
2021-02-10 10:01 ` Jeff King
2021-02-10 16:31 ` Junio C Hamano
2021-02-10 20:38 ` Jeff King
2021-02-10 23:15 ` Taylor Blau
2021-02-11 11:00 ` Jeff King
2021-02-11 12:04 ` Ævar Arnfjörð Bjarmason
2021-02-11 17:57 ` Junio C Hamano
2021-02-17 23:31 ` [PATCH 0/2] rev-list --disk-usage example docs Jeff King
2021-02-17 23:34 ` [PATCH 1/2] docs/rev-list: add an examples section Jeff King
2021-02-17 23:35 ` [PATCH 2/2] docs/rev-list: add some examples of --disk-usage Jeff King
2021-02-17 23:44 ` [PATCH 0/2] rev-list --disk-usage example docs Taylor Blau
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YCJpbPIlSpCAKSBF@coredump.intra.peff.net \
--to=peff@peff.net \
--cc=git@vger.kernel.org \
--cc=kyle@kyleam.com \
--cc=me@ttaylorr.com \
--cc=sunshine@sunshineco.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).