From: Jeff King <firstname.lastname@example.org> To: Rasmus Villemoes <email@example.com> Cc: "Git Mailing List" <firstname.lastname@example.org>, "Stefan Beller" <email@example.com>, "Nguyễn Thái Ngọc Duy" <firstname.lastname@example.org> Subject: Re: approximate_object_count_valid never set? Date: Thu, 17 Sep 2020 08:53:33 -0400 [thread overview] Message-ID: <20200917125333.GA3024501@coredump.intra.peff.net> (raw) In-Reply-To: <email@example.com> On Thu, Sep 17, 2020 at 10:20:03AM +0200, Rasmus Villemoes wrote: > Hi, > > While poking around the code, I noticed that it seems > ->approximate_object_count_valid is never set to 1, and it never has > been, not even back when it was a global variable. So perhaps it can > just be removed and the logic depending on it simplified? Or am I > missing some preprocessor trickery. > > Nobody seems to have noticed the lack of caching - and actually setting > it to 1 after the count has been computed might be a little dangerous > unless one takes care to invalidate the cache anywhere that might be > relevant. We should be able to construct a test where it matters. The main cost that the flag is overcoming is the iteration through the packs. So we'd want a lot of packs. And the primary place the function would get called a lot is when abbreviating commits. So doing: for i in $(seq 1000); do echo blob echo 'data <<EOF' echo $i echo EOF echo checkpoint done | git -c transfer.unpacklimit=0 fast-import will get us a lot of packs. I tried that in linux.git. And then this should get us a baseline for how much it costs to traverse and print out object names: $ time git rev-list --format=%H HEAD >/dev/null real 0m6.636s user 0m6.492s sys 0m0.144s And now let's see how long it takes with abbreviation: $ time git rev-list --format=%h HEAD >/dev/null real 0m34.518s user 0m34.253s sys 0m0.264s Yow. That's a lot. But part of the cost is that we have to look up each abbreviated hash in each pack to see if it's present there, so we'd expect it to be a lot more expensive. But let's try it with the caching flag: diff --git a/packfile.c b/packfile.c index 9ef27508f2..e69012e7f2 100644 --- a/packfile.c +++ b/packfile.c @@ -923,6 +923,7 @@ unsigned long repo_approximate_object_count(struct repository *r) count += p->num_objects; } r->objects->approximate_object_count = count; + r->objects->approximate_object_count_valid = 1; } return r->objects->approximate_object_count; } $ time git rev-list --format=%h HEAD >/dev/null real 0m29.411s user 0m29.150s sys 0m0.260s Still not great, but caching the count did save us 15%. That seems worth it to me (1000 packs is more than we'd hope for, but not uncommon in a poorly maintained repo). The failure to set the flag is just a bug; looks like mine from 8e3f52d778 (find_unique_abbrev: move logic out of get_short_sha1(), 2016-10-03). You're right that caching runs the risk of the cache being invalidated. But in this case I think we're covered. We'd generally modify packed_git only via reprepare_packed_git(), prepare_packed_git(), and we do reset the flag there. Plus count is only meant to be approximate, so even if it ended up stale within a single process, I don't think it would be that big a deal. -Peff
next prev parent reply other threads:[~2020-09-17 12:53 UTC|newest] Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top 2020-09-17 8:20 Rasmus Villemoes 2020-09-17 11:58 ` Ævar Arnfjörð Bjarmason 2020-09-17 12:53 ` Jeff King [this message] 2020-09-17 16:47 ` [PATCH] packfile: actually set approximate_object_count_valid Jeff King 2020-09-17 16:53 ` Taylor Blau 2020-09-17 18:26 ` Junio C Hamano
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style List information: http://vger.kernel.org/majordomo-info.html * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20200917125333.GA3024501@coredump.intra.peff.net \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --subject='Re: approximate_object_count_valid never set?' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
Code repositories for project(s) associated with this inbox: https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).