From: "René Scharfe" <l.s.r@web.de>
To: Jeff King <peff@peff.net>, git@vger.kernel.org
Subject: Re: [PATCH] fast-import: replace custom hash with hashmap.c
Date: Fri, 3 Apr 2020 20:53:30 +0200 [thread overview]
Message-ID: <6926f6fe-2e5c-bcbc-fd0a-5b9a70eed06c@web.de> (raw)
In-Reply-To: <20200403121508.GA638328@coredump.intra.peff.net>
Am 03.04.20 um 14:15 schrieb Jeff King:
> We use a custom hash in fast-import to store the set of objects we've
> imported so far. It has a fixed set of 2^16 buckets and chains any
> collisions with a linked list. As the number of objects grows larger
> than that, the load factor increases and we degrade to O(n) lookups and
> O(n^2) insertions.
>
> We can scale better by using our hashmap.c implementation, which will
> resize the bucket count as we grow. This does incur an extra memory cost
> of 8 bytes per object, as hashmap stores the integer hash value for each
> entry in its hashmap_entry struct (which we really don't care about
> here, because we're just reusing the embedded object hash). But I think
> the numbers below justify this (and our per-object memory cost is
> already much higher).
>
> I also looked at using khash, but it seemed to perform slightly worse
> than hashmap at all sizes, and worse even than the existing code for
> small sizes. It's also awkward to use here, because we want to look up a
> "struct object_entry" from a "struct object_id", and it doesn't handle
> mismatched keys as well. Making a mapping of object_id to object_entry
> would be more natural, but that would require pulling the embedded oid
> out of the object_entry or incurring an extra 32 bytes per object.
>
> In a synthetic test creating as many cheap, tiny objects as possible
>
> perl -e '
> my $bits = shift;
> my $nr = 2**$bits;
>
> for (my $i = 0; $i < $nr; $i++) {
> print "blob\n";
> print "data 4\n";
> print pack("N", $i);
> }
> ' $bits | git fast-import
>
> I got these results:
>
> nr_objects master khash hashmap
> 2^20 0m4.317s 0m5.109s 0m3.890s
> 2^21 0m10.204s 0m9.702s 0m7.933s
> 2^22 0m27.159s 0m17.911s 0m16.751s
> 2^23 1m19.038s 0m35.080s 0m31.963s
> 2^24 4m18.766s 1m10.233s 1m6.793s
I get similar numbers.
>
> which points to hashmap as the winner. We didn't have any perf tests for
> fast-export or fast-import, so I added one as a more real-world case.
> It uses an export without blobs since that's significantly cheaper than
> a full one, but still is an interesting case people might use (e.g., for
> rewriting history). It will emphasize this change in some ways (as a
> percentage we spend more time making objects and less shuffling blob
> bytes around) and less in others (the total object count is lower).
>
> Here are the results for linux.git:
>
> Test HEAD^ HEAD
> ----------------------------------------------------------------------------
> 9300.1: export (no-blobs) 67.64(66.96+0.67) 67.81(67.06+0.75) +0.3%
> 9300.2: import (no-blobs) 284.04(283.34+0.69) 198.09(196.01+0.92) -30.3%
My numbers look a bit different for this, not sure why:
Test origin/master HEAD
---------------------------------------------------------------------------
9300.1: export (no-blobs) 69.36(66.44+1.56) 67.89(66.07+1.56) -2.1%
9300.2: import (no-blobs) 295.10(293.83+1.19) 283.83(282.91+0.91) -3.8%
They are still in favor of the patch, just not as strongly as yours.
>
> It only has ~5.2M commits and trees, so this is a larger effect than I
> expected (the 2^23 case above only improved by 50s or so, but here we
> gained almost 90s). This is probably due to actually performing more
> object lookups in a real import with trees and commits, as opposed to
> just dumping a bunch of blobs into a pack.
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
> fast-import.c | 62 +++++++++++++++++++-----------
> t/perf/p9300-fast-import-export.sh | 23 +++++++++++
> 2 files changed, 62 insertions(+), 23 deletions(-)
> create mode 100755 t/perf/p9300-fast-import-export.sh
>
I have to warn you up front: I'm not familiar with hashmap or
fast-import, so I'll focus on stylistic topics -- bikeshedding. The
actual change looks reasonable to me overall and the performance is
convincing.
> diff --git a/fast-import.c b/fast-import.c
> index 202dda11a6..0ef6defc10 100644
> --- a/fast-import.c
> +++ b/fast-import.c
> @@ -39,12 +39,28 @@
>
> struct object_entry {
> struct pack_idx_entry idx;
> - struct object_entry *next;
> + struct hashmap_entry ent;
> uint32_t type : TYPE_BITS,
> pack_id : PACK_ID_BITS,
> depth : DEPTH_BITS;
> };
>
> +static int object_entry_hashcmp(const void *map_data,
> + const struct hashmap_entry *eptr,
> + const struct hashmap_entry *entry_or_key,
> + const void *keydata)
> +{
> + const struct object_id *oid = keydata;
> + const struct object_entry *e1, *e2;
> +
> + e1 = container_of(eptr, const struct object_entry, ent);
> + if (oid)
> + return oidcmp(&e1->idx.oid, oid);
> +
> + e2 = container_of(entry_or_key, const struct object_entry, ent);
> + return oidcmp(&e1->idx.oid, &e2->idx.oid);
Other hashmap users express this more tersely, similar to this:
const struct object_entry *e1, *e2;
e1 = container_of(eptr, const struct object_entry, ent);
e2 = container_of(entry_or_key, const struct object_entry, ent);
return oidcmp(&e1->idx.oid, keydata ? keydata : &e2->idx.oid);
> +}
> +
> struct object_entry_pool {
> struct object_entry_pool *next_pool;
> struct object_entry *next_free;
> @@ -178,7 +194,7 @@ static off_t pack_size;
> /* Table of objects we've written. */
> static unsigned int object_entry_alloc = 5000;
> static struct object_entry_pool *blocks;
> -static struct object_entry *object_table[1 << 16];
> +static struct hashmap object_table;
> static struct mark_set *marks;
> static const char *export_marks_file;
> static const char *import_marks_file;
> @@ -455,44 +471,42 @@ static struct object_entry *new_object(struct object_id *oid)
>
> static struct object_entry *find_object(struct object_id *oid)
> {
> - unsigned int h = oid->hash[0] << 8 | oid->hash[1];
> - struct object_entry *e;
> - for (e = object_table[h]; e; e = e->next)
> - if (oideq(oid, &e->idx.oid))
> - return e;
> + struct hashmap_entry lookup_entry, *e;
> +
> + hashmap_entry_init(&lookup_entry, oidhash(oid));
> + e = hashmap_get(&object_table, &lookup_entry, oid);
> + if (e)
> + return container_of(e, struct object_entry, ent);
(Worth repeating:) No casting trick, yay!
> return NULL;
> }
>
> static struct object_entry *insert_object(struct object_id *oid)
> {
> - unsigned int h = oid->hash[0] << 8 | oid->hash[1];
> - struct object_entry *e = object_table[h];
> + struct object_entry *e;
> + struct hashmap_entry lookup_entry, *hashent;
>
> - while (e) {
> - if (oideq(oid, &e->idx.oid))
> - return e;
> - e = e->next;
> - }
> + hashmap_entry_init(&lookup_entry, oidhash(oid));
> + hashent = hashmap_get(&object_table, &lookup_entry, oid);
> + if (hashent)
> + return container_of(hashent, struct object_entry, ent);
That duplicates find_object()...
>
> e = new_object(oid);
> - e->next = object_table[h];
> e->idx.offset = 0;
> - object_table[h] = e;
> + e->ent.hash = lookup_entry.hash;
... except for this part. Would it make sense to replace this line with
a call to hashmap_entry_init()? Seems cleaner to me. It would look
like this:
struct object_entry *e = find_object(oid);
if (e)
return e;
e = new_object(oid);
e->idx.offset = 0;
hashmap_entry_init(&e->ent, oidhash(oid));
hashmap_add(&object_table, &e->ent);
return e;
> + hashmap_add(&object_table, &e->ent);
> return e;
> }
>
> static void invalidate_pack_id(unsigned int id)
> {
> - unsigned int h;
> unsigned long lu;
> struct tag *t;
> + struct hashmap_iter iter;
> + struct object_entry *e;
>
> - for (h = 0; h < ARRAY_SIZE(object_table); h++) {
> - struct object_entry *e;
> -
> - for (e = object_table[h]; e; e = e->next)
> - if (e->pack_id == id)
> - e->pack_id = MAX_PACK_ID;
> + hashmap_for_each_entry(&object_table, &iter, e, ent) {
> + if (e->pack_id == id)
> + e->pack_id = MAX_PACK_ID;
> }
>
> for (lu = 0; lu < branch_table_sz; lu++) {
> @@ -3511,6 +3525,8 @@ int cmd_main(int argc, const char **argv)
> avail_tree_table = xcalloc(avail_tree_table_sz, sizeof(struct avail_tree_content*));
> marks = mem_pool_calloc(&fi_mem_pool, 1, sizeof(struct mark_set));
>
> + hashmap_init(&object_table, object_entry_hashcmp, NULL, 0);
> +
> /*
> * We don't parse most options until after we've seen the set of
> * "feature" lines at the start of the stream (which allows the command
> diff --git a/t/perf/p9300-fast-import-export.sh b/t/perf/p9300-fast-import-export.sh
> new file mode 100755
> index 0000000000..c3c743d04a
> --- /dev/null
> +++ b/t/perf/p9300-fast-import-export.sh
> @@ -0,0 +1,23 @@
> +#!/bin/sh
> +
> +test_description='test fast-import and fast-export performance'
> +. ./perf-lib.sh
> +
> +test_perf_default_repo
> +
> +# Use --no-data here to produce a vastly smaller export file.
> +# This is much cheaper to work with but should still exercise
> +# fast-import pretty well (we'll still process all commits and
> +# trees, which account for 60% or more of objects in most repos).
> +#
> +# Use --rencode to avoid the default of aborting on non-utf8 commits,
> +# which lets this test run against a wider variety of sample repos.
> +test_perf 'export (no-blobs)' '
> + git fast-export --reencode=yes --no-data HEAD >export
> +'
> +
> +test_perf 'import (no-blobs)' '
> + git fast-import --force <export
> +'
> +
> +test_done
>
next prev parent reply other threads:[~2020-04-03 18:53 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-04-03 12:15 [PATCH] fast-import: replace custom hash with hashmap.c Jeff King
2020-04-03 18:53 ` René Scharfe [this message]
2020-04-05 18:59 ` Jeff King
2020-04-06 18:07 ` René Scharfe
2020-04-06 18:35 ` Jeff King
2020-04-06 18:46 ` Jeff King
2020-04-07 18:37 ` René Scharfe
2020-04-06 19:49 ` [PATCH v2] " Jeff King
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=6926f6fe-2e5c-bcbc-fd0a-5b9a70eed06c@web.de \
--to=l.s.r@web.de \
--cc=git@vger.kernel.org \
--cc=peff@peff.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).