From: Ramkumar Ramachandra <artagnon@gmail.com>
To: Vicent Marti <tanoku@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: [PATCH 03/16] pack-objects: use a faster hash table
Date: Tue, 25 Jun 2013 23:28:12 +0530 [thread overview]
Message-ID: <CALkWK0krR=PaGC6iX8abmSswr7HBtzU61-7DR00xA3CyW80dtA@mail.gmail.com> (raw)
In-Reply-To: <1372116193-32762-4-git-send-email-tanoku@gmail.com>
Vicent Marti wrote:
> When replacing the old hash table implementation in `pack-objects` with
> the khash_sha1 table, the insertion time is greatly reduced:
Why? What is the exact change?
> The big win here, however, is in the massively reduced amount of hash
> collisions
Okay, so there seems to be some problem with how collisions are
handled in the hashtable.
> -static int locate_object_entry_hash(const unsigned char *sha1)
> -{
> - int i;
> - unsigned int ui;
> - memcpy(&ui, sha1, sizeof(unsigned int));
> - i = ui % object_ix_hashsz;
> - while (0 < object_ix[i]) {
> - if (!hashcmp(sha1, objects[object_ix[i] - 1].idx.sha1))
> - return i;
> - if (++i == object_ix_hashsz)
> - i = 0;
> - }
> - return -1 - i;
> -}
Classical chaining to handle collisions: very naive. Deserves to be thrown out.
> -static void rehash_objects(void)
> -{
> - uint32_t i;
> - struct object_entry *oe;
> -
> - object_ix_hashsz = nr_objects * 3;
> - if (object_ix_hashsz < 1024)
> - object_ix_hashsz = 1024;
> - object_ix = xrealloc(object_ix, sizeof(int) * object_ix_hashsz);
> - memset(object_ix, 0, sizeof(int) * object_ix_hashsz);
> - for (i = 0, oe = objects; i < nr_objects; i++, oe++) {
> - int ix = locate_object_entry_hash(oe->idx.sha1);
> - if (0 <= ix)
> - continue;
> - ix = -1 - ix;
> - object_ix[ix] = i + 1;
> - }
> -}
This is called when the hashtable runs out of space. It didn't appear
in your profiler because it doesn't appear to be a bottleneck, right?
Growing aggressively to 3x times the number of objects probably
explains it. Just for comparison, how does khash grow?
> static struct object_entry *locate_object_entry(const unsigned char *sha1)
> {
> - int i;
> + khiter_t pos = kh_get_sha1(packed_objects, sha1);
>
> - if (!object_ix_hashsz)
> - return NULL;
> + if (pos < kh_end(packed_objects)) {
Wait, why is this required? When will kh_get_sha1() return a position
beyond kh_end()? What does that mean?
> + return kh_value(packed_objects, pos);
> + }
>
> - i = locate_object_entry_hash(sha1);
> - if (0 <= i)
> - return &objects[object_ix[i]-1];
> return NULL;
> }
Overall, replaced call to locate_object_entry_hash() with a call to
kh_get_sha1(). Okay.
> -static int add_object_entry(const unsigned char *sha1, enum object_type type,
> - const char *name, int exclude)
> +static int add_object_entry_1(const unsigned char *sha1, enum object_type type,
> + uint32_t hash, int exclude, struct packed_git *found_pack,
> + off_t found_offset)
> {
> struct object_entry *entry;
> - struct packed_git *p, *found_pack = NULL;
> - off_t found_offset = 0;
> - int ix;
> - unsigned hash = name_hash(name);
> + struct packed_git *p;
> + khiter_t ix;
> + int hash_ret;
>
> - ix = nr_objects ? locate_object_entry_hash(sha1) : -1;
> - if (ix >= 0) {
> + ix = kh_put_sha1(packed_objects, sha1, &hash_ret);
You don't need to call locate_object_entry() to check for collisions
because kh_put_sha1() takes care of that?
> + if (hash_ret == 0) {
> if (exclude) {
> - entry = objects + object_ix[ix] - 1;
> + entry = kh_value(packed_objects, ix);
Superficial change: using kh_value(), because we stripped out the
chaining logic.
> @@ -966,19 +965,30 @@ static int add_object_entry(const unsigned char *sha1, enum object_type type,
> entry->in_pack_offset = found_offset;
> }
>
> - if (object_ix_hashsz * 3 <= nr_objects * 4)
> - rehash_objects();
> - else
> - object_ix[-1 - ix] = nr_objects;
> + kh_value(packed_objects, ix) = entry;
> + kh_key(packed_objects, ix) = entry->idx.sha1;
> + objects[nr_objects++] = entry;
Wait, what? Why didn't you use kh_put_sha1()?
I didn't look very carefully, but the patch seems to be okay overall.
On the issue of which hashtable replacement to use (why khash, and not
something else?), I briefly looked at linux.git's linux/hashtable.h
and git.git's hash.h; both of them are chaining hashes. From a brief
look at khash.h, it seems to be somewhat less naive and sane: my only
concern is that it is written entirely in using CPP macros which is a
great for syntax/performance, but not-so-great for debugging. I don't
know if there's a better off-the-shelf implementation out there, but I
haven't been looking for one either. By the way, it's MIT license
authored by an anonymous person (sources at:
https://github.com/attractivechaos/klib/blob/master/khash.h), but I
don't know if that's a problem.
Thanks.
next prev parent reply other threads:[~2013-06-25 17:58 UTC|newest]
Thread overview: 64+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-06-24 23:22 [PATCH 00/16] Speed up Counting Objects with bitmap data Vicent Marti
2013-06-24 23:22 ` [PATCH 01/16] list-objects: mark tree as unparsed when we free its buffer Vicent Marti
2013-06-24 23:22 ` [PATCH 02/16] sha1_file: refactor into `find_pack_object_pos` Vicent Marti
2013-06-25 13:59 ` Thomas Rast
2013-06-24 23:23 ` [PATCH 03/16] pack-objects: use a faster hash table Vicent Marti
2013-06-25 14:03 ` Thomas Rast
2013-06-26 2:14 ` Jeff King
2013-06-26 4:47 ` Jeff King
2013-06-25 17:58 ` Ramkumar Ramachandra [this message]
2013-06-25 22:48 ` Junio C Hamano
2013-06-25 23:09 ` Vicent Martí
2013-06-24 23:23 ` [PATCH 04/16] pack-objects: make `pack_name_hash` global Vicent Marti
2013-06-24 23:23 ` [PATCH 05/16] revision: allow setting custom limiter function Vicent Marti
2013-06-24 23:23 ` [PATCH 06/16] sha1_file: export `git_open_noatime` Vicent Marti
2013-06-24 23:23 ` [PATCH 07/16] compat: add endinanness helpers Vicent Marti
2013-06-25 13:08 ` Peter Krefting
2013-06-25 13:25 ` Vicent Martí
2013-06-27 5:56 ` Peter Krefting
2013-06-24 23:23 ` [PATCH 08/16] ewah: compressed bitmap implementation Vicent Marti
2013-06-25 1:10 ` Junio C Hamano
2013-06-25 22:51 ` Junio C Hamano
2013-06-25 15:38 ` Thomas Rast
2013-06-24 23:23 ` [PATCH 09/16] documentation: add documentation for the bitmap format Vicent Marti
2013-06-25 5:42 ` Shawn Pearce
2013-06-25 19:33 ` Vicent Martí
2013-06-25 21:17 ` Junio C Hamano
2013-06-25 22:08 ` Vicent Martí
2013-06-27 1:11 ` Shawn Pearce
2013-06-27 2:36 ` Vicent Martí
2013-06-27 2:45 ` Jeff King
2013-06-27 16:07 ` Shawn Pearce
2013-06-27 17:17 ` Jeff King
2013-07-01 18:47 ` Colby Ranger
2013-07-01 19:13 ` Shawn Pearce
2013-07-07 9:46 ` Jeff King
2013-07-07 17:27 ` Shawn Pearce
2013-06-26 5:11 ` Jeff King
2013-06-26 18:41 ` Colby Ranger
2013-06-26 22:33 ` Colby Ranger
2013-06-27 0:53 ` Colby Ranger
2013-06-27 1:32 ` Shawn Pearce
2013-06-27 1:29 ` Shawn Pearce
2013-06-25 15:58 ` Thomas Rast
2013-06-25 22:30 ` Vicent Martí
2013-06-26 23:12 ` Thomas Rast
2013-06-26 23:19 ` Thomas Rast
2013-06-24 23:23 ` [PATCH 10/16] pack-objects: use bitmaps when packing objects Vicent Marti
2013-06-25 12:48 ` Ramkumar Ramachandra
2013-06-25 15:58 ` Thomas Rast
2013-06-25 23:06 ` Junio C Hamano
2013-06-25 23:14 ` Vicent Martí
2013-06-24 23:23 ` [PATCH 11/16] rev-list: add bitmap mode to speed up lists Vicent Marti
2013-06-25 16:22 ` Thomas Rast
2013-06-26 1:45 ` Vicent Martí
2013-06-26 23:13 ` Thomas Rast
2013-06-26 5:22 ` Jeff King
2013-06-24 23:23 ` [PATCH 12/16] pack-objects: implement bitmap writing Vicent Marti
2013-06-24 23:23 ` [PATCH 13/16] repack: consider bitmaps when performing repacks Vicent Marti
2013-06-25 23:00 ` Junio C Hamano
2013-06-25 23:16 ` Vicent Martí
2013-06-24 23:23 ` [PATCH 14/16] sha1_file: implement `nth_packed_object_info` Vicent Marti
2013-06-24 23:23 ` [PATCH 15/16] write-bitmap: implement new git command to write bitmaps Vicent Marti
2013-06-24 23:23 ` [PATCH 16/16] rev-list: Optimize --count using bitmaps too Vicent Marti
2013-06-25 16:05 ` [PATCH 00/16] Speed up Counting Objects with bitmap data Thomas Rast
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CALkWK0krR=PaGC6iX8abmSswr7HBtzU61-7DR00xA3CyW80dtA@mail.gmail.com' \
--to=artagnon@gmail.com \
--cc=git@vger.kernel.org \
--cc=tanoku@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).