git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Ramkumar Ramachandra <artagnon@gmail.com>
To: Vicent Marti <tanoku@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: [PATCH 03/16] pack-objects: use a faster hash table
Date: Tue, 25 Jun 2013 23:28:12 +0530	[thread overview]
Message-ID: <CALkWK0krR=PaGC6iX8abmSswr7HBtzU61-7DR00xA3CyW80dtA@mail.gmail.com> (raw)
In-Reply-To: <1372116193-32762-4-git-send-email-tanoku@gmail.com>

Vicent Marti wrote:
> When replacing the old hash table implementation in `pack-objects` with
> the khash_sha1 table, the insertion time is greatly reduced:

Why?  What is the exact change?

> The big win here, however, is in the massively reduced amount of hash
> collisions

Okay, so there seems to be some problem with how collisions are
handled in the hashtable.

> -static int locate_object_entry_hash(const unsigned char *sha1)
> -{
> -       int i;
> -       unsigned int ui;
> -       memcpy(&ui, sha1, sizeof(unsigned int));
> -       i = ui % object_ix_hashsz;
> -       while (0 < object_ix[i]) {
> -               if (!hashcmp(sha1, objects[object_ix[i] - 1].idx.sha1))
> -                       return i;
> -               if (++i == object_ix_hashsz)
> -                       i = 0;
> -       }
> -       return -1 - i;
> -}

Classical chaining to handle collisions: very naive.  Deserves to be thrown out.

> -static void rehash_objects(void)
> -{
> -       uint32_t i;
> -       struct object_entry *oe;
> -
> -       object_ix_hashsz = nr_objects * 3;
> -       if (object_ix_hashsz < 1024)
> -               object_ix_hashsz = 1024;
> -       object_ix = xrealloc(object_ix, sizeof(int) * object_ix_hashsz);
> -       memset(object_ix, 0, sizeof(int) * object_ix_hashsz);
> -       for (i = 0, oe = objects; i < nr_objects; i++, oe++) {
> -               int ix = locate_object_entry_hash(oe->idx.sha1);
> -               if (0 <= ix)
> -                       continue;
> -               ix = -1 - ix;
> -               object_ix[ix] = i + 1;
> -       }
> -}

This is called when the hashtable runs out of space.  It didn't appear
in your profiler because it doesn't appear to be a bottleneck, right?
Growing aggressively to 3x times the number of objects probably
explains it.  Just for comparison, how does khash grow?

>  static struct object_entry *locate_object_entry(const unsigned char *sha1)
>  {
> -       int i;
> +       khiter_t pos = kh_get_sha1(packed_objects, sha1);
>
> -       if (!object_ix_hashsz)
> -               return NULL;
> +       if (pos < kh_end(packed_objects)) {

Wait, why is this required?  When will kh_get_sha1() return a position
beyond kh_end()?  What does that mean?

> +               return kh_value(packed_objects, pos);
> +       }
>
> -       i = locate_object_entry_hash(sha1);
> -       if (0 <= i)
> -               return &objects[object_ix[i]-1];
>         return NULL;
>  }

Overall, replaced call to locate_object_entry_hash() with a call to
kh_get_sha1().  Okay.

> -static int add_object_entry(const unsigned char *sha1, enum object_type type,
> -                           const char *name, int exclude)
> +static int add_object_entry_1(const unsigned char *sha1, enum object_type type,
> +                           uint32_t hash, int exclude, struct packed_git *found_pack,
> +                               off_t found_offset)
>  {
>         struct object_entry *entry;
> -       struct packed_git *p, *found_pack = NULL;
> -       off_t found_offset = 0;
> -       int ix;
> -       unsigned hash = name_hash(name);
> +       struct packed_git *p;
> +       khiter_t ix;
> +       int hash_ret;
>
> -       ix = nr_objects ? locate_object_entry_hash(sha1) : -1;
> -       if (ix >= 0) {
> +       ix = kh_put_sha1(packed_objects, sha1, &hash_ret);

You don't need to call locate_object_entry() to check for collisions
because kh_put_sha1() takes care of that?

> +       if (hash_ret == 0) {
>                 if (exclude) {
> -                       entry = objects + object_ix[ix] - 1;
> +                       entry = kh_value(packed_objects, ix);

Superficial change: using kh_value(), because we stripped out the
chaining logic.

> @@ -966,19 +965,30 @@ static int add_object_entry(const unsigned char *sha1, enum object_type type,
>                 entry->in_pack_offset = found_offset;
>         }
>
> -       if (object_ix_hashsz * 3 <= nr_objects * 4)
> -               rehash_objects();
> -       else
> -               object_ix[-1 - ix] = nr_objects;
> +       kh_value(packed_objects, ix) = entry;
> +       kh_key(packed_objects, ix) = entry->idx.sha1;
> +       objects[nr_objects++] = entry;

Wait, what?  Why didn't you use kh_put_sha1()?

I didn't look very carefully, but the patch seems to be okay overall.
On the issue of which hashtable replacement to use (why khash, and not
something else?), I briefly looked at linux.git's linux/hashtable.h
and git.git's hash.h; both of them are chaining hashes.  From a brief
look at khash.h, it seems to be somewhat less naive and sane: my only
concern is that it is written entirely in using CPP macros which is a
great for syntax/performance, but not-so-great for debugging.  I don't
know if there's a better off-the-shelf implementation out there, but I
haven't been looking for one either.  By the way, it's MIT license
authored by an anonymous person (sources at:
https://github.com/attractivechaos/klib/blob/master/khash.h), but I
don't know if that's a problem.

Thanks.

  parent reply	other threads:[~2013-06-25 17:58 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-06-24 23:22 [PATCH 00/16] Speed up Counting Objects with bitmap data Vicent Marti
2013-06-24 23:22 ` [PATCH 01/16] list-objects: mark tree as unparsed when we free its buffer Vicent Marti
2013-06-24 23:22 ` [PATCH 02/16] sha1_file: refactor into `find_pack_object_pos` Vicent Marti
2013-06-25 13:59   ` Thomas Rast
2013-06-24 23:23 ` [PATCH 03/16] pack-objects: use a faster hash table Vicent Marti
2013-06-25 14:03   ` Thomas Rast
2013-06-26  2:14     ` Jeff King
2013-06-26  4:47       ` Jeff King
2013-06-25 17:58   ` Ramkumar Ramachandra [this message]
2013-06-25 22:48   ` Junio C Hamano
2013-06-25 23:09     ` Vicent Martí
2013-06-24 23:23 ` [PATCH 04/16] pack-objects: make `pack_name_hash` global Vicent Marti
2013-06-24 23:23 ` [PATCH 05/16] revision: allow setting custom limiter function Vicent Marti
2013-06-24 23:23 ` [PATCH 06/16] sha1_file: export `git_open_noatime` Vicent Marti
2013-06-24 23:23 ` [PATCH 07/16] compat: add endinanness helpers Vicent Marti
2013-06-25 13:08   ` Peter Krefting
2013-06-25 13:25     ` Vicent Martí
2013-06-27  5:56       ` Peter Krefting
2013-06-24 23:23 ` [PATCH 08/16] ewah: compressed bitmap implementation Vicent Marti
2013-06-25  1:10   ` Junio C Hamano
2013-06-25 22:51     ` Junio C Hamano
2013-06-25 15:38   ` Thomas Rast
2013-06-24 23:23 ` [PATCH 09/16] documentation: add documentation for the bitmap format Vicent Marti
2013-06-25  5:42   ` Shawn Pearce
2013-06-25 19:33     ` Vicent Martí
2013-06-25 21:17       ` Junio C Hamano
2013-06-25 22:08         ` Vicent Martí
2013-06-27  1:11           ` Shawn Pearce
2013-06-27  2:36             ` Vicent Martí
2013-06-27  2:45               ` Jeff King
2013-06-27 16:07                 ` Shawn Pearce
2013-06-27 17:17                   ` Jeff King
2013-07-01 18:47                   ` Colby Ranger
2013-07-01 19:13                     ` Shawn Pearce
2013-07-07  9:46                     ` Jeff King
2013-07-07 17:27                       ` Shawn Pearce
2013-06-26  5:11       ` Jeff King
2013-06-26 18:41         ` Colby Ranger
2013-06-26 22:33           ` Colby Ranger
2013-06-27  0:53             ` Colby Ranger
2013-06-27  1:32               ` Shawn Pearce
2013-06-27  1:29         ` Shawn Pearce
2013-06-25 15:58   ` Thomas Rast
2013-06-25 22:30     ` Vicent Martí
2013-06-26 23:12       ` Thomas Rast
2013-06-26 23:19         ` Thomas Rast
2013-06-24 23:23 ` [PATCH 10/16] pack-objects: use bitmaps when packing objects Vicent Marti
2013-06-25 12:48   ` Ramkumar Ramachandra
2013-06-25 15:58   ` Thomas Rast
2013-06-25 23:06   ` Junio C Hamano
2013-06-25 23:14     ` Vicent Martí
2013-06-24 23:23 ` [PATCH 11/16] rev-list: add bitmap mode to speed up lists Vicent Marti
2013-06-25 16:22   ` Thomas Rast
2013-06-26  1:45     ` Vicent Martí
2013-06-26 23:13       ` Thomas Rast
2013-06-26  5:22     ` Jeff King
2013-06-24 23:23 ` [PATCH 12/16] pack-objects: implement bitmap writing Vicent Marti
2013-06-24 23:23 ` [PATCH 13/16] repack: consider bitmaps when performing repacks Vicent Marti
2013-06-25 23:00   ` Junio C Hamano
2013-06-25 23:16     ` Vicent Martí
2013-06-24 23:23 ` [PATCH 14/16] sha1_file: implement `nth_packed_object_info` Vicent Marti
2013-06-24 23:23 ` [PATCH 15/16] write-bitmap: implement new git command to write bitmaps Vicent Marti
2013-06-24 23:23 ` [PATCH 16/16] rev-list: Optimize --count using bitmaps too Vicent Marti
2013-06-25 16:05 ` [PATCH 00/16] Speed up Counting Objects with bitmap data Thomas Rast

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CALkWK0krR=PaGC6iX8abmSswr7HBtzU61-7DR00xA3CyW80dtA@mail.gmail.com' \
    --to=artagnon@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=tanoku@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).