git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "René Scharfe" <l.s.r@web.de>
To: Jeff King <peff@peff.net>
Cc: git@vger.kernel.org
Subject: Re: [PATCH] fast-import: replace custom hash with hashmap.c
Date: Mon, 6 Apr 2020 20:07:45 +0200	[thread overview]
Message-ID: <020726d4-5a5c-e751-2824-d05004823326@web.de> (raw)
In-Reply-To: <20200405185951.GA1309762@coredump.intra.peff.net>

Am 05.04.20 um 20:59 schrieb Jeff King:
> On Fri, Apr 03, 2020 at 08:53:30PM +0200, René Scharfe wrote:
>
>>> Here are the results for linux.git:
>>>
>>>   Test                        HEAD^                 HEAD
>>>   ----------------------------------------------------------------------------
>>>   9300.1: export (no-blobs)   67.64(66.96+0.67)     67.81(67.06+0.75) +0.3%
>>>   9300.2: import (no-blobs)   284.04(283.34+0.69)   198.09(196.01+0.92) -30.3%
>>
>> My numbers look a bit different for this, not sure why:
>>
>> Test                        origin/master         HEAD
>> ---------------------------------------------------------------------------
>> 9300.1: export (no-blobs)   69.36(66.44+1.56)     67.89(66.07+1.56) -2.1%
>> 9300.2: import (no-blobs)   295.10(293.83+1.19)   283.83(282.91+0.91) -3.8%
>>
>> They are still in favor of the patch, just not as strongly as yours.
>
> Interesting. I re-ran mine just to double check, and got:
>
> Test                        HEAD^                 HEAD
> ----------------------------------------------------------------------------
> 9300.1: export (no-blobs)   63.11(62.31+0.79)     62.00(61.21+0.79) -1.8%
> 9300.2: import (no-blobs)   261.79(261.06+0.72)   188.55(187.87+0.66) -28.0%
>
> (for some reason my whole machine is faster today; maybe I had closed
> slack).
>
> This is on a fresh-ish clone of linux.git.

A second run today reported:

Test                        origin/master         HEAD
---------------------------------------------------------------------------
9300.1: export (no-blobs)   61.58(59.93+1.40)     60.63(59.35+1.22) -1.5%
9300.2: import (no-blobs)   239.64(239.00+0.63)   246.02(245.18+0.82) +2.7%

git describe says v5.4-3890-g86fd3e9df543 in that repo.

Dunno.  My PC has thermal issues and stressing it for half an hour straight
may cause it to throttle?

With Git's own repo I just got this:

Test                        origin/master       HEAD
-----------------------------------------------------------------------
9300.1: export (no-blobs)   2.58(2.45+0.05)     2.81(2.75+0.05) +8.9%
9300.2: import (no-blobs)   10.41(10.37+0.03)   10.75(10.72+0.02) +3.3%

>
>>> +	e1 = container_of(eptr, const struct object_entry, ent);
>>> +	if (oid)
>>> +		return oidcmp(&e1->idx.oid, oid);
>>> +
>>> +	e2 = container_of(entry_or_key, const struct object_entry, ent);
>>> +	return oidcmp(&e1->idx.oid, &e2->idx.oid);
>>
>> Other hashmap users express this more tersely, similar to this:
>>
>> 	const struct object_entry *e1, *e2;
>>
>> 	e1 = container_of(eptr, const struct object_entry, ent);
>> 	e2 = container_of(entry_or_key, const struct object_entry, ent);
>> 	return oidcmp(&e1->idx.oid, keydata ? keydata : &e2->idx.oid);
>
> I wasn't sure if we'd ever see entry_or_key as NULL, in which case the
> second container_of() would be undefined behavior. There's a
> container_of_or_null() for this, but it seemed simpler to just avoid the
> deref entirely if we didn't need it.

OK.  The other users might have the same issue then, though.

> I think in practice we won't ever pass NULL, though. Even a
> hashmap_get() needs to pass in a hashmap_entry with the hash value in
> it. Though I think that's technically _also_ undefined behavior. If I
> have a struct where the entry is not at the beginning, like:
>
>   struct foo {
>     const char *bar;
>     struct hashmap_entry ent;
>   };
>
> then:
>
>   e2 = container_of(ptr_to_entry, struct foo, ent);
>
> is going to form a pointer at 8 bytes before ptr_to_entry. If it really
> is a "struct hashmap_entry" on the stack, then it's pointing at garbage.
>
> We don't dereference it, so it's likely fine in practice, but even
> computing such a pointer does violate the standard.

Really?  If ptr_to_entry was NULL then any pointer arithmetic on it would
be bad.  If it points to a bare hashmap_entry and we lied to container_of
by telling it that it's embedded in some other struct then that would be
bad as well.  But if there actually is a surrounding struct of the
specified type then the resulting pointer must surely be valid, no matter
if the object it points to was initialized?  Do you have the relevant
chapter and verse handy?

>
>>> +	hashmap_entry_init(&lookup_entry, oidhash(oid));
>>> +	hashent = hashmap_get(&object_table, &lookup_entry, oid);
>>> +	if (hashent)
>>> +		return container_of(hashent, struct object_entry, ent);
>>
>> That duplicates find_object()...
>
> Yes. This is how most hashmap users do it. It's only a few lines, and
> sharing without extra work would mean passing the hash value around
> (otherwise we'd have to recompute it again), which is awkward.

I find touching hashmap_entry members directly more awkward.

Compilers likely inline find_object() here, so the text size and
performance should be about the same.

> Though in our case oidhash() is cheap enough that perhaps it's not worth
> worrying about?

Probably, but I didn't measure.  My system is quite noisy..

>
>>>  	e = new_object(oid);
>>> -	e->next = object_table[h];
>>>  	e->idx.offset = 0;
>>> -	object_table[h] = e;
>>> +	e->ent.hash = lookup_entry.hash;
>>
>> ... except for this part.  Would it make sense to replace this line with
>> a call to hashmap_entry_init()?  Seems cleaner to me.  It would look
>> like this:
>> 	struct object_entry *e = find_object(oid);
>>
>> 	if (e)
>> 		return e;
>>
>> 	e = new_object(oid);
>> 	e->idx.offset = 0;
>> 	hashmap_entry_init(&e->ent, oidhash(oid));
>> 	hashmap_add(&object_table, &e->ent);
>> 	return e;
>
> Right, that calls oidhash() again. If we're OK with that, I agree it's a
> bit shorter and not any more awkward. I do worry slightly that it sets a
> bad example, though (you wouldn't want tusing only the API to o repeat a strhash(), for
> example).

Stuffing the oidhash() result into a variable and using it twice with
hashmap_entry_init() would work as well.  This would make the reason for
the duplicate find_object() code obvious, while keeping struct
hashmap_entry opaque.

René

  reply	other threads:[~2020-04-06 18:08 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-03 12:15 [PATCH] fast-import: replace custom hash with hashmap.c Jeff King
2020-04-03 18:53 ` René Scharfe
2020-04-05 18:59   ` Jeff King
2020-04-06 18:07     ` René Scharfe [this message]
2020-04-06 18:35       ` Jeff King
2020-04-06 18:46         ` Jeff King
2020-04-07 18:37         ` René Scharfe
2020-04-06 19:49 ` [PATCH v2] " Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=020726d4-5a5c-e751-2824-d05004823326@web.de \
    --to=l.s.r@web.de \
    --cc=git@vger.kernel.org \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).