git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Stefan Beller <sbeller@google.com>
Cc: "git@vger.kernel.org" <git@vger.kernel.org>
Subject: Re: [PATCH 1/2] obj_hash: convert to a critbit tree
Date: Wed, 14 Sep 2016 18:13:54 -0700	[thread overview]
Message-ID: <20160915011353.l5lx6ltcn2zegbnx@sigill.intra.peff.net> (raw)
In-Reply-To: <CAGZ79kb4_k=+O6pzQkZBGHQty+kdYTnQX_3110RNt097AVn+Kg@mail.gmail.com>

On Wed, Sep 14, 2016 at 05:52:06PM -0700, Stefan Beller wrote:

> > So finding "1011" involves traversing the trie: down the "1"
> > side, then the "0" side, and then check that the rest
> > matches "11".
> 
> So we stop building a tree as soon as we hit a unique data
> element (i.e. if we stick to the idea of encoding the hash along the way,
> we would have needed another node each for "11" as well as "00"
> that points to NULL and the ending data respectively.
> 
> So we stop short as soon as we have a unique.
> 
> That makes insertion very easy, because as soon as we hit
> a unique, we'd just introduce a node and add the two uniques
> left and right. (Well what happens if we were to insert
> 101010111 and 101010101 ? Both have a long prefix,
> I suppose we'd have 7 nodes and then the distiguishing
> node for those 2.)

I think it actually can represent the interior sequence as a single
node. That's what the "critical" part of critbit is for; the critical
bit is the one where we diverge. But I'm not 100% sure that it's
implemented that way.

In practice, though, sha1 would give us a pretty full tree near the
front, so I'd expect each bit to be "critical".

> > It's possible that (2) would be better if instead of a
> > critbit tree, we used a "critbyte" tree. That would involve
> > fewer node traversals, at the cost of making each node
> > larger (right now the internal nodes store 2 pointer slots;
> > they'd have to store 256 to handle a full byte). I didn't
> > try it, but I suspect it would still be slower for the same
> > reasons.
> 
> I would expect to go for a crit-"variable-length" tree instead.
> 
> The reason for this is that a higher fan out seems to be more
> beneficial in the earlier stages, e.g. we could use critbyte trees
> for the first 1-3 layers in the tree as that will have good memory
> efficiency (all 256 slots filled), but will be faster than the critbit trees
> (one lookup instead of 8 conditional jumps).

Yeah, I suspect a hybrid approach may be a better tradeoff. I encourage
you to try and measure. :)

> I guess when trying to improve the hashsets, someone tried trees
> as a collision resolver?

I don't think so. hashmap.c uses a linked list, but obj_hash just does
linear probing. Both are linear, but the latter is more memory efficient
and especially has a more compact cache footprint when resolving
collisions. The downside of open probing like that is that it's very
hard to delete an entry, but we don't care about that for obj_hash.

I don't think that improving collision resolution would help that much
for obj_hash, though. The fact that quadratic probing and cuckoo hashing
didn't yield big benefits implies that we don't spend most of our time
on collisions in the first place (OTOH, my 9a414486d9f0 that you found
does help _only_ when there are collisions, so maybe I'm wrong).

> So maybe we could have some software sided cache for hot entries?
> (I imagine a data structure consisting of 2 hash sets here, one
> hashset containing
> the complete data set, and the other hashset is a very small hashset with e.g.
> just 256(?) entries that are an LRU cache for the cache entries.
> Though this sounds like we'd be trying to outsmart the hardware... Not sure
> I'd expect gains from that)

Yeah. Basically what we want is the reverse of a bloom filter: it's OK
to be wrong, but it most be wrong to say "it's not here" when it is, not
the other way around. And so that's basically...a cache of a smaller
data-structure in front of the larger one.

But given that the hash table is already O(1)-ish, it's hard to beat
that (because remember when you have a cache miss, you then have to do
an extra lookup in the full table anyway).

I did play around with stuff like that when I was coming up with
9a414486d9f0, but was never able to make it faster. Patches welcome, of
course.

> I guess we rather want to split up the data sets on the application
> side: i.e. instead of
> having so many objects, have hash sets for e.g. blobs, trees, commits separately
> and then use slightly different strategies there (different load factors?)

My gut is that would not make a big difference (and sometimes it would
be worse, because we don't know what the object type is, or we want to
make _sure_ that we don't have the object known as a different type).

> Unrelated note about hashmaps:
> I wonder if we rather want to give good initial estimates of the size
> as one improvement

My recollection is that the growth isn't really relevant. At least for
obj_hash, we do _way_ more lookups than insertions, so the only thing
that really matters is lookup speed.

But as with everything, if you can come up with improved numbers, I'm
happy to look at the patches. :)

-Peff

  reply	other threads:[~2016-09-15  1:14 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-09-14 23:55 Journal of Failed Git Experiments, Volume 1 Jeff King
2016-09-14 23:56 ` [PATCH 1/2] obj_hash: convert to a critbit tree Jeff King
2016-09-15  0:52   ` Stefan Beller
2016-09-15  1:13     ` Jeff King [this message]
2016-09-14 23:58 ` [PATCH 2/2] use zstd zlib wrapper Jeff King
2016-09-15  1:22   ` Stefan Beller
2016-09-15  6:28     ` Jeff King
2016-09-25 14:17 ` Journal of Failed Git Experiments, Volume 1 Johannes Schindelin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160915011353.l5lx6ltcn2zegbnx@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=sbeller@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).