Re: [PATCH] oidmap: map with OID as key

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Jonathan Tan <jonathantanmy@google.com>
To: Jeff King <peff@peff.net>
Cc: Brandon Williams <bmwill@google.com>,
	git@vger.kernel.org, git@jeffhostetler.com
Subject: Re: [PATCH] oidmap: map with OID as key
Date: Fri, 29 Sep 2017 12:04:56 -0700	[thread overview]
Message-ID: <20170929120456.3bb8021de1c7aebee7bb5026@google.com> (raw)
In-Reply-To: <20170928200556.grysihlj7cbzocfq@sigill.intra.peff.net>

On Thu, 28 Sep 2017 16:05:57 -0400
Jeff King <peff@peff.net> wrote:

> When I saw that you were implementing "oidset" in terms of "oidmap", I
> was all ready to be crabby about this extra memory. But then I saw that
> the implementation tries hard not to waste any memory. :)
> 
> All of which is to say I gave this some thought when I was in the "ready
> to be crabby" phase, and came to the conclusion that it probably isn't
> that painful. An unused pointer is 8 bytes per entry. We're already
> spending 20 for the oid itself (which is likely to grow to 32
> eventually), plus 8 for the chained "next" pointer. Plus potentially 8
> for a padded version of the hash, if we just use a straight hashmap that
> duplicates the hash field.
> 
> So depending how you count it, we're wasting between 28% (sha1 and no
> extra hash) and 16% (sha256 plus reusing hashmap). That's not great, but
> it's probably not breaking the bank.

Hmm...how did you get the 16% figure? The values, as I see it, are:
 - 32 for the sha256 hash itself
 - 8 for the "next" chain pointer
 - 8 for the padded hash
 - 8 for the "util" pointer

For an oidset, the padded hash and the "util" pointer are wasted, which is
16/56=0.286. (If you assume, say, 8 bytes of malloc bookkeeping overhead, then
16/64=0.25.)

For other uses of an oidmap, the padded hash and "util" pointer are still
wasted, so the numbers are the same as above (except that the malloc
bookkeeping overhead is doubled).

> Another way of thinking about it. Large-ish (but not insane) repos have
> on the order of 5-10 million objects. If we had an oidset that mentioned
> every single object in the repository, that's 40-80MB wasted in the
> worst case. For current uses of oidset, that's probably fine. It's
> generally used only to collect ref tips (so probably two orders of
> magnitude less).
> 
> If you're planning on using an oidset to mark every object in a
> 100-million-object monorepo, we'd probably care more. But I'd venture to
> say that any scheme which involves generating that hash table on the fly
> is doing it wrong. At at that scale we'd want to look at compact
> mmap-able on-disk representations.

In a 100-million-object monorepo, we will probably end up only operating on the
"frontier" objects (objects that we do not have but we know about because some
object we have reference them) at the worst. I don't have numbers but I think
that that is at least an order of magnitude less than 100M.

> So I think we may be better off going with the solution here that's
> simpler and requires introducing less code. If it does turn out to be a
> memory problem in the future, this is a _really_ easy thing to optimize
> after the fact, because we have these nice abstractions.

Optimizing away the padded hash should be straightforward. Optimizing away the
"util" pointer after we have code that uses it is (slightly?) less
straightforward, but still doable.

I still lean towards saving memory by eliminating the padded hash and
not using the "util" pointer, because the complication is contained
within only two files (oidmap.{c,h}), and because the constant factors
in memory savings might end up mattering. But I don't feel strongly
about that - I think any of the oidmaps that we have discussed is an
improvement over what we have now.

next prev parent reply	other threads:[~2017-09-29 19:05 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-09-27 22:19 [PATCH] oidmap: map with OID as key Jonathan Tan
2017-09-28  0:41 ` Brandon Williams
2017-09-28 17:46   ` Jonathan Tan
2017-09-28 20:05     ` Jeff King
2017-09-29 19:04       ` Jonathan Tan [this message]
2017-09-29 19:26         ` Jeff King
2017-09-29 21:43       ` Johannes Schindelin
2017-09-29 23:24         ` Jeff King
2017-09-28  3:13 ` Junio C Hamano
2017-09-28 17:38   ` Jonathan Tan
2017-09-29 22:54 ` [PATCH v2] " Jonathan Tan
2017-10-02 23:48   ` Brandon Williams
2017-10-03  6:31     ` Jeff King
2017-10-04  0:29       ` Jonathan Tan
2017-10-04  7:45         ` Jeff King
2017-10-04  8:48           ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170929120456.3bb8021de1c7aebee7bb5026@google.com \
    --to=jonathantanmy@google.com \
    --cc=bmwill@google.com \
    --cc=git@jeffhostetler.com \
    --cc=git@vger.kernel.org \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).