git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Eric Wong <e@80x24.org>
Cc: Johannes Schindelin <Johannes.Schindelin@gmx.de>,
	Josh Triplett <josh@joshtriplett.org>,
	git@vger.kernel.org
Subject: Re: Cross-referencing the Git mailing list archive with their corresponding commits in `pu`
Date: Mon, 6 Feb 2017 17:07:55 -0500	[thread overview]
Message-ID: <20170206220754.5q2oddr5ej7c6qcg@sigill.intra.peff.net> (raw)
In-Reply-To: <20170206204820.GA7128@starla>

On Mon, Feb 06, 2017 at 08:48:20PM +0000, Eric Wong wrote:

> I haven't hit insurmountable performance problems, even on
> low-end hardware; especially since I started storing blob ids in
> Xapian itself, avoiding the expensive tree lookup via git.

The painful thing is traversing the object graph for clones and fetches.
Bitmaps help, but you still have to generate them.

> The main problem seems to be tree size.  Deepening (2/2/36 vs
> 2/38) might be an option (I think Peff brought that up); but it
> might be easier to switch to YYYYMM refs (working like
> logrotate) and rely on Xapian to tie the entire thing together.

Yes, the hashing is definitely one issue. Some numbers here:

  http://public-inbox.org/git/20160805092805.w3nwv2l6jkbuwlzf@sigill.intra.peff.net/

If you have C commits on a tree with T entries, you have to do C*T hash
lookups for a flat tree (for each commit, you have to see "yup, already
saw that object"). Sharding that across H entries at the top level drops
the tree cost from T to H + T/H (actually, it's a bit worse because we
have to read the secondary tree, too). Sharding again (at H') gets you
H + H' + T/H/H'.

Let's imagine you do one message per commit, so C=T. At 400K messages,
that's about 160 billion hash lookups flat. At H=256, it's about 700
million. If you shard again with H'=256, it's 200 million. After that,
the additive terms start to dominate, and it's not worth going any
further (and also, we're ignoring the extra-tree cost to each level).

At that point you're better off to start having fewer commits. I know
that the schema you use does put useful information into the commit
message, but it's also redundant with what's in the messages themselves.
And it sounds like you push most of that out to Xapian anyway.

Imagine your repo had one commit with 400K historical messages, and then
grouped the new messages so that on average we got about 10 messages per
commit (this doesn't seem unrealistic for something that commits every
few minutes; the messages tend to be bunched in time; I ran some
numbers against a 10-minute mark in the earlier message).

Then after another 100K messages, we'd have C=10,001 and T=500K. With
two levels of hashing at 256 each, that's ~5 million hash lookups to
walk the graph. And those numbers would be reasonable for a hosting site
like GitHub.

I don't know what C is for the kernel repo, but I suspect with the right
tuning it could be made into large-but-reasonable.

-Peff

  reply	other threads:[~2017-02-06 22:08 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-02-06 15:34 Cross-referencing the Git mailing list archive with their corresponding commits in `pu` Johannes Schindelin
2017-02-06 19:10 ` Junio C Hamano
2017-02-09 14:11   ` Lars Schneider
2017-02-09 21:53     ` Johannes Schindelin
2017-02-09 22:18       ` Junio C Hamano
2017-02-06 20:48 ` Eric Wong
2017-02-06 22:07   ` Jeff King [this message]
2017-02-07  0:14     ` Eric Wong
2017-02-17 17:50 ` Johannes Schindelin
2017-02-20 19:33   ` Junio C Hamano
2017-02-20 20:06     ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170206220754.5q2oddr5ej7c6qcg@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=e@80x24.org \
    --cc=git@vger.kernel.org \
    --cc=josh@joshtriplett.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).