git@vger.kernel.org mailing list mirror (one of many)
 help / Atom feed
From: Jeff King <peff@peff.net>
To: Jonathan Nieder <jrnieder@gmail.com>
Cc: Shawn Pearce <spearce@spearce.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Git Mailing List <git@vger.kernel.org>,
	Stefan Beller <sbeller@google.com>,
	bmwill@google.com, Jonathan Tan <jonathantanmy@google.com>,
	David Lang <david@lang.hm>,
	"brian m. carlson" <sandals@crustytoothpaste.net>
Subject: Re: RFC v3: Another proposed hash function transition plan
Date: Fri, 10 Mar 2017 14:38:35 -0500
Message-ID: <20170310193835.t7syswueuu7nmkjz@sigill.intra.peff.net> (raw)
In-Reply-To: <20170309202408.GA17847@aiede.mtv.corp.google.com>

On Thu, Mar 09, 2017 at 12:24:08PM -0800, Jonathan Nieder wrote:

> > SHA-1 to SHA-3: lookup SHA-1 in .msha1, reverse .idx, find offset to
> > read the SHA-3.
> > SHA-3 to SHA-1: lookup SHA-3 in .idx, and reverse the .msha1 file to
> > translate offset to SHA-1.
> 
> Thanks for this suggestion.  I was initially vaguely nervous about
> lookup times in an idx-style file, but as you say, object reads from a
> packfile already have to deal with this kind of lookup and work fine.

Not exactly. The "reverse .idx" step has to build the reverse mapping on
the fly, and it's non-trivial. For instance, try:

  sha1=$(git rev-parse HEAD)
  time echo $sha1 | git cat-file --batch-check='%(objectsize)'
  time echo $sha1 | git cat-file --batch-check='%(objectsize:disk)'

on a large repo (where HEAD is in a big pack). The on-disk size is
conceptually simpler, as we only need to look at the offset of the
object versus the offset of the object after it. But in practice it
takes much longer, because it has to build the revindex on the fly (I
get 7ms versus 179ms on linux.git).

The effort is linear in the number of objects (we create the revindex
with a radix sort).

The reachability bitmaps suffer from this, too, as they need the
revindex to know which object is at which bit position. At GitHub we
added an extension to the .bitmap files that stores this "bit cache".
Here are timings before and after on linux.git:

  $ time git rev-list --use-bitmap-index --count master
  659371

  real	0m0.182s
  user	0m0.136s
  sys	0m0.044s

  $ time git.gh rev-list --use-bitmap-index --count master
  659371

  real	0m0.016s
  user	0m0.008s
  sys	0m0.004s

It's not a full revindex, but it's enough for bitmap use. You can also
use it to generate the revindex slightly more quickly, because you can
skip the sorting step (you just insert the entries in the correct order
by walking the bit cache and dereferencing the offsets from the .idx
portion). So it's still linear, but with a smaller constant factor.

I think for the purposes here, though, we don't actually care about the
offsets. For the cost of one uint32_t per object, you can keep a list
mapping positions in the sha1 index into the sha3 index. So then you do
the log-n binary search to find the sha1, a constant-time lookup in the
mapping array, and that gives you the position in the sha3 index, from
which you can then access the sha3 (or the actual pack offset, for that
matter).

So I think it's solvable, but I suspect we would want an extension to
the .idx format to store the mapping array, in order to keep it log-n.

-Peff

  reply index

Thread overview: 113+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-03-04  1:12 RFC: " Jonathan Nieder
2017-03-05  2:35 ` Linus Torvalds
2017-03-06  0:26   ` brian m. carlson
2017-03-06 18:24     ` Brandon Williams
2017-06-15 10:30       ` Which hash function to use, was " Johannes Schindelin
2017-06-15 11:05         ` Mike Hommey
2017-06-15 13:01           ` Jeff King
2017-06-15 16:30             ` Ævar Arnfjörð Bjarmason
2017-06-15 19:34               ` Johannes Schindelin
2017-06-15 21:59                 ` Adam Langley
2017-06-15 22:41                   ` brian m. carlson
2017-06-15 23:36                     ` Ævar Arnfjörð Bjarmason
2017-06-16  0:17                       ` brian m. carlson
2017-06-16  6:25                         ` Ævar Arnfjörð Bjarmason
2017-06-16 13:24                           ` Johannes Schindelin
2017-06-16 17:38                             ` Adam Langley
2017-06-16 20:52                               ` Junio C Hamano
2017-06-16 21:12                                 ` Junio C Hamano
2017-06-16 21:24                                   ` Jonathan Nieder
2017-06-16 21:39                                     ` Ævar Arnfjörð Bjarmason
2017-06-16 20:42                             ` Jeff King
2017-06-19  9:26                               ` Johannes Schindelin
2017-06-15 21:10             ` Mike Hommey
2017-06-16  4:30               ` Jeff King
2017-06-15 17:36         ` Brandon Williams
2017-06-15 19:20           ` Junio C Hamano
2017-06-15 19:13         ` Jonathan Nieder
2017-03-07  0:17   ` RFC v3: " Jonathan Nieder
2017-03-09 19:14     ` Shawn Pearce
2017-03-09 20:24       ` Jonathan Nieder
2017-03-10 19:38         ` Jeff King [this message]
2017-03-10 19:55           ` Jonathan Nieder
2017-09-28  4:43       ` [PATCH v4] technical doc: add a design doc for hash function transition Jonathan Nieder
2017-09-29  6:06         ` Junio C Hamano
2017-09-29  8:09           ` Junio C Hamano
2017-09-29 17:34           ` Jonathan Nieder
2017-10-02  8:25             ` Junio C Hamano
2017-10-02 19:41             ` Jason Cooper
2017-10-02  9:02         ` Junio C Hamano
2017-10-02 19:23         ` Jason Cooper
2017-10-03  5:40         ` Junio C Hamano
2017-10-03 13:08           ` Jason Cooper
2017-10-04  1:44         ` Junio C Hamano
2017-09-06  6:28     ` RFC v3: Another proposed hash function transition plan Junio C Hamano
2017-09-08  2:40       ` Junio C Hamano
2017-09-08  3:34         ` Jeff King
2017-09-11 18:59         ` Brandon Williams
2017-09-13 12:05           ` Johannes Schindelin
2017-09-13 13:43             ` demerphq
2017-09-13 22:51               ` Jonathan Nieder
2017-09-14 18:26                 ` Johannes Schindelin
2017-09-14 18:40                   ` Jonathan Nieder
2017-09-14 22:09                     ` Johannes Schindelin
2017-09-13 23:30               ` Linus Torvalds
2017-09-14 18:45                 ` Johannes Schindelin
2017-09-18 12:17                   ` Gilles Van Assche
2017-09-18 22:16                     ` Johannes Schindelin
2017-09-19 16:45                       ` Gilles Van Assche
2017-09-29 13:17                         ` Johannes Schindelin
2017-09-29 14:54                           ` Joan Daemen
2017-09-29 22:33                             ` Johannes Schindelin
2017-09-30 22:02                               ` Joan Daemen
2017-10-02 14:26                                 ` Johannes Schindelin
2017-09-18 22:25                     ` Jonathan Nieder
2017-09-26 17:05                   ` Jason Cooper
2017-09-26 22:11                     ` Johannes Schindelin
2017-09-26 22:25                       ` [PATCH] technical doc: add a design doc for hash function transition Stefan Beller
2017-09-26 23:38                         ` Jonathan Nieder
2017-09-26 23:51                       ` RFC v3: Another proposed hash function transition plan Jonathan Nieder
2017-10-02 14:54                         ` Jason Cooper
2017-10-02 16:50                           ` Brandon Williams
2017-10-02 14:00                       ` Jason Cooper
2017-10-02 17:18                         ` Linus Torvalds
2017-10-02 19:37                           ` Jeff King
2017-09-13 16:30             ` Jonathan Nieder
2017-09-13 21:52               ` Junio C Hamano
2017-09-13 22:07                 ` Stefan Beller
2017-09-13 22:18                   ` Jonathan Nieder
2017-09-14  2:13                     ` Junio C Hamano
2017-09-14 15:23                       ` Johannes Schindelin
2017-09-14 15:45                         ` demerphq
2017-09-14 22:06                           ` Johannes Schindelin
2017-09-13 22:15                 ` Junio C Hamano
2017-09-13 22:27                   ` Jonathan Nieder
2017-09-14  2:10                     ` Junio C Hamano
2017-09-14 12:39               ` Johannes Schindelin
2017-09-14 16:36                 ` Brandon Williams
2017-09-14 18:49                 ` Jonathan Nieder
2017-09-15 20:42                   ` Philip Oakley
2017-03-05 11:02 ` RFC: " David Lang
     [not found]   ` <CA+dhYEXHbQfJ6KUB1tWS9u1MLEOJL81fTYkbxu4XO-i+379LPw@mail.gmail.com>
2017-03-06  9:43     ` Jeff King
2017-03-06 23:40   ` Jonathan Nieder
2017-03-07  0:03     ` Mike Hommey
2017-03-06  8:43 ` Jeff King
2017-03-06 18:39   ` Jonathan Tan
2017-03-06 19:22     ` Linus Torvalds
2017-03-06 19:59       ` Brandon Williams
2017-03-06 21:53       ` Junio C Hamano
2017-03-07  8:59     ` Jeff King
2017-03-06 18:43   ` Junio C Hamano
2017-03-07 18:57 ` Ian Jackson
2017-03-07 19:15   ` Linus Torvalds
2017-03-08 11:20     ` Ian Jackson
2017-03-08 15:37       ` Johannes Schindelin
2017-03-08 15:40       ` Johannes Schindelin
2017-03-20  5:21         ` Use base32? Jason Hennessey
2017-03-20  5:58           ` Michael Steuer
2017-03-20  8:05             ` Jacob Keller
2017-03-21  3:07               ` Michael Steuer
2017-03-13  9:24 ` RFC: Another proposed hash function transition plan The Keccak Team
2017-03-13 17:48   ` Jonathan Nieder
2017-03-13 18:34     ` ankostis
2017-03-17 11:07       ` Johannes Schindelin

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170310193835.t7syswueuu7nmkjz@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=bmwill@google.com \
    --cc=david@lang.hm \
    --cc=git@vger.kernel.org \
    --cc=jonathantanmy@google.com \
    --cc=jrnieder@gmail.com \
    --cc=sandals@crustytoothpaste.net \
    --cc=sbeller@google.com \
    --cc=spearce@spearce.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

git@vger.kernel.org mailing list mirror (one of many)

Archives are clonable:
	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git
	nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git
	nntp://news.gmane.org/gmane.comp.version-control.git

 note: .onion URLs require Tor: https://www.torproject.org/
       or Tor2web: https://www.tor2web.org/

AGPL code for this site: git clone https://public-inbox.org/ public-inbox