From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-3.7 required=3.0 tests=AWL,BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,RCVD_IN_DNSWL_HI,RP_MATCHES_RCVD shortcircuit=no autolearn=ham autolearn_force=no version=3.4.0 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by dcvr.yhbt.net (Postfix) with ESMTP id C99EA202A5 for ; Thu, 13 Jul 2017 20:35:38 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752926AbdGMUfh (ORCPT ); Thu, 13 Jul 2017 16:35:37 -0400 Received: from cloud.peff.net ([104.130.231.41]:39924 "HELO cloud.peff.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with SMTP id S1752847AbdGMUfg (ORCPT ); Thu, 13 Jul 2017 16:35:36 -0400 Received: (qmail 5563 invoked by uid 109); 13 Jul 2017 20:35:36 -0000 Received: from Unknown (HELO peff.net) (10.0.1.2) by cloud.peff.net (qpsmtpd/0.94) with SMTP; Thu, 13 Jul 2017 20:35:36 +0000 Authentication-Results: cloud.peff.net; auth=none Received: (qmail 4948 invoked by uid 111); 13 Jul 2017 20:35:48 -0000 Received: from sigill.intra.peff.net (HELO sigill.intra.peff.net) (10.0.0.7) by peff.net (qpsmtpd/0.94) with SMTP; Thu, 13 Jul 2017 16:35:48 -0400 Authentication-Results: peff.net; auth=none Received: by sigill.intra.peff.net (sSMTP sendmail emulation); Thu, 13 Jul 2017 16:35:33 -0400 Date: Thu, 13 Jul 2017 16:35:33 -0400 From: Jeff King To: Stefan Beller Cc: Shawn Pearce , git Subject: Re: reftable: new ref storage format Message-ID: <20170713203533.vcfyf5iei46g4tcf@sigill.intra.peff.net> References: <20170713193234.fkxf73t6jevj4svg@sigill.intra.peff.net> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org On Thu, Jul 13, 2017 at 12:56:54PM -0700, Stefan Beller wrote: > >> ### Problem statement > >> > >> Some repositories contain a lot of references (e.g. android at 866k, > >> rails at 31k). The existing packed-refs format takes up a lot of > >> space (e.g. 62M), and does not scale with additional references. > >> Lookup of a single reference requires linearly scanning the file. > > > > I think the linear scan is actually an implementation short-coming. Even > > though the records aren't fixed-length, the fact that newlines can only > > appear as end-of-record is sufficient to mmap and binary search a > > packed-refs file (you just have to backtrack a little when you land in > > the middle of a record). > > Except that a record is a "delta" to the previous record, so it's not > just finding a record, but reconstructing it. Example for records: I was still talking about the existing packed-refs implementation here. I agree that a full binary search of a reftable is harder because of the prefix compression (it may still be possible by scanning backwards, but I think there are ambiguities when you land in the middle of a record, since there's no unambiguous end-of-record character). But I don't think it matters. If you binary-search to a constant-sized block, then a linear scan of the block is acceptable. > >> - Occupy less disk space for large repositories. > > > > Good goal. Just to play devil's advocate, the simplest way to do that > > with the current code would be to gzip packed-refs (and/or store sha1s > > as binary). That works against the "mmap and binary search" plan, > > though. :) > > Given the compression by delta-ing the name to the previous change and > the fact that Gerrit has > > refs/heads/changes/1 > refs/heads/changes/2 > refs/heads/changes/3 > ... > > I think this format would trump a "dumb" zip. > (Github having sequentially numbered pull requests would also > benefit here) You may be surprised. Let's imagine that you have a set of 4096 refs in refs/changes/1, refs/changes/2, etc: for i in $(seq 1 4096) do echo refs/changes/$i done >input Now let's do a prefix compression, with a single byte for "how many characters to reuse from the last entry": perl -lne ' my $common; if (defined $last) { chop $last while !/\Q$last\E/; $common = length($last); } else { $common = 0; } print chr($common), substr($_, $common); $last = $_; ' prefix And a gzip: gzip -c -9 zip And the results: $ wc -c prefix; wc -c zip 12754 prefix 10116 zip The secret sauce is most likely that gzip is bit-packing, using only a few bits per character and not aligning with byte boundaries. Not that I'm recommending just gzipping the whole packed-refs file. It ruins the fast-lookup. We _could_ consider gzipping individual blocks of a reftable (or any structure that allows you to search to a constant-sized block and do a linear search from there). But given that they're in the same ballpark, I'm happy with whatever ends up the simplest to code and debug. ;) Just for fun, here's the decoding script for the prefix-compression: perl -e ' while (read(STDIN, $common, 1)) { $common = ord($common); $rest = ; if ($common > 0) { $rest = substr($last, 0, $common) . $rest } print $rest; $last = $rest}' > OK, let me try to summarize to see if I understand. > > When Shawn presented the proposal, a couple of colleagues here > were as excited as I was, but the daring question is, why Shawn > did not give the whole thing in BNF format from top down: > > initial-block > content-blocks* > (index-block) > footer Yeah, I agree it took me a bit to figure out what was going on. A high-level overview of the format would have been nice. > > So lookup really is more > > like O(block_size * log(n/block_size)), but block_size being a constant, > > it drops out to O(log n). > > There is also an index block such that you can binary search across > blocks, so > > O( log(block_count) + log(intra_block_restarting_points) + small linear scan) > > There are 2 binary searches, and the block size is an interesting > thing to look at when making up trade offs. Right, the cross-block index was what I was trying to account for. Either way, from a big-O perspective the block size and the number of restarts are constants with respect to the total number of entries. I'm happy with log(n), though. It's hard to do better. -Peff