Re: reftable [v4]: new ref storage format

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Michael Haggerty <mhagger@alum.mit.edu>
To: Shawn Pearce <spearce@spearce.org>
Cc: git <git@vger.kernel.org>, Jeff King <peff@peff.net>,
	Junio C Hamano <gitster@pobox.com>,
	David Borowitz <dborowitz@google.com>
Subject: Re: reftable [v4]: new ref storage format
Date: Tue, 1 Aug 2017 17:49:47 -0700	[thread overview]
Message-ID: <CAMy9T_FA1RV+NFxaXR65gw7G6OxL7Z8Ve3VNLrF4oyCdqtdahg@mail.gmail.com> (raw)
In-Reply-To: <CAJo=hJtrdCOF-RxzXfyLx7R-1f2-7pZVO_UOg28J=wUDNdf3yw@mail.gmail.com>

On Tue, Aug 1, 2017 at 1:23 PM, Shawn Pearce <spearce@spearce.org> wrote:
> On Mon, Jul 31, 2017 at 11:41 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>> On Sun, Jul 30, 2017 at 8:51 PM, Shawn Pearce <spearce@spearce.org> wrote:
>>> 4th iteration of the reftable storage format.
>>> [...]
>>
>> Before we commit to Shawn's reftable proposal, I wanted to explore
>> what a contrasting design that is not block based would look like. So
>> I threw together a sketch of such a design. It is not as refined as
>> Shawn's, and I haven't had time to program an implementation or
>> collect statistics, but maybe it is interesting anyway.
>
> Thanks for taking the time to write this out. Its constructive to see
> other possible approaches.
>
> I roughly implemented your proposed design in JGit for references
> only. I skipped the object lookup and log handling for the sake of
> studying the basics of this approach.

Wow, that's awesome, thanks!

>        | size   | seek_cold | seek_hot  |
> mh     | 28.3 M | 24.5 usec | 14.5 usec |
> sp  4k | 29.2 M | 63.0 usec |  5.8 usec |
> sp 64k | 27.7 M | 35.6 usec | 23.3 usec |

OK, so it's at least in the right ballpark.

>> * Multiple related chunks can be stored in a 64 kiB block (interesting
>>   for people who prefer to read files in 64k segments). By storing
>>   related chunks near each other, the number of seeks will probably
>>   typically be smaller than the number of chunks that have to be
>>   accessed.
>
> This is difficult. The most related chunks are actually index chunks
> in the multi-level index. You want the next step(s) near the current
> step to avoid an actual disk seek. But that is hard to do when the
> index is several levels deep.

Given the 64k read size that you prefer, it seems to me that it would
probably be possible to keep pairs of index layers (i.e., one index
node and all of its direct children) in the same 64k range, though
this might require adjusting their sizes somewhat. And possibly to
keep the lowest index layer close to the reference chunks that it
describes (analogous to your format, where the restart records form
something like an index that is located close to the references). So I
would think that it is plausible to arrange things so that a reference
can be read within two 64k reads even if the index has three levels.

>> * The file can be written in two possible ways: with cross-references
>>   between chunks always pointing to later parts of the file (probably
>>   preferable if the data will be read from a local hard disk) or
>>   always pointing to earlier parts of the file (to accomodate writers
>>   who need to write their data in order). See the `offsets_negative`
>>   field in the header. (I'm not sure that the former variant is
>>   actually needed.)
>
> This seems like unnecessary complexity. I'd prefer to just say the
> offsets are negative, and reference backwards.

Maybe. I just wasn't sure whether a hard disk would perform
significantly worse when reading backwards rather than forwards
(because of pre-fetch of blocks). It could very well be unnecessary.

Michael

next prev parent reply	other threads:[~2017-08-02  0:49 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-07-31  3:51 reftable [v4]: new ref storage format Shawn Pearce
2017-07-31 17:41 ` Dave Borowitz
2017-07-31 19:01 ` Stefan Beller
2017-07-31 23:05   ` Shawn Pearce
2017-07-31 19:42 ` Junio C Hamano
2017-07-31 23:43   ` Shawn Pearce
2017-08-01 16:08     ` Shawn Pearce
2017-08-01  6:41 ` Michael Haggerty
2017-08-01 20:23   ` Shawn Pearce
2017-08-02  0:49     ` Michael Haggerty [this message]
2017-08-01 23:27   ` Shawn Pearce
2017-08-01 23:54     ` Shawn Pearce
2017-08-02  1:51     ` Michael Haggerty
2017-08-02  2:38       ` Shawn Pearce
2017-08-02  9:28         ` Jeff King
2017-08-02 15:17           ` Shawn Pearce
2017-08-02 16:51             ` Junio C Hamano
2017-08-02 17:28             ` Jeff King
2017-08-02 12:20         ` Dave Borowitz
2017-08-02 17:18           ` Jeff King
2017-08-03 18:38         ` Michael Haggerty
2017-08-03 22:26           ` Shawn Pearce
2017-08-03 22:48             ` Michael Haggerty
2017-08-04  2:50               ` Shawn Pearce
2017-08-05 21:00       ` Shawn Pearce
2017-08-01 13:54 ` Dave Borowitz
2017-08-01 15:27   ` Shawn Pearce
2017-08-02 19:50 ` Junio C Hamano
2017-08-02 20:28   ` Jeff King
2017-08-03 22:17     ` Shawn Pearce
2017-08-03  1:50   ` Junio C Hamano
2017-08-03  2:21     ` Shawn Pearce
2017-08-03  2:36       ` Junio C Hamano
2017-08-02 19:54 ` Stefan Beller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAMy9T_FA1RV+NFxaXR65gw7G6OxL7Z8Ve3VNLrF4oyCdqtdahg@mail.gmail.com \
    --to=mhagger@alum.mit.edu \
    --cc=dborowitz@google.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=peff@peff.net \
    --cc=spearce@spearce.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).