Re: reftable [v4]: new ref storage format

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Shawn Pearce <spearce@spearce.org>
To: Junio C Hamano <gitster@pobox.com>
Cc: git <git@vger.kernel.org>, Jeff King <peff@peff.net>,
	Michael Haggerty <mhagger@alum.mit.edu>,
	David Borowitz <dborowitz@google.com>
Subject: Re: reftable [v4]: new ref storage format
Date: Mon, 31 Jul 2017 16:43:07 -0700	[thread overview]
Message-ID: <CAJo=hJurMO=eQP3xctwTX9cO3yTZogJsw5HMztWjB8JHHtJ=fQ@mail.gmail.com> (raw)
In-Reply-To: <xmqqlgn4ieaw.fsf@gitster.mtv.corp.google.com>

On Mon, Jul 31, 2017 at 12:42 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Shawn Pearce <spearce@spearce.org> writes:
>
>> ### Peeling
>>
>> References in a reftable are always peeled.
>
> This hopefully means "a record for an annotated (or signed) tag
> records both the tag object and the object it refers to", and does
> not include peeling a commit down to its tree.

Yes, thanks for the clarification. I've added that to this section.


>> ### Reference name encoding
>>
>> Reference names should be encoded with UTF-8.
>
> If we limited ourselves to say that a refname is an uninterpreted
> sequence of bytes that must pass check_refname_format() test, mthen
> we won't open us to confusion such as "are two refs with the same
> Unicode name encoded with different normalization form considered
> the same?"

Ack. I'll reword this as bytes, and point to git-check-ref-format.

> In what way does this "should be in UTF-8" help describing the file
> format, I wonder.

In JGit we have a String object for a reference name. Dropping that
down to a sequence of bytes requires converting it with a character
encoding, such as US-ASCII or UTF-8.


>> ### First ref block
>>
>> The first ref block shares the same block as the file header, and is
>> 24 bytes smaller than all other blocks in the file.  The first block
>> immediately begins after the file header, at offset 24.
>>
>> If the first block is a log block (a log only table), its block header
>> begins immediately at offset 24.
>
> A minor nit: You called such a file "a log-only file"; let's be consistent.

Fixed. I found another that also was inconsistent. Thanks for the
attention to detail.


>> ### Ref block format
>>
>> A ref block is written as:
>>
>>     'r'
>>     uint24( block_len )
>>     ref_record+
>>     uint32( restart_offset )+
>>     uint16( restart_count )
>>     padding?
>>
>> Blocks begin with `block_type = 'r'` and a 3-byte `block_len` which
>> encodes the number of bytes in the block up to, but not including the
>> optional `padding`.  This is almost always shorter than the file's
>> `block_size`.  In the first ref block, `block_len` includes 24 bytes
>> for the file header.
>
> As a block cannot be longer than 16MB, allocating uint32 to a
> restart offset may be a bit overkill.  I do not know if it is worth
> attempting to pack 1/3 more restart entries in a block by using
> uint24, though.

I don't think its worth the annoyance in code that has to deal with
this field. An example 87k ref repository with a 4k block size has ~9
restarts per block and ~19 bytes of padding. Using uint24 saves us 9
bytes, yet the average ref here costs 27 bytes. We aren't likely to
fill the block with another ref, or another restart point.


>> #### ref record
[...]
>> The `value` follows.  Its format is determined by `value_type`, one of
>> the following:
>>
>> - `0x0`: deletion; no value data (see transactions, below)
>> - `0x1`: one 20-byte object id; value of the ref
>> - `0x2`: two 20-byte object ids; value of the ref, peeled target
>> - `0x3`: symref and text: `varint( text_len ) text`
>> - `0x4`: index record (see below)
>> - `0x5`: log record (see below)
[...]
>> Types `0x6..0x7` are reserved for future use.
>
> I wondered if we regret the apparent limited extensibility later,
> but giving 4 bits to value-type would limit suffix length that can
> be represented by a single varint() only to 7, while the format
> described would give us up to 15 bytes.  We can say type 0x7 would
> be followed by another varint() to record the extended type, or
> something, to extend it, so probably what you did here strikes a
> good balance.

Thanks. I think we actually have 0x4-0x7 available in value_type. I
used 0x4 and 0x5 as above only as a suggestion from Stefan to help
someone debug a reader. The block type differs where 0x0-0x3 and 0x4
and 0x5 are used, so there is already sufficient information available
to disambiguate the value_type such that we don't actually have to use
0x4 and 0x5 as I have here.

So I agree with myself, and with you, 3 bits for value_type and 4 bits
for suffix_len is the right balance.


>> ### Ref index
>> ...
>> An index block should only be written if there are at least 4 blocks
>> in the file, as cold reads using the index requires 2 disk reads (read
>> index, read block), and binary searching <= 4 blocks also requires <=
>> 2 reads.  Omitting the index block from smaller files saves space.
>
> I think the last "<= 4" should be "< 4" here.  That is consistent
> with an earlier part of the paragraph that requires at least 4
> ref-blocks in the file, because a reftable with only 3 ref-blocks
> still can be accessed with 2 reads (a reftable with 4 ref-blocks
> without index may need 3 reads as there is no "middle" for binary
> search).
>
> The first sentence should be "if there are at least 4 ref-blocks", I
> guess.

Thanks, clarified.


>> ### Obj block format
>>
>> Object blocks use unique, abbreviated 2-20 byte SHA-1 keys, mapping
>> to ref blocks containing references pointing to that object directly,
>> or as the peeled value of an annotated tag.  Like ref blocks, object
>> blocks use the file's standard `block_size`.
>>
>> To save space in small files, object blocks may be omitted if the ref
>> index is not present.  When missing readers should brute force a
>> linear search of all references to lookup by SHA-1.
>
> I want a comma after "When missing".

Added comma.

> It is a bit unclear why the presense of ref-index is linked to the
> presense and/or need of this block. I first thought that the reason
> is because the data in this table is to index into ref-index table,
> in which case of course it would not make sense to have this table
> if ref-index table is not present, but that is not the case. Am I
> correct to read the above advice/suggestion to mean "if the reftable
> has so few blocks not to require ref-index blocks, we hypothesize
> that it is not worth having obj-block table either"?

Yes.


>> #### obj record
>>
>> An `obj_record` describes a single object abbreviation, and the blocks
>> containing references using that unique abbreviation:
>>
>>     varint( prefix_length )
>>     varint( (suffix_length << 3) | cnt_3 )
>>     suffix
>>     varint( cnt_large )?
>>     varint( block_delta )+
>>
>> Like in reference blocks, abbreviations are prefix compressed within
>> an obj block.  On large reftables with many unique objects, higher
>> block sizes (64k), and higher restart interval (128), a
>> `prefix_length` of 2 or 3 and `suffix_length` of 3 may be common in
>> obj records (unique abbreviation of 5-6 raw bytes, 10-12 hex digits).
>
> OK.
>
>> Each record contains `block_count` number of block identifiers for ref
>> blocks.  For 1-7 blocks the block count is stored in `cnt_3`.  When
>> `cnt_3 = 0` the actual block count follows in a varint, `cnt_large`.
>
> I feel a bit lost here.  Is this about a single object pointed by
> multiple refs, and that we expect to have not too many refs pointing
> at a single object?

Yes, correct. I've added a paragraph to clarify:

  The use of `cnt_3` bets most objects are pointed to by only a single
  reference, some may be pointed to be a couple of references, and very
  few (if any) are pointed to by more than 7 references.


>> The first `block_delta` is the absolute block identifier counting from
>> the start of the file. The offset of that block can be obtained by
>> `block_delta[0] * block_size`.  Additional `block_delta` entries are
>> relative to the prior entry, e.g. a reader would perform:
>>
>>     block_id = block_delta[0]
>>     prior = block_id
>>     for (j = 1; j < block_count; j++) {
>>       block_id = prior + block_delta[j]
>>       prior = block_id
>>     }
>>
>> With a `block_id` in hand, a reader must linearly scan the ref block
>> at `block_id * block_size` offset in the file, starting from the first
>> `ref_record`, testing each reference's SHA-1s (for `value_type = 0x1`
>> or `0x2`) for full equality.  Faster searching by SHA-1 within a
>> single ref block is not supported by the reftable format.  Smaller
>> block sizes reduces the number of candidates this step must consider.
>
> Assuming varint() yields an unsigned quantity, the writer needs to
> sort the refs that point at the same object by their block numbers
> first and record from the smallest one to the larger ones?  Not a
> complaint, but just seeking help to make sure I understood it.

Yes. I added a "sorted ascending" to the paragraph to clarify.

  Additional `block_delta` entries are sorted ascending and relative
  to the prior entry, e.g.  a reader would perform:


>> ### Log block format
>>
>> Unlike ref and obj blocks, log block sizes are variable in size, and
>> do not match the `block_size` specified in the file header or footer.
>> Writers should choose an appropriate buffer size to prepare a log block
>> for deflation, such as `2 * block_size`.
>>
>> A log block is written as:
>>
>>     'g'
>>     uint24( block_len )
>>     zlib_deflate {
>>       log_record+
>>       int32( restart_offset )+
>>       int16( restart_count )
>>     }
>>
>> Log blocks look similar to ref blocks, except `block_type = 'g'`.
>>
>> The 4-byte block header is followed by the deflated block contents
>> using zlib deflate.  The `block_len` in the header is the inflated
>> size (including 4-byte block header), and should be used by readers to
>> preallocate the inflation output buffer.  Offsets within the block
>> (e.g.  `restart_offset`) still include the 4-byte header.  Readers may
>> prefer prefixing the inflation output buffer with the 4-byte header.
>
> Is block_len allowed to exceed the file-global block_size?

Yes, clarified that with an additional sentence.


>> #### log record
>>
>> Log record keys are structured as:
>>
>>     ref_name '\0' reverse_int64( update_index )
>>
>> where `update_index` is the unique transaction identifier.  The
>> `update_index` field must be unique within the scope of a `ref_name`.
>> See the update index section below for further details.
>>
>> The `reverse_int64` function inverses the value so lexographical
>> ordering the network byte order encoding sorts the more recent records
>> with higher `update_index` values first:
>>
>>     reverse_int64(int64 t) {
>>       return 0xffffffffffffffff - t;
>>     }
>
> OK, so this no longer is linked to timestamp, which makes things
> simpler.
>
> All log records in a single reftable is sorted with the log record
> key, so the reflog entries for a specific ref are adjacent to each
> other and are ordered in reverse "chronological" order, assuming
> that the update_index transaction numbers monotonically increase?

Correct.

>> The value data following the key suffix is complex:
>>
>> - two 20-byte SHA-1s (old id, new id)
>> - varint time in seconds since epoch (Jan 1, 1970)
>> - 2-byte timezone offset (signed)
>
> "offset in minutes"

Fixed.

>> - varint string of committer's name
>> - varint string of committer's email
>
> We might want to clarify that this is without surrounding <>.

Good idea, added.


>> #### Importing logs
>>
>> When importing from `$GIT_DIR/logs` writers should globally order all
>> log records roughly by timestamp while preserving file order, and
>> assign unique, increasing `update_index` values for each log line.
>
> I am not quite sure why you need global order (I am assuming that
> you mean "consistency across multiple logs, e.g. $GIT_DIR/logs/HEAD
> and $GIT_DIR/logs/refs/heads/master"), as the resulting log table
> sorts the entries essentially with refname as the first key and then
> recentness of the entry as the second key.  Wouldn't the resulting
> log table in a reftable be sorted in the same way as
>
>     cd $GIT_DIR/logs &&
>     find -type f -print |
>     sort |
>     xargs -n 1 tac
>
> anyway?
>
> ... Ah, you want to give the same (or at least close-enough)
> transaction ID to two reflog entries that result from a commit while
> 'master' branch is checked out, one for 'refs/heads/master' and the
> other for HEAD.  Then the suggestion makes sense.  I wonder if the
> existing log records that are migrated are known to have timestamps
> that fit in int64_t, using the timestamp from the original is
> sufficient?
>
> ... The answer is no; if the original records clock rewind, you'd
> still want to assign the update_index number that is in line with
> the order of the entries, not with the skewed clock value.
>
> OK.

Correct. :)


>> ## Repository format
>>
>> ### Version 1
>>
>> A repository must set its `$GIT_DIR/config` to configure reftable:
>>
>>     [core]
>>         repositoryformatversion = 1
>>     [extensions]
>>         reftable = 1
>
> The expectation is that this number matches the version number
> recorded in the reftable itself?

Honestly, I'm not sure why its "1" and not "true". I was asking myself
that yesterday before I posted this iteration to the list. I think we
can use "true" here.

next prev parent reply	other threads:[~2017-07-31 23:43 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-07-31  3:51 reftable [v4]: new ref storage format Shawn Pearce
2017-07-31 17:41 ` Dave Borowitz
2017-07-31 19:01 ` Stefan Beller
2017-07-31 23:05   ` Shawn Pearce
2017-07-31 19:42 ` Junio C Hamano
2017-07-31 23:43   ` Shawn Pearce [this message]
2017-08-01 16:08     ` Shawn Pearce
2017-08-01  6:41 ` Michael Haggerty
2017-08-01 20:23   ` Shawn Pearce
2017-08-02  0:49     ` Michael Haggerty
2017-08-01 23:27   ` Shawn Pearce
2017-08-01 23:54     ` Shawn Pearce
2017-08-02  1:51     ` Michael Haggerty
2017-08-02  2:38       ` Shawn Pearce
2017-08-02  9:28         ` Jeff King
2017-08-02 15:17           ` Shawn Pearce
2017-08-02 16:51             ` Junio C Hamano
2017-08-02 17:28             ` Jeff King
2017-08-02 12:20         ` Dave Borowitz
2017-08-02 17:18           ` Jeff King
2017-08-03 18:38         ` Michael Haggerty
2017-08-03 22:26           ` Shawn Pearce
2017-08-03 22:48             ` Michael Haggerty
2017-08-04  2:50               ` Shawn Pearce
2017-08-05 21:00       ` Shawn Pearce
2017-08-01 13:54 ` Dave Borowitz
2017-08-01 15:27   ` Shawn Pearce
2017-08-02 19:50 ` Junio C Hamano
2017-08-02 20:28   ` Jeff King
2017-08-03 22:17     ` Shawn Pearce
2017-08-03  1:50   ` Junio C Hamano
2017-08-03  2:21     ` Shawn Pearce
2017-08-03  2:36       ` Junio C Hamano
2017-08-02 19:54 ` Stefan Beller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAJo=hJurMO=eQP3xctwTX9cO3yTZogJsw5HMztWjB8JHHtJ=fQ@mail.gmail.com' \
    --to=spearce@spearce.org \
    --cc=dborowitz@google.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=mhagger@alum.mit.edu \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).