Re: reftable [v4]: new ref storage format

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Re: reftable [v4]: new ref storage format
@ 2017-07-31  3:51 Shawn Pearce
  2017-07-31 17:41 ` Dave Borowitz
                   ` (6 more replies)
  0 siblings, 7 replies; 34+ messages in thread
From: Shawn Pearce @ 2017-07-31  3:51 UTC (permalink / raw)
  To: git, Jeff King, Michael Haggerty; +Cc: Junio C Hamano, David Borowitz

4th iteration of the reftable storage format.

You can read a rendered version of this here:
https://googlers.googlesource.com/sop/jgit/+/reftable/Documentation/technical/reftable.md

Significant changes from v3:
- Incorporated Michael Haggerty's update_index concept for reflog.
- Explicitly documented log-only tables.

- Magic changed to 'REFT'
- Symlinks now use type 0x3 with "ref: " prefix.
- Ref-like files (FETCH_HEAD, MERGE_HEAD) also use type 0x3.
- restart_count simplified, obj block_count simplified

# reftable

[TOC]

## Overview

### Problem statement

Some repositories contain a lot of references (e.g.  android at 866k,
rails at 31k).  The existing packed-refs format takes up a lot of
space (e.g.  62M), and does not scale with additional references.
Lookup of a single reference requires linearly scanning the file.

Atomic pushes modifying multiple references require copying the
entire packed-refs file, which can be a considerable amount of data
moved (e.g. 62M in, 62M out) for even small transactions (2 refs
modified).

Repositories with many loose references occupy a large number of disk
blocks from the local file system, as each reference is its own file
storing 41 bytes (and another file for the corresponding reflog).
This negatively affects the number of inodes available when a large
number of repositories are stored on the same filesystem.  Readers can
be penalized due to the larger number of syscalls required to traverse
and read the `$GIT_DIR/refs` directory.

### Objectives

- Near constant time lookup for any single reference, even when the
  repository is cold and not in process or kernel cache.
- Near constant time verification a SHA-1 is referred to by at least
  one reference (for allow-tip-sha1-in-want).
- Efficient lookup of an entire namespace, such as `refs/tags/`.
- Support atomic push with `O(size_of_update)` operations.
- Combine reflog storage with ref storage for small transactions.
- Separate reflog storage for base refs and historical logs.

### Description

A reftable file is a portable binary file format customized for
reference storage. References are sorted, enabling linear scans,
binary search lookup, and range scans.

Storage in the file is organized into blocks.  Prefix compression
is used within a single block to reduce disk space.  Block size is
tunable by the writer.

### Performance

Space used, packed-refs vs. reftable:

repository | packed-refs | reftable | % original | avg ref  | avg obj
-----------|------------:|---------:|-----------:|---------:|--------:
android    |      62.2 M |   34.4 M |     55.2%  | 33 bytes | 8 bytes
rails      |       1.8 M |    1.1 M |     57.7%  | 29 bytes | 6 bytes
git        |      78.7 K |   44.0 K |     60.0%  | 50 bytes | 6 bytes
git (heads)|       332 b |    276 b |     83.1%  | 34 bytes | 0 bytes

Scan (read 866k refs), by reference name lookup (single ref from 866k
refs), and by SHA-1 lookup (refs with that SHA-1, from 866k refs):

format      | cache | scan    | by name        | by SHA-1
------------|------:|--------:|---------------:|---------------:
packed-refs | cold  |  402 ms | 409,660.1 usec | 412,535.8 usec
packed-refs | hot   |         |   6,844.6 usec |  20,110.1 usec
reftable    | cold  |  112 ms |      33.9 usec |     323.2 usec
reftable    | hot   |         |      20.2 usec |     320.8 usec

Space used for 149,932 log entries for 43,061 refs,
reflog vs. reftable:

format        | size  | avg entry
--------------|------:|-----------:
$GIT_DIR/logs | 173 M | 1209 bytes
reftable      |   5 M |   37 bytes

## Details

### Peeling

References in a reftable are always peeled.

### Reference name encoding

Reference names should be encoded with UTF-8.

### Network byte order

All multi-byte, fixed width fields are in network byte order.

### Ordering

Blocks are lexicographically ordered by their first reference.

### Directory/file conflicts

The reftable format accepts both `refs/heads/foo` and
`refs/heads/foo/bar` as distinct references.

This property is useful for retaining log records in reftable, but may
confuse versions of Git using `$GIT_DIR/refs` directory tree to
maintain references.  Users of reftable may choose to continue to
reject `foo` and `foo/bar` type conflicts to prevent problems for
peers.

## File format

### Structure

A reftable file has the following high-level structure:

    first_block {
      header
      first_ref_block
    }
    ref_blocks*
    ref_index?
    obj_blocks*
    obj_index?
    log_blocks*
    log_index?
    footer

A log-only file omits the `ref_blocks`, `ref_index`, `obj_blocks` and
`obj_index` sections, containing only the file header and log blocks:

    first_block {
      header
    }
    log_blocks*
    log_index?
    footer

in a log-only file the first log block immediately follows the file
header, without padding to block alignment.

### Block size

The `block_size` is arbitrarily determined by the writer, and does not
have to be a power of 2.  The block size must be larger than the
longest reference name or log entry used in the repository, as
references cannot span blocks.

Powers of two that are friendly to the virtual memory system or
filesystem (such as 4k or 8k) are recommended.  Larger sizes (64k) can
yield better compression, with a possible increased cost incurred by
readers during access.

The largest block size is `16777215` bytes (15.99 MiB).

### Header

A 24-byte header appears at the beginning of the file:

    'REFT'
    uint8( version_number = 1 )
    uint24( block_size )
    uint64( min_update_index )
    uint64( max_update_index )

The `min_update_index` and `max_update_index` describe bounds for the
`update_index` field of all log records in this file.  When reftables
are used in a stack for transactions (see below), these fields can
order the files such that the prior file's `max_update_index + 1` is
the next file's `min_update_index`.

### First ref block

The first ref block shares the same block as the file header, and is
24 bytes smaller than all other blocks in the file.  The first block
immediately begins after the file header, at offset 24.

If the first block is a log block (a log only table), its block header
begins immediately at offset 24.

### Ref block format

A ref block is written as:

    'r'
    uint24( block_len )
    ref_record+
    uint32( restart_offset )+
    uint16( restart_count )
    padding?

Blocks begin with `block_type = 'r'` and a 3-byte `block_len` which
encodes the number of bytes in the block up to, but not including the
optional `padding`.  This is almost always shorter than the file's
`block_size`.  In the first ref block, `block_len` includes 24 bytes
for the file header.

The 4-byte block header is followed by a variable number of
`ref_record`, describing reference names and values.  The format
is described below.

A variable number of 4-byte `restart_offset` values follow the
records.  Offsets are relative to the start of the block and refer to
the first byte of any `ref_record` whose name has not been prefix
compressed.  Entries in the `restart_offset` list must be sorted,
ascending.  Readers can start linear scans from any of these records.

As the first ref block shares the first file block with the file
header, offsets in the first block are relative to the start of the
file (position 0), and include the file header.  This requires the
first restart in the first block to be at offset 24.  Restarts in
subsequent ref blocks are relative to the start of the ref block.

The 2-byte `restart_count` stores the number of entries in the
`restart_offset` list, which must not be empty.

Readers can use the restart count to binary search between restarts
before starting a linear scan.  The `restart_count` field must be
the last 2 bytes of the block as specified by `block_len` from the
block header.

The end of the block may be filled with `padding` NUL bytes to fill
out the block to the common `block_size` as specified in the file
header.  Padding may be necessary to ensure the following block starts
at a block alignment, and does not spill into the tail of this block.
Padding may be omitted if the block is the last block of the file, and
there is no index block.  This allows reftable to efficiently scale
down to a small number of refs.

#### ref record

A `ref_record` describes a single reference, storing both the name and
its value(s). Records are formatted as:

    varint( prefix_length )
    varint( (suffix_length << 3) | value_type )
    suffix
    value?

The `prefix_length` field specifies how many leading bytes of the
prior reference record's name should be copied to obtain this
reference's name.  This must be 0 for the first reference in any
block, and also must be 0 for any `ref_record` whose offset is listed
in the `restart_offset` table at the end of the block.

Recovering a reference name from any `ref_record` is a simple concat:

    this_name = prior_name[0..prefix_length] + suffix

The `suffix_length` value provides the number of bytes to copy from
`suffix` to complete the reference name.

The `value` follows.  Its format is determined by `value_type`, one of
the following:

- `0x0`: deletion; no value data (see transactions, below)
- `0x1`: one 20-byte object id; value of the ref
- `0x2`: two 20-byte object ids; value of the ref, peeled target
- `0x3`: symref and text: `varint( text_len ) text`
- `0x4`: index record (see below)
- `0x5`: log record (see below)

Symbolic references use `0x3` with a `text` string starting with `"ref: "`,
followed by the complete name of the reference target.  No
compression is applied to the target name.  Other types of contents
that are also reference like, such as `FETCH_HEAD` and `MERGE_HEAD`,
may also be stored using type `0x3`.

Types `0x6..0x7` are reserved for future use.

### Ref index

The ref index stores the name of the last reference from every ref
block in the file, enabling constant O(1) disk seeks for all lookups.
Any reference can be found by searching the index, identifying the
containing block, and searching within that block.

If present the ref index block appears after the last ref block.  The
prior ref block should be padded to ensure the ref index starts on a
block alignment.

An index block should only be written if there are at least 4 blocks
in the file, as cold reads using the index requires 2 disk reads (read
index, read block), and binary searching <= 4 blocks also requires <=
2 reads.  Omitting the index block from smaller files saves space.

Index block format:

    uint32( (0x80 << 24) | block_len )
    index_record+
    uint32( restart_offset )+
    uint16( restart_count )
    padding?

The index block header starts with the high bit set.  This identifies
the block as an index block, and not as a ref block, log block or file
footer.  The `block_len` field in an index block is 30-bits network
byte order, and allowed to occupy space normally used by the block
type in other blocks.  This supports indexes significantly larger than
the file's `block_size`.

The `restart_offset` and `restart_count` fields are identical in
format, meaning and usage as in ref blocks.

To reduce the number of reads required for random access in very large
files, the index block may be larger than the other blocks.  However,
readers must hold the entire index in memory to benefit from this, so
it's a time-space tradeoff in both file size and reader memory.
Increasing the block size decreases the index size.

When object blocks are present the ref index block is padded with
`padding` to maintain alignment for the next block. No padding is
necessary if log blocks or the file trailer follows the ref index.

#### index record

An index record describes the last entry in another block.
Index records are written as:

    varint( prefix_length )
    varint( (suffix_length << 3) | 0x4 )
    suffix
    varint( block_position )

Index records use prefix compression exactly like `ref_record`.

Index records store `block_position` after the suffix, specifying the
absolute position in bytes (from the start of the file) of the block
that ends with this reference. Readers can seek to `block_position` to
begin reading the block header.

#### Reading the index

Readers loading the ref index must first read the footer (below) to
obtain `ref_index_offset`. If not present, the offset will be 0.

### Obj block format

Object blocks use unique, abbreviated 2-20 byte SHA-1 keys, mapping
to ref blocks containing references pointing to that object directly,
or as the peeled value of an annotated tag.  Like ref blocks, object
blocks use the file's standard `block_size`.

To save space in small files, object blocks may be omitted if the ref
index is not present.  When missing readers should brute force a
linear search of all references to lookup by SHA-1.

An object block is written as:

    'o'
    uint24( block_len )
    obj_record+
    uint32( restart_offset )+
    uint16( restart_count )
    padding?

Fields are identical to ref block.  Binary search using the restart
table works the same as in reference blocks.

Because object identifiers are abbreviated by writers to the shortest
unique abbreviation within the reftable, obj key lengths are variable
between 2 and 20 bytes.  Readers must compare only for common prefix
match within an obj block or obj index.

Object blocks should be block aligned, according to `block_size` from
the file header.  The `padding` field is filled with NULs to maintain
alignment for the next block.

#### obj record

An `obj_record` describes a single object abbreviation, and the blocks
containing references using that unique abbreviation:

    varint( prefix_length )
    varint( (suffix_length << 3) | cnt_3 )
    suffix
    varint( cnt_large )?
    varint( block_delta )+

Like in reference blocks, abbreviations are prefix compressed within
an obj block.  On large reftables with many unique objects, higher
block sizes (64k), and higher restart interval (128), a
`prefix_length` of 2 or 3 and `suffix_length` of 3 may be common in
obj records (unique abbreviation of 5-6 raw bytes, 10-12 hex digits).

Each record contains `block_count` number of block identifiers for ref
blocks.  For 1-7 blocks the block count is stored in `cnt_3`.  When
`cnt_3 = 0` the actual block count follows in a varint, `cnt_large`.

The first `block_delta` is the absolute block identifier counting from
the start of the file. The offset of that block can be obtained by
`block_delta[0] * block_size`.  Additional `block_delta` entries are
relative to the prior entry, e.g. a reader would perform:

    block_id = block_delta[0]
    prior = block_id
    for (j = 1; j < block_count; j++) {
      block_id = prior + block_delta[j]
      prior = block_id
    }

With a `block_id` in hand, a reader must linearly scan the ref block
at `block_id * block_size` offset in the file, starting from the first
`ref_record`, testing each reference's SHA-1s (for `value_type = 0x1`
or `0x2`) for full equality.  Faster searching by SHA-1 within a
single ref block is not supported by the reftable format.  Smaller
block sizes reduces the number of candidates this step must consider.

### Obj index

The obj index stores the abbreviation from the last entry for every
obj block in the file, enabling constant O(1) disk seeks for all
lookups.  It is formatted exactly the same as the ref index, but
refers to obj blocks.

The obj index should be present if obj blocks are present, as
obj blocks should only be written in larger files.

The obj index should be block aligned, according to `block_size` from
the file header.  This requires padding the last obj block to maintain
alignment.

Readers loading the obj index must first read the footer (below) to
obtain `obj_index_offset`.  If not present, the offset will be 0.

### Log block format

Unlike ref and obj blocks, log block sizes are variable in size, and
do not match the `block_size` specified in the file header or footer.
Writers should choose an appropriate buffer size to prepare a log block
for deflation, such as `2 * block_size`.

A log block is written as:

    'g'
    uint24( block_len )
    zlib_deflate {
      log_record+
      int32( restart_offset )+
      int16( restart_count )
    }

Log blocks look similar to ref blocks, except `block_type = 'g'`.

The 4-byte block header is followed by the deflated block contents
using zlib deflate.  The `block_len` in the header is the inflated
size (including 4-byte block header), and should be used by readers to
preallocate the inflation output buffer.  Offsets within the block
(e.g.  `restart_offset`) still include the 4-byte header.  Readers may
prefer prefixing the inflation output buffer with the 4-byte header.

Within the deflate container, a variable number of `log_record`
describe reference changes.  The log record format is described
below.  See ref block format (above) for a description of
`restart_offset` and `restart_count`.

Unlike ref blocks, log blocks are written at any alignment, without
padding.  The first log block immediately follows the end of the prior
block, which omits its trailing padding.  In very small files the log
block may appear in the first block.

Because log blocks have no alignment or padding between blocks,
readers must keep track of the bytes consumed by the inflater to
know where the next log block begins.

#### log record

Log record keys are structured as:

    ref_name '\0' reverse_int64( update_index )

where `update_index` is the unique transaction identifier.  The
`update_index` field must be unique within the scope of a `ref_name`.
See the update index section below for further details.

The `reverse_int64` function inverses the value so lexographical
ordering the network byte order encoding sorts the more recent records
with higher `update_index` values first:

    reverse_int64(int64 t) {
      return 0xffffffffffffffff - t;
    }

Log records have a similar starting structure to ref and index
records, utilizing the same prefix compression scheme applied to the
log record key described above.

```
    varint( prefix_length )
    varint( (suffix_length << 3) | 0x5 )
    suffix

    old_id
    new_id
    varint( time_seconds )
    sint16( tz_offset )
    varint( name_length    )  name
    varint( email_length   )  email
    varint( message_length )  message
```

The value data following the key suffix is complex:

- two 20-byte SHA-1s (old id, new id)
- varint time in seconds since epoch (Jan 1, 1970)
- 2-byte timezone offset (signed)
- varint string of committer's name
- varint string of committer's email
- varint string of message

`tz_offset` is the absolute number of minutes from GMT the committer
was at the time of the update.  For example `GMT-0800` is encoded in
reftable as `sint16(-480)` and `GMT+0230` is `sint16(150)`.

The `message_length` may be 0, in which case there was no message
supplied for the update.

#### Reading the log

Readers accessing the log must first read the footer (below) to
determine the `log_offset`.  The first block of the log begins at
`log_offset` bytes since the start of the file.  The `log_offset` is
not block aligned.

#### Importing logs

When importing from `$GIT_DIR/logs` writers should globally order all
log records roughly by timestamp while preserving file order, and
assign unique, increasing `update_index` values for each log line.

### Log index

The log index stores the log key (`refname \0 reverse_int64(update_index)`)
for the last log record of every log block in the file, supporting
bounded-time lookup.

A log index block must be written if 2 or more log blocks are written
to the file.  If present, the log index appears after the last log
block.  There is no padding used to align the log index to block
alignment.

Log index format is identical to ref index, except the keys are 9
bytes longer to include `'\0'` and the 8-byte
`reverse_int64(update_index)`.  Records use `block_position` to
refer to the start of a log block.

#### Reading the index

Readers loading the log index must first read the footer (below) to
obtain `log_index_offset`. If not present, the offset will be 0.

### Footer

After the last block of the file, a file footer is written.  It begins
like the file header, but is extended with additional data.

A 68-byte footer appears at the end:

```
    'REFT'
    uint8( version_number = 1 )
    uint24( block_size )
    uint64( min_update_index )
    uint64( max_update_index )

    uint64( ref_index_offset )
    uint64( obj_offset )
    uint64( obj_index_offset )

    uint64( log_offset )
    uint64( log_index_offset )

    uint32( CRC-32 of above )
```

If a section is missing (e.g. ref index) the corresponding offset
field (e.g. `ref_index_offset`) will be 0.

- `obj_offset`: byte offset for the first obj block.
- `log_offset`: byte offset for the first log block.
- `ref_index_offset`: byte offset for the start of the ref index.
- `obj_index_offset`: byte offset for the start of the obj index.
- `log_index_offset`: byte offset for the start of the log index.

#### Reading the footer

Readers must seek to `file_length - 68` to access the footer.  A
trusted external source (such as `stat(2)`) is necessary to obtain
`file_length`.  When reading the footer, readers must verify:

- 4-byte magic is correct
- 1-byte version number is recognized
- 4-byte CRC-32 matches the other 64 bytes (including magic, and version)

Once verified, the other fields of the footer can be accessed.

### Varint encoding

Varint encoding is identical to the ofs-delta encoding method used
within pack files.

Decoder works such as:

    val = buf[ptr] & 0x7f
    while (buf[ptr] & 0x80) {
      ptr++
      val = ((val + 1) << 7) | (buf[ptr] & 0x7f)
    }

### Binary search

Binary search within a block is supported by the `restart_offset`
fields at the end of the block.  Readers can binary search through the
restart table to locate between which two restart points the sought
reference or key should appear.

Each record identified by a `restart_offset` stores the complete key
in the `suffix` field of the record, making the compare operation
during binary search straightforward.

Once a restart point lexicographically before the sought reference has
been identified, readers can linearly scan through the following
record entries to locate the sought record, terminating if the current
record sorts after (and therefore the sought key is not present).

#### Restart point selection

Writers determine the restart points at file creation.  The process is
arbitrary, but every 16 or 64 records is recommended.  Every 16 may
be more suitable for smaller block sizes (4k or 8k), every 64 for
larger block sizes (64k).

More frequent restart points reduces prefix compression and increases
space consumed by the restart table, both of which increase file size.

Less frequent restart points makes prefix compression more effective,
decreasing overall file size, with increased penalities for readers
walking through more records after the binary search step.

A maximum of `65535` restart points per block is supported.

## Considerations

### Lightweight refs dominate

The reftable format assumes the vast majority of references are single
SHA-1 valued with common prefixes, such as Gerrit Code Review's
`refs/changes/` namespace, GitHub's `refs/pulls/` namespace, or many
lightweight tags in the `refs/tags/` namespace.

Annotated tags storing the peeled object cost only an additional 20
bytes per reference.

### Low overhead

A reftable with very few references (e.g. git.git with 5 heads)
is 276 bytes for reftable, vs. 332 bytes for packed-refs.  This
supports reftable scaling down for transaction logs (below).

### Block size

For a Gerrit Code Review type repository with many change refs, larger
block sizes (64 KiB) and less frequent restart points (every 64) yield
better compression due to more references within the block compressing
against the prior reference.

Larger block sizes reduces the index size, as the reftable will
require fewer blocks to store the same number of references.

### Minimal disk seeks

Assuming the index block has been loaded into memory, binary searching
for any single reference requires exactly 1 disk seek to load the
containing block.

### Scans and lookups dominate

Scanning all references and lookup by name (or namespace such as
`refs/heads/`) are the most common activities performed by repositories.
SHA-1s are stored twice when obj blocks are present, avoiding disk
seeks for the common cases of scan and lookup by name.

### Logs are infrequently read

Logs are infrequently accessed, but can be large.  Deflating log
blocks saves disk space, with some increased penalty at read time.

Logs are stored in an isolated section from refs, reducing the burden
on reference readers that want to ignore logs.  Further, historical
logs can be isolated into log-only reftables.

### Logs are read backwards

Logs are frequently accessed backwards (most recent N records for
master to answer `master@{4}`), so log records are grouped by
reference, and sorted descending by update index.

## Repository format

### Version 1

A repository must set its `$GIT_DIR/config` to configure reftable:

    [core]
        repositoryformatversion = 1
    [extensions]
        reftable = 1

### Layout

The `$GIT_DIR/refs` path is a file when reftable is configured, not a
directory.  This prevents loose references from being stored.

A collection of reftable files are stored in the `$GIT_DIR/reftable/`
directory:

    00000001_UF4paF
    00000002_bUVgy4

where reftable files are named by a unique name such as produced by
the function:

    mktemp "${update_index}_XXXXXX"

The stack ordering file is `$GIT_DIR/refs` and lists the current
files, one per line, in order, from oldest (base) to newest (most
recent):

    $ cat .git/refs
    00000001_UF4paF
    00000002_bUVgy4

Readers must read `$GIT_DIR/refs` to determine which files are
relevant right now, and search through the stack in reverse order
(last reftable is examined first).

Reftable files not listed in `refs` may be new (and about to be added
to the stack by the active writer), or ancient and ready to be pruned.

### Update transactions

Although reftables are immutable, mutations are supported by writing a
new reftable and atomically appending it to the stack:

1. Acquire `refs.lock`.
2. Read `refs` to determine current reftables.
3. Select `update_index` to be most recent file's `max_update_index + 1`.
4. Prepare new reftable `${update_index}_XXXXXX`, including log entries.
5. Copy `refs` to `refs.lock`, appending file from (4).
6. Rename `refs.lock` to `refs`.

During step 4 the new file's `min_update_index` and `max_update_index`
are both set to the `update_index` selected by step 3.  All log
records for the transaction use the same `update_index` in their keys.
This enables later correlation of which references were updated by the
same transaction.

Because a single `refs.lock` file is used to manage locking, the
repository is single-threaded for writers.  Writers may have to
busy-spin (with backoff) around creating `refs.lock`, for up to an
acceptable wait period, aborting if the repository is too busy to
mutate.  Application servers wrapped around repositories (e.g.  Gerrit
Code Review) can layer their own lock/wait queue to improve fairness
to writers.

### Reference deletions

Deletion of any reference can be explicitly stored by setting the
`type` to `0x0` and omitting the `value` field of the `ref_record`.
This entry shadows the reference in earlier files in the stack.

### Compaction

A partial stack of reftables can be compacted by merging references
using a straightforward merge join across reftables, selecting the
most recent value for output, and omitting deleted references that do
not appear in remaining, lower reftables.

A compacted reftable should set its `min_update_index` to the smallest of
the input files' `min_update_index`, and its `max_update_index`
likewise to the largest input `max_update_index`.

For sake of illustration, assume the stack currently consists of
reftable files (from oldest to newest): A, B, C, and D. The compactor
is going to compact B and C, leaving A and D alone.

1.  Obtain lock `refs.lock` and read the `refs` file.
2.  Obtain locks `B.lock` and `C.lock`.
    Ownership of these locks prevents other processes from trying
    to compact these files.
3.  Release `refs.lock`.
4.  Compact `B` and `C` into a new file `${min_update_index}_XXXXXX`.
5.  Reacquire lock `refs.lock`.
6.  Verify that `B` and `C` are still in the stack, in that order. This
    should always be the case, assuming that other processes are adhering
    to the locking protocol.
7.  Write the new stack to `refs.lock`, replacing `B` and `C` with the
    file from (4).
8.  Rename `refs.lock` to `refs`.
9.  Delete `B` and `C`, perhaps after a short sleep to avoid forcing
    readers to backtrack.

This strategy permits compactions to proceed independently of updates.

## Alternatives considered

### bzip packed-refs

`bzip2` can significantly shrink a large packed-refs file (e.g. 62
MiB compresses to 23 MiB, 37%).  However the bzip format does not support
random access to a single reference. Readers must inflate and discard
while performing a linear scan.

Breaking packed-refs into chunks (individually compressing each chunk)
would reduce the amount of data a reader must inflate, but still
leaves the problem of indexing chunks to support readers efficiently
locating the correct chunk.

Given the compression achieved by reftable's encoding, it does not
seem necessary to add the complexity of bzip/gzip/zlib.

### JGit Ketch RefTree

[JGit Ketch][ketch] proposed [RefTree][reftree], an encoding of
references inside Git tree objects stored as part of the repository's
object database.

The RefTree format adds additional load on the object database storage
layer (more loose objects, more objects in packs), and relies heavily
on the packer's delta compression to save space.  Namespaces which are
flat (e.g.  thousands of tags in refs/tags) initially create very
large loose objects, and so RefTree does not address the problem of
copying many references to modify a handful.

Flat namespaces are not efficiently searchable in RefTree, as tree
objects in canonical formatting cannot be binary searched. This fails
the need to handle a large number of references in a single namespace,
such as GitHub's `refs/pulls`, or a project with many tags.

[ketch]: https://dev.eclipse.org/mhonarc/lists/jgit-dev/msg03073.html
[reftree]: https://public-inbox.org/git/CAJo=hJvnAPNAdDcAAwAvU9C4RVeQdoS3Ev9WTguHx4fD0V_nOg@mail.gmail.com/

### LMDB

David Turner proposed [using LMDB][dt-lmdb], as LMDB is lightweight
(64k of runtime code) and GPL-compatible license.

A downside of LMDB is its reliance on a single C implementation.  This
makes embedding inside JGit (a popular reimplemenation of Git)
difficult, and hoisting onto virtual storage (for JGit DFS) virtually
impossible.

A common format that can be supported by all major Git implementations
(git-core, JGit, libgit2) is strongly preferred.

[dt-lmdb]: https://public-inbox.org/git/1455772670-21142-26-git-send-email-dturner@twopensource.com/

## Future

### Longer hashes

Version will bump (e.g.  2) to indicate `value` uses a different
object id length other than 20.  The length could be stored in an
expanded file header, or hardcoded as part of the version.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: reftable [v4]: new ref storage format
  2017-07-31  3:51 reftable [v4]: new ref storage format Shawn Pearce
@ 2017-07-31 17:41 ` Dave Borowitz
  2017-07-31 19:01 ` Stefan Beller
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 34+ messages in thread
From: Dave Borowitz @ 2017-07-31 17:41 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git, Jeff King, Michael Haggerty, Junio C Hamano

On Sun, Jul 30, 2017 at 11:51 PM, Shawn Pearce <spearce@spearce.org> wrote:
> - Near constant time verification a SHA-1 is referred to by at least
>   one reference (for allow-tip-sha1-in-want).

I think I understated the importance of this when I originally brought
up allow-tip-sha1-in-want. This is an important optimization for *any*
HTTP server, even without allow-tip-sha1-in-want, in order to validate
the SHA-1s sent in the upload-pack request, which doesn't share memory
state with the /info/refs request processing.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: reftable [v4]: new ref storage format
  2017-07-31  3:51 reftable [v4]: new ref storage format Shawn Pearce
  2017-07-31 17:41 ` Dave Borowitz
@ 2017-07-31 19:01 ` Stefan Beller
  2017-07-31 23:05   ` Shawn Pearce
  2017-07-31 19:42 ` Junio C Hamano
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 34+ messages in thread
From: Stefan Beller @ 2017-07-31 19:01 UTC (permalink / raw)
  To: Shawn Pearce
  Cc: git, Jeff King, Michael Haggerty, Junio C Hamano, David Borowitz

On Sun, Jul 30, 2017 at 8:51 PM, Shawn Pearce <spearce@spearce.org> wrote:
> 4th iteration of the reftable storage format.
>
> You can read a rendered version of this here:
> https://googlers.googlesource.com/sop/jgit/+/reftable/Documentation/technical/reftable.md
>
> Significant changes from v3:
> - Incorporated Michael Haggerty's update_index concept for reflog.
> - Explicitly documented log-only tables.

I have read through v4 and I was missing the rationale
to why this is a good idea, digging up the discussion for
v3 seems to indicate that reflogs and refs themselves have
different usage patterns such that different compaction patterns
are desired, hence we need to have different files for them.

> ### Ref block format
>
> A ref block is written as:
>
>     'r'
>     uint24( block_len )
>     ref_record+
>     uint32( restart_offset )+

As the block_size is encoded in uint24, (and so is block_len),
the restart offsets could be made uint24 as well, though that may
have alignment issues, such that reading may be slower.
But as the ref_records may produce unaligned 32 bit ints
already, I am not worried about that.

>     uint16( restart_count )

When looking for 16/32/64 bit hard coded ints I came across
this one once again. How much more expensive is reading
a varint? As the block_len points at the restart_count, we cannot
replace it with a varint easily, but we could use a byte-reversed
varint instead. If we do this step, all restart offsets could also be
(reverse) varints?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: reftable [v4]: new ref storage format
  2017-07-31  3:51 reftable [v4]: new ref storage format Shawn Pearce
  2017-07-31 17:41 ` Dave Borowitz
  2017-07-31 19:01 ` Stefan Beller
@ 2017-07-31 19:42 ` Junio C Hamano
  2017-07-31 23:43   ` Shawn Pearce
  2017-08-01  6:41 ` Michael Haggerty
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 34+ messages in thread
From: Junio C Hamano @ 2017-07-31 19:42 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git, Jeff King, Michael Haggerty, David Borowitz

Shawn Pearce <spearce@spearce.org> writes:

> ### Peeling
>
> References in a reftable are always peeled.

This hopefully means "a record for an annotated (or signed) tag
records both the tag object and the object it refers to", and does
not include peeling a commit down to its tree.

> ### Reference name encoding
>
> Reference names should be encoded with UTF-8.

If we limited ourselves to say that a refname is an uninterpreted
sequence of bytes that must pass check_refname_format() test, then
we won't open us to confusion such as "are two refs with the same
Unicode name encoded with different normalization form considered
the same?"

In what way does this "should be in UTF-8" help describing the file
format, I wonder.

> ### Directory/file conflicts
>
> The reftable format accepts both `refs/heads/foo` and
> `refs/heads/foo/bar` as distinct references.
>
> This property is useful for retaining log records in reftable, but may
> confuse versions of Git using `$GIT_DIR/refs` directory tree to
> maintain references.  Users of reftable may choose to continue to
> reject `foo` and `foo/bar` type conflicts to prevent problems for
> peers.

This is an interesting one.  I do agree that preserving reflog for
removed refs is a nice propertly to have.

> ### First ref block
>
> The first ref block shares the same block as the file header, and is
> 24 bytes smaller than all other blocks in the file.  The first block
> immediately begins after the file header, at offset 24.
>
> If the first block is a log block (a log only table), its block header
> begins immediately at offset 24.

A minor nit: You called such a file "a log-only file"; let's be consistent.

>
> ### Ref block format
>
> A ref block is written as:
>
>     'r'
>     uint24( block_len )
>     ref_record+
>     uint32( restart_offset )+
>     uint16( restart_count )
>     padding?
>
> Blocks begin with `block_type = 'r'` and a 3-byte `block_len` which
> encodes the number of bytes in the block up to, but not including the
> optional `padding`.  This is almost always shorter than the file's
> `block_size`.  In the first ref block, `block_len` includes 24 bytes
> for the file header.

As a block cannot be longer than 16MB, allocating uint32 to a
restart offset may be a bit overkill.  I do not know if it is worth
attempting to pack 1/3 more restart entries in a block by using
uint24, though.

> The end of the block may be filled with `padding` NUL bytes to fill
> out the block to the common `block_size` as specified in the file
> header.  Padding may be necessary to ensure the following block starts
> at a block alignment, and does not spill into the tail of this block.
> Padding may be omitted if the block is the last block of the file, and
> there is no index block.  This allows reftable to efficiently scale
> down to a small number of refs.

We may want to phrase it in a way that it is more clear that
padding, if exists, must be filled with NUL bytes, not arbitrary
garbage.  Your version may be clear enough already.  I dunno.

> #### ref record
>
> A `ref_record` describes a single reference, storing both the name and
> its value(s). Records are formatted as:
>
>     varint( prefix_length )

Just like we saw that "uintX are network byte order" upfront, it may
be easier to give the definition, or at least an outline, of varint()
near there.

>     varint( (suffix_length << 3) | value_type )
>     suffix
>     value?
>
> The `prefix_length` field specifies how many leading bytes of the
> prior reference record's name should be copied to obtain this
> reference's name.  This must be 0 for the first reference in any
> block, and also must be 0 for any `ref_record` whose offset is listed
> in the `restart_offset` table at the end of the block.
>
> Recovering a reference name from any `ref_record` is a simple concat:
>
>     this_name = prior_name[0..prefix_length] + suffix
>
> The `suffix_length` value provides the number of bytes to copy from
> `suffix` to complete the reference name.
>
> The `value` follows.  Its format is determined by `value_type`, one of
> the following:
>
> - `0x0`: deletion; no value data (see transactions, below)
> - `0x1`: one 20-byte object id; value of the ref
> - `0x2`: two 20-byte object ids; value of the ref, peeled target
> - `0x3`: symref and text: `varint( text_len ) text`
> - `0x4`: index record (see below)
> - `0x5`: log record (see below)
>
> Symbolic references use `0x3` with a `text` string starting with `"ref: "`,
> followed by the complete name of the reference target.  No
> compression is applied to the target name.  Other types of contents
> that are also reference like, such as `FETCH_HEAD` and `MERGE_HEAD`,
> may also be stored using type `0x3`.
>
> Types `0x6..0x7` are reserved for future use.

I wondered if we regret the apparent limited extensibility later,
but giving 4 bits to value-type would limit suffix length that can
be represented by a single varint() only to 7, while the format
described would give us up to 15 bytes.  We can say type 0x7 would
be followed by another varint() to record the extended type, or
something, to extend it, so probably what you did here strikes a
good balance.

> ### Ref index
> ...
> An index block should only be written if there are at least 4 blocks
> in the file, as cold reads using the index requires 2 disk reads (read
> index, read block), and binary searching <= 4 blocks also requires <=
> 2 reads.  Omitting the index block from smaller files saves space.

I think the last "<= 4" should be "< 4" here.  That is consistent
with an earlier part of the paragraph that requires at least 4
ref-blocks in the file, because a reftable with only 3 ref-blocks
still can be accessed with 2 reads (a reftable with 4 ref-blocks
without index may need 3 reads as there is no "middle" for binary
search).

The first sentence should be "if there are at least 4 ref-blocks", I
guess.

> ### Obj block format
>
> Object blocks use unique, abbreviated 2-20 byte SHA-1 keys, mapping
> to ref blocks containing references pointing to that object directly,
> or as the peeled value of an annotated tag.  Like ref blocks, object
> blocks use the file's standard `block_size`.
>
> To save space in small files, object blocks may be omitted if the ref
> index is not present.  When missing readers should brute force a
> linear search of all references to lookup by SHA-1.

I want a comma after "When missing".

It is a bit unclear why the presense of ref-index is linked to the
presense and/or need of this block. I first thought that the reason
is because the data in this table is to index into ref-index table,
in which case of course it would not make sense to have this table
if ref-index table is not present, but that is not the case. Am I
correct to read the above advice/suggestion to mean "if the reftable
has so few blocks not to require ref-index blocks, we hypothesize
that it is not worth having obj-block table either"?

> #### obj record
>
> An `obj_record` describes a single object abbreviation, and the blocks
> containing references using that unique abbreviation:
>
>     varint( prefix_length )
>     varint( (suffix_length << 3) | cnt_3 )
>     suffix
>     varint( cnt_large )?
>     varint( block_delta )+
>
> Like in reference blocks, abbreviations are prefix compressed within
> an obj block.  On large reftables with many unique objects, higher
> block sizes (64k), and higher restart interval (128), a
> `prefix_length` of 2 or 3 and `suffix_length` of 3 may be common in
> obj records (unique abbreviation of 5-6 raw bytes, 10-12 hex digits).

OK.

> Each record contains `block_count` number of block identifiers for ref
> blocks.  For 1-7 blocks the block count is stored in `cnt_3`.  When
> `cnt_3 = 0` the actual block count follows in a varint, `cnt_large`.

I feel a bit lost here.  Is this about a single object pointed by
multiple refs, and that we expect to have not too many refs pointing
at a single object?

> The first `block_delta` is the absolute block identifier counting from
> the start of the file. The offset of that block can be obtained by
> `block_delta[0] * block_size`.  Additional `block_delta` entries are
> relative to the prior entry, e.g. a reader would perform:
>
>     block_id = block_delta[0]
>     prior = block_id
>     for (j = 1; j < block_count; j++) {
>       block_id = prior + block_delta[j]
>       prior = block_id
>     }
>
> With a `block_id` in hand, a reader must linearly scan the ref block
> at `block_id * block_size` offset in the file, starting from the first
> `ref_record`, testing each reference's SHA-1s (for `value_type = 0x1`
> or `0x2`) for full equality.  Faster searching by SHA-1 within a
> single ref block is not supported by the reftable format.  Smaller
> block sizes reduces the number of candidates this step must consider.

Assuming varint() yields an unsigned quantity, the writer needs to
sort the refs that point at the same object by their block numbers
first and record from the smallest one to the larger ones?  Not a
complaint, but just seeking help to make sure I understood it.

> ### Log block format
>
> Unlike ref and obj blocks, log block sizes are variable in size, and
> do not match the `block_size` specified in the file header or footer.
> Writers should choose an appropriate buffer size to prepare a log block
> for deflation, such as `2 * block_size`.
>
> A log block is written as:
>
>     'g'
>     uint24( block_len )
>     zlib_deflate {
>       log_record+
>       int32( restart_offset )+
>       int16( restart_count )
>     }
>
> Log blocks look similar to ref blocks, except `block_type = 'g'`.
>
> The 4-byte block header is followed by the deflated block contents
> using zlib deflate.  The `block_len` in the header is the inflated
> size (including 4-byte block header), and should be used by readers to
> preallocate the inflation output buffer.  Offsets within the block
> (e.g.  `restart_offset`) still include the 4-byte header.  Readers may
> prefer prefixing the inflation output buffer with the 4-byte header.

Is block_len allowed to exceed the file-global block_size?

> #### log record
>
> Log record keys are structured as:
>
>     ref_name '\0' reverse_int64( update_index )
>
> where `update_index` is the unique transaction identifier.  The
> `update_index` field must be unique within the scope of a `ref_name`.
> See the update index section below for further details.
>
> The `reverse_int64` function inverses the value so lexographical
> ordering the network byte order encoding sorts the more recent records
> with higher `update_index` values first:
>
>     reverse_int64(int64 t) {
>       return 0xffffffffffffffff - t;
>     }

OK, so this no longer is linked to timestamp, which makes things
simpler.

All log records in a single reftable is sorted with the log record
key, so the reflog entries for a specific ref are adjacent to each
other and are ordered in reverse "chronological" order, assuming
that the update_index transaction numbers monotonically increase?

> The value data following the key suffix is complex:
>
> - two 20-byte SHA-1s (old id, new id)
> - varint time in seconds since epoch (Jan 1, 1970)
> - 2-byte timezone offset (signed)

"offset in minutes"

> - varint string of committer's name
> - varint string of committer's email

We might want to clarify that this is without surrounding <>.

> #### Reading the log
>
> Readers accessing the log must first read the footer (below) to
> determine the `log_offset`.  The first block of the log begins at
> `log_offset` bytes since the start of the file.  The `log_offset` is
> not block aligned.
>
> #### Importing logs
>
> When importing from `$GIT_DIR/logs` writers should globally order all
> log records roughly by timestamp while preserving file order, and
> assign unique, increasing `update_index` values for each log line.

I am not quite sure why you need global order (I am assuming that
you mean "consistency across multiple logs, e.g. $GIT_DIR/logs/HEAD
and $GIT_DIR/logs/refs/heads/master"), as the resulting log table
sorts the entries essentially with refname as the first key and then
recentness of the entry as the second key.  Wouldn't the resulting
log table in a reftable be sorted in the same way as

    cd $GIT_DIR/logs &&
    find -type f -print |
    sort |
    xargs -n 1 tac

anyway?

... Ah, you want to give the same (or at least close-enough)
transaction ID to two reflog entries that result from a commit while
'master' branch is checked out, one for 'refs/heads/master' and the
other for HEAD.  Then the suggestion makes sense.  I wonder if the
existing log records that are migrated are known to have timestamps
that fit in int64_t, using the timestamp from the original is
sufficient?

... The answer is no; if the original records clock rewind, you'd
still want to assign the update_index number that is in line with
the order of the entries, not with the skewed clock value.

OK.

> ## Repository format
>
> ### Version 1
>
> A repository must set its `$GIT_DIR/config` to configure reftable:
>
>     [core]
>         repositoryformatversion = 1
>     [extensions]
>         reftable = 1

The expectation is that this number matches the version number
recorded in the reftable itself?

That's it for now.  I'll comment on the part after Update
transactions in a separate message.

Thanks.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: reftable [v4]: new ref storage format
  2017-07-31 19:01 ` Stefan Beller
@ 2017-07-31 23:05   ` Shawn Pearce
  0 siblings, 0 replies; 34+ messages in thread
From: Shawn Pearce @ 2017-07-31 23:05 UTC (permalink / raw)
  To: Stefan Beller
  Cc: git, Jeff King, Michael Haggerty, Junio C Hamano, David Borowitz

On Mon, Jul 31, 2017 at 12:01 PM, Stefan Beller <sbeller@google.com> wrote:
> On Sun, Jul 30, 2017 at 8:51 PM, Shawn Pearce <spearce@spearce.org> wrote:
>> 4th iteration of the reftable storage format.
>>
>> You can read a rendered version of this here:
>> https://googlers.googlesource.com/sop/jgit/+/reftable/Documentation/technical/reftable.md
>>
>
>>     uint16( restart_count )
>
> When looking for 16/32/64 bit hard coded ints I came across
> this one once again. How much more expensive is reading
> a varint? As the block_len points at the restart_count, we cannot
> replace it with a varint easily, but we could use a byte-reversed
> varint instead. If we do this step, all restart offsets could also be
> (reverse) varints?

Its not the expense of decoding a varint, its making the file
structure predictable so its simple to binary search within restart
points. If these were varints, it becomes much more difficult to
divide the remaining range in half and update the boundary conditions
to locate the next mid-point.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: reftable [v4]: new ref storage format
  2017-07-31 19:42 ` Junio C Hamano
@ 2017-07-31 23:43   ` Shawn Pearce
  2017-08-01 16:08     ` Shawn Pearce
  0 siblings, 1 reply; 34+ messages in thread
From: Shawn Pearce @ 2017-07-31 23:43 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jeff King, Michael Haggerty, David Borowitz

On Mon, Jul 31, 2017 at 12:42 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Shawn Pearce <spearce@spearce.org> writes:
>
>> ### Peeling
>>
>> References in a reftable are always peeled.
>
> This hopefully means "a record for an annotated (or signed) tag
> records both the tag object and the object it refers to", and does
> not include peeling a commit down to its tree.

Yes, thanks for the clarification. I've added that to this section.


>> ### Reference name encoding
>>
>> Reference names should be encoded with UTF-8.
>
> If we limited ourselves to say that a refname is an uninterpreted
> sequence of bytes that must pass check_refname_format() test, mthen
> we won't open us to confusion such as "are two refs with the same
> Unicode name encoded with different normalization form considered
> the same?"

Ack. I'll reword this as bytes, and point to git-check-ref-format.

> In what way does this "should be in UTF-8" help describing the file
> format, I wonder.

In JGit we have a String object for a reference name. Dropping that
down to a sequence of bytes requires converting it with a character
encoding, such as US-ASCII or UTF-8.


>> ### First ref block
>>
>> The first ref block shares the same block as the file header, and is
>> 24 bytes smaller than all other blocks in the file.  The first block
>> immediately begins after the file header, at offset 24.
>>
>> If the first block is a log block (a log only table), its block header
>> begins immediately at offset 24.
>
> A minor nit: You called such a file "a log-only file"; let's be consistent.

Fixed. I found another that also was inconsistent. Thanks for the
attention to detail.


>> ### Ref block format
>>
>> A ref block is written as:
>>
>>     'r'
>>     uint24( block_len )
>>     ref_record+
>>     uint32( restart_offset )+
>>     uint16( restart_count )
>>     padding?
>>
>> Blocks begin with `block_type = 'r'` and a 3-byte `block_len` which
>> encodes the number of bytes in the block up to, but not including the
>> optional `padding`.  This is almost always shorter than the file's
>> `block_size`.  In the first ref block, `block_len` includes 24 bytes
>> for the file header.
>
> As a block cannot be longer than 16MB, allocating uint32 to a
> restart offset may be a bit overkill.  I do not know if it is worth
> attempting to pack 1/3 more restart entries in a block by using
> uint24, though.

I don't think its worth the annoyance in code that has to deal with
this field. An example 87k ref repository with a 4k block size has ~9
restarts per block and ~19 bytes of padding. Using uint24 saves us 9
bytes, yet the average ref here costs 27 bytes. We aren't likely to
fill the block with another ref, or another restart point.


>> #### ref record
[...]
>> The `value` follows.  Its format is determined by `value_type`, one of
>> the following:
>>
>> - `0x0`: deletion; no value data (see transactions, below)
>> - `0x1`: one 20-byte object id; value of the ref
>> - `0x2`: two 20-byte object ids; value of the ref, peeled target
>> - `0x3`: symref and text: `varint( text_len ) text`
>> - `0x4`: index record (see below)
>> - `0x5`: log record (see below)
[...]
>> Types `0x6..0x7` are reserved for future use.
>
> I wondered if we regret the apparent limited extensibility later,
> but giving 4 bits to value-type would limit suffix length that can
> be represented by a single varint() only to 7, while the format
> described would give us up to 15 bytes.  We can say type 0x7 would
> be followed by another varint() to record the extended type, or
> something, to extend it, so probably what you did here strikes a
> good balance.

Thanks. I think we actually have 0x4-0x7 available in value_type. I
used 0x4 and 0x5 as above only as a suggestion from Stefan to help
someone debug a reader. The block type differs where 0x0-0x3 and 0x4
and 0x5 are used, so there is already sufficient information available
to disambiguate the value_type such that we don't actually have to use
0x4 and 0x5 as I have here.

So I agree with myself, and with you, 3 bits for value_type and 4 bits
for suffix_len is the right balance.


>> ### Ref index
>> ...
>> An index block should only be written if there are at least 4 blocks
>> in the file, as cold reads using the index requires 2 disk reads (read
>> index, read block), and binary searching <= 4 blocks also requires <=
>> 2 reads.  Omitting the index block from smaller files saves space.
>
> I think the last "<= 4" should be "< 4" here.  That is consistent
> with an earlier part of the paragraph that requires at least 4
> ref-blocks in the file, because a reftable with only 3 ref-blocks
> still can be accessed with 2 reads (a reftable with 4 ref-blocks
> without index may need 3 reads as there is no "middle" for binary
> search).
>
> The first sentence should be "if there are at least 4 ref-blocks", I
> guess.

Thanks, clarified.


>> ### Obj block format
>>
>> Object blocks use unique, abbreviated 2-20 byte SHA-1 keys, mapping
>> to ref blocks containing references pointing to that object directly,
>> or as the peeled value of an annotated tag.  Like ref blocks, object
>> blocks use the file's standard `block_size`.
>>
>> To save space in small files, object blocks may be omitted if the ref
>> index is not present.  When missing readers should brute force a
>> linear search of all references to lookup by SHA-1.
>
> I want a comma after "When missing".

Added comma.

> It is a bit unclear why the presense of ref-index is linked to the
> presense and/or need of this block. I first thought that the reason
> is because the data in this table is to index into ref-index table,
> in which case of course it would not make sense to have this table
> if ref-index table is not present, but that is not the case. Am I
> correct to read the above advice/suggestion to mean "if the reftable
> has so few blocks not to require ref-index blocks, we hypothesize
> that it is not worth having obj-block table either"?

Yes.


>> #### obj record
>>
>> An `obj_record` describes a single object abbreviation, and the blocks
>> containing references using that unique abbreviation:
>>
>>     varint( prefix_length )
>>     varint( (suffix_length << 3) | cnt_3 )
>>     suffix
>>     varint( cnt_large )?
>>     varint( block_delta )+
>>
>> Like in reference blocks, abbreviations are prefix compressed within
>> an obj block.  On large reftables with many unique objects, higher
>> block sizes (64k), and higher restart interval (128), a
>> `prefix_length` of 2 or 3 and `suffix_length` of 3 may be common in
>> obj records (unique abbreviation of 5-6 raw bytes, 10-12 hex digits).
>
> OK.
>
>> Each record contains `block_count` number of block identifiers for ref
>> blocks.  For 1-7 blocks the block count is stored in `cnt_3`.  When
>> `cnt_3 = 0` the actual block count follows in a varint, `cnt_large`.
>
> I feel a bit lost here.  Is this about a single object pointed by
> multiple refs, and that we expect to have not too many refs pointing
> at a single object?

Yes, correct. I've added a paragraph to clarify:

  The use of `cnt_3` bets most objects are pointed to by only a single
  reference, some may be pointed to be a couple of references, and very
  few (if any) are pointed to by more than 7 references.


>> The first `block_delta` is the absolute block identifier counting from
>> the start of the file. The offset of that block can be obtained by
>> `block_delta[0] * block_size`.  Additional `block_delta` entries are
>> relative to the prior entry, e.g. a reader would perform:
>>
>>     block_id = block_delta[0]
>>     prior = block_id
>>     for (j = 1; j < block_count; j++) {
>>       block_id = prior + block_delta[j]
>>       prior = block_id
>>     }
>>
>> With a `block_id` in hand, a reader must linearly scan the ref block
>> at `block_id * block_size` offset in the file, starting from the first
>> `ref_record`, testing each reference's SHA-1s (for `value_type = 0x1`
>> or `0x2`) for full equality.  Faster searching by SHA-1 within a
>> single ref block is not supported by the reftable format.  Smaller
>> block sizes reduces the number of candidates this step must consider.
>
> Assuming varint() yields an unsigned quantity, the writer needs to
> sort the refs that point at the same object by their block numbers
> first and record from the smallest one to the larger ones?  Not a
> complaint, but just seeking help to make sure I understood it.

Yes. I added a "sorted ascending" to the paragraph to clarify.

  Additional `block_delta` entries are sorted ascending and relative
  to the prior entry, e.g.  a reader would perform:


>> ### Log block format
>>
>> Unlike ref and obj blocks, log block sizes are variable in size, and
>> do not match the `block_size` specified in the file header or footer.
>> Writers should choose an appropriate buffer size to prepare a log block
>> for deflation, such as `2 * block_size`.
>>
>> A log block is written as:
>>
>>     'g'
>>     uint24( block_len )
>>     zlib_deflate {
>>       log_record+
>>       int32( restart_offset )+
>>       int16( restart_count )
>>     }
>>
>> Log blocks look similar to ref blocks, except `block_type = 'g'`.
>>
>> The 4-byte block header is followed by the deflated block contents
>> using zlib deflate.  The `block_len` in the header is the inflated
>> size (including 4-byte block header), and should be used by readers to
>> preallocate the inflation output buffer.  Offsets within the block
>> (e.g.  `restart_offset`) still include the 4-byte header.  Readers may
>> prefer prefixing the inflation output buffer with the 4-byte header.
>
> Is block_len allowed to exceed the file-global block_size?

Yes, clarified that with an additional sentence.


>> #### log record
>>
>> Log record keys are structured as:
>>
>>     ref_name '\0' reverse_int64( update_index )
>>
>> where `update_index` is the unique transaction identifier.  The
>> `update_index` field must be unique within the scope of a `ref_name`.
>> See the update index section below for further details.
>>
>> The `reverse_int64` function inverses the value so lexographical
>> ordering the network byte order encoding sorts the more recent records
>> with higher `update_index` values first:
>>
>>     reverse_int64(int64 t) {
>>       return 0xffffffffffffffff - t;
>>     }
>
> OK, so this no longer is linked to timestamp, which makes things
> simpler.
>
> All log records in a single reftable is sorted with the log record
> key, so the reflog entries for a specific ref are adjacent to each
> other and are ordered in reverse "chronological" order, assuming
> that the update_index transaction numbers monotonically increase?

Correct.

>> The value data following the key suffix is complex:
>>
>> - two 20-byte SHA-1s (old id, new id)
>> - varint time in seconds since epoch (Jan 1, 1970)
>> - 2-byte timezone offset (signed)
>
> "offset in minutes"

Fixed.

>> - varint string of committer's name
>> - varint string of committer's email
>
> We might want to clarify that this is without surrounding <>.

Good idea, added.


>> #### Importing logs
>>
>> When importing from `$GIT_DIR/logs` writers should globally order all
>> log records roughly by timestamp while preserving file order, and
>> assign unique, increasing `update_index` values for each log line.
>
> I am not quite sure why you need global order (I am assuming that
> you mean "consistency across multiple logs, e.g. $GIT_DIR/logs/HEAD
> and $GIT_DIR/logs/refs/heads/master"), as the resulting log table
> sorts the entries essentially with refname as the first key and then
> recentness of the entry as the second key.  Wouldn't the resulting
> log table in a reftable be sorted in the same way as
>
>     cd $GIT_DIR/logs &&
>     find -type f -print |
>     sort |
>     xargs -n 1 tac
>
> anyway?
>
> ... Ah, you want to give the same (or at least close-enough)
> transaction ID to two reflog entries that result from a commit while
> 'master' branch is checked out, one for 'refs/heads/master' and the
> other for HEAD.  Then the suggestion makes sense.  I wonder if the
> existing log records that are migrated are known to have timestamps
> that fit in int64_t, using the timestamp from the original is
> sufficient?
>
> ... The answer is no; if the original records clock rewind, you'd
> still want to assign the update_index number that is in line with
> the order of the entries, not with the skewed clock value.
>
> OK.

Correct. :)


>> ## Repository format
>>
>> ### Version 1
>>
>> A repository must set its `$GIT_DIR/config` to configure reftable:
>>
>>     [core]
>>         repositoryformatversion = 1
>>     [extensions]
>>         reftable = 1
>
> The expectation is that this number matches the version number
> recorded in the reftable itself?

Honestly, I'm not sure why its "1" and not "true". I was asking myself
that yesterday before I posted this iteration to the list. I think we
can use "true" here.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: reftable [v4]: new ref storage format
  2017-07-31  3:51 reftable [v4]: new ref storage format Shawn Pearce
                   ` (2 preceding siblings ...)
  2017-07-31 19:42 ` Junio C Hamano
@ 2017-08-01  6:41 ` Michael Haggerty
  2017-08-01 20:23   ` Shawn Pearce
  2017-08-01 23:27   ` Shawn Pearce
  2017-08-01 13:54 ` Dave Borowitz
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 34+ messages in thread
From: Michael Haggerty @ 2017-08-01  6:41 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git, Jeff King, Junio C Hamano, David Borowitz

On Sun, Jul 30, 2017 at 8:51 PM, Shawn Pearce <spearce@spearce.org> wrote:
> 4th iteration of the reftable storage format.
> [...]

Before we commit to Shawn's reftable proposal, I wanted to explore
what a contrasting design that is not block based would look like. So
I threw together a sketch of such a design. It is not as refined as
Shawn's, and I haven't had time to program an implementation or
collect statistics, but maybe it is interesting anyway.

The design benefits from a lot of Shawn's ideas, plus a lot of the
discussion from the mailing list. First let me explain in words the
major differences and why I think they might be promising.

It seems to me that the block-orientation of Shawn's spec adds a lot
of complication to the design (e.g., padding, optional blocks, varying
block sizes for different parts of the file). It also seems that the
format has been heavily optimized to be read in 64k blocks, whereas
most users will be storing their references to local hard disks or
(increasingly) to local SSDs. So I tried to come up with a design that
is friendlier to the latter, hopefully without being much worse for
the former.

One thing that I realized is that, assuming that reference values and
reflogs are usually compacted separately, there will basically be
three types of reftable files:

1. Accumulated references and their values (potentially very large)

2. Accumulated reflogs (potentially gigantic)

3. Recent transaction(s), which include both refs and reflogs (usually
   quite small).

The first two types of file don't benefit from separating reference
data from reflog data in the file (because they only include one or
the other). The third kind of file doesn't care because such files
will be small.

So let's store all of the data related to a single reference in one
spot. This simplifies the file format and removes the need for two
separate indexes and/or variants that special-case the omission of an
index. A disadvantage is that, it prevents compression of reflog
entries across references. I think that is a reasonable tradeoff,
because most reflog entries (in the repositories I checked; that might
differ in Gerrit-managed repos) correspond to references that have
been updated many times. To partly compensate for the resulting
reduction in compression, I added some features to reduce the storage
of redundant information in reflog entries.

I also wanted to support multiple-level indexes, so that one isn't
forced to choose rather large blocksizes for repositories with lots of
references.

On the other hand, this design does not support binary searching for
refnames. Instead, the refnames are distributed into a multiple-level
tree. When looking for a reference, each tree node on the way to the
reference needs to be parsed serially. But the nodes are small enough
(due to the multilevel indexing scheme) that I hope that isn't a
problem. I'd rather read less data at the expense of parsing a little
bit more.

The primary data structure consists of `INDEX` and `REFS` chunks.
Basically the `INDEX` chunks are interior nodes of the tree, and the
`REFS` chunks are the leaf nodes. But both types of nodes are N-way.

* Prefix compression occurs within a single node chunk, but not across
  nodes. In that way, the node chunks are analogous to a restart plus
  the subsequent prefix-compressed references in Shawn's design (but
  they will typically be bigger).

* A chunk must be read and processed serially to make sense of it.
  Chunks are also the unit of addressing.

* Indexes can be multi-level to keep each level of the tree small.
  This reduces the total amount of data that needs to be read to
  access a reference, but might require an additional seek if the
  required index chunks are not close together.

* Node chunks can be read serially starting at any chunk without
  referring to any index. Also, data are always written in order by
  reference name, both within single chunks and regarding the ordering
  of multiple `REFS` chunks. So references can be iterated over
  without regard to the indexes.

* I feel like the size of a chunk should be roughly 1-4 kiB, and will
  usually fit in one disk block (though it might cross disk blocks,
  requiring two to be read).

* Multiple related chunks can be stored in a 64 kiB block (interesting
  for people who prefer to read files in 64k segments). By storing
  related chunks near each other, the number of seeks will probably
  typically be smaller than the number of chunks that have to be
  accessed.

* Padding can be added between any two chunks (e.g., to avoid
  splitting chunks across 64 kiB blocks), but it is entirely optional
  and probably would not be used for data stored on local storage.

* In many cases, cross-references between chunks are relative, so that
  they can be stored compactly using varint encoding. This is more
  important than in Shawn's proposal because we don't have block
  indexes, so we have to store arbitrary disk offsets.

* The file can be written in two possible ways: with cross-references
  between chunks always pointing to later parts of the file (probably
  preferable if the data will be read from a local hard disk) or
  always pointing to earlier parts of the file (to accomodate writers
  who need to write their data in order). See the `offsets_negative`
  field in the header. (I'm not sure that the former variant is
  actually needed.)

Because we don't have blocks, it is more difficult to store the object
references compactly. I've arranged them in a radix-256 tree with
relative references. The lowest internal nodes are stored sparsely,
and contain absolute references that point at the containing chunk.

Definitions:

Varint encoding is similar to the unsigned varint encoding used for
protocol buffers; i.e., a little-endian base-128 format where the most
significant bit is used to indicate that another byte is coming.
Decoding is as follows:

```
val = *ptr & 0x7f
while (*ptr++ & 0x80) {
    val = (val << 7) | (*ptr & 0x7f)
}
```

Strings are usually encoded using `varstr` encoding, which is the same
as how strings are encoded in protocol buffers. A `varstr` consists of
a `varint` length followed by the specified number of bytes holding
the contents of the string. `varstr` strings are not intrinsically
NUL-terminated.

The `chunk_length` fields encode the total size of the containing
chunk, including its `chunk_type` field.

```
varstr(s) = {
    varint(strlen(s))
    s
}
```

```
reftable = {
    header

    # The choice to order ref_chunks before obj_chunks is not crucial
    # to the design, but probably makes sense.
    [ref_chunk | padding]*
    [obj_chunk | padding]*

    footer
}

header = {
    'REFT'
    uint8( version_number = 1 )

    # if `contains_values` is false, this file *must not* contain any
    # reference values; i.e., `value_type` must always be `NO_VALUE`.
    contains_values : bool

    # if `contains_logs` is false, this file *must not* contain any
    # reflog entries; i.e., `log_type` must always be `NO_REFLOG`.

    contains_logs : bool

    # If `offsets_negative` is true, then all `*_offset` fields point
    # backwards in the file; i.e., the corresponding varint value is
    # negated before use.
    offsets_negative : bool

    min_update_index : uint64
    max_update_index : uint64

    # To accomodate systems that have to write files serially, the
    # following two entries can be zeroed out in the header to tell
    # the reader that it has to read the corresponding values from the
    # footer.

    # The file offset of the chunk containing the root of the
    # reference tree:
    ref_root_chunk_addr : uint64

    # The file offset of the chunk containing the root of the object
    # tree. This value must be zero if `contains_values` is false:
    obj_root_chunk_addr : uint64
}

footer = {
    header
    uint32(CRC-32 of previous part of footer)
}

chunk = {
    # The `chunk_type` determines how to interpret the payload, and
    # influences how to compute its length (which is needed to advance
    # to the next chunk).

    chunk_type : enum PAD_BYTE | PADDING
                    | INDEX | REFS
                    | OBJS_INDEX | OBJS

    if chunk_type == PAD_BYTE {
        # This chunk type can be used to add a single byte of padding,
        # which would otherwise be impossible because a `PADDING`
        # chunk requires a minimum of two bytes.
    }
    elif chunk_type == PADDING {
        # A form of padding that's cheaper to skip over than
        # `PAD_BYTE`.

        # The total number of bytes in this chunk, including
        # `chunk_type`. The contents will otherwise be ignored:

        chunk_length : varint
    }
    elif chunk_type == INDEX {
        chunk_length : varint
        first_child : {
            refname : varstr
            index_payload
        }
        other_children : {
            # Length of prefix being carried over from the previous
            # record:
            prefix_len : varint
            suffix : varstr
            index_payload
        }*
    }
    elif chunk_type == REFS {
        chunk_length : varint
        first_ref : {
            refname : varstr
            ref_payload
        }*
        other_refs : {
            # Length of prefix being carried over from the previous
            # record:
            prefix_len : varint
            suffix : varstr
            ref_payload
        }*
    }
    elif chunk_type == OBJS_INDEX {
        chunk_length : varint

        # The offset, relative to the start of this chunk, of the
        # chunk containing the next level of the obj index, for each
        # of the possible "next" bytes in the SHA-1, or zero if there
        # are no references with the given next byte.
        child_offset : varint * 256
    }
    elif chunk_type == OBJS {
        chunk_length : varint
        obj_record*
    }
}
```

```
index_payload = {
    # The number of bytes from the begining of this chunk to the child
    # chunk. If child_offset is zero, then there are no entries in
    # this reftable whose refnames start with the specified prefix.
    #
    # The child pointed at is of type INDEX (another index chunk
    # containing the next finer level of detail) or of type REFS. In
    # either case, the first record in the pointed-to chunk must have
    # `prefix_len == 0` and contain the entire key as `suffix`.
    child_offset : varint
}

ref_payload = {
    value_type : enum NO_VALUE
                    | DELETED
                    | VALUE | VALUE_PEELED
                    | SYMREF | SYMREF_PEELED
                    | SPECIAL
    log_type : enum NO_REFLOG | REFLOG | REFLOG_COMPRESSED
    symref_target : bool

    if value_type == NO_VALUE {
        # This type is used if we need to store a reflog entry but
        # have no reference value to store in this file.
    }
    elif value_type == DELETED {
        # This type indicates that the reference has been deleted,
        # regardless of what any reftables deeper in the stack claim.
    }
    elif value_type == VALUE {
        # This is a normal (non-symbolic) reference.
        sha1 : uchar[20]
    }
    elif value_type == VALUE_PEELED {
        # This is a normal (non-symbolic) reference that points at a
        # tag. `peeled` is the reference peeled down to a non-tag.
        sha1 : uchar[20]
        peeled : uchar[20]
    }
    elif value_type == SYMREF {
        # This is a symref that points at a non-existent branch.
        target : varstr
    }
    elif value_type == SYMREF_PEELED {
        # This is a symref that points at a branch that has the
        # specified value.
        target : varstr
        peeled : uchar[20]
    }
    elif value_type == SPECIAL {
        # This is one of the special references (like FETCH_HEAD,
        # MERGE_HEAD). The contents are stored as they would be in a
        # loose reference file:
        contents : varstr
    }

    # This field is used to keep backwards links from references to
    # any symrefs that point at them, to make it tractable to update
    # the reflog of the symref if the reference is changed directly:
    if symref_target {
        referer : varstr
        varint(0)
    }

    if log_type == NO_REFLOG {
    }
    elif log_type == REFLOG {
        log_entry_length : varint
        log_entry
    elif log_type == REFLOGS_COMPRESSED {
        zlib_length : varint
        zlib_deflate(
            log_entry*
        )
    }
}

# Log entries are stored from oldest to newest. The "chained" variants
# of `log_type` take their `new_id` from the current value of the
# reference (if it is the first log entry for the references) or from
# the preceding (i.e., next newest) log record's `old_id` value.
log_entry = {
    # `CREATE_CHAINED` and `UPDATE_CHAINED` take their `new_id` from
    # the preceding (i.e., next newest) record's `old_id` value.
    log_type : enum END_OF_LOG | LOG
    old_type : enum ABSENT | REF | SYMREF
    new_type : enum ABSENT | REF | SYMREF | CHAINED

    if log_type != END_OF_LOG {
        # The update_index of this entry, or 0 if there are no more entries:
        update_index : varint

        # `PUSHER_CHAINED` takes its `name` and `email` from the preceding
        # (i.e., next newest) record's fields.
        pusher_type : enum PUSHER_EXPLICIT | PUSHER_CHAINED

        if new_type == ABSENT {
        }
        elif new_type == REF {
            new_id : uchar[20]
        }
        elif new_type == SYMREF {
            new_ref : varstr
        }
        elif new_type == CHAINED {
        }

        if old_type == ABSENT {
        }
        elif old_type == REF {
            old_id : uchar[20]
        }
        elif old_type == SYMREF {
            old_ref : varstr
        }

        time_seconds : uint32
        tz_offset_minutes : sint16
        if pusher_type == PUSHER_EXPLICIT {
            name : varstr
            email : varstr
        }

        # The reflog message. For a bit more compression:
#
# * Many messages contain old/new SHA-1s for the reference
#   updates. That redundancy could be eliminated by replacing
#   the old SHA-1 with `%o`, the new one with `%n`, and `%`
#   with `%%`.
        #
        # * Some "standard" log messages (i.e., the ones generated by
#   Git itself) could be hard-coded as additional `log_type`
#   constants.
        message : varstr
    }
}

obj_record = {
    # The "next" two bytes of the SHA-1 being described:
    bytes : uchar[2]

    # The number of of child_addrs:
    count : varint

    # File offsets of the chunks containing references that point at
    # objects with this prefix:
    child_addr+ : varint
}
```

Michael

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: reftable [v4]: new ref storage format
  2017-07-31  3:51 reftable [v4]: new ref storage format Shawn Pearce
                   ` (3 preceding siblings ...)
  2017-08-01  6:41 ` Michael Haggerty
@ 2017-08-01 13:54 ` Dave Borowitz
  2017-08-01 15:27   ` Shawn Pearce
  2017-08-02 19:50 ` Junio C Hamano
  2017-08-02 19:54 ` Stefan Beller
  6 siblings, 1 reply; 34+ messages in thread
From: Dave Borowitz @ 2017-08-01 13:54 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git, Jeff King, Michael Haggerty, Junio C Hamano

On Sun, Jul 30, 2017 at 11:51 PM, Shawn Pearce <spearce@spearce.org> wrote:
> - Ref-like files (FETCH_HEAD, MERGE_HEAD) also use type 0x3.

> - Combine reflog storage with ref storage for small transactions.
> - Separate reflog storage for base refs and historical logs.

How is the stash implemented in reftable? In particular, "git stash
drop" needs to be able to remove an arbitrary single entry from a
reflog.

I don't think the current proposal supports writing tombstones for
reflog entries, so this maybe implies that "stash drop" would have to
be implemented by rewriting the whole reflog for the stash ref during
compaction. Can the current compaction algorithm support this?

I suppose there are more exotic alternatives:
* Go back to the normal ref(log) format for refs/stash. I figure you
probably don't want to do this, given that you already moved other
ref-like files into the reftable in a later revision of this proposal.
* Implement the whole stash storage/command in some other way that
doesn't depend on reflog.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: reftable [v4]: new ref storage format
  2017-08-01 13:54 ` Dave Borowitz
@ 2017-08-01 15:27   ` Shawn Pearce
  0 siblings, 0 replies; 34+ messages in thread
From: Shawn Pearce @ 2017-08-01 15:27 UTC (permalink / raw)
  To: Dave Borowitz; +Cc: git, Jeff King, Michael Haggerty, Junio C Hamano

On Tue, Aug 1, 2017 at 6:54 AM, Dave Borowitz <dborowitz@google.com> wrote:
> On Sun, Jul 30, 2017 at 11:51 PM, Shawn Pearce <spearce@spearce.org> wrote:
>> - Ref-like files (FETCH_HEAD, MERGE_HEAD) also use type 0x3.
>
>> - Combine reflog storage with ref storage for small transactions.
>> - Separate reflog storage for base refs and historical logs.
>
> How is the stash implemented in reftable? In particular, "git stash
> drop" needs to be able to remove an arbitrary single entry from a
> reflog.
>
> I don't think the current proposal supports writing tombstones for
> reflog entries, so this maybe implies that "stash drop" would have to
> be implemented by rewriting the whole reflog for the stash ref during
> compaction. Can the current compaction algorithm support this?

Yes, the reftable holding the stash would have to be completely
rewritten for a "stash drop" command to be able to remove a reflog
entry.

> I suppose there are more exotic alternatives:
> * Go back to the normal ref(log) format for refs/stash. I figure you
> probably don't want to do this, given that you already moved other
> ref-like files into the reftable in a later revision of this proposal.
> * Implement the whole stash storage/command in some other way that
> doesn't depend on reflog.

I think the simplest is just put refs/stash in its own reftable, and
put that reftable in the stack somewhere. Editing refs/stash reflog
requires rewriting and replacing that reftable in the stack, but
otherwise doesn't impact any other (possibly much larger) reflog.

Pushing things onto refs/stash would be creating new small transaction
reftables on the top of the stack, which should be compacted with the
lower refs/stash table. I guess that means refs/stash needs special
handling in some of the compaction code to keep it isolated.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: reftable [v4]: new ref storage format
  2017-07-31 23:43   ` Shawn Pearce
@ 2017-08-01 16:08     ` Shawn Pearce
  0 siblings, 0 replies; 34+ messages in thread
From: Shawn Pearce @ 2017-08-01 16:08 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jeff King, Michael Haggerty, David Borowitz

On Mon, Jul 31, 2017 at 4:43 PM, Shawn Pearce <spearce@spearce.org> wrote:
> On Mon, Jul 31, 2017 at 12:42 PM, Junio C Hamano <gitster@pobox.com> wrote:
>>
>> As a block cannot be longer than 16MB, allocating uint32 to a
>> restart offset may be a bit overkill.  I do not know if it is worth
>> attempting to pack 1/3 more restart entries in a block by using
>> uint24, though.
>
> I don't think its worth the annoyance in code that has to deal with
> this field. An example 87k ref repository with a 4k block size has ~9
> restarts per block and ~19 bytes of padding. Using uint24 saves us 9
> bytes, yet the average ref here costs 27 bytes. We aren't likely to
> fill the block with another ref, or another restart point.

I thought about this more. We can fit an additional ref per block in
some cases. It saves ~20 KiB for one of my example repositories if
restart_offset is a uint24.

Given that readers have to deal with these being unaligned loads
anyway, its not significantly harder to work with uint24 vs. uint32.
So I've changed the definition to be uint24.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: reftable [v4]: new ref storage format
  2017-08-01  6:41 ` Michael Haggerty
@ 2017-08-01 20:23   ` Shawn Pearce
  2017-08-02  0:49     ` Michael Haggerty
  2017-08-01 23:27   ` Shawn Pearce
  1 sibling, 1 reply; 34+ messages in thread
From: Shawn Pearce @ 2017-08-01 20:23 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: git, Jeff King, Junio C Hamano, David Borowitz

On Mon, Jul 31, 2017 at 11:41 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> On Sun, Jul 30, 2017 at 8:51 PM, Shawn Pearce <spearce@spearce.org> wrote:
>> 4th iteration of the reftable storage format.
>> [...]
>
> Before we commit to Shawn's reftable proposal, I wanted to explore
> what a contrasting design that is not block based would look like. So
> I threw together a sketch of such a design. It is not as refined as
> Shawn's, and I haven't had time to program an implementation or
> collect statistics, but maybe it is interesting anyway.

Thanks for taking the time to write this out. Its constructive to see
other possible approaches.

I roughly implemented your proposed design in JGit for references
only. I skipped the object lookup and log handling for the sake of
studying the basics of this approach.

       | size   | seek_cold | seek_hot  |
mh     | 28.3 M | 24.5 usec | 14.5 usec |
sp  4k | 29.2 M | 63.0 usec |  5.8 usec |
sp 64k | 27.7 M | 35.6 usec | 23.3 usec |

All tests were run with a 4K chunk/block size on an 866k ref example.
The mh approach does not use padding between chunks, so chunks are
variable sized, but not exceeding 4K. The differences between cold and
hot is whether or not the file was held open, and the root of the ref
index stays in memory.

In the mh approach with a 4K chunk size and 866k refs, the index is 2
levels deep. A cold seek needs to perform 3 disk reads to navigate the
index, hot seek is 2 disk reads.

> It seems to me that the block-orientation of Shawn's spec adds a lot
> of complication to the design (e.g., padding, optional blocks, varying
> block sizes for different parts of the file). It also seems that the
> format has been heavily optimized to be read in 64k blocks, whereas
> most users will be storing their references to local hard disks or
> (increasingly) to local SSDs. So I tried to come up with a design that
> is friendlier to the latter, hopefully without being much worse for
> the former.

Yes, my design is biased to run well on a 64k block size, where the
underlying disk is slow. Ideally reads require only one or two block
reads. Torn blocks (where a range of data lies on two different
blocks) are expensive, as both 64k blocks have to be loaded to acquire
data that lies across the boundary. Given how fast SSDs are, I don't
think this causes any problems for SSD users.

> One thing that I realized is that, assuming that reference values and
> reflogs are usually compacted separately, there will basically be
> three types of reftable files:
>
> 1. Accumulated references and their values (potentially very large)
>
> 2. Accumulated reflogs (potentially gigantic)
>
> 3. Recent transaction(s), which include both refs and reflogs (usually
>    quite small).
>
> The first two types of file don't benefit from separating reference
> data from reflog data in the file (because they only include one or
> the other). The third kind of file doesn't care because such files
> will be small.

In principal, I agree with your simplification. The important thing is
keeping the reflog away from the accumulated references, such that
scans of references (e.g. current ls-remote advertisement) is
efficient. I think I was also aiming to do this even for the
transaction log files, as ls-remote operations don't need the log
record data.

> So let's store all of the data related to a single reference in one
> spot. This simplifies the file format and removes the need for two
> separate indexes and/or variants that special-case the omission of an
> index. A disadvantage is that, it prevents compression of reflog
> entries across references. I think that is a reasonable tradeoff,
> because most reflog entries (in the repositories I checked; that might
> differ in Gerrit-managed repos) correspond to references that have
> been updated many times. To partly compensate for the resulting
> reduction in compression, I added some features to reduce the storage
> of redundant information in reflog entries.

I found the encoding of reflog entries below somewhat complicated, but
I haven't been able to implement it to compare storage size vs. the
deflated block I proposed.

> On the other hand, this design does not support binary searching for
> refnames. Instead, the refnames are distributed into a multiple-level
> tree. When looking for a reference, each tree node on the way to the
> reference needs to be parsed serially. But the nodes are small enough
> (due to the multilevel indexing scheme) that I hope that isn't a
> problem. I'd rather read less data at the expense of parsing a little
> bit more.

Looking at the numbers above, the binary search within a 4k block does
reduce lookup time. E.g. reftable has ~7 binary search restart points
within each 4k block, vs. your approach needing to parse up to 119
refs in a 4k chunk.

> * Multiple related chunks can be stored in a 64 kiB block (interesting
>   for people who prefer to read files in 64k segments). By storing
>   related chunks near each other, the number of seeks will probably
>   typically be smaller than the number of chunks that have to be
>   accessed.

This is difficult. The most related chunks are actually index chunks
in the multi-level index. You want the next step(s) near the current
step to avoid an actual disk seek. But that is hard to do when the
index is several levels deep.

> * The file can be written in two possible ways: with cross-references
>   between chunks always pointing to later parts of the file (probably
>   preferable if the data will be read from a local hard disk) or
>   always pointing to earlier parts of the file (to accomodate writers
>   who need to write their data in order). See the `offsets_negative`
>   field in the header. (I'm not sure that the former variant is
>   actually needed.)

This seems like unnecessary complexity. I'd prefer to just say the
offsets are negative, and reference backwards.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: reftable [v4]: new ref storage format
  2017-08-01  6:41 ` Michael Haggerty
  2017-08-01 20:23   ` Shawn Pearce
@ 2017-08-01 23:27   ` Shawn Pearce
  2017-08-01 23:54     ` Shawn Pearce
  2017-08-02  1:51     ` Michael Haggerty
  1 sibling, 2 replies; 34+ messages in thread
From: Shawn Pearce @ 2017-08-01 23:27 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: git, Jeff King, Junio C Hamano, David Borowitz

On Mon, Jul 31, 2017 at 11:41 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> On Sun, Jul 30, 2017 at 8:51 PM, Shawn Pearce <spearce@spearce.org> wrote:
>> 4th iteration of the reftable storage format.
>> [...]
>
> Before we commit to Shawn's reftable proposal, I wanted to explore
> what a contrasting design that is not block based would look like.

I forgot to look at a 1k chunk size, as you suggested that might also
be suitable. Here is the more complete experiment table:

       | size   | seek_cold | seek_hot  |
mh  1k | 36.6 M | 20.6 usec | 10.7 usec |
mh  4k | 28.3 M | 24.5 usec | 14.5 usec |
sp  4k | 29.2 M | 63.0 usec |  5.8 usec |
sp 64k | 27.7 M | 35.6 usec | 23.3 usec |

A couple of other notes about your contrasting design:

>     elif chunk_type == INDEX {
>         chunk_length : varint

Using a varint for the chunk length made for a complicated reader.
JGit doesn't have the luxury of mmap to access the file, so we have to
allocate a byte[] and read data from a file descriptor to do anything
fancy like decoding a varint. For my experiment I wound up just
hardcoding the IO to read 1k or 4k from whatever address.

A "real" implementation would likely prefer to read a fixed width
field here such that chunks have a 3 byte header (1 byte chunk_type, 2
byte chunk_length), and then issue a second read to acquire the rest
of the chunk. Given that encoding a chunk length of 1024 or 4096 both
requires 2 bytes of varint, its always going to be 2 bytes in your
design anyway. With the way chunks are scanned, I don't think you want
chunks as large as 16k, which would have caused the varint to go to 3
bytes (but still fits in a fixed 2-byte chunk_length).

My reftable proposal should still do well in a mmap region. Most of
the cold start penalty for reftable is JGit copying the ref index from
the file descriptor to the memory block where we can parse the format.
That is why the cold_seek time declines for a larger block size, the
index is smaller.

>         first_child : {
>             refname : varstr
>             index_payload
>         }
>         other_children : {
>             # Length of prefix being carried over from the previous
>             # record:
>             prefix_len : varint
>             suffix : varstr
>             index_payload

Having no prefix_len on first_child made for a slightly funkier
parser. It does save you a byte, but the parser has to know if its
looking at the first child, or an other_children to know if it should
expect the prefix_len. Its a simple condition, but it kind of grated
on me when I wrote that particular section of the experiment. For the
majority of records the parser considers, the prefix_len is always
present.

That is why I proposed the restart_offsets point to the prefix_len,
and prefix_len = 0 at restart points. It slightly simplified the
parser.

>     elif chunk_type == OBJS_INDEX {
>         chunk_length : varint
>
>         # The offset, relative to the start of this chunk, of the
>         # chunk containing the next level of the obj index, for each
>         # of the possible "next" bytes in the SHA-1, or zero if there
>         # are no references with the given next byte.
>         child_offset : varint * 256

This is space saving and cute, but kind of annoying. If it was fixed
width 32 bit you can address up to 4G away from this chunk's address,
and you can directly jump to the byte of interest. By being varints
you do save a little space, as most files will probably only need 3
byte varints, and the 0s do collapse down to 1 byte, but you have to
linearly walk the list to find any specific byte.

> ref_payload = {
>     value_type : enum NO_VALUE
>                     | DELETED
>                     | VALUE | VALUE_PEELED
>                     | SYMREF | SYMREF_PEELED
>                     | SPECIAL
>     log_type : enum NO_REFLOG | REFLOG | REFLOG_COMPRESSED
>     symref_target : bool

FWIW I didn't implement log_type or symref_target in my experiment, so
the size per ref was maybe a few bytes smaller than what you outlined
here.

>     # This field is used to keep backwards links from references to
>     # any symrefs that point at them, to make it tractable to update
>     # the reflog of the symref if the reference is changed directly:
>     if symref_target {
>         referer : varstr
>         varint(0)
>     }

I wonder how desirable this feature is. Most updates are done through
HEAD, which is a symref and can therefore update both HEAD and the
target's reflogs in the same operation. It seems to me its rare to
issue an update directly on the ref that HEAD points at. Its even
rarer to have a non-HEAD symbolic reference whose reflog you expect to
track something else.

Is this for refs/remotes/origin/HEAD to be a symref and have its
reflog mirror the fetch operations that touched the underlying ref?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: reftable [v4]: new ref storage format
  2017-08-01 23:27   ` Shawn Pearce
@ 2017-08-01 23:54     ` Shawn Pearce
  2017-08-02  1:51     ` Michael Haggerty
  1 sibling, 0 replies; 34+ messages in thread
From: Shawn Pearce @ 2017-08-01 23:54 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: git, Jeff King, Junio C Hamano, David Borowitz

On Tue, Aug 1, 2017 at 4:27 PM, Shawn Pearce <spearce@spearce.org> wrote:
> On Mon, Jul 31, 2017 at 11:41 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>> On Sun, Jul 30, 2017 at 8:51 PM, Shawn Pearce <spearce@spearce.org> wrote:
>>> 4th iteration of the reftable storage format.
>>> [...]
>>
>> Before we commit to Shawn's reftable proposal, I wanted to explore
>> what a contrasting design that is not block based would look like.
>
> I forgot to look at a 1k chunk size, as you suggested that might also
> be suitable. Here is the more complete experiment table:
>
>        | size   | seek_cold | seek_hot  |
> mh  1k | 36.6 M | 20.6 usec | 10.7 usec |
> mh  4k | 28.3 M | 24.5 usec | 14.5 usec |
> sp  4k | 29.2 M | 63.0 usec |  5.8 usec |
> sp 64k | 27.7 M | 35.6 usec | 23.3 usec |

Argh. I got that mh 1k size wrong, its actually 29.4M (not 36.6M!).
Sorry for the noise.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: reftable [v4]: new ref storage format
  2017-08-01 20:23   ` Shawn Pearce
@ 2017-08-02  0:49     ` Michael Haggerty
  0 siblings, 0 replies; 34+ messages in thread
From: Michael Haggerty @ 2017-08-02  0:49 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git, Jeff King, Junio C Hamano, David Borowitz

On Tue, Aug 1, 2017 at 1:23 PM, Shawn Pearce <spearce@spearce.org> wrote:
> On Mon, Jul 31, 2017 at 11:41 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>> On Sun, Jul 30, 2017 at 8:51 PM, Shawn Pearce <spearce@spearce.org> wrote:
>>> 4th iteration of the reftable storage format.
>>> [...]
>>
>> Before we commit to Shawn's reftable proposal, I wanted to explore
>> what a contrasting design that is not block based would look like. So
>> I threw together a sketch of such a design. It is not as refined as
>> Shawn's, and I haven't had time to program an implementation or
>> collect statistics, but maybe it is interesting anyway.
>
> Thanks for taking the time to write this out. Its constructive to see
> other possible approaches.
>
> I roughly implemented your proposed design in JGit for references
> only. I skipped the object lookup and log handling for the sake of
> studying the basics of this approach.

Wow, that's awesome, thanks!

>        | size   | seek_cold | seek_hot  |
> mh     | 28.3 M | 24.5 usec | 14.5 usec |
> sp  4k | 29.2 M | 63.0 usec |  5.8 usec |
> sp 64k | 27.7 M | 35.6 usec | 23.3 usec |

OK, so it's at least in the right ballpark.

>> * Multiple related chunks can be stored in a 64 kiB block (interesting
>>   for people who prefer to read files in 64k segments). By storing
>>   related chunks near each other, the number of seeks will probably
>>   typically be smaller than the number of chunks that have to be
>>   accessed.
>
> This is difficult. The most related chunks are actually index chunks
> in the multi-level index. You want the next step(s) near the current
> step to avoid an actual disk seek. But that is hard to do when the
> index is several levels deep.

Given the 64k read size that you prefer, it seems to me that it would
probably be possible to keep pairs of index layers (i.e., one index
node and all of its direct children) in the same 64k range, though
this might require adjusting their sizes somewhat. And possibly to
keep the lowest index layer close to the reference chunks that it
describes (analogous to your format, where the restart records form
something like an index that is located close to the references). So I
would think that it is plausible to arrange things so that a reference
can be read within two 64k reads even if the index has three levels.

>> * The file can be written in two possible ways: with cross-references
>>   between chunks always pointing to later parts of the file (probably
>>   preferable if the data will be read from a local hard disk) or
>>   always pointing to earlier parts of the file (to accomodate writers
>>   who need to write their data in order). See the `offsets_negative`
>>   field in the header. (I'm not sure that the former variant is
>>   actually needed.)
>
> This seems like unnecessary complexity. I'd prefer to just say the
> offsets are negative, and reference backwards.

Maybe. I just wasn't sure whether a hard disk would perform
significantly worse when reading backwards rather than forwards
(because of pre-fetch of blocks). It could very well be unnecessary.

Michael

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: reftable [v4]: new ref storage format
  2017-08-01 23:27   ` Shawn Pearce
  2017-08-01 23:54     ` Shawn Pearce
@ 2017-08-02  1:51     ` Michael Haggerty
  2017-08-02  2:38       ` Shawn Pearce
  2017-08-05 21:00       ` Shawn Pearce
  1 sibling, 2 replies; 34+ messages in thread
From: Michael Haggerty @ 2017-08-02  1:51 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git, Jeff King, Junio C Hamano, David Borowitz

On Tue, Aug 1, 2017 at 4:27 PM, Shawn Pearce <spearce@spearce.org> wrote:
> On Mon, Jul 31, 2017 at 11:41 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>> On Sun, Jul 30, 2017 at 8:51 PM, Shawn Pearce <spearce@spearce.org> wrote:
>>> 4th iteration of the reftable storage format.
>>> [...]
>>
>> Before we commit to Shawn's reftable proposal, I wanted to explore
>> what a contrasting design that is not block based would look like.
>
> I forgot to look at a 1k chunk size, as you suggested that might also
> be suitable. Here is the more complete experiment table:
>
>        | size   | seek_cold | seek_hot  |
> mh  1k | 29.4 M | 20.6 usec | 10.7 usec |   <== Row fixed
> mh  4k | 28.3 M | 24.5 usec | 14.5 usec |
> sp  4k | 29.2 M | 63.0 usec |  5.8 usec |
> sp 64k | 27.7 M | 35.6 usec | 23.3 usec |

At least in this case, the 1k block size seems like a good tradeoff.
Did 1k vs 4k change the number of index levels required?

> A couple of other notes about your contrasting design:
>
>>     elif chunk_type == INDEX {
>>         chunk_length : varint
>
> Using a varint for the chunk length made for a complicated reader.
> JGit doesn't have the luxury of mmap to access the file, so we have to
> allocate a byte[] and read data from a file descriptor to do anything
> fancy like decoding a varint. For my experiment I wound up just
> hardcoding the IO to read 1k or 4k from whatever address.
>
> A "real" implementation would likely prefer to read a fixed width
> field here such that chunks have a 3 byte header (1 byte chunk_type, 2
> byte chunk_length), and then issue a second read to acquire the rest
> of the chunk. Given that encoding a chunk length of 1024 or 4096 both
> requires 2 bytes of varint, its always going to be 2 bytes in your
> design anyway. With the way chunks are scanned, I don't think you want
> chunks as large as 16k, which would have caused the varint to go to 3
> bytes (but still fits in a fixed 2-byte chunk_length).

That's a good point for INDEX and OBJS_INDEX blocks. Though for REFS
blocks that include reflogs, the block size has to be large enough to
hold the whole reflog for a reference, which can be arbitrarily large.
(Maybe this is a weakness of the design?) OBJS blocks can also be
unbounded in size if very many references point at the same object,
thought that is perhaps only a theoretical problem.

> My reftable proposal should still do well in a mmap region. Most of
> the cold start penalty for reftable is JGit copying the ref index from
> the file descriptor to the memory block where we can parse the format.
> That is why the cold_seek time declines for a larger block size, the
> index is smaller.
>
>
>>         first_child : {
>>             refname : varstr
>>             index_payload
>>         }
>>         other_children : {
>>             # Length of prefix being carried over from the previous
>>             # record:
>>             prefix_len : varint
>>             suffix : varstr
>>             index_payload
>
> Having no prefix_len on first_child made for a slightly funkier
> parser. It does save you a byte, but the parser has to know if its
> looking at the first child, or an other_children to know if it should
> expect the prefix_len. Its a simple condition, but it kind of grated
> on me when I wrote that particular section of the experiment. For the
> majority of records the parser considers, the prefix_len is always
> present.
>
> That is why I proposed the restart_offsets point to the prefix_len,
> and prefix_len = 0 at restart points. It slightly simplified the
> parser.

Yes, that seems like a reasonable compromise.

>>     elif chunk_type == OBJS_INDEX {
>>         chunk_length : varint
>>
>>         # The offset, relative to the start of this chunk, of the
>>         # chunk containing the next level of the obj index, for each
>>         # of the possible "next" bytes in the SHA-1, or zero if there
>>         # are no references with the given next byte.
>>         child_offset : varint * 256
>
> This is space saving and cute, but kind of annoying. If it was fixed
> width 32 bit you can address up to 4G away from this chunk's address,
> and you can directly jump to the byte of interest. By being varints
> you do save a little space, as most files will probably only need 3
> byte varints, and the 0s do collapse down to 1 byte, but you have to
> linearly walk the list to find any specific byte.

The reason I used varint here was mostly because I expect the lowest
level of the OBJS_INDEX to be placed close to the OBJS chunks that it
refers to; hopefully within a 2-byte varint. Let me consider more
carefully whether that is realistic...

An OBJS_INDEX should be well under 1kB in size. The lowest-level
OBJS_INDEX will presumably be relatively but not totally full
(otherwise you'd add another OBJS_INDEX level); suppose 1/2 of its
entries are filled. Most of those entries should contain only a single
SHA-1.

An obj_record will typically be 3 bytes plus the size of one
`child_addr`. The latter is in the earlier part of the file, but for
your 30 MB example, that will probably be 4 bytes per address. So the
total size of the obj_records would be about 7 bytes * 128 which is
also less than a kB.

So a 2-byte varint should be more than enough for child_offset for the
lowest-level OBJS_INDEXes (which in turn are by far the most common
types of OBJS_INDEXes). And if the OBJ_INDEX is less full, the zeros
can be stored in one byte.

Indeed, it might make sense to special-case the lowest-level
OBJS_INDEX chunk type to use uint16 pointers and store the obj_records
directly in the chunk, following the child_offsets.

The higher-level OBJS_INDEXes might have to point further away. I
think it would be a bad idea to limit reftable file sizes to 4 GB,
though maybe it's not unreasonable to limit the OBJS_INDEX part of the
file to that size so that a uint32 to be used?

Peff and I discussed off-list whether the lookup-by-SHA-1 feature is
so important in the first place. Currently, all references must be
scanned for the advertisement anyway, so avoiding a second scan to vet
SHA-1s received from the client is at best going to reduce the effort
by a constant factor. Do you have numbers showing that this
optimization is worth it?

OTOH a mythical protocol v2 might reduce the need to scan the
references for advertisement, so maybe this optimization will be more
helpful in the future?

>> ref_payload = {
>>     value_type : enum NO_VALUE
>>                     | DELETED
>>                     | VALUE | VALUE_PEELED
>>                     | SYMREF | SYMREF_PEELED
>>                     | SPECIAL
>>     log_type : enum NO_REFLOG | REFLOG | REFLOG_COMPRESSED
>>     symref_target : bool
>
> FWIW I didn't implement log_type or symref_target in my experiment, so
> the size per ref was maybe a few bytes smaller than what you outlined
> here.

But value_type, log_type, and symref_target should all fit within a
single byte, no?

>>     # This field is used to keep backwards links from references to
>>     # any symrefs that point at them, to make it tractable to update
>>     # the reflog of the symref if the reference is changed directly:
>>     if symref_target {
>>         referer : varstr
>>         varint(0)
>>     }
>
> I wonder how desirable this feature is. Most updates are done through
> HEAD, which is a symref and can therefore update both HEAD and the
> target's reflogs in the same operation. It seems to me its rare to
> issue an update directly on the ref that HEAD points at. Its even
> rarer to have a non-HEAD symbolic reference whose reflog you expect to
> track something else.
>
> Is this for refs/remotes/origin/HEAD to be a symref and have its
> reflog mirror the fetch operations that touched the underlying ref?

What I was thinking is that we don't update symrefs' reflogs correctly
if the pointed-to reference is updated (except for HEAD) and it would
be nice to fix that problem. Your case of `refs/remotes/origin/HEAD`
is an example. Or on GitHub's servers, if we wanted to store all of
the forks of a repo in a single repository, we might want to use a
namespace for each fork, like `refs/forks/NNNN/*`, and store the
fork's default branch in `refs/forks/NNNN/HEAD`. The current refs code
wouldn't know to update such a symref's reflog (though of course it
could be special-cased in like HEAD is now).

That's what I was thinking. But I've yet to hear anybody complain
about missing reflogs for symrefs if the underlying reference is
updated directly, so maybe we should just leave that "problem"
unsolved. It is certainly simpler and less brittle not to have to keep
backreferences like these in sync with the forward references.

Michael

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: reftable [v4]: new ref storage format
  2017-08-02  1:51     ` Michael Haggerty
@ 2017-08-02  2:38       ` Shawn Pearce
  2017-08-02  9:28         ` Jeff King
                           ` (2 more replies)
  2017-08-05 21:00       ` Shawn Pearce
  1 sibling, 3 replies; 34+ messages in thread
From: Shawn Pearce @ 2017-08-02  2:38 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: git, Jeff King, Junio C Hamano, David Borowitz

On Tue, Aug 1, 2017 at 6:51 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> On Tue, Aug 1, 2017 at 4:27 PM, Shawn Pearce <spearce@spearce.org> wrote:
>> On Mon, Jul 31, 2017 at 11:41 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>>> On Sun, Jul 30, 2017 at 8:51 PM, Shawn Pearce <spearce@spearce.org> wrote:
>>>> 4th iteration of the reftable storage format.
>>>> [...]
>>>
>>> Before we commit to Shawn's reftable proposal, I wanted to explore
>>> what a contrasting design that is not block based would look like.
>>
>> I forgot to look at a 1k chunk size, as you suggested that might also
>> be suitable. Here is the more complete experiment table:
>>
>>        | size   | seek_cold | seek_hot  |
>> mh  1k | 29.4 M | 20.6 usec | 10.7 usec |   <== Row fixed
>> mh  4k | 28.3 M | 24.5 usec | 14.5 usec |
>> sp  4k | 29.2 M | 63.0 usec |  5.8 usec |
>> sp 64k | 27.7 M | 35.6 usec | 23.3 usec |
>
> At least in this case, the 1k block size seems like a good tradeoff.
> Did 1k vs 4k change the number of index levels required?

Yes. 1k needed 4 levels of index; 4k needed 2 levels.
More chunks, with smaller index chunks, forces a deeper index depth.

>> A couple of other notes about your contrasting design:
>>
>>>     elif chunk_type == INDEX {
>>>         chunk_length : varint
>>
>> Using a varint for the chunk length made for a complicated reader.
>> JGit doesn't have the luxury of mmap to access the file, so we have to
>> allocate a byte[] and read data from a file descriptor to do anything
>> fancy like decoding a varint. For my experiment I wound up just
>> hardcoding the IO to read 1k or 4k from whatever address.
>>
>> A "real" implementation would likely prefer to read a fixed width
>> field here such that chunks have a 3 byte header (1 byte chunk_type, 2
>> byte chunk_length), and then issue a second read to acquire the rest
>> of the chunk. Given that encoding a chunk length of 1024 or 4096 both
>> requires 2 bytes of varint, its always going to be 2 bytes in your
>> design anyway. With the way chunks are scanned, I don't think you want
>> chunks as large as 16k, which would have caused the varint to go to 3
>> bytes (but still fits in a fixed 2-byte chunk_length).
>
> That's a good point for INDEX and OBJS_INDEX blocks. Though for REFS
> blocks that include reflogs, the block size has to be large enough to
> hold the whole reflog for a reference, which can be arbitrarily large.
> (Maybe this is a weakness of the design?)

Gah. I missed that part about the whole reflog needing to fit in the
same chunk when I wrote the quoted text above. That is a pretty big
downside for very busy refs.

Even when you isolate logs into their own file, the reflog for a busy
ref could be huge. A naive reader would want to "page in" the entire
chunk_length before parsing. That isn't tenable if the reflog for a
busy ref was say 50 MiB. It complicates the reader more. My reftable
proposal deals with this by breaking the log up with its special key
structure.

Requiring the entire reflog of a single ref to fit into a single chunk
does seem to have its downsides.

> OBJS blocks can also be
> unbounded in size if very many references point at the same object,
> thought that is perhaps only a theoretical problem.

Gah, I missed that in reftable. The block id pointer list could cause
a single object id to exceed what fits in a block, and that will cause
the writer to fail unless its caller sets the block size larger. I
basically assumed this overflow condition is very unlikely, as its not
common to have a huge number of refs pointing to the same object.

>>>     elif chunk_type == OBJS_INDEX {
>>>         chunk_length : varint
>>>
>>>         # The offset, relative to the start of this chunk, of the
>>>         # chunk containing the next level of the obj index, for each
>>>         # of the possible "next" bytes in the SHA-1, or zero if there
>>>         # are no references with the given next byte.
>>>         child_offset : varint * 256
>>
>> This is space saving and cute, but kind of annoying. If it was fixed
>> width 32 bit you can address up to 4G away from this chunk's address,
>> and you can directly jump to the byte of interest. By being varints
>> you do save a little space, as most files will probably only need 3
>> byte varints, and the 0s do collapse down to 1 byte, but you have to
>> linearly walk the list to find any specific byte.
>
> The reason I used varint here was mostly because I expect the lowest
> level of the OBJS_INDEX to be placed close to the OBJS chunks that it
> refers to; hopefully within a 2-byte varint. Let me consider more
> carefully whether that is realistic...

Ah, ok. So you were expecting the writer to interleave OBJS and
OBJS_INDEX chunks, with the OBJS_INDEX appearing every ~256 OBJS
chunks. And then storing the next level of OBJS_INDEX consecutively.

I think your math is still wrong about the lowest level OBJS_INDEX
needing only 2-byte varint for child_offset. Given a 1k chunk size,
you still have to address backwards about 256k, which requires a
3-byte varint.

Given the uniform distribution of SHA-1, I think you still wind up
with most of the OBJS_INDEX populated. E.g. in my 866k ref/865k obj
example the unique object abbreviation is 6 raw bytes. The OBJS chunk
storing 2 bytes means we have 4 levels of OBJS_INDEX to cover the
first 4 bytes. That leaves ~844 obj_records in each OBJS chunk. If
those are 7 bytes/obj_record, the OBJS chunk is ~5.7 KiB.

Forcing the OBJS chunk to stay under say 4 KiB will absolutely cause a
lot more holes in the OBJS_INDEX levels.

> Peff and I discussed off-list whether the lookup-by-SHA-1 feature is
> so important in the first place. Currently, all references must be
> scanned for the advertisement anyway,

Not really. You can hide refs and allow-tip-sha1 so clients can fetch
a ref even if it wasn't in the advertisement. We really want to use
that wire protocol capability with Gerrit Code Review to hide the
refs/changes/ namespace from the advertisement, but allow clients to
fetch any of those refs if they send its current SHA-1 in a want line
anyway.

So a server could scan only the refs/{heads,tags}/ prefixes for the
advertisement, and then leverage the lookup-by-SHA1 to verify other
SHA-1s sent by the client.

> so avoiding a second scan to vet
> SHA-1s received from the client is at best going to reduce the effort
> by a constant factor. Do you have numbers showing that this
> optimization is worth it?

No, but I don't think I need to do much to prove it. My 866k ref
example advertisement right now is >62 MiB. If we do what I'm
suggesting in the paragraphs above, the advertisement is ~51 KiB.

> OTOH a mythical protocol v2 might reduce the need to scan the
> references for advertisement, so maybe this optimization will be more
> helpful in the future?

Yes, I'm hopeful we can get a v2 protocol built on the work Jonathan
Tan is doing, and switch the advertisement around to "client speaks
first", so that we can be smarter on the server about which refs are
read and sent. That is a long way off, lets say 3-5 years before its
really common in clients.

>>> ref_payload = {
>>>     value_type : enum NO_VALUE
>>>                     | DELETED
>>>                     | VALUE | VALUE_PEELED
>>>                     | SYMREF | SYMREF_PEELED
>>>                     | SPECIAL
>>>     log_type : enum NO_REFLOG | REFLOG | REFLOG_COMPRESSED
>>>     symref_target : bool
>>
>> FWIW I didn't implement log_type or symref_target in my experiment, so
>> the size per ref was maybe a few bytes smaller than what you outlined
>> here.
>
> But value_type, log_type, and symref_target should all fit within a
> single byte, no?

Yes. So my experiment was maybe ~866 KiB off in total file size. :)

>>>     # This field is used to keep backwards links from references to
>>>     # any symrefs that point at them, to make it tractable to update
>>>     # the reflog of the symref if the reference is changed directly:
>>>     if symref_target {
>>>         referer : varstr
>>>         varint(0)
>>>     }
>>
>> I wonder how desirable this feature is. Most updates are done through
>> HEAD, which is a symref and can therefore update both HEAD and the
>> target's reflogs in the same operation. It seems to me its rare to
>> issue an update directly on the ref that HEAD points at. Its even
>> rarer to have a non-HEAD symbolic reference whose reflog you expect to
>> track something else.
>>
>> Is this for refs/remotes/origin/HEAD to be a symref and have its
>> reflog mirror the fetch operations that touched the underlying ref?
>
> What I was thinking is that we don't update symrefs' reflogs correctly
> if the pointed-to reference is updated (except for HEAD) and it would
> be nice to fix that problem. Your case of `refs/remotes/origin/HEAD`
> is an example. Or on GitHub's servers, if we wanted to store all of
> the forks of a repo in a single repository, we might want to use a
> namespace for each fork, like `refs/forks/NNNN/*`, and store the
> fork's default branch in `refs/forks/NNNN/HEAD`. The current refs code
> wouldn't know to update such a symref's reflog (though of course it
> could be special-cased in like HEAD is now).

Servers today don't update HEAD reflog when a branch is pushed. I
think trying to record that is overkill, you have the reflog data in
the ref itself that the client sent the command to modify.

> That's what I was thinking. But I've yet to hear anybody complain
> about missing reflogs for symrefs if the underlying reference is
> updated directly, so maybe we should just leave that "problem"
> unsolved. It is certainly simpler and less brittle not to have to keep
> backreferences like these in sync with the forward references.

Yup, that is my take on it, and why I didn't try to put this into
reftable drafts, even though it was discussed between us on the list
in earlier messages.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: reftable [v4]: new ref storage format
  2017-08-02  2:38       ` Shawn Pearce
@ 2017-08-02  9:28         ` Jeff King
  2017-08-02 15:17           ` Shawn Pearce
  2017-08-02 12:20         ` Dave Borowitz
  2017-08-03 18:38         ` Michael Haggerty
  2 siblings, 1 reply; 34+ messages in thread
From: Jeff King @ 2017-08-02  9:28 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Michael Haggerty, git, Junio C Hamano, David Borowitz

On Tue, Aug 01, 2017 at 07:38:37PM -0700, Shawn Pearce wrote:

> > OBJS blocks can also be
> > unbounded in size if very many references point at the same object,
> > thought that is perhaps only a theoretical problem.
> 
> Gah, I missed that in reftable. The block id pointer list could cause
> a single object id to exceed what fits in a block, and that will cause
> the writer to fail unless its caller sets the block size larger. I
> basically assumed this overflow condition is very unlikely, as its not
> common to have a huge number of refs pointing to the same object.

It's actually quite common for us, as we have big shared-object repos
that contain a copy of the refs of all of their child repos (for
reachability during packing, etc). So tags, where the value is the same
in each fork, you have one ref per fork pointing to it.

Just peeking at torvalds/linux, we have some objects with ~35K refs
pointing to them (e.g., the v2.6.11 tag).

> > Peff and I discussed off-list whether the lookup-by-SHA-1 feature is
> > so important in the first place. Currently, all references must be
> > scanned for the advertisement anyway,
> 
> Not really. You can hide refs and allow-tip-sha1 so clients can fetch
> a ref even if it wasn't in the advertisement. We really want to use
> that wire protocol capability with Gerrit Code Review to hide the
> refs/changes/ namespace from the advertisement, but allow clients to
> fetch any of those refs if they send its current SHA-1 in a want line
> anyway.
> 
> So a server could scan only the refs/{heads,tags}/ prefixes for the
> advertisement, and then leverage the lookup-by-SHA1 to verify other
> SHA-1s sent by the client.

Yeah, that makes sense (though I hope in the end that strategy will go
away in favor of a better protocol, as getting the sha1 out-of-band has
obvious UX complexities).

> > OTOH a mythical protocol v2 might reduce the need to scan the
> > references for advertisement, so maybe this optimization will be more
> > helpful in the future?
> 
> Yes, I'm hopeful we can get a v2 protocol built on the work Jonathan
> Tan is doing, and switch the advertisement around to "client speaks
> first", so that we can be smarter on the server about which refs are
> read and sent. That is a long way off, lets say 3-5 years before its
> really common in clients.

I was actually planning to spend some time on this in the next month or
two. I don't think it needs to be that complicated. We don't need a
whole protocol revamp. We just need a way to get a few bits from the
client before the advertisement, and from there we can bootstrap any
more radical protocol changes we want.

I know it will take a while before it's something we can expect in
clients, but it's definitely worth planning around. And sometimes a
feature like this can drive upgrades, if it's something that produces an
immediate and obvious benefit to the client.

> Servers today don't update HEAD reflog when a branch is pushed. I
> think trying to record that is overkill, you have the reflog data in
> the ref itself that the client sent the command to modify.

I think they do, at least for C git:

  $ git init --bare dst.git
  $ git -C dst.git config core.logallrefupdates
  $ git push dst.git
  ...
  To dst.git
   * [new branch]      master -> master
  $ find dst.git/logs -type f | xargs wc -l
    1 dst.git/logs/refs/heads/master
    1 dst.git/logs/HEAD

The special logic for "see if we're updating the ref that HEAD points
to" is deep in the ref machinery, so it gets triggered for all updates,
including pushes.

I agree it's not actually that interesting for a bare repo, where HEAD
isn't that meaningful (and doesn't tend to change a lot anyway).

> > That's what I was thinking. But I've yet to hear anybody complain
> > about missing reflogs for symrefs if the underlying reference is
> > updated directly, so maybe we should just leave that "problem"
> > unsolved. It is certainly simpler and less brittle not to have to keep
> > backreferences like these in sync with the forward references.
> 
> Yup, that is my take on it, and why I didn't try to put this into
> reftable drafts, even though it was discussed between us on the list
> in earlier messages.

Yeah, I'd agree. It might be worth doing a better job of showing the
before/after destinations in the reflog when updating a symbolic ref,
which would let you reconstruct the state from the pointed-to reflogs if
you cared to. But that's orthogonal to the storage format (you can do it
already if you bother to pass a good message to "symbolic-ref -m").

-Peff

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: reftable [v4]: new ref storage format
  2017-08-02  2:38       ` Shawn Pearce
  2017-08-02  9:28         ` Jeff King
@ 2017-08-02 12:20         ` Dave Borowitz
  2017-08-02 17:18           ` Jeff King
  2017-08-03 18:38         ` Michael Haggerty
  2 siblings, 1 reply; 34+ messages in thread
From: Dave Borowitz @ 2017-08-02 12:20 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Michael Haggerty, git, Jeff King, Junio C Hamano

On Tue, Aug 1, 2017 at 10:38 PM, Shawn Pearce <spearce@spearce.org> wrote:
>> Peff and I discussed off-list whether the lookup-by-SHA-1 feature is
>> so important in the first place. Currently, all references must be
>> scanned for the advertisement anyway,
>
> Not really. You can hide refs and allow-tip-sha1 so clients can fetch
> a ref even if it wasn't in the advertisement. We really want to use
> that wire protocol capability with Gerrit Code Review to hide the
> refs/changes/ namespace from the advertisement, but allow clients to
> fetch any of those refs if they send its current SHA-1 in a want line
> anyway.
>
> So a server could scan only the refs/{heads,tags}/ prefixes for the
> advertisement, and then leverage the lookup-by-SHA1 to verify other
> SHA-1s sent by the client.
>
>> so avoiding a second scan to vet
>> SHA-1s received from the client is at best going to reduce the effort
>> by a constant factor. Do you have numbers showing that this
>> optimization is worth it?
>
> No, but I don't think I need to do much to prove it. My 866k ref
> example advertisement right now is >62 MiB. If we do what I'm
> suggesting in the paragraphs above, the advertisement is ~51 KiB.

That being said, our bias towards minimizing the number of ref scans
is rooted in our experience where scanning 866k refs takes 5 seconds
to get the response from the storage backend into the git server.
Cutting ref scans from 2 to 1 (or 1 to 0) is a big deal in that case.
But that 5s number is based on our current, slow storage, not on
reftable. If migrating to reftable turns each 5s scan into a 400ms
scan, we might be able to live with that, even if we don't have fast
lookup by SHA-1.

>> OTOH a mythical protocol v2 might reduce the need to scan the
>> references for advertisement, so maybe this optimization will be more
>> helpful in the future?

I haven't been following the status of the proposal, but I was
assuming a client-speaks-first protocol would also imply the client
asking for refnames, not SHA-1s, in which case lookup by SHA-1 is no
longer relevant.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: reftable [v4]: new ref storage format
  2017-08-02  9:28         ` Jeff King
@ 2017-08-02 15:17           ` Shawn Pearce
  2017-08-02 16:51             ` Junio C Hamano
  2017-08-02 17:28             ` Jeff King
  0 siblings, 2 replies; 34+ messages in thread
From: Shawn Pearce @ 2017-08-02 15:17 UTC (permalink / raw)
  To: Jeff King; +Cc: Michael Haggerty, git, Junio C Hamano, David Borowitz

On Wed, Aug 2, 2017 at 2:28 AM, Jeff King <peff@peff.net> wrote:
> On Tue, Aug 01, 2017 at 07:38:37PM -0700, Shawn Pearce wrote:
>
>> > OBJS blocks can also be
>> > unbounded in size if very many references point at the same object,
>> > thought that is perhaps only a theoretical problem.
>>
>> Gah, I missed that in reftable. The block id pointer list could cause
>> a single object id to exceed what fits in a block, and that will cause
>> the writer to fail unless its caller sets the block size larger. I
>> basically assumed this overflow condition is very unlikely, as its not
>> common to have a huge number of refs pointing to the same object.
>
> It's actually quite common for us, as we have big shared-object repos
> that contain a copy of the refs of all of their child repos (for
> reachability during packing, etc). So tags, where the value is the same
> in each fork, you have one ref per fork pointing to it.
>
> Just peeking at torvalds/linux, we have some objects with ~35K refs
> pointing to them (e.g., the v2.6.11 tag).

Oy. I'll bet that every occurrence winds up in its own block due to
the layout of the namespace, and so the obj block list needs 35k
varint pointers. That requires a larger block size if it has any
chance of fitting into the reftable format.

Another option is disable the obj table for these shared-object repos.
Its an optional part of the format and can be omitted if the reader
isn't likely to need to lookup by SHA-1, or is willing to pay the
brute force cost of scanning every ref.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: reftable [v4]: new ref storage format
  2017-08-02 15:17           ` Shawn Pearce
@ 2017-08-02 16:51             ` Junio C Hamano
  2017-08-02 17:28             ` Jeff King
  1 sibling, 0 replies; 34+ messages in thread
From: Junio C Hamano @ 2017-08-02 16:51 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Jeff King, Michael Haggerty, git, David Borowitz

Shawn Pearce <spearce@spearce.org> writes:

> On Wed, Aug 2, 2017 at 2:28 AM, Jeff King <peff@peff.net> wrote:
>> On Tue, Aug 01, 2017 at 07:38:37PM -0700, Shawn Pearce wrote:
>>
>>> > OBJS blocks can also be
>>> > unbounded in size if very many references point at the same object,
>>> > thought that is perhaps only a theoretical problem.
>>>
>>> Gah, I missed that in reftable. The block id pointer list could cause
>>> a single object id to exceed what fits in a block, and that will cause
>>> the writer to fail unless its caller sets the block size larger. I
>>> basically assumed this overflow condition is very unlikely, as its not
>>> common to have a huge number of refs pointing to the same object.
>>
>> It's actually quite common for us, as we have big shared-object repos
>> that contain a copy of the refs of all of their child repos (for
>> reachability during packing, etc). So tags, where the value is the same
>> in each fork, you have one ref per fork pointing to it.
>>
>> Just peeking at torvalds/linux, we have some objects with ~35K refs
>> pointing to them (e.g., the v2.6.11 tag).
>
> Oy. I'll bet that every occurrence winds up in its own block due to
> the layout of the namespace, and so the obj block list needs 35k
> varint pointers. That requires a larger block size if it has any
> chance of fitting into the reftable format.
>
> Another option is disable the obj table for these shared-object repos.
> Its an optional part of the format and can be omitted if the reader
> isn't likely to need to lookup by SHA-1, or is willing to pay the
> brute force cost of scanning every ref.

I am wondering if we need the reverse look-up for a regular
repository that allows "fetch anything at the tip".  It only needs
"I got this request for an object name--does it sit at the tip of
any ref?  Yes/No".  It does not need to know exactly which ref
points at the asked object.

Yes, I know Gerrit has its own ACL that limits the visibility of
refs so the question becomes "does it sit at the tip of any ref that
is visible to the current user?".  An Yes/No answer for "any ref?"
may still give you a short-cut when rejecting, but you'd then need
to scan to give a positive answer without a full reverse mapping.

For the use of "consolidated object store for all forks", I do not
think it makes sense to even have "Does it sit at a tip, Yes/No"
information.  Even when Linus's repository and a fork share the same
object store, I do not think anybody expects to be able to fetch a
commit at the tip of a branch in the fork but not in Linus's
repository.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: reftable [v4]: new ref storage format
  2017-08-02 12:20         ` Dave Borowitz
@ 2017-08-02 17:18           ` Jeff King
  0 siblings, 0 replies; 34+ messages in thread
From: Jeff King @ 2017-08-02 17:18 UTC (permalink / raw)
  To: Dave Borowitz; +Cc: Shawn Pearce, Michael Haggerty, git, Junio C Hamano

On Wed, Aug 02, 2017 at 08:20:44AM -0400, Dave Borowitz wrote:

> >> OTOH a mythical protocol v2 might reduce the need to scan the
> >> references for advertisement, so maybe this optimization will be more
> >> helpful in the future?
> 
> I haven't been following the status of the proposal, but I was
> assuming a client-speaks-first protocol would also imply the client
> asking for refnames, not SHA-1s, in which case lookup by SHA-1 is no
> longer relevant.

Good point. The hidden-refs thing Shawn described is a trick that would
be used because the current protocol is so lousy. It's not clear how a
stateless-rpc request would work, but in theory the follow-up requests
could also say "hey, I'm only interested in refs/{heads,tags}" and the
stateless server could limit is marking to that.

But that would still leave something like allowReachableSHA1InWant
having to look at all refs (and I don't see why a client requesting by
sha1 would give any limiting refspec at all). On the other hand, looking
at the ref tips would probably be dominated by actually traversing the
commits in most cases. Of course one could come up with a pathological
case pretty easily (tons of refs, and people asking for submodules at
tip commits).

So I do think there are cases where the optimization would help, but I
still not sure how much. If it's an optional bit of the design, we at
least have the option of just not generating if it turns out not to be
useful. My only concern would be if we make other protocol sacrifices or
complications to include it.

-Peff

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: reftable [v4]: new ref storage format
  2017-08-02 15:17           ` Shawn Pearce
  2017-08-02 16:51             ` Junio C Hamano
@ 2017-08-02 17:28             ` Jeff King
  1 sibling, 0 replies; 34+ messages in thread
From: Jeff King @ 2017-08-02 17:28 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Michael Haggerty, git, Junio C Hamano, David Borowitz

On Wed, Aug 02, 2017 at 08:17:29AM -0700, Shawn Pearce wrote:

> > Just peeking at torvalds/linux, we have some objects with ~35K refs
> > pointing to them (e.g., the v2.6.11 tag).
> 
> Oy. I'll bet that every occurrence winds up in its own block due to
> the layout of the namespace, and so the obj block list needs 35k
> varint pointers. That requires a larger block size if it has any
> chance of fitting into the reftable format.
> 
> Another option is disable the obj table for these shared-object repos.
> Its an optional part of the format and can be omitted if the reader
> isn't likely to need to lookup by SHA-1, or is willing to pay the
> brute force cost of scanning every ref.

Yeah, sorry, I meant to write a few more paragraphs. I think refusing to
generate the object table for these repos would be OK. We don't serve
any user-facing operations out of them directly[1].

I'm also open to the argument that they're simply insane. Most of the
time we don't need them to be a real repository at all. They could exist
as a bare "objects/" directory. It's only at repack time that we
actually need to know which objects are reachable[2], so we could
do a special repack that generates the list on the fly from the child
repositories.

-Peff

[1] We actually disable the ".have" advertisements, because the cost of
    showing all of the shared-storage ref tips is too high. One thing
    I'd like to do is be able to advertise a subset of the alternate
    refs (if you're a fork of torvalds/linux, then share _just_ the refs
    from there). But with the current ref code, I can't even ask for a
    subset of the refs without paying the cost to walk all of them.
    That's one of the things I'd like to build on top of the mmap'd
    packed-refs solution (and naturally would work with reftables, too).

[2] It's a bit more complicated than just knowing the list of reachable
    objects. We also want to know which ones are reachable from which
    fork, as we do some trickery to avoid creating deltas across forks.
    So we really do want the whole ref-list, and not something like a
    de-duped set of reachable tips. I don't think that makes a
    difference for anything we're discussing here, but just a bit of
    trivia in case somebody is thinking about the shared-storage problem
    space.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: reftable [v4]: new ref storage format
  2017-07-31  3:51 reftable [v4]: new ref storage format Shawn Pearce
                   ` (4 preceding siblings ...)
  2017-08-01 13:54 ` Dave Borowitz
@ 2017-08-02 19:50 ` Junio C Hamano
  2017-08-02 20:28   ` Jeff King
  2017-08-03  1:50   ` Junio C Hamano
  2017-08-02 19:54 ` Stefan Beller
  6 siblings, 2 replies; 34+ messages in thread
From: Junio C Hamano @ 2017-08-02 19:50 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git, Jeff King, Michael Haggerty, David Borowitz

Shawn Pearce <spearce@spearce.org> writes:

> ### Layout
>
> The `$GIT_DIR/refs` path is a file when reftable is configured, not a
> directory.  This prevents loose references from being stored.
>
> A collection of reftable files are stored in the `$GIT_DIR/reftable/`
> directory:
>
>     00000001_UF4paF
>     00000002_bUVgy4
>
> where reftable files are named by a unique name such as produced by
> the function:
>
>     mktemp "${update_index}_XXXXXX"
>
> The stack ordering file is `$GIT_DIR/refs` and lists the current
> files, one per line, in order, from oldest (base) to newest (most
> recent):
>
>     $ cat .git/refs
>     00000001_UF4paF
>     00000002_bUVgy4
>
> Readers must read `$GIT_DIR/refs` to determine which files are
> relevant right now, and search through the stack in reverse order
> (last reftable is examined first).
>
> Reftable files not listed in `refs` may be new (and about to be added
> to the stack by the active writer), or ancient and ready to be pruned.

I like the general idea, what the file format can represent and how
it does so, but I am a bit uneasy about how well this "stacked" part
would work for desktop clients.  The structure presented here is for
optimizing the "we want to learn about many (or all) refs" access
pattern, which probably matters a lot on the server implementations,
but I do not feel comfortable without knowing how much it penalizes
"I want the current value of this single ref" access pattern.

With the traditional "packed-refs plus loose" layout, no matter how
many times a handful of selected busy refs are updated during the
day, you'd need to open at most two files to find out the current
value of a single ref (admittedly, the accessing of the second file,
after we realize that there is no loose one, would be very costly).
If you make a few commits on a topic branch A, then build a 100
commit series on top of another topic branch B, finding the current
value of A is still one open and read of refs/heads/A.

With the reftable format, we'd need to open and read all 100
incremental transactions that touch branch B before realizing that
none of them talk about A, and read the next transaction file to
find the current value of A.  To keep this number low, we'd need
quite a frequent compaction.

We can just declare that reftable format is not for desktop clients
but for server implementations where frequent compaction would not
be an annoyance to the users, but I'd wish we do not have to.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: reftable [v4]: new ref storage format
  2017-07-31  3:51 reftable [v4]: new ref storage format Shawn Pearce
                   ` (5 preceding siblings ...)
  2017-08-02 19:50 ` Junio C Hamano
@ 2017-08-02 19:54 ` Stefan Beller
  6 siblings, 0 replies; 34+ messages in thread
From: Stefan Beller @ 2017-08-02 19:54 UTC (permalink / raw)
  To: Shawn Pearce
  Cc: git, Jeff King, Michael Haggerty, Junio C Hamano, David Borowitz

> ### Ref block format
>
> A ref block is written as:
>
>     'r'
>     uint24( block_len )
>     ref_record+
>     uint32( restart_offset )+
>     uint16( restart_count )
>     padding?
>

So I learned that your current writer is a two block pass,
i.e. the block is first written into memory and then once
the block looks complete it is written out to disk.

This would allow us to shuffle the data around during
the actual out-to-disk-phase, such as this:

  'r'
  uint24( restart_count )
  uint32( restart_offset )+
  ref_record+
  ref_record_endmarker
  padding?

(A) In nearby emails we discussed to have the restart offsets
to be 24 bit, but now they are 32-bit aligned to the start of a block
so we could keep them 32 bit for simplicity of reading.

(B) Note how there is no block_len encoding, which was originally
only needed to lookup the position of restart_count. (so even for that
we could rename it to padding_len, such that the position of
restart_count can be decoded easily)

We no longer need the block_len as the restart_count comes right
after the 'r'.

Instead we'll have a ref_record_endmarker that reads as a ref
with both prefix and suffix to '0', type deletion (such that there is
no further cost). The end marker would only need two '0's, which
makes it indistinguishable from padding.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: reftable [v4]: new ref storage format
  2017-08-02 19:50 ` Junio C Hamano
@ 2017-08-02 20:28   ` Jeff King
  2017-08-03 22:17     ` Shawn Pearce
  2017-08-03  1:50   ` Junio C Hamano
  1 sibling, 1 reply; 34+ messages in thread
From: Jeff King @ 2017-08-02 20:28 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Shawn Pearce, git, Michael Haggerty, David Borowitz

On Wed, Aug 02, 2017 at 12:50:39PM -0700, Junio C Hamano wrote:

> With the traditional "packed-refs plus loose" layout, no matter how
> many times a handful of selected busy refs are updated during the
> day, you'd need to open at most two files to find out the current
> value of a single ref (admittedly, the accessing of the second file,
> after we realize that there is no loose one, would be very costly).
> If you make a few commits on a topic branch A, then build a 100
> commit series on top of another topic branch B, finding the current
> value of A is still one open and read of refs/heads/A.
> 
> With the reftable format, we'd need to open and read all 100
> incremental transactions that touch branch B before realizing that
> none of them talk about A, and read the next transaction file to
> find the current value of A.  To keep this number low, we'd need
> quite a frequent compaction.

I think this is where compaction cleverness can come in.

One relatively easy optimization would be to note when the most recent
reftable contains a subset of the refs we are currently updating (and
the obvious string-of-updates to a single ref falls under that), and do
a "quick" compaction where we simply drop[1] that reftable in favor of
ours. That compaction is essentially free, because we know those entries
aren't valid anymore anyway.

I'm actually not sure if this is a strict "drop", though, because of
reflogs. If the reflog is written into the same file as the ref update,
then you'd need to roll its entry into your new update, too. But see
below anyway.

For more complicated cases, there's some cleverness you can do with
small on-the-fly compactions. Even if there are entries in the last few
reftables that we're not currently updating, it's pretty cheap to roll a
few of them up into our new reftable if it lets us drop some
intermediate reftables. E.g., if we're writing a new reftable with a 4K
block size but only have 100 bytes of new data, we're probably best to
roll up a previous 500-byte reftable.

That one's an obvious optimization because we know that the filesystem
is going to make us spend 4K either way, so rounding up to that is
generally free-ish.

What's less obvious is when we should roll up a bunch of 4K tables into
one (let's say) 32K table.  I think there's a formula for doing this
geometrically so that the amortized cost of writing stays under a
certain bound (linear?). But I haven't thought it through (or looked it
up); I was hoping Shawn had already done so and could dump that wisdom
on us.

> We can just declare that reftable format is not for desktop clients
> but for server implementations where frequent compaction would not
> be an annoyance to the users, but I'd wish we do not have to.

Yeah, I agree we should avoid that. I would love to eventually kill off
the loose/packed backend (or at least make it historical and read-only,
so that a reasonable response to people complaining about its races and
lack of atomicity is "please upgrade to reftables").

-Peff

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: reftable [v4]: new ref storage format
  2017-08-02 19:50 ` Junio C Hamano
  2017-08-02 20:28   ` Jeff King
@ 2017-08-03  1:50   ` Junio C Hamano
  2017-08-03  2:21     ` Shawn Pearce
  1 sibling, 1 reply; 34+ messages in thread
From: Junio C Hamano @ 2017-08-03  1:50 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git, Jeff King, Michael Haggerty, David Borowitz

Junio C Hamano <gitster@pobox.com> writes:

> I like the general idea, what the file format can represent and how
> it does so, but I am a bit uneasy about how well this "stacked" part
> would work for desktop clients.

Two more random things before I forget.

 * I understand that you would want to allow both a ref "ref/D" and
   "ref/D/F" to appear in the same reftable file.  A refname is an
   uninterpreted sequence of bytes and refnames are sorted in the
   table.

   Would it benefit us if we define the sort order of bytes slightly
   different from the ASCII order, so that a slash '/' sorts between
   NUL '\000' and SOH '\001', which is the order we should have used
   when storing the entries in the index?

 * Even though readers can continue accessing, starting from the
   $GIT_DIR/refs, without locking and get consistent views, any
   transaction that groups one or more ref updates would need to
   take a global lock on $GIT_DIR/refs file.  

   Would it become a problem in a busy repository?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: reftable [v4]: new ref storage format
  2017-08-03  1:50   ` Junio C Hamano
@ 2017-08-03  2:21     ` Shawn Pearce
  2017-08-03  2:36       ` Junio C Hamano
  0 siblings, 1 reply; 34+ messages in thread
From: Shawn Pearce @ 2017-08-03  2:21 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jeff King, Michael Haggerty, David Borowitz

On Wed, Aug 2, 2017 at 6:50 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Junio C Hamano <gitster@pobox.com> writes:
>
>> I like the general idea, what the file format can represent and how
>> it does so, but I am a bit uneasy about how well this "stacked" part
>> would work for desktop clients.
>
> Two more random things before I forget.
>
>  * I understand that you would want to allow both a ref "ref/D" and
>    "ref/D/F" to appear in the same reftable file.  A refname is an
>    uninterpreted sequence of bytes and refnames are sorted in the
>    table.
>
>    Would it benefit us if we define the sort order of bytes slightly
>    different from the ASCII order, so that a slash '/' sorts between
>    NUL '\000' and SOH '\001', which is the order we should have used
>    when storing the entries in the index?

I'm not really with that. It complicates the compare routine, and
makes the data in reftable sorted differently than we announce in the
wire protocol. That cuts off any sort of optimizations we were
considering at $DAY_JOB to plumb the wire protocol code in JGit closer
to reftable code so that a large advertisement is more or less just
dumped straight from reftable to the wire protocol with only minimal
formatting changes.

>  * Even though readers can continue accessing, starting from the
>    $GIT_DIR/refs, without locking and get consistent views, any
>    transaction that groups one or more ref updates would need to
>    take a global lock on $GIT_DIR/refs file.
>
>    Would it become a problem in a busy repository?

Potentially yes. Writers may need to use a randomized exponential
backoff while trying to acquire the lock. For big application servers,
I recommended wrapping the file IO locking code with an application
server managed lock + fair wait queue. This is fairly straightforward
for JGit to implement within Gerrit Code Review.

I'm not really sure how one could safely do the same thing with
git-core. Its certainly not going to be portable or safe on NFS if we
tried to do anything fancy with flock(2), fcntl(F_SETLKW), or
semop(2).

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: reftable [v4]: new ref storage format
  2017-08-03  2:21     ` Shawn Pearce
@ 2017-08-03  2:36       ` Junio C Hamano
  0 siblings, 0 replies; 34+ messages in thread
From: Junio C Hamano @ 2017-08-03  2:36 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git, Jeff King, Michael Haggerty, David Borowitz

Shawn Pearce <spearce@spearce.org> writes:

> On Wed, Aug 2, 2017 at 6:50 PM, Junio C Hamano <gitster@pobox.com> wrote:
> ...
>>    Would it benefit us if we define the sort order of bytes slightly
>>    different from the ASCII order, so that a slash '/' sorts between
>>    NUL '\000' and SOH '\001', which is the order we should have used
>>    when storing the entries in the index?
>
> I'm not really with that. It complicates the compare routine, and
> makes the data in reftable sorted differently than we announce in the
> wire protocol.

Fair enough.  It was not like I had some operations in mind that
would benefit from such a sort order (i.e. walking two of these
things in parallel and merging them, which would have been the case
for the index when we walk it together with one or more trees); if
there is no such operation that benefit, there is no reason to try
to be clever here.

>>  * Even though readers can continue accessing, starting from the
>>    $GIT_DIR/refs, without locking and get consistent views, any
>>    transaction that groups one or more ref updates would need to
>>    take a global lock on $GIT_DIR/refs file.
>>
>>    Would it become a problem in a busy repository?
> ...
>
> I'm not really sure how one could safely do the same thing with
> git-core. Its certainly not going to be portable or safe on NFS if we
> tried to do anything fancy with flock(2), fcntl(F_SETLKW), or
> semop(2).

Yes.

And for public record, another thing we privately discussed was that
we currently do not know if we would want to make this design mesh
well with the use of multiple worktrees (IIUC, HEAD and things
outside refs/, and refs/bisect/, need to be per- worktree, while
others are to be common), and if so how.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: reftable [v4]: new ref storage format
  2017-08-02  2:38       ` Shawn Pearce
  2017-08-02  9:28         ` Jeff King
  2017-08-02 12:20         ` Dave Borowitz
@ 2017-08-03 18:38         ` Michael Haggerty
  2017-08-03 22:26           ` Shawn Pearce
  2 siblings, 1 reply; 34+ messages in thread
From: Michael Haggerty @ 2017-08-03 18:38 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git, Jeff King, Junio C Hamano, David Borowitz

On Tue, Aug 1, 2017 at 7:38 PM, Shawn Pearce <spearce@spearce.org> wrote:
> On Tue, Aug 1, 2017 at 6:51 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>> On Tue, Aug 1, 2017 at 4:27 PM, Shawn Pearce <spearce@spearce.org> wrote:
>>> On Mon, Jul 31, 2017 at 11:41 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>>>> [...]
>>> A couple of other notes about your contrasting design:
>>>
>>>>     elif chunk_type == INDEX {
>>>>         chunk_length : varint
>>>
>>> Using a varint for the chunk length made for a complicated reader.
>>> JGit doesn't have the luxury of mmap to access the file, so we have to
>>> allocate a byte[] and read data from a file descriptor to do anything
>>> fancy like decoding a varint. For my experiment I wound up just
>>> hardcoding the IO to read 1k or 4k from whatever address.
>>>
>>> A "real" implementation would likely prefer to read a fixed width
>>> field here such that chunks have a 3 byte header (1 byte chunk_type, 2
>>> byte chunk_length), and then issue a second read to acquire the rest
>>> of the chunk. Given that encoding a chunk length of 1024 or 4096 both
>>> requires 2 bytes of varint, its always going to be 2 bytes in your
>>> design anyway. With the way chunks are scanned, I don't think you want
>>> chunks as large as 16k, which would have caused the varint to go to 3
>>> bytes (but still fits in a fixed 2-byte chunk_length).
>>
>> That's a good point for INDEX and OBJS_INDEX blocks. Though for REFS
>> blocks that include reflogs, the block size has to be large enough to
>> hold the whole reflog for a reference, which can be arbitrarily large.
>> (Maybe this is a weakness of the design?)
>
> Gah. I missed that part about the whole reflog needing to fit in the
> same chunk when I wrote the quoted text above. That is a pretty big
> downside for very busy refs.
>
> Even when you isolate logs into their own file, the reflog for a busy
> ref could be huge. A naive reader would want to "page in" the entire
> chunk_length before parsing. That isn't tenable if the reflog for a
> busy ref was say 50 MiB. It complicates the reader more. My reftable
> proposal deals with this by breaking the log up with its special key
> structure.
>
> Requiring the entire reflog of a single ref to fit into a single chunk
> does seem to have its downsides.

I was assuming that readers would uncompress the data streamily, in
which case I don't think that it would be much of a problem: reflogs
are usually read either in their entirety, or just the most recent few
entries are read, either of which could be done efficiently despite
the whole reflog being in a single zlib-compressed blob. If streamy
reading is thought to be too complicated for readers, it wouldn't be a
big deal to add a `log_type` of `REFLOG_COMPRESSED_SEGMENTED` and put
the entries into multiple, smaller zlib-compressed chunks.

>> OBJS blocks can also be
>> unbounded in size if very many references point at the same object,
>> thought that is perhaps only a theoretical problem.
>
> Gah, I missed that in reftable. The block id pointer list could cause
> a single object id to exceed what fits in a block, and that will cause
> the writer to fail unless its caller sets the block size larger. I
> basically assumed this overflow condition is very unlikely, as its not
> common to have a huge number of refs pointing to the same object.

Given what Peff pointed out, let's just leave this as a varint for OBJS blocks.

>>>>     elif chunk_type == OBJS_INDEX {
>>>>         chunk_length : varint
>>>>
>>>>         # The offset, relative to the start of this chunk, of the
>>>>         # chunk containing the next level of the obj index, for each
>>>>         # of the possible "next" bytes in the SHA-1, or zero if there
>>>>         # are no references with the given next byte.
>>>>         child_offset : varint * 256
>>>
>>> This is space saving and cute, but kind of annoying. If it was fixed
>>> width 32 bit you can address up to 4G away from this chunk's address,
>>> and you can directly jump to the byte of interest. By being varints
>>> you do save a little space, as most files will probably only need 3
>>> byte varints, and the 0s do collapse down to 1 byte, but you have to
>>> linearly walk the list to find any specific byte.
>>
>> The reason I used varint here was mostly because I expect the lowest
>> level of the OBJS_INDEX to be placed close to the OBJS chunks that it
>> refers to; hopefully within a 2-byte varint. Let me consider more
>> carefully whether that is realistic...
>
> Ah, ok. So you were expecting the writer to interleave OBJS and
> OBJS_INDEX chunks, with the OBJS_INDEX appearing every ~256 OBJS
> chunks. And then storing the next level of OBJS_INDEX consecutively.
>
> I think your math is still wrong about the lowest level OBJS_INDEX
> needing only 2-byte varint for child_offset. Given a 1k chunk size,
> you still have to address backwards about 256k, which requires a
> 3-byte varint.

I think that you would want the lowest OBJS_INDEX to be mostly
populated, meaning that the next level nodes in the tree would mostly
only have zero or one entry. The true distribution [1] would be close
to a Poisson distribution with a mean of `λ ≲ 1` (where λ is basically
the filling factor), which looks like

    P(λ, n) = λⁿ exp(-λ) / n!

For `λ = 1`, that looks like

    P(1, 0) = 36.8%
    P(1, 1) = 36.8%
    P(1, 2) = 18.4%
    P(1, 3) = 06.1%
    P(1, 4) = 01.5%
    P(1, 5) = 00.3%
    P(1, 6) = 00.1%

so almost all OBJS chunks would have fewer than five entries. And in
fact the mean number of entries would be `λ`. So the total size of the
OBJS chunks pointed at by a given OBJS_INDEX chunk should be something
like

    256 * λ * sizeof(OBJS chunk) ≲ 1800 bytes.

[1] This is assuming randomness of SHA-1s, which AFAIK is normally a
good approximation. It is of course possible that somebody would try
to grief the system by creating a repository with lots of objects that
have similar SHA-1 prefixes, but I think the worst result would be
that some operations would scale like O(number of references) rather
than O(256), which isn't all that pathological, and the distinct
pattern would be clear evidence of malice that would justify banning
the user.

> Given the uniform distribution of SHA-1, I think you still wind up
> with most of the OBJS_INDEX populated. E.g. in my 866k ref/865k obj
> example the unique object abbreviation is 6 raw bytes. The OBJS chunk
> storing 2 bytes means we have 4 levels of OBJS_INDEX to cover the
> first 4 bytes. That leaves ~844 obj_records in each OBJS chunk. If
> those are 7 bytes/obj_record, the OBJS chunk is ~5.7 KiB.
>
> Forcing the OBJS chunk to stay under say 4 KiB will absolutely cause a
> lot more holes in the OBJS_INDEX levels.

I was thinking more about how to store the objects lookup table. Let's
assume that we combine the lowest-level OBJS_INDEX chunk along with
the OBJS chunks that it points to into a new OBJS_LEAF chunk. I think
the goals are as follows:

* The lowest OBJS_LEAF nodes should have a filling factor of
approximately `λ = 1`.
* It should be possible to adjust the size of the lowest OBJS_LEAF
nodes based on the preferred read size. (This might mean that all of
the OBJS_LEAF nodes referred to by the next-higher node in the tree
fit together with it in a single 64k block.)
* That higher-level nodes also can be packed efficiently in blocks of
roughly the preferred read size.
* That false positives be the exception. If we ask a higher-level
reftable whether it might contain a reference to $sha1, it would be
nice to be able to get a "no" answer without having to read actual
references. I think that is an argument why we *don't* want to store
only the largest unique prefix of the contained SHA-1s. I think we
want to *always* store enough bits of SHA-1 prefix to make the
referred-to SHA-1s represent only a small fraction (say, 1/256th) of
the total prefix space. E.g., if references in a particular repository
point to 2ⁿ distinct SHA-1s, then we'd want the prefix to include
something like n+8 bits.

To achieve these goals, I think we it would be best to make the number
of bits used in each level of the tree be variable rather than
hard-coded to be 8. By changing the radix of different levels of the
tree, I think you could arrange for block sizes that are optimal for
your filesystem at the same time that you ensure that the OBJS_LEAF
nodes have the desired filling factor.

If anybody's interested, I can flesh out these ideas more next week
when I have time.

>> Peff and I discussed off-list whether the lookup-by-SHA-1 feature is
>> so important in the first place. Currently, all references must be
>> scanned for the advertisement anyway,
>
> Not really. You can hide refs and allow-tip-sha1 so clients can fetch
> a ref even if it wasn't in the advertisement. We really want to use
> that wire protocol capability with Gerrit Code Review to hide the
> refs/changes/ namespace from the advertisement, but allow clients to
> fetch any of those refs if they send its current SHA-1 in a want line
> anyway.
>
> So a server could scan only the refs/{heads,tags}/ prefixes for the
> advertisement, and then leverage the lookup-by-SHA1 to verify other
> SHA-1s sent by the client.
>
>> so avoiding a second scan to vet
>> SHA-1s received from the client is at best going to reduce the effort
>> by a constant factor. Do you have numbers showing that this
>> optimization is worth it?
>
> No, but I don't think I need to do much to prove it. My 866k ref
> example advertisement right now is >62 MiB. If we do what I'm
> suggesting in the paragraphs above, the advertisement is ~51 KiB.
>
>> OTOH a mythical protocol v2 might reduce the need to scan the
>> references for advertisement, so maybe this optimization will be more
>> helpful in the future?
>
> Yes, I'm hopeful we can get a v2 protocol built on the work Jonathan
> Tan is doing, and switch the advertisement around to "client speaks
> first", so that we can be smarter on the server about which refs are
> read and sent. That is a long way off, lets say 3-5 years before its
> really common in clients.

My takeaway from the discussion about whether object lookup is an
important feature is that it should very much be optional, and by
extension we should probably leave room for other optional extensions
in the file format.

> [...]
>> That's what I was thinking. But I've yet to hear anybody complain
>> about missing reflogs for symrefs if the underlying reference is
>> updated directly, so maybe we should just leave that "problem"
>> unsolved. It is certainly simpler and less brittle not to have to keep
>> backreferences like these in sync with the forward references.
>
> Yup, that is my take on it, and why I didn't try to put this into
> reftable drafts, even though it was discussed between us on the list
> in earlier messages.

Yes, let's forget the idea of including backreferences from refs to
the symrefs that refer to them.

Michael

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: reftable [v4]: new ref storage format
  2017-08-02 20:28   ` Jeff King
@ 2017-08-03 22:17     ` Shawn Pearce
  0 siblings, 0 replies; 34+ messages in thread
From: Shawn Pearce @ 2017-08-03 22:17 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git, Michael Haggerty, David Borowitz

On Wed, Aug 2, 2017 at 1:28 PM, Jeff King <peff@peff.net> wrote:
> On Wed, Aug 02, 2017 at 12:50:39PM -0700, Junio C Hamano wrote:
>
>> With the traditional "packed-refs plus loose" layout, no matter how
>> many times a handful of selected busy refs are updated during the
>> day, you'd need to open at most two files to find out the current
>> value of a single ref (admittedly, the accessing of the second file,
>> after we realize that there is no loose one, would be very costly).
>> If you make a few commits on a topic branch A, then build a 100
>> commit series on top of another topic branch B, finding the current
>> value of A is still one open and read of refs/heads/A.
>>
>> With the reftable format, we'd need to open and read all 100
>> incremental transactions that touch branch B before realizing that
>> none of them talk about A, and read the next transaction file to
>> find the current value of A.  To keep this number low, we'd need
>> quite a frequent compaction.
>
> I think this is where compaction cleverness can come in.
>
> One relatively easy optimization would be to note when the most recent
> reftable contains a subset of the refs we are currently updating (and
> the obvious string-of-updates to a single ref falls under that), and do
> a "quick" compaction where we simply drop[1] that reftable in favor of
> ours. That compaction is essentially free, because we know those entries
> aren't valid anymore anyway.
>
> I'm actually not sure if this is a strict "drop", though, because of
> reflogs. If the reflog is written into the same file as the ref update,
> then you'd need to roll its entry into your new update, too. But see
> below anyway.
>
> For more complicated cases, there's some cleverness you can do with
> small on-the-fly compactions. Even if there are entries in the last few
> reftables that we're not currently updating, it's pretty cheap to roll a
> few of them up into our new reftable if it lets us drop some
> intermediate reftables. E.g., if we're writing a new reftable with a 4K
> block size but only have 100 bytes of new data, we're probably best to
> roll up a previous 500-byte reftable.
>
> That one's an obvious optimization because we know that the filesystem
> is going to make us spend 4K either way, so rounding up to that is
> generally free-ish.

Yes. I was trying to propose exactly this in the first draft of
reftable, but someone on list didn't like the idea of an update
transaction stalling to perform a compaction, so I took that text out
in later drafts.

What I had envisioned is exactly what you mention; an update of
refs/heads/B that is just going to overlay another small reftable that
also recently updated refs/heads/B should just replace that table
instead of pushing onto the stack. Or if the combined top of stack +
new table is under 4K, they should just combine together instead of
pushing a new table onto the stack.


> What's less obvious is when we should roll up a bunch of 4K tables into
> one (let's say) 32K table.  I think there's a formula for doing this
> geometrically so that the amortized cost of writing stays under a
> certain bound (linear?). But I haven't thought it through (or looked it
> up); I was hoping Shawn had already done so and could dump that wisdom
> on us.

Sorry, I haven't done that. :)


>> We can just declare that reftable format is not for desktop clients
>> but for server implementations where frequent compaction would not
>> be an annoyance to the users, but I'd wish we do not have to.
>
> Yeah, I agree we should avoid that. I would love to eventually kill off
> the loose/packed backend (or at least make it historical and read-only,
> so that a reasonable response to people complaining about its races and
> lack of atomicity is "please upgrade to reftables").

I'd also really like reftable to be useful for "desktop clients". I'd
consider it something of a design failure if the format was horribly
unsuitable in some way to that use case. I think the small compactions
during updates will be essential to supporting the typical developer's
workflow.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: reftable [v4]: new ref storage format
  2017-08-03 18:38         ` Michael Haggerty
@ 2017-08-03 22:26           ` Shawn Pearce
  2017-08-03 22:48             ` Michael Haggerty
  0 siblings, 1 reply; 34+ messages in thread
From: Shawn Pearce @ 2017-08-03 22:26 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: git, Jeff King, Junio C Hamano, David Borowitz

On Thu, Aug 3, 2017 at 11:38 AM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> On Tue, Aug 1, 2017 at 7:38 PM, Shawn Pearce <spearce@spearce.org> wrote:
>> On Tue, Aug 1, 2017 at 6:51 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>>> On Tue, Aug 1, 2017 at 4:27 PM, Shawn Pearce <spearce@spearce.org> wrote:
>>>> On Mon, Jul 31, 2017 at 11:41 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>>>>> [...]
>>>> A couple of other notes about your contrasting design:
>>>>
...
>>> OBJS blocks can also be
>>> unbounded in size if very many references point at the same object,
>>> thought that is perhaps only a theoretical problem.
>>
>> Gah, I missed that in reftable. The block id pointer list could cause
>> a single object id to exceed what fits in a block, and that will cause
>> the writer to fail unless its caller sets the block size larger. I
>> basically assumed this overflow condition is very unlikely, as its not
>> common to have a huge number of refs pointing to the same object.
>
> Given what Peff pointed out, let's just leave this as a varint for OBJS blocks.

We discussed this at $DAY_JOB yesterday. We realized that if an obj
block has that many ref pointers present, it may be more efficient for
a reader to scan all references instead of chasing those pointers
individually. Latest draft of reftable now omits the ref pointer list
in an obj block if it exceeds the obj block size, which only occurs
when a high proportion of the ref blocks contain that SHA-1.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: reftable [v4]: new ref storage format
  2017-08-03 22:26           ` Shawn Pearce
@ 2017-08-03 22:48             ` Michael Haggerty
  2017-08-04  2:50               ` Shawn Pearce
  0 siblings, 1 reply; 34+ messages in thread
From: Michael Haggerty @ 2017-08-03 22:48 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git, Jeff King, Junio C Hamano, David Borowitz

I've revised the blockless reftable proposal to address some feedback:

* Don't omit `prefix_len` for the first ref and first child in a
block. It doesn't save much but makes the reader more complicated.
* Get rid of `symref_target` (the backlink from a reference to the
symlink(s) that point at it). It is a solution to a problem that is
not worth solving.
* Get rid of the `SYMREF_PEELED` type. It is impossible to maintain
this field affordably without `symref_target`, but it seems fragile
and unnecessary anyway.
* Add a way to mark a reflog entry "deleted" without having to rewrite
everything. This is mostly meant to deal with `refs/stash`.
* Define an extension mechanism.
* Define the SHA-1 → object mapping as an extension rather than as
part of the main spec. My gut feeling is that it will never be
implemented for git-core.
* Revise how the SHA-1 → object mapping works:
    * Merge the bottommost OBJ_INDEX node with the old OBJ nodes to
form a new OBJ_LEAF node.
    * Allow the branching factor of each node to be specified
independently (to allow the node sizes to be matched more closely to
the preferred read sizes).
* Rename some fields for clarity.

I currently lean towards the opinion that we should store pseudorefs
(like `FETCH_HEAD`, `MERGE_HEAD` *outside of* reftables, except for
`HEAD` (which behaves more like a normal reference, which is
considered for reachability, and for which we want to retain reflogs),
which we should store *in* reftables.

I don't talk at all about how individual reftable files are stacked
together, because that is pretty much unchanged from Shawn's proposal.
The only thing I would suggest is to give the reftable files names
that indicate whether they include reference values, reflogs, or both,
so that readers can tell from the table of contents which files they
have to open.

The new proposal follows:

Definitions:

Varint encoding is similar to the unsigned varint encoding used for
protocol buffers; i.e., a little-endian base-128 format where the most
significant bit is used to indicate that another byte is coming.
Decoding is as follows:

```
val = *ptr & 0x7f
while (*ptr++ & 0x80) {
    val = (val << 7) | (*ptr & 0x7f)
}
```

Strings are usually encoded using `varstr` encoding, which is the same
as how strings are encoded in protocol buffers. A `varstr` consists of
a `varint` length followed by the specified number of bytes holding
the contents of the string. `varstr` strings are not intrinsically
NUL-terminated.

The `chunk_length` fields encode the total size of the containing
chunk, including its `chunk_type` field.

```
varstr(s) = {
    varint(strlen(s))
    s
}
```

```
reftable = {
    header
    [chunk | padding]*
    footer
}

header = {
    'REFT'
    uint8( version_number = 1 )

    # The length of the whole header, in bytes:
    header_length : uint16

    metadata
}

footer = {
    'REFT'
    uint8( version_number = 1 )
    metadata

    # The length of the whole footer, in bytes:
    footer_length : uint16

    uint32(CRC-32 of previous part of footer)
}

metadata = {
    # if `contains_values` is false, this file *must not* contain any
    # reference values; i.e., `value_type` must always be `NO_VALUE`.
    contains_values : bool

    # if `contains_logs` is false, this file *must not* contain any
    # reflog entries; i.e., `log_type` must always be `NO_REFLOG`.

    contains_logs : bool

    # If `offsets_negative` is true, then all `*_offset` fields point
    # backwards in the file; i.e., the corresponding varint value is
    # negated before use.
    offsets_negative : bool

    # update_index is a counter that is incremented by one for each
    # atomic reference update affecting a stack of reftables (even if
    # the reference update changes multiple references). Counting
    # starts at 1 (0 is used as a special value). The following two
    # fields indicate that this reftable file covers update_indexes in
    # the range
    #
    #     min_update_index < update_index <= max_update_index
    #
    # , with the possible exception of LOG_DELETED entries.

    max_update_index : uint64
    min_update_index : uint64

    # To accomodate systems that have to write files serially, the
    # following two entries can be zeroed out in the header to tell
    # the reader that it has to read the corresponding values from the
    # footer.

    # The file offset of the chunk containing the root of the
    # reference tree:
    ref_root_chunk_addr : uint64

    # This space can be used to add extensions to the file format. It
    # is included in `header_length` / `footer_length` to allo older
    # readers to skip over any extensions that it doesn't understand.
    extension*
}

extension = {
    # The name of the extension:
    extension_name : varstr

    # The address of the first block of data associated with the extension:
    extension_root_chunk_addr : uint64
}

chunk = {
    # The `chunk_type` determines how to interpret the payload, and
    # influences how to compute its length (which is needed to advance
    # to the next chunk).

    chunk_type : enum PAD_BYTE | PADDING
                    | INDEX | REFS
                    | EXTENSION

    if chunk_type == PAD_BYTE {
        # This chunk type can be used to add a single byte of padding,
        # which would otherwise be impossible because a `PADDING`
        # chunk requires a minimum of two bytes.
    }
    elif chunk_type == PADDING {
        # A form of padding that's cheaper to skip over than
        # `PAD_BYTE`.

        # The total number of bytes in this chunk, including
        # `chunk_type`. The contents will otherwise be ignored:

        chunk_length : varint
    }
    elif chunk_type == INDEX {
        chunk_length : varint
        children : {
            # Length of prefix being carried over from the previous
            # record (always zero for the first child in a chunk):
            prefix_len : varint
            suffix : varstr
            index_payload
        }*
    }
    elif chunk_type == REFS {
        chunk_length : varint
        refs : {
            # Length of prefix being carried over from the previous
            # record (always zero for the first reference in a chunk):
            prefix_len : varint
            suffix : varstr
            ref_payload
        }*
    }
    elif chunk_type == EXTENSION {
        chunk_length : varint
        # Additional data may be present here. It should be ignored by
        # readers that don't know about the extension.
    }
}

index_payload = {
    # The number of bytes from the begining of this chunk to the child
    # chunk. If child_offset is zero, then there are no entries in
    # this reftable whose refnames start with the specified prefix.
    #
    # The child pointed at is of type INDEX (another index chunk
    # containing the next finer level of detail) or of type REFS. In
    # either case, the first record in the pointed-to chunk must have
    # `prefix_len == 0` and contain the entire key as `suffix`.
    child_offset : varint
}

ref_payload = {
    value_type : enum NO_VALUE
                    | DELETED
                    | VALUE | VALUE_PEELED
                    | SYMREF
                    | SPECIAL
    log_type : enum NO_REFLOG | REFLOG | REFLOG_COMPRESSED

    if value_type == NO_VALUE {
        # This type is used if we need to store a reflog entry but
        # have no reference value to store in this file.
    }
    elif value_type == DELETED {
        # This type indicates that the reference has been deleted,
        # regardless of what any reftables deeper in the stack claim.
    }
    elif value_type == VALUE {
        # This is a normal (non-symbolic) reference.
        sha1 : uchar[20]
    }
    elif value_type == VALUE_PEELED {
        # This is a normal (non-symbolic) reference that points at a
        # tag. `peeled` is the reference peeled down to a non-tag.
        sha1 : uchar[20]
        peeled : uchar[20]
    }
    elif value_type == SYMREF {
        # This is a symref that points at a non-existent branch.
        target : varstr
    }
    elif value_type == SPECIAL {
        # This is one of the special references (like FETCH_HEAD,
        # MERGE_HEAD). The contents are stored as they would be in a
        # loose reference file:
        contents : varstr
    }

    if log_type == NO_REFLOG {
    }
    elif log_type == REFLOG {
        log_entry_length : varint
        log_entry
    }
    elif log_type == REFLOGS_COMPRESSED {
        compressed_length : varint
        zlib_deflate(
            log_entry*
        )
    }
}

# Log entries are stored from oldest to newest. The "chained" variants
# of `log_type` take their `new_id` from the current value of the
# reference (if it is the first log entry for the references) or from
# the preceding (i.e., next newest) log record's `old_id` value.
log_entry = {
    # `CREATE_CHAINED` and `UPDATE_CHAINED` take their `new_id` from
    # the preceding (i.e., next newest) record's `old_id` value.
    log_type : enum END_OF_LOG | LOG
    old_type : enum ABSENT | REF | SYMREF
    new_type : enum ABSENT | REF | SYMREF | CHAINED

    if log_type == LOG {
        # The update_index of this entry, or 0 if there are no more
        # entries. log_entries are always stored in order from largest
        # to smallest update_index.
        update_index : varint

        # `PUSHER_CHAINED` takes its `name` and `email` from the preceding
        # (i.e., next newest) record's fields.
        pusher_type : enum PUSHER_EXPLICIT | PUSHER_CHAINED

        if new_type == ABSENT {
        }
        elif new_type == REF {
            new_id : uchar[20]
        }
        elif new_type == SYMREF {
            new_ref : varstr
        }
        elif new_type == CHAINED {
        }

        if old_type == ABSENT {
        }
        elif old_type == REF {
            old_id : uchar[20]
        }
        elif old_type == SYMREF {
            old_ref : varstr
        }

        time_seconds : uint32
        tz_offset_minutes : sint16
        if pusher_type == PUSHER_EXPLICIT {
            name : varstr
            email : varstr
        }

        # The reflog message. For a bit more compression:
        #
        # * Many messages contain old/new SHA-1s for the reference
        #   updates. That redundancy could be eliminated by replacing
        #   the old SHA-1 with `%o`, the new one with `%n`, and `%`
        #   with `%%`.
        #
        # * Some "standard" log messages (i.e., the ones generated by
        #   Git itself) could be hard-coded as additional `log_type`
        #   constants.
        message : varstr
    } else if log_type == LOG_DELETED {
        # An entry of this type indicates that any reflog entries for
        # this reference in reftables with `update_index >
        # next_retained_update_index` should be ignored (either they
        # have been deleted or they have been superseded by entries in
        # this reftable). This feature will mainly be used for `git
        # stash drop` and related commands. Such an entry, if it
        # exists, must be the last (or only) log_entry in the
        # ref_payload.
        next_retained_update_index : varint
    } else if log_type == END_OF_LOG {
    }
}
```


## Object → reference lookup extension

This extension allows one to determine efficiently which references
(if any) point at an object with a specified name.

This data is structured as an N-way trie whose branching factors can
vary from level to level. There are two kinds of nodes:

* Internal nodes (composed of chunks of type `OBJ_INDEX`): always
  point to other chunks of type `OBJ_INDEX` or `OBJ_LEAF`.

* Leaf nodes (composed of chunks of type `OBJ_LEAF`): can also include
  branching, but the individual items point at `obj_record`s within
  the same `OBJ_LEAF`.

Each `obj_record` within an `OBJ_LEAF` node points at one or more
`REFS` chunks that contain references that point at an object whose
name starts with the indicated prefix. To find the name(s) of
reference(s) that point at your SHA-1, you have to read the `REFS`
chunk and compare the target SHA-1s to yours.

```
objref_extension = {
    extension_name : varstr("objref")

    # The file offset of the chunk containing the root of the object
    # trie. This value must be zero (or the extension must be absent)
    # if `contains_values` is false:
    obj_root_chunk_addr : uint64
}

objref_chunk = {
    chunk_length : varint

    objref_chunk_type = OBJ_INDEX | OBJ_LEAF

    # The degree of branching at this node is `1 << radix_bits`. For
    # `OBJ_LEAF` nodes, it is allowed for this to be zero:
    radix_bits : uint4

    if objref_chunk_type == OBJ_INDEX {
        # The offset, relative to the start of this chunk, of the
        # chunk containing the next level of the obj index, for each
        # of the possible "next" bytes in the SHA-1, or zero if there
        # are no references with the given next byte.
        child_offset : varint * (1 << radix_bits)
    }
    elif objref_chunk_type == OBJ_LEAF {
        # The total number of SHA-1 bytes that are included in the
        # SHA-1 prefixes corresponding to each child, including those
        # accounted for by the `radix_bits` of all of the ancestor
        # nodes:
        prefix_bytes : uint4

        # The relative address, within this chunk, of the obj_record
        # for each combination of the next `radix_bits` bits of the
        # SHA-1, or zero if there are no references with the given
        # next byte.
        obj_record_addr : varint[1 << radix_bits]

        # The object records for the SHA-1 prefixes referred to above:
        obj_record*
    }
}

obj_record = {
    # The bytes `sha1[floor(sum(radix_bits) / 8) : prefix_bytes]` of
    # the SHA-1 being described (see below):
    prefix : uchar[n]

    # The number of child_addrs:
    count : varint

    # File offsets of the chunks containing references that point at
    # objects with this prefix:
    child_addr+ : varint
}
```

On Thu, Aug 3, 2017 at 3:26 PM, Shawn Pearce <spearce@spearce.org> wrote:
> On Thu, Aug 3, 2017 at 11:38 AM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>> On Tue, Aug 1, 2017 at 7:38 PM, Shawn Pearce <spearce@spearce.org> wrote:
>>> On Tue, Aug 1, 2017 at 6:51 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>>>> On Tue, Aug 1, 2017 at 4:27 PM, Shawn Pearce <spearce@spearce.org> wrote:
>>>>> On Mon, Jul 31, 2017 at 11:41 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>>>>>> [...]
>>>>> A couple of other notes about your contrasting design:
>>>>>
> ...
>>>> OBJS blocks can also be
>>>> unbounded in size if very many references point at the same object,
>>>> thought that is perhaps only a theoretical problem.
>>>
>>> Gah, I missed that in reftable. The block id pointer list could cause
>>> a single object id to exceed what fits in a block, and that will cause
>>> the writer to fail unless its caller sets the block size larger. I
>>> basically assumed this overflow condition is very unlikely, as its not
>>> common to have a huge number of refs pointing to the same object.
>>
>> Given what Peff pointed out, let's just leave this as a varint for OBJS blocks.
>
> We discussed this at $DAY_JOB yesterday. We realized that if an obj
> block has that many ref pointers present, it may be more efficient for
> a reader to scan all references instead of chasing those pointers
> individually. Latest draft of reftable now omits the ref pointer list
> in an obj block if it exceeds the obj block size, which only occurs
> when a high proportion of the ref blocks contain that SHA-1.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: reftable [v4]: new ref storage format
  2017-08-03 22:48             ` Michael Haggerty
@ 2017-08-04  2:50               ` Shawn Pearce
  0 siblings, 0 replies; 34+ messages in thread
From: Shawn Pearce @ 2017-08-04  2:50 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: git, Jeff King, Junio C Hamano, David Borowitz

On Thu, Aug 3, 2017 at 3:48 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> I've revised the blockless reftable proposal to address some feedback:

I've been thinking more about your blockless proposal.

I experimentally modified my reftable implementation to omit padding
between blocks, bringing it a tiny bit closer to your blockless
proposal. Unfortunately this slightly increased latency for lookups on
a 4k chunk size. I speculate this is because chunks are no longer
aligned with the filesystem page, and I'm forcing Mac OS X to give me
two pages out of the filesystem. Using padding to align to a 4k block
is slightly faster, and the average wasted per block is <=20 bytes,
too small to fit another ref.

The restart table and binary search within the 4k block is a
performance win. Disabling the restart table significantly increased
lookup latency.

tl;dr:  I think the block alignment and restart table are wins vs. the
multi-level index.

A suggested downside of my reftable design is the ref index at 4k
block size for 866k refs is 199 KiB, and must be paged in for binary
search to locate the correct block for any lookup. The pack idx for
the two main packs in this repository is 210 MiB. We think fairly
little of mmap'ing 210 MiB to perform binary search to find object
data. 199 KiB for ref data seems to be a bargain. An advantage of the
single level index is its only one page touched after the index is
loaded.

Hot reftable reads (5.6 usec) are faster than loose ref reads (6.5
usec). Once the ref index is loaded, reftable can read a ref more
quickly than the time required to open-read-close a loose ref.
Admittedly, a large index slows down a cold read.

tl;dr:  I just don't think the size of the index is a concern.

I really favor the reflog data in a different section from the ref
values themselves. Even for smaller transaction files, it improves
scan and lookup time by allowing readers who just care about the name
and SHA-1 value of a ref to not be paging in or skipping over log
record payloads. However, I also agree that the aggregates may benefit
from ref and log being separate files.

> * Add a way to mark a reflog entry "deleted" without having to rewrite
> everything. This is mostly meant to deal with `refs/stash`.

This is an interesting idea. Given how I implemented reftable in JGit,
just inserting a deletion record for the same (ref,update_index) tuple
would make it trivial to hide the prior entry.

> * Define an extension mechanism.
> * Define the SHA-1 → object mapping as an extension rather than as
> part of the main spec. My gut feeling is that it will never be
> implemented for git-core.

While the SHA-1 -> object mapping may never be implemented for
git-core, I'd still prefer to see it as an optional part of the file
specification, rather than an extension that is specified. IMHO the
extension stuff in DIRC has made it unnecessarily complicated, and
we've still revved that file through many revisions.

> * Revise how the SHA-1 → object mapping works:
>     * Merge the bottommost OBJ_INDEX node with the old OBJ nodes to
> form a new OBJ_LEAF node.
>     * Allow the branching factor of each node to be specified
> independently (to allow the node sizes to be matched more closely to
> the preferred read sizes).

I'm not sure objects warrant this kind of complexity. The obj support
in reftable is nearly identical to the ref support. I have a
significant amount of code that is common between them. Your approach
has objects different enough from refs that they need their own code,
increasing complexity in both the writer and reader.

> I currently lean towards the opinion that we should store pseudorefs
> (like `FETCH_HEAD`, `MERGE_HEAD` *outside of* reftables, except for
> `HEAD` (which behaves more like a normal reference, which is
> considered for reachability, and for which we want to retain reflogs),
> which we should store *in* reftables.

I'm on the fence, and don't really have a strong opinion about where
we store the pseudorefs. Happy to keep them in $GIT_DIR, happy to have
them supported inside a reftable.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: reftable [v4]: new ref storage format
  2017-08-02  1:51     ` Michael Haggerty
  2017-08-02  2:38       ` Shawn Pearce
@ 2017-08-05 21:00       ` Shawn Pearce
  1 sibling, 0 replies; 34+ messages in thread
From: Shawn Pearce @ 2017-08-05 21:00 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: git, Jeff King, Junio C Hamano, David Borowitz

On Tue, Aug 1, 2017 at 6:51 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> On Tue, Aug 1, 2017 at 4:27 PM, Shawn Pearce <spearce@spearce.org> wrote:
>> On Mon, Jul 31, 2017 at 11:41 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>>> On Sun, Jul 30, 2017 at 8:51 PM, Shawn Pearce <spearce@spearce.org> wrote:
>>>> 4th iteration of the reftable storage format.
>>>> [...]
>>>
>>> Before we commit to Shawn's reftable proposal, I wanted to explore
>>> what a contrasting design that is not block based would look like.
>>
>> I forgot to look at a 1k chunk size, as you suggested that might also
>> be suitable. Here is the more complete experiment table:
>>
>>        | size   | seek_cold | seek_hot  |
>> mh  1k | 29.4 M | 20.6 usec | 10.7 usec |   <== Row fixed
>> mh  4k | 28.3 M | 24.5 usec | 14.5 usec |
>> sp  4k | 29.2 M | 63.0 usec |  5.8 usec |
>> sp 64k | 27.7 M | 35.6 usec | 23.3 usec |

I modified reftable to use a mutli-level index as recommended by
Michael Haggerty, and re-ran the above table:

fmt |  bs | idx  |  size  | seek_cold | seek_hot  |
----|-----|------|--------|-----------|-----------|
 mh |  1k | 4 lv | 29.4 M | 20.1 usec |  7.1 usec |
 sp |  1k | 4 lv | 30.7 M | 21.0 usec |  5.5 usec |

 mh |  4k | 2 lv | 28.3 M | 23.4 usec | 11.2 usec |
 sp |  4k | 2 lv | 29.2 M | 19.9 usec |  5.4 usec |

 sp |  4k | 1 lv | 29.2 M | 62.9 usec |  5.6 usec |
 sp | 64k | 1 lv | 27.7 M | 35.6 usec | 21.6 usec |

fmt:  mh = Michael's proposal, sp = Shawn's reftable
bs:  chunk size or block size in bytes
idx:  how many levels of index
size:  total file size in bytes

I can't explain the slightly slower sp-1k-4lv vs. mh-1k-4lv cold_seek
in the first two rows. It might simply be the larger footer read
slowing down JGit. Its reliably flipped in the next two (at 4k).

reftable is more efficient in seek_hot at finding and parsing a
reference. For multi-level indexes both sp and mh implementations used
a lazy caching strategy that caches index blocks along the path to the
ref, but doesn't cache the final ref block. The tests amortized this
by performing 60,000 lookups on the same ref.

>> At least in this case, the 1k block size seems like a good tradeoff.

With the multi-level index now in reftable, it seems like reftable at
4k, 2 level index is a better tradeoff. It aligns with the most common
filesystem block size, and has lower seek times, both cold and hot.
Its slightly larger than Michael's alternative proposal (+921 K), but
compresses better than reftable at 1k (-1.5 M).

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2017-08-05 21:01 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-31  3:51 reftable [v4]: new ref storage format Shawn Pearce
2017-07-31 17:41 ` Dave Borowitz
2017-07-31 19:01 ` Stefan Beller
2017-07-31 23:05   ` Shawn Pearce
2017-07-31 19:42 ` Junio C Hamano
2017-07-31 23:43   ` Shawn Pearce
2017-08-01 16:08     ` Shawn Pearce
2017-08-01  6:41 ` Michael Haggerty
2017-08-01 20:23   ` Shawn Pearce
2017-08-02  0:49     ` Michael Haggerty
2017-08-01 23:27   ` Shawn Pearce
2017-08-01 23:54     ` Shawn Pearce
2017-08-02  1:51     ` Michael Haggerty
2017-08-02  2:38       ` Shawn Pearce
2017-08-02  9:28         ` Jeff King
2017-08-02 15:17           ` Shawn Pearce
2017-08-02 16:51             ` Junio C Hamano
2017-08-02 17:28             ` Jeff King
2017-08-02 12:20         ` Dave Borowitz
2017-08-02 17:18           ` Jeff King
2017-08-03 18:38         ` Michael Haggerty
2017-08-03 22:26           ` Shawn Pearce
2017-08-03 22:48             ` Michael Haggerty
2017-08-04  2:50               ` Shawn Pearce
2017-08-05 21:00       ` Shawn Pearce
2017-08-01 13:54 ` Dave Borowitz
2017-08-01 15:27   ` Shawn Pearce
2017-08-02 19:50 ` Junio C Hamano
2017-08-02 20:28   ` Jeff King
2017-08-03 22:17     ` Shawn Pearce
2017-08-03  1:50   ` Junio C Hamano
2017-08-03  2:21     ` Shawn Pearce
2017-08-03  2:36       ` Junio C Hamano
2017-08-02 19:54 ` Stefan Beller

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).