reftable: new ref storage format

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* reftable: new ref storage format
@ 2017-07-13  0:17 Shawn Pearce
  2017-07-13 19:32 ` Jeff King
  2017-07-16 17:33 ` Michael Haggerty
  0 siblings, 2 replies; 25+ messages in thread
From: Shawn Pearce @ 2017-07-13  0:17 UTC (permalink / raw)
  To: git

We've been having scaling problems with insane number of references
(>866k), so I started thinking a lot about improving ref storage.

I've written a simple approach, and implemented it in JGit.
Performance is promising:

  - 62M packed-refs compresses to 27M
  - 42.3 usec lookup

You can read a rendered version of this here:
https://googlers.googlesource.com/sop/jgit/+/reftable/Documentation/technical/reftable.md

## Overview

### Problem statement

Some repositories contain a lot of references (e.g.  android at 866k,
rails at 31k).  The existing packed-refs format takes up a lot of
space (e.g.  62M), and does not scale with additional references.
Lookup of a single reference requires linearly scanning the file.

Atomic pushes modifying multiple references require copying the
entire packed-refs file, which can be a considerable amount of data
moved (e.g. 62M in, 62M out) for even small transactions (2 refs
modified).

Repositories with many loose references occupy a large number of disk
blocks from the local file system, as each reference is its own file
storing 41 bytes.  This negatively affects the number of inodes
available when a large number of repositories are stored on the same
filesystem.  Readers are also penalized due to the larger number of
syscalls required to traverse and read the `$GIT_DIR/refs` directory.

### Objectives

- Near constant time lookup for any single reference, even when the
  repository is cold and not in process or kernel cache.
- Occupy less disk space for large repositories.
- Support atomic pushes with lower copying penalities.

### Description

A reftable file is a portable binary file format customized for
reference storage. References are sorted, enabling linear scans,
binary search lookup, and range scans.

Storage in the file is organized into blocks.  Prefix compression
is used within a single block to reduce disk space.  Block size is
tunable by the writer.

### Performance

Space used, packed-refs vs. reftable:

repository | packed-refs | reftable | % original | avg ref
-----------|------------:|---------:|-----------:|---------:
android    |      62.2 M |   27.7 M |     44.4%  | 33 bytes
rails      |       1.8 M |  896.2 K |     47.6%  | 29 bytes
git        |      78.7 K |   27.9 K |     40.0%  | 43 bytes
git (heads)|       332 b |    204 b |     61.4%  | 34 bytes

Scan (read 866k refs) and lookup (single ref from 866k refs):

format      | scan    | lookup
------------|--------:|---------------:
packed-refs |  380 ms | 375420.0 usec
reftable    |  125 ms |     42.3 usec

## Details

### Peeling

References in a reftable are always peeled.

### Reference name encoding

Reference names should be encoded with UTF-8.

### Ordering

Blocks are lexicographically ordered by their first reference.

## File format

### Header

A 8-byte header appears at the beginning of each file:

- 4-byte magic is: `\'1', 'R', 'E', 'F'`
- 1-byte version number, `1`.
- 3-byte `block_size` in bytes (network byte order).

### Block size

The `block_size` is arbitrarily determined by the writer, and does not
have to be a power of 2.  The block size must be larger than the
longest reference name used in the repository, as references cannot
span blocks.

### First block

The first block shares the same block as the file header, and is 8
bytes smaller than all other blocks in the file.  The first block
immediately begins after the file header, at offset 8.

### Block format

A block is written as:

    ref_record*
    padding?
    int32( restart_offset )*
    int32( record_end_offset )
    int32( number_of_restarts )

Blocks begin with a variable number of `ref_record`, describing
reference names and values. The format is described below.

The middle of the record may be filled with `padding` NUL bytes to
fill out the block to the common `block_size` as specified in the file
header.  Padding may be necessary to ensure `number_of_restarts`
occupies the last 4 bytes of the block.  Padding may be omitted if the
block is the last block of the file, and there is no index block.
This allows reftable to efficiently scale down to a small number of
refs.

A variable number of 4-byte, network byte order `restart_offset`
values follows the padding.  Offsets are relative to the start of the
block and refer to the first byte of any `ref_record` whose name has
not been prefixed compressed.  Readers can start linear scans from any
of these records.

The 4-byte, network byte order `record_end_offset` follows, providing
the block-relative offset after the end of the last `ref_record`.  If
`padding` is present this is the offset of the first byte of padding,
or the first byte of the first `restart_offset` entry.

The 4-byte, network byte order `number_of_restarts` stores the number
of entries in the `restart_offset` list.  Readers can use the restart
count to binary search between restarts before starting a linear scan.
This field must be the last 4 bytes of the block; the `padding` field
must be used to ensure this is true.

#### ref record

A `ref_record` describes a single reference, storing both the name and
its value(s). Records are formatted as:

    varint( prefix_length )
    varint( (suffix_length << 2) | type )
    suffix
    value?

The `prefix_length` field specifies how many leading bytes of the
prior reference record's name should be copied to obtain this
reference's name.  This must be 0 for the first reference in any
block, and also must be 0 for any `ref_record` whose offset is listed
in the `restart_offset` table at the end of the block.

Recovering a reference name from any `ref_record` is a simple concat:

    this_name = prior_name[0..prefix_length] + suffix

The second varint carries both `suffix_length` and `type`.  The
`suffix_length` value provides the number of bytes to copy from
`suffix` to complete the reference name.

The `value` immediately follows.  Its format is determined by `type`,
a 2 bit code, one of the following:

- `0x0`: deletion; no value data (see transactions, below)
- `0x1`: one 20-byte object id; value of the ref
- `0x2`: two 20-byte object ids; value of the ref, peeled target
- `0x3`: symbolic reference: `varint( target_len ) target`

Symbolic references use a varint length followed by a variable number
of bytes to encode the complete reference target.  No compression is
applied to the target name.

### Index block

The index stores the name of the last reference from every block in
the file, enabling constant O(1) disk seeks for all lookups.  Any
reference can be found by binary searching the index, identifying the
containing block, and searching within that block.

If present, the index block appears after the last block of the file.

An index block should only be written if there are at least 4 blocks
in the file, as cold reads using the index requires 2 disk reads, and
binary searching <= 4 blocks also requires <= 2 reads.  Omitting the
index block from smaller files saves space.

Index block format:

    '\1' 'i'
    index_record*
    int32( restart_offset )*
    int32( record_end_offset )
    int32( number_of_restarts )

Index blocks begin with a magic prefix, `\1i`, where other blocks
would have started with `\0` for the first ref record's prefix length.
This supports stopping sequential scans at the index block, without
prior knowledge of its position.

Unlike other blocks, the index block is not padded.

The `restart_offset`, `record_end_offset`, and `number_of_restarts`
fields are identical in format, meaning and usage as in `ref_record`.

To reduce the number of reads required for random access in very large
files, the index block may be larger than the other blocks.  However,
readers must hold the entire index in memory to benefit from this, so
its a time-space tradeoff in both file size, and reader memory.
Increasing the block size in the writer decreases the index size.

#### index record

An index record describes the last reference of another block.
Index records are written as:

    varint( prefix_length )
    varint( (suffix_length << 2) )
    suffix
    varint( block_idx )

Index records use prefix compression exactly like `ref_record`.  The
`suffix_length` is shifted 2 bits without a `type` to simplify unified
reader/writer code for both block types.

Index records store `block_idx` after the suffix, specifying which
block of the file ends with this reference. The block is located at
position `block_idx * block_size`.

### Reading the index

Readers loading the index must first read the footer (below) to
determine `index_size`.  The index is located at position:

    file_length - (index_size + 16)

### Footer

After the last block of the file (or index block, if present), a file
footer is written.  This is similar in structure to the file header,
but extended with additional data.

A 16-byte footer appears at the end:

- 4-byte magic is: `\'1', 'R', 'E', 'F'`
- 1-byte version number, 1.
- 3-byte `block_size` in bytes (network byte order).
- 4-byte `index_size` in bytes (network byte order).
- 4-byte CRC-32 of the preceding 12 bytes (network byte order).

Like the index block magic header, the footer begins with `\1R` to
allow sequential scans to recognize the end of file has been reached.

#### Reading the footer

Readers must seek to `file_length - 16` to access the footer.  A
trusted external source (such as `stat(2)`) is necessary to obtain
`file_length`.  When reading the footer, readers must verify:

- 4-byte magic is correct
- 1-byte version number is recognized
- 4-byte CRC-32 matches the other 12 bytes read

Once verified, the `block_size` and `index_size` may be accessed from
the footer.

### Varint encoding

Varint encoding is identical to the ofs-delta encoding method used
within pack files.

Decoder works such as:

    val = buf[ptr] & 0x7f
    while (buf[ptr] & 0x80) {
      ptr++
      val++
      val = val << 7
      val = val | (buf[ptr] & 0x7f)
    }

### Binary search

Binary search within a block is supported by the `restart_offset`
fields at the end of the block.  Readers can binary search through the
restart table to locate between which two restart points the sought
reference should appear.

Each reference identified by a `restart_offset` stores the complete
reference name in the `suffix` field of the `ref_record`, making the
compare operation during the binary search straightforward.

Once a restart point lexicographically before the sought reference has
been identified, readers can linearly scan through the following
`ref_record` entries to locate the sought reference, stopping when the
current `ref_record` sorts after (and therefore the sought reference
is not present).

#### Restart point selection

Writers determine the restart points at file creation.  The process is
arbitrary, but every 16 or 64 references is recommended.  Every 16 may
be more suitable for smaller block sizes (4k or 8k), every 64 for
larger block sizes (64k).

More frequent restart points reduces prefix compression and increases
space consumed by the restart table, both of which will increase the
overall file size.

Less frequent restart points makes prefix compression more effective,
decreasing overall file size, with increased penalities for readers
who must walk through more references after the binary search step.

## Considerations

### Lightweight refs dominate

The reftable format assumes the vast majority of references are single
SHA-1 valued with common prefixes, such as Gerrit Code Review's
`refs/changes/` namespace, GitHub's `refs/pulls/` namespace, or many
lightweight tags in the `refs/tags/` namespace.

Annotated tags storing the peeled object cost only an additional 20
bytes per reference.

### Low overhead

A reftable with very few references (e.g.  git.git with 5 heads) uses
only 204 bytes for reftable vs.  332 bytes for packed-refs.  This
supports reftable scaling down, to be used for transaction logs
(below).

### Block size

For a Gerrit Code Review type repository with many change refs, larger
block sizes (64 KiB) and less frequent restart points (every 64) yield
better compression due to more references within the block able to
compress against the prior reference.

Larger block sizes reduces the index size, as the reftable will
require fewer blocks to store the same number of references.

### Minimal disk seeks

Assuming the index block has been loaded into memory, binary searching
for any single reference requires exactly 1 disk seek to load the
containing block.

## Repository format

When reftable is stored in a file-backed Git repository, the stack is
represented as a series of reftable files:

    $GIT_DIR/reftable
    $GIT_DIR/reftable.1
    $GIT_DIR/reftable.2
    $GIT_DIR/reftable.3
    ...
    $GIT_DIR/reftable.10

where a larger suffix ordinal indicates a more recent table.

### Transactions

Although reftables are immutable, they can be stacked in a search
pattern, with each reference transaction adding a new reftable to the
top of the stack.  Readers scan down the reftable stack from
most-recent (`reftable.10`) to the base file (`reftable`).

### Update process

Updating references follows an update protocol:

1. Atomically create `$GIT_DIR/reftable.lock`.
2. `readdir($GIT_DIR)` to determine the highest suffix ordinal, `n`.
3. Compute the update transaction (e.g. compare expected values).
4. Write only modified references as a reftable to `reftable.lock`.
5. Rename `reftable.lock` to `reftable.${n + 1}`.

Because a single `reftable.lock` file is used to manage locking, the
repository is single-threaded for writers.  Writers may have to
busy-spin (with some small backoff) around creating `reftable.lock`,
for up to an acceptable wait period, aborting if the repository is too
busy to mutate.  Application servers wrapped around repositories (e.g.
Gerrit Code Review) can layer their own in memory thread lock/wait
queue to provide fairness.

### Reference deletions

Deletion of any reference can be explicitly stored by setting the
`type` to `0x0` and omitting the `value` field of the `ref_record`.
This entry shadows the reference in lower files in the stack.

### Compaction

A stack of reftables can be compacted by merging references using a
straightforward merge join across all reftables, selecting the most
recent value for output, and omitting deleted references that do not
appear in remaining, lower reftables.

The stack can be collapsed as part of any update transaction.  If the
current number of files is larger than a threshold (e.g.  4), writers
can perform an lstat(2) on each reftable file to determine how many
bytes would have to be read/copied from an existing file into the
new file, enabling deletion of the existing file.

Writers can select to collapse the most recent files (e.g.  10, 9, 8,
...), up to a collapse IO threshold (e.g.  4 MiB).  Each file selected
for collapse must have its references merged into the new reftable
that is being prepared.

Compaction is similar to the update process, but an explicit temporary
file must be used:

1. Atomically create `$GIT_DIR/reftable.lock`.
2. `readdir($GIT_DIR)` to determine the highest suffix ordinal, `n`.
3. Compute the update transaction (e.g. compare expected values).
4. Select files from (2) to collapse as part of this transaction.
5. Create temp file by `mktemp("$GIT_DIR/.reftableXXXXXX")`.
6. Write modified and collapsed references to temp file.
7. Rename temp file to `reftable.${n + 1}`.
8. Delete collapsed files `reftable.${n}`, `reftable.${n - 1}`, ...
9. Delete `reftable.lock`.

Because `reftable.9` can disappear after `reftable.10` is created,
readers receiving ENOENT when opening `reftable.9` must peform
another readdir to look for new reftables.

Rebuilding the base `$GIT_TABLE/reftable` follows the same protocol,
except in step 7 the temp file is renamed to `reftable`, and step 8
removes all files with an ordinal suffix.

## Alternatives considered

### bzip packed-refs

`bzip2` can significantly shrink a large packed-refs file (e.g. 62
MiB compresses to 23 MiB, 37%).  However the bzip format does not support
random access to a single reference. Readers must inflate and discard
while performing a linear scan.

Breaking packed-refs into chunks (individually compressing each chunk)
would reduce the amount of data a reader must inflate, but still
leaves the problem of indexing chunks to support readers efficiently
locating the correct chunk.

Given the compression ratios achieved by reftable's simple encoding
(e.g.  44%), without using a standard compression algorithm, it does
not seem necessary to add the complexity of bzip/gzip/zlib.

### JGit Ketch RefTree

[JGit Ketch][ketch] proposed [RefTree][reftree], an encoding of
references inside Git tree objects stored as part of the repository's
object database.

The RefTree format adds additional load on the object database storage
layer (more loose objects, more objects in packs), and relies heavily
on the packer's delta compression to save space.  Namespaces which are
flat (e.g.  thousands of tags in refs/tags) initially create very
large loose objects, and so RefTree does not address the problem of
copying many references to modify a handful.

Flat namespaces are not efficiently searchable in RefTree, as tree
objects in canonical formatting cannot be binary searched. This fails
the need to handle a large number of references in a single namespace,
such as GitHub's `refs/pulls`, or a project with many tags.

[ketch]: https://dev.eclipse.org/mhonarc/lists/jgit-dev/msg03073.html
[reftree]: https://public-inbox.org/git/CAJo=hJvnAPNAdDcAAwAvU9C4RVeQdoS3Ev9WTguHx4fD0V_nOg@mail.gmail.com/

### LMDB

David Turner proposed [using LMDB][dt-lmdb], as LMDB is lightweight
(64k of runtime code) and GPL-compatible license.

A downside of LMDB is its reliance on a single C implementation.  This
makes embedding inside JGit (a popular reimplemenation of Git)
difficult, and hositing onto virtual storage (for JGit DFS) virtually
impossible.

A common format that can be supported by all major Git implementations
(git-core, JGit, libgit2) is strongly preferred.

[dt-lmdb]: https://public-inbox.org/git/1455772670-21142-26-git-send-email-dturner@twopensource.com/

## Future

### Longer hashes

Version will bump (e.g.  2) to indicate `value` uses a different
object id length other than 20.  The length could be stored in an
expanded file header, or hardcoded as part of the version.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: reftable: new ref storage format
  2017-07-13  0:17 reftable: new ref storage format Shawn Pearce
@ 2017-07-13 19:32 ` Jeff King
  2017-07-13 19:56   ` Stefan Beller
  2017-07-14  0:11   ` Shawn Pearce
  2017-07-16 17:33 ` Michael Haggerty
  1 sibling, 2 replies; 25+ messages in thread
From: Jeff King @ 2017-07-13 19:32 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git

On Wed, Jul 12, 2017 at 05:17:58PM -0700, Shawn Pearce wrote:

> We've been having scaling problems with insane number of references
> (>866k), so I started thinking a lot about improving ref storage.
> 
> I've written a simple approach, and implemented it in JGit.
> Performance is promising:
> 
>   - 62M packed-refs compresses to 27M
>   - 42.3 usec lookup

Exciting. I'd love for us to have a simple-ish on-disk structure that
scales well and doesn't involve a dependency on a third-party database
structure.

Let me see what holes I can poke in your proposal, though. :)

> ### Problem statement
> 
> Some repositories contain a lot of references (e.g.  android at 866k,
> rails at 31k).  The existing packed-refs format takes up a lot of
> space (e.g.  62M), and does not scale with additional references.
> Lookup of a single reference requires linearly scanning the file.

I think the linear scan is actually an implementation short-coming. Even
though the records aren't fixed-length, the fact that newlines can only
appear as end-of-record is sufficient to mmap and binary search a
packed-refs file (you just have to backtrack a little when you land in
the middle of a record).

I wrote a proof of concept a while ago, but got stuck on integrating it
into the ref code, because of some of the assumptions that it made.
Michael Haggerty has been doing several rounds of refactors to remove
those assumptions. I think we're pretty close (I've actually seen the
endgame where packed-refs is fully binary searched, but I think there
are a few more cleanups necessary to cover all cases).

> Atomic pushes modifying multiple references require copying the
> entire packed-refs file, which can be a considerable amount of data
> moved (e.g. 62M in, 62M out) for even small transactions (2 refs
> modified).

I think your definition of atomic here doesn't match what git.git does.

Our atomic push just takes the lock on all of the refs, and then once it
has all of them, commits all of the locks. So it's atomic in the sense
that you either get all or none of the writes (modulo a commit failure
in the middle, which we naturally have no rollback plan for). But it can
be done without touching the packed-refs file at all.

I imagine that you're looking at atomicity from the perspective of a
reader. In the git.git scheme, the reader may see a half-committed
transaction. If you dispense with loose refs entirely and treat the
packed-refs file as a single poorly-implemented key/value database, then
you get reader atomicity (but O(size_of_database) write performance).

> Repositories with many loose references occupy a large number of disk
> blocks from the local file system, as each reference is its own file
> storing 41 bytes.  This negatively affects the number of inodes
> available when a large number of repositories are stored on the same
> filesystem.  Readers are also penalized due to the larger number of
> syscalls required to traverse and read the `$GIT_DIR/refs` directory.

In my experience, the syscalls involved in loose refs aren't usually a
big deal. If you have 800k refs, they're not all changing constantly. So
a single pack-refs "fixes" performance going forward. What _is_ a big
deal is that the packing process is complicated, readers have a very
un-atomic view because of the myriad of files involved, and you get
annoying lock contention during packing, as well as between deletions
that have to rewrite packed-refs.

But I'm not sure if you meant to contrast here a system where we didn't
use packed-refs at all (though of course such a system is very much not
atomic by the definition above).

> ### Objectives
> 
> - Near constant time lookup for any single reference, even when the
>   repository is cold and not in process or kernel cache.

Good goal, though TBH I'd be happy with O(log n).

A related one is being able to traverse a subset of refs in
O(nr_traversed). E.g., "git tag -l" should not have to do work
proportional to what is in refs/changes. That falls out of most
proposals that allow fast lookups, but notably not a straight
hash-table.

> - Occupy less disk space for large repositories.

Good goal.  Just to play devil's advocate, the simplest way to do that
with the current code would be to gzip packed-refs (and/or store sha1s
as binary). That works against the "mmap and binary search" plan,
though. :)

> - Support atomic pushes with lower copying penalities.

"Lower" is vague. I'd hope we could do updates with effort linear to the
number of updated refs (it's OK if there's a constant factor, like
writing extra blocks, as long as a single-ref update is about as
expensive in a 10-ref repo as in a 10K-ref repo).

> Scan (read 866k refs) and lookup (single ref from 866k refs):
> 
> format      | scan    | lookup
> ------------|--------:|---------------:
> packed-refs |  380 ms | 375420.0 usec
> reftable    |  125 ms |     42.3 usec

Don't forget in git.git that the memory usage for packed-refs is
atrocious (because we parse the whole thing into RAM).

> ### Peeling
> 
> References in a reftable are always peeled.

Good. This is a very important optimization to retain.

> ### Reference name encoding
> 
> Reference names should be encoded with UTF-8.

Don't we usually treat refnames as byte sequences (subject to a few
rules, as in check_ref_format())? It seems like the encoding should be
out-of-scope for the storage format.

> ## File format

OK, let me try to summarize to see if I understand.

The reftable file is a sequence of blocks, each of which contains a
finite set of heavily-compressed refs. You have to read each block
sequentially, but since they're a fixed size, that's still a
constant-time operation (I'm ignoring the "restarts" thing for now). You
find the right block by reading the index.  So lookup really is more
like O(block_size * log(n/block_size)), but block_size being a constant,
it drops out to O(log n).

Linear scans are easy, because everything is in sorted order. So you
just find the first entry via binary search, and then walk forward.

Updates are where things get dicier. It looks like you just write a new
partial reftable file with your updates. And then if there are N
reftables present, readers actually have to do a list-merge of the
results they get from all of them (where the results from reftable.5
trump ones from reftable.4).

So basically we're just journaling updates into a directory of atomic
reftable updates. And then to keep the reader's job from getting too
painful, a write occasionally has to compact into a single reftable,
rewriting the entire ref store. That's what I see as the biggest
weakness here. If you keep too large a reftable stack, then readers have
to spend a lot of extra effort on lookups. But if you keep too small a
stack, then you are frequently rewriting the whole database.

Technically writes are still O(n). Because of the journaling you
amortize the whole-rewrite cost across several updates, but it's still
O(n/c). That seems like the biggest weakness of the scheme to me.

I think there's some cleverness you can use with compacting in a
geometric scheme, though, to amortize up to a certain bound. I didn't
see any discussion of that, though.

It's also possible I'm misunderstanding the writes. See below.

> Compaction is similar to the update process, but an explicit temporary
> file must be used:
> 
> 1. Atomically create `$GIT_DIR/reftable.lock`.
> 2. `readdir($GIT_DIR)` to determine the highest suffix ordinal, `n`.
> 3. Compute the update transaction (e.g. compare expected values).
> 4. Select files from (2) to collapse as part of this transaction.
> 5. Create temp file by `mktemp("$GIT_DIR/.reftableXXXXXX")`.
> 6. Write modified and collapsed references to temp file.
> 7. Rename temp file to `reftable.${n + 1}`.
> 8. Delete collapsed files `reftable.${n}`, `reftable.${n - 1}`, ...
> 9. Delete `reftable.lock`.

I had originally assumed you'd just compact back down to the reftable
file after some N updates (say, 10). But here, it looks like you'd
always compact 0-9 into 10, and then 10-19 into 20, and so on, and the
ordinal would go up forever.

I think that's OK, as it would take a long time to get unwieldy. And I
think you have to do it that way, as you can't atomically replace
"reftable" and delete .1-.9 at the same time.

> Because `reftable.9` can disappear after `reftable.10` is created,
> readers receiving ENOENT when opening `reftable.9` must peform
> another readdir to look for new reftables.

But after compaction, won't having "reftable.10" but no ".9" be the
steady state? As a reader, how can I tell the difference between these
two cases:

  1. Somebody created .10 and deleted .9 and lower.

  2. Somebody created .11 and deleted .10 and lower, while I was trying
     to read .9.

Is basically every read going to require:

  1. readdir to find the highest ordinal

  2. keep walking down the stack until you get ENOENT

  3. readdir again to make sure there's not a new ordinal

But in that case, if step 3 turns up a new reftable.11, how do I know
whether it's a compaction (in which case I need to restart my read from
.11) or if it's just another update-on-top? In a busy repository, you
might see a lot of update-on-tops.

> [...specifics...]

I liked a lot of what I saw in the rest of it (e.g., handling symrefs,
which packed-refs does not). Some bits seemed complicated. E.g., I
actually wonder how much restarts help in practice if you have
reasonably-sized blocks, and they complicate things a lot). Likewise
some bits are optional for very small reftable files to reduce overhead.
But if you have very small reftables, it's going to be fast either way.
If you waste 4K to store 200 bytes, that's fine as long as you're still
wasting only 4K when you store 200 megabytes.

That's all hand-waving, of course. But all things being equal, I'd
prefer to focus on algorithmic speedups, focusing on the simplest thing
that could work, and then seeing real experimental numbers from
additions.

I also realize that beggars can't be choosers. If you have a working
system that performs well, I should consider shutting up. :)

One thing I didn't see is reflogs. They don't strictly have to be part
of a ref-storage solution. But they do still consume at least one inode
per ref in the current system. If you reflog everything, which in my
opinion you should. Having an audit trail of ref updates is very useful
for debugging (both bugs in Git, and trying to figure out what went
wrong when a user goes berserk with "git push --prune").

-Peff

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: reftable: new ref storage format
  2017-07-13 19:32 ` Jeff King
@ 2017-07-13 19:56   ` Stefan Beller
  2017-07-13 20:35     ` Jeff King
  2017-07-14  0:11   ` Shawn Pearce
  1 sibling, 1 reply; 25+ messages in thread
From: Stefan Beller @ 2017-07-13 19:56 UTC (permalink / raw)
  To: Jeff King; +Cc: Shawn Pearce, git

On Thu, Jul 13, 2017 at 12:32 PM, Jeff King <peff@peff.net> wrote:
> On Wed, Jul 12, 2017 at 05:17:58PM -0700, Shawn Pearce wrote:
>
>> We've been having scaling problems with insane number of references
>> (>866k), so I started thinking a lot about improving ref storage.
>>
>> I've written a simple approach, and implemented it in JGit.
>> Performance is promising:
>>
>>   - 62M packed-refs compresses to 27M
>>   - 42.3 usec lookup
>
> Exciting. I'd love for us to have a simple-ish on-disk structure that
> scales well and doesn't involve a dependency on a third-party database
> structure.
>
> Let me see what holes I can poke in your proposal, though. :)
>
>> ### Problem statement
>>
>> Some repositories contain a lot of references (e.g.  android at 866k,
>> rails at 31k).  The existing packed-refs format takes up a lot of
>> space (e.g.  62M), and does not scale with additional references.
>> Lookup of a single reference requires linearly scanning the file.
>
> I think the linear scan is actually an implementation short-coming. Even
> though the records aren't fixed-length, the fact that newlines can only
> appear as end-of-record is sufficient to mmap and binary search a
> packed-refs file (you just have to backtrack a little when you land in
> the middle of a record).

Except that a record is a "delta" to the previous record, so it's not
just finding a record, but reconstructing it. Example for records:

    varint( prefix_length )
    varint( (suffix_length << 2) | type )
    suffix
    value?

First record:

 0,
 16 << 2 | 0x01,
  "refs/heads/maint"
  08f9c32463bf9e578acb7ac5f77afd36e803c6bc

next record (refs/heads/master):

  13
  4 << 2 | 0x01
  "ster",
  80145b1e412719c960036c8c62a9e35dd23a5b2d

Now if you found the second one, you cannot reconstruct its
real name (refs/heads/master) without knowing the name
of the first. The name of the first is easy because the prefix_length
is 0. If it also had a prefix length != 0 you'd have to go back more.


>> - Occupy less disk space for large repositories.
>
> Good goal.  Just to play devil's advocate, the simplest way to do that
> with the current code would be to gzip packed-refs (and/or store sha1s
> as binary). That works against the "mmap and binary search" plan,
> though. :)

Given the compression by delta-ing the name to the previous change and
the fact that Gerrit has

  refs/heads/changes/1
  refs/heads/changes/2
  refs/heads/changes/3
  ...

I think this format would trump a "dumb" zip.
(Github having sequentially numbered pull requests would also
benefit here)

>> ## File format
>
> OK, let me try to summarize to see if I understand.

When Shawn presented the proposal, a couple of colleagues here
were as excited as I was, but the daring question is, why Shawn
did not give the whole thing in BNF format from top down:

  initial-block
  content-blocks*
  (index-block)
  footer

> The reftable file is a sequence of blocks, each of which contains a
> finite set of heavily-compressed refs. You have to read each block
> sequentially,

Each block may have restarting points, that allow for intra-block
binary search.

> but since they're a fixed size, that's still a
> constant-time operation (I'm ignoring the "restarts" thing for now). You
> find the right block by reading the index.

or by reading the footer at the end. If the footer and the index differ
in block size (one bit flipped), we can ask the CRC of the footer
for more guidance.

>  So lookup really is more
> like O(block_size * log(n/block_size)), but block_size being a constant,
> it drops out to O(log n).

There is also an index block such that you can binary search across
blocks, so

O( log(block_count) + log(intra_block_restarting_points) + small linear scan)

There are 2 binary searches, and the block size is an interesting
thing to look at when making up trade offs.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: reftable: new ref storage format
  2017-07-13 19:56   ` Stefan Beller
@ 2017-07-13 20:35     ` Jeff King
  2017-07-13 21:51       ` Eric Wong
  2017-07-14  0:27       ` Shawn Pearce
  0 siblings, 2 replies; 25+ messages in thread
From: Jeff King @ 2017-07-13 20:35 UTC (permalink / raw)
  To: Stefan Beller; +Cc: Shawn Pearce, git

On Thu, Jul 13, 2017 at 12:56:54PM -0700, Stefan Beller wrote:

> >> ### Problem statement
> >>
> >> Some repositories contain a lot of references (e.g.  android at 866k,
> >> rails at 31k).  The existing packed-refs format takes up a lot of
> >> space (e.g.  62M), and does not scale with additional references.
> >> Lookup of a single reference requires linearly scanning the file.
> >
> > I think the linear scan is actually an implementation short-coming. Even
> > though the records aren't fixed-length, the fact that newlines can only
> > appear as end-of-record is sufficient to mmap and binary search a
> > packed-refs file (you just have to backtrack a little when you land in
> > the middle of a record).
> 
> Except that a record is a "delta" to the previous record, so it's not
> just finding a record, but reconstructing it. Example for records:

I was still talking about the existing packed-refs implementation here.

I agree that a full binary search of a reftable is harder because of the
prefix compression (it may still be possible by scanning backwards, but
I think there are ambiguities when you land in the middle of a record,
since there's no unambiguous end-of-record character). But I don't think
it matters. If you binary-search to a constant-sized block, then a
linear scan of the block is acceptable.

> >> - Occupy less disk space for large repositories.
> >
> > Good goal.  Just to play devil's advocate, the simplest way to do that
> > with the current code would be to gzip packed-refs (and/or store sha1s
> > as binary). That works against the "mmap and binary search" plan,
> > though. :)
> 
> Given the compression by delta-ing the name to the previous change and
> the fact that Gerrit has
> 
>   refs/heads/changes/1
>   refs/heads/changes/2
>   refs/heads/changes/3
>   ...
> 
> I think this format would trump a "dumb" zip.
> (Github having sequentially numbered pull requests would also
> benefit here)

You may be surprised. Let's imagine that you have a set of 4096 refs in
refs/changes/1, refs/changes/2, etc:

  for i in $(seq 1 4096)
  do
    echo refs/changes/$i
  done >input

Now let's do a prefix compression, with a single byte for "how many
characters to reuse from the last entry":

  perl -lne '
    my $common;
    if (defined $last) {
      chop $last while !/\Q$last\E/;
      $common = length($last);
    } else {
      $common = 0;
    }
    print chr($common), substr($_, $common);
    $last = $_;
  ' <input >prefix

And a gzip:

  gzip -c -9 <input >zip

And the results:

  $ wc -c prefix; wc -c zip
  12754 prefix
  10116 zip

The secret sauce is most likely that gzip is bit-packing, using only a
few bits per character and not aligning with byte boundaries.

Not that I'm recommending just gzipping the whole packed-refs file. It
ruins the fast-lookup. We _could_ consider gzipping individual blocks of
a reftable (or any structure that allows you to search to a
constant-sized block and do a linear search from there). But given that
they're in the same ballpark, I'm happy with whatever ends up the
simplest to code and debug. ;)

Just for fun, here's the decoding script for the prefix-compression:

  perl -e '
    while (read(STDIN, $common, 1)) {
      $common = ord($common);
      $rest = <STDIN>;
      if ($common > 0) {
        $rest = substr($last, 0, $common) . $rest
      }
      print $rest;
      $last = $rest}' <prefix
  '

> > OK, let me try to summarize to see if I understand.
> 
> When Shawn presented the proposal, a couple of colleagues here
> were as excited as I was, but the daring question is, why Shawn
> did not give the whole thing in BNF format from top down:
> 
>   initial-block
>   content-blocks*
>   (index-block)
>   footer

Yeah, I agree it took me a bit to figure out what was going on. A
high-level overview of the format would have been nice.

> >  So lookup really is more
> > like O(block_size * log(n/block_size)), but block_size being a constant,
> > it drops out to O(log n).
> 
> There is also an index block such that you can binary search across
> blocks, so
> 
> O( log(block_count) + log(intra_block_restarting_points) + small linear scan)
> 
> There are 2 binary searches, and the block size is an interesting
> thing to look at when making up trade offs.

Right, the cross-block index was what I was trying to account for.
Either way, from a big-O perspective the block size and the number of
restarts are constants with respect to the total number of entries. I'm
happy with log(n), though. It's hard to do better.

-Peff

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: reftable: new ref storage format
  2017-07-13 20:35     ` Jeff King
@ 2017-07-13 21:51       ` Eric Wong
  2017-07-14  0:27       ` Shawn Pearce
  1 sibling, 0 replies; 25+ messages in thread
From: Eric Wong @ 2017-07-13 21:51 UTC (permalink / raw)
  To: Jeff King; +Cc: Stefan Beller, Shawn Pearce, git

Jeff King <peff@peff.net> wrote:
> I agree that a full binary search of a reftable is harder because of the
> prefix compression (it may still be possible by scanning backwards, but
> I think there are ambiguities when you land in the middle of a record,
> since there's no unambiguous end-of-record character). But I don't think
> it matters. If you binary-search to a constant-sized block, then a
> linear scan of the block is acceptable.

For a new packed-refs, I think an intrusive critbit tree would
be a good way to store refs which have many common prefixes and
I've always wanted to apply critbit to an on-disk storage
format...

Several years ago, I started writing one in Perl using
pwrite/pread to provide Message-ID <=> NNTP article number
mapping several years ago, but gave up on it out of laziness:

  https://80x24.org/spew/1441508596-19511-1-git-send-email-e@80x24.org/raw

The end goal would've been to have two tries sharing the same
storage struct: one keyed by Message-ID, the other keyed by NNTP
article number (and figuring out the node using offsets like
we do with (container_of|list_entry) in list.h.

For git, being able to do an O(hashlength) prefix search based
on the object_id from the reftable would speed up decorations, I
think.  And of course, the O(refnamelength) prefix search would
also apply to the refnames themselves.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: reftable: new ref storage format
  2017-07-13 19:32 ` Jeff King
  2017-07-13 19:56   ` Stefan Beller
@ 2017-07-14  0:11   ` Shawn Pearce
  2017-07-14 14:27     ` Dave Borowitz
  2017-07-14 20:08     ` Jeff King
  1 sibling, 2 replies; 25+ messages in thread
From: Shawn Pearce @ 2017-07-14  0:11 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Thu, Jul 13, 2017 at 12:32 PM, Jeff King <peff@peff.net> wrote:
> On Wed, Jul 12, 2017 at 05:17:58PM -0700, Shawn Pearce wrote:
>
>> ### Problem statement
>>
>> Some repositories contain a lot of references (e.g.  android at 866k,
>> rails at 31k).  The existing packed-refs format takes up a lot of
>> space (e.g.  62M), and does not scale with additional references.
>> Lookup of a single reference requires linearly scanning the file.
>
> I think the linear scan is actually an implementation short-coming. Even
> though the records aren't fixed-length, the fact that newlines can only
> appear as end-of-record is sufficient to mmap and binary search a
> packed-refs file (you just have to backtrack a little when you land in
> the middle of a record).
>
> I wrote a proof of concept a while ago, but got stuck on integrating it
> into the ref code, because of some of the assumptions that it made.
> Michael Haggerty has been doing several rounds of refactors to remove
> those assumptions. I think we're pretty close (I've actually seen the
> endgame where packed-refs is fully binary searched, but I think there
> are a few more cleanups necessary to cover all cases).

You are correct, this is possible with the current packed-refs format.
It just hasn't materialized in a shipping implementation yet.

>> Atomic pushes modifying multiple references require copying the
>> entire packed-refs file, which can be a considerable amount of data
>> moved (e.g. 62M in, 62M out) for even small transactions (2 refs
>> modified).
>
> I think your definition of atomic here doesn't match what git.git does.

:-(

> Our atomic push just takes the lock on all of the refs, and then once it
> has all of them, commits all of the locks. So it's atomic in the sense
> that you either get all or none of the writes (modulo a commit failure
> in the middle, which we naturally have no rollback plan for). But it can
> be done without touching the packed-refs file at all.
>
> I imagine that you're looking at atomicity from the perspective of a
> reader. In the git.git scheme, the reader may see a half-committed
> transaction. If you dispense with loose refs entirely and treat the
> packed-refs file as a single poorly-implemented key/value database, then
> you get reader atomicity (but O(size_of_database) write performance).

Yes, I was hoping for reader atomicity. But I may OK foregoing that if
the transaction either all goes through, or all fails. A partially
stuck transaction because the process died in the middle of the commit
step creates a mess for an administrator to undo. Does she rename
"foo.lock" to "foo"? Or delete "foo.lock"?

>> Repositories with many loose references occupy a large number of disk
>> blocks from the local file system, as each reference is its own file
>> storing 41 bytes.  This negatively affects the number of inodes
>> available when a large number of repositories are stored on the same
>> filesystem.  Readers are also penalized due to the larger number of
>> syscalls required to traverse and read the `$GIT_DIR/refs` directory.
>
> In my experience, the syscalls involved in loose refs aren't usually a
> big deal. If you have 800k refs, they're not all changing constantly. So
> a single pack-refs "fixes" performance going forward. What _is_ a big
> deal is that the packing process is complicated, readers have a very
> un-atomic view because of the myriad of files involved, and you get
> annoying lock contention during packing, as well as between deletions
> that have to rewrite packed-refs.
>
> But I'm not sure if you meant to contrast here a system where we didn't
> use packed-refs at all (though of course such a system is very much not
> atomic by the definition above).

No, I really did mean the current system. Gerrit Code Review servers
create a lot of references throughout the day. Its easy to accumulate
a few thousand new loose references in a 24 hour period. Even if you
have 600k existing refs in packed-refs, you still have 2k new/modified
refs since last nightly cron ran git gc.

>> ### Objectives
>>
>> - Near constant time lookup for any single reference, even when the
>>   repository is cold and not in process or kernel cache.
>
> Good goal, though TBH I'd be happy with O(log n).
>
> A related one is being able to traverse a subset of refs in
> O(nr_traversed). E.g., "git tag -l" should not have to do work
> proportional to what is in refs/changes. That falls out of most
> proposals that allow fast lookups, but notably not a straight
> hash-table.

Thanks, I missed that in this list, even though it was an explicit
objective going into this work. I added:

- Efficient lookup of an entire namespace, such as `refs/tags/`.

>> - Occupy less disk space for large repositories.
>
> Good goal.  Just to play devil's advocate, the simplest way to do that
> with the current code would be to gzip packed-refs (and/or store sha1s
> as binary). That works against the "mmap and binary search" plan,
> though. :)

Yes it does. I tried to cover that later under "Alternatives
considered > bzip". :)

>> ### Reference name encoding
>>
>> Reference names should be encoded with UTF-8.
>
> Don't we usually treat refnames as byte sequences (subject to a few
> rules, as in check_ref_format())? It seems like the encoding should be
> out-of-scope for the storage format.

True that git-core treats them as byte sequences, but JGit treats them as UTF-8.

>> ## File format
>
> OK, let me try to summarize to see if I understand.
>
> The reftable file is a sequence of blocks, each of which contains a
> finite set of heavily-compressed refs. You have to read each block
> sequentially, but since they're a fixed size, that's still a
> constant-time operation (I'm ignoring the "restarts" thing for now). You
> find the right block by reading the index.  So lookup really is more
> like O(block_size * log(n/block_size)), but block_size being a constant,
> it drops out to O(log n).

Yes. My "near constant time" claim was because I had my head buried
thinking about disk IO operations when I wrote this, not the algorithm
that happens on the CPU. One of the applications for reftable is on a
slower-than-usual storage system, where reading a block of a file
costs enough milliseconds that it doesn't matter how good or bad the
CPU algorithm is.

> Linear scans are easy, because everything is in sorted order. So you
> just find the first entry via binary search, and then walk forward.

Yup.

> Updates are where things get dicier. It looks like you just write a new
> partial reftable file with your updates. And then if there are N
> reftables present, readers actually have to do a list-merge of the
> results they get from all of them (where the results from reftable.5
> trump ones from reftable.4).

Correct.

> So basically we're just journaling updates into a directory of atomic
> reftable updates. And then to keep the reader's job from getting too
> painful, a write occasionally has to compact into a single reftable,
> rewriting the entire ref store.

Correct.

> That's what I see as the biggest
> weakness here. If you keep too large a reftable stack, then readers have
> to spend a lot of extra effort on lookups. But if you keep too small a
> stack, then you are frequently rewriting the whole database.

But you say this yourself below, you can do this using a geometric
scheme or something to bound the cost of these rewrites such that you
aren't frequently rewriting the whole database.

> Technically writes are still O(n). Because of the journaling you
> amortize the whole-rewrite cost across several updates, but it's still
> O(n/c). That seems like the biggest weakness of the scheme to me.
>
> I think there's some cleverness you can use with compacting in a
> geometric scheme, though, to amortize up to a certain bound. I didn't
> see any discussion of that, though.

I think I left this as an exercise to the implementer. The
"Transactions > Compaction" section talks about the compaction
algorithm being applied only near the top of the stack, ignoring the
base table(s).

>> Compaction is similar to the update process, but an explicit temporary
>> file must be used:
>>
>> 1. Atomically create `$GIT_DIR/reftable.lock`.
>> 2. `readdir($GIT_DIR)` to determine the highest suffix ordinal, `n`.
>> 3. Compute the update transaction (e.g. compare expected values).
>> 4. Select files from (2) to collapse as part of this transaction.
>> 5. Create temp file by `mktemp("$GIT_DIR/.reftableXXXXXX")`.
>> 6. Write modified and collapsed references to temp file.
>> 7. Rename temp file to `reftable.${n + 1}`.
>> 8. Delete collapsed files `reftable.${n}`, `reftable.${n - 1}`, ...
>> 9. Delete `reftable.lock`.
>
> I had originally assumed you'd just compact back down to the reftable
> file after some N updates (say, 10). But here, it looks like you'd
> always compact 0-9 into 10, and then 10-19 into 20, and so on, and the
> ordinal would go up forever.
>
> I think that's OK, as it would take a long time to get unwieldy. And I
> think you have to do it that way, as you can't atomically replace
> "reftable" and delete .1-.9 at the same time.
>
>> Because `reftable.9` can disappear after `reftable.10` is created,
>> readers receiving ENOENT when opening `reftable.9` must peform
>> another readdir to look for new reftables.
>
> But after compaction, won't having "reftable.10" but no ".9" be the
> steady state?

Yes, it will be. You'd have maybe "reftable" and "reftable.10", and
nothing else.

> As a reader, how can I tell the difference between these
> two cases:
>
>   1. Somebody created .10 and deleted .9 and lower.
>
>   2. Somebody created .11 and deleted .10 and lower, while I was trying
>      to read .9.
>
> Is basically every read going to require:
>
>   1. readdir to find the highest ordinal
>
>   2. keep walking down the stack until you get ENOENT
>
>   3. readdir again to make sure there's not a new ordinal
>
> But in that case, if step 3 turns up a new reftable.11, how do I know
> whether it's a compaction (in which case I need to restart my read from
> .11) or if it's just another update-on-top? In a busy repository, you
> might see a lot of update-on-tops.

I think I was imaging updates are less frequent than reads, and a
reader is going to readdir(), and then immediately open every file in
the stack to setup the merge-join iteration. If the reader retains the
file descriptor, the reader can keep that file in their stack.

There is risk of a reader live-locking; the reader might have done a
readdir, starts opening the stack, sees ENOENT. In which case the
reader starts over. If an updater is performing compactions faster
than a reader can readdir and open paths, its live-lock for the
reader. That certainly is one motivation to not always perform a
compaction.

>> [...specifics...]
>
> I liked a lot of what I saw in the rest of it (e.g., handling symrefs,
> which packed-refs does not). Some bits seemed complicated. E.g., I
> actually wonder how much restarts help in practice if you have
> reasonably-sized blocks, and they complicate things a lot).

My implementation lets me tweak the restart table with a command line
option, so I was able to run a number of experiments for you. A 64k
block doesn't fit more than a few thousand references, so restart 4000
effectively disables restarts.

block  restart  size  lookup
4k     16       29M    90.2 usec
8k     16       29M    76.4 usec

64k    16       29M   147.7 usec
64k    64       28M   134.3 usec
64k    256     27M   143.4 usec

4k     4000     28M   104.0 usec
8k     4000     28M   117.1 usec
64k   4000     27M   288.5 usec

Turning off restarts shrinks the file, and increases lookup time.

For $REASONS, I favor a larger block size in some cases, even though
the lookup times get worse. For example, being able to use 64k/64 may
be a sweet spot for that particular IO system I mentioned above.

Fun fact: gzip packed-refs for this data set is 27M. The 64k/256 is
only 432 KiB larger than gzip (default compression).

> Likewise
> some bits are optional for very small reftable files to reduce overhead.
> But if you have very small reftables, it's going to be fast either way.
> If you waste 4K to store 200 bytes, that's fine

Not really. I store a lot of very small pack files, the 1k idx v2
header is pretty annoying for these. It dwarfs the rest of the
information, in cases it dwarfs the pack file itself. Not being able
to forgo the fanout table when you have a pack of 4 objects of 3 trees
and 1 commit is pretty damn annoying.

> as long as you're still
> wasting only 4K when you store 200 megabytes.

I don't think this is fair. Its better thought of as a ratio.

It depends on the parameters to the writer, but reftable was "wasting"
anywhere between 20K-44K of NUL byte padding for the various
experiments above, and the index was anywhere from 12K-185K, depending
on the block size (smaller block size == larger index).

Wasting 44K on padding, 12K on index, to compress 62M down to 27M...
a penalty of 0.2% of that 27M. That seems acceptable.

5 refs in reftable is ~204 bytes, because of the optional features
being disabled on small files. If reftable was forced to fill out to a
full 4K block, that is a penalty of 1907%. This might seem like
nothing, but for cases where the table has to travel on the network,
or is being stored in a tail-block-optimized filesystem, its a huge
waste to pad the file out.

> I also realize that beggars can't be choosers. If you have a working
> system that performs well, I should consider shutting up. :)

I have it in Java for JGit; I don't yet have a C implementation.

> One thing I didn't see is reflogs. They don't strictly have to be part
> of a ref-storage solution. But they do still consume at least one inode
> per ref in the current system. If you reflog everything, which in my
> opinion you should. Having an audit trail of ref updates is very useful
> for debugging (both bugs in Git, and trying to figure out what went
> wrong when a user goes berserk with "git push --prune").

Yea, I glossed over that and ignored them. Mostly because one system
where I want to use reftable already has a separate database handling
the reflogs. In another (Gerrit Code Review), we disable reflogs for
the insane refs/changes/ namespace, as nearly every reference is
created once, and never modified.

One could abuse the reftable format to store a reflog. If the refname
field is just a sequence of bytes, one could store keys like refname +
'\0' + timestamp, and reuse the symbolic ref value format to store the
old/new/message, as its just a length delimited string.

I'm loath to allocate more bits to denote a reflog entry vs. ref entry
in the same file, but I could see some advantages to that. Commits
would only have to write one new reftable for the combined update +
log record.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: reftable: new ref storage format
  2017-07-13 20:35     ` Jeff King
  2017-07-13 21:51       ` Eric Wong
@ 2017-07-14  0:27       ` Shawn Pearce
  2017-07-14 20:10         ` Jeff King
  1 sibling, 1 reply; 25+ messages in thread
From: Shawn Pearce @ 2017-07-14  0:27 UTC (permalink / raw)
  To: Jeff King; +Cc: Stefan Beller, git

On Thu, Jul 13, 2017 at 1:35 PM, Jeff King <peff@peff.net> wrote:
> On Thu, Jul 13, 2017 at 12:56:54PM -0700, Stefan Beller wrote:
>
> I agree that a full binary search of a reftable is harder because of the
> prefix compression (it may still be possible by scanning backwards, but
> I think there are ambiguities when you land in the middle of a record,
> since there's no unambiguous end-of-record character).

Its impossible to safely binary search this reftable format using a
naive divide byte count in half and find record boundary approach. I
actually did design an earlier version of reftable that was safe to
use this approach for its binary search within blocks, and wound up
discarding it. It was slower and more complex implementation than the
format I shared with the list.

> But I don't think
> it matters. If you binary-search to a constant-sized block, then a
> linear scan of the block is acceptable.

Depends on the block size. :)

> Not that I'm recommending just gzipping the whole packed-refs file. It
> ruins the fast-lookup.

As I just mentioned elsewhere in the thread:

  src file    65306185
  gzip        28338906
  reftable  28782292

The reftable format (for 64k block, 256 restart) is within spitting
distance (432 KiB) of a default level gzip of packed-refs. We can get
fast-lookup, and OK compression.

> We _could_ consider gzipping individual blocks of
> a reftable (or any structure that allows you to search to a
> constant-sized block and do a linear search from there). But given that
> they're in the same ballpark, I'm happy with whatever ends up the
> simplest to code and debug. ;)

This does help to shrink the file, e.g. it drops from 28M to 23M.

It makes it more CPU costly to access a block, as we have to inflate
that to walk through the records. It also messes with alignment. When
you touch a block, that may be straddling two virtual memory pages in
your kernel/filesystem.

I'm not sure those penalties are worth the additional 16% reduction in size.

>> When Shawn presented the proposal, a couple of colleagues here
>> were as excited as I was, but the daring question is, why Shawn
>> did not give the whole thing in BNF format from top down:
>>
>>   initial-block
>>   content-blocks*
>>   (index-block)
>>   footer
>
> Yeah, I agree it took me a bit to figure out what was going on. A
> high-level overview of the format would have been nice.

Noted, I've added this to my writeup.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: reftable: new ref storage format
  2017-07-14  0:11   ` Shawn Pearce
@ 2017-07-14 14:27     ` Dave Borowitz
  2017-07-14 15:31       ` Shawn Pearce
  2017-07-14 20:08     ` Jeff King
  1 sibling, 1 reply; 25+ messages in thread
From: Dave Borowitz @ 2017-07-14 14:27 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Jeff King, git

On Thu, Jul 13, 2017 at 8:11 PM, Shawn Pearce <spearce@spearce.org> wrote:
> In another (Gerrit Code Review), we disable reflogs for
> the insane refs/changes/ namespace, as nearly every reference is
> created once, and never modified.

Apologies for the tangent, but this is not true in the most recent
Gerrit implementation. We update refs/changes/CD/ABCD/1 and
refs/changes/CD/ABCD/meta in a single BatchRefUpdate, and we set a
reflog message on the BatchRefUpdate instance, which updates the
reflog for all refs in the batch. The reflog message on /meta is
important, and arguably it's useful to be able to correlate that with
the reflog on /1.

If you think storing reflogs on patch set refs is going to be a
problem wrt on-disk storage, we should discuss this offline :)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: reftable: new ref storage format
  2017-07-14 14:27     ` Dave Borowitz
@ 2017-07-14 15:31       ` Shawn Pearce
  0 siblings, 0 replies; 25+ messages in thread
From: Shawn Pearce @ 2017-07-14 15:31 UTC (permalink / raw)
  To: Dave Borowitz; +Cc: Jeff King, git

On Fri, Jul 14, 2017 at 7:27 AM, Dave Borowitz <dborowitz@google.com> wrote:
> On Thu, Jul 13, 2017 at 8:11 PM, Shawn Pearce <spearce@spearce.org> wrote:
>> In another (Gerrit Code Review), we disable reflogs for
>> the insane refs/changes/ namespace, as nearly every reference is
>> created once, and never modified.
>
> Apologies for the tangent, but this is not true in the most recent
> Gerrit implementation. We update refs/changes/CD/ABCD/1 and
> refs/changes/CD/ABCD/meta in a single BatchRefUpdate, and we set a
> reflog message on the BatchRefUpdate instance, which updates the
> reflog for all refs in the batch. The reflog message on /meta is
> important, and arguably it's useful to be able to correlate that with
> the reflog on /1.
>
> If you think storing reflogs on patch set refs is going to be a
> problem wrt on-disk storage, we should discuss this offline :)

Reflog storage is a problem for Gerrit. It was a problem in early 2009
when servers had a lot less changes. Its going to be even more of a
problem now. Sounds like we have to support reflogs in reftable, or
something like it.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: reftable: new ref storage format
  2017-07-14  0:11   ` Shawn Pearce
  2017-07-14 14:27     ` Dave Borowitz
@ 2017-07-14 20:08     ` Jeff King
  2017-07-16  6:01       ` Shawn Pearce
  2017-07-16  8:07       ` Johannes Sixt
  1 sibling, 2 replies; 25+ messages in thread
From: Jeff King @ 2017-07-14 20:08 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git

On Thu, Jul 13, 2017 at 05:11:52PM -0700, Shawn Pearce wrote:

> Yes, I was hoping for reader atomicity. But I may OK foregoing that if
> the transaction either all goes through, or all fails. A partially
> stuck transaction because the process died in the middle of the commit
> step creates a mess for an administrator to undo. Does she rename
> "foo.lock" to "foo"? Or delete "foo.lock"?

Agreed, there's no real rollback or recovery process. I do think
shooting for reader atomicity is worth doing. Lack of atomicity can
cause odd things to happen with operations like pruning, for example. If
I'm trying to get a list of all of the reachable objects, for example, I
might have to readdir() a bunch of directories (let's forget even that a
single readdir() is not necessarily atomic). If I try to atomically move
"refs/heads/z/foo" to "refs/heads/a/foo" there is a reasonable chance
that a reader may see only the deletion and not the addition.

I don't have any known cases of this biting anyone, but it's somewhat
scary.

> > But I'm not sure if you meant to contrast here a system where we didn't
> > use packed-refs at all (though of course such a system is very much not
> > atomic by the definition above).
> 
> No, I really did mean the current system. Gerrit Code Review servers
> create a lot of references throughout the day. Its easy to accumulate
> a few thousand new loose references in a 24 hour period. Even if you
> have 600k existing refs in packed-refs, you still have 2k new/modified
> refs since last nightly cron ran git gc.

I do think you'd be better served by just calling pack-refs more
frequently, then. Nightly is too infrequent for a busy repo. And under
something like reftables, you'd end up doing the equivalent of a
pack-refs every N updates anyway.

We actually pack refs quite aggressively at GitHub. Way more than I
would consider reasonable, but it's never been a big bottleneck, so I've
never looked into it. We don't do it for every update, but every update
triggers a "consider syncing objects into shared storage" job, which
will pack the refs. So in a hypothetical repo that's constantly updating
we probably pack refs at least once a minute.

But we're generally on low-latency local disks. It sounds like you
emphatically are not.

> > Good goal.  Just to play devil's advocate, the simplest way to do that
> > with the current code would be to gzip packed-refs (and/or store sha1s
> > as binary). That works against the "mmap and binary search" plan,
> > though. :)
> 
> Yes it does. I tried to cover that later under "Alternatives
> considered > bzip". :)

Yeah, sorry, I read and responded to the document a bit out of order. I
agree it's a dead-end. :)

> >> Reference names should be encoded with UTF-8.
> >
> > Don't we usually treat refnames as byte sequences (subject to a few
> > rules, as in check_ref_format())? It seems like the encoding should be
> > out-of-scope for the storage format.
> 
> True that git-core treats them as byte sequences, but JGit treats them as UTF-8.

I think we can probably stick with that, unless the UTF-8ness is really
important? I guess it might matter if it impacts the sorting order.

> Yes. My "near constant time" claim was because I had my head buried
> thinking about disk IO operations when I wrote this, not the algorithm
> that happens on the CPU. One of the applications for reftable is on a
> slower-than-usual storage system, where reading a block of a file
> costs enough milliseconds that it doesn't matter how good or bad the
> CPU algorithm is.

OK, that explains a lot of the decisions better. If you're planning on
evolving this proposal document, I think it would make sense to talk
about that in the objectives section. (I say "if" because I am happy for
the mailing list discussion to serve as a rationale document).

> But you say this yourself below, you can do this using a geometric
> scheme or something to bound the cost of these rewrites such that you
> aren't frequently rewriting the whole database.

Right, but that sounds like math. I wanted you to spoonfeed me the
geometric algorithm (and its bound proof). ;)

> >> Compaction is similar to the update process, but an explicit temporary
> >> file must be used:
> >>
> >> 1. Atomically create `$GIT_DIR/reftable.lock`.
> >> 2. `readdir($GIT_DIR)` to determine the highest suffix ordinal, `n`.
> >> 3. Compute the update transaction (e.g. compare expected values).
> >> 4. Select files from (2) to collapse as part of this transaction.
> >> 5. Create temp file by `mktemp("$GIT_DIR/.reftableXXXXXX")`.
> >> 6. Write modified and collapsed references to temp file.
> >> 7. Rename temp file to `reftable.${n + 1}`.
> >> 8. Delete collapsed files `reftable.${n}`, `reftable.${n - 1}`, ...
> >> 9. Delete `reftable.lock`.

I think the "stack" implementation is what makes me most uncomfortable
with this proposal. Atomicity with filesystem operations and especially
readdir() is one of the things I think is most flaky about the current
system. Here's an idea for an alternative implementation.

  1. Write out reftables to files named after the hash of their content
     (e.g., $GIT_DIR/reftables/1234abcd...).

  2. The first block of the each reftable has a backpointer to the
     previous table.

  3. There's a well-known name (e.g., $GIT_DIR/reftable) that represents
     the current state. We update it with the usual .lock/rename dance.

That gives readers an easy atomic view; once they've called open() on
"reftable", they follow back-pointers instead of computing the names
(n-1, n-2, etc). They may still find that a compaction has removed a
file they need, but:

  - they only have to restart due to an actual compaction. They'll never
    be tricked by a regular update.

  - you can compact without immediately deleting the old reftables. So
    you might compact, and then delete the reftables N seconds later. Any
    reader which completes the read within N seconds wouldn't have to
    restart.

I think I can anticipate your answer, though. If you have a system where
the latency to open and read a file is high, then you've just serialized
the latencies as you walk the chain. Whereas with predictable names, you
can pre-fetch refname.i through refname.j in parallel.

How deep would you anticipate stacks getting? Would it be feasible for
the tip to contain the names of the tables in the entire chain? If we're
talking about 20 (or even 32) bytes per name, you could still fit over a
hundred names in a 4K inode.

It doesn't escape me that I'm basically reinventing RefTree here, with
reftables instead of tree objects. But I think breaking away from using
real Git objects opens up a lot of efficiency tricks (like the prefix
compression, and the parallel-fetch thing above). And it removes a lot
of the gc complexity.

> I think I was imaging updates are less frequent than reads, and a
> reader is going to readdir(), and then immediately open every file in
> the stack to setup the merge-join iteration. If the reader retains the
> file descriptor, the reader can keep that file in their stack.
> 
> There is risk of a reader live-locking; the reader might have done a
> readdir, starts opening the stack, sees ENOENT. In which case the
> reader starts over. If an updater is performing compactions faster
> than a reader can readdir and open paths, its live-lock for the
> reader. That certainly is one motivation to not always perform a
> compaction.

I guess I don't have much faith in the atomicity of readdir(). And it
would be nice if we could tell the difference between "oops, reftable.5
was racily deleted" and "it is not supposed to be there due to a
previous compaction". So I foresee always walking back to 0, stopping at
the first ENOENT, and then doing a final readdir() to see if any new
items appeared. If one has, we can't tell if it's a compaction in
progress or a regular update, and we have to restart.

So I'm worried about live-locking with a regular updater, not even a
compacting one.

> > I liked a lot of what I saw in the rest of it (e.g., handling symrefs,
> > which packed-refs does not). Some bits seemed complicated. E.g., I
> > actually wonder how much restarts help in practice if you have
> > reasonably-sized blocks, and they complicate things a lot).
> 
> My implementation lets me tweak the restart table with a command line
> option, so I was able to run a number of experiments for you. A 64k
> block doesn't fit more than a few thousand references, so restart 4000
> effectively disables restarts.
> 
> block  restart  size  lookup
> 4k     16       29M    90.2 usec
> 8k     16       29M    76.4 usec
> 
> 64k    16       29M   147.7 usec
> 64k    64       28M   134.3 usec
> 64k    256     27M   143.4 usec
> 
> 4k     4000     28M   104.0 usec
> 8k     4000     28M   117.1 usec
> 64k   4000     27M   288.5 usec

Thanks for these numbers. I was really thinking that blocks would be on
the order of 4K (where you can see that the restarts help very little).
For local disk that's a pretty reasonable size. For high-latency fetches
to a specialized database, maybe not.

> For $REASONS, I favor a larger block size in some cases, even though
> the lookup times get worse. For example, being able to use 64k/64 may
> be a sweet spot for that particular IO system I mentioned above.

Right, that makes a lot more sense.

> > as long as you're still
> > wasting only 4K when you store 200 megabytes.
> 
> I don't think this is fair. Its better thought of as a ratio.
> [...]
> 5 refs in reftable is ~204 bytes, because of the optional features
> being disabled on small files. If reftable was forced to fill out to a
> full 4K block, that is a penalty of 1907%. This might seem like
> nothing, but for cases where the table has to travel on the network,
> or is being stored in a tail-block-optimized filesystem, its a huge
> waste to pad the file out.

Yeah, my assumption was that anything under 4K is basically going to
take 4K. But I agree it's very dependent on the underlying storage
mechanism.

> > I also realize that beggars can't be choosers. If you have a working
> > system that performs well, I should consider shutting up. :)
> 
> I have it in Java for JGit; I don't yet have a C implementation.

I'm a beggar; I'll take even a well-developed plan. :)

The implementation on this doesn't seem overly complex. My main concerns
are what we're asking from the filesystem in terms of atomicity, and
what possible races there are.

> > One thing I didn't see is reflogs. They don't strictly have to be part
> > of a ref-storage solution. But they do still consume at least one inode
> > per ref in the current system. If you reflog everything, which in my
> > opinion you should. Having an audit trail of ref updates is very useful
> > for debugging (both bugs in Git, and trying to figure out what went
> > wrong when a user goes berserk with "git push --prune").
> 
> Yea, I glossed over that and ignored them. Mostly because one system
> where I want to use reftable already has a separate database handling
> the reflogs. In another (Gerrit Code Review), we disable reflogs for
> the insane refs/changes/ namespace, as nearly every reference is
> created once, and never modified.

Even for created-once refs, I've found an audit trail of created when,
by whom, using what program to be quite valuable. Long ago we tried to
use reflogs for that, but these days we literally just write the refname
and its reflog entry to an "audit_log" file. It's not used for
reachability and it's never pruned. It just grows forever without bound.

I think some variant of that could work for reflog storage (with
reachability and whole-file-rewrite expiration, obviously). The biggest
drawback is that traversing the reflogs for one ref requires walking
over the entries for all refs. But I'm not sure how much that would hurt
in practice. Many reflog operations look at all the reflogs anyway
(e.g., reachability, expiration). And finding the entry for a single ref
(e.g., ref@{1.day.ago}) is bounded in far back it has to walk.

> One could abuse the reftable format to store a reflog. If the refname
> field is just a sequence of bytes, one could store keys like refname +
> '\0' + timestamp, and reuse the symbolic ref value format to store the
> old/new/message, as its just a length delimited string.

Gross, but it could work. I actually think David's LMDB proposal did
something similar (encoding the entries in the keyname), but I'd have to
double-check.

> I'm loath to allocate more bits to denote a reflog entry vs. ref entry
> in the same file, but I could see some advantages to that. Commits
> would only have to write one new reftable for the combined update +
> log record.

Yes, but I'd worry that the reflog entries (which would have to hang
around in the reftable until being expired) would slow performance for
normal ref lookups. In my experience ref lookups are very frequent, and
reflog lookups are not. It's OK to segment them and have worse
performance for the reflogs.

-Peff

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: reftable: new ref storage format
  2017-07-14  0:27       ` Shawn Pearce
@ 2017-07-14 20:10         ` Jeff King
  0 siblings, 0 replies; 25+ messages in thread
From: Jeff King @ 2017-07-14 20:10 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Stefan Beller, git

On Thu, Jul 13, 2017 at 05:27:44PM -0700, Shawn Pearce wrote:

> > We _could_ consider gzipping individual blocks of
> > a reftable (or any structure that allows you to search to a
> > constant-sized block and do a linear search from there). But given that
> > they're in the same ballpark, I'm happy with whatever ends up the
> > simplest to code and debug. ;)
> 
> This does help to shrink the file, e.g. it drops from 28M to 23M.
> 
> It makes it more CPU costly to access a block, as we have to inflate
> that to walk through the records. It also messes with alignment. When
> you touch a block, that may be straddling two virtual memory pages in
> your kernel/filesystem.
> 
> I'm not sure those penalties are worth the additional 16% reduction in size.

Yeah, I don't really care about a 16% reduction in size. I care much
more about simplicity of implementation and debugging. Using zlib is
kind-of simple to implement. But if you've ever had to debug it (or
figure out what is going on with maybe-corrupted output), it's pretty
nasty.

So I don't mind a more readable custom compression if it's not too
complicated. And especially if it buys us extra performance by being
able to jump around non-sequentially in the block.

-Peff

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: reftable: new ref storage format
  2017-07-14 20:08     ` Jeff King
@ 2017-07-16  6:01       ` Shawn Pearce
  2017-07-16 10:01         ` Jeff King
  2017-07-16  8:07       ` Johannes Sixt
  1 sibling, 1 reply; 25+ messages in thread
From: Shawn Pearce @ 2017-07-16  6:01 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Fri, Jul 14, 2017 at 1:08 PM, Jeff King <peff@peff.net> wrote:
> On Thu, Jul 13, 2017 at 05:11:52PM -0700, Shawn Pearce wrote:
>
> I think the "stack" implementation is what makes me most uncomfortable
> with this proposal. Atomicity with filesystem operations and especially
> readdir() is one of the things I think is most flaky about the current
> system. Here's an idea for an alternative implementation.
>
>   1. Write out reftables to files named after the hash of their content
>      (e.g., $GIT_DIR/reftables/1234abcd...).
>
>   2. The first block of the each reftable has a backpointer to the
>      previous table.
>
>   3. There's a well-known name (e.g., $GIT_DIR/reftable) that represents
>      the current state. We update it with the usual .lock/rename dance.
>
> That gives readers an easy atomic view; once they've called open() on
> "reftable", they follow back-pointers instead of computing the names
> (n-1, n-2, etc). They may still find that a compaction has removed a
> file they need, but:
>
>   - they only have to restart due to an actual compaction. They'll never
>     be tricked by a regular update.
>
>   - you can compact without immediately deleting the old reftables. So
>     you might compact, and then delete the reftables N seconds later. Any
>     reader which completes the read within N seconds wouldn't have to
>     restart.
>
> I think I can anticipate your answer, though. If you have a system where
> the latency to open and read a file is high, then you've just serialized
> the latencies as you walk the chain. Whereas with predictable names, you
> can pre-fetch refname.i through refname.j in parallel.
>
> How deep would you anticipate stacks getting? Would it be feasible for
> the tip to contain the names of the tables in the entire chain? If we're
> talking about 20 (or even 32) bytes per name, you could still fit over a
> hundred names in a 4K inode.

I think we'd want to keep the stacks under 32, which is a reasonable
amount of space used in the header of each reftable. I don't have this
yet in my updated document + implementation, but I'll look at trying
to add it over the next couple of days. Your idea to hold the explicit
list of the stack in each reftable makes for a very safe atomic reader
view.


> It doesn't escape me that I'm basically reinventing RefTree here, with
> reftables instead of tree objects. But I think breaking away from using
> real Git objects opens up a lot of efficiency tricks (like the prefix
> compression, and the parallel-fetch thing above). And it removes a lot
> of the gc complexity.

Yes, I agree. Also RefTree has trouble scaling due to the flat tree
object format. It depends on the user to sensibly break up the
reference space with '/' characters sprinkled about. This reftable
proposal does not suffer from that limitation, a user can use any
valid ref name structuring.


> So I'm worried about live-locking with a regular updater, not even a
> compacting one.

Ok. I think your idea of tracking an explicit list of the stack in the
top of every reftable solves this in a very neat way, so I'll look to
switch to that.


>> block  restart  size  lookup
>> 4k     16       29M    90.2 usec
>> 8k     16       29M    76.4 usec
>
> Thanks for these numbers. I was really thinking that blocks would be on
> the order of 4K (where you can see that the restarts help very little).
> For local disk that's a pretty reasonable size. For high-latency fetches
> to a specialized database, maybe not.
...
>> > One thing I didn't see is reflogs. They don't strictly have to be part
>> > of a ref-storage solution. But they do still consume at least one inode
>> > per ref in the current system. If you reflog everything, which in my
>> > opinion you should. Having an audit trail of ref updates is very useful
>> > for debugging (both bugs in Git, and trying to figure out what went
>> > wrong when a user goes berserk with "git push --prune").
>>
>> Yea, I glossed over that and ignored them. Mostly because one system
>> where I want to use reftable already has a separate database handling
>> the reflogs. In another (Gerrit Code Review), we disable reflogs for
>> the insane refs/changes/ namespace, as nearly every reference is
>> created once, and never modified.
>
> Even for created-once refs, I've found an audit trail of created when,
> by whom, using what program to be quite valuable.

Dave Borowitz agrees with you, Gerrit Code Review should be recording
reflog data, even for create-once refs. Its a limitation of the
current $GIT_DIR/logs that has forced it to be disabled. I'd really
like something like reftable, so we can record reflog.


>> One could abuse the reftable format to store a reflog. If the refname
>> field is just a sequence of bytes, one could store keys like refname +
>> '\0' + timestamp, and reuse the symbolic ref value format to store the
>> old/new/message, as its just a length delimited string.
>
> Gross, but it could work. I actually think David's LMDB proposal did
> something similar (encoding the entries in the keyname), but I'd have to
> double-check.

I added log support to the reftable format. I updated [1] to reflect
log blocks at the end of the file. I ran a year's worth of log
records, 149,932 log entries on 43,061 refs to test:

format                size
$GIT_DIR/logs  173 M
reftable                  4 M  (avg 30 bytes)

[1]: https://googlers.googlesource.com/sop/jgit/+/reftable/Documentation/technical/reftable.md

reftable gets these kinds of savings by packing many logs into a
single file (so no disk block overheads), clustering log records by
ref name, prefix compressing, and deflating log records using 128k
input blocks. I'm OK with using deflate for log records (and bigger
block sizes), as log records are infrequently accessed compared to
current refs.

There is a log index to perform efficient point-in-time lookup for a
single ref, and then iterate over its log records from that time, and
older. That answers most reflog queries quickly. Large batch log reads
like gc reachability can simply iterate through all log records,
clustered by ref, ordered by time descending.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: reftable: new ref storage format
  2017-07-14 20:08     ` Jeff King
  2017-07-16  6:01       ` Shawn Pearce
@ 2017-07-16  8:07       ` Johannes Sixt
  2017-07-16 10:03         ` Jeff King
  1 sibling, 1 reply; 25+ messages in thread
From: Johannes Sixt @ 2017-07-16  8:07 UTC (permalink / raw)
  To: Jeff King, Shawn Pearce; +Cc: git

Am 14.07.2017 um 22:08 schrieb Jeff King:
> The implementation on this doesn't seem overly complex. My main concerns
> are what we're asking from the filesystem in terms of atomicity, and
> what possible races there are.

One of the failure modes is that on Windows a file cannot be deleted 
while it is open in any process. It can happen that a compacting updater 
wants to remove a reftable file that is still open in a reader.

-- Hannes

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: reftable: new ref storage format
  2017-07-16  6:01       ` Shawn Pearce
@ 2017-07-16 10:01         ` Jeff King
  0 siblings, 0 replies; 25+ messages in thread
From: Jeff King @ 2017-07-16 10:01 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git

On Sat, Jul 15, 2017 at 11:01:47PM -0700, Shawn Pearce wrote:

> > How deep would you anticipate stacks getting? Would it be feasible for
> > the tip to contain the names of the tables in the entire chain? If we're
> > talking about 20 (or even 32) bytes per name, you could still fit over a
> > hundred names in a 4K inode.
> 
> I think we'd want to keep the stacks under 32, which is a reasonable
> amount of space used in the header of each reftable. I don't have this
> yet in my updated document + implementation, but I'll look at trying
> to add it over the next couple of days. Your idea to hold the explicit
> list of the stack in each reftable makes for a very safe atomic reader
> view.

Great.  I was thinking about this a bit and have one more possible
tweak.

If you store the names of the dependent reftables in each table, then
you end up using storage quadratic in the size of the stack. Because
the bottom-most table has 0 pointers, then the next one has 1, and then
next one has 2, and so on, until the nth one has n.

Now we're talking about n=32 here, so that's probably OK.

But one variant is that the reftables _don't_ know about their
ancestors. Instead, the list of reftables is kept in a top-level pointer
file, and it's that pointer file which is rewritten on update. I.e., a
write is something like:

   1. Take reftable.lock

   2. Write reftables/1234abcd to represent your update.

   3. Copy the old reftable to reftable.lock, then append "1234abcd".

   4. Atomic rename into place.

And the reader is just:

  1. Open reftable, read the list of tables.

  2. In parallel, open/fetch each of the tables and find your starting
     pointer for iteration/lookup.

  3. Do a list-merge on the open tables.

The one thing you lose is that "unreachable" reftables no longer form a
meaningful hierarchy. With the pointers inside the reftables themselves,
if your "reftable" file got corrupted, you could find the dangling table
at the apex of the graph and have a good guess at the ref state.
Without, you just have a jumble of states and you don't know which takes
precedence (though you could probably make a good guess from mtimes).

> I added log support to the reftable format. I updated [1] to reflect
> log blocks at the end of the file. I ran a year's worth of log
> records, 149,932 log entries on 43,061 refs to test:

Cool. I'll be on vacation for the next week, so apologies if I don't
keep the discussion going. But I'm very excited about the overall
direction. :)

-Peff

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: reftable: new ref storage format
  2017-07-16  8:07       ` Johannes Sixt
@ 2017-07-16 10:03         ` Jeff King
  2017-07-16 10:10           ` Johannes Sixt
  0 siblings, 1 reply; 25+ messages in thread
From: Jeff King @ 2017-07-16 10:03 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: Shawn Pearce, git

On Sun, Jul 16, 2017 at 10:07:57AM +0200, Johannes Sixt wrote:

> Am 14.07.2017 um 22:08 schrieb Jeff King:
> > The implementation on this doesn't seem overly complex. My main concerns
> > are what we're asking from the filesystem in terms of atomicity, and
> > what possible races there are.
> 
> One of the failure modes is that on Windows a file cannot be deleted while
> it is open in any process. It can happen that a compacting updater wants to
> remove a reftable file that is still open in a reader.

Good point. I think the explicit pointers I mentioned are an improvement
there, because a compacting updater _can_ leave the file in place if the
delete fails (and later, another compaction can clean up cruft that was
left).

I assume that's more or less how pack deletion works on Windows.

-Peff

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: reftable: new ref storage format
  2017-07-16 10:03         ` Jeff King
@ 2017-07-16 10:10           ` Johannes Sixt
  0 siblings, 0 replies; 25+ messages in thread
From: Johannes Sixt @ 2017-07-16 10:10 UTC (permalink / raw)
  To: Jeff King; +Cc: Shawn Pearce, git

Am 16.07.2017 um 12:03 schrieb Jeff King:
> On Sun, Jul 16, 2017 at 10:07:57AM +0200, Johannes Sixt wrote:
> 
>> Am 14.07.2017 um 22:08 schrieb Jeff King:
>>> The implementation on this doesn't seem overly complex. My main concerns
>>> are what we're asking from the filesystem in terms of atomicity, and
>>> what possible races there are.
>>
>> One of the failure modes is that on Windows a file cannot be deleted while
>> it is open in any process. It can happen that a compacting updater wants to
>> remove a reftable file that is still open in a reader.
> 
> Good point. I think the explicit pointers I mentioned are an improvement
> there, because a compacting updater _can_ leave the file in place if the
> delete fails (and later, another compaction can clean up cruft that was
> left).

Yes, I think so, too. The pointers make things so much simpler.

> I assume that's more or less how pack deletion works on Windows.

Correct.

-- Hannes

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: reftable: new ref storage format
  2017-07-13  0:17 reftable: new ref storage format Shawn Pearce
  2017-07-13 19:32 ` Jeff King
@ 2017-07-16 17:33 ` Michael Haggerty
  2017-07-16 19:43   ` Shawn Pearce
  1 sibling, 1 reply; 25+ messages in thread
From: Michael Haggerty @ 2017-07-16 17:33 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git

Thanks for your reftable proposal. It would solve a lot of scalability
problems that we currently have, and do it in a way that is
implementable in both C and Java, which is very nice.

There are two mostly orthogonal components to your proposal:

1. What does a single reftable file look like?
2. How can multiple reftable files be used together to avoid having to
rewrite data more than necessary?

For example (just for the sake of argument), many of the goals could
be achieved by stacking traditional packed-refs files together (i.e.,
component 2 of your proposal), with the single extension that a
packed-refs file can assign some "nil" value to a reference to
indicate that the reference has been deleted. As long as references
are sought in packed-refs files using binary search, lookup and
iteration would both be O(n + lg N) (where N is the total number of
references and n is the number being iterated over), and updates would
(I think) be amortized O(n), where n is the number of references being
updated or deleted. Stacked packed-refs files would, of course, not
yield the compression benefits of your reftable proposal.

Overall comments/questions:

* Do you propose to store *all* references (i.e., including the
references that we call pseudorefs, like `HEAD`, `FETCH_HEAD`, etc) in
reftables, or only the references under `refs/`? If the former, then
you need to consider that some pseudorefs contain additional
information besides the SHA-1 or linked-to reference. If the latter,
then you could get some additional compression by making the `refs/`
prefix implicit rather than storing it in the "restart" records
explicitly.

  Personally, I don't think it makes sense to store pseudorefs in
reftables. HEAD, though it doesn't include any supplemental
information, is read so frequently that it seems like a bad idea to
have to go through the lookup process rather than storing it in a
separate flat file. Moreover, HEAD is written very *infrequently*, so
(absent special treatment) it would tend to sink deep in the reftable
stack, making reads even more expensive.

* You have obviously designed your proposal to be friendly to whatever
non-local storage system you are using at Google. It would probably
help us understand your design tradeoffs better if you could tell us a
little about that storage system.

So let's examine the components one after the other...

1. The structure of a single reftable file

* You use int32 and int24 values in a number of places. Couldn't these
be uint32 and uint24?

* You use int32 values in a number of places where a smaller size
would suffice. For example, the block size is limited to 24 bits, so
surely restart_offset, record_end_offset, and number_of_restarts
needn't be larger than 24 bits?

   OK, actually the *index* block size is not limited to 24 bits, so
it's not a no-brainer.

* I'm a little bit concerned that the structure is fixed to be a one-
or two-level indexing scheme; i.e., the (optional) index block plus
the restart index table within a block. This forces (as I understand
it) the block size to be chosen based on the overall file size to
prevent the index block from becoming too large, whereas for a
low-latency local filesystem implementation like SSDs it would
probably be preferable to set the block size to agree with the page
size to minimize reads and memory usage.

  So I ask myself whether it might be worthwhile to allow deeper
levels of indexing. The main index block could divide the namespace
into as many segments as fit in one block, but if necessary point at a
second-level index block that further divides the namespace before
pointing at the ref block that ultimately contains the value. Those of
us using local SSDs might tune the system to use 4k block sizes
throughout, basically preferring to read three or even four disjoint
4k blocks rather than two disjoint 64k blocks.

* The tuning parameter number_of_restarts currently trades off space
(for the full refnames and the restart_offsets) against the need to
read and parse more ref_records to get the full refnames. ISTM that
this tradeoff could be made less painful by defining a blockwide
prefix that is omitted from the refnames as used in the restarts. So
the full refname would change from

      this_name = prior_name[0..prefix_length] + suffix

  to

      this_name = block_prefix + prior_name[0..prefix_length] + suffix

  I would expect this to allow more frequent restarts at lower space
cost. Combining this idea with the previous idea, non-top-level index
blocks could also have a block_prefix to make their restarts cheaper,
too.

  If we are willing to force all reads to consider the indexes, the
block_prefix could be taken from the index_record that points at it.

* Does the format need to insist on fixed-size blocks? The alternative
would be to allow index_records to point at arbitrary offsets in the
file. Admittedly, the file offsets would take more bits to store them
than the block_idx. But it would allow the blocks to be matched better
to the logical structure of the refs namespace, making the
block_prefix work better and making it easier to find a particular
namespace in one jump. For example, you could add an extra top-level
index entry for the HEAD branch, on the assumption that it will be
read frequently. You could also add an entry for `refs/replace`, which
is read for almost every command invocation. You could even add an
entry for `refs/replace` and an entry for the first reference
following `refs/replace`, which would allow the client to determine
that there are zero `refs/replace` references (the usual case) from
the top-level index alone by seeing that the two offsets are
identical.

  BTW, just because the file format doesn't insist on fixed-size
blocks wouldn't mean that particular implementations couldn't choose
to write fixed-size blocks, as your implementation might do. If this
is an important use case, the extra bits needed to store arbitrary
file offsets could be avoided by allowing a filewide offset_multiplier
to be defined.

* What is the rationale for the footer? Is it important for the file
to be written as a stream, as opposed to seeking back to the beginning
of the file to add the index_size there? The CRC, I suppose, is meant
to make it possible to detect a file that has been truncated
accidentally?

2. The stacking of multiple reftable files together

* I like the ideas later in the thread to have a file of pointers to
the files that make up the stack. (I think of it as a "table of
contents"). I think without it there will be unavoidable races
involving readdir().

* The spec should describe the protocol for reading references. The
only interesting thing here is that

* I think that geometric repacking of reftable files will be essential
to performance for very active repositories with many references, and
that the steady state of such a repository will involve O(lg N)
retable files.

* I would like to propose adding another design goal: that reference
repacking can be done without blocking other *writers* (let alone
readers, obviously). I think that something like this could work:

  Suppose the stack currently consists of reftable files (from oldest
to newest) A, B, C, and D, with table-of-contents file reftable. If I
want to repack files B and C together, then I

  * Obtain lock reftable.lock and read the file.
  * Obtain locks B.lock and C.lock. My ownership of these locks
prevents other processes from trying to repack these files.
  * Release reftable.lock.
  * Repack B and C into a new file, X.lock.
  * Obtain lock reftable.lock.
  * Verify that B, and C are still in the stack, in that order. This
should always be the case, assuming that other processes are adhering
to the locking protocol.
  * Rename X.lock to X.
  * Write the new stack to reftable.lock, replacing B and C with X.
  * Rename reftable.lock on top of reftable.
  * Delete B and C (perhaps after a short sleep to avoid forcing
readers to backtrack).

  I think that this algorithm would allow reference updates to
continue while the repack is happening, and would even allow reftables
higher or lower in the stack than B and C to be repacked at the same
time (though maybe that won't be necessary).

* A very frequent operation will be to read a single reference. This
could get expensive if a reftable stack grows deep, because every
single reftable might have to be inspected. One way to reduce this
cost might be to store a bloom filter in large reftable files. This
could allow a quick determination that the file doesn't contain the
reference being sought.

I haven't reviewed your proposal for storing reflogs in reftables in
any detail, though I must say that my first reaction is surprise that
you want to store reflogs (which are mostly immutable, rarely read,
and can be an order of magnitude bigger) in the same files as
references.

Michael

On Wed, Jul 12, 2017 at 5:17 PM, Shawn Pearce <spearce@spearce.org> wrote:
> We've been having scaling problems with insane number of references
> (>866k), so I started thinking a lot about improving ref storage.
>
> I've written a simple approach, and implemented it in JGit.
> Performance is promising:
>
>   - 62M packed-refs compresses to 27M
>   - 42.3 usec lookup
>
> You can read a rendered version of this here:
> https://googlers.googlesource.com/sop/jgit/+/reftable/Documentation/technical/reftable.md
>
>
> ## Overview
>
> ### Problem statement
>
> Some repositories contain a lot of references (e.g.  android at 866k,
> rails at 31k).  The existing packed-refs format takes up a lot of
> space (e.g.  62M), and does not scale with additional references.
> Lookup of a single reference requires linearly scanning the file.
>
> Atomic pushes modifying multiple references require copying the
> entire packed-refs file, which can be a considerable amount of data
> moved (e.g. 62M in, 62M out) for even small transactions (2 refs
> modified).
>
> Repositories with many loose references occupy a large number of disk
> blocks from the local file system, as each reference is its own file
> storing 41 bytes.  This negatively affects the number of inodes
> available when a large number of repositories are stored on the same
> filesystem.  Readers are also penalized due to the larger number of
> syscalls required to traverse and read the `$GIT_DIR/refs` directory.
>
> ### Objectives
>
> - Near constant time lookup for any single reference, even when the
>   repository is cold and not in process or kernel cache.
> - Occupy less disk space for large repositories.
> - Support atomic pushes with lower copying penalities.
>
> ### Description
>
> A reftable file is a portable binary file format customized for
> reference storage. References are sorted, enabling linear scans,
> binary search lookup, and range scans.
>
> Storage in the file is organized into blocks.  Prefix compression
> is used within a single block to reduce disk space.  Block size is
> tunable by the writer.
>
> ### Performance
>
> Space used, packed-refs vs. reftable:
>
> repository | packed-refs | reftable | % original | avg ref
> -----------|------------:|---------:|-----------:|---------:
> android    |      62.2 M |   27.7 M |     44.4%  | 33 bytes
> rails      |       1.8 M |  896.2 K |     47.6%  | 29 bytes
> git        |      78.7 K |   27.9 K |     40.0%  | 43 bytes
> git (heads)|       332 b |    204 b |     61.4%  | 34 bytes
>
> Scan (read 866k refs) and lookup (single ref from 866k refs):
>
> format      | scan    | lookup
> ------------|--------:|---------------:
> packed-refs |  380 ms | 375420.0 usec
> reftable    |  125 ms |     42.3 usec
>
> ## Details
>
> ### Peeling
>
> References in a reftable are always peeled.
>
> ### Reference name encoding
>
> Reference names should be encoded with UTF-8.
>
> ### Ordering
>
> Blocks are lexicographically ordered by their first reference.
>
>
> ## File format
>
> ### Header
>
> A 8-byte header appears at the beginning of each file:
>
> - 4-byte magic is: `\'1', 'R', 'E', 'F'`
> - 1-byte version number, `1`.
> - 3-byte `block_size` in bytes (network byte order).
>
> ### Block size
>
> The `block_size` is arbitrarily determined by the writer, and does not
> have to be a power of 2.  The block size must be larger than the
> longest reference name used in the repository, as references cannot
> span blocks.
>
> ### First block
>
> The first block shares the same block as the file header, and is 8
> bytes smaller than all other blocks in the file.  The first block
> immediately begins after the file header, at offset 8.
>
> ### Block format
>
> A block is written as:
>
>     ref_record*
>     padding?
>     int32( restart_offset )*
>     int32( record_end_offset )
>     int32( number_of_restarts )
>
> Blocks begin with a variable number of `ref_record`, describing
> reference names and values. The format is described below.
>
> The middle of the record may be filled with `padding` NUL bytes to
> fill out the block to the common `block_size` as specified in the file
> header.  Padding may be necessary to ensure `number_of_restarts`
> occupies the last 4 bytes of the block.  Padding may be omitted if the
> block is the last block of the file, and there is no index block.
> This allows reftable to efficiently scale down to a small number of
> refs.
>
> A variable number of 4-byte, network byte order `restart_offset`
> values follows the padding.  Offsets are relative to the start of the
> block and refer to the first byte of any `ref_record` whose name has
> not been prefixed compressed.  Readers can start linear scans from any
> of these records.
>
> The 4-byte, network byte order `record_end_offset` follows, providing
> the block-relative offset after the end of the last `ref_record`.  If
> `padding` is present this is the offset of the first byte of padding,
> or the first byte of the first `restart_offset` entry.
>
> The 4-byte, network byte order `number_of_restarts` stores the number
> of entries in the `restart_offset` list.  Readers can use the restart
> count to binary search between restarts before starting a linear scan.
> This field must be the last 4 bytes of the block; the `padding` field
> must be used to ensure this is true.
>
> #### ref record
>
> A `ref_record` describes a single reference, storing both the name and
> its value(s). Records are formatted as:
>
>     varint( prefix_length )
>     varint( (suffix_length << 2) | type )
>     suffix
>     value?
>
> The `prefix_length` field specifies how many leading bytes of the
> prior reference record's name should be copied to obtain this
> reference's name.  This must be 0 for the first reference in any
> block, and also must be 0 for any `ref_record` whose offset is listed
> in the `restart_offset` table at the end of the block.
>
> Recovering a reference name from any `ref_record` is a simple concat:
>
>     this_name = prior_name[0..prefix_length] + suffix
>
> The second varint carries both `suffix_length` and `type`.  The
> `suffix_length` value provides the number of bytes to copy from
> `suffix` to complete the reference name.
>
> The `value` immediately follows.  Its format is determined by `type`,
> a 2 bit code, one of the following:
>
> - `0x0`: deletion; no value data (see transactions, below)
> - `0x1`: one 20-byte object id; value of the ref
> - `0x2`: two 20-byte object ids; value of the ref, peeled target
> - `0x3`: symbolic reference: `varint( target_len ) target`
>
> Symbolic references use a varint length followed by a variable number
> of bytes to encode the complete reference target.  No compression is
> applied to the target name.
>
> ### Index block
>
> The index stores the name of the last reference from every block in
> the file, enabling constant O(1) disk seeks for all lookups.  Any
> reference can be found by binary searching the index, identifying the
> containing block, and searching within that block.
>
> If present, the index block appears after the last block of the file.
>
> An index block should only be written if there are at least 4 blocks
> in the file, as cold reads using the index requires 2 disk reads, and
> binary searching <= 4 blocks also requires <= 2 reads.  Omitting the
> index block from smaller files saves space.
>
> Index block format:
>
>     '\1' 'i'
>     index_record*
>     int32( restart_offset )*
>     int32( record_end_offset )
>     int32( number_of_restarts )
>
> Index blocks begin with a magic prefix, `\1i`, where other blocks
> would have started with `\0` for the first ref record's prefix length.
> This supports stopping sequential scans at the index block, without
> prior knowledge of its position.
>
> Unlike other blocks, the index block is not padded.
>
> The `restart_offset`, `record_end_offset`, and `number_of_restarts`
> fields are identical in format, meaning and usage as in `ref_record`.
>
> To reduce the number of reads required for random access in very large
> files, the index block may be larger than the other blocks.  However,
> readers must hold the entire index in memory to benefit from this, so
> its a time-space tradeoff in both file size, and reader memory.
> Increasing the block size in the writer decreases the index size.
>
> #### index record
>
> An index record describes the last reference of another block.
> Index records are written as:
>
>     varint( prefix_length )
>     varint( (suffix_length << 2) )
>     suffix
>     varint( block_idx )
>
> Index records use prefix compression exactly like `ref_record`.  The
> `suffix_length` is shifted 2 bits without a `type` to simplify unified
> reader/writer code for both block types.
>
> Index records store `block_idx` after the suffix, specifying which
> block of the file ends with this reference. The block is located at
> position `block_idx * block_size`.
>
> ### Reading the index
>
> Readers loading the index must first read the footer (below) to
> determine `index_size`.  The index is located at position:
>
>     file_length - (index_size + 16)
>
> ### Footer
>
> After the last block of the file (or index block, if present), a file
> footer is written.  This is similar in structure to the file header,
> but extended with additional data.
>
> A 16-byte footer appears at the end:
>
> - 4-byte magic is: `\'1', 'R', 'E', 'F'`
> - 1-byte version number, 1.
> - 3-byte `block_size` in bytes (network byte order).
> - 4-byte `index_size` in bytes (network byte order).
> - 4-byte CRC-32 of the preceding 12 bytes (network byte order).
>
> Like the index block magic header, the footer begins with `\1R` to
> allow sequential scans to recognize the end of file has been reached.
>
> #### Reading the footer
>
> Readers must seek to `file_length - 16` to access the footer.  A
> trusted external source (such as `stat(2)`) is necessary to obtain
> `file_length`.  When reading the footer, readers must verify:
>
> - 4-byte magic is correct
> - 1-byte version number is recognized
> - 4-byte CRC-32 matches the other 12 bytes read
>
> Once verified, the `block_size` and `index_size` may be accessed from
> the footer.
>
> ### Varint encoding
>
> Varint encoding is identical to the ofs-delta encoding method used
> within pack files.
>
> Decoder works such as:
>
>     val = buf[ptr] & 0x7f
>     while (buf[ptr] & 0x80) {
>       ptr++
>       val++
>       val = val << 7
>       val = val | (buf[ptr] & 0x7f)
>     }
>
> ### Binary search
>
> Binary search within a block is supported by the `restart_offset`
> fields at the end of the block.  Readers can binary search through the
> restart table to locate between which two restart points the sought
> reference should appear.
>
> Each reference identified by a `restart_offset` stores the complete
> reference name in the `suffix` field of the `ref_record`, making the
> compare operation during the binary search straightforward.
>
> Once a restart point lexicographically before the sought reference has
> been identified, readers can linearly scan through the following
> `ref_record` entries to locate the sought reference, stopping when the
> current `ref_record` sorts after (and therefore the sought reference
> is not present).
>
> #### Restart point selection
>
> Writers determine the restart points at file creation.  The process is
> arbitrary, but every 16 or 64 references is recommended.  Every 16 may
> be more suitable for smaller block sizes (4k or 8k), every 64 for
> larger block sizes (64k).
>
> More frequent restart points reduces prefix compression and increases
> space consumed by the restart table, both of which will increase the
> overall file size.
>
> Less frequent restart points makes prefix compression more effective,
> decreasing overall file size, with increased penalities for readers
> who must walk through more references after the binary search step.
>
> ## Considerations
>
> ### Lightweight refs dominate
>
> The reftable format assumes the vast majority of references are single
> SHA-1 valued with common prefixes, such as Gerrit Code Review's
> `refs/changes/` namespace, GitHub's `refs/pulls/` namespace, or many
> lightweight tags in the `refs/tags/` namespace.
>
> Annotated tags storing the peeled object cost only an additional 20
> bytes per reference.
>
> ### Low overhead
>
> A reftable with very few references (e.g.  git.git with 5 heads) uses
> only 204 bytes for reftable vs.  332 bytes for packed-refs.  This
> supports reftable scaling down, to be used for transaction logs
> (below).
>
> ### Block size
>
> For a Gerrit Code Review type repository with many change refs, larger
> block sizes (64 KiB) and less frequent restart points (every 64) yield
> better compression due to more references within the block able to
> compress against the prior reference.
>
> Larger block sizes reduces the index size, as the reftable will
> require fewer blocks to store the same number of references.
>
> ### Minimal disk seeks
>
> Assuming the index block has been loaded into memory, binary searching
> for any single reference requires exactly 1 disk seek to load the
> containing block.
>
> ## Repository format
>
> When reftable is stored in a file-backed Git repository, the stack is
> represented as a series of reftable files:
>
>     $GIT_DIR/reftable
>     $GIT_DIR/reftable.1
>     $GIT_DIR/reftable.2
>     $GIT_DIR/reftable.3
>     ...
>     $GIT_DIR/reftable.10
>
> where a larger suffix ordinal indicates a more recent table.
>
> ### Transactions
>
> Although reftables are immutable, they can be stacked in a search
> pattern, with each reference transaction adding a new reftable to the
> top of the stack.  Readers scan down the reftable stack from
> most-recent (`reftable.10`) to the base file (`reftable`).
>
> ### Update process
>
> Updating references follows an update protocol:
>
> 1. Atomically create `$GIT_DIR/reftable.lock`.
> 2. `readdir($GIT_DIR)` to determine the highest suffix ordinal, `n`.
> 3. Compute the update transaction (e.g. compare expected values).
> 4. Write only modified references as a reftable to `reftable.lock`.
> 5. Rename `reftable.lock` to `reftable.${n + 1}`.
>
> Because a single `reftable.lock` file is used to manage locking, the
> repository is single-threaded for writers.  Writers may have to
> busy-spin (with some small backoff) around creating `reftable.lock`,
> for up to an acceptable wait period, aborting if the repository is too
> busy to mutate.  Application servers wrapped around repositories (e.g.
> Gerrit Code Review) can layer their own in memory thread lock/wait
> queue to provide fairness.
>
> ### Reference deletions
>
> Deletion of any reference can be explicitly stored by setting the
> `type` to `0x0` and omitting the `value` field of the `ref_record`.
> This entry shadows the reference in lower files in the stack.
>
> ### Compaction
>
> A stack of reftables can be compacted by merging references using a
> straightforward merge join across all reftables, selecting the most
> recent value for output, and omitting deleted references that do not
> appear in remaining, lower reftables.
>
> The stack can be collapsed as part of any update transaction.  If the
> current number of files is larger than a threshold (e.g.  4), writers
> can perform an lstat(2) on each reftable file to determine how many
> bytes would have to be read/copied from an existing file into the
> new file, enabling deletion of the existing file.
>
> Writers can select to collapse the most recent files (e.g.  10, 9, 8,
> ...), up to a collapse IO threshold (e.g.  4 MiB).  Each file selected
> for collapse must have its references merged into the new reftable
> that is being prepared.
>
> Compaction is similar to the update process, but an explicit temporary
> file must be used:
>
> 1. Atomically create `$GIT_DIR/reftable.lock`.
> 2. `readdir($GIT_DIR)` to determine the highest suffix ordinal, `n`.
> 3. Compute the update transaction (e.g. compare expected values).
> 4. Select files from (2) to collapse as part of this transaction.
> 5. Create temp file by `mktemp("$GIT_DIR/.reftableXXXXXX")`.
> 6. Write modified and collapsed references to temp file.
> 7. Rename temp file to `reftable.${n + 1}`.
> 8. Delete collapsed files `reftable.${n}`, `reftable.${n - 1}`, ...
> 9. Delete `reftable.lock`.
>
> Because `reftable.9` can disappear after `reftable.10` is created,
> readers receiving ENOENT when opening `reftable.9` must peform
> another readdir to look for new reftables.
>
> Rebuilding the base `$GIT_TABLE/reftable` follows the same protocol,
> except in step 7 the temp file is renamed to `reftable`, and step 8
> removes all files with an ordinal suffix.
>
> ## Alternatives considered
>
> ### bzip packed-refs
>
> `bzip2` can significantly shrink a large packed-refs file (e.g. 62
> MiB compresses to 23 MiB, 37%).  However the bzip format does not support
> random access to a single reference. Readers must inflate and discard
> while performing a linear scan.
>
> Breaking packed-refs into chunks (individually compressing each chunk)
> would reduce the amount of data a reader must inflate, but still
> leaves the problem of indexing chunks to support readers efficiently
> locating the correct chunk.
>
> Given the compression ratios achieved by reftable's simple encoding
> (e.g.  44%), without using a standard compression algorithm, it does
> not seem necessary to add the complexity of bzip/gzip/zlib.
>
> ### JGit Ketch RefTree
>
> [JGit Ketch][ketch] proposed [RefTree][reftree], an encoding of
> references inside Git tree objects stored as part of the repository's
> object database.
>
> The RefTree format adds additional load on the object database storage
> layer (more loose objects, more objects in packs), and relies heavily
> on the packer's delta compression to save space.  Namespaces which are
> flat (e.g.  thousands of tags in refs/tags) initially create very
> large loose objects, and so RefTree does not address the problem of
> copying many references to modify a handful.
>
> Flat namespaces are not efficiently searchable in RefTree, as tree
> objects in canonical formatting cannot be binary searched. This fails
> the need to handle a large number of references in a single namespace,
> such as GitHub's `refs/pulls`, or a project with many tags.
>
> [ketch]: https://dev.eclipse.org/mhonarc/lists/jgit-dev/msg03073.html
> [reftree]: https://public-inbox.org/git/CAJo=hJvnAPNAdDcAAwAvU9C4RVeQdoS3Ev9WTguHx4fD0V_nOg@mail.gmail.com/
>
> ### LMDB
>
> David Turner proposed [using LMDB][dt-lmdb], as LMDB is lightweight
> (64k of runtime code) and GPL-compatible license.
>
> A downside of LMDB is its reliance on a single C implementation.  This
> makes embedding inside JGit (a popular reimplemenation of Git)
> difficult, and hositing onto virtual storage (for JGit DFS) virtually
> impossible.
>
> A common format that can be supported by all major Git implementations
> (git-core, JGit, libgit2) is strongly preferred.
>
> [dt-lmdb]: https://public-inbox.org/git/1455772670-21142-26-git-send-email-dturner@twopensource.com/
>
> ## Future
>
> ### Longer hashes
>
> Version will bump (e.g.  2) to indicate `value` uses a different
> object id length other than 20.  The length could be stored in an
> expanded file header, or hardcoded as part of the version.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: reftable: new ref storage format
  2017-07-16 17:33 ` Michael Haggerty
@ 2017-07-16 19:43   ` Shawn Pearce
  2017-07-16 21:12     ` Shawn Pearce
                       ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Shawn Pearce @ 2017-07-16 19:43 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: git

On Sun, Jul 16, 2017 at 10:33 AM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> Thanks for your reftable proposal.

Thanks for your time reading the proposal. I really was looking
forward to your insights on this project.

> It would solve a lot of scalability
> problems that we currently have, and do it in a way that is
> implementable in both C and Java, which is very nice.
>
> There are two mostly orthogonal components to your proposal:
>
> 1. What does a single reftable file look like?
> 2. How can multiple reftable files be used together to avoid having to
> rewrite data more than necessary?
>
> For example (just for the sake of argument), many of the goals could
> be achieved by stacking traditional packed-refs files together (i.e.,
> component 2 of your proposal),

Agreed. I actually started from stacked packed-refs as the format
proposal and iterated on that several times before starting to draft
reftable. I like that packed-refs is already a format, and its human
readable. It unfortunately still didn't solve enough other objectives,
which led me towards reftable.

> * Do you propose to store *all* references (i.e., including the
> references that we call pseudorefs, like `HEAD`, `FETCH_HEAD`, etc) in
> reftables, or only the references under `refs/`? If the former, then
> you need to consider that some pseudorefs contain additional
> information besides the SHA-1 or linked-to reference. If the latter,
> then you could get some additional compression by making the `refs/`
> prefix implicit rather than storing it in the "restart" records
> explicitly.

Great question I didn't have an answer for. I was planning to store
HEAD, but not FETCH_HEAD/MERGE_HEAD/etc. Mostly because our storage
system at $DAY_JOB doesn't have a place to store HEAD itself, its
buried down in the reference system. So reftable has to do the job.

The file format for reftable describes a type 0x3 which is just a
length delimited byte string. Provided that the data fits into a
single block, reftable can store these larger files with their
auxiliary data.

I'm open to the idea of HEAD and friends being outside of reftable in
a git-core accessed repository, but the backend storage I want to run
reftable on would likely need to store HEAD inside reftable.

>   Personally, I don't think it makes sense to store pseudorefs in
> reftables. HEAD, though it doesn't include any supplemental
> information, is read so frequently that it seems like a bad idea to
> have to go through the lookup process rather than storing it in a
> separate flat file. Moreover, HEAD is written very *infrequently*, so
> (absent special treatment) it would tend to sink deep in the reftable
> stack, making reads even more expensive.

That is a very fair argument for keeping HEAD outside.

A counter argument is HEAD is written very frequently by following its
indirect pointer to the branch (e.g. refs/heads/master). HEAD is
consequently reflogged very frequently. reftable stores the logs
inside the table shards. HEAD could be floated onto the top on every
write to accompany its log record.

> * You have obviously designed your proposal to be friendly to whatever
> non-local storage system you are using at Google. It would probably
> help us understand your design tradeoffs better if you could tell us a
> little about that storage system.

Remote network attached disk. Random 64k seeks are O(40ms).
Smaller seeks cost about the same as 64k.
Larger seeks start to see the block size increase latency.

Caching is the responsibility of the application (e.g. JGit).

Tail-block packing is available for small files (ResierFS-ish, but its
not ResierFS). A 4k minimum file size occupies the entire 4k, and has
to transfer between the CPU and the disk. A 300 byte file occupies 300
bytes. To the extent the file scales down with its content, the better
(for this system).

> So let's examine the components one after the other...
>
> 1. The structure of a single reftable file
>
> * You use int32 and int24 values in a number of places. Couldn't these
> be uint32 and uint24?

Yes. I'll clarify that in the document.

> * You use int32 values in a number of places where a smaller size
> would suffice. For example, the block size is limited to 24 bits, so
> surely restart_offset, record_end_offset, and number_of_restarts
> needn't be larger than 24 bits?
>
>    OK, actually the *index* block size is not limited to 24 bits, so
> it's not a no-brainer.

Right, its the index block that really pushes the restart_offsets to
32 bits. And I wanted to leverage code between the multiple block
formats as much as possible, as the structure is similar.

> * I'm a little bit concerned that the structure is fixed to be a one-
> or two-level indexing scheme; i.e., the (optional) index block plus
> the restart index table within a block. This forces (as I understand
> it) the block size to be chosen based on the overall file size to
> prevent the index block from becoming too large, whereas for a
> low-latency local filesystem implementation like SSDs it would
> probably be preferable to set the block size to agree with the page
> size to minimize reads and memory usage.

True... but... in my "android" example repository we have 866,456 live
refs. A block size of 64k needs only 443 blocks, and a 12k index, to
get the file to compress to 28M (vs. 62M packed-refs).

Index records are averaging 28 bytes per block. That gives us room for
about 1955 blocks, or 4,574,700 refs before the index block exceeds
64k.

In other words, I'm happy with this. Given my storage is network
attached, by the time I bring in 4k I might as well bring in 64k.
Given that a 64k index lets refs grow 5.2x before I start to see
longer read times to load the index, I'm OK with that.

If you are assuming local attached SSD with a quality kernel (Linux),
I think you can mmap the reftable, and basically nearly about IO
costs. reftable is still more disk friendly for random seeks than pack
idx v2, or v2.

>   So I ask myself whether it might be worthwhile to allow deeper
> levels of indexing. The main index block could divide the namespace
> into as many segments as fit in one block, but if necessary point at a
> second-level index block that further divides the namespace before
> pointing at the ref block that ultimately contains the value. Those of
> us using local SSDs might tune the system to use 4k block sizes
> throughout, basically preferring to read three or even four disjoint
> 4k blocks rather than two disjoint 64k blocks.

I did consider this, and discarded it as unnecessary complexity. See
some of my discussion above. Nobody is complaining about pack idx
v1/v2 lookup speeds right now, and they are worse to disks.

> * The tuning parameter number_of_restarts currently trades off space
> (for the full refnames and the restart_offsets) against the need to
> read and parse more ref_records to get the full refnames. ISTM that
> this tradeoff could be made less painful by defining a blockwide
> prefix that is omitted from the refnames as used in the restarts. So
> the full refname would change from
>
>       this_name = prior_name[0..prefix_length] + suffix
>
>   to
>
>       this_name = block_prefix + prior_name[0..prefix_length] + suffix
>
>   I would expect this to allow more frequent restarts at lower space
> cost.

I've been on the fence about the value of this. It makes the search
with restarts more difficult to implement, but does allow shrinking a
handful of very popular prefixes like "refs/" and "refs/pulls/" in
some blocks.

An older format of reftable used only a block_prefix, and could not
get nearly as good compression as too many blocks contained references
with different prefixes.

>   If we are willing to force all reads to consider the indexes, the
> block_prefix could be taken from the index_record that points at it.

No, its important to be able to scan the file sequentially without
touching the index. Ref advertisement for the current wire protocol
demands all references. Being able to scan from the start of the file
to the end of the last ref block and more-or-less just dump those to
the client is a really nice gain.

Having to load the index first to walk the blocks in order and
assemble them with data from the index detracts from the simplicity of
just walking the ref blocks.

> * Does the format need to insist on fixed-size blocks?

Unfortunately, yes. For my network storage to perform efficient random
access to data, I need the block sizes fixed. I'm willing to trade off
inefficient torn block reads for the index and the log blocks/index,
but not the ref blocks.

> * What is the rationale for the footer? Is it important for the file
> to be written as a stream, as opposed to seeking back to the beginning
> of the file to add the index_size there?

Yes, my storage doesn't allow me to overwrite a prior section. Its
write-once. So I can't patch this data into the header.

> The CRC, I suppose, is meant
> to make it possible to detect a file that has been truncated
> accidentally?

Correct, or that the file footer is somehow otherwise damaged, or that
the file was accidentally appended to and now the footer location no
longer contains footer bits.

> 2. The stacking of multiple reftable files together
>
> * I like the ideas later in the thread to have a file of pointers to
> the files that make up the stack. (I think of it as a "table of
> contents"). I think without it there will be unavoidable races
> involving readdir().

I do too. Its a fantastic idea I am going to steal from Peff and work
into this document and implementation.

> * The spec should describe the protocol for reading references. The
> only interesting thing here is that
>
> * I think that geometric repacking of reftable files will be essential
> to performance for very active repositories with many references, and
> that the steady state of such a repository will involve O(lg N)
> retable files.
>
> * I would like to propose adding another design goal: that reference
> repacking can be done without blocking other *writers* (let alone
> readers, obviously). I think that something like this could work:
[...]
>   I think that this algorithm would allow reference updates to
> continue while the repack is happening, and would even allow reftables
> higher or lower in the stack than B and C to be repacked at the same
> time (though maybe that won't be necessary).

I agree, this is clever. I'll work that into the document. Thank you.

> * A very frequent operation will be to read a single reference. This
> could get expensive if a reftable stack grows deep, because every
> single reftable might have to be inspected. One way to reduce this
> cost might be to store a bloom filter in large reftable files. This
> could allow a quick determination that the file doesn't contain the
> reference being sought.

Yes, of course. But I'm torn on that premise. My gut theory says most
references that are read individually were/are recently modified. Thus
they should be higher in the reftable stack, and a reader will
terminate with its result. Its only the not-founds that will have a
penalty of reaching the base of the stack.

I'd also really like to see repositories holding stacks <4 deep, not
32 deep. If we can get the compaction working well, we should be able
to see most repositories with 1 large base file, 1 medium-ish
compaction, 2 recent updates.

At $DAY_JOB we can do this successfully with pack files, which are
larger and more costly to combine than reftable. I think we can get
reftable to do support a reasonable stack depth.

> I haven't reviewed your proposal for storing reflogs in reftables in
> any detail, though I must say that my first reaction is surprise that
> you want to store reflogs (which are mostly immutable, rarely read,
> and can be an order of magnitude bigger) in the same files as
> references.

reftable compressed 87,393 references into 3M, and a year's worth of
their log updates into another 4M. Keeping them together actually
simplifies a lot of nasty corner cases in Git.

D/F conflicts ("refs/heads/foo", "refs/heads/foo/bar") go away, but
assuming we continue to reject these to protect existing clients, we
can also still retain logs for deleted names that were cleared out.

Logs can be written in the same atomic filesystem operation as the
refs are mutated, avoiding any skew about the log being dropped or
logged early. Logs are now searchable in reverse time order, which
accelerates log queries.

I really think its worth storing the logs inside the reftable. I'm
pretty sold on that part of the design at this point. There are many
advantages, and I've been able to sufficiently push off the downsides
by storing the logs in a separate region of the reftable.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: reftable: new ref storage format
  2017-07-16 19:43   ` Shawn Pearce
@ 2017-07-16 21:12     ` Shawn Pearce
  2017-07-16 21:13     ` Dave Borowitz
  2017-07-18  1:43     ` Michael Haggerty
  2 siblings, 0 replies; 25+ messages in thread
From: Shawn Pearce @ 2017-07-16 21:12 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: git

On Sun, Jul 16, 2017 at 12:43 PM, Shawn Pearce <spearce@spearce.org> wrote:
> On Sun, Jul 16, 2017 at 10:33 AM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>
>> * The tuning parameter number_of_restarts currently trades off space
>> (for the full refnames and the restart_offsets) against the need to
>> read and parse more ref_records to get the full refnames. ISTM that
>> this tradeoff could be made less painful by defining a blockwide
>> prefix that is omitted from the refnames as used in the restarts. So
>> the full refname would change from
>>
>>       this_name = prior_name[0..prefix_length] + suffix
>>
>>   to
>>
>>       this_name = block_prefix + prior_name[0..prefix_length] + suffix
>>
>>   I would expect this to allow more frequent restarts at lower space
>> cost.
>
> I've been on the fence about the value of this. It makes the search
> with restarts more difficult to implement, but does allow shrinking a
> handful of very popular prefixes like "refs/" and "refs/pulls/" in
> some blocks.
>
> An older format of reftable used only a block_prefix, and could not
> get nearly as good compression as too many blocks contained references
> with different prefixes.


I ran an experiment on my 866k ref data set. Using a block_prefix gets
less compression, and doesn't improve packing in the file. Given the
additional code complexity, it really isn't worth it:

format           |  size    |  blocks |  avg ref/blk
------------------|----------|-----------|----------------
original          | 28 M   |   443    |  1955
block_prefix  |  29 M  |   464    | 1867

:-(

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: reftable: new ref storage format
  2017-07-16 19:43   ` Shawn Pearce
  2017-07-16 21:12     ` Shawn Pearce
@ 2017-07-16 21:13     ` Dave Borowitz
  2017-07-16 21:31       ` Shawn Pearce
  2017-07-18  1:43     ` Michael Haggerty
  2 siblings, 1 reply; 25+ messages in thread
From: Dave Borowitz @ 2017-07-16 21:13 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Michael Haggerty, git

On Sun, Jul 16, 2017 at 3:43 PM, Shawn Pearce <spearce@spearce.org> wrote:
> True... but... in my "android" example repository we have 866,456 live
> refs. A block size of 64k needs only 443 blocks, and a 12k index, to
> get the file to compress to 28M (vs. 62M packed-refs).
>
> Index records are averaging 28 bytes per block. That gives us room for
> about 1955 blocks, or 4,574,700 refs before the index block exceeds
> 64k.

That's only a 5x increase over the current number of refs in this
android repo. I would not be so sure this repo doesn't grow another 5x
in the next few years. Especially as the other optimizations for
working with large repos start to be applied, so it won't be
prohibitively painful to work with such a repo.

Are we ok with increasing the block size when this eventually happens?
(At least I think that's what we would have to do, I haven't been
following closely the discussion on scaling limits.)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: reftable: new ref storage format
  2017-07-16 21:13     ` Dave Borowitz
@ 2017-07-16 21:31       ` Shawn Pearce
  0 siblings, 0 replies; 25+ messages in thread
From: Shawn Pearce @ 2017-07-16 21:31 UTC (permalink / raw)
  To: Dave Borowitz; +Cc: Michael Haggerty, git

On Sun, Jul 16, 2017 at 2:13 PM, Dave Borowitz <dborowitz@google.com> wrote:
> On Sun, Jul 16, 2017 at 3:43 PM, Shawn Pearce <spearce@spearce.org> wrote:
>> True... but... in my "android" example repository we have 866,456 live
>> refs. A block size of 64k needs only 443 blocks, and a 12k index, to
>> get the file to compress to 28M (vs. 62M packed-refs).
>>
>> Index records are averaging 28 bytes per block. That gives us room for
>> about 1955 blocks, or 4,574,700 refs before the index block exceeds
>> 64k.
>
> That's only a 5x increase over the current number of refs in this
> android repo. I would not be so sure this repo doesn't grow another 5x
> in the next few years. Especially as the other optimizations for
> working with large repos start to be applied, so it won't be
> prohibitively painful to work with such a repo.
>
> Are we ok with increasing the block size when this eventually happens?
> (At least I think that's what we would have to do, I haven't been
> following closely the discussion on scaling limits.)

I think I'd try letting the index grow to 4 blocks (256k) before I
considered increasing the block size. Remember pack idx files are much
larger, and are loaded wholesale into memory by JGit. A ref idx at
256k might not be problematic.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: reftable: new ref storage format
  2017-07-16 19:43   ` Shawn Pearce
  2017-07-16 21:12     ` Shawn Pearce
  2017-07-16 21:13     ` Dave Borowitz
@ 2017-07-18  1:43     ` Michael Haggerty
  2017-07-18 18:53       ` Junio C Hamano
  2017-07-23 22:56       ` Shawn Pearce
  2 siblings, 2 replies; 25+ messages in thread
From: Michael Haggerty @ 2017-07-18  1:43 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git

On Sun, Jul 16, 2017 at 12:43 PM, Shawn Pearce <spearce@spearce.org> wrote:
> On Sun, Jul 16, 2017 at 10:33 AM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>> * Do you propose to store *all* references (i.e., including the
>> references that we call pseudorefs, like `HEAD`, `FETCH_HEAD`, etc) in
>> reftables, or only the references under `refs/`? If the former, then
>> you need to consider that some pseudorefs contain additional
>> information besides the SHA-1 or linked-to reference. If the latter,
>> then you could get some additional compression by making the `refs/`
>> prefix implicit rather than storing it in the "restart" records
>> explicitly.
>
> Great question I didn't have an answer for. I was planning to store
> HEAD, but not FETCH_HEAD/MERGE_HEAD/etc. Mostly because our storage
> system at $DAY_JOB doesn't have a place to store HEAD itself, its
> buried down in the reference system. So reftable has to do the job.
>
> The file format for reftable describes a type 0x3 which is just a
> length delimited byte string. Provided that the data fits into a
> single block, reftable can store these larger files with their
> auxiliary data.
>
> I'm open to the idea of HEAD and friends being outside of reftable in
> a git-core accessed repository, but the backend storage I want to run
> reftable on would likely need to store HEAD inside reftable.
>
>>   Personally, I don't think it makes sense to store pseudorefs in
>> reftables. HEAD, though it doesn't include any supplemental
>> information, is read so frequently that it seems like a bad idea to
>> have to go through the lookup process rather than storing it in a
>> separate flat file. Moreover, HEAD is written very *infrequently*, so
>> (absent special treatment) it would tend to sink deep in the reftable
>> stack, making reads even more expensive.
>
> That is a very fair argument for keeping HEAD outside.
>
> A counter argument is HEAD is written very frequently by following its
> indirect pointer to the branch (e.g. refs/heads/master). HEAD is
> consequently reflogged very frequently. reftable stores the logs
> inside the table shards. HEAD could be floated onto the top on every
> write to accompany its log record.

On second thought, the idea of having HEAD (or maybe all pseudorefs)
in the same system would open a few interesting possibilities that
derive from having a global, atomic view of all references:

1. We could store backlinks from references to the symbolic references
that refer to them. This would allow us to update the reflogs for
symbolic refs properly. (Currently, there is special-case code to
update the reflogs for HEAD when the reference that it points at is
modified, but not for other symrefs.)

2. We could store "peeled" versions of symbolic refs. These would have
to be updated whenever the pointed-at reference is updated, but that
would have two nice advantages: HEAD would usually be resolvable based
on the top reftable in the stack, and it would be resolvable in one
step (without having the follow the symref explicitly).

> [...]
>> * The tuning parameter number_of_restarts currently trades off space
>> (for the full refnames and the restart_offsets) against the need to
>> read and parse more ref_records to get the full refnames. ISTM that
>> this tradeoff could be made less painful by defining a blockwide
>> prefix that is omitted from the refnames as used in the restarts. So
>> the full refname would change from
>>
>>       this_name = prior_name[0..prefix_length] + suffix
>>
>>   to
>>
>>       this_name = block_prefix + prior_name[0..prefix_length] + suffix
>>
>>   I would expect this to allow more frequent restarts at lower space
>> cost.
>
> I've been on the fence about the value of this. It makes the search
> with restarts more difficult to implement, but does allow shrinking a
> handful of very popular prefixes like "refs/" and "refs/pulls/" in
> some blocks.
>
> An older format of reftable used only a block_prefix, and could not
> get nearly as good compression as too many blocks contained references
> with different prefixes.

It's clear that using *only* a block_prefix wouldn't yield as good
compression as your proposed scheme. And it feels plausible that when
using 64k blocks, a block_prefix wouldn't help much.

I'm still not quite resigned to non-Google users wanting to use blocks
as large as 64k, but (short of doing actual experiments, yuck!) I
can't estimate whether it would make any detectable difference in the
real world.

On the other end of the spectrum, I might mention that the
shared-storage "network.git" repositories that we use at GitHub often
have a colossal number of references (basically, the sum of the number
of references in all of the forks in a "repository network", including
some hidden references that users don't see). For example, one
"network.git" repository has 56M references(!) Mercifully, we
currently only have to access these repositories for batch jobs, but,
given a better reference storage backend, that might change.

> [...]
>> 2. The stacking of multiple reftable files together
>>
>> * A very frequent operation will be to read a single reference. This
>> could get expensive if a reftable stack grows deep, because every
>> single reftable might have to be inspected. One way to reduce this
>> cost might be to store a bloom filter in large reftable files. This
>> could allow a quick determination that the file doesn't contain the
>> reference being sought.
>
> Yes, of course. But I'm torn on that premise. My gut theory says most
> references that are read individually were/are recently modified. Thus
> they should be higher in the reftable stack, and a reader will
> terminate with its result. Its only the not-founds that will have a
> penalty of reaching the base of the stack.
>
> I'd also really like to see repositories holding stacks <4 deep, not
> 32 deep. If we can get the compaction working well, we should be able
> to see most repositories with 1 large base file, 1 medium-ish
> compaction, 2 recent updates.
>
> At $DAY_JOB we can do this successfully with pack files, which are
> larger and more costly to combine than reftable. I think we can get
> reftable to do support a reasonable stack depth.

Are you saying that you merge subsets of packfiles without merging all
of them? Does this work together with bitmaps, or do you only have
bitmaps for the biggest packfile?

We've thought about merging packfiles in that way, but don't want to
give up the benefits of bitmaps.

>> I haven't reviewed your proposal for storing reflogs in reftables in
>> any detail, though I must say that my first reaction is surprise that
>> you want to store reflogs (which are mostly immutable, rarely read,
>> and can be an order of magnitude bigger) in the same files as
>> references.
>
> reftable compressed 87,393 references into 3M, and a year's worth of
> their log updates into another 4M. Keeping them together actually
> simplifies a lot of nasty corner cases in Git.
>
> D/F conflicts ("refs/heads/foo", "refs/heads/foo/bar") go away, but
> assuming we continue to reject these to protect existing clients, we
> can also still retain logs for deleted names that were cleared out.
>
> Logs can be written in the same atomic filesystem operation as the
> refs are mutated, avoiding any skew about the log being dropped or
> logged early. Logs are now searchable in reverse time order, which
> accelerates log queries.
>
> I really think its worth storing the logs inside the reftable. I'm
> pretty sold on that part of the design at this point. There are many
> advantages, and I've been able to sufficiently push off the downsides
> by storing the logs in a separate region of the reftable.

Those sizes don't sound that scary. Do your reflogs include
significant information in the log messages, or are they all "push"
"push" "push"? We record quite a bit of information in our audit_log
entries (our equivalent of reflogs), so I would expect ours to
compress much less well.

We also tend to use our audit_logs to see what was happening in a
repository; e.g., around the time that a problem occurred. So for us
it is useful that the entries are in chronological order across
references, as opposed to having the entries for each reference
grouped together. We might be the oddballs here though, and in fact it
is possible that this would be an argument for us to stick to our
audit_log scheme rather than use reflogs stored in reftables.

I agree that it would be nice to have logs in the same atomic store as
the references, so if it can be made efficient, I'm all for it.

I've since read over the reflog part of your proposal. Comments:

* With file-based reflogs, the presence or absence of a reflog file
(even if it is empty) is sometimes used to decide whether to log new
updates for the corresponding reference [1]. Is there an equivalent
for reftable-based reflogs?

[1] https://github.com/git/git/blob/f3da2b79be9565779e4f76dc5812c68e156afdf0/refs/files-backend.c#L1976-L1991

* It is not clear from your proposal whether
refname+reverse_int32(time_sec) is meant to be a unique key.
  * If yes, think again. It is not at all unusual for a reference to
be updated more than once in a second.
  * If no, then how can reflogs be expired? It seems that there would
often be no alternative to rewriting all of the reftable files.

* As reflog entries accumulate and reftable files are compacted, it
will often be the case that compacted reftable files will contain many
reflog entries for the some references. Normally (though not always),
the old_id of one entry should be identical to the new_id of the next.
It seems that it should be possible to save quite a bit of space by
representing such entries as a group rather than singly.

Regarding your updated proposal for how to name and stack reftable files:

* You say that ".ref files are named by the SHA-1 hash of the contents
of the reftable". I assume that means "the contents of that particular
file". However, this is not entirely straightforward. It is thinkable
for two reftable files to have the exact same contents. For example,
if reflogs are turned off, and I (1) make commit A; (2) make commit B;
(3) reset the branch back to commit A; then I think that the first and
third reftable files would have identical contents. This would not
*necessarily* be a problem—given that the two reftable files would
have identical contents, the same file could serve for both of them.
But it could very easily lead to confusion, for example if some
process that is compacting the first two reftables decides that it can
delete the file at the same moment that the `git reset` process has
just rewritten the file or decided that it doesn't have to rewrite the
file.

  We could avoid this situation by including the name of the
predecessor reftable file in the body of each new reftable, or even by
including it in the SHA-1 without writing it into the file.

  It *seems* as if it would be an advantage to include the name of the
predecessor reftable in a new reftable, but that info would become
obsolete if some deeper reftables are compacted while new reftables
are being written (which I think is a more useful design goal than
being able to chain the reftable files to each other ab initio).

  We could have both properties if the SHA-1 of a reftable file were a
hash of the *logical* contents of the whole stack of
references+reflogs, including its predecessors. That hash would be
invariant under compaction, so if we compact files A, B, and C, the
results would necessarily have the same hash as file C did previously.
However, it would be expensive to compute the hash of the whole
contents, because to do so one would have to iterate through all of
the references and reflog entries. Moreover, IIUC, on Windows it would
not be possible to rename the "new C" file on top of the "old C" file
if any reader has that file open.

But I don't think there is any reason that the files have to be named
using the hash of their contents. As far as I understand, any unique
filename (i.e., even something as simple as `mktemp XXXXXX.ref`) would
serve just fine. It might also be convenient to embed the timestamp or
N+1 of the predecessor file in the filename for human consumption.

Some other random comments:

* Do we really want to use Git-style varints here? It seems to me that
protocol-buffer-style varints are more familiar and are a lot easier
to understand (albeit a miniscule bit larger on average). They also
have the advantage that they can be padded by inserting 0x80 bytes, a
property that would have come in handy in a packfile-related project
that we were working on internally.

* What would you think about being extravagant and making the
value_type a full byte? It would make the format a tiny bit easier to
work with, and would leave room for future enhancements (e.g.,
pseudorefs, peeled symrefs, support for the successors of SHA-1s)
without having to change the file format dramatically.

* Speaking of the successors of SHA-1s, if you haven't already, it
would make sense for somebody to read the proposal for the transition
to a post-SHA-1 world with an eye to considering whether the reftable
file should play a role, and whether that should have an effect on the
current design. I'm not saying we should build in support already, but
it would be a pity if the format had to be changed radically at the
time of the transition.

* The transition to reftable will require a new repository format
version and an extension [2]. You might as well mention that in your
proposal and suggest a name.

[2] https://github.com/git/git/commit/00a09d57eb8a041e6a6b0470c53533719c049bab

* Current versions of git look for `$GIT_DIR/refs` as part of
determining whether a directory might be a plausible git directory
[3]. Although current versions of Git won't be able to read a
reftable-based repository, they *should* be able to recognize the
directory as a plausible Git directory. So ISTM that a reftable-based
Git repository should put *something* at that path. For example, that
path could be used as the name of the "stack" file. The fact that it
is a file rather than a directory shouldn't bother is_git_directory()
but should be enough to prevent it from accidentally being used as a
loose refs directory.

[3] https://github.com/git/git/blob/f3da2b79be9565779e4f76dc5812c68e156afdf0/setup.c#L335-L339

Yours,
Michael

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: reftable: new ref storage format
  2017-07-18  1:43     ` Michael Haggerty
@ 2017-07-18 18:53       ` Junio C Hamano
  2017-07-23 22:56       ` Shawn Pearce
  1 sibling, 0 replies; 25+ messages in thread
From: Junio C Hamano @ 2017-07-18 18:53 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Shawn Pearce, git

Michael Haggerty <mhagger@alum.mit.edu> writes:

> On second thought, the idea of having HEAD (or maybe all pseudorefs)
> in the same system would open a few interesting possibilities that
> derive from having a global, atomic view of all references:
>
> 1. We could store backlinks from references to the symbolic references
> that refer to them. This would allow us to update the reflogs for
> symbolic refs properly. (Currently, there is special-case code to
> update the reflogs for HEAD when the reference that it points at is
> modified, but not for other symrefs.)
>
> 2. We could store "peeled" versions of symbolic refs. These would have
> to be updated whenever the pointed-at reference is updated, but that
> would have two nice advantages: HEAD would usually be resolvable based
> on the top reftable in the stack, and it would be resolvable in one
> step (without having the follow the symref explicitly).

Interesting.  I think FETCH_HEAD is the only thing that would not
mesh well with the "one ref records one object name, or refers to
another ref" paradigm, and I think it is OK to leave it that way,
even in the new pluggable ref backend world order.

It still bothers me that the way refs.c special-cases what you call
pseudo refs seems somewhat inconsistent, which I found by accident
the other day while running t1405 and reported in a separate thread.
When we start adding a reftable backend, I suspect we'd need to dig
further, but if I recall the symptom correctly, writing them out is
still passed to the main (i.e. files) backend, while reading them is
done directly in refs.c layer without consulting any backend, and
that made it impossible to optionally tweak the filename used to
store refs in the files backend.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: reftable: new ref storage format
  2017-07-18  1:43     ` Michael Haggerty
  2017-07-18 18:53       ` Junio C Hamano
@ 2017-07-23 22:56       ` Shawn Pearce
  2017-07-23 23:03         ` Shawn Pearce
  1 sibling, 1 reply; 25+ messages in thread
From: Shawn Pearce @ 2017-07-23 22:56 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: git

+git@vger.kernel.org. I originally sent the below reply privately by mistake.

On Mon, Jul 17, 2017 at 6:43 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> On Sun, Jul 16, 2017 at 12:43 PM, Shawn Pearce <spearce@spearce.org> wrote:
>> On Sun, Jul 16, 2017 at 10:33 AM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>
> On second thought, the idea of having HEAD (or maybe all pseudorefs)
> in the same system would open a few interesting possibilities that
> derive from having a global, atomic view of all references:
>
> 1. We could store backlinks from references to the symbolic references
> that refer to them. This would allow us to update the reflogs for
> symbolic refs properly. (Currently, there is special-case code to
> update the reflogs for HEAD when the reference that it points at is
> modified, but not for other symrefs.)

This is a good idea, but makes for some difficult transition code. We
have to keep the special case for HEAD, but other symrefs would log
when in a reftable.

> 2. We could store "peeled" versions of symbolic refs. These would have
> to be updated whenever the pointed-at reference is updated, but that
> would have two nice advantages: HEAD would usually be resolvable based
> on the top reftable in the stack, and it would be resolvable in one
> step (without having the follow the symref explicitly).

Great observation. I wish I saw that sooner. Its a pain in the neck to
resolve symrefs, and has caused us a few bugs in JGit on our current
non-standard storage. It depends on the back pointer being present and
accurate to ensure an update of master also updates the cached HEAD.

I'll have to mull on these a bit. I'm not folding them into my
documentation and implementation just yet.


[...]
> I'm still not quite resigned to non-Google users wanting to use blocks
> as large as 64k, but (short of doing actual experiments, yuck!) I
> can't estimate whether it would make any detectable difference in the
> real world.

I think it is only likely to matter with NFS, and then its a balancing
act of how much of that block did you need vs. not need. :)

> On the other end of the spectrum, I might mention that the
> shared-storage "network.git" repositories that we use at GitHub often
> have a colossal number of references (basically, the sum of the number
> of references in all of the forks in a "repository network", including
> some hidden references that users don't see). For example, one
> "network.git" repository has 56M references(!) Mercifully, we
> currently only have to access these repositories for batch jobs, but,
> given a better reference storage backend, that might change.

A larger block size right now has the advantage of a smaller index,
which could make a single ref lookup more efficient. Otherwise, the
block size doesn't have a big impact on streaming through many
references.


>>> 2. The stacking of multiple reftable files together
[...]
>> At $DAY_JOB we can do this successfully with pack files, which are
>> larger and more costly to combine than reftable. I think we can get
>> reftable to do support a reasonable stack depth.
>
> Are you saying that you merge subsets of packfiles without merging all
> of them? Does this work together with bitmaps, or do you only have
> bitmaps for the biggest packfile?
>
> We've thought about merging packfiles in that way, but don't want to
> give up the benefits of bitmaps.

Yes. We compact smaller pack files together into a larger pack file,
and try to keep a repository at:

 - 2 compacted packs, each <20 MiB
 - 1 base pack + bitmap

We issue a daily GC for any repository that isn't just 1 base pack.
But during a business day this compacting process lets us handle most
read traffic quite well, despite the bitmaps being incomplete.


>>> I haven't reviewed your proposal for storing reflogs in reftables in
[...]
>
> Those sizes don't sound that scary. Do your reflogs include
> significant information in the log messages, or are they all "push"
> "push" "push"? We record quite a bit of information in our audit_log
> entries (our equivalent of reflogs), so I would expect ours to
> compress much less well.

These were pretty sparse in the comment field, and frequent reuse of a
message. So it may not be representative of what you are storing.

> We also tend to use our audit_logs to see what was happening in a
> repository; e.g., around the time that a problem occurred. So for us
> it is useful that the entries are in chronological order across
> references, as opposed to having the entries for each reference
> grouped together. We might be the oddballs here though, and in fact it
> is possible that this would be an argument for us to stick to our
> audit_log scheme rather than use reflogs stored in reftables.

I think of reflogs about a single ref, not the whole repository. So
I'm inclined to say the reftable storage of them should be by ref,
then time. Anyone who wants a repository view must either scan the
entire log segment, or should roll their own log with an
update/post-receive hook.


> I've since read over the reflog part of your proposal. Comments:
>
> * With file-based reflogs, the presence or absence of a reflog file
> (even if it is empty) is sometimes used to decide whether to log new
> updates for the corresponding reference [1]. Is there an equivalent
> for reftable-based reflogs?
>
> [1] https://github.com/git/git/blob/f3da2b79be9565779e4f76dc5812c68e156af=
df0/refs/files-backend.c#L1976-L1991

No, I was going to store everything. The flag you are talking about
was added because IIRC Gerrit Code Review was making a lot of
create-only references that were never modified and the reflogs were
growing out of control in terms of inodes used. So disabling reflogs
by default and then explicitly creating files for refs under
refs/heads/ provided an escape hatch.

With integrated storage in reftable, these problems go away.


> * It is not clear from your proposal whether
> refname+reverse_int32(time_sec) is meant to be a unique key.
>   * If yes, think again. It is not at all unusual for a reference to
> be updated more than once in a second.
>   * If no, then how can reflogs be expired? It seems that there would
> often be no alternative to rewriting all of the reftable files.

Great catch, I missed that. I've expanded the format to be int64(
time_usec ) using microseconds since the epoch, with the requirement
that the time be unique.

Writers working with a second precision clock (e.g. current reflog)
should pad the timestamp with a suffix of 999999, and decrement a
microsecond from each older log record that would otherwise have a
duplicate time.


> * As reflog entries accumulate and reftable files are compacted, it
> will often be the case that compacted reftable files will contain many
> reflog entries for the some references. Normally (though not always),
> the old_id of one entry should be identical to the new_id of the next.
> It seems that it should be possible to save quite a bit of space by
> representing such entries as a group rather than singly.

I'm betting on grouping of log records by ref and time, and then libz
deflate over the log block to reduce duplications. I actually tried
adding a flag byte to recycle the id across records, and it was
slightly larger than letting libz handle it.


> Regarding your updated proposal for how to name and stack reftable files:
>
> * You say that ".ref files are named by the SHA-1 hash of the contents
> of the reftable". I assume that means "the contents of that particular
> file". However, this is not entirely straightforward. It is thinkable
> for two reftable files to have the exact same contents. For example,
> if reflogs are turned off, and I (1) make commit A; (2) make commit B;
> (3) reset the branch back to commit A; then I think that the first and
> third reftable files would have identical contents. This would not
> *necessarily* be a problem=E2=80=94given that the two reftable files woul=
d
> have identical contents, the same file could serve for both of them.
> But it could very easily lead to confusion, for example if some
> process that is compacting the first two reftables decides that it can
> delete the file at the same moment that the `git reset` process has
> just rewritten the file or decided that it doesn't have to rewrite the
> file.

Yikes!  Good catch.

>   We could avoid this situation by including the name of the
> predecessor reftable file in the body of each new reftable, or even by
> including it in the SHA-1 without writing it into the file.
>
>   It *seems* as if it would be an advantage to include the name of the
> predecessor reftable in a new reftable, but that info would become
> obsolete if some deeper reftables are compacted while new reftables
> are being written (which I think is a more useful design goal than
> being able to chain the reftable files to each other ab initio).

I agree. We don't want to embed the predecessor table names into the
table itself.

>   We could have both properties if the SHA-1 of a reftable file were a
> hash of the *logical* contents of the whole stack of
> references+reflogs, including its predecessors. That hash would be
> invariant under compaction, so if we compact files A, B, and C, the
> results would necessarily have the same hash as file C did previously.
> However, it would be expensive to compute the hash of the whole
> contents, because to do so one would have to iterate through all of
> the references and reflog entries. Moreover, IIUC, on Windows it would
> not be possible to rename the "new C" file on top of the "old C" file
> if any reader has that file open.
>
> But I don't think there is any reason that the files have to be named
> using the hash of their contents. As far as I understand, any unique
> filename (i.e., even something as simple as `mktemp XXXXXX.ref`) would
> serve just fine. It might also be convenient to embed the timestamp or
> N+1 of the predecessor file in the filename for human consumption.

Yes, you've sold me on this. We don't want to name them by SHA-1 of
contents, instead any suitable unique name is fine, such as one
constructed by mktemp.


> * Do we really want to use Git-style varints here? It seems to me that
> protocol-buffer-style varints are more familiar and are a lot easier
> to understand (albeit a miniscule bit larger on average). They also
> have the advantage that they can be padded by inserting 0x80 bytes, a
> property that would have come in handy in a packfile-related project
> that we were working on internally.

Yes, I would prefer to stick with a varint already in use in the git
project, vs. introducing yet another varint format. Its bad enough we
have 2 varint types in the pack file format. And I really don't see a
reason to support people doing weird things with a reftable, like
padding a file by inserting no-op 0x80 varint bytes.

> * What would you think about being extravagant and making the
> value_type a full byte? It would make the format a tiny bit easier to
> work with, and would leave room for future enhancements (e.g.,
> pseudorefs, peeled symrefs, support for the successors of SHA-1s)
> without having to change the file format dramatically.

I reran my 866k file with full byte value_type. It pushes up the
average bytes per ref from 33 to 34, but the overall file size is
still 28M (with 64 block size). I think its reasonable to expand this
to the full byte as you suggest.


> * The transition to reftable will require a new repository format
> version and an extension [2]. You might as well mention that in your
> proposal and suggest a name.
>
> [2] https://github.com/git/git/commit/00a09d57eb8a041e6a6b0470c53533719c0=
49bab

Thanks, documented.

> * Current versions of git look for `$GIT_DIR/refs` as part of
> determining whether a directory might be a plausible git directory
> [3]. Although current versions of Git won't be able to read a
> reftable-based repository, they *should* be able to recognize the
> directory as a plausible Git directory. So ISTM that a reftable-based
> Git repository should put *something* at that path. For example, that
> path could be used as the name of the "stack" file. The fact that it
> is a file rather than a directory shouldn't bother is_git_directory()
> but should be enough to prevent it from accidentally being used as a
> loose refs directory.
>
> [3] https://github.com/git/git/blob/f3da2b79be9565779e4f76dc5812c68e156af=
df0/setup.c#L335-L339

Thanks, documented.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: reftable: new ref storage format
  2017-07-23 22:56       ` Shawn Pearce
@ 2017-07-23 23:03         ` Shawn Pearce
  0 siblings, 0 replies; 25+ messages in thread
From: Shawn Pearce @ 2017-07-23 23:03 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: git

On Sun, Jul 23, 2017 at 3:56 PM, Shawn Pearce <spearce@spearce.org> wrote:
> On Mon, Jul 17, 2017 at 6:43 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>> On Sun, Jul 16, 2017 at 12:43 PM, Shawn Pearce <spearce@spearce.org> wrote:
>>> On Sun, Jul 16, 2017 at 10:33 AM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>
>> * What would you think about being extravagant and making the
>> value_type a full byte? It would make the format a tiny bit easier to
>> work with, and would leave room for future enhancements (e.g.,
>> pseudorefs, peeled symrefs, support for the successors of SHA-1s)
>> without having to change the file format dramatically.
>
> I reran my 866k file with full byte value_type. It pushes up the
> average bytes per ref from 33 to 34, but the overall file size is
> still 28M (with 64 block size). I think its reasonable to expand this
> to the full byte as you suggest.

FYI, I went back on this in the v3 draft I posted on Jul 22 in
https://public-inbox.org/git/CAJo=hJvxWg2J-yRiCK3szux=eYM2ThjT0KWo-SFFOOc1RkxXzg@mail.gmail.com/

I expanded value_type from 2 bits to 3 bits, but kept it as a bit
field in a varint. I just couldn't justify the additional byte per ref
in these large files. The prefix compression works well enough that
many refs are still able to use only a single byte for the
suffix_length << 3 | value_type varint, keeping the average at 33
bytes per ref.

The reftable format uses values 0-3, leaving 4-7 available. I reserved
4 for an arbitrary payload like MERGE_HEAD type files.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2017-07-23 23:04 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-13  0:17 reftable: new ref storage format Shawn Pearce
2017-07-13 19:32 ` Jeff King
2017-07-13 19:56   ` Stefan Beller
2017-07-13 20:35     ` Jeff King
2017-07-13 21:51       ` Eric Wong
2017-07-14  0:27       ` Shawn Pearce
2017-07-14 20:10         ` Jeff King
2017-07-14  0:11   ` Shawn Pearce
2017-07-14 14:27     ` Dave Borowitz
2017-07-14 15:31       ` Shawn Pearce
2017-07-14 20:08     ` Jeff King
2017-07-16  6:01       ` Shawn Pearce
2017-07-16 10:01         ` Jeff King
2017-07-16  8:07       ` Johannes Sixt
2017-07-16 10:03         ` Jeff King
2017-07-16 10:10           ` Johannes Sixt
2017-07-16 17:33 ` Michael Haggerty
2017-07-16 19:43   ` Shawn Pearce
2017-07-16 21:12     ` Shawn Pearce
2017-07-16 21:13     ` Dave Borowitz
2017-07-16 21:31       ` Shawn Pearce
2017-07-18  1:43     ` Michael Haggerty
2017-07-18 18:53       ` Junio C Hamano
2017-07-23 22:56       ` Shawn Pearce
2017-07-23 23:03         ` Shawn Pearce

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).