git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Stefan Beller <sbeller@google.com>
To: Shawn Pearce <spearce@spearce.org>
Cc: git <git@vger.kernel.org>
Subject: Re: reftable [v2]: new ref storage format
Date: Mon, 17 Jul 2017 11:53:23 -0700	[thread overview]
Message-ID: <CAGZ79kaFmxRSp1YBXc=6YEeDGUO-VLBF0yk+Bb0np7x80z_Vog@mail.gmail.com> (raw)
In-Reply-To: <CAJo=hJtTp2eA3z9wW9cHo-nA7kK40vVThqh6inXpbCcqfdMP9g@mail.gmail.com>

On Mon, Jul 17, 2017 at 8:01 AM, Shawn Pearce <spearce@spearce.org> wrote:

> A ref block is written as:
>
>     'r'
>     uint24 ( block_len )
>     ref_record*
>     uint32( restart_offset )*
>     uint16( number_of_restarts )
>     padding?
>
...
>
> The 2-byte `number_of_restarts + 1` stores the number of entries in
> the `restart_offset` list.

This means uint16( number_of_restarts ) cannot be 0, but has to be 1
to indicate no restarts?

Why do we need to be non-NUL here, but the padding is all NUL?

When starting to write a block, we need to know exactly how large
the ref_records* and restart offsets need to be to put the
number_of_restarts at the position as promised via block_len.
This sounds complicated unless I missed the obvious.

Going by this, would it rather make sense to omit the block_len
and then scan backwards from *block_size-1 to find the first non-NUL
and that will be the number_of_restarts?

Or we could allow additional padding between ref_record and
restart_offsets, such that the writer implementation has wiggle room
for the restarting logic.

>
> #### log record
>
> Log record keys are structured as:
>
>     ref_name '\0' reverse_int32( time_sec )
>
> where `time_sec` is the update time in seconds since the epoch.

The epoch ends 2038, which is in 21 years. (Did you just volunteer
to fixup the time issues in 20 years?)
If possible I'd prefer a date to be encoded with more range available.

>  The
> `reverse_int32` function inverses the value so lexographical ordering
> the network byte order time sorts more recent records first:
>
>     reverse_int(int32 t) {
>       return 0xffffffff - t;
>     }
>
> Log records have a similar starting structure to ref and index
> records, utilizing the same prefix compression scheme applied to the
> key described above.

The ref names itself are compressed, would we also want to compress
the timing information? The time_sec could be a varint indicating a delta
to the previous change of a ref, or fixed to the epoch if it is the first change
of that ref.

Just to be clear, the ordering here would be

  refs/heads/maint <number>
  refs/heads/maint <smaller number>
  ...
  refs/heads/master <number>
  refs/heads/master <smaller number>

such that refs that have more than one entry in its reflog in a given
refstable file, would have perfect prefix compression for the ref?

> ### Update transactions
>
> Although reftables are immutable, mutations are supported by writing a
> new reftable and atomically appending it to the stack:
>
> 1. Atomically create `stack.lock`
> 2. Copy current stack to `stack.lock`.
> 3. Prepare new reftable in temp file `.refXXXXXXX`.
>    Include log entries.
> 4. Rename (3) to `${sha1}.ref`.
> 5. Append `${sha1}.ref` to `stack.lock`
> 6. Atomically rename `stack.lock` to `stack`.

In case 3.+4. becomes time consuming, it can be prepared outside
the lock, such that inside the lock we'd only need to check
for contention of refs? For example if I'd want to update one ref and
another write wants to update another refs, we'd both be preparing
the a new refstable containing one ref and log, then one of us obtains
the lock and writes. The second writer would then need to inspect
the delta of the stack and see if any ref that they want to change
was touched.

> ### Compaction
>
> A partial stack of reftables can be compacted by merging references
> using a straightforward merge join across reftables, selecting the
> most recent value for output, and omitting deleted references that do
> not appear in remaining, lower reftables.
>
> For sake of illustration, assume the stack currently consists of
> reftable files (from oldest to newest): A, B, C, and D. The compactor
> is going to compact B and C, leaving A and D alone.
>
> 1.  Obtain lock `stack.lock` and read the `stack` file.
> 2.  Obtain locks `B.lock` and `C.lock`.
>     Ownership of these locks prevents other processes from trying
> to compact these files.
> 3.  Release `stack.lock`.
> 4.  Compact `B` and `C` in temp file `.refXXXXXXX`.
> 5.  Reacquire lock `stack.lock`.
> 6.  Verify that `B` and `C` are still in the stack, in that order. This
>     should always be the case, assuming that other processes are adhering
>     to the locking protocol.
> 7.  Rename `.refXXXXXXX` to `X`.
> 8.  Write the new stack to `stack.lock`, replacing `B` and `C` with `X`.
> 9.  Atomically rename `stack.lock` to `stack`.
> 10. Delete `B` and `C`, perhaps after a short sleep to avoid forcing
>     readers to backtrack.
>
> This strategy permits compactions to proceed independently of updates.

10. could be deferred to gc as well. auto gc would need to learn about
the number
of loose refstables in that case.

Thanks,
Stefan

  reply	other threads:[~2017-07-17 18:53 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-07-17 15:01 reftable [v2]: new ref storage format Shawn Pearce
2017-07-17 18:53 ` Stefan Beller [this message]
2017-07-17 19:04   ` Shawn Pearce
2017-07-17 19:56     ` Stefan Beller
2017-07-17 19:51 ` Junio C Hamano
2017-07-18 20:54   ` Shawn Pearce
2017-07-19 14:02     ` Ævar Arnfjörð Bjarmason
2017-07-23 21:46       ` Shawn Pearce
2017-07-23 23:47         ` Ævar Arnfjörð Bjarmason

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAGZ79kaFmxRSp1YBXc=6YEeDGUO-VLBF0yk+Bb0np7x80z_Vog@mail.gmail.com' \
    --to=sbeller@google.com \
    --cc=git@vger.kernel.org \
    --cc=spearce@spearce.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).