From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-3.8 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,RCVD_IN_DNSWL_HI, RP_MATCHES_RCVD shortcircuit=no autolearn=ham autolearn_force=no version=3.4.0 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by dcvr.yhbt.net (Postfix) with ESMTP id 731E220357 for ; Mon, 17 Jul 2017 18:53:28 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751332AbdGQSx0 (ORCPT ); Mon, 17 Jul 2017 14:53:26 -0400 Received: from mail-pf0-f180.google.com ([209.85.192.180]:34648 "EHLO mail-pf0-f180.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751311AbdGQSxY (ORCPT ); Mon, 17 Jul 2017 14:53:24 -0400 Received: by mail-pf0-f180.google.com with SMTP id q85so80594743pfq.1 for ; Mon, 17 Jul 2017 11:53:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=KE7DktBCnUx7XjolxwEnJyiC+rkS/XnoCyWr78CLwqE=; b=mJax4JZtxfPy46vE/TgU6L6BRjAGtSlgOzYjO2eTmGkE7HBk4f1MaxaMNykIDCnPz8 xbI7Dyc2BnIua2ho7WE111Rk8TEG4bYaPrQncGQe73JbqCvKOer8UiAEmECMfqIeAhiv cTMDn6qZ0ZI4b4pxaZOqMgm+v8zqxKq6OhyDq8k+OY/KGf37bq7jC+q27v3ZYYted+ZK txEk6txWbJfwYOo/PQtT972rbiiYtZn3mpo2+rNJ6+PN89G0LpZOxdfv9yxKLycO7L20 D43Vicn6+tMdMD/DmAIXeNGrD+pj7IIj4MX+C9q2a4kivkzw62AZBsk1jz2vBe1g58Rx PJpw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=KE7DktBCnUx7XjolxwEnJyiC+rkS/XnoCyWr78CLwqE=; b=GqUbkM9luutyj4k389bUr1AlFOO3sDh6wZImr63pVZG42gQiQaoZt3l8ae1YtgxdBz SvamsuuO44myh4SGXxTVqJU+uvadDTF6a/aNLJHx5xvyrWlXVi8vjtHZUsUYytl2aRpK dNDb9WzGROCm4ysMWfs3zi0kwKFEnQkprqEBf3TNzU5VNvbvhK/oKVpASMAOvbz/g+hh hGcE/LzzJINKGmAQ9nqX/Umt7/1Xq7CWgepC89UQsVI84WVoYhiXeuRAMw5Q1gzlzogA FC2vmN4bSdl/zTMSBtdzWttab1R3PUFlFT3KUY47AGfRnVFPGC4c8TDuRlkXHcO6lBWq mOxg== X-Gm-Message-State: AIVw111X+fhQdovircoiUdboCFzY5mHgg2cIUTGgEuiAKYcfhmteXkye 1++QnHpkY48WDPpO8boevIdGsK68H4KE84/RlQ== X-Received: by 10.99.54.9 with SMTP id d9mr18826611pga.195.1500317604208; Mon, 17 Jul 2017 11:53:24 -0700 (PDT) MIME-Version: 1.0 Received: by 10.100.167.81 with HTTP; Mon, 17 Jul 2017 11:53:23 -0700 (PDT) In-Reply-To: References: From: Stefan Beller Date: Mon, 17 Jul 2017 11:53:23 -0700 Message-ID: Subject: Re: reftable [v2]: new ref storage format To: Shawn Pearce Cc: git Content-Type: text/plain; charset="UTF-8" Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org On Mon, Jul 17, 2017 at 8:01 AM, Shawn Pearce wrote: > A ref block is written as: > > 'r' > uint24 ( block_len ) > ref_record* > uint32( restart_offset )* > uint16( number_of_restarts ) > padding? > ... > > The 2-byte `number_of_restarts + 1` stores the number of entries in > the `restart_offset` list. This means uint16( number_of_restarts ) cannot be 0, but has to be 1 to indicate no restarts? Why do we need to be non-NUL here, but the padding is all NUL? When starting to write a block, we need to know exactly how large the ref_records* and restart offsets need to be to put the number_of_restarts at the position as promised via block_len. This sounds complicated unless I missed the obvious. Going by this, would it rather make sense to omit the block_len and then scan backwards from *block_size-1 to find the first non-NUL and that will be the number_of_restarts? Or we could allow additional padding between ref_record and restart_offsets, such that the writer implementation has wiggle room for the restarting logic. > > #### log record > > Log record keys are structured as: > > ref_name '\0' reverse_int32( time_sec ) > > where `time_sec` is the update time in seconds since the epoch. The epoch ends 2038, which is in 21 years. (Did you just volunteer to fixup the time issues in 20 years?) If possible I'd prefer a date to be encoded with more range available. > The > `reverse_int32` function inverses the value so lexographical ordering > the network byte order time sorts more recent records first: > > reverse_int(int32 t) { > return 0xffffffff - t; > } > > Log records have a similar starting structure to ref and index > records, utilizing the same prefix compression scheme applied to the > key described above. The ref names itself are compressed, would we also want to compress the timing information? The time_sec could be a varint indicating a delta to the previous change of a ref, or fixed to the epoch if it is the first change of that ref. Just to be clear, the ordering here would be refs/heads/maint refs/heads/maint ... refs/heads/master refs/heads/master such that refs that have more than one entry in its reflog in a given refstable file, would have perfect prefix compression for the ref? > ### Update transactions > > Although reftables are immutable, mutations are supported by writing a > new reftable and atomically appending it to the stack: > > 1. Atomically create `stack.lock` > 2. Copy current stack to `stack.lock`. > 3. Prepare new reftable in temp file `.refXXXXXXX`. > Include log entries. > 4. Rename (3) to `${sha1}.ref`. > 5. Append `${sha1}.ref` to `stack.lock` > 6. Atomically rename `stack.lock` to `stack`. In case 3.+4. becomes time consuming, it can be prepared outside the lock, such that inside the lock we'd only need to check for contention of refs? For example if I'd want to update one ref and another write wants to update another refs, we'd both be preparing the a new refstable containing one ref and log, then one of us obtains the lock and writes. The second writer would then need to inspect the delta of the stack and see if any ref that they want to change was touched. > ### Compaction > > A partial stack of reftables can be compacted by merging references > using a straightforward merge join across reftables, selecting the > most recent value for output, and omitting deleted references that do > not appear in remaining, lower reftables. > > For sake of illustration, assume the stack currently consists of > reftable files (from oldest to newest): A, B, C, and D. The compactor > is going to compact B and C, leaving A and D alone. > > 1. Obtain lock `stack.lock` and read the `stack` file. > 2. Obtain locks `B.lock` and `C.lock`. > Ownership of these locks prevents other processes from trying > to compact these files. > 3. Release `stack.lock`. > 4. Compact `B` and `C` in temp file `.refXXXXXXX`. > 5. Reacquire lock `stack.lock`. > 6. Verify that `B` and `C` are still in the stack, in that order. This > should always be the case, assuming that other processes are adhering > to the locking protocol. > 7. Rename `.refXXXXXXX` to `X`. > 8. Write the new stack to `stack.lock`, replacing `B` and `C` with `X`. > 9. Atomically rename `stack.lock` to `stack`. > 10. Delete `B` and `C`, perhaps after a short sleep to avoid forcing > readers to backtrack. > > This strategy permits compactions to proceed independently of updates. 10. could be deferred to gc as well. auto gc would need to learn about the number of loose refstables in that case. Thanks, Stefan