From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-Status: No, score=-2.9 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_PASS, SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by dcvr.yhbt.net (Postfix) with ESMTP id EDAB41F55B for ; Wed, 20 May 2020 18:52:10 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726850AbgETSwG (ORCPT ); Wed, 20 May 2020 14:52:06 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41048 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726775AbgETSwG (ORCPT ); Wed, 20 May 2020 14:52:06 -0400 Received: from mail-pf1-x441.google.com (mail-pf1-x441.google.com [IPv6:2607:f8b0:4864:20::441]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 55182C061A0E for ; Wed, 20 May 2020 11:52:06 -0700 (PDT) Received: by mail-pf1-x441.google.com with SMTP id 145so1993794pfw.13 for ; Wed, 20 May 2020 11:52:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:content-transfer-encoding:in-reply-to; bh=pFj1izBOpy2lNNeFdz/gqcwm+QSMTxlb6phuEGxsA4Y=; b=os/AjTC4O9OFd6KQoAyqgFHneBcsJo6EjzOAD1fIxPWezLFOQ27BIcjmbiAmF2T/g5 uKQSOdRJfsb8rfPtPuEnXu1MuiZTFbGxXQlEaXF5QcDRVLK3zYGKCNiYBvnhK2QLrSBF noOlE4W5gwM6ZWzj6uYxXpnH1cGmYmLgYQHr0OdgeRtkxfE6gKB/I808XdcAnoEzA+on 7mjLyUWNs1PDr0vf/Mh62h7J43aywDkAcADw1Q7rTPU3B0ToNTWbUoHfoLU0H72nfIMw CEKQj7mme1eqis7+pNoTRaC5pGykMaxsjso1evjhjXa+2uNNLqW6A1Ywju8JZ6T/izL6 k2tA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to; bh=pFj1izBOpy2lNNeFdz/gqcwm+QSMTxlb6phuEGxsA4Y=; b=AaVOA6Ec4gFkoHQrx64xWWV3T6rtKm/AW6/cmP0DTqSWc7KogmcXP4D4SmfjTijEgK 8VCK3C1qFlf5RmUtPdVutUh9EcbzpgmsLBPdfDc1O85nn1/KQUr8Qur4aN3QMqK9KPJY zJ5gDjNRWTKyJKmt4dIkt2rRS8QZaU5MkqOjPdLBDxWupclBH9tw6GvqMwiqXhCRM0oV YFC9yR2TDABQzWof0S+crhwMDH4kbwZKXdjujdE8aEqToHiwEuffYissRPWZE8mMOFfv lJL5f8jl92GXgYa/9W5/p34KbVYpRIX7YTXfdbdWzAEkIWJEZoTL8rsjn/HkMijF2pwL AZ+Q== X-Gm-Message-State: AOAM530Tieqz6cFZ7fLX9NvbyzExzH10SWt4wzhRQSfCd9O7W16JxyX6 T+olEtxF81oJk1TY7EhBuyA= X-Google-Smtp-Source: ABdhPJyNOWNdCP5jO8LmDjcauNczKQOXLqa0+6T2w274YKSRQfxIVKXXgWhUMfgerVnPuFK42ry0aw== X-Received: by 2002:a65:498f:: with SMTP id r15mr5331532pgs.345.1590000725580; Wed, 20 May 2020 11:52:05 -0700 (PDT) Received: from google.com ([2620:15c:2ce:200:cf67:1de0:170f:be65]) by smtp.gmail.com with ESMTPSA id a85sm2685580pfd.181.2020.05.20.11.52.03 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 May 2020 11:52:04 -0700 (PDT) Date: Wed, 20 May 2020 11:52:02 -0700 From: Jonathan Nieder To: Junio C Hamano Cc: Jonathan Nieder via GitGitGadget , git@vger.kernel.org, Han-Wen Nienhuys Subject: Re: [PATCH v13 04/13] reftable: file format documentation Message-ID: <20200520185202.GA3648@google.com> References: <96fd9814a6753e87fb99c6f9d55a0728d3dba6cb.1589226388.git.gitgitgadget@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Junio C Hamano wrote: > Han-wen wrote: >> From: Jonathan Nieder >> >> Shawn Pearce explains: >> >> Some repositories contain a lot of references (e.g. android at 866k, >> rails at 31k). The reftable format provides: >> >> - Near constant time lookup for any single reference, even when the >> repository is cold and not in process or kernel cache. >> - Near constant time verification a SHA-1 is referred to by at least >> one reference (for allow-tip-sha1-in-want). > > Not quite grammatical sentence? Perhaps "if" after "verification? Good catch, thanks. [...] >> using pandoc 2.2.1. The result required the following additional >> minor changes: >> >> - removed the [TOC] directive to add a table of contents, since >> asciidoc does not support it >> - replaced git-scm.com/docs links with linkgit: directives that link >> to other pages within Git's documentation > > There are many > > ’ > > funny-quotes where we would prefer to place vanilla single quotes, > which may also need to be corrected in the conversion toolchain. Looks like Han-Wen is taking care of this (thanks!). > Typoes pointed out below may probably be from the original where > they should be corrected. I'm happy to do one final update the doc in JGit to match what we end up with and then replace it with a pointer to Git's copy once that lands. [...] >> +Repositories with many loose references occupy a large number of disk >> +blocks from the local file system, as each reference is its own file >> +storing 41 bytes (and another file for the corresponding reflog). This >> +negatively affects the number of inodes available when a large number of >> +repositories are stored on the same filesystem. Readers can be penalized >> +due to the larger number of syscalls required to traverse and read the >> +`$GIT_DIR/refs` directory. > > Another downside is that we cannot arrange atomic updates to > multiple refs over loose refs, even though the "lookup of a single > reference does not require linear scan" unlike packed-refs, (as long > as the filesystem does its job). Worth mentioning? Yes, this was another major part of the motivation (avoiding the complication of the "atomic" multi-ref updates to packed-refs that Git and JGit had to learn). [...] >> +References stored in a reftable are peeled, a record for an annotated >> +(or signed) tag records both the tag object, and the object it refers >> +to. > > OK. Peeled results are recorded in packed-refs file because quite > often when we use a tag object, what we actually want to access is > the commit object it points at. We do so here for the same reason? > > Not a rhetorical question, but if it invites a question from a > reader, it may deserve to be described before readers ask it. For a single tag ref, peeling to a commit is not very expensive. But for batch lookups e.g. when serving a response to an ls-remote request, it adds up, and having the peeled results recorded helps. [...] >> +Directory/file conflicts >> +^^^^^^^^^^^^^^^^^^^^^^^^ >> + >> +The reftable format accepts both `refs/heads/foo` and >> +`refs/heads/foo/bar` as distinct references. >> + >> +This property is useful for retaining log records in reftable, but may >> +confuse versions of Git using `$GIT_DIR/refs` directory tree to maintain >> +references. Users of reftable may choose to continue to reject `foo` and >> +`foo/bar` type conflicts to prevent problems for peers. [...] > "users ... may choose" implies that it is up to the implementation > of reftable user which one to show, so given a single repository, > "jgit" may show "refs/heads/foo" while "libgit2" may choose to show > the other one. > > I am not sure if that is desirable---I suspect that we want to > record which one needs to be chosen so that these "D/F conflicts > disallowing" users can make consistent choices, but I dunno. Yes, I think it would be better to explicitly say that Git will continue to reject D/F conflicts for refs (*not* reflogs) even though the format can support them in principle. If we choose to permit them some day in the future, I believe that would be a separate repository format extension and protocol capability to avoid confusing old versions of Git. [...] >> +Symbolic references use `0x3`, followed by the complete name of the >> +reference target. No compression is applied to the target name. > > Is there a place in the file format where an incomplete name can be > stored? If not, I think it makes it easier to read if we drop > "complete" from the sentence. The sentence about "no compression" covers the lack of prefix encoding, so I suppose I agree. Might make sense to say "full name" to convey that we're talking about rev-parse --symbolic-full-name, not a relative path like symlinks support. [...] >> +Log block format >> +^^^^^^^^^^^^^^^^ >> + >> +Unlike ref and obj blocks, log blocks are always unaligned. >> + >> +Log blocks are variable in size, and do not match the `block_size` >> +specified in the file header or footer. Writers should choose an >> +appropriate buffer size to prepare a log block for deflation, such as >> +`2 * block_size`. > > I can guess the reason behind this design decision, but the readers > may not be able to. Should we write it down here, or would it make > too much irrelevant details? I don't have a strong opinion. It sounds like Han-Wen sees something to explain there, so I suppose it would be nice to spell out. (My take: reflog lookups are not on the critical path for most operations; especially, random accesses do not need to be fast. From a performance perspective, the best we can do is to compress them well to decrease I/O cost, hence there's not much value to alignment.) [...] > This is a tangent but in a repository at hosting provider, whose > primary (and often the only) source of updates are by end-user > pushing into it, if reflogs are enabled, whose name and email are > recorded in the logs? The committer or tagger of the object that > sits at the tip of the ref after the update? What happens when a > blob is pushed to update a ref? Or would it be just a single "user" > that represents the "server operator"? The latter, "server operator" (GIT_COMMITTER_IDENT at the server). Committer in commit objects is forgeable, hence wouldn't be very useful here. > We know in a non-bare repository an individual contributor works on > typically records only one in the reflog: the user who > works in it. > > What I am trying to get at is if it makes more sense to have a small > table of unique pairs used in the file and have > log_data record a single varint that is the index into that > "committer ident" table. I would suspect that it would give us > significantly more gain than mere <> two bytes per log_data entry. That's true, and a good idea for the next rev of the format. [...] >> +A 68-byte footer appears at the end: >> + >> +.... >> + 'REFT' >> + uint8( version_number = 1 ) >> + uint24( block_size ) >> + uint64( min_update_index ) >> + uint64( max_update_index ) >> + >> + uint64( ref_index_position ) >> + uint64( (obj_position << 5) | obj_id_len ) [...] >> +* `obj_id_len`: number of bytes used to abbreviate object identifiers in >> +obj blocks. > > Should we write "this can be up to 31" somewhere? It is more than > enough for SHA-1 and not quite sufficient for SHA-256 (unless we say > "we store obj_id_len-1 here")? Oh! I'll take a closer look and then follow up. Thanks for looking it over, Jonathan