From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-3.1 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,RCVD_IN_DNSWL_HI, RCVD_IN_SORBS_SPAM,RP_MATCHES_RCVD shortcircuit=no autolearn=no autolearn_force=no version=3.4.0 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by dcvr.yhbt.net (Postfix) with ESMTP id 3DEAA2047F for ; Wed, 2 Aug 2017 02:39:02 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751677AbdHBCjA (ORCPT ); Tue, 1 Aug 2017 22:39:00 -0400 Received: from mail-vk0-f41.google.com ([209.85.213.41]:33926 "EHLO mail-vk0-f41.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751571AbdHBCi7 (ORCPT ); Tue, 1 Aug 2017 22:38:59 -0400 Received: by mail-vk0-f41.google.com with SMTP id n125so13134367vke.1 for ; Tue, 01 Aug 2017 19:38:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=spearce.org; s=google; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=o9oIvPRx9mlCsvYn5qyhmaWvOnrfxwkcct2kQwEAYaY=; b=Zv+0IBwQzQOuNqEEQeSrFD/XFZVXm7uN2HA75I8oD71JCxd3E4a4jtfLI29L9wATEp LJWjDRIqCfCVuIOIm5xbmenid1TFbF28XH6xw1qnzGwaZQxT+rvmKBrLTq5a9q7daa/S OX12hBoG4AzLq1Be+19S3jNlzZmu3xeHuMVqo= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=o9oIvPRx9mlCsvYn5qyhmaWvOnrfxwkcct2kQwEAYaY=; b=ghLV9n8GOtDGh0xxlAiYKmf+ZhVd2WjqRYvtgakAmStKRfBp57BVkxeQffk2cTyLTb xhGqsdlJCZYrF+rwxBqhMt5K9oG5BpsD60GZkU2Txsp74lYsw651zw3JfqbKldj6Otx8 yyjlkrjfIMcYQ3zJslO4et1bDKCvn+HVUkWKknm6eTFHYseTGFa1kZye5o32HqSQe3Ps NZnqSCl7fLW/ProkLa9iyCUESgFG4Gh5TRmZM3wl0slYdSefKXdfVo8DBO5cLgQrzlbx HYUSDadBtCer/NhqBaU4L4dBcy6wD/hmJywdEKqdh4jxy81J0FB7WmhqEyi8vO4O4VZ1 +T5g== X-Gm-Message-State: AIVw113lkYVM0IrDgsalJzxB2k6z7CmgufcEu4qUptj0bv7ptwsRIFPc 8yATy87TPCcD4TXzgwtTx7tdP2E3h/9K X-Received: by 10.31.195.3 with SMTP id t3mr14461227vkf.98.1501641538245; Tue, 01 Aug 2017 19:38:58 -0700 (PDT) MIME-Version: 1.0 Received: by 10.103.83.7 with HTTP; Tue, 1 Aug 2017 19:38:37 -0700 (PDT) In-Reply-To: References: From: Shawn Pearce Date: Tue, 1 Aug 2017 19:38:37 -0700 Message-ID: Subject: Re: reftable [v4]: new ref storage format To: Michael Haggerty Cc: git , Jeff King , Junio C Hamano , David Borowitz Content-Type: text/plain; charset="UTF-8" Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org On Tue, Aug 1, 2017 at 6:51 PM, Michael Haggerty wrote: > On Tue, Aug 1, 2017 at 4:27 PM, Shawn Pearce wrote: >> On Mon, Jul 31, 2017 at 11:41 PM, Michael Haggerty wrote: >>> On Sun, Jul 30, 2017 at 8:51 PM, Shawn Pearce wrote: >>>> 4th iteration of the reftable storage format. >>>> [...] >>> >>> Before we commit to Shawn's reftable proposal, I wanted to explore >>> what a contrasting design that is not block based would look like. >> >> I forgot to look at a 1k chunk size, as you suggested that might also >> be suitable. Here is the more complete experiment table: >> >> | size | seek_cold | seek_hot | >> mh 1k | 29.4 M | 20.6 usec | 10.7 usec | <== Row fixed >> mh 4k | 28.3 M | 24.5 usec | 14.5 usec | >> sp 4k | 29.2 M | 63.0 usec | 5.8 usec | >> sp 64k | 27.7 M | 35.6 usec | 23.3 usec | > > At least in this case, the 1k block size seems like a good tradeoff. > Did 1k vs 4k change the number of index levels required? Yes. 1k needed 4 levels of index; 4k needed 2 levels. More chunks, with smaller index chunks, forces a deeper index depth. >> A couple of other notes about your contrasting design: >> >>> elif chunk_type == INDEX { >>> chunk_length : varint >> >> Using a varint for the chunk length made for a complicated reader. >> JGit doesn't have the luxury of mmap to access the file, so we have to >> allocate a byte[] and read data from a file descriptor to do anything >> fancy like decoding a varint. For my experiment I wound up just >> hardcoding the IO to read 1k or 4k from whatever address. >> >> A "real" implementation would likely prefer to read a fixed width >> field here such that chunks have a 3 byte header (1 byte chunk_type, 2 >> byte chunk_length), and then issue a second read to acquire the rest >> of the chunk. Given that encoding a chunk length of 1024 or 4096 both >> requires 2 bytes of varint, its always going to be 2 bytes in your >> design anyway. With the way chunks are scanned, I don't think you want >> chunks as large as 16k, which would have caused the varint to go to 3 >> bytes (but still fits in a fixed 2-byte chunk_length). > > That's a good point for INDEX and OBJS_INDEX blocks. Though for REFS > blocks that include reflogs, the block size has to be large enough to > hold the whole reflog for a reference, which can be arbitrarily large. > (Maybe this is a weakness of the design?) Gah. I missed that part about the whole reflog needing to fit in the same chunk when I wrote the quoted text above. That is a pretty big downside for very busy refs. Even when you isolate logs into their own file, the reflog for a busy ref could be huge. A naive reader would want to "page in" the entire chunk_length before parsing. That isn't tenable if the reflog for a busy ref was say 50 MiB. It complicates the reader more. My reftable proposal deals with this by breaking the log up with its special key structure. Requiring the entire reflog of a single ref to fit into a single chunk does seem to have its downsides. > OBJS blocks can also be > unbounded in size if very many references point at the same object, > thought that is perhaps only a theoretical problem. Gah, I missed that in reftable. The block id pointer list could cause a single object id to exceed what fits in a block, and that will cause the writer to fail unless its caller sets the block size larger. I basically assumed this overflow condition is very unlikely, as its not common to have a huge number of refs pointing to the same object. >>> elif chunk_type == OBJS_INDEX { >>> chunk_length : varint >>> >>> # The offset, relative to the start of this chunk, of the >>> # chunk containing the next level of the obj index, for each >>> # of the possible "next" bytes in the SHA-1, or zero if there >>> # are no references with the given next byte. >>> child_offset : varint * 256 >> >> This is space saving and cute, but kind of annoying. If it was fixed >> width 32 bit you can address up to 4G away from this chunk's address, >> and you can directly jump to the byte of interest. By being varints >> you do save a little space, as most files will probably only need 3 >> byte varints, and the 0s do collapse down to 1 byte, but you have to >> linearly walk the list to find any specific byte. > > The reason I used varint here was mostly because I expect the lowest > level of the OBJS_INDEX to be placed close to the OBJS chunks that it > refers to; hopefully within a 2-byte varint. Let me consider more > carefully whether that is realistic... Ah, ok. So you were expecting the writer to interleave OBJS and OBJS_INDEX chunks, with the OBJS_INDEX appearing every ~256 OBJS chunks. And then storing the next level of OBJS_INDEX consecutively. I think your math is still wrong about the lowest level OBJS_INDEX needing only 2-byte varint for child_offset. Given a 1k chunk size, you still have to address backwards about 256k, which requires a 3-byte varint. Given the uniform distribution of SHA-1, I think you still wind up with most of the OBJS_INDEX populated. E.g. in my 866k ref/865k obj example the unique object abbreviation is 6 raw bytes. The OBJS chunk storing 2 bytes means we have 4 levels of OBJS_INDEX to cover the first 4 bytes. That leaves ~844 obj_records in each OBJS chunk. If those are 7 bytes/obj_record, the OBJS chunk is ~5.7 KiB. Forcing the OBJS chunk to stay under say 4 KiB will absolutely cause a lot more holes in the OBJS_INDEX levels. > Peff and I discussed off-list whether the lookup-by-SHA-1 feature is > so important in the first place. Currently, all references must be > scanned for the advertisement anyway, Not really. You can hide refs and allow-tip-sha1 so clients can fetch a ref even if it wasn't in the advertisement. We really want to use that wire protocol capability with Gerrit Code Review to hide the refs/changes/ namespace from the advertisement, but allow clients to fetch any of those refs if they send its current SHA-1 in a want line anyway. So a server could scan only the refs/{heads,tags}/ prefixes for the advertisement, and then leverage the lookup-by-SHA1 to verify other SHA-1s sent by the client. > so avoiding a second scan to vet > SHA-1s received from the client is at best going to reduce the effort > by a constant factor. Do you have numbers showing that this > optimization is worth it? No, but I don't think I need to do much to prove it. My 866k ref example advertisement right now is >62 MiB. If we do what I'm suggesting in the paragraphs above, the advertisement is ~51 KiB. > OTOH a mythical protocol v2 might reduce the need to scan the > references for advertisement, so maybe this optimization will be more > helpful in the future? Yes, I'm hopeful we can get a v2 protocol built on the work Jonathan Tan is doing, and switch the advertisement around to "client speaks first", so that we can be smarter on the server about which refs are read and sent. That is a long way off, lets say 3-5 years before its really common in clients. >>> ref_payload = { >>> value_type : enum NO_VALUE >>> | DELETED >>> | VALUE | VALUE_PEELED >>> | SYMREF | SYMREF_PEELED >>> | SPECIAL >>> log_type : enum NO_REFLOG | REFLOG | REFLOG_COMPRESSED >>> symref_target : bool >> >> FWIW I didn't implement log_type or symref_target in my experiment, so >> the size per ref was maybe a few bytes smaller than what you outlined >> here. > > But value_type, log_type, and symref_target should all fit within a > single byte, no? Yes. So my experiment was maybe ~866 KiB off in total file size. :) >>> # This field is used to keep backwards links from references to >>> # any symrefs that point at them, to make it tractable to update >>> # the reflog of the symref if the reference is changed directly: >>> if symref_target { >>> referer : varstr >>> varint(0) >>> } >> >> I wonder how desirable this feature is. Most updates are done through >> HEAD, which is a symref and can therefore update both HEAD and the >> target's reflogs in the same operation. It seems to me its rare to >> issue an update directly on the ref that HEAD points at. Its even >> rarer to have a non-HEAD symbolic reference whose reflog you expect to >> track something else. >> >> Is this for refs/remotes/origin/HEAD to be a symref and have its >> reflog mirror the fetch operations that touched the underlying ref? > > What I was thinking is that we don't update symrefs' reflogs correctly > if the pointed-to reference is updated (except for HEAD) and it would > be nice to fix that problem. Your case of `refs/remotes/origin/HEAD` > is an example. Or on GitHub's servers, if we wanted to store all of > the forks of a repo in a single repository, we might want to use a > namespace for each fork, like `refs/forks/NNNN/*`, and store the > fork's default branch in `refs/forks/NNNN/HEAD`. The current refs code > wouldn't know to update such a symref's reflog (though of course it > could be special-cased in like HEAD is now). Servers today don't update HEAD reflog when a branch is pushed. I think trying to record that is overkill, you have the reflog data in the ref itself that the client sent the command to modify. > That's what I was thinking. But I've yet to hear anybody complain > about missing reflogs for symrefs if the underlying reference is > updated directly, so maybe we should just leave that "problem" > unsolved. It is certainly simpler and less brittle not to have to keep > backreferences like these in sync with the forward references. Yup, that is my take on it, and why I didn't try to put this into reftable drafts, even though it was discussed between us on the list in earlier messages.