git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
From: Michael Haggerty <mhagger@alum.mit.edu>
To: Shawn Pearce <spearce@spearce.org>
Cc: git <git@vger.kernel.org>, Jeff King <peff@peff.net>,
	Junio C Hamano <gitster@pobox.com>,
	David Borowitz <dborowitz@google.com>
Subject: Re: reftable [v4]: new ref storage format
Date: Thu, 3 Aug 2017 11:38:27 -0700	[thread overview]
Message-ID: <CAMy9T_G5xorPGp=5=p_ku3RhB1E-c9+4mEgYhbyAhLdc1V=JBg@mail.gmail.com> (raw)
In-Reply-To: <CAJo=hJv=zJvbzfAZwspxECXrnBJR4XfJbGZegsNUCx=6uheO2Q@mail.gmail.com>

On Tue, Aug 1, 2017 at 7:38 PM, Shawn Pearce <spearce@spearce.org> wrote:
> On Tue, Aug 1, 2017 at 6:51 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>> On Tue, Aug 1, 2017 at 4:27 PM, Shawn Pearce <spearce@spearce.org> wrote:
>>> On Mon, Jul 31, 2017 at 11:41 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>>>> [...]
>>> A couple of other notes about your contrasting design:
>>>
>>>>     elif chunk_type == INDEX {
>>>>         chunk_length : varint
>>>
>>> Using a varint for the chunk length made for a complicated reader.
>>> JGit doesn't have the luxury of mmap to access the file, so we have to
>>> allocate a byte[] and read data from a file descriptor to do anything
>>> fancy like decoding a varint. For my experiment I wound up just
>>> hardcoding the IO to read 1k or 4k from whatever address.
>>>
>>> A "real" implementation would likely prefer to read a fixed width
>>> field here such that chunks have a 3 byte header (1 byte chunk_type, 2
>>> byte chunk_length), and then issue a second read to acquire the rest
>>> of the chunk. Given that encoding a chunk length of 1024 or 4096 both
>>> requires 2 bytes of varint, its always going to be 2 bytes in your
>>> design anyway. With the way chunks are scanned, I don't think you want
>>> chunks as large as 16k, which would have caused the varint to go to 3
>>> bytes (but still fits in a fixed 2-byte chunk_length).
>>
>> That's a good point for INDEX and OBJS_INDEX blocks. Though for REFS
>> blocks that include reflogs, the block size has to be large enough to
>> hold the whole reflog for a reference, which can be arbitrarily large.
>> (Maybe this is a weakness of the design?)
>
> Gah. I missed that part about the whole reflog needing to fit in the
> same chunk when I wrote the quoted text above. That is a pretty big
> downside for very busy refs.
>
> Even when you isolate logs into their own file, the reflog for a busy
> ref could be huge. A naive reader would want to "page in" the entire
> chunk_length before parsing. That isn't tenable if the reflog for a
> busy ref was say 50 MiB. It complicates the reader more. My reftable
> proposal deals with this by breaking the log up with its special key
> structure.
>
> Requiring the entire reflog of a single ref to fit into a single chunk
> does seem to have its downsides.

I was assuming that readers would uncompress the data streamily, in
which case I don't think that it would be much of a problem: reflogs
are usually read either in their entirety, or just the most recent few
entries are read, either of which could be done efficiently despite
the whole reflog being in a single zlib-compressed blob. If streamy
reading is thought to be too complicated for readers, it wouldn't be a
big deal to add a `log_type` of `REFLOG_COMPRESSED_SEGMENTED` and put
the entries into multiple, smaller zlib-compressed chunks.

>> OBJS blocks can also be
>> unbounded in size if very many references point at the same object,
>> thought that is perhaps only a theoretical problem.
>
> Gah, I missed that in reftable. The block id pointer list could cause
> a single object id to exceed what fits in a block, and that will cause
> the writer to fail unless its caller sets the block size larger. I
> basically assumed this overflow condition is very unlikely, as its not
> common to have a huge number of refs pointing to the same object.

Given what Peff pointed out, let's just leave this as a varint for OBJS blocks.

>>>>     elif chunk_type == OBJS_INDEX {
>>>>         chunk_length : varint
>>>>
>>>>         # The offset, relative to the start of this chunk, of the
>>>>         # chunk containing the next level of the obj index, for each
>>>>         # of the possible "next" bytes in the SHA-1, or zero if there
>>>>         # are no references with the given next byte.
>>>>         child_offset : varint * 256
>>>
>>> This is space saving and cute, but kind of annoying. If it was fixed
>>> width 32 bit you can address up to 4G away from this chunk's address,
>>> and you can directly jump to the byte of interest. By being varints
>>> you do save a little space, as most files will probably only need 3
>>> byte varints, and the 0s do collapse down to 1 byte, but you have to
>>> linearly walk the list to find any specific byte.
>>
>> The reason I used varint here was mostly because I expect the lowest
>> level of the OBJS_INDEX to be placed close to the OBJS chunks that it
>> refers to; hopefully within a 2-byte varint. Let me consider more
>> carefully whether that is realistic...
>
> Ah, ok. So you were expecting the writer to interleave OBJS and
> OBJS_INDEX chunks, with the OBJS_INDEX appearing every ~256 OBJS
> chunks. And then storing the next level of OBJS_INDEX consecutively.
>
> I think your math is still wrong about the lowest level OBJS_INDEX
> needing only 2-byte varint for child_offset. Given a 1k chunk size,
> you still have to address backwards about 256k, which requires a
> 3-byte varint.

I think that you would want the lowest OBJS_INDEX to be mostly
populated, meaning that the next level nodes in the tree would mostly
only have zero or one entry. The true distribution [1] would be close
to a Poisson distribution with a mean of `λ ≲ 1` (where λ is basically
the filling factor), which looks like

    P(λ, n) = λⁿ exp(-λ) / n!

For `λ = 1`, that looks like

    P(1, 0) = 36.8%
    P(1, 1) = 36.8%
    P(1, 2) = 18.4%
    P(1, 3) = 06.1%
    P(1, 4) = 01.5%
    P(1, 5) = 00.3%
    P(1, 6) = 00.1%

so almost all OBJS chunks would have fewer than five entries. And in
fact the mean number of entries would be `λ`. So the total size of the
OBJS chunks pointed at by a given OBJS_INDEX chunk should be something
like

    256 * λ * sizeof(OBJS chunk) ≲ 1800 bytes.

[1] This is assuming randomness of SHA-1s, which AFAIK is normally a
good approximation. It is of course possible that somebody would try
to grief the system by creating a repository with lots of objects that
have similar SHA-1 prefixes, but I think the worst result would be
that some operations would scale like O(number of references) rather
than O(256), which isn't all that pathological, and the distinct
pattern would be clear evidence of malice that would justify banning
the user.

> Given the uniform distribution of SHA-1, I think you still wind up
> with most of the OBJS_INDEX populated. E.g. in my 866k ref/865k obj
> example the unique object abbreviation is 6 raw bytes. The OBJS chunk
> storing 2 bytes means we have 4 levels of OBJS_INDEX to cover the
> first 4 bytes. That leaves ~844 obj_records in each OBJS chunk. If
> those are 7 bytes/obj_record, the OBJS chunk is ~5.7 KiB.
>
> Forcing the OBJS chunk to stay under say 4 KiB will absolutely cause a
> lot more holes in the OBJS_INDEX levels.

I was thinking more about how to store the objects lookup table. Let's
assume that we combine the lowest-level OBJS_INDEX chunk along with
the OBJS chunks that it points to into a new OBJS_LEAF chunk. I think
the goals are as follows:

* The lowest OBJS_LEAF nodes should have a filling factor of
approximately `λ = 1`.
* It should be possible to adjust the size of the lowest OBJS_LEAF
nodes based on the preferred read size. (This might mean that all of
the OBJS_LEAF nodes referred to by the next-higher node in the tree
fit together with it in a single 64k block.)
* That higher-level nodes also can be packed efficiently in blocks of
roughly the preferred read size.
* That false positives be the exception. If we ask a higher-level
reftable whether it might contain a reference to $sha1, it would be
nice to be able to get a "no" answer without having to read actual
references. I think that is an argument why we *don't* want to store
only the largest unique prefix of the contained SHA-1s. I think we
want to *always* store enough bits of SHA-1 prefix to make the
referred-to SHA-1s represent only a small fraction (say, 1/256th) of
the total prefix space. E.g., if references in a particular repository
point to 2ⁿ distinct SHA-1s, then we'd want the prefix to include
something like n+8 bits.

To achieve these goals, I think we it would be best to make the number
of bits used in each level of the tree be variable rather than
hard-coded to be 8. By changing the radix of different levels of the
tree, I think you could arrange for block sizes that are optimal for
your filesystem at the same time that you ensure that the OBJS_LEAF
nodes have the desired filling factor.

If anybody's interested, I can flesh out these ideas more next week
when I have time.

>> Peff and I discussed off-list whether the lookup-by-SHA-1 feature is
>> so important in the first place. Currently, all references must be
>> scanned for the advertisement anyway,
>
> Not really. You can hide refs and allow-tip-sha1 so clients can fetch
> a ref even if it wasn't in the advertisement. We really want to use
> that wire protocol capability with Gerrit Code Review to hide the
> refs/changes/ namespace from the advertisement, but allow clients to
> fetch any of those refs if they send its current SHA-1 in a want line
> anyway.
>
> So a server could scan only the refs/{heads,tags}/ prefixes for the
> advertisement, and then leverage the lookup-by-SHA1 to verify other
> SHA-1s sent by the client.
>
>> so avoiding a second scan to vet
>> SHA-1s received from the client is at best going to reduce the effort
>> by a constant factor. Do you have numbers showing that this
>> optimization is worth it?
>
> No, but I don't think I need to do much to prove it. My 866k ref
> example advertisement right now is >62 MiB. If we do what I'm
> suggesting in the paragraphs above, the advertisement is ~51 KiB.
>
>> OTOH a mythical protocol v2 might reduce the need to scan the
>> references for advertisement, so maybe this optimization will be more
>> helpful in the future?
>
> Yes, I'm hopeful we can get a v2 protocol built on the work Jonathan
> Tan is doing, and switch the advertisement around to "client speaks
> first", so that we can be smarter on the server about which refs are
> read and sent. That is a long way off, lets say 3-5 years before its
> really common in clients.

My takeaway from the discussion about whether object lookup is an
important feature is that it should very much be optional, and by
extension we should probably leave room for other optional extensions
in the file format.

> [...]
>> That's what I was thinking. But I've yet to hear anybody complain
>> about missing reflogs for symrefs if the underlying reference is
>> updated directly, so maybe we should just leave that "problem"
>> unsolved. It is certainly simpler and less brittle not to have to keep
>> backreferences like these in sync with the forward references.
>
> Yup, that is my take on it, and why I didn't try to put this into
> reftable drafts, even though it was discussed between us on the list
> in earlier messages.

Yes, let's forget the idea of including backreferences from refs to
the symrefs that refer to them.

Michael

  parent reply	other threads:[~2017-08-03 18:38 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-07-31  3:51 Shawn Pearce
2017-07-31 17:41 ` Dave Borowitz
2017-07-31 19:01 ` Stefan Beller
2017-07-31 23:05   ` Shawn Pearce
2017-07-31 19:42 ` Junio C Hamano
2017-07-31 23:43   ` Shawn Pearce
2017-08-01 16:08     ` Shawn Pearce
2017-08-01  6:41 ` Michael Haggerty
2017-08-01 20:23   ` Shawn Pearce
2017-08-02  0:49     ` Michael Haggerty
2017-08-01 23:27   ` Shawn Pearce
2017-08-01 23:54     ` Shawn Pearce
2017-08-02  1:51     ` Michael Haggerty
2017-08-02  2:38       ` Shawn Pearce
2017-08-02  9:28         ` Jeff King
2017-08-02 15:17           ` Shawn Pearce
2017-08-02 16:51             ` Junio C Hamano
2017-08-02 17:28             ` Jeff King
2017-08-02 12:20         ` Dave Borowitz
2017-08-02 17:18           ` Jeff King
2017-08-03 18:38         ` Michael Haggerty [this message]
2017-08-03 22:26           ` Shawn Pearce
2017-08-03 22:48             ` Michael Haggerty
2017-08-04  2:50               ` Shawn Pearce
2017-08-05 21:00       ` Shawn Pearce
2017-08-01 13:54 ` Dave Borowitz
2017-08-01 15:27   ` Shawn Pearce
2017-08-02 19:50 ` Junio C Hamano
2017-08-02 20:28   ` Jeff King
2017-08-03 22:17     ` Shawn Pearce
2017-08-03  1:50   ` Junio C Hamano
2017-08-03  2:21     ` Shawn Pearce
2017-08-03  2:36       ` Junio C Hamano
2017-08-02 19:54 ` Stefan Beller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAMy9T_G5xorPGp=5=p_ku3RhB1E-c9+4mEgYhbyAhLdc1V=JBg@mail.gmail.com' \
    --to=mhagger@alum.mit.edu \
    --cc=dborowitz@google.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=peff@peff.net \
    --cc=spearce@spearce.org \
    --subject='Re: reftable [v4]: new ref storage format' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Code repositories for project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).