From mboxrd@z Thu Jan 1 00:00:00 1970 From: Shawn Pearce Subject: Re: [PATCH 09/16] documentation: add documentation for the bitmap format Date: Mon, 24 Jun 2013 23:42:49 -0600 Message-ID: References: <1372116193-32762-1-git-send-email-tanoku@gmail.com> <1372116193-32762-10-git-send-email-tanoku@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Cc: git To: Vicent Marti , Colby Ranger X-From: git-owner@vger.kernel.org Tue Jun 25 07:43:29 2013 Return-path: Envelope-to: gcvg-git-2@plane.gmane.org Received: from vger.kernel.org ([209.132.180.67]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1UrM2K-0000xY-Ms for gcvg-git-2@plane.gmane.org; Tue, 25 Jun 2013 07:43:29 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751254Ab3FYFnM (ORCPT ); Tue, 25 Jun 2013 01:43:12 -0400 Received: from mail-ie0-f169.google.com ([209.85.223.169]:46183 "EHLO mail-ie0-f169.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751054Ab3FYFnK (ORCPT ); Tue, 25 Jun 2013 01:43:10 -0400 Received: by mail-ie0-f169.google.com with SMTP id 10so26919062ied.14 for ; Mon, 24 Jun 2013 22:43:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=spearce.org; s=google; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; bh=hi9yguPNkTTbCsFRr+4uBcuM0qbHSCeuJZvOp8W6tI4=; b=YadzyTYse28HsPxZn4/pXNWv/JdBf5ULJB0IJcwY+CQJHeGidKivNroM8ERnQYir3U JgRJAhb7BR/1dpm90Q6Lb4BRfQT0pnclPKtwfoffTa621M/LshymJ2kjF0A9AqXv46Jk 266TK69v92Gqmh56ZkUy1x7ZZAcx319SIpd2M= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type:x-gm-message-state; bh=hi9yguPNkTTbCsFRr+4uBcuM0qbHSCeuJZvOp8W6tI4=; b=VRVKCFFihnwprErwJRt4kXtMNOeuyaOCyP1qJBcdZqUWQpUEtTEHIilZs85PKO91+D x9+A599UFuxxlZMKbLYPrIDFdpvcqYNt9wI0bfXzlnAq6Y+bq9aYDyFzNpWrsGLrICn2 CfFf6wn9uQdvtqZb22HeZ4GB4SjX0ao9oZDmsJL/m3xJXTaBenGQoDPEoVoBnyDDU5Xm 3q6aR7A48nASY9LrcC4Qt41Z/oiRc/dv98zKxzBuQcZ9Pb04ubSsohkdkOyWGxckwP/K Z99mH+CoDc/M4FIfYuiHTvT7q23c2xn4Yf08rXXsA2ZyWldJgv57rLn+BdTAbT38Ai/z hoDw== X-Received: by 10.42.98.69 with SMTP id r5mr2235523icn.95.1372138989977; Mon, 24 Jun 2013 22:43:09 -0700 (PDT) Received: by 10.64.82.38 with HTTP; Mon, 24 Jun 2013 22:42:49 -0700 (PDT) In-Reply-To: <1372116193-32762-10-git-send-email-tanoku@gmail.com> X-Gm-Message-State: ALoCoQkGLJi9ZCEz3rI56V98c7VE01rQoBlyhiDqApznT9bwEp7+spRdL69jFfQgfcuUwLwPeXsC Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: On Mon, Jun 24, 2013 at 5:23 PM, Vicent Marti wrote: > This is the technical documentation and design rationale for the new > Bitmap v2 on-disk format. > --- > Documentation/technical/bitmap-format.txt | 235 +++++++++++++++++++++++++++++ > 1 file changed, 235 insertions(+) > create mode 100644 Documentation/technical/bitmap-format.txt > > diff --git a/Documentation/technical/bitmap-format.txt b/Documentation/technical/bitmap-format.txt > new file mode 100644 > index 0000000..5400082 > --- /dev/null > +++ b/Documentation/technical/bitmap-format.txt > @@ -0,0 +1,235 @@ > +GIT bitmap v2 format & rationale > +================================ > + > + - A header appears at the beginning, using the same format > + as JGit's original bitmap indexes. > + > + 4-byte signature: {'B', 'I', 'T', 'M'} > + > + 2-byte version number (network byte order) > + The current implementation only supports version 2 > + of the bitmap index. The rationale for this is explained > + in this document. > + > + 2-byte flags (network byte order) > + > + The folowing flags are supported: > + > + - BITMAP_OPT_FULL_DAG (0x1) REQUIRED > + This flag must always be present. It implies that the bitmap > + index has been generated for a packfile with full closure > + (i.e. where every single object in the packfile can find > + its parent links inside the same packfile). This is a > + requirement for the bitmap index format, also present in JGit, > + that greatly reduces the complexity of the implementation. > + > + - BITMAP_OPT_LE_BITMAPS (0x2) > + If present, this implies that that the EWAH bitmaps in this > + index has been serialized to disk in little-endian byte order. > + Note that this only applies to the actual bitmaps, not to the > + Git data structures in the index, which are always in Network > + Byte order as it's costumary. > + > + - BITMAP_OPT_BE_BITMAPS (0x4) > + If present, this implies that the EWAH bitmaps have been serialized > + using big-endian byte order (NWO). If the flag is missing, **the > + default is to assume that the bitmaps are in big-endian**. I very much hate seeing a file format that is supposed to be portable that supports both big-endian and little-endian encoding. Such a specification forces everyone to implement two code paths to handle reading data from the file, on the off-chance they are on the wrong platform. Or it forces one platform to be unable to use the file. In which case a repository may want to build two files, one for each platform, but you blocked that off by allowing only one file. What is wrong with picking one encoding and sticking to it? The .idx file and the dircache use big-endian format. Why not just use big-endian here too and convert words on demand as they are accessed from the mmap region? That is what the .idx format does when accessing an offset. > + - BITMAP_OPT_HASH_CACHE (0x8) > + If present, a hash cache for finding delta bases will be available > + right after the header block in this index. See the following > + section for details. > + > + 4-byte entry count (network byte order) > + > + The total count of entries (bitmapped commits) in this bitmap index. > + > + 20-byte checksum > + > + The SHA1 checksum of the pack this bitmap index belongs to. > + > + - An OPTIONAL delta cache follows the header. Some may find the name "delta cache" confusing as it does not cache deltas of objects. May I suggest "path hash cache" as an alternative name? > + The cache is formed by `n` 4-byte hashes in a row, where `n` is > + the amount of objects in the indexed packfile. Note that this amount > + is the **total number of objects** and is not related to the > + number of commits that have been selected and indexed in the > + bitmap index. > + > + The hashes are stored in Network Byte Order and they are the same > + values generated by a normal revision walk during the `pack-objects` > + phase. I find it interesting this is network byte order and not big-endian or little-endian based on the flag in the header. > + The `n`nth hash in the cache is the name hash for the `n`th object > + in the index for the indexed packfile. > + > + [RATIONALE]: > + > + The bitmap index allows us to skip the Counting Objects phase > + during `pack-objects` and yield all the OIDs that would be reachable > + ("WANTS") when generating the pack. > + > + This optimization, however, means that we're adding objects to the > + packfile straight from the packfile index, and hence we are lacking > + path information for the objects that would normally be generated > + during the "Counting Objects" phase. > + > + This path information for each object is hashed and used as a very > + effective way to find good delta bases when compressing the packfile; > + without these hashes, the resulting packfiles are much less optimal. > + > + By storing all the hashes in a cache together with the bitmapsin > + the bitmap index, we can yield not only the SHA1 of all the reachable > + objects, but also their hashes, and allow Git to be much smarter when > + finding delta bases for packing. > + > + If the delta cache is not available, the bitmap index will obviously > + be smaller in disk, but the packfiles generated using this index will > + be between 20% and 30% bigger, because of the lack of name/path > + information when finding delta bases. JGit does not encode this because we were afraid of freezing the hash function into the file format. Indeed we are not certain JGit even uses the same path hash function as C Git does, because C Git's implementation is covered by the GPL and JGit prefers to license its work under BSD. If the path hash is going to become part of the format, the algorithm for computing the hash should also be specified in the format so that non-GPL implementations have an opportunity to be compatible. One way we side-stepped the size inflation problem in JGit was to only use the bitmap index information when sending data on the wire to a client. Here delta reuse plays a significant factor in building the pack, and we don't have to be as accurate on matching deltas. During the equivalent of `git repack` bitmaps are not used, allowing the traditional graph enumeration algorithm to generate path hash information. > + - 4 EWAH bitmaps that act as type indexes > + > + Type indexes are serialized after the hash cache in the shape > + of four EWAH bitmaps stored consecutively (see Appendix A for > + the serialization format of an EWAH bitmap). > + > + There is a bitmap for each Git object type, stored in the following > + order: > + > + - Commits > + - Trees > + - Blobs > + - Tags > + > + In each bitmap, the `n`th bit is set to true if the `n`th object > + in the packfile index is of that type. > + > + The obvious consequence is that the XOR of all 4 bitmaps will result > + in a full set (all bits sets), and the AND of all 4 bitmaps will > + result in an empty bitmap (no bits set). Instead of XOR did you mean OR here? > + - N EWAH bitmaps, one for each indexed commit > + > + Where `N` is the total amount of entries in this bitmap index. > + See Appendix A for the serialization format of an EWAH bitmap. > + > + - An entry index with `N` entries for the indexed commits > + > + Index entries are stored consecutively, and each entry has the > + following format: > + > + - 20-byte SHA1 > + The SHA1 of the commit that this bitmap indexes > + > + - 4-byte offset (Network Byte Order) > + The offset **from the beginning of the file** where the > + bitmap for this commit is stored. Eh, another network byte order field in a file that also has selective ordering. *sigh* > + - 1-byte XOR-offset > + The xor offset used to compress this bitmap. For an entry > + in position `x`, a XOR offset of `y` means that the actual > + bitmap representing for this commit is composed by XORing the > + bitmap for this entry with the bitmap in entry `x-y` (i.e. > + the bitmap `y` entries before this one). > + > + Note that this compression can be recursive. In order to > + XOR this entry with a previous one, the previous entry needs > + to be decompressed first, and so on. > + > + The hard-limit for this offset is 160 (an entry can only be > + xor'ed against one of the 160 entries preceding it). This > + number is always positivea, and hence entries are always xor'ed > + with **previous** bitmaps, not bitmaps that will come afterwards > + in the index. What order are these entries in? Sorted by SHA-1 or random? Colby found that doing an XOR against the descendant commit yielded very small bitmaps, so JGit tries to XOR-compress bitmaps along common linear slices of history. This is trivial in Linus' kernel tree where there is effectively only one history, but its more relevant with long-running side branches that have release tags that may not have fully merged into "master". > + - 1-byte flags for this bitmap > + At the moment the only available flag is `0x1`, which hints > + that this bitmap can be re-used when rebuilding bitmap indexes > + for the repository. > + > + - 2 bytes of RESERVED data (used right now for better packing). > + > +== Rationale for changes from the Bitmap Format v1 > + > +- Serialized EWAH bitmaps can be stored in Little-Endian byte order, > + if defined by the BITMAP_OPT_LE_BITMAPS flag in the header. > + > + The original JGit implementation stored bitmaps in Big-Endian byte > + order (NWO) because it was unable to `mmap` the serialized format, > + and hence always required a full parse of the bitmap index to memory, > + where the BE->LE conversion could be performed. You can mmap a file and convert each word on access if your machine is not using the same byte order. It is not necessary to convert the entire file before using it through a mmap region. > + This full parse, however, requires prohibitive loading times in LE > + machines (i.e. all modern server hardware): a repository like > + `torvalds/linux` can have about 8mb of bitmap indexes, resulting > + in roughly 400ms of parse time. This makes me wonder what the JGit parse time is. It is ugly if we are spending 400ms to load the bitmap index for the kernel repository. > + This is not an issue in JGit, which is capable of serving repositories > + from a single-process daemon running on the JVM, but `git-daemon` in > + git has been implemented with a process-based design (a new > + `pack-objects` is spawned for each request), and the boot times > + of parsing the bitmap index every time `pack-objects` is spawned can > + seriously slow down requests (particularly for small fetches, where we'd > + spend about 1.5s booting up and 300ms performing the Counting Objects > + phase). There are other strategies that Git could use to handle request processing at scale. But I guess its reasonable to assume these aren't viable for Git for a number of reasons. E.g. "long tail" access effect that many servers have, where most requests are to a large number of repositories that themselves receive very few requests, an environment that does not lend itself to caching. > + By storing the bitmaps in Little-Endian, we're able to `mmap` their > + compressed data straight in memory without parsing it beforehand, and > + since most queries don't require accessing all the serialized bitmaps, > + we'll only page in the minimal amount of bitmaps necessary to perform > + the reachability analysis as they are accessed. FWIW the .idx and .pack file formats `mmap` the compressed data straight into memory without parsing it beforehand, and do not use little-endian byte order. It is possible to have a single compressed file format definition that is portable to all architectures, and is accessed by mmap, at scale, with reasonable efficiency. > +- An index of all the bitmapped commits is written at the end of the packfile, > + instead of interpersed with the serialized bitmaps in the middle of the > + file. This is probably a mistake in the JGit design. Your approach is slightly more complex, but in general I agree with having a table of the SHA-1s isolated from the bitmaps themselves so that a reader can access specific bitmaps at random without needing to wade through all compressed bitmaps. I would have proposed putting the table at the start of the file, not the end. The writer making the file can completely serialize the bitmaps into memory before writing them to disk, and thus knows the full layout of the resulting file. If the bitmaps don't fit in RAM at writing time, game over, the optimization of having a very compact representation of the graph is no longer helping you. > + Again, the old design implied a full parse of the whole bitmap index > + (which JGit can afford because its daemon is single-process), but it made > + impossible `mmaping` the bitmap index file and accessing only the parts > + required to actually solve the query. > + > + With an index at the end of the file, we can load only this index in memory, > + allowing for very efficient access to all the available bitmaps lazily (we > + have their offsets in the mmaped file). > + > +- The ordering of the objects in each bitmap has changed from > + packfile-order (the nth bit in the bitmap is the nth object in the > + packfile) to index-order (the nth bit in the bitmap is the nth object > + in the INDEX of the packfile). Did you notice an increase in bitmap size when you did this? Colby tested both orderings and we observed the bitmaps were quantifiably smaller when using the pack file ordering, due to the pack file locality rules and the EWAH compression. Using the pack file ordering was a very conscious design decision. > + There is not a noticeable performance difference when actually converting > + from bitmap position to SHA1 and from SHA1 to bitmap position, but when > + using packfile ordering like JGit does, queries need to go through the > + reverse index (pack-revindex.c). > + > + Generating this reverse index at runtime is **not** free (around 900ms > + generation time for a repository like `torvalds/linux`), and once again, > + this generation time needs to happen every time `pack-objects` is > + spawned. Did you know the packer needs the reverse index in order to compute the end offset of an object it will copy as-is during delta reuse? How have you avoided making the reverse index? Again this is why we chose to pin the JGit bitmap on the reverse index being present. It already had to be present to support as-is reuse. Once we knew we had to have that reverse index it was OK to rely on it to get better compression on the bitmaps, and thus make them take up less memory when loaded into a server. Even if you mmap a file you want it to be small so it is more likely to retain in the kernel buffer cache across process invocations.