From: "SZEDER Gábor" <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>,
Garima Singh <garima.singh@microsoft.com>,
Derrick Stolee <stolee@gmail.com>,
Jakub Narebski <jnareb@gmail.com>, Jeff King <peff@peff.net>,
Taylor Blau <me@ttaylorr.com>
Subject: Re: [PATCH 15/34] commit-graph-format.txt: document the modified path Bloom filter chunks
Date: Tue, 2 Jun 2020 19:59:49 +0200 [thread overview]
Message-ID: <20200602175949.GA2898@szeder.dev> (raw)
In-Reply-To: <20200529085038.26008-16-szeder.dev@gmail.com>
On Fri, May 29, 2020 at 10:50:19AM +0200, SZEDER Gábor wrote:
> Modified Path Bloom Filters
> Each modified path Bloom filter consists of:
>
> - 4 bytes specifying the number of bits in the Bloom filter's bit
> array.
>
> For practical purposes 32 bit is more than sufficient to store the
> number of bits in the Bloom filter's array. When using k = 7
> hashes, i.e. 10 bits per path, then we could store over 400
> million paths in a single Bloom filter; with k = 32 hashes we'd
> use 46 bit per path, which is still over 93 million paths.
> Alternatives considered
> -----------------------
>
> Here are some alternatives that I've considered but discarded and
> ideas that I haven't (yet) followed through:
> - Varints. Using 4 bytes for the size of each Bloom filter in the
> Modified Path Bloom Filters chunk is a lot more than necessary.
> How much space can be saved by using varints?
Not that much, at least compared to the whole commit-graph file.
Since 63 bit modified path Bloom filters are embedded in the index
entries, it's pointless to store smaller Bloom filters, so the size of
non-embedded Bloom filters can be defined as
nr_bits = 64 + decode_varint(...)
A one byte varint can encode values up to 127, while a two bytes
varint can encode values up to 16511. So the max nr_bits is 191 and
16575, respectively, which means max 19 or 1657 paths with 7 hashes,
i.e. 10 bits per path. The table below shows the percentage of
commits with embedded modified path Bloom filters and commits with
non-embedded Bloom filters with one byte and two bytes varints, and
how much space is saved.
Percentage of commits |
modifying <= N paths | varint commit-graph
compared to first parent | size diff file size
<=6 <=19 <=1657 | (bytes) (ls -lh)
------------------------------------------+-------------------------------
elasticsearch 18.32% 65.56% 99.79% | -90158 9.5M -0.9%
jdk 26.62% 70.46% 97.94% | -90262 16M -0.6%
webkit 38.42% 82.24% 99.92% | -298824 19M -1.6%
android-base 42.32% 88.02% 99.98% | -132327 30M -0.5%
llvm-project 53.60% 93.86% 99.99% | -344239 24M -1.4%
gecko-dev 54.54% 87.15% 99.87% | -625029 63M -1.0%
tensorflow 55.17% 90.76% 99.52% | -83758 9.0M -0.9%
rails 58.71% 95.13% 99.99% | -53153 6.0M -0.9%
rust 59.29% 88.03% 99.96% | -90677 8.2M -1.1%
glibc 61.59% 91.11% 99.96% | -38422 3.8M -1.0%
gcc 63.80% 95.39% 99.97% | -179463 18M -1.0%
go 65.31% 95.19% 99.99% | -39109 3.2M -1.2%
cmssw 67.58% 93.02% 99.91% | -154440 23M -0.7%
linux 72.79% 95.27% 99.78% | -527045 78M -0.7%
cpython 81.91% 97.78% 100.00% | -40678 7.8M -0.5%
git 90.28% 98.56% 100.00% | -13406 3.9M -0.4%
homebrew-cask 98.61% 99.58% 99.99% | -2992 6.6M -0.1%
homebrew-core 99.81% 99.93% 100.00% | -810 11M -0.1%
> How much is the runtime overhead of decoding those varints?
Not enough to be above noise level. ("a lot of paths",
MurmurHash3, k = 7, enhanced double hashing, 32 bit uint arithmetic)
| Average time spent
Average runtime | loading and querying
| Bloom filters
uint32 varint | uint32 varint
-------------------------------------+-----------------------------
android-base 0.1456s 0.144172s | 0.01550s 0.01571s 1.3%
cmssw 0.0313s 0.032046s | 0.00383s 0.00395s 3.1%
cpython 0.0810s 0.081649s | 0.00765s 0.00785s 2.6%
elasticsearch 0.0136s 0.013899s | 0.00246s 0.00258s 4.8%
gcc 0.2114s 0.211080s | 0.02259s 0.02261s 0%
gecko-dev 0.4815s 0.474903s | 0.06004s 0.06028s 0.3%
git 0.0310s 0.031156s | 0.00192s 0.00195s 1.5%
glibc 0.0282s 0.029032s | 0.00320s 0.00342s 6.8%
go 0.0403s 0.039408s | 0.00428s 0.00414s -3.3%
jdk 0.0068s 0.006666s | 0.00117s 0.00113s -3.5%
linux 0.0873s 0.087438s | 0.01109s 0.01133s 2.1%
llvm-project 0.4135s 0.418390s | 0.04716s 0.04834s 2.5%
rails 0.0391s 0.038997s | 0.00449s 0.00448s -0.3%
rust 0.0439s 0.044569s | 0.00444s 0.00461s 3.8%
tensorflow 0.0487s 0.049166s | 0.00618s 0.00634s 2.5%
webkit 0.2420s 0.241807s | 0.03353s 0.03379s 0.7%
> How can we ensure that varint decoding doesn't read past the end
> of the mmapped memory region?
With a decode_varint() variant that takes the max number of bytes to
read as an additional parameter, and returns 0 if the varint is too
long.
next prev parent reply other threads:[~2020-06-02 18:00 UTC|newest]
Thread overview: 43+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-05-29 8:50 [PoC PATCH 00/34] An alternative modified path Bloom filters implementation SZEDER Gábor
2020-05-28 22:09 ` Johannes Schindelin
2020-05-29 8:50 ` [PATCH 01/34] tree-walk.c: don't match submodule entries for 'submod/anything' SZEDER Gábor
2020-05-29 8:50 ` [PATCH 02/34] commit-graph: fix parsing the Chunk Lookup table SZEDER Gábor
2020-05-29 8:50 ` [PATCH 03/34] commit-graph-format.txt: all multi-byte numbers are in network byte order SZEDER Gábor
2020-05-29 8:50 ` [PATCH 04/34] commit-slab: add a function to deep free entries on the slab SZEDER Gábor
2020-06-04 16:43 ` Derrick Stolee
2020-06-27 15:56 ` SZEDER Gábor
2020-06-29 11:30 ` Derrick Stolee
2020-05-29 8:50 ` [PATCH 05/34] diff.h: drop diff_tree_oid() & friends' return value SZEDER Gábor
2020-05-29 8:50 ` [PATCH 06/34] commit-graph: clean up #includes SZEDER Gábor
2020-05-29 8:50 ` [PATCH 07/34] commit-graph: simplify parse_commit_graph() #1 SZEDER Gábor
2020-05-29 8:50 ` [PATCH 08/34] commit-graph: simplify parse_commit_graph() #2 SZEDER Gábor
2020-05-29 8:50 ` [PATCH 09/34] commit-graph: simplify write_commit_graph_file() #1 SZEDER Gábor
2020-05-29 8:50 ` [PATCH 10/34] commit-graph: simplify write_commit_graph_file() #2 SZEDER Gábor
2020-05-29 8:50 ` [PATCH 11/34] commit-graph: allocate the 'struct chunk_info' array dinamically SZEDER Gábor
2020-05-29 8:50 ` [PATCH 12/34] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor
2020-05-29 8:50 ` [PATCH 13/34] commit-graph: simplify write_commit_graph_file() #3 SZEDER Gábor
2020-05-29 8:50 ` [PATCH 14/34] commit-graph: check chunk sizes after writing SZEDER Gábor
2020-05-29 8:50 ` [PATCH 15/34] commit-graph-format.txt: document the modified path Bloom filter chunks SZEDER Gábor
2020-06-02 17:59 ` SZEDER Gábor [this message]
2020-05-29 8:50 ` [PATCH 16/34] Add a generic and minimal Bloom filter implementation SZEDER Gábor
2020-05-29 8:50 ` [PATCH 17/34] Import a streaming-capable Murmur3 hash function implementation SZEDER Gábor
2020-05-29 8:50 ` [PATCH 18/34] commit-graph: write "empty" Modified Path Bloom Filter Index chunk SZEDER Gábor
2020-05-29 8:50 ` [PATCH 19/34] commit-graph: add commit slab for modified path Bloom filters SZEDER Gábor
2020-05-29 8:50 ` [PATCH 20/34] commit-graph: fill the Modified Path Bloom Filter Index chunk SZEDER Gábor
2020-05-29 8:50 ` [PATCH 21/34] commit-graph: load and use " SZEDER Gábor
2020-05-29 8:50 ` [PATCH 22/34] commit-graph: write the Modified Path Bloom Filters chunk SZEDER Gábor
2020-05-29 8:50 ` [PATCH 23/34] commit-graph: load and use " SZEDER Gábor
2020-05-29 8:50 ` [PATCH 24/34] commit-graph: check all leading directories in modified path Bloom filters SZEDER Gábor
2020-05-29 8:50 ` [PATCH 25/34] commit-graph: check embedded modified path Bloom filters with a mask SZEDER Gábor
2020-05-29 8:50 ` [PATCH 26/34] commit-graph: deduplicate modified path Bloom filters SZEDER Gábor
2020-05-29 8:50 ` [PATCH 27/34] commit-graph: load modified path Bloom filters for merge commits SZEDER Gábor
2020-05-29 8:50 ` [PATCH 28/34] commit-graph: write Modified Path Bloom Filter Merge Index chunk SZEDER Gábor
2020-05-29 8:50 ` [PATCH 29/34] commit-graph: extract init and free write_commit_graph_context SZEDER Gábor
2020-05-29 8:50 ` [PATCH 30/34] commit-graph: move write_commit_graph_reachable below write_commit_graph SZEDER Gábor
2020-05-29 8:50 ` [PATCH 31/34] t7007-show: make the first test compatible with the next patch SZEDER Gábor
2020-05-29 8:50 ` [PATCH 32/34] PoC commit-graph: use revision walk machinery for '--reachable' SZEDER Gábor
2020-05-29 8:50 ` [PATCH 33/34] commit-graph: write modified path Bloom filters in "history order" SZEDER Gábor
2020-05-29 8:50 ` [PATCH 34/34] commit-graph: use modified path Bloom filters with wildcards, if possible SZEDER Gábor
2020-05-29 13:59 ` [PoC PATCH 00/34] An alternative modified path Bloom filters implementation Derrick Stolee
2020-06-01 23:25 ` Taylor Blau
2020-06-02 17:08 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200602175949.GA2898@szeder.dev \
--to=szeder.dev@gmail.com \
--cc=garima.singh@microsoft.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=jnareb@gmail.com \
--cc=me@ttaylorr.com \
--cc=peff@peff.net \
--cc=stolee@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).