git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / Atom feed
From: "SZEDER Gábor" <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: "Junio C Hamano" <gitster@pobox.com>,
	"Garima Singh" <garima.singh@microsoft.com>,
	"Derrick Stolee" <stolee@gmail.com>,
	"Jakub Narebski" <jnareb@gmail.com>, "Jeff King" <peff@peff.net>,
	"Taylor Blau" <me@ttaylorr.com>,
	"SZEDER Gábor" <szeder.dev@gmail.com>
Subject: [PATCH 33/34] commit-graph: write modified path Bloom filters in "history order"
Date: Fri, 29 May 2020 10:50:37 +0200
Message-ID: <20200529085038.26008-34-szeder.dev@gmail.com> (raw)
In-Reply-To: <20200529085038.26008-1-szeder.dev@gmail.com>

[Explain!  Try topo order!]

I don't have recent benchmark results at hand yet, but as far as I can
recall some old results this reduces the time spent loading and
checking modified path Bloom filters by another ~10%.  Checking Bloom
filters is of course only a small part of the whole runtime of a
pathspec-limited revision walk, so the overall improvement is only
about 1-2%.
Oh, well, I expected more from this change... But perhaps it has
larger impact with less embedded modified path Bloom filters?

This is the last patch in this series that improves pathspec-limited
revision walk performance, so it's time to compare average runtime and
memory usage of

  git rev-list HEAD -- "$path"

for 5000+ randomly selected paths from each repository with and
without modified path Bloom filters:

                  Average runtime     Average  |  Average max RSS
                 without      with    speedup  |  without    with
  ---------------------------------------------+----------------------------
  android-base   0.8780s    0.1456s     6.03x  |  387MB    91.2MB    -76.5%
  cmssw          0.3143s    0.0313s    10.04x  |  181MB    37.4MB    -79.4%
  cpython        0.7453s    0.0810s     9.20x  |  159MB    43.8MB    -72.5%
  elasticsearch  0.1492s    0.0136s    10.95x  |  134MB    21.8MB    -83.7%
  gcc            7.1852s    0.2114s    34.00x  |  297MB    91.1MB    -69.3%
  gecko-dev      4.6113s    0.4815s     9.58x  |  832MB   175.2MB    -79.0%
  git            0.6180s    0.0310s    19.94x  |  131MB    31.5MB    -76.0%
  glibc          0.5618s    0.0282s    19.92x  |  135MB    25.0MB    -81.6%
  go             0.4913s    0.0403s    12.19x  |  130MB    24.7MB    -81.1%
  jdk            0.0482s    0.0068s     7.09x  |   52MB    27.1MB    -48.2%
  linux          0.7043s    0.0873s     8.08x  |  438MB   150.9MB    -65.5%
  llvm-project   2.6844s    0.4135s     6.49x  |  384MB   126.5MB    -67.1%
  rails          0.2784s    0.0391s     7.12x  |   88MB    28.8MB    -67.3%
  rust           0.7757s    0.0439s    17.67x  |  344MB    42.0MB    -87.8%
  tensorflow     0.6258s    0.0487s    12.85x  |  233MB    36.7MB    -84.2%
  webkit         1.9137s    0.2420s     7.91x  |  940MB    84.7MB    -91.0%

The astute reader may notice that in several cases the achieved
speedup is a bit higher than the calculated potential speedup shown in
the log message of the patch adding the modified path Bloom filter
specs, e.g. 34x vs. 29.7x in the gcc repository.  I suspect that by
eliminating much of the tree-diff overhead we load a lot less (tree)
objects, putting less strain on the caches, and in turn making other
parts of the process faster as well.

Alas, this is not without extra cost: writing the commit-graph file
with modified path Bloom filters takes a lot longer and the resulting
commit-graph file is considerably larger:

          Writing a commit-graph file from      |
             scratch with '--reachable'         | commit-graph file size
                 without       with             |   without      with
  ----------------------------------------------+--------------------------------
  android-base    6.230s     40.880s     6.56x  |  25077512    31278605    +24.7%
  cmssw           3.177s     25.691s     8.09x  |  13017516    23971497    +84.1%
  cpython         1.344s      8.951s     6.66x  |   6808068     8152146    +19.7%
  gcc             2.886s     36.917s    12.79x  |  12839660    18068286    +40.7%
  gecko-dev       9.331s     97.729s    10.47x  |  43335208    66568549    +53.6%
  git             0.750s      5.245s     6.99x  |   3388208     4011595    +18.3%
  glibc           0.565s      4.146s     7.34x  |   2832460     3936588    +38.9%
  homebrew-cask   1.083s     27.024s    24.95x  |   5928644     6859160    +15.7%
  homebrew-core   1.702s     55.478s    32.60x  |   9171324    10509040    +14.5%
  jdk             0.568s     19.418s    34.19x  |   3237508    15858410   +389.8%
  linux          13.525s    100.837s     7.46x  |  49800744    81939893    +64.5%
  llvm-project    4.275s     31.188s     7.30x  |  19309116    25336867    +31.2%
  rails           0.986s      5.607s     5.69x  |   4990252     6317541    +26.5%
  rust            1.260s     13.250s    10.52x  |   6053824     8601440    +42.0%
  webkit          3.658s     30.469s     8.33x  |  12288620    19438512    +58.1%
---
 commit-graph.c | 28 +++++++++++++++++++++-------
 1 file changed, 21 insertions(+), 7 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 1677e4fb3e..8eb0cbedaf 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -67,6 +67,8 @@ struct modified_path_bloom_filter_info {
 	uint64_t offset;
 	uint32_t merge_index_pos;
 	struct modified_path_bloom_filter_info *next;
+	/* TODO: need better names for 'next' and 'next_commit_bfi' */
+	struct modified_path_bloom_filter_info *next_commit_bfi;
 };
 
 static void free_modified_path_bloom_filter_info_in_slab(
@@ -1159,6 +1161,10 @@ struct write_commit_graph_context {
 
 		/* Used to find identical modified path Bloom filters */
 		struct hashmap dedup_hashmap;
+
+		/* To chain up Bloom filters in history order */
+		struct modified_path_bloom_filter_info *first_commit_bfi;
+		struct modified_path_bloom_filter_info *prev_commit_bfi;
 	} mpbfctx;
 };
 
@@ -1363,18 +1369,18 @@ static int write_graph_chunk_modified_path_bloom_filters(struct hashfile *f,
 		struct write_commit_graph_context *ctx)
 {
 	uint64_t offset = 0;
-	int i;
+	struct modified_path_bloom_filter_info *next_commit_bfi;
 
-	for (i = 0; i < ctx->commits.nr; i++) {
-		struct commit *commit = ctx->commits.list[i];
-		struct modified_path_bloom_filter_info *bfi;
-		unsigned int filter_size;
+	next_commit_bfi = ctx->mpbfctx.first_commit_bfi;
+	while (next_commit_bfi) {
+		struct modified_path_bloom_filter_info *bfi = next_commit_bfi;
+
+		next_commit_bfi = next_commit_bfi->next_commit_bfi;
 
 		display_progress(ctx->progress, ++ctx->progress_cnt);
 
-		bfi = modified_path_bloom_filters_peek(
-				&modified_path_bloom_filters, commit);
 		for (; bfi; bfi = bfi->next) {
+			unsigned int filter_size;
 			if (bfi->duplicate_of)
 				continue;
 			if (!bfi->filter.nr_bits)
@@ -1762,6 +1768,14 @@ static void create_modified_path_bloom_filter(
 	bloom_filter_set_bits(&bfi->filter, mpbfctx->hashes,
 			      mpbfctx->hashes_nr);
 
+	if (!mpbfctx->first_commit_bfi) {
+		mpbfctx->first_commit_bfi = bfi;
+		mpbfctx->prev_commit_bfi = bfi;
+	} else if (!nth_parent) {
+		mpbfctx->prev_commit_bfi->next_commit_bfi = bfi;
+		mpbfctx->prev_commit_bfi = bfi;
+	}
+
 	if (path_component_count > mpbfctx->embedded_limit &&
 	    !handle_duplicate_modified_path_bloom_filter(mpbfctx, bfi))
 		mpbfctx->total_filter_size += sizeof(uint32_t) +
-- 
2.27.0.rc1.431.g5c813f95dc


  parent reply	other threads:[~2020-05-29  8:52 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-29  8:50 [PoC PATCH 00/34] An alternative modified path Bloom filters implementation SZEDER Gábor
2020-05-28 22:09 ` Johannes Schindelin
2020-05-29  8:50 ` [PATCH 01/34] tree-walk.c: don't match submodule entries for 'submod/anything' SZEDER Gábor
2020-05-29  8:50 ` [PATCH 02/34] commit-graph: fix parsing the Chunk Lookup table SZEDER Gábor
2020-05-29  8:50 ` [PATCH 03/34] commit-graph-format.txt: all multi-byte numbers are in network byte order SZEDER Gábor
2020-05-29  8:50 ` [PATCH 04/34] commit-slab: add a function to deep free entries on the slab SZEDER Gábor
2020-06-04 16:43   ` Derrick Stolee
2020-06-27 15:56     ` SZEDER Gábor
2020-06-29 11:30       ` Derrick Stolee
2020-05-29  8:50 ` [PATCH 05/34] diff.h: drop diff_tree_oid() & friends' return value SZEDER Gábor
2020-05-29  8:50 ` [PATCH 06/34] commit-graph: clean up #includes SZEDER Gábor
2020-05-29  8:50 ` [PATCH 07/34] commit-graph: simplify parse_commit_graph() #1 SZEDER Gábor
2020-05-29  8:50 ` [PATCH 08/34] commit-graph: simplify parse_commit_graph() #2 SZEDER Gábor
2020-05-29  8:50 ` [PATCH 09/34] commit-graph: simplify write_commit_graph_file() #1 SZEDER Gábor
2020-05-29  8:50 ` [PATCH 10/34] commit-graph: simplify write_commit_graph_file() #2 SZEDER Gábor
2020-05-29  8:50 ` [PATCH 11/34] commit-graph: allocate the 'struct chunk_info' array dinamically SZEDER Gábor
2020-05-29  8:50 ` [PATCH 12/34] commit-graph: unify the signatures of all write_graph_chunk_*() functions SZEDER Gábor
2020-05-29  8:50 ` [PATCH 13/34] commit-graph: simplify write_commit_graph_file() #3 SZEDER Gábor
2020-05-29  8:50 ` [PATCH 14/34] commit-graph: check chunk sizes after writing SZEDER Gábor
2020-05-29  8:50 ` [PATCH 15/34] commit-graph-format.txt: document the modified path Bloom filter chunks SZEDER Gábor
2020-06-02 17:59   ` SZEDER Gábor
2020-05-29  8:50 ` [PATCH 16/34] Add a generic and minimal Bloom filter implementation SZEDER Gábor
2020-05-29  8:50 ` [PATCH 17/34] Import a streaming-capable Murmur3 hash function implementation SZEDER Gábor
2020-05-29  8:50 ` [PATCH 18/34] commit-graph: write "empty" Modified Path Bloom Filter Index chunk SZEDER Gábor
2020-05-29  8:50 ` [PATCH 19/34] commit-graph: add commit slab for modified path Bloom filters SZEDER Gábor
2020-05-29  8:50 ` [PATCH 20/34] commit-graph: fill the Modified Path Bloom Filter Index chunk SZEDER Gábor
2020-05-29  8:50 ` [PATCH 21/34] commit-graph: load and use " SZEDER Gábor
2020-05-29  8:50 ` [PATCH 22/34] commit-graph: write the Modified Path Bloom Filters chunk SZEDER Gábor
2020-05-29  8:50 ` [PATCH 23/34] commit-graph: load and use " SZEDER Gábor
2020-05-29  8:50 ` [PATCH 24/34] commit-graph: check all leading directories in modified path Bloom filters SZEDER Gábor
2020-05-29  8:50 ` [PATCH 25/34] commit-graph: check embedded modified path Bloom filters with a mask SZEDER Gábor
2020-05-29  8:50 ` [PATCH 26/34] commit-graph: deduplicate modified path Bloom filters SZEDER Gábor
2020-05-29  8:50 ` [PATCH 27/34] commit-graph: load modified path Bloom filters for merge commits SZEDER Gábor
2020-05-29  8:50 ` [PATCH 28/34] commit-graph: write Modified Path Bloom Filter Merge Index chunk SZEDER Gábor
2020-05-29  8:50 ` [PATCH 29/34] commit-graph: extract init and free write_commit_graph_context SZEDER Gábor
2020-05-29  8:50 ` [PATCH 30/34] commit-graph: move write_commit_graph_reachable below write_commit_graph SZEDER Gábor
2020-05-29  8:50 ` [PATCH 31/34] t7007-show: make the first test compatible with the next patch SZEDER Gábor
2020-05-29  8:50 ` [PATCH 32/34] PoC commit-graph: use revision walk machinery for '--reachable' SZEDER Gábor
2020-05-29  8:50 ` SZEDER Gábor [this message]
2020-05-29  8:50 ` [PATCH 34/34] commit-graph: use modified path Bloom filters with wildcards, if possible SZEDER Gábor
2020-05-29 13:59 ` [PoC PATCH 00/34] An alternative modified path Bloom filters implementation Derrick Stolee
2020-06-01 23:25 ` Taylor Blau
2020-06-02 17:08   ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200529085038.26008-34-szeder.dev@gmail.com \
    --to=szeder.dev@gmail.com \
    --cc=garima.singh@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jnareb@gmail.com \
    --cc=me@ttaylorr.com \
    --cc=peff@peff.net \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

git@vger.kernel.org list mirror (unofficial, one of many)

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V1 git git/ https://public-inbox.org/git \
		git@vger.kernel.org
	public-inbox-index git

Example config snippet for mirrors.
Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git
	nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git
	nntp://news.gmane.io/gmane.comp.version-control.git
 note: .onion URLs require Tor: https://www.torproject.org/

code repositories for the project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git