git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / Atom feed
From: Derrick Stolee <stolee@gmail.com>
To: Jeff King <peff@peff.net>,
	Garima Singh via GitGitGadget <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, szeder.dev@gmail.com,
	jonathantanmy@google.com, jeffhost@microsoft.com,
	me@ttaylorr.com, Junio C Hamano <gitster@pobox.com>
Subject: Re: [PATCH 0/9] [RFC] Changed Paths Bloom Filters
Date: Fri, 27 Dec 2019 11:11:37 -0500
Message-ID: <e9a4e4ff-5466-dc39-c3f5-c9a8b8f2f11d@gmail.com> (raw)
In-Reply-To: <20191222093036.GA3449072@coredump.intra.peff.net>

On 12/22/2019 4:30 AM, Jeff King wrote:
> On Fri, Dec 20, 2019 at 10:05:11PM +0000, Garima Singh via GitGitGadget wrote:
> 
>> Adopting changed path bloom filters has been discussed on the list before,
>> and a prototype version was worked on by SZEDER Gábor, Jonathan Tan and Dr.
>> Derrick Stolee [1]. This series is based on Dr. Stolee's approach [2] and
>> presents an updated and more polished RFC version of the feature.
> 
> Great to see progress here. I probably won't have time to review this
> carefully before the new year, but I did notice some low-hanging fruit
> on the generation side.
> 
> So here are a few patches to reduce the CPU and memory usage. They could
> be squashed in at the appropriate spots, or perhaps taken as inspiration
> if there are better solutions (especially for the first one).

I tested these patches with the Linux kernel repo and reported my results
on each patch. However, I wanted to also test on a larger internal repo
(the AzureDevOps repo), which has ~500 commits with more than 512 changes,
and generally has larger diffs than the Linux kernel repo.

| Version  | Time   | Memory |
|----------|--------|--------|
| Garima   | 16m36s | 27.0gb |
| Peff 1   | 6m32s  | 28.0gb |
| Peff 2   | 6m48s  |  5.6gb |
| Peff 3   | 6m14s  |  4.5gb |
| Shortcut | 3m47s  |  4.5gb |

For reference, I found the time and memory information using
"/usr/bin/time --verbose" in a bash script.
 
> I think we could go further still, by actually doing a non-recursive
> diff_tree_oid(), and then recursing into sub-trees ourselves. That would
> save us having to split apart each path to add the leading paths to the
> hashmap (most of which will be duplicates if the commit touched "a/b/c"
> and "a/b/d", etc). I doubt it would be that huge a speedup though. We
> have to keep a list of the touched paths anyway (since the bloom key
> parameters depend on the number of entries), and most of the time is
> almost certainly spent inflating the trees in the first place. However
> it might be easier to follow the code, and it would make it simpler to
> stop traversing at the 512-entry limit, rather than generating a huge
> diff only to throw it away.

By "Shortcut" in the table above, I mean the following patch on top of
Garima's and Peff's changes. It inserts a max_changes option into struct
diff_options to halt the diff early. This seemed like an easier change
than creating a new tree diff algorithm wholesale.

Thanks,
-Stolee

-->8--

From: Derrick Stolee <dstolee@microsoft.com>
Date: Fri, 27 Dec 2019 10:13:48 -0500
Subject: [PATCH] diff: halt tree-diff early after max_changes

When computing the changed-paths bloom filters for the commit-graph,
we limit the size of the filter by restricting the number of paths
in the diff. Instead of computing a large diff and then ignoring the
result, it is better to halt the diff computation early.

Create a new "max_changes" option in struct diff_options. If non-zero,
then halt the diff computation after discovering strictly more changed
paths. This includes paths corresponding to trees that change.

Use this max_changes option in the bloom filter calculations. This
reduces the time taken to compute the filters for the Linux kernel
repo from 2m50s to 2m35s. For a larger repo with more commits changing
many paths, the time reduces from 6 minutes to under 4 minutes.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 bloom.c     | 4 +++-
 diff.h      | 5 +++++
 tree-diff.c | 5 +++++
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/bloom.c b/bloom.c
index ea77631cc2..83dde2378b 100644
--- a/bloom.c
+++ b/bloom.c
@@ -155,6 +155,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
 	int i;
 	struct diff_options diffopt;
+	int max_changes = 512;
 
 	filter = bloom_filter_slab_at(&bloom_filters, c);
 
@@ -171,6 +172,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 
 	repo_diff_setup(r, &diffopt);
 	diffopt.flags.recursive = 1;
+	diffopt.max_changes = max_changes;
 	diff_setup_done(&diffopt);
 
 	if (c->parents)
@@ -179,7 +181,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 		diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
 	diffcore_std(&diffopt);
 
-	if (diff_queued_diff.nr <= 512) {
+	if (diff_queued_diff.nr <= max_changes) {
 		struct hashmap pathmap;
 		struct pathmap_hash_entry* e;
 		struct hashmap_iter iter;
diff --git a/diff.h b/diff.h
index 6febe7e365..9443dc1b00 100644
--- a/diff.h
+++ b/diff.h
@@ -285,6 +285,11 @@ struct diff_options {
 	/* Number of hexdigits to abbreviate raw format output to. */
 	int abbrev;
 
+	/* If non-zero, then stop computing after this many changes. */
+	int max_changes;
+	/* For internal use only. */
+	int num_changes;
+
 	int ita_invisible_in_index;
 /* white-space error highlighting */
 #define WSEH_NEW (1<<12)
diff --git a/tree-diff.c b/tree-diff.c
index 33ded7f8b3..16a21d9f34 100644
--- a/tree-diff.c
+++ b/tree-diff.c
@@ -434,6 +434,9 @@ static struct combine_diff_path *ll_diff_tree_paths(
 		if (diff_can_quit_early(opt))
 			break;
 
+		if (opt->max_changes && opt->num_changes > opt->max_changes)
+			break;
+
 		if (opt->pathspec.nr) {
 			skip_uninteresting(&t, base, opt);
 			for (i = 0; i < nparent; i++)
@@ -518,6 +521,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
 
 			/* t↓ */
 			update_tree_entry(&t);
+			opt->num_changes++;
 		}
 
 		/* t > p[imin] */
@@ -535,6 +539,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
 		skip_emit_tp:
 			/* ∀ pi=p[imin]  pi↓ */
 			update_tp_entries(tp, nparent);
+			opt->num_changes++;
 		}
 	}
 
-- 
2.25.0.rc0


  parent reply index

Thread overview: 141+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-20 22:05 Garima Singh via GitGitGadget
2019-12-20 22:05 ` [PATCH 1/9] commit-graph: add --changed-paths option to write Garima Singh via GitGitGadget
2020-01-01 20:20   ` Jakub Narebski
2019-12-20 22:05 ` [PATCH 2/9] commit-graph: write changed paths bloom filters Garima Singh via GitGitGadget
2019-12-21 16:48   ` Philip Oakley
2020-01-06 18:44   ` Jakub Narebski
2020-01-13 19:48     ` Garima Singh
2019-12-20 22:05 ` [PATCH 3/9] commit-graph: use MAX_NUM_CHUNKS Garima Singh via GitGitGadget
2020-01-07 12:19   ` Jakub Narebski
2019-12-20 22:05 ` [PATCH 4/9] commit-graph: document bloom filter format Garima Singh via GitGitGadget
2020-01-07 14:46   ` Jakub Narebski
2019-12-20 22:05 ` [PATCH 5/9] commit-graph: write changed path bloom filters to commit-graph file Garima Singh via GitGitGadget
2020-01-07 16:01   ` Jakub Narebski
2020-01-14 15:14     ` Garima Singh
2019-12-20 22:05 ` [PATCH 6/9] commit-graph: test commit-graph write --changed-paths Garima Singh via GitGitGadget
2020-01-08  0:32   ` Jakub Narebski
2019-12-20 22:05 ` [PATCH 7/9] commit-graph: reuse existing bloom filters during write Garima Singh via GitGitGadget
2020-01-09 19:12   ` Jakub Narebski
2019-12-20 22:05 ` [PATCH 8/9] revision.c: use bloom filters to speed up path based revision walks Garima Singh via GitGitGadget
2020-01-11  0:27   ` Jakub Narebski
2020-01-15  0:08     ` Garima Singh
2019-12-20 22:05 ` [PATCH 9/9] commit-graph: add GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS test flag Garima Singh via GitGitGadget
2020-01-11 19:56   ` Jakub Narebski
2020-01-15  0:55     ` Garima Singh
2019-12-20 22:14 ` [PATCH 0/9] [RFC] Changed Paths Bloom Filters Junio C Hamano
2019-12-22  9:26 ` Christian Couder
2019-12-22  9:38   ` Jeff King
2020-01-01 12:04     ` Jakub Narebski
2019-12-22  9:30 ` Jeff King
2019-12-22  9:32   ` [PATCH 1/3] commit-graph: examine changed-path objects in pack order Jeff King
2019-12-27 14:51     ` Derrick Stolee
2019-12-29  6:12       ` Jeff King
2019-12-29  6:28         ` Jeff King
2019-12-30 14:37         ` Derrick Stolee
2019-12-30 14:51           ` Derrick Stolee
2019-12-22  9:32   ` [PATCH 2/3] commit-graph: free large diffs, too Jeff King
2019-12-27 14:52     ` Derrick Stolee
2019-12-22  9:32   ` [PATCH 3/3] commit-graph: stop using full rev_info for diffs Jeff King
2019-12-27 14:53     ` Derrick Stolee
2019-12-26 14:21   ` [PATCH 0/9] [RFC] Changed Paths Bloom Filters Derrick Stolee
2019-12-29  6:03     ` Jeff King
2019-12-27 16:11   ` Derrick Stolee [this message]
2019-12-29  6:24     ` Jeff King
2019-12-30 16:04       ` Derrick Stolee
2019-12-30 17:02       ` Junio C Hamano
2019-12-31 16:45 ` Jakub Narebski
2020-01-13 16:54   ` Garima Singh
2020-01-20 13:48     ` Jakub Narebski
2020-01-21 16:14       ` Garima Singh
2020-02-02 18:43         ` Jakub Narebski
2020-01-21 23:40 ` Emily Shaffer
2020-01-27 18:24   ` Garima Singh
2020-02-01 23:32   ` Jakub Narebski
2020-02-05 22:56 ` [PATCH v2 00/11] " Garima Singh via GitGitGadget
2020-02-05 22:56   ` [PATCH v2 01/11] commit-graph: use MAX_NUM_CHUNKS Garima Singh via GitGitGadget
2020-02-09 12:39     ` Jakub Narebski
2020-02-05 22:56   ` [PATCH v2 02/11] bloom: core Bloom filter implementation for changed paths Garima Singh via GitGitGadget
2020-02-15 17:17     ` Jakub Narebski
2020-02-16 16:49     ` Jakub Narebski
2020-02-22  0:32       ` Garima Singh
2020-02-23 13:38         ` Jakub Narebski
2020-02-24 17:34           ` Garima Singh
2020-02-24 18:20             ` Jakub Narebski
2020-02-05 22:56   ` [PATCH v2 03/11] diff: halt tree-diff early after max_changes Derrick Stolee via GitGitGadget
2020-02-17  0:00     ` Jakub Narebski
2020-02-22  0:37       ` Garima Singh
2020-02-05 22:56   ` [PATCH v2 04/11] commit-graph: compute Bloom filters for changed paths Garima Singh via GitGitGadget
2020-02-17 21:56     ` Jakub Narebski
2020-02-22  0:55       ` Garima Singh
2020-02-23 17:34         ` Jakub Narebski
2020-02-05 22:56   ` [PATCH v2 05/11] commit-graph: examine changed-path objects in pack order Jeff King via GitGitGadget
2020-02-18 17:59     ` Jakub Narebski
2020-02-24 18:29       ` Garima Singh
2020-02-05 22:56   ` [PATCH v2 06/11] commit-graph: examine commits by generation number Derrick Stolee via GitGitGadget
2020-02-19  0:32     ` Jakub Narebski
2020-02-24 20:45       ` Garima Singh
2020-02-05 22:56   ` [PATCH v2 07/11] commit-graph: write Bloom filters to commit graph file Garima Singh via GitGitGadget
2020-02-19 15:13     ` Jakub Narebski
2020-02-24 21:14       ` Garima Singh
2020-02-25 11:40         ` Jakub Narebski
2020-02-25 15:58           ` Garima Singh
2020-02-05 22:56   ` [PATCH v2 08/11] commit-graph: reuse existing Bloom filters during write Garima Singh via GitGitGadget
2020-02-20 18:48     ` Jakub Narebski
2020-02-24 21:45       ` Garima Singh
2020-02-05 22:56   ` [PATCH v2 09/11] commit-graph: add --changed-paths option to write subcommand Garima Singh via GitGitGadget
2020-02-20 20:28     ` Jakub Narebski
2020-02-24 21:51       ` Garima Singh
2020-02-25 12:10         ` Jakub Narebski
2020-02-20 22:10     ` Bryan Turner
2020-02-22  1:44       ` Garima Singh
2020-02-05 22:56   ` [PATCH v2 10/11] revision.c: use Bloom filters to speed up path based revision walks Garima Singh via GitGitGadget
2020-02-21 17:31     ` Jakub Narebski
2020-02-21 22:45     ` Jakub Narebski
2020-02-05 22:56   ` [PATCH v2 11/11] commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag Garima Singh via GitGitGadget
2020-02-22  0:11     ` Jakub Narebski
2020-02-07 13:52   ` [PATCH v2 00/11] Changed Paths Bloom Filters SZEDER Gábor
2020-02-07 15:09     ` Garima Singh
2020-02-07 15:36       ` Derrick Stolee
2020-02-07 16:15         ` SZEDER Gábor
2020-02-07 16:33           ` Derrick Stolee
2020-02-11 19:08       ` Garima Singh
2020-02-08 23:04   ` Jakub Narebski
2020-02-21 17:41     ` Garima Singh
2020-03-29 18:36       ` Junio C Hamano
2020-03-30  0:31   ` [PATCH v3 00/16] " Garima Singh via GitGitGadget
2020-03-30  0:31     ` [PATCH v3 01/16] commit-graph: define and use MAX_NUM_CHUNKS Garima Singh via GitGitGadget
2020-03-30  0:31     ` [PATCH v3 02/16] bloom.c: add the murmur3 hash implementation Garima Singh via GitGitGadget
2020-03-30  0:31     ` [PATCH v3 03/16] bloom.c: introduce core Bloom filter constructs Garima Singh via GitGitGadget
2020-03-30  0:31     ` [PATCH v3 04/16] bloom.c: core Bloom filter implementation for changed paths Garima Singh via GitGitGadget
2020-03-30  0:31     ` [PATCH v3 05/16] diff: halt tree-diff early after max_changes Derrick Stolee via GitGitGadget
2020-03-30  0:31     ` [PATCH v3 06/16] commit-graph: compute Bloom filters for changed paths Garima Singh via GitGitGadget
2020-03-30  0:31     ` [PATCH v3 07/16] commit-graph: examine changed-path objects in pack order Jeff King via GitGitGadget
2020-03-30  0:31     ` [PATCH v3 08/16] commit-graph: examine commits by generation number Garima Singh via GitGitGadget
2020-03-30  0:31     ` [PATCH v3 09/16] diff: skip batch object download when possible Garima Singh via GitGitGadget
2020-03-30  0:31     ` [PATCH v3 10/16] commit-graph: write Bloom filters to commit graph file Garima Singh via GitGitGadget
2020-03-30  0:31     ` [PATCH v3 11/16] commit-graph: reuse existing Bloom filters during write Garima Singh via GitGitGadget
2020-03-30  0:31     ` [PATCH v3 12/16] commit-graph: add --changed-paths option to write subcommand Garima Singh via GitGitGadget
2020-03-30  0:31     ` [PATCH v3 13/16] revision.c: use Bloom filters to speed up path based revision walks Garima Singh via GitGitGadget
2020-03-30  0:31     ` [PATCH v3 14/16] revision.c: add trace2 stats around Bloom filter usage Garima Singh via GitGitGadget
2020-03-30  0:31     ` [PATCH v3 15/16] t4216: add end to end tests for git log with Bloom filters Garima Singh via GitGitGadget
2020-03-30  0:31     ` [PATCH v3 16/16] commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag Garima Singh via GitGitGadget
2020-04-06 16:59     ` [PATCH v4 00/15] Changed Paths Bloom Filters Garima Singh via GitGitGadget
2020-04-06 16:59       ` [PATCH v4 01/15] commit-graph: define and use MAX_NUM_CHUNKS Garima Singh via GitGitGadget
2020-04-06 16:59       ` [PATCH v4 02/15] bloom.c: add the murmur3 hash implementation Garima Singh via GitGitGadget
2020-04-06 16:59       ` [PATCH v4 03/15] bloom.c: introduce core Bloom filter constructs Garima Singh via GitGitGadget
2020-04-06 16:59       ` [PATCH v4 04/15] bloom.c: core Bloom filter implementation for changed paths Garima Singh via GitGitGadget
2020-04-06 16:59       ` [PATCH v4 05/15] diff: halt tree-diff early after max_changes Derrick Stolee via GitGitGadget
2020-04-06 16:59       ` [PATCH v4 06/15] commit-graph: compute Bloom filters for changed paths Garima Singh via GitGitGadget
2020-04-06 16:59       ` [PATCH v4 07/15] commit-graph: examine changed-path objects in pack order Jeff King via GitGitGadget
2020-04-06 16:59       ` [PATCH v4 08/15] commit-graph: examine commits by generation number Garima Singh via GitGitGadget
2020-04-06 16:59       ` [PATCH v4 09/15] commit-graph: write Bloom filters to commit graph file Garima Singh via GitGitGadget
2020-04-06 16:59       ` [PATCH v4 10/15] commit-graph: reuse existing Bloom filters during write Garima Singh via GitGitGadget
2020-04-06 16:59       ` [PATCH v4 11/15] commit-graph: add --changed-paths option to write subcommand Garima Singh via GitGitGadget
2020-04-06 16:59       ` [PATCH v4 12/15] revision.c: use Bloom filters to speed up path based revision walks Garima Singh via GitGitGadget
2020-04-06 16:59       ` [PATCH v4 13/15] revision.c: add trace2 stats around Bloom filter usage Garima Singh via GitGitGadget
2020-04-06 16:59       ` [PATCH v4 14/15] t4216: add end to end tests for git log with Bloom filters Garima Singh via GitGitGadget
2020-04-06 16:59       ` [PATCH v4 15/15] commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag Garima Singh via GitGitGadget
2020-04-08 15:51       ` [PATCH v4 00/15] Changed Paths Bloom Filters Derrick Stolee
2020-04-08 19:21         ` Junio C Hamano
2020-04-08 20:05         ` Jakub Narębski
2020-03-05 19:49 ` [PATCH 0/9] [RFC] " Garima Singh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e9a4e4ff-5466-dc39-c3f5-c9a8b8f2f11d@gmail.com \
    --to=stolee@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=gitster@pobox.com \
    --cc=jeffhost@microsoft.com \
    --cc=jonathantanmy@google.com \
    --cc=me@ttaylorr.com \
    --cc=peff@peff.net \
    --cc=szeder.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

git@vger.kernel.org list mirror (unofficial, one of many)

Archives are clonable:
	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

Example config snippet for mirrors

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git
	nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git
	nntp://news.gmane.io/gmane.comp.version-control.git

 note: .onion URLs require Tor: https://www.torproject.org/

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git