git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
From: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: peff@peff.net, me@ttaylorr.com, garimasigit@gmail.com,
	szeder.dev@gmail.com, jnareb@gmail.com,
	Derrick Stolee <dstolee@microsoft.com>,
	Derrick Stolee <dstolee@microsoft.com>
Subject: [PATCH v2 06/12] bloom: use num_changes not nr for limit detection
Date: Mon, 11 May 2020 11:56:13 +0000	[thread overview]
Message-ID: <8278b5c0918c6b357e1f1ab3ee65ed2091ec3f61.1589198180.git.gitgitgadget@gmail.com> (raw)
In-Reply-To: <pull.622.v2.git.1589198180.gitgitgadget@gmail.com>

From: Derrick Stolee <dstolee@microsoft.com>

As diff_tree_oid() computes a diff, it will terminate early if the
total number of changed paths is strictly larger than max_changes.
This includes the directories that changed, not just the file paths.
However, only the file paths are reflected in the resulting diff
queue's "nr" value.

Use the "num_changes" from diffopt to check if the diff terminated
early. This is incredibly important, as it can result in incorrect
filters! For example, the first commit in the Linux kernel repo
reports only 471 changes, but since these are nested inside several
directories they expand to 513 "real" changes, and in fact the
total list of changes is not reported. Thus, the computed filter
for this commit is incorrect.

Demonstrate the subtle difference by using one fewer file change
in the 'get bloom filter for commit with 513 changes' test. Before,
this edited 513 files inside "bigDir" which hit this inequality.
However, dropping the file count by one demonstrates how the
previous inequality was incorrect but the new one is correct.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 bloom.c          | 2 +-
 t/t0095-bloom.sh | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/bloom.c b/bloom.c
index 196cda0a1bd..e2ede44126c 100644
--- a/bloom.c
+++ b/bloom.c
@@ -215,7 +215,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 		diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
 	diffcore_std(&diffopt);
 
-	if (diff_queued_diff.nr <= max_changes) {
+	if (diffopt.num_changes <= max_changes) {
 		struct hashmap pathmap;
 		struct pathmap_hash_entry *e;
 		struct hashmap_iter iter;
diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
index 6defeb544f1..48a90625596 100755
--- a/t/t0095-bloom.sh
+++ b/t/t0095-bloom.sh
@@ -100,7 +100,7 @@ test_expect_success EXPENSIVE 'get bloom filter for commit with 513 changes' '
 	rm actual &&
 	rm expect &&
 	mkdir bigDir &&
-	for i in $(test_seq 0 512)
+	for i in $(test_seq 0 511)
 	do
 		echo $i >bigDir/$i
 	done &&
-- 
gitgitgadget


  parent reply	other threads:[~2020-05-11 11:57 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-01 15:30 [PATCH 00/12] Integrating line-log and changed-path Bloom filters Derrick Stolee via GitGitGadget
2020-05-01 15:30 ` [PATCH 01/12] bloom: fix whitespace around tab length Derrick Stolee via GitGitGadget
2020-05-01 22:51   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 02/12] test-bloom: fix usage typo Derrick Stolee via GitGitGadget
2020-05-01 22:51   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 03/12] Documentation: changed-path Bloom filters use byte words Derrick Stolee via GitGitGadget
2020-05-01 22:55   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 04/12] bloom: de-duplicate directory entries Derrick Stolee via GitGitGadget
2020-05-01 23:06   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 05/12] bloom: parse commit before computing filters Derrick Stolee via GitGitGadget
2020-05-01 23:10   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 06/12] bloom: use num_changes not nr for limit detection Derrick Stolee via GitGitGadget
2020-05-01 23:12   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 07/12] completion: offer '--(no-)patch' among 'git log' options SZEDER Gábor via GitGitGadget
2020-05-01 23:44   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 08/12] line-log: remove unused fields from 'struct line_log_data' SZEDER Gábor via GitGitGadget
2020-05-01 23:46   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 09/12] t4211-line-log: add tests for parent oids SZEDER Gábor via GitGitGadget
2020-05-04 17:31   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 10/12] line-log: more responsive, incremental 'git log -L' SZEDER Gábor via GitGitGadget
2020-05-04 17:52   ` Taylor Blau
2020-05-04 17:55     ` Derrick Stolee
2020-05-01 15:30 ` [PATCH 11/12] line-log: try to use generation number-based topo-ordering SZEDER Gábor via GitGitGadget
2020-05-04 21:25   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 12/12] line-log: integrate with changed-path Bloom filters Derrick Stolee via GitGitGadget
2020-05-04 21:50   ` Taylor Blau
2020-05-01 17:34 ` [PATCH 00/12] Integrating line-log and " Junio C Hamano
2020-05-11 11:56 ` [PATCH v2 " Derrick Stolee via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 01/12] bloom: fix whitespace around tab length Derrick Stolee via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 02/12] test-bloom: fix usage typo Derrick Stolee via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 03/12] bloom: parse commit before computing filters Derrick Stolee via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 04/12] Documentation: changed-path Bloom filters use byte words Derrick Stolee via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 05/12] bloom: de-duplicate directory entries Derrick Stolee via GitGitGadget
2020-06-07 21:45     ` SZEDER Gábor
2020-05-11 11:56   ` Derrick Stolee via GitGitGadget [this message]
2020-08-04 14:51     ` [PATCH v2 06/12] bloom: use num_changes not nr for limit detection SZEDER Gábor
2020-05-11 11:56   ` [PATCH v2 07/12] completion: offer '--(no-)patch' among 'git log' options SZEDER Gábor via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 08/12] line-log: remove unused fields from 'struct line_log_data' SZEDER Gábor via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 09/12] t4211-line-log: add tests for parent oids SZEDER Gábor via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 10/12] line-log: more responsive, incremental 'git log -L' SZEDER Gábor via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 11/12] line-log: try to use generation number-based topo-ordering SZEDER Gábor via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 12/12] line-log: integrate with changed-path Bloom filters Derrick Stolee via GitGitGadget

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8278b5c0918c6b357e1f1ab3ee65ed2091ec3f61.1589198180.git.gitgitgadget@gmail.com \
    --to=gitgitgadget@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=garimasigit@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=jnareb@gmail.com \
    --cc=me@ttaylorr.com \
    --cc=peff@peff.net \
    --cc=szeder.dev@gmail.com \
    --subject='Re: [PATCH v2 06/12] bloom: use num_changes not nr for limit detection' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Code repositories for project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).