[PATCH 00/12] Integrating line-log and changed-path Bloom filters

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH 00/12] Integrating line-log and changed-path Bloom filters
@ 2020-05-01 15:30 Derrick Stolee via GitGitGadget
  2020-05-01 15:30 ` [PATCH 01/12] bloom: fix whitespace around tab length Derrick Stolee via GitGitGadget
                   ` (13 more replies)
  0 siblings, 14 replies; 42+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-05-01 15:30 UTC (permalink / raw)
  To: git; +Cc: peff, me, garimasigit, szeder.dev, jnareb, Derrick Stolee

The best way to learn how a feature works is to build on it. That's also the
best way to find defects.

I started on this patch series with two goals:

 1. Understand and submit Szeder's incremental line-log RFC [1].
 2. Integrate changed-path Bloom filters with line-log.

[1] https://lore.kernel.org/git/20190818182801.7580-1-szeder.dev@gmail.com/

Since I knew a lot about the Bloom filters and not much about line-log.c, I
expected to learn a lot that could contribute to improving Szeder's patches.
Instead, I discovered some deficiencies in the Bloom filter code, including
a few meaningful bugs that should be fixed before the topic is merge to
master.

This series is organized as follows:

 * Patches 1-4 are minor Bloom filter cleanups, including a documentation
   bug.

 * Patch 5 fixes a problem where the Bloom filters use much larger filters
   than expected.

 * Patch 6 fixes a bug related to the short-circuit of large diffs from
   e369698016 (diff: halt tree-diff early after max_changes, 2020-03-30).
   The condition for halting the diff computation is different than the
   check in bloom.c to see if that short-circuit was hit, which leads to
   some commits with large diffs getting an incomplete Bloom filter. This
   happened on the root commit of the Linux kernel repo, for example.

 * Patches 7-11 are Szeder's RFC, with my sign-off added. I have not changed
   the code from those submissions because I didn't find anything wrong with
   them as I was testing. However, I will take responsibility to respond to
   the feedback in this series.

 * Patch 12 integrates Bloom filters with the line-log code.

I organized these patches so patches 1-6 could be split into its own topic,
if beneficial to take earlier than the line-log changes.

The end result of this combined effort is the following: git log -L commands
feel much more responsive to a terminal user because Szeder's changes make
the computation incremental, and they have better end-to-end performance
because of the integration with Bloom filters to reduce time spent running
tree diffs for TREESAME commits.

The following table is also in patch 12, but it summarizes the results of
this series. These are timings for running git log -L 1,10:$path for these
recently-edited paths. The "Entire History" columns wait for the full
command to complete. The "First Result" columns add "-n 1" to the command,
so we see the benefits of the incremental algorithm. Thus, the performance
improvements from "Entire History" to "First Result" are due entirely to
Szeder's patches. The performance improvements from "Before" to "After" are
due to the changed-path Bloom filters.

|                              | Entire History  | First Result    |
| Path                         | Before | After  | Before | After  |
|------------------------------|--------|--------|--------|--------|
| MAINTAINERS                  | 4.26 s | 3.87 s | 0.41 s | 0.39 s |
| fs/namei.c                   | 1.99 s | 0.99 s | 0.42 s | 0.21 s |
| arch/x86/kvm/cpuid.c         | 5.28 s | 1.12 s | 0.16 s | 0.09 s |
| fs/io_uring.c                | 4.34 s | 0.99 s | 0.94 s | 0.27 s |
| arch/x86/kvm/vmx/vmx.c       | 5.01 s | 1.34 s | 0.21 s | 0.12 s |
| arch/x86/kvm/x86.c           | 2.24 s | 1.18 s | 0.21 s | 0.14 s |
| fs/btrfs/disk-io.c           | 1.82 s | 1.01 s | 0.06 s | 0.05 s |
| Documentation/scsi/index.rst | 3.30 s | 0.89 s | 1.46 s | 0.03 s |

Some splashy numbers include a 110x speedup for finding the first result for
Documentation/scsi/index.rst. That's certainly an outlier, but the rest have
an at least 10x speedup.

Thanks, -Stolee

Derrick Stolee (7):
  bloom: fix whitespace around tab length
  test-bloom: fix usage typo
  Documentation: changed-path Bloom filters use byte words
  bloom: de-duplicate directory entries
  bloom: parse commit before computing filters
  bloom: use num_changes not nr for limit detection
  line-log: integrate with changed-path Bloom filters

SZEDER Gábor (5):
  completion: offer '--(no-)patch' among 'git log' options
  line-log: remove unused fields from 'struct line_log_data'
  t4211-line-log: add tests for parent oids
  line-log: more responsive, incremental 'git log -L'
  line-log: try to use generation number-based topo-ordering

 .../technical/commit-graph-format.txt         |  8 +--
 bloom.c                                       | 61 ++++++++++++-----
 bloom.h                                       |  5 +-
 contrib/completion/git-completion.bash        |  1 +
 line-log.c                                    | 43 +++++++++++-
 line-log.h                                    |  5 +-
 revision.c                                    | 43 ++++++++++--
 t/helper/test-bloom.c                         |  2 +-
 t/t0095-bloom.sh                              |  6 +-
 t/t4211-line-log.sh                           | 68 +++++++++++++++++++
 10 files changed, 201 insertions(+), 41 deletions(-)

base-commit: 1b4c57fa87e121f155863f898dc39d06cf4a1d99
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-622%2Fderrickstolee%2Flog-l-on-bloom-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-622/derrickstolee/log-l-on-bloom-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/622
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 01/12] bloom: fix whitespace around tab length
  2020-05-01 15:30 [PATCH 00/12] Integrating line-log and changed-path Bloom filters Derrick Stolee via GitGitGadget
@ 2020-05-01 15:30 ` Derrick Stolee via GitGitGadget
  2020-05-01 22:51   ` Taylor Blau
  2020-05-01 15:30 ` [PATCH 02/12] test-bloom: fix usage typo Derrick Stolee via GitGitGadget
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 42+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-05-01 15:30 UTC (permalink / raw)
  To: git
  Cc: peff, me, garimasigit, szeder.dev, jnareb, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Fix alignment issues that were likely introduced due to an editor
using tab lengths of 4 instead of 8.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 bloom.c | 16 ++++++++--------
 bloom.h |  4 ++--
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/bloom.c b/bloom.c
index dd9bab9bbd6..2e3e0f5037b 100644
--- a/bloom.c
+++ b/bloom.c
@@ -29,8 +29,8 @@ static inline unsigned char get_bitmask(uint32_t pos)
 }
 
 static int load_bloom_filter_from_graph(struct commit_graph *g,
-				   struct bloom_filter *filter,
-				   struct commit *c)
+					struct bloom_filter *filter,
+					struct commit *c)
 {
 	uint32_t lex_pos, start_index, end_index;
 
@@ -123,9 +123,9 @@ uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len)
 }
 
 void fill_bloom_key(const char *data,
-					size_t len,
-					struct bloom_key *key,
-					const struct bloom_filter_settings *settings)
+		    size_t len,
+		    struct bloom_key *key,
+		    const struct bloom_filter_settings *settings)
 {
 	int i;
 	const uint32_t seed0 = 0x293ae76f;
@@ -139,8 +139,8 @@ void fill_bloom_key(const char *data,
 }
 
 void add_key_to_filter(const struct bloom_key *key,
-					   struct bloom_filter *filter,
-					   const struct bloom_filter_settings *settings)
+		       struct bloom_filter *filter,
+		       const struct bloom_filter_settings *settings)
 {
 	int i;
 	uint64_t mod = filter->len * BITS_PER_WORD;
@@ -160,7 +160,7 @@ void init_bloom_filters(void)
 
 struct bloom_filter *get_bloom_filter(struct repository *r,
 				      struct commit *c,
-					  int compute_if_not_present)
+				      int compute_if_not_present)
 {
 	struct bloom_filter *filter;
 	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
diff --git a/bloom.h b/bloom.h
index b935186425d..a51e3715296 100644
--- a/bloom.h
+++ b/bloom.h
@@ -74,8 +74,8 @@ void fill_bloom_key(const char *data,
 		    const struct bloom_filter_settings *settings);
 
 void add_key_to_filter(const struct bloom_key *key,
-					   struct bloom_filter *filter,
-					   const struct bloom_filter_settings *settings);
+		       struct bloom_filter *filter,
+		       const struct bloom_filter_settings *settings);
 
 void init_bloom_filters(void);
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 02/12] test-bloom: fix usage typo
  2020-05-01 15:30 [PATCH 00/12] Integrating line-log and changed-path Bloom filters Derrick Stolee via GitGitGadget
  2020-05-01 15:30 ` [PATCH 01/12] bloom: fix whitespace around tab length Derrick Stolee via GitGitGadget
@ 2020-05-01 15:30 ` Derrick Stolee via GitGitGadget
  2020-05-01 22:51   ` Taylor Blau
  2020-05-01 15:30 ` [PATCH 03/12] Documentation: changed-path Bloom filters use byte words Derrick Stolee via GitGitGadget
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 42+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-05-01 15:30 UTC (permalink / raw)
  To: git
  Cc: peff, me, garimasigit, szeder.dev, jnareb, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/helper/test-bloom.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index 77eb27adac7..00c1d44a561 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -44,7 +44,7 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
 }
 
 static const char *bloom_usage = "\n"
-"  test-tool bloom get_murmer3 <string>\n"
+"  test-tool bloom get_murmur3 <string>\n"
 "  test-tool bloom generate_filter <string> [<string>...]\n"
 "  test-tool get_filter_for_commit <commit-hex>\n";
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 03/12] Documentation: changed-path Bloom filters use byte words
  2020-05-01 15:30 [PATCH 00/12] Integrating line-log and changed-path Bloom filters Derrick Stolee via GitGitGadget
  2020-05-01 15:30 ` [PATCH 01/12] bloom: fix whitespace around tab length Derrick Stolee via GitGitGadget
  2020-05-01 15:30 ` [PATCH 02/12] test-bloom: fix usage typo Derrick Stolee via GitGitGadget
@ 2020-05-01 15:30 ` Derrick Stolee via GitGitGadget
  2020-05-01 22:55   ` Taylor Blau
  2020-05-01 15:30 ` [PATCH 04/12] bloom: de-duplicate directory entries Derrick Stolee via GitGitGadget
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 42+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-05-01 15:30 UTC (permalink / raw)
  To: git
  Cc: peff, me, garimasigit, szeder.dev, jnareb, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In Documentation/technical/commit-graph-format.txt, the definition
of the BIDX chunk specifies the length is a number of 8-byte words.
During development we discovered that using 8-byte words in the
Murmur3 hash algorithm causes issues with Big-Endian versus Little-
Endian machines. Thus, the hash algorithm was adapted to work on a
byte-by-byte basis. However, this caused a change in the definition
of a "word" in bloom.h. Now, a "word" is a single byte, which allows
filters to be as small as two bytes. These length-two filters are
demonstrated in t0095-bloom.sh, and a larger filter of length 25 is
demonstrated as well.

The original point of using 8-byte words was for alignment reasons.
It also presented opportunities for extremely sparse Bloom filters
when there were a small number of changes at a commit, creating a
very low false-positive rate. However, modifying the format at this
point is unlikely to be a valuable exercise. Also, this use of
single-byte granularity does present opportunities to save space.
It is unclear if 8-byte alignment of the filters would present any
meaningful performance benefits.

Modify the format document to reflect reality.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/commit-graph-format.txt | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index de56f9f1efd..1beef171822 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -97,10 +97,10 @@ CHUNK DATA:
       bit on. The other bits correspond to the position of the last parent.
 
   Bloom Filter Index (ID: {'B', 'I', 'D', 'X'}) (N * 4 bytes) [Optional]
-    * The ith entry, BIDX[i], stores the number of 8-byte word blocks in all
-      Bloom filters from commit 0 to commit i (inclusive) in lexicographic
-      order. The Bloom filter for the i-th commit spans from BIDX[i-1] to
-      BIDX[i] (plus header length), where BIDX[-1] is 0.
+    * The ith entry, BIDX[i], stores the number of bytes in all Bloom filters
+      from commit 0 to commit i (inclusive) in lexicographic order. The Bloom
+      filter for the i-th commit spans from BIDX[i-1] to BIDX[i] (plus header
+      length), where BIDX[-1] is 0.
     * The BIDX chunk is ignored if the BDAT chunk is not present.
 
   Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 04/12] bloom: de-duplicate directory entries
  2020-05-01 15:30 [PATCH 00/12] Integrating line-log and changed-path Bloom filters Derrick Stolee via GitGitGadget
                   ` (2 preceding siblings ...)
  2020-05-01 15:30 ` [PATCH 03/12] Documentation: changed-path Bloom filters use byte words Derrick Stolee via GitGitGadget
@ 2020-05-01 15:30 ` Derrick Stolee via GitGitGadget
  2020-05-01 23:06   ` Taylor Blau
  2020-05-01 15:30 ` [PATCH 05/12] bloom: parse commit before computing filters Derrick Stolee via GitGitGadget
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 42+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-05-01 15:30 UTC (permalink / raw)
  To: git
  Cc: peff, me, garimasigit, szeder.dev, jnareb, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When computing a changed-path Bloom filter, we need to take the
files that changed from the diff computation and extract the parent
directories. That way, a directory pathspec such as "Documentation"
could match commits that change "Documentation/git.txt".

However, the current code does a poor job of this process. The paths
are added to a hashmap, but we do not check if an entry already
exists with that path. This can create many duplicate entries and
cause the filter to have a much larger length than it should. This
means that the filter is more sparse than intended, which helps the
false positive rate, but wastes a lot of space.

Properly use hashmap_get() before hashmap_add(). Also be sure to
include a comparison function so these can be matched correctly.

This has an effect on a test in t0095-bloom.sh. This makes sense,
there are ten changes inside "smallDir" so the total number of
paths in the filter should be 11. This would result in 11 * 10 bits
required, and with 8 bits per byte, this results in 14 bytes.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 bloom.c          | 35 ++++++++++++++++++++++++++---------
 t/t0095-bloom.sh |  4 ++--
 2 files changed, 28 insertions(+), 11 deletions(-)

diff --git a/bloom.c b/bloom.c
index 2e3e0f5037b..eb08571c628 100644
--- a/bloom.c
+++ b/bloom.c
@@ -158,6 +158,19 @@ void init_bloom_filters(void)
 	init_bloom_filter_slab(&bloom_filters);
 }
 
+static int pathmap_cmp(const void *hashmap_cmp_fn_data,
+		       const struct hashmap_entry *eptr,
+		       const struct hashmap_entry *entry_or_key,
+		       const void *keydata)
+{
+	const struct pathmap_hash_entry *e1, *e2;
+
+	e1 = container_of(eptr, const struct pathmap_hash_entry, entry);
+	e2 = container_of(entry_or_key, const struct pathmap_hash_entry, entry);
+
+	return strcmp(e1->path, e2->path);
+}
+
 struct bloom_filter *get_bloom_filter(struct repository *r,
 				      struct commit *c,
 				      int compute_if_not_present)
@@ -203,25 +216,29 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 		struct hashmap pathmap;
 		struct pathmap_hash_entry *e;
 		struct hashmap_iter iter;
-		hashmap_init(&pathmap, NULL, NULL, 0);
+		hashmap_init(&pathmap, pathmap_cmp, NULL, 0);
 
 		for (i = 0; i < diff_queued_diff.nr; i++) {
 			const char *path = diff_queued_diff.queue[i]->two->path;
 
 			/*
-			* Add each leading directory of the changed file, i.e. for
-			* 'dir/subdir/file' add 'dir' and 'dir/subdir' as well, so
-			* the Bloom filter could be used to speed up commands like
-			* 'git log dir/subdir', too.
-			*
-			* Note that directories are added without the trailing '/'.
-			*/
+			 * Add each leading directory of the changed file, i.e. for
+			 * 'dir/subdir/file' add 'dir' and 'dir/subdir' as well, so
+			 * the Bloom filter could be used to speed up commands like
+			 * 'git log dir/subdir', too.
+			 *
+			 * Note that directories are added without the trailing '/'.
+			 */
 			do {
 				char *last_slash = strrchr(path, '/');
 
 				FLEX_ALLOC_STR(e, path, path);
 				hashmap_entry_init(&e->entry, strhash(path));
-				hashmap_add(&pathmap, &e->entry);
+
+				if (!hashmap_get(&pathmap, &e->entry, NULL))
+					hashmap_add(&pathmap, &e->entry);
+				else
+					free(e);
 
 				if (!last_slash)
 					last_slash = (char*)path;
diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
index 8f9eef116dc..6defeb544f1 100755
--- a/t/t0095-bloom.sh
+++ b/t/t0095-bloom.sh
@@ -89,8 +89,8 @@ test_expect_success 'get bloom filter for commit with 10 changes' '
 	git add smallDir &&
 	git commit -m "commit with 10 changes" &&
 	cat >expect <<-\EOF &&
-	Filter_Length:25
-	Filter_Data:82|a0|65|47|0c|92|90|c0|a1|40|02|a0|e2|40|e0|04|0a|9a|66|cf|80|19|85|42|23|
+	Filter_Length:14
+	Filter_Data:02|b3|c4|a0|34|e7|fe|eb|cb|47|fe|a0|e8|72|
 	EOF
 	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
 	test_cmp expect actual
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 05/12] bloom: parse commit before computing filters
  2020-05-01 15:30 [PATCH 00/12] Integrating line-log and changed-path Bloom filters Derrick Stolee via GitGitGadget
                   ` (3 preceding siblings ...)
  2020-05-01 15:30 ` [PATCH 04/12] bloom: de-duplicate directory entries Derrick Stolee via GitGitGadget
@ 2020-05-01 15:30 ` Derrick Stolee via GitGitGadget
  2020-05-01 23:10   ` Taylor Blau
  2020-05-01 15:30 ` [PATCH 06/12] bloom: use num_changes not nr for limit detection Derrick Stolee via GitGitGadget
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 42+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-05-01 15:30 UTC (permalink / raw)
  To: git
  Cc: peff, me, garimasigit, szeder.dev, jnareb, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When computing changed-path Bloom filters for a commit, we need to
know if the commit has a parent or not. If the commit is not parsed,
then its parent pointer will be NULL.

As far as I can tell, the only opportunity to reach this code
without parsing the commit is inside "test-tool bloom
get_filter_for_commit" but it is best to be safe.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 bloom.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/bloom.c b/bloom.c
index eb08571c628..c3b81b1a38a 100644
--- a/bloom.c
+++ b/bloom.c
@@ -206,6 +206,9 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 	diffopt.max_changes = max_changes;
 	diff_setup_done(&diffopt);
 
+	/* ensure commit is parsed so we have parent information */
+	parse_commit(c);
+
 	if (c->parents)
 		diff_tree_oid(&c->parents->item->object.oid, &c->object.oid, "", &diffopt);
 	else
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 06/12] bloom: use num_changes not nr for limit detection
  2020-05-01 15:30 [PATCH 00/12] Integrating line-log and changed-path Bloom filters Derrick Stolee via GitGitGadget
                   ` (4 preceding siblings ...)
  2020-05-01 15:30 ` [PATCH 05/12] bloom: parse commit before computing filters Derrick Stolee via GitGitGadget
@ 2020-05-01 15:30 ` Derrick Stolee via GitGitGadget
  2020-05-01 23:12   ` Taylor Blau
  2020-05-01 15:30 ` [PATCH 07/12] completion: offer '--(no-)patch' among 'git log' options SZEDER Gábor via GitGitGadget
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 42+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-05-01 15:30 UTC (permalink / raw)
  To: git
  Cc: peff, me, garimasigit, szeder.dev, jnareb, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

As diff_tree_oid() computes a diff, it will terminate early if the
total number of changed paths is strictly larger than max_changes.
This includes the directories that changed, not just the file paths.
However, only the file paths are reflected in the resulting diff
queue "nr" value.

Use the "num_changes" from diffopt to check if the diff terminated
early. This is incredibly important, as it can result in incorrect
filters! For example, the first commit in the Linux kernel repo
reports only 471 changes, but since these are nested inside several
directories they expand to 513 "real" changes, and in fact the
total list of changes is not reported. Thus, the computed filter
for this commit is incorrect.

Demonstrate the subtle difference by using one fewer file change
in the 'get bloom filter for commit with 513 changes' test. Before,
this edited 513 files inside "bigDir" which hit this inequality.
However, dropping the file count by one demonstrates how the
previous inequality was incorrect but the new one is correct.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 bloom.c          | 2 +-
 t/t0095-bloom.sh | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/bloom.c b/bloom.c
index c3b81b1a38a..9a8386ac676 100644
--- a/bloom.c
+++ b/bloom.c
@@ -215,7 +215,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 		diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
 	diffcore_std(&diffopt);
 
-	if (diff_queued_diff.nr <= max_changes) {
+	if (diffopt.num_changes <= max_changes) {
 		struct hashmap pathmap;
 		struct pathmap_hash_entry *e;
 		struct hashmap_iter iter;
diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
index 6defeb544f1..48a90625596 100755
--- a/t/t0095-bloom.sh
+++ b/t/t0095-bloom.sh
@@ -100,7 +100,7 @@ test_expect_success EXPENSIVE 'get bloom filter for commit with 513 changes' '
 	rm actual &&
 	rm expect &&
 	mkdir bigDir &&
-	for i in $(test_seq 0 512)
+	for i in $(test_seq 0 511)
 	do
 		echo $i >bigDir/$i
 	done &&
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 07/12] completion: offer '--(no-)patch' among 'git log' options
  2020-05-01 15:30 [PATCH 00/12] Integrating line-log and changed-path Bloom filters Derrick Stolee via GitGitGadget
                   ` (5 preceding siblings ...)
  2020-05-01 15:30 ` [PATCH 06/12] bloom: use num_changes not nr for limit detection Derrick Stolee via GitGitGadget
@ 2020-05-01 15:30 ` SZEDER Gábor via GitGitGadget
  2020-05-01 23:44   ` Taylor Blau
  2020-05-01 15:30 ` [PATCH 08/12] line-log: remove unused fields from 'struct line_log_data' SZEDER Gábor via GitGitGadget
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 42+ messages in thread
From: SZEDER Gábor via GitGitGadget @ 2020-05-01 15:30 UTC (permalink / raw)
  To: git
  Cc: peff, me, garimasigit, szeder.dev, jnareb, Derrick Stolee,
	SZEDER Gábor

From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 contrib/completion/git-completion.bash | 1 +
 1 file changed, 1 insertion(+)

diff --git a/contrib/completion/git-completion.bash b/contrib/completion/git-completion.bash
index c21786f2fd0..ec6ff1d5fb8 100644
--- a/contrib/completion/git-completion.bash
+++ b/contrib/completion/git-completion.bash
@@ -1860,6 +1860,7 @@ _git_log ()
 			$merge
 			$__git_diff_common_options
 			--pickaxe-all --pickaxe-regex
+			--patch --no-patch
 			"
 		return
 		;;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 08/12] line-log: remove unused fields from 'struct line_log_data'
  2020-05-01 15:30 [PATCH 00/12] Integrating line-log and changed-path Bloom filters Derrick Stolee via GitGitGadget
                   ` (6 preceding siblings ...)
  2020-05-01 15:30 ` [PATCH 07/12] completion: offer '--(no-)patch' among 'git log' options SZEDER Gábor via GitGitGadget
@ 2020-05-01 15:30 ` SZEDER Gábor via GitGitGadget
  2020-05-01 23:46   ` Taylor Blau
  2020-05-01 15:30 ` [PATCH 09/12] t4211-line-log: add tests for parent oids SZEDER Gábor via GitGitGadget
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 42+ messages in thread
From: SZEDER Gábor via GitGitGadget @ 2020-05-01 15:30 UTC (permalink / raw)
  To: git
  Cc: peff, me, garimasigit, szeder.dev, jnareb, Derrick Stolee,
	SZEDER Gábor

From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>

Remove the unused fields 'status', 'arg_alloc', 'arg_nr' and 'args'
from 'struct line_log_data'.  They were already part of the struct
when it was introduced in commit 12da1d1f6 (Implement line-history
search (git log -L), 2013-03-28), but as far as I can tell none of
them have ever been actually used.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 line-log.h | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/line-log.h b/line-log.h
index 8ee7a2bd4a1..882c5055bb8 100644
--- a/line-log.h
+++ b/line-log.h
@@ -46,10 +46,7 @@ void sort_and_merge_range_set(struct range_set *);
 struct line_log_data {
 	struct line_log_data *next;
 	char *path;
-	char status;
 	struct range_set ranges;
-	int arg_alloc, arg_nr;
-	const char **args;
 	struct diff_filepair *pair;
 	struct diff_ranges diff;
 };
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 09/12]  t4211-line-log: add tests for parent oids
  2020-05-01 15:30 [PATCH 00/12] Integrating line-log and changed-path Bloom filters Derrick Stolee via GitGitGadget
                   ` (7 preceding siblings ...)
  2020-05-01 15:30 ` [PATCH 08/12] line-log: remove unused fields from 'struct line_log_data' SZEDER Gábor via GitGitGadget
@ 2020-05-01 15:30 ` SZEDER Gábor via GitGitGadget
  2020-05-04 17:31   ` Taylor Blau
  2020-05-01 15:30 ` [PATCH 10/12] line-log: more responsive, incremental 'git log -L' SZEDER Gábor via GitGitGadget
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 42+ messages in thread
From: SZEDER Gábor via GitGitGadget @ 2020-05-01 15:30 UTC (permalink / raw)
  To: git
  Cc: peff, me, garimasigit, szeder.dev, jnareb, Derrick Stolee,
	SZEDER Gábor

From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>

None of the tests in 't4211-line-log.sh' really check which parent
object IDs are shown in the output, either implicitly as part of
"Merge: ..." lines [1] or explicitly via the '%p' or '%P' format
specifiers in a custom pretty format.

Add two tests to 't4211-line-log.sh' to check which parent object IDs
are shown, one without and one with explicitly requested parent
rewriting, IOW without and with the '--parents' option.

The test without '--parents' is marked as failing, because without
that option parent rewriting should not be performed, and thus the
parent object ID should be that of the immediate parent, just like in
case of a pathspec-limited history traversal without parent rewriting.
The current line-level log implementation, however, performs parent
rewriting unconditionally and without a possibility to turn it off,
and, consequently, it shows the object ID of the most recent ancestor
that modified the given line range.

In both of these new tests we only really care about the object IDs of
the listed commits and their parents, but not the diffs of the line
ranges; the diffs have already been thoroughly checked in the previous
tests.

[1] While one of the tests ('-M -L ':f:b.c' parallel-change') does
    list a merge commit, both of its parents happen to modify the
    given line range and are listed as well, so the implications of
    parent rewriting remained hidden and untested.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t4211-line-log.sh | 68 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 68 insertions(+)

diff --git a/t/t4211-line-log.sh b/t/t4211-line-log.sh
index cda58186c2d..ea4a9398365 100755
--- a/t/t4211-line-log.sh
+++ b/t/t4211-line-log.sh
@@ -215,4 +215,72 @@ test_expect_success 'fancy rename following #2' '
 	test_cmp expect actual
 '
 
+# Create the following linear history, where each commit does what its
+# subject line promises:
+#
+#   * 66c6410 Modify func2() in file.c
+#   * 50834e5 Modify other-file
+#   * fe5851c Modify func1() in file.c
+#   * 8c7c7dd Add other-file
+#   * d5f4417 Add func1() and func2() in file.c
+test_expect_success 'setup for checking line-log and parent oids' '
+	git checkout --orphan parent-oids &&
+	git reset --hard &&
+
+	cat >file.c <<-\EOF &&
+	int func1()
+	{
+	    return F1;
+	}
+
+	int func2()
+	{
+	    return F2;
+	}
+	EOF
+	git add file.c &&
+	test_tick &&
+	git commit -m "Add func1() and func2() in file.c" &&
+
+	echo 1 >other-file &&
+	git add other-file &&
+	git commit -m "Add other-file" &&
+
+	sed -e "s/F1/F1 + 1/" file.c >tmp &&
+	mv tmp file.c &&
+	git commit -a -m "Modify func1() in file.c" &&
+
+	echo 2 >other-file &&
+	git commit -a -m "Modify other-file" &&
+
+	sed -e "s/F2/F2 + 2/" file.c >tmp &&
+	mv tmp file.c &&
+	git commit -a -m "Modify func2() in file.c" &&
+
+	head_oid=$(git rev-parse --short HEAD) &&
+	prev_oid=$(git rev-parse --short HEAD^) &&
+	root_oid=$(git rev-parse --short HEAD~4)
+'
+
+# Parent oid should be from immediate parent.
+test_expect_failure 'parent oids without parent rewriting' '
+	cat >expect <<-EOF &&
+	$head_oid $prev_oid Modify func2() in file.c
+	$root_oid  Add func1() and func2() in file.c
+	EOF
+	git log --format="%h %p %s" --no-patch -L:func2:file.c >actual &&
+	test_cmp expect actual
+'
+
+# Parent oid should be from the most recent ancestor touching func2(),
+# i.e. in this case from the root commit.
+test_expect_success 'parent oids with parent rewriting' '
+	cat >expect <<-EOF &&
+	$head_oid $root_oid Modify func2() in file.c
+	$root_oid  Add func1() and func2() in file.c
+	EOF
+	git log --format="%h %p %s" --no-patch -L:func2:file.c --parents >actual &&
+	test_cmp expect actual
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 10/12] line-log: more responsive, incremental 'git log -L'
  2020-05-01 15:30 [PATCH 00/12] Integrating line-log and changed-path Bloom filters Derrick Stolee via GitGitGadget
                   ` (8 preceding siblings ...)
  2020-05-01 15:30 ` [PATCH 09/12] t4211-line-log: add tests for parent oids SZEDER Gábor via GitGitGadget
@ 2020-05-01 15:30 ` SZEDER Gábor via GitGitGadget
  2020-05-04 17:52   ` Taylor Blau
  2020-05-01 15:30 ` [PATCH 11/12] line-log: try to use generation number-based topo-ordering SZEDER Gábor via GitGitGadget
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 42+ messages in thread
From: SZEDER Gábor via GitGitGadget @ 2020-05-01 15:30 UTC (permalink / raw)
  To: git
  Cc: peff, me, garimasigit, szeder.dev, jnareb, Derrick Stolee,
	SZEDER Gábor

From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>

The current line-level log implementation performs a preprocessing
step in prepare_revision_walk(), during which the line_log_filter()
function filters and rewrites history to keep only commits modifying
the given line range.  This preprocessing affects both responsiveness
and correctness:

  - Git doesn't produce any output during this preprocessing step.
    Checking whether a commit modified the given line range is
    somewhat expensive, so depending on the size of the given revision
    range this preprocessing can result in a significant delay before
    the first commit is shown.

  - Limiting the number of displayed commits (e.g. 'git log -3 -L...')
    doesn't limit the amount of work during preprocessing, because
    that limit is applied during history traversal.  Alas, by that
    point this expensive preprocessing step has already churned
    through the whole revision range to find all commits modifying the
    revision range, even though only a few of them need to be shown.

  - It rewrites parents, with no way to turn it off.  Without the user
    explicitly requesting parent rewriting any parent object ID shown
    should be that of the immediate parent, just like in case of a
    pathspec-limited history traversal without parent rewriting.

    However, after that preprocessing step rewrote history, the
    subsequent "regular" history traversal (i.e. get_revision() in a
    loop) only sees commits modifying the given line range.
    Consequently, it can only show the object ID of the last ancestor
    that modified the given line range (which might happen to be the
    immediate parent, but many-many times it isn't).

This patch addresses both the correctness and, at least for the common
case, the responsiveness issues by integrating line-level log
filtering into the regular revision walking machinery:

  - Make process_ranges_arbitrary_commit(), the static function in
    'line-log.c' deciding whether a commit modifies the given line
    range, public by removing the static keyword and adding the
    'line_log_' prefix, so it can be called from other parts of the
    revision walking machinery.

  - If the user didn't explicitly ask for parent rewriting (which, I
    believe, is the most common case):

    - Call this now-public function during regular history traversal,
      namely from get_commit_action() to ignore any commits not
      modifying the given line range.

      Note that while this check is relatively expensive, it must be
      performed before other, much cheaper conditions, because the
      tracked line range must be adjusted even when the commit will
      end up being ignored by other conditions.

    - Skip the line_log_filter() call, i.e. the expensive
      preprocessing step, in prepare_revision_walk(), because, thanks
      to the above points, the revision walking machinery is now able
      to filter out commits not modifying the given line range while
      traversing history.

      This way the regular history traversal sees the unmodified
      history, and is therefore able to print the object ids of the
      immediate parents of the listed commits.  The eliminated
      preprocessing step can greatly reduce the delay before the first
      commit is shown, see the numbers below.

  - However, if the user did explicitly ask for parent rewriting via
    '--parents' or a similar option, then stick with the current
    implementation for now, i.e. perform that expensive filtering and
    history rewriting in the preprocessing step just like we did
    before, leaving the initial delay as long as it was.

I tried to integrate line-level log filtering with parent rewriting
into the regular history traversal, but, unfortunately, several
subtleties resisted... :)  Maybe someday we'll figure out how to do
that, but until then at least the simple and common (i.e. without
parent rewriting) 'git log -L:func:file' commands can benefit from the
reduced delay.

This change makes the failing 'parent oids without parent rewriting'
test in 't4211-line-log.sh' succeed.

The reduced delay is most noticable when there's a commit modifying
the line range near the tip of a large-ish revision range:

  # no parent rewriting requested, no commit-graph present
  $ time git --no-pager log -L:read_alternate_refs:sha1-file.c -1 v2.23.0

  Before:

    real    0m9.570s
    user    0m9.494s
    sys     0m0.076s

  After:

    real    0m0.718s
    user    0m0.674s
    sys     0m0.044s

A significant part of the remaining delay is spent reading and parsing
commit objects in limit_list().  With the help of the commit-graph we
can eliminate most of that reading and parsing overhead, so here are
the timing results of the same command as above, but this time using
the commit-graph:

  Before:

    real    0m8.874s
    user    0m8.816s
    sys     0m0.057s

  After:

    real    0m0.107s
    user    0m0.091s
    sys     0m0.013s

The next patch will further reduce the remaining delay.

To be clear: this patch doesn't actually optimize the line-level log,
but merely moves most of the work from the preprocessing step to the
history travelsal, so the commits modifying the line range can be
shown as soon as they are processed, and the traversal can be
terminated as soon as the given number of commits are shown.
Consequently, listing the full history of a line range, potentially
all the way to the root commit, will take the same time as before (but
at least the user might start reading the output earlier).
Furthermore, if the most recent commit modifying the line range is far
away from the starting revision, then that initial delay will still be
significant.

Additional testing by Derrick Stolee: In the Linux kernel repository,
the MAINTAINERS file was changed ~3,500 times across the ~915,000
commits. In addition to that edit frequency, the file itself is quite
large (~18,700 lines). This means that a significant portion of the
computation is taken up by computing the patch-diff of the file. This
patch improves the time it takes to output the first result quite a
bit:

Command: git log -L 100,200:MAINTAINERS -n 1 >/dev/null
 Before: 3.88 s
  After: 0.71 s

If we drop the "-n 1" in the command, then there is no change in
end-to-end process time. This is because the command still needs to
walk the entire commit history, which negates the point of this
patch. This is expected.

As a note for future reference, the ~4.3 seconds in the old code
spends ~2.6 seconds computing the patch-diffs, and the rest of the
time is spent walking commits and computing diffs for which paths
changed at each commit. The changed-path Bloom filters could improve
the end-to-end computation time (i.e. no "-n 1" in the command).

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 line-log.c          |  4 ++--
 line-log.h          |  2 ++
 revision.c          | 27 ++++++++++++++++++++++++++-
 t/t4211-line-log.sh |  2 +-
 4 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/line-log.c b/line-log.c
index 9010e00950b..520ee715bcd 100644
--- a/line-log.c
+++ b/line-log.c
@@ -1227,7 +1227,7 @@ static int process_ranges_merge_commit(struct rev_info *rev, struct commit *comm
 	/* NEEDSWORK leaking like a sieve */
 }
 
-static int process_ranges_arbitrary_commit(struct rev_info *rev, struct commit *commit)
+int line_log_process_ranges_arbitrary_commit(struct rev_info *rev, struct commit *commit)
 {
 	struct line_log_data *range = lookup_line_range(rev, commit);
 	int changed = 0;
@@ -1270,7 +1270,7 @@ int line_log_filter(struct rev_info *rev)
 	while (list) {
 		struct commit_list *to_free = NULL;
 		commit = list->item;
-		if (process_ranges_arbitrary_commit(rev, commit)) {
+		if (line_log_process_ranges_arbitrary_commit(rev, commit)) {
 			*pp = list;
 			pp = &list->next;
 		} else
diff --git a/line-log.h b/line-log.h
index 882c5055bb8..82ae8d98a40 100644
--- a/line-log.h
+++ b/line-log.h
@@ -54,6 +54,8 @@ struct line_log_data {
 void line_log_init(struct rev_info *rev, const char *prefix, struct string_list *args);
 
 int line_log_filter(struct rev_info *rev);
+int line_log_process_ranges_arbitrary_commit(struct rev_info *rev,
+						    struct commit *commit);
 
 int line_log_print(struct rev_info *rev, struct commit *commit);
 
diff --git a/revision.c b/revision.c
index f78c636e4d0..3228db9af6d 100644
--- a/revision.c
+++ b/revision.c
@@ -39,6 +39,8 @@ static const char *term_good;
 
 implement_shared_commit_slab(revision_sources, char *);
 
+static inline int want_ancestry(const struct rev_info *revs);
+
 void show_object_with_name(FILE *out, struct object *obj, const char *name)
 {
 	const char *p;
@@ -3511,7 +3513,14 @@ int prepare_revision_walk(struct rev_info *revs)
 			sort_in_topological_order(&revs->commits, revs->sort_order);
 	} else if (revs->topo_order)
 		init_topo_walk(revs);
-	if (revs->line_level_traverse)
+	if (revs->line_level_traverse && want_ancestry(revs))
+		/*
+		 * At the moment we can only do line-level log with parent
+		 * rewriting by performing this expensive pre-filtering step.
+		 * If parent rewriting is not requested, then we rather
+		 * perform the line-level log filtering during the regular
+		 * history traversal.
+		 */
 		line_log_filter(revs);
 	if (revs->simplify_merges)
 		simplify_merges(revs);
@@ -3722,6 +3731,22 @@ enum commit_action get_commit_action(struct rev_info *revs, struct commit *commi
 		return commit_ignore;
 	if (commit->object.flags & UNINTERESTING)
 		return commit_ignore;
+	if (revs->line_level_traverse && !want_ancestry(revs)) {
+		/*
+		 * In case of line-level log with parent rewriting
+		 * prepare_revision_walk() already took care of all line-level
+		 * log filtering, and there is nothing left to do here.
+		 *
+		 * If parent rewriting was not requested, then this is the
+		 * place to perform the line-level log filtering.  Notably,
+		 * this check, though expensive, must come before the other,
+		 * cheaper filtering conditions, because the tracked line
+		 * ranges must be adjusted even when the commit will end up
+		 * being ignored based on other conditions.
+		 */
+		if (!line_log_process_ranges_arbitrary_commit(revs, commit))
+			return commit_ignore;
+	}
 	if (revs->min_age != -1 &&
 	    comparison_date(revs, commit) > revs->min_age)
 			return commit_ignore;
diff --git a/t/t4211-line-log.sh b/t/t4211-line-log.sh
index ea4a9398365..1428eae2629 100755
--- a/t/t4211-line-log.sh
+++ b/t/t4211-line-log.sh
@@ -263,7 +263,7 @@ test_expect_success 'setup for checking line-log and parent oids' '
 '
 
 # Parent oid should be from immediate parent.
-test_expect_failure 'parent oids without parent rewriting' '
+test_expect_success 'parent oids without parent rewriting' '
 	cat >expect <<-EOF &&
 	$head_oid $prev_oid Modify func2() in file.c
 	$root_oid  Add func1() and func2() in file.c
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 11/12] line-log: try to use generation number-based topo-ordering
  2020-05-01 15:30 [PATCH 00/12] Integrating line-log and changed-path Bloom filters Derrick Stolee via GitGitGadget
                   ` (9 preceding siblings ...)
  2020-05-01 15:30 ` [PATCH 10/12] line-log: more responsive, incremental 'git log -L' SZEDER Gábor via GitGitGadget
@ 2020-05-01 15:30 ` SZEDER Gábor via GitGitGadget
  2020-05-04 21:25   ` Taylor Blau
  2020-05-01 15:30 ` [PATCH 12/12] line-log: integrate with changed-path Bloom filters Derrick Stolee via GitGitGadget
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 42+ messages in thread
From: SZEDER Gábor via GitGitGadget @ 2020-05-01 15:30 UTC (permalink / raw)
  To: git
  Cc: peff, me, garimasigit, szeder.dev, jnareb, Derrick Stolee,
	SZEDER Gábor

From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>

The previous patch made it possible to perform line-level filtering
during history traversal instead of in an expensive preprocessing
step, but it still requires some simpler preprocessing steps, notably
topo-ordering.  However, nowadays we have commit-graphs storing
generation numbers, which make it possible to incrementally traverse
the history in topological order, without the preparatory limit_list()
and sort_in_topological_order() steps; see b45424181e (revision.c:
generation-based topo-order algorithm, 2018-11-01).

This patch combines the two, so we can do both the topo-ordering and
the line-level filtering during history traversal, eliminating even
those simpler preprocessing steps, and thus further reducing the delay
before showing the first commit modifying the given line range.

The 'revs->limited' flag plays the central role in this, because, due
to limitations of the current implementation, the generation
number-based topo-ordering is only enabled when this flag remains
unset.  Line-level log, however, always sets this flag in
setup_revisions() ever since the feature was introduced in 12da1d1f6f
(Implement line-history search (git log -L), 2013-03-28).  The reason
for setting 'limited' is unclear, though, because the line-level log
itself doesn't directly depend on it, and it doesn't affect how the
limit_list() function limits the revision range.  However, there is an
indirect dependency: the line-level log requires topo-ordering, and
the "traditional" sort_in_topological_order() requires an already
limited commit list since e6c3505b44 (Make sure we generate the whole
commit list before trying to sort it topologically, 2005-07-06).  The
new, generation numbers-based topo-ordering doesn't require a limited
commit list anymore.

So don't set 'revs->limited' for line-level log, unless it is really
necessary, namely:

  - The user explicitly requested parent rewriting, because that is
    still done in the line_log_filter() preprocessing step (see
    previous patch), which requires sort_in_topological_order() and in
    turn limit_list() as well.

  - A commit-graph file is not available or it doesn't yet contain
    generation numbers.  In these cases we had to fall back on
    sort_in_topological_order() and in turn limit_list().  The
    existing condition with generation_numbers_enabled() has already
    ensured that the 'limited' flag is set in these cases; this patch
    just makes sure that the line-level log sets 'revs->topo_order'
    before that condition.

While the reduced delay before showing the first commit is measurable
in git.git, it takes a bigger repository to make it clearly noticable.
In both cases below the line ranges were chosen so that they were
modified rather close to the starting revisions, so the effect of this
change is most noticable.

  # git.git
  $ time git --no-pager log -L:read_alternate_refs:sha1-file.c -1 v2.23.0

  Before:

    real    0m0.107s
    user    0m0.091s
    sys     0m0.013s

  After:

    real    0m0.058s
    user    0m0.050s
    sys     0m0.005s

  # linux.git
  $ time git --no-pager log \
    -L:build_restore_work_registers:arch/mips/mm/tlbex.c -1 v5.2

  Before:

    real   0m1.129s
    user   0m1.061s
    sys    0m0.069s

  After:

    real   0m0.096s
    user   0m0.087s
    sys    0m0.009s

Additional testing by Derrick Stolee: Since this patch improves
the performance for the first result, I repeated the experiment
from the previous patch on the Linux kernel repository:

    Command: git log -L 100,200:MAINTAINERS -n 1 >/dev/null
     Before: 0.71 s
      After: 0.05 s

Now, we have dropped the full topo-order of all ~910,000 commits
before reporting the first result. The remaining performance
improvements then are:

 1. Update the parent-rewriting logic to be incremental similar to
    how "git log --graph" behaves.

 2. Use changed-path Bloom filters to reduce the time spend in the
    tree-diff to see if the path(s) changed.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 revision.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/revision.c b/revision.c
index 3228db9af6d..3356ede9a20 100644
--- a/revision.c
+++ b/revision.c
@@ -2790,6 +2790,12 @@ int setup_revisions(int argc, const char **argv, struct rev_info *revs, struct s
 	if (revs->diffopt.objfind)
 		revs->simplify_history = 0;
 
+	if (revs->line_level_traverse) {
+		if (want_ancestry(revs))
+			revs->limited = 1;
+		revs->topo_order = 1;
+	}
+
 	if (revs->topo_order && !generation_numbers_enabled(the_repository))
 		revs->limited = 1;
 
@@ -2809,11 +2815,6 @@ int setup_revisions(int argc, const char **argv, struct rev_info *revs, struct s
 
 	revs->diffopt.abbrev = revs->abbrev;
 
-	if (revs->line_level_traverse) {
-		revs->limited = 1;
-		revs->topo_order = 1;
-	}
-
 	diff_setup_done(&revs->diffopt);
 
 	grep_commit_pattern_type(GREP_PATTERN_TYPE_UNSPECIFIED,
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 12/12] line-log: integrate with changed-path Bloom filters
  2020-05-01 15:30 [PATCH 00/12] Integrating line-log and changed-path Bloom filters Derrick Stolee via GitGitGadget
                   ` (10 preceding siblings ...)
  2020-05-01 15:30 ` [PATCH 11/12] line-log: try to use generation number-based topo-ordering SZEDER Gábor via GitGitGadget
@ 2020-05-01 15:30 ` Derrick Stolee via GitGitGadget
  2020-05-04 21:50   ` Taylor Blau
  2020-05-01 17:34 ` [PATCH 00/12] Integrating line-log and " Junio C Hamano
  2020-05-11 11:56 ` [PATCH v2 " Derrick Stolee via GitGitGadget
  13 siblings, 1 reply; 42+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-05-01 15:30 UTC (permalink / raw)
  To: git
  Cc: peff, me, garimasigit, szeder.dev, jnareb, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The previous changes to the line-log machinery focused on making the
first result appear faster. This was achieved by no longer walking the
entire commit history before returning the early results. There is still
another way to improve the performance: walk most commits much faster.
Let's use the changed-path Bloom filters to reduce time spent computing
diffs.

Since the line-log computation requires opening blobs and checking the
content-diff, there is still a lot of necessary computation that cannot
be replaced with changed-path Bloom filters. The part that we can reduce
is most effective when checking the history of a file that is deep in
several directories and those directories are modified frequently. In
this case, the computation to check if a commit is TREESAME to its first
parent takes a large fraction of the time. That is ripe for improvement
with changed-path Bloom filters.

We must ensure that prepare_to_use_bloom_filters() is called in
revision.c so that the bloom_filter_settings are loaded into the struct
rev_info from the commit-graph. Of course, some cases are still
forbidden, but in the line-log case the pathspec is provided in a
different way than normal.

Since multiple paths and segments could be requested, we compute the
struct bloom_key data dynamically during the commit walk. This could
likely be improved, but adds code complexity that is not valuable at
this time.

There are two cases to care about: merge commits and "ordinary" commits.
Merge commits have multiple parents, but if we are TREESAME to our first
parent in every range, then pass the blame for all ranges to the first
parent. Ordinary commits have the same condition, but each is done
slightly differently in the process_ranges_[merge|ordinary]_commit()
methods. By checking if the changed-path Bloom filter can guarantee
TREESAME, we can avoid that tree-diff cost. If the filter says "probably
changed", then we need to run the tree-diff and then the blob-diff if
there was a real edit.

The Linux kernel repository is a good testing ground for the performance
improvements claimed here. There are two different cases to test. The
first is the "entire history" case, where we output the entire history
to /dev/null to see how long it would take to compute the full line-log
history. The second is the "first result" case, where we find how long
it takes to show the first value, which is an indicator of how quickly a
user would see responses when waiting at a terminal.

To test, I selected the paths that were changed most frequently in the
top 10,000 commits using this command (stolen from StackOverflow [1]):

	git log --pretty=format: --name-only -n 10000 | sort | \
		uniq -c | sort -rg | head -10

which results in

    121 MAINTAINERS
     63 fs/namei.c
     60 arch/x86/kvm/cpuid.c
     59 fs/io_uring.c
     58 arch/x86/kvm/vmx/vmx.c
     51 arch/x86/kvm/x86.c
     45 arch/x86/kvm/svm.c
     42 fs/btrfs/disk-io.c
     42 Documentation/scsi/index.rst

(along with a bogus first result). It appears that the path
arch/x86/kvm/svm.c was renamed, so we ignore that entry. This leaves the
following results:

|                              | Entire History  | First Result    |
| Path                         | Before | After  | Before | After  |
|------------------------------|--------|--------|--------|--------|
| MAINTAINERS                  | 4.26 s | 3.87 s | 0.41 s | 0.39 s |
| fs/namei.c                   | 1.99 s | 0.99 s | 0.42 s | 0.21 s |
| arch/x86/kvm/cpuid.c         | 5.28 s | 1.12 s | 0.16 s | 0.09 s |
| fs/io_uring.c                | 4.34 s | 0.99 s | 0.94 s | 0.27 s |
| arch/x86/kvm/vmx/vmx.c       | 5.01 s | 1.34 s | 0.21 s | 0.12 s |
| arch/x86/kvm/x86.c           | 2.24 s | 1.18 s | 0.21 s | 0.14 s |
| fs/btrfs/disk-io.c           | 1.82 s | 1.01 s | 0.06 s | 0.05 s |
| Documentation/scsi/index.rst | 3.30 s | 0.89 s | 1.46 s | 0.03 s |

It is worth noting that the least speedup comes for the MAINTAINERS file
which is

 * edited frequently,
 * low in the directory heirarchy, and
 * quite a large file.

All of those points lead to spending more time doing the blob diff and
less time doing the tree diff. Still, we see some improvement in that
case and significant improvement in other cases. A 2-4x speedup is
likely the more typical case as opposed to the small 5% change for that
file.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 bloom.c    |  5 +++++
 bloom.h    |  1 +
 line-log.c | 39 ++++++++++++++++++++++++++++++++++++++-
 revision.c |  5 ++++-
 4 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/bloom.c b/bloom.c
index 9a8386ac676..e22c0108f39 100644
--- a/bloom.c
+++ b/bloom.c
@@ -138,6 +138,11 @@ void fill_bloom_key(const char *data,
 		key->hashes[i] = hash0 + i * hash1;
 }
 
+void clear_bloom_key(struct bloom_key *key)
+{
+	FREE_AND_NULL(key->hashes);
+}
+
 void add_key_to_filter(const struct bloom_key *key,
 		       struct bloom_filter *filter,
 		       const struct bloom_filter_settings *settings)
diff --git a/bloom.h b/bloom.h
index a51e3715296..d0c69172e67 100644
--- a/bloom.h
+++ b/bloom.h
@@ -72,6 +72,7 @@ void fill_bloom_key(const char *data,
 		    size_t len,
 		    struct bloom_key *key,
 		    const struct bloom_filter_settings *settings);
+void clear_bloom_key(struct bloom_key *key);
 
 void add_key_to_filter(const struct bloom_key *key,
 		       struct bloom_filter *filter,
diff --git a/line-log.c b/line-log.c
index 520ee715bcd..7dc411da8f6 100644
--- a/line-log.c
+++ b/line-log.c
@@ -15,6 +15,7 @@
 #include "userdiff.h"
 #include "line-log.h"
 #include "argv-array.h"
+#include "bloom.h"
 
 static void range_set_grow(struct range_set *rs, size_t extra)
 {
@@ -1146,6 +1147,37 @@ int line_log_print(struct rev_info *rev, struct commit *commit)
 	return 1;
 }
 
+static int bloom_filter_check(struct rev_info *rev,
+			      struct commit *commit,
+			      struct line_log_data *range)
+{
+	struct bloom_filter *filter;
+	struct bloom_key key;
+	int result = 0;
+
+	if (!commit->parents)
+		return 1;
+
+	if (!rev->bloom_filter_settings ||
+	    !(filter = get_bloom_filter(rev->repo, commit, 0)))
+		return 1;
+
+	if (!range)
+		return 0;
+
+	while (!result && range) {
+		fill_bloom_key(range->path, strlen(range->path), &key, rev->bloom_filter_settings);
+
+		if (bloom_filter_contains(filter, &key, rev->bloom_filter_settings))
+			result = 1;
+
+		clear_bloom_key(&key);
+		range = range->next;
+	}
+
+	return result;
+}
+
 static int process_ranges_ordinary_commit(struct rev_info *rev, struct commit *commit,
 					  struct line_log_data *range)
 {
@@ -1159,6 +1191,7 @@ static int process_ranges_ordinary_commit(struct rev_info *rev, struct commit *c
 
 	queue_diffs(range, &rev->diffopt, &queue, commit, parent);
 	changed = process_all_files(&parent_range, rev, &queue, range);
+
 	if (parent)
 		add_line_range(rev, parent, parent_range);
 	free_line_log_data(parent_range);
@@ -1233,7 +1266,11 @@ int line_log_process_ranges_arbitrary_commit(struct rev_info *rev, struct commit
 	int changed = 0;
 
 	if (range) {
-		if (!commit->parents || !commit->parents->next)
+		if (commit->parents && !bloom_filter_check(rev, commit, range)) {
+			struct line_log_data *prange = line_log_data_copy(range);
+			add_line_range(rev, commit->parents->item, prange);
+			clear_commit_line_range(rev, commit);
+		} else if (!commit->parents || !commit->parents->next)
 			changed = process_ranges_ordinary_commit(rev, commit, range);
 		else
 			changed = process_ranges_merge_commit(rev, commit, range);
diff --git a/revision.c b/revision.c
index 3356ede9a20..cbf4b61aa67 100644
--- a/revision.c
+++ b/revision.c
@@ -689,6 +689,9 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 	if (!revs->bloom_filter_settings)
 		return;
 
+	if (!revs->pruning.pathspec.nr)
+		return;
+
 	pi = &revs->pruning.pathspec.items[0];
 	last_index = pi->len - 1;
 
@@ -3501,7 +3504,7 @@ int prepare_revision_walk(struct rev_info *revs)
 				       FOR_EACH_OBJECT_PROMISOR_ONLY);
 	}
 
-	if (revs->pruning.pathspec.nr == 1 && !revs->reflog_info)
+	if (!revs->reflog_info)
 		prepare_to_use_bloom_filter(revs);
 	if (revs->no_walk != REVISION_WALK_NO_WALK_UNSORTED)
 		commit_list_sort_by_date(&revs->commits);
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/12] Integrating line-log and changed-path Bloom filters
  2020-05-01 15:30 [PATCH 00/12] Integrating line-log and changed-path Bloom filters Derrick Stolee via GitGitGadget
                   ` (11 preceding siblings ...)
  2020-05-01 15:30 ` [PATCH 12/12] line-log: integrate with changed-path Bloom filters Derrick Stolee via GitGitGadget
@ 2020-05-01 17:34 ` Junio C Hamano
  2020-05-11 11:56 ` [PATCH v2 " Derrick Stolee via GitGitGadget
  13 siblings, 0 replies; 42+ messages in thread
From: Junio C Hamano @ 2020-05-01 17:34 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, peff, me, garimasigit, szeder.dev, jnareb, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> This series is organized as follows:
>
>  * Patches 1-4 are minor Bloom filter cleanups, including a documentation
>    bug.
>    
>    
>  * Patch 5 fixes a problem where the Bloom filters use much larger filters
>    than expected.
>    
>    
>  * Patch 6 fixes a bug related to the short-circuit of large diffs from
>    e369698016 (diff: halt tree-diff early after max_changes, 2020-03-30).
>    The condition for halting the diff computation is different than the
>    check in bloom.c to see if that short-circuit was hit, which leads to
>    some commits with large diffs getting an incomplete Bloom filter. This
>    happened on the root commit of the Linux kernel repo, for example.
> ...   
> I organized these patches so patches 1-6 could be split into its own topic,
> if beneficial to take earlier than the line-log changes.

Nice.

> The end result of this combined effort is the following: git log -L commands
> feel much more responsive to a terminal user because Szeder's changes make
> the computation incremental, and they have better end-to-end performance
> because of the integration with Bloom filters to reduce time spent running
> tree diffs for TREESAME commits.

Nice, again.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 01/12] bloom: fix whitespace around tab length
  2020-05-01 15:30 ` [PATCH 01/12] bloom: fix whitespace around tab length Derrick Stolee via GitGitGadget
@ 2020-05-01 22:51   ` Taylor Blau
  0 siblings, 0 replies; 42+ messages in thread
From: Taylor Blau @ 2020-05-01 22:51 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, peff, me, garimasigit, szeder.dev, jnareb, Derrick Stolee

On Fri, May 01, 2020 at 03:30:18PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
>
> Fix alignment issues that were likely introduced due to an editor
> using tab lengths of 4 instead of 8.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---

All looks good, thanks.

  Reviewed-by: Taylor Blau <me@ttaylorr.com>

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 02/12] test-bloom: fix usage typo
  2020-05-01 15:30 ` [PATCH 02/12] test-bloom: fix usage typo Derrick Stolee via GitGitGadget
@ 2020-05-01 22:51   ` Taylor Blau
  0 siblings, 0 replies; 42+ messages in thread
From: Taylor Blau @ 2020-05-01 22:51 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, peff, me, garimasigit, szeder.dev, jnareb, Derrick Stolee

On Fri, May 01, 2020 at 03:30:19PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  t/helper/test-bloom.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
> index 77eb27adac7..00c1d44a561 100644
> --- a/t/helper/test-bloom.c
> +++ b/t/helper/test-bloom.c
> @@ -44,7 +44,7 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
>  }
>
>  static const char *bloom_usage = "\n"
> -"  test-tool bloom get_murmer3 <string>\n"
> +"  test-tool bloom get_murmur3 <string>\n"

I stared at this change for _far_ longer than I'm willing to admit
trying to find the difference between these lines. Looks good, though.

>  "  test-tool bloom generate_filter <string> [<string>...]\n"
>  "  test-tool get_filter_for_commit <commit-hex>\n";
>
> --
> gitgitgadget

  Reviewed-by: Taylor Blau <me@ttaylorr.com>

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 03/12] Documentation: changed-path Bloom filters use byte words
  2020-05-01 15:30 ` [PATCH 03/12] Documentation: changed-path Bloom filters use byte words Derrick Stolee via GitGitGadget
@ 2020-05-01 22:55   ` Taylor Blau
  0 siblings, 0 replies; 42+ messages in thread
From: Taylor Blau @ 2020-05-01 22:55 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, peff, me, garimasigit, szeder.dev, jnareb, Derrick Stolee

On Fri, May 01, 2020 at 03:30:20PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
>
> In Documentation/technical/commit-graph-format.txt, the definition
> of the BIDX chunk specifies the length is a number of 8-byte words.
> During development we discovered that using 8-byte words in the
> Murmur3 hash algorithm causes issues with Big-Endian versus Little-
> Endian machines. Thus, the hash algorithm was adapted to work on a

I'm nit-picking here, but I believe that "Little Endian" is usually
spelled "little endian", right?

At least, 'git log --oneline -Gendian | wc -l' doesn't return many/any
results for me, but '-GEndian' does.

> byte-by-byte basis. However, this caused a change in the definition
> of a "word" in bloom.h. Now, a "word" is a single byte, which allows
> filters to be as small as two bytes. These length-two filters are
> demonstrated in t0095-bloom.sh, and a larger filter of length 25 is
> demonstrated as well.
>
> The original point of using 8-byte words was for alignment reasons.
> It also presented opportunities for extremely sparse Bloom filters
> when there were a small number of changes at a commit, creating a
> very low false-positive rate. However, modifying the format at this
> point is unlikely to be a valuable exercise. Also, this use of
> single-byte granularity does present opportunities to save space.
> It is unclear if 8-byte alignment of the filters would present any
> meaningful performance benefits.
>
> Modify the format document to reflect reality.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

The rest of this patch looks good, and indeed matches reality ;).

  Reviewed-by: Taylor Blau <me@ttaylorr.com>

> ---
>  Documentation/technical/commit-graph-format.txt | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
> index de56f9f1efd..1beef171822 100644
> --- a/Documentation/technical/commit-graph-format.txt
> +++ b/Documentation/technical/commit-graph-format.txt
> @@ -97,10 +97,10 @@ CHUNK DATA:
>        bit on. The other bits correspond to the position of the last parent.
>
>    Bloom Filter Index (ID: {'B', 'I', 'D', 'X'}) (N * 4 bytes) [Optional]
> -    * The ith entry, BIDX[i], stores the number of 8-byte word blocks in all
> -      Bloom filters from commit 0 to commit i (inclusive) in lexicographic
> -      order. The Bloom filter for the i-th commit spans from BIDX[i-1] to
> -      BIDX[i] (plus header length), where BIDX[-1] is 0.
> +    * The ith entry, BIDX[i], stores the number of bytes in all Bloom filters
> +      from commit 0 to commit i (inclusive) in lexicographic order. The Bloom
> +      filter for the i-th commit spans from BIDX[i-1] to BIDX[i] (plus header
> +      length), where BIDX[-1] is 0.
>      * The BIDX chunk is ignored if the BDAT chunk is not present.
>
>    Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
> --
> gitgitgadget

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 04/12] bloom: de-duplicate directory entries
  2020-05-01 15:30 ` [PATCH 04/12] bloom: de-duplicate directory entries Derrick Stolee via GitGitGadget
@ 2020-05-01 23:06   ` Taylor Blau
  0 siblings, 0 replies; 42+ messages in thread
From: Taylor Blau @ 2020-05-01 23:06 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, peff, me, garimasigit, szeder.dev, jnareb, Derrick Stolee

On Fri, May 01, 2020 at 03:30:21PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
>
> When computing a changed-path Bloom filter, we need to take the
> files that changed from the diff computation and extract the parent
> directories. That way, a directory pathspec such as "Documentation"
> could match commits that change "Documentation/git.txt".
>
> However, the current code does a poor job of this process. The paths
> are added to a hashmap, but we do not check if an entry already
> exists with that path. This can create many duplicate entries and
> cause the filter to have a much larger length than it should. This
> means that the filter is more sparse than intended, which helps the
> false positive rate, but wastes a lot of space.
>
> Properly use hashmap_get() before hashmap_add(). Also be sure to
> include a comparison function so these can be matched correctly.
>
> This has an effect on a test in t0095-bloom.sh. This makes sense,
> there are ten changes inside "smallDir" so the total number of
> paths in the filter should be 11. This would result in 11 * 10 bits
> required, and with 8 bits per byte, this results in 14 bytes.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  bloom.c          | 35 ++++++++++++++++++++++++++---------
>  t/t0095-bloom.sh |  4 ++--
>  2 files changed, 28 insertions(+), 11 deletions(-)
>
> diff --git a/bloom.c b/bloom.c
> index 2e3e0f5037b..eb08571c628 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -158,6 +158,19 @@ void init_bloom_filters(void)
>  	init_bloom_filter_slab(&bloom_filters);
>  }
>
> +static int pathmap_cmp(const void *hashmap_cmp_fn_data,
> +		       const struct hashmap_entry *eptr,
> +		       const struct hashmap_entry *entry_or_key,
> +		       const void *keydata)
> +{
> +	const struct pathmap_hash_entry *e1, *e2;
> +
> +	e1 = container_of(eptr, const struct pathmap_hash_entry, entry);
> +	e2 = container_of(entry_or_key, const struct pathmap_hash_entry, entry);
> +
> +	return strcmp(e1->path, e2->path);
> +}
> +
>  struct bloom_filter *get_bloom_filter(struct repository *r,
>  				      struct commit *c,
>  				      int compute_if_not_present)
> @@ -203,25 +216,29 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>  		struct hashmap pathmap;
>  		struct pathmap_hash_entry *e;
>  		struct hashmap_iter iter;
> -		hashmap_init(&pathmap, NULL, NULL, 0);
> +		hashmap_init(&pathmap, pathmap_cmp, NULL, 0);
>
>  		for (i = 0; i < diff_queued_diff.nr; i++) {
>  			const char *path = diff_queued_diff.queue[i]->two->path;
>
>  			/*
> -			* Add each leading directory of the changed file, i.e. for
> -			* 'dir/subdir/file' add 'dir' and 'dir/subdir' as well, so
> -			* the Bloom filter could be used to speed up commands like
> -			* 'git log dir/subdir', too.
> -			*
> -			* Note that directories are added without the trailing '/'.
> -			*/
> +			 * Add each leading directory of the changed file, i.e. for
> +			 * 'dir/subdir/file' add 'dir' and 'dir/subdir' as well, so
> +			 * the Bloom filter could be used to speed up commands like
> +			 * 'git log dir/subdir', too.
> +			 *
> +			 * Note that directories are added without the trailing '/'.
> +			 */
>  			do {
>  				char *last_slash = strrchr(path, '/');
>
>  				FLEX_ALLOC_STR(e, path, path);
>  				hashmap_entry_init(&e->entry, strhash(path));
> -				hashmap_add(&pathmap, &e->entry);
> +
> +				if (!hashmap_get(&pathmap, &e->entry, NULL))
> +					hashmap_add(&pathmap, &e->entry);
> +				else
> +					free(e);

Yep; this seems right on: if we have a match in the hashmap already,
there's no need to add another one. In fact, it's much better not to, as
you point out.

In the case that we didn't need to add one, we can free up the
corresponding resources (I had to double-check that 'free()' would
appropriately discard the flex array, but that's fine).

Nice find!

>  				if (!last_slash)
>  					last_slash = (char*)path;
> diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
> index 8f9eef116dc..6defeb544f1 100755
> --- a/t/t0095-bloom.sh
> +++ b/t/t0095-bloom.sh
> @@ -89,8 +89,8 @@ test_expect_success 'get bloom filter for commit with 10 changes' '
>  	git add smallDir &&
>  	git commit -m "commit with 10 changes" &&
>  	cat >expect <<-\EOF &&
> -	Filter_Length:25
> -	Filter_Data:82|a0|65|47|0c|92|90|c0|a1|40|02|a0|e2|40|e0|04|0a|9a|66|cf|80|19|85|42|23|
> +	Filter_Length:14
> +	Filter_Data:02|b3|c4|a0|34|e7|fe|eb|cb|47|fe|a0|e8|72|
>  	EOF
>  	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
>  	test_cmp expect actual
> --
> gitgitgadget

Makes sense, I figured that there'd be a little bit of test-fallout
here.

  Reviewed-by: Taylor Blau <me@ttaylorr.com>

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 05/12] bloom: parse commit before computing filters
  2020-05-01 15:30 ` [PATCH 05/12] bloom: parse commit before computing filters Derrick Stolee via GitGitGadget
@ 2020-05-01 23:10   ` Taylor Blau
  0 siblings, 0 replies; 42+ messages in thread
From: Taylor Blau @ 2020-05-01 23:10 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, peff, me, garimasigit, szeder.dev, jnareb, Derrick Stolee

On Fri, May 01, 2020 at 03:30:22PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
>
> When computing changed-path Bloom filters for a commit, we need to
> know if the commit has a parent or not. If the commit is not parsed,
> then its parent pointer will be NULL.
>
> As far as I can tell, the only opportunity to reach this code
> without parsing the commit is inside "test-tool bloom
> get_filter_for_commit" but it is best to be safe.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  bloom.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/bloom.c b/bloom.c
> index eb08571c628..c3b81b1a38a 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -206,6 +206,9 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>  	diffopt.max_changes = max_changes;
>  	diff_setup_done(&diffopt);
>
> +	/* ensure commit is parsed so we have parent information */

I don't think that this comment is critical to the change, since
familiar readers will know why this 'parse_commit' call is here, but I
don't really mind either way...

> +	parse_commit(c);
> +

OK. I figure that we could force the callers to parse these commits, but
I think that this approach makes more sense. This should be a no-op for
already-parsed commits, which means we'll get out of this call pretty
quickly.

The benefit there, of course, would be that we don't have to worry about
some callers forgetting to make sure that they are passing
already-parsed commits, and that the trade-off we make to allow such a
thing is pretty cheap, relatively speaking.

>  	if (c->parents)
>  		diff_tree_oid(&c->parents->item->object.oid, &c->object.oid, "", &diffopt);
>  	else
> --
> gitgitgadget

Makes sense, thanks.

  Reviewed-by: Taylor Blau <me@ttaylorr.com>

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 06/12] bloom: use num_changes not nr for limit detection
  2020-05-01 15:30 ` [PATCH 06/12] bloom: use num_changes not nr for limit detection Derrick Stolee via GitGitGadget
@ 2020-05-01 23:12   ` Taylor Blau
  0 siblings, 0 replies; 42+ messages in thread
From: Taylor Blau @ 2020-05-01 23:12 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, peff, me, garimasigit, szeder.dev, jnareb, Derrick Stolee

On Fri, May 01, 2020 at 03:30:23PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
>
> As diff_tree_oid() computes a diff, it will terminate early if the
> total number of changed paths is strictly larger than max_changes.
> This includes the directories that changed, not just the file paths.
> However, only the file paths are reflected in the resulting diff
> queue "nr" value.

Nit; s/queue/queue's

>
> Use the "num_changes" from diffopt to check if the diff terminated
> early. This is incredibly important, as it can result in incorrect
> filters! For example, the first commit in the Linux kernel repo
> reports only 471 changes, but since these are nested inside several
> directories they expand to 513 "real" changes, and in fact the
> total list of changes is not reported. Thus, the computed filter
> for this commit is incorrect.

Wow, this is a great find. Nicely done!

> Demonstrate the subtle difference by using one fewer file change
> in the 'get bloom filter for commit with 513 changes' test. Before,
> this edited 513 files inside "bigDir" which hit this inequality.
> However, dropping the file count by one demonstrates how the
> previous inequality was incorrect but the new one is correct.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  bloom.c          | 2 +-
>  t/t0095-bloom.sh | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/bloom.c b/bloom.c
> index c3b81b1a38a..9a8386ac676 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -215,7 +215,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>  		diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
>  	diffcore_std(&diffopt);
>
> -	if (diff_queued_diff.nr <= max_changes) {
> +	if (diffopt.num_changes <= max_changes) {
>  		struct hashmap pathmap;
>  		struct pathmap_hash_entry *e;
>  		struct hashmap_iter iter;
> diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
> index 6defeb544f1..48a90625596 100755
> --- a/t/t0095-bloom.sh
> +++ b/t/t0095-bloom.sh
> @@ -100,7 +100,7 @@ test_expect_success EXPENSIVE 'get bloom filter for commit with 513 changes' '
>  	rm actual &&
>  	rm expect &&
>  	mkdir bigDir &&
> -	for i in $(test_seq 0 512)
> +	for i in $(test_seq 0 511)

Thanks for demonstrating the fix. This is change is very subtle, but I
think that it's right on.

>  	do
>  		echo $i >bigDir/$i
>  	done &&
> --
> gitgitgadget

  Reviewed-by: Taylor Blau <me@ttaylorr.com>

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 07/12] completion: offer '--(no-)patch' among 'git log' options
  2020-05-01 15:30 ` [PATCH 07/12] completion: offer '--(no-)patch' among 'git log' options SZEDER Gábor via GitGitGadget
@ 2020-05-01 23:44   ` Taylor Blau
  0 siblings, 0 replies; 42+ messages in thread
From: Taylor Blau @ 2020-05-01 23:44 UTC (permalink / raw)
  To: SZEDER Gábor via GitGitGadget
  Cc: git, peff, me, garimasigit, szeder.dev, jnareb, Derrick Stolee

On Fri, May 01, 2020 at 03:30:24PM +0000, SZEDER Gábor via GitGitGadget wrote:
> From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>

This doesn't look quite right ;). Interestingly, Git applies it just
fine, and your encoding looks reasonable to me.

> Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  contrib/completion/git-completion.bash | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/contrib/completion/git-completion.bash b/contrib/completion/git-completion.bash
> index c21786f2fd0..ec6ff1d5fb8 100644
> --- a/contrib/completion/git-completion.bash
> +++ b/contrib/completion/git-completion.bash
> @@ -1860,6 +1860,7 @@ _git_log ()
>  			$merge
>  			$__git_diff_common_options
>  			--pickaxe-all --pickaxe-regex
> +			--patch --no-patch
>  			"
>  		return
>  		;;
> --
> gitgitgadget
>
Thanks,
Taylor

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 08/12] line-log: remove unused fields from 'struct line_log_data'
  2020-05-01 15:30 ` [PATCH 08/12] line-log: remove unused fields from 'struct line_log_data' SZEDER Gábor via GitGitGadget
@ 2020-05-01 23:46   ` Taylor Blau
  0 siblings, 0 replies; 42+ messages in thread
From: Taylor Blau @ 2020-05-01 23:46 UTC (permalink / raw)
  To: SZEDER Gábor via GitGitGadget
  Cc: git, peff, me, garimasigit, szeder.dev, jnareb, Derrick Stolee

On Fri, May 01, 2020 at 03:30:25PM +0000, SZEDER Gábor via GitGitGadget wrote:
> From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>
>
> Remove the unused fields 'status', 'arg_alloc', 'arg_nr' and 'args'
> from 'struct line_log_data'.  They were already part of the struct
> when it was introduced in commit 12da1d1f6 (Implement line-history
> search (git log -L), 2013-03-28), but as far as I can tell none of
> them have ever been actually used.
>
> Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  line-log.h | 3 ---
>  1 file changed, 3 deletions(-)
>
> diff --git a/line-log.h b/line-log.h
> index 8ee7a2bd4a1..882c5055bb8 100644
> --- a/line-log.h
> +++ b/line-log.h
> @@ -46,10 +46,7 @@ void sort_and_merge_range_set(struct range_set *);
>  struct line_log_data {
>  	struct line_log_data *next;
>  	char *path;
> -	char status;
>  	struct range_set ranges;
> -	int arg_alloc, arg_nr;
> -	const char **args;
>  	struct diff_filepair *pair;
>  	struct diff_ranges diff;
>  };
> --
> gitgitgadget

Very sensible.

  Reviewed-by: Taylor Blau <me@ttaylorr.com>

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 09/12]  t4211-line-log: add tests for parent oids
  2020-05-01 15:30 ` [PATCH 09/12] t4211-line-log: add tests for parent oids SZEDER Gábor via GitGitGadget
@ 2020-05-04 17:31   ` Taylor Blau
  0 siblings, 0 replies; 42+ messages in thread
From: Taylor Blau @ 2020-05-04 17:31 UTC (permalink / raw)
  To: SZEDER Gábor via GitGitGadget
  Cc: git, peff, me, garimasigit, szeder.dev, jnareb, Derrick Stolee

On Fri, May 01, 2020 at 03:30:26PM +0000, SZEDER Gábor via GitGitGadget wrote:
> From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>
>
> None of the tests in 't4211-line-log.sh' really check which parent
> object IDs are shown in the output, either implicitly as part of
> "Merge: ..." lines [1] or explicitly via the '%p' or '%P' format
> specifiers in a custom pretty format.
>
> Add two tests to 't4211-line-log.sh' to check which parent object IDs
> are shown, one without and one with explicitly requested parent
> rewriting, IOW without and with the '--parents' option.
>
> The test without '--parents' is marked as failing, because without
> that option parent rewriting should not be performed, and thus the
> parent object ID should be that of the immediate parent, just like in
> case of a pathspec-limited history traversal without parent rewriting.
> The current line-level log implementation, however, performs parent
> rewriting unconditionally and without a possibility to turn it off,
> and, consequently, it shows the object ID of the most recent ancestor
> that modified the given line range.
>
> In both of these new tests we only really care about the object IDs of
> the listed commits and their parents, but not the diffs of the line
> ranges; the diffs have already been thoroughly checked in the previous
> tests.
>
> [1] While one of the tests ('-M -L ':f:b.c' parallel-change') does
>     list a merge commit, both of its parents happen to modify the
>     given line range and are listed as well, so the implications of
>     parent rewriting remained hidden and untested.
>
> Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---

These tests make sense, and demonstrate the problem nicely. I don't
think that I have a ton to add to this patch, so:

  Reviewed-by: Taylor Blau <me@ttaylorr.com>

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 10/12] line-log: more responsive, incremental 'git log -L'
  2020-05-01 15:30 ` [PATCH 10/12] line-log: more responsive, incremental 'git log -L' SZEDER Gábor via GitGitGadget
@ 2020-05-04 17:52   ` Taylor Blau
  2020-05-04 17:55     ` Derrick Stolee
  0 siblings, 1 reply; 42+ messages in thread
From: Taylor Blau @ 2020-05-04 17:52 UTC (permalink / raw)
  To: SZEDER Gábor via GitGitGadget
  Cc: git, peff, me, garimasigit, szeder.dev, jnareb, Derrick Stolee

On Fri, May 01, 2020 at 03:30:27PM +0000, SZEDER Gábor via GitGitGadget wrote:
> From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>
>
> The current line-level log implementation performs a preprocessing
> step in prepare_revision_walk(), during which the line_log_filter()
> function filters and rewrites history to keep only commits modifying
> the given line range.  This preprocessing affects both responsiveness
> and correctness:
>
>   - Git doesn't produce any output during this preprocessing step.
>     Checking whether a commit modified the given line range is
>     somewhat expensive, so depending on the size of the given revision
>     range this preprocessing can result in a significant delay before
>     the first commit is shown.
>
>   - Limiting the number of displayed commits (e.g. 'git log -3 -L...')
>     doesn't limit the amount of work during preprocessing, because
>     that limit is applied during history traversal.  Alas, by that
>     point this expensive preprocessing step has already churned
>     through the whole revision range to find all commits modifying the
>     revision range, even though only a few of them need to be shown.
>
>   - It rewrites parents, with no way to turn it off.  Without the user
>     explicitly requesting parent rewriting any parent object ID shown
>     should be that of the immediate parent, just like in case of a
>     pathspec-limited history traversal without parent rewriting.
>
>     However, after that preprocessing step rewrote history, the
>     subsequent "regular" history traversal (i.e. get_revision() in a
>     loop) only sees commits modifying the given line range.
>     Consequently, it can only show the object ID of the last ancestor
>     that modified the given line range (which might happen to be the
>     immediate parent, but many-many times it isn't).

Thanks for a very clear and straightforward explanation of the problem
that you are trying to solve, here.

> This patch addresses both the correctness and, at least for the common
> case, the responsiveness issues by integrating line-level log
> filtering into the regular revision walking machinery:
>
>   - Make process_ranges_arbitrary_commit(), the static function in
>     'line-log.c' deciding whether a commit modifies the given line
>     range, public by removing the static keyword and adding the
>     'line_log_' prefix, so it can be called from other parts of the
>     revision walking machinery.
>
>   - If the user didn't explicitly ask for parent rewriting (which, I
>     believe, is the most common case):
>
>     - Call this now-public function during regular history traversal,
>       namely from get_commit_action() to ignore any commits not
>       modifying the given line range.
>
>       Note that while this check is relatively expensive, it must be
>       performed before other, much cheaper conditions, because the
>       tracked line range must be adjusted even when the commit will
>       end up being ignored by other conditions.
>
>     - Skip the line_log_filter() call, i.e. the expensive
>       preprocessing step, in prepare_revision_walk(), because, thanks
>       to the above points, the revision walking machinery is now able
>       to filter out commits not modifying the given line range while
>       traversing history.
>
>       This way the regular history traversal sees the unmodified
>       history, and is therefore able to print the object ids of the
>       immediate parents of the listed commits.  The eliminated
>       preprocessing step can greatly reduce the delay before the first
>       commit is shown, see the numbers below.

Nicely done.

>   - However, if the user did explicitly ask for parent rewriting via
>     '--parents' or a similar option, then stick with the current
>     implementation for now, i.e. perform that expensive filtering and
>     history rewriting in the preprocessing step just like we did
>     before, leaving the initial delay as long as it was.

Right; here we're stuck taking the slow path. Maybe in the future that
could become faster, too, if the revision walking machinery knew how to
perform this filtering, but not today. Makes sense.

> I tried to integrate line-level log filtering with parent rewriting
> into the regular history traversal, but, unfortunately, several
> subtleties resisted... :)  Maybe someday we'll figure out how to do
> that, but until then at least the simple and common (i.e. without
> parent rewriting) 'git log -L:func:file' commands can benefit from the
> reduced delay.

:)

> This change makes the failing 'parent oids without parent rewriting'
> test in 't4211-line-log.sh' succeed.
>
> The reduced delay is most noticable when there's a commit modifying
> the line range near the tip of a large-ish revision range:
>
>   # no parent rewriting requested, no commit-graph present
>   $ time git --no-pager log -L:read_alternate_refs:sha1-file.c -1 v2.23.0
>
>   Before:
>
>     real    0m9.570s
>     user    0m9.494s
>     sys     0m0.076s
>
>   After:
>
>     real    0m0.718s
>     user    0m0.674s
>     sys     0m0.044s
>
> A significant part of the remaining delay is spent reading and parsing
> commit objects in limit_list().  With the help of the commit-graph we
> can eliminate most of that reading and parsing overhead, so here are
> the timing results of the same command as above, but this time using
> the commit-graph:
>
>   Before:
>
>     real    0m8.874s
>     user    0m8.816s
>     sys     0m0.057s
>
>   After:
>
>     real    0m0.107s
>     user    0m0.091s
>     sys     0m0.013s
>
> The next patch will further reduce the remaining delay.
>
> To be clear: this patch doesn't actually optimize the line-level log,
> but merely moves most of the work from the preprocessing step to the
> history travelsal, so the commits modifying the line range can be

s/travelsal/traversal

> shown as soon as they are processed, and the traversal can be
> terminated as soon as the given number of commits are shown.
> Consequently, listing the full history of a line range, potentially
> all the way to the root commit, will take the same time as before (but
> at least the user might start reading the output earlier).
> Furthermore, if the most recent commit modifying the line range is far
> away from the starting revision, then that initial delay will still be
> significant.
>
> Additional testing by Derrick Stolee: In the Linux kernel repository,
> the MAINTAINERS file was changed ~3,500 times across the ~915,000
> commits. In addition to that edit frequency, the file itself is quite
> large (~18,700 lines). This means that a significant portion of the
> computation is taken up by computing the patch-diff of the file. This
> patch improves the time it takes to output the first result quite a
> bit:
>
> Command: git log -L 100,200:MAINTAINERS -n 1 >/dev/null
>  Before: 3.88 s
>   After: 0.71 s

Are these 'real' times, or user/sys times?

> If we drop the "-n 1" in the command, then there is no change in
> end-to-end process time. This is because the command still needs to
> walk the entire commit history, which negates the point of this
> patch. This is expected.
>
> As a note for future reference, the ~4.3 seconds in the old code
> spends ~2.6 seconds computing the patch-diffs, and the rest of the
> time is spent walking commits and computing diffs for which paths
> changed at each commit. The changed-path Bloom filters could improve
> the end-to-end computation time (i.e. no "-n 1" in the command).
>
> Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  line-log.c          |  4 ++--
>  line-log.h          |  2 ++
>  revision.c          | 27 ++++++++++++++++++++++++++-
>  t/t4211-line-log.sh |  2 +-
>  4 files changed, 31 insertions(+), 4 deletions(-)
>
> diff --git a/line-log.c b/line-log.c
> index 9010e00950b..520ee715bcd 100644
> --- a/line-log.c
> +++ b/line-log.c
> @@ -1227,7 +1227,7 @@ static int process_ranges_merge_commit(struct rev_info *rev, struct commit *comm
>  	/* NEEDSWORK leaking like a sieve */
>  }
>
> -static int process_ranges_arbitrary_commit(struct rev_info *rev, struct commit *commit)
> +int line_log_process_ranges_arbitrary_commit(struct rev_info *rev, struct commit *commit)
>  {
>  	struct line_log_data *range = lookup_line_range(rev, commit);
>  	int changed = 0;
> @@ -1270,7 +1270,7 @@ int line_log_filter(struct rev_info *rev)
>  	while (list) {
>  		struct commit_list *to_free = NULL;
>  		commit = list->item;
> -		if (process_ranges_arbitrary_commit(rev, commit)) {
> +		if (line_log_process_ranges_arbitrary_commit(rev, commit)) {
>  			*pp = list;
>  			pp = &list->next;
>  		} else
> diff --git a/line-log.h b/line-log.h
> index 882c5055bb8..82ae8d98a40 100644
> --- a/line-log.h
> +++ b/line-log.h
> @@ -54,6 +54,8 @@ struct line_log_data {
>  void line_log_init(struct rev_info *rev, const char *prefix, struct string_list *args);
>
>  int line_log_filter(struct rev_info *rev);
> +int line_log_process_ranges_arbitrary_commit(struct rev_info *rev,
> +						    struct commit *commit);
>
>  int line_log_print(struct rev_info *rev, struct commit *commit);
>
> diff --git a/revision.c b/revision.c
> index f78c636e4d0..3228db9af6d 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -39,6 +39,8 @@ static const char *term_good;
>
>  implement_shared_commit_slab(revision_sources, char *);
>
> +static inline int want_ancestry(const struct rev_info *revs);
> +

I wouldn't be opposed to simply moving the definition of 'want_ancestry'
up here, since it is self-contained and doesn't depend on any other
definitions, except 'struct rev_info' itself, which comes from
'revision.h'.

>  void show_object_with_name(FILE *out, struct object *obj, const char *name)
>  {
>  	const char *p;
> @@ -3511,7 +3513,14 @@ int prepare_revision_walk(struct rev_info *revs)
>  			sort_in_topological_order(&revs->commits, revs->sort_order);
>  	} else if (revs->topo_order)
>  		init_topo_walk(revs);
> -	if (revs->line_level_traverse)
> +	if (revs->line_level_traverse && want_ancestry(revs))
> +		/*
> +		 * At the moment we can only do line-level log with parent
> +		 * rewriting by performing this expensive pre-filtering step.
> +		 * If parent rewriting is not requested, then we rather
> +		 * perform the line-level log filtering during the regular
> +		 * history traversal.

I think that this is just a style comment at this point, but I'm not
overly attached to the last three lines of this comment. I think that
the commit message that these lines would blame down to explains the
situation quite clearly, and the reader can search for the
'!want_ancestry' call a couple hundred lines later.

That said, I certainly won't be offended if you want to keep this
comment around; I just don't think that it's strictly necessary.

> +		 */
>  		line_log_filter(revs);
>  	if (revs->simplify_merges)
>  		simplify_merges(revs);
> @@ -3722,6 +3731,22 @@ enum commit_action get_commit_action(struct rev_info *revs, struct commit *commi
>  		return commit_ignore;
>  	if (commit->object.flags & UNINTERESTING)
>  		return commit_ignore;
> +	if (revs->line_level_traverse && !want_ancestry(revs)) {
> +		/*
> +		 * In case of line-level log with parent rewriting
> +		 * prepare_revision_walk() already took care of all line-level
> +		 * log filtering, and there is nothing left to do here.
> +		 *
> +		 * If parent rewriting was not requested, then this is the
> +		 * place to perform the line-level log filtering.  Notably,
> +		 * this check, though expensive, must come before the other,
> +		 * cheaper filtering conditions, because the tracked line
> +		 * ranges must be adjusted even when the commit will end up
> +		 * being ignored based on other conditions.
> +		 */
> +		if (!line_log_process_ranges_arbitrary_commit(revs, commit))
> +			return commit_ignore;
> +	}
>  	if (revs->min_age != -1 &&
>  	    comparison_date(revs, commit) > revs->min_age)
>  			return commit_ignore;
> diff --git a/t/t4211-line-log.sh b/t/t4211-line-log.sh
> index ea4a9398365..1428eae2629 100755
> --- a/t/t4211-line-log.sh
> +++ b/t/t4211-line-log.sh
> @@ -263,7 +263,7 @@ test_expect_success 'setup for checking line-log and parent oids' '
>  '
>
>  # Parent oid should be from immediate parent.
> -test_expect_failure 'parent oids without parent rewriting' '
> +test_expect_success 'parent oids without parent rewriting' '
>  	cat >expect <<-EOF &&
>  	$head_oid $prev_oid Modify func2() in file.c
>  	$root_oid  Add func1() and func2() in file.c
> --
> gitgitgadget

All makes sense, and the ending is very satisfying. Someone with a
little more familiarity with line-log should review this also. But, for
my part this has been:

  Reviewed-by: Taylor Blau <me@ttaylorr.com>

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 10/12] line-log: more responsive, incremental 'git log -L'
  2020-05-04 17:52   ` Taylor Blau
@ 2020-05-04 17:55     ` Derrick Stolee
  0 siblings, 0 replies; 42+ messages in thread
From: Derrick Stolee @ 2020-05-04 17:55 UTC (permalink / raw)
  To: Taylor Blau, SZEDER Gábor via GitGitGadget
  Cc: git, peff, garimasigit, szeder.dev, jnareb, Derrick Stolee

On 5/4/2020 1:52 PM, Taylor Blau wrote:
> On Fri, May 01, 2020 at 03:30:27PM +0000, SZEDER Gábor via GitGitGadget wrote:

>> To be clear: this patch doesn't actually optimize the line-level log,
>> but merely moves most of the work from the preprocessing step to the
>> history travelsal, so the commits modifying the line range can be
> 
> s/travelsal/traversal

Thanks!

>> shown as soon as they are processed, and the traversal can be
>> terminated as soon as the given number of commits are shown.
>> Consequently, listing the full history of a line range, potentially
>> all the way to the root commit, will take the same time as before (but
>> at least the user might start reading the output earlier).
>> Furthermore, if the most recent commit modifying the line range is far
>> away from the starting revision, then that initial delay will still be
>> significant.
>>
>> Additional testing by Derrick Stolee: In the Linux kernel repository,
>> the MAINTAINERS file was changed ~3,500 times across the ~915,000
>> commits. In addition to that edit frequency, the file itself is quite
>> large (~18,700 lines). This means that a significant portion of the
>> computation is taken up by computing the patch-diff of the file. This
>> patch improves the time it takes to output the first result quite a
>> bit:
>>
>> Command: git log -L 100,200:MAINTAINERS -n 1 >/dev/null
>>  Before: 3.88 s
>>   After: 0.71 s
> 
> Are these 'real' times, or user/sys times?

True, this is a difference between my reporting and Szeder's. I only
report "real" time. 

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 11/12] line-log: try to use generation number-based topo-ordering
  2020-05-01 15:30 ` [PATCH 11/12] line-log: try to use generation number-based topo-ordering SZEDER Gábor via GitGitGadget
@ 2020-05-04 21:25   ` Taylor Blau
  0 siblings, 0 replies; 42+ messages in thread
From: Taylor Blau @ 2020-05-04 21:25 UTC (permalink / raw)
  To: SZEDER Gábor via GitGitGadget
  Cc: git, peff, me, garimasigit, szeder.dev, jnareb, Derrick Stolee

On Fri, May 01, 2020 at 03:30:28PM +0000, SZEDER Gábor via GitGitGadget wrote:
> From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>
>
> The previous patch made it possible to perform line-level filtering
> during history traversal instead of in an expensive preprocessing
> step, but it still requires some simpler preprocessing steps, notably
> topo-ordering.  However, nowadays we have commit-graphs storing
> generation numbers, which make it possible to incrementally traverse
> the history in topological order, without the preparatory limit_list()
> and sort_in_topological_order() steps; see b45424181e (revision.c:
> generation-based topo-order algorithm, 2018-11-01).
>
> [snip]

This one makes sense, too. Sorry for a long gap in reviewing parts of
this series. A few other things have popped up throughout the day. I'll
get to the last one shortly.


Thanks,
Taylor

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 12/12] line-log: integrate with changed-path Bloom filters
  2020-05-01 15:30 ` [PATCH 12/12] line-log: integrate with changed-path Bloom filters Derrick Stolee via GitGitGadget
@ 2020-05-04 21:50   ` Taylor Blau
  0 siblings, 0 replies; 42+ messages in thread
From: Taylor Blau @ 2020-05-04 21:50 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, peff, me, garimasigit, szeder.dev, jnareb, Derrick Stolee

On Fri, May 01, 2020 at 03:30:29PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>

Extremely well done. These changes all make sense to me, and the
performance improvements are all very appealing. Thanks for a most
enjoyable read.

  Reviewed-by: Taylor Blau <me@ttaylorr.com>

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v2 00/12] Integrating line-log and changed-path Bloom filters
  2020-05-01 15:30 [PATCH 00/12] Integrating line-log and changed-path Bloom filters Derrick Stolee via GitGitGadget
                   ` (12 preceding siblings ...)
  2020-05-01 17:34 ` [PATCH 00/12] Integrating line-log and " Junio C Hamano
@ 2020-05-11 11:56 ` Derrick Stolee via GitGitGadget
  2020-05-11 11:56   ` [PATCH v2 01/12] bloom: fix whitespace around tab length Derrick Stolee via GitGitGadget
                     ` (11 more replies)
  13 siblings, 12 replies; 42+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-05-11 11:56 UTC (permalink / raw)
  To: git; +Cc: peff, me, garimasigit, szeder.dev, jnareb, Derrick Stolee

The best way to learn how a feature works is to build on it. That's also the
best way to find defects.

I started on this patch series with two goals:

 1. Understand and submit Szeder's incremental line-log RFC [1].
 2. Integrate changed-path Bloom filters with line-log.

[1] https://lore.kernel.org/git/20190818182801.7580-1-szeder.dev@gmail.com/

Since I knew a lot about the Bloom filters and not much about line-log.c, I
expected to learn a lot that could contribute to improving Szeder's patches.
Instead, I discovered some deficiencies in the Bloom filter code, including
a few meaningful bugs that should be fixed before the topic is merge to
master.

This series is organized as follows:

 * Patches 1-4 are minor Bloom filter cleanups, including a documentation
   bug.
   
   
 * Patch 5 fixes a problem where the Bloom filters use much larger filters
   than expected.
   
   
 * Patch 6 fixes a bug related to the short-circuit of large diffs from
   e369698016 (diff: halt tree-diff early after max_changes, 2020-03-30).
   The condition for halting the diff computation is different than the
   check in bloom.c to see if that short-circuit was hit, which leads to
   some commits with large diffs getting an incomplete Bloom filter. This
   happened on the root commit of the Linux kernel repo, for example.
   
   
 * Patches 7-11 are Szeder's RFC, with my sign-off added. I have not changed
   the code from those submissions because I didn't find anything wrong with
   them as I was testing. However, I will take responsibility to respond to
   the feedback in this series.
   
   
 * Patch 12 integrates Bloom filters with the line-log code.
   
   

I organized these patches so patches 1-6 could be split into its own topic,
if beneficial to take earlier than the line-log changes.

The end result of this combined effort is the following: git log -L commands
feel much more responsive to a terminal user because Szeder's changes make
the computation incremental, and they have better end-to-end performance
because of the integration with Bloom filters to reduce time spent running
tree diffs for TREESAME commits.

The following table is also in patch 12, but it summarizes the results of
this series. These are timings for running git log -L 1,10:$path for these
recently-edited paths. The "Entire History" columns wait for the full
command to complete. The "First Result" columns add "-n 1" to the command,
so we see the benefits of the incremental algorithm. Thus, the performance
improvements from "Entire History" to "First Result" are due entirely to
Szeder's patches. The performance improvements from "Before" to "After" are
due to the changed-path Bloom filters.

|                              | Entire History  | First Result    |
| Path                         | Before | After  | Before | After  |
|------------------------------|--------|--------|--------|--------|
| MAINTAINERS                  | 4.26 s | 3.87 s | 0.41 s | 0.39 s |
| fs/namei.c                   | 1.99 s | 0.99 s | 0.42 s | 0.21 s |
| arch/x86/kvm/cpuid.c         | 5.28 s | 1.12 s | 0.16 s | 0.09 s |
| fs/io_uring.c                | 4.34 s | 0.99 s | 0.94 s | 0.27 s |
| arch/x86/kvm/vmx/vmx.c       | 5.01 s | 1.34 s | 0.21 s | 0.12 s |
| arch/x86/kvm/x86.c           | 2.24 s | 1.18 s | 0.21 s | 0.14 s |
| fs/btrfs/disk-io.c           | 1.82 s | 1.01 s | 0.06 s | 0.05 s |
| Documentation/scsi/index.rst | 3.30 s | 0.89 s | 1.46 s | 0.03 s |

Some splashy numbers include a 110x speedup for finding the first result for
Documentation/scsi/index.rst. That's certainly an outlier, but the rest have
an at least 10x speedup.

Update in V2: Modified commit messages according to Taylor's feedback.

Thanks, -Stolee

Derrick Stolee (7):
  bloom: fix whitespace around tab length
  test-bloom: fix usage typo
  bloom: parse commit before computing filters
  Documentation: changed-path Bloom filters use byte words
  bloom: de-duplicate directory entries
  bloom: use num_changes not nr for limit detection
  line-log: integrate with changed-path Bloom filters

SZEDER Gábor (5):
  completion: offer '--(no-)patch' among 'git log' options
  line-log: remove unused fields from 'struct line_log_data'
  t4211-line-log: add tests for parent oids
  line-log: more responsive, incremental 'git log -L'
  line-log: try to use generation number-based topo-ordering

 .../technical/commit-graph-format.txt         |  8 +--
 bloom.c                                       | 61 ++++++++++++-----
 bloom.h                                       |  5 +-
 contrib/completion/git-completion.bash        |  1 +
 line-log.c                                    | 43 +++++++++++-
 line-log.h                                    |  5 +-
 revision.c                                    | 43 ++++++++++--
 t/helper/test-bloom.c                         |  2 +-
 t/t0095-bloom.sh                              |  6 +-
 t/t4211-line-log.sh                           | 68 +++++++++++++++++++
 10 files changed, 201 insertions(+), 41 deletions(-)


base-commit: 1b4c57fa87e121f155863f898dc39d06cf4a1d99
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-622%2Fderrickstolee%2Flog-l-on-bloom-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-622/derrickstolee/log-l-on-bloom-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/622

Range-diff vs v1:

  1:  89625b0efae =  1:  89625b0efae bloom: fix whitespace around tab length
  2:  572d0508fe0 =  2:  572d0508fe0 test-bloom: fix usage typo
  5:  ef4c08e401b !  3:  0e08cec78d3 bloom: parse commit before computing filters
     @@ bloom.c: struct bloom_filter *get_bloom_filter(struct repository *r,
       	diff_setup_done(&diffopt);
       
      +	/* ensure commit is parsed so we have parent information */
     -+	parse_commit(c);
     ++	repo_parse_commit(r, c);
      +
       	if (c->parents)
       		diff_tree_oid(&c->parents->item->object.oid, &c->object.oid, "", &diffopt);
  3:  03b2c84db36 !  4:  61f78a2b0dd Documentation: changed-path Bloom filters use byte words
     @@ Commit message
          In Documentation/technical/commit-graph-format.txt, the definition
          of the BIDX chunk specifies the length is a number of 8-byte words.
          During development we discovered that using 8-byte words in the
     -    Murmur3 hash algorithm causes issues with Big-Endian versus Little-
     -    Endian machines. Thus, the hash algorithm was adapted to work on a
     +    Murmur3 hash algorithm causes issues with big-endian versus little-
     +    endian machines. Thus, the hash algorithm was adapted to work on a
          byte-by-byte basis. However, this caused a change in the definition
          of a "word" in bloom.h. Now, a "word" is a single byte, which allows
          filters to be as small as two bytes. These length-two filters are
  4:  07d0a25f1c4 =  5:  ba0d8c1f539 bloom: de-duplicate directory entries
  6:  7d5561575d5 !  6:  8278b5c0918 bloom: use num_changes not nr for limit detection
     @@ Commit message
          total number of changed paths is strictly larger than max_changes.
          This includes the directories that changed, not just the file paths.
          However, only the file paths are reflected in the resulting diff
     -    queue "nr" value.
     +    queue's "nr" value.
      
          Use the "num_changes" from diffopt to check if the diff terminated
          early. This is incredibly important, as it can result in incorrect
  7:  35d2901957e =  7:  05f8ee14752 completion: offer '--(no-)patch' among 'git log' options
  8:  1f326612da0 =  8:  c3f87ec4379 line-log: remove unused fields from 'struct line_log_data'
  9:  4e3d7233095 !  9:  69c2c2f775f  t4211-line-log: add tests for parent oids
     @@ Metadata
      Author: SZEDER Gábor <szeder.dev@gmail.com>
      
       ## Commit message ##
     -     t4211-line-log: add tests for parent oids
     +    t4211-line-log: add tests for parent oids
      
          None of the tests in 't4211-line-log.sh' really check which parent
          object IDs are shown in the output, either implicitly as part of
 10:  d9991d6158d ! 10:  009da4b93f6 line-log: more responsive, incremental 'git log -L'
     @@ Commit message
      
          To be clear: this patch doesn't actually optimize the line-level log,
          but merely moves most of the work from the preprocessing step to the
     -    history travelsal, so the commits modifying the line range can be
     +    history traversal, so the commits modifying the line range can be
          shown as soon as they are processed, and the traversal can be
          terminated as soon as the given number of commits are shown.
          Consequently, listing the full history of a line range, potentially
     @@ Commit message
          commits. In addition to that edit frequency, the file itself is quite
          large (~18,700 lines). This means that a significant portion of the
          computation is taken up by computing the patch-diff of the file. This
     -    patch improves the time it takes to output the first result quite a
     -    bit:
     +    patch improves the real time it takes to output the first result quite
     +    a bit:
      
          Command: git log -L 100,200:MAINTAINERS -n 1 >/dev/null
           Before: 3.88 s
 11:  3abc7130924 ! 11:  da087d2acbb line-log: try to use generation number-based topo-ordering
     @@ Commit message
      
          Additional testing by Derrick Stolee: Since this patch improves
          the performance for the first result, I repeated the experiment
     -    from the previous patch on the Linux kernel repository:
     +    from the previous patch on the Linux kernel repository, reporting
     +    real time here:
      
              Command: git log -L 100,200:MAINTAINERS -n 1 >/dev/null
               Before: 0.71 s
 12:  7e0c2871cf7 ! 12:  305f1eb0982 line-log: integrate with changed-path Bloom filters
     @@ Commit message
      
          (along with a bogus first result). It appears that the path
          arch/x86/kvm/svm.c was renamed, so we ignore that entry. This leaves the
     -    following results:
     +    following results for the real command time:
      
          |                              | Entire History  | First Result    |
          | Path                         | Before | After  | Before | After  |

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v2 01/12] bloom: fix whitespace around tab length
  2020-05-11 11:56 ` [PATCH v2 " Derrick Stolee via GitGitGadget
@ 2020-05-11 11:56   ` Derrick Stolee via GitGitGadget
  2020-05-11 11:56   ` [PATCH v2 02/12] test-bloom: fix usage typo Derrick Stolee via GitGitGadget
                     ` (10 subsequent siblings)
  11 siblings, 0 replies; 42+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-05-11 11:56 UTC (permalink / raw)
  To: git
  Cc: peff, me, garimasigit, szeder.dev, jnareb, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Fix alignment issues that were likely introduced due to an editor
using tab lengths of 4 instead of 8.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 bloom.c | 16 ++++++++--------
 bloom.h |  4 ++--
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/bloom.c b/bloom.c
index dd9bab9bbd6..2e3e0f5037b 100644
--- a/bloom.c
+++ b/bloom.c
@@ -29,8 +29,8 @@ static inline unsigned char get_bitmask(uint32_t pos)
 }
 
 static int load_bloom_filter_from_graph(struct commit_graph *g,
-				   struct bloom_filter *filter,
-				   struct commit *c)
+					struct bloom_filter *filter,
+					struct commit *c)
 {
 	uint32_t lex_pos, start_index, end_index;
 
@@ -123,9 +123,9 @@ uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len)
 }
 
 void fill_bloom_key(const char *data,
-					size_t len,
-					struct bloom_key *key,
-					const struct bloom_filter_settings *settings)
+		    size_t len,
+		    struct bloom_key *key,
+		    const struct bloom_filter_settings *settings)
 {
 	int i;
 	const uint32_t seed0 = 0x293ae76f;
@@ -139,8 +139,8 @@ void fill_bloom_key(const char *data,
 }
 
 void add_key_to_filter(const struct bloom_key *key,
-					   struct bloom_filter *filter,
-					   const struct bloom_filter_settings *settings)
+		       struct bloom_filter *filter,
+		       const struct bloom_filter_settings *settings)
 {
 	int i;
 	uint64_t mod = filter->len * BITS_PER_WORD;
@@ -160,7 +160,7 @@ void init_bloom_filters(void)
 
 struct bloom_filter *get_bloom_filter(struct repository *r,
 				      struct commit *c,
-					  int compute_if_not_present)
+				      int compute_if_not_present)
 {
 	struct bloom_filter *filter;
 	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
diff --git a/bloom.h b/bloom.h
index b935186425d..a51e3715296 100644
--- a/bloom.h
+++ b/bloom.h
@@ -74,8 +74,8 @@ void fill_bloom_key(const char *data,
 		    const struct bloom_filter_settings *settings);
 
 void add_key_to_filter(const struct bloom_key *key,
-					   struct bloom_filter *filter,
-					   const struct bloom_filter_settings *settings);
+		       struct bloom_filter *filter,
+		       const struct bloom_filter_settings *settings);
 
 void init_bloom_filters(void);
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 02/12] test-bloom: fix usage typo
  2020-05-11 11:56 ` [PATCH v2 " Derrick Stolee via GitGitGadget
  2020-05-11 11:56   ` [PATCH v2 01/12] bloom: fix whitespace around tab length Derrick Stolee via GitGitGadget
@ 2020-05-11 11:56   ` Derrick Stolee via GitGitGadget
  2020-05-11 11:56   ` [PATCH v2 03/12] bloom: parse commit before computing filters Derrick Stolee via GitGitGadget
                     ` (9 subsequent siblings)
  11 siblings, 0 replies; 42+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-05-11 11:56 UTC (permalink / raw)
  To: git
  Cc: peff, me, garimasigit, szeder.dev, jnareb, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/helper/test-bloom.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index 77eb27adac7..00c1d44a561 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -44,7 +44,7 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
 }
 
 static const char *bloom_usage = "\n"
-"  test-tool bloom get_murmer3 <string>\n"
+"  test-tool bloom get_murmur3 <string>\n"
 "  test-tool bloom generate_filter <string> [<string>...]\n"
 "  test-tool get_filter_for_commit <commit-hex>\n";
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 03/12] bloom: parse commit before computing filters
  2020-05-11 11:56 ` [PATCH v2 " Derrick Stolee via GitGitGadget
  2020-05-11 11:56   ` [PATCH v2 01/12] bloom: fix whitespace around tab length Derrick Stolee via GitGitGadget
  2020-05-11 11:56   ` [PATCH v2 02/12] test-bloom: fix usage typo Derrick Stolee via GitGitGadget
@ 2020-05-11 11:56   ` Derrick Stolee via GitGitGadget
  2020-05-11 11:56   ` [PATCH v2 04/12] Documentation: changed-path Bloom filters use byte words Derrick Stolee via GitGitGadget
                     ` (8 subsequent siblings)
  11 siblings, 0 replies; 42+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-05-11 11:56 UTC (permalink / raw)
  To: git
  Cc: peff, me, garimasigit, szeder.dev, jnareb, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When computing changed-path Bloom filters for a commit, we need to
know if the commit has a parent or not. If the commit is not parsed,
then its parent pointer will be NULL.

As far as I can tell, the only opportunity to reach this code
without parsing the commit is inside "test-tool bloom
get_filter_for_commit" but it is best to be safe.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 bloom.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/bloom.c b/bloom.c
index 2e3e0f5037b..96792782719 100644
--- a/bloom.c
+++ b/bloom.c
@@ -193,6 +193,9 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 	diffopt.max_changes = max_changes;
 	diff_setup_done(&diffopt);
 
+	/* ensure commit is parsed so we have parent information */
+	repo_parse_commit(r, c);
+
 	if (c->parents)
 		diff_tree_oid(&c->parents->item->object.oid, &c->object.oid, "", &diffopt);
 	else
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 04/12] Documentation: changed-path Bloom filters use byte words
  2020-05-11 11:56 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (2 preceding siblings ...)
  2020-05-11 11:56   ` [PATCH v2 03/12] bloom: parse commit before computing filters Derrick Stolee via GitGitGadget
@ 2020-05-11 11:56   ` Derrick Stolee via GitGitGadget
  2020-05-11 11:56   ` [PATCH v2 05/12] bloom: de-duplicate directory entries Derrick Stolee via GitGitGadget
                     ` (7 subsequent siblings)
  11 siblings, 0 replies; 42+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-05-11 11:56 UTC (permalink / raw)
  To: git
  Cc: peff, me, garimasigit, szeder.dev, jnareb, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In Documentation/technical/commit-graph-format.txt, the definition
of the BIDX chunk specifies the length is a number of 8-byte words.
During development we discovered that using 8-byte words in the
Murmur3 hash algorithm causes issues with big-endian versus little-
endian machines. Thus, the hash algorithm was adapted to work on a
byte-by-byte basis. However, this caused a change in the definition
of a "word" in bloom.h. Now, a "word" is a single byte, which allows
filters to be as small as two bytes. These length-two filters are
demonstrated in t0095-bloom.sh, and a larger filter of length 25 is
demonstrated as well.

The original point of using 8-byte words was for alignment reasons.
It also presented opportunities for extremely sparse Bloom filters
when there were a small number of changes at a commit, creating a
very low false-positive rate. However, modifying the format at this
point is unlikely to be a valuable exercise. Also, this use of
single-byte granularity does present opportunities to save space.
It is unclear if 8-byte alignment of the filters would present any
meaningful performance benefits.

Modify the format document to reflect reality.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/commit-graph-format.txt | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index de56f9f1efd..1beef171822 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -97,10 +97,10 @@ CHUNK DATA:
       bit on. The other bits correspond to the position of the last parent.

   Bloom Filter Index (ID: {'B', 'I', 'D', 'X'}) (N * 4 bytes) [Optional]
-    * The ith entry, BIDX[i], stores the number of 8-byte word blocks in all
-      Bloom filters from commit 0 to commit i (inclusive) in lexicographic
-      order. The Bloom filter for the i-th commit spans from BIDX[i-1] to
-      BIDX[i] (plus header length), where BIDX[-1] is 0.
+    * The ith entry, BIDX[i], stores the number of bytes in all Bloom filters
+      from commit 0 to commit i (inclusive) in lexicographic order. The Bloom
+      filter for the i-th commit spans from BIDX[i-1] to BIDX[i] (plus header
+      length), where BIDX[-1] is 0.
     * The BIDX chunk is ignored if the BDAT chunk is not present.

   Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 05/12] bloom: de-duplicate directory entries
  2020-05-11 11:56 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (3 preceding siblings ...)
  2020-05-11 11:56   ` [PATCH v2 04/12] Documentation: changed-path Bloom filters use byte words Derrick Stolee via GitGitGadget
@ 2020-05-11 11:56   ` Derrick Stolee via GitGitGadget
  2020-06-07 21:45     ` SZEDER Gábor
  2020-05-11 11:56   ` [PATCH v2 06/12] bloom: use num_changes not nr for limit detection Derrick Stolee via GitGitGadget
                     ` (6 subsequent siblings)
  11 siblings, 1 reply; 42+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-05-11 11:56 UTC (permalink / raw)
  To: git
  Cc: peff, me, garimasigit, szeder.dev, jnareb, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When computing a changed-path Bloom filter, we need to take the
files that changed from the diff computation and extract the parent
directories. That way, a directory pathspec such as "Documentation"
could match commits that change "Documentation/git.txt".

However, the current code does a poor job of this process. The paths
are added to a hashmap, but we do not check if an entry already
exists with that path. This can create many duplicate entries and
cause the filter to have a much larger length than it should. This
means that the filter is more sparse than intended, which helps the
false positive rate, but wastes a lot of space.

Properly use hashmap_get() before hashmap_add(). Also be sure to
include a comparison function so these can be matched correctly.

This has an effect on a test in t0095-bloom.sh. This makes sense,
there are ten changes inside "smallDir" so the total number of
paths in the filter should be 11. This would result in 11 * 10 bits
required, and with 8 bits per byte, this results in 14 bytes.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 bloom.c          | 35 ++++++++++++++++++++++++++---------
 t/t0095-bloom.sh |  4 ++--
 2 files changed, 28 insertions(+), 11 deletions(-)

diff --git a/bloom.c b/bloom.c
index 96792782719..196cda0a1bd 100644
--- a/bloom.c
+++ b/bloom.c
@@ -158,6 +158,19 @@ void init_bloom_filters(void)
 	init_bloom_filter_slab(&bloom_filters);
 }
 
+static int pathmap_cmp(const void *hashmap_cmp_fn_data,
+		       const struct hashmap_entry *eptr,
+		       const struct hashmap_entry *entry_or_key,
+		       const void *keydata)
+{
+	const struct pathmap_hash_entry *e1, *e2;
+
+	e1 = container_of(eptr, const struct pathmap_hash_entry, entry);
+	e2 = container_of(entry_or_key, const struct pathmap_hash_entry, entry);
+
+	return strcmp(e1->path, e2->path);
+}
+
 struct bloom_filter *get_bloom_filter(struct repository *r,
 				      struct commit *c,
 				      int compute_if_not_present)
@@ -206,25 +219,29 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 		struct hashmap pathmap;
 		struct pathmap_hash_entry *e;
 		struct hashmap_iter iter;
-		hashmap_init(&pathmap, NULL, NULL, 0);
+		hashmap_init(&pathmap, pathmap_cmp, NULL, 0);
 
 		for (i = 0; i < diff_queued_diff.nr; i++) {
 			const char *path = diff_queued_diff.queue[i]->two->path;
 
 			/*
-			* Add each leading directory of the changed file, i.e. for
-			* 'dir/subdir/file' add 'dir' and 'dir/subdir' as well, so
-			* the Bloom filter could be used to speed up commands like
-			* 'git log dir/subdir', too.
-			*
-			* Note that directories are added without the trailing '/'.
-			*/
+			 * Add each leading directory of the changed file, i.e. for
+			 * 'dir/subdir/file' add 'dir' and 'dir/subdir' as well, so
+			 * the Bloom filter could be used to speed up commands like
+			 * 'git log dir/subdir', too.
+			 *
+			 * Note that directories are added without the trailing '/'.
+			 */
 			do {
 				char *last_slash = strrchr(path, '/');
 
 				FLEX_ALLOC_STR(e, path, path);
 				hashmap_entry_init(&e->entry, strhash(path));
-				hashmap_add(&pathmap, &e->entry);
+
+				if (!hashmap_get(&pathmap, &e->entry, NULL))
+					hashmap_add(&pathmap, &e->entry);
+				else
+					free(e);
 
 				if (!last_slash)
 					last_slash = (char*)path;
diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
index 8f9eef116dc..6defeb544f1 100755
--- a/t/t0095-bloom.sh
+++ b/t/t0095-bloom.sh
@@ -89,8 +89,8 @@ test_expect_success 'get bloom filter for commit with 10 changes' '
 	git add smallDir &&
 	git commit -m "commit with 10 changes" &&
 	cat >expect <<-\EOF &&
-	Filter_Length:25
-	Filter_Data:82|a0|65|47|0c|92|90|c0|a1|40|02|a0|e2|40|e0|04|0a|9a|66|cf|80|19|85|42|23|
+	Filter_Length:14
+	Filter_Data:02|b3|c4|a0|34|e7|fe|eb|cb|47|fe|a0|e8|72|
 	EOF
 	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
 	test_cmp expect actual
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 06/12] bloom: use num_changes not nr for limit detection
  2020-05-11 11:56 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (4 preceding siblings ...)
  2020-05-11 11:56   ` [PATCH v2 05/12] bloom: de-duplicate directory entries Derrick Stolee via GitGitGadget
@ 2020-05-11 11:56   ` Derrick Stolee via GitGitGadget
  2020-08-04 14:51     ` SZEDER Gábor
  2020-05-11 11:56   ` [PATCH v2 07/12] completion: offer '--(no-)patch' among 'git log' options SZEDER Gábor via GitGitGadget
                     ` (5 subsequent siblings)
  11 siblings, 1 reply; 42+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-05-11 11:56 UTC (permalink / raw)
  To: git
  Cc: peff, me, garimasigit, szeder.dev, jnareb, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

As diff_tree_oid() computes a diff, it will terminate early if the
total number of changed paths is strictly larger than max_changes.
This includes the directories that changed, not just the file paths.
However, only the file paths are reflected in the resulting diff
queue's "nr" value.

Use the "num_changes" from diffopt to check if the diff terminated
early. This is incredibly important, as it can result in incorrect
filters! For example, the first commit in the Linux kernel repo
reports only 471 changes, but since these are nested inside several
directories they expand to 513 "real" changes, and in fact the
total list of changes is not reported. Thus, the computed filter
for this commit is incorrect.

Demonstrate the subtle difference by using one fewer file change
in the 'get bloom filter for commit with 513 changes' test. Before,
this edited 513 files inside "bigDir" which hit this inequality.
However, dropping the file count by one demonstrates how the
previous inequality was incorrect but the new one is correct.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 bloom.c          | 2 +-
 t/t0095-bloom.sh | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/bloom.c b/bloom.c
index 196cda0a1bd..e2ede44126c 100644
--- a/bloom.c
+++ b/bloom.c
@@ -215,7 +215,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 		diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
 	diffcore_std(&diffopt);
 
-	if (diff_queued_diff.nr <= max_changes) {
+	if (diffopt.num_changes <= max_changes) {
 		struct hashmap pathmap;
 		struct pathmap_hash_entry *e;
 		struct hashmap_iter iter;
diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
index 6defeb544f1..48a90625596 100755
--- a/t/t0095-bloom.sh
+++ b/t/t0095-bloom.sh
@@ -100,7 +100,7 @@ test_expect_success EXPENSIVE 'get bloom filter for commit with 513 changes' '
 	rm actual &&
 	rm expect &&
 	mkdir bigDir &&
-	for i in $(test_seq 0 512)
+	for i in $(test_seq 0 511)
 	do
 		echo $i >bigDir/$i
 	done &&
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 07/12] completion: offer '--(no-)patch' among 'git log' options
  2020-05-11 11:56 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (5 preceding siblings ...)
  2020-05-11 11:56   ` [PATCH v2 06/12] bloom: use num_changes not nr for limit detection Derrick Stolee via GitGitGadget
@ 2020-05-11 11:56   ` SZEDER Gábor via GitGitGadget
  2020-05-11 11:56   ` [PATCH v2 08/12] line-log: remove unused fields from 'struct line_log_data' SZEDER Gábor via GitGitGadget
                     ` (4 subsequent siblings)
  11 siblings, 0 replies; 42+ messages in thread
From: SZEDER Gábor via GitGitGadget @ 2020-05-11 11:56 UTC (permalink / raw)
  To: git
  Cc: peff, me, garimasigit, szeder.dev, jnareb, Derrick Stolee,
	SZEDER Gábor

From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 contrib/completion/git-completion.bash | 1 +
 1 file changed, 1 insertion(+)

diff --git a/contrib/completion/git-completion.bash b/contrib/completion/git-completion.bash
index c21786f2fd0..ec6ff1d5fb8 100644
--- a/contrib/completion/git-completion.bash
+++ b/contrib/completion/git-completion.bash
@@ -1860,6 +1860,7 @@ _git_log ()
 			$merge
 			$__git_diff_common_options
 			--pickaxe-all --pickaxe-regex
+			--patch --no-patch
 			"
 		return
 		;;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 08/12] line-log: remove unused fields from 'struct line_log_data'
  2020-05-11 11:56 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (6 preceding siblings ...)
  2020-05-11 11:56   ` [PATCH v2 07/12] completion: offer '--(no-)patch' among 'git log' options SZEDER Gábor via GitGitGadget
@ 2020-05-11 11:56   ` SZEDER Gábor via GitGitGadget
  2020-05-11 11:56   ` [PATCH v2 09/12] t4211-line-log: add tests for parent oids SZEDER Gábor via GitGitGadget
                     ` (3 subsequent siblings)
  11 siblings, 0 replies; 42+ messages in thread
From: SZEDER Gábor via GitGitGadget @ 2020-05-11 11:56 UTC (permalink / raw)
  To: git
  Cc: peff, me, garimasigit, szeder.dev, jnareb, Derrick Stolee,
	SZEDER Gábor

From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>

Remove the unused fields 'status', 'arg_alloc', 'arg_nr' and 'args'
from 'struct line_log_data'.  They were already part of the struct
when it was introduced in commit 12da1d1f6 (Implement line-history
search (git log -L), 2013-03-28), but as far as I can tell none of
them have ever been actually used.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 line-log.h | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/line-log.h b/line-log.h
index 8ee7a2bd4a1..882c5055bb8 100644
--- a/line-log.h
+++ b/line-log.h
@@ -46,10 +46,7 @@ void sort_and_merge_range_set(struct range_set *);
 struct line_log_data {
 	struct line_log_data *next;
 	char *path;
-	char status;
 	struct range_set ranges;
-	int arg_alloc, arg_nr;
-	const char **args;
 	struct diff_filepair *pair;
 	struct diff_ranges diff;
 };
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 09/12] t4211-line-log: add tests for parent oids
  2020-05-11 11:56 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (7 preceding siblings ...)
  2020-05-11 11:56   ` [PATCH v2 08/12] line-log: remove unused fields from 'struct line_log_data' SZEDER Gábor via GitGitGadget
@ 2020-05-11 11:56   ` SZEDER Gábor via GitGitGadget
  2020-05-11 11:56   ` [PATCH v2 10/12] line-log: more responsive, incremental 'git log -L' SZEDER Gábor via GitGitGadget
                     ` (2 subsequent siblings)
  11 siblings, 0 replies; 42+ messages in thread
From: SZEDER Gábor via GitGitGadget @ 2020-05-11 11:56 UTC (permalink / raw)
  To: git
  Cc: peff, me, garimasigit, szeder.dev, jnareb, Derrick Stolee,
	SZEDER Gábor

From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>

None of the tests in 't4211-line-log.sh' really check which parent
object IDs are shown in the output, either implicitly as part of
"Merge: ..." lines [1] or explicitly via the '%p' or '%P' format
specifiers in a custom pretty format.

Add two tests to 't4211-line-log.sh' to check which parent object IDs
are shown, one without and one with explicitly requested parent
rewriting, IOW without and with the '--parents' option.

The test without '--parents' is marked as failing, because without
that option parent rewriting should not be performed, and thus the
parent object ID should be that of the immediate parent, just like in
case of a pathspec-limited history traversal without parent rewriting.
The current line-level log implementation, however, performs parent
rewriting unconditionally and without a possibility to turn it off,
and, consequently, it shows the object ID of the most recent ancestor
that modified the given line range.

In both of these new tests we only really care about the object IDs of
the listed commits and their parents, but not the diffs of the line
ranges; the diffs have already been thoroughly checked in the previous
tests.

[1] While one of the tests ('-M -L ':f:b.c' parallel-change') does
    list a merge commit, both of its parents happen to modify the
    given line range and are listed as well, so the implications of
    parent rewriting remained hidden and untested.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t4211-line-log.sh | 68 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 68 insertions(+)

diff --git a/t/t4211-line-log.sh b/t/t4211-line-log.sh
index cda58186c2d..ea4a9398365 100755
--- a/t/t4211-line-log.sh
+++ b/t/t4211-line-log.sh
@@ -215,4 +215,72 @@ test_expect_success 'fancy rename following #2' '
 	test_cmp expect actual
 '
 
+# Create the following linear history, where each commit does what its
+# subject line promises:
+#
+#   * 66c6410 Modify func2() in file.c
+#   * 50834e5 Modify other-file
+#   * fe5851c Modify func1() in file.c
+#   * 8c7c7dd Add other-file
+#   * d5f4417 Add func1() and func2() in file.c
+test_expect_success 'setup for checking line-log and parent oids' '
+	git checkout --orphan parent-oids &&
+	git reset --hard &&
+
+	cat >file.c <<-\EOF &&
+	int func1()
+	{
+	    return F1;
+	}
+
+	int func2()
+	{
+	    return F2;
+	}
+	EOF
+	git add file.c &&
+	test_tick &&
+	git commit -m "Add func1() and func2() in file.c" &&
+
+	echo 1 >other-file &&
+	git add other-file &&
+	git commit -m "Add other-file" &&
+
+	sed -e "s/F1/F1 + 1/" file.c >tmp &&
+	mv tmp file.c &&
+	git commit -a -m "Modify func1() in file.c" &&
+
+	echo 2 >other-file &&
+	git commit -a -m "Modify other-file" &&
+
+	sed -e "s/F2/F2 + 2/" file.c >tmp &&
+	mv tmp file.c &&
+	git commit -a -m "Modify func2() in file.c" &&
+
+	head_oid=$(git rev-parse --short HEAD) &&
+	prev_oid=$(git rev-parse --short HEAD^) &&
+	root_oid=$(git rev-parse --short HEAD~4)
+'
+
+# Parent oid should be from immediate parent.
+test_expect_failure 'parent oids without parent rewriting' '
+	cat >expect <<-EOF &&
+	$head_oid $prev_oid Modify func2() in file.c
+	$root_oid  Add func1() and func2() in file.c
+	EOF
+	git log --format="%h %p %s" --no-patch -L:func2:file.c >actual &&
+	test_cmp expect actual
+'
+
+# Parent oid should be from the most recent ancestor touching func2(),
+# i.e. in this case from the root commit.
+test_expect_success 'parent oids with parent rewriting' '
+	cat >expect <<-EOF &&
+	$head_oid $root_oid Modify func2() in file.c
+	$root_oid  Add func1() and func2() in file.c
+	EOF
+	git log --format="%h %p %s" --no-patch -L:func2:file.c --parents >actual &&
+	test_cmp expect actual
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 10/12] line-log: more responsive, incremental 'git log -L'
  2020-05-11 11:56 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (8 preceding siblings ...)
  2020-05-11 11:56   ` [PATCH v2 09/12] t4211-line-log: add tests for parent oids SZEDER Gábor via GitGitGadget
@ 2020-05-11 11:56   ` SZEDER Gábor via GitGitGadget
  2020-05-11 11:56   ` [PATCH v2 11/12] line-log: try to use generation number-based topo-ordering SZEDER Gábor via GitGitGadget
  2020-05-11 11:56   ` [PATCH v2 12/12] line-log: integrate with changed-path Bloom filters Derrick Stolee via GitGitGadget
  11 siblings, 0 replies; 42+ messages in thread
From: SZEDER Gábor via GitGitGadget @ 2020-05-11 11:56 UTC (permalink / raw)
  To: git
  Cc: peff, me, garimasigit, szeder.dev, jnareb, Derrick Stolee,
	SZEDER Gábor

From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>

The current line-level log implementation performs a preprocessing
step in prepare_revision_walk(), during which the line_log_filter()
function filters and rewrites history to keep only commits modifying
the given line range.  This preprocessing affects both responsiveness
and correctness:

  - Git doesn't produce any output during this preprocessing step.
    Checking whether a commit modified the given line range is
    somewhat expensive, so depending on the size of the given revision
    range this preprocessing can result in a significant delay before
    the first commit is shown.

  - Limiting the number of displayed commits (e.g. 'git log -3 -L...')
    doesn't limit the amount of work during preprocessing, because
    that limit is applied during history traversal.  Alas, by that
    point this expensive preprocessing step has already churned
    through the whole revision range to find all commits modifying the
    revision range, even though only a few of them need to be shown.

  - It rewrites parents, with no way to turn it off.  Without the user
    explicitly requesting parent rewriting any parent object ID shown
    should be that of the immediate parent, just like in case of a
    pathspec-limited history traversal without parent rewriting.

    However, after that preprocessing step rewrote history, the
    subsequent "regular" history traversal (i.e. get_revision() in a
    loop) only sees commits modifying the given line range.
    Consequently, it can only show the object ID of the last ancestor
    that modified the given line range (which might happen to be the
    immediate parent, but many-many times it isn't).

This patch addresses both the correctness and, at least for the common
case, the responsiveness issues by integrating line-level log
filtering into the regular revision walking machinery:

  - Make process_ranges_arbitrary_commit(), the static function in
    'line-log.c' deciding whether a commit modifies the given line
    range, public by removing the static keyword and adding the
    'line_log_' prefix, so it can be called from other parts of the
    revision walking machinery.

  - If the user didn't explicitly ask for parent rewriting (which, I
    believe, is the most common case):

    - Call this now-public function during regular history traversal,
      namely from get_commit_action() to ignore any commits not
      modifying the given line range.

      Note that while this check is relatively expensive, it must be
      performed before other, much cheaper conditions, because the
      tracked line range must be adjusted even when the commit will
      end up being ignored by other conditions.

    - Skip the line_log_filter() call, i.e. the expensive
      preprocessing step, in prepare_revision_walk(), because, thanks
      to the above points, the revision walking machinery is now able
      to filter out commits not modifying the given line range while
      traversing history.

      This way the regular history traversal sees the unmodified
      history, and is therefore able to print the object ids of the
      immediate parents of the listed commits.  The eliminated
      preprocessing step can greatly reduce the delay before the first
      commit is shown, see the numbers below.

  - However, if the user did explicitly ask for parent rewriting via
    '--parents' or a similar option, then stick with the current
    implementation for now, i.e. perform that expensive filtering and
    history rewriting in the preprocessing step just like we did
    before, leaving the initial delay as long as it was.

I tried to integrate line-level log filtering with parent rewriting
into the regular history traversal, but, unfortunately, several
subtleties resisted... :)  Maybe someday we'll figure out how to do
that, but until then at least the simple and common (i.e. without
parent rewriting) 'git log -L:func:file' commands can benefit from the
reduced delay.

This change makes the failing 'parent oids without parent rewriting'
test in 't4211-line-log.sh' succeed.

The reduced delay is most noticable when there's a commit modifying
the line range near the tip of a large-ish revision range:

  # no parent rewriting requested, no commit-graph present
  $ time git --no-pager log -L:read_alternate_refs:sha1-file.c -1 v2.23.0

  Before:

    real    0m9.570s
    user    0m9.494s
    sys     0m0.076s

  After:

    real    0m0.718s
    user    0m0.674s
    sys     0m0.044s

A significant part of the remaining delay is spent reading and parsing
commit objects in limit_list().  With the help of the commit-graph we
can eliminate most of that reading and parsing overhead, so here are
the timing results of the same command as above, but this time using
the commit-graph:

  Before:

    real    0m8.874s
    user    0m8.816s
    sys     0m0.057s

  After:

    real    0m0.107s
    user    0m0.091s
    sys     0m0.013s

The next patch will further reduce the remaining delay.

To be clear: this patch doesn't actually optimize the line-level log,
but merely moves most of the work from the preprocessing step to the
history traversal, so the commits modifying the line range can be
shown as soon as they are processed, and the traversal can be
terminated as soon as the given number of commits are shown.
Consequently, listing the full history of a line range, potentially
all the way to the root commit, will take the same time as before (but
at least the user might start reading the output earlier).
Furthermore, if the most recent commit modifying the line range is far
away from the starting revision, then that initial delay will still be
significant.

Additional testing by Derrick Stolee: In the Linux kernel repository,
the MAINTAINERS file was changed ~3,500 times across the ~915,000
commits. In addition to that edit frequency, the file itself is quite
large (~18,700 lines). This means that a significant portion of the
computation is taken up by computing the patch-diff of the file. This
patch improves the real time it takes to output the first result quite
a bit:

Command: git log -L 100,200:MAINTAINERS -n 1 >/dev/null
 Before: 3.88 s
  After: 0.71 s

If we drop the "-n 1" in the command, then there is no change in
end-to-end process time. This is because the command still needs to
walk the entire commit history, which negates the point of this
patch. This is expected.

As a note for future reference, the ~4.3 seconds in the old code
spends ~2.6 seconds computing the patch-diffs, and the rest of the
time is spent walking commits and computing diffs for which paths
changed at each commit. The changed-path Bloom filters could improve
the end-to-end computation time (i.e. no "-n 1" in the command).

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 line-log.c          |  4 ++--
 line-log.h          |  2 ++
 revision.c          | 27 ++++++++++++++++++++++++++-
 t/t4211-line-log.sh |  2 +-
 4 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/line-log.c b/line-log.c
index 9010e00950b..520ee715bcd 100644
--- a/line-log.c
+++ b/line-log.c
@@ -1227,7 +1227,7 @@ static int process_ranges_merge_commit(struct rev_info *rev, struct commit *comm
 	/* NEEDSWORK leaking like a sieve */
 }
 
-static int process_ranges_arbitrary_commit(struct rev_info *rev, struct commit *commit)
+int line_log_process_ranges_arbitrary_commit(struct rev_info *rev, struct commit *commit)
 {
 	struct line_log_data *range = lookup_line_range(rev, commit);
 	int changed = 0;
@@ -1270,7 +1270,7 @@ int line_log_filter(struct rev_info *rev)
 	while (list) {
 		struct commit_list *to_free = NULL;
 		commit = list->item;
-		if (process_ranges_arbitrary_commit(rev, commit)) {
+		if (line_log_process_ranges_arbitrary_commit(rev, commit)) {
 			*pp = list;
 			pp = &list->next;
 		} else
diff --git a/line-log.h b/line-log.h
index 882c5055bb8..82ae8d98a40 100644
--- a/line-log.h
+++ b/line-log.h
@@ -54,6 +54,8 @@ struct line_log_data {
 void line_log_init(struct rev_info *rev, const char *prefix, struct string_list *args);
 
 int line_log_filter(struct rev_info *rev);
+int line_log_process_ranges_arbitrary_commit(struct rev_info *rev,
+						    struct commit *commit);
 
 int line_log_print(struct rev_info *rev, struct commit *commit);
 
diff --git a/revision.c b/revision.c
index f78c636e4d0..3228db9af6d 100644
--- a/revision.c
+++ b/revision.c
@@ -39,6 +39,8 @@ static const char *term_good;
 
 implement_shared_commit_slab(revision_sources, char *);
 
+static inline int want_ancestry(const struct rev_info *revs);
+
 void show_object_with_name(FILE *out, struct object *obj, const char *name)
 {
 	const char *p;
@@ -3511,7 +3513,14 @@ int prepare_revision_walk(struct rev_info *revs)
 			sort_in_topological_order(&revs->commits, revs->sort_order);
 	} else if (revs->topo_order)
 		init_topo_walk(revs);
-	if (revs->line_level_traverse)
+	if (revs->line_level_traverse && want_ancestry(revs))
+		/*
+		 * At the moment we can only do line-level log with parent
+		 * rewriting by performing this expensive pre-filtering step.
+		 * If parent rewriting is not requested, then we rather
+		 * perform the line-level log filtering during the regular
+		 * history traversal.
+		 */
 		line_log_filter(revs);
 	if (revs->simplify_merges)
 		simplify_merges(revs);
@@ -3722,6 +3731,22 @@ enum commit_action get_commit_action(struct rev_info *revs, struct commit *commi
 		return commit_ignore;
 	if (commit->object.flags & UNINTERESTING)
 		return commit_ignore;
+	if (revs->line_level_traverse && !want_ancestry(revs)) {
+		/*
+		 * In case of line-level log with parent rewriting
+		 * prepare_revision_walk() already took care of all line-level
+		 * log filtering, and there is nothing left to do here.
+		 *
+		 * If parent rewriting was not requested, then this is the
+		 * place to perform the line-level log filtering.  Notably,
+		 * this check, though expensive, must come before the other,
+		 * cheaper filtering conditions, because the tracked line
+		 * ranges must be adjusted even when the commit will end up
+		 * being ignored based on other conditions.
+		 */
+		if (!line_log_process_ranges_arbitrary_commit(revs, commit))
+			return commit_ignore;
+	}
 	if (revs->min_age != -1 &&
 	    comparison_date(revs, commit) > revs->min_age)
 			return commit_ignore;
diff --git a/t/t4211-line-log.sh b/t/t4211-line-log.sh
index ea4a9398365..1428eae2629 100755
--- a/t/t4211-line-log.sh
+++ b/t/t4211-line-log.sh
@@ -263,7 +263,7 @@ test_expect_success 'setup for checking line-log and parent oids' '
 '
 
 # Parent oid should be from immediate parent.
-test_expect_failure 'parent oids without parent rewriting' '
+test_expect_success 'parent oids without parent rewriting' '
 	cat >expect <<-EOF &&
 	$head_oid $prev_oid Modify func2() in file.c
 	$root_oid  Add func1() and func2() in file.c
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 11/12] line-log: try to use generation number-based topo-ordering
  2020-05-11 11:56 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (9 preceding siblings ...)
  2020-05-11 11:56   ` [PATCH v2 10/12] line-log: more responsive, incremental 'git log -L' SZEDER Gábor via GitGitGadget
@ 2020-05-11 11:56   ` SZEDER Gábor via GitGitGadget
  2020-05-11 11:56   ` [PATCH v2 12/12] line-log: integrate with changed-path Bloom filters Derrick Stolee via GitGitGadget
  11 siblings, 0 replies; 42+ messages in thread
From: SZEDER Gábor via GitGitGadget @ 2020-05-11 11:56 UTC (permalink / raw)
  To: git
  Cc: peff, me, garimasigit, szeder.dev, jnareb, Derrick Stolee,
	SZEDER Gábor

From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>

The previous patch made it possible to perform line-level filtering
during history traversal instead of in an expensive preprocessing
step, but it still requires some simpler preprocessing steps, notably
topo-ordering.  However, nowadays we have commit-graphs storing
generation numbers, which make it possible to incrementally traverse
the history in topological order, without the preparatory limit_list()
and sort_in_topological_order() steps; see b45424181e (revision.c:
generation-based topo-order algorithm, 2018-11-01).

This patch combines the two, so we can do both the topo-ordering and
the line-level filtering during history traversal, eliminating even
those simpler preprocessing steps, and thus further reducing the delay
before showing the first commit modifying the given line range.

The 'revs->limited' flag plays the central role in this, because, due
to limitations of the current implementation, the generation
number-based topo-ordering is only enabled when this flag remains
unset.  Line-level log, however, always sets this flag in
setup_revisions() ever since the feature was introduced in 12da1d1f6f
(Implement line-history search (git log -L), 2013-03-28).  The reason
for setting 'limited' is unclear, though, because the line-level log
itself doesn't directly depend on it, and it doesn't affect how the
limit_list() function limits the revision range.  However, there is an
indirect dependency: the line-level log requires topo-ordering, and
the "traditional" sort_in_topological_order() requires an already
limited commit list since e6c3505b44 (Make sure we generate the whole
commit list before trying to sort it topologically, 2005-07-06).  The
new, generation numbers-based topo-ordering doesn't require a limited
commit list anymore.

So don't set 'revs->limited' for line-level log, unless it is really
necessary, namely:

  - The user explicitly requested parent rewriting, because that is
    still done in the line_log_filter() preprocessing step (see
    previous patch), which requires sort_in_topological_order() and in
    turn limit_list() as well.

  - A commit-graph file is not available or it doesn't yet contain
    generation numbers.  In these cases we had to fall back on
    sort_in_topological_order() and in turn limit_list().  The
    existing condition with generation_numbers_enabled() has already
    ensured that the 'limited' flag is set in these cases; this patch
    just makes sure that the line-level log sets 'revs->topo_order'
    before that condition.

While the reduced delay before showing the first commit is measurable
in git.git, it takes a bigger repository to make it clearly noticable.
In both cases below the line ranges were chosen so that they were
modified rather close to the starting revisions, so the effect of this
change is most noticable.

  # git.git
  $ time git --no-pager log -L:read_alternate_refs:sha1-file.c -1 v2.23.0

  Before:

    real    0m0.107s
    user    0m0.091s
    sys     0m0.013s

  After:

    real    0m0.058s
    user    0m0.050s
    sys     0m0.005s

  # linux.git
  $ time git --no-pager log \
    -L:build_restore_work_registers:arch/mips/mm/tlbex.c -1 v5.2

  Before:

    real   0m1.129s
    user   0m1.061s
    sys    0m0.069s

  After:

    real   0m0.096s
    user   0m0.087s
    sys    0m0.009s

Additional testing by Derrick Stolee: Since this patch improves
the performance for the first result, I repeated the experiment
from the previous patch on the Linux kernel repository, reporting
real time here:

    Command: git log -L 100,200:MAINTAINERS -n 1 >/dev/null
     Before: 0.71 s
      After: 0.05 s

Now, we have dropped the full topo-order of all ~910,000 commits
before reporting the first result. The remaining performance
improvements then are:

 1. Update the parent-rewriting logic to be incremental similar to
    how "git log --graph" behaves.

 2. Use changed-path Bloom filters to reduce the time spend in the
    tree-diff to see if the path(s) changed.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 revision.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/revision.c b/revision.c
index 3228db9af6d..3356ede9a20 100644
--- a/revision.c
+++ b/revision.c
@@ -2790,6 +2790,12 @@ int setup_revisions(int argc, const char **argv, struct rev_info *revs, struct s
 	if (revs->diffopt.objfind)
 		revs->simplify_history = 0;

+	if (revs->line_level_traverse) {
+		if (want_ancestry(revs))
+			revs->limited = 1;
+		revs->topo_order = 1;
+	}
+
 	if (revs->topo_order && !generation_numbers_enabled(the_repository))
 		revs->limited = 1;

@@ -2809,11 +2815,6 @@ int setup_revisions(int argc, const char **argv, struct rev_info *revs, struct s

 	revs->diffopt.abbrev = revs->abbrev;

-	if (revs->line_level_traverse) {
-		revs->limited = 1;
-		revs->topo_order = 1;
-	}
-
 	diff_setup_done(&revs->diffopt);

 	grep_commit_pattern_type(GREP_PATTERN_TYPE_UNSPECIFIED,
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 12/12] line-log: integrate with changed-path Bloom filters
  2020-05-11 11:56 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (10 preceding siblings ...)
  2020-05-11 11:56   ` [PATCH v2 11/12] line-log: try to use generation number-based topo-ordering SZEDER Gábor via GitGitGadget
@ 2020-05-11 11:56   ` Derrick Stolee via GitGitGadget
  11 siblings, 0 replies; 42+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-05-11 11:56 UTC (permalink / raw)
  To: git
  Cc: peff, me, garimasigit, szeder.dev, jnareb, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The previous changes to the line-log machinery focused on making the
first result appear faster. This was achieved by no longer walking the
entire commit history before returning the early results. There is still
another way to improve the performance: walk most commits much faster.
Let's use the changed-path Bloom filters to reduce time spent computing
diffs.

Since the line-log computation requires opening blobs and checking the
content-diff, there is still a lot of necessary computation that cannot
be replaced with changed-path Bloom filters. The part that we can reduce
is most effective when checking the history of a file that is deep in
several directories and those directories are modified frequently. In
this case, the computation to check if a commit is TREESAME to its first
parent takes a large fraction of the time. That is ripe for improvement
with changed-path Bloom filters.

We must ensure that prepare_to_use_bloom_filters() is called in
revision.c so that the bloom_filter_settings are loaded into the struct
rev_info from the commit-graph. Of course, some cases are still
forbidden, but in the line-log case the pathspec is provided in a
different way than normal.

Since multiple paths and segments could be requested, we compute the
struct bloom_key data dynamically during the commit walk. This could
likely be improved, but adds code complexity that is not valuable at
this time.

There are two cases to care about: merge commits and "ordinary" commits.
Merge commits have multiple parents, but if we are TREESAME to our first
parent in every range, then pass the blame for all ranges to the first
parent. Ordinary commits have the same condition, but each is done
slightly differently in the process_ranges_[merge|ordinary]_commit()
methods. By checking if the changed-path Bloom filter can guarantee
TREESAME, we can avoid that tree-diff cost. If the filter says "probably
changed", then we need to run the tree-diff and then the blob-diff if
there was a real edit.

The Linux kernel repository is a good testing ground for the performance
improvements claimed here. There are two different cases to test. The
first is the "entire history" case, where we output the entire history
to /dev/null to see how long it would take to compute the full line-log
history. The second is the "first result" case, where we find how long
it takes to show the first value, which is an indicator of how quickly a
user would see responses when waiting at a terminal.

To test, I selected the paths that were changed most frequently in the
top 10,000 commits using this command (stolen from StackOverflow [1]):

	git log --pretty=format: --name-only -n 10000 | sort | \
		uniq -c | sort -rg | head -10

which results in

    121 MAINTAINERS
     63 fs/namei.c
     60 arch/x86/kvm/cpuid.c
     59 fs/io_uring.c
     58 arch/x86/kvm/vmx/vmx.c
     51 arch/x86/kvm/x86.c
     45 arch/x86/kvm/svm.c
     42 fs/btrfs/disk-io.c
     42 Documentation/scsi/index.rst

(along with a bogus first result). It appears that the path
arch/x86/kvm/svm.c was renamed, so we ignore that entry. This leaves the
following results for the real command time:

|                              | Entire History  | First Result    |
| Path                         | Before | After  | Before | After  |
|------------------------------|--------|--------|--------|--------|
| MAINTAINERS                  | 4.26 s | 3.87 s | 0.41 s | 0.39 s |
| fs/namei.c                   | 1.99 s | 0.99 s | 0.42 s | 0.21 s |
| arch/x86/kvm/cpuid.c         | 5.28 s | 1.12 s | 0.16 s | 0.09 s |
| fs/io_uring.c                | 4.34 s | 0.99 s | 0.94 s | 0.27 s |
| arch/x86/kvm/vmx/vmx.c       | 5.01 s | 1.34 s | 0.21 s | 0.12 s |
| arch/x86/kvm/x86.c           | 2.24 s | 1.18 s | 0.21 s | 0.14 s |
| fs/btrfs/disk-io.c           | 1.82 s | 1.01 s | 0.06 s | 0.05 s |
| Documentation/scsi/index.rst | 3.30 s | 0.89 s | 1.46 s | 0.03 s |

It is worth noting that the least speedup comes for the MAINTAINERS file
which is

 * edited frequently,
 * low in the directory heirarchy, and
 * quite a large file.

All of those points lead to spending more time doing the blob diff and
less time doing the tree diff. Still, we see some improvement in that
case and significant improvement in other cases. A 2-4x speedup is
likely the more typical case as opposed to the small 5% change for that
file.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 bloom.c    |  5 +++++
 bloom.h    |  1 +
 line-log.c | 39 ++++++++++++++++++++++++++++++++++++++-
 revision.c |  5 ++++-
 4 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/bloom.c b/bloom.c
index e2ede44126c..c38d1cff0c6 100644
--- a/bloom.c
+++ b/bloom.c
@@ -138,6 +138,11 @@ void fill_bloom_key(const char *data,
 		key->hashes[i] = hash0 + i * hash1;
 }

+void clear_bloom_key(struct bloom_key *key)
+{
+	FREE_AND_NULL(key->hashes);
+}
+
 void add_key_to_filter(const struct bloom_key *key,
 		       struct bloom_filter *filter,
 		       const struct bloom_filter_settings *settings)
diff --git a/bloom.h b/bloom.h
index a51e3715296..d0c69172e67 100644
--- a/bloom.h
+++ b/bloom.h
@@ -72,6 +72,7 @@ void fill_bloom_key(const char *data,
 		    size_t len,
 		    struct bloom_key *key,
 		    const struct bloom_filter_settings *settings);
+void clear_bloom_key(struct bloom_key *key);

 void add_key_to_filter(const struct bloom_key *key,
 		       struct bloom_filter *filter,
diff --git a/line-log.c b/line-log.c
index 520ee715bcd..7dc411da8f6 100644
--- a/line-log.c
+++ b/line-log.c
@@ -15,6 +15,7 @@
 #include "userdiff.h"
 #include "line-log.h"
 #include "argv-array.h"
+#include "bloom.h"

 static void range_set_grow(struct range_set *rs, size_t extra)
 {
@@ -1146,6 +1147,37 @@ int line_log_print(struct rev_info *rev, struct commit *commit)
 	return 1;
 }

+static int bloom_filter_check(struct rev_info *rev,
+			      struct commit *commit,
+			      struct line_log_data *range)
+{
+	struct bloom_filter *filter;
+	struct bloom_key key;
+	int result = 0;
+
+	if (!commit->parents)
+		return 1;
+
+	if (!rev->bloom_filter_settings ||
+	    !(filter = get_bloom_filter(rev->repo, commit, 0)))
+		return 1;
+
+	if (!range)
+		return 0;
+
+	while (!result && range) {
+		fill_bloom_key(range->path, strlen(range->path), &key, rev->bloom_filter_settings);
+
+		if (bloom_filter_contains(filter, &key, rev->bloom_filter_settings))
+			result = 1;
+
+		clear_bloom_key(&key);
+		range = range->next;
+	}
+
+	return result;
+}
+
 static int process_ranges_ordinary_commit(struct rev_info *rev, struct commit *commit,
 					  struct line_log_data *range)
 {
@@ -1159,6 +1191,7 @@ static int process_ranges_ordinary_commit(struct rev_info *rev, struct commit *c

 	queue_diffs(range, &rev->diffopt, &queue, commit, parent);
 	changed = process_all_files(&parent_range, rev, &queue, range);
+
 	if (parent)
 		add_line_range(rev, parent, parent_range);
 	free_line_log_data(parent_range);
@@ -1233,7 +1266,11 @@ int line_log_process_ranges_arbitrary_commit(struct rev_info *rev, struct commit
 	int changed = 0;

 	if (range) {
-		if (!commit->parents || !commit->parents->next)
+		if (commit->parents && !bloom_filter_check(rev, commit, range)) {
+			struct line_log_data *prange = line_log_data_copy(range);
+			add_line_range(rev, commit->parents->item, prange);
+			clear_commit_line_range(rev, commit);
+		} else if (!commit->parents || !commit->parents->next)
 			changed = process_ranges_ordinary_commit(rev, commit, range);
 		else
 			changed = process_ranges_merge_commit(rev, commit, range);
diff --git a/revision.c b/revision.c
index 3356ede9a20..cbf4b61aa67 100644
--- a/revision.c
+++ b/revision.c
@@ -689,6 +689,9 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 	if (!revs->bloom_filter_settings)
 		return;

+	if (!revs->pruning.pathspec.nr)
+		return;
+
 	pi = &revs->pruning.pathspec.items[0];
 	last_index = pi->len - 1;

@@ -3501,7 +3504,7 @@ int prepare_revision_walk(struct rev_info *revs)
 				       FOR_EACH_OBJECT_PROMISOR_ONLY);
 	}

-	if (revs->pruning.pathspec.nr == 1 && !revs->reflog_info)
+	if (!revs->reflog_info)
 		prepare_to_use_bloom_filter(revs);
 	if (revs->no_walk != REVISION_WALK_NO_WALK_UNSORTED)
 		commit_list_sort_by_date(&revs->commits);
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 05/12] bloom: de-duplicate directory entries
  2020-05-11 11:56   ` [PATCH v2 05/12] bloom: de-duplicate directory entries Derrick Stolee via GitGitGadget
@ 2020-06-07 21:45     ` SZEDER Gábor
  0 siblings, 0 replies; 42+ messages in thread
From: SZEDER Gábor @ 2020-06-07 21:45 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, peff, me, garimasigit, jnareb, Derrick Stolee

On Mon, May 11, 2020 at 11:56:12AM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
> 
> When computing a changed-path Bloom filter, we need to take the
> files that changed from the diff computation and extract the parent
> directories. That way, a directory pathspec such as "Documentation"
> could match commits that change "Documentation/git.txt".
> 
> However, the current code does a poor job of this process. The paths
> are added to a hashmap, but we do not check if an entry already
> exists with that path. This can create many duplicate entries and
> cause the filter to have a much larger length than it should. This
> means that the filter is more sparse than intended, which helps the
> false positive rate, but wastes a lot of space.
> 
> Properly use hashmap_get() before hashmap_add(). Also be sure to
> include a comparison function so these can be matched correctly.
> 
> This has an effect on a test in t0095-bloom.sh. This makes sense,
> there are ten changes inside "smallDir" so the total number of
> paths in the filter should be 11. This would result in 11 * 10 bits
> required, and with 8 bits per byte, this results in 14 bytes.

This is the first patch where the chunk format and the specs are in
sync and the Bloom filters are as large as promised, so it would have
been nice to see how the false positive rate, the runtime of
pathspec-limited revision walks, and the size of the Bloom filters
chunk or the commit-graph file turned out...

> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  bloom.c          | 35 ++++++++++++++++++++++++++---------
>  t/t0095-bloom.sh |  4 ++--
>  2 files changed, 28 insertions(+), 11 deletions(-)
> 
> diff --git a/bloom.c b/bloom.c
> index 96792782719..196cda0a1bd 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -158,6 +158,19 @@ void init_bloom_filters(void)
>  	init_bloom_filter_slab(&bloom_filters);
>  }
>  
> +static int pathmap_cmp(const void *hashmap_cmp_fn_data,
> +		       const struct hashmap_entry *eptr,
> +		       const struct hashmap_entry *entry_or_key,
> +		       const void *keydata)
> +{
> +	const struct pathmap_hash_entry *e1, *e2;
> +
> +	e1 = container_of(eptr, const struct pathmap_hash_entry, entry);
> +	e2 = container_of(entry_or_key, const struct pathmap_hash_entry, entry);
> +
> +	return strcmp(e1->path, e2->path);
> +}
> +
>  struct bloom_filter *get_bloom_filter(struct repository *r,
>  				      struct commit *c,
>  				      int compute_if_not_present)
> @@ -206,25 +219,29 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>  		struct hashmap pathmap;
>  		struct pathmap_hash_entry *e;
>  		struct hashmap_iter iter;
> -		hashmap_init(&pathmap, NULL, NULL, 0);
> +		hashmap_init(&pathmap, pathmap_cmp, NULL, 0);
>  
>  		for (i = 0; i < diff_queued_diff.nr; i++) {
>  			const char *path = diff_queued_diff.queue[i]->two->path;
>  
>  			/*
> -			* Add each leading directory of the changed file, i.e. for
> -			* 'dir/subdir/file' add 'dir' and 'dir/subdir' as well, so
> -			* the Bloom filter could be used to speed up commands like
> -			* 'git log dir/subdir', too.
> -			*
> -			* Note that directories are added without the trailing '/'.
> -			*/
> +			 * Add each leading directory of the changed file, i.e. for
> +			 * 'dir/subdir/file' add 'dir' and 'dir/subdir' as well, so
> +			 * the Bloom filter could be used to speed up commands like
> +			 * 'git log dir/subdir', too.
> +			 *
> +			 * Note that directories are added without the trailing '/'.
> +			 */
>  			do {
>  				char *last_slash = strrchr(path, '/');
>  
>  				FLEX_ALLOC_STR(e, path, path);

Allocating a hashmap entry each time we query whether a path is
already present in the hashmap has some overhead.  It's cheaper to
create an entry on the stack just for the query and allocate an entry
on the heap only if the path isn't in the map.

Deduplicating without a hashmap is faster still.

>  				hashmap_entry_init(&e->entry, strhash(path));
> -				hashmap_add(&pathmap, &e->entry);
> +
> +				if (!hashmap_get(&pathmap, &e->entry, NULL))
> +					hashmap_add(&pathmap, &e->entry);
> +				else
> +					free(e);
>  
>  				if (!last_slash)
>  					last_slash = (char*)path;
> diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
> index 8f9eef116dc..6defeb544f1 100755
> --- a/t/t0095-bloom.sh
> +++ b/t/t0095-bloom.sh
> @@ -89,8 +89,8 @@ test_expect_success 'get bloom filter for commit with 10 changes' '
>  	git add smallDir &&
>  	git commit -m "commit with 10 changes" &&
>  	cat >expect <<-\EOF &&
> -	Filter_Length:25
> -	Filter_Data:82|a0|65|47|0c|92|90|c0|a1|40|02|a0|e2|40|e0|04|0a|9a|66|cf|80|19|85|42|23|
> +	Filter_Length:14
> +	Filter_Data:02|b3|c4|a0|34|e7|fe|eb|cb|47|fe|a0|e8|72|
>  	EOF
>  	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
>  	test_cmp expect actual
> -- 
> gitgitgadget
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 06/12] bloom: use num_changes not nr for limit detection
  2020-05-11 11:56   ` [PATCH v2 06/12] bloom: use num_changes not nr for limit detection Derrick Stolee via GitGitGadget
@ 2020-08-04 14:51     ` SZEDER Gábor
  0 siblings, 0 replies; 42+ messages in thread
From: SZEDER Gábor @ 2020-08-04 14:51 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, peff, me, garimasigit, jnareb, Derrick Stolee

On Mon, May 11, 2020 at 11:56:13AM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
> 
> As diff_tree_oid() computes a diff, it will terminate early if the
> total number of changed paths is strictly larger than max_changes.
> This includes the directories that changed, not just the file paths.
> However, only the file paths are reflected in the resulting diff
> queue's "nr" value.
> 
> Use the "num_changes" from diffopt to check if the diff terminated
> early. This is incredibly important, as it can result in incorrect
> filters! For example, the first commit in the Linux kernel repo
> reports only 471 changes, but since these are nested inside several
> directories they expand to 513 "real" changes, and in fact the
> total list of changes is not reported. Thus, the computed filter
> for this commit is incorrect.
> 
> Demonstrate the subtle difference by using one fewer file change
> in the 'get bloom filter for commit with 513 changes' test. Before,
> this edited 513 files inside "bigDir" which hit this inequality.
> However, dropping the file count by one demonstrates how the
> previous inequality was incorrect but the new one is correct.
> 
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  bloom.c          | 2 +-
>  t/t0095-bloom.sh | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/bloom.c b/bloom.c
> index 196cda0a1bd..e2ede44126c 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -215,7 +215,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>  		diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
>  	diffcore_std(&diffopt);
>  
> -	if (diff_queued_diff.nr <= max_changes) {
> +	if (diffopt.num_changes <= max_changes) {

The field 'diffopt.num_changes' is marked with:

  /* For internal use only. */

>  		struct hashmap pathmap;
>  		struct pathmap_hash_entry *e;
>  		struct hashmap_iter iter;
> diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
> index 6defeb544f1..48a90625596 100755
> --- a/t/t0095-bloom.sh
> +++ b/t/t0095-bloom.sh
> @@ -100,7 +100,7 @@ test_expect_success EXPENSIVE 'get bloom filter for commit with 513 changes' '
>  	rm actual &&
>  	rm expect &&
>  	mkdir bigDir &&
> -	for i in $(test_seq 0 512)
> +	for i in $(test_seq 0 511)
>  	do
>  		echo $i >bigDir/$i
>  	done &&
> -- 
> gitgitgadget
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2020-08-04 14:51 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-01 15:30 [PATCH 00/12] Integrating line-log and changed-path Bloom filters Derrick Stolee via GitGitGadget
2020-05-01 15:30 ` [PATCH 01/12] bloom: fix whitespace around tab length Derrick Stolee via GitGitGadget
2020-05-01 22:51   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 02/12] test-bloom: fix usage typo Derrick Stolee via GitGitGadget
2020-05-01 22:51   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 03/12] Documentation: changed-path Bloom filters use byte words Derrick Stolee via GitGitGadget
2020-05-01 22:55   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 04/12] bloom: de-duplicate directory entries Derrick Stolee via GitGitGadget
2020-05-01 23:06   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 05/12] bloom: parse commit before computing filters Derrick Stolee via GitGitGadget
2020-05-01 23:10   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 06/12] bloom: use num_changes not nr for limit detection Derrick Stolee via GitGitGadget
2020-05-01 23:12   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 07/12] completion: offer '--(no-)patch' among 'git log' options SZEDER Gábor via GitGitGadget
2020-05-01 23:44   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 08/12] line-log: remove unused fields from 'struct line_log_data' SZEDER Gábor via GitGitGadget
2020-05-01 23:46   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 09/12] t4211-line-log: add tests for parent oids SZEDER Gábor via GitGitGadget
2020-05-04 17:31   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 10/12] line-log: more responsive, incremental 'git log -L' SZEDER Gábor via GitGitGadget
2020-05-04 17:52   ` Taylor Blau
2020-05-04 17:55     ` Derrick Stolee
2020-05-01 15:30 ` [PATCH 11/12] line-log: try to use generation number-based topo-ordering SZEDER Gábor via GitGitGadget
2020-05-04 21:25   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 12/12] line-log: integrate with changed-path Bloom filters Derrick Stolee via GitGitGadget
2020-05-04 21:50   ` Taylor Blau
2020-05-01 17:34 ` [PATCH 00/12] Integrating line-log and " Junio C Hamano
2020-05-11 11:56 ` [PATCH v2 " Derrick Stolee via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 01/12] bloom: fix whitespace around tab length Derrick Stolee via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 02/12] test-bloom: fix usage typo Derrick Stolee via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 03/12] bloom: parse commit before computing filters Derrick Stolee via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 04/12] Documentation: changed-path Bloom filters use byte words Derrick Stolee via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 05/12] bloom: de-duplicate directory entries Derrick Stolee via GitGitGadget
2020-06-07 21:45     ` SZEDER Gábor
2020-05-11 11:56   ` [PATCH v2 06/12] bloom: use num_changes not nr for limit detection Derrick Stolee via GitGitGadget
2020-08-04 14:51     ` SZEDER Gábor
2020-05-11 11:56   ` [PATCH v2 07/12] completion: offer '--(no-)patch' among 'git log' options SZEDER Gábor via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 08/12] line-log: remove unused fields from 'struct line_log_data' SZEDER Gábor via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 09/12] t4211-line-log: add tests for parent oids SZEDER Gábor via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 10/12] line-log: more responsive, incremental 'git log -L' SZEDER Gábor via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 11/12] line-log: try to use generation number-based topo-ordering SZEDER Gábor via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 12/12] line-log: integrate with changed-path Bloom filters Derrick Stolee via GitGitGadget

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).