git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH 00/12] Integrating line-log and changed-path Bloom filters
@ 2020-05-01 15:30 Derrick Stolee via GitGitGadget
  2020-05-01 15:30 ` [PATCH 01/12] bloom: fix whitespace around tab length Derrick Stolee via GitGitGadget
                   ` (13 more replies)
  0 siblings, 14 replies; 42+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-05-01 15:30 UTC (permalink / raw)
  To: git; +Cc: peff, me, garimasigit, szeder.dev, jnareb, Derrick Stolee

The best way to learn how a feature works is to build on it. That's also the
best way to find defects.

I started on this patch series with two goals:

 1. Understand and submit Szeder's incremental line-log RFC [1].
 2. Integrate changed-path Bloom filters with line-log.

[1] https://lore.kernel.org/git/20190818182801.7580-1-szeder.dev@gmail.com/

Since I knew a lot about the Bloom filters and not much about line-log.c, I
expected to learn a lot that could contribute to improving Szeder's patches.
Instead, I discovered some deficiencies in the Bloom filter code, including
a few meaningful bugs that should be fixed before the topic is merge to
master.

This series is organized as follows:

 * Patches 1-4 are minor Bloom filter cleanups, including a documentation
   bug.
   
   
 * Patch 5 fixes a problem where the Bloom filters use much larger filters
   than expected.
   
   
 * Patch 6 fixes a bug related to the short-circuit of large diffs from
   e369698016 (diff: halt tree-diff early after max_changes, 2020-03-30).
   The condition for halting the diff computation is different than the
   check in bloom.c to see if that short-circuit was hit, which leads to
   some commits with large diffs getting an incomplete Bloom filter. This
   happened on the root commit of the Linux kernel repo, for example.
   
   
 * Patches 7-11 are Szeder's RFC, with my sign-off added. I have not changed
   the code from those submissions because I didn't find anything wrong with
   them as I was testing. However, I will take responsibility to respond to
   the feedback in this series.
   
   
 * Patch 12 integrates Bloom filters with the line-log code.
   
   

I organized these patches so patches 1-6 could be split into its own topic,
if beneficial to take earlier than the line-log changes.

The end result of this combined effort is the following: git log -L commands
feel much more responsive to a terminal user because Szeder's changes make
the computation incremental, and they have better end-to-end performance
because of the integration with Bloom filters to reduce time spent running
tree diffs for TREESAME commits.

The following table is also in patch 12, but it summarizes the results of
this series. These are timings for running git log -L 1,10:$path for these
recently-edited paths. The "Entire History" columns wait for the full
command to complete. The "First Result" columns add "-n 1" to the command,
so we see the benefits of the incremental algorithm. Thus, the performance
improvements from "Entire History" to "First Result" are due entirely to
Szeder's patches. The performance improvements from "Before" to "After" are
due to the changed-path Bloom filters.

|                              | Entire History  | First Result    |
| Path                         | Before | After  | Before | After  |
|------------------------------|--------|--------|--------|--------|
| MAINTAINERS                  | 4.26 s | 3.87 s | 0.41 s | 0.39 s |
| fs/namei.c                   | 1.99 s | 0.99 s | 0.42 s | 0.21 s |
| arch/x86/kvm/cpuid.c         | 5.28 s | 1.12 s | 0.16 s | 0.09 s |
| fs/io_uring.c                | 4.34 s | 0.99 s | 0.94 s | 0.27 s |
| arch/x86/kvm/vmx/vmx.c       | 5.01 s | 1.34 s | 0.21 s | 0.12 s |
| arch/x86/kvm/x86.c           | 2.24 s | 1.18 s | 0.21 s | 0.14 s |
| fs/btrfs/disk-io.c           | 1.82 s | 1.01 s | 0.06 s | 0.05 s |
| Documentation/scsi/index.rst | 3.30 s | 0.89 s | 1.46 s | 0.03 s |

Some splashy numbers include a 110x speedup for finding the first result for
Documentation/scsi/index.rst. That's certainly an outlier, but the rest have
an at least 10x speedup.

Thanks, -Stolee

Derrick Stolee (7):
  bloom: fix whitespace around tab length
  test-bloom: fix usage typo
  Documentation: changed-path Bloom filters use byte words
  bloom: de-duplicate directory entries
  bloom: parse commit before computing filters
  bloom: use num_changes not nr for limit detection
  line-log: integrate with changed-path Bloom filters

SZEDER Gábor (5):
  completion: offer '--(no-)patch' among 'git log' options
  line-log: remove unused fields from 'struct line_log_data'
  t4211-line-log: add tests for parent oids
  line-log: more responsive, incremental 'git log -L'
  line-log: try to use generation number-based topo-ordering

 .../technical/commit-graph-format.txt         |  8 +--
 bloom.c                                       | 61 ++++++++++++-----
 bloom.h                                       |  5 +-
 contrib/completion/git-completion.bash        |  1 +
 line-log.c                                    | 43 +++++++++++-
 line-log.h                                    |  5 +-
 revision.c                                    | 43 ++++++++++--
 t/helper/test-bloom.c                         |  2 +-
 t/t0095-bloom.sh                              |  6 +-
 t/t4211-line-log.sh                           | 68 +++++++++++++++++++
 10 files changed, 201 insertions(+), 41 deletions(-)


base-commit: 1b4c57fa87e121f155863f898dc39d06cf4a1d99
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-622%2Fderrickstolee%2Flog-l-on-bloom-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-622/derrickstolee/log-l-on-bloom-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/622
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2020-08-04 14:51 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-01 15:30 [PATCH 00/12] Integrating line-log and changed-path Bloom filters Derrick Stolee via GitGitGadget
2020-05-01 15:30 ` [PATCH 01/12] bloom: fix whitespace around tab length Derrick Stolee via GitGitGadget
2020-05-01 22:51   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 02/12] test-bloom: fix usage typo Derrick Stolee via GitGitGadget
2020-05-01 22:51   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 03/12] Documentation: changed-path Bloom filters use byte words Derrick Stolee via GitGitGadget
2020-05-01 22:55   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 04/12] bloom: de-duplicate directory entries Derrick Stolee via GitGitGadget
2020-05-01 23:06   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 05/12] bloom: parse commit before computing filters Derrick Stolee via GitGitGadget
2020-05-01 23:10   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 06/12] bloom: use num_changes not nr for limit detection Derrick Stolee via GitGitGadget
2020-05-01 23:12   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 07/12] completion: offer '--(no-)patch' among 'git log' options SZEDER Gábor via GitGitGadget
2020-05-01 23:44   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 08/12] line-log: remove unused fields from 'struct line_log_data' SZEDER Gábor via GitGitGadget
2020-05-01 23:46   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 09/12] t4211-line-log: add tests for parent oids SZEDER Gábor via GitGitGadget
2020-05-04 17:31   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 10/12] line-log: more responsive, incremental 'git log -L' SZEDER Gábor via GitGitGadget
2020-05-04 17:52   ` Taylor Blau
2020-05-04 17:55     ` Derrick Stolee
2020-05-01 15:30 ` [PATCH 11/12] line-log: try to use generation number-based topo-ordering SZEDER Gábor via GitGitGadget
2020-05-04 21:25   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 12/12] line-log: integrate with changed-path Bloom filters Derrick Stolee via GitGitGadget
2020-05-04 21:50   ` Taylor Blau
2020-05-01 17:34 ` [PATCH 00/12] Integrating line-log and " Junio C Hamano
2020-05-11 11:56 ` [PATCH v2 " Derrick Stolee via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 01/12] bloom: fix whitespace around tab length Derrick Stolee via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 02/12] test-bloom: fix usage typo Derrick Stolee via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 03/12] bloom: parse commit before computing filters Derrick Stolee via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 04/12] Documentation: changed-path Bloom filters use byte words Derrick Stolee via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 05/12] bloom: de-duplicate directory entries Derrick Stolee via GitGitGadget
2020-06-07 21:45     ` SZEDER Gábor
2020-05-11 11:56   ` [PATCH v2 06/12] bloom: use num_changes not nr for limit detection Derrick Stolee via GitGitGadget
2020-08-04 14:51     ` SZEDER Gábor
2020-05-11 11:56   ` [PATCH v2 07/12] completion: offer '--(no-)patch' among 'git log' options SZEDER Gábor via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 08/12] line-log: remove unused fields from 'struct line_log_data' SZEDER Gábor via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 09/12] t4211-line-log: add tests for parent oids SZEDER Gábor via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 10/12] line-log: more responsive, incremental 'git log -L' SZEDER Gábor via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 11/12] line-log: try to use generation number-based topo-ordering SZEDER Gábor via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 12/12] line-log: integrate with changed-path Bloom filters Derrick Stolee via GitGitGadget

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).