git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: peff@peff.net, me@ttaylorr.com, garimasigit@gmail.com,
	szeder.dev@gmail.com, jnareb@gmail.com,
	Derrick Stolee <dstolee@microsoft.com>
Subject: [PATCH v2 00/12] Integrating line-log and changed-path Bloom filters
Date: Mon, 11 May 2020 11:56:07 +0000	[thread overview]
Message-ID: <pull.622.v2.git.1589198180.gitgitgadget@gmail.com> (raw)
In-Reply-To: <pull.622.git.1588347029.gitgitgadget@gmail.com>

The best way to learn how a feature works is to build on it. That's also the
best way to find defects.

I started on this patch series with two goals:

 1. Understand and submit Szeder's incremental line-log RFC [1].
 2. Integrate changed-path Bloom filters with line-log.

[1] https://lore.kernel.org/git/20190818182801.7580-1-szeder.dev@gmail.com/

Since I knew a lot about the Bloom filters and not much about line-log.c, I
expected to learn a lot that could contribute to improving Szeder's patches.
Instead, I discovered some deficiencies in the Bloom filter code, including
a few meaningful bugs that should be fixed before the topic is merge to
master.

This series is organized as follows:

 * Patches 1-4 are minor Bloom filter cleanups, including a documentation
   bug.
   
   
 * Patch 5 fixes a problem where the Bloom filters use much larger filters
   than expected.
   
   
 * Patch 6 fixes a bug related to the short-circuit of large diffs from
   e369698016 (diff: halt tree-diff early after max_changes, 2020-03-30).
   The condition for halting the diff computation is different than the
   check in bloom.c to see if that short-circuit was hit, which leads to
   some commits with large diffs getting an incomplete Bloom filter. This
   happened on the root commit of the Linux kernel repo, for example.
   
   
 * Patches 7-11 are Szeder's RFC, with my sign-off added. I have not changed
   the code from those submissions because I didn't find anything wrong with
   them as I was testing. However, I will take responsibility to respond to
   the feedback in this series.
   
   
 * Patch 12 integrates Bloom filters with the line-log code.
   
   

I organized these patches so patches 1-6 could be split into its own topic,
if beneficial to take earlier than the line-log changes.

The end result of this combined effort is the following: git log -L commands
feel much more responsive to a terminal user because Szeder's changes make
the computation incremental, and they have better end-to-end performance
because of the integration with Bloom filters to reduce time spent running
tree diffs for TREESAME commits.

The following table is also in patch 12, but it summarizes the results of
this series. These are timings for running git log -L 1,10:$path for these
recently-edited paths. The "Entire History" columns wait for the full
command to complete. The "First Result" columns add "-n 1" to the command,
so we see the benefits of the incremental algorithm. Thus, the performance
improvements from "Entire History" to "First Result" are due entirely to
Szeder's patches. The performance improvements from "Before" to "After" are
due to the changed-path Bloom filters.

|                              | Entire History  | First Result    |
| Path                         | Before | After  | Before | After  |
|------------------------------|--------|--------|--------|--------|
| MAINTAINERS                  | 4.26 s | 3.87 s | 0.41 s | 0.39 s |
| fs/namei.c                   | 1.99 s | 0.99 s | 0.42 s | 0.21 s |
| arch/x86/kvm/cpuid.c         | 5.28 s | 1.12 s | 0.16 s | 0.09 s |
| fs/io_uring.c                | 4.34 s | 0.99 s | 0.94 s | 0.27 s |
| arch/x86/kvm/vmx/vmx.c       | 5.01 s | 1.34 s | 0.21 s | 0.12 s |
| arch/x86/kvm/x86.c           | 2.24 s | 1.18 s | 0.21 s | 0.14 s |
| fs/btrfs/disk-io.c           | 1.82 s | 1.01 s | 0.06 s | 0.05 s |
| Documentation/scsi/index.rst | 3.30 s | 0.89 s | 1.46 s | 0.03 s |

Some splashy numbers include a 110x speedup for finding the first result for
Documentation/scsi/index.rst. That's certainly an outlier, but the rest have
an at least 10x speedup.

Update in V2: Modified commit messages according to Taylor's feedback.

Thanks, -Stolee

Derrick Stolee (7):
  bloom: fix whitespace around tab length
  test-bloom: fix usage typo
  bloom: parse commit before computing filters
  Documentation: changed-path Bloom filters use byte words
  bloom: de-duplicate directory entries
  bloom: use num_changes not nr for limit detection
  line-log: integrate with changed-path Bloom filters

SZEDER Gábor (5):
  completion: offer '--(no-)patch' among 'git log' options
  line-log: remove unused fields from 'struct line_log_data'
  t4211-line-log: add tests for parent oids
  line-log: more responsive, incremental 'git log -L'
  line-log: try to use generation number-based topo-ordering

 .../technical/commit-graph-format.txt         |  8 +--
 bloom.c                                       | 61 ++++++++++++-----
 bloom.h                                       |  5 +-
 contrib/completion/git-completion.bash        |  1 +
 line-log.c                                    | 43 +++++++++++-
 line-log.h                                    |  5 +-
 revision.c                                    | 43 ++++++++++--
 t/helper/test-bloom.c                         |  2 +-
 t/t0095-bloom.sh                              |  6 +-
 t/t4211-line-log.sh                           | 68 +++++++++++++++++++
 10 files changed, 201 insertions(+), 41 deletions(-)


base-commit: 1b4c57fa87e121f155863f898dc39d06cf4a1d99
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-622%2Fderrickstolee%2Flog-l-on-bloom-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-622/derrickstolee/log-l-on-bloom-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/622

Range-diff vs v1:

  1:  89625b0efae =  1:  89625b0efae bloom: fix whitespace around tab length
  2:  572d0508fe0 =  2:  572d0508fe0 test-bloom: fix usage typo
  5:  ef4c08e401b !  3:  0e08cec78d3 bloom: parse commit before computing filters
     @@ bloom.c: struct bloom_filter *get_bloom_filter(struct repository *r,
       	diff_setup_done(&diffopt);
       
      +	/* ensure commit is parsed so we have parent information */
     -+	parse_commit(c);
     ++	repo_parse_commit(r, c);
      +
       	if (c->parents)
       		diff_tree_oid(&c->parents->item->object.oid, &c->object.oid, "", &diffopt);
  3:  03b2c84db36 !  4:  61f78a2b0dd Documentation: changed-path Bloom filters use byte words
     @@ Commit message
          In Documentation/technical/commit-graph-format.txt, the definition
          of the BIDX chunk specifies the length is a number of 8-byte words.
          During development we discovered that using 8-byte words in the
     -    Murmur3 hash algorithm causes issues with Big-Endian versus Little-
     -    Endian machines. Thus, the hash algorithm was adapted to work on a
     +    Murmur3 hash algorithm causes issues with big-endian versus little-
     +    endian machines. Thus, the hash algorithm was adapted to work on a
          byte-by-byte basis. However, this caused a change in the definition
          of a "word" in bloom.h. Now, a "word" is a single byte, which allows
          filters to be as small as two bytes. These length-two filters are
  4:  07d0a25f1c4 =  5:  ba0d8c1f539 bloom: de-duplicate directory entries
  6:  7d5561575d5 !  6:  8278b5c0918 bloom: use num_changes not nr for limit detection
     @@ Commit message
          total number of changed paths is strictly larger than max_changes.
          This includes the directories that changed, not just the file paths.
          However, only the file paths are reflected in the resulting diff
     -    queue "nr" value.
     +    queue's "nr" value.
      
          Use the "num_changes" from diffopt to check if the diff terminated
          early. This is incredibly important, as it can result in incorrect
  7:  35d2901957e =  7:  05f8ee14752 completion: offer '--(no-)patch' among 'git log' options
  8:  1f326612da0 =  8:  c3f87ec4379 line-log: remove unused fields from 'struct line_log_data'
  9:  4e3d7233095 !  9:  69c2c2f775f  t4211-line-log: add tests for parent oids
     @@ Metadata
      Author: SZEDER Gábor <szeder.dev@gmail.com>
      
       ## Commit message ##
     -     t4211-line-log: add tests for parent oids
     +    t4211-line-log: add tests for parent oids
      
          None of the tests in 't4211-line-log.sh' really check which parent
          object IDs are shown in the output, either implicitly as part of
 10:  d9991d6158d ! 10:  009da4b93f6 line-log: more responsive, incremental 'git log -L'
     @@ Commit message
      
          To be clear: this patch doesn't actually optimize the line-level log,
          but merely moves most of the work from the preprocessing step to the
     -    history travelsal, so the commits modifying the line range can be
     +    history traversal, so the commits modifying the line range can be
          shown as soon as they are processed, and the traversal can be
          terminated as soon as the given number of commits are shown.
          Consequently, listing the full history of a line range, potentially
     @@ Commit message
          commits. In addition to that edit frequency, the file itself is quite
          large (~18,700 lines). This means that a significant portion of the
          computation is taken up by computing the patch-diff of the file. This
     -    patch improves the time it takes to output the first result quite a
     -    bit:
     +    patch improves the real time it takes to output the first result quite
     +    a bit:
      
          Command: git log -L 100,200:MAINTAINERS -n 1 >/dev/null
           Before: 3.88 s
 11:  3abc7130924 ! 11:  da087d2acbb line-log: try to use generation number-based topo-ordering
     @@ Commit message
      
          Additional testing by Derrick Stolee: Since this patch improves
          the performance for the first result, I repeated the experiment
     -    from the previous patch on the Linux kernel repository:
     +    from the previous patch on the Linux kernel repository, reporting
     +    real time here:
      
              Command: git log -L 100,200:MAINTAINERS -n 1 >/dev/null
               Before: 0.71 s
 12:  7e0c2871cf7 ! 12:  305f1eb0982 line-log: integrate with changed-path Bloom filters
     @@ Commit message
      
          (along with a bogus first result). It appears that the path
          arch/x86/kvm/svm.c was renamed, so we ignore that entry. This leaves the
     -    following results:
     +    following results for the real command time:
      
          |                              | Entire History  | First Result    |
          | Path                         | Before | After  | Before | After  |

-- 
gitgitgadget

  parent reply	other threads:[~2020-05-11 11:56 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-01 15:30 [PATCH 00/12] Integrating line-log and changed-path Bloom filters Derrick Stolee via GitGitGadget
2020-05-01 15:30 ` [PATCH 01/12] bloom: fix whitespace around tab length Derrick Stolee via GitGitGadget
2020-05-01 22:51   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 02/12] test-bloom: fix usage typo Derrick Stolee via GitGitGadget
2020-05-01 22:51   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 03/12] Documentation: changed-path Bloom filters use byte words Derrick Stolee via GitGitGadget
2020-05-01 22:55   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 04/12] bloom: de-duplicate directory entries Derrick Stolee via GitGitGadget
2020-05-01 23:06   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 05/12] bloom: parse commit before computing filters Derrick Stolee via GitGitGadget
2020-05-01 23:10   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 06/12] bloom: use num_changes not nr for limit detection Derrick Stolee via GitGitGadget
2020-05-01 23:12   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 07/12] completion: offer '--(no-)patch' among 'git log' options SZEDER Gábor via GitGitGadget
2020-05-01 23:44   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 08/12] line-log: remove unused fields from 'struct line_log_data' SZEDER Gábor via GitGitGadget
2020-05-01 23:46   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 09/12] t4211-line-log: add tests for parent oids SZEDER Gábor via GitGitGadget
2020-05-04 17:31   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 10/12] line-log: more responsive, incremental 'git log -L' SZEDER Gábor via GitGitGadget
2020-05-04 17:52   ` Taylor Blau
2020-05-04 17:55     ` Derrick Stolee
2020-05-01 15:30 ` [PATCH 11/12] line-log: try to use generation number-based topo-ordering SZEDER Gábor via GitGitGadget
2020-05-04 21:25   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 12/12] line-log: integrate with changed-path Bloom filters Derrick Stolee via GitGitGadget
2020-05-04 21:50   ` Taylor Blau
2020-05-01 17:34 ` [PATCH 00/12] Integrating line-log and " Junio C Hamano
2020-05-11 11:56 ` Derrick Stolee via GitGitGadget [this message]
2020-05-11 11:56   ` [PATCH v2 01/12] bloom: fix whitespace around tab length Derrick Stolee via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 02/12] test-bloom: fix usage typo Derrick Stolee via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 03/12] bloom: parse commit before computing filters Derrick Stolee via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 04/12] Documentation: changed-path Bloom filters use byte words Derrick Stolee via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 05/12] bloom: de-duplicate directory entries Derrick Stolee via GitGitGadget
2020-06-07 21:45     ` SZEDER Gábor
2020-05-11 11:56   ` [PATCH v2 06/12] bloom: use num_changes not nr for limit detection Derrick Stolee via GitGitGadget
2020-08-04 14:51     ` SZEDER Gábor
2020-05-11 11:56   ` [PATCH v2 07/12] completion: offer '--(no-)patch' among 'git log' options SZEDER Gábor via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 08/12] line-log: remove unused fields from 'struct line_log_data' SZEDER Gábor via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 09/12] t4211-line-log: add tests for parent oids SZEDER Gábor via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 10/12] line-log: more responsive, incremental 'git log -L' SZEDER Gábor via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 11/12] line-log: try to use generation number-based topo-ordering SZEDER Gábor via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 12/12] line-log: integrate with changed-path Bloom filters Derrick Stolee via GitGitGadget

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=pull.622.v2.git.1589198180.gitgitgadget@gmail.com \
    --to=gitgitgadget@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=garimasigit@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=jnareb@gmail.com \
    --cc=me@ttaylorr.com \
    --cc=peff@peff.net \
    --cc=szeder.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).