git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "SZEDER Gábor" <szeder.dev@gmail.com>
To: Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, peff@peff.net, me@ttaylorr.com,
	garimasigit@gmail.com, jnareb@gmail.com,
	Derrick Stolee <dstolee@microsoft.com>
Subject: Re: [PATCH v2 05/12] bloom: de-duplicate directory entries
Date: Sun, 7 Jun 2020 23:45:50 +0200	[thread overview]
Message-ID: <20200607214550.GA22200@szeder.dev> (raw)
In-Reply-To: <ba0d8c1f539751d9cfd556dac4eeda97758cf12e.1589198180.git.gitgitgadget@gmail.com>

On Mon, May 11, 2020 at 11:56:12AM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
> 
> When computing a changed-path Bloom filter, we need to take the
> files that changed from the diff computation and extract the parent
> directories. That way, a directory pathspec such as "Documentation"
> could match commits that change "Documentation/git.txt".
> 
> However, the current code does a poor job of this process. The paths
> are added to a hashmap, but we do not check if an entry already
> exists with that path. This can create many duplicate entries and
> cause the filter to have a much larger length than it should. This
> means that the filter is more sparse than intended, which helps the
> false positive rate, but wastes a lot of space.
> 
> Properly use hashmap_get() before hashmap_add(). Also be sure to
> include a comparison function so these can be matched correctly.
> 
> This has an effect on a test in t0095-bloom.sh. This makes sense,
> there are ten changes inside "smallDir" so the total number of
> paths in the filter should be 11. This would result in 11 * 10 bits
> required, and with 8 bits per byte, this results in 14 bytes.

This is the first patch where the chunk format and the specs are in
sync and the Bloom filters are as large as promised, so it would have
been nice to see how the false positive rate, the runtime of
pathspec-limited revision walks, and the size of the Bloom filters
chunk or the commit-graph file turned out...

> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  bloom.c          | 35 ++++++++++++++++++++++++++---------
>  t/t0095-bloom.sh |  4 ++--
>  2 files changed, 28 insertions(+), 11 deletions(-)
> 
> diff --git a/bloom.c b/bloom.c
> index 96792782719..196cda0a1bd 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -158,6 +158,19 @@ void init_bloom_filters(void)
>  	init_bloom_filter_slab(&bloom_filters);
>  }
>  
> +static int pathmap_cmp(const void *hashmap_cmp_fn_data,
> +		       const struct hashmap_entry *eptr,
> +		       const struct hashmap_entry *entry_or_key,
> +		       const void *keydata)
> +{
> +	const struct pathmap_hash_entry *e1, *e2;
> +
> +	e1 = container_of(eptr, const struct pathmap_hash_entry, entry);
> +	e2 = container_of(entry_or_key, const struct pathmap_hash_entry, entry);
> +
> +	return strcmp(e1->path, e2->path);
> +}
> +
>  struct bloom_filter *get_bloom_filter(struct repository *r,
>  				      struct commit *c,
>  				      int compute_if_not_present)
> @@ -206,25 +219,29 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>  		struct hashmap pathmap;
>  		struct pathmap_hash_entry *e;
>  		struct hashmap_iter iter;
> -		hashmap_init(&pathmap, NULL, NULL, 0);
> +		hashmap_init(&pathmap, pathmap_cmp, NULL, 0);
>  
>  		for (i = 0; i < diff_queued_diff.nr; i++) {
>  			const char *path = diff_queued_diff.queue[i]->two->path;
>  
>  			/*
> -			* Add each leading directory of the changed file, i.e. for
> -			* 'dir/subdir/file' add 'dir' and 'dir/subdir' as well, so
> -			* the Bloom filter could be used to speed up commands like
> -			* 'git log dir/subdir', too.
> -			*
> -			* Note that directories are added without the trailing '/'.
> -			*/
> +			 * Add each leading directory of the changed file, i.e. for
> +			 * 'dir/subdir/file' add 'dir' and 'dir/subdir' as well, so
> +			 * the Bloom filter could be used to speed up commands like
> +			 * 'git log dir/subdir', too.
> +			 *
> +			 * Note that directories are added without the trailing '/'.
> +			 */
>  			do {
>  				char *last_slash = strrchr(path, '/');
>  
>  				FLEX_ALLOC_STR(e, path, path);

Allocating a hashmap entry each time we query whether a path is
already present in the hashmap has some overhead.  It's cheaper to
create an entry on the stack just for the query and allocate an entry
on the heap only if the path isn't in the map.

Deduplicating without a hashmap is faster still.

>  				hashmap_entry_init(&e->entry, strhash(path));
> -				hashmap_add(&pathmap, &e->entry);
> +
> +				if (!hashmap_get(&pathmap, &e->entry, NULL))
> +					hashmap_add(&pathmap, &e->entry);
> +				else
> +					free(e);
>  
>  				if (!last_slash)
>  					last_slash = (char*)path;
> diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
> index 8f9eef116dc..6defeb544f1 100755
> --- a/t/t0095-bloom.sh
> +++ b/t/t0095-bloom.sh
> @@ -89,8 +89,8 @@ test_expect_success 'get bloom filter for commit with 10 changes' '
>  	git add smallDir &&
>  	git commit -m "commit with 10 changes" &&
>  	cat >expect <<-\EOF &&
> -	Filter_Length:25
> -	Filter_Data:82|a0|65|47|0c|92|90|c0|a1|40|02|a0|e2|40|e0|04|0a|9a|66|cf|80|19|85|42|23|
> +	Filter_Length:14
> +	Filter_Data:02|b3|c4|a0|34|e7|fe|eb|cb|47|fe|a0|e8|72|
>  	EOF
>  	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
>  	test_cmp expect actual
> -- 
> gitgitgadget
> 

  reply	other threads:[~2020-06-07 21:46 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-01 15:30 [PATCH 00/12] Integrating line-log and changed-path Bloom filters Derrick Stolee via GitGitGadget
2020-05-01 15:30 ` [PATCH 01/12] bloom: fix whitespace around tab length Derrick Stolee via GitGitGadget
2020-05-01 22:51   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 02/12] test-bloom: fix usage typo Derrick Stolee via GitGitGadget
2020-05-01 22:51   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 03/12] Documentation: changed-path Bloom filters use byte words Derrick Stolee via GitGitGadget
2020-05-01 22:55   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 04/12] bloom: de-duplicate directory entries Derrick Stolee via GitGitGadget
2020-05-01 23:06   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 05/12] bloom: parse commit before computing filters Derrick Stolee via GitGitGadget
2020-05-01 23:10   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 06/12] bloom: use num_changes not nr for limit detection Derrick Stolee via GitGitGadget
2020-05-01 23:12   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 07/12] completion: offer '--(no-)patch' among 'git log' options SZEDER Gábor via GitGitGadget
2020-05-01 23:44   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 08/12] line-log: remove unused fields from 'struct line_log_data' SZEDER Gábor via GitGitGadget
2020-05-01 23:46   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 09/12] t4211-line-log: add tests for parent oids SZEDER Gábor via GitGitGadget
2020-05-04 17:31   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 10/12] line-log: more responsive, incremental 'git log -L' SZEDER Gábor via GitGitGadget
2020-05-04 17:52   ` Taylor Blau
2020-05-04 17:55     ` Derrick Stolee
2020-05-01 15:30 ` [PATCH 11/12] line-log: try to use generation number-based topo-ordering SZEDER Gábor via GitGitGadget
2020-05-04 21:25   ` Taylor Blau
2020-05-01 15:30 ` [PATCH 12/12] line-log: integrate with changed-path Bloom filters Derrick Stolee via GitGitGadget
2020-05-04 21:50   ` Taylor Blau
2020-05-01 17:34 ` [PATCH 00/12] Integrating line-log and " Junio C Hamano
2020-05-11 11:56 ` [PATCH v2 " Derrick Stolee via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 01/12] bloom: fix whitespace around tab length Derrick Stolee via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 02/12] test-bloom: fix usage typo Derrick Stolee via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 03/12] bloom: parse commit before computing filters Derrick Stolee via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 04/12] Documentation: changed-path Bloom filters use byte words Derrick Stolee via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 05/12] bloom: de-duplicate directory entries Derrick Stolee via GitGitGadget
2020-06-07 21:45     ` SZEDER Gábor [this message]
2020-05-11 11:56   ` [PATCH v2 06/12] bloom: use num_changes not nr for limit detection Derrick Stolee via GitGitGadget
2020-08-04 14:51     ` SZEDER Gábor
2020-05-11 11:56   ` [PATCH v2 07/12] completion: offer '--(no-)patch' among 'git log' options SZEDER Gábor via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 08/12] line-log: remove unused fields from 'struct line_log_data' SZEDER Gábor via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 09/12] t4211-line-log: add tests for parent oids SZEDER Gábor via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 10/12] line-log: more responsive, incremental 'git log -L' SZEDER Gábor via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 11/12] line-log: try to use generation number-based topo-ordering SZEDER Gábor via GitGitGadget
2020-05-11 11:56   ` [PATCH v2 12/12] line-log: integrate with changed-path Bloom filters Derrick Stolee via GitGitGadget

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200607214550.GA22200@szeder.dev \
    --to=szeder.dev@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=garimasigit@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=jnareb@gmail.com \
    --cc=me@ttaylorr.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).