git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Derrick Stolee <stolee@gmail.com>
To: "Đoàn Trần Công Danh" <congdanhqx@gmail.com>, git@vger.kernel.org
Subject: Re: test-tool: bloom: generate_filter for multiple string?
Date: Thu, 31 Dec 2020 06:31:52 -0500	[thread overview]
Message-ID: <dfe3fbf9-4708-cd53-d78e-780375224b8e@gmail.com> (raw)
In-Reply-To: <20201231035438.22533-1-congdanhqx@gmail.com>

On 12/30/2020 10:54 PM, Đoàn Trần Công Danh wrote:
> 
> Hello,
> 
> I'm reading the code for Bloom Filter to see if arXiv:2012.00472
> could be an improvement.
> 
> I'm not sure if I'm missing something or it's our test-bloom generate_filter
> doesn't really support testing for multiple inputs.

It doesn't support creating a _realistic_ filter of the given size. The tests
at this level are more about keeping the filters consistent so we can trust
that we are not accidentally changing the hashing algorithm and its file format.

The test works if we provide multiple inputs, the problem is that the resulting
filter has a higher density of bits than we expect, since we didn't size it
according to the number of inputs.

> If I understand correctly, we should either:
> * allocate more entry for inputs; or
> * abort if argc != 3

Either approach would be fine. I wonder what your goal is for testing the
multiple inputs. Are you expecting a larger filter to demonstrate some
behavior that is not tested by a small filter?

> diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
> index 46e97b04eb..1026facc59 100644
> --- a/t/helper/test-bloom.c
> +++ b/t/helper/test-bloom.c
> @@ -68,12 +68,14 @@ int cmd__bloom(int argc, const char **argv)
>  	if (!strcmp(argv[1], "generate_filter")) {
>  		struct bloom_filter filter;
>  		int i = 2;
> -		filter.len =  (settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
> -		filter.data = xcalloc(filter.len, sizeof(unsigned char));
>  
>  		if (argc - 1 < i)
>  			usage(bloom_usage);
>  
> +		filter.len = (settings.bits_per_entry * (argc - 2) + BITS_PER_WORD - 1)
> +			/ BITS_PER_WORD;
> +		filter.data = xcalloc(filter.len, sizeof(unsigned char));
> +

Whitespace issues aside, this is the right approach to make the filter grow with
the input.

>  		while (argv[i]) {
>  			add_string_to_filter(argv[i], &filter);
>  			i++;
> diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
> index 7e4ab1795f..6d83927c86 100755
> --- a/t/t0095-bloom.sh
> +++ b/t/t0095-bloom.sh
> @@ -67,6 +67,17 @@ test_expect_success 'compute bloom key for test string 2' '
>  	test_cmp expect actual
>  '
>  
> +test_expect_success 'compute bloom key for multiple string' '
> +	cat >expect <<-\EOF &&
> +	Hashes:0xb270de9b|0x1bb6f26e|0x84fd0641|0xee431a14|0x57892de7|0xc0cf41ba|0x2a15558d|
> +	Hashes:0x20ab385b|0xf5237fe2|0xc99bc769|0x9e140ef0|0x728c5677|0x47049dfe|0x1b7ce585|

Multiple hash outputs work already, helping us build confidence that our
hash algorithm hasn't changed.

> +	Filter_Length:3
> +	Filter_Data:45|ba|ac|

And this is the part of the output that you are changing.

> +	EOF
> +	test-tool bloom generate_filter "Hello world!" file.txt >actual &&
> +	test_cmp expect actual
> +'
> +

So, your suggested change has merit. I just wonder what value it provides on top
of the current implementation? If your goal is to create a way to inspect a full-sized
filter, then that would be interesting to explore.

Thanks,
-Stolee


  reply	other threads:[~2020-12-31 11:38 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-31  3:54 test-tool: bloom: generate_filter for multiple string? Đoàn Trần Công Danh
2020-12-31 11:31 ` Derrick Stolee [this message]
2021-01-05 13:34   ` Đoàn Trần Công Danh
2021-01-12 19:53 ` Taylor Blau
2021-01-13 11:59   ` Đoàn Trần Công Danh
2021-01-13 12:06     ` Derrick Stolee
2021-01-13 12:13       ` Đoàn Trần Công Danh
2021-01-13 15:13     ` Taylor Blau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=dfe3fbf9-4708-cd53-d78e-780375224b8e@gmail.com \
    --to=stolee@gmail.com \
    --cc=congdanhqx@gmail.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).