From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-3.8 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI, RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by dcvr.yhbt.net (Postfix) with ESMTP id E9D5E1F45E for ; Thu, 20 Feb 2020 18:48:34 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728410AbgBTSse (ORCPT ); Thu, 20 Feb 2020 13:48:34 -0500 Received: from mail-lj1-f196.google.com ([209.85.208.196]:41155 "EHLO mail-lj1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726959AbgBTSsd (ORCPT ); Thu, 20 Feb 2020 13:48:33 -0500 Received: by mail-lj1-f196.google.com with SMTP id h23so5317700ljc.8 for ; Thu, 20 Feb 2020 10:48:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:references:date:in-reply-to:message-id :user-agent:mime-version:content-transfer-encoding; bh=/HKCPnG33wysgVc9Ent3dFfIj+AIAZEEt/Nl/qYv6EI=; b=k+ULatVAecs7qZ1H0hT7qDOVa1LVMrm9qrJJp4XKLz/+CObh/2HAyx2waiVM2Yyayz 8bLg30uh2VeXnvQ3GCqOt4nrYXygKrAVUuYazMmTMQO8bilubnGSZQx6mxq1Ij0PTkFf OrFoq2210i7XbcmP1Lq0coF0BnP3UkkqgGOlJoZjmBU5P2vh+qBK+r8BWfbkUOj/XufA kYaoePBpoJ9yOXdiarimkacX41Vz4ERmMOslvEHz0rGHKC8iOWrYW39zVj6LMqmiWi+n bojRsRgC5ZRdiRpPL3uMeGv0o6MnZPwx30iTpnqCRXiGRwTIwD1EbGq5mOqCpc54QQHB 8oHQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:references:date:in-reply-to :message-id:user-agent:mime-version:content-transfer-encoding; bh=/HKCPnG33wysgVc9Ent3dFfIj+AIAZEEt/Nl/qYv6EI=; b=jAUDtlhlddRgjXaIVPAN6QRi1jvOERegk7+O47Z7irQx6zTYObmHIkjgIQvZU+ALdG tdHg03xvzHHPsQ7tFej/45pyyLzh0Qqi389O+E66N276gVY5vNu/r/JzXiQXPP1jbbtO 3VH7AnS/NcAXRR5LPnEtHosj2DLDhWJMd5LWxTgUIEAihaCy74+B+/0IgLcIyFdYF4T5 HtvZjKdVs2hnr7hKXQp5cu26JzJP9L6XnxLNzG/kIJRxguDlHU3osxC021PI0CiltFEQ qkbgOAHNXWwXR+KilawvYJfo0uu+atUSFzdHMCzxlmQrl8Te4qkaOgRAxLr8MLhZyhfa rYRQ== X-Gm-Message-State: APjAAAUM/PwcXZ36bl4RmAtgEkLUKg0SvHCrfQRi1kyfh1FmAMjUW6y6 HrG4q0N8+xzBrnzt1hmhi4s= X-Google-Smtp-Source: APXvYqyKlppISyL6Ij3RqyJbXDfPf3mI2LEpRALBxdge50VWN9/ca+Mbzs+1itNf0+hLmHrF177QBQ== X-Received: by 2002:a2e:868c:: with SMTP id l12mr19763316lji.194.1582224509750; Thu, 20 Feb 2020 10:48:29 -0800 (PST) Received: from Laptop-Acer-Aspire-F15 (host-89-229-7-83.dynamic.mm.pl. [89.229.7.83]) by smtp.gmail.com with ESMTPSA id b13sm189345lfi.77.2020.02.20.10.48.27 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Thu, 20 Feb 2020 10:48:28 -0800 (PST) From: Jakub Narebski To: "Garima Singh via GitGitGadget" Cc: git@vger.kernel.org, Derrick Stolee , SZEDER =?utf-8?Q?G=C3=A1bor?= , Jonathan Tan , Jeff Hostetler , Taylor Blau , Jeff King , Garima Singh , Christian Couder , Emily Shaffer , Junio C Hamano , Garima Singh Subject: Re: [PATCH v2 08/11] commit-graph: reuse existing Bloom filters during write. References: Date: Thu, 20 Feb 2020 19:48:21 +0100 In-Reply-To: (Garima Singh via GitGitGadget's message of "Wed, 05 Feb 2020 22:56:27 +0000") Message-ID: <86r1ypf62y.fsf@gmail.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.2 (windows-nt) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org "Garima Singh via GitGitGadget" writes: > From: Garima Singh > > Read previously computed Bloom filters from the commit-graph file if > possible to avoid recomputing during commit-graph write. All right, what is written makes sense for this point in patch series. But it my opinion it is more important to state that this commit adds "parsing" of the Bloom filter data from commit-graph file. This means that it needs to be calculated only once, then stored in commit-graph, ready to be re-used. > > See Documentation/technical/commit-graph-format for the format in which > the Bloom filter information is written to the commit graph file. > > To read Bloom filter for a given commit with lexicographic position > 'i' we need to: > 1. Read BIDX[i] which essentially gives us the starting index in BDAT for > filter of commit i+1. It is essentially the index past the end > of the filter of commit i. It is called end_index in the code. > > 2. For i>0, read BIDX[i-1] which will give us the starting index in BDAT > for filter of commit i. It is called the start_index in the code. > For the first commit, where i =3D 0, Bloom filter data starts at the > beginning, just past the header in the BDAT chunk. Hence, start_index > will be 0. > > 3. The length of the filter will be end_index - start_index, because > BIDX[i] gives the cumulative 8-byte words including the ith > commit's filter. > > We toggle whether Bloom filters should be recomputed based on the > compute_if_null flag. Nitpick: the flag (the parameter) is called compute_if_not_present, not compute_if_null. All right, this explanation is nice and clear. > > Helped-by: Derrick Stolee > Signed-off-by: Garima Singh > --- > bloom.c | 49 ++++++++++++++++++++++++++++++++++++++++++- > bloom.h | 4 +++- > commit-graph.c | 7 ++++--- > t/helper/test-bloom.c | 2 +- > 4 files changed, 56 insertions(+), 6 deletions(-) > > diff --git a/bloom.c b/bloom.c > index 818382c03b..90d84dc713 100644 > --- a/bloom.c > +++ b/bloom.c > @@ -1,5 +1,7 @@ > #include "git-compat-util.h" > #include "bloom.h" > +#include "commit.h" > +#include "commit-slab.h" > #include "commit-graph.h" > #include "object-store.h" > #include "diff.h" > @@ -127,8 +129,39 @@ void add_key_to_filter(struct bloom_key *key, > } > } >=20=20 > +static int load_bloom_filter_from_graph(struct commit_graph *g, > + struct bloom_filter *filter, > + struct commit *c) > +{ > + uint32_t lex_pos, start_index, end_index; > + > + while (c->graph_pos < g->num_commits_in_base) > + g =3D g->base_graph; > + > + /* The commit graph commit 'c' lives in doesn't carry bloom filters. */ > + if (!g->chunk_bloom_indexes) > + return 0; > + > + lex_pos =3D c->graph_pos - g->num_commits_in_base; All right, this finds lexicographical position of the commit following the chain of incremental commit-graph files, and also check if the commit-graph fragment that contains the commit in question has Bloom filter data included. > + > + end_index =3D get_be32(g->chunk_bloom_indexes + 4 * lex_pos); > + > + if (lex_pos) Wouldn't it be better to be more explicit, and write + if (lex_pos > 0) > + start_index =3D get_be32(g->chunk_bloom_indexes + 4 * (lex_pos - 1)); > + else > + start_index =3D 0; All right, here we find start_index and end_index. It might be good idea to at least assert() that start_index <=3D end_index, though that should not happen (that is why I propose for this check to be compiled on only for debug builds). > + > + filter->len =3D end_index - start_index; > + filter->data =3D (uint64_t *)(g->chunk_bloom_data + > + sizeof(uint64_t) * start_index + > + BLOOMDATA_CHUNK_HEADER_SIZE); All right, nice use of constant. > + > + return 1; > +} > + > struct bloom_filter *get_bloom_filter(struct repository *r, > - struct commit *c) > + struct commit *c, > + int compute_if_not_present) > { > struct bloom_filter *filter; > struct bloom_filter_settings settings =3D DEFAULT_BLOOM_FILTER_SETTINGS; > @@ -141,6 +174,20 @@ struct bloom_filter *get_bloom_filter(struct reposit= ory *r, >=20=20 > filter =3D bloom_filter_slab_at(&bloom_filters, c); >=20=20 > + if (!filter->data) { > + load_commit_graph_info(r, c); > + if (c->graph_pos !=3D COMMIT_NOT_FROM_GRAPH && > + r->objects->commit_graph->chunk_bloom_indexes) { All right, the limitation that the top layer of incremental commit graph needs to have Bloom filters enabled for it to be even considered is reasonable tradeoff, in my opinion. > + if (load_bloom_filter_from_graph(r->objects->commit_graph, filter, c)) > + return filter; > + else > + return NULL; If it should have filter, return it, otherwise return NULL. I wonder however when it can return NULL (and whether it should compute Bloom filters if required instead). > + } > + } > + > + if (filter->data || !compute_if_not_present) > + return filter; If we have filter from slab, return it. All right. However, according to documentation contained in comments in commit-slab.h, bloom_filter_slab_at() will allocate the location to store the data, and return freshly allocated memory... fortunately it uses xcalloc() so returned bloom_filter would have ->len =3D=3D 0 and ->data =3D=3D 0. > + > repo_diff_setup(r, &diffopt); > diffopt.flags.recursive =3D 1; > diffopt.max_changes =3D max_changes; > diff --git a/bloom.h b/bloom.h > index 7f40c751f7..76f8a9ad0c 100644 > --- a/bloom.h > +++ b/bloom.h > @@ -13,6 +13,7 @@ struct bloom_filter_settings { >=20=20 > #define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 } > #define BITS_PER_WORD 64 > +#define BLOOMDATA_CHUNK_HEADER_SIZE 3*sizeof(uint32_t) All right. >=20=20 > /* > * A bloom_filter struct represents a data segment to > @@ -47,7 +48,8 @@ void add_key_to_filter(struct bloom_key *key, > struct bloom_filter_settings *settings); >=20=20 > struct bloom_filter *get_bloom_filter(struct repository *r, > - struct commit *c); > + struct commit *c, > + int compute_if_not_present); > All right, adding new parameter (changing function signature). > int bloom_filter_contains(struct bloom_filter *filter, > struct bloom_key *key, > diff --git a/commit-graph.c b/commit-graph.c > index 4585b3b702..c0e9834bf2 100644 > --- a/commit-graph.c > +++ b/commit-graph.c > @@ -1094,7 +1094,7 @@ static void write_graph_chunk_bloom_indexes(struct = hashfile *f, > ctx->commits.nr); >=20=20 > while (list < last) { > - struct bloom_filter *filter =3D get_bloom_filter(ctx->r, *list); > + struct bloom_filter *filter =3D get_bloom_filter(ctx->r, *list, 0); > cur_pos +=3D filter->len; > display_progress(progress, ++i); > hashwrite_be32(f, cur_pos); > @@ -1123,7 +1123,7 @@ static void write_graph_chunk_bloom_data(struct has= hfile *f, > hashwrite_be32(f, settings->bits_per_entry); >=20=20 > while (list < last) { > - struct bloom_filter *filter =3D get_bloom_filter(ctx->r, *list); > + struct bloom_filter *filter =3D get_bloom_filter(ctx->r, *list, 0); > display_progress(progress, ++i); > hashwrite(f, filter->data, filter->len * sizeof(uint64_t)); > list++; All right, if needed (that is, if '--changed-path' option from the future commit is provided to 'git commit-graph write'), compute_bloom_filters() would be called befor write_commit_graph_file(), which in turn runs write_graph_chunk_bloom_index() and *_data(). Actually, when writing Bloom data chunks (BIDX and BDAT) we could have requested recomputing filters if necessary: slab storage works as memoization, so you would calculate Bloom filter data for each commit in the commit-graph only once. And write_graph_chunk_bloom_indexes() and write_graph_chunk_bloom_data() are called only if ctx->changed_paths is true. So it would work with + struct bloom_filter *filter =3D get_bloom_filter(ctx->r, *list, 1); Only in the future we would really need to call with compute_if_not_present parameter set to falsy value. > @@ -1304,7 +1304,7 @@ static void compute_bloom_filters(struct write_comm= it_graph_context *ctx) >=20=20 > for (i =3D 0; i < ctx->commits.nr; i++) { > struct commit *c =3D sorted_by_pos[i]; > - struct bloom_filter *filter =3D get_bloom_filter(ctx->r, c); > + struct bloom_filter *filter =3D get_bloom_filter(ctx->r, c, 1); > ctx->total_bloom_filter_data_size +=3D sizeof(uint64_t) * filter->len; > display_progress(progress, i + 1); > } > @@ -2314,6 +2314,7 @@ void free_commit_graph(struct commit_graph *g) > g->data =3D NULL; > close(g->graph_fd); > } > + free(g->bloom_filter_settings); > free(g->filename); > free(g); Shouldn't this fixup be added to earlier commit? > } > diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c > index 331957011b..9b4be97f75 100644 > --- a/t/helper/test-bloom.c > +++ b/t/helper/test-bloom.c > @@ -47,7 +47,7 @@ static void get_bloom_filter_for_commit(const struct ob= ject_id *commit_oid) > struct bloom_filter *filter; > setup_git_directory(); > c =3D lookup_commit(the_repository, commit_oid); > - filter =3D get_bloom_filter(the_repository, c); > + filter =3D get_bloom_filter(the_repository, c, 1); > print_bloom_filter(filter); > } I would like to see some tests, but that needs to wait for patch that adds --changed-paths option to the 'write' subcommand. Things to be tested: 1. That after reading commit-graph with Bloom filter: - that commit(s) in commit-graph have Bloom filter - that commits outside commit-graph do not have Bloom filter 2. That incremental commit-graph feature works: - for commits in deeper layer that have Bloom filter chunks - for commits in deeper layer that do not have Bloom filter chunks Best, --=20 Jakub Nar=C4=99bski