From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-3.8 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI, RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by dcvr.yhbt.net (Postfix) with ESMTP id CC6591F880 for ; Thu, 9 Jan 2020 19:12:21 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2388755AbgAITMV (ORCPT ); Thu, 9 Jan 2020 14:12:21 -0500 Received: from mail-lj1-f196.google.com ([209.85.208.196]:35471 "EHLO mail-lj1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2388728AbgAITMU (ORCPT ); Thu, 9 Jan 2020 14:12:20 -0500 Received: by mail-lj1-f196.google.com with SMTP id j1so8455393lja.2 for ; Thu, 09 Jan 2020 11:12:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:references:date:in-reply-to:message-id :user-agent:mime-version:content-transfer-encoding; bh=4GFDB2XJ8BgW0509NLWOTt7ASwwebQ8yXPTNcKF/Nvk=; b=JobJB4pVzGNZHWbtKl5GeVD4lih4E5syBPd2QSMd5+tMDok4P5oghXQ+LzaxvCEuDk dsGssbo7oE5W34JLg3V32P3lTSpKxv5LsrFybFFsQztJ0pKCTBCCCa6JZX+mseOD6by3 WNjRgV/g3rYcdiOcs288Jt0nJrVseConvjP3HiZ4OVTvC9/pI8+cyGHr92Tcb33rNVyp IQNgAuxxs35439RlPBwRr+/N/Mi14kTcpcSbaPCbExpL40mCz5eOFKwOsbLf+pEzETiY 1d3asO5baw1x7+jYn9jJHpOg1k/soAOGydbkpou5f5jkITbYhDSAuFGIHRF45HsdXvl4 31aQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:references:date:in-reply-to :message-id:user-agent:mime-version:content-transfer-encoding; bh=4GFDB2XJ8BgW0509NLWOTt7ASwwebQ8yXPTNcKF/Nvk=; b=PtbT1c9eGu7D9IWU81fovg+8DD2354iVjiJzYntPABOnaa/P8w9Xj8XhyTPsCmECic QMXz9zfM69TI+Y7v9LypQcbCg0WqPWo45mfhd7A4A9KvHSpM5xWyyLa493YuW+wBOur/ Z+qW/MMhef2vovHFSFbT4H5uHtp0KGyWsF+vKgeQbQPO//CZnjvilNfuVHh/PNug9sqh WODpa53NCpcNRW0HKtchyEPiliZ9hIGdUJkrfNfqplPtEBD1d/3Vj0DC74uMWXpH/BlH GV/RO0qwTxQTFhMg6NdIbA0MnJFko8FDamhNySi7Lg4b5mtBMmBckFH86wIck7m2gTv/ i5qg== X-Gm-Message-State: APjAAAVdK5xMT17R+yJK+d+J5c/9d47Ani29gVCupsD40bhMji6+lVXq cR0A+dpGrqCAkAmMrb38iis= X-Google-Smtp-Source: APXvYqzewVmVOBpvH9m71Zo+3/oh16eDCk+HolFJiDOCxf96HJDscNjEuPdTVfPDTE3o1RYNnYx2BA== X-Received: by 2002:a05:651c:1110:: with SMTP id d16mr7775592ljo.86.1578597137840; Thu, 09 Jan 2020 11:12:17 -0800 (PST) Received: from Laptop-Acer-Aspire-F15 (host-89-229-7-83.dynamic.mm.pl. [89.229.7.83]) by smtp.gmail.com with ESMTPSA id n30sm3988566lfi.54.2020.01.09.11.12.16 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Thu, 09 Jan 2020 11:12:17 -0800 (PST) From: Jakub Narebski To: "Garima Singh via GitGitGadget" Cc: git@vger.kernel.org, Derrick Stolee , SZEDER =?utf-8?Q?G=C3=A1bor?= , Jonathan Tan , Jeff Hostetler , Taylor Blau , Jeff King , Junio C Hamano , Garima Singh Subject: Re: [PATCH 7/9] commit-graph: reuse existing bloom filters during write. References: <1e2acb37ad710cb0d1c09ed163fdd4473a27335c.1576879520.git.gitgitgadget@gmail.com> Date: Thu, 09 Jan 2020 20:12:07 +0100 In-Reply-To: <1e2acb37ad710cb0d1c09ed163fdd4473a27335c.1576879520.git.gitgitgadget@gmail.com> (Garima Singh via GitGitGadget's message of "Fri, 20 Dec 2019 22:05:18 +0000") Message-ID: <86r208v3zs.fsf@gmail.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.2 (windows-nt) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org "Garima Singh via GitGitGadget" writes: > From: Garima Singh > Overly long lines in the commit message. > Read previously computed bloom filters from the commit-graph file if poss= ible > to avoid recomputing during commit-graph write. Hmmm. This fixes (somewhat) the problem that I have noticed in previous patch that the Bloom filter was computed at least twice, once for BIDX chunk, once fo BDAT chunk. I think the order should be: - use Bloom filter on slab, if present - fill it from commit graph, if saved there - if needed, compute it from scratch (expensive operation!) If I understand it correctly, it now does it... though possibly with unnecessary memory allocation if commit-graph file does not include Bloomm filters data, and (re)computing is not requested (see later). But I might be wrong here. > > Reading from the commit-graph is based on the format in which bloom filte= rs are > written in the commit graph file. See method `fill_filter_from_graph` in = bloom.c This description reads a bit strange; it looks like it states a truism (we read in the format we wrote). It think it should be rephrased in different way for better readability. > > For reading the bloom filter for commit at lexicographic position i: I think it would better read as: To read Bloom filter for a given commit with lexicographic position 'i' we need to: > 1. Read BIDX[i] which essentially gives us the starting index in BDAT for= filter > of commit i+1 (called the next_index in the code) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | \- I think would be not needed with better var name I would also add that it gives the position [one past] the end of Bloom filter data for i-th commit. > > 2. For i>0, read BIDX[i-1] which will give us the starting index in BDAT = for > filter of commit i (called the prev_index in the code) Minor nitpick: Full stops are missing. Why it is called prev_index and next_index, while it is either curr_index and next_index or prev_index and curr_index, or maybe even better beg_index and end_index? > For i =3D 0, prev_index will be 0. The first lexicographic commit's fi= lter will > start at BDAT. I would state it For first commit, with i =3D 0, Bloom filter data starts at the beginning, just past the header in BDAT chunk. > > 3. The length of the filter will be next_index - prev_index, because BIDX= [i] > gives the cumulative 8-byte words including the ith commit's filter. > > We toggle whether bloom filters should be recomputed based on the compute= _if_null > flag. All right. > > Helped-by: Derrick Stolee > Signed-off-by: Garima Singh > --- > bloom.c | 40 ++++++++++++++++++++++++++++++++++++++-- > bloom.h | 3 ++- > commit-graph.c | 6 +++--- > 3 files changed, 43 insertions(+), 6 deletions(-) > > diff --git a/bloom.c b/bloom.c > index 08328cc381..86b1005802 100644 > --- a/bloom.c > +++ b/bloom.c > @@ -1,5 +1,7 @@ > #include "git-compat-util.h" > #include "bloom.h" > +#include "commit.h" > +#include "commit-slab.h" > #include "commit-graph.h" > #include "object-store.h" > #include "diff.h" > @@ -119,13 +121,35 @@ static void add_key_to_filter(struct bloom_key *key, > } > } >=20=20 > +static void fill_filter_from_graph(struct commit_graph *g, > + struct bloom_filter *filter, > + struct commit *c) > +{ > + uint32_t lex_pos, prev_index, next_index; > + > + while (c->graph_pos < g->num_commits_in_base) > + g =3D g->base_graph; > + > + lex_pos =3D c->graph_pos - g->num_commits_in_base; This part shares common code with load_oid_from_graph(), only without some of error checking; perhaps it might be good to extract it into a separate helper function, e.g. `lex_index(&g, c->graph_pos)`. Minor nitpick about the consistency of function names: why load_oid_from_graph(), but fill_filter_from_graph(), and not load_filter_from_graph() / load_bloom_from_graph()? > + > + next_index =3D get_be32(g->chunk_bloom_indexes + 4 * lex_pos); > + if (lex_pos) I think using + if (lex_pos > 0) or + if (lex_pos >=3D 0) might be easier to reason about. > + prev_index =3D get_be32(g->chunk_bloom_indexes + 4 * (lex_pos - 1)); > + else > + prev_index =3D 0; > + > + filter->len =3D next_index - prev_index; The above command reads a bit strange: next - prev? not next - curr, or curr - prev? Wouldn't it be better to name it begin_index and end_index, or beg_index and end_index for brevity? > + filter->data =3D (uint64_t *)(g->chunk_bloom_data + 8 * prev_index + 12= ); Please do not use magic constants; use instead something like: + filter->data =3D (uint64_t *)(g->chunk_bloom_data + + sizeof(uint64_t) * prev_index + + BLOOMDATA_CHUNK_HEADER_SIZE); Perhaps using `3*sizeof(unit32_t)` instead of magic value 12 would be enough; but having symbolic name for BDAT chunk header size is better, I think. > +} > + > void load_bloom_filters(void) > { > init_bloom_filter_slab(&bloom_filters); > } >=20=20 > struct bloom_filter *get_bloom_filter(struct repository *r, > - struct commit *c) > + struct commit *c, > + int compute_if_null) I'm not sure about `compute_if_null` name... > { > struct bloom_filter *filter; > struct bloom_filter_settings settings =3D DEFAULT_BLOOM_FILTER_SETTINGS; > @@ -134,6 +158,18 @@ struct bloom_filter *get_bloom_filter(struct reposit= ory *r, > const char *revs_argv[] =3D {NULL, "HEAD", NULL}; >=20=20 > filter =3D bloom_filter_slab_at(&bloom_filters, c); Note that the documentation for `slab_at(slab, commit)` is documented in `commit-slab.h` as * This function locates the data associated with the given commit in * the indegree slab, and returns the pointer to it. The location to * store the data is allocated as necessary. ~~~~~~~~~~~~~~~~~~~~~~ Should we worry about this possibly unnecessary allocation (if there is no Bloom filter chunk in the commit-graph, and we are not recomputing it)? There is `slab_peek(slab_commit)` with the following properties: * This function is similar to indegree_at(), but it will return NULL * until a call to indegree_at() was made for the commit. > + > + if (!filter->data) { What does the Bloom filter for a commit with no changes looks like? What about for a commit with more than 512 changes? Do, in either of those cases, filter->len is 0? If yes, what about filter->data? > + load_commit_graph_info(r, c); > + if (c->graph_pos !=3D COMMIT_NOT_FROM_GRAPH && r->objects->commit_grap= h->chunk_bloom_indexes) { Please wrap overly long lines (109 characters seems too long; the CodingGuidelines states: - We try to keep to at most 80 characters per line. + if (c->graph_pos !=3D COMMIT_NOT_FROM_GRAPH && + r->objects->commit_graph->chunk_bloom_indexes) { You seem to assume here that in the chain of commit-graph files either all of them would have Bloom filters, or all of them would be missing Bloom filters. Isn't it however possible for only some of the commit-graph files in chain to include Bloom filter data chunks? In such case, the top commit-graph file may have BIDX chunk (bloom_indexes), but the commit-graph with the commit 'c' might not have it. Or the top commit-graph file may be missing BIDX chunk, so Git would recompute it even if the commit-graph file for commit 'c' includes it. If such situation is forbidden, how the restriction is managed? Note: in any case, this needs to be tested! > + fill_filter_from_graph(r->objects->commit_graph, filter, c); > + return filter; > + } > + } All right, if it is not in slab, we try to read if from the commit graph. Looks all right. > + > + if (filter->data || !compute_if_null) > + return filter; ^^^^^^^^ | \- one tab too many If we have found existing filter (on slab or in the commit-graph), or if we won't be recomputing it, return it. O.K. > + > init_revisions(&revs, NULL); > revs.diffopt.flags.recursive =3D 1; >=20=20 > @@ -198,4 +234,4 @@ struct bloom_filter *get_bloom_filter(struct reposito= ry *r, > DIFF_QUEUE_CLEAR(&diff_queued_diff); >=20=20 > return filter; > -} > \ No newline at end of file > +} Accidental change. > diff --git a/bloom.h b/bloom.h > index ba8ae70b67..101d689bbd 100644 > --- a/bloom.h > +++ b/bloom.h > @@ -36,7 +36,8 @@ struct bloom_key { > void load_bloom_filters(void); >=20=20 > struct bloom_filter *get_bloom_filter(struct repository *r, > - struct commit *c); > + struct commit *c, > + int compute_if_null); > All right, this is just update of the function signature. > void fill_bloom_key(const char *data, > int len, > diff --git a/commit-graph.c b/commit-graph.c > index def2ade166..0580ce75d5 100644 > --- a/commit-graph.c > +++ b/commit-graph.c > @@ -1032,7 +1032,7 @@ static void write_graph_chunk_bloom_indexes(struct = hashfile *f, > uint32_t cur_pos =3D 0; >=20=20 > while (list < last) { > - struct bloom_filter *filter =3D get_bloom_filter(ctx->r, *list); > + struct bloom_filter *filter =3D get_bloom_filter(ctx->r, *list, 0); > cur_pos +=3D filter->len; > hashwrite_be32(f, cur_pos); > list++; > @@ -1051,7 +1051,7 @@ static void write_graph_chunk_bloom_data(struct has= hfile *f, > hashwrite_be32(f, settings->bits_per_entry); >=20=20 > while (first < last) { > - struct bloom_filter *filter =3D get_bloom_filter(ctx->r, *first); > + struct bloom_filter *filter =3D get_bloom_filter(ctx->r, *first, 0); > hashwrite(f, filter->data, filter->len * sizeof(uint64_t)); > first++; > } O.K., so those two do not compute Bloom filters, but they are called from write_commit_graph_file(), which in turn is called in write_commit_graph() *after* running compute_bloom_filters(). Looks good to me, then. > @@ -1218,7 +1218,7 @@ static void compute_bloom_filters(struct write_comm= it_graph_context *ctx) >=20=20 > for (i =3D 0; i < ctx->commits.nr; i++) { > struct commit *c =3D ctx->commits.list[i]; > - struct bloom_filter *filter =3D get_bloom_filter(ctx->r, c); > + struct bloom_filter *filter =3D get_bloom_filter(ctx->r, c, 1); > ctx->total_bloom_filter_size +=3D sizeof(uint64_t) * filter->len; > display_progress(progress, i + 1); > } All right, so compute_bloom_filters() ensures that it is actually computed (if needed). Looks good to me. Regards, --=20 Jakub Nar=C4=99bski