git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Taylor Blau <me@ttaylorr.com>
To: Derrick Stolee <stolee@gmail.com>
Cc: Taylor Blau <me@ttaylorr.com>,
	git@vger.kernel.org, peff@peff.net, dstolee@microsoft.com,
	szeder.dev@gmail.com, gitster@pobox.com
Subject: Re: [PATCH v3 12/14] commit-graph: add large-filters bitmap chunk
Date: Tue, 11 Aug 2020 17:18:15 -0400	[thread overview]
Message-ID: <20200811211815.GA65971@syl.lan> (raw)
In-Reply-To: <4f487dca-a938-b4c4-8b77-a202209459cb@gmail.com>

On Tue, Aug 11, 2020 at 05:11:49PM -0400, Derrick Stolee wrote:
> On 8/11/20 4:52 PM, Taylor Blau wrote:
> > To allow using the existing bitmap code with 64-bit words, we write the
> > data in network byte order from the 64-bit words. This means we also
> > need to read the array from the commit-graph file by translating each
> > word from network byte order using get_be64() when loading the commit
> > graph. (Note that this *could* be delayed until first-use, but a later
> > patch will rely on this being initialized early, so we assume the
> > up-front cost when parsing instead of delaying initialization).
>
> I don't think this is correct to do. This means that every commit-graph
> load will load the entire large bloom filter chunk into memory before
> allowing a single commit to be read from the graph.
>
> In the case of a very large commit-graph (> 1 million commits), this
> would cause a significant initial cost that is not necessarily needed
> for anything. For example, "git log -1" will be delayed by this action.
>
> If I understand correctly, this bloom_large bitmap is only used when
> writing Bloom filters. At that point, reading the entire bitmap from
> disk into memory is inexpensive compared to the time saved by the
> feature.

That's not quite correct. By this point in the patch series, we only use
this bitmap for writing, but in the final patch, we will use it in
'load_bloom_filter_from_graph()' (see that patch for the details of why,
there is an explanatory comment to that effect).

So, the way that this lazy initialization was written was subtly
incorrect, since calling 'load_bloom_filter_from_graph' would return the
wrong value depending on what was or wasn't called before it (namely
whether or not we had initialized the bitmap).

> > @@ -429,6 +430,24 @@ struct commit_graph *parse_commit_graph(struct repository *r,
> >  				graph->bloom_filter_settings->bits_per_entry = get_be32(data + chunk_offset + 8);
> >  			}
> >  			break;
> > +
> > +		case GRAPH_CHUNKID_BLOOMLARGE:
> > +			if (graph->chunk_bloom_large_filters)
> > +				chunk_repeated = 1;
> > +			else if (r->settings.commit_graph_read_changed_paths) {
>
> This is guarded against commitGraph.readChangedPaths, which is good,
> but that's not enough to claim that we need the bloom_large bitmap
> in this process.
>
> > +				size_t alloc = get_be64(chunk_lookup + 4) - chunk_offset - sizeof(uint32_t);
>
> If we store this inside the commit_graph struct, we can save this
> size for later so...
>
> > +				graph->chunk_bloom_large_filters = data + chunk_offset + sizeof(uint32_t);
> > +				graph->bloom_filter_settings->max_changed_paths = get_be32(data + chunk_offset);
>
> ...this portion can be done only when we are about to read from the
> bitmap.

Right, that all is clear.

>
> > +				if (alloc) {
> > +					size_t j;
> > +					graph->bloom_large = bitmap_word_alloc(alloc);
> > +
> > +					for (j = 0; j < graph->bloom_large->word_alloc; j++)
> > +						graph->bloom_large->words[j] = get_be64(
> > +							graph->chunk_bloom_large_filters + j * sizeof(eword_t));
> > +				}
>
>
>
> > +			}
> > +			break;
> >  		}
> >
> >  		if (chunk_repeated) {
> > @@ -443,7 +462,9 @@ struct commit_graph *parse_commit_graph(struct repository *r,
> >  		/* We need both the bloom chunks to exist together. Else ignore the data */
> >  		graph->chunk_bloom_indexes = NULL;
> >  		graph->chunk_bloom_data = NULL;
> > +		graph->chunk_bloom_large_filters = NULL;
> >  		FREE_AND_NULL(graph->bloom_filter_settings);
> > +		bitmap_free(graph->bloom_large);
>
> Perhaps this bitmap_free() needs a check to see if we
> initialized it (with my recommended lazy-loading), but
> maybe not?

It does not ('bitmap_free()' understands that a NULL argument is a noop,
like 'free()').
>
> >  	}
> >
> >  	hashcpy(graph->oid.hash, graph->data + graph->data_len - graph->hash_len);
> > @@ -932,6 +953,20 @@ struct tree *get_commit_tree_in_graph(struct repository *r, const struct commit
> >  	return get_commit_tree_in_graph_one(r, r->objects->commit_graph, c);
> >  }
> >
> > +static int get_bloom_filter_large_in_graph(struct commit_graph *g,
> > +					   const struct commit *c)
> > +{
> > +	uint32_t graph_pos = commit_graph_position(c);
> > +	if (graph_pos == COMMIT_NOT_FROM_GRAPH)
> > +		return 0;
> > +
> > +	while (g && graph_pos < g->num_commits_in_base)
> > +		g = g->base_graph;
> > +
> > +	if (!(g && g->bloom_large))
> > +		return 0;
>
> Here is where we can check if the size of the chunk is
> non-zero and if the g->bloom_large bitmap is uninitialized.
> Since we are caring about this information, it is now time
> to do the network-byte transition of the full bitmap.

Yes, that's right. But again, we'll need to move this out to a
non-static helper function so that we can call it from bloom.c in the
final patch.

I'm admittedly a little unsure of what to do next, since the changes are
scoped to this and the remaining patches (really just this and the final
patch).

I guess I'll try the other approach of replacing these two patches
by sending emails in response so that Junio can take those (I'll avoid
sending a fixup patch).

> > +	return bitmap_get(g->bloom_large, graph_pos - g->num_commits_in_base);
> > +}
> >
> Thanks,
> -Stolee

Thanks,
Taylor

  reply	other threads:[~2020-08-11 21:18 UTC|newest]

Thread overview: 117+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-03 18:57 [PATCH 00/10] more miscellaneous Bloom filter improvements Taylor Blau
2020-08-03 18:57 ` [PATCH 01/10] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
2020-08-04  7:24   ` Jeff King
2020-08-04 20:08     ` Taylor Blau
2020-08-03 18:57 ` [PATCH 02/10] commit-graph: pass a 'struct repository *' in more places Taylor Blau
2020-08-03 18:57 ` [PATCH 03/10] t4216: use an '&&'-chain Taylor Blau
2020-08-03 18:57 ` [PATCH 04/10] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
2020-08-03 18:57 ` [PATCH 05/10] commit-graph: respect 'commitgraph.readChangedPaths' Taylor Blau
2020-08-03 18:57 ` [PATCH 06/10] commit-graph.c: sort index into commits list Taylor Blau
2020-08-04 12:31   ` Derrick Stolee
2020-08-04 20:10     ` Taylor Blau
2020-08-03 18:57 ` [PATCH 07/10] commit-graph: add large-filters bitmap chunk Taylor Blau
2020-08-03 18:59   ` Taylor Blau
2020-08-04 12:57   ` Derrick Stolee
2020-08-03 18:57 ` [PATCH 08/10] bloom: split 'get_bloom_filter()' in two Taylor Blau
2020-08-04 13:00   ` Derrick Stolee
2020-08-04 20:12     ` Taylor Blau
2020-08-03 18:57 ` [PATCH 09/10] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
2020-08-03 18:57 ` [PATCH 10/10] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
2020-08-04 13:03   ` Derrick Stolee
2020-08-04 20:14     ` Taylor Blau
2020-08-05 17:01 ` [PATCH v2 00/14] more miscellaneous Bloom filter improvements Taylor Blau
2020-08-05 17:01   ` [PATCH v2 01/14] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
2020-08-05 17:02   ` [PATCH v2 02/14] t4216: use an '&&'-chain Taylor Blau
2020-08-05 17:02   ` [PATCH v2 03/14] commit-graph: pass a 'struct repository *' in more places Taylor Blau
2020-08-05 17:02   ` [PATCH v2 04/14] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
2020-08-05 17:02   ` [PATCH v2 05/14] commit-graph: respect 'commitGraph.readChangedPaths' Taylor Blau
2020-08-05 17:02   ` [PATCH v2 06/14] commit-graph.c: store maximum changed paths Taylor Blau
2020-08-05 17:02   ` [PATCH v2 07/14] bloom: split 'get_bloom_filter()' in two Taylor Blau
2020-08-05 17:02   ` [PATCH v2 08/14] bloom: use provided 'struct bloom_filter_settings' Taylor Blau
2020-08-05 17:02   ` [PATCH v2 09/14] bloom/diff: properly short-circuit on max_changes Taylor Blau
2020-08-05 17:02   ` [PATCH v2 10/14] commit-graph.c: sort index into commits list Taylor Blau
2020-08-05 17:02   ` [PATCH v2 11/14] csum-file.h: introduce 'hashwrite_be64()' Taylor Blau
2020-08-05 17:02   ` [PATCH v2 12/14] commit-graph: add large-filters bitmap chunk Taylor Blau
2020-08-05 21:01     ` Junio C Hamano
2020-08-05 21:17       ` Taylor Blau
2020-08-05 22:21         ` Junio C Hamano
2020-08-05 22:25           ` Taylor Blau
2020-08-11 13:48             ` Taylor Blau
2020-08-11 18:59               ` Junio C Hamano
2020-08-05 17:03   ` [PATCH v2 13/14] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
2020-08-05 17:03   ` [PATCH v2 14/14] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
2020-08-11 20:51 ` [PATCH v3 00/14] more miscellaneous Bloom filter improvements Taylor Blau
2020-08-11 20:51   ` [PATCH v3 01/14] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
2020-08-11 21:18     ` SZEDER Gábor
2020-08-11 21:21       ` Taylor Blau
2020-08-11 21:27         ` SZEDER Gábor
2020-08-11 21:34           ` Taylor Blau
2020-08-11 23:55             ` SZEDER Gábor
2020-08-12 11:48               ` Derrick Stolee
2020-08-14 20:17                 ` Taylor Blau
2020-08-11 20:51   ` [PATCH v3 02/14] t4216: use an '&&'-chain Taylor Blau
2020-08-11 20:51   ` [PATCH v3 03/14] commit-graph: pass a 'struct repository *' in more places Taylor Blau
2020-08-11 20:51   ` [PATCH v3 04/14] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
2020-08-11 20:51   ` [PATCH v3 05/14] commit-graph: respect 'commitGraph.readChangedPaths' Taylor Blau
2020-08-11 20:51   ` [PATCH v3 06/14] commit-graph.c: store maximum changed paths Taylor Blau
2020-08-11 20:51   ` [PATCH v3 07/14] bloom: split 'get_bloom_filter()' in two Taylor Blau
2020-08-11 20:51   ` [PATCH v3 11/14] csum-file.h: introduce 'hashwrite_be64()' Taylor Blau
2020-08-11 20:51   ` [PATCH v3 08/14] bloom: use provided 'struct bloom_filter_settings' Taylor Blau
2020-08-11 20:51   ` [PATCH v3 09/14] bloom/diff: properly short-circuit on max_changes Taylor Blau
2020-08-11 20:52   ` [PATCH v3 10/14] commit-graph.c: sort index into commits list Taylor Blau
2020-08-11 20:52   ` [PATCH v3 12/14] commit-graph: add large-filters bitmap chunk Taylor Blau
2020-08-11 21:11     ` Derrick Stolee
2020-08-11 21:18       ` Taylor Blau [this message]
2020-08-11 22:05         ` Taylor Blau
2020-08-19 13:35     ` SZEDER Gábor
2020-09-02 20:23       ` Taylor Blau
2020-09-01 14:35     ` SZEDER Gábor
2020-09-02 20:40       ` Taylor Blau
2020-08-11 20:52   ` [PATCH v3 13/14] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
2020-08-19  9:56     ` SZEDER Gábor
2020-09-02 21:02       ` Taylor Blau
2020-08-11 20:52   ` [PATCH v3 14/14] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
2020-08-12 11:49     ` SZEDER Gábor
2020-08-14 20:20       ` Taylor Blau
2020-08-17 22:50         ` SZEDER Gábor
2020-09-02 21:03           ` Taylor Blau
2020-08-12 12:29     ` Derrick Stolee
2020-08-14 20:10       ` Taylor Blau
2020-08-18 22:23     ` SZEDER Gábor
2020-09-03 16:35       ` Taylor Blau
2020-08-19  8:20     ` SZEDER Gábor
2020-09-03 16:42       ` Taylor Blau
2020-09-04  8:50         ` SZEDER Gábor
2020-09-01 14:36     ` SZEDER Gábor
2020-09-03 18:49       ` Taylor Blau
2020-09-03 21:45   ` [PATCH v3 00/14] more miscellaneous Bloom filter improvements Junio C Hamano
2020-09-03 22:33     ` Taylor Blau
2020-09-03 22:45 ` [PATCH v4 " Taylor Blau
2020-09-03 22:46   ` [PATCH v4 01/14] commit-graph: introduce 'get_bloom_filter_settings()' Taylor Blau
2020-09-03 22:46   ` [PATCH v4 02/14] t4216: use an '&&'-chain Taylor Blau
2020-09-03 22:46   ` [PATCH v4 03/14] commit-graph: pass a 'struct repository *' in more places Taylor Blau
2020-09-03 22:46   ` [PATCH v4 04/14] t/helper/test-read-graph.c: prepare repo settings Taylor Blau
2020-09-03 22:46   ` [PATCH v4 05/14] commit-graph: respect 'commitGraph.readChangedPaths' Taylor Blau
2020-09-03 22:46   ` [PATCH v4 06/14] commit-graph.c: store maximum changed paths Taylor Blau
2020-09-03 22:46   ` [PATCH v4 07/14] bloom: split 'get_bloom_filter()' in two Taylor Blau
2020-09-05 17:22     ` Jakub Narębski
2020-09-05 17:38       ` Taylor Blau
2020-09-05 17:50         ` Jakub Narębski
2020-09-05 18:01           ` Taylor Blau
2020-09-05 18:18             ` Jakub Narębski
2020-09-05 18:38               ` Taylor Blau
2020-09-05 18:55                 ` Taylor Blau
2020-09-05 19:04                   ` SZEDER Gábor
2020-09-05 19:49                     ` Taylor Blau
2020-09-06 21:52                       ` Junio C Hamano
2020-09-03 22:46   ` [PATCH v4 08/14] bloom: use provided 'struct bloom_filter_settings' Taylor Blau
2020-09-03 22:46   ` [PATCH v4 09/14] bloom/diff: properly short-circuit on max_changes Taylor Blau
2020-09-03 22:46   ` [PATCH v4 10/14] commit-graph.c: sort index into commits list Taylor Blau
2020-09-03 22:46   ` [PATCH v4 11/14] csum-file.h: introduce 'hashwrite_be64()' Taylor Blau
2020-09-04 20:18     ` René Scharfe
2020-09-04 20:22       ` Taylor Blau
2020-09-03 22:46   ` [PATCH v4 12/14] commit-graph: add large-filters bitmap chunk Taylor Blau
2020-09-03 22:46   ` [PATCH v4 13/14] commit-graph: rename 'split_commit_graph_opts' Taylor Blau
2020-09-04 15:20     ` Taylor Blau
2020-09-03 22:47   ` [PATCH v4 14/14] builtin/commit-graph.c: introduce '--max-new-filters=<n>' Taylor Blau
2020-09-04 14:39   ` [PATCH v4 00/14] more miscellaneous Bloom filter improvements Derrick Stolee

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200811211815.GA65971@syl.lan \
    --to=me@ttaylorr.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=peff@peff.net \
    --cc=stolee@gmail.com \
    --cc=szeder.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).