git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Making split commit graphs pick up new options (namely --changed-paths)
@ 2021-06-10 10:40 Ævar Arnfjörð Bjarmason
  2021-06-10 17:22 ` Taylor Blau
  0 siblings, 1 reply; 10+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-06-10 10:40 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster, peff, szeder.dev


On Wed, Sep 16 2020, Taylor Blau wrote:

Replying to
http://lore.kernel.org/git/ccb6482feb8d8606d82b5ab97e33184f26d6c5b6.1600279373.git.me@ttaylorr.com
as a start-off point for discussion;

> Introduce a command-line flag to specify the maximum number of new Bloom
> filters that a 'git commit-graph write' is willing to compute from
> scratch.
>
> Prior to this patch, a commit-graph write with '--changed-paths' would
> compute Bloom filters for all selected commits which haven't already
> been computed (i.e., by a previous commit-graph write with '--split'
> such that a roll-up or replacement is performed).
>
> This behavior can cause prohibitively-long commit-graph writes for a
> variety of reasons:
>
>   * There may be lots of filters whose diffs take a long time to
>     generate (for example, they have close to the maximum number of
>     changes, diffing itself takes a long time, etc).
>
>   * Old-style commit-graphs (which encode filters with too many entries
>     as not having been computed at all) cause us to waste time
>     recomputing filters that appear to have not been computed only to
>     discover that they are too-large.
>
> This can make the upper-bound of the time it takes for 'git commit-graph
> write --changed-paths' to be rather unpredictable.
>
> To make this command behave more predictably, introduce
> '--max-new-filters=<n>' to allow computing at most '<n>' Bloom filters
> from scratch. This lets "computing" already-known filters proceed
> quickly, while bounding the number of slow tasks that Git is willing to
> do.
> [...]
> @@ -67,6 +67,11 @@ this option is given, future commit-graph writes will automatically assume
>  that this option was intended. Use `--no-changed-paths` to stop storing this
>  data.
>  +
> +With the `--max-new-filters=<n>` option, generate at most `n` new Bloom
> +filters (if `--changed-paths` is specified). If `n` is `-1`, no limit is
> +enforced. Commits whose filters are not calculated are stored as a
> +length zero Bloom filter.
> ++
> [...]

Is there any way with an existing --split setup that introduces a
--changed-paths to make the "add bloom filters to the graph" eventually
consistent, or is some one-off --split=replace the only way to
grandfather in such a feature?

Reading the code there seems to be no way to do that, and we have the
"chunk_bloom_data" in the graph, as well as "bloom_filter_settings".

I'd expect some way to combine the "max_new_filters" and --split with
some eventual-consistency logic so that graphs not matching our current
settings are replaced, or replaced some <limit> at a time.

Also, am I reading the expire_commit_graphs() logic correctly that we
first write the split graph, and then unlink() things that are too old?
I.e. if you rely on the commit-graph to optimize things this will make
things slower until the next run of writing the graph?

I expected to find something more gentle there, i.e. marking that file
as obsolete, not making it part of the new chain (replacing it), and
then unlinking only things not part of the current chain of data that
are too old. But perhaps I'm just misreading or misunderstanding the
behavior...

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Making split commit graphs pick up new options (namely --changed-paths)
  2021-06-10 10:40 Making split commit graphs pick up new options (namely --changed-paths) Ævar Arnfjörð Bjarmason
@ 2021-06-10 17:22 ` Taylor Blau
  2021-06-10 18:21   ` Derrick Stolee
  2021-06-10 23:56   ` Ævar Arnfjörð Bjarmason
  0 siblings, 2 replies; 10+ messages in thread
From: Taylor Blau @ 2021-06-10 17:22 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Taylor Blau, git, dstolee, gitster, peff, szeder.dev

On Thu, Jun 10, 2021 at 12:40:33PM +0200, Ævar Arnfjörð Bjarmason wrote:
> Is there any way with an existing --split setup that introduces a
> --changed-paths to make the "add bloom filters to the graph" eventually
> consistent, or is some one-off --split=replace the only way to
> grandfather in such a feature?

I'm assuming what you mean is "can I introduce changed-path Bloom
filters into an existing split commit-graph with many layers without
having to recompute the whole thing at once?" If so, then the answer is
yes.

Passing --changed-paths causes the commit-graph machinery to compute
missing Bloom filters for every commit in the graph layer it is writing.
So, if you do something like:

  git commit-graph write --split --reachable --size-multiple=2 \
    --changed-paths

(--size-multiple=2 is the default, but I'm including it for clarity),
then you'll get changed-path Bloom filters for all commits in the new
layer, including any layers which may have been merged to create that
layer.

That all still respects `--max-new-filters`, with preference going to
commits with lower generation numbers before higher ones (unless
including commits from packs explicitly with --stdin-packs, in which
case preference is given in pack order; see
commit-graph.c:commit_pos_cmp() for details).

t4216 shows this for --split=replace, but you could just as easily
imagine a test like this:

    #!/bin/sh

    rm -fr repo
    git init repo
    cd repo

    commit () {
      >$1
      git add $1
      git commit -m "$1"
    }

    # no changed-path Bloom filter
    commit missing
    git commit-graph write --split --reachable --no-changed-paths

    missing="$(git rev-parse HEAD)"
    ~/src/git/t/helper/test-tool bloom get_filter_for_commit "$missing"

    # >= 2x the size of the previous layer, so they will be merged
    commit bloom1
    commit bloom2
    git commit-graph write --split --reachable --changed-paths

    # and the $missing commit has a Bloom filter
    ~/src/git/t/helper/test-tool bloom get_filter_for_commit "$missing"

(One caveat is that if you run that script unmodified, you'll get a
filter on both invcations of the test-tool: that's because it computes
filters on the fly if they are missing. You can change that by s/1/0 in
the call to get_or_compute_bloom_filter()).

> Reading the code there seems to be no way to do that, and we have the
> "chunk_bloom_data" in the graph, as well as "bloom_filter_settings".
>
> I'd expect some way to combine the "max_new_filters" and --split with
> some eventual-consistency logic so that graphs not matching our current
> settings are replaced, or replaced some <limit> at a time.

This is asking about something slightly different, Bloom filter
settings rather than the existence of chagned-path Bloom filters
themselves. The Bloom settings aren't written to the commit-graph
although there has been some discussion about doing this in the past.

If we ever did encode the Bloom settings, I imagine that accomplishing a
sort of "eventually replace all changed-path Bloom filters with these
new settings" would be as simple as considering all filters computed
under different settings to be "uncomputed".

> Also, am I reading the expire_commit_graphs() logic correctly that we
> first write the split graph, and then unlink() things that are too old?
> I.e. if you rely on the commit-graph to optimize things this will make
> things slower until the next run of writing the graph?

Before expire_commit_graphs() is called, we call mark_commit_graphs()
which freshens the mtimes of all surviving commit-graph layers, and then
expire_commit_graphs() removes the stale layers. I'm not sure what
things getting slower is referring to since the resulting commit-graph
has at least as many commits as the commit-graph that existed prior to
the write.

> I expected to find something more gentle there [...]

FWIW, I also find this "expire based on mtimes" thing a little odd for
writing split commit-graphs because we know exactly which layers we want
to get rid of. I suspect that the reuse comes from wanting to unify the
logic for handling '--expire-time' with the expiration that happens
after writing a split commit-graph that merged two or more previous
layers.

I would probably change mark_commit_graphs() to remove those merged
layers explicitly (but still run expire_commit_graphs() to handle
--expire-time). But, come to think of it... if merging >2 layers already
causes the merged layers to be removed, then why would you ever set an
--expire-time yourself?

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Making split commit graphs pick up new options (namely --changed-paths)
  2021-06-10 17:22 ` Taylor Blau
@ 2021-06-10 18:21   ` Derrick Stolee
  2021-06-10 23:56   ` Ævar Arnfjörð Bjarmason
  1 sibling, 0 replies; 10+ messages in thread
From: Derrick Stolee @ 2021-06-10 18:21 UTC (permalink / raw)
  To: Taylor Blau, Ævar Arnfjörð Bjarmason
  Cc: git, dstolee, gitster, peff, szeder.dev

On 6/10/2021 1:22 PM, Taylor Blau wrote:
> On Thu, Jun 10, 2021 at 12:40:33PM +0200, Ævar Arnfjörð Bjarmason wrote:
...
>> Reading the code there seems to be no way to do that, and we have the
>> "chunk_bloom_data" in the graph, as well as "bloom_filter_settings".
>>
>> I'd expect some way to combine the "max_new_filters" and --split with
>> some eventual-consistency logic so that graphs not matching our current
>> settings are replaced, or replaced some <limit> at a time.
> 
> This is asking about something slightly different, Bloom filter
> settings rather than the existence of chagned-path Bloom filters
> themselves. The Bloom settings aren't written to the commit-graph
> although there has been some discussion about doing this in the past.

Some of the settings are included, but not the "maximum size" of a
filter. Thus, if that maximum size changes we do not have a way to
determine if a missing filter is larger than that limit or not.
Further, the existing filters might be larger than the new maximum
which means we would need to check if some of those filters should
be dropped.

Here is the spec of the BDAT chunk:

Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
    * It starts with header consisting of three unsigned 32-bit integers:
      - Version of the hash algorithm being used. We currently only support
	value 1 which corresponds to the 32-bit version of the murmur3 hash
	implemented exactly as described in
	https://en.wikipedia.org/wiki/MurmurHash#Algorithm and the double
	hashing technique using seed values 0x293ae76f and 0x7e646e2 as
	described in https://doi.org/10.1007/978-3-540-30494-4_26 "Bloom Filters
	in Probabilistic Verification"
      - The number of times a path is hashed and hence the number of bit positions
	      that cumulatively determine whether a file is present in the commit.
      - The minimum number of bits 'b' per entry in the Bloom filter. If the filter
	      contains 'n' entries, then the filter size is the minimum number of 64-bit
	      words that contain n*b bits.
    * The rest of the chunk is the concatenation of all the computed Bloom
      filters for the commits in lexicographic order.

> If we ever did encode the Bloom settings, I imagine that accomplishing a
> sort of "eventually replace all changed-path Bloom filters with these
> new settings" would be as simple as considering all filters computed
> under different settings to be "uncomputed".
> 
>> Also, am I reading the expire_commit_graphs() logic correctly that we
>> first write the split graph, and then unlink() things that are too old?
>> I.e. if you rely on the commit-graph to optimize things this will make
>> things slower until the next run of writing the graph?
> 
> Before expire_commit_graphs() is called, we call mark_commit_graphs()
> which freshens the mtimes of all surviving commit-graph layers, and then
> expire_commit_graphs() removes the stale layers. I'm not sure what
> things getting slower is referring to since the resulting commit-graph
> has at least as many commits as the commit-graph that existed prior to
> the write.
> 
>> I expected to find something more gentle there [...]
> 
> FWIW, I also find this "expire based on mtimes" thing a little odd for
> writing split commit-graphs because we know exactly which layers we want
> to get rid of. I suspect that the reuse comes from wanting to unify the
> logic for handling '--expire-time' with the expiration that happens
> after writing a split commit-graph that merged two or more previous
> layers.
> 
> I would probably change mark_commit_graphs() to remove those merged
> layers explicitly (but still run expire_commit_graphs() to handle
> --expire-time). But, come to think of it... if merging >2 layers already
> causes the merged layers to be removed, then why would you ever set an
> --expire-time yourself?

The --expire-time is intended to leave the old layers around a while
so concurrent processes who already parsed the commit-graph-chain file
can discover the layers it referenced. It's not a perfect mechanism, so
there is room for improvement here.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Making split commit graphs pick up new options (namely --changed-paths)
  2021-06-10 17:22 ` Taylor Blau
  2021-06-10 18:21   ` Derrick Stolee
@ 2021-06-10 23:56   ` Ævar Arnfjörð Bjarmason
  2021-06-11  0:50     ` Taylor Blau
  1 sibling, 1 reply; 10+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-06-10 23:56 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, dstolee, gitster, peff, szeder.dev


On Thu, Jun 10 2021, Taylor Blau wrote:

> On Thu, Jun 10, 2021 at 12:40:33PM +0200, Ævar Arnfjörð Bjarmason wrote:
>> Is there any way with an existing --split setup that introduces a
>> --changed-paths to make the "add bloom filters to the graph" eventually
>> consistent, or is some one-off --split=replace the only way to
>> grandfather in such a feature?
>
> I'm assuming what you mean is "can I introduce changed-path Bloom
> filters into an existing split commit-graph with many layers without
> having to recompute the whole thing at once?" If so, then the answer is
> yes.
>
> Passing --changed-paths causes the commit-graph machinery to compute
> missing Bloom filters for every commit in the graph layer it is writing.
> So, if you do something like:
>
>   git commit-graph write --split --reachable --size-multiple=2 \
>     --changed-paths
>
> (--size-multiple=2 is the default, but I'm including it for clarity),
> then you'll get changed-path Bloom filters for all commits in the new
> layer, including any layers which may have been merged to create that
> layer.
>
> That all still respects `--max-new-filters`, with preference going to
> commits with lower generation numbers before higher ones (unless
> including commits from packs explicitly with --stdin-packs, in which
> case preference is given in pack order; see
> commit-graph.c:commit_pos_cmp() for details).
>
> t4216 shows this for --split=replace, but you could just as easily
> imagine a test like this:
>
>     #!/bin/sh
>
>     rm -fr repo
>     git init repo
>     cd repo
>
>     commit () {
>       >$1
>       git add $1
>       git commit -m "$1"
>     }
>
>     # no changed-path Bloom filter
>     commit missing
>     git commit-graph write --split --reachable --no-changed-paths
>
>     missing="$(git rev-parse HEAD)"
>     ~/src/git/t/helper/test-tool bloom get_filter_for_commit "$missing"
>
>     # >= 2x the size of the previous layer, so they will be merged
>     commit bloom1
>     commit bloom2
>     git commit-graph write --split --reachable --changed-paths
>
>     # and the $missing commit has a Bloom filter
>     ~/src/git/t/helper/test-tool bloom get_filter_for_commit "$missing"
>
> (One caveat is that if you run that script unmodified, you'll get a
> filter on both invcations of the test-tool: that's because it computes
> filters on the fly if they are missing. You can change that by s/1/0 in
> the call to get_or_compute_bloom_filter()).

I was assuming it would not be added, but it seems I was just wrong
about that.

What prompted this question from me was reviewing a MR dealing with the
format, where the code is parsing graph files checking if they have
bloom filters, and then using --split or --split=replace:
https://gitlab.com/gitlab-org/gitaly/-/merge_requests/3545#note_596862825

>> Reading the code there seems to be no way to do that, and we have the
>> "chunk_bloom_data" in the graph, as well as "bloom_filter_settings".
>>
>> I'd expect some way to combine the "max_new_filters" and --split with
>> some eventual-consistency logic so that graphs not matching our current
>> settings are replaced, or replaced some <limit> at a time.
>
> This is asking about something slightly different, Bloom filter
> settings rather than the existence of chagned-path Bloom filters
> themselves. The Bloom settings aren't written to the commit-graph
> although there has been some discussion about doing this in the past.
>
> If we ever did encode the Bloom settings, I imagine that accomplishing a
> sort of "eventually replace all changed-path Bloom filters with these
> new settings" would be as simple as considering all filters computed
> under different settings to be "uncomputed".

Yes, I don't have or know of a use-case for this now.

But perhaps as the commit-graph format gets used more as a generic
container for generated data about commits a thing like this would be
useful, i.e. regenerating that part of the graph if the settings are
different, particularly (as has been said) if it's not easy to discover
the input setting from the data itself.

>> Also, am I reading the expire_commit_graphs() logic correctly that we
>> first write the split graph, and then unlink() things that are too old?
>> I.e. if you rely on the commit-graph to optimize things this will make
>> things slower until the next run of writing the graph?
>
> Before expire_commit_graphs() is called, we call mark_commit_graphs()
> which freshens the mtimes of all surviving commit-graph layers, and then
> expire_commit_graphs() removes the stale layers. I'm not sure what
> things getting slower is referring to since the resulting commit-graph
> has at least as many commits as the commit-graph that existed prior to
> the write.

I was assuming we'd be left without a segment of the data, but now I
seem to recall that the commit-graph can't have missing parts of the
graph. So this was doubly a brainfart on my part.

>> I expected to find something more gentle there [...]
>
> FWIW, I also find this "expire based on mtimes" thing a little odd for
> writing split commit-graphs because we know exactly which layers we want
> to get rid of. I suspect that the reuse comes from wanting to unify the
> logic for handling '--expire-time' with the expiration that happens
> after writing a split commit-graph that merged two or more previous
> layers.
>
> I would probably change mark_commit_graphs() to remove those merged
> layers explicitly (but still run expire_commit_graphs() to handle
> --expire-time). But, come to think of it... if merging >2 layers already
> causes the merged layers to be removed, then why would you ever set an
> --expire-time yourself?

What Derrick said downthread, but you might also find this discussion
(and upthread of that) on the split graph format interesting:
https://lore.kernel.org/git/874l63xpwx.fsf@evledraar.gmail.com/

I.e. the initial idea was to have a numeric split
(commit-graph-{1,2,3,...}, and juggle those "chunks". It seems I managed
to convince Derrick to go with the current "chained hash" approach.

That's related to the way we want to do expiry. I.e. you might think
that layer isn't used, but a concurrent process working over NFS may
still want it, especially if that process itself is a long-running
command.

Reading the code now I think it can be much more aggressive than it
currently is, even say to the point of the "commit-graph write"
optionally forking to the background and doing the unlink() itself in
10-60 seconds or something.

I.e. in load_commit_graph_chain() we always load full chain once we
start doing so, we don't load one part of it, find the data we want, and
then go to a bunch of more other work outside of commit-graph.c, only to
return to loading the next segment in the chain much later. It happens
as fast as the loop around load_commit_graph_one will run.

That loop would be less racy if it pre-opened all the files it wants, so
the consuming process would hang onto them past an unlink() by a
concurrent process.

It's probably a non-issue in practice, but one can imagine e.g. a set of
10x 10MB commit-graph files being loaded over a 10KB/s NFS connection or
something. Opening and parsing all the files could take a while. By the
time we open() the last one an aggressive expiry policy might have
removed it from under us.

Anyway, even then the worst we'll do is warn about not finding the
segment in the chain and return the graph up until that point. I.e.:

    warning(_("unable to find all commit-graph files"));

So yeah, maybe we can just unlink() them right away, or another way to
handle the race is that load_commit_graph_chain() could just try again
from the beginning in such a case, and presumably picking up the fresh
just-rewritten chain.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Making split commit graphs pick up new options (namely --changed-paths)
  2021-06-10 23:56   ` Ævar Arnfjörð Bjarmason
@ 2021-06-11  0:50     ` Taylor Blau
  2021-06-11 17:47       ` Derrick Stolee
  0 siblings, 1 reply; 10+ messages in thread
From: Taylor Blau @ 2021-06-11  0:50 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, dstolee, gitster, peff, szeder.dev

On Fri, Jun 11, 2021 at 01:56:31AM +0200, Ævar Arnfjörð Bjarmason wrote:
> >> Reading the code there seems to be no way to do that, and we have the
> >> "chunk_bloom_data" in the graph, as well as "bloom_filter_settings".
> >>
> >> I'd expect some way to combine the "max_new_filters" and --split with
> >> some eventual-consistency logic so that graphs not matching our current
> >> settings are replaced, or replaced some <limit> at a time.
> >
> > This is asking about something slightly different, Bloom filter
> > settings rather than the existence of chagned-path Bloom filters
> > themselves. The Bloom settings aren't written to the commit-graph
> > although there has been some discussion about doing this in the past.
> >
> > If we ever did encode the Bloom settings, I imagine that accomplishing a
> > sort of "eventually replace all changed-path Bloom filters with these
> > new settings" would be as simple as considering all filters computed
> > under different settings to be "uncomputed".
>
> Yes, I don't have or know of a use-case for this now.
>
> But perhaps as the commit-graph format gets used more as a generic
> container for generated data about commits a thing like this would be
> useful, i.e. regenerating that part of the graph if the settings are
> different, particularly (as has been said) if it's not easy to discover
> the input setting from the data itself.

I wasn't completely correct about this (as Stolee noted and you ACK
below): we do store *some* settings in the format, but we don't store
the maximum number of entries in each filter. So today we would indicate
"computed, too large" for a commit that touches, say, 513 paths (the
limit is 512). But if you later ran:

  GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=514 \
    git commit-graph write --split=replace --changed-paths

we would propagate the "too large" status forward and not bother to
recompute the filter (even though we could represent it if we did bother
to recompute it).

In the future we should probably include this as part of the filter
settings and recompute all filters in the commit-graph we're writing if
the value differs from the previous commit-graph. We have to recompute
it even if the maximum decreases since we can't ask the filter directly
"how many entries do you have?".

...and we have to assume that the limit is 512 on all commit-graphs that
exist in the wild today which do not explicit include this default
value. (That's not totally correct, of course, because you can set the
GIT_TEST_ variable, but it's close enough that I'd feel comfortable with
that rule in practice).

> So yeah, maybe we can just unlink() them right away, or another way to
> handle the race is that load_commit_graph_chain() could just try again
> from the beginning in such a case, and presumably picking up the fresh
> just-rewritten chain.

I'd probably be in favor of the latter.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Making split commit graphs pick up new options (namely --changed-paths)
  2021-06-11  0:50     ` Taylor Blau
@ 2021-06-11 17:47       ` Derrick Stolee
  2021-06-11 19:01         ` Taylor Blau
  0 siblings, 1 reply; 10+ messages in thread
From: Derrick Stolee @ 2021-06-11 17:47 UTC (permalink / raw)
  To: Taylor Blau, Ævar Arnfjörð Bjarmason
  Cc: git, dstolee, gitster, peff, szeder.dev

On 6/10/2021 8:50 PM, Taylor Blau wrote:
> On Fri, Jun 11, 2021 at 01:56:31AM +0200, Ævar Arnfjörð Bjarmason wrote:
>> So yeah, maybe we can just unlink() them right away, or another way to
>> handle the race is that load_commit_graph_chain() could just try again
>> from the beginning in such a case, and presumably picking up the fresh
>> just-rewritten chain.
> 
> I'd probably be in favor of the latter.

I want to point out that on Windows we cannot successfully unlink()
a layer that is currently being read by another Git process. That
will not affect server scenarios (to the best of my knowledge) but
is important to many end users.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Making split commit graphs pick up new options (namely --changed-paths)
  2021-06-11 17:47       ` Derrick Stolee
@ 2021-06-11 19:01         ` Taylor Blau
  2021-06-15 14:21           ` Derrick Stolee
  0 siblings, 1 reply; 10+ messages in thread
From: Taylor Blau @ 2021-06-11 19:01 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Taylor Blau, Ævar Arnfjörð Bjarmason, git, dstolee,
	gitster, peff, szeder.dev

On Fri, Jun 11, 2021 at 01:47:28PM -0400, Derrick Stolee wrote:
> On 6/10/2021 8:50 PM, Taylor Blau wrote:
> > On Fri, Jun 11, 2021 at 01:56:31AM +0200, Ævar Arnfjörð Bjarmason wrote:
> >> So yeah, maybe we can just unlink() them right away, or another way to
> >> handle the race is that load_commit_graph_chain() could just try again
> >> from the beginning in such a case, and presumably picking up the fresh
> >> just-rewritten chain.
> >
> > I'd probably be in favor of the latter.
>
> I want to point out that on Windows we cannot successfully unlink()
> a layer that is currently being read by another Git process. That
> will not affect server scenarios (to the best of my knowledge) but
> is important to many end users.

Right, but isn't this already a problem today? Since the expiration
window is zero we are already effectively trying to unlink all merged
layers immediately:

  - Marking merged commit-graph layers as expired via
    mark_commit_graphs() by setting their mtime to "now", and then

  - Immediately removing all layers which have mtime older than an
    instant later in expire_commit_graphs().

(I almost suggested that a race already exists between multiple writers
that merge multiple layers of the commit-graph, but that race doesn't
exist because the commit-graph chain is written before other layers are
marked and expired.)

In any case, it seems like the return value from unlink() is
deliberately ignored in case another process is holding an expired layer
open when we try to unlink it. So we'll eventually clean up all layers
that don't belong to the commit-graph-chain, but at the granularity of
new writes.

(FWIW, I had to re-read 8d84097f96 (commit-graph: expire commit-graph
files, 2019-06-18) which mentions that a configuration variable would be
introduced to change the expiration window, but we don't have any such
configuration option. It also doesn't make any mention of handling this
problem on Windows, which made me think that the unlink() calls weren't
checking their return values by accident when in fact it was probably on
purpose.)

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Making split commit graphs pick up new options (namely --changed-paths)
  2021-06-11 19:01         ` Taylor Blau
@ 2021-06-15 14:21           ` Derrick Stolee
  2021-06-15 14:35             ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 10+ messages in thread
From: Derrick Stolee @ 2021-06-15 14:21 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Ævar Arnfjörð Bjarmason, git, dstolee, gitster,
	peff, szeder.dev

On 6/11/2021 3:01 PM, Taylor Blau wrote:
> On Fri, Jun 11, 2021 at 01:47:28PM -0400, Derrick Stolee wrote:
>> On 6/10/2021 8:50 PM, Taylor Blau wrote:
>>> On Fri, Jun 11, 2021 at 01:56:31AM +0200, Ævar Arnfjörð Bjarmason wrote:
>>>> So yeah, maybe we can just unlink() them right away, or another way to
>>>> handle the race is that load_commit_graph_chain() could just try again
>>>> from the beginning in such a case, and presumably picking up the fresh
>>>> just-rewritten chain.
>>>
>>> I'd probably be in favor of the latter.
>>
>> I want to point out that on Windows we cannot successfully unlink()
>> a layer that is currently being read by another Git process. That
>> will not affect server scenarios (to the best of my knowledge) but
>> is important to many end users.
> 
> Right, but isn't this already a problem today? Since the expiration
> window is zero we are already effectively trying to unlink all merged
> layers immediately:
> 
>   - Marking merged commit-graph layers as expired via
>     mark_commit_graphs() by setting their mtime to "now", and then
> 
>   - Immediately removing all layers which have mtime older than an
>     instant later in expire_commit_graphs().

The commit-graph management we built in Scalar and VFS for Git uses
a non-zero expire time to avoid this problem. However, since we will
clean up the ones that fail to unlink() in a future cleanup, we did
not do a similar thing in Git's background maintenance and have not
had any issues.

> (I almost suggested that a race already exists between multiple writers
> that merge multiple layers of the commit-graph, but that race doesn't
> exist because the commit-graph chain is written before other layers are
> marked and expired.)

And the maintenance buitin solves this by using a maintenance.lock
file to avoid concurrent processes operating on the same repo.

> In any case, it seems like the return value from unlink() is
> deliberately ignored in case another process is holding an expired layer
> open when we try to unlink it. So we'll eventually clean up all layers
> that don't belong to the commit-graph-chain, but at the granularity of
> new writes.

Correct. It also requires new data for the write, or the commit-graph
write will leave early and skip the cleanup IIRC.

> (FWIW, I had to re-read 8d84097f96 (commit-graph: expire commit-graph
> files, 2019-06-18) which mentions that a configuration variable would be
> introduced to change the expiration window, but we don't have any such
> configuration option. It also doesn't make any mention of handling this
> problem on Windows, which made me think that the unlink() calls weren't
> checking their return values by accident when in fact it was probably on
> purpose.)

That config option never appeared, probably because ignoring the
unlink() return was sufficient to get around this problem. Thanks
for digging in and making sure I remembered this correctly.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Making split commit graphs pick up new options (namely --changed-paths)
  2021-06-15 14:21           ` Derrick Stolee
@ 2021-06-15 14:35             ` Ævar Arnfjörð Bjarmason
  2021-06-16  1:45               ` Junio C Hamano
  0 siblings, 1 reply; 10+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-06-15 14:35 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Taylor Blau, git, dstolee, gitster, peff, szeder.dev


On Tue, Jun 15 2021, Derrick Stolee wrote:

> On 6/11/2021 3:01 PM, Taylor Blau wrote:
>> On Fri, Jun 11, 2021 at 01:47:28PM -0400, Derrick Stolee wrote:
>> (FWIW, I had to re-read 8d84097f96 (commit-graph: expire commit-graph
>> files, 2019-06-18) which mentions that a configuration variable would be
>> introduced to change the expiration window, but we don't have any such
>> configuration option. It also doesn't make any mention of handling this
>> problem on Windows, which made me think that the unlink() calls weren't
>> checking their return values by accident when in fact it was probably on
>> purpose.)
>
> That config option never appeared, probably because ignoring the
> unlink() return was sufficient to get around this problem. Thanks
> for digging in and making sure I remembered this correctly.

Isn't the whole ignoring the return value of unlink() Windows-specific
code? There's no issue with unlinking a file someone else has open on
POSIX systems, indeed unlinking a file you just created (but hold a FD
to) is a common pattern for getting a temporary file that you don't need
to unlink on atexit(). It's just not used in e.g. Git's codebase because
of portability concerns.

So not a big deal at all, but I wonder if there should be a warning
there on !Windows, if you can't unlink a file on a POSIX system that
suggests e.g. a persistent permission problem that won't be going away
if you ignore it.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Making split commit graphs pick up new options (namely --changed-paths)
  2021-06-15 14:35             ` Ævar Arnfjörð Bjarmason
@ 2021-06-16  1:45               ` Junio C Hamano
  0 siblings, 0 replies; 10+ messages in thread
From: Junio C Hamano @ 2021-06-16  1:45 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Derrick Stolee, Taylor Blau, git, dstolee, peff, szeder.dev

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> ... indeed unlinking a file you just created (but hold a FD
> to) is a common pattern for getting a temporary file that you don't need
> to unlink on atexit(). It's just not used in e.g. Git's codebase because
> of portability concerns.

Do we open one r/w, write into it and later read back from it,
without wanting the file to persist after we exit?  That is more
suitable for say Editor's "swp" file but I do not think of an
instance of our use of temporary file that would benefit from that
pattern.  IOW, it's not used in our codebase because we have no need
for the pattern, not due to portability concerns.

But that distinction is not the primary point in this topic.

> So not a big deal at all, but I wonder if there should be a warning
> there on !Windows, if you can't unlink a file on a POSIX system that
> suggests e.g. a persistent permission problem that won't be going away
> if you ignore it.

I do think we should catch failure from unlink and probably should
error out, unless we anticipate that somebody else might be removing
the same file and we got ENOENT.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-06-16  1:45 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-10 10:40 Making split commit graphs pick up new options (namely --changed-paths) Ævar Arnfjörð Bjarmason
2021-06-10 17:22 ` Taylor Blau
2021-06-10 18:21   ` Derrick Stolee
2021-06-10 23:56   ` Ævar Arnfjörð Bjarmason
2021-06-11  0:50     ` Taylor Blau
2021-06-11 17:47       ` Derrick Stolee
2021-06-11 19:01         ` Taylor Blau
2021-06-15 14:21           ` Derrick Stolee
2021-06-15 14:35             ` Ævar Arnfjörð Bjarmason
2021-06-16  1:45               ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).