On Tue, Nov 01, 2022 at 05:00:22AM -0400, Jeff King wrote:
> On Fri, Oct 28, 2022 at 04:42:19PM +0200, Patrick Steinhardt wrote:
> 
> >     - A client shouldn't assume objects to exist that have not been part
> >       of the reference advertisement. But if it excluded an object from
> >       the packfile that is reachable via any ref that is excluded from
> >       the reference advertisement due to `transfer.hideRefs` we'd have
> >       accepted the push anyway. I'd argue that this is a bug in the
> >       current implementation.
> 
> Like others, I don't think this is a bug exactly. We'd never introduce a
> corruption. We're just more lenient with clients than we need to be.
> 
> But I don't think your scheme changes that. In a sense, the tips used by
> "rev-list --not --all" are really an optimization. We will walk the
> history from the to-be-updated ref tips all the way down to the roots if
> we have to. So imagine that I have object X which is not referenced at
> all (neither hidden nor visible ref). We obviously do not advertise it
> to the client, but let's further imagine that a client sends us a pack
> with X..Y, and a request to update some ref to Y.
> 
> Both before and after your code, if rev-list is able to walk down from Y
> until either we hit all roots or all UNINTERESTING commits, it will be
> satisfied. So as long as the receiving repo actually has all of the
> history leading up to X, it will allow the push, regardless of your
> patch.

Oh, right! Now I see where my thinko was, which means both you and
Taylor are correct. I somehow assumed that we'd fail the connectivity
check in that case, but all it means is that we now potentially walk
more objects than we'd have done if we used `--not --all`.

> If we wanted to stop being lenient, we'd have to actually check that
> every object we traverse is either reachable, or came from the
> just-pushed pack.

Yes, indeed.

> There's also a subtle timing issue here. Our connectivity check happens
> after we've finished receiving the pack. So not only are we including
> hidden refs, but we are using the ref state at the end of the push
> (after receiving and processing the incoming pack), rather than the
> beginning.
> 
> From the same "leniency" lens this seems like the wrong thing. But as
> above, it doesn't matter in practice, because these tips are really an
> optimization to tell rev-list that it can stop traversing.
> 
> If you think of the connectivity check less as "did the client try to
> cheat" and more as "is it OK to update these refs without introducing a
> corruption", then it makes sense that you'd want to do read the inputs
> to the check as close to the ref update as possible, because it shrinks
> the race window which could introduce corruption.

Agreed.

> Imagine a situation like this:
> 
>   0. We advertise to client that we have commit X.
> 
>   1. Client starts pushing up a pack with X..Y and asks to update some
>      branch to Y.
> 
>   2. Meanwhile, the branch with X is deleted, and X is pruned.
> 
>   3. Server finishes receiving the pack. All looks good, and then we
>      start a connectivity check.
> 
> In the current code, that check starts with the current ref state (with
> X deleted) as a given, and makes sure that we have the objects we need
> to update the refs. After your patches, it would take X as a given, and
> stop traversing when we see it.
> 
> That same race exists before your patch, but it's between the time of
> "rev-list --not --all" running and the ref update. After your patch,
> it's between the advertisement and the ref update, which can be a long
> time (hours or even days, if the client is very slow).
> 
> In practice I'm not sure how big a deal this is. If we feed the
> now-pruned X to rev-list, it may notice that X went away, though we've
> been reducing the number of checks there in the name of efficiency
> (e.g., if it's still in the commit graph, we'd say "OK, good enough"
> these days, even if we don't have it on disk anymore).
> 
> But it feels like a wrong direction to make that race longer if there's
> no need to.

Good point.

> So all that said...
> 
> >     - Second, by using advertised refs as inputs instead of `git
> >       rev-list --not --all` we avoid looking up all refs that are
> >       irrelevant to the current push. This can be a huge performance
> >       improvement in repos that have a huge amount of internal, hidden
> >       refs. In one of our repos with 7m refs, of which 6.8m are hidden,
> >       this speeds up pushes from ~30s to ~4.5s.
> 
> I like the general direction here of avoiding the hidden refs. The
> client _shouldn't_ have been using them, so we can optimistically assume
> they're useless (and in the case of races or other weirdness, rev-list
> just ends up traversing a bit further).
> 
> But we can split the two ideas in your series:
> 
>   1. Feed the advertised tips from receive-pack to rev-list.
> 
>   2. Avoid traversing from the hidden tips.
> 
> Doing (1) gets you (2) for free. But if we don't want to do (1), and I
> don't think we do, we can get (2) by just teaching rev-list to narrow
> the check.
> 
> I see some discussion in the other part of the thread, and we may need a
> new rev-list option to do this, as mentioned there. However, you _might_
> be able to do it the existing --exclude mechanism. I.e., something like:
> 
>   rev-list --stdin --not --exclude 'refs/hidden/*' --all

Yeah, Taylor proposed to add a new `--visible-refs=receive` option that
lets git-rev-list(1) automatically add all references that are visible
when paying attention to `receive.hideRefs`. I actually like this idea
and will likely have a look at how easy or hard it is to implement.

> The gotchas are:
> 
>   - I'm not 100% sure that --exclude globbing and transfer.hideRefs
>     syntax are compatible. You'd want to check.
> 
>   - these would have to come on the command line (at least with the
>     current code). Probably nobody has enough hiderefs patterns for that
>     to be a problem (and remember we are passing the glob pattern here,
>     not the 6.8M refs themselves). But it could bite somebody in a
>     pathological case.
> 
> -Peff

Well, we can avoid these gotchas if we used `--visible-refs`.

Patrick