On Tue, Nov 01, 2022 at 05:00:22AM -0400, Jeff King wrote: > On Fri, Oct 28, 2022 at 04:42:19PM +0200, Patrick Steinhardt wrote: > > > - A client shouldn't assume objects to exist that have not been part > > of the reference advertisement. But if it excluded an object from > > the packfile that is reachable via any ref that is excluded from > > the reference advertisement due to `transfer.hideRefs` we'd have > > accepted the push anyway. I'd argue that this is a bug in the > > current implementation. > > Like others, I don't think this is a bug exactly. We'd never introduce a > corruption. We're just more lenient with clients than we need to be. > > But I don't think your scheme changes that. In a sense, the tips used by > "rev-list --not --all" are really an optimization. We will walk the > history from the to-be-updated ref tips all the way down to the roots if > we have to. So imagine that I have object X which is not referenced at > all (neither hidden nor visible ref). We obviously do not advertise it > to the client, but let's further imagine that a client sends us a pack > with X..Y, and a request to update some ref to Y. > > Both before and after your code, if rev-list is able to walk down from Y > until either we hit all roots or all UNINTERESTING commits, it will be > satisfied. So as long as the receiving repo actually has all of the > history leading up to X, it will allow the push, regardless of your > patch. Oh, right! Now I see where my thinko was, which means both you and Taylor are correct. I somehow assumed that we'd fail the connectivity check in that case, but all it means is that we now potentially walk more objects than we'd have done if we used `--not --all`. > If we wanted to stop being lenient, we'd have to actually check that > every object we traverse is either reachable, or came from the > just-pushed pack. Yes, indeed. > There's also a subtle timing issue here. Our connectivity check happens > after we've finished receiving the pack. So not only are we including > hidden refs, but we are using the ref state at the end of the push > (after receiving and processing the incoming pack), rather than the > beginning. > > From the same "leniency" lens this seems like the wrong thing. But as > above, it doesn't matter in practice, because these tips are really an > optimization to tell rev-list that it can stop traversing. > > If you think of the connectivity check less as "did the client try to > cheat" and more as "is it OK to update these refs without introducing a > corruption", then it makes sense that you'd want to do read the inputs > to the check as close to the ref update as possible, because it shrinks > the race window which could introduce corruption. Agreed. > Imagine a situation like this: > > 0. We advertise to client that we have commit X. > > 1. Client starts pushing up a pack with X..Y and asks to update some > branch to Y. > > 2. Meanwhile, the branch with X is deleted, and X is pruned. > > 3. Server finishes receiving the pack. All looks good, and then we > start a connectivity check. > > In the current code, that check starts with the current ref state (with > X deleted) as a given, and makes sure that we have the objects we need > to update the refs. After your patches, it would take X as a given, and > stop traversing when we see it. > > That same race exists before your patch, but it's between the time of > "rev-list --not --all" running and the ref update. After your patch, > it's between the advertisement and the ref update, which can be a long > time (hours or even days, if the client is very slow). > > In practice I'm not sure how big a deal this is. If we feed the > now-pruned X to rev-list, it may notice that X went away, though we've > been reducing the number of checks there in the name of efficiency > (e.g., if it's still in the commit graph, we'd say "OK, good enough" > these days, even if we don't have it on disk anymore). > > But it feels like a wrong direction to make that race longer if there's > no need to. Good point. > So all that said... > > > - Second, by using advertised refs as inputs instead of `git > > rev-list --not --all` we avoid looking up all refs that are > > irrelevant to the current push. This can be a huge performance > > improvement in repos that have a huge amount of internal, hidden > > refs. In one of our repos with 7m refs, of which 6.8m are hidden, > > this speeds up pushes from ~30s to ~4.5s. > > I like the general direction here of avoiding the hidden refs. The > client _shouldn't_ have been using them, so we can optimistically assume > they're useless (and in the case of races or other weirdness, rev-list > just ends up traversing a bit further). > > But we can split the two ideas in your series: > > 1. Feed the advertised tips from receive-pack to rev-list. > > 2. Avoid traversing from the hidden tips. > > Doing (1) gets you (2) for free. But if we don't want to do (1), and I > don't think we do, we can get (2) by just teaching rev-list to narrow > the check. > > I see some discussion in the other part of the thread, and we may need a > new rev-list option to do this, as mentioned there. However, you _might_ > be able to do it the existing --exclude mechanism. I.e., something like: > > rev-list --stdin --not --exclude 'refs/hidden/*' --all Yeah, Taylor proposed to add a new `--visible-refs=receive` option that lets git-rev-list(1) automatically add all references that are visible when paying attention to `receive.hideRefs`. I actually like this idea and will likely have a look at how easy or hard it is to implement. > The gotchas are: > > - I'm not 100% sure that --exclude globbing and transfer.hideRefs > syntax are compatible. You'd want to check. > > - these would have to come on the command line (at least with the > current code). Probably nobody has enough hiderefs patterns for that > to be a problem (and remember we are passing the glob pattern here, > not the 6.8M refs themselves). But it could bite somebody in a > pathological case. > > -Peff Well, we can avoid these gotchas if we used `--visible-refs`. Patrick