On Wed, Nov 10, 2021 at 03:36:59AM -0500, Jeff King wrote:
> On Tue, Nov 09, 2021 at 12:25:46PM +0100, Patrick Steinhardt wrote:
> 
> > So I've finally found the time to have another look at massaging this
> > into the ref_transaction mechanism. If we do want to batch the fsync(3P)
> > calls, then we basically have two different alternatives:
> > 
> >     1. We first lock all loose refs by creating the respective lock
> >        files and writing the updated ref into that file. We keep the
> >        file descriptor open such that we can then flush them all in one
> >        go.
> > 
> >     2. Same as before, we lock all refs and write the updated pointers
> >        into the lockfiles, but this time we close each lockfile after
> >        having written to it. Later, we reopen them all to fsync(3P) them
> >        to disk.
> > 
> > I'm afraid both alternatives aren't any good: the first alternative
> > risks running out of file descriptors if you queue up lots of refs. And
> > the second alternative is going to be slow, especially so on Windows if
> > I'm not mistaken.
> 
> I agree the first is a dead end. I had imagined something like the
> second, but I agree that we'd have to measure the cost of re-opening.
> It's too bad there is not a syscall to sync a particular set of paths
> (or even better, a directory tree recursively).
> 
> There is another option, though: the batch-fsync code for objects does a
> "cheap" fsync while we have the descriptor open, and then later triggers
> a to-disk sync on a separate file. My understanding is that this works
> because modern filesystems will make sure the data write is in the
> journal on the cheap sync, and then the separate-file sync makes sure
> the journal goes to disk.
> 
> We could do something like that here. In fact, if you don't care about
> durability and just filesystem corruption, then you only care about the
> first sync (because the bad case is when the rename gets journaled but
> the data write doesn't).

Ah, interesting. That does sound like a good way forward to me, thanks
for the pointers!

Patrick

> In fact, even if you did choose to re-open and fsync each one, that's
> still sequential. We'd need some way to tell the kernel to sync them all
> at once. The batch-fsync trickery above is one such way (I haven't
> tried, but I wonder if making a bunch of fsync calls in parallel would
> work similarly).
> 
> > So with both not being feasible, we'll likely have to come up with a
> > more complex scheme if we want to batch-sync files. One idea would be to
> > fsync(3P) all lockfiles every $n refs, but it adds complexity in a place
> > where I'd really like to have things as simple as possible. It also
> > raises the question what $n would have to be.
> 
> I do think syncing every $n would not be too hard to implement. It could
> all be hidden behind a state machine API that collects lock files and
> flushes when it sees fit. You'd just call a magic "batch_fsync_and_close"
> instead of "fsync() and close()", though you would have to remember to
> do a final flush call to tell it there are no more coming.
> 
> -Peff