On Mon, Feb 19, 2024 at 04:14:21PM -0800, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
> > Refactor the code to call `check_refname_format()` directly instead of
> > trying to resolve the ref. This is significantly more efficient given
> > that we don't have to hit the object database anymore to list reflogs.
> > And second, it ensures that we end up showing reflogs of broken refs,
> > which will help to make the reflog more useful.
> 
> And the user would notice corrupt ones among those reflogs listed
> when using "rev-list -g" on the reflog anyway?  Which sounds like a
> sensible thing to do.

Yeah. Overall the user experience is still quite lacking when you have
such "funny" reflogs. Corrupted ones would result in errors as you
mentioned, and that's to be expected in my opinion.

The more dubious behaviour is that `git reflog show $REFLOG` refuses to
show the reflog when the corresponding ref is missing. This is something
I plan to address in a follow-up patch series.

> > Note that this really only impacts the case where the corresponding ref
> > is corrupt. Reflogs for nonexistent refs would have been returned to the
> > caller beforehand already as we did not pass `RESOLVE_REF_READING` to
> > the function, and thus `refs_resolve_ref_unsafe()` would have returned
> > successfully in that case.
> 
> What do "Reflogs for nonexistent refs" really mean?  With the files
> backend, if "git branch -d main" that removed the "main" branch
> somehow forgot to remove the ".git/logs/refs/heads/main" file, the
> reflog entries in such a file is for nonexistent ref.  Is that what
> you meant?

Yes. Would "Reflogs which do not have a corresponding ref with the same
name" be clearer?

> As a tool to help diagnosing and correcting minor repo
> breakages, finding such a leftover file that should not exist is a
> good idea, I would think.
> 
> Would we see missing reflog for a ref that exists in the iteration?
> I guess we shouldn't, as the reflog iterator that recursively
> enumerates files under "$GIT_DIR/logs/" would not see such a missing
> reflog by definition.

No, and I'd claim we shouldn't. The reflog mechanism gives the user
control over which reflogs should and which shouldn't exist. For one,
`core.logAllRefUpdates` allows the user to either enable or disable the
reflog mechanism. If set to "false" then no reflogs are created, with
"true" some are created, and with "always" we always end up creating
reflogs. So depending on this setting it's expected that a subset of
reflogs do not exist.

But that'also not the whole story yet. Theoretically speaking, reflogs
have a subtle opt-in mechanism: once a reflog is created, we will
continue writing to it no matter what `core.logAllRefUpdates` says. So
it's feasible to have `core.logAllRefUpdates=false`, but then explicitly
create a specific reflog so that you log changes to a specific ref.

With this behaviour in mind I'd say that we shouldn't log missing
reflogs.

> > diff --git a/refs/files-backend.c b/refs/files-backend.c
> > index 2b3c99b00d..741148087d 100644
> > --- a/refs/files-backend.c
> > +++ b/refs/files-backend.c
> > @@ -2130,17 +2130,9 @@ static int files_reflog_iterator_advance(struct ref_iterator *ref_iterator)
> >  	while ((ok = dir_iterator_advance(diter)) == ITER_OK) {
> >  		if (!S_ISREG(diter->st.st_mode))
> >  			continue;
> > -		if (diter->basename[0] == '.')
> > +		if (check_refname_format(diter->basename,
> > +					 REFNAME_ALLOW_ONELEVEL))
> >  			continue;
> 
> A tangent.
> 
> I've never liked the code arrangement in the check_refname_format()
> that assumes that each level can be separately checked with exactly
> the same logic, and the only thing ALLOW_ONELEVEL does is to include
> pseudorefs and HEAD; this makes such assumption even more ingrained.
> I am not sure what to think about it, but let's keep reading.

Yeah. This code here is basically just copied over from
`refs_resolve_ref_unsafe()` to ensure that it remains compatible. In a
future patch series we might include a new option `--include-broken`
that would also surface broken-but-safe reflog names.

But going down the tangent even more: one think I've noticed is that the
way `check_refname_format()` is structured is also wildly inefficient.
It's quite astonishing that when iterating over refs, we spend _more_
time in `check_refname_format()` than reading the refs from disk,
parsing them and massaging them into their final representation.

Overall, the whole infra to check refnames could use some improvement.
But this has already been discussed in other threads recently.

Patrick

> > -		if (ends_with(diter->basename, ".lock"))
> > -			continue;
> 
> This can safely go, as it is rejected by check_refname_format().
> 
> > -		if (!refs_resolve_ref_unsafe(iter->ref_store,
> > -					     diter->relative_path, 0,
> > -					     NULL, NULL)) {
> > -			error("bad ref for %s", diter->path.buf);
> > -			continue;
> > -		}
> 
> This is the focus of this step in the series.  We did not abort the
> iteration before, but now we no longer issue any error message.
> 
> >  		iter->base.refname = diter->relative_path;
> >  		return ITER_OK;
> > diff --git a/refs/reftable-backend.c b/refs/reftable-backend.c
> > index 889bb1f1ba..efbbf23c72 100644
> > --- a/refs/reftable-backend.c
> > +++ b/refs/reftable-backend.c
> > @@ -1659,11 +1659,9 @@ static int reftable_reflog_iterator_advance(struct ref_iterator *ref_iterator)
> >  		if (iter->last_name && !strcmp(iter->log.refname, iter->last_name))
> >  			continue;
> >  
> > -		if (!refs_resolve_ref_unsafe(&iter->refs->base, iter->log.refname,
> > -					     0, NULL, NULL)) {
> > -			error(_("bad ref for %s"), iter->log.refname);
> > +		if (check_refname_format(iter->log.refname,
> > +					 REFNAME_ALLOW_ONELEVEL))
> >  			continue;
> > -		}
> 
> This side is much more straight-forward.  Looking good.
> 
> >  
> >  		free(iter->last_name);
> >  		iter->last_name = xstrdup(iter->log.refname);