Distinguishing FF vs non-FF updates in the reflog?

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Distinguishing FF vs non-FF updates in the reflog?
@ 2021-03-17 20:06 Han-Wen Nienhuys
  2021-03-17 21:21 ` Martin Fick
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Han-Wen Nienhuys @ 2021-03-17 20:06 UTC (permalink / raw)
  To: git

Hi there,

I'm working on some extensions to Gerrit for which it would be very
beneficial if we could tell from the reflog if an update is a
fast-forward or not: if we find a SHA1 in the reflog, and see there
were only FF updates since, we can be sure that the SHA1 is reachable
from the branch, without having to open packfiles and decode commits.

For the reftable format, I think we could store this easily by
introducing more record types. Today we have 0 = deletion, 1 = update,
and we could add 2 = FF update, 3 = non-FF update.

However, the textual reflog format doesn't easily allow for this.
However, we might add a convention, eg. have the message start with
'FF' or 'NFF' depending on the nature of the update.

Does this make sense, and if yes is it worth proposing a change?

thanks,
-- 
Han-Wen Nienhuys - Google Munich
I work 80%. Don't expect answers from me on Fridays.
--

Google Germany GmbH, Erika-Mann-Strasse 33, 80636 Munich

Registergericht und -nummer: Hamburg, HRB 86891

Sitz der Gesellschaft: Hamburg

Geschäftsführer: Paul Manicle, Halimah DeLaine Prado

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Distinguishing FF vs non-FF updates in the reflog?
  2021-03-17 20:06 Distinguishing FF vs non-FF updates in the reflog? Han-Wen Nienhuys
@ 2021-03-17 21:21 ` Martin Fick
  2021-03-18  8:58   ` Han-Wen Nienhuys
  2021-03-18 19:47 ` Jeff King
  2021-03-22 13:26 ` Ævar Arnfjörð Bjarmason
  2 siblings, 1 reply; 19+ messages in thread
From: Martin Fick @ 2021-03-17 21:21 UTC (permalink / raw)
  To: Han-Wen Nienhuys; +Cc: git

On Wednesday, March 17, 2021 9:06:06 PM MDT Han-Wen Nienhuys wrote:
> I'm working on some extensions to Gerrit for which it would be very
> beneficial if we could tell from the reflog if an update is a
> fast-forward or not: if we find a SHA1 in the reflog, and see there
> were only FF updates since, we can be sure that the SHA1 is reachable
> from the branch, without having to open packfiles and decode commits.

I don't think this would be reliable.

1) Not all updates make it to the reflogs
2) Reflogs can be edited or mucked with
3) On NFS reflogs can outright be wrong even when used properly as their are 
caching issues. We specifically have seen entries that appear to be FFs that 
were not.

I believe that today git can do very fast reachability checks without opening 
pack files by using some of its indexes (bitmap code or https://git-scm.com/
docs/commit-graph ?). It probably makes sense to add this ability to jgit if 
that is what you need?

-Martin 

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Distinguishing FF vs non-FF updates in the reflog?
  2021-03-17 21:21 ` Martin Fick
@ 2021-03-18  8:58   ` Han-Wen Nienhuys
  2021-03-18 19:35     ` Jeff King
                       ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Han-Wen Nienhuys @ 2021-03-18  8:58 UTC (permalink / raw)
  To: Martin Fick; +Cc: git

On Wed, Mar 17, 2021 at 10:22 PM Martin Fick <mfick@codeaurora.org> wrote:
>
> On Wednesday, March 17, 2021 9:06:06 PM MDT Han-Wen Nienhuys wrote:
> > I'm working on some extensions to Gerrit for which it would be very
> > beneficial if we could tell from the reflog if an update is a
> > fast-forward or not: if we find a SHA1 in the reflog, and see there
> > were only FF updates since, we can be sure that the SHA1 is reachable
> > from the branch, without having to open packfiles and decode commits.
>
> I don't think this would be reliable.
>
> 1) Not all updates make it to the reflogs
> 2) Reflogs can be edited or mucked with
> 3) On NFS reflogs can outright be wrong even when used properly as their are
> caching issues. We specifically have seen entries that appear to be FFs that
> were not.

Can you tell a little more about 3) ? SInce we don't annotate non-FF
vs FF today, what does "appear to be FFs" mean?

But you are right: since the reflog for a branch is in a different
file from the branch head, there is no way to do an update to both of
them at the same time. I guess this will have to be a reftable-only
feature.

> I believe that today git can do very fast reachability checks without opening
> pack files by using some of its indexes (bitmap code or https://git-scm.com/
> docs/commit-graph ?). It probably makes sense to add this ability to jgit if
> that is what you need?

The bitmaps are generated by GC, and you can't GC all the time. JGit
has support for bitmaps, and its support actually predates C-Git's
support for it. (It was added to JGit by Colby Ranger who worked in
Shawn's team).

I expect that the commit graph doesn't work for my intended use-case.

-- 
Han-Wen Nienhuys - Google Munich
I work 80%. Don't expect answers from me on Fridays.
--
Google Germany GmbH, Erika-Mann-Strasse 33, 80636 Munich
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Geschäftsführer: Paul Manicle, Halimah DeLaine Prado

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Distinguishing FF vs non-FF updates in the reflog?
  2021-03-18  8:58   ` Han-Wen Nienhuys
@ 2021-03-18 19:35     ` Jeff King
  2021-03-18 22:24     ` Martin Fick
  2021-03-18 22:31     ` Martin Fick
  2 siblings, 0 replies; 19+ messages in thread
From: Jeff King @ 2021-03-18 19:35 UTC (permalink / raw)
  To: Han-Wen Nienhuys; +Cc: Martin Fick, git

On Thu, Mar 18, 2021 at 09:58:56AM +0100, Han-Wen Nienhuys wrote:

> > 1) Not all updates make it to the reflogs
> > 2) Reflogs can be edited or mucked with
> > 3) On NFS reflogs can outright be wrong even when used properly as their are
> > caching issues. We specifically have seen entries that appear to be FFs that
> > were not.
> 
> Can you tell a little more about 3) ? SInce we don't annotate non-FF
> vs FF today, what does "appear to be FFs" mean?
> 
> But you are right: since the reflog for a branch is in a different
> file from the branch head, there is no way to do an update to both of
> them at the same time. I guess this will have to be a reftable-only
> feature.

Each individual reflog entry (in the branch reflog and the HEAD reflog)
should still be consistent, though. They give the "before" and "after"
object ids, and the ff-ness is an immutable property of those commit
ids.

> > I believe that today git can do very fast reachability checks without opening
> > pack files by using some of its indexes (bitmap code or https://git-scm.com/
> > docs/commit-graph ?). It probably makes sense to add this ability to jgit if
> > that is what you need?
> 
> The bitmaps are generated by GC, and you can't GC all the time. JGit
> has support for bitmaps, and its support actually predates C-Git's
> support for it. (It was added to JGit by Colby Ranger who worked in
> Shawn's team).

Bitmaps can help with these checks, but we don't actually look at them
in most of the algorithms one might use for computing ancestry. One of
the reasons for that is that they often backfire as an optimization,
because:

  - as you note, they are often not up to date because they require a
    repack. So they won't help when asking about very recently added
    commits (which people tend to ask about more than ancient ones).

  - the bitmap file format doesn't have any index. So a reader has to
    scan the whole thing upon opening to decide which commits have
    bitmaps.

For several years we had a patch at GitHub that checked for bitmaps
during "--contains" traversals. Even though it did sometimes backfire,
it was enough of a net win to be worth keeping, compared to actually
opening commit objects to follow their parent pointers. But with
commit-graphs, it was a strict loss, and we stopped using it entirely
last year. (We do still look at bitmaps for our branch ahead/behind
checks using a custom patch; I'm suspicious of its performance for the
same reasons, but we haven't dug carefully into it).

But...

> I expect that the commit graph doesn't work for my intended use-case.

...I think commit-graphs are a big win here. They are more often kept up
to date, because they can be generated incrementally with effort
proportional to the number of new commits. And they make a big
difference if the traversal has to cover a lot of commits. E.g., here's
the most extreme case in git.git, checking ancestry of the oldest
commit:

  $ time git merge-base --is-ancestor e83c5163316f89bfbde7d9ab23ca2e25604af290 HEAD; echo $?

  real	0m0.014s
  user	0m0.008s
  sys	0m0.005s
  0

  $ time git -c core.commitgraph=false merge-base --is-ancestor e83c5163316f89bfbde7d9ab23ca2e25604af290 HEAD; echo $?

  real	0m0.398s
  user	0m0.369s
  sys	0m0.028s
  0

Of course most results won't be so dramatic, because they wouldn't have
to traverse many commits in the first place (so they are already pretty
fast with or without the commit-graph).  But that 14ms should be an
upper bound for this repo. And naturally that scales with the number of
commits; in linux.git it's 43ms, compared to 8.7s without commit-graphs).

-Peff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Distinguishing FF vs non-FF updates in the reflog?
  2021-03-17 20:06 Distinguishing FF vs non-FF updates in the reflog? Han-Wen Nienhuys
  2021-03-17 21:21 ` Martin Fick
@ 2021-03-18 19:47 ` Jeff King
  2021-03-22 14:40   ` Han-Wen Nienhuys
  2021-03-22 13:26 ` Ævar Arnfjörð Bjarmason
  2 siblings, 1 reply; 19+ messages in thread
From: Jeff King @ 2021-03-18 19:47 UTC (permalink / raw)
  To: Han-Wen Nienhuys; +Cc: git

On Wed, Mar 17, 2021 at 09:06:06PM +0100, Han-Wen Nienhuys wrote:

> I'm working on some extensions to Gerrit for which it would be very
> beneficial if we could tell from the reflog if an update is a
> fast-forward or not: if we find a SHA1 in the reflog, and see there
> were only FF updates since, we can be sure that the SHA1 is reachable
> from the branch, without having to open packfiles and decode commits.

I left some numbers in another part of the thread, but IMHO performance
isn't that compelling a reason to do this these days, if you are using
commit-graphs.

Just walking the reflog might be _slightly_ faster, though not
necessarily (it depends on whether the depth of the object graph or the
depth of the reflog chain is deeper). It might matter more if you are
using a more exotic storage scheme, where switching from accessing
reflogs to objects implies extra round-trips to a server (e.g., custom
storage backends with JGit; I don't know the state of the art in what
Google is doing there).

> For the reftable format, I think we could store this easily by
> introducing more record types. Today we have 0 = deletion, 1 = update,
> and we could add 2 = FF update, 3 = non-FF update.
> 
> However, the textual reflog format doesn't easily allow for this.
> However, we might add a convention, eg. have the message start with
> 'FF' or 'NFF' depending on the nature of the update.
> 
> Does this make sense, and if yes is it worth proposing a change?

At GitHub we do something similar. We don't generally use reflogs much
at all, but we keep a custom "audit log": a single append-only file that
records every ref update in the repository. And its format just happens
to be one reflog entry per line, prefixed by the updated ref.

And there we do generally annotate the FF-ness of an update by stuffing
it into the free-form message field (in fact, we shove in a small JSON
object, so we record multiple fields like the pushing id, IP, etc).

But the main goal there isn't performance (and in fact we don't
generally consult it for anything outside of debugging). The reason we
record FF-ness is for later debugging or analysis. We don't prune from
the audit log, and we don't consider it for reachability when we prune
objects (since otherwise you'd never be able to prune anything!). So the
objects sometimes aren't available later to compute, but we still want
to know if the user did a force-push, etc.

I don't think that really applies to regular reflogs, because they do
imply reachability (and they are not great for later analysis, because
we may selectively expire unreachable entries).

-Peff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Distinguishing FF vs non-FF updates in the reflog?
  2021-03-18  8:58   ` Han-Wen Nienhuys
  2021-03-18 19:35     ` Jeff King
@ 2021-03-18 22:24     ` Martin Fick
  2021-03-22 12:31       ` Han-Wen Nienhuys
  2021-03-18 22:31     ` Martin Fick
  2 siblings, 1 reply; 19+ messages in thread
From: Martin Fick @ 2021-03-18 22:24 UTC (permalink / raw)
  To: Han-Wen Nienhuys; +Cc: git

On Thursday, March 18, 2021 9:58:56 AM MDT Han-Wen Nienhuys wrote:
> On Wed, Mar 17, 2021 at 10:22 PM Martin Fick <mfick@codeaurora.org> wrote:
> > On Wednesday, March 17, 2021 9:06:06 PM MDT Han-Wen Nienhuys wrote:
> > > I'm working on some extensions to Gerrit for which it would be very
> > > beneficial if we could tell from the reflog if an update is a
> > > fast-forward or not: if we find a SHA1 in the reflog, and see there
> > > were only FF updates since, we can be sure that the SHA1 is reachable
> > > from the branch, without having to open packfiles and decode commits.
> > 
> > I don't think this would be reliable.
> > 
> > 1) Not all updates make it to the reflogs
> > 2) Reflogs can be edited or mucked with
> > 3) On NFS reflogs can outright be wrong even when used properly as their
> > are caching issues. We specifically have seen entries that appear to be
> > FFs that were not.
> 
> Can you tell a little more about 3) ? SInce we don't annotate non-FF
> vs FF today, what does "appear to be FFs" mean?

To be honest I don't recall for sure, but I will describe what I think has 
happened. I think that we have seen a server(A) update a branch from
C1 to C2A, and then later another server(B) update the same branch from C1 to 
C2B. Obviously the move from C2A to C2B is not a FF, but that move is not what 
is recorded. Each of those updates was a FF when viewed as separate entries, 
but if you look at both lines you can see that the second entry does not start 
where the first one left off. This would be detectable, it would appear as if 
there was a missed entry that did a rewind from C2A to C1, but that rewind 
presumably actually came in as part of the second update from server (B)  as 
it had a cached version of the branch and believed it still pointed to C1 when 
it made its update,

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Distinguishing FF vs non-FF updates in the reflog?
  2021-03-18  8:58   ` Han-Wen Nienhuys
  2021-03-18 19:35     ` Jeff King
  2021-03-18 22:24     ` Martin Fick
@ 2021-03-18 22:31     ` Martin Fick
  2021-03-18 22:54       ` Jeff King
  2 siblings, 1 reply; 19+ messages in thread
From: Martin Fick @ 2021-03-18 22:31 UTC (permalink / raw)
  To: Han-Wen Nienhuys; +Cc: git

On Thursday, March 18, 2021 9:58:56 AM MDT Han-Wen Nienhuys wrote:
> The bitmaps are generated by GC, and you can't GC all the time. 

I believe that I recently saw an effort to make this incremental, perhaps 
related to the geometric repacking series? If that were the case, you could gc 
much more often cheaply. Perhaps it could be something done on every upload at 
some point the way that reflog effectively does on every update?

As Peff pointed out though, commit-graphs would still be a better way to go 
anyway since that is closer to their intent,

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Distinguishing FF vs non-FF updates in the reflog?
  2021-03-18 22:31     ` Martin Fick
@ 2021-03-18 22:54       ` Jeff King
  0 siblings, 0 replies; 19+ messages in thread
From: Jeff King @ 2021-03-18 22:54 UTC (permalink / raw)
  To: Martin Fick; +Cc: Han-Wen Nienhuys, git

On Thu, Mar 18, 2021 at 04:31:24PM -0600, Martin Fick wrote:

> On Thursday, March 18, 2021 9:58:56 AM MDT Han-Wen Nienhuys wrote:
> > The bitmaps are generated by GC, and you can't GC all the time. 
> 
> I believe that I recently saw an effort to make this incremental, perhaps 
> related to the geometric repacking series? If that were the case, you could gc 
> much more often cheaply. Perhaps it could be something done on every upload at 
> some point the way that reflog effectively does on every update?

That geometric repacking work is leading up to having a bitmap for a
multi-pack-index. Which will make them _cheaper_, but still not
especially cheap (because we've reordered the objects corresponding to
each bit, and also because our writing process still does a lot of
O(nr_commits) work).

In the very long run, I think the way out would be to stop using pack or
midx ordering as the basis of the bitmap, and instead have a stable
object ordering that can be appended to. That would allow true
incremental generation of the bitmaps (leaving old ones in place, and
just adding a new ones to represent new commits). But that's such a big
departure from the status quo that having a midx bitmap seemed like a
more attainable middle ground in the meantime.

-Peff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Distinguishing FF vs non-FF updates in the reflog?
  2021-03-18 22:24     ` Martin Fick
@ 2021-03-22 12:31       ` Han-Wen Nienhuys
  2021-03-22 17:45         ` Martin Fick
  0 siblings, 1 reply; 19+ messages in thread
From: Han-Wen Nienhuys @ 2021-03-22 12:31 UTC (permalink / raw)
  To: Martin Fick; +Cc: git

On Thu, Mar 18, 2021 at 11:24 PM Martin Fick <mfick@codeaurora.org> wrote:
>
> On Thursday, March 18, 2021 9:58:56 AM MDT Han-Wen Nienhuys wrote:
> > On Wed, Mar 17, 2021 at 10:22 PM Martin Fick <mfick@codeaurora.org> wrote:
> > > On Wednesday, March 17, 2021 9:06:06 PM MDT Han-Wen Nienhuys wrote:
> > > > I'm working on some extensions to Gerrit for which it would be very
> > > > beneficial if we could tell from the reflog if an update is a
> > > > fast-forward or not: if we find a SHA1 in the reflog, and see there
> > > > were only FF updates since, we can be sure that the SHA1 is reachable
> > > > from the branch, without having to open packfiles and decode commits.
> > >
> > > I don't think this would be reliable.
> > >
> > > 1) Not all updates make it to the reflogs
> > > 2) Reflogs can be edited or mucked with
> > > 3) On NFS reflogs can outright be wrong even when used properly as their
> > > are caching issues. We specifically have seen entries that appear to be
> > > FFs that were not.
> >
> > Can you tell a little more about 3) ? SInce we don't annotate non-FF
> > vs FF today, what does "appear to be FFs" mean?
>
> To be honest I don't recall for sure, but I will describe what I think has
> happened. I think that we have seen a server(A) update a branch from
> C1 to C2A, and then later another server(B) update the same branch from C1 to
> C2B. Obviously the move from C2A to C2B is not a FF, but that move is not what
> is recorded. Each of those updates was a FF when viewed as separate entries,

I think those would fail with the way that Gerrit uses JGit, because
C1 -> C2B would fail with LOCK_ERROR. I guess there are code paths in
Git (?) that will execute force-push without checking if the update is
FF or not.

-- 
Han-Wen Nienhuys - Google Munich
I work 80%. Don't expect answers from me on Fridays.
--

Google Germany GmbH, Erika-Mann-Strasse 33, 80636 Munich

Registergericht und -nummer: Hamburg, HRB 86891

Sitz der Gesellschaft: Hamburg

Geschäftsführer: Paul Manicle, Halimah DeLaine Prado

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Distinguishing FF vs non-FF updates in the reflog?
  2021-03-17 20:06 Distinguishing FF vs non-FF updates in the reflog? Han-Wen Nienhuys
  2021-03-17 21:21 ` Martin Fick
  2021-03-18 19:47 ` Jeff King
@ 2021-03-22 13:26 ` Ævar Arnfjörð Bjarmason
  2021-03-22 14:59   ` Han-Wen Nienhuys
  2 siblings, 1 reply; 19+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-03-22 13:26 UTC (permalink / raw)
  To: Han-Wen Nienhuys; +Cc: git, Jeff King, Martin Fick

On Wed, Mar 17 2021, Han-Wen Nienhuys wrote:

> Hi there,
>
> I'm working on some extensions to Gerrit for which it would be very
> beneficial if we could tell from the reflog if an update is a
> fast-forward or not: if we find a SHA1 in the reflog, and see there
> were only FF updates since, we can be sure that the SHA1 is reachable
> from the branch, without having to open packfiles and decode commits.
>
> For the reftable format, I think we could store this easily by
> introducing more record types. [snip].

Aside from what others have mentioned here, you're talking about the
log_type field are you not? I.e.:
https://googlers.googlesource.com/sop/jgit/+/reftable/Documentation/technical/reftable.md#log-block-format

Has that "log_type = 0x0" tombstone proven to be a worthwhile
optimization past the stash case mention there (which is presumably not
relevant to the vast majority of Google's use-cases).

I.e. it's redundant to looking at the record and seeing if new_id =
ZERO_OID.

Similarly can't ff v.s. non-ff be deduced unambiguously by looking ahead
to the next record, and seeing if the current record's "old_id" matches
that of the last record's "new_id". If it does it's a FF, if not it's a
non-FF (or a create/delete).

I'm not arguing that a quicker lookup isn't needed, I'm just trying to
dig at what "beneficial" here is. The format is ordered, and the common
case is that the page we have in memory has the last record.

What sort of case are we talking about where not unpacking the log_data
segment is making a difference?

> However, the textual reflog format doesn't easily allow for this.
> However, we might add a convention, eg. have the message start with
> 'FF' or 'NFF' depending on the nature of the update.

Maybe a bit ugly, but a ".." and "..." prefix would at least be
consistent with "fetch" output. Or e.g. "commit:" and "+commit:" for ff
and non-ff (and we could make it "\t commit:" v.s. "\t+commit:"
v.s. current "\tcommit:" to distinguish all three in the current
text-based format. Per "OUTPUT" in git-fetch(1).

> [Ævar: snipped from earlier] Today we have 0 = deletion, 1 = update,
> and we could add 2 = FF update, 3 = non-FF update.

I've written log table implementations (a site table in a RDBMS) for git
(one table for refs) which had:

    create, ff, non-ff, delete

I wonder if that quad-state would be useful for reftable too, with this
proposed change you'd still need to unpack the record and see if the
old_id is ZERO_OID to check if it's a creation, would you not?

I also wonder if it couldn't be:

    0 = deletion, 1 = non-ff-update, 2 = ff-update, 4 = creation

So the format wouldn't forever carry the historical wart of this not
having been considered from the beginning.

It would mean that the few current reftable users (just Google?) would
have to look at the record to see if it's *really* a non-ff-update, but
presumably they need to do so now for ff v.s. non-ff, so they're no
worse off than they are now.

Then when those users know they're on a version that distinguishes these
they can hard rely on 1 not being a "ff for sure", not a "maybe" status
for new updates. Presumably they either don't care about ancient reflog
records, or a one-off migration of rewriting the records for older
entries could be done.

Also between my [1] and this proposal we have at least a reftable v1.01
in the wild (the filename locking behavior change discussed in [1]), and
this would make it v1.02, but the only up-to-date spec is for v1.00 (and
maybe JGit has other changes I haven't tracked).

That [1] change is minor, but still, a spec change.

So just a *poke* that having some version where the spec is kept
up-to-date with that and this change if it happens would be very useful,
especially if the reftable-in-git.git lands one of these days.

1. https://lore.kernel.org/git/87k0tzulf1.fsf@evledraar.gmail.com/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Distinguishing FF vs non-FF updates in the reflog?
  2021-03-18 19:47 ` Jeff King
@ 2021-03-22 14:40   ` Han-Wen Nienhuys
  2021-03-26  7:43     ` Jeff King
  0 siblings, 1 reply; 19+ messages in thread
From: Han-Wen Nienhuys @ 2021-03-22 14:40 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Thu, Mar 18, 2021 at 8:47 PM Jeff King <peff@peff.net> wrote:
> > I'm working on some extensions to Gerrit for which it would be very
> > beneficial if we could tell from the reflog if an update is a
> > fast-forward or not: if we find a SHA1 in the reflog, and see there
> > were only FF updates since, we can be sure that the SHA1 is reachable
> > from the branch, without having to open packfiles and decode commits.
>
> I left some numbers in another part of the thread, but IMHO performance
> isn't that compelling a reason to do this these days, if you are using
> commit-graphs.
>
> Just walking the reflog might be _slightly_ faster, though not
> necessarily (it depends on whether the depth of the object graph or the
> depth of the reflog chain is deeper). It might matter more if you are
> using a more exotic storage scheme, where switching from accessing
> reflogs to objects implies extra round-trips to a server (e.g., custom
> storage backends with JGit; I don't know the state of the art in what
> Google is doing there).

JGit doesn't currently support commit-graph, so it's hard to predict
what performance will be like, but isn't commit-graph is keyed by
SHA1? That makes it hard to do caching, especially when considering
large repositories.

AFAIU, commit-graph would help speed up reachability checks, by being
able to shortcut cases where the commit number proves that some commit
is not ancestor of the other, but you still have to do a revwalk to
conclusively prove reachability.

In our storage system, the revwalk runs on top of packfile data that
must be faulted-in (slow!) from datacenter-wide storage. It's made
worse because we don't support midx yet.

The application that I'm thinking of providing a way for automation to
deal with lagging replicas. This could be done by specifying a

  X-Need-GitRef: $repositoryname~$refname~$SHA1

header on Gerrit requests, that specify that the given $SHA1 must have
been in a recent ref update, and be reachable from $refname. The
reflog has this information organized in a form that suited very well
to answering these questions quickly (assuming the reflog is annotated
such that we can distinguish FF and non-FF updates)

> > Does this make sense, and if yes is it worth proposing a change?
>
> At GitHub we do something similar. We don't generally use reflogs much
> at all, but we keep a custom "audit log": a single append-only file that
> records every ref update in the repository. And its format just happens
> to be one reflog entry per line, prefixed by the updated ref.

The interest of having a standard/convention in Git would be to not
require reftable for folks that want to use this feature.

> And there we do generally annotate the FF-ness of an update by stuffing
> it into the free-form message field (in fact, we shove in a small JSON
> object, so we record multiple fields like the pushing id, IP, etc).
>
> But the main goal there isn't performance (and in fact we don't
> generally consult it for anything outside of debugging). The reason we
> record FF-ness is for later debugging or analysis. We don't prune from
> the audit log, and we don't consider it for reachability when we prune
> objects (since otherwise you'd never be able to prune anything!). So the
> objects sometimes aren't available later to compute, but we still want
> to know if the user did a force-push, etc.

We store reflogs in a global database table, which has this kind of
information, but the Google-specific format is harder to make work
with Gerrit, which is open source.

-- 
Han-Wen Nienhuys - Google Munich
I work 80%. Don't expect answers from me on Fridays.
--
Google Germany GmbH, Erika-Mann-Strasse 33, 80636 Munich
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Geschäftsführer: Paul Manicle, Halimah DeLaine Prado

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Distinguishing FF vs non-FF updates in the reflog?
  2021-03-22 13:26 ` Ævar Arnfjörð Bjarmason
@ 2021-03-22 14:59   ` Han-Wen Nienhuys
  2021-03-22 15:39     ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 19+ messages in thread
From: Han-Wen Nienhuys @ 2021-03-22 14:59 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: git, Jeff King, Martin Fick

On Mon, Mar 22, 2021 at 2:26 PM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
> > I'm working on some extensions to Gerrit for which it would be very
> > beneficial if we could tell from the reflog if an update is a
> > fast-forward or not: if we find a SHA1 in the reflog, and see there
> > were only FF updates since, we can be sure that the SHA1 is reachable
> > from the branch, without having to open packfiles and decode commits.
> >
> > For the reftable format, I think we could store this easily by
> > introducing more record types. [snip].
>
> Aside from what others have mentioned here, you're talking about the
> log_type field are you not? I.e.:
> https://googlers.googlesource.com/sop/jgit/+/reftable/Documentation/technical/reftable.md#log-block-format

Correct.

> Has that "log_type = 0x0" tombstone proven to be a worthwhile
> optimization past the stash case mention there (which is presumably not
> relevant to the vast majority of Google's use-cases).

I've never really understood the log_type=0x0 use case. I think it was
added solely to cater for a use case in CGit's stash command.

> I.e. it's redundant to looking at the record and seeing if new_id =
> ZERO_OID.
>
> Similarly can't ff v.s. non-ff be deduced unambiguously by looking ahead
> to the next record, and seeing if the current record's "old_id" matches
> that of the last record's "new_id". If it does it's a FF, if not it's a
> non-FF (or a create/delete).

I don't see how that will tell you FF vs non-FF-ness.  Both an FF
update and a non-FF  update look like 'new_oid = 20-random-bytes'.
Barring further info, you have to lookup the commit object for those
bytes, and then walk back to see if you pass old_oid.

AFAICT, a correct sequence of ref updates (FF or not) always has
prev.new_oid = current.old_oid.

> > [Ævar: snipped from earlier] Today we have 0 = deletion, 1 = update,
> > and we could add 2 = FF update, 3 = non-FF update.
>
> I've written log table implementations (a site table in a RDBMS) for git
> (one table for refs) which had:
>
>     create, ff, non-ff, delete
>
> I wonder if that quad-state would be useful for reftable too, with this
> proposed change you'd still need to unpack the record and see if the
> old_id is ZERO_OID to check if it's a creation, would you not?

Delete & create are handled with ZERO_OID.

The reftable format makes it so that you have to decode a record in
order to read past it (there is no size framing the table entry
level), so there is no big performance advantage in encoding this
information in the log_type. You merely use a log_type bit rather than
a 20 byte raw ID. Since log records are zlib compressed anyway, it
probably also makes no space difference.

> I also wonder if it couldn't be:
>
>     0 = deletion, 1 = non-ff-update, 2 = ff-update, 4 = creation
>
> So the format wouldn't forever carry the historical wart of this not
> having been considered from the beginning.

If you do it like this, you will force that all implementations to
have to compute whether a (forced) update is a FF or not. I don't know
if that is a problem. A 'maybe non-FF' value would be useful. Perhaps
we could even do simply

     0 = deletion, 1 = maybe-ff-update, 2 = guaranteed-ff-update

> It would mean that the few current reftable users (just Google?) would
> have to look at the record to see if it's *really* a non-ff-update, but
> presumably they need to do so now for ff v.s. non-ff, so they're no
> worse off than they are now.

At Google, we currently don't record log records in reftable yet. From
our perspective, we could probably change the standard 'in place'.
JGit has supported reftable since Nov 2019, but I'm unaware of users;
I did hear about GerritForge wanting to try it out in production this
year.

> Then when those users know they're on a version that distinguishes these
> they can hard rely on 1 not being a "ff for sure", not a "maybe" status
> for new updates. Presumably they either don't care about ancient reflog
> records, or a one-off migration of rewriting the records for older
> entries could be done.
>
> Also between my [1] and this proposal we have at least a reftable v1.01
> in the wild (the filename locking behavior change discussed in [1]), and
> this would make it v1.02, but the only up-to-date spec is for v1.00 (and
> maybe JGit has other changes I haven't tracked).

The file locking update has been added to the standard,
https://github.com/git/git/pull/951.

-- 
Han-Wen Nienhuys - Google Munich
I work 80%. Don't expect answers from me on Fridays.
--

Google Germany GmbH, Erika-Mann-Strasse 33, 80636 Munich

Registergericht und -nummer: Hamburg, HRB 86891

Sitz der Gesellschaft: Hamburg

Geschäftsführer: Paul Manicle, Halimah DeLaine Prado

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Distinguishing FF vs non-FF updates in the reflog?
  2021-03-22 14:59   ` Han-Wen Nienhuys
@ 2021-03-22 15:39     ` Ævar Arnfjörð Bjarmason
  2021-03-22 15:56       ` Han-Wen Nienhuys
  0 siblings, 1 reply; 19+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-03-22 15:39 UTC (permalink / raw)
  To: Han-Wen Nienhuys; +Cc: git, Jeff King, Martin Fick


On Mon, Mar 22 2021, Han-Wen Nienhuys wrote:

> On Mon, Mar 22, 2021 at 2:26 PM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>> > I'm working on some extensions to Gerrit for which it would be very
>> > beneficial if we could tell from the reflog if an update is a
>> > fast-forward or not: if we find a SHA1 in the reflog, and see there
>> > were only FF updates since, we can be sure that the SHA1 is reachable
>> > from the branch, without having to open packfiles and decode commits.
>> >
>> > For the reftable format, I think we could store this easily by
>> > introducing more record types. [snip].
>>
>> Aside from what others have mentioned here, you're talking about the
>> log_type field are you not? I.e.:
>> https://googlers.googlesource.com/sop/jgit/+/reftable/Documentation/technical/reftable.md#log-block-format
>
> Correct.
>
>> Has that "log_type = 0x0" tombstone proven to be a worthwhile
>> optimization past the stash case mention there (which is presumably not
>> relevant to the vast majority of Google's use-cases).
>
> I've never really understood the log_type=0x0 use case. I think it was
> added solely to cater for a use case in CGit's stash command.
>
>> I.e. it's redundant to looking at the record and seeing if new_id =
>> ZERO_OID.
>>
>> Similarly can't ff v.s. non-ff be deduced unambiguously by looking ahead
>> to the next record, and seeing if the current record's "old_id" matches
>> that of the last record's "new_id". If it does it's a FF, if not it's a
>> non-FF (or a create/delete).
>
> I don't see how that will tell you FF vs non-FF-ness.  Both an FF
> update and a non-FF  update look like 'new_oid = 20-random-bytes'.
> Barring further info, you have to lookup the commit object for those
> bytes, and then walk back to see if you pass old_oid.

Because both the next reflog format and reftable's have the update-ref
<newvalue> <oldvalue> for each update. So:

    $ cut -d ' ' -f1-2 .git/logs/refs/remotes/origin/master | head -n 2
    1c52ecf4ba0f4f7af72775695fee653f50737c71 ba2aa15129e59f248d8cdd30404bc78b5178f61d
    ba2aa15129e59f248d8cdd30404bc78b5178f61d 6d3ef5b467eccd2769f1aa1c555d317d3c8dc707

We can know with !strcmp(rows[0][1], rows[1][0]) whether the latest
update is a ff or non-ff (it is). As opposed to:

    $ cut -d ' ' -f1-2 .git/logs/refs/remotes/origin/seen | head -n 2
    8b26a41f4bf7e7c0f097cb91012d08fe8ae30e7f 0f3a981cbd5be5f97e9504ab770cd88f988fe820
    0f3a981cbd5be5f97e9504ab770cd88f988fe820 fdd019edfe6bc60d0100d5751c41e4f6ad28a2ef

Where the rows[0][1] value is not the same as rows[1][0].

What you can't do is find whether any given commit(s) in those ranges
are part of the FF or not, since you just have the start/end points.

But I'm vaguely paranoid that we're talking past one another here and
I've misunderstood what you want...

> AFAICT, a correct sequence of ref updates (FF or not) always has
> prev.new_oid = current.old_oid.
>
>> > [Ævar: snipped from earlier] Today we have 0 = deletion, 1 = update,
>> > and we could add 2 = FF update, 3 = non-FF update.
>>
>> I've written log table implementations (a site table in a RDBMS) for git
>> (one table for refs) which had:
>>
>>     create, ff, non-ff, delete
>>
>> I wonder if that quad-state would be useful for reftable too, with this
>> proposed change you'd still need to unpack the record and see if the
>> old_id is ZERO_OID to check if it's a creation, would you not?
>
> Delete & create are handled with ZERO_OID.
>
> The reftable format makes it so that you have to decode a record in
> order to read past it (there is no size framing the table entry
> level), so there is no big performance advantage in encoding this
> information in the log_type. You merely use a log_type bit rather than
> a 20 byte raw ID. Since log records are zlib compressed anyway, it
> probably also makes no space difference.
>
>> I also wonder if it couldn't be:
>>
>>     0 = deletion, 1 = non-ff-update, 2 = ff-update, 4 = creation
>>
>> So the format wouldn't forever carry the historical wart of this not
>> having been considered from the beginning.
>
> If you do it like this, you will force that all implementations to
> have to compute whether a (forced) update is a FF or not. I don't know
> if that is a problem. A 'maybe non-FF' value would be useful. Perhaps
> we could even do simply
>
>      0 = deletion, 1 = maybe-ff-update, 2 = guaranteed-ff-update

The point is that nobody relies on this now, but also that you'd of
course want 1 = i-know-for-sure-non-ff-update, 2 =
i-know-for-sure-ff-update.

But since logs are transitory (aren't they also expired in reftable,
can't see that from skimming hte spec) and this is just a helpful
side-index I think a one-time migration for the tiny minority of
reftable users who'd care would be preferrable to the vast majority of
future reftable users (if it ever lands in git.git) having to deal with
this (albeit small) special-case forever.

>> It would mean that the few current reftable users (just Google?) would
>> have to look at the record to see if it's *really* a non-ff-update, but
>> presumably they need to do so now for ff v.s. non-ff, so they're no
>> worse off than they are now.
>
> At Google, we currently don't record log records in reftable yet. From
> our perspective, we could probably change the standard 'in place'.
> JGit has supported reftable since Nov 2019, but I'm unaware of users;
> I did hear about GerritForge wanting to try it out in production this
> year.

Ah, so not even Google's using it, the backcompat is just for good
measure...

>> Then when those users know they're on a version that distinguishes these
>> they can hard rely on 1 not being a "ff for sure", not a "maybe" status
>> for new updates. Presumably they either don't care about ancient reflog
>> records, or a one-off migration of rewriting the records for older
>> entries could be done.
>>
>> Also between my [1] and this proposal we have at least a reftable v1.01
>> in the wild (the filename locking behavior change discussed in [1]), and
>> this would make it v1.02, but the only up-to-date spec is for v1.00 (and
>> maybe JGit has other changes I haven't tracked).
>
> The file locking update has been added to the standard,
> https://github.com/git/git/pull/951.

Ah yes. Now I remember. I managed to miss/not recall that. Thanks.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Distinguishing FF vs non-FF updates in the reflog?
  2021-03-22 15:39     ` Ævar Arnfjörð Bjarmason
@ 2021-03-22 15:56       ` Han-Wen Nienhuys
  2021-03-22 16:40         ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 19+ messages in thread
From: Han-Wen Nienhuys @ 2021-03-22 15:56 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: git, Jeff King, Martin Fick

On Mon, Mar 22, 2021 at 4:39 PM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
> We can know with !strcmp(rows[0][1], rows[1][0]) whether the latest
> update is a ff or non-ff (it is). As opposed to:
>
>     $ cut -d ' ' -f1-2 .git/logs/refs/remotes/origin/seen | head -n 2
>     8b26a41f4bf7e7c0f097cb91012d08fe8ae30e7f 0f3a981cbd5be5f97e9504ab770cd88f988fe820
>     0f3a981cbd5be5f97e9504ab770cd88f988fe820 fdd019edfe6bc60d0100d5751c41e4f6ad28a2ef
>
> Where the rows[0][1] value is not the same as rows[1][0].

I'm confused.

rows[0][1] == "0f3a981cbd5be5f97e9504ab770cd88f988fe820"
rows[1][0] == "0f3a981cbd5be5f97e9504ab770cd88f988fe820"

they are the same. I don't understand your argument.


-- 
Han-Wen Nienhuys - Google Munich
I work 80%. Don't expect answers from me on Fridays.
--
Google Germany GmbH, Erika-Mann-Strasse 33, 80636 Munich
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Geschäftsführer: Paul Manicle, Halimah DeLaine Prado

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Distinguishing FF vs non-FF updates in the reflog?
  2021-03-22 15:56       ` Han-Wen Nienhuys
@ 2021-03-22 16:40         ` Ævar Arnfjörð Bjarmason
  2021-03-22 17:12           ` Han-Wen Nienhuys
  2021-03-22 18:36           ` Junio C Hamano
  0 siblings, 2 replies; 19+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-03-22 16:40 UTC (permalink / raw)
  To: Han-Wen Nienhuys; +Cc: git, Jeff King, Martin Fick

On Mon, Mar 22 2021, Han-Wen Nienhuys wrote:

> On Mon, Mar 22, 2021 at 4:39 PM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>> We can know with !strcmp(rows[0][1], rows[1][0]) whether the latest
>> update is a ff or non-ff (it is). As opposed to:
>>
>>     $ cut -d ' ' -f1-2 .git/logs/refs/remotes/origin/seen | head -n 2
>>     8b26a41f4bf7e7c0f097cb91012d08fe8ae30e7f 0f3a981cbd5be5f97e9504ab770cd88f988fe820
>>     0f3a981cbd5be5f97e9504ab770cd88f988fe820 fdd019edfe6bc60d0100d5751c41e4f6ad28a2ef
>>
>> Where the rows[0][1] value is not the same as rows[1][0].
>
> I'm confused.
>
> rows[0][1] == "0f3a981cbd5be5f97e9504ab770cd88f988fe820"
> rows[1][0] == "0f3a981cbd5be5f97e9504ab770cd88f988fe820"
>
> they are the same. I don't understand your argument.

Sorry, I mean same = ff update, not the same = non-ff. So I flipped
those around in describing it.

But in any case, the point is that you can reliably tell from a log of
updates whether individual updates are ff or non-ff, as long as you:

1. Look at two records but not one, or rather the ff-ness of the update
   you're looking at now depends on <oldvalue> of that update matching
   the <newvalue> of the update before that.

2. The log is guaranteed to be in the same order as the update-ref
   calls, and not to be pruned in the middle.

3. Don't care about the FF-ness of the first entry in the reflog
   (e.g. prune it at N entries or 2 weeks, you can't see if the oldest
   entry you have is a FF or not)

So as I noted I've materialized this in a RDBMS before as something like
ENUM('delete', 'create', 'ff', 'non-ff'), but that was:

A. In the delete/create case as much about the convenience of
   selecting/grouping as anything else, easier to select operation =
   "delete" than {old,new} = "<copy paste or calefully type exactly 40 x
   '0'>"

B. In SQL it's a PITA both typing and index-wise to get information
   about a record based on a "previous" record when the relation between
   the two is a a UNIQUE INDEX and AUTO-INCREMENT id. I.e. for an
   AUTO_INCREMENT UNIQUE INDEX of repo,refname,id to get the "last"
   record you need to select where repo && refname is the same, and id <
   current_id LIMIT 1.

C. You're accessing the data ad-hoc & manually via SQL, not some
   Git-specific query language/wrapper.

But in the case of reftable I'm wondering what the use-case is, since
presumably whatever API now serves up the reflog from it can just as
well look at one more record and fill in a matrialized "ff" or "non-ff"
field for the current record based on that.

Also: If you're adding more values a genuinely useful value to log (that
I've seen custom logging for) is "was this a forced push? || update-ref
without the <oldvalue>?", which is *not* the same as "not a
fast-forward". I.e. it's the difference between "update this with this
<oldvalue>" (ff or non-ff) v.s. "I don't care what the oldvalue is, just
update it" (f orce or not).

(To complicate matters and make them even more harder to explain, the
whole "force with lease" thing is not actually a "force push" in that
sense, it's just an update-ref with an oldvalue you expect to result in
a non-ff update).

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Distinguishing FF vs non-FF updates in the reflog?
  2021-03-22 16:40         ` Ævar Arnfjörð Bjarmason
@ 2021-03-22 17:12           ` Han-Wen Nienhuys
  2021-03-22 18:36           ` Junio C Hamano
  1 sibling, 0 replies; 19+ messages in thread
From: Han-Wen Nienhuys @ 2021-03-22 17:12 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: git, Jeff King, Martin Fick

On Mon, Mar 22, 2021 at 5:40 PM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
>
> On Mon, Mar 22 2021, Han-Wen Nienhuys wrote:
>
> > On Mon, Mar 22, 2021 at 4:39 PM Ævar Arnfjörð Bjarmason
> > <avarab@gmail.com> wrote:
> >> We can know with !strcmp(rows[0][1], rows[1][0]) whether the latest
> >> update is a ff or non-ff (it is). As opposed to:
> >>
> >>     $ cut -d ' ' -f1-2 .git/logs/refs/remotes/origin/seen | head -n 2
> >>     8b26a41f4bf7e7c0f097cb91012d08fe8ae30e7f 0f3a981cbd5be5f97e9504ab770cd88f988fe820
> >>     0f3a981cbd5be5f97e9504ab770cd88f988fe820 fdd019edfe6bc60d0100d5751c41e4f6ad28a2ef
> >>
> >> Where the rows[0][1] value is not the same as rows[1][0].
> >
> > I'm confused.
> >
> > rows[0][1] == "0f3a981cbd5be5f97e9504ab770cd88f988fe820"
> > rows[1][0] == "0f3a981cbd5be5f97e9504ab770cd88f988fe820"
> >
> > they are the same. I don't understand your argument.
>
> Sorry, I mean same = ff update, not the same = non-ff. So I flipped
> those around in describing it.
>
> But in any case, the point is that you can reliably tell from a log of
> updates whether individual updates are ff or non-ff, as long as you:

?


$ git checkout -b non-ff origin/master
Previous HEAD position was 59f4325222 Add "test-tool dump-reftable" command.
Branch 'non-ff' set up to track remote branch 'master' from 'origin'.
Switched to a new branch 'non-ff'

$ git reset --hard HEAD^
HEAD is now at 56a57652ef Sync with Git 2.30.2 for CVE-2021-21300

$ cat .git/logs/refs/heads/non-ff
0000000000000000000000000000000000000000
13d7ab6b5d7929825b626f050b62a11241ea4945 Han-Wen Nienhuys
<hanwen@google.com> 1616432938 +0100 branch: Created from
origin/master
13d7ab6b5d7929825b626f050b62a11241ea4945
56a57652ef8e4ca2f108a8719b8caeed5e153c95 Han-Wen Nienhuys
<hanwen@google.com> 1616432948 +0100 reset: moving to HEAD^

How can I tell from this reflog that "moving to HEAD^" is a non-FF update?

-- 
Han-Wen Nienhuys - Google Munich
I work 80%. Don't expect answers from me on Fridays.
--

Google Germany GmbH, Erika-Mann-Strasse 33, 80636 Munich

Registergericht und -nummer: Hamburg, HRB 86891

Sitz der Gesellschaft: Hamburg

Geschäftsführer: Paul Manicle, Halimah DeLaine Prado

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Distinguishing FF vs non-FF updates in the reflog?
  2021-03-22 12:31       ` Han-Wen Nienhuys
@ 2021-03-22 17:45         ` Martin Fick
  0 siblings, 0 replies; 19+ messages in thread
From: Martin Fick @ 2021-03-22 17:45 UTC (permalink / raw)
  To: Han-Wen Nienhuys; +Cc: git

On Monday, March 22, 2021 1:31:25 PM MDT Han-Wen Nienhuys wrote:
> On Thu, Mar 18, 2021 at 11:24 PM Martin Fick <mfick@codeaurora.org> wrote:
> > On Thursday, March 18, 2021 9:58:56 AM MDT Han-Wen Nienhuys wrote:
> > > On Wed, Mar 17, 2021 at 10:22 PM Martin Fick <mfick@codeaurora.org> 
wrote:
> > > > On Wednesday, March 17, 2021 9:06:06 PM MDT Han-Wen Nienhuys wrote:
> > > > > I'm working on some extensions to Gerrit for which it would be very
> > > > > beneficial if we could tell from the reflog if an update is a
> > > > > fast-forward or not: if we find a SHA1 in the reflog, and see there
> > > > > were only FF updates since, we can be sure that the SHA1 is
> > > > > reachable
> > > > > from the branch, without having to open packfiles and decode
> > > > > commits.
> > > > 
> > > > I don't think this would be reliable.
> > > > 
> > > > 1) Not all updates make it to the reflogs
> > > > 2) Reflogs can be edited or mucked with
> > > > 3) On NFS reflogs can outright be wrong even when used properly as
> > > > their
> > > > are caching issues. We specifically have seen entries that appear to
> > > > be
> > > > FFs that were not.
> > > 
> > > Can you tell a little more about 3) ? SInce we don't annotate non-FF
> > > vs FF today, what does "appear to be FFs" mean?
> > 
> > To be honest I don't recall for sure, but I will describe what I think has
> > happened. I think that we have seen a server(A) update a branch from
> > C1 to C2A, and then later another server(B) update the same branch from C1
> > to C2B. Obviously the move from C2A to C2B is not a FF, but that move is
> > not what is recorded. Each of those updates was a FF when viewed as
> > separate entries,
> I think those would fail with the way that Gerrit uses JGit, because
> C1 -> C2B would fail with LOCK_ERROR.

If jgit knew that the branch no longer pointed to C1, then yes it should fail 
with LOCK_ERROR. However this situation is believed to arise due to jgit's 
caching which could make it think that the branch still points to C1 even 
thought it has already been advanced to C2A! :(

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Distinguishing FF vs non-FF updates in the reflog?
  2021-03-22 16:40         ` Ævar Arnfjörð Bjarmason
  2021-03-22 17:12           ` Han-Wen Nienhuys
@ 2021-03-22 18:36           ` Junio C Hamano
  1 sibling, 0 replies; 19+ messages in thread
From: Junio C Hamano @ 2021-03-22 18:36 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Han-Wen Nienhuys, git, Jeff King, Martin Fick

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

>> I'm confused.
>>
>> rows[0][1] == "0f3a981cbd5be5f97e9504ab770cd88f988fe820"
>> rows[1][0] == "0f3a981cbd5be5f97e9504ab770cd88f988fe820"
>>
>> they are the same. I don't understand your argument.
>
> Sorry, I mean same = ff update, not the same = non-ff. So I flipped
> those around in describing it.

I am confused too.  Are you tacking something else, a gap in a run
of reflog entries?  If I go from commit A to B to C, the first log
entry would record the transtion from A->B, and the second entry
would record the transition from B->C, and the lack of gap does not
say anything about the relationship between A and B, or B and C.  A
can be, and does not have to be, an ancestor of B, and B can be, and
does not have to be, an ancestor of C.  Hopping from A to B to C would
leave the same pair of reflog records and I do not think you can tell
the reachability among A and B and C from them.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Distinguishing FF vs non-FF updates in the reflog?
  2021-03-22 14:40   ` Han-Wen Nienhuys
@ 2021-03-26  7:43     ` Jeff King
  0 siblings, 0 replies; 19+ messages in thread
From: Jeff King @ 2021-03-26  7:43 UTC (permalink / raw)
  To: Han-Wen Nienhuys; +Cc: git

On Mon, Mar 22, 2021 at 03:40:46PM +0100, Han-Wen Nienhuys wrote:

> > I left some numbers in another part of the thread, but IMHO performance
> > isn't that compelling a reason to do this these days, if you are using
> > commit-graphs.
> >
> > Just walking the reflog might be _slightly_ faster, though not
> > necessarily (it depends on whether the depth of the object graph or the
> > depth of the reflog chain is deeper). It might matter more if you are
> > using a more exotic storage scheme, where switching from accessing
> > reflogs to objects implies extra round-trips to a server (e.g., custom
> > storage backends with JGit; I don't know the state of the art in what
> > Google is doing there).
> 
> JGit doesn't currently support commit-graph, so it's hard to predict
> what performance will be like, but isn't commit-graph is keyed by
> SHA1? That makes it hard to do caching, especially when considering
> large repositories.

Yes, it's keyed by sha1. It's essentially replacing "inflate the commit
object and parse it" with "here are the parsed values as mmap-able
32-bit integer fields" (there's some other stuff with generation
numbers, too, but the main speedup is simply that accessing each commit
is orders of magnitude cheaper).

It caches well, because those properties of the commit are immutable.
But if you meant "when pulling data from the commit-graph file, is it
friendly to block cache", then no, it's not linear. You'd binary search
within it to find each commit, just as you would a pack .idx (and just
like a .idx, I'd expect a system that is pulling data from a network
source to want to grab the whole commit-graph file. They tend to be much
smaller than the main .idx for a given repo).

> AFAIU, commit-graph would help speed up reachability checks, by being
> able to shortcut cases where the commit number proves that some commit
> is not ancestor of the other, but you still have to do a revwalk to
> conclusively prove reachability.

Right. You'll still walk a lot of the commits, but you'll do so much
faster (the generation numbers can also help prune some uninteresting
side paths, but again, I think the main value for this operation is just
getting the parent info much faster).

-Peff

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2021-03-26  7:44 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-17 20:06 Distinguishing FF vs non-FF updates in the reflog? Han-Wen Nienhuys
2021-03-17 21:21 ` Martin Fick
2021-03-18  8:58   ` Han-Wen Nienhuys
2021-03-18 19:35     ` Jeff King
2021-03-18 22:24     ` Martin Fick
2021-03-22 12:31       ` Han-Wen Nienhuys
2021-03-22 17:45         ` Martin Fick
2021-03-18 22:31     ` Martin Fick
2021-03-18 22:54       ` Jeff King
2021-03-18 19:47 ` Jeff King
2021-03-22 14:40   ` Han-Wen Nienhuys
2021-03-26  7:43     ` Jeff King
2021-03-22 13:26 ` Ævar Arnfjörð Bjarmason
2021-03-22 14:59   ` Han-Wen Nienhuys
2021-03-22 15:39     ` Ævar Arnfjörð Bjarmason
2021-03-22 15:56       ` Han-Wen Nienhuys
2021-03-22 16:40         ` Ævar Arnfjörð Bjarmason
2021-03-22 17:12           ` Han-Wen Nienhuys
2021-03-22 18:36           ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).