Feature request: provide a persistent IDs on a commit

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Feature request: provide a persistent IDs on a commit
@ 2022-07-18 17:18 Stephen Finucane
  2022-07-18 17:35 ` Konstantin Ryabitsev
  2022-07-18 18:50 ` Ævar Arnfjörð Bjarmason
  0 siblings, 2 replies; 29+ messages in thread
From: Stephen Finucane @ 2022-07-18 17:18 UTC (permalink / raw)
  To: git

...to track evolution of a patch through time.

tl;dr: How hard would it be to retrofit an 'ChangeID' concept à la the 'Change-
ID' trailer used by Gerrit into git core?

Firstly, apologies in advance if this is the wrong forum to post a feature
request. I help maintain the Patchwork project [1], which a web-based tool that
provides a mechanism to track the state of patches submitted to a mailing list
and make sure stuff doesn't slip through the crack. One of our long-term goals
has been to track the evolution of an individual patch through multiple
revisions. This is surprisingly hard goal because oftentimes there isn't a whole
lot to work with. One can try to guess whether things are the same by inspecting
the metadata of the commit (subject, author, commit message, and the diff
itself) but each of these metadata items are subject to arbitrary changes and
are therefore fallible.

One of the mechanisms I've seen used to address this is the 'Change-ID' trailer
used by Gerrit. For anyone that hasn't seen this, the Gerrit server provides a
git commit hook that you can install locally. When installed, this appends a
'Change-ID' trailer to each and every commit message. In this way, the evolution
of a patch (or a "change", in Gerrit parlance) can be tracked through time since
the Change ID provides an authoritative answer to the question "is this still
the same patch". Unfortunately, there are still some obvious downside to this
approach. Not only does this additional trailer clutter your commit messages but
it's also something the user must install themselves. While Gerrit can insist
that this is installed before pushing a change, this isn't an option for any of
the common forges nor is it something git-send-email supports.

I imagine most people working with mailing list based workflows have their own
client side tooling to support this while software forges like GitHub and GitLab
simply don't bother tracking version history between individual commits in a
pull/merge request. IMO though, it would be fantastic if third party tools
weren't necessary though. What I suspect we want is a persistent ID (or rather
UUID) that never changes regardless of how many times a patch is cherry-picked,
rebased, or otherwise modified, similar to the Author and AuthorDate fields.
Like Author and AuthorDate, it would be part of the core git commit metadata
rather than something in the commit message like Signed-Off-By or Change-ID.

Has such an idea ever been explored? Is it even possible? Would it be broadly
useful?

Cheers,
Stephen

[1] github.com/getpatchwork/patchwork/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Feature request: provide a persistent IDs on a commit
  2022-07-18 17:18 Feature request: provide a persistent IDs on a commit Stephen Finucane
@ 2022-07-18 17:35 ` Konstantin Ryabitsev
  2022-07-18 19:04   ` Michal Suchánek
  2022-07-18 21:24   ` Glen Choo
  2022-07-18 18:50 ` Ævar Arnfjörð Bjarmason
  1 sibling, 2 replies; 29+ messages in thread
From: Konstantin Ryabitsev @ 2022-07-18 17:35 UTC (permalink / raw)
  To: Stephen Finucane; +Cc: git

On Mon, Jul 18, 2022 at 06:18:11PM +0100, Stephen Finucane wrote:
> ...to track evolution of a patch through time.
> 
> tl;dr: How hard would it be to retrofit an 'ChangeID' concept à la the 'Change-
> ID' trailer used by Gerrit into git core?

I just started working on this for b4, with the notable difference that the
change-id trailer is used in the cover letter instead of in individual
commits, which moves the concept of "change" from a single commit to a series
of commits. IMO, it's much more useful in that scope, because as series are
reviewed and iterated, individual patches can get squashed, split up or
otherwise transformed.

You can see my test commits here:
https://lore.kernel.org/linux-patches/20220707-my-new-branch-v1-0-8d355bae1bb5@linuxfoundation.org/

You will notice that each cover letter has the following in the basement:

    ---
    base-commit: 88084a3df1672e131ddc1b4e39eeacfd39864acf
    change-id: 20220707-my-new-branch-[uniquerandomstr]

There are 3 revisions of the series and you can locate all of them by
searching for that trailer:
https://lore.kernel.org/linux-patches/?q=%22change-id%3A+20220707-my-new-branch-1325e0e7fd1c%22

Note, that "b4 submit" is in the early experimental stage and will likely
undergo significant changes in the next few weeks, so I wouldn't treat it as
any more than curiosity at this point.

-K

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Feature request: provide a persistent IDs on a commit
  2022-07-18 17:18 Feature request: provide a persistent IDs on a commit Stephen Finucane
  2022-07-18 17:35 ` Konstantin Ryabitsev
@ 2022-07-18 18:50 ` Ævar Arnfjörð Bjarmason
  2022-07-19 10:47   ` Stephen Finucane
  2022-07-21 16:18   ` Phillip Susi
  1 sibling, 2 replies; 29+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-07-18 18:50 UTC (permalink / raw)
  To: Stephen Finucane; +Cc: git


On Mon, Jul 18 2022, Stephen Finucane wrote:

> ...to track evolution of a patch through time.
>
> tl;dr: How hard would it be to retrofit an 'ChangeID' concept à la the 'Change-
> ID' trailer used by Gerrit into git core?
>
> Firstly, apologies in advance if this is the wrong forum to post a feature
> request. I help maintain the Patchwork project [1], which a web-based tool that
> provides a mechanism to track the state of patches submitted to a mailing list
> and make sure stuff doesn't slip through the crack. One of our long-term goals
> has been to track the evolution of an individual patch through multiple
> revisions. This is surprisingly hard goal because oftentimes there isn't a whole
> lot to work with. One can try to guess whether things are the same by inspecting
> the metadata of the commit (subject, author, commit message, and the diff
> itself) but each of these metadata items are subject to arbitrary changes and
> are therefore fallible.
>
> One of the mechanisms I've seen used to address this is the 'Change-ID' trailer
> used by Gerrit. For anyone that hasn't seen this, the Gerrit server provides a
> git commit hook that you can install locally. When installed, this appends a
> 'Change-ID' trailer to each and every commit message. In this way, the evolution
> of a patch (or a "change", in Gerrit parlance) can be tracked through time since
> the Change ID provides an authoritative answer to the question "is this still
> the same patch". Unfortunately, there are still some obvious downside to this
> approach. Not only does this additional trailer clutter your commit messages but
> it's also something the user must install themselves. While Gerrit can insist
> that this is installed before pushing a change, this isn't an option for any of
> the common forges nor is it something git-send-email supports.

git format-patch+send-email will send your trailers along as-is, how
doesn't it support Change-Id. Does it need some support that any other
made-up trailer doesn't?

> I imagine most people working with mailing list based workflows have their own
> client side tooling to support this while software forges like GitHub and GitLab
> simply don't bother tracking version history between individual commits in a
> pull/merge request.

It's far from ideal, but at least GitLab shows a diff on a push to a MR,
including if it's force-pushed. I'm not sure about GitHub.

> IMO though, it would be fantastic if third party tools
> weren't necessary though. What I suspect we want is a persistent ID (or rather
> UUID) that never changes regardless of how many times a patch is cherry-picked,
> rebased, or otherwise modified, similar to the Author and AuthorDate fields.
> Like Author and AuthorDate, it would be part of the core git commit metadata
> rather than something in the commit message like Signed-Off-By or Change-ID.
>
> Has such an idea ever been explored? Is it even possible? Would it be broadly
> useful?

This has come up a bunch of times. I think that the thing git itself
should be doing is to lean into the same notion that we use for tracking
renames. I.e. we don't, we analyze history after-the-fact and spot the
renames for you.

We have some of that in git already, as git-patch-id, and more recently
git-range-diff. Both are flawed in a bunch of ways, and it's easy to run
into edge cases where they don't spot something that they "should"
have. Where "should" exists in the mind of the user.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Feature request: provide a persistent IDs on a commit
  2022-07-18 17:35 ` Konstantin Ryabitsev
@ 2022-07-18 19:04   ` Michal Suchánek
  2022-07-19 10:57     ` Stephen Finucane
  2022-07-18 21:24   ` Glen Choo
  1 sibling, 1 reply; 29+ messages in thread
From: Michal Suchánek @ 2022-07-18 19:04 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: Stephen Finucane, git

On Mon, Jul 18, 2022 at 01:35:11PM -0400, Konstantin Ryabitsev wrote:
> On Mon, Jul 18, 2022 at 06:18:11PM +0100, Stephen Finucane wrote:
> > ...to track evolution of a patch through time.
> > 
> > tl;dr: How hard would it be to retrofit an 'ChangeID' concept à la the 'Change-
> > ID' trailer used by Gerrit into git core?
> 
> I just started working on this for b4, with the notable difference that the
> change-id trailer is used in the cover letter instead of in individual
> commits, which moves the concept of "change" from a single commit to a series
> of commits. IMO, it's much more useful in that scope, because as series are
> reviewed and iterated, individual patches can get squashed, split up or
> otherwise transformed.

You can turn that around and say that IDs of individual commits are more
powerful because they are preserved as series are reviewed, split,
merged, and commits cherry-picked.

Thanks

Michal

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Feature request: provide a persistent IDs on a commit
  2022-07-18 17:35 ` Konstantin Ryabitsev
  2022-07-18 19:04   ` Michal Suchánek
@ 2022-07-18 21:24   ` Glen Choo
  2022-07-20 19:21     ` Konstantin Ryabitsev
  2022-07-24  5:09     ` Elijah Newren
  1 sibling, 2 replies; 29+ messages in thread
From: Glen Choo @ 2022-07-18 21:24 UTC (permalink / raw)
  To: Konstantin Ryabitsev, Stephen Finucane; +Cc: git

Konstantin Ryabitsev <konstantin@linuxfoundation.org> writes:

> On Mon, Jul 18, 2022 at 06:18:11PM +0100, Stephen Finucane wrote:
>> ...to track evolution of a patch through time.
>> 
>> tl;dr: How hard would it be to retrofit an 'ChangeID' concept à la the 'Change-
>> ID' trailer used by Gerrit into git core?
>
> I just started working on this for b4, with the notable difference that the
> change-id trailer is used in the cover letter instead of in individual
> commits, which moves the concept of "change" from a single commit to a series
> of commits. IMO, it's much more useful in that scope, because as series are
> reviewed and iterated, individual patches can get squashed, split up or
> otherwise transformed.

My 2 cents, since I used to use Gerrit a lot :)

I find persistent per-commit ids really useful, even when patches get
moved around. E.g. Gerrit can show and diff previous versions of the
patch, which makes it really easy to tell how the patch has evolved
over time.

That's not to say that we don't need per-topic ids though ;) E.g. Gerrit
is pretty bad at handling whole topics - it does naive mapping on a
per-commit level, so it has no concept of "these (n - 1) patches should
replace these n patches".

I, for one, would love to see some kind of "rewrite tracking" in Git.
One use case that comes up often is downstream patches, where patches
are continuously rebased onto a new upstream; in those cases, it's
pretty hard to keep track of how the patch has changed over time

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Feature request: provide a persistent IDs on a commit
  2022-07-18 18:50 ` Ævar Arnfjörð Bjarmason
@ 2022-07-19 10:47   ` Stephen Finucane
  2022-07-19 11:09     ` Ævar Arnfjörð Bjarmason
  2022-07-21 16:18   ` Phillip Susi
  1 sibling, 1 reply; 29+ messages in thread
From: Stephen Finucane @ 2022-07-19 10:47 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: git

On Mon, 2022-07-18 at 20:50 +0200, Ævar Arnfjörð Bjarmason wrote:
> On Mon, Jul 18 2022, Stephen Finucane wrote:
> 
> > ...to track evolution of a patch through time.
> > 
> > tl;dr: How hard would it be to retrofit an 'ChangeID' concept à la the 'Change-
> > ID' trailer used by Gerrit into git core?
> > 
> > Firstly, apologies in advance if this is the wrong forum to post a feature
> > request. I help maintain the Patchwork project [1], which a web-based tool that
> > provides a mechanism to track the state of patches submitted to a mailing list
> > and make sure stuff doesn't slip through the crack. One of our long-term goals
> > has been to track the evolution of an individual patch through multiple
> > revisions. This is surprisingly hard goal because oftentimes there isn't a whole
> > lot to work with. One can try to guess whether things are the same by inspecting
> > the metadata of the commit (subject, author, commit message, and the diff
> > itself) but each of these metadata items are subject to arbitrary changes and
> > are therefore fallible.
> > 
> > One of the mechanisms I've seen used to address this is the 'Change-ID' trailer
> > used by Gerrit. For anyone that hasn't seen this, the Gerrit server provides a
> > git commit hook that you can install locally. When installed, this appends a
> > 'Change-ID' trailer to each and every commit message. In this way, the evolution
> > of a patch (or a "change", in Gerrit parlance) can be tracked through time since
> > the Change ID provides an authoritative answer to the question "is this still
> > the same patch". Unfortunately, there are still some obvious downside to this
> > approach. Not only does this additional trailer clutter your commit messages but
> > it's also something the user must install themselves. While Gerrit can insist
> > that this is installed before pushing a change, this isn't an option for any of
> > the common forges nor is it something git-send-email supports.
> 
> git format-patch+send-email will send your trailers along as-is, how
> doesn't it support Change-Id. Does it need some support that any other
> made-up trailer doesn't?

It supports sending the trailers, sure. What it doesn't support is insisting you
send this specific trailer (Change-Id). Only Gerrit can do this (server side,
thankfully, which means you don't need to ask all contributors to install this
hook if you want to rely on it for tooling, CI, etc.).

> > I imagine most people working with mailing list based workflows have their own
> > client side tooling to support this while software forges like GitHub and GitLab
> > simply don't bother tracking version history between individual commits in a
> > pull/merge request.
> 
> It's far from ideal, but at least GitLab shows a diff on a push to a MR,
> including if it's force-pushed. I'm not sure about GitHub.

GitHub does not. Simply piling multiple additional "fix" commits onto the PR
branch results in a less horrible review experience since you can maintain
context, alas at the cost of a rotten git log. We don't need to debate the pros
and cons of the various forges though :)

> 
> > IMO though, it would be fantastic if third party tools
> > weren't necessary though. What I suspect we want is a persistent ID (or rather
> > UUID) that never changes regardless of how many times a patch is cherry-picked,
> > rebased, or otherwise modified, similar to the Author and AuthorDate fields.
> > Like Author and AuthorDate, it would be part of the core git commit metadata
> > rather than something in the commit message like Signed-Off-By or Change-ID.
> > 
> > Has such an idea ever been explored? Is it even possible? Would it be broadly
> > useful?
> 
> This has come up a bunch of times. I think that the thing git itself
> should be doing is to lean into the same notion that we use for tracking
> renames. I.e. we don't, we analyze history after-the-fact and spot the
> renames for you.

Any idea where I'd find previous discussions on this? I did look, and the only
proposal I found was an old one that seemed to suggest including the Change-Id
commit-msg hook with git itself which is not what I'm suggesting here.

> We have some of that in git already, as git-patch-id, and more recently
> git-range-diff. Both are flawed in a bunch of ways, and it's easy to run
> into edge cases where they don't spot something that they "should"
> have. Where "should" exists in the mind of the user.

That's a fair point and is of course what we (Patchwork) have to do currently.
Patchwork can track relations between individual patches but doesn't attempt to
generate these relations itself. Instead, we rely on third-party tooling. The
PaStA tool was one such example of a tool that could do this [1]. I can't
imagine a tool like Gerrit would ever work without this concept of an
authoritative (and arbitrary) identifier to track a patch's identity through
time, hence its reliance on the Change-Id trailer.

Perhaps we could flip this on its head. What would be the _downsides_ of
providing a persistent, arbitrary identifier on a commit similar to Author and
AuthorDate fields? There's obviously some work involved in implementing it but
assuming that was already done, what would break/be worse as a result?

Stephen

[1] https://rsarky.github.io/2020/08/10/pasta-patchwork.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Feature request: provide a persistent IDs on a commit
  2022-07-18 19:04   ` Michal Suchánek
@ 2022-07-19 10:57     ` Stephen Finucane
  0 siblings, 0 replies; 29+ messages in thread
From: Stephen Finucane @ 2022-07-19 10:57 UTC (permalink / raw)
  To: Michal Suchánek, Konstantin Ryabitsev; +Cc: git

On Mon, 2022-07-18 at 21:04 +0200, Michal Suchánek wrote:
> On Mon, Jul 18, 2022 at 01:35:11PM -0400, Konstantin Ryabitsev wrote:
> > On Mon, Jul 18, 2022 at 06:18:11PM +0100, Stephen Finucane wrote:
> > > ...to track evolution of a patch through time.
> > > 
> > > tl;dr: How hard would it be to retrofit an 'ChangeID' concept à la the 'Change-
> > > ID' trailer used by Gerrit into git core?
> > 
> > I just started working on this for b4, with the notable difference that the
> > change-id trailer is used in the cover letter instead of in individual
> > commits, which moves the concept of "change" from a single commit to a series
> > of commits. IMO, it's much more useful in that scope, because as series are
> > reviewed and iterated, individual patches can get squashed, split up or
> > otherwise transformed.
> 
> You can turn that around and say that IDs of individual commits are more
> powerful because they are preserved as series are reviewed, split,
> merged, and commits cherry-picked.

There's also the fact that many communities insist on small, atomic commits:
they're much easier to review. It stands to reason that reviewing a series on a
patch-by-patch basis is also much easier, as is reviewing a series _revision_ on
a patch-by-patch basis. To be able to do this though, you need to be able to map
patch revisions to their predecessors/successors and well as the series
revisions. I don't see how you realistically rely on a series-only identifier.

There's no reason 'git-format-patch' couldn't allow you to set an
AuthorID/ChangeID/<whatever we want to call this field> value for a cover
letter, though it obviously would need to be done manually since cover letters
aren't git objects.

  git send-email \
    --reroll-count 2 \
    --series-id 300628e5-8b27-45fe-be71-95417f7ccd6f
    main

Stephen

> 
> Thanks
> 
> Michal

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Feature request: provide a persistent IDs on a commit
  2022-07-19 10:47   ` Stephen Finucane
@ 2022-07-19 11:09     ` Ævar Arnfjörð Bjarmason
  2022-07-19 11:57       ` Michal Suchánek
  2022-07-29 12:11       ` Stephen Finucane
  0 siblings, 2 replies; 29+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-07-19 11:09 UTC (permalink / raw)
  To: Stephen Finucane; +Cc: git

On Tue, Jul 19 2022, Stephen Finucane wrote:

> On Mon, 2022-07-18 at 20:50 +0200, Ævar Arnfjörð Bjarmason wrote:
>> On Mon, Jul 18 2022, Stephen Finucane wrote:
>> 
>> > ...to track evolution of a patch through time.
>> > 
>> > tl;dr: How hard would it be to retrofit an 'ChangeID' concept à la the 'Change-
>> > ID' trailer used by Gerrit into git core?
>> > 
>> > Firstly, apologies in advance if this is the wrong forum to post a feature
>> > request. I help maintain the Patchwork project [1], which a web-based tool that
>> > provides a mechanism to track the state of patches submitted to a mailing list
>> > and make sure stuff doesn't slip through the crack. One of our long-term goals
>> > has been to track the evolution of an individual patch through multiple
>> > revisions. This is surprisingly hard goal because oftentimes there isn't a whole
>> > lot to work with. One can try to guess whether things are the same by inspecting
>> > the metadata of the commit (subject, author, commit message, and the diff
>> > itself) but each of these metadata items are subject to arbitrary changes and
>> > are therefore fallible.
>> > 
>> > One of the mechanisms I've seen used to address this is the 'Change-ID' trailer
>> > used by Gerrit. For anyone that hasn't seen this, the Gerrit server provides a
>> > git commit hook that you can install locally. When installed, this appends a
>> > 'Change-ID' trailer to each and every commit message. In this way, the evolution
>> > of a patch (or a "change", in Gerrit parlance) can be tracked through time since
>> > the Change ID provides an authoritative answer to the question "is this still
>> > the same patch". Unfortunately, there are still some obvious downside to this
>> > approach. Not only does this additional trailer clutter your commit messages but
>> > it's also something the user must install themselves. While Gerrit can insist
>> > that this is installed before pushing a change, this isn't an option for any of
>> > the common forges nor is it something git-send-email supports.
>> 
>> git format-patch+send-email will send your trailers along as-is, how
>> doesn't it support Change-Id. Does it need some support that any other
>> made-up trailer doesn't?
>
> It supports sending the trailers, sure. What it doesn't support is insisting you
> send this specific trailer (Change-Id). Only Gerrit can do this (server side,
> thankfully, which means you don't need to ask all contributors to install this
> hook if you want to rely on it for tooling, CI, etc.).

Ah, it's still unclear to me what you're proposing here though. That
send-email always (generates?) or otherwise insists on the trailer, that
it can be configured ot add it?

That send-email have some "pre-send-email" hook? Something else?

I'd think for projects that care about this they're likely to have a
centralized enough workflow that it can be checked on the remote side,
whether that's some sanity check on the applier's "git am" pipeline, or
a "pre-receive" hook.

>> > I imagine most people working with mailing list based workflows have their own
>> > client side tooling to support this while software forges like GitHub and GitLab
>> > simply don't bother tracking version history between individual commits in a
>> > pull/merge request.
>> 
>> It's far from ideal, but at least GitLab shows a diff on a push to a MR,
>> including if it's force-pushed. I'm not sure about GitHub.
>
> GitHub does not. Simply piling multiple additional "fix" commits onto the PR
> branch results in a less horrible review experience since you can maintain
> context, alas at the cost of a rotten git log. We don't need to debate the pros
> and cons of the various forges though :)

Yes, I'm only mentioning it because it's worth looking at existing
"solutions" that are in use in the wild, however flawed those may be.

>> > IMO though, it would be fantastic if third party tools
>> > weren't necessary though. What I suspect we want is a persistent ID (or rather
>> > UUID) that never changes regardless of how many times a patch is cherry-picked,
>> > rebased, or otherwise modified, similar to the Author and AuthorDate fields.
>> > Like Author and AuthorDate, it would be part of the core git commit metadata
>> > rather than something in the commit message like Signed-Off-By or Change-ID.
>> > 
>> > Has such an idea ever been explored? Is it even possible? Would it be broadly
>> > useful?
>> 
>> This has come up a bunch of times. I think that the thing git itself
>> should be doing is to lean into the same notion that we use for tracking
>> renames. I.e. we don't, we analyze history after-the-fact and spot the
>> renames for you.
>
> Any idea where I'd find previous discussions on this? I did look, and the only
> proposal I found was an old one that seemed to suggest including the Change-Id
> commit-msg hook with git itself which is not what I'm suggesting here.

At the time I was punting on finding the links, and just working off
vague recollection, and hoping you'd go list spelunking.

But I since recalled some details, I think the most relevant thing is
this discussion about a "git evolve":

    https://lore.kernel.org/git/CAPL8ZivFmHqS2y+WmNR6faRMnuahiqwPVYsV99NiJ1QLHOs9fQ@mail.gmail.com/

Which I think you'll find useful, especially as mercurial has an
existing implementation. The wider context for that "git evolve" is (I
believe) people at Google who maintain Gerrit trying to "upstream" the
Change-Id.

Now, it hasn't landed in git.git, and it's been a few years, but going
through the details of why it fizzled out will be useful to you, if
you're interested in driving something like this forward.

There's also these two proposals from Eric Raymond:

	https://lore.kernel.org/git/20190515191605.21D394703049@snark.thyrsus.com/
	https://lore.kernel.org/git/20190521013250.3506B470485F@snark.thyrsus.com/

Which I'm linking to here not because I think they're viable, as you can
see from my participation in those threads I think what he suggested is
an architectural dead end as far as git is concerned.

But rather because it's conceptually adjacent (you could in principle
use nanosecond timestamps as a poor man's UUID), and much of the
follow-up discussion is about format changes in general, and if/when
those might be viable.

>> We have some of that in git already, as git-patch-id, and more recently
>> git-range-diff. Both are flawed in a bunch of ways, and it's easy to run
>> into edge cases where they don't spot something that they "should"
>> have. Where "should" exists in the mind of the user.
>
> That's a fair point and is of course what we (Patchwork) have to do currently.
> Patchwork can track relations between individual patches but doesn't attempt to
> generate these relations itself. Instead, we rely on third-party tooling. The
> PaStA tool was one such example of a tool that could do this [1]. I can't
> imagine a tool like Gerrit would ever work without this concept of an
> authoritative (and arbitrary) identifier to track a patch's identity through
> time, hence its reliance on the Change-Id trailer.

I haven't used Gerrit or Patchwork, so much of this is from ignorance on
that front, but I have spent a lot of time thinking about this in the
context of git in general.

I think as users of git go the git project itself makes very heavy use
of this, i.e. sequences of patches are substantially rewritten, split,
squashed etc. all the time, or even split into two or more sets of
submissions.

Having said all that I can't see how a Change-Id isn't a Bad Idea(TM)
for all the same reasons that pre-git SCMs file formats that track
renames explicitly were a bad idea.

I.e. yes you can come up with cases where that's "better" than what git
does, but they didn't handle splitting/merging files etc.

Similarly what happens when you have 3 patches each with their own
Change-Id and you split them into 4 patches. Is the Change-Id 1=1 or
1=many. I'm suggesting that you'd want a solution that can be many=many.

And also, that those many=many should be dynamically configurable and
inferred after the fact. E.g. range-diff will commits that are similar
enough that two authors with no knowledge of each other independently
came up with.

I think that range-diff is still lacking in a lot of ways, in particular:

 * It matches entire commits (log + diff) on a similarity score, I've
   often wanted a way to "weigh" it, so e.g. a matching hunk would have
   3x the matching score of a matching commit message.

   Now it often "gives up", you can give it a higher --creation-factor,
   but that's "global", so for a large range you'll often start
   including irrelevant things as well.

 * It only does 1=1 attribution, and e.g. currently can't find/represent
   a case where a commit with 3 hunks got split into two commits, with 2
   and 1 hunks, respectively. It'll (usually) show a diff to the new 2
   hunk commit, but the "new" 1 hunk will be shown as new.

   We could continue to drill down and find such "unattributed" hunks.

> Perhaps we could flip this on its head. What would be the _downsides_ of
> providing a persistent, arbitrary identifier on a commit similar to Author and
> AuthorDate fields? There's obviously some work involved in implementing it but
> assuming that was already done, what would break/be worse as a result?

That "Repository formats matter", to borrow a phrase from a classic post
about git[1]. Once you provide a way to do something it will be used,
and when that something has inherent limitations (think SCM rename
tracking) used to the exclusion of others.

You can't provide something like that as an opt-in and "upstream" it
without it inevetably trickling into a lot of areas of Git's UX.

To continue the rename example, now you can just re-arrange your source
tree and not worry about micro-managing it with "git mv" (in the "svn
mv" sense), git will figure it out after the fact.

That's a sinificant UX benefit, we can provide a *much simpler* UX as a
result.

What would be the harm of an optional "rename tracking" header? After
all the heuristic sometimes "fails".

The harm would be that if you really wanted to lean into that (even
optionally) you'd be forced to add that to all sorts of tooling, not
just the cheap convenience that is "git mv" currently.

Likewise everything from "cherry-pick" to "rebase" to "commit" would
inevitably have to learn some way to know about, carry forward and ask
the user about Change-Id's and their preservation. Don't you think so?

Otherwise they'd be much too easy to lose track of, and if they only
reason we did all that is because we didn't think enough about the "work
it out after" approach that would be a bad investment of time.

But I may be wrong about all of that, I think one thing that would
really help clarify this & similar proposals is if people pushing it
forward came up with some basic tests for it, i.e. just something like
a:

    series-v1/
    series-v2/

Where those two directories would be the "git format-patch" output (or
whatever) of two versions of a series that Gerrit or Patchwork are now
managing, along with some (plain text?) manual mapping of which things
in v1 correspond to v2.

We could then compare how that manual attribution performs v.s. trying
to find which things match (range-diff) afterwards.

1. https://keithp.com/blog/Repository_Formats_Matter/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Feature request: provide a persistent IDs on a commit
  2022-07-19 11:09     ` Ævar Arnfjörð Bjarmason
@ 2022-07-19 11:57       ` Michal Suchánek
  2022-07-29 12:11       ` Stephen Finucane
  1 sibling, 0 replies; 29+ messages in thread
From: Michal Suchánek @ 2022-07-19 11:57 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Stephen Finucane, git

On Tue, Jul 19, 2022 at 01:09:02PM +0200, Ævar Arnfjörð Bjarmason wrote:
> 
> On Tue, Jul 19 2022, Stephen Finucane wrote:
> 
> > On Mon, 2022-07-18 at 20:50 +0200, Ævar Arnfjörð Bjarmason wrote:
> >> On Mon, Jul 18 2022, Stephen Finucane wrote:
> >> 
> >> > ...to track evolution of a patch through time.
> >> > 
> >> > tl;dr: How hard would it be to retrofit an 'ChangeID' concept à la the 'Change-
> >> > ID' trailer used by Gerrit into git core?
> >> > 
> >> > Firstly, apologies in advance if this is the wrong forum to post a feature
> >> > request. I help maintain the Patchwork project [1], which a web-based tool that
> >> > provides a mechanism to track the state of patches submitted to a mailing list
> >> > and make sure stuff doesn't slip through the crack. One of our long-term goals
> >> > has been to track the evolution of an individual patch through multiple
> >> > revisions. This is surprisingly hard goal because oftentimes there isn't a whole
> >> > lot to work with. One can try to guess whether things are the same by inspecting
> >> > the metadata of the commit (subject, author, commit message, and the diff
> >> > itself) but each of these metadata items are subject to arbitrary changes and
> >> > are therefore fallible.
> >> > 
> >> > One of the mechanisms I've seen used to address this is the 'Change-ID' trailer
> >> > used by Gerrit. For anyone that hasn't seen this, the Gerrit server provides a
> >> > git commit hook that you can install locally. When installed, this appends a
> >> > 'Change-ID' trailer to each and every commit message. In this way, the evolution
> >> > of a patch (or a "change", in Gerrit parlance) can be tracked through time since
> >> > the Change ID provides an authoritative answer to the question "is this still
> >> > the same patch". Unfortunately, there are still some obvious downside to this
> >> > approach. Not only does this additional trailer clutter your commit messages but
> >> > it's also something the user must install themselves. While Gerrit can insist
> >> > that this is installed before pushing a change, this isn't an option for any of
> >> > the common forges nor is it something git-send-email supports.
> >> 
> >> git format-patch+send-email will send your trailers along as-is, how
> >> doesn't it support Change-Id. Does it need some support that any other
> >> made-up trailer doesn't?
> >
> > It supports sending the trailers, sure. What it doesn't support is insisting you
> > send this specific trailer (Change-Id). Only Gerrit can do this (server side,
> > thankfully, which means you don't need to ask all contributors to install this
> > hook if you want to rely on it for tooling, CI, etc.).
> 
> Ah, it's still unclear to me what you're proposing here though. That
> send-email always (generates?) or otherwise insists on the trailer, that
> it can be configured ot add it?

And isn't send-email time too late?

That would mean that you get new ID for every version of the patch sent.

Thanks

Michal

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Feature request: provide a persistent IDs on a commit
  2022-07-18 21:24   ` Glen Choo
@ 2022-07-20 19:21     ` Konstantin Ryabitsev
  2022-07-20 19:30       ` Michal Suchánek
  2022-07-20 22:10       ` Theodore Ts'o
  2022-07-24  5:09     ` Elijah Newren
  1 sibling, 2 replies; 29+ messages in thread
From: Konstantin Ryabitsev @ 2022-07-20 19:21 UTC (permalink / raw)
  To: Glen Choo; +Cc: Stephen Finucane, git

On Mon, Jul 18, 2022 at 02:24:07PM -0700, Glen Choo wrote:
> > I just started working on this for b4, with the notable difference that the
> > change-id trailer is used in the cover letter instead of in individual
> > commits, which moves the concept of "change" from a single commit to a series
> > of commits. IMO, it's much more useful in that scope, because as series are
> > reviewed and iterated, individual patches can get squashed, split up or
> > otherwise transformed.
> 
> My 2 cents, since I used to use Gerrit a lot :)
> 
> I find persistent per-commit ids really useful, even when patches get
> moved around. E.g. Gerrit can show and diff previous versions of the
> patch, which makes it really easy to tell how the patch has evolved
> over time.

The kernel community has repeatedly rejected per-patch Change-id trailers
because they carry no meaningful information outside of the gerrit system on
which they were created. Seeing a Change-Id trailer in a commit tells you
nothing about the history of that commit unless you know the gerrit system on
which this patch was reviewed (and have access to it, which is not a given).
This is not as opaque as it used to be now that Gerrit provided ability to
clone the underlying notedb, but this still fails on commits that were
contributed to an upstream that doesn't use Gerrit.

The current recommended strategy for the kernel is to put any historical
information (including any links to archival sites, etc) into the merge
commit and only keep chain-of-custody and code-review trailers in actual
code commits. For this reason, I opted to use change-ids in the cover letter
only.

-Konstantin

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Feature request: provide a persistent IDs on a commit
  2022-07-20 19:21     ` Konstantin Ryabitsev
@ 2022-07-20 19:30       ` Michal Suchánek
  2022-07-20 22:10       ` Theodore Ts'o
  1 sibling, 0 replies; 29+ messages in thread
From: Michal Suchánek @ 2022-07-20 19:30 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: Glen Choo, Stephen Finucane, git

On Wed, Jul 20, 2022 at 03:21:44PM -0400, Konstantin Ryabitsev wrote:
> On Mon, Jul 18, 2022 at 02:24:07PM -0700, Glen Choo wrote:
> > > I just started working on this for b4, with the notable difference that the
> > > change-id trailer is used in the cover letter instead of in individual
> > > commits, which moves the concept of "change" from a single commit to a series
> > > of commits. IMO, it's much more useful in that scope, because as series are
> > > reviewed and iterated, individual patches can get squashed, split up or
> > > otherwise transformed.
> > 
> > My 2 cents, since I used to use Gerrit a lot :)
> > 
> > I find persistent per-commit ids really useful, even when patches get
> > moved around. E.g. Gerrit can show and diff previous versions of the
> > patch, which makes it really easy to tell how the patch has evolved
> > over time.
> 
> The kernel community has repeatedly rejected per-patch Change-id trailers
> because they carry no meaningful information outside of the gerrit system on
> which they were created. Seeing a Change-Id trailer in a commit tells you
> nothing about the history of that commit unless you know the gerrit system on

Unless you happen to see another patch with the same ID, and for that to
happen the ID needs to be generated when the commit is created (not when
it's uploaded to gerrit or sent to a mailing list), and preserved by
default in all processing of the commit.

Then you can actually track the commit as it evolves in tools like
patchwork, in theory.

Thanks

Michal

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Feature request: provide a persistent IDs on a commit
  2022-07-20 19:21     ` Konstantin Ryabitsev
  2022-07-20 19:30       ` Michal Suchánek
@ 2022-07-20 22:10       ` Theodore Ts'o
  2022-07-21 11:57         ` Han-Wen Nienhuys
  1 sibling, 1 reply; 29+ messages in thread
From: Theodore Ts'o @ 2022-07-20 22:10 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: Glen Choo, Stephen Finucane, git

On Wed, Jul 20, 2022 at 03:21:44PM -0400, Konstantin Ryabitsev wrote:
> The kernel community has repeatedly rejected per-patch Change-id trailers
> because they carry no meaningful information outside of the gerrit system on
> which they were created. Seeing a Change-Id trailer in a commit tells you
> nothing about the history of that commit unless you know the gerrit system on
> which this patch was reviewed (and have access to it, which is not a given).

The "no meaningful information outside of the gerrit system" is the
key.  This was extensively discussed in the
ksummit-discuss@lists.linux-foundation.org mailing list in late August
2019, subject line "Allowing something Change-Id (or something like
it) in kernel commits".  Quoting from Linus Torvalds:

    From: Linus Torvalds
    Date: Thu, 22 Aug 2019 17:17:05 -0700
    Message-Id: CAHk-=whFbgy4RXG11c_=S7O-248oWmwB_aZOcWzWMVh3w7=RCw@mail.gmail.com

    No. That's not it at all. It's not "dislike gerrit".

    It's "dislike pointless garbage".

    If the gerrit database is public and searchable using the uuid, then
    that would make the uuid useful to outsiders. And instead of just
    putting a UUID (which is hard to look up unless you know where it came
    from), make it be that "Link:" that gives not just the UUID, but also
    gives you the metadata for that UUID to be looked up.

    But so far, in every single case the uuid's I've ever seen have been
    pointless garbage, that aren't useful in general to public open source
    developers, and as such shouldn't be in the git tree.

    See the difference?

    So if you guys make the gerrit database actually public, and then
    start adding "Link: ..." tags so that we can see what they point to, I
    think people will be more than supportive of it.

    But if it's some stupid and pointless UUID that is useful to nobody
    outside of google (or special magical groups of people associated with
    it), then I will personally continue to be very much against it.

So....  imagine if we had some kind of search service, maybe homed at
lore.kernel.org, where given a particular "Change Id" --- and it could
look either like a Gerrit-style Change-Id or something else like a URL
or URL-like (it matters not) the search service could give you a list
of:

  * All mailing list threads where the body contained the "Change-Id:
    XXX" id, so we could find the previous versions of the commit, and
    the reviews that took place on a mailing list.  (And this could be
    either a pointer to lore.kernel.org and/or a patchwork URL.)

  * All URL's to public gerrit servers where that patch may have been reviewed.

  * A list of git Commit ID's from a set of "interesting" git trees
    (e.g., the upstream Linux tree, the Long Term Stable trees, maybe
    some other interesting trees ala Android Common, etc.

If we had such a thing, as opposed to something that only worked in a
closed private garden like an internal Gerrit server sitting behind a
corporate firewall, even if the patch initially was developed in a
closed private Gerrit ecosystem --- if the moment it was published for
external upstream review, it would get captured by this search
service, then the Change ID would be useful.  And if that Change ID
could also be used to find out how the patch was ported to various
stable or productg trees, then it would be even more useful --- and
then people would probably find it to be useful, and resistance to
having a per-commit Change-ID would probably drop, or perhaps, even
enthusiastically embraced, because people could actually see the
*value* behind it.

To do this, we would need to have various tools, such as Patchwork,
Gerrit, Git, public-inbox, etc., treat Change-ID as a first-class
indexed object, so that you could quickly map from a Change-ID to a
git commit in a particular git tree (if present), or to set of
public-inbox URL's, or a set of patchwork URL's, etc.

And then we would need some kind of aggregation service which would
aggregate the information from all of the various sources
(public-inbox, git, Gerrit, Patchwork, etc.) and then gave users a
single "front door" where they could submit a Change-Id, and find all
the patch history, patch review comments, and later, patch backports
and forward ports.

The question is ---- is this doable?   And who will do the work?   :-)

    	     	     	     	       - Ted

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Feature request: provide a persistent IDs on a commit
  2022-07-20 22:10       ` Theodore Ts'o
@ 2022-07-21 11:57         ` Han-Wen Nienhuys
  0 siblings, 0 replies; 29+ messages in thread
From: Han-Wen Nienhuys @ 2022-07-21 11:57 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Konstantin Ryabitsev, Glen Choo, Stephen Finucane, git

On Thu, Jul 21, 2022 at 12:10 AM Theodore Ts'o <tytso@mit.edu> wrote:
> On Wed, Jul 20, 2022 at 03:21:44PM -0400, Konstantin Ryabitsev wrote:
> > The kernel community has repeatedly rejected per-patch Change-id trailers
> > because they carry no meaningful information outside of the gerrit system on
> > which they were created. Seeing a Change-Id trailer in a commit tells you
> > nothing about the history of that commit unless you know the gerrit system on
> > which this patch was reviewed (and have access to it, which is not a given).
>
> The "no meaningful information outside of the gerrit system" is the
> key.  This was extensively discussed in the
> ksummit-discuss@lists.linux-foundation.org mailing list in late August
> 2019, subject line "Allowing something Change-Id (or something like
> it) in kernel commits".  Quoting from Linus Torvalds:
>
>     From: Linus Torvalds
>     Date: Thu, 22 Aug 2019 17:17:05 -0700
>     Message-Id: CAHk-=whFbgy4RXG11c_=S7O-248oWmwB_aZOcWzWMVh3w7=RCw@mail.gmail.com
>
>     No. That's not it at all. It's not "dislike gerrit".
>
>     It's "dislike pointless garbage".
>
>     If the gerrit database is public and searchable using the uuid, then
>     that would make the uuid useful to outsiders. And instead of just
>     putting a UUID (which is hard to look up unless you know where it came
>     from), make it be that "Link:" that gives not just the UUID, but also
>     gives you the metadata for that UUID to be looked up.
>..
>     So if you guys make the gerrit database actually public, and then
>     start adding "Link: ..." tags so that we can see what they point to, I
>     think people will be more than supportive of it.

Support for the "Link:" footer as a change ID has been implemented in
Gerrit as of https://gerrit.googlesource.com/gerrit/+/8cab93302d9c35316d691e848b67e687a68182b5
(available in Gerrit 3.3 and onwards).  I'm not sure if it has seen
much use, though.

-- 
Han-Wen Nienhuys - Google Munich
I work 80%. Don't expect answers from me on Fridays.
--
Google Germany GmbH, Erika-Mann-Strasse 33, 80636 Munich
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Geschäftsführer: Paul Manicle, Liana Sebastian

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Feature request: provide a persistent IDs on a commit
  2022-07-18 18:50 ` Ævar Arnfjörð Bjarmason
  2022-07-19 10:47   ` Stephen Finucane
@ 2022-07-21 16:18   ` Phillip Susi
  2022-07-21 18:58     ` Hilco Wijbenga
  1 sibling, 1 reply; 29+ messages in thread
From: Phillip Susi @ 2022-07-21 16:18 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Stephen Finucane, git


Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> This has come up a bunch of times. I think that the thing git itself
> should be doing is to lean into the same notion that we use for tracking
> renames. I.e. we don't, we analyze history after-the-fact and spot the
> renames for you.

I've never been a big fan of that quality of git because it is
inherently unreliable.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Feature request: provide a persistent IDs on a commit
  2022-07-21 16:18   ` Phillip Susi
@ 2022-07-21 18:58     ` Hilco Wijbenga
  2022-07-22 20:08       ` Philip Oakley
  0 siblings, 1 reply; 29+ messages in thread
From: Hilco Wijbenga @ 2022-07-21 18:58 UTC (permalink / raw)
  To: Phillip Susi
  Cc: Ævar Arnfjörð Bjarmason, Stephen Finucane,
	Git Users

On Thu, Jul 21, 2022 at 9:39 AM Phillip Susi <phill@thesusis.net> wrote:
> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>
> > This has come up a bunch of times. I think that the thing git itself
> > should be doing is to lean into the same notion that we use for tracking
> > renames. I.e. we don't, we analyze history after-the-fact and spot the
> > renames for you.
>
> I've never been a big fan of that quality of git because it is
> inherently unreliable.

Indeed, which would be fine ... if there were a way to tell Git, "no
this is not a rename" or "hey, you missed this rename" but there
isn't.

Reading previous messages, it seems like the
after-the-fact-rename-heuristic makes the Git code simpler. That is a
perfectly valid argument for not supporting "explicit" renames but I
have seen several messages from which I inferred that rename handling
was deemed a "solved problem". And _that_, at least in my experience,
is definitely not the case.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Feature request: provide a persistent IDs on a commit
  2022-07-21 18:58     ` Hilco Wijbenga
@ 2022-07-22 20:08       ` Philip Oakley
  2022-07-22 20:36         ` Michal Suchánek
  0 siblings, 1 reply; 29+ messages in thread
From: Philip Oakley @ 2022-07-22 20:08 UTC (permalink / raw)
  To: Hilco Wijbenga, Phillip Susi
  Cc: Ævar Arnfjörð Bjarmason, Stephen Finucane,
	Git Users

On 21/07/2022 19:58, Hilco Wijbenga wrote:
> On Thu, Jul 21, 2022 at 9:39 AM Phillip Susi <phill@thesusis.net> wrote:
>> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>>
>>> This has come up a bunch of times. I think that the thing git itself
>>> should be doing is to lean into the same notion that we use for tracking
>>> renames. I.e. we don't, we analyze history after-the-fact and spot the
>>> renames for you.
>> I've never been a big fan of that quality of git because it is
>> inherently unreliable.
> Indeed, which would be fine ... if there were a way to tell Git, "no
> this is not a rename" or "hey, you missed this rename" but there
> isn't.
>
> Reading previous messages, it seems like the
> after-the-fact-rename-heuristic makes the Git code simpler. That is a
> perfectly valid argument for not supporting "explicit" renames but I
> have seen several messages from which I inferred that rename handling
> was deemed a "solved problem". And _that_, at least in my experience,
> is definitely not the case.

Part of the rename problem is that there can be many different routes to
the same result, and often the route used isn't the one 'specified' by
those who wish a complicated rename process to have happened 'their
way', plus people forget to record what they actually did. Attempting to
capture what happened still results major gaps in the record.

It's nice to believe that in software we could perfectly capture the
copy/edit/rename processes between revisions, but with humans in the
loop it just doesn't work as planned.

Hence the value of Git is that it does record faithfully the end points,
and allows a moderately standardised way of viewing the perceived rename
process. It also removes the external attempts at 'control' of the
revision record that bedevil other approaches.

Philip

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Feature request: provide a persistent IDs on a commit
  2022-07-22 20:08       ` Philip Oakley
@ 2022-07-22 20:36         ` Michal Suchánek
  2022-07-22 22:46           ` Jacob Keller
  2022-07-24  5:10           ` Elijah Newren
  0 siblings, 2 replies; 29+ messages in thread
From: Michal Suchánek @ 2022-07-22 20:36 UTC (permalink / raw)
  To: Philip Oakley
  Cc: Hilco Wijbenga, Phillip Susi,
	Ævar Arnfjörð Bjarmason, Stephen Finucane,
	Git Users

On Fri, Jul 22, 2022 at 09:08:56PM +0100, Philip Oakley wrote:
> On 21/07/2022 19:58, Hilco Wijbenga wrote:
> > On Thu, Jul 21, 2022 at 9:39 AM Phillip Susi <phill@thesusis.net> wrote:
> >> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
> >>
> >>> This has come up a bunch of times. I think that the thing git itself
> >>> should be doing is to lean into the same notion that we use for tracking
> >>> renames. I.e. we don't, we analyze history after-the-fact and spot the
> >>> renames for you.
> >> I've never been a big fan of that quality of git because it is
> >> inherently unreliable.
> > Indeed, which would be fine ... if there were a way to tell Git, "no
> > this is not a rename" or "hey, you missed this rename" but there
> > isn't.
> >
> > Reading previous messages, it seems like the
> > after-the-fact-rename-heuristic makes the Git code simpler. That is a
> > perfectly valid argument for not supporting "explicit" renames but I
> > have seen several messages from which I inferred that rename handling
> > was deemed a "solved problem". And _that_, at least in my experience,
> > is definitely not the case.
> 
> Part of the rename problem is that there can be many different routes to
> the same result, and often the route used isn't the one 'specified' by
> those who wish a complicated rename process to have happened 'their
> way', plus people forget to record what they actually did. Attempting to
> capture what happened still results major gaps in the record.

Doesn't git have rebase?

It is not required that the rename is captured perfectly every time so
long as it can be amended later.

Thanks

Michal

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Feature request: provide a persistent IDs on a commit
  2022-07-22 20:36         ` Michal Suchánek
@ 2022-07-22 22:46           ` Jacob Keller
  2022-07-23  7:00             ` Michal Suchánek
  2022-07-24  5:10           ` Elijah Newren
  1 sibling, 1 reply; 29+ messages in thread
From: Jacob Keller @ 2022-07-22 22:46 UTC (permalink / raw)
  To: Michal Suchánek
  Cc: Philip Oakley, Hilco Wijbenga, Phillip Susi,
	Ævar Arnfjörð Bjarmason, Stephen Finucane,
	Git Users

On Fri, Jul 22, 2022 at 1:42 PM Michal Suchánek <msuchanek@suse.de> wrote:
>
> On Fri, Jul 22, 2022 at 09:08:56PM +0100, Philip Oakley wrote:
> > On 21/07/2022 19:58, Hilco Wijbenga wrote:
> > > On Thu, Jul 21, 2022 at 9:39 AM Phillip Susi <phill@thesusis.net> wrote:
> > >> Ęvar Arnfjörš Bjarmason <avarab@gmail.com> writes:
> > >>
> > >>> This has come up a bunch of times. I think that the thing git itself
> > >>> should be doing is to lean into the same notion that we use for tracking
> > >>> renames. I.e. we don't, we analyze history after-the-fact and spot the
> > >>> renames for you.
> > >> I've never been a big fan of that quality of git because it is
> > >> inherently unreliable.
> > > Indeed, which would be fine ... if there were a way to tell Git, "no
> > > this is not a rename" or "hey, you missed this rename" but there
> > > isn't.
> > >
> > > Reading previous messages, it seems like the
> > > after-the-fact-rename-heuristic makes the Git code simpler. That is a
> > > perfectly valid argument for not supporting "explicit" renames but I
> > > have seen several messages from which I inferred that rename handling
> > > was deemed a "solved problem". And _that_, at least in my experience,
> > > is definitely not the case.
> >
> > Part of the rename problem is that there can be many different routes to
> > the same result, and often the route used isn't the one 'specified' by
> > those who wish a complicated rename process to have happened 'their
> > way', plus people forget to record what they actually did. Attempting to
> > capture what happened still results major gaps in the record.
>
> Doesn't git have rebase?
>
> It is not required that the rename is captured perfectly every time so
> long as it can be amended later.
>
> Thanks
>
> Michal

Rebase is typically reserved only to modify commits which are not yet
"permanent". Once a commit starts being referenced by many others it
becomes more and more difficult to rebase it. Any rebase effectively
creates a new commit.

There are multiple threads discussing renames and handling them in git
in the past which are worth re-reading, including at least

https://public-inbox.org/git/Pine.LNX.4.58.0504141102430.7211@ppc970.osdl.org/

A fuller analysis here too:
https://public-inbox.org/git/Pine.LNX.4.64.0510221251330.10477@g5.osdl.org/

As mentioned above in this thread, depending on what context you are
using, a change to a commit could be many to many: i.e. a commit which
splits into 2, or 3 commits merging into one, or 3 commits splitting
apart and then becoming 2 commits. When that happens, what "change id"
do you use for each commit?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Feature request: provide a persistent IDs on a commit
  2022-07-22 22:46           ` Jacob Keller
@ 2022-07-23  7:00             ` Michal Suchánek
  2022-07-24  5:23               ` Elijah Newren
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Suchánek @ 2022-07-23  7:00 UTC (permalink / raw)
  To: Jacob Keller
  Cc: Philip Oakley, Hilco Wijbenga, Phillip Susi,
	Ævar Arnfjörð Bjarmason, Stephen Finucane,
	Git Users

On Fri, Jul 22, 2022 at 03:46:22PM -0700, Jacob Keller wrote:
> On Fri, Jul 22, 2022 at 1:42 PM Michal Suchánek <msuchanek@suse.de> wrote:
> >
> > On Fri, Jul 22, 2022 at 09:08:56PM +0100, Philip Oakley wrote:
> > > On 21/07/2022 19:58, Hilco Wijbenga wrote:
> > > > On Thu, Jul 21, 2022 at 9:39 AM Phillip Susi <phill@thesusis.net> wrote:
> > > >> Ęvar Arnfjörš Bjarmason <avarab@gmail.com> writes:
> > > >>
> > > >>> This has come up a bunch of times. I think that the thing git itself
> > > >>> should be doing is to lean into the same notion that we use for tracking
> > > >>> renames. I.e. we don't, we analyze history after-the-fact and spot the
> > > >>> renames for you.
> > > >> I've never been a big fan of that quality of git because it is
> > > >> inherently unreliable.
> > > > Indeed, which would be fine ... if there were a way to tell Git, "no
> > > > this is not a rename" or "hey, you missed this rename" but there
> > > > isn't.
> > > >
> > > > Reading previous messages, it seems like the
> > > > after-the-fact-rename-heuristic makes the Git code simpler. That is a
> > > > perfectly valid argument for not supporting "explicit" renames but I
> > > > have seen several messages from which I inferred that rename handling
> > > > was deemed a "solved problem". And _that_, at least in my experience,
> > > > is definitely not the case.
> > >
> > > Part of the rename problem is that there can be many different routes to
> > > the same result, and often the route used isn't the one 'specified' by
> > > those who wish a complicated rename process to have happened 'their
> > > way', plus people forget to record what they actually did. Attempting to
> > > capture what happened still results major gaps in the record.
> >
> > Doesn't git have rebase?
> >
> > It is not required that the rename is captured perfectly every time so
> > long as it can be amended later.
> >
> > Thanks
> >
> > Michal
> 
> Rebase is typically reserved only to modify commits which are not yet
> "permanent". Once a commit starts being referenced by many others it
> becomes more and more difficult to rebase it. Any rebase effectively
> creates a new commit.
> 
> There are multiple threads discussing renames and handling them in git
> in the past which are worth re-reading, including at least
> 
> https://public-inbox.org/git/Pine.LNX.4.58.0504141102430.7211@ppc970.osdl.org/
> 
> A fuller analysis here too:
> https://public-inbox.org/git/Pine.LNX.4.64.0510221251330.10477@g5.osdl.org/
> 
> As mentioned above in this thread, depending on what context you are
> using, a change to a commit could be many to many: i.e. a commit which
> splits into 2, or 3 commits merging into one, or 3 commits splitting
> apart and then becoming 2 commits. When that happens, what "change id"
> do you use for each commit?

Same as commit message and any trailers you might have - they are
preserved, concatenated, and can be regenerated.

Thanks

Michal

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Feature request: provide a persistent IDs on a commit
  2022-07-18 21:24   ` Glen Choo
  2022-07-20 19:21     ` Konstantin Ryabitsev
@ 2022-07-24  5:09     ` Elijah Newren
  1 sibling, 0 replies; 29+ messages in thread
From: Elijah Newren @ 2022-07-24  5:09 UTC (permalink / raw)
  To: Glen Choo; +Cc: Konstantin Ryabitsev, Stephen Finucane, Git Mailing List

On Mon, Jul 18, 2022 at 2:29 PM Glen Choo <chooglen@google.com> wrote:
>
> Konstantin Ryabitsev <konstantin@linuxfoundation.org> writes:
>
> > On Mon, Jul 18, 2022 at 06:18:11PM +0100, Stephen Finucane wrote:
> >> ...to track evolution of a patch through time.
> >>
> >> tl;dr: How hard would it be to retrofit an 'ChangeID' concept à la the 'Change-
> >> ID' trailer used by Gerrit into git core?
> >
> > I just started working on this for b4, with the notable difference that the
> > change-id trailer is used in the cover letter instead of in individual
> > commits, which moves the concept of "change" from a single commit to a series
> > of commits. IMO, it's much more useful in that scope, because as series are
> > reviewed and iterated, individual patches can get squashed, split up or
> > otherwise transformed.
>
> My 2 cents, since I used to use Gerrit a lot :)
>
> I find persistent per-commit ids really useful, even when patches get
> moved around. E.g. Gerrit can show and diff previous versions of the
> patch, which makes it really easy to tell how the patch has evolved
> over time.
>
> That's not to say that we don't need per-topic ids though ;) E.g. Gerrit
> is pretty bad at handling whole topics - it does naive mapping on a
> per-commit level, so it has no concept of "these (n - 1) patches should
> replace these n patches".
>
> I, for one, would love to see some kind of "rewrite tracking" in Git.
> One use case that comes up often is downstream patches, where patches
> are continuously rebased onto a new upstream; in those cases, it's
> pretty hard to keep track of how the patch has changed over time

Two angles I can think of that partially address this:

1) If you have the old commits still around and know what they were,
you can run range-diff to see differences between any pair of versions
of the commits.

2) cherry-picks and reverts might already include a link to an "old"
commit for you in the commit message ("cherry picked from commit
<hash>" or "This reverts <hash>").  Those could be used to show how
the new commit differs from what would have been done with an
automatic cherry-pick or automatic revert.  (By "automatic", I
basically mean what the state of files in the working tree would be
when the operation stops to allow users to resolve conflicts.)  In
fact, I wrote some patches to do precisely this quite a while ago
which are up at https://github.com/gitgitgadget/git/pull/1151 if
you're curious.  But this approach is not useful for general rebasing,
because there's no automated way to find out what the original commit
was so that you can take a look at such a difference.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Feature request: provide a persistent IDs on a commit
  2022-07-22 20:36         ` Michal Suchánek
  2022-07-22 22:46           ` Jacob Keller
@ 2022-07-24  5:10           ` Elijah Newren
  2022-07-24  8:59             ` Michal Suchánek
  1 sibling, 1 reply; 29+ messages in thread
From: Elijah Newren @ 2022-07-24  5:10 UTC (permalink / raw)
  To: Michal Suchánek
  Cc: Philip Oakley, Hilco Wijbenga, Phillip Susi,
	Ævar Arnfjörð Bjarmason, Stephen Finucane,
	Git Users

On Fri, Jul 22, 2022 at 1:42 PM Michal Suchánek <msuchanek@suse.de> wrote:
>
> On Fri, Jul 22, 2022 at 09:08:56PM +0100, Philip Oakley wrote:
> > On 21/07/2022 19:58, Hilco Wijbenga wrote:
> > > On Thu, Jul 21, 2022 at 9:39 AM Phillip Susi <phill@thesusis.net> wrote:
> > >> Ęvar Arnfjörš Bjarmason <avarab@gmail.com> writes:
> > >>
> > >>> This has come up a bunch of times. I think that the thing git itself
> > >>> should be doing is to lean into the same notion that we use for tracking
> > >>> renames. I.e. we don't, we analyze history after-the-fact and spot the
> > >>> renames for you.
> > >> I've never been a big fan of that quality of git because it is
> > >> inherently unreliable.
> > > Indeed, which would be fine ... if there were a way to tell Git, "no
> > > this is not a rename" or "hey, you missed this rename" but there
> > > isn't.
> > >
> > > Reading previous messages, it seems like the
> > > after-the-fact-rename-heuristic makes the Git code simpler. That is a
> > > perfectly valid argument for not supporting "explicit" renames but I
> > > have seen several messages from which I inferred that rename handling
> > > was deemed a "solved problem". And _that_, at least in my experience,
> > > is definitely not the case.
> >
> > Part of the rename problem is that there can be many different routes to
> > the same result, and often the route used isn't the one 'specified' by
> > those who wish a complicated rename process to have happened 'their
> > way', plus people forget to record what they actually did. Attempting to
> > capture what happened still results major gaps in the record.
>
> Doesn't git have rebase?
>
> It is not required that the rename is captured perfectly every time so
> long as it can be amended later.

"so long as".  Therefore, since it can't be amended after the commit
is accepted/merged, it is required that this auxiliary data be
captured perfectly before that time if it's going to be captured at
all.

Did I read that right?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Feature request: provide a persistent IDs on a commit
  2022-07-23  7:00             ` Michal Suchánek
@ 2022-07-24  5:23               ` Elijah Newren
  2022-07-24  8:54                 ` Michal Suchánek
  2022-07-25 21:47                 ` Jacob Keller
  0 siblings, 2 replies; 29+ messages in thread
From: Elijah Newren @ 2022-07-24  5:23 UTC (permalink / raw)
  To: Michal Suchánek
  Cc: Jacob Keller, Philip Oakley, Hilco Wijbenga, Phillip Susi,
	Ævar Arnfjörð Bjarmason, Stephen Finucane,
	Git Users

On Sat, Jul 23, 2022 at 12:44 AM Michal Suchánek <msuchanek@suse.de> wrote:
>
> On Fri, Jul 22, 2022 at 03:46:22PM -0700, Jacob Keller wrote:
> > On Fri, Jul 22, 2022 at 1:42 PM Michal Suchánek <msuchanek@suse.de> wrote:
> > >
> > > On Fri, Jul 22, 2022 at 09:08:56PM +0100, Philip Oakley wrote:
[...]
> > > > Part of the rename problem is that there can be many different routes to
> > > > the same result, and often the route used isn't the one 'specified' by
> > > > those who wish a complicated rename process to have happened 'their
> > > > way', plus people forget to record what they actually did. Attempting to
> > > > capture what happened still results major gaps in the record.
> > >
> > > Doesn't git have rebase?
> > >
> > > It is not required that the rename is captured perfectly every time so
> > > long as it can be amended later.
> > >
> >
> > Rebase is typically reserved only to modify commits which are not yet
> > "permanent". Once a commit starts being referenced by many others it
> > becomes more and more difficult to rebase it. Any rebase effectively
> > creates a new commit.
> >
> > There are multiple threads discussing renames and handling them in git
> > in the past which are worth re-reading, including at least
> >
> > https://public-inbox.org/git/Pine.LNX.4.58.0504141102430.7211@ppc970.osdl.org/
> >
> > A fuller analysis here too:
> > https://public-inbox.org/git/Pine.LNX.4.64.0510221251330.10477@g5.osdl.org/
> >
> > As mentioned above in this thread, depending on what context you are
> > using, a change to a commit could be many to many: i.e. a commit which
> > splits into 2, or 3 commits merging into one, or 3 commits splitting
> > apart and then becoming 2 commits. When that happens, what "change id"
> > do you use for each commit?
>
> Same as commit message and any trailers you might have - they are
> preserved, concatenated

Exactly how are they concatenated?  Is that a user operation, or
something a Git command does automatically?  Which commands and which
circumstances?  If users do it, what's the UI for them to discover
what the fields are, for them to discover whether such a thing might
be needed or beneficial, and the UI for them to change these fields?
This sounds like a massive UX/UI issue that I don't have a clue how to
tackle (assuming I wanted to).

> and can be regenerated.

"can be".  But generally won't be even when it should be, right?

Committer name/email/date basically don't even exist as far as many
Git users are concerned.  They aren't shown in the default log output
(which greatly saddens me), and even after attempting to educate users
for well over a decade now, I still routinely find developers who are
surprised that these things exist.

Given that committer name/email/date aren't shown with --pretty=full
but with the lame option name --pretty=fuller, I can't see why it'd
make any sense to show Change-Ids in the log output by default.

But if it's not shown -- and by default -- then it doesn't exist for
many users.  And if it doesn't exist, users aren't going to fix it
when they need to.

(Even if it were shown by default, it's not clear to me that users
would know when to fix it, or how to fix it, or even care to fix it
and instead view it as a pedantic requirement being foisted on them.)

I think the "many-to-many issue" others have raised in this thread is
an important, big, and thorny problem.  I think it has the potential
to be a minefield of UX and a steady stream of bug reports.  And
seeing proponents of Change-Id just dismissing the issue makes me all
the more suspicious of the proposal in the first place.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Feature request: provide a persistent IDs on a commit
  2022-07-24  5:23               ` Elijah Newren
@ 2022-07-24  8:54                 ` Michal Suchánek
  2022-07-25 21:47                 ` Jacob Keller
  1 sibling, 0 replies; 29+ messages in thread
From: Michal Suchánek @ 2022-07-24  8:54 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Jacob Keller, Philip Oakley, Hilco Wijbenga, Phillip Susi,
	Ævar Arnfjörð Bjarmason, Stephen Finucane,
	Git Users

On Sat, Jul 23, 2022 at 10:23:09PM -0700, Elijah Newren wrote:
> On Sat, Jul 23, 2022 at 12:44 AM Michal Suchánek <msuchanek@suse.de> wrote:
> >
> > On Fri, Jul 22, 2022 at 03:46:22PM -0700, Jacob Keller wrote:
> > > On Fri, Jul 22, 2022 at 1:42 PM Michal Suchánek <msuchanek@suse.de> wrote:
> > > >
> > > > On Fri, Jul 22, 2022 at 09:08:56PM +0100, Philip Oakley wrote:
> [...]
> > > > > Part of the rename problem is that there can be many different routes to
> > > > > the same result, and often the route used isn't the one 'specified' by
> > > > > those who wish a complicated rename process to have happened 'their
> > > > > way', plus people forget to record what they actually did. Attempting to
> > > > > capture what happened still results major gaps in the record.
> > > >
> > > > Doesn't git have rebase?
> > > >
> > > > It is not required that the rename is captured perfectly every time so
> > > > long as it can be amended later.
> > > >
> > >
> > > Rebase is typically reserved only to modify commits which are not yet
> > > "permanent". Once a commit starts being referenced by many others it
> > > becomes more and more difficult to rebase it. Any rebase effectively
> > > creates a new commit.
> > >
> > > There are multiple threads discussing renames and handling them in git
> > > in the past which are worth re-reading, including at least
> > >
> > > https://public-inbox.org/git/Pine.LNX.4.58.0504141102430.7211@ppc970.osdl.org/
> > >
> > > A fuller analysis here too:
> > > https://public-inbox.org/git/Pine.LNX.4.64.0510221251330.10477@g5.osdl.org/
> > >
> > > As mentioned above in this thread, depending on what context you are
> > > using, a change to a commit could be many to many: i.e. a commit which
> > > splits into 2, or 3 commits merging into one, or 3 commits splitting
> > > apart and then becoming 2 commits. When that happens, what "change id"
> > > do you use for each commit?
> >
> > Same as commit message and any trailers you might have - they are
> > preserved, concatenated
> 
> Exactly how are they concatenated?  Is that a user operation, or
> something a Git command does automatically?  Which commands and which
> circumstances?  If users do it, what's the UI for them to discover
> what the fields are, for them to discover whether such a thing might
> be needed or beneficial, and the UI for them to change these fields?
> This sounds like a massive UX/UI issue that I don't have a clue how to
> tackle (assuming I wanted to).

Currently when you squash commits you get both commit messages
concatenated, including any trailers.

You are free to adjust as you see fit.

> 
> > and can be regenerated.
> 
> "can be".  But generally won't be even when it should be, right?

"when it should" is not something that can be programmatically
derrmined, so it's up to the user, and sure, there will be cases where
somebody thinks it "should" but it has not. Then they can complain, just
like with any other trailer we already have today.

> I think the "many-to-many issue" others have raised in this thread is
> an important, big, and thorny problem.  I think it has the potential
> to be a minefield of UX and a steady stream of bug reports.  And
> seeing proponents of Change-Id just dismissing the issue makes me all
> the more suspicious of the proposal in the first place.

And how do you get this many to many situation in the first place?

You reset to a base before your changes and create completely new series
with completely new messages and everything?

Then of course you get completely new trailers as well unless you
somehow fish out some metadata from the old commits and manually apply
them.

I don't see any functionality in git that does many to many commits
transform in one step. It's always just split/merge.

Thanks

Michal

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Feature request: provide a persistent IDs on a commit
  2022-07-24  5:10           ` Elijah Newren
@ 2022-07-24  8:59             ` Michal Suchánek
  0 siblings, 0 replies; 29+ messages in thread
From: Michal Suchánek @ 2022-07-24  8:59 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Philip Oakley, Hilco Wijbenga, Phillip Susi,
	Ævar Arnfjörð Bjarmason, Stephen Finucane,
	Git Users

On Sat, Jul 23, 2022 at 10:10:11PM -0700, Elijah Newren wrote:
> On Fri, Jul 22, 2022 at 1:42 PM Michal Suchánek <msuchanek@suse.de> wrote:
> >
> > On Fri, Jul 22, 2022 at 09:08:56PM +0100, Philip Oakley wrote:
> > > On 21/07/2022 19:58, Hilco Wijbenga wrote:
> > > > On Thu, Jul 21, 2022 at 9:39 AM Phillip Susi <phill@thesusis.net> wrote:
> > > >> Ęvar Arnfjörš Bjarmason <avarab@gmail.com> writes:
> > > >>
> > > >>> This has come up a bunch of times. I think that the thing git itself
> > > >>> should be doing is to lean into the same notion that we use for tracking
> > > >>> renames. I.e. we don't, we analyze history after-the-fact and spot the
> > > >>> renames for you.
> > > >> I've never been a big fan of that quality of git because it is
> > > >> inherently unreliable.
> > > > Indeed, which would be fine ... if there were a way to tell Git, "no
> > > > this is not a rename" or "hey, you missed this rename" but there
> > > > isn't.
> > > >
> > > > Reading previous messages, it seems like the
> > > > after-the-fact-rename-heuristic makes the Git code simpler. That is a
> > > > perfectly valid argument for not supporting "explicit" renames but I
> > > > have seen several messages from which I inferred that rename handling
> > > > was deemed a "solved problem". And _that_, at least in my experience,
> > > > is definitely not the case.
> > >
> > > Part of the rename problem is that there can be many different routes to
> > > the same result, and often the route used isn't the one 'specified' by
> > > those who wish a complicated rename process to have happened 'their
> > > way', plus people forget to record what they actually did. Attempting to
> > > capture what happened still results major gaps in the record.
> >
> > Doesn't git have rebase?
> >
> > It is not required that the rename is captured perfectly every time so
> > long as it can be amended later.
> 
> "so long as".  Therefore, since it can't be amended after the commit
> is accepted/merged, it is required that this auxiliary data be
> captured perfectly before that time if it's going to be captured at
> all.
> 
> Did I read that right?

Or it will be broken after it is merged, just as many other things in
commits that are accepted into history that is not to be modified
anymore.

The only point I can see here is that if there is any user-crafted
metadata that describes renames then it should be considered advisory,
and an option to override it should exist because it may be wrong.

Nonetheless, if such feature existed users that are willing to generate
such metadata and review it before it gets merged may get more out of
the rename tracking than can be done automatically today.

Thanks

Michal

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Feature request: provide a persistent IDs on a commit
  2022-07-24  5:23               ` Elijah Newren
  2022-07-24  8:54                 ` Michal Suchánek
@ 2022-07-25 21:47                 ` Jacob Keller
  2022-07-26  3:49                   ` Elijah Newren
  1 sibling, 1 reply; 29+ messages in thread
From: Jacob Keller @ 2022-07-25 21:47 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Michal Suchánek, Philip Oakley, Hilco Wijbenga, Phillip Susi,
	Ævar Arnfjörð Bjarmason, Stephen Finucane,
	Git Users

On Sat, Jul 23, 2022 at 10:23 PM Elijah Newren <newren@gmail.com> wrote:
>
> On Sat, Jul 23, 2022 at 12:44 AM Michal Suchánek <msuchanek@suse.de> wrote:
> >
> > On Fri, Jul 22, 2022 at 03:46:22PM -0700, Jacob Keller wrote:
> > > On Fri, Jul 22, 2022 at 1:42 PM Michal Suchánek <msuchanek@suse.de> wrote:
> > > >
> > > > On Fri, Jul 22, 2022 at 09:08:56PM +0100, Philip Oakley wrote:
> [...]
> > > > > Part of the rename problem is that there can be many different routes to
> > > > > the same result, and often the route used isn't the one 'specified' by
> > > > > those who wish a complicated rename process to have happened 'their
> > > > > way', plus people forget to record what they actually did. Attempting to
> > > > > capture what happened still results major gaps in the record.
> > > >
> > > > Doesn't git have rebase?
> > > >
> > > > It is not required that the rename is captured perfectly every time so
> > > > long as it can be amended later.
> > > >
> > >
> > > Rebase is typically reserved only to modify commits which are not yet
> > > "permanent". Once a commit starts being referenced by many others it
> > > becomes more and more difficult to rebase it. Any rebase effectively
> > > creates a new commit.
> > >
> > > There are multiple threads discussing renames and handling them in git
> > > in the past which are worth re-reading, including at least
> > >
> > > https://public-inbox.org/git/Pine.LNX.4.58.0504141102430.7211@ppc970.osdl.org/
> > >
> > > A fuller analysis here too:
> > > https://public-inbox.org/git/Pine.LNX.4.64.0510221251330.10477@g5.osdl.org/
> > >
> > > As mentioned above in this thread, depending on what context you are
> > > using, a change to a commit could be many to many: i.e. a commit which
> > > splits into 2, or 3 commits merging into one, or 3 commits splitting
> > > apart and then becoming 2 commits. When that happens, what "change id"
> > > do you use for each commit?
> >
> > Same as commit message and any trailers you might have - they are
> > preserved, concatenated
>
> Exactly how are they concatenated?  Is that a user operation, or
> something a Git command does automatically?  Which commands and which
> circumstances?  If users do it, what's the UI for them to discover
> what the fields are, for them to discover whether such a thing might
> be needed or beneficial, and the UI for them to change these fields?
> This sounds like a massive UX/UI issue that I don't have a clue how to
> tackle (assuming I wanted to).
>
> > and can be regenerated.
>
> "can be".  But generally won't be even when it should be, right?
>
> Committer name/email/date basically don't even exist as far as many
> Git users are concerned.  They aren't shown in the default log output
> (which greatly saddens me), and even after attempting to educate users
> for well over a decade now, I still routinely find developers who are
> surprised that these things exist.
>
> Given that committer name/email/date aren't shown with --pretty=full
> but with the lame option name --pretty=fuller, I can't see why it'd
> make any sense to show Change-Ids in the log output by default.
>
> But if it's not shown -- and by default -- then it doesn't exist for
> many users.  And if it doesn't exist, users aren't going to fix it
> when they need to.
>
> (Even if it were shown by default, it's not clear to me that users
> would know when to fix it, or how to fix it, or even care to fix it
> and instead view it as a pedantic requirement being foisted on them.)
>
> I think the "many-to-many issue" others have raised in this thread is
> an important, big, and thorny problem.  I think it has the potential
> to be a minefield of UX and a steady stream of bug reports.  And
> seeing proponents of Change-Id just dismissing the issue makes me all
> the more suspicious of the proposal in the first place.

I do think there is some value in having a sort of generic id like
change-id, but I do think we want to be careful about how exactly we
handle it.

As you say, if we hide it then users may not be aware of it, and if we
make it visible users who don't care may be annoyed. I don't think we
can fully automate it because of the nature of combining changes and
splitting changes require humans to decide which change keeps which
ID. Its not even clear when rebasing whether a split is going to
happen. A combine operation is easier to detect in rebase
(fixup/squash), but determining which id to keep is not. Would we even
want to have support for "this commit merges two and is now one, but
we keep both IDs because it really is both commits"? That gets messy
pretty fast.

Users such as gerrit already simply use the trailer with Change-id and
manage to make it work by enforcing some constraints and assuming
users will know what to do (because otherwise they fail to interact
with gerrit servers).

For cases where it helps, I think its very valuable. Being able to
track revisions of a series or a patch is super useful. Getting
external tooling like public-inbox, patchworks, etc to use this would
also be useful. But I think we would want to sort out the situation a
bit for how and when are they generated, when are they
replaced/re-generated, how this interacts with mailing etc.

Should rebase just always regenerate? that loses a lot of value. I
guess squashing could offer users a choice of which to keep? Fixup
would always keep the same one. And otherwise it becomes up to users
to know when they need to copy from an old commit or refresh an
existing commit... Thats pretty much what gerrit does these days, if a
commit doesn't have the trailer it gets added, and if it does, its up
to the user to know when to remove it or regenerate it... Since its a
commit message trailer it gets sent implicitly through the mailing
list unless removed.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Feature request: provide a persistent IDs on a commit
  2022-07-25 21:47                 ` Jacob Keller
@ 2022-07-26  3:49                   ` Elijah Newren
  2022-07-26  8:43                     ` Michal Suchánek
  0 siblings, 1 reply; 29+ messages in thread
From: Elijah Newren @ 2022-07-26  3:49 UTC (permalink / raw)
  To: Jacob Keller
  Cc: Michal Suchánek, Philip Oakley, Hilco Wijbenga, Phillip Susi,
	Ævar Arnfjörð Bjarmason, Stephen Finucane,
	Git Users

On Mon, Jul 25, 2022 at 2:47 PM Jacob Keller <jacob.keller@gmail.com> wrote:
>
> On Sat, Jul 23, 2022 at 10:23 PM Elijah Newren <newren@gmail.com> wrote:
> >
> > On Sat, Jul 23, 2022 at 12:44 AM Michal Suchánek <msuchanek@suse.de> wrote:
> > >
> > > On Fri, Jul 22, 2022 at 03:46:22PM -0700, Jacob Keller wrote:
> > > > On Fri, Jul 22, 2022 at 1:42 PM Michal Suchánek <msuchanek@suse.de> wrote:
> > > > >
> > > > > On Fri, Jul 22, 2022 at 09:08:56PM +0100, Philip Oakley wrote:
> > [...]
> > > > > > Part of the rename problem is that there can be many different routes to
> > > > > > the same result, and often the route used isn't the one 'specified' by
> > > > > > those who wish a complicated rename process to have happened 'their
> > > > > > way', plus people forget to record what they actually did. Attempting to
> > > > > > capture what happened still results major gaps in the record.
> > > > >
> > > > > Doesn't git have rebase?
> > > > >
> > > > > It is not required that the rename is captured perfectly every time so
> > > > > long as it can be amended later.
> > > > >
> > > >
> > > > Rebase is typically reserved only to modify commits which are not yet
> > > > "permanent". Once a commit starts being referenced by many others it
> > > > becomes more and more difficult to rebase it. Any rebase effectively
> > > > creates a new commit.
> > > >
> > > > There are multiple threads discussing renames and handling them in git
> > > > in the past which are worth re-reading, including at least
> > > >
> > > > https://public-inbox.org/git/Pine.LNX.4.58.0504141102430.7211@ppc970.osdl.org/
> > > >
> > > > A fuller analysis here too:
> > > > https://public-inbox.org/git/Pine.LNX.4.64.0510221251330.10477@g5.osdl.org/
> > > >
> > > > As mentioned above in this thread, depending on what context you are
> > > > using, a change to a commit could be many to many: i.e. a commit which
> > > > splits into 2, or 3 commits merging into one, or 3 commits splitting
> > > > apart and then becoming 2 commits. When that happens, what "change id"
> > > > do you use for each commit?
> > >
> > > Same as commit message and any trailers you might have - they are
> > > preserved, concatenated
> >
> > Exactly how are they concatenated?  Is that a user operation, or
> > something a Git command does automatically?  Which commands and which
> > circumstances?  If users do it, what's the UI for them to discover
> > what the fields are, for them to discover whether such a thing might
> > be needed or beneficial, and the UI for them to change these fields?
> > This sounds like a massive UX/UI issue that I don't have a clue how to
> > tackle (assuming I wanted to).
> >
> > > and can be regenerated.
> >
> > "can be".  But generally won't be even when it should be, right?
> >
> > Committer name/email/date basically don't even exist as far as many
> > Git users are concerned.  They aren't shown in the default log output
> > (which greatly saddens me), and even after attempting to educate users
> > for well over a decade now, I still routinely find developers who are
> > surprised that these things exist.
> >
> > Given that committer name/email/date aren't shown with --pretty=full
> > but with the lame option name --pretty=fuller, I can't see why it'd
> > make any sense to show Change-Ids in the log output by default.
> >
> > But if it's not shown -- and by default -- then it doesn't exist for
> > many users.  And if it doesn't exist, users aren't going to fix it
> > when they need to.
> >
> > (Even if it were shown by default, it's not clear to me that users
> > would know when to fix it, or how to fix it, or even care to fix it
> > and instead view it as a pedantic requirement being foisted on them.)
> >
> > I think the "many-to-many issue" others have raised in this thread is
> > an important, big, and thorny problem.  I think it has the potential
> > to be a minefield of UX and a steady stream of bug reports.  And
> > seeing proponents of Change-Id just dismissing the issue makes me all
> > the more suspicious of the proposal in the first place.
>
> I do think there is some value in having a sort of generic id like
> change-id, but I do think we want to be careful about how exactly we
> handle it.
>
> As you say, if we hide it then users may not be aware of it, and if we
> make it visible users who don't care may be annoyed. I don't think we
> can fully automate it because of the nature of combining changes and
> splitting changes require humans to decide which change keeps which
> ID. Its not even clear when rebasing whether a split is going to
> happen. A combine operation is easier to detect in rebase
> (fixup/squash), but determining which id to keep is not. Would we even
> want to have support for "this commit merges two and is now one, but
> we keep both IDs because it really is both commits"? That gets messy
> pretty fast.
>
> Users such as gerrit already simply use the trailer with Change-id and
> manage to make it work by enforcing some constraints and assuming
> users will know what to do (because otherwise they fail to interact
> with gerrit servers).
>
> For cases where it helps, I think its very valuable. Being able to
> track revisions of a series or a patch is super useful. Getting
> external tooling like public-inbox, patchworks, etc to use this would
> also be useful. But I think we would want to sort out the situation a
> bit for how and when are they generated, when are they
> replaced/re-generated, how this interacts with mailing etc.
>
> Should rebase just always regenerate? that loses a lot of value. I
> guess squashing could offer users a choice of which to keep? Fixup
> would always keep the same one. And otherwise it becomes up to users
> to know when they need to copy from an old commit or refresh an
> existing commit... Thats pretty much what gerrit does these days, if a
> commit doesn't have the trailer it gets added, and if it does, its up
> to the user to know when to remove it or regenerate it... Since its a
> commit message trailer it gets sent implicitly through the mailing
> list unless removed.

Yes, I fully agree it needs to be spelled out a lot more.  And not
just obvious commands (everyone seems to focus on commit, cherry-pick,
and rebase), but what about e.g. `git merge --squash`?

Also, as far as value goes, I have an interesting story related to
Change-Ids (read the last sentence if you only want the summary):

<long story>

I have used Gerrit fairly heavily.  I maintained an instance for a few
hundred developers for several years (inheriting it from others), and
was responsible for various build & release stuff related to one of
the larger products tracked in it (an approximately
linux-kernel-sized) product.  I also attended a Gerrit conference or
two and submitted a few patches for Gerrit that were accepted.  So,
for context, I'm clearly not a Gerrit developer since I only submitted
a few patches, but I was the clear expert within my company on Gerrit.
So that's my background.

Some background on the (insane) project management of the time
(unrelated to Git or Gerrit or Change-Ids) is also important to
understand this story:  Years ago, this project had well over 100
active branches (!!).  And yes, branches were being aggressively
retired, but I still remember when we finally managed to get the count
under 200.  Each branch had important patches, and hundreds of patches
on these branches was not uncommon.  Yes, it was insane, and yes, I
and many others were really happy when we eventually reached the land
of sanity with just a few active branches (and all but the main one
only gets backport fixes).

One of the things I did to help us move in the direction towards
sanity was a "snowflake report" -- a report that would help people
determine which patches had not already been upstreamed (i.e. not
included in the main development branch of the same repository), and
which still needed to be.  (The terminology for the report was that
unique patches that weren't upstream were "snowflakes" which we were
trying to pick out of a "blizzard of commits").  Anyway, the immediate
impetus for this report was that after enough disasters from
forgetting to upstream important patches, people started asking around
about how to avoid another repeat.  I thought this was a trivial
question to answer at first, but...I was wrong.

Now, as I said before, this product was tracked in Gerrit.  Also,
direct pushing to bypass reviews was disabled (with _very_ rare
exceptions, that essentially didn't affect the quality of the report
mentioned below at all), and Change-Ids were required for all commits
pushed up for code reviews.  So, yes, people had the Gerrit-suggested
hook installed, and yes virtually all commits had Change-Ids.

So, what was implemented to answer "which patches in this branch have
been upstreamed (i.e. included in the main development branch)"?  A
variety of checks.  One of which was fully reliable:

    * "git cherry" to catch the "100% certainty cherry-picks" (only
caught maybe 5% of the upstreamed patches, but still useful).

And a bunch of other checks that were just heuristics:

    * (author name, author email, author date) triples matching
    * commit message exactly matching
    * "(cherry-picked from commit <HASH>)" footers from one commit
referring to another (or there being a transitive chain that wasn't
too long, or there being a transitive tree with a path between the
commits -- think for example of both commits being a cherry-pick of
the same thing)
    * patch-hunk-ids matching (instead of git-patch-id which computes
a patch-id for the overall patch, compute one for each hunk.  Then
look for other commits that have one of their patch hunks match. This
could result in a many-to-many relationship between downstream and
upstream patches)

We generated a report based on this (on a wiki so folks could edit and
add notes), with html links and such.  For each downstream commit, we
added links to all potential upstream commit(s) and included reasons
why each commit was thought to be a potential match.  Further, for any
potential upstream commit that might match (and which wasn't a 100%
certainty pick from "git cherry"), we also looked at which filename(s)
were modified in both commits and reported on the number of matching
and non-matching filename(s) between the pair in addition to the
number of matching and non-matching patch hunks.

Basically, it was a large amount of information to allow humans to
review how similar the commits were to help them determine if the
changes from the commit were already (partially or fully) included
upstream.

I got lots of requests to find as many links as possible, because with
unfortunately frequent regularity, a new report would be created for
some branch and dozens of teams of developers were being forced to
review the reports and sign off on every single patch and whether it
needed to be upstreamed, or even finish being upstreamed (and to do
the upstreaming work if so).  I got a fair number of comments and
questions on the reports as a result.

I'm glad we no longer need this report due to switching to a much more
sane testing/backporting/delivery/branching/etc. story, but years ago
this report was helpful.  Anyway...

A few surprises I found:
  * Patches could be partially upstreamed.  For example, someone
cherry-picked something from a newer version of upstream than their
current branch was based off of, then found and amended important
fixes into that commit.  So although the patch might appear to be
upstream (because the commit messages matched, the author
name/email/dates match, etc.), important fixes in it were not.
  * I was surprised by the number of cases that were not one-to-one
mappings.  One patch might be upstreamed via several patches on the
main branch.  Or parts of several patches may have been upstreamed as
one big combination commit upstream.  I knew theoretically there could
be some like this, and perhaps wasn't too surprised that there would
be at least one, but there were quite a few more than I expected.
  * One of the weirder cases I remember: Someone had created changes
by going into the Gerrit UI, picking an existing but obsolete/closed
code review, editing the code in ways that had absolutely nothing to
do with the original commit they started on (and perhaps on some other
branch?), and left the Change-Id around because it looked valid and
they didn't know what it meant anyway.  (Something about their laptop
being broken and being unable to edit code locally, so they just
edited "in the cloud" and let the automated regression tests verify
their changes; Gerrit either didn't have a way to start a new commit
at the time or the developer didn't find it, so they just edited
something else.)  I knew about it because I found weird pairs of
non-matching things and this developer left a note about it in their
edited commit message just in case anything weird happened.

But let me be more explicit about Change-Ids: we didn't use them at
all.  It came up multiple times as a question, but in my looking into
useful factors, to me they seemed to provide no extra value and I had
found multiple cases where they seemed to be misleading.  The purpose
of the report was to avoid more disasters from forgetting to backport
commits.  This report seems like the kind of thing that Change-Ids
were invented for, and we already had Change-Ids in all commits due to
using Gerrit, but from my exploration I didn't trust the Change-Ids to
provide net-positive value, so I simply didn't use them at all.

</long story>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Feature request: provide a persistent IDs on a commit
  2022-07-26  3:49                   ` Elijah Newren
@ 2022-07-26  8:43                     ` Michal Suchánek
  0 siblings, 0 replies; 29+ messages in thread
From: Michal Suchánek @ 2022-07-26  8:43 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Jacob Keller, Philip Oakley, Hilco Wijbenga, Phillip Susi,
	Ævar Arnfjörð Bjarmason, Stephen Finucane,
	Git Users

Hello,

On Mon, Jul 25, 2022 at 08:49:30PM -0700, Elijah Newren wrote:
> On Mon, Jul 25, 2022 at 2:47 PM Jacob Keller <jacob.keller@gmail.com> wrote:
> >
> > On Sat, Jul 23, 2022 at 10:23 PM Elijah Newren <newren@gmail.com> wrote:
> > >
> > > On Sat, Jul 23, 2022 at 12:44 AM Michal Suchánek <msuchanek@suse.de> wrote:
> > > >
> > > > On Fri, Jul 22, 2022 at 03:46:22PM -0700, Jacob Keller wrote:
> > > > > On Fri, Jul 22, 2022 at 1:42 PM Michal Suchánek <msuchanek@suse.de> wrote:
> > > > > >
> > > > > > On Fri, Jul 22, 2022 at 09:08:56PM +0100, Philip Oakley wrote:
> > > [...]
> > > > > > > Part of the rename problem is that there can be many different routes to
> > > > > > > the same result, and often the route used isn't the one 'specified' by
> > > > > > > those who wish a complicated rename process to have happened 'their
> > > > > > > way', plus people forget to record what they actually did. Attempting to
> > > > > > > capture what happened still results major gaps in the record.
> > > > > >
> > > > > > Doesn't git have rebase?
> > > > > >
> > > > > > It is not required that the rename is captured perfectly every time so
> > > > > > long as it can be amended later.
> > > > > >
> > > > >
> > > > > Rebase is typically reserved only to modify commits which are not yet
> > > > > "permanent". Once a commit starts being referenced by many others it
> > > > > becomes more and more difficult to rebase it. Any rebase effectively
> > > > > creates a new commit.
> > > > >
> > > > > There are multiple threads discussing renames and handling them in git
> > > > > in the past which are worth re-reading, including at least
> > > > >
> > > > > https://public-inbox.org/git/Pine.LNX.4.58.0504141102430.7211@ppc970.osdl.org/
> > > > >
> > > > > A fuller analysis here too:
> > > > > https://public-inbox.org/git/Pine.LNX.4.64.0510221251330.10477@g5.osdl.org/
> > > > >
> > > > > As mentioned above in this thread, depending on what context you are
> > > > > using, a change to a commit could be many to many: i.e. a commit which
> > > > > splits into 2, or 3 commits merging into one, or 3 commits splitting
> > > > > apart and then becoming 2 commits. When that happens, what "change id"
> > > > > do you use for each commit?
> > > >
> > > > Same as commit message and any trailers you might have - they are
> > > > preserved, concatenated
> > >
> > > Exactly how are they concatenated?  Is that a user operation, or
> > > something a Git command does automatically?  Which commands and which
> > > circumstances?  If users do it, what's the UI for them to discover
> > > what the fields are, for them to discover whether such a thing might
> > > be needed or beneficial, and the UI for them to change these fields?
> > > This sounds like a massive UX/UI issue that I don't have a clue how to
> > > tackle (assuming I wanted to).
> > >
> > > > and can be regenerated.
> > >
> > > "can be".  But generally won't be even when it should be, right?
> > >
> > > Committer name/email/date basically don't even exist as far as many
> > > Git users are concerned.  They aren't shown in the default log output
> > > (which greatly saddens me), and even after attempting to educate users
> > > for well over a decade now, I still routinely find developers who are
> > > surprised that these things exist.
> > >
> > > Given that committer name/email/date aren't shown with --pretty=full
> > > but with the lame option name --pretty=fuller, I can't see why it'd
> > > make any sense to show Change-Ids in the log output by default.
> > >
> > > But if it's not shown -- and by default -- then it doesn't exist for
> > > many users.  And if it doesn't exist, users aren't going to fix it
> > > when they need to.
> > >
> > > (Even if it were shown by default, it's not clear to me that users
> > > would know when to fix it, or how to fix it, or even care to fix it
> > > and instead view it as a pedantic requirement being foisted on them.)
> > >
> > > I think the "many-to-many issue" others have raised in this thread is
> > > an important, big, and thorny problem.  I think it has the potential
> > > to be a minefield of UX and a steady stream of bug reports.  And
> > > seeing proponents of Change-Id just dismissing the issue makes me all
> > > the more suspicious of the proposal in the first place.
> >
> > I do think there is some value in having a sort of generic id like
> > change-id, but I do think we want to be careful about how exactly we
> > handle it.
> >
> > As you say, if we hide it then users may not be aware of it, and if we
> > make it visible users who don't care may be annoyed. I don't think we
> > can fully automate it because of the nature of combining changes and
> > splitting changes require humans to decide which change keeps which
> > ID. Its not even clear when rebasing whether a split is going to
> > happen. A combine operation is easier to detect in rebase
> > (fixup/squash), but determining which id to keep is not. Would we even
> > want to have support for "this commit merges two and is now one, but
> > we keep both IDs because it really is both commits"? That gets messy
> > pretty fast.
> >
> > Users such as gerrit already simply use the trailer with Change-id and
> > manage to make it work by enforcing some constraints and assuming
> > users will know what to do (because otherwise they fail to interact
> > with gerrit servers).
> >
> > For cases where it helps, I think its very valuable. Being able to
> > track revisions of a series or a patch is super useful. Getting
> > external tooling like public-inbox, patchworks, etc to use this would
> > also be useful. But I think we would want to sort out the situation a
> > bit for how and when are they generated, when are they
> > replaced/re-generated, how this interacts with mailing etc.
> >
> > Should rebase just always regenerate? that loses a lot of value. I
> > guess squashing could offer users a choice of which to keep? Fixup
> > would always keep the same one. And otherwise it becomes up to users
> > to know when they need to copy from an old commit or refresh an
> > existing commit... Thats pretty much what gerrit does these days, if a
> > commit doesn't have the trailer it gets added, and if it does, its up
> > to the user to know when to remove it or regenerate it... Since its a
> > commit message trailer it gets sent implicitly through the mailing
> > list unless removed.
> 
> Yes, I fully agree it needs to be spelled out a lot more.  And not
> just obvious commands (everyone seems to focus on commit, cherry-pick,
> and rebase), but what about e.g. `git merge --squash`?
> 
> Also, as far as value goes, I have an interesting story related to
> Change-Ids (read the last sentence if you only want the summary):
> 
> 
> <long story>
> 
> I have used Gerrit fairly heavily.  I maintained an instance for a few
> hundred developers for several years (inheriting it from others), and
> was responsible for various build & release stuff related to one of
> the larger products tracked in it (an approximately
> linux-kernel-sized) product.  I also attended a Gerrit conference or
> two and submitted a few patches for Gerrit that were accepted.  So,
> for context, I'm clearly not a Gerrit developer since I only submitted
> a few patches, but I was the clear expert within my company on Gerrit.
> So that's my background.
> 
> Some background on the (insane) project management of the time
> (unrelated to Git or Gerrit or Change-Ids) is also important to
> understand this story:  Years ago, this project had well over 100
> active branches (!!).  And yes, branches were being aggressively
> retired, but I still remember when we finally managed to get the count
> under 200.  Each branch had important patches, and hundreds of patches
> on these branches was not uncommon.  Yes, it was insane, and yes, I
> and many others were really happy when we eventually reached the land
> of sanity with just a few active branches (and all but the main one
> only gets backport fixes).
> 
> One of the things I did to help us move in the direction towards
> sanity was a "snowflake report" -- a report that would help people
> determine which patches had not already been upstreamed (i.e. not
> included in the main development branch of the same repository), and
> which still needed to be.  (The terminology for the report was that
> unique patches that weren't upstream were "snowflakes" which we were
> trying to pick out of a "blizzard of commits").  Anyway, the immediate
> impetus for this report was that after enough disasters from
> forgetting to upstream important patches, people started asking around
> about how to avoid another repeat.  I thought this was a trivial
> question to answer at first, but...I was wrong.
> 
> Now, as I said before, this product was tracked in Gerrit.  Also,
> direct pushing to bypass reviews was disabled (with _very_ rare
> exceptions, that essentially didn't affect the quality of the report
> mentioned below at all), and Change-Ids were required for all commits
> pushed up for code reviews.  So, yes, people had the Gerrit-suggested
> hook installed, and yes virtually all commits had Change-Ids.
> 
> So, what was implemented to answer "which patches in this branch have
> been upstreamed (i.e. included in the main development branch)"?  A
> variety of checks.  One of which was fully reliable:
> 
>     * "git cherry" to catch the "100% certainty cherry-picks" (only
> caught maybe 5% of the upstreamed patches, but still useful).
> 
> And a bunch of other checks that were just heuristics:
> 
>     * (author name, author email, author date) triples matching
>     * commit message exactly matching
>     * "(cherry-picked from commit <HASH>)" footers from one commit
> referring to another (or there being a transitive chain that wasn't
> too long, or there being a transitive tree with a path between the
> commits -- think for example of both commits being a cherry-pick of
> the same thing)
>     * patch-hunk-ids matching (instead of git-patch-id which computes
> a patch-id for the overall patch, compute one for each hunk.  Then
> look for other commits that have one of their patch hunks match. This
> could result in a many-to-many relationship between downstream and
> upstream patches)
> 
> We generated a report based on this (on a wiki so folks could edit and
> add notes), with html links and such.  For each downstream commit, we
> added links to all potential upstream commit(s) and included reasons
> why each commit was thought to be a potential match.  Further, for any
> potential upstream commit that might match (and which wasn't a 100%
> certainty pick from "git cherry"), we also looked at which filename(s)
> were modified in both commits and reported on the number of matching
> and non-matching filename(s) between the pair in addition to the
> number of matching and non-matching patch hunks.
> 
> Basically, it was a large amount of information to allow humans to
> review how similar the commits were to help them determine if the
> changes from the commit were already (partially or fully) included
> upstream.
> 
> I got lots of requests to find as many links as possible, because with
> unfortunately frequent regularity, a new report would be created for
> some branch and dozens of teams of developers were being forced to
> review the reports and sign off on every single patch and whether it
> needed to be upstreamed, or even finish being upstreamed (and to do
> the upstreaming work if so).  I got a fair number of comments and
> questions on the reports as a result.
> 
> I'm glad we no longer need this report due to switching to a much more
> sane testing/backporting/delivery/branching/etc. story, but years ago
> this report was helpful.  Anyway...
> 
> A few surprises I found:
>   * Patches could be partially upstreamed.  For example, someone
> cherry-picked something from a newer version of upstream than their
> current branch was based off of, then found and amended important
> fixes into that commit.  So although the patch might appear to be
> upstream (because the commit messages matched, the author
> name/email/dates match, etc.), important fixes in it were not.
>   * I was surprised by the number of cases that were not one-to-one
> mappings.  One patch might be upstreamed via several patches on the
> main branch.  Or parts of several patches may have been upstreamed as
> one big combination commit upstream.  I knew theoretically there could
> be some like this, and perhaps wasn't too surprised that there would
> be at least one, but there were quite a few more than I expected.
>   * One of the weirder cases I remember: Someone had created changes
> by going into the Gerrit UI, picking an existing but obsolete/closed
> code review, editing the code in ways that had absolutely nothing to
> do with the original commit they started on (and perhaps on some other
> branch?), and left the Change-Id around because it looked valid and
> they didn't know what it meant anyway.  (Something about their laptop
> being broken and being unable to edit code locally, so they just
> edited "in the cloud" and let the automated regression tests verify
> their changes; Gerrit either didn't have a way to start a new commit
> at the time or the developer didn't find it, so they just edited
> something else.)  I knew about it because I found weird pairs of
> non-matching things and this developer left a note about it in their
> edited commit message just in case anything weird happened.
> 
> But let me be more explicit about Change-Ids: we didn't use them at
> all.  It came up multiple times as a question, but in my looking into
> useful factors, to me they seemed to provide no extra value and I had
> found multiple cases where they seemed to be misleading.  The purpose
> of the report was to avoid more disasters from forgetting to backport
> commits.  This report seems like the kind of thing that Change-Ids
> were invented for, and we already had Change-Ids in all commits due to
> using Gerrit, but from my exploration I didn't trust the Change-Ids to
> provide net-positive value, so I simply didn't use them at all.
> 
> </long story>

if you are into long stories I have enother one.

<long story>
I am maintaining a code review system that I inherited that comes from
time when code review systems weren't cool, or even a thing at all. It's
basically a bunch of perls scripts that run as git hooks, cron jobs, and
CGI pages.

It is used to maintain some backports and original development on top of
upstream kernel releases.

The design is in some ways interesting. One particular interesting
decision which was made some time in ancient past is that the changes
are maintained as quilt patch series rather than modified sources.

Not sure what was the rationale behind the decision. Surely you could
save a lot of space if you had few changes. Also pre-git you would get
the release tarball, the point release patches, and any local changes as
patches on top so it makes sense to track the files you have together
with the series file that says how to use them, and you cannot get
easily away from tracking patches because that's what you get from
upstream.

This leads to an interesting pattern - you are not tracking changes to
source code but changes to history of changes to source code. At any
time you can render the quilt series as a git history that starts with
the upstream release and adds commits on top, in order specified in the
series file. You can add and drop commits in the middle, update tags,
whatever. And it can all be nice linear development in the repository
that tracks the quilt series.

Initially you would get a lot of big patches similar to the upstream
point releasee patches - like a patch that diffs the changes in a
particular driver between two kernel versions, with some additional
fixes on top. Clearly an unamanageable, untrackable, many-to-many patch
relationship to the upstream development.

However, over time this became a problem with more and more patches
piling up, base kernel version updates causing regressions, etc, etc

So over time some rules emerged. When backporting something from
upstream add each upstream git commit as separate patch file. Tag it
with the upstream git commit ID (change ID, yay). Add additional fixes
as separate patch files, and send them upstream. etc, etc.

In the end 1:1 relationship emerged with some rare exceptions.

Sure, there are upstream subsystems that routinely cherry-pick patches

Sure, there are upstream merge commits that introduce random unrelated
changes.

Nonetheless, the saner the upstream subsystem maintenanace and the the
saner the downstream branch maintenance the closer you get to 1:1 patch
relationship.
</long story>

If I can derive anything from these stories it is that while arbitrary
many to many relationsips between patches are possible they are not at
all desirable. Then it is not a bug for change tracking to not support
such relationship, you can even see it as a safety against people
shooting themselves in the foot.

If you are developing a feature over many revisions before it gets
merged upstream it can happen that you squash some patches that were
initially separate, and later split them differently. However, it does
not happen as one operation. So the many to many problem is not
something worth solving, at least for sane workflows.

Thanks

Michal

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Feature request: provide a persistent IDs on a commit
  2022-07-19 11:09     ` Ævar Arnfjörð Bjarmason
  2022-07-19 11:57       ` Michal Suchánek
@ 2022-07-29 12:11       ` Stephen Finucane
  2022-07-29 12:40         ` Jason Pyeron
  1 sibling, 1 reply; 29+ messages in thread
From: Stephen Finucane @ 2022-07-29 12:11 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: git

On Tue, 2022-07-19 at 13:09 +0200, Ævar Arnfjörð Bjarmason wrote:
> On Tue, Jul 19 2022, Stephen Finucane wrote:
> 
> > On Mon, 2022-07-18 at 20:50 +0200, Ævar Arnfjörð Bjarmason wrote:
> > > On Mon, Jul 18 2022, Stephen Finucane wrote:
> > > 
> > > > ...to track evolution of a patch through time.
> > > > 
> > > > tl;dr: How hard would it be to retrofit an 'ChangeID' concept à la the 'Change-
> > > > ID' trailer used by Gerrit into git core?
> > > > 
> > > > Firstly, apologies in advance if this is the wrong forum to post a feature
> > > > request. I help maintain the Patchwork project [1], which a web-based tool that
> > > > provides a mechanism to track the state of patches submitted to a mailing list
> > > > and make sure stuff doesn't slip through the crack. One of our long-term goals
> > > > has been to track the evolution of an individual patch through multiple
> > > > revisions. This is surprisingly hard goal because oftentimes there isn't a whole
> > > > lot to work with. One can try to guess whether things are the same by inspecting
> > > > the metadata of the commit (subject, author, commit message, and the diff
> > > > itself) but each of these metadata items are subject to arbitrary changes and
> > > > are therefore fallible.
> > > > 
> > > > One of the mechanisms I've seen used to address this is the 'Change-ID' trailer
> > > > used by Gerrit. For anyone that hasn't seen this, the Gerrit server provides a
> > > > git commit hook that you can install locally. When installed, this appends a
> > > > 'Change-ID' trailer to each and every commit message. In this way, the evolution
> > > > of a patch (or a "change", in Gerrit parlance) can be tracked through time since
> > > > the Change ID provides an authoritative answer to the question "is this still
> > > > the same patch". Unfortunately, there are still some obvious downside to this
> > > > approach. Not only does this additional trailer clutter your commit messages but
> > > > it's also something the user must install themselves. While Gerrit can insist
> > > > that this is installed before pushing a change, this isn't an option for any of
> > > > the common forges nor is it something git-send-email supports.
> > > 
> > > git format-patch+send-email will send your trailers along as-is, how
> > > doesn't it support Change-Id. Does it need some support that any other
> > > made-up trailer doesn't?
> > 
> > It supports sending the trailers, sure. What it doesn't support is insisting you
> > send this specific trailer (Change-Id). Only Gerrit can do this (server side,
> > thankfully, which means you don't need to ask all contributors to install this
> > hook if you want to rely on it for tooling, CI, etc.).
> 
> Ah, it's still unclear to me what you're proposing here though. That
> send-email always (generates?) or otherwise insists on the trailer, that
> it can be configured ot add it?
>
> That send-email have some "pre-send-email" hook? Something else?
 
(Apologies for the delayed response: I was on holiday).

I'm afraid I don't have the correct terminology to describe what I'm suggesting
so I'll show an example instead.

I have configured the 'fuller' pretty formatter locally:

   $ git config format.pretty
   fuller

When I do git log on e.g. the openstack nova repo, I see:

   commit 2709e30956b53be1dca91eec801220f0efbaed93
   Author:     Stephen Finucane <sfinucan@redhat.com>
   AuthorDate: Thu Jul 14 15:43:40 2022 +0100
   Commit:     Stephen Finucane <sfinucan@redhat.com>
   CommitDate: Mon Jul 18 12:30:25 2022 +0100
   
       Fix compatibility with jsonschema 4.x
       
       This changed one of the error messages we depend on [1].
       
       [1] https://github.com/python-jsonschema/jsonschema/commit/641e9b8c
       
       Change-Id: I643ec568ee2eb2ec1a555f813fd2f1acff915afa
       Signed-off-by: Stephen Finucane <sfinucan@redhat.com>

(Side note: What *is the term for the "Author", "AuthorDate", "Commit" and
"CommitDate" fields? Commit header? Commit metadata? Something else?)

My thinking is there are two types of information here: information that relates
to the "commiting" of this change and information that relates to the
"authorship" of the this change. The commit ID, 'Commit' and 'CommitDate' fields
clearly form the commit parts. I'm arguing that it would be good to have an
equivalent to the commit ID field for the authorship-type metadata.
   
   commit 2709e30956b53be1dca91eec801220f0efbaed93
   Author:     Stephen Finucane <sfinucan@redhat.com>
   AuthorDate: Thu Jul 14 15:43:40 2022 +0100
   AuthorID:   I643ec568ee2eb2ec1a555f813fd2f1acff915afa
   Commit:     Stephen Finucane <sfinucan@redhat.com>
   CommitDate: Mon Jul 18 12:30:25 2022 +0100
   
       Fix compatibility with jsonschema 4.x
       
       This changed one of the error messages we depend on [1].
       
       [1] https://github.com/python-jsonschema/jsonschema/commit/641e9b8c
       
       Signed-off-by: Stephen Finucane <sfinucan@redhat.com>

At risk of repeating myself, I think this information would be valuable to allow
me to answer the question "is this the same[*] commit?". During code review,
this would allow me to track the evolution of an individual patch. Once a patch
is merged, it would allow me to track the backporting or cherry-picking of that
patch between branches (in a more reliable fashion than the "cherry picked from"
trailer that one can add with the '-x' flag).

Now I do realize that there will be issues with this. As has been noted
elsewhere in the thread, people do split patches up or merge them together, and
a patch can change so drastically during review that it doesn't resemble the
original patch in any way. However, I'd argue that in both cases the presence of
these persistent IDs would at least leave a breadcrumb trail for either tooling
or humans to follow. Similarly, it is possible for users to mess things up by
resetting or reusing the persistent ID fields, but as has been noted elsewhere
in this thread this is already an issue with the existing Author* fields (which
many users likely don't know about) yet I couldn't imagine anyone wanting to get
rid of these. It's an education thing.

> I'd think for projects that care about this they're likely to have a
> centralized enough workflow that it can be checked on the remote side,
> whether that's some sanity check on the applier's "git am" pipeline, or
> a "pre-receive" hook.

Yeah, as above I'm hoping this would form part of the core metadata of a commit
rather than a trailer or something. Tools like Gerrit could of course do
validation on this but that's outside the scope of what I'm looking at.

> > > > I imagine most people working with mailing list based workflows have their own
> > > > client side tooling to support this while software forges like GitHub and GitLab
> > > > simply don't bother tracking version history between individual commits in a
> > > > pull/merge request.
> > > 
> > > It's far from ideal, but at least GitLab shows a diff on a push to a MR,
> > > including if it's force-pushed. I'm not sure about GitHub.
> > 
> > GitHub does not. Simply piling multiple additional "fix" commits onto the PR
> > branch results in a less horrible review experience since you can maintain
> > context, alas at the cost of a rotten git log. We don't need to debate the pros
> > and cons of the various forges though :)
> 
> Yes, I'm only mentioning it because it's worth looking at existing
> "solutions" that are in use in the wild, however flawed those may be.
> 
> > > > IMO though, it would be fantastic if third party tools
> > > > weren't necessary though. What I suspect we want is a persistent ID (or rather
> > > > UUID) that never changes regardless of how many times a patch is cherry-picked,
> > > > rebased, or otherwise modified, similar to the Author and AuthorDate fields.
> > > > Like Author and AuthorDate, it would be part of the core git commit metadata
> > > > rather than something in the commit message like Signed-Off-By or Change-ID.
> > > > 
> > > > Has such an idea ever been explored? Is it even possible? Would it be broadly
> > > > useful?
> > > 
> > > This has come up a bunch of times. I think that the thing git itself
> > > should be doing is to lean into the same notion that we use for tracking
> > > renames. I.e. we don't, we analyze history after-the-fact and spot the
> > > renames for you.
> > 
> > Any idea where I'd find previous discussions on this? I did look, and the only
> > proposal I found was an old one that seemed to suggest including the Change-Id
> > commit-msg hook with git itself which is not what I'm suggesting here.
> 
> At the time I was punting on finding the links, and just working off
> vague recollection, and hoping you'd go list spelunking.
> 
> But I since recalled some details, I think the most relevant thing is
> this discussion about a "git evolve":
> 
>     https://lore.kernel.org/git/CAPL8ZivFmHqS2y+WmNR6faRMnuahiqwPVYsV99NiJ1QLHOs9fQ@mail.gmail.com/
> 
> Which I think you'll find useful, especially as mercurial has an
> existing implementation. The wider context for that "git evolve" is (I
> believe) people at Google who maintain Gerrit trying to "upstream" the
> Change-Id.
> 
> Now, it hasn't landed in git.git, and it's been a few years, but going
> through the details of why it fizzled out will be useful to you, if
> you're interested in driving something like this forward.

Yeah to be clear I'm not suggesting tracking anything like this in Git core. My
main request is here is a persistent Author ID field. Commits as they are would
remain the same: we'd just be able to show the evolution of a "change" in
external tooling without the need for separate trailers.

> There's also these two proposals from Eric Raymond:
> 
> 	https://lore.kernel.org/git/20190515191605.21D394703049@snark.thyrsus.com/

This however, looks more similar to what I'm proposing. If understand this
correctly (I'm still reading the full thread), Eric is proposing allowing two
ways to reference a commit: the hash and a sort of alias. There would still be a
1:1 mapping though, which is explicitly not what I want. I'm also not suggesting
generating this stuff server-side. It should be part of the commit when
initially created, just like Author and AuthorDate.

> 	https://lore.kernel.org/git/20190521013250.3506B470485F@snark.thyrsus.com/
> 
> Which I'm linking to here not because I think they're viable, as you can
> see from my participation in those threads I think what he suggested is
> an architectural dead end as far as git is concerned.
> 
> But rather because it's conceptually adjacent (you could in principle
> use nanosecond timestamps as a poor man's UUID), and much of the
> follow-up discussion is about format changes in general, and if/when
> those might be viable.
> 
> > > We have some of that in git already, as git-patch-id, and more recently
> > > git-range-diff. Both are flawed in a bunch of ways, and it's easy to run
> > > into edge cases where they don't spot something that they "should"
> > > have. Where "should" exists in the mind of the user.
> > 
> > That's a fair point and is of course what we (Patchwork) have to do currently.
> > Patchwork can track relations between individual patches but doesn't attempt to
> > generate these relations itself. Instead, we rely on third-party tooling. The
> > PaStA tool was one such example of a tool that could do this [1]. I can't
> > imagine a tool like Gerrit would ever work without this concept of an
> > authoritative (and arbitrary) identifier to track a patch's identity through
> > time, hence its reliance on the Change-Id trailer.
> 
> I haven't used Gerrit or Patchwork, so much of this is from ignorance on
> that front, but I have spent a lot of time thinking about this in the
> context of git in general.
> 
> I think as users of git go the git project itself makes very heavy use
> of this, i.e. sequences of patches are substantially rewritten, split,
> squashed etc. all the time, or even split into two or more sets of
> submissions.
> 
> Having said all that I can't see how a Change-Id isn't a Bad Idea(TM)
> for all the same reasons that pre-git SCMs file formats that track
> renames explicitly were a bad idea.
> 
> I.e. yes you can come up with cases where that's "better" than what git
> does, but they didn't handle splitting/merging files etc.
> 
> Similarly what happens when you have 3 patches each with their own
> Change-Id and you split them into 4 patches. Is the Change-Id 1=1 or
> 1=many. I'm suggesting that you'd want a solution that can be many=many.
> 
> And also, that those many=many should be dynamically configurable and
> inferred after the fact. E.g. range-diff will commits that are similar
> enough that two authors with no knowledge of each other independently
> came up with.

I touched on the splitting/merging of changes above but just to reiterate, I
don't think this is an issue. I'm using Gerrit for OpenStack-related efforts nad
mailing lists (with Patchwork tracking submissions) elsewhere. Patches are
frequently split and merged as part of a review process and can often be merged
as part of a backport (I've yet to see a patch split up when backporting but it
could happen too). If a patch is split, the original patch retains the 'Change-
ID' as well as 'Author' and 'AuthorDate' fields while the split out patch(es)
get new versions of these. If one or more patches are squashed, you get the
'Change-ID' and 'Author'/'AuthorDate' of the first patch in the series of
squashed patches. In both cases though, some Change-ID persists which means you
can track the evolution of a patch or series through time. These are all
extremely helpful breadcrumbs for reviewers.

Regarding the rename issue, I agree that this isn't something Git should do
either. As you note, it's too hard to do 100% reliably, which would be expected
goal. I'm not looking for 100% reliability here. I just want a better breadcrumb
than e.g. range-diff currently provides.

> I think that range-diff is still lacking in a lot of ways, in particular:
> 
>  * It matches entire commits (log + diff) on a similarity score, I've
>    often wanted a way to "weigh" it, so e.g. a matching hunk would have
>    3x the matching score of a matching commit message.
> 
>    Now it often "gives up", you can give it a higher --creation-factor,
>    but that's "global", so for a large range you'll often start
>    including irrelevant things as well.
> 
>  * It only does 1=1 attribution, and e.g. currently can't find/represent
>    a case where a commit with 3 hunks got split into two commits, with 2
>    and 1 hunks, respectively. It'll (usually) show a diff to the new 2
>    hunk commit, but the "new" 1 hunk will be shown as new.
> 
>    We could continue to drill down and find such "unattributed" hunks.
> 
> > Perhaps we could flip this on its head. What would be the _downsides_ of
> > providing a persistent, arbitrary identifier on a commit similar to Author and
> > AuthorDate fields? There's obviously some work involved in implementing it but
> > assuming that was already done, what would break/be worse as a result?
> 
> That "Repository formats matter", to borrow a phrase from a classic post
> about git[1]. Once you provide a way to do something it will be used,
> and when that something has inherent limitations (think SCM rename
> tracking) used to the exclusion of others.
> 
> You can't provide something like that as an opt-in and "upstream" it
> without it inevetably trickling into a lot of areas of Git's UX.
> 
> To continue the rename example, now you can just re-arrange your source
> tree and not worry about micro-managing it with "git mv" (in the "svn
> mv" sense), git will figure it out after the fact.
> 
> That's a sinificant UX benefit, we can provide a *much simpler* UX as a
> result.
> 
> What would be the harm of an optional "rename tracking" header? After
> all the heuristic sometimes "fails".
> 
> The harm would be that if you really wanted to lean into that (even
> optionally) you'd be forced to add that to all sorts of tooling, not
> just the cheap convenience that is "git mv" currently.
> 
> Likewise everything from "cherry-pick" to "rebase" to "commit" would
> inevitably have to learn some way to know about, carry forward and ask
> the user about Change-Id's and their preservation. Don't you think so?

These are all valid points. Hopefully my points above regarding the similarity
to the Author and AuthorDate fields helps though.

> Otherwise they'd be much too easy to lose track of, and if they only
> reason we did all that is because we didn't think enough about the "work
> it out after" approach that would be a bad investment of time.
> 
> But I may be wrong about all of that, I think one thing that would
> really help clarify this & similar proposals is if people pushing it
> forward came up with some basic tests for it, i.e. just something like
> a:
> 
>     series-v1/
>     series-v2/
> 
> Where those two directories would be the "git format-patch" output (or
> whatever) of two versions of a series that Gerrit or Patchwork are now
> managing, along with some (plain text?) manual mapping of which things
> in v1 correspond to v2.
> 
> We could then compare how that manual attribution performs v.s. trying
> to find which things match (range-diff) afterwards.

I hope my examples above helped with this, but I can prepare a sample series
(including a sample 'git log' output) if you'd like. Just let me know where
you'd like it sent.

Cheers,
Stephen


> 
> 1. https://keithp.com/blog/Repository_Formats_Matter/
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: Feature request: provide a persistent IDs on a commit
  2022-07-29 12:11       ` Stephen Finucane
@ 2022-07-29 12:40         ` Jason Pyeron
  0 siblings, 0 replies; 29+ messages in thread
From: Jason Pyeron @ 2022-07-29 12:40 UTC (permalink / raw)
  To: 'Stephen Finucane',
	'Ævar Arnfjörð Bjarmason'
  Cc: git

> From: Stephen Finucane
> Sent: Friday, July 29, 2022 8:11 AM
> 
> On Tue, 2022-07-19 at 13:09 +0200, Ævar Arnfjörð Bjarmason wrote:
> > On Tue, Jul 19 2022, Stephen Finucane wrote:
> >
> > > On Mon, 2022-07-18 at 20:50 +0200, Ævar Arnfjörð Bjarmason wrote:
> > > > On Mon, Jul 18 2022, Stephen Finucane wrote:
> > > >
> > > > > ...to track evolution of a patch through time.
> > > > >
> > > > > tl;dr: How hard would it be to retrofit an 'ChangeID' concept à la the 'Change-
> > > > > ID' trailer used by Gerrit into git core?
> > > > >
> > > > > Firstly, apologies in advance if this is the wrong forum to post a feature
> > > > > request. I help maintain the Patchwork project [1], which a web-based tool that
> > > > > provides a mechanism to track the state of patches submitted to a mailing list
> > > > > and make sure stuff doesn't slip through the crack. One of our long-term goals
> > > > > has been to track the evolution of an individual patch through multiple
> > > > > revisions. This is surprisingly hard goal because oftentimes there isn't a whole
> > > > > lot to work with. One can try to guess whether things are the same by inspecting
> > > > > the metadata of the commit (subject, author, commit message, and the diff
> > > > > itself) but each of these metadata items are subject to arbitrary changes and
> > > > > are therefore fallible.
> > > > >
> > > > > One of the mechanisms I've seen used to address this is the 'Change-ID' trailer
> > > > > used by Gerrit. For anyone that hasn't seen this, the Gerrit server provides a
> > > > > git commit hook that you can install locally. When installed, this appends a
> > > > > 'Change-ID' trailer to each and every commit message. In this way, the evolution
> > > > > of a patch (or a "change", in Gerrit parlance) can be tracked through time since
> > > > > the Change ID provides an authoritative answer to the question "is this still
> > > > > the same patch". Unfortunately, there are still some obvious downside to this
> > > > > approach. Not only does this additional trailer clutter your commit messages but
> > > > > it's also something the user must install themselves. While Gerrit can insist
> > > > > that this is installed before pushing a change, this isn't an option for any of
> > > > > the common forges nor is it something git-send-email supports.
> > > >
> > > > git format-patch+send-email will send your trailers along as-is, how
> > > > doesn't it support Change-Id. Does it need some support that any other
> > > > made-up trailer doesn't?
> > >
> > > It supports sending the trailers, sure. What it doesn't support is insisting you
> > > send this specific trailer (Change-Id). Only Gerrit can do this (server side,
> > > thankfully, which means you don't need to ask all contributors to install this
> > > hook if you want to rely on it for tooling, CI, etc.).
> >
> > Ah, it's still unclear to me what you're proposing here though. That
> > send-email always (generates?) or otherwise insists on the trailer, that
> > it can be configured ot add it?
> >
> > That send-email have some "pre-send-email" hook? Something else?
> 
> (Apologies for the delayed response: I was on holiday).
> 
> I'm afraid I don't have the correct terminology to describe what I'm suggesting
> so I'll show an example instead.
> 
> I have configured the 'fuller' pretty formatter locally:
> 
>    $ git config format.pretty
>    fuller
> 
> When I do git log on e.g. the openstack nova repo, I see:
> 
>    commit 2709e30956b53be1dca91eec801220f0efbaed93
>    Author:     Stephen Finucane <sfinucan@redhat.com>
>    AuthorDate: Thu Jul 14 15:43:40 2022 +0100
>    Commit:     Stephen Finucane <sfinucan@redhat.com>
>    CommitDate: Mon Jul 18 12:30:25 2022 +0100
> 
>        Fix compatibility with jsonschema 4.x
> 
>        This changed one of the error messages we depend on [1].
> 
>        [1] https://github.com/python-jsonschema/jsonschema/commit/641e9b8c
> 
>        Change-Id: I643ec568ee2eb2ec1a555f813fd2f1acff915afa
>        Signed-off-by: Stephen Finucane <sfinucan@redhat.com>
> 
> (Side note: What *is the term for the "Author", "AuthorDate", "Commit" and
> "CommitDate" fields? Commit header? Commit metadata? Something else?)
> 
> My thinking is there are two types of information here: information that relates
> to the "commiting" of this change and information that relates to the
> "authorship" of the this change. The commit ID, 'Commit' and 'CommitDate' fields
> clearly form the commit parts. I'm arguing that it would be good to have an
> equivalent to the commit ID field for the authorship-type metadata.
> 
>    commit 2709e30956b53be1dca91eec801220f0efbaed93
>    Author:     Stephen Finucane <sfinucan@redhat.com>
>    AuthorDate: Thu Jul 14 15:43:40 2022 +0100
>    AuthorID:   I643ec568ee2eb2ec1a555f813fd2f1acff915afa
>    Commit:     Stephen Finucane <sfinucan@redhat.com>
>    CommitDate: Mon Jul 18 12:30:25 2022 +0100
> 
>        Fix compatibility with jsonschema 4.x
> 
>        This changed one of the error messages we depend on [1].
> 
>        [1] https://github.com/python-jsonschema/jsonschema/commit/641e9b8c
> 
>        Signed-off-by: Stephen Finucane <sfinucan@redhat.com>
> 
> At risk of repeating myself, I think this information would be valuable to allow
> me to answer the question "is this the same[*] commit?". During code review,
> this would allow me to track the evolution of an individual patch. Once a patch
> is merged, it would allow me to track the backporting or cherry-picking of that

We have been toying with this. We are looking at a field (behaves like parent) to track "original commit".

This value would be set on first rebase, amend, cherry-pick, etc.

The bonus for us will be when we patch gerrit to consume it and git log --graph --somenewoption to use it.

It would be nice if git core did add such value.

-Jason


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2022-07-29 12:50 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-18 17:18 Feature request: provide a persistent IDs on a commit Stephen Finucane
2022-07-18 17:35 ` Konstantin Ryabitsev
2022-07-18 19:04   ` Michal Suchánek
2022-07-19 10:57     ` Stephen Finucane
2022-07-18 21:24   ` Glen Choo
2022-07-20 19:21     ` Konstantin Ryabitsev
2022-07-20 19:30       ` Michal Suchánek
2022-07-20 22:10       ` Theodore Ts'o
2022-07-21 11:57         ` Han-Wen Nienhuys
2022-07-24  5:09     ` Elijah Newren
2022-07-18 18:50 ` Ævar Arnfjörð Bjarmason
2022-07-19 10:47   ` Stephen Finucane
2022-07-19 11:09     ` Ævar Arnfjörð Bjarmason
2022-07-19 11:57       ` Michal Suchánek
2022-07-29 12:11       ` Stephen Finucane
2022-07-29 12:40         ` Jason Pyeron
2022-07-21 16:18   ` Phillip Susi
2022-07-21 18:58     ` Hilco Wijbenga
2022-07-22 20:08       ` Philip Oakley
2022-07-22 20:36         ` Michal Suchánek
2022-07-22 22:46           ` Jacob Keller
2022-07-23  7:00             ` Michal Suchánek
2022-07-24  5:23               ` Elijah Newren
2022-07-24  8:54                 ` Michal Suchánek
2022-07-25 21:47                 ` Jacob Keller
2022-07-26  3:49                   ` Elijah Newren
2022-07-26  8:43                     ` Michal Suchánek
2022-07-24  5:10           ` Elijah Newren
2022-07-24  8:59             ` Michal Suchánek

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).