git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Stephen Finucane <stephen@that.guru>
To: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: Feature request: provide a persistent IDs on a commit
Date: Fri, 29 Jul 2022 13:11:24 +0100	[thread overview]
Message-ID: <bf871a430177ced6d628641eac9d478389fb6c2b.camel@that.guru> (raw)
In-Reply-To: <220719.86y1wpuy5o.gmgdl@evledraar.gmail.com>

On Tue, 2022-07-19 at 13:09 +0200, Ævar Arnfjörð Bjarmason wrote:
> On Tue, Jul 19 2022, Stephen Finucane wrote:
> 
> > On Mon, 2022-07-18 at 20:50 +0200, Ævar Arnfjörð Bjarmason wrote:
> > > On Mon, Jul 18 2022, Stephen Finucane wrote:
> > > 
> > > > ...to track evolution of a patch through time.
> > > > 
> > > > tl;dr: How hard would it be to retrofit an 'ChangeID' concept à la the 'Change-
> > > > ID' trailer used by Gerrit into git core?
> > > > 
> > > > Firstly, apologies in advance if this is the wrong forum to post a feature
> > > > request. I help maintain the Patchwork project [1], which a web-based tool that
> > > > provides a mechanism to track the state of patches submitted to a mailing list
> > > > and make sure stuff doesn't slip through the crack. One of our long-term goals
> > > > has been to track the evolution of an individual patch through multiple
> > > > revisions. This is surprisingly hard goal because oftentimes there isn't a whole
> > > > lot to work with. One can try to guess whether things are the same by inspecting
> > > > the metadata of the commit (subject, author, commit message, and the diff
> > > > itself) but each of these metadata items are subject to arbitrary changes and
> > > > are therefore fallible.
> > > > 
> > > > One of the mechanisms I've seen used to address this is the 'Change-ID' trailer
> > > > used by Gerrit. For anyone that hasn't seen this, the Gerrit server provides a
> > > > git commit hook that you can install locally. When installed, this appends a
> > > > 'Change-ID' trailer to each and every commit message. In this way, the evolution
> > > > of a patch (or a "change", in Gerrit parlance) can be tracked through time since
> > > > the Change ID provides an authoritative answer to the question "is this still
> > > > the same patch". Unfortunately, there are still some obvious downside to this
> > > > approach. Not only does this additional trailer clutter your commit messages but
> > > > it's also something the user must install themselves. While Gerrit can insist
> > > > that this is installed before pushing a change, this isn't an option for any of
> > > > the common forges nor is it something git-send-email supports.
> > > 
> > > git format-patch+send-email will send your trailers along as-is, how
> > > doesn't it support Change-Id. Does it need some support that any other
> > > made-up trailer doesn't?
> > 
> > It supports sending the trailers, sure. What it doesn't support is insisting you
> > send this specific trailer (Change-Id). Only Gerrit can do this (server side,
> > thankfully, which means you don't need to ask all contributors to install this
> > hook if you want to rely on it for tooling, CI, etc.).
> 
> Ah, it's still unclear to me what you're proposing here though. That
> send-email always (generates?) or otherwise insists on the trailer, that
> it can be configured ot add it?
>
> That send-email have some "pre-send-email" hook? Something else?
 
(Apologies for the delayed response: I was on holiday).

I'm afraid I don't have the correct terminology to describe what I'm suggesting
so I'll show an example instead.

I have configured the 'fuller' pretty formatter locally:

   $ git config format.pretty
   fuller

When I do git log on e.g. the openstack nova repo, I see:

   commit 2709e30956b53be1dca91eec801220f0efbaed93
   Author:     Stephen Finucane <sfinucan@redhat.com>
   AuthorDate: Thu Jul 14 15:43:40 2022 +0100
   Commit:     Stephen Finucane <sfinucan@redhat.com>
   CommitDate: Mon Jul 18 12:30:25 2022 +0100
   
       Fix compatibility with jsonschema 4.x
       
       This changed one of the error messages we depend on [1].
       
       [1] https://github.com/python-jsonschema/jsonschema/commit/641e9b8c
       
       Change-Id: I643ec568ee2eb2ec1a555f813fd2f1acff915afa
       Signed-off-by: Stephen Finucane <sfinucan@redhat.com>

(Side note: What *is the term for the "Author", "AuthorDate", "Commit" and
"CommitDate" fields? Commit header? Commit metadata? Something else?)

My thinking is there are two types of information here: information that relates
to the "commiting" of this change and information that relates to the
"authorship" of the this change. The commit ID, 'Commit' and 'CommitDate' fields
clearly form the commit parts. I'm arguing that it would be good to have an
equivalent to the commit ID field for the authorship-type metadata.
   
   commit 2709e30956b53be1dca91eec801220f0efbaed93
   Author:     Stephen Finucane <sfinucan@redhat.com>
   AuthorDate: Thu Jul 14 15:43:40 2022 +0100
   AuthorID:   I643ec568ee2eb2ec1a555f813fd2f1acff915afa
   Commit:     Stephen Finucane <sfinucan@redhat.com>
   CommitDate: Mon Jul 18 12:30:25 2022 +0100
   
       Fix compatibility with jsonschema 4.x
       
       This changed one of the error messages we depend on [1].
       
       [1] https://github.com/python-jsonschema/jsonschema/commit/641e9b8c
       
       Signed-off-by: Stephen Finucane <sfinucan@redhat.com>

At risk of repeating myself, I think this information would be valuable to allow
me to answer the question "is this the same[*] commit?". During code review,
this would allow me to track the evolution of an individual patch. Once a patch
is merged, it would allow me to track the backporting or cherry-picking of that
patch between branches (in a more reliable fashion than the "cherry picked from"
trailer that one can add with the '-x' flag).

Now I do realize that there will be issues with this. As has been noted
elsewhere in the thread, people do split patches up or merge them together, and
a patch can change so drastically during review that it doesn't resemble the
original patch in any way. However, I'd argue that in both cases the presence of
these persistent IDs would at least leave a breadcrumb trail for either tooling
or humans to follow. Similarly, it is possible for users to mess things up by
resetting or reusing the persistent ID fields, but as has been noted elsewhere
in this thread this is already an issue with the existing Author* fields (which
many users likely don't know about) yet I couldn't imagine anyone wanting to get
rid of these. It's an education thing.

> I'd think for projects that care about this they're likely to have a
> centralized enough workflow that it can be checked on the remote side,
> whether that's some sanity check on the applier's "git am" pipeline, or
> a "pre-receive" hook.

Yeah, as above I'm hoping this would form part of the core metadata of a commit
rather than a trailer or something. Tools like Gerrit could of course do
validation on this but that's outside the scope of what I'm looking at.

> > > > I imagine most people working with mailing list based workflows have their own
> > > > client side tooling to support this while software forges like GitHub and GitLab
> > > > simply don't bother tracking version history between individual commits in a
> > > > pull/merge request.
> > > 
> > > It's far from ideal, but at least GitLab shows a diff on a push to a MR,
> > > including if it's force-pushed. I'm not sure about GitHub.
> > 
> > GitHub does not. Simply piling multiple additional "fix" commits onto the PR
> > branch results in a less horrible review experience since you can maintain
> > context, alas at the cost of a rotten git log. We don't need to debate the pros
> > and cons of the various forges though :)
> 
> Yes, I'm only mentioning it because it's worth looking at existing
> "solutions" that are in use in the wild, however flawed those may be.
> 
> > > > IMO though, it would be fantastic if third party tools
> > > > weren't necessary though. What I suspect we want is a persistent ID (or rather
> > > > UUID) that never changes regardless of how many times a patch is cherry-picked,
> > > > rebased, or otherwise modified, similar to the Author and AuthorDate fields.
> > > > Like Author and AuthorDate, it would be part of the core git commit metadata
> > > > rather than something in the commit message like Signed-Off-By or Change-ID.
> > > > 
> > > > Has such an idea ever been explored? Is it even possible? Would it be broadly
> > > > useful?
> > > 
> > > This has come up a bunch of times. I think that the thing git itself
> > > should be doing is to lean into the same notion that we use for tracking
> > > renames. I.e. we don't, we analyze history after-the-fact and spot the
> > > renames for you.
> > 
> > Any idea where I'd find previous discussions on this? I did look, and the only
> > proposal I found was an old one that seemed to suggest including the Change-Id
> > commit-msg hook with git itself which is not what I'm suggesting here.
> 
> At the time I was punting on finding the links, and just working off
> vague recollection, and hoping you'd go list spelunking.
> 
> But I since recalled some details, I think the most relevant thing is
> this discussion about a "git evolve":
> 
>     https://lore.kernel.org/git/CAPL8ZivFmHqS2y+WmNR6faRMnuahiqwPVYsV99NiJ1QLHOs9fQ@mail.gmail.com/
> 
> Which I think you'll find useful, especially as mercurial has an
> existing implementation. The wider context for that "git evolve" is (I
> believe) people at Google who maintain Gerrit trying to "upstream" the
> Change-Id.
> 
> Now, it hasn't landed in git.git, and it's been a few years, but going
> through the details of why it fizzled out will be useful to you, if
> you're interested in driving something like this forward.

Yeah to be clear I'm not suggesting tracking anything like this in Git core. My
main request is here is a persistent Author ID field. Commits as they are would
remain the same: we'd just be able to show the evolution of a "change" in
external tooling without the need for separate trailers.

> There's also these two proposals from Eric Raymond:
> 
> 	https://lore.kernel.org/git/20190515191605.21D394703049@snark.thyrsus.com/

This however, looks more similar to what I'm proposing. If understand this
correctly (I'm still reading the full thread), Eric is proposing allowing two
ways to reference a commit: the hash and a sort of alias. There would still be a
1:1 mapping though, which is explicitly not what I want. I'm also not suggesting
generating this stuff server-side. It should be part of the commit when
initially created, just like Author and AuthorDate.

> 	https://lore.kernel.org/git/20190521013250.3506B470485F@snark.thyrsus.com/
> 
> Which I'm linking to here not because I think they're viable, as you can
> see from my participation in those threads I think what he suggested is
> an architectural dead end as far as git is concerned.
> 
> But rather because it's conceptually adjacent (you could in principle
> use nanosecond timestamps as a poor man's UUID), and much of the
> follow-up discussion is about format changes in general, and if/when
> those might be viable.
> 
> > > We have some of that in git already, as git-patch-id, and more recently
> > > git-range-diff. Both are flawed in a bunch of ways, and it's easy to run
> > > into edge cases where they don't spot something that they "should"
> > > have. Where "should" exists in the mind of the user.
> > 
> > That's a fair point and is of course what we (Patchwork) have to do currently.
> > Patchwork can track relations between individual patches but doesn't attempt to
> > generate these relations itself. Instead, we rely on third-party tooling. The
> > PaStA tool was one such example of a tool that could do this [1]. I can't
> > imagine a tool like Gerrit would ever work without this concept of an
> > authoritative (and arbitrary) identifier to track a patch's identity through
> > time, hence its reliance on the Change-Id trailer.
> 
> I haven't used Gerrit or Patchwork, so much of this is from ignorance on
> that front, but I have spent a lot of time thinking about this in the
> context of git in general.
> 
> I think as users of git go the git project itself makes very heavy use
> of this, i.e. sequences of patches are substantially rewritten, split,
> squashed etc. all the time, or even split into two or more sets of
> submissions.
> 
> Having said all that I can't see how a Change-Id isn't a Bad Idea(TM)
> for all the same reasons that pre-git SCMs file formats that track
> renames explicitly were a bad idea.
> 
> I.e. yes you can come up with cases where that's "better" than what git
> does, but they didn't handle splitting/merging files etc.
> 
> Similarly what happens when you have 3 patches each with their own
> Change-Id and you split them into 4 patches. Is the Change-Id 1=1 or
> 1=many. I'm suggesting that you'd want a solution that can be many=many.
> 
> And also, that those many=many should be dynamically configurable and
> inferred after the fact. E.g. range-diff will commits that are similar
> enough that two authors with no knowledge of each other independently
> came up with.

I touched on the splitting/merging of changes above but just to reiterate, I
don't think this is an issue. I'm using Gerrit for OpenStack-related efforts nad
mailing lists (with Patchwork tracking submissions) elsewhere. Patches are
frequently split and merged as part of a review process and can often be merged
as part of a backport (I've yet to see a patch split up when backporting but it
could happen too). If a patch is split, the original patch retains the 'Change-
ID' as well as 'Author' and 'AuthorDate' fields while the split out patch(es)
get new versions of these. If one or more patches are squashed, you get the
'Change-ID' and 'Author'/'AuthorDate' of the first patch in the series of
squashed patches. In both cases though, some Change-ID persists which means you
can track the evolution of a patch or series through time. These are all
extremely helpful breadcrumbs for reviewers.

Regarding the rename issue, I agree that this isn't something Git should do
either. As you note, it's too hard to do 100% reliably, which would be expected
goal. I'm not looking for 100% reliability here. I just want a better breadcrumb
than e.g. range-diff currently provides.

> I think that range-diff is still lacking in a lot of ways, in particular:
> 
>  * It matches entire commits (log + diff) on a similarity score, I've
>    often wanted a way to "weigh" it, so e.g. a matching hunk would have
>    3x the matching score of a matching commit message.
> 
>    Now it often "gives up", you can give it a higher --creation-factor,
>    but that's "global", so for a large range you'll often start
>    including irrelevant things as well.
> 
>  * It only does 1=1 attribution, and e.g. currently can't find/represent
>    a case where a commit with 3 hunks got split into two commits, with 2
>    and 1 hunks, respectively. It'll (usually) show a diff to the new 2
>    hunk commit, but the "new" 1 hunk will be shown as new.
> 
>    We could continue to drill down and find such "unattributed" hunks.
> 
> > Perhaps we could flip this on its head. What would be the _downsides_ of
> > providing a persistent, arbitrary identifier on a commit similar to Author and
> > AuthorDate fields? There's obviously some work involved in implementing it but
> > assuming that was already done, what would break/be worse as a result?
> 
> That "Repository formats matter", to borrow a phrase from a classic post
> about git[1]. Once you provide a way to do something it will be used,
> and when that something has inherent limitations (think SCM rename
> tracking) used to the exclusion of others.
> 
> You can't provide something like that as an opt-in and "upstream" it
> without it inevetably trickling into a lot of areas of Git's UX.
> 
> To continue the rename example, now you can just re-arrange your source
> tree and not worry about micro-managing it with "git mv" (in the "svn
> mv" sense), git will figure it out after the fact.
> 
> That's a sinificant UX benefit, we can provide a *much simpler* UX as a
> result.
> 
> What would be the harm of an optional "rename tracking" header? After
> all the heuristic sometimes "fails".
> 
> The harm would be that if you really wanted to lean into that (even
> optionally) you'd be forced to add that to all sorts of tooling, not
> just the cheap convenience that is "git mv" currently.
> 
> Likewise everything from "cherry-pick" to "rebase" to "commit" would
> inevitably have to learn some way to know about, carry forward and ask
> the user about Change-Id's and their preservation. Don't you think so?

These are all valid points. Hopefully my points above regarding the similarity
to the Author and AuthorDate fields helps though.

> Otherwise they'd be much too easy to lose track of, and if they only
> reason we did all that is because we didn't think enough about the "work
> it out after" approach that would be a bad investment of time.
> 
> But I may be wrong about all of that, I think one thing that would
> really help clarify this & similar proposals is if people pushing it
> forward came up with some basic tests for it, i.e. just something like
> a:
> 
>     series-v1/
>     series-v2/
> 
> Where those two directories would be the "git format-patch" output (or
> whatever) of two versions of a series that Gerrit or Patchwork are now
> managing, along with some (plain text?) manual mapping of which things
> in v1 correspond to v2.
> 
> We could then compare how that manual attribution performs v.s. trying
> to find which things match (range-diff) afterwards.

I hope my examples above helped with this, but I can prepare a sample series
(including a sample 'git log' output) if you'd like. Just let me know where
you'd like it sent.

Cheers,
Stephen


> 
> 1. https://keithp.com/blog/Repository_Formats_Matter/
> 


  parent reply	other threads:[~2022-07-29 12:11 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-07-18 17:18 Feature request: provide a persistent IDs on a commit Stephen Finucane
2022-07-18 17:35 ` Konstantin Ryabitsev
2022-07-18 19:04   ` Michal Suchánek
2022-07-19 10:57     ` Stephen Finucane
2022-07-18 21:24   ` Glen Choo
2022-07-20 19:21     ` Konstantin Ryabitsev
2022-07-20 19:30       ` Michal Suchánek
2022-07-20 22:10       ` Theodore Ts'o
2022-07-21 11:57         ` Han-Wen Nienhuys
2022-07-24  5:09     ` Elijah Newren
2022-07-18 18:50 ` Ævar Arnfjörð Bjarmason
2022-07-19 10:47   ` Stephen Finucane
2022-07-19 11:09     ` Ævar Arnfjörð Bjarmason
2022-07-19 11:57       ` Michal Suchánek
2022-07-29 12:11       ` Stephen Finucane [this message]
2022-07-29 12:40         ` Jason Pyeron
2022-07-21 16:18   ` Phillip Susi
2022-07-21 18:58     ` Hilco Wijbenga
2022-07-22 20:08       ` Philip Oakley
2022-07-22 20:36         ` Michal Suchánek
2022-07-22 22:46           ` Jacob Keller
2022-07-23  7:00             ` Michal Suchánek
2022-07-24  5:23               ` Elijah Newren
2022-07-24  8:54                 ` Michal Suchánek
2022-07-25 21:47                 ` Jacob Keller
2022-07-26  3:49                   ` Elijah Newren
2022-07-26  8:43                     ` Michal Suchánek
2022-07-24  5:10           ` Elijah Newren
2022-07-24  8:59             ` Michal Suchánek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bf871a430177ced6d628641eac9d478389fb6c2b.camel@that.guru \
    --to=stephen@that.guru \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).