[idea] File history tracking hints

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [idea] File history tracking hints
@ 2017-09-11  7:11 Pavel Kretov
  2017-09-11 18:11 ` Stefan Beller
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Pavel Kretov @ 2017-09-11  7:11 UTC (permalink / raw)
  To: git

Hi all,

Excuse me if the topic I'm going to raise here has been already discussed
on the mailing list, forums, or IRC, but I couldn't find anything related.

The problem:

Git, being "a stupid content tracker", doesn't try to keep an eye on
operations which happens to individual files; things like file renames
aren't recorded during commit, but heuristically detected later.

Unfortunately, the heuristic can only deal with simple file renames with
no substantial content changes; it's helpless when you:

 - rename file and change it's content significantly;
 - split single file into several files;
 - merge several files into another;
 - copy entire file from another commit, and do other things like these.

However, if we're able to preserve this information, it's possible
not only to do more accurate 'git blame', but also merge revisions with
fewer conflicts.

The proposal:

The idea is to let user give hints about what was changed during
the commit. For example, if user did a rename which wasn't automatically
detected, he would append something like the following to his commit
message:

    Tracking-hints: rename dev-vcs/git/git-1.0.ebuild ->
dev-vcs/git/git-2.0.ebuild

or (if full paths of affected files can be unambiguously omitted):

    Tracking-hints: rename git-1.0.ebuild -> git-2.0.ebuild

There may be other hint types:

    Tracking-hint: recreate LICENSE.txt
    Tracking-hint: split main.c -> main.c cmdline.c
    Tracking-hint: merge linalg.py <- vector.py matrix.py

or even something like this:

    Tracking-hint: copy json.py <-
libs/json.py@4db88291251151d8c5c8e4f20430fa4def2cb2ed

If file transformation cannot be described by a single tracking hint, it shall
be possible to specify a sequence of hints at once:

    Tracking-hint:
        split Utils.java -> AppHelpers.java StringHelpers.java
        recreate Utils.java

Note that in the above example the order of operations really matters, so
both lines have to reside in one 'Tracking-hint' block.

* * *

How do you think, is this idea worth implementing?
Any other thoughts on this?

-- Pavel Kretov.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [idea] File history tracking hints
  2017-09-11  7:11 [idea] File history tracking hints Pavel Kretov
@ 2017-09-11 18:11 ` Stefan Beller
  2017-09-11 18:47   ` Jacob Keller
  2017-09-11 18:41 ` Jeff King
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 18+ messages in thread
From: Stefan Beller @ 2017-09-11 18:11 UTC (permalink / raw)
  To: Pavel Kretov; +Cc: git@vger.kernel.org

On Mon, Sep 11, 2017 at 12:11 AM, Pavel Kretov <firegurafiku@gmail.com> wrote:
> Hi all,
>
> Excuse me if the topic I'm going to raise here has been already discussed
> on the mailing list, forums, or IRC, but I couldn't find anything related.
>
>
> The problem:
>
> Git, being "a stupid content tracker", doesn't try to keep an eye on
> operations which happens to individual files; things like file renames
> aren't recorded during commit, but heuristically detected later.
>
> Unfortunately, the heuristic can only deal with simple file renames with
> no substantial content changes; it's helpless when you:
>
>  - rename file and change it's content significantly;
>  - split single file into several files;
>  - merge several files into another;
>  - copy entire file from another commit, and do other things like these.
>
> However, if we're able to preserve this information, it's possible
> not only to do more accurate 'git blame', but also merge revisions with
> fewer conflicts.
>
>
> The proposal:
>
> The idea is to let user give hints about what was changed during
> the commit. For example, if user did a rename which wasn't automatically
> detected, he would append something like the following to his commit
> message:
>
>     Tracking-hints: rename dev-vcs/git/git-1.0.ebuild ->
> dev-vcs/git/git-2.0.ebuild
>
> or (if full paths of affected files can be unambiguously omitted):
>
>     Tracking-hints: rename git-1.0.ebuild -> git-2.0.ebuild
>
> There may be other hint types:
>
>     Tracking-hint: recreate LICENSE.txt
>     Tracking-hint: split main.c -> main.c cmdline.c
>     Tracking-hint: merge linalg.py <- vector.py matrix.py
>
> or even something like this:
>
>     Tracking-hint: copy json.py <-
> libs/json.py@4db88291251151d8c5c8e4f20430fa4def2cb2ed
>
> If file transformation cannot be described by a single tracking hint, it shall
> be possible to specify a sequence of hints at once:
>
>     Tracking-hint:
>         split Utils.java -> AppHelpers.java StringHelpers.java
>         recreate Utils.java
>
> Note that in the above example the order of operations really matters, so
> both lines have to reside in one 'Tracking-hint' block.
>
> * * *
>
> How do you think, is this idea worth implementing?
> Any other thoughts on this?
>
> -- Pavel Kretov.

This was discussed a couple of times on the mailing list
(though not recently).

I searched for "rename tracking files site:public-inbox.org/git"
and came up with
https://public-inbox.org/git/Pine.LNX.4.58.0504141102430.7211@ppc970.osdl.org/
(the nearby emails seem to also be relevant to this discussion)

tl:dr: When encoding these hints, you do it at commit time,
but the heuristic can be improved upon later.
So you can assume the heuristic is better for the
common case, as someone will fix the heuristic for the
common case. Also Gits model is to track objects.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [idea] File history tracking hints
  2017-09-11  7:11 [idea] File history tracking hints Pavel Kretov
  2017-09-11 18:11 ` Stefan Beller
@ 2017-09-11 18:41 ` Jeff King
  2017-09-11 20:09 ` Igor Djordjevic
  2017-09-11 21:48 ` Philip Oakley
  3 siblings, 0 replies; 18+ messages in thread
From: Jeff King @ 2017-09-11 18:41 UTC (permalink / raw)
  To: Pavel Kretov; +Cc: git

On Mon, Sep 11, 2017 at 10:11:31AM +0300, Pavel Kretov wrote:

> Unfortunately, the heuristic can only deal with simple file renames with
> no substantial content changes; it's helpless when you:
> 
>  - rename file and change it's content significantly;
>  - split single file into several files;
>  - merge several files into another;
>  - copy entire file from another commit, and do other things like these.
> 
> However, if we're able to preserve this information, it's possible
> not only to do more accurate 'git blame', but also merge revisions with
> fewer conflicts.

This is definitely something that's been discussed before on the list
(though I'm not sure of the best keywords to dig for; Stefan found one
thread but I know there have been others).

And I don't think it's a totally unreasonable idea, but there are some
complications. The biggest one is that renames are really part of a
_diff_ between two endpoints. We think of them as attached to a commit
because we tend to talk about commits as a diff from state A to state B.

So obviously in the diff HEAD^ versus HEAD, we can look at the hints for
HEAD. But what about "git diff v1.0 v1.1", that may cover multiple
commits? Right now Git doesn't look at the intermediate commits at all.
And in fact we may not even know what they are, if the command is fed
two trees. Or the two endpoints may not have a sensible history (e.g.,
consider diffing between two branches, one of which has been rebased).

But even if we had a sensible set of commits to pull hints from (e.g.,
if v1.0 and v1.1 were in a linear relationship), it's not clear to me
how you would want to apply them to an end-to-end diff.

So I don't think that these kind of tracking hints make sense for a lot
of diffs (including merges, which use diffs between the endpoints and
the merge base).

Which isn't to say that they're useless. I agree that something like
"--follow" could benefit from an annotation that tells us when and how
to pick up the next step in the traversal. But of course somebody has to
make those annotations. If we had a tool to do it automatically, then we
could apply the same tool at run-time later.

But maybe if it were an optional annotation, people would want to use it
when the normal rename logic doesn't kick in. So perhaps a baby step in
this direction would be to teach something like "--follow" to "jump"
across a non-rename when it sees a special marking in the commit
message.

-Peff

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [idea] File history tracking hints
  2017-09-11 18:11 ` Stefan Beller
@ 2017-09-11 18:47   ` Jacob Keller
  0 siblings, 0 replies; 18+ messages in thread
From: Jacob Keller @ 2017-09-11 18:47 UTC (permalink / raw)
  To: Stefan Beller; +Cc: Pavel Kretov, git@vger.kernel.org

On Mon, Sep 11, 2017 at 11:11 AM, Stefan Beller <sbeller@google.com> wrote:
> On Mon, Sep 11, 2017 at 12:11 AM, Pavel Kretov <firegurafiku@gmail.com> wrote:
>> Hi all,
>>
>> Excuse me if the topic I'm going to raise here has been already discussed
>> on the mailing list, forums, or IRC, but I couldn't find anything related.
>>
>>
>> The problem:
>>
>> Git, being "a stupid content tracker", doesn't try to keep an eye on
>> operations which happens to individual files; things like file renames
>> aren't recorded during commit, but heuristically detected later.
>>
>> Unfortunately, the heuristic can only deal with simple file renames with
>> no substantial content changes; it's helpless when you:
>>
>>  - rename file and change it's content significantly;
>>  - split single file into several files;
>>  - merge several files into another;
>>  - copy entire file from another commit, and do other things like these.
>>
>> However, if we're able to preserve this information, it's possible
>> not only to do more accurate 'git blame', but also merge revisions with
>> fewer conflicts.
>>
>>
>> The proposal:
>>
>> The idea is to let user give hints about what was changed during
>> the commit. For example, if user did a rename which wasn't automatically
>> detected, he would append something like the following to his commit
>> message:
>>
>>     Tracking-hints: rename dev-vcs/git/git-1.0.ebuild ->
>> dev-vcs/git/git-2.0.ebuild
>>
>> or (if full paths of affected files can be unambiguously omitted):
>>
>>     Tracking-hints: rename git-1.0.ebuild -> git-2.0.ebuild
>>
>> There may be other hint types:
>>
>>     Tracking-hint: recreate LICENSE.txt
>>     Tracking-hint: split main.c -> main.c cmdline.c
>>     Tracking-hint: merge linalg.py <- vector.py matrix.py
>>
>> or even something like this:
>>
>>     Tracking-hint: copy json.py <-
>> libs/json.py@4db88291251151d8c5c8e4f20430fa4def2cb2ed
>>
>> If file transformation cannot be described by a single tracking hint, it shall
>> be possible to specify a sequence of hints at once:
>>
>>     Tracking-hint:
>>         split Utils.java -> AppHelpers.java StringHelpers.java
>>         recreate Utils.java
>>
>> Note that in the above example the order of operations really matters, so
>> both lines have to reside in one 'Tracking-hint' block.
>>
>> * * *
>>
>> How do you think, is this idea worth implementing?
>> Any other thoughts on this?
>>
>> -- Pavel Kretov.
>
> This was discussed a couple of times on the mailing list
> (though not recently).
>
> I searched for "rename tracking files site:public-inbox.org/git"
> and came up with
> https://public-inbox.org/git/Pine.LNX.4.58.0504141102430.7211@ppc970.osdl.org/
> (the nearby emails seem to also be relevant to this discussion)
>
> tl:dr: When encoding these hints, you do it at commit time,
> but the heuristic can be improved upon later.
> So you can assume the heuristic is better for the
> common case, as someone will fix the heuristic for the
> common case. Also Gits model is to track objects.

Linus has a pretty long post about this, it's somewhere in that
discussion. Essentially, if you bake in rename detection (or other
hints) at commit time, then you're stuck with it forever.

Additionally, there are similar but not *quite* the same operations
which you probably wouldn't bake into at the start, and the types of
questions a user wants to ask isn't known at commit time, but rather
known at *debug* time in the future when you're digging up history. In
this time frame, the user does know what to care about and what kind
of questions to ask, so it's already natural to ask these questions at
that time.

Additionally, if you have to generate the heuristic every commit,
you're increasing time "wasted" every commit, where as doing the
lookup later when a user starts asking questions like during blame or
diff would only add time during an operation the user already expects
to take some time.

Thanks,
Jake

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [idea] File history tracking hints
  2017-09-11  7:11 [idea] File history tracking hints Pavel Kretov
  2017-09-11 18:11 ` Stefan Beller
  2017-09-11 18:41 ` Jeff King
@ 2017-09-11 20:09 ` Igor Djordjevic
  2017-09-11 21:48 ` Philip Oakley
  3 siblings, 0 replies; 18+ messages in thread
From: Igor Djordjevic @ 2017-09-11 20:09 UTC (permalink / raw)
  To: Pavel Kretov, git

Hi Pavel,

On 11/09/2017 09:11, Pavel Kretov wrote:
> Hi all,
> 
> Excuse me if the topic I'm going to raise here has been already discussed
> on the mailing list, forums, or IRC, but I couldn't find anything related.
> 
> 
> The problem:
> 
> Git, being "a stupid content tracker", doesn't try to keep an eye on
> operations which happens to individual files; things like file renames
> aren't recorded during commit, but heuristically detected later.
> 
> Unfortunately, the heuristic can only deal with simple file renames with
> no substantial content changes; it's helpless when you:
> 
>  - rename file and change it's content significantly;
>  - split single file into several files;
>  - merge several files into another;
>  - copy entire file from another commit, and do other things like these.
> 
> However, if we're able to preserve this information, it's possible
> not only to do more accurate 'git blame', but also merge revisions with
> fewer conflicts.
> 
> 
> The proposal:
> 
> The idea is to let user give hints about what was changed during
> the commit. For example, if user did a rename which wasn't automatically
> detected, he would append something like the following to his commit
> message:
> 
>     Tracking-hints: rename dev-vcs/git/git-1.0.ebuild ->
> dev-vcs/git/git-2.0.ebuild
> 
> or (if full paths of affected files can be unambiguously omitted):
> 
>     Tracking-hints: rename git-1.0.ebuild -> git-2.0.ebuild
> 
> There may be other hint types:
> 
>     Tracking-hint: recreate LICENSE.txt
>     Tracking-hint: split main.c -> main.c cmdline.c
>     Tracking-hint: merge linalg.py <- vector.py matrix.py
> 
> or even something like this:
> 
>     Tracking-hint: copy json.py <-
> libs/json.py@4db88291251151d8c5c8e4f20430fa4def2cb2ed
> 
> If file transformation cannot be described by a single tracking hint, it shall
> be possible to specify a sequence of hints at once:
> 
>     Tracking-hint:
>         split Utils.java -> AppHelpers.java StringHelpers.java
>         recreate Utils.java
> 
> Note that in the above example the order of operations really matters, so
> both lines have to reside in one 'Tracking-hint' block.
> 
> * * *
> 
> How do you think, is this idea worth implementing?
> Any other thoughts on this? 

Here[1] you can find Linus` reply (from 2005-04-15) to "rename 
tracking" discussion, usually quoted to explain the Git philosophy on 
this point, even referred to as "one of the most important messages 
in the list archive"[2] by Junio himself.

[1] https://public-inbox.org/git/Pine.LNX.4.58.0504150753440.7211@ppc970.osdl.org/
[2] https://public-inbox.org/git/xmqqr30qflk9.fsf@gitster.mtv.corp.google.com/

Regards,
Buga

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [idea] File history tracking hints
  2017-09-11  7:11 [idea] File history tracking hints Pavel Kretov
                   ` (2 preceding siblings ...)
  2017-09-11 20:09 ` Igor Djordjevic
@ 2017-09-11 21:48 ` Philip Oakley
  2017-09-13 11:38   ` Johannes Schindelin
  3 siblings, 1 reply; 18+ messages in thread
From: Philip Oakley @ 2017-09-11 21:48 UTC (permalink / raw)
  To: Pavel Kretov, git

From: "Pavel Kretov" <firegurafiku@gmail.com>
> Hi all,
>
> Excuse me if the topic I'm going to raise here has been already discussed
> on the mailing list, forums, or IRC, but I couldn't find anything related.
>
>
> The problem:
>
> Git, being "a stupid content tracker", doesn't try to keep an eye on
> operations which happens to individual files; things like file renames
> aren't recorded during commit, but heuristically detected later.
>
> Unfortunately, the heuristic can only deal with simple file renames with
> no substantial content changes; it's helpless when you:
>
> - rename file and change it's content significantly;
> - split single file into several files;
> - merge several files into another;
> - copy entire file from another commit, and do other things like these.
>
> However, if we're able to preserve this information, it's possible
> not only to do more accurate 'git blame', but also merge revisions with
> fewer conflicts.
>
>
> The proposal:
>
> The idea is to let user give hints about what was changed during
> the commit. For example, if user did a rename which wasn't automatically
> detected, he would append something like the following to his commit
> message:
>
>    Tracking-hints: rename dev-vcs/git/git-1.0.ebuild ->
> dev-vcs/git/git-2.0.ebuild
>
> or (if full paths of affected files can be unambiguously omitted):
>
>    Tracking-hints: rename git-1.0.ebuild -> git-2.0.ebuild
>
> There may be other hint types:
>
>    Tracking-hint: recreate LICENSE.txt
>    Tracking-hint: split main.c -> main.c cmdline.c
>    Tracking-hint: merge linalg.py <- vector.py matrix.py
>
> or even something like this:
>
>    Tracking-hint: copy json.py <-
> libs/json.py@4db88291251151d8c5c8e4f20430fa4def2cb2ed
>
> If file transformation cannot be described by a single tracking hint, it 
> shall
> be possible to specify a sequence of hints at once:
>
>    Tracking-hint:
>        split Utils.java -> AppHelpers.java StringHelpers.java
>        recreate Utils.java
>
> Note that in the above example the order of operations really matters, so
> both lines have to reside in one 'Tracking-hint' block.
>
> * * *
>
> How do you think, is this idea worth implementing?
> Any other thoughts on this?
>
> -- Pavel Kretov.

Maybe use the "interpret-trailers" methods for standardising your hints 
locally (in your team / workplace) to see how it goes and flesh out what 
works and what doesn't. Trying to decide, a-priori, what are the right hints 
is likely to be the hard part.
--
Philip 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [idea] File history tracking hints
  2017-09-11 21:48 ` Philip Oakley
@ 2017-09-13 11:38   ` Johannes Schindelin
  2017-09-14 23:22     ` Philip Oakley
  0 siblings, 1 reply; 18+ messages in thread
From: Johannes Schindelin @ 2017-09-13 11:38 UTC (permalink / raw)
  To: Philip Oakley; +Cc: Pavel Kretov, git

Hi Philip,

On Mon, 11 Sep 2017, Philip Oakley wrote:

> From: "Pavel Kretov" <firegurafiku@gmail.com>
> > Hi all,
> >
> > Excuse me if the topic I'm going to raise here has been already discussed
> > on the mailing list, forums, or IRC, but I couldn't find anything related.
> >
> >
> > The problem:
> >
> > Git, being "a stupid content tracker", doesn't try to keep an eye on
> > operations which happens to individual files; things like file renames
> > aren't recorded during commit, but heuristically detected later.
> >
> > Unfortunately, the heuristic can only deal with simple file renames with
> > no substantial content changes; it's helpless when you:
> >
> > - rename file and change it's content significantly;
> > - split single file into several files;
> > - merge several files into another;
> > - copy entire file from another commit, and do other things like these.
> >
> > However, if we're able to preserve this information, it's possible
> > not only to do more accurate 'git blame', but also merge revisions with
> > fewer conflicts.
> >
> >
> > The proposal:
> >
> > The idea is to let user give hints about what was changed during
> > the commit. For example, if user did a rename which wasn't automatically
> > detected, he would append something like the following to his commit
> > message:
> >
> >    Tracking-hints: rename dev-vcs/git/git-1.0.ebuild ->
> > dev-vcs/git/git-2.0.ebuild
> >
> > or (if full paths of affected files can be unambiguously omitted):
> >
> >    Tracking-hints: rename git-1.0.ebuild -> git-2.0.ebuild
> >
> > There may be other hint types:
> >
> >    Tracking-hint: recreate LICENSE.txt
> >    Tracking-hint: split main.c -> main.c cmdline.c
> >    Tracking-hint: merge linalg.py <- vector.py matrix.py
> >
> > or even something like this:
> >
> >    Tracking-hint: copy json.py <-
> > libs/json.py@4db88291251151d8c5c8e4f20430fa4def2cb2ed
> >
> > If file transformation cannot be described by a single tracking hint, it
> > shall
> > be possible to specify a sequence of hints at once:
> >
> >    Tracking-hint:
> >        split Utils.java -> AppHelpers.java StringHelpers.java
> >        recreate Utils.java
> >
> > Note that in the above example the order of operations really matters, so
> > both lines have to reside in one 'Tracking-hint' block.
> >
> > * * *
> >
> > How do you think, is this idea worth implementing?
> > Any other thoughts on this?
> >
> > -- Pavel Kretov.
> 
> Maybe use the "interpret-trailers" methods for standardising your hints
> locally (in your team / workplace) to see how it goes and flesh out what works
> and what doesn't. Trying to decide, a-priori, what are the right hints is
> likely to be the hard part.

I think this adds a very valuable insight to this discussion: the current
state of Git's rename handling is based on the idea that you either record
the renames, or you detect them. Like, there is either "on" or "off". No
middle ground.

However, if you understand that there is also the possibility of hints
that can help any erroneous rename detection (and *everybody* who
seriously worked on a massive code base has seen that rename detection
fail in the most inopportune ways [*1*]), then you are on to something.

So I totally like the idea of introducing hints, possibly as trailers in
the commit message (or as refs/notes/rename/* or whatever) that can be
picked up by Git versions that know about them, and can be ignored by Git
versions that insist on the rename detection du jour. With a config option
to control the behavior, maybe, too.

Ciao,
Dscho

Footnote *1*: Just to name a couple of examples from my personal
experience, off the top of my head:

- license boiler plates often let Git detect renames/copies where there
  are none,

- even something as trivial as moving Java classes (and their dependent
  classes) between packages changes every line referring to said packages,
  causing Git's rename detection to go for a drink instead of doing its
  job,

- indentation changes overwhelm Git's rename detection,

- when rename detection would matter most, like, really a lot, to lift the
  burden of the human beings in front of the computer pouring over
  hundreds of thousands of files moved from one directory tree to another,
  that's exactly when Git's rename detection says that there are too many
  files, here are my union rights, I am going home, good luck to you.

In light of such experiences, I have to admit that the notion that the
rename detection can always be improved in hindsight puts quite a bit of
insult to injury for those developers who are bitten by it.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [idea] File history tracking hints
  2017-09-13 11:38   ` Johannes Schindelin
@ 2017-09-14 23:22     ` Philip Oakley
  2017-09-29 23:12       ` Johannes Schindelin
  0 siblings, 1 reply; 18+ messages in thread
From: Philip Oakley @ 2017-09-14 23:22 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Pavel Kretov, git

From: "Johannes Schindelin" <Johannes.Schindelin@gmx.de>
> Hi Philip,
>
> On Mon, 11 Sep 2017, Philip Oakley wrote:
>
>> From: "Pavel Kretov" <firegurafiku@gmail.com>
>> > Hi all,
>> >
>> > Excuse me if the topic I'm going to raise here has been already 
>> > discussed
>> > on the mailing list, forums, or IRC, but I couldn't find anything 
>> > related.
>> >
>> >
>> > The problem:
>> >
>> > Git, being "a stupid content tracker", doesn't try to keep an eye on
>> > operations which happens to individual files; things like file renames
>> > aren't recorded during commit, but heuristically detected later.
>> >
>> > Unfortunately, the heuristic can only deal with simple file renames 
>> > with
>> > no substantial content changes; it's helpless when you:
>> >
>> > - rename file and change it's content significantly;
>> > - split single file into several files;
>> > - merge several files into another;
>> > - copy entire file from another commit, and do other things like these.
>> >
>> > However, if we're able to preserve this information, it's possible
>> > not only to do more accurate 'git blame', but also merge revisions with
>> > fewer conflicts.
>> >
>> >
>> > The proposal:
>> >
>> > The idea is to let user give hints about what was changed during
>> > the commit. For example, if user did a rename which wasn't 
>> > automatically
>> > detected, he would append something like the following to his commit
>> > message:
>> >
>> >    Tracking-hints: rename dev-vcs/git/git-1.0.ebuild ->
>> > dev-vcs/git/git-2.0.ebuild
>> >
>> > or (if full paths of affected files can be unambiguously omitted):
>> >
>> >    Tracking-hints: rename git-1.0.ebuild -> git-2.0.ebuild
>> >
>> > There may be other hint types:
>> >
>> >    Tracking-hint: recreate LICENSE.txt
>> >    Tracking-hint: split main.c -> main.c cmdline.c
>> >    Tracking-hint: merge linalg.py <- vector.py matrix.py
>> >
>> > or even something like this:
>> >
>> >    Tracking-hint: copy json.py <-
>> > libs/json.py@4db88291251151d8c5c8e4f20430fa4def2cb2ed
>> >
>> > If file transformation cannot be described by a single tracking hint, 
>> > it
>> > shall
>> > be possible to specify a sequence of hints at once:
>> >
>> >    Tracking-hint:
>> >        split Utils.java -> AppHelpers.java StringHelpers.java
>> >        recreate Utils.java
>> >
>> > Note that in the above example the order of operations really matters, 
>> > so
>> > both lines have to reside in one 'Tracking-hint' block.
>> >
>> > * * *
>> >
>> > How do you think, is this idea worth implementing?
>> > Any other thoughts on this?
>> >
>> > -- Pavel Kretov.
>>
>> Maybe use the "interpret-trailers" methods for standardising your hints
>> locally (in your team / workplace) to see how it goes and flesh out what 
>> works
>> and what doesn't. Trying to decide, a-priori, what are the right hints is
>> likely to be the hard part.
>
> I think this adds a very valuable insight to this discussion: the current
> state of Git's rename handling is based on the idea that you either record
> the renames, or you detect them. Like, there is either "on" or "off". No
> middle ground.
>
> However, if you understand that there is also the possibility of hints
> that can help any erroneous rename detection (and *everybody* who
> seriously worked on a massive code base has seen that rename detection
> fail in the most inopportune ways [*1*]), then you are on to something.
>
> So I totally like the idea of introducing hints, possibly as trailers in
> the commit message (or as refs/notes/rename/* or whatever) that can be
> picked up by Git versions that know about them, and can be ignored by Git
> versions that insist on the rename detection du jour. With a config option
> to control the behavior, maybe, too.
>
> Ciao,
> Dscho
>
> Footnote *1*: Just to name a couple of examples from my personal
> experience, off the top of my head:
>
> - license boiler plates often let Git detect renames/copies where there
>  are none,
>
> - even something as trivial as moving Java classes (and their dependent
>  classes) between packages changes every line referring to said packages,
>  causing Git's rename detection to go for a drink instead of doing its
>  job,
>
> - indentation changes overwhelm Git's rename detection,
>
> - when rename detection would matter most, like, really a lot, to lift the
>  burden of the human beings in front of the computer pouring over
>  hundreds of thousands of files moved from one directory tree to another,
>  that's exactly when Git's rename detection says that there are too many
>  files, here are my union rights, I am going home, good luck to you.
>
> In light of such experiences, I have to admit that the notion that the
> rename detection can always be improved in hindsight puts quite a bit of
> insult to injury for those developers who are bitten by it.

Your list made me think that the hints should be directed toward what may be 
considered existing solutions for those specific awkward cases.

So the hints could be (by type):
- template;licence;boiler-plate;standard;reference :: copy
- word-rename
- regex for word substitution changes (e.g. which chars are within 
'Word-_0`)
- regex for white-space changes (i.e. which chars are considered 
whitespace.)
- move-dir path/glob spec
- move-file path/glob spec
(maybe list each 'group' of moves, so that once found the rest of the rename 
detection follows the group.)

Once the particular hint is detected (path qualified) then the clue/hint is 
used to assist in parsing the files to simplify the comparison task and 
locate common lines, or common word patterns.

The first example is just a set of alternate terms folk use for the new 
duplicate file file case.

The second is a hint that there has been a number of fairly global name 
changes in the files. so not only do a word diff but detect & sumarise those 
global changes. (your class move example)

The third is the more simple global word changes, based on a limited char 
set for a 'word' token list.
The fourth is where we are focussed on the white space part (complementing 
the word token viewpoint)

The move hints are lists of path specs that each have distinctly moved.

It may be possible to order the hints as well, so that the detections work 
in the right order, giving the heuristics a better chance!

--
Philip
 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [idea] File history tracking hints
  2017-09-14 23:22     ` Philip Oakley
@ 2017-09-29 23:12       ` Johannes Schindelin
  2017-09-30  8:02         ` Jeff Hostetler
  0 siblings, 1 reply; 18+ messages in thread
From: Johannes Schindelin @ 2017-09-29 23:12 UTC (permalink / raw)
  To: Philip Oakley; +Cc: Pavel Kretov, git

Hi Philip,

On Fri, 15 Sep 2017, Philip Oakley wrote:

> From: "Johannes Schindelin" <Johannes.Schindelin@gmx.de>
>
> > In light of such experiences, I have to admit that the notion that the
> > rename detection can always be improved in hindsight puts quite a bit of
> > insult to injury for those developers who are bitten by it.
> 
> Your list made me think that the hints should be directed toward what may be
> considered existing solutions for those specific awkward cases.
> 
> So the hints could be (by type):
> - template;licence;boiler-plate;standard;reference :: copy
> - word-rename
> - regex for word substitution changes (e.g. which chars are within 'Word-_0`)
> - regex for white-space changes (i.e. which chars are considered whitespace.)
> - move-dir path/glob spec
> - move-file path/glob spec
> (maybe list each 'group' of moves, so that once found the rest of the rename
> detection follows the group.)
> 
> Once the particular hint is detected (path qualified) then the clue/hint is
> used to assist in parsing the files to simplify the comparison task and locate
> common lines, or common word patterns.
> 
> The first example is just a set of alternate terms folk use for the new
> duplicate file file case.
> 
> The second is a hint that there has been a number of fairly global name
> changes in the files. so not only do a word diff but detect & sumarise those
> global changes. (your class move example)
> 
> The third is the more simple global word changes, based on a limited char set
> for a 'word' token list.
> The fourth is where we are focussed on the white space part (complementing the
> word token viewpoint)
> 
> The move hints are lists of path specs that each have distinctly moved.
> 
> It may be possible to order the hints as well, so that the detections work in
> the right order, giving the heuristics a better chance!

I think my point was: no matter how likely we thought any heuristic rename
detection can be perfected over time, history proved that suspicion
incorrect.

Therefore, it would be good to have a way to tell Git about renames
explicitly so that it does not even need to use its heuristics.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [idea] File history tracking hints
  2017-09-29 23:12       ` Johannes Schindelin
@ 2017-09-30  8:02         ` Jeff Hostetler
  2017-09-30 15:11           ` Johannes Schindelin
  2017-10-01  3:27           ` Junio C Hamano
  0 siblings, 2 replies; 18+ messages in thread
From: Jeff Hostetler @ 2017-09-30  8:02 UTC (permalink / raw)
  To: Johannes Schindelin, Philip Oakley; +Cc: Pavel Kretov, git



On 9/29/2017 7:12 PM, Johannes Schindelin wrote:
> Hi Philip,
> 
> On Fri, 15 Sep 2017, Philip Oakley wrote:
> 
>> From: "Johannes Schindelin" <Johannes.Schindelin@gmx.de>
>>
>>> In light of such experiences, I have to admit that the notion that the
>>> rename detection can always be improved in hindsight puts quite a bit of
>>> insult to injury for those developers who are bitten by it.
>>
>> Your list made me think that the hints should be directed toward what may be
>> considered existing solutions for those specific awkward cases.
>>
>> So the hints could be (by type):
>> - template;licence;boiler-plate;standard;reference :: copy
>> - word-rename
>> - regex for word substitution changes (e.g. which chars are within 'Word-_0`)
>> - regex for white-space changes (i.e. which chars are considered whitespace.)
>> - move-dir path/glob spec
>> - move-file path/glob spec
>> (maybe list each 'group' of moves, so that once found the rest of the rename
>> detection follows the group.)
>>
>> Once the particular hint is detected (path qualified) then the clue/hint is
>> used to assist in parsing the files to simplify the comparison task and locate
>> common lines, or common word patterns.
>>
>> The first example is just a set of alternate terms folk use for the new
>> duplicate file file case.
>>
>> The second is a hint that there has been a number of fairly global name
>> changes in the files. so not only do a word diff but detect & sumarise those
>> global changes. (your class move example)
>>
>> The third is the more simple global word changes, based on a limited char set
>> for a 'word' token list.
>> The fourth is where we are focussed on the white space part (complementing the
>> word token viewpoint)
>>
>> The move hints are lists of path specs that each have distinctly moved.
>>
>> It may be possible to order the hints as well, so that the detections work in
>> the right order, giving the heuristics a better chance!
> 
> I think my point was: no matter how likely we thought any heuristic rename
> detection can be perfected over time, history proved that suspicion
> incorrect.
> 
> Therefore, it would be good to have a way to tell Git about renames
> explicitly so that it does not even need to use its heuristics.

Agreed.

It would be nice if every file (and tree) had a permanent GUID
associated with it.  Then the filename/pathname becomes a property
of the GUIDs.  Then you can exactly know about moves/renames with
minimal effort (and no guessing).  But I suppose that ship has sailed...

Jeff


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [idea] File history tracking hints
  2017-09-30  8:02         ` Jeff Hostetler
@ 2017-09-30 15:11           ` Johannes Schindelin
  2017-10-01  3:27           ` Junio C Hamano
  1 sibling, 0 replies; 18+ messages in thread
From: Johannes Schindelin @ 2017-09-30 15:11 UTC (permalink / raw)
  To: Jeff Hostetler; +Cc: Philip Oakley, Pavel Kretov, git

Hi Jeff,

On Sat, 30 Sep 2017, Jeff Hostetler wrote:

> On 9/29/2017 7:12 PM, Johannes Schindelin wrote:
>
> > Therefore, it would be good to have a way to tell Git about renames
> > explicitly so that it does not even need to use its heuristics.
> 
> Agreed.
> 
> It would be nice if every file (and tree) had a permanent GUID
> associated with it.  Then the filename/pathname becomes a property
> of the GUIDs.  Then you can exactly know about moves/renames with
> minimal effort (and no guessing).  But I suppose that ship has sailed...

Yes, that ship has sailed.

But we still could teach Git to understand certain "hints" (that would be
really more like "cluebats").

So while we cannot have any GUIDs that are persistent across renames/moves
(and which users would probably get wrong all the time by using
third-party tools that are not Git-rename aware), we have unique
identifiers: the object names.

And we could easily have a lookup table of pairs of object names, telling
Git that they were source and target of a rename. When Git would try to
figure out whether anything was renamed, it would first look at that
lookup table and save itself a lot of work (and opportunity to fail) and
short-cut the rename detection.

Ciao,
Johannes

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [idea] File history tracking hints
  2017-09-30  8:02         ` Jeff Hostetler
  2017-09-30 15:11           ` Johannes Schindelin
@ 2017-10-01  3:27           ` Junio C Hamano
  2017-10-02 17:41             ` Stefan Beller
  1 sibling, 1 reply; 18+ messages in thread
From: Junio C Hamano @ 2017-10-01  3:27 UTC (permalink / raw)
  To: Jeff Hostetler; +Cc: Johannes Schindelin, Philip Oakley, Pavel Kretov, git

Jeff Hostetler <git@jeffhostetler.com> writes:

> On 9/29/2017 7:12 PM, Johannes Schindelin wrote:
>
>> Therefore, it would be good to have a way to tell Git about renames
>> explicitly so that it does not even need to use its heuristics.
>
> Agreed.
>
> It would be nice if every file (and tree) had a permanent GUID
> associated with it.  Then the filename/pathname becomes a property
> of the GUIDs.  Then you can exactly know about moves/renames with
> minimal effort (and no guessing).

I actually like the idea to have a mechanism where the user can give
hint to influence, or instruction to dictate, how Git determines
"this old path moved to this new path" when comparing two trees.  A
human would not consider a new file (e.g. header file) that begins
with a few dozen commonly-seen boilerplate lines (e.g. copyright
statement) followed by several lines unique to the new contents to
be a rename of a disappearing old file that begins with the same
boilerplate followed by several lines that are different from what
is in the new file, but Git's algorithm would give equal weight to
all of these lines when deciding how similar the new file is to the
old file, and can misidentify a new file to be a rename of an old
file that is unrelated.  Even when Git can and does determine the
pairing correctly, it would be a win if we do not have to recompute
the same pairing every time.  So both as hint and as cache, such a
mechanism would make sense [*1*].

But "file ID" does not have any place to contribute to such a
mechanism.  Each of two developers working on the same project in a
disributed environment can grab the same gist and create a new file
in his or her tree, perhaps at the same path or at a different
path.  At the time of such an addition, there is no way for each of
them to give these two files the same "file ID" (that is how the
world works in the distributed environment after all)---which "file
ID" should survive when their two histories finally meet and results
in a single file after a merge?  A file with "file ID" may not be
renamed but may be copied and evolve separately and differently.
Which one should inherit its original "file ID" and how does having
"file ID" help us identify the other one is equally related to the
original file?  These two are merely examples that "file ID"s would
cause while solving "only" what can be expressed in "git diff -M"
output (the latter illustrates that it does not even help showing
"git diff -C").

And when we stop limiting ourselves to the whole-file renames and
copies (which can be expressed in "git diff" output) but also want
to help finer-grained operation like "git blame", we'd want to have
something that helps in situations like a single file's contents
split into multiple files and multiple files' contents concatenated
into a single new file, both of which happens during code
refactoring.  "file ID" would not contribute an iota in helping
these situations.  

I've said this number of times, and I'll say this again, but one of
the most important message in our list archive is gmane:217 aka

https://public-inbox.org/git/Pine.LNX.4.58.0504150753440.7211@ppc970.osdl.org/

I'd encourge people to read and re-read that message until they can
recite it by heart.

Linus mentions "CVS annotate"; the message was written long before
we had "git blame", and it served as a guide when desiging how we
dig contents movement in various parts of the system.

[Footnote]

*1* There are many possible implementations; the most obvious would
    be to record a pair of blob object names and instruct Git when
    it seems one side of a pair disappearing and the other side of
    the pair appearing, take the pair as a rename.  And that would
    be sufficient for "git log -M".  

    Such a cache/hint alone however would not help much in "git
    merge" without further work, as we merge using only the tree
    state of the three points in the history (i.e. the common
    ancestor and two tips).  merge-recursive needs to be taught to
    find the renames at each commit it finds throughout the history
    from the ancestor and each tip and carry its finding through if
    it wants to take advantage of such hint/cache.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [idea] File history tracking hints
  2017-10-01  3:27           ` Junio C Hamano
@ 2017-10-02 17:41             ` Stefan Beller
  2017-10-02 18:51               ` Jeff Hostetler
  2017-10-03  0:45               ` Junio C Hamano
  0 siblings, 2 replies; 18+ messages in thread
From: Stefan Beller @ 2017-10-02 17:41 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff Hostetler, Johannes Schindelin, Philip Oakley, Pavel Kretov,
	git@vger.kernel.org

>> It would be nice if every file (and tree) had a permanent GUID
>> associated with it.  Then the filename/pathname becomes a property
>> of the GUIDs.  Then you can exactly know about moves/renames with
>> minimal effort (and no guessing).
>
...

> https://public-inbox.org/git/Pine.LNX.4.58.0504150753440.7211@ppc970.osdl.org/
>
> I'd encourge people to read and re-read that message until they can
> recite it by heart.

I have rethought about the idea of GUIDs as proposed by Jeff and wanted
to give a reply. After rereading this message, I think my thoughts are
already included via:

  - you're doing the work at the wrong point for _another_ reason. You're
     freezing your (crappy) algorithm at tree creation time, and basically
     making it pointless to ever create something better later, because even
     if hardware and software improves, you've codified that "we have to
     have crappy information".

--
My design proposal for these "rename hints" would be a special trailer,
roughly:

    Rename: LICENSE -> legal.txt
    Rename: t/* -> tests/*

or more generally:

    Rename: <pathspec> <delim> <pathspec>

This however has multiple issues due to potential
human inaccuracies:
(A) typos in the trailer key or in the pathspec
   (resulting in different error modes)
(B) partial hints (We currently have a world of
   completely missing hints, so I would not expect it to
   be worse?)
(C) wrong hints. This ought to be no problem as Git would
   take some CPU time to conclude the hint was bogus.

For (A), I would imagine we want a mechanism (e.g. notes)
to "correct" the hints. This is the similar issue as a typo in a
commit message, which we currently just ignore if the
commit has been merged to e.g. master.

So maybe we'd just design around that, giving the option
to give the correct hints via command line.

So if the commit has the typo'd hint

    Remame:  t/* -> tests/*

the human would see that (and also conclude that by
the commit message), and then invoke

git log -C -C-hint="t/* -> tests/*" ...

which would have the corrected hint and hence deliver
the best output.

Maybe the "-C-hint" flag is the best starting point when
going in that direction?

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [idea] File history tracking hints
  2017-10-02 17:41             ` Stefan Beller
@ 2017-10-02 18:51               ` Jeff Hostetler
  2017-10-02 19:18                 ` Stefan Beller
  2017-10-03  0:45               ` Junio C Hamano
  1 sibling, 1 reply; 18+ messages in thread
From: Jeff Hostetler @ 2017-10-02 18:51 UTC (permalink / raw)
  To: Stefan Beller, Junio C Hamano
  Cc: Johannes Schindelin, Philip Oakley, Pavel Kretov,
	git@vger.kernel.org

On 10/2/2017 1:41 PM, Stefan Beller wrote:
>>> It would be nice if every file (and tree) had a permanent GUID
>>> associated with it.  Then the filename/pathname becomes a property
>>> of the GUIDs.  Then you can exactly know about moves/renames with
>>> minimal effort (and no guessing).
>>
> ...
> 
>> https://public-inbox.org/git/Pine.LNX.4.58.0504150753440.7211@ppc970.osdl.org/
>>
>> I'd encourge people to read and re-read that message until they can
>> recite it by heart.
> 
> I have rethought about the idea of GUIDs as proposed by Jeff and wanted
> to give a reply. After rereading this message, I think my thoughts are
> already included via:
> 
>    - you're doing the work at the wrong point for _another_ reason. You're
>       freezing your (crappy) algorithm at tree creation time, and basically
>       making it pointless to ever create something better later, because even
>       if hardware and software improves, you've codified that "we have to
>       have crappy information".
> 
> --
> My design proposal for these "rename hints" would be a special trailer,
> roughly:
> 
>      Rename: LICENSE -> legal.txt
>      Rename: t/* -> tests/*
> 
> or more generally:
> 
>      Rename: <pathspec> <delim> <pathspec>
> 
> This however has multiple issues due to potential
> human inaccuracies:
> (A) typos in the trailer key or in the pathspec
>     (resulting in different error modes)
> (B) partial hints (We currently have a world of
>     completely missing hints, so I would not expect it to
>     be worse?)
> (C) wrong hints. This ought to be no problem as Git would
>     take some CPU time to conclude the hint was bogus.
> 
> For (A), I would imagine we want a mechanism (e.g. notes)
> to "correct" the hints. This is the similar issue as a typo in a
> commit message, which we currently just ignore if the
> commit has been merged to e.g. master.
> 
> So maybe we'd just design around that, giving the option
> to give the correct hints via command line.
> 
> So if the commit has the typo'd hint
> 
>      Remame:  t/* -> tests/*
> 
> the human would see that (and also conclude that by
> the commit message), and then invoke
> 
> git log -C -C-hint="t/* -> tests/*" ...
> 
> which would have the corrected hint and hence deliver
> the best output.
> 
> Maybe the "-C-hint" flag is the best starting point when
> going in that direction?
> 
> Thanks,
> Stefan
> 

Sorry to re-re-...-re-stir up such an old topic.

I wasn't really thinking about commit-to-commit hints.
I think these have lots of problems.  (If commit A->B does
"t/* -> tests/*" and commit B->C does "test/*.c -> xyx/*",
then you need a way to compute a transitive closure to see
the net-net hints for A->C.  I think that quickly spirals
out of control.)

No, I was going in another direction.  For example, if a
tree-entry contains { file-guid, file-name, file-sha, ... }
then when diffing any 2 commits, you can match up files
(and folders) by their guids.  Renames pop out trivially when
their file-names don't match.  File moves pop out when the
file-guids appear in different trees.  Adds and deletes pop
out when file-guids don't have a peer. (I'm glossing over some
of the details, but you get the idea.)  To address Junio's
question, independently added files with the same name will
have 2 different file-guids.  We amend the merge rules to
handle this case and pick one of them (say, the one that
is sorts less than the other) as the winner and go on.
All-in-all the solution is not trivial (as there are a few
edge cases to deal with), but it better matches the (casual)
user's perception of what happened to their tree over time.
It also doesn't require expensive code to sniff for renames
on every command (which doesn't scale on really large repos).

But as I said before, that ship has passed...
Jeff

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [idea] File history tracking hints
  2017-10-02 18:51               ` Jeff Hostetler
@ 2017-10-02 19:18                 ` Stefan Beller
  2017-10-02 20:02                   ` Jeff Hostetler
  0 siblings, 1 reply; 18+ messages in thread
From: Stefan Beller @ 2017-10-02 19:18 UTC (permalink / raw)
  To: Jeff Hostetler
  Cc: Junio C Hamano, Johannes Schindelin, Philip Oakley, Pavel Kretov,
	git@vger.kernel.org

On Mon, Oct 2, 2017 at 11:51 AM, Jeff Hostetler <git@jeffhostetler.com> wrote:

> Sorry to re-re-...-re-stir up such an old topic.
>
> I wasn't really thinking about commit-to-commit hints.
> I think these have lots of problems.  (If commit A->B does
> "t/* -> tests/*" and commit B->C does "test/*.c -> xyx/*",
> then you need a way to compute a transitive closure to see
> the net-net hints for A->C.  I think that quickly spirals
> out of control.)

I agree. Though as a human I can still look at
A..C giving the hint that t/*.c and xyz/*.c ought to
be taken into account for rename detection.
(which is currently done with -M -C --find-copies-harder
as a generic "there are renamed things", and not the very
specific rule, that may be cheaper to examine compared to
these generic rules)

> No, I was going in another direction.  For example, if a
> tree-entry contains { file-guid, file-name, file-sha, ... }
> then when diffing any 2 commits, you can match up files
> (and folders) by their guids.  Renames pop out trivially when
> their file-names don't match.  File moves pop out when the
> file-guids appear in different trees.  Adds and deletes pop
> out when file-guids don't have a peer. (I'm glossing over some
> of the details, but you get the idea.)

How do you know when a guid needs adaption?

(c.f. origin/jt/packmigrate)
If a commit moves a function out of a file into a new file,
the ideal version control could notice that the function
was moved into a new file and still attribute the original
authors by ignoring the move commit.

Another series in flight could have modified that
function slightly (fixed a bug), such that it's hard to
reason about these things.

For guids I imagine the new file gets a new guid, such that
tracking the function becomes harder?

> To address Junio's
> question, independently added files with the same name will
> have 2 different file-guids.  We amend the merge rules to
> handle this case and pick one of them (say, the one that
> is sorts less than the other) as the winner and go on.
> All-in-all the solution is not trivial (as there are a few
> edge cases to deal with), but it better matches the (casual)
> user's perception of what happened to their tree over time.

The GUID would be made up at creation time, I assume?
Is there any input other than the file itself? (I assumed so
initially, such that:
  By having a GUID in the tree, we would divorce from the notion
  of a "content addressable file system" quickly, as we both could
  create the same tree locally (containing the same blobs) and
  yet the trees would have different names due to having different
  GUIDs in them
), which I'd find undesirable.

> It also doesn't require expensive code to sniff for renames
> on every command (which doesn't scale on really large repos).

I wonder if the rename detection could be offloaded to a server
(which scales) that provides a "hint file" to clients, such that the
clients can then cheaply make use of these specific hints.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [idea] File history tracking hints
  2017-10-02 19:18                 ` Stefan Beller
@ 2017-10-02 20:02                   ` Jeff Hostetler
  2017-10-03  0:52                     ` Junio C Hamano
  0 siblings, 1 reply; 18+ messages in thread
From: Jeff Hostetler @ 2017-10-02 20:02 UTC (permalink / raw)
  To: Stefan Beller
  Cc: Junio C Hamano, Johannes Schindelin, Philip Oakley, Pavel Kretov,
	git@vger.kernel.org



On 10/2/2017 3:18 PM, Stefan Beller wrote:
> On Mon, Oct 2, 2017 at 11:51 AM, Jeff Hostetler <git@jeffhostetler.com> wrote:
> 
>> Sorry to re-re-...-re-stir up such an old topic.
>>
>> I wasn't really thinking about commit-to-commit hints.
>> I think these have lots of problems.  (If commit A->B does
>> "t/* -> tests/*" and commit B->C does "test/*.c -> xyx/*",
>> then you need a way to compute a transitive closure to see
>> the net-net hints for A->C.  I think that quickly spirals
>> out of control.)
> 
> I agree. Though as a human I can still look at
> A..C giving the hint that t/*.c and xyz/*.c ought to
> be taken into account for rename detection.
> (which is currently done with -M -C --find-copies-harder
> as a generic "there are renamed things", and not the very
> specific rule, that may be cheaper to examine compared to
> these generic rules)
> 
>> No, I was going in another direction.  For example, if a
>> tree-entry contains { file-guid, file-name, file-sha, ... }
>> then when diffing any 2 commits, you can match up files
>> (and folders) by their guids.  Renames pop out trivially when
>> their file-names don't match.  File moves pop out when the
>> file-guids appear in different trees.  Adds and deletes pop
>> out when file-guids don't have a peer. (I'm glossing over some
>> of the details, but you get the idea.)
> 
> How do you know when a guid needs adaption?

I'm not sure I know what you mean by "adaption".

> 
> (c.f. origin/jt/packmigrate)
> If a commit moves a function out of a file into a new file,
> the ideal version control could notice that the function
> was moved into a new file and still attribute the original
> authors by ignoring the move commit.

I think that's an orthogonal problem.  I could move a function
from one file to an existing file or to a new file it doesn't
matter.  Attributing those lines back to the original author
(rather than the mover) is a bit of a pipe dream IMHO.  And I
have to wonder if it is always the correct thing to do?  I can
see scenarios where you'd want the mover.

I guess there's nothing from stopping the "ideal VC system"
doing all this line-based analysis, but that shouldn't make
file renames expensive to detect (since that is the granularity
that people and most tools expect the system to work with).

> 
> Another series in flight could have modified that
> function slightly (fixed a bug), such that it's hard to
> reason about these things.
> 
> For guids I imagine the new file gets a new guid, such that
> tracking the function becomes harder?
> 

Yeah, I'm not thinking about tracking individual functions.

> 
>> To address Junio's
>> question, independently added files with the same name will
>> have 2 different file-guids.  We amend the merge rules to
>> handle this case and pick one of them (say, the one that
>> is sorts less than the other) as the winner and go on.
>> All-in-all the solution is not trivial (as there are a few
>> edge cases to deal with), but it better matches the (casual)
>> user's perception of what happened to their tree over time.
> 
> The GUID would be made up at creation time, I assume?
> Is there any input other than the file itself? (I assumed so
> initially, such that:
>    By having a GUID in the tree, we would divorce from the notion
>    of a "content addressable file system" quickly, as we both could
>    create the same tree locally (containing the same blobs) and
>    yet the trees would have different names due to having different
>    GUIDs in them
> ), which I'd find undesirable.

Right.  A real solution would store the guid data slightly
differently so we could preserve the existing SHA properties.
My example was more conceptual.

> 
>> It also doesn't require expensive code to sniff for renames
>> on every command (which doesn't scale on really large repos).
> 
> I wonder if the rename detection could be offloaded to a server
> (which scales) that provides a "hint file" to clients, such that the
> clients can then cheaply make use of these specific hints.
> 

I don't know.  Might be easier to add that computation to the
occasional client-side housekeeping (somewhat like the commit
generation number computation we keep talking about).

Thanks
Jeff

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [idea] File history tracking hints
  2017-10-02 17:41             ` Stefan Beller
  2017-10-02 18:51               ` Jeff Hostetler
@ 2017-10-03  0:45               ` Junio C Hamano
  1 sibling, 0 replies; 18+ messages in thread
From: Junio C Hamano @ 2017-10-03  0:45 UTC (permalink / raw)
  To: Stefan Beller
  Cc: Jeff Hostetler, Johannes Schindelin, Philip Oakley, Pavel Kretov,
	git@vger.kernel.org

Stefan Beller <sbeller@google.com> writes:

> I have rethought about the idea of GUIDs as proposed by Jeff and wanted
> to give a reply. After rereading this message, I think my thoughts are
> already included via:
>
>   - you're doing the work at the wrong point for _another_ reason. You're
>      freezing your (crappy) algorithm at tree creation time, and basically
>      making it pointless to ever create something better later, because even
>      if hardware and software improves, you've codified that "we have to
>      have crappy information".
>
> --
> My design proposal for these "rename hints" would be a special trailer,
> roughly:
>
>     Rename: LICENSE -> legal.txt
>     Rename: t/* -> tests/*
>
> or more generally:
>
>     Rename: <pathspec> <delim> <pathspec>

Yes, it is a non starter to have that baked in the log message of a
commit object.  The principle Linus lays out in the message does not
reject such hints stored outside baked-in data structure, which
allows mistakes to be corrected without affecting the real history,
though.

Another thing that makes what you wrote above of dubious value is
that it attaches such hints to "a commit" (whether baked inside the
log message, or as some form of "notes" that can be associated with
a specific commit); it adds hints at a wrong place.

Given identical pair of trees <X,Y> that are wrapped in two pairs of
commits <A> and <B> where A^{tree}=B^{tree} and A^^{tree}=B^^{tree},
we do not want to have to give duplicated hints for A and B, to help
"git show A" and "git show B" to behave the same.

Rather, if we said "these two blobs A and B are similar and we want
diffcore-rename to pair them, no matter where they appear in any two
trees", then "git diff -M X Y", where X and Y may not have any
ancestry relationship (they may not even be commits) can be told
that the blob A that is in tree X and the blob B that is in tree Y
are renames or copies, no matter where in these trees the pair of
blobs appear, and no matter how X and Y are related (or unrelated)
in the history.

That is a bigger reason why annotating a commit may be a bad way to
go.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [idea] File history tracking hints
  2017-10-02 20:02                   ` Jeff Hostetler
@ 2017-10-03  0:52                     ` Junio C Hamano
  0 siblings, 0 replies; 18+ messages in thread
From: Junio C Hamano @ 2017-10-03  0:52 UTC (permalink / raw)
  To: Jeff Hostetler
  Cc: Stefan Beller, Johannes Schindelin, Philip Oakley, Pavel Kretov,
	git@vger.kernel.org

Jeff Hostetler <git@jeffhostetler.com> writes:

>> How do you know when a guid needs adaption?
>
> I'm not sure I know what you mean by "adaption".

I think he meant adapting, and I think he is referring to what I
wrote in the message upthread to explain why "file ID" would not
help.

It seems to me, from reading the remainder of your message, that it
is also becoming clear to you that "file ID" would not help and your
conceptual thing was merely a hand-waving that was dubious how it
could be made into a concrete working design?  Hopefully we can
converge on a workable design that does not involve "file ID", and
that would be a good outcome.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2017-10-03  0:52 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-11  7:11 [idea] File history tracking hints Pavel Kretov
2017-09-11 18:11 ` Stefan Beller
2017-09-11 18:47   ` Jacob Keller
2017-09-11 18:41 ` Jeff King
2017-09-11 20:09 ` Igor Djordjevic
2017-09-11 21:48 ` Philip Oakley
2017-09-13 11:38   ` Johannes Schindelin
2017-09-14 23:22     ` Philip Oakley
2017-09-29 23:12       ` Johannes Schindelin
2017-09-30  8:02         ` Jeff Hostetler
2017-09-30 15:11           ` Johannes Schindelin
2017-10-01  3:27           ` Junio C Hamano
2017-10-02 17:41             ` Stefan Beller
2017-10-02 18:51               ` Jeff Hostetler
2017-10-02 19:18                 ` Stefan Beller
2017-10-02 20:02                   ` Jeff Hostetler
2017-10-03  0:52                     ` Junio C Hamano
2017-10-03  0:45               ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).