Re: [idea] File history tracking hints

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Junio C Hamano <gitster@pobox.com>
To: Jeff Hostetler <git@jeffhostetler.com>
Cc: Johannes Schindelin <Johannes.Schindelin@gmx.de>,
	Philip Oakley <philipoakley@iee.org>,
	Pavel Kretov <firegurafiku@gmail.com>,
	git@vger.kernel.org
Subject: Re: [idea] File history tracking hints
Date: Sun, 01 Oct 2017 12:27:04 +0900	[thread overview]
Message-ID: <xmqqtvzjv987.fsf@gitster.mtv.corp.google.com> (raw)
In-Reply-To: <5fb263a8-d83b-64a7-812f-fd8e3748feb6@jeffhostetler.com> (Jeff Hostetler's message of "Sat, 30 Sep 2017 04:02:35 -0400")

Jeff Hostetler <git@jeffhostetler.com> writes:

> On 9/29/2017 7:12 PM, Johannes Schindelin wrote:
>
>> Therefore, it would be good to have a way to tell Git about renames
>> explicitly so that it does not even need to use its heuristics.
>
> Agreed.
>
> It would be nice if every file (and tree) had a permanent GUID
> associated with it.  Then the filename/pathname becomes a property
> of the GUIDs.  Then you can exactly know about moves/renames with
> minimal effort (and no guessing).

I actually like the idea to have a mechanism where the user can give
hint to influence, or instruction to dictate, how Git determines
"this old path moved to this new path" when comparing two trees.  A
human would not consider a new file (e.g. header file) that begins
with a few dozen commonly-seen boilerplate lines (e.g. copyright
statement) followed by several lines unique to the new contents to
be a rename of a disappearing old file that begins with the same
boilerplate followed by several lines that are different from what
is in the new file, but Git's algorithm would give equal weight to
all of these lines when deciding how similar the new file is to the
old file, and can misidentify a new file to be a rename of an old
file that is unrelated.  Even when Git can and does determine the
pairing correctly, it would be a win if we do not have to recompute
the same pairing every time.  So both as hint and as cache, such a
mechanism would make sense [*1*].

But "file ID" does not have any place to contribute to such a
mechanism.  Each of two developers working on the same project in a
disributed environment can grab the same gist and create a new file
in his or her tree, perhaps at the same path or at a different
path.  At the time of such an addition, there is no way for each of
them to give these two files the same "file ID" (that is how the
world works in the distributed environment after all)---which "file
ID" should survive when their two histories finally meet and results
in a single file after a merge?  A file with "file ID" may not be
renamed but may be copied and evolve separately and differently.
Which one should inherit its original "file ID" and how does having
"file ID" help us identify the other one is equally related to the
original file?  These two are merely examples that "file ID"s would
cause while solving "only" what can be expressed in "git diff -M"
output (the latter illustrates that it does not even help showing
"git diff -C").

And when we stop limiting ourselves to the whole-file renames and
copies (which can be expressed in "git diff" output) but also want
to help finer-grained operation like "git blame", we'd want to have
something that helps in situations like a single file's contents
split into multiple files and multiple files' contents concatenated
into a single new file, both of which happens during code
refactoring.  "file ID" would not contribute an iota in helping
these situations.  

I've said this number of times, and I'll say this again, but one of
the most important message in our list archive is gmane:217 aka

https://public-inbox.org/git/Pine.LNX.4.58.0504150753440.7211@ppc970.osdl.org/

I'd encourge people to read and re-read that message until they can
recite it by heart.

Linus mentions "CVS annotate"; the message was written long before
we had "git blame", and it served as a guide when desiging how we
dig contents movement in various parts of the system.

[Footnote]

*1* There are many possible implementations; the most obvious would
    be to record a pair of blob object names and instruct Git when
    it seems one side of a pair disappearing and the other side of
    the pair appearing, take the pair as a rename.  And that would
    be sufficient for "git log -M".  

    Such a cache/hint alone however would not help much in "git
    merge" without further work, as we merge using only the tree
    state of the three points in the history (i.e. the common
    ancestor and two tips).  merge-recursive needs to be taught to
    find the renames at each commit it finds throughout the history
    from the ancestor and each tip and carry its finding through if
    it wants to take advantage of such hint/cache.

next prev parent reply	other threads:[~2017-10-01  3:27 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-09-11  7:11 [idea] File history tracking hints Pavel Kretov
2017-09-11 18:11 ` Stefan Beller
2017-09-11 18:47   ` Jacob Keller
2017-09-11 18:41 ` Jeff King
2017-09-11 20:09 ` Igor Djordjevic
2017-09-11 21:48 ` Philip Oakley
2017-09-13 11:38   ` Johannes Schindelin
2017-09-14 23:22     ` Philip Oakley
2017-09-29 23:12       ` Johannes Schindelin
2017-09-30  8:02         ` Jeff Hostetler
2017-09-30 15:11           ` Johannes Schindelin
2017-10-01  3:27           ` Junio C Hamano [this message]
2017-10-02 17:41             ` Stefan Beller
2017-10-02 18:51               ` Jeff Hostetler
2017-10-02 19:18                 ` Stefan Beller
2017-10-02 20:02                   ` Jeff Hostetler
2017-10-03  0:52                     ` Junio C Hamano
2017-10-03  0:45               ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqqtvzjv987.fsf@gitster.mtv.corp.google.com \
    --to=gitster@pobox.com \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=firegurafiku@gmail.com \
    --cc=git@jeffhostetler.com \
    --cc=git@vger.kernel.org \
    --cc=philipoakley@iee.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).