git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff Hostetler <git@jeffhostetler.com>
To: Stefan Beller <sbeller@google.com>
Cc: Junio C Hamano <gitster@pobox.com>,
	Johannes Schindelin <Johannes.Schindelin@gmx.de>,
	Philip Oakley <philipoakley@iee.org>,
	Pavel Kretov <firegurafiku@gmail.com>,
	"git@vger.kernel.org" <git@vger.kernel.org>
Subject: Re: [idea] File history tracking hints
Date: Mon, 2 Oct 2017 16:02:09 -0400	[thread overview]
Message-ID: <f9b722d9-cd37-40f3-7ae4-6f7f3d90de83@jeffhostetler.com> (raw)
In-Reply-To: <CAGZ79kbjfXC3CxMDouUrCUVt-OJXckDtg9U_7=R=FM-eon4ikA@mail.gmail.com>



On 10/2/2017 3:18 PM, Stefan Beller wrote:
> On Mon, Oct 2, 2017 at 11:51 AM, Jeff Hostetler <git@jeffhostetler.com> wrote:
> 
>> Sorry to re-re-...-re-stir up such an old topic.
>>
>> I wasn't really thinking about commit-to-commit hints.
>> I think these have lots of problems.  (If commit A->B does
>> "t/* -> tests/*" and commit B->C does "test/*.c -> xyx/*",
>> then you need a way to compute a transitive closure to see
>> the net-net hints for A->C.  I think that quickly spirals
>> out of control.)
> 
> I agree. Though as a human I can still look at
> A..C giving the hint that t/*.c and xyz/*.c ought to
> be taken into account for rename detection.
> (which is currently done with -M -C --find-copies-harder
> as a generic "there are renamed things", and not the very
> specific rule, that may be cheaper to examine compared to
> these generic rules)
> 
>> No, I was going in another direction.  For example, if a
>> tree-entry contains { file-guid, file-name, file-sha, ... }
>> then when diffing any 2 commits, you can match up files
>> (and folders) by their guids.  Renames pop out trivially when
>> their file-names don't match.  File moves pop out when the
>> file-guids appear in different trees.  Adds and deletes pop
>> out when file-guids don't have a peer. (I'm glossing over some
>> of the details, but you get the idea.)
> 
> How do you know when a guid needs adaption?

I'm not sure I know what you mean by "adaption".

> 
> (c.f. origin/jt/packmigrate)
> If a commit moves a function out of a file into a new file,
> the ideal version control could notice that the function
> was moved into a new file and still attribute the original
> authors by ignoring the move commit.

I think that's an orthogonal problem.  I could move a function
from one file to an existing file or to a new file it doesn't
matter.  Attributing those lines back to the original author
(rather than the mover) is a bit of a pipe dream IMHO.  And I
have to wonder if it is always the correct thing to do?  I can
see scenarios where you'd want the mover.

I guess there's nothing from stopping the "ideal VC system"
doing all this line-based analysis, but that shouldn't make
file renames expensive to detect (since that is the granularity
that people and most tools expect the system to work with).

> 
> Another series in flight could have modified that
> function slightly (fixed a bug), such that it's hard to
> reason about these things.
> 
> For guids I imagine the new file gets a new guid, such that
> tracking the function becomes harder?
> 

Yeah, I'm not thinking about tracking individual functions.

> 
>> To address Junio's
>> question, independently added files with the same name will
>> have 2 different file-guids.  We amend the merge rules to
>> handle this case and pick one of them (say, the one that
>> is sorts less than the other) as the winner and go on.
>> All-in-all the solution is not trivial (as there are a few
>> edge cases to deal with), but it better matches the (casual)
>> user's perception of what happened to their tree over time.
> 
> The GUID would be made up at creation time, I assume?
> Is there any input other than the file itself? (I assumed so
> initially, such that:
>    By having a GUID in the tree, we would divorce from the notion
>    of a "content addressable file system" quickly, as we both could
>    create the same tree locally (containing the same blobs) and
>    yet the trees would have different names due to having different
>    GUIDs in them
> ), which I'd find undesirable.

Right.  A real solution would store the guid data slightly
differently so we could preserve the existing SHA properties.
My example was more conceptual.

> 
>> It also doesn't require expensive code to sniff for renames
>> on every command (which doesn't scale on really large repos).
> 
> I wonder if the rename detection could be offloaded to a server
> (which scales) that provides a "hint file" to clients, such that the
> clients can then cheaply make use of these specific hints.
> 

I don't know.  Might be easier to add that computation to the
occasional client-side housekeeping (somewhat like the commit
generation number computation we keep talking about).

Thanks
Jeff

  reply	other threads:[~2017-10-02 20:02 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-09-11  7:11 [idea] File history tracking hints Pavel Kretov
2017-09-11 18:11 ` Stefan Beller
2017-09-11 18:47   ` Jacob Keller
2017-09-11 18:41 ` Jeff King
2017-09-11 20:09 ` Igor Djordjevic
2017-09-11 21:48 ` Philip Oakley
2017-09-13 11:38   ` Johannes Schindelin
2017-09-14 23:22     ` Philip Oakley
2017-09-29 23:12       ` Johannes Schindelin
2017-09-30  8:02         ` Jeff Hostetler
2017-09-30 15:11           ` Johannes Schindelin
2017-10-01  3:27           ` Junio C Hamano
2017-10-02 17:41             ` Stefan Beller
2017-10-02 18:51               ` Jeff Hostetler
2017-10-02 19:18                 ` Stefan Beller
2017-10-02 20:02                   ` Jeff Hostetler [this message]
2017-10-03  0:52                     ` Junio C Hamano
2017-10-03  0:45               ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f9b722d9-cd37-40f3-7ae4-6f7f3d90de83@jeffhostetler.com \
    --to=git@jeffhostetler.com \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=firegurafiku@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=philipoakley@iee.org \
    --cc=sbeller@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).