Ascertaining amount of "original" code across files/repo

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Ascertaining amount of "original" code across files/repo
@ 2017-10-22 21:25 Thomas Adam
  2017-10-23  2:04 ` Junio C Hamano
  0 siblings, 1 reply; 2+ messages in thread
From: Thomas Adam @ 2017-10-22 21:25 UTC (permalink / raw)
  To: git

Hi all,

I was recently left with an interesting problem of looking at a heuristic to
determine how much original code was left in a repository.  Or to put another
way, how much the code had changed since.  In my case "original code" means
"since the initial commit", as this code base had been imported from CVS long
ago; and that was the correct starting point.

What I came up with was the following heuristic.  What I'm curious to know is
whether there's an alternative way to look at this and/or if such tooling
already exists.

What I did was first of all ascertain the number of original lines in each of
the files I was interested in:

	for i in *.[ch]
	do
		c="$(git --no-pager blame "$i" | grep -c '^\^')"
		[ $c -gt 0 ] && echo "$i:$c"
	done | sort -t':' -k2 -nr

Given this, I then did some maths on the total lines from each of those files
and to work out a percentage by file, and over all.

What I'm curious to know is whether this approach of using "git blame" is a
good approach or not.

Thanks for your time.

-- Thomas Adam

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Ascertaining amount of "original" code across files/repo
  2017-10-22 21:25 Ascertaining amount of "original" code across files/repo Thomas Adam
@ 2017-10-23  2:04 ` Junio C Hamano
  0 siblings, 0 replies; 2+ messages in thread
From: Junio C Hamano @ 2017-10-23  2:04 UTC (permalink / raw)
  To: Thomas Adam; +Cc: git

Thomas Adam <thomas@xteddy.org> writes:

> What I did was first of all ascertain the number of original lines in each of
> the files I was interested in:
>
> 	for i in *.[ch]
> 	do
> 		c="$(git --no-pager blame "$i" | grep -c '^\^')"
> 		[ $c -gt 0 ] && echo "$i:$c"
> 	done | sort -t':' -k2 -nr

Another approach I've used when I was curious how many among 1244
lines Linus originally wrote for Git in 2005 remains in today's
codebase goes the other way [*1*].

The "reverse" approach makes use of the -S option of blame to
fabricate a hypothetical history where the very initial version of
Git is today's version, and then there is another version that was
built on it (eh, rather reduced out of it) which is Linus's
original.

	$ git tag initial e83c5163316f89
	$ cat >fake-history <<EOF
	$(git rev-parse initial) $(git rev-parse master)
	$(git rev-parse master)
	EOF

The list of files that Linus had in his original can befound out
with:

	$ git ls-tree -r --name-only initial

and you can iterate over them with a command like this:

	$ git blame -Sfake-history -s -b initial -- cache.h

a brief commentary of the options:

 * "-Sfake-history" option points at a fake-history file, which uses
   the same format as the "graft" file, to establish the fake
   ancestry.  The first line claims that the Linus's 'initial'
   version has only one parent, which is our current version
   'master' (in reality, Linus's 'initial' version did not have any
   parent, of course).  The second line claims that our current
   version 'master' is a root commit without any parent.

 * "-s" squelches all metainformation other than commit object name
   from the prefix of each line; "-b" further blanks out the commit
   object name of the "root" commit---note that in this fake
   history, our current state in 'master' is what is blanked out.

The output may start like so:

                     1) #ifndef CACHE_H
                     2) #define CACHE_H
                     3) 
        e83c5163316  4) #include <stdio.h>
        e83c5163316  5) #include <sys/stat.h>
        e83c5163316  6) #include <fcntl.h>
        e83c5163316  7) #include <stddef.h>

The idea is that a line that is blamed to the "root" commit
(i.e. blank prefix) is what survived since Linus's version down to
our current version.  In the fake world, Linus started from our
today's version and ended up with the same result in his version for
these lines.  A line that is blamed to e83c516 is something we do
not have in our today's version that is "added" by Linus in this
fake world---that in reality is what we "lost" from Linus's original
over time.

By adding -M and -C on "git blame" command line, you'll find more
lines that survived over time from Linus's original by getting moved
around inside the same file and across file boundaries.  By adding -w,
indentation-only changes would also be ignored.

I am not judging which is more correct to go in the forward
direction like your approach does or to go in the reverse, as I
haven't thought about it deeply enough.

[Reference]

*1* https://docs.google.com/file/d/0Bw3FApcOlPDhMFR3UldGSHFGcjQ/view

    Slide #11 was created using the above method.

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2017-10-23  2:04 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-22 21:25 Ascertaining amount of "original" code across files/repo Thomas Adam
2017-10-23  2:04 ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).