git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH 0/5] git.git .mailmap cleanups
@ 2012-12-12 11:30 Jeff King
  2012-12-12 11:36 ` [PATCH 1/5] .mailmap: match up some obvious names/emails Jeff King
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: Jeff King @ 2012-12-12 11:30 UTC (permalink / raw
  To: git

I noticed a few obvious problems in the output of "git shortlog -nse" on
git.git. So I wrote an analysis script to find more, and of course there
were lots.

This series tries to clean up the low-hanging fruit. The first two
commits fix multiple names matching a single email. Hopefully not too
contentious, but I'll cc all involved parties to confirm. The second has
a different root cause, so I've broken it out into its own commit.

  [1/5]: .mailmap: match up some obvious names/emails
  [2/5]: .mailmap: fix broken entry for Martin Langhoff

Next up are multiple emails which match a single name. There are over a
hundred of these, and they are much less obvious to fix. They really
need individuals to post patches to fix their own identities (and some
may not want fixing at all, if they used different emails to have
meaningful different identities).

So I've left these untouched except for:

  [3/5]: .mailmap: normalize emails for Jeff King

I am allowed to fix my own. :)

  [4/5]: .mailmap: normalize emails for Linus Torvalds

As the benevolent dictator, Linus has underlings to fix such things for
him.

Also, his entry was the original reason I started looking at the data.
He fares quite poorly in "shortlog -nse" because his commits are
scattered across many addresses.

  [5/5]: contrib: update stats/mailmap script

This replaces the current mailmap script in contrib, which has a bug and
lacks some of the features of my new script.

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/5] .mailmap: match up some obvious names/emails
  2012-12-12 11:30 [PATCH 0/5] git.git .mailmap cleanups Jeff King
@ 2012-12-12 11:36 ` Jeff King
  2012-12-12 11:38 ` [PATCH 2/5] .mailmap: fix broken entry for Martin Langhoff Jeff King
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Jeff King @ 2012-12-12 11:36 UTC (permalink / raw
  To: git
  Cc: Cheng Renquan, Dan Johnson, Eric S. Raymond,
	Frédéric Heitzmann, Jakub Narębski, Kevin Leung,
	Marc-André Lureau, Mark Rada, Robert Zeh, Tay Ray Chuan

This patch updates git's .mailmap in cases where multiple
names are matched to a single email. The "master" name for
each email was chosen by:

  1. If the only difference is in the presence or absence
     of accented characters, the accented form is chosen
     (under the assumption that it is the natural spelling,
     and accents are sometimes stripped in email).

  2. Otherwise, the most commonly used name is chosen.

  3. If all names are equally common, the most recently used name is
     chosen.

Signed-off-by: Jeff King <peff@peff.net>
---
I'm cc-ing all involved authors. If you object or want to normalize your
name in some other way, please let me know.

 .mailmap | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/.mailmap b/.mailmap
index bcf4f87..69301bd 100644
--- a/.mailmap
+++ b/.mailmap
@@ -9,7 +9,9 @@ Chris Shoemaker <c.shoemaker@cox.net>
 Alexander Gavrilov <angavrilov@gmail.com>
 Aneesh Kumar K.V <aneesh.kumar@gmail.com>
 Brian M. Carlson <sandals@crustytoothpaste.ath.cx>
+Cheng Renquan <crquan@gmail.com>
 Chris Shoemaker <c.shoemaker@cox.net>
+Dan Johnson <computerdruid@gmail.com>
 Dana L. How <danahow@gmail.com>
 Dana L. How <how@deathvalley.cswitch.com>
 Daniel Barkalow <barkalow@iabervon.org>
@@ -18,13 +20,16 @@ Horst H. von Brand <vonbrand@inf.utfsm.cl>
 David S. Miller <davem@davemloft.net>
 Deskin Miller <deskinm@umich.edu>
 Dirk Süsserott <newsletter@dirk.my1.cc>
+Eric S. Raymond <esr@thyrsus.com>
 Erik Faye-Lund <kusmabite@gmail.com> <kusmabite@googlemail.com>
 Fredrik Kuivinen <freku045@student.liu.se>
+Frédéric Heitzmann <frederic.heitzmann@gmail.com>
 H. Peter Anvin <hpa@bonde.sc.orionmulti.com>
 H. Peter Anvin <hpa@tazenda.sc.orionmulti.com>
 H. Peter Anvin <hpa@trantor.hos.anvin.org>
 Horst H. von Brand <vonbrand@inf.utfsm.cl>
 İsmail Dönmez <ismail@pardus.org.tr>
+Jakub Narębski <jnareb@gmail.com>
 Jay Soffian <jaysoffian+git@gmail.com>
 Joachim Berdal Haga <cjhaga@fys.uio.no>
 Johannes Sixt <j6t@kdbg.org> <johannes.sixt@telecom.at>
@@ -41,11 +46,14 @@ Lukas Sandström <lukass@etek.chalmers.se>
 Junio C Hamano <gitster@pobox.com> <junio@kernel.org>
 Junio C Hamano <gitster@pobox.com> <junkio@cox.net>
 Karl Hasselström <kha@treskal.com>
+Kevin Leung <kevinlsk@gmail.com>
 Kent Engstrom <kent@lysator.liu.se>
 Lars Doelle <lars.doelle@on-line ! de>
 Lars Doelle <lars.doelle@on-line.de>
 Li Hong <leehong@pku.edu.cn>
 Lukas Sandström <lukass@etek.chalmers.se>
+Marc-André Lureau <marcandre.lureau@gmail.com>
+Mark Rada <marada@uwaterloo.ca>
 Martin Langhoff <martin@laptop.org>
 Martin von Zweigbergk <martinvonz@gmail.com> <martin.von.zweigbergk@gmail.com>
 Michael Coleman <tutufan@gmail.com>
@@ -63,11 +71,13 @@ Steven Grimm <koreth@midwinter.com>
 Ramsay Allan Jones <ramsay@ramsay1.demon.co.uk>
 René Scharfe <rene.scharfe@lsrfire.ath.cx>
 Robert Fitzsimons <robfitz@273k.net>
+Robert Zeh <robert.a.zeh@gmail.com>
 Sam Vilain <sam@vilain.net>
 Santi Béjar <sbejar@gmail.com>
 Sean Estabrooks <seanlkml@sympatico.ca>
 Shawn O. Pearce <spearce@spearce.org>
 Steven Grimm <koreth@midwinter.com>
+Tay Ray Chuan <rctay89@gmail.com>
 Theodore Ts'o <tytso@mit.edu>
 Thomas Rast <trast@inf.ethz.ch> <trast@student.ethz.ch>
 Tony Luck <tony.luck@intel.com>
-- 
1.8.0.2.4.g59402aa

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/5] .mailmap: fix broken entry for Martin Langhoff
  2012-12-12 11:30 [PATCH 0/5] git.git .mailmap cleanups Jeff King
  2012-12-12 11:36 ` [PATCH 1/5] .mailmap: match up some obvious names/emails Jeff King
@ 2012-12-12 11:38 ` Jeff King
  2012-12-12 11:38 ` [PATCH 3/5] .mailmap: normalize emails for Jeff King Jeff King
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Jeff King @ 2012-12-12 11:38 UTC (permalink / raw
  To: git; +Cc: Martin Langhoff

Commit adc3192 (Martin Langhoff has a new e-mail address,
2010-10-05) added a mailmap entry, but forgot that both the
old and new email addresses need to appear for one to be
mapped to the other (i.e., we do not key mailmap emails by
name).

Signed-off-by: Jeff King <peff@peff.net>
---
 .mailmap | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.mailmap b/.mailmap
index 69301bd..e370e86 100644
--- a/.mailmap
+++ b/.mailmap
@@ -54,7 +54,7 @@ Mark Rada <marada@uwaterloo.ca>
 Lukas Sandström <lukass@etek.chalmers.se>
 Marc-André Lureau <marcandre.lureau@gmail.com>
 Mark Rada <marada@uwaterloo.ca>
-Martin Langhoff <martin@laptop.org>
+Martin Langhoff <martin@laptop.org> <martin@catalyst.net.nz>
 Martin von Zweigbergk <martinvonz@gmail.com> <martin.von.zweigbergk@gmail.com>
 Michael Coleman <tutufan@gmail.com>
 Michael J Gruber <git@drmicha.warpmail.net> <michaeljgruber+gmane@fastmail.fm>
-- 
1.8.0.2.4.g59402aa

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 3/5] .mailmap: normalize emails for Jeff King
  2012-12-12 11:30 [PATCH 0/5] git.git .mailmap cleanups Jeff King
  2012-12-12 11:36 ` [PATCH 1/5] .mailmap: match up some obvious names/emails Jeff King
  2012-12-12 11:38 ` [PATCH 2/5] .mailmap: fix broken entry for Martin Langhoff Jeff King
@ 2012-12-12 11:38 ` Jeff King
  2012-12-12 11:41 ` [PATCH 4/5] .mailmap: normalize emails for Linus Torvalds Jeff King
  2012-12-12 11:41 ` [PATCH 5/5] contrib: update stats/mailmap script Jeff King
  4 siblings, 0 replies; 6+ messages in thread
From: Jeff King @ 2012-12-12 11:38 UTC (permalink / raw
  To: git

I never meant anything special by using my @github.com
address; it is merely a mistake that it has sometimes bled
through to patches.

Signed-off-by: Jeff King <peff@peff.net>
---
 .mailmap | 1 +
 1 file changed, 1 insertion(+)

diff --git a/.mailmap b/.mailmap
index e370e86..4a27b7f 100644
--- a/.mailmap
+++ b/.mailmap
@@ -31,6 +31,7 @@ Jay Soffian <jaysoffian+git@gmail.com>
 İsmail Dönmez <ismail@pardus.org.tr>
 Jakub Narębski <jnareb@gmail.com>
 Jay Soffian <jaysoffian+git@gmail.com>
+Jeff King <peff@peff.net> <peff@github.com>
 Joachim Berdal Haga <cjhaga@fys.uio.no>
 Johannes Sixt <j6t@kdbg.org> <johannes.sixt@telecom.at>
 Johannes Sixt <j6t@kdbg.org> <j.sixt@viscovery.net>
-- 
1.8.0.2.4.g59402aa

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 4/5] .mailmap: normalize emails for Linus Torvalds
  2012-12-12 11:30 [PATCH 0/5] git.git .mailmap cleanups Jeff King
                   ` (2 preceding siblings ...)
  2012-12-12 11:38 ` [PATCH 3/5] .mailmap: normalize emails for Jeff King Jeff King
@ 2012-12-12 11:41 ` Jeff King
  2012-12-12 11:41 ` [PATCH 5/5] contrib: update stats/mailmap script Jeff King
  4 siblings, 0 replies; 6+ messages in thread
From: Jeff King @ 2012-12-12 11:41 UTC (permalink / raw
  To: git; +Cc: Linus Torvalds

Linus used a lot of different per-machine email addresses in
the early days. This means that "git shortlog -nse" does not
aggregate his counts, and he is listed well below where he
should be (8th instead of 3rd).

Signed-off-by: Jeff King <peff@peff.net>
---
Linus,

I recall you considered "email ident from random machine" as a feature
very early on in git's history, but you seem to have settled on using
the linux-foundation address pretty consistently these days. Please let
me know if you object to normalizing your entries in this way.

 .mailmap | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/.mailmap b/.mailmap
index 4a27b7f..c7e8618 100644
--- a/.mailmap
+++ b/.mailmap
@@ -52,6 +52,12 @@ Li Hong <leehong@pku.edu.cn>
 Lars Doelle <lars.doelle@on-line ! de>
 Lars Doelle <lars.doelle@on-line.de>
 Li Hong <leehong@pku.edu.cn>
+Linus Torvalds <torvalds@linux-foundation.org> <torvalds@woody.linux-foundation.org>
+Linus Torvalds <torvalds@linux-foundation.org> <torvalds@osdl.org>
+Linus Torvalds <torvalds@linux-foundation.org> <torvalds@g5.osdl.org>
+Linus Torvalds <torvalds@linux-foundation.org> <torvalds@evo.osdl.org>
+Linus Torvalds <torvalds@linux-foundation.org> <torvalds@ppc970.osdl.org>
+Linus Torvalds <torvalds@linux-foundation.org> <torvalds@ppc970.osdl.org.(none)>
 Lukas Sandström <lukass@etek.chalmers.se>
 Marc-André Lureau <marcandre.lureau@gmail.com>
 Mark Rada <marada@uwaterloo.ca>
-- 
1.8.0.2.4.g59402aa

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 5/5] contrib: update stats/mailmap script
  2012-12-12 11:30 [PATCH 0/5] git.git .mailmap cleanups Jeff King
                   ` (3 preceding siblings ...)
  2012-12-12 11:41 ` [PATCH 4/5] .mailmap: normalize emails for Linus Torvalds Jeff King
@ 2012-12-12 11:41 ` Jeff King
  4 siblings, 0 replies; 6+ messages in thread
From: Jeff King @ 2012-12-12 11:41 UTC (permalink / raw
  To: git

This version changes quite a few things:

  1. The original parsed the mailmap file itself, and it did
     it wrong (it did not understand entries with an extra
     email key).

     Instead, this version uses git's "%aE" and "%aN"
     formats to have git perform the mapping, meaning we do
     not have to read .mailmap at all, but still operate on
     the current state that git sees (and it also works
     properly from subdirs).

  2. The original would find multiple names for an email,
     but not the other way around.

     This version can do either or both. If we find multiple
     emails for a name, the resolution is less obvious than
     the other way around. However, it can still be a
     starting point for a human to investigate.

  3. The original would order only by count, not by recency.

     This version can do either. Combined with showing the
     counts, it can be easier to decide how to resolve.

  4. This version shows similar entries in a blank-delimited
     stanza, which makes it more clear which options you are
     picking from.

Signed-off-by: Jeff King <peff@peff.net>
---
 contrib/stats/mailmap.pl | 108 ++++++++++++++++++++++++++++++-----------------
 1 file changed, 70 insertions(+), 38 deletions(-)
 rewrite contrib/stats/mailmap.pl (97%)

diff --git a/contrib/stats/mailmap.pl b/contrib/stats/mailmap.pl
dissimilarity index 97%
index 4b852e2..9513f5e 100755
--- a/contrib/stats/mailmap.pl
+++ b/contrib/stats/mailmap.pl
@@ -1,38 +1,70 @@
-#!/usr/bin/perl -w
-my %mailmap = ();
-open I, "<", ".mailmap";
-while (<I>) {
-	chomp;
-	next if /^#/;
-	if (my ($author, $mail) = /^(.*?)\s+<(.+)>$/) {
-		$mailmap{$mail} = $author;
-	}
-}
-close I;
-
-my %mail2author = ();
-open I, "git log --pretty='format:%ae	%an' |";
-while (<I>) {
-	chomp;
-	my ($mail, $author) = split(/\t/, $_);
-	next if exists $mailmap{$mail};
-	$mail2author{$mail} ||= {};
-	$mail2author{$mail}{$author} ||= 0;
-	$mail2author{$mail}{$author}++;
-}
-close I;
-
-while (my ($mail, $authorcount) = each %mail2author) {
-	# %$authorcount is ($author => $count);
-	# sort and show the names from the most frequent ones.
-	my @names = (map { $_->[0] }
-		sort { $b->[1] <=> $a->[1] }
-		map { [$_, $authorcount->{$_}] }
-		keys %$authorcount);
-	if (1 < @names) {
-		for (@names) {
-			print "$_ <$mail>\n";
-		}
-	}
-}
-
+#!/usr/bin/perl
+
+use warnings 'all';
+use strict;
+use Getopt::Long;
+
+my $match_emails;
+my $match_names;
+my $order_by = 'count';
+Getopt::Long::Configure(qw(bundling));
+GetOptions(
+	'emails|e!' => \$match_emails,
+	'names|n!'  => \$match_names,
+	'count|c'   => sub { $order_by = 'count' },
+	'time|t'    => sub { $order_by = 'stamp' },
+) or exit 1;
+$match_emails = 1 unless $match_names;
+
+my $email = {};
+my $name = {};
+
+open(my $fh, '-|', "git log --format='%at <%aE> %aN'");
+while(<$fh>) {
+	my ($t, $e, $n) = /(\S+) <(\S+)> (.*)/;
+	mark($email, $e, $n, $t);
+	mark($name, $n, $e, $t);
+}
+close($fh);
+
+if ($match_emails) {
+	foreach my $e (dups($email)) {
+		foreach my $n (vals($email->{$e})) {
+			show($n, $e, $email->{$e}->{$n});
+		}
+		print "\n";
+	}
+}
+if ($match_names) {
+	foreach my $n (dups($name)) {
+		foreach my $e (vals($name->{$n})) {
+			show($n, $e, $name->{$n}->{$e});
+		}
+		print "\n";
+	}
+}
+exit 0;
+
+sub mark {
+	my ($h, $k, $v, $t) = @_;
+	my $e = $h->{$k}->{$v} ||= { count => 0, stamp => 0 };
+	$e->{count}++;
+	$e->{stamp} = $t unless $t < $e->{stamp};
+}
+
+sub dups {
+	my $h = shift;
+	return grep { keys($h->{$_}) > 1 } keys($h);
+}
+
+sub vals {
+	my $h = shift;
+	return sort {
+		$h->{$b}->{$order_by} <=> $h->{$a}->{$order_by}
+	} keys($h);
+}
+
+sub show {
+	my ($n, $e, $h) = @_;
+	print "$n <$e> ($h->{$order_by})\n";
+}
-- 
1.8.0.2.4.g59402aa

^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-12-12 11:42 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-12-12 11:30 [PATCH 0/5] git.git .mailmap cleanups Jeff King
2012-12-12 11:36 ` [PATCH 1/5] .mailmap: match up some obvious names/emails Jeff King
2012-12-12 11:38 ` [PATCH 2/5] .mailmap: fix broken entry for Martin Langhoff Jeff King
2012-12-12 11:38 ` [PATCH 3/5] .mailmap: normalize emails for Jeff King Jeff King
2012-12-12 11:41 ` [PATCH 4/5] .mailmap: normalize emails for Linus Torvalds Jeff King
2012-12-12 11:41 ` [PATCH 5/5] contrib: update stats/mailmap script Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).