[PATCH v2 0/5] Hashed mailmap

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH v2 0/5] Hashed mailmap
@ 2021-01-03 21:18 brian m. carlson
  2021-01-03 21:18 ` [PATCH v2 1/5] mailmap: add a function to inspect the number of entries brian m. carlson
                   ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: brian m. carlson @ 2021-01-03 21:18 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Ævar Arnfjörð Bjarmason, Phillip Wood

Many people, through the course of their lives, will change either a
name or an email address.  For this reason, we have the mailmap, to map
from a user's former name or email address to their current, canonical
forms.  Normally, this works well as it is.

However, sometimes people change a name (or an email) and want to
completely cease use of the former name or email.  This could be because
a transgender person has transitioned, because a person has left an
abusive partner or broken ties with an abusive family member, or for any
other number of good and valuable reasons.  In these cases, placing the
former name in the .mailmap may be undesirable.

For those situations, let's introduce a hashed mailmap, where the user's
former name or email address can be in the form @sha256:<hash>.  This
obscures the former name or email.

In the course of experimenting with some solutions for v2, I noticed
that our mailmap support has a bunch of problems with case sensitivity.
Notably, it treats local-parts of email addresses in a case-insensitive
way, when the RFC specifically says that they are case sensitive, and we
also treat names case insensitively, but only for ASCII characters.
Both of those have been fixed here, and the commit messages explain in
lurid detail why, while incompatible, this is the correct behavior.

I've also added some performance numbers and explained some alternate
solutions in the commit message for the final patch.  That's in addition
to the performance improvements I've done so that the feature is both
cheaper for users and nearly invisible for non-users.  That isn't quite
the same as adding a perf test, which I haven't done, but I think this
explains the situation quite well.  If folks are still dying for a perf
test, I can add one in v3.

I will point out that fully hashing a mailmap isn't necessarily cheap,
but how expensive it is depends on the weighting of current and former
members of the project.  As mentioned in the original thread, I think a
hash rather than an encoding is the right choice here.  It is likely
that in a few iterations of hardware, all users will have accelerated
SHA-256 and the cost will end up being a handful of cycles per name
overall.

Changes from v1:
* Fix case-sensitivity problems in the mailmap.
* Add documentation.
* Add explanation of how to compute the value.
* Add some optimizations to improve performance.
* Improve commit message to discuss performance numbers and explain
  rationale better.

brian m. carlson (5):
  mailmap: add a function to inspect the number of entries
  mailmap: switch to opaque struct
  t4203: add failing test for case-sensitive local-parts and names
  mailmap: use case-sensitive comparisons for local-parts and names
  mailmap: support hashed entries in mailmaps

 Documentation/mailmap.txt |  28 ++++++++
 builtin/blame.c           |   2 +-
 builtin/check-mailmap.c   |   4 +-
 builtin/commit.c          |   2 +-
 mailmap.c                 | 139 +++++++++++++++++++++++++++++++++-----
 mailmap.h                 |  15 ++--
 pretty.c                  |   4 +-
 pretty.h                  |   2 +-
 revision.c                |   2 +-
 revision.h                |   3 +-
 shortlog.h                |   3 +-
 t/t4203-mailmap.sh        |  64 +++++++++++++++++-
 12 files changed, 236 insertions(+), 32 deletions(-)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2 1/5] mailmap: add a function to inspect the number of entries
  2021-01-03 21:18 [PATCH v2 0/5] Hashed mailmap brian m. carlson
@ 2021-01-03 21:18 ` brian m. carlson
  2021-01-04 15:14   ` Ævar Arnfjörð Bjarmason
  2021-01-04 17:04   ` René Scharfe
  2021-01-03 21:18 ` [PATCH v2 2/5] mailmap: switch to opaque struct brian m. carlson
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 19+ messages in thread
From: brian m. carlson @ 2021-01-03 21:18 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Ævar Arnfjörð Bjarmason, Phillip Wood

We're soon going to change the type of our mailmap into an opaque struct
so we can add features and improve performance.  When we do so, it won't
be possible for users to inspect its internals to determine how many
items are present, so let's introduce a function that lets users inquire
how many objects are in the mailmap and use it where we want this
information.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 mailmap.c | 5 +++++
 mailmap.h | 1 +
 pretty.c  | 2 +-
 3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/mailmap.c b/mailmap.c
index 962fd86d6d..c9a538f4e2 100644
--- a/mailmap.c
+++ b/mailmap.c
@@ -361,3 +361,8 @@ int map_user(struct string_list *map,
 	debug_mm("map_user:  --\n");
 	return 0;
 }
+
+int mailmap_entries(struct string_list *map)
+{
+	return map->nr;
+}
diff --git a/mailmap.h b/mailmap.h
index d0e65646cb..ff57b05a15 100644
--- a/mailmap.h
+++ b/mailmap.h
@@ -5,6 +5,7 @@ struct string_list;
 
 int read_mailmap(struct string_list *map, char **repo_abbrev);
 void clear_mailmap(struct string_list *map);
+int mailmap_entries(struct string_list *map);
 
 int map_user(struct string_list *map,
 			 const char **email, size_t *emaillen, const char **name, size_t *namelen);
diff --git a/pretty.c b/pretty.c
index 7a7708a0ea..43a0039870 100644
--- a/pretty.c
+++ b/pretty.c
@@ -681,7 +681,7 @@ static int mailmap_name(const char **email, size_t *email_len,
 		mail_map = xcalloc(1, sizeof(*mail_map));
 		read_mailmap(mail_map, NULL);
 	}
-	return mail_map->nr && map_user(mail_map, email, email_len, name, name_len);
+	return mailmap_entries(mail_map) && map_user(mail_map, email, email_len, name, name_len);
 }
 
 static size_t format_person_part(struct strbuf *sb, char part,

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 2/5] mailmap: switch to opaque struct
  2021-01-03 21:18 [PATCH v2 0/5] Hashed mailmap brian m. carlson
  2021-01-03 21:18 ` [PATCH v2 1/5] mailmap: add a function to inspect the number of entries brian m. carlson
@ 2021-01-03 21:18 ` brian m. carlson
  2021-01-04 15:17   ` Ævar Arnfjörð Bjarmason
  2021-01-03 21:18 ` [PATCH v2 3/5] t4203: add failing test for case-sensitive local-parts and names brian m. carlson
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: brian m. carlson @ 2021-01-03 21:18 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Ævar Arnfjörð Bjarmason, Phillip Wood

Currently a mailmap is a simple sorted string list.  However, we'll soon
want to add additional data to our mailmap, and in order to do so
effectively, we'll need a struct of our own.  To do that, let's create a
struct mailmap and use that everywhere we're creating a mailmap.

Note that we no longer explicitly initialize our mailmaps, since
read_mailmap will do that for us.  We also don't need to explicitly set
the flag to strdup strings, since we've passed that argument when we
initialize the string list.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 builtin/blame.c         |  2 +-
 builtin/check-mailmap.c |  4 ++--
 builtin/commit.c        |  2 +-
 mailmap.c               | 20 +++++++++++++-------
 mailmap.h               | 14 +++++++++-----
 pretty.c                |  2 +-
 pretty.h                |  2 +-
 revision.c              |  2 +-
 revision.h              |  3 ++-
 shortlog.h              |  3 ++-
 10 files changed, 33 insertions(+), 21 deletions(-)

diff --git a/builtin/blame.c b/builtin/blame.c
index 6f7e32411a..d48dbbd005 100644
--- a/builtin/blame.c
+++ b/builtin/blame.c
@@ -60,7 +60,7 @@ static int mark_ignored_lines;
 static struct date_mode blame_date_mode = { DATE_ISO8601 };
 static size_t blame_date_width;
 
-static struct string_list mailmap = STRING_LIST_INIT_NODUP;
+static struct mailmap mailmap;
 
 #ifndef DEBUG_BLAME
 #define DEBUG_BLAME 0
diff --git a/builtin/check-mailmap.c b/builtin/check-mailmap.c
index cdce144f3b..ad155e9092 100644
--- a/builtin/check-mailmap.c
+++ b/builtin/check-mailmap.c
@@ -15,7 +15,7 @@ static const struct option check_mailmap_options[] = {
 	OPT_END()
 };
 
-static void check_mailmap(struct string_list *mailmap, const char *contact)
+static void check_mailmap(struct mailmap *mailmap, const char *contact)
 {
 	const char *name, *mail;
 	size_t namelen, maillen;
@@ -39,7 +39,7 @@ static void check_mailmap(struct string_list *mailmap, const char *contact)
 int cmd_check_mailmap(int argc, const char **argv, const char *prefix)
 {
 	int i;
-	struct string_list mailmap = STRING_LIST_INIT_NODUP;
+	struct mailmap mailmap;
 
 	git_config(git_default_config, NULL);
 	argc = parse_options(argc, argv, prefix, check_mailmap_options,
diff --git a/builtin/commit.c b/builtin/commit.c
index 505fe60956..2d69847c49 100644
--- a/builtin/commit.c
+++ b/builtin/commit.c
@@ -1027,7 +1027,7 @@ static const char *find_author_by_nickname(const char *name)
 	struct rev_info revs;
 	struct commit *commit;
 	struct strbuf buf = STRBUF_INIT;
-	struct string_list mailmap = STRING_LIST_INIT_NODUP;
+	struct mailmap mailmap;
 	const char *av[20];
 	int ac = 0;
 
diff --git a/mailmap.c b/mailmap.c
index c9a538f4e2..d3287b409a 100644
--- a/mailmap.c
+++ b/mailmap.c
@@ -237,11 +237,14 @@ static int read_mailmap_blob(struct string_list *map,
 	return 0;
 }
 
-int read_mailmap(struct string_list *map, char **repo_abbrev)
+int read_mailmap(struct mailmap *mailmap, char **repo_abbrev)
 {
 	int err = 0;
+	struct string_list *map;
+
+	map = mailmap->mailmap = malloc(sizeof(*mailmap->mailmap));
+	string_list_init(map, 1);
 
-	map->strdup_strings = 1;
 	map->cmp = namemap_cmp;
 
 	if (!git_mailmap_blob && is_bare_repository())
@@ -254,11 +257,14 @@ int read_mailmap(struct string_list *map, char **repo_abbrev)
 	return err;
 }
 
-void clear_mailmap(struct string_list *map)
+void clear_mailmap(struct mailmap *mailmap)
 {
+	struct string_list *map = mailmap->mailmap;
 	debug_mm("mailmap: clearing %d entries...\n", map->nr);
 	map->strdup_strings = 1;
 	string_list_clear_func(map, free_mailmap_entry);
+	string_list_clear(map, 1);
+	free(map);
 	debug_mm("mailmap: cleared\n");
 }
 
@@ -313,7 +319,7 @@ static struct string_list_item *lookup_prefix(struct string_list *map,
 	return NULL;
 }
 
-int map_user(struct string_list *map,
+int map_user(struct mailmap *map,
 	     const char **email, size_t *emaillen,
 	     const char **name, size_t *namelen)
 {
@@ -324,7 +330,7 @@ int map_user(struct string_list *map,
 		 (int)*namelen, debug_str(*name),
 		 (int)*emaillen, debug_str(*email));
 
-	item = lookup_prefix(map, *email, *emaillen);
+	item = lookup_prefix(map->mailmap, *email, *emaillen);
 	if (item != NULL) {
 		me = (struct mailmap_entry *)item->util;
 		if (me->namemap.nr) {
@@ -362,7 +368,7 @@ int map_user(struct string_list *map,
 	return 0;
 }
 
-int mailmap_entries(struct string_list *map)
+int mailmap_entries(struct mailmap *map)
 {
-	return map->nr;
+	return map->mailmap->nr;
 }
diff --git a/mailmap.h b/mailmap.h
index ff57b05a15..4cdce3b064 100644
--- a/mailmap.h
+++ b/mailmap.h
@@ -1,13 +1,17 @@
 #ifndef MAILMAP_H
 #define MAILMAP_H
 
-struct string_list;
+#include "string-list.h"
 
-int read_mailmap(struct string_list *map, char **repo_abbrev);
-void clear_mailmap(struct string_list *map);
-int mailmap_entries(struct string_list *map);
+struct mailmap {
+	struct string_list *mailmap;
+};
 
-int map_user(struct string_list *map,
+int read_mailmap(struct mailmap *map, char **repo_abbrev);
+void clear_mailmap(struct mailmap *map);
+int mailmap_entries(struct mailmap *map);
+
+int map_user(struct mailmap *map,
 			 const char **email, size_t *emaillen, const char **name, size_t *namelen);
 
 #endif
diff --git a/pretty.c b/pretty.c
index 43a0039870..0dc2c98e4a 100644
--- a/pretty.c
+++ b/pretty.c
@@ -676,7 +676,7 @@ const char *repo_logmsg_reencode(struct repository *r,
 static int mailmap_name(const char **email, size_t *email_len,
 			const char **name, size_t *name_len)
 {
-	static struct string_list *mail_map;
+	static struct mailmap *mail_map;
 	if (!mail_map) {
 		mail_map = xcalloc(1, sizeof(*mail_map));
 		read_mailmap(mail_map, NULL);
diff --git a/pretty.h b/pretty.h
index 7ce6c0b437..15735a4c51 100644
--- a/pretty.h
+++ b/pretty.h
@@ -40,7 +40,7 @@ struct pretty_print_context {
 	struct reflog_walk_info *reflog_info;
 	struct rev_info *rev;
 	const char *output_encoding;
-	struct string_list *mailmap;
+	struct mailmap *mailmap;
 	int color;
 	struct ident_split *from_ident;
 	unsigned encode_email_headers:1;
diff --git a/revision.c b/revision.c
index 9dff845bed..848f43d88b 100644
--- a/revision.c
+++ b/revision.c
@@ -3659,7 +3659,7 @@ int rewrite_parents(struct rev_info *revs, struct commit *commit,
 	return 0;
 }
 
-static int commit_rewrite_person(struct strbuf *buf, const char *what, struct string_list *mailmap)
+static int commit_rewrite_person(struct strbuf *buf, const char *what, struct mailmap *mailmap)
 {
 	char *person, *endp;
 	size_t len, namelen, maillen;
diff --git a/revision.h b/revision.h
index 086ff10280..06bc127e90 100644
--- a/revision.h
+++ b/revision.h
@@ -7,6 +7,7 @@
 #include "notes.h"
 #include "pretty.h"
 #include "diff.h"
+#include "mailmap.h"
 #include "commit-slab-decl.h"
 
 /**
@@ -241,7 +242,7 @@ struct rev_info {
 	int		patch_name_max;
 	int		no_inline;
 	int		show_log_size;
-	struct string_list *mailmap;
+	struct mailmap *mailmap;
 
 	/* Filter by commit log message */
 	struct grep_opt	grep_filter;
diff --git a/shortlog.h b/shortlog.h
index 64be879b24..9ebf7bbb9c 100644
--- a/shortlog.h
+++ b/shortlog.h
@@ -2,6 +2,7 @@
 #define SHORTLOG_H
 
 #include "string-list.h"
+#include "mailmap.h"
 
 struct commit;
 
@@ -25,7 +26,7 @@ struct shortlog {
 
 	char *common_repo_prefix;
 	int email;
-	struct string_list mailmap;
+	struct mailmap mailmap;
 	FILE *file;
 };
 

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 3/5] t4203: add failing test for case-sensitive local-parts and names
  2021-01-03 21:18 [PATCH v2 0/5] Hashed mailmap brian m. carlson
  2021-01-03 21:18 ` [PATCH v2 1/5] mailmap: add a function to inspect the number of entries brian m. carlson
  2021-01-03 21:18 ` [PATCH v2 2/5] mailmap: switch to opaque struct brian m. carlson
@ 2021-01-03 21:18 ` brian m. carlson
  2021-01-03 21:18 ` [PATCH v2 4/5] mailmap: use case-sensitive comparisons for " brian m. carlson
  2021-01-03 21:18 ` [PATCH v2 5/5] mailmap: support hashed entries in mailmaps brian m. carlson
  4 siblings, 0 replies; 19+ messages in thread
From: brian m. carlson @ 2021-01-03 21:18 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Ævar Arnfjörð Bjarmason, Phillip Wood

Currently, Git always looks up entries in the mailmap in a
case-insensitive way, both for names and addresses, which is, as
explained below, suboptimal.

First, for email addresses, RFC 5321 is clear that only domains are case
insensitive; local-parts (the portion before the at sign) are not.  It
states this:

  The local-part of a mailbox MUST BE treated as case sensitive.
  Therefore, SMTP implementations MUST take care to preserve the case of
  mailbox local-parts.

There exist systems today where local-parts remain case sensitive (and
this author has one), and as such, it's incorrect for us to case fold
them in any way.  Let's add a failing test that indicates this is a
problem, while still keeping the test for case-insensitive domains.

Note that it's also incorrect for us to case-fold names because we don't
guarantee that we're using the locale of the author, and it's impossible
to case-fold names in a locale-insensitive way.  Turkish and Azeri
contain both a dotted and dotless I, and the uppercase ASCII I folds not
to the lowercase ASCII I, but to a dotless version, and vice versa with
the lowercase I.  There are many words in Turkish which differ only in
the dottedness of the I, so it is likely that there are also personal
names which differ in the same way.

That would be a problem even if our implementation were perfect, which
it is not.  We currently fold only ASCII characters, so this feature has
never worked correctly for the vast majority of the users on the planet,
regardless of the locale.  For example, on Linux, even in a Spanish
locale, we don't handle "Simón" properly.  Even if we did handle that,
we'd probably also want to implement Unicode normalization, which we
don't.

In general, case-folding text is extremely language- and locale-specific
and requires intimacy with the spelling and grammar of the language in
question and careful attention to the Unicode details in order to
produce a result that is meaningful to humans and conforms with
linguistic and societal norms.

Because we do not have any of the required context with a plain personal
name, we cannot hope to possibly case-fold personal names correctly.  We
should stop trying to do so and just treat them as a series of bytes, so
let's add a test that we don't case-fold personal names as well.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 t/t4203-mailmap.sh | 29 +++++++++++++++++++++++++++--
 1 file changed, 27 insertions(+), 2 deletions(-)

diff --git a/t/t4203-mailmap.sh b/t/t4203-mailmap.sh
index 586c3a86b1..32e849504c 100755
--- a/t/t4203-mailmap.sh
+++ b/t/t4203-mailmap.sh
@@ -170,10 +170,35 @@ Repo Guy (1):

 EOF

-test_expect_success 'name entry after email entry, case-insensitive' '
+test_expect_success 'name entry after email entry, case-insensitive domain' '
 	mkdir -p internal_mailmap &&
 	echo "<bugs@company.xy> <bugs@company.xx>" >internal_mailmap/.mailmap &&
-	echo "Internal Guy <BUGS@Company.xx>" >>internal_mailmap/.mailmap &&
+	echo "Internal Guy <bugs@Company.xx>" >>internal_mailmap/.mailmap &&
+	git shortlog HEAD >actual &&
+	test_cmp expect actual
+'
+
+cat >expect <<\EOF
+Repo Guy (1):
+      initial
+
+nick1 (1):
+      second
+
+EOF
+
+test_expect_failure 'name entry after email entry, case-sensitive local-part' '
+	mkdir -p internal_mailmap &&
+	echo "<bugs@company.xy> <bugs@company.xx>" >internal_mailmap/.mailmap &&
+	echo "Internal Guy <BUGS@company.xx>" >>internal_mailmap/.mailmap &&
+	git shortlog HEAD >actual &&
+	test_cmp expect actual
+'
+
+test_expect_failure 'name entry after email entry, case-sensitive personal name' '
+	mkdir -p internal_mailmap &&
+	echo "<bugs@company.xy> <bugs@company.xx>" >internal_mailmap/.mailmap &&
+	echo "Nick1 <bugs@company.xx> NICK1 <bugs@company.xx>" >internal_mailmap/.mailmap &&
 	git shortlog HEAD >actual &&
 	test_cmp expect actual
 '

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 4/5] mailmap: use case-sensitive comparisons for local-parts and names
  2021-01-03 21:18 [PATCH v2 0/5] Hashed mailmap brian m. carlson
                   ` (2 preceding siblings ...)
  2021-01-03 21:18 ` [PATCH v2 3/5] t4203: add failing test for case-sensitive local-parts and names brian m. carlson
@ 2021-01-03 21:18 ` brian m. carlson
  2021-01-04 16:10   ` Ævar Arnfjörð Bjarmason
  2021-01-03 21:18 ` [PATCH v2 5/5] mailmap: support hashed entries in mailmaps brian m. carlson
  4 siblings, 1 reply; 19+ messages in thread
From: brian m. carlson @ 2021-01-03 21:18 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Ævar Arnfjörð Bjarmason, Phillip Wood

RFC 5321 is clear that the local-part of an email address (the part
before the at sign) is case sensitive, and this has been the case since
the original RFC 821.  It directs us that "the local-part MUST be
interpreted and assigned semantics only by the host specified in the
domain part of the address."

Since we are not that party, it's not correct for us to compare them
case insensitively.  However, we do still want to compare the domain
parts case insensitively, so let's add a helper that downcases the
domain and then compare byte by byte.

Similarly, it's not possible for us to correctly case-fold text in a
locale-insensitive way, so our handling of personal names has also been
subject to bugs.  Additionally, we've never handled non-ASCII characters
correctly, which means that our previous comparisons really only worked
well for a fraction of the people on the planet.  Since our code wasn't
right and it's basically impossible to compare personal names without
regard to case, let's also switch our matching of names to be byte by
byte.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 mailmap.c          | 27 ++++++++++++++++++++++++---
 t/t4203-mailmap.sh |  4 ++--
 2 files changed, 26 insertions(+), 5 deletions(-)

diff --git a/mailmap.c b/mailmap.c
index d3287b409a..5c52dbb7e0 100644
--- a/mailmap.c
+++ b/mailmap.c
@@ -64,7 +64,22 @@ static void free_mailmap_entry(void *p, const char *s)
  */
 static int namemap_cmp(const char *a, const char *b)
 {
-	return strcasecmp(a, b);
+	return strcmp(a, b);
+}
+
+/*
+ * Lowercases the domain (and only the domain) part of an email address.  The
+ * local-part, which is defined by RFC 5321 to be case sensitive, is not
+ * affected.
+ */
+static char *lowercase_email(char *s)
+{
+	char *end = strchrnul(s, '@');
+	while (*end) {
+		*end = tolower(*end);
+		end++;
+	}
+	return s;
 }
 
 static void add_mapping(struct string_list *map,
@@ -74,9 +89,13 @@ static void add_mapping(struct string_list *map,
 	struct mailmap_entry *me;
 	struct string_list_item *item;
 
+	lowercase_email(new_email);
+
 	if (old_email == NULL) {
 		old_email = new_email;
 		new_email = NULL;
+	} else {
+		lowercase_email(old_email);
 	}
 
 	item = string_list_insert(map, old_email);
@@ -300,7 +319,7 @@ static struct string_list_item *lookup_prefix(struct string_list *map,
 	 * real location of the key if one exists.
 	 */
 	while (0 <= --i && i < map->nr) {
-		int cmp = strncasecmp(map->items[i].string, string, len);
+		int cmp = strncmp(map->items[i].string, string, len);
 		if (cmp < 0)
 			/*
 			 * "i" points at a key definitely below the prefix;
@@ -323,6 +342,7 @@ int map_user(struct mailmap *map,
 	     const char **email, size_t *emaillen,
 	     const char **name, size_t *namelen)
 {
+	char *searchable_email = xstrndup(*email, *emaillen);
 	struct string_list_item *item;
 	struct mailmap_entry *me;
 
@@ -330,7 +350,8 @@ int map_user(struct mailmap *map,
 		 (int)*namelen, debug_str(*name),
 		 (int)*emaillen, debug_str(*email));
 
-	item = lookup_prefix(map->mailmap, *email, *emaillen);
+	item = lookup_prefix(map->mailmap, searchable_email, *emaillen);
+	free(searchable_email);
 	if (item != NULL) {
 		me = (struct mailmap_entry *)item->util;
 		if (me->namemap.nr) {
diff --git a/t/t4203-mailmap.sh b/t/t4203-mailmap.sh
index 32e849504c..df4a0e03cc 100755
--- a/t/t4203-mailmap.sh
+++ b/t/t4203-mailmap.sh
@@ -187,7 +187,7 @@ nick1 (1):
 
 EOF
 
-test_expect_failure 'name entry after email entry, case-sensitive local-part' '
+test_expect_success 'name entry after email entry, case-sensitive local-part' '
 	mkdir -p internal_mailmap &&
 	echo "<bugs@company.xy> <bugs@company.xx>" >internal_mailmap/.mailmap &&
 	echo "Internal Guy <BUGS@company.xx>" >>internal_mailmap/.mailmap &&
@@ -195,7 +195,7 @@ test_expect_failure 'name entry after email entry, case-sensitive local-part' '
 	test_cmp expect actual
 '
 
-test_expect_failure 'name entry after email entry, case-sensitive personal name' '
+test_expect_success 'name entry after email entry, case-sensitive personal name' '
 	mkdir -p internal_mailmap &&
 	echo "<bugs@company.xy> <bugs@company.xx>" >internal_mailmap/.mailmap &&
 	echo "Nick1 <bugs@company.xx> NICK1 <bugs@company.xx>" >internal_mailmap/.mailmap &&

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 5/5] mailmap: support hashed entries in mailmaps
  2021-01-03 21:18 [PATCH v2 0/5] Hashed mailmap brian m. carlson
                   ` (3 preceding siblings ...)
  2021-01-03 21:18 ` [PATCH v2 4/5] mailmap: use case-sensitive comparisons for " brian m. carlson
@ 2021-01-03 21:18 ` brian m. carlson
  2021-01-05 14:21   ` Ævar Arnfjörð Bjarmason
  2021-01-05 20:05   ` Junio C Hamano
  4 siblings, 2 replies; 19+ messages in thread
From: brian m. carlson @ 2021-01-03 21:18 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Ævar Arnfjörð Bjarmason, Phillip Wood

Many people, through the course of their lives, will change either a
name or an email address.  For this reason, we have the mailmap, to map
from a user's former name or email address to their current, canonical
forms.  Normally, this works well as it is.

However, sometimes people change a name or an email address and wish to
wholly disassociate themselves from that former name or email address.
For example, a person may transition from one gender to another,
changing their name, or they may have changed their name to disassociate
themselves from an abusive family or partner.  In such a case, using the
former name or address in any way may be undesirable and the person may
wish to replace it as completely as possible.

For projects which wish to support this, introduce hashed forms into the
mailmap.  These forms, which start with "@sha256:" followed by a SHA-256
hash of the entry, can be used in place of the form used in the commit
field.  This form is intentionally designed to be unlikely to conflict
with legitimate use cases.  For example, this is not a valid email
address according to RFC 5322.  In the unlikely event that a user has
put such a form into the actual commit as their name, we will accept it.

While the form of the data is designed to accept multiple hash
algorithms, we intentionally do not support SHA-1.  There is little
reason to support such a weak algorithm in new use cases and no
backwards compatibility to consider.  Moreover, SHA-256 is faster than
the SHA1DC implementation we use, so this not only improves performance,
but simplifies the current implementation somewhat as well.

Note that it is, of course, possible to perform a lookup on all commit
objects to determine the actual entry which matches the hashed form of
the data.  However, this is an improvement over the status quo.

The performance of this patch with no hashed entries is very similar to
the performance without this patch.  Considering a git log command to
look up author and committer information on 981,680 commits in the Linux
kernel history, either with an unhashed mailmap or a mailmap with all
old values hashed:

                                   Shortest  Longest  Average  Change
  Git 2.30                         7.876     8.297    8.143
  This patch, unhashed             7.923     8.484    8.237    + 1.15%
  This patch, hashed               14.510    14.783   14.672   +80.17%
  This patch, hashed, unoptimized  15.425    16.318   15.901   +95.27%

Thus, the average performance after this patch is within normal
variation of the pre-patch performance.  It's unlikely that users will
notice the difference in practice, even on much larger
repositories, unless they're using the new feature.

To minimize the performance impact of the hashing process, we maintain a
reference count of each mailmap entry and when we encounter an entry we
must hash, we insert the same object under the unhashed key as well.  We
also keep a count of the number of hashed entries.  This means we must
hash an object at most once and once we've seen all the hashed objects,
we won't hash any more objects.  Times without this optimization are
listed above in the unoptimized entry.

This has the potential to cause a performance problem as we insert items
into a sorted list, but changing the implementation to use a khash map
instead does not result in a significantly faster implementation,
despite the improved insertion speed.  Performance in the unhashed case
is slightly worse, so this approach was not adopted since it provides
few benefits.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 Documentation/mailmap.txt | 28 +++++++++++
 mailmap.c                 | 99 ++++++++++++++++++++++++++++++++++-----
 mailmap.h                 |  2 +
 t/t4203-mailmap.sh        | 35 ++++++++++++++
 4 files changed, 152 insertions(+), 12 deletions(-)

diff --git a/Documentation/mailmap.txt b/Documentation/mailmap.txt
index 4a8c276529..b21194bf3e 100644
--- a/Documentation/mailmap.txt
+++ b/Documentation/mailmap.txt
@@ -73,3 +73,31 @@ Santa Claus <santa.claus@northpole.xx> <me@company.xx>
 
 Use hash '#' for comments that are either on their own line, or after
 the email address.
+
+In addition to specifying a former name or email literally, it is also possible
+to specify it in a hashed form, which consists of the string `@sha256:`,
+followed by an all-lowercase SHA-256 hash of the entry in hexadecimal.  For
+example, to take the example above, instead of specifying the replacement for
+"Some Dude" as such, you could specify one of these lines:
+
+------------
+Some Dude <some@dude.xx> nick1 <@sha256:bee4fdd8c5e2e85009c8ae231d5a395adb24d5a597f2b75489926460680b8ce1>
+Some Dude <some@dude.xx> @sha256:56030827e2765e8878c94c4cc43f5410b22f3b8c2b1ef8f631ac3953f8299279 <bugs@company.xx>
+Some Dude <some@dude.xx> @sha256:56030827e2765e8878c94c4cc43f5410b22f3b8c2b1ef8f631ac3953f8299279 <@sha256:bee4fdd8c5e2e85009c8ae231d5a395adb24d5a597f2b75489926460680b8ce1>
+------------
+
+These hash is a hash of the literal name or email without any trailing newlines.
+For example, you can compute the values above like so, using the Perl `shasum`
+command (or a similar command of your choice):
+
+------------
+$ printf '%s' bugs@company.xx | shasum -a 256
+bee4fdd8c5e2e85009c8ae231d5a395adb24d5a597f2b75489926460680b8ce1  -
+------------
+
+SHA-1 is not accepted as a hash algorithm in mailmaps.
+
+Using the hashed form may be desirable to obscure one's former name or email,
+but be aware that it is just obfuscation: it's still possible for someone with
+access to the repository to iterate through all authors and committers and map
+the hashed values to unhashed ones.
diff --git a/mailmap.c b/mailmap.c
index 5c52dbb7e0..ed401bb1e4 100644
--- a/mailmap.c
+++ b/mailmap.c
@@ -18,6 +18,8 @@ const char *git_mailmap_blob;
 struct mailmap_info {
 	char *name;
 	char *email;
+
+	unsigned refcount;
 };
 
 struct mailmap_entry {
@@ -25,6 +27,10 @@ struct mailmap_entry {
 	char *name;
 	char *email;
 
+	unsigned refcount;
+	unsigned hashed_count;
+	unsigned hashed_seen;
+
 	/* name and email for the complex mail and name matching case */
 	struct string_list namemap;
 };
@@ -32,6 +38,9 @@ struct mailmap_entry {
 static void free_mailmap_info(void *p, const char *s)
 {
 	struct mailmap_info *mi = (struct mailmap_info *)p;
+	if (--mi->refcount)
+		return;
+
 	debug_mm("mailmap: -- complex: '%s' -> '%s' <%s>\n",
 		 s, debug_str(mi->name), debug_str(mi->email));
 	free(mi->name);
@@ -41,6 +50,9 @@ static void free_mailmap_info(void *p, const char *s)
 static void free_mailmap_entry(void *p, const char *s)
 {
 	struct mailmap_entry *me = (struct mailmap_entry *)p;
+	if (--me->refcount)
+		return;
+
 	debug_mm("mailmap: removing entries for <%s>, with %d sub-entries\n",
 		 s, me->namemap.nr);
 	debug_mm("mailmap: - simple: '%s' <%s>\n",
@@ -82,10 +94,17 @@ static char *lowercase_email(char *s)
 	return s;
 }
 
-static void add_mapping(struct string_list *map,
+static int is_hashed(const char *s)
+{
+	const char *prefix = "@sha256:";
+	return strncmp(s, prefix, strlen(prefix)) == 0;
+}
+
+static void add_mapping(struct mailmap *mailmap,
 			char *new_name, char *new_email,
 			char *old_name, char *old_email)
 {
+	struct string_list *map = mailmap->mailmap;
 	struct mailmap_entry *me;
 	struct string_list_item *item;
 
@@ -95,7 +114,10 @@ static void add_mapping(struct string_list *map,
 		old_email = new_email;
 		new_email = NULL;
 	} else {
-		lowercase_email(old_email);
+		if (is_hashed(old_email))
+			mailmap->hashed_count++;
+		else
+			lowercase_email(old_email);
 	}
 
 	item = string_list_insert(map, old_email);
@@ -105,6 +127,7 @@ static void add_mapping(struct string_list *map,
 		me = xcalloc(1, sizeof(struct mailmap_entry));
 		me->namemap.strdup_strings = 1;
 		me->namemap.cmp = namemap_cmp;
+		me->refcount = 1;
 		item->util = me;
 	}
 
@@ -125,6 +148,9 @@ static void add_mapping(struct string_list *map,
 		debug_mm("mailmap: adding (complex) entry for '%s'\n", old_email);
 		mi->name = xstrdup_or_null(new_name);
 		mi->email = xstrdup_or_null(new_email);
+		mi->refcount = 1;
+		if (is_hashed(old_name))
+			me->hashed_count++;
 		string_list_insert(&me->namemap, old_name)->util = mi;
 	}
 
@@ -162,7 +188,7 @@ static char *parse_name_and_email(char *buffer, char **name,
 	return (*right == '\0' ? NULL : right);
 }
 
-static void read_mailmap_line(struct string_list *map, char *buffer,
+static void read_mailmap_line(struct mailmap *map, char *buffer,
 			      char **repo_abbrev)
 {
 	char *name1 = NULL, *email1 = NULL, *name2 = NULL, *email2 = NULL;
@@ -194,7 +220,7 @@ static void read_mailmap_line(struct string_list *map, char *buffer,
 		add_mapping(map, name1, email1, name2, email2);
 }
 
-static int read_mailmap_file(struct string_list *map, const char *filename,
+static int read_mailmap_file(struct mailmap *map, const char *filename,
 			     char **repo_abbrev)
 {
 	char buffer[1024];
@@ -216,7 +242,7 @@ static int read_mailmap_file(struct string_list *map, const char *filename,
 	return 0;
 }
 
-static void read_mailmap_string(struct string_list *map, char *buf,
+static void read_mailmap_string(struct mailmap *map, char *buf,
 				char **repo_abbrev)
 {
 	while (*buf) {
@@ -230,7 +256,7 @@ static void read_mailmap_string(struct string_list *map, char *buf,
 	}
 }
 
-static int read_mailmap_blob(struct string_list *map,
+static int read_mailmap_blob(struct mailmap *map,
 			     const char *name,
 			     char **repo_abbrev)
 {
@@ -269,10 +295,10 @@ int read_mailmap(struct mailmap *mailmap, char **repo_abbrev)
 	if (!git_mailmap_blob && is_bare_repository())
 		git_mailmap_blob = "HEAD:.mailmap";
 
-	err |= read_mailmap_file(map, ".mailmap", repo_abbrev);
+	err |= read_mailmap_file(mailmap, ".mailmap", repo_abbrev);
 	if (startup_info->have_repository)
-		err |= read_mailmap_blob(map, git_mailmap_blob, repo_abbrev);
-	err |= read_mailmap_file(map, git_mailmap_file, repo_abbrev);
+		err |= read_mailmap_blob(mailmap, git_mailmap_blob, repo_abbrev);
+	err |= read_mailmap_file(mailmap, git_mailmap_file, repo_abbrev);
 	return err;
 }
 
@@ -282,7 +308,7 @@ void clear_mailmap(struct mailmap *mailmap)
 	debug_mm("mailmap: clearing %d entries...\n", map->nr);
 	map->strdup_strings = 1;
 	string_list_clear_func(map, free_mailmap_entry);
-	string_list_clear(map, 1);
+	string_list_clear(map, 0);
 	free(map);
 	debug_mm("mailmap: cleared\n");
 }
@@ -338,6 +364,55 @@ static struct string_list_item *lookup_prefix(struct string_list *map,
 	return NULL;
 }
 
+/*
+ * Convert an email or name into a hashed form for comparison.  The hashed form
+ * will be created in the form
+ * @sha256:c68b7a430ac8dee9676ec77a387194e23f234d024e03d844050cf6c01775c8f6,
+ * which would be the hashed form for "doe@example.com".
+ */
+static char *hashed_form(struct strbuf *buf, const struct git_hash_algo *algop, const char *key, size_t keylen)
+{
+	git_hash_ctx ctx;
+	unsigned char hashbuf[GIT_MAX_RAWSZ];
+	char hexbuf[GIT_MAX_HEXSZ + 1];
+
+	algop->init_fn(&ctx);
+	algop->update_fn(&ctx, key, keylen);
+	algop->final_fn(hashbuf, &ctx);
+	hash_to_hex_algop_r(hexbuf, hashbuf, algop);
+
+	strbuf_addf(buf, "@%s:%s", algop->name, hexbuf);
+	return buf->buf;
+}
+
+static struct string_list_item *lookup_one(struct string_list *map,
+					   const char *string, size_t len,
+					   unsigned hashed_count,
+					   unsigned *hashed_seen)
+{
+	struct strbuf buf = STRBUF_INIT;
+	struct string_list_item *item = lookup_prefix(map, string, len);
+	if (item || !hashed_count || hashed_count == *hashed_seen)
+		return item;
+
+	hashed_form(&buf, &hash_algos[GIT_HASH_SHA256], string, len);
+	item = lookup_prefix(map, buf.buf, buf.len);
+	if (item) {
+		struct mailmap_info *mi = (struct mailmap_info *)item->util;
+		char *s = xstrndup(string, len);
+		map->strdup_strings = 0;
+		item = string_list_insert(map, s);
+		map->strdup_strings = 1;
+		if (!item->util) {
+			item->util = mi;
+			mi->refcount++;
+			(*hashed_seen)++;
+		}
+	}
+	strbuf_release(&buf);
+	return item;
+}
+
 int map_user(struct mailmap *map,
 	     const char **email, size_t *emaillen,
 	     const char **name, size_t *namelen)
@@ -350,7 +425,7 @@ int map_user(struct mailmap *map,
 		 (int)*namelen, debug_str(*name),
 		 (int)*emaillen, debug_str(*email));
 
-	item = lookup_prefix(map->mailmap, searchable_email, *emaillen);
+	item = lookup_one(map->mailmap, searchable_email, *emaillen, map->hashed_count, &map->hashed_seen);
 	free(searchable_email);
 	if (item != NULL) {
 		me = (struct mailmap_entry *)item->util;
@@ -361,7 +436,7 @@ int map_user(struct mailmap *map,
 			 * simple entry.
 			 */
 			struct string_list_item *subitem;
-			subitem = lookup_prefix(&me->namemap, *name, *namelen);
+			subitem = lookup_one(&me->namemap, *name, *namelen, me->hashed_count, &me->hashed_seen);
 			if (subitem)
 				item = subitem;
 		}
diff --git a/mailmap.h b/mailmap.h
index 4cdce3b064..69f8be5705 100644
--- a/mailmap.h
+++ b/mailmap.h
@@ -5,6 +5,8 @@
 
 struct mailmap {
 	struct string_list *mailmap;
+	unsigned hashed_count;
+	unsigned hashed_seen;
 };
 
 int read_mailmap(struct mailmap *map, char **repo_abbrev);
diff --git a/t/t4203-mailmap.sh b/t/t4203-mailmap.sh
index df4a0e03cc..004b4a3d40 100755
--- a/t/t4203-mailmap.sh
+++ b/t/t4203-mailmap.sh
@@ -62,6 +62,41 @@ test_expect_success 'check-mailmap --stdin arguments' '
 	test_cmp expect actual
 '
 
+test_expect_success 'hashed mailmap' '
+	test_config mailmap.file ./hashed &&
+	hashed_author_name="@sha256:$(printf "$GIT_AUTHOR_NAME" | test-tool sha256)" &&
+	hashed_author_email="@sha256:$(printf "$GIT_AUTHOR_EMAIL" | test-tool sha256)" &&
+	cat >expect <<-EOF &&
+	$GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL>
+	EOF
+
+	cat >hashed <<-EOF &&
+	$GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $hashed_author_name <$GIT_AUTHOR_EMAIL>
+	EOF
+	git check-mailmap "$GIT_AUTHOR_NAME <$GIT_AUTHOR_EMAIL>" >actual &&
+	test_cmp expect actual &&
+
+	cat >hashed <<-EOF &&
+	Wrong <wrong@example.org> $GIT_AUTHOR_NAME <$GIT_AUTHOR_EMAIL>
+	$GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $hashed_author_name <$GIT_AUTHOR_EMAIL>
+	EOF
+	# Check that we prefer literal matches over hashed names.
+	git check-mailmap "$hashed_author_name <$GIT_AUTHOR_EMAIL>" >actual &&
+	test_cmp expect actual &&
+
+	cat >hashed <<-EOF &&
+	$GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $hashed_author_name <$hashed_author_email>
+	EOF
+	git check-mailmap "$GIT_AUTHOR_NAME <$GIT_AUTHOR_EMAIL>" >actual &&
+	test_cmp expect actual &&
+
+	cat >hashed <<-EOF &&
+	$GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> <$hashed_author_email>
+	EOF
+	git check-mailmap "$GIT_AUTHOR_NAME <$GIT_AUTHOR_EMAIL>" >actual &&
+	test_cmp expect actual
+'
+
 test_expect_success 'check-mailmap bogus contact' '
 	test_must_fail git check-mailmap bogus
 '

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/5] mailmap: add a function to inspect the number of entries
  2021-01-03 21:18 ` [PATCH v2 1/5] mailmap: add a function to inspect the number of entries brian m. carlson
@ 2021-01-04 15:14   ` Ævar Arnfjörð Bjarmason
  2021-01-04 17:04   ` René Scharfe
  1 sibling, 0 replies; 19+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-01-04 15:14 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Jeff King, Phillip Wood


On Sun, Jan 03 2021, brian m. carlson wrote:

> We're soon going to change the type of our mailmap into an opaque struct
> so we can add features and improve performance.  When we do so, it won't
> be possible for users to inspect its internals to determine how many
> items are present, so let's introduce a function that lets users inquire
> how many objects are in the mailmap and use it where we want this
> information.
>
> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> ---
>  mailmap.c | 5 +++++
>  mailmap.h | 1 +
>  pretty.c  | 2 +-
>  3 files changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/mailmap.c b/mailmap.c
> index 962fd86d6d..c9a538f4e2 100644
> --- a/mailmap.c
> +++ b/mailmap.c
> @@ -361,3 +361,8 @@ int map_user(struct string_list *map,
>  	debug_mm("map_user:  --\n");
>  	return 0;
>  }
> +
> +int mailmap_entries(struct string_list *map)

We're returning ->nr here from string_list, whose "nr" is unsigned
int. Wouldn't it make more sense to do the same here?...

> +{
> +	return map->nr;
> +}
> diff --git a/mailmap.h b/mailmap.h
> index d0e65646cb..ff57b05a15 100644
> --- a/mailmap.h
> +++ b/mailmap.h
> @@ -5,6 +5,7 @@ struct string_list;
>  
>  int read_mailmap(struct string_list *map, char **repo_abbrev);
>  void clear_mailmap(struct string_list *map);
> +int mailmap_entries(struct string_list *map);
>  
>  int map_user(struct string_list *map,
>  			 const char **email, size_t *emaillen, const char **name, size_t *namelen);
> diff --git a/pretty.c b/pretty.c
> index 7a7708a0ea..43a0039870 100644
> --- a/pretty.c
> +++ b/pretty.c
> @@ -681,7 +681,7 @@ static int mailmap_name(const char **email, size_t *email_len,
>  		mail_map = xcalloc(1, sizeof(*mail_map));
>  		read_mailmap(mail_map, NULL);
>  	}
> -	return mail_map->nr && map_user(mail_map, email, email_len, name, name_len);
> +	return mailmap_entries(mail_map) && map_user(mail_map, email, email_len, name, name_len);

...and isn't this introducing a bug where the number of entries is
beyond "signed int" but not "unsigned int" by overflowing?

Bikeshedding: For both hashmap and strmap we have a
{hashmap,strmap}_get_size() function. I think a "static inline unsigned
int mailmap_get_size()" here would be more consistent & obvious.

>  }
>  
>  static size_t format_person_part(struct strbuf *sb, char part,


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/5] mailmap: switch to opaque struct
  2021-01-03 21:18 ` [PATCH v2 2/5] mailmap: switch to opaque struct brian m. carlson
@ 2021-01-04 15:17   ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 19+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-01-04 15:17 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Jeff King, Phillip Wood


On Sun, Jan 03 2021, brian m. carlson wrote:

> -void clear_mailmap(struct string_list *map)
> +void clear_mailmap(struct mailmap *mailmap)
>  {
> +	struct string_list *map = mailmap->mailmap;
>  	debug_mm("mailmap: clearing %d entries...\n", map->nr);

Re my comment on 1/5 wouldn't it make more sense to use the new API
here:

    -       debug_mm("mailmap: clearing %d entries...\n", map->nr);
    +       debug_mm("mailmap: clearing %u entries...\n", mailmap_entries(mailmap));

I changed %d to %u here, maybe we'd also want to do that earlier while
we're at it (not code you're changing):

    -       debug_mm("mailmap: removing entries for <%s>, with %d sub-entries\n",
    +       debug_mm("mailmap: removing entries for <%s>, with %u sub-entries\n",

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 4/5] mailmap: use case-sensitive comparisons for local-parts and names
  2021-01-03 21:18 ` [PATCH v2 4/5] mailmap: use case-sensitive comparisons for " brian m. carlson
@ 2021-01-04 16:10   ` Ævar Arnfjörð Bjarmason
  2021-01-06  0:46     ` Junio C Hamano
  0 siblings, 1 reply; 19+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-01-04 16:10 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Jeff King, Phillip Wood, Johannes Schindelin


On Sun, Jan 03 2021, brian m. carlson wrote:

[For the purposes of this reply I'm also replying to your commit message
in 3/5, which I think makes things less confusing than two E-Mails]

> Currently, Git always looks up entries in the mailmap in a
> case-insensitive way, both for names and addresses, which is, as
> explained below, suboptimal.
>
> First, for email addresses, RFC 5321 is clear that only domains are case
> insensitive; local-parts (the portion before the at sign) are not.  It
> states this:
> 
>   The local-part of a mailbox MUST BE treated as case sensitive.
>   Therefore, SMTP implementations MUST take care to preserve the case of
>   mailbox local-parts.

It seems better to quote the part of this where it goes on to say:

    However, exploiting the case sensitivity of mailbox local-parts
    impedes interoperability and is discouraged.  Mailbox domains follow
    normal DNS rules and are hence not case sensitive.

And also that RFC 5321 says earlier

    "a host that expects to receive mail SHOULD avoid defining mailboxes
    where the Local-part requires (or uses) the Quoted-string form or
    where the Local-part is case- sensitive.  For any purposes that
    require generating or comparing Local-parts (e.g., to specific
    mailbox names), all quoted forms MUST be treated as equivalent,".

So if we're taking RFC 5321 MUST requirements as something we've got to
do we're still not following it. But as I'll go on to note I think
rationale by RFC 5321 is perhaps not the best thing to do here...

> There exist systems today where local-parts remain case sensitive (and
> this author has one), and as such, it's incorrect for us to case fold
> them in any way.  Let's add a failing test that indicates this is a
> problem, while still keeping the test for case-insensitive domains.

I don't really care much about this feature of mailmap, but I'm
struggling a bit to understand this part.

We're not an SMTP server or SMTP server library, so we don't have to
worry about mail mis-routing or whatever.

We just have to worry about cases where you're not all of these people
in one project's commit metadata and/or .mailmap, and thus mailmap rules
would match too generously:

    "brian m. carlson" <sandals@crustytoothpaste.net>
    "brian m. carlson" <SANDALS@crustytoothpaste.net>
    "BRIAN M. CARLSON" <sandals@crustytoothpaste.net>
    "BRIAN M. CARLSON" <SANDALS@crustytoothpaste.net>

Is that really plausible? In any case, neither of these two patches make
reference to us already having changed this in the past in 1.6.2 and &
there being reports on the ML about the bug & us changing it back. See
https://lore.kernel.org/git/f182fb1700e8dea15459fd02ced2a6e5797bec99.1238458535u.git.johannes.schindelin@gmx.de/

Maybe we should still do this, but I think for a v3 it makes sense to
summarize that discussion etc.

> Note that it's also incorrect for us to case-fold names because we don't
> guarantee that we're using the locale of the author, and it's impossible
> to case-fold names in a locale-insensitive way.  Turkish and Azeri
> contain both a dotted and dotless I, and the uppercase ASCII I folds not
> to the lowercase ASCII I, but to a dotless version, and vice versa with
> the lowercase I.  There are many words in Turkish which differ only in
> the dottedness of the I, so it is likely that there are also personal
> names which differ in the same way.
> 
> That would be a problem even if our implementation were perfect, which
> it is not.  We currently fold only ASCII characters, so this feature has
> never worked correctly for the vast majority of the users on the planet,
> regardless of the locale.  For example, on Linux, even in a Spanish
> locale, we don't handle "Simón" properly.  Even if we did handle that,
> we'd probably also want to implement Unicode normalization, which we
> don't.

As one of those people, aren't you confusing two things here. I daresay
that almost everyone with a non-ASCII name doesn't have a non-ASCII
E-Mail address, sure some do, but treating those as one and the same
doesn't seem to make sense.

Yes you can have non-ASCII on the LHS of @ in an E-Mail address, it just
seems to me that's very rare.

Do we have known cases where we're making use of this case-insensitive
matching of the *name* part? Showing some of those non-ASCII cases in
the tests in 3/5 would be nice.

> IN general, case-folding text is extremely language- and locale-specific
> and requires intimacy with the spelling and grammar of the language in
> question and careful attention to the Unicode details in order to
> produce a result that is meaningful to humans and conforms with
> linguistic and societal norms.
> 
> Because we do not have any of the required context with a plain personal
> name, we cannot hope to possibly case-fold personal names correctly.  We
> should stop trying to do so and just treat them as a series of bytes, so
> let's add a test that we don't case-fold personal names as well.

[... end <snip> of commit 3/5 ...]

> RFC 5321 is clear that the local-part of an email address (the part
> before the at sign) is case sensitive, and this has been the case since
> the original RFC 821.  It directs us that "the local-part MUST be
> interpreted and assigned semantics only by the host specified in the
> domain part of the address."
>
> Since we are not that party, it's not correct for us to compare them
> case insensitively.  However, we do still want to compare the domain
> parts case insensitively, so let's add a helper that downcases the
> domain and then compare byte by byte.
>
> Similarly, it's not possible for us to correctly case-fold text in a
> locale-insensitive way, so our handling of personal names has also been
> subject to bugs.  Additionally, we've never handled non-ASCII characters
> correctly, which means that our previous comparisons really only worked
> well for a fraction of the people on the planet.  Since our code wasn't
> right and it's basically impossible to compare personal names without
> regard to case, let's also switch our matching of names to be byte by
> byte.

I'm still undecided about whether we should be calling strcasecmp() to
begin with, but I don't really find this convincing. Just because we
only support brian@ and BRIAN@ being the same but not ævar@ and ÆVAR@ we
should break the existing behavior for everyone?

> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> ---
>  mailmap.c          | 27 ++++++++++++++++++++++++---
>  t/t4203-mailmap.sh |  4 ++--
>  2 files changed, 26 insertions(+), 5 deletions(-)
>
> diff --git a/mailmap.c b/mailmap.c
> index d3287b409a..5c52dbb7e0 100644
> --- a/mailmap.c
> +++ b/mailmap.c
> @@ -64,7 +64,22 @@ static void free_mailmap_entry(void *p, const char *s)
>   */
>  static int namemap_cmp(const char *a, const char *b)
>  {
> -	return strcasecmp(a, b);
> +	return strcmp(a, b);
> +}

It seems to me if we're not going to use strcasecmp() we can get rid of
this whole namemap_cmp() function. See the comment just above it &
de2f95ebed2 (mailmap: work around implementations with pure inline
strcasecmp, 2013-09-12). I.e. we had a wrapper function here just
because we were using strcasecmp().

> +/*
> + * Lowercases the domain (and only the domain) part of an email address.  The
> + * local-part, which is defined by RFC 5321 to be case sensitive, is not
> + * affected.
> + */
> +static char *lowercase_email(char *s)
> +{
> +	char *end = strchrnul(s, '@');

Speaking of pedantic readings of RFC 5321, aren't we closing the door
here to proper DQUOTE handling? I.e. the local-part containing a @
within quotes :)


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/5] mailmap: add a function to inspect the number of entries
  2021-01-03 21:18 ` [PATCH v2 1/5] mailmap: add a function to inspect the number of entries brian m. carlson
  2021-01-04 15:14   ` Ævar Arnfjörð Bjarmason
@ 2021-01-04 17:04   ` René Scharfe
  1 sibling, 0 replies; 19+ messages in thread
From: René Scharfe @ 2021-01-04 17:04 UTC (permalink / raw)
  To: brian m. carlson, git
  Cc: Jeff King, Ævar Arnfjörð Bjarmason, Phillip Wood

Am 03.01.21 um 22:18 schrieb brian m. carlson:
> We're soon going to change the type of our mailmap into an opaque struct
> so we can add features and improve performance.  When we do so, it won't
> be possible for users to inspect its internals to determine how many
> items are present, so let's introduce a function that lets users inquire
> how many objects are in the mailmap and use it where we want this
> information.
>
> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> ---
>  mailmap.c | 5 +++++
>  mailmap.h | 1 +
>  pretty.c  | 2 +-
>  3 files changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/mailmap.c b/mailmap.c
> index 962fd86d6d..c9a538f4e2 100644
> --- a/mailmap.c
> +++ b/mailmap.c
> @@ -361,3 +361,8 @@ int map_user(struct string_list *map,
>  	debug_mm("map_user:  --\n");
>  	return 0;
>  }
> +
> +int mailmap_entries(struct string_list *map)
> +{
> +	return map->nr;
> +}
> diff --git a/mailmap.h b/mailmap.h
> index d0e65646cb..ff57b05a15 100644
> --- a/mailmap.h
> +++ b/mailmap.h
> @@ -5,6 +5,7 @@ struct string_list;
>
>  int read_mailmap(struct string_list *map, char **repo_abbrev);
>  void clear_mailmap(struct string_list *map);
> +int mailmap_entries(struct string_list *map);
>
>  int map_user(struct string_list *map,
>  			 const char **email, size_t *emaillen, const char **name, size_t *namelen);
> diff --git a/pretty.c b/pretty.c
> index 7a7708a0ea..43a0039870 100644
> --- a/pretty.c
> +++ b/pretty.c
> @@ -681,7 +681,7 @@ static int mailmap_name(const char **email, size_t *email_len,
>  		mail_map = xcalloc(1, sizeof(*mail_map));
>  		read_mailmap(mail_map, NULL);
>  	}
> -	return mail_map->nr && map_user(mail_map, email, email_len, name, name_len);
> +	return mailmap_entries(mail_map) && map_user(mail_map, email, email_len, name, name_len);

This seems to be the only caller of the new function.  It uses it to
avoid calling map_user() in case the mailmap is empty.  Since you're
doing performance measurements: How bad would it be to simply remove
that check?

René

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 5/5] mailmap: support hashed entries in mailmaps
  2021-01-03 21:18 ` [PATCH v2 5/5] mailmap: support hashed entries in mailmaps brian m. carlson
@ 2021-01-05 14:21   ` Ævar Arnfjörð Bjarmason
  2021-01-06  0:24     ` brian m. carlson
  2021-01-05 20:05   ` Junio C Hamano
  1 sibling, 1 reply; 19+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-01-05 14:21 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Jeff King, Phillip Wood

On Sun, Jan 03 2021, brian m. carlson wrote:

I think it makes sense to split up 1-4/5 here from 5/5 in this series
since they're really unrelated changes, although due to the changes in
1-4 they'll conflict.

> Many people, through the course of their lives, will change either a
> name or an email address.  For this reason, we have the mailmap, to map
> from a user's former name or email address to their current, canonical
> forms.  Normally, this works well as it is.
>
> However, sometimes people change a name or an email address and wish to
> wholly disassociate themselves from that former name or email address.
> For example, a person may transition from one gender to another,
> changing their name, or they may have changed their name to disassociate
> themselves from an abusive family or partner.  In such a case, using the
> former name or address in any way may be undesirable and the person may
> wish to replace it as completely as possible.

The cover letter noted "As mentioned in the original thread, I think a
hash rather than an encoding is the right choice here.". Reading the v1
I think you're referring to
https://lore.kernel.org/git/X9wUGaR3IXcpV0nT@camp.crustytoothpaste.net/

In v1 I pointed out you needed to read some combination of the cover
letter & the patch to see what this was intended for (see [1]). I think
for v3 the commit itself should summarize the trade-offs & design
choices.

> For projects which wish to support this, introduce hashed forms into the
> mailmap.  These forms, which start with "@sha256:" followed by a SHA-256
> hash of the entry, can be used in place of the form used in the commit
> field.  This form is intentionally designed to be unlikely to conflict
> with legitimate use cases.  For example, this is not a valid email
> address according to RFC 5322.  In the unlikely event that a user has
> put such a form into the actual commit as their name, we will accept it.

We'll emit the commit author information as-is in that case under "git
show", or run the mapping and map it via mailmap? Anyway, it seems
there's a test for this. Probably better to just point to it.

> While the form of the data is designed to accept multiple hash
> algorithms, we intentionally do not support SHA-1.  There is little
> reason to support such a weak algorithm in new use cases and no
> backwards compatibility to consider.  Moreover, SHA-256 is faster than
> the SHA1DC implementation we use, so this not only improves performance,
> but simplifies the current implementation somewhat as well.

I agree with most of this aside from the "weak algorithm" part. That
seems like an irrelevant aside for this specific use of a hashing
algorithm, no? We could even use MD5 here, so SHA256-only is just
setting is up for not needing to deal with SHA1 forever in this one
place in some SHA256 future repo.

> Note that it is, of course, possible to perform a lookup on all commit
> objects to determine the actual entry which matches the hashed form of
> the data.  However, this is an improvement over the status quo.
>
> The performance of this patch with no hashed entries is very similar to
> the performance without this patch.  Considering a git log command to
> look up author and committer information on 981,680 commits in the Linux
> kernel history, either with an unhashed mailmap or a mailmap with all
> old values hashed:
>
>                                    Shortest  Longest  Average  Change
>   Git 2.30                         7.876     8.297    8.143
>   This patch, unhashed             7.923     8.484    8.237    + 1.15%
>   This patch, hashed               14.510    14.783   14.672   +80.17%
>   This patch, hashed, unoptimized  15.425    16.318   15.901   +95.27%
>
> Thus, the average performance after this patch is within normal
> variation of the pre-patch performance.  It's unlikely that users will
> notice the difference in practice, even on much larger
> repositories, unless they're using the new feature.

Am I reading this right that if there's a single hashed entry in
.mailmap anything using %aE or %aN is around 2x as slow?

Your v1 mentioned that a project might "insert entries for many
contributors in order to make discovery of "interesting" entries
significantly less convenient." which is gone in the v2 patch. As noted
in [1] I don't see how it helps the obscurity much, but if that's still
the intended use we'd expect to get more slowdowns in the wild if users
intend to convert their whole mailmap to this form if they want a single
entry to use the form.

Anyway, as you might have guessed I'm still not a fan of this direction.

But most of it is because I honestly don't get why this specific
approach is required to achieve the stated aims, there's a few of them,
so here's an attempt to break them down:

1. User changed their name and doesn't want themselves or others to see
  their old name

For the case where Joe Developer is now known as Jane Doe in most cases
you don't need to put the old name at all into the .mailmap. E.g. for
git.git this patch to our .mailmap produces the same output for `log
--all --pretty="%h %an%ae%aN%aE"`:

     brian m. carlson <sandals@crustytoothpaste.net>
    -brian m. carlson <sandals@crustytoothpaste.net> <sandals@crustytoothpaste.ath.cx>
    -brian m. carlson <sandals@crustytoothpaste.net> <bk2204@github.com>
    +<sandals@crustytoothpaste.net> <sandals@crustytoothpaste.ath.cx>
    +<sandals@crustytoothpaste.net> <bk2204@github.com>

So the new->name/email mapping (as opposed to new->email) is really only
needed for some really obscure cases where two people shared an E-Mail
or something.

So we're talking about hiding the old E-Mail, presumably because it was
joe@ intsead of jane@, so in that case we could just support URI
encoding:

    Jane Doe <jane@example.com>
    <jane@example.com> <%6A%6F%65@%64%65%76%65%6C%6F%70%65%72.%63%6F%6D>

Made via:

    $ perl -MURI::Escape=uri_escape -wE 'say uri_escape q[joe@developer.com], "^@."'
    %6A%6F%65@%64%65%76%65%6C%6F%70%65%72.%63%6F%6D

Which also has the nice attribute that people can make it obvious what
part they want to hide, since this is really a feature to enable social
politeness & consideration:

    Jane Doe <jane@example.com>
    # I don't want to be known by my old name, thanks
    <jane@example.com> <%6A%6F%65@developer.com>

2. Hiding from your enemies

For the other use-case of "abusive family or partner" I had the comment
in v1 of "but not so much that you'd still take the risk of submitting a
patch to .mailmap?".

Now that's obviously phrased in an off-the-cuff manner, but I'm
serious.

I think it is important that the non-security of this feature obviously
looks like some trivial encoding, because that's what it is. People get
lulled into a false sense of security with these things all the time
(e.g. thinking their "Authorization" HTTP header is safe to post on a
public pastebin). So we should as much as possible make this look like
the non-security it is.

3. Enabling people not to treat .mailmap as binary or a multi-encoding
file.

I mentioned this in my [1]. Your implementation doesn't do this, but
e.g. it would be very nice for a project that switched from latin-1 to
utf-8 to be able to do, in some cases:

    # Made with: perl -MURI::Escape=uri_escape -wE 'say uri_escape "@ARGV", "^a-z@. "' $(echo Ævar Arnfjörð Bjarmason | iconv -f utf-8 -t iso-8859-1)
    #

    Ævar Arnfjörð Bjarmason <avarab@gmail.com> %C6var %41rnfj%F6r%F0 %42jarmason <avarab@gmail.com>

Or some combination thereof, so e.g. previously Big5/latin1 who migrated
to UTF-8 don't need to have non-valid UTF-8 in .mailmap

4. Spam

You mentioned this in your [2] (but not as a use-case in the v2
re-rolled commit message):

    And we know that spammers and recruiters (which, in this case, are
    also spammers) do indeed scrape repositories via the repository web
    interfaces.

Surely these people are most interested in the current E-Mail addresses,
which if they're scraping the common web interfaces (e.g. Github,
GitLab) are easily accessible there. It doesn't seem very plausible that
someone would care enough to scrape .mailmap for old addresses but not
just update their scraper to clone & run "git log" for the purposes of
e.g. their recruitment E-Mails.

5. Interaction with other systems

Something I mentioned in the last 3 paragraphs of my [1]. I think you're
only considering the cases where git itself does the mailmap
translation, but we have 3rd party systems that make use of the format
in good ways (also doing the Joe->Jane mapping). Making it a hassle for
those systems makes it more likely that Jane doesn't get the mapping she
wants.

1. https://lore.kernel.org/git/87eejswql6.fsf@evledraar.gmail.com/
2. https://lore.kernel.org/git/X9wUGaR3IXcpV0nT@camp.crustytoothpaste.net/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 5/5] mailmap: support hashed entries in mailmaps
  2021-01-03 21:18 ` [PATCH v2 5/5] mailmap: support hashed entries in mailmaps brian m. carlson
  2021-01-05 14:21   ` Ævar Arnfjörð Bjarmason
@ 2021-01-05 20:05   ` Junio C Hamano
  2021-01-06  0:28     ` brian m. carlson
  1 sibling, 1 reply; 19+ messages in thread
From: Junio C Hamano @ 2021-01-05 20:05 UTC (permalink / raw)
  To: brian m. carlson
  Cc: git, Jeff King, Ævar Arnfjörð Bjarmason,
	Phillip Wood

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> For example, a person may transition from one gender to another,
> changing their name, or they may have changed their name to disassociate
> themselves from an abusive family or partner.  In such a case, using the
> former name or address in any way may be undesirable and the person may
> wish to replace it as completely as possible.

I am not sure if we want to even mention the "for example" here.

These are certainly all legitimate reasons to want this feature, but
after reading the "for example", lack of a corresponding negative
statement (e.g. sometimes people also change their name or address
to hide their bad behaviour in the past that is associated with
these names) needlessly stood out and made me wonder if we need to
somehow defend the feature with "...but we do not mean to abet
people in hiding their past bad behaviour with this mechanism".  I'd
prefer us not forced to defend the mechanism if we did not have to.

> Note that it is, of course, possible to perform a lookup on all commit
> objects to determine the actual entry which matches the hashed form of
> the data.  However, this is an improvement over the status quo.

There were suggestions to use reversible encoding, IIRC, just for
obscurity.  I do not have a strong preference either way myself, but
because such an approach would give the same improvement over the
status quo, would be simpler, more performant and most importantly,
it makes it clear that this is not serious security but casual
obscurity, I'd want to be convinced why we want to use a hash here
a bit more strongly.

> +In addition to specifying a former name or email literally, it is also possible
> +to specify it in a hashed form, which consists of the string `@sha256:`,
> +followed by an all-lowercase SHA-256 hash of the entry in hexadecimal.  For
> +example, to take the example above, instead of specifying the replacement for
> +"Some Dude" as such, you could specify one of these lines:
> ...

> +SHA-1 is not accepted as a hash algorithm in mailmaps.

Is this needed to be said?  After all, we won't take @md5: or
@blake2: or anything other than @sha256: in this version (and
probably any forseeable versions).  Unless we offer a way to plug-in
algos of projects' choice, that is, and at that point, "SHA-1 is not
accepted" is a statement too strong for us to make.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 5/5] mailmap: support hashed entries in mailmaps
  2021-01-05 14:21   ` Ævar Arnfjörð Bjarmason
@ 2021-01-06  0:24     ` brian m. carlson
  2021-01-10 19:24       ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 19+ messages in thread
From: brian m. carlson @ 2021-01-06  0:24 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: git, Jeff King, Phillip Wood

[-- Attachment #1: Type: text/plain, Size: 11147 bytes --]

On 2021-01-05 at 14:21:40, Ævar Arnfjörð Bjarmason wrote:
> 
> On Sun, Jan 03 2021, brian m. carlson wrote:
> 
> I think it makes sense to split up 1-4/5 here from 5/5 in this series
> since they're really unrelated changes, although due to the changes in
> 1-4 they'll conflict.

Okay, I'll drop them.

> In v1 I pointed out you needed to read some combination of the cover
> letter & the patch to see what this was intended for (see [1]). I think
> for v3 the commit itself should summarize the trade-offs & design
> choices.

I can do that.  It's a very long commit message anyway, but if you think
it would be better in the commit message, I can add it.

> > For projects which wish to support this, introduce hashed forms into the
> > mailmap.  These forms, which start with "@sha256:" followed by a SHA-256
> > hash of the entry, can be used in place of the form used in the commit
> > field.  This form is intentionally designed to be unlikely to conflict
> > with legitimate use cases.  For example, this is not a valid email
> > address according to RFC 5322.  In the unlikely event that a user has
> > put such a form into the actual commit as their name, we will accept it.
> 
> We'll emit the commit author information as-is in that case under "git
> show", or run the mapping and map it via mailmap? Anyway, it seems
> there's a test for this. Probably better to just point to it.

It will be handled correctly via the mailmap code, in which case we'll
make a no-op transformation.  If the user is not using the mailmap, then
it will be handled trivially.

> > While the form of the data is designed to accept multiple hash
> > algorithms, we intentionally do not support SHA-1.  There is little
> > reason to support such a weak algorithm in new use cases and no
> > backwards compatibility to consider.  Moreover, SHA-256 is faster than
> > the SHA1DC implementation we use, so this not only improves performance,
> > but simplifies the current implementation somewhat as well.
> 
> I agree with most of this aside from the "weak algorithm" part. That
> seems like an irrelevant aside for this specific use of a hashing
> algorithm, no? We could even use MD5 here, so SHA256-only is just
> setting is up for not needing to deal with SHA1 forever in this one
> place in some SHA256 future repo.

One should avoid the use of weak algorithms when possible even if they
are not being used in a way that makes them weak because it incentivizes
others to use them, often in a way that is insecure.  I had a
conversation with a junior candidate during an interview who said they
used SHA-1 in a particular case "because Git uses it."  That's why I
mentioned it.

> > Note that it is, of course, possible to perform a lookup on all commit
> > objects to determine the actual entry which matches the hashed form of
> > the data.  However, this is an improvement over the status quo.
> >
> > The performance of this patch with no hashed entries is very similar to
> > the performance without this patch.  Considering a git log command to
> > look up author and committer information on 981,680 commits in the Linux
> > kernel history, either with an unhashed mailmap or a mailmap with all
> > old values hashed:
> >
> >                                    Shortest  Longest  Average  Change
> >   Git 2.30                         7.876     8.297    8.143
> >   This patch, unhashed             7.923     8.484    8.237    + 1.15%
> >   This patch, hashed               14.510    14.783   14.672   +80.17%
> >   This patch, hashed, unoptimized  15.425    16.318   15.901   +95.27%
> >
> > Thus, the average performance after this patch is within normal
> > variation of the pre-patch performance.  It's unlikely that users will
> > notice the difference in practice, even on much larger
> > repositories, unless they're using the new feature.
> 
> Am I reading this right that if there's a single hashed entry in
> .mailmap anything using %aE or %aN is around 2x as slow?

No, that's not the case.  As soon as we see every hashed entry, we will
stop hashing new entries.  Linux is not necessarily the best case for
this because it has a long history with many one-off contributors long
ago in the history.

I'll explain that further in the commit message and add some more
metrics.

> Your v1 mentioned that a project might "insert entries for many
> contributors in order to make discovery of "interesting" entries
> significantly less convenient." which is gone in the v2 patch. As noted
> in [1] I don't see how it helps the obscurity much, but if that's still
> the intended use we'd expect to get more slowdowns in the wild if users
> intend to convert their whole mailmap to this form if they want a single
> entry to use the form.

Peff objected to that text, so I removed it.

As mentioned above, it depends on who you put in the mailmap.  If
they're the most recent 50 contributors, it'll probably be pretty cheap.
If you put the oldest contributors in there and they've not sent any
recent commits, it will be more expensive.

> Anyway, as you might have guessed I'm still not a fan of this direction.

I've got that impression pretty strongly.

I do want to point out that generally I'm pretty willing to change
approaches and do things differently.  I've completely redone a decent
number of patches in the past in response to feedback on the list.

I'm not changing the approach here because, as mentioned below, I don't
think that just encoding meets the use cases I'm targeting here.  So I
have heard your suggestions and to be clear, I do value your input on
this (and on other topics), it's just that I disagree that such a change
is one I should make.

> So the new->name/email mapping (as opposed to new->email) is really only
> needed for some really obscure cases where two people shared an E-Mail
> or something.

That's unlikely, but it does happen.  That's why we have it.

> So we're talking about hiding the old E-Mail, presumably because it was
> joe@ intsead of jane@, so in that case we could just support URI
> encoding:
> 
>     Jane Doe <jane@example.com>
>     <jane@example.com> <%6A%6F%65@%64%65%76%65%6C%6F%70%65%72.%63%6F%6D>
> 
> Made via:
> 
>     $ perl -MURI::Escape=uri_escape -wE 'say uri_escape q[joe@developer.com], "^@."'
>     %6A%6F%65@%64%65%76%65%6C%6F%70%65%72.%63%6F%6D
> 
> Which also has the nice attribute that people can make it obvious what
> part they want to hide, since this is really a feature to enable social
> politeness & consideration:
> 
>     Jane Doe <jane@example.com>
>     # I don't want to be known by my old name, thanks
>     <jane@example.com> <%6A%6F%65@developer.com>

I don't think this feature is going to get used if we just encode names
or email addresses.  In the United States, when someone transitions,
they get a court order to change their name.  I don't think a lot of
corporate environments are going to want to just encode an old name or
email address in a trivially invertible way given that.  This is
typically a topic handled with some sensitivity in most companies.

I will tell you that I would not just use an encoded version if I were
changing my name for any of the reasons I've mentioned.  That wouldn't
cut it for me, and I wouldn't use such a feature.  The feature I'm
implementing is a feature I've talked with trans folks about, and that's
why I'm implementing this as it is.  The response I got was essentially,
"It's not everything I want, but it's an improvement."

If the decision is that we want to go with encoding instead of hashing,
then I'll drop this patch.  I'm not going to put my name or sign-off on
that because I don't think it meets the need I'm addressing here.

The entire problem, of course, is that we bake a human's personal name
and email address immutably into a Merkle tree.  We know full well that
people do change their names and email addresses all the time (e.g.,
marriage, job changes), and yet we have this design.  In retrospect, we
should have done something different, but hindsight is 20/20 and I'm
just trying to do the best we can with what we've got.

> 2. Hiding from your enemies
> 
> For the other use-case of "abusive family or partner" I had the comment
> in v1 of "but not so much that you'd still take the risk of submitting a
> patch to .mailmap?".

No, my use case isn't "hiding from an abusive family or partner".  It's
"I'm finally free of that **** and I never want to hear their name
again."  (I've known people in this situation.)  Also, the similar use
case of, "my family member, with whom I share an uncommon name, murdered
someone, which I obviously found abhorrent, and I would like to not be
associated with them when my name is Googled."  And yes, I knew an
acquaintance many years ago whose family member murdered someone.

In other words, the person changed their name to disassociate
themselves, not to hide from their abuser.

> 4. Spam
> 
> You mentioned this in your [2] (but not as a use-case in the v2
> re-rolled commit message):
> 
>     And we know that spammers and recruiters (which, in this case, are
>     also spammers) do indeed scrape repositories via the repository web
>     interfaces.
> 
> Surely these people are most interested in the current E-Mail addresses,
> which if they're scraping the common web interfaces (e.g. Github,
> GitLab) are easily accessible there. It doesn't seem very plausible that
> someone would care enough to scrape .mailmap for old addresses but not
> just update their scraper to clone & run "git log" for the purposes of
> e.g. their recruitment E-Mails.

Unless the user is using the GitHub-provided noreply address or a
similar address, which is common.  This allows people to map all of
their old addresses to such an address, which, judging from
StackOverflow, is a thing people want to do.

I can tell you from dealing with abuse that raising the bar even the
tiniest bit is very significant to stopping it.  Most recruiters are not
developers and they and spammers don't have Git installed.  They're
going to rely on Googling or other public search functionality, and this
makes that harder.

Greylisting is exactly raising the bar the tiniest amount and it's
extraordinarily effective.

> 5. Interaction with other systems
> 
> Something I mentioned in the last 3 paragraphs of my [1]. I think you're
> only considering the cases where git itself does the mailmap
> translation, but we have 3rd party systems that make use of the format
> in good ways (also doing the Joe->Jane mapping). Making it a hassle for
> those systems makes it more likely that Jane doesn't get the mapping she
> wants.

This is an argument for never changing the format.  Sometimes things
change, and I don't want to avoid making a change because other
implementations haven't implemented it yet.  Under that approach, we'd
never have the SHA-256 work.
-- 
brian m. carlson (he/him or they/them)
Houston, Texas, US

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 5/5] mailmap: support hashed entries in mailmaps
  2021-01-05 20:05   ` Junio C Hamano
@ 2021-01-06  0:28     ` brian m. carlson
  2021-01-06  1:50       ` Junio C Hamano
  0 siblings, 1 reply; 19+ messages in thread
From: brian m. carlson @ 2021-01-06  0:28 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Jeff King, Ævar Arnfjörð Bjarmason,
	Phillip Wood

[-- Attachment #1: Type: text/plain, Size: 1896 bytes --]

On 2021-01-05 at 20:05:22, Junio C Hamano wrote:
> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
> 
> > For example, a person may transition from one gender to another,
> > changing their name, or they may have changed their name to disassociate
> > themselves from an abusive family or partner.  In such a case, using the
> > former name or address in any way may be undesirable and the person may
> > wish to replace it as completely as possible.
> 
> I am not sure if we want to even mention the "for example" here.
> 
> These are certainly all legitimate reasons to want this feature, but
> after reading the "for example", lack of a corresponding negative
> statement (e.g. sometimes people also change their name or address
> to hide their bad behaviour in the past that is associated with
> these names) needlessly stood out and made me wonder if we need to
> somehow defend the feature with "...but we do not mean to abet
> people in hiding their past bad behaviour with this mechanism".  I'd
> prefer us not forced to defend the mechanism if we did not have to.

I added it because I imagine the use cases for this feature aren't
immediately obvious to a lot of people and the general rule is that
commit messages explain why we would implement such a feature.  If you'd
prefer I drop it and leave it up to the imagination (or to the list
archives), I can do that.

> > +SHA-1 is not accepted as a hash algorithm in mailmaps.
> 
> Is this needed to be said?  After all, we won't take @md5: or
> @blake2: or anything other than @sha256: in this version (and
> probably any forseeable versions).  Unless we offer a way to plug-in
> algos of projects' choice, that is, and at that point, "SHA-1 is not
> accepted" is a statement too strong for us to make.

I'll drop that line.
-- 
brian m. carlson (he/him or they/them)
Houston, Texas, US

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 4/5] mailmap: use case-sensitive comparisons for local-parts and names
  2021-01-04 16:10   ` Ævar Arnfjörð Bjarmason
@ 2021-01-06  0:46     ` Junio C Hamano
  2021-01-12 14:08       ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 19+ messages in thread
From: Junio C Hamano @ 2021-01-06  0:46 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: brian m. carlson, git, Jeff King, Phillip Wood,
	Johannes Schindelin

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> On Sun, Jan 03 2021, brian m. carlson wrote:
>
> We just have to worry about cases where you're not all of these people
> in one project's commit metadata and/or .mailmap, and thus mailmap rules
> would match too generously:
>
>     "brian m. carlson" <sandals@crustytoothpaste.net>
>     "brian m. carlson" <SANDALS@crustytoothpaste.net>
>     "BRIAN M. CARLSON" <sandals@crustytoothpaste.net>
>     "BRIAN M. CARLSON" <SANDALS@crustytoothpaste.net>
>
> Is that really plausible? In any case, neither of these two patches make
> reference to us already having changed this in the past in 1.6.2 and &
> there being reports on the ML about the bug & us changing it back. See
> https://lore.kernel.org/git/f182fb1700e8dea15459fd02ced2a6e5797bec99.1238458535u.git.johannes.schindelin@gmx.de/
>
> Maybe we should still do this, but I think for a v3 it makes sense to
> summarize that discussion etc.

After reading the old discussion again, I am not sure if this is
worth doing.  To many people, it is a promise we've made and kept
that we treat addresses including the local part case-insensitively
when summarising commits by ident strings.

I'd really wish that this series were structured to have 5/5 early
and 3&4/5 squashed into a single final patch.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 5/5] mailmap: support hashed entries in mailmaps
  2021-01-06  0:28     ` brian m. carlson
@ 2021-01-06  1:50       ` Junio C Hamano
  0 siblings, 0 replies; 19+ messages in thread
From: Junio C Hamano @ 2021-01-06  1:50 UTC (permalink / raw)
  To: brian m. carlson
  Cc: git, Jeff King, Ævar Arnfjörð Bjarmason,
	Phillip Wood

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> I added it because I imagine the use cases for this feature aren't
> immediately obvious to a lot of people and the general rule is that
> commit messages explain why we would implement such a feature.

Yeah, I understand that.

>> > +SHA-1 is not accepted as a hash algorithm in mailmaps.
>>  ...
> I'll drop that line.

Thanks.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 5/5] mailmap: support hashed entries in mailmaps
  2021-01-06  0:24     ` brian m. carlson
@ 2021-01-10 19:24       ` Ævar Arnfjörð Bjarmason
  2021-01-10 21:26         ` brian m. carlson
  0 siblings, 1 reply; 19+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-01-10 19:24 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Jeff King, Phillip Wood

On Wed, Jan 06 2021, brian m. carlson wrote:

> On 2021-01-05 at 14:21:40, Ævar Arnfjörð Bjarmason wrote:
>> 
>> On Sun, Jan 03 2021, brian m. carlson wrote:
>> 
>> I think it makes sense to split up 1-4/5 here from 5/5 in this series
>> since they're really unrelated changes, although due to the changes in
>> 1-4 they'll conflict.
>
> Okay, I'll drop them.

Not replying to most of this E-Mail because I think there's nothing left
to add / you clarified things for me in those cases / we respectfully
disagree / any outstanding points we can pick up in your re-roll /
whatever :)

>> So we're talking about hiding the old E-Mail, presumably because it was
>> joe@ intsead of jane@, so in that case we could just support URI
>> encoding:
>> 
>>     Jane Doe <jane@example.com>
>>     <jane@example.com> <%6A%6F%65@%64%65%76%65%6C%6F%70%65%72.%63%6F%6D>
>> 
>> Made via:
>> 
>>     $ perl -MURI::Escape=uri_escape -wE 'say uri_escape q[joe@developer.com], "^@."'
>>     %6A%6F%65@%64%65%76%65%6C%6F%70%65%72.%63%6F%6D
>> 
>> Which also has the nice attribute that people can make it obvious what
>> part they want to hide, since this is really a feature to enable social
>> politeness & consideration:
>> 
>>     Jane Doe <jane@example.com>
>>     # I don't want to be known by my old name, thanks
>>     <jane@example.com> <%6A%6F%65@developer.com>
>
> I don't think this feature is going to get used if we just encode names
> or email addresses.  In the United States, when someone transitions,
> they get a court order to change their name.  I don't think a lot of
> corporate environments are going to want to just encode an old name or
> email address in a trivially invertible way given that.  This is
> typically a topic handled with some sensitivity in most companies.
>
> I will tell you that I would not just use an encoded version if I were
> changing my name for any of the reasons I've mentioned.  That wouldn't
> cut it for me, and I wouldn't use such a feature.  The feature I'm
> implementing is a feature I've talked with trans folks about, and that's
> why I'm implementing this as it is.  The response I got was essentially,
> "It's not everything I want, but it's an improvement."
>
> If the decision is that we want to go with encoding instead of hashing,
> then I'll drop this patch.  I'm not going to put my name or sign-off on
> that because I don't think it meets the need I'm addressing here.
>
> The entire problem, of course, is that we bake a human's personal name
> and email address immutably into a Merkle tree.  We know full well that
> people do change their names and email addresses all the time (e.g.,
> marriage, job changes), and yet we have this design.  In retrospect, we
> should have done something different, but hindsight is 20/20 and I'm
> just trying to do the best we can with what we've got.

Doesn't the difference in some sense boil down to either an implicit
promise or an implicit assumption that the hashed version is forever
going to be protected by some security-through-obscurity/inconvenience
when it comes to git.git & its default tooling?

And would those users be as comfortable with the difference between
encoded v.s. hashed if e.g. "git check-mailmap" learned to read the
.mailmap and search-replace all the hashed versions with their
materialized values, or if popular tools like Emacs learned to via a Git
.mailmap in a "need translation" similar to *.gpg and *.gz. How about if
popular web views of Git served up that materialized "check-mailmap"
output by default?

None of which I think is implausible that we'll get as follow-up
patches, I might even submit some at some point, not out of some spite.
Just because I don't want to maintain out-of-tree code for an
out-of-tree program that understands a Git .mailmap today, but where I'd
need to search-replace the hashed versions.

Ditto it being very likely that popular editors or web viewers will gain
support for this, just because it's tedious to manually hash &
copy/paste & validate values.

In looking at some of the fsck code recently & having some
yet-unsubmitted patches I thought of trying to compine it with
mailmap. I.e. it seems like a natural feature for fsck to gain to warn
you about unused mailmap entries, just like it can warn about
unreachable/dangling objects. After all these are really just sort-of
pointers into our Merkle tree. Spewing out all the mappings seems like
an obvious addition to that, e.g. in spewing out an
"optimized/non-redundant" (plain or hashed) mailmap to re-commit.

That's the main reason I'm uncomfortable with this approach, because it
seems to me to implicitly rely on things that are tedious now, but which
the march of history all but inevitably should make trivial if we were
to integrate it. Unless we're *also* promising to forever intentionally
(and artificially) keep it inconvenient.

E.g. the example of how long it takes to clone & extract this info from
chromium.git in the v1 thread.

It seems like a fair assumption that we'll have some future version of
git where you can ask a remote server about that sort of thing in
milliseconds.

Not because of this hashed .mailmap thing in particular, just as an
emergent effect that it's happy to serve up things it knows about the
DAG from having walked & cached it in general. E.g. info from the
commit-graph, what hash is contained in what ref, or how one value (such
as a .mailmap entry) maps to another etc.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 5/5] mailmap: support hashed entries in mailmaps
  2021-01-10 19:24       ` Ævar Arnfjörð Bjarmason
@ 2021-01-10 21:26         ` brian m. carlson
  0 siblings, 0 replies; 19+ messages in thread
From: brian m. carlson @ 2021-01-10 21:26 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: git, Jeff King, Phillip Wood

[-- Attachment #1: Type: text/plain, Size: 1317 bytes --]

On 2021-01-10 at 19:24:34, Ævar Arnfjörð Bjarmason wrote:
> Doesn't the difference in some sense boil down to either an implicit
> promise or an implicit assumption that the hashed version is forever
> going to be protected by some security-through-obscurity/inconvenience
> when it comes to git.git & its default tooling?
> 
> And would those users be as comfortable with the difference between
> encoded v.s. hashed if e.g. "git check-mailmap" learned to read the
> .mailmap and search-replace all the hashed versions with their
> materialized values, or if popular tools like Emacs learned to via a Git
> .mailmap in a "need translation" similar to *.gpg and *.gz. How about if
> popular web views of Git served up that materialized "check-mailmap"
> output by default?
> 
> None of which I think is implausible that we'll get as follow-up
> patches, I might even submit some at some point, not out of some spite.
> Just because I don't want to maintain out-of-tree code for an
> out-of-tree program that understands a Git .mailmap today, but where I'd
> need to search-replace the hashed versions.

Yes, I think we do rely on this being inconvenient.  If you plan to
submit such a patch, I'm going to let this series drop.
-- 
brian m. carlson (he/him or they/them)
Houston, Texas, US

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 4/5] mailmap: use case-sensitive comparisons for local-parts and names
  2021-01-06  0:46     ` Junio C Hamano
@ 2021-01-12 14:08       ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 19+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-01-12 14:08 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: brian m. carlson, git, Jeff King, Phillip Wood,
	Johannes Schindelin


On Wed, Jan 06 2021, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>
>> On Sun, Jan 03 2021, brian m. carlson wrote:
>>
>> We just have to worry about cases where you're not all of these people
>> in one project's commit metadata and/or .mailmap, and thus mailmap rules
>> would match too generously:
>>
>>     "brian m. carlson" <sandals@crustytoothpaste.net>
>>     "brian m. carlson" <SANDALS@crustytoothpaste.net>
>>     "BRIAN M. CARLSON" <sandals@crustytoothpaste.net>
>>     "BRIAN M. CARLSON" <SANDALS@crustytoothpaste.net>
>>
>> Is that really plausible? In any case, neither of these two patches make
>> reference to us already having changed this in the past in 1.6.2 and &
>> there being reports on the ML about the bug & us changing it back. See
>> https://lore.kernel.org/git/f182fb1700e8dea15459fd02ced2a6e5797bec99.1238458535u.git.johannes.schindelin@gmx.de/
>>
>> Maybe we should still do this, but I think for a v3 it makes sense to
>> summarize that discussion etc.
>
> After reading the old discussion again, I am not sure if this is
> worth doing.  To many people, it is a promise we've made and kept
> that we treat addresses including the local part case-insensitively
> when summarising commits by ident strings.
>
> I'd really wish that this series were structured to have 5/5 early
> and 3&4/5 squashed into a single final patch.

Something that I only realized after I sent
<87czykvg19.fsf@evledraar.gmail.com>: Any problems .mailmap has with
Turkish "dotless I" we have already with "git log --author=<name> -i".

Not to say that there isn't some problem to solve here, just that if we
do it's a more general issue than mailmap.

As can be inferred from my upthread reply I thought that was ASCII-only,
but it turns out we do set LC_CTYPE based on the user's locale, and it
also applies for English-speakers. E.g. in LANG=en_US.UTF-8
"--author=ævar -i" will work.

Of course that doesn't address the point of whether we should be
DWYM-ing E-Mail addresses, just the sub-claim that one reason we
shouldn't is because of impossible-to-solve Unicode edge cases.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2021-01-12 14:10 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-03 21:18 [PATCH v2 0/5] Hashed mailmap brian m. carlson
2021-01-03 21:18 ` [PATCH v2 1/5] mailmap: add a function to inspect the number of entries brian m. carlson
2021-01-04 15:14   ` Ævar Arnfjörð Bjarmason
2021-01-04 17:04   ` René Scharfe
2021-01-03 21:18 ` [PATCH v2 2/5] mailmap: switch to opaque struct brian m. carlson
2021-01-04 15:17   ` Ævar Arnfjörð Bjarmason
2021-01-03 21:18 ` [PATCH v2 3/5] t4203: add failing test for case-sensitive local-parts and names brian m. carlson
2021-01-03 21:18 ` [PATCH v2 4/5] mailmap: use case-sensitive comparisons for " brian m. carlson
2021-01-04 16:10   ` Ævar Arnfjörð Bjarmason
2021-01-06  0:46     ` Junio C Hamano
2021-01-12 14:08       ` Ævar Arnfjörð Bjarmason
2021-01-03 21:18 ` [PATCH v2 5/5] mailmap: support hashed entries in mailmaps brian m. carlson
2021-01-05 14:21   ` Ævar Arnfjörð Bjarmason
2021-01-06  0:24     ` brian m. carlson
2021-01-10 19:24       ` Ævar Arnfjörð Bjarmason
2021-01-10 21:26         ` brian m. carlson
2021-01-05 20:05   ` Junio C Hamano
2021-01-06  0:28     ` brian m. carlson
2021-01-06  1:50       ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).