git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Wildcards in mailmap to hide transgender people's deadnames
@ 2022-09-13 21:53 Florine W. Dekker
  2022-09-14  7:40 ` René Scharfe
  0 siblings, 1 reply; 14+ messages in thread
From: Florine W. Dekker @ 2022-09-13 21:53 UTC (permalink / raw)
  To: git

Hi! I would like to suggest that the mailmap feature accepts (a limited 
form of) wildcards for matching email addresses, which helps transgender 
users configure the mailmap to map their old name ("deadname") and email 
to their new name and email without revealing the old info in the 
mailmap config itself.

For example, consider a user who changed their name from Jane Doe to 
John Doe, and their email from jane.doe@example.com to 
john.doe@example.com. John wants to prevent others from learning their 
old name, but sometimes it's not feasible to rewrite the entire history 
of the repository (e.g. because there are thousands of commits, or 
because this would mess up references between commits). In this case, 
mailmap seems like a good way to prevent people from finding out the old 
name by accident: Just add the line `John Doe <john.doe@example.com> 
<jane.doe@example.com>` to the mailmap config. However, this has the 
unfortunate effect that readers may now accidentally find John's old 
name if they look at the mailmap config.

I suggest that mailmap config files support wildcards in the email 
address. This helps people who have changed their name to specify a 
mapping without revealing their old name in the definition of this 
mapping. Because the * symbol is valid in an email address, I suggest 
the sequence \* to be the wildcard symbol, meaning "0 or more symbols". 
This cannot be misinterpreted in an RFC5322-valid email address, because 
this sequence is not legal in the domain part, is not legal in an 
unquoted local part, and is not legal in a quoted local part unless 
preceded by an unescaped backslash (that is, "jo\\*hn"@doe.com does not 
contain a wildcard). In short, if mailmap encounters the sequence \* in 
an email address, it should interpret the sequence as a wildcard if and 
only if it is not directly preceded by an odd number of backslashes 
regardless of whether the local part is quoted (so \* is a wildcard, \\* 
is not, \\\* is, \\\\* is not).

Now, John can now add the following line to their mailmap config: `John 
Doe <john.doe@example.com> <\*.doe@example.com>`, which does not reveal 
their old name. Someone could always spend more effort to uncover the 
name using more advanced tools, but the point of this feature is to 
prevent accidental discovery of the name in cases where completely 
hiding the name is not feasible.

If you have feedback or comments on this suggestion, please let me know.

- Florine



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Wildcards in mailmap to hide transgender people's deadnames
  2022-09-13 21:53 Wildcards in mailmap to hide transgender people's deadnames Florine W. Dekker
@ 2022-09-14  7:40 ` René Scharfe
  2022-09-14  9:07   ` Florine W. Dekker
       [not found]   ` <CANgJU+Wt_yjv1phwiSUtLLZ=JKA9LvS=0UcBYNu+nxdJ_7d_Ew@mail.gmail.com>
  0 siblings, 2 replies; 14+ messages in thread
From: René Scharfe @ 2022-09-14  7:40 UTC (permalink / raw)
  To: Florine W. Dekker, git; +Cc: brian m . carlson

Am 13.09.22 um 23:53 schrieb Florine W. Dekker:
> Now, John can now add the following line to their mailmap config:
> `John Doe <john.doe@example.com> <\*.doe@example.com>`, which does
> not reveal their old name.

That would falsely attribute the work of possible future developers
ann.doe@example.com and bob.doe@example.com to John as well.

Supporting hashed entries would allow for a more targeted obfuscation.
That was discussed a while ago:
https://lore.kernel.org/git/20210103211849.2691287-1-sandals@crustytoothpaste.net/

> Someone could always spend more effort to uncover the name using more
> advanced tools, but the point of this feature is to prevent
> accidental discovery of the name in cases where completely hiding the
> name is not feasible.

Extracting old email addresses from a repository is easy by comparing
authors' email addresses without and with mailmap applied, no advanced
tools required.  Here's mine from Git's own repo:

   $ git log --format='%ae %aE' |
     awk '$1 != $2 && !a[$0] {a[$0] = 1; print}' |
     grep -F l.s.r@web.de
   rene.scharfe@lsrfire.ath.cx l.s.r@web.de

The same can be done with names (%an/%aN).

René


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Wildcards in mailmap to hide transgender people's deadnames
  2022-09-14  7:40 ` René Scharfe
@ 2022-09-14  9:07   ` Florine W. Dekker
  2022-09-19 11:20     ` Ævar Arnfjörð Bjarmason
       [not found]   ` <CANgJU+Wt_yjv1phwiSUtLLZ=JKA9LvS=0UcBYNu+nxdJ_7d_Ew@mail.gmail.com>
  1 sibling, 1 reply; 14+ messages in thread
From: Florine W. Dekker @ 2022-09-14  9:07 UTC (permalink / raw)
  To: René Scharfe, git; +Cc: brian m . carlson

On 14/09/2022 09:40, René Scharfe wrote:
> Am 13.09.22 um 23:53 schrieb Florine W. Dekker:
>> Now, John can now add the following line to their mailmap config:
>> `John Doe <john.doe@example.com> <\*.doe@example.com>`, which does
>> not reveal their old name.
> That would falsely attribute the work of possible future developers
> ann.doe@example.com and bob.doe@example.com to John as well.

Good point. I assumed such false positives would be unlikely because I 
was considering very-small-scale projects, but I agree that using 
wildcards is not at all feasible for larger projects.

> Supporting hashed entries would allow for a more targeted obfuscation.
> That was discussed a while ago:
> https://lore.kernel.org/git/20210103211849.2691287-1-sandals@crustytoothpaste.net/

That was an interesting read. I agree with Ævar in that thread in that I 
think URL encoding is sufficient. I think it meets Brian's use case of 
never having to see the old name again, and my use case of obfuscating 
it from accidental discovery by friendly collaborators. While a hash 
certainly gives a stronger sense of security, I think it's a false sense 
of security, because, as you note below, recovering old email addresses 
from the tree is not much more trivial than reversing the encoding. And 
either way, a sha256 hash can easily be inverted in a few days(?) using 
a dictionary attack with email addresses from data breaches. As someone 
who has changed her name, I would be content with using a simple URL 
encoding.

>> Someone could always spend more effort to uncover the name using more
>> advanced tools, but the point of this feature is to prevent
>> accidental discovery of the name in cases where completely hiding the
>> name is not feasible.
> Extracting old email addresses from a repository is easy by comparing
> authors' email addresses without and with mailmap applied, no advanced
> tools required.  Here's mine from Git's own repo:
>
>     $ git log --format='%ae %aE' |
>       awk '$1 != $2 && !a[$0] {a[$0] = 1; print}' |
>       grep -F l.s.r@web.de
>     rene.scharfe@lsrfire.ath.cx l.s.r@web.de
>
> The same can be done with names (%an/%aN).

You're absolutely right. With "advanced tools" I was referring to 
anything more advanced than a plain `git log` ;-)

- Florine



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Wildcards in mailmap to hide transgender people's deadnames
       [not found]   ` <CANgJU+Wt_yjv1phwiSUtLLZ=JKA9LvS=0UcBYNu+nxdJ_7d_Ew@mail.gmail.com>
@ 2022-09-16 16:59     ` Florine W. Dekker
  2022-09-20  0:32       ` brian m. carlson
  0 siblings, 1 reply; 14+ messages in thread
From: Florine W. Dekker @ 2022-09-16 16:59 UTC (permalink / raw)
  To: demerphq, René Scharfe; +Cc: Git, brian m . carlson

On 14/09/2022 11:58, demerphq wrote:
> Yes. The way that git models identity is flawed as it makes the 
> mistake in assuming names are constant attributes of a person. Of 
> course this is not true at all, people change names for all kinds of 
> reasons and in some countries close to half the population will change 
> their name over the course of their lifetime at least once, when they 
> marry. So this is not some woke issue, it's a long standing issue in 
> how men traditionally model identity in software systems. I'm a man 
> and I've made that mistake myself, it's a common blindspot.
>
> Git really should use some level of non cleartext indirection on 
> identity, and store that data outside of the change log. Then history 
> wouldn't need to be written to update someone's particulars and many 
> identity concerns would just go away.
>
> Arguably .mailmap is just a workaround for the mismatch between model 
> and reality and doesnt really solve the problems of names changing and 
> actually makes it worse. Really this should be fixed at a deeper 
> level. The trick I guess is how would one do that in a back compatible 
> way.
>
> Yves

I understand what you mean, and agree that mailmap is just a workaround 
for this issue, having been designed to unify a user's multiple 
identifiers, rather than helping move on from a now-invalid identifier. 
Being completely new to this mailing list, however, I feel that solving 
the issues you raise might be a might much for me to take on.

Instead, for now, I'm interested to see what we can do with mailmap as a 
workaround. I like the idea of using URL encoding, and would like to 
hear others' opinions on doing so. I think it provides a social signal 
on its obfuscated state, it prevents people from accidentally finding 
out, and is easy and efficient to execute.

- Florine



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Wildcards in mailmap to hide transgender people's deadnames
  2022-09-14  9:07   ` Florine W. Dekker
@ 2022-09-19 11:20     ` Ævar Arnfjörð Bjarmason
  2022-09-19 12:27       ` rsbecker
  2022-09-19 15:19       ` brian m. carlson
  0 siblings, 2 replies; 14+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-09-19 11:20 UTC (permalink / raw)
  To: Florine W. Dekker; +Cc: René Scharfe, git, brian m . carlson


On Wed, Sep 14 2022, Florine W. Dekker wrote:

> On 14/09/2022 09:40, René Scharfe wrote:
>> Am 13.09.22 um 23:53 schrieb Florine W. Dekker:
>>> Now, John can now add the following line to their mailmap config:
>>> `John Doe <john.doe@example.com> <\*.doe@example.com>`, which does
>>> not reveal their old name.
>> That would falsely attribute the work of possible future developers
>> ann.doe@example.com and bob.doe@example.com to John as well.

First, I'm very happy to see that someone has picked up the thread on
this again.

> Good point. I assumed such false positives would be unlikely because I
> was considering very-small-scale projects, but I agree that using 
> wildcards is not at all feasible for larger projects.

Yes, please, making the mapping fuzzy in any way is really going against
the core design of the mailmap mechanism, it should be unambiguous,
*also* for commits going forward.

>> Supporting hashed entries would allow for a more targeted obfuscation.
>> That was discussed a while ago:
>> https://lore.kernel.org/git/20210103211849.2691287-1-sandals@crustytoothpaste.net/
>
> That was an interesting read. I agree with Ævar in that thread in that
> I think URL encoding is sufficient. I think it meets Brian's use case
> of never having to see the old name again, and my use case of
> obfuscating it from accidental discovery by friendly
> collaborators.

The question that was left open in my mind after that previous
discussion was weather people who wanted the "deadname" feature would
find this acceptable, I don't think we got any explicit ACK/NACK on that
(but I may be misrecalling, and didn't go back & re-read the whole
thing).

I'm happy that there's at least one ACK to it here in the form of your
reply, and hopefully that represents what a wider audience would prefer.

> While a hash certainly gives a stronger sense of
> security, I think it's a false sense of security, because, as you note
> below, recovering old email addresses from the tree is not much more
> trivial than reversing the encoding. And either way, a sha256 hash can
> easily be inverted in a few days(?) using a dictionary attack with
> email addresses from data breaches.

It's going to be "milliseconds", not "days". Brute-forcing a SHA-256 to
find an unknown E-Mail address might take longer, but by definition for
a .mailmap entry you already have both sides.

So "brute-forcing" is just a matter of hashing authors & E-Mails in our
history, and seeing if they correspond to .mailmap entries.

> As someone who has changed her name, I would be content with using a
> simple URL encoding.

I'd be happy to have that as a feature, in particular because (as I
pointed out in the previous discussion) it has a large use-case outside
of this .mailmap topic, namely wanting to map e.g. mis-encoded author
names in past commits to the right encoding (which I've personally had
some use-cases for).

There might be other "bonus" use-cases I've missed. E.g. is ">" or "<"
allowed in obscure E-Mail addresses (maybe within quotes?), our current
parser would barf on it, but being able to URI-encode it would work
around that. I don't know offhand to what extent there's an overlap with
various RFC-pedantic E-Mail addresses one could come up with, and what
we'd accept in commit objects with "fsck".

In any case, I think that an implementation of this & patch to
gitmailmap(5) should explain this sort of feature in those terms. If
some people then find it useful to encode things in the ASCII-space for
some reason (e.g. the social "deadname" reason) that would also be
useful.

But in terms the docs I don't think it should be documented in that
way. Git just needs to provide the feature, we don't need to dictate how
& why someone might use it.

>> [...]
>>     $ git log --format='%ae %aE' |
>>       awk '$1 != $2 && !a[$0] {a[$0] = 1; print}' |
>>       grep -F l.s.r@web.de
>>     rene.scharfe@lsrfire.ath.cx l.s.r@web.de
>>
>> The same can be done with names (%an/%aN).
>
> You're absolutely right. With "advanced tools" I was referring to
> anything more advanced than a plain `git log` ;-)

The thing that still makes me a bit nervous on this topic is that we
need to make it really clear that we're *not* providing some promise of
obscuring these values going forward, but just providing a feature that
some people might rely on as a combined social mechanism, and with the
assumption that the defaults of the "git log" view are unlikely to
change.

I.e. I think a "deadname" use-case of this would probably:

* Have some comment at the top of .mailmap about why some values are
  over-encoded (or perhaps it would be obvious to everyone working on
  that repo why someone was encoding the "plain ASCII" A-Za-z0-9 space).

* Use the default "git log" view, where we happen to map these (given
  the right options, config etc.)

But should not:

* Assume that other tools such as "fsck", "check-mailmap" or even "log"
  won't have future features that make de-obscuring these values easier,
  or something that's part of a normal workflow.

  E.g. I've wanted a "fsck for mailmap" for a while, i.e. to scan the
  file, parse our history, and see which entries are redundant or even
  potentially missing (based on e.g. names matching, but having
  different E-Mail addresses).

  It would be hard not to de-obscure URI encoded values for some
  features like that, e.g. if "log" adds the ability to say "this name X
  was mapped from Y".

* In general pretend that the mailmap is anything but a *public* and
  easily readable mapping. It's inherent in the feature that the
  consumer of it will know that X used to be Y.

The last thing we want is to create some feature that effectively ends
up being some self-doxxing (or self-"de-deadnaming"?) mechanism, because
we've left a gap between user expectations and what we can realistically
provide.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: Wildcards in mailmap to hide transgender people's deadnames
  2022-09-19 11:20     ` Ævar Arnfjörð Bjarmason
@ 2022-09-19 12:27       ` rsbecker
  2022-09-19 15:19       ` brian m. carlson
  1 sibling, 0 replies; 14+ messages in thread
From: rsbecker @ 2022-09-19 12:27 UTC (permalink / raw)
  To: 'Ævar Arnfjörð Bjarmason',
	'Florine W. Dekker'
  Cc: 'René Scharfe', git, 'brian m . carlson'

On September 19, 2022 7:20 AM, Ævar Arnfjörð Bjarmason wrote:
>On Wed, Sep 14 2022, Florine W. Dekker wrote:
>
>> On 14/09/2022 09:40, René Scharfe wrote:
>>> Am 13.09.22 um 23:53 schrieb Florine W. Dekker:
>>>> Now, John can now add the following line to their mailmap config:
>>>> `John Doe <john.doe@example.com> <\*.doe@example.com>`, which does
>>>> not reveal their old name.
>>> That would falsely attribute the work of possible future developers
>>> ann.doe@example.com and bob.doe@example.com to John as well.
>
>First, I'm very happy to see that someone has picked up the thread on this again.
>
>> Good point. I assumed such false positives would be unlikely because I
>> was considering very-small-scale projects, but I agree that using
>> wildcards is not at all feasible for larger projects.
>
>Yes, please, making the mapping fuzzy in any way is really going against the core
>design of the mailmap mechanism, it should be unambiguous,
>*also* for commits going forward.
>
>>> Supporting hashed entries would allow for a more targeted obfuscation.
>>> That was discussed a while ago:
>>> https://lore.kernel.org/git/20210103211849.2691287-1-sandals@crustyto
>>> othpaste.net/
>>
>> That was an interesting read. I agree with Ævar in that thread in that
>> I think URL encoding is sufficient. I think it meets Brian's use case
>> of never having to see the old name again, and my use case of
>> obfuscating it from accidental discovery by friendly collaborators.
>
>The question that was left open in my mind after that previous discussion was
>weather people who wanted the "deadname" feature would find this acceptable,
>I don't think we got any explicit ACK/NACK on that (but I may be misrecalling, and
>didn't go back & re-read the whole thing).
>
>I'm happy that there's at least one ACK to it here in the form of your reply, and
>hopefully that represents what a wider audience would prefer.
>
>> While a hash certainly gives a stronger sense of security, I think
>> it's a false sense of security, because, as you note below, recovering
>> old email addresses from the tree is not much more trivial than
>> reversing the encoding. And either way, a sha256 hash can easily be
>> inverted in a few days(?) using a dictionary attack with email
>> addresses from data breaches.
>
>It's going to be "milliseconds", not "days". Brute-forcing a SHA-256 to find an
>unknown E-Mail address might take longer, but by definition for a .mailmap entry
>you already have both sides.
>
>So "brute-forcing" is just a matter of hashing authors & E-Mails in our history, and
>seeing if they correspond to .mailmap entries.
>
>> As someone who has changed her name, I would be content with using a
>> simple URL encoding.
>
>I'd be happy to have that as a feature, in particular because (as I pointed out in the
>previous discussion) it has a large use-case outside of this .mailmap topic, namely
>wanting to map e.g. mis-encoded author names in past commits to the right
>encoding (which I've personally had some use-cases for).
>
>There might be other "bonus" use-cases I've missed. E.g. is ">" or "<"
>allowed in obscure E-Mail addresses (maybe within quotes?), our current parser
>would barf on it, but being able to URI-encode it would work around that. I don't
>know offhand to what extent there's an overlap with various RFC-pedantic E-Mail
>addresses one could come up with, and what we'd accept in commit objects with
>"fsck".
>
>In any case, I think that an implementation of this & patch to
>gitmailmap(5) should explain this sort of feature in those terms. If some people
>then find it useful to encode things in the ASCII-space for some reason (e.g. the
>social "deadname" reason) that would also be useful.
>
>But in terms the docs I don't think it should be documented in that way. Git just
>needs to provide the feature, we don't need to dictate how & why someone
>might use it.
>
>>> [...]
>>>     $ git log --format='%ae %aE' |
>>>       awk '$1 != $2 && !a[$0] {a[$0] = 1; print}' |
>>>       grep -F l.s.r@web.de
>>>     rene.scharfe@lsrfire.ath.cx l.s.r@web.de
>>>
>>> The same can be done with names (%an/%aN).
>>
>> You're absolutely right. With "advanced tools" I was referring to
>> anything more advanced than a plain `git log` ;-)
>
>The thing that still makes me a bit nervous on this topic is that we need to make it
>really clear that we're *not* providing some promise of obscuring these values
>going forward, but just providing a feature that some people might rely on as a
>combined social mechanism, and with the assumption that the defaults of the "git
>log" view are unlikely to change.
>
>I.e. I think a "deadname" use-case of this would probably:
>
>* Have some comment at the top of .mailmap about why some values are
>  over-encoded (or perhaps it would be obvious to everyone working on
>  that repo why someone was encoding the "plain ASCII" A-Za-z0-9 space).
>
>* Use the default "git log" view, where we happen to map these (given
>  the right options, config etc.)
>
>But should not:
>
>* Assume that other tools such as "fsck", "check-mailmap" or even "log"
>  won't have future features that make de-obscuring these values easier,
>  or something that's part of a normal workflow.
>
>  E.g. I've wanted a "fsck for mailmap" for a while, i.e. to scan the
>  file, parse our history, and see which entries are redundant or even
>  potentially missing (based on e.g. names matching, but having
>  different E-Mail addresses).
>
>  It would be hard not to de-obscure URI encoded values for some
>  features like that, e.g. if "log" adds the ability to say "this name X
>  was mapped from Y".
>
>* In general pretend that the mailmap is anything but a *public* and
>  easily readable mapping. It's inherent in the feature that the
>  consumer of it will know that X used to be Y.
>
>The last thing we want is to create some feature that effectively ends up being
>some self-doxxing (or self-"de-deadnaming"?) mechanism, because we've left a
>gap between user expectations and what we can realistically provide.

As a side topic, which I brought up about 2 years ago, there are other reasons to do this, including GDPR-like rules, to obfuscate identity information. A solution to obfuscation could provide a mechanism to change the attribution. My team has experience in this domain. Do we want to reopen that discussion?

-Randall


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Wildcards in mailmap to hide transgender people's deadnames
  2022-09-19 11:20     ` Ævar Arnfjörð Bjarmason
  2022-09-19 12:27       ` rsbecker
@ 2022-09-19 15:19       ` brian m. carlson
  2022-09-19 16:31         ` Junio C Hamano
  2022-09-20 10:23         ` Ævar Arnfjörð Bjarmason
  1 sibling, 2 replies; 14+ messages in thread
From: brian m. carlson @ 2022-09-19 15:19 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Florine W. Dekker, René Scharfe, git

[-- Attachment #1: Type: text/plain, Size: 2371 bytes --]

On 2022-09-19 at 11:20:13, Ævar Arnfjörð Bjarmason wrote:
> I.e. I think a "deadname" use-case of this would probably:
> 
> * Have some comment at the top of .mailmap about why some values are
>   over-encoded (or perhaps it would be obvious to everyone working on
>   that repo why someone was encoding the "plain ASCII" A-Za-z0-9 space).

I don't think we need to do this.  First of all, it makes people curious
and nosy, and it draws attention to the situation when in many cases,
other contributors might not even notice as they're updating the
mailmap.  Adding lots of attention is going to add the potential for
harassment.

> But should not:
> 
> * Assume that other tools such as "fsck", "check-mailmap" or even "log"
>   won't have future features that make de-obscuring these values easier,
>   or something that's part of a normal workflow.

Your statement that you intended to write exactly such a feature was the
main reason I dropped the SHA-256 hashed mailmap series.  I don't think
it's constructive to offer or propose to offer such a feature in Git if
we're trying to obscure people's names in the mailmap, and as such I
would want to see a guarantee that we wouldn't implement or accept such
a feature.  I don't see the point of obscuring names in the mailmap if
we're just going to print them next to each other in the future, and I
don't think it's moving us towards a solution to suggest that we might
do that in the future.

I'm happy to resurrect my SHA-256 hashed mailmap series if we're
all willing to agree to not implement trivial decoding features.

I also have an alternate proposal which I pitched to some folks at Git
Merge and which I just finished writing up that basically moves personal
names and emails out of commits, replacing them with opaque identifiers,
and using a constantly squashed mailmap commit in a special ref to store
the mapping.  This doesn't address changing identities in existing
commits, which as we've seen are nearly impossible to fix, but it does
address new ones.  I've sent it out at
https://lore.kernel.org/git/20220919145231.48245-1-sandals@crustytoothpaste.net/.

We may in fact want to do both of these things (hashed or encoded
mailmap and opaque identifiers with squashed mailmap) at once.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Wildcards in mailmap to hide transgender people's deadnames
  2022-09-19 15:19       ` brian m. carlson
@ 2022-09-19 16:31         ` Junio C Hamano
  2022-09-19 17:26           ` brian m. carlson
  2022-09-20 10:23         ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 14+ messages in thread
From: Junio C Hamano @ 2022-09-19 16:31 UTC (permalink / raw)
  To: brian m. carlson
  Cc: Ævar Arnfjörð Bjarmason, Florine W. Dekker,
	René Scharfe, git

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> On 2022-09-19 at 11:20:13, Ævar Arnfjörð Bjarmason wrote:
>> I.e. I think a "deadname" use-case of this would probably:
>> 
>> * Have some comment at the top of .mailmap about why some values are
>>   over-encoded (or perhaps it would be obvious to everyone working on
>>   that repo why someone was encoding the "plain ASCII" A-Za-z0-9 space).
>
> I don't think we need to do this.  First of all, it makes people curious
> and nosy, and it draws attention to the situation when in many cases,
> other contributors might not even notice as they're updating the
> mailmap.  Adding lots of attention is going to add the potential for
> harassment.
>
>> But should not:
>> 
>> * Assume that other tools such as "fsck", "check-mailmap" or even "log"
>>   won't have future features that make de-obscuring these values easier,
>>   or something that's part of a normal workflow.
>
> Your statement that you intended to write exactly such a feature was the
> main reason I dropped the SHA-256 hashed mailmap series.  I don't think
> it's constructive to offer or propose to offer such a feature in Git if
> we're trying to obscure people's names in the mailmap, ...

Yes, I remember that exchange, and I find your position reasonable.
Yes, we all know how to build such a feature.  Yes, we know a
third-party implementation of such a feature may materialize.

But we do not have to be the ones to encourage use of such a
feature.

> I also have an alternate proposal which I pitched to some folks at Git
> Merge and which I just finished writing up that basically moves personal
> names and emails out of commits, replacing them with opaque identifiers,

That part I can agree with.

> and using a constantly squashed mailmap commit in a special ref to store
> the mapping.

This part only half (the "special ref" half, not "constatntly
squashed" part, even though I know why it matters more to your
goal).  My gut feeling is that auditing and merging will become
nightmare.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Wildcards in mailmap to hide transgender people's deadnames
  2022-09-19 16:31         ` Junio C Hamano
@ 2022-09-19 17:26           ` brian m. carlson
  0 siblings, 0 replies; 14+ messages in thread
From: brian m. carlson @ 2022-09-19 17:26 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Ævar Arnfjörð Bjarmason, Florine W. Dekker,
	René Scharfe, git

[-- Attachment #1: Type: text/plain, Size: 2836 bytes --]

On 2022-09-19 at 16:31:25, Junio C Hamano wrote:
> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
> 
> > Your statement that you intended to write exactly such a feature was the
> > main reason I dropped the SHA-256 hashed mailmap series.  I don't think
> > it's constructive to offer or propose to offer such a feature in Git if
> > we're trying to obscure people's names in the mailmap, ...
> 
> Yes, I remember that exchange, and I find your position reasonable.
> Yes, we all know how to build such a feature.  Yes, we know a
> third-party implementation of such a feature may materialize.
> 
> But we do not have to be the ones to encourage use of such a
> feature.

Sure.  The goal is to make the tool more friendly (at least to some
folks) to use.  Other people can worsen the experience; we don't have to
do that.

As I said, if we're willing to commit to not add such a decoding feature
to Git, I'm happy to resurrect my hashed mailmap approach with or
without changes and get it ready to merge.  It sounds like that might be
an approach we're comfortable with here.

> > I also have an alternate proposal which I pitched to some folks at Git
> > Merge and which I just finished writing up that basically moves personal
> > names and emails out of commits, replacing them with opaque identifiers,
> 
> That part I can agree with.
> 
> > and using a constantly squashed mailmap commit in a special ref to store
> > the mapping.
> 
> This part only half (the "special ref" half, not "constatntly
> squashed" part, even though I know why it matters more to your
> goal).  My gut feeling is that auditing and merging will become
> nightmare.

Since it's not clear to me, you're saying you think a special ref is
fine, but having it be constantly squashed is not?

If so, I will say that my proposal in the other thread will let folks
keep a history if they want with a config option (although that means
you may need to rewrite history once in a while if someone changes their
name).  In my workflow and in the workflow of folks who primarily work
with forges, that isn't necessary, and the mailmap, if it's even
required, can be maintained independently or even automatically.  For
example, I could imagine GitHub writing my display name into the mailmap
file automatically when one of my pull requests is merged if I and the
repository owner have such an option configured.

However, in my proposed patch workflow, git am does the work for you by
updating the ref automatically, so all you need to do is literally apply
patches with mailmap headers and then push the ref once in a while.

I'm definitely open to discussing this approach more if we think it can
be formed into something viable.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Wildcards in mailmap to hide transgender people's deadnames
  2022-09-16 16:59     ` Florine W. Dekker
@ 2022-09-20  0:32       ` brian m. carlson
  0 siblings, 0 replies; 14+ messages in thread
From: brian m. carlson @ 2022-09-20  0:32 UTC (permalink / raw)
  To: Florine W. Dekker; +Cc: demerphq, René Scharfe, Git

[-- Attachment #1: Type: text/plain, Size: 1423 bytes --]

On 2022-09-16 at 16:59:23, Florine W. Dekker wrote:
> I understand what you mean, and agree that mailmap is just a workaround for
> this issue, having been designed to unify a user's multiple identifiers,
> rather than helping move on from a now-invalid identifier. Being completely
> new to this mailing list, however, I feel that solving the issues you raise
> might be a might much for me to take on.

I agree this is a bigger, separate issue that we should address, but it
shouldn't prevent us from doing what improvements we can to the mailmap.

> Instead, for now, I'm interested to see what we can do with mailmap as a
> workaround. I like the idea of using URL encoding, and would like to hear
> others' opinions on doing so. I think it provides a social signal on its
> obfuscated state, it prevents people from accidentally finding out, and is
> easy and efficient to execute.

I think this would be a fine solution.  If folks think the hashed
mailmap would be better, I can resend that, or if we like the
URL-encoded option, that shouldn't be too difficult to implement.

I do appreciate you taking the time to bring this up since I think this
is an important issue to address and it's come up a couple of times.  I
hope that this time we can come up with some sort of improvement with
the mailmap we're willing to take.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Wildcards in mailmap to hide transgender people's deadnames
  2022-09-19 15:19       ` brian m. carlson
  2022-09-19 16:31         ` Junio C Hamano
@ 2022-09-20 10:23         ` Ævar Arnfjörð Bjarmason
  2022-09-20 14:58           ` Florine W. Dekker
  2022-09-21 16:42           ` Junio C Hamano
  1 sibling, 2 replies; 14+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-09-20 10:23 UTC (permalink / raw)
  To: brian m. carlson; +Cc: Florine W. Dekker, René Scharfe, git


On Mon, Sep 19 2022, brian m. carlson wrote:

> [[PGP Signed Part:Undecided]]
> On 2022-09-19 at 11:20:13, Ævar Arnfjörð Bjarmason wrote:
>> I.e. I think a "deadname" use-case of this would probably:
>> 
>> * Have some comment at the top of .mailmap about why some values are
>>   over-encoded (or perhaps it would be obvious to everyone working on
>>   that repo why someone was encoding the "plain ASCII" A-Za-z0-9 space).
>
> I don't think we need to do this.  First of all, it makes people curious
> and nosy, and it draws attention to the situation when in many cases,
> other contributors might not even notice as they're updating the
> mailmap.  

Sure, to clarify I meant this is something that a downstream project
using the .mailmap might want to add, or they might now.

> Adding lots of attention is going to add the potential for
> harassment.

I'm in no way minimizing that potential for harassment, doxxing etc., in
fact I'm vehemently agreeing whith that point. But I think this gets to
the crux of our disagreement.

I think it would be irresponsible of us to provide a feature that looks
as though it can in any way mitigate those concerns.

If you're someone that's worried about being harassed if someone makes
the link from your previous identity Y to your current identity X where
you already have Y as part of a public git history. The right answer is
to not submit a change to the .mailmap to explicitly connect the two.

>> But should not:
>> 
>> * Assume that other tools such as "fsck", "check-mailmap" or even "log"
>>   won't have future features that make de-obscuring these values easier,
>>   or something that's part of a normal workflow.
>
> Your statement that you intended to write exactly such a feature was the
> main reason I dropped the SHA-256 hashed mailmap series.  I don't think
> it's constructive to offer or propose to offer such a feature in Git if
> we're trying to obscure people's names in the mailmap, and as such I
> would want to see a guarantee that we wouldn't implement or accept such
> a feature.  I don't see the point of obscuring names in the mailmap if
> we're just going to print them next to each other in the future, and I
> don't think it's moving us towards a solution to suggest that we might
> do that in the future.

I haven't gone back and re-read that whole thread, but I think I was
mainly pointing out that we or someone else can and probably will write
the trivial reverse mapping.

Hence my point above, even if we carefully scrutinize every change to
git.git to ensure that we never implement a feature that de-hashes the
hashes you proposed all it'll take to defeat the entire mechanism is
something trivial like:

	diff -u <(git log) <(git log --no-mailmap)

> I'm happy to resurrect my SHA-256 hashed mailmap series if we're
> all willing to agree to not implement trivial decoding features.

I'd think you'd want to be really clear about what that forward promise
would entail. E.g. I've sometimes wanted a way for "git log" to report
when it munges commits due to adding notes, re-encoding the data etc. If
someone submits that sort of feature should it always explicitly leave
out mailmap-related rewrites?

And even if it does, who do we think we're really helping in the end,
given the trivial way you could get that with an external "diff" with
the one-liner above?

> I also have an alternate proposal which I pitched to some folks at Git
> Merge and which I just finished writing up that basically moves personal
> names and emails out of commits, replacing them with opaque identifiers,
> and using a constantly squashed mailmap commit in a special ref to store
> the mapping.  This doesn't address changing identities in existing
> commits, which as we've seen are nearly impossible to fix, but it does
> address new ones.  I've sent it out at
> https://lore.kernel.org/git/20220919145231.48245-1-sandals@crustytoothpaste.net/.

As I understand the difference in this scenario a hypothetical future
repo's Y commit's authorship would have been opaque in the first place
using this mechanism, and via your "refs/mailmap" you'd have mapped
Y=Bob.

You then make a future X commit, and map X=Alice, and have a .mailmap
entry which mapped Y=X, but that entry would refer to the opaque value.

That certainly changes things in a fundamental way, and goes most or all
of the way to mitigating what I've been pointing out as a flaw in these
proposals.

I'd still be very much on the fence about whether we'd ever want to
recommend that to someone concerned with "harassment" and the like (as
opposed to a milder social preference), as all it would take to get to
that point is someone having a copy of the older "refs/mailmap" to
unmask the previous "Y".

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Wildcards in mailmap to hide transgender people's deadnames
  2022-09-20 10:23         ` Ævar Arnfjörð Bjarmason
@ 2022-09-20 14:58           ` Florine W. Dekker
  2022-09-21 16:42           ` Junio C Hamano
  1 sibling, 0 replies; 14+ messages in thread
From: Florine W. Dekker @ 2022-09-20 14:58 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, brian m. carlson
  Cc: René Scharfe, git

On 20/09/2022 12:23, Ævar Arnfjörð Bjarmason wrote:
>> I'm happy to resurrect my SHA-256 hashed mailmap series if we're
>> all willing to agree to not implement trivial decoding features.
> I'd think you'd want to be really clear about what that forward promise
> would entail. E.g. I've sometimes wanted a way for "git log" to report
> when it munges commits due to adding notes, re-encoding the data etc. If
> someone submits that sort of feature should it always explicitly leave
> out mailmap-related rewrites?
>
> And even if it does, who do we think we're really helping in the end,
> given the trivial way you could get that with an external "diff" with
> the one-liner above?

I think the most important thing here is that the mailmap should not 
allow for even-more-trivial ways to discover old names than currently 
already exist. I've thought more about what you said, Ævar, and now I'm 
wary of a mailmap implementation that would entail having my old and new 
information next to each other, even if encoded (doesn't matter if it's 
URL-encoded or base64-encoded), because I think it's likely some 
external data mining tool will decode the address and place them next to 
each other, so that if you search for the email address in a search 
engine you'll also see the other address. I think a hash encoding will 
prevent these automated miners from doing that, since reversing a hash 
is too much effort for an untargeted attack (right? if you disagree, how 
about a salted hash?).

Either way, I think any mailmap-based solution will allow the old and 
new name to be linked to each other by an adversary, as you showed with 
your neat one-liner. However, I think a (salted?) hash in the mailmap 
will be sufficient for casual obfuscation where harassment is unlikely, 
but the user wants to prevent accidental disclosure or plain linkage.

>> I also have an alternate proposal which I pitched to some folks at Git
>> Merge and which I just finished writing up that basically moves personal
>> names and emails out of commits, replacing them with opaque identifiers,
>> and using a constantly squashed mailmap commit in a special ref to store
>> the mapping.  This doesn't address changing identities in existing
>> commits, which as we've seen are nearly impossible to fix, but it does
>> address new ones.  I've sent it out at
>> https://lore.kernel.org/git/20220919145231.48245-1-sandals@crustytoothpaste.net/.
> As I understand the difference in this scenario a hypothetical future
> repo's Y commit's authorship would have been opaque in the first place
> using this mechanism, and via your "refs/mailmap" you'd have mapped
> Y=Bob.
>
> You then make a future X commit, and map X=Alice, and have a .mailmap
> entry which mapped Y=X, but that entry would refer to the opaque value.
>
> That certainly changes things in a fundamental way, and goes most or all
> of the way to mitigating what I've been pointing out as a flaw in these
> proposals.
>
> I'd still be very much on the fence about whether we'd ever want to
> recommend that to someone concerned with "harassment" and the like (as
> opposed to a milder social preference), as all it would take to get to
> that point is someone having a copy of the older "refs/mailmap" to
> unmask the previous "Y".

I first want to say that I really like your proposal, Brian! I didn't 
think this subject would get the attention it did, but I'm happy it's 
being picked up the way it is, and to see this lively discussion going 
on between yall!

And Ævar, you're right that having an older copy would allow one to 
discover a mapping from the old to the new name. But this will happen in 
any way we can conceivably implement this because the adversary can 
always keep an old copy of the entire repo, clone the new one, and 
compare the two logs. (You can probably come up with a neat one-liner, 
but that's besides the point ;-).) I think that the most appropriate 
threat model here is to assume that everyone who has accessed the repo 
before the name change will notice the name change and will be able to 
create a mapping. Instead, our goal should be to create a system that 
ensures that people who first access the repo after the name change are 
unable to find the old name at all. I think Brian's proposal achieves 
this. This is analogous to the real world where people who knew me 
before my transition will probably never (completely) forget my old 
name, and it's useless to try to make that happen, but at least I can 
prevent new people I meet from finding out the old name.

- Florine



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Wildcards in mailmap to hide transgender people's deadnames
  2022-09-20 10:23         ` Ævar Arnfjörð Bjarmason
  2022-09-20 14:58           ` Florine W. Dekker
@ 2022-09-21 16:42           ` Junio C Hamano
  2022-09-26  9:14             ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 14+ messages in thread
From: Junio C Hamano @ 2022-09-21 16:42 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: brian m. carlson, Florine W. Dekker, René Scharfe, git

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> I think it would be irresponsible of us to provide a feature that looks
> as though it can in any way mitigate those concerns.
>
> If you're someone that's worried about being harassed if someone makes
> the link from your previous identity Y to your current identity X where
> you already have Y as part of a public git history. The right answer is
> to not submit a change to the .mailmap to explicitly connect the two.

While I agree with the sentiment "You are in control if your three
names appear to refer to the same person" (and "On the Internet
nobody knows you're a dog"), I wish the world were so black and
white.

Many people change their names over the course of their life, and
some do not want the linkage to their past revealed.  Many of them
have nothing to be ashamed of themselves but do so due to risk of
discrimination, while some of them may do so to hide inconvenient
facts about their past.  While I have no sympathy to the latter, I
do not think it is unreasonable for the folks in the former camp to
also want recognition for the achievement made under their old as
well as their current identity.  And "pretend you have nothing to do
with that identity you used in the past life" goes directly against
the idea of taking credit for what you did in the past.

As the expertise you demonstrated under your old name will not
help others find you as an expert in an area, until your new name
starts being associated with your newly earned recognition, it is
also a loss for the development community.

I dunno.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Wildcards in mailmap to hide transgender people's deadnames
  2022-09-21 16:42           ` Junio C Hamano
@ 2022-09-26  9:14             ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 14+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-09-26  9:14 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: brian m. carlson, Florine W. Dekker, René Scharfe, git


On Wed, Sep 21 2022, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>
>> I think it would be irresponsible of us to provide a feature that looks
>> as though it can in any way mitigate those concerns.
>>
>> If you're someone that's worried about being harassed if someone makes
>> the link from your previous identity Y to your current identity X where
>> you already have Y as part of a public git history. The right answer is
>> to not submit a change to the .mailmap to explicitly connect the two.
>
> While I agree with the sentiment "You are in control if your three
> names appear to refer to the same person" (and "On the Internet
> nobody knows you're a dog"), I wish the world were so black and
> white.
>
> Many people change their names over the course of their life, and
> some do not want the linkage to their past revealed.  Many of them
> have nothing to be ashamed of themselves but do so due to risk of
> discrimination, while some of them may do so to hide inconvenient
> facts about their past.  While I have no sympathy to the latter, [...]

Just on the "no sympathy for the latter" I just want to point out that
this topic is a subject of fundimental disagreement between how EU & US
legislature views this, re the recent "right to be forgotten"
developments in EU law as they relate to "directory" searches[1].

> I do not think it is unreasonable for the folks in the former camp to
> also want recognition for the achievement made under their old as
> well as their current identity.  And "pretend you have nothing to do
> with that identity you used in the past life" goes directly against
> the idea of taking credit for what you did in the past.
> [...]
> As the expertise you demonstrated under your old name will not
> help others find you as an expert in an area, until your new name
> starts being associated with your newly earned recognition, it is
> also a loss for the development community.

Indeed, I think that most people who change their name for whatever
reason on a project they contributed to before & after that change will
probably want a .mailmap entry.

I was narrowly responding to the "harassment" aspect of this. I.e. that
it's a fundimental aspect of how our object graph & git is currently
implemented that you'll be giving someone "both names" as it were.

I think that if some users want their name not to be trivially
discoverable by e.g. grepping we could cater to that & other use-cases
with something like optional URI encoding.

But I think it's equally important that we don't present something that
looks like a strong password hash to a novice user (the sha256-ing), but
which due to the party reading the data already having "both names" can
be trivially brute-forced in the time it takes to run a "git log --all
--use-mailmap", or equivalent.

I also think it's important that we keep .mailmap something where we're
explictly giving *other people* the "both names", and for ourselves &
third-party systems make it easy to use the data.

I pointed out in previous discussions how e.g. the sha-256 proposal
would require rev walking & "brute forcing" for some workflows, such as
scraping a .mailmap to insert into a relational DB, in order to make the
same association there (and I've implemented a system like this at a
past job).

I wonder to what extent concerns about the deadname use-case would be
mitigated if we added support for a .git/info/mailmap, similar to
.git/info/exclude. I.e. we now have mailmap.file, but not a way to
suppliment an in-repo .mailmap. This would help the users who want to
avoid seeing their own "deadname", but which would also like to avoid
making the association to their new name part of the public record.

1. https://en.wikipedia.org/wiki/Right_to_be_forgotten

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2022-09-26  9:34 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-13 21:53 Wildcards in mailmap to hide transgender people's deadnames Florine W. Dekker
2022-09-14  7:40 ` René Scharfe
2022-09-14  9:07   ` Florine W. Dekker
2022-09-19 11:20     ` Ævar Arnfjörð Bjarmason
2022-09-19 12:27       ` rsbecker
2022-09-19 15:19       ` brian m. carlson
2022-09-19 16:31         ` Junio C Hamano
2022-09-19 17:26           ` brian m. carlson
2022-09-20 10:23         ` Ævar Arnfjörð Bjarmason
2022-09-20 14:58           ` Florine W. Dekker
2022-09-21 16:42           ` Junio C Hamano
2022-09-26  9:14             ` Ævar Arnfjörð Bjarmason
     [not found]   ` <CANgJU+Wt_yjv1phwiSUtLLZ=JKA9LvS=0UcBYNu+nxdJ_7d_Ew@mail.gmail.com>
2022-09-16 16:59     ` Florine W. Dekker
2022-09-20  0:32       ` brian m. carlson

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).