git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Junio C Hamano <gitster@pobox.com>
To: Jeff King <peff@peff.net>
Cc: Duy Nguyen <pclouds@gmail.com>, Git Mailing List <git@vger.kernel.org>
Subject: Re: [PATCH v3] teach fast-export an --anonymize option
Date: Thu, 28 Aug 2014 11:11:47 -0700	[thread overview]
Message-ID: <xmqqwq9sa3h8.fsf@gitster.dls.corp.google.com> (raw)
In-Reply-To: <20140828123257.GA18642@peff.net> (Jeff King's message of "Thu, 28 Aug 2014 08:32:58 -0400")

Jeff King <peff@peff.net> writes:

> Subject: docs/fast-export: explain --anonymize more completely
>
> The original commit made mention of this option, but not why
> one might want it or how they might use it. Let's try to be
> a little more thorough, and also explain how to confirm that
> the output really is anonymous.
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  Documentation/git-fast-export.txt | 63 ++++++++++++++++++++++++++++++++++++---
>  1 file changed, 59 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
> index 52831fa..dbe9a46 100644
> --- a/Documentation/git-fast-export.txt
> +++ b/Documentation/git-fast-export.txt
> @@ -106,10 +106,9 @@ marks the same across runs.
>  	different from the commit's first parent).
>  
>  --anonymize::
> -	Replace all refnames, paths, blob contents, commit and tag
> -	messages, names, and email addresses in the output with
> -	anonymized data, while still retaining the shape of history and
> -	of the stored tree.
> +	Anonymize the contents of the repository while still retaining
> +	the shape of the history and stored tree.  See the section on
> +	`ANONYMIZING` below.

Technically s/tree/trees/, I would think.  For a repository with
multiple branches, perhaps s/history/histories/, too, but I would
not insist on that ;-).

> +ANONYMIZING
> +-----------
> +
> +If the `--anonymize` option is given, git will attempt to remove all
> +identifying information from the repository while still retaining enough
> +of the original tree and history patterns to reproduce some bugs. The
> +goal is that a git bug which is found on a private repository will
> +persist in the anonymized repository, and the latter can be shared with
> +git developers to help solve the bug.
> +
> +With this option, git will replace all refnames, paths, blob contents,
> +commit and tag messages, names, and email addresses in the output with
> +anonymized data.  Two instances of the same string will be replaced
> +equivalently (e.g., two commits with the same author will have the same
> +anonymized author in the output, but bear no resemblance to the original
> +author string). The relationship between commits, branches, and tags is
> +retained, as well as the commit timestamps (but the commit messages and
> +refnames bear no resemblance to the originals). The relative makeup of
> +the tree is retained (e.g., if you have a root tree with 10 files and 3
> +trees, so will the output), but their names and the contents of the
> +files will be replaced.

While I do not think I or anybody who would ask other people to use
this option would be confused, the phrase "the same string" may risk
unnecessary worries from those who are asked to trust this option.

I am not yet convinced that it is unlikely for the reader to read
the above and imagine that the anonymiser may go word by word,
replacing "the same string" with the same anonymised gibberish
(which would be susceptible to old-school cryptoanalysis
techniques).

Among the ones that listed, refnames, blob contents, commit messages
and tag messages are converted as a single "string" and I wish I
could think of phrasing to stress that point somehow.

Each path component in paths is converted as a single "string", so
we can read from two anonymised paths if they refer to blobs in the
same directory in the original.  This is a good thing, of course,
but it shows that among those listed in "refnames, paths, blob
contents, ..." in a flat sentence, some are treated as a single
token for replacement but not others, and it is hard to tell for a
reader which one is which, unless the reader knows the internals of
Git, i.e. what kind of things we as the debuggers-of-Git would want
to preserve.

Isn't the unit for human identity anonymisation even more coarse?
If it is not should it?

In other words, do "Junio C Hamano <junio@pobox.com>" and "Junio C
Hamano <gitster@pobox.com>" map to one gibberish human readable name
with two gibberish e-mail addresses, or 2 "User$n <user$n>"?  Is the
fact that this organization seems to allocate two e-mails to each
developer something this organization may want to hide from the
public (and something we as the Git debuggers would not benefit from
knowing)?

  parent reply	other threads:[~2014-08-28 18:12 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-08-21  7:01 [PATCH] teach fast-export an --anonymize option Jeff King
2014-08-21 20:15 ` Junio C Hamano
2014-08-21 22:41   ` Jeff King
2014-08-21 21:57 ` Junio C Hamano
2014-08-21 22:49   ` Jeff King
2014-08-21 23:21     ` [PATCH v2] " Jeff King
2014-08-22 13:06       ` Duy Nguyen
2014-08-22 18:39       ` Philip Oakley
2014-08-23  6:19         ` Jeff King
2014-08-27 16:01       ` Junio C Hamano
2014-08-27 16:58         ` Jeff King
2014-08-27 17:01           ` [PATCH v3] " Jeff King
2014-08-28 10:30             ` Duy Nguyen
2014-08-28 12:32               ` Jeff King
2014-08-28 16:46                 ` Ramsay Jones
2014-08-28 18:43                   ` Junio C Hamano
2014-08-28 18:50                   ` Jeff King
2014-08-28 18:11                 ` Junio C Hamano [this message]
2014-08-28 19:04                   ` Jeff King
2014-08-31 10:34                 ` Eric Sunshine
2014-08-31 15:53                   ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqqwq9sa3h8.fsf@gitster.dls.corp.google.com \
    --to=gitster@pobox.com \
    --cc=git@vger.kernel.org \
    --cc=pclouds@gmail.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).