git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Philip Oakley" <philipoakley@iee.org>
To: "Jeff King" <peff@peff.net>, "Junio C Hamano" <gitster@pobox.com>
Cc: <git@vger.kernel.org>, "Duy Nguyen" <pclouds@gmail.com>
Subject: Re: [PATCH v2] teach fast-export an --anonymize option
Date: Fri, 22 Aug 2014 19:39:59 +0100	[thread overview]
Message-ID: <9E69746EFF0048FF84631A9794206566@PhilipOakley> (raw)
In-Reply-To: 20140821232100.GA27849@peff.net

From: "Jeff King" <peff@peff.net>: Friday, August 22, 2014 12:21 AM
> On Thu, Aug 21, 2014 at 06:49:10PM -0400, Jeff King wrote:
>
>> The few things I don't anonymize are:
>>
>>   1. ref prefixes. We see the same distribution of refs/heads vs
>>      refs/tags, etc.
>>
>>   2. refs/heads/master is left untouched, for convenience (and 
>> because
>>      it's not really a secret). The implementation is lazy, though, 
>> and
>>      would leave "refs/heads/master-supersecret", as well. I can 
>> tighten
>>      that if we really want to be careful.
>>
>>   3. gitlinks are left untouched, since sha1s cannot be reversed. 
>> This
>>      could leak some information (if your private repo points to a
>>      public, I can find out you have it as submodule). I doubt it
>>      matters, but we can also scramble the sha1s.
>
> Here's a re-roll that addresses the latter two. I don't think any are 
> a
> big deal, but it's much easier to say "it's handled" than try to 
> figure
> out whether and when it's important.
>
> This also includes the documentation update I sent earlier. The
> interdiff is a bit noisy, as I also converted the anonymize_mem 
> function
> to take void pointers (since it doesn't know or care what it's 
> storing,
> and this makes storing unsigned chars for sha1s easier).
>

Just a bit of bikeshedding for future improvements..

The .gitignore is another potential user problem area that may benefit 
form not being anonymised when problems strike. For example, there's a 
current problem on the git-users list 
https://groups.google.com/forum/#!topic/git-users/JJFIEsI5HRQ about "git 
clean vs git status re .gitignore", which would then also beg questions 
about retaining file extensions/suffixes (.txt, .o, .c, etc).

I've had a similar problem with an over zealous file compare routine 
where the same too much vs too little was an issue.

One thought is that the user should be able to, as an option, select the 
number of initial characters retained from filenames, and similarly, the 
option to retain the file extension, and possibly directory names, such 
that the full .gitignore still works in most cases, and the sort order 
works (as far as it goes on number of characters).

All things for future improvers to consider.

Philip

> -- >8 --
> Subject: teach fast-export an --anonymize option
>
> Sometimes users want to report a bug they experience on
> their repository, but they are not at liberty to share the
> contents of the repository. It would be useful if they could
> produce a repository that has a similar shape to its history
> and tree, but without leaking any information. This
> "anonymized" repository could then be shared with developers
> (assuming it still replicates the original problem).
>
> This patch implements an "--anonymize" option to
> fast-export, which generates a stream that can recreate such
> a repository. Producing a single stream makes it easy for
> the caller to verify that they are not leaking any useful
> information. You can get an overview of what will be shared
> by running a command like:
>
>  git fast-export --anonymize --all |
>  perl -pe 's/\d+/X/g' |
>  sort -u |
>  less
>
> which will show every unique line we generate, modulo any
> numbers (each anonymized token is assigned a number, like
> "User 0", and we replace it consistently in the output).
>
> In addition to anonymizing, this produces test cases that
> are relatively small (compared to the original repository)
> and fast to generate (compared to using filter-branch, or
> modifying the output of fast-export yourself). Here are
> numbers for git.git:
>
>  $ time git fast-export --anonymize --all \
>         --tag-of-filtered-object=drop >output
>  real    0m2.883s
>  user    0m2.828s
>  sys     0m0.052s
>
>  $ gzip output
>  $ ls -lh output.gz | awk '{print $5}'
>  2.9M
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
[...] 

  parent reply	other threads:[~2014-08-22 18:40 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-08-21  7:01 [PATCH] teach fast-export an --anonymize option Jeff King
2014-08-21 20:15 ` Junio C Hamano
2014-08-21 22:41   ` Jeff King
2014-08-21 21:57 ` Junio C Hamano
2014-08-21 22:49   ` Jeff King
2014-08-21 23:21     ` [PATCH v2] " Jeff King
2014-08-22 13:06       ` Duy Nguyen
2014-08-22 18:39       ` Philip Oakley [this message]
2014-08-23  6:19         ` Jeff King
2014-08-27 16:01       ` Junio C Hamano
2014-08-27 16:58         ` Jeff King
2014-08-27 17:01           ` [PATCH v3] " Jeff King
2014-08-28 10:30             ` Duy Nguyen
2014-08-28 12:32               ` Jeff King
2014-08-28 16:46                 ` Ramsay Jones
2014-08-28 18:43                   ` Junio C Hamano
2014-08-28 18:50                   ` Jeff King
2014-08-28 18:11                 ` Junio C Hamano
2014-08-28 19:04                   ` Jeff King
2014-08-31 10:34                 ` Eric Sunshine
2014-08-31 15:53                   ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9E69746EFF0048FF84631A9794206566@PhilipOakley \
    --to=philipoakley@iee.org \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=pclouds@gmail.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).