[PATCH] teach fast-export an --anonymize option

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH] teach fast-export an --anonymize option
@ 2014-08-21  7:01 Jeff King
  2014-08-21 20:15 ` Junio C Hamano
  2014-08-21 21:57 ` Junio C Hamano
  0 siblings, 2 replies; 21+ messages in thread
From: Jeff King @ 2014-08-21  7:01 UTC (permalink / raw)
  To: git; +Cc: Duy Nguyen

Sometimes users want to report a bug they experience on
their repository, but they are not at liberty to share the
contents of the repository. It would be useful if they could
produce a repository that has a similar shape to its history
and tree, but without leaking any information. This
"anonymized" repository could then be shared with developers
(assuming it still replicates the original problem).

This patch implements an "--anonymize" option to
fast-export, which generates a stream that can recreate such
a repository. Producing a single stream makes it easy for
the caller to verify that they are not leaking any useful
information. You can get an overview of what will be shared
by running a command like:

  git fast-export --anonymize --all |
  perl -pe 's/\d+/X/g' |
  sort -u |
  less

which will show every unique line we generate, modulo any
numbers (each anonymized token is assigned a number, like
"User 0", and we replace it consistently in the output).

In addition to anonymizing, this produces test cases that
are relatively small (compared to the original repository)
and fast to generate (compared to using filter-branch, or
modifying the output of fast-export yourself). Here are
numbers for git.git:

  $ time git fast-export --anonymize --all \
         --tag-of-filtered-object=drop >output
  real    0m2.883s
  user    0m2.828s
  sys     0m0.052s

  $ gzip output
  $ ls -lh output.gz | awk '{print $5}'
  2.9M

Signed-off-by: Jeff King <peff@peff.net>
---
I haven't used this for anything real yet. It was a fun exercise, and I
do think it should work in practice. I'd be curious to hear a success
report of somebody actually debugging something with this.

In theory we could anonymize in a reversible way (e.g., by encrypting
each token with a key, and then not sharing the key), but it's a lot
more complicated and I don't think it buys us much. The one thing I'd
really like is to be able to test packing on an anonymized repository,
but two objects which delta well together will not have their encrypted
contents delta (unless you use something weak like ECB mode, in which
case the contents are not really as anonymized as you would hope).

I think most interesting cases involve things like commit traversal, and
that should still work here, even with made-up contents. Some weird
cases involving trees would not work if they depend on the filenames
(e.g., things that impact sort order). We could allow finer-grained
control, like "--anonymize=commits,blobs" if somebody was OK sharing
their filenames. I did not go that far here, but it should be pretty
easy to build on top.

 Documentation/git-fast-export.txt |   6 +
 builtin/fast-export.c             | 280 ++++++++++++++++++++++++++++++++++++--
 t/t9351-fast-export-anonymize.sh  | 117 ++++++++++++++++
 3 files changed, 392 insertions(+), 11 deletions(-)
 create mode 100755 t/t9351-fast-export-anonymize.sh

diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
index 221506b..0ec7cad 100644
--- a/Documentation/git-fast-export.txt
+++ b/Documentation/git-fast-export.txt
@@ -105,6 +105,12 @@ marks the same across runs.
 	in the commit (as opposed to just listing the files which are
 	different from the commit's first parent).
 
+--anonymize::
+	Replace all paths, blob contents, commit and tag messages,
+	names, and email addresses in the output with anonymized data,
+	while still retaining the shape of history and of the stored
+	tree.
+
 --refspec::
 	Apply the specified refspec to each ref exported. Multiple of them can
 	be specified.
diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 92b4624..acd2838 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -18,6 +18,7 @@
 #include "parse-options.h"
 #include "quote.h"
 #include "remote.h"
+#include "blob.h"
 
 static const char *fast_export_usage[] = {
 	N_("git fast-export [rev-list-opts]"),
@@ -34,6 +35,7 @@ static int full_tree;
 static struct string_list extra_refs = STRING_LIST_INIT_NODUP;
 static struct refspec *refspecs;
 static int refspecs_nr;
+static int anonymize;
 
 static int parse_opt_signed_tag_mode(const struct option *opt,
 				     const char *arg, int unset)
@@ -81,6 +83,76 @@ static int has_unshown_parent(struct commit *commit)
 	return 0;
 }
 
+struct anonymized_entry {
+	struct hashmap_entry hash;
+	const char *orig;
+	size_t orig_len;
+	const char *anon;
+	size_t anon_len;
+};
+
+static int anonymized_entry_cmp(const void *va, const void *vb,
+				const void *data)
+{
+	const struct anonymized_entry *a = va, *b = vb;
+	return a->orig_len != b->orig_len ||
+		memcmp(a->orig, b->orig, a->orig_len);
+}
+
+/*
+ * Basically keep a cache of X->Y so that we can repeatedly replace
+ * the same anonymized string with another. The actual generation
+ * is farmed out to the generate function.
+ */
+static const char *anonymize_mem(struct hashmap *map,
+				 char *(*generate)(const char *, size_t *),
+				 const char *orig, size_t *len)
+{
+	struct anonymized_entry key, *ret;
+
+	if (!map->cmpfn)
+		hashmap_init(map, anonymized_entry_cmp, 0);
+
+	hashmap_entry_init(&key, memhash(orig, *len));
+	key.orig = orig;
+	key.orig_len = *len;
+	ret = hashmap_get(map, &key, NULL);
+
+	if (!ret) {
+		ret = xmalloc(sizeof(*ret));
+		hashmap_entry_init(&ret->hash, key.hash.hash);
+		ret->orig = xstrdup(orig);
+		ret->orig_len = *len;
+		ret->anon = generate(orig, len);
+		ret->anon_len = *len;
+		hashmap_put(map, ret);
+	}
+
+	*len = ret->anon_len;
+	return ret->anon;
+}
+
+/*
+ * We anonymize each component of a path individually,
+ * so that paths a/b and a/c will share a common root.
+ * The paths are cached via anonymize_mem so that repeated
+ * lookups for "a" will yield the same value.
+ */
+static void anonymize_path(struct strbuf *out, const char *path,
+			   struct hashmap *map,
+			   char *(*generate)(const char *, size_t *))
+{
+	while (*path) {
+		const char *end_of_component = strchrnul(path, '/');
+		size_t len = end_of_component - path;
+		const char *c = anonymize_mem(map, generate, path, &len);
+		strbuf_add(out, c, len);
+		path = end_of_component;
+		if (*path)
+			strbuf_addch(out, *path++);
+	}
+}
+
 /* Since intptr_t is C99, we do not use it here */
 static inline uint32_t *mark_to_ptr(uint32_t mark)
 {
@@ -119,6 +191,26 @@ static void show_progress(void)
 		printf("progress %d objects\n", counter);
 }
 
+/*
+ * Ideally we would want some transformation of the blob data here
+ * that is unreversible, but would still be the same size and have
+ * the same data relationship to other blobs (so that we get the same
+ * delta and packing behavior as the original). But the first and last
+ * requirements there are probably mutually exclusive, so let's take
+ * the easy way out for now, and just generate arbitrary content.
+ *
+ * There's no need to cache this result with anonymize_mem, since
+ * we already handle blob content caching with marks.
+ */
+static char *anonymize_blob(unsigned long *size)
+{
+	static int counter;
+	struct strbuf out = STRBUF_INIT;
+	strbuf_addf(&out, "anonymous blob %d", counter++);
+	*size = out.len;
+	return strbuf_detach(&out, NULL);
+}
+
 static void export_blob(const unsigned char *sha1)
 {
 	unsigned long size;
@@ -137,12 +229,19 @@ static void export_blob(const unsigned char *sha1)
 	if (object && object->flags & SHOWN)
 		return;
 
-	buf = read_sha1_file(sha1, &type, &size);
-	if (!buf)
-		die ("Could not read blob %s", sha1_to_hex(sha1));
-	if (check_sha1_signature(sha1, buf, size, typename(type)) < 0)
-		die("sha1 mismatch in blob %s", sha1_to_hex(sha1));
-	object = parse_object_buffer(sha1, type, size, buf, &eaten);
+	if (anonymize) {
+		buf = anonymize_blob(&size);
+		object = (struct object *)lookup_blob(sha1);
+		eaten = 0;
+	} else {
+		buf = read_sha1_file(sha1, &type, &size);
+		if (!buf)
+			die ("Could not read blob %s", sha1_to_hex(sha1));
+		if (check_sha1_signature(sha1, buf, size, typename(type)) < 0)
+			die("sha1 mismatch in blob %s", sha1_to_hex(sha1));
+		object = parse_object_buffer(sha1, type, size, buf, &eaten);
+	}
+
 	if (!object)
 		die("Could not read blob %s", sha1_to_hex(sha1));
 
@@ -190,7 +289,7 @@ static int depth_first(const void *a_, const void *b_)
 	return (a->status == 'R') - (b->status == 'R');
 }
 
-static void print_path(const char *path)
+static void print_path_1(const char *path)
 {
 	int need_quote = quote_c_style(path, NULL, NULL, 0);
 	if (need_quote)
@@ -201,6 +300,28 @@ static void print_path(const char *path)
 		printf("%s", path);
 }
 
+static char *anonymize_path_component(const char *path, size_t *len)
+{
+	static int counter;
+	struct strbuf out = STRBUF_INIT;
+	strbuf_addf(&out, "path%d", counter++);
+	return strbuf_detach(&out, len);
+}
+
+static void print_path(const char *path)
+{
+	if (!anonymize)
+		print_path_1(path);
+	else {
+		static struct hashmap paths;
+		static struct strbuf anon = STRBUF_INIT;
+
+		anonymize_path(&anon, path, &paths, anonymize_path_component);
+		print_path_1(anon.buf);
+		strbuf_reset(&anon);
+	}
+}
+
 static void show_filemodify(struct diff_queue_struct *q,
 			    struct diff_options *options, void *data)
 {
@@ -241,7 +362,9 @@ static void show_filemodify(struct diff_queue_struct *q,
 		case DIFF_STATUS_ADDED:
 			/*
 			 * Links refer to objects in another repositories;
-			 * output the SHA-1 verbatim.
+			 * output the SHA-1 verbatim. We don't anonymize
+			 * these at all; they are not reversible and
+			 * probably not a big deal to share.
 			 */
 			if (no_data || S_ISGITLINK(spec->mode))
 				printf("M %06o %s ", spec->mode,
@@ -279,6 +402,109 @@ static const char *find_encoding(const char *begin, const char *end)
 	return bol;
 }
 
+static char *anonymize_ref_component(const char *old, size_t *len)
+{
+	static int counter;
+	struct strbuf out = STRBUF_INIT;
+	strbuf_addf(&out, "ref%d", counter++);
+	return strbuf_detach(&out, len);
+}
+
+static const char *anonymize_refname(const char *refname)
+{
+	/*
+	 * If any of these prefixes is found, we will leave it intact
+	 * so that tags remain tags and so forth. We also leave "master"
+	 * as a special case, since it does not reveal anything interesting.
+	 */
+	static const char *prefixes[] = {
+		"refs/heads/master",
+		"refs/heads/",
+		"refs/tags/",
+		"refs/remotes/",
+		"refs/"
+	};
+	static struct hashmap refs;
+	static struct strbuf anon = STRBUF_INIT;
+	int i;
+
+	strbuf_reset(&anon);
+	for (i = 0; i < ARRAY_SIZE(prefixes); i++) {
+		if (skip_prefix(refname, prefixes[i], &refname)) {
+			strbuf_addstr(&anon, prefixes[i]);
+			break;
+		}
+	}
+
+	anonymize_path(&anon, refname, &refs, anonymize_ref_component);
+	return anon.buf;
+}
+
+/*
+ * We do not even bother to cache commit messages, as they are unlikely
+ * to be repeated verbatim, and it is not that interesting when they are.
+ */
+static char *anonymize_commit_message(const char *old)
+{
+	static int counter;
+	return xstrfmt("subject %d\n\nbody\n", counter++);
+}
+
+static struct hashmap idents;
+static char *anonymize_ident(const char *old, size_t *len)
+{
+	static int counter;
+	struct strbuf out = STRBUF_INIT;
+	strbuf_addf(&out, "User %d <user%d@example.com>", counter, counter);
+	counter++;
+	return strbuf_detach(&out, len);
+}
+
+/*
+ * Our strategy here is to anonymize the names and email addresses,
+ * but keep timestamps intact, as they influence things like traversal
+ * order (and by themselves should not be too revealing).
+ */
+static void anonymize_ident_line(const char **beg, const char **end)
+{
+	static struct strbuf buffers[] = { STRBUF_INIT, STRBUF_INIT };
+	static unsigned which_buffer;
+
+	struct strbuf *out;
+	struct ident_split split;
+	const char *end_of_header;
+
+	out = &buffers[which_buffer++];
+	which_buffer %= ARRAY_SIZE(buffers);
+	strbuf_reset(out);
+
+	/* skip "committer", "author", "tagger", etc */
+	end_of_header = strchr(*beg, ' ');
+	if (!end_of_header)
+		die("BUG: malformed line fed to anonymize_ident_line: %.*s",
+		    (int)(*end - *beg), *beg);
+	end_of_header++;
+	strbuf_add(out, *beg, end_of_header - *beg);
+
+	if (!split_ident_line(&split, end_of_header, *end - end_of_header) &&
+	    split.date_begin) {
+		const char *ident;
+		size_t len;
+
+		len = split.mail_end - split.name_begin;
+		ident = anonymize_mem(&idents, anonymize_ident,
+				      split.name_begin, &len);
+		strbuf_add(out, ident, len);
+		strbuf_addch(out, ' ');
+		strbuf_add(out, split.date_begin, split.tz_end - split.date_begin);
+	} else {
+		strbuf_addstr(out, "Malformed Ident <malformed@example.com> 0 -0000");
+	}
+
+	*beg = out->buf;
+	*end = out->buf + out->len;
+}
+
 static void handle_commit(struct commit *commit, struct rev_info *rev)
 {
 	int saved_output_format = rev->diffopt.output_format;
@@ -287,6 +513,7 @@ static void handle_commit(struct commit *commit, struct rev_info *rev)
 	const char *encoding, *message;
 	char *reencoded = NULL;
 	struct commit_list *p;
+	const char *refname;
 	int i;
 
 	rev->diffopt.output_format = DIFF_FORMAT_CALLBACK;
@@ -326,13 +553,22 @@ static void handle_commit(struct commit *commit, struct rev_info *rev)
 		if (!S_ISGITLINK(diff_queued_diff.queue[i]->two->mode))
 			export_blob(diff_queued_diff.queue[i]->two->sha1);
 
+	refname = commit->util;
+	if (anonymize) {
+		refname = anonymize_refname(refname);
+		anonymize_ident_line(&committer, &committer_end);
+		anonymize_ident_line(&author, &author_end);
+	}
+
 	mark_next_object(&commit->object);
-	if (!is_encoding_utf8(encoding))
+	if (anonymize)
+		reencoded = anonymize_commit_message(message);
+	else if (!is_encoding_utf8(encoding))
 		reencoded = reencode_string(message, "UTF-8", encoding);
 	if (!commit->parents)
-		printf("reset %s\n", (const char*)commit->util);
+		printf("reset %s\n", refname);
 	printf("commit %s\nmark :%"PRIu32"\n%.*s\n%.*s\ndata %u\n%s",
-	       (const char *)commit->util, last_idnum,
+	       refname, last_idnum,
 	       (int)(author_end - author), author,
 	       (int)(committer_end - committer), committer,
 	       (unsigned)(reencoded
@@ -363,6 +599,14 @@ static void handle_commit(struct commit *commit, struct rev_info *rev)
 	show_progress();
 }
 
+static char *anonymize_tag(const char *old, size_t *len)
+{
+	static int counter;
+	struct strbuf out = STRBUF_INIT;
+	strbuf_addf(&out, "tag message %d", counter++);
+	return strbuf_detach(&out, len);
+}
+
 static void handle_tail(struct object_array *commits, struct rev_info *revs)
 {
 	struct commit *commit;
@@ -419,6 +663,17 @@ static void handle_tag(const char *name, struct tag *tag)
 	} else {
 		tagger++;
 		tagger_end = strchrnul(tagger, '\n');
+		if (anonymize)
+			anonymize_ident_line(&tagger, &tagger_end);
+	}
+
+	if (anonymize) {
+		name = anonymize_refname(name);
+		if (message) {
+			static struct hashmap tags;
+			message = anonymize_mem(&tags, anonymize_tag,
+						message, &message_size);
+		}
 	}
 
 	/* handle signed tags */
@@ -584,6 +839,8 @@ static void handle_tags_and_duplicates(void)
 			handle_tag(name, (struct tag *)object);
 			break;
 		case OBJ_COMMIT:
+			if (anonymize)
+				name = anonymize_refname(name);
 			/* create refs pointing to already seen commits */
 			commit = (struct commit *)object;
 			printf("reset %s\nfrom :%d\n\n", name,
@@ -719,6 +976,7 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
 		OPT_BOOL(0, "no-data", &no_data, N_("Skip output of blob data")),
 		OPT_STRING_LIST(0, "refspec", &refspecs_list, N_("refspec"),
 			     N_("Apply refspec to exported refs")),
+		OPT_BOOL(0, "anonymize", &anonymize, N_("anonymize output")),
 		OPT_END()
 	};
 
diff --git a/t/t9351-fast-export-anonymize.sh b/t/t9351-fast-export-anonymize.sh
new file mode 100755
index 0000000..f76ffe4
--- /dev/null
+++ b/t/t9351-fast-export-anonymize.sh
@@ -0,0 +1,117 @@
+#!/bin/sh
+
+test_description='basic tests for fast-export --anonymize'
+. ./test-lib.sh
+
+test_expect_success 'setup simple repo' '
+	test_commit base &&
+	test_commit foo &&
+	git checkout -b other HEAD^ &&
+	mkdir subdir &&
+	test_commit subdir/bar &&
+	test_commit subdir/xyzzy &&
+	git tag -m "annotated tag" mytag
+'
+
+test_expect_success 'export anonymized stream' '
+	git fast-export --anonymize --all >stream
+'
+
+# this also covers commit messages
+test_expect_success 'stream omits path names' '
+	! fgrep base stream &&
+	! fgrep foo stream &&
+	! fgrep subdir stream &&
+	! fgrep bar stream &&
+	! fgrep xyzzy stream
+'
+
+test_expect_success 'stream allows master as refname' '
+	fgrep master stream
+'
+
+test_expect_success 'stream omits other refnames' '
+	! fgrep other stream
+'
+
+test_expect_success 'stream omits identities' '
+	! fgrep "$GIT_COMMITTER_NAME" stream &&
+	! fgrep "$GIT_COMMITTER_EMAIL" stream &&
+	! fgrep "$GIT_AUTHOR_NAME" stream &&
+	! fgrep "$GIT_AUTHOR_EMAIL" stream
+'
+
+test_expect_success 'stream omits tag message' '
+	! fgrep "annotated tag" stream
+'
+
+# NOTE: we chdir to the new, anonymized repository
+# after this. All further tests should assume this.
+test_expect_success 'import stream to new repository' '
+	git init new &&
+	cd new &&
+	git fast-import <../stream
+'
+
+test_expect_success 'result has two branches' '
+	git for-each-ref --format="%(refname)" refs/heads >branches &&
+	test_line_count = 2 branches &&
+	other_branch=$(grep -v refs/heads/master branches)
+'
+
+test_expect_success 'repo has original shape' '
+	cat >expect <<-\EOF &&
+	> subject 3
+	> subject 2
+	< subject 1
+	- subject 0
+	EOF
+	git log --format="%m %s" --left-right --boundary \
+		master...$other_branch >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'root tree has original shape' '
+	cat >expect <<-\EOF &&
+	blob
+	tree
+	EOF
+	git ls-tree $other_branch >root &&
+	cut -d" " -f2 <root >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'paths in subdir ended up in one tree' '
+	cat >expect <<-\EOF &&
+	blob
+	blob
+	EOF
+	tree=$(grep tree root | cut -f2) &&
+	git ls-tree $other_branch:$tree >tree &&
+	cut -d" " -f2 <tree >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'tag points to branch tip' '
+	git rev-parse $other_branch >expect &&
+	git for-each-ref --format="%(*objectname)" | grep . >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'idents are shared' '
+	git log --all --format="%an <%ae>" >authors &&
+	sort -u authors >unique &&
+	test_line_count = 1 unique &&
+	git log --all --format="%cn <%ce>" >committers &&
+	sort -u committers >unique &&
+	test_line_count = 1 unique &&
+	! test_cmp authors committers
+'
+
+test_expect_success 'commit timestamps are retained' '
+	git log --all --format="%ct" >timestamps &&
+	sort -u timestamps >unique &&
+	test_line_count = 4 unique
+'
+
+test_done
-- 
2.1.0.346.ga0367b9

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] teach fast-export an --anonymize option
  2014-08-21  7:01 [PATCH] teach fast-export an --anonymize option Jeff King
@ 2014-08-21 20:15 ` Junio C Hamano
  2014-08-21 22:41   ` Jeff King
  2014-08-21 21:57 ` Junio C Hamano
  1 sibling, 1 reply; 21+ messages in thread
From: Junio C Hamano @ 2014-08-21 20:15 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Duy Nguyen

Jeff King <peff@peff.net> writes:

> +/*
> + * We anonymize each component of a path individually,
> + * so that paths a/b and a/c will share a common root.
> + * The paths are cached via anonymize_mem so that repeated
> + * lookups for "a" will yield the same value.
> + */
> +static void anonymize_path(struct strbuf *out, const char *path,
> +			   struct hashmap *map,
> +			   char *(*generate)(const char *, size_t *))
> +{
> +	while (*path) {
> +		const char *end_of_component = strchrnul(path, '/');
> +		size_t len = end_of_component - path;
> +		const char *c = anonymize_mem(map, generate, path, &len);
> +		strbuf_add(out, c, len);
> +		path = end_of_component;
> +		if (*path)
> +			strbuf_addch(out, *path++);
> +	}
> +}

Do two paths sort the same way before and after anonymisation?  For
example, if generate() works as a simple substitution, it should map
a character that sorts before (or after) '/' with another that also
sorts before (or after) '/' for us to be able to diagnose an error
that comes from D/F sort order confusion.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] teach fast-export an --anonymize option
  2014-08-21  7:01 [PATCH] teach fast-export an --anonymize option Jeff King
  2014-08-21 20:15 ` Junio C Hamano
@ 2014-08-21 21:57 ` Junio C Hamano
  2014-08-21 22:49   ` Jeff King
  1 sibling, 1 reply; 21+ messages in thread
From: Junio C Hamano @ 2014-08-21 21:57 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Duy Nguyen

Jeff King <peff@peff.net> writes:

> +--anonymize::
> +	Replace all paths, blob contents, commit and tag messages,
> +	names, and email addresses in the output with anonymized data,
> +	while still retaining the shape of history and of the stored
> +	tree.

Sometimes branch names can contain codenames the project may prefer
to hide from the general public, so they may need to be anonymised
as well.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] teach fast-export an --anonymize option
  2014-08-21 20:15 ` Junio C Hamano
@ 2014-08-21 22:41   ` Jeff King
  0 siblings, 0 replies; 21+ messages in thread
From: Jeff King @ 2014-08-21 22:41 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Duy Nguyen

On Thu, Aug 21, 2014 at 01:15:10PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > +/*
> > + * We anonymize each component of a path individually,
> > + * so that paths a/b and a/c will share a common root.
> > + * The paths are cached via anonymize_mem so that repeated
> > + * lookups for "a" will yield the same value.
> > + */
> > +static void anonymize_path(struct strbuf *out, const char *path,
> > +			   struct hashmap *map,
> > +			   char *(*generate)(const char *, size_t *))
> > +{
> > +	while (*path) {
> > +		const char *end_of_component = strchrnul(path, '/');
> > +		size_t len = end_of_component - path;
> > +		const char *c = anonymize_mem(map, generate, path, &len);
> > +		strbuf_add(out, c, len);
> > +		path = end_of_component;
> > +		if (*path)
> > +			strbuf_addch(out, *path++);
> > +	}
> > +}
> 
> Do two paths sort the same way before and after anonymisation?  For
> example, if generate() works as a simple substitution, it should map
> a character that sorts before (or after) '/' with another that also
> sorts before (or after) '/' for us to be able to diagnose an error
> that comes from D/F sort order confusion.

No, the sort order is totally lost. I'd be afraid that a general scheme
would end up leaking information about what was in the filenames. It
might be acceptable to leak some information here, though, if it adds to
the realism of the result.

I tried here to lay the basic infrastructure and do the simplest thing
that might work, so we could evaluate proposals like that independently
(and also because I didn't come up with a clever enough algorithm to do
what you're asking).  Patches welcome on top. :)

-Peff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] teach fast-export an --anonymize option
  2014-08-21 21:57 ` Junio C Hamano
@ 2014-08-21 22:49   ` Jeff King
  2014-08-21 23:21     ` [PATCH v2] " Jeff King
  0 siblings, 1 reply; 21+ messages in thread
From: Jeff King @ 2014-08-21 22:49 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Duy Nguyen

On Thu, Aug 21, 2014 at 02:57:22PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > +--anonymize::
> > +	Replace all paths, blob contents, commit and tag messages,
> > +	names, and email addresses in the output with anonymized data,
> > +	while still retaining the shape of history and of the stored
> > +	tree.
> 
> Sometimes branch names can contain codenames the project may prefer
> to hide from the general public, so they may need to be anonymised
> as well.

Yes, I do anonymize them (and check it in the tests). See
anonymize_refname. I just forgot to include it in the list. Trivial
squashable patch is below.

The few things I don't anonymize are:

  1. ref prefixes. We see the same distribution of refs/heads vs
     refs/tags, etc.

  2. refs/heads/master is left untouched, for convenience (and because
     it's not really a secret). The implementation is lazy, though, and
     would leave "refs/heads/master-supersecret", as well. I can tighten
     that if we really want to be careful.

  3. gitlinks are left untouched, since sha1s cannot be reversed. This
     could leak some information (if your private repo points to a
     public, I can find out you have it as submodule). I doubt it
     matters, but we can also scramble the sha1s.

---
diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
index 0ec7cad..52831fa 100644
--- a/Documentation/git-fast-export.txt
+++ b/Documentation/git-fast-export.txt
@@ -106,10 +106,10 @@ marks the same across runs.
 	different from the commit's first parent).
 
 --anonymize::
-	Replace all paths, blob contents, commit and tag messages,
-	names, and email addresses in the output with anonymized data,
-	while still retaining the shape of history and of the stored
-	tree.
+	Replace all refnames, paths, blob contents, commit and tag
+	messages, names, and email addresses in the output with
+	anonymized data, while still retaining the shape of history and
+	of the stored tree.
 
 --refspec::
 	Apply the specified refspec to each ref exported. Multiple of them can

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2] teach fast-export an --anonymize option
  2014-08-21 22:49   ` Jeff King
@ 2014-08-21 23:21     ` Jeff King
  2014-08-22 13:06       ` Duy Nguyen
                         ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Jeff King @ 2014-08-21 23:21 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Duy Nguyen

On Thu, Aug 21, 2014 at 06:49:10PM -0400, Jeff King wrote:

> The few things I don't anonymize are:
> 
>   1. ref prefixes. We see the same distribution of refs/heads vs
>      refs/tags, etc.
> 
>   2. refs/heads/master is left untouched, for convenience (and because
>      it's not really a secret). The implementation is lazy, though, and
>      would leave "refs/heads/master-supersecret", as well. I can tighten
>      that if we really want to be careful.
> 
>   3. gitlinks are left untouched, since sha1s cannot be reversed. This
>      could leak some information (if your private repo points to a
>      public, I can find out you have it as submodule). I doubt it
>      matters, but we can also scramble the sha1s.

Here's a re-roll that addresses the latter two. I don't think any are a
big deal, but it's much easier to say "it's handled" than try to figure
out whether and when it's important.

This also includes the documentation update I sent earlier. The
interdiff is a bit noisy, as I also converted the anonymize_mem function
to take void pointers (since it doesn't know or care what it's storing,
and this makes storing unsigned chars for sha1s easier).

-- >8 --
Subject: teach fast-export an --anonymize option

Sometimes users want to report a bug they experience on
their repository, but they are not at liberty to share the
contents of the repository. It would be useful if they could
produce a repository that has a similar shape to its history
and tree, but without leaking any information. This
"anonymized" repository could then be shared with developers
(assuming it still replicates the original problem).

This patch implements an "--anonymize" option to
fast-export, which generates a stream that can recreate such
a repository. Producing a single stream makes it easy for
the caller to verify that they are not leaking any useful
information. You can get an overview of what will be shared
by running a command like:

  git fast-export --anonymize --all |
  perl -pe 's/\d+/X/g' |
  sort -u |
  less

which will show every unique line we generate, modulo any
numbers (each anonymized token is assigned a number, like
"User 0", and we replace it consistently in the output).

In addition to anonymizing, this produces test cases that
are relatively small (compared to the original repository)
and fast to generate (compared to using filter-branch, or
modifying the output of fast-export yourself). Here are
numbers for git.git:

  $ time git fast-export --anonymize --all \
         --tag-of-filtered-object=drop >output
  real    0m2.883s
  user    0m2.828s
  sys     0m0.052s

  $ gzip output
  $ ls -lh output.gz | awk '{print $5}'
  2.9M

Signed-off-by: Jeff King <peff@peff.net>
---
 Documentation/git-fast-export.txt |   6 +
 builtin/fast-export.c             | 300 ++++++++++++++++++++++++++++++++++++--
 t/t9351-fast-export-anonymize.sh  | 117 +++++++++++++++
 3 files changed, 412 insertions(+), 11 deletions(-)
 create mode 100755 t/t9351-fast-export-anonymize.sh

diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
index 221506b..52831fa 100644
--- a/Documentation/git-fast-export.txt
+++ b/Documentation/git-fast-export.txt
@@ -105,6 +105,12 @@ marks the same across runs.
 	in the commit (as opposed to just listing the files which are
 	different from the commit's first parent).
 
+--anonymize::
+	Replace all refnames, paths, blob contents, commit and tag
+	messages, names, and email addresses in the output with
+	anonymized data, while still retaining the shape of history and
+	of the stored tree.
+
 --refspec::
 	Apply the specified refspec to each ref exported. Multiple of them can
 	be specified.
diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 92b4624..b8182c2 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -18,6 +18,7 @@
 #include "parse-options.h"
 #include "quote.h"
 #include "remote.h"
+#include "blob.h"
 
 static const char *fast_export_usage[] = {
 	N_("git fast-export [rev-list-opts]"),
@@ -34,6 +35,7 @@ static int full_tree;
 static struct string_list extra_refs = STRING_LIST_INIT_NODUP;
 static struct refspec *refspecs;
 static int refspecs_nr;
+static int anonymize;
 
 static int parse_opt_signed_tag_mode(const struct option *opt,
 				     const char *arg, int unset)
@@ -81,6 +83,76 @@ static int has_unshown_parent(struct commit *commit)
 	return 0;
 }
 
+struct anonymized_entry {
+	struct hashmap_entry hash;
+	const char *orig;
+	size_t orig_len;
+	const char *anon;
+	size_t anon_len;
+};
+
+static int anonymized_entry_cmp(const void *va, const void *vb,
+				const void *data)
+{
+	const struct anonymized_entry *a = va, *b = vb;
+	return a->orig_len != b->orig_len ||
+		memcmp(a->orig, b->orig, a->orig_len);
+}
+
+/*
+ * Basically keep a cache of X->Y so that we can repeatedly replace
+ * the same anonymized string with another. The actual generation
+ * is farmed out to the generate function.
+ */
+static const void *anonymize_mem(struct hashmap *map,
+				 void *(*generate)(const void *, size_t *),
+				 const void *orig, size_t *len)
+{
+	struct anonymized_entry key, *ret;
+
+	if (!map->cmpfn)
+		hashmap_init(map, anonymized_entry_cmp, 0);
+
+	hashmap_entry_init(&key, memhash(orig, *len));
+	key.orig = orig;
+	key.orig_len = *len;
+	ret = hashmap_get(map, &key, NULL);
+
+	if (!ret) {
+		ret = xmalloc(sizeof(*ret));
+		hashmap_entry_init(&ret->hash, key.hash.hash);
+		ret->orig = xstrdup(orig);
+		ret->orig_len = *len;
+		ret->anon = generate(orig, len);
+		ret->anon_len = *len;
+		hashmap_put(map, ret);
+	}
+
+	*len = ret->anon_len;
+	return ret->anon;
+}
+
+/*
+ * We anonymize each component of a path individually,
+ * so that paths a/b and a/c will share a common root.
+ * The paths are cached via anonymize_mem so that repeated
+ * lookups for "a" will yield the same value.
+ */
+static void anonymize_path(struct strbuf *out, const char *path,
+			   struct hashmap *map,
+			   void *(*generate)(const void *, size_t *))
+{
+	while (*path) {
+		const char *end_of_component = strchrnul(path, '/');
+		size_t len = end_of_component - path;
+		const char *c = anonymize_mem(map, generate, path, &len);
+		strbuf_add(out, c, len);
+		path = end_of_component;
+		if (*path)
+			strbuf_addch(out, *path++);
+	}
+}
+
 /* Since intptr_t is C99, we do not use it here */
 static inline uint32_t *mark_to_ptr(uint32_t mark)
 {
@@ -119,6 +191,26 @@ static void show_progress(void)
 		printf("progress %d objects\n", counter);
 }
 
+/*
+ * Ideally we would want some transformation of the blob data here
+ * that is unreversible, but would still be the same size and have
+ * the same data relationship to other blobs (so that we get the same
+ * delta and packing behavior as the original). But the first and last
+ * requirements there are probably mutually exclusive, so let's take
+ * the easy way out for now, and just generate arbitrary content.
+ *
+ * There's no need to cache this result with anonymize_mem, since
+ * we already handle blob content caching with marks.
+ */
+static char *anonymize_blob(unsigned long *size)
+{
+	static int counter;
+	struct strbuf out = STRBUF_INIT;
+	strbuf_addf(&out, "anonymous blob %d", counter++);
+	*size = out.len;
+	return strbuf_detach(&out, NULL);
+}
+
 static void export_blob(const unsigned char *sha1)
 {
 	unsigned long size;
@@ -137,12 +229,19 @@ static void export_blob(const unsigned char *sha1)
 	if (object && object->flags & SHOWN)
 		return;
 
-	buf = read_sha1_file(sha1, &type, &size);
-	if (!buf)
-		die ("Could not read blob %s", sha1_to_hex(sha1));
-	if (check_sha1_signature(sha1, buf, size, typename(type)) < 0)
-		die("sha1 mismatch in blob %s", sha1_to_hex(sha1));
-	object = parse_object_buffer(sha1, type, size, buf, &eaten);
+	if (anonymize) {
+		buf = anonymize_blob(&size);
+		object = (struct object *)lookup_blob(sha1);
+		eaten = 0;
+	} else {
+		buf = read_sha1_file(sha1, &type, &size);
+		if (!buf)
+			die ("Could not read blob %s", sha1_to_hex(sha1));
+		if (check_sha1_signature(sha1, buf, size, typename(type)) < 0)
+			die("sha1 mismatch in blob %s", sha1_to_hex(sha1));
+		object = parse_object_buffer(sha1, type, size, buf, &eaten);
+	}
+
 	if (!object)
 		die("Could not read blob %s", sha1_to_hex(sha1));
 
@@ -190,7 +289,7 @@ static int depth_first(const void *a_, const void *b_)
 	return (a->status == 'R') - (b->status == 'R');
 }
 
-static void print_path(const char *path)
+static void print_path_1(const char *path)
 {
 	int need_quote = quote_c_style(path, NULL, NULL, 0);
 	if (need_quote)
@@ -201,6 +300,43 @@ static void print_path(const char *path)
 		printf("%s", path);
 }
 
+static void *anonymize_path_component(const void *path, size_t *len)
+{
+	static int counter;
+	struct strbuf out = STRBUF_INIT;
+	strbuf_addf(&out, "path%d", counter++);
+	return strbuf_detach(&out, len);
+}
+
+static void print_path(const char *path)
+{
+	if (!anonymize)
+		print_path_1(path);
+	else {
+		static struct hashmap paths;
+		static struct strbuf anon = STRBUF_INIT;
+
+		anonymize_path(&anon, path, &paths, anonymize_path_component);
+		print_path_1(anon.buf);
+		strbuf_reset(&anon);
+	}
+}
+
+static void *generate_fake_sha1(const void *old, size_t *len)
+{
+	static uint32_t counter = 1; /* avoid null sha1 */
+	unsigned char *out = xcalloc(20, 1);
+	put_be32(out + 16, counter++);
+	return out;
+}
+
+static const unsigned char *anonymize_sha1(const unsigned char *sha1)
+{
+	static struct hashmap sha1s;
+	size_t len = 20;
+	return anonymize_mem(&sha1s, generate_fake_sha1, sha1, &len);
+}
+
 static void show_filemodify(struct diff_queue_struct *q,
 			    struct diff_options *options, void *data)
 {
@@ -245,7 +381,9 @@ static void show_filemodify(struct diff_queue_struct *q,
 			 */
 			if (no_data || S_ISGITLINK(spec->mode))
 				printf("M %06o %s ", spec->mode,
-				       sha1_to_hex(spec->sha1));
+				       sha1_to_hex(anonymize ?
+						   anonymize_sha1(spec->sha1) :
+						   spec->sha1));
 			else {
 				struct object *object = lookup_object(spec->sha1);
 				printf("M %06o :%d ", spec->mode,
@@ -279,6 +417,114 @@ static const char *find_encoding(const char *begin, const char *end)
 	return bol;
 }
 
+static void *anonymize_ref_component(const void *old, size_t *len)
+{
+	static int counter;
+	struct strbuf out = STRBUF_INIT;
+	strbuf_addf(&out, "ref%d", counter++);
+	return strbuf_detach(&out, len);
+}
+
+static const char *anonymize_refname(const char *refname)
+{
+	/*
+	 * If any of these prefixes is found, we will leave it intact
+	 * so that tags remain tags and so forth.
+	 */
+	static const char *prefixes[] = {
+		"refs/heads/",
+		"refs/tags/",
+		"refs/remotes/",
+		"refs/"
+	};
+	static struct hashmap refs;
+	static struct strbuf anon = STRBUF_INIT;
+	int i;
+
+	/*
+	 * We also leave "master" as a special case, since it does not reveal
+	 * anything interesting.
+	 */
+	if (!strcmp(refname, "refs/heads/master"))
+		return refname;
+
+	strbuf_reset(&anon);
+	for (i = 0; i < ARRAY_SIZE(prefixes); i++) {
+		if (skip_prefix(refname, prefixes[i], &refname)) {
+			strbuf_addstr(&anon, prefixes[i]);
+			break;
+		}
+	}
+
+	anonymize_path(&anon, refname, &refs, anonymize_ref_component);
+	return anon.buf;
+}
+
+/*
+ * We do not even bother to cache commit messages, as they are unlikely
+ * to be repeated verbatim, and it is not that interesting when they are.
+ */
+static char *anonymize_commit_message(const char *old)
+{
+	static int counter;
+	return xstrfmt("subject %d\n\nbody\n", counter++);
+}
+
+static struct hashmap idents;
+static void *anonymize_ident(const void *old, size_t *len)
+{
+	static int counter;
+	struct strbuf out = STRBUF_INIT;
+	strbuf_addf(&out, "User %d <user%d@example.com>", counter, counter);
+	counter++;
+	return strbuf_detach(&out, len);
+}
+
+/*
+ * Our strategy here is to anonymize the names and email addresses,
+ * but keep timestamps intact, as they influence things like traversal
+ * order (and by themselves should not be too revealing).
+ */
+static void anonymize_ident_line(const char **beg, const char **end)
+{
+	static struct strbuf buffers[] = { STRBUF_INIT, STRBUF_INIT };
+	static unsigned which_buffer;
+
+	struct strbuf *out;
+	struct ident_split split;
+	const char *end_of_header;
+
+	out = &buffers[which_buffer++];
+	which_buffer %= ARRAY_SIZE(buffers);
+	strbuf_reset(out);
+
+	/* skip "committer", "author", "tagger", etc */
+	end_of_header = strchr(*beg, ' ');
+	if (!end_of_header)
+		die("BUG: malformed line fed to anonymize_ident_line: %.*s",
+		    (int)(*end - *beg), *beg);
+	end_of_header++;
+	strbuf_add(out, *beg, end_of_header - *beg);
+
+	if (!split_ident_line(&split, end_of_header, *end - end_of_header) &&
+	    split.date_begin) {
+		const char *ident;
+		size_t len;
+
+		len = split.mail_end - split.name_begin;
+		ident = anonymize_mem(&idents, anonymize_ident,
+				      split.name_begin, &len);
+		strbuf_add(out, ident, len);
+		strbuf_addch(out, ' ');
+		strbuf_add(out, split.date_begin, split.tz_end - split.date_begin);
+	} else {
+		strbuf_addstr(out, "Malformed Ident <malformed@example.com> 0 -0000");
+	}
+
+	*beg = out->buf;
+	*end = out->buf + out->len;
+}
+
 static void handle_commit(struct commit *commit, struct rev_info *rev)
 {
 	int saved_output_format = rev->diffopt.output_format;
@@ -287,6 +533,7 @@ static void handle_commit(struct commit *commit, struct rev_info *rev)
 	const char *encoding, *message;
 	char *reencoded = NULL;
 	struct commit_list *p;
+	const char *refname;
 	int i;
 
 	rev->diffopt.output_format = DIFF_FORMAT_CALLBACK;
@@ -326,13 +573,22 @@ static void handle_commit(struct commit *commit, struct rev_info *rev)
 		if (!S_ISGITLINK(diff_queued_diff.queue[i]->two->mode))
 			export_blob(diff_queued_diff.queue[i]->two->sha1);
 
+	refname = commit->util;
+	if (anonymize) {
+		refname = anonymize_refname(refname);
+		anonymize_ident_line(&committer, &committer_end);
+		anonymize_ident_line(&author, &author_end);
+	}
+
 	mark_next_object(&commit->object);
-	if (!is_encoding_utf8(encoding))
+	if (anonymize)
+		reencoded = anonymize_commit_message(message);
+	else if (!is_encoding_utf8(encoding))
 		reencoded = reencode_string(message, "UTF-8", encoding);
 	if (!commit->parents)
-		printf("reset %s\n", (const char*)commit->util);
+		printf("reset %s\n", refname);
 	printf("commit %s\nmark :%"PRIu32"\n%.*s\n%.*s\ndata %u\n%s",
-	       (const char *)commit->util, last_idnum,
+	       refname, last_idnum,
 	       (int)(author_end - author), author,
 	       (int)(committer_end - committer), committer,
 	       (unsigned)(reencoded
@@ -363,6 +619,14 @@ static void handle_commit(struct commit *commit, struct rev_info *rev)
 	show_progress();
 }
 
+static void *anonymize_tag(const void *old, size_t *len)
+{
+	static int counter;
+	struct strbuf out = STRBUF_INIT;
+	strbuf_addf(&out, "tag message %d", counter++);
+	return strbuf_detach(&out, len);
+}
+
 static void handle_tail(struct object_array *commits, struct rev_info *revs)
 {
 	struct commit *commit;
@@ -419,6 +683,17 @@ static void handle_tag(const char *name, struct tag *tag)
 	} else {
 		tagger++;
 		tagger_end = strchrnul(tagger, '\n');
+		if (anonymize)
+			anonymize_ident_line(&tagger, &tagger_end);
+	}
+
+	if (anonymize) {
+		name = anonymize_refname(name);
+		if (message) {
+			static struct hashmap tags;
+			message = anonymize_mem(&tags, anonymize_tag,
+						message, &message_size);
+		}
 	}
 
 	/* handle signed tags */
@@ -584,6 +859,8 @@ static void handle_tags_and_duplicates(void)
 			handle_tag(name, (struct tag *)object);
 			break;
 		case OBJ_COMMIT:
+			if (anonymize)
+				name = anonymize_refname(name);
 			/* create refs pointing to already seen commits */
 			commit = (struct commit *)object;
 			printf("reset %s\nfrom :%d\n\n", name,
@@ -719,6 +996,7 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
 		OPT_BOOL(0, "no-data", &no_data, N_("Skip output of blob data")),
 		OPT_STRING_LIST(0, "refspec", &refspecs_list, N_("refspec"),
 			     N_("Apply refspec to exported refs")),
+		OPT_BOOL(0, "anonymize", &anonymize, N_("anonymize output")),
 		OPT_END()
 	};
 
diff --git a/t/t9351-fast-export-anonymize.sh b/t/t9351-fast-export-anonymize.sh
new file mode 100755
index 0000000..f76ffe4
--- /dev/null
+++ b/t/t9351-fast-export-anonymize.sh
@@ -0,0 +1,117 @@
+#!/bin/sh
+
+test_description='basic tests for fast-export --anonymize'
+. ./test-lib.sh
+
+test_expect_success 'setup simple repo' '
+	test_commit base &&
+	test_commit foo &&
+	git checkout -b other HEAD^ &&
+	mkdir subdir &&
+	test_commit subdir/bar &&
+	test_commit subdir/xyzzy &&
+	git tag -m "annotated tag" mytag
+'
+
+test_expect_success 'export anonymized stream' '
+	git fast-export --anonymize --all >stream
+'
+
+# this also covers commit messages
+test_expect_success 'stream omits path names' '
+	! fgrep base stream &&
+	! fgrep foo stream &&
+	! fgrep subdir stream &&
+	! fgrep bar stream &&
+	! fgrep xyzzy stream
+'
+
+test_expect_success 'stream allows master as refname' '
+	fgrep master stream
+'
+
+test_expect_success 'stream omits other refnames' '
+	! fgrep other stream
+'
+
+test_expect_success 'stream omits identities' '
+	! fgrep "$GIT_COMMITTER_NAME" stream &&
+	! fgrep "$GIT_COMMITTER_EMAIL" stream &&
+	! fgrep "$GIT_AUTHOR_NAME" stream &&
+	! fgrep "$GIT_AUTHOR_EMAIL" stream
+'
+
+test_expect_success 'stream omits tag message' '
+	! fgrep "annotated tag" stream
+'
+
+# NOTE: we chdir to the new, anonymized repository
+# after this. All further tests should assume this.
+test_expect_success 'import stream to new repository' '
+	git init new &&
+	cd new &&
+	git fast-import <../stream
+'
+
+test_expect_success 'result has two branches' '
+	git for-each-ref --format="%(refname)" refs/heads >branches &&
+	test_line_count = 2 branches &&
+	other_branch=$(grep -v refs/heads/master branches)
+'
+
+test_expect_success 'repo has original shape' '
+	cat >expect <<-\EOF &&
+	> subject 3
+	> subject 2
+	< subject 1
+	- subject 0
+	EOF
+	git log --format="%m %s" --left-right --boundary \
+		master...$other_branch >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'root tree has original shape' '
+	cat >expect <<-\EOF &&
+	blob
+	tree
+	EOF
+	git ls-tree $other_branch >root &&
+	cut -d" " -f2 <root >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'paths in subdir ended up in one tree' '
+	cat >expect <<-\EOF &&
+	blob
+	blob
+	EOF
+	tree=$(grep tree root | cut -f2) &&
+	git ls-tree $other_branch:$tree >tree &&
+	cut -d" " -f2 <tree >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'tag points to branch tip' '
+	git rev-parse $other_branch >expect &&
+	git for-each-ref --format="%(*objectname)" | grep . >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'idents are shared' '
+	git log --all --format="%an <%ae>" >authors &&
+	sort -u authors >unique &&
+	test_line_count = 1 unique &&
+	git log --all --format="%cn <%ce>" >committers &&
+	sort -u committers >unique &&
+	test_line_count = 1 unique &&
+	! test_cmp authors committers
+'
+
+test_expect_success 'commit timestamps are retained' '
+	git log --all --format="%ct" >timestamps &&
+	sort -u timestamps >unique &&
+	test_line_count = 4 unique
+'
+
+test_done
-- 
2.1.0.346.ga0367b9

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v2] teach fast-export an --anonymize option
  2014-08-21 23:21     ` [PATCH v2] " Jeff King
@ 2014-08-22 13:06       ` Duy Nguyen
  2014-08-22 18:39       ` Philip Oakley
  2014-08-27 16:01       ` Junio C Hamano
  2 siblings, 0 replies; 21+ messages in thread
From: Duy Nguyen @ 2014-08-22 13:06 UTC (permalink / raw)
  To: Jeff King, Steven Evergreen; +Cc: Junio C Hamano, Git Mailing List

On Fri, Aug 22, 2014 at 6:21 AM, Jeff King <peff@peff.net> wrote:
> -- >8 --
> Subject: teach fast-export an --anonymize option
>
> Sometimes users want to report a bug they experience on
> their repository, but they are not at liberty to share the
> contents of the repository. It would be useful if they could
> produce a repository that has a similar shape to its history
> and tree, but without leaking any information. This
> "anonymized" repository could then be shared with developers
> (assuming it still replicates the original problem).

This is cool. Thanks Jeff. Steven could you try this with the repo
that failed shallow clone --no-single-branch the other day?

>
> This patch implements an "--anonymize" option to
> fast-export, which generates a stream that can recreate such
> a repository. Producing a single stream makes it easy for
> the caller to verify that they are not leaking any useful
> information. You can get an overview of what will be shared
> by running a command like:
>
>   git fast-export --anonymize --all |
>   perl -pe 's/\d+/X/g' |
>   sort -u |
>   less
>
> which will show every unique line we generate, modulo any
> numbers (each anonymized token is assigned a number, like
> "User 0", and we replace it consistently in the output).
>
> In addition to anonymizing, this produces test cases that
> are relatively small (compared to the original repository)
> and fast to generate (compared to using filter-branch, or
> modifying the output of fast-export yourself). Here are
> numbers for git.git:
>
>   $ time git fast-export --anonymize --all \
>          --tag-of-filtered-object=drop >output
>   real    0m2.883s
>   user    0m2.828s
>   sys     0m0.052s
>
>   $ gzip output
>   $ ls -lh output.gz | awk '{print $5}'
>   2.9M
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  Documentation/git-fast-export.txt |   6 +
>  builtin/fast-export.c             | 300 ++++++++++++++++++++++++++++++++++++--
>  t/t9351-fast-export-anonymize.sh  | 117 +++++++++++++++
>  3 files changed, 412 insertions(+), 11 deletions(-)
>  create mode 100755 t/t9351-fast-export-anonymize.sh
>
> diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
> index 221506b..52831fa 100644
> --- a/Documentation/git-fast-export.txt
> +++ b/Documentation/git-fast-export.txt
> @@ -105,6 +105,12 @@ marks the same across runs.
>         in the commit (as opposed to just listing the files which are
>         different from the commit's first parent).
>
> +--anonymize::
> +       Replace all refnames, paths, blob contents, commit and tag
> +       messages, names, and email addresses in the output with
> +       anonymized data, while still retaining the shape of history and
> +       of the stored tree.
> +
>  --refspec::
>         Apply the specified refspec to each ref exported. Multiple of them can
>         be specified.
> diff --git a/builtin/fast-export.c b/builtin/fast-export.c
> index 92b4624..b8182c2 100644
> --- a/builtin/fast-export.c
> +++ b/builtin/fast-export.c
> @@ -18,6 +18,7 @@
>  #include "parse-options.h"
>  #include "quote.h"
>  #include "remote.h"
> +#include "blob.h"
>
>  static const char *fast_export_usage[] = {
>         N_("git fast-export [rev-list-opts]"),
> @@ -34,6 +35,7 @@ static int full_tree;
>  static struct string_list extra_refs = STRING_LIST_INIT_NODUP;
>  static struct refspec *refspecs;
>  static int refspecs_nr;
> +static int anonymize;
>
>  static int parse_opt_signed_tag_mode(const struct option *opt,
>                                      const char *arg, int unset)
> @@ -81,6 +83,76 @@ static int has_unshown_parent(struct commit *commit)
>         return 0;
>  }
>
> +struct anonymized_entry {
> +       struct hashmap_entry hash;
> +       const char *orig;
> +       size_t orig_len;
> +       const char *anon;
> +       size_t anon_len;
> +};
> +
> +static int anonymized_entry_cmp(const void *va, const void *vb,
> +                               const void *data)
> +{
> +       const struct anonymized_entry *a = va, *b = vb;
> +       return a->orig_len != b->orig_len ||
> +               memcmp(a->orig, b->orig, a->orig_len);
> +}
> +
> +/*
> + * Basically keep a cache of X->Y so that we can repeatedly replace
> + * the same anonymized string with another. The actual generation
> + * is farmed out to the generate function.
> + */
> +static const void *anonymize_mem(struct hashmap *map,
> +                                void *(*generate)(const void *, size_t *),
> +                                const void *orig, size_t *len)
> +{
> +       struct anonymized_entry key, *ret;
> +
> +       if (!map->cmpfn)
> +               hashmap_init(map, anonymized_entry_cmp, 0);
> +
> +       hashmap_entry_init(&key, memhash(orig, *len));
> +       key.orig = orig;
> +       key.orig_len = *len;
> +       ret = hashmap_get(map, &key, NULL);
> +
> +       if (!ret) {
> +               ret = xmalloc(sizeof(*ret));
> +               hashmap_entry_init(&ret->hash, key.hash.hash);
> +               ret->orig = xstrdup(orig);
> +               ret->orig_len = *len;
> +               ret->anon = generate(orig, len);
> +               ret->anon_len = *len;
> +               hashmap_put(map, ret);
> +       }
> +
> +       *len = ret->anon_len;
> +       return ret->anon;
> +}
> +
> +/*
> + * We anonymize each component of a path individually,
> + * so that paths a/b and a/c will share a common root.
> + * The paths are cached via anonymize_mem so that repeated
> + * lookups for "a" will yield the same value.
> + */
> +static void anonymize_path(struct strbuf *out, const char *path,
> +                          struct hashmap *map,
> +                          void *(*generate)(const void *, size_t *))
> +{
> +       while (*path) {
> +               const char *end_of_component = strchrnul(path, '/');
> +               size_t len = end_of_component - path;
> +               const char *c = anonymize_mem(map, generate, path, &len);
> +               strbuf_add(out, c, len);
> +               path = end_of_component;
> +               if (*path)
> +                       strbuf_addch(out, *path++);
> +       }
> +}
> +
>  /* Since intptr_t is C99, we do not use it here */
>  static inline uint32_t *mark_to_ptr(uint32_t mark)
>  {
> @@ -119,6 +191,26 @@ static void show_progress(void)
>                 printf("progress %d objects\n", counter);
>  }
>
> +/*
> + * Ideally we would want some transformation of the blob data here
> + * that is unreversible, but would still be the same size and have
> + * the same data relationship to other blobs (so that we get the same
> + * delta and packing behavior as the original). But the first and last
> + * requirements there are probably mutually exclusive, so let's take
> + * the easy way out for now, and just generate arbitrary content.
> + *
> + * There's no need to cache this result with anonymize_mem, since
> + * we already handle blob content caching with marks.
> + */
> +static char *anonymize_blob(unsigned long *size)
> +{
> +       static int counter;
> +       struct strbuf out = STRBUF_INIT;
> +       strbuf_addf(&out, "anonymous blob %d", counter++);
> +       *size = out.len;
> +       return strbuf_detach(&out, NULL);
> +}
> +
>  static void export_blob(const unsigned char *sha1)
>  {
>         unsigned long size;
> @@ -137,12 +229,19 @@ static void export_blob(const unsigned char *sha1)
>         if (object && object->flags & SHOWN)
>                 return;
>
> -       buf = read_sha1_file(sha1, &type, &size);
> -       if (!buf)
> -               die ("Could not read blob %s", sha1_to_hex(sha1));
> -       if (check_sha1_signature(sha1, buf, size, typename(type)) < 0)
> -               die("sha1 mismatch in blob %s", sha1_to_hex(sha1));
> -       object = parse_object_buffer(sha1, type, size, buf, &eaten);
> +       if (anonymize) {
> +               buf = anonymize_blob(&size);
> +               object = (struct object *)lookup_blob(sha1);
> +               eaten = 0;
> +       } else {
> +               buf = read_sha1_file(sha1, &type, &size);
> +               if (!buf)
> +                       die ("Could not read blob %s", sha1_to_hex(sha1));
> +               if (check_sha1_signature(sha1, buf, size, typename(type)) < 0)
> +                       die("sha1 mismatch in blob %s", sha1_to_hex(sha1));
> +               object = parse_object_buffer(sha1, type, size, buf, &eaten);
> +       }
> +
>         if (!object)
>                 die("Could not read blob %s", sha1_to_hex(sha1));
>
> @@ -190,7 +289,7 @@ static int depth_first(const void *a_, const void *b_)
>         return (a->status == 'R') - (b->status == 'R');
>  }
>
> -static void print_path(const char *path)
> +static void print_path_1(const char *path)
>  {
>         int need_quote = quote_c_style(path, NULL, NULL, 0);
>         if (need_quote)
> @@ -201,6 +300,43 @@ static void print_path(const char *path)
>                 printf("%s", path);
>  }
>
> +static void *anonymize_path_component(const void *path, size_t *len)
> +{
> +       static int counter;
> +       struct strbuf out = STRBUF_INIT;
> +       strbuf_addf(&out, "path%d", counter++);
> +       return strbuf_detach(&out, len);
> +}
> +
> +static void print_path(const char *path)
> +{
> +       if (!anonymize)
> +               print_path_1(path);
> +       else {
> +               static struct hashmap paths;
> +               static struct strbuf anon = STRBUF_INIT;
> +
> +               anonymize_path(&anon, path, &paths, anonymize_path_component);
> +               print_path_1(anon.buf);
> +               strbuf_reset(&anon);
> +       }
> +}
> +
> +static void *generate_fake_sha1(const void *old, size_t *len)
> +{
> +       static uint32_t counter = 1; /* avoid null sha1 */
> +       unsigned char *out = xcalloc(20, 1);
> +       put_be32(out + 16, counter++);
> +       return out;
> +}
> +
> +static const unsigned char *anonymize_sha1(const unsigned char *sha1)
> +{
> +       static struct hashmap sha1s;
> +       size_t len = 20;
> +       return anonymize_mem(&sha1s, generate_fake_sha1, sha1, &len);
> +}
> +
>  static void show_filemodify(struct diff_queue_struct *q,
>                             struct diff_options *options, void *data)
>  {
> @@ -245,7 +381,9 @@ static void show_filemodify(struct diff_queue_struct *q,
>                          */
>                         if (no_data || S_ISGITLINK(spec->mode))
>                                 printf("M %06o %s ", spec->mode,
> -                                      sha1_to_hex(spec->sha1));
> +                                      sha1_to_hex(anonymize ?
> +                                                  anonymize_sha1(spec->sha1) :
> +                                                  spec->sha1));
>                         else {
>                                 struct object *object = lookup_object(spec->sha1);
>                                 printf("M %06o :%d ", spec->mode,
> @@ -279,6 +417,114 @@ static const char *find_encoding(const char *begin, const char *end)
>         return bol;
>  }
>
> +static void *anonymize_ref_component(const void *old, size_t *len)
> +{
> +       static int counter;
> +       struct strbuf out = STRBUF_INIT;
> +       strbuf_addf(&out, "ref%d", counter++);
> +       return strbuf_detach(&out, len);
> +}
> +
> +static const char *anonymize_refname(const char *refname)
> +{
> +       /*
> +        * If any of these prefixes is found, we will leave it intact
> +        * so that tags remain tags and so forth.
> +        */
> +       static const char *prefixes[] = {
> +               "refs/heads/",
> +               "refs/tags/",
> +               "refs/remotes/",
> +               "refs/"
> +       };
> +       static struct hashmap refs;
> +       static struct strbuf anon = STRBUF_INIT;
> +       int i;
> +
> +       /*
> +        * We also leave "master" as a special case, since it does not reveal
> +        * anything interesting.
> +        */
> +       if (!strcmp(refname, "refs/heads/master"))
> +               return refname;
> +
> +       strbuf_reset(&anon);
> +       for (i = 0; i < ARRAY_SIZE(prefixes); i++) {
> +               if (skip_prefix(refname, prefixes[i], &refname)) {
> +                       strbuf_addstr(&anon, prefixes[i]);
> +                       break;
> +               }
> +       }
> +
> +       anonymize_path(&anon, refname, &refs, anonymize_ref_component);
> +       return anon.buf;
> +}
> +
> +/*
> + * We do not even bother to cache commit messages, as they are unlikely
> + * to be repeated verbatim, and it is not that interesting when they are.
> + */
> +static char *anonymize_commit_message(const char *old)
> +{
> +       static int counter;
> +       return xstrfmt("subject %d\n\nbody\n", counter++);
> +}
> +
> +static struct hashmap idents;
> +static void *anonymize_ident(const void *old, size_t *len)
> +{
> +       static int counter;
> +       struct strbuf out = STRBUF_INIT;
> +       strbuf_addf(&out, "User %d <user%d@example.com>", counter, counter);
> +       counter++;
> +       return strbuf_detach(&out, len);
> +}
> +
> +/*
> + * Our strategy here is to anonymize the names and email addresses,
> + * but keep timestamps intact, as they influence things like traversal
> + * order (and by themselves should not be too revealing).
> + */
> +static void anonymize_ident_line(const char **beg, const char **end)
> +{
> +       static struct strbuf buffers[] = { STRBUF_INIT, STRBUF_INIT };
> +       static unsigned which_buffer;
> +
> +       struct strbuf *out;
> +       struct ident_split split;
> +       const char *end_of_header;
> +
> +       out = &buffers[which_buffer++];
> +       which_buffer %= ARRAY_SIZE(buffers);
> +       strbuf_reset(out);
> +
> +       /* skip "committer", "author", "tagger", etc */
> +       end_of_header = strchr(*beg, ' ');
> +       if (!end_of_header)
> +               die("BUG: malformed line fed to anonymize_ident_line: %.*s",
> +                   (int)(*end - *beg), *beg);
> +       end_of_header++;
> +       strbuf_add(out, *beg, end_of_header - *beg);
> +
> +       if (!split_ident_line(&split, end_of_header, *end - end_of_header) &&
> +           split.date_begin) {
> +               const char *ident;
> +               size_t len;
> +
> +               len = split.mail_end - split.name_begin;
> +               ident = anonymize_mem(&idents, anonymize_ident,
> +                                     split.name_begin, &len);
> +               strbuf_add(out, ident, len);
> +               strbuf_addch(out, ' ');
> +               strbuf_add(out, split.date_begin, split.tz_end - split.date_begin);
> +       } else {
> +               strbuf_addstr(out, "Malformed Ident <malformed@example.com> 0 -0000");
> +       }
> +
> +       *beg = out->buf;
> +       *end = out->buf + out->len;
> +}
> +
>  static void handle_commit(struct commit *commit, struct rev_info *rev)
>  {
>         int saved_output_format = rev->diffopt.output_format;
> @@ -287,6 +533,7 @@ static void handle_commit(struct commit *commit, struct rev_info *rev)
>         const char *encoding, *message;
>         char *reencoded = NULL;
>         struct commit_list *p;
> +       const char *refname;
>         int i;
>
>         rev->diffopt.output_format = DIFF_FORMAT_CALLBACK;
> @@ -326,13 +573,22 @@ static void handle_commit(struct commit *commit, struct rev_info *rev)
>                 if (!S_ISGITLINK(diff_queued_diff.queue[i]->two->mode))
>                         export_blob(diff_queued_diff.queue[i]->two->sha1);
>
> +       refname = commit->util;
> +       if (anonymize) {
> +               refname = anonymize_refname(refname);
> +               anonymize_ident_line(&committer, &committer_end);
> +               anonymize_ident_line(&author, &author_end);
> +       }
> +
>         mark_next_object(&commit->object);
> -       if (!is_encoding_utf8(encoding))
> +       if (anonymize)
> +               reencoded = anonymize_commit_message(message);
> +       else if (!is_encoding_utf8(encoding))
>                 reencoded = reencode_string(message, "UTF-8", encoding);
>         if (!commit->parents)
> -               printf("reset %s\n", (const char*)commit->util);
> +               printf("reset %s\n", refname);
>         printf("commit %s\nmark :%"PRIu32"\n%.*s\n%.*s\ndata %u\n%s",
> -              (const char *)commit->util, last_idnum,
> +              refname, last_idnum,
>                (int)(author_end - author), author,
>                (int)(committer_end - committer), committer,
>                (unsigned)(reencoded
> @@ -363,6 +619,14 @@ static void handle_commit(struct commit *commit, struct rev_info *rev)
>         show_progress();
>  }
>
> +static void *anonymize_tag(const void *old, size_t *len)
> +{
> +       static int counter;
> +       struct strbuf out = STRBUF_INIT;
> +       strbuf_addf(&out, "tag message %d", counter++);
> +       return strbuf_detach(&out, len);
> +}
> +
>  static void handle_tail(struct object_array *commits, struct rev_info *revs)
>  {
>         struct commit *commit;
> @@ -419,6 +683,17 @@ static void handle_tag(const char *name, struct tag *tag)
>         } else {
>                 tagger++;
>                 tagger_end = strchrnul(tagger, '\n');
> +               if (anonymize)
> +                       anonymize_ident_line(&tagger, &tagger_end);
> +       }
> +
> +       if (anonymize) {
> +               name = anonymize_refname(name);
> +               if (message) {
> +                       static struct hashmap tags;
> +                       message = anonymize_mem(&tags, anonymize_tag,
> +                                               message, &message_size);
> +               }
>         }
>
>         /* handle signed tags */
> @@ -584,6 +859,8 @@ static void handle_tags_and_duplicates(void)
>                         handle_tag(name, (struct tag *)object);
>                         break;
>                 case OBJ_COMMIT:
> +                       if (anonymize)
> +                               name = anonymize_refname(name);
>                         /* create refs pointing to already seen commits */
>                         commit = (struct commit *)object;
>                         printf("reset %s\nfrom :%d\n\n", name,
> @@ -719,6 +996,7 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
>                 OPT_BOOL(0, "no-data", &no_data, N_("Skip output of blob data")),
>                 OPT_STRING_LIST(0, "refspec", &refspecs_list, N_("refspec"),
>                              N_("Apply refspec to exported refs")),
> +               OPT_BOOL(0, "anonymize", &anonymize, N_("anonymize output")),
>                 OPT_END()
>         };
>
> diff --git a/t/t9351-fast-export-anonymize.sh b/t/t9351-fast-export-anonymize.sh
> new file mode 100755
> index 0000000..f76ffe4
> --- /dev/null
> +++ b/t/t9351-fast-export-anonymize.sh
> @@ -0,0 +1,117 @@
> +#!/bin/sh
> +
> +test_description='basic tests for fast-export --anonymize'
> +. ./test-lib.sh
> +
> +test_expect_success 'setup simple repo' '
> +       test_commit base &&
> +       test_commit foo &&
> +       git checkout -b other HEAD^ &&
> +       mkdir subdir &&
> +       test_commit subdir/bar &&
> +       test_commit subdir/xyzzy &&
> +       git tag -m "annotated tag" mytag
> +'
> +
> +test_expect_success 'export anonymized stream' '
> +       git fast-export --anonymize --all >stream
> +'
> +
> +# this also covers commit messages
> +test_expect_success 'stream omits path names' '
> +       ! fgrep base stream &&
> +       ! fgrep foo stream &&
> +       ! fgrep subdir stream &&
> +       ! fgrep bar stream &&
> +       ! fgrep xyzzy stream
> +'
> +
> +test_expect_success 'stream allows master as refname' '
> +       fgrep master stream
> +'
> +
> +test_expect_success 'stream omits other refnames' '
> +       ! fgrep other stream
> +'
> +
> +test_expect_success 'stream omits identities' '
> +       ! fgrep "$GIT_COMMITTER_NAME" stream &&
> +       ! fgrep "$GIT_COMMITTER_EMAIL" stream &&
> +       ! fgrep "$GIT_AUTHOR_NAME" stream &&
> +       ! fgrep "$GIT_AUTHOR_EMAIL" stream
> +'
> +
> +test_expect_success 'stream omits tag message' '
> +       ! fgrep "annotated tag" stream
> +'
> +
> +# NOTE: we chdir to the new, anonymized repository
> +# after this. All further tests should assume this.
> +test_expect_success 'import stream to new repository' '
> +       git init new &&
> +       cd new &&
> +       git fast-import <../stream
> +'
> +
> +test_expect_success 'result has two branches' '
> +       git for-each-ref --format="%(refname)" refs/heads >branches &&
> +       test_line_count = 2 branches &&
> +       other_branch=$(grep -v refs/heads/master branches)
> +'
> +
> +test_expect_success 'repo has original shape' '
> +       cat >expect <<-\EOF &&
> +       > subject 3
> +       > subject 2
> +       < subject 1
> +       - subject 0
> +       EOF
> +       git log --format="%m %s" --left-right --boundary \
> +               master...$other_branch >actual &&
> +       test_cmp expect actual
> +'
> +
> +test_expect_success 'root tree has original shape' '
> +       cat >expect <<-\EOF &&
> +       blob
> +       tree
> +       EOF
> +       git ls-tree $other_branch >root &&
> +       cut -d" " -f2 <root >actual &&
> +       test_cmp expect actual
> +'
> +
> +test_expect_success 'paths in subdir ended up in one tree' '
> +       cat >expect <<-\EOF &&
> +       blob
> +       blob
> +       EOF
> +       tree=$(grep tree root | cut -f2) &&
> +       git ls-tree $other_branch:$tree >tree &&
> +       cut -d" " -f2 <tree >actual &&
> +       test_cmp expect actual
> +'
> +
> +test_expect_success 'tag points to branch tip' '
> +       git rev-parse $other_branch >expect &&
> +       git for-each-ref --format="%(*objectname)" | grep . >actual &&
> +       test_cmp expect actual
> +'
> +
> +test_expect_success 'idents are shared' '
> +       git log --all --format="%an <%ae>" >authors &&
> +       sort -u authors >unique &&
> +       test_line_count = 1 unique &&
> +       git log --all --format="%cn <%ce>" >committers &&
> +       sort -u committers >unique &&
> +       test_line_count = 1 unique &&
> +       ! test_cmp authors committers
> +'
> +
> +test_expect_success 'commit timestamps are retained' '
> +       git log --all --format="%ct" >timestamps &&
> +       sort -u timestamps >unique &&
> +       test_line_count = 4 unique
> +'
> +
> +test_done
> --
> 2.1.0.346.ga0367b9
>



-- 
Duy

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2] teach fast-export an --anonymize option
  2014-08-21 23:21     ` [PATCH v2] " Jeff King
  2014-08-22 13:06       ` Duy Nguyen
@ 2014-08-22 18:39       ` Philip Oakley
  2014-08-23  6:19         ` Jeff King
  2014-08-27 16:01       ` Junio C Hamano
  2 siblings, 1 reply; 21+ messages in thread
From: Philip Oakley @ 2014-08-22 18:39 UTC (permalink / raw)
  To: Jeff King, Junio C Hamano; +Cc: git, Duy Nguyen

From: "Jeff King" <peff@peff.net>: Friday, August 22, 2014 12:21 AM
> On Thu, Aug 21, 2014 at 06:49:10PM -0400, Jeff King wrote:
>
>> The few things I don't anonymize are:
>>
>>   1. ref prefixes. We see the same distribution of refs/heads vs
>>      refs/tags, etc.
>>
>>   2. refs/heads/master is left untouched, for convenience (and 
>> because
>>      it's not really a secret). The implementation is lazy, though, 
>> and
>>      would leave "refs/heads/master-supersecret", as well. I can 
>> tighten
>>      that if we really want to be careful.
>>
>>   3. gitlinks are left untouched, since sha1s cannot be reversed. 
>> This
>>      could leak some information (if your private repo points to a
>>      public, I can find out you have it as submodule). I doubt it
>>      matters, but we can also scramble the sha1s.
>
> Here's a re-roll that addresses the latter two. I don't think any are 
> a
> big deal, but it's much easier to say "it's handled" than try to 
> figure
> out whether and when it's important.
>
> This also includes the documentation update I sent earlier. The
> interdiff is a bit noisy, as I also converted the anonymize_mem 
> function
> to take void pointers (since it doesn't know or care what it's 
> storing,
> and this makes storing unsigned chars for sha1s easier).
>

Just a bit of bikeshedding for future improvements..

The .gitignore is another potential user problem area that may benefit 
form not being anonymised when problems strike. For example, there's a 
current problem on the git-users list 
https://groups.google.com/forum/#!topic/git-users/JJFIEsI5HRQ about "git 
clean vs git status re .gitignore", which would then also beg questions 
about retaining file extensions/suffixes (.txt, .o, .c, etc).

I've had a similar problem with an over zealous file compare routine 
where the same too much vs too little was an issue.

One thought is that the user should be able to, as an option, select the 
number of initial characters retained from filenames, and similarly, the 
option to retain the file extension, and possibly directory names, such 
that the full .gitignore still works in most cases, and the sort order 
works (as far as it goes on number of characters).

All things for future improvers to consider.

Philip

> -- >8 --
> Subject: teach fast-export an --anonymize option
>
> Sometimes users want to report a bug they experience on
> their repository, but they are not at liberty to share the
> contents of the repository. It would be useful if they could
> produce a repository that has a similar shape to its history
> and tree, but without leaking any information. This
> "anonymized" repository could then be shared with developers
> (assuming it still replicates the original problem).
>
> This patch implements an "--anonymize" option to
> fast-export, which generates a stream that can recreate such
> a repository. Producing a single stream makes it easy for
> the caller to verify that they are not leaking any useful
> information. You can get an overview of what will be shared
> by running a command like:
>
>  git fast-export --anonymize --all |
>  perl -pe 's/\d+/X/g' |
>  sort -u |
>  less
>
> which will show every unique line we generate, modulo any
> numbers (each anonymized token is assigned a number, like
> "User 0", and we replace it consistently in the output).
>
> In addition to anonymizing, this produces test cases that
> are relatively small (compared to the original repository)
> and fast to generate (compared to using filter-branch, or
> modifying the output of fast-export yourself). Here are
> numbers for git.git:
>
>  $ time git fast-export --anonymize --all \
>         --tag-of-filtered-object=drop >output
>  real    0m2.883s
>  user    0m2.828s
>  sys     0m0.052s
>
>  $ gzip output
>  $ ls -lh output.gz | awk '{print $5}'
>  2.9M
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
[...] 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2] teach fast-export an --anonymize option
  2014-08-22 18:39       ` Philip Oakley
@ 2014-08-23  6:19         ` Jeff King
  0 siblings, 0 replies; 21+ messages in thread
From: Jeff King @ 2014-08-23  6:19 UTC (permalink / raw)
  To: Philip Oakley; +Cc: Junio C Hamano, git, Duy Nguyen

On Fri, Aug 22, 2014 at 07:39:59PM +0100, Philip Oakley wrote:

> Just a bit of bikeshedding for future improvements..
> 
> The .gitignore is another potential user problem area that may benefit form
> not being anonymised when problems strike.

Thanks, I had meant to mention some implications for .gitmodules here,
but forgot about .gitignore (and .gitattributes!).

For any git-specific files like this, we have two challenges:

  1. We've munged their filenames (so .gitignore is probably path123
     now).

  2. We'll have munged their contents. So even if we left the file as
     .gitignore, it will have junk in it.

Fixing (1) is pretty easy. I structured all of the anonymizing functions
to take the old values, even though most of them just throw it away
entirely (which is a good way to be sure you're not leaking anything!).
But we could pass through a few specific ones.

However, that doesn't help us if the contents are still munged (in fact
it's worse, because git will be annoyed that your .gitmodules file
contains unparseable crap). So how do we munge those files?

It depends on the individual file, I think, and what the user wants to
protect.

For .gitignore and .gitattributes, we can translate the pathnames
contained in the file. But that doesn't work in the general case,
because the file could have wildcards or other non-literal syntax.

For .gitmodules, I think it's all-or-nothing. Either the user is OK
sharing the URLs of their submodules or not (we could munge _just_ the
URLs, but it's not like the result would be remotely functional).

So while we might be able to get some things working on the .gitignore
side, I kind of think the simplest way forward is just adding finer
granularity for the user. Let them say "my filenames are OK to share
because they're part of the problem, but just make sure you hide my
commit messages and file contents".

And then if you're not munging filenames, we would turn off .gitignore
and .gitattributes munging. The implementation is not too hard.
export_blob does not have the path of the blob, but we generate the list
of blobs to export from a diff, so we can feed the path that way.  That
technically misses a case where you have a blob at path "X", we
anonymize it, and then you later move it to ".gitignore", which would
not be anonymized. But that is unlikely enough that it is probably not
worth worrying about.

> For example, there's a current
> problem on the git-users list
> https://groups.google.com/forum/#!topic/git-users/JJFIEsI5HRQ about "git
> clean vs git status re .gitignore", which would then also beg questions
> about retaining file extensions/suffixes (.txt, .o, .c, etc).

Yeah, I think retaining extensions would be a reasonable option (and you
would probably use it with an option to retain .gitattributes or
.gitignore whole if you were confident that those files did not have
anything private and just used extension wildcards).

> One thought is that the user should be able to, as an option, select the
> number of initial characters retained from filenames, and similarly, the
> option to retain the file extension, and possibly directory names, such that
> the full .gitignore still works in most cases, and the sort order works (as
> far as it goes on number of characters).

Yeah, those all seem reasonable.

> All things for future improvers to consider.

Agreed. I wanted to go through your list not because I want to implement
any of those things right now, but because I wanted to make sure that
there was nothing in my approach that would preclude us from building
those things later. And I don't think there is (and I'd be happy if
somebody else felt like building them on top, now or later).

-Peff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2] teach fast-export an --anonymize option
  2014-08-21 23:21     ` [PATCH v2] " Jeff King
  2014-08-22 13:06       ` Duy Nguyen
  2014-08-22 18:39       ` Philip Oakley
@ 2014-08-27 16:01       ` Junio C Hamano
  2014-08-27 16:58         ` Jeff King
  2 siblings, 1 reply; 21+ messages in thread
From: Junio C Hamano @ 2014-08-27 16:01 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Duy Nguyen

Jeff King <peff@peff.net> writes:

> diff --git a/t/t9351-fast-export-anonymize.sh b/t/t9351-fast-export-anonymize.sh
> new file mode 100755
> index 0000000..f76ffe4
> --- /dev/null
> +++ b/t/t9351-fast-export-anonymize.sh
> @@ -0,0 +1,117 @@
> +#!/bin/sh
> +
> +test_description='basic tests for fast-export --anonymize'
> +. ./test-lib.sh
> +
> +test_expect_success 'setup simple repo' '
> +	test_commit base &&
> +	test_commit foo &&
> +	git checkout -b other HEAD^ &&
> +	mkdir subdir &&
> +	test_commit subdir/bar &&
> +	test_commit subdir/xyzzy &&
> +	git tag -m "annotated tag" mytag
> +'
> +
> +test_expect_success 'export anonymized stream' '
> +	git fast-export --anonymize --all >stream
> +'
> +
> +# this also covers commit messages
> +test_expect_success 'stream omits path names' '
> +	! fgrep base stream &&
> +	! fgrep foo stream &&
> +	! fgrep subdir stream &&
> +	! fgrep bar stream &&
> +	! fgrep xyzzy stream
> +'

I know there are a few isolated places that already use "fgrep", but
let's not spread the disease. Neither "fgrep" nor "egrep" appears in
POSIX and they can easily be spelled more portably as "grep -F" and
"grep -E", respectively.

> +test_expect_success 'stream allows master as refname' '
> +	fgrep master stream
> +'
> +
> +test_expect_success 'stream omits other refnames' '
> +	! fgrep other stream
> +'

What should happen to mytag?

> +
> +test_expect_success 'stream omits identities' '
> +	! fgrep "$GIT_COMMITTER_NAME" stream &&
> +	! fgrep "$GIT_COMMITTER_EMAIL" stream &&
> +	! fgrep "$GIT_AUTHOR_NAME" stream &&
> +	! fgrep "$GIT_AUTHOR_EMAIL" stream
> +'
> +
> +test_expect_success 'stream omits tag message' '
> +	! fgrep "annotated tag" stream
> +'
> +
> +# NOTE: we chdir to the new, anonymized repository
> +# after this. All further tests should assume this.
> +test_expect_success 'import stream to new repository' '
> +	git init new &&
> +	cd new &&
> +	git fast-import <../stream
> +'
> +
> +test_expect_success 'result has two branches' '
> +	git for-each-ref --format="%(refname)" refs/heads >branches &&
> +	test_line_count = 2 branches &&
> +	other_branch=$(grep -v refs/heads/master branches)
> +'
> +
> +test_expect_success 'repo has original shape' '
> +	cat >expect <<-\EOF &&
> +	> subject 3
> +	> subject 2
> +	< subject 1
> +	- subject 0
> +	EOF
> +	git log --format="%m %s" --left-right --boundary \
> +		master...$other_branch >actual &&
> +	test_cmp expect actual
> +'

Yuck and Hmph.  Doing a shape-preserving conversion is very
important, but I wonder if we can we verify without having to cast a
particular rewrite rule in stone.  We know we want to preserve
relative order of committer timestamps (to reproduce bugs that
depend on the traversal order), and it may be OK to reuse the
exactly the same committer timestamps from the original, in which
case we can make sure that we create the original history with
appropriate "test_tick"s (I think test_commit does that for us) and
use "%ct" instead of "%s" here, perhaps?  That way we can later
change the rewrite rules of commit object payload without having to
adjust this test.

> +
> +test_expect_success 'root tree has original shape' '
> +	cat >expect <<-\EOF &&
> +	blob
> +	tree
> +	EOF
> +	git ls-tree $other_branch >root &&
> +	cut -d" " -f2 <root >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'paths in subdir ended up in one tree' '
> +	cat >expect <<-\EOF &&
> +	blob
> +	blob
> +	EOF
> +	tree=$(grep tree root | cut -f2) &&
> +	git ls-tree $other_branch:$tree >tree &&
> +	cut -d" " -f2 <tree >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'tag points to branch tip' '
> +	git rev-parse $other_branch >expect &&
> +	git for-each-ref --format="%(*objectname)" | grep . >actual &&
> +	test_cmp expect actual
> +'

I notice you haven't checked how many tags you have in the
repository, unlike the number of branches which you counted
earlier.

> +test_expect_success 'idents are shared' '
> +	git log --all --format="%an <%ae>" >authors &&
> +	sort -u authors >unique &&
> +	test_line_count = 1 unique &&
> +	git log --all --format="%cn <%ce>" >committers &&
> +	sort -u committers >unique &&
> +	test_line_count = 1 unique &&
> +	! test_cmp authors committers
> +'

Two commits by the same author must convert to two commits by the
same anonymized author, but that is not tested here; the history
made in 'setup a simple repo' step is a bit too simple to do that
anyway, though ;-).

> +test_expect_success 'commit timestamps are retained' '
> +	git log --all --format="%ct" >timestamps &&
> +	sort -u timestamps >unique &&
> +	test_line_count = 4 unique
> +'
> +
> +test_done

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2] teach fast-export an --anonymize option
  2014-08-27 16:01       ` Junio C Hamano
@ 2014-08-27 16:58         ` Jeff King
  2014-08-27 17:01           ` [PATCH v3] " Jeff King
  0 siblings, 1 reply; 21+ messages in thread
From: Jeff King @ 2014-08-27 16:58 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Duy Nguyen

On Wed, Aug 27, 2014 at 09:01:02AM -0700, Junio C Hamano wrote:

> > +# this also covers commit messages
> > +test_expect_success 'stream omits path names' '
> > +	! fgrep base stream &&
> > +	! fgrep foo stream &&
> > +	! fgrep subdir stream &&
> > +	! fgrep bar stream &&
> > +	! fgrep xyzzy stream
> > +'
> 
> I know there are a few isolated places that already use "fgrep", but
> let's not spread the disease. Neither "fgrep" nor "egrep" appears in
> POSIX and they can easily be spelled more portably as "grep -F" and
> "grep -E", respectively.s

I actually specifically used "fgrep" rather than "grep -F" because some
platforms have the former but not the latter. It has been a while since
commit 8753941, but I think Solaris was such a platform (and maybe AIX,
too, based on [1]).

[1] http://article.gmane.org/gmane.comp.version-control.git/97076

They could actually just be "grep" in this case, since we know the
patterns do not have any metacharacters. I was thinking that our
$GIT_COMMITTER_NAME (grepped below) did, but it is "C O Mitter", not "C.
O. Mitter".

I'll switch that in the re-roll.

> > +test_expect_success 'stream allows master as refname' '
> > +	fgrep master stream
> > +'
> > +
> > +test_expect_success 'stream omits other refnames' '
> > +	! fgrep other stream
> > +'
> 
> What should happen to mytag?

I added the tag test at the very end, and you can see that it did not
get as much attention. :) We should check "! grep mytag stream" here.
Will be in the re-roll.

> > +test_expect_success 'repo has original shape' '
> > +	cat >expect <<-\EOF &&
> > +	> subject 3
> > +	> subject 2
> > +	< subject 1
> > +	- subject 0
> > +	EOF
> > +	git log --format="%m %s" --left-right --boundary \
> > +		master...$other_branch >actual &&
> > +	test_cmp expect actual
> > +'
> 
> Yuck and Hmph.  Doing a shape-preserving conversion is very
> important, but I wonder if we can we verify without having to cast a
> particular rewrite rule in stone.  We know we want to preserve
> relative order of committer timestamps (to reproduce bugs that
> depend on the traversal order), and it may be OK to reuse the
> exactly the same committer timestamps from the original, in which
> case we can make sure that we create the original history with
> appropriate "test_tick"s (I think test_commit does that for us) and
> use "%ct" instead of "%s" here, perhaps?  That way we can later
> change the rewrite rules of commit object payload without having to
> adjust this test.

Yeah, everything is lost except the shape and committer timestamps. So
our choices are basically "%m" or "%m %ct". I think the latter is
probably the least bad choice. Will switch.

There's a potential problem with picking the same branches in the same
order. We can cheat a little here, though, because "master" retains its
same name. Since there's only one other branch, it's always the other
one we want.

The trees are not necessarily so lucky. We check that the root tree ends
up with a blob and a tree. But the anonymization does not necessarily
have to preserve their order (and it probably wouldn't under many
schemes). I think we can get away with just sorting the type list.

> > +test_expect_success 'tag points to branch tip' '
> > +	git rev-parse $other_branch >expect &&
> > +	git for-each-ref --format="%(*objectname)" | grep . >actual &&
> > +	test_cmp expect actual
> > +'
> 
> I notice you haven't checked how many tags you have in the
> repository, unlike the number of branches which you counted
> earlier.

Yes, because test_commit makes a bunch of extraneous tags, which I
thought made the test a little brittle.

> > +test_expect_success 'idents are shared' '
> > +	git log --all --format="%an <%ae>" >authors &&
> > +	sort -u authors >unique &&
> > +	test_line_count = 1 unique &&
> > +	git log --all --format="%cn <%ce>" >committers &&
> > +	sort -u committers >unique &&
> > +	test_line_count = 1 unique &&
> > +	! test_cmp authors committers
> > +'
> 
> Two commits by the same author must convert to two commits by the
> same anonymized author, but that is not tested here; the history
> made in 'setup a simple repo' step is a bit too simple to do that
> anyway, though ;-).

I think we do check that. The commits are all by the same author, so the
fact that "sort -u authors" has one line means that they all got
anonymized to the same author. What we don't check is that two
_different_ authors get different idents. I check that the two idents
get different values (authors != committers), but it is not clear from a
blackbox test that it is because the anonymization is working (it might
be because authors and committers come from a different pool of strings,
though I know having written the code that that is not the case).

I also do not test that the same ident as an author and as a committer
ends up the same. Honestly, I didn't really feel that it was worth much
bother. The shape of history and the tree is the most interesting thing
here, and the primary thing about idents is that we wipe them.

Maybe it was just me being lazy, though.

> > +test_expect_success 'commit timestamps are retained' '
> > +	git log --all --format="%ct" >timestamps &&
> > +	sort -u timestamps >unique &&
> > +	test_line_count = 4 unique
> > +'

I think we can drop this one if we check %ct in the graph-shape test
above. It's redundant.

I'll send v3 in a minute. Here's the interdiff.

diff --git a/t/t9351-fast-export-anonymize.sh b/t/t9351-fast-export-anonymize.sh
index f76ffe4..897dc50 100755
--- a/t/t9351-fast-export-anonymize.sh
+++ b/t/t9351-fast-export-anonymize.sh
@@ -19,30 +19,31 @@ test_expect_success 'export anonymized stream' '
 
 # this also covers commit messages
 test_expect_success 'stream omits path names' '
-	! fgrep base stream &&
-	! fgrep foo stream &&
-	! fgrep subdir stream &&
-	! fgrep bar stream &&
-	! fgrep xyzzy stream
+	! grep base stream &&
+	! grep foo stream &&
+	! grep subdir stream &&
+	! grep bar stream &&
+	! grep xyzzy stream
 '
 
 test_expect_success 'stream allows master as refname' '
-	fgrep master stream
+	grep master stream
 '
 
 test_expect_success 'stream omits other refnames' '
-	! fgrep other stream
+	! grep other stream &&
+	! grep mytag stream
 '
 
 test_expect_success 'stream omits identities' '
-	! fgrep "$GIT_COMMITTER_NAME" stream &&
-	! fgrep "$GIT_COMMITTER_EMAIL" stream &&
-	! fgrep "$GIT_AUTHOR_NAME" stream &&
-	! fgrep "$GIT_AUTHOR_EMAIL" stream
+	! grep "$GIT_COMMITTER_NAME" stream &&
+	! grep "$GIT_COMMITTER_EMAIL" stream &&
+	! grep "$GIT_AUTHOR_NAME" stream &&
+	! grep "$GIT_AUTHOR_EMAIL" stream
 '
 
 test_expect_success 'stream omits tag message' '
-	! fgrep "annotated tag" stream
+	! grep "annotated tag" stream
 '
 
 # NOTE: we chdir to the new, anonymized repository
@@ -59,25 +60,25 @@ test_expect_success 'result has two branches' '
 	other_branch=$(grep -v refs/heads/master branches)
 '
 
-test_expect_success 'repo has original shape' '
-	cat >expect <<-\EOF &&
-	> subject 3
-	> subject 2
-	< subject 1
-	- subject 0
-	EOF
-	git log --format="%m %s" --left-right --boundary \
-		master...$other_branch >actual &&
+test_expect_success 'repo has original shape and timestamps' '
+	shape () {
+		git log --format="%m %ct" --left-right --boundary "$@"
+	} &&
+	(cd .. && shape master...other) >expect &&
+	shape master...$other_branch >actual &&
 	test_cmp expect actual
 '
 
 test_expect_success 'root tree has original shape' '
+	# the output entries are not necessarily in the same
+	# order, but we know at least that we will have one tree
+	# and one blob, so just check the sorted order
 	cat >expect <<-\EOF &&
 	blob
 	tree
 	EOF
 	git ls-tree $other_branch >root &&
-	cut -d" " -f2 <root >actual &&
+	cut -d" " -f2 <root | sort >actual &&
 	test_cmp expect actual
 '
 
@@ -108,10 +109,4 @@ test_expect_success 'idents are shared' '
 	! test_cmp authors committers
 '
 
-test_expect_success 'commit timestamps are retained' '
-	git log --all --format="%ct" >timestamps &&
-	sort -u timestamps >unique &&
-	test_line_count = 4 unique
-'
-
 test_done

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3] teach fast-export an --anonymize option
  2014-08-27 16:58         ` Jeff King
@ 2014-08-27 17:01           ` Jeff King
  2014-08-28 10:30             ` Duy Nguyen
  0 siblings, 1 reply; 21+ messages in thread
From: Jeff King @ 2014-08-27 17:01 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Duy Nguyen

Sometimes users want to report a bug they experience on
their repository, but they are not at liberty to share the
contents of the repository. It would be useful if they could
produce a repository that has a similar shape to its history
and tree, but without leaking any information. This
"anonymized" repository could then be shared with developers
(assuming it still replicates the original problem).

This patch implements an "--anonymize" option to
fast-export, which generates a stream that can recreate such
a repository. Producing a single stream makes it easy for
the caller to verify that they are not leaking any useful
information. You can get an overview of what will be shared
by running a command like:

  git fast-export --anonymize --all |
  perl -pe 's/\d+/X/g' |
  sort -u |
  less

which will show every unique line we generate, modulo any
numbers (each anonymized token is assigned a number, like
"User 0", and we replace it consistently in the output).

In addition to anonymizing, this produces test cases that
are relatively small (compared to the original repository)
and fast to generate (compared to using filter-branch, or
modifying the output of fast-export yourself). Here are
numbers for git.git:

  $ time git fast-export --anonymize --all \
         --tag-of-filtered-object=drop >output
  real    0m2.883s
  user    0m2.828s
  sys     0m0.052s

  $ gzip output
  $ ls -lh output.gz | awk '{print $5}'
  2.9M

Signed-off-by: Jeff King <peff@peff.net>
---
This has all of the test tweaks I responded about. The only thing I
didn't do is beef up the ident tests, but I'm not really sure it's worth
the trouble.

 Documentation/git-fast-export.txt |   6 +
 builtin/fast-export.c             | 300 ++++++++++++++++++++++++++++++++++++--
 t/t9351-fast-export-anonymize.sh  | 112 ++++++++++++++
 3 files changed, 407 insertions(+), 11 deletions(-)
 create mode 100755 t/t9351-fast-export-anonymize.sh

diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
index 221506b..52831fa 100644
--- a/Documentation/git-fast-export.txt
+++ b/Documentation/git-fast-export.txt
@@ -105,6 +105,12 @@ marks the same across runs.
 	in the commit (as opposed to just listing the files which are
 	different from the commit's first parent).
 
+--anonymize::
+	Replace all refnames, paths, blob contents, commit and tag
+	messages, names, and email addresses in the output with
+	anonymized data, while still retaining the shape of history and
+	of the stored tree.
+
 --refspec::
 	Apply the specified refspec to each ref exported. Multiple of them can
 	be specified.
diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 92b4624..b8182c2 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -18,6 +18,7 @@
 #include "parse-options.h"
 #include "quote.h"
 #include "remote.h"
+#include "blob.h"
 
 static const char *fast_export_usage[] = {
 	N_("git fast-export [rev-list-opts]"),
@@ -34,6 +35,7 @@ static int full_tree;
 static struct string_list extra_refs = STRING_LIST_INIT_NODUP;
 static struct refspec *refspecs;
 static int refspecs_nr;
+static int anonymize;
 
 static int parse_opt_signed_tag_mode(const struct option *opt,
 				     const char *arg, int unset)
@@ -81,6 +83,76 @@ static int has_unshown_parent(struct commit *commit)
 	return 0;
 }
 
+struct anonymized_entry {
+	struct hashmap_entry hash;
+	const char *orig;
+	size_t orig_len;
+	const char *anon;
+	size_t anon_len;
+};
+
+static int anonymized_entry_cmp(const void *va, const void *vb,
+				const void *data)
+{
+	const struct anonymized_entry *a = va, *b = vb;
+	return a->orig_len != b->orig_len ||
+		memcmp(a->orig, b->orig, a->orig_len);
+}
+
+/*
+ * Basically keep a cache of X->Y so that we can repeatedly replace
+ * the same anonymized string with another. The actual generation
+ * is farmed out to the generate function.
+ */
+static const void *anonymize_mem(struct hashmap *map,
+				 void *(*generate)(const void *, size_t *),
+				 const void *orig, size_t *len)
+{
+	struct anonymized_entry key, *ret;
+
+	if (!map->cmpfn)
+		hashmap_init(map, anonymized_entry_cmp, 0);
+
+	hashmap_entry_init(&key, memhash(orig, *len));
+	key.orig = orig;
+	key.orig_len = *len;
+	ret = hashmap_get(map, &key, NULL);
+
+	if (!ret) {
+		ret = xmalloc(sizeof(*ret));
+		hashmap_entry_init(&ret->hash, key.hash.hash);
+		ret->orig = xstrdup(orig);
+		ret->orig_len = *len;
+		ret->anon = generate(orig, len);
+		ret->anon_len = *len;
+		hashmap_put(map, ret);
+	}
+
+	*len = ret->anon_len;
+	return ret->anon;
+}
+
+/*
+ * We anonymize each component of a path individually,
+ * so that paths a/b and a/c will share a common root.
+ * The paths are cached via anonymize_mem so that repeated
+ * lookups for "a" will yield the same value.
+ */
+static void anonymize_path(struct strbuf *out, const char *path,
+			   struct hashmap *map,
+			   void *(*generate)(const void *, size_t *))
+{
+	while (*path) {
+		const char *end_of_component = strchrnul(path, '/');
+		size_t len = end_of_component - path;
+		const char *c = anonymize_mem(map, generate, path, &len);
+		strbuf_add(out, c, len);
+		path = end_of_component;
+		if (*path)
+			strbuf_addch(out, *path++);
+	}
+}
+
 /* Since intptr_t is C99, we do not use it here */
 static inline uint32_t *mark_to_ptr(uint32_t mark)
 {
@@ -119,6 +191,26 @@ static void show_progress(void)
 		printf("progress %d objects\n", counter);
 }
 
+/*
+ * Ideally we would want some transformation of the blob data here
+ * that is unreversible, but would still be the same size and have
+ * the same data relationship to other blobs (so that we get the same
+ * delta and packing behavior as the original). But the first and last
+ * requirements there are probably mutually exclusive, so let's take
+ * the easy way out for now, and just generate arbitrary content.
+ *
+ * There's no need to cache this result with anonymize_mem, since
+ * we already handle blob content caching with marks.
+ */
+static char *anonymize_blob(unsigned long *size)
+{
+	static int counter;
+	struct strbuf out = STRBUF_INIT;
+	strbuf_addf(&out, "anonymous blob %d", counter++);
+	*size = out.len;
+	return strbuf_detach(&out, NULL);
+}
+
 static void export_blob(const unsigned char *sha1)
 {
 	unsigned long size;
@@ -137,12 +229,19 @@ static void export_blob(const unsigned char *sha1)
 	if (object && object->flags & SHOWN)
 		return;
 
-	buf = read_sha1_file(sha1, &type, &size);
-	if (!buf)
-		die ("Could not read blob %s", sha1_to_hex(sha1));
-	if (check_sha1_signature(sha1, buf, size, typename(type)) < 0)
-		die("sha1 mismatch in blob %s", sha1_to_hex(sha1));
-	object = parse_object_buffer(sha1, type, size, buf, &eaten);
+	if (anonymize) {
+		buf = anonymize_blob(&size);
+		object = (struct object *)lookup_blob(sha1);
+		eaten = 0;
+	} else {
+		buf = read_sha1_file(sha1, &type, &size);
+		if (!buf)
+			die ("Could not read blob %s", sha1_to_hex(sha1));
+		if (check_sha1_signature(sha1, buf, size, typename(type)) < 0)
+			die("sha1 mismatch in blob %s", sha1_to_hex(sha1));
+		object = parse_object_buffer(sha1, type, size, buf, &eaten);
+	}
+
 	if (!object)
 		die("Could not read blob %s", sha1_to_hex(sha1));
 
@@ -190,7 +289,7 @@ static int depth_first(const void *a_, const void *b_)
 	return (a->status == 'R') - (b->status == 'R');
 }
 
-static void print_path(const char *path)
+static void print_path_1(const char *path)
 {
 	int need_quote = quote_c_style(path, NULL, NULL, 0);
 	if (need_quote)
@@ -201,6 +300,43 @@ static void print_path(const char *path)
 		printf("%s", path);
 }
 
+static void *anonymize_path_component(const void *path, size_t *len)
+{
+	static int counter;
+	struct strbuf out = STRBUF_INIT;
+	strbuf_addf(&out, "path%d", counter++);
+	return strbuf_detach(&out, len);
+}
+
+static void print_path(const char *path)
+{
+	if (!anonymize)
+		print_path_1(path);
+	else {
+		static struct hashmap paths;
+		static struct strbuf anon = STRBUF_INIT;
+
+		anonymize_path(&anon, path, &paths, anonymize_path_component);
+		print_path_1(anon.buf);
+		strbuf_reset(&anon);
+	}
+}
+
+static void *generate_fake_sha1(const void *old, size_t *len)
+{
+	static uint32_t counter = 1; /* avoid null sha1 */
+	unsigned char *out = xcalloc(20, 1);
+	put_be32(out + 16, counter++);
+	return out;
+}
+
+static const unsigned char *anonymize_sha1(const unsigned char *sha1)
+{
+	static struct hashmap sha1s;
+	size_t len = 20;
+	return anonymize_mem(&sha1s, generate_fake_sha1, sha1, &len);
+}
+
 static void show_filemodify(struct diff_queue_struct *q,
 			    struct diff_options *options, void *data)
 {
@@ -245,7 +381,9 @@ static void show_filemodify(struct diff_queue_struct *q,
 			 */
 			if (no_data || S_ISGITLINK(spec->mode))
 				printf("M %06o %s ", spec->mode,
-				       sha1_to_hex(spec->sha1));
+				       sha1_to_hex(anonymize ?
+						   anonymize_sha1(spec->sha1) :
+						   spec->sha1));
 			else {
 				struct object *object = lookup_object(spec->sha1);
 				printf("M %06o :%d ", spec->mode,
@@ -279,6 +417,114 @@ static const char *find_encoding(const char *begin, const char *end)
 	return bol;
 }
 
+static void *anonymize_ref_component(const void *old, size_t *len)
+{
+	static int counter;
+	struct strbuf out = STRBUF_INIT;
+	strbuf_addf(&out, "ref%d", counter++);
+	return strbuf_detach(&out, len);
+}
+
+static const char *anonymize_refname(const char *refname)
+{
+	/*
+	 * If any of these prefixes is found, we will leave it intact
+	 * so that tags remain tags and so forth.
+	 */
+	static const char *prefixes[] = {
+		"refs/heads/",
+		"refs/tags/",
+		"refs/remotes/",
+		"refs/"
+	};
+	static struct hashmap refs;
+	static struct strbuf anon = STRBUF_INIT;
+	int i;
+
+	/*
+	 * We also leave "master" as a special case, since it does not reveal
+	 * anything interesting.
+	 */
+	if (!strcmp(refname, "refs/heads/master"))
+		return refname;
+
+	strbuf_reset(&anon);
+	for (i = 0; i < ARRAY_SIZE(prefixes); i++) {
+		if (skip_prefix(refname, prefixes[i], &refname)) {
+			strbuf_addstr(&anon, prefixes[i]);
+			break;
+		}
+	}
+
+	anonymize_path(&anon, refname, &refs, anonymize_ref_component);
+	return anon.buf;
+}
+
+/*
+ * We do not even bother to cache commit messages, as they are unlikely
+ * to be repeated verbatim, and it is not that interesting when they are.
+ */
+static char *anonymize_commit_message(const char *old)
+{
+	static int counter;
+	return xstrfmt("subject %d\n\nbody\n", counter++);
+}
+
+static struct hashmap idents;
+static void *anonymize_ident(const void *old, size_t *len)
+{
+	static int counter;
+	struct strbuf out = STRBUF_INIT;
+	strbuf_addf(&out, "User %d <user%d@example.com>", counter, counter);
+	counter++;
+	return strbuf_detach(&out, len);
+}
+
+/*
+ * Our strategy here is to anonymize the names and email addresses,
+ * but keep timestamps intact, as they influence things like traversal
+ * order (and by themselves should not be too revealing).
+ */
+static void anonymize_ident_line(const char **beg, const char **end)
+{
+	static struct strbuf buffers[] = { STRBUF_INIT, STRBUF_INIT };
+	static unsigned which_buffer;
+
+	struct strbuf *out;
+	struct ident_split split;
+	const char *end_of_header;
+
+	out = &buffers[which_buffer++];
+	which_buffer %= ARRAY_SIZE(buffers);
+	strbuf_reset(out);
+
+	/* skip "committer", "author", "tagger", etc */
+	end_of_header = strchr(*beg, ' ');
+	if (!end_of_header)
+		die("BUG: malformed line fed to anonymize_ident_line: %.*s",
+		    (int)(*end - *beg), *beg);
+	end_of_header++;
+	strbuf_add(out, *beg, end_of_header - *beg);
+
+	if (!split_ident_line(&split, end_of_header, *end - end_of_header) &&
+	    split.date_begin) {
+		const char *ident;
+		size_t len;
+
+		len = split.mail_end - split.name_begin;
+		ident = anonymize_mem(&idents, anonymize_ident,
+				      split.name_begin, &len);
+		strbuf_add(out, ident, len);
+		strbuf_addch(out, ' ');
+		strbuf_add(out, split.date_begin, split.tz_end - split.date_begin);
+	} else {
+		strbuf_addstr(out, "Malformed Ident <malformed@example.com> 0 -0000");
+	}
+
+	*beg = out->buf;
+	*end = out->buf + out->len;
+}
+
 static void handle_commit(struct commit *commit, struct rev_info *rev)
 {
 	int saved_output_format = rev->diffopt.output_format;
@@ -287,6 +533,7 @@ static void handle_commit(struct commit *commit, struct rev_info *rev)
 	const char *encoding, *message;
 	char *reencoded = NULL;
 	struct commit_list *p;
+	const char *refname;
 	int i;
 
 	rev->diffopt.output_format = DIFF_FORMAT_CALLBACK;
@@ -326,13 +573,22 @@ static void handle_commit(struct commit *commit, struct rev_info *rev)
 		if (!S_ISGITLINK(diff_queued_diff.queue[i]->two->mode))
 			export_blob(diff_queued_diff.queue[i]->two->sha1);
 
+	refname = commit->util;
+	if (anonymize) {
+		refname = anonymize_refname(refname);
+		anonymize_ident_line(&committer, &committer_end);
+		anonymize_ident_line(&author, &author_end);
+	}
+
 	mark_next_object(&commit->object);
-	if (!is_encoding_utf8(encoding))
+	if (anonymize)
+		reencoded = anonymize_commit_message(message);
+	else if (!is_encoding_utf8(encoding))
 		reencoded = reencode_string(message, "UTF-8", encoding);
 	if (!commit->parents)
-		printf("reset %s\n", (const char*)commit->util);
+		printf("reset %s\n", refname);
 	printf("commit %s\nmark :%"PRIu32"\n%.*s\n%.*s\ndata %u\n%s",
-	       (const char *)commit->util, last_idnum,
+	       refname, last_idnum,
 	       (int)(author_end - author), author,
 	       (int)(committer_end - committer), committer,
 	       (unsigned)(reencoded
@@ -363,6 +619,14 @@ static void handle_commit(struct commit *commit, struct rev_info *rev)
 	show_progress();
 }
 
+static void *anonymize_tag(const void *old, size_t *len)
+{
+	static int counter;
+	struct strbuf out = STRBUF_INIT;
+	strbuf_addf(&out, "tag message %d", counter++);
+	return strbuf_detach(&out, len);
+}
+
 static void handle_tail(struct object_array *commits, struct rev_info *revs)
 {
 	struct commit *commit;
@@ -419,6 +683,17 @@ static void handle_tag(const char *name, struct tag *tag)
 	} else {
 		tagger++;
 		tagger_end = strchrnul(tagger, '\n');
+		if (anonymize)
+			anonymize_ident_line(&tagger, &tagger_end);
+	}
+
+	if (anonymize) {
+		name = anonymize_refname(name);
+		if (message) {
+			static struct hashmap tags;
+			message = anonymize_mem(&tags, anonymize_tag,
+						message, &message_size);
+		}
 	}
 
 	/* handle signed tags */
@@ -584,6 +859,8 @@ static void handle_tags_and_duplicates(void)
 			handle_tag(name, (struct tag *)object);
 			break;
 		case OBJ_COMMIT:
+			if (anonymize)
+				name = anonymize_refname(name);
 			/* create refs pointing to already seen commits */
 			commit = (struct commit *)object;
 			printf("reset %s\nfrom :%d\n\n", name,
@@ -719,6 +996,7 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
 		OPT_BOOL(0, "no-data", &no_data, N_("Skip output of blob data")),
 		OPT_STRING_LIST(0, "refspec", &refspecs_list, N_("refspec"),
 			     N_("Apply refspec to exported refs")),
+		OPT_BOOL(0, "anonymize", &anonymize, N_("anonymize output")),
 		OPT_END()
 	};
 
diff --git a/t/t9351-fast-export-anonymize.sh b/t/t9351-fast-export-anonymize.sh
new file mode 100755
index 0000000..897dc50
--- /dev/null
+++ b/t/t9351-fast-export-anonymize.sh
@@ -0,0 +1,112 @@
+#!/bin/sh
+
+test_description='basic tests for fast-export --anonymize'
+. ./test-lib.sh
+
+test_expect_success 'setup simple repo' '
+	test_commit base &&
+	test_commit foo &&
+	git checkout -b other HEAD^ &&
+	mkdir subdir &&
+	test_commit subdir/bar &&
+	test_commit subdir/xyzzy &&
+	git tag -m "annotated tag" mytag
+'
+
+test_expect_success 'export anonymized stream' '
+	git fast-export --anonymize --all >stream
+'
+
+# this also covers commit messages
+test_expect_success 'stream omits path names' '
+	! grep base stream &&
+	! grep foo stream &&
+	! grep subdir stream &&
+	! grep bar stream &&
+	! grep xyzzy stream
+'
+
+test_expect_success 'stream allows master as refname' '
+	grep master stream
+'
+
+test_expect_success 'stream omits other refnames' '
+	! grep other stream &&
+	! grep mytag stream
+'
+
+test_expect_success 'stream omits identities' '
+	! grep "$GIT_COMMITTER_NAME" stream &&
+	! grep "$GIT_COMMITTER_EMAIL" stream &&
+	! grep "$GIT_AUTHOR_NAME" stream &&
+	! grep "$GIT_AUTHOR_EMAIL" stream
+'
+
+test_expect_success 'stream omits tag message' '
+	! grep "annotated tag" stream
+'
+
+# NOTE: we chdir to the new, anonymized repository
+# after this. All further tests should assume this.
+test_expect_success 'import stream to new repository' '
+	git init new &&
+	cd new &&
+	git fast-import <../stream
+'
+
+test_expect_success 'result has two branches' '
+	git for-each-ref --format="%(refname)" refs/heads >branches &&
+	test_line_count = 2 branches &&
+	other_branch=$(grep -v refs/heads/master branches)
+'
+
+test_expect_success 'repo has original shape and timestamps' '
+	shape () {
+		git log --format="%m %ct" --left-right --boundary "$@"
+	} &&
+	(cd .. && shape master...other) >expect &&
+	shape master...$other_branch >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'root tree has original shape' '
+	# the output entries are not necessarily in the same
+	# order, but we know at least that we will have one tree
+	# and one blob, so just check the sorted order
+	cat >expect <<-\EOF &&
+	blob
+	tree
+	EOF
+	git ls-tree $other_branch >root &&
+	cut -d" " -f2 <root | sort >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'paths in subdir ended up in one tree' '
+	cat >expect <<-\EOF &&
+	blob
+	blob
+	EOF
+	tree=$(grep tree root | cut -f2) &&
+	git ls-tree $other_branch:$tree >tree &&
+	cut -d" " -f2 <tree >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'tag points to branch tip' '
+	git rev-parse $other_branch >expect &&
+	git for-each-ref --format="%(*objectname)" | grep . >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'idents are shared' '
+	git log --all --format="%an <%ae>" >authors &&
+	sort -u authors >unique &&
+	test_line_count = 1 unique &&
+	git log --all --format="%cn <%ce>" >committers &&
+	sort -u committers >unique &&
+	test_line_count = 1 unique &&
+	! test_cmp authors committers
+'
+
+test_done
-- 
2.1.0.346.ga0367b9

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] teach fast-export an --anonymize option
  2014-08-27 17:01           ` [PATCH v3] " Jeff King
@ 2014-08-28 10:30             ` Duy Nguyen
  2014-08-28 12:32               ` Jeff King
  0 siblings, 1 reply; 21+ messages in thread
From: Duy Nguyen @ 2014-08-28 10:30 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Git Mailing List

On Thu, Aug 28, 2014 at 12:01 AM, Jeff King <peff@peff.net> wrote:
> You can get an overview of what will be shared
> by running a command like:
>
>   git fast-export --anonymize --all |
>   perl -pe 's/\d+/X/g' |
>   sort -u |
>   less
>
> which will show every unique line we generate, modulo any
> numbers (each anonymized token is assigned a number, like
> "User 0", and we replace it consistently in the output).

I feel like this should be part of git-fast-export.txt, just to
increase the user's confidence in the tool (and I don't expect most
users to read this commit message).
-- 
Duy

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] teach fast-export an --anonymize option
  2014-08-28 10:30             ` Duy Nguyen
@ 2014-08-28 12:32               ` Jeff King
  2014-08-28 16:46                 ` Ramsay Jones
                                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Jeff King @ 2014-08-28 12:32 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Junio C Hamano, Git Mailing List

On Thu, Aug 28, 2014 at 05:30:44PM +0700, Duy Nguyen wrote:

> On Thu, Aug 28, 2014 at 12:01 AM, Jeff King <peff@peff.net> wrote:
> > You can get an overview of what will be shared
> > by running a command like:
> >
> >   git fast-export --anonymize --all |
> >   perl -pe 's/\d+/X/g' |
> >   sort -u |
> >   less
> >
> > which will show every unique line we generate, modulo any
> > numbers (each anonymized token is assigned a number, like
> > "User 0", and we replace it consistently in the output).
> 
> I feel like this should be part of git-fast-export.txt, just to
> increase the user's confidence in the tool (and I don't expect most
> users to read this commit message).

Hmph. Whenever I say "I think this patch is done", suddenly the comments
start pouring in. :)

I think you are right, though, and we could stand to explain
the feature a little more in the documentation in general.
How about this patch on top (or squashed in):

-- >8 --
Subject: docs/fast-export: explain --anonymize more completely

The original commit made mention of this option, but not why
one might want it or how they might use it. Let's try to be
a little more thorough, and also explain how to confirm that
the output really is anonymous.

Signed-off-by: Jeff King <peff@peff.net>
---
 Documentation/git-fast-export.txt | 63 ++++++++++++++++++++++++++++++++++++---
 1 file changed, 59 insertions(+), 4 deletions(-)

diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
index 52831fa..dbe9a46 100644
--- a/Documentation/git-fast-export.txt
+++ b/Documentation/git-fast-export.txt
@@ -106,10 +106,9 @@ marks the same across runs.
 	different from the commit's first parent).
 
 --anonymize::
-	Replace all refnames, paths, blob contents, commit and tag
-	messages, names, and email addresses in the output with
-	anonymized data, while still retaining the shape of history and
-	of the stored tree.
+	Anonymize the contents of the repository while still retaining
+	the shape of the history and stored tree.  See the section on
+	`ANONYMIZING` below.
 
 --refspec::
 	Apply the specified refspec to each ref exported. Multiple of them can
@@ -147,6 +146,62 @@ referenced by that revision range contains the string
 'refs/heads/master'.
 
 
+ANONYMIZING
+-----------
+
+If the `--anonymize` option is given, git will attempt to remove all
+identifying information from the repository while still retaining enough
+of the original tree and history patterns to reproduce some bugs. The
+goal is that a git bug which is found on a private repository will
+persist in the anonymized repository, and the latter can be shared with
+git developers to help solve the bug.
+
+With this option, git will replace all refnames, paths, blob contents,
+commit and tag messages, names, and email addresses in the output with
+anonymized data.  Two instances of the same string will be replaced
+equivalently (e.g., two commits with the same author will have the same
+anonymized author in the output, but bear no resemblance to the original
+author string). The relationship between commits, branches, and tags is
+retained, as well as the commit timestamps (but the commit messages and
+refnames bear no resemblance to the originals). The relative makeup of
+the tree is retained (e.g., if you have a root tree with 10 files and 3
+trees, so will the output), but their names and the contents of the
+files will be replaced.
+
+If you think you have found a git bug, you can start by exporting an
+anonymized stream of the whole repository:
+
+---------------------------------------------------
+$ git fast-export --anonymize --all >anon-stream
+---------------------------------------------------
+
+Then confirm that the bug persists in a repository created from that
+stream (many bugs will not, as they really do depend on the exact
+repository contents):
+
+---------------------------------------------------
+$ git init anon-repo
+$ cd anon-repo
+$ git fast-import <../anon-stream
+$ ... test your bug ...
+---------------------------------------------------
+
+If the anonymized repository shows the bug, it may be worth sharing
+`anon-stream` along with a regular bug report. Note that the anonymized
+stream compresses very well, so gzipping it is encouraged. If you want
+to examine the stream to see that it does not contain any private data,
+you can peruse it directly before sending. You may also want to try:
+
+---------------------------------------------------
+$ perl -pe 's/\d+/X/g' <anon-stream | sort -u | less
+---------------------------------------------------
+
+which shows all of the unique lines (with numbers converted to "X", to
+collapse "User 0", "User 1", etc into "User X"). This produces a much
+smaller output, and it is usually easy to quickly confirm that there is
+no private data in the stream.
+
+
 Limitations
 -----------
 
-- 
2.1.0.346.ga0367b9

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] teach fast-export an --anonymize option
  2014-08-28 12:32               ` Jeff King
@ 2014-08-28 16:46                 ` Ramsay Jones
  2014-08-28 18:43                   ` Junio C Hamano
  2014-08-28 18:50                   ` Jeff King
  2014-08-28 18:11                 ` Junio C Hamano
  2014-08-31 10:34                 ` Eric Sunshine
  2 siblings, 2 replies; 21+ messages in thread
From: Ramsay Jones @ 2014-08-28 16:46 UTC (permalink / raw)
  To: Jeff King, Duy Nguyen; +Cc: Junio C Hamano, Git Mailing List

On 28/08/14 13:32, Jeff King wrote:
> On Thu, Aug 28, 2014 at 05:30:44PM +0700, Duy Nguyen wrote:
> 
>> On Thu, Aug 28, 2014 at 12:01 AM, Jeff King <peff@peff.net> wrote:
>>> You can get an overview of what will be shared
>>> by running a command like:
>>>
>>>   git fast-export --anonymize --all |
>>>   perl -pe 's/\d+/X/g' |
>>>   sort -u |
>>>   less
>>>
>>> which will show every unique line we generate, modulo any
>>> numbers (each anonymized token is assigned a number, like
>>> "User 0", and we replace it consistently in the output).
>>
>> I feel like this should be part of git-fast-export.txt, just to
>> increase the user's confidence in the tool (and I don't expect most
>> users to read this commit message).
> 
> Hmph. Whenever I say "I think this patch is done", suddenly the comments
> start pouring in. :)

:-D

> I think you are right, though, and we could stand to explain
> the feature a little more in the documentation in general.
> How about this patch on top (or squashed in):
> 
> -- >8 --
> Subject: docs/fast-export: explain --anonymize more completely
> 
> The original commit made mention of this option, but not why
> one might want it or how they might use it. Let's try to be
> a little more thorough, and also explain how to confirm that
> the output really is anonymous.
> 
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  Documentation/git-fast-export.txt | 63 ++++++++++++++++++++++++++++++++++++---
>  1 file changed, 59 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
> index 52831fa..dbe9a46 100644
> --- a/Documentation/git-fast-export.txt
> +++ b/Documentation/git-fast-export.txt
> @@ -106,10 +106,9 @@ marks the same across runs.
>  	different from the commit's first parent).
>  
>  --anonymize::
> -	Replace all refnames, paths, blob contents, commit and tag
> -	messages, names, and email addresses in the output with
> -	anonymized data, while still retaining the shape of history and
> -	of the stored tree.
> +	Anonymize the contents of the repository while still retaining
> +	the shape of the history and stored tree.  See the section on
> +	`ANONYMIZING` below.
>  
>  --refspec::
>  	Apply the specified refspec to each ref exported. Multiple of them can
> @@ -147,6 +146,62 @@ referenced by that revision range contains the string
>  'refs/heads/master'.
>  
>  
> +ANONYMIZING
> +-----------
> +
> +If the `--anonymize` option is given, git will attempt to remove all
> +identifying information from the repository while still retaining enough
> +of the original tree and history patterns to reproduce some bugs. The
> +goal is that a git bug which is found on a private repository will

s/goal/hope/ ;-)

> +persist in the anonymized repository, and the latter can be shared with
> +git developers to help solve the bug.
> +
> +With this option, git will replace all refnames, paths, blob contents,
> +commit and tag messages, names, and email addresses in the output with
> +anonymized data.  Two instances of the same string will be replaced
> +equivalently (e.g., two commits with the same author will have the same
> +anonymized author in the output, but bear no resemblance to the original
> +author string). The relationship between commits, branches, and tags is
> +retained, as well as the commit timestamps (but the commit messages and
> +refnames bear no resemblance to the originals). The relative makeup of
> +the tree is retained (e.g., if you have a root tree with 10 files and 3
> +trees, so will the output), but their names and the contents of the
> +files will be replaced.
> +
> +If you think you have found a git bug, you can start by exporting an
> +anonymized stream of the whole repository:
> +
> +---------------------------------------------------
> +$ git fast-export --anonymize --all >anon-stream
> +---------------------------------------------------
> +
> +Then confirm that the bug persists in a repository created from that
> +stream (many bugs will not, as they really do depend on the exact
> +repository contents):

Dumb question (I have not even read the patch, so please just ignore me
if this is indeed dumb!): Is the map of <original-name, anonymized-name>
available to the user while he attempts to confirm that the bug is still
present?

For example, if I anonymized git.git, and did 'git branch -v' (say), how
easy would it be for me to recognise which branch was 'next'?

ATB,
Ramsay Jones

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] teach fast-export an --anonymize option
  2014-08-28 12:32               ` Jeff King
  2014-08-28 16:46                 ` Ramsay Jones
@ 2014-08-28 18:11                 ` Junio C Hamano
  2014-08-28 19:04                   ` Jeff King
  2014-08-31 10:34                 ` Eric Sunshine
  2 siblings, 1 reply; 21+ messages in thread
From: Junio C Hamano @ 2014-08-28 18:11 UTC (permalink / raw)
  To: Jeff King; +Cc: Duy Nguyen, Git Mailing List

Jeff King <peff@peff.net> writes:

> Subject: docs/fast-export: explain --anonymize more completely
>
> The original commit made mention of this option, but not why
> one might want it or how they might use it. Let's try to be
> a little more thorough, and also explain how to confirm that
> the output really is anonymous.
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  Documentation/git-fast-export.txt | 63 ++++++++++++++++++++++++++++++++++++---
>  1 file changed, 59 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
> index 52831fa..dbe9a46 100644
> --- a/Documentation/git-fast-export.txt
> +++ b/Documentation/git-fast-export.txt
> @@ -106,10 +106,9 @@ marks the same across runs.
>  	different from the commit's first parent).
>  
>  --anonymize::
> -	Replace all refnames, paths, blob contents, commit and tag
> -	messages, names, and email addresses in the output with
> -	anonymized data, while still retaining the shape of history and
> -	of the stored tree.
> +	Anonymize the contents of the repository while still retaining
> +	the shape of the history and stored tree.  See the section on
> +	`ANONYMIZING` below.

Technically s/tree/trees/, I would think.  For a repository with
multiple branches, perhaps s/history/histories/, too, but I would
not insist on that ;-).

> +ANONYMIZING
> +-----------
> +
> +If the `--anonymize` option is given, git will attempt to remove all
> +identifying information from the repository while still retaining enough
> +of the original tree and history patterns to reproduce some bugs. The
> +goal is that a git bug which is found on a private repository will
> +persist in the anonymized repository, and the latter can be shared with
> +git developers to help solve the bug.
> +
> +With this option, git will replace all refnames, paths, blob contents,
> +commit and tag messages, names, and email addresses in the output with
> +anonymized data.  Two instances of the same string will be replaced
> +equivalently (e.g., two commits with the same author will have the same
> +anonymized author in the output, but bear no resemblance to the original
> +author string). The relationship between commits, branches, and tags is
> +retained, as well as the commit timestamps (but the commit messages and
> +refnames bear no resemblance to the originals). The relative makeup of
> +the tree is retained (e.g., if you have a root tree with 10 files and 3
> +trees, so will the output), but their names and the contents of the
> +files will be replaced.

While I do not think I or anybody who would ask other people to use
this option would be confused, the phrase "the same string" may risk
unnecessary worries from those who are asked to trust this option.

I am not yet convinced that it is unlikely for the reader to read
the above and imagine that the anonymiser may go word by word,
replacing "the same string" with the same anonymised gibberish
(which would be susceptible to old-school cryptoanalysis
techniques).

Among the ones that listed, refnames, blob contents, commit messages
and tag messages are converted as a single "string" and I wish I
could think of phrasing to stress that point somehow.

Each path component in paths is converted as a single "string", so
we can read from two anonymised paths if they refer to blobs in the
same directory in the original.  This is a good thing, of course,
but it shows that among those listed in "refnames, paths, blob
contents, ..." in a flat sentence, some are treated as a single
token for replacement but not others, and it is hard to tell for a
reader which one is which, unless the reader knows the internals of
Git, i.e. what kind of things we as the debuggers-of-Git would want
to preserve.

Isn't the unit for human identity anonymisation even more coarse?
If it is not should it?

In other words, do "Junio C Hamano <junio@pobox.com>" and "Junio C
Hamano <gitster@pobox.com>" map to one gibberish human readable name
with two gibberish e-mail addresses, or 2 "User$n <user$n>"?  Is the
fact that this organization seems to allocate two e-mails to each
developer something this organization may want to hide from the
public (and something we as the Git debuggers would not benefit from
knowing)?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] teach fast-export an --anonymize option
  2014-08-28 16:46                 ` Ramsay Jones
@ 2014-08-28 18:43                   ` Junio C Hamano
  2014-08-28 18:50                   ` Jeff King
  1 sibling, 0 replies; 21+ messages in thread
From: Junio C Hamano @ 2014-08-28 18:43 UTC (permalink / raw)
  To: Ramsay Jones; +Cc: Jeff King, Duy Nguyen, Git Mailing List

Ramsay Jones <ramsay@ramsay1.demon.co.uk> writes:

> Dumb question (I have not even read the patch, so please just ignore me
> if this is indeed dumb!): Is the map of <original-name, anonymized-name>
> available to the user while he attempts to confirm that the bug is still
> present?
>
> For example, if I anonymized git.git, and did 'git branch -v' (say), how
> easy would it be for me to recognise which branch was 'next'?

It is not dumb but actually is a very good point.

There needs an easy way for the reporting user to turn an
observation such as "When I do 'git log master..next' I see this one
extraneous commit shown" into a corresponding statement to accompany
the anonymised output.  The user needs it to make sure that the
symptom reproduces in the anonymised repository in order to decide
if it is even worthwhile to send the output for analysis in the
first place.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] teach fast-export an --anonymize option
  2014-08-28 16:46                 ` Ramsay Jones
  2014-08-28 18:43                   ` Junio C Hamano
@ 2014-08-28 18:50                   ` Jeff King
  1 sibling, 0 replies; 21+ messages in thread
From: Jeff King @ 2014-08-28 18:50 UTC (permalink / raw)
  To: Ramsay Jones; +Cc: Duy Nguyen, Junio C Hamano, Git Mailing List

On Thu, Aug 28, 2014 at 05:46:15PM +0100, Ramsay Jones wrote:

> Dumb question (I have not even read the patch, so please just ignore me
> if this is indeed dumb!): Is the map of <original-name, anonymized-name>
> available to the user while he attempts to confirm that the bug is still
> present?

No, it's not.

> For example, if I anonymized git.git, and did 'git branch -v' (say), how
> easy would it be for me to recognise which branch was 'next'?

You can't, really. The simplest thing would be to pare down your
repository to the minimum number of branches before anonymizing.

It might make sense to have an option to dump the maps we've stored to a
separate file (in theory, you could even load them back in and do an
incremental anonymized export[1]). I think I'd rather wait on
implementing that until we see more real-world use cases (but as always,
I'm happy to review if somebody wants to pick it up).

-Peff

[1] Incremental anonymization is not something I think is worth
    supporting by itself. However, there may be some value in being able
    to anonymize two similar repositories using the same mappings. For
    instance, a repository and its clone.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] teach fast-export an --anonymize option
  2014-08-28 18:11                 ` Junio C Hamano
@ 2014-08-28 19:04                   ` Jeff King
  0 siblings, 0 replies; 21+ messages in thread
From: Jeff King @ 2014-08-28 19:04 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Duy Nguyen, Git Mailing List

On Thu, Aug 28, 2014 at 11:11:47AM -0700, Junio C Hamano wrote:

> > +	Anonymize the contents of the repository while still retaining
> > +	the shape of the history and stored tree.  See the section on
> > +	`ANONYMIZING` below.
> 
> Technically s/tree/trees/, I would think.  For a repository with
> multiple branches, perhaps s/history/histories/, too, but I would
> not insist on that ;-).

Sure, I think both of those are fine (I meant "tree" here to refer to
the general notion of a set of paths over time, not a particular tree
object).

> > +With this option, git will replace all refnames, paths, blob contents,
> > +commit and tag messages, names, and email addresses in the output with
> > +anonymized data.  Two instances of the same string will be replaced
> > +equivalently (e.g., two commits with the same author will have the same
> > +anonymized author in the output, but bear no resemblance to the original
> > +author string). The relationship between commits, branches, and tags is
> > +retained, as well as the commit timestamps (but the commit messages and
> > +refnames bear no resemblance to the originals). The relative makeup of
> > +the tree is retained (e.g., if you have a root tree with 10 files and 3
> > +trees, so will the output), but their names and the contents of the
> > +files will be replaced.
> 
> While I do not think I or anybody who would ask other people to use
> this option would be confused, the phrase "the same string" may risk
> unnecessary worries from those who are asked to trust this option.
> 
> I am not yet convinced that it is unlikely for the reader to read
> the above and imagine that the anonymiser may go word by word,
> replacing "the same string" with the same anonymised gibberish
> (which would be susceptible to old-school cryptoanalysis
> techniques).

I tried to use phrases like "bears no resemblance" to indicate that the
mapping was not leaking information. Does it bear a separate paragraph
explaining the transformation (I was trying to avoid that because it is
necessarily intimately linked with the particular implementation
chosen).

> Among the ones that listed, refnames, blob contents, commit messages
> and tag messages are converted as a single "string" and I wish I
> could think of phrasing to stress that point somehow.

Maybe a separate paragraph like:

  Note that the replacement strings are chosen with no input from the
  original strings. There is no cryptography or other tricks involved,
  but rather we make up a new string like "message 123", replace a
  particular commit message with it, and then use the mapping between
  the two for the rest of the output. Thus, no information about the
  original commit message is leaked, and only the internal mapping
  (which is not part of the output stream) could reverse the
  transformation.

> Each path component in paths is converted as a single "string", so
> we can read from two anonymised paths if they refer to blobs in the
> same directory in the original.  This is a good thing, of course,
> but it shows that among those listed in "refnames, paths, blob
> contents, ..." in a flat sentence, some are treated as a single
> token for replacement but not others, and it is hard to tell for a
> reader which one is which, unless the reader knows the internals of
> Git, i.e. what kind of things we as the debuggers-of-Git would want
> to preserve.

Yes, I was really trying not to get into those details, because I do not
think they matter to most callers and are subject to change as we come
up with better heuristics. I do not even want to promise an
implementation like "no tricky cryptography" above, because we may think
of a more interesting way to transform components.

> Isn't the unit for human identity anonymisation even more coarse?
> If it is not should it?
> 
> In other words, do "Junio C Hamano <junio@pobox.com>" and "Junio C
> Hamano <gitster@pobox.com>" map to one gibberish human readable name
> with two gibberish e-mail addresses, or 2 "User$n <user$n>"?  Is the
> fact that this organization seems to allocate two e-mails to each
> developer something this organization may want to hide from the
> public (and something we as the Git debuggers would not benefit from
> knowing)?

The ident mapping takes a single "Name <email>" string and converts it
into a "User X <userX@example.com>" string. So no, we are not leaking
the fact that one name has multiple emails. I actually started down that
path, but gave it up, as it could produce entries like "User 3
<email5@example.com>" which were downright confusing. Plus I did not
think that would be a useful thing for debuggers to know, and replacing
the whole string is simpler (I also entertained the idea of just
blanking _all_ idents; what I expect to be of primary use here is the
history shape, and I doubt that a bug would be triggered by the pattern
of usernames but not their actual content).

-Peff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] teach fast-export an --anonymize option
  2014-08-28 12:32               ` Jeff King
  2014-08-28 16:46                 ` Ramsay Jones
  2014-08-28 18:11                 ` Junio C Hamano
@ 2014-08-31 10:34                 ` Eric Sunshine
  2014-08-31 15:53                   ` Jeff King
  2 siblings, 1 reply; 21+ messages in thread
From: Eric Sunshine @ 2014-08-31 10:34 UTC (permalink / raw)
  To: Jeff King; +Cc: Duy Nguyen, Junio C Hamano, Git Mailing List

On Thu, Aug 28, 2014 at 8:32 AM, Jeff King <peff@peff.net> wrote:
> On Thu, Aug 28, 2014 at 05:30:44PM +0700, Duy Nguyen wrote:
>
>> On Thu, Aug 28, 2014 at 12:01 AM, Jeff King <peff@peff.net> wrote:
>> > You can get an overview of what will be shared
>> > by running a command like:
>> >
>> >   git fast-export --anonymize --all |
>> >   perl -pe 's/\d+/X/g' |
>> >   sort -u |
>> >   less
>> >
>> > which will show every unique line we generate, modulo any
>> > numbers (each anonymized token is assigned a number, like
>> > "User 0", and we replace it consistently in the output).
>>
>> I feel like this should be part of git-fast-export.txt, just to
>> increase the user's confidence in the tool (and I don't expect most
>> users to read this commit message).
>
> Hmph. Whenever I say "I think this patch is done", suddenly the comments
> start pouring in. :)

Considering that the value of --anonymize is not yet known, is such an
invasive change to fast-export.c warranted? Would it make sense
instead to provide "anonymize" functionality as a contrib/ script or a
distinct git-anonymize-foo command which accepts a fast-import stream
as input and anonymizes it as output?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] teach fast-export an --anonymize option
  2014-08-31 10:34                 ` Eric Sunshine
@ 2014-08-31 15:53                   ` Jeff King
  0 siblings, 0 replies; 21+ messages in thread
From: Jeff King @ 2014-08-31 15:53 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: Duy Nguyen, Junio C Hamano, Git Mailing List

On Sun, Aug 31, 2014 at 06:34:08AM -0400, Eric Sunshine wrote:

> >> I feel like this should be part of git-fast-export.txt, just to
> >> increase the user's confidence in the tool (and I don't expect most
> >> users to read this commit message).
> >
> > Hmph. Whenever I say "I think this patch is done", suddenly the comments
> > start pouring in. :)
> 
> Considering that the value of --anonymize is not yet known, is such an
> invasive change to fast-export.c warranted? Would it make sense
> instead to provide "anonymize" functionality as a contrib/ script or a
> distinct git-anonymize-foo command which accepts a fast-import stream
> as input and anonymizes it as output?

I considered that, but there's a non-trivial amount of work in the
parsing of the stream (I had originally thought to just ship a perl
script to operate on the stream). And while there's a fair bit of code
added to fast-export.c, none of it is ever called unless --anonymize is
set.

So while I am not 100% sure that the idea is a good one, I do not think
it is hurting the current fast-export in any meaningful way. Two things
we could do to minimize that are:

  1. Move the anonymization code into a separate C file to keep the
     fast-export source a little more pristine. I avoided doing this
     just because the interfaces to the functions are fairly tailored to
     what fast-export wants.

  2. Have a separate git-anonymize command which is basically running
     "git fast-export --anonymize" under the hood. This avoids polluting
     fast-export from the user's perspective (they do not need to care
     that it is running fast-export under the hood).

-Peff

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2014-08-31 15:54 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-21  7:01 [PATCH] teach fast-export an --anonymize option Jeff King
2014-08-21 20:15 ` Junio C Hamano
2014-08-21 22:41   ` Jeff King
2014-08-21 21:57 ` Junio C Hamano
2014-08-21 22:49   ` Jeff King
2014-08-21 23:21     ` [PATCH v2] " Jeff King
2014-08-22 13:06       ` Duy Nguyen
2014-08-22 18:39       ` Philip Oakley
2014-08-23  6:19         ` Jeff King
2014-08-27 16:01       ` Junio C Hamano
2014-08-27 16:58         ` Jeff King
2014-08-27 17:01           ` [PATCH v3] " Jeff King
2014-08-28 10:30             ` Duy Nguyen
2014-08-28 12:32               ` Jeff King
2014-08-28 16:46                 ` Ramsay Jones
2014-08-28 18:43                   ` Junio C Hamano
2014-08-28 18:50                   ` Jeff King
2014-08-28 18:11                 ` Junio C Hamano
2014-08-28 19:04                   ` Jeff King
2014-08-31 10:34                 ` Eric Sunshine
2014-08-31 15:53                   ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).