git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: git@vger.kernel.org
Cc: Eric Sunshine <sunshine@sunshineco.com>,
	Junio C Hamano <gitster@pobox.com>,
	Johannes Schindelin <Johannes.Schindelin@gmx.de>
Subject: [PATCH v2 0/4] fast-export: allow dumping anonymization mappings
Date: Mon, 22 Jun 2020 17:47:45 -0400	[thread overview]
Message-ID: <20200622214745.GA3302779@coredump.intra.peff.net> (raw)
In-Reply-To: <20200619132304.GA2540657@coredump.intra.peff.net>

On Fri, Jun 19, 2020 at 09:23:04AM -0400, Jeff King wrote:

> This series gives an alternate way to achieve the same effect, but much
> better in that it works for _any_ ref (so if you are trying to reproduce
> the effect of "rev-list origin/foo..bar" in the anonymized repo, you can
> easily do so). Ditto for paths, so that "rev-list -- foo.c" can be
> reproduced in the anonymized repo.

Here's a v2 which I think addresses all of the comments. I have to admit
that after writing my last email to Junio, I am wondering whether it
would be sufficient and simpler to let the user specify a static mapping
of tokens (that could just be applied anywhere).

I'll take a look at that, but since I worked up this version, here it is
in the meantime.

The interesting changes are:

  - path output is now quoted, making it unambiguous. The intent is for
    humans to look at it, but it's not much extra work to make it
    machine readable, too.

  - the path dumping was in the wrong spot. It was happening in the
    generic function that's used for "path-like" things, including
    refnames. So the path mapping dump had extra cruft in it.

  - got rid of the maybe_dump_anon() helper

  - tests now avoid hard-coding expected counts

  - the path-dump test now checks the expected count

  [1/4]: fast-export: allow dumping the refname mapping
  [2/4]: fast-export: anonymize "master" refname
  [3/4]: fast-export: refactor path printing to not rely on stdout
  [4/4]: fast-export: allow dumping the path mapping

 Documentation/git-fast-export.txt | 34 +++++++++++++++
 builtin/fast-export.c             | 69 +++++++++++++++++++++++++------
 t/t9351-fast-export-anonymize.sh  | 44 ++++++++++++++++----
 3 files changed, 125 insertions(+), 22 deletions(-)

Range-diff from v1:

1:  82a17ae976 ! 1:  7ba5582d66 fast-export: allow dumping the refname mapping
    @@ builtin/fast-export.c: static int has_unshown_parent(struct commit *commit)
     +	kh_put_strset(seen->set, xstrdup(str), &hashret);
     +	return 0;
     +}
    -+
    -+static void maybe_dump_anon(FILE *out, struct seen_set *seen,
    -+			    const char *orig, const char *anon)
    -+{
    -+	if (!out)
    -+		return;
    -+	if (!check_and_mark_seen(seen, orig))
    -+		fprintf(out, "%s %s\n", orig, anon);
    -+}
     +
      struct anonymized_entry {
      	struct hashmap_entry hash;
    @@ builtin/fast-export.c: static const char *anonymize_refname(const char *refname)
      	}
      
      	anonymize_path(&anon, refname, &refs, anonymize_ref_component);
    -+	maybe_dump_anon(anonymized_refnames_handle, &seen,
    ++
    ++	if (anonymized_refnames_handle &&
    ++	    !check_and_mark_seen(&seen, full_refname))
    ++		fprintf(anonymized_refnames_handle, "%s %s\n",
     +			full_refname, anon.buf);
    ++
      	return anon.buf;
      }
      
    @@ t/t9351-fast-export-anonymize.sh: test_expect_success 'stream omits tag message'
     +	# we make no guarantees of the exact anonymized names,
     +	# so just check that we have the right number and
     +	# that a sample line looks sane.
    ++	expected_count=$(git for-each-ref | wc -l) &&
     +	# Note that master is not anonymized, and so not included
     +	# in the mapping.
    -+	test_line_count = 6 refs.out &&
    ++	expected_count=$((expected_count - 1)) &&
    ++	test_line_count = $expected_count refs.out &&
     +	grep "^refs/heads/other refs/heads/" refs.out
     +'
     +
2:  be56b375cc ! 2:  d88f7c83a5 fast-export: anonymize "master" refname
    @@ t/t9351-fast-export-anonymize.sh: test_expect_success 'stream omits path names'
      	! grep mytag stream
      '
     @@ t/t9351-fast-export-anonymize.sh: test_expect_success 'refname mapping can be dumped' '
    - 	# we make no guarantees of the exact anonymized names,
      	# so just check that we have the right number and
      	# that a sample line looks sane.
    + 	expected_count=$(git for-each-ref | wc -l) &&
     -	# Note that master is not anonymized, and so not included
     -	# in the mapping.
    --	test_line_count = 6 refs.out &&
    -+	test_line_count = 7 refs.out &&
    +-	expected_count=$((expected_count - 1)) &&
    + 	test_line_count = $expected_count refs.out &&
      	grep "^refs/heads/other refs/heads/" refs.out
      '
    - 
     @@ t/t9351-fast-export-anonymize.sh: test_expect_success 'import stream to new repository' '
      test_expect_success 'result has two branches' '
      	git for-each-ref --format="%(refname)" refs/heads >branches &&
-:  ---------- > 3:  164f1e1eab fast-export: refactor path printing to not rely on stdout
3:  a4e9f1f2ac ! 4:  b0aa59f07e fast-export: allow dumping the path mapping
    @@ Commit message
     
         We recently taught fast-export to dump the refname mapping. Let's do the
         same thing for paths, which can reuse most of the same infrastructure.
    -    Note that the output format isn't unambiguous here (because paths could
    -    contain spaces). That's OK because this is meant to be examined by a
    -    human.
     
         We could also just introduce a "dump mapping" file that shows every
         mapping we make. But it would be a bit more awkward to work with, as the
    @@ Documentation/git-fast-export.txt: by keeping the marks the same across runs.
     +	Output the mapping of real paths to anonymized paths to <file>.
     +	The output will contain one line per path that appears in the
     +	output stream, with the original path, a space, and its
    -+	anonymized counterpart. See the section on `ANONYMIZING` below.
    ++	anonymized counterpart. Paths may be quoted if they contain a
    ++	space, or unusual characters; see `core.quotePath` in
    ++	linkgit:git-config(1). See also `ANONYMIZING` below.
     +
      --reference-excluded-parents::
      	By default, running a command such as `git fast-export
    @@ builtin/fast-export.c: static struct string_list tag_refs = STRING_LIST_INIT_NOD
      static struct revision_sources revision_sources;
      
      static int parse_opt_signed_tag_mode(const struct option *opt,
    -@@ builtin/fast-export.c: static void anonymize_path(struct strbuf *out, const char *path,
    - 			   struct hashmap *map,
    - 			   void *(*generate)(const void *, size_t *))
    - {
    -+	static struct seen_set seen;
    -+	const char *full_path = path;
    +@@ builtin/fast-export.c: static void print_path(const char *path)
    + 		print_path_1(stdout, path);
    + 	else {
    + 		static struct hashmap paths;
    ++		static struct seen_set seen;
    + 		static struct strbuf anon = STRBUF_INIT;
    + 
    + 		anonymize_path(&anon, path, &paths, anonymize_path_component);
    ++		if (anonymized_paths_handle &&
    ++		    !check_and_mark_seen(&seen, path)) {
    ++			print_path_1(anonymized_paths_handle, path);
    ++			fputc(' ', anonymized_paths_handle);
    ++			print_path_1(anonymized_paths_handle, anon.buf);
    ++			fputc('\n', anonymized_paths_handle);
    ++		}
     +
    - 	while (*path) {
    - 		const char *end_of_component = strchrnul(path, '/');
    - 		size_t len = end_of_component - path;
    -@@ builtin/fast-export.c: static void anonymize_path(struct strbuf *out, const char *path,
    - 		if (*path)
    - 			strbuf_addch(out, *path++);
    + 		print_path_1(stdout, anon.buf);
    + 		strbuf_reset(&anon);
      	}
    -+
    -+	maybe_dump_anon(anonymized_paths_handle, &seen, full_path, out->buf);
    - }
    - 
    - static inline void *mark_to_ptr(uint32_t mark)
     @@ builtin/fast-export.c: int cmd_fast_export(int argc, const char **argv, const char *prefix)
      	     *import_filename = NULL,
      	     *import_filename_if_exists = NULL;
    @@ builtin/fast-export.c: int cmd_fast_export(int argc, const char **argv, const ch
      		printf("feature done\n");
     
      ## t/t9351-fast-export-anonymize.sh ##
    +@@ t/t9351-fast-export-anonymize.sh: test_expect_success 'setup simple repo' '
    + 	git checkout -b other HEAD^ &&
    + 	mkdir subdir &&
    + 	test_commit subdir/bar &&
    +-	test_commit subdir/xyzzy &&
    ++	test_commit quoting "subdir/this needs quoting" &&
    + 	git tag -m "annotated tag" mytag
    + '
    + 
    +@@ t/t9351-fast-export-anonymize.sh: test_expect_success 'stream omits path names' '
    + 	! grep foo stream &&
    + 	! grep subdir stream &&
    + 	! grep bar stream &&
    +-	! grep xyzzy stream
    ++	! grep quoting stream
    + '
    + 
    + test_expect_success 'stream omits refnames' '
     @@ t/t9351-fast-export-anonymize.sh: test_expect_success 'refname mapping can be dumped' '
      	grep "^refs/heads/other refs/heads/" refs.out
      '
      
     +test_expect_success 'path mapping can be dumped' '
     +	git fast-export --anonymize --all \
     +		--dump-anonymized-paths=paths.out >/dev/null &&
    -+	# do not assume a particular anonymization scheme or order;
    -+	# just sanity check that a sample line looks sensible.
    -+	grep "^foo " paths.out
    ++	# as above, avoid depending on the exact scheme, but
    ++	# but check that we have the right number of mappings,
    ++	# and spot-check one sample.
    ++	expected_count=$(
    ++		git rev-list --objects --all |
    ++		git cat-file --batch-check="%(objecttype) %(rest)" |
    ++		sed -ne "s/^blob //p" |
    ++		sort -u |
    ++		wc -l
    ++	) &&
    ++	test_line_count = $expected_count paths.out &&
    ++	grep "^\"subdir/this needs quoting\" " paths.out
     +'
     +
      # NOTE: we chdir to the new, anonymized repository

  parent reply	other threads:[~2020-06-22 21:47 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-19 13:23 [PATCH 0/3] fast-export: allow dumping anonymization mappings Jeff King
2020-06-19 13:25 ` [PATCH 1/3] fast-export: allow dumping the refname mapping Jeff King
2020-06-19 15:51   ` Eric Sunshine
2020-06-19 16:01     ` Jeff King
2020-06-19 16:18       ` Eric Sunshine
2020-06-19 17:45         ` Jeff King
2020-06-19 18:00           ` Eric Sunshine
2020-06-22 21:30             ` Jeff King
2020-06-19 19:20         ` Junio C Hamano
2020-06-22 21:32           ` Jeff King
2020-06-19 13:26 ` [PATCH 2/3] fast-export: anonymize "master" refname Jeff King
2020-06-19 13:29 ` [PATCH 3/3] fast-export: allow dumping the path mapping Jeff King
2020-06-19 16:00   ` Eric Sunshine
2020-06-19 19:24   ` Junio C Hamano
2020-06-22 21:38     ` Jeff King
2020-06-19 13:51 ` [PATCH 0/3] fast-export: allow dumping anonymization mappings Johannes Schindelin
2020-06-22 16:35   ` Junio C Hamano
2020-06-22 21:47 ` Jeff King [this message]
2020-06-22 21:47   ` [PATCH v2 1/4] fast-export: allow dumping the refname mapping Jeff King
2020-06-22 21:48   ` [PATCH v2 2/4] fast-export: anonymize "master" refname Jeff King
2020-06-22 21:48   ` [PATCH v2 3/4] fast-export: refactor path printing to not rely on stdout Jeff King
2020-06-22 21:48   ` [PATCH v2 4/4] fast-export: allow dumping the path mapping Jeff King
2020-06-23 15:24   ` [alternative 0/10] fast-export: allow seeding the anonymized mapping Jeff King
2020-06-23 15:24     ` [PATCH 01/10] t9351: derive anonymized tree checks from original repo Jeff King
2020-06-23 15:24     ` [PATCH 02/10] fast-export: use xmemdupz() for anonymizing oids Jeff King
2020-06-23 15:24     ` [PATCH 03/10] fast-export: store anonymized oids as hex strings Jeff King
2020-06-24 11:43       ` SZEDER Gábor
2020-06-24 15:54         ` Jeff King
2020-06-25 15:49           ` Jeff King
2020-06-25 20:45             ` SZEDER Gábor
2020-06-25 21:15               ` Jeff King
2020-06-29 13:17                 ` Johannes Schindelin
2020-06-30 19:35                   ` Jeff King
2020-06-23 15:24     ` [PATCH 04/10] fast-export: tighten anonymize_mem() interface to handle only strings Jeff King
2020-06-23 15:24     ` [PATCH 05/10] fast-export: stop storing lengths in anonymized hashmaps Jeff King
2020-06-23 15:24     ` [PATCH 06/10] fast-export: use a flex array to store anonymized entries Jeff King
2020-06-23 15:25     ` [PATCH 07/10] fast-export: move global "idents" anonymize hashmap into function Jeff King
2020-06-23 15:25     ` [PATCH 08/10] fast-export: add a "data" callback parameter to anonymize_str() Jeff King
2020-06-24 19:58       ` Junio C Hamano
2020-06-23 15:25     ` [PATCH 09/10] fast-export: allow seeding the anonymized mapping Jeff King
2020-06-23 17:16       ` Eric Sunshine
2020-06-23 18:30         ` Jeff King
2020-06-23 20:30           ` Eric Sunshine
2020-06-24 15:47             ` Jeff King
2020-06-23 18:11       ` Eric Sunshine
2020-06-23 18:35         ` Jeff King
2020-06-23 20:35           ` Eric Sunshine
2020-06-24 15:48             ` Jeff King
2020-06-23 15:25     ` [PATCH 10/10] fast-export: anonymize "master" refname Jeff King
2020-06-23 19:34     ` [alternative 0/10] fast-export: allow seeding the anonymized mapping Junio C Hamano
2020-06-23 19:44       ` Jeff King
2020-06-25 19:48     ` [PATCH v2 0/11] " Jeff King
2020-06-25 19:48       ` [PATCH v2 01/11] t9351: derive anonymized tree checks from original repo Jeff King
2020-06-25 19:48       ` [PATCH v2 02/11] fast-export: use xmemdupz() for anonymizing oids Jeff King
2020-06-25 19:48       ` [PATCH v2 03/11] fast-export: store anonymized oids as hex strings Jeff King
2020-06-25 19:48       ` [PATCH v2 04/11] fast-export: tighten anonymize_mem() interface to handle only strings Jeff King
2020-06-25 19:48       ` [PATCH v2 05/11] fast-export: stop storing lengths in anonymized hashmaps Jeff King
2020-06-25 19:48       ` [PATCH v2 06/11] fast-export: use a flex array to store anonymized entries Jeff King
2020-06-25 19:48       ` [PATCH v2 07/11] fast-export: move global "idents" anonymize hashmap into function Jeff King
2020-06-25 19:48       ` [PATCH v2 08/11] fast-export: add a "data" callback parameter to anonymize_str() Jeff King
2020-06-25 19:48       ` [PATCH v2 09/11] fast-export: allow seeding the anonymized mapping Jeff King
2020-06-25 19:48       ` [PATCH v2 10/11] fast-export: anonymize "master" refname Jeff King
2020-06-25 19:48       ` [PATCH v2 11/11] fast-export: use local array to store anonymized oid Jeff King
2020-06-25 21:22       ` [PATCH v2 0/11] fast-export: allow seeding the anonymized mapping Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200622214745.GA3302779@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=sunshine@sunshineco.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).