git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH v2 0/5] Fix and extend encoding handling in fast export/import
@ 2019-04-30 18:25 Elijah Newren
  2019-04-30 18:25 ` [PATCH v2 1/5] t9350: fix encoding test to actually test reencoding Elijah Newren
                   ` (5 more replies)
  0 siblings, 6 replies; 45+ messages in thread
From: Elijah Newren @ 2019-04-30 18:25 UTC (permalink / raw)
  To: gitster; +Cc: git, Eric Sunshine, Elijah Newren

While stress testing `git filter-repo`, I noticed an issue with
encoding; further digging led to the fixes and features in this series.
See the individual commit messages for details.

Changes since v1 (full range-diff below):
  * Applied style fixes Eric pointed out in his review (thanks!)
  * Rebased on latest master (83232e38, "The seventh batch"), resolving
    a trivial merge conflict.  Now merges cleanly with next and pu as
    well.

I'm a bit under the weather so I may be slow to respond...

Elijah Newren (5):
  t9350: fix encoding test to actually test reencoding
  fast-import: support 'encoding' commit header
  fast-export: avoid stripping encoding header if we cannot reencode
  fast-export: differentiate between explicitly utf-8 and implicitly
    utf-8
  fast-export: do automatic reencoding of commit messages only if
    requested

 Documentation/git-fast-import.txt |  7 ++++
 builtin/fast-export.c             | 44 +++++++++++++++++++++----
 fast-import.c                     | 11 +++++--
 t/t9300-fast-import.sh            | 20 ++++++++++++
 t/t9350-fast-export.sh            | 53 +++++++++++++++++++++++++------
 5 files changed, 118 insertions(+), 17 deletions(-)

Range-diff:
1:  d6efd05142 ! 1:  9cc04242bd t9350: fix encoding test to actually test reencoding
    @@ -26,8 +26,7 @@
     -	# use author and committer name in ISO-8859-1 to match it.
     -	. "$TEST_DIRECTORY"/t3901/8859-1.txt &&
     +	test_when_finished "git reset --hard HEAD~1" &&
    -+	test_when_finished "git config --unset i18n.commitencoding" &&
    -+	git config i18n.commitencoding iso-8859-7 &&
    ++	test_config i18n.commitencoding iso-8859-7 &&
      	test_tick &&
      	echo rosten >file &&
     -	git commit -s -m den file &&
2:  02f48c7559 ! 2:  0cd023ac7a fast-import: support 'encoding' commit header
    @@ -51,9 +51,8 @@
      	}
      	if (!committer)
      		die("Expected committer but didn't get one");
    -+	if (skip_prefix(command_buf.buf, "encoding ", &encoding)) {
    ++	if (skip_prefix(command_buf.buf, "encoding ", &encoding))
     +		read_next_command();
    -+	}
      	parse_data(&msg, 0, NULL);
      	read_next_command();
      	parse_from(b);
    @@ -69,7 +68,7 @@
     +		strbuf_addf(&new_data,
     +			"encoding %s\n",
     +			encoding);
    -+	strbuf_addf(&new_data, "\n");
    ++	strbuf_addch(&new_data, '\n');
      	strbuf_addbuf(&new_data, &msg);
      	free(author);
      	free(committer);
    @@ -78,14 +77,14 @@
      --- a/t/t9300-fast-import.sh
      +++ b/t/t9300-fast-import.sh
     @@
    - 	background_import_still_running
    + 	sed -e s/LFs/LLL/ W-input | tr L "\n" | test_must_fail git fast-import
      '
      
     +###
    -+### series W (other new features)
    ++### series X (other new features)
     +###
     +
    -+test_expect_success 'W: handling encoding' '
    ++test_expect_success 'X: handling encoding' '
     +	test_tick &&
     +	cat >input <<-INPUT_END &&
     +	commit refs/heads/encoding
3:  86c348402d ! 3:  1fddf51402 fast-export: avoid stripping encoding header if we cannot reencode
    @@ -41,8 +41,7 @@
     +test_expect_success 'encoding preserved if reencoding fails' '
     +
     +	test_when_finished "git reset --hard HEAD~1" &&
    -+	test_when_finished "git config --unset i18n.commitencoding" &&
    -+	git config i18n.commitencoding iso-8859-7 &&
    ++	test_config i18n.commitencoding iso-8859-7 &&
     +	echo rosten >file &&
     +	git commit -s -m "$(printf "Pi: \360; Invalid: \377")" file &&
     +	git fast-export wer^..wer >iso-8859-7.fi &&
4:  c09b23bc59 = 4:  4a2e04b3ae fast-export: differentiate between explicitly utf-8 and implicitly utf-8
5:  24b69a0db9 ! 5:  44aacb1a0b fast-export: do automatic reencoding of commit messages only if requested
    @@ -92,8 +92,7 @@
     +test_expect_success 'reencoding iso-8859-7' '
      
      	test_when_finished "git reset --hard HEAD~1" &&
    - 	test_when_finished "git config --unset i18n.commitencoding" &&
    -@@
    + 	test_config i18n.commitencoding iso-8859-7 &&
      	test_tick &&
      	echo rosten >file &&
      	git commit -s -m "$(printf "Pi: \360")" file &&
    @@ -109,8 +108,7 @@
     +test_expect_success 'aborting on iso-8859-7' '
     +
     +	test_when_finished "git reset --hard HEAD~1" &&
    -+	test_when_finished "git config --unset i18n.commitencoding" &&
    -+	git config i18n.commitencoding iso-8859-7 &&
    ++	test_config i18n.commitencoding iso-8859-7 &&
     +	echo rosten >file &&
     +	git commit -s -m "$(printf "Pi: \360")" file &&
     +	test_must_fail git fast-export --reencode=abort wer^..wer >iso-8859-7.fi
    @@ -119,8 +117,7 @@
     +test_expect_success 'preserving iso-8859-7' '
     +
     +	test_when_finished "git reset --hard HEAD~1" &&
    -+	test_when_finished "git config --unset i18n.commitencoding" &&
    -+	git config i18n.commitencoding iso-8859-7 &&
    ++	test_config i18n.commitencoding iso-8859-7 &&
     +	echo rosten >file &&
     +	git commit -s -m "$(printf "Pi: \360")" file &&
     +	git fast-export --reencode=no wer^..wer >iso-8859-7.fi &&
    @@ -134,8 +131,7 @@
      test_expect_success 'encoding preserved if reencoding fails' '
      
      	test_when_finished "git reset --hard HEAD~1" &&
    -@@
    - 	git config i18n.commitencoding iso-8859-7 &&
    + 	test_config i18n.commitencoding iso-8859-7 &&
      	echo rosten >file &&
      	git commit -s -m "$(printf "Pi: \360; Invalid: \377")" file &&
     -	git fast-export wer^..wer >iso-8859-7.fi &&
-- 
2.21.0.782.g44aacb1a0b


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2 1/5] t9350: fix encoding test to actually test reencoding
  2019-04-30 18:25 [PATCH v2 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
@ 2019-04-30 18:25 ` Elijah Newren
  2019-04-30 18:25 ` [PATCH v2 2/5] fast-import: support 'encoding' commit header Elijah Newren
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 45+ messages in thread
From: Elijah Newren @ 2019-04-30 18:25 UTC (permalink / raw)
  To: gitster; +Cc: git, Eric Sunshine, Elijah Newren

This test used an author with non-ascii characters in the name, but
no special commit message.  It then grep'ed for those non-ascii
characters, but those are guaranteed to exist regardless of the
reencoding process since the reencoding only affects the commit message,
not the author or committer names.  As such, the test would work even if
the re-encoding process simply stripped the commit message entirely.
Modify the test to actually check that the reencoding in utf-8 worked.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 t/t9350-fast-export.sh | 18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
index 5690fe2810..f55759651a 100755
--- a/t/t9350-fast-export.sh
+++ b/t/t9350-fast-export.sh
@@ -94,22 +94,21 @@ test_expect_success 'fast-export --show-original-ids | git fast-import' '
 	test $MUSS = $(git rev-parse --verify refs/tags/muss)
 '
 
-test_expect_success 'iso-8859-1' '
+test_expect_success 'iso-8859-7' '
 
-	git config i18n.commitencoding ISO8859-1 &&
-	# use author and committer name in ISO-8859-1 to match it.
-	. "$TEST_DIRECTORY"/t3901/8859-1.txt &&
+	test_when_finished "git reset --hard HEAD~1" &&
+	test_config i18n.commitencoding iso-8859-7 &&
 	test_tick &&
 	echo rosten >file &&
-	git commit -s -m den file &&
-	git fast-export wer^..wer >iso8859-1.fi &&
-	sed "s/wer/i18n/" iso8859-1.fi |
+	git commit -s -m "$(printf "Pi: \360")" file &&
+	git fast-export wer^..wer >iso-8859-7.fi &&
+	sed "s/wer/i18n/" iso-8859-7.fi |
 		(cd new &&
 		 git fast-import &&
 		 git cat-file commit i18n >actual &&
-		 grep "Áéí óú" actual)
-
+		 grep $(printf "\317\200") actual)
 '
+
 test_expect_success 'import/export-marks' '
 
 	git checkout -b marks master &&
@@ -224,7 +223,6 @@ GIT_COMMITTER_NAME='C O Mitter'; export GIT_COMMITTER_NAME
 
 test_expect_success 'setup copies' '
 
-	git config --unset i18n.commitencoding &&
 	git checkout -b copy rein &&
 	git mv file file3 &&
 	git commit -m move1 &&
-- 
2.21.0.782.g44aacb1a0b


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 2/5] fast-import: support 'encoding' commit header
  2019-04-30 18:25 [PATCH v2 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
  2019-04-30 18:25 ` [PATCH v2 1/5] t9350: fix encoding test to actually test reencoding Elijah Newren
@ 2019-04-30 18:25 ` Elijah Newren
  2019-04-30 18:25 ` [PATCH v2 3/5] fast-export: avoid stripping encoding header if we cannot reencode Elijah Newren
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 45+ messages in thread
From: Elijah Newren @ 2019-04-30 18:25 UTC (permalink / raw)
  To: gitster; +Cc: git, Eric Sunshine, Elijah Newren

Since git supports commit messages with an encoding other than utf-8,
allow fast-import to import such commits.  This may be useful for folks
who do not want to reencode commit messages from an external system, and
may also be useful to achieve reversible history rewrites (e.g. sha1sum
<-> sha256sum transitions or subtree work) with git repositories that
have used specialized encodings in their commit history.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 Documentation/git-fast-import.txt |  7 +++++++
 fast-import.c                     | 11 +++++++++--
 t/t9300-fast-import.sh            | 20 ++++++++++++++++++++
 3 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-fast-import.txt b/Documentation/git-fast-import.txt
index d65cdb3d08..7baf9e47b5 100644
--- a/Documentation/git-fast-import.txt
+++ b/Documentation/git-fast-import.txt
@@ -388,6 +388,7 @@ change to the project.
 	original-oid?
 	('author' (SP <name>)? SP LT <email> GT SP <when> LF)?
 	'committer' (SP <name>)? SP LT <email> GT SP <when> LF
+	('encoding' SP <encoding>)?
 	data
 	('from' SP <commit-ish> LF)?
 	('merge' SP <commit-ish> LF)?
@@ -455,6 +456,12 @@ that was selected by the --date-format=<fmt> command-line option.
 See ``Date Formats'' above for the set of supported formats, and
 their syntax.
 
+`encoding`
+^^^^^^^^^^
+The optional `encoding` command indicates the encoding of the commit
+message.  Most commits are UTF-8 and the encoding is omitted, but this
+allows importing commit messages into git without first reencoding them.
+
 `from`
 ^^^^^^
 The `from` command is used to specify the commit to initialize
diff --git a/fast-import.c b/fast-import.c
index f38d04fa58..76a7bd3699 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -2585,6 +2585,7 @@ static void parse_new_commit(const char *arg)
 	struct branch *b;
 	char *author = NULL;
 	char *committer = NULL;
+	const char *encoding = NULL;
 	struct hash_list *merge_list = NULL;
 	unsigned int merge_count;
 	unsigned char prev_fanout, new_fanout;
@@ -2607,6 +2608,8 @@ static void parse_new_commit(const char *arg)
 	}
 	if (!committer)
 		die("Expected committer but didn't get one");
+	if (skip_prefix(command_buf.buf, "encoding ", &encoding))
+		read_next_command();
 	parse_data(&msg, 0, NULL);
 	read_next_command();
 	parse_from(b);
@@ -2670,9 +2673,13 @@ static void parse_new_commit(const char *arg)
 	}
 	strbuf_addf(&new_data,
 		"author %s\n"
-		"committer %s\n"
-		"\n",
+		"committer %s\n",
 		author ? author : committer, committer);
+	if (encoding)
+		strbuf_addf(&new_data,
+			"encoding %s\n",
+			encoding);
+	strbuf_addch(&new_data, '\n');
 	strbuf_addbuf(&new_data, &msg);
 	free(author);
 	free(committer);
diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
index 3668263c40..141b7fa35e 100755
--- a/t/t9300-fast-import.sh
+++ b/t/t9300-fast-import.sh
@@ -3299,4 +3299,24 @@ test_expect_success !MINGW 'W: get-mark & empty orphan commit with erroneous thi
 	sed -e s/LFs/LLL/ W-input | tr L "\n" | test_must_fail git fast-import
 '
 
+###
+### series X (other new features)
+###
+
+test_expect_success 'X: handling encoding' '
+	test_tick &&
+	cat >input <<-INPUT_END &&
+	commit refs/heads/encoding
+	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+	encoding iso-8859-7
+	data <<COMMIT
+	INPUT_END
+
+	printf "Pi: \360\nCOMMIT\n" >>input &&
+
+	git fast-import <input &&
+	git cat-file -p encoding | grep $(printf "\360") &&
+	git log -1 --format=%B encoding | grep $(printf "\317\200")
+'
+
 test_done
-- 
2.21.0.782.g44aacb1a0b


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 3/5] fast-export: avoid stripping encoding header if we cannot reencode
  2019-04-30 18:25 [PATCH v2 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
  2019-04-30 18:25 ` [PATCH v2 1/5] t9350: fix encoding test to actually test reencoding Elijah Newren
  2019-04-30 18:25 ` [PATCH v2 2/5] fast-import: support 'encoding' commit header Elijah Newren
@ 2019-04-30 18:25 ` Elijah Newren
  2019-04-30 18:25 ` [PATCH v2 4/5] fast-export: differentiate between explicitly utf-8 and implicitly utf-8 Elijah Newren
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 45+ messages in thread
From: Elijah Newren @ 2019-04-30 18:25 UTC (permalink / raw)
  To: gitster; +Cc: git, Eric Sunshine, Elijah Newren

When fast-export encounters a commit with an 'encoding' header, it tries
to reencode in utf-8 and then drops the encoding header.  However, if it
fails to reencode in utf-8 because e.g. one of the characters in the
commit message was invalid in the old encoding, then we need to retain
the original encoding or otherwise we lose information needed to
understand all the other (valid) characters in the original commit
message.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 builtin/fast-export.c  |  7 +++++--
 t/t9350-fast-export.sh | 14 ++++++++++++++
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 9e283482ef..7734a9f5a5 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -642,9 +642,12 @@ static void handle_commit(struct commit *commit, struct rev_info *rev,
 	printf("commit %s\nmark :%"PRIu32"\n", refname, last_idnum);
 	if (show_original_ids)
 		printf("original-oid %s\n", oid_to_hex(&commit->object.oid));
-	printf("%.*s\n%.*s\ndata %u\n%s",
+	printf("%.*s\n%.*s\n",
 	       (int)(author_end - author), author,
-	       (int)(committer_end - committer), committer,
+	       (int)(committer_end - committer), committer);
+	if (!reencoded && encoding)
+		printf("encoding %s\n", encoding);
+	printf("data %u\n%s",
 	       (unsigned)(reencoded
 			  ? strlen(reencoded) : message
 			  ? strlen(message) : 0),
diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
index f55759651a..67dd7ac7f4 100755
--- a/t/t9350-fast-export.sh
+++ b/t/t9350-fast-export.sh
@@ -109,6 +109,20 @@ test_expect_success 'iso-8859-7' '
 		 grep $(printf "\317\200") actual)
 '
 
+test_expect_success 'encoding preserved if reencoding fails' '
+
+	test_when_finished "git reset --hard HEAD~1" &&
+	test_config i18n.commitencoding iso-8859-7 &&
+	echo rosten >file &&
+	git commit -s -m "$(printf "Pi: \360; Invalid: \377")" file &&
+	git fast-export wer^..wer >iso-8859-7.fi &&
+	sed "s/wer/i18n-invalid/" iso-8859-7.fi |
+		(cd new &&
+		 git fast-import &&
+		 git cat-file commit i18n-invalid >actual &&
+		 grep ^encoding actual)
+'
+
 test_expect_success 'import/export-marks' '
 
 	git checkout -b marks master &&
-- 
2.21.0.782.g44aacb1a0b


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 4/5] fast-export: differentiate between explicitly utf-8 and implicitly utf-8
  2019-04-30 18:25 [PATCH v2 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
                   ` (2 preceding siblings ...)
  2019-04-30 18:25 ` [PATCH v2 3/5] fast-export: avoid stripping encoding header if we cannot reencode Elijah Newren
@ 2019-04-30 18:25 ` Elijah Newren
  2019-04-30 18:25 ` [PATCH v2 5/5] fast-export: do automatic reencoding of commit messages only if requested Elijah Newren
  2019-05-10 20:53 ` [PATCH v3 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
  5 siblings, 0 replies; 45+ messages in thread
From: Elijah Newren @ 2019-04-30 18:25 UTC (permalink / raw)
  To: gitster; +Cc: git, Eric Sunshine, Elijah Newren

The find_encoding() function returned the encoding used by a commit
message, returning a default of git_commit_encoding (usually utf-8).
Although the current code does not differentiate between a commit which
explicitly requested utf-8 and one where we just assume utf-8 because no
encoding is set, it will become important when we try to preserve the
encoding header.  Since is_encoding_utf8() returns true when passed
NULL, we can just return NULL from find_encoding() instead of returning
git_commit_encoding.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 builtin/fast-export.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 7734a9f5a5..66331fa401 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -453,7 +453,7 @@ static const char *find_encoding(const char *begin, const char *end)
 	bol = memmem(begin, end ? end - begin : strlen(begin),
 		     needle, strlen(needle));
 	if (!bol)
-		return git_commit_encoding;
+		return NULL;
 	bol += strlen(needle);
 	eol = strchrnul(bol, '\n');
 	*eol = '\0';
-- 
2.21.0.782.g44aacb1a0b


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 5/5] fast-export: do automatic reencoding of commit messages only if requested
  2019-04-30 18:25 [PATCH v2 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
                   ` (3 preceding siblings ...)
  2019-04-30 18:25 ` [PATCH v2 4/5] fast-export: differentiate between explicitly utf-8 and implicitly utf-8 Elijah Newren
@ 2019-04-30 18:25 ` Elijah Newren
  2019-05-10 20:53 ` [PATCH v3 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
  5 siblings, 0 replies; 45+ messages in thread
From: Elijah Newren @ 2019-04-30 18:25 UTC (permalink / raw)
  To: gitster; +Cc: git, Eric Sunshine, Elijah Newren

Automatic re-encoding of commit messages (and dropping of the encoding
header) hurts attempts to do reversible history rewrites (e.g. sha1sum
<-> sha256sum transitions, some subtree rewrites), and seems
inconsistent with the general principle followed elsewhere in
fast-export of requiring explicit user requests to modify the output
(e.g. --signed-tags=strip, --tag-of-filtered-object=rewrite).  Add a
--reencode flag that the user can use to specify, and like other
fast-export flags, default it to 'abort'.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 builtin/fast-export.c  | 35 ++++++++++++++++++++++++++++++++---
 t/t9350-fast-export.sh | 29 ++++++++++++++++++++++++++---
 2 files changed, 58 insertions(+), 6 deletions(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 66331fa401..43cc52331c 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -33,6 +33,7 @@ static const char *fast_export_usage[] = {
 static int progress;
 static enum { SIGNED_TAG_ABORT, VERBATIM, WARN, WARN_STRIP, STRIP } signed_tag_mode = SIGNED_TAG_ABORT;
 static enum { TAG_FILTERING_ABORT, DROP, REWRITE } tag_of_filtered_mode = TAG_FILTERING_ABORT;
+static enum { REENCODE_ABORT, REENCODE_PLEASE, REENCODE_NEVER } reencode_mode = REENCODE_ABORT;
 static int fake_missing_tagger;
 static int use_done_feature;
 static int no_data;
@@ -77,6 +78,20 @@ static int parse_opt_tag_of_filtered_mode(const struct option *opt,
 	return 0;
 }
 
+static int parse_opt_reencode_mode(const struct option *opt,
+				   const char *arg, int unset)
+{
+	if (unset || !strcmp(arg, "abort"))
+		reencode_mode = REENCODE_ABORT;
+	else if (!strcmp(arg, "yes"))
+		reencode_mode = REENCODE_PLEASE;
+	else if (!strcmp(arg, "no"))
+		reencode_mode = REENCODE_NEVER;
+	else
+		return error("Unknown reencoding mode: %s", arg);
+	return 0;
+}
+
 static struct decoration idnums;
 static uint32_t last_idnum;
 
@@ -633,10 +648,21 @@ static void handle_commit(struct commit *commit, struct rev_info *rev,
 	}
 
 	mark_next_object(&commit->object);
-	if (anonymize)
+	if (anonymize) {
 		reencoded = anonymize_commit_message(message);
-	else if (!is_encoding_utf8(encoding))
-		reencoded = reencode_string(message, "UTF-8", encoding);
+	} else if (encoding) {
+		switch(reencode_mode) {
+		case REENCODE_PLEASE:
+			reencoded = reencode_string(message, "UTF-8", encoding);
+			break;
+		case REENCODE_NEVER:
+			break;
+		case REENCODE_ABORT:
+			die("Encountered commit-specific encoding %s in commit "
+			    "%s; use --reencode=<mode> to handle it",
+			    encoding, oid_to_hex(&commit->object.oid));
+		}
+	}
 	if (!commit->parents)
 		printf("reset %s\n", refname);
 	printf("commit %s\nmark :%"PRIu32"\n", refname, last_idnum);
@@ -1091,6 +1117,9 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
 		OPT_CALLBACK(0, "tag-of-filtered-object", &tag_of_filtered_mode, N_("mode"),
 			     N_("select handling of tags that tag filtered objects"),
 			     parse_opt_tag_of_filtered_mode),
+		OPT_CALLBACK(0, "reencode", &reencode_mode, N_("mode"),
+			     N_("select handling of commit messages in an alternate encoding"),
+			     parse_opt_reencode_mode),
 		OPT_STRING(0, "export-marks", &export_filename, N_("file"),
 			     N_("Dump marks to this file")),
 		OPT_STRING(0, "import-marks", &import_filename, N_("file"),
diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
index 67dd7ac7f4..92cfeb6cfc 100755
--- a/t/t9350-fast-export.sh
+++ b/t/t9350-fast-export.sh
@@ -94,14 +94,14 @@ test_expect_success 'fast-export --show-original-ids | git fast-import' '
 	test $MUSS = $(git rev-parse --verify refs/tags/muss)
 '
 
-test_expect_success 'iso-8859-7' '
+test_expect_success 'reencoding iso-8859-7' '
 
 	test_when_finished "git reset --hard HEAD~1" &&
 	test_config i18n.commitencoding iso-8859-7 &&
 	test_tick &&
 	echo rosten >file &&
 	git commit -s -m "$(printf "Pi: \360")" file &&
-	git fast-export wer^..wer >iso-8859-7.fi &&
+	git fast-export --reencode=yes wer^..wer >iso-8859-7.fi &&
 	sed "s/wer/i18n/" iso-8859-7.fi |
 		(cd new &&
 		 git fast-import &&
@@ -109,13 +109,36 @@ test_expect_success 'iso-8859-7' '
 		 grep $(printf "\317\200") actual)
 '
 
+test_expect_success 'aborting on iso-8859-7' '
+
+	test_when_finished "git reset --hard HEAD~1" &&
+	test_config i18n.commitencoding iso-8859-7 &&
+	echo rosten >file &&
+	git commit -s -m "$(printf "Pi: \360")" file &&
+	test_must_fail git fast-export --reencode=abort wer^..wer >iso-8859-7.fi
+'
+
+test_expect_success 'preserving iso-8859-7' '
+
+	test_when_finished "git reset --hard HEAD~1" &&
+	test_config i18n.commitencoding iso-8859-7 &&
+	echo rosten >file &&
+	git commit -s -m "$(printf "Pi: \360")" file &&
+	git fast-export --reencode=no wer^..wer >iso-8859-7.fi &&
+	sed "s/wer/i18n-no-recoding/" iso-8859-7.fi |
+		(cd new &&
+		 git fast-import &&
+		 git cat-file commit i18n-no-recoding >actual &&
+		 grep $(printf "\360") actual)
+'
+
 test_expect_success 'encoding preserved if reencoding fails' '
 
 	test_when_finished "git reset --hard HEAD~1" &&
 	test_config i18n.commitencoding iso-8859-7 &&
 	echo rosten >file &&
 	git commit -s -m "$(printf "Pi: \360; Invalid: \377")" file &&
-	git fast-export wer^..wer >iso-8859-7.fi &&
+	git fast-export --reencode=yes wer^..wer >iso-8859-7.fi &&
 	sed "s/wer/i18n-invalid/" iso-8859-7.fi |
 		(cd new &&
 		 git fast-import &&
-- 
2.21.0.782.g44aacb1a0b


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 0/5] Fix and extend encoding handling in fast export/import
  2019-04-30 18:25 [PATCH v2 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
                   ` (4 preceding siblings ...)
  2019-04-30 18:25 ` [PATCH v2 5/5] fast-export: do automatic reencoding of commit messages only if requested Elijah Newren
@ 2019-05-10 20:53 ` Elijah Newren
  2019-05-10 20:53   ` [PATCH v3 1/5] t9350: fix encoding test to actually test reencoding Elijah Newren
                     ` (6 more replies)
  5 siblings, 7 replies; 45+ messages in thread
From: Elijah Newren @ 2019-05-10 20:53 UTC (permalink / raw)
  To: gitster
  Cc: git, Eric Sunshine, Johannes Schindelin, Johannes Sixt,
	Elijah Newren

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 8203 bytes --]

While stress testing `git filter-repo`, I noticed an issue with
encoding; further digging led to the fixes and features in this series.
See the individual commit messages for details.

Changes since v2 (full range-diff below):
  * Modified the testcases to pass on Windows[1], as verified via
    gitgitgadget pull request[2].  Required adding a couple new files
    (which store the desired bytes) and checking the size of the output
    instead of checking for particular bytes (but the lengths of the
    expected byte sequences differ so this works fine...).

[1] Failures of previous patchset on Windows noticed and reported by Dscho;
    explanation from Hannes is that Windows munges users' command lines to
    force them to be characters instead of bytes.
[2] https://github.com/gitgitgadget/git/pull/187

Elijah Newren (5):
t9350: fix encoding test to actually test reencoding
fast-import: support 'encoding' commit header
fast-export: avoid stripping encoding header if we cannot reencode
fast-export: differentiate between explicitly utf-8 and implicitly
utf-8
fast-export: do automatic reencoding of commit messages only if
requested

Documentation/git-fast-import.txt            |  7 ++
builtin/fast-export.c                        | 44 ++++++++++--
fast-import.c                                | 11 ++-
t/t9300-fast-import.sh                       | 20 ++++++
t/t9350-fast-export.sh                       | 75 +++++++++++++++++---
t/t9350/broken-iso-8859-7-commit-message.txt |  1 +
t/t9350/simple-iso-8859-7-commit-message.txt |  1 +
7 files changed, 142 insertions(+), 17 deletions(-)
create mode 100644 t/t9350/broken-iso-8859-7-commit-message.txt
create mode 100644 t/t9350/simple-iso-8859-7-commit-message.txt

Range-diff:
1:  9cc04242bd ! 1:  2d7bb64acf t9350: fix encoding test to actually test reencoding
    @@ -32,15 +32,26 @@
     -	git commit -s -m den file &&
     -	git fast-export wer^..wer >iso8859-1.fi &&
     -	sed "s/wer/i18n/" iso8859-1.fi |
    -+	git commit -s -m "$(printf "Pi: \360")" file &&
    ++	git commit -s -F "$TEST_DIRECTORY/t9350/simple-iso-8859-7-commit-message.txt" file &&
     +	git fast-export wer^..wer >iso-8859-7.fi &&
     +	sed "s/wer/i18n/" iso-8859-7.fi |
      		(cd new &&
      		 git fast-import &&
    ++		 # The commit object, if not re-encoded, would be 240 bytes.
    ++		 # Removing the "encoding iso-8859-7\n" header drops 20 bytes.
    ++		 # Re-encoding the Pi character from \xF0 in iso-8859-7 to
    ++		 # \xCF\x80 in utf-8 adds a byte.  Grepping for specific bytes
    ++		 # would be nice, but Windows apparently munges user data
    ++		 # in the form of bytes on the command line to force them to
    ++		 # be characters instead, so we are limited for portability
    ++		 # reasons in subsequent similar tests in this file to check
    ++		 # for size rather than what bytes are present.
    ++		 test 221 -eq "$(git cat-file -s i18n)" &&
    ++		 # Also make sure the commit does not have the "encoding" header
      		 git cat-file commit i18n >actual &&
     -		 grep "Áéí óú" actual)
     -
    -+		 grep $(printf "\317\200") actual)
    ++		 ! grep ^encoding actual)
      '
     +
      test_expect_success 'import/export-marks' '
    @@ -54,3 +65,11 @@
      	git checkout -b copy rein &&
      	git mv file file3 &&
      	git commit -m move1 &&
    +
    + diff --git a/t/t9350/simple-iso-8859-7-commit-message.txt b/t/t9350/simple-iso-8859-7-commit-message.txt
    + new file mode 100644
    + --- /dev/null
    + +++ b/t/t9350/simple-iso-8859-7-commit-message.txt
    +@@
    ++Pi: ð
    + \ No newline at end of file
2:  0cd023ac7a = 2:  9fa5695017 fast-import: support 'encoding' commit header
3:  1fddf51402 ! 3:  dfc76573e9 fast-export: avoid stripping encoding header if we cannot reencode
    @@ -35,7 +35,7 @@
      --- a/t/t9350-fast-export.sh
      +++ b/t/t9350-fast-export.sh
     @@
    - 		 grep $(printf "\317\200") actual)
    + 		 ! grep ^encoding actual)
      '
      
     +test_expect_success 'encoding preserved if reencoding fails' '
    @@ -43,15 +43,26 @@
     +	test_when_finished "git reset --hard HEAD~1" &&
     +	test_config i18n.commitencoding iso-8859-7 &&
     +	echo rosten >file &&
    -+	git commit -s -m "$(printf "Pi: \360; Invalid: \377")" file &&
    ++	git commit -s -F "$TEST_DIRECTORY/t9350/broken-iso-8859-7-commit-message.txt" file &&
     +	git fast-export wer^..wer >iso-8859-7.fi &&
     +	sed "s/wer/i18n-invalid/" iso-8859-7.fi |
     +		(cd new &&
     +		 git fast-import &&
     +		 git cat-file commit i18n-invalid >actual &&
    -+		 grep ^encoding actual)
    ++		 grep ^encoding actual &&
    ++		 # Also verify that the commit has the expected size; i.e.
    ++		 # that no bytes were re-encoded to a different encoding.
    ++		 test 252 -eq "$(git cat-file -s i18n-invalid)")
     +'
     +
      test_expect_success 'import/export-marks' '
      
      	git checkout -b marks master &&
    +
    + diff --git a/t/t9350/broken-iso-8859-7-commit-message.txt b/t/t9350/broken-iso-8859-7-commit-message.txt
    + new file mode 100644
    + --- /dev/null
    + +++ b/t/t9350/broken-iso-8859-7-commit-message.txt
    +@@
    ++Pi: ð; Invalid: ÿ
    + \ No newline at end of file
4:  4a2e04b3ae = 4:  83b3656b76 fast-export: differentiate between explicitly utf-8 and implicitly utf-8
5:  44aacb1a0b ! 5:  2063122293 fast-export: do automatic reencoding of commit messages only if requested
    @@ -95,14 +95,14 @@
      	test_config i18n.commitencoding iso-8859-7 &&
      	test_tick &&
      	echo rosten >file &&
    - 	git commit -s -m "$(printf "Pi: \360")" file &&
    + 	git commit -s -F "$TEST_DIRECTORY/t9350/simple-iso-8859-7-commit-message.txt" file &&
     -	git fast-export wer^..wer >iso-8859-7.fi &&
     +	git fast-export --reencode=yes wer^..wer >iso-8859-7.fi &&
      	sed "s/wer/i18n/" iso-8859-7.fi |
      		(cd new &&
      		 git fast-import &&
     @@
    - 		 grep $(printf "\317\200") actual)
    + 		 ! grep ^encoding actual)
      '
      
     +test_expect_success 'aborting on iso-8859-7' '
    @@ -110,7 +110,7 @@
     +	test_when_finished "git reset --hard HEAD~1" &&
     +	test_config i18n.commitencoding iso-8859-7 &&
     +	echo rosten >file &&
    -+	git commit -s -m "$(printf "Pi: \360")" file &&
    ++	git commit -s -F "$TEST_DIRECTORY/t9350/simple-iso-8859-7-commit-message.txt" file &&
     +	test_must_fail git fast-export --reencode=abort wer^..wer >iso-8859-7.fi
     +'
     +
    @@ -119,13 +119,21 @@
     +	test_when_finished "git reset --hard HEAD~1" &&
     +	test_config i18n.commitencoding iso-8859-7 &&
     +	echo rosten >file &&
    -+	git commit -s -m "$(printf "Pi: \360")" file &&
    ++	git commit -s -F "$TEST_DIRECTORY/t9350/simple-iso-8859-7-commit-message.txt" file &&
     +	git fast-export --reencode=no wer^..wer >iso-8859-7.fi &&
     +	sed "s/wer/i18n-no-recoding/" iso-8859-7.fi |
     +		(cd new &&
     +		 git fast-import &&
    ++		 # The commit object, if not re-encoded, is 240 bytes.
    ++		 # Removing the "encoding iso-8859-7\n" header would drops 20
    ++		 # bytes.  Re-encoding the Pi character from \xF0 in
    ++		 # iso-8859-7 to \xCF\x80 in utf-8 would add a byte.  I would
    ++		 # grep for the # specific bytes, but Windows lamely does not
    ++		 # allow that, so just search for the expected size.
    ++		 test 240 -eq "$(git cat-file -s i18n-no-recoding)" &&
    ++		 # Also make sure the commit has the "encoding" header
     +		 git cat-file commit i18n-no-recoding >actual &&
    -+		 grep $(printf "\360") actual)
    ++		 grep ^encoding actual)
     +'
     +
      test_expect_success 'encoding preserved if reencoding fails' '
    @@ -133,7 +141,7 @@
      	test_when_finished "git reset --hard HEAD~1" &&
      	test_config i18n.commitencoding iso-8859-7 &&
      	echo rosten >file &&
    - 	git commit -s -m "$(printf "Pi: \360; Invalid: \377")" file &&
    + 	git commit -s -F "$TEST_DIRECTORY/t9350/broken-iso-8859-7-commit-message.txt" file &&
     -	git fast-export wer^..wer >iso-8859-7.fi &&
     +	git fast-export --reencode=yes wer^..wer >iso-8859-7.fi &&
      	sed "s/wer/i18n-invalid/" iso-8859-7.fi |

-- 
2.21.0.782.g2063122293


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v3 1/5] t9350: fix encoding test to actually test reencoding
  2019-05-10 20:53 ` [PATCH v3 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
@ 2019-05-10 20:53   ` Elijah Newren
  2019-05-10 20:53   ` [PATCH v3 2/5] fast-import: support 'encoding' commit header Elijah Newren
                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 45+ messages in thread
From: Elijah Newren @ 2019-05-10 20:53 UTC (permalink / raw)
  To: gitster
  Cc: git, Eric Sunshine, Johannes Schindelin, Johannes Sixt,
	Elijah Newren

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 3169 bytes --]

This test used an author with non-ascii characters in the name, but
no special commit message.  It then grep'ed for those non-ascii
characters, but those are guaranteed to exist regardless of the
reencoding process since the reencoding only affects the commit message,
not the author or committer names.  As such, the test would work even if
the re-encoding process simply stripped the commit message entirely.
Modify the test to actually check that the reencoding in utf-8 worked.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 t/t9350-fast-export.sh                       | 29 +++++++++++++-------
 t/t9350/simple-iso-8859-7-commit-message.txt |  1 +
 2 files changed, 20 insertions(+), 10 deletions(-)
 create mode 100644 t/t9350/simple-iso-8859-7-commit-message.txt

diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
index 5690fe2810..ef9b1aa20b 100755
--- a/t/t9350-fast-export.sh
+++ b/t/t9350-fast-export.sh
@@ -94,22 +94,32 @@ test_expect_success 'fast-export --show-original-ids | git fast-import' '
 	test $MUSS = $(git rev-parse --verify refs/tags/muss)
 '
 
-test_expect_success 'iso-8859-1' '
+test_expect_success 'iso-8859-7' '
 
-	git config i18n.commitencoding ISO8859-1 &&
-	# use author and committer name in ISO-8859-1 to match it.
-	. "$TEST_DIRECTORY"/t3901/8859-1.txt &&
+	test_when_finished "git reset --hard HEAD~1" &&
+	test_config i18n.commitencoding iso-8859-7 &&
 	test_tick &&
 	echo rosten >file &&
-	git commit -s -m den file &&
-	git fast-export wer^..wer >iso8859-1.fi &&
-	sed "s/wer/i18n/" iso8859-1.fi |
+	git commit -s -F "$TEST_DIRECTORY/t9350/simple-iso-8859-7-commit-message.txt" file &&
+	git fast-export wer^..wer >iso-8859-7.fi &&
+	sed "s/wer/i18n/" iso-8859-7.fi |
 		(cd new &&
 		 git fast-import &&
+		 # The commit object, if not re-encoded, would be 240 bytes.
+		 # Removing the "encoding iso-8859-7\n" header drops 20 bytes.
+		 # Re-encoding the Pi character from \xF0 in iso-8859-7 to
+		 # \xCF\x80 in utf-8 adds a byte.  Grepping for specific bytes
+		 # would be nice, but Windows apparently munges user data
+		 # in the form of bytes on the command line to force them to
+		 # be characters instead, so we are limited for portability
+		 # reasons in subsequent similar tests in this file to check
+		 # for size rather than what bytes are present.
+		 test 221 -eq "$(git cat-file -s i18n)" &&
+		 # Also make sure the commit does not have the "encoding" header
 		 git cat-file commit i18n >actual &&
-		 grep "Áéí óú" actual)
-
+		 ! grep ^encoding actual)
 '
+
 test_expect_success 'import/export-marks' '
 
 	git checkout -b marks master &&
@@ -224,7 +234,6 @@ GIT_COMMITTER_NAME='C O Mitter'; export GIT_COMMITTER_NAME
 
 test_expect_success 'setup copies' '
 
-	git config --unset i18n.commitencoding &&
 	git checkout -b copy rein &&
 	git mv file file3 &&
 	git commit -m move1 &&
diff --git a/t/t9350/simple-iso-8859-7-commit-message.txt b/t/t9350/simple-iso-8859-7-commit-message.txt
new file mode 100644
index 0000000000..8b3f0c3dba
--- /dev/null
+++ b/t/t9350/simple-iso-8859-7-commit-message.txt
@@ -0,0 +1 @@
+Pi: ð
\ No newline at end of file
-- 
2.21.0.782.g2063122293


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 2/5] fast-import: support 'encoding' commit header
  2019-05-10 20:53 ` [PATCH v3 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
  2019-05-10 20:53   ` [PATCH v3 1/5] t9350: fix encoding test to actually test reencoding Elijah Newren
@ 2019-05-10 20:53   ` Elijah Newren
  2019-05-10 20:53   ` [PATCH v3 3/5] fast-export: avoid stripping encoding header if we cannot reencode Elijah Newren
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 45+ messages in thread
From: Elijah Newren @ 2019-05-10 20:53 UTC (permalink / raw)
  To: gitster
  Cc: git, Eric Sunshine, Johannes Schindelin, Johannes Sixt,
	Elijah Newren

Since git supports commit messages with an encoding other than utf-8,
allow fast-import to import such commits.  This may be useful for folks
who do not want to reencode commit messages from an external system, and
may also be useful to achieve reversible history rewrites (e.g. sha1sum
<-> sha256sum transitions or subtree work) with git repositories that
have used specialized encodings in their commit history.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 Documentation/git-fast-import.txt |  7 +++++++
 fast-import.c                     | 11 +++++++++--
 t/t9300-fast-import.sh            | 20 ++++++++++++++++++++
 3 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-fast-import.txt b/Documentation/git-fast-import.txt
index d65cdb3d08..7baf9e47b5 100644
--- a/Documentation/git-fast-import.txt
+++ b/Documentation/git-fast-import.txt
@@ -388,6 +388,7 @@ change to the project.
 	original-oid?
 	('author' (SP <name>)? SP LT <email> GT SP <when> LF)?
 	'committer' (SP <name>)? SP LT <email> GT SP <when> LF
+	('encoding' SP <encoding>)?
 	data
 	('from' SP <commit-ish> LF)?
 	('merge' SP <commit-ish> LF)?
@@ -455,6 +456,12 @@ that was selected by the --date-format=<fmt> command-line option.
 See ``Date Formats'' above for the set of supported formats, and
 their syntax.
 
+`encoding`
+^^^^^^^^^^
+The optional `encoding` command indicates the encoding of the commit
+message.  Most commits are UTF-8 and the encoding is omitted, but this
+allows importing commit messages into git without first reencoding them.
+
 `from`
 ^^^^^^
 The `from` command is used to specify the commit to initialize
diff --git a/fast-import.c b/fast-import.c
index f38d04fa58..76a7bd3699 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -2585,6 +2585,7 @@ static void parse_new_commit(const char *arg)
 	struct branch *b;
 	char *author = NULL;
 	char *committer = NULL;
+	const char *encoding = NULL;
 	struct hash_list *merge_list = NULL;
 	unsigned int merge_count;
 	unsigned char prev_fanout, new_fanout;
@@ -2607,6 +2608,8 @@ static void parse_new_commit(const char *arg)
 	}
 	if (!committer)
 		die("Expected committer but didn't get one");
+	if (skip_prefix(command_buf.buf, "encoding ", &encoding))
+		read_next_command();
 	parse_data(&msg, 0, NULL);
 	read_next_command();
 	parse_from(b);
@@ -2670,9 +2673,13 @@ static void parse_new_commit(const char *arg)
 	}
 	strbuf_addf(&new_data,
 		"author %s\n"
-		"committer %s\n"
-		"\n",
+		"committer %s\n",
 		author ? author : committer, committer);
+	if (encoding)
+		strbuf_addf(&new_data,
+			"encoding %s\n",
+			encoding);
+	strbuf_addch(&new_data, '\n');
 	strbuf_addbuf(&new_data, &msg);
 	free(author);
 	free(committer);
diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
index 3668263c40..141b7fa35e 100755
--- a/t/t9300-fast-import.sh
+++ b/t/t9300-fast-import.sh
@@ -3299,4 +3299,24 @@ test_expect_success !MINGW 'W: get-mark & empty orphan commit with erroneous thi
 	sed -e s/LFs/LLL/ W-input | tr L "\n" | test_must_fail git fast-import
 '
 
+###
+### series X (other new features)
+###
+
+test_expect_success 'X: handling encoding' '
+	test_tick &&
+	cat >input <<-INPUT_END &&
+	commit refs/heads/encoding
+	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+	encoding iso-8859-7
+	data <<COMMIT
+	INPUT_END
+
+	printf "Pi: \360\nCOMMIT\n" >>input &&
+
+	git fast-import <input &&
+	git cat-file -p encoding | grep $(printf "\360") &&
+	git log -1 --format=%B encoding | grep $(printf "\317\200")
+'
+
 test_done
-- 
2.21.0.782.g2063122293


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 3/5] fast-export: avoid stripping encoding header if we cannot reencode
  2019-05-10 20:53 ` [PATCH v3 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
  2019-05-10 20:53   ` [PATCH v3 1/5] t9350: fix encoding test to actually test reencoding Elijah Newren
  2019-05-10 20:53   ` [PATCH v3 2/5] fast-import: support 'encoding' commit header Elijah Newren
@ 2019-05-10 20:53   ` Elijah Newren
  2019-05-10 20:53   ` [PATCH v3 4/5] fast-export: differentiate between explicitly utf-8 and implicitly utf-8 Elijah Newren
                     ` (3 subsequent siblings)
  6 siblings, 0 replies; 45+ messages in thread
From: Elijah Newren @ 2019-05-10 20:53 UTC (permalink / raw)
  To: gitster
  Cc: git, Eric Sunshine, Johannes Schindelin, Johannes Sixt,
	Elijah Newren

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 2871 bytes --]

When fast-export encounters a commit with an 'encoding' header, it tries
to reencode in utf-8 and then drops the encoding header.  However, if it
fails to reencode in utf-8 because e.g. one of the characters in the
commit message was invalid in the old encoding, then we need to retain
the original encoding or otherwise we lose information needed to
understand all the other (valid) characters in the original commit
message.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 builtin/fast-export.c                        |  7 +++++--
 t/t9350-fast-export.sh                       | 17 +++++++++++++++++
 t/t9350/broken-iso-8859-7-commit-message.txt |  1 +
 3 files changed, 23 insertions(+), 2 deletions(-)
 create mode 100644 t/t9350/broken-iso-8859-7-commit-message.txt

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 9e283482ef..7734a9f5a5 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -642,9 +642,12 @@ static void handle_commit(struct commit *commit, struct rev_info *rev,
 	printf("commit %s\nmark :%"PRIu32"\n", refname, last_idnum);
 	if (show_original_ids)
 		printf("original-oid %s\n", oid_to_hex(&commit->object.oid));
-	printf("%.*s\n%.*s\ndata %u\n%s",
+	printf("%.*s\n%.*s\n",
 	       (int)(author_end - author), author,
-	       (int)(committer_end - committer), committer,
+	       (int)(committer_end - committer), committer);
+	if (!reencoded && encoding)
+		printf("encoding %s\n", encoding);
+	printf("data %u\n%s",
 	       (unsigned)(reencoded
 			  ? strlen(reencoded) : message
 			  ? strlen(message) : 0),
diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
index ef9b1aa20b..fa124f4842 100755
--- a/t/t9350-fast-export.sh
+++ b/t/t9350-fast-export.sh
@@ -120,6 +120,23 @@ test_expect_success 'iso-8859-7' '
 		 ! grep ^encoding actual)
 '
 
+test_expect_success 'encoding preserved if reencoding fails' '
+
+	test_when_finished "git reset --hard HEAD~1" &&
+	test_config i18n.commitencoding iso-8859-7 &&
+	echo rosten >file &&
+	git commit -s -F "$TEST_DIRECTORY/t9350/broken-iso-8859-7-commit-message.txt" file &&
+	git fast-export wer^..wer >iso-8859-7.fi &&
+	sed "s/wer/i18n-invalid/" iso-8859-7.fi |
+		(cd new &&
+		 git fast-import &&
+		 git cat-file commit i18n-invalid >actual &&
+		 grep ^encoding actual &&
+		 # Also verify that the commit has the expected size; i.e.
+		 # that no bytes were re-encoded to a different encoding.
+		 test 252 -eq "$(git cat-file -s i18n-invalid)")
+'
+
 test_expect_success 'import/export-marks' '
 
 	git checkout -b marks master &&
diff --git a/t/t9350/broken-iso-8859-7-commit-message.txt b/t/t9350/broken-iso-8859-7-commit-message.txt
new file mode 100644
index 0000000000..d06ad75b44
--- /dev/null
+++ b/t/t9350/broken-iso-8859-7-commit-message.txt
@@ -0,0 +1 @@
+Pi: ð; Invalid: ÿ
\ No newline at end of file
-- 
2.21.0.782.g2063122293


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 4/5] fast-export: differentiate between explicitly utf-8 and implicitly utf-8
  2019-05-10 20:53 ` [PATCH v3 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
                     ` (2 preceding siblings ...)
  2019-05-10 20:53   ` [PATCH v3 3/5] fast-export: avoid stripping encoding header if we cannot reencode Elijah Newren
@ 2019-05-10 20:53   ` Elijah Newren
  2019-05-10 20:53   ` [PATCH v3 5/5] fast-export: do automatic reencoding of commit messages only if requested Elijah Newren
                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 45+ messages in thread
From: Elijah Newren @ 2019-05-10 20:53 UTC (permalink / raw)
  To: gitster
  Cc: git, Eric Sunshine, Johannes Schindelin, Johannes Sixt,
	Elijah Newren

The find_encoding() function returned the encoding used by a commit
message, returning a default of git_commit_encoding (usually utf-8).
Although the current code does not differentiate between a commit which
explicitly requested utf-8 and one where we just assume utf-8 because no
encoding is set, it will become important when we try to preserve the
encoding header.  Since is_encoding_utf8() returns true when passed
NULL, we can just return NULL from find_encoding() instead of returning
git_commit_encoding.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 builtin/fast-export.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 7734a9f5a5..66331fa401 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -453,7 +453,7 @@ static const char *find_encoding(const char *begin, const char *end)
 	bol = memmem(begin, end ? end - begin : strlen(begin),
 		     needle, strlen(needle));
 	if (!bol)
-		return git_commit_encoding;
+		return NULL;
 	bol += strlen(needle);
 	eol = strchrnul(bol, '\n');
 	*eol = '\0';
-- 
2.21.0.782.g2063122293


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v3 5/5] fast-export: do automatic reencoding of commit messages only if requested
  2019-05-10 20:53 ` [PATCH v3 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
                     ` (3 preceding siblings ...)
  2019-05-10 20:53   ` [PATCH v3 4/5] fast-export: differentiate between explicitly utf-8 and implicitly utf-8 Elijah Newren
@ 2019-05-10 20:53   ` Elijah Newren
  2019-05-11 21:07     ` Torsten Bögershausen
  2019-05-13 10:14   ` [PATCH v3 0/5] Fix and extend encoding handling in fast export/import Johannes Schindelin
  2019-05-13 16:47   ` [PATCH v4 " Elijah Newren
  6 siblings, 1 reply; 45+ messages in thread
From: Elijah Newren @ 2019-05-10 20:53 UTC (permalink / raw)
  To: gitster
  Cc: git, Eric Sunshine, Johannes Schindelin, Johannes Sixt,
	Elijah Newren

Automatic re-encoding of commit messages (and dropping of the encoding
header) hurts attempts to do reversible history rewrites (e.g. sha1sum
<-> sha256sum transitions, some subtree rewrites), and seems
inconsistent with the general principle followed elsewhere in
fast-export of requiring explicit user requests to modify the output
(e.g. --signed-tags=strip, --tag-of-filtered-object=rewrite).  Add a
--reencode flag that the user can use to specify, and like other
fast-export flags, default it to 'abort'.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 builtin/fast-export.c  | 35 ++++++++++++++++++++++++++++++++---
 t/t9350-fast-export.sh | 37 ++++++++++++++++++++++++++++++++++---
 2 files changed, 66 insertions(+), 6 deletions(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 66331fa401..43cc52331c 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -33,6 +33,7 @@ static const char *fast_export_usage[] = {
 static int progress;
 static enum { SIGNED_TAG_ABORT, VERBATIM, WARN, WARN_STRIP, STRIP } signed_tag_mode = SIGNED_TAG_ABORT;
 static enum { TAG_FILTERING_ABORT, DROP, REWRITE } tag_of_filtered_mode = TAG_FILTERING_ABORT;
+static enum { REENCODE_ABORT, REENCODE_PLEASE, REENCODE_NEVER } reencode_mode = REENCODE_ABORT;
 static int fake_missing_tagger;
 static int use_done_feature;
 static int no_data;
@@ -77,6 +78,20 @@ static int parse_opt_tag_of_filtered_mode(const struct option *opt,
 	return 0;
 }
 
+static int parse_opt_reencode_mode(const struct option *opt,
+				   const char *arg, int unset)
+{
+	if (unset || !strcmp(arg, "abort"))
+		reencode_mode = REENCODE_ABORT;
+	else if (!strcmp(arg, "yes"))
+		reencode_mode = REENCODE_PLEASE;
+	else if (!strcmp(arg, "no"))
+		reencode_mode = REENCODE_NEVER;
+	else
+		return error("Unknown reencoding mode: %s", arg);
+	return 0;
+}
+
 static struct decoration idnums;
 static uint32_t last_idnum;
 
@@ -633,10 +648,21 @@ static void handle_commit(struct commit *commit, struct rev_info *rev,
 	}
 
 	mark_next_object(&commit->object);
-	if (anonymize)
+	if (anonymize) {
 		reencoded = anonymize_commit_message(message);
-	else if (!is_encoding_utf8(encoding))
-		reencoded = reencode_string(message, "UTF-8", encoding);
+	} else if (encoding) {
+		switch(reencode_mode) {
+		case REENCODE_PLEASE:
+			reencoded = reencode_string(message, "UTF-8", encoding);
+			break;
+		case REENCODE_NEVER:
+			break;
+		case REENCODE_ABORT:
+			die("Encountered commit-specific encoding %s in commit "
+			    "%s; use --reencode=<mode> to handle it",
+			    encoding, oid_to_hex(&commit->object.oid));
+		}
+	}
 	if (!commit->parents)
 		printf("reset %s\n", refname);
 	printf("commit %s\nmark :%"PRIu32"\n", refname, last_idnum);
@@ -1091,6 +1117,9 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
 		OPT_CALLBACK(0, "tag-of-filtered-object", &tag_of_filtered_mode, N_("mode"),
 			     N_("select handling of tags that tag filtered objects"),
 			     parse_opt_tag_of_filtered_mode),
+		OPT_CALLBACK(0, "reencode", &reencode_mode, N_("mode"),
+			     N_("select handling of commit messages in an alternate encoding"),
+			     parse_opt_reencode_mode),
 		OPT_STRING(0, "export-marks", &export_filename, N_("file"),
 			     N_("Dump marks to this file")),
 		OPT_STRING(0, "import-marks", &import_filename, N_("file"),
diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
index fa124f4842..3cf4c7fc0c 100755
--- a/t/t9350-fast-export.sh
+++ b/t/t9350-fast-export.sh
@@ -94,14 +94,14 @@ test_expect_success 'fast-export --show-original-ids | git fast-import' '
 	test $MUSS = $(git rev-parse --verify refs/tags/muss)
 '
 
-test_expect_success 'iso-8859-7' '
+test_expect_success 'reencoding iso-8859-7' '
 
 	test_when_finished "git reset --hard HEAD~1" &&
 	test_config i18n.commitencoding iso-8859-7 &&
 	test_tick &&
 	echo rosten >file &&
 	git commit -s -F "$TEST_DIRECTORY/t9350/simple-iso-8859-7-commit-message.txt" file &&
-	git fast-export wer^..wer >iso-8859-7.fi &&
+	git fast-export --reencode=yes wer^..wer >iso-8859-7.fi &&
 	sed "s/wer/i18n/" iso-8859-7.fi |
 		(cd new &&
 		 git fast-import &&
@@ -120,13 +120,44 @@ test_expect_success 'iso-8859-7' '
 		 ! grep ^encoding actual)
 '
 
+test_expect_success 'aborting on iso-8859-7' '
+
+	test_when_finished "git reset --hard HEAD~1" &&
+	test_config i18n.commitencoding iso-8859-7 &&
+	echo rosten >file &&
+	git commit -s -F "$TEST_DIRECTORY/t9350/simple-iso-8859-7-commit-message.txt" file &&
+	test_must_fail git fast-export --reencode=abort wer^..wer >iso-8859-7.fi
+'
+
+test_expect_success 'preserving iso-8859-7' '
+
+	test_when_finished "git reset --hard HEAD~1" &&
+	test_config i18n.commitencoding iso-8859-7 &&
+	echo rosten >file &&
+	git commit -s -F "$TEST_DIRECTORY/t9350/simple-iso-8859-7-commit-message.txt" file &&
+	git fast-export --reencode=no wer^..wer >iso-8859-7.fi &&
+	sed "s/wer/i18n-no-recoding/" iso-8859-7.fi |
+		(cd new &&
+		 git fast-import &&
+		 # The commit object, if not re-encoded, is 240 bytes.
+		 # Removing the "encoding iso-8859-7\n" header would drops 20
+		 # bytes.  Re-encoding the Pi character from \xF0 in
+		 # iso-8859-7 to \xCF\x80 in utf-8 would add a byte.  I would
+		 # grep for the # specific bytes, but Windows lamely does not
+		 # allow that, so just search for the expected size.
+		 test 240 -eq "$(git cat-file -s i18n-no-recoding)" &&
+		 # Also make sure the commit has the "encoding" header
+		 git cat-file commit i18n-no-recoding >actual &&
+		 grep ^encoding actual)
+'
+
 test_expect_success 'encoding preserved if reencoding fails' '
 
 	test_when_finished "git reset --hard HEAD~1" &&
 	test_config i18n.commitencoding iso-8859-7 &&
 	echo rosten >file &&
 	git commit -s -F "$TEST_DIRECTORY/t9350/broken-iso-8859-7-commit-message.txt" file &&
-	git fast-export wer^..wer >iso-8859-7.fi &&
+	git fast-export --reencode=yes wer^..wer >iso-8859-7.fi &&
 	sed "s/wer/i18n-invalid/" iso-8859-7.fi |
 		(cd new &&
 		 git fast-import &&
-- 
2.21.0.782.g2063122293


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 5/5] fast-export: do automatic reencoding of commit messages only if requested
  2019-05-10 20:53   ` [PATCH v3 5/5] fast-export: do automatic reencoding of commit messages only if requested Elijah Newren
@ 2019-05-11 21:07     ` Torsten Bögershausen
  2019-05-11 21:42       ` Elijah Newren
  0 siblings, 1 reply; 45+ messages in thread
From: Torsten Bögershausen @ 2019-05-11 21:07 UTC (permalink / raw)
  To: Elijah Newren
  Cc: gitster, git, Eric Sunshine, Johannes Schindelin, Johannes Sixt

On Fri, May 10, 2019 at 01:53:35PM -0700, Elijah Newren wrote:
> Automatic re-encoding of commit messages (and dropping of the encoding
> header) hurts attempts to do reversible history rewrites (e.g. sha1sum
> <-> sha256sum transitions, some subtree rewrites), and seems
> inconsistent with the general principle followed elsewhere in
> fast-export of requiring explicit user requests to modify the output
> (e.g. --signed-tags=strip, --tag-of-filtered-object=rewrite).  Add a
> --reencode flag that the user can use to specify, and like other
> fast-export flags, default it to 'abort'.
>
> Signed-off-by: Elijah Newren <newren@gmail.com>
> ---
>  builtin/fast-export.c  | 35 ++++++++++++++++++++++++++++++++---
>  t/t9350-fast-export.sh | 37 ++++++++++++++++++++++++++++++++++---
>  2 files changed, 66 insertions(+), 6 deletions(-)
>
> diff --git a/builtin/fast-export.c b/builtin/fast-export.c
> index 66331fa401..43cc52331c 100644
> --- a/builtin/fast-export.c
> +++ b/builtin/fast-export.c
> @@ -33,6 +33,7 @@ static const char *fast_export_usage[] = {
>  static int progress;
>  static enum { SIGNED_TAG_ABORT, VERBATIM, WARN, WARN_STRIP, STRIP } signed_tag_mode = SIGNED_TAG_ABORT;
>  static enum { TAG_FILTERING_ABORT, DROP, REWRITE } tag_of_filtered_mode = TAG_FILTERING_ABORT;
> +static enum { REENCODE_ABORT, REENCODE_PLEASE, REENCODE_NEVER } reencode_mode = REENCODE_ABORT;
>  static int fake_missing_tagger;
>  static int use_done_feature;
>  static int no_data;
> @@ -77,6 +78,20 @@ static int parse_opt_tag_of_filtered_mode(const struct option *opt,
>  	return 0;
>  }
>
> +static int parse_opt_reencode_mode(const struct option *opt,
> +				   const char *arg, int unset)
> +{

This one is good:
> +	if (unset || !strcmp(arg, "abort"))
> +		reencode_mode = REENCODE_ABORT;

But here: does it make sense to use REENCODE_YES/NO to be more consistant ?
> +	else if (!strcmp(arg, "yes"))
> +		reencode_mode = REENCODE_PLEASE;
> +	else if (!strcmp(arg, "no"))
> +		reencode_mode = REENCODE_NEVER;
> +	else
> +		return error("Unknown reencoding mode: %s", arg);
> +	return 0;
> +}
> +
>  static struct decoration idnums;
>  static uint32_t last_idnum;
>
> @@ -633,10 +648,21 @@ static void handle_commit(struct commit *commit, struct rev_info *rev,
>  	}
>
>  	mark_next_object(&commit->object);
> -	if (anonymize)
> +	if (anonymize) {
>  		reencoded = anonymize_commit_message(message);
> -	else if (!is_encoding_utf8(encoding))
> -		reencoded = reencode_string(message, "UTF-8", encoding);
> +	} else if (encoding) {
> +		switch(reencode_mode) {
> +		case REENCODE_PLEASE:
> +			reencoded = reencode_string(message, "UTF-8", encoding);
> +			break;
> +		case REENCODE_NEVER:
> +			break;
> +		case REENCODE_ABORT:
> +			die("Encountered commit-specific encoding %s in commit "
> +			    "%s; use --reencode=<mode> to handle it",
Should we be more helpfull and say !use --reencode=[yes|no] to handle it ?

> +			    encoding, oid_to_hex(&commit->object.oid));
> +		}
> +	}
>  	if (!commit->parents)
>  		printf("reset %s\n", refname);
>  	printf("commit %s\nmark :%"PRIu32"\n", refname, last_idnum);
> @@ -1091,6 +1117,9 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
>  		OPT_CALLBACK(0, "tag-of-filtered-object", &tag_of_filtered_mode, N_("mode"),
>  			     N_("select handling of tags that tag filtered objects"),
>  			     parse_opt_tag_of_filtered_mode),
> +		OPT_CALLBACK(0, "reencode", &reencode_mode, N_("mode"),
> +			     N_("select handling of commit messages in an alternate encoding"),
> +			     parse_opt_reencode_mode),
>  		OPT_STRING(0, "export-marks", &export_filename, N_("file"),
>  			     N_("Dump marks to this file")),
>  		OPT_STRING(0, "import-marks", &import_filename, N_("file"),
> diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
> index fa124f4842..3cf4c7fc0c 100755
> --- a/t/t9350-fast-export.sh
> +++ b/t/t9350-fast-export.sh
> @@ -94,14 +94,14 @@ test_expect_success 'fast-export --show-original-ids | git fast-import' '
>  	test $MUSS = $(git rev-parse --verify refs/tags/muss)
>  '
>
> -test_expect_success 'iso-8859-7' '
> +test_expect_success 'reencoding iso-8859-7' '
>
>  	test_when_finished "git reset --hard HEAD~1" &&
>  	test_config i18n.commitencoding iso-8859-7 &&
>  	test_tick &&
>  	echo rosten >file &&
>  	git commit -s -F "$TEST_DIRECTORY/t9350/simple-iso-8859-7-commit-message.txt" file &&
> -	git fast-export wer^..wer >iso-8859-7.fi &&
> +	git fast-export --reencode=yes wer^..wer >iso-8859-7.fi &&
>  	sed "s/wer/i18n/" iso-8859-7.fi |
>  		(cd new &&
>  		 git fast-import &&
> @@ -120,13 +120,44 @@ test_expect_success 'iso-8859-7' '
>  		 ! grep ^encoding actual)
>  '
>
> +test_expect_success 'aborting on iso-8859-7' '
> +
> +	test_when_finished "git reset --hard HEAD~1" &&
> +	test_config i18n.commitencoding iso-8859-7 &&
> +	echo rosten >file &&
> +	git commit -s -F "$TEST_DIRECTORY/t9350/simple-iso-8859-7-commit-message.txt" file &&
> +	test_must_fail git fast-export --reencode=abort wer^..wer >iso-8859-7.fi
> +'
> +
> +test_expect_success 'preserving iso-8859-7' '
> +
> +	test_when_finished "git reset --hard HEAD~1" &&
> +	test_config i18n.commitencoding iso-8859-7 &&
> +	echo rosten >file &&
> +	git commit -s -F "$TEST_DIRECTORY/t9350/simple-iso-8859-7-commit-message.txt" file &&
> +	git fast-export --reencode=no wer^..wer >iso-8859-7.fi &&
> +	sed "s/wer/i18n-no-recoding/" iso-8859-7.fi |
> +		(cd new &&
> +		 git fast-import &&
> +		 # The commit object, if not re-encoded, is 240 bytes.
> +		 # Removing the "encoding iso-8859-7\n" header would drops 20
> +		 # bytes.  Re-encoding the Pi character from \xF0 in
> +		 # iso-8859-7 to \xCF\x80 in utf-8 would add a byte.  I would
> +		 # grep for the # specific bytes, but Windows lamely does not
This is somewhat unclear to me. What does Windows not allow ?

> +		 # allow that, so just search for the expected size.
> +		 test 240 -eq "$(git cat-file -s i18n-no-recoding)" &&
> +		 # Also make sure the commit has the "encoding" header
> +		 git cat-file commit i18n-no-recoding >actual &&
> +		 grep ^encoding actual)
> +'
> +
>  test_expect_success 'encoding preserved if reencoding fails' '
>
>  	test_when_finished "git reset --hard HEAD~1" &&
>  	test_config i18n.commitencoding iso-8859-7 &&
>  	echo rosten >file &&
>  	git commit -s -F "$TEST_DIRECTORY/t9350/broken-iso-8859-7-commit-message.txt" file &&
> -	git fast-export wer^..wer >iso-8859-7.fi &&
> +	git fast-export --reencode=yes wer^..wer >iso-8859-7.fi &&
>  	sed "s/wer/i18n-invalid/" iso-8859-7.fi |
>  		(cd new &&
>  		 git fast-import &&
> --
> 2.21.0.782.g2063122293
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 5/5] fast-export: do automatic reencoding of commit messages only if requested
  2019-05-11 21:07     ` Torsten Bögershausen
@ 2019-05-11 21:42       ` Elijah Newren
  2019-05-13  7:48         ` Junio C Hamano
  2019-05-13 10:23         ` Johannes Schindelin
  0 siblings, 2 replies; 45+ messages in thread
From: Elijah Newren @ 2019-05-11 21:42 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: Junio C Hamano, Git Mailing List, Eric Sunshine,
	Johannes Schindelin, Johannes Sixt

On Sat, May 11, 2019 at 2:07 PM Torsten Bögershausen <tboegi@web.de> wrote:
> On Fri, May 10, 2019 at 01:53:35PM -0700, Elijah Newren wrote:

> This one is good:
> > +     if (unset || !strcmp(arg, "abort"))
> > +             reencode_mode = REENCODE_ABORT;
>
> But here: does it make sense to use REENCODE_YES/NO to be more consistant ?
> > +     else if (!strcmp(arg, "yes"))
> > +             reencode_mode = REENCODE_PLEASE;
> > +     else if (!strcmp(arg, "no"))
> > +             reencode_mode = REENCODE_NEVER;

Didn't realize there was any such convention, and even have difficulty
finding it with grep (CONTAINS_{YES,NO} appears to be the only example
I can find), but the alternate wording seems fine; I'm happy to adopt
it.

> > +             case REENCODE_ABORT:
> > +                     die("Encountered commit-specific encoding %s in commit "
> > +                         "%s; use --reencode=<mode> to handle it",
> Should we be more helpfull and say !use --reencode=[yes|no] to handle it ?

Sounds like a good idea; I'll adjust it.


> > +     sed "s/wer/i18n-no-recoding/" iso-8859-7.fi |
> > +             (cd new &&
> > +              git fast-import &&
> > +              # The commit object, if not re-encoded, is 240 bytes.
> > +              # Removing the "encoding iso-8859-7\n" header would drops 20
> > +              # bytes.  Re-encoding the Pi character from \xF0 in
> > +              # iso-8859-7 to \xCF\x80 in utf-8 would add a byte.  I would
> > +              # grep for the specific bytes, but Windows lamely does not
> This is somewhat unclear to me. What does Windows not allow ?
> > +              # allow that, so just search for the expected size.
> > +              test 240 -eq "$(git cat-file -s i18n-no-recoding)" &&
> > +              # Also make sure the commit has the "encoding" header
> > +              git cat-file commit i18n-no-recoding >actual &&
> > +              grep ^encoding actual)
> > +'

Windows does not allow specifying the bytes I want to grep for on the
command line; it'll munge the command line, resulting in it searching
for something other than what I wanted to be searched for, and return
the wrong answer based on searching for the wrong thing.  See
https://public-inbox.org/git/f8eb246f-a936-e9df-4bb4-068b86a62baf@kdbg.org/
and https://public-inbox.org/git/nycvar.QRO.7.76.6.1905101551110.44@tvgsbejvaqbjf.bet/.

My comment was already pretty long because it looks like a crazy way
to run the test and thus it feels like I need to explain it.  And the
craziness is based on how Windows behaves; it seems insane to me that
Windows decides to munge user data (in the form of the command line
provided), so much so that it makes me wonder if I really understood
Hannes' and Dscho's explanations of what it is doing.  (How could
anyone have thought munging user data was a good idea?)  Anyway, long
story short, I'm not sure how to explain it correctly and succinctly.
Any suggestions?


Elijah

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 5/5] fast-export: do automatic reencoding of commit messages only if requested
  2019-05-11 21:42       ` Elijah Newren
@ 2019-05-13  7:48         ` Junio C Hamano
  2019-05-13 13:24           ` Elijah Newren
  2019-05-13 10:23         ` Johannes Schindelin
  1 sibling, 1 reply; 45+ messages in thread
From: Junio C Hamano @ 2019-05-13  7:48 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Torsten Bögershausen, Git Mailing List, Eric Sunshine,
	Johannes Schindelin, Johannes Sixt

Elijah Newren <newren@gmail.com> writes:

> On Sat, May 11, 2019 at 2:07 PM Torsten Bögershausen <tboegi@web.de> wrote:
>> On Fri, May 10, 2019 at 01:53:35PM -0700, Elijah Newren wrote:
>
>> This one is good:
>> > +     if (unset || !strcmp(arg, "abort"))
>> > +             reencode_mode = REENCODE_ABORT;
>>
>> But here: does it make sense to use REENCODE_YES/NO to be more consistant ?
>> > +     else if (!strcmp(arg, "yes"))
>> > +             reencode_mode = REENCODE_PLEASE;
>> > +     else if (!strcmp(arg, "no"))
>> > +             reencode_mode = REENCODE_NEVER;
>
> Didn't realize there was any such convention, and even have difficulty
> finding it with grep (CONTAINS_{YES,NO} appears to be the only example
> I can find), but the alternate wording seems fine; I'm happy to adopt
> it.

I am OK with Yes/No.  

Don't we want to treat this as "bool or literal 'abort'", though?
Other options that are "bool or something else" tend to accept
"true" as a synonym for "yes", and I am wondering if we want to
follow suit here, too.

Thanks.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 0/5] Fix and extend encoding handling in fast export/import
  2019-05-10 20:53 ` [PATCH v3 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
                     ` (4 preceding siblings ...)
  2019-05-10 20:53   ` [PATCH v3 5/5] fast-export: do automatic reencoding of commit messages only if requested Elijah Newren
@ 2019-05-13 10:14   ` Johannes Schindelin
  2019-05-13 16:47   ` [PATCH v4 " Elijah Newren
  6 siblings, 0 replies; 45+ messages in thread
From: Johannes Schindelin @ 2019-05-13 10:14 UTC (permalink / raw)
  To: Elijah Newren; +Cc: gitster, git, Eric Sunshine, Johannes Sixt

Hi Elijah,

On Fri, 10 May 2019, Elijah Newren wrote:

>   * Modified the testcases to pass on Windows[1], as verified via
>     gitgitgadget pull request[2].  Required adding a couple new files
>     (which store the desired bytes) and checking the size of the output
>     instead of checking for particular bytes (but the lengths of the
>     expected byte sequences differ so this works fine...).
>
> [1] Failures of previous patchset on Windows noticed and reported by Dscho;
>     explanation from Hannes is that Windows munges users' command lines to
>     force them to be characters instead of bytes.
> [2] https://github.com/gitgitgadget/git/pull/187

Excellent! Thank you!

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 5/5] fast-export: do automatic reencoding of commit messages only if requested
  2019-05-11 21:42       ` Elijah Newren
  2019-05-13  7:48         ` Junio C Hamano
@ 2019-05-13 10:23         ` Johannes Schindelin
  2019-05-13 12:56           ` Torsten Bögershausen
  2019-05-13 16:41           ` Elijah Newren
  1 sibling, 2 replies; 45+ messages in thread
From: Johannes Schindelin @ 2019-05-13 10:23 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Torsten Bögershausen, Junio C Hamano, Git Mailing List,
	Eric Sunshine, Johannes Sixt

Hi Elijah,

On Sat, 11 May 2019, Elijah Newren wrote:

> [...] the craziness is based on how Windows behaves; it seems insane to
> me that Windows decides to munge user data (in the form of the command
> line provided), so much so that it makes me wonder if I really
> understood Hannes' and Dscho's explanations of what it is doing.

It is not the user data that is munged by *Windows*, but by *Git for
Windows*. The user data on Windows is encoded in UTF-16 (or some slight
variant thereof). Git *cannot* handle UTF-16. Git's test suite *cannot*
handle UTF-16. So we convert. That's all there is to it.

Ciao,
Dscho

P.S.: Of course it is not *all* there is to it. There is also a current
code page which depends on the current user's current locale. We can
definitely not rely on that, as Git has no idea about this and would quite
positively produce incorrect output because of it. So we really just use
the `*W()` functions of the Win32 API (i.e. the ones accepting wide
Unicode characters and strings, i.e. UTF-16). I don't think we can do
better than that.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 5/5] fast-export: do automatic reencoding of commit messages only if requested
  2019-05-13 10:23         ` Johannes Schindelin
@ 2019-05-13 12:56           ` Torsten Bögershausen
  2019-05-13 13:29             ` Elijah Newren
  2019-05-13 16:41           ` Elijah Newren
  1 sibling, 1 reply; 45+ messages in thread
From: Torsten Bögershausen @ 2019-05-13 12:56 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Elijah Newren, Junio C Hamano, Git Mailing List, Eric Sunshine,
	Johannes Sixt

On Mon, May 13, 2019 at 12:23:29PM +0200, Johannes Schindelin wrote:
> Hi Elijah,
>
> On Sat, 11 May 2019, Elijah Newren wrote:
>
> > [...] the craziness is based on how Windows behaves; it seems insane to
> > me that Windows decides to munge user data (in the form of the command
> > line provided), so much so that it makes me wonder if I really
> > understood Hannes' and Dscho's explanations of what it is doing.
>
> It is not the user data that is munged by *Windows*, but by *Git for
> Windows*. The user data on Windows is encoded in UTF-16 (or some slight
> variant thereof). Git *cannot* handle UTF-16. Git's test suite *cannot*
> handle UTF-16. So we convert. That's all there is to it.
>
> Ciao,
> Dscho
>
> P.S.: Of course it is not *all* there is to it. There is also a current
> code page which depends on the current user's current locale. We can
> definitely not rely on that, as Git has no idea about this and would quite
> positively produce incorrect output because of it. So we really just use
> the `*W()` functions of the Win32 API (i.e. the ones accepting wide
> Unicode characters and strings, i.e. UTF-16). I don't think we can do
> better than that.

We can actuall feed valid UTF-8 into a test case.
(Remember that shell scripts need this octal numbering, see
t/t0050)


See the "ä" code point:
$ auml=$(printf '\303\244')
$ printf $auml
ä


Now we can feed those 2 bytes (wich are valid UTF) into
Git and say "convert them from ISO-8859-1 into UTF-8,
resulting in 4 bytes.
Is my explanation clear enough ?
If not, plese tell me.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 5/5] fast-export: do automatic reencoding of commit messages only if requested
  2019-05-13  7:48         ` Junio C Hamano
@ 2019-05-13 13:24           ` Elijah Newren
  0 siblings, 0 replies; 45+ messages in thread
From: Elijah Newren @ 2019-05-13 13:24 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Torsten Bögershausen, Git Mailing List, Eric Sunshine,
	Johannes Schindelin, Johannes Sixt

On Mon, May 13, 2019 at 12:48 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> > On Sat, May 11, 2019 at 2:07 PM Torsten Bögershausen <tboegi@web.de> wrote:
> >> On Fri, May 10, 2019 at 01:53:35PM -0700, Elijah Newren wrote:
> >
> >> This one is good:
> >> > +     if (unset || !strcmp(arg, "abort"))
> >> > +             reencode_mode = REENCODE_ABORT;
> >>
> >> But here: does it make sense to use REENCODE_YES/NO to be more consistant ?
> >> > +     else if (!strcmp(arg, "yes"))
> >> > +             reencode_mode = REENCODE_PLEASE;
> >> > +     else if (!strcmp(arg, "no"))
> >> > +             reencode_mode = REENCODE_NEVER;
> >
> > Didn't realize there was any such convention, and even have difficulty
> > finding it with grep (CONTAINS_{YES,NO} appears to be the only example
> > I can find), but the alternate wording seems fine; I'm happy to adopt
> > it.
>
> I am OK with Yes/No.
>
> Don't we want to treat this as "bool or literal 'abort'", though?
> Other options that are "bool or something else" tend to accept
> "true" as a synonym for "yes", and I am wondering if we want to
> follow suit here, too.

Makes sense; will do.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 5/5] fast-export: do automatic reencoding of commit messages only if requested
  2019-05-13 12:56           ` Torsten Bögershausen
@ 2019-05-13 13:29             ` Elijah Newren
  0 siblings, 0 replies; 45+ messages in thread
From: Elijah Newren @ 2019-05-13 13:29 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: Johannes Schindelin, Junio C Hamano, Git Mailing List,
	Eric Sunshine, Johannes Sixt

On Mon, May 13, 2019 at 5:56 AM Torsten Bögershausen <tboegi@web.de> wrote:
>
> On Mon, May 13, 2019 at 12:23:29PM +0200, Johannes Schindelin wrote:
> > Hi Elijah,
> >
> > On Sat, 11 May 2019, Elijah Newren wrote:
> >
> > > [...] the craziness is based on how Windows behaves; it seems insane to
> > > me that Windows decides to munge user data (in the form of the command
> > > line provided), so much so that it makes me wonder if I really
> > > understood Hannes' and Dscho's explanations of what it is doing.
> >
> > It is not the user data that is munged by *Windows*, but by *Git for
> > Windows*. The user data on Windows is encoded in UTF-16 (or some slight
> > variant thereof). Git *cannot* handle UTF-16. Git's test suite *cannot*
> > handle UTF-16. So we convert. That's all there is to it.
> >
> > Ciao,
> > Dscho
> >
> > P.S.: Of course it is not *all* there is to it. There is also a current
> > code page which depends on the current user's current locale. We can
> > definitely not rely on that, as Git has no idea about this and would quite
> > positively produce incorrect output because of it. So we really just use
> > the `*W()` functions of the Win32 API (i.e. the ones accepting wide
> > Unicode characters and strings, i.e. UTF-16). I don't think we can do
> > better than that.
>
> We can actuall feed valid UTF-8 into a test case.
> (Remember that shell scripts need this octal numbering, see
> t/t0050)

Sure, but that's not useful here.  I need to feed both valid and
invalid ISO-8859-7 (or anything *other* than UTF-8) into a test case,
in order to verify how git handles reencoding from something other
than utf-8.  I did something like what you proposed originally, but
since it wasn't utf-8 it caused test failures on Windows.

Elijah

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 5/5] fast-export: do automatic reencoding of commit messages only if requested
  2019-05-13 10:23         ` Johannes Schindelin
  2019-05-13 12:56           ` Torsten Bögershausen
@ 2019-05-13 16:41           ` Elijah Newren
  1 sibling, 0 replies; 45+ messages in thread
From: Elijah Newren @ 2019-05-13 16:41 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Torsten Bögershausen, Junio C Hamano, Git Mailing List,
	Eric Sunshine, Johannes Sixt

On Mon, May 13, 2019 at 3:23 AM Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
>
> Hi Elijah,
>
> On Sat, 11 May 2019, Elijah Newren wrote:
>
> > [...] the craziness is based on how Windows behaves; it seems insane to
> > me that Windows decides to munge user data (in the form of the command
> > line provided), so much so that it makes me wonder if I really
> > understood Hannes' and Dscho's explanations of what it is doing.
>
> It is not the user data that is munged by *Windows*, but by *Git for
> Windows*. The user data on Windows is encoded in UTF-16 (or some slight
> variant thereof). Git *cannot* handle UTF-16. Git's test suite *cannot*
> handle UTF-16. So we convert. That's all there is to it.

Ooh, it's Git for Windows doing the conversion?  That means I can test
for the expected bytes with printf and grep, I only need to feed
special bytes to git via file instead of command line.  That's better.
:-)

> Ciao,
> Dscho
>
> P.S.: Of course it is not *all* there is to it. There is also a current
> code page which depends on the current user's current locale. We can
> definitely not rely on that, as Git has no idea about this and would quite
> positively produce incorrect output because of it. So we really just use
> the `*W()` functions of the Win32 API (i.e. the ones accepting wide
> Unicode characters and strings, i.e. UTF-16). I don't think we can do
> better than that.

Going off on a tangent for a minute...okay, so you need to do some
kind of conversion for Windows, but why is it automatically UTF-8 ->
UTF-16?  In particular, if i18n.commitencoding configuration option is
set (to something other than UTF-8), then couldn't you instead convert
the commit message specified on the command line from $(git config
i18n.commitencoding) to UTF-16?  Or, perhaps convert from $(git config
i18n.commitencoding) to UTF-8 before the automatic UTF-8 -> UTF-16
conversion?  It doesn't matter for this series anymore since I've
worked around it (by passing the bytes via an external file as
suggested by Hannes), but it seems like Git For Windows might be able
to still do better here.  Or maybe I'm still not understanding the
full picture.  Anyway, thanks for all the explanations and the help
getting this fixed up.

Elijah

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v4 0/5] Fix and extend encoding handling in fast export/import
  2019-05-10 20:53 ` [PATCH v3 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
                     ` (5 preceding siblings ...)
  2019-05-13 10:14   ` [PATCH v3 0/5] Fix and extend encoding handling in fast export/import Johannes Schindelin
@ 2019-05-13 16:47   ` Elijah Newren
  2019-05-13 16:47     ` [PATCH v4 1/5] t9350: fix encoding test to actually test reencoding Elijah Newren
                       ` (5 more replies)
  6 siblings, 6 replies; 45+ messages in thread
From: Elijah Newren @ 2019-05-13 16:47 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Eric Sunshine, Johannes Schindelin, Johannes Sixt,
	Torsten Bögershausen, Elijah Newren

While stress testing `git filter-repo`, I noticed an issue with
encoding; further digging led to the fixes and features in this series.
See the individual commit messages for details.

Changes since v3 (full range-diff below):
  * YES/NO changes suggested by Torsten
  * more boolean synonyms as suggested by Junio
  * check for the exact expected special bytes, in addition to the size
    (Dscho pointed out that it was GitForWindows that munged bytes, not
     Windows, so while I need to be careful in what I pass to git, printf
     and grep can work directly with the special bytes)
  * also checked on gitgitgadget that it passes on the major platforms

[1] https://github.com/gitgitgadget/git/pull/191

Elijah Newren (5):
  t9350: fix encoding test to actually test reencoding
  fast-import: support 'encoding' commit header
  fast-export: avoid stripping encoding header if we cannot reencode
  fast-export: differentiate between explicitly utf-8 and implicitly
    utf-8
  fast-export: do automatic reencoding of commit messages only if
    requested

 Documentation/git-fast-import.txt            |  7 ++
 builtin/fast-export.c                        | 44 +++++++++--
 fast-import.c                                | 11 ++-
 t/t9300-fast-import.sh                       | 20 +++++
 t/t9350-fast-export.sh                       | 78 +++++++++++++++++---
 t/t9350/broken-iso-8859-7-commit-message.txt |  1 +
 t/t9350/simple-iso-8859-7-commit-message.txt |  1 +
 7 files changed, 145 insertions(+), 17 deletions(-)
 create mode 100644 t/t9350/broken-iso-8859-7-commit-message.txt
 create mode 100644 t/t9350/simple-iso-8859-7-commit-message.txt

Range-diff:
1:  2d7bb64acf ! 1:  37a68a0ffd t9350: fix encoding test to actually test reencoding
    @@ -39,18 +39,16 @@
      		 git fast-import &&
     +		 # The commit object, if not re-encoded, would be 240 bytes.
     +		 # Removing the "encoding iso-8859-7\n" header drops 20 bytes.
    -+		 # Re-encoding the Pi character from \xF0 in iso-8859-7 to
    -+		 # \xCF\x80 in utf-8 adds a byte.  Grepping for specific bytes
    -+		 # would be nice, but Windows apparently munges user data
    -+		 # in the form of bytes on the command line to force them to
    -+		 # be characters instead, so we are limited for portability
    -+		 # reasons in subsequent similar tests in this file to check
    -+		 # for size rather than what bytes are present.
    ++		 # Re-encoding the Pi character from \xF0 (\360) in iso-8859-7
    ++		 # to \xCF\x80 (\317\200) in utf-8 adds a byte.  Check for
    ++		 # the expected size.
     +		 test 221 -eq "$(git cat-file -s i18n)" &&
    -+		 # Also make sure the commit does not have the "encoding" header
    ++		 # ...and for the expected translation of bytes.
      		 git cat-file commit i18n >actual &&
     -		 grep "Áéí óú" actual)
     -
    ++		 grep $(printf "\317\200") actual &&
    ++		 # Also make sure the commit does not have the "encoding" header
     +		 ! grep ^encoding actual)
      '
     +
2:  9fa5695017 = 2:  3d84f4613d fast-import: support 'encoding' commit header
3:  dfc76573e9 ! 3:  baa8394a3a fast-export: avoid stripping encoding header if we cannot reencode
    @@ -49,10 +49,14 @@
     +		(cd new &&
     +		 git fast-import &&
     +		 git cat-file commit i18n-invalid >actual &&
    ++		 # Make sure the commit still has the encoding header
     +		 grep ^encoding actual &&
    -+		 # Also verify that the commit has the expected size; i.e.
    ++		 # Verify that the commit has the expected size; i.e.
     +		 # that no bytes were re-encoded to a different encoding.
    -+		 test 252 -eq "$(git cat-file -s i18n-invalid)")
    ++		 test 252 -eq "$(git cat-file -s i18n-invalid)" &&
    ++		 # ...and check for the original special bytes
    ++		 grep $(printf "\360") actual &&
    ++		 grep $(printf "\377") actual)
     +'
     +
      test_expect_success 'import/export-marks' '
4:  83b3656b76 = 4:  49960164c6 fast-export: differentiate between explicitly utf-8 and implicitly utf-8
5:  2063122293 ! 5:  571613a09e fast-export: do automatic reencoding of commit messages only if requested
    @@ -20,7 +20,7 @@
      static int progress;
      static enum { SIGNED_TAG_ABORT, VERBATIM, WARN, WARN_STRIP, STRIP } signed_tag_mode = SIGNED_TAG_ABORT;
      static enum { TAG_FILTERING_ABORT, DROP, REWRITE } tag_of_filtered_mode = TAG_FILTERING_ABORT;
    -+static enum { REENCODE_ABORT, REENCODE_PLEASE, REENCODE_NEVER } reencode_mode = REENCODE_ABORT;
    ++static enum { REENCODE_ABORT, REENCODE_YES, REENCODE_NO } reencode_mode = REENCODE_ABORT;
      static int fake_missing_tagger;
      static int use_done_feature;
      static int no_data;
    @@ -33,10 +33,10 @@
     +{
     +	if (unset || !strcmp(arg, "abort"))
     +		reencode_mode = REENCODE_ABORT;
    -+	else if (!strcmp(arg, "yes"))
    -+		reencode_mode = REENCODE_PLEASE;
    -+	else if (!strcmp(arg, "no"))
    -+		reencode_mode = REENCODE_NEVER;
    ++	else if (!strcmp(arg, "yes") || !strcmp(arg, "true") || !strcmp(arg, "on"))
    ++		reencode_mode = REENCODE_YES;
    ++	else if (!strcmp(arg, "no") || !strcmp(arg, "false") || !strcmp(arg, "off"))
    ++		reencode_mode = REENCODE_NO;
     +	else
     +		return error("Unknown reencoding mode: %s", arg);
     +	return 0;
    @@ -56,14 +56,14 @@
     -		reencoded = reencode_string(message, "UTF-8", encoding);
     +	} else if (encoding) {
     +		switch(reencode_mode) {
    -+		case REENCODE_PLEASE:
    ++		case REENCODE_YES:
     +			reencoded = reencode_string(message, "UTF-8", encoding);
     +			break;
    -+		case REENCODE_NEVER:
    ++		case REENCODE_NO:
     +			break;
     +		case REENCODE_ABORT:
     +			die("Encountered commit-specific encoding %s in commit "
    -+			    "%s; use --reencode=<mode> to handle it",
    ++			    "%s; use --reencode=[yes|no] to handle it",
     +			    encoding, oid_to_hex(&commit->object.oid));
     +		}
     +	}
    @@ -126,13 +126,14 @@
     +		 git fast-import &&
     +		 # The commit object, if not re-encoded, is 240 bytes.
     +		 # Removing the "encoding iso-8859-7\n" header would drops 20
    -+		 # bytes.  Re-encoding the Pi character from \xF0 in
    -+		 # iso-8859-7 to \xCF\x80 in utf-8 would add a byte.  I would
    -+		 # grep for the # specific bytes, but Windows lamely does not
    -+		 # allow that, so just search for the expected size.
    ++		 # bytes.  Re-encoding the Pi character from \xF0 (\360) in
    ++		 # iso-8859-7 to \xCF\x80 (\317\200) in utf-8 adds a byte.
    ++		 # Check for the expected size...
     +		 test 240 -eq "$(git cat-file -s i18n-no-recoding)" &&
    -+		 # Also make sure the commit has the "encoding" header
    ++		 # ...as well as the expected byte.
     +		 git cat-file commit i18n-no-recoding >actual &&
    ++		 grep $(printf "\360") actual &&
    ++		 # Also make sure the commit has the "encoding" header
     +		 grep ^encoding actual)
     +'
     +

-- 
2.21.0.782.g571613a09e

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v4 1/5] t9350: fix encoding test to actually test reencoding
  2019-05-13 16:47   ` [PATCH v4 " Elijah Newren
@ 2019-05-13 16:47     ` Elijah Newren
  2019-05-13 16:47     ` [PATCH v4 2/5] fast-import: support 'encoding' commit header Elijah Newren
                       ` (4 subsequent siblings)
  5 siblings, 0 replies; 45+ messages in thread
From: Elijah Newren @ 2019-05-13 16:47 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Eric Sunshine, Johannes Schindelin, Johannes Sixt,
	Torsten Bögershausen, Elijah Newren

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 2983 bytes --]

This test used an author with non-ascii characters in the name, but
no special commit message.  It then grep'ed for those non-ascii
characters, but those are guaranteed to exist regardless of the
reencoding process since the reencoding only affects the commit message,
not the author or committer names.  As such, the test would work even if
the re-encoding process simply stripped the commit message entirely.
Modify the test to actually check that the reencoding in utf-8 worked.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 t/t9350-fast-export.sh                       | 27 ++++++++++++--------
 t/t9350/simple-iso-8859-7-commit-message.txt |  1 +
 2 files changed, 18 insertions(+), 10 deletions(-)
 create mode 100644 t/t9350/simple-iso-8859-7-commit-message.txt

diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
index 5690fe2810..c721026260 100755
--- a/t/t9350-fast-export.sh
+++ b/t/t9350-fast-export.sh
@@ -94,22 +94,30 @@ test_expect_success 'fast-export --show-original-ids | git fast-import' '
 	test $MUSS = $(git rev-parse --verify refs/tags/muss)
 '
 
-test_expect_success 'iso-8859-1' '
+test_expect_success 'iso-8859-7' '
 
-	git config i18n.commitencoding ISO8859-1 &&
-	# use author and committer name in ISO-8859-1 to match it.
-	. "$TEST_DIRECTORY"/t3901/8859-1.txt &&
+	test_when_finished "git reset --hard HEAD~1" &&
+	test_config i18n.commitencoding iso-8859-7 &&
 	test_tick &&
 	echo rosten >file &&
-	git commit -s -m den file &&
-	git fast-export wer^..wer >iso8859-1.fi &&
-	sed "s/wer/i18n/" iso8859-1.fi |
+	git commit -s -F "$TEST_DIRECTORY/t9350/simple-iso-8859-7-commit-message.txt" file &&
+	git fast-export wer^..wer >iso-8859-7.fi &&
+	sed "s/wer/i18n/" iso-8859-7.fi |
 		(cd new &&
 		 git fast-import &&
+		 # The commit object, if not re-encoded, would be 240 bytes.
+		 # Removing the "encoding iso-8859-7\n" header drops 20 bytes.
+		 # Re-encoding the Pi character from \xF0 (\360) in iso-8859-7
+		 # to \xCF\x80 (\317\200) in utf-8 adds a byte.  Check for
+		 # the expected size.
+		 test 221 -eq "$(git cat-file -s i18n)" &&
+		 # ...and for the expected translation of bytes.
 		 git cat-file commit i18n >actual &&
-		 grep "Áéí óú" actual)
-
+		 grep $(printf "\317\200") actual &&
+		 # Also make sure the commit does not have the "encoding" header
+		 ! grep ^encoding actual)
 '
+
 test_expect_success 'import/export-marks' '
 
 	git checkout -b marks master &&
@@ -224,7 +232,6 @@ GIT_COMMITTER_NAME='C O Mitter'; export GIT_COMMITTER_NAME
 
 test_expect_success 'setup copies' '
 
-	git config --unset i18n.commitencoding &&
 	git checkout -b copy rein &&
 	git mv file file3 &&
 	git commit -m move1 &&
diff --git a/t/t9350/simple-iso-8859-7-commit-message.txt b/t/t9350/simple-iso-8859-7-commit-message.txt
new file mode 100644
index 0000000000..8b3f0c3dba
--- /dev/null
+++ b/t/t9350/simple-iso-8859-7-commit-message.txt
@@ -0,0 +1 @@
+Pi: ð
\ No newline at end of file
-- 
2.21.0.782.g571613a09e


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 2/5] fast-import: support 'encoding' commit header
  2019-05-13 16:47   ` [PATCH v4 " Elijah Newren
  2019-05-13 16:47     ` [PATCH v4 1/5] t9350: fix encoding test to actually test reencoding Elijah Newren
@ 2019-05-13 16:47     ` Elijah Newren
  2019-05-13 16:47     ` [PATCH v4 3/5] fast-export: avoid stripping encoding header if we cannot reencode Elijah Newren
                       ` (3 subsequent siblings)
  5 siblings, 0 replies; 45+ messages in thread
From: Elijah Newren @ 2019-05-13 16:47 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Eric Sunshine, Johannes Schindelin, Johannes Sixt,
	Torsten Bögershausen, Elijah Newren

Since git supports commit messages with an encoding other than utf-8,
allow fast-import to import such commits.  This may be useful for folks
who do not want to reencode commit messages from an external system, and
may also be useful to achieve reversible history rewrites (e.g. sha1sum
<-> sha256sum transitions or subtree work) with git repositories that
have used specialized encodings in their commit history.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 Documentation/git-fast-import.txt |  7 +++++++
 fast-import.c                     | 11 +++++++++--
 t/t9300-fast-import.sh            | 20 ++++++++++++++++++++
 3 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-fast-import.txt b/Documentation/git-fast-import.txt
index d65cdb3d08..7baf9e47b5 100644
--- a/Documentation/git-fast-import.txt
+++ b/Documentation/git-fast-import.txt
@@ -388,6 +388,7 @@ change to the project.
 	original-oid?
 	('author' (SP <name>)? SP LT <email> GT SP <when> LF)?
 	'committer' (SP <name>)? SP LT <email> GT SP <when> LF
+	('encoding' SP <encoding>)?
 	data
 	('from' SP <commit-ish> LF)?
 	('merge' SP <commit-ish> LF)?
@@ -455,6 +456,12 @@ that was selected by the --date-format=<fmt> command-line option.
 See ``Date Formats'' above for the set of supported formats, and
 their syntax.
 
+`encoding`
+^^^^^^^^^^
+The optional `encoding` command indicates the encoding of the commit
+message.  Most commits are UTF-8 and the encoding is omitted, but this
+allows importing commit messages into git without first reencoding them.
+
 `from`
 ^^^^^^
 The `from` command is used to specify the commit to initialize
diff --git a/fast-import.c b/fast-import.c
index f38d04fa58..76a7bd3699 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -2585,6 +2585,7 @@ static void parse_new_commit(const char *arg)
 	struct branch *b;
 	char *author = NULL;
 	char *committer = NULL;
+	const char *encoding = NULL;
 	struct hash_list *merge_list = NULL;
 	unsigned int merge_count;
 	unsigned char prev_fanout, new_fanout;
@@ -2607,6 +2608,8 @@ static void parse_new_commit(const char *arg)
 	}
 	if (!committer)
 		die("Expected committer but didn't get one");
+	if (skip_prefix(command_buf.buf, "encoding ", &encoding))
+		read_next_command();
 	parse_data(&msg, 0, NULL);
 	read_next_command();
 	parse_from(b);
@@ -2670,9 +2673,13 @@ static void parse_new_commit(const char *arg)
 	}
 	strbuf_addf(&new_data,
 		"author %s\n"
-		"committer %s\n"
-		"\n",
+		"committer %s\n",
 		author ? author : committer, committer);
+	if (encoding)
+		strbuf_addf(&new_data,
+			"encoding %s\n",
+			encoding);
+	strbuf_addch(&new_data, '\n');
 	strbuf_addbuf(&new_data, &msg);
 	free(author);
 	free(committer);
diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
index 3668263c40..141b7fa35e 100755
--- a/t/t9300-fast-import.sh
+++ b/t/t9300-fast-import.sh
@@ -3299,4 +3299,24 @@ test_expect_success !MINGW 'W: get-mark & empty orphan commit with erroneous thi
 	sed -e s/LFs/LLL/ W-input | tr L "\n" | test_must_fail git fast-import
 '
 
+###
+### series X (other new features)
+###
+
+test_expect_success 'X: handling encoding' '
+	test_tick &&
+	cat >input <<-INPUT_END &&
+	commit refs/heads/encoding
+	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+	encoding iso-8859-7
+	data <<COMMIT
+	INPUT_END
+
+	printf "Pi: \360\nCOMMIT\n" >>input &&
+
+	git fast-import <input &&
+	git cat-file -p encoding | grep $(printf "\360") &&
+	git log -1 --format=%B encoding | grep $(printf "\317\200")
+'
+
 test_done
-- 
2.21.0.782.g571613a09e


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 3/5] fast-export: avoid stripping encoding header if we cannot reencode
  2019-05-13 16:47   ` [PATCH v4 " Elijah Newren
  2019-05-13 16:47     ` [PATCH v4 1/5] t9350: fix encoding test to actually test reencoding Elijah Newren
  2019-05-13 16:47     ` [PATCH v4 2/5] fast-import: support 'encoding' commit header Elijah Newren
@ 2019-05-13 16:47     ` Elijah Newren
  2019-05-13 16:47     ` [PATCH v4 4/5] fast-export: differentiate between explicitly utf-8 and implicitly utf-8 Elijah Newren
                       ` (2 subsequent siblings)
  5 siblings, 0 replies; 45+ messages in thread
From: Elijah Newren @ 2019-05-13 16:47 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Eric Sunshine, Johannes Schindelin, Johannes Sixt,
	Torsten Bögershausen, Elijah Newren

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 3048 bytes --]

When fast-export encounters a commit with an 'encoding' header, it tries
to reencode in utf-8 and then drops the encoding header.  However, if it
fails to reencode in utf-8 because e.g. one of the characters in the
commit message was invalid in the old encoding, then we need to retain
the original encoding or otherwise we lose information needed to
understand all the other (valid) characters in the original commit
message.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 builtin/fast-export.c                        |  7 +++++--
 t/t9350-fast-export.sh                       | 21 ++++++++++++++++++++
 t/t9350/broken-iso-8859-7-commit-message.txt |  1 +
 3 files changed, 27 insertions(+), 2 deletions(-)
 create mode 100644 t/t9350/broken-iso-8859-7-commit-message.txt

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 9e283482ef..7734a9f5a5 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -642,9 +642,12 @@ static void handle_commit(struct commit *commit, struct rev_info *rev,
 	printf("commit %s\nmark :%"PRIu32"\n", refname, last_idnum);
 	if (show_original_ids)
 		printf("original-oid %s\n", oid_to_hex(&commit->object.oid));
-	printf("%.*s\n%.*s\ndata %u\n%s",
+	printf("%.*s\n%.*s\n",
 	       (int)(author_end - author), author,
-	       (int)(committer_end - committer), committer,
+	       (int)(committer_end - committer), committer);
+	if (!reencoded && encoding)
+		printf("encoding %s\n", encoding);
+	printf("data %u\n%s",
 	       (unsigned)(reencoded
 			  ? strlen(reencoded) : message
 			  ? strlen(message) : 0),
diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
index c721026260..4fd637312a 100755
--- a/t/t9350-fast-export.sh
+++ b/t/t9350-fast-export.sh
@@ -118,6 +118,27 @@ test_expect_success 'iso-8859-7' '
 		 ! grep ^encoding actual)
 '
 
+test_expect_success 'encoding preserved if reencoding fails' '
+
+	test_when_finished "git reset --hard HEAD~1" &&
+	test_config i18n.commitencoding iso-8859-7 &&
+	echo rosten >file &&
+	git commit -s -F "$TEST_DIRECTORY/t9350/broken-iso-8859-7-commit-message.txt" file &&
+	git fast-export wer^..wer >iso-8859-7.fi &&
+	sed "s/wer/i18n-invalid/" iso-8859-7.fi |
+		(cd new &&
+		 git fast-import &&
+		 git cat-file commit i18n-invalid >actual &&
+		 # Make sure the commit still has the encoding header
+		 grep ^encoding actual &&
+		 # Verify that the commit has the expected size; i.e.
+		 # that no bytes were re-encoded to a different encoding.
+		 test 252 -eq "$(git cat-file -s i18n-invalid)" &&
+		 # ...and check for the original special bytes
+		 grep $(printf "\360") actual &&
+		 grep $(printf "\377") actual)
+'
+
 test_expect_success 'import/export-marks' '
 
 	git checkout -b marks master &&
diff --git a/t/t9350/broken-iso-8859-7-commit-message.txt b/t/t9350/broken-iso-8859-7-commit-message.txt
new file mode 100644
index 0000000000..d06ad75b44
--- /dev/null
+++ b/t/t9350/broken-iso-8859-7-commit-message.txt
@@ -0,0 +1 @@
+Pi: ð; Invalid: ÿ
\ No newline at end of file
-- 
2.21.0.782.g571613a09e


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 4/5] fast-export: differentiate between explicitly utf-8 and implicitly utf-8
  2019-05-13 16:47   ` [PATCH v4 " Elijah Newren
                       ` (2 preceding siblings ...)
  2019-05-13 16:47     ` [PATCH v4 3/5] fast-export: avoid stripping encoding header if we cannot reencode Elijah Newren
@ 2019-05-13 16:47     ` Elijah Newren
  2019-05-13 16:47     ` [PATCH v4 5/5] fast-export: do automatic reencoding of commit messages only if requested Elijah Newren
  2019-05-13 23:17     ` [PATCH v5 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
  5 siblings, 0 replies; 45+ messages in thread
From: Elijah Newren @ 2019-05-13 16:47 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Eric Sunshine, Johannes Schindelin, Johannes Sixt,
	Torsten Bögershausen, Elijah Newren

The find_encoding() function returned the encoding used by a commit
message, returning a default of git_commit_encoding (usually utf-8).
Although the current code does not differentiate between a commit which
explicitly requested utf-8 and one where we just assume utf-8 because no
encoding is set, it will become important when we try to preserve the
encoding header.  Since is_encoding_utf8() returns true when passed
NULL, we can just return NULL from find_encoding() instead of returning
git_commit_encoding.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 builtin/fast-export.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 7734a9f5a5..66331fa401 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -453,7 +453,7 @@ static const char *find_encoding(const char *begin, const char *end)
 	bol = memmem(begin, end ? end - begin : strlen(begin),
 		     needle, strlen(needle));
 	if (!bol)
-		return git_commit_encoding;
+		return NULL;
 	bol += strlen(needle);
 	eol = strchrnul(bol, '\n');
 	*eol = '\0';
-- 
2.21.0.782.g571613a09e


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v4 5/5] fast-export: do automatic reencoding of commit messages only if requested
  2019-05-13 16:47   ` [PATCH v4 " Elijah Newren
                       ` (3 preceding siblings ...)
  2019-05-13 16:47     ` [PATCH v4 4/5] fast-export: differentiate between explicitly utf-8 and implicitly utf-8 Elijah Newren
@ 2019-05-13 16:47     ` Elijah Newren
  2019-05-13 22:32       ` Junio C Hamano
  2019-05-13 23:17     ` [PATCH v5 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
  5 siblings, 1 reply; 45+ messages in thread
From: Elijah Newren @ 2019-05-13 16:47 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Eric Sunshine, Johannes Schindelin, Johannes Sixt,
	Torsten Bögershausen, Elijah Newren

Automatic re-encoding of commit messages (and dropping of the encoding
header) hurts attempts to do reversible history rewrites (e.g. sha1sum
<-> sha256sum transitions, some subtree rewrites), and seems
inconsistent with the general principle followed elsewhere in
fast-export of requiring explicit user requests to modify the output
(e.g. --signed-tags=strip, --tag-of-filtered-object=rewrite).  Add a
--reencode flag that the user can use to specify, and like other
fast-export flags, default it to 'abort'.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 builtin/fast-export.c  | 35 ++++++++++++++++++++++++++++++++---
 t/t9350-fast-export.sh | 38 +++++++++++++++++++++++++++++++++++---
 2 files changed, 67 insertions(+), 6 deletions(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 66331fa401..4906b23248 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -33,6 +33,7 @@ static const char *fast_export_usage[] = {
 static int progress;
 static enum { SIGNED_TAG_ABORT, VERBATIM, WARN, WARN_STRIP, STRIP } signed_tag_mode = SIGNED_TAG_ABORT;
 static enum { TAG_FILTERING_ABORT, DROP, REWRITE } tag_of_filtered_mode = TAG_FILTERING_ABORT;
+static enum { REENCODE_ABORT, REENCODE_YES, REENCODE_NO } reencode_mode = REENCODE_ABORT;
 static int fake_missing_tagger;
 static int use_done_feature;
 static int no_data;
@@ -77,6 +78,20 @@ static int parse_opt_tag_of_filtered_mode(const struct option *opt,
 	return 0;
 }
 
+static int parse_opt_reencode_mode(const struct option *opt,
+				   const char *arg, int unset)
+{
+	if (unset || !strcmp(arg, "abort"))
+		reencode_mode = REENCODE_ABORT;
+	else if (!strcmp(arg, "yes") || !strcmp(arg, "true") || !strcmp(arg, "on"))
+		reencode_mode = REENCODE_YES;
+	else if (!strcmp(arg, "no") || !strcmp(arg, "false") || !strcmp(arg, "off"))
+		reencode_mode = REENCODE_NO;
+	else
+		return error("Unknown reencoding mode: %s", arg);
+	return 0;
+}
+
 static struct decoration idnums;
 static uint32_t last_idnum;
 
@@ -633,10 +648,21 @@ static void handle_commit(struct commit *commit, struct rev_info *rev,
 	}
 
 	mark_next_object(&commit->object);
-	if (anonymize)
+	if (anonymize) {
 		reencoded = anonymize_commit_message(message);
-	else if (!is_encoding_utf8(encoding))
-		reencoded = reencode_string(message, "UTF-8", encoding);
+	} else if (encoding) {
+		switch(reencode_mode) {
+		case REENCODE_YES:
+			reencoded = reencode_string(message, "UTF-8", encoding);
+			break;
+		case REENCODE_NO:
+			break;
+		case REENCODE_ABORT:
+			die("Encountered commit-specific encoding %s in commit "
+			    "%s; use --reencode=[yes|no] to handle it",
+			    encoding, oid_to_hex(&commit->object.oid));
+		}
+	}
 	if (!commit->parents)
 		printf("reset %s\n", refname);
 	printf("commit %s\nmark :%"PRIu32"\n", refname, last_idnum);
@@ -1091,6 +1117,9 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
 		OPT_CALLBACK(0, "tag-of-filtered-object", &tag_of_filtered_mode, N_("mode"),
 			     N_("select handling of tags that tag filtered objects"),
 			     parse_opt_tag_of_filtered_mode),
+		OPT_CALLBACK(0, "reencode", &reencode_mode, N_("mode"),
+			     N_("select handling of commit messages in an alternate encoding"),
+			     parse_opt_reencode_mode),
 		OPT_STRING(0, "export-marks", &export_filename, N_("file"),
 			     N_("Dump marks to this file")),
 		OPT_STRING(0, "import-marks", &import_filename, N_("file"),
diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
index 4fd637312a..d21d7bf9a9 100755
--- a/t/t9350-fast-export.sh
+++ b/t/t9350-fast-export.sh
@@ -94,14 +94,14 @@ test_expect_success 'fast-export --show-original-ids | git fast-import' '
 	test $MUSS = $(git rev-parse --verify refs/tags/muss)
 '
 
-test_expect_success 'iso-8859-7' '
+test_expect_success 'reencoding iso-8859-7' '
 
 	test_when_finished "git reset --hard HEAD~1" &&
 	test_config i18n.commitencoding iso-8859-7 &&
 	test_tick &&
 	echo rosten >file &&
 	git commit -s -F "$TEST_DIRECTORY/t9350/simple-iso-8859-7-commit-message.txt" file &&
-	git fast-export wer^..wer >iso-8859-7.fi &&
+	git fast-export --reencode=yes wer^..wer >iso-8859-7.fi &&
 	sed "s/wer/i18n/" iso-8859-7.fi |
 		(cd new &&
 		 git fast-import &&
@@ -118,13 +118,45 @@ test_expect_success 'iso-8859-7' '
 		 ! grep ^encoding actual)
 '
 
+test_expect_success 'aborting on iso-8859-7' '
+
+	test_when_finished "git reset --hard HEAD~1" &&
+	test_config i18n.commitencoding iso-8859-7 &&
+	echo rosten >file &&
+	git commit -s -F "$TEST_DIRECTORY/t9350/simple-iso-8859-7-commit-message.txt" file &&
+	test_must_fail git fast-export --reencode=abort wer^..wer >iso-8859-7.fi
+'
+
+test_expect_success 'preserving iso-8859-7' '
+
+	test_when_finished "git reset --hard HEAD~1" &&
+	test_config i18n.commitencoding iso-8859-7 &&
+	echo rosten >file &&
+	git commit -s -F "$TEST_DIRECTORY/t9350/simple-iso-8859-7-commit-message.txt" file &&
+	git fast-export --reencode=no wer^..wer >iso-8859-7.fi &&
+	sed "s/wer/i18n-no-recoding/" iso-8859-7.fi |
+		(cd new &&
+		 git fast-import &&
+		 # The commit object, if not re-encoded, is 240 bytes.
+		 # Removing the "encoding iso-8859-7\n" header would drops 20
+		 # bytes.  Re-encoding the Pi character from \xF0 (\360) in
+		 # iso-8859-7 to \xCF\x80 (\317\200) in utf-8 adds a byte.
+		 # Check for the expected size...
+		 test 240 -eq "$(git cat-file -s i18n-no-recoding)" &&
+		 # ...as well as the expected byte.
+		 git cat-file commit i18n-no-recoding >actual &&
+		 grep $(printf "\360") actual &&
+		 # Also make sure the commit has the "encoding" header
+		 grep ^encoding actual)
+'
+
 test_expect_success 'encoding preserved if reencoding fails' '
 
 	test_when_finished "git reset --hard HEAD~1" &&
 	test_config i18n.commitencoding iso-8859-7 &&
 	echo rosten >file &&
 	git commit -s -F "$TEST_DIRECTORY/t9350/broken-iso-8859-7-commit-message.txt" file &&
-	git fast-export wer^..wer >iso-8859-7.fi &&
+	git fast-export --reencode=yes wer^..wer >iso-8859-7.fi &&
 	sed "s/wer/i18n-invalid/" iso-8859-7.fi |
 		(cd new &&
 		 git fast-import &&
-- 
2.21.0.782.g571613a09e


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 5/5] fast-export: do automatic reencoding of commit messages only if requested
  2019-05-13 16:47     ` [PATCH v4 5/5] fast-export: do automatic reencoding of commit messages only if requested Elijah Newren
@ 2019-05-13 22:32       ` Junio C Hamano
  0 siblings, 0 replies; 45+ messages in thread
From: Junio C Hamano @ 2019-05-13 22:32 UTC (permalink / raw)
  To: Elijah Newren
  Cc: git, Eric Sunshine, Johannes Schindelin, Johannes Sixt,
	Torsten Bögershausen

Elijah Newren <newren@gmail.com> writes:

> +static int parse_opt_reencode_mode(const struct option *opt,
> +				   const char *arg, int unset)
> +{
> +	if (unset || !strcmp(arg, "abort"))
> +		reencode_mode = REENCODE_ABORT;
> +	else if (!strcmp(arg, "yes") || !strcmp(arg, "true") || !strcmp(arg, "on"))
> +		reencode_mode = REENCODE_YES;
> +	else if (!strcmp(arg, "no") || !strcmp(arg, "false") || !strcmp(arg, "off"))
> +		reencode_mode = REENCODE_NO;

Don't duplicate the logic; find existing callers of git_parse_maybe_bool(),
learn what they are using the helper for and mimic them, perhaps?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v5 0/5] Fix and extend encoding handling in fast export/import
  2019-05-13 16:47   ` [PATCH v4 " Elijah Newren
                       ` (4 preceding siblings ...)
  2019-05-13 16:47     ` [PATCH v4 5/5] fast-export: do automatic reencoding of commit messages only if requested Elijah Newren
@ 2019-05-13 23:17     ` Elijah Newren
  2019-05-13 23:17       ` [PATCH v5 1/5] t9350: fix encoding test to actually test reencoding Elijah Newren
                         ` (5 more replies)
  5 siblings, 6 replies; 45+ messages in thread
From: Elijah Newren @ 2019-05-13 23:17 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Eric Sunshine, Johannes Schindelin, Johannes Sixt,
	Torsten Bögershausen, Elijah Newren

While stress testing `git filter-repo`, I noticed an issue with
encoding; further digging led to the fixes and features in this series.
See the individual commit messages for details.

Changes since v4 (full range-diff below):
  * Used git_parse_maybe_bool()
  * Updated Documentation/git-fast-export.txt to document the new option

Elijah Newren (5):
  t9350: fix encoding test to actually test reencoding
  fast-import: support 'encoding' commit header
  fast-export: avoid stripping encoding header if we cannot reencode
  fast-export: differentiate between explicitly utf-8 and implicitly
    utf-8
  fast-export: do automatic reencoding of commit messages only if
    requested

 Documentation/git-fast-export.txt            |  7 ++
 Documentation/git-fast-import.txt            |  7 ++
 builtin/fast-export.c                        | 55 ++++++++++++--
 fast-import.c                                | 11 ++-
 t/t9300-fast-import.sh                       | 20 +++++
 t/t9350-fast-export.sh                       | 78 +++++++++++++++++---
 t/t9350/broken-iso-8859-7-commit-message.txt |  1 +
 t/t9350/simple-iso-8859-7-commit-message.txt |  1 +
 8 files changed, 163 insertions(+), 17 deletions(-)
 create mode 100644 t/t9350/broken-iso-8859-7-commit-message.txt
 create mode 100644 t/t9350/simple-iso-8859-7-commit-message.txt

Range-diff:
1:  37a68a0ffd = 1:  37a68a0ffd t9350: fix encoding test to actually test reencoding
2:  3d84f4613d = 2:  3d84f4613d fast-import: support 'encoding' commit header
3:  baa8394a3a = 3:  baa8394a3a fast-export: avoid stripping encoding header if we cannot reencode
4:  49960164c6 = 4:  49960164c6 fast-export: differentiate between explicitly utf-8 and implicitly utf-8
5:  571613a09e ! 5:  d8be4ee826 fast-export: do automatic reencoding of commit messages only if requested
    @@ -13,6 +13,24 @@
     
         Signed-off-by: Elijah Newren <newren@gmail.com>
     
    + diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
    + --- a/Documentation/git-fast-export.txt
    + +++ b/Documentation/git-fast-export.txt
    +@@
    + 	for intermediary filters (e.g. for rewriting commit messages
    + 	which refer to older commits, or for stripping blobs by id).
    + 
    ++--reencode=(yes|no|abort)::
    ++	Specify how to handle `encoding` header in commit objects.  When
    ++	asking to 'abort' (which is the default), this program will die
    ++	when encountering such a commit object.  With 'yes', the commit
    ++	message will be reencoded into UTF-8.  With 'no', the original
    ++	encoding will be preserved.
    ++
    + --refspec::
    + 	Apply the specified refspec to each ref exported. Multiple of them can
    + 	be specified.
    +
      diff --git a/builtin/fast-export.c b/builtin/fast-export.c
      --- a/builtin/fast-export.c
      +++ b/builtin/fast-export.c
    @@ -31,14 +49,25 @@
     +static int parse_opt_reencode_mode(const struct option *opt,
     +				   const char *arg, int unset)
     +{
    -+	if (unset || !strcmp(arg, "abort"))
    ++	if (unset) {
     +		reencode_mode = REENCODE_ABORT;
    -+	else if (!strcmp(arg, "yes") || !strcmp(arg, "true") || !strcmp(arg, "on"))
    -+		reencode_mode = REENCODE_YES;
    -+	else if (!strcmp(arg, "no") || !strcmp(arg, "false") || !strcmp(arg, "off"))
    ++		return 0;
    ++	}
    ++
    ++	switch (git_parse_maybe_bool(arg)) {
    ++	case 0:
     +		reencode_mode = REENCODE_NO;
    -+	else
    -+		return error("Unknown reencoding mode: %s", arg);
    ++		break;
    ++	case 1:
    ++		reencode_mode = REENCODE_YES;
    ++		break;
    ++	default:
    ++		if (arg && !strcasecmp(arg, "abort"))
    ++			reencode_mode = REENCODE_ABORT;
    ++		else
    ++			return error("Unknown reencoding mode: %s", arg);
    ++	}
    ++
     +	return 0;
     +}
     +
-- 
2.21.0.782.gd8be4ee826


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v5 1/5] t9350: fix encoding test to actually test reencoding
  2019-05-13 23:17     ` [PATCH v5 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
@ 2019-05-13 23:17       ` Elijah Newren
  2019-05-14  2:50         ` Torsten Bögershausen
  2019-05-13 23:17       ` [PATCH v5 2/5] fast-import: support 'encoding' commit header Elijah Newren
                         ` (4 subsequent siblings)
  5 siblings, 1 reply; 45+ messages in thread
From: Elijah Newren @ 2019-05-13 23:17 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Eric Sunshine, Johannes Schindelin, Johannes Sixt,
	Torsten Bögershausen, Elijah Newren

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 2983 bytes --]

This test used an author with non-ascii characters in the name, but
no special commit message.  It then grep'ed for those non-ascii
characters, but those are guaranteed to exist regardless of the
reencoding process since the reencoding only affects the commit message,
not the author or committer names.  As such, the test would work even if
the re-encoding process simply stripped the commit message entirely.
Modify the test to actually check that the reencoding in utf-8 worked.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 t/t9350-fast-export.sh                       | 27 ++++++++++++--------
 t/t9350/simple-iso-8859-7-commit-message.txt |  1 +
 2 files changed, 18 insertions(+), 10 deletions(-)
 create mode 100644 t/t9350/simple-iso-8859-7-commit-message.txt

diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
index 5690fe2810..c721026260 100755
--- a/t/t9350-fast-export.sh
+++ b/t/t9350-fast-export.sh
@@ -94,22 +94,30 @@ test_expect_success 'fast-export --show-original-ids | git fast-import' '
 	test $MUSS = $(git rev-parse --verify refs/tags/muss)
 '
 
-test_expect_success 'iso-8859-1' '
+test_expect_success 'iso-8859-7' '
 
-	git config i18n.commitencoding ISO8859-1 &&
-	# use author and committer name in ISO-8859-1 to match it.
-	. "$TEST_DIRECTORY"/t3901/8859-1.txt &&
+	test_when_finished "git reset --hard HEAD~1" &&
+	test_config i18n.commitencoding iso-8859-7 &&
 	test_tick &&
 	echo rosten >file &&
-	git commit -s -m den file &&
-	git fast-export wer^..wer >iso8859-1.fi &&
-	sed "s/wer/i18n/" iso8859-1.fi |
+	git commit -s -F "$TEST_DIRECTORY/t9350/simple-iso-8859-7-commit-message.txt" file &&
+	git fast-export wer^..wer >iso-8859-7.fi &&
+	sed "s/wer/i18n/" iso-8859-7.fi |
 		(cd new &&
 		 git fast-import &&
+		 # The commit object, if not re-encoded, would be 240 bytes.
+		 # Removing the "encoding iso-8859-7\n" header drops 20 bytes.
+		 # Re-encoding the Pi character from \xF0 (\360) in iso-8859-7
+		 # to \xCF\x80 (\317\200) in utf-8 adds a byte.  Check for
+		 # the expected size.
+		 test 221 -eq "$(git cat-file -s i18n)" &&
+		 # ...and for the expected translation of bytes.
 		 git cat-file commit i18n >actual &&
-		 grep "Áéí óú" actual)
-
+		 grep $(printf "\317\200") actual &&
+		 # Also make sure the commit does not have the "encoding" header
+		 ! grep ^encoding actual)
 '
+
 test_expect_success 'import/export-marks' '
 
 	git checkout -b marks master &&
@@ -224,7 +232,6 @@ GIT_COMMITTER_NAME='C O Mitter'; export GIT_COMMITTER_NAME
 
 test_expect_success 'setup copies' '
 
-	git config --unset i18n.commitencoding &&
 	git checkout -b copy rein &&
 	git mv file file3 &&
 	git commit -m move1 &&
diff --git a/t/t9350/simple-iso-8859-7-commit-message.txt b/t/t9350/simple-iso-8859-7-commit-message.txt
new file mode 100644
index 0000000000..8b3f0c3dba
--- /dev/null
+++ b/t/t9350/simple-iso-8859-7-commit-message.txt
@@ -0,0 +1 @@
+Pi: ð
\ No newline at end of file
-- 
2.21.0.782.gd8be4ee826


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 2/5] fast-import: support 'encoding' commit header
  2019-05-13 23:17     ` [PATCH v5 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
  2019-05-13 23:17       ` [PATCH v5 1/5] t9350: fix encoding test to actually test reencoding Elijah Newren
@ 2019-05-13 23:17       ` Elijah Newren
  2019-05-13 23:17       ` [PATCH v5 3/5] fast-export: avoid stripping encoding header if we cannot reencode Elijah Newren
                         ` (3 subsequent siblings)
  5 siblings, 0 replies; 45+ messages in thread
From: Elijah Newren @ 2019-05-13 23:17 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Eric Sunshine, Johannes Schindelin, Johannes Sixt,
	Torsten Bögershausen, Elijah Newren

Since git supports commit messages with an encoding other than utf-8,
allow fast-import to import such commits.  This may be useful for folks
who do not want to reencode commit messages from an external system, and
may also be useful to achieve reversible history rewrites (e.g. sha1sum
<-> sha256sum transitions or subtree work) with git repositories that
have used specialized encodings in their commit history.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 Documentation/git-fast-import.txt |  7 +++++++
 fast-import.c                     | 11 +++++++++--
 t/t9300-fast-import.sh            | 20 ++++++++++++++++++++
 3 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-fast-import.txt b/Documentation/git-fast-import.txt
index d65cdb3d08..7baf9e47b5 100644
--- a/Documentation/git-fast-import.txt
+++ b/Documentation/git-fast-import.txt
@@ -388,6 +388,7 @@ change to the project.
 	original-oid?
 	('author' (SP <name>)? SP LT <email> GT SP <when> LF)?
 	'committer' (SP <name>)? SP LT <email> GT SP <when> LF
+	('encoding' SP <encoding>)?
 	data
 	('from' SP <commit-ish> LF)?
 	('merge' SP <commit-ish> LF)?
@@ -455,6 +456,12 @@ that was selected by the --date-format=<fmt> command-line option.
 See ``Date Formats'' above for the set of supported formats, and
 their syntax.
 
+`encoding`
+^^^^^^^^^^
+The optional `encoding` command indicates the encoding of the commit
+message.  Most commits are UTF-8 and the encoding is omitted, but this
+allows importing commit messages into git without first reencoding them.
+
 `from`
 ^^^^^^
 The `from` command is used to specify the commit to initialize
diff --git a/fast-import.c b/fast-import.c
index f38d04fa58..76a7bd3699 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -2585,6 +2585,7 @@ static void parse_new_commit(const char *arg)
 	struct branch *b;
 	char *author = NULL;
 	char *committer = NULL;
+	const char *encoding = NULL;
 	struct hash_list *merge_list = NULL;
 	unsigned int merge_count;
 	unsigned char prev_fanout, new_fanout;
@@ -2607,6 +2608,8 @@ static void parse_new_commit(const char *arg)
 	}
 	if (!committer)
 		die("Expected committer but didn't get one");
+	if (skip_prefix(command_buf.buf, "encoding ", &encoding))
+		read_next_command();
 	parse_data(&msg, 0, NULL);
 	read_next_command();
 	parse_from(b);
@@ -2670,9 +2673,13 @@ static void parse_new_commit(const char *arg)
 	}
 	strbuf_addf(&new_data,
 		"author %s\n"
-		"committer %s\n"
-		"\n",
+		"committer %s\n",
 		author ? author : committer, committer);
+	if (encoding)
+		strbuf_addf(&new_data,
+			"encoding %s\n",
+			encoding);
+	strbuf_addch(&new_data, '\n');
 	strbuf_addbuf(&new_data, &msg);
 	free(author);
 	free(committer);
diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
index 3668263c40..141b7fa35e 100755
--- a/t/t9300-fast-import.sh
+++ b/t/t9300-fast-import.sh
@@ -3299,4 +3299,24 @@ test_expect_success !MINGW 'W: get-mark & empty orphan commit with erroneous thi
 	sed -e s/LFs/LLL/ W-input | tr L "\n" | test_must_fail git fast-import
 '
 
+###
+### series X (other new features)
+###
+
+test_expect_success 'X: handling encoding' '
+	test_tick &&
+	cat >input <<-INPUT_END &&
+	commit refs/heads/encoding
+	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+	encoding iso-8859-7
+	data <<COMMIT
+	INPUT_END
+
+	printf "Pi: \360\nCOMMIT\n" >>input &&
+
+	git fast-import <input &&
+	git cat-file -p encoding | grep $(printf "\360") &&
+	git log -1 --format=%B encoding | grep $(printf "\317\200")
+'
+
 test_done
-- 
2.21.0.782.gd8be4ee826


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 3/5] fast-export: avoid stripping encoding header if we cannot reencode
  2019-05-13 23:17     ` [PATCH v5 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
  2019-05-13 23:17       ` [PATCH v5 1/5] t9350: fix encoding test to actually test reencoding Elijah Newren
  2019-05-13 23:17       ` [PATCH v5 2/5] fast-import: support 'encoding' commit header Elijah Newren
@ 2019-05-13 23:17       ` Elijah Newren
  2019-05-14  2:56         ` Torsten Bögershausen
  2019-05-13 23:17       ` [PATCH v5 4/5] fast-export: differentiate between explicitly utf-8 and implicitly utf-8 Elijah Newren
                         ` (2 subsequent siblings)
  5 siblings, 1 reply; 45+ messages in thread
From: Elijah Newren @ 2019-05-13 23:17 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Eric Sunshine, Johannes Schindelin, Johannes Sixt,
	Torsten Bögershausen, Elijah Newren

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 3048 bytes --]

When fast-export encounters a commit with an 'encoding' header, it tries
to reencode in utf-8 and then drops the encoding header.  However, if it
fails to reencode in utf-8 because e.g. one of the characters in the
commit message was invalid in the old encoding, then we need to retain
the original encoding or otherwise we lose information needed to
understand all the other (valid) characters in the original commit
message.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 builtin/fast-export.c                        |  7 +++++--
 t/t9350-fast-export.sh                       | 21 ++++++++++++++++++++
 t/t9350/broken-iso-8859-7-commit-message.txt |  1 +
 3 files changed, 27 insertions(+), 2 deletions(-)
 create mode 100644 t/t9350/broken-iso-8859-7-commit-message.txt

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 9e283482ef..7734a9f5a5 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -642,9 +642,12 @@ static void handle_commit(struct commit *commit, struct rev_info *rev,
 	printf("commit %s\nmark :%"PRIu32"\n", refname, last_idnum);
 	if (show_original_ids)
 		printf("original-oid %s\n", oid_to_hex(&commit->object.oid));
-	printf("%.*s\n%.*s\ndata %u\n%s",
+	printf("%.*s\n%.*s\n",
 	       (int)(author_end - author), author,
-	       (int)(committer_end - committer), committer,
+	       (int)(committer_end - committer), committer);
+	if (!reencoded && encoding)
+		printf("encoding %s\n", encoding);
+	printf("data %u\n%s",
 	       (unsigned)(reencoded
 			  ? strlen(reencoded) : message
 			  ? strlen(message) : 0),
diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
index c721026260..4fd637312a 100755
--- a/t/t9350-fast-export.sh
+++ b/t/t9350-fast-export.sh
@@ -118,6 +118,27 @@ test_expect_success 'iso-8859-7' '
 		 ! grep ^encoding actual)
 '
 
+test_expect_success 'encoding preserved if reencoding fails' '
+
+	test_when_finished "git reset --hard HEAD~1" &&
+	test_config i18n.commitencoding iso-8859-7 &&
+	echo rosten >file &&
+	git commit -s -F "$TEST_DIRECTORY/t9350/broken-iso-8859-7-commit-message.txt" file &&
+	git fast-export wer^..wer >iso-8859-7.fi &&
+	sed "s/wer/i18n-invalid/" iso-8859-7.fi |
+		(cd new &&
+		 git fast-import &&
+		 git cat-file commit i18n-invalid >actual &&
+		 # Make sure the commit still has the encoding header
+		 grep ^encoding actual &&
+		 # Verify that the commit has the expected size; i.e.
+		 # that no bytes were re-encoded to a different encoding.
+		 test 252 -eq "$(git cat-file -s i18n-invalid)" &&
+		 # ...and check for the original special bytes
+		 grep $(printf "\360") actual &&
+		 grep $(printf "\377") actual)
+'
+
 test_expect_success 'import/export-marks' '
 
 	git checkout -b marks master &&
diff --git a/t/t9350/broken-iso-8859-7-commit-message.txt b/t/t9350/broken-iso-8859-7-commit-message.txt
new file mode 100644
index 0000000000..d06ad75b44
--- /dev/null
+++ b/t/t9350/broken-iso-8859-7-commit-message.txt
@@ -0,0 +1 @@
+Pi: ð; Invalid: ÿ
\ No newline at end of file
-- 
2.21.0.782.gd8be4ee826


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 4/5] fast-export: differentiate between explicitly utf-8 and implicitly utf-8
  2019-05-13 23:17     ` [PATCH v5 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
                         ` (2 preceding siblings ...)
  2019-05-13 23:17       ` [PATCH v5 3/5] fast-export: avoid stripping encoding header if we cannot reencode Elijah Newren
@ 2019-05-13 23:17       ` Elijah Newren
  2019-05-14  3:01         ` Torsten Bögershausen
  2019-05-13 23:17       ` [PATCH v5 5/5] fast-export: do automatic reencoding of commit messages only if requested Elijah Newren
  2019-05-14  4:30       ` [PATCH v6 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
  5 siblings, 1 reply; 45+ messages in thread
From: Elijah Newren @ 2019-05-13 23:17 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Eric Sunshine, Johannes Schindelin, Johannes Sixt,
	Torsten Bögershausen, Elijah Newren

The find_encoding() function returned the encoding used by a commit
message, returning a default of git_commit_encoding (usually utf-8).
Although the current code does not differentiate between a commit which
explicitly requested utf-8 and one where we just assume utf-8 because no
encoding is set, it will become important when we try to preserve the
encoding header.  Since is_encoding_utf8() returns true when passed
NULL, we can just return NULL from find_encoding() instead of returning
git_commit_encoding.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 builtin/fast-export.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 7734a9f5a5..66331fa401 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -453,7 +453,7 @@ static const char *find_encoding(const char *begin, const char *end)
 	bol = memmem(begin, end ? end - begin : strlen(begin),
 		     needle, strlen(needle));
 	if (!bol)
-		return git_commit_encoding;
+		return NULL;
 	bol += strlen(needle);
 	eol = strchrnul(bol, '\n');
 	*eol = '\0';
-- 
2.21.0.782.gd8be4ee826


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 5/5] fast-export: do automatic reencoding of commit messages only if requested
  2019-05-13 23:17     ` [PATCH v5 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
                         ` (3 preceding siblings ...)
  2019-05-13 23:17       ` [PATCH v5 4/5] fast-export: differentiate between explicitly utf-8 and implicitly utf-8 Elijah Newren
@ 2019-05-13 23:17       ` Elijah Newren
  2019-05-14  0:19         ` Eric Sunshine
  2019-05-14  4:30       ` [PATCH v6 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
  5 siblings, 1 reply; 45+ messages in thread
From: Elijah Newren @ 2019-05-13 23:17 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Eric Sunshine, Johannes Schindelin, Johannes Sixt,
	Torsten Bögershausen, Elijah Newren

Automatic re-encoding of commit messages (and dropping of the encoding
header) hurts attempts to do reversible history rewrites (e.g. sha1sum
<-> sha256sum transitions, some subtree rewrites), and seems
inconsistent with the general principle followed elsewhere in
fast-export of requiring explicit user requests to modify the output
(e.g. --signed-tags=strip, --tag-of-filtered-object=rewrite).  Add a
--reencode flag that the user can use to specify, and like other
fast-export flags, default it to 'abort'.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 Documentation/git-fast-export.txt |  7 +++++
 builtin/fast-export.c             | 46 +++++++++++++++++++++++++++++--
 t/t9350-fast-export.sh            | 38 +++++++++++++++++++++++--
 3 files changed, 85 insertions(+), 6 deletions(-)

diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
index 64c01ba918..11427acdde 100644
--- a/Documentation/git-fast-export.txt
+++ b/Documentation/git-fast-export.txt
@@ -129,6 +129,13 @@ marks the same across runs.
 	for intermediary filters (e.g. for rewriting commit messages
 	which refer to older commits, or for stripping blobs by id).
 
+--reencode=(yes|no|abort)::
+	Specify how to handle `encoding` header in commit objects.  When
+	asking to 'abort' (which is the default), this program will die
+	when encountering such a commit object.  With 'yes', the commit
+	message will be reencoded into UTF-8.  With 'no', the original
+	encoding will be preserved.
+
 --refspec::
 	Apply the specified refspec to each ref exported. Multiple of them can
 	be specified.
diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 66331fa401..0bb65b3886 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -33,6 +33,7 @@ static const char *fast_export_usage[] = {
 static int progress;
 static enum { SIGNED_TAG_ABORT, VERBATIM, WARN, WARN_STRIP, STRIP } signed_tag_mode = SIGNED_TAG_ABORT;
 static enum { TAG_FILTERING_ABORT, DROP, REWRITE } tag_of_filtered_mode = TAG_FILTERING_ABORT;
+static enum { REENCODE_ABORT, REENCODE_YES, REENCODE_NO } reencode_mode = REENCODE_ABORT;
 static int fake_missing_tagger;
 static int use_done_feature;
 static int no_data;
@@ -77,6 +78,31 @@ static int parse_opt_tag_of_filtered_mode(const struct option *opt,
 	return 0;
 }
 
+static int parse_opt_reencode_mode(const struct option *opt,
+				   const char *arg, int unset)
+{
+	if (unset) {
+		reencode_mode = REENCODE_ABORT;
+		return 0;
+	}
+
+	switch (git_parse_maybe_bool(arg)) {
+	case 0:
+		reencode_mode = REENCODE_NO;
+		break;
+	case 1:
+		reencode_mode = REENCODE_YES;
+		break;
+	default:
+		if (arg && !strcasecmp(arg, "abort"))
+			reencode_mode = REENCODE_ABORT;
+		else
+			return error("Unknown reencoding mode: %s", arg);
+	}
+
+	return 0;
+}
+
 static struct decoration idnums;
 static uint32_t last_idnum;
 
@@ -633,10 +659,21 @@ static void handle_commit(struct commit *commit, struct rev_info *rev,
 	}
 
 	mark_next_object(&commit->object);
-	if (anonymize)
+	if (anonymize) {
 		reencoded = anonymize_commit_message(message);
-	else if (!is_encoding_utf8(encoding))
-		reencoded = reencode_string(message, "UTF-8", encoding);
+	} else if (encoding) {
+		switch(reencode_mode) {
+		case REENCODE_YES:
+			reencoded = reencode_string(message, "UTF-8", encoding);
+			break;
+		case REENCODE_NO:
+			break;
+		case REENCODE_ABORT:
+			die("Encountered commit-specific encoding %s in commit "
+			    "%s; use --reencode=[yes|no] to handle it",
+			    encoding, oid_to_hex(&commit->object.oid));
+		}
+	}
 	if (!commit->parents)
 		printf("reset %s\n", refname);
 	printf("commit %s\nmark :%"PRIu32"\n", refname, last_idnum);
@@ -1091,6 +1128,9 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
 		OPT_CALLBACK(0, "tag-of-filtered-object", &tag_of_filtered_mode, N_("mode"),
 			     N_("select handling of tags that tag filtered objects"),
 			     parse_opt_tag_of_filtered_mode),
+		OPT_CALLBACK(0, "reencode", &reencode_mode, N_("mode"),
+			     N_("select handling of commit messages in an alternate encoding"),
+			     parse_opt_reencode_mode),
 		OPT_STRING(0, "export-marks", &export_filename, N_("file"),
 			     N_("Dump marks to this file")),
 		OPT_STRING(0, "import-marks", &import_filename, N_("file"),
diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
index 4fd637312a..d21d7bf9a9 100755
--- a/t/t9350-fast-export.sh
+++ b/t/t9350-fast-export.sh
@@ -94,14 +94,14 @@ test_expect_success 'fast-export --show-original-ids | git fast-import' '
 	test $MUSS = $(git rev-parse --verify refs/tags/muss)
 '
 
-test_expect_success 'iso-8859-7' '
+test_expect_success 'reencoding iso-8859-7' '
 
 	test_when_finished "git reset --hard HEAD~1" &&
 	test_config i18n.commitencoding iso-8859-7 &&
 	test_tick &&
 	echo rosten >file &&
 	git commit -s -F "$TEST_DIRECTORY/t9350/simple-iso-8859-7-commit-message.txt" file &&
-	git fast-export wer^..wer >iso-8859-7.fi &&
+	git fast-export --reencode=yes wer^..wer >iso-8859-7.fi &&
 	sed "s/wer/i18n/" iso-8859-7.fi |
 		(cd new &&
 		 git fast-import &&
@@ -118,13 +118,45 @@ test_expect_success 'iso-8859-7' '
 		 ! grep ^encoding actual)
 '
 
+test_expect_success 'aborting on iso-8859-7' '
+
+	test_when_finished "git reset --hard HEAD~1" &&
+	test_config i18n.commitencoding iso-8859-7 &&
+	echo rosten >file &&
+	git commit -s -F "$TEST_DIRECTORY/t9350/simple-iso-8859-7-commit-message.txt" file &&
+	test_must_fail git fast-export --reencode=abort wer^..wer >iso-8859-7.fi
+'
+
+test_expect_success 'preserving iso-8859-7' '
+
+	test_when_finished "git reset --hard HEAD~1" &&
+	test_config i18n.commitencoding iso-8859-7 &&
+	echo rosten >file &&
+	git commit -s -F "$TEST_DIRECTORY/t9350/simple-iso-8859-7-commit-message.txt" file &&
+	git fast-export --reencode=no wer^..wer >iso-8859-7.fi &&
+	sed "s/wer/i18n-no-recoding/" iso-8859-7.fi |
+		(cd new &&
+		 git fast-import &&
+		 # The commit object, if not re-encoded, is 240 bytes.
+		 # Removing the "encoding iso-8859-7\n" header would drops 20
+		 # bytes.  Re-encoding the Pi character from \xF0 (\360) in
+		 # iso-8859-7 to \xCF\x80 (\317\200) in utf-8 adds a byte.
+		 # Check for the expected size...
+		 test 240 -eq "$(git cat-file -s i18n-no-recoding)" &&
+		 # ...as well as the expected byte.
+		 git cat-file commit i18n-no-recoding >actual &&
+		 grep $(printf "\360") actual &&
+		 # Also make sure the commit has the "encoding" header
+		 grep ^encoding actual)
+'
+
 test_expect_success 'encoding preserved if reencoding fails' '
 
 	test_when_finished "git reset --hard HEAD~1" &&
 	test_config i18n.commitencoding iso-8859-7 &&
 	echo rosten >file &&
 	git commit -s -F "$TEST_DIRECTORY/t9350/broken-iso-8859-7-commit-message.txt" file &&
-	git fast-export wer^..wer >iso-8859-7.fi &&
+	git fast-export --reencode=yes wer^..wer >iso-8859-7.fi &&
 	sed "s/wer/i18n-invalid/" iso-8859-7.fi |
 		(cd new &&
 		 git fast-import &&
-- 
2.21.0.782.gd8be4ee826


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 5/5] fast-export: do automatic reencoding of commit messages only if requested
  2019-05-13 23:17       ` [PATCH v5 5/5] fast-export: do automatic reencoding of commit messages only if requested Elijah Newren
@ 2019-05-14  0:19         ` Eric Sunshine
  0 siblings, 0 replies; 45+ messages in thread
From: Eric Sunshine @ 2019-05-14  0:19 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Junio C Hamano, Git List, Johannes Schindelin, Johannes Sixt,
	Torsten Bögershausen

On Mon, May 13, 2019 at 7:17 PM Elijah Newren <newren@gmail.com> wrote:
> Automatic re-encoding of commit messages (and dropping of the encoding
> header) hurts attempts to do reversible history rewrites (e.g. sha1sum
> <-> sha256sum transitions, some subtree rewrites), and seems
> inconsistent with the general principle followed elsewhere in
> fast-export of requiring explicit user requests to modify the output
> (e.g. --signed-tags=strip, --tag-of-filtered-object=rewrite).  Add a
> --reencode flag that the user can use to specify, and like other
> fast-export flags, default it to 'abort'.
>
> Signed-off-by: Elijah Newren <newren@gmail.com>
> ---
> diff --git a/builtin/fast-export.c b/builtin/fast-export.c
> @@ -77,6 +78,31 @@ static int parse_opt_tag_of_filtered_mode(const struct option *opt,
> +static int parse_opt_reencode_mode(const struct option *opt,
> +                                  const char *arg, int unset)
> +{
> +       if (unset) {
> +               reencode_mode = REENCODE_ABORT;
> +               return 0;
> +       }
> +
> +       switch (git_parse_maybe_bool(arg)) {
> +       case 0:
> +               reencode_mode = REENCODE_NO;
> +               break;
> +       case 1:
> +               reencode_mode = REENCODE_YES;
> +               break;
> +       default:
> +               if (arg && !strcasecmp(arg, "abort"))

If 'arg' is NULL, git_parse_maybe_bool() will have returned 1, which
is already handled above, so no need to check it here. So, dropping
the "arg &&" from the conditional...

> +                       reencode_mode = REENCODE_ABORT;
> +               else
> +                       return error("Unknown reencoding mode: %s", arg);

...makes it clear that you won't ever be passing NULL to "%s", which
is undefined behavior.

> +       }
> +
> +       return 0;
> +}

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 1/5] t9350: fix encoding test to actually test reencoding
  2019-05-13 23:17       ` [PATCH v5 1/5] t9350: fix encoding test to actually test reencoding Elijah Newren
@ 2019-05-14  2:50         ` Torsten Bögershausen
  0 siblings, 0 replies; 45+ messages in thread
From: Torsten Bögershausen @ 2019-05-14  2:50 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Junio C Hamano, git, Eric Sunshine, Johannes Schindelin,
	Johannes Sixt

On Mon, May 13, 2019 at 04:17:22PM -0700, Elijah Newren wrote:
> This test used an author with non-ascii characters in the name, but
> no special commit message.  It then grep'ed for those non-ascii
> characters, but those are guaranteed to exist regardless of the
> reencoding process since the reencoding only affects the commit message,
> not the author or committer names.  As such, the test would work even if
> the re-encoding process simply stripped the commit message entirely.
> Modify the test to actually check that the reencoding in utf-8 worked.

(Not a native english speaker here): Should that be
"...reencoding into utf-8 worked" ?

>
> Signed-off-by: Elijah Newren <newren@gmail.com>
> ---
>  t/t9350-fast-export.sh                       | 27 ++++++++++++--------
>  t/t9350/simple-iso-8859-7-commit-message.txt |  1 +
>  2 files changed, 18 insertions(+), 10 deletions(-)
>  create mode 100644 t/t9350/simple-iso-8859-7-commit-message.txt
>
> diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
> index 5690fe2810..c721026260 100755
> --- a/t/t9350-fast-export.sh
> +++ b/t/t9350-fast-export.sh
> @@ -94,22 +94,30 @@ test_expect_success 'fast-export --show-original-ids | git fast-import' '
>  	test $MUSS = $(git rev-parse --verify refs/tags/muss)
>  '
>
> -test_expect_success 'iso-8859-1' '
> +test_expect_success 'iso-8859-7' '
>
> -	git config i18n.commitencoding ISO8859-1 &&
> -	# use author and committer name in ISO-8859-1 to match it.
> -	. "$TEST_DIRECTORY"/t3901/8859-1.txt &&
> +	test_when_finished "git reset --hard HEAD~1" &&
> +	test_config i18n.commitencoding iso-8859-7 &&
>  	test_tick &&
>  	echo rosten >file &&
> -	git commit -s -m den file &&
> -	git fast-export wer^..wer >iso8859-1.fi &&
> -	sed "s/wer/i18n/" iso8859-1.fi |
> +	git commit -s -F "$TEST_DIRECTORY/t9350/simple-iso-8859-7-commit-message.txt" file &&
> +	git fast-export wer^..wer >iso-8859-7.fi &&
> +	sed "s/wer/i18n/" iso-8859-7.fi |
>  		(cd new &&
>  		 git fast-import &&
> +		 # The commit object, if not re-encoded, would be 240 bytes.
> +		 # Removing the "encoding iso-8859-7\n" header drops 20 bytes.
> +		 # Re-encoding the Pi character from \xF0 (\360) in iso-8859-7
> +		 # to \xCF\x80 (\317\200) in utf-8 adds a byte.  Check for
> +		 # the expected size.
> +		 test 221 -eq "$(git cat-file -s i18n)" &&
> +		 # ...and for the expected translation of bytes.
>  		 git cat-file commit i18n >actual &&
> -		 grep "Áéí óú" actual)
> -
> +		 grep $(printf "\317\200") actual &&
> +		 # Also make sure the commit does not have the "encoding" header
> +		 ! grep ^encoding actual)
>  '
> +
>  test_expect_success 'import/export-marks' '
>
>  	git checkout -b marks master &&
> @@ -224,7 +232,6 @@ GIT_COMMITTER_NAME='C O Mitter'; export GIT_COMMITTER_NAME
>
>  test_expect_success 'setup copies' '
>
> -	git config --unset i18n.commitencoding &&
>  	git checkout -b copy rein &&
>  	git mv file file3 &&
>  	git commit -m move1 &&
> diff --git a/t/t9350/simple-iso-8859-7-commit-message.txt b/t/t9350/simple-iso-8859-7-commit-message.txt
> new file mode 100644
> index 0000000000..8b3f0c3dba
> --- /dev/null
> +++ b/t/t9350/simple-iso-8859-7-commit-message.txt
> @@ -0,0 +1 @@
> +Pi: ?
> \ No newline at end of file
> --
> 2.21.0.782.gd8be4ee826
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 3/5] fast-export: avoid stripping encoding header if we cannot reencode
  2019-05-13 23:17       ` [PATCH v5 3/5] fast-export: avoid stripping encoding header if we cannot reencode Elijah Newren
@ 2019-05-14  2:56         ` Torsten Bögershausen
  0 siblings, 0 replies; 45+ messages in thread
From: Torsten Bögershausen @ 2019-05-14  2:56 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Junio C Hamano, git, Eric Sunshine, Johannes Schindelin,
	Johannes Sixt

On Mon, May 13, 2019 at 04:17:24PM -0700, Elijah Newren wrote:
> When fast-export encounters a commit with an 'encoding' header, it tries
> to reencode in utf-8 and then drops the encoding header.  However, if it
> fails to reencode in utf-8 because e.g. one of the characters in the
> commit message was invalid in the old encoding, then we need to retain
> the original encoding or otherwise we lose information needed to
> understand all the other (valid) characters in the original commit
> message.

Minor question: "utf-8" or "UTF-8" ?
Mostly we use UTF-8 in Git.

>
> Signed-off-by: Elijah Newren <newren@gmail.com>
> ---
>  builtin/fast-export.c                        |  7 +++++--
>  t/t9350-fast-export.sh                       | 21 ++++++++++++++++++++
>  t/t9350/broken-iso-8859-7-commit-message.txt |  1 +
>  3 files changed, 27 insertions(+), 2 deletions(-)
>  create mode 100644 t/t9350/broken-iso-8859-7-commit-message.txt
>
> diff --git a/builtin/fast-export.c b/builtin/fast-export.c
> index 9e283482ef..7734a9f5a5 100644
> --- a/builtin/fast-export.c
> +++ b/builtin/fast-export.c
> @@ -642,9 +642,12 @@ static void handle_commit(struct commit *commit, struct rev_info *rev,
>  	printf("commit %s\nmark :%"PRIu32"\n", refname, last_idnum);
>  	if (show_original_ids)
>  		printf("original-oid %s\n", oid_to_hex(&commit->object.oid));
> -	printf("%.*s\n%.*s\ndata %u\n%s",
> +	printf("%.*s\n%.*s\n",
>  	       (int)(author_end - author), author,
> -	       (int)(committer_end - committer), committer,
> +	       (int)(committer_end - committer), committer);
> +	if (!reencoded && encoding)
> +		printf("encoding %s\n", encoding);
> +	printf("data %u\n%s",
>  	       (unsigned)(reencoded
>  			  ? strlen(reencoded) : message
>  			  ? strlen(message) : 0),
> diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
> index c721026260..4fd637312a 100755
> --- a/t/t9350-fast-export.sh
> +++ b/t/t9350-fast-export.sh
> @@ -118,6 +118,27 @@ test_expect_success 'iso-8859-7' '
>  		 ! grep ^encoding actual)
>  '
>
> +test_expect_success 'encoding preserved if reencoding fails' '
> +
> +	test_when_finished "git reset --hard HEAD~1" &&
> +	test_config i18n.commitencoding iso-8859-7 &&
> +	echo rosten >file &&
> +	git commit -s -F "$TEST_DIRECTORY/t9350/broken-iso-8859-7-commit-message.txt" file &&
> +	git fast-export wer^..wer >iso-8859-7.fi &&
> +	sed "s/wer/i18n-invalid/" iso-8859-7.fi |
> +		(cd new &&
> +		 git fast-import &&
> +		 git cat-file commit i18n-invalid >actual &&
> +		 # Make sure the commit still has the encoding header
> +		 grep ^encoding actual &&
> +		 # Verify that the commit has the expected size; i.e.
> +		 # that no bytes were re-encoded to a different encoding.
> +		 test 252 -eq "$(git cat-file -s i18n-invalid)" &&
> +		 # ...and check for the original special bytes
> +		 grep $(printf "\360") actual &&
> +		 grep $(printf "\377") actual)
> +'
> +
>  test_expect_success 'import/export-marks' '
>
>  	git checkout -b marks master &&
> diff --git a/t/t9350/broken-iso-8859-7-commit-message.txt b/t/t9350/broken-iso-8859-7-commit-message.txt
> new file mode 100644
> index 0000000000..d06ad75b44
> --- /dev/null
> +++ b/t/t9350/broken-iso-8859-7-commit-message.txt
> @@ -0,0 +1 @@
> +Pi: ?; Invalid: ?
> \ No newline at end of file
> --
> 2.21.0.782.gd8be4ee826
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 4/5] fast-export: differentiate between explicitly utf-8 and implicitly utf-8
  2019-05-13 23:17       ` [PATCH v5 4/5] fast-export: differentiate between explicitly utf-8 and implicitly utf-8 Elijah Newren
@ 2019-05-14  3:01         ` Torsten Bögershausen
  0 siblings, 0 replies; 45+ messages in thread
From: Torsten Bögershausen @ 2019-05-14  3:01 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Junio C Hamano, git, Eric Sunshine, Johannes Schindelin,
	Johannes Sixt

On Mon, May 13, 2019 at 04:17:25PM -0700, Elijah Newren wrote:
> The find_encoding() function returned the encoding used by a commit
> message, returning a default of git_commit_encoding (usually utf-8).
I think "UTF-8" is preferred over "utf-8".
Unless it is a function name like is_encoding_utf8()

> Although the current code does not differentiate between a commit which
> explicitly requested utf-8 and one where we just assume utf-8 because no
> encoding is set, it will become important when we try to preserve the
> encoding header.  Since is_encoding_utf8() returns true when passed
> NULL, we can just return NULL from find_encoding() instead of returning
> git_commit_encoding.
>
> Signed-off-by: Elijah Newren <newren@gmail.com>
> ---
>  builtin/fast-export.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/builtin/fast-export.c b/builtin/fast-export.c
> index 7734a9f5a5..66331fa401 100644
> --- a/builtin/fast-export.c
> +++ b/builtin/fast-export.c
> @@ -453,7 +453,7 @@ static const char *find_encoding(const char *begin, const char *end)
>  	bol = memmem(begin, end ? end - begin : strlen(begin),
>  		     needle, strlen(needle));
>  	if (!bol)
> -		return git_commit_encoding;
> +		return NULL;
>  	bol += strlen(needle);
>  	eol = strchrnul(bol, '\n');
>  	*eol = '\0';
> --
> 2.21.0.782.gd8be4ee826
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v6 0/5] Fix and extend encoding handling in fast export/import
  2019-05-13 23:17     ` [PATCH v5 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
                         ` (4 preceding siblings ...)
  2019-05-13 23:17       ` [PATCH v5 5/5] fast-export: do automatic reencoding of commit messages only if requested Elijah Newren
@ 2019-05-14  4:30       ` Elijah Newren
  2019-05-14  4:30         ` [PATCH v6 1/5] t9350: fix encoding test to actually test reencoding Elijah Newren
                           ` (5 more replies)
  5 siblings, 6 replies; 45+ messages in thread
From: Elijah Newren @ 2019-05-14  4:30 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Eric Sunshine, Johannes Schindelin, Johannes Sixt,
	Torsten Bögershausen, Elijah Newren

While stress testing `git filter-repo`, I noticed an issue with
encoding; further digging led to the fixes and features in this series.
See the individual commit messages for details.

Changes since v5 (full range-diff below):
  * s/utf-8/UTF-8/, as pointed out by Torsten (in commit messages and comments)
  * small code cleanup pointed out by Eric
  * rewrap the first commit message

Elijah Newren (5):
  t9350: fix encoding test to actually test reencoding
  fast-import: support 'encoding' commit header
  fast-export: avoid stripping encoding header if we cannot reencode
  fast-export: differentiate between explicitly UTF-8 and implicitly
    UTF-8
  fast-export: do automatic reencoding of commit messages only if
    requested

 Documentation/git-fast-export.txt            |  7 ++
 Documentation/git-fast-import.txt            |  7 ++
 builtin/fast-export.c                        | 55 ++++++++++++--
 fast-import.c                                | 11 ++-
 t/t9300-fast-import.sh                       | 20 +++++
 t/t9350-fast-export.sh                       | 78 +++++++++++++++++---
 t/t9350/broken-iso-8859-7-commit-message.txt |  1 +
 t/t9350/simple-iso-8859-7-commit-message.txt |  1 +
 8 files changed, 163 insertions(+), 17 deletions(-)
 create mode 100644 t/t9350/broken-iso-8859-7-commit-message.txt
 create mode 100644 t/t9350/simple-iso-8859-7-commit-message.txt

Range-diff:
1:  37a68a0ffd ! 1:  b5dcdab662 t9350: fix encoding test to actually test reencoding
    @@ -2,13 +2,13 @@
     
         t9350: fix encoding test to actually test reencoding
     
    -    This test used an author with non-ascii characters in the name, but
    -    no special commit message.  It then grep'ed for those non-ascii
    -    characters, but those are guaranteed to exist regardless of the
    -    reencoding process since the reencoding only affects the commit message,
    -    not the author or committer names.  As such, the test would work even if
    -    the re-encoding process simply stripped the commit message entirely.
    -    Modify the test to actually check that the reencoding in utf-8 worked.
    +    This test used an author with non-ascii characters in the name, but no
    +    special commit message.  It then grep'ed for those non-ascii characters,
    +    but those are guaranteed to exist regardless of the reencoding process
    +    since the reencoding only affects the commit message, not the author or
    +    committer names.  As such, the test would work even if the re-encoding
    +    process simply stripped the commit message entirely.  Modify the test to
    +    actually check that the reencoding into UTF-8 worked.
     
         Signed-off-by: Elijah Newren <newren@gmail.com>
     
    @@ -40,7 +40,7 @@
     +		 # The commit object, if not re-encoded, would be 240 bytes.
     +		 # Removing the "encoding iso-8859-7\n" header drops 20 bytes.
     +		 # Re-encoding the Pi character from \xF0 (\360) in iso-8859-7
    -+		 # to \xCF\x80 (\317\200) in utf-8 adds a byte.  Check for
    ++		 # to \xCF\x80 (\317\200) in UTF-8 adds a byte.  Check for
     +		 # the expected size.
     +		 test 221 -eq "$(git cat-file -s i18n)" &&
     +		 # ...and for the expected translation of bytes.
2:  3d84f4613d ! 2:  af7d4e18fa fast-import: support 'encoding' commit header
    @@ -2,7 +2,7 @@
     
         fast-import: support 'encoding' commit header
     
    -    Since git supports commit messages with an encoding other than utf-8,
    +    Since git supports commit messages with an encoding other than UTF-8,
         allow fast-import to import such commits.  This may be useful for folks
         who do not want to reencode commit messages from an external system, and
         may also be useful to achieve reversible history rewrites (e.g. sha1sum
3:  baa8394a3a ! 3:  d5b300692a fast-export: avoid stripping encoding header if we cannot reencode
    @@ -3,8 +3,8 @@
         fast-export: avoid stripping encoding header if we cannot reencode
     
         When fast-export encounters a commit with an 'encoding' header, it tries
    -    to reencode in utf-8 and then drops the encoding header.  However, if it
    -    fails to reencode in utf-8 because e.g. one of the characters in the
    +    to reencode in UTF-8 and then drops the encoding header.  However, if it
    +    fails to reencode in UTF-8 because e.g. one of the characters in the
         commit message was invalid in the old encoding, then we need to retain
         the original encoding or otherwise we lose information needed to
         understand all the other (valid) characters in the original commit
4:  49960164c6 ! 4:  2cef40c613 fast-export: differentiate between explicitly utf-8 and implicitly utf-8
    @@ -1,11 +1,11 @@
     Author: Elijah Newren <newren@gmail.com>
     
    -    fast-export: differentiate between explicitly utf-8 and implicitly utf-8
    +    fast-export: differentiate between explicitly UTF-8 and implicitly UTF-8
     
         The find_encoding() function returned the encoding used by a commit
    -    message, returning a default of git_commit_encoding (usually utf-8).
    +    message, returning a default of git_commit_encoding (usually UTF-8).
         Although the current code does not differentiate between a commit which
    -    explicitly requested utf-8 and one where we just assume utf-8 because no
    +    explicitly requested UTF-8 and one where we just assume UTF-8 because no
         encoding is set, it will become important when we try to preserve the
         encoding header.  Since is_encoding_utf8() returns true when passed
         NULL, we can just return NULL from find_encoding() instead of returning
5:  d8be4ee826 ! 5:  d18f03d1bf fast-export: do automatic reencoding of commit messages only if requested
    @@ -62,7 +62,7 @@
     +		reencode_mode = REENCODE_YES;
     +		break;
     +	default:
    -+		if (arg && !strcasecmp(arg, "abort"))
    ++		if (!strcasecmp(arg, "abort"))
     +			reencode_mode = REENCODE_ABORT;
     +		else
     +			return error("Unknown reencoding mode: %s", arg);
    @@ -156,7 +156,7 @@
     +		 # The commit object, if not re-encoded, is 240 bytes.
     +		 # Removing the "encoding iso-8859-7\n" header would drops 20
     +		 # bytes.  Re-encoding the Pi character from \xF0 (\360) in
    -+		 # iso-8859-7 to \xCF\x80 (\317\200) in utf-8 adds a byte.
    ++		 # iso-8859-7 to \xCF\x80 (\317\200) in UTF-8 adds a byte.
     +		 # Check for the expected size...
     +		 test 240 -eq "$(git cat-file -s i18n-no-recoding)" &&
     +		 # ...as well as the expected byte.
-- 
2.21.0.782.gd18f03d1bf


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v6 1/5] t9350: fix encoding test to actually test reencoding
  2019-05-14  4:30       ` [PATCH v6 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
@ 2019-05-14  4:30         ` Elijah Newren
  2019-05-14  4:30         ` [PATCH v6 2/5] fast-import: support 'encoding' commit header Elijah Newren
                           ` (4 subsequent siblings)
  5 siblings, 0 replies; 45+ messages in thread
From: Elijah Newren @ 2019-05-14  4:30 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Eric Sunshine, Johannes Schindelin, Johannes Sixt,
	Torsten Bögershausen, Elijah Newren

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 2986 bytes --]

This test used an author with non-ascii characters in the name, but no
special commit message.  It then grep'ed for those non-ascii characters,
but those are guaranteed to exist regardless of the reencoding process
since the reencoding only affects the commit message, not the author or
committer names.  As such, the test would work even if the re-encoding
process simply stripped the commit message entirely.  Modify the test to
actually check that the reencoding into UTF-8 worked.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 t/t9350-fast-export.sh                       | 27 ++++++++++++--------
 t/t9350/simple-iso-8859-7-commit-message.txt |  1 +
 2 files changed, 18 insertions(+), 10 deletions(-)
 create mode 100644 t/t9350/simple-iso-8859-7-commit-message.txt

diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
index 5690fe2810..756d6a9905 100755
--- a/t/t9350-fast-export.sh
+++ b/t/t9350-fast-export.sh
@@ -94,22 +94,30 @@ test_expect_success 'fast-export --show-original-ids | git fast-import' '
 	test $MUSS = $(git rev-parse --verify refs/tags/muss)
 '
 
-test_expect_success 'iso-8859-1' '
+test_expect_success 'iso-8859-7' '
 
-	git config i18n.commitencoding ISO8859-1 &&
-	# use author and committer name in ISO-8859-1 to match it.
-	. "$TEST_DIRECTORY"/t3901/8859-1.txt &&
+	test_when_finished "git reset --hard HEAD~1" &&
+	test_config i18n.commitencoding iso-8859-7 &&
 	test_tick &&
 	echo rosten >file &&
-	git commit -s -m den file &&
-	git fast-export wer^..wer >iso8859-1.fi &&
-	sed "s/wer/i18n/" iso8859-1.fi |
+	git commit -s -F "$TEST_DIRECTORY/t9350/simple-iso-8859-7-commit-message.txt" file &&
+	git fast-export wer^..wer >iso-8859-7.fi &&
+	sed "s/wer/i18n/" iso-8859-7.fi |
 		(cd new &&
 		 git fast-import &&
+		 # The commit object, if not re-encoded, would be 240 bytes.
+		 # Removing the "encoding iso-8859-7\n" header drops 20 bytes.
+		 # Re-encoding the Pi character from \xF0 (\360) in iso-8859-7
+		 # to \xCF\x80 (\317\200) in UTF-8 adds a byte.  Check for
+		 # the expected size.
+		 test 221 -eq "$(git cat-file -s i18n)" &&
+		 # ...and for the expected translation of bytes.
 		 git cat-file commit i18n >actual &&
-		 grep "Áéí óú" actual)
-
+		 grep $(printf "\317\200") actual &&
+		 # Also make sure the commit does not have the "encoding" header
+		 ! grep ^encoding actual)
 '
+
 test_expect_success 'import/export-marks' '
 
 	git checkout -b marks master &&
@@ -224,7 +232,6 @@ GIT_COMMITTER_NAME='C O Mitter'; export GIT_COMMITTER_NAME
 
 test_expect_success 'setup copies' '
 
-	git config --unset i18n.commitencoding &&
 	git checkout -b copy rein &&
 	git mv file file3 &&
 	git commit -m move1 &&
diff --git a/t/t9350/simple-iso-8859-7-commit-message.txt b/t/t9350/simple-iso-8859-7-commit-message.txt
new file mode 100644
index 0000000000..8b3f0c3dba
--- /dev/null
+++ b/t/t9350/simple-iso-8859-7-commit-message.txt
@@ -0,0 +1 @@
+Pi: ð
\ No newline at end of file
-- 
2.21.0.782.gd18f03d1bf


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v6 2/5] fast-import: support 'encoding' commit header
  2019-05-14  4:30       ` [PATCH v6 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
  2019-05-14  4:30         ` [PATCH v6 1/5] t9350: fix encoding test to actually test reencoding Elijah Newren
@ 2019-05-14  4:30         ` Elijah Newren
  2019-05-14  4:31         ` [PATCH v6 3/5] fast-export: avoid stripping encoding header if we cannot reencode Elijah Newren
                           ` (3 subsequent siblings)
  5 siblings, 0 replies; 45+ messages in thread
From: Elijah Newren @ 2019-05-14  4:30 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Eric Sunshine, Johannes Schindelin, Johannes Sixt,
	Torsten Bögershausen, Elijah Newren

Since git supports commit messages with an encoding other than UTF-8,
allow fast-import to import such commits.  This may be useful for folks
who do not want to reencode commit messages from an external system, and
may also be useful to achieve reversible history rewrites (e.g. sha1sum
<-> sha256sum transitions or subtree work) with git repositories that
have used specialized encodings in their commit history.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 Documentation/git-fast-import.txt |  7 +++++++
 fast-import.c                     | 11 +++++++++--
 t/t9300-fast-import.sh            | 20 ++++++++++++++++++++
 3 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-fast-import.txt b/Documentation/git-fast-import.txt
index d65cdb3d08..7baf9e47b5 100644
--- a/Documentation/git-fast-import.txt
+++ b/Documentation/git-fast-import.txt
@@ -388,6 +388,7 @@ change to the project.
 	original-oid?
 	('author' (SP <name>)? SP LT <email> GT SP <when> LF)?
 	'committer' (SP <name>)? SP LT <email> GT SP <when> LF
+	('encoding' SP <encoding>)?
 	data
 	('from' SP <commit-ish> LF)?
 	('merge' SP <commit-ish> LF)?
@@ -455,6 +456,12 @@ that was selected by the --date-format=<fmt> command-line option.
 See ``Date Formats'' above for the set of supported formats, and
 their syntax.
 
+`encoding`
+^^^^^^^^^^
+The optional `encoding` command indicates the encoding of the commit
+message.  Most commits are UTF-8 and the encoding is omitted, but this
+allows importing commit messages into git without first reencoding them.
+
 `from`
 ^^^^^^
 The `from` command is used to specify the commit to initialize
diff --git a/fast-import.c b/fast-import.c
index f38d04fa58..76a7bd3699 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -2585,6 +2585,7 @@ static void parse_new_commit(const char *arg)
 	struct branch *b;
 	char *author = NULL;
 	char *committer = NULL;
+	const char *encoding = NULL;
 	struct hash_list *merge_list = NULL;
 	unsigned int merge_count;
 	unsigned char prev_fanout, new_fanout;
@@ -2607,6 +2608,8 @@ static void parse_new_commit(const char *arg)
 	}
 	if (!committer)
 		die("Expected committer but didn't get one");
+	if (skip_prefix(command_buf.buf, "encoding ", &encoding))
+		read_next_command();
 	parse_data(&msg, 0, NULL);
 	read_next_command();
 	parse_from(b);
@@ -2670,9 +2673,13 @@ static void parse_new_commit(const char *arg)
 	}
 	strbuf_addf(&new_data,
 		"author %s\n"
-		"committer %s\n"
-		"\n",
+		"committer %s\n",
 		author ? author : committer, committer);
+	if (encoding)
+		strbuf_addf(&new_data,
+			"encoding %s\n",
+			encoding);
+	strbuf_addch(&new_data, '\n');
 	strbuf_addbuf(&new_data, &msg);
 	free(author);
 	free(committer);
diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
index 3668263c40..141b7fa35e 100755
--- a/t/t9300-fast-import.sh
+++ b/t/t9300-fast-import.sh
@@ -3299,4 +3299,24 @@ test_expect_success !MINGW 'W: get-mark & empty orphan commit with erroneous thi
 	sed -e s/LFs/LLL/ W-input | tr L "\n" | test_must_fail git fast-import
 '
 
+###
+### series X (other new features)
+###
+
+test_expect_success 'X: handling encoding' '
+	test_tick &&
+	cat >input <<-INPUT_END &&
+	commit refs/heads/encoding
+	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+	encoding iso-8859-7
+	data <<COMMIT
+	INPUT_END
+
+	printf "Pi: \360\nCOMMIT\n" >>input &&
+
+	git fast-import <input &&
+	git cat-file -p encoding | grep $(printf "\360") &&
+	git log -1 --format=%B encoding | grep $(printf "\317\200")
+'
+
 test_done
-- 
2.21.0.782.gd18f03d1bf


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v6 3/5] fast-export: avoid stripping encoding header if we cannot reencode
  2019-05-14  4:30       ` [PATCH v6 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
  2019-05-14  4:30         ` [PATCH v6 1/5] t9350: fix encoding test to actually test reencoding Elijah Newren
  2019-05-14  4:30         ` [PATCH v6 2/5] fast-import: support 'encoding' commit header Elijah Newren
@ 2019-05-14  4:31         ` Elijah Newren
  2019-05-14  4:31         ` [PATCH v6 4/5] fast-export: differentiate between explicitly UTF-8 and implicitly UTF-8 Elijah Newren
                           ` (2 subsequent siblings)
  5 siblings, 0 replies; 45+ messages in thread
From: Elijah Newren @ 2019-05-14  4:31 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Eric Sunshine, Johannes Schindelin, Johannes Sixt,
	Torsten Bögershausen, Elijah Newren

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 3048 bytes --]

When fast-export encounters a commit with an 'encoding' header, it tries
to reencode in UTF-8 and then drops the encoding header.  However, if it
fails to reencode in UTF-8 because e.g. one of the characters in the
commit message was invalid in the old encoding, then we need to retain
the original encoding or otherwise we lose information needed to
understand all the other (valid) characters in the original commit
message.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 builtin/fast-export.c                        |  7 +++++--
 t/t9350-fast-export.sh                       | 21 ++++++++++++++++++++
 t/t9350/broken-iso-8859-7-commit-message.txt |  1 +
 3 files changed, 27 insertions(+), 2 deletions(-)
 create mode 100644 t/t9350/broken-iso-8859-7-commit-message.txt

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 9e283482ef..7734a9f5a5 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -642,9 +642,12 @@ static void handle_commit(struct commit *commit, struct rev_info *rev,
 	printf("commit %s\nmark :%"PRIu32"\n", refname, last_idnum);
 	if (show_original_ids)
 		printf("original-oid %s\n", oid_to_hex(&commit->object.oid));
-	printf("%.*s\n%.*s\ndata %u\n%s",
+	printf("%.*s\n%.*s\n",
 	       (int)(author_end - author), author,
-	       (int)(committer_end - committer), committer,
+	       (int)(committer_end - committer), committer);
+	if (!reencoded && encoding)
+		printf("encoding %s\n", encoding);
+	printf("data %u\n%s",
 	       (unsigned)(reencoded
 			  ? strlen(reencoded) : message
 			  ? strlen(message) : 0),
diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
index 756d6a9905..e2ab8eddc0 100755
--- a/t/t9350-fast-export.sh
+++ b/t/t9350-fast-export.sh
@@ -118,6 +118,27 @@ test_expect_success 'iso-8859-7' '
 		 ! grep ^encoding actual)
 '
 
+test_expect_success 'encoding preserved if reencoding fails' '
+
+	test_when_finished "git reset --hard HEAD~1" &&
+	test_config i18n.commitencoding iso-8859-7 &&
+	echo rosten >file &&
+	git commit -s -F "$TEST_DIRECTORY/t9350/broken-iso-8859-7-commit-message.txt" file &&
+	git fast-export wer^..wer >iso-8859-7.fi &&
+	sed "s/wer/i18n-invalid/" iso-8859-7.fi |
+		(cd new &&
+		 git fast-import &&
+		 git cat-file commit i18n-invalid >actual &&
+		 # Make sure the commit still has the encoding header
+		 grep ^encoding actual &&
+		 # Verify that the commit has the expected size; i.e.
+		 # that no bytes were re-encoded to a different encoding.
+		 test 252 -eq "$(git cat-file -s i18n-invalid)" &&
+		 # ...and check for the original special bytes
+		 grep $(printf "\360") actual &&
+		 grep $(printf "\377") actual)
+'
+
 test_expect_success 'import/export-marks' '
 
 	git checkout -b marks master &&
diff --git a/t/t9350/broken-iso-8859-7-commit-message.txt b/t/t9350/broken-iso-8859-7-commit-message.txt
new file mode 100644
index 0000000000..d06ad75b44
--- /dev/null
+++ b/t/t9350/broken-iso-8859-7-commit-message.txt
@@ -0,0 +1 @@
+Pi: ð; Invalid: ÿ
\ No newline at end of file
-- 
2.21.0.782.gd18f03d1bf


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v6 4/5] fast-export: differentiate between explicitly UTF-8 and implicitly UTF-8
  2019-05-14  4:30       ` [PATCH v6 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
                           ` (2 preceding siblings ...)
  2019-05-14  4:31         ` [PATCH v6 3/5] fast-export: avoid stripping encoding header if we cannot reencode Elijah Newren
@ 2019-05-14  4:31         ` Elijah Newren
  2019-05-14  4:31         ` [PATCH v6 5/5] fast-export: do automatic reencoding of commit messages only if requested Elijah Newren
  2019-05-16 18:15         ` [PATCH v6 0/5] Fix and extend encoding handling in fast export/import Torsten Bögershausen
  5 siblings, 0 replies; 45+ messages in thread
From: Elijah Newren @ 2019-05-14  4:31 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Eric Sunshine, Johannes Schindelin, Johannes Sixt,
	Torsten Bögershausen, Elijah Newren

The find_encoding() function returned the encoding used by a commit
message, returning a default of git_commit_encoding (usually UTF-8).
Although the current code does not differentiate between a commit which
explicitly requested UTF-8 and one where we just assume UTF-8 because no
encoding is set, it will become important when we try to preserve the
encoding header.  Since is_encoding_utf8() returns true when passed
NULL, we can just return NULL from find_encoding() instead of returning
git_commit_encoding.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 builtin/fast-export.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 7734a9f5a5..66331fa401 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -453,7 +453,7 @@ static const char *find_encoding(const char *begin, const char *end)
 	bol = memmem(begin, end ? end - begin : strlen(begin),
 		     needle, strlen(needle));
 	if (!bol)
-		return git_commit_encoding;
+		return NULL;
 	bol += strlen(needle);
 	eol = strchrnul(bol, '\n');
 	*eol = '\0';
-- 
2.21.0.782.gd18f03d1bf


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v6 5/5] fast-export: do automatic reencoding of commit messages only if requested
  2019-05-14  4:30       ` [PATCH v6 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
                           ` (3 preceding siblings ...)
  2019-05-14  4:31         ` [PATCH v6 4/5] fast-export: differentiate between explicitly UTF-8 and implicitly UTF-8 Elijah Newren
@ 2019-05-14  4:31         ` Elijah Newren
  2019-05-16 18:15         ` [PATCH v6 0/5] Fix and extend encoding handling in fast export/import Torsten Bögershausen
  5 siblings, 0 replies; 45+ messages in thread
From: Elijah Newren @ 2019-05-14  4:31 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Eric Sunshine, Johannes Schindelin, Johannes Sixt,
	Torsten Bögershausen, Elijah Newren

Automatic re-encoding of commit messages (and dropping of the encoding
header) hurts attempts to do reversible history rewrites (e.g. sha1sum
<-> sha256sum transitions, some subtree rewrites), and seems
inconsistent with the general principle followed elsewhere in
fast-export of requiring explicit user requests to modify the output
(e.g. --signed-tags=strip, --tag-of-filtered-object=rewrite).  Add a
--reencode flag that the user can use to specify, and like other
fast-export flags, default it to 'abort'.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 Documentation/git-fast-export.txt |  7 +++++
 builtin/fast-export.c             | 46 +++++++++++++++++++++++++++++--
 t/t9350-fast-export.sh            | 38 +++++++++++++++++++++++--
 3 files changed, 85 insertions(+), 6 deletions(-)

diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
index 64c01ba918..11427acdde 100644
--- a/Documentation/git-fast-export.txt
+++ b/Documentation/git-fast-export.txt
@@ -129,6 +129,13 @@ marks the same across runs.
 	for intermediary filters (e.g. for rewriting commit messages
 	which refer to older commits, or for stripping blobs by id).
 
+--reencode=(yes|no|abort)::
+	Specify how to handle `encoding` header in commit objects.  When
+	asking to 'abort' (which is the default), this program will die
+	when encountering such a commit object.  With 'yes', the commit
+	message will be reencoded into UTF-8.  With 'no', the original
+	encoding will be preserved.
+
 --refspec::
 	Apply the specified refspec to each ref exported. Multiple of them can
 	be specified.
diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 66331fa401..c22cef3b2f 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -33,6 +33,7 @@ static const char *fast_export_usage[] = {
 static int progress;
 static enum { SIGNED_TAG_ABORT, VERBATIM, WARN, WARN_STRIP, STRIP } signed_tag_mode = SIGNED_TAG_ABORT;
 static enum { TAG_FILTERING_ABORT, DROP, REWRITE } tag_of_filtered_mode = TAG_FILTERING_ABORT;
+static enum { REENCODE_ABORT, REENCODE_YES, REENCODE_NO } reencode_mode = REENCODE_ABORT;
 static int fake_missing_tagger;
 static int use_done_feature;
 static int no_data;
@@ -77,6 +78,31 @@ static int parse_opt_tag_of_filtered_mode(const struct option *opt,
 	return 0;
 }
 
+static int parse_opt_reencode_mode(const struct option *opt,
+				   const char *arg, int unset)
+{
+	if (unset) {
+		reencode_mode = REENCODE_ABORT;
+		return 0;
+	}
+
+	switch (git_parse_maybe_bool(arg)) {
+	case 0:
+		reencode_mode = REENCODE_NO;
+		break;
+	case 1:
+		reencode_mode = REENCODE_YES;
+		break;
+	default:
+		if (!strcasecmp(arg, "abort"))
+			reencode_mode = REENCODE_ABORT;
+		else
+			return error("Unknown reencoding mode: %s", arg);
+	}
+
+	return 0;
+}
+
 static struct decoration idnums;
 static uint32_t last_idnum;
 
@@ -633,10 +659,21 @@ static void handle_commit(struct commit *commit, struct rev_info *rev,
 	}
 
 	mark_next_object(&commit->object);
-	if (anonymize)
+	if (anonymize) {
 		reencoded = anonymize_commit_message(message);
-	else if (!is_encoding_utf8(encoding))
-		reencoded = reencode_string(message, "UTF-8", encoding);
+	} else if (encoding) {
+		switch(reencode_mode) {
+		case REENCODE_YES:
+			reencoded = reencode_string(message, "UTF-8", encoding);
+			break;
+		case REENCODE_NO:
+			break;
+		case REENCODE_ABORT:
+			die("Encountered commit-specific encoding %s in commit "
+			    "%s; use --reencode=[yes|no] to handle it",
+			    encoding, oid_to_hex(&commit->object.oid));
+		}
+	}
 	if (!commit->parents)
 		printf("reset %s\n", refname);
 	printf("commit %s\nmark :%"PRIu32"\n", refname, last_idnum);
@@ -1091,6 +1128,9 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
 		OPT_CALLBACK(0, "tag-of-filtered-object", &tag_of_filtered_mode, N_("mode"),
 			     N_("select handling of tags that tag filtered objects"),
 			     parse_opt_tag_of_filtered_mode),
+		OPT_CALLBACK(0, "reencode", &reencode_mode, N_("mode"),
+			     N_("select handling of commit messages in an alternate encoding"),
+			     parse_opt_reencode_mode),
 		OPT_STRING(0, "export-marks", &export_filename, N_("file"),
 			     N_("Dump marks to this file")),
 		OPT_STRING(0, "import-marks", &import_filename, N_("file"),
diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
index e2ab8eddc0..b4004e05c2 100755
--- a/t/t9350-fast-export.sh
+++ b/t/t9350-fast-export.sh
@@ -94,14 +94,14 @@ test_expect_success 'fast-export --show-original-ids | git fast-import' '
 	test $MUSS = $(git rev-parse --verify refs/tags/muss)
 '
 
-test_expect_success 'iso-8859-7' '
+test_expect_success 'reencoding iso-8859-7' '
 
 	test_when_finished "git reset --hard HEAD~1" &&
 	test_config i18n.commitencoding iso-8859-7 &&
 	test_tick &&
 	echo rosten >file &&
 	git commit -s -F "$TEST_DIRECTORY/t9350/simple-iso-8859-7-commit-message.txt" file &&
-	git fast-export wer^..wer >iso-8859-7.fi &&
+	git fast-export --reencode=yes wer^..wer >iso-8859-7.fi &&
 	sed "s/wer/i18n/" iso-8859-7.fi |
 		(cd new &&
 		 git fast-import &&
@@ -118,13 +118,45 @@ test_expect_success 'iso-8859-7' '
 		 ! grep ^encoding actual)
 '
 
+test_expect_success 'aborting on iso-8859-7' '
+
+	test_when_finished "git reset --hard HEAD~1" &&
+	test_config i18n.commitencoding iso-8859-7 &&
+	echo rosten >file &&
+	git commit -s -F "$TEST_DIRECTORY/t9350/simple-iso-8859-7-commit-message.txt" file &&
+	test_must_fail git fast-export --reencode=abort wer^..wer >iso-8859-7.fi
+'
+
+test_expect_success 'preserving iso-8859-7' '
+
+	test_when_finished "git reset --hard HEAD~1" &&
+	test_config i18n.commitencoding iso-8859-7 &&
+	echo rosten >file &&
+	git commit -s -F "$TEST_DIRECTORY/t9350/simple-iso-8859-7-commit-message.txt" file &&
+	git fast-export --reencode=no wer^..wer >iso-8859-7.fi &&
+	sed "s/wer/i18n-no-recoding/" iso-8859-7.fi |
+		(cd new &&
+		 git fast-import &&
+		 # The commit object, if not re-encoded, is 240 bytes.
+		 # Removing the "encoding iso-8859-7\n" header would drops 20
+		 # bytes.  Re-encoding the Pi character from \xF0 (\360) in
+		 # iso-8859-7 to \xCF\x80 (\317\200) in UTF-8 adds a byte.
+		 # Check for the expected size...
+		 test 240 -eq "$(git cat-file -s i18n-no-recoding)" &&
+		 # ...as well as the expected byte.
+		 git cat-file commit i18n-no-recoding >actual &&
+		 grep $(printf "\360") actual &&
+		 # Also make sure the commit has the "encoding" header
+		 grep ^encoding actual)
+'
+
 test_expect_success 'encoding preserved if reencoding fails' '
 
 	test_when_finished "git reset --hard HEAD~1" &&
 	test_config i18n.commitencoding iso-8859-7 &&
 	echo rosten >file &&
 	git commit -s -F "$TEST_DIRECTORY/t9350/broken-iso-8859-7-commit-message.txt" file &&
-	git fast-export wer^..wer >iso-8859-7.fi &&
+	git fast-export --reencode=yes wer^..wer >iso-8859-7.fi &&
 	sed "s/wer/i18n-invalid/" iso-8859-7.fi |
 		(cd new &&
 		 git fast-import &&
-- 
2.21.0.782.gd18f03d1bf


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v6 0/5] Fix and extend encoding handling in fast export/import
  2019-05-14  4:30       ` [PATCH v6 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
                           ` (4 preceding siblings ...)
  2019-05-14  4:31         ` [PATCH v6 5/5] fast-export: do automatic reencoding of commit messages only if requested Elijah Newren
@ 2019-05-16 18:15         ` Torsten Bögershausen
  5 siblings, 0 replies; 45+ messages in thread
From: Torsten Bögershausen @ 2019-05-16 18:15 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Junio C Hamano, git, Eric Sunshine, Johannes Schindelin,
	Johannes Sixt

On Mon, May 13, 2019 at 09:30:57PM -0700, Elijah Newren wrote:
> While stress testing `git filter-repo`, I noticed an issue with
> encoding; further digging led to the fixes and features in this series.
> See the individual commit messages for details.
>

No more comments from my side, thanks for the patience.

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2019-05-16 18:15 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-30 18:25 [PATCH v2 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
2019-04-30 18:25 ` [PATCH v2 1/5] t9350: fix encoding test to actually test reencoding Elijah Newren
2019-04-30 18:25 ` [PATCH v2 2/5] fast-import: support 'encoding' commit header Elijah Newren
2019-04-30 18:25 ` [PATCH v2 3/5] fast-export: avoid stripping encoding header if we cannot reencode Elijah Newren
2019-04-30 18:25 ` [PATCH v2 4/5] fast-export: differentiate between explicitly utf-8 and implicitly utf-8 Elijah Newren
2019-04-30 18:25 ` [PATCH v2 5/5] fast-export: do automatic reencoding of commit messages only if requested Elijah Newren
2019-05-10 20:53 ` [PATCH v3 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
2019-05-10 20:53   ` [PATCH v3 1/5] t9350: fix encoding test to actually test reencoding Elijah Newren
2019-05-10 20:53   ` [PATCH v3 2/5] fast-import: support 'encoding' commit header Elijah Newren
2019-05-10 20:53   ` [PATCH v3 3/5] fast-export: avoid stripping encoding header if we cannot reencode Elijah Newren
2019-05-10 20:53   ` [PATCH v3 4/5] fast-export: differentiate between explicitly utf-8 and implicitly utf-8 Elijah Newren
2019-05-10 20:53   ` [PATCH v3 5/5] fast-export: do automatic reencoding of commit messages only if requested Elijah Newren
2019-05-11 21:07     ` Torsten Bögershausen
2019-05-11 21:42       ` Elijah Newren
2019-05-13  7:48         ` Junio C Hamano
2019-05-13 13:24           ` Elijah Newren
2019-05-13 10:23         ` Johannes Schindelin
2019-05-13 12:56           ` Torsten Bögershausen
2019-05-13 13:29             ` Elijah Newren
2019-05-13 16:41           ` Elijah Newren
2019-05-13 10:14   ` [PATCH v3 0/5] Fix and extend encoding handling in fast export/import Johannes Schindelin
2019-05-13 16:47   ` [PATCH v4 " Elijah Newren
2019-05-13 16:47     ` [PATCH v4 1/5] t9350: fix encoding test to actually test reencoding Elijah Newren
2019-05-13 16:47     ` [PATCH v4 2/5] fast-import: support 'encoding' commit header Elijah Newren
2019-05-13 16:47     ` [PATCH v4 3/5] fast-export: avoid stripping encoding header if we cannot reencode Elijah Newren
2019-05-13 16:47     ` [PATCH v4 4/5] fast-export: differentiate between explicitly utf-8 and implicitly utf-8 Elijah Newren
2019-05-13 16:47     ` [PATCH v4 5/5] fast-export: do automatic reencoding of commit messages only if requested Elijah Newren
2019-05-13 22:32       ` Junio C Hamano
2019-05-13 23:17     ` [PATCH v5 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
2019-05-13 23:17       ` [PATCH v5 1/5] t9350: fix encoding test to actually test reencoding Elijah Newren
2019-05-14  2:50         ` Torsten Bögershausen
2019-05-13 23:17       ` [PATCH v5 2/5] fast-import: support 'encoding' commit header Elijah Newren
2019-05-13 23:17       ` [PATCH v5 3/5] fast-export: avoid stripping encoding header if we cannot reencode Elijah Newren
2019-05-14  2:56         ` Torsten Bögershausen
2019-05-13 23:17       ` [PATCH v5 4/5] fast-export: differentiate between explicitly utf-8 and implicitly utf-8 Elijah Newren
2019-05-14  3:01         ` Torsten Bögershausen
2019-05-13 23:17       ` [PATCH v5 5/5] fast-export: do automatic reencoding of commit messages only if requested Elijah Newren
2019-05-14  0:19         ` Eric Sunshine
2019-05-14  4:30       ` [PATCH v6 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
2019-05-14  4:30         ` [PATCH v6 1/5] t9350: fix encoding test to actually test reencoding Elijah Newren
2019-05-14  4:30         ` [PATCH v6 2/5] fast-import: support 'encoding' commit header Elijah Newren
2019-05-14  4:31         ` [PATCH v6 3/5] fast-export: avoid stripping encoding header if we cannot reencode Elijah Newren
2019-05-14  4:31         ` [PATCH v6 4/5] fast-export: differentiate between explicitly UTF-8 and implicitly UTF-8 Elijah Newren
2019-05-14  4:31         ` [PATCH v6 5/5] fast-export: do automatic reencoding of commit messages only if requested Elijah Newren
2019-05-16 18:15         ` [PATCH v6 0/5] Fix and extend encoding handling in fast export/import Torsten Bögershausen

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).