git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH 0/5] Fix and extend encoding handling in fast export/import
@ 2019-04-25 15:51 Elijah Newren
  2019-04-25 15:51 ` [PATCH 1/5] t9350: fix encoding test to actually test reencoding Elijah Newren
                   ` (5 more replies)
  0 siblings, 6 replies; 13+ messages in thread
From: Elijah Newren @ 2019-04-25 15:51 UTC (permalink / raw)
  To: git; +Cc: gitster, johannes.schindelin, Elijah Newren

While stress testing `git filter-repo`, I noticed an issue with
encoding; further digging led to the fixes and features in this series.
See the individual commit messages for details.

Elijah Newren (5):
  t9350: fix encoding test to actually test reencoding
  fast-import: support 'encoding' commit header
  fast-export: avoid stripping encoding header if we cannot reencode
  fast-export: differentiate between explicitly utf-8 and implicitly
    utf-8
  fast-export: do automatic reencoding of commit messages only if
    requested

 Documentation/git-fast-import.txt |  7 ++++
 builtin/fast-export.c             | 44 ++++++++++++++++++++----
 fast-import.c                     | 12 +++++--
 t/t9300-fast-import.sh            | 20 +++++++++++
 t/t9350-fast-export.sh            | 57 ++++++++++++++++++++++++++-----
 5 files changed, 123 insertions(+), 17 deletions(-)

-- 
2.21.0.779.g2f4b9c5032

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/5] t9350: fix encoding test to actually test reencoding
  2019-04-25 15:51 [PATCH 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
@ 2019-04-25 15:51 ` Elijah Newren
  2019-04-25 17:52   ` Eric Sunshine
  2019-04-25 15:51 ` [PATCH 2/5] fast-import: support 'encoding' commit header Elijah Newren
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: Elijah Newren @ 2019-04-25 15:51 UTC (permalink / raw)
  To: git; +Cc: gitster, johannes.schindelin, Elijah Newren

This test used an author with non-ascii characters in the name, but
no special commit message.  It then grep'ed for those non-ascii
characters, but those are guaranteed to exist regardless of the
reencoding process since the reencoding only affects the commit message,
not the author or committer names.  As such, the test would work even if
the re-encoding process simply stripped the commit message entirely.
Modify the test to actually check that the reencoding in utf-8 worked.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 t/t9350-fast-export.sh | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
index 5690fe2810..6c07f910eb 100755
--- a/t/t9350-fast-export.sh
+++ b/t/t9350-fast-export.sh
@@ -94,22 +94,22 @@ test_expect_success 'fast-export --show-original-ids | git fast-import' '
 	test $MUSS = $(git rev-parse --verify refs/tags/muss)
 '
 
-test_expect_success 'iso-8859-1' '
+test_expect_success 'iso-8859-7' '
 
-	git config i18n.commitencoding ISO8859-1 &&
-	# use author and committer name in ISO-8859-1 to match it.
-	. "$TEST_DIRECTORY"/t3901/8859-1.txt &&
+	test_when_finished "git reset --hard HEAD~1" &&
+	test_when_finished "git config --unset i18n.commitencoding" &&
+	git config i18n.commitencoding iso-8859-7 &&
 	test_tick &&
 	echo rosten >file &&
-	git commit -s -m den file &&
-	git fast-export wer^..wer >iso8859-1.fi &&
-	sed "s/wer/i18n/" iso8859-1.fi |
+	git commit -s -m "$(printf "Pi: \360")" file &&
+	git fast-export wer^..wer >iso-8859-7.fi &&
+	sed "s/wer/i18n/" iso-8859-7.fi |
 		(cd new &&
 		 git fast-import &&
 		 git cat-file commit i18n >actual &&
-		 grep "Áéí óú" actual)
-
+		 grep $(printf "\317\200") actual)
 '
+
 test_expect_success 'import/export-marks' '
 
 	git checkout -b marks master &&
@@ -224,7 +224,6 @@ GIT_COMMITTER_NAME='C O Mitter'; export GIT_COMMITTER_NAME
 
 test_expect_success 'setup copies' '
 
-	git config --unset i18n.commitencoding &&
 	git checkout -b copy rein &&
 	git mv file file3 &&
 	git commit -m move1 &&
-- 
2.21.0.779.g2f4b9c5032


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/5] fast-import: support 'encoding' commit header
  2019-04-25 15:51 [PATCH 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
  2019-04-25 15:51 ` [PATCH 1/5] t9350: fix encoding test to actually test reencoding Elijah Newren
@ 2019-04-25 15:51 ` Elijah Newren
  2019-04-25 19:36   ` Eric Sunshine
  2019-04-25 15:51 ` [PATCH 3/5] fast-export: avoid stripping encoding header if we cannot reencode Elijah Newren
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: Elijah Newren @ 2019-04-25 15:51 UTC (permalink / raw)
  To: git; +Cc: gitster, johannes.schindelin, Elijah Newren

Since git supports commit messages with an encoding other than utf-8,
allow fast-import to import such commits.  This may be useful for folks
who do not want to reencode commit messages from an external system, and
may also be useful to achieve reversible history rewrites (e.g. sha1sum
<-> sha256sum transitions or subtree work) with git repositories that
have used specialized encodings in their commit history.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 Documentation/git-fast-import.txt |  7 +++++++
 fast-import.c                     | 12 ++++++++++--
 t/t9300-fast-import.sh            | 20 ++++++++++++++++++++
 3 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-fast-import.txt b/Documentation/git-fast-import.txt
index d65cdb3d08..7baf9e47b5 100644
--- a/Documentation/git-fast-import.txt
+++ b/Documentation/git-fast-import.txt
@@ -388,6 +388,7 @@ change to the project.
 	original-oid?
 	('author' (SP <name>)? SP LT <email> GT SP <when> LF)?
 	'committer' (SP <name>)? SP LT <email> GT SP <when> LF
+	('encoding' SP <encoding>)?
 	data
 	('from' SP <commit-ish> LF)?
 	('merge' SP <commit-ish> LF)?
@@ -455,6 +456,12 @@ that was selected by the --date-format=<fmt> command-line option.
 See ``Date Formats'' above for the set of supported formats, and
 their syntax.
 
+`encoding`
+^^^^^^^^^^
+The optional `encoding` command indicates the encoding of the commit
+message.  Most commits are UTF-8 and the encoding is omitted, but this
+allows importing commit messages into git without first reencoding them.
+
 `from`
 ^^^^^^
 The `from` command is used to specify the commit to initialize
diff --git a/fast-import.c b/fast-import.c
index f38d04fa58..25026c068a 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -2585,6 +2585,7 @@ static void parse_new_commit(const char *arg)
 	struct branch *b;
 	char *author = NULL;
 	char *committer = NULL;
+	const char *encoding = NULL;
 	struct hash_list *merge_list = NULL;
 	unsigned int merge_count;
 	unsigned char prev_fanout, new_fanout;
@@ -2607,6 +2608,9 @@ static void parse_new_commit(const char *arg)
 	}
 	if (!committer)
 		die("Expected committer but didn't get one");
+	if (skip_prefix(command_buf.buf, "encoding ", &encoding)) {
+		read_next_command();
+	}
 	parse_data(&msg, 0, NULL);
 	read_next_command();
 	parse_from(b);
@@ -2670,9 +2674,13 @@ static void parse_new_commit(const char *arg)
 	}
 	strbuf_addf(&new_data,
 		"author %s\n"
-		"committer %s\n"
-		"\n",
+		"committer %s\n",
 		author ? author : committer, committer);
+	if (encoding)
+		strbuf_addf(&new_data,
+			"encoding %s\n",
+			encoding);
+	strbuf_addf(&new_data, "\n");
 	strbuf_addbuf(&new_data, &msg);
 	free(author);
 	free(committer);
diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
index 3668263c40..141b7fa35e 100755
--- a/t/t9300-fast-import.sh
+++ b/t/t9300-fast-import.sh
@@ -3299,4 +3299,24 @@ test_expect_success !MINGW 'W: get-mark & empty orphan commit with erroneous thi
 	sed -e s/LFs/LLL/ W-input | tr L "\n" | test_must_fail git fast-import
 '
 
+###
+### series X (other new features)
+###
+
+test_expect_success 'X: handling encoding' '
+	test_tick &&
+	cat >input <<-INPUT_END &&
+	commit refs/heads/encoding
+	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+	encoding iso-8859-7
+	data <<COMMIT
+	INPUT_END
+
+	printf "Pi: \360\nCOMMIT\n" >>input &&
+
+	git fast-import <input &&
+	git cat-file -p encoding | grep $(printf "\360") &&
+	git log -1 --format=%B encoding | grep $(printf "\317\200")
+'
+
 test_done
-- 
2.21.0.779.g2f4b9c5032


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/5] fast-export: avoid stripping encoding header if we cannot reencode
  2019-04-25 15:51 [PATCH 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
  2019-04-25 15:51 ` [PATCH 1/5] t9350: fix encoding test to actually test reencoding Elijah Newren
  2019-04-25 15:51 ` [PATCH 2/5] fast-import: support 'encoding' commit header Elijah Newren
@ 2019-04-25 15:51 ` Elijah Newren
  2019-04-25 15:51 ` [PATCH 4/5] fast-export: differentiate between explicitly utf-8 and implicitly utf-8 Elijah Newren
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Elijah Newren @ 2019-04-25 15:51 UTC (permalink / raw)
  To: git; +Cc: gitster, johannes.schindelin, Elijah Newren

When fast-export encounters a commit with an 'encoding' header, it tries
to reencode in utf-8 and then drops the encoding header.  However, if it
fails to reencode in utf-8 because e.g. one of the characters in the
commit message was invalid in the old encoding, then we need to retain
the original encoding or otherwise we lose information needed to
understand all the other (valid) characters in the original commit
message.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 builtin/fast-export.c  |  7 +++++--
 t/t9350-fast-export.sh | 15 +++++++++++++++
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 9e283482ef..7734a9f5a5 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -642,9 +642,12 @@ static void handle_commit(struct commit *commit, struct rev_info *rev,
 	printf("commit %s\nmark :%"PRIu32"\n", refname, last_idnum);
 	if (show_original_ids)
 		printf("original-oid %s\n", oid_to_hex(&commit->object.oid));
-	printf("%.*s\n%.*s\ndata %u\n%s",
+	printf("%.*s\n%.*s\n",
 	       (int)(author_end - author), author,
-	       (int)(committer_end - committer), committer,
+	       (int)(committer_end - committer), committer);
+	if (!reencoded && encoding)
+		printf("encoding %s\n", encoding);
+	printf("data %u\n%s",
 	       (unsigned)(reencoded
 			  ? strlen(reencoded) : message
 			  ? strlen(message) : 0),
diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
index 6c07f910eb..975c8c4014 100755
--- a/t/t9350-fast-export.sh
+++ b/t/t9350-fast-export.sh
@@ -110,6 +110,21 @@ test_expect_success 'iso-8859-7' '
 		 grep $(printf "\317\200") actual)
 '
 
+test_expect_success 'encoding preserved if reencoding fails' '
+
+	test_when_finished "git reset --hard HEAD~1" &&
+	test_when_finished "git config --unset i18n.commitencoding" &&
+	git config i18n.commitencoding iso-8859-7 &&
+	echo rosten >file &&
+	git commit -s -m "$(printf "Pi: \360; Invalid: \377")" file &&
+	git fast-export wer^..wer >iso-8859-7.fi &&
+	sed "s/wer/i18n-invalid/" iso-8859-7.fi |
+		(cd new &&
+		 git fast-import &&
+		 git cat-file commit i18n-invalid >actual &&
+		 grep ^encoding actual)
+'
+
 test_expect_success 'import/export-marks' '
 
 	git checkout -b marks master &&
-- 
2.21.0.779.g2f4b9c5032


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 4/5] fast-export: differentiate between explicitly utf-8 and implicitly utf-8
  2019-04-25 15:51 [PATCH 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
                   ` (2 preceding siblings ...)
  2019-04-25 15:51 ` [PATCH 3/5] fast-export: avoid stripping encoding header if we cannot reencode Elijah Newren
@ 2019-04-25 15:51 ` Elijah Newren
  2019-04-25 15:51 ` [PATCH 5/5] fast-export: do automatic reencoding of commit messages only if requested Elijah Newren
  2019-04-25 15:55 ` [PATCH 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
  5 siblings, 0 replies; 13+ messages in thread
From: Elijah Newren @ 2019-04-25 15:51 UTC (permalink / raw)
  To: git; +Cc: gitster, johannes.schindelin, Elijah Newren

The find_encoding() function returned the encoding used by a commit
message, returning a default of git_commit_encoding (usually utf-8).
Although the current code does not differentiate between a commit which
explicitly requested utf-8 and one where we just assume utf-8 because no
encoding is set, it will become important when we try to preserve the
encoding header.  Since is_encoding_utf8() returns true when passed
NULL, we can just return NULL from find_encoding() instead of returning
git_commit_encoding.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 builtin/fast-export.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 7734a9f5a5..66331fa401 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -453,7 +453,7 @@ static const char *find_encoding(const char *begin, const char *end)
 	bol = memmem(begin, end ? end - begin : strlen(begin),
 		     needle, strlen(needle));
 	if (!bol)
-		return git_commit_encoding;
+		return NULL;
 	bol += strlen(needle);
 	eol = strchrnul(bol, '\n');
 	*eol = '\0';
-- 
2.21.0.779.g2f4b9c5032


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 5/5] fast-export: do automatic reencoding of commit messages only if requested
  2019-04-25 15:51 [PATCH 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
                   ` (3 preceding siblings ...)
  2019-04-25 15:51 ` [PATCH 4/5] fast-export: differentiate between explicitly utf-8 and implicitly utf-8 Elijah Newren
@ 2019-04-25 15:51 ` Elijah Newren
  2019-04-25 15:55 ` [PATCH 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
  5 siblings, 0 replies; 13+ messages in thread
From: Elijah Newren @ 2019-04-25 15:51 UTC (permalink / raw)
  To: git; +Cc: gitster, johannes.schindelin, Elijah Newren

Automatic re-encoding of commit messages (and dropping of the encoding
header) hurts attempts to do reversible history rewrites (e.g. sha1sum
<-> sha256sum transitions, some subtree rewrites), and seems
inconsistent with the general principle followed elsewhere in
fast-export of requiring explicit user requests to modify the output
(e.g. --signed-tags=strip, --tag-of-filtered-object=rewrite).  Add a
--reencode flag that the user can use to specify, and like other
fast-export flags, default it to 'abort'.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 builtin/fast-export.c  | 35 ++++++++++++++++++++++++++++++++---
 t/t9350-fast-export.sh | 31 ++++++++++++++++++++++++++++---
 2 files changed, 60 insertions(+), 6 deletions(-)

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 66331fa401..43cc52331c 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -33,6 +33,7 @@ static const char *fast_export_usage[] = {
 static int progress;
 static enum { SIGNED_TAG_ABORT, VERBATIM, WARN, WARN_STRIP, STRIP } signed_tag_mode = SIGNED_TAG_ABORT;
 static enum { TAG_FILTERING_ABORT, DROP, REWRITE } tag_of_filtered_mode = TAG_FILTERING_ABORT;
+static enum { REENCODE_ABORT, REENCODE_PLEASE, REENCODE_NEVER } reencode_mode = REENCODE_ABORT;
 static int fake_missing_tagger;
 static int use_done_feature;
 static int no_data;
@@ -77,6 +78,20 @@ static int parse_opt_tag_of_filtered_mode(const struct option *opt,
 	return 0;
 }
 
+static int parse_opt_reencode_mode(const struct option *opt,
+				   const char *arg, int unset)
+{
+	if (unset || !strcmp(arg, "abort"))
+		reencode_mode = REENCODE_ABORT;
+	else if (!strcmp(arg, "yes"))
+		reencode_mode = REENCODE_PLEASE;
+	else if (!strcmp(arg, "no"))
+		reencode_mode = REENCODE_NEVER;
+	else
+		return error("Unknown reencoding mode: %s", arg);
+	return 0;
+}
+
 static struct decoration idnums;
 static uint32_t last_idnum;
 
@@ -633,10 +648,21 @@ static void handle_commit(struct commit *commit, struct rev_info *rev,
 	}
 
 	mark_next_object(&commit->object);
-	if (anonymize)
+	if (anonymize) {
 		reencoded = anonymize_commit_message(message);
-	else if (!is_encoding_utf8(encoding))
-		reencoded = reencode_string(message, "UTF-8", encoding);
+	} else if (encoding) {
+		switch(reencode_mode) {
+		case REENCODE_PLEASE:
+			reencoded = reencode_string(message, "UTF-8", encoding);
+			break;
+		case REENCODE_NEVER:
+			break;
+		case REENCODE_ABORT:
+			die("Encountered commit-specific encoding %s in commit "
+			    "%s; use --reencode=<mode> to handle it",
+			    encoding, oid_to_hex(&commit->object.oid));
+		}
+	}
 	if (!commit->parents)
 		printf("reset %s\n", refname);
 	printf("commit %s\nmark :%"PRIu32"\n", refname, last_idnum);
@@ -1091,6 +1117,9 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
 		OPT_CALLBACK(0, "tag-of-filtered-object", &tag_of_filtered_mode, N_("mode"),
 			     N_("select handling of tags that tag filtered objects"),
 			     parse_opt_tag_of_filtered_mode),
+		OPT_CALLBACK(0, "reencode", &reencode_mode, N_("mode"),
+			     N_("select handling of commit messages in an alternate encoding"),
+			     parse_opt_reencode_mode),
 		OPT_STRING(0, "export-marks", &export_filename, N_("file"),
 			     N_("Dump marks to this file")),
 		OPT_STRING(0, "import-marks", &import_filename, N_("file"),
diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
index 975c8c4014..4774926bb6 100755
--- a/t/t9350-fast-export.sh
+++ b/t/t9350-fast-export.sh
@@ -94,7 +94,7 @@ test_expect_success 'fast-export --show-original-ids | git fast-import' '
 	test $MUSS = $(git rev-parse --verify refs/tags/muss)
 '
 
-test_expect_success 'iso-8859-7' '
+test_expect_success 'reencoding iso-8859-7' '
 
 	test_when_finished "git reset --hard HEAD~1" &&
 	test_when_finished "git config --unset i18n.commitencoding" &&
@@ -102,7 +102,7 @@ test_expect_success 'iso-8859-7' '
 	test_tick &&
 	echo rosten >file &&
 	git commit -s -m "$(printf "Pi: \360")" file &&
-	git fast-export wer^..wer >iso-8859-7.fi &&
+	git fast-export --reencode=yes wer^..wer >iso-8859-7.fi &&
 	sed "s/wer/i18n/" iso-8859-7.fi |
 		(cd new &&
 		 git fast-import &&
@@ -110,6 +110,31 @@ test_expect_success 'iso-8859-7' '
 		 grep $(printf "\317\200") actual)
 '
 
+test_expect_success 'aborting on iso-8859-7' '
+
+	test_when_finished "git reset --hard HEAD~1" &&
+	test_when_finished "git config --unset i18n.commitencoding" &&
+	git config i18n.commitencoding iso-8859-7 &&
+	echo rosten >file &&
+	git commit -s -m "$(printf "Pi: \360")" file &&
+	test_must_fail git fast-export --reencode=abort wer^..wer >iso-8859-7.fi
+'
+
+test_expect_success 'preserving iso-8859-7' '
+
+	test_when_finished "git reset --hard HEAD~1" &&
+	test_when_finished "git config --unset i18n.commitencoding" &&
+	git config i18n.commitencoding iso-8859-7 &&
+	echo rosten >file &&
+	git commit -s -m "$(printf "Pi: \360")" file &&
+	git fast-export --reencode=no wer^..wer >iso-8859-7.fi &&
+	sed "s/wer/i18n-no-recoding/" iso-8859-7.fi |
+		(cd new &&
+		 git fast-import &&
+		 git cat-file commit i18n-no-recoding >actual &&
+		 grep $(printf "\360") actual)
+'
+
 test_expect_success 'encoding preserved if reencoding fails' '
 
 	test_when_finished "git reset --hard HEAD~1" &&
@@ -117,7 +142,7 @@ test_expect_success 'encoding preserved if reencoding fails' '
 	git config i18n.commitencoding iso-8859-7 &&
 	echo rosten >file &&
 	git commit -s -m "$(printf "Pi: \360; Invalid: \377")" file &&
-	git fast-export wer^..wer >iso-8859-7.fi &&
+	git fast-export --reencode=yes wer^..wer >iso-8859-7.fi &&
 	sed "s/wer/i18n-invalid/" iso-8859-7.fi |
 		(cd new &&
 		 git fast-import &&
-- 
2.21.0.779.g2f4b9c5032


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/5] Fix and extend encoding handling in fast export/import
  2019-04-25 15:51 [PATCH 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
                   ` (4 preceding siblings ...)
  2019-04-25 15:51 ` [PATCH 5/5] fast-export: do automatic reencoding of commit messages only if requested Elijah Newren
@ 2019-04-25 15:55 ` Elijah Newren
  2019-04-25 15:57   ` Elijah Newren
  2019-04-26 21:32   ` brian m. carlson
  5 siblings, 2 replies; 13+ messages in thread
From: Elijah Newren @ 2019-04-25 15:55 UTC (permalink / raw)
  To: Git Mailing List; +Cc: Junio C Hamano, Johannes Schindelin, brian m. carlson

On Thu, Apr 25, 2019 at 9:51 AM Elijah Newren <newren@gmail.com> wrote:
>
> While stress testing `git filter-repo`, I noticed an issue with
> encoding; further digging led to the fixes and features in this series.
> See the individual commit messages for details.

Whoops, forgot to cc Brian; I'm curious if my understanding is correct
about the sha256sum transition plans that the intent in the short term
is using fast-export & fast-import to transition to-and-from a
sha256sum repo on the fly; if so, I believe that transition work
should use the new --reencode=yes option in patch five.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/5] Fix and extend encoding handling in fast export/import
  2019-04-25 15:55 ` [PATCH 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
@ 2019-04-25 15:57   ` Elijah Newren
  2019-04-26 21:32   ` brian m. carlson
  1 sibling, 0 replies; 13+ messages in thread
From: Elijah Newren @ 2019-04-25 15:57 UTC (permalink / raw)
  To: Git Mailing List; +Cc: Junio C Hamano, Johannes Schindelin, brian m. carlson

On Thu, Apr 25, 2019 at 9:55 AM Elijah Newren <newren@gmail.com> wrote:
>
> On Thu, Apr 25, 2019 at 9:51 AM Elijah Newren <newren@gmail.com> wrote:
> >
> > While stress testing `git filter-repo`, I noticed an issue with
> > encoding; further digging led to the fixes and features in this series.
> > See the individual commit messages for details.
>
> Whoops, forgot to cc Brian; I'm curious if my understanding is correct
> about the sha256sum transition plans that the intent in the short term
> is using fast-export & fast-import to transition to-and-from a
> sha256sum repo on the fly; if so, I believe that transition work
> should use the new --reencode=yes option in patch five.

I seem to be struggling with distractions this morning; I mean the
`--reencode=no` option from patch 5/5.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/5] t9350: fix encoding test to actually test reencoding
  2019-04-25 15:51 ` [PATCH 1/5] t9350: fix encoding test to actually test reencoding Elijah Newren
@ 2019-04-25 17:52   ` Eric Sunshine
  2019-04-26  2:40     ` Junio C Hamano
  0 siblings, 1 reply; 13+ messages in thread
From: Eric Sunshine @ 2019-04-25 17:52 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Git List, Junio C Hamano, Johannes Schindelin

On Thu, Apr 25, 2019 at 11:51 AM Elijah Newren <newren@gmail.com> wrote:
> This test used an author with non-ascii characters in the name, but
> no special commit message.  It then grep'ed for those non-ascii
> characters, but those are guaranteed to exist regardless of the
> reencoding process since the reencoding only affects the commit message,
> not the author or committer names.  As such, the test would work even if
> the re-encoding process simply stripped the commit message entirely.
> Modify the test to actually check that the reencoding in utf-8 worked.
>
> Signed-off-by: Elijah Newren <newren@gmail.com>
> ---
> diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
> @@ -94,22 +94,22 @@ test_expect_success 'fast-export --show-original-ids | git fast-import' '
> +test_expect_success 'iso-8859-7' '
> +       test_when_finished "git config --unset i18n.commitencoding" &&
> +       git config i18n.commitencoding iso-8859-7 &&

Aren't these two lines are pretty much equivalent to this single line?

    test_config i18n.commitencoding iso-8859-7 &&

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/5] fast-import: support 'encoding' commit header
  2019-04-25 15:51 ` [PATCH 2/5] fast-import: support 'encoding' commit header Elijah Newren
@ 2019-04-25 19:36   ` Eric Sunshine
  2019-04-26 11:39     ` Elijah Newren
  0 siblings, 1 reply; 13+ messages in thread
From: Eric Sunshine @ 2019-04-25 19:36 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Git List, Junio C Hamano, Johannes Schindelin

On Thu, Apr 25, 2019 at 11:51 AM Elijah Newren <newren@gmail.com> wrote:
> Since git supports commit messages with an encoding other than utf-8,
> allow fast-import to import such commits.  This may be useful for folks
> who do not want to reencode commit messages from an external system, and
> may also be useful to achieve reversible history rewrites (e.g. sha1sum
> <-> sha256sum transitions or subtree work) with git repositories that
> have used specialized encodings in their commit history.
>
> Signed-off-by: Elijah Newren <newren@gmail.com>
> ---
> diff --git a/fast-import.c b/fast-import.c
> @@ -2607,6 +2608,9 @@ static void parse_new_commit(const char *arg)
>         if (!committer)
>                 die("Expected committer but didn't get one");
> +       if (skip_prefix(command_buf.buf, "encoding ", &encoding)) {
> +               read_next_command();
> +       }

Style nit: unnecessary braces

> @@ -2670,9 +2674,13 @@ static void parse_new_commit(const char *arg)
>         strbuf_addf(&new_data,
>                 "author %s\n"
> -               "committer %s\n"
> -               "\n",
> +               "committer %s\n",
>                 author ? author : committer, committer);
> +       if (encoding)
> +               strbuf_addf(&new_data,
> +                       "encoding %s\n",
> +                       encoding);
> +       strbuf_addf(&new_data, "\n");

Alternately:

    strbuf_addch(&new_data, '\n');

> diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
> @@ -3299,4 +3299,24 @@ test_expect_success !MINGW 'W: get-mark & empty orphan commit with erroneous thi
> +test_expect_success 'X: handling encoding' '
> +       test_tick &&
> +       [...]
> +       git cat-file -p encoding | grep $(printf "\360") &&
> +       git log -1 --format=%B encoding | grep $(printf "\317\200")

This script is already full of instances of Git commands upstream of
pipes, so this usage is consistent (despite recent work to eliminate
such situations). Okay.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/5] t9350: fix encoding test to actually test reencoding
  2019-04-25 17:52   ` Eric Sunshine
@ 2019-04-26  2:40     ` Junio C Hamano
  0 siblings, 0 replies; 13+ messages in thread
From: Junio C Hamano @ 2019-04-26  2:40 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: Elijah Newren, Git List, Johannes Schindelin

Eric Sunshine <sunshine@sunshineco.com> writes:

> On Thu, Apr 25, 2019 at 11:51 AM Elijah Newren <newren@gmail.com> wrote:
>> This test used an author with non-ascii characters in the name, but
>> no special commit message.  It then grep'ed for those non-ascii
>> characters, but those are guaranteed to exist regardless of the
>> reencoding process since the reencoding only affects the commit message,
>> not the author or committer names.  As such, the test would work even if
>> the re-encoding process simply stripped the commit message entirely.
>> Modify the test to actually check that the reencoding in utf-8 worked.
>>
>> Signed-off-by: Elijah Newren <newren@gmail.com>
>> ---
>> diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
>> @@ -94,22 +94,22 @@ test_expect_success 'fast-export --show-original-ids | git fast-import' '
>> +test_expect_success 'iso-8859-7' '
>> +       test_when_finished "git config --unset i18n.commitencoding" &&
>> +       git config i18n.commitencoding iso-8859-7 &&
>
> Aren't these two lines are pretty much equivalent to this single line?
>
>     test_config i18n.commitencoding iso-8859-7 &&

Yes.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/5] fast-import: support 'encoding' commit header
  2019-04-25 19:36   ` Eric Sunshine
@ 2019-04-26 11:39     ` Elijah Newren
  0 siblings, 0 replies; 13+ messages in thread
From: Elijah Newren @ 2019-04-26 11:39 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: Git List, Junio C Hamano, Johannes Schindelin

Hi Eric,

On Thu, Apr 25, 2019 at 1:37 PM Eric Sunshine <sunshine@sunshineco.com> wrote:
>
> On Thu, Apr 25, 2019 at 11:51 AM Elijah Newren <newren@gmail.com> wrote:
> > Since git supports commit messages with an encoding other than utf-8,
> > allow fast-import to import such commits.  This may be useful for folks
> > who do not want to reencode commit messages from an external system, and
> > may also be useful to achieve reversible history rewrites (e.g. sha1sum
> > <-> sha256sum transitions or subtree work) with git repositories that
> > have used specialized encodings in their commit history.
> >
> > Signed-off-by: Elijah Newren <newren@gmail.com>
> > ---
> > diff --git a/fast-import.c b/fast-import.c
> > @@ -2607,6 +2608,9 @@ static void parse_new_commit(const char *arg)
> >         if (!committer)
> >                 die("Expected committer but didn't get one");
> > +       if (skip_prefix(command_buf.buf, "encoding ", &encoding)) {
> > +               read_next_command();
> > +       }
>
> Style nit: unnecessary braces
>
> > @@ -2670,9 +2674,13 @@ static void parse_new_commit(const char *arg)
> >         strbuf_addf(&new_data,
> >                 "author %s\n"
> > -               "committer %s\n"
> > -               "\n",
> > +               "committer %s\n",
> >                 author ? author : committer, committer);
> > +       if (encoding)
> > +               strbuf_addf(&new_data,
> > +                       "encoding %s\n",
> > +                       encoding);
> > +       strbuf_addf(&new_data, "\n");
>
> Alternately:
>
>     strbuf_addch(&new_data, '\n');

Thanks for taking a look.  I'll fix both of these items you
highlighted and the test_config item you pointed out in 1/5 in the
next re-roll.

> > diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
> > @@ -3299,4 +3299,24 @@ test_expect_success !MINGW 'W: get-mark & empty orphan commit with erroneous thi
> > +test_expect_success 'X: handling encoding' '
> > +       test_tick &&
> > +       [...]
> > +       git cat-file -p encoding | grep $(printf "\360") &&
> > +       git log -1 --format=%B encoding | grep $(printf "\317\200")
>
> This script is already full of instances of Git commands upstream of
> pipes, so this usage is consistent (despite recent work to eliminate
> such situations). Okay.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/5] Fix and extend encoding handling in fast export/import
  2019-04-25 15:55 ` [PATCH 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
  2019-04-25 15:57   ` Elijah Newren
@ 2019-04-26 21:32   ` brian m. carlson
  1 sibling, 0 replies; 13+ messages in thread
From: brian m. carlson @ 2019-04-26 21:32 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Git Mailing List, Junio C Hamano, Johannes Schindelin

[-- Attachment #1: Type: text/plain, Size: 1259 bytes --]

On Thu, Apr 25, 2019 at 09:55:11AM -0600, Elijah Newren wrote:
> On Thu, Apr 25, 2019 at 9:51 AM Elijah Newren <newren@gmail.com> wrote:
> >
> > While stress testing `git filter-repo`, I noticed an issue with
> > encoding; further digging led to the fixes and features in this series.
> > See the individual commit messages for details.
> 
> Whoops, forgot to cc Brian; I'm curious if my understanding is correct
> about the sha256sum transition plans that the intent in the short term
> is using fast-export & fast-import to transition to-and-from a
> sha256sum repo on the fly; if so, I believe that transition work
> should use the new --reencode=yes option in patch five.

The plan is to convert using fast-import and fast-export, yes, but
on the fly, no. You'll convert your repository up front using
fast-import and fast-export and then conversion will happen on the fly
as needed internally. The latter is a thing I'm working on.

So individual users will want to use the --reencode option, but
internally we probably won't get as far as actually decoding most of the
commit object, so we'll keep the bytes in place.

I do appreciate the CC, though.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-04-26 21:32 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-25 15:51 [PATCH 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
2019-04-25 15:51 ` [PATCH 1/5] t9350: fix encoding test to actually test reencoding Elijah Newren
2019-04-25 17:52   ` Eric Sunshine
2019-04-26  2:40     ` Junio C Hamano
2019-04-25 15:51 ` [PATCH 2/5] fast-import: support 'encoding' commit header Elijah Newren
2019-04-25 19:36   ` Eric Sunshine
2019-04-26 11:39     ` Elijah Newren
2019-04-25 15:51 ` [PATCH 3/5] fast-export: avoid stripping encoding header if we cannot reencode Elijah Newren
2019-04-25 15:51 ` [PATCH 4/5] fast-export: differentiate between explicitly utf-8 and implicitly utf-8 Elijah Newren
2019-04-25 15:51 ` [PATCH 5/5] fast-export: do automatic reencoding of commit messages only if requested Elijah Newren
2019-04-25 15:55 ` [PATCH 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
2019-04-25 15:57   ` Elijah Newren
2019-04-26 21:32   ` brian m. carlson

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).