git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Elijah Newren <newren@gmail.com>
To: gitster@pobox.com
Cc: git@vger.kernel.org, Eric Sunshine <sunshine@sunshineco.com>,
	Johannes Schindelin <Johannes.Schindelin@gmx.de>,
	Johannes Sixt <j6t@kdbg.org>, Elijah Newren <newren@gmail.com>
Subject: [PATCH v3 0/5] Fix and extend encoding handling in fast export/import
Date: Fri, 10 May 2019 13:53:30 -0700	[thread overview]
Message-ID: <20190510205335.19968-1-newren@gmail.com> (raw)
In-Reply-To: <20190430182523.3339-1-newren@gmail.com>

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 8203 bytes --]

While stress testing `git filter-repo`, I noticed an issue with
encoding; further digging led to the fixes and features in this series.
See the individual commit messages for details.

Changes since v2 (full range-diff below):
  * Modified the testcases to pass on Windows[1], as verified via
    gitgitgadget pull request[2].  Required adding a couple new files
    (which store the desired bytes) and checking the size of the output
    instead of checking for particular bytes (but the lengths of the
    expected byte sequences differ so this works fine...).

[1] Failures of previous patchset on Windows noticed and reported by Dscho;
    explanation from Hannes is that Windows munges users' command lines to
    force them to be characters instead of bytes.
[2] https://github.com/gitgitgadget/git/pull/187

Elijah Newren (5):
t9350: fix encoding test to actually test reencoding
fast-import: support 'encoding' commit header
fast-export: avoid stripping encoding header if we cannot reencode
fast-export: differentiate between explicitly utf-8 and implicitly
utf-8
fast-export: do automatic reencoding of commit messages only if
requested

Documentation/git-fast-import.txt            |  7 ++
builtin/fast-export.c                        | 44 ++++++++++--
fast-import.c                                | 11 ++-
t/t9300-fast-import.sh                       | 20 ++++++
t/t9350-fast-export.sh                       | 75 +++++++++++++++++---
t/t9350/broken-iso-8859-7-commit-message.txt |  1 +
t/t9350/simple-iso-8859-7-commit-message.txt |  1 +
7 files changed, 142 insertions(+), 17 deletions(-)
create mode 100644 t/t9350/broken-iso-8859-7-commit-message.txt
create mode 100644 t/t9350/simple-iso-8859-7-commit-message.txt

Range-diff:
1:  9cc04242bd ! 1:  2d7bb64acf t9350: fix encoding test to actually test reencoding
    @@ -32,15 +32,26 @@
     -	git commit -s -m den file &&
     -	git fast-export wer^..wer >iso8859-1.fi &&
     -	sed "s/wer/i18n/" iso8859-1.fi |
    -+	git commit -s -m "$(printf "Pi: \360")" file &&
    ++	git commit -s -F "$TEST_DIRECTORY/t9350/simple-iso-8859-7-commit-message.txt" file &&
     +	git fast-export wer^..wer >iso-8859-7.fi &&
     +	sed "s/wer/i18n/" iso-8859-7.fi |
      		(cd new &&
      		 git fast-import &&
    ++		 # The commit object, if not re-encoded, would be 240 bytes.
    ++		 # Removing the "encoding iso-8859-7\n" header drops 20 bytes.
    ++		 # Re-encoding the Pi character from \xF0 in iso-8859-7 to
    ++		 # \xCF\x80 in utf-8 adds a byte.  Grepping for specific bytes
    ++		 # would be nice, but Windows apparently munges user data
    ++		 # in the form of bytes on the command line to force them to
    ++		 # be characters instead, so we are limited for portability
    ++		 # reasons in subsequent similar tests in this file to check
    ++		 # for size rather than what bytes are present.
    ++		 test 221 -eq "$(git cat-file -s i18n)" &&
    ++		 # Also make sure the commit does not have the "encoding" header
      		 git cat-file commit i18n >actual &&
     -		 grep "Áéí óú" actual)
     -
    -+		 grep $(printf "\317\200") actual)
    ++		 ! grep ^encoding actual)
      '
     +
      test_expect_success 'import/export-marks' '
    @@ -54,3 +65,11 @@
      	git checkout -b copy rein &&
      	git mv file file3 &&
      	git commit -m move1 &&
    +
    + diff --git a/t/t9350/simple-iso-8859-7-commit-message.txt b/t/t9350/simple-iso-8859-7-commit-message.txt
    + new file mode 100644
    + --- /dev/null
    + +++ b/t/t9350/simple-iso-8859-7-commit-message.txt
    +@@
    ++Pi: ð
    + \ No newline at end of file
2:  0cd023ac7a = 2:  9fa5695017 fast-import: support 'encoding' commit header
3:  1fddf51402 ! 3:  dfc76573e9 fast-export: avoid stripping encoding header if we cannot reencode
    @@ -35,7 +35,7 @@
      --- a/t/t9350-fast-export.sh
      +++ b/t/t9350-fast-export.sh
     @@
    - 		 grep $(printf "\317\200") actual)
    + 		 ! grep ^encoding actual)
      '
      
     +test_expect_success 'encoding preserved if reencoding fails' '
    @@ -43,15 +43,26 @@
     +	test_when_finished "git reset --hard HEAD~1" &&
     +	test_config i18n.commitencoding iso-8859-7 &&
     +	echo rosten >file &&
    -+	git commit -s -m "$(printf "Pi: \360; Invalid: \377")" file &&
    ++	git commit -s -F "$TEST_DIRECTORY/t9350/broken-iso-8859-7-commit-message.txt" file &&
     +	git fast-export wer^..wer >iso-8859-7.fi &&
     +	sed "s/wer/i18n-invalid/" iso-8859-7.fi |
     +		(cd new &&
     +		 git fast-import &&
     +		 git cat-file commit i18n-invalid >actual &&
    -+		 grep ^encoding actual)
    ++		 grep ^encoding actual &&
    ++		 # Also verify that the commit has the expected size; i.e.
    ++		 # that no bytes were re-encoded to a different encoding.
    ++		 test 252 -eq "$(git cat-file -s i18n-invalid)")
     +'
     +
      test_expect_success 'import/export-marks' '
      
      	git checkout -b marks master &&
    +
    + diff --git a/t/t9350/broken-iso-8859-7-commit-message.txt b/t/t9350/broken-iso-8859-7-commit-message.txt
    + new file mode 100644
    + --- /dev/null
    + +++ b/t/t9350/broken-iso-8859-7-commit-message.txt
    +@@
    ++Pi: ð; Invalid: ÿ
    + \ No newline at end of file
4:  4a2e04b3ae = 4:  83b3656b76 fast-export: differentiate between explicitly utf-8 and implicitly utf-8
5:  44aacb1a0b ! 5:  2063122293 fast-export: do automatic reencoding of commit messages only if requested
    @@ -95,14 +95,14 @@
      	test_config i18n.commitencoding iso-8859-7 &&
      	test_tick &&
      	echo rosten >file &&
    - 	git commit -s -m "$(printf "Pi: \360")" file &&
    + 	git commit -s -F "$TEST_DIRECTORY/t9350/simple-iso-8859-7-commit-message.txt" file &&
     -	git fast-export wer^..wer >iso-8859-7.fi &&
     +	git fast-export --reencode=yes wer^..wer >iso-8859-7.fi &&
      	sed "s/wer/i18n/" iso-8859-7.fi |
      		(cd new &&
      		 git fast-import &&
     @@
    - 		 grep $(printf "\317\200") actual)
    + 		 ! grep ^encoding actual)
      '
      
     +test_expect_success 'aborting on iso-8859-7' '
    @@ -110,7 +110,7 @@
     +	test_when_finished "git reset --hard HEAD~1" &&
     +	test_config i18n.commitencoding iso-8859-7 &&
     +	echo rosten >file &&
    -+	git commit -s -m "$(printf "Pi: \360")" file &&
    ++	git commit -s -F "$TEST_DIRECTORY/t9350/simple-iso-8859-7-commit-message.txt" file &&
     +	test_must_fail git fast-export --reencode=abort wer^..wer >iso-8859-7.fi
     +'
     +
    @@ -119,13 +119,21 @@
     +	test_when_finished "git reset --hard HEAD~1" &&
     +	test_config i18n.commitencoding iso-8859-7 &&
     +	echo rosten >file &&
    -+	git commit -s -m "$(printf "Pi: \360")" file &&
    ++	git commit -s -F "$TEST_DIRECTORY/t9350/simple-iso-8859-7-commit-message.txt" file &&
     +	git fast-export --reencode=no wer^..wer >iso-8859-7.fi &&
     +	sed "s/wer/i18n-no-recoding/" iso-8859-7.fi |
     +		(cd new &&
     +		 git fast-import &&
    ++		 # The commit object, if not re-encoded, is 240 bytes.
    ++		 # Removing the "encoding iso-8859-7\n" header would drops 20
    ++		 # bytes.  Re-encoding the Pi character from \xF0 in
    ++		 # iso-8859-7 to \xCF\x80 in utf-8 would add a byte.  I would
    ++		 # grep for the # specific bytes, but Windows lamely does not
    ++		 # allow that, so just search for the expected size.
    ++		 test 240 -eq "$(git cat-file -s i18n-no-recoding)" &&
    ++		 # Also make sure the commit has the "encoding" header
     +		 git cat-file commit i18n-no-recoding >actual &&
    -+		 grep $(printf "\360") actual)
    ++		 grep ^encoding actual)
     +'
     +
      test_expect_success 'encoding preserved if reencoding fails' '
    @@ -133,7 +141,7 @@
      	test_when_finished "git reset --hard HEAD~1" &&
      	test_config i18n.commitencoding iso-8859-7 &&
      	echo rosten >file &&
    - 	git commit -s -m "$(printf "Pi: \360; Invalid: \377")" file &&
    + 	git commit -s -F "$TEST_DIRECTORY/t9350/broken-iso-8859-7-commit-message.txt" file &&
     -	git fast-export wer^..wer >iso-8859-7.fi &&
     +	git fast-export --reencode=yes wer^..wer >iso-8859-7.fi &&
      	sed "s/wer/i18n-invalid/" iso-8859-7.fi |

-- 
2.21.0.782.g2063122293


  parent reply	other threads:[~2019-05-10 20:53 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-30 18:25 [PATCH v2 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
2019-04-30 18:25 ` [PATCH v2 1/5] t9350: fix encoding test to actually test reencoding Elijah Newren
2019-04-30 18:25 ` [PATCH v2 2/5] fast-import: support 'encoding' commit header Elijah Newren
2019-04-30 18:25 ` [PATCH v2 3/5] fast-export: avoid stripping encoding header if we cannot reencode Elijah Newren
2019-04-30 18:25 ` [PATCH v2 4/5] fast-export: differentiate between explicitly utf-8 and implicitly utf-8 Elijah Newren
2019-04-30 18:25 ` [PATCH v2 5/5] fast-export: do automatic reencoding of commit messages only if requested Elijah Newren
2019-05-10 20:53 ` Elijah Newren [this message]
2019-05-10 20:53   ` [PATCH v3 1/5] t9350: fix encoding test to actually test reencoding Elijah Newren
2019-05-10 20:53   ` [PATCH v3 2/5] fast-import: support 'encoding' commit header Elijah Newren
2019-05-10 20:53   ` [PATCH v3 3/5] fast-export: avoid stripping encoding header if we cannot reencode Elijah Newren
2019-05-10 20:53   ` [PATCH v3 4/5] fast-export: differentiate between explicitly utf-8 and implicitly utf-8 Elijah Newren
2019-05-10 20:53   ` [PATCH v3 5/5] fast-export: do automatic reencoding of commit messages only if requested Elijah Newren
2019-05-11 21:07     ` Torsten Bögershausen
2019-05-11 21:42       ` Elijah Newren
2019-05-13  7:48         ` Junio C Hamano
2019-05-13 13:24           ` Elijah Newren
2019-05-13 10:23         ` Johannes Schindelin
2019-05-13 12:56           ` Torsten Bögershausen
2019-05-13 13:29             ` Elijah Newren
2019-05-13 16:41           ` Elijah Newren
2019-05-13 10:14   ` [PATCH v3 0/5] Fix and extend encoding handling in fast export/import Johannes Schindelin
2019-05-13 16:47   ` [PATCH v4 " Elijah Newren
2019-05-13 16:47     ` [PATCH v4 1/5] t9350: fix encoding test to actually test reencoding Elijah Newren
2019-05-13 16:47     ` [PATCH v4 2/5] fast-import: support 'encoding' commit header Elijah Newren
2019-05-13 16:47     ` [PATCH v4 3/5] fast-export: avoid stripping encoding header if we cannot reencode Elijah Newren
2019-05-13 16:47     ` [PATCH v4 4/5] fast-export: differentiate between explicitly utf-8 and implicitly utf-8 Elijah Newren
2019-05-13 16:47     ` [PATCH v4 5/5] fast-export: do automatic reencoding of commit messages only if requested Elijah Newren
2019-05-13 22:32       ` Junio C Hamano
2019-05-13 23:17     ` [PATCH v5 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
2019-05-13 23:17       ` [PATCH v5 1/5] t9350: fix encoding test to actually test reencoding Elijah Newren
2019-05-14  2:50         ` Torsten Bögershausen
2019-05-13 23:17       ` [PATCH v5 2/5] fast-import: support 'encoding' commit header Elijah Newren
2019-05-13 23:17       ` [PATCH v5 3/5] fast-export: avoid stripping encoding header if we cannot reencode Elijah Newren
2019-05-14  2:56         ` Torsten Bögershausen
2019-05-13 23:17       ` [PATCH v5 4/5] fast-export: differentiate between explicitly utf-8 and implicitly utf-8 Elijah Newren
2019-05-14  3:01         ` Torsten Bögershausen
2019-05-13 23:17       ` [PATCH v5 5/5] fast-export: do automatic reencoding of commit messages only if requested Elijah Newren
2019-05-14  0:19         ` Eric Sunshine
2019-05-14  4:30       ` [PATCH v6 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
2019-05-14  4:30         ` [PATCH v6 1/5] t9350: fix encoding test to actually test reencoding Elijah Newren
2019-05-14  4:30         ` [PATCH v6 2/5] fast-import: support 'encoding' commit header Elijah Newren
2019-05-14  4:31         ` [PATCH v6 3/5] fast-export: avoid stripping encoding header if we cannot reencode Elijah Newren
2019-05-14  4:31         ` [PATCH v6 4/5] fast-export: differentiate between explicitly UTF-8 and implicitly UTF-8 Elijah Newren
2019-05-14  4:31         ` [PATCH v6 5/5] fast-export: do automatic reencoding of commit messages only if requested Elijah Newren
2019-05-16 18:15         ` [PATCH v6 0/5] Fix and extend encoding handling in fast export/import Torsten Bögershausen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190510205335.19968-1-newren@gmail.com \
    --to=newren@gmail.com \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=j6t@kdbg.org \
    --cc=sunshine@sunshineco.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).