From: Elijah Newren <newren@gmail.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org, "Eric Sunshine" <sunshine@sunshineco.com>,
"Johannes Schindelin" <Johannes.Schindelin@gmx.de>,
"Johannes Sixt" <j6t@kdbg.org>,
"Torsten Bögershausen" <tboegi@web.de>,
"Elijah Newren" <newren@gmail.com>
Subject: [PATCH v4 0/5] Fix and extend encoding handling in fast export/import
Date: Mon, 13 May 2019 09:47:17 -0700 [thread overview]
Message-ID: <20190513164722.31534-1-newren@gmail.com> (raw)
In-Reply-To: <20190510205335.19968-1-newren@gmail.com>
While stress testing `git filter-repo`, I noticed an issue with
encoding; further digging led to the fixes and features in this series.
See the individual commit messages for details.
Changes since v3 (full range-diff below):
* YES/NO changes suggested by Torsten
* more boolean synonyms as suggested by Junio
* check for the exact expected special bytes, in addition to the size
(Dscho pointed out that it was GitForWindows that munged bytes, not
Windows, so while I need to be careful in what I pass to git, printf
and grep can work directly with the special bytes)
* also checked on gitgitgadget that it passes on the major platforms
[1] https://github.com/gitgitgadget/git/pull/191
Elijah Newren (5):
t9350: fix encoding test to actually test reencoding
fast-import: support 'encoding' commit header
fast-export: avoid stripping encoding header if we cannot reencode
fast-export: differentiate between explicitly utf-8 and implicitly
utf-8
fast-export: do automatic reencoding of commit messages only if
requested
Documentation/git-fast-import.txt | 7 ++
builtin/fast-export.c | 44 +++++++++--
fast-import.c | 11 ++-
t/t9300-fast-import.sh | 20 +++++
t/t9350-fast-export.sh | 78 +++++++++++++++++---
t/t9350/broken-iso-8859-7-commit-message.txt | 1 +
t/t9350/simple-iso-8859-7-commit-message.txt | 1 +
7 files changed, 145 insertions(+), 17 deletions(-)
create mode 100644 t/t9350/broken-iso-8859-7-commit-message.txt
create mode 100644 t/t9350/simple-iso-8859-7-commit-message.txt
Range-diff:
1: 2d7bb64acf ! 1: 37a68a0ffd t9350: fix encoding test to actually test reencoding
@@ -39,18 +39,16 @@
git fast-import &&
+ # The commit object, if not re-encoded, would be 240 bytes.
+ # Removing the "encoding iso-8859-7\n" header drops 20 bytes.
-+ # Re-encoding the Pi character from \xF0 in iso-8859-7 to
-+ # \xCF\x80 in utf-8 adds a byte. Grepping for specific bytes
-+ # would be nice, but Windows apparently munges user data
-+ # in the form of bytes on the command line to force them to
-+ # be characters instead, so we are limited for portability
-+ # reasons in subsequent similar tests in this file to check
-+ # for size rather than what bytes are present.
++ # Re-encoding the Pi character from \xF0 (\360) in iso-8859-7
++ # to \xCF\x80 (\317\200) in utf-8 adds a byte. Check for
++ # the expected size.
+ test 221 -eq "$(git cat-file -s i18n)" &&
-+ # Also make sure the commit does not have the "encoding" header
++ # ...and for the expected translation of bytes.
git cat-file commit i18n >actual &&
- grep "Áéí óú" actual)
-
++ grep $(printf "\317\200") actual &&
++ # Also make sure the commit does not have the "encoding" header
+ ! grep ^encoding actual)
'
+
2: 9fa5695017 = 2: 3d84f4613d fast-import: support 'encoding' commit header
3: dfc76573e9 ! 3: baa8394a3a fast-export: avoid stripping encoding header if we cannot reencode
@@ -49,10 +49,14 @@
+ (cd new &&
+ git fast-import &&
+ git cat-file commit i18n-invalid >actual &&
++ # Make sure the commit still has the encoding header
+ grep ^encoding actual &&
-+ # Also verify that the commit has the expected size; i.e.
++ # Verify that the commit has the expected size; i.e.
+ # that no bytes were re-encoded to a different encoding.
-+ test 252 -eq "$(git cat-file -s i18n-invalid)")
++ test 252 -eq "$(git cat-file -s i18n-invalid)" &&
++ # ...and check for the original special bytes
++ grep $(printf "\360") actual &&
++ grep $(printf "\377") actual)
+'
+
test_expect_success 'import/export-marks' '
4: 83b3656b76 = 4: 49960164c6 fast-export: differentiate between explicitly utf-8 and implicitly utf-8
5: 2063122293 ! 5: 571613a09e fast-export: do automatic reencoding of commit messages only if requested
@@ -20,7 +20,7 @@
static int progress;
static enum { SIGNED_TAG_ABORT, VERBATIM, WARN, WARN_STRIP, STRIP } signed_tag_mode = SIGNED_TAG_ABORT;
static enum { TAG_FILTERING_ABORT, DROP, REWRITE } tag_of_filtered_mode = TAG_FILTERING_ABORT;
-+static enum { REENCODE_ABORT, REENCODE_PLEASE, REENCODE_NEVER } reencode_mode = REENCODE_ABORT;
++static enum { REENCODE_ABORT, REENCODE_YES, REENCODE_NO } reencode_mode = REENCODE_ABORT;
static int fake_missing_tagger;
static int use_done_feature;
static int no_data;
@@ -33,10 +33,10 @@
+{
+ if (unset || !strcmp(arg, "abort"))
+ reencode_mode = REENCODE_ABORT;
-+ else if (!strcmp(arg, "yes"))
-+ reencode_mode = REENCODE_PLEASE;
-+ else if (!strcmp(arg, "no"))
-+ reencode_mode = REENCODE_NEVER;
++ else if (!strcmp(arg, "yes") || !strcmp(arg, "true") || !strcmp(arg, "on"))
++ reencode_mode = REENCODE_YES;
++ else if (!strcmp(arg, "no") || !strcmp(arg, "false") || !strcmp(arg, "off"))
++ reencode_mode = REENCODE_NO;
+ else
+ return error("Unknown reencoding mode: %s", arg);
+ return 0;
@@ -56,14 +56,14 @@
- reencoded = reencode_string(message, "UTF-8", encoding);
+ } else if (encoding) {
+ switch(reencode_mode) {
-+ case REENCODE_PLEASE:
++ case REENCODE_YES:
+ reencoded = reencode_string(message, "UTF-8", encoding);
+ break;
-+ case REENCODE_NEVER:
++ case REENCODE_NO:
+ break;
+ case REENCODE_ABORT:
+ die("Encountered commit-specific encoding %s in commit "
-+ "%s; use --reencode=<mode> to handle it",
++ "%s; use --reencode=[yes|no] to handle it",
+ encoding, oid_to_hex(&commit->object.oid));
+ }
+ }
@@ -126,13 +126,14 @@
+ git fast-import &&
+ # The commit object, if not re-encoded, is 240 bytes.
+ # Removing the "encoding iso-8859-7\n" header would drops 20
-+ # bytes. Re-encoding the Pi character from \xF0 in
-+ # iso-8859-7 to \xCF\x80 in utf-8 would add a byte. I would
-+ # grep for the # specific bytes, but Windows lamely does not
-+ # allow that, so just search for the expected size.
++ # bytes. Re-encoding the Pi character from \xF0 (\360) in
++ # iso-8859-7 to \xCF\x80 (\317\200) in utf-8 adds a byte.
++ # Check for the expected size...
+ test 240 -eq "$(git cat-file -s i18n-no-recoding)" &&
-+ # Also make sure the commit has the "encoding" header
++ # ...as well as the expected byte.
+ git cat-file commit i18n-no-recoding >actual &&
++ grep $(printf "\360") actual &&
++ # Also make sure the commit has the "encoding" header
+ grep ^encoding actual)
+'
+
--
2.21.0.782.g571613a09e
next prev parent reply other threads:[~2019-05-13 16:47 UTC|newest]
Thread overview: 45+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-04-30 18:25 [PATCH v2 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
2019-04-30 18:25 ` [PATCH v2 1/5] t9350: fix encoding test to actually test reencoding Elijah Newren
2019-04-30 18:25 ` [PATCH v2 2/5] fast-import: support 'encoding' commit header Elijah Newren
2019-04-30 18:25 ` [PATCH v2 3/5] fast-export: avoid stripping encoding header if we cannot reencode Elijah Newren
2019-04-30 18:25 ` [PATCH v2 4/5] fast-export: differentiate between explicitly utf-8 and implicitly utf-8 Elijah Newren
2019-04-30 18:25 ` [PATCH v2 5/5] fast-export: do automatic reencoding of commit messages only if requested Elijah Newren
2019-05-10 20:53 ` [PATCH v3 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
2019-05-10 20:53 ` [PATCH v3 1/5] t9350: fix encoding test to actually test reencoding Elijah Newren
2019-05-10 20:53 ` [PATCH v3 2/5] fast-import: support 'encoding' commit header Elijah Newren
2019-05-10 20:53 ` [PATCH v3 3/5] fast-export: avoid stripping encoding header if we cannot reencode Elijah Newren
2019-05-10 20:53 ` [PATCH v3 4/5] fast-export: differentiate between explicitly utf-8 and implicitly utf-8 Elijah Newren
2019-05-10 20:53 ` [PATCH v3 5/5] fast-export: do automatic reencoding of commit messages only if requested Elijah Newren
2019-05-11 21:07 ` Torsten Bögershausen
2019-05-11 21:42 ` Elijah Newren
2019-05-13 7:48 ` Junio C Hamano
2019-05-13 13:24 ` Elijah Newren
2019-05-13 10:23 ` Johannes Schindelin
2019-05-13 12:56 ` Torsten Bögershausen
2019-05-13 13:29 ` Elijah Newren
2019-05-13 16:41 ` Elijah Newren
2019-05-13 10:14 ` [PATCH v3 0/5] Fix and extend encoding handling in fast export/import Johannes Schindelin
2019-05-13 16:47 ` Elijah Newren [this message]
2019-05-13 16:47 ` [PATCH v4 1/5] t9350: fix encoding test to actually test reencoding Elijah Newren
2019-05-13 16:47 ` [PATCH v4 2/5] fast-import: support 'encoding' commit header Elijah Newren
2019-05-13 16:47 ` [PATCH v4 3/5] fast-export: avoid stripping encoding header if we cannot reencode Elijah Newren
2019-05-13 16:47 ` [PATCH v4 4/5] fast-export: differentiate between explicitly utf-8 and implicitly utf-8 Elijah Newren
2019-05-13 16:47 ` [PATCH v4 5/5] fast-export: do automatic reencoding of commit messages only if requested Elijah Newren
2019-05-13 22:32 ` Junio C Hamano
2019-05-13 23:17 ` [PATCH v5 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
2019-05-13 23:17 ` [PATCH v5 1/5] t9350: fix encoding test to actually test reencoding Elijah Newren
2019-05-14 2:50 ` Torsten Bögershausen
2019-05-13 23:17 ` [PATCH v5 2/5] fast-import: support 'encoding' commit header Elijah Newren
2019-05-13 23:17 ` [PATCH v5 3/5] fast-export: avoid stripping encoding header if we cannot reencode Elijah Newren
2019-05-14 2:56 ` Torsten Bögershausen
2019-05-13 23:17 ` [PATCH v5 4/5] fast-export: differentiate between explicitly utf-8 and implicitly utf-8 Elijah Newren
2019-05-14 3:01 ` Torsten Bögershausen
2019-05-13 23:17 ` [PATCH v5 5/5] fast-export: do automatic reencoding of commit messages only if requested Elijah Newren
2019-05-14 0:19 ` Eric Sunshine
2019-05-14 4:30 ` [PATCH v6 0/5] Fix and extend encoding handling in fast export/import Elijah Newren
2019-05-14 4:30 ` [PATCH v6 1/5] t9350: fix encoding test to actually test reencoding Elijah Newren
2019-05-14 4:30 ` [PATCH v6 2/5] fast-import: support 'encoding' commit header Elijah Newren
2019-05-14 4:31 ` [PATCH v6 3/5] fast-export: avoid stripping encoding header if we cannot reencode Elijah Newren
2019-05-14 4:31 ` [PATCH v6 4/5] fast-export: differentiate between explicitly UTF-8 and implicitly UTF-8 Elijah Newren
2019-05-14 4:31 ` [PATCH v6 5/5] fast-export: do automatic reencoding of commit messages only if requested Elijah Newren
2019-05-16 18:15 ` [PATCH v6 0/5] Fix and extend encoding handling in fast export/import Torsten Bögershausen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190513164722.31534-1-newren@gmail.com \
--to=newren@gmail.com \
--cc=Johannes.Schindelin@gmx.de \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=j6t@kdbg.org \
--cc=sunshine@sunshineco.com \
--cc=tboegi@web.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).