From: lars.schneider@autodesk.com
To: git@vger.kernel.org
Cc: gitster@pobox.com, tboegi@web.de, j6t@kdbg.org,
sunshine@sunshineco.com, peff@peff.net,
ramsay@ramsayjones.plus.com, Johannes.Schindelin@gmx.de,
Lars Schneider <larsxschneider@gmail.com>
Subject: [PATCH v10 0/9] convert: add support for different encodings
Date: Wed, 7 Mar 2018 18:30:17 +0100 [thread overview]
Message-ID: <20180307173026.30058-1-lars.schneider@autodesk.com> (raw)
From: Lars Schneider <larsxschneider@gmail.com>
Hi,
Patches 1-5,8 are preparation and helper functions. Patch 3 is new.
Patch 6,7,9 are the actual change.
This series depends on Torsten's 8462ff43e4 (convert_to_git():
safe_crlf/checksafe becomes int conv_flags, 2018-01-13) which is
already in master.
Changes since v9:
* make has_bom_prefix() / is_missing_required_utf_bom() more lenient in
what they accept (ignore casing, accept UTF?? and UTF-?? , Junio)
* replace memcmp() which does not check the length of the strings with
a case insensitive variant of starts_with() (Junio)
* do not convert encoding names to uppercase
(this fixes a leak introduced in the last iteration, Eric)
* do not cleanup test files that the test did not create (Eric)
* do not cleanup err.out files in tests (Eric)
I did not address Eric's feedback to make validate_encoding()
cleaner [1] as I want to stabilize the series and Eric wrote
that we can clean this up later:
http://public-inbox.org/git/CAPig+cSoka-yBTYBz42JGQTyCH7LDWnToeOvdZfG0_64o9QnBQ@mail.gmail.com
Thanks,
Lars
RFC: https://public-inbox.org/git/BDB9B884-6D17-4BE3-A83C-F67E2AFA2B46@gmail.com/
v1: https://public-inbox.org/git/20171211155023.1405-1-lars.schneider@autodesk.com/
v2: https://public-inbox.org/git/20171229152222.39680-1-lars.schneider@autodesk.com/
v3: https://public-inbox.org/git/20180106004808.77513-1-lars.schneider@autodesk.com/
v4: https://public-inbox.org/git/20180120152418.52859-1-lars.schneider@autodesk.com/
v5: https://public-inbox.org/git/20180129201855.9182-1-tboegi@web.de/
v6: https://public-inbox.org/git/20180209132830.55385-1-lars.schneider@autodesk.com/
v7: https://public-inbox.org/git/20180215152711.158-1-lars.schneider@autodesk.com/
v8: https://public-inbox.org/git/20180224162801.98860-1-lars.schneider@autodesk.com/
v9: https://public-inbox.org/git/20180304201418.60958-1-lars.schneider@autodesk.com/
Base Ref:
Web-Diff: https://github.com/larsxschneider/git/commit/a602b8dcef
Checkout: git fetch https://github.com/larsxschneider/git encoding-v10 && git checkout a602b8dcef
### Interdiff (v9..v10):
diff --git a/convert.c b/convert.c
index 6cbb2b2618..e861f1abbc 100644
--- a/convert.c
+++ b/convert.c
@@ -269,7 +269,8 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
static int validate_encoding(const char *path, const char *enc,
const char *data, size_t len, int die_on_error)
{
- if (!memcmp("UTF-", enc, 4)) {
+ /* We only check for UTF here as UTF?? can be an alias for UTF-?? */
+ if (startscase_with(enc, "UTF")) {
/*
* Check for detectable errors in UTF encodings
*/
@@ -277,16 +278,18 @@ static int validate_encoding(const char *path, const char *enc,
const char *error_msg = _(
"BOM is prohibited in '%s' if encoded as %s");
/*
- * This advice is shown for UTF-??BE and UTF-??LE
- * encodings. We truncate the encoding name to 6
- * chars with %.6s to cut off the last two "byte
- * order" characters.
+ * This advice is shown for UTF-??BE and UTF-??LE encodings.
+ * We cut off the last two characters of the encoding name
+ # to generate the encoding name suitable for BOMs.
*/
const char *advise_msg = _(
"The file '%s' contains a byte order "
- "mark (BOM). Please use %.6s as "
+ "mark (BOM). Please use %s as "
"working-tree-encoding.");
- advise(advise_msg, path, enc);
+ char *upper_enc = xstrdup_toupper(enc);
+ upper_enc[strlen(upper_enc)-2] = '\0';
+ advise(advise_msg, path, upper_enc);
+ free(upper_enc);
if (die_on_error)
die(error_msg, path, enc);
else {
@@ -301,7 +304,9 @@ static int validate_encoding(const char *path, const char *enc,
"mark (BOM). Please use %sBE or %sLE "
"(depending on the byte order) as "
"working-tree-encoding.");
- advise(advise_msg, path, enc, enc);
+ char *upper_enc = xstrdup_toupper(enc);
+ advise(advise_msg, path, upper_enc, upper_enc);
+ free(upper_enc);
if (die_on_error)
die(error_msg, path, enc);
else {
@@ -1216,11 +1221,7 @@ static const char *git_path_check_encoding(struct attr_check_item *check)
if (!strcasecmp(value, default_encoding))
return NULL;
- /*
- * Ensure encoding names are always upper case (e.g. UTF-8) to
- * simplify subsequent string comparisons.
- */
- return xstrdup_toupper(value);
+ return value;
}
static enum crlf_action git_path_check_crlf(struct attr_check_item *check)
diff --git a/git-compat-util.h b/git-compat-util.h
index 68b2ad531e..f648da0c11 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -455,6 +455,7 @@ extern void (*get_warn_routine(void))(const char *warn, va_list params);
extern void set_die_is_recursing_routine(int (*routine)(void));
extern int starts_with(const char *str, const char *prefix);
+extern int startscase_with(const char *str, const char *prefix);
/*
* If the string "str" begins with the string found in "prefix", return 1.
diff --git a/strbuf.c b/strbuf.c
index b635f0bdc4..5779a2d591 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -11,6 +11,15 @@ int starts_with(const char *str, const char *prefix)
return 0;
}
+int startscase_with(const char *str, const char *prefix)
+{
+ for (; ; str++, prefix++)
+ if (!*prefix)
+ return 1;
+ else if (tolower(*str) != tolower(*prefix))
+ return 0;
+}
+
int skip_to_optional_arg_default(const char *str, const char *prefix,
const char **arg, const char *def)
{
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index 23e89ae623..7cff41a350 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -52,7 +52,7 @@ test_expect_success 're-encode to UTF-16 on checkout' '
'
test_expect_success 'check $GIT_DIR/info/attributes support' '
- test_when_finished "rm -f test.utf8.raw test.utf32.raw test.utf32.git" &&
+ test_when_finished "rm -f test.utf32.git" &&
test_when_finished "git reset --hard HEAD" &&
echo "*.utf32 text working-tree-encoding=utf-32" >.git/info/attributes &&
@@ -75,19 +75,19 @@ do
# In these cases the BOM is prohibited.
cp bebom.utf${i}be.raw bebom.utf${i}be &&
test_must_fail git add bebom.utf${i}be 2>err.out &&
- test_i18ngrep "fatal: BOM is prohibited .* UTF-${i}BE" err.out &&
+ test_i18ngrep "fatal: BOM is prohibited .* utf-${i}be" err.out &&
cp lebom.utf${i}le.raw lebom.utf${i}be &&
test_must_fail git add lebom.utf${i}be 2>err.out &&
- test_i18ngrep "fatal: BOM is prohibited .* UTF-${i}BE" err.out &&
+ test_i18ngrep "fatal: BOM is prohibited .* utf-${i}be" err.out &&
cp bebom.utf${i}be.raw bebom.utf${i}le &&
test_must_fail git add bebom.utf${i}le 2>err.out &&
- test_i18ngrep "fatal: BOM is prohibited .* UTF-${i}LE" err.out &&
+ test_i18ngrep "fatal: BOM is prohibited .* utf-${i}le" err.out &&
cp lebom.utf${i}le.raw lebom.utf${i}le &&
test_must_fail git add lebom.utf${i}le 2>err.out &&
- test_i18ngrep "fatal: BOM is prohibited .* UTF-${i}LE" err.out
+ test_i18ngrep "fatal: BOM is prohibited .* utf-${i}le" err.out
'
test_expect_success "check required UTF-${i} BOM" '
@@ -97,11 +97,11 @@ do
cp nobom.utf${i}be.raw nobom.utf${i} &&
test_must_fail git add nobom.utf${i} 2>err.out &&
- test_i18ngrep "fatal: BOM is required .* UTF-${i}" err.out &&
+ test_i18ngrep "fatal: BOM is required .* utf-${i}" err.out &&
cp nobom.utf${i}le.raw nobom.utf${i} &&
test_must_fail git add nobom.utf${i} 2>err.out &&
- test_i18ngrep "fatal: BOM is required .* UTF-${i}" err.out
+ test_i18ngrep "fatal: BOM is required .* utf-${i}" err.out
'
test_expect_success "eol conversion for UTF-${i} encoded files on checkout" '
@@ -141,7 +141,6 @@ do
done
test_expect_success 'check unsupported encodings' '
- test_when_finished "rm -f err.out" &&
test_when_finished "git reset --hard HEAD" &&
echo "*.nothing text working-tree-encoding=" >>.gitattributes &&
@@ -156,7 +155,6 @@ test_expect_success 'check unsupported encodings' '
test_expect_success 'error if encoding round trip is not the same during refresh' '
BEFORE_STATE=$(git rev-parse HEAD) &&
- test_when_finished "rm -f err.out" &&
test_when_finished "git reset --hard $BEFORE_STATE" &&
# Add and commit a UTF-16 file but skip the "working-tree-encoding"
@@ -176,7 +174,6 @@ test_expect_success 'error if encoding round trip is not the same during refresh
test_expect_success 'error if encoding garbage is already in Git' '
BEFORE_STATE=$(git rev-parse HEAD) &&
- test_when_finished "rm -f err.out" &&
test_when_finished "git reset --hard $BEFORE_STATE" &&
# Skip the UTF-16 filter for the added file
@@ -219,14 +216,14 @@ test_expect_success 'check roundtrip encoding' '
# ... unless we tell Git to check it!
GIT_TRACE=1 git -c core.checkRoundtripEncoding="UTF-16, UTF-32" \
add roundtrip.utf16 2>&1 |
- grep "Checking roundtrip encoding for UTF-16" &&
+ grep "Checking roundtrip encoding for utf-16" &&
git reset &&
# ... unless we tell Git to check it!
# (here we also check that the casing of the encoding is irrelevant)
GIT_TRACE=1 git -c core.checkRoundtripEncoding="UTF-32, utf-16" \
add roundtrip.utf16 2>&1 |
- grep "Checking roundtrip encoding for UTF-16" &&
+ grep "Checking roundtrip encoding for utf-16" &&
git reset
'
diff --git a/utf8.c b/utf8.c
index 5113d26e56..81c6678df1 100644
--- a/utf8.c
+++ b/utf8.c
@@ -552,11 +552,13 @@ static const char utf32_le_bom[] = {0xFF, 0xFE, 0x00, 0x00};
int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
{
return (
- (!strcmp(enc, "UTF-16BE") || !strcmp(enc, "UTF-16LE")) &&
+ (!strcasecmp(enc, "UTF-16BE") || !strcasecmp(enc, "UTF-16LE") ||
+ !strcasecmp(enc, "UTF16BE") || !strcasecmp(enc, "UTF16LE")) &&
(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
) || (
- (!strcmp(enc, "UTF-32BE") || !strcmp(enc, "UTF-32LE")) &&
+ (!strcasecmp(enc, "UTF-32BE") || !strcasecmp(enc, "UTF-32LE") ||
+ !strcasecmp(enc, "UTF32BE") || !strcasecmp(enc, "UTF32LE")) &&
(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
);
@@ -565,11 +567,11 @@ int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
int is_missing_required_utf_bom(const char *enc, const char *data, size_t len)
{
return (
- !strcmp(enc, "UTF-16") &&
+ (!strcasecmp(enc, "UTF-16") || !strcasecmp(enc, "UTF16")) &&
!(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
) || (
- !strcmp(enc, "UTF-32") &&
+ (!strcasecmp(enc, "UTF-32") || !strcasecmp(enc, "UTF32")) &&
!(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
);
### Patches
Lars Schneider (9):
strbuf: remove unnecessary NUL assignment in xstrdup_tolower()
strbuf: add xstrdup_toupper()
strbuf: add a case insensitive starts_with()
utf8: add function to detect prohibited UTF-16/32 BOM
utf8: add function to detect a missing UTF-16/32 BOM
convert: add 'working-tree-encoding' attribute
convert: check for detectable errors in UTF encodings
convert: add tracing for 'working-tree-encoding' attribute
convert: add round trip check based on 'core.checkRoundtripEncoding'
Documentation/config.txt | 6 +
Documentation/gitattributes.txt | 88 +++++++++++++
config.c | 5 +
convert.c | 268 ++++++++++++++++++++++++++++++++++++++-
convert.h | 2 +
environment.c | 1 +
git-compat-util.h | 1 +
sha1_file.c | 2 +-
strbuf.c | 22 +++-
strbuf.h | 1 +
t/t0028-working-tree-encoding.sh | 230 +++++++++++++++++++++++++++++++++
utf8.c | 39 ++++++
utf8.h | 28 ++++
13 files changed, 690 insertions(+), 3 deletions(-)
create mode 100755 t/t0028-working-tree-encoding.sh
base-commit: 8a2f0888555ce46ac87452b194dec5cb66fb1417
--
2.16.2
next reply other threads:[~2018-03-07 17:31 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-03-07 17:30 lars.schneider [this message]
2018-03-07 17:30 ` [PATCH v10 1/9] strbuf: remove unnecessary NUL assignment in xstrdup_tolower() lars.schneider
2018-03-07 17:30 ` [PATCH v10 2/9] strbuf: add xstrdup_toupper() lars.schneider
2018-03-07 17:30 ` [PATCH v10 3/9] strbuf: add a case insensitive starts_with() lars.schneider
2018-03-08 0:31 ` Duy Nguyen
2018-03-08 23:12 ` Junio C Hamano
2018-03-09 15:54 ` Lars Schneider
2018-03-09 17:20 ` Junio C Hamano
2018-03-09 19:06 ` Ævar Arnfjörð Bjarmason
2018-03-07 17:30 ` [PATCH v10 4/9] utf8: add function to detect prohibited UTF-16/32 BOM lars.schneider
2018-03-07 17:30 ` [PATCH v10 5/9] utf8: add function to detect a missing " lars.schneider
2018-03-07 17:30 ` [PATCH v10 6/9] convert: add 'working-tree-encoding' attribute lars.schneider
2018-03-07 17:54 ` Eric Sunshine
2018-03-07 22:56 ` Lars Schneider
2018-03-07 22:57 ` Junio C Hamano
2018-03-07 19:35 ` Junio C Hamano
2018-03-07 17:30 ` [PATCH v10 7/9] convert: check for detectable errors in UTF encodings lars.schneider
2018-03-07 18:04 ` Eric Sunshine
2018-03-09 17:02 ` Lars Schneider
2018-03-07 19:49 ` Junio C Hamano
2018-03-07 22:12 ` Lars Schneider
2018-03-07 22:32 ` Junio C Hamano
2018-03-07 22:49 ` Lars Schneider
2018-03-07 22:57 ` Junio C Hamano
2018-03-07 23:19 ` Lars Schneider
2018-03-07 23:34 ` Junio C Hamano
2018-03-07 17:30 ` [PATCH v10 8/9] convert: add tracing for 'working-tree-encoding' attribute lars.schneider
2018-03-07 17:30 ` [PATCH v10 9/9] convert: add round trip check based on 'core.checkRoundtripEncoding' lars.schneider
2018-03-07 19:59 ` Junio C Hamano
2018-03-07 22:44 ` Lars Schneider
2018-03-07 22:52 ` Junio C Hamano
2018-03-07 22:58 ` Lars Schneider
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180307173026.30058-1-lars.schneider@autodesk.com \
--to=lars.schneider@autodesk.com \
--cc=Johannes.Schindelin@gmx.de \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=j6t@kdbg.org \
--cc=larsxschneider@gmail.com \
--cc=peff@peff.net \
--cc=ramsay@ramsayjones.plus.com \
--cc=sunshine@sunshineco.com \
--cc=tboegi@web.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).