From: lars.schneider@autodesk.com
To: git@vger.kernel.org
Cc: gitster@pobox.com, tboegi@web.de, j6t@kdbg.org,
sunshine@sunshineco.com, peff@peff.net,
ramsay@ramsayjones.plus.com, Johannes.Schindelin@gmx.de,
Lars Schneider <larsxschneider@gmail.com>
Subject: [PATCH v8 0/7] convert: add support for different encodings
Date: Sat, 24 Feb 2018 17:27:54 +0100 [thread overview]
Message-ID: <20180224162801.98860-1-lars.schneider@autodesk.com> (raw)
From: Lars Schneider <larsxschneider@gmail.com>
Hi,
Patches 1-4, 6 are preparation and helper functions.
Patch 5,7 are the actual change.
This series depends on Torsten's 8462ff43e4 (convert_to_git():
safe_crlf/checksafe becomes int conv_flags, 2018-01-13) which is
already in master.
Changes since v7:
* make it clearer in the documentation that Git stores content "as-is"
by default. Content is only stored in UTF-8 if w-t-e is used (Junio)
* add test case for $GIT_DIR/info/attributes support (Junio)
Thanks,
Lars
RFC: https://public-inbox.org/git/BDB9B884-6D17-4BE3-A83C-F67E2AFA2B46@gmail.com/
v1: https://public-inbox.org/git/20171211155023.1405-1-lars.schneider@autodesk.com/
v2: https://public-inbox.org/git/20171229152222.39680-1-lars.schneider@autodesk.com/
v3: https://public-inbox.org/git/20180106004808.77513-1-lars.schneider@autodesk.com/
v4: https://public-inbox.org/git/20180120152418.52859-1-lars.schneider@autodesk.com/
v5: https://public-inbox.org/git/20180129201855.9182-1-tboegi@web.de/
v6: https://public-inbox.org/git/20180209132830.55385-1-lars.schneider@autodesk.com/
v7: https://public-inbox.org/git/20180215152711.158-1-lars.schneider@autodesk.com/
Base Ref:
Web-Diff: https://github.com/larsxschneider/git/commit/2758a2da29
Checkout: git fetch https://github.com/larsxschneider/git encoding-v8 && git checkout 2758a2da29
### Interdiff (v7..v8):
diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 10cb37795d..11315054f4 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -275,11 +275,11 @@ few exceptions. Even though...
`working-tree-encoding`
^^^^^^^^^^^^^^^^^^^^^^^
-Git recognizes files encoded with ASCII or one of its supersets (e.g.
-UTF-8 or ISO-8859-1) as text files. All other encodings are usually
-interpreted as binary and consequently built-in Git text processing
-tools (e.g. 'git diff') as well as most Git web front ends do not
-visualize the content.
+Git recognizes files encoded in ASCII or one of its supersets (e.g.
+UTF-8, ISO-8859-1, ...) as text files. Files encoded in certain other
+encodings (e.g. UTF-16) are interpreted as binary and consequently
+built-in Git text processing tools (e.g. 'git diff') as well as most Git
+web front ends do not visualize the contents of these files by default.
In these cases you can tell Git the encoding of a file in the working
directory with the `working-tree-encoding` attribute. If a file with this
@@ -291,12 +291,24 @@ the content is reencoded back to the specified encoding.
Please note that using the `working-tree-encoding` attribute may have a
number of pitfalls:
-- Third party Git implementations that do not support the
- `working-tree-encoding` attribute will checkout the respective files
- UTF-8 encoded and not in the expected encoding. Consequently, these
- files will appear different which typically causes trouble. This is
- in particular the case for older Git versions and alternative Git
- implementations such as JGit or libgit2 (as of February 2018).
+- Alternative Git implementations (e.g. JGit or libgit2) and older Git
+ versions (as of March 2018) do not support the `working-tree-encoding`
+ attribute. If you decide to use the `working-tree-encoding` attribute
+ in your repository, then it is strongly recommended to ensure that all
+ clients working with the repository support it.
+
+ If you declare `*.proj` files as UTF-16 and you add `foo.proj` with an
+ `working-tree-encoding` enabled Git client, then `foo.proj` will be
+ stored as UTF-8 internally. A client without `working-tree-encoding`
+ support will checkout `foo.proj` as UTF-8 encoded file. This will
+ typically cause trouble for the users of this file.
+
+ If a Git client, that does not support the `working-tree-encoding`
+ attribute, adds a new file `bar.proj`, then `bar.proj` will be
+ stored "as-is" internally (in this example probably as UTF-16).
+ A client with `working-tree-encoding` support will interpret the
+ internal contents as UTF-8 and try to convert it to UTF-16 on checkout.
+ That operation will fail and cause an error.
- Reencoding content to non-UTF encodings can cause errors as the
conversion might not be UTF-8 round trip safe. If you suspect your
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index e4717402a5..e34c21eb29 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -13,8 +13,11 @@ test_expect_success 'setup test repo' '
echo "*.utf16 text working-tree-encoding=utf-16" >.gitattributes &&
printf "$text" >test.utf8.raw &&
printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
+ printf "$text" | iconv -f UTF-8 -t UTF-32 >test.utf32.raw &&
cp test.utf16.raw test.utf16 &&
+ cp test.utf32.raw test.utf32 &&
+ # Add only UTF-16 file, we will add the UTF-32 file later
git add .gitattributes test.utf16 &&
git commit -m initial
'
@@ -24,7 +27,7 @@ test_expect_success 'ensure UTF-8 is stored in Git' '
test_cmp_bin test.utf8.raw test.utf16.git &&
# cleanup
- rm test.utf8.raw test.utf16.git
+ rm test.utf16.git
'
test_expect_success 're-encode to UTF-16 on checkout' '
@@ -36,6 +39,19 @@ test_expect_success 're-encode to UTF-16 on checkout' '
rm test.utf16.raw
'
+test_expect_success 'check $GIT_DIR/info/attributes support' '
+ echo "*.utf32 text working-tree-encoding=utf-32" >.git/info/attributes &&
+
+ git add test.utf32 &&
+
+ git cat-file -p :test.utf32 >test.utf32.git &&
+ test_cmp_bin test.utf8.raw test.utf32.git &&
+
+ # cleanup
+ git reset --hard HEAD &&
+ rm test.utf8.raw test.utf32.raw test.utf32.git
+'
+
test_expect_success 'check prohibited UTF BOM' '
printf "\0a\0b\0c" >nobom.utf16be.raw &&
printf "a\0b\0c\0" >nobom.utf16le.raw &&
### Patches
Lars Schneider (7):
strbuf: remove unnecessary NUL assignment in xstrdup_tolower()
strbuf: add xstrdup_toupper()
utf8: add function to detect prohibited UTF-16/32 BOM
utf8: add function to detect a missing UTF-16/32 BOM
convert: add 'working-tree-encoding' attribute
convert: add tracing for 'working-tree-encoding' attribute
convert: add round trip check based on 'core.checkRoundtripEncoding'
Documentation/config.txt | 6 +
Documentation/gitattributes.txt | 86 +++++++++++++
config.c | 5 +
convert.c | 256 ++++++++++++++++++++++++++++++++++++-
convert.h | 2 +
environment.c | 1 +
sha1_file.c | 2 +-
strbuf.c | 13 +-
strbuf.h | 1 +
t/t0028-working-tree-encoding.sh | 269 +++++++++++++++++++++++++++++++++++++++
utf8.c | 37 ++++++
utf8.h | 25 ++++
12 files changed, 700 insertions(+), 3 deletions(-)
create mode 100755 t/t0028-working-tree-encoding.sh
base-commit: 8a2f0888555ce46ac87452b194dec5cb66fb1417
--
2.16.1
next reply other threads:[~2018-02-24 16:29 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-02-24 16:27 lars.schneider [this message]
2018-02-24 16:27 ` [PATCH v8 1/7] strbuf: remove unnecessary NUL assignment in xstrdup_tolower() lars.schneider
2018-02-24 16:27 ` [PATCH v8 2/7] strbuf: add xstrdup_toupper() lars.schneider
2018-02-24 16:27 ` [PATCH v8 3/7] utf8: add function to detect prohibited UTF-16/32 BOM lars.schneider
2018-02-25 3:41 ` Eric Sunshine
2018-02-25 11:35 ` Lars Schneider
2018-02-27 5:17 ` Eric Sunshine
2018-02-28 21:34 ` Lars Schneider
2018-02-24 16:27 ` [PATCH v8 4/7] utf8: add function to detect a missing " lars.schneider
2018-02-25 3:52 ` Eric Sunshine
2018-02-25 11:41 ` Lars Schneider
2018-02-24 16:27 ` [PATCH v8 5/7] convert: add 'working-tree-encoding' attribute lars.schneider
2018-02-25 7:15 ` Eric Sunshine
2018-02-27 11:16 ` Lars Schneider
2018-02-28 21:20 ` Eric Sunshine
2018-02-24 16:28 ` [PATCH v8 6/7] convert: add tracing for " lars.schneider
2018-02-24 16:28 ` [PATCH v8 7/7] convert: add round trip check based on 'core.checkRoundtripEncoding' lars.schneider
2018-02-25 19:50 ` Eric Sunshine
2018-03-04 19:08 ` Lars Schneider
2018-03-04 19:58 ` Eric Sunshine
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180224162801.98860-1-lars.schneider@autodesk.com \
--to=lars.schneider@autodesk.com \
--cc=Johannes.Schindelin@gmx.de \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=j6t@kdbg.org \
--cc=larsxschneider@gmail.com \
--cc=peff@peff.net \
--cc=ramsay@ramsayjones.plus.com \
--cc=sunshine@sunshineco.com \
--cc=tboegi@web.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).