Re: [PATCH v8 5/7] convert: add 'working-tree-encoding' attribute

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Eric Sunshine <sunshine@sunshineco.com>
To: Lars Schneider <lars.schneider@autodesk.com>
Cc: "Git List" <git@vger.kernel.org>,
	"Junio C Hamano" <gitster@pobox.com>,
	"Torsten Bögershausen" <tboegi@web.de>,
	"Johannes Sixt" <j6t@kdbg.org>, "Jeff King" <peff@peff.net>,
	"Ramsay Jones" <ramsay@ramsayjones.plus.com>,
	"Johannes Schindelin" <Johannes.Schindelin@gmx.de>,
	"Lars Schneider" <larsxschneider@gmail.com>
Subject: Re: [PATCH v8 5/7] convert: add 'working-tree-encoding' attribute
Date: Sun, 25 Feb 2018 02:15:30 -0500	[thread overview]
Message-ID: <CAPig+cRtp5g0iDd3pHybZXd_VkGG2-a4CdCZdiqN55Cr1T5Zyg@mail.gmail.com> (raw)
In-Reply-To: <20180224162801.98860-6-lars.schneider@autodesk.com>

On Sat, Feb 24, 2018 at 11:27 AM,  <lars.schneider@autodesk.com> wrote:
> Git recognizes files encoded with ASCII or one of its supersets (e.g.
> UTF-8 or ISO-8859-1) as text files. All other encodings are usually
> interpreted as binary and consequently built-in Git text processing
> tools (e.g. 'git diff') as well as most Git web front ends do not
> visualize the content.
>
> Add an attribute to tell Git what encoding the user has defined for a
> given file. If the content is added to the index, then Git converts the
> content to a canonical UTF-8 representation. On checkout Git will
> reverse the conversion.
>
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> ---
> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
> @@ -272,6 +272,84 @@ few exceptions.  Even though...
> +Please note that using the `working-tree-encoding` attribute may have a
> +number of pitfalls:
> +
> +- Alternative Git implementations (e.g. JGit or libgit2) and older Git
> +  versions (as of March 2018) do not support the `working-tree-encoding`
> +  attribute. If you decide to use the `working-tree-encoding` attribute
> +  in your repository, then it is strongly recommended to ensure that all
> +  clients working with the repository support it.
> +
> +  If you declare `*.proj` files as UTF-16 and you add `foo.proj` with an
> +  `working-tree-encoding` enabled Git client, then `foo.proj` will be
> +  stored as UTF-8 internally. A client without `working-tree-encoding`
> +  support will checkout `foo.proj` as UTF-8 encoded file. This will
> +  typically cause trouble for the users of this file.

The above paragraph is giving an example of the scenario described in
the paragraph above it. It might, therefore, be clearer to start this
paragraph with "For example,...". Also, as an aid to non-Windows
developers, it might be helpful to introduce ".proj" files as
"Microsoft Visual Studio `*.proj` files...".

> diff --git a/convert.c b/convert.c
> @@ -265,6 +266,110 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
> +static struct encoding {
> +       const char *name;
> +       struct encoding *next;
> +} *encoding, **encoding_tail;

Why does this struct need to be a linked-list since it contains only a
name? In fact, why do the struct and linked list exist at all? Back in
v1 when the struct held more members, it made sense to place them in a
collection so they could be re-used, but today it just seems an
unnecessary artifact which buys you nothing.

Is the plan that some future patch series will add members to the
struct? If not, then it seems that it should be retired altogether.

> +static const char *default_encoding = "UTF-8";
> +
> +static int encode_to_git(const char *path, const char *src, size_t src_len,
> +                        struct strbuf *buf, struct encoding *enc, int conv_flags)
> +{
> +       [...]
> +       if (has_prohibited_utf_bom(enc->name, src, src_len)) {
> +               const char *error_msg = _(
> +                       "BOM is prohibited in '%s' if encoded as %s");
> +               /*
> +                * This advise is shown for UTF-??BE and UTF-??LE encodings.

s/advise/advice/

> +                * We truncate the encoding name to 6 chars with %.6s to cut
> +                * off the last two "byte order" characters.
> +                */
> +               const char *advise_msg = _(
> +                       "The file '%s' contains a byte order mark (BOM). "
> +                       "Please use %.6s as working-tree-encoding.");
> +               advise(advise_msg, path, enc->name);
> +               if (conv_flags & CONV_WRITE_OBJECT)
> +                       die(error_msg, path, enc->name);
> +               else
> +                       error(error_msg, path, enc->name);

What is the intended behavior in the non-die() case? As implemented,
this is still going to attempt the conversion even though it threw an
error. Should it be returning 0 here instead? Same question for the
couple additional cases below.

> +
> +       } else if (is_missing_required_utf_bom(enc->name, src, src_len)) {
> +               const char *error_msg = _(
> +                       "BOM is required in '%s' if encoded as %s");
> +               const char *advise_msg = _(
> +                       "The file '%s' is missing a byte order mark (BOM). "
> +                       "Please use %sBE or %sLE (depending on the byte order) "
> +                       "as working-tree-encoding.");
> +               advise(advise_msg, path, enc->name, enc->name);
> +               if (conv_flags & CONV_WRITE_OBJECT)
> +                       die(error_msg, path, enc->name);
> +               else
> +                       error(error_msg, path, enc->name);
> +       }

For a function which presumably is agnostic about the working-tree
encoding (supporting whatever iconv does), it nevertheless has
unusually intimate knowledge about UTF BOMs, in general, and the
internals of has_prohibited_utf_bom() and
is_missing_required_utf_bom(), in particular. It might be cleaner, and
would simplify this function, if all this UTF BOM-specific gunk was
moved elsewhere and generalized into a "validate conversion" function.

As this series is already at v8, perhaps such cleanup could be done as
a follow-on series (by someone) rather than re-rolling.

> +       dst = reencode_string_len(src, src_len, default_encoding, enc->name,
> +                                 &dst_len);
> +       if (!dst) {
> +               /*
> +                * We could add the blob "as-is" to Git. However, on checkout
> +                * we would try to reencode to the original encoding. This
> +                * would fail and we would leave the user with a messed-up
> +                * working tree. Let's try to avoid this by screaming loud.
> +                */
> +               const char* msg = _("failed to encode '%s' from %s to %s");
> +               if (conv_flags & CONV_WRITE_OBJECT)
> +                       die(msg, path, enc->name, default_encoding);
> +               else
> +                       error(msg, path, enc->name, default_encoding);
> +       }
> +
> +       strbuf_attach(buf, dst, dst_len, dst_len + 1);
> +       return 1;
> +}
> +
> +static int encode_to_worktree(const char *path, const char *src, size_t src_len,
> +                             struct strbuf *buf, struct encoding *enc)
> +{
> +       [...]
> +       dst = reencode_string_len(src, src_len, enc->name, default_encoding,
> +                                 &dst_len);
> +       if (!dst) {
> +               error("failed to encode '%s' from %s to %s",
> +                       path, enc->name, default_encoding);

The last two arguments need to be swapped.

> +               return 0;
> +       }
> +
> +       strbuf_attach(buf, dst, dst_len, dst_len + 1);
> +       return 1;
> +}
> @@ -978,6 +1083,35 @@ static int ident_to_worktree(const char *path, const char *src, size_t len,
> +static struct encoding *git_path_check_encoding(struct attr_check_item *check)
> +{
> +       if (ATTR_TRUE(value) || ATTR_FALSE(value) || ATTR_UNSET(value) ||
> +           !strlen(value))
> +               return NULL;
> +
> +       for (enc = encoding; enc; enc = enc->next)
> +               if (!strcasecmp(value, enc->name))
> +                       return enc;
> +
> +       /* Don't encode to the default encoding */
> +       if (!strcasecmp(value, default_encoding))
> +               return NULL;

You could handle this easy-to-detect case before attempting the more
expensive loop just above. (Not necessarily worth a re-roll.)

> +       enc = xcalloc(1, sizeof(*enc));
> +       /*
> +        * Ensure encoding names are always upper case (e.g. UTF-8) to
> +        * simplify subsequent string comparisons.
> +        */
> +       enc->name = xstrdup_toupper(value);
> +       *encoding_tail = enc;
> +       encoding_tail = &(enc->next);
> +
> +       return enc;
> +}
> diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
> @@ -0,0 +1,226 @@
> +test_expect_success 'ensure UTF-8 is stored in Git' '
> +       git cat-file -p :test.utf16 >test.utf16.git &&
> +       test_cmp_bin test.utf8.raw test.utf16.git &&
> +
> +       # cleanup
> +       rm test.utf16.git
> +'

This cleanup won't take place if test_cmp_bin() fails. Instead,
cleanups are typically handled by test_when_finished() to ensure that
the cleanup action occurs even if the test fails.

    test_expect_success 'ensure UTF-8 is stored in Git' '
        test_when_finished "rm -f  test.utf16.git" &&
        ...

Same comment for remaining tests.

> +test_expect_success 'check prohibited UTF BOM' '
> +       printf "\0a\0b\0c"                         >nobom.utf16be.raw &&
> +       printf "a\0b\0c\0"                         >nobom.utf16le.raw &&
> +       printf "\376\777\0a\0b\0c"                 >bebom.utf16be.raw &&
> +       printf "\777\376a\0b\0c\0"                 >lebom.utf16le.raw &&
> +
> +       printf "\0\0\0a\0\0\0b\0\0\0c"             >nobom.utf32be.raw &&
> +       printf "a\0\0\0b\0\0\0c\0\0\0"             >nobom.utf32le.raw &&
> +       printf "\0\0\376\777\0\0\0a\0\0\0b\0\0\0c" >bebom.utf32be.raw &&
> +       printf "\777\376\0\0a\0\0\0b\0\0\0c\0\0\0" >lebom.utf32le.raw &&

These resources seem to be used by multiple tests. Perhaps create them
in a distinct "setup" test instead of bundling them in this test?

> +       echo "*.utf16be text working-tree-encoding=utf-16be" >>.gitattributes &&
> +       echo "*.utf16le text working-tree-encoding=utf-16le" >>.gitattributes &&
> +       echo "*.utf32be text working-tree-encoding=utf-32be" >>.gitattributes &&
> +       echo "*.utf32le text working-tree-encoding=utf-32le" >>.gitattributes &&
> +
> +       # Here we add a UTF-16 files with BOM (big-endian and little-endian)
> +       # but we tell Git to treat it as UTF-16BE/UTF-16LE. In these cases
> +       # the BOM is prohibited.
> +       cp bebom.utf16be.raw bebom.utf16be &&
> +       test_must_fail git add bebom.utf16be 2>err.out &&
> +       test_i18ngrep "fatal: BOM is prohibited .* UTF-16BE" err.out &&
> +
> +       cp lebom.utf16le.raw lebom.utf16be &&
> +       test_must_fail git add lebom.utf16be 2>err.out &&
> +       test_i18ngrep "fatal: BOM is prohibited .* UTF-16BE" err.out &&
> +
> +       cp bebom.utf16be.raw bebom.utf16le &&
> +       test_must_fail git add bebom.utf16le 2>err.out &&
> +       test_i18ngrep "fatal: BOM is prohibited .* UTF-16LE" err.out &&
> +
> +       cp lebom.utf16le.raw lebom.utf16le &&
> +       test_must_fail git add lebom.utf16le 2>err.out &&
> +       test_i18ngrep "fatal: BOM is prohibited .* UTF-16LE" err.out &&
> +
> +       # ... and the same for UTF-32
> +       cp bebom.utf32be.raw bebom.utf32be &&
> +       test_must_fail git add bebom.utf32be 2>err.out &&
> +       test_i18ngrep "fatal: BOM is prohibited .* UTF-32BE" err.out &&
> +
> +       cp lebom.utf32le.raw lebom.utf32be &&
> +       test_must_fail git add lebom.utf32be 2>err.out &&
> +       test_i18ngrep "fatal: BOM is prohibited .* UTF-32BE" err.out &&
> +
> +       cp bebom.utf32be.raw bebom.utf32le &&
> +       test_must_fail git add bebom.utf32le 2>err.out &&
> +       test_i18ngrep "fatal: BOM is prohibited .* UTF-32LE" err.out &&
> +
> +       cp lebom.utf32le.raw lebom.utf32le &&
> +       test_must_fail git add lebom.utf32le 2>err.out &&
> +       test_i18ngrep "fatal: BOM is prohibited .* UTF-32LE" err.out &&
> +
> +       # cleanup
> +       git reset --hard HEAD
> +'

Clumping all these checks into the same test makes it difficult to
determine which one failed (if one does fail). Also, the amount of
copy/paste code is easy to get wrong and a bit onerous to review.
Automating via for-loops would address these concerns. For instance,
to check all combinations (not just 8 combinations tested above):

    for i in 16be 16le 32be 32le
    do
        for j in 16be 16le 32be 32le
        do
            test_expect_success "check prohibited UTF BOM utf$i/utf$j" '
                test_when_finished "git reset --hard HEAD" &&
                cp bebom.utf$i.raw bebom.utf$j &&
                test_must_fail git add bebom.utf$j 2>err.out &&
                J=$(echo $j | tr bel BEL) &&
                test_i18ngrep "fatal: BOM is prohibited .* UTF-$J" err.out
            '
        done
    done

Alternately, the test could be moved to a function and called for each
combination:

    check_prohibited_bom () {
        i=$1
        j=$2
        test_expect_success "check prohibited UTF BOM utf$i/utf$j" '
            ...
        '
    }
    check_prohibited_bom 16be 16be
    check_prohibited_bom 16be 16le
    ...

Same comment regarding copy/paste in tests below.

Probably not worth a re-roll.

> +test_expect_success 'check required UTF BOM' '
> +       echo "*.utf32 text working-tree-encoding=utf-32" >>.gitattributes &&
> +
> +       cp nobom.utf16be.raw nobom.utf16 &&
> +       test_must_fail git add nobom.utf16 2>err.out &&
> +       test_i18ngrep "fatal: BOM is required .* UTF-16" err.out &&
> +
> +       cp nobom.utf16le.raw nobom.utf16 &&
> +       test_must_fail git add nobom.utf16 2>err.out &&
> +       test_i18ngrep "fatal: BOM is required .* UTF-16" err.out &&
> +
> +       cp nobom.utf32be.raw nobom.utf32 &&
> +       test_must_fail git add nobom.utf32 2>err.out &&
> +       test_i18ngrep "fatal: BOM is required .* UTF-32" err.out &&
> +
> +       cp nobom.utf32le.raw nobom.utf32 &&
> +       test_must_fail git add nobom.utf32 2>err.out &&
> +       test_i18ngrep "fatal: BOM is required .* UTF-32" err.out &&
> +
> +       # cleanup
> +       rm nobom.utf16 nobom.utf32 &&
> +       git reset --hard HEAD
> +'

next prev parent reply	other threads:[~2018-02-25  7:15 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-02-24 16:27 [PATCH v8 0/7] convert: add support for different encodings lars.schneider
2018-02-24 16:27 ` [PATCH v8 1/7] strbuf: remove unnecessary NUL assignment in xstrdup_tolower() lars.schneider
2018-02-24 16:27 ` [PATCH v8 2/7] strbuf: add xstrdup_toupper() lars.schneider
2018-02-24 16:27 ` [PATCH v8 3/7] utf8: add function to detect prohibited UTF-16/32 BOM lars.schneider
2018-02-25  3:41   ` Eric Sunshine
2018-02-25 11:35     ` Lars Schneider
2018-02-27  5:17       ` Eric Sunshine
2018-02-28 21:34         ` Lars Schneider
2018-02-24 16:27 ` [PATCH v8 4/7] utf8: add function to detect a missing " lars.schneider
2018-02-25  3:52   ` Eric Sunshine
2018-02-25 11:41     ` Lars Schneider
2018-02-24 16:27 ` [PATCH v8 5/7] convert: add 'working-tree-encoding' attribute lars.schneider
2018-02-25  7:15   ` Eric Sunshine [this message]
2018-02-27 11:16     ` Lars Schneider
2018-02-28 21:20       ` Eric Sunshine
2018-02-24 16:28 ` [PATCH v8 6/7] convert: add tracing for " lars.schneider
2018-02-24 16:28 ` [PATCH v8 7/7] convert: add round trip check based on 'core.checkRoundtripEncoding' lars.schneider
2018-02-25 19:50   ` Eric Sunshine
2018-03-04 19:08     ` Lars Schneider
2018-03-04 19:58       ` Eric Sunshine

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAPig+cRtp5g0iDd3pHybZXd_VkGG2-a4CdCZdiqN55Cr1T5Zyg@mail.gmail.com \
    --to=sunshine@sunshineco.com \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=j6t@kdbg.org \
    --cc=lars.schneider@autodesk.com \
    --cc=larsxschneider@gmail.com \
    --cc=peff@peff.net \
    --cc=ramsay@ramsayjones.plus.com \
    --cc=tboegi@web.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).