git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Luke Diamand <luke@diamand.org>
To: Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com>,
	Junio C Hamano <gitster@pobox.com>,
	Eric Sunshine <sunshine@sunshineco.com>,
	Git List <git@vger.kernel.org>,
	Johannes Schindelin <Johannes.Schindelin@gmx.de>,
	Pete Wyckoff <pw@padd.com>, Anders Bakken <abakken@netflix.com>
Cc: Andrew Oakley <andrew@adoakley.name>
Subject: Re: [PATCH v6] Add git-p4.fallbackEncoding
Date: Thu, 29 Apr 2021 08:36:20 +0000	[thread overview]
Message-ID: <179cd4f0-def6-1b3a-2802-139b19d3301d@diamand.org> (raw)
In-Reply-To: <20210429073905.837-1-tzadik.vanderhoof@gmail.com>



On 29/04/2021 07:39, Tzadik Vanderhoof wrote:
> Add git-p4.fallbackEncoding config variable, to prevent git-p4 from
> crashing on non UTF-8 changeset descriptions.
> 
> When git-p4 reads the output from a p4 command, it assumes it will
> be 100% UTF-8. If even one character in the output of one p4 command is
> not UTF-8, git-p4 crashes with:
> 
>      File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList
>          value = value.decode() UnicodeDecodeError: 'utf-8' codec can't
>          decode byte Ox93 in position 42: invalid start byte
> 
> This is especially a problem for the "git p4 clone ... @all" command,
> where git-p4 needs to read thousands of changeset descriptions, one of
> which may have a stray smart quote, causing the whole clone operation
> to fail.
> 
> Add a new config setting, allowing git-p4 to try a fallback encoding
> (for example, "cp1252") and/or use the Unicode replacement character,
> to prevent the whole program from crashing on such a minor problem.

I think Andrew Oakley pointed this out earlier - but in the days of 
Python2 this was (I think) never a problem. Python2 just took in the 
binary data, in whatever encoding, and passed it untouched on to git, 
which in turn just stored it.

https://lore.kernel.org/git/20210412085251.51475-1-andrew@adoakley.name/

It was only whatever was trying to render the bytestream that needed to 
worry about the encoding.

Now we're making the decision in git-p4 when we ingest it - did we 
consider just passing it along untouched?

The problem at hand is that git-p4 is trying to store it internally as a 
`string' which now is unicode-aware, when perhaps it should not be.

It's going to get very confusing if anyone ingests something from 
Perforce having set the encoding to one thing, and it turns out to be a 
different encoding, or worse, multiple encodings for the same repo.

I also worry that if someone has connected to a Unicode-aware Perforce 
server, and then unwittingly set P4CHARSET, then are we going to end up 
silently scrambling everything?

I'm not sure, encodings make my head hurt.

Luke



> 
> Signed-off-by: Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com>
> ---
>   Documentation/git-p4.txt                   |  9 ++
>   git-p4.py                                  | 11 ++-
>   t/t9836-git-p4-config-fallback-encoding.sh | 98 ++++++++++++++++++++++
>   3 files changed, 117 insertions(+), 1 deletion(-)
>   create mode 100755 t/t9836-git-p4-config-fallback-encoding.sh
> 
> diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt
> index f89e68b424..86d3ffa644 100644
> --- a/Documentation/git-p4.txt
> +++ b/Documentation/git-p4.txt
> @@ -638,6 +638,15 @@ git-p4.pathEncoding::
>   	to transcode the paths to UTF-8. As an example, Perforce on Windows
>   	often uses "cp1252" to encode path names.
>   
> +git-p4.fallbackEncoding::
> +	Perforce changeset descriptions can be stored in any encoding.
> +	Git-p4 first tries to interpret each description as UTF-8. If that
> +	fails, this config allows another encoding to be tried. You can specify,
> +	for example, "cp1252". If git-p4.fallbackEncoding is "replace", UTF-8 will
> +	be used, with invalid UTF-8 characters replaced by the Unicode replacement
> +	character. The default is "none": there is no fallback, and any non UTF-8
> +	character will cause git-p4 to immediately fail.
> +
>   git-p4.largeFileSystem::
>   	Specify the system that is used for large (binary) files. Please note
>   	that large file systems do not support the 'git p4 submit' command.
> diff --git a/git-p4.py b/git-p4.py
> index 09c9e93ac4..202fb01bdf 100755
> --- a/git-p4.py
> +++ b/git-p4.py
> @@ -771,7 +771,16 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False,
>                   for key, value in entry.items():
>                       key = key.decode()
>                       if isinstance(value, bytes) and not (key in ('data', 'path', 'clientFile') or key.startswith('depotFile')):
> -                        value = value.decode()
> +                        try:
> +                            value = value.decode()
> +                        except UnicodeDecodeError:
> +                            fallbackEncoding = gitConfig("git-p4.fallbackEncoding").lower() or 'none'
> +                            if fallbackEncoding == 'none':
> +                                raise Exception("UTF-8 decoding failed. Consider using git config git-p4.fallbackEncoding")
> +                            elif fallbackEncoding == 'replace':
> +                                value = value.decode(errors='replace')
> +                            else:
> +                                value = value.decode(encoding=fallbackEncoding)
>                       decoded_entry[key] = value
>                   # Parse out data if it's an error response
>                   if decoded_entry.get('code') == 'error' and 'data' in decoded_entry:
> diff --git a/t/t9836-git-p4-config-fallback-encoding.sh b/t/t9836-git-p4-config-fallback-encoding.sh
> new file mode 100755
> index 0000000000..901bb3759d
> --- /dev/null
> +++ b/t/t9836-git-p4-config-fallback-encoding.sh
> @@ -0,0 +1,98 @@
> +#!/bin/sh
> +
> +test_description='test git-p4.fallbackEncoding config'
> +
> +GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
> +export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
> +
> +. ./lib-git-p4.sh
> +
> +test_expect_success 'start p4d' '
> +	start_p4d
> +'
> +
> +test_expect_success 'add Unicode description' '
> +	cd "$cli" &&
> +	echo file1 >file1 &&
> +	p4 add file1 &&
> +	p4 submit -d documentación
> +'
> +
> +# Unicode descriptions cause "git p4 clone" to crash with a UnicodeDecodeError in some
> +# environments. This test determines if that is the case in our environment. If so,
> +# we create a file called "clone_fails". In subsequent tests, we check whether that
> +# file exists to determine what behavior to expect.
> +
> +clone_fails="$TRASH_DIRECTORY/clone_fails"
> +
> +# If clone fails with git-p4.fallbackEncoding set to "none", create the "clone_fails" file,
> +# and make sure the error message is correct
> +
> +test_expect_success 'clone with git-p4.fallbackEncoding set to "none"' '
> +	git config --global git-p4.fallbackEncoding none &&
> +	test_when_finished cleanup_git && {
> +		git p4 clone --dest="$git" //depot@all 2>error || (
> +			>"$clone_fails" &&
> +			grep "UTF-8 decoding failed. Consider using git config git-p4.fallbackEncoding" error
> +		)
> +	}
> +'
> +
> +# If clone fails with git-p4.fallbackEncoding set to "none", it should also fail when it's unset,
> +# also with the correct error message.  Otherwise the clone should succeed.
> +
> +test_expect_success 'clone with git-p4.fallbackEncoding unset' '
> +	git config --global --unset git-p4.fallbackEncoding &&
> +	test_when_finished cleanup_git && {
> +		(
> +			test -f "$clone_fails" &&
> +			test_must_fail git p4 clone --dest="$git" //depot@all 2>error &&
> +			grep "UTF-8 decoding failed. Consider using git config git-p4.fallbackEncoding" error
> +		) ||
> +		(
> +			! test -f "$clone_fails" &&
> +			git p4 clone --dest="$git" //depot@all 2>error
> +		)
> +	}
> +'
> +
> +# Whether or not "clone_fails" exists, setting git-p4.fallbackEncoding
> +# to "cp1252" should cause clone to succeed and get the right description
> +
> +test_expect_success 'clone with git-p4.fallbackEncoding set to "cp1252"' '
> +	git config --global git-p4.fallbackEncoding cp1252 &&
> +	test_when_finished cleanup_git &&
> +	(
> +		git p4 clone --dest="$git" //depot@all &&
> +		cd "$git" &&
> +		git log --oneline >log &&
> +		desc=$(head -1 log | cut -d" " -f2) &&
> +		test "$desc" = "documentación"
> +	)
> +'
> +
> +# Setting git-p4.fallbackEncoding to "replace" should always cause clone to succeed.
> +# If "clone_fails" exists, the description should contain the Unicode replacement
> +# character, otherwise the description should be correct (since we're on a system that
> +# doesn't have the Unicode issue)
> +
> +test_expect_success 'clone with git-p4.fallbackEncoding set to "replace"' '
> +	git config --global git-p4.fallbackEncoding replace &&
> +	test_when_finished cleanup_git &&
> +	(
> +		git p4 clone --dest="$git" //depot@all &&
> +		cd "$git" &&
> +		git log --oneline >log &&
> +		desc=$(head -1 log | cut -d" " -f2) &&
> +		{
> +			(test -f "$clone_fails" &&
> +				test "$desc" = "documentaci�n"
> +			) ||
> +			(! test -f "$clone_fails" &&
> +				test "$desc" = "documentación"
> +			)
> +		}
> +	)
> +'
> +
> +test_done
> 

  reply	other threads:[~2021-04-29  8:36 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-08 19:28 git-p4 crashes on non UTF-8 output from p4 Tzadik Vanderhoof
2021-04-09 15:38 ` Torsten Bögershausen
2021-04-11  7:16   ` Tzadik Vanderhoof
2021-04-11  9:37     ` Torsten Bögershausen
2021-04-11 20:21       ` Tzadik Vanderhoof
2021-04-12  4:06         ` Torsten Bögershausen
2021-04-21  8:46           ` [PATCH] add git-p4.fallbackEncoding config variable, to prevent git-p4 from crashing on non UTF-8 changeset descriptions Tzadik Vanderhoof
2021-04-21  8:55             ` Tzadik Vanderhoof
2021-04-22  5:05               ` [PATCH v3] add git-p4.fallbackEncoding config setting, " Tzadik Vanderhoof
2021-04-22 15:50                 ` Torsten Bögershausen
2021-04-22 16:17                   ` Eric Sunshine
2021-04-22 22:33                     ` Eric Sunshine
2021-04-23  6:36                       ` [PATCH] add git-p4.fallbackEncoding config variable, " Tzadik Vanderhoof
2021-04-23  6:44                         ` Tzadik Vanderhoof
2021-04-23 19:08                           ` Tzadik Vanderhoof
2021-04-24  8:14                             ` Torsten Bögershausen
2021-04-27  5:39                               ` [PATCH v5] " Tzadik Vanderhoof
2021-04-27  5:45                                 ` Tzadik Vanderhoof
2021-04-28  4:39                                 ` Junio C Hamano
2021-04-28 14:58                                   ` Torsten Bögershausen
2021-04-29  7:39                                     ` [PATCH v6] Add git-p4.fallbackEncoding Tzadik Vanderhoof
2021-04-29  8:36                                       ` Luke Diamand [this message]
2021-04-29 17:29                                         ` Tzadik Vanderhoof
     [not found]                                     ` <20210429074458.891-1-tzadik.vanderhoof@gmail.com>
     [not found]                                       ` <c4c48615-d1f4-fd37-0960-979535907f15@web.de>
2021-04-29 17:14                                         ` Tzadik Vanderhoof

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=179cd4f0-def6-1b3a-2802-139b19d3301d@diamand.org \
    --to=luke@diamand.org \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=abakken@netflix.com \
    --cc=andrew@adoakley.name \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=pw@padd.com \
    --cc=sunshine@sunshineco.com \
    --cc=tzadik.vanderhoof@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).