git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Torsten Bögershausen" <tboegi@web.de>
To: Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: [PATCH v3] add git-p4.fallbackEncoding config setting, to prevent git-p4 from crashing on non UTF-8 changeset descriptions
Date: Thu, 22 Apr 2021 17:50:47 +0200	[thread overview]
Message-ID: <20210422155047.3unltvv3mh5uq7wp@tb-raspi4> (raw)
In-Reply-To: <20210422050504.441-1-tzadik.vanderhoof@gmail.com>

On Wed, Apr 21, 2021 at 10:05:04PM -0700, Tzadik Vanderhoof wrote:

Thanks for V3, please see some comments inline.

> When git-p4 reads the output from a p4 command, it assumes it will be
> 100% UTF-8. If even one character in the output of one p4 command is
> not UTF-8, git-p4 crashes with:
>
> File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList
>     value = value.decode() UnicodeDecodeError: 'utf-8' codec can't
> decode byte Ox93 in position 42: invalid start byte
>
> This is especially a problem on the "git p4 clone ... @all" command,
> where git-p4 needs to read thousands of changeset descriptions, one of
> which may have a stray smart quote, causing the whole clone operation
> to fail.
>

> This pull request adds a new config setting, allowing git-p4 to try
> another encoding (for example, "cp1252") and/or use the Unicode replacement
> character, to prevent the whole program from crashing on such a minor problem.

"This pull request" is somewhat superflous wording.
How about:

Add a new config setting, allowing git-p4 to try a fallback encoding
(for example, "cp1252") and/or use the Unicode replacement character,
to prevent the whole program from crashing on such a minor problem.

Documentation is good (and needed, and neccessary).
Probably this is then not needed:
> See the documentation included in the patch for more details of how
> the new config setting works.

>
> Signed-off-by: Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com>
> ---
>  Documentation/git-p4.txt                   | 10 ++++
>  git-p4.py                                  | 11 +++-
>  t/t9835-git-p4-config-fallback-encoding.sh | 65 ++++++++++++++++++++++
>  3 files changed, 85 insertions(+), 1 deletion(-)
>  create mode 100755 t/t9835-git-p4-config-fallback-encoding.sh
>
> diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt
> index f89e68b..e0131a9 100644
> --- a/Documentation/git-p4.txt
> +++ b/Documentation/git-p4.txt
> @@ -638,6 +638,16 @@ git-p4.pathEncoding::
>  	to transcode the paths to UTF-8. As an example, Perforce on Windows
>  	often uses "cp1252" to encode path names.
>
> +git-p4.fallbackEncoding::
> +	Perforce changeset descriptions can be in a mixture of encodings.
> +	Git-p4 first tries to interpret each description as UTF-8. If that
> +	fails, this config allows another encoding to be tried. You
> +	can specify, for example, "cp1252".

That looks OK according to
https://docs.python.org/3/library/codecs.html#standard-encodings

> + If instead of an encoding,
> +	you specify "replace", UTF-8 will be used, with invalid UTF-8
> +	characters replaced by the Unicode replacement character. If you
> +	specify "none" (the default), there is no fallback, and any non
> +	UTF-8 character will cause git-p4 to immediately fail.
> +

May be, that is a matter of taste:

> + If git-p4.fallbackEncoding is "replace" ", UTF-8 will be used, with invalid UTF-8
> +	characters replaced by the Unicode replacement character.
> +	The default is "none": there is no fallback, and any non
> +	UTF-8 character will cause git-p4 to immediately fail.
> +



>  git-p4.largeFileSystem::
>  	Specify the system that is used for large (binary) files. Please note
>  	that large file systems do not support the 'git p4 submit' command.
> diff --git a/git-p4.py b/git-p4.py
> index 09c9e93..3364287 100755
> --- a/git-p4.py
> +++ b/git-p4.py
> @@ -771,7 +771,16 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False,
>                  for key, value in entry.items():
>                      key = key.decode()
>                      if isinstance(value, bytes) and not (key in ('data', 'path', 'clientFile') or key.startswith('depotFile')):
> -                        value = value.decode()
> +                        try:
> +                            value = value.decode()
> +                        except UnicodeDecodeError as ex:
> +                            fallbackEncoding = gitConfig("git-p4.fallbackEncoding").lower() or 'none'
> +                            if fallbackEncoding == 'none':
> +                                raise Exception("UTF8 decoding failed. Consider using git config git-p4.fallbackEncoding")
> +                            elif fallbackEncoding == 'replace':
> +                                value = value.decode(errors='replace')
> +                            else:
> +                                value = value.decode(encoding=fallbackEncoding)
>                      decoded_entry[key] = value
>                  # Parse out data if it's an error response
>                  if decoded_entry.get('code') == 'error' and 'data' in decoded_entry:
> diff --git a/t/t9835-git-p4-config-fallback-encoding.sh b/t/t9835-git-p4-config-fallback-encoding.sh
> new file mode 100755
> index 0000000..56a245e
> --- /dev/null
> +++ b/t/t9835-git-p4-config-fallback-encoding.sh
> @@ -0,0 +1,65 @@
> +#!/bin/sh
> +
> +test_description='test git-p4.fallbackEncoding config'
> +
> +GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
> +export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
> +
> +. ./lib-git-p4.sh
> +
> +if test_have_prereq !MINGW,!CYGWIN; then
> +	skip_all='This system is not subject to encoding failures in "git p4 clone"'
> +	test_done
> +fi

Out of curiosity: Why are Windows versions (MINGW, CYGWIN) excluded ?

> +
> +test_expect_success 'start p4d' '
> +	start_p4d
> +'
> +
> +test_expect_success 'add cp1252 description' '
> +	cd "$cli" &&
> +	echo file1 >file1 &&
> +	p4 add file1 &&
> +	p4 submit -d documentación
> +'
> +
> +test_expect_success 'clone fails with git-p4.fallbackEncoding unset' '
> +	test_might_fail git config --global --unset git-p4.fallbackEncoding &&
> +	test_when_finished cleanup_git &&
> +	(
> +		test_must_fail git p4 clone --dest="$git" //depot@all 2>> actual &&
Should this be >actual ?
And please no ' ' between the '>' and the filename.

> +		grep "UTF8 decoding failed. Consider using git config git-p4.fallbackEncoding" actual
> +	)
> +'
> +test_expect_success 'clone fails with git-p4.fallbackEncoding set to "none"' '
> +	git config --global git-p4.fallbackEncoding none &&
> +	test_when_finished cleanup_git &&
> +	(
> +		test_must_fail git p4 clone --dest="$git" //depot@all 2>> actual &&
Same here
2 >actual

> +		grep "UTF8 decoding failed. Consider using git config git-p4.fallbackEncoding" actual
> +	)
> +'
> +
> +test_expect_success 'clone succeeds with git-p4.fallbackEncoding set to "cp1252"' '
> +	git config --global git-p4.fallbackEncoding cp1252 &&
> +	test_when_finished cleanup_git &&
> +	(
> +		git p4 clone --dest="$git" //depot@all &&
> +		cd "$git" &&
> +		git log --oneline >log &&
> +		desc=$(head -1 log | awk '\''{print $2}'\'') &&	[ "$desc" = "documentación" ]

Style nit:
See Documentation/CodingGuidelines: - We prefer "test" over "[ ... ]".

desc=$(head -1 log | awk '\''{print $2}'\'') &&	test "$desc" = "documentación"

> +	)
> +'
> +
> +test_expect_success 'clone succeeds with git-p4.fallbackEncoding set to "replace"' '
> +	git config --global git-p4.fallbackEncoding replace &&
> +	test_when_finished cleanup_git &&
> +	(
> +		git p4 clone --dest="$git" //depot@all &&
> +		cd "$git" &&
> +		git log --oneline >log &&
> +		desc=$(head -1 log | awk '\''{print $2}'\'') &&	[ "$desc" = "documentaci�n" ]
> +	)
> +'
> +
> +test_done
> --
> 2.31.1.windows.1
>

Are there any more comments from the p4 experts ?

  reply	other threads:[~2021-04-22 15:50 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-08 19:28 git-p4 crashes on non UTF-8 output from p4 Tzadik Vanderhoof
2021-04-09 15:38 ` Torsten Bögershausen
2021-04-11  7:16   ` Tzadik Vanderhoof
2021-04-11  9:37     ` Torsten Bögershausen
2021-04-11 20:21       ` Tzadik Vanderhoof
2021-04-12  4:06         ` Torsten Bögershausen
2021-04-21  8:46           ` [PATCH] add git-p4.fallbackEncoding config variable, to prevent git-p4 from crashing on non UTF-8 changeset descriptions Tzadik Vanderhoof
2021-04-21  8:55             ` Tzadik Vanderhoof
2021-04-22  5:05               ` [PATCH v3] add git-p4.fallbackEncoding config setting, " Tzadik Vanderhoof
2021-04-22 15:50                 ` Torsten Bögershausen [this message]
2021-04-22 16:17                   ` Eric Sunshine
2021-04-22 22:33                     ` Eric Sunshine
2021-04-23  6:36                       ` [PATCH] add git-p4.fallbackEncoding config variable, " Tzadik Vanderhoof
2021-04-23  6:44                         ` Tzadik Vanderhoof
2021-04-23 19:08                           ` Tzadik Vanderhoof
2021-04-24  8:14                             ` Torsten Bögershausen
2021-04-27  5:39                               ` [PATCH v5] " Tzadik Vanderhoof
2021-04-27  5:45                                 ` Tzadik Vanderhoof
2021-04-28  4:39                                 ` Junio C Hamano
2021-04-28 14:58                                   ` Torsten Bögershausen
2021-04-29  7:39                                     ` [PATCH v6] Add git-p4.fallbackEncoding Tzadik Vanderhoof
2021-04-29  8:36                                       ` Luke Diamand
2021-04-29 17:29                                         ` Tzadik Vanderhoof
     [not found]                                     ` <20210429074458.891-1-tzadik.vanderhoof@gmail.com>
     [not found]                                       ` <c4c48615-d1f4-fd37-0960-979535907f15@web.de>
2021-04-29 17:14                                         ` Tzadik Vanderhoof

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210422155047.3unltvv3mh5uq7wp@tb-raspi4 \
    --to=tboegi@web.de \
    --cc=git@vger.kernel.org \
    --cc=tzadik.vanderhoof@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).