git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Luke Diamand <luke@diamand.org>
To: Feiyang Xue <me@feiyangxue.com>
Cc: Junio C Hamano <gitster@pobox.com>,
	Feiyang via GitGitGadget <gitgitgadget@gmail.com>,
	Git Users <git@vger.kernel.org>, Feiyang <github@feiyangxue.com>,
	Ben Keene <seraphire@gmail.com>,
	Yang Zhao <yang.zhao@skyboxlabs.com>,
	Scott Lamb <slamb@slamb.org>
Subject: Re: [PATCH] git-p4: handle non-unicode characters in p4 cl
Date: Thu, 4 Feb 2021 00:11:20 +0000	[thread overview]
Message-ID: <CAE5ih7-F9efsiV5AQmw3ocjiy+BT6ZAT5fA0Lx0OSkVTO8Kqjg@mail.gmail.com> (raw)
In-Reply-To: <BD039BE8-643F-4F61-A0DB-E3581C6B6B10@feiyangxue.com>

We've started getting this quite a lot as we switched to a new P4
server and I suspect that the i18N options are incorrect. So I think
this would be welcome.

A test case would be useful, as debugging these decoding problems is a
bit of a nightmare.

Luke

On Wed, 3 Feb 2021 at 22:42, Feiyang Xue <me@feiyangxue.com> wrote:
>
>
>
> On Feb 3, 2021, at 3:44 PM, Junio C Hamano <gitster@pobox.com> wrote:
>
> "Feiyang via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> From: Feiynag Xue <fxue@roku.com>
>
> P4 allows non-unicode characters in changelist description body,
> so git-p4 needs to be character encoding aware when reading p4 cl
>
> This change adds 2 config options, one specifies encoding,
> the other specifies erro handling upon unrecognized character.
> Those configs  apply when it reads p4 description text, mostly
> from commands "p4 describe" and "p4 changes".
>
> Signed-off-by: Feiynag Xue <fxue@roku.com>
> ---
>
>
> Adding a few people who had meaningful (read: needs some Perforce
> knowledge) changes to this part of the codebase to Cc: to ask for
> their reviews.
>
>
> Adding Yang Zhao <yang.zhao@skyboxlabs.com>, who had made character
> encodings related changes for paths.
>
> Adding Scott Lamb <slamb@slamb.org>, who had made changes to this
> “p4CmdList()” method.
>
>
>
>    git-p4: handle non-unicode characters in p4 changelist description
>
>    P4 allows non-unicode characters in changelist description body, so
>    git-p4 needs to be character encoding aware when reading p4 cl.
>
>    This change adds 2 config options: one specifies encoding, the other
>    specifies erro handling upon unrecognized character. Those configs apply
>    when it reads p4 description text, mostly from commands "p4 describe"
>    and "p4 changes".
>
>    ------------------------------------------------------------------------
>
>    I have an open question in mind: what might be the best default config
>    to use?
>
>    Currently the python's bytes.decode() is called with default utf-8 and
>    strict error handling, so git-p4 pukes on non-unicode characters. I
>    encountered it when git p4 sync attempts to ingest a certain CL.
>
>    It seems to make sense to default to replace so that it gets rid of
>    non-unicode chars while trying to retain information. However, i am
>    uncertain on if we have use cases where it relies on the
>    stop-on-non-unicode behavior. (Hypothetically say an automation that's
>    expected to return error on non-unicode char in order to stop them from
>    propagating further?)
>
>    ------------------------------------------------------------------------
>
>    I tested it with git p4 sync to a P4 CL that somehow has non-unicode
>    control character in description. With
>    git-p4.cldescencodingerrhandling=ignore, it proceeded without error.
>
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-864%2Ffeiyeung%2Fdescription-text-encoding-handling-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-864/feiyeung/description-text-encoding-handling-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/864
>
> Documentation/git-p4.txt | 13 +++++++++++++
> git-p4.py                | 12 +++++++++++-
> 2 files changed, 24 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt
> index f89e68b424c..01a0e0b1067 100644
> --- a/Documentation/git-p4.txt
> +++ b/Documentation/git-p4.txt
> @@ -638,6 +638,19 @@ git-p4.pathEncoding::
> to transcode the paths to UTF-8. As an example, Perforce on Windows
> often uses "cp1252" to encode path names.
>
> +git-p4.clDescEncoding::
> + Perforce allows non-unicode characters in changelist description.
> + Use this config to tell git-p4 what encoding Perforce had used for
> + description text. This encoding is used to transcode the text to
> + UTF-8. Defaults to "utf_8".
>
>
> Would it still work if you replaced "utf_8" here with "UTF-8"?  If
> we can use "UTF-8", this description (and the code that does so)
> would read much less awkward, I would think.
>
>
> I doubt “UTF-8” would work; I do believe the lower case “utf-8” would.
>
> Looking at Python3 documentation on encodings, UTF-8 is specified as “utf_8”.
> It allows aliases of using dash to replace underscore, as pointed out by the
> samge page: https://docs.python.org/3/library/codecs.html#standard-encodings
> > Notice that spelling alternatives that only differ in case or use a hyphen
> > instead of an underscore are also valid aliases; therefore, e.g. 'utf-8’
> > is a valid alias for the 'utf_8' codec.
>
> I used underscore one “utf_8” for consistency reason: this file already has
> uses of “utf_8”.
>
>
>
> +git-p4.clDescNonUnicodeHandling::
> + Perforce allows non-unicode characters in changelist description.
> + Use this config to tell git-p4 what to do when it does not recognize
> + the character encoding in description body. Defaults to "strict" for
> + stopping upon encounter. "ignore" for skipping unrecognized
> + characters; "replace" for attempting to convert into UTF-8.
> +
> git-p4.largeFileSystem::
> Specify the system that is used for large (binary) files. Please note
> that large file systems do not support the 'git p4 submit' command.
> diff --git a/git-p4.py b/git-p4.py
> index 09c9e93ac40..abbeb9156bd 100755
> --- a/git-p4.py
> +++ b/git-p4.py
> @@ -206,6 +206,13 @@ def decode_path(path):
>                 print('Path with non-ASCII characters detected. Used {} to decode: {}'.format(encoding, path))
>         return path
>
> +def decode_changlist_description(text):
> +    """Decode bytes or bytearray using configured changelist description encoding options
> +    """
> +    encoding = gitConfig('git-p4.clDescEncoding') or 'utf_8'
> +    err_handling = gitConfig('git-p4.clDescEncodingErrHandling') or 'strict'
> +    return text.decode(encoding, err_handling)
> +
> def run_git_hook(cmd, param=[]):
>     """Execute a hook if the hook exists."""
>     if verbose:
> @@ -771,7 +778,10 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False,
>                 for key, value in entry.items():
>                     key = key.decode()
>                     if isinstance(value, bytes) and not (key in ('data', 'path', 'clientFile') or key.startswith('depotFile')):
> -                        value = value.decode()
> +                        if key == 'desc':
> +                            value = decode_changlist_description(value)
> +                        else:
> +                            value = value.decode()
>                     decoded_entry[key] = value
>                 # Parse out data if it's an error response
>                 if decoded_entry.get('code') == 'error' and 'data' in decoded_entry:
>
> base-commit: e6362826a0409539642a5738db61827e5978e2e4
>
>

  parent reply	other threads:[~2021-02-04  0:14 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-03 16:59 [PATCH] git-p4: handle non-unicode characters in p4 cl Feiyang via GitGitGadget
2021-02-03 21:44 ` Junio C Hamano
     [not found]   ` <BD039BE8-643F-4F61-A0DB-E3581C6B6B10@feiyangxue.com>
2021-02-04  0:11     ` Luke Diamand [this message]
2021-02-04 18:45 ` Andrew Oakley

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAE5ih7-F9efsiV5AQmw3ocjiy+BT6ZAT5fA0Lx0OSkVTO8Kqjg@mail.gmail.com \
    --to=luke@diamand.org \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=github@feiyangxue.com \
    --cc=gitster@pobox.com \
    --cc=me@feiyangxue.com \
    --cc=seraphire@gmail.com \
    --cc=slamb@slamb.org \
    --cc=yang.zhao@skyboxlabs.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).