git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Feiyang via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: Feiyang <github@feiyangxue.com>, Feiynag Xue <fxue@roku.com>
Subject: [PATCH] git-p4: handle non-unicode characters in p4 cl
Date: Wed, 03 Feb 2021 16:59:59 +0000	[thread overview]
Message-ID: <pull.864.git.1612371600332.gitgitgadget@gmail.com> (raw)

From: Feiynag Xue <fxue@roku.com>

P4 allows non-unicode characters in changelist description body,
so git-p4 needs to be character encoding aware when reading p4 cl

This change adds 2 config options, one specifies encoding,
the other specifies erro handling upon unrecognized character.
Those configs  apply when it reads p4 description text, mostly
from commands "p4 describe" and "p4 changes".

Signed-off-by: Feiynag Xue <fxue@roku.com>
---
    git-p4: handle non-unicode characters in p4 changelist description
    
    P4 allows non-unicode characters in changelist description body, so
    git-p4 needs to be character encoding aware when reading p4 cl.
    
    This change adds 2 config options: one specifies encoding, the other
    specifies erro handling upon unrecognized character. Those configs apply
    when it reads p4 description text, mostly from commands "p4 describe"
    and "p4 changes".
    
    ------------------------------------------------------------------------
    
    I have an open question in mind: what might be the best default config
    to use?
    
    Currently the python's bytes.decode() is called with default utf-8 and
    strict error handling, so git-p4 pukes on non-unicode characters. I
    encountered it when git p4 sync attempts to ingest a certain CL.
    
    It seems to make sense to default to replace so that it gets rid of
    non-unicode chars while trying to retain information. However, i am
    uncertain on if we have use cases where it relies on the
    stop-on-non-unicode behavior. (Hypothetically say an automation that's
    expected to return error on non-unicode char in order to stop them from
    propagating further?)
    
    ------------------------------------------------------------------------
    
    I tested it with git p4 sync to a P4 CL that somehow has non-unicode
    control character in description. With
    git-p4.cldescencodingerrhandling=ignore, it proceeded without error.

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-864%2Ffeiyeung%2Fdescription-text-encoding-handling-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-864/feiyeung/description-text-encoding-handling-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/864

 Documentation/git-p4.txt | 13 +++++++++++++
 git-p4.py                | 12 +++++++++++-
 2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt
index f89e68b424c..01a0e0b1067 100644
--- a/Documentation/git-p4.txt
+++ b/Documentation/git-p4.txt
@@ -638,6 +638,19 @@ git-p4.pathEncoding::
 	to transcode the paths to UTF-8. As an example, Perforce on Windows
 	often uses "cp1252" to encode path names.
 
+git-p4.clDescEncoding::
+	Perforce allows non-unicode characters in changelist description. 
+	Use this config to tell git-p4 what encoding Perforce had used for 
+	description text. This encoding is used to transcode the text to
+	UTF-8. Defaults to "utf_8".
+
+git-p4.clDescNonUnicodeHandling::
+	Perforce allows non-unicode characters in changelist description. 
+	Use this config to tell git-p4 what to do when it does not recognize 
+	the character encoding in description body. Defaults to "strict" for 
+	stopping upon encounter. "ignore" for skipping unrecognized
+	characters; "replace" for attempting to convert into UTF-8. 
+
 git-p4.largeFileSystem::
 	Specify the system that is used for large (binary) files. Please note
 	that large file systems do not support the 'git p4 submit' command.
diff --git a/git-p4.py b/git-p4.py
index 09c9e93ac40..abbeb9156bd 100755
--- a/git-p4.py
+++ b/git-p4.py
@@ -206,6 +206,13 @@ def decode_path(path):
                 print('Path with non-ASCII characters detected. Used {} to decode: {}'.format(encoding, path))
         return path
 
+def decode_changlist_description(text):
+    """Decode bytes or bytearray using configured changelist description encoding options
+    """
+    encoding = gitConfig('git-p4.clDescEncoding') or 'utf_8'
+    err_handling = gitConfig('git-p4.clDescEncodingErrHandling') or 'strict'
+    return text.decode(encoding, err_handling)
+
 def run_git_hook(cmd, param=[]):
     """Execute a hook if the hook exists."""
     if verbose:
@@ -771,7 +778,10 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False,
                 for key, value in entry.items():
                     key = key.decode()
                     if isinstance(value, bytes) and not (key in ('data', 'path', 'clientFile') or key.startswith('depotFile')):
-                        value = value.decode()
+                        if key == 'desc':
+                            value = decode_changlist_description(value)
+                        else:
+                            value = value.decode()
                     decoded_entry[key] = value
                 # Parse out data if it's an error response
                 if decoded_entry.get('code') == 'error' and 'data' in decoded_entry:

base-commit: e6362826a0409539642a5738db61827e5978e2e4
-- 
gitgitgadget

             reply	other threads:[~2021-02-03 17:03 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-03 16:59 Feiyang via GitGitGadget [this message]
2021-02-03 21:44 ` [PATCH] git-p4: handle non-unicode characters in p4 cl Junio C Hamano
     [not found]   ` <BD039BE8-643F-4F61-A0DB-E3581C6B6B10@feiyangxue.com>
2021-02-04  0:11     ` Luke Diamand
2021-02-04 18:45 ` Andrew Oakley

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=pull.864.git.1612371600332.gitgitgadget@gmail.com \
    --to=gitgitgadget@gmail.com \
    --cc=fxue@roku.com \
    --cc=git@vger.kernel.org \
    --cc=github@feiyangxue.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).