git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [RFC PATCH 0/4] git-p4: python 3 compatability
@ 2019-11-28  1:28 Yang Zhao
  2019-11-28  1:28 ` [RFC PATCH 1/4] git-p4: decode response from p4 to str for python3 Yang Zhao
                   ` (4 more replies)
  0 siblings, 5 replies; 7+ messages in thread
From: Yang Zhao @ 2019-11-28  1:28 UTC (permalink / raw)
  To: git; +Cc: Yang Zhao

This patchset the necessary changes to have git-p4 run under python 3.

I'm submitting this as an RFC as the changes are not exhaustive. I do not have
p4d available, and therefore cannot run the tests locally. I have made
best-effort attempts to test the changes against a repo I do have read access to.

Yang Zhao (4):
  git-p4: decode response from p4 to str for python3
  git-p4: properly encode/decode communication with git for python 3
  git-p4: open .gitp4-usercache.txt in text mode
  git-p4: use utf-8 encoding for file paths throughout

 git-p4.py | 58 +++++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 41 insertions(+), 17 deletions(-)

-- 
2.24.0.windows.2


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC PATCH 1/4] git-p4: decode response from p4 to str for python3
  2019-11-28  1:28 [RFC PATCH 0/4] git-p4: python 3 compatability Yang Zhao
@ 2019-11-28  1:28 ` Yang Zhao
  2019-11-28  1:28 ` [RFC PATCH 2/4] git-p4: properly encode/decode communication with git for python 3 Yang Zhao
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 7+ messages in thread
From: Yang Zhao @ 2019-11-28  1:28 UTC (permalink / raw)
  To: git; +Cc: Yang Zhao

The marshalled dict in the response given on STDOUT by p4 uses `str` for
keys and string values. When run using python3, these values are
deserialized as `bytes`, leading to a whole host of problems as the rest
of the code assumes `str` is used throughout.

An exception is made for the `data` field as it may contain arbitrary
binary data that is not text, as well as `depotFile` which may contain
text encoded with something other than ASCII or UTF-8.

Signed-off-by: Yang Zhao <yang.zhao@skyboxlabs.com>
---
 git-p4.py | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/git-p4.py b/git-p4.py
index 60c73b6a37..ead9d816e1 100755
--- a/git-p4.py
+++ b/git-p4.py
@@ -36,6 +36,7 @@
     unicode = str
     bytes = bytes
     basestring = (str,bytes)
+    use_encoded_streams = True
 else:
     # 'unicode' exists, must be Python 2
     str = str
@@ -643,6 +644,15 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False,
     try:
         while True:
             entry = marshal.load(p4.stdout)
+            if use_encoded_streams:
+                # Decode unmarshalled dict to use str keys and values, except for:
+                #   - `data` which may contain arbitrary binary data
+                #   - `depotFile` which may contain non-UTF8 encoded text
+                decoded_entry = {}
+                for key, value in entry.items():
+                    key = key.decode()
+                    decoded_entry[key] = value.decode() if not (key in ['data', 'depotFile'] or isinstance(value, str)) else value
+                entry = decoded_entry
             if skip_info:
                 if 'code' in entry and entry['code'] == 'info':
                     continue
-- 
2.24.0.windows.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC PATCH 2/4] git-p4: properly encode/decode communication with git for python 3
  2019-11-28  1:28 [RFC PATCH 0/4] git-p4: python 3 compatability Yang Zhao
  2019-11-28  1:28 ` [RFC PATCH 1/4] git-p4: decode response from p4 to str for python3 Yang Zhao
@ 2019-11-28  1:28 ` Yang Zhao
  2019-11-28  1:28 ` [RFC PATCH 3/4] git-p4: open .gitp4-usercache.txt in text mode Yang Zhao
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 7+ messages in thread
From: Yang Zhao @ 2019-11-28  1:28 UTC (permalink / raw)
  To: git; +Cc: Yang Zhao

Under python3, calls to write() on the stream to `git fast-import` must
be encoded.  This patch wraps the IO object such that this encoding is
done transparently.

Conversely, any text data read from subprocesses must also be decoded
before running through the rest of the pipeline.

Signed-off-by: Yang Zhao <yang.zhao@skyboxlabs.com>
---
 git-p4.py | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/git-p4.py b/git-p4.py
index ead9d816e1..d79cbf7005 100755
--- a/git-p4.py
+++ b/git-p4.py
@@ -237,6 +237,9 @@ def read_pipe_lines(c):
     if pipe.close() or p.wait():
         die('Command failed: %s' % str(c))
 
+    if use_encoded_streams:
+        val = [line.decode() for line in val]
+
     return val
 
 def p4_read_pipe_lines(c):
@@ -631,7 +634,7 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False,
             stdin_file.write(stdin)
         else:
             for i in stdin:
-                stdin_file.write(i + '\n')
+                stdin_file.write((i + '\n').encode())
         stdin_file.flush()
         stdin_file.seek(0)
 
@@ -3547,6 +3550,15 @@ def openStreams(self):
         self.gitStream = self.importProcess.stdin
         self.gitError = self.importProcess.stderr
 
+        if use_encoded_streams:
+            # Wrap gitStream.write() so that it can be called using `str` arguments
+            def make_encoded_write(write):
+                def encoded_write(s):
+                    return write(s.encode() if isinstance(s, str) else s)
+                return encoded_write
+
+            self.gitStream.write = make_encoded_write(self.gitStream.write)
+
     def closeStreams(self):
         self.gitStream.close()
         if self.importProcess.wait() != 0:
-- 
2.24.0.windows.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC PATCH 3/4] git-p4: open .gitp4-usercache.txt in text mode
  2019-11-28  1:28 [RFC PATCH 0/4] git-p4: python 3 compatability Yang Zhao
  2019-11-28  1:28 ` [RFC PATCH 1/4] git-p4: decode response from p4 to str for python3 Yang Zhao
  2019-11-28  1:28 ` [RFC PATCH 2/4] git-p4: properly encode/decode communication with git for python 3 Yang Zhao
@ 2019-11-28  1:28 ` Yang Zhao
  2019-11-28  1:28 ` [RFC PATCH 4/4] git-p4: use utf-8 encoding for file paths throughout Yang Zhao
  2019-11-28 12:54 ` [RFC PATCH 0/4] git-p4: python 3 compatability Johannes Schindelin
  4 siblings, 0 replies; 7+ messages in thread
From: Yang Zhao @ 2019-11-28  1:28 UTC (permalink / raw)
  To: git; +Cc: Yang Zhao

Opening .gitp4-usercache.txt in text mode makes python 3 happy without
explicitly adding encoding and decoding.

Signed-off-by: Yang Zhao <yang.zhao@skyboxlabs.com>
---
 git-p4.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/git-p4.py b/git-p4.py
index d79cbf7005..6821d6aafd 100755
--- a/git-p4.py
+++ b/git-p4.py
@@ -1387,14 +1387,14 @@ def getUserMapFromPerforceServer(self):
         for (key, val) in self.users.items():
             s += "%s\t%s\n" % (key.expandtabs(1), val.expandtabs(1))
 
-        open(self.getUserCacheFilename(), "wb").write(s)
+        open(self.getUserCacheFilename(), 'w', encoding='utf-8').write(s)
         self.userMapFromPerforceServer = True
 
     def loadUserMapFromCache(self):
         self.users = {}
         self.userMapFromPerforceServer = False
         try:
-            cache = open(self.getUserCacheFilename(), "rb")
+            cache = open(self.getUserCacheFilename(), 'r', encoding='utf-8')
             lines = cache.readlines()
             cache.close()
             for line in lines:
-- 
2.24.0.windows.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC PATCH 4/4] git-p4: use utf-8 encoding for file paths throughout
  2019-11-28  1:28 [RFC PATCH 0/4] git-p4: python 3 compatability Yang Zhao
                   ` (2 preceding siblings ...)
  2019-11-28  1:28 ` [RFC PATCH 3/4] git-p4: open .gitp4-usercache.txt in text mode Yang Zhao
@ 2019-11-28  1:28 ` Yang Zhao
  2019-11-28  2:57   ` Elijah Newren
  2019-11-28 12:54 ` [RFC PATCH 0/4] git-p4: python 3 compatability Johannes Schindelin
  4 siblings, 1 reply; 7+ messages in thread
From: Yang Zhao @ 2019-11-28  1:28 UTC (permalink / raw)
  To: git; +Cc: Yang Zhao

Try to decode file paths in responses from p4 as soon as possible so
that we are working with unicode string throughout the rest of the flow.
This makes python 3 a lot happier.

Signed-off-by: Yang Zhao <yang.zhao@skyboxlabs.com>
---

This is probably the most risky patch out of the set. It's very likely
that I've neglected to consider certain corner cases with decoding of
path data.

 git-p4.py | 34 ++++++++++++++++++----------------
 1 file changed, 18 insertions(+), 16 deletions(-)

diff --git a/git-p4.py b/git-p4.py
index 6821d6aafd..bd693e1404 100755
--- a/git-p4.py
+++ b/git-p4.py
@@ -650,11 +650,27 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False,
             if use_encoded_streams:
                 # Decode unmarshalled dict to use str keys and values, except for:
                 #   - `data` which may contain arbitrary binary data
-                #   - `depotFile` which may contain non-UTF8 encoded text
+                #   - `depotFile` which may contain non-UTF8 encoded text, and is decoded
+                #     according to git-p4.pathEncoding config
                 decoded_entry = {}
                 for key, value in entry.items():
                     key = key.decode()
-                    decoded_entry[key] = value.decode() if not (key in ['data', 'depotFile'] or isinstance(value, str)) else value
+                    if key == 'data':
+                        pass
+                    elif key == 'depotFile':
+                        try:
+                            value = value.decode('ascii')
+                        except:
+                            encoding = 'utf-8'
+                            if gitConfig('git-p4.pathEncoding'):
+                                encoding = gitConfig('git-p4.pathEncoding')
+                            path = path.decode(encoding, 'replace')
+                            if verbose:
+                                print('Path with non-ASCII characters detected. Used %s to decode: %s ' % (encoding, path))
+                    elif not isinstance(value, str):
+                        value = value.decode()
+
+                    decoded_entry[key] = value
                 entry = decoded_entry
             if skip_info:
                 if 'code' in entry and entry['code'] == 'info':
@@ -2758,24 +2774,11 @@ def writeToGitStream(self, gitMode, relPath, contents):
             self.gitStream.write(d)
         self.gitStream.write('\n')
 
-    def encodeWithUTF8(self, path):
-        try:
-            path.decode('ascii')
-        except:
-            encoding = 'utf8'
-            if gitConfig('git-p4.pathEncoding'):
-                encoding = gitConfig('git-p4.pathEncoding')
-            path = path.decode(encoding, 'replace').encode('utf8', 'replace')
-            if self.verbose:
-                print('Path with non-ASCII characters detected. Used %s to encode: %s ' % (encoding, path))
-        return path
-
     # output one file from the P4 stream
     # - helper for streamP4Files
 
     def streamOneP4File(self, file, contents):
         relPath = self.stripRepoPath(file['depotFile'], self.branchPrefixes)
-        relPath = self.encodeWithUTF8(relPath)
         if verbose:
             if 'fileSize' in self.stream_file:
                 size = int(self.stream_file['fileSize'])
@@ -2858,7 +2861,6 @@ def streamOneP4File(self, file, contents):
 
     def streamOneP4Deletion(self, file):
         relPath = self.stripRepoPath(file['path'], self.branchPrefixes)
-        relPath = self.encodeWithUTF8(relPath)
         if verbose:
             sys.stdout.write("delete %s\n" % relPath)
             sys.stdout.flush()
-- 
2.24.0.windows.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 4/4] git-p4: use utf-8 encoding for file paths throughout
  2019-11-28  1:28 ` [RFC PATCH 4/4] git-p4: use utf-8 encoding for file paths throughout Yang Zhao
@ 2019-11-28  2:57   ` Elijah Newren
  0 siblings, 0 replies; 7+ messages in thread
From: Elijah Newren @ 2019-11-28  2:57 UTC (permalink / raw)
  To: Yang Zhao; +Cc: Git Mailing List

On Wed, Nov 27, 2019 at 5:32 PM Yang Zhao <yang.zhao@skyboxlabs.com> wrote:
>
> Try to decode file paths in responses from p4 as soon as possible so
> that we are working with unicode string throughout the rest of the flow.
> This makes python 3 a lot happier.
>
> Signed-off-by: Yang Zhao <yang.zhao@skyboxlabs.com>
> ---
>
> This is probably the most risky patch out of the set. It's very likely
> that I've neglected to consider certain corner cases with decoding of
> path data.

Yes, this does seem somewhat risky to me.  It may go well on platforms
that require all filenames to be unicode.  And it may work for users
who happen to restrict their filenames to valid utf-8.  But this
abstraction doesn't fit the general problem, so some users may be left
out in the cold.

I tried multiple times while switching git-filter-repo from python2 to
python3, at different levels of pervasiveness, to use unicode more
generally.  But I mostly gave up; everyone knows files won't
necessarily be unicode, but you just can't assume filenames or commit
messages or branch or tag names (and perhaps a few other things I'm
forgetting) are either.  I ended up using bytestrings everywhere
except messages displayed to the user, and I only decode at that
point.


Of course, if perforce happens to only work with unicode filenames
then you'll be fine.  And perhaps you don't want or need to be as
paranoid as I was about what people could do.  So I don't know if my
experience applies in your case (I've never used perforce myself), but
I just thought I'd mention it in case it's useful.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 0/4] git-p4: python 3 compatability
  2019-11-28  1:28 [RFC PATCH 0/4] git-p4: python 3 compatability Yang Zhao
                   ` (3 preceding siblings ...)
  2019-11-28  1:28 ` [RFC PATCH 4/4] git-p4: use utf-8 encoding for file paths throughout Yang Zhao
@ 2019-11-28 12:54 ` Johannes Schindelin
  4 siblings, 0 replies; 7+ messages in thread
From: Johannes Schindelin @ 2019-11-28 12:54 UTC (permalink / raw)
  To: Yang Zhao; +Cc: git

Hi,

On Wed, 27 Nov 2019, Yang Zhao wrote:

> This patchset the necessary changes to have git-p4 run under python 3.
>
> I'm submitting this as an RFC as the changes are not exhaustive. I do not have
> p4d available, and therefore cannot run the tests locally. I have made
> best-effort attempts to test the changes against a repo I do have read access to.

How about opening a PR at https://github.com/git/git? This will run our
Travis and Azure Pipelines builds, and I am sure that you can add a
debug-only patch to force-use Python3 on the Perforce tests (and if I were
you, I would also restrict the run to those tests by editing `.travis.yml`
and `azure-pipelines.yml`, both of which files should be easy enough to
understand without additional documentation).

Ciao,
Johannes

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2019-11-28 12:55 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-28  1:28 [RFC PATCH 0/4] git-p4: python 3 compatability Yang Zhao
2019-11-28  1:28 ` [RFC PATCH 1/4] git-p4: decode response from p4 to str for python3 Yang Zhao
2019-11-28  1:28 ` [RFC PATCH 2/4] git-p4: properly encode/decode communication with git for python 3 Yang Zhao
2019-11-28  1:28 ` [RFC PATCH 3/4] git-p4: open .gitp4-usercache.txt in text mode Yang Zhao
2019-11-28  1:28 ` [RFC PATCH 4/4] git-p4: use utf-8 encoding for file paths throughout Yang Zhao
2019-11-28  2:57   ` Elijah Newren
2019-11-28 12:54 ` [RFC PATCH 0/4] git-p4: python 3 compatability Johannes Schindelin

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).