git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH v2 0/3] Transition git-p4.py to support Python 3 only
@ 2021-12-10 15:30 Joel Holdsworth
  2021-12-10 15:30 ` [PATCH v2 1/3] git-p4: remove support for Python 2 Joel Holdsworth
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Joel Holdsworth @ 2021-12-10 15:30 UTC (permalink / raw)
  To: git
  Cc: Tzadik Vanderhoof, Dorgon Chang, Joachim Kuebart, Daniel Levin,
	Johannes Schindelin, Luke Diamand, Ben Keene, Andrew Oakley,
	Joel Holdsworth

The git-p4.py script currently implements code paths for both Python 2
and 3.

Python 2 was discontinued in 2020, and there is no longer any officially
supported interpreter. Further development of git-p4.py will require
would-be developers to test their changes with all supported dialects of
the language. However, if there is no longer any supported runtime
environment available, this places an unreasonable burden on the Git
project to maintain support for an obselete dialect of the language.

The patch-set removes all Python 2-specific code paths.

This second revisision of the patch-set is more tightly focussed on the
task of retiring Python 2, and excludes previously submitted patches
that contain other tidy-ups and bug fixes.

Joel Holdsworth (3):
  git-p4: remove support for Python 2
  git-p4: eliminate decode_stream and encode_stream
  git-p4: add "Nvidia Corporation" to copyright header

 git-p4.py | 132 +++++++++++++++++++-----------------------------------
 1 file changed, 45 insertions(+), 87 deletions(-)

-- 
2.33.0


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2 1/3] git-p4: remove support for Python 2
  2021-12-10 15:30 [PATCH v2 0/3] Transition git-p4.py to support Python 3 only Joel Holdsworth
@ 2021-12-10 15:30 ` Joel Holdsworth
  2021-12-12 17:50   ` Andrew Oakley
  2021-12-10 15:31 ` [PATCH v2 2/3] git-p4: eliminate decode_stream and encode_stream Joel Holdsworth
  2021-12-10 15:31 ` [PATCH v2 3/3] git-p4: add "Nvidia Corporation" to copyright header Joel Holdsworth
  2 siblings, 1 reply; 9+ messages in thread
From: Joel Holdsworth @ 2021-12-10 15:30 UTC (permalink / raw)
  To: git
  Cc: Tzadik Vanderhoof, Dorgon Chang, Joachim Kuebart, Daniel Levin,
	Johannes Schindelin, Luke Diamand, Ben Keene, Andrew Oakley,
	Joel Holdsworth

git-p4 previously contained seperate code-paths for Python 2 and 3 to
abstract away the differences in string handling behaviour between the
two platforms.

This patch removes the Python 2 code-paths within this abstraction
without removing the abstractions themselves. These will be removed in
later patches to further modernise the script.

The motivation for this change is that there is a family of issues with
git-p4's handling of incoming text data when it contains bytes which
cannot be decoded into UTF-8 characters. For text files created in
Windows, CP1252 Smart Quote Characters (0x93 and 0x94) are seen fairly
frequently. These codes are invalid in UTF-8, so if the script
encounters any file or file name containing them, on Python 2 the
symbols will be corrupted, and on Python 3 the script will fail with an
exception.

In order to address these issues it will be necessary to overhaul
git-p4's handling of incoming data. Keeping a clean separation between
encoded bytes and decoded text is much easier to do in Python 3. If
Python 2 support must be maintained, this will require careful testing
of the separate code paths for each platform, which is unreasonable
given that Python 2 is now thoroughly deprecated.

The minimum supported Python version has been set to 3.6. This version
is no longer supported by the Python project, however at the current
time it is still available for use in RHEL 8. No features from newer
versions of Python are currently required.

Signed-off-by: Joel Holdsworth <jholdsworth@nvidia.com>
---
 git-p4.py | 90 ++++++++++++++++++-------------------------------------
 1 file changed, 29 insertions(+), 61 deletions(-)

diff --git a/git-p4.py b/git-p4.py
index 2b4500226a..e3fe86e4f2 100755
--- a/git-p4.py
+++ b/git-p4.py
@@ -1,4 +1,4 @@
-#!/usr/bin/env python
+#!/usr/bin/env python3
 #
 # git-p4.py -- A tool for bidirectional operation between a Perforce depot and git.
 #
@@ -16,8 +16,9 @@
 # pylint: disable=too-many-branches,too-many-nested-blocks
 #
 import sys
-if sys.version_info.major < 3 and sys.version_info.minor < 7:
-    sys.stderr.write("git-p4: requires Python 2.7 or later.\n")
+if (sys.version_info.major < 3 or
+    (sys.version_info.major == 3 and sys.version_info.minor < 6)):
+    sys.stderr.write("git-p4: requires Python 3.6 or later.\n")
     sys.exit(1)
 import os
 import optparse
@@ -36,16 +37,6 @@
 import errno
 import glob
 
-# On python2.7 where raw_input() and input() are both availble,
-# we want raw_input's semantics, but aliased to input for python3
-# compatibility
-# support basestring in python3
-try:
-    if raw_input and input:
-        input = raw_input
-except:
-    pass
-
 verbose = False
 
 # Only labels/tags matching this will be imported/exported
@@ -176,35 +167,16 @@ def prompt(prompt_text):
         if response in choices:
             return response
 
-# We need different encoding/decoding strategies for text data being passed
-# around in pipes depending on python version
-if bytes is not str:
-    # For python3, always encode and decode as appropriate
-    def decode_text_stream(s):
-        return s.decode() if isinstance(s, bytes) else s
-    def encode_text_stream(s):
-        return s.encode() if isinstance(s, str) else s
-else:
-    # For python2.7, pass read strings as-is, but also allow writing unicode
-    def decode_text_stream(s):
-        return s
-    def encode_text_stream(s):
-        return s.encode('utf_8') if isinstance(s, unicode) else s
+def decode_text_stream(s):
+    return s.decode() if isinstance(s, bytes) else s
+def encode_text_stream(s):
+    return s.encode() if isinstance(s, str) else s
 
 def decode_path(path):
     """Decode a given string (bytes or otherwise) using configured path encoding options
     """
     encoding = gitConfig('git-p4.pathEncoding') or 'utf_8'
-    if bytes is not str:
-        return path.decode(encoding, errors='replace') if isinstance(path, bytes) else path
-    else:
-        try:
-            path.decode('ascii')
-        except:
-            path = path.decode(encoding, errors='replace')
-            if verbose:
-                print('Path with non-ASCII characters detected. Used {} to decode: {}'.format(encoding, path))
-        return path
+    return path.decode(encoding, errors='replace') if isinstance(path, bytes) else path
 
 def run_git_hook(cmd, param=[]):
     """Execute a hook if the hook exists."""
@@ -289,8 +261,8 @@ def write_pipe(c, stdin):
 
 def p4_write_pipe(c, stdin):
     real_cmd = p4_build_cmd(c)
-    if bytes is not str and isinstance(stdin, str):
-        stdin = encode_text_stream(stdin)
+    if isinstance(stdin, str):
+        stdin = stdin.encode()
     return write_pipe(real_cmd, stdin)
 
 def read_pipe_full(c):
@@ -762,21 +734,18 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False,
     result = []
     try:
         while True:
-            entry = marshal.load(p4.stdout)
-            if bytes is not str:
-                # Decode unmarshalled dict to use str keys and values, except for:
-                #   - `data` which may contain arbitrary binary data
-                #   - `depotFile[0-9]*`, `path`, or `clientFile` which may contain non-UTF8 encoded text
-                decoded_entry = {}
-                for key, value in entry.items():
-                    key = key.decode()
-                    if isinstance(value, bytes) and not (key in ('data', 'path', 'clientFile') or key.startswith('depotFile')):
-                        value = value.decode()
-                    decoded_entry[key] = value
-                # Parse out data if it's an error response
-                if decoded_entry.get('code') == 'error' and 'data' in decoded_entry:
-                    decoded_entry['data'] = decoded_entry['data'].decode()
-                entry = decoded_entry
+            # Decode unmarshalled dict to use str keys and values, except for:
+            #   - `data` which may contain arbitrary binary data
+            #   - `depotFile[0-9]*`, `path`, or `clientFile` which may contain non-UTF8 encoded text
+            entry = {}
+            for key, value in marshal.load(p4.stdout).items():
+                key = key.decode()
+                if isinstance(value, bytes) and not (key in ('data', 'path', 'clientFile') or key.startswith('depotFile')):
+                    value = value.decode()
+                entry[key] = value
+            # Parse out data if it's an error response
+            if entry.get('code') == 'error' and 'data' in entry:
+                entry['data'] = entry['data'].decode()
             if skip_info:
                 if 'code' in entry and entry['code'] == 'info':
                     continue
@@ -3840,14 +3809,13 @@ def openStreams(self):
         self.gitStream = self.importProcess.stdin
         self.gitError = self.importProcess.stderr
 
-        if bytes is not str:
-            # Wrap gitStream.write() so that it can be called using `str` arguments
-            def make_encoded_write(write):
-                def encoded_write(s):
-                    return write(s.encode() if isinstance(s, str) else s)
-                return encoded_write
+        # Wrap gitStream.write() so that it can be called using `str` arguments
+        def make_encoded_write(write):
+            def encoded_write(s):
+                return write(s.encode() if isinstance(s, str) else s)
+            return encoded_write
 
-            self.gitStream.write = make_encoded_write(self.gitStream.write)
+        self.gitStream.write = make_encoded_write(self.gitStream.write)
 
     def closeStreams(self):
         if self.gitStream is None:
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 2/3] git-p4: eliminate decode_stream and encode_stream
  2021-12-10 15:30 [PATCH v2 0/3] Transition git-p4.py to support Python 3 only Joel Holdsworth
  2021-12-10 15:30 ` [PATCH v2 1/3] git-p4: remove support for Python 2 Joel Holdsworth
@ 2021-12-10 15:31 ` Joel Holdsworth
  2021-12-10 15:31 ` [PATCH v2 3/3] git-p4: add "Nvidia Corporation" to copyright header Joel Holdsworth
  2 siblings, 0 replies; 9+ messages in thread
From: Joel Holdsworth @ 2021-12-10 15:31 UTC (permalink / raw)
  To: git
  Cc: Tzadik Vanderhoof, Dorgon Chang, Joachim Kuebart, Daniel Levin,
	Johannes Schindelin, Luke Diamand, Ben Keene, Andrew Oakley,
	Joel Holdsworth

The decode_stream and encode_stream functions previously abstracted away
the differences in string encode/decode behaviour between Python 2 and
Python 3.

Given that Python 2 is no longer supported, and the code paths for it
have been removed, the abstraction is no longer necessary, and the
script can therefore be simplified by eliminating these functions.

Signed-off-by: Joel Holdsworth <jholdsworth@nvidia.com>
---
 git-p4.py | 49 +++++++++++++++++++------------------------------
 1 file changed, 19 insertions(+), 30 deletions(-)

diff --git a/git-p4.py b/git-p4.py
index e3fe86e4f2..5568d44c72 100755
--- a/git-p4.py
+++ b/git-p4.py
@@ -167,11 +167,6 @@ def prompt(prompt_text):
         if response in choices:
             return response
 
-def decode_text_stream(s):
-    return s.decode() if isinstance(s, bytes) else s
-def encode_text_stream(s):
-    return s.encode() if isinstance(s, str) else s
-
 def decode_path(path):
     """Decode a given string (bytes or otherwise) using configured path encoding options
     """
@@ -276,7 +271,7 @@ def read_pipe_full(c):
     expand = not isinstance(c, list)
     p = subprocess.Popen(c, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=expand)
     (out, err) = p.communicate()
-    return (p.returncode, out, decode_text_stream(err))
+    return (p.returncode, out, err.decode())
 
 def read_pipe(c, ignore_error=False, raw=False):
     """ Read output from  command. Returns the output text on
@@ -288,22 +283,17 @@ def read_pipe(c, ignore_error=False, raw=False):
     (retcode, out, err) = read_pipe_full(c)
     if retcode != 0:
         if ignore_error:
-            out = ""
+            out = b""
         else:
             die('Command failed: %s\nError: %s' % (str(c), err))
-    if not raw:
-        out = decode_text_stream(out)
-    return out
+    return out if raw else out.decode()
 
 def read_pipe_text(c):
     """ Read output from a command with trailing whitespace stripped.
         On error, returns None.
     """
     (retcode, out, err) = read_pipe_full(c)
-    if retcode != 0:
-        return None
-    else:
-        return decode_text_stream(out).rstrip()
+    return out.decode().rstrip() if retcode == 0 else None
 
 def p4_read_pipe(c, ignore_error=False, raw=False):
     real_cmd = p4_build_cmd(c)
@@ -316,7 +306,7 @@ def read_pipe_lines(c):
     expand = not isinstance(c, list)
     p = subprocess.Popen(c, stdout=subprocess.PIPE, shell=expand)
     pipe = p.stdout
-    val = [decode_text_stream(line) for line in pipe.readlines()]
+    val = [line.decode() for line in pipe.readlines()]
     if pipe.close() or p.wait():
         die('Command failed: %s' % str(c))
     return val
@@ -346,7 +336,7 @@ def p4_has_move_command():
     cmd = p4_build_cmd(["move", "-k", "@from", "@to"])
     p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
     (out, err) = p.communicate()
-    err = decode_text_stream(err)
+    err = err.decode()
     # return code will be 1 in either case
     if err.find("Invalid option") >= 0:
         return False
@@ -721,7 +711,7 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False,
             stdin_file.write(stdin)
         else:
             for i in stdin:
-                stdin_file.write(encode_text_stream(i))
+                stdin_file.write(i.encode() if isinstance(i, str) else i)
                 stdin_file.write(b'\n')
         stdin_file.flush()
         stdin_file.seek(0)
@@ -963,8 +953,7 @@ def branch_exists(branch):
 
     cmd = [ "git", "rev-parse", "--symbolic", "--verify", branch ]
     p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
-    out, _ = p.communicate()
-    out = decode_text_stream(out)
+    out = p.communicate()[0].decode()
     if p.returncode:
         return False
     # expect exactly one line of output: the branch name
@@ -1349,7 +1338,7 @@ def generatePointer(self, contentFile):
             ['git', 'lfs', 'pointer', '--file=' + contentFile],
             stdout=subprocess.PIPE
         )
-        pointerFile = decode_text_stream(pointerProcess.stdout.read())
+        pointerFile = pointerProcess.stdout.read().decode()
         if pointerProcess.wait():
             os.remove(contentFile)
             die('git-lfs pointer command failed. Did you install the extension?')
@@ -2148,7 +2137,7 @@ def applyCommit(self, id):
         tmpFile = os.fdopen(handle, "w+b")
         if self.isWindows:
             submitTemplate = submitTemplate.replace("\n", "\r\n")
-        tmpFile.write(encode_text_stream(submitTemplate))
+        tmpFile.write(submitTemplate.encode())
         tmpFile.close()
 
         submitted = False
@@ -2204,8 +2193,8 @@ def applyCommit(self, id):
                         return False
 
                 # read the edited message and submit
-                tmpFile = open(fileName, "rb")
-                message = decode_text_stream(tmpFile.read())
+                with open(fileName, "r") as tmpFile:
+                    message = tmpFile.read()
                 tmpFile.close()
                 if self.isWindows:
                     message = message.replace("\r\n", "\n")
@@ -2905,7 +2894,7 @@ def splitFilesIntoBranches(self, commit):
         return branches
 
     def writeToGitStream(self, gitMode, relPath, contents):
-        self.gitStream.write(encode_text_stream(u'M {} inline {}\n'.format(gitMode, relPath)))
+        self.gitStream.write('M {} inline {}\n'.format(gitMode, relPath))
         self.gitStream.write('data %d\n' % sum(len(d) for d in contents))
         for d in contents:
             self.gitStream.write(d)
@@ -2947,7 +2936,7 @@ def streamOneP4File(self, file, contents):
             git_mode = "120000"
             # p4 print on a symlink sometimes contains "target\n";
             # if it does, remove the newline
-            data = ''.join(decode_text_stream(c) for c in contents)
+            data = ''.join(c.decode() for c in contents)
             if not data:
                 # Some version of p4 allowed creating a symlink that pointed
                 # to nothing.  This causes p4 errors when checking out such
@@ -3001,9 +2990,9 @@ def streamOneP4File(self, file, contents):
         pattern = p4_keywords_regexp_for_type(type_base, type_mods)
         if pattern:
             regexp = re.compile(pattern, re.VERBOSE)
-            text = ''.join(decode_text_stream(c) for c in contents)
+            text = ''.join(c.decode() for c in contents)
             text = regexp.sub(r'$\1$', text)
-            contents = [ encode_text_stream(text) ]
+            contents = [text.encode()]
 
         if self.largeFileSystem:
             (git_mode, contents) = self.largeFileSystem.processContent(git_mode, relPath, contents)
@@ -3015,7 +3004,7 @@ def streamOneP4Deletion(self, file):
         if verbose:
             sys.stdout.write("delete %s\n" % relPath)
             sys.stdout.flush()
-        self.gitStream.write(encode_text_stream(u'D {}\n'.format(relPath)))
+        self.gitStream.write('D {}\n'.format(relPath))
 
         if self.largeFileSystem and self.largeFileSystem.isLargeFile(relPath):
             self.largeFileSystem.removeLargeFile(relPath)
@@ -3115,9 +3104,9 @@ def streamP4FilesCbSelf(entry):
                 if 'shelved_cl' in f:
                     # Handle shelved CLs using the "p4 print file@=N" syntax to print
                     # the contents
-                    fileArg = f['path'] + encode_text_stream('@={}'.format(f['shelved_cl']))
+                    fileArg = f['path'] + '@={}'.format(f['shelved_cl']).encode()
                 else:
-                    fileArg = f['path'] + encode_text_stream('#{}'.format(f['rev']))
+                    fileArg = f['path'] + '#{}'.format(f['rev']).encode()
 
                 fileArgs.append(fileArg)
 
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 3/3] git-p4: add "Nvidia Corporation" to copyright header
  2021-12-10 15:30 [PATCH v2 0/3] Transition git-p4.py to support Python 3 only Joel Holdsworth
  2021-12-10 15:30 ` [PATCH v2 1/3] git-p4: remove support for Python 2 Joel Holdsworth
  2021-12-10 15:31 ` [PATCH v2 2/3] git-p4: eliminate decode_stream and encode_stream Joel Holdsworth
@ 2021-12-10 15:31 ` Joel Holdsworth
  2021-12-10 18:57   ` Luke Diamand
  2021-12-11 21:19   ` Elijah Newren
  2 siblings, 2 replies; 9+ messages in thread
From: Joel Holdsworth @ 2021-12-10 15:31 UTC (permalink / raw)
  To: git
  Cc: Tzadik Vanderhoof, Dorgon Chang, Joachim Kuebart, Daniel Levin,
	Johannes Schindelin, Luke Diamand, Ben Keene, Andrew Oakley,
	Joel Holdsworth

The inclusion of the coorporate copyright is a stipulation of the
company code release process.
---
 git-p4.py | 1 +
 1 file changed, 1 insertion(+)

diff --git a/git-p4.py b/git-p4.py
index 5568d44c72..17e18265dc 100755
--- a/git-p4.py
+++ b/git-p4.py
@@ -5,6 +5,7 @@
 # Author: Simon Hausmann <simon@lst.de>
 # Copyright: 2007 Simon Hausmann <simon@lst.de>
 #            2007 Trolltech ASA
+#            2021 Nvidia Corporation
 # License: MIT <http://www.opensource.org/licenses/mit-license.php>
 #
 # pylint: disable=invalid-name,missing-docstring,too-many-arguments,broad-except
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 3/3] git-p4: add "Nvidia Corporation" to copyright header
  2021-12-10 15:31 ` [PATCH v2 3/3] git-p4: add "Nvidia Corporation" to copyright header Joel Holdsworth
@ 2021-12-10 18:57   ` Luke Diamand
  2021-12-11 21:19   ` Elijah Newren
  1 sibling, 0 replies; 9+ messages in thread
From: Luke Diamand @ 2021-12-10 18:57 UTC (permalink / raw)
  To: Joel Holdsworth
  Cc: Git Users, Tzadik Vanderhoof, Dorgon Chang, Joachim Kuebart,
	Daniel Levin, Johannes Schindelin, Ben Keene, Andrew Oakley

On Fri, 10 Dec 2021 at 15:31, Joel Holdsworth <jholdsworth@nvidia.com> wrote:
>
> The inclusion of the coorporate copyright is a stipulation of the
> company code release process.

This doesn't seem right to me.

It seems very odd that there's a patch that just adds a copyright
notice, with no further notice.

What code does this cover and is now copyrighted? What are the license
terms of the copyright holder? I think these things need to be made
clear before this could be accepted.

> ---
>  git-p4.py | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/git-p4.py b/git-p4.py
> index 5568d44c72..17e18265dc 100755
> --- a/git-p4.py
> +++ b/git-p4.py
> @@ -5,6 +5,7 @@
>  # Author: Simon Hausmann <simon@lst.de>
>  # Copyright: 2007 Simon Hausmann <simon@lst.de>
>  #            2007 Trolltech ASA
> +#            2021 Nvidia Corporation
>  # License: MIT <http://www.opensource.org/licenses/mit-license.php>
>  #
>  # pylint: disable=invalid-name,missing-docstring,too-many-arguments,broad-except
> --
> 2.33.0
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 3/3] git-p4: add "Nvidia Corporation" to copyright header
  2021-12-10 15:31 ` [PATCH v2 3/3] git-p4: add "Nvidia Corporation" to copyright header Joel Holdsworth
  2021-12-10 18:57   ` Luke Diamand
@ 2021-12-11 21:19   ` Elijah Newren
  2021-12-13  0:11     ` brian m. carlson
  1 sibling, 1 reply; 9+ messages in thread
From: Elijah Newren @ 2021-12-11 21:19 UTC (permalink / raw)
  To: Joel Holdsworth
  Cc: Git Mailing List, Tzadik Vanderhoof, Dorgon Chang,
	Joachim Kuebart, Daniel Levin, Johannes Schindelin, Luke Diamand,
	Ben Keene, Andrew Oakley

On Fri, Dec 10, 2021 at 12:30 PM Joel Holdsworth <jholdsworth@nvidia.com> wrote:
>
> The inclusion of the coorporate copyright is a stipulation of the
> company code release process.
> ---
>  git-p4.py | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/git-p4.py b/git-p4.py
> index 5568d44c72..17e18265dc 100755
> --- a/git-p4.py
> +++ b/git-p4.py
> @@ -5,6 +5,7 @@
>  # Author: Simon Hausmann <simon@lst.de>
>  # Copyright: 2007 Simon Hausmann <simon@lst.de>
>  #            2007 Trolltech ASA
> +#            2021 Nvidia Corporation
>  # License: MIT <http://www.opensource.org/licenses/mit-license.php>
>  #
>  # pylint: disable=invalid-name,missing-docstring,too-many-arguments,broad-except
> --
> 2.33.0

Can we just get rid of all these copyright notices from all files in
Git?  They're obviously out-of-date and not even close to an accurate
indicator of authorship.  For example, builtin/branch.c has:

 * Copyright (c) 2006 Kristian Høgsberg <krh@redhat.com>
 * Based on git-branch.sh by Junio C Hamano.

Kristian only authored 1 patch for this file (though that one patch
was submitted and attributed to Lars Hjemli in c31820c26b ("Make
git-branch a builtin", 2006-10-23) with a note in the commit message
about Kristian being the real author).  I added a simple replace
object to change the author attribution on Kristian's commit back to
him, and then...

Looking at shortlog, there are 86 different authors, with the top 7
having this many commits:

$ git shortlog -sn --no-merges -- builtin/branch.c builtin-branch.c | head -n 7
    42 Junio C Hamano
    37 Jeff King
    30 Nguyễn Thái Ngọc Duy
    16 Karthik Nayak
    12 René Scharfe
    11 Lars Hjemli
    11 Ævar Arnfjörð Bjarmason

Looking at git blame, there are 61 different authors who have lines of
code surviving until today, with the top 7 being:

$ git blame -C -C builtin/branch.c | awk '{print $3 " " $4}' | sort |
uniq -c | sort -rn | head -n 7
    139 (Junio C
    129 (Nguyễn Thái
    122 (Karthik Nayak
     40 (Jeff King
     38 (Kristian H�gsberg
     36 (Sahil Dua
     35 (René Scharfe

So the copyright notice is horribly misleading at best.  It also seems
like the wrong way to figure out the answer to _any_ question I can
think of.  (Some examples: "Who can review my changes to this file?",
"Who do I need to contact for permission to relicense?", "Who should I
praise for doing the work of making this code so great for me?", etc.)
-- in all cases, shortlog, log, and blame are better tools.

Can we just git rid of these lines entirely?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/3] git-p4: remove support for Python 2
  2021-12-10 15:30 ` [PATCH v2 1/3] git-p4: remove support for Python 2 Joel Holdsworth
@ 2021-12-12 17:50   ` Andrew Oakley
  2021-12-12 22:01     ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew Oakley @ 2021-12-12 17:50 UTC (permalink / raw)
  To: Joel Holdsworth
  Cc: git, Tzadik Vanderhoof, Dorgon Chang, Joachim Kuebart,
	Daniel Levin, Johannes Schindelin, Luke Diamand, Ben Keene

On Fri, 10 Dec 2021 15:30:59 +0000
Joel Holdsworth <jholdsworth@nvidia.com> wrote:
> The motivation for this change is that there is a family of issues
> with git-p4's handling of incoming text data when it contains bytes
> which cannot be decoded into UTF-8 characters. For text files created
> in Windows, CP1252 Smart Quote Characters (0x93 and 0x94) are seen
> fairly frequently. These codes are invalid in UTF-8, so if the script
> encounters any file or file name containing them, on Python 2 the
> symbols will be corrupted, and on Python 3 the script will fail with
> an exception.

As I've pointed out previously, peforce fails to store the encoding of
text like commit messages.  With Windows perforce clients, the encoding
used seems to be based on the current code page on the client which
made the commit.  If you're part of a global organisation with people
in different locales making commits then you will find that there is
not a consistent encoding for commit messages.

Given that you don't know the encoding of the text, what's the best
thing to do with the data?  Options I can see are:

- Feed the raw bytes directly into git.  The i18n.commitEncoding config
  option can be set by the user if they want to attempt to decode the
  commit messages in something other than UTF-8.
- Attempt to detect the encoding somehow, feed the raw bytes directly
  into git and set the encoding on the commit.
- Attempt to dedect the encoding somehow and reencode everything into
  UTF-8.

Right now, if you use python2 then you get the behaviour as described
in the first of these options.  It doesn't "corrupt" anything, it just
transfers the bytes from perforce into git.  If you use python3 then
git-p4 is unusable because it throws exceptions trying to decode things.

It's not clear to me how "attempt to detect the encoding somehow" would
work.  The first option therefore seems like the best choice.

I think that this is the problem which really needs solving.  Dropping
support for python2 doesn't make the issue go away (although it might
make it slightly easier to write the code).  I think that the python2
compatibility should be maintained at least until the encoding problems
have been solved for python3.

I previously wrote some patches which attempt to move in what I think is
the right direction, but unfortunately they never got upstreamed:

https://lore.kernel.org/git/20210412085251.51475-1-andrew@adoakley.name/

Your comments elsewhere that git-p4 could benifit from some clean-up
seem accurate to me, and it would be good to see that kind of change.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/3] git-p4: remove support for Python 2
  2021-12-12 17:50   ` Andrew Oakley
@ 2021-12-12 22:01     ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 9+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-12-12 22:01 UTC (permalink / raw)
  To: Andrew Oakley
  Cc: Joel Holdsworth, git, Tzadik Vanderhoof, Dorgon Chang,
	Joachim Kuebart, Daniel Levin, Johannes Schindelin, Luke Diamand,
	Ben Keene


On Sun, Dec 12 2021, Andrew Oakley wrote:

> On Fri, 10 Dec 2021 15:30:59 +0000
> Joel Holdsworth <jholdsworth@nvidia.com> wrote:
>> The motivation for this change is that there is a family of issues
>> with git-p4's handling of incoming text data when it contains bytes
>> which cannot be decoded into UTF-8 characters. For text files created
>> in Windows, CP1252 Smart Quote Characters (0x93 and 0x94) are seen
>> fairly frequently. These codes are invalid in UTF-8, so if the script
>> encounters any file or file name containing them, on Python 2 the
>> symbols will be corrupted, and on Python 3 the script will fail with
>> an exception.
>
> As I've pointed out previously, peforce fails to store the encoding of
> text like commit messages.  With Windows perforce clients, the encoding
> used seems to be based on the current code page on the client which
> made the commit.  If you're part of a global organisation with people
> in different locales making commits then you will find that there is
> not a consistent encoding for commit messages.
>
> Given that you don't know the encoding of the text, what's the best
> thing to do with the data?  Options I can see are:
>
> - Feed the raw bytes directly into git.  The i18n.commitEncoding config
>   option can be set by the user if they want to attempt to decode the
>   commit messages in something other than UTF-8.
> - Attempt to detect the encoding somehow, feed the raw bytes directly
>   into git and set the encoding on the commit.
> - Attempt to dedect the encoding somehow and reencode everything into
>   UTF-8.
>
> Right now, if you use python2 then you get the behaviour as described
> in the first of these options.  It doesn't "corrupt" anything, it just
> transfers the bytes from perforce into git.  If you use python3 then
> git-p4 is unusable because it throws exceptions trying to decode things.
>
> [...]
>
> I think that this is the problem which really needs solving.  Dropping
> support for python2 doesn't make the issue go away (although it might
> make it slightly easier to write the code).  I think that the python2
> compatibility should be maintained at least until the encoding problems
> have been solved for python3.
>
> I previously wrote some patches which attempt to move in what I think is
> the right direction, but unfortunately they never got upstreamed:
>
> https://lore.kernel.org/git/20210412085251.51475-1-andrew@adoakley.name/
>
> Your comments elsewhere that git-p4 could benifit from some clean-up
> seem accurate to me, and it would be good to see that kind of change.

This summary makes sense, i.e. if the original SCM doesn't have a
declared or consistent encoding then having no "encoding" header etc. in
git likewise makes sense, and we should be trying to handle it in our
output layer.

[Snipped from above]:

> It's not clear to me how "attempt to detect the encoding somehow" would
> work.  The first option therefore seems like the best choice.

This really isn't possible to do in the general case, but you can get
pretty far with heuristics.

The best way to do this that I'm aware of is Mozilla's character
detection library:

    https://www-archive.mozilla.org/projects/intl/chardetinterface

Here's an old (maybe/probably not up-to-date) copy of its sources that
I've worked with:

    https://metacpan.org/release/JGMYERS/Encode-Detect-1.01/source/src?sort=[[2,1]]

I.e. if you get arbitrary text the best you can do if you're going in
blind is to have various character/language frequency tables to try to
guess at what encoding is the most plausible, but even then you might
still be wrong.

I'd think for git-p4 (and git-svn, how do we handle this there?) a
sensible first approximation would be to use UTF-8, and if we encounter
data that doesn't conform die.

Then offer the user to manually configure a "fallback" encoding. I think
most real-world projects only have two of those, e.g. some old latin1
data before a UTF-8 migration.

And maybe start with a "don't even try" mode, which AFAICT is what
you're describing Python 2 doing.

If we change the Python 3 code to do what the Python 2 code does now,
will we pull the rug from under users who have only ever used the Python
3 code, and are relying on those semantics somehow?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 3/3] git-p4: add "Nvidia Corporation" to copyright header
  2021-12-11 21:19   ` Elijah Newren
@ 2021-12-13  0:11     ` brian m. carlson
  0 siblings, 0 replies; 9+ messages in thread
From: brian m. carlson @ 2021-12-13  0:11 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Joel Holdsworth, Git Mailing List, Tzadik Vanderhoof,
	Dorgon Chang, Joachim Kuebart, Daniel Levin, Johannes Schindelin,
	Luke Diamand, Ben Keene, Andrew Oakley

[-- Attachment #1: Type: text/plain, Size: 2350 bytes --]

On 2021-12-11 at 21:19:18, Elijah Newren wrote:
> On Fri, Dec 10, 2021 at 12:30 PM Joel Holdsworth <jholdsworth@nvidia.com> wrote:
> >
> > The inclusion of the coorporate copyright is a stipulation of the
> > company code release process.
> > ---
> >  git-p4.py | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/git-p4.py b/git-p4.py
> > index 5568d44c72..17e18265dc 100755
> > --- a/git-p4.py
> > +++ b/git-p4.py
> > @@ -5,6 +5,7 @@
> >  # Author: Simon Hausmann <simon@lst.de>
> >  # Copyright: 2007 Simon Hausmann <simon@lst.de>
> >  #            2007 Trolltech ASA
> > +#            2021 Nvidia Corporation
> >  # License: MIT <http://www.opensource.org/licenses/mit-license.php>
> >  #
> >  # pylint: disable=invalid-name,missing-docstring,too-many-arguments,broad-except
> > --
> > 2.33.0
> 
> Can we just git rid of these lines entirely?

In the case of the MIT License, it is a condition of the license that
the copyright notices be preserved, so no, we cannot remove them.
Specifically, the first paragraph, which grants permissions states that
they are "subject to the following conditions", one of which is as
follows:

  The above copyright notice and this permission notice shall be
  included in all copies or substantial portions of the Software.

The other is the total exclusion of warranty or liability.

As for the rest of the codebase, the GPL v2 states that the exercise of
copying and distribution is permitted, "provided that you conspicuously
and appropriately publish on each copy an appropriate copyright notice
and disclaimer of warranty."  I am not an attorney, but I'm pretty sure
that it would not be permissible to remove a copyright notice unless the
code to which it referred were no longer present and that would not
count as publishing an appropriate copyright notice.

However, that doesn't mean we need to add to them, but I will state that
as a contributor who primarily contributes on his own time, I don't
think it's unreasonable for a contributor to request that a copyright
notice be applied where applicable as attribution, since that's the only
compensation one receives for one's contributions.  Such copyright
notices could live in a central file for convenience, however.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-12-13  0:12 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-10 15:30 [PATCH v2 0/3] Transition git-p4.py to support Python 3 only Joel Holdsworth
2021-12-10 15:30 ` [PATCH v2 1/3] git-p4: remove support for Python 2 Joel Holdsworth
2021-12-12 17:50   ` Andrew Oakley
2021-12-12 22:01     ` Ævar Arnfjörð Bjarmason
2021-12-10 15:31 ` [PATCH v2 2/3] git-p4: eliminate decode_stream and encode_stream Joel Holdsworth
2021-12-10 15:31 ` [PATCH v2 3/3] git-p4: add "Nvidia Corporation" to copyright header Joel Holdsworth
2021-12-10 18:57   ` Luke Diamand
2021-12-11 21:19   ` Elijah Newren
2021-12-13  0:11     ` brian m. carlson

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).